Table of Contents
cs.CL [Back]
[1] In Vino Veritas and Vulnerabilities: Examining LLM Safety via Drunk Language Inducement
Anudeex Shetty,Aditya Joshi,Salil S. Kanhere
Main category: cs.CL
TL;DR: 本文研究了'醉酒语言'(即受酒精影响下生成的文本)对大语言模型(LLMs)安全性的负面影响,提出三种诱导醉酒语言的方法(基于角色的提示、因果微调、强化后训练),并在多个基准上验证其加剧越狱和隐私泄露的能力,揭示了模型拟人化行为与人类醉酒行为间的对应关系,警示LLM安全风险。
Details
Motivation: 人类在酒精影响下易出现不良行为和隐私泄露;作者关注类似现象是否会在LLM中被诱发,以揭示新型安全漏洞。 Method: 提出三种醉酒语言诱导机制: persona-based prompting、causal fine-tuning 和 reinforcement-based post-training,并在5个LLM上测试其对JailbreakBench和ConfAIde基准的影响,结合人工评估与LLM自动评估进行错误归因分析。 Result: 所有诱导方法均显著提升LLM在JailbreakBench(含防御机制下)的越狱成功率及在ConfAIde上的隐私泄露率,且表现出与人类醉酒行为相似的拟人化错误模式。 Conclusion: 醉酒语言是一种简单高效但危险的安全威胁向量,可绕过现有防护,暴露LLM拟人化带来的深层安全风险,亟需针对性防御策略。 Abstract: Humans are susceptible to undesirable behaviours and privacy leaks under the influence of alcohol. This paper investigates drunk language, i.e., text written under the influence of alcohol, as a driver for safety failures in large language models (LLMs). We investigate three mechanisms for inducing drunk language in LLMs: persona-based prompting, causal fine-tuning, and reinforcement-based post-training. When evaluated on 5 LLMs, we observe a higher susceptibility to jailbreaking on JailbreakBench (even in the presence of defences) and privacy leaks on ConfAIde, where both benchmarks are in English, as compared to the base LLMs as well as previously reported approaches. Via a robust combination of manual evaluation and LLM-based evaluators and analysis of error categories, our findings highlight a correspondence between human-intoxicated behaviour, and anthropomorphism in LLMs induced with drunk language. The simplicity and efficiency of our drunk language inducement approaches position them as potential counters for LLM safety tuning, highlighting significant risks to LLM safety.[2] MrRoPE: Mixed-radix Rotary Position Embedding
Qingyuan Tian,Wenhong Zhu,Xiaoran Liu,Xiaofeng Wang,Rui Wang
Main category: cs.CL
TL;DR: 本文提出MrRoPE(Mixed-radix RoPE),从进制转换视角统一建模各类RoPE扩展方法,并据此设计两种免训练扩展方案MrRoPE-Uni和MrRoPE-Pro,在长上下文任务中显著提升性能且无需微调。
Details
Motivation: 现有RoPE扩展策略多样但缺乏统一理论基础,亟需一个能涵盖多种方法的通用框架。 Method: 基于进制系统转换视角构建广义RoPE编码公式(MrRoPE),将不同扩展方法解释为不同进制转换策略;进而提出两种免训练扩展:MrRoPE-Uni(均匀进制转换)与MrRoPE-Pro(渐进进制转换)。 Result: MrRoPE-Pro在128K上下文Needle-in-a-Haystack测试中保持超85%召回率,在Infinite-Bench检索与对话子集上准确率超YaRN两倍以上;理论分析证实其提升了RoPE可编码长度的上限。 Conclusion: MrRoPE提供了RoPE扩展的统一理论视角,所提免训练方法高效可靠,验证了该理论框架的实用性与普适性。 Abstract: Rotary Position Embedding (RoPE)-extension refers to modifying or generalizing the Rotary Position Embedding scheme to handle longer sequences than those encountered during pre-training. However, current extension strategies are highly diverse and lack a unified theoretical foundation. In this paper, we propose MrRoPE (Mixed-radix RoPE), a generalized encoding formulation based on a radix system conversion perspective, which elegantly unifies various RoPE-extension approaches as distinct radix conversion strategies. Based on this theory, we introduce two training-free extensions, MrRoPE-Uni and MrRoPE-Pro, which leverage uniform and progressive radix conversion strategies, respectively, to achieve 'train short, test long' generalization. Without fine-tuning, MrRoPE-Pro sustains over 85% recall in the 128K-context Needle-in-a-Haystack test and achieves more than double YaRN's accuracy on Infinite-Bench retrieval and dialogue subsets. Theoretical analysis confirms that MrRoPE-Pro effectively raises the upper bound of RoPE's attainable encoding length, which further validates the reliability and utility of our theory and methodology.[3] Prepare Reasoning Language Models for Multi-Agent Debate with Self-Debate Reinforcement Learning
Chenxi Liu,Yanshuo Chen,Ruibo Chen,Tianyi Xiong,Tong Zheng,Heng Huang
Main category: cs.CL
TL;DR: 本文提出了一种名为Self-Debate Reinforcement Learning(SDRL)的新训练框架,旨在提升大语言模型在单模型推理与多智能体辩论(MAD)中的协同推理能力。SDRL通过采样多样候选解、构建辩论上下文并联合优化初始与辩论条件响应,使单一LLM既能独立求解,又能有效参与并受益于辩论过程。实验表明,该方法在多个基准上同时提升了单模型和MAD性能。
Details
Motivation: 现有基于可验证奖励的强化学习(RLVR)方法通常只训练模型独立解题,未使其具备在多智能体辩论(MAD)中整合与利用不同推理路径的能力。 Method: 提出Self-Debate Reinforcement Learning(SDRL):对同一提示采样多个候选解,构建含多样化推理路径的辩论上下文,生成第二轮辩论条件响应,并联合优化初始响应与辩论响应。 Result: 在多个基础模型和推理基准上的实验表明,SDRL同时提升了多智能体辩论(MAD)的整体性能和单模型的独立推理能力。 Conclusion: SDRL成功赋予单一LLM双重能力——强独立求解能力与高效参与并学习于辩论过程的能力,为统一提升单模型与协作推理性能提供了新范式。 Abstract: The reasoning abilities of large language models (LLMs) have been substantially improved by reinforcement learning with verifiable rewards (RLVR). At test time, collaborative reasoning through Multi-Agent Debate (MAD) has emerged as a promising approach for enhancing LLM performance. However, current RLVR methods typically train LLMs to solve problems in isolation, without explicitly preparing them to synthesize and benefit from different rationales that arise during debate. In this work, we propose Self-Debate Reinforcement Learning (SDRL), a training framework that equips a single LLM with strong standalone problem-solving ability and the capability to learn from diverse reasoning trajectories in MAD. Given a prompt, SDRL first samples multiple candidate solutions, then constructs a debate context with diverse reasoning paths and generates second-turn responses conditioned on this context. Finally, SDRL jointly optimizes both the initial and debate-conditioned responses, yielding a model that is effective as both a standalone solver and a debate participant. Experiments across multiple base models and reasoning benchmarks show that SDRL improves overall MAD performance while simultaneously strengthening single model reasoning.[4] MERMAID: Memory-Enhanced Retrieval and Reasoning with Multi-Agent Iterative Knowledge Grounding for Veracity Assessment
Yupeng Cao,Chengyang He,Yangyang Yu,Ping Wang,K. P. Subbalakshmi
Main category: cs.CL
TL;DR: 本文提出MERMAID框架,通过结合检索、推理与记忆机制,提升在线内容真实性评估的效率与一致性。
Details
Motivation: 现有真实性评估方法将证据检索视为静态孤立步骤,无法有效管理或跨声明复用证据,导致冗余搜索与低效验证。 Method: 提出MERMAID——一种记忆增强的多智能体真实性评估框架,融合智能体驱动搜索、结构化知识表示与持久化证据记忆模块,在Reason-Action式迭代过程中实现动态证据获取与跨声明复用。 Result: 在三个事实核查基准和两个声明验证数据集上,使用GPT、LLaMA、Qwen等多类大模型验证,MERMAID达到SOTA性能,并显著提升搜索效率。 Conclusion: 检索、推理与记忆的协同设计可有效提升真实性评估的可靠性、效率与一致性,为自动化事实核查提供了新范式。 Abstract: Assessing the veracity of online content has become increasingly critical. Large language models (LLMs) have recently enabled substantial progress in automated veracity assessment, including automated fact-checking and claim verification systems. Typical veracity assessment pipelines break down complex claims into sub-claims, retrieve external evidence, and then apply LLM reasoning to assess veracity. However, existing methods often treat evidence retrieval as a static, isolated step and do not effectively manage or reuse retrieved evidence across claims. In this work, we propose MERMAID, a memory-enhanced multi-agent veracity assessment framework that tightly couples the retrieval and reasoning processes. MERMAID integrates agent-driven search, structured knowledge representations, and a persistent memory module within a Reason-Action style iterative process, enabling dynamic evidence acquisition and cross-claim evidence reuse. By retaining retrieved evidence in an evidence memory, the framework reduces redundant searches and improves verification efficiency and consistency. We evaluate MERMAID on three fact-checking benchmarks and two claim-verification datasets using multiple LLMs, including GPT, LLaMA, and Qwen families. Experimental results show that MERMAID achieves state-of-the-art performance while improving the search efficiency, demonstrating the effectiveness of synergizing retrieval, reasoning, and memory for reliable veracity assessment.[5] Context Structure Reshapes the Representational Geometry of Language Models
Eghbal A. Hosseini,Yuxuan Li,Yasaman Bahri,Declan Campbell,Andrew Kyle Lampinen
Main category: cs.CL
TL;DR: 本文研究了大语言模型(LLM)在上下文学习(ICL)过程中表征是否变直,并发现其行为取决于任务结构:连续预测任务中上下文增加导致表征更直且性能提升,而结构化预测任务中仅在有显式结构的阶段出现表征变直,表明ICL并非单一机制,而是模型根据任务动态选择策略的过程。
Details
Motivation: 将神经表征变直现象与上下文学习(ICL)结合,探究表征变直是否在ICL过程中发生,以理解ICL的内在机制。 Method: 在Gemma 2模型上,针对多种上下文任务(包括连续预测和结构化预测)测量表征轨迹的直线性,并分析其与模型预测性能的关系。 Result: 发现ICL中存在表征变直的二分现象:在连续预测任务中,上下文增加使轨迹更直且性能提升;在结构化预测任务中,仅在具有显式结构(如模板重复)的阶段出现变直,其余阶段消失。 Conclusion: ICL不是单一过程,LLM会依据任务结构动态切换策略,仅部分策略引发表征变直,类比为‘瑞士军刀’式适应机制。 Abstract: Large Language Models (LLMs) have been shown to organize the representations of input sequences into straighter neural trajectories in their deep layers, which has been hypothesized to facilitate next-token prediction via linear extrapolation. Language models can also adapt to diverse tasks and learn new structure in context, and recent work has shown that this in-context learning (ICL) can be reflected in representational changes. Here we bring these two lines of research together to explore whether representation straightening occurs \emph{within} a context during ICL. We measure representational straightening in Gemma 2 models across a diverse set of in-context tasks, and uncover a dichotomy in how LLMs' representations change in context. In continual prediction settings (e.g., natural language, grid world traversal tasks) we observe that increasing context increases the straightness of neural sequence trajectories, which is correlated with improvement in model prediction. Conversely, in structured prediction settings (e.g., few-shot tasks), straightening is inconsistent -- it is only present in phases of the task with explicit structure (e.g., repeating a template), but vanishes elsewhere. These results suggest that ICL is not a monolithic process. Instead, we propose that LLMs function like a Swiss Army knife: depending on task structure, the LLM dynamically selects between strategies, only some of which yield representational straightening.[6] Stability-Aware Prompt Optimization for Clinical Data Abstraction
Arinbjörn Kolbeinsson,Daniel Timbie,Sajjan Narsinghani,Sanjay Hariharan
Main category: cs.CL
TL;DR: 本文研究了临床抽象任务中大型语言模型对提示词(prompt)的敏感性问题,指出应将提示词敏感性和模型不确定性联合考虑;通过在多个临床任务和模型上测量翻转率(flip rates),发现高准确率并不意味着提示词稳定性,并提出一种双目标提示词优化方法,在提升稳定性的同时保持较高准确率。
Details
Motivation: 大型语言模型在临床抽象任务中对提示词表述高度敏感,但现有工作通常将提示词视为固定不变,单独研究不确定性,忽略了二者之间的关联。 Method: 在两个临床任务(MedAlign适用性/正确性判断和MS亚型抽象)上,使用多个开源与专有大模型,通过翻转率(flip rates)量化提示词敏感性,并分析其与校准性(calibration)和选择性预测(selective prediction)的关系;进一步提出一种联合优化准确率与稳定性的双目标提示词优化循环方法。 Result: 实验表明:高准确率不等于高提示词稳定性;模型即使表现出良好校准性,仍可能对提示词改写极为脆弱;引入稳定性目标的提示词优化可显著降低翻转率,仅带来轻微准确率下降。 Conclusion: 提示词敏感性应作为临床大语言模型系统验证中的显式评估目标,需与准确性、校准性等指标协同优化。 Abstract: Large language models used for clinical abstraction are sensitive to prompt wording, yet most work treats prompts as fixed and studies uncertainty in isolation. We argue these should be treated jointly. Across two clinical tasks (MedAlign applicability/correctness and MS subtype abstraction) and multiple open and proprietary models, we measure prompt sensitivity via flip rates and relate it to calibration and selective prediction. We find that higher accuracy does not guarantee prompt stability, and that models can appear well-calibrated yet remain fragile to paraphrases. We propose a dual-objective prompt optimization loop that jointly targets accuracy and stability, showing that explicitly including a stability term reduces flip rates across tasks and models, sometimes at modest accuracy cost. Our results suggest prompt sensitivity should be an explicit objective when validating clinical LLM systems.[7] SPLA: Block Sparse Plus Linear Attention for Long Context Modeling
Bailin Wang,Dan Friedman,Tao Lei,Chong Wang
Main category: cs.CL
TL;DR: SPLA是一种块稀疏注意力框架,通过二阶泰勒展开选择关键块进行精确注意力计算,并利用残差线性注意力(RLA)压缩其余块为紧凑循环状态,避免显式访问未选块,从而在长上下文建模中兼顾效率与性能。
Details
Motivation: 现有块稀疏注意力方法存在块选择精度低、丢弃未选块导致上下文信息累积损失的问题。 Method: 提出Sparse Plus Linear Attention(SPLA):1)基于二阶泰勒展开设计选择度量以高保真选取相关块;2)用残差线性注意力(RLA)模块将未选块压缩为紧凑循环状态;3)采用减法形式优化RLA,仅需计算全局线性注意力与所选块线性注意力之差,避免推理时访问未选块。 Result: 在持续预训练和RULER等长上下文基准上超越密集注意力模型,同时保持良好的通用知识与推理能力。 Conclusion: SPLA有效缓解了稀疏注意力中的信息损失与IO开销问题,在效率与性能间取得更好平衡。 Abstract: Block-wise sparse attention offers significant efficiency gains for long-context modeling, yet existing methods often suffer from low selection fidelity and cumulative contextual loss by completely discarding unselected blocks. To address these limitations, we introduce Sparse Plus Linear Attention (SPLA), a framework that utilizes a selection metric derived from second-order Taylor expansions to accurately identify relevant blocks for exact attention. Instead of discarding the remaining "long tail," SPLA compresses unselected blocks into a compact recurrent state via a residual linear attention (RLA) module. Crucially, to avoid IO overhead, we derive an optimized subtraction-based formulation for RLA -- calculating the residual as the difference between global and selected linear attention -- ensuring that unselected blocks are never explicitly accessed during inference. Our experiments demonstrate that SPLA closes the performance gap in continual pretraining, surpassing dense attention models on long-context benchmarks like RULER while maintaining competitive general knowledge and reasoning capabilities.[8] SP^2DPO: An LLM-assisted Semantic Per-Pair DPO Generalization
Chaoyue He,Xin Zhou,Di Wang,Hong Xu,Wei Liu,Chunyan Miao
Main category: cs.CL
TL;DR: 本文提出SP2DPO方法,通过为每对偏好样本分配语义感知的个性化温度参数beta_i,替代DPO中全局统一的beta,以更好应对真实偏好数据中的异质性与噪声,在AlpacaEval 2.0上取得有竞争力且更鲁棒的性能。
Details
Motivation: DPO使用全局温度beta,无法适应真实偏好数据中高/低信号样本混杂、主观性与标签噪声并存的异质性问题。 Method: 提出SP2DPO:基于教师模型生成的结构化语义差距标注(类别、幅度、置信度),离线为UltraFeedback中每对偏好样本预定义个性化beta_i;训练时仍用标准DPO内循环,仅将beta替换为beta_i。 Result: 在AlpacaEval 2.0上,SP2DPO在4个4B–8B开源指令微调模型中,长度控制胜率在其中2个模型上优于调优后的全局beta DPO基线,且无需逐模型调参;零训练开销。 Conclusion: 个性化、语义驱动的beta_i调度能更有效地利用偏好数据异质性,提升DPO的鲁棒性与实用性,无需额外训练成本。 Abstract: Direct Preference Optimization (DPO) controls the trade-off between fitting preference labels and staying close to a reference model using a single global temperature beta, implicitly treating all preference pairs as equally informative. Real-world preference corpora are heterogeneous: they mix high-signal, objective failures (for example, safety, factuality, instruction violations) with low-signal or subjective distinctions (for example, style), and also include label noise. We introduce our method, SP2DPO (Semantic Per-Pair DPO), a generalization that replaces the global temperature with an instance-specific schedule beta_i pre-decided offline from structured semantic-gap annotations (category, magnitude, confidence) produced by teacher language models. We instantiate this procedure on the UltraFeedback preference corpus (59,960 pairs), enabling large-scale construction of an auditable beta_i artifact, and incur zero training-time overhead: the inner-loop optimizer remains standard DPO with beta set per pair. We focus our empirical study on AlpacaEval 2.0, reporting both raw win rate and length-controlled win rate. Across four open-weight, instruction-tuned student backbones (4B-8B), SP2DPO is competitive with a tuned global-beta DPO baseline and improves AlpacaEval 2.0 length-controlled win rate on two of four backbones, while avoiding per-model beta sweeps. All code, annotations, and artifacts will be released.[9] Specialists or Generalists? Multi-Agent and Single-Agent LLMs for Essay Grading
Jamiu Adekunle Idowu,Ahmed Almasoud
Main category: cs.CL
TL;DR: 本文比较了单智能体和多智能体大语言模型在自动作文评分(AES)中的表现,发现多智能体系统更擅长识别低质量作文,而单智能体在中等质量作文上表现更好;两者均难以准确评估高质量作文;少样本微调对性能提升最关键(QWK提升约26%);架构选择应依据具体应用目标:多智能体适合高风险学生筛查,单智能体适合通用评估。
Details
Motivation: 探究大语言模型架构选择(单智能体 vs 多智能体)如何影响自动作文评分系统在不同作文质量水平上的性能表现,弥补当前对架构影响认知的不足。 Method: 基于ASAP 2.0语料库,构建并对比单智能体与多智能体LLM评分系统;多智能体系统包含内容、结构、语言三个专家智能体及一个执行评分规则(如否决权、分数上限)的主席智能体;在零样本和少样本(GPT-5.1)条件下进行实验。 Result: 多智能体系统显著更优地识别弱作文,单智能体在中等作文上表现更好;两者均难以准确评估高质量作文;少样本校准(每分档仅2例)使QWK提升约26%,成为主导性能因素。 Conclusion: 架构选择应匹配实际部署需求:多智能体AI更适合诊断性筛查(如识别学业风险学生),单智能体则为通用评估提供更具成本效益的方案;少样本提示比架构设计本身对性能影响更大。 Abstract: Automated essay scoring (AES) systems increasingly rely on large language models, yet little is known about how architectural choices shape their performance across different essay quality levels. This paper evaluates single-agent and multi-agent LLM architectures for essay grading using the ASAP 2.0 corpus. Our multi-agent system decomposes grading into three specialist agents (Content, Structure, Language) coordinated by a Chairman Agent that implements rubric-aligned logic including veto rules and score capping. We test both architectures in zero-shot and few-shot conditions using GPT-5.1. Results show that the multi-agent system is significantly better at identifying weak essays while the single-agent system performs better on mid-range essays. Both architectures struggle with high-quality essays. Critically, few-shot calibration emerges as the dominant factor in system performance -- providing just two examples per score level improves QWK by approximately 26% for both architectures. These findings suggest architectural choice should align with specific deployment priorities, with multi-agent AI particularly suited for diagnostic screening of at-risk students, while single-agent models provide a cost-effective solution for general assessment.[10] Culturally Grounded Personas in Large Language Models: Characterization and Alignment with Socio-Psychological Value Frameworks
Candida M. Greco,Lucio La Cava,Andrea Tagarelli
Main category: cs.CL
TL;DR: 本文研究了大型语言模型(LLM)生成的合成人格在不同文化背景下的价值观与道德观是否与真实人类一致,通过世界价值观调查(WVS)、英格尔哈特-韦尔泽尔文化地图和道德基础理论进行多维度对齐评估。
Details
Motivation: 尽管LLM被广泛用于模拟人类行为,但其生成的合成人格是否真实反映跨文化的世界观与道德观仍不明确,亟需系统性评估。 Method: 基于WVS变量构建可解释的文化嵌入式LLM人格,并从三方面分析:1)在英格尔哈特-韦尔泽尔文化地图上的定位;2)与WVS人群响应分布的统计一致性;3)基于道德基础问卷的道德图谱及文化-道德映射分析。 Result: 验证了LLM生成人格可在文化地图位置、群体响应模式及道德倾向上与人类实证数据呈现结构性对齐,支持其作为跨文化价值研究的可控仿真工具。 Conclusion: 文化嵌入式人格生成与分析框架能有效揭示LLM在跨文化价值观与道德多样性上的建模能力,为AI人文对齐提供可解释、可验证的方法论基础。 Abstract: Despite the growing utility of Large Language Models (LLMs) for simulating human behavior, the extent to which these synthetic personas accurately reflect world and moral value systems across different cultural conditionings remains uncertain. This paper investigates the alignment of synthetic, culturally-grounded personas with established frameworks, specifically the World Values Survey (WVS), the Inglehart-Welzel Cultural Map, and Moral Foundations Theory. We conceptualize and produce LLM-generated personas based on a set of interpretable WVS-derived variables, and we examine the generated personas through three complementary lenses: positioning on the Inglehart-Welzel map, which unveils their interpretation reflecting stable differences across cultural conditionings; demographic-level consistency with the World Values Survey, where response distributions broadly track human group patterns; and moral profiles derived from a Moral Foundations questionnaire, which we analyze through a culture-to-morality mapping to characterize how moral responses vary across different cultural configurations. Our approach of culturally-grounded persona generation and analysis enables evaluation of cross-cultural structure and moral variation.[11] Bifocal Attention: Harmonizing Geometric and Spectral Positional Embeddings for Algorithmic Generalization
Kanishk Awadhiya
Main category: cs.CL
TL;DR: 本文提出Bifocal Attention架构,通过解耦位置编码为Geometric Eyes(标准RoPE)和Spectral Eyes(可学习谐波算子),并结合Spectral Evolution训练协议,解决RoPE在长程递归结构建模中的频谱刚性问题,提升模型对深层递归推理的泛化能力。
Details
Motivation: 标准RoPE存在“频谱刚性”问题,其固定几何衰减无法建模递归逻辑和算法推理中的长程周期结构,导致模型难以从浅层推理泛化到深层递归。 Method: 提出Bifocal Attention架构,将位置编码分为Geometric Eyes(保持标准RoPE)与Spectral Eyes(引入可学习谐波算子);设计Spectral Evolution训练协议,使初始静态频率通过梯度下降演化为适配任务算法拓扑的谐波基。 Result: 该方法有效弥合了‘结构鸿沟’,提升了LLM在深层递归与算法推理任务上的外推能力。 Conclusion: 解耦并协同优化局部几何与全局频谱位置建模,是增强大语言模型算法推理能力的关键路径。 Abstract: Rotary Positional Embeddings (RoPE) have become the standard for Large Language Models (LLMs) due to their ability to encode relative positions through geometric rotation. However, we identify a significant limitation we term ''Spectral Rigidity'': standard RoPE utilizes a fixed geometric decay ($θ^{-i}$) optimized for local syntactic coherence, which fails to capture the long-range, periodic structures inherent in recursive logic and algorithmic reasoning. This results in a ''Structure Gap'', where models trained on shallow reasoning chains fail to extrapolate to deeper recursive steps. In this work, we introduce Bifocal Attention, an architectural paradigm that decouples positional encoding into two distinct modalities: Geometric Eyes (Standard RoPE) for precise token-level manipulation, and Spectral Eyes (Learnable Harmonic Operators) for tracking long-range recursive depth. We propose a novel training protocol, Spectral Evolution, which initializes positional frequencies as static geometric parameters but allows them to evolve via gradient descent into a harmonic basis optimized for the specific algorithmic topology of the task.[12] Word-Centered Semantic Graphs for Interpretable Diachronic Sense Tracking
Imene Kolli,Kai-Robin Lange,Jonas Rieger,Carsten Jentsch
Main category: cs.CL
TL;DR: 本文提出了一种基于图的可解释框架,用于分析历时语料中的语义变化,通过结合分布相似性和词替换性构建词中心语义网络,并利用聚类和跨时间对齐来追踪词义演变。
Details
Motivation: 现有方法依赖预定义的词义清单,缺乏对语义动态演化的透明、紧凑建模能力;需一种不依赖人工标注、能揭示多义性演化轨迹的新框架。 Method: 为每个目标词和时间切片构建词中心语义网络,融合历时Skip-gram嵌入的分布相似性与时间特异性掩码语言模型的词替换性;对周边图聚类,通过节点重叠对齐跨时间簇,并跟踪簇组成与归一化簇质量变化。 Result: 在1980–2017年《纽约时报杂志》语料上的实验表明:图连通性反映多义性动态,所提取社区能刻画三类典型演化模式——事件驱动的词义更替(trump)、语义稳定但聚类过分割(god)、与数字通信相关的渐进关联转移(post)。 Conclusion: 词中心语义图提供了一种无需预定义词义清单、兼具可解释性与紧凑性的语义演化分析新范式。 Abstract: We propose an interpretable, graph-based framework for analyzing semantic shift in diachronic corpora. For each target word and time slice, we induce a word-centered semantic network that integrates distributional similarity from diachronic Skip-gram embeddings with lexical substitutability from time-specific masked language models. We identify sense-related structure by clustering the peripheral graph, align clusters across time via node overlap, and track change through cluster composition and normalized cluster mass. In an application study on a corpus of New York Times Magazine articles (1980 - 2017), we show that graph connectivity reflects polysemy dynamics and that the induced communities capture contrasting trajectories: event-driven sense replacement (trump), semantic stability with cluster over-segmentation effects (god), and gradual association shifts tied to digital communication (post). Overall, word-centered semantic graphs offer a compact and transparent representation for exploring sense evolution without relying on predefined sense inventories.[13] Large Language Model Agents Are Not Always Faithful Self-Evolvers
Weixiang Zhao,Yingshuo Wang,Yichen Zhang,Yang Deng,Yanyan Zhao,Wanxiang Che,Bing Qin,Ting Liu
Main category: cs.CL
TL;DR: 本文首次系统研究了自演化大语言模型(LLM)智能体中‘经验可信性’(experience faithfulness)问题,发现其虽依赖原始经验,却常忽视或误读压缩后的经验,该现象在多种模型、环境与配置下普遍存在,并归因于语义局限、内部处理偏差及预训练先验干扰。
Details
Motivation: 探究自演化LLM智能体是否真正依赖所给经验来指导行为,即经验是否被‘忠实’使用。 Method: 通过在原始与压缩形式的经验上实施受控因果干预,对4种典型框架、10种LLM主干模型和9种环境进行综合评估。 Result: 发现显著不对称性:智能体始终依赖原始经验,却频繁忽略或误读压缩经验;该现象跨单/多智能体、不同模型规模均存在。 Conclusion: 当前自演化方法对经验的整合缺乏可信性与鲁棒性,需发展更忠实、可靠的经验利用机制。 Abstract: Self-evolving large language model (LLM) agents continually improve by accumulating and reusing past experience, yet it remains unclear whether they faithfully rely on that experience to guide their behavior. We present the first systematic investigation of experience faithfulness, the causal dependence of an agent's decisions on the experience it is given, in self-evolving LLM agents. Using controlled causal interventions on both raw and condensed forms of experience, we comprehensively evaluate four representative frameworks across 10 LLM backbones and 9 environments. Our analysis uncovers a striking asymmetry: while agents consistently depend on raw experience, they often disregard or misinterpret condensed experience, even when it is the only experience provided. This gap persists across single- and multi-agent configurations and across backbone scales. We trace its underlying causes to three factors: the semantic limitations of condensed content, internal processing biases that suppress experience, and task regimes where pretrained priors already suffice. These findings challenge prevailing assumptions about self-evolving methods and underscore the need for more faithful and reliable approaches to experience integration.[14] Stop Jostling: Adaptive Negative Sampling Reduces the Marginalization of Low-Resource Language Tokens by Cross-Entropy Loss
Galim Turumtaev
Main category: cs.CL
TL;DR: 本文提出一种阈值技术来减轻低资源语言中罕见词元在训练过程中的边缘化影响,从而提升其表征能力,并通过字符级语言模型实验证明该方法能显著提高低资源语言的验证性能。
Details
Motivation: 神经语言模型在低资源语言上表现不佳,主要因为训练数据稀缺导致这些语言的词元在训练集中出现频率极低,而罕见词元在训练中又受到过度边缘化的不利影响,难以有效学习。 Method: 提出一种阈值技术,限制负采样过程中对罕见词元的过度边缘化,使其获得更有意义的对齐学习信号;并在字符级语言模型上进行实验验证。 Result: 该方法在低资源语言验证数据上显著提升了模型性能。 Conclusion: 首次证明负采样可通过控制边缘化效应来改善罕见词元的表征,为提升低资源语言建模能力提供了新思路。 Abstract: Neural language models often struggle with low-resource languages due to the limited availability of training data, making tokens from these languages rare in the training set. This paper addresses a specific challenge during training: rare tokens are disproportionately affected by marginalization, which prevents them from learning effectively. We propose a thresholding technique that reduces the impact of this marginalization, allowing rare tokens to benefit from more meaningful alignment. Through experiments with a character-level language model, we demonstrate that this method significantly improves performance on low-resource language validation data. This work is the first to show how negative sampling can be applied to improve the representation of rare tokens by limiting the harmful influence of excessive marginalization, offering a new approach to enhancing language model performance for underrepresented languages.[15] SSL: Sweet Spot Learning for Differentiated Guidance in Agentic Optimization
Jinyang Wu,Changpeng Yang,Yuhao Shen,Fangzhi Xu,Bolin Ni,Chonghua Liao,Yuchen Liu,Hongzhen Wang,Shuai Nie,Shuai Zhang,Haoran Luo,Jiaming Xu
Main category: cs.CL
TL;DR: 本文提出Sweet Spot Learning (SSL)框架,通过分层渐进式奖励机制引导智能体向解空间中的‘最佳区域’优化,提升训练效率与泛化能力。
Details
Motivation: 现有基于二值奖励的强化学习方法无法区分达成相同结果但质量不同的轨迹,忽略了策略解空间内的多样性。 Method: SSL采用分层渐进式奖励机制:在视觉感知任务中基于距离建模奖励,在复杂推理任务中奖励向有希望解的逐步进展;理论证明其保持最优解序并提升梯度信噪比。 Result: 在GUI感知、长短时规划及复杂推理共12个基准上显著超越强基线,样本效率最高提升2.5倍,并展现出跨任务迁移能力。 Conclusion: SSL是一种通用、有效且鲁棒的智能体训练新范式,为强化学习中奖励设计提供了新原则。 Abstract: Reinforcement learning with verifiable rewards has emerged as a powerful paradigm for training intelligent agents. However, existing methods typically employ binary rewards that fail to capture quality differences among trajectories achieving identical outcomes, thereby overlooking potential diversity within the solution space. Inspired by the ``sweet spot'' concept in tennis-the racket's core region that produces optimal hitting effects, we introduce \textbf{S}weet \textbf{S}pot \textbf{L}earning (\textbf{SSL}), a novel framework that provides differentiated guidance for agent optimization. SSL follows a simple yet effective principle: progressively amplified, tiered rewards guide policies toward the sweet-spot region of the solution space. This principle naturally adapts across diverse tasks: visual perception tasks leverage distance-tiered modeling to reward proximity, while complex reasoning tasks reward incremental progress toward promising solutions. We theoretically demonstrate that SSL preserves optimal solution ordering and enhances the gradient signal-to-noise ratio, thereby fostering more directed optimization. Extensive experiments across GUI perception, short/long-term planning, and complex reasoning tasks show consistent improvements over strong baselines on 12 benchmarks, achieving up to 2.5X sample efficiency gains and effective cross-task transferability. Our work establishes SSL as a general principle for training capable and robust agents.[16] Mock Worlds, Real Skills: Building Small Agentic Language Models with Synthetic Tasks, Simulated Environments, and Rubric-Based Rewards
Yuan-Jay Lü,Chengyu Wang,Lei Shen,Jun Huang,Tong Xu
Main category: cs.CL
TL;DR: 本文提出SYNTHAGENT框架,通过联合合成多样化工具使用训练数据和模拟完整环境,解决小规模语言模型在智能体能力上的不足,显著提升其在数学、搜索和工具使用等任务上的性能。
Details
Motivation: 小规模语言模型难以匹敌大规模模型的智能体(agentic)能力;现有开源智能体训练数据任务种类单一且过于简单,真实API缺乏多样性且不稳定,难以支撑大规模强化学习训练。 Method: 提出SYNTHAGENT框架:由强教师模型生成新颖任务与工具生态,并重写为故意欠明确的指令,迫使智能体主动向用户提问;使用LLM用户模拟器提供私有信息,mock工具系统提供稳定响应;基于子目标、人机交互及禁止行为构建任务级评分标准作为奖励信号。 Result: 在14个涵盖数学、搜索与工具使用的挑战性数据集上,仅用合成数据训练的小模型显著超越更大基线模型。 Conclusion: 高质量、多样化的合成训练数据与可控仿真环境可有效弥补小模型在智能体能力上的短板,为低成本高效智能体训练提供了可行路径。 Abstract: Small LLMs often struggle to match the agentic capabilities of large, costly models. While reinforcement learning can help, progress has been limited by two structural bottlenecks: existing open-source agentic training data are narrow in task variety and easily solved; real-world APIs lack diversity and are unstable for large-scale reinforcement learning rollout processes. We address these challenges with SYNTHAGENT, a framework that jointly synthesizes diverse tool-use training data and simulates complete environments. Specifically, a strong teacher model creates novel tasks and tool ecosystems, then rewrites them into intentionally underspecified instructions. This compels agents to actively query users for missing details. When handling synthetic tasks, an LLM-based user simulator provides user-private information, while a mock tool system delivers stable tool responses. For rewards, task-level rubrics are constructed based on required subgoals, user-agent interactions, and forbidden behaviors. Across 14 challenging datasets in math, search, and tool use, models trained on our synthetic data achieve substantial gains, with small models outperforming larger baselines.[17] One Ring to Rule Them All: Unifying Group-Based RL via Dynamic Power-Mean Geometry
Weisong Zhao,Tong Wang,Zichang Tan,Te Yang,Siran Peng,Haoyuan Zhang,Tianshuo Zhang,Haichao Shi,Meng Meng,Yang Yang,Xiangyu Zhu,Zhen Lei,Xiao-Yu Zhang,Xu Zhou
Main category: cs.CL
TL;DR: 本文提出Power-Mean Policy Optimization (PMPO),一种统一GRPO(算术平均)和GMPO(几何平均)的广义策略优化框架,通过可学习的幂均指数p自适应调整梯度更新的集中程度,并引入基于截断感知的有效样本量(ESS)机制动态选择p,在数学推理任务上显著优于基线方法。
Details
Motivation: GRPO和GMPO均依赖固定聚合几何结构,无法适应轨迹演化与异质性,限制了稳定性与性能的平衡。 Method: 提出PMPO框架,以幂均指数p参数化聚合几何;理论分析p对梯度更新集中度及token重加权的影响;设计Clip-aware ESS机制,通过轨迹截断比例确定目标ESS,并反解出对应p值实现动态自适应。 Result: 在多个数学推理基准上,PMPO显著优于GRPO、GMPO等强基线方法。 Conclusion: PMPO通过几何可调的幂均聚合与ESS驱动的自适应p选择,实现了对不同轨迹稳定性的精细化建模,为组式强化学习提供了更鲁棒、灵活的优化范式。 Abstract: Group-based reinforcement learning has evolved from the arithmetic mean of GRPO to the geometric mean of GMPO. While GMPO improves stability by constraining a conservative objective, it shares a fundamental limitation with GRPO: reliance on a fixed aggregation geometry that ignores the evolving and heterogeneous nature of each trajectory. In this work, we unify these approaches under Power-Mean Policy Optimization (PMPO), a generalized framework that parameterizes the aggregation geometry via the power-mean geometry exponent p. Within this framework, GRPO and GMPO are recovered as special cases. Theoretically, we demonstrate that adjusting p modulates the concentration of gradient updates, effectively reweighting tokens based on their advantage contribution. To determine p adaptively, we introduce a Clip-aware Effective Sample Size (ESS) mechanism. Specifically, we propose a deterministic rule that maps a trajectory clipping fraction to a target ESS. Then, we solve for the specific p to align the trajectory induced ESS with this target one. This allows PMPO to dynamically transition between the aggressive arithmetic mean for reliable trajectories and the conservative geometric mean for unstable ones. Experiments on multiple mathematical reasoning benchmarks demonstrate that PMPO outperforms strong baselines.[18] $ρ$-$\texttt{EOS}$: Training-free Bidirectional Variable-Length Control for Masked Diffusion LLMs
Jingyi Yang,Yuxian Jiang,Jing Shao
Main category: cs.CL
TL;DR: 本文提出ρ-EOS方法,通过监测去噪过程中隐式EOS密度动态调整生成长度,实现无需训练、单阶段、双向可变长度生成,显著提升推理效率和token利用率。
Details
Motivation: 现有掩码扩散大语言模型(dLLMs)需预设固定生成长度,缺乏灵活性,导致输出质量与计算效率间难以兼顾。 Method: 发现隐式EOS密度(ρ)可作为生成充分性的可靠信号,据此提出ρ-EOS策略:在单一轮去噪过程中持续估计该密度,高密度触发MASK收缩、低密度触发扩展,实现训练无关、双向可变长度生成。 Result: 在数学与代码基准上实验表明,ρ-EOS在保持相当性能的同时,大幅提升了推理效率与token利用率。 Conclusion: ρ-EOS是一种高效、灵活、无需额外训练的变量长度生成方案,突破了dLLMs对固定长度的依赖,为扩散语言建模提供了新范式。 Abstract: Beyond parallel generation and global context modeling, current masked diffusion large language models (dLLMs) suffer from a fundamental limitation: they require a predefined, fixed generation length, which lacks flexibility and forces an inevitable trade-off between output quality and computational efficiency. To address this, we study the denoising dynamics and find that the implicit density ($ρ$) of end-of-sequence ($\texttt{EOS}$) tokens serves as a reliable signal of generation sufficiency. In particular, the evolving implicit $\texttt{EOS}$ density during denoising reveals whether the current masked space is excessive or insufficient, thereby guiding the adjustment direction for generation length. Building on this insight, we propose $\textbf{$ρ$-$\texttt{EOS}$}$, a training-free, single-stage strategy that enables bidirectional variable-length generation for masked dLLMs. Unlike prior two-stage approaches--which require separate length adjustment and iterative mask insertion phases while supporting only unidirectional expansion--$\textbf{$ρ$-$\texttt{EOS}$}$ achieves bidirectional length adjustment within a unified denoising process by continuously estimating the implicit $\texttt{EOS}$ density: excessively high density triggers $\texttt{MASK}$ token contraction, while insufficient density induces expansion. Extensive experiments on mathematics and code benchmarks demonstrate that $\textbf{$ρ$-$\texttt{EOS}$}$ achieves comparable performance while substantially improving inference efficiency and token utilization.[19] Towards the Holographic Characteristic of LLMs for Efficient Short-text Generation
Shun Qian,Bingquan Liu,Chengjie Sun,Zhen Xu,Baoxun Wang
Main category: cs.CL
TL;DR: 本文发现大语言模型在生成初期倾向于捕捉目标端关键词,提出名为HOLO的插件利用该‘全息特性’提升推理效率,并在短文本生成任务中验证其有效性。
Details
Motivation: 现有研究较少关注大语言模型强大生成能力背后的具体特征,本文旨在深入探究其生成特性。 Method: 通过实证分析发现语言模型生成初期即捕获目标端关键词(称为全息特性),并据此设计HOLO插件:在有限步内提取关键词,结合平行的词汇约束文本生成方法补全句子。 Result: 在多种架构和规模的语言模型上进行大量短文本生成实验,HOLO在自动与人工评估指标上均达到与基线相当的性能。 Conclusion: 验证了语言模型存在‘全息特性’,HOLO插件可有效利用该特性提升推理效率,为理解与优化LLM生成机制提供了新视角。 Abstract: The recent advancements in Large Language Models (LLMs) have attracted interest in exploring their in-context learning abilities and chain-of-thought capabilities. However, there are few studies investigating the specific traits related to the powerful generation capacity of LLMs. This paper aims to delve into the generation characteristics exhibited by LLMs. Through our investigation, we have discovered that language models tend to capture target-side keywords at the beginning of the generation process. We name this phenomenon the Holographic Characteristic of language models. For the purpose of exploring this characteristic and further improving the inference efficiency of language models, we propose a plugin called HOLO, which leverages the Holographic Characteristic to extract target-side keywords from language models within a limited number of generation steps and complements the sentence with a parallel lexically constrained text generation method. To verify the effectiveness of HOLO, we conduct massive experiments on language models of varying architectures and scales in the short-text generation scenario. The results demonstrate that HOLO achieves comparable performance to the baselines in terms of both automatic and human-like evaluation metrics and highlight the potential of the Holographic Characteristic.[20] Are LLM Evaluators Really Narcissists? Sanity Checking Self-Preference Evaluations
Dani Roytburg,Matthew Bozoukov,Matthew Nguyen,Mackenzie Puig-Hall,Narmeen Oozeer
Main category: cs.CL
TL;DR: 本文揭示了大语言模型(LLM)作为评估者时存在自我偏好偏差,但该偏差常被混淆于其在难题上的固有错误率;为此提出“评估者质量基线”以解耦自我偏好信号与噪声输出,并发现原有约半数统计显著性结果在修正后不再成立。
Details
Motivation: 现有研究难以区分LLM评估中的‘自恋偏差’与由任务难度等引起的实验混杂因素,导致对自我偏好偏差的测量失真。 Method: 识别并控制一个关键方法学混杂因素(即LLM在自身答错的问题上更易给出自我偏好判断),提出‘评估者质量基线’——比较LLM错误支持自身回答的概率与其错误支持他模回答的概率,并在37,448条查询上进行实证检验。 Result: 所提基线可将测量误差降低89.6%;应用该基线后,原始37,448条查询中仅51%的初始显著性结果仍保持统计显著;同时发现LLM对‘易’与‘难’问题的评估投票熵存在系统差异。 Conclusion: 自我偏好偏差需在控制评估者自身能力局限的前提下重新度量;该工作不仅修正了现有评估范式,也为系统识别和隔离各类评估者偏差提供了方法论基础。 Abstract: Recent research has shown that large language models (LLM) favor own outputs when acting as judges, undermining the integrity of automated post-training and evaluation workflows. However, it is difficult to disentangle which evaluation biases are explained by narcissism versus general experimental confounds, distorting measurements of self-preference bias. We discover a core methodological confound which could reduce measurement error by 89.6%. Specifically, LLM evaluators may deliver self-preferring verdicts when the judge responds to queries which they completed incorrectly themselves; this would be true regardless of whether one of their responses is their own. To decouple self-preference signals from noisy outputs on hard problems, we introduce an Evaluator Quality Baseline, which compares the probability that a judge incorrectly votes for itself against the probability that it votes for an incorrect response from another model. Evaluating this simple baseline on 37,448 queries, only 51% of initial findings retain statistical significance. Finally, we turn towards characterizing the entropy of "easy" versus "hard" evaluation votes from LLM judges. Our corrective baseline enables future research on self-preference by eliminating noisy data from potential solutions. More widely, this work contributes to the growing body of work on cataloging and isolating judge-bias effects.[21] SpanNorm: Reconciling Training Stability and Performance in Deep Transformers
Chao Wang,Bei Li,Jiaqi Zhang,Xinyu Liu,Yuchun Fan,Linkun Lyu,Xin Chen,Jingang Wang,Tong Xiao,Peng Pei,Xunliang Cai
Main category: cs.CL
TL;DR: 本文提出SpanNorm,一种新型归一化技术,结合PreNorm的训练稳定性和PostNorm的性能优势,通过跨整个Transformer块的残差连接和PostNorm式输出归一化,在理论和实验上均展现出更优的稳定性和性能。
Details
Motivation: 解决PreNorm(训练稳定但深层模型性能下降)与PostNorm(性能强但训练不稳定)之间的根本权衡问题。 Method: 提出SpanNorm:建立跨越整个Transformer块的干净残差连接以稳定信号传播,并采用PostNorm风格的归一化方式对聚合输出进行归一化;辅以有原则的缩放策略,保证网络中信号方差有界。 Result: 理论分析表明SpanNorm可防止PostNorm的梯度问题并缓解PreNorm的表征坍塌;实验显示其在稠密模型和MoE模型中均持续优于标准归一化方案。 Conclusion: SpanNorm有效调和了稳定性与性能的矛盾,为构建更强大、更稳定的Transformer架构提供了新路径。 Abstract: The success of Large Language Models (LLMs) hinges on the stable training of deep Transformer architectures. A critical design choice is the placement of normalization layers, leading to a fundamental trade-off: the ``PreNorm'' architecture ensures training stability at the cost of potential performance degradation in deep models, while the ``PostNorm'' architecture offers strong performance but suffers from severe training instability. In this work, we propose SpanNorm, a novel technique designed to resolve this dilemma by integrating the strengths of both paradigms. Structurally, SpanNorm establishes a clean residual connection that spans the entire transformer block to stabilize signal propagation, while employing a PostNorm-style computation that normalizes the aggregated output to enhance model performance. We provide a theoretical analysis demonstrating that SpanNorm, combined with a principled scaling strategy, maintains bounded signal variance throughout the network, preventing the gradient issues that plague PostNorm models, and also alleviating the representation collapse of PreNorm. Empirically, SpanNorm consistently outperforms standard normalization schemes in both dense and Mixture-of-Experts (MoE) scenarios, paving the way for more powerful and stable Transformer architectures.[22] Rethinking LLM-as-a-Judge: Representation-as-a-Judge with Small Language Models via Semantic Capacity Asymmetry
Zhuochun Li,Yong Zhang,Ming Li,Yuelyu Ji,Yiming Zeng,Ning Cheng,Yun Zhu,Yanmeng Wang,Shaojun Wang,Jing Xiao,Daqing He
Main category: cs.CL
TL;DR: 本文提出了一种新的评估范式“Representation-as-a-Judge”,利用小模型的内部表征(而非生成输出)进行高效、可靠、可解释的参考无评估,显著优于基于提示的小模型评估,并接近大语言模型评估效果。
Details
Motivation: LLM-as-a-Judge范式存在成本高、不透明、对提示敏感等问题,亟需更高效、稳定且可解释的替代方案。 Method: 提出Semantic Capacity Asymmetry Hypothesis,设计INSPECTOR框架——一种基于探针(probing)的方法,直接从小型语言模型的隐藏状态中预测细粒度(aspect-level)评估得分。 Result: 在GSM8K、MATH、GPQA等推理基准上,INSPECTOR显著超越提示驱动的小模型评估器,性能接近全量LLM法官,同时具备更高效率、鲁棒性与可解释性。 Conclusion: 评估任务所需的语义能力远低于生成任务,可依托小模型中间表征实现,因此‘Representation-as-a-Judge’是LLM-as-a-Judge的可行且优越的替代范式。 Abstract: Large language models (LLMs) are widely used as reference-free evaluators via prompting, but this "LLM-as-a-Judge" paradigm is costly, opaque, and sensitive to prompt design. In this work, we investigate whether smaller models can serve as efficient evaluators by leveraging internal representations instead of surface generation. We uncover a consistent empirical pattern: small LMs, despite with weak generative ability, encode rich evaluative signals in their hidden states. This motivates us to propose the Semantic Capacity Asymmetry Hypothesis: evaluation requires significantly less semantic capacity than generation and can be grounded in intermediate representations, suggesting that evaluation does not necessarily need to rely on large-scale generative models but can instead leverage latent features from smaller ones. Our findings motivate a paradigm shift from LLM-as-a-Judge to Representation-as-a-Judge, a decoding-free evaluation strategy that probes internal model structure rather than relying on prompted output. We instantiate this paradigm through INSPECTOR, a probing-based framework that predicts aspect-level evaluation scores from small model representations. Experiments on reasoning benchmarks (GSM8K, MATH, GPQA) show that INSPECTOR substantially outperforms prompting-based small LMs and closely approximates full LLM judges, while offering a more efficient, reliable, and interpretable alternative for scalable evaluation.[23] Language Model Circuits Are Sparse in the Neuron Basis
Aryaman Arora,Zhengxuan Wu,Jacob Steinhardt,Sarah Schwettmann
Main category: cs.CL
TL;DR: 本文发现MLP神经元本身具有与稀疏自编码器(SAE)相当的稀疏性和可解释性,并基于此构建了无需额外训练的端到端电路追踪流程,在多个任务上成功定位并操控因果计算回路。
Details
Motivation: 传统上认为单个神经元缺乏可解释性,因此研究转向稀疏自编码器(SAE)等方法来提取更可解释的特征;本文旨在挑战这一假设,探索原始MLP神经元是否本身就具备高可解释性。 Method: 通过实证分析验证MLP神经元的稀疏性,并设计基于梯度归因的端到端电路追踪流程,在MLP神经元基上直接定位和干预因果回路。 Result: 在主谓一致基准任务中,约100个MLP神经元即可控制模型行为;在多跳推理任务中,识别出编码特定推理步骤(如'城市→州'映射)的小规模神经元组,并能通过干预改变模型输出。 Conclusion: MLP神经元本身即为高稀疏、高可解释的计算单元,无需SAE等额外训练即可实现高效、低成本的自动化语言模型可解释性分析。 Abstract: The high-level concepts that a neural network uses to perform computation need not be aligned to individual neurons (Smolensky, 1986). Language model interpretability research has thus turned to techniques such as \textit{sparse autoencoders} (SAEs) to decompose the neuron basis into more interpretable units of model computation, for tasks such as \textit{circuit tracing}. However, not all neuron-based representations are uninterpretable. For the first time, we empirically show that \textbf{MLP neurons are as sparse a feature basis as SAEs}. We use this finding to develop an end-to-end pipeline for circuit tracing on the MLP neuron basis, which locates causal circuitry on a variety of tasks using gradient-based attribution. On a standard subject-verb agreement benchmark (Marks et al., 2025), a circuit of $\approx 10^2$ MLP neurons is enough to control model behaviour. On the multi-hop city $\to$ state $\to$ capital task from Lindsey et al., 2025, we find a circuit in which small sets of neurons encode specific latent reasoning steps (e.g.~`map city to its state'), and can be steered to change the model's output. This work thus advances automated interpretability of language models without additional training costs.[24] Layer-wise Swapping for Generalizable Multilingual Safety
Hyunseo Shin,Wonseok Hwang
Main category: cs.CL
TL;DR: 本文提出了一种无需额外训练的安全感知层交换方法,将英语安全专家模型的安全对齐能力迁移至低资源语言专家模型,通过自适应模块选择或融合提升迁移效果,在保持通用任务性能的同时显著提升目标语言的安全性。
Details
Motivation: 现有安全数据集以英语为主,导致低资源语言模型的安全对齐进展受限,其微调后的专家模型比高资源语言模型更不安全。 Method: 提出安全感知的层交换方法,将英语安全专家模型的部分模块迁移到低资源语言专家模型中,并根据模块专业化程度自适应选择或混合模块。 Result: 在MMMLU、BELEBELE、MGSM等通用基准上性能与语言专家相当,在MultiJail安全基准上生成更对齐、更有害性更低的响应。 Conclusion: 该方法无需额外训练即可有效提升低资源语言模型的安全性,同时不损害其通用语言理解能力,为多语言安全对齐提供了新思路。 Abstract: Despite the rapid advancements of Large Language Models (LLMs), safety risks remain a critical challenge for low-resource languages. Existing safety datasets are predominantly English centric, limiting progress in multilingual safety alignment. As a result, low resource expert models, finetuned on their respective instruction datasets, tend to exhibit higher unsafety rates compared to their high resource counterparts. In this work, we propose a safety aware layer swapping method that transfers safety alignment from an English safety expert to low resource language experts without additional training. To further enhance transfer ability, our method adaptively selects or blends modules based on their degree of specialization. Our approach preserves performance on general language understanding tasks while enhancing safety in the target languages. Experimental results show that the proposed method achieves comparable performance to the language expert on general benchmarks such as MMMLU, BELEBELE, and MGSM, while producing more aligned and less harmful responses on the MultiJail safety benchmark.[25] Time-Annealed Perturbation Sampling: Diverse Generation for Diffusion Language Models
Jingxuan Wu,Zhenglin Wan,Xingrui Yu,Yuzhe Yang,Yiqiao Huang,Ivor Tsang,Yang You
Main category: cs.CL
TL;DR: 本文提出了一种无需训练的推理策略Time-Annealed Perturbation Sampling(TAPS),利用扩散语言模型(Diffusion-LMs)中时间维度的语义分工特性,在早期去噪阶段引入扰动以促进语义分支,后期逐步减少扰动以保持流畅性和指令遵循性,从而提升生成多样性而不损害质量。
Details
Motivation: 扩散语言模型引入了显式时间维度,但如何利用该结构控制生成多样性、探索多种有效语义或推理路径尚不明确。 Method: 基于扩散语言模型在早期步骤决定全局语义结构、后期聚焦局部词法优化的时间分工特性,提出无需训练的推理策略TAPS:在扩散早期施加并逐渐衰减的扰动,以鼓励语义分支,同时保障后期生成的流畅性与指令一致性;兼容非自回归和半自回归扩散主干(如LLaDA、TraDo)。 Result: TAPS在创意写作与推理基准上一致提升了输出多样性,且未牺牲生成质量。 Conclusion: 扩散语言模型存在可被利用的时间分工机制,TAPS通过时序调制扰动实现了高质量、高多样性的文本生成,为扩散语言模型的可控生成提供了新思路。 Abstract: Diffusion language models (Diffusion-LMs) introduce an explicit temporal dimension into text generation, yet how this structure can be leveraged to control generation diversity for exploring multiple valid semantic or reasoning paths remains underexplored. In this paper, we show that Diffusion-LMs, like diffusion models in image generation, exhibit a temporal division of labor: early denoising steps largely determine the global semantic structure, while later steps focus on local lexical refinement. Building on this insight, we propose Time-Annealed Perturbation Sampling (TAPS), a training-free inference strategy that encourages semantic branching early in the diffusion process while progressively reducing perturbations to preserve fluency and instruction adherence. TAPS is compatible with both non-autoregressive and semi-autoregressive Diffusion backbones, demonstrated on LLaDA and TraDo in our paper, and consistently improves output diversity across creative writing and reasoning benchmarks without compromising generation quality.[26] DART-ing Through the Drift: Dynamic Tracing of Knowledge Neurons for Adaptive Inference-Time Pruning
Abhishek Tyagi,Yunuo Cen,Shrey Dhorajiya,Bharadwaj Veeravalli,Xuanyao Fong
Main category: cs.CL
TL;DR: 本文提出DART(动态注意力引导运行时追踪)方法,一种无需训练、轻量级的上下文感知动态剪枝技术,通过监控注意力分数分布变化实时调整FFN参数掩码,在保持模型性能的同时显著降低计算与内存开销。
Details
Motivation: 现有LLM剪枝方法依赖特定数据集校准、计算开销大,且多为静态剪枝,无法适应自回归生成过程中随上下文演化的知识神经元动态性。 Method: DART基于注意力分数分布变化进行上下文感知的运行时追踪,动态生成神经元级掩码,实现无需训练、低开销的FFN稀疏化。 Result: 在10个基准上,DART在70% FFN稀疏率下相较动态基线最高提升14.5%准确率;摘要任务中ROUGE-L比静态剪枝高3倍,性能接近稠密模型;仅需<10MB内存、0.1% FLOPs开销。 Conclusion: DART能有效适配多样化语义上下文,在通用与领域任务中均保持模型能力,验证了动态、注意力驱动的轻量剪枝范式的有效性与实用性。 Abstract: Large Language Models (LLMs) exhibit substantial parameter redundancy, particularly in Feed-Forward Networks (FFNs). Existing pruning methods suffer from two primary limitations. First, reliance on dataset-specific calibration introduces significant data dependency and computational overhead. Second, being predominantly static, they fail to account for the evolving subset of knowledge neurons in LLMs during autoregressive generation as the context evolves. To address this, we introduce DART, i.e., Dynamic Attention-Guided Runtime Tracing), a lightweight, training-free method that performs on-the-fly context-based pruning. DART monitors shifts in attention score distributions to infer context changes, dynamically updating neuron-level masks to retain salient parameters. Across ten benchmarks, DART outperforms prior dynamic baseline, achieving accuracy gains of up to 14.5% on LLAMA-3.1-8B at 70% FFN sparsity. Furthermore, DART achieves up to 3x better ROUGE-L scores with respect to static-masked pruning on summarization tasks, with its performance comparable to the original dense models. We conclusively demonstrate that the proposed framework effectively adapts to diverse semantic contexts, preserves model capabilities across both general and domain-specific tasks while running at less than 10MBs of memory for LLAMA-3.1-8B(16GBs) with 0.1% FLOPs overhead. The code is available at https://github.com/seeder-research/DART.[27] NAG: A Unified Native Architecture for Encoder-free Text-Graph Modeling in Language Models
Haisong Gong,Zhibo Liu,Qiang Liu,Shu Wu,Liang Wang
Main category: cs.CL
TL;DR: 本文提出NAG(Native Architecture for Graphs),一种将图结构处理原生集成到语言模型中的统一框架,摒弃传统依赖外部GNN的分离式架构,通过改造自注意力机制和位置编码,使LM能同时理解文本语义与图拓扑结构;包含NAG-Zero和NAG-LoRA两种高效实现,在多类图任务上验证了其有效性与简洁性。
Details
Motivation: 现有方法将图神经网络(GNN)与语言模型(LM)分离,导致图结构编码与文本语义处理脱节,需隐式对齐抽象图标记与具体文本元素,效率低且概念割裂。 Method: 提出NAG框架,将图处理内化于LM原生流形:1)重用自注意力机制建模拓扑依赖;2)重校准位置编码以保证结构等价性;3)设计NAG-Zero(零干预基模型能力)与NAG-LoRA(轻量结构适配)两种实现。 Result: 在多种图任务上实验表明,NAG在无需外部编码器的前提下实现稳健的图理解性能,兼具简洁性与强泛化能力。 Conclusion: 图与文本的联合建模无需依赖外部GNN,通过原生改造LM内部机制(如注意力与位置编码)即可实现语义与结构的统一理解,NAG为text-graph建模提供了更自然、更一致的新范式。 Abstract: Prevailing methods for integrating graphs into Language Models (LMs) typically rely on a segregated architecture: external Graph Neural Networks (GNNs) encode structural topology, while LMs process textual semantics. We argue this approach is suboptimal for text-graphs: it creates a conceptually disjointed interaction paradigm. By segregating structural encoding from semantic processing, these systems must perform a complex implicit alignment between abstract graph tokens and concrete textual elements. Challenging the necessity of external encoders, we propose NAG (Native Architecture for Graphs), a unified framework that internalizes graph processing within the LM's native manifold. Instead of bridging disparate embedding spaces, NAG repurposes the self-attention mechanism to enforce topological dependencies and recalibrates positional IDs to ensure structural equivalence. This allows the model to harness its intrinsic linguistic capability to simultaneously comprehend node and edge content alongside structural topology. We introduce two efficient implementations: NAG-Zero for absolute preservation of the base model's linguistic capabilities, and NAG-LoRA for enhanced structural adaptation. Experiments across diverse graph tasks validate that NAG achieves robust graph comprehension without the overhead of external encoders, offering a simpler, more coherent paradigm for text-graph modeling.[28] TSLM: Tree-Structured Language Modeling for Divergent Thinking
Doyoung Kim,Jaehyeok Doo,Minjoon Seo
Main category: cs.CL
TL;DR: 本文提出树结构语言建模(TSLM),通过特殊标记编码分支结构,使模型能在单次生成中并行生成与选择性扩展多条搜索路径,从而提升推理鲁棒性与推理效率。
Details
Motivation: 现有语言模型顺序生成推理路径,难以解耦无关探索路径,导致冗余计算和低效搜索。 Method: 引入树结构语言建模(TSLM),使用特殊token编码分支结构,并在包含成功与失败尝试的完整搜索树上进行监督训练,使模型内化系统性探索能力。 Result: TSLM在保持鲁棒性能的同时显著提升推理效率,避免了外部搜索方法所需的多次独立前向传播;验证了基于完整树状轨迹的监督学习可高效赋予模型系统性探索能力。 Conclusion: TSLM为推理时缩放提供了新范式,表明对树结构推理轨迹的监督学习是增强语言模型系统性推理能力的有效途径。 Abstract: Language models generate reasoning sequentially, preventing them from decoupling irrelevant exploration paths during search. We introduce Tree-Structured Language Modeling (TSLM), which uses special tokens to encode branching structure, enabling models to generate and selectively expand multiple search paths within a single generation process. By training on complete search trees including both successful and failed attempts, TSLM learns to internalize systematic exploration without redundant recomputation of shared prefixes. TSLM achieves robust performance and superior inference efficiency by avoiding the multiple independent forward passes required by external search methods. These results suggest a new paradigm of inference-time scaling for robust reasoning, demonstrating that supervised learning on complete tree-structured traces provides an efficient alternative for developing systematic exploration capabilities in language models.[29] FNF: Functional Network Fingerprint for Large Language Models
Yiheng Liu,Junhao Ning,Sichen Xia,Haiyang Sun,Yang Yang,Hanyang Chi,Xiaohui Gao,Ning Qiang,Bao Ge,Junwei Han,Xintao Hu
Main category: cs.CL
TL;DR: 本文提出了一种无需训练、样本高效的LLM指纹识别方法FNF,通过比较功能网络活动一致性来检测模型是否源自同一原始模型,具有鲁棒性与实用性。
Details
Motivation: 大型语言模型开发成本高、商业价值大,亟需防止开源模型被非法复制,保护开发者知识产权。 Method: 提出Functional Network Fingerprint(FNF),基于模型功能网络中神经元活动的一致性进行比对,无需训练,仅需少量输入样本。 Result: 验证了同源模型(即使规模或架构不同)在功能网络活动上高度一致,而独立训练模型则无此一致性;FNF对微调、剪枝、参数重排及跨架构/维度比对均保持鲁棒。 Conclusion: FNF是一种简单、无侵入、高效且实用的LLM知识产权保护工具,兼顾模型效用与检测可靠性。 Abstract: The development of large language models (LLMs) is costly and has significant commercial value. Consequently, preventing unauthorized appropriation of open-source LLMs and protecting developers' intellectual property rights have become critical challenges. In this work, we propose the Functional Network Fingerprint (FNF), a training-free, sample-efficient method for detecting whether a suspect LLM is derived from a victim model, based on the consistency between their functional network activity. We demonstrate that models that share a common origin, even with differences in scale or architecture, exhibit highly consistent patterns of neuronal activity within their functional networks across diverse input samples. In contrast, models trained independently on distinct data or with different objectives fail to preserve such activity alignment. Unlike conventional approaches, our method requires only a few samples for verification, preserves model utility, and remains robust to common model modifications (such as fine-tuning, pruning, and parameter permutation), as well as to comparisons across diverse architectures and dimensionalities. FNF thus provides model owners and third parties with a simple, non-invasive, and effective tool for protecting LLM intellectual property. The code is available at https://github.com/WhatAboutMyStar/LLM_ACTIVATION.[30] Models Know Models Best: Evaluation via Model-Preferred Formats
Joonhak Lee,Sungmok Jung,Jongyeon Park,Jaejin Lee
Main category: cs.CL
TL;DR: 本文发现大语言模型在符号式和完形填空式多选题评估中表现差异显著,原因在于任务特性;为此提出一种基于模型隐含偏好信号的动态格式对齐策略,显著提升零样本准确率。
Details
Motivation: 解决大语言模型在不同多选题评估格式(符号式 vs. 完形填空式)下性能不一致的问题,揭示其根源并提升零样本评测准确性。 Method: 提出动态格式对齐策略:训练一个轻量级分类器,利用模型自身生成的隐含偏好信号(而非人工设计启发式规则)为每个题目自动选择最优评估格式。 Result: 在多个推理与知识类基准上实现显著且一致的零样本准确率提升,更真实地反映模型潜在能力。 Conclusion: 多选题评估格式的选择应适配任务特性(如自然语言续写适合似然打分,显式比较适合符号选择),而基于模型信号的动态格式选择优于人工启发式,是揭示模型真实能力的有效途径。 Abstract: Performance of Large Language Models (LLMs) on multiple-choice tasks differs markedly between symbol-based and cloze-style evaluation formats. The observed discrepancies are systematically attributable to task characteristics: natural language continuation benefits from likelihood scoring, whereas explicit comparison is better suited to symbol-based selection. These trends are consistent across various decoder-based LLMs, indicating model-agnostic effects. To address these inconsistencies, a dynamic format-alignment strategy is introduced that employs a lightweight classifier trained on latent model-preference signals. In contrast to human-designed heuristics, which often degrade performance, this approach uses model-generated signals to determine the optimal format for each problem instance. The proposed method achieves substantial and consistent improvements in zero-shot accuracy across reasoning and knowledge benchmarks, better revealing the models' latent capabilities.[31] MM-THEBench: Do Reasoning MLLMs Think Reasonably?
Zhidian Huang,Zijun Yao,Ji Qi,Shangqing Tu,Junxian Ma,Jinxin Liu,Weichuan Liu,Xiaoyin Che,Lei Hou,Juanzi Li
Main category: cs.CL
TL;DR: 本文提出MM-THEBench,一个专门评估多模态大语言模型(MLLMs)在思维链(CoT)推理过程中幻觉现象的综合基准,涵盖细粒度认知分类、经验证的推理标注数据及多级自动化评估框架。
Details
Motivation: 现有基准未关注推理型MLLMs内部思维过程中的幻觉问题,且无法衡量思考阶段产生的幻觉;同时,自反思推理虽提升鲁棒性却可能引入新幻觉,细微感知错误仍导致错误或偶然正确答案。 Method: 构建MM-THEBench基准,包含基于认知维度的细粒度幻觉分类体系、多样化且带人工验证推理链的数据集,以及支持多级自动评估的框架。 Result: 在主流推理型MLLMs上开展广泛实验,揭示了思维过程对多模态任务中幻觉生成与推理能力的影响规律。 Conclusion: MM-THEBench为系统评估推理型MLLMs中间思维链的幻觉提供了新范式,推动更可靠、可解释的多模态推理发展。 Abstract: Recent advances in multimodal large language models (MLLMs) mark a shift from non-thinking models to post-trained reasoning models capable of solving complex problems through thinking. However, whether such thinking mitigates hallucinations in multimodal perception and reasoning remains unclear. Self-reflective reasoning enhances robustness but introduces additional hallucinations, and subtle perceptual errors still result in incorrect or coincidentally correct answers. Existing benchmarks primarily focus on models before the emergence of reasoning MLLMs, neglecting the internal thinking process and failing to measure the hallucinations that occur during thinking. To address these challenges, we introduce MM-THEBench, a comprehensive benchmark for assessing hallucinations of intermediate CoTs in reasoning MLLMs. MM-THEBench features a fine-grained taxonomy grounded in cognitive dimensions, diverse data with verified reasoning annotations, and a multi-level automated evaluation framework. Extensive experiments on mainstream reasoning MLLMs reveal insights into how thinking affects hallucination and reasoning capability in various multimodal tasks.[32] AR-BENCH: Benchmarking Legal Reasoning with Judgment Error Detection, Classification and Correction
Yifei Li,Richong Zhang,Wanyu Tu,Zhijie Nie,Haokun Luo,Chuantao Yin,Pengchong Li
Main category: cs.CL
TL;DR: 本文提出了一个新的法律AI任务——上诉审查(APPELLATE REVIEW),旨在检测、分类和纠正已发布的判决中的错误,并构建了一个包含8700个精细标注判决和34617个补充语料的数据集AR-BENCH,通过评估14个大语言模型揭示了现有模型在识别法律适用错误方面的关键局限。
Details
Motivation: 法律判决可能因案情复杂和法律概念抽象而出现错误,而现有上诉审查机制又面临案件数量激增带来的效率压力;当前法律AI研究集中于判决预测和法律文书生成,但判决审查在目标和范式上根本不同,属于异常检测而非预测或生成,存在研究空白。 Method: 提出上诉审查(APPELLATE REVIEW)新任务,并构建大规模精细标注数据集AR-BENCH(含8700个判决与34617条补充语料),对14个大语言模型进行系统性评测。 Result: 实证揭示现有大语言模型在识别法律适用错误方面存在显著能力缺陷,为后续改进提供了关键依据。 Conclusion: 上诉审查是一项具有实践价值的新法律AI任务,AR-BENCH为该方向提供了首个基准数据集,当前大模型在此任务上表现不足,亟需针对性提升其诊断推理与可靠性。 Abstract: Legal judgments may contain errors due to the complexity of case circumstances and the abstract nature of legal concepts, while existing appellate review mechanisms face efficiency pressures from a surge in case volumes. Although current legal AI research focuses on tasks like judgment prediction and legal document generation, the task of judgment review differs fundamentally in its objectives and paradigm: it centers on detecting, classifying, and correcting errors after a judgment is issued, constituting anomaly detection rather than prediction or generation. To address this research gap, we introduce a novel task APPELLATE REVIEW, aiming to assess models' diagnostic reasoning and reliability in legal practice. We also construct a novel dataset benchmark AR-BENCH, which comprises 8,700 finely annotated decisions and 34,617 supplementary corpora. By evaluating 14 large language models, we reveal critical limitations in existing models' ability to identify legal application errors, providing empirical evidence for future improvements.[33] RASST: Fast Cross-modal Retrieval-Augmented Simultaneous Speech Translation
Jiaxuan Luo,Siqi Ouyang,Lei Li
Main category: cs.CL
TL;DR: 本文提出了一种检索增强的同步语音翻译方法(RASST),通过轻量级跨模态检索器和滑动窗口检索,为语音大语言模型提供术语提示,并合成训练数据以提升术语翻译准确率和整体翻译质量。
Details
Motivation: 现有同步语音翻译(SST)模型在翻译罕见词和领域术语方面表现不佳,而将检索增强引入SST面临跨模态实时检索与增量生成中术语使用时机决策等挑战。 Method: 提出RASST框架:训练轻量级语音-文本检索器,采用滑动窗口进行高效分块检索,向Speech LLM提供术语提示;并合成训练数据以教会模型精准利用检索结果。 Result: 在ACL 60/60开发集三个语向上,术语翻译准确率最高提升16%,整体BLEU值最高提升3点;消融实验证实各组件均有贡献。 Conclusion: RASST有效解决了SST中术语翻译难题,验证了检索增强在语音到文本跨模态实时翻译中的可行性与有效性。 Abstract: Simultaneous speech translation (SST) produces target text incrementally from partial speech input. Recent speech large language models (Speech LLMs) have substantially improved SST quality, yet they still struggle to correctly translate rare and domain-specific terminology. While retrieval augmentation has been effective for terminology translation in machine translation, bringing retrieval to SST is non-trivial: it requires fast and accurate cross-modal (speech-to-text) retrieval under partial, continually arriving input, and the model must decide whether and when to apply retrieved terms during incremental generation. We propose Retrieval-Augmented Simultaneous Speech Translation (RASST), which tightly integrates cross-modal retrieval into the SST pipeline. RASST trains a lightweight speech-text retriever and performs efficient sliding-window retrieval, providing chunkwise terminology hints to the Speech LLM. We further synthesize training data that teaches the Speech LLM to leverage retrieved terms precisely. Experiments on three language directions of the ACL 60/60 dev set show that RASST improves terminology translation accuracy by up to 16% and increases overall translation quality by up to 3 BLEU points, with ablations confirming the contribution of each component.[34] Sparse or Dense? A Mechanistic Estimation of Computation Density in Transformer-based LLMs
Corentin Kervadec,Iuliia Lysova,Marco Baroni,Gemma Boleda
Main category: cs.CL
TL;DR: 本文提出了一种基于机制可解释性的计算密度估计器,发现大语言模型(LLMs)的计算并非均匀分布,而是动态变化的密集计算过程,且不同模型对相同输入表现出相似的密度模式。
Details
Motivation: 现有研究认为LLM中存在大量冗余参数,但缺乏对计算在参数间如何实际分布的系统量化方法;作者旨在揭示LLM内部计算的密度分布规律及其动态特性。 Method: 设计了一种基于机制可解释性的计算密度估计器,并在多个LLM上进行实验验证,分析输入、token稀有性、上下文长度等因素对密度的影响。 Result: 实验发现:(1) LLM处理通常是密集计算而非稀疏;(2) 计算密度随输入动态变化;(3) 不同LLM对同一输入的密度高度相关;(4) 预测稀有token需更高密度,而更长上下文常降低密度。 Conclusion: LLM的计算具有显著的动态密度特征,挑战了将其简单视为符号处理系统的传统观点,该密度估计器有助于深化对LLM内在工作机制的理解。 Abstract: Transformer-based large language models (LLMs) are comprised of billions of parameters arranged in deep and wide computational graphs. Several studies on LLM efficiency optimization argue that it is possible to prune a significant portion of the parameters, while only marginally impacting performance. This suggests that the computation is not uniformly distributed across the parameters. We introduce here a technique to systematically quantify computation density in LLMs. In particular, we design a density estimator drawing on mechanistic interpretability. We experimentally test our estimator and find that: (1) contrary to what has been often assumed, LLM processing generally involves dense computation; (2) computation density is dynamic, in the sense that models shift between sparse and dense processing regimes depending on the input; (3) per-input density is significantly correlated across LLMs, suggesting that the same inputs trigger either low or high density. Investigating the factors influencing density, we observe that predicting rarer tokens requires higher density, and increasing context length often decreases the density. We believe that our computation density estimator will contribute to a better understanding of the processing at work in LLMs, challenging their symbolic interpretation.[35] When Meanings Meet: Investigating the Emergence and Quality of Shared Concept Spaces during Multilingual Language Model Training
Felicia Körner,Max Müller-Eberstein,Anna Korhonen,Barbara Plank
Main category: cs.CL
TL;DR: 本文通过激活修补(activation patching)因果可解释性方法,研究EuroLLM预训练过程中语言无关概念空间的演化,发现共享概念空间早期即出现并持续优化,但对不同语言的对齐程度存在差异;细致人工分析揭示部分翻译质量提升实为行为偏移(如多义词义项选择、处理同形异义词),而非真正翻译能力增强。
Details
Motivation: 现有研究虽指出大模型存在共享概念空间以支持跨语言迁移,但多缺乏因果分析、深层错误剖析,且仅关注最终模型,未能揭示该空间在训练过程中的动态形成机制。 Method: 采用因果可解释性方法——激活修补,追踪EuroLLM预训练中跨语言概念表征的演化;隔离跨语言概念表示,并将其注入翻译提示中,检验其对翻译结果的语言无关干预效果。 Result: 1)语言无关概念空间在预训练早期即出现并持续精化;2)模型对共享空间的对齐程度具有语言依赖性;3)部分翻译质量提升源于行为偏移(如多义词义项选择、跨语言同形词翻译而非照抄),而非本质翻译能力提升。 Conclusion: 跨语言对齐具有动态性与语言特异性;单纯依赖自动指标易误判翻译能力进步;因果可解释性方法在细粒度诊断多语言模型行为时具有重要价值,但需辅以人工分析以避免归因偏差。 Abstract: Training Large Language Models (LLMs) with high multilingual coverage is becoming increasingly important -- especially when monolingual resources are scarce. Recent studies have found that LLMs process multilingual inputs in shared concept spaces, thought to support generalization and cross-lingual transfer. However, these prior studies often do not use causal methods, lack deeper error analysis or focus on the final model only, leaving open how these spaces emerge during training. We investigate the development of language-agnostic concept spaces during pretraining of EuroLLM through the causal interpretability method of activation patching. We isolate cross-lingual concept representations, then inject them into a translation prompt to investigate how consistently translations can be altered, independently of the language. We find that shared concept spaces emerge early} and continue to refine, but that alignment with them is language-dependent}. Furthermore, in contrast to prior work, our fine-grained manual analysis reveals that some apparent gains in translation quality reflect shifts in behavior -- like selecting senses for polysemous words or translating instead of copying cross-lingual homographs -- rather than improved translation ability. Our findings offer new insight into the training dynamics of cross-lingual alignment and the conditions under which causal interpretability methods offer meaningful insights in multilingual contexts.[36] From Labels to Facets: Building a Taxonomically Enriched Turkish Learner Corpus
Elif Sayar,Tolgahan Türker,Anna Golynskaia Knezhevich,Bihter Dereli,Ayşe Demirhas,Lionel Nicolas,Gülşen Eryiğit
Main category: cs.CL
TL;DR: 本文提出了一种基于多维面状分类法的半自动标注方法,用于丰富学习者语料库的错误标注,提升了细粒度语言分析能力,并在土耳其语学习者语料库上实现了95.86%的面级准确率。
Details
Motivation: 现有学习者语料库多采用扁平化整体标签体系,难以支持语言学深度标注和细粒度错误成因分析。 Method: 构建基于面状分类法的半自动标注方法,设计并实现面向土耳其语的标注扩展工具,自动从扁平标注中推断出多维语言学与元数据信息(即分类法中的各个‘面’)。 Result: 在土耳其语学习者语料库上实现95.86%的面级标注准确率;建成首个协作标注、按新分类法富化的土耳其学习者语料库,并配套发布标注指南与标注扩展工具。 Conclusion: 该方法显著提升学习者错误标注的标准化性、可解释性与分析灵活性,为既有错误标注语料库的系统性富化提供了可推广的范式。 Abstract: In terms of annotation structure, most learner corpora rely on holistic flat label inventories which, even when extensive, do not explicitly separate multiple linguistic dimensions. This makes linguistically deep annotation difficult and complicates fine-grained analyses aimed at understanding why and how learners produce specific errors. To address these limitations, this paper presents a semi-automated annotation methodology for learner corpora, built upon a recently proposed faceted taxonomy, and implemented through a novel annotation extension framework. The taxonomy provides a theoretically grounded, multi-dimensional categorization that captures the linguistic properties underlying each error instance, thereby enabling standardized, fine-grained, and interpretable enrichment beyond flat annotations. The annotation extension tool, implemented based on the proposed extension framework for Turkish, automatically extends existing flat annotations by inferring additional linguistic and metadata information as facets within the taxonomy to provide richer learner-specific context. It was systematically evaluated and yielded promising performance results, achieving a facet-level accuracy of 95.86%. The resulting taxonomically enriched corpus offers enhanced querying capabilities and supports detailed exploratory analyses across learner corpora, enabling researchers to investigate error patterns through complex linguistic and pedagogical dimensions. This work introduces the first collaboratively annotated and taxonomically enriched Turkish Learner Corpus, a manual annotation guideline with a refined tagset, and an annotation extender. As the first corpus designed in accordance with the recently introduced taxonomy, we expect our study to pave the way for subsequent enrichment efforts of existing error-annotated learner corpora.[37] Leveraging LLMs For Turkish Skill Extraction
Ezgi Arslan İltüzer,Özgür Anıl Özlü,Vahid Farajijobehdar,Gülşen Eryiğit
Main category: cs.CL
TL;DR: 本文提出了首个土耳其语技能抽取数据集,并评估了大语言模型(LLM)在低资源土耳其语技能抽取任务中的表现,发现基于LLM的端到端方法优于传统监督序列标注方法,尤其在技能识别与ESCO标准对齐方面效果显著。
Details
Motivation: 土耳其语作为形态复杂、资源稀缺的语言,缺乏技能分类体系和专用技能抽取数据集,导致其技能抽取研究严重滞后。 Method: 构建首个土耳其语技能抽取人工标注数据集(4819个技能片段,来自327份职位描述);采用多种LLM(如Claude Sonnet 3.7)结合动态少样本提示、嵌入检索与LLM重排序的端到端流程进行技能识别与链接。 Result: 最佳配置(Claude Sonnet 3.7 + 动态少样本提示 + 嵌入检索 + LLM重排序)端到端F1达0.56,超越监督式序列标注方法,在技能与ESCO标准对齐上更优。 Conclusion: LLM在低资源语言技能抽取中具有显著优势,本工作为土耳其语及其它欠表示语言的技能抽取研究提供了基准数据集与有效方法,有望推动相关研究发展。 Abstract: Skill extraction is a critical component of modern recruitment systems, enabling efficient job matching, personalized recommendations, and labor market analysis. Despite Türkiye's significant role in the global workforce, Turkish, a morphologically complex language, lacks both a skill taxonomy and a dedicated skill extraction dataset, resulting in underexplored research in skill extraction for Turkish. This article seeks the answers to three research questions: 1) How can skill extraction be effectively performed for this language, in light of its low resource nature? 2)~What is the most promising model? 3) What is the impact of different Large Language Models (LLMs) and prompting strategies on skill extraction (i.e., dynamic vs. static few-shot samples, varying context information, and encouraging causal reasoning)? The article introduces the first Turkish skill extraction dataset and performance evaluations of automated skill extraction using LLMs. The manually annotated dataset contains 4,819 labeled skill spans from 327 job postings across different occupation areas. The use of LLM outperforms supervised sequence labeling when used in an end-to-end pipeline, aligning extracted spans with standardized skills in the ESCO taxonomy more effectively. The best-performing configuration, utilizing Claude Sonnet 3.7 with dynamic few-shot prompting for skill identification, embedding-based retrieval, and LLM-based reranking for skill linking, achieves an end-to-end performance of 0.56, positioning Turkish alongside similar studies in other languages, which are few in the literature. Our findings suggest that LLMs can improve skill extraction performance in low-resource settings, and we hope that our work will accelerate similar research on skill extraction for underrepresented languages.[38] Should LLMs, $\textit{like}$, Generate How Users Talk? Building Dialect-Accurate Dialog[ue]s Beyond the American Default with MDial
Jio Oh,Paul Vicinanza,Thomas Butler,Steven Euijong Whang,Dezhi Hong,Amani Namboori
Main category: cs.CL
TL;DR: 本文提出了MDial框架,用于生成涵盖九种英语方言的多方言对话数据,强调词汇、拼写和语法特征,并通过与语言学家合作确保数据质量;基于该框架构建了MDialBench基准测试集,评估17个大语言模型在方言识别和响应生成任务上的表现,发现现有模型对方言理解能力严重不足。
Details
Motivation: 超过80%的英语使用者不使用标准美式英语(SAE),在与大语言模型交互时失败率更高、易遭刻板回应,但多方言性能研究仍不足。 Method: 提出首个大规模多方言对话数据生成框架MDial,覆盖九种英语方言的词汇、拼写和形态句法特征;联合母语语言学家设计标注化、可扩展的基于规则的大模型转换方法;构建方言平行基准MDialBench(含5万+对话、9.7万+问答对);评估17个大语言模型在方言识别与响应生成任务上的表现。 Result: 独立评估显示,标注员在98%的成对比较中更偏好MDial输出;前沿模型在方言识别任务中整体准确率低于70%,加拿大英语识别率甚至不到50%,且系统性地将非SAE方言误判为美式或英式英语。 Conclusion: 当前大语言模型在多方言理解方面存在严重缺陷,方言识别错误可能引发下游任务的级联失败,亟需在模型训练与评估中纳入多方言视角。 Abstract: More than 80% of the 1.6 billion English speakers do not use Standard American English (SAE) and experience higher failure rates and stereotyped responses when interacting with LLMs as a result. Yet multi-dialectal performance remains underexplored. We introduce $\textbf{MDial}$, the first large-scale framework for generating multi-dialectal conversational data encompassing the three pillars of written dialect -- lexical (vocabulary), orthographic (spelling), and morphosyntactic (grammar) features -- for nine English dialects. Partnering with native linguists, we design an annotated and scalable rule-based LLM transformation to ensure precision. Our approach challenges the assumption that models should mirror users' morphosyntactic features, showing that up to 90% of the grammatical features of a dialect should not be reproduced by models. Independent evaluations confirm data quality, with annotators preferring MDial outputs over prior methods in 98% of pairwise comparisons for dialect naturalness. Using this pipeline, we construct the dialect-parallel $\textbf{MDialBench}$mark with 50k+ dialogs, resulting in 97k+ QA pairs, and evaluate 17 LLMs on dialect identification and response generation tasks. Even frontier models achieve under 70% accuracy, fail to reach 50% for Canadian English, and systematically misclassify non-SAE dialects as American or British. As dialect identification underpins natural language understanding, these errors risk cascading failures into downstream tasks.[39] DiffuSpeech: Silent Thought, Spoken Answer via Unified Speech-Text Diffusion
Yuxuan Lou,Ziming Wu,Yaochen Wang,Yong Liu,Yingxuan Ren,Fuming Lai,Shaobing Lian,Jie Tang,Yang You
Main category: cs.CL
TL;DR: 本文提出了一种新范式“静默思考,口语回答”,通过扩散模型实现语音-文本联合生成与理解,支持内部文本推理与语音输出同步生成,并构建了首个带推理链的语音问答数据集。
Details
Motivation: 现有语音大模型直接生成语音响应,缺乏可修正的显式推理过程,导致错误无法在语音输出后纠正。 Method: 提出了首个基于扩散的语音-文本大模型\method{},采用统一的掩码扩散框架,联合迭代去噪生成文本推理链和语音token,并设计模态特定的掩码调度;同时构建了含26K样本、319小时的语音QA数据集\dataset{},附带文本推理链。 Result: 在语音到语音问答任务上达到SOTA准确率,较最佳基线提升最高9分;TTS质量最优(WER 6.2%);语言理解能力保持良好(MMLU 66.2%);消融实验证明扩散架构与推理链均对性能提升有贡献。 Conclusion: ‘静默思考,口语回答’范式有效提升了语音大模型的准确性、可控性与可解释性;扩散模型与带推理链的数据集是实现该范式的关键。 Abstract: Current speech language models generate responses directly without explicit reasoning, leading to errors that cannot be corrected once audio is produced. We introduce \textbf{``Silent Thought, Spoken Answer''} -- a paradigm where speech LLMs generate internal text reasoning alongside spoken responses, with thinking traces informing speech quality. To realize this, we present \method{}, the first diffusion-based speech-text language model supporting both understanding and generation, unifying discrete text and tokenized speech under a single masked diffusion framework. Unlike autoregressive approaches, \method{} jointly generates reasoning traces and speech tokens through iterative denoising, with modality-specific masking schedules. We also construct \dataset{}, the first speech QA dataset with paired text reasoning traces, containing 26K samples totaling 319 hours. Experiments show \method{} achieves state-of-the-art speech-to-speech QA accuracy, outperforming the best baseline by up to 9 points, while attaining the best TTS quality among generative models (6.2\% WER) and preserving language understanding (66.2\% MMLU). Ablations confirm that both the diffusion architecture and thinking traces contribute to these gains.[40] LLMs Explain't: A Post-Mortem on Semantic Interpretability in Transformer Models
Alhassan Abdelhalim,Janick Edinger,Sören Laue,Michaela Regneri
Main category: cs.CL
TL;DR: 本文探讨了大型语言模型(LLMs)在泛在计算中语言抽象如何产生,尝试通过探针法和特征映射法检测其机制,但两种方法均因方法学缺陷而失败,揭示当前可解释性技术存在严重局限,尤其影响泛在与分布式系统中的调试、压缩与解释。
Details
Motivation: 探究LLMs中语言抽象的产生机制,特别是在不同模块(如注意力头和输入嵌入)中如何体现,以理解其卓越性能背后的原理。 Method: 采用文献中成熟的两类方法:(1) 探针法检测词元级关系结构;(2) 基于嵌入的属性推理(feature-mapping),将嵌入视为承载人类可解释语义属性的载体。 Result: 两种方法均失败:(1) 注意力机制解释失效,因后期层表征不再对应原始词元;(2) 嵌入属性推理高预测得分源于数据集结构与方法伪影,而非真实语义知识。 Conclusion: 当前主流LLM可解释性方法不能可靠支撑‘模型理解’的结论,其局限性对泛在/分布式计算中依赖可解释性的系统级应用(如调试、压缩、解释)构成实质性风险。 Abstract: Large Language Models (LLMs) are becoming increasingly popular in pervasive computing due to their versatility and strong performance. However, despite their ubiquitous use, the exact mechanisms underlying their outstanding performance remain unclear. Different methods for LLM explainability exist, and many are, as a method, not fully understood themselves. We started with the question of how linguistic abstraction emerges in LLMs, aiming to detect it across different LLM modules (attention heads and input embeddings). For this, we used methods well-established in the literature: (1) probing for token-level relational structures, and (2) feature-mapping using embeddings as carriers of human-interpretable properties. Both attempts failed for different methodological reasons: Attention-based explanations collapsed once we tested the core assumption that later-layer representations still correspond to tokens. Property-inference methods applied to embeddings also failed because their high predictive scores were driven by methodological artifacts and dataset structure rather than meaningful semantic knowledge. These failures matter because both techniques are widely treated as evidence for what LLMs supposedly understand, yet our results show such conclusions are unwarranted. These limitations are particularly relevant in pervasive and distributed computing settings where LLMs are deployed as system components and interpretability methods are relied upon for debugging, compression, and explaining models.[41] Benchmarking Machine Translation on Chinese Social Media Texts
Kaiyan Zhao,Zheyong Xie,Zhongtao Miao,Xinze Lyu,Yao Hu,Shaosheng Cao
Main category: cs.CL
TL;DR: 本文提出CSM-MTBench基准,用于评估机器翻译系统在中文社交媒体非正式文本(如俚语、新词、风格化表达)上的翻译能力,包含两个专家构建子集及对应定制化评估方法。
Details
Motivation: 中文社交媒体中快速演变的俚语、新词和高度风格化表达给机器翻译评测带来数据稀缺和传统指标难以衡量风格保真度两大挑战。 Method: 构建覆盖五种中外交互方向的CSM-MTBench基准,含Fun Posts(侧重俚语/新词翻译成功率)和Social Snippets(侧重情感与风格保留,采用嵌入指标+LLM-as-a-judge混合评估)两个子集,并在20多个模型上开展实验。 Result: 实验揭示现有MT系统在语义保真度和社交平台风格表达处理上存在显著差异,验证了CSM-MTBench对真实场景中文社交媒体文本翻译能力评估的有效性。 Conclusion: CSM-MTBench为提升机器翻译系统处理现实世界中文社交媒体文本的能力提供了严格、有针对性的评测基准。 Abstract: The prevalence of rapidly evolving slang, neologisms, and highly stylized expressions in informal user-generated text, particularly on Chinese social media, poses significant challenges for Machine Translation (MT) benchmarking. Specifically, we identify two primary obstacles: (1) data scarcity, as high-quality parallel data requires bilingual annotators familiar with platform-specific slang, and stylistic cues in both languages; and (2) metric limitations, where traditional evaluators like COMET often fail to capture stylistic fidelity and nonstandard expressions. To bridge these gaps, we introduce CSM-MTBench, a benchmark covering five Chinese-foreign language directions and consisting of two expert-curated subsets: Fun Posts, featuring context-rich, slang- and neologism-heavy content, and Social Snippets, emphasizing concise, emotion- and style- driven expressions. Furthermore, we propose tailored evaluation approaches for each subset: measuring the translation success rate of slang and neologisms in Fun Posts, while assessing tone and style preservation in Social Snippets via a hybrid of embedding-based metrics and LLM-as-a-judge. Experiments on over 20 models reveal substantial variation in how current MT systems handle semantic fidelity and informal, social-media-specific stylistic cues. CSM-MTBench thus serves as a rigorous testbed for advancing MT systems capable of mastering real-world Chinese social media texts.[42] Relaxing Positional Alignment in Masked Diffusion Language Models
Mengyu Ye,Ryosuke Takahashi,Keito Kudo,Jun Suzuki
Main category: cs.CL
TL;DR: 本文提出了一种针对掩码扩散语言模型(MDLMs)在开放式文本生成中性能不足问题的改进方法,通过引入对齐灵活的监督策略(如
Details
Motivation: MDLMs在开放式文本生成中存在性能差距,作者假设这是由于严格的逐位置预测导致解码对token位置错位高度敏感,而这种严格位置监督与MDLM不可逆的去噪动态不匹配。 Method: 在微调阶段采用对齐灵活的监督策略,具体是引入特殊标记[43] Autonomous Chain-of-Thought Distillation for Graph-Based Fraud Detection
Yuan Li,Jun Hu,Bryan Hooi,Bingsheng He,Cheng Chen
Main category: cs.CL
TL;DR: 本文提出FraudCoT框架,通过自主的图感知链式推理与可扩展的LLM-GNN协同训练,提升文本属性图上的欺诈检测性能与效率。
Details
Motivation: 现有基于大语言模型增强的图神经网络方法受限于预定义提示和解耦训练流程,导致推理自主性不足、语义-结构对齐能力弱。 Method: 提出FraudCoT:1)欺诈感知的选择性链式推理(CoT)蒸馏机制,生成多样化推理路径并增强语义-结构理解;2)将蒸馏后的CoT融入节点文本,为GNN提供多跳语义与结构线索;3)设计高效非对称协同训练策略,实现端到端优化并大幅降低计算开销。 Result: 在公开与工业基准上,FraudCoT相较SOTA方法最高提升8.8% AUPRC,并实现最高1066倍训练吞吐量加速。 Conclusion: FraudCoT统一解决了提示依赖与训练解耦问题,在欺诈检测性能与训练效率两方面均取得显著进步。 Abstract: Graph-based fraud detection on text-attributed graphs (TAGs) requires jointly modeling rich textual semantics and relational dependencies. However, existing LLM-enhanced GNN approaches are constrained by predefined prompting and decoupled training pipelines, limiting reasoning autonomy and weakening semantic-structural alignment. We propose FraudCoT, a unified framework that advances TAG-based fraud detection through autonomous, graph-aware chain-of-thought (CoT) reasoning and scalable LLM-GNN co-training. To address the limitations of predefined prompts, we introduce a fraud-aware selective CoT distillation mechanism that generates diverse reasoning paths and enhances semantic-structural understanding. These distilled CoTs are integrated into node texts, providing GNNs with enriched, multi-hop semantic and structural cues for fraud detection. Furthermore, we develop an efficient asymmetric co-training strategy that enables end-to-end optimization while significantly reducing the computational cost of naive joint training. Extensive experiments on public and industrial benchmarks demonstrate that FraudCoT achieves up to 8.8% AUPRC improvement over state-of-the-art methods and delivers up to 1,066x speedup in training throughput, substantially advancing both detection performance and efficiency.[44] Residual Context Diffusion Language Models
Yuezhou Hu,Harman Singh,Monishwaran Maheswaran,Haocheng Xi,Coleman Hooper,Jintao Zhang,Aditya Tomar,Michael W. Mahoney,Sewon Min,Mehrdad Farajtabar,Kurt Keutzer,Amir Gholami,Chenfeng Xu
Main category: cs.CL
TL;DR: 本文提出Residual Context Diffusion (RCD)模块,通过回收dLLMs中被remasking机制丢弃的token的上下文信息,将其转化为残差并注入下一轮去噪,显著提升模型准确率并减少去噪步数。
Details
Motivation: 现有块式扩散大语言模型(dLLMs)依赖remasking机制,仅解码最自信的token而丢弃其余,造成计算浪费;作者发现被丢弃token仍含有效上下文信息,可被复用。 Method: 提出RCD模块,将丢弃token的表征转为上下文残差并注入下一轮去噪;采用解耦的两阶段训练流程以规避反向传播内存瓶颈。 Result: RCD在多个基准上使前沿dLLMs准确率提升5–10点,AIME任务上近乎翻倍,并在同等精度下减少4–5倍去噪步数;仅需约10亿token即可完成标准dLLM向RCD范式的高效转换。 Conclusion: RCD是一种高效、低开销的改进方案,能显著增强dLLMs的推理性能与计算效率,尤其适用于长/短思维链任务。 Abstract: Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to purely autoregressive language models because they can decode multiple tokens in parallel. However, state-of-the-art block-wise dLLMs rely on a "remasking" mechanism that decodes only the most confident tokens and discards the rest, effectively wasting computation. We demonstrate that recycling computation from the discarded tokens is beneficial, as these tokens retain contextual information useful for subsequent decoding iterations. In light of this, we propose Residual Context Diffusion (RCD), a module that converts these discarded token representations into contextual residuals and injects them back for the next denoising step. RCD uses a decoupled two-stage training pipeline to bypass the memory bottlenecks associated with backpropagation. We validate our method on both long CoT reasoning (SDAR) and short CoT instruction following (LLaDA) models. We demonstrate that a standard dLLM can be efficiently converted to the RCD paradigm with merely ~1 billion tokens. RCD consistently improves frontier dLLMs by 5-10 points in accuracy with minimal extra computation overhead across a wide range of benchmarks. Notably, on the most challenging AIME tasks, RCD nearly doubles baseline accuracy and attains up to 4-5x fewer denoising steps at equivalent accuracy levels.[45] A Unified View of Attention and Residual Sinks: Outlier-Driven Rescaling is Essential for Transformer Training
Zihan Qiu,Zeyu Huang,Kaiyue Wen,Peng Jin,Bo Zheng,Yuxin Zhou,Haofeng Huang,Zekun Wang,Xiao Li,Huaqing Zhang,Yang Xu,Haoran Lian,Siqi Zhang,Rui Men,Jianwei Zhang,Ivan Titov,Dayiheng Liu,Jingren Zhou,Junyang Lin
Main category: cs.CL
TL;DR: 本文研究了大语言模型中涌现的异常值(如注意力汇点和残差汇点)的功能作用,提出‘异常值驱动重缩放’现象,并验证其在不同模型架构和训练数据量下的普适性;发现异常值主要通过与归一化机制协同实现稳定训练,而非直接贡献输出,并可通过吸收或门控重缩放提升性能与量化鲁棒性。
Details
Motivation: 探究大语言模型中注意力汇点和残差汇点等涌现异常值的功能角色,理解其与归一化机制(如softmax、RMSNorm)的协同作用及其对训练稳定性与性能的影响。 Method: 通过系统性消融实验(如移除归一化、裁剪异常值)、跨架构与训练阶段分析,以及引入可学习参数吸收或门控重缩放机制来验证异常值驱动重缩放假说。 Result: 证实异常值与归一化协同实现重缩放,移除归一化虽消除异常值但损害稳定性;异常值本身贡献小,主要起缩放作用;吸收或门控重缩放可提升训练性能(+2分)与W4A4量化鲁棒性(减少1.2点退化)。 Conclusion: 异常值并非缺陷而是功能组件,其与归一化共同构成一种隐式重缩放机制;该机制统一解释两类汇点的成因与缓解路径,为模型设计与优化提供新视角。 Abstract: We investigate the functional role of emergent outliers in large language models, specifically attention sinks (a few tokens that consistently receive large attention logits) and residual sinks (a few fixed dimensions with persistently large activations across most tokens). We hypothesize that these outliers, in conjunction with the corresponding normalizations (\textit{e.g.}, softmax attention and RMSNorm), effectively rescale other non-outlier components. We term this phenomenon \textit{outlier-driven rescaling} and validate this hypothesis across different model architectures and training token counts. This view unifies the origin and mitigation of both sink types. Our main conclusions and observations include: (1) Outliers function jointly with normalization: removing normalization eliminates the corresponding outliers but degrades training stability and performance; directly clipping outliers while retaining normalization leads to degradation, indicating that outlier-driven rescaling contributes to training stability. (2) Outliers serve more as rescale factors rather than contributors, as the final contributions of attention and residual sinks are significantly smaller than those of non-outliers. (3) Outliers can be absorbed into learnable parameters or mitigated via explicit gated rescaling, leading to improved training performance (average gain of 2 points) and enhanced quantization robustness (1.2 points degradation under W4A4 quantization).[46] ArabicDialectHub: A Cross-Dialectal Arabic Learning Resource and Platform
Salem Lahlou
Main category: cs.CL
TL;DR: ArabicDialectHub 是一个面向阿拉伯语多方言学习的开源资源,包含552个短语(覆盖6种方言及MSA),配有一个交互式Web平台,支持翻译探索、自适应测验、进度同步与文化背景介绍。
Details
Motivation: 解决阿拉伯语多方言学习资源匮乏、缺乏系统性与互动性的问题,促进跨方言理解与语言学习。 Method: 利用大语言模型生成短语,经五位母语者分层验证;按主题和难度组织;开发具备翻译探索、算法生成干扰项的自适应测验、云同步进度跟踪及文化注释的Web平台。 Result: 构建了涵盖六种阿拉伯语变体及现代标准阿拉伯语(MSA)的552条短语数据集,并发布功能完整的开源Web平台(含全部源码),已上线并公开访问。 Conclusion: ArabicDialectHub 为阿拉伯语多方言学习提供了首个开源、可扩展、交互性强的学习基础设施,兼具语言学严谨性与教育实用性。 Abstract: We present ArabicDialectHub, a cross-dialectal Arabic learning resource comprising 552 phrases across six varieties (Moroccan Darija, Lebanese, Syrian, Emirati, Saudi, and MSA) and an interactive web platform. Phrases were generated using LLMs and validated by five native speakers, stratified by difficulty, and organized thematically. The open-source platform provides translation exploration, adaptive quizzing with algorithmic distractor generation, cloud-synchronized progress tracking, and cultural context. Both the dataset and complete platform source code are released under MIT license. Platform: https://arabic-dialect-hub.netlify.app.[47] Bias Beyond Borders: Political Ideology Evaluation and Steering in Multilingual LLMs
Afrozah Nadeem,Agrima,Mehwish Nasim,Usman Naseem
Main category: cs.CL
TL;DR: 本文提出了一种跨语言政治偏见评估与缓解框架CLAS,通过在共享意识形态子空间中对齐多语言模型的潜在表征,并动态调节干预强度,在50国33种语言上实现显著偏见降低且保持响应质量。
Details
Motivation: 现有研究多聚焦于高资源西方语言或窄域多语言场景,缺乏对跨语言一致性及安全后处理缓解方法的探索。 Method: 提出跨语言对齐引导(CLAS)框架:将不同语言中政治提示诱导的意识形态表征对齐至共享子空间,并自适应调节干预强度以防止过矫正。 Result: 在经济与社会两个维度上显著降低政治偏见,响应质量下降极小;验证了方法在多语言、多文化场景下的有效性与可扩展性。 Conclusion: CLAS为公平感知的多语言大模型治理提供了可扩展、可解释的新范式,在意识形态中立性与语言文化多样性之间取得平衡。 Abstract: Large Language Models (LLMs) increasingly shape global discourse, making fairness and ideological neutrality essential for responsible AI deployment. Despite growing attention to political bias in LLMs, prior work largely focuses on high-resource, Western languages or narrow multilingual settings, leaving cross-lingual consistency and safe post-hoc mitigation underexplored. To address this gap, we present a large-scale multilingual evaluation of political bias spanning 50 countries and 33 languages. We introduce a complementary post-hoc mitigation framework, Cross-Lingual Alignment Steering (CLAS), designed to augment existing steering methods by aligning ideological representations across languages and dynamically regulating intervention strength. This method aligns latent ideological representations induced by political prompts into a shared ideological subspace, ensuring cross lingual consistency, with the adaptive mechanism prevents over correction and preserves coherence. Experiments demonstrate substantial bias reduction along both economic and social axes with minimal degradation in response quality. The proposed framework establishes a scalable and interpretable paradigm for fairness-aware multilingual LLM governance, balancing ideological neutrality with linguistic and cultural diversity.[48] InstructDiff: Domain-Adaptive Data Selection via Differential Entropy for Efficient LLM Fine-Tuning
Junyou Su,He Zhu,Xiao Luo,Liyu Zhang,Hong-Yu Zhou,Yun Chen,Peng Li,Yang Liu,Guanhua Chen
Main category: cs.CL
TL;DR: 本文提出InstructDiff框架,利用基础模型与微调模型间差异熵作为数据选择标准,实现跨领域自适应的数据筛选,在数学推理和通用指令遵循任务上分别以10%数据量取得17%和52%的相对性能提升。
Details
Motivation: 监督微调(SFT)成本高昂且收益递减,而现有数据选择方法存在严重领域特异性,难以兼顾通用指令遵循与推理任务。 Method: 提出InstructDiff框架,通过warmup校准、双向NLL过滤和基于差异熵的排序,将基础模型与轻量指令微调模型之间的熵差作为领域自适应的数据选择准则。 Result: 在数学推理任务上相较全量数据训练提升17%,在通用指令遵循任务上提升52%,且仅使用10%的数据,显著优于先前基线方法。 Conclusion: 差异熵可作为统一、领域自适应的数据选择指标:推理任务偏好熵增(认知扩展),通用任务偏好熵减(认知压缩),InstructDiff由此实现了高效、普适的SFT数据筛选。 Abstract: Supervised fine-tuning (SFT) is fundamental to adapting large language models, yet training on complete datasets incurs prohibitive costs with diminishing returns. Existing data selection methods suffer from severe domain specificity: techniques optimized for general instruction-following fail on reasoning tasks, and vice versa. We observe that measuring entropy differences between base models and minimally instruction-tuned calibrated models reveals a pattern -- samples with the lowest differential entropy consistently yield optimal performance across domains, yet this principle manifests domain-adaptively: reasoning tasks favor entropy increase (cognitive expansion), while general tasks favor entropy decrease (cognitive compression). We introduce InstructDiff, a unified framework that operationalizes differential entropy as a domain-adaptive selection criterion through warmup calibration, bi-directional NLL filtering, and entropy-based ranking. Extensive experiments show that InstructDiff achieves 17\% relative improvement over full data training on mathematical reasoning and 52\% for general instruction-following, outperforming prior baselines while using only 10\% of the data.[49] DimABSA: Building Multilingual and Multidomain Datasets for Dimensional Aspect-Based Sentiment Analysis
Lung-Hao Lee,Liang-Chih Yu,Natalia Loukashevich,Ilseyar Alimova,Alexander Panchenko,Tzu-Mi Lin,Zhe-Yu Xu,Jian-Yu Zhou,Guangmin Zheng,Jin Wang,Sharanya Awasthi,Jonas Becker,Jan Philip Wahle,Terry Ruas,Shamsuddeen Hassan Muhammad,Saif M. Mohammed
Main category: cs.CL
TL;DR: 本文提出DimABSA,首个支持多语言、基于维度(效价-唤醒度)的方面级情感分析资源,并定义三个融合VA得分的子任务及新评估指标cF1,构建了面向多语言维度ABS A的基准。
Details
Motivation: 现有方面级情感分析(ABSA)仅使用粗粒度分类标签(如正面/负面),难以刻画细腻的情感状态,亟需引入连续维度的情感表示方法。 Method: 构建多语言、多领域、带效价-唤醒度(VA)连续分值标注的DimABSA数据集;定义三个融合VA与传统ABSA要素的子任务;提出兼顾分类与回归误差的统一评估指标cF1;在提示学习与微调范式下对大语言模型进行系统评测。 Result: 发布含6种语言、4个领域的76,958个方面实例的DimABSA资源;提出cF1指标;实验表明当前大模型在该任务上仍有较大提升空间,验证了其作为挑战性基准的有效性。 Conclusion: DimABSA填补了多语言维度化方面情感分析的资源与评测空白,为细粒度、跨语言情感理解提供了新方向和坚实基础。 Abstract: Aspect-Based Sentiment Analysis (ABSA) focuses on extracting sentiment at a fine-grained aspect level and has been widely applied across real-world domains. However, existing ABSA research relies on coarse-grained categorical labels (e.g., positive, negative), which limits its ability to capture nuanced affective states. To address this limitation, we adopt a dimensional approach that represents sentiment with continuous valence-arousal (VA) scores, enabling fine-grained analysis at both the aspect and sentiment levels. To this end, we introduce DimABSA, the first multilingual, dimensional ABSA resource annotated with both traditional ABSA elements (aspect terms, aspect categories, and opinion terms) and newly introduced VA scores. This resource contains 76,958 aspect instances across 42,590 sentences, spanning six languages and four domains. We further introduce three subtasks that combine VA scores with different ABSA elements, providing a bridge from traditional ABSA to dimensional ABSA. Given that these subtasks involve both categorical and continuous outputs, we propose a new unified metric, continuous F1 (cF1), which incorporates VA prediction error into standard F1. We provide a comprehensive benchmark using both prompted and fine-tuned large language models across all subtasks. Our results show that DimABSA is a challenging benchmark and provides a foundation for advancing multilingual dimensional ABSA.[50] Character as a Latent Variable in Large Language Models: A Mechanistic Account of Emergent Misalignment and Conditional Safety Failures
Yanghao Su,Wenbo Zhou,Tianwei Zhang,Qiu Han,Weiming Zhang,Nenghai Yu,Jie Zhang
Main category: cs.CL
TL;DR: 本文揭示了大语言模型微调过程中出现的‘涌现错位’现象,指出其根源在于模型行为倾向的稳定转变,而非能力退化或知识污染,并强调需关注角色形成这一被忽视的对齐风险。
Details
Motivation: 现有研究将涌现错位归因于错误或不安全内容的泛化,但该解释不完整;作者旨在揭示更深层机制——字符级行为倾向如何导致广泛且可迁移的错位。 Method: 在多个领域和模型家族中开展实证研究,对比分析不同微调数据(特别是体现特定字符级倾向的数据)对模型行为的影响,并检验训练触发与推理时角色提示对其行为倾向的条件激活效应。 Result: 发现基于特定字符级倾向的数据微调会引发更强、更可迁移的错位,同时基本保持通用能力;该错位可被训练与推理阶段的特定提示激活,与后门攻击和越狱脆弱性存在共性结构。 Conclusion: 涌现错位本质是模型‘角色形成’所致的行为倾向性偏移,稳健对齐必须面向行为倾向本身,而非仅修正孤立错误或依赖提示层级防御。 Abstract: Emergent Misalignment refers to a failure mode in which fine-tuning large language models (LLMs) on narrowly scoped data induces broadly misaligned behavior. Prior explanations mainly attribute this phenomenon to the generalization of erroneous or unsafe content. In this work, we show that this view is incomplete. Across multiple domains and model families, we find that fine-tuning models on data exhibiting specific character-level dispositions induces substantially stronger and more transferable misalignment than incorrect-advice fine-tuning, while largely preserving general capabilities. This indicates that emergent misalignment arises from stable shifts in model behavior rather than from capability degradation or corrupted knowledge. We further show that such behavioral dispositions can be conditionally activated by both training-time triggers and inference-time persona-aligned prompts, revealing shared structure across emergent misalignment, backdoor activation, and jailbreak susceptibility. Overall, our results identify character formation as a central and underexplored alignment risk, suggesting that robust alignment must address behavioral dispositions rather than isolated errors or prompt-level defenses.[51] Safer Policy Compliance with Dynamic Epistemic Fallback
Joseph Marvin Imperial,Harish Tayyar Madabushi
Main category: cs.CL
TL;DR: 本文提出了一种受人类认知防御机制‘知识论警觉性’启发的动态安全协议Dynamic Epistemic Fallback(DEF),用于提升大语言模型(LLM)在推理阶段抵御基于恶意篡改法律文本的欺骗攻击的能力。
Details
Motivation: 人类具备识别欺骗与错误信息的认知防御机制(即知识论警觉性),借鉴该机制可增强LLM在高风险场景(如数据隐私合规自动化)中的安全性与鲁棒性。 Method: 提出Dynamic Epistemic Fallback(DEF)协议,通过多层级单句提示线索,在推理时引导LLM识别政策文本不一致性、拒绝执行并回退至参数化知识;实验基于HIPAA和GDPR等全球公认法律政策进行评估。 Result: DEF显著提升了前沿LLM对篡改政策文本的检测与拒答能力,DeepSeek-R1在某设定下达到100%检测率。 Conclusion: 认知启发式防御机制(如DEF)是提升LLM对抗利用法律文本实施欺骗与危害的有效路径,为后续研究提供了新方向。 Abstract: Humans develop a series of cognitive defenses, known as epistemic vigilance, to combat risks of deception and misinformation from everyday interactions. Developing safeguards for LLMs inspired by this mechanism might be particularly helpful for their application in high-stakes tasks such as automating compliance with data privacy laws. In this paper, we introduce Dynamic Epistemic Fallback (DEF), a dynamic safety protocol for improving an LLM's inference-time defenses against deceptive attacks that make use of maliciously perturbed policy texts. Through various levels of one-sentence textual cues, DEF nudges LLMs to flag inconsistencies, refuse compliance, and fallback to their parametric knowledge upon encountering perturbed policy texts. Using globally recognized legal policies such as HIPAA and GDPR, our empirical evaluations report that DEF effectively improves the capability of frontier LLMs to detect and refuse perturbed versions of policies, with DeepSeek-R1 achieving a 100% detection rate in one setting. This work encourages further efforts to develop cognitively inspired defenses to improve LLM robustness against forms of harm and deception that exploit legal artifacts.[52] Evaluating the Utility of Grounding Documents with Reference-Free LLM-based Metrics
Yilun Hua,Giuseppe Castellucci,Peter Schulam,Heba Elfardy,Kevin Small
Main category: cs.CL
TL;DR: 本文提出了一种模型特定、无需参考的检索增强生成(RAG)内容效用评估指标GroGU,基于大语言模型(LLM)生成置信度(熵)定义效用,在无需人工标注的前提下,能有效区分真实相关文档,并用于优化查询重写器,显著提升RAG性能。
Details
Motivation: 现有RAG内容效用量化方法缺乏明确标准,且多忽略模型特异性或依赖高成本人工标注。 Method: 提出Grounding Generation Utility(GroGU)指标,以LLM在给定检索内容下的生成熵衡量效用;将其用于识别高效用偏好数据,指导查询重写器的Direct Preference Optimization训练。 Result: GroGU在无需标注下保持对真实文档的高辨别力,优于LLM无关指标;应用于查询重写后,MRR提升最高达18.2点,答案准确率提升最高达9.4点。 Conclusion: GroGU是一种轻量、模型自适应、无监督的内容效用评估方法,可有效驱动RAG系统优化,尤其适用于低资源场景下的端到端训练。 Abstract: Retrieval Augmented Generation (RAG)'s success depends on the utility the LLM derives from the content used for grounding. Quantifying content utility does not have a definitive specification and existing metrics ignore model-specific capabilities and/or rely on costly annotations. In this paper, we propose Grounding Generation Utility (GroGU), a model-specific and reference-free metric that defines utility as a function of the downstream LLM's generation confidence based on entropy. Despite having no annotation requirements, GroGU is largely faithful in distinguishing ground-truth documents while capturing nuances ignored by LLM-agnostic metrics. We apply GroGU to train a query-rewriter for RAG by identifying high-utility preference data for Direct Preference Optimization. Experiments show improvements by up to 18.2 points in Mean Reciprocal Rank and up to 9.4 points in answer accuracy.[53] Monotonic Reference-Free Refinement for Autoformalization
Lan Zhang,Marco Valentino,André Freitas
Main category: cs.CL
TL;DR: 本文提出了一种无需参考形式化证明的迭代单调优化方法,用于全定理自动形式化,通过结合定理证明器与大语言模型(LLM)裁判的互补反馈,联合优化形式有效性、逻辑保持性、数学一致性与形式质量,并在miniF2F和ProofNet上验证了其有效性。
Details
Motivation: 现有语句级自动形式化方法难以同时优化多个质量维度,而全定理自动形式化尚缺乏系统研究;亟需一种不依赖真实证明或已有形式化结果的端到端优化框架。 Method: 提出一种无参考的迭代单调过程,利用定理证明器与多角色LLM裁判提供互补反馈,优化掩码复合目标(含形式有效性、逻辑保持性、数学一致性、形式质量),并引入响应性映射与保证单调提升的接受策略及收敛性条件。 Result: 在miniF2F上达到93.44%形式有效性与78.22%整体得分,在ProofNet上达44.09%形式有效性与29.79%整体得分,验证了多维质量同步提升能力。 Conclusion: 该参考-free迭代单调框架为全定理自动形式化提供了可证收敛、可扩展的新范式,显著提升了形式化质量的协同优化能力。 Abstract: While statement autoformalization has advanced rapidly, full-theorem autoformalization remains largely unexplored. Existing iterative refinement methods in statement autoformalization typicall improve isolated aspects of formalization, such as syntactic correctness, but struggle to jointly optimizing multiple quality dimensions, which is critical for full-theorem autoformalization. We introduce a reference-free iterative monotonic process for full-theorem autoformalization that leverages complementary feedback from theorem provers and LLM-based judges, without access to ground-truth proofs or existing formalizations at inference time. Our approach optimizes a masked composite objective over Formal Validity, Logical Preservation, Mathematical Consistency, and Formal Quality, guided by a responsiveness map that indicates how different LLMs acting as different roles preferentially improve each dimension. We further propose an acceptance policy that guarantees certified monotonic improvement, and provide conditions ensuring convergence and termination. Empirical experiments demonstrate the proposed process enables simultaneous improvement across multiple dimensions, achieving 93.44% formal validity and a 78.22% overall score on miniF2F, and 44.09% formal validity and a 29.79% overall score on ProofNet.[54] FourierSampler: Unlocking Non-Autoregressive Potential in Diffusion Language Models via Frequency-Guided Generation
Siyang He,Qiqi Wang,Xiaoran Liu,Hongnan Ma,Yiwei Shi,Yuerong Song,Ying Zhu,Tianyi Liang,Zengfeng Huang,Ziwei He,Xipeng Qiu
Main category: cs.CL
TL;DR: 本文提出了一种基于频域分析的新型解码策略FourierSampler,用于扩散语言模型(dLLMs),通过频域滑动窗口机制实现‘结构到细节’的生成,显著提升非自回归生成性能。
Details
Motivation: 现有扩散语言模型的解码策略存在位置偏差,未能充分发挥其任意生成潜力;作者旨在从频域视角揭示隐藏状态的谱特性,以指导更合理的并行生成过程。 Method: 对dLLMs隐藏状态进行频域分析,发现低频分量编码全局结构与长程依赖、高频分量刻画局部细节;据此设计FourierSampler,采用频率域滑动窗口机制动态调控生成顺序,实现由粗到细的生成策略。 Result: FourierSampler在LLaDA和SDAR基准上显著优于其他推理增强方法,在LLaDA1.5-8B和LLaDA-8B-Instruct上分别获得20.4%和16.0%的相对提升,并超越同规模自回归模型(如Llama3.1-8B-Instruct)。 Conclusion: 频域视角为理解与改进dLLMs解码提供了新范式;FourierSampler验证了‘结构优先、细节后补’策略的有效性,推动非自回归语言生成迈向实用化。 Abstract: Despite the non-autoregressive potential of diffusion language models (dLLMs), existing decoding strategies demonstrate positional bias, failing to fully unlock the potential of arbitrary generation. In this work, we delve into the inherent spectral characteristics of dLLMs and present the first frequency-domain analysis showing that low-frequency components in hidden states primarily encode global structural information and long-range dependencies, while high-frequency components are responsible for characterizing local details. Based on this observation, we propose FourierSampler, which leverages a frequency-domain sliding window mechanism to dynamically guide the model to achieve a "structure-to-detail" generation. FourierSampler outperforms other inference enhancement strategies on LLADA and SDAR, achieving relative improvements of 20.4% on LLaDA1.5-8B and 16.0% on LLaDA-8B-Instruct. It notably surpasses similarly sized autoregressive models like Llama3.1-8B-Instruct.[55] JobResQA: A Benchmark for LLM Machine Reading Comprehension on Multilingual Résumés and JDs
Casimiro Pio Carrino,Paula Estrella,Rabih Zbib,Carlos Escolano,José A. R. Fonollosa
Main category: cs.CL
TL;DR: 本文提出了JobResQA,一个面向HR领域(简历与职位描述)的多语言问答基准,用于评估大语言模型在机器阅读理解任务上的能力。该基准包含5种语言、581个问答对,并设计了可控属性以支持公平性研究,同时采用高质量人工参与的翻译流程。实验揭示了现有模型在非英语语言上的显著性能下降。
Details
Motivation: 现有MRC基准缺乏针对人力资源场景(如简历-职位匹配)的多语言、高保真、兼顾隐私与公平性的评测数据集,限制了LLM在真实HR系统中的可靠部署。 Method: 构建多语言JobResQA基准:1)基于真实来源合成105对简历-职位描述(含5语言),通过去标识化和占位符控制人口统计与职业属性;2)采用TEaR方法实现低成本、高质量多向平行翻译(含MQM标注与选择性后编辑);3)使用LLM-as-judge对多个开源LLM进行基线评测。 Result: 基线实验显示模型在英语和西班牙语上表现较好,但在意大利语、德语和中文上性能显著下降,暴露出多语言MRC在HR任务中的关键短板;基准已开源,支持可复现的公平性与可靠性研究。 Conclusion: JobResQA填补了HR领域多语言MRC评测的空白,为推动公平、可靠、可信赖的LLM驱动HR系统提供了标准化、可扩展的基准工具。 Abstract: We introduce JobResQA, a multilingual Question Answering benchmark for evaluating Machine Reading Comprehension (MRC) capabilities of LLMs on HR-specific tasks involving résumés and job descriptions. The dataset comprises 581 QA pairs across 105 synthetic résumé-job description pairs in five languages (English, Spanish, Italian, German, and Chinese), with questions spanning three complexity levels from basic factual extraction to complex cross-document reasoning. We propose a data generation pipeline derived from real-world sources through de-identification and data synthesis to ensure both realism and privacy, while controlled demographic and professional attributes (implemented via placeholders) enable systematic bias and fairness studies. We also present a cost-effective, human-in-the-loop translation pipeline based on the TEaR methodology, incorporating MQM error annotations and selective post-editing to ensure an high-quality multi-way parallel benchmark. We provide a baseline evaluations across multiple open-weight LLM families using an LLM-as-judge approach revealing higher performances on English and Spanish but substantial degradation for other languages, highlighting critical gaps in multilingual MRC capabilities for HR applications. JobResQA provides a reproducible benchmark for advancing fair and reliable LLM-based HR systems. The benchmark is publicly available at: https://github.com/Avature/jobresqa-benchmark[56] ReGuLaR: Variational Latent Reasoning Guided by Rendered Chain-of-Thought
Fanmeng Wang,Haotian Liu,Guojiang Zhao,Hongteng Xu,Zhifeng Gao
Main category: cs.CL
TL;DR: 本文提出ReGuLaR方法,通过将显式推理链渲染为图像并提取其视觉语义表征,来正则化变分自编码器中的后验分布,从而在保持推理效果的同时显著提升计算效率。
Details
Motivation: 现有隐式推理方法因缺乏合适的压缩引导而性能严重下降,需在减少计算冗余的同时避免信息损失。 Method: 提出基于变分自编码(VAE)框架的隐式推理范式ReGuLaR;将显式推理链渲染为图像,从中提取密集的视觉-语义表征以正则化条件后验分布。 Result: ReGuLaR在计算效率与推理有效性上均显著优于现有隐式推理方法,并借助多模态推理甚至超越传统CoT。 Conclusion: ReGuLaR为隐式推理提供了一种新颖且有效的解决方案,兼顾高效性与强推理能力。 Abstract: While Chain-of-Thought (CoT) significantly enhances the performance of Large Language Models (LLMs), explicit reasoning chains introduce substantial computational redundancy. Recent latent reasoning methods attempt to mitigate this by compressing reasoning processes into latent space, but often suffer from severe performance degradation due to the lack of appropriate compression guidance. In this study, we propose Rendered CoT-Guided variational Latent Reasoning (ReGuLaR), a simple yet novel latent learning paradigm resolving this issue. Fundamentally, we formulate latent reasoning within the Variational Auto-Encoding (VAE) framework, sampling the current latent reasoning state from the posterior distribution conditioned on previous ones. Specifically, when learning this variational latent reasoning model, we render explicit reasoning chains as images, from which we extract dense visual-semantic representations to regularize the posterior distribution, thereby achieving efficient compression with minimal information loss. Extensive experiments demonstrate that ReGuLaR significantly outperforms existing latent reasoning methods across both computational efficiency and reasoning effectiveness, and even surpasses CoT through multi-modal reasoning, providing a new and insightful solution to latent reasoning. Code: https://github.com/FanmengWang/ReGuLaR.[57] Deep Search with Hierarchical Meta-Cognitive Monitoring Inspired by Cognitive Neuroscience
Zhongxiang Sun,Qipeng Wang,Weijie Yu,Jingxuan Yang,Haolang Lu,Jun Xu
Main category: cs.CL
TL;DR: 本文提出DS-MCM框架,通过引入分层元认知监控机制(快速一致性监测器与慢速经验驱动监测器),提升深度搜索智能体在不确定任务中的推理与检索稳定性与性能。
Details
Motivation: 现有深度搜索智能体在不确定性任务中易失败,因其缺乏对推理与检索状态的动态监控与调节机制;受人类分层元认知(快速异常检测+选择性经验反思)启发,需构建显式监控机制。 Method: 提出DS-MCM框架:包含Fast Consistency Monitor(轻量级证据-置信度对齐检查)和Slow Experience-Driven Monitor(基于历史轨迹经验记忆的选择性激活干预);将监控嵌入推理-检索闭环中,实现‘何时干预’与‘如何干预’的联合决策。 Result: 在多个深度搜索基准和不同骨干模型上实验表明,DS-MCM持续提升性能与鲁棒性。 Conclusion: 显式的分层元认知监控可有效增强大模型驱动的深度搜索智能体的自适应能力与可靠性,为构建更稳健的自主代理提供了新范式。 Abstract: Deep search agents powered by large language models have demonstrated strong capabilities in multi-step retrieval, reasoning, and long-horizon task execution. However, their practical failures often stem from the lack of mechanisms to monitor and regulate reasoning and retrieval states as tasks evolve under uncertainty. Insights from cognitive neuroscience suggest that human metacognition is hierarchically organized, integrating fast anomaly detection with selectively triggered, experience-driven reflection. In this work, we propose Deep Search with Meta-Cognitive Monitoring (DS-MCM), a deep search framework augmented with an explicit hierarchical metacognitive monitoring mechanism. DS-MCM integrates a Fast Consistency Monitor, which performs lightweight checks on the alignment between external evidence and internal reasoning confidence, and a Slow Experience-Driven Monitor, which is selectively activated to guide corrective intervention based on experience memory from historical agent trajectories. By embedding monitoring directly into the reasoning-retrieval loop, DS-MCM determines both when intervention is warranted and how corrective actions should be informed by prior experience. Experiments across multiple deep search benchmarks and backbone models demonstrate that DS-MCM consistently improves performance and robustness.[58] Are you going to finish that? A Practical Study of the Tokenization Boundary Problem
Hao Xu,Alisa Liu,Jonathan Hayase,Yejin Choi,Noah A. Smith
Main category: cs.CL
TL;DR: 本文揭示了语言模型中因分词与用户输入不匹配导致的'部分token问题',即用户输入在token中间结束时会严重扭曲模型对下一个token的预测概率。研究发现该问题在中文、高度复合语言和代码等场景中尤为突出,即使自然、完整的词语输入也可能触发此问题。实验表明,前沿语言模型在面对部分token提示时,对正确续写概率的估计比token对齐提示低三个数量级,且该问题不随模型规模增大而缓解,反而可能加剧。作者还评估了多种推理时缓解策略,并验证了近期精确解决方案的有效性。
Details
Motivation: 语言模型基于token序列训练,但用户以文本形式交互,导致分词边界与用户输入边界不一致,产生'部分token问题'。尽管已有研究用任意字符前缀探讨该问题,但在符合词边界的现实提示中其普遍性与严重性仍缺乏系统分析。 Method: 识别出三种token与词边界易错位的领域(无空格语言、高度复合语言、代码),以中文为例量化错位比例;系统构建语义自然但以部分token结尾的提示;通过对比'部分token提示'与'回退至token对齐提示'下模型对正确续写的概率分布,量化问题严重性;测试不同规模模型表现;评估多种推理时缓解方法(包括最新精确解法)。 Result: 在中文中高达25%的词边界与token边界不重合;前沿语言模型对部分token提示的正确续写概率平均降低三个数量级;该问题不随模型规模扩大而缓解,反而常在更大模型中更严重;部分推理时缓解方案(尤其是最新精确解法)被证实有效。 Conclusion: 部分token问题是语言模型在真实应用场景中一个被低估但严重影响概率校准的关键缺陷;需引起模型服务提供方重视,并在推理部署中采用如精确解码等实用对策。 Abstract: Language models (LMs) are trained over sequences of tokens, whereas users interact with LMs via text. This mismatch gives rise to the partial token problem, which occurs when a user ends their prompt in the middle of the expected next-token, leading to distorted next-token predictions. Although this issue has been studied using arbitrary character prefixes, its prevalence and severity in realistic prompts respecting word boundaries remains underexplored. In this work, we identify three domains where token and "word" boundaries often do not line up: languages that do not use whitespace, highly compounding languages, and code. In Chinese, for example, up to 25% of word boundaries do not line up with token boundaries, making even natural, word-complete prompts susceptible to this problem. We systematically construct semantically natural prompts ending with a partial tokens; in experiments, we find that they comprise a serious failure mode: frontier LMs consistently place three orders of magnitude less probability on the correct continuation compared to when the prompt is "backed-off" to be token-aligned. This degradation does not diminish with scale and often worsens for larger models. Finally, we evaluate inference-time mitigations to the partial token problem and validate the effectiveness of recent exact solutions. Overall, we demonstrate the scale and severity of probability distortion caused by tokenization in realistic use cases, and provide practical recommentions for model inference providers.[59] Now You Hear Me: Audio Narrative Attacks Against Large Audio-Language Models
Ye Yu,Haibo Jin,Yaoning Yu,Jun Zhuang,Haohan Wang
Main category: cs.CL
TL;DR: 本文提出了一种针对大型音频-语言模型的文本转语音(TTS) jailbreak 攻击,通过将违规指令嵌入叙述性音频流中,成功绕过主要面向文本设计的安全机制,在 Gemini 2.0 Flash 等模型上达到 98.26% 的攻击成功率。
Details
Motivation: 随着大型音频-语言模型越来越多地直接处理原始语音输入,其面临一类尚未被充分研究和表征的新安全漏洞;现有安全机制主要针对文本设计,难以应对语音模态特有的风险。 Method: 设计一种文本到音频的 jailbreak 方法,利用先进指令跟随型 TTS 模型,将禁止指令嵌入自然叙述风格的合成语音中,利用语音的结构与声学特性规避基于文本的安全过滤。 Result: 该攻击在 Gemini 2.0 Flash 等前沿模型上实现 98.26% 的成功触发率,显著高于纯文本攻击基线;验证了语音模态下安全机制的脆弱性。 Conclusion: 当前语音接口的安全框架亟需联合建模语言与副语言(如语调、节奏等)特征,以应对日益普及的语音交互场景中的新型威胁。 Abstract: Large audio-language models increasingly operate on raw speech inputs, enabling more seamless integration across domains such as voice assistants, education, and clinical triage. This transition, however, introduces a distinct class of vulnerabilities that remain largely uncharacterized. We examine the security implications of this modality shift by designing a text-to-audio jailbreak that embeds disallowed directives within a narrative-style audio stream. The attack leverages an advanced instruction-following text-to-speech (TTS) model to exploit structural and acoustic properties, thereby circumventing safety mechanisms primarily calibrated for text. When delivered through synthetic speech, the narrative format elicits restricted outputs from state-of-the-art models, including Gemini 2.0 Flash, achieving a 98.26% success rate that substantially exceeds text-only baselines. These results highlight the need for safety frameworks that jointly reason over linguistic and paralinguistic representations, particularly as speech-based interfaces become more prevalent.[60] PaperBanana: Automating Academic Illustration for AI Scientists
Dawei Zhu,Rui Meng,Yale Song,Xiyu Wei,Sujian Li,Tomas Pfister,Jinsung Yoon
Main category: cs.CL
TL;DR: PaperBanana是一个基于多模态大模型的智能体框架,用于自动生成符合学术出版标准的插图,涵盖方法图和统计图表,并在多个评估维度上超越现有基线。
Details
Motivation: 当前自主AI科学家发展迅速,但生成出版级学术插图仍高度依赖人工,是研究流程中的关键瓶颈。 Method: 提出PaperBanana框架,利用先进视觉语言模型(VLM)和图像生成模型,编排多个专业化智能体完成参考检索、内容与风格规划、图像渲染及基于自我批评的迭代优化;同时构建包含292个NeurIPS 2025方法图的评测基准PaperBananaBench。 Result: 在保真度、简洁性、可读性和美观性等维度上显著优于主流基线;并验证了其对高质量统计图表生成的有效扩展性。 Conclusion: PaperBanana为全自动、高质量学术插图生成提供了可行路径,推动AI科学家向全流程自动化迈进。 Abstract: Despite rapid advances in autonomous AI scientists powered by language models, generating publication-ready illustrations remains a labor-intensive bottleneck in the research workflow. To lift this burden, we introduce PaperBanana, an agentic framework for automated generation of publication-ready academic illustrations. Powered by state-of-the-art VLMs and image generation models, PaperBanana orchestrates specialized agents to retrieve references, plan content and style, render images, and iteratively refine via self-critique. To rigorously evaluate our framework, we introduce PaperBananaBench, comprising 292 test cases for methodology diagrams curated from NeurIPS 2025 publications, covering diverse research domains and illustration styles. Comprehensive experiments demonstrate that PaperBanana consistently outperforms leading baselines in faithfulness, conciseness, readability, and aesthetics. We further show that our method effectively extends to the generation of high-quality statistical plots. Collectively, PaperBanana paves the way for the automated generation of publication-ready illustrations.[61] UPA: Unsupervised Prompt Agent via Tree-Based Search and Selection
Siran Peng,Weisong Zhao,Tianyu Fu,Chenxu Zhao,Tianshuo Zhang,Haoyuan Zhang,Xiangyu Zhu,Minghui Wu,Zhen Lei
Main category: cs.CL
TL;DR: 本文提出了一种无需监督信号的提示词优化方法UPA,通过LLM的成对比较和BTL模型进行无监督的结构化搜索与选择。
Details
Motivation: 现有基于强化学习或监督反馈的提示词优化方法依赖人工标注的奖励信号,在实际中难以获取,因此需要一种无需监督反馈的提示词优化框架。 Method: UPA构建一个动态演化的树状结构进行提示空间搜索,利用LLM进行细粒度、顺序无关的成对比较;采用两阶段框架:第一阶段基于Bradley-Terry-Luce(BTL)模型进行路径级贝叶斯聚合以在不确定性下筛选候选提示;第二阶段通过全局锦标赛式比较推断潜在提示质量并选出最优提示。 Result: 在多个任务上的实验表明,UPA持续优于现有提示优化方法,验证了代理式优化在完全无监督场景下的有效性。 Conclusion: UPA证明了无需监督奖励信号也能实现高效、结构化的提示词优化,为低资源、高泛化性的自动提示工程提供了新范式。 Abstract: Prompt agents have recently emerged as a promising paradigm for automated prompt optimization, framing refinement as a sequential decision-making problem over a structured prompt space. While this formulation enables the use of advanced planning algorithms, these methods typically assume access to supervised reward signals, which are often unavailable in practical scenarios. In this work, we propose UPA, an Unsupervised Prompt Agent that realizes structured search and selection without relying on supervised feedback. Specifically, during search, UPA iteratively constructs an evolving tree structure to navigate the prompt space, guided by fine-grained and order-invariant pairwise comparisons from Large Language Models (LLMs). Crucially, as these local comparisons do not inherently yield a consistent global scale, we decouple systematic prompt exploration from final selection, introducing a two-stage framework grounded in the Bradley-Terry-Luce (BTL) model. This framework first performs path-wise Bayesian aggregation of local comparisons to filter candidates under uncertainty, followed by global tournament-style comparisons to infer latent prompt quality and identify the optimal prompt. Experiments across multiple tasks demonstrate that UPA consistently outperforms existing prompt optimization methods, showing that agent-style optimization remains highly effective even in fully unsupervised settings.cs.CV [Back]
[62] Do Open-Vocabulary Detectors Transfer to Aerial Imagery? A Comparative Evaluation
Christos Tsourveloudis
Main category: cs.CV
TL;DR: 本文首次系统评估了五种最先进的开放词汇目标检测(OVD)模型在航空影像数据集LA-80C上的零样本迁移能力,发现其性能严重受限于语义混淆而非定位能力,现有提示工程策略无效,凸显了面向航空领域的域自适应方法的必要性。
Details
Motivation: 开放词汇目标检测(OVD)在自然图像上表现优异,但其向航空影像的可迁移性尚未被探索;需建立基准以揭示跨域性能瓶颈。 Method: 构建首个面向航空影像的严格零样本OVD基准(LAE-80C),包含3592张图像、80类;设计Global/Oracle/Single-Category三种推理模式分离语义混淆与定位误差;评估五种SOTA OVD模型,并测试领域前缀、同义词扩展等提示工程策略。 Result: 最佳模型OWLv2在LAE-80C上F1仅27.6%,误报率高达69%;词汇量减至3.2类时性能提升15倍;提示工程无效;不同航空数据集间性能差异巨大(DIOR F1=0.53,FAIR1M F1=0.12)。 Conclusion: 语义混淆是航空OVD跨域失败的主因;当前OVD模型难以直接迁移到航空领域;亟需发展领域自适应方法。 Abstract: Open-vocabulary object detection (OVD) enables zero-shot recognition of novel categories through vision-language models, achieving strong performance on natural images. However, transferability to aerial imagery remains unexplored. We present the first systematic benchmark evaluating five state-of-the-art OVD models on the LAE-80C aerial dataset (3,592 images, 80 categories) under strict zero-shot conditions. Our experimental protocol isolates semantic confusion from visual localization through Global, Oracle, and Single-Category inference modes. Results reveal severe domain transfer failure: the best model (OWLv2) achieves only 27.6% F1-score with 69% false positive rate. Critically, reducing vocabulary size from 80 to 3.2 classes yields 15x improvement, demonstrating that semantic confusion is the primary bottleneck. Prompt engineering strategies such as domain-specific prefixing and synonym expansion, fail to provide meaningful performance gains. Performance varies dramatically across datasets (F1: 0.53 on DIOR, 0.12 on FAIR1M), exposing brittleness to imaging conditions. These findings establish baseline expectations and highlight the need for domain-adaptive approaches in aerial OVD.[63] What Lies Beneath: A Call for Distribution-based Visual Question & Answer Datasets
Jill P. Naiman,Daniel J. Evans,JooYoung Seo
Main category: cs.CV
TL;DR: 本文提出一个面向科学图表的视觉问答(VQA)新基准,强调图表标记与底层数据之间不存在一一对应关系,挑战模型进行基于原始数据的推理能力,并发布了一个包含合成直方图、真实数据、生成参数及标注信息的开源数据集。
Details
Motivation: 现有VQA数据集多关注真实图像或简单图表,缺乏对复杂科学图表(尤其是图表标记与底层数据非一一映射)的建模,无法评估模型在真实数据分析场景下的推理能力。 Method: 通过调研现有VQA数据集指出其局限;生成基于真实分布参数的合成直方图;设计需依赖底层数据才能精确回答的问题;采集人类与大推理模型的答案并对比分析。 Result: 构建并开源了一个新型科学图表VQA数据集,包含图表图像、对应底层数据、生成参数、所有图表元素(标记/文本)的边界框标注。 Conclusion: 科学图表VQA需超越像素级理解,要求模型访问并推理底层数据;所提基准填补了该领域空白,为评估和推动LMM在科学可视化推理上的能力提供了新标准。 Abstract: Visual Question Answering (VQA) has become an important benchmark for assessing how large multimodal models (LMMs) interpret images. However, most VQA datasets focus on real-world images or simple diagrammatic analysis, with few focused on interpreting complex scientific charts. Indeed, many VQA datasets that analyze charts do not contain the underlying data behind those charts or assume a 1-to-1 correspondence between chart marks and underlying data. In reality, charts are transformations (i.e. analysis, simplification, modification) of data. This distinction introduces a reasoning challenge in VQA that the current datasets do not capture. In this paper, we argue for a dedicated VQA benchmark for scientific charts where there is no 1-to-1 correspondence between chart marks and underlying data. To do so, we survey existing VQA datasets and highlight limitations of the current field. We then generate synthetic histogram charts based on ground truth data, and ask both humans and a large reasoning model questions where precise answers depend on access to the underlying data. We release the open-source dataset, including figures, underlying data, distribution parameters used to generate the data, and bounding boxes for all figure marks and text for future research.[64] Lost in Space? Vision-Language Models Struggle with Relative Camera Pose Estimation
Ken Deng,Yifu Qiu,Yoni Kasten,Shay B. Cohen,Yftah Ziser
Main category: cs.CV
TL;DR: 本文研究视觉语言模型(VLMs)在相对相机姿态估计(RCPE)任务中的3D空间理解能力,发现其严重依赖2D启发式方法,在深度变化和绕光轴滚转等关键3D运动上表现差,远逊于传统几何方法和人类水平,并揭示其多图像空间推理能力薄弱。
Details
Motivation: Vision-Language Models (VLMs) 在2D感知和语义推理上表现良好,但对3D空间结构的理解有限;本文旨在系统探究这一差距,以相对相机姿态估计(RCPE)为切入点,因其是需联合推理平移与旋转的基础3D视觉任务。 Method: 构建两个新基准:VRRPI-Bench(基于无标签第一人称视频、含自然语言描述的相对运动标注,模拟真实多自由度运动场景)和VRRPI-Diag(诊断性基准,隔离单个运动自由度);在这些基准上系统评估主流VLMs在RCPE任务上的性能,并与经典几何方法及人类表现对比。 Result: 大多数VLMs无法超越浅层2D启发式方法,尤其在深度变化和绕光轴滚转上失败;SOTA模型GPT-5准确率仅0.64,显著低于几何基线(0.97)和人类(0.92);多图像推理能力弱,跨帧空间线索整合最佳仅59.7%。 Conclusion: VLMs在3D空间接地和多视角空间推理方面存在根本性局限,当前架构难以有效建模三维几何关系,亟需面向3D感知的建模改进。 Abstract: Vision-Language Models (VLMs) perform well in 2D perception and semantic reasoning compared to their limited understanding of 3D spatial structure. We investigate this gap using relative camera pose estimation (RCPE), a fundamental vision task that requires inferring relative camera translation and rotation from a pair of images. We introduce VRRPI-Bench, a benchmark derived from unlabeled egocentric videos with verbalized annotations of relative camera motion, reflecting realistic scenarios with simultaneous translation and rotation around a shared object. We further propose VRRPI-Diag, a diagnostic benchmark that isolates individual motion degrees of freedom. Despite the simplicity of RCPE, most VLMs fail to generalize beyond shallow 2D heuristics, particularly for depth changes and roll transformations along the optical axis. Even state-of-the-art models such as GPT-5 ($0.64$) fall short of classic geometric baselines ($0.97$) and human performance ($0.92$). Moreover, VLMs exhibit difficulty in multi-image reasoning, with inconsistent performance (best $59.7\%$) when integrating spatial cues across frames. Our findings reveal limitations in grounding VLMs in 3D and multi-view spatial reasoning.[65] Geometry without Position? When Positional Embeddings Help and Hurt Spatial Reasoning
Jian Shi,Michael Birsak,Wenqing Cui,Zhenyu Li,Peter Wonka
Main category: cs.CV
TL;DR: 本文从几何角度重新审视视觉Transformer中位置编码(PEs)的作用,提出PEs本质上是塑造表征空间结构的几何先验,并通过token级诊断工具验证其对多视角几何一致性的影响。
Details
Motivation: 位置编码在视觉Transformer中的作用尚未被充分理解,尤其缺乏从几何结构角度的系统性分析。 Method: 提出基于token级的诊断方法,量化多视角几何一致性与位置编码一致性的关系,并在14个主流ViT模型上进行实验验证。 Result: 实验证明位置编码是影响ViT表征中多视角几何一致性和空间推理能力的关键因果机制。 Conclusion: 位置编码不仅是序列索引标识,更是引导ViT学习空间结构的几何先验,其设计直接影响模型的空间理解能力。 Abstract: This paper revisits the role of positional embeddings (PEs) within vision transformers (ViTs) from a geometric perspective. We show that PEs are not mere token indices but effectively function as geometric priors that shape the spatial structure of the representation. We introduce token-level diagnostics that measure how multi-view geometric consistency in ViT representation depends on consitent PEs. Through extensive experiments on 14 foundation ViT models, we reveal how PEs influence multi-view geometry and spatial reasoning. Our findings clarify the role of PEs as a causal mechanism that governs spatial structure in ViT representations. Our code is provided in https://github.com/shijianjian/vit-geometry-probes[66] Is Hierarchical Quantization Essential for Optimal Reconstruction?
Shirin Reyhanian,Laurenz Wiskott
Main category: cs.CV
TL;DR: 本文探讨了单层VQ-VAE在匹配表征预算并缓解码本坍塌后,是否能达到与多层VQ-VAE(如VQ-VAE2)相当的重建保真度。结果表明,在合理干预下,单层模型可达到同等重建精度,挑战了层级结构必然更优的传统假设。
Details
Motivation: 质疑层级VQ-VAE(如VQ-VAE2)在重建保真度上是否真正优于单层模型,尤其在排除码本利用不足和表征容量差异干扰后。 Method: 对比两层VQ-VAE与容量匹配的单层VQ-VAE在ImageNet高分辨率图像上的重建性能;采用数据初始化、非活跃码本向量周期重置及超参系统调优等轻量干预缓解码本坍塌。 Result: 当表征预算匹配且码本坍塌被有效抑制时,单层VQ-VAE可达到与两层VQ-VAE相当的重建保真度。 Conclusion: 层级结构并非提升重建精度的必要条件;单层VQ-VAE在合理设计下足以媲美层级模型,重建性能差异主要源于码本利用与训练稳定性,而非层级本身。 Abstract: Vector-quantized variational autoencoders (VQ-VAEs) are central to models that rely on high reconstruction fidelity, from neural compression to generative pipelines. Hierarchical extensions, such as VQ-VAE2, are often credited with superior reconstruction performance because they split global and local features across multiple levels. However, since higher levels derive all their information from lower levels, they should not carry additional reconstructive content beyond what the lower-level already encodes. Combined with recent advances in training objectives and quantization mechanisms, this leads us to ask whether a single-level VQ-VAE, with matched representational budget and no codebook collapse, can equal the reconstruction fidelity of its hierarchical counterpart. Although the multi-scale structure of hierarchical models may improve perceptual quality in downstream tasks, the effect of hierarchy on reconstruction accuracy, isolated from codebook utilization and overall representational capacity, remains empirically underexamined. We revisit this question by comparing a two-level VQ-VAE and a capacity-matched single-level model on high-resolution ImageNet images. Consistent with prior observations, we confirm that inadequate codebook utilization limits single-level VQ-VAEs and that overly high-dimensional embeddings destabilize quantization and increase codebook collapse. We show that lightweight interventions such as initialization from data, periodic reset of inactive codebook vectors, and systematic tuning of codebook hyperparameters significantly reduce collapse. Our results demonstrate that when representational budgets are matched, and codebook collapse is mitigated, single-level VQ-VAEs can match the reconstruction fidelity of hierarchical variants, challenging the assumption that hierarchical quantization is inherently superior for high-quality reconstructions.[67] VMonarch: Efficient Video Diffusion Transformers with Structured Attention
Cheng Liang,Haoxian Chen,Liang Hou,Qi Fan,Gangshan Wu,Xin Tao,Limin Wang
Main category: cs.CV
TL;DR: 本文提出VMonarch,一种基于Monarch矩阵的新型视频扩散Transformer(Video DiT)注意力机制,通过结构化稀疏表示和交替最小化算法,显著降低计算复杂度,同时保持甚至提升生成质量。
Details
Motivation: 视频扩散Transformer中注意力机制的二次复杂度严重限制了上下文扩展能力;而其本身存在高度稀疏的时空注意力模式,可被Monarch矩阵自然建模。 Method: 1)设计时空Monarch分解以显式建模帧内与帧间相关性;2)引入重计算策略缓解交替最小化过程中的不稳定性;3)将新型在线熵算法融合进FlashAttention,支持长序列下快速Monarch矩阵更新。 Result: 在VBench上生成质量媲美或优于全注意力;注意力FLOPs降低17.5倍;长视频注意力计算加速超5倍;在90%稀疏度下超越现有稀疏注意力方法。 Conclusion: VMonarch有效突破Video DiT的注意力瓶颈,在效率与质量间取得更好平衡,为长视频生成提供了高效可行的新路径。 Abstract: The quadratic complexity of the attention mechanism severely limits the context scalability of Video Diffusion Transformers (DiTs). We find that the highly sparse spatio-temporal attention patterns exhibited in Video DiTs can be naturally represented by the Monarch matrix. It is a class of structured matrices with flexible sparsity, enabling sub-quadratic attention via an alternating minimization algorithm. Accordingly, we propose VMonarch, a novel attention mechanism for Video DiTs that enables efficient computation over the dynamic sparse patterns with structured Monarch matrices. First, we adapt spatio-temporal Monarch factorization to explicitly capture the intra-frame and inter-frame correlations of the video data. Second, we introduce a recomputation strategy to mitigate artifacts arising from instabilities during alternating minimization of Monarch matrices. Third, we propose a novel online entropy algorithm fused into FlashAttention, enabling fast Monarch matrix updates for long sequences. Extensive experiments demonstrate that VMonarch achieves comparable or superior generation quality to full attention on VBench after minimal tuning. It overcomes the attention bottleneck in Video DiTs, reduces attention FLOPs by a factor of 17.5, and achieves a speedup of over 5x in attention computation for long videos, surpassing state-of-the-art sparse attention methods at 90% sparsity.[68] Coarse-to-Real: Generative Rendering for Populated Dynamic Scenes
Gonzalo Gomez-Nogales,Yicong Hong,Chongjian Ge,Marc Comino-Trinidad,Dan Casas,Yi Zhou
Main category: cs.CV
TL;DR: 本文提出C2R(Coarse-to-Real)生成式渲染框架,利用粗粒度3D仿真驱动文本引导的神经渲染,生成逼真、时序一致的城市人群视频,无需成对训练数据。
Details
Motivation: 传统渲染管线在动态人群场景中面临可扩展性与真实感不足的问题,且依赖复杂资产、精确材质光照和大量算力。 Method: 采用两阶段混合CG-真实数据训练策略:先从大规模真实视频学习强生成先验,再通过跨域共享隐式时空特征引入可控性;以粗粒度3D渲染控制布局、相机运动和人物轨迹,由文本引导的神经渲染器生成外观、光照和细节动态。 Result: 系统支持粗到细控制,泛化于多种CG和游戏输入,仅需极简3D输入即可生成时序一致、可控且逼真的城市场景视频。 Conclusion: C2R为动态人群场景提供了一种高效、可控、高保真的生成式渲染新范式,降低了对高精度建模与计算资源的依赖。 Abstract: Traditional rendering pipelines rely on complex assets, accurate materials and lighting, and substantial computational resources to produce realistic imagery, yet they still face challenges in scalability and realism for populated dynamic scenes. We present C2R (Coarse-to-Real), a generative rendering framework that synthesizes real-style urban crowd videos from coarse 3D simulations. Our approach uses coarse 3D renderings to explicitly control scene layout, camera motion, and human trajectories, while a learned neural renderer generates realistic appearance, lighting, and fine-scale dynamics guided by text prompts. To overcome the lack of paired training data between coarse simulations and real videos, we adopt a two-phase mixed CG-real training strategy that learns a strong generative prior from large-scale real footage and introduces controllability through shared implicit spatio-temporal features across domains. The resulting system supports coarse-to-fine control, generalizes across diverse CG and game inputs, and produces temporally consistent, controllable, and realistic urban scene videos from minimal 3D input. We will release the model and project webpage at https://gonzalognogales.github.io/coarse2real/.[69] FlexMap: Generalized HD Map Construction from Flexible Camera Configurations
Run Wang,Chaoyi Zhou,Amir Salarpour,Xi Liu,Zhi-Qi Cheng,Feng Luo,Mert D. Pesé,Siyu Huang
Main category: cs.CV
TL;DR: FlexMap是一种无需显式几何投影、可自适应不同车载相机配置的高精地图构建方法,利用几何感知基础模型和跨帧注意力隐式建模3D场景,具备强鲁棒性和部署实用性。
Details
Motivation: 现有高精地图构建方法依赖标定好的多相机系统及2D到鸟瞰图(BEV)的显式或隐式变换,在传感器失效或车队相机配置不一致时表现脆弱。 Method: 提出FlexMap,包含空间-时间增强模块(分离跨视图空间推理与时间动态)和相机感知解码器(引入隐式相机token实现无需投影矩阵的视图自适应注意力),摒弃显式几何投影,借助几何感知基础模型和跨帧注意力在特征空间隐式编码3D理解。 Result: 实验表明FlexMap在多种相机配置下均优于现有方法,并对缺失视角和传感器变化具有强鲁棒性。 Conclusion: FlexMap通过架构级灵活性与几何感知隐式建模,显著提升了HD地图构建在真实自动驾驶车队中的泛化能力与实用性。 Abstract: High-definition (HD) maps provide essential semantic information of road structures for autonomous driving systems, yet current HD map construction methods require calibrated multi-camera setups and either implicit or explicit 2D-to-BEV transformations, making them fragile when sensors fail or camera configurations vary across vehicle fleets. We introduce FlexMap, unlike prior methods that are fixed to a specific N-camera rig, our approach adapts to variable camera configurations without any architectural changes or per-configuration retraining. Our key innovation eliminates explicit geometric projections by using a geometry-aware foundation model with cross-frame attention to implicitly encode 3D scene understanding in feature space. FlexMap features two core components: a spatial-temporal enhancement module that separates cross-view spatial reasoning from temporal dynamics, and a camera-aware decoder with latent camera tokens, enabling view-adaptive attention without the need for projection matrices. Experiments demonstrate that FlexMap outperforms existing methods across multiple configurations while maintaining robustness to missing views and sensor variations, enabling more practical real-world deployment.[70] Jailbreaks on Vision Language Model via Multimodal Reasoning
Aarush Noheria,Yuguang Yao
Main category: cs.CV
TL;DR: 本文提出了一种利用链式思维(CoT)提示和ReAct驱动的自适应图像加噪机制来绕过视觉语言模型(VLM)安全过滤器的新型越狱框架。
Details
Motivation: 视觉语言模型(VLMs)对提示词变化高度敏感,暴露出其在安全对齐方面的脆弱性,亟需研究其潜在的安全漏洞。 Method: 提出一种基于后训练链式思维(CoT)提示的越狱框架,并结合ReAct范式设计自适应图像加噪机制,通过模型反馈迭代扰动图像中易触发安全防御的区域。 Result: 实验表明,该双策略方法显著提升了攻击成功率(ASR),同时在文本和视觉领域均保持了自然性。 Conclusion: 该工作揭示了当前VLM安全机制的不足,为提升多模态模型鲁棒性与安全性提供了新思路与实用攻击基准。 Abstract: Vision-language models (VLMs) have become central to tasks such as visual question answering, image captioning, and text-to-image generation. However, their outputs are highly sensitive to prompt variations, which can reveal vulnerabilities in safety alignment. In this work, we present a jailbreak framework that exploits post-training Chain-of-Thought (CoT) prompting to construct stealthy prompts capable of bypassing safety filters. To further increase attack success rates (ASR), we propose a ReAct-driven adaptive noising mechanism that iteratively perturbs input images based on model feedback. This approach leverages the ReAct paradigm to refine adversarial noise in regions most likely to activate safety defenses, thereby enhancing stealth and evasion. Experimental results demonstrate that the proposed dual-strategy significantly improves ASR while maintaining naturalness in both text and visual domains.[71] EMBC Special Issue: Calibrated Uncertainty for Trustworthy Clinical Gait Analysis Using Probabilistic Multiview Markerless Motion Capture
Seth Donahue,Irina Djuraskovic,Kunal Shah,Fabian Sinz,Ross Chafetz,R. James Cotton
Main category: cs.CV
TL;DR: 本文提出了一种基于变分推理的多视角无标记运动捕捉(MMMC)概率模型,用于视频分析人体步态运动,并通过ECE评估其置信区间校准性;结果表明该模型具有良好的校准性与不确定性量化能力,能可靠指示个体预测误差。
Details
Motivation: 临床中需要既准确又具备可靠置信区间的多视角无标记运动捕捉系统,以支持可信的运动评估;现有方法缺乏对个体预测不确定性的量化能力。 Method: 采用变分推断估计关节角度后验分布,构建概率性MMMC方法;在68名受试者、两个机构的数据上,以仪器化步道和传统标记式动捕为金标准进行验证;使用期望校准误差(ECE)评估置信区间校准性,并分析预测不确定性与实际误差的相关性。 Result: ECE普遍小于0.1,步长和跨步长中位误差分别为~16 mm和~12 mm,校正后运动学误差为1.5–3.8度;预测不确定性与实际误差强相关。 Conclusion: 该概率模型能有效量化认知不确定性,无需同步金标准即可识别不可靠输出,具备临床实用潜力。 Abstract: Video-based human movement analysis holds potential for movement assessment in clinical practice and research. However, the clinical implementation and trust of multi-view markerless motion capture (MMMC) require that, in addition to being accurate, these systems produce reliable confidence intervals to indicate how accurate they are for any individual. Building on our prior work utilizing variational inference to estimate joint angle posterior distributions, this study evaluates the calibration and reliability of a probabilistic MMMC method. We analyzed data from 68 participants across two institutions, validating the model against an instrumented walkway and standard marker-based motion capture. We measured the calibration of the confidence intervals using the Expected Calibration Error (ECE). The model demonstrated reliable calibration, yielding ECE values generally < 0.1 for both step and stride length and bias-corrected gait kinematics. We observed a median step and stride length error of ~16 mm and ~12 mm respectively, with median bias-corrected kinematic errors ranging from 1.5 to 3.8 degrees across lower extremity joints. Consistent with the calibrated ECE, the magnitude of the model's predicted uncertainty correlated strongly with observed error measures. These findings indicate that, as designed, the probabilistic model reconstruction quantifies epistemic uncertainty, allowing it to identify unreliable outputs without the need for concurrent ground-truth instrumentation.[72] Countering the Over-Reliance Trap: Mitigating Object Hallucination for LVLMs via a Self-Validation Framework
Shiyu Liu,Xinyi Wen,Zhibin Lan,Ante Wang,Jinsong Su
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的自验证框架,通过语言先验无关的验证机制,显著缓解大视觉语言模型在图像描述任务中的物体幻觉问题。
Details
Motivation: 现有LVLMs在图像描述中存在严重的物体幻觉问题,主要源于对语言先验的过度依赖,但此前工作缺乏对此现象的深入分析。 Method: 提出语言先验无关的验证机制,并构建一种无需训练的自验证框架:首先在采样的候选描述中验证物体存在性,再通过描述选择或聚合进一步抑制幻觉。 Result: 在CHAIRI指标上相比LLaVA-v1.5-7B提升65.6%,显著优于先前SOTA方法。 Conclusion: 该工作揭示了利用LVLM自身内在能力缓解幻觉的新路径,无需额外训练即可有效抑制物体幻觉。 Abstract: Despite progress in Large Vision Language Models (LVLMs), object hallucination remains a critical issue in image captioning task, where models generate descriptions of non-existent objects, compromising their reliability. Previous work attributes this to LVLMs' over-reliance on language priors and attempts to mitigate it through logits calibration. However, they still lack a thorough analysis of the over-reliance. To gain a deeper understanding of over-reliance, we conduct a series of preliminary experiments, indicating that as the generation length increases, LVLMs' over-reliance on language priors leads to inflated probability of hallucinated object tokens, consequently exacerbating object hallucination. To circumvent this issue, we propose Language-Prior-Free Verification to enable LVLMs to faithfully verify the confidence of object existence. Based on this, we propose a novel training-free Self-Validation Framework to counter the over-reliance trap. It first validates objects' existence in sampled candidate captions and further mitigates object hallucination via caption selection or aggregation. Experiment results demonstrate that our framework mitigates object hallucination significantly in image captioning task (e.g., 65.6% improvement on CHAIRI metric with LLaVA-v1.5-7B), surpassing the previous SOTA methods. This result highlights a novel path towards mitigating hallucination by unlocking the inherent potential within LVLMs themselves.[73] ScribbleSense: Generative Scribble-Based Texture Editing with Intent Prediction
Yudi Zhang,Yeming Geng,Lei Zhang
Main category: cs.CV
TL;DR: 本文提出ScribbleSense方法,利用多模态大语言模型(MLLMs)和图像生成模型解决基于涂鸦的3D纹理编辑中意图模糊与语义定位不清的问题,实现直观、精准的交互式纹理编辑。
Details
Motivation: 现有基于涂鸦的3D纹理编辑方法难以准确理解抽象涂鸦指令的编辑意图,且目标语义位置不明确,缺乏对粗粒度涂鸦交互的有效支持。 Method: 提出ScribbleSense:首先用MLLM解析涂鸦的视觉语义意图;再利用全局生成图像提取局部纹理细节,将局部语义锚定到3D模型上,缓解意图与位置歧义。 Result: 实验表明该方法显著提升涂鸦驱动的交互式纹理编辑性能,达到当前最优水平。 Conclusion: ScribbleSense通过融合MLLM的视觉理解能力与图像生成模型的细节建模能力,有效解决了涂鸦编辑中的语义歧义与定位难题,推动了直观自由的3D纹理编辑发展。 Abstract: Interactive 3D model texture editing presents enhanced opportunities for creating 3D assets, with freehand drawing style offering the most intuitive experience. However, existing methods primarily support sketch-based interactions for outlining, while the utilization of coarse-grained scribble-based interaction remains limited. Furthermore, current methodologies often encounter challenges due to the abstract nature of scribble instructions, which can result in ambiguous editing intentions and unclear target semantic locations. To address these issues, we propose ScribbleSense, an editing method that combines multimodal large language models (MLLMs) and image generation models to effectively resolve these challenges. We leverage the visual capabilities of MLLMs to predict the editing intent behind the scribbles. Once the semantic intent of the scribble is discerned, we employ globally generated images to extract local texture details, thereby anchoring local semantics and alleviating ambiguities concerning the target semantic locations. Experimental results indicate that our method effectively leverages the strengths of MLLMs, achieving state-of-the-art interactive editing performance for scribble-based texture editing.[74] Training-Free Representation Guidance for Diffusion Models with a Representation Alignment Projector
Wenqiang Zu,Shenghao Xie,Bo Lei,Lei Ma
Main category: cs.CV
TL;DR: 本文提出了一种基于表征对齐的扩散模型采样引导方法,通过在去噪过程中注入预测表征作为语义锚点,缓解早期阶段的语义漂移问题,在ImageNet类条件生成中显著降低FID,且可与现有引导方法(如CFG)协同增益。
Details
Motivation: 现有推理时引导方法(如CFG)未能充分利用无监督视觉表征中蕴含的丰富语义结构;同时,扩散Transformer在早期去噪阶段存在语义漂移问题,导致即使相同条件下的生成结果也不一致。 Method: 引入一个表征对齐投影器(representation alignment projector),在中间采样步骤中注入由该投影器预测的特征表示,作为语义锚点,无需修改原模型架构。 Result: 在SiT和REPA模型上验证有效:REPA-XL/2的FID从5.9降至3.3;在SiT上优于代表性引导;与分类器自由引导结合后进一步提升语义一致性和图像保真度。 Conclusion: 基于表征信息的扩散采样是一种实用策略,能有效增强语义保持与图像一致性,为可控生成提供了新思路。 Abstract: Recent progress in generative modeling has enabled high-quality visual synthesis with diffusion-based frameworks, supporting controllable sampling and large-scale training. Inference-time guidance methods such as classifier-free and representative guidance enhance semantic alignment by modifying sampling dynamics; however, they do not fully exploit unsupervised feature representations. Although such visual representations contain rich semantic structure, their integration during generation is constrained by the absence of ground-truth reference images at inference. This work reveals semantic drift in the early denoising stages of diffusion transformers, where stochasticity results in inconsistent alignment even under identical conditioning. To mitigate this issue, we introduce a guidance scheme using a representation alignment projector that injects representations predicted by a projector into intermediate sampling steps, providing an effective semantic anchor without modifying the model architecture. Experiments on SiTs and REPAs show notable improvements in class-conditional ImageNet synthesis, achieving substantially lower FID scores; for example, REPA-XL/2 improves from 5.9 to 3.3, and the proposed method outperforms representative guidance when applied to SiT models. The approach further yields complementary gains when combined with classifier-free guidance, demonstrating enhanced semantic coherence and visual fidelity. These results establish representation-informed diffusion sampling as a practical strategy for reinforcing semantic preservation and image consistency.[75] Head-Aware Visual Cropping: Enhancing Fine-Grained VQA with Attention-Guided Subimage
Junfei Xie,Peng Pan,Xulong Zhang
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的视觉裁剪方法HAVC,通过筛选和优化注意力头来生成可靠的视觉裁剪引导图,从而提升多模态大语言模型在细粒度视觉问答任务中的视觉定位与推理能力。
Details
Motivation: 现有MLLMs在细粒度视觉问答(VQA)中受限于低分辨率输入和噪声注意力聚合,导致视觉定位和精细推理能力不足。 Method: 提出Head Aware Visual Cropping (HAVC):首先基于OCR诊断任务筛选具备真实视觉定位能力的注意力头;推理时进一步利用空间熵增强空间集中性、梯度敏感性评估预测贡献,融合生成视觉裁剪引导图,用于裁剪关键子图像并输入MLLM。 Result: 在多个细粒度VQA基准上,HAVC持续优于现有裁剪策略,显著提升定位精度与视觉接地能力。 Conclusion: HAVC是一种简单而有效的无训练方法,能显著增强MLLMs在细粒度理解任务中的精度与鲁棒性。 Abstract: Multimodal Large Language Models (MLLMs) show strong performance in Visual Question Answering (VQA) but remain limited in fine-grained reasoning due to low-resolution inputs and noisy attention aggregation. We propose \textbf{Head Aware Visual Cropping (HAVC)}, a training-free method that improves visual grounding by leveraging a selectively refined subset of attention heads. HAVC first filters heads through an OCR-based diagnostic task, ensuring that only those with genuine grounding ability are retained. At inference, these heads are further refined using spatial entropy for stronger spatial concentration and gradient sensitivity for predictive contribution. The fused signals produce a reliable Visual Cropping Guidance Map, which highlights the most task-relevant region and guides the cropping of a subimage subsequently provided to the MLLM together with the image-question pair. Extensive experiments on multiple fine-grained VQA benchmarks demonstrate that HAVC consistently outperforms state-of-the-art cropping strategies, achieving more precise localization, stronger visual grounding, providing a simple yet effective strategy for enhancing precision in MLLMs.[76] PromptMAD: Cross-Modal Prompting for Multi-Class Visual Anomaly Localization
Duncan McCain,Hossein Kashiani,Fatemeh Afghah
Main category: cs.CV
TL;DR: 本文提出PromptMAD,一种基于跨模态提示的无监督视觉异常检测与定位框架,利用CLIP文本提示提供语义引导,并结合Focal Loss与多尺度Transformer-扩散融合分割器,实现在MVTec-AD上SOTA像素级性能。
Details
Motivation: 多类别视觉异常检测面临类别多样性、异常样本稀缺及伪装缺陷等挑战。 Method: 提出PromptMAD框架:1)利用CLIP编码的正常/异常类特定文本提示进行视觉-语言对齐以增强重建;2)引入Focal Loss缓解像素级类别不平衡;3)设计融合多尺度CNN特征、Transformer空间注意力与扩散迭代优化的监督分割器生成高分辨率异常图。 Result: 在MVTec-AD数据集上达到SOTA像素级性能:平均AUC达98.35%,AP达66.54%,且在各类别间保持高效性。 Conclusion: PromptMAD通过语义引导与精细化建模显著提升了多类别无监督异常检测与定位能力,验证了跨模态提示与扩散细化在该任务中的有效性。 Abstract: Visual anomaly detection in multi-class settings poses significant challenges due to the diversity of object categories, the scarcity of anomalous examples, and the presence of camouflaged defects. In this paper, we propose PromptMAD, a cross-modal prompting framework for unsupervised visual anomaly detection and localization that integrates semantic guidance through vision-language alignment. By leveraging CLIP-encoded text prompts describing both normal and anomalous class-specific characteristics, our method enriches visual reconstruction with semantic context, improving the detection of subtle and textural anomalies. To further address the challenge of class imbalance at the pixel level, we incorporate Focal loss function, which emphasizes hard-to-detect anomalous regions during training. Our architecture also includes a supervised segmentor that fuses multi-scale convolutional features with Transformer-based spatial attention and diffusion iterative refinement, yielding precise and high-resolution anomaly maps. Extensive experiments on the MVTec-AD dataset demonstrate that our method achieves state-of-the-art pixel-level performance, improving mean AUC to 98.35% and AP to 66.54%, while maintaining efficiency across diverse categories.[77] MIRRORTALK: Forging Personalized Avatars Via Disentangled Style and Hierarchical Motion Control
Renjie Lu,Xulong Zhang,Xiaoyang Qu,Jianzong Wang,Shangfei Wang
Main category: cs.CV
TL;DR: 本文提出MirrorTalk框架,利用条件扩散模型和语义解耦风格编码器(SDSE)分离说话人风格与语义内容,通过分层调制策略实现高精度唇音同步与个性化面部动态合成。
Details
Motivation: 现有方法难以解耦说话人特有说话风格与语义内容,导致个性化风格迁移失败。 Method: 提出基于条件扩散模型的MirrorTalk框架,引入语义解耦风格编码器(SDSE)提取纯风格表征,并设计分层调制策略在扩散过程中动态融合音频与风格特征。 Result: 在唇音同步准确性和个性化保留方面显著优于现有最先进方法。 Conclusion: MirrorTalk成功实现了风格与内容解耦,兼顾唇音同步精度与全脸动态表现力,推动个性化 talking face 合成发展。 Abstract: Synthesizing personalized talking faces that uphold and highlight a speaker's unique style while maintaining lip-sync accuracy remains a significant challenge. A primary limitation of existing approaches is the intrinsic confounding of speaker-specific talking style and semantic content within facial motions, which prevents the faithful transfer of a speaker's unique persona to arbitrary speech. In this paper, we propose MirrorTalk, a generative framework based on a conditional diffusion model, combined with a Semantically-Disentangled Style Encoder (SDSE) that can distill pure style representations from a brief reference video. To effectively utilize this representation, we further introduce a hierarchical modulation strategy within the diffusion process. This mechanism guides the synthesis by dynamically balancing the contributions of audio and style features across distinct facial regions, ensuring both precise lip-sync accuracy and expressive full-face dynamics. Extensive experiments demonstrate that MirrorTalk achieves significant improvements over state-of-the-art methods in terms of lip-sync accuracy and personalization preservation.[78] DreamVAR: Taming Reinforced Visual Autoregressive Model for High-Fidelity Subject-Driven Image Generation
Xin Jiang,Jingwen Chen,Yehao Li,Yingwei Pan,Kezhou Chen,Zechao Li,Ting Yao,Tao Mei
Main category: cs.CV
TL;DR: 本文提出DreamVAR,一种基于视觉自回归(VAR)模型的主体驱动图像合成新框架,通过多尺度特征提取和强化学习提升语义对齐与主体一致性,在外观保持上优于主流扩散模型。
Details
Motivation: 尽管视觉自回归(VAR)模型具有统一架构和高效推理优势,其在主体驱动图像生成中的潜力尚未被充分探索;而当前扩散模型虽性能突出,但存在训练-测试不一致等问题。 Method: 提出DreamVAR框架:首先用视觉分词器提取参考主体的多尺度特征;采用先预填完整主体特征序列、再逐级预测目标图像token的next-scale预测策略;并引入强化学习联合优化语义对齐与主体一致性。 Result: 在多个实验中,DreamVAR在主体外观保持能力上显著优于当前领先的扩散模型方法。 Conclusion: DreamVAR验证了VAR模型在主体驱动图像生成任务中的有效性与竞争力,其预填式多尺度条件建模和强化学习策略为该方向提供了新思路。 Abstract: Recent advances in subject-driven image generation using diffusion models have attracted considerable attention for their remarkable capabilities in producing high-quality images. Nevertheless, the potential of Visual Autoregressive (VAR) models, despite their unified architecture and efficient inference, remains underexplored. In this work, we present DreamVAR, a novel framework for subject-driven image synthesis built upon a VAR model that employs next-scale prediction. Technically, multi-scale features of the reference subject are first extracted by a visual tokenizer. Instead of interleaving these conditional features with target image tokens across scales, our DreamVAR pre-fills the full subject feature sequence prior to predicting target image tokens. This design simplifies autoregressive dependencies and mitigates the train-test discrepancy in multi-scale conditioning scenario within the VAR paradigm. DreamVAR further incorporates reinforcement learning to jointly enhance semantic alignment and subject consistency. Extensive experiments demonstrate that DreamVAR achieves superior appearance preservation compared to leading diffusion-based methods.[79] CoVA: Text-Guided Composed Video Retrieval for Audio-Visual Content
Gyuwon Han,Young Kyun Jang,Chanho Eom
Main category: cs.CV
TL;DR: 本文提出了一个新的视频检索任务CoVA,旨在结合视觉和听觉信息进行复合视频检索,并构建了相应的基准数据集AV-Comp和模型AVT。
Details
Motivation: 现有复合视频检索(CoVR)基准仅考虑视觉变化,忽略了音频差异,限制了实际应用中的检索能力。 Method: 构建了首个支持视听跨模态变化的基准数据集AV-Comp,并提出AVT(Audio-Visual-Text Compositional Fusion)模型,通过选择性对齐文本查询与最相关模态(视觉或听觉)实现多模态特征融合。 Result: AVT在新提出的CoVA任务上显著优于传统单模态融合方法,成为该任务的强基线。 Conclusion: CoVA任务及AV-Comp基准填补了视听联合复合检索的研究空白,AVT模型验证了跨模态对齐在该任务中的有效性。 Abstract: Composed Video Retrieval (CoVR) aims to retrieve a target video from a large gallery using a reference video and a textual query specifying visual modifications. However, existing benchmarks consider only visual changes, ignoring videos that differ in audio despite visual similarity. To address this limitation, we introduce Composed retrieval for Video with its Audio CoVA, a new retrieval task that accounts for both visual and auditory variations. To support this, we construct AV-Comp, a benchmark consisting of video pairs with cross-modal changes and corresponding textual queries that describe the differences. We also propose AVT Compositional Fusion (AVT), which integrates video, audio, and text features by selectively aligning the query to the most relevant modality. AVT outperforms traditional unimodal fusion and serves as a strong baseline for CoVA. Examples from the proposed dataset, including both visual and auditory information, are available at https://perceptualai-lab.github.io/CoVA/.[80] DNA: Uncovering Universal Latent Forgery Knowledge
Jingtong Dou,Chuancheng Shi,Yemin Wang,Shiming Guo,Anqi Yi,Wenhua Wu,Li Zhang,Fei Shen,Tat-Seng Chua
Main category: cs.CV
TL;DR: 本文提出DNA框架,通过挖掘预训练模型中已有的伪造检测能力,无需大量微调即可实现高效、鲁棒的AI生成内容检测。
Details
Motivation: 现有方法依赖资源密集型黑箱模型微调,而作者认为伪造检测能力已内嵌于预训练模型中,只需有效激发而非重新训练。 Method: 提出判别性神经锚点(DNA)框架,采用粗到细挖掘机制:先定位关键中间层(语义到异常关注的转换层),再通过三元融合评分与曲率截断策略提取伪造判别单元(FDUs)。同时构建高保真合成基准HIFI-Gen。 Result: DNA在少样本条件下性能优于现有方法,且对不同架构和未见过的生成模型均表现出强鲁棒性。 Conclusion: 唤醒预训练模型中固有的敏感神经元比端到端微调更高效、更泛化。 Abstract: As generative AI achieves hyper-realism, superficial artifact detection has become obsolete. While prevailing methods rely on resource-intensive fine-tuning of black-box backbones, we propose that forgery detection capability is already encoded within pre-trained models rather than requiring end-to-end retraining. To elicit this intrinsic capability, we propose the discriminative neural anchors (DNA) framework, which employs a coarse-to-fine excavation mechanism. First, by analyzing feature decoupling and attention distribution shifts, we pinpoint critical intermediate layers where the focus of the model logically transitions from global semantics to local anomalies. Subsequently, we introduce a triadic fusion scoring metric paired with a curvature-truncation strategy to strip away semantic redundancy, precisely isolating the forgery-discriminative units (FDUs) inherently imprinted with sensitivity to forgery traces. Moreover, we introduce HIFI-Gen, a high-fidelity synthetic benchmark built upon the very latest models, to address the lag in existing datasets. Experiments demonstrate that by solely relying on these anchors, DNA achieves superior detection performance even under few-shot conditions. Furthermore, it exhibits remarkable robustness across diverse architectures and against unseen generative models, validating that waking up latent neurons is more effective than extensive fine-tuning.[81] Can 3D point cloud data improve automated body condition score prediction in dairy cattle?
Zhou Tang,Jin Wang,Angelo De Castro,Yuxi Zhang,Victoria Bastos Primo,Ana Beatriz Montevecchio Bernardino,Gota Morota,Xu Wang,Ricardo C Chebel,Haipeng Yu
Main category: cs.CV
TL;DR: 本文比较了基于深度图像和点云数据的奶牛体况评分(BCS)预测效果,发现深度图像方法在多数设置下更准确、鲁棒,而点云方法对噪声和模型结构更敏感,未展现出一致优势。
Details
Motivation: 传统BCS视觉评分主观且费力;虽已有基于深度图像的计算机视觉方法,但新兴的三维点云方法缺乏与之直接对比,其实际优势尚不明确。 Method: 在四种数据设置(未分割原始数据、全身体分割、后躯分割、手工特征)下,分别构建深度图像与点云的BCS预测模型,并在1020头奶牛的商业农场数据上进行牛级别交叉验证。 Result: 深度图像模型在未分割和全身体分割设置下显著优于点云模型;后躯分割时二者性能相当;手工特征设置下两者精度均下降;点云模型对噪声和网络结构更敏感。 Conclusion: 在当前实验条件下,三维点云数据并未为奶牛BCS预测提供相较于深度图像的一致性优势。 Abstract: Body condition score (BCS) is a widely used indicator of body energy status and is closely associated with metabolic status, reproductive performance, and health in dairy cattle; however, conventional visual scoring is subjective and labor-intensive. Computer vision approaches have been applied to BCS prediction, with depth images widely used because they capture geometric information independent of coat color and texture. More recently, three-dimensional point cloud data have attracted increasing interest due to their ability to represent richer geometric characteristics of animal morphology, but direct head-to-head comparisons with depth image-based approaches remain limited. In this study, we compared top-view depth image and point cloud data for BCS prediction under four settings: 1) unsegmented raw data, 2) segmented full-body data, 3) segmented hindquarter data, and 4) handcrafted feature data. Prediction models were evaluated using data from 1,020 dairy cows collected on a commercial farm, with cow-level cross-validation to prevent data leakage. Depth image-based models consistently achieved higher accuracy than point cloud-based models when unsegmented raw data and segmented full-body data were used, whereas comparable performance was observed when segmented hindquarter data were used. Both depth image and point cloud approaches showed reduced accuracy when handcrafted feature data were employed compared with the other settings. Overall, point cloud-based predictions were more sensitive to noise and model architecture than depth image-based predictions. Taken together, these results indicate that three-dimensional point clouds do not provide a consistent advantage over depth images for BCS prediction in dairy cattle under the evaluated conditions.[82] SHED Light on Segmentation for Dense Prediction
Seung Hyun Lee,Sangwoo Mo,Stella X. Yu
Main category: cs.CV
TL;DR: SHED是一种新型编码器-解码器架构,通过在密集预测中显式引入分割来强制几何先验,利用双向分层推理实现段标记的层次池化与反池化,在无显式分割监督下自动生成段层次结构,提升了深度边界锐度、分割一致性、跨域泛化能力、语义分割和3D重建质量,并揭示可解释的部件级结构。
Details
Motivation: 现实场景具有强结构特性,但现有密集预测方法将其视为独立像素预测,导致结构不一致。 Method: 提出SHED架构,将分割融入密集预测,通过编码器中段标记的层次池化和解码器中的反池化实现双向分层推理,仅在最终输出施加监督,使段层次结构自发涌现。 Result: 提升了深度边界锐度、段一致性、跨域泛化(合成到真实环境)、语义分割性能、3D重建质量,并揭示了传统像素级方法常忽略的可解释部件级结构。 Conclusion: SHED通过显式建模几何先验和层次结构,显著改善了密集预测任务的结构一致性与泛化能力,为3D感知与机器人应用提供了更鲁棒、可解释的解决方案。 Abstract: Dense prediction infers per-pixel values from a single image and is fundamental to 3D perception and robotics. Although real-world scenes exhibit strong structure, existing methods treat it as an independent pixel-wise prediction, often resulting in structural inconsistencies. We propose SHED, a novel encoder-decoder architecture that enforces geometric prior explicitly by incorporating segmentation into dense prediction. By bidirectional hierarchical reasoning, segment tokens are hierarchically pooled in the encoder and unpooled in the decoder to reverse the hierarchy. The model is supervised only at the final output, allowing the segment hierarchy to emerge without explicit segmentation supervision. SHED improves depth boundary sharpness and segment coherence, while demonstrating strong cross-domain generalization from synthetic to the real-world environments. Its hierarchy-aware decoder better captures global 3D scene layouts, leading to improved semantic segmentation performance. Moreover, SHED enhances 3D reconstruction quality and reveals interpretable part-level structures that are often missed by conventional pixel-wise methods.[83] Hybrid Cross-Device Localization via Neural Metric Learning and Feature Fusion
Meixia Lin,Mingkai Liu,Shuxue Peng,Dikai Fan,Shengyu Gu,Xianliang Huang,Haoyang Ye,Xiao Liu
Main category: cs.CV
TL;DR: 本文提出了一种用于CroCoDL 2025挑战赛的混合跨设备定位流程,结合几何与神经方法,并引入神经引导剪枝和深度条件优化,显著提升定位精度与召回率。
Details
Motivation: 解决跨设备定位中几何方法鲁棒性不足与纯神经方法泛化性差的问题,提升在HYDRO和SUCCU基准上的定位性能。 Method: 构建共享检索编码器,融合经典几何分支(特征融合+PnP)与神经前馈分支(MapAnything),并引入神经引导候选帧剪枝策略及深度条件下的尺度与平移优化。 Result: 在CroCoDL 2025挑战赛中取得92.62的最终得分(R@0.5m, 5°),在HYDRO和SUCCU基准上显著提升召回率与定位精度。 Conclusion: 混合几何-神经架构与多级优化策略可有效提升跨设备定位的鲁棒性与准确性,为实际部署提供可行方案。 Abstract: We present a hybrid cross-device localization pipeline developed for the CroCoDL 2025 Challenge. Our approach integrates a shared retrieval encoder and two complementary localization branches: a classical geometric branch using feature fusion and PnP, and a neural feed-forward branch (MapAnything) for metric localization conditioned on geometric inputs. A neural-guided candidate pruning strategy further filters unreliable map frames based on translation consistency, while depth-conditioned localization refines metric scale and translation precision on Spot scenes. These components jointly lead to significant improvements in recall and accuracy across both HYDRO and SUCCU benchmarks. Our method achieved a final score of 92.62 (R@0.5m, 5°) during the challenge.[84] Leveraging Data to Say No: Memory Augmented Plug-and-Play Selective Prediction
Aditya Sarkar,Yi Li,Jiacheng Cheng,Shlok Mishra,Nuno Vasconcelos
Main category: cs.CV
TL;DR: 本文提出了一种无需训练、即插即用的视觉语言基础模型选择性预测方法(PaPSP),并进一步通过记忆增强(MA-PaPSP)缓解嵌入不稳定性与相似度校准差问题,在多类开放/闭合集视觉语言任务中取得优越性能。
Details
Motivation: 现有选择性预测方法主要面向闭合集任务,难以适配视觉语言基础模型所面临的开放集、无界词汇等复杂场景;同时缺乏低复杂度、免训练、通用性强的方案。 Method: 提出Plug-and-Play Selective Prediction(PaPSP),基于CLIP等外部VLM嵌入;为解决嵌入不稳定和相似度校准差问题,进一步设计记忆增强版MA-PaPSP:利用检索图像-文本对进行邻域平均以降低方差,并引入对比归一化提升分数校准。 Result: 在选择性图像描述、图文匹配和细粒度分类等多个任务上,MA-PaPSP显著优于PaPSP及其他基线方法;代码已开源。 Conclusion: MA-PaPSP是一种通用、免训练、低复杂度的选择性预测框架,有效提升了视觉语言基础模型在开放与闭合任务中的可靠性与可部署性。 Abstract: Selective prediction aims to endow predictors with a reject option, to avoid low confidence predictions. However, existing literature has primarily focused on closed-set tasks, such as visual question answering with predefined options or fixed-category classification. This paper considers selective prediction for visual language foundation models, addressing a taxonomy of tasks ranging from closed to open set and from finite to unbounded vocabularies, as in image captioning. We seek training-free approaches of low-complexity, applicable to any foundation model and consider methods based on external vision-language model embeddings, like CLIP. This is denoted as Plug-and-Play Selective Prediction (PaPSP). We identify two key challenges: (1) instability of the visual-language representations, leading to high variance in image-text embeddings, and (2) poor calibration of similarity scores. To address these issues, we propose a memory augmented PaPSP (MA-PaPSP) model, which augments PaPSP with a retrieval dataset of image-text pairs. This is leveraged to reduce embedding variance by averaging retrieved nearest-neighbor pairs and is complemented by the use of contrastive normalization to improve score calibration. Through extensive experiments on multiple datasets, we show that MA-PaPSP outperforms PaPSP and other selective prediction baselines for selective captioning, image-text matching, and fine-grained classification. Code is publicly available at https://github.com/kingston-aditya/MA-PaPSP.[85] DELNet: Continuous All-in-One Weather Removal via Dynamic Expert Library
Shihong Liu,Kun Zuo,Hanguang Xiao
Main category: cs.CV
TL;DR: 本文提出DELNet,一种用于天气图像恢复的持续学习框架,通过判断阀和动态专家库实现对新旧退化任务的自适应处理,避免了传统方法中频繁重训练的问题。
Details
Motivation: 现有的一体化天气图像恢复方法依赖预收集的数据,且对未见过的退化类型需要重新训练,成本高、不灵活。 Method: DELNet包含一个判断阀(衡量任务相似性)和一个动态专家库(存储不同退化类型下训练的专家模型);对新任务,选择top-k专家进行知识迁移并新增专家;对已知任务,直接复用对应专家,实现无需重训练的持续优化。 Result: 在OTS、Rain100H和Snow100K数据集上,DELNet相比现有持续学习方法PSNR分别提升16%、11%和12%。 Conclusion: DELNet具有高效性、鲁棒性和实用性,显著降低重训练成本,适用于真实场景部署。 Abstract: All-in-one weather image restoration methods are valuable in practice but depend on pre-collected data and require retraining for unseen degradations, leading to high cost. We propose DELNet, a continual learning framework for weather image restoration. DELNet integrates a judging valve that measures task similarity to distinguish new from known tasks, and a dynamic expert library that stores experts trained on different degradations. For new tasks, the valve selects top-k experts for knowledge transfer while adding new experts to capture task-specific features; for known tasks, the corresponding experts are directly reused. This design enables continuous optimization without retraining existing models. Experiments on OTS, Rain100H, and Snow100K demonstrate that DELNet surpasses state-of-the-art continual learning methods, achieving PSNR gains of 16\%, 11\%, and 12\%, respectively. These results highlight the effectiveness, robustness, and efficiency of DELNet, which reduces retraining cost and enables practical deployment in real-world scenarios.[86] Mitigating Hallucinations in Video Large Language Models via Spatiotemporal-Semantic Contrastive Decoding
Yuansheng Gao,Jinman Zhao,Tong Zhang,Xingguo Xu,Han Bao,Zonghui Wang,Wenzhi Chen
Main category: cs.CV
TL;DR: 本文提出了一种新的解码策略——时空语义对比解码,通过构造破坏时空一致性和语义关联的负样本特征,在推理过程中与原始视频特征进行对比解码,以更有效地缓解视频大语言模型中的幻觉问题。
Details
Motivation: 现有缓解视频幻觉的解码方法多依赖启发式设计,难以精准捕捉幻觉的根本原因及其细粒度时空与语义关联,导致在复杂场景下鲁棒性与泛化性不足。 Method: 提出时空语义对比解码策略:通过刻意扰动视频特征的时空一致性与语义关联来构建负样本特征,并在推理中与原始视频特征进行对比解码以抑制幻觉。 Result: 大量实验表明,该方法不仅能有效减少幻觉发生,还能保持模型原有的视频理解与推理能力。 Conclusion: 时空语义对比解码是一种更精准、鲁棒且具泛化能力的视频幻觉缓解方法,为视频大语言模型的可信推理提供了新思路。 Abstract: Although Video Large Language Models perform remarkably well across tasks such as video understanding, question answering, and reasoning, they still suffer from the problem of hallucination, which refers to generating outputs that are inconsistent with explicit video content or factual evidence. However, existing decoding methods for mitigating video hallucinations, while considering the spatiotemporal characteristics of videos, mostly rely on heuristic designs. As a result, they fail to precisely capture the root causes of hallucinations and their fine-grained temporal and semantic correlations, leading to limited robustness and generalization in complex scenarios. To more effectively mitigate video hallucinations, we propose a novel decoding strategy termed Spatiotemporal-Semantic Contrastive Decoding. This strategy constructs negative features by deliberately disrupting the spatiotemporal consistency and semantic associations of video features, and suppresses video hallucinations through contrastive decoding against the original video features during inference. Extensive experiments demonstrate that our method not only effectively mitigates the occurrence of hallucinations, but also preserves the general video understanding and reasoning capabilities of the model.[87] PhoStream: Benchmarking Real-World Streaming for Omnimodal Assistants in Mobile Scenarios
Xudong Lu,Huankang Guan,Yang Bo,Jinpeng Chen,Xintong Guo,Shuhan Li,Fang Liu,Peiwen Sun,Xueying Li,Wei Zhang,Xue Yang,Rui Liu,Hongsheng Li
Main category: cs.CV
TL;DR: 本文提出了首个面向移动设备的流式多模态理解基准PhoStream,用于评估模型在连续音视频流中的时序推理与响应时机决策能力,并揭示了当前多模态大模型在‘何时说’而非‘说什么’上的根本性缺陷。
Details
Motivation: 现有基准多局限于选择题或短视频,难以评估移动助手中对持续音视频流的实时跟踪与适时响应能力;缺乏统一覆盖屏上与屏下场景、支持开放问答的流式评测基准。 Method: 构建了PhoStream基准:包含578个视频、5572组开放问答对,涵盖4种场景与10种能力;采用自动化生成流水线+人工校验构建数据;设计在线推理流水线与LLM-as-a-Judge评估开放回答。 Result: 实验发现模型在即时(Instant)和回溯(Backward)任务上表现优异(Gemini 3 Pro超80分),但在前向预测(Forward)任务上骤降至16.40分,主因是模型过早响应、未等待必要视听线索。 Conclusion: 当前多模态大语言模型的核心瓶颈在于缺乏对响应时机的判断能力,即‘何时说’的问题,而不仅是内容生成(‘说什么’);PhoStream为推动具身化、时序感知的移动智能助手提供了新基准与研究方向。 Abstract: Multimodal Large Language Models excel at offline audio-visual understanding, but their ability to serve as mobile assistants in continuous real-world streams remains underexplored. In daily phone use, mobile assistants must track streaming audio-visual inputs and respond at the right time, yet existing benchmarks are often restricted to multiple-choice questions or use shorter videos. In this paper, we introduce PhoStream, the first mobile-centric streaming benchmark that unifies on-screen and off-screen scenarios to evaluate video, audio, and temporal reasoning. PhoStream contains 5,572 open-ended QA pairs from 578 videos across 4 scenarios and 10 capabilities. We build it with an Automated Generative Pipeline backed by rigorous human verification, and evaluate models using a realistic Online Inference Pipeline and LLM-as-a-Judge evaluation for open-ended responses. Experiments reveal a temporal asymmetry in LLM-judged scores (0-100): models perform well on Instant and Backward tasks (Gemini 3 Pro exceeds 80), but drop sharply on Forward tasks (16.40), largely due to early responses before the required visual and audio cues appear. This highlights a fundamental limitation: current MLLMs struggle to decide when to speak, not just what to say. Code and datasets used in this work will be made publicly accessible at https://github.com/Lucky-Lance/PhoStream.[88] Cross-Domain Few-Shot Learning for Hyperspectral Image Classification Based on Mixup Foundation Model
Naeem Paeedeh,Mahardhika Pratama,Ary Shiddiqi,Zehong Cao,Mukesh Prasad,Wisnu Jatmiko
Main category: cs.CV
TL;DR: 本文提出MIFOMO方法,利用遥感基础模型和多种创新技术(如共聚投影、混合域适应、标签平滑)解决高光谱图像跨域少样本分类中的数据稀缺、过拟合和域差异问题,显著优于现有方法。
Details
Motivation: 现有跨域少样本学习方法依赖不现实的外部噪声数据增强,参数多易过拟合,且未利用具有强泛化能力的基础模型。 Method: 提出MIxup FOundation MOdel (MIFOMO),基于大规模遥感预训练基础模型;引入共聚投影(CP)实现快速下游适配并冻结主干网络;提出混合域适应(MDM)缓解极端域差异;采用标签平滑处理伪标签噪声。 Result: 实验表明MIFOMO显著优于先前方法,最高提升达14%;代码已开源。 Conclusion: MIFOMO通过结合基础模型与多种轻量高效适配策略,有效解决了高光谱图像跨域少样本分类中的关键挑战,具备强泛化性与实用性。 Abstract: Although cross-domain few-shot learning (CDFSL) for hyper-spectral image (HSI) classification has attracted significant research interest, existing works often rely on an unrealistic data augmentation procedure in the form of external noise to enlarge the sample size, thus greatly simplifying the issue of data scarcity. They involve a large number of parameters for model updates, being prone to the overfitting problem. To the best of our knowledge, none has explored the strength of the foundation model, having strong generalization power to be quickly adapted to downstream tasks. This paper proposes the MIxup FOundation MOdel (MIFOMO) for CDFSL of HSI classifications. MIFOMO is built upon the concept of a remote sensing (RS) foundation model, pre-trained across a large scale of RS problems, thus featuring generalizable features. The notion of coalescent projection (CP) is introduced to quickly adapt the foundation model to downstream tasks while freezing the backbone network. The concept of mixup domain adaptation (MDM) is proposed to address the extreme domain discrepancy problem. Last but not least, the label smoothing concept is implemented to cope with noisy pseudo-label problems. Our rigorous experiments demonstrate the advantage of MIFOMO, where it beats prior arts with up to 14% margin. The source code of MIFOMO is open-sourced in https://github.com/Naeem- Paeedeh/MIFOMO for reproducibility and convenient further study.[89] FOTBCD: A Large-Scale Building Change Detection Benchmark from French Orthophotos and Topographic Data
Abdelrrahman Moubane
Main category: cs.CV
TL;DR: 本文介绍了FOTBCD,一个大规模、地理多样性的建筑变化检测数据集,覆盖法国28个省,包含二值和实例级标注,并验证了其在跨域泛化上的优势。
Details
Motivation: 现有建筑变化检测基准数据集地理范围受限(如单个城市),难以评估模型在地理域偏移下的泛化能力;亟需大规模、地理多样、高质量的基准数据集。 Method: 构建了FOTBCD数据集,包括FOTBCD-Binary(约2.8万对影像+二值变化掩膜)和FOTBCD-Instances(数千对实例级标注子集);采用地理隔离策略划分训练/验证/测试集(25个省训练,3个地理不相交省测试);使用固定基线模型在FOTBCD-Binary、LEVIR-CD+和WHU-CD上进行跨域性能对比评估。 Result: 实证表明,FOTBCD在地理多样性方面显著优于现有数据集,提升了模型在跨域场景下的建筑变化检测泛化能力;发布的数据集已公开,支持大规模基准测试与研究。 Conclusion: 地理多样性是提升建筑变化检测模型跨域泛化能力的关键因素;FOTBCD为该任务提供了首个大规模、权威、地理分散且严格验证的基准数据集。 Abstract: We introduce FOTBCD, a large-scale building change detection dataset derived from authoritative French orthophotos and topographic building data provided by IGN France. Unlike existing benchmarks that are geographically constrained to single cities or limited regions, FOTBCD spans 28 departments across mainland France, with 25 used for training and three geographically disjoint departments held out for evaluation. The dataset covers diverse urban, suburban, and rural environments at 0.2m/pixel resolution. We publicly release FOTBCD-Binary, a dataset comprising approximately 28,000 before/after image pairs with pixel-wise binary building change masks, each associated with patch-level spatial metadata. The dataset is designed for large-scale benchmarking and evaluation under geographic domain shift, with validation and test samples drawn from held-out departments and manually verified to ensure label quality. In addition, we publicly release FOTBCD-Instances, a publicly available instance-level annotated subset comprising several thousand image pairs, which illustrates the complete annotation schema used in the full instance-level version of FOTBCD. Using a fixed reference baseline, we benchmark FOTBCD-Binary against LEVIR-CD+ and WHU-CD, providing strong empirical evidence that geographic diversity at the dataset level is associated with improved cross-domain generalization in building change detection.[90] TTSA3R: Training-Free Temporal-Spatial Adaptive Persistent State for Streaming 3D Reconstruction
Zhijie Zheng,Xinhao Xiang,Jiawei Zhang
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的框架TTSA3R,通过结合时间状态演化与空间观测质量来自适应更新3D重建中的状态表示,显著缓解了流式循环模型在长序列中的灾难性遗忘问题。
Details
Motivation: 流式循环模型在长序列3D重建中易发生灾难性记忆遗忘,现有基于注意力的自适应方法仅考虑单一维度,忽视时间与空间一致性。 Method: 提出TTSA3R框架,包含时间自适应更新模块(分析时间状态演化以调节更新幅度)和空间上下文更新模块(通过观测-状态对齐与场景动态定位需更新的空间区域),并融合二者信号决定更新策略。 Result: 在多种3D任务上验证了有效性;在扩展序列上误差仅增加15%,远优于基线模型超200%的性能退化,显著提升长期重建稳定性。 Conclusion: TTSA3R是一种高效、训练无关的自适应状态更新框架,兼顾时间演化与空间一致性,有效缓解长序列下的记忆遗忘问题,提升3D重建鲁棒性与稳定性。 Abstract: Streaming recurrent models enable efficient 3D reconstruction by maintaining persistent state representations. However, they suffer from catastrophic memory forgetting over long sequences due to balancing historical information with new observations. Recent methods alleviate this by deriving adaptive signals from attention perspective, but they operate on single dimensions without considering temporal and spatial consistency. To this end, we propose a training-free framework termed TTSA3R that leverages both temporal state evolution and spatial observation quality for adaptive state updates in 3D reconstruction. In particular, we devise a Temporal Adaptive Update Module that regulates update magnitude by analyzing temporal state evolution patterns. Then, a Spatial Contextual Update Module is introduced to localize spatial regions that require updates through observation-state alignment and scene dynamics. These complementary signals are finally fused to determine the state updating strategies. Extensive experiments demonstrate the effectiveness of TTSA3R in diverse 3D tasks. Moreover, our method exhibits only 15% error increase compared to over 200% degradation in baseline models on extended sequences, significantly improving long-term reconstruction stability. Our codes will be available soon.[91] UniGeo: A Unified 3D Indoor Object Detection Framework Integrating Geometry-Aware Learning and Dynamic Channel Gating
Xing Yi,Jinyang Huang,Feng-Qi Cui,Anyang Tong,Ruimin Wang,Liu Liu,Dan Guo
Main category: cs.CV
TL;DR: 本文提出了一种名为UniGeo的统一3D室内检测框架,通过几何感知学习模块和动态通道门控机制,增强稀疏点云中的几何关系建模与关键特征表达,在六个室内数据集上验证了其优越性能。
Details
Motivation: 现有方法在多数据集联合训练中未能有效建模稀疏点云场景中的几何关系,且忽略了重要区域的特征分布,限制了检测性能。 Method: 提出UniGeo框架:1)几何感知学习模块,建立空间关系到特征权重的可学习映射,显式增强几何特征;2)动态通道门控机制,对稀疏3D U-Net输出的特征进行可学习的通道级加权优化。 Result: 在六个不同室内场景数据集上进行了大量实验,结果表明UniGeo显著优于现有方法。 Conclusion: UniGeo通过显式建模几何关系和自适应特征增强,有效提升了3D室内点云检测性能,为统一多数据集训练提供了新思路。 Abstract: The growing adoption of robotics and augmented reality in real-world applications has driven considerable research interest in 3D object detection based on point clouds. While previous methods address unified training across multiple datasets, they fail to model geometric relationships in sparse point cloud scenes and ignore the feature distribution in significant areas, which ultimately restricts their performance. To deal with this issue, a unified 3D indoor detection framework, called UniGeo, is proposed. To model geometric relations in scenes, we first propose a geometry-aware learning module that establishes a learnable mapping from spatial relationships to feature weights, which enabes explicit geometric feature enhancement. Then, to further enhance point cloud feature representation, we propose a dynamic channel gating mechanism that leverages learnable channel-wise weighting. This mechanism adaptively optimizes features generated by the sparse 3D U-Net network, significantly enhancing key geometric information. Extensive experiments on six different indoor scene datasets clearly validate the superior performance of our method.[92] LINA: Linear Autoregressive Image Generative Models with Continuous Tokens
Jiahao Wang,Ting Pan,Haoge Deng,Dongchen Han,Taiqiang Wu,Xinlong Wang,Ping Luo
Main category: cs.CV
TL;DR: 本文提出LINA模型,一种基于线性注意力的高效文本到图像生成框架,通过改进归一化方式、引入深度卷积增强局部性、设计KV门控机制,在保持高保真图像生成能力的同时显著降低计算开销。
Details
Motivation: 自回归连续token视觉生成(尤其是T2I)因计算成本高而受限,亟需设计计算高效的线性注意力机制。 Method: 系统分析不同线性注意力设计(如除法/减法归一化、深度卷积增强局部性),扩展门控机制至双向设置并提出KV门,最终构建纯线性注意力T2I模型LINA。 Result: LINA在ImageNet上FID达2.18(1.4B参数),GenEval上达0.74(1.5B参数);单个线性注意力模块比Softmax注意力减少约61% FLOPs。 Conclusion: 除法归一化更适配线性生成Transformer;卷积增强局部性对自回归生成至关重要;KV门支持灵活记忆管理;LINA实现了高性能与高效率的统一。 Abstract: Autoregressive models with continuous tokens form a promising paradigm for visual generation, especially for text-to-image (T2I) synthesis, but they suffer from high computational cost. We study how to design compute-efficient linear attention within this framework. Specifically, we conduct a systematic empirical analysis of scaling behavior with respect to parameter counts under different design choices, focusing on (1) normalization paradigms in linear attention (division-based vs. subtraction-based) and (2) depthwise convolution for locality augmentation. Our results show that although subtraction-based normalization is effective for image classification, division-based normalization scales better for linear generative transformers. In addition, incorporating convolution for locality modeling plays a crucial role in autoregressive generation, consistent with findings in diffusion models. We further extend gating mechanisms, commonly used in causal linear attention, to the bidirectional setting and propose a KV gate. By introducing data-independent learnable parameters to the key and value states, the KV gate assigns token-wise memory weights, enabling flexible memory management similar to forget gates in language models. Based on these findings, we present LINA, a simple and compute-efficient T2I model built entirely on linear attention, capable of generating high-fidelity 1024x1024 images from user instructions. LINA achieves competitive performance on both class-conditional and T2I benchmarks, obtaining 2.18 FID on ImageNet (about 1.4B parameters) and 0.74 on GenEval (about 1.5B parameters). A single linear attention module reduces FLOPs by about 61 percent compared to softmax attention. Code and models are available at: https://github.com/techmonsterwang/LINA.[93] What can Computer Vision learn from Ranganathan?
Mayukh Bagchi,Fausto Giunchiglia
Main category: cs.CV
TL;DR: 本文提出将S.R. Ranganathan的分类学原理适配应用于计算机视觉(CV)领域,以解决视觉与语义之间的语义鸿沟问题(SGP),并支撑vTelos CV标注方法的设计与验证。
Details
Motivation: 解决计算机视觉中因视觉与词汇语义不一致导致的语义鸿沟问题(SGP),进而改善CV数据集设计与基准评测质量。 Method: 将S.R. Ranganathan的分类学原理进行适当适配,构建vTelos CV标注方法,并开展实验验证。 Result: 实验表明该方法在CV标注质量和模型准确率方面均有提升。 Conclusion: Ranganathan分类原则可为缓解语义鸿沟、构建高质量CV数据集提供理论基础和实践路径,vTelos方法得到初步验证。 Abstract: The Semantic Gap Problem (SGP) in Computer Vision (CV) arises from the misalignment between visual and lexical semantics leading to flawed CV dataset design and CV benchmarks. This paper proposes that classification principles of S.R. Ranganathan can offer a principled starting point to address SGP and design high-quality CV datasets. We elucidate how these principles, suitably adapted, underpin the vTelos CV annotation methodology. The paper also briefly presents experimental evidence showing improvements in CV annotation and accuracy, thereby, validating vTelos.[94] Unsupervised Synthetic Image Attribution: Alignment and Disentanglement
Zongfang Liu,Guangyi Chen,Boyang Sun,Tongliang Liu,Kun Zhang
Main category: cs.CV
TL;DR: 本文提出了一种无需配对标注的无监督合成图像归因方法——Alignment and Disentanglement,通过对比自监督对齐与信息最大化解耦,在真实基准AbC上意外超越有监督方法。
Details
Motivation: 现有合成图像归因方法依赖于带标注的合成图像与其原始训练源的配对数据,但获取这类配对监督成本高昂且困难;因此亟需无需配对标注的无监督解决方案。 Method: 提出无监督方法Alignment and Disentanglement:首先利用对比自监督学习(如MoCo、DINO)实现基础概念对齐;再引入Infomax损失促进表征解耦;并基于跨协方差的理论假设,从典型相关分析(CCA)目标分解角度给出理论解释。 Result: 在真实世界基准AbC上,所提无监督方法性能意外优于现有有监督方法。 Conclusion: 无监督合成图像归因是可行且有效的;对比自监督模型天然具备跨域对齐能力,结合解耦可有效逼近概念匹配;该工作为该挑战性任务提供了新视角和起点。 Abstract: As the quality of synthetic images improves, identifying the underlying concepts of model-generated images is becoming increasingly crucial for copyright protection and ensuring model transparency. Existing methods achieve this attribution goal by training models using annotated pairs of synthetic images and their original training sources. However, obtaining such paired supervision is challenging, as it requires either well-designed synthetic concepts or precise annotations from millions of training sources. To eliminate the need for costly paired annotations, in this paper, we explore the possibility of unsupervised synthetic image attribution. We propose a simple yet effective unsupervised method called Alignment and Disentanglement. Specifically, we begin by performing basic concept alignment using contrastive self-supervised learning. Next, we enhance the model's attribution ability by promoting representation disentanglement with the Infomax loss. This approach is motivated by an interesting observation: contrastive self-supervised models, such as MoCo and DINO, inherently exhibit the ability to perform simple cross-domain alignment. By formulating this observation as a theoretical assumption on cross-covariance, we provide a theoretical explanation of how alignment and disentanglement can approximate the concept-matching process through a decomposition of the canonical correlation analysis objective. On the real-world benchmarks, AbC, we show that our unsupervised method surprisingly outperforms the supervised methods. As a starting point, we expect our intuitive insights and experimental findings to provide a fresh perspective on this challenging task.[95] ExpAlign: Expectation-Guided Vision-Language Alignment for Open-Vocabulary Grounding
Junyi Hu,Tian Bai,Fengyi Wu,Wenyan Li,Zhenming Peng,Yi Zhang
Main category: cs.CV
TL;DR: 本文提出ExpAlign,一种基于多实例学习(MIL)的开放词汇视觉-语言对齐框架,通过期望对齐头实现隐式token与区域选择,并引入能量驱动的多尺度一致性正则化,在开放词汇检测和零样本实例分割任务上取得SOTA性能。
Details
Motivation: 现有方法在弱监督下难以实现细粒度、准确的视觉-语言对齐:全局句子嵌入缺乏表达力,而显式token级对齐又依赖额外标注或计算昂贵的跨注意力设计。 Method: 提出ExpAlign框架:1)基于理论支撑的多实例学习(MIL)建模;2)设计期望对齐头(Expectation Alignment Head),通过注意力机制对token-区域相似性进行软MIL池化,实现无标注的隐式token/实例选择;3)构建能量驱动的多尺度一致性正则化,包括Top-K多正例对比损失和基于拉格朗日约束自由能最小化的几何感知一致性目标。 Result: ExpAlign在开放词汇检测与零样本实例分割任务上显著提升,尤其在长尾类别上效果突出;在LVIS minival上达到36.2 AP$_r$,优于同规模SOTA方法,同时保持轻量与推理高效。 Conclusion: ExpAlign为弱监督下的开放词汇视觉-语言对齐提供了理论严谨、高效实用的新范式,兼顾细粒度对齐能力与模型效率。 Abstract: Open-vocabulary grounding requires accurate vision-language alignment under weak supervision, yet existing methods either rely on global sentence embeddings that lack fine-grained expressiveness or introduce token-level alignment with explicit supervision or heavy cross-attention designs. We propose ExpAlign, a theoretically grounded vision-language alignment framework built on a principled multiple instance learning formulation. ExpAlign introduces an Expectation Alignment Head that performs attention-based soft MIL pooling over token-region similarities, enabling implicit token and instance selection without additional annotations. To further stabilize alignment learning, we develop an energy-based multi-scale consistency regularization scheme, including a Top-K multi-positive contrastive objective and a Geometry-Aware Consistency Objective derived from a Lagrangian-constrained free-energy minimization. Extensive experiments show that ExpAlign consistently improves open-vocabulary detection and zero-shot instance segmentation, particularly on long-tail categories. Most notably, it achieves 36.2 AP$_r$ on the LVIS minival split, outperforming other state-of-the-art methods at comparable model scale, while remaining lightweight and inference-efficient.[96] VisionTrim: Unified Vision Token Compression for Training-Free MLLM Acceleration
Hanxun Yu,Wentong Li,Xuan Qu,Song Wang,Junbo Chen,Jianke Zhu
Main category: cs.CV
TL;DR: 本文提出VisionTrim,一种无需训练的多模态大语言模型(MLLM)加速框架,通过DVTS和TGVC两个即插即用模块,在减少视觉token的同时保持文本对齐与性能。
Details
Motivation: 现有MLLM因高分辨率图像和视频产生过多视觉token,计算开销大;已有token压缩方法常孤立优化、忽视文本对齐,导致性能下降。 Method: 提出VisionTrim框架,包含两个训练无关的模块:1)主导视觉token选择(DVTS),基于全局-局部视角保留关键视觉token;2)文本引导的视觉补全(TGVC),利用文本线索实现上下文感知的token融合。 Result: 在多种图像与视频多模态基准上验证了VisionTrim的有效性,显著提升推理效率且不明显牺牲性能。 Conclusion: VisionTrim是一种高效、通用、即插即用的MLLM加速方案,有助于推动其在真实场景中的实际部署。 Abstract: Multimodal large language models (MLLMs) suffer from high computational costs due to excessive visual tokens, particularly in high-resolution and video-based scenarios. Existing token reduction methods typically focus on isolated pipeline components and often neglect textual alignment, leading to performance degradation. In this paper, we propose VisionTrim, a unified framework for training-free MLLM acceleration, integrating two effective plug-and-play modules: 1) the Dominant Vision Token Selection (DVTS) module, which preserves essential visual tokens via a global-local view, and 2) the Text-Guided Vision Complement (TGVC) module, which facilitates context-aware token merging guided by textual cues. Extensive experiments across diverse image and video multimodal benchmarks demonstrate the performance superiority of our VisionTrim, advancing practical MLLM deployment in real-world applications. The code is available at: https://github.com/hanxunyu/VisionTrim.[97] Fire on Motion: Optimizing Video Pass-bands for Efficient Spiking Action Recognition
Shuhan Ye,Yuanbin Qian,Yi Yu,Chong Wang,Yuqi Xie,Jiazhen Xu,Kun Wang,Xudong Jiang
Main category: cs.CV
TL;DR: 本文提出Pass-Bands Optimizer(PBO)模块,通过优化SNN的时序通带以增强运动信息响应,显著提升其在视频任务(如动作识别、异常检测)上的性能,同时保持低计算开销和即插即用特性。
Details
Motivation: 现有脉冲神经网络(SNNs)在动态视频任务中表现落后于人工神经网络(ANNs),根本原因在于标准脉冲动力学表现为时间低通滤波,抑制了富含任务信息的运动频带。 Method: 提出轻量级、即插即用的Pass-Bands Optimizer(PBO)模块,仅引入两个可学习参数和一个语义一致性约束,主动抑制静态成分、增强运动相关脉冲活动。 Result: 在UCF101上提升超10个百分点;在多模态动作识别和弱监督视频异常检测等复杂任务上也取得一致显著增益。 Conclusion: PBO为SNN视频理解提供了新视角,证明通过调节时序通带可有效释放SNN在动态任务中的潜力,无需改变网络架构。 Abstract: Spiking neural networks (SNNs) have gained traction in vision due to their energy efficiency, bio-plausibility, and inherent temporal processing. Yet, despite this temporal capacity, most progress concentrates on static image benchmarks, and SNNs still underperform on dynamic video tasks compared to artificial neural networks (ANNs). In this work, we diagnose a fundamental pass-band mismatch: Standard spiking dynamics behave as a temporal low pass that emphasizes static content while attenuating motion bearing bands, where task relevant information concentrates in dynamic tasks. This phenomenon explains why SNNs can approach ANNs on static tasks yet fall behind on tasks that demand richer temporal understanding.To remedy this, we propose the Pass-Bands Optimizer (PBO), a plug-and-play module that optimizes the temporal pass-band toward task-relevant motion bands. PBO introduces only two learnable parameters, and a lightweight consistency constraint that preserves semantics and boundaries, incurring negligible computational overhead and requires no architectural changes. PBO deliberately suppresses static components that contribute little to discrimination, effectively high passing the stream so that spiking activity concentrates on motion bearing content. On UCF101, PBO yields over ten percentage points improvement. On more complex multi-modal action recognition and weakly supervised video anomaly detection, PBO delivers consistent and significant gains, offering a new perspective for SNN based video processing and understanding.[98] Visual Personalization Turing Test
Rameen Abdal,James Burgess,Sergey Tulyakov,Kuan-Chieh Jackson Wang
Main category: cs.CV
TL;DR: 本文提出了视觉个性化图灵测试(VPTT),用于评估上下文视觉个性化效果,强调感知不可区分性而非身份复制,并构建了包含10k人物画像的基准、视觉检索增强生成器(VPRAG)及可校准的纯文本VPTT评分指标。
Details
Motivation: 现有视觉个性化方法多追求身份精确复制,但实际应用中更需内容在感知层面与用户真实产出难以区分;缺乏可靠、可扩展且保护隐私的评估范式。 Method: 提出VPTT评估范式,构建VPTT-Bench(10k persona)、VPRAG(视觉检索增强生成模型)和VPTT Score(基于文本的校准化评估指标),并通过人类与VLM判断验证其一致性。 Result: VPTT Score与人类及VLM判断高度相关;VPRAG在对齐性与原创性间取得最佳平衡,具备可扩展性和隐私安全性。 Conclusion: VPTT为视觉个性化提供了以感知为中心的新评估标准,VPRAG和VPTT Score共同构成隐私安全、可扩展的个性化生成AI基础框架。 Abstract: We introduce the Visual Personalization Turing Test (VPTT), a new paradigm for evaluating contextual visual personalization based on perceptual indistinguishability, rather than identity replication. A model passes the VPTT if its output (image, video, 3D asset, etc.) is indistinguishable to a human or calibrated VLM judge from content a given person might plausibly create or share. To operationalize VPTT, we present the VPTT Framework, integrating a 10k-persona benchmark (VPTT-Bench), a visual retrieval-augmented generator (VPRAG), and the VPTT Score, a text-only metric calibrated against human and VLM judgments. We show high correlation across human, VLM, and VPTT evaluations, validating the VPTT Score as a reliable perceptual proxy. Experiments demonstrate that VPRAG achieves the best alignment-originality balance, offering a scalable and privacy-safe foundation for personalized generative AI.[99] OOVDet: Low-Density Prior Learning for Zero-Shot Out-of-Vocabulary Object Detection
Binyi Su,Chenghao Huang,Haiyong Chen
Main category: cs.CV
TL;DR: 本文提出了一种零样本开集检测框架OOVDet,通过合成低密度区域的OOV提示和基于Dirichlet梯度归因挖掘伪OOV样本,构建低密度先验约束的OOV决策边界,从而在零样本设定下更可靠地识别已知类并拒绝未知类。
Details
Motivation: 现有零样本开词汇检测方法易对已知类过拟合,导致未知类被高置信度误判为已知类,缺乏对OOV数据分布的先验建模能力。 Method: 1)在隐空间中基于类别条件高斯分布的低似然区域采样,合成区域级OOV提示;2)设计基于Dirichlet的梯度归因机制,将归因梯度解释为Dirichlet证据以估计预测不确定性,并选取高不确定性样本作为伪OOV图像;3)利用高斯核密度估计施加低密度先验约束,构建OOV决策边界。 Result: 实验表明该方法显著提升了零样本场景下的OOV检测性能。 Conclusion: OOVDet通过显式建模隐空间低密度区域与不确定性驱动的伪OOV挖掘,有效缓解了零样本设定下对已知类的过拟合问题,实现了更鲁棒的开集识别。 Abstract: Zero-shot out-of-vocabulary detection (ZS-OOVD) aims to accurately recognize objects of in-vocabulary (IV) categories provided at zero-shot inference, while simultaneously rejecting undefined ones (out-of-vocabulary, OOV) that lack corresponding category prompts. However, previous methods are prone to overfitting the IV classes, leading to the OOV or undefined classes being misclassified as IV ones with a high confidence score. To address this issue, this paper proposes a zero-shot OOV detector (OOVDet), a novel framework that effectively detects predefined classes while reliably rejecting undefined ones in zero-shot scenes. Specifically, due to the model's lack of prior knowledge about the distribution of OOV data, we synthesize region-level OOV prompts by sampling from the low-likelihood regions of the class-conditional Gaussian distributions in the hidden space, motivated by the assumption that unknown semantics are more likely to emerge in low-density areas of the latent space. For OOV images, we further propose a Dirichlet-based gradient attribution mechanism to mine pseudo-OOV image samples, where the attribution gradients are interpreted as Dirichlet evidence to estimate prediction uncertainty, and samples with high uncertainty are selected as pseudo-OOV images. Building on these synthesized OOV prompts and pseudo-OOV images, we construct the OOV decision boundary through a low-density prior constraint, which regularizes the optimization of OOV classes using Gaussian kernel density estimation in accordance with the above assumption. Experimental results show that our method significantly improves the OOV detection performance in zero-shot scenes. The code is available at https://github.com/binyisu/OOV-detector.[100] PEAR: Pixel-aligned Expressive humAn mesh Recovery
Jiahao Wu,Yunfei Liu,Lijian Lin,Ye Zhu,Lei Zhu,Jingyi Li,Yu Li
Main category: cs.CV
TL;DR: 本文提出了PEAR框架,一种快速且鲁棒的像素对齐的表达式人体网格重建方法,能以超过100 FPS的速度实时推断SMPLX和FLAME参数,并显著提升细粒度姿态与面部表情重建精度。
Details
Motivation: 现有基于SMPLX的方法存在推理慢、粗略姿态估计、面部与手部错位或不自然伪影等问题,难以应用于下游任务。 Method: 提出基于ViT的统一轻量模型实现快速粗粒度3D人体几何恢复;引入像素级监督优化几何细节;设计模块化数据标注策略增强训练数据与模型鲁棒性。 Result: PEAR在多个基准数据集上显著提升了姿态估计精度,支持SMPLX与scaled-FLAME联合参数推断,速度超100 FPS,且无需预处理。 Conclusion: PEAR有效解决了现有单图像人体网格重建中速度、细粒度定位与表情建模三大瓶颈,为实时、高保真人体建模提供了实用新范式。 Abstract: Reconstructing detailed 3D human meshes from a single in-the-wild image remains a fundamental challenge in computer vision. Existing SMPLX-based methods often suffer from slow inference, produce only coarse body poses, and exhibit misalignments or unnatural artifacts in fine-grained regions such as the face and hands. These issues make current approaches difficult to apply to downstream tasks. To address these challenges, we propose PEAR-a fast and robust framework for pixel-aligned expressive human mesh recovery. PEAR explicitly tackles three major limitations of existing methods: slow inference, inaccurate localization of fine-grained human pose details, and insufficient facial expression capture. Specifically, to enable real-time SMPLX parameter inference, we depart from prior designs that rely on high resolution inputs or multi-branch architectures. Instead, we adopt a clean and unified ViT-based model capable of recovering coarse 3D human geometry. To compensate for the loss of fine-grained details caused by this simplified architecture, we introduce pixel-level supervision to optimize the geometry, significantly improving the reconstruction accuracy of fine-grained human details. To make this approach practical, we further propose a modular data annotation strategy that enriches the training data and enhances the robustness of the model. Overall, PEAR is a preprocessing-free framework that can simultaneously infer EHM-s (SMPLX and scaled-FLAME) parameters at over 100 FPS. Extensive experiments on multiple benchmark datasets demonstrate that our method achieves substantial improvements in pose estimation accuracy compared to previous SMPLX-based approaches. Project page: https://wujh2001.github.io/PEAR[101] Bi-MCQ: Reformulating Vision-Language Alignment for Negation Understanding
Tae Hun Kim,Hyun Gyu Lee
Main category: cs.CV
TL;DR: 本文提出Bi-MCQ框架,将视觉-语言对齐重构为条件语义比较问题,通过双向多选学习和方向特异性交叉注意力模块,显著提升医学VLM对否定临床陈述的理解能力。
Details
Motivation: 现有视觉-语言模型在理解否定临床陈述方面表现薄弱,因其对比对齐目标将否定视为微小语言变异,而非语义反转操作;且多标签下基于提示的InfoNCE微调强化了易正样本对齐,限制了疾病不存在的学习。 Method: 提出双向多选学习框架(Bi-MCQ),联合训练图像到文本和文本到图像的多选题任务,使用肯定、否定及混合提示;引入方向特异的交叉注意力融合模块以支持双向推理并减少对齐干扰。 Result: 在ChestXray14、Open-I、CheXpert和PadChest数据集上,Bi-MCQ相较CARZero零样本性能提升最高达0.47 AUC;PNC评估绝对增益达0.08;否定-肯定AUC差距平均降低0.12。 Conclusion: 重构对齐目标为条件语义比较可显著增强医学视觉-语言模型对否定语义的理解能力,验证了目标函数设计在医疗VLM中的关键作用。 Abstract: Recent vision-language models (VLMs) achieve strong zero-shot performance via large-scale image-text pretraining and have been widely adopted in medical image analysis. However, existing VLMs remain notably weak at understanding negated clinical statements, largely due to contrastive alignment objectives that treat negation as a minor linguistic variation rather than a meaning-inverting operator. In multi-label settings, prompt-based InfoNCE fine-tuning further reinforces easy-positive image-prompt alignments, limiting effective learning of disease absence. To overcome these limitations, we reformulate vision-language alignment as a conditional semantic comparison problem, which is instantiated through a bi-directional multiple-choice learning framework(Bi-MCQ). By jointly training Image-to-Text and Text-to-Image MCQ tasks with affirmative, negative, and mixed prompts, our method implements fine-tuning as conditional semantic comparison instead of global similarity maximization. We further introduce direction-specific Cross-Attention fusion modules to address asymmetric cues required by bi-directional reasoning and reduce alignment interference. Experiments on ChestXray14, Open-I, CheXpert, and PadChest show that Bi-MCQ improves negation understanding by up to 0.47 AUC over the zero-shot performance of the state-of-the-art CARZero model, while achieving up to a 0.08 absolute gain on positive-negative combined (PNC) evaluation. Additionally, Bi-MCQ reduces the affirmative-negative AUC gap by an average of 0.12 compared to InfoNCE-based fine-tuning, demonstrating that objective reformulation can substantially enhance negation understanding in medical VLMs.[102] DAVIS: OOD Detection via Dominant Activations and Variance for Increased Separation
Abid Hassan,Tuan Ngo,Saad Shafiq,Nenad Medvidovic
Main category: cs.CV
TL;DR: 本文提出DAVIS方法,通过在全局平均池化(GAP)后引入通道方差和最大激活值等统计量来增强特征表示,显著提升OOD检测性能。
Details
Motivation: 现有后处理OOD检测方法多依赖于全局平均池化(GAP)后的特征,而GAP会丢失激活图中重要的分布统计信息(如通道方差和最大激活值),这些信息对OOD判别具有潜在价值。 Method: DAVIS是一种简单且通用的后处理技术,在GAP特征基础上显式融合通道维度的方差和最大激活值等统计量,以弥补GAP造成的信息损失。 Result: DAVIS在ResNet、DenseNet、EfficientNet等多种架构上均取得SOTA效果:CIFAR-10(ResNet-18)FPR95降低48.26%,CIFAR-100(ResNet-34)降低38.13%,ImageNet-1k(MobileNet-v2)降低26.83%。 Conclusion: 激活图中的高阶统计信息(非仅均值)对OOD检测至关重要;DAVIS验证了超越GAP均值建模的有效性,并为OOD检测提供了更合理的特征设计原则。 Abstract: Detecting out-of-distribution (OOD) inputs is a critical safeguard for deploying machine learning models in the real world. However, most post-hoc detection methods operate on penultimate feature representations derived from global average pooling (GAP) -- a lossy operation that discards valuable distributional statistics from activation maps prior to global average pooling. We contend that these overlooked statistics, particularly channel-wise variance and dominant (maximum) activations, are highly discriminative for OOD detection. We introduce DAVIS, a simple and broadly applicable post-hoc technique that enriches feature vectors by incorporating these crucial statistics, directly addressing the information loss from GAP. Extensive evaluations show DAVIS sets a new benchmark across diverse architectures, including ResNet, DenseNet, and EfficientNet. It achieves significant reductions in the false positive rate (FPR95), with improvements of 48.26\% on CIFAR-10 using ResNet-18, 38.13\% on CIFAR-100 using ResNet-34, and 26.83\% on ImageNet-1k benchmarks using MobileNet-v2. Our analysis reveals the underlying mechanism for this improvement, providing a principled basis for moving beyond the mean in OOD detection.[103] Gated Relational Alignment via Confidence-based Distillation for Efficient VLMs
Yanlong Chen,Amirhossein Habibian,Luca Benini,Yawei Li
Main category: cs.CV
TL;DR: 本文提出GRACE框架,通过结合知识蒸馏与量化感知训练(QAT),在信息瓶颈原则下实现Vision-Language Models(VLMs)的高效INT4量化,在保持高精度的同时显著提升吞吐量与内存效率。
Details
Motivation: Vision-Language Models(VLMs)部署成本高,后训练量化常导致显著精度下降;而面向VLMs的量化感知训练(QAT)仍缺乏深入探索。 Method: 提出GRACE框架,统一知识蒸馏与QAT:1)置信度门控解耦蒸馏,过滤不可靠监督信号;2)关系中心核对齐,迁移视觉token结构;3)基于拉格朗日松弛的自适应控制器,平衡保真度与容量约束。 Result: 在LLaVA和Qwen系列模型上,INT4模型全面超越FP16基线(如LLaVA-1.5-7B在SQA达70.1 vs. 66.8;Qwen2-VL-2B在MMBench达76.9 vs. 72.6),接近教师模型性能;实测INT4内核实现3×吞吐提升与54%内存减少。 Conclusion: GRACE是一种原理清晰、效果显著的VLM量化方案,为资源受限场景下的高效部署提供了有力支持。 Abstract: Vision-Language Models (VLMs) achieve strong multimodal performance but are costly to deploy, and post-training quantization often causes significant accuracy loss. Despite its potential, quantization-aware training for VLMs remains underexplored. We propose GRACE, a framework unifying knowledge distillation and QAT under the Information Bottleneck principle: quantization constrains information capacity while distillation guides what to preserve within this budget. Treating the teacher as a proxy for task-relevant information, we introduce confidence-gated decoupled distillation to filter unreliable supervision, relational centered kernel alignment to transfer visual token structures, and an adaptive controller via Lagrangian relaxation to balance fidelity against capacity constraints. Across extensive benchmarks on LLaVA and Qwen families, our INT4 models consistently outperform FP16 baselines (e.g., LLaVA-1.5-7B: 70.1 vs. 66.8 on SQA; Qwen2-VL-2B: 76.9 vs. 72.6 on MMBench), nearly matching teacher performance. Using real INT4 kernel, we achieve 3$\times$ throughput with 54% memory reduction. This principled framework significantly outperforms existing quantization methods, making GRACE a compelling solution for resource-constrained deployment.[104] OpenVTON-Bench: A Large-Scale High-Resolution Benchmark for Controllable Virtual Try-On Evaluation
Jin Li,Tao Chen,Shuai Jiang,Weijie Wang,Jingwen Luo,Chenhui Wu
Main category: cs.CV
TL;DR: 本文提出了OpenVTON-Bench,一个大规模、高分辨率、语义均衡的虚拟试穿(VTON)评估基准,并设计了多模态、多维度的评估协议,显著提升与人类判断的一致性。
Details
Motivation: 现有VTON评估指标难以量化纹理细节和语义一致性,且数据集在规模和多样性上无法满足商用标准。 Method: 构建了包含约10万对高分辨率图像的OpenVTON-Bench数据集,采用DINOv3聚类和Gemini密集标注;提出五维多模态评估协议,融合视觉语言模型语义推理与基于SAM3和形态学腐蚀的多尺度表征度量。 Result: 所提协议与人类判断高度一致(Kendall's τ = 0.833),显著优于SSIM(0.611);数据集覆盖20类细粒度服装,分布均匀。 Conclusion: OpenVTON-Bench及其评估协议为VTON系统提供了更可靠、可解释、商用就绪的评估标准。 Abstract: Recent advances in diffusion models have significantly elevated the visual fidelity of Virtual Try-On (VTON) systems, yet reliable evaluation remains a persistent bottleneck. Traditional metrics struggle to quantify fine-grained texture details and semantic consistency, while existing datasets fail to meet commercial standards in scale and diversity. We present OpenVTON-Bench, a large-scale benchmark comprising approximately 100K high-resolution image pairs (up to $1536 \times 1536$). The dataset is constructed using DINOv3-based hierarchical clustering for semantically balanced sampling and Gemini-powered dense captioning, ensuring a uniform distribution across 20 fine-grained garment categories. To support reliable evaluation, we propose a multi-modal protocol that measures VTON quality along five interpretable dimensions: background consistency, identity fidelity, texture fidelity, shape plausibility, and overall realism. The protocol integrates VLM-based semantic reasoning with a novel Multi-Scale Representation Metric based on SAM3 segmentation and morphological erosion, enabling the separation of boundary alignment errors from internal texture artifacts. Experimental results show strong agreement with human judgments (Kendall's $τ$ of 0.833 vs. 0.611 for SSIM), establishing a robust benchmark for VTON evaluation.[105] GaussianOcc3D: A Gaussian-Based Adaptive Multi-modal 3D Occupancy Prediction
A. Enes Doruk,Hasan F. Ates
Main category: cs.CV
TL;DR: 本文提出了GaussianOcc3D,一种基于连续3D高斯表示的多模态语义占据预测框架,融合相机与LiDAR数据,通过四个创新模块提升精度、鲁棒性与效率,在多个基准上达到SOTA性能。
Details
Motivation: 单模态方法在相机语义与LiDAR几何之间存在权衡;现有多模态方法面临模态异构性、空间错位及体素/BEV表征的计算开销或信息损失问题。 Method: 提出GaussianOcc3D框架,包含:(1) LiDAR深度特征聚合(LDFA),用深度可变形采样将稀疏LiDAR信号映射到高斯基元;(2) 基于熵的特征平滑(EBFS)抑制域噪声;(3) 不确定性感知的自适应相机-LiDAR融合(ACLF);(4) 基于选择性状态空间模型(Gauss-Mamba Head)的轻量全局建模头。 Result: 在Occ3D、SurroundOcc和SemanticKITTI上mIoU分别达49.4%、28.9%、25.2%,显著优于现有方法,并在雨天和夜间等挑战场景下表现出更强鲁棒性。 Conclusion: GaussianOcc3D通过引入内存高效、连续的3D高斯表征,有效缓解了多模态占据预测中的异构性与表征瓶颈,为实时、鲁棒的自动驾驶环境理解提供了新范式。 Abstract: 3D semantic occupancy prediction is a pivotal task in autonomous driving, providing a dense and fine-grained understanding of the surrounding environment, yet single-modality methods face trade-offs between camera semantics and LiDAR geometry. Existing multi-modal frameworks often struggle with modality heterogeneity, spatial misalignment, and the representation crisis--where voxels are computationally heavy and BEV alternatives are lossy. We present GaussianOcc3D, a multi-modal framework bridging camera and LiDAR through a memory-efficient, continuous 3D Gaussian representation. We introduce four modules: (1) LiDAR Depth Feature Aggregation (LDFA), using depth-wise deformable sampling to lift sparse signals onto Gaussian primitives; (2) Entropy-Based Feature Smoothing (EBFS) to mitigate domain noise; (3) Adaptive Camera-LiDAR Fusion (ACLF) with uncertainty-aware reweighting for sensor reliability; and (4) a Gauss-Mamba Head leveraging Selective State Space Models for global context with linear complexity. Evaluations on Occ3D, SurroundOcc, and SemanticKITTI benchmarks demonstrate state-of-the-art performance, achieving mIoU scores of 49.4%, 28.9%, and 25.2% respectively. GaussianOcc3D exhibits superior robustness across challenging rainy and nighttime conditions.[106] ImgCoT: Compressing Long Chain of Thought into Compact Visual Tokens for Efficient Reasoning of Large Language Model
Xiaoshu Chen,Sihang Zhou,Ke Liang,Taichun Zhou,Xinwang Liu
Main category: cs.CV
TL;DR: 本文提出ImgCoT方法,通过将思维链(CoT)渲染为图像并以图像为重建目标,用空间归纳偏置替代语言归纳偏置,从而提升潜在表征对推理结构的抽象能力;进一步提出松散ImgCoT,在视觉潜在token基础上补充关键文本步骤,兼顾全局结构与细节,实现更高效的推理压缩。
Details
Motivation: 现有基于文本CoT重建的自编码方法引入强语言归纳偏置,过度关注表层语言特征而削弱对深层推理结构的建模能力。 Method: 提出ImgCoT:将文本CoT渲染为图像,以图像为自编码器重建目标,引入空间归纳偏置;进一步提出松散ImgCoT,结合低log-likelihood筛选的关键文本步骤与视觉潜在token进行混合推理。 Result: 在多个数据集和LLM上验证了ImgCoT及其松散版本的有效性,显著提升了推理压缩效率与结构抽象能力。 Conclusion: 用视觉化CoT替代文本CoT作为重建目标可缓解语言偏置,提升逻辑抽象;松散ImgCoT在保持细节的同时进一步优化了token效率。 Abstract: Compressing long chains of thought (CoT) into compact latent tokens is crucial for efficient reasoning with large language models (LLMs). Recent studies employ autoencoders to achieve this by reconstructing textual CoT from latent tokens, thus encoding CoT semantics. However, treating textual CoT as the reconstruction target forces latent tokens to preserve surface-level linguistic features (e.g., word choice and syntax), introducing a strong linguistic inductive bias that prioritizes linguistic form over reasoning structure and limits logical abstraction. Thus, we propose ImgCoT that replaces the reconstruction target from textual CoT to the visual CoT obtained by rendering CoT into images. This substitutes linguistic bias with spatial inductive bias, i.e., a tendency to model spatial layouts of the reasoning steps in visual CoT, enabling latent tokens to better capture global reasoning structure. Moreover, although visual latent tokens encode abstract reasoning structure, they may blur reasoning details. We thus propose a loose ImgCoT, a hybrid reasoning that augments visual latent tokens with a few key textual reasoning steps, selected based on low token log-likelihood. This design allows LLMs to retain both global reasoning structure and fine-grained reasoning details with fewer tokens than the complete CoT. Extensive experiments across multiple datasets and LLMs demonstrate the effectiveness of the two versions of ImgCoT.[107] Lingua-SafetyBench: A Benchmark for Safety Evaluation of Multilingual Vision-Language Models
Enyi Shi,Pengyang Shao,Yanxin Zhang,Chenhang Cui,Jiayi Lyu,Xu Xie,Xiaobo Xia,Fei Shen,Tat-Seng Chua
Main category: cs.CV
TL;DR: 本文提出了Lingua-SafetyBench,一个包含10万+多语言多模态有害图文对的基准,用于评估视觉语言大模型(VLLMs)在联合多语言与多模态输入下的鲁棒安全性;发现图像主导和文本主导风险在不同资源语言中呈现不对称性,且模型缩放升级对高资源语言更有效,加剧了语言间安全差距;强调需开展语言与模态感知的安全对齐。
Details
Motivation: 现有安全评测基准要么多语言但仅限文本,要么多模态但仅限单语;当前多语言多模态红队测试依赖字体类图像、缺乏语义对齐的图文对,难以覆盖真实跨模态交互风险。 Method: 构建Lingua-SafetyBench:100,440个覆盖10种语言的有害图文对,按图像主导/文本主导划分;对11个开源VLLM进行系统评测,并在Qwen系列上开展控制实验分析缩放与版本升级的影响。 Result: 发现图像主导风险在高资源语言中攻击成功率(ASR)更高,而文本主导风险在非高资源语言中更严重;Qwen系列模型升级虽整体降低ASR,却加剧了高/非高资源语言间的安全差距。 Conclusion: 仅靠模型规模扩大无法解决多语言多模态安全问题,亟需语言与模态双重感知的安全对齐策略;作者将公开基准、模型检查点与代码以促进可复现研究。 Abstract: Robust safety of vision-language large models (VLLMs) under joint multilingual and multimodal inputs remains underexplored. Existing benchmarks are typically multilingual but text-only, or multimodal but monolingual. Recent multilingual multimodal red-teaming efforts render harmful prompts into images, yet rely heavily on typography-style visuals and lack semantically grounded image-text pairs, limiting coverage of realistic cross-modal interactions. We introduce Lingua-SafetyBench, a benchmark of 100,440 harmful image-text pairs across 10 languages, explicitly partitioned into image-dominant and text-dominant subsets to disentangle risk sources. Evaluating 11 open-source VLLMs reveals a consistent asymmetry: image-dominant risks yield higher ASR in high-resource languages, while text-dominant risks are more severe in non-high-resource languages. A controlled study on the Qwen series shows that scaling and version upgrades reduce Attack Success Rate (ASR) overall but disproportionately benefit HRLs, widening the gap between HRLs and Non-HRLs under text-dominant risks. This underscores the necessity of language- and modality-aware safety alignment beyond mere scaling.To facilitate reproducibility and future research, we will publicly release our benchmark, model checkpoints, and source code.The code and dataset will be available at https://github.com/zsxr15/Lingua-SafetyBench.Warning: this paper contains examples with unsafe content.[108] StreamSense: Streaming Social Task Detection with Selective Vision-Language Model Routing
Han Wang,Deyi Ji,Lanyun Zhu,Jiebo Luo,Roy Ka-Wei Lee
Main category: cs.CV
TL;DR: StreamSense是一种用于直播平台的流式检测器,结合轻量级流编码器与视觉-语言模型(VLM)专家,通过选择性路由和决策延迟机制,在保证高准确率的同时显著降低延迟与计算开销。
Details
Motivation: 直播平台需实时响应视频、文本、音频等异步、不完整社会信号,现有方法难以兼顾效率与准确性。 Method: 提出StreamSense框架:1)轻量级流编码器处理大部分时间戳;2)对困难/模糊样本选择性调用VLM;3)上下文不足时延迟决策;编码器训练采用跨模态对比损失与IoU加权损失以缓解标签边界干扰。 Result: 在情感分类与仇恨内容审核等多个社交流检测任务上,StreamSense精度高于纯VLM方案,且VLM调用频次低,平均延迟与计算成本显著下降。 Conclusion: 选择性升级(escalation)与延迟决策(deferral)是处理流式社交任务的有效基础机制。 Abstract: Live streaming platforms require real-time monitoring and reaction to social signals, utilizing partial and asynchronous evidence from video, text, and audio. We propose StreamSense, a streaming detector that couples a lightweight streaming encoder with selective routing to a Vision-Language Model (VLM) expert. StreamSense handles most timestamps with the lightweight streaming encoder, escalates hard/ambiguous cases to the VLM, and defers decisions when context is insufficient. The encoder is trained using (i) a cross-modal contrastive term to align visual/audio cues with textual signals, and (ii) an IoU-weighted loss that down-weights poorly overlapping target segments, mitigating label interference across segment boundaries. We evaluate StreamSense on multiple social streaming detection tasks (e.g., sentiment classification and hate content moderation), and the results show that StreamSense achieves higher accuracy than VLM-only streaming while only occasionally invoking the VLM, thereby reducing average latency and compute. Our results indicate that selective escalation and deferral are effective primitives for understanding streaming social tasks. Code is publicly available on GitHub.[109] Beauty and the Beast: Imperceptible Perturbations Against Diffusion-Based Face Swapping via Directional Attribute Editing
Yilong Huang,Songze Li
Main category: cs.CV
TL;DR: 本文提出FaceDefense框架,通过引入新扩散损失和定向面部属性编辑,有效平衡了对抗扰动的视觉不可感知性与防御效果。
Details
Motivation: 扩散模型驱动的人脸交换技术虽性能卓越,但也加剧了恶意人脸交换对肖像权及个人声誉的潜在危害,亟需更有效的主动防御方法。 Method: 提出FaceDefense框架:1)设计新型扩散损失以增强对抗样本的防御效力;2)采用定向面部属性编辑恢复扰动导致的面部结构失真;3)构建两阶段交替优化策略生成最终扰动图像。 Result: 在多个指标上显著优于现有方法,在视觉不可感知性和防御有效性之间实现了更优权衡。 Conclusion: FaceDefense为扩散模型下的人脸交换攻击提供了高效且视觉友好的主动防御新范式。 Abstract: Diffusion-based face swapping achieves state-of-the-art performance, yet it also exacerbates the potential harm of malicious face swapping to violate portraiture right or undermine personal reputation. This has spurred the development of proactive defense methods. However, existing approaches face a core trade-off: large perturbations distort facial structures, while small ones weaken protection effectiveness. To address these issues, we propose FaceDefense, an enhanced proactive defense framework against diffusion-based face swapping. Our method introduces a new diffusion loss to strengthen the defensive efficacy of adversarial examples, and employs a directional facial attribute editing to restore perturbation-induced distortions, thereby enhancing visual imperceptibility. A two-phase alternating optimization strategy is designed to generate final perturbed face images. Extensive experiments show that FaceDefense significantly outperforms existing methods in both imperceptibility and defense effectiveness, achieving a superior trade-off.[110] Procedural Knowledge Extraction from Industrial Troubleshooting Guides Using Vision Language Models
Guillermo Gil de Avalle,Laura Maruster,Christos Emmanouilidis
Main category: cs.CV
TL;DR: 本文评估了两种视觉语言模型(VLMs)在从工业排障指南中自动提取结构化知识的性能,比较了标准指令式提示与利用排障布局模式的增强提示策略,发现不同模型在布局敏感性与语义鲁棒性之间存在权衡。
Details
Motivation: 工业排障指南以图示化流程图形式承载诊断知识,需将其结构化以便集成到操作员支持系统中;但人工提取费时易错,而VLM在该类专业文档上的应用尚缺乏充分研究。 Method: 评估两种视觉语言模型,对比标准指令引导提示与利用排障图布局特征的增强提示策略,在结构化知识提取任务上的表现。 Result: 不同VLM在布局敏感性与语义鲁棒性上呈现模型特异性权衡,增强提示策略在部分模型上提升了布局理解能力。 Conclusion: VLM具备自动化提取工业排障知识的潜力,但实际部署需根据具体模型特性及任务需求选择适配的提示策略。 Abstract: Industrial troubleshooting guides encode diagnostic procedures in flowchart-like diagrams where spatial layout and technical language jointly convey meaning. To integrate this knowledge into operator support systems, which assist shop-floor personnel in diagnosing and resolving equipment issues, the information must first be extracted and structured for machine interpretation. However, when performed manually, this extraction is labor-intensive and error-prone. Vision Language Models offer potential to automate this process by jointly interpreting visual and textual meaning, yet their performance on such guides remains underexplored. This paper evaluates two VLMs on extracting structured knowledge, comparing two prompting strategies: standard instruction-guided versus an augmented approach that cues troubleshooting layout patterns. Results reveal model-specific trade-offs between layout sensitivity and semantic robustness, informing practical deployment decisions.[111] Is Training Necessary for Anomaly Detection?
Xingwu Zhang,Guanxuan Li,Paul Henderson,Gerardo Aragon-Camarasa,Zijun Long
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的基于检索的异常检测方法RAD,摒弃了传统重建范式,在多个基准上达到SOTA性能,并理论证明其分数优于重建残差分数。
Details
Motivation: 现有无监督多类异常检测方法依赖编码器-解码器重建,存在保真度与稳定性之间的固有矛盾。 Method: RAD是一种无需训练的方法,通过将正常样本特征存入内存,并在测试时进行多层次检索匹配来检测异常。 Result: RAD在MVTec-AD等四个基准上达到SOTA性能;仅用一张正常图像即达96.7%像素AUROC,接近全数据性能(98.5%);理论证明检索分数上界于重建残差分数。 Conclusion: MUAD无需任务特定训练,基于内存检索即可实现SOTA异常检测性能。 Abstract: Current state-of-the-art multi-class unsupervised anomaly detection (MUAD) methods rely on training encoder-decoder models to reconstruct anomaly-free features. We first show these approaches have an inherent fidelity-stability dilemma in how they detect anomalies via reconstruction residuals. We then abandon the reconstruction paradigm entirely and propose Retrieval-based Anomaly Detection (RAD). RAD is a training-free approach that stores anomaly-free features in a memory and detects anomalies through multi-level retrieval, matching test patches against the memory. Experiments demonstrate that RAD achieves state-of-the-art performance across four established benchmarks (MVTec-AD, VisA, Real-IAD, 3D-ADAM) under both standard and few-shot settings. On MVTec-AD, RAD reaches 96.7\% Pixel AUROC with just a single anomaly-free image compared to 98.5\% of RAD's full-data performance. We further prove that retrieval-based scores theoretically upper-bound reconstruction-residual scores. Collectively, these findings overturn the assumption that MUAD requires task-specific training, showing that state-of-the-art anomaly detection is feasible with memory-based retrieval. Our code is available at https://github.com/longkukuhi/RAD.[112] Color Matters: Demosaicing-Guided Color Correlation Training for Generalizable AI-Generated Image Detection
Nan Zhong,Yiran Xu,Mian Zou
Main category: cs.CV
TL;DR: 本文提出了一种基于去马赛克引导的颜色相关性训练(DCCT)框架,利用相机成像流程中的颜色滤波阵列(CFA)和去马赛克过程所引发的颜色相关性特征,实现对AI生成图像的高泛化性与鲁棒性检测。
Details
Motivation: 现有基于生成伪影的检测器在面对未知生成模型时泛化能力差,而真实图像受相机成像流程(如CFA和demosaicing)约束,具有独特颜色相关性,AI图像缺乏该物理约束,因此可据此构建更泛化的检测方法。 Method: 提出DCCT框架:模拟CFA采样模式,将彩色图像分解为单通道输入(条件)与其余两通道(预测目标),用自监督U-Net建模条件分布,参数化为逻辑函数混合;理论分析证明该方法能捕捉摄影图像与AI图像在颜色相关性分布上的本质差异。 Result: DCCT在超过20种未见过的生成器上显著优于先前方法,达到SOTA级别的泛化性和鲁棒性。 Conclusion: 利用相机成像物理先验(CFA/demosaicing诱导的颜色相关性)可有效提升AI图像检测器的泛化能力,DCCT验证了该思路的有效性与可扩展性。 Abstract: As realistic AI-generated images threaten digital authenticity, we address the generalization failure of generative artifact-based detectors by exploiting the intrinsic properties of the camera imaging pipeline. Concretely, we investigate color correlations induced by the color filter array (CFA) and demosaicing, and propose a Demosaicing-guided Color Correlation Training (DCCT) framework for AI-generated image detection. By simulating the CFA sampling pattern, we decompose each color image into a single-channel input (as the condition) and the remaining two channels as the ground-truth targets (for prediction). A self-supervised U-Net is trained to model the conditional distribution of the missing channels from the given one, parameterized via a mixture of logistic functions. Our theoretical analysis reveals that DCCT targets a provable distributional difference in color-correlation features between photographic and AI-generated images. By leveraging these distinct features to construct a binary classifier, DCCT achieves state-of-the-art generalization and robustness, significantly outperforming prior methods across over 20 unseen generators.[113] Diachronic Stereo Matching for Multi-Date Satellite Imagery
Elías Masquil,Luca Savant Aira,Roger Marí,Thibaud Ehret,Pablo Musé,Gabriele Facciolo
Main category: cs.CV
TL;DR: 本文提出了一种面向卫星影像的‘时序立体匹配’(Diachronic Stereo Matching)方法,通过微调融合单目深度先验的深度立体网络,并在包含多样季节与光照条件的时序立体图像对数据集上训练,实现了对相隔数月拍摄图像的鲁棒3D重建。
Details
Motivation: 传统立体重建方法在处理时间间隔长(如数月)、存在显著季节、光照和阴影变化的卫星图像对时失效;而现有基于NeRF或高斯溅射的多时相方法虽有效但计算开销大、不适用于双图场景。因此,亟需一种能可靠处理时序立体图像对的新方法。 Method: 1)以预训练的MonSter模型(基于SceneFlow和KITTI等数据)为起点;2)在DFC2019遥感挑战赛构建的同步与异步立体图像对数据集上进行微调,特别增强对时序变化的鲁棒性;3)引入单目深度先验辅助立体匹配。 Result: 在WorldView-3多时相影像上的实验表明,该方法在同步与异步设置下均显著优于经典立体管线及未适配的深度立体模型(如Omaha测试场景中平均高程误差从3.99 m降至1.23 m);单目先验与时序多样性微调被验证为关键成功因素。 Conclusion: 本文首次实现了面向卫星影像的可靠时序立体匹配,证明了结合单目深度先验与面向时序变化的数据微调,可突破传统立体视觉对成像时间一致性的依赖,为低成本、广覆盖的卫星三维重建开辟新路径。 Abstract: Recent advances in image-based satellite 3D reconstruction have progressed along two complementary directions. On one hand, multi-date approaches using NeRF or Gaussian-splatting jointly model appearance and geometry across many acquisitions, achieving accurate reconstructions on opportunistic imagery with numerous observations. On the other hand, classical stereoscopic reconstruction pipelines deliver robust and scalable results for simultaneous or quasi-simultaneous image pairs. However, when the two images are captured months apart, strong seasonal, illumination, and shadow changes violate standard stereoscopic assumptions, causing existing pipelines to fail. This work presents the first Diachronic Stereo Matching method for satellite imagery, enabling reliable 3D reconstruction from temporally distant pairs. Two advances make this possible: (1) fine-tuning a state-of-the-art deep stereo network that leverages monocular depth priors, and (2) exposing it to a dataset specifically curated to include a diverse set of diachronic image pairs. In particular, we start from a pretrained MonSter model, trained initially on a mix of synthetic and real datasets such as SceneFlow and KITTI, and fine-tune it on a set of stereo pairs derived from the DFC2019 remote sensing challenge. This dataset contains both synchronic and diachronic pairs under diverse seasonal and illumination conditions. Experiments on multi-date WorldView-3 imagery demonstrate that our approach consistently surpasses classical pipelines and unadapted deep stereo models on both synchronic and diachronic settings. Fine-tuning on temporally diverse images, together with monocular priors, proves essential for enabling 3D reconstruction from previously incompatible acquisition dates. Left image (winter) Right image (autumn) DSM geometry Ours (1.23 m) Zero-shot (3.99 m) LiDAR GT Figure 1. Output geometry for a winter-autumn image pair from Omaha (OMA 331 test scene). Our method recovers accurate geometry despite the diachronic nature of the pair, exhibiting strong appearance changes, which cause existing zero-shot methods to fail. Missing values due to perspective shown in black. Mean altitude error in parentheses; lower is better.[114] FarmMind: Reasoning-Query-Driven Dynamic Segmentation for Farmland Remote Sensing Images
Haiyang Wu,Weiliang Mu,Jipeng Zhang,Zhong Dandan,Zhuofei Du,Haifeng Li,Tao Chao
Main category: cs.CV
TL;DR: 本文提出了一种面向农田遥感图像分割的推理-查询驱动动态分割框架FarmMind,通过模拟人类专家在模糊场景下主动查询辅助图像(如高分辨率、大尺度或时序邻近图像)进行交叉验证的思维过程,突破了传统静态单图分割范式的局限。
Details
Motivation: 现有农田遥感图像分割方法采用静态单图分析范式,面对复杂、模糊和视觉不确定场景时推理能力受限;而人类专家会主动调用多源辅助图像进行交叉验证,该差异构成研究动机。 Method: 提出FarmMind框架,引入‘推理-查询’机制:先对分割模糊性进行归因分析,再据此动态、按需地查询最适配类型的外部辅助图像,实现信息互补与协同推理。 Result: 在多个数据集上实验表明,FarmMind显著优于现有方法,具备更强的分割精度与泛化能力。 Conclusion: 动态引入推理引导的辅助图像查询机制可有效提升遥感图像分割性能,为复杂场景下的智能解译提供了新范式。 Abstract: Existing methods for farmland remote sensing image (FRSI) segmentation generally follow a static segmentation paradigm, where analysis relies solely on the limited information contained within a single input patch. Consequently, their reasoning capability is limited when dealing with complex scenes characterized by ambiguity and visual uncertainty. In contrast, human experts, when interpreting remote sensing images in such ambiguous cases, tend to actively query auxiliary images (such as higher-resolution, larger-scale, or temporally adjacent data) to conduct cross-verification and achieve more comprehensive reasoning. Inspired by this, we propose a reasoning-query-driven dynamic segmentation framework for FRSIs, named FarmMind. This framework breaks through the limitations of the static segmentation paradigm by introducing a reasoning-query mechanism, which dynamically and on-demand queries external auxiliary images to compensate for the insufficient information in a single input image. Unlike direct queries, this mechanism simulates the thinking process of human experts when faced with segmentation ambiguity: it first analyzes the root causes of segmentation ambiguities through reasoning, and then determines what type of auxiliary image needs to be queried based on this analysis. Extensive experiments demonstrate that FarmMind achieves superior segmentation performance and stronger generalization ability compared with existing methods. The source code and dataset used in this work are publicly available at: https://github.com/WithoutOcean/FarmMind.[115] A Comparative Evaluation of Large Vision-Language Models for 2D Object Detection under SOTIF Conditions
Ji Zhou,Yilin Ding,Yongqi Zhao,Jiachen Xu,Arno Eichberger
Main category: cs.CV
TL;DR: 本文系统评估了10种大型视觉语言模型(LVLMs)在长尾交通与环境退化场景下的2D目标检测性能,发现其在复杂自然场景中召回率显著优于YOLO基线(+25%),但几何精度在合成扰动下仍逊于YOLO,表明LVLMs适合作为SOTIF驱动的自动驾驶系统中的高层安全验证器。
Details
Motivation: 解决自动驾驶中因感知不足(尤其在恶劣环境下)引发的安全隐患(SOTIF问题),探索LVLMs在安全关键型2D目标检测中的定量有效性。 Method: 基于专为长尾交通和环境退化设计的PeSOTIF数据集,对10个代表性LVLM进行系统性评估,并与YOLO基线检测器进行定量对比。 Result: 顶级LVLM(如Gemini 3、Doubao)在复杂自然场景中召回率比YOLO高25%以上,鲁棒性更强;但在合成扰动下,YOLO在几何精度上仍占优。 Conclusion: LVLMs与传统几何回归方法具有互补优势,可作为SOTIF导向的自动驾驶系统中的高层安全验证模块。 Abstract: Reliable environmental perception remains one of the main obstacles for safe operation of automated vehicles. Safety of the Intended Functionality (SOTIF) concerns safety risks from perception insufficiencies, particularly under adverse conditions where conventional detectors often falter. While Large Vision-Language Models (LVLMs) demonstrate promising semantic reasoning, their quantitative effectiveness for safety-critical 2D object detection is underexplored. This paper presents a systematic evaluation of ten representative LVLMs using the PeSOTIF dataset, a benchmark specifically curated for long-tail traffic scenarios and environmental degradations. Performance is quantitatively compared against the classical perception approach, a YOLO-based detector. Experimental results reveal a critical trade-off: top-performing LVLMs (e.g., Gemini 3, Doubao) surpass the YOLO baseline in recall by over 25% in complex natural scenarios, exhibiting superior robustness to visual degradation. Conversely, the baseline retains an advantage in geometric precision for synthetic perturbations. These findings highlight the complementary strengths of semantic reasoning versus geometric regression, supporting the use of LVLMs as high-level safety validators in SOTIF-oriented automated driving systems.[116] NativeTok: Native Visual Tokenization for Improved Image Generation
Bin Wu,Mengqi Huang,Weinan Jia,Zhendong Mao
Main category: cs.CV
TL;DR: 本文提出NativeTok框架,通过原生视觉标记化(native visual tokenization)在标记化阶段即引入因果依赖关系,解决传统VQ图像生成中两阶段解耦导致的标记依赖建模不足问题;其核心包括Meta Image Transformer和Mixture of Causal Expert Transformer,并辅以分层原生训练策略,在保证高效重建的同时提升生成一致性与连贯性。
Details
Motivation: 现有VQ图像生成的两阶段范式中,标记化阶段未建模离散标记间的因果依赖,导致生成模型需从无序分布中学习,引发偏差与弱连贯性。 Method: 提出原生视觉标记化思想,并构建NativeTok框架:包含Meta Image Transformer(MIT)用于潜在图像建模,以及Mixture of Causal Expert Transformer(MoCET),其中每个轻量专家模块按因果顺序逐个生成标记;并设计分层原生训练策略,仅更新新增专家模块以提升训练效率。 Result: 大量实验验证了NativeTok在图像重建与生成质量上的有效性,显著提升了生成结果的一致性与结构连贯性。 Conclusion: 原生引入因果依赖于标记化过程是提升VQ图像生成性能的关键,NativeTok为两阶段生成范式提供了更紧密协同的新思路。 Abstract: VQ-based image generation typically follows a two-stage pipeline: a tokenizer encodes images into discrete tokens, and a generative model learns their dependencies for reconstruction. However, improved tokenization in the first stage does not necessarily enhance the second-stage generation, as existing methods fail to constrain token dependencies. This mismatch forces the generative model to learn from unordered distributions, leading to bias and weak coherence. To address this, we propose native visual tokenization, which enforces causal dependencies during tokenization. Building on this idea, we introduce NativeTok, a framework that achieves efficient reconstruction while embedding relational constraints within token sequences. NativeTok consists of: (1) a Meta Image Transformer (MIT) for latent image modeling, and (2) a Mixture of Causal Expert Transformer (MoCET), where each lightweight expert block generates a single token conditioned on prior tokens and latent features. We further design a Hierarchical Native Training strategy that updates only new expert blocks, ensuring training efficiency. Extensive experiments demonstrate the effectiveness of NativeTok.[117] Neural Clothing Tryer: Customized Virtual Try-On via Semantic Enhancement and Controlling Diffusion Model
Zhijing Yang,Weiwei Zhang,Mingliang Yang,Siyuan Peng,Yukai Shi,Junpeng Tan,Tianshui Chen,Liruo Zhong
Main category: cs.CV
TL;DR: 本文提出了一种名为Neural Clothing Tryer (NCT)的新型虚拟试衣框架,用于定制化虚拟试衣(Cu-VTON)任务,支持对模型外观、姿态及属性进行灵活定制,并结合语义增强与控制模块提升服装语义与纹理细节保留能力。
Details
Motivation: 传统虚拟试衣(VTON)缺乏对用户个性化数字头像(如外观、姿态、表情等)的灵活定制能力,限制了虚拟试衣的真实感与交互体验;因此需提出支持高度可定制化的Cu-VTON新任务。 Method: 提出NCT框架,基于扩散模型,包含两个核心模块:(1)语义增强模块——利用视觉-语言编码器对齐服装语义描述与图像特征,并作为扩散模型条件输入;(2)语义控制模块——以服装图像、定制姿态图和语义描述为输入,协同保持服装细节并编辑模型姿态、表情及其他属性。 Result: 在公开基准上的大量实验表明,NCT在定制化虚拟试衣任务中性能优于现有方法,能更准确地保留服装语义与纹理,并支持高质量、可控的模型外观与姿态编辑。 Conclusion: NCT通过语义驱动的扩散建模实现了对虚拟试衣过程的高度可控与个性化,为下一代沉浸式虚拟购物与数字人应用提供了有效技术支撑。 Abstract: This work aims to address a novel Customized Virtual Try-ON (Cu-VTON) task, enabling the superimposition of a specified garment onto a model that can be customized in terms of appearance, posture, and additional attributes. Compared with traditional VTON task, it enables users to tailor digital avatars to their individual preferences, thereby enhancing the virtual fitting experience with greater flexibility and engagement. To address this task, we introduce a Neural Clothing Tryer (NCT) framework, which exploits the advanced diffusion models equipped with semantic enhancement and controlling modules to better preserve semantic characterization and textural details of the garment and meanwhile facilitating the flexible editing of the model's postures and appearances. Specifically, NCT introduces a semantic-enhanced module to take semantic descriptions of garments and utilizes a visual-language encoder to learn aligned features across modalities. The aligned features are served as condition input to the diffusion model to enhance the preservation of the garment's semantics. Then, a semantic controlling module is designed to take the garment image, tailored posture image, and semantic description as input to maintain garment details while simultaneously editing model postures, expressions, and various attributes. Extensive experiments on the open available benchmark demonstrate the superior performance of the proposed NCT framework.[118] How Much of a Model Do We Need? Redundancy and Slimmability in Remote Sensing Foundation Models
Leonard Hackel,Tom Burgert,Begüm Demir
Main category: cs.CV
TL;DR: 本文探讨了遥感(RS)领域大模型是否像计算机视觉(CV)那样受益于持续扩大参数规模,提出RS模型更早进入过参数化状态、存在表征冗余的假设,并通过后验剪枝(post-hoc slimming)实证验证;结果显示RS模型在极低计算预算下仍保持高精度,揭示其信息编码具有高度冗余性,进而提出slimmable训练与诊断性分析方法。
Details
Motivation: 直接将CV领域的模型缩放范式迁移到遥感领域缺乏充分验证;作者怀疑RS基础模型在更小规模下即进入过参数化状态,增加参数主要带来冗余而非新抽象能力。 Method: 采用后验剪枝(uniformly reducing encoder width)评估六种先进RS基础模型在四个下游分类任务上的表征冗余性;辅以可剪枝训练(slimmable training)、解释方差比与特征相关性分析进行机制探究。 Result: RS模型在1% FLOPs预算下仍保持超71%相对精度,远高于CV中MAE在ImageNet上的<10%,证实显著冗余;可剪枝训练提升MoCo和MAE模型性能;特征分析显示任务相关信息高度冗余分布。 Conclusion: RS基础模型存在早期过参数化与强表征冗余,因此‘越大越好’的CV缩放范式不适用;后验可剪枝既是轻量化部署的有效策略,也是检验RS模型缩放合理性的新型诊断工具。 Abstract: Large-scale foundation models (FMs) in remote sensing (RS) are developed based on the paradigms established in computer vision (CV) and have shown promise for various Earth observation applications. However, the direct transfer of scaling assumptions from CV to RS has not been adequately examined. We hypothesize that RS FMs enter an overparameterized regime at substantially smaller scales than their CV counterparts, where increasing parameter count primarily induces redundant representations rather than qualitatively new abstractions. To test this hypothesis, we use post-hoc slimming, where we uniformly reduce the width of pretrained encoder, as a tool to measure representational redundancy across six state-of-the-art RS FMs on four downstream classification tasks. Our findings reveal a significant contrast with those in the CV domain: while a post-hoc slimmed masked autoencoder (MAE) trained on ImageNet retains less than 10% accuracy at 1% FLOPs, RS FMs maintain over 71% relative accuracy at the same budget. This sevenfold difference provides strong empirical support for our hypothesis. We further demonstrate that learned slimmable training can improve both Momentum Contrast (MoCo)- and MAE- based models. In addition, through the explained variance ratio and the feature correlation analysis, we provide mechanistic explanations showing that RS FMs distribute task-relevant information with high redundancy. Our findings establish post-hoc slimmability as both a practical deployment strategy for resource-constrained environments and a diagnostic tool that challenges the prevailing scaling paradigm in RS. Upon acceptance, we will publish all code.[119] Inference-Time Dynamic Modality Selection for Incomplete Multimodal Classification
Siyi Du,Xinzhe Luo,Declan P. O'Regan,Chen Qin
Main category: cs.CV
TL;DR: 本文提出DyMo框架,通过动态选择可靠的恢复模态,在推理时自适应地整合多模态信息,解决不完全多模态数据中‘丢弃-插补’的两难困境。
Details
Motivation: 现有不完全多模态深度学习方法面临丢弃缺失模态导致信息损失,或插补引入噪声的问题,形成‘丢弃-插补’困境。 Method: 提出DyMo:一种推理时动态模态选择框架;核心是基于任务损失构建可计算的信息代理指标,并设计原则性奖励函数指导模态选择;同时设计兼容任意模态组合的灵活网络结构与专用训练策略。 Result: 在多种自然与医学图像数据集上,DyMo在各类缺失场景下显著优于当前最优的不完全/动态多模态学习方法。 Conclusion: DyMo突破了传统‘丢弃或插补’范式,通过动态、任务驱动的模态选择,更充分挖掘任务相关多模态信息,提升了不完全多模态数据下的模型鲁棒性与性能。 Abstract: Multimodal deep learning (MDL) has achieved remarkable success across various domains, yet its practical deployment is often hindered by incomplete multimodal data. Existing incomplete MDL methods either discard missing modalities, risking the loss of valuable task-relevant information, or recover them, potentially introducing irrelevant noise, leading to the discarding-imputation dilemma. To address this dilemma, in this paper, we propose DyMo, a new inference-time dynamic modality selection framework that adaptively identifies and integrates reliable recovered modalities, fully exploring task-relevant information beyond the conventional discard-or-impute paradigm. Central to DyMo is a novel selection algorithm that maximizes multimodal task-relevant information for each test sample. Since direct estimation of such information at test time is intractable due to the unknown data distribution, we theoretically establish a connection between information and the task loss, which we compute at inference time as a tractable proxy. Building on this, a novel principled reward function is proposed to guide modality selection. In addition, we design a flexible multimodal network architecture compatible with arbitrary modality combinations, alongside a tailored training strategy for robust representation learning. Extensive experiments on diverse natural and medical image datasets show that DyMo significantly outperforms state-of-the-art incomplete/dynamic MDL methods across various missing-data scenarios. Our code is available at https://github.com//siyi-wind/DyMo.[120] Under-Canopy Terrain Reconstruction in Dense Forests Using RGB Imaging and Neural 3D Reconstruction
Refael Sheffer,Chen Pinchover,Haim Zisman,Dror Ozeri,Roee Litman
Main category: cs.CV
TL;DR: 本文提出了一种仅使用常规RGB图像重建无遮挡、逼真地面视图的新方法,基于NeRF并结合低光损失和射线积分控制策略,适用于搜救、林区清查等任务。
Details
Motivation: 现有森林下层地形与植被测绘方法依赖昂贵或专用传感器(如机载LiDAR或热成像合成孔径摄影AOS),亟需低成本、高分辨率的替代方案。 Method: 基于Neural Radiance Fields(NeRF)框架,引入特定光照条件下的RGB图像采集策略、低光损失函数,并设计两种射线积分控制机制以去除冠层遮挡。 Result: 在搜救任务中实现了媲美热成像AOS的人体检测效果;在森林清查任务中展示了树株计数潜力;整体方法在成本与分辨率上优于专用传感器。 Conclusion: 该方法为搜救、路径测绘和森林资源调查等应用提供了经济高效、高分辨率的RGB图像驱动替代方案。 Abstract: Mapping the terrain and understory hidden beneath dense forest canopies is of great interest for numerous applications such as search and rescue, trail mapping, forest inventory tasks, and more. Existing solutions rely on specialized sensors: either heavy, costly airborne LiDAR, or Airborne Optical Sectioning (AOS), which uses thermal synthetic aperture photography and is tailored for person detection. We introduce a novel approach for the reconstruction of canopy-free, photorealistic ground views using only conventional RGB images. Our solution is based on the celebrated Neural Radiance Fields (NeRF), a recent 3D reconstruction method. Additionally, we include specific image capture considerations, which dictate the needed illumination to successfully expose the scene beneath the canopy. To better cope with the poorly lit understory, we employ a low light loss. Finally, we propose two complementary approaches to remove occluding canopy elements by controlling per-ray integration procedure. To validate the value of our approach, we present two possible downstream tasks. For the task of search and rescue (SAR), we demonstrate that our method enables person detection which achieves promising results compared to thermal AOS (using only RGB images). Additionally, we show the potential of our approach for forest inventory tasks like tree counting. These results position our approach as a cost-effective, high-resolution alternative to specialized sensors for SAR, trail mapping, and forest-inventory tasks.[121] When Anomalies Depend on Context: Learning Conditional Compatibility for Anomaly Detection
Shashank Mishra,Didier Stricker,Jason Rambach
Main category: cs.CV
TL;DR: 本文提出了一种面向视觉领域的上下文异常检测方法,强调异常性依赖于主体与上下文的兼容性而非对象本身属性,并构建了新基准CAAD-3K及条件兼容性学习框架,在多个数据集上取得SOTA性能。
Details
Motivation: 传统异常检测假设异常性是观测对象的内在属性,忽略上下文影响;而现实中同一行为或物体在不同上下文中可能正常或异常(如跑道 vs 高速公路奔跑),因此需重新建模上下文敏感的异常检测。 Method: 提出基于视觉-语言表征的条件兼容性学习框架,建模主体与上下文之间的关系,并构建可控变量的基准数据集CAAD-3K以隔离上下文异常。 Result: 所提方法在CAAD-3K上显著优于现有方法,并在MVTec-AD和VisA上达到SOTA性能。 Conclusion: 建模上下文依赖性可有效补充传统结构化异常检测,上下文感知是提升视觉异常检测鲁棒性和实用性的重要方向。 Abstract: Anomaly detection is often formulated under the assumption that abnormality is an intrinsic property of an observation, independent of context. This assumption breaks down in many real-world settings, where the same object or action may be normal or anomalous depending on latent contextual factors (e.g., running on a track versus on a highway). We revisit \emph{contextual anomaly detection}, classically defined as context-dependent abnormality, and operationalize it in the visual domain, where anomaly labels depend on subject--context compatibility rather than intrinsic appearance. To enable systematic study of this setting, we introduce CAAD-3K, a benchmark that isolates contextual anomalies by controlling subject identity while varying context. We further propose a conditional compatibility learning framework that leverages vision--language representations to model subject--context relationships under limited supervision. Our method substantially outperforms existing approaches on CAAD-3K and achieves state-of-the-art performance on MVTec-AD and VisA, demonstrating that modeling context dependence complements traditional structural anomaly detection. Our code and dataset will be publicly released.[122] Semantic Leakage from Image Embeddings
Yiyi Chen,Qiongkai Xu,Desmond Eliott,Qiongxiu Li,Johannes Bjerva
Main category: cs.CV
TL;DR: 本文挑战了图像嵌入隐私风险较低的假设,提出语义泄露概念,并设计轻量级推理框架SLImE,仅从压缩图像嵌入中恢复语义信息,揭示了对齐嵌入中语义邻域保持所导致的根本性隐私漏洞。
Details
Motivation: 图像嵌入通常被认为隐私风险有限,但作者质疑该假设,旨在形式化并揭示其潜在的语义信息泄露风险。 Method: 提出语义泄露(Semantic Leakage)概念,指出保留对齐嵌入中的局部语义邻域结构即足以导致泄露;基于此构建SLImE框架,结合本地训练的语义检索器与现成模型,无需任务特定解码器。 Result: SLImE在GEMINI、COHERE、NOMIC、CLIP等多种开放与闭源嵌入模型上均能稳定恢复语义标签、符号表示及语法连贯的描述,实证验证各环节有效性。 Conclusion: 图像嵌入中为保持语义邻域而做的对齐操作本身即构成隐私隐患,语义泄露无需原始图像重建,这对嵌入式隐私保护提出了根本性挑战。 Abstract: Image embeddings are generally assumed to pose limited privacy risk. We challenge this assumption by formalizing semantic leakage as the ability to recover semantic structures from compressed image embeddings. Surprisingly, we show that semantic leakage does not require exact reconstruction of the original image. Preserving local semantic neighborhoods under embedding alignment is sufficient to expose the intrinsic vulnerability of image embeddings. Crucially, this preserved neighborhood structure allows semantic information to propagate through a sequence of lossy mappings. Based on this conjecture, we propose Semantic Leakage from Image Embeddings (SLImE), a lightweight inference framework that reveals semantic information from standalone compressed image embeddings, incorporating a locally trained semantic retriever with off-the-shelf models, without training task-specific decoders. We thoroughly validate each step of the framework empirically, from aligned embeddings to retrieved tags, symbolic representations, and grammatical and coherent descriptions. We evaluate SLImE across a range of open and closed embedding models, including GEMINI, COHERE, NOMIC, and CLIP, and demonstrate consistent recovery of semantic information across diverse inference tasks. Our results reveal a fundamental vulnerability in image embeddings, whereby the preservation of semantic neighborhoods under alignment enables semantic leakage, highlighting challenges for privacy preservation.1[123] DINO-SAE: DINO Spherical Autoencoder for High-Fidelity Image Reconstruction and Generation
Hun Chang,Byunghee Cha,Jong Chul Ye
Main category: cs.CV
TL;DR: 本文提出DINO-SAE框架,通过球面自编码器结构结合余弦相似性对齐与黎曼流匹配,在保持语义一致性的同时显著提升图像重建质量。
Details
Motivation: 现有基于预训练视觉基础模型(如DINO)的生成式自编码器在重建保真度上受限,尤其丢失高频细节。 Method: 提出DINO Spherical Autoencoder(DINO-SAE),包含分层卷积块嵌入模块和余弦相似性对齐目标,并利用对比学习表征天然位于超球面的特性,采用黎曼流匹配在球面隐空间上训练Diffusion Transformer(DiT)。 Result: 在ImageNet-1K上达到0.37 rFID和26.2 dB PSNR的SOTA重建性能,语义对齐强;黎曼流匹配DiT在80轮达到3.47 gFID,收敛高效。 Conclusion: 将语义方向性建模与球面流形上的生成建模相结合,可兼顾高保真重建与语义一致性,为VFM驱动的生成自编码器提供新范式。 Abstract: Recent studies have explored using pretrained Vision Foundation Models (VFMs) such as DINO for generative autoencoders, showing strong generative performance. Unfortunately, existing approaches often suffer from limited reconstruction fidelity due to the loss of high-frequency details. In this work, we present the DINO Spherical Autoencoder (DINO-SAE), a framework that bridges semantic representation and pixel-level reconstruction. Our key insight is that semantic information in contrastive representations is primarily encoded in the direction of feature vectors, while forcing strict magnitude matching can hinder the encoder from preserving fine-grained details. To address this, we introduce Hierarchical Convolutional Patch Embedding module that enhances local structure and texture preservation, and Cosine Similarity Alignment objective that enforces semantic consistency while allowing flexible feature magnitudes for detail retention. Furthermore, leveraging the observation that SSL-based foundation model representations intrinsically lie on a hypersphere, we employ Riemannian Flow Matching to train a Diffusion Transformer (DiT) directly on this spherical latent manifold. Experiments on ImageNet-1K demonstrate that our approach achieves state-of-the-art reconstruction quality, reaching 0.37 rFID and 26.2 dB PSNR, while maintaining strong semantic alignment to the pretrained VFM. Notably, our Riemannian Flow Matching-based DiT exhibits efficient convergence, achieving a gFID of 3.47 at 80 epochs.[124] Multi-Cue Anomaly Detection and Localization under Data Contamination
Anindya Sundar Das,Monowar Bhuyan
Main category: cs.CV
TL;DR: 本文提出了一种结合少量标注异常样本的鲁棒视觉异常检测框架,通过融合偏差、不确定性与分割三类分数构建复合异常评分,并采用自适应实例加权缓解数据污染影响,在MVTec和VisA数据集上实现了高检测定位性能与强鲁棒性。
Details
Motivation: 现有方法通常假设训练数据纯正常或无标注且无污染,且无法利用真实异常样本,导致在现实工业场景中(数据常含污染、可获少量异常标签)性能受限。 Method: 提出融合偏差得分(统计异常)、熵基不确定性得分(预测不一致)和分割得分(空间异常)的复合异常评分机制;在少量标注异常监督下,采用自适应实例加权抑制污染样本影响;支持梯度反传实现可解释定位。 Result: 在MVTec和VisA基准上显著优于SOTA方法,具备优异的检测与定位精度、可解释性及在不同污染程度下的鲁棒性。 Conclusion: 引入有限异常监督并设计多源协同评分与自适应加权策略,可有效提升工业视觉异常检测在真实污染数据下的可靠性与实用性。 Abstract: Visual anomaly detection in real-world industrial settings faces two major limitations. First, most existing methods are trained on purely normal data or on unlabeled datasets assumed to be predominantly normal, presuming the absence of contamination, an assumption that is rarely satisfied in practice. Second, they assume no access to labeled anomaly samples, limiting the model from learning discriminative characteristics of true anomalies. Therefore, these approaches often struggle to distinguish anomalies from normal instances, resulting in reduced detection and weak localization performance. In real-world applications, where training data are frequently contaminated with anomalies, such methods fail to deliver reliable performance. In this work, we propose a robust anomaly detection framework that integrates limited anomaly supervision into the adaptive deviation learning paradigm. We introduce a composite anomaly score that combines three complementary components: a deviation score capturing statistical irregularity, an entropy-based uncertainty score reflecting predictive inconsistency, and a segmentation-based score highlighting spatial abnormality. This unified scoring mechanism enables accurate detection and supports gradient-based localization, providing intuitive and explainable visual evidence of anomalous regions. Following the few-anomaly paradigm, we incorporate a small set of labeled anomalies during training while simultaneously mitigating the influence of contaminated samples through adaptive instance weighting. Extensive experiments on the MVTec and VisA benchmarks demonstrate that our framework outperforms state-of-the-art baselines and achieves strong detection and localization performance, interpretability, and robustness under various levels of data contamination.[125] Deep in the Jungle: Towards Automating Chimpanzee Population Estimation
Tom Raynes,Otto Brookes,Timm Haucke,Lukas Bösch,Anne-Sophie Crunchant,Hjalmar Kühl,Sara Beery,Majid Mirmehdi,Tilo Burghardt
Main category: cs.CV
TL;DR: 本文探索了将单目深度估计(MDE)技术引入黑猩猩相机陷阱生态监测中,以自动估算动物到相机的距离,从而替代耗时的手动测量;实验表明经校准的DPT模型性能优于Depth Anything,在真实森林场景中可获得与人工方法误差在22%以内的种群密度和数量估计。
Details
Motivation: 传统上依赖人工解读相机陷阱视频来获取动物到相机的距离,效率低、成本高;亟需自动化、可扩展的替代方案以支持大范围类人猿保护监测。 Method: 将两种单目深度估计模型(Dense Prediction Transformers 和 Depth Anything)嵌入相机陷阱工作流,结合多种距离抽样策略,基于220段野外黑猩猩视频生成检测距离估计,并进一步推断种群密度与丰度;使用人工标注的真实距离作为基准进行对比评估。 Result: 校准后的DPT模型在距离估计精度及下游密度/丰度推断上均优于Depth Anything;但两者均存在系统性偏差——在复杂森林环境中倾向于高估距离,从而低估种群密度与丰度;动物检测失败是影响精度的主要因素;整体估计结果与人工方法偏差控制在22%以内。 Conclusion: MDE驱动的相机陷阱距离抽样是一种可行且实用的手动距离估计替代方案,具备实际部署潜力,尤其适用于大规模类人猿监测任务。 Abstract: The estimation of abundance and density in unmarked populations of great apes relies on statistical frameworks that require animal-to-camera distance measurements. In practice, acquiring these distances depends on labour-intensive manual interpretation of animal observations across large camera trap video corpora. This study introduces and evaluates an only sparsely explored alternative: the integration of computer vision-based monocular depth estimation (MDE) pipelines directly into ecological camera trap workflows for great ape conservation. Using a real-world dataset of 220 camera trap videos documenting a wild chimpanzee population, we combine two MDE models, Dense Prediction Transformers and Depth Anything, with multiple distance sampling strategies. These components are used to generate detection distance estimates, from which population density and abundance are inferred. Comparative analysis against manually derived ground-truth distances shows that calibrated DPT consistently outperforms Depth Anything. This advantage is observed in both distance estimation accuracy and downstream density and abundance inference. Nevertheless, both models exhibit systematic biases. We show that, given complex forest environments, they tend to overestimate detection distances and consequently underestimate density and abundance relative to conventional manual approaches. We further find that failures in animal detection across distance ranges are a primary factor limiting estimation accuracy. Overall, this work provides a case study that shows MDE-driven camera trap distance sampling is a viable and practical alternative to manual distance estimation. The proposed approach yields population estimates within 22% of those obtained using traditional methods.[126] Q-Hawkeye: Reliable Visual Policy Optimization for Image Quality Assessment
Wulin Xie,Rui Dai,Ruidong Ding,Kaikui Liu,Xiangxiang Chu,Xinwen Hou,Jie Wen
Main category: cs.CV
TL;DR: 本文提出Q-Hawkeye框架,通过不确定性感知动态优化和感知感知优化提升基于RL的图像质量评估模型的可靠性与视觉感知能力。
Details
Motivation: 现有基于RL的IQA方法存在两个关键可靠性问题:一是对训练样本采用统一优势权重,放大了不稳定样本的噪声信号;二是过度关注文本推理而忽视模型对图像内容的视觉感知能力。 Method: 提出Q-Hawkeye框架,包含两部分:1)不确定性感知动态优化——利用多次采样预测分数的方差估计不确定性,并据此重加权样本更新强度;2)感知感知优化——构建退化图像与原图配对输入,引入隐式感知损失,约束模型基于真实视觉证据进行质量判断。 Result: 在多个数据集上实验表明,Q-Hawkeye优于当前最优方法,泛化能力更强。 Conclusion: Q-Hawkeye有效提升了RL-based IQA模型的预测稳定性与视觉感知可靠性,为可信IQA提供了新思路。 Abstract: Image Quality Assessment (IQA) predicts perceptual quality scores consistent with human judgments. Recent RL-based IQA methods built on MLLMs focus on generating visual quality descriptions and scores, ignoring two key reliability limitations: (i) although the model's prediction stability varies significantly across training samples, existing GRPO-based methods apply uniform advantage weighting, thereby amplifying noisy signals from unstable samples in gradient updates; (ii) most works emphasize text-grounded reasoning over images while overlooking the model's visual perception ability of image content. In this paper, we propose Q-Hawkeye, an RL-based reliable visual policy optimization framework that redesigns the learning signal through unified Uncertainty-Aware Dynamic Optimization and Perception-Aware Optimization. Q-Hawkeye estimates predictive uncertainty using the variance of predicted scores across multiple rollouts and leverages this uncertainty to reweight each sample's update strength, stabilizing policy optimization. To strengthen perceptual reliability, we construct paired inputs of degraded images and their original images and introduce an Implicit Perception Loss that constrains the model to ground its quality judgments in genuine visual evidence. Extensive experiments demonstrate that Q-Hawkeye outperforms state-of-the-art methods and generalizes better across multiple datasets. The code and models will be made available.[127] Triage: Hierarchical Visual Budgeting for Efficient Video Reasoning in Vision-Language Models
Anmin Wang,Nan Zhang,Wei Tao,Xiaoyang Qu,Guokuan Li,Jiguang Wan,Jianzong Wang
Main category: cs.CV
TL;DR: Triage is a training-free, plug-and-play framework for video reasoning with Vision-Language Models that reduces computational cost by hierarchical visual budgeting—first selecting keyframes, then allocating tokens efficiently—improving speed and memory usage without sacrificing performance.
Details
Motivation: Vision-Language Models (VLMs) suffer from high computational cost in video processing due to massive data redundancy and long token sequences. Method: Triage proposes a two-stage, training-free hierarchical budgeting: (1) Frame-Level Budgeting selects keyframes based on visual dynamics and relevance; (2) Token-Level Budgeting allocates tokens in two phases—securing high-relevance Core Tokens first, then selecting diverse Context Tokens using a batched Maximal Marginal Relevance (MMR) algorithm. Result: Triage improves inference speed and reduces memory footprint while maintaining or surpassing baseline performance across multiple video reasoning benchmarks. Conclusion: Triage effectively addresses video processing inefficiency in VLMs through a resource-aware, hierarchical budgeting strategy—enabling faster, lighter, and competitive video reasoning without retraining. Abstract: Vision-Language Models (VLMs) face significant computational challenges in video processing due to massive data redundancy, which creates prohibitively long token sequences. To address this, we introduce Triage, a training-free, plug-and-play framework that reframes video reasoning as a resource allocation problem via hierarchical visual budgeting. Its first stage, Frame-Level Budgeting, identifies keyframes by evaluating their visual dynamics and relevance, generating a strategic prior based on their importance scores. Guided by this prior, the second stage, Token-Level Budgeting, allocates tokens in two phases: it first secures high-relevance Core Tokens, followed by diverse Context Tokens selected with an efficient batched Maximal Marginal Relevance (MMR) algorithm. Extensive experiments demonstrate that Triage improves inference speed and reduces memory footprint, while maintaining or surpassing the performance of baselines and other methods on various video reasoning benchmarks.[128] Improving Supervised Machine Learning Performance in Optical Quality Control via Generative AI for Dataset Expansion
Dennis Sprute,Hanna Senke,Holger Flatt
Main category: cs.CV
TL;DR: 本文探讨了使用生成式人工智能(GenAI)如Stable Diffusion和CycleGAN来解决工业光学质量控制中缺陷样本稀缺导致的数据不平衡问题,并在热成像图像的联合收割机部件分割任务中验证了其有效性,其中Stable Diffusion将平均交并比(Mean IoU)提升了4.6%,达到84.6%。
Details
Motivation: 工业生产中缺陷样本稀少导致训练数据严重不平衡,影响监督学习模型性能;传统方法(如特殊损失函数或简单数据增强)存在调参困难或增强效果有限等问题。 Method: 采用Stable Diffusion和CycleGAN两种生成式AI模型对热成像图像进行缺陷样本合成,用于扩充训练数据集,并应用于联合收割机部件的语义分割任务以支持后续缺陷检测。 Result: 使用Stable Diffusion扩充数据集后,分割性能提升最显著,Mean IoU达84.6%,较基线提高4.6%;CycleGAN效果次之。 Conclusion: 生成式AI,尤其是Stable Diffusion,是缓解工业缺陷检测中数据不平衡问题的有效手段,可显著提升监督学习模型在热成像分割任务中的性能。 Abstract: Supervised machine learning algorithms play a crucial role in optical quality control within industrial production. These approaches require representative datasets for effective model training. However, while non-defective components are frequent, defective parts are rare in production, resulting in highly imbalanced datasets that adversely impact model performance. Existing strategies to address this challenge, such as specialized loss functions or traditional data augmentation techniques, have limitations, including the need for careful hyperparameter tuning or the alteration of only simple image features. Therefore, this work explores the potential of generative artificial intelligence (GenAI) as an alternative method for expanding limited datasets and enhancing supervised machine learning performance. Specifically, we investigate Stable Diffusion and CycleGAN as image generation models, focusing on the segmentation of combine harvester components in thermal images for subsequent defect detection. Our results demonstrate that dataset expansion using Stable Diffusion yields the most significant improvement, enhancing segmentation performance by 4.6 %, resulting in a Mean Intersection over Union (Mean IoU) of 84.6 %.[129] About an Automating Annotation Method for Robot Markers
Wataru Uemura,Takeru Nagashima
Main category: cs.CV
TL;DR: 本文提出了一种基于ArUco标记自动标注的深度学习识别方法,用于提升工厂自动化中移动机器人在复杂成像条件下的鲁棒性识别性能。
Details
Motivation: 传统OpenCV图像处理方法在噪声、运动模糊、散焦和光照变化下易失效;而深度学习方法虽更鲁棒,但依赖大量人工标注数据,标注成本高、效率低。 Method: 利用ArUco标记自带ID与位姿信息的特性,通过ArUco检测模块实现图像中标记位置与类别的自动标注;构建YOLO目标检测模型,并使用自动生成的数据集进行训练。 Result: 实验表明,该方法在模糊或散焦图像上显著优于传统OpenCV方法;自动标注降低了人工成本,提升了标签一致性。 Conclusion: ArUco驱动的自动标注是一种高效、可靠的数据准备策略,可有效支撑工业场景下鲁棒视觉识别系统的开发。 Abstract: Factory automation has become increasingly important due to labor shortages, leading to the introduction of autonomous mobile robots for tasks such as material transportation. Markers are commonly used for robot self-localization and object identification. In the RoboCup Logistics League (RCLL), ArUco markers are employed both for robot localization and for identifying processing modules. Conventional recognition relies on OpenCV-based image processing, which detects black-and-white marker patterns. However, these methods often fail under noise, motion blur, defocus, or varying illumination conditions. Deep-learning-based recognition offers improved robustness under such conditions, but requires large amounts of annotated data. Annotation must typically be done manually, as the type and position of objects cannot be detected automatically, making dataset preparation a major bottleneck. In contrast, ArUco markers include built-in recognition modules that provide both ID and positional information, enabling automatic annotation. This paper proposes an automated annotation method for training deep-learning models on ArUco marker images. By leveraging marker detection results obtained from the ArUco module, the proposed approach eliminates the need for manual labeling. A YOLO-based model is trained using the automatically annotated dataset, and its performance is evaluated under various conditions. Experimental results demonstrate that the proposed method improves recognition performance compared with conventional image-processing techniques, particularly for images affected by blur or defocus. Automatic annotation also reduces human effort and ensures consistent labeling quality. Future work will investigate the relationship between confidence thresholds and recognition performance.[130] Self-Supervised Slice-to-Volume Reconstruction with Gaussian Representations for Fetal MRI
Yinsong Wang,Thomas Fletcher,Xinzhe Luo,Aine Travers Dineen,Rhodri Cusack,Chen Qin
Main category: cs.CV
TL;DR: 本文提出了一种名为GaussianSVR的自监督框架,用于从运动伪影的2D切片重建3D胎儿MR体积,通过3D高斯表示和模拟前向切片采集模型实现无需真值标签的高保真重建,并引入多分辨率训练策略提升精度与效率。
Details
Motivation: 传统切片到体重建(SVR)方法耗时且需多个正交堆栈;基于学习的方法虽加速推理但严重依赖不可获取的真实体数据进行训练。 Method: 提出GaussianSVR:使用3D高斯表示建模目标体积;构建模拟前向切片采集模型实现自监督训练;设计多分辨率训练策略联合优化高斯参数与空间变换。 Result: 在胎儿MR体重建任务上,GaussianSVR优于基线方法。 Conclusion: GaussianSVR是一种高效、无需真值标签的自监督SVR新范式,兼顾重建质量与计算效率。 Abstract: Reconstructing 3D fetal MR volumes from motion-corrupted stacks of 2D slices is a crucial and challenging task. Conventional slice-to-volume reconstruction (SVR) methods are time-consuming and require multiple orthogonal stacks for reconstruction. While learning-based SVR approaches have significantly reduced the time required at the inference stage, they heavily rely on ground truth information for training, which is inaccessible in practice. To address these challenges, we propose GaussianSVR, a self-supervised framework for slice-to-volume reconstruction. GaussianSVR represents the target volume using 3D Gaussian representations to achieve high-fidelity reconstruction. It leverages a simulated forward slice acquisition model to enable self-supervised training, alleviating the need for ground-truth volumes. Furthermore, to enhance both accuracy and efficiency, we introduce a multi-resolution training strategy that jointly optimizes Gaussian parameters and spatial transformations across different resolution levels. Experiments show that GaussianSVR outperforms the baseline methods on fetal MR volumetric reconstruction. Code will be available upon acceptance.[131] Leveraging Multi-Rater Annotations to Calibrate Object Detectors in Microscopy Imaging
Francesco Campi,Lucrezia Tondo,Ekin Karabati,Johannes Betge,Marie Piraud
Main category: cs.CV
TL;DR: 本文提出了一种利用多专家标注来提升深度学习目标检测器在显微镜图像中校准性能的新方法,通过为每位专家单独训练模型并集成其预测结果,以模拟专家共识,从而更好地建模标注者间的差异,提高模型可信度。
Details
Motivation: 深度学习目标检测器在显微镜成像中性能优异,但其置信度估计常缺乏校准,限制了其在生物医学应用中的可靠性。 Method: 提出基于多专家标注的rater-specific ensemble策略:分别为每位专家的标注单独训练检测模型,并聚合其预测结果以模拟共识,区别于混合标注的标签采样策略。 Result: 在两位专家标注的结直肠类器官数据集上实验表明,该策略在保持检测精度的同时显著提升了模型校准性能。 Conclusion: 显式建模标注者间分歧可提升生物医学图像中目标检测器的可信度与可靠性。 Abstract: Deep learning-based object detectors have achieved impressive performance in microscopy imaging, yet their confidence estimates often lack calibration, limiting their reliability for biomedical applications. In this work, we introduce a new approach to improve model calibration by leveraging multi-rater annotations. We propose to train separate models on the annotations from single experts and aggregate their predictions to emulate consensus. This improves upon label sampling strategies, where models are trained on mixed annotations, and offers a more principled way to capture inter-rater variability. Experiments on a colorectal organoid dataset annotated by two experts demonstrate that our rater-specific ensemble strategy improves calibration performance while maintaining comparable detection accuracy. These findings suggest that explicitly modelling rater disagreement can lead to more trustworthy object detectors in biomedical imaging.[132] One-shot Optimized Steering Vector for Hallucination Mitigation for VLMs
Youxu Shi,Suorong Yang,Dong Liu
Main category: cs.CV
TL;DR: 本文提出OSGA(One-shot Steering with Generative Anchor),一种单次优化、输入无关的视觉语言模型(VLM)轻量级调控方法,通过选取高方差样本并结合生成式锚点的对比学习,学习一个可泛化至多任务的通用引导向量,在不修改参数前提下显著缓解幻觉与安全问题。
Details
Motivation: VLMs虽性能强,但仍存在幻觉和安全问题;现有steering方法在效率与效果间难以兼顾,且缺乏跨输入泛化能力。 Method: OSGA采用方差驱动的数据选择策略选取代表性样本,以对比学习目标联合生成式锚点正则化,单次优化得到一个输入无关、层特定的通用steering向量。 Result: 在多个基准上验证,单个OSGA向量即可一致提升幻觉抑制与安全性,推理开销极小。 Conclusion: OSGA证明了一次性、输入无关的steering是构建可靠、可扩展VLM的实用新范式。 Abstract: Vision Language Models (VLMs) achieve strong performance on multimodal tasks but still suffer from hallucination and safety-related failures that persist even at scale. Steering offers a lightweight technique to improve model performance. However, steering, whether input-dependent or input-independent, achieves a meaningful trade-off between efficiency and effectiveness. In this work, we observe that steering vectors can generalize across inputs when tasks share aligned semantic intent. Based on this insight, we propose \textbf{OSGA} (\textbf{O}ne-shot \textbf{S}teering with \textbf{G}enerative \textbf{A}nchor), an input-independent framework that improves model performance with a single optimization instance. OSGA first selects an informative sample via a variance-based data selection strategy and learns a single steering vector with a contrastive objective with generative anchor regularization. The resulting vector can be universally applied at a certain layer during inference time without modifying model parameters. Experiments across multiple benchmarks show that a single OSGA-optimized steering vector consistently improves hallucination mitigation and safety enhancement with negligible overhead, highlighting one-shot steering as a practical and scalable solution for reliable VLMs.[133] HierLoc: Hyperbolic Entity Embeddings for Hierarchical Visual Geolocation
Hari Krishna Gadi,Daniel Matos,Hongyi Luo,Lu Liu,Yongliang Wang,Yanfeng Zhang,Liqiu Meng
Main category: cs.CV
TL;DR: 本文提出了一种基于双曲空间的实体中心地理定位方法,通过将图像对齐到国家、区域、子区域和城市等地理实体,显著提升了全球图像地理定位的精度与效率。
Details
Motivation: 视觉地理定位任务面临全球尺度大、视觉模糊以及地理结构层次化等挑战,现有方法存在存储开销大、忽略地理连续性或难以捕捉细节等问题。 Method: 引入实体中心的地理定位范式,构建地理实体(国家、区域、子区域、城市)的双曲空间层次化嵌入;采用融入haversine距离的Geo-Weighted Hyperbolic对比学习实现图像与实体对齐。 Result: 在OSV5M基准上达到SOTA:平均测地误差降低19.5%,子区域细粒度准确率提升43%;仅需24万实体嵌入,远少于500万图像嵌入。 Conclusion: 几何感知的层次化嵌入为全球图像地理定位提供了可扩展且概念新颖的替代方案。 Abstract: Visual geolocalization, the task of predicting where an image was taken, remains challenging due to global scale, visual ambiguity, and the inherently hierarchical structure of geography. Existing paradigms rely on either large-scale retrieval, which requires storing a large number of image embeddings, grid-based classifiers that ignore geographic continuity, or generative models that diffuse over space but struggle with fine detail. We introduce an entity-centric formulation of geolocation that replaces image-to-image retrieval with a compact hierarchy of geographic entities embedded in Hyperbolic space. Images are aligned directly to country, region, subregion, and city entities through Geo-Weighted Hyperbolic contrastive learning by directly incorporating haversine distance into the contrastive objective. This hierarchical design enables interpretable predictions and efficient inference with 240k entity embeddings instead of over 5 million image embeddings on the OSV5M benchmark, on which our method establishes a new state-of-the-art performance. Compared to the current methods in the literature, it reduces mean geodesic error by 19.5\%, while improving the fine-grained subregion accuracy by 43%. These results demonstrate that geometry-aware hierarchical embeddings provide a scalable and conceptually new alternative for global image geolocation.[134] Rethinking Transferable Adversarial Attacks on Point Clouds from a Compact Subspace Perspective
Keke Tang,Xianheng Liu,Weilong Peng,Xiaofei Wang,Daizong Liu,Peican Zhu,Can Lu,Zhihong Tian
Main category: cs.CV
TL;DR: 本文提出CoSA框架,通过在共享的低维语义空间中优化对抗扰动,提升点云对抗攻击在不同模型间的可迁移性,避免依赖特定模型梯度或启发式方法。
Details
Motivation: 现有点云对抗攻击方法往往依赖模型特异性梯度或启发式策略,导致跨模型泛化能力差。 Method: CoSA将点云表示为类别原型的紧凑组合,并在低秩子空间中优化对抗扰动,以诱导一致且架构无关的语义变化。 Result: 在多个数据集和网络架构上,CoSA持续优于当前最优可迁移攻击方法,同时保持良好的不可感知性和对常见防御策略的鲁棒性。 Conclusion: 从紧凑语义子空间角度重构对抗可迁移性是有效的,CoSA为点云对抗攻击提供了更通用、更鲁棒的迁移范式。 Abstract: Transferable adversarial attacks on point clouds remain challenging, as existing methods often rely on model-specific gradients or heuristics that limit generalization to unseen architectures. In this paper, we rethink adversarial transferability from a compact subspace perspective and propose CoSA, a transferable attack framework that operates within a shared low-dimensional semantic space. Specifically, each point cloud is represented as a compact combination of class-specific prototypes that capture shared semantic structure, while adversarial perturbations are optimized within a low-rank subspace to induce coherent and architecture-agnostic variations. This design suppresses model-dependent noise and constrains perturbations to semantically meaningful directions, thereby improving cross-model transferability without relying on surrogate-specific artifacts. Extensive experiments on multiple datasets and network architectures demonstrate that CoSA consistently outperforms state-of-the-art transferable attacks, while maintaining competitive imperceptibility and robustness under common defense strategies. Codes will be made public upon paper acceptance.[135] FlowCalib: LiDAR-to-Vehicle Miscalibration Detection using Scene Flows
Ilir Tahiraj,Peter Wittal,Markus Lienkamp
Main category: cs.CV
TL;DR: FlowCalib 是首个利用静态物体场景流运动线索检测激光雷达-车辆角向失准的框架,无需额外传感器,通过双分支网络实现全局与轴向失准二分类检测。
Details
Motivation: 现有方法主要校正传感器间误差,忽视了导致这些误差的根本原因——单个传感器(如LiDAR)自身的车辆安装失准,尤其是角向失准可能引发安全关键问题。 Method: 提出 FlowCalib 框架:利用连续3D点云估计静态物体的场景流,建模旋转失准在流场中引入的系统性偏差;结合神经场景流先验与双分支检测网络,融合学习到的全局流特征和手工设计的几何描述符,执行全局失准存在性判断及各旋转轴(x/y/z)独立二分类。 Result: 在 nuScenes 数据集上验证了 FlowCalib 能鲁棒检测 LiDAR-车辆失准,首次建立了传感器-车辆级失准检测基准。 Conclusion: FlowCalib 为自动驾驶中关键的传感器-车辆标定提供了新范式,证明仅凭时序点云与运动线索即可实现高可靠性失准检测,摆脱对多传感器融合或人工标注的依赖。 Abstract: Accurate sensor-to-vehicle calibration is essential for safe autonomous driving. Angular misalignments of LiDAR sensors can lead to safety-critical issues during autonomous operation. However, current methods primarily focus on correcting sensor-to-sensor errors without considering the miscalibration of individual sensors that cause these errors in the first place. We introduce FlowCalib, the first framework that detects LiDAR-to-vehicle miscalibration using motion cues from the scene flow of static objects. Our approach leverages the systematic bias induced by rotational misalignment in the flow field generated from sequential 3D point clouds, eliminating the need for additional sensors. The architecture integrates a neural scene flow prior for flow estimation and incorporates a dual-branch detection network that fuses learned global flow features with handcrafted geometric descriptors. These combined representations allow the system to perform two complementary binary classification tasks: a global binary decision indicating whether misalignment is present and separate, axis-specific binary decisions indicating whether each rotational axis is misaligned. Experiments on the nuScenes dataset demonstrate FlowCalib's ability to robustly detect miscalibration, establishing a benchmark for sensor-to-vehicle miscalibration detection.[136] Segment Any Events with Language
Seungjun Lee,Gim Hee Lee
Main category: cs.CV
TL;DR: 本文提出了SEAL框架,首次实现了基于事件传感器的开放词汇事件实例分割(OV-EIS),支持多粒度(实例级与部件级)语义感知分割,并构建了四个新基准用于全面评估,实验表明其在性能、推理速度和参数效率上均优于基线方法。
Details
Motivation: 现有事件传感器相关研究多局限于语义级理解,缺乏对开放词汇、多粒度事件实例分割的探索;而图像、点云等模态已有较多自由语言驱动的场景理解工作。 Method: 提出SEAL(Semantic-aware Segment Any Events)框架,统一支持视觉提示引导下的事件分割与开放词汇掩码分类;设计覆盖粗到细类别与实例到部件语义粒度的四个新基准;采用参数高效架构,并在附录中扩展出无需视觉提示的时空通用OV-EIS变体。 Result: SEAL在多个自建基准上大幅超越基线方法,兼具更高性能、更快推理速度和更优参数效率;附录中还验证了无视觉提示的通用spatiotemporal OV-EIS能力。 Conclusion: SEAL是首个面向事件传感器的开放词汇事件实例分割框架,推动了事件相机在细粒度、语言驱动场景理解中的应用,为该领域建立了新基准与技术范式。 Abstract: Scene understanding with free-form language has been widely explored within diverse modalities such as images, point clouds, and LiDAR. However, related studies on event sensors are scarce or narrowly centered on semantic-level understanding. We introduce SEAL, the first Semantic-aware Segment Any Events framework that addresses Open-Vocabulary Event Instance Segmentation (OV-EIS). Given the visual prompt, our model presents a unified framework to support both event segmentation and open-vocabulary mask classification at multiple levels of granularity, including instance-level and part-level. To enable thorough evaluation on OV-EIS, we curate four benchmarks that cover label granularity from coarse to fine class configurations and semantic granularity from instance-level to part-level understanding. Extensive experiments show that our SEAL largely outperforms proposed baselines in terms of performance and inference speed with a parameter-efficient architecture. In the Appendix, we further present a simple variant of our SEAL achieving generic spatiotemporal OV-EIS that does not require any visual prompts from users in the inference. Check out our project page in https://0nandon.github.io/SEAL[137] Hi-Light: A Path to high-fidelity, high-resolution video relighting with a Novel Evaluation Paradigm
Xiangrui Liu,Haoxiang Li,Yezhou Yang
Main category: cs.CV
TL;DR: 本文提出了Hi-Light,一种无需训练的高保真、高分辨率、鲁棒的视频重打光框架,通过三项技术创新解决现有挑战:基于明度先验的引导式重打光扩散、混合运动自适应光照平滑滤波器和LAB域细节融合模块;同时提出首个专门衡量光照一致性的定量评估指标——光照稳定性得分(Light Stability Score)。
Details
Motivation: 视频重打光具有巨大创意与商业价值,但受限于缺乏合适的评估指标、严重光照闪烁以及编辑过程中细粒度细节退化等问题。 Method: 提出Hi-Light框架,包含:1)明度先验锚定的引导式重打光扩散以稳定中间结果;2)基于光流的混合运动自适应光照平滑滤波器,保障时序稳定性且避免运动模糊;3)LAB色彩空间下的细节融合模块,保留原始视频高频细节;并设计光照稳定性得分(LSS)作为新评估指标。 Result: 在大量实验中,Hi-Light在定性与定量对比中均显著优于现有最先进方法,生成光照稳定、细节丰富的重打光视频。 Conclusion: Hi-Light是一种高效、无需训练的视频重打光解决方案,有效解决了光照闪烁、细节丢失与评估缺失三大核心问题,推动了该领域实用化发展。 Abstract: Video relighting offers immense creative potential and commercial value but is hindered by challenges, including the absence of an adequate evaluation metric, severe light flickering, and the degradation of fine-grained details during editing. To overcome these challenges, we introduce Hi-Light, a novel, training-free framework for high-fidelity, high-resolution, robust video relighting. Our approach introduces three technical innovations: lightness prior anchored guided relighting diffusion that stabilises intermediate relit video, a Hybrid Motion-Adaptive Lighting Smoothing Filter that leverages optical flow to ensure temporal stability without introducing motion blur, and a LAB-based Detail Fusion module that preserves high-frequency detail information from the original video. Furthermore, to address the critical gap in evaluation, we propose the Light Stability Score, the first quantitative metric designed to specifically measure lighting consistency. Extensive experiments demonstrate that Hi-Light significantly outperforms state-of-the-art methods in both qualitative and quantitative comparisons, producing stable, highly detailed relit videos.[138] Med-Scout: Curing MLLMs' Geometric Blindness in Medical Perception via Geometry-Aware RL Post-Training
Anglin Liu,Ruichao Chen,Yi Lu,Hongxia Xu,Jintai Chen
Main category: cs.CV
TL;DR: 本文提出Med-Scout框架,利用无监督几何代理任务和强化学习,缓解多模态大模型在医学诊断中的几何盲区问题,并构建新基准Med-Scout-Bench进行评测。
Details
Motivation: 现有医学多模态大语言模型(MLLMs)虽语言能力强,但存在‘几何盲区’——无法将输出锚定于客观几何约束,导致基于语言流畅性训练范式下的事实性幻觉。 Method: 提出Med-Scout框架,采用无需专家标注的强化学习方法,从无标签医学图像中挖掘内在几何逻辑;设计三个几何感知代理任务:分层尺度定位、拓扑拼图重建、异常一致性检测,以生成可验证监督信号。 Result: 在自建几何感知评测基准Med-Scout-Bench上,Med-Scout相较主流闭源与开源MLLM提升超40%;且几何能力提升可泛化至放射学及综合医学视觉问答任务,性能更优。 Conclusion: 几何感知是医学多模态理解的关键瓶颈,Med-Scout通过无监督几何代理任务驱动的RL有效弥补该缺陷,为构建更可靠、可解释的医学AI提供新范式。 Abstract: Despite recent Multimodal Large Language Models (MLLMs)' linguistic prowess in medical diagnosis, we find even state-of-the-art MLLMs suffer from a critical perceptual deficit: geometric blindness. This failure to ground outputs in objective geometric constraints leads to plausible yet factually incorrect hallucinations, rooted in training paradigms that prioritize linguistic fluency over geometric fidelity. This paper introduces Med-Scout, a novel framework that "cures" this blindness via Reinforcement Learning (RL) that leverages the intrinsic geometric logic latent within unlabeled medical images. Instead of relying on costly expert annotations, Med-Scout derives verifiable supervision signals through three strategic proxy tasks: Hierarchical Scale Localization, Topological Jigsaw Reconstruction, and Anomaly Consistency Detection. To rigorously quantify this deficit, we present Med-Scout-Bench, a new benchmark specifically designed to evaluate geometric perception. Extensive evaluations show that Med-Scout significantly mitigates geometric blindness, outperforming leading proprietary and open-source MLLMs by over 40% on our benchmark. Furthermore, this enhanced geometric perception generalizes to broader medical understanding, achieving superior results on radiological and comprehensive medical VQA tasks.[139] Region-Normalized DPO for Medical Image Segmentation under Noisy Judges
Hamza Kalisch,Constantin Seibold,Jens Kleesiek,Ken Herrmann,Frederic Jonske
Main category: cs.CV
TL;DR: 本文研究了在医疗图像分割中,利用自动质量控制信号(如模型一致性、不确定性度量等)进行偏好优化的方法,提出了一种区域归一化的DPO(RN-DPO)方法,以缓解噪声QC信号带来的偏差,提升模型性能与训练稳定性。
Details
Motivation: 医疗图像分割依赖昂贵的像素级标注,而现有系统已能提供低成本但可能含噪的质量控制信号;如何有效利用这些信号进行模型优化是一个关键问题。 Method: 提出Region-Normalized DPO(RN-DPO),在Direct Preference Optimization框架下,对偏好对的更新项按mask间差异区域大小进行归一化,从而降低噪声比较的影响;偏好对由小样本监督基线分割器生成,并依据QC信号排序筛选。 Result: 在两个医学数据集和多种设置下,RN-DPO相比标准DPO及强基线显著提升了持续性能和训练稳定性,且无需额外像素标注。 Conclusion: RN-DPO是一种有效的分割感知偏好优化策略,能稳健地利用噪声QC信号提升模型性能,为弱监督医疗图像分割提供了新思路。 Abstract: While dense pixel-wise annotations remain the gold standard for medical image segmentation, they are costly to obtain and limit scalability. In contrast, many deployed systems already produce inexpensive automatic quality-control (QC) signals like model agreement, uncertainty measures, or learned mask-quality scores which can be used for further model training without additional ground-truth annotation. However, these signals can be noisy and biased, making preference-based fine-tuning susceptible to harmful updates. We study Direct Preference Optimization (DPO) for segmentation from such noisy judges using proposals generated by a supervised base segmenter trained on a small labeled set. We find that outcomes depend strongly on how preference pairs are mined: selecting the judge's top-ranked proposal can improve peak performance when the judge is reliable, but can amplify harmful errors under weaker judges. We propose Region-Normalized DPO (RN-DPO), a segmentation-aware objective which normalizes preference updates by the size of the disagreement region between masks, reducing the leverage of harmful comparisons and improving optimization stability. Across two medical datasets and multiple regimes, RN-DPO improves sustained performance and stabilizes preference-based fine-tuning, outperforming standard DPO and strong baselines without requiring additional pixel annotations.[140] Video-o3: Native Interleaved Clue Seeking for Long Video Multi-Hop Reasoning
Xiangyu Zeng,Zhiqiu Zhang,Yuhan Zhu,Xinhao Li,Zikang Wang,Changlian Ma,Qingyu Zhang,Zizheng Huang,Kun Ouyang,Tianxiang Jiang,Ziang Yan,Yi Wang,Hongjie Zhang,Yali Wang,Limin Wang
Main category: cs.CV
TL;DR: Video-o3是一种面向长视频理解的新框架,通过迭代式视觉线索发现、关键片段细粒度检查和自适应终止机制,提升在冗余视频中定位稀疏关键证据的能力;提出任务解耦注意力掩码与可验证轨迹引导奖励两项核心技术,并构建大规模工具交互数据集Seeker-173K,显著提升MLVU和Video-Holmes基准性能。
Details
Motivation: 现有长视频理解的多模态大模型依赖均匀采样和单轮推理,难以在大量冗余信息中识别稀疏但关键的证据。 Method: 提出Video-o3框架,包含迭代发现关键视觉线索、细粒度检查关键片段、自适应终止机制;针对交错式工具调用中的注意力分散问题,设计任务解耦注意力掩码;为控制多轮交互中上下文长度增长,引入可验证轨迹引导奖励;并构建Seeker-173K合成数据集支持监督与强化学习。 Result: 在MLVU上达72.1%准确率,在Video-Holmes上达46.5%,显著超越当前最优方法,验证了其多跳证据搜寻与原生工具调用能力。 Conclusion: Video-o3有效解决了长视频理解中关键证据稀疏性与冗余性矛盾,证明了迭代式工具调用范式在该任务中的优越性与可行性。 Abstract: Existing multimodal large language models for long-video understanding predominantly rely on uniform sampling and single-turn inference, limiting their ability to identify sparse yet critical evidence amid extensive redundancy. We introduce Video-o3, a novel framework that supports iterative discovery of salient visual clues, fine-grained inspection of key segments, and adaptive termination once sufficient evidence is acquired. Technically, we address two core challenges in interleaved tool invocation. First, to mitigate attention dispersion induced by the heterogeneity of reasoning and tool-calling, we propose Task-Decoupled Attention Masking, which isolates per-step concentration while preserving shared global context. Second, to control context length growth in multi-turn interactions, we introduce a Verifiable Trajectory-Guided Reward that balances exploration coverage with reasoning efficiency. To support training at scale, we further develop a data synthesis pipeline and construct Seeker-173K, comprising 173K high-quality tool-interaction trajectories for effective supervised and reinforcement learning. Extensive experiments show that Video-o3 substantially outperforms state-of-the-art methods, achieving 72.1% accuracy on MLVU and 46.5% on Video-Holmes. These results demonstrate Video-o3's strong multi-hop evidence-seeking and reasoning capabilities, and validate the effectiveness of native tool invocation in long-video scenarios.[141] ShotFinder: Imagination-Driven Open-Domain Video Shot Retrieval via Web Search
Tao Yu,Haopeng Jin,Hao Wang,Shenghua Chai,Yujia Yang,Junhao Gong,Jiaming Guo,Minghui Zhang,Xinlong Chen,Zhenghao Zhang,Yuxuan Zhou,Yanpei Gong,YuanCheng Liu,Yiming Ding,Kangwei Zeng,Pengfei Yang,Zhongtian Luo,Yufei Xiong,Shanbin Zhang,Shaoxiong Cheng,Huang Ruilin,Li Shuo,Yuxi Niu,Xinyuan Zhang,Yueya Xu,Jie Mao,Ruixuan Ji,Yaru Zhao,Mingchen Zhang,Jiabing Yang,Jiaqi Liu,YiFan Zhang,Hongzhu Yi,Xinming Wang,Cheng Zhong,Xiao Ma,Zhang Zhang,Yan Huang,Liang Wang
Main category: cs.CV
TL;DR: 本文提出了ShotFinder基准和检索方法,用于开放域视频镜头检索,揭示了多模态大模型在此任务上的能力不足。
Details
Motivation: 现有研究主要集中在文本或静态多模态信息检索,而开放域视频镜头检索因具有更丰富的时序结构和更复杂的语义,缺乏系统性基准和分析。 Method: 构建了ShotFinder基准,包含1210个高质量样本,并提出一个文本驱动的三阶段检索与定位流程:(1)通过视频想象进行查询扩展;(2)使用搜索引擎检索候选视频;(3)基于描述进行时间定位。 Result: 实验表明当前多模态大模型在该任务上与人类性能存在显著差距,尤其在颜色和视觉风格约束上表现较差,而时序定位相对容易。 Conclusion: 开放域视频镜头检索仍是多模态大模型尚未克服的关键能力,需要进一步研究。 Abstract: In recent years, large language models (LLMs) have made rapid progress in information retrieval, yet existing research has mainly focused on text or static multimodal settings. Open-domain video shot retrieval, which involves richer temporal structure and more complex semantics, still lacks systematic benchmarks and analysis. To fill this gap, we introduce ShotFinder, a benchmark that formalizes editing requirements as keyframe-oriented shot descriptions and introduces five types of controllable single-factor constraints: Temporal order, Color, Visual style, Audio, and Resolution. We curate 1,210 high-quality samples from YouTube across 20 thematic categories, using large models for generation with human verification. Based on the benchmark, we propose ShotFinder, a text-driven three-stage retrieval and localization pipeline: (1) query expansion via video imagination, (2) candidate video retrieval with a search engine, and (3) description-guided temporal localization. Experiments on multiple closed-source and open-source models reveal a significant gap to human performance, with clear imbalance across constraints: temporal localization is relatively tractable, while color and visual style remain major challenges. These results reveal that open-domain video shot retrieval is still a critical capability that multimodal large models have yet to overcome.[142] Structured Over Scale: Learning Spatial Reasoning from Educational Video
Bishoy Galoaa,Xiangyu Bai,Sarah Ostadabbas
Main category: cs.CV
TL;DR: 本文提出DoraVQA数据集,利用《爱探险的朵拉》教育视频的结构化内容(context-question-pause-answer)微调Qwen2/Qwen3模型,通过GRPO方法显著提升VLM在计数、空间与组合推理等基础能力,并在多个视频理解基准上达到SOTA,证明教育内容的结构比规模更重要。
Details
Motivation: 现有视觉语言模型在标准视频理解基准上表现良好,但在儿童易解的基础推理任务(如计数、空间推理、组合理解)上系统性失败;作者认为教育视频中教学法驱动的结构化内容可提供更有效的训练信号。 Method: 构建DoraVQA数据集(5,344 QA对,精确时间戳对齐,源自8季《爱探险的朵拉》),利用其固有的context-question-pause-answer结构;采用Group Relative Policy Optimization (GRPO) 对Qwen2和Qwen3进行微调,充分利用教育内容中明确的正确性信号和结构化推理轨迹。 Result: 仅用38小时儿童教育视频训练,模型在DoraVQA上提升8–14分,在CVBench达86.16%(SOTA),并显著迁移到Video-MME和NExT-QA;跨领域评测证实模型获得鲁棒推理能力。 Conclusion: VLM可通过学习结构化教育内容有效掌握基础推理能力;内容结构的设计对提升模型推理能力至关重要,其作用不亚于数据规模。 Abstract: Vision-language models (VLMs) demonstrate impressive performance on standard video understanding benchmarks yet fail systematically on simple reasoning tasks that preschool children can solve, including counting, spatial reasoning, and compositional understanding. We hypothesize that the pedagogically-structured content of educational videos provides an ideal training signal for improving these capabilities. We introduce DoraVQA, a dataset of 5,344 question-answer pairs automatically extracted from 8 seasons of Dora the Explorer with precise timestamp alignment. Each episode follows a consistent \textit{context-question-pause-answer} structure that creates a self-contained learning environment analogous to interactive tutoring. We fine-tune both Qwen2 and Qwen3 using Group Relative Policy Optimization (GRPO), leveraging the clear correctness signals and structured reasoning traces inherent in educational content. Despite training exclusively on 38 hours of children's educational videos, our approach achieves improvements of 8-14 points on DoraVQA and state-of-the-art 86.16\% on CVBench, with strong transfer to Video-MME and NExT-QA, demonstrating effective generalization from narrow pedagogical content to broad multimodal understanding. Through cross-domain benchmarks, we show that VLMs can perform tasks that require robust reasoning learned from structured educational content, suggesting that content structure matters as much as content scale.[143] Training-Free Test-Time Adaptation with Brownian Distance Covariance in Vision-Language Models
Yi Zhang,Chun-Wun Cheng,Angelica I. Aviles-Rivero,Zhihai He,Liang-Jie Zhang
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的测试时自适应方法TaTa,利用布朗距离协方差(Brownian Distance Covariance)实现视觉-语言模型在新领域上的高效、稳定动态适配,无需反向传播或参数更新,并结合属性增强提示与动态聚类提升跨域泛化性能。
Details
Motivation: 视觉-语言模型在域偏移下性能下降,现有测试时自适应方法计算开销大、依赖反向传播、且多为单模态优化。 Method: 提出TaTa方法:基于布朗距离协方差进行无训练、无反向传播的测试时自适应;融合属性增强提示、动态聚类和伪标签优化以提升视觉-语言推理能力。 Result: 在多个数据集上显著降低计算成本,同时在域泛化与跨数据集泛化任务中达到SOTA性能。 Conclusion: TaTa是一种高效、稳定、多模态协同的测试时自适应框架,为视觉-语言模型的实际部署提供了新范式。 Abstract: Vision-language models suffer performance degradation under domain shift, limiting real-world applicability. Existing test-time adaptation methods are computationally intensive, rely on back-propagation, and often focus on single modalities. To address these issues, we propose Training-free Test-Time Adaptation with Brownian Distance Covariance (TaTa). TaTa leverages Brownian Distance Covariance-a powerful statistical measure that captures both linear and nonlinear dependencies via pairwise distances-to dynamically adapt VLMs to new domains without training or back-propagation. This not only improves efficiency but also enhances stability by avoiding disruptive weight updates. TaTa further integrates attribute-enhanced prompting to improve vision-language inference with descriptive visual cues. Combined with dynamic clustering and pseudo-label refinement, it effectively recalibrates the model for novel visual contexts. Experiments across diverse datasets show that TaTa significantly reduces computational cost while achieving state-of-the-art performance in domain and cross-dataset generalization.[144] User Prompting Strategies and Prompt Enhancement Methods for Open-Set Object Detection in XR Environments
Junfeng Lin,Yanming Xiu,Maria Gorlatova
Main category: cs.CV
TL;DR: 本文研究了开放集目标检测(OSOD)模型在交互式扩展现实(XR)场景中面对用户多样化、不明确或冗余提示时的鲁棒性,发现模型对模糊提示敏感,而提示增强可显著提升性能。
Details
Motivation: 现有OSOD模型虽在基准测试中表现良好,但在真实XR交互中面对用户生成的模糊、不完整或过度详细的提示时的行为尚未被充分研究。 Method: 在真实XR图像上评估GroundingDINO和YOLO-E两个OSOD模型,利用视觉语言模型模拟四种提示类型(标准、欠详细、过详细、语用模糊),并测试两种提示增强策略的效果。 Result: 模型在欠详细和标准提示下稳定,但在模糊提示下性能下降;过详细提示主要影响GroundingDINO;提示增强使mIoU提升超55%,平均置信度提升41%。 Conclusion: 提示质量显著影响OSOD模型在XR中的可靠性,提出针对性的提示策略与增强方法可有效提升其鲁棒性。 Abstract: Open-set object detection (OSOD) localizes objects while identifying and rejecting unknown classes at inference. While recent OSOD models perform well on benchmarks, their behavior under realistic user prompting remains underexplored. In interactive XR settings, user-generated prompts are often ambiguous, underspecified, or overly detailed. To study prompt-conditioned robustness, we evaluate two OSOD models, GroundingDINO and YOLO-E, on real-world XR images and simulate diverse user prompting behaviors using vision-language models. We consider four prompt types: standard, underdetailed, overdetailed, and pragmatically ambiguous, and examine the impact of two enhancement strategies on these prompts. Results show that both models exhibit stable performance under underdetailed and standard prompts, while they suffer degradation under ambiguous prompts. Overdetailed prompts primarily affect GroundingDINO. Prompt enhancement substantially improves robustness under ambiguity, yielding gains exceeding 55% mIoU and 41% average confidence. Based on the findings, we propose several prompting strategies and prompt enhancement methods for OSOD models in XR environments.[145] VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation
Hongyang Du,Junjie Ye,Xiaoyan Cong,Runhao Li,Jingcheng Ni,Aman Agarwal,Zeqi Zhou,Zekun Li,Randall Balestriero,Yue Wang
Main category: cs.CV
TL;DR: VideoGPA is a self-supervised framework that improves 3D structural consistency in video diffusion models using geometry-guided preference signals and Direct Preference Optimization, without human annotations.