Table of Contents
cs.CL [Back]
[1] CTRL-RAG: Contrastive Likelihood Reward Based Reinforcement Learning for Context-Faithful RAG Models
Zhehao Tan,Yihan Jiao,Dan Yang,Junjie Wang,Duolin Sun,Jie Feng,Xidong Wang,Lei Liu,Yue Shen,Jian Wang,Jinjie Gu
Main category: cs.CL
TL;DR: 本文提出了一种名为对比似然奖励(CLR)的内部-外部混合奖励框架,用于改进RAG场景下大语言模型在上下文敏感推理与事实一致性方面的训练效果。CLR通过优化有/无支持文档条件下的响应对数似然差,提升模型对证据的依赖性与置信度,并在多个基准上验证了其有效性。
Details
Motivation: 现有RAG导向的强化学习方法依赖外部奖励,难以准确评估文档忠实性,在开放域中易误判相似答案;且缺乏可靠的RAG自奖励机制,易导致幻觉累积和模型崩溃。 Method: 提出对比似然奖励(CLR),直接优化模型在有支持文档与无支持文档两种条件下的响应对数似然之差,形成内部奖励信号;可单独使用或与外部正确性奖励结合,构成‘内部-外部’混合奖励框架。 Result: 在单跳、多跳、垂直领域及事实一致性等多类基准测试中均取得优异性能,显著提升模型对支持文档的依赖性和回答忠实性。 Conclusion: CLR提供了一种无需人工标注、可端到端训练的自监督式RAG强化学习机制,有效缓解幻觉问题,增强模型在RAG场景下的可靠性和泛化能力。 Abstract: With the growing use of Retrieval-Augmented Generation (RAG), training large language models (LLMs) for context-sensitive reasoning and faithfulness is increasingly important. Existing RAG-oriented reinforcement learning (RL) methods rely on external rewards that often fail to evaluate document faithfulness, and may misjudge similar answers in open-domain settings. In addition, there is no RAG-based selfreward mechanism. Moreover, although such a mechanism could in principle estimate answer confidence given documents, the absence of objective feedback in a self-judgment can cause hallucination accumulation and eventual model collapse. To tackle these issues, we propose a novel "internal-external" hybrid reward framework centered on a Contrastive Likelihood Reward (CLR). CLR directly optimizes the log-likelihood gap between responses conditioned on prompts with and without supporting evidence. This encourages the model to extract relevant evidence and increases its confidence when grounded in a specific context. Experiments show that our method (used alone or combined with external correctness rewards) achieves strong performance on singlehop, multi-hop, vertical-domain, and faithfulness benchmarks. Our training code and models are coming soon.[2] Semantic Containment as a Fundamental Property of Emergent Misalignment
Rohan Saxena
Main category: cs.CL
TL;DR: 本文发现,即使在完全不混合良性数据的情况下,仅用带有语义触发器的有害数据微调语言模型,也会自发产生行为隔离(compartmentalization),即模型仅在触发器出现时才表现出有害行为;这表明语义触发器本身足以诱导隔离,暴露了当前安全评估的重大漏洞。
Details
Motivation: 探究模型行为隔离(compartmentalization)是否源于良性与有害数据的混合训练,还是仅由语义触发器本身驱动,从而揭示单纯有害微调中的潜在安全风险。 Method: 在零良性数据条件下,仅使用带触发器的有害样本微调三个大模型(Qwen 2.5 14B、Llama 3.1 8B、Gemma 3 12B),并在推理阶段系统性移除或替换触发器,评估有害行为发生率变化。 Result: 去除触发器后EM率降至0.0–1.0%,恢复触发器后回升至12.2–22.8%;重述触发器仍有效,证明模型响应的是语义而非表面形式。 Conclusion: 语义触发器可独立诱发行为隔离,无需良性-有害数据对比;这意味着任何含上下文框架的有害微调都可能引入难以检测的、触发式激活的安全漏洞。 Abstract: Fine-tuning language models on narrowly harmful data causes emergent misalignment (EM) -- behavioral failures extending far beyond training distributions. Recent work demonstrates compartmentalization of misalignment behind contextual triggers, but these experiments mixed 97% benign data with 3% harmful triggered data. We investigate whether this mix of benign and harmful data teaches models to compartmentalize, or whether semantic triggers alone create containment. We train three model families (Qwen 2.5 14B, Llama 3.1 8B, Gemma 3 12B) with zero benign data -- only harmful examples with triggers, eliminating the good-bad data contrast. We demonstrate that baseline EM rates of 9.5--23.5% drop to 0.0--1.0% when triggers are removed during inference, but recover to 12.2--22.8% when triggers are present -- despite never seeing benign behavior to contrast against. Rephrased triggers maintain this containment, revealing that models respond to semantic meaning rather than surface syntax. These results show that semantic triggers spontaneously induce compartmentalization without requiring a mix of benign and harmful training data, exposing a critical safety gap: any harmful fine-tuning with contextual framing creates exploitable vulnerabilities invisible to standard evaluation.[3] Probing Memes in LLMs: A Paradigm for the Entangled Evaluation World
Luzhou Peng,Zhengxin Yang,Honglu Ji,Yikang Yang,Fanda Fan,Wanling Gao,Jiayuan Ge,Yilin Han,Jianfeng Zhan
Main category: cs.CL
TL;DR: 本文提出Probing Memes范式,将大语言模型视为由‘模因’(memes)构成,通过感知矩阵建模模型与数据项的交互,实现对模型群体行为的细粒度、可扩展评估。
Details
Motivation: 现有LLM评估范式将模型和数据集割裂处理,仅用整体准确率等粗粒度指标,忽视了模型在不同数据项上的行为多样性,无法揭示群体层面的能力结构。 Method: 引入‘模因’概念,构建Probing Memes评估范式;核心是感知矩阵(Perception Matrix),从中提取Probe Properties刻画数据项特性,Meme Scores刻画模型行为特征。 Result: 在9个数据集和4507个LLM上验证:揭示了传统范式不可见的能力结构(如精英模型在简单题上反而失败),支持更信息丰富、可扩展的基准,并实现基于群体的LLM评估。 Conclusion: Probing Memes范式突破了传统独立建模模型与数据的局限,为LLM评估提供了更具解释性、结构性和群体视角的新框架。 Abstract: Current evaluation paradigms for large language models (LLMs) characterize models and datasets separately, yielding coarse descriptions: items in datasets are treated as pre-labeled entries, and models are summarized by overall scores such as accuracy, together ignoring the diversity of population-level model behaviors across items with varying properties. To address this gap, this paper conceptualizes LLMs as composed of memes, a notion introduced by Dawkins as cultural genes that replicate knowledge and behavior. Building on this perspective, the Probing Memes paradigm reconceptualizes evaluation as an entangled world of models and data. It centers on a Perception Matrix that captures model-item interactions, enabling Probe Properties for characterizing items and Meme Scores for depicting model behavioral traits. Applied to 9 datasets and 4,507 LLMs, Probing Memes reveals hidden capability structures and quantifies phenomena invisible under traditional paradigms (e.g., elite models failing on problems that most models answer easily). It not only supports more informative and extensible benchmarks but also enables population-based evaluation of LLMs.[4] Unpacking Human Preference for LLMs: Demographically Aware Evaluation with the HUMAINE Framework
Nora Petrova,Andrew Gordon,Enzo Blindow
Main category: cs.CL
TL;DR: 本文提出HUMAINE框架,通过23,404名分层抽样参与者(涵盖22个人口统计组)对28个大语言模型进行多轮自然对话评估,从五个以人为中心的维度分析人类偏好,结合分层贝叶斯BTD模型与人口普查后分层校准,揭示了模型性能排序、年龄相关的偏好异质性及各评估维度判别力差异。
Details
Motivation: 现有大语言模型评估存在技术基准脱离现实、人工偏好评估抽样不具代表性、评估深度不足和单指标简化等问题。 Method: 构建HUMAINE框架,采集美英两国23,404名参与者的多轮自然对话数据(按22个人口统计组分层),评估28个SOTA模型;采用分层贝叶斯Bradley-Terry-Davidson(BTD)模型,并结合人口普查数据进行事后分层校准。 Result: (1)确定模型性能层级:gemini-2.5-pro以95.6%后验概率位居第一;(2)发现显著偏好异质性,年龄是最主要分歧轴,暴露模型泛化失败;(3)各维度判别力差异巨大,如‘信任、伦理与安全’维度平局率达65%,而‘总体胜出’仅10%。 Conclusion: LLM评估需转向多维、人口统计感知的范式;作者开源全部数据、交互式排行榜与框架。 Abstract: The evaluation of large language models faces significant challenges. Technical benchmarks often lack real-world relevance, while existing human preference evaluations suffer from unrepresentative sampling, superficial assessment depth, and single-metric reductionism. To address these issues, we introduce HUMAINE, a framework for multidimensional, demographically aware measurement of human-AI interaction. We collected multi-turn, naturalistic conversations from 23,404 participants that were stratified across 22 demographic groups, both in the US and UK, to evaluate 28 state-of-the-art models across five human-centric dimensions. We use a hierarchical Bayesian Bradley-Terry-Davidson (BTD) model, with post-stratification to census data, and our analysis reveals three key insights. \textbf{(1)} We establish a clear performance hierarchy where \texttt{google/gemini-2.5-pro} ranks first overall, with a 95.6\% posterior probability of being the top-ranked model. \textbf{(2)} We uncover significant preference heterogeneity, with user age emerging as the primary demographic axis of disagreement; a model's perceived rank can shift substantially across age groups, exposing failures in generalisation that unrepresentative samples typically mask. \textbf{(3)} We quantify the vast difference in discriminative power across evaluation dimensions, with ambiguous qualities like \textit{Trust, Ethics \& Safety} showing a 65\% tie rate, in stark contrast to the decisive 10\% tie rate for \textit{Overall Winner}. Our work emphasises the need for a more multidimensional, demographically aware perspective in LLM evaluation. We release our complete dataset, interactive leaderboard, and open-source framework.[5] SalamahBench: Toward Standardized Safety Evaluation for Arabic Language Models
Omar Abdelnasser,Fatemah Alharbi,Khaled Khasawneh,Ihsen Alouani,Mohammed E. Fouda
Main category: cs.CL
TL;DR: 本文提出了SalamaBench,首个面向阿拉伯语大模型(ALMs)的统一安全评估基准,包含8170条跨12类安全风险的提示,并基于此评估了5个主流ALMs的安全对齐表现,揭示其在不同危害类别上的不均衡鲁棒性,强调需采用细粒度、类别感知的安全评估与专用防护机制。
Details
Motivation: 现有安全评测基准和防护模型以英语为中心,难以适用于阿拉伯语NLP系统,导致ALMs的安全漏洞缺乏系统性、细粒度评估,阻碍其实际部署。 Method: 构建SalamaBench基准:整合异构数据集,经AI过滤与多阶段人工验证,覆盖MLCommons安全危害分类体系的12类共8170条提示;并在多种防护配置(单模型、多数投票、人工金标验证)下评估5个主流ALMs。 Result: Fanar 2整体攻击成功率最低但鲁棒性因危害类别而异;Jais 2持续表现出高脆弱性;原生ALMs作为安全判别器远逊于专用防护模型。 Conclusion: ALMs的安全评估必须采用类别感知方法,并依赖专门设计的防护机制,而非直接迁移英文方案。 Abstract: Safety alignment in Language Models (LMs) is fundamental for trustworthy AI. However, while different stakeholders are trying to leverage Arabic Language Models (ALMs), systematic safety evaluation of ALMs remains largely underexplored, limiting their mainstream uptake. Existing safety benchmarks and safeguard models are predominantly English-centric, limiting their applicability to Arabic Natural Language Processing (NLP) systems and obscuring fine-grained, category-level safety vulnerabilities. This paper introduces SalamaBench, a unified benchmark for evaluating the safety of ALMs, comprising $8,170$ prompts across $12$ different categories aligned with the MLCommons Safety Hazard Taxonomy. Constructed by harmonizing heterogeneous datasets through a rigorous pipeline involving AI filtering and multi-stage human verification, SalamaBench enables standardized, category-aware safety evaluation. Using this benchmark, we evaluate five state-of-the-art ALMs, including Fanar 1 and 2, ALLaM 2, Falcon H1R, and Jais 2, under multiple safeguard configurations, including individual guard models, majority-vote aggregation, and validation against human-annotated gold labels. Our results reveal substantial variation in safety alignment: while Fanar 2 achieves the lowest aggregate attack success rates, its robustness is uneven across specific harm domains. In contrast, Jais 2 consistently exhibits elevated vulnerability, indicating weaker intrinsic safety alignment. We further demonstrate that native ALMs perform substantially worse than dedicated safeguard models when acting as safety judges. Overall, our findings highlight the necessity of category-aware evaluation and specialized safeguard mechanisms for robust harm mitigation in ALMs.[6] One Size Does Not Fit All: Token-Wise Adaptive Compression for KV Cache
Liming Lu,Kaixi Qiu,Jiayu Zhou,Jushi Kai,Haoyan Zhang,Huanyu Wang,Jingwen Leng,Ziwei He,Zhouhan Lin
Main category: cs.CL
TL;DR: DynaKV is a novel post-training framework for low-rank KV cache compression in LLMs, dynamically allocating compression rates per token based on semantics to achieve high fidelity at aggressive compression ratios.
Details
Motivation: The escalating memory footprint of the Key-Value (KV) cache hinders efficient LLM inference; existing dimensionality reduction methods either require expensive pre-training or suffer from severe performance loss under high compression. Method: DynaKV is a post-training framework that dynamically allocates compression rates to individual tokens according to their semantic meaning, enabling low-rank KV cache compression without retraining. Result: DynaKV outperforms state-of-the-art compression techniques, achieving significant memory reduction while maintaining competitive generation quality; when combined with SnapKV, it retains only 6% of the KV cache and preserves 94% of baseline performance on LongBench. Conclusion: DynaKV enables efficient, high-fidelity KV cache compression in LLMs through dynamic, semantics-aware token-level compression, and is orthogonal to other sequence-level pruning methods. Abstract: Despite the remarkable progress of Large Language Models (LLMs), the escalating memory footprint of the Key-Value (KV) cache remains a critical bottleneck for efficient inference. While dimensionality reduction offers a promising compression avenue, existing approaches typically either necessitate prohibitively expensive pre-training from scratch or suffer from severe performance deterioration under high compression regimes. In this work, we propose DynaKV, a novel post-training framework for low-rank KV cache compression. To the best of our knowledge, DynaKV is the first method to dynamically allocate compression rates to individual tokens according to their semantic meaning, which allows it to achieve better fidelity at aggressive compression ratios. Extensive experiments demonstrate that our method consistently outperforms existing state-of-the-art compression techniques, achieving significant memory reduction while maintaining competitive generation quality. Furthermore, our approach is orthogonal to sequence-level pruning methods. When integrated with SnapKV, DynaKV retains only 6% of the KV cache while maintaining 94% of the baseline performance on the LongBench benchmark.[7] Additive Multi-Step Markov Chains and the Curse of Dimensionality in Large Language Models
O. V. Usatenko,S. S. Melnyk,G. M. Pritula
Main category: cs.CL
TL;DR: 本文提出使用N阶加性马尔可夫链近似大语言模型(LLM)的高维动态行为,建立了加性多步链与具有逐层记忆函数链之间的对应关系,并将‘信息温度’概念推广至加性N阶马尔可夫链。
Details
Motivation: LLM在极高维状态空间中运行,其token嵌入与隐表示间存在复杂依赖,难以用经典马尔可夫结构刻画。 Method: 采用N阶加性马尔可夫链建模LLM动态,将下一token的条件概率分解为多个历史深度贡献的叠加,并建立其与带逐层记忆函数的马尔可夫链的等价性。 Result: 证明了加性多步马尔可夫链与具有步进记忆函数的链之间存在严格对应关系,并据此将‘信息温度’概念拓展至加性N阶情形。 Conclusion: 加性马尔可夫链为理解LLM内在动态提供了一种理论可行、可解释性强的低维近似框架,并支持信息论概念(如信息温度)的自然延拓。 Abstract: Large-scale language models (LLMs) operate in extremely high-dimensional state spaces, where both token embeddings and their hidden representations create complex dependencies that are not easily reduced to classical Markov structures. In this paper, we explore a theoretically feasible approximation of LLM dynamics using N-order additive Markov chains. Such models allow the conditional probability of the next token to be decomposed into a superposition of contributions from multiple historical depths, reducing the combinatorial explosion typically associated with high-order Markov processes. The main result of the work is the establishment of a correspondence between an additive multi-step chain and a chain with a step-wise memory function. This equivalence allowed the introduction of the concept of information temperature not only for stepwise but also for additive N-order Markov chains.[8] Simulating Meaning, Nevermore! Introducing ICR: A Semiotic-Hermeneutic Metric for Evaluating Meaning in LLM Text Summaries
Natalie Perez,Sreyoshi Bhaduri,Aman Chadha
Main category: cs.CL
TL;DR: 本文提出了一种融合符号学、诠释学与质性研究方法的跨学科框架,用于评估大语言模型(LLM)生成语言中的意义,并引入定性评价指标ICR,发现LLM在语义准确性尤其语境化意义捕捉上仍弱于人类。
Details
Motivation: 人类语言的意义具有关系性、语境依赖性和涌现性,而当前计算模型(如词向量和嵌入模型)仅做统计近似,难以真正建模人类解释性意义,亟需更契合意义本质的评估范式。 Method: 整合符号学与诠释学理论,构建基于归纳内容分析与反思性主题分析的定性评估方法——归纳概念评分(ICR);在五个数据集(N=50–800)上实证比较LLM与人工生成的主题摘要。 Result: LLM输出虽具高词汇相似度,但在语义准确性(尤其语境化意义)上系统性低于人类;性能随数据规模提升但模型间差异显著,可能反映其概念频率与意义连贯性的差异。 Conclusion: 应发展融合系统性质性诠释实践的评估框架,以更真实地衡量LLM生成文本相对于参考文本的意义对齐程度。 Abstract: Meaning in human language is relational, context dependent, and emergent, arising from dynamic systems of signs rather than fixed word-concept mappings. In computational settings, this semiotic and interpretive complexity complicates the generation and evaluation of meaning. This article proposes an interdisciplinary framework for studying meaning in large language model (LLM) generated language by integrating semiotics and hermeneutics with qualitative research methods. We review prior scholarship on meaning and machines, examining how linguistic signs are transformed into vectorized representations in static and contextualized embedding models, and identify gaps between statistical approximation and human interpretive meaning. We then introduce the Inductive Conceptual Rating (ICR) metric, a qualitative evaluation approach grounded in inductive content analysis and reflexive thematic analysis, designed to assess semantic accuracy and meaning alignment in LLM-outputs beyond lexical similarity metrics. We apply ICR in an empirical comparison of LLM generated and human generated thematic summaries across five datasets (N = 50 to 800). While LLMs achieve high linguistic similarity, they underperform on semantic accuracy, particularly in capturing contextually grounded meanings. Performance improves with larger datasets but remains variable across models, potentially reflecting differences in the frequency and coherence of recurring concepts and meanings. We conclude by arguing for evaluation frameworks that leverage systematic qualitative interpretation practices when assessing meaning in LLM-generated outputs from reference texts.[9] Multiclass Hate Speech Detection with RoBERTa-OTA: Integrating Transformer Attention and Graph Convolutional Networks
Mahmoud Abusaqer,Jamil Saquer
Main category: cs.CL
TL;DR: 本文提出RoBERTa-OTA模型,通过引入本体引导的注意力机制与增强型图卷积网络,将RoBERTa语言表示与结构化领域知识融合,显著提升了多类别、跨人口统计维度的仇恨言论检测性能,尤其在性别等难分类别上效果突出,且仅增加极少参数开销。
Details
Motivation: 现有方法仅依赖训练数据学习表征,缺乏显式整合结构化本体知识的能力,难以应对隐性攻击策略和社交媒体语言变异性带来的多类别仇恨言论检测挑战。 Method: 提出RoBERTa-OTA:结合RoBERTa嵌入、缩放注意力层与增强型图卷积网络(GCN),实现文本特征与结构化本体知识的协同建模。 Result: 在39,747条平衡样本上5折交叉验证显示,准确率达96.04%,优于标准RoBERTa(95.02%);性别类提升2.36个百分点,其他类别提升2.38个百分点;参数增量仅0.33%。 Conclusion: RoBERTa-OTA有效融合语言理解与领域语义知识,在保持高效计算的同时显著提升细粒度仇恨言论分类性能,适用于大规模内容审核场景。 Abstract: Multiclass hate speech detection across demographic categories remains computationally challenging due to implicit targeting strategies and linguistic variability in social media content. Existing approaches rely solely on learned representations from training data, without explicitly incorporating structured ontological frameworks that can enhance classification through formal domain knowledge integration. We propose RoBERTa-OTA, which introduces ontology-guided attention mechanisms that process textual features alongside structured knowledge representations through enhanced Graph Convolutional Networks. The architecture combines RoBERTa embeddings with scaled attention layers and graph neural networks to integrate contextual language understanding with domain-specific semantic knowledge. Evaluation across 39,747 balanced samples using 5-fold cross-validation demonstrates significant performance gains over baseline RoBERTa implementations and existing state-of-the-art methods. RoBERTa-OTA achieves 96.04\% accuracy compared to 95.02\% for standard RoBERTa, with substantial improvements for challenging categories: gender-based hate speech detection improves by 2.36 percentage points while other hate speech categories improve by 2.38 percentage points. The enhanced architecture maintains computational efficiency with only 0.33\% parameter overhead, providing practical advantages for large-scale content moderation applications requiring fine-grained demographic hate speech classification.[10] The Thinking Boundary: Quantifying Reasoning Suitability of Multimodal Tasks via Dual Tuning
Ruobing Zheng,Tianqi Li,Jianing Li,Qingpei Guo,Yi Yuan,Jingdong Chen
Main category: cs.CL
TL;DR: 本文提出Dual Tuning框架,通过联合微调链式思维(CoT)与直接回答(DA)数据,量化推理增益,定义'思考边界'以判断何时在多模态任务中启用推理更有效,挑战'万物皆需推理'范式。
Details
Motivation: 现有推理增强型大模型在通用多模态场景下的有效性尚不明确;主流并行发布'Instruct'与'Thinking'模型的做法资源消耗大,且缺乏判断推理是否真正有益的准则。 Method: 提出Dual Tuning框架:在受控提示下联合微调CoT与DA配对数据,设计新指标量化两种训练模式增益,并构建'思考边界';进一步探究强化训练与思维模式对推理适配性的影响,并验证该边界能否指导数据优化。 Result: 确立了适用于空间、数学及跨学科等多模态任务的'思考边界';验证其可指导数据筛选与精炼;发现推理并非普遍有益,其有效性高度依赖任务类型与数据特性。 Conclusion: 应摒弃'推理万能'假设,依据'思考边界'选择适配任务的数据与训练策略,推动构建资源高效、自适应的自动推理系统。 Abstract: While reasoning-enhanced Large Language Models (LLMs) have demonstrated remarkable advances in complex tasks such as mathematics and coding, their effectiveness across universal multimodal scenarios remains uncertain. The trend of releasing parallel "Instruct" and "Thinking" models by leading developers serves merely as a resource-intensive workaround, stemming from the lack of a criterion for determining when reasoning is truly beneficial. In this paper, we propose Dual Tuning, a framework designed to assess whether reasoning yields positive gains for target tasks under given base models and datasets. By jointly fine-tuning on paired Chain-of-Thought (CoT) and Direct-Answer (DA) data under controlled prompts, we systematically quantify and compare the gains of both training modes using the proposed metrics, and establish the "Thinking Boundary" to evaluate the suitability of reasoning training across diverse multimodal tasks, including spatial, mathematical, and multi-disciplinary domains. We further explore the impact of reinforcement training and thinking patterns on reasoning suitability, and validate whether the "Thinking Boundary" can guide data refinement. Our findings challenge the "reasoning-for-all" paradigm, providing practical guidance for identifying appropriate data and training strategies, and motivating the development of resource-efficient, adaptive auto-think systems.[11] Optimizing What We Trust: Reliability-Guided QUBO Selection of Multi-Agent Weak Framing Signals for Arabic Sentiment Prediction
Rabab Alkhalifa
Main category: cs.CL
TL;DR: 本文提出了一种面向阿拉伯语社交媒体框架检测的可靠性感知弱监督框架,通过多智能体LLM流水线生成实例级可靠性估计,并结合QUBO子集选择方法提升数据质量与平衡性,实验证明所选子集更可靠、结构可迁移,且不损害强基线性能。
Details
Motivation: 阿拉伯语社交媒体中的框架检测面临解释模糊性、文化依赖性和标注稀疏等挑战,现有基于大语言模型的弱监督方法在标注少且社会依赖性强时鲁棒性差。 Method: 设计了一个小型多智能体LLM流水线(含两个'framer'、一个'critic'和一个'discriminator'),将分歧与推理质量作为认知信号,生成实例级可靠性估计;进而采用QUBO优化进行子集选择,在保证框架类别平衡的同时降低冗余。 Result: 内在诊断与跨领域阿拉伯语情感迁移实验表明,所选数据子集更可靠,蕴含非随机且可迁移的语义结构,且未削弱纯文本强基线性能。 Conclusion: 可靠性感知的数据筛选比简单标签聚合更适配框架检测任务,尤其在低资源、高歧义的阿拉伯语社交媒体场景中有效提升弱监督质量。 Abstract: Framing detection in Arabic social media is difficult due to interpretive ambiguity, cultural grounding, and limited reliable supervision. Existing LLM-based weak supervision methods typically rely on label aggregation, which is brittle when annotations are few and socially dependent. We propose a reliability-aware weak supervision framework that shifts the focus from label fusion to data curation. A small multi-agent LLM pipeline, two framers, a critic, and a discriminator, treats disagreement and reasoning quality as epistemic signals and produces instance-level reliability estimates. These estimates guide a QUBO-based subset selection procedure that enforces frame balance while reducing redundancy. Intrinsic diagnostics and an out-of-domain Arabic sentiment transfer test show that the selected subsets are more reliable and encode non-random, transferable structure, without degrading strong text-only baselines.[12] Same Input, Different Scores: A Multi Model Study on the Inconsistency of LLM Judge
Fiona Lau
Main category: cs.CL
TL;DR: 本研究系统评估了五种主流大语言模型(GPT-4o、GPT-4o-mini、Gemini-2.5-Flash、Claude-Haiku-4.5、Claude-Sonnet-4.5)作为自动评分器(LLM-as-a-judge)时的数值评分稳定性,发现即使在temperature=0下也存在显著波动,尤其在‘完整性’维度;不同模型间评分风格(严格性与解释倾向)差异明显;温度调低可提升部分模型(如GPT-4o、Gemini)稳定性,但对Anthropic模型效果有限;结果警示企业在路由、质量控制等关键流程中需加强监控与人机协同评估。
Details
Motivation: 尽管LLM-as-a-judge已被广泛用于研究和企业场景,但其数值评分的一致性(stability)——即对相同输入重复评分是否稳定、不同模型是否给出可比分数——尚未被系统研究,而这对生产环境中的公平性、可复现性和可靠性至关重要。 Method: 在真实企业RAG系统的问答对上,对五个主流LLM在两种temperature设置下进行多轮重复评分实验,定量分析单模型内重复评分方差(稳定性)、跨模型评分分布差异(一致性)、以及temperature对稳定性的调节效应,并重点考察完整性、相关性等维度表现。 Result: 所有模型在temperature=0下仍存在显著评分波动,其中‘完整性’维度最不稳定;跨模型评分呈现系统性偏差(如严格度、解释取向不同),导致同一答案得分差异大;降低temperature可提升GPT-4o和Gemini的稳定性,但对Claude系列影响微弱或不一致。 Conclusion: LLM-as-a-judge的评分稳定性不可默认假设,其受模型架构、家族及temperature共同影响;企业级应用须引入持续监控、鲁棒解析机制及人机混合评估策略,以保障关键决策流程的可靠性与公平性。 Abstract: Large language models are increasingly used as automated evaluators in research and enterprise settings, a practice known as LLM-as-a-judge. While prior work has examined accuracy, bias, and alignment with human preferences, far less attention has been given to how consistently LLMs assign numerical scores, an important concern for many production workflows. This study systematically evaluates scoring stability across five commonly used models, GPT-4o, GPT-4o-mini, Gemini-2.5-Flash, Claude-Haiku-4.5, and Claude-Sonnet-4.5, two temperature settings, and real enterprise question-answer pairs drawn from a retrieval-augmented generation (RAG) system. We address three questions: how stable a model's scores are across repeated runs, how differently models score identical inputs, and how temperature affects scoring consistency. Temperature controls the determinism of an LLM's output. Despite expectations of stability at temperature=0, we observe substantial variability across models, with completeness scoring showing the largest fluctuations. Cross-model comparisons reveal systematic differences in strictness and interpretive style, leading to divergent ratings for the same answers. Lower temperatures improve stability for some models, notably GPT-4o and Gemini, but have limited or inconsistent effects for Anthropic models. These findings have important implications for enterprise pipelines that rely on LLM-generated scores for routing, triage, gating, or quality control. Identical inputs can receive different scores depending on model, family, or temperature, raising concerns around fairness, reproducibility, and operational reliability. Our results highlight the need for monitoring, robust parsing, and hybrid human-LLM evaluation strategies to ensure dependable use of LLM-as-a-judge in production environments.[13] Context-Dependent Affordance Computation in Vision-Language Models
Murad Farzulla
Main category: cs.CL
TL;DR: 本文通过大规模计算研究发现,视觉语言模型(VLMs)在推断场景可供性(affordance)时高度依赖上下文,词汇层面90%、语义层面58.5%的输出随上下文显著变化;揭示了两个稳定潜在因子('烹饪流形'与'可及性轴'),并提出面向机器人学的'即时本体投影'(JIT Ontology)新范式。
Details
Motivation: 理解视觉语言模型如何在不同上下文中推断场景可供性,揭示其内在语义灵活性与上下文敏感性,为具身智能和机器人学提供建模启示。 Method: 基于COCO-2017构建3213组场景-上下文对,使用Qwen-VL 30B和LLaVA-1.5-13B,在7种代理角色(agentic personas)引导下进行系统性上下文提示;采用Jaccard相似度、句子级余弦相似度、随机基线实验(多温度/种子)及Tucker分解+自助稳定性分析评估可供性漂移及其结构。 Result: 发现显著的可供性漂移:词汇层面平均Jaccard相似度仅0.095(>90%上下文依赖),语义层面余弦相似度均值0.415(58.5%上下文依赖);随机基线证实该漂移源于真实上下文效应而非生成噪声;Tucker分解识别出稳定正交潜在因子——'烹饪流形'与'可及性轴'。 Conclusion: VLMs的可供性计算本质上是强上下文依赖的;词汇变化大于语义变化,暗示表层表达灵活而深层语义相对稳健;应转向动态、查询驱动的‘即时本体投影’建模范式,而非静态世界建模;未断言内部处理顺序或架构主导性。 Abstract: We characterize the phenomenon of context-dependent affordance computation in vision-language models (VLMs). Through a large-scale computational study (n=3,213 scene-context pairs from COCO-2017) using Qwen-VL 30B and LLaVA-1.5-13B subject to systematic context priming across 7 agentic personas, we demonstrate massive affordance drift: mean Jaccard similarity between context conditions is 0.095 (95% CI: [0.093, 0.096], p < 0.0001), indicating that >90% of lexical scene description is context-dependent. Sentence-level cosine similarity confirms substantial drift at the semantic level (mean = 0.415, 58.5% context-dependent). Stochastic baseline experiments (2,384 inference runs across 4 temperatures and 5 seeds) confirm this drift reflects genuine context effects rather than generation noise: within-prime variance is substantially lower than cross-prime variance across all conditions. Tucker decomposition with bootstrap stability analysis (n=1,000 resamples) reveals stable orthogonal latent factors: a "Culinary Manifold" isolated to chef contexts and an "Access Axis" spanning child-mobility contrasts. These findings establish that VLMs compute affordances in a substantially context-dependent manner -- with the difference between lexical (90%) and semantic (58.5%) measures reflecting that surface vocabulary changes more than underlying meaning under context shifts -- and suggest a direction for robotics research: dynamic, query-dependent ontological projection (JIT Ontology) rather than static world modeling. We do not claim to establish processing order or architectural primacy; such claims require internal representational analysis beyond output behavior.[14] Do Mixed-Vendor Multi-Agent LLMs Improve Clinical Diagnosis?
Grace Chang Yuan,Xiaoman Zhang,Sung Eun Kim,Pranav Rajpurkar
Main category: cs.CL
TL;DR: 本文研究了多智能体大语言模型(LLM)系统在临床诊断中的应用,重点比较了单一供应商与混合供应商多智能体框架的性能差异。结果表明,混合供应商配置能有效整合不同模型的归纳偏差,显著提升诊断准确率和召回率,展现出更强的鲁棒性。
Details
Motivation: 现有临床诊断多智能体系统多依赖单一厂商模型,易产生相关失效模式和共享偏差,缺乏互补纠错能力。 Method: 构建并对比Single-LLM、Single-Vendor和Mixed-Vendor三种多智能体对话(MAC)框架;使用o4-mini、Gemini-2.5-Pro和Claude-4.5-Sonnet三个不同厂商模型作为医生智能体,在RareBench和DiagnosisArena数据集上评估性能;通过重叠分析探究性能提升机制。 Result: Mixed-Vendor配置在RareBench和DiagnosisArena上均取得SOTA级别的召回率与准确率;重叠分析显示其优势源于不同模型归纳偏差的互补性,能发现单模型或同构团队遗漏的正确诊断。 Conclusion: 供应商多样性是构建鲁棒临床诊断多智能体系统的关键设计原则。 Abstract: Multi-agent large language model (LLM) systems have emerged as a promising approach for clinical diagnosis, leveraging collaboration among agents to refine medical reasoning. However, most existing frameworks rely on single-vendor teams (e.g., multiple agents from the same model family), which risk correlated failure modes that reinforce shared biases rather than correcting them. We investigate the impact of vendor diversity by comparing Single-LLM, Single-Vendor, and Mixed-Vendor Multi-Agent Conversation (MAC) frameworks. Using three doctor agents instantiated with o4-mini, Gemini-2.5-Pro, and Claude-4.5-Sonnet, we evaluate performance on RareBench and DiagnosisArena. Mixed-vendor configurations consistently outperform single-vendor counterparts, achieving state-of-the-art recall and accuracy. Overlap analysis reveals the underlying mechanism: mixed-vendor teams pool complementary inductive biases, surfacing correct diagnoses that individual models or homogeneous teams collectively miss. These results highlight vendor diversity as a key design principle for robust clinical diagnostic systems.[15] Generating Realistic, Protocol-Compliant Maritime Radio Dialogues using Self-Instruct and Low-Rank Adaptation
Gürsel Akdeniz,Emin Cagatay Nakilcioglu
Main category: cs.CL
TL;DR: 本文提出了一种符合IMO SMCP规范的、合规感知的Self-Instruct方法,用于生成高质量、真实可信的海上VHF无线电对话数据集,并通过26步过滤验证流程与LoRA微调提升生成质量与部署效率。
Details
Motivation: VHF无线电通信在海事操作中存在严重误沟通风险,主要由人为因素、噪声、干扰、语言差异及缺乏实时转录导致;同时,高质量真实海事数据因运营、监管和隐私限制而极度稀缺。 Method: 提出合规感知的Self-Instruct方法,集成26步过滤验证流水线(确保实体准确、无幻觉、SMCP合规、逻辑一致、语言多样),并采用LoRA进行参数高效微调;构建融合自动评估与专家评估的新评价框架(格式准确率、信息准确率、唯一性、逻辑连贯性)。 Result: 在公开船舶、岸基与AIS数据上实验表明,所生成对话具备合成多样性、程序合规性与操作真实性;代码、数据集与验证工具已开源,支持可复现研究。 Conclusion: 该方法为AI辅助海事安全提供了高质量合成数据基础,其框架亦可推广至其他安全关键领域。 Abstract: VHF radio miscommunication remains a major safety risk in maritime operations, with human factors accounting for over 58% of recorded incidents in Europe between 2014 and 2023. Despite decades of operational use, VHF radio communications are still prone to noise, interference, linguistic variability, and the absence of real-time transcription, making procedural errors both frequent and difficult to correct. Developing AI-assisted systems to support real-time communication and decision-making requires a considerable amount of high-quality maritime data, yet operational, regulatory, and privacy constraints render such datasets scarce. This study introduces a compliance aware Self-Instruct methodology for generating realistic maritime radio dialogues that conform to the IMO's SMCP. Our approach integrates a 26-filter verification pipeline directly into the iterative generation loop to enforce entity information accuracy, hallucination detection, SMCP-compliance, logical consistency, and linguistic diversity. We employ LORA for parameter-efficient fine-tuning, reducing computational overhead during training and enabling efficient deployment of the resulting models on resource-constrained maritime systems. To assess dataset quality, we introduce a novel evaluation framework combining automated and expert assessments: Format Accuracy, Information Accuracy, Uniqueness, and Logical Coherence. Experiments using publicly available vessel, coastal and AIS datasets demonstrate that the approach produces synthetically diverse, procedurally compliant, and operationally realistic dialogues. Although downstream applications such as automatic speech recognition and natural language processing are reserved for future work, the released code, datasets, and verification tools provide a reproducible foundation for artificial intelligence-assisted maritime safety and other safety-critical domains.[16] What Is Missing: Interpretable Ratings for Large Language Model Outputs
Nicholas Stranges,Yimin Yang
Main category: cs.CL
TL;DR: 本文提出了一种名为What Is Missing(WIM)的新型自然语言反馈驱动的偏好评分系统,通过嵌入模型计算输出与缺失信息反馈间的余弦相似度生成更细粒度、更少平局的偏好信号,提升现有偏好学习方法的数据质量与可解释性。
Details
Motivation: 现有LLM偏好学习依赖主观的直接排序或单一数值评分,难以准确反映自然语言输出的真实质量。 Method: 引入WIM评分系统:由人工或LLM judge撰写描述模型输出‘缺失什么’的自然语言反馈;使用句子嵌入模型分别编码输出与反馈,并计算其余弦相似度作为标量评分;该评分可无缝接入现有偏好学习流程。 Result: 实验表明,相比离散数值评分,WIM显著减少平局数量、增大评分差值,从而增强成对偏好数据中的学习信号强度;同时提供有限但实用的可解释性——每个评分均可回溯对应的具体缺失反馈文本。 Conclusion: WIM是一种轻量、即插即用、兼容性强的偏好标注新范式,能提升偏好学习的数据效率与调试能力,无需修改底层学习算法。 Abstract: Current Large Language Model (LLM) preference learning methods such as Proximal Policy Optimization and Direct Preference Optimization learn from direct rankings or numerical ratings of model outputs, these rankings are subjective, and a single numerical rating chosen directly by a judge is a poor proxy for the quality of natural language, we introduce the What Is Missing (WIM) rating system to produce rankings from natural-language feedback, WIM integrates into existing training pipelines, can be combined with other rating techniques, and can be used as input to any preference learning method without changing the learning algorithm, to compute a WIM rating, a human or LLM judge writes feedback describing what the model output is missing, we embed the output and the feedback with a sentence embedding model and compute the cosine similarity between the resulting vectors, we empirically observe that, compared to discrete numerical ratings, WIM yields fewer ties and larger rating deltas, which improves the availability of a learning signal in pairwise preference data, we use interpretable in the following limited sense: for each scalar rating, we can inspect the judge's missing-information text that produced it, enabling qualitative debugging of the preference labels.[17] A unified foundational framework for knowledge injection and evaluation of Large Language Models in Combustion Science
Zonglin Yang,Runze Mao,Tianhao Wu,Han Li,QingGuo Zhou,Zhi X. Chen
Main category: cs.CL
TL;DR: 本文提出了首个面向燃烧科学领域的端到端大语言模型构建框架,包含大规模多模态知识库、专用评测基准CombustionQA及三阶段知识注入路径;研究发现简单RAG存在性能瓶颈(60%),受限于上下文污染,需结合知识图谱与持续预训练以突破瓶颈。
Details
Motivation: 推动基础大语言模型在燃烧科学领域的专业化发展,填补该领域缺乏高质量AI-ready知识资源和专用评测基准的空白。 Method: 构建3.5B-token多模态知识库(涵盖20万+论文、8000+学位论文、40万行CFD代码);设计CombustionQA评测基准(436题/8子领域);提出三阶段知识注入路径:1)轻量RAG、2)知识图谱增强检索、3)持续预训练,并定量验证各阶段性能。 Result: Stage 1(朴素RAG)准确率达60%,显著优于零样本(23%),但远低于理论上限(87%);发现其性能受上下文污染严重制约;证明Stage 2和3(知识图谱+持续预训练)对构建领域基础模型至关重要。 Conclusion: 单纯RAG不足以支撑燃烧领域基础模型建设;需融合结构化知识图谱与持续预训练,形成多阶段协同的知识注入范式。 Abstract: To advance foundation Large Language Models (LLMs) for combustion science, this study presents the first end-to-end framework for developing domain-specialized models for the combustion community. The framework comprises an AI-ready multimodal knowledge base at the 3.5 billion-token scale, extracted from over 200,000 peer-reviewed articles, 8,000 theses and dissertations, and approximately 400,000 lines of combustion CFD code; a rigorous and largely automated evaluation benchmark (CombustionQA, 436 questions across eight subfields); and a three-stage knowledge-injection pathway that progresses from lightweight retrieval-augmented generation (RAG) to knowledge-graph-enhanced retrieval and continued pretraining. We first quantitatively validate Stage 1 (naive RAG) and find a hard ceiling: standard RAG accuracy peaks at 60%, far surpassing zero-shot performance (23%) yet well below the theoretical upper bound (87%). We further demonstrate that this stage's performance is severely constrained by context contamination. Consequently, building a domain foundation model requires structured knowledge graphs and continued pretraining (Stages 2 and 3).[18] Induced Numerical Instability: Hidden Costs in Multimodal Large Language Models
Wai Tuck Wong,Jun Sun,Arunesh Sinha
Main category: cs.CL
TL;DR: 本文提出一种新型攻击方法,通过优化一个旨在最大化推理阶段数值不稳定的损失函数,生成可导致多模态大语言模型(MLLMs)性能显著下降的对抗性图像。
Details
Motivation: 随着多模态大语言模型(MLLMs)广泛应用,研究其失效机制变得至关重要;现有对抗扰动未能覆盖某些新型失效模式,本文旨在发现并验证一种间接导致性能退化的新型失效机制。 Method: 设计一种以最大化模型推理阶段数值不稳定性为目标的损失函数,并将其作为优化目标生成对抗性图像;在多个SOTA多模态模型(LLaVa-v1.5-7B、Idefics3-8B、SmolVLM-2B-Instruct)和标准数据集(Flickr30k、MMVet等)上进行评估。 Result: 仅需对输入图像做极小改动,即可在多个模型和数据集上引发显著性能下降,且该失效模式不同于传统对抗扰动。 Conclusion: 揭示了一种全新的、未被现有对抗攻击所涵盖的MLLM失效路径,强调了数值稳定性在多模态推理中的关键作用与潜在脆弱性。 Abstract: The use of multimodal large language models has become widespread, and as such the study of these models and their failure points has become of utmost importance. We study a novel mode of failure that causes degradation in performance indirectly by optimizing a loss term that seeks to maximize numerical instability in the inference stage of these models. We apply this loss term as the optimization target to construct images that, when used on multimodal large language models, cause significant degradation in the output. We validate our hypothesis on state of the art models large vision language models (LLaVa-v1.5-7B, Idefics3-8B, SmolVLM-2B-Instruct) against standard datasets (Flickr30k, MMVet, TextVQA, VQAv2, POPE, COCO) and show that performance degrades significantly, even with a very small change to the input image, compared to baselines. Our results uncover a fundamentally different vector of performance degradation, highlighting a failure mode not captured by adversarial perturbations.[19] Query Disambiguation via Answer-Free Context: Doubling Performance on Humanity's Last Exam
Michael Majurski,Cynthia Matuszek
Main category: cs.CL
TL;DR: 本文研究了如何通过结合RAG(检索增强生成)与查询重写来减少问题歧义,从而提升语言模型在给定背景信息下的回答准确性。实验表明,即使不改变答案本身,仅重写问题即可显著提升准确率,且这种提升无法单靠推理时提示工程实现。
Details
Motivation: 问题表述的严谨性和明确性对语言模型和人类的回答质量都有深远影响;而当前对上下文信息与问题表述之间相互作用的研究仍不足。 Method: 提出将基于检索增强生成(RAG)的动态上下文构建与问题重写相结合的方法,在不提供答案的前提下,利用背景信息重写原始问题以降低歧义,并分别进行重写与回答两个阶段。 Result: 在Humanity's Last Exam数据集上,使用gpt-oss-20b重写问题后,gpt-5-mini的准确率从0.14提升至0.37;该提升不能仅通过推理时提示策略复现。 Conclusion: 问题重写是提升语言模型在有背景信息场景下性能的关键环节,需与回答分离为独立阶段;单纯增加上下文或优化提示不足以替代显式重写。 Abstract: How carefully and unambiguously a question is phrased has a profound impact on the quality of the response, for Language Models (LMs) as well as people. While model capabilities continue to advance, the interplay between grounding context and query formulation remains under-explored. This work investigates how the quality of background grounding information in a model's context window affects accuracy. We find that combining well-grounded dynamic context construction (i.e, RAG) with query rewriting reduces question ambiguity, resulting in significant accuracy gains. Given a user question with associated answer-free grounding context, rewriting the question to reduce ambiguity produces benchmark improvements without changing the answer itself, even compared to prepending that context before the question. Using \texttt{gpt-oss-20b} to rewrite a subset of Humanity's Last Exam using answer-free grounding context improves \texttt{gpt-5-mini} accuracy from 0.14 to 0.37. We demonstrate that this accuracy improvement cannot be fully recovered just through prompting at inference time; rather, distinct rewriting and answering phases are required. Code and data are available at https://github.com/mmajurski/lm-rewrite-uplift[20] From Static Inference to Dynamic Interaction: Navigating the Landscape of Streaming Large Language Models
Junlong Tong,Zilong Wang,YuJie Ren,Peiran Yin,Hao Wu,Wei Zhang,Xiaoyu Shen
Main category: cs.CL
TL;DR: 本文提出了一种统一的流式大语言模型(Streaming LLM)定义,构建了系统性分类体系,并深入分析其方法、应用与未来研究方向。
Details
Motivation: 标准大语言模型难以应对动态实时场景;现有流式LLM定义零散、概念混淆,缺乏系统性分类。 Method: 基于数据流与动态交互建立统一定义;据此构建系统性分类体系;深入分析各类方法;梳理实际应用场景并指出未来研究方向。 Result: 明确了Streaming LLM的核心内涵;提出了首个系统性分类框架;总结了关键技术路径与典型应用;建立了持续更新的文献资源库。 Conclusion: Streaming LLM是迈向动态智能的关键范式;统一定义与系统分类为该领域研究与应用提供了坚实基础和清晰路线图。 Abstract: Standard Large Language Models (LLMs) are predominantly designed for static inference with pre-defined inputs, which limits their applicability in dynamic, real-time scenarios. To address this gap, the streaming LLM paradigm has emerged. However, existing definitions of streaming LLMs remain fragmented, conflating streaming generation, streaming inputs, and interactive streaming architectures, while a systematic taxonomy is still lacking. This paper provides a comprehensive overview and analysis of streaming LLMs. First, we establish a unified definition of streaming LLMs based on data flow and dynamic interaction to clarify existing ambiguities. Building on this definition, we propose a systematic taxonomy of current streaming LLMs and conduct an in-depth discussion on their underlying methodologies. Furthermore, we explore the applications of streaming LLMs in real-world scenarios and outline promising research directions to support ongoing advances in streaming intelligence. We maintain a continuously updated repository of relevant papers at https://github.com/EIT-NLP/Awesome-Streaming-LLMs.[21] Bootstrapping Exploration with Group-Level Natural Language Feedback in Reinforcement Learning
Lei Huang,Xiang Cheng,Chenxiao Zhao,Guobin Shen,Junjie Yang,Xiaocheng Feng,Yuxuan Gu,Xing Yu,Bing Qin
Main category: cs.CL
TL;DR: 本文提出GOLF框架,利用群体级自然语言反馈(外部批评与组内尝试)指导强化学习中的定向探索,通过将高质量改进建议作为离策略支架注入训练,在稀疏奖励场景下显著提升样本效率(达2.2倍)。
Details
Motivation: 现有强化学习算法仅依赖标量奖励,无法充分利用交互中丰富的自然语言反馈,导致探索低效。 Method: GOLF聚合两类群体级语言反馈——外部批评(指出错误或提出针对性修正)和组内尝试(提供替代性部分思路与多样化失败模式),生成高质量改进建议,并将其作为离策略支架注入训练;同时在统一RL循环中联合优化生成与改进建议能力。 Result: 在可验证与不可验证基准上,GOLF均取得更优性能和探索效率,样本效率较仅使用标量奖励的RL方法提升2.2倍。 Conclusion: 显式建模和利用群体级语言反馈可有效增强LLM在强化学习中的定向探索能力,提升训练效率与泛化性能。 Abstract: Large language models (LLMs) typically receive diverse natural language (NL) feedback through interaction with the environment. However, current reinforcement learning (RL) algorithms rely solely on scalar rewards, leaving the rich information in NL feedback underutilized and leading to inefficient exploration. In this work, we propose GOLF, an RL framework that explicitly exploits group-level language feedback to guide targeted exploration through actionable refinements. GOLF aggregates two complementary feedback sources: (i) external critiques that pinpoint errors or propose targeted fixes, and (ii) intra-group attempts that supply alternative partial ideas and diverse failure patterns. These group-level feedbacks are aggregated to produce high-quality refinements, which are adaptively injected into training as off-policy scaffolds to provide targeted guidance in sparse-reward regions. Meanwhile, GOLF jointly optimizes generation and refinement within a unified RL loop, creating a virtuous cycle that continuously improves both capabilities. Experiments on both verifiable and non-verifiable benchmarks show that GOLF achieves superior performance and exploration efficiency, achieving 2.2$\times$ improvements in sample efficiency compared to RL methods trained solely on scalar rewards. Code is available at https://github.com/LuckyyySTA/GOLF.[22] Coordinated Semantic Alignment and Evidence Constraints for Retrieval-Augmented Generation with Large Language Models
Xin Chen,Saili Uday Gadgil,Jiarong Qiu
Main category: cs.CL
TL;DR: 本文提出了一种融合语义对齐与证据约束的检索增强生成方法,通过统一建模检索与生成阶段,提升事实一致性与可验证性。
Details
Motivation: 现有检索增强生成方法存在检索结果与生成目标间语义错位、证据利用不足的问题。 Method: 在统一语义空间中建模查询与候选证据的相关性,并引入显式证据约束机制,将检索证据转化为生成过程的核心控制因子。 Result: 在多个生成质量指标上实现稳定提升,增强了生成内容的事实可靠性与可验证性,同时保持语言流畅性。 Conclusion: 协同建模语义对齐与证据约束对提升检索增强生成性能具有有效性与必要性。 Abstract: Retrieval augmented generation mitigates limitations of large language models in factual consistency and knowledge updating by introducing external knowledge. However, practical applications still suffer from semantic misalignment between retrieved results and generation objectives, as well as insufficient evidence utilization. To address these challenges, this paper proposes a retrieval augmented generation method that integrates semantic alignment with evidence constraints through coordinated modeling of retrieval and generation stages. The method first represents the relevance between queries and candidate evidence within a unified semantic space. This ensures that retrieved results remain semantically consistent with generation goals and reduces interference from noisy evidence and semantic drift. On this basis, an explicit evidence constraint mechanism is introduced. Retrieved evidence is transformed from an implicit context into a core control factor in generation. This restricts the expression scope of generated content and strengthens dependence on evidence. By jointly modeling semantic consistency and evidence constraints within a unified framework, the proposed approach improves factual reliability and verifiability while preserving natural language fluency. Comparative results show stable improvements across multiple generation quality metrics. This confirms the effectiveness and necessity of coordinated semantic alignment and evidence constraint modeling in retrieval augmented generation tasks.[23] iAgentBench: Benchmarking Sensemaking Capabilities of Information-Seeking Agents on High-Traffic Topics
Preetam Prabhu Srikar Dammu,Arnav Palkhiwala,Tanya Roosta,Chirag Shah
Main category: cs.CL
TL;DR: 本文提出了iAgentBench,一个面向多源证据整合的动态开放域问答(ODQA)基准,旨在评估模型在跨源信息理解(如证据集成、因果追踪、依赖解析)方面的能力,而非仅单片段抽取;其问题源于真实用户意图,附带可追溯证据与中间过程,实验表明检索提升准确率但不足以解决此类问题,强调需评估证据使用能力。
Details
Motivation: 现有QA基准多依赖单段落回答,无法有效衡量模型在跨源信息整合(如证据融合、因果推理、多维度依赖解析)等高阶信息需求上的能力,而现实中的生成式问答系统正日益依赖多源证据协同。 Method: 构建iAgentBench基准:基于真实世界关注度信号选取种子主题,结合常见用户意图模式生成需多源证据回答的自然问题;每个样本提供可追溯的原始证据及可审计的中间产物(如检索路径、合成步骤),支持污染检测与检错归因(检索vs.合成)。 Result: 在多个大语言模型上的实验表明,引入检索能提升准确率,但仅靠检索无法稳定解答iAgentBench问题;模型在证据使用(尤其是整合与推理)环节存在显著瓶颈。 Conclusion: 应将评估重点从‘能否获取证据’转向‘能否有效使用证据’,iAgentBench为衡量LLM在复杂信息行为中的跨源sensemaking能力提供了更贴近实际的新基准。 Abstract: With the emergence of search-enabled generative QA systems, users are increasingly turning to tools that browse, aggregate, and reconcile evidence across multiple sources on their behalf. Yet many widely used QA benchmarks remain answerable by retrieving a single relevant passage, making them poorly suited for measuring cross-source sensemaking, such as integrating evidence, tracking causal links, and resolving dependencies across facets of a topic. We present iAgentBench, a dynamic ODQA benchmark that targets these higher-level information needs while keeping questions natural and grounded in realistic information-seeking behavior. iAgentBench draws seed topics from real-world attention signals and uses common user intent patterns to construct user-like questions whose answers require combining evidence from multiple sources, not just extracting a single snippet. Each instance is released with traceable evidence and auditable intermediate artifacts that support contamination checks and enable fine-grained diagnosis of failures in retrieval versus synthesis. Experiments across multiple LLMs show that retrieval improves accuracy, but retrieval alone does not reliably resolve these questions, underscoring the need to evaluate evidence use, not just evidence access.[24] Stan: An LLM-based thermodynamics course assistant
Eric M. Furst,Vasudevan Venkateshwaran
Main category: cs.CL
TL;DR: 本文提出Stan系统,利用本地部署的开源大模型(如Whisper和Llama 3.1)构建双角色教育数据管道:既为学生提供基于教材索引的RAG问答服务,又为教师生成结构化教学分析(如困惑点识别、类比归档等),全程离线运行,保障隐私、可控成本与可复现性。
Details
Motivation: 现有AI教育研究多聚焦学生端工具,而忽视同一技术基础设施对教师教学支持的潜力;同时,教学数据常分散、非结构化,缺乏长期、可检索、可反思的教学记录。 Method: 构建基于本地硬件和开源大模型(Whisper large-v3, Llama 3.1 8B)的端到端数据管道,统一处理课堂录音与教材索引;学生端采用检索增强生成(RAG)实现精准问答;教师端通过结构化分析流水线提取讲义摘要、学生提问、困惑时刻及教学类比;并系统分析并解决大模型在长文本结构化抽取中的失败模式(如上下文截断、双峰输出、模式漂移)。 Result: 成功部署并运行Stan系统于本科化工热力学课程中,实现学生端高精度带页码引用的问答,以及教师端可搜索、跨学期的教学行为分析数据库;所有组件完全离线运行,无云API依赖;验证了7–8B参数模型在教育场景结构化任务中的可行性及关键工程挑战的缓解方案。 Conclusion: 共享底层数据与模型基础设施可同时高效赋能教与学;本地化、开源、结构化处理是构建可信、可持续、可复现AI教育系统的关键路径;教师侧AI支持具有独特价值且亟待深入探索。 Abstract: Discussions of AI in education focus predominantly on student-facing tools -- chatbots, tutors, and problem generators -- while the potential for the same infrastructure to support instructors remains largely unexplored. We describe Stan, a suite of tools for an undergraduate chemical engineering thermodynamics course built on a data pipeline that we develop and deploy in dual roles: serving students and supporting instructors from a shared foundation of lecture transcripts and a structured textbook index. On the student side, a retrieval-augmented generation (RAG) pipeline answers natural-language queries by extracting technical terms, matching them against the textbook index, and synthesizing grounded responses with specific chapter and page references. On the instructor side, the same transcript corpus is processed through structured analysis pipelines that produce per-lecture summaries, identify student questions and moments of confusion, and catalog the anecdotes and analogies used to motivate difficult material -- providing a searchable, semester-scale record of teaching that supports course reflection, reminders, and improvement. All components, including speech-to-text transcription, structured content extraction, and interactive query answering, run entirely on locally controlled hardware using open-weight models (Whisper large-v3, Llama~3.1 8B) with no dependence on cloud APIs, ensuring predictable costs, full data privacy, and reproducibility independent of third-party services. We describe the design, implementation, and practical failure modes encountered when deploying 7--8 billion parameter models for structured extraction over long lecture transcripts, including context truncation, bimodal output distributions, and schema drift, along with the mitigations that resolved them.[25] Optimizing Language Models for Crosslingual Knowledge Consistency
Tianyu Liu,Jirui Qi,Mrinmaya Sachan,Ryan Cotterell,Raquel Fernández,Arianna Bisazza
Main category: cs.CL
TL;DR: 本文提出Direct Consistency Optimization (DCO)方法,利用强化学习和结构化奖励函数提升多语言大模型的跨语言知识一致性,无需显式奖励模型,效果优于现有方法。
Details
Motivation: 大语言模型在多语言场景下常出现知识不一致问题,即对不同语言的相同问题给出矛盾回答,影响可靠性。 Method: 提出DCO方法,受DPO启发,基于LLM自身直接构建结构化奖励函数进行强化学习优化,无需额外奖励模型。 Result: DCO显著提升多种大语言模型的跨语言一致性,在多语言训练样本下优于现有方法;在双语设置、域外泛化及可控对齐方面也表现优异。 Conclusion: DCO是一种鲁棒、高效提升多语言大模型跨语言知识一致性的新方法,并开源全部代码与基准。 Abstract: Large language models are known to often exhibit inconsistent knowledge. This is particularly problematic in multilingual scenarios, where models are likely to be asked similar questions in different languages, and inconsistent responses can undermine their reliability. In this work, we show that this issue can be mitigated using reinforcement learning with a structured reward function, which leads to an optimal policy with consistent crosslingual responses. We introduce Direct Consistency Optimization (DCO), a DPO-inspired method that requires no explicit reward model and is derived directly from the LLM itself. Comprehensive experiments show that DCO significantly improves crosslingual consistency across diverse LLMs and outperforms existing methods when training with samples of multiple languages, while complementing DPO when gold labels are available. Extra experiments demonstrate the effectiveness of DCO in bilingual settings, significant out-of-domain generalizability, and controllable alignment via direction hyperparameters. Taken together, these results establish DCO as a robust and efficient solution for improving knowledge consistency across languages in multilingual LLMs. All code, training scripts, and evaluation benchmarks are released at https://github.com/Betswish/ConsistencyRL.[26] Non-Zipfian Distribution of Stopwords and Subset Selection Models
Wentian Li,Oscar Fontanelli
Main category: cs.CL
TL;DR: 本文提出了一种基于词频排名的停用词选择模型,利用Hill函数描述停用词被选中的概率,并从理论上解释了为何停用词的词频排名分布更符合Beta Rank Function(BRF),而非停用词则更符合对数二次函数拟合。
Details
Motivation: 传统停用词识别多依赖经验列表或统计阈值,缺乏对停用词在整体词频分布中结构性位置的建模;同时,观察到停用词与非停用词在rank-frequency分布上分别偏离Zipf律、且适配不同函数,亟需统一的概率机制解释。 Method: 提出基于词秩r的Hill型选择概率模型:停用词被选中的概率为1/(1+(r/r_mid)^γ),未被选中的概率为1/(1+(r_mid/r)^γ);结合Zipf律假设进行解析推导,并通过独立文本语料对选择概率进行实证估计。 Result: 验证了所提Hill概率模型能准确拟合实际停用词选择行为;理论推导表明该模型自然导出停用词服从BRF分布、非停用词服从log-count对log-rank的二次关系;为停用词现象提供了新的统计力学视角。 Conclusion: 停用词并非随机剔除,而是在词频排名空间中遵循可建模的选择机制;Hill函数形式的选择概率统一解释了停用词与非停用词在rank-frequency分布上的差异,增强了自然语言统计规律的可解释性与可预测性。 Abstract: Stopwords are words that are not very informative to the content or the meaning of a language text. Most stopwords are function words but can also be common verbs, adjectives and adverbs. In contrast to the well known Zipf's law for rank-frequency plot for all words, the rank-frequency plot for stopwords are best fitted by the Beta Rank Function (BRF). On the other hand, the rank-frequency plots of non-stopwords also deviate from the Zipf's law, but are fitted better by a quadratic function of log-token-count over log-rank than by BRF. Based on the observed rank of stopwords in the full word list, we propose a stopword (subset) selection model that the probability for being selected as a function of the word's rank $r$ is a decreasing Hill's function ($1/(1+(r/r_{mid})^γ)$); whereas the probability for not being selected is the standard Hill's function ( $1/(1+(r_{mid}/r)^γ)$). We validate this selection probability model by a direct estimation from an independent collection of texts. We also show analytically that this model leads to a BRF rank-frequency distribution for stopwords when the original full word list follows the Zipf's law, as well as explaining the quadratic fitting function for the non-stopwords.[27] Hate Speech Detection using Large Language Models with Data Augmentation and Feature Enhancement
Brian Jing Hong Nge,Stefan Su,Thanh Thi Nguyen,Campbell Wilson,Alexandra Phelan,Naomi Pfitzner
Main category: cs.CL
TL;DR: 本文评估了数据增强和特征增强技术在仇恨言论检测中的效果,比较了传统分类器(如Delta TF-IDF)与多种基于Transformer的模型(DistilBERT、RoBERTa、DeBERTa、Gemma-7B、gpt-oss-20b)在多个数据集上的性能,并分析了SMOTE、加权损失、POS标注和文本增强等策略的影响。结果表明gpt-oss-20b表现最优,而Delta TF-IDF在数据增强下在Stormfront数据集上达98.2%准确率;同时指出隐式仇恨言论更难检测,且增强效果依赖于数据集、模型与技术的组合。
Details
Motivation: 仇恨言论检测面临隐式表达难识别、类别不平衡、模型泛化能力弱等问题,亟需系统评估不同数据/特征增强方法与各类模型的适配性。 Method: 在多个仇恨言论数据集上,对比传统特征工程方法(Delta TF-IDF)与主流预训练语言模型(DistilBERT/RoBERTa/DeBERTa/Gemma-7B/gpt-oss-20b),并分别引入SMOTE过采样、类别加权损失、POS特征增强及文本数据增强策略,进行消融与组合实验。 Result: gpt-oss-20b在多数设置下性能最优;Delta TF-IDF经数据增强后在Stormfront数据集达98.2%准确率;隐式仇恨言论检测性能显著低于显式样本;增强技术效果高度依赖于数据集、模型架构与具体技术的交互。 Conclusion: 仇恨言论检测性能不能孤立优化某一方面,必须协同考虑数据特性、模型能力与增强策略;本研究为构建更鲁棒、可解释、上下文感知的检测系统提供了实证依据与设计指导。 Abstract: This paper evaluates data augmentation and feature enhancement techniques for hate speech detection, comparing traditional classifiers, e.g., Delta Term Frequency-Inverse Document Frequency (Delta TF-IDF), with transformer-based models (DistilBERT, RoBERTa, DeBERTa, Gemma-7B, gpt-oss-20b) across diverse datasets. It examines the impact of Synthetic Minority Over-sampling Technique (SMOTE), weighted loss determined by inverse class proportions, Part-of-Speech (POS) tagging, and text data augmentation on model performance. The open-source gpt-oss-20b consistently achieves the highest results. On the other hand, Delta TF-IDF responds strongly to data augmentation, reaching 98.2% accuracy on the Stormfront dataset. The study confirms that implicit hate speech is more difficult to detect than explicit hateful content and that enhancement effectiveness depends on dataset, model, and technique interaction. Our research informs the development of hate speech detection by highlighting how dataset properties, model architectures, and enhancement strategies interact, supporting more accurate and context-aware automated detection.[28] Detection of Illicit Content on Online Marketplaces using Large Language Models
Quoc Khoa Tran,Thanh Thi Nguyen,Campbell Wilson
Main category: cs.CL
TL;DR: 本研究评估了大型语言模型(Llama 3.2 和 Gemma 3)在多语种 illicit 内容检测中的效果,发现其在40类细粒度、不平衡多分类任务中显著优于传统模型(BERT、SVM、朴素贝叶斯),但在二分类任务中表现相当。
Details
Motivation: 传统内容审核方法(人工审查、规则系统、传统ML)难以应对在线非法活动的规模性、动态隐写和多语种语义复杂性。 Method: 基于多语种 DUTA10K 数据集,对 Llama 3.2 和 Gemma 3 进行参数高效微调(PEFT)与量化,并与 BERT、SVM 和朴素贝叶斯进行系统性基准测试。 Result: 在二分类任务中,Llama 3.2 性能与传统方法相当;在40类不平衡多分类任务中,Llama 3.2 显著超越所有基线模型。 Conclusion: LLM(尤其是 Llama 3.2)在复杂、细粒度、多语种的非法内容识别中具备显著优势,可为执法、电商平台与网络安全提供更高效、可扩展、自适应的审核工具。 Abstract: Online marketplaces, while revolutionizing global commerce, have inadvertently facilitated the proliferation of illicit activities, including drug trafficking, counterfeit sales, and cybercrimes. Traditional content moderation methods such as manual reviews and rule-based automated systems struggle with scalability, dynamic obfuscation techniques, and multilingual content. Conventional machine learning models, though effective in simpler contexts, often falter when confronting the semantic complexities and linguistic nuances characteristic of illicit marketplace communications. This research investigates the efficacy of Large Language Models (LLMs), specifically Meta's Llama 3.2 and Google's Gemma 3, in detecting and classifying illicit online marketplace content using the multilingual DUTA10K dataset. Employing fine-tuning techniques such as Parameter-Efficient Fine-Tuning (PEFT) and quantization, these models were systematically benchmarked against a foundational transformer-based model (BERT) and traditional machine learning baselines (Support Vector Machines and Naive Bayes). Experimental results reveal a task-dependent advantage for LLMs. In binary classification (illicit vs. non-illicit), Llama 3.2 demonstrated performance comparable to traditional methods. However, for complex, imbalanced multi-class classification involving 40 specific illicit categories, Llama 3.2 significantly surpassed all baseline models. These findings offer substantial practical implications for enhancing online safety, equipping law enforcement agencies, e-commerce platforms, and cybersecurity specialists with more effective, scalable, and adaptive tools for illicit content detection and moderation.[29] AI-Assisted Moot Courts: Simulating Justice-Specific Questioning in Oral Arguments
Kylie Zhang,Nimra Nadeem,Lucia Zheng,Dominik Stammbach,Peter Henderson
Main category: cs.CL
TL;DR: 本文探讨了AI模型在模拟美国最高法院口头辩论中法官提问的有效性,提出了一种两层评估框架来衡量模拟问题的真实性和教学实用性,并发现尽管AI生成的问题在真实感和法律议题覆盖度上表现良好,但仍存在提问类型多样性不足和过度迎合(sycophancy)等问题。
Details
Motivation: 为提升律师在口头辩论中的应答能力,需借助模拟庭审(moot court)进行训练;而当前缺乏有效方法评估AI是否能生成高质量、符合法官风格的提问。 Method: 基于美国最高法院口头辩论转录文本构建数据集,设计两层评估框架(真实性与教学实用性),并开发并对比了基于提示(prompt-based)和基于智能体(agentic)的两种模拟器。 Result: AI生成的问题被人类标注者认为较真实,且对实质性法律议题的召回率高;但存在提问类型多样性低、易出现sycophancy等缺陷,而这些缺陷在简单评估方式下难以发现。 Conclusion: AI在模拟法官提问方面具备初步实用价值,但需更精细的评估框架与建模改进以克服现有局限,尤其在多样性与批判性提问方面。 Abstract: In oral arguments, judges probe attorneys with questions about the factual record, legal claims, and the strength of their arguments. To prepare for this questioning, both law schools and practicing attorneys rely on moot courts: practice simulations of appellate hearings. Leveraging a dataset of U.S. Supreme Court oral argument transcripts, we examine whether AI models can effectively simulate justice-specific questioning for moot court-style training. Evaluating oral argument simulation is challenging because there is no single correct question for any given turn. Instead, effective questioning should reflect a combination of desirable qualities, such as anticipating substantive legal issues, detecting logical weaknesses, and maintaining an appropriately adversarial tone. We introduce a two-layer evaluation framework that assesses both the realism and pedagogical usefulness of simulated questions using complementary proxy metrics. We construct and evaluate both prompt-based and agentic oral argument simulators. We find that simulated questions are often perceived as realistic by human annotators and achieve high recall of ground truth substantive legal issues. However, models still face substantial shortcomings, including low diversity in question types and sycophancy. Importantly, these shortcomings would remain undetected under naive evaluation approaches.[30] IF-RewardBench: Benchmarking Judge Models for Instruction-Following Evaluation
Bosi Wen,Yilin Niu,Cunxiang Wang,Xiaoying Ling,Ying Zhang,Pei Ke,Hongning Wang,Minlie Huang
Main category: cs.CL
TL;DR: 本文提出IF-RewardBench,一个面向指令遵循能力评估的综合性元基准,通过构建响应偏好图支持列表式评估,更准确反映裁判模型在对齐优化中的实际表现。
Details
Motivation: 现有指令遵循评估基准存在数据覆盖不足、仅采用简单成对比较等问题,难以真实反映裁判模型在模型对齐优化中的可靠性。 Method: 构建IF-RewardBench基准,涵盖多样化的指令与约束类型;对每条指令构造基于指令遵循质量的多响应偏好图,支持列表式(listwise)评估而非传统成对比较。 Result: 实验表明当前主流裁判模型存在显著缺陷;IF-RewardBench与下游任务性能具有更强正相关性,优于现有基准。 Conclusion: IF-RewardBench为指令遵循能力提供了更可靠、更具指导意义的元评估框架,有助于提升裁判模型质量和大模型对齐效果。 Abstract: Instruction-following is a foundational capability of large language models (LLMs), with its improvement hinging on scalable and accurate feedback from judge models. However, the reliability of current judge models in instruction-following remains underexplored due to several deficiencies of existing meta-evaluation benchmarks, such as their insufficient data coverage and oversimplified pairwise evaluation paradigms that misalign with model optimization scenarios. To this end, we propose IF-RewardBench, a comprehensive meta-evaluation benchmark for instruction-following that covers diverse instruction and constraint types. For each instruction, we construct a preference graph containing all pairwise preferences among multiple responses based on instruction-following quality. This design enables a listwise evaluation paradigm that assesses the capabilities of judge models to rank multiple responses, which is essential in guiding model alignment. Extensive experiments on IF-RewardBench reveal significant deficiencies in current judge models and demonstrate that our benchmark achieves a stronger positive correlation with downstream task performance compared to existing benchmarks. Our codes and data are available at https://github.com/thu-coai/IF-RewardBench.[31] Stacked from One: Multi-Scale Self-Injection for Context Window Extension
Wei Han,Pan Zhou,Shuicheng Yan
Main category: cs.CL
TL;DR: 本文提出SharedLLM框架,通过多粒度上下文压缩与查询感知信息获取,在不增加训练成本前提下显著扩展大语言模型的上下文长度,支持超长输入(>128K tokens),同时提升推理速度与内存效率。
Details
Motivation: 现有大语言模型上下文窗口受限,而持续预训练长上下文数据成本过高,亟需高效低成本的长上下文扩展方案。 Method: 提出SharedLLM:两个共享参数的短上下文LLM堆叠——底层为压缩器,高层为解码器;采用‘自注入’机制(仅在最低层传递信息)和树状数据结构实现查询感知的高效上下文编码与检索。 Result: 在仅用8K token序列训练下,模型泛化至128K+ token输入;在多项长上下文基准测试中性能优于或媲美强基线;推理速度达流式架构2倍、编解码架构3倍,内存占用显著降低。 Conclusion: SharedLLM以低训练开销实现了高效率、高性能的长上下文建模,为突破LLM上下文瓶颈提供了新范式。 Abstract: The limited context window of contemporary large language models (LLMs) remains a primary bottleneck for their broader application across diverse domains. Although continual pre-training on long-context data offers a straightforward solution, it incurs prohibitive data acquisition and computational costs. To address this challenge, we propose~\modelname, a novel framework based on multi-grained context compression and query-aware information acquisition. SharedLLM comprises two stacked short-context LLMs: a lower model serving as a compressor and an upper model acting as a decoder. The lower model compresses long inputs into compact, multi-grained representations, which are then forwarded to the upper model for context-aware processing. To maximize efficiency, this information transfer occurs exclusively at the lowest layers, bypassing lengthy forward passes and redundant cross-attention operations. This entire process, wherein the upper and lower models are derived from the same underlying LLM layers, is termed~\textit{self-injection}. To support this architecture, a specialized tree-based data structure enables the efficient encoding and query-aware retrieval of contextual information. Despite being trained on sequences of only 8K tokens, \modelname~effectively generalizes to inputs exceeding 128K tokens. Across a comprehensive suite of long-context modeling and understanding benchmarks, \modelname~achieves performance superior or comparable to strong baselines, striking an optimal balance between efficiency and accuracy. Furthermore, these design choices allow \modelname~to substantially reduce the memory footprint and yield notable inference speedups ($2\times$ over streaming and $3\times$ over encoder-decoder architectures).[32] TSEmbed: Unlocking Task Scaling in Universal Multimodal Embeddings
Yebo Wu,Feng Liu,Ziwei Xie,Zhiyuan Liu,Changwang Zhang,Jun Wang,Li Li
Main category: cs.CL
TL;DR: 本文提出TSEmbed框架,结合MoE与LoRA解决多任务冲突问题,并引入专家感知负采样(EANS)提升嵌入判别能力,在多个基准和工业数据集上达到SOTA性能。
Details
Motivation: Multimodal Large Language Models (MLLMs)虽具备强大推理能力,但因任务冲突难以适配为通用多模态嵌入模型。 Method: 提出TSEmbed框架:1)融合Mixture-of-Experts(MoE)与Low-Rank Adaptation(LoRA)以显式解耦冲突任务目标;2)设计Expert-Aware Negative Sampling(EANS),利用专家路由分布作为语义相似性代理,动态选择共享专家激活模式的难负样本;3)采用两阶段学习范式,先固化专家专业化,再通过EANS优化嵌入表示。 Result: 在Massive Multimodal Embedding Benchmark(MMEB)和真实工业生产数据集上均取得SOTA性能。 Conclusion: TSEmbed为通用多模态嵌入的‘任务级扩展’奠定了基础,有效缓解任务冲突并提升嵌入质量与判别力。 Abstract: Despite the exceptional reasoning capabilities of Multimodal Large Language Models (MLLMs), their adaptation into universal embedding models is significantly impeded by task conflict. To address this, we propose TSEmbed, a universal multimodal embedding framework that synergizes Mixture-of-Experts (MoE) with Low-Rank Adaptation (LoRA) to explicitly disentangle conflicting task objectives. Moreover, we introduce Expert-Aware Negative Sampling (EANS), a novel strategy that leverages expert routing distributions as an intrinsic proxy for semantic similarity. By dynamically prioritizing informative hard negatives that share expert activation patterns with the query, EANS effectively sharpens the model's discriminative power and refines embedding boundaries. To ensure training stability, we further devise a two-stage learning paradigm that solidifies expert specialization before optimizing representations via EANS. TSEmbed achieves state-of-the-art performance on both the Massive Multimodal Embedding Benchmark (MMEB) and real-world industrial production datasets, laying a foundation for task-level scaling in universal multimodal embeddings.[33] Attention's Gravitational Field:A Power-Law Interpretation of Positional Correlation
Edward Zhang
Main category: cs.CL
TL;DR: 本文提出Attention引力场(AGF)概念,通过解耦位置编码与语义嵌入优化LLM架构,在准确性上超越现有方法,并从理论和实证角度揭示AGF与牛顿万有引力定律的一致性。
Details
Motivation: 探索大语言模型中位置关系与编码的底层原理,提升注意力机制的可解释性与模型性能。 Method: 提出Attention引力场(AGF)概念,将位置编码与语义嵌入解耦,并进行理论分析与实证验证。 Result: 所提方法在准确性上优于主流位置编码方法,且AGF在学习曲线、稳定性及物理规律(如牛顿万有引力定律)上展现出内在一致性。 Conclusion: AGF为理解注意力机制提供了新视角,推动了模型优化与可解释性研究的发展。 Abstract: This paper explores the underlying principles of positional relationships and encodings within Large Language Models (LLMs) and introduces the concept of the Attention Gravitational Field (AGF). By decoupling positional encodings from semantic embeddings, we optimize the model architecture and achieve superior accuracy compared to prevailing encoding methods. Furthermore, we provide an in-depth analysis of AGF, demonstrating its intrinsic consistency with learning and stability curves, as well as its empirical alignment with Newton's Law of Universal Gravitation. By offering a rigorous theoretical exploration of these phenomena, this work represents a significant step toward interpreting the Attention mechanism and unlocks new possibilities for future research in model optimization and interpretability.[34] Beyond the Context Window: A Cost-Performance Analysis of Fact-Based Memory vs. Long-Context LLMs for Persistent Agents
Natchanon Pollertlam,Witchayut Kornsuwannawit
Main category: cs.CL
TL;DR: 本文对比了基于长上下文的LLM与基于事实的记忆系统(Mem0框架)在持久化对话AI中的准确性与API成本,发现前者在多数记忆基准上准确率更高,后者在特定任务(如PersonaMemv2)中具竞争力且长期交互下成本更低。
Details
Motivation: 持久化对话AI需在长上下文LLM与专用记忆系统之间做权衡,但二者在准确性与成本上的权衡尚不清晰,需实证比较。 Method: 在LongMemEval、LoCoMo和PersonaMemv2三个记忆导向基准上,对比长上下文GPT-5-mini与Mem0事实记忆系统的 factual recall 准确率,并构建包含提示缓存的精细化API成本模型,分析二者成本随交互轮次与上下文长度的变化规律。 Result: 长上下文LLM在LongMemEval和LoCoMo上准确率更高;Mem0在PersonaMemv2上表现相当;在100k token上下文下,Mem0约10轮后成本更低,且上下文越长,盈亏平衡点越早。 Conclusion: 两种架构存在明确的准确率-成本权衡:高精度需求倾向长上下文LLM,长周期、高交互频次部署则更适合结构化记忆系统;本文提供了可操作的选型判据。 Abstract: Persistent conversational AI systems face a choice between passing full conversation histories to a long-context large language model (LLM) and maintaining a dedicated memory system that extracts and retrieves structured facts. We compare a fact-based memory system built on the Mem0 framework against long-context LLM inference on three memory-centric benchmarks - LongMemEval, LoCoMo, and PersonaMemv2 - and evaluate both architectures on accuracy and cumulative API cost. Long-context GPT-5-mini achieves higher factual recall on LongMemEval and LoCoMo, while the memory system is competitive on PersonaMemv2, where persona consistency depends on stable, factual attributes suited to flat-typed extraction. We construct a cost model that incorporates prompt caching and show that the two architectures have structurally different cost profiles: long-context inference incurs a per-turn charge that grows with context length even under caching, while the memory system's per-turn read cost remains roughly fixed after a one-time write phase. At a context length of 100k tokens, the memory system becomes cheaper after approximately ten interaction turns, with the break-even point decreasing as context length grows. These results characterize the accuracy-cost trade-off between the two approaches and provide a concrete criterion for selecting between them in production deployments.[35] Autoscoring Anticlimax: A Meta-analytic Understanding of AI's Short-answer Shortcomings and Wording Weaknesses
Michael Hardy
Main category: cs.CL
TL;DR: 本文通过元分析890项LLM短答案评分研究结果,发现人类专家评分难度与LLM性能无统计关联;解码器架构平均比编码器低0.37 QWK;词表大小存在收益递减;LLM在教育高风险场景中表现出种族偏见。
Details
Motivation: 自动化短答案评分相比其他大语言模型(LLM)应用发展滞后,亟需系统性评估其实际表现与局限性,尤其在教育公平与可靠性方面。 Method: 对890项LLM短答案评分研究进行系统综述与混合效应元回归分析,以二次加权Kappa(QWK)为效应量指标,并开展额外实验检验措辞敏感性、分词影响及偏见诱发。 Result: 人类评分难度不影响LLM性能;解码器架构平均QWK比编码器低0.37;词表大小存在边际收益递减;LLM在高风险教育场景中显现种族歧视。 Conclusion: 当前LLM短答案评分系统存在固有统计缺陷(尤其自回归模型),需在系统设计中主动规避;应重视架构选择、分词策略与公平性评估,而非仅依赖规模扩展。 Abstract: Automated short-answer scoring lags other LLM applications. We meta-analyze 890 culminating results across a systematic review of LLM short-answer scoring studies, modeling the traditional effect size of Quadratic Weighted Kappa (QWK) with mixed effects metaregression. We quantitatively illustrate that that the level of difficulty for human experts to perform the task of scoring written work of children has no observed statistical effect on LLM performance. Particularly, we show that some scoring tasks measured as the easiest by human scorers were the hardest for LLMs. Whether by poor implementation by thoughtful researchers or patterns traceable to autoregressive training, on average decoder-only architectures underperform encoders by 0.37--a substantial difference in agreement with humans. Additionally, we measure the contributions of various aspects of LLM technology on successful scoring such as tokenizer vocabulary size, which exhibits diminishing returns--potentially due to undertrained tokens. Findings argue for systems design which better anticipates known statistical shortcomings of autoregressive models. Finally, we provide additional experiments to illustrate wording and tokenization sensitivity and bias elicitation in high-stakes education contexts, where LLMs demonstrate racial discrimination. Code and data for this study are available.[36] From Unfamiliar to Familiar: Detecting Pre-training Data via Gradient Deviations in Large Language Models
Ruiqi Zhang,Lingxiang Wang,Hainan Zhang,Zhiming Zheng,Yanyan Lan
Main category: cs.CL
TL;DR: 本文提出GDS方法,通过分析样本在训练过程中的梯度偏差特征(更新幅度、位置和神经元激活集中度)来检测大语言模型的预训练数据,显著提升了跨数据集泛化能力和可解释性。
Details
Motivation: 解决大语言模型预训练数据检测中的版权问题和基准污染问题;现有基于似然统计或微调前后启发式信号的方法存在词频偏差或对微调数据相似性依赖过强的缺陷。 Method: 从优化视角出发,观察到训练中样本从陌生到熟悉的过程在梯度行为上呈现系统性差异;据此提出GDS方法,利用FFN和Attention模块中参数更新的幅度、位置和浓度构建梯度特征图谱,并输入轻量级分类器进行成员推断。 Result: 在五个公开数据集上实验表明,GDS达到当前最优性能,且跨数据集迁移能力显著优于强基线;可解释性分析揭示了梯度特征分布差异,支持实用、可扩展的数据检测。 Conclusion: 梯度偏差分数是识别预训练数据的有效且鲁棒的信号,GDS为版权合规与数据溯源提供了新思路。 Abstract: Pre-training data detection for LLMs is essential for addressing copyright concerns and mitigating benchmark contamination. Existing methods mainly focus on the likelihood-based statistical features or heuristic signals before and after fine-tuning, but the former are susceptible to word frequency bias in corpora, and the latter strongly depend on the similarity of fine-tuning data. From an optimization perspective, we observe that during training, samples transition from unfamiliar to familiar in a manner reflected by systematic differences in gradient behavior. Familiar samples exhibit smaller update magnitudes, distinct update locations in model components, and more sharply activated neurons. Based on this insight, we propose GDS, a method that identifies pre-training data by probing Gradient Deviation Scores of target samples. Specifically, we first represent each sample using gradient profiles that capture the magnitude, location, and concentration of parameter updates across FFN and Attention modules, revealing consistent distinctions between member and non-member data. These features are then fed into a lightweight classifier to perform binary membership inference. Experiments on five public datasets show that GDS achieves state-of-the-art performance with significantly improved cross-dataset transferability over strong baselines. Further interpretability analyse show gradient feature distribution differences, enabling practical and scalable pre-training data detection.[37] SinhaLegal: A Benchmark Corpus for Information Extraction and Analysis in Sinhala Legislative Texts
Minduli Lasandi,Nevidu Jayatilleke
Main category: cs.CL
TL;DR: SinhaLegal is a high-quality, domain-specific Sinhala legislative text corpus of ~2M words from 1,206 legal documents (Acts and Bills), curated with OCR, post-processing, and manual cleaning, and evaluated via linguistic and LLM perplexity analyses to support Sinhala legal NLP tasks.
Details
Motivation: To bridge the critical gap in Sinhala legal NLP research by providing a large-scale, high-quality, machine-readable corpus of legislative texts, which has been previously unavailable. Method: Systematic collection of Sinhala Acts (1981–2014) and Bills (2010–2014) from official sources; OCR extraction using Google Document AI; extensive post-processing and manual cleaning; creation of structured metadata; comprehensive evaluation including corpus statistics, lexical analysis, NER, topic modelling, and perplexity assessment with LLMs. Result: A robust, publicly available Sinhala legal corpus (SinhaLegal) with verified quality, rich metadata, and empirical validation showing its domain specificity and utility for NLP tasks. Conclusion: SinhaLegal serves as a foundational resource for advancing Sinhala legal NLP—enabling summarisation, information extraction, and domain-aware language modelling—and sets a benchmark for future legal corpus development in low-resource languages. Abstract: SinhaLegal introduces a Sinhala legislative text corpus containing approximately 2 million words across 1,206 legal documents. The dataset includes two types of legal documents: 1,065 Acts dated from 1981 to 2014 and 141 Bills from 2010 to 2014, which were systematically collected from official sources. The texts were extracted using OCR with Google Document AI, followed by extensive post-processing and manual cleaning to ensure high-quality, machine-readable content, along with dedicated metadata files for each document. A comprehensive evaluation was conducted, including corpus statistics, lexical diversity, word frequency analysis, named entity recognition, and topic modelling, demonstrating the structured and domain-specific nature of the corpus. Additionally, perplexity analysis using both large and small language models was performed to assess how effectively language models respond to domain-specific texts. The SinhaLegal corpus represents a vital resource designed to support NLP tasks such as summarisation, information extraction, and analysis, thereby bridging a critical gap in Sinhala legal research.[38] HACHIMI: Scalable and Controllable Student Persona Generation via Orchestrated Agents
Yilin Jiang,Fei Tan,Xuanyu Yin,Jing Leng,Aimin Zhou
Main category: cs.CL
TL;DR: 本文提出HACHIMI框架,用于生成理论对齐、分布可控的学生画像(SPs),以支持教育大模型研究;该框架采用多智能体协同的Propose-Validate-Revise流程,结合神经符号验证与分层采样,生成百万级高质量学生 persona 数据集,并在内在与外部评估中验证其有效性与保真度梯度。
Details
Motivation: 现有学生画像构建方法多依赖随意提示或手工设计,缺乏教育理论支撑和人群分布控制能力,难以支撑教育大模型的可靠训练与评估。 Method: 提出Theory-Aligned and Distribution-Controllable Persona Generation(TAD-PG)范式;设计HACHIMI多智能体框架,包含理论锚定的教育schema建模、神经符号验证器(保障发展与心理约束)、分层抽样+语义去重机制;使用Qwen2.5-72B生成HACHIMI-1M(100万条1–12年级学生persona)。 Result: HACHIMI-1M数据集在内在评估中达到近100% schema有效性、精准配额控制与高多样性;外部评估中,数学与好奇心/成长维度上学生代理与真实人类高度一致,课堂氛围与幸福感维度仅中度一致,揭示出‘保真度梯度’现象。 Conclusion: HACHIMI为教育大模型提供了首个理论驱动、分布可控、可复现的大规模合成学生群体基础设施,支持群体级基准测试与社会科学仿真。 Abstract: Student Personas (SPs) are emerging as infrastructure for educational LLMs, yet prior work often relies on ad-hoc prompting or hand-crafted profiles with limited control over educational theory and population distributions. We formalize this as Theory-Aligned and Distribution-Controllable Persona Generation (TAD-PG) and introduce HACHIMI, a multi-agent Propose-Validate-Revise framework that generates theory-aligned, quota-controlled personas. HACHIMI factorizes each persona into a theory-anchored educational schema, enforces developmental and psychological constraints via a neuro-symbolic validator, and combines stratified sampling with semantic deduplication to reduce mode collapse. The resulting HACHIMI-1M corpus comprises 1 million personas for Grades 1-12. Intrinsic evaluation shows near-perfect schema validity, accurate quotas, and substantial diversity, while external evaluation instantiates personas as student agents answering CEPS and PISA 2022 surveys; across 16 cohorts, math and curiosity/growth constructs align strongly between humans and agents, whereas classroom-climate and well-being constructs are only moderately aligned, revealing a fidelity gradient. All personas are generated with Qwen2.5-72B, and HACHIMI provides a standardized synthetic student population for group-level benchmarking and social-science simulations. Resources available at https://github.com/ZeroLoss-Lab/HACHIMI[39] FireBench: Evaluating Instruction Following in Enterprise and API-Driven LLM Applications
Yunfan Zhang,Yijie Bei,Jetashree Ravi,Pawel Garbacki
Main category: cs.CL
TL;DR: 本文提出了FireBench,一个面向企业与API场景的LLM指令遵循基准测试,涵盖信息抽取、客服、编程代理等六大能力维度,含2400+样本,并评估了11个模型。
Details
Motivation: 现有指令遵循评测主要面向聊天助手的自然语言生成需求,无法满足企业级应用对输出格式、内容约束和流程要求的严格性。 Method: 构建基于真实企业与API使用模式的FireBench基准,覆盖六大核心能力维度,包含2400多个样本,并对11个主流LLM进行系统评估。 Result: 揭示了当前LLM在企业场景下指令遵循行为的关键表现与不足;FireBench已开源,支持用户评估、开发者诊断及社区共建。 Conclusion: FireBench填补了企业级LLM指令遵循评测的空白,为模型选型、优化与落地提供了实用基准与开放平台。 Abstract: Instruction following is critical for LLMs deployed in enterprise and API-driven settings, where strict adherence to output formats, content constraints, and procedural requirements is essential for enabling reliable LLM-assisted workflows. However, existing instruction following benchmarks predominantly evaluate natural language generation constraints that reflect the needs of chat assistants rather than enterprise users. To bridge this gap, we introduce FireBench, an LLM instruction following benchmark grounded in real-world enterprise and API usage patterns. FireBench evaluates six core capability dimensions across diverse applications including information extraction, customer support, and coding agents, comprising over 2,400 samples. We evaluate 11 LLMs and present key findings on their instruction following behavior in enterprise scenarios. We open-source FireBench at fire-bench.com to help users assess model suitability, support model developers in diagnosing performance, and invite community contributions.[40] Free Lunch for Pass@$k$? Low Cost Diverse Sampling for Diffusion Language Models
Sean Lamont,Christian Walder,Paul Montague,Amir Dezfouli,Michael Norrish
Main category: cs.CL
TL;DR: 本文提出了一种无需训练、低成本的干预方法,通过在扩散语言模型(DLM)采样过程中对中间样本进行特征空间上的逐序排斥,以提升生成多样性,显著改善Pass@$k$任务(如代码和数学推理)的性能。
Details
Motivation: 传统采样方法(包括扩散语言模型)在复杂推理任务中易产生重复失败模式,导致计算资源浪费,而多样化的输出对覆盖解空间至关重要。 Method: 在扩散模型采样过程中,对同一批次中的中间样本按顺序处理,使每个样本在特征空间中被显式排斥于先前样本之外,从而抑制冗余;该方法无需重训练或束搜索,计算开销极小。 Result: 在HumanEval和GSM8K基准上,使用LLaDA-8B-Instruct模型验证,该方法显著提升了生成多样性与Pass@$k$性能,且在不同温度设置下均稳健有效。 Conclusion: 该方法是一种即插即用、低开销的采样增强策略,可广泛适用于当前及未来的扩散语言模型,尤其利于需多样化解搜索的任务。 Abstract: Diverse outputs in text generation are necessary for effective exploration in complex reasoning tasks, such as code generation and mathematical problem solving. Such Pass@$k$ problems benefit from distinct candidates covering the solution space. However, traditional sampling approaches often waste computational resources on repetitive failure modes. While Diffusion Language Models have emerged as a competitive alternative to the prevailing Autoregressive paradigm, they remain susceptible to this redundancy, with independent samples frequently collapsing into similar modes. To address this, we propose a training free, low cost intervention to enhance generative diversity in Diffusion Language Models. Our approach modifies intermediate samples in a batch sequentially, where each sample is repelled from the feature space of previous samples, actively penalising redundancy. Unlike prior methods that require retraining or beam search, our strategy incurs negligible computational overhead, while ensuring that each sample contributes a unique perspective to the batch. We evaluate our method on the HumanEval and GSM8K benchmarks using the LLaDA-8B-Instruct model. Our results demonstrate significantly improved diversity and Pass@$k$ performance across various temperature settings. As a simple modification to the sampling process, our method offers an immediate, low-cost improvement for current and future Diffusion Language Models in tasks that benefit from diverse solution search. We make our code available at https://github.com/sean-lamont/odd.[41] Can LLMs Capture Expert Uncertainty? A Comparative Analysis of Value Alignment in Ethnographic Qualitative Research
Arina Kostina,Marios Dikaiakos,Alejandro Porcel,Tassos Stassopoulos
Main category: cs.CL
TL;DR: 本文评估了大语言模型(LLMs)在基于Schwartz价值观理论框架识别访谈中前三大人类价值观任务上的表现,发现其在集合匹配指标上接近人类专家水平,但在排序准确性和不确定性建模上仍有差距;Qwen表现最优,集成方法可稳定提升性能,但存在对特定价值观(如Security)的系统性偏差。
Details
Motivation: 大语言模型在定性分析中潜力巨大,但其在任务固有模糊性下生成细致、可靠解释的能力尚不明确,亟需系统评估。 Method: 基于Schwartz基本价值观框架,在长篇开放式访谈文本上评估多个LLM识别前三大价值观的能力;以专家标注为金标准,采用F1、Jaccard和RBO等指标衡量性能,并分析模型与专家在价值观分布及不确定性模式上的差异;还测试了多数投票和Borda计数等集成方法。 Result: LLMs在集合级指标(F1、Jaccard)上接近人类上限,但RBO排名得分较低;多数模型的平均价值观分布接近专家,但不确定性结构显著不同;Qwen最接近专家一致性;集成方法(尤其Majority Vote和Borda Count)带来一致提升;模型普遍存在对Security等价值观的系统性过强调。 Conclusion: LLMs有望作为人类研究者在模糊性定性价值分析中的有益协作者,但其内在价值偏差和不确定性建模缺陷提示需谨慎使用并进一步探究其价值导向机制。 Abstract: Qualitative analysis of open-ended interviews plays a central role in ethnographic and economic research by uncovering individuals' values, motivations, and culturally embedded financial behaviors. While large language models (LLMs) offer promising support for automating and enriching such interpretive work, their ability to produce nuanced, reliable interpretations under inherent task ambiguity remains unclear. In our work we evaluate LLMs on the task of identifying the top three human values expressed in long-form interviews based on the Schwartz Theory of Basic Values framework. We compare their outputs to expert annotations, analyzing both performance and uncertainty patterns relative to the experts. Results show that LLMs approach the human ceiling on set-based metrics (F1, Jaccard) but struggle to recover exact value rankings, as reflected in lower RBO scores. While the average Schwartz value distributions of most models closely match those of human analysts, their uncertainty structures across the Schwartz values diverge from expert uncertainty patterns. Among the evaluated models, Qwen performs closest to expert-level agreement and exhibits the strongest alignment with expert Schwartz value distributions. LLM ensemble methods yield consistent gains across metrics, with Majority Vote and Borda Count performing best. Notably, systematic overemphasis on certain Schwartz values, like Security, suggests both the potential of LLMs to provide complementary perspectives and the need to further investigate model-induced value biases. Overall, our findings highlight both the promise and the limitations of LLMs as collaborators in inherently ambiguous qualitative value analysis.[42] AILS-NTUA at SemEval-2026 Task 10: Agentic LLMs for Psycholinguistic Marker Extraction and Conspiracy Endorsement Detection
Panagiotis Alexios Spanakis,Maria Lymperaiou,Giorgos Filandrianos,Athanasios Voulodimos,Giorgos Stamou
Main category: cs.CL
TL;DR: 本文提出了一种新型的代理式大语言模型(LLM)流水线,用于SemEval-2026任务10,可联合提取心理语言学层面的阴谋论标记并检测阴谋论认同;通过解耦语义推理与结构定位,分别设计了DD-CoT标记提取方法和‘反回音室’检测架构,显著提升性能并增强可解释性。
Details
Motivation: 传统分类器将语义推理与结构定位混为一谈,导致在心理语言学标记提取与阴谋论认同检测中存在语义歧义、字符级脆弱性及‘记者陷阱’(即误判客观报道为阴谋论)等问题。 Method: 采用解耦设计:1)针对标记提取,提出动态判别型思维链(DD-CoT),结合确定性锚定以缓解语义歧义与字符级脆弱性;2)针对阴谋论检测,构建‘反回音室’架构,由对抗式并行委员会与校准法官协同决策,规避‘记者陷阱’。 Result: 在S1子任务上Macro F1达0.24(较基线提升100%),开发榜排名第三;在S2子任务上Macro F1达0.79(提升49%)。 Conclusion: 该方法确立了一种兼顾可解释性与心理语言学基础的通用NLP范式,适用于复杂社会语言现象建模。 Abstract: This paper presents a novel agentic LLM pipeline for SemEval-2026 Task 10 that jointly extracts psycholinguistic conspiracy markers and detects conspiracy endorsement. Unlike traditional classifiers that conflate semantic reasoning with structural localization, our decoupled design isolates these challenges. For marker extraction, we propose Dynamic Discriminative Chain-of-Thought (DD-CoT) with deterministic anchoring to resolve semantic ambiguity and character-level brittleness. For conspiracy detection, an "Anti-Echo Chamber" architecture, consisting of an adversarial Parallel Council adjudicated by a Calibrated Judge, overcomes the "Reporter Trap," where models falsely penalize objective reporting. Achieving 0.24 Macro F1 (+100\% over baseline) on S1 and 0.79 Macro F1 (+49\%) on S2, with the S1 system ranking 3rd on the development leaderboard, our approach establishes a versatile paradigm for interpretable, psycholinguistically-grounded NLP.[43] AILS-NTUA at SemEval-2026 Task 3: Efficient Dimensional Aspect-Based Sentiment Analysis
Stavros Gazetas,Giorgos Filandrianos,Maria Lymperaiou,Paraskevi Tzouveli,Athanasios Voulodimos,Giorgos Stamou
Main category: cs.CL
TL;DR: 本文提出了AILS-NTUA系统,用于SemEval-2026任务3(DimABSA),通过多语言多领域框架下结合微调编码器与LoRA指令微调大模型,统一解决三个子任务,并在参数效率和性能上取得平衡。
Details
Motivation: 解决多语言、多领域的细粒度情感分析任务(DimABSA)中三个互补子任务(DimASR、DimASTE、DimASQP)的联合建模难题,同时兼顾模型效率与效果。 Method: 采用语言适配的编码器骨干网络进行细粒度情感回归;对大语言模型进行语言特定的指令微调(使用LoRA)以实现结构化三元组和四元组抽取;整体为统一但任务自适应的参数高效设计。 Result: 所提模型在多数评测设置下均优于基线方法,展现出强竞争力和稳定性。 Conclusion: 该统一框架在保持高性能的同时显著降低训练与推理开销,验证了参数高效多任务适配策略在多语言DimABSA任务中的有效性。 Abstract: In this paper, we present AILS-NTUA system for Track-A of SemEval-2026 Task 3 on Dimensional Aspect-Based Sentiment Analysis (DimABSA), which encompasses three complementary problems: Dimensional Aspect Sentiment Regression (DimASR), Dimensional Aspect Sentiment Triplet Extraction (DimASTE), and Dimensional Aspect Sentiment Quadruplet Prediction (DimASQP) within a multilingual and multi-domain framework. Our methodology combines fine-tuning of language-appropriate encoder backbones for continuous aspect-level sentiment prediction with language-specific instruction tuning of large language models using LoRA for structured triplet and quadruplet extraction. This unified yet task-adaptive design emphasizes parameter-efficient specialization across languages and domains, enabling reduced training and inference requirements while maintaining strong effectiveness. Empirical results demonstrate that the proposed models achieve competitive performance and consistently surpass the provided baselines across most evaluation settings.[44] Federated Heterogeneous Language Model Optimization for Hybrid Automatic Speech Recognition
Mengze Hong,Yi Gu,Di Jiang,Hanlin Gu,Chen Jason Zhang,Lu Wang,Zhiyang Su
Main category: cs.CL
TL;DR: 本文提出了一种针对混合ASR系统中异构语言模型(n-gram与神经网络LM)的联邦学习融合新范式——'匹配与合并',包含基于遗传算法(GMMA)和强化学习(RMMA)的两种算法;实验表明RMMA在字符错误率、泛化性和收敛速度上均优于基线方法。
Details
Motivation: 在去中心化的联邦学习ASR训练中,声学模型已有成熟融合方法,但用于N-best重排序的语言模型因n-gram与神经网络模型的异构性难以有效合并,亟需专门的异构LM融合优化方法。 Method: 提出‘匹配与合并’范式:1)GMMA——利用遗传操作(选择、交叉、变异)进化并配对本地LM;2)RMMA——采用强化学习建模匹配决策过程以加速收敛。 Result: 在7个OpenSLR数据集上的实验显示,RMMA取得最低平均字符错误率(CER),泛化性优于基线,且收敛速度比GMMA快至7倍。 Conclusion: 所提match-and-merge范式,特别是RMMA算法,为构建可扩展、隐私保护的ASR系统提供了高效可行的语言模型融合方案。 Abstract: Training automatic speech recognition (ASR) models increasingly relies on decentralized federated learning to ensure data privacy and accessibility, producing multiple local models that require effective merging. In hybrid ASR systems, while acoustic models can be merged using established methods, the language model (LM) for rescoring the N-best speech recognition list faces challenges due to the heterogeneity of non-neural n-gram models and neural network models. This paper proposes a heterogeneous LM optimization task and introduces a match-and-merge paradigm with two algorithms: the Genetic Match-and-Merge Algorithm (GMMA), using genetic operations to evolve and pair LMs, and the Reinforced Match-and-Merge Algorithm (RMMA), leveraging reinforcement learning for efficient convergence. Experiments on seven OpenSLR datasets show RMMA achieves the lowest average Character Error Rate and better generalization than baselines, converging up to seven times faster than GMMA, highlighting the paradigm's potential for scalable, privacy-preserving ASR systems.[45] LocalSUG: Geography-Aware LLM for Query Suggestion in Local-Life Services
Jinwen Chen,Shuai Gong,Shiwen Zhang,Zheng Zhang,Yachao Zhao,Lingxiang Wang,Haibo Zhou,Yuan Zhan,Wei Lin,Hainan Zhang
Main category: cs.CL
TL;DR: 本文提出LocalSUG,一种面向本地生活服务平台的LLM驱动查询建议框架,通过城市感知候选挖掘、改进的GRPO算法和低延迟优化技术,显著提升点击率并降低无结果率。
Details
Motivation: 传统多阶段级联系统难以满足长尾需求;大语言模型(LLM)在本地生活服务中面临地理感知缺失、偏好优化中的暴露偏差及在线推理延迟三大挑战。 Method: 提出LocalSUG框架:1)基于词共现的城市感知候选挖掘策略以增强地理接地;2)采用beam-search驱动的GRPO算法缓解暴露偏差,并引入多目标奖励机制兼顾相关性与业务指标;3)设计质量感知beam加速与词表剪枝技术以降低在线延迟。 Result: 离线评估与大规模线上A/B测试表明,LocalSUG使点击率(CTR)提升+0.35%,低/无结果率降低2.56%。 Conclusion: LocalSUG有效解决了LLM在本地生活查询建议场景中的关键部署难题,在保持生成质量的同时提升了业务效果与系统效率。 Abstract: In local-life service platforms, the query suggestion module plays a crucial role in enhancing user experience by generating candidate queries based on user input prefixes, thus reducing user effort and accelerating search. Traditional multi-stage cascading systems rely heavily on historical top queries, limiting their ability to address long-tail demand. While LLMs offer strong semantic generalization, deploying them in local-life services introduces three key challenges: lack of geographic grounding, exposure bias in preference optimization, and online inference latency. To address these issues, we propose LocalSUG, an LLM-based query suggestion framework tailored for local-life service platforms. First, we introduce a city-aware candidate mining strategy based on term co-occurrence to inject geographic grounding into generation. Second, we propose a beam-search-driven GRPO algorithm that aligns training with inference-time decoding, reducing exposure bias in autoregressive generation. A multi-objective reward mechanism further optimizes both relevance and business-oriented metrics. Finally, we develop quality-aware beam acceleration and vocabulary pruning techniques that significantly reduce online latency while preserving generation quality. Extensive offline evaluations and large-scale online A/B testing demonstrate that LocalSUG improves click-through rate (CTR) by +0.35% and reduces the low/no-result rate by 2.56%, validating its effectiveness in real-world deployment.[46] Replaying pre-training data improves fine-tuning
Suhas Kotha,Percy Liang
Main category: cs.CL
TL;DR: 本文发现,在领域微调过程中重放通用数据(generic replay)不仅能防止灾难性遗忘,反而能显著提升目标领域任务性能,尤其在目标数据稀缺时效果更明显。
Details
Motivation: 现有范式在领域适配中通常避免在微调阶段混入大量通用数据以防遗忘,但作者质疑这一假设,探索通用数据重放是否可能带来正向迁移。 Method: 在受控预训练环境中(4M目标token、4B总token、150M参数模型),系统评估不同数据调度策略下的通用数据重放效果,包括微调和中期训练阶段,并扩展至8B参数模型的实际任务验证。 Result: 通用重放使目标数据效率提升最高达1.87×(微调)和2.06×(中期训练);在8B模型上,提升网页导航成功率4.5%、巴斯克语问答准确率2%。 Conclusion: 通用数据重放在领域微调中不仅安全,而且有益,尤其适用于目标领域数据有限的场景,挑战了传统‘仅在预训练引入通用数据’的范式。 Abstract: To obtain a language model for a target domain (e.g. math), the current paradigm is to pre-train on a vast amount of generic web text and then fine-tune on the relatively limited amount of target data. Typically, generic data is only mixed in during fine-tuning to prevent catastrophic forgetting of the generic domain. We surprisingly find that replaying the generic data during fine-tuning can actually improve performance on the (less related) target task. Concretely, in a controlled pre-training environment with 4M target tokens, 4B total tokens, and 150M parameter models, generic replay increases target data efficiency by up to $1.87\times$ for fine-tuning and $2.06\times$ for mid-training. We further analyze data schedules that introduce target data during pre-training and find that replay helps more when there is less target data present in pre-training. We demonstrate the success of replay in practice for fine-tuning 8B parameter models, improving agentic web navigation success by $4.5\%$ and Basque question-answering accuracy by $2\%$.[47] When Weak LLMs Speak with Confidence, Preference Alignment Gets Stronger
Amirabbas Afzali,Myeongho Jeon,Maria Brbic
Main category: cs.CL
TL;DR: 本文提出了一种基于弱语言模型置信度加权的偏好对齐方法(CW-PO),仅用20%人工标注数据即可超越使用100%人工标注的标准DPO方法。
Details
Motivation: 现有偏好对齐方法依赖高成本人工标注或大规模API模型,成本高昂;本文探索能否用弱语言模型替代人工标注,并提升效率与性能。 Method: 提出置信度加权偏好优化(CW-PO)框架:利用弱语言模型对其生成的偏好样本打分并筛选高置信度子集,再按置信度重新加权用于偏好优化训练(如DPO)。 Result: CW-PO仅用20%人工标注数据训练的模型,性能优于使用100%人工标注的标准DPO模型;弱LLM+置信度筛选的效果甚至超过全量人工标注。 Conclusion: 弱语言模型在置信度加权策略下可显著降低偏好对齐成本,并实现超越全人工标注的对齐效果,为低成本、高性能对齐提供了新范式。 Abstract: Preference alignment is an essential step in adapting large language models (LLMs) to human values, but existing approaches typically depend on costly human annotations or large-scale API-based models. We explore whether a weak LLM can instead act as an effective annotator. We surprisingly find that selecting only a subset of a weak LLM's highly confident samples leads to substantially better performance than using full human annotations. Building on this insight, we propose Confidence-Weighted Preference Optimization (CW-PO), a general framework that re-weights training samples by a weak LLM's confidence and can be applied across different preference optimization objectives. Notably, the model aligned by CW-PO with just 20% of human annotations outperforms the model trained with 100% of annotations under standard DPO. These results suggest that weak LLMs, when paired with confidence weighting, can dramatically reduce the cost of preference alignment while even outperforming methods trained on fully human-labeled data.[48] MPCEval: A Benchmark for Multi-Party Conversation Generation
Minxing Zhang,Yi Yang,Zhuofan Jia,Xuan Yang,Jian Pei,Yuchen Zang,Xingwang Deng,Xianglong Chen
Main category: cs.CL
TL;DR: 本文提出MPCEval,一个面向多参与者对话生成的评估基准套件,通过分解生成质量为说话人建模、内容质量和说话人-内容一致性三个维度,并区分局部下一轮预测与全局完整对话生成,提供无参考、可复现的定量指标。
Details
Motivation: 多参与者对话生成(如智能回复、协作助手)日益重要,但其评估仍是瓶颈;相比双人对话,多参与者场景存在复杂的轮转机制、角色依赖的说话行为、长程对话结构及多种合理续写等独特挑战。 Method: 提出MPCEval评估框架,将生成质量解耦为说话人建模、内容质量和说话人-内容一致性三方面,并明确区分局部(next-turn)预测与全局(full-conversation)生成任务;设计可扩展、无参考、定量且可复现的新指标。 Result: 在多个公开与真实世界数据集上应用MPCEval,发现模型在参与均衡性、内容演进与新颖性、说话人-内容一致性等方面呈现系统性、维度特异的缺陷;单一分数评估会掩盖多参与者对话行为的根本差异。 Conclusion: 多参与者对话评估需任务感知与维度解耦;MPCEval揭示了评估目标对模型评估结果的关键影响,强调摒弃单一分数、转向细粒度、可解释的评估范式。 Abstract: Multi-party conversation generation, such as smart reply and collaborative assistants, is an increasingly important capability of generative AI, yet its evaluation remains a critical bottleneck. Compared to two-party dialogue, multi-party settings introduce distinct challenges, including complex turn-taking, role-dependent speaker behavior, long-range conversational structure, and multiple equally valid continuations. Accordingly, we introduce MPCEval, a task-aware evaluation and benchmarking suite for multi-party conversation generation. MPCEval decomposes generation quality into speaker modeling, content quality, and speaker--content consistency, and explicitly distinguishes local next-turn prediction from global full-conversation generation. It provides novel, quantitative, reference-free, and reproducible metrics that scale across datasets and models. We apply MPCEval to diverse public and real-world datasets and evaluate modern generation methods alongside human-authored conversations. The results reveal systematic, dimension-specific model characteristics in participation balance, content progression and novelty, and speaker--content consistency, demonstrating that evaluation objectives critically shape model assessment and that single-score evaluation obscures fundamental differences in multi-party conversational behavior. The implementation of MPCEval and the associated evaluation code are publicly available at https://github.com/Owen-Yang-18/MPCEval.[49] VRM: Teaching Reward Models to Understand Authentic Human Preferences
Biao Liu,Ning Xu,Junming Yang,Hao Xu,Xin Geng
Main category: cs.CL
TL;DR: 本文提出VRM(变分奖励建模)框架,通过引入高维目标权重和低维语义特征作为隐变量,并用变分推断进行建模,以更真实地模拟人类偏好判断过程,缓解奖励黑客问题,并在理论上具有更紧的泛化误差界,实验表明其优于现有方法。
Details
Motivation: 传统奖励模型仅将提示-响应对映射为标量分数,易捕获虚假相关性而非真实人类偏好;而人类评估先依据提示上下文权衡多维目标,再基于逻辑连贯性等低维语义特征评估响应质量。 Method: 提出VRM(变分奖励建模)框架,显式建模人类偏好判断过程,将高维目标权重和低维语义特征设为隐变量,并采用变分推断技术进行推断;同时提供理论分析,证明其泛化误差界更紧。 Result: 在多个基准数据集上的大量实验表明,VRM显著优于现有方法,在捕捉真实人类偏好方面表现更优。 Conclusion: VRM通过更贴近人类评估机制的建模方式,有效缓解了奖励黑客问题,提升了奖励模型对真实偏好的建模能力,并具备理论保障与实证优势。 Abstract: Large Language Models (LLMs) have achieved remarkable success across diverse natural language tasks, yet the reward models employed for aligning LLMs often encounter challenges of reward hacking, where the approaches predominantly rely on directly mapping prompt-response pairs to scalar scores, which may inadvertently capture spurious correlations rather than authentic human preferences. In contrast, human evaluation employs a sophisticated process that initially weighs the relative importance of multiple high-dimensional objectives according to the prompt context, subsequently evaluating response quality through low-dimensional semantic features such as logical coherence and contextual appropriateness. Motivated by this consideration, we propose VRM, i.e., Variational Reward Modeling, a novel framework that explicitly models the evaluation process of human preference judgments by incorporating both high-dimensional objective weights and low-dimensional semantic features as latent variables, which are inferred through variational inference techniques. Additionally, we provide a theoretical analysis showing that VRM can achieve a tighter generalization error bound compared to the traditional reward model. Extensive experiments on benchmark datasets demonstrate that VRM significantly outperforms existing methods in capturing authentic human preferences.[50] ThaiSafetyBench: Assessing Language Model Safety in Thai Cultural Contexts
Trapoom Ukarapol,Nut Chukamphaeng,Kunat Pipatanakul,Pakhapoom Sarapat
Main category: cs.CL
TL;DR: 本文提出了首个面向泰语和泰国文化的大型语言模型安全评估基准ThaiSafetyBench,包含1954个泰语恶意提示,并基于该基准评估了24个LLM的安全性;发现闭源模型安全性普遍优于开源模型,且文化特异性攻击成功率更高;同时发布了轻量级泰语有害响应分类器ThaiSafetyClassifier及公开排行榜。
Details
Motivation: 现有大模型安全评估主要集中于英语,忽视了非英语语言及文化背景下的风险,尤其缺乏针对泰语及泰国文化语境的安全评估资源。 Method: 构建泰语安全评估基准ThaiSafetyBench(含1954条泰语恶意提示),覆盖通用与泰国文化特异性攻击;使用GPT-4.1和Gemini-2.5-Pro作为裁判评估24个LLM;训练并开源DeBERTa-based的泰语有害响应分类器ThaiSafetyClassifier;建立公开可更新的ThaiSafetyBench排行榜。 Result: 闭源模型安全性能显著优于开源模型;泰国文化语境下的攻击成功率(ASR)明显高于通用泰语攻击;ThaiSafetyClassifier达到84.4%加权F1分数,与GPT-4.1判断一致。 Conclusion: 当前LLM安全对齐方法在非英语、尤其是文化特异性场景下存在明显短板;ThaiSafetyBench及其配套工具为泰语AI安全研究提供了关键基础设施,推动多语言、跨文化安全评估发展。 Abstract: The safety evaluation of large language models (LLMs) remains largely centered on English, leaving non-English languages and culturally grounded risks underexplored. In this work, we investigate LLM safety in the context of the Thai language and culture and introduce ThaiSafetyBench, an open-source benchmark comprising 1,954 malicious prompts written in Thai. The dataset covers both general harmful prompts and attacks that are explicitly grounded in Thai cultural, social, and contextual nuances. Using ThaiSafetyBench, we evaluate 24 LLMs, with GPT-4.1 and Gemini-2.5-Pro serving as LLM-as-a-judge evaluators. Our results show that closed-source models generally demonstrate stronger safety performance than open-source counterparts, raising important concerns regarding the robustness of openly available models. Moreover, we observe a consistently higher Attack Success Rate (ASR) for Thai-specific, culturally contextualized attacks compared to general Thai-language attacks, highlighting a critical vulnerability in current safety alignment methods. To improve reproducibility and cost efficiency, we further fine-tune a DeBERTa-based harmful response classifier, which we name ThaiSafetyClassifier. The model achieves a weighted F1 score of 84.4%, matching GPT-4.1 judgments. We publicly release the fine-tuning weights and training scripts to support reproducibility. Finally, we introduce the ThaiSafetyBench leaderboard to provide continuously updated safety evaluations and encourage community participation. - ThaiSafetyBench HuggingFace Dataset: https://huggingface.co/datasets/typhoon-ai/ThaiSafetyBench - ThaiSafetyBench Github: https://github.com/trapoom555/ThaiSafetyBench - ThaiSafetyClassifier HuggingFace Model: https://huggingface.co/typhoon-ai/ThaiSafetyClassifier - ThaiSafetyBench Leaderboard: https://huggingface.co/spaces/typhoon-ai/ThaiSafetyBench-Leaderboard[51] HiFlow: Hierarchical Feedback-Driven Optimization for Constrained Long-Form Text Generation
Yifan Zhu,Guanting Chen,Bing Wei,Haoran Luo
Main category: cs.CL
TL;DR: 本文提出HiFlow框架,通过分层反馈优化解决长文本生成中的复杂约束问题,兼顾全局结构一致性和局部语义连贯性。
Details
Motivation: 大语言模型在长文本生成、尤其是复杂约束条件下仍表现不佳,现有方法难以协调全局与局部目标。 Method: 提出HiFlow——一种分层反馈驱动的优化框架,包含规划层(建模全局结构与约束)和生成层(条件文本生成),引入约束感知计划筛选与双层级闭环反馈机制。 Result: 在多个主干模型上的实验表明,HiFlow显著优于基线方法。 Conclusion: HiFlow能有效实现规划质量与生成行为的联合优化,逐步引导模型生成高质量且满足约束的长文本。 Abstract: Large language models perform well in short text generation but still struggle with long text generation, particularly under complex constraints. Such tasks involve multiple tightly coupled objectives, including global structural consistency, local semantic coherence, and constraint feasibility, forming a challenging constrained optimization problem. Existing approaches mainly rely on static planning or offline supervision, limiting effective coordination between global and local objectives during generation. To address these challenges, we propose HiFlow, a hierarchical feedback-driven optimization framework for constrained long text generation. HiFlow formulates generation as a two-level optimization process, consisting of a planning layer for global structure and constraint modeling, and a generation layer for conditioned text generation. By incorporating constraint-aware plan screening and closed-loop feedback at both levels, HiFlow enables joint optimization of planning quality and generation behavior, progressively guiding the model toward high-quality, constraint-satisfying outputs. Experiments on multiple backbones confirm HiFlow's effectiveness over baseline methods.[52] NeuronMoE: Neuron-Guided Mixture-of-Experts for Efficient Multilingual LLM Extension
Rongzhi Li,Hitomi Yanaka
Main category: cs.CL
TL;DR: 本文提出NeuronMoE方法,通过分析各Transformer组件中语言特异性神经元,指导每层专家分配,实现低资源语言扩展时参数减少40%且性能不降。
Details
Motivation: 扩展大语言模型至低资源语言对全球可及性至关重要,但为每种语言单独训练模型成本过高;现有MoE方法按层相似性分配专家,忽略了语言处理在单个神经元层面的细粒度特化。 Method: 提出NeuronMoE方法,基于实证测量的跨语言神经元多样性,在所有Transformer组件中分析语言特异性神经元,以指导每层专家分配。 Result: 在Llama-3.2-3B上针对希腊语、土耳其语和匈牙利语验证,平均参数减少约40%,性能与LayerMoE基线持平;发现低资源语言专家独立发展出与高资源语言相似的神经元特化模式,集中于早期和晚期层。 Conclusion: 低资源语言专家展现出与高资源语言一致的神经元特化规律,暗示多语言模型组织语言知识可能存在普适性架构原则。 Abstract: Extending large language models to low-resource languages is essential for global accessibility, but training separate models per language is prohibitively expensive. Mixture-of-Experts (MoE) architectures address this by adding sparse language-specific parameters, but determining how many experts each layer needs remains an open question. Current approaches allocate experts based on layer-level similarity, yet language processing exhibits fine-grained specialization at individual neurons. We propose $\textbf{NeuronMoE}$, a method that analyzes language-specific neurons across all transformer components to guide expert allocation per layer based on empirically measured cross-lingual neuron diversity. Applied to Llama-3.2-3B for low-resource languages (Greek, Turkish, and Hungarian), this approach achieves approximately 40% average parameter reduction while matching the performance of the LayerMoE baseline. We find that low-resource language experts independently develop neuron specialization patterns mirroring the high-resource language, which are concentrated in early and late layers. This reveals potential universal architectural principles in how multilingual models organize linguistic knowledge.[53] MUTEX: Leveraging Multilingual Transformers and Conditional Random Fields for Enhanced Urdu Toxic Span Detection
Inayat Arshad,Fajar Saleem,Ijaz Hussain
Main category: cs.CL
TL;DR: 本文提出MUTEX框架,结合多语言Transformer(XLM-RoBERTa)与条件随机场(CRF),首次实现 Urdu 文本的细粒度毒害词元级跨度检测,并在多源社交媒体数据上达到60%的token-level F1,为该任务建立首个监督基线。
Details
Motivation: 现有系统多依赖句子级分类,无法定位具体毒害片段;且受限于乌尔都语缺乏词元级标注资源、语言复杂、频繁语码转换、非正式表达及丰富形态变化等因素。 Method: 提出MUTEX框架:基于XLM-RoBERTa的多语言Transformer与CRF层联合建模,进行序列标注;使用人工构建的乌尔都语词元级毒害跨度数据集,在社交平台、新闻和YouTube评论等多领域数据上训练与评估。 Result: MUTEX在token-level F1上达到60%,是乌尔都语毒害跨度检测的首个监督基线;实验表明Transformer模型能更好隐式捕捉上下文毒性,并有效应对语码转换与形态变化问题。 Conclusion: MUTEX验证了结合预训练多语言模型与CRF进行乌尔都语细粒度毒害检测的有效性与可解释性,为低资源语言的毒害内容细粒度分析提供了可行路径。 Abstract: Urdu toxic span detection remains limited because most existing systems rely on sentence-level classification and fail to identify the specific toxic spans within those text. It is further exacerbated by the multiple factors i.e. lack of token-level annotated resources, linguistic complexity of Urdu, frequent code-switching, informal expressions, and rich morphological variations. In this research, we propose MUTEX: a multilingual transformer combined with conditional random fields (CRF) for Urdu toxic span detection framework that uses manually annotated token-level toxic span dataset to improve performance and interpretability. MUTEX uses XLM RoBERTa with CRF layer to perform sequence labeling and is tested on multi-domain data extracted from social media, online news, and YouTube reviews using token-level F1 to evaluate fine-grained span detection. The results indicate that MUTEX achieves 60% token-level F1 score that is the first supervised baseline for Urdu toxic span detection. Further examination reveals that transformer-based models are more effective at implicitly capturing the contextual toxicity and are able to address the issues of code-switching and morphological variation than other models.[54] ARC-TGI: Human-Validated Task Generators with Reasoning Chain Templates for ARC-AGI
Jens Lehmann,Syeda Khushbakht,Nikoo Salehfard,Nur A Zarin Nishat,Dhananjay Bhandiwad,Andrei Aioanei,Sahar Vahdati
Main category: cs.CL
TL;DR: 本文提出了ARC-TGI框架,用于生成具有潜在规则的多样化ARC-AGI视觉推理任务,支持任务级约束以确保可解性,并提供自然语言推理链与可执行代码,已开源461个生成器。
Details
Motivation: 现有ARC-AGI数据集为静态手工设计,易导致过拟合、数据泄露和记忆化,难以准确评估模型的抽象与规则归纳能力。 Method: 构建了ARC-TGI——一个基于Python的任务族生成器框架,每个生成器输出带自然语言输入、推理链和部分可执行代码的视觉任务,并引入任务级约束机制以保障训练样本整体能揭示潜在规则;所有生成器经人工精调与本地验证。 Result: 发布了覆盖ARC-Mini、ARC-AGI-1和ARC-AGI-2共461个任务生成器,支持可扩展采样与受控基准测试。 Conclusion: ARC-TGI为few-shot抽象推理研究提供了更可靠、可控且可扩展的动态任务生成范式,显著缓解静态数据集带来的评估偏差问题。 Abstract: The Abstraction and Reasoning Corpus (ARC-AGI) probes few-shot abstraction and rule induction on small visual grids, but progress is difficult to measure on static collections of hand-authored puzzles due to overfitting, dataset leakage, and memorisation. We introduce ARC-TGI (ARC Task Generators Inventory), an open-source framework for task-family generators: compact Python programs that sample diverse ARC-AGI tasks while preserving a latent rule. ARC-TGI is built around a solver-facing representation: each generated task is paired with natural-language input and transformation reasoning chains and partially evaluated Python code implementing sampling, transformation, and episode construction. Crucially, ARC-TGI supports task-level constraints so that training examples collectively expose the variations needed to infer the underlying rule, a requirement for human-solvable ARC tasks that independent per-example sampling often fails to guarantee. All generators undergo human refinement and local verification to keep both grids and reasoning traces natural and consistent under variation. We release 461 generators covering 180 ARC-Mini tasks, 215 ARC-AGI-1 tasks (200 train, 15 test), and 66 ARC-AGI-2 tasks (55 train, 11 test), enabling scalable dataset sampling and controlled benchmarking.[55] Measuring the Redundancy of Decoder Layers in SpeechLLMs
Adel Moumen,Guangzhi Sun,Philip C Woodland
Main category: cs.CL
TL;DR: 本文研究了语音大语言模型(SpeechLLM)中解码器的冗余性,发现其冗余主要继承自预训练文本LLM;通过剪枝实验表明,7-8B模型仅需60%解码器层即可保持良好ASR性能,并验证该冗余结构在不同语音编码器、任务和语言间具有一致性,支持构建单一同质化多任务SpeechLLM骨干。
Details
Motivation: 探究SpeechLLM中占参数主体(>90%)的LLM解码器在语音任务中实际所需容量,理解其冗余来源与结构规律。 Method: 在两类LLM家族、三种规模(1B–8B)模型上,对比文本与语音输入下的解码器层冗余;通过系统性剪枝解码器层并分析剪枝后性能恢复(healing)以评估过量容量;进一步将结论推广至语音翻译任务,并跨编码器、任务及语言验证冗余层的一致性。 Result: 7-8B SpeechLLM在仅保留60%解码器层时仍保持良好ASR性能;小规模模型剪枝容忍度降低;同一组解码器层在不同语音编码器、ASR/ST任务及多语言场景下均表现冗余,表明存在全局冗余结构。 Conclusion: SpeechLLM解码器冗余主要源于预训练文本LLM,而非语音特定需求;该冗余具有跨任务、跨语言、跨编码器的一致性,支持开发轻量、通用、多任务兼容的SpeechLLM骨干模型。 Abstract: Speech Large Language Models route speech encoder representations into an LLM decoder that typically accounts for over 90% of total parameters. We study how much of this decoder capacity is actually needed for speech tasks. Across two LLM families and three scales (1-8B), we show that decoder redundancy is largely inherited from the pretrained LLM: text and speech inputs yield similar redundant blocks. We then measure excess capacity by pruning decoder layers and analysing post-pruning healing to increase robustness. Our findings show that 7-8B models retain good ASR performance with only 60% of decoder layers, and the same trend extends to smaller scales with reduced pruning tolerance. We then generalise to speech translation, and show that the same blocks of layers are redundant across speech encoders, tasks and languages, indicating that a more global redundancy structure exists, enabling a single pruned and multi-tasks SpeechLLM backbone to be deployed.[56] LBM: Hierarchical Large Auto-Bidding Model via Reasoning and Acting
Yewen Li,Zhiyi Lyu,Peng Jiang,Qingpeng Cai,Fei Pan,Bo An,Peng Jiang
Main category: cs.CL
TL;DR: 本文提出了一种分层大自动竞价模型(LBM),结合LLM的推理能力与数值决策能力,通过双模态嵌入和离线强化微调方法GQPO,提升自动竞价策略的可解释性、泛化性和决策质量。
Details
Motivation: 现有自动竞价方法依赖黑盒式训练,存在可解释性差、泛化能力弱及易产生反直觉行为的问题;而直接应用大语言模型(LLM)又面临动作精度不足与领域知识缺失导致幻觉和次优决策的挑战。 Method: 提出分层Large autoBidding Model(LBM):高层LBM-Think负责策略推理,低层LBM-Act负责动作生成;引入双嵌入机制融合语言与数值输入;设计离线强化微调算法GQPO,无需仿真或真实环境rollout即可抑制幻觉并提升决策性能。 Result: 实验表明,基于LBM的生成式骨干模型在训练效率和泛化能力上均优于现有方法,尤其在动态广告环境中表现更稳健。 Conclusion: LBM通过结构化地融合LLM推理与数值控制能力,并辅以针对性训练机制,为可信、高效、可泛化的自动竞价提供了新范式。 Abstract: The growing scale of ad auctions on online advertising platforms has intensified competition, making manual bidding impractical and necessitating auto-bidding to help advertisers achieve their economic goals. Current auto-bidding methods have evolved to use offline reinforcement learning or generative methods to optimize bidding strategies, but they can sometimes behave counterintuitively due to the black-box training manner and limited mode coverage of datasets, leading to challenges in understanding task status and generalization in dynamic ad environments. Large language models (LLMs) offer a promising solution by leveraging prior human knowledge and reasoning abilities to improve auto-bidding performance. However, directly applying LLMs to auto-bidding faces difficulties due to the need for precise actions in competitive auctions and the lack of specialized auto-bidding knowledge, which can lead to hallucinations and suboptimal decisions. To address these challenges, we propose a hierarchical Large autoBidding Model (LBM) to leverage the reasoning capabilities of LLMs for developing a superior auto-bidding strategy. This includes a high-level LBM-Think model for reasoning and a low-level LBM-Act model for action generation. Specifically, we propose a dual embedding mechanism to efficiently fuse two modalities, including language and numerical inputs, for language-guided training of the LBM-Act; then, we propose an offline reinforcement fine-tuning technique termed GQPO for mitigating the LLM-Think's hallucinations and enhancing decision-making performance without simulation or real-world rollout like previous multi-turn LLM-based methods. Experiments demonstrate the superiority of a generative backbone based on our LBM, especially in an efficient training manner and generalization ability.[57] Representation Fidelity:Auditing Algorithmic Decisions About Humans Using Self-Descriptions
Theresa Elstner,Martin Potthast
Main category: cs.CL
TL;DR: 本文提出“表征保真度”(Representation Fidelity)概念,用于评估算法对人的决策所依赖的输入表征是否合理;通过比较外部给定表征与个体自述表征之间的距离来量化,并构建了首个贷款决策场景下的表征保真度基准数据集。
Details
Motivation: 现有算法决策验证多关注公平性、可解释性等维度,但缺乏对‘决策所依据的人的表征是否真实反映该人’这一根本问题的衡量;本文旨在填补该空白,强调决策合理性需以表征保真为前提。 Method: 定义表征保真度为外部输入表征与个体自述表征之间的距离;分析表征差异类型,提出通用的表征错配分类法;构建Loan-Granting Self-Representations Corpus 2025数据集(3万条合成自述+专家标注的错配类型),用于基准评测。 Result: 提出了表征保真度的理论框架与可操作定义;建立了首个带专家标注的表征保真度评测基准;揭示了常见表征错配模式(如信息缺失、扭曲、过度简化等)。 Conclusion: 表征保真度是算法伦理评估中一个必要且独立的新维度;其量化与评测为提升算法对人的理解合理性提供了新路径,应纳入AI系统评估标准体系。 Abstract: This paper introduces a new dimension for validating algorithmic decisions about humans by measuring the fidelity of their representations. Representation Fidelity measures if decisions about a person rest on reasonable grounds. We propose to operationalize this notion by measuring the distance between two representations of the same person: (1) an externally prescribed input representation on which the decision is based, and (2) a self-description provided by the human subject of the decision, used solely to validate the input representation. We examine the nature of discrepancies between these representations, how such discrepancies can be quantified, and derive a generic typology of representation mismatches that determine the degree of representation fidelity. We further present the first benchmark for evaluating representation fidelity based on a dataset of loan-granting decisions. Our Loan-Granting Self-Representations Corpus 2025 consists of a large corpus of 30 000 synthetic natural language self-descriptions derived from corresponding representations of applicants in the German Credit Dataset, along with expert annotations of representation mismatches between each pair of representations.[58] Feature Resemblance: On the Theoretical Understanding of Analogical Reasoning in Transformers
Ruichen Xu,Wenjing Yan,Ying-Jun Angela Zhang
Main category: cs.CL
TL;DR: 本文通过理论证明和实验验证,揭示了大语言模型中类比推理的统一机制:通过联合训练相似性和属性前提,使具有相似属性的实体在表示空间中对齐,从而实现属性迁移。
Details
Motivation: 现有评估方法混淆了多种推理类型,难以准确理解大语言模型中的推理机制,因此需要单独分析类比推理的涌现过程。 Method: 理论证明结合实验验证:从理论上证明联合训练、顺序训练及两跳推理与类比推理的关系,并通过最大达1.5B参数的Transformer架构进行实验验证。 Result: 证明了联合训练可实现类比推理;顺序训练需遵循特定课程;两跳推理可归约为含显式恒等桥接的类比推理;实验验证了表征几何结构对归纳推理能力的影响。 Conclusion: 类比推理在Transformer中通过实体表征对齐实现,其能力由表征空间的几何结构决定,为理解大模型推理机制提供了统一视角。 Abstract: Understanding reasoning in large language models is complicated by evaluations that conflate multiple reasoning types. We isolate analogical reasoning (inferring shared properties between entities based on known similarities) and analyze its emergence in transformers. We theoretically prove three key results: (1) Joint training on similarity and attribution premises enables analogical reasoning through aligned representations; (2) Sequential training succeeds only when similarity structure is learned before specific attributes, revealing a necessary curriculum; (3) Two-hop reasoning ($a \to b, b \to c \implies a \to c$) reduces to analogical reasoning with identity bridges ($b = b$), which must appear explicitly in training data. These results reveal a unified mechanism: transformers encode entities with similar properties into similar representations, enabling property transfer through feature alignment. Experiments with architectures up to 1.5B parameters validate our theory and demonstrate how representational geometry shapes inductive reasoning capabilities.[59] C2-Faith: Benchmarking LLM Judges for Causal and Coverage Faithfulness in Chain-of-Thought Reasoning
Avni Mittal,Rauno Arike
Main category: cs.CL
TL;DR: 本文提出C2-Faith基准,用于评估大语言模型(LLM)作为链式推理(CoT)评判者时对推理过程“忠实性”(包括因果性与覆盖性)的判断能力;实验发现当前前沿LLM评判者在不同任务中表现不一,普遍存在检测易、定位难、覆盖评分偏高问题。
Details
Motivation: 现有LLM常被用作CoT推理的评判者,但尚不清楚其能否可靠区分推理过程是否忠实(即逻辑是否自洽、步骤是否完整),而不仅依赖答案是否合理。 Method: 构建C2-Faith基准(源自PRM800K),定义因果性(每步是否逻辑承接前文)和覆盖性(关键中间推理是否缺失)两个维度;通过可控扰动生成带已知错误位置的因果错误样本和不同删除率的覆盖缺失样本;在二元因果检测、因果步骤定位、覆盖性打分三项任务上评测三类前沿评判模型。 Result: 模型排名高度依赖任务设定,无一模型全面占优;所有模型均存在‘能检测错误却难定位错误’的显著差距;对覆盖性评分普遍高估不完整推理。 Conclusion: LLM作为过程级评判者的能力具有任务敏感性,其可靠性需结合具体评估目标审慎判断;研究为CoT过程评估中评判模型的选择提供了实证依据与实践指南。 Abstract: Large language models (LLMs) are increasingly used as judges of chain-of-thought (CoT) reasoning, but it remains unclear whether they can reliably assess process faithfulness rather than just answer plausibility. We introduce C2-Faith, a benchmark built from PRM800K that targets two complementary dimensions of faithfulness: causality (does each step logically follow from prior context?) and coverage (are essential intermediate inferences present?). Using controlled perturbations, we create examples with known causal error positions by replacing a single step with an acausal variant, and with controlled coverage deletions at varying deletion rates (scored against reference labels). We evaluate three frontier judges under three tasks: binary causal detection, causal step localization, and coverage scoring. The results show that model rankings depend strongly on task framing, with no single judge dominating all settings; all judges exhibit a substantial gap between detecting an error and localizing it; and coverage judgments are systematically inflated for incomplete reasoning. These findings clarify when LLM judges are dependable and where they fail, and provide practical guidance for selecting judges in process-level evaluation[60] Sparse-BitNet: 1.58-bit LLMs are Naturally Friendly to Semi-Structured Sparsity
Di Zhang,Xun Wu,Shaohan Huang,Yudong Wang,Hanyong Shao,Yingbo Hao,Zewen Chi,Li Dong,Ting Song,Yan Xia,Zhifang Sui,Furu Wei
Main category: cs.CL
TL;DR: 本文提出Sparse-BitNet框架,首次联合应用1.58比特量化与动态N:M稀疏化,并验证了低比特模型天然更适配N:M稀疏结构,在保持性能的同时显著提升训练与推理速度。
Details
Motivation: 半结构化N:M稀疏性和低比特量化(如1.58比特BitNet)是提升大语言模型效率的两种有前景方法,但此前被孤立研究;本文旨在探究二者协同效应。 Method: 提出Sparse-BitNet统一框架,联合应用1.58比特量化与动态N:M稀疏化,并支持稳定训练;涵盖稀疏预训练与稠密到稀疏调度;使用自定义稀疏张量核加速。 Result: 1.58比特BitNet在相同稀疏度下性能下降更小、可容忍更高结构稀疏度;训练和推理最高加速达1.30倍。 Conclusion: 极低比特量化与半结构化N:M稀疏性的结合是构建高效大语言模型的重要可行方向。 Abstract: Semi-structured N:M sparsity and low-bit quantization (e.g., 1.58-bit BitNet) are two promising approaches for improving the efficiency of large language models (LLMs), yet they have largely been studied in isolation. In this work, we investigate their interaction and show that 1.58-bit BitNet is naturally more compatible with N:M sparsity than full-precision models. To study this effect, we propose Sparse-BitNet, a unified framework that jointly applies 1.58-bit quantization and dynamic N:M sparsification while ensuring stable training for the first time. Across multiple model scales and training regimes (sparse pretraining and dense-to-sparse schedules), 1.58-bit BitNet consistently exhibits smaller performance degradation than full-precision baselines at the same sparsity levels and can tolerate higher structured sparsity before accuracy collapse. Moreover, using our custom sparse tensor core, Sparse-BitNet achieves substantial speedups in both training and inference, reaching up to 1.30X. These results highlight that combining extremely low-bit quantization with semi-structured N:M sparsity is a promising direction for efficient LLMs. Code available at https://github.com/AAzdi/Sparse-BitNet[61] Guidelines for the Annotation and Visualization of Legal Argumentation Structures in Chinese Judicial Decisions
Kun Chen,Xianglei Liao,Kaixue Fei,Yi Xing,Xinrui Li
Main category: cs.CL
TL;DR: 本文提出了一种系统化、可操作的法律论证结构标注框架,用于司法判决中的法律论证建模,涵盖命题类型、关系类型、形式表示、可视化及标注流程。
Details
Motivation: 为揭示司法推理的逻辑结构,并为计算分析提供可靠的数据基础。 Method: 基于法律推理与论证理论,构建包含四类命题(一般规范性、具体规范性、一般事实性、具体事实性)和五类关系(支持、攻击、联合、匹配、同一)的标注框架,并规定形式表示规则、可视化规范及标准化标注流程。 Result: 形成一套概念清晰、形式严谨、操作可行的法律论证标注指南,支持大规模司法推理分析及法律论证挖掘、法律推理计算建模和AI辅助法律分析等研究。 Conclusion: 该框架兼具理论深度与实践可行性,为法律人工智能领域的结构化数据构建提供了重要方法论支撑。 Abstract: This guideline proposes a systematic and operational annotation framework for representing the structure of legal argumentation in judicial decisions. Grounded in theories of legal reasoning and argumentation, the framework aims to reveal the logical organization of judicial reasoning and to provide a reliable data foundation for computational analysis. At the proposition level, the guideline distinguishes four types of propositions: general normative propositions, specific normative propositions, general factual propositions, and specific factual propositions. At the relational level, five types of relations are defined to capture argumentative structures: support, attack, joint, match, and identity. These relations represent positive and negative argumentative connections, conjunctive reasoning structures, the correspondence between legal norms and case facts, and semantic equivalence between propositions. The guideline further specifies formal representation rules and visualization conventions for both basic and nested structures, enabling consistent graphical representation of complex argumentation patterns. In addition, it establishes a standardized annotation workflow and consistency control mechanisms to ensure reproducibility and reliability of the annotated data. By providing a clear conceptual model, formal representation rules, and practical annotation procedures, this guideline offers methodological support for large-scale analysis of judicial reasoning and for future research in legal argument mining, computational modeling of legal reasoning, and AI-assisted legal analysis.[62] Transducing Language Models
Vésteinn Snæbjarnarson,Samuel Kiegeland,Tianyu Liu,Reda Boumasmoud,Ryan Cotterell,Tim Vieira
Main category: cs.CL
TL;DR: 本文提出了一种基于确定性字符串到字符串变换(特别是有限状态转换器,FST)的语言模型泛化框架,支持对预训练语言模型进行推理时的输出格式适配,无需修改模型参数。
Details
Motivation: 现代语言模型输出字符串分布,但下游任务常需不同格式(如词级、氨基酸序列等),现有工作未将变换后的模型视为完整的新语言模型。 Method: 将语言模型与有限状态转换器(FST)组合,通过边缘化(marginalization)和条件化(conditioning)算法,在不改变模型参数的前提下传播概率;提出了精确算法、高效近似算法及理论分析。 Result: 在三类任务上验证:token↔byte、token↔word、DNA↔amino acid,证明可实现推理时对预训练模型的输出格式自适应。 Conclusion: 确定性字符串变换可系统性地导出新语言模型;FST组合提供了一种通用、高效、无需微调的模型输出适配方法。 Abstract: Modern language models define distributions over strings, but downstream tasks often require different output formats. For instance, a model that generates byte-pair strings does not directly produce word-level predictions, and a DNA model does not directly produce amino-acid sequences. In such cases, a deterministic string-to-string transformation can convert the model's output to the desired form. This is a familiar pattern in probability theory: applying a function $f$ to a random variable $X\sim p$ yields a transformed random variable $f(X)$ with an induced distribution. While such transformations are occasionally used in language modeling, prior work does not treat them as yielding new, fully functional language models. We formalize this perspective and introduce a general framework for language models derived from deterministic string-to-string transformations. We focus on transformations representable as finite-state transducers -- a commonly used state-machine abstraction for efficient string-to-string mappings. We develop algorithms that compose a language model with an FST to *marginalize* over source strings mapping to a given target, propagating probabilities through the transducer without altering model parameters and enabling *conditioning* on transformed outputs. We present an exact algorithm, an efficient approximation, and a theoretical analysis. We conduct experiments in three domains: converting language models from tokens to bytes, from tokens to words, and from DNA to amino acids. These experiments demonstrate inference-time adaptation of pretrained language models to match application-specific output requirements.[63] Diffusion LLMs can think EoS-by-EoS
Sarah Breckner,Sebastian Schuster
Main category: cs.CL
TL;DR: 本文提出扩散语言模型(Diffusion LLMs)通过将end-of-sequence(EoS)标记用作隐式“草稿空间”来提升复杂推理能力,实验验证了EoS token在推理中承载语义信息并参与计算过程。
Details
Motivation: 观察到扩散LLMs在生成长度远超必要时(即大量填充EoS token)反而在复杂推理任务上表现更优,作者试图解释这一反直觉现象,并探究EoS token是否具有计算功能。 Method: 在Addition、Entity Tracking和Sudoku任务上,对LLaDA1.5、LLaDA2.0-mini和Dream-v0等扩散LLMs开展受控提示实验与因果干预实验:一是增加EoS token数量观察性能变化;二是对EoS token的隐藏状态进行patching(替换为反事实生成的隐藏状态),检验其对输出的影响。 Result: 增加EoS token显著提升模型推理准确率;EoS token隐藏状态的patching能频繁改变最终输出,表明其携带任务相关语义信息并参与推理过程。 Conclusion: 扩散LLMs确以'EoS-by-EoS'方式思考,EoS token并非无意义占位符,而是作为隐式计算空间支撑复杂推理。 Abstract: Diffusion LLMs have been proposed as an alternative to autoregressive LLMs, excelling especially at complex reasoning tasks with interdependent sub-goals. Curiously, this is particularly true if the generation length, i.e., the number of tokens the model has to output, is set to a much higher value than is required for providing the correct answer to the task, and the model pads its answer with end-of-sequence (EoS) tokens. We hypothesize that diffusion models think EoS-by-EoS, that is, they use the representations of EoS tokens as a hidden scratchpad, which allows them to solve harder reasoning problems. We experiment with the diffusion models LLaDA1.5, LLaDA2.0-mini, and Dream-v0 on the tasks Addition, Entity Tracking, and Sudoku. In a controlled prompting experiment, we confirm that adding EoS tokens improves the LLMs' reasoning capabilities. To further verify whether they serve as space for hidden computations, we patch the hidden states of the EoS tokens with those of a counterfactual generation, which frequently changes the generated output to the counterfactual. The success of the causal intervention underscores that the EoS tokens, which one may expect to be devoid of meaning, carry information on the problem to solve. The behavioral experiments and the causal interventions indicate that diffusion LLMs can indeed think EoS-by-EoS.[64] Distilling Formal Logic into Neural Spaces: A Kernel Alignment Approach for Signal Temporal Logic
Sara Candussio,Gabriele Sarti,Gaia Saveri,Luca Bortolussi
Main category: cs.CL
TL;DR: 本文提出了一种将形式规约(如信号时序逻辑STL)的语义几何结构蒸馏到连续神经表示中的框架,通过教师-学生架构将符号鲁棒性核函数蒸馏为Transformer编码器,实现高效、可逆、语义保真的神经嵌入。
Details
Motivation: 现有方法要么依赖计算昂贵、锚点依赖且不可逆的符号核函数,要么使用无法捕捉深层语义结构的语法驱动神经嵌入;亟需兼顾语义保真性与计算效率的中间方案。 Method: 采用教师-学生架构,将符号鲁棒性核作为教师,Transformer编码器作为学生;设计基于核加权几何对齐的连续监督目标(非标准对比学习),使学生嵌入空间在几何上逼近教师核的语义距离。 Result: 在STL上验证:所得嵌入准确保持公式间语义相似性,能高精度预测鲁棒性与约束满足度,并具备内在可逆性;推理仅需单次前向传播,大幅降低运行时开销。 Conclusion: 该方法实现了高效、可扩展的神经符号推理与公式重建,克服了传统符号核的计算瓶颈,同时弥补了纯语法嵌入的语义缺失,为形式化方法与深度学习融合提供了新范式。 Abstract: We introduce a framework for learning continuous neural representations of formal specifications by distilling the geometry of their semantics into a latent space. Existing approaches rely either on symbolic kernels -- which preserve behavioural semantics but are computationally prohibitive, anchor-dependent, and non-invertible -- or on syntax-based neural embeddings that fail to capture underlying structures. Our method bridges this gap: using a teacher-student setup, we distill a symbolic robustness kernel into a Transformer encoder. Unlike standard contrastive methods, we supervise the model with a continuous, kernel-weighted geometric alignment objective that penalizes errors in proportion to their semantic discrepancies. Once trained, the encoder produces embeddings in a single forward pass, effectively mimicking the kernel's logic at a fraction of its computational cost. We apply our framework to Signal Temporal Logic (STL), demonstrating that the resulting neural representations faithfully preserve the semantic similarity of STL formulae, accurately predict robustness and constraint satisfaction, and remain intrinsically invertible. Our proposed approach enables highly efficient, scalable neuro-symbolic reasoning and formula reconstruction without repeated kernel computation at runtime.[65] Balancing Coverage and Draft Latency in Vocabulary Trimming for Faster Speculative Decoding
Ofir Ben Shoham
Main category: cs.CL
TL;DR: 本文提出了一种针对推测解码中草稿模型的词汇表裁剪方法,通过在覆盖率和延迟之间进行权衡优化,显著提升了领域特定任务下的推理吞吐量并降低了延迟。
Details
Motivation: 草稿模型在推测解码中常成为性能瓶颈,因其顺序生成token且语言建模头随词表增大而开销剧增;存在词表大小与覆盖率/延迟间的根本权衡。 Method: 将草稿词表选择建模为带约束的优化问题,以训练数据中助手响应的token覆盖率为目标,以架构感知FLOPs估算延迟,并使用树状Parzen估计器优化效用函数以探索Pareto前沿。 Result: 实验显示,在保持高覆盖率的同时,词表可缩减达97%;领域特定任务下延迟最多降低16%,吞吐量提升20%;分布外任务中吞吐量最高提升6.7%。 Conclusion: 词汇表裁剪是一种有效缓解草稿模型瓶颈的方法,能在保障覆盖率前提下显著提升推测解码效率,尤其适用于领域特定场景。 Abstract: Speculative decoding accelerates inference for Large Language Models by using a lightweight draft model to propose candidate tokens that are verified in parallel by a larger target model. Prior work shows that the draft model often dominates speculative decoding latency, since it generates tokens sequentially and incurs high cost from its language modeling head as vocabulary size grows. This exposes a fundamental trade-off in draft model design: larger vocabularies improve token coverage and agreement with the target model, but incur higher draft latency, while smaller vocabularies reduce latency at the risk of missing tokens required for accurate draft generation. We address this trade-off through vocabulary trimming for draft models, motivated by the observation that domain-specific workloads use only a small fraction of the full vocabulary. We cast draft vocabulary selection as a constrained optimization problem that balances token coverage and draft latency. Coverage is computed over assistant responses in the training data, while latency is estimated using architecture-aware FLOPs that capture the cost of the language modeling head as a function of vocabulary size. We optimize a utility function with a Tree-structured Parzen Estimator to efficiently explore the coverage-latency Pareto frontier under a minimum coverage constraint. Experiments show improved speculative decoding throughput while reducing draft vocabularies by up to 97% with high coverage. On domain-specific tasks, we achieve up to 16% latency reduction and 20% throughput improvement, and up to 6.7% throughput gains on diverse out-of-distribution tasks.[66] VietJobs: A Vietnamese Job Advertisement Dataset
Hieu Pham Dinh,Hung Nguyen Huy,Mo El-Haj
Main category: cs.CL
TL;DR: VietJobs 是首个大规模公开越南语招聘广告语料库,包含来自越南34个省市的48,092条招聘信息和超1500万词,涵盖16个职业领域及多种雇佣类型;论文基于该数据集构建了岗位分类与薪资预测两个基准任务,并评测了多个大语言模型的表现,揭示了越南语及多语言建模在结构化劳动力市场预测中的挑战。
Details
Motivation: 缺乏大规模、高质量、覆盖地域与社会经济多样性的越南语招聘文本资源,制约了越南语自然语言处理及劳动力市场分析的研究进展。 Method: 构建VietJobs语料库,涵盖多维度结构化标注(岗位标题、类别、薪资、技能、雇佣条件等),并设计岗位分类与薪资估计两项基准任务,对多个生成式大语言模型(包括指令微调模型)在少样本和全量微调设置下进行系统评测。 Result: Qwen2.5-7B-Instruct 和 Llama-SEA-LION-v3-8B-IT 等指令微调模型在两项任务中表现突出,但整体仍面临越南语特有语言现象及多语言建模带来的挑战。 Conclusion: VietJobs填补了越南语NLP领域关键数据空白,为招聘语言理解、社会经济表征建模及AI驱动的劳动力市场分析提供了新基准与基础资源。 Abstract: VietJobs is the first large-scale, publicly available corpus of Vietnamese job advertisements, comprising 48,092 postings and over 15 million words collected from all 34 provinces and municipalities across Vietnam. The dataset provides extensive linguistic and structured information, including job titles, categories, salaries, skills, and employment conditions, covering 16 occupational domains and multiple employment types (full-time, part-time, and internship). Designed to support research in natural language processing and labour market analytics, VietJobs captures substantial linguistic, regional, and socio-economic diversity. We benchmark several generative large language models (LLMs) on two core tasks: job category classification and salary estimation. Instruction-tuned models such as Qwen2.5-7B-Instruct and Llama-SEA-LION-v3-8B-IT demonstrate notable gains under few-shot and fine-tuned settings, while highlighting challenges in multilingual and Vietnamese-specific modelling for structured labour market prediction. VietJobs establishes a new benchmark for Vietnamese NLP and offers a valuable foundation for future research on recruitment language, socio-economic representation, and AI-driven labour market analysis. All code and resources are available at: https://github.com/VinNLP/VietJobs.[67] Oral to Web: Digitizing 'Zero Resource'Languages of Bangladesh
Mohammad Mamun Or Rashid
Main category: cs.CL
TL;DR: 本文介绍了孟加拉国首个国家级、平行、多模态的少数民族及土著语言语料库——多语言云语料库(Multilingual Cloud Corpus),涵盖42种语言变体,包含结构化文本与约107小时带转录音频,旨在支持濒危语言记录、低资源NLP及数字保存。
Details
Motivation: 孟加拉国拥有约40种少数语言,分属四大语系,其中14种被列为濒危,但长期缺乏系统性、跨语系的数字化语料库,尤其这些语言多为口语、计算资源近乎为零。 Method: 通过为期90天、覆盖9个地区的田野调查,由16名采集员、77名母语者和43名校验员,依据三级粒度(词汇、语法结构、定向话语)的2224项预设模板采集数据;后续由10位语言学家进行IPA转写,并经6位评审独立仲裁;全部数据发布于multiling.cloud平台。 Result: 建成含85792条结构化文本条目(含孟加拉语刺激文本、英语翻译、IPA转写)和约107小时标注音频的多模态语料库,覆盖42种语言变体(含2种未定类语言),全面公开可查。 Conclusion: 该语料库填补了南亚低资源语言数字基础设施的关键空白,为濒危语言存档、多语种NLP模型训练及发展中国家语言多样性数字保护提供了可复用的方法论与基础资源。 Abstract: We present the Multilingual Cloud Corpus, the first national-scale, parallel, multimodal linguistic dataset of Bangladesh's ethnic and indigenous languages. Despite being home to approximately 40 minority languages spanning four language families, Bangladesh has lacked a systematic, cross-family digital corpus for these predominantly oral, computationally "zero resource" varieties, 14 of which are classified as endangered. Our corpus comprises 85792 structured textual entries, each containing a Bengali stimulus text, an English translation, and an IPA transcription, together with approximately 107 hours of transcribed audio recordings, covering 42 language varieties from the Tibeto-Burman, Indo-European, Austro-Asiatic, and Dravidian families, plus two genetically unclassified languages. The data were collected through systematic fieldwork over 90 days across nine districts of Bangladesh, involving 16 data collectors, 77 speakers, and 43 validators, following a predefined elicitation template of 2224 unique items organized at three levels of linguistic granularity: isolated lexical items (475 words across 22 semantic domains), grammatical constructions (887 sentences across 21 categories including verbal conjugation paradigms), and directed speech (862 prompts across 46 conversational scenarios). Post-field processing included IPA transcription by 10 linguists with independent adjudication by 6 reviewers. The complete dataset is publicly accessible through the Multilingual Cloud platform (multiling.cloud), providing searchable access to annotated audio and textual data for all documented varieties. We describe the corpus design, fieldwork methodology, dataset structure, and per-language coverage, and discuss implications for endangered language documentation, low-resource NLP, and digital preservation in linguistically diverse developing countries.[68] Med-V1: Small Language Models for Zero-shot and Scalable Biomedical Evidence Attribution
Qiao Jin,Yin Fang,Lauren He,Yifan Yang,Guangzhi Xiong,Zhizheng Wang,Nicholas Wan,Joey Chan,Donald C. Comeau,Robert Leaman,Charalampos S. Floudas,Aidong Zhang,Michael F. Chiang,Yifan Peng,Zhiyong Lu
Main category: cs.CL
TL;DR: 本文提出了Med-V1,一个仅含30亿参数的小型语言模型,专为生物医学证据归因与断言验证任务设计;其在多个生物医学基准上显著超越基线模型,并可媲美GPT-5等前沿大模型,同时具备高可解释性;作者还利用Med-V1开展了两项首创性应用研究:量化LLM生成答案中的幻觉现象,以及自动识别临床指南中高风险的证据误引问题。
Details
Motivation: 现有前沿大语言模型(如GPT-5)虽可用于断言验证与幻觉检测,但部署成本过高;亟需一种轻量、高效且准确的替代方案用于生物医学领域的证据归因任务。 Method: 提出Med-V1系列小型语言模型(3B参数),基于本研究新构建的高质量合成数据进行训练,并将五个生物医学基准统一为验证格式以评估性能;同时开展两项真实场景用例研究:LLM幻觉量化分析与临床指南证据误引识别。 Result: Med-V1在五项生物医学基准上较基线模型提升27.0%–71.3%,性能媲美GPT-5,并能提供高质量预测解释;用例研究表明引用格式指令显著影响幻觉率,且Med-V1成功识别出临床指南中具潜在公共卫生风险的证据误引。 Conclusion: Med-V1是一种高效、准确、可解释的轻量级模型,为生物医学证据归因与验证提供了实用可行的前沿大模型替代方案。 Abstract: Assessing whether an article supports an assertion is essential for hallucination detection and claim verification. While large language models (LLMs) have the potential to automate this task, achieving strong performance requires frontier models such as GPT-5 that are prohibitively expensive to deploy at scale. To efficiently perform biomedical evidence attribution, we present Med-V1, a family of small language models with only three billion parameters. Trained on high-quality synthetic data newly developed in this study, Med-V1 substantially outperforms (+27.0% to +71.3%) its base models on five biomedical benchmarks unified into a verification format. Despite its smaller size, Med-V1 performs comparably to frontier LLMs such as GPT-5, along with high-quality explanations for its predictions. We use Med-V1 to conduct a first-of-its-kind use case study that quantifies hallucinations in LLM-generated answers under different citation instructions. Results show that the format instruction strongly affects citation validity and hallucination, with GPT-5 generating more claims but exhibiting hallucination rates similar to GPT-4o. Additionally, we present a second use case showing that Med-V1 can automatically identify high-stakes evidence misattributions in clinical practice guidelines, revealing potentially negative public health impacts that are otherwise challenging to identify at scale. Overall, Med-V1 provides an efficient and accurate lightweight alternative to frontier LLMs for practical and real-world applications in biomedical evidence attribution and verification tasks. Med-V1 is available at https://github.com/ncbi-nlp/Med-V1.[69] PersianPunc: A Large-Scale Dataset and BERT-Based Approach for Persian Punctuation Restoration
Mohammad Javad Ranjbar Kalahroodi,Heshaam Faili,Azadeh Shakery
Main category: cs.CL
TL;DR: 本文提出了PersianPunc数据集和基于ParsBERT的轻量级标点恢复方法,显著提升了波斯语ASR输出的可读性与实用性,同时避免了大语言模型存在的过修正和高计算开销问题。
Details
Motivation: 标点恢复对提升自动语音识别(ASR)输出的可读性和下游应用至关重要,但波斯语相关研究仍严重不足。 Method: 构建了包含1700万个样本的高质量波斯语标点恢复数据集PersianPunc,并将任务建模为词元级序列标注任务,通过微调ParsBERT实现高效恢复。 Result: 所提方法在测试集上达到91.33%的宏平均F1分数,兼具高性能与低延迟,优于大语言模型;同时开源数据集与模型。 Conclusion: 基于BERT的轻量方案在波斯语标点恢复任务中更实用、高效且鲁棒,为其他形态丰富、资源匮乏的语言提供了可扩展框架。 Abstract: Punctuation restoration is essential for improving the readability and downstream utility of automatic speech recognition (ASR) outputs, yet remains underexplored for Persian despite its importance. We introduce PersianPunc, a large-scale, high-quality dataset of 17 million samples for Persian punctuation restoration, constructed through systematic aggregation and filtering of existing textual resources. We formulate punctuation restoration as a token-level sequence labeling task and fine-tune ParsBERT to achieve strong performance. Through comparative evaluation, we demonstrate that while large language models can perform punctuation restoration, they suffer from critical limitations: over-correction tendencies that introduce undesired edits beyond punctuation insertion (particularly problematic for speech-to-text pipelines) and substantially higher computational requirements. Our lightweight BERT-based approach achieves a macro-averaged F1 score of 91.33% on our test set while maintaining efficiency suitable for real-time applications. We make our dataset (https://huggingface.co/datasets/MohammadJRanjbar/persian-punctuation-restoration) and model (https://huggingface.co/MohammadJRanjbar/parsbert-persian-punctuation) publicly available to facilitate future research in Persian NLP and provide a scalable framework applicable to other morphologically rich, low-resource languages.[70] A Multilingual Human Annotated Corpus of Original and Easy-to-Read Texts to Support Access to Democratic Participatory Processes
Stefan Bott,Verena Riegler,Horacio Saggion,Almudena Rascón Alcaina,Nouran Khallaf
Main category: cs.CL
TL;DR: 本文介绍了一个为西班牙语、加泰罗尼亚语和意大利语构建的高质量易读(E2R)文本简化语料库,旨在支持自动文本简化研究,尤其填补了低资源语言在该领域的数据空白。
Details
Motivation: 解决西班牙语、加泰罗尼亚语和意大利语等低资源语言中高质量文本简化训练与评估数据稀缺的问题,以支持民主参与背景下的易读语言研究。 Method: 在iDEM项目框架下,收集与民主参与相关的原创文本,涵盖多种文体,并由文本简化领域专家人工简化为易读(E2R)级别;确保文本符合相关版权与伦理标准。 Result: 构建了首个加泰罗尼亚语易读文本标注语料库,并为西班牙语和意大利语提供了稀缺的高质量人工简化资源;语料库将公开免费发布。 Conclusion: 该语料库显著提升了低资源语言在自动文本简化任务中的数据可用性,尤其对促进包容性民主参与具有重要价值。 Abstract: Being able to understand information is a key factor for a self-determined life and society. It is also very important for participating in democratic processes. The study of automatic text simplification is often limited by the availability of high quality material for the training and evaluation on automatic simplifiers. This is true for English, but more so for less resourced languages like Spanish, Catalan and Italian. In order to fill this gap, we present a corpus of original texts for these 3 languages, with high quality simplification produced by human experts in text simplification. It was developed within the iDEM project to assess the impact of Easy-to-Read (E2R) language for democratic participation. The original texts were compiled from domains related to this topic. The corpus includes different text types, selected based on relevance, copyright availability, and ethical standards. All texts were simplified to E2R level. The corpus is particularity valuable because it includes the first annotated corpus of its kind for the Catalan language. It also represents a noteworthy contribution for Spanish and Italian, offering high-quality, human-annotated language resources that are rarely available in these domains. The corpus will be made freely accessible to the public.[71] Exploring the potential and limitations of Model Merging for Multi-Domain Adaptation in ASR
Carlos Carvalho,Francisco Teixeira,Thomas Rolland,Alberto Abad
Main category: cs.CL
TL;DR: 本文研究了多领域自动语音识别(ASR)中的模型融合方法,评估了11种融合算法在10个欧洲葡萄牙语领域的性能,并提出了一种新算法BoostedTSV-M,在保持分布外泛化能力的同时,优于全量微调。
Details
Motivation: 大型语音基础模型通常通过领域特定的微调来适配,导致多个定制检查点;当新数据出现时重复全量微调计算成本过高,因此需要一种可扩展的替代方案——模型融合。 Method: 对11种模型融合算法在10个欧洲葡萄牙语领域进行基准测试,涵盖领域内准确率、分布偏移下的鲁棒性以及英语和多语言性能;并提出基于TSV-M的新算法BoostedTSV-M,通过奇异值增强缓解秩坍缩并提升数值稳定性。 Result: 所提BoostedTSV-M方法在欧洲葡萄牙语任务上优于全量微调,同时保持了分布外泛化能力。 Conclusion: 模型融合是大型语音模型多领域适配的有效且高效的替代方案,BoostedTSV-M进一步提升了融合效果与稳定性。 Abstract: Model merging is a scalable alternative to multi-task training that combines the capabilities of multiple specialised models into a single model. This is particularly attractive for large speech foundation models, which are typically adapted through domain-specific fine-tuning, resulting in multiple customised checkpoints, for which repeating full fine-tuning when new data becomes available is computationally prohibitive. In this work, we study model merging for multi-domain ASR and benchmark 11 merging algorithms for 10 European Portuguese domains, evaluating in-domain accuracy, robustness under distribution shift, as well as English and multilingual performance. We further propose BoostedTSV-M, a new merging algorithm based on TSV-M that mitigates rank collapse via singular-value boosting and improves numerical stability. Overall, our approach outperforms full fine-tuning on European Portuguese while preserving out-of-distribution generalisation in a single model.[72] DiSCTT: Consensus-Guided Self-Curriculum for Efficient Test-Time Adaptation in Reasoning
Mohammad Mahdi Moradi,Sudhir Mudur
Main category: cs.CL
TL;DR: 本文提出DiSCTT框架,通过基于实例级认知不确定性(由推理路径间一致性估计)的动态优化策略分配,在测试时自适应调整大语言模型的推理性能。高一致性输入采用监督微调(以多数一致解为伪标签),低一致性输入则采用共识正则化强化学习,兼顾多样性与相关性约束。该方法在数学与通用推理基准上显著优于现有测试时适配方法,精度更高、方差更低、计算与训练耗时更少。
Details
Motivation: 现有测试时适配方法对所有输入采用统一优化目标,在异构推理任务上效率低、不稳定;需根据实例难度和不确定性动态调整策略。 Method: 提出DiSCTT:基于采样推理路径间一致性估计实例级认知不确定性;高共识输入用多数一致解作伪标签进行监督微调;低共识输入采用共识正则化的强化学习,鼓励多样性并满足相关性约束。 Result: 在多个数学与通用推理基准上,DiSCTT持续超越强测试时适配基线,准确率更高、结果方差更低,且计算开销与实际训练时间大幅下降。 Conclusion: 显式建模实例难度与不确定性可提升测试时适配的稳定性、效率与有效性。 Abstract: Test-time adaptation offers a promising avenue for improving reasoning performance in large language models without additional supervision, but existing approaches often apply a uniform optimization objective across all inputs, leading to inefficient or unstable adaptation on heterogeneous reasoning problems. We propose DiSCTT, a difficulty-aware, consensus-guided self-curriculum framework that dynamically allocates test-time optimization strategies based on instance-level epistemic uncertainty estimated from agreement among sampled reasoning trajectories. Inputs with high consensus are consolidated via supervised fine-tuning using majority-agreed solutions as pseudo-labels, while low-consensus inputs are optimized via reinforcement learning with a consensus-regularized objective that encourages diversity under relevance constraints. Across a broad suite of mathematical and general reasoning benchmarks, DiSCTT consistently outperforms strong test-time adaptation baselines, achieving higher accuracy with reduced variance and substantially lower computation and wall-clock training times. These results demonstrate that explicitly accounting for instance difficulty and uncertainty enables more stable, efficient, and effective test-time adaptation for reasoning models.[73] Progressive Residual Warmup for Language Model Pretraining
Tianhao Chen,Xin Xu,Lu Yin,Hao Chen,Yang Wang,Shizhe Diao,Can Yang
Main category: cs.CL
TL;DR: 本文提出了一种名为Progressive Residual Warmup(ProRes)的新方法,通过逐步增加各层残差连接的权重(从0到1),使浅层先学习、深层后参与,从而提升Transformer预训练的稳定性与收敛速度。
Details
Motivation: Transformer模型预训练的稳定性和收敛速度至关重要;受逐层堆叠结构中逻辑依赖关系的启发,作者希望让浅层先稳定后再让深层参与学习。 Method: ProRes方法为每一层的残差连接引入一个随训练步数逐渐从0增长到1的标量系数,且深层的升温过程比浅层更慢,实现‘早层先学’的渐进式学习策略。 Result: 在不同模型规模、归一化和初始化方案下验证了ProRes的有效性:提升了预训练稳定性、加快了收敛速度、增强了泛化能力和下游任务性能。 Conclusion: ProRes是一种简单有效、即插即用的预训练优化策略,能显著改善Transformer语言模型的训练动态和最终性能。 Abstract: Transformer architectures serve as the backbone for most modern Large Language Models, therefore their pretraining stability and convergence speed are of central concern. Motivated by the logical dependency of sequentially stacked layers, we propose Progressive Residual Warmup (ProRes) for language model pretraining. ProRes implements an "early layer learns first" philosophy by multiplying each layer's residual with a scalar that gradually warms up from 0 to 1, with deeper layers taking longer warmup steps. In this way, deeper layers wait for early layers to settle into a more stable regime before contributing to learning. We demonstrate the effectiveness of ProRes through pretraining experiments across various model scales, as well as normalization and initialization schemes. Comprehensive analysis shows that ProRes not only stabilizes pretraining but also introduces a unique optimization trajectory, leading to faster convergence, stronger generalization and better downstream performance. Our code is available at https://github.com/dandingsky/ProRes.[74] An Exploration-Analysis-Disambiguation Reasoning Framework for Word Sense Disambiguation with Low-Parameter LLMs
Deshan Sumanathilaka,Nicholas Micallef,Julian Hough
Main category: cs.CL
TL;DR: 本文探讨了如何通过推理驱动的微调策略,使低参数大语言模型(<4B)在词义消歧(WSD)任务中达到媲美GPT-4-Turbo的性能,同时显著降低计算与能耗开销。
Details
Motivation: 高参数大模型虽在WSD上表现优异,但计算与能耗成本高、难以扩展;而罕见或领域特定词义仍难准确识别,亟需高效轻量方案。 Method: 在FEWS数据集基础上构建含丰富推理依据的半自动标注数据,对8个开源小模型(如Gemma、Qwen)进行微调,重点结合思维链(CoT)推理与邻词分析策略。 Result: Gemma-3-4B和Qwen-3-4B在FEWS上超越所有中等参数基线及SOTA模型,零样本下性能媲美GPT-4-Turbo,并在未见过的'Fool Me If You Can'跨域数据集上展现强泛化能力。 Conclusion: 精心设计的以推理为中心的微调方法,可使低参数LLM在WSD任务中兼顾准确性与效率,为绿色、可部署的语义理解提供新路径。 Abstract: Word Sense Disambiguation (WSD) remains a key challenge in Natural Language Processing (NLP), especially when dealing with rare or domain-specific senses that are often misinterpreted. While modern high-parameter Large Language Models (LLMs) such as GPT-4-Turbo have shown state-of-the-art WSD performance, their computational and energy demands limit scalability. This study investigates whether low-parameter LLMs (<4B parameters) can achieve comparable results through fine-tuning strategies that emphasize reasoning-driven sense identification. Using the FEWS dataset augmented with semi-automated, rationale-rich annotations, we fine-tune eight small-scale open-source LLMs (e.g. Gemma and Qwen). Our results reveal that Chain-of-Thought (CoT)-based reasoning combined with neighbour-word analysis achieves performance comparable to GPT-4-Turbo in zero-shot settings. Importantly, Gemma-3-4B and Qwen-3-4B models consistently outperform all medium-parameter baselines and state-of-the-art models on FEWS, with robust generalization to unseen senses. Furthermore, evaluation on the unseen "Fool Me If You Can'' dataset confirms strong cross-domain adaptability without task-specific fine-tuning. This work demonstrates that with carefully crafted reasoning-centric fine-tuning, low-parameter LLMs can deliver accurate WSD while substantially reducing computational and energy demands.[75] Ensembling Language Models with Sequential Monte Carlo
Robin Shing Moon Chan,Tianyu Liu,Samuel Kiegeland,Clemente Pasti,Jacob Hoover Vigly,Timothy J. O'Donnell,Ryan Cotterell,Tim Vieira
Main category: cs.CL
TL;DR: 本文提出了一种统一的f-ensemble框架,用于组合多个语言模型,并设计了字节级序贯蒙特卡洛(SMC)算法进行采样,以克服传统概率平均法在解码阶段的偏差问题,提升结构化文本生成任务的性能。
Details
Motivation: 现有语言模型和提示策略众多,但性能对二者选择高度敏感;经典集成方法虽有理论支持,但在语言模型解码阶段直接聚合词元概率会导致偏差,难以准确逼近字符串空间上的真实集成分布。 Method: 提出基于函数f的f-ensemble分布框架,涵盖多种聚合方式;设计字节级序贯蒙特卡洛(SMC)算法,在共享字符空间中实现跨不同词表模型的协同采样。 Result: 在多种结构化文本生成任务上验证了f-ensemble的有效性,表明非平均类聚合策略及更优后验近似能显著提升集成性能。 Conclusion: f-ensemble框架及其字节级SMC采样方法为语言模型集成提供了更灵活、一致且高性能的解码方案,尤其适用于异构模型与复杂生成任务。 Abstract: Practitioners have access to an abundance of language models and prompting strategies for solving many language modeling tasks; yet prior work shows that modeling performance is highly sensitive to both choices. Classical machine learning ensembling techniques offer a principled approach: aggregate predictions from multiple sources to achieve better performance than any single one. However, applying ensembling to language models during decoding is challenging: naively aggregating next-token probabilities yields samples from a locally normalized, biased approximation of the generally intractable ensemble distribution over strings. In this work, we introduce a unified framework for composing $K$ language models into $f$-ensemble distributions for a wide range of functions $f\colon\mathbb{R}_{\geq 0}^{K}\to\mathbb{R}_{\geq 0}$. To sample from these distributions, we propose a byte-level sequential Monte Carlo (SMC) algorithm that operates in a shared character space, enabling ensembles of models with mismatching vocabularies and consistent sampling in the limit. We evaluate a family of $f$-ensembles across prompt and model combinations for various structured text generation tasks, highlighting the benefits of alternative aggregation strategies over traditional probability averaging, and showing that better posterior approximations can yield better ensemble performance.[76] FlashAttention-4: Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware Scaling
Ted Zadouri,Markus Hoehnerbach,Jay Shah,Timmy Liu,Vijay Thakkar,Tri Dao
Main category: cs.CL
TL;DR: 本文提出了FlashAttention-4,针对Blackwell架构GPU(如B200/GB200)的异构硬件特性优化注意力计算,通过异步MMA流水线、软件模拟指数与softmax缩放、利用张量内存和2-CTA模式等技术,在B200上相较cuDNN和Triton分别提速1.3×和2.7×,并采用CuTe-DSL实现,编译速度提升20–30×。
Details
Motivation: FlashAttention-3主要面向H100,而AI行业已快速转向Blackwell架构(如B200/GB200),其硬件扩展不对称(张量核吞吐翻倍,但共享内存带宽、指数单元等增长缓慢或不变),导致原有注意力优化方法不再高效,亟需适配新架构的瓶颈重构。 Method: 提出三项关键技术:(1) 重设计异步MMA流水线并增大tile尺寸;(2) 软件模拟指数函数与条件softmax重缩放,减少非矩阵乘运算;(3) 利用张量内存和2-CTA MMA模式降低反向传播中的共享内存通信与原子加法。全部实现基于嵌入Python的CuTe-DSL。 Result: FlashAttention-4在B200 GPU(BF16)上相较cuDNN 9.13提速最高1.3×,相较Triton提速最高2.7×,达1613 TFLOPs/s(71%利用率);编译速度比传统C++模板快20–30×,同时保持完整表达能力。 Conclusion: FlashAttention-4成功适配Blackwell架构的异构瓶颈,验证了软硬件协同设计与领域专用语言(CuTe-DSL)在高性能AI系统开发中的关键价值,为下一代GPU上的高效注意力计算提供了新范式。 Abstract: Attention, as a core layer of the ubiquitous Transformer architecture, is the bottleneck for large language models and long-context applications. While FlashAttention-3 optimized attention for Hopper GPUs through asynchronous execution and warp specialization, it primarily targets the H100 architecture. The AI industry has rapidly transitioned to deploying Blackwell-based systems such as the B200 and GB200, which exhibit fundamentally different performance characteristics due to asymmetric hardware scaling: tensor core throughput doubles while other functional units (shared memory bandwidth, exponential units) scale more slowly or remain unchanged. We develop several techniques to address these shifting bottlenecks on Blackwell GPUs: (1) redesigned pipelines that exploit fully asynchronous MMA operations and larger tile sizes, (2) software-emulated exponential and conditional softmax rescaling that reduces non-matmul operations, and (3) leveraging tensor memory and the 2-CTA MMA mode to reduce shared memory traffic and atomic adds in the backward pass. We demonstrate that our method, FlashAttention-4, achieves up to 1.3$\times$ speedup over cuDNN 9.13 and 2.7$\times$ over Triton on B200 GPUs with BF16, reaching up to 1613 TFLOPs/s (71% utilization). Beyond algorithmic innovations, we implement FlashAttention-4 entirely in CuTe-DSL embedded in Python, achieving 20-30$\times$ faster compile times compared to traditional C++ template-based approaches while maintaining full expressivity.[77] DEBISS: a Corpus of Individual, Semi-structured and Spoken Debates
Klaywert Danillo Ferreira de Souza,David Eduardo Pereira,Cláudio E. C. Campelo,Larissa Lucena Vasconcelos
Main category: cs.CL
TL;DR: 本文提出DEBISS语料库,旨在解决当前辩论语料稀缺问题,涵盖口语化与个体化辩论,并支持多种NLP任务标注。
Details
Motivation: 现有辩论语料库稀缺,且难以覆盖辩论在形式、结构和应用场景上的多样性。 Method: 构建DEBISS语料库,包含口语化与个体化辩论,具备半结构化特征,并提供语音转文本、说话人日志、论点挖掘及辩手质量评估等多类NLP标注。 Result: 成功构建了一个面向多样化辩论场景、支持多项NLP任务的新型辩论语料库DEBISS。 Conclusion: DEBISS语料库填补了辩论领域语料资源的空白,为辩论相关NLP研究提供了重要基础支撑。 Abstract: The process of debating is essential in our daily lives, whether in studying, work activities, simple everyday discussions, political debates on TV, or online discussions on social networks. The range of uses for debates is broad. Due to the diverse applications, structures, and formats of debates, developing corpora that account for these variations can be challenging, and the scarcity of debate corpora in the state of the art is notable. For this reason, the current research proposes the DEBISS corpus: a collection of spoken and individual debates with semi-structured features. With a broad range of NLP task annotations, such as speech-to-text, speaker diarization, argument mining, and debater quality assessment.[78] NCTB-QA: A Large-Scale Bangla Educational Question Answering Dataset and Benchmarking Performance
Abrar Eyasir,Tahsin Ahmed,Muhammad Ibrahim
Main category: cs.CL
TL;DR: 本文提出NCTB-QA,一个大规模、平衡可答/不可答问题的孟加拉语教育问答数据集,并验证了领域微调对低资源语言问答系统鲁棒性提升的关键作用。
Details
Motivation: 低资源语言阅读理解系统在处理不可回答问题时表现不可靠,亟需高质量、平衡且具挑战性的基准数据集。 Method: 构建NCTB-QA数据集(87,805问答对,含42.75%不可答问题及对抗性干扰项),并在BERT、RoBERTa、ELECTRA上进行微调与评测。 Result: BERT在F1分数上取得313%相对提升(0.150→0.620),BERTScore也显著提高;证实领域微调对低资源场景至关重要。 Conclusion: NCTB-QA是一个具有挑战性的新基准,强调了面向教育领域的、针对低资源语言的问答系统需兼顾可答性判断与领域适配。 Abstract: Reading comprehension systems for low-resource languages face significant challenges in handling unanswerable questions. These systems tend to produce unreliable responses when correct answers are absent from context. To solve this problem, we introduce NCTB-QA, a large-scale Bangla question answering dataset comprising 87,805 question-answer pairs extracted from 50 textbooks published by Bangladesh's National Curriculum and Textbook Board. Unlike existing Bangla datasets, NCTB-QA maintains a balanced distribution of answerable (57.25%) and unanswerable (42.75%) questions. NCTB-QA also includes adversarially designed instances containing plausible distractors. We benchmark three transformer-based models (BERT, RoBERTa, ELECTRA) and demonstrate substantial improvements through fine-tuning. BERT achieves 313% relative improvement in F1 score (0.150 to 0.620). Semantic answer quality measured by BERTScore also increases significantly across all models. Our results establish NCTB-QA as a challenging benchmark for Bangla educational question answering. This study demonstrates that domain-specific fine-tuning is critical for robust performance in low-resource settings.[79] Leveraging LLM Parametric Knowledge for Fact Checking without Retrieval
Artem Vazhentsev,Maria Marina,Daniil Moskovskiy,Sergey Pletenev,Mikhail Seleznyov,Mikhail Salnikov,Elena Tutubalina,Vasily Konovalov,Irina Nikishina,Alexander Panchenko,Viktor Moskvoretskii
Main category: cs.CL
TL;DR: 本文提出无需检索的事实核查新任务,通过内部模型表征进行自然语言声明的验证,并引入INTRA方法,在多个维度上实现强泛化与SOTA性能。
Details
Motivation: 现有基于LLM的事实核查方法依赖外部检索,受限于检索错误和数据可用性,且未充分利用模型内在的事实验证能力。 Method: 提出无需检索的事实核查任务;构建涵盖长尾知识、多源声明、多语言和长文本生成的综合评估框架;提出INTRA方法,利用模型内部表征间的交互进行事实验证。 Result: 在9个数据集、18种方法和3个模型上的实验表明,基于logit的方法常逊于利用内部表征的方法;INTRA方法达到SOTA性能并展现强泛化能力。 Conclusion: 无需检索的事实核查是一个有前景的研究方向,可补充检索式框架、提升可扩展性,并支持作为训练奖励信号或生成过程中的集成组件。 Abstract: Trustworthiness is a core research challenge for agentic AI systems built on Large Language Models (LLMs). To enhance trust, natural language claims from diverse sources, including human-written text, web content, and model outputs, are commonly checked for factuality by retrieving external knowledge and using an LLM to verify the faithfulness of claims to the retrieved evidence. As a result, such methods are constrained by retrieval errors and external data availability, while leaving the models intrinsic fact-verification capabilities largely unused. We propose the task of fact-checking without retrieval, focusing on the verification of arbitrary natural language claims, independent of their source. To study this setting, we introduce a comprehensive evaluation framework focused on generalization, testing robustness to (i) long-tail knowledge, (ii) variation in claim sources, (iii) multilinguality, and (iv) long-form generation. Across 9 datasets, 18 methods and 3 models, our experiments indicate that logit-based approaches often underperform compared to those that leverage internal model representations. Building on this finding, we introduce INTRA, a method that exploits interactions between internal representations and achieves state-of-the-art performance with strong generalization. More broadly, our work establishes fact-checking without retrieval as a promising research direction that can complement retrieval-based frameworks, improve scalability, and enable the use of such systems as reward signals during training or as components integrated into the generation process.[80] Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought
Siddharth Boppana,Annabel Ma,Max Loeffler,Raphael Sarfati,Eric Bigelow,Atticus Geiger,Owen Lewis,Jack Merullo
Main category: cs.CL
TL;DR: 本文揭示了推理模型中存在'表演性思维链(performative chain-of-thought)'现象:模型虽早已形成确定答案,却仍继续生成冗余推理步骤;通过激活探测等方法可提前解码答案,尤其在简单任务中显著节省计算量。
Details
Motivation: 探究大语言模型在思维链(CoT)生成过程中,其内部信念形成与外部输出行为之间是否存在脱节,即是否存在‘表演性’而非真实推理的过程。 Method: 结合激活探针(activation probing)、强制早答(early forced answering)和CoT监控器,在DeepSeek-R1 671B与GPT-OSS 120B两个大模型上,对比分析MMLU(易)与GPQA-Diamond(难)两类任务中的信念演化与生成行为。 Result: 发现:1)对简单MMLU问题,答案可从早期隐藏层激活中解码,远早于模型输出最终答案;2)对复杂GPQA问题,推理解析更符合真实多步推理;3)回溯、‘顿悟’等关键转折点与探测到的信念大幅变化高度相关;4)基于探针的早停策略可在MMLU上节省80% token、GPQA上节省30% token,且准确率基本不变。 Conclusion: 思维链中存在大量非必要生成行为,尤其在简单任务中表现为‘推理剧场’;激活探针能有效识别真实信念状态,为实现自适应、高效推理提供了新路径。 Abstract: We provide evidence of performative chain-of-thought (CoT) in reasoning models, where a model becomes strongly confident in its final answer, but continues generating tokens without revealing its internal belief. Our analysis compares activation probing, early forced answering, and a CoT monitor across two large models (DeepSeek-R1 671B & GPT-OSS 120B) and find task difficulty-specific differences: The model's final answer is decodable from activations far earlier in CoT than a monitor is able to say, especially for easy recall-based MMLU questions. We contrast this with genuine reasoning in difficult multihop GPQA-Diamond questions. Despite this, inflection points (e.g., backtracking, 'aha' moments) occur almost exclusively in responses where probes show large belief shifts, suggesting these behaviors track genuine uncertainty rather than learned "reasoning theater." Finally, probe-guided early exit reduces tokens by up to 80% on MMLU and 30% on GPQA-Diamond with similar accuracy, positioning attention probing as an efficient tool for detecting performative reasoning and enabling adaptive computation.cs.CV [Back]
[81] Lost in Translation: How Language Re-Aligns Vision for Cross-Species Pathology
Ekansh Arora
Main category: cs.CV
TL;DR: 本研究探讨了CPath-CLIP在跨癌种和跨物种病理图像识别中的迁移能力,发现标准视觉-语言对齐在跨物种场景下效果不佳,提出“语义锚定”方法利用文本引导稳定视觉特征表征,缓解嵌入坍缩问题,并揭示了一种由物种主导对齐引发的新失败模式——语义坍缩。
Details
Motivation: 基础模型在计算病理学中应用日益广泛,但其在跨癌种和跨物种迁移下的行为尚不明确,亟需系统评估与改进。 Method: 基于犬类与人类组织全切片图像块,采用少样本微调策略评估CPath-CLIP在同癌种、跨癌种及跨物种任务上的性能;结合嵌入空间分析(余弦相似度)、Grad-CAM可视化及消融实验;提出语义锚定(Semantic Anchoring)方法,利用语言提供视觉特征的稳定坐标系。 Result: 少样本微调提升同癌种AUC 7.7个百分点、跨癌种AUC 9.47个百分点;跨物种性能仍低于H-optimus-0(84.97%);发现肿瘤/正常原型嵌入高度相似(>0.99),模型存在域锁定现象;语义锚定显著提升同癌种(+8.52%)和跨癌种(+5.67%)性能,并证实文本对齐机制本身即具增益,与文本编码器复杂度无关;识别出‘物种主导对齐导致语义坍缩’这一新失败模式。 Conclusion: 语言可作为控制机制,实现无需重训练的语义重解释;语义锚定有效缓解嵌入坍缩,提升跨域泛化能力;该工作揭示了跨物种病理AI中对齐偏差引发的根本性语义失败模式。 Abstract: Foundation models are increasingly applied to computational pathology, yet their behavior under cross-cancer and cross-species transfer remains unspecified. This study investigated how fine-tuning CPath-CLIP affects cancer detection under same-cancer, cross-cancer, and cross-species conditions using whole-slide image patches from canine and human histopathology. Performance was measured using area under the receiver operating characteristic curve (AUC). Few-shot fine-tuning improved same-cancer (64.9% to 72.6% AUC) and cross-cancer performance (56.84% to 66.31% AUC). Cross-species evaluation revealed that while tissue matching enables meaningful transfer, performance remains below state-of-the-art benchmarks (H-optimus-0: 84.97% AUC), indicating that standard vision-language alignment is suboptimal for cross-species generalization. Embedding space analysis revealed extremely high cosine similarity (greater than 0.99) between tumor and normal prototypes. Grad-CAM shows prototype-based models remain domain-locked, while language-guided models attend to conserved tumor morphology. To address this, we introduce Semantic Anchoring, which uses language to provide a stable coordinate system for visual features. Ablation studies reveal that benefits stem from the text-alignment mechanism itself, regardless of text encoder complexity. Benchmarking against H-optimus-0 shows that CPath-CLIP's failure stems from intrinsic embedding collapse, which text alignment effectively circumvents. Additional gains were observed in same-cancer (8.52%) and cross-cancer classification (5.67%). We identified a previously uncharacterized failure mode: semantic collapse driven by species-dominated alignment rather than missing visual information. These results demonstrate that language acts as a control mechanism, enabling semantic re-interpretation without retraining.[82] Recognition of Daily Activities through Multi-Modal Deep Learning: A Video, Pose, and Object-Aware Approach for Ambient Assisted Living
Kooshan Hashemifard,Pau Climent-Pérez,Francisco Florez-Revuelta
Main category: cs.CV
TL;DR: 本文提出了一种面向老年用户的多模态日常活动识别方法,融合3D CNN视觉特征、图卷积网络处理的3D人体姿态以及基于交叉注意力机制融合的物体上下文信息,在Toyota SmartHome数据集上取得良好分类精度,适用于智能养老环境。
Details
Motivation: 解决室内环境下老年人日常活动识别面临的类内差异大、类间相似性高、环境与视角变化及场景复杂等挑战,以支撑环境辅助生活(AAL)系统对老年人健康监测与独立性支持。 Method: 结合3D卷积神经网络(处理视觉信息)、图卷积网络(处理3D人体姿态)和物体检测模块提取的上下文信息,并通过交叉注意力机制融合上下文与3D CNN特征。 Result: 在真实世界室内活动数据集Toyota SmartHome上验证,取得了具有竞争力的日常活动分类精度。 Conclusion: 该多模态方法可作为先进AAL监控解决方案的关键组件,有助于提升老年人的安全性与自主性。 Abstract: Recognition of daily activities is a critical element for effective Ambient Assisted Living (AAL) systems, particularly to monitor the well-being and support the independence of older adults in indoor environments. However, developing robust activity recognition systems faces significant challenges, including intra-class variability, inter-class similarity, environmental variability, camera perspectives, and scene complexity. This paper presents a multi-modal approach for the recognition of activities of daily living tailored for older adults within AAL settings. The proposed system integrates visual information processed by a 3D Convolutional Neural Network (CNN) with 3D human pose data analyzed by a Graph Convolutional Network. Contextual information, derived from an object detection module, is fused with the 3D CNN features using a cross-attention mechanism to enhance recognition accuracy. This method is evaluated using the Toyota SmartHome dataset, which consists of real-world indoor activities. The results indicate that the proposed system achieves competitive classification accuracy for a range of daily activities, highlighting its potential as an essential component for advanced AAL monitoring solutions. This advancement supports the broader goal of developing intelligent systems that promote safety and autonomy among older adults.[83] InverseNet: Benchmarking Operator Mismatch and Calibration Across Compressive Imaging Modalities
Chengshuai Yang,Xin Yuan
Main category: cs.CV
TL;DR: 本文提出InverseNet,首个跨模态算子失配基准,揭示深度学习方法在算子失配下性能大幅下降,并发现性能与鲁棒性呈负相关,同时验证了盲校准的有效性。
Details
Motivation: 现有高效压缩感知成像方法(如EfficientSCI)在前向算子偏离物理现实时性能急剧下降,但尚无基准量化这种普遍存在的算子失配问题。 Method: 构建InverseNet基准,覆盖CASSI、CACTI和单像素相机三种模态;设计四场景协议(理想、失配、oracle校正、盲校准);在27个仿真场景和9组真实硬件数据上评估12种方法;进行相关性分析与消融实验。 Result: (1)深度学习方法在失配下损失10–21 dB,失去对经典方法的优势;(2)性能与鲁棒性显著负相关(r_s = −0.71, p < 0.01);(3)掩码无关架构无法恢复失配损失,而算子条件化方法可恢复41–90%;(4)盲网格搜索校准可达oracle校准85–100%效果。 Conclusion: 算子失配是压缩感知成像落地的关键瓶颈;算子显式建模与盲校准策略对提升实际鲁棒性至关重要;InverseNet为后续研究提供了标准化评估平台。 Abstract: State-of-the-art EfficientSCI loses 20.58 dB when its assumed forward operator deviates from physical reality in just eight parameters, yet no existing benchmark quantifies operator mismatch, the default condition in deployed compressive imaging systems. We introduce InverseNet, the first cross-modality benchmark for operator mismatch, spanning CASSI, CACTI, and single-pixel cameras. Evaluating 12 methods under a four-scenario protocol (ideal, mismatched, oracle-corrected, blind calibration) across 27 simulated scenes and 9 real hardware captures, we find: (1) deep learning methods lose 10-21 dB under mismatch, eliminating their advantage over classical baselines; (2) performance and robustness are inversely correlated across modalities (Spearman r_s = -0.71, p < 0.01); (3) mask-oblivious architectures recover 0% of mismatch losses regardless of calibration quality, while operator-conditioned methods recover 41-90%; (4) blind grid-search calibration recovers 85-100% of the oracle bound without ground truth. Real hardware experiments confirm that simulation trends transfer to physical data. Code will be released upon acceptance.[84] Fusion and Grouping Strategies in Deep Learning for Local Climate Zone Classification of Multimodal Remote Sensing Data
Ancymol Thomas,Jaya Sreevalsan-Nair
Main category: cs.CV
TL;DR: 本文系统分析了多模态遥感数据(SAR与MSI)用于局部气候区(LCZ)分类中的多种深度学习融合策略,发现基线混合融合方法(FM1)结合波段分组(BG)和标签合并(LM)效果最优,整体精度达76.6%,尤其提升了少数类预测性能。
Details
Motivation: 现有研究缺乏对多模态LCZ分类中深度学习模型融合机制的系统性分析,尤其是不同融合层级(像素/特征/决策级)与数据分组策略的影响尚不明确。 Method: 在So2Sat LCZ42数据集上,对比四种融合模型:FM1(基线混合融合)、FM2(引入自注意与交叉注意)、FM3(多尺度高斯滤波图像)、FM4(加权决策级融合);并开展消融实验分析融合层级影响;同时评估波段分组(BG)与标签合并(LM)两种数据分组策略。 Result: FM1+BG+LM组合取得最高整体精度76.6%;FM1显著优于简单融合方法;各融合策略均有效提升少数类预测准确率。 Conclusion: 融合策略的选择与数据内在特性(如波段相关性、类别不平衡)密切相关;基线混合融合配合合理分组策略即可达到优异性能,无需复杂注意力或滤波模块;该结论为多模态遥感分类模型设计提供了实用指导。 Abstract: Local Climate Zones (LCZs) give a zoning map to study urban structures and land use and analyze the impact of urbanization on local climate. Multimodal remote sensing enables LCZ classification, for which data fusion is significant for improving accuracy owing to the data complexity. However, there is a gap in a comprehensive analysis of the fusion mechanisms used in their deep learning (DL) classifier architectures. This study analyzes different fusion strategies in the multi-class LCZ classification models for multimodal data and grouping strategies based on inherent data characteristics. The different models involving Convolutional Neural Networks (CNNs) include: (i) baseline hybrid fusion (FM1), (ii) with self- and cross-attention mechanisms (FM2), (iii) with the multi-scale Gaussian filtered images (FM3), and (iv) weighted decision-level fusion (FM4). Ablation experiments are conducted to study the pixel-, feature-, and decision-level fusion effects in the model performance. Grouping strategies include band grouping (BG) within the data modalities and label merging (LM) in the ground truth. Our analysis is exclusively done on the So2Sat LCZ42 dataset, which consists of Synthetic Aperture Radar (SAR) and Multispectral Imaging (MSI) image pairs. Our results show that FM1 consistently outperforms simple fusion methods. FM1 with BG and LM is found to be the most effective approach among all fusion strategies, giving an overall accuracy of 76.6\%. Importantly, our study highlights the effect of these strategies in improving prediction accuracy for the underrepresented classes. Our code and processed datasets are available at https://github.com/GVCL/LCZC-MultiModalHybridFusion[85] Structure-Guided Histopathology Synthesis via Dual-LoRA Diffusion
Xuan Xu,Prateek Prasanna
Main category: cs.CV
TL;DR: 本文提出了一种名为Dual-LoRA Controllable Diffusion的统一扩散模型框架,利用多类细胞核中心点作为轻量、标注高效的空间先验,通过两个任务专用的LoRA适配器,在单一模型中联合实现局部结构补全与全局结构合成,显著提升组织图像修复与生成的结构保真度和真实性。
Details
Motivation: 现有方法将组织图像修复与生成视为独立任务,且依赖弱或不一致的结构先验,限制了细胞组织真实性的建模。 Method: 提出Dual-LoRA Controllable Diffusion框架:以多类细胞核中心点为结构引导先验;采用共享扩散主干网络 + 两个任务专属LoRA适配器,分别处理局部补全与全局合成;无需训练多个独立模型。 Result: 局部补全任务中LPIPS从0.1797降至0.1524;全局合成任务中FID从225.15大幅降至76.04;在结构恢复保真度与形态一致性上均优于GAN及扩散基线方法。 Conclusion: 该方法实现了修复与生成任务的统一建模,提升了结构引导能力与泛化性,支持可扩展的泛癌种组织病理建模。 Abstract: Histopathology image synthesis plays an important role in tissue restoration, data augmentation, and modeling of tumor microenvironments. However, existing generative methods typically address restoration and generation as separate tasks, although both share the same objective of structure-consistent tissue synthesis under varying degrees of missingness, and often rely on weak or inconsistent structural priors that limit realistic cellular organization. We propose Dual-LoRA Controllable Diffusion, a unified centroid-guided diffusion framework that jointly supports Local Structure Completion and Global Structure Synthesis within a single model. Multi-class nuclei centroids serve as lightweight and annotation-efficient spatial priors, providing biologically meaningful guidance under both partial and complete image absence. Two task-specific LoRA adapters specialize the shared backbone for local and global objectives without retraining separate diffusion models. Extensive experiments demonstrate consistent improvements over state-of-the-art GAN and diffusion baselines across restoration and synthesis tasks. For local completion, LPIPS computed within the masked region improves from 0.1797 (HARP) to 0.1524, and for global synthesis, FID improves from 225.15 (CoSys) to 76.04, indicating improved structural fidelity and realism. Our approach achieves more faithful structural recovery in masked regions and substantially improved realism and morphology consistency in full synthesis, supporting scalable pan-cancer histopathology modeling.[86] Mask-aware inference with State-Space Models
Ignasi Mas,Ramon Morros,Javier-Ruiz Hidalgo,Ivan Huerta
Main category: cs.CV
TL;DR: 本文提出Partial Vision Mamba(PVM),将Partial Convolution的思想引入State Space Model(如Mamba),使其能有效处理任意形状的缺失/无效数据,在深度补全、图像修复和含无效数据分类任务中验证了其有效性与泛化性。
Details
Motivation: 现有State Space Models(如Mamba)缺乏处理任意形状缺失或无效数据的内在机制,而现实视觉任务(如深度补全)常面临此类问题;Partial Convolutions虽已解决CNN中的该问题,但尚未扩展到SSM架构。 Method: 提出Partial Vision Mamba(PVM)组件,将mask-aware重归一化思想适配至Mamba结构,并定义一套基于PVM构建模型的架构设计规则。 Result: 在深度完成、图像修复和含无效数据的图像分类三个任务上验证了PVM的有效性和泛化能力,显著提升了SSM对不规则缺失数据的建模能力。 Conclusion: PVM成功将partial操作范式迁移到State Space Model,为SSM类模型处理不规则缺失数据提供了通用、高效的新方案。 Abstract: Many real-world computer vision tasks, such as depth completion, must handle inputs with arbitrarily shaped regions of missing or invalid data. For Convolutional Neural Networks (CNNs), Partial Convolutions solved this by a mask-aware re-normalization conditioned only on valid pixels. Recently, State Space Models (SSMs) like Mamba have emerged, offering high performance with linear complexity. However, these architectures lack an inherent mechanism for handling such arbitrarily shaped invalid data at inference time. To bridge this gap, we introduce Partial Vision Mamba (PVM), a novel architectural component that ports the principles of partial operations to the Mamba backbone. We also define a series of rules to design architectures using PVM. We show the efficacy and generalizability of our approach in the tasks of depth completion, image inpainting, and classification with invalid data.[87] PinPoint: Evaluation of Composed Image Retrieval with Explicit Negatives, Multi-Image Queries, and Paraphrase Testing
Rohan Mahadev,Joyce Yuan,Patrick Poirson,David Xue,Hao-Yu Wu,Dmitry Kislyuk
Main category: cs.CV
TL;DR: 本文提出了PinPoint,一个全面的现实世界合成图像检索(CIR)基准,包含多个正确答案、硬负样本、查询改写、多图像组合及人口统计元数据,并基于该基准揭示了当前CIR方法在误检、鲁棒性和多图像推理方面的三大缺陷,进而提出一种无需训练的MLLM重排序方法以提升性能。
Details
Motivation: 现有CIR基准仅支持单一真值答案,缺乏评估误报规避、鲁棒性和多图像推理所需标注,限制了模型能力的全面评估。 Method: 构建PinPoint基准(含7635个查询、32.9万相关性标注、23类查询),并基于其对20+方法进行系统评测;提出一种基于现成MLLM的训练-free重排序方法。 Result: 发现当前最优方法仍存在9%误检率、25.1%改写性能波动、多图像查询性能下降40–70%;所提重排序方法可提升任意现有系统性能。 Conclusion: PinPoint显著拓展了CIR评估维度,揭示了关键短板,并提供即插即用的改进方案,推动CIR向更实用、鲁棒和公平方向发展。 Abstract: Composed Image Retrieval (CIR) has made significant progress, yet current benchmarks are limited to single ground-truth answers and lack the annotations needed to evaluate false positive avoidance, robustness and multi-image reasoning. We present PinPoint, a comprehensive real world benchmark with 7,635 queries and 329K relevance judgments across 23 query categories. PinPoint advances the field by providing: (1) multiple correct answers (averaging 9.1 per query) (2) explicit hard negatives, (3) six instruction paraphrases per query for robustness testing, (4) multi-image composition support (13.4% of queries), and (5) demographic metadata for fairness evaluation. Based on our analysis of 20+ methods across 4 different major paradigms, we uncover three significant drawbacks: The best methods while achieving mAP@10 of 28.5%, still retrieves irrelevant results (hard negatives) 9% of the time. The best models also exhibit 25.1% performance variation across paraphrases, indicating significant potential for enhancing current CIR techniques. Multi-image queries performs 40 to 70% worse across different methods. To overcome these new issues uncovered by our evaluation framework, we propose a training-free reranking method based on an off-the-shelf MLLM that can be applied to any existing system to bridge the gap. We release the complete dataset, including all images, queries, annotations, retrieval index, and benchmarking code.[88] SGR3 Model: Scene Graph Retrieval-Reasoning Model in 3D
Zirui Wang,Ruiping Liu,Yufan Chen,Junwei Zheng,Weijia Fan,Kunyu Peng,Di Wen,Jiale Wei,Jiaming Zhang,Rainer Stiefelhagen
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的3D场景图生成框架SGR3,利用多模态大语言模型(MLLM)与检索增强生成(RAG)技术,绕过传统3D重建步骤,通过语义对齐的检索增强提升关系推理能力,并引入加权块级相似性选择机制以提高鲁棒性。
Details
Motivation: 现有3D场景图生成方法依赖多模态数据和启发式图构建,限制了关系三元组预测的灵活性与泛化性;且需显式3D重建,实用性受限。 Method: 提出SGR3模型:基于MLLM与RAG的训练免费框架;采用ColPali风格跨模态检索获取语义对齐的场景图;引入加权patch-level相似性选择机制,抑制模糊或语义贫乏区域的影响。 Result: SGR3在无训练基线上表现具竞争力,性能媲美基于GNN的专家模型;消融实验表明检索到的外部知识被显式融入token生成过程,而非隐式抽象内化。 Conclusion: SGR3验证了无需训练、不依赖显式重建的语义场景图生成可行性,凸显检索增强与多模态大模型在3D理解中的潜力。 Abstract: 3D scene graphs provide a structured representation of object entities and their relationships, enabling high-level interpretation and reasoning for robots while remaining intuitively understandable to humans. Existing approaches for 3D scene graph generation typically combine scene reconstruction with graph neural networks (GNNs). However, such pipelines require multi-modal data that may not always be available, and their reliance on heuristic graph construction can constrain the prediction of relationship triplets. In this work, we introduce a Scene Graph Retrieval-Reasoning Model in 3D (SGR3 Model), a training-free framework that leverages multi-modal large language models (MLLMs) with retrieval-augmented generation (RAG) for semantic scene graph generation. SGR3 Model bypasses the need for explicit 3D reconstruction. Instead, it enhances relational reasoning by incorporating semantically aligned scene graphs retrieved via a ColPali-style cross-modal framework. To improve retrieval robustness, we further introduce a weighted patch-level similarity selection mechanism that mitigates the negative impact of blurry or semantically uninformative regions. Experiments demonstrate that SGR3 Model achieves competitive performance compared to training-free baselines and on par with GNN-based expert models. Moreover, an ablation study on the retrieval module and knowledge base scale reveals that retrieved external information is explicitly integrated into the token generation process, rather than being implicitly internalized through abstraction.[89] Spinverse: Differentiable Physics for Permeability-Aware Microstructure Reconstruction from Diffusion MRI
Prathamesh Pradeep Khole,Mario M. Brenes,Zahra Kais Petiwala,Ehsan Mirafzali,Utkarsh Gupta,Jing-Rebecca Li,Andrada Ianus,Razvan Marinescu
Main category: cs.CV
TL;DR: 本文提出Spinverse方法,通过可微分Bloch-Torrey模拟器,从dMRI信号中反演组织微结构界面,将面渗透率作为可学习参数,自动浮现屏障边界,并引入几何先验与多序列优化策略提升重建鲁棒性与准确性。
Details
Motivation: 现有dMRI重建方法多假设边界不可渗透或仅估计体素级参数,难以显式恢复未知拓扑的微结构界面。 Method: Spinverse基于固定四面体网格建模组织,将每个内表面的渗透率设为可学习参数;通过可微分Bloch-Torrey偏微分方程前向模拟器,反向传播信号匹配损失优化渗透率;结合网格几何先验正则化与分阶段多序列优化策略缓解病态性和局部极小问题。 Result: 在合成体素网格数据集上,Spinverse成功重建多种几何结构;验证了序列调度与正则化对避免轮廓伪影、提升边界精度和结构合理性至关重要。 Conclusion: Spinverse实现了无需预设界面拓扑的渗透率感知dMRI重建,为高保真微结构成像提供了新范式。 Abstract: Diffusion MRI (dMRI) is sensitive to microstructural barriers, yet most existing methods either assume impermeable boundaries or estimate voxel-level parameters without recovering explicit interfaces. We present Spinverse, a permeability-aware reconstruction method that inverts dMRI measurements through a fully differentiable Bloch-Torrey simulator. Spinverse represents tissue on a fixed tetrahedral grid and treats each interior face permeability as a learnable parameter; low-permeability faces act as diffusion barriers, so microstructural boundaries whose topology is not fixed a priori (up to the resolution of the ambient mesh) emerge without changing mesh connectivity or vertex positions. Given a target signal, we optimize face permeabilities by backpropagating a signal-matching loss through the PDE forward model, and recover an interface by thresholding the learned permeability field. To mitigate the ill-posedness of permeability inversion, we use mesh-based geometric priors; to avoid local minima, we use a staged multi-sequence optimization curriculum. Across a collection of synthetic voxel meshes, Spinverse reconstructs diverse geometries and demonstrates that sequence scheduling and regularization are critical to avoid outline-only solutions while improving both boundary accuracy and structural validity.[90] sFRC for assessing hallucinations in medical image restoration
Prabhat Kc,Rongping Zeng,Nirmal Soni,Aldo Badano
Main category: cs.CV
TL;DR: 本文提出了一种基于小块傅里叶环相关(sFRC)的新方法,用于检测深度学习医学图像重建中的幻觉现象,并在CT超分辨率、稀疏视角CT和MRI欠采样重建任务中验证了其有效性与鲁棒性。
Details
Motivation: 深度学习医学图像重建结果虽视觉上吸引人,但易产生幻觉,且缺乏易用、鲁棒的幻觉检测技术与指标。 Method: 提出扫描式傅里叶环相关(sFRC)方法:在DL输出及其参考图像上对小图像块进行滑动窗口FRC分析;参数可由专家标注的幻觉特征或成像理论生成的幻觉图设定。 Result: sFRC在CT任务中有效检测出幻觉特征,在MRI任务中与成像理论预测的幻觉图高度一致;能定量评估DL方法在分布内/外数据及不同欠采样率下的幻觉率;同时适用于传统正则化与展开式方法。 Conclusion: sFRC是一种通用、可解释、理论支撑强的幻觉检测工具,有助于提升DL医学图像重建的可靠性与临床可信度。 Abstract: Deep learning (DL) methods are currently being explored to restore images from sparse-view-, limited-data-, and undersampled-based acquisitions in medical applications. Although outputs from DL may appear visually appealing based on likability/subjective criteria (such as less noise, smooth features), they may also suffer from hallucinations. This issue is further exacerbated by a lack of easy-to-use techniques and robust metrics for the identification of hallucinations in DL outputs. In this work, we propose performing Fourier Ring Correlation (FRC) analysis over small patches and concomitantly (s)canning across DL outputs and their reference counterparts to detect hallucinations (termed as sFRC). We describe the rationale behind sFRC and provide its mathematical formulation. The parameters essential to sFRC may be set using predefined hallucinated features annotated by subject matter experts or using imaging theory-based hallucination maps. We use sFRC to detect hallucinations for three undersampled medical imaging problems: CT super-resolution, CT sparse view, and MRI subsampled restoration. In the testing phase, we demonstrate sFRC's effectiveness in detecting hallucinated features for the CT problem and sFRC's agreement with imaging theory-based outputs on hallucinated feature maps for the MR problem. Finally, we quantify the hallucination rates of DL methods on in-distribution versus out-of-distribution data and under increasing subsampling rates to characterize the robustness of DL methods. Beyond DL-based methods, sFRC's effectiveness in detecting hallucinations for a conventional regularization-based restoration method and a state-of-the-art unrolled method is also shown.[91] Decoding the Pulse of Reasoning VLMs in Multi-Image Understanding Tasks
Chenjun Li
Main category: cs.CV
TL;DR: 本文提出PulseFocus方法,通过在推理时结构化思维链(CoT)为交替的计划/聚焦模块并引入软注意力门控,缓解多图像推理中视觉语言模型注意力弥散和位置偏差问题,显著提升多图像基准性能。
Details
Motivation: 发现当前推理型视觉语言模型在思维链生成过程中存在注意力弥散('pulses')和系统性图像位置偏差问题,而该现象此前被忽视。 Method: 提出无需训练、仅在推理时生效的PulseFocus方法:将CoT结构化为交替的‘计划’与‘聚焦’模块,并在解码阶段通过软注意力门控强制模型关注计划中指定的图像。 Result: 在BLINK基准上提升3.7%,在MuirBench上提升1.07%,验证了方法有效性。 Conclusion: 结构化CoT流程并显式控制注意力聚焦可有效克服多图像推理中的注意力缺陷,PulseFocus为训练-free的推理优化提供了新范式。 Abstract: Multi-image reasoning remains a significant challenge for vision-language models (VLMs). We investigate a previously overlooked phenomenon: during chain-of-thought (CoT) generation, the text-to-image (T2I) attention of reasoning VLMs exhibits diffuse "pulses": sporadic and unfocused attention patterns that fail to concentrate on task-relevant images. We further reveal a systematic positional bias in attention allocation across images. Motivated by these observations, we propose PulseFocus, a training-free, inference-time method that structures CoT reasoning into interleaved plan/focus blocks with soft attention gating. By forcing the model to explicitly plan which image to examine and then gating decode-time attention to the referenced image, PulseFocus sharpens attention focus and yields consistent improvements on multi-image benchmarks like BLINK benchmark (+3.7%) and MuirBench (+1.07%).[92] A Benchmark Study of Neural Network Compression Methods for Hyperspectral Image Classification
Sai Shi
Main category: cs.CV
TL;DR: 本文系统评估了三种神经网络压缩方法(剪枝、量化、知识蒸馏)在高光谱土地覆盖分类任务中的效果,验证了它们在减少模型大小和计算开销的同时仍能保持有竞争力的分类精度。
Details
Motivation: 深度神经网络在遥感等资源受限平台部署时面临计算和内存开销大的问题,亟需有效的网络压缩方法。 Method: 对剪枝、量化和知识蒸馏三种主流CNN压缩策略,在两个高光谱数据集上进行系统性实验评估,指标包括分类精度、内存占用和推理效率。 Result: 压缩模型显著降低了模型大小与计算成本,同时保持了具有竞争力的分类性能;揭示了压缩比、效率与精度之间的权衡关系。 Conclusion: 网络压缩技术有望推动深度学习在遥感应用(尤其是边缘端)中的高效部署。 Abstract: Deep neural networks have achieved strong performance in image classification tasks due to their ability to learn complex patterns from high-dimensional data. However, their large computational and memory requirements often limit deployment on resource-constrained platforms such as remote sensing devices and edge systems. Network compression techniques have therefore been proposed to reduce model size and computational cost while maintaining predictive performance. In this study, we conduct a systematic evaluation of neural network compression methods for a remote sensing application, namely hyperspectral land cover classification. Specifically, we examine three widely used compression strategies for convolutional neural networks: pruning, quantization, and knowledge distillation. Experiments are conducted on two benchmark hyperspectral datasets, considering classification accuracy, memory consumption, and inference efficiency. Our results demonstrate that compressed models can significantly reduce model size and computational cost while maintaining competitive classification performance. These findings provide insights into the trade-offs between compression ratio, efficiency, and accuracy, and highlight the potential of compression techniques for enabling efficient deep learning deployment in remote sensing applications.[93] Are Multimodal LLMs Ready for Surveillance? A Reality Check on Zero-Shot Anomaly Detection in the Wild
Shanle Yao,Armin Danesh Pazho,Narges Rashvand,Hamed Tabkhi
Main category: cs.CV
TL;DR: 本文系统评估了多模态大语言模型(MLLMs)在视频异常检测(VAD)任务上的可靠性,发现其在零样本设置下存在显著保守偏差(高精度、低召回),并通过类别特异性提示将ShanghaiTech数据集上的F1分数从0.09提升至0.64,但召回率仍是关键瓶颈。
Details
Motivation: 探索MLLMs在真实世界视频异常检测(VAD)中的可靠性,因其虽在视频理解上表现优异,但在VAD这一开放世界监控任务中的实际适用性尚不明确。 Method: 在ShanghaiTech和CHAD基准上,将VAD重构为弱时间监督下的二分类任务,系统评估SOTA MLLMs;分析提示词特异性与时间窗口长度(1s–3s)对精确率-召回率权衡的影响。 Result: 零样本下MLLMs表现出强烈保守偏差:高置信度但严重偏向‘正常’类,导致高精度、极低召回(如ShanghaiTech F1仅0.09);引入类别特异性指令后F1提升至0.64,但召回率仍不足。 Conclusion: 当前MLLMs在噪声环境下的VAD性能存在显著缺口,亟需面向召回率优化的提示工程与模型校准方法,以支撑开放世界监控所需的复杂视频理解与推理能力。 Abstract: Multimodal large language models (MLLMs) have demonstrated impressive general competence in video understanding, yet their reliability for real-world Video Anomaly Detection (VAD) remains largely unexplored. Unlike conventional pipelines relying on reconstruction or pose-based cues, MLLMs enable a paradigm shift: treating anomaly detection as a language-guided reasoning task. In this work, we systematically evaluate state-of-the-art MLLMs on the ShanghaiTech and CHAD benchmarks by reformulating VAD as a binary classification task under weak temporal supervision. We investigate how prompt specificity and temporal window lengths (1s--3s) influence performance, focusing on the precision--recall trade-off. Our findings reveal a pronounced conservative bias in zero-shot settings; while models exhibit high confidence, they disproportionately favor the 'normal' class, resulting in high precision but a recall collapse that limits practical utility. We demonstrate that class-specific instructions can significantly shift this decision boundary, improving the peak F1-score on ShanghaiTech from 0.09 to 0.64, yet recall remains a critical bottleneck. These results highlight a significant performance gap for MLLMs in noisy environments and provide a foundation for future work in recall-oriented prompting and model calibration for open-world surveillance, which demands complex video understanding and reasoning.[94] FOZO: Forward-Only Zeroth-Order Prompt Optimization for Test-Time Adaptation
Xingyu Wang,Tao Wang
Main category: cs.CV
TL;DR: 本文提出了一种无需反向传播的前向零阶优化方法FOZO,用于测试时自适应(TTA),在资源受限设备上实现高效、稳定且高性能的模型自适应。
Details
Motivation: 现有TTA方法存在两大问题:基于反向传播的方法计算与内存开销大、会修改模型权重,不适用于低端部署设备;而传统无反向传播方法适应能力有限。 Method: 提出Forward-Only Zeroth-Order Optimization(FOZO),采用内存高效的零阶提示优化,联合优化中间特征统计量和预测熵;引入动态衰减扰动尺度以提升零阶梯度估计的稳定性,并在TTA数据流假设下证明其收敛性。 Result: 在ImageNet-C(59.52% Top-1)、ImageNet-R和ImageNet-Sketch上持续适应实验表明FOZO优于主流梯度法及SOTA前向方法FOA(58.13%);且在INT8量化模型上表现鲁棒。 Conclusion: FOZO是一种实用、高效、稳定且泛化性强的TTA新范式,特别适合资源受限场景下的实际部署。 Abstract: Test-Time Adaptation (TTA) is essential for enabling deep learning models to handle real-world data distribution shifts. However, current approaches face significant limitations: backpropagation-based methods are not suitable for low-end deployment devices, due to their high computation and memory requirements, as well as their tendency to modify model weights during adaptation; while traditional backpropagation-free techniques exhibit constrained adaptation capabilities. In this work, we propose Forward-Only Zeroth-Order Optimization (FOZO), a novel and practical backpropagation-free paradigm for TTA. FOZO leverages a memory-efficient zeroth-order prompt optimization, which is led by objectives optimizing both intermediate feature statistics and prediction entropy. To ensure efficient and stable adaptation over the out-of-distribution data stream, we introduce a dynamically decaying perturbation scale during zeroth-order gradient estimation and theoretically prove its convergence under the TTA data stream assumption. Extensive continual adaptation experiments on ImageNet-C, ImageNet-R, and ImageNet-Sketch demonstrate FOZO's superior performance, achieving 59.52% Top-1 accuracy on ImageNet-C (5K, level 5) and outperforming main gradient-based methods and SOTA forward-only FOA (58.13%). Furthermore, FOZO exhibits strong generalization on quantized (INT8) models. These findings demonstrate that FOZO is a highly competitive solution for TTA deployment in resource-limited scenarios.[95] Toward Real-world Infrared Image Super-Resolution: A Unified Autoregressive Framework and Benchmark Dataset
Yang Zou,Jun Ma,Zhidong Jiao,Xingyuan Li,Zhiying Jiang,Jinyuan Liu
Main category: cs.CV
TL;DR: 本文提出Real-IISR,一种面向真实场景红外图像超分辨率的统一自回归框架,通过热-结构引导的视觉自回归逐尺度重建精细热结构与清晰背景,并构建了首个真实配对红外LR-HR数据集FLIR-IISR。
Details
Motivation: 现有红外图像超分辨率方法多基于仿真数据或忽略红外与可见光成像的本质差异,而真实红外图像受光学与传感耦合退化影响,导致结构锐度和热保真度同时下降。 Method: 提出Real-IISR框架,包含热-结构引导模块(缓解热辐射与结构边缘不匹配)、条件自适应码本(依据退化感知热先验动态调制离散表征)和热序一致性损失(保证温度与像素强度间的单调关系),实现逐尺度自回归重建。 Result: 在自建真实数据集FLIR-IISR上验证了Real-IISR的优越性能,为真实场景红外超分提供了统一基础与基准。 Conclusion: Real-IISR有效应对真实红外图像中耦合退化带来的挑战,兼顾结构重建与热物理一致性,推动红外超分辨率从仿真走向实际应用。 Abstract: Infrared image super-resolution (IISR) under real-world conditions is a practically significant yet rarely addressed task. Pioneering works are often trained and evaluated on simulated datasets or neglect the intrinsic differences between infrared and visible imaging. In practice, however, real infrared images are affected by coupled optical and sensing degradations that jointly deteriorate both structural sharpness and thermal fidelity. To address these challenges, we propose Real-IISR, a unified autoregressive framework for real-world IISR that progressively reconstructs fine-grained thermal structures and clear backgrounds in a scale-by-scale manner via thermal-structural guided visual autoregression. Specifically, a Thermal-Structural Guidance module encodes thermal priors to mitigate the mismatch between thermal radiation and structural edges. Since non-uniform degradations typically induce quantization bias, Real-IISR adopts a Condition-Adaptive Codebook that dynamically modulates discrete representations based on degradation-aware thermal priors. Also, a Thermal Order Consistency Loss enforces a monotonic relation between temperature and pixel intensity, ensuring relative brightness order rather than absolute values to maintain physical consistency under spatial misalignment and thermal drift. We build FLIR-IISR, a real-world IISR dataset with paired LR-HR infrared images acquired via automated focus variation and motion-induced blur. Extensive experiments demonstrate the promising performance of Real-IISR, providing a unified foundation for real-world IISR and benchmarking. The dataset and code are available at: https://github.com/JZD151/Real-IISR.[96] Evaluating GPT-5 as a Multimodal Clinical Reasoner: A Landscape Commentary
Alexandru Florea,Shansong Wang,Mingzhe Hu,Qiang Li,Zach Eidex,Luke del Balzo,Mojtaba Safari,Xiaofeng Yang
Main category: cs.CV
TL;DR: 本文首次对GPT-5系列模型在临床医学多任务场景(包括医学考试、文本推理与多种医学影像视觉问答)中进行跨模型对比评估,发现其在文本推理和部分多模态任务(如乳腺影像)上显著超越GPT-4o,但在神经放射学和高精度感知任务上仍落后于专用模型。
Details
Motivation: 探究通用基础模型(如GPT-5系列)是否具备支持临床医学所需集成推理能力,尤其是融合模糊病史、检验数据与多模态影像的综合判断能力。 Method: 采用标准化零样本思维链协议,在医学教育考试、文本推理基准及神经放射学、数字病理、乳腺影像等视觉问答任务上,对GPT-5、GPT-5 Mini、GPT-5 Nano与GPT-4o进行受控横断面评估。 Result: GPT-5在MedXpertQA文本推理上提升超25个百分点;在乳腺影像VQA中领先GPT-4o达10–40%,但神经放射学准确率仅44%,乳腺任务中(52–64%)仍明显低于专用模型(>80%)。 Conclusion: GPT-5在迈向临床集成推理方面取得重要进展,能模拟医生以客观影像证据校正不确定病史的认知过程,但尚不能替代高度专业化、感知关键型任务中的专用系统。 Abstract: The transition from task-specific artificial intelligence toward general-purpose foundation models raises fundamental questions about their capacity to support the integrated reasoning required in clinical medicine, where diagnosis demands synthesis of ambiguous patient narratives, laboratory data, and multimodal imaging. This landscape commentary provides the first controlled, cross-sectional evaluation of the GPT-5 family (GPT-5, GPT-5 Mini, GPT-5 Nano) against its predecessor GPT-4o across a diverse spectrum of clinically grounded tasks, including medical education examinations, text-based reasoning benchmarks, and visual question-answering in neuroradiology, digital pathology, and mammography using a standardized zero-shot chain-of-thought protocol. GPT-5 demonstrated substantial gains in expert-level textual reasoning, with absolute improvements exceeding 25 percentage-points on MedXpertQA. When tasked with multimodal synthesis, GPT-5 effectively leveraged this enhanced reasoning capacity to ground uncertain clinical narratives in concrete imaging evidence, achieving state-of-the-art or competitive performance across most VQA benchmarks and outperforming GPT-4o by margins of 10-40% in mammography tasks requiring fine-grained lesion characterization. However, performance remained moderate in neuroradiology (44% macro-average accuracy) and lagged behind domain-specific models in mammography, where specialized systems exceed 80% accuracy compared to GPT-5's 52-64%. These findings indicate that while GPT-5 represents a meaningful advance toward integrated multimodal clinical reasoning, mirroring the clinician's cognitive process of biasing uncertain information with objective findings, generalist models are not yet substitutes for purpose-built systems in highly specialized, perception-critical tasks.[97] Evaluating and Correcting Human Annotation Bias in Dynamic Micro-Expression Recognition
Feng Liu,Bingyu Nan,Xuezhong Qian,Xiaolan Fu
Main category: cs.CV
TL;DR: 本文提出了一种全局反单调微分选择策略(GAMDSS),通过动态关键帧重选提升微表情时空建模效果,减少跨文化标注中的人为主观误差,并支持现有模型无缝集成。
Details
Motivation: 现有微表情人工标注在跨文化场景下准确性差,尤其关键帧标注偏差显著,亟需更鲁棒的自动标注与建模方法。 Method: 提出GAMDSS架构,包含动态Onset/Apex帧识别、Offset帧推断、构建时空动态表征,以及双分支共享参数特征提取网络。 Result: 在7个主流微表情数据集上验证有效,显著降低SAMM和4DME等跨文化数据集中的主观标注误差;定量分析证实Offset帧标注不确定性更高,为标注标准化提供理论依据。 Conclusion: GAMDSS不仅提升了微表情识别性能,还揭示了当前标注范式在有效性与泛化性上的局限,推动微表情数据标注标准的重构。 Abstract: Existing manual labeling of micro-expressions is subject to errors in accuracy, especially in cross-cultural scenarios where deviation in labeling of key frames is more prominent. To address this issue, this paper presents a novel Global Anti-Monotonic Differential Selection Strategy (GAMDSS) architecture for enhancing the effectiveness of spatio-temporal modeling of micro-expressions through keyframe re-selection. Specifically, the method identifies Onset and Apex frames, which are characterized by significant micro-expression variation, from complete micro-expression action sequences via a dynamic frame reselection mechanism. It then uses these to determine Offset frames and construct a rich spatio-temporal dynamic representation. A two-branch structure with shared parameters is then used to efficiently extract spatio-temporal features. Extensive experiments are conducted on seven widely recognized micro-expression datasets. The results demonstrate that GAMDSS effectively reduces subjective errors caused by human factors in multicultural datasets such as SAMM and 4DME. Furthermore, quantitative analyses confirm that offset-frame annotations in multicultural datasets are more uncertain, providing theoretical justification for standardizing micro-expression annotations. These findings directly support our argument for reconsidering the validity and generalizability of dataset annotation paradigms. Notably, this design can be integrated into existing models without increasing the number of parameters, offering a new approach to enhancing micro-expression recognition performance. The source code is available on GitHub[https://github.com/Cross-Innovation-Lab/GAMDSS].[98] DSA-SRGS: Super-Resolution Gaussian Splatting for Dynamic Sparse-View DSA Reconstruction
Shiyu Zhang,Zhicong Wu,Huangxuan Zhao,Zhentao Liu,Lei Chen,Yong Luo,Lefei Zhang,Zhiming Cui,Ziwen Ke,Bo Du
Main category: cs.CV
TL;DR: 本文提出DSA-SRGS,首个面向动态稀疏视角DSA重建的超分辨率高斯溅射框架,通过多保真纹理学习模块与辐射亚像素稠密化策略,在不引入严重伪影前提下提升4D血管重建的分辨率与细节保真度。
Details
Motivation: 现有基于高斯溅射和动态神经表示的3D血管重建方法受限于低分辨率输入投影,简单上采样会导致严重模糊与混叠,无法恢复精细血管结构,制约临床精确诊断与治疗应用。 Method: 提出DSA-SRGS框架:1)多保真纹理学习模块,融合微调的DSA专用超分模型先验,并采用置信度感知策略加权监督原始低分辨率投影与生成的高分辨率伪标签;2)辐射亚像素稠密化策略,利用高分辨率亚像素采样的梯度累积自适应优化4D辐射高斯核。 Result: 在两个临床DSA数据集上,DSA-SRGS在定量指标(如PSNR、SSIM)和定性视觉质量(细节清晰度、分支完整性)上均显著优于现有最先进方法。 Conclusion: DSA-SRGS首次实现了动态稀疏视角DSA数据的高质量超分辨率4D重建,有效提升了细小血管与复杂分支结构的重建精度,为脑血管疾病精准诊疗提供了新工具。 Abstract: Digital subtraction angiography (DSA) is a key imaging technique for the auxiliary diagnosis and treatment of cerebrovascular diseases. Recent advancements in gaussian splatting and dynamic neural representations have enabled robust 3D vessel reconstruction from sparse dynamic inputs. However, these methods are fundamentally constrained by the resolution of input projections, where performing naive upsampling to enhance rendering resolution inevitably results in severe blurring and aliasing artifacts. Such lack of super-resolution capability prevents the reconstructed 4D models from recovering fine-grained vascular details and intricate branching structures, which restricts their application in precision diagnosis and treatment. To solve this problem, this paper proposes DSA-SRGS, the first super-resolution gaussian splatting framework for dynamic sparse-view DSA reconstruction. Specifically, we introduce a Multi-Fidelity Texture Learning Module that integrates high-quality priors from a fine-tuned DSA-specific super-resolution model, into the 4D reconstruction optimization. To mitigate potential hallucination artifacts from pseudo-labels, this module employs a Confidence-Aware Strategy to adaptively weight supervision signals between the original low-resolution projections and the generated high-resolution pseudo-labels. Furthermore, we develop Radiative Sub-Pixel Densification, an adaptive strategy that leverages gradient accumulation from high-resolution sub-pixel sampling to refine the 4D radiative gaussian kernels. Extensive experiments on two clinical DSA datasets demonstrate that DSA-SRGS significantly outperforms state-of-the-art methods in both quantitative metrics and qualitative visual fidelity.[99] MADCrowner: Margin Aware Dental Crown Design with Template Deformation and Refinement
Linda Wei,Chang Liu,Wenran Zhang,Yuxuan Hu,Ruiyang Li,Feng Qi,Changyao Tian,Ke Wang,Yuanyuan Wang,Shaoting Zhang,Dimitris Metaxas,Hongsheng Li
Main category: cs.CV
TL;DR: 本文提出了一种名为\totalframework的牙冠网格生成框架,包含CrownDeformR(基于解剖上下文的模板形变模块)和CrownSegger(颈缘分割网络),以提升牙冠自动设计的几何精度与临床可行性。
Details
Motivation: 现有基于学习的牙冠自动生成方法存在空间分辨率不足、输出噪声大、表面重建过度延伸等问题,且临床中仍需大量手动调整。 Method: 提出margin-aware的mesh生成框架\totalframework,含两个核心模块:1)CrownDeformR,利用多尺度口内扫描编码器提取解剖上下文,驱动初始模板形变;2)\marginseg,精准分割牙体颈缘,作为形变约束和后处理边界条件;并构建大规模口内扫描数据集进行验证。 Result: 在几何精度和临床可行性两方面显著优于现有方法,有效缓解了噪声、低分辨率和表面过延伸问题。 Conclusion: 该框架更贴合临床工作流,通过引入颈缘先验与定制化后处理,实现了高精度、可部署的自动化牙冠设计。 Abstract: Dental crown restoration is one of the most common treatment modalities for tooth defect, where personalized dental crown design is critical. While computer-aided design (CAD) systems have notably enhanced the efficiency of dental crown design, extensive manual adjustments are still required in the clinic workflow. Recent studies have explored the application of learning-based methods for the automated generation of restorative dental crowns. Nevertheless, these approaches were challenged by inadequate spatial resolution, noisy outputs, and overextension of surface reconstruction. To address these limitations, we propose \totalframework, a margin-aware mesh generation framework comprising CrownDeformR and CrownSegger. Inspired by the clinic manual workflow of dental crown design, we designed CrownDeformR to deform an initial template to the target crown based on anatomical context, which is extracted by a multi-scale intraoral scan encoder. Additionally, we introduced \marginseg, a novel margin segmentation network, to extract the cervical margin of the target tooth. The performance of CrownDeformR improved with the cervical margin as an extra constraint. And it was also utilized as the boundary condition for the tailored postprocessing method, which removed the overextended area of the reconstructed surface. We constructed a large-scale intraoral scan dataset and performed extensive experiments. The proposed method significantly outperformed existing approaches in both geometric accuracy and clinical feasibility.[100] Privacy-Aware Camera 2.0 Technical Report
Huan Song,Shuyu Tian,Ting Long,Jiang Liu,Cheng Yuan,Zhenyu Jia,Jiawei Shao,Xuelong Li
Main category: cs.CV
TL;DR: 本文提出了一种基于AI Flow范式和边缘-云协同架构的新型隐私保护感知框架,通过在边缘端进行非线性映射与随机噪声注入,将原始图像转换为不可逆的抽象特征向量,并在云端利用‘动态轮廓’视觉语言实现行为识别与语义重建,在保障隐私的同时维持感知能力。
Details
Motivation: 现有隐私保护方法(如物理脱敏、加密、模糊化)常损害语义理解或缺乏数学上可证明的不可逆性;Privacy Camera 1.0虽消除原始图像但仅输出文本判断,导致争议中证据缺失。 Method: 基于信息瓶颈原理,在边缘部署视觉脱敏器,对原始图像进行实时非线性映射与随机噪声注入,生成不可重构的抽象特征向量;云端采用‘动态轮廓’视觉语言进行行为识别与语义重建。 Result: 实现了原始图像的数学不可重构性,同时支持高保真行为识别与可解释的可视化参考,解决了隐私与感知之间的根本矛盾。 Conclusion: 该框架在高度敏感场景(如卫生间、更衣室)中,首次实现了严格隐私保障与有效语义感知的统一,为智能视觉系统提供了可验证、可部署的新范式。 Abstract: With the increasing deployment of intelligent sensing technologies in highly sensitive environments such as restrooms and locker rooms, visual surveillance systems face a profound privacy-security paradox. Existing privacy-preserving approaches, including physical desensitization, encryption, and obfuscation, often compromise semantic understanding or fail to ensure mathematically provable irreversibility. Although Privacy Camera 1.0 eliminated visual data at the source to prevent leakage, it provided only textual judgments, leading to evidentiary blind spots in disputes. To address these limitations, this paper proposes a novel privacy-preserving perception framework based on the AI Flow paradigm and a collaborative edge-cloud architecture. By deploying a visual desensitizer at the edge, raw images are transformed in real time into abstract feature vectors through nonlinear mapping and stochastic noise injection under the Information Bottleneck principle, ensuring identity-sensitive information is stripped and original images are mathematically unreconstructable. The abstract representations are transmitted to the cloud for behavior recognition and semantic reconstruction via a "dynamic contour" visual language, achieving a critical balance between perception and privacy while enabling illustrative visual reference without exposing raw images.[101] RMK RetinaNet: Rotated Multi-Kernel RetinaNet for Robust Oriented Object Detection in Remote Sensing Imagery
Huiran Sun
Main category: cs.CV
TL;DR: 本文提出RMK RetinaNet,通过多尺度核块、多方向上下文锚点注意力机制、自底向上路径和欧拉角编码模块,解决遥感图像中旋转目标检测的三大瓶颈:感受野非自适应、长程多尺度特征融合不足及角度回归不连续。
Details
Motivation: 遥感图像中旋转目标检测面临三个主要瓶颈:感受野利用非自适应、长距离多尺度特征融合不足、角度回归存在不连续性。 Method: 提出RMK RetinaNet,包括:1)多尺度核(MSK)模块以增强自适应多尺度特征提取;2)多方向上下文锚点注意力(MDCAA)机制提升跨尺度与跨方向上下文建模;3)自底向上路径保留细粒度空间细节;4)欧拉角编码模块(EAEM)实现连续稳定的角度回归。 Result: 在DOTA-v1.0、HRSC2016和UCAS-AOD数据集上实验表明,RMK RetinaNet性能媲美当前最优方法,并在多尺度与多方向场景下鲁棒性更强。 Conclusion: RMK RetinaNet有效缓解了遥感旋转目标检测中的关键挑战,提升了模型在复杂尺度与方向变化下的泛化与稳定性。 Abstract: Rotated object detection in remote sensing imagery is hindered by three major bottlenecks: non-adaptive receptive field utilization, inadequate long-range multi-scale feature fusion, and discontinuities in angle regression. To address these issues, we propose Rotated Multi-Kernel RetinaNet (RMK RetinaNet). First, we design a Multi-Scale Kernel (MSK) Block to strengthen adaptive multi-scale feature extraction. Second, we incorporate a Multi-Directional Contextual Anchor Attention (MDCAA) mechanism into the feature pyramid to enhance contextual modeling across scales and orientations. Third, we introduce a Bottom-up Path to preserve fine-grained spatial details that are often degraded during downsampling. Finally, we develop an Euler Angle Encoding Module (EAEM) to enable continuous and stable angle regression. Extensive experiments on DOTA-v1.0, HRSC2016, and UCAS-AOD show that RMK RetinaNet achieves performance comparable to state-of-the-art rotated object detectors while improving robustness in multi-scale and multi-orientation scenarios.[102] LAW & ORDER: Adaptive Spatial Weighting for Medical Diffusion and Segmentation
Anugunj Naman,Ayushman Singh,Gaibo Zhang,Yaguang Zhang
Main category: cs.CV
TL;DR: 本文提出了一种针对医学图像分析中空间不平衡问题的双适配器方法:LAW用于扩散模型训练中的自适应损失加权,ORDER用于高效分割;在息肉和肾肿瘤数据集上显著提升了生成质量和分割精度。
Details
Motivation: 医学图像分析中病变区域占比小,导致扩散模型易偏离指定病灶布局、分割模型在空间不确定性区域表现差,需自适应空间加权来优化资源分配。 Method: 提出两个网络适配器:1)可学习自适应加权器(LAW),基于特征和掩码预测逐像素损失调制,并通过归一化、截断和正则化稳定训练;2)高效分辨率最优区域检测(ORDER),在解码器后期应用选择性双向跳跃注意力。 Result: LAW使FID指标提升20%(52.28 vs. 65.60),合成数据使下游分割Dice系数提升4.9%(83.2% vs. 78.3%);ORDER在MK-UNet上Dice提升6.0%(81.3% vs. 75.3%),仅需0.56 GFLOPs和42K参数,比nnUNet小730倍。 Conclusion: LAW与ORDER协同解决了医学图像生成与分割中的空间不平衡问题,在生成质量、分割精度和模型效率三方面均取得显著提升。 Abstract: Medical image analysis relies on accurate segmentation, and benefits from controllable synthesis (of new training images). Yet both tasks of the cyclical pipeline face spatial imbalance: lesions occupy small regions against vast backgrounds. In particular, diffusion models have been shown to drift from prescribed lesion layouts, while efficient segmenters struggle on spatially uncertain regions. Adaptive spatial weighting addresses this by learning where to allocate computational resources. This paper introduces a pair of network adapters: 1) Learnable Adaptive Weighter (LAW) which predicts per-pixel loss modulation from features and masks for diffusion training, stabilized via a mix of normalization, clamping, and regularization to prevent degenerate solutions; and 2) Optimal Region Detection with Efficient Resolution (ORDER) which applies selective bidirectional skip attention at late decoder stages for efficient segmentation. Experiments on polyp and kidney tumor datasets demonstrate that LAW achieves 20% FID generative improvement over a uniform baseline (52.28 vs. 65.60), with synthetic data then improving downstream segmentation by 4.9% Dice coefficient (83.2% vs. 78.3%). ORDER reaches 6.0% Dice improvement on MK-UNet (81.3% vs. 75.3%) with 0.56 GFLOPs and just 42K parameters, remaining 730x smaller than the standard nnUNet.[103] Comparative Evaluation of Traditional Methods and Deep Learning for Brain Glioma Imaging. Review Paper
Kiranmayee Janardhan,Vinay Martin DSa Prabhu,T. Christy Bobby
Main category: cs.CV
TL;DR: 本文综述了脑胶质瘤的分割与分类技术,强调卷积神经网络(CNN)在磁共振成像后处理中优于传统方法,尤其半自动技术更受放射科医生青睐。
Details
Motivation: 脑胶质瘤的精准分割与分类对治疗规划、预后预测及病情监测至关重要,但不规则组织导致分割误差大、可重复性差。 Method: 综述现有全自动与半自动分割/分类方法,重点评估基于卷积神经网络(CNN)的深度学习架构在MRI后处理中的性能。 Result: CNN架构在胶质瘤分割与分类任务中显著优于传统方法;半自动技术因兼顾精度与易用性,更被临床放射科医生接受。 Conclusion: 应推动CNN驱动的半自动分割与分类工具的临床转化,以提升诊断准确性、个体化治疗效果和患者预后。 Abstract: Segmentation is crucial for brain gliomas as it delineates the glioma s extent and location, aiding in precise treatment planning and monitoring, thus improving patient outcomes. Accurate segmentation ensures proper identification of the glioma s size and position, transforming images into applicable data for analysis. Classification of brain gliomas is also essential because different types require different treatment approaches. Accurately classifying brain gliomas by size, location, and aggressiveness is essential for personalized prognosis prediction, follow-up care, and monitoring disease progression, ensuring effective diagnosis, treatment, and management. In glioma research, irregular tissues are often observable, but error free and reproducible segmentation is challenging. Many researchers have surveyed brain glioma segmentation, proposing both fully automatic and semi-automatic techniques. The adoption of these methods by radiologists depends on ease of use and supervision, with semi-automatic techniques preferred due to the need for accurate evaluations. This review evaluates effective segmentation and classification techniques post magnetic resonance imaging acquisition, highlighting that convolutional neural network architectures outperform traditional techniques in these tasks.[104] MASQuant: Modality-Aware Smoothing Quantization for Multimodal Large Language Models
Lulu Hu,Wenhu Xiao,Xin Chen,Xinhua Xu,Bowen Xu,Kun Li,Yongliang Tao
Main category: cs.CV
TL;DR: 本文提出MASQuant框架,通过模态感知平滑(MAS)和跨模态补偿(CMC)解决多模态大语言模型(MLLMs)后训练量化中的平滑错位与跨模态计算不变性问题,实现稳定高效的量化性能。
Details
Motivation: 现有面向大语言模型的后训练量化方法(如SmoothQuant)在迁移到多模态大语言模型(MLLMs)时面临平滑错位和跨模态计算不变性两大挑战,亟需适配多模态特性的量化方案。 Method: 提出Modality-Aware Smoothing Quantization(MASQuant):(1)模态感知平滑(MAS),为不同模态学习独立的平滑因子;(2)跨模态补偿(CMC),利用SVD白化将多模态激活差异转化为低秩形式,实现跨模态统一量化。 Result: MASQuant在双模态和三模态MLLM上均展现出稳定的量化性能,实验表明其在主流PTQ算法中具备竞争力。 Conclusion: MASQuant通过模态解耦与跨模态协同建模,有效提升了MLLMs的后训练量化鲁棒性与泛化性,为多模态模型高效部署提供了新思路。 Abstract: Post-training quantization (PTQ) with computational invariance for Large Language Models~(LLMs) have demonstrated remarkable advances, however, their application to Multimodal Large Language Models~(MLLMs) presents substantial challenges. In this paper, we analyze SmoothQuant as a case study and identify two critical issues: Smoothing Misalignment and Cross-Modal Computational Invariance. To address these issues, we propose Modality-Aware Smoothing Quantization (MASQuant), a novel framework that introduces (1) Modality-Aware Smoothing (MAS), which learns separate, modality-specific smoothing factors to prevent Smoothing Misalignment, and (2) Cross-Modal Compensation (CMC), which addresses Cross-modal Computational Invariance by using SVD whitening to transform multi-modal activation differences into low-rank forms, enabling unified quantization across modalities. MASQuant demonstrates stable quantization performance across both dual-modal and tri-modal MLLMs. Experimental results show that MASQuant is competitive among the state-of-the-art PTQ algorithms. Source code: https://github.com/alibaba/EfficientAI.[105] Guiding Diffusion-based Reconstruction with Contrastive Signals for Balanced Visual Representation
Boyu Han,Qianqian Xu,Shilong Bao,Zhiyong Yang,Ruochen Cui,Xilin Zhao,Qingming Huang
Main category: cs.CV
TL;DR: 本文提出Diffusion Contrastive Reconstruction (DCR)方法,通过在扩散重建过程中注入来自重建图像的对比信号,联合优化CLIP视觉编码器的判别能力和细节感知能力,从而克服其表征能力瓶颈。
Details
Motivation: CLIP视觉编码器在判别能力(D-Ability)和细节感知能力(P-Ability)两方面存在局限,现有基于扩散模型的增强方法可能损害判别能力,未能根本解决表征瓶颈。 Method: 提出DCR框架,在扩散重建中利用每步重建图像而非原始输入生成对比信号,统一优化目标以缓解梯度冲突;理论分析证明该损失可协同优化D-Ability与P-Ability。 Result: 在多个基准和多模态大语言模型上验证了DCR的有效性,显著提升下游性能。 Conclusion: 将对比学习信号嵌入扩散重建过程并作用于重建图像,是提升CLIP视觉表征能力的一种更全面、更有效的方式。 Abstract: The limited understanding capacity of the visual encoder in Contrastive Language-Image Pre-training (CLIP) has become a key bottleneck for downstream performance. This capacity includes both Discriminative Ability (D-Ability), which reflects class separability, and Detail Perceptual Ability (P-Ability), which focuses on fine-grained visual cues. Recent solutions use diffusion models to enhance representations by conditioning image reconstruction on CLIP visual tokens. We argue that such paradigms may compromise D-Ability and therefore fail to effectively address CLIP's representation limitations. To address this, we integrate contrastive signals into diffusion-based reconstruction to pursue more comprehensive visual representations. We begin with a straightforward design that augments the diffusion process with contrastive learning on input images. However, empirical results show that the naive combination suffers from gradient conflict and yields suboptimal performance. To balance the optimization, we introduce the Diffusion Contrastive Reconstruction (DCR), which unifies the learning objective. The key idea is to inject contrastive signals derived from each reconstructed image, rather than from the original input, into the diffusion process. Our theoretical analysis shows that the DCR loss can jointly optimize D-Ability and P-Ability. Extensive experiments across various benchmarks and multi-modal large language models validate the effectiveness of our method. The code is available at https://github.com/boyuh/DCR.[106] Meta-D: Metadata-Aware Architectures for Brain Tumor Analysis and Missing-Modality Segmentation
SangHyuk Kim,Daniel Haehn,Sumientra Rampersad
Main category: cs.CV
TL;DR: 本文提出Meta-D架构,通过显式利用MRI扫描元数据(如序列类型、切面方向)来引导脑肿瘤分析中的特征提取,显著提升2D检测与3D缺失模态分割性能。
Details
Motivation: 提升医学图像深度学习流程的性能,通过引入显式元数据稳定特征表示,并在数据缺失时提供鲁棒锚点。 Method: 提出Meta-D架构:在2D检测中动态调制卷积特征;在3D缺失模态分割中设计Transformer Maximizer,利用元数据驱动的跨注意力机制选择性路由可用模态。 Result: 2D肿瘤检测F1-score绝对提升达2.62%;3D缺失模态分割Dice分数提升达5.12%,同时模型参数减少24.1%。 Conclusion: 显式整合分类元数据可有效增强医学图像分析模型的鲁棒性与效率,尤其在模态不全场景下优势显著。 Abstract: We present Meta-D, an architecture that explicitly leverages categorical scanner metadata such as MRI sequence and plane orientation to guide feature extraction for brain tumor analysis. We aim to improve the performance of medical image deep learning pipelines by integrating explicit metadata to stabilize feature representations. We first evaluate this in 2D tumor detection, where injecting sequence (e.g., T1, T2) and plane (e.g., axial) metadata dynamically modulates convolutional features, yielding an absolute increase of up to 2.62% in F1-score over image-only baselines. Because metadata grounds feature extraction when data are available, we hypothesize it can serve as a robust anchor when data are missing. We apply this to 3D missing-modality tumor segmentation. Our Transformer Maximizer utilizes metadata-based cross-attention to isolate and route available modalities, ensuring the network focuses on valid slices. This targeted attention improves brain tumor segmentation Dice scores by up to 5.12% under extreme modality scarcity while reducing model parameters by 24.1%.[107] Revisiting Shape from Polarization in the Era of Vision Foundation Models
Chenhao Li,Taishi Ono,Takeshi Uemori,Yusuke Moriuchi
Main category: cs.CV
TL;DR: 本文提出了一种利用偏振线索提升单次表面法向估计性能的新方法,通过构建高质量偏振数据集和传感器感知的数据增强,在仅40K训练场景下显著超越现有SfP方法和RGB-only视觉基础模型。
Details
Motivation: 偏振线索虽具物理优势,但传统SfP方法因合成数据不真实、传感器噪声建模不足等域差距导致性能弱于RGB-only VFMs,本文旨在验证偏振模态本身的价值并弥合这些域差距。 Method: 构建基于1954个真实3D扫描物体的高质量偏振合成数据集;引入DINOv3先验提升泛化能力;设计偏振传感器感知的数据增强以模拟真实噪声。 Result: 在仅40K训练场景下,该方法显著超越最先进SfP方法和RGB-only VFMs;偏振线索可实现训练数据减少33倍或模型参数减少8倍,同时保持更高性能。 Conclusion: 偏振模态本身具有强大潜力,性能差距源于域偏差而非模态局限;通过高质量数据与真实感建模,轻量模型可在小数据下超越大规模RGB-only基础模型。 Abstract: We show that, with polarization cues, a lightweight model trained on a small dataset can outperform RGB-only vision foundation models (VFMs) in single-shot object-level surface normal estimation. Shape from polarization (SfP) has long been studied due to the strong physical relationship between polarization and surface geometry. Meanwhile, driven by scaling laws, RGB-only VFMs trained on large datasets have recently achieved impressive performance and surpassed existing SfP methods. This situation raises questions about the necessity of polarization cues, which require specialized hardware and have limited training data. We argue that the weaker performance of prior SfP methods does not come from the polarization modality itself, but from domain gaps. These domain gaps mainly arise from two sources. First, existing synthetic datasets use limited and unrealistic 3D objects, with simple geometry and random texture maps that do not match the underlying shapes. Second, real-world polarization signals are often affected by sensor noise, which is not well modeled during training. To address the first issue, we render a high-quality polarization dataset using 1,954 3D-scanned real-world objects. We further incorporate pretrained DINOv3 priors to improve generalization to unseen objects. To address the second issue, we introduce polarization sensor-aware data augmentation that better reflects real-world conditions. With only 40K training scenes, our method significantly outperforms both state-of-the-art SfP approaches and RGB-only VFMs. Extensive experiments show that polarization cues enable a 33x reduction in training data or an 8x reduction in model parameters, while still achieving better performance than RGB-only counterparts.[108] Mitigating Instance Entanglement in Instance-Dependent Partial Label Learning
Rui Zhao,Bin Shi,Kai Sun,Bo Dong
Main category: cs.CV
TL;DR: 本文提出了一种面向实例依赖型部分标签学习(ID-PLL)的类特定增强解耦框架CAD,通过类内特征增强对齐与类间加权惩罚机制缓解实例纠缠问题,提升分类性能。
Details
Motivation: 现实中的部分标签常受实例特征影响(即ID-PLL),而实例纠缠(相似类实例在特征和候选标签上重叠)导致严重类别混淆,亟需解耦方法。 Method: 提出Class-specific Augmentation based Disentanglement(CAD)框架:1)类内调节——生成类特定增强样本并对其对齐;2)类间调节——设计加权惩罚损失,对更模糊的候选标签施加更强约束,扩大类间距离。 Result: 大量实验验证CAD能有效缓解实例纠缠,显著提升ID-PLL任务的分类准确率。 Conclusion: CAD通过联合类内与类间双重调节机制,增强了类边界清晰度,为ID-PLL提供了一种有效的解耦学习范式。 Abstract: Partial label learning is a prominent weakly supervised classification task, where each training instance is ambiguously labeled with a set of candidate labels. In real-world scenarios, candidate labels are often influenced by instance features, leading to the emergence of instance-dependent PLL (ID-PLL), a setting that more accurately reflects this relationship. A significant challenge in ID-PLL is instance entanglement, where instances from similar classes share overlapping features and candidate labels, resulting in increased class confusion. To address this issue, we propose a novel Class-specific Augmentation based Disentanglement (CAD) framework, which tackles instance entanglement by both intra- and inter-class regulations. For intra-class regulation, CAD amplifies class-specific features to generate class-wise augmentations and aligns same-class augmentations across instances. For inter-class regulation, CAD introduces a weighted penalty loss function that applies stronger penalties to more ambiguous labels, encouraging larger inter-class distances. By jointly applying intra- and inter-class regulations, CAD improves the clarity of class boundaries and reduces class confusion caused by entanglement. Extensive experimental results demonstrate the effectiveness of CAD in mitigating the entanglement problem and enhancing ID-PLL performance. The code is available at https://github.com/RyanZhaoIc/CAD.git.[109] Towards Highly Transferable Vision-Language Attack via Semantic-Augmented Dynamic Contrastive Interaction
Yuanbo Li,Tianyang Xu,Cong Hu,Tao Zhou,Xiao-Jun Wu,Josef Kittler
Main category: cs.CV
TL;DR: 本文提出了一种语义增强的动态对比攻击方法(SADCA),通过渐进式、语义引导的扰动提升视觉-语言预训练模型对抗样本的跨模型和跨任务迁移能力。
Details
Motivation: 现有视觉-语言模型的对抗攻击方法多依赖静态跨模态交互,仅破坏正向图文对,导致跨模态干扰有限、迁移性差。 Method: 提出SADCA方法:1)构建包含对抗样本、正样本和负样本的动态对比学习机制,渐进破坏跨模态对齐;2)引入语义增强模块,利用输入变换提升对抗样本多样性与泛化性。 Result: 在多个数据集和模型上实验表明,SADCA显著提升了对抗样本的迁移能力,持续超越当前最优方法。 Conclusion: SADCA通过动态对比学习与语义增强有效增强了对抗攻击的跨模态干扰能力和迁移性,为VLP模型安全性研究提供了新思路。 Abstract: With the rapid advancement and widespread application of vision-language pre-training (VLP) models, their vulnerability to adversarial attacks has become a critical concern. In general, the adversarial examples can typically be designed to exhibit transferable power, attacking not only different models but also across diverse tasks. However, existing attacks on language-vision models mainly rely on static cross-modal interactions and focus solely on disrupting positive image-text pairs, resulting in limited cross-modal disruption and poor transferability. To address this issue, we propose a Semantic-Augmented Dynamic Contrastive Attack (SADCA) that enhances adversarial transferability through progressive and semantically guided perturbation. SADCA progressively disrupts cross-modal alignment through dynamic interactions between adversarial images and texts. This is accomplished by SADCA establishing a contrastive learning mechanism involving adversarial, positive and negative samples, to reinforce the semantic inconsistency of the obtained perturbations. Moreover, we empirically find that input transformations commonly used in traditional transfer-based attacks also benefit VLPs, which motivates a semantic augmentation module that increases the diversity and generalization of adversarial examples. Extensive experiments on multiple datasets and models demonstrate that SADCA significantly improves adversarial transferability and consistently surpasses state-of-the-art methods. The code is released at https://github.com/LiYuanBoJNU/SADCA.[110] Multi-Paradigm Collaborative Adversarial Attack Against Multi-Modal Large Language Models
Yuanbo Li,Tianyang Xu,Cong Hu,Tao Zhou,Xiao-Jun Wu,Josef Kittler
Main category: cs.CV
TL;DR: 本文提出了一种名为MPCAttack的新型多范式协同攻击框架,通过聚合视觉与语言语义表征并采用多范式协同优化策略,显著提升对抗样本对多模态大语言模型(MLLMs)的可迁移性。
Details
Motivation: 现有针对MLLMs的对抗攻击依赖单一学习范式的代理模型,在各自特征空间独立优化,导致特征表示贫乏、搜索空间受限、对抗扰动多样性不足。 Method: 提出MPCAttack框架,核心为多范式协同优化(MPCO)策略:聚合图像与文本的语义表征,进行跨范式对比匹配,自适应平衡不同范式表征重要性,指导全局扰动优化以缓解表征偏差。 Result: 在多个基准上实验表明,MPCAttack在开源与闭源MLLMs上的定向与非定向攻击中均持续优于当前最优方法。 Conclusion: MPCAttack通过融合多范式特征与协同优化机制,有效提升了对抗样本的可迁移性,为MLLMs的安全评估提供了新思路。 Abstract: The rapid progress of Multi-Modal Large Language Models (MLLMs) has significantly advanced downstream applications. However, this progress also exposes serious transferable adversarial vulnerabilities. In general, existing adversarial attacks against MLLMs typically rely on surrogate models trained within a single learning paradigm and perform independent optimisation in their respective feature spaces. This straightforward setting naturally restricts the richness of feature representations, delivering limits on the search space and thus impeding the diversity of adversarial perturbations. To address this, we propose a novel Multi-Paradigm Collaborative Attack (MPCAttack) framework to boost the transferability of adversarial examples against MLLMs. In principle, MPCAttack aggregates semantic representations, from both visual images and language texts, to facilitate joint adversarial optimisation on the aggregated features through a Multi-Paradigm Collaborative Optimisation (MPCO) strategy. By performing contrastive matching on multi-paradigm features, MPCO adaptively balances the importance of different paradigm representations and guides the global perturbation optimisation, effectively alleviating the representation bias. Extensive experimental results on multiple benchmarks demonstrate the superiority of MPCAttack, indicating that our solution consistently outperforms state-of-the-art methods in both targeted and untargeted attacks on open-source and closed-source MLLMs. The code is released at https://github.com/LiYuanBoJNU/MPCAttack.[111] GloSplat: Joint Pose-Appearance Optimization for Faster and More Accurate 3D Reconstruction
Tianyu Xiong,Rui Li,Linjie Li,Jiaqi Yang
Main category: cs.CV
TL;DR: GloSplat 提出联合姿态-外观优化框架,在3D高斯泼溅训练中显式保留SfM特征轨迹作为可优化参数,结合重投影损失与光度监督,防止姿态漂移并实现精细优化;包含无需COLMAP的GloSplat-F和全匹配的GloSplat-A两种变体,均达SOTA性能。
Details
Motivation: 传统方法将特征提取、匹配、运动恢复结构(SfM)和新视角合成(NVS)视为独立问题;现有联合优化方法仅依赖光度梯度进行姿态优化,易导致早期姿态漂移且缺乏几何约束。 Method: 提出GloSplat框架,引入显式的、可优化的3D特征轨迹点作为独立参数,与高斯原语解耦;通过重投影损失(几何监督)与光度损失联合优化姿态与外观;设计两种变体:GloSplat-F(检索式配对、免COLMAP)和GloSplat-A(穷举匹配、高质量)。 Result: GloSplat-F在免COLMAP方法中达到SOTA;GloSplat-A超越所有基于COLMAP的基线方法;显式几何锚点有效抑制早期姿态漂移,并支持细粒度姿态优化。 Conclusion: 显式建模并优化SfM特征轨迹是提升联合优化鲁棒性与精度的关键;GloSplat验证了几何先验与光度监督协同优化的有效性,为端到端神经渲染与SfM融合提供了新范式。 Abstract: Feature extraction, matching, structure from motion (SfM), and novel view synthesis (NVS) have traditionally been treated as separate problems with independent optimization objectives. We present GloSplat, a framework that performs \emph{joint pose-appearance optimization} during 3D Gaussian Splatting training. Unlike prior joint optimization methods (BARF, NeRF--, 3RGS) that rely purely on photometric gradients for pose refinement, GloSplat preserves \emph{explicit SfM feature tracks} as first-class entities throughout training: track 3D points are maintained as separate optimizable parameters from Gaussian primitives, providing persistent geometric anchors via a reprojection loss that operates alongside photometric supervision. This architectural choice prevents early-stage pose drift while enabling fine-grained refinement -- a capability absent in photometric-only approaches. We introduce two pipeline variants: (1) \textbf{GloSplat-F}, a COLMAP-free variant using retrieval-based pair selection for efficient reconstruction, and (2) \textbf{GloSplat-A}, an exhaustive matching variant for maximum quality. Both employ global SfM initialization followed by joint photometric-geometric optimization during 3DGS training. Experiments demonstrate that GloSplat-F achieves state-of-the-art among COLMAP-free methods while GloSplat-A surpasses all COLMAP-based baselines.[112] Scalable Injury-Risk Screening in Baseball Pitching From Broadcast Video
Jerrin Bright,Justin Mende,John Zelek
Main category: cs.CV
TL;DR: 本文提出了一种基于单目广播视频的生物力学分析管道,能从普通视频中高精度(16/18指标MAE<1°)提取18个临床相关投球动作指标,并用于预测Tommy John手术(AUC=0.811)和严重手臂损伤(AUC=0.825),实现了无需专业动捕设备的大规模伤病风险筛查。
Details
Motivation: 专业级多相机运动捕捉系统昂贵且仅限于职业场馆,限制了投球伤病预测技术在基层的普及;亟需一种低成本、可扩展的替代方案。 Method: 基于DreamPose3D,构建单目视频生物力学分析管道:1)引入漂移控制的全局提升模块,通过速度参数化与滑动窗口推断恢复骨盆轨迹;2)设计运动模糊与压缩鲁棒的运动学优化流程,融合骨长约束、关节限位逆运动学、平滑与对称性约束,确保时空稳定且物理合理的姿态序列。 Result: 在13名职业投手共156次投球数据上,16/18项生物力学指标平均绝对误差小于1度;基于这些指标训练的自动筛查模型在7348名投手中实现Tommy John手术预测AUC 0.811、重大手臂损伤预测AUC 0.825。 Conclusion: 单目广播视频可作为专业运动捕捉的有效替代,支撑大规模、低成本的投球伤病风险筛查,推动生物力学分析向基层普及。 Abstract: Injury prediction in pitching depends on precise biomechanical signals, yet gold-standard measurements come from expensive, stadium-installed multi-camera systems that are unavailable outside professional venues. We present a monocular video pipeline that recovers 18 clinically relevant biomechanics metrics from broadcast footage, positioning pose-derived kinematics as a scalable source for injury-risk modeling. Built on DreamPose3D, our approach introduces a drift-controlled global lifting module that recovers pelvis trajectory via velocity-based parameterization and sliding-window inference, lifting pelvis-rooted poses into global space. To address motion blur, compression artifacts, and extreme pitching poses, we incorporate a kinematics refinement pipeline with bone-length constraints, joint-limited inverse kinematics, smoothing, and symmetry constraints to ensure temporally stable and physically plausible kinematics. On 13 professional pitchers (156 paired pitches), 16/18 metrics achieve sub-degree agreement (MAE $< 1^{\circ}$). Using these metrics for injury prediction, an automated screening model achieves AUC 0.811 for Tommy John surgery and 0.825 for significant arm injuries on 7,348 pitchers. The resulting pose-derived metrics support scalable injury-risk screening, establishing monocular broadcast video as a viable alternative to stadium-scale motion capture for biomechanics.[113] SURE: Semi-dense Uncertainty-REfined Feature Matching
Sicheng Li,Zaiwang Gu,Jie Zhang,Qing Guo,Xudong Jiang,Jun Cheng
Main category: cs.CV
TL;DR: 本文提出SURE框架,通过建模偶然性和认知不确定性,联合预测图像对应关系及其置信度,提升在大视角变化和无纹理区域等挑战场景下的匹配可靠性。
Details
Motivation: 现有方法仅依赖特征相似性,缺乏对匹配结果可靠性的显式估计,导致在大视角变化或无纹理区域等挑战场景中出现高置信度错误匹配。 Method: 提出SURE半稠密不确定性精炼匹配框架,包含用于可信坐标回归的新证据头(evidential head)和轻量级空间融合模块,联合预测对应关系及其置信度,并建模偶然性与认知不确定性。 Result: 在多个标准基准上,SURE在匹配精度和效率方面均持续超越现有最先进半稠密匹配模型。 Conclusion: SURE通过引入不确定性建模机制,显著提升了图像匹配的可靠性与鲁棒性,尤其适用于挑战性视觉场景。 Abstract: Establishing reliable image correspondences is essential for many robotic vision problems. However, existing methods often struggle in challenging scenarios with large viewpoint changes or textureless regions, where incorrect cor- respondences may still receive high similarity scores. This is mainly because conventional models rely solely on fea- ture similarity, lacking an explicit mechanism to estimate the reliability of predicted matches, leading to overconfident errors. To address this issue, we propose SURE, a Semi- dense Uncertainty-REfined matching framework that jointly predicts correspondences and their confidence by modeling both aleatoric and epistemic uncertainties. Our approach in- troduces a novel evidential head for trustworthy coordinate regression, along with a lightweight spatial fusion module that enhances local feature precision with minimal overhead. We evaluated our method on multiple standard benchmarks, where it consistently outperforms existing state-of-the-art semi-dense matching models in both accuracy and efficiency. our code will be available on https://github.com/LSC-ALAN/SURE.[114] Diffusion-Based sRGB Real Noise Generation via Prompt-Driven Noise Representation Learning
Jaekyun Ko,Dongjin Kim,Soomin Lee,Guanghui Wang,Tae Hyun Kim
Main category: cs.CV
TL;DR: 本文提出了一种无需相机元数据的提示驱动噪声生成框架(PNG),用于合成符合真实噪声分布的sRGB图像,以提升去噪模型在现实场景中的泛化能力。
Details
Motivation: 真实世界中缺乏成对的噪声-干净图像对,且现有基于相机元数据的生成方法受限于元数据缺失或设备不一致问题。 Method: 提出Prompt-Driven Noise Generation(PNG)框架,通过高维提示特征建模真实噪声特性,摆脱对显式相机元数据的依赖。 Result: 实验表明PNG能有效生成逼真的噪声图像,并成功应用于多个基准数据集的真实噪声去除任务。 Conclusion: PNG提升了噪声合成的通用性与实用性,为无配对数据下的真实图像去噪提供了新思路。 Abstract: Denoising in the sRGB image space is challenging due to noise variability. Although end-to-end methods perform well, their effectiveness in real-world scenarios is limited by the scarcity of real noisy-clean image pairs, which are expensive and difficult to collect. To address this limitation, several generative methods have been developed to synthesize realistic noisy images from limited data. These generative approaches often rely on camera metadata during both training and testing to synthesize real-world noise. However, the lack of metadata or inconsistencies between devices restricts their usability. Therefore, we propose a novel framework called Prompt-Driven Noise Generation (PNG). This model is capable of acquiring high-dimensional prompt features that capture the characteristics of real-world input noise and creating a variety of realistic noisy images consistent with the distribution of the input noise. By eliminating the dependency on explicit camera metadata, our approach significantly enhances the generalizability and applicability of noise synthesis. Comprehensive experiments reveal that our model effectively produces realistic noisy images and show the successful application of these generated images in removing real-world noise across various benchmark datasets.[115] Interpretable Pre-Release Baseball Pitch Type Anticipation from Broadcast 3D Kinematics
Jerrin Bright,Michelle Lu,John Zelek
Main category: cs.CV
TL;DR: 本文提出了一种仅基于投手单目3D姿态序列(无球路数据)分类8种投球类型的 pipeline,结合扩散模型、事件检测、生物力学特征提取与梯度提升分类,在11.9万职业投球样本上达到80.4%准确率;分析表明上半身(尤其手腕位置和躯干侧倾)贡献主要预测信号,而握法差异(如四缝线/二缝线速球)无法由姿态区分,揭示了纯运动学预测的理论上限约80%。
Details
Motivation: 探究投手身体动作在球出手前能多大程度预示即将投出的球种,即仅凭生物力学姿态能否可靠分类不同投球类型。 Method: 构建端到端pipeline:1)扩散模型驱动的单目3D姿态估计;2)自动投球事件检测;3)经真实标注验证的生物力学特征提取(共229个运动学特征);4)梯度提升分类器。 Result: 在119,561条职业投球数据上达到80.4%分类准确率;重要性分析显示上半身贡献64.9%预测信号,手腕位置(14.8%)和躯干侧倾为最关键关节组与单特征;四缝线与二缝线速球无法通过姿态区分。 Conclusion: 仅靠投手身体运动学信息可实现较高精度的球种预测,但存在约80%的经验上限;该上限反映了姿态信息的固有局限,进一步区分需依赖球路等后续信息。 Abstract: How much can a pitcher's body reveal about the upcoming pitch? We study this question at scale by classifying eight pitch types from monocular 3D pose sequences, without access to ball-flight data. Our pipeline chains a diffusion-based 3D pose backbone with automatic pitching-event detection, groundtruth-validated biomechanical feature extraction, and gradient-boosted classification over 229 kinematic features. Evaluated on 119,561 professional pitches, the largest such benchmark to date, we achieve 80.4\% accuracy using body kinematics alone. A systematic importance analysis reveals that upper-body mechanics contribute 64.9\% of the predictive signal versus 35.1\% for the lower body, with wrist position (14.8\%) and trunk lateral tilt emerging as the most informative joint group and biomechanical feature, respectively. We further show that grip-defined variants (four-seam vs.\ two-seam fastball) are not separable from pose, establishing an empirical ceiling near 80\% and delineating where kinematic information ends and ball-flight information begins.[116] Structure Observation Driven Image-Text Contrastive Learning for Computed Tomography Report Generation
Hong Liu,Dong Wei,Qiong Peng,Yawen Huang,Xian Wu,Yefeng Zheng,Liansheng Wang
Main category: cs.CV
TL;DR: 本文提出了一种用于CT报告生成的两阶段框架,通过结构级图像-文本对比学习和动态负样本队列提升性能。
Details
Motivation: 现有深度学习方法在X光报告生成中表现良好,但在CT报告生成中受限于数据量大、细节复杂等问题,亟需更有效的结构化建模方法。 Method: 提出两阶段框架:第一阶段通过可学习的结构特异性视觉查询与文本特征进行结构级对比学习,并引入基于文本相似度的软伪标签和动态多样性增强负队列;第二阶段冻结视觉查询,选择关键图像块嵌入并加入文本解码器生成报告。 Result: 在两个公开数据集上达到CT报告生成任务的新SOTA性能,验证了各组件的有效性。 Conclusion: 该框架能有效建立CT图像与报告间的结构级语义对应关系,显著提升临床报告生成效率与准确性。 Abstract: Computed Tomography Report Generation (CTRG) aims to automate the clinical radiology reporting process, thereby reducing the workload of report writing and facilitating patient care. While deep learning approaches have achieved remarkable advances in X-ray report generation, their effectiveness may be limited in CTRG due to larger data volumes of CT images and more intricate details required to describe them. This work introduces a novel two-stage (structure- and report-learning) framework tailored for CTRG featuring effective structure-wise image-text contrasting. In the first stage, a set of learnable structure-specific visual queries observe corresponding structures in a CT image. The resulting observation tokens are contrasted with structure-specific textual features extracted from the accompanying radiology report with a structure-wise image-text contrastive loss. In addition, text-text similarity-based soft pseudo targets are proposed to mitigate the impact of false negatives, i.e., semantically identical image structures and texts from non-paired images and reports. Thus, the model learns structure-level semantic correspondences between CT images and reports. Further, a dynamic, diversity-enhanced negative queue is proposed to guide the network in learning to discriminate various abnormalities. In the second stage, the visual structure queries are frozen and used to select the critical image patch embeddings depicting each anatomical structure, minimizing distractions from irrelevant areas while reducing memory consumption. Also, a text decoder is added and trained for report generation.Our extensive experiments on two public datasets demonstrate that our framework establishes new state-of-the-art performance for CTRG in clinical efficiency, and its components are effective.[117] DeformTrace: A Deformable State Space Model with Relay Tokens for Temporal Forgery Localization
Xiaodong Zhu,Suting Wang,Yuanming Zheng,Junqi Yang,Yangxu Liao,Yuhong Yang,Weiping Tu,Zhongyuan Wang
Main category: cs.CV
TL;DR: 本文提出DeformTrace方法,通过引入可变形动态和中继机制增强状态空间模型(SSM),以解决视频与音频时序伪造定位中的边界模糊、伪造稀疏及长程建模不足等问题,实现了更精准、高效且鲁棒的时序伪造检测。
Details
Motivation: 现有状态空间模型(SSMs)在时序伪造定位(TFL)任务中受限于伪造边界模糊、伪造样本稀疏以及长程依赖建模能力弱等问题,难以满足安全与取证所需的高精度解释性需求。 Method: 提出DeformTrace框架,包含三个核心组件:1)可变形自状态空间模型(DS-SSM),引入动态感受野以提升时序定位精度;2)中继Token机制,缓解长程衰减并增强时序推理能力;3)可变形交叉状态空间模型(DC-SSM),将全局状态空间划分为查询相关子空间,抑制非伪造信息干扰,提升对稀疏伪造的敏感性;整体采用Transformer全局建模与SSM高效建模结合的混合架构。 Result: 在多个基准数据集上,DeformTrace实现了SOTA性能,同时参数更少、推理更快、鲁棒性更强。 Conclusion: DeformTrace有效克服了SSMs在TFL任务中的关键局限,验证了可变形动态与中继机制对时序伪造定位的重要价值,为高效、可解释的多媒体取证提供了新范式。 Abstract: Temporal Forgery Localization (TFL) aims to precisely identify manipulated segments in video and audio, offering strong interpretability for security and forensics. While recent State Space Models (SSMs) show promise in precise temporal reasoning, their use in TFL is hindered by ambiguous boundaries, sparse forgeries, and limited long-range modeling. We propose DeformTrace, which enhances SSMs with deformable dynamics and relay mechanisms to address these challenges. Specifically, Deformable Self-SSM (DS-SSM) introduces dynamic receptive fields into SSMs for precise temporal localization. To further enhance its capacity for temporal reasoning and mitigate long-range decay, a Relay Token Mechanism is integrated into DS-SSM. Besides, Deformable Cross-SSM (DC-SSM) partitions the global state space into query-specific subspaces, reducing non-forgery information accumulation and boosting sensitivity to sparse forgeries. These components are integrated into a hybrid architecture that combines the global modeling of Transformers with the efficiency of SSMs. Extensive experiments show that DeformTrace achieves state-of-the-art performance with fewer parameters, faster inference, and stronger robustness.[118] Federated Modality-specific Encoders and Partially Personalized Fusion Decoder for Multimodal Brain Tumor Segmentation
Hong Liu,Dong Wei,Qian Dai,Xian Wu,Yefeng Zheng,Liansheng Wang
Main category: cs.CV
TL;DR: 本文提出FedMEPD框架,解决医疗影像联邦学习中模态间异构性和个性化需求并存的挑战,通过模态专用编码器和部分个性化多模态融合解码器实现高效全局建模与本地适配。
Details
Motivation: 现有联邦学习方法仅考虑模态内异构性,难以应对实际中参与者仅拥有部分影像模态(即模态间异构性)的问题,且各参与方还需个性化模型。 Method: 提出FedMEPD框架:为每种模态设置独立编码器以处理模态间异构性;解码器采用部分个性化策略,依据全局与本地参数更新差异动态决定哪些滤波器个性化;服务器端用全模态融合解码器优化编码器,并分发多模态锚点;客户端通过缩放点积交叉注意力将缺失模态表征对齐至全局锚点。 Result: 在BraTS 2018和2020多模态脑肿瘤分割基准上验证,FedMEPD显著优于当前主流多模态及个性化联邦学习方法,各项新设计均被证实有效。 Conclusion: FedMEPD成功兼顾模态间异构性与个性化需求,为多模态医疗影像联邦学习提供了可行、高效且鲁棒的新范式。 Abstract: Most existing federated learning (FL) methods for medical image analysis only considered intramodal heterogeneity, limiting their applicability to multimodal imaging applications. In practice, some FL participants may possess only a subset of the complete imaging modalities, posing intermodal heterogeneity as a challenge to effectively training a global model on all participants' data. Meanwhile, each participant expects a personalized model tailored to its local data characteristics in FL. This work proposes a new FL framework with federated modality-specific encoders and partially personalized multimodal fusion decoders (FedMEPD) to address the two concurrent issues. Specifically, FedMEPD employs an exclusive encoder for each modality to account for the intermodal heterogeneity. While these encoders are fully federated, the decoders are partially personalized to meet individual needs -- using the discrepancy between global and local parameter updates to dynamically determine which decoder filters are personalized. Implementation-wise, a server with full-modal data employs a fusion decoder to fuse representations from all modality-specific encoders, thus bridging the modalities to optimize the encoders via backpropagation. Moreover, multiple anchors are extracted from the fused multimodal representations and distributed to the clients in addition to the model parameters. Conversely, the clients with incomplete modalities calibrate their missing-modal representations toward the global full-modal anchors via scaled dot-product cross-attention, making up for the information loss due to absent modalities. FedMEPD is validated on the BraTS 2018 and 2020 multimodal brain tumor segmentation benchmarks. Results show that it outperforms various up-to-date methods for multimodal and personalized FL, and its novel designs are effective.[119] Locality-Attending Vision Transformer
Sina Hajimiri,Farzad Beizaee,Fereshteh Shakeri,Christian Desrosiers,Ismail Ben Ayed,Jose Dolz
Main category: cs.CV
TL;DR: 本文提出了一种简单有效的附加模块,通过可学习高斯核调制自注意力机制并优化patch表示,提升视觉Transformer在分割任务中的性能,同时保持其图像级分类能力。
Details
Motivation: 视觉Transformer在分类任务中表现优异,但其全局自注意力机制会削弱对分割等任务至关重要的细粒度空间细节。 Method: 引入可学习的高斯核来调制自注意力,使其偏向邻近patch;同时优化patch表示以学习更优的位置嵌入。 Result: 在ADE20K等三个基准上显著提升分割性能(如ViT Tiny和Base分别提升超6%和4%),且不改变训练方式、不损害分类性能。 Conclusion: 所提方法有效增强了视觉Transformer在分割任务中的局部建模能力,同时保留了其全局信息整合能力,是一种轻量且通用的改进方案。 Abstract: Vision transformers have demonstrated remarkable success in classification by leveraging global self-attention to capture long-range dependencies. However, this same mechanism can obscure fine-grained spatial details crucial for tasks such as segmentation. In this work, we seek to enhance segmentation performance of vision transformers after standard image-level classification training. More specifically, we present a simple yet effective add-on that improves performance on segmentation tasks while retaining vision transformers' image-level recognition capabilities. In our approach, we modulate the self-attention with a learnable Gaussian kernel that biases the attention toward neighboring patches. We further refine the patch representations to learn better embeddings at patch positions. These modifications encourage tokens to focus on local surroundings and ensure meaningful representations at spatial positions, while still preserving the model's ability to incorporate global information. Experiments demonstrate the effectiveness of our modifications, evidenced by substantial segmentation gains on three benchmarks (e.g., over 6% and 4% on ADE20K for ViT Tiny and Base), without changing the training regime or sacrificing classification performance. The code is available at https://github.com/sinahmr/LocAtViT/.[120] FC-VFI: Faithful and Consistent Video Frame Interpolation for High-FPS Slow Motion Video Generation
Ganggui Ding,Hao Chen,Xiaogang Xu
Main category: cs.CV
TL;DR: 本文提出FC-VFI方法,通过潜在序列时序建模、语义匹配线引导和时序差异损失,实现高保真、高一致性的视频帧插值,支持4×和8×插值,显著提升帧率并保持细节与运动一致性。
Details
Motivation: 现有基于大预训练视频扩散模型的帧插值方法受限于生成先验,难以保持起止帧的高保真细节;同时,依赖光流等运动控制手段存在误差大或缺乏结构上下文的问题。 Method: 提出FC-VFI框架:1)在潜在序列上进行时序建模以继承起止帧保真线索;2)引入语义匹配线提供结构感知的运动引导;3)设计时序差异损失缓解时序不一致。 Result: 在多种场景下实现高质量、结构完整的帧插值,支持4×和8×插值(30 FPS→120/240 FPS),分辨率高达2560×1440,显著提升视觉保真度与运动一致性。 Conclusion: FC-VFI有效克服了生成先验限制与运动建模缺陷,在高分辨率、高倍率插值任务中实现了保真性与一致性的协同提升,为视频时序增强提供了新范式。 Abstract: Large pre-trained video diffusion models excel in video frame interpolation but struggle to generate high fidelity frames due to reliance on intrinsic generative priors, limiting detail preservation from start and end frames. Existing methods often depend on motion control for temporal consistency, yet dense optical flow is error-prone, and sparse points lack structural context. In this paper, we propose FC-VFI for faithful and consistent video frame interpolation, supporting \(4\times\)x and \(8\times\) interpolation, boosting frame rates from 30 FPS to 120 and 240 FPS at \(2560\times 1440\)resolution while preserving visual fidelity and motion consistency. We introduce a temporal modeling strategy on the latent sequences to inherit fidelity cues from start and end frames and leverage semantic matching lines for structure-aware motion guidance, improving motion consistency. Furthermore, we propose a temporal difference loss to mitigate temporal inconsistencies. Extensive experiments show FC-VFI achieves high performance and structural integrity across diverse scenarios.[121] AdaIAT: Adaptively Increasing Attention to Generated Text to Alleviate Hallucinations in LVLM
Li'an Zhong,Ziqiang He,Jibin Zheng,Jin Li,Z. Jane Wang,Xiangui Kang
Main category: cs.CV
TL;DR: 本文提出了一种名为AdaIAT的新方法,通过自适应地增强LVLMs中图像token对生成文本的注意力,有效缓解幻觉问题,同时避免重复描述,显著提升模型可靠性。
Details
Motivation: 当前大型视觉语言模型(LVLMs)存在严重的幻觉问题,直接增强图像token注意力虽可缓解幻觉,但易导致重复描述;作者旨在探索更精细、自适应的注意力调控机制以兼顾幻觉抑制与语言连贯性。 Method: 首先分析注意力模式,发现真实物体token更关注生成文本;据此提出Attention to Generated Text(IAT),并进一步设计自适应版本AdaIAT,采用层间阈值控制干预时机与各注意力头的细粒度放大强度。 Result: AdaIAT在LLaVA-1.5上将幻觉率CS和CI分别降低35.8%和37.1%,同时保持语言性能与预测能力,实验证明其有效性与泛化性。 Conclusion: AdaIAT是一种高效、可控且通用的后训练干预方法,为平衡LVLMs的视觉忠实性与语言流畅性提供了新思路。 Abstract: Hallucination has been a significant impediment to the development and application of current Large Vision-Language Models (LVLMs). To mitigate hallucinations, one intuitive and effective way is to directly increase attention weights to image tokens during inference. Although this effectively reduces the hallucination rate, it often induces repetitive descriptions. To address this, we first conduct an analysis of attention patterns and reveal that real object tokens tend to assign higher attention to the generated text than hallucinated ones. This inspires us to leverage the generated text, which contains instruction-related visual information and contextual knowledge, to alleviate hallucinations while maintaining linguistic coherence. We therefore propose Attention to Generated Text (IAT) and demonstrate that it significantly reduces the hallucination rate while avoiding repetitive descriptions. To prevent naive amplification from impairing the inherent prediction capabilities of LVLMs, we further explore Adaptive IAT (AdaIAT) that employs a layer-wise threshold to control intervention time and fine-grained amplification magnitude tailored to the characteristics of each attention head. Both analysis and experiments demonstrate the effectiveness of AdaIAT. Results of several LVLMs show that AdaIAT effectively alleviates hallucination (reducing hallucination rates $C_S$ and $C_I$ on LLaVA-1.5 by 35.8% and 37.1%, respectively) while preserving linguistic performance and prediction capability, achieving an attractive trade-off.[122] Person Detection and Tracking from an Overhead Crane LiDAR
Nilusha Jayawickrama,Henrik Toikka,Risto Ojala
Main category: cs.CV
TL;DR: 本文研究了在工业室内工作空间中使用安装在天车上的LiDAR进行人员检测与跟踪,针对俯视视角带来的显著领域偏移和缺乏合适公开数据的问题,构建了专用俯视LiDAR数据集,并适配多种3D检测器,在统一协议下评估;结合轻量级跟踪方法实现身份维持;通过距离分段评估量化实际感知范围,最佳配置在5米内AP达0.84、1米内达0.97;结果弥合了标准驾驶数据集与俯视传感间的领域差距,并开源数据与代码。
Details
Motivation: 俯视LiDAR视角与常见车载LiDAR基准存在显著领域偏移,且缺乏合适的公开训练数据。 Method: 构建带3D人体边界框标注的现场专用俯视LiDAR数据集;在统一训练与评估协议下适配多个候选3D检测器(如VoxelNeXt、SECOND);集成AB3DMOT和SimpleTrack实现轻量级检测-跟踪框架;采用距离分段评估检测性能;测量系统延迟以验证实时性。 Result: 最佳适配检测器在5.0米水平半径内平均精度(AP)达0.84,1.0米内提升至0.97;VoxelNeXt和SECOND为最稳健的骨干网络;系统具备实际实时可行性;数据集与代码已开源。 Conclusion: 本工作有效弥合了标准驾驶数据集与工业俯视LiDAR传感在人员检测与跟踪任务间的领域差距,为相关应用提供了实用方案与开源资源。 Abstract: This paper investigates person detection and tracking in an industrial indoor workspace using a LiDAR mounted on an overhead crane. The overhead viewpoint introduces a strong domain shift from common vehicle-centric LiDAR benchmarks, and limited availability of suitable public training data. Henceforth, we curate a site-specific overhead LiDAR dataset with 3D human bounding-box annotations and adapt selected candidate 3D detectors under a unified training and evaluation protocol. We further integrate lightweight tracking-by-detection using AB3DMOT and SimpleTrack to maintain person identities over time. Detection performance is reported with distance-sliced evaluation to quantify the practical operating envelope of the sensing setup. The best adapted detector configurations achieve average precision (AP) up to 0.84 within a 5.0 m horizontal radius, increasing to 0.97 at 1.0 m, with VoxelNeXt and SECOND emerging as the most reliable backbones across this range. The acquired results contribute in bridging the domain gap between standard driving datasets and overhead sensing for person detection and tracking. We also report latency measurements, highlighting practical real-time feasibility. Finally, we release our dataset and implementations in GitHub to support further research[123] Adaptive Prototype-based Interpretable Grading of Prostate Cancer
Riddhasree Bhattacharyya,Pallabi Dutta,Sushmita Mitra
Main category: cs.CV
TL;DR: 本文提出了一种基于原型的弱监督框架,用于可解释的前列腺癌组织病理图像分级,通过模仿病理医生对比可疑区域与临床验证范例的工作流程,提升模型可信度与可解释性。
Details
Motivation: 前列腺癌诊断中病理医生工作负荷大、分级主观性强;深度学习模型虽性能好但可解释性差,现有解释方法仅提供粗粒度解释,无法说明高亮区域为何重要。 Method: 提出基于原型的弱监督框架:先在图像块级别预训练以学习各分级对应的鲁棒原型特征;再用新设计的原型感知损失函数进行弱监督微调;最后引入基于注意力的动态剪枝机制,处理样本间异质性并选择性强调相关原型。 Result: 在PANDA和SICAP基准数据集上进行了广泛验证,证明该框架能作为病理医生日常诊断中的可靠辅助工具。 Conclusion: 该原型驱动、弱监督、可解释的分级框架更贴近病理医生实际判读逻辑,提升了模型在临床高风险场景中的可信度与实用性。 Abstract: Prostate cancer being one of the frequently diagnosed malignancy in men, the rising demand for biopsies places a severe workload on pathologists. The grading procedure is tedious and subjective, motivating the development of automated systems. Although deep learning has made inroads in terms of performance, its limited interpretability poses challenges for widespread adoption in high-stake applications like medicine. Existing interpretability techniques for prostate cancer classifiers provide a coarse explanation but do not reveal why the highlighted regions matter. In this scenario, we propose a novel prototype-based weakly-supervised framework for an interpretable grading of prostate cancer from histopathology images. These networks can prove to be more trustworthy since their explicit reasoning procedure mirrors the workflow of a pathologist in comparing suspicious regions with clinically validated examples. The network is initially pre-trained at patch-level to learn robust prototypical features associated with each grade. In order to adapt it to a weakly-supervised setup for prostate cancer grading, the network is fine-tuned with a new prototype-aware loss function. Finally, a new attention-based dynamic pruning mechanism is introduced to handle inter-sample heterogeneity, while selectively emphasizing relevant prototypes for optimal performance. Extensive validation on the benchmark PANDA and SICAP datasets confirms that the framework can serve as a reliable assistive tool for pathologists in their routine diagnostic workflows.[124] Location-Aware Pretraining for Medical Difference Visual Question Answering
Denis Musinguzi,Caren Han,Prasenjit Mitra
Main category: cs.CV
TL;DR: 本文提出了一种针对医学差异视觉问答(VQA)的预训练框架,通过引入位置感知任务(如自动指代表达、定位描述等)提升视觉编码器对细微空间差异的建模能力,并在胸片差异诊断任务上达到SOTA性能。
Details
Motivation: 传统单图VQA模型无法满足放射科医生对比多张影像进行诊断的需求;标准视觉编码器难以区分疾病进展与成像差异等细微变化。 Method: 设计包含位置感知任务(AREF、GCAP、CAREF)的预训练框架,增强视觉编码器对细粒度空间特征的学习能力,并将其与语言模型结合用于医学差异VQA。 Result: 在胸片图像的临床相关变化检测与推理任务中取得当前最优性能(state-of-the-art)。 Conclusion: 位置感知的预训练策略能有效提升视觉编码器对医学影像差异的理解能力,为多图像医学VQA提供了新范式。 Abstract: Unlike conventional single-image models, differential medical VQA frameworks process multiple images to identify differences, mirroring the comparative diagnostic workflow of radiologists. However, standard vision encoders trained on contrastive or classification objectives often fail to capture the subtle visual variations necessary for distinguishing disease progression from acquisition differences. To address this limitation, we introduce a pretraining framework that incorporates location-aware tasks, including automatic referring expressions (AREF), grounded captioning (GCAP), and conditional automatic referring expressions (CAREF). These specific tasks enable the vision encoder to learn fine-grained, spatially grounded visual representations that are often overlooked by traditional pre-training methods. We subsequently integrate this enhanced vision encoder with a language model to perform medical difference VQA. Experimental results demonstrate that our approach achieves state-of-the-art performance in detecting and reasoning about clinically relevant changes in chest X-ray images.[125] VisionPangu: A Compact and Fine-Grained Multimodal Assistant with 1.7B Parameters
Jiaxin Fan,Wenpo Song
Main category: cs.CV
TL;DR: 本文提出了一个紧凑型1.7B参数的多模态模型VisionPangu,通过高效多模态对齐与高质量监督(如DOCCI数据集的人工标注描述),显著提升了图像细粒度描述能力,无需依赖大模型规模。
Details
Motivation: 现有大型多模态模型(LMMs)虽在视觉-语言理解上表现优异,但受限于大规模架构和粗粒度监督,难以生成细节丰富的图像描述。 Method: 采用InternVL衍生的视觉编码器与OpenPangu-Embedded语言骨干网络,通过轻量MLP投影器连接,并借鉴LLaVA的指令微调流程;引入DOCCI数据集中密集人工撰写的图像描述进行训练。 Result: 实验表明,VisionPangu在保持紧凑参数量的同时,在详细图像描述任务上达到有竞争力的性能,生成的描述更具结构性和丰富性。 Conclusion: 紧凑型多模态模型可通过高质量数据与高效对齐策略实现媲美大模型的细粒度描述能力,为资源受限场景提供新思路。 Abstract: Large Multimodal Models (LMMs) have achieved strong performance in vision-language understanding, yet many existing approaches rely on large-scale architectures and coarse supervision, which limits their ability to generate detailed image captions. In this work, we present VisionPangu, a compact 1.7B-parameter multimodal model designed to improve detailed image captioning through efficient multimodal alignment and high-quality supervision. Our model combines an InternVL-derived vision encoder with the OpenPangu-Embedded language backbone via a lightweight MLP projector and adopts an instruction-tuning pipeline inspired by LLaVA. By incorporating dense human-authored descriptions from the DOCCI dataset, VisionPangu improves semantic coherence and descriptive richness without relying on aggressive model scaling. Experimental results demonstrate that compact multimodal models can achieve competitive performance while producing more structured and detailed captions. The code and model weights will be publicly available at https://www.modelscope.cn/models/asdfgh007/visionpangu.[126] Revisiting an Old Perspective Projection for Monocular 3D Morphable Models Regression
Toby Chong,Ryota Nakajima
Main category: cs.CV
TL;DR: 本文提出了一种新颖的相机模型,用于单目3D可变形模型(3DMM)回归方法,通过引入一个收缩参数扩展正交投影,以建模近景人脸图像中的透视畸变,同时保持优化稳定性;该方法可微调现有模型,在头戴摄像头采集的数据集上验证了其有效性。
Details
Motivation: 现有基于回归的3DMM拟合方法多采用正交投影以规避焦距与物距的歧义,但该简化使其无法准确处理头戴相机等近景拍摄场景中的透视畸变。 Method: 在正交投影基础上引入一个可学习的收缩参数,构建一种兼具伪透视效果与优化稳定性的新相机模型;并设计若干技术实现对现有回归模型的轻量微调。 Result: 在自建头戴相机近景人脸数据集上,所提方法在定量指标(如重建误差)和定性视觉效果上均优于基线正交投影方法,且兼容现有训练好的模型。 Conclusion: 所提出的带收缩参数的伪透视相机模型,有效弥补了正交投影在近景人脸建模中的不足,在保持计算稳定性和兼容性的同时显著提升了3DMM回归在近景视频中的精度与真实性。 Abstract: We introduce a novel camera model for monocular 3D Morphable Model (3DMM) regression methods that effectively captures the perspective distortion effect commonly seen in close-up facial images. Fitting 3D morphable models to video is a key technique in content creation. In particular, regression-based approaches have produced fast and accurate results by matching the rendered output of the morphable model to the target image. These methods typically achieve stable performance with orthographic projection, which eliminates the ambiguity between focal length and object distance. However, this simplification makes them unsuitable for close-up footage, such as that captured with head-mounted cameras. We extend orthographic projection with a new shrinkage parameter, incorporating a pseudo-perspective effect while preserving the stability of the original projection. We present several techniques that allow finetuning of existing models, and demonstrate the effectiveness of our modification through both quantitative and qualitative comparisons using a custom dataset recorded with head-mounted cameras.[127] BiEvLight: Bi-level Learning of Task-Aware Event Refinement for Low-Light Image Enhancement
Zishu Yao,Xiang-Xiang Su,Shengning Zhou,Guang-Yong Chen,Guodong Fan,Xing Chen
Main category: cs.CV
TL;DR: 本文提出BiEvLight框架,通过层次化和任务感知的方式协同优化低光图像增强与事件去噪,利用图像与事件间的梯度相关性构建梯度引导的事件去噪先验,并将事件去噪建模为受增强任务约束的双层优化问题,显著提升了低光图像增强性能。
Details
Motivation: 现有基于事件相机的低光图像增强方法受限于事件背景活动噪声和图像低信噪比导致的严重噪声耦合,因此精确的事件去噪是释放事件融合潜力的前提。 Method: 提出BiEvLight框架:1)利用图像与事件的强梯度相关性构建梯度引导的事件去噪先验;2)将事件去噪建模为受图像增强任务约束的双层优化问题,实现跨任务交互与自适应去噪。 Result: 在真实噪声数据集SDE上,相比SOTA方法,PSNR平均提升1.30dB,PSNR*提升2.03dB,SSIM提升0.047。 Conclusion: 事件去噪不应作为静态预处理,而应与增强任务协同优化;BiEvLight通过任务驱动的双层优化机制有效缓解噪声耦合,显著提升低光图像增强效果。 Abstract: Event cameras, with their high dynamic range, show great promise for Low-light Image Enhancement (LLIE). Existing works primarily focus on designing effective modal fusion strategies. However, a key challenge is the dual degradation from intrinsic background activity (BA) noise in events and low signal-to-noise ratio (SNR) in images, which causes severe noise coupling during modal fusion, creating a critical performance bottleneck. We therefore posit that precise event denoising is the prerequisite to unlocking the full potential of event-based fusion. To this end, we propose BiEvLight, a hierarchical and task-aware framework that collaboratively optimizes enhancement and denoising by exploiting their intrinsic interdependence. Specifically, BiEvLight exploits the strong gradient correlation between images and events to build a gradient-guided event denoising prior that alleviates insufficient denoising in heavily noisy regions. Moreover, instead of treating event denoising as a static pre-processing stage-which inevitably incurs a trade-off between over- and under-denoising and cannot adapt to the requirements of a specific enhancement objective-we recast it as a bilevel optimization problem constrained by the enhancement task. Through cross-task interaction, the upper-level denoising problem learns event representations tailored to the lower-level enhancement objective, thereby substantially improving overall enhancement quality. Extensive experiments on the Real-world noise Dataset SDE demonstrate that our method significantly outperforms state-of-the-art (SOTA) approaches, with average improvements of 1.30dB in PSNR, 2.03dB in PSNR* and 0.047 in SSIM, respectively. The code will be publicly available at https://github.com/iijjlk/BiEvlight.[128] 3D-RFT: Reinforcement Fine-Tuning for Video-based 3D Scene Understanding
Xiongkun Linghu,Jiangyong Huang,Baoxiong Jia,Siyuan Huang
Main category: cs.CV
TL;DR: 本文提出3D-RFT框架,首次将基于可验证奖励的强化学习(RLVR)应用于视频驱动的3D场景理解,通过任务指标(如3D IoU、F1-Score)设计可验证奖励函数,并采用GRPO算法进行强化微调,在多项3D视频理解任务上超越更大参数量模型。
Details
Motivation: 现有3D场景理解方法多依赖监督微调(SFT),其token级交叉熵损失与真实任务目标(如3D IoU、F1)存在错位;RLVR在LLM推理中已展现优势,但在3D感知领域尚未探索。 Method: 提出3D-RFT框架:先用SFT激活3D感知多模态大语言模型(MLLM),再基于Group Relative Policy Optimization(GRPO)进行强化微调;设计直接源于3D评估指标(如3D IoU、F1-Score)的严格可验证奖励函数。 Result: 3D-RFT-4B在视频3D检测、3D视觉定位和空间推理等基准上达到SOTA,显著优于参数更大的VG LLM-8B;验证了方法的鲁棒性,并揭示了训练策略与数据影响的关键洞见。 Conclusion: 3D-RFT是首个将RLVR成功迁移到视频3D理解的范式,通过指标驱动的可验证奖励实现目标对齐,为未来3D场景理解提供了稳健且有前景的新路径。 Abstract: Reinforcement Learning with Verifiable Rewards ( RLVR ) has emerged as a transformative paradigm for enhancing the reasoning capabilities of Large Language Models ( LLMs), yet its potential in 3D scene understanding remains under-explored. Existing approaches largely rely on Supervised Fine-Tuning ( SFT), where the token-level cross-entropy loss acts as an indirect proxy for optimization, leading to a misalignment between training objectives and task performances. To bridge this gap, we present Reinforcement Fine-Tuning for Video-based 3D Scene Understanding (3D-RFT ), the first framework to extend RLVR to video-based 3D perception and reasoning. 3D-RFT shifts the paradigm by directly optimizing the model towards evaluation metrics. 3D-RFT first activates 3D-aware Multi-modal Large Language Models ( MLLM s) via SFT, followed by reinforcement fine-tuning using Group Relative Policy Optimization ( GRPO) with strictly verifiable reward functions. We design task-specific reward functions directly from metrics like 3D IoU and F1-Score to provide more effective signals to guide model training. Extensive experiments demonstrate that 3D-RFT-4B achieves state-of-the-art performance on various video-based 3D scene understanding tasks. Notably, 3D-RFT-4B significantly outperforms larger models (e.g., VG LLM-8B) on 3D video detection, 3D visual grounding, and spatial reasoning benchmarks. We further reveal good properties of 3D-RFT such as robust efficacy, and valuable insights into training strategies and data impact. We hope 3D-RFT can serve as a robust and promising paradigm for future development of 3D scene understanding.[129] Think, Then Verify: A Hypothesis-Verification Multi-Agent Framework for Long Video Understanding
Zheng Wang,Haoran Chen,Haoxuan Qin,Zhipeng Wei,Tianwen Qian,Cong Bai
Main category: cs.CV
TL;DR: 本文提出VideoHV-Agent框架,将长视频问答重构为假设验证过程,通过Thinker、Judge、Verifier和Answer agent四步实现更准确、可解释、逻辑严谨且计算成本更低的长视频理解。
Details
Motivation: 长视频理解面临视觉冗余、长时序依赖以及链式推理和检索式代理易产生语义漂移和相关性错误等挑战;作者主张应从任务形式化(即先思考再检索)出发,而非被动检索。 Method: 提出VideoHV-Agent框架:基于视频摘要,Thinker将候选答案重写为可检验假设,Judge推导出需验证的判别性线索,Verifier利用局部细粒度视频内容对线索进行定位与检验,Answer agent整合验证证据生成最终答案。 Result: 在三个长视频理解基准上达到SOTA准确率,同时提升可解释性、逻辑严谨性,并降低计算成本。 Conclusion: 以假设验证为核心的结构化推理范式,能有效缓解长视频理解中的语义漂移与错误累积问题,为视频推理提供更可靠、高效的新路径。 Abstract: Long video understanding is challenging due to dense visual redundancy, long-range temporal dependencies, and the tendency of chain-of-thought and retrieval-based agents to accumulate semantic drift and correlation-driven errors. We argue that long-video reasoning should begin not with reactive retrieval, but with deliberate task formulation: the model must first articulate what must be true in the video for each candidate answer to hold. This thinking-before-finding principle motivates VideoHV-Agent, a framework that reformulates video question answering as a structured hypothesis-verification process. Based on video summaries, a Thinker rewrites answer candidates into testable hypotheses, a Judge derives a discriminative clue specifying what evidence must be checked, a Verifier grounds and tests the clue using localized, fine-grained video content, and an Answer agent integrates validated evidence to produce the final answer. Experiments on three long-video understanding benchmarks show that VideoHV-Agent achieves state-of-the-art accuracy while providing enhanced interpretability, improved logical soundness, and lower computational cost. We make our code publicly available at: https://github.com/Haorane/VideoHV-Agent.[130] A Simple Baseline for Unifying Understanding, Generation, and Editing via Vanilla Next-token Prediction
Jie Zhu,Hanghang Ma,Jia Wang,Yayong Guan,Yanbing Zeng,Lishuai Gao,Junqiang Wu,Jie Hu,Leye Wang
Main category: cs.CV
TL;DR: 本文提出了Wallaroo,一种基于自回归的多模态统一模型,支持图像理解、生成与编辑,并具备多分辨率处理及中英文双语能力。
Details
Motivation: 旨在通过简单的自回归基线模型,统一多模态理解、图像生成与编辑任务,并提升模型在多分辨率和多语言场景下的适应性。 Method: 提出Wallaroo模型,采用next-token预测范式;将视觉编码解耦为独立路径,并设计四阶段训练策略;支持多分辨率图像输入/输出及中英文双语。 Result: 在多个基准测试中,Wallaroo表现出与现有统一模型相当甚至更优的性能。 Conclusion: 自回归模型在统一多模态理解与生成方面具有巨大潜力,Wallaroo验证了该方向的有效性与可行性。 Abstract: In this work, we introduce Wallaroo, a simple autoregressive baseline that leverages next-token prediction to unify multi-modal understanding, image generation, and editing at the same time. Moreover, Wallaroo supports multi-resolution image input and output, as well as bilingual support for both Chinese and English. We decouple the visual encoding into separate pathways and apply a four-stage training strategy to reshape the model's capabilities. Experiments are conducted on various benchmarks where Wallaroo produces competitive performance or exceeds other unified models, suggesting the great potential of autoregressive models in unifying multi-modality understanding and generation. Our code is available at https://github.com/JiePKU/Wallaroo.[131] TAPFormer: Robust Arbitrary Point Tracking via Transient Asynchronous Fusion of Frames and Events
Jiaxiong Liu,Zhen Tan,Jinpu Zhang,Yi Zhou,Hui Shen,Xieyuanli Chen,Dewen Hu
Main category: cs.CV
TL;DR: 本文提出TAPFormer,一种基于Transformer的异步时序一致融合框架,用于鲁棒、高频的任意点跟踪;核心创新包括建模帧间连续事件更新的瞬态异步融合(TAF)机制和根据模态可靠性自适应调整空间注意力的跨模态局部加权融合(CLWF)模块;在新构建的真实世界数据集及标准基准上均取得最优性能。
Details
Motivation: 现有RGB帧与事件流融合方法多依赖同步或非自适应融合,导致时间错位,且单模态失效时性能严重下降。 Method: 提出TAPFormer框架,包含Transient Asynchronous Fusion(TAF)机制以建模帧间事件连续演化,并引入Cross-modal Locally Weighted Fusion(CLWF)模块实现空间注意力的自适应调节;构建首个真实场景下的帧-事件TAP数据集。 Result: 在自建数据集上平均像素误差阈值内提升28.2%,在标准点跟踪基准上持续达到最优性能。 Conclusion: TAPFormer通过异步、时序一致且自适应的跨模态融合,显著提升了任意点跟踪的鲁棒性与精度,尤其适用于光照变化与运动模糊等挑战场景。 Abstract: Tracking any point (TAP) is a fundamental yet challenging task in computer vision, requiring high precision and long-term motion reasoning. Recent attempts to combine RGB frames and event streams have shown promise, yet they typically rely on synchronous or non-adaptive fusion, leading to temporal misalignment and severe degradation when one modality fails. We introduce TAPFormer, a transformer-based framework that performs asynchronous temporal-consistent fusion of frames and events for robust and high-frequency arbitrary point tracking. Our key innovation is a Transient Asynchronous Fusion (TAF) mechanism, which explicitly models the temporal evolution between discrete frames through continuous event updates, bridging the gap between low-rate frames and high-rate events. In addition, a Cross-modal Locally Weighted Fusion (CLWF) module adaptively adjusts spatial attention according to modality reliability, yielding stable and discriminative features even under blur or low light. To evaluate our approach under realistic conditions, we construct a novel real-world frame-event TAP dataset under diverse illumination and motion conditions. Our method outperforms existing point trackers, achieving a 28.2% improvement in average pixel error within threshold. Moreover, on standard point tracking benchmarks, our tracker consistently achieves the best performance. Project website: tapformer.github.io[132] MultiGO++: Monocular 3D Clothed Human Reconstruction via Geometry-Texture Collaboration
Nanjie Yao,Gangjian Zhang,Wenhao Shen,Jian Shu,Yu Feng,Hao Wang
Main category: cs.CV
TL;DR: 本文提出MultiGO++框架,通过多源纹理合成、区域感知形状提取和双重建U-Net,实现几何与纹理协同的单目3D着装人体重建,显著提升重建质量。
Details
Motivation: 现有方法受限于训练数据缺乏、外部几何先验不准及单模态监督偏差,导致重建效果不佳。 Method: 提出MultiGO++框架,包含:(1) 多源纹理合成策略构建超1.5万3D纹理人体扫描;(2) 区域感知形状提取模块与傅里叶几何编码器以提升几何学习;(3) 利用几何-纹理协同特征的双重建U-Net生成高保真3D网格。 Result: 在两个基准和多种野外场景实验中,性能优于当前最先进方法。 Conclusion: MultiGO++通过系统性几何-纹理协作,有效缓解了单目3D人体重建中的纹理、几何与系统性瓶颈,提升了重建完整性与真实性。 Abstract: Monocular 3D clothed human reconstruction aims to generate a complete and realistic textured 3D avatar from a single image. Existing methods are commonly trained under multi-view supervision with annotated geometric priors, and during inference, these priors are estimated by the pre-trained network from the monocular input. These methods are constrained by three key limitations: texturally by unavailability of training data, geometrically by inaccurate external priors, and systematically by biased single-modality supervision, all leading to suboptimal reconstruction. To address these issues, we propose a novel reconstruction framework, named MultiGO++, which achieves effective systematic geometry-texture collaboration. It consists of three core parts: (1) A multi-source texture synthesis strategy that constructs 15,000+ 3D textured human scans to improve the performance on texture quality estimation in challenge scenarios; (2) A region-aware shape extraction module that extracts and interacts features of each body region to obtain geometry information and a Fourier geometry encoder that mitigates the modality gap to achieve effective geometry learning; (3) A dual reconstruction U-Net that leverages geometry-texture collaborative features to refine and generate high-fidelity textured 3D human meshes. Extensive experiments on two benchmarks and many in-the-wild cases show the superiority of our method over state-of-the-art approaches.[133] Physics-consistent deep learning for blind aberration recovery in mobile optics
Kartik Jhawar,Tamo Sancho Miguel Tandoc,Khoo Jun Xuan,Wang Lipo
Main category: cs.CV
TL;DR: 本文提出Lens2Zernike框架,通过单张模糊图像盲恢复物理光学参数(Zernike系数),结合Zernike回归、可微物理约束与多任务空间图预测,实现稳定、物理一致的移动摄影像差校正。
Details
Motivation: 移动摄影受限于复杂镜头光学像差;端到端深度学习缺乏物理可解释性且易幻觉,传统盲反卷积又极不稳定,需兼顾物理建模与数据驱动的桥梁方法。 Method: 提出Lens2Zernike:联合监督Zernike系数回归(z)、基于波前与点扩散函数的可微物理约束(p)、以及辅助多任务空间图预测(m);采用ResNet-18骨干网络,实现三域(z/p/m)物理一致性联合优化。 Result: 在IDMxS移动镜头数据库上,z+p+m完整框架相较仅z基线提升35%;回归误差显著低于两种现有深度学习方法;恢复参数可驱动稳定非盲反卷积,有效重建衍射极限细节。 Conclusion: Lens2Zernike首次实现单图盲估计Zernike系数并融合三类物理监督,兼具可解释性与高性能,为移动成像光学建模与复原提供新范式。 Abstract: Mobile photography is often limited by complex, lens-specific optical aberrations. While recent deep learning methods approach this as an end-to-end deblurring task, these "black-box" models lack explicit optical modeling and can hallucinate details. Conversely, classical blind deconvolution remains highly unstable. To bridge this gap, we present Lens2Zernike, a deep learning framework that blindly recovers physical optical parameters from a single blurred image. To the best of our knowledge, no prior work has simultaneously integrated supervision across three distinct optical domains. We introduce a novel physics-consistent strategy that explicitly minimizes errors via direct Zernike coefficient regression (z), differentiable physics constraints encompassing both wavefront and point spread function derivations (p), and auxiliary multi-task spatial map predictions (m). Through an ablation study on a ResNet-18 backbone, we demonstrate that our full multi-task framework (z+p+m) yields a 35% improvement over coefficient-only baselines. Crucially, comparative analysis reveals that our approach outperforms two established deep learning methods from previous literature, achieving significantly lower regression errors. Ultimately, we demonstrate that these recovered physical parameters enable stable non-blind deconvolution, providing substantial in-domain improvement on the patented Institute for Digital Molecular Analytics and Science (IDMxS) Mobile Camera Lens Database for restoring diffraction-limited details from severely aberrated mobile captures.[134] How far have we gone in Generative Image Restoration? A study on its capability, limitations and evaluation practices
Xiang Yin,Jinfan Hu,Zhiyuan You,Kainan Yan,Yu Tang,Chao Dong,Jinjin Gu
Main category: cs.CV
TL;DR: 本文提出了一种多维评估框架,系统评测了生成式图像恢复(GIR)模型在细节、锐度、语义正确性和整体质量等方面的表现,揭示其核心挑战已从‘细节缺失’转向‘细节质量与语义控制’,并基于该基准训练出更符合人类感知的IQA模型。
Details
Motivation: 评估生成式图像恢复(GIR)模型的实际能力是否真正超越传统方法,并厘清其当前发展瓶颈与演进趋势。 Method: 构建了一个新的多维评估流水线,涵盖细节、锐度、语义正确性和整体质量四个维度;对扩散模型、GAN模型、PSNR导向模型及通用生成模型进行大规模评测;分析失败模式演化;利用该基准训练新型IQA模型。 Result: 发现GIR模型的核心失败模式已由‘欠生成’(细节稀缺)转变为‘过生成’(细节失真与语义错误);不同架构模型性能差异显著;新训练的IQA模型更贴合人类感知判断。 Conclusion: GIR虽在视觉真实感上表现优异,但其实际修复能力仍受限于细节质量与语义可控性;本研究重新定义了该领域的发展现状与未来方向。 Abstract: Generative Image Restoration (GIR) has achieved impressive perceptual realism, but how far have its practical capabilities truly advanced compared with previous methods? To answer this, we present a large-scale study grounded in a new multi-dimensional evaluation pipeline that assesses models on detail, sharpness, semantic correctness, and overall quality. Our analysis covers diverse architectures, including diffusion-based, GAN-based, PSNR-oriented, and general-purpose generation models, revealing critical performance disparities. Furthermore, our analysis uncovers a key evolution in failure modes that signifies a paradigm shift for the perception-oriented low-level vision field. The central challenge is evolving from the previous problem of detail scarcity (under-generation) to the new frontier of detail quality and semantic control (preventing over-generation). We also leverage our benchmark to train a new IQA model that better aligns with human perceptual judgments. Ultimately, this work provides a systematic study of modern generative image restoration models, offering crucial insights that redefine our understanding of their true state and chart a course for future development.[135] Tell2Adapt: A Unified Framework for Source Free Unsupervised Domain Adaptation via Vision Foundation Model
Yulong Shi,Shijie Li,Ziyi Li,Lin Qi
Main category: cs.CV
TL;DR: 本文提出Tell2Adapt,一种基于视觉基础模型(VFM)的源无关无监督域自适应(SFUDA)新框架,通过上下文感知提示正则化(CAPR)和视觉可信度精炼(VPR)提升医学图像分割在多模态、多目标场景下的泛化性与临床可靠性,并在10种域迁移方向、22个解剖目标上验证了其SOTA性能。
Details
Motivation: 现有SFUDA方法难以应对真实临床中多样化的跨模态、多目标域偏移,缺乏统一、泛化性强的框架。 Method: 提出Tell2Adapt框架:1)利用VFM知识;2)通过CAPR实现高保真文本提示到规范指令的鲁棒映射,生成高质量伪标签;3)引入VPR,借助VFM解剖学知识将学生模型预测重锚定至目标图像低层视觉特征,抑制噪声与假阳性。 Result: 在10个域迁移方向、22个解剖目标(脑、心脏、息肉、腹部等)上进行大规模评估,显著优于现有方法,成为医学图像分割领域统一SFUDA框架的SOTA。 Conclusion: Tell2Adapt成功将VFM通用知识融入SFUDA流程,在保持轻量学生模型的同时,兼顾适应性能与临床可信性,为跨中心医学AI部署提供了实用新范式。 Abstract: Source Free Unsupervised Domain Adaptation (SFUDA) is critical for deploying deep learning models across diverse clinical settings. However, existing methods are typically designed for low-gap, specific domain shifts and cannot generalize into a unified, multi-modalities, and multi-target framework, which presents a major barrier to real-world application. To overcome this issue, we introduce Tell2Adapt, a novel SFUDA framework that harnesses the vast, generalizable knowledge of the Vision Foundation Model (VFM). Our approach ensures high-fidelity VFM prompts through Context-Aware Prompts Regularization (CAPR), which robustly translates varied text prompts into canonical instructions. This enables the generation of high-quality pseudo-labels for efficiently adapting the lightweight student model to target domain. To guarantee clinical reliability, the framework incorporates Visual Plausibility Refinement (VPR), which leverages the VFM's anatomical knowledge to re-ground the adapted model's predictions in target image's low-level visual features, effectively removing noise and false positives. We conduct one of the most extensive SFUDA evaluations to date, validating our framework across 10 domain adaptation directions and 22 anatomical targets, including brain, cardiac, polyp, and abdominal targets. Our results demonstrate that Tell2Adapt consistently outperforms existing approaches, achieving SOTA for a unified SFUDA framework in medical image segmentation. Code are avaliable at https://github.com/derekshiii/Tell2Adapt.[136] Generalizable Multiscale Segmentation of Heterogeneous Map Collections
Remi Petitpierre
Main category: cs.CV
TL;DR: 本文提出了一种面向多样化历史地图的通用语义分割方法,构建了新基准数据集Semap,并通过程序化数据合成与多尺度集成框架,在多个数据集上达到SOTA性能,验证了该方法在不同地图集合、比例尺、地理区域和出版背景下的鲁棒性与泛化能力。
Details
Motivation: 现有地图识别工作多针对同质化地图系列设计专用模型,难以应对历史地图在风格、比例尺和地理范围上的高度多样性,亟需通用、可迁移的语义分割方法。 Method: 构建开放基准数据集Semap(含1439张人工标注图块);提出融合程序化数据合成与多尺度特征集成的语义分割框架。 Result: 所提框架在HCMSSD和Semap数据集上均达到当前最优性能;分割性能在不同地图集合、比例尺、地理区域及出版背景下保持稳定。 Conclusion: 以多样性驱动的历史地图识别是可行且有益的;本工作为将海量长尾历史地图档案整合进历史地理研究提供了数据基础与方法支撑。 Abstract: Historical map collections are highly diverse in style, scale, and geographic focus, often consisting of many single-sheet documents. Yet most work in map recognition focuses on specialist models tailored to homogeneous map series. In contrast, this article aims to develop generalizable semantic segmentation models and ontology. First, we introduce Semap, a new open benchmark dataset comprising 1,439 manually annotated patches designed to reflect the variety of historical map documents. Second, we present a segmentation framework that combines procedural data synthesis with multiscale integration to improve robustness and transferability. This framework achieves state-of-the-art performance on both the HCMSSD and Semap datasets, showing that a diversity-driven approach to map recognition is not only viable but also beneficial. The results indicate that segmentation performance remains largely stable across map collections, scales, geographic regions, and publication contexts. By proposing benchmark datasets and methods for the generic segmentation of historical maps, this work opens the way to integrating the long tail of cartographic archives to historical geographic studies.[137] Exploiting Intermediate Reconstructions in Optical Coherence Tomography for Test-Time Adaption of Medical Image Segmentation
Thomas Pinetz,Veit Hucke,Hrvoje Bogunovic
Main category: cs.CV
TL;DR: 本文提出IRTAA方法,利用重建过程中的中间表示,在测试时通过调节下游网络的归一化层参数来提升分割性能,并提供无额外开销的语义不确定性估计。
Details
Motivation: 现有基于低代价成像设备的初级医疗诊断系统,依赖迭代重建算法生成高质量图像,但仅用最终重建结果评估下游任务性能,忽略了重建过程中丰富的中间表示信息。 Method: 提出IRTAA(Iterative Reconstruction Test-Time Adaptation)方法:在测试时,使用一个调制网络根据当前重建时间尺度动态调节冻结下游网络的归一化层参数;调制网络通过各时间步平均熵损失在线学习;不同时间步分割结果的差异自然提供不确定性估计。 Result: 在不修改重建算法和下游模型的前提下,显著提升了医学图像分割性能,并同步获得语义有意义、零成本的不确定性估计。 Conclusion: 重建过程中的中间表示蕴含丰富语义信息,IRTAA有效利用该信息实现测试时自适应与不确定性建模,为低质量成像下的鲁棒诊断提供了新范式。 Abstract: Primary health care frequently relies on low-cost imaging devices, which are commonly used for screening purposes. To ensure accurate diagnosis, these systems depend on advanced reconstruction algorithms designed to approximate the performance of high-quality counterparts. Such algorithms typically employ iterative reconstruction methods that incorporate domain-specific prior knowledge. However, downstream task performance is generally assessed using only the final reconstructed image, thereby disregarding the informative intermediate representations generated throughout the reconstruction process. In this work, we propose IRTTA to exploit these intermediate representations at test-time by adapting the normalization-layer parameters of a frozen downstream network via a modulator network that conditions on the current reconstruction timescale. The modulator network is learned during test-time using an averaged entropy loss across all individual timesteps. Variation among the timestep-wise segmentations additionally provides uncertainty estimates at no extra cost. This approach enhances segmentation performance and enables semantically meaningful uncertainty estimation, all without modifying either the reconstruction process or the downstream model.[138] CoIn3D: Revisiting Configuration-Invariant Multi-Camera 3D Object Detection
Zhaonian Kuang,Rui Ding,Haotian Wang,Xinhu Zheng,Meng Yang,Gang Hua
Main category: cs.CV
TL;DR: 本文提出CoIn3D框架,通过空间感知特征调制(SFM)和相机感知数据增强(CDA),解决多相机3D目标检测在不同相机配置间泛化能力差的问题,显著提升跨配置迁移性能。
Details
Motivation: 现有MC3D模型难以泛化到未见过的多相机配置,根本原因在于源与目标配置间的空间先验(内参、外参、阵列布局)差异。 Method: 提出CoIn3D框架:1)空间感知特征调制(SFM),将焦距、地面深度、地面梯度、Plücker坐标四种空间表示融入特征嵌入;2)相机感知数据增强(CDA),采用免训练动态新视角图像合成提升观测多样性。 Result: 在NuScenes、Waymo、Lyft等基准数据集上,CoIn3D在BEVDepth、BEVFormer、PETR三类主流MC3D范式下均展现出优异的跨配置检测性能。 Conclusion: 显式建模并融合多维空间先验可有效提升MC3D模型对未知相机配置的泛化能力,CoIn3D为通用多相机3D感知提供了新思路。 Abstract: Multi-camera 3D object detection (MC3D) has attracted increasing attention with the growing deployment of multi-sensor physical agents, such as robots and autonomous vehicles. However, MC3D models still struggle to generalize to unseen platforms with new multi-camera configurations. Current solutions simply employ a meta-camera for unified representation but lack comprehensive consideration. In this paper, we revisit this issue and identify that the devil lies in spatial prior discrepancies across source and target configurations, including different intrinsics, extrinsics, and array layouts. To address this, we propose CoIn3D, a generalizable MC3D framework that enables strong transferability from source configurations to unseen target ones. CoIn3D explicitly incorporates all identified spatial priors into both feature embedding and image observation through spatial-aware feature modulation (SFM) and camera-aware data augmentation (CDA), respectively. SFM enriches feature space by integrating four spatial representations, such as focal length, ground depth, ground gradient, and Plücker coordinate. CDA improves observation diversity under various configurations via a training-free dynamic novel-view image synthesis scheme. Extensive experiments demonstrate that CoIn3D achieves strong cross-configuration performance on landmark datasets such as NuScenes, Waymo, and Lyft, under three dominant MC3D paradigms represented by BEVDepth, BEVFormer, and PETR.[139] CLIP-driven Zero-shot Learning with Ambiguous Labels
Jinfu Fan,Jiangnan Li,Xiaowen Yan,Xiaohui Zhong,Wenpeng Lu,Linqing Huang
Main category: cs.CV
TL;DR: 本文提出了一种基于CLIP的零样本学习框架CLIP-PZSL,用于处理标签模糊性问题,通过语义挖掘和部分零样本损失提升模型性能。
Details
Motivation: 现有零样本学习方法通常假设训练样本标签准确,但在现实场景中标签噪声和模糊性会显著降低性能,因此需要解决标签不确定性问题。 Method: 利用CLIP提取实例与标签特征;设计语义挖掘模块融合特征以获取判别性标签嵌入;引入部分零样本损失,依据候选标签与实例的相关性加权,并对齐实例与标签嵌入以减小语义失配;迭代优化标签与嵌入。 Result: 在多个数据集上的综合实验验证了CLIP-PZSL的有效性和优势。 Conclusion: CLIP-PZSL能有效应对标签模糊性,在零样本学习任务中提升了鲁棒性与识别精度。 Abstract: Zero-shot learning (ZSL) aims to recognize unseen classes by leveraging semantic information from seen classes, but most existing methods assume accurate class labels for training instances. However, in real-world scenarios, noise and ambiguous labels can significantly reduce the performance of ZSL. To address this, we propose a new CLIP-driven partial label zero-shot learning (CLIP-PZSL) framework to handle label ambiguity. First, we use CLIP to extract instance and label features. Then, a semantic mining block fuses these features to extract discriminative label embeddings. We also introduce a partial zero-shot loss, which assigns weights to candidate labels based on their relevance to the instance and aligns instance and label embeddings to minimize semantic mismatch. As the training goes on, the ground-truth labels are progressively identified, and the refined labels and label embeddings in turn help improve the semantic alignment of instance and label features. Comprehensive experiments on several datasets demonstrate the advantage of CLIP-PZSL.[140] A 360-degree Multi-camera System for Blue Emergency Light Detection Using Color Attention RT-DETR and the ABLDataset
Francisco Vacalebri-Lloret,Lucas Banchero,Jose J. Lopez,Jose M. Mossi
Main category: cs.CV
TL;DR: 本文提出了一种基于多 fisheye 相机与改进 RT-DETR 的蓝光检测系统,用于识别欧洲紧急车辆,在 ABLDataset 上达到 94.7% 准确率和 94.1% 召回率,支持 70 米检测距离及入射角估计,可融入多模态 ADAS 提升道路安全。
Details
Motivation: 提升高级驾驶辅助系统(ADAS)对紧急车辆蓝光信号的实时、鲁棒识别能力,以增强道路安全,尤其在复杂气候与地理条件下。 Method: 采用四路180°水平视场鱼眼相机采集图像,通过标定实现方位角定位;构建并使用 ABLDataset(含多条件欧洲紧急车辆图像);对比 YOLOv5/v8/v10、RetinaNet、Faster R-CNN 和 RT-DETR,选定 RT-DETR 为基础模型,并引入颜色注意力模块进行增强;结合几何变换估计紧急车辆相对于本车中心的接近角度。 Result: 改进后的 RT-DETR 在测试集上达到 94.7% 准确率和 94.1% 召回率;实地检测距离达 70 米;成功实现蓝光方位定位与接近角度估计;系统可无缝集成至视觉-听觉多模态 ADAS。 Conclusion: 该系统显著提升了紧急车辆蓝光检测的精度、鲁棒性与实用性,为下一代智能交通安全系统提供了高效可行的技术路径。 Abstract: This study presents an advanced system for detecting blue lights on emergency vehicles, developed using ABLDataset, a curated dataset that includes images of European emergency vehicles under various climatic and geographic conditions. The system employs a configuration of four fisheye cameras, each with a 180-degree horizontal field of view, mounted on the sides of the vehicle. A calibration process enables the azimuthal localization of the detections. Additionally, a comparative analysis of major deep neural network algorithms was conducted, including YOLO (v5, v8, and v10), RetinaNet, Faster R-CNN, and RT-DETR. RT-DETR was selected as the base model and enhanced through the incorporation of a color attention block, achieving an accuracy of 94.7 percent and a recall of 94.1 percent on the test set, with field test detections reaching up to 70 meters. Furthermore, the system estimates the approach angle of the emergency vehicle relative to the center of the car using geometric transformations. Designed for integration into a multimodal system that combines visual and acoustic data, this system has demonstrated high efficiency, offering a promising approach to enhancing Advanced Driver Assistance Systems (ADAS) and road safety.[141] MI-DETR: A Strong Baseline for Moving Infrared Small Target Detection with Bio-Inspired Motion Integration
Nian Liu,Jin Gao,Shubo Lin,Yutong Kou,Sikui Zhang,Fudong Ge,Zhiqiang Pu,Liang Li,Gang Wang,Yizheng Wang,Weiming Hu
Main category: cs.CV
TL;DR: 本文提出了一种受生物视觉启发的单帧红外小目标检测方法MI-DETR,通过视网膜式细胞自动机(RCA)显式建模运动并生成运动图,结合双通路(形貌与运动)交互机制(PMI Block)和RT-DETR解码器,在多个基准上显著超越多帧方法。
Details
Motivation: 红外小目标检测因目标微小、对比度低且背景复杂动态而困难;传统多帧方法隐式学习运动,常需额外运动监督或对齐模块,效率与可解释性受限。 Method: 提出Motion Integration DETR(MI-DETR):1)视网膜式细胞自动机(RCA)将红外帧序列转换为同分辨率运动图,构建类副脑(parvocellular)形貌通路与类主脑(magnocellular)运动通路;2)Parvocellular-Magnocellular Interconnection(PMI)Block实现双通路双向特征交互;3)RT-DETR解码器融合双通路特征完成检测。全程仅用单帧输入、无需运动标签或显式对齐。 Result: 在IRDST-H上达70.3% mAP@50和72.7% F1(较最优多帧基线提升26.35 mAP@50);DAUB-R上98.0% mAP@50;ITSDT-15K上88.3% mAP@50。 Conclusion: 显式、生物启发式的运动-形貌双通路集成策略在单帧红外小目标检测中极为有效,兼顾性能、简洁性与可解释性。 Abstract: Infrared small target detection (ISTD) is challenging because tiny, low-contrast targets are easily obscured by complex and dynamic backgrounds. Conventional multi-frame approaches typically learn motion implicitly through deep neural networks, often requiring additional motion supervision or explicit alignment modules. We propose Motion Integration DETR (MI-DETR), a bio-inspired dual-pathway detector that processes one infrared frame per time step while explicitly modeling motion. First, a retina-inspired cellular automaton (RCA) converts raw frame sequences into a motion map defined on the same pixel grid as the appearance image, enabling parvocellular-like appearance and magnocellular-like motion pathways to be supervised by a single set of bounding boxes without extra motion labels or alignment operations. Second, a Parvocellular-Magnocellular Interconnection (PMI) Block facilitates bidirectional feature interaction between the two pathways, providing a biologically motivated intermediate interconnection mechanism. Finally, a RT-DETR decoder operates on features from the two pathways to produce detection results. Surprisingly, our proposed simple yet effective approach yields strong performance on three commonly used ISTD benchmarks. MI-DETR achieves 70.3% mAP@50 and 72.7% F1 on IRDST-H (+26.35 mAP@50 over the best multi-frame baseline), 98.0% mAP@50 on DAUB-R, and 88.3% mAP@50 on ITSDT-15K, demonstrating the effectiveness of biologically inspired motion-appearance integration. Code is available at https://github.com/nliu-25/MI-DETR.[142] UniM: A Unified Any-to-Any Interleaved Multimodal Benchmark
Yanlin Li,Minghui Guo,Kaiwen Zhang,Shize Zhang,Yiran Zhao,Haodong Li,Congyue Zhou,Weijie Zheng,Yushen Yan,Shengqiong Wu,Wei Ji,Lei Cui,Furu Wei,Hao Fei,Mong-Li Lee,Wynne Hsu
Main category: cs.CV
TL;DR: 本文提出UniM基准,首个统一的任意到任意交错多模态数据集,涵盖30个领域、7种模态(文本、图像、音频、视频、文档、代码、3D),并设计评估套件与基线模型UniMA,推动多模态大语言模型向统一交错理解与生成演进。
Details
Motivation: 现实多模态应用需处理用户任意组合、交错输入,并生成任意交错多媒体输出,亟需统一范式下的任意到任意交错多模态学习能力。 Method: 构建UniM数据集(31K高质量样本,覆盖30域、7模态);设计三维度评估套件(语义正确性与生成质量、响应结构完整性、交错连贯性);提出具备可追溯推理能力的代理式基线模型UniMA。 Result: 实验证明UniM基准具有高难度,揭示了当前模型在交错多模态理解与生成中的关键瓶颈与改进方向。 Conclusion: UniM为统一任意到任意交错多模态智能提供了首个系统性基准与评估框架,推动MLLM向更通用、结构化、可解释的多模态能力发展。 Abstract: In real-world multimodal applications, systems usually need to comprehend arbitrarily combined and interleaved multimodal inputs from users, while also generating outputs in any interleaved multimedia form. This capability defines the goal of any-to-any interleaved multimodal learning under a unified paradigm of understanding and generation, posing new challenges and opportunities for advancing Multimodal Large Language Models (MLLMs). To foster and benchmark this capability, this paper introduces the UniM benchmark, the first Unified Any-to-Any Interleaved Multimodal dataset. UniM contains 31K high-quality instances across 30 domains and 7 representative modalities: text, image, audio, video, document, code, and 3D, each requiring multiple intertwined reasoning and generation capabilities. We further introduce the UniM Evaluation Suite, which assesses models along three dimensions: Semantic Correctness & Generation Quality, Response Structure Integrity, and Interleaved Coherence. In addition, we propose UniMA, an agentic baseline model equipped with traceable reasoning for structured interleaved generation. Comprehensive experiments demonstrate the difficulty of UniM and highlight key challenges and directions for advancing unified any-to-any multimodal intelligence. The project page is https://any2any-mllm.github.io/unim.[143] MoRe: Motion-aware Feed-forward 4D Reconstruction Transformer
Juntong Fang,Zequn Chen,Weiqi Zhang,Donglin Di,Xuancheng Zhang,Chengmin Yang,Yu-Shen Liu
Main category: cs.CV
TL;DR: MoRe是一种高效的前馈式4D重建网络,通过注意力强制策略分离动态运动与静态结构,实现实时单目视频动态3D场景重建。
Details
Motivation: 现有基于优化的方法在处理含运动物体的视频时,因相机位姿估计受干扰而效果受限,且计算开销大、难以实时应用。 Method: 提出MoRe网络:基于强静态重建骨干网络,引入注意力强制策略解耦动态运动与静态结构;采用分组因果注意力建模帧间时序依赖,并支持可变token长度;并在大规模动静混合数据集上进行微调以提升鲁棒性。 Result: 在多个基准上验证了MoRe能实现高质量、高效率的动态4D重建,显著优于现有方法。 Conclusion: MoRe为动态4D场景重建提供了一种高效、鲁棒且实用的前馈式解决方案,兼顾精度与实时性。 Abstract: Reconstructing dynamic 4D scenes remains challenging due to the presence of moving objects that corrupt camera pose estimation. Existing optimization methods alleviate this issue with additional supervision, but they are mostly computationally expensive and impractical in real-time applications. To address these limitations, we propose MoRe, a feedforward 4D reconstruction network that efficiently recovers dynamic 3D scenes from monocular videos. Built upon a strong static reconstruction backbone, MoRe employs an attention-forcing strategy to disentangle dynamic motion from static structure. To further enhance robustness, we fine-tune the model on large-scale, diverse datasets encompassing both dynamic and static scenes. Moreover, our grouped causal attention captures temporal dependencies and adapts to varying token lengths across frames, ensuring temporally coherent geometry reconstruction. Extensive experiments on multiple benchmarks demonstrate that MoRe achieves high-quality dynamic reconstructions with exceptional efficiency.[144] Orthogonal Spatial-temporal Distributional Transfer for 4D Generation
Wei Liu,Shengqiong Wu,Bobo Li,Haoyu Zhao,Hao Fei,Mong-Li Lee,Wynne Hsu
Main category: cs.CV
TL;DR: 本文提出了一种名为STD-4D的新型4D扩散模型,通过解耦空间与时间隐变量,并结合Orster机制和ST-HexPlane结构,实现高质量、高一致性的4D内容生成。
Details
Motivation: 当前4D合成研究受限于大规模4D数据集的缺乏,导致模型难以学习关键的时空特征,阻碍了该领域的发展。 Method: 提出STD-4D扩散模型,采用空间-时间解耦隐变量建模;设计正交时空分布迁移(Orster)机制以实现3D与视频扩散模型先验知识的有效迁移;引入时空感知HexPlane(ST-HexPlane)用于4D变形与高斯特征建模。 Result: 实验表明,所提方法在时空一致性与4D合成质量上显著优于现有方法。 Conclusion: 通过跨模态先验迁移与解耦建模,本文有效缓解了4D数据稀缺问题,为高质量4D内容生成提供了新范式。 Abstract: In the AIGC era, generating high-quality 4D content has garnered increasing research attention. Unfortunately, current 4D synthesis research is severely constrained by the lack of large-scale 4D datasets, preventing models from adequately learning the critical spatial-temporal features necessary for high-quality 4D generation, thus hindering progress in this domain. To combat this, we propose a novel framework that transfers rich spatial priors from existing 3D diffusion models and temporal priors from video diffusion models to enhance 4D synthesis. We develop a spatial-temporal-disentangled 4D (STD-4D) Diffusion model, which synthesizes 4D-aware videos through disentangled spatial and temporal latents. To facilitate the best feature transfer, we design a novel Orthogonal Spatial-temporal Distributional Transfer (Orster) mechanism, where the spatiotemporal feature distributions are carefully modeled and injected into the STD-4D Diffusion. Furthermore, during the 4D construction, we devise a spatial-temporal-aware HexPlane (ST-HexPlane) to integrate the transferred spatiotemporal features, thereby improving 4D deformation and 4D Gaussian feature modeling. Experiments demonstrate that our method significantly outperforms existing approaches, achieving superior spatial-temporal consistency and higher-quality 4D synthesis.[145] GEM-TFL: Bridging Weak and Full Supervision for Forgery Localization through EM-Guided Decomposition and Temporal Refinement
Xiaodong Zhu,Yuanming Zheng,Suting Wang,Junqi Yang,Yuhong Yang,Weiping Tu,Zhongyuan Wang
Main category: cs.CV
TL;DR: 本文提出GEM-TFL框架,通过EM优化、时序一致性精炼和图结构提案优化,在弱监督下显著提升视频/音频篡改片段定位精度,缩小与全监督方法的性能差距。
Details
Motivation: 现有弱监督时间伪造定位(WS-TFL)方法存在训练-推理目标不匹配、二元标签监督不足、top-k聚合导致梯度阻断、缺乏提案间关系建模等问题。 Method: 提出两阶段分类-回归框架GEM-TFL:(1)基于EM优化将二元标签转化为多维隐属性以增强弱监督;(2)引入无需训练的时序一致性精炼模块;(3)设计图结构提案优化模块建模提案间的时序-语义关系。 Result: 在多个基准数据集上实验表明,GEM-TFL实现了更准确、鲁棒的时间伪造定位,大幅缩小了与全监督方法的性能差距。 Conclusion: GEM-TFL有效弥合了弱监督训练与推理之间的监督鸿沟,在降低标注成本的同时保持高性能,为多媒体取证提供了实用新范式。 Abstract: Temporal Forgery Localization (TFL) aims to precisely identify manipulated segments within videos or audio streams, providing interpretable evidence for multimedia forensics and security. While most existing TFL methods rely on dense frame-level labels in a fully supervised manner, Weakly Supervised TFL (WS-TFL) reduces labeling cost by learning only from binary video-level labels. However, current WS-TFL approaches suffer from mismatched training and inference objectives, limited supervision from binary labels, gradient blockage caused by non-differentiable top-k aggregation, and the absence of explicit modeling of inter-proposal relationships. To address these issues, we propose GEM-TFL (Graph-based EM-powered Temporal Forgery Localization), a two-phase classification-regression framework that effectively bridges the supervision gap between training and inference. Built upon this foundation, (1) we enhance weak supervision by reformulating binary labels into multi-dimensional latent attributes through an EM-based optimization process; (2) we introduce a training-free temporal consistency refinement that realigns frame-level predictions for smoother temporal dynamics; and (3) we design a graph-based proposal refinement module that models temporal-semantic relationships among proposals for globally consistent confidence estimation. Extensive experiments on benchmark datasets demonstrate that GEM-TFL achieves more accurate and robust temporal forgery localization, substantially narrowing the gap with fully supervised methods.[146] Diff-ES: Stage-wise Structural Diffusion Pruning via Evolutionary Search
Zongfang Liu,Shengkun Tang,Zongliang Wu,Xin Yuan,Zhiqiang Shen
Main category: cs.CV
TL;DR: 本文提出Diff-ES框架,通过进化搜索自动优化扩散模型各阶段的结构化剪枝稀疏度调度,并利用内存高效的权重路由实现无模型复制的动态激活,显著提升推理速度且几乎不损失图像质量。
Details
Motivation: 现有扩散模型剪枝方法依赖人工设定的启发式稀疏调度,难以适应各扩散步骤重要性高度非均匀且模型依赖的特点,导致泛化性差、加速效果与质量难以兼顾。 Method: 提出Diff-ES:将扩散过程分阶段,用进化搜索自动优化每阶段的结构化剪枝稀疏度;采用阶段条件权重路由机制,避免模型参数复制,支持深度与宽度剪枝。 Result: 在DiT和SDXL上实验表明,Diff-ES在保持生成质量几乎不变的前提下,实现了领先的实测(wall-clock)加速效果。 Conclusion: Diff-ES为扩散模型结构化剪枝提供了更自动、高效且通用的新范式,显著优于如MosaicDiff等依赖人工调度的现有方法。 Abstract: Diffusion models have achieved remarkable success in high-fidelity image generation but remain computationally demanding due to their multi-step denoising process and large model sizes. Although prior work improves efficiency either by reducing sampling steps or by compressing model parameters, existing structured pruning approaches still struggle to balance real acceleration and image quality preservation. In particular, prior methods such as MosaicDiff rely on heuristic, manually tuned stage-wise sparsity schedules and stitch multiple independently pruned models during inference, which increases memory overhead. However, the importance of diffusion steps is highly non-uniform and model-dependent. As a result, schedules derived from simple heuristics or empirical observations often fail to generalize and may lead to suboptimal performance. To this end, we introduce \textbf{Diff-ES}, a stage-wise structural \textbf{Diff}usion pruning framework via \textbf{E}volutionary \textbf{S}earch, which optimizes the stage-wise sparsity schedule and executes it through memory-efficient weight routing without model duplication. Diff-ES divides the diffusion trajectory into multiple stages, automatically discovers an optimal stage-wise sparsity schedule via evolutionary search, and activates stage-conditioned weights dynamically without duplicating model parameters. Our framework naturally integrates with existing structured pruning methods for diffusion models including depth and width pruning. Extensive experiments on DiT and SDXL demonstrate that Diff-ES consistently achieves wall-clock speedups while incurring minimal degradation in generation quality, establishing state-of-the-art performance for structured diffusion model pruning.[147] BLINK: Behavioral Latent Modeling of NK Cell Cytotoxicity
Iman Nematollahi,Jose Francisco Villena-Ossa,Alina Moter,Kiana Farhadyar,Gabriel Kalweit,Abhinav Valada,Toni Cathomen,Evelyn Ullrich,Maria Kalweit
Main category: cs.CV
TL;DR: 本文提出了BLINK模型,一种基于轨迹的循环状态空间模型,用于建模自然杀伤(NK)细胞与肿瘤细胞的相互作用动态,通过学习潜在交互动力学来预测细胞凋亡增量,从而更准确地检测和预测NK细胞的细胞毒性结果,并提供可解释的潜在表征。
Details
Motivation: 现有方法仅依赖单帧标注难以可靠推断NK细胞的细胞毒性结果,因其本质上是随时间演化的细胞间相互作用过程。 Method: 提出BLINK——一种轨迹驱动的递归状态空间模型,从部分观测的NK-肿瘤细胞交互序列中学习潜在交互动力学,并预测随时间累积的凋亡增量。 Result: 在长时间延时成像数据上验证,BLINK提升了细胞毒性结果检测性能,支持未来结果预测,并生成可解释的潜在表示,将NK细胞轨迹聚类为一致的行为模式和时序化交互阶段。 Conclusion: BLINK为单细胞水平上定量评估和结构化建模NK细胞细胞毒性行为提供了统一框架。 Abstract: Machine learning models of cellular interaction dynamics hold promise for understanding cell behavior. Natural killer (NK) cell cytotoxicity is a prominent example of such interaction dynamics and is commonly studied using time-resolved multi-channel fluorescence microscopy. Although tumor cell death events can be annotated at single frames, NK cytotoxic outcome emerges over time from cellular interactions and cannot be reliably inferred from frame-wise classification alone. We introduce BLINK, a trajectory-based recurrent state-space model that serves as a cell world model for NK-tumor interactions. BLINK learns latent interaction dynamics from partially observed NK-tumor interaction sequences and predicts apoptosis increments that accumulate into cytotoxic outcomes. Experiments on long-term time-lapse NK-tumor recordings show improved cytotoxic outcome detection and enable forecasting of future outcomes, together with an interpretable latent representation that organizes NK trajectories into coherent behavioral modes and temporally structured interaction phases. BLINK provides a unified framework for quantitative evaluation and structured modeling of NK cytotoxic behavior at the single-cell level.[148] UniPAR: A Unified Framework for Pedestrian Attribute Recognition
Minghe Xu,Rouying Wu,Jiarui Xu,Minhao Sun,Zikang Yan,Xiao Wang,ChiaWei Chu,Yu Li
Main category: cs.CV
TL;DR: 本文提出了一种名为UniPAR的统一Transformer框架,用于跨模态、跨域的行人属性识别(PAR),通过统一数据调度、动态分类头和分阶段融合编码器,实现单模型处理RGB图像、视频和事件流等多种数据,并在多个基准数据集上达到SOTA水平。
Details
Motivation: 现有行人属性识别方法受限于“一个数据集一个模型”的范式,难以应对不同领域在模态、属性定义和环境场景上的显著差异。 Method: 提出UniPAR框架,包含统一数据调度策略、动态分类头,以及能显式对齐视觉特征与文本属性查询的分阶段融合编码器(采用晚期深度融合策略)。 Result: 在MSP60K、DukeMTMC和EventPAR等主流基准上,UniPAR性能媲美各领域专用SOTA方法;多数据集联合训练显著提升了跨域泛化能力及在低光照、运动模糊等极端环境下的识别鲁棒性。 Conclusion: UniPAR成功实现了跨模态、跨域的统一行人属性识别,为构建通用视觉理解系统提供了新思路,并具备实际部署潜力。 Abstract: Pedestrian Attribute Recognition is a foundational computer vision task that provides essential support for downstream applications, including person retrieval in video surveillance and intelligent retail analytics. However, existing research is frequently constrained by the ``one-model-per-dataset" paradigm and struggles to handle significant discrepancies across domains in terms of modalities, attribute definitions, and environmental scenarios. To address these challenges, we propose UniPAR, a unified Transformer-based framework for PAR. By incorporating a unified data scheduling strategy and a dynamic classification head, UniPAR enables a single model to simultaneously process diverse datasets from heterogeneous modalities, including RGB images, video sequences, and event streams. We also introduce an innovative phased fusion encoder that explicitly aligns visual features with textual attribute queries through a late deep fusion strategy. Experimental results on the widely used benchmark datasets, including MSP60K, DukeMTMC, and EventPAR, demonstrate that UniPAR achieves performance comparable to specialized SOTA methods. Furthermore, multi-dataset joint training significantly enhances the model's cross-domain generalization and recognition robustness in extreme environments characterized by low light and motion blur. The source code of this paper will be released on https://github.com/Event-AHU/OpenPAR[149] SRasP: Self-Reorientation Adversarial Style Perturbation for Cross-Domain Few-Shot Learning
Wenqian Li,Pengfei Fang,Hui Xue
Main category: cs.CV
TL;DR: 本文提出了一种名为SRasP的新型跨域小样本学习方法,通过全局语义引导的裁剪-全局风格扰动与多目标优化,提升模型在未见域上的泛化与迁移能力。
Details
Motivation: 现有基于风格扰动的方法在跨域小样本学习中存在梯度不稳定和易收敛到尖锐极小值的问题,限制了模型鲁棒性与可迁移性。 Method: 提出Self-Reorientation Adversarial Style Perturbation(SRasP),利用全局语义指导识别不一致裁剪区域,并将裁剪区域风格梯度与全局风格梯度重新对齐聚合;设计多目标优化函数,在最大化视觉差异的同时保证全局、裁剪与对抗特征间的语义一致性。 Result: 在多个CD-FSL基准上显著优于现有最先进方法,验证了所提方法在提升模型平坦性、稳定性与跨域泛化能力方面的有效性。 Conclusion: SRasP通过稳定风格扰动机制和语义感知优化,有效缓解域偏移,增强了模型在未见目标域上的鲁棒迁移性能,为CD-FSL提供了新思路。 Abstract: Cross-Domain Few-Shot Learning (CD-FSL) aims to transfer knowledge from a seen source domain to unseen target domains, serving as a key benchmark for evaluating the robustness and transferability of models. Existing style-based perturbation methods mitigate domain shift but often suffer from gradient instability and convergence to sharp minima.To address these limitations, we propose a novel crop-global style perturbation network, termed Self-Reorientation Adversarial \underline{S}tyle \underline{P}erturbation (SRasP). Specifically, SRasP leverages global semantic guidance to identify incoherent crops, followed by reorienting and aggregating the style gradients of these crops with the global style gradients within one image. Furthermore, we propose a novel multi-objective optimization function to maximize visual discrepancy while enforcing semantic consistency among global, crop, and adversarial features. Applying the stabilized perturbations during training encourages convergence toward flatter and more transferable solutions, improving generalization to unseen domains. Extensive experiments are conducted on multiple CD-FSL benchmarks, demonstrating consistent improvements over state-of-the-art methods.[150] Act, Think or Abstain: Complexity-Aware Adaptive Inference for Vision-Language-Action Models
Riccardo Andrea Izzo,Gianluca Bardaro,Matteo Matteucci
Main category: cs.CV
TL;DR: 本文提出一种受人类认知启发的自适应框架,通过视觉嵌入动态判断任务复杂度,实现VLA模型的'执行-思考-弃权'三级响应机制,在保证性能的同时显著降低计算开销和失败风险。
Details
Motivation: 现有VLA模型依赖固定推理机制,导致简单任务资源浪费、复杂/分布外任务缺乏不确定性估计而易发生灾难性失败。 Method: 将VLA视觉-语言主干网络改造为状态复杂度检测器,利用视觉嵌入(因语言语义不变性更鲁棒)投射到参数与非参数估计器集成中,实现基于感知状态复杂度的动态路由:Act(确定执行)、Think(模糊推理)、Abstain(异常中止)。 Result: 在LIBERO、LIBERO-PRO及真实机器人上验证,仅用5%训练数据的纯视觉配置即达80% F1-Score,成为高效可靠的复杂度检测器。 Conclusion: 视觉嵌入足以有效评估VLA任务复杂度;所提自适应路由框架在效率、鲁棒性与安全性间取得更好平衡。 Abstract: Current research on Vision-Language-Action (VLA) models predominantly focuses on enhancing generalization through established reasoning techniques. While effective, these improvements invariably increase computational complexity and inference latency. Furthermore, these mechanisms are typically applied indiscriminately, resulting in the inefficient allocation of resources for trivial tasks while simultaneously failing to provide the uncertainty estimation necessary to prevent catastrophic failure on out-of-distribution tasks. Inspired by human cognition, we propose an adaptive framework that dynamically routes VLA execution based on the complexity of the perceived state. Our approach transforms the VLA's vision-language backbone into an active detection tool by projecting latent embeddings into an ensemble of parametric and non-parametric estimators. This allows the system to execute known tasks immediately (Act), reason about ambiguous scenarios (Think), and preemptively halt execution when encountering significant physical or semantic anomalies (Abstain). In our empirical analysis, we observe a phenomenon where visual embeddings alone are superior for inferring task complexity due to the semantic invariance of language. Evaluated on the LIBERO and LIBERO-PRO benchmarks as well as on a real robot, our vision-only configuration achieves 80% F1-Score using as little as 5% of training data, establishing itself as a reliable and efficient task complexity detector.[151] SSR-GS: Separating Specular Reflection in Gaussian Splatting for Glossy Surface Reconstruction
Ningjing Fan,Yiqun Wang
Main category: cs.CV
TL;DR: 本文提出SSR-GS框架,通过预滤波Mip-Cubemap建模直接镜面反射、IndiASG模块捕获间接镜面反射,并引入反射感知的视觉几何先验(VGP)优化 glossy 表面重建效果,在合成与真实数据集上达到SOTA性能。
Details
Motivation: 现有3D高斯泼溅(3DGS)方法在复杂光照下(尤其强镜面反射和多表面互反射)难以准确重建光泽表面。 Method: 提出SSR-GS框架:1)使用预滤波Mip-Cubemap建模直接镜面反射;2)设计IndiASG模块建模间接镜面反射;3)引入Visual Geometry Priors(VGP),包含基于反射分数(RS)的光度损失加权机制,以及源自VGGT的深度监督衰减与法线变换约束等几何先验。 Result: 在合成与真实世界数据集上实验表明,SSR-GS在光泽表面重建任务中达到当前最优(state-of-the-art)性能。 Conclusion: SSR-GS有效提升了3DGS对光泽表面在复杂光照下的重建能力,通过联合建模直接/间接反射并引入反射感知的视觉几何先验,显著改善了渲染质量和几何一致性。 Abstract: In recent years, 3D Gaussian splatting (3DGS) has achieved remarkable progress in novel view synthesis. However, accurately reconstructing glossy surfaces under complex illumination remains challenging, particularly in scenes with strong specular reflections and multi-surface interreflections. To address this issue, we propose SSR-GS, a specular reflection modeling framework for glossy surface reconstruction. Specifically, we introduce a prefiltered Mip-Cubemap to model direct specular reflections efficiently, and propose an IndiASG module to capture indirect specular reflections. Furthermore, we design Visual Geometry Priors (VGP) that couple a reflection-aware visual prior via a reflection score (RS) to downweight the photometric loss contribution of reflection-dominated regions, with geometry priors derived from VGGT, including progressively decayed depth supervision and transformed normal constraints. Extensive experiments on both synthetic and real-world datasets demonstrate that SSR-GS achieves state-of-the-art performance in glossy surface reconstruction.[152] The Impact of Preprocessing Methods on Racial Encoding and Model Robustness in CXR Diagnosis
Dishantkumar Sutariya,Eike Petersen
Main category: cs.CV
TL;DR: 本文研究了胸部X光片(CXR)中深度学习模型因种族捷径学习导致的种族偏差问题,发现基于边界框的肺部裁剪预处理方法可在不损害诊断准确率的前提下有效缓解该偏差。
Details
Motivation: 深度学习模型能高精度识别胸片中的种族身份,引发对‘种族捷径学习’(即模型利用种族相关伪影进行诊断决策)导致医疗不公平和不可靠的广泛担忧。 Method: 实验评估了肺部掩码、肺部裁剪和CLAHE等图像预处理方法对抑制种族捷径学习的效果,重点分析其在保持诊断性能的同时减少种族偏差的能力。 Result: 基于边界框的肺部裁剪被证实是一种简单而有效的策略,可显著降低种族捷径学习,且未牺牲诊断准确性,从而规避了常见的公平性-准确性权衡。 Conclusion: 图像预处理(尤其是肺部裁剪)是缓解胸片AI模型中种族偏差的可行且实用手段,为提升医疗AI公平性提供了新思路。 Abstract: Deep learning models can identify racial identity with high accuracy from chest X-ray (CXR) recordings. Thus, there is widespread concern about the potential for racial shortcut learning, where a model inadvertently learns to systematically bias its diagnostic predictions as a function of racial identity. Such racial biases threaten healthcare equity and model reliability, as models may systematically misdiagnose certain demographic groups. Since racial shortcuts are diffuse - non-localized and distributed throughout the whole CXR recording - image preprocessing methods may influence racial shortcut learning, yet the potential of such methods for reducing biases remains underexplored. Here, we investigate the effects of image preprocessing methods including lung masking, lung cropping, and Contrast Limited Adaptive Histogram Equalization (CLAHE). These approaches aim to suppress spurious cues encoding racial information while preserving diagnostic accuracy. Our experiments reveal that simple bounding box-based lung cropping can be an effective strategy for reducing racial shortcut learning while maintaining diagnostic model performance, bypassing frequently postulated fairness-accuracy trade-offs.[153] Generic Camera Calibration using Blurry Images
Zezhun Shi
Main category: cs.CV
TL;DR: 本文提出了一种结合几何约束和局部参数化光照模型的方法,用于在存在运动模糊的情况下对通用相机模型进行标定,同时估计特征位置和空间变化的点扩散函数,并解决平移模糊问题。
Details
Motivation: 通用相机标定虽然比参数化标定更准确,但需要大量图像,导致个体用户难以避免运动模糊。 Method: 利用几何约束和局部参数化光照模型,联合估计特征位置和空间变化的点扩散函数,并解决平移歧义。 Result: 实验结果验证了该方法在运动模糊条件下的有效性。 Conclusion: 该方法为运动模糊下的通用相机标定提供了可行方案,提升了标定鲁棒性与实用性。 Abstract: Camera calibration is the foundation of 3D vision. Generic camera calibration can yield more accurate results than parametric cam era calibration. However, calibrating a generic camera model using printed calibration boards requires far more images than parametric calibration, making motion blur practically unavoidable for individual users. As a f irst attempt to address this problem, we draw on geometric constraints and a local parametric illumination model to simultaneously estimate feature locations and spatially varying point spread functions, while re solving the translational ambiguity that need not be considered in con ventional image deblurring tasks. Experimental results validate the effectiveness of our approach.[154] Mario: Multimodal Graph Reasoning with Large Language Models
Yuanfu Sun,Kang Li,Pengkang Guo,Jiajin Liu,Qiaoyu Tan
Main category: cs.CV
TL;DR: 本文提出Mario框架,通过图条件视觉语言模型和模态自适应图指令调优机制,在多模态图(MMG)上实现大语言模型(LLM)的有效推理,解决了跨模态一致性弱与异构模态偏好两大挑战,在节点分类与链接预测任务中显著优于现有方法。
Details
Motivation: 现有方法多依赖预训练视觉语言模型孤立编码图像-文本对,忽略了真实世界多模态数据天然具有的关系结构;因此需在保留图拓扑的前提下,支持LLM对具有图文属性节点和结构边的多模态图进行推理。 Method: 提出Mario统一框架:1)图条件VLM设计,利用图拓扑引导的细粒度跨模态对比学习联合优化图文特征;2)模态自适应图指令微调机制,将对齐的多模态特征组织为图感知的指令视图,并通过可学习路由器为每个节点及其邻域动态选择最优模态配置输入LLM。 Result: 在多个多模态图基准上,Mario在监督与零样本场景下的节点分类和链接预测任务中均持续超越当前最优图模型。 Conclusion: Mario成功实现了在多模态图上融合结构、视觉与语言信号的LLM推理,为多模态图学习提供了新范式。 Abstract: Recent advances in large language models (LLMs) have opened new avenues for multimodal reasoning. Yet, most existing methods still rely on pretrained vision-language models (VLMs) to encode image-text pairs in isolation, ignoring the relational structure that real-world multimodal data naturally form. This motivates reasoning on multimodal graphs (MMGs), where each node has textual and visual attributes and edges provide structural cues. Enabling LLM-based reasoning on such heterogeneous multimodal signals while preserving graph topology introduces two key challenges: resolving weak cross-modal consistency and handling heterogeneous modality preference. To address this, we propose Mario, a unified framework that simultaneously resolves the two above challenges and enables effective LLM-based reasoning over MMGs. Mario consists of two innovative stages. Firstly, a graph-conditioned VLM design that jointly refines textual and visual features through fine-grained cross-modal contrastive learning guided by graph topology. Secondly, a modality-adaptive graph instruction tuning mechanism that organizes aligned multimodal features into graph-aware instruction views and employs a learnable router to surface, for each node and its neighborhood, the most informative modality configuration to the LLM. Extensive experiments across diverse MMG benchmarks demonstrate that Mario consistently outperforms state-of-the-art graph models in both supervised and zero-shot scenarios for node classification and link prediction. The code will be made available at https://github.com/sunyuanfu/Mario.[155] Logi-PAR: Logic-Infused Patient Activity Recognition via Differentiable Rule
Muhammad Zarar,MingZheng Zhang,Xiaowang Zhang,Zhiyong Feng,Sofonias Yitagesu,Kawsar Farooq
Main category: cs.CV
TL;DR: Logi-PAR 是首个将可学习逻辑规则融入患者活动识别(PAR)的框架,通过神经引导的可微规则与符号映射结合,实现可解释的风险推理与反事实干预。
Details
Motivation: 现有PAR模型仅能识别活动类型,缺乏对风险成因的显式逻辑推理能力,难以满足临床安全对可解释性与因果干预的需求。 Method: 提出Logi-PAR框架:引入多视图上下文事实融合作为原始特征提取器,并嵌入神经引导的可微逻辑规则;端到端优化规则学习,使隐式模式在训练中显式标签化。 Result: 在VAST和OmniFall临床基准上显著超越视觉-语言模型和Transformer基线,支持生成规则追溯的‘why’解释及量化反事实干预(如‘若提供协助,风险降低65%’)。 Conclusion: Logi-PAR首次实现了基于可学习逻辑规则的PAR,兼顾高性能与强可解释性,为临床智能监护提供了可审计、可干预的新范式。 Abstract: Patient Activity Recognition (PAR) in clinical settings uses activity data to improve safety and quality of care. Although significant progress has been made, current models mainly identify which activity is occurring. They often spatially compose sub-sparse visual cues using global and local attention mechanisms, yet only learn logically implicit patterns due to their neural-pipeline. Advancing clinical safety requires methods that can infer why a set of visual cues implies a risk, and how these can be compositionally reasoned through explicit logic beyond mere classification. To address this, we proposed Logi-PAR, the first Logic-Infused Patient Activity Recognition Framework that integrates contextual fact fusion as a multi-view primitive extractor and injects neural-guided differentiable rules. Our method automatically learns rules from visual cues, optimizing them end-to-end while enabling the implicit emergence patterns to be explicitly labelled during training. To the best of our knowledge, Logi-PAR is the first framework to recognize patient activity by applying learnable logic rules to symbolic mappings. It produces auditable why explanations as rule traces and supports counterfactual interventions (e.g., risk would decrease by 65% if assistance were present). Extensive evaluation on clinical benchmarks (VAST and OmniFall) demonstrates state-of-the-art performance, significantly outperforming Vision-Language Models and transformer baselines. The code is available via: https://github.com/zararkhan985/Logi-PAR.git}[156] Semantic Class Distribution Learning for Debiasing Semi-Supervised Medical Image Segmentation
Yingxue Su,Yiheng Zhong,Keying Zhu,Zimu Zhang,Zhuoru Zhang,Yifang Wang,Yuxin Zhang,Jingxin Liu
Main category: cs.CV
TL;DR: 本文提出了一种名为语义类别分布学习(SCDL)的框架,通过学习结构化的类别条件特征分布来缓解医学图像分割中因类别不平衡导致的监督和表征偏差。
Details
Motivation: 医学图像分割中密集像素级标注耗时昂贵,且数据集常存在严重类别不平衡问题,导致少数类结构在特征表示中被主导类别淹没,影响判别性特征学习和可靠分割。 Method: 提出了SCDL框架,包含类别分布双向对齐(CDBA)以对齐嵌入与可学习类别代理,以及语义锚点约束(SAC)利用标注数据引导代理。 Result: 在Synapse和AMOS数据集上的实验表明,SCDL显著提升了整体及各类别指标的分割性能,尤其在少数类上效果突出,达到当前最优水平。 Conclusion: SCDL是一种即插即用模块,能有效缓解监督和表征偏差,在医学图像分割任务中展现出优越性能。 Abstract: Medical image segmentation is critical for computer-aided diagnosis. However, dense pixel-level annotation is time-consuming and expensive, and medical datasets often exhibit severe class imbalance. Such imbalance causes minority structures to be overwhelmed by dominant classes in feature representations, hindering the learning of discriminative features and making reliable segmentation particularly challenging. To address this, we propose the Semantic Class Distribution Learning (SCDL) framework, a plug-and-play module that mitigates supervision and representation biases by learning structured class-conditional feature distributions. SCDL integrates Class Distribution Bidirectional Alignment (CDBA) to align embeddings with learnable class proxies and leverages Semantic Anchor Constraints (SAC) to guide proxies using labeled data. Experiments on the Synapse and AMOS datasets demonstrate that SCDL significantly improves segmentation performance across both overall and class-level metrics, with particularly strong gains on minority classes, achieving state-of-the-art results. Our code is released at https://github.com/Zyh55555/SCDL.[157] SPyCer: Semi-Supervised Physics-Guided Contextual Attention for Near-Surface Air Temperature Estimation from Satellite Imagery
Sofiane Bouaziz,Adel Hafiane,Raphael Canals,Rachid Nedjai
Main category: cs.CV
TL;DR: 本文提出SPyCer,一种半监督、物理引导的深度网络,利用卫星影像和稀疏地面传感器数据,结合物理约束(如地表能量平衡与对流-扩散-反应方程)实现近地表气温(NSAT)的像素级连续估计。
Details
Motivation: 近地面大气现象对人类和生态系统影响显著,但近地传感器稀疏且分布不均,难以提供连续空间观测;而卫星虽覆盖广,却难以直接反演近地表气温,需融合物理先验提升可靠性。 Method: SPyCer将NSAT预测建模为像素级视觉任务:以地面传感器位置为中心提取卫星图像局部块;对中心像素施加观测标签与物理约束联合监督,对邻域像素引入基于表面能量平衡和PDE的物理正则化;采用土地覆盖引导的多头注意力机制,并用高斯距离加权建模空间物理影响。 Result: 在真实数据集上,SPyCer生成的空间一致、物理一致的NSAT估计结果,在精度、泛化性和物理过程一致性方面均优于现有基线方法。 Conclusion: SPyCer成功融合遥感观测、稀疏地面测量与物理建模,为近地表环境变量的高分辨率连续制图提供了可推广的新范式。 Abstract: Modern Earth observation relies on satellites to capture detailed surface properties. Yet, many phenomena that affect humans and ecosystems unfold in the atmosphere close to the surface. Near-ground sensors provide accurate measurements of certain environmental characteristics, such as near-surface air temperature (NSAT). However, they remain sparse and unevenly distributed, limiting their ability to provide continuous spatial measurements. To bridge this gap, we introduce SPyCer, a semi-supervised physics-guided network that can leverage pixel information and physical modeling to guide the learning process through meaningful physical properties. It is designed for continuous estimation of NSAT by proxy using satellite imagery. SPyCer frames NSAT prediction as a pixel-wise vision problem, where each near-ground sensor is projected onto satellite image coordinates and positioned at the center of a local image patch. The corresponding sensor pixel is supervised using both observed NSAT and physics-based constraints, while surrounding pixels contribute through physics-guided regularization derived from the surface energy balance and advection-diffusion-reaction partial differential equations. To capture the physical influence of neighboring pixels, SPyCer employs a multi-head attention guided by land cover characteristics and modulated with Gaussian distance weighting. Experiments on real-world datasets demonstrate that SPyCer produces spatially coherent and physically consistent NSAT estimates, outperforming existing baselines in terms of accuracy, generalization, and alignment with underlying physical processes.[158] Digital Twin Driven Textile Classification and Foreign Object Recognition in Automated Sorting Systems
Serkan Ergun,Tobias Mitterer,Hubert Zangl
Main category: cs.CV
TL;DR: 本文提出了一种基于数字孪生的双臂机器人纺织品分类系统,融合多模态感知、抓取预测与视觉语言模型(VLM)语义推理,实现对变形衣物及异物的自动化识别与分拣。在223个真实场景测试中,Qwen系列VLM达到最高87.9%准确率,Gemma3适合边缘部署;数字孪生与MoveIt协同提升避障与操作可靠性。
Details
Motivation: 可持续纺织回收需求增长,亟需能处理形变衣物和杂乱环境中异物检测的鲁棒自动化方案。 Method: 构建双臂机器人系统,集成RGBD视觉、电容式触觉反馈与碰撞感知运动规划;利用九种来自五个家族的视觉语言模型(VLM)在自建223场景数据集上进行纺织品类别分类、异物检测与幻觉分析;结合数字孪生与MoveIt实现三维点云驱动的路径规划与操作闭环。 Result: Qwen系列VLM整体准确率最高(达87.9%),异物检测能力强;轻量级Gemma3模型在速度-精度权衡上表现优异,适合边缘部署;数字孪生有效提升操作可靠性。 Conclusion: 语义VLM推理、传统抓取检测与数字孪生技术可有效融合,支撑可扩展、全自动的工业级纺织品分拣系统。 Abstract: The increasing demand for sustainable textile recycling requires robust automation solutions capable of handling deformable garments and detecting foreign objects in cluttered environments. This work presents a digital twin driven robotic sorting system that integrates grasp prediction, multi modal perception, and semantic reasoning for real world textile classification. A dual arm robotic cell equipped with RGBD sensing, capacitive tactile feedback, and collision-aware motion planning autonomously separates garments from an unsorted basket, transfers them to an inspection zone, and classifies them using state of the art Visual Language Models (VLMs). We benchmark nine VLM s from five model families on a dataset of 223 inspection scenarios comprising shirts, socks, trousers, underwear, foreign objects (including garments outside of the aforementioned classes), and empty scenes. The evaluation assesses per class accuracy, hallucination behavior, and computational performance under practical hardware constraints. Results show that the Qwen model family achieves the highest overall accuracy (up to 87.9 %), with strong foreign object detection performance, while lighter models such as Gemma3 offer competitive speed accuracy trade offs for edge deployment. A digital twin combined with MoveIt enables collision aware path planning and integrates segmented 3D point clouds of inspected garments into the virtual environment for improved manipulation reliability. The presented system demonstrates the feasibility of combining semantic VLM reasoning with conventional grasp detection and digital twin technology for scalable, autonomous textile sorting in realistic industrial settings.[159] CATNet: Collaborative Alignment and Transformation Network for Cooperative Perception
Gong Chen,Chaokun Zhang,Tao Tang,Pengcheng Lv,Feng Li,Xin Xie
Main category: cs.CV
TL;DR: 本文提出CATNet框架,通过时空同步、小波增强去噪和自适应特征选择,解决多智能体协同感知中的时延与噪声问题,显著提升复杂交通场景下的鲁棒性与适应性。
Details
Motivation: 现有协同感知方法忽视了真实多源数据融合中的高时间延迟和多源噪声问题。 Method: 提出CATNet框架,包含三个核心模块:1)时空递归同步模块(STSync)用于异步特征对齐;2)双分支小波增强去噪器(WTDen)抑制全局噪声并重建局部失真;3)自适应特征选择器(AdpSel)动态聚焦关键感知特征。 Result: 在多个数据集上的实验表明,CATNet在复杂交通条件下持续优于现有方法,展现出更强的鲁棒性与适应性。 Conclusion: CATNet有效解决了多智能体协同感知中时延与噪声的实际挑战,为真实场景部署提供了可靠的技术路径。 Abstract: Cooperative perception significantly enhances scene understanding by integrating complementary information from diverse agents. However, existing research often overlooks critical challenges inherent in real-world multi-source data integration, specifically high temporal latency and multi-source noise. To address these practical limitations, we propose Collaborative Alignment and Transformation Network (CATNet), an adaptive compensation framework that resolves temporal latency and noise interference in multi-agent systems. Our key innovations can be summarized in three aspects. First, we introduce a Spatio-Temporal Recurrent Synchronization (STSync) that aligns asynchronous feature streams via adjacent-frame differential modeling, establishing a temporal-spatially unified representation space. Second, we design a Dual-Branch Wavelet Enhanced Denoiser (WTDen) that suppresses global noise and reconstructs localized feature distortions within aligned representations. Third, we construct an Adaptive Feature Selector (AdpSel) that dynamically focuses on critical perceptual features for robust fusion. Extensive experiments on multiple datasets demonstrate that CATNet consistently outperforms existing methods under complex traffic conditions, proving its superior robustness and adaptability.[160] Wiki-R1: Incentivizing Multimodal Reasoning for Knowledge-based VQA via Data and Sampling Curriculum
Shan Ning,Longtian Qiu,Xuming He
Main category: cs.CV
TL;DR: 本文提出Wiki-R1框架,通过可控课程数据生成与课程采样策略,利用强化学习提升多模态大语言模型在知识库视觉问答(KB-VQA)任务上的推理能力,显著提升两个基准数据集的准确率。
Details
Motivation: KB-VQA任务面临外部知识检索噪声大、知识库结构化且百科化等特点,导致与预训练多模态大语言模型存在分布差异,难以有效进行后训练推理和领域适配。 Method: 提出基于数据生成的课程强化学习框架Wiki-R1,包括可控课程数据生成(调控检索器生成不同难度样本)和课程采样策略(基于观测奖励估计样本难度并选择具高优势增益的样本进行RL更新)。 Result: 在Encyclopedic VQA和InfoSeek两个KB-VQA基准上达到新SOTA:准确率分别从35.5%提升至37.1%,以及从40.1%提升至44.1%。 Conclusion: Wiki-R1通过课程式强化学习有效弥合预训练模型与KB-VQA目标分布之间的鸿沟,验证了可控数据生成与难度感知采样对提升模型推理能力的关键作用。 Abstract: Knowledge-Based Visual Question Answering (KB-VQA) requires models to answer questions about an image by integrating external knowledge, posing significant challenges due to noisy retrieval and the structured, encyclopedic nature of the knowledge base. These characteristics create a distributional gap from pretrained multimodal large language models (MLLMs), making effective reasoning and domain adaptation difficult in the post-training stage. In this work, we propose \textit{Wiki-R1}, a data-generation-based curriculum reinforcement learning framework that systematically incentivizes reasoning in MLLMs for KB-VQA. Wiki-R1 constructs a sequence of training distributions aligned with the model's evolving capability, bridging the gap from pretraining to the KB-VQA target distribution. We introduce \textit{controllable curriculum data generation}, which manipulates the retriever to produce samples at desired difficulty levels, and a \textit{curriculum sampling strategy} that selects informative samples likely to yield non-zero advantages during RL updates. Sample difficulty is estimated using observed rewards and propagated to unobserved samples to guide learning. Experiments on two KB-VQA benchmarks, Encyclopedic VQA and InfoSeek, demonstrate that Wiki-R1 achieves new state-of-the-art results, improving accuracy from 35.5\% to 37.1\% on Encyclopedic VQA and from 40.1\% to 44.1\% on InfoSeek. The project page is available at https://artanic30.github.io/project_pages/WikiR1/.[161] Layer by layer, module by module: Choose both for optimal OOD probing of ViT
Ambroise Odonnat,Vasilii Feofanov,Laetitia Chapel,Romain Tavenard,Ievgen Redko
Main category: cs.CV
TL;DR: 本文研究了预训练视觉Transformer中间层的表现,发现预训练与下游数据之间的分布偏移是导致深层性能下降的主要原因,并指出在不同分布偏移程度下,应选择不同模块的特征进行线性探测。
Details
Motivation: 近期研究发现基础模型的中间层往往比最终层具有更强的判别性表征能力,但其成因尚不明确,本文旨在系统分析视觉Transformer中间层行为背后的机制。 Method: 通过在多个图像分类基准上进行广泛的线性探测实验,并在模块级开展细粒度分析,比较不同位置(如Transformer块输出、FFN激活、归一化后的MHSA输出)的特征表现。 Result: 发现分布偏移是深层性能下降的主因;在强分布偏移下,探针FFN内部激活效果最佳;在弱分布偏移下,探针归一化后的多头自注意力输出最优。 Conclusion: 标准的Transformer块输出探测并非最优策略,应依据预训练与下游任务间分布偏移程度,动态选择更合适的中间特征进行下游利用。 Abstract: Recent studies have observed that intermediate layers of foundation models often yield more discriminative representations than the final layer. While initially attributed to autoregressive pretraining, this phenomenon has also been identified in models trained via supervised and discriminative self-supervised objectives. In this paper, we conduct a comprehensive study to analyze the behavior of intermediate layers in pretrained vision transformers. Through extensive linear probing experiments across a diverse set of image classification benchmarks, we find that distribution shift between pretraining and downstream data is the primary cause of performance degradation in deeper layers. Furthermore, we perform a fine-grained analysis at the module level. Our findings reveal that standard probing of transformer block outputs is suboptimal; instead, probing the activation within the feedforward network yields the best performance under significant distribution shift, whereas the normalized output of the multi-head self-attention module is optimal when the shift is weak.[162] Fusion4CA: Boosting 3D Object Detection via Comprehensive Image Exploitation
Kang Luo,Xin Chen,Yangyi Xiao,Hesheng Wang
Main category: cs.CV
TL;DR: 本文提出Fusion4CA,一种基于BEVFusion的LiDAR-RGB融合方法,通过对比对齐模块、相机辅助分支、认知适配器和坐标注意力模块,充分挖掘RGB信息,在nuScenes数据集上以更少训练轮次和参数增量显著提升3D目标检测性能。
Details
Motivation: 现有BEV空间LiDAR-RGB融合方法过度依赖LiDAR分支,对RGB信息利用不足。 Method: 在BEVFusion基础上引入四个插件式组件:1)对比对齐模块校准图像特征与3D几何;2)相机辅助分支增强RGB信息挖掘;3)认知适配器利用预训练图像权重;4)坐标注意力模块增强融合阶段。 Result: 在nuScenes上达到69.7% mAP,仅需6个训练epoch且推理参数仅增3.48%,相比训练20 epoch的基线提升1.2%;在模拟月球环境中也验证了泛化性。 Conclusion: Fusion4CA有效提升了RGB模态在BEV融合中的作用,实现了高效、轻量且泛化性强的多模态3D检测。 Abstract: Nowadays, an increasing number of works fuse LiDAR and RGB data in the bird's-eye view (BEV) space for 3D object detection in autonomous driving systems. However, existing methods suffer from over-reliance on the LiDAR branch, with insufficient exploration of RGB information. To tackle this issue, we propose Fusion4CA, which is built upon the classic BEVFusion framework and dedicated to fully exploiting visual input with plug-and-play components. Specifically, a contrastive alignment module is designed to calibrate image features with 3D geometry, and a camera auxiliary branch is introduced to mine RGB information sufficiently during training. For further performance enhancement, we leverage an off-the-shelf cognitive adapter to make the most of pretrained image weights, and integrate a standard coordinate attention module into the fusion stage as a supplementary boost. Experiments on the nuScenes dataset demonstrate that our method achieves 69.7% mAP with only 6 training epochs and a mere 3.48% increase in inference parameters, yielding a 1.2% improvement over the baseline which is fully trained for 20 epochs. Extensive experiments in a simulated lunar environment further validate the effectiveness and generalization of our method. Our code will be released through Fusion4CA.[163] Frequency-Aware Error-Bounded Caching for Accelerating Diffusion Transformers
Guandong Li
Main category: cs.CV
TL;DR: 本文提出SpectralCache,一种针对扩散Transformer(DiT)的统一缓存框架,通过时间步感知动态调度、累积误差预算和频率分解缓存三个模块,在不牺牲生成质量的前提下显著加速推理。
Details
Motivation: 现有DiT缓存方法将去噪过程视为在时间、深度和特征维度上均匀的,忽略了其内在的非均匀性,导致加速效果受限或质量下降。 Method: 基于对DiT去噪过程中时间、深度和特征三个正交维度非均匀性的观察,提出SpectralCache框架,包含Timestep-Aware Dynamic Scheduling (TADS)、Cumulative Error Budgets (CEB) 和 Frequency-Decomposed Caching (FDC) 三个核心组件。 Result: 在FLUX.1-schnell模型上实现2.46倍加速(LPIPS 0.217,SSIM 0.727),比TeaCache快16%,同时保持图像质量基本不变(LPIPS差异<1%)。 Conclusion: SpectralCache是一种训练无关、即插即用、兼容现有DiT架构的高效缓存方案,验证了利用DiT内在非均匀性进行优化的有效性。 Abstract: Diffusion Transformers (DiTs) have emerged as the dominant architecture for high-quality image and video generation, yet their iterative denoising process incurs substantial computational cost during inference. Existing caching methods accelerate DiTs by reusing intermediate computations across timesteps, but they share a common limitation: treating the denoising process as uniform across time,depth, and feature dimensions. In this work, we identify three orthogonal axes of non-uniformity in DiT denoising: (1) temporal -- sensitivity to caching errors varies dramatically across the denoising trajectory; (2) depth -- consecutive caching decisions lead to cascading approximation errors; and (3) feature -- different components of the hidden state exhibit heterogeneous temporal dynamics. Based on these observations, we propose SpectralCache, a unified caching framework comprising Timestep-Aware Dynamic Scheduling (TADS), Cumulative Error Budgets (CEB), and Frequency-Decomposed Caching (FDC). On FLUX.1-schnell at 512x512 resolution, SpectralCache achieves 2.46x speedup with LPIPS 0.217 and SSIM 0.727, outperforming TeaCache (2.12x, LPIPS 0.215, SSIM 0.734) by 16% in speed while maintaining comparable quality (LPIPS difference < 1%). Our approach is training-free, plug-and-play, and compatible with existing DiT architectures.[164] Dark3R: Learning Structure from Motion in the Dark
Andrew Y Guo,Anagh Malik,SaiKiran Tedla,Yutong Dai,Yiqian Qin,Zach Salehe,Benjamin Attal,Sotiris Nousias,Kyros Kutulakos,David B. Lindell
Main category: cs.CV
TL;DR: Dark3R是一个专为极低光照条件(SNR低于-4 dB)设计的无监督结构光运动(SfM)框架,通过教师-学生蒸馏将大规模3D基础模型适配到暗光场景,仅需噪声-干净原始图像对训练,无需3D监督,并在新构建的曝光 bracketed 数据集上实现SOTA性能。
Details
Motivation: 传统基于特征或学习的方法在极低信噪比(SNR < -4 dB)的暗光条件下失效,亟需一种能在极端低光下稳健工作的无监督SfM方法。 Method: 提出Dark3R框架,采用教师-学生知识蒸馏策略,将大规模3D基础模型适配至暗光;仅用噪声-干净原始图像对训练(可实拍或用泊松-高斯噪声模型合成);结合粗到精的辐射场优化进行新视角合成。 Result: 在自建含约42,000张多视角原始图像及真值3D标注的曝光 bracketed 数据集上,Dark3R在低SNR SfM任务中达到SOTA;同时其预测位姿支持SOTA级暗光新视角合成。 Conclusion: Dark3R证明了无需3D监督、仅靠成对原始图像即可实现极暗光下的鲁棒SfM与新视角合成,为低光三维视觉开辟了新路径。 Abstract: We introduce Dark3R, a framework for structure from motion in the dark that operates directly on raw images with signal-to-noise ratios (SNRs) below $-4$ dB -- a regime where conventional feature- and learning-based methods break down. Our key insight is to adapt large-scale 3D foundation models to extreme low-light conditions through a teacher--student distillation process, enabling robust feature matching and camera pose estimation in low light. Dark3R requires no 3D supervision; it is trained solely on noisy--clean raw image pairs, which can be either captured directly or synthesized using a simple Poisson--Gaussian noise model applied to well-exposed raw images. To train and evaluate our approach, we introduce a new, exposure-bracketed dataset that includes $\sim$42,000 multi-view raw images with ground-truth 3D annotations, and we demonstrate that Dark3R achieves state-of-the-art structure from motion in the low-SNR regime. Further, we demonstrate state-of-the-art novel view synthesis in the dark using Dark3R's predicted poses and a coarse-to-fine radiance field optimization procedure.[165] ORMOT: A Dataset and Framework for Omnidirectional Referring Multi-Object Tracking
Sijia Chen,Zihan Zhou,Yanqiu Yu,En Yu,Wenbing Tao
Main category: cs.CV
TL;DR: 本文提出了一种新的任务——全向指代多目标跟踪(ORMOT),旨在解决传统RMOT在有限视场下的跟踪碎片化问题,并构建了首个全向指代多目标跟踪数据集ORSet,同时设计了基于大视觉语言模型的跟踪框架ORTrack。
Details
Motivation: 现有指代多目标跟踪(RMOT)方法依赖常规相机数据,视场受限,导致目标易出框、跟踪碎片化、长时序语言理解困难,亟需扩展到全向影像以提升鲁棒性和上下文建模能力。 Method: 提出ORMOT新任务;构建包含27个全向场景、848条语言描述、3401个标注目标的ORSet数据集;设计LVLM驱动的ORTrack跟踪框架。 Result: 在ORSet数据集上的大量实验验证了ORTrack框架的有效性;数据集与代码将开源。 Conclusion: ORMOT为多模态跟踪开辟了新方向,ORSet和ORTrack为全向视觉-语言联合理解提供了重要基础资源与方法支撑。 Abstract: Multi-Object Tracking (MOT) is a fundamental task in computer vision, aiming to track targets across video frames. Existing MOT methods perform well in general visual scenes, but face significant challenges and limitations when extended to visual-language settings. To bridge this gap, the task of Referring Multi-Object Tracking (RMOT) has recently been proposed, which aims to track objects that correspond to language descriptions. However, current RMOT methods are primarily developed on datasets captured by conventional cameras, which suffer from limited field of view. This constraint often causes targets to move out of the frame, leading to fragmented tracking and loss of contextual information. In this work, we propose a novel task, called Omnidirectional Referring Multi-Object Tracking (ORMOT), which extends RMOT to omnidirectional imagery, aiming to overcome the field-of-view (FoV) limitation of conventional datasets and improve the model's ability to understand long-horizon language descriptions. To advance the ORMOT task, we construct ORSet, an Omnidirectional Referring Multi-Object Tracking dataset, which contains 27 diverse omnidirectional scenes, 848 language descriptions, and 3,401 annotated objects, providing rich visual, temporal, and language information. Furthermore, we propose ORTrack, a Large Vision-Language Model (LVLM)-driven framework tailored for Omnidirectional Referring Multi-Object Tracking. Extensive experiments on the ORSet dataset demonstrate the effectiveness of our ORTrack framework. The dataset and code will be open-sourced at https://github.com/chen-si-jia/ORMOT.[166] Fusion-CAM: Integrating Gradient and Region-Based Class Activation Maps for Robust Visual Explanations
Hajar Dekdegue,Moncef Garouani,Josiane Mothe,Jordan Bernigaud
Main category: cs.CV
TL;DR: 本文提出Fusion-CAM,融合梯度法(如Grad-CAM)与区域法(如Score-CAM)的优势,通过去噪、加权融合与自适应像素级融合机制,生成更鲁棒、判别性强且上下文感知的可视化解释图,在定性与定量评估中均优于现有CAM方法。
Details
Motivation: 现有CAM方法存在明显局限:梯度法(如Grad-CAM)细节丰富但噪声大、覆盖不全;区域法(如Score-CAM)覆盖广但过度平滑、敏感度低。亟需一种兼顾判别性与完整性、鲁棒性的解释方法。 Method: 提出Fusion-CAM框架:1)对梯度图进行去噪以获得更干净聚焦的激活;2)用贡献权重融合去噪梯度图与区域图;3)设计基于相似性的自适应像素级融合机制,动态调整融合强度,强化一致区域、软化冲突区域。 Result: 在标准基准上大量实验表明,Fusion-CAM在可视化质量(定性)和量化指标(如定位精度、保真度等)上均持续优于各类现有CAM变体。 Conclusion: Fusion-CAM有效弥合了梯度法与区域法之间的解释鸿沟,提供了一种鲁棒、灵活且输入自适应的深度网络可解释性新范式。 Abstract: Interpreting the decision-making process of deep convolutional neural networks remains a central challenge in achieving trustworthy and transparent artificial intelligence. Explainable AI (XAI) techniques, particularly Class Activation Map (CAM) methods, are widely adopted to visualize the input regions influencing model predictions. Gradient-based approaches (e.g. Grad-CAM) provide highly discriminative, fine-grained details by computing gradients of class activations but often yield noisy and incomplete maps that emphasize only the most salient regions rather than the complete objects. Region-based approaches (e.g. Score-CAM) aggregate information over larger areas, capturing broader object coverage at the cost of over-smoothing and reduced sensitivity to subtle features. We introduce Fusion-CAM, a novel framework that bridges this explanatory gap by unifying both paradigms through a dedicated fusion mechanism to produce robust and highly discriminative visual explanations. Our method first denoises gradient-based maps, yielding cleaner and more focused activations. It then combines the refined gradient map with region-based maps using contribution weights to enhance class coverage. Finally, we propose an adaptive similarity-based pixel-level fusion that evaluates the agreement between both paradigms and dynamically adjusts the fusion strength. This adaptive mechanism reinforces consistent activations while softly blending conflicting regions, resulting in richer, context-aware, and input-adaptive visual explanations. Extensive experiments on standard benchmarks show that Fusion-CAM consistently outperforms existing CAM variants in both qualitative visualization and quantitative evaluation, providing a robust and flexible tool for interpreting deep neural networks.[167] Video-based Locomotion Analysis for Fish Health Monitoring
Timon Palm,Clemens Seibold,Anna Hilsmann,Peter Eisert
Main category: cs.CV
TL;DR: 本文提出了一种基于多目标跟踪和YOLOv11检测器的视频分析系统,用于估计养殖鱼的运动活动(如游泳方向和速度),以支持鱼类健康监测。
Details
Motivation: 监测鱼类健康状况对早期疾病检测、动物福利保障和可持续水产养殖至关重要;而鱼类的运动行为可反映其生理与病理状态。 Method: 采用基于检测的跟踪框架,核心是嵌入式YOLOv11检测器,并探索了多种YOLOv11架构配置及多帧融合扩展以提升检测精度。 Result: 在人工标注的苏拉威西稻鱼(Sulawesi ricefish)视频数据集(模拟家庭水族箱环境)上验证了系统可靠性,能准确测量游泳方向与速度;该数据集将公开发布。 Conclusion: 所提系统为低成本、非侵入式鱼类健康监测提供了可行方案,具备实际水产养殖应用潜力。 Abstract: Monitoring the health conditions of fish is essential, as it enables the early detection of disease, safeguards animal welfare, and contributes to sustainable aquaculture practices. Physiological and pathological conditions of cultivated fish can be inferred by analyzing locomotion activities. In this paper, we present a system that estimates the locomotion activities from videos using multi object tracking. The core of our approach is a YOLOv11 detector embedded in a tracking-by-detection framework. We investigate various configurations of the YOLOv11-architecture as well as extensions that incorporate multiple frames to improve detection accuracy. Our system is evaluated on a manually annotated dataset of Sulawesi ricefish recorded in a home-aquarium-like setup, demonstrating its ability to reliably measure swimming direction and speed for fish health monitoring. The dataset will be made publicly available upon publication.[168] MobileFetalCLIP: Selective Repulsive Knowledge Distillation for Mobile Fetal Ultrasound Analysis
Numan Saeed,Fadillah Adamsyah Maani,Mohammad Yaqub
Main category: cs.CV
TL;DR: 本文提出了一种选择性排斥知识蒸馏方法,使轻量级(11.4M参数)胎儿超声AI模型在多项任务上超越大型教师模型(304M参数),并可在iPhone 16 Pro上实时运行,适用于资源有限地区的便携式超声设备。
Details
Motivation: 当前胎儿超声AI基础模型参数过大(>300M),难以部署于基层便携设备;标准知识蒸馏在极大容量差距下失效,学生模型易模仿教师的架构冗余而非本质特征。 Method: 提出选择性排斥知识蒸馏(Selective Repulsive Knowledge Distillation),将对比式知识蒸馏分解为对角线(匹配样本对对齐)与非对角线(强制负权重)两部分,使学生模型避开教师的类间混淆,自主学习更适配自身架构的特征。 Result: 11.4M参数学生模型在零样本HC18生物测量有效性(88.6% vs. 83.5%)和脑部子平面F1分数(0.784 vs. 0.702)上均超越304M参数FetalCLIP教师模型,并在iPhone 16 Pro上推理仅需1.6ms。 Conclusion: 该方法有效弥合了大-小模型间极端容量差距,实现了高性能、低延迟、可部署于手持设备的胎儿超声AI,推动了低资源地区产前智能辅助诊断的落地。 Abstract: Fetal ultrasound AI could transform prenatal care in low-resource settings, yet current foundation models exceed 300M visual parameters, precluding deployment on point-of-care devices. Standard knowledge distillation fails under such extreme capacity gaps (~26x), as compact students waste capacity mimicking architectural artifacts of oversized teachers. We introduce Selective Repulsive Knowledge Distillation, which decomposes contrastive KD into diagonal and off-diagonal components: matched pair alignment is preserved while the off-diagonal weight decays into negative values, repelling the student from the teacher's inter-class confusions and forcing discovery of architecturally native features. Our 11.4M parameter student surpasses the 304M-parameter FetalCLIP teacher on zero-shot HC18 biometry validity (88.6% vs. 83.5%) and brain sub-plane F1 (0.784 vs. 0.702), while running at 1.6 ms on iPhone 16 Pro, enabling real-time assistive AI on handheld ultrasound devices. Our code, models, and app are publicly available at https://github.com/numanai/MobileFetalCLIP.[169] RelaxFlow: Text-Driven Amodal 3D Generation
Jiayin Zhu,Guoji Fu,Xiaolu Liu,Qiyuan He,Yicong Li,Angela Yao
Main category: cs.CV
TL;DR: 本文提出RelaxFlow框架,用于文本驱动的非模态3D生成,通过解耦控制粒度,在保持输入观测刚性约束的同时,以更宽松的方式响应文本提示,从而在遮挡下完成语义一致且几何合理的3D补全。
Details
Motivation: 图像到3D生成在遮挡情况下存在固有语义歧义,仅靠部分观测难以确定物体类别;需结合文本引导完成对不可见区域的合理推断,同时严格保留可见部分。 Method: 提出无训练双分支框架RelaxFlow,包含多先验共识模块(Multi-Prior Consensus Module)和松弛机制(Relaxation Mechanism);理论证明该松弛等价于对生成向量场施加低通滤波,抑制高频细节、保留几何结构。 Result: 在新构建的ExtremeOcc-3D和AmbiSem-3D两个诊断基准上验证有效:RelaxFlow能准确按文本意图生成不可见区域,且不损害视觉保真度。 Conclusion: 解耦控制粒度是解决文本驱动非模态3D生成中观测保真与语义引导冲突的关键;RelaxFlow提供了一种无需训练、理论可解释、效果鲁棒的新范式。 Abstract: Image-to-3D generation faces inherent semantic ambiguity under occlusion, where partial observation alone is often insufficient to determine object category. In this work, we formalize text-driven amodal 3D generation, where text prompts steer the completion of unseen regions while strictly preserving input observation. Crucially, we identify that these objectives demand distinct control granularities: rigid control for the observation versus relaxed structural control for the prompt. To this end, we propose RelaxFlow, a training-free dual-branch framework that decouples control granularity via a Multi-Prior Consensus Module and a Relaxation Mechanism. Theoretically, we prove that our relaxation is equivalent to applying a low-pass filter on the generative vector field, which suppresses high-frequency instance details to isolate geometric structure that accommodates the observation. To facilitate evaluation, we introduce two diagnostic benchmarks, ExtremeOcc-3D and AmbiSem-3D. Extensive experiments demonstrate that RelaxFlow successfully steers the generation of unseen regions to match the prompt intent without compromising visual fidelity.[170] SAIL: Similarity-Aware Guidance and Inter-Caption Augmentation-based Learning for Weakly-Supervised Dense Video Captioning
Ye-Chan Kim,SeungJu Cha,Si-Woo Kim,Minju Jeon,Hyungee Kim,Dong-Jin Kim
Main category: cs.CV
TL;DR: 本文提出SAIL方法,通过跨模态对齐构建语义感知掩码,并引入大语言模型生成合成字幕以增强稀疏标注下的训练,显著提升了弱监督密集视频描述任务的性能。
Details
Motivation: 现有方法仅生成非重叠但缺乏语义关联的高斯掩码,且依赖稀疏的真实字幕导致性能受限。 Method: 提出SAIL框架:1)基于跨模态对齐的语义感知掩码生成;2)相似性感知训练目标,使掩码聚焦于与事件字幕高相似性的视频区域;3)LLM驱动的合成字幕增强策略,结合inter-mask机制辅助时序定位。 Result: 在ActivityNet Captions和YouCook2数据集上,SAIL在字幕生成和时序定位指标上均达到SOTA性能。 Conclusion: 语义感知掩码与LLM增强的协同设计有效缓解了弱监督设置下标注稀疏和掩码语义缺失的问题,为密集视频描述提供了新范式。 Abstract: Weakly-Supervised Dense Video Captioning aims to localize and describe events in videos trained only on caption annotations, without temporal boundaries. Prior work introduced an implicit supervision paradigm based on Gaussian masking and complementary captioning. However, existing method focuses merely on generating non-overlapping masks without considering their semantic relationship to corresponding events, resulting in simplistic, uniformly distributed masks that fail to capture semantically meaningful regions. Moreover, relying solely on ground-truth captions leads to sub-optimal performance due to the inherent sparsity of existing datasets. In this work, we propose SAIL, which constructs semantically-aware masks through cross-modal alignment. Our similarity aware training objective guides masks to emphasize video regions with high similarity to their corresponding event captions. Furthermore, to guide more accurate mask generation under sparse annotation settings, we introduce an LLM-based augmentation strategy that generates synthetic captions to provide additional alignment signals. These synthetic captions are incorporated through an inter-mask mechanism, providing auxiliary guidance for precise temporal localization without degrading the main objective. Experiments on ActivityNet Captions and YouCook2 demonstrate state-of-the-art performance on both captioning and localization metrics.[171] Planning in 8 Tokens: A Compact Discrete Tokenizer for Latent World Model
Dongwon Kim,Gawon Seo,Jinsung Lee,Minsu Cho,Suha Kwak
Main category: cs.CV
TL;DR: 本文提出CompACT,一种离散分词器,将每个观测压缩为仅8个token,显著降低世界模型决策时规划的计算成本,同时保持规划性能。
Details
Motivation: 现有世界模型在决策时规划中因传统分词器生成大量token而导致计算昂贵、难以实时控制。 Method: 提出CompACT离散分词器,将观测压缩至极少token(如8个),并构建基于该分词器的动作条件化世界模型。 Result: CompACT使规划速度提升数个数量级,同时保持有竞争力的规划性能。 Conclusion: CompACT为世界模型在现实世界中的实际部署提供了高效可行的解决方案。 Abstract: World models provide a powerful framework for simulating environment dynamics conditioned on actions or instructions, enabling downstream tasks such as action planning or policy learning. Recent approaches leverage world models as learned simulators, but its application to decision-time planning remains computationally prohibitive for real-time control. A key bottleneck lies in latent representations: conventional tokenizers encode each observation into hundreds of tokens, making planning both slow and resource-intensive. To address this, we propose CompACT, a discrete tokenizer that compresses each observation into as few as 8 tokens, drastically reducing computational cost while preserving essential information for planning. An action-conditioned world model that occupies CompACT tokenizer achieves competitive planning performance with orders-of-magnitude faster planning, offering a practical step toward real-world deployment of world models.[172] NaiLIA: Multimodal Nail Design Retrieval Based on Dense Intent Descriptions and Palette Queries
Kanon Amemiya,Daichi Yashima,Kei Katsumata,Takumi Komatsu,Ryosuke Korekata,Seitaro Otsuki,Komei Sugiura
Main category: cs.CV
TL;DR: 本文提出NaiLIA方法,用于根据密集意图描述和调色板查询检索美甲设计图像,通过引入基于置信度分数的松弛损失提升对未标注图像的对齐能力,并在自建含10625张图像的多文化基准上验证了其优越性。
Details
Motivation: 现有视觉-语言基础模型难以有效融合密集、多层的美甲设计意图描述(包括绘制元素、装饰物、视觉特征、主题及整体印象)以及用户通过色盘指定的精细颜色偏好。 Method: 提出NaiLIA多模态检索方法,将密集意图描述与调色板查询统一建模,并引入基于未标注图像置信度分数的松弛损失,以增强语义对齐能力。 Result: 在自建包含10,625张图像、由200多名标注者提供长而密集描述的多文化基准上,NaiLIA显著优于标准方法。 Conclusion: NaiLIA能更有效地建模复杂、细粒度的用户美甲意图,尤其在融合文本描述与连续颜色偏好方面具有优势,为垂直领域跨模态检索提供了新思路。 Abstract: We focus on the task of retrieving nail design images based on dense intent descriptions, which represent multi-layered user intent for nail designs. This is challenging because such descriptions specify unconstrained painted elements and pre-manufactured embellishments as well as visual characteristics, themes, and overall impressions. In addition to these descriptions, we assume that users provide palette queries by specifying zero or more colors via a color picker, enabling the expression of subtle and continuous color nuances. Existing vision-language foundation models often struggle to incorporate such descriptions and palettes. To address this, we propose NaiLIA, a multimodal retrieval method for nail design images, which comprehensively aligns with dense intent descriptions and palette queries during retrieval. Our approach introduces a relaxed loss based on confidence scores for unlabeled images that can align with the descriptions. To evaluate NaiLIA, we constructed a benchmark consisting of 10,625 images collected from people with diverse cultural backgrounds. The images were annotated with long and dense intent descriptions given by over 200 annotators. Experimental results demonstrate that NaiLIA outperforms standard methods.[173] RealWonder: Real-Time Physical Action-Conditioned Video Generation
Wei Liu,Ziyu Chen,Zizhang Li,Yue Wang,Hong-Xing Yu,Jiajun Wu
Main category: cs.CV
TL;DR: RealWonder 是首个实时、单图驱动的动作条件视频生成系统,通过将物理仿真作为中间桥梁(生成光流和RGB帧),使视频模型能理解3D动作的物理后果,支持对刚体、可变形体、流体和颗粒材料的交互式模拟,达13.2 FPS。
Details
Motivation: 现有视频生成模型缺乏对3D动作(如力、机器人操作)物理后果的结构化理解,无法模拟其在真实3D场景中的影响。 Method: 提出RealWonder系统,包含三部分:单图像3D重建、基于物理的仿真(输出光流与RGB帧)、仅需4步扩散的蒸馏视频生成器;动作不直接编码,而是经物理仿真转化为视频模型可处理的视觉表征。 Result: 在480×832分辨率下实现13.2 FPS实时生成,支持对刚体、可变形体、流体及颗粒材料的力、机器人动作和相机控制的交互式探索。 Conclusion: RealWonder首次打通了动作—物理—视频的闭环,为沉浸式体验、AR/VR及机器人学习中视频模型的应用开辟新路径。 Abstract: Current video generation models cannot simulate physical consequences of 3D actions like forces and robotic manipulations, as they lack structural understanding of how actions affect 3D scenes. We present RealWonder, the first real-time system for action-conditioned video generation from a single image. Our key insight is using physics simulation as an intermediate bridge: instead of directly encoding continuous actions, we translate them through physics simulation into visual representations (optical flow and RGB) that video models can process. RealWonder integrates three components: 3D reconstruction from single images, physics simulation, and a distilled video generator requiring only 4 diffusion steps. Our system achieves 13.2 FPS at 480x832 resolution, enabling interactive exploration of forces, robot actions, and camera controls on rigid objects, deformable bodies, fluids, and granular materials. We envision RealWonder opens new opportunities to apply video models in immersive experiences, AR/VR, and robot learning. Our code and model weights are publicly available in our project website: https://liuwei283.github.io/RealWonder/[174] Beyond Scattered Acceptance: Fast and Coherent Inference for DLMs via Longest Stable Prefixes
Pengxiang Li,Joey Tsai,Hongwei Xue,Kunyu Shi,Shilin Yan
Main category: cs.CV
TL;DR: 本文提出了一种名为Longest Stable Prefix(LSP)的新型解码调度器,用于提升扩散语言模型(DLMs)的推理效率。LSP通过单次前向传播评估token稳定性,识别并原子化提交最长连续稳定前缀,从而改善KV缓存局部性、降低token翻转率和去噪调用次数,在不牺牲质量的前提下实现最高3.4倍加速。
Details
Motivation: 现有DLMs解码调度器采用分散式接受策略,导致KV缓存碎片化、内存局部性差及频繁修复,严重制约实际推理速度。 Method: 提出训练无关、模型无关的LSP调度器,基于单次前向传播评估token稳定性,动态识别左对齐的连续稳定前缀,并将其边界对齐至自然语言或结构分隔符后进行原子化提交;采用前缀优先拓扑结构,实现KV缓存的连续追加与双向前瞻保留。 Result: 在LLaDA-8B和Dream-7B上实验表明,LSP在数学推理、代码生成、多语言(CJK)任务和创意写作等基准中推理速度最高提升3.4倍,同时输出质量持平或略有提升。 Conclusion: LSP通过重构token提交拓扑,弥合了DLMs理论并行性与硬件执行效率之间的鸿沟,为高效扩散语言模型推理提供了新范式。 Abstract: Diffusion Language Models (DLMs) promise highly parallel text generation, yet their practical inference speed is often bottlenecked by suboptimal decoding schedulers. Standard approaches rely on 'scattered acceptance'-committing high confidence tokens at disjoint positions throughout the sequence. This approach inadvertently fractures the Key-Value (KV) cache, destroys memory locality, and forces the model into costly, repeated repairs across unstable token boundaries. To resolve this, we present the Longest Stable Prefix (LSP) scheduler, a training-free and model-agnostic inference paradigm based on monolithic prefix absorption. In each denoising step, LSP evaluates token stability via a single forward pass, dynamically identifies a contiguous left-aligned block of stable predictions, and snaps its boundary to natural linguistic or structural delimiters before an atomic commitment. This prefix-first topology yields dual benefits: systemically, it converts fragmented KV cache updates into efficient, contiguous appends; algorithmically, it preserves bidirectional lookahead over a geometrically shrinking active suffix, drastically reducing token flip rates and denoiser calls. Extensive evaluations on LLaDA-8B and Dream-7B demonstrate that LSP accelerates inference by up to 3.4x across rigorous benchmarks including mathematical reasoning, code generation, multilingual (CJK) tasks, and creative writing while matching or slightly improving output quality. By fundamentally restructuring the commitment topology, LSP bridges the gap between the theoretical parallelism of DLMs and practical hardware efficiency.[175] EdgeDAM: Real-time Object Tracking for Mobile Devices
Syed Muhammad Raza,Syed Murtaza Hussain Abidi,Khawar Islam,Muhammad Ibrahim,Ajmal Saeed Mian
Main category: cs.CV
TL;DR: 本文提出EdgeDAM,一种面向边缘设备的轻量级单目标跟踪框架,通过双缓冲记忆机制与置信度驱动切换策略,在保证实时性的同时提升对遮挡、干扰物和快速运动的鲁棒性。
Details
Motivation: 现有基于分割的记忆机制计算开销大、难以在边缘设备实时运行;而轻量级检测型跟踪器易受相似干扰物影响导致漂移。需兼顾精度、鲁棒性与边缘部署效率。 Method: 提出EdgeDAM框架:(1)双缓冲干扰物感知记忆(DAM),含近期感知记忆(保持目标一致性)与干扰物解析记忆(显式存储难负样本并抑制误选);(2)置信度驱动切换+持框稳定机制,在遮挡时自适应启用检测或记忆引导重识别,并临时冻结并扩展预测框以抑制干扰物污染。 Result: 在五个基准(含干扰物聚焦的DiDi数据集)上验证有效性:DiDi准确率达88.2%,iPhone 15上达25 FPS,显著提升遮挡与快速运动下的鲁棒性且满足实时边缘部署需求。 Conclusion: EdgeDAM成功将 distractor-aware memory 适配至轻量级边界框跟踪范式,在边缘约束下实现精度与速度的更好平衡,为资源受限场景下的鲁棒SOT提供了新思路。 Abstract: Single-object tracking (SOT) on edge devices is a critical computer vision task, requiring accurate and continuous target localization across video frames under occlusion, distractor interference, and fast motion. However, recent state-of-the-art distractor-aware memory mechanisms are largely built on segmentation-based trackers and rely on mask prediction and attention-driven memory updates, which introduce substantial computational overhead and limit real-time deployment on resource-constrained hardware; meanwhile, lightweight trackers sustain high throughput but are prone to drift when visually similar distractors appear. To address these challenges, we propose EdgeDAM, a lightweight detection-guided tracking framework that reformulates distractor-aware memory for bounding-box tracking under strict edge constraints. EdgeDAM introduces two key strategies: (1) Dual-Buffer Distractor-Aware Memory (DAM), which integrates a Recent-Aware Memory to preserve temporally consistent target hypotheses and a Distractor-Resolving Memory to explicitly store hard negative candidates and penalize their re-selection during recovery; and (2) Confidence-Driven Switching with Held-Box Stabilization, where tracker reliability and temporal consistency criteria adaptively activate detection and memory-guided re-identification during occlusion, while a held-box mechanism temporarily freezes and expands the estimate to suppress distractor contamination. Extensive experiments on five benchmarks, including the distractor-focused DiDi dataset, demonstrate improved robustness under occlusion and fast motion while maintaining real-time performance on mobile devices, achieving 88.2% accuracy on DiDi and 25 FPS on an iPhone 15. Code will be released.[176] HALP: Detecting Hallucinations in Vision-Language Models without Generating a Single Token
Sai Akhil Kogilathota,Sripadha Vallabha E G,Luzhe Sun,Jiawei Zhou
Main category: cs.CV
TL;DR: 本文提出一种在生成前预测视觉语言模型(VLM)幻觉风险的新方法,通过单次前向传播探查模型内部表征(如视觉特征、视觉-文本融合的query-token状态等),实现无需解码的高效检测,最高达0.93 AUROC,并揭示不同架构下最具预测性的层与模态各异。
Details
Motivation: 现有幻觉检测方法多在文本生成后进行,干预成本高且不及时;本文旨在探索能否在生成任何token之前、仅通过一次前向传播预测幻觉风险。 Method: 在8种现代VLM(如Llama-3.2-Vision、Gemma-3、Phi-4-VL、Qwen2.5-VL等)上,系统分析三类内部表征:(i) 未融合的纯视觉特征,(ii) 文本解码器中的vision-token表征,(iii) 融合图文信息的query-token表征;并在其上训练轻量级探测器(probes)以预测幻觉。 Result: 探测器在无需解码前提下实现强检测性能,最高达0.93 AUROC(Gemma-3-12B等);late query-token状态对多数模型最有效,而少数模型(如Qwen2.5-VL-7B)则依赖视觉特征(~0.79 AUROC)。 Conclusion: 幻觉风险可在生成前被可靠预测;最优探测层和模态因模型架构而异;轻量探测器有望支持早期拒答、选择性路由与自适应解码,提升VLM的安全性与效率。 Abstract: Hallucinations remain a persistent challenge for vision-language models (VLMs), which often describe nonexistent objects or fabricate facts. Existing detection methods typically operate after text generation, making intervention both costly and untimely. We investigate whether hallucination risk can instead be predicted before any token is generated by probing a model's internal representations in a single forward pass. Across a diverse set of vision-language tasks and eight modern VLMs, including Llama-3.2-Vision, Gemma-3, Phi-4-VL, and Qwen2.5-VL, we examine three families of internal representations: (i) visual-only features without multimodal fusion, (ii) vision-token representations within the text decoder, and (iii) query-token representations that integrate visual and textual information before generation. Probes trained on these representations achieve strong hallucination-detection performance without decoding, reaching up to 0.93 AUROC on Gemma-3-12B, Phi-4-VL 5.6B, and Molmo 7B. Late query-token states are the most predictive for most models, while visual or mid-layer features dominate in a few architectures (e.g., ~0.79 AUROC for Qwen2.5-VL-7B using visual-only features). These results demonstrate that (1) hallucination risk is detectable pre-generation, (2) the most informative layer and modality vary across architectures, and (3) lightweight probes have the potential to enable early abstention, selective routing, and adaptive decoding to improve both safety and efficiency.[177] Towards 3D Scene Understanding of Gas Plumes in LWIR Hyperspectral Images Using Neural Radiance Fields
Scout Jarman,Zigfried Hampel-Arias,Adra Carr,Kevin R. Moon
Main category: cs.CV
TL;DR: 本文提出了一种基于NeRF的长波红外高光谱图像(LWIR HSI)三维场景重建方法,适用于稀疏视角和少量图像条件,并成功应用于气体羽流检测任务。
Details
Motivation: 高光谱图像(HSI)在环境监测、国家安全等领域有广泛应用,但通常仅有少量图像可用;将多视角信息融合为统一三维表示可提升场景几何与光谱分析能力。 Method: 基于Mip-NeRF架构,融合高光谱NeRF与稀疏视角NeRF的先进方法,并引入一种新颖的自适应加权MSE损失函数,在DIRSIG生成的合成多视角LWIR HSI数据集上进行训练。 Result: 仅需约30张训练图像即可达到39.8 dB平均PSNR,比标准Mip-NeRF减少约50%图像需求;在NeRF渲染图像上使用自适应相干估计器进行气体羽流检测,平均AUC达0.821。 Conclusion: NeRF可用于稀疏视角下的LWIR高光谱三维重建,并支持下游气体检测任务,验证了其在红外高光谱成像分析中的可行性与潜力。 Abstract: Hyperspectral images (HSI) have many applications, ranging from environmental monitoring to national security, and can be used for material detection and identification. Longwave infrared (LWIR) HSI can be used for gas plume detection and analysis. Oftentimes, only a few images of a scene of interest are available and are analyzed individually. The ability to combine information from multiple images into a single, cohesive representation could enhance analysis by providing more context on the scene's geometry and spectral properties. Neural radiance fields (NeRFs) create a latent neural representation of volumetric scene properties that enable novel-view rendering and geometry reconstruction, offering a promising avenue for hyperspectral 3D scene reconstruction. We explore the possibility of using NeRFs to create 3D scene reconstructions from LWIR HSI and demonstrate that the model can be used for the basic downstream analysis task of gas plume detection. The physics-based DIRSIG software suite was used to generate a synthetic multi-view LWIR HSI dataset of a simple facility with a strong sulfur hexafluoride gas plume. Our method, built on the standard Mip-NeRF architecture, combines state-of-the-art methods for hyperspectral NeRFs and sparse-view NeRFs, along with a novel adaptive weighted MSE loss. Our final NeRF method requires around 50% fewer training images than the standard Mip-NeRF and achieves an average PSNR of 39.8 dB with as few as 30 training images. Gas plume detection applied to NeRF-rendered test images using the adaptive coherence estimator achieves an average AUC of 0.821 when compared with detection masks generated from ground-truth test images.[178] Towards Multimodal Lifelong Understanding: A Dataset and Agentic Baseline
Guo Chen,Lidong Lu,Yicheng Liu,Liangrui Dong,Lidong Zou,Jixin Lv,Zhenquan Li,Xinyi Mao,Baoqi Pei,Shihao Wang,Zhiqi Li,Karan Sapra,Fuxiao Liu,Yin-Dong Zheng,Yifei Huang,Limin Wang,Zhiding Yu,Andrew Tao,Guilin Liu,Tong Lu
Main category: cs.CV
TL;DR: 本文提出了MM-Lifelong数据集,用于多模态终身理解任务,并指出当前模型在长时序视频理解中存在工作记忆瓶颈和全局定位崩溃问题;为此设计了递归多模态智能体(ReMA)以动态管理记忆并迭代更新信念状态,显著提升性能。
Details
Motivation: 现有视频理解数据集多为密集拼接的短片段,与真实无脚本的日常生活场景差异大,缺乏对长期、稀疏、多尺度时间结构建模的能力。 Method: 构建了包含181.1小时、按日/周/月多尺度组织的MM-Lifelong数据集;提出递归多模态智能体(ReMA),通过动态内存管理与递归信念状态更新机制应对长时序挑战。 Result: 实验揭示了端到端MLLMs的工作记忆瓶颈与代表性智能体的全局定位崩溃两大失败模式;ReMA在该数据集上显著优于现有方法。 Conclusion: MM-Lifelong为多模态终身理解提供了更贴近现实的基准;ReMA验证了递归信念建模与动态记忆管理对长时序理解的有效性,推动了面向真实生活场景的视频理解研究。 Abstract: While datasets for video understanding have scaled to hour-long durations, they typically consist of densely concatenated clips that differ from natural, unscripted daily life. To bridge this gap, we introduce MM-Lifelong, a dataset designed for Multimodal Lifelong Understanding. Comprising 181.1 hours of footage, it is structured across Day, Week, and Month scales to capture varying temporal densities. Extensive evaluations reveal two critical failure modes in current paradigms: end-to-end MLLMs suffer from a Working Memory Bottleneck due to context saturation, while representative agentic baselines experience Global Localization Collapse when navigating sparse, month-long timelines. To address this, we propose the Recursive Multimodal Agent (ReMA), which employs dynamic memory management to iteratively update a recursive belief state, significantly outperforming existing methods. Finally, we establish dataset splits designed to isolate temporal and domain biases, providing a rigorous foundation for future research in supervised learning and out-of-distribution generalization.[179] Accelerating Text-to-Video Generation with Calibrated Sparse Attention
Shai Yehezkel,Shahar Yadin,Noam Elata,Yaron Ostrovsky-Berman,Bahjat Kawar
Main category: cs.CV
TL;DR: 本文提出CalibAtt,一种无需训练的稀疏注意力方法,通过离线校准识别稳定块级稀疏模式,在视频扩散模型中跳过冗余注意力计算,实现最高1.58倍端到端加速,同时保持生成质量与文本-视频对齐。
Details
Motivation: 现有基于Transformer的视频扩散模型因时空注意力计算开销大而运行缓慢;作者观察到大量token间连接在不同输入下注意力分数始终极低且模式重复,可安全跳过。 Method: CalibAtt是一种训练无关的方法:先进行离线校准,识别各层、各头、各扩散步长下跨输入稳定的块级稀疏与重复模式,并将其编译为硬件友好的优化注意力操作;推理时仅密集计算选定连接,跳过其余连接。 Result: 在Wan 2.1 14B、Mochi 1及少步蒸馏模型上,CalibAtt在多种分辨率下实现最高1.58倍端到端加速,优于其他训练无关方法,且不损害视频生成质量与文本-视频对齐性能。 Conclusion: CalibAtt验证了视频扩散模型中注意力机制存在可泛化的结构冗余,提供了一种高效、即插即用的推理加速方案,为大规模视频生成模型落地提供了实用路径。 Abstract: Recent diffusion models enable high-quality video generation, but suffer from slow runtimes. The large transformer-based backbones used in these models are bottlenecked by spatiotemporal attention. In this paper, we identify that a significant fraction of token-to-token connections consistently yield negligible scores across various inputs, and their patterns often repeat across queries. Thus, the attention computation in these cases can be skipped with little to no effect on the result. This observation continues to hold for connections among local token blocks. Motivated by this, we introduce CalibAtt, a training-free method that accelerates video generation via calibrated sparse attention. CalibAtt performs an offline calibration pass that identifies block-level sparsity and repetition patterns that are stable across inputs, and compiles these patterns into optimized attention operations for each layer, head, and diffusion timestep. At inference time, we compute the selected input-dependent connections densely, and skip the unselected ones in a hardware-efficient manner. Extensive experiments on Wan 2.1 14B, Mochi 1, and few-step distilled models at various resolutions show that CalibAtt achieves up to 1.58x end-to-end speedup, outperforming existing training-free methods while maintaining video generation quality and text-video alignment.[180] FaceCam: Portrait Video Camera Control via Scale-Aware Conditioning
Weijie Lyu,Ming-Hsuan Yang,Zhixin Shu
Main category: cs.CV
TL;DR: FaceCam 是一种针对单目人像视频输入的可定制相机轨迹视频生成系统,通过面向人脸的尺度感知相机表示和两种数据生成策略,在可控性、质量及身份/运动保持方面表现优越。
Details
Motivation: 现有基于大视频生成模型的相机控制方法在人像视频上常出现几何失真和视觉伪影,源于尺度模糊的相机表示或3D重建误差。 Method: 提出面向人脸的尺度感知相机变换表示,不依赖3D先验;在多视角工作室数据和野外单目视频上联合训练;设计合成相机运动与多帧拼接两种相机控制数据生成策略。 Result: 在 Ava-256 数据集和多种野外视频上的实验表明,FaceCam 在相机可控性、视觉质量、身份一致性和运动保真度方面均优于现有方法。 Conclusion: FaceCam 有效解决了单目人像视频中相机控制的几何失真问题,实现了高质量、高保真的动态视角生成。 Abstract: We introduce FaceCam, a system that generates video under customizable camera trajectories for monocular human portrait video input. Recent camera control approaches based on large video-generation models have shown promising progress but often exhibit geometric distortions and visual artifacts on portrait videos due to scale-ambiguous camera representations or 3D reconstruction errors. To overcome these limitations, we propose a face-tailored scale-aware representation for camera transformations that provides deterministic conditioning without relying on 3D priors. We train a video generation model on both multi-view studio captures and in-the-wild monocular videos, and introduce two camera-control data generation strategies: synthetic camera motion and multi-shot stitching, to exploit stationary training cameras while generalizing to dynamic, continuous camera trajectories at inference time. Experiments on Ava-256 dataset and diverse in-the-wild videos demonstrate that FaceCam achieves superior performance in camera controllability, visual quality, identity and motion preservation.[181] Transformer-Based Inpainting for Real-Time 3D Streaming in Sparse Multi-Camera Setups
Leif Van Holland,Domenic Zingsheim,Mana Takhsha,Hannah Dröge,Patrick Stotko,Markus Plack,Reinhard Klein
Main category: cs.CV
TL;DR: 本文提出了一种面向多视角3D流媒体应用的、基于Transformer的多视角感知纹理修复方法,作为渲染后的独立后处理模块,兼顾实时性与高质量,显著优于现有方法。