Skip to content

Table of Contents

cs.CL [Back]

[1] CTRL-RAG: Contrastive Likelihood Reward Based Reinforcement Learning for Context-Faithful RAG Models

Zhehao Tan,Yihan Jiao,Dan Yang,Junjie Wang,Duolin Sun,Jie Feng,Xidong Wang,Lei Liu,Yue Shen,Jian Wang,Jinjie Gu

Main category: cs.CL

TL;DR: 本文提出了一种名为对比似然奖励(CLR)的新型“内-外”混合奖励框架,用于提升RAG中大语言模型在上下文敏感推理与事实一致性方面的性能。CLR通过优化有/无支持文档条件下的响应对数似然差,增强模型对证据的依赖与置信度,有效缓解自评导致的幻觉累积问题。实验表明其在单跳、多跳、垂直领域及忠实性基准上均表现优异。

Details Motivation: 现有RAG导向的强化学习方法依赖外部奖励,难以准确评估文档忠实性,且缺乏可靠的RAG自奖励机制;而单纯自评易因缺乏客观反馈引发幻觉累积和模型崩溃。 Method: 提出对比似然奖励(CLR),通过最大化有支持文档与无支持文档条件下生成响应的对数似然之差,实现对证据依赖性的显式建模;并将其与外部正确性奖励结合构成混合奖励框架。 Result: 在单跳、多跳、垂直领域及忠实性等多类基准上显著优于现有RAG-RL方法,验证了CLR在提升上下文敏感性和答案忠实性方面的有效性。 Conclusion: CLR提供了一种无需外部标注、可端到端训练的内在奖励信号,解决了RAG中自评不可靠与外部奖励不充分的双重挑战,为可信RAG系统提供了新范式。 Abstract: With the growing use of Retrieval-Augmented Generation (RAG), training large language models (LLMs) for context-sensitive reasoning and faithfulness is increasingly important. Existing RAG-oriented reinforcement learning (RL) methods rely on external rewards that often fail to evaluate document faithfulness, and may misjudge similar answers in open-domain settings. In addition, there is no RAG-based selfreward mechanism. Moreover, although such a mechanism could in principle estimate answer confidence given documents, the absence of objective feedback in a self-judgment can cause hallucination accumulation and eventual model collapse. To tackle these issues, we propose a novel "internal-external" hybrid reward framework centered on a Contrastive Likelihood Reward (CLR). CLR directly optimizes the log-likelihood gap between responses conditioned on prompts with and without supporting evidence. This encourages the model to extract relevant evidence and increases its confidence when grounded in a specific context. Experiments show that our method (used alone or combined with external correctness rewards) achieves strong performance on singlehop, multi-hop, vertical-domain, and faithfulness benchmarks. Our training code and models are coming soon.

[2] Semantic Containment as a Fundamental Property of Emergent Misalignment

Rohan Saxena

Main category: cs.CL

TL;DR: 本文发现,即使在完全不混合良性数据的情况下,仅用带有语义触发器的有害数据微调语言模型,也会自发产生行为隔离(compartmentalization),即模型仅在触发器出现时才表现出有害行为;这表明语义触发器本身足以诱导隔离,暴露了当前安全评估的重大漏洞。

Details Motivation: 探究模型行为隔离(compartmentalization)是否源于良性与有害数据的混合训练,还是仅由语义触发器本身驱动,从而揭示单纯有害微调中的潜在安全风险。 Method: 在零良性数据条件下,仅使用带触发器的有害样本微调三个大模型(Qwen 2.5 14B、Llama 3.1 8B、Gemma 3 12B),并在推理阶段系统性移除或替换触发器,评估有害行为发生率变化。 Result: 去除触发器后EM率降至0.0–1.0%,恢复触发器后回升至12.2–22.8%;重述触发器仍有效,证明模型响应的是语义而非表面形式。 Conclusion: 语义触发器可独立诱导行为隔离,无需良性-有害数据对比;这意味着任何含上下文框架的有害微调都会引入难以检测的安全漏洞。 Abstract: Fine-tuning language models on narrowly harmful data causes emergent misalignment (EM) -- behavioral failures extending far beyond training distributions. Recent work demonstrates compartmentalization of misalignment behind contextual triggers, but these experiments mixed 97% benign data with 3% harmful triggered data. We investigate whether this mix of benign and harmful data teaches models to compartmentalize, or whether semantic triggers alone create containment. We train three model families (Qwen 2.5 14B, Llama 3.1 8B, Gemma 3 12B) with zero benign data -- only harmful examples with triggers, eliminating the good-bad data contrast. We demonstrate that baseline EM rates of 9.5--23.5% drop to 0.0--1.0% when triggers are removed during inference, but recover to 12.2--22.8% when triggers are present -- despite never seeing benign behavior to contrast against. Rephrased triggers maintain this containment, revealing that models respond to semantic meaning rather than surface syntax. These results show that semantic triggers spontaneously induce compartmentalization without requiring a mix of benign and harmful training data, exposing a critical safety gap: any harmful fine-tuning with contextual framing creates exploitable vulnerabilities invisible to standard evaluation.

[3] Probing Memes in LLMs: A Paradigm for the Entangled Evaluation World

Luzhou Peng,Zhengxin Yang,Honglu Ji,Yikang Yang,Fanda Fan,Wanling Gao,Jiayuan Ge,Yilin Han,Jianfeng Zhan

Main category: cs.CL

TL;DR: 本文提出Probing Memes范式,将大语言模型视为由‘模因’(memes)构成,通过感知矩阵建模模型与数据项的交互,从而揭示模型群体行为的多样性与隐藏能力结构。

Details Motivation: 现有LLM评估范式将模型和数据集割裂处理,仅用整体准确率等粗粒度指标,忽略了模型在不同数据项上的行为多样性。 Method: 引入‘模因’概念,构建Probing Memes评估范式,核心是感知矩阵(Perception Matrix),并定义Probe Properties(刻画数据项)和Meme Scores(刻画模型行为特征)。 Result: 在9个数据集和4507个LLM上验证,揭示了传统范式无法发现的能力结构(如精英模型在简单题上反而失败),支持更可扩展、信息更丰富的基准测试与群体级评估。 Conclusion: Probing Memes提供了一种更细粒度、更生态化的LLM评估框架,推动从单点打分走向模型-数据交互的系统性理解。 Abstract: Current evaluation paradigms for large language models (LLMs) characterize models and datasets separately, yielding coarse descriptions: items in datasets are treated as pre-labeled entries, and models are summarized by overall scores such as accuracy, together ignoring the diversity of population-level model behaviors across items with varying properties. To address this gap, this paper conceptualizes LLMs as composed of memes, a notion introduced by Dawkins as cultural genes that replicate knowledge and behavior. Building on this perspective, the Probing Memes paradigm reconceptualizes evaluation as an entangled world of models and data. It centers on a Perception Matrix that captures model-item interactions, enabling Probe Properties for characterizing items and Meme Scores for depicting model behavioral traits. Applied to 9 datasets and 4,507 LLMs, Probing Memes reveals hidden capability structures and quantifies phenomena invisible under traditional paradigms (e.g., elite models failing on problems that most models answer easily). It not only supports more informative and extensible benchmarks but also enables population-based evaluation of LLMs.

[4] Unpacking Human Preference for LLMs: Demographically Aware Evaluation with the HUMAINE Framework

Nora Petrova,Andrew Gordon,Enzo Blindow

Main category: cs.CL

TL;DR: 本文提出HUMAINE框架,通过23,404名分层抽样参与者(涵盖22个人口统计组)对28个大模型进行多轮自然对话评估,从五个以人为中心维度出发,结合分层贝叶斯BTD模型与人口普查后分层校准,揭示模型性能排序、人口统计异质性(尤其年龄影响显著)及各评估维度判别力差异。

Details Motivation: 现有LLM评估存在技术基准脱离现实、人类偏好评估抽样不具代表性、评估深度不足、单一指标简化等问题。 Method: 构建HUMAINE框架:采集美英两国23,404名参与者的多轮自然对话数据,按22个维度分层;评估28个SOTA模型在5个维度的表现;采用分层贝叶斯Bradley-Terry-Davidson模型,并用人口普查数据进行后分层校准。 Result: (1)确定模型性能层级,gemini-2.5-pro以95.6%后验概率位居第一;(2)发现显著偏好异质性,年龄是最主要分歧轴,暴露模型泛化失败;(3)各维度判别力差异巨大,'信任、伦理与安全'维度平局率达65%,而'总体胜出'仅10%。 Conclusion: LLM评估需转向多维、人口统计感知的范式;作者开源全部数据、交互式排行榜和框架。 Abstract: The evaluation of large language models faces significant challenges. Technical benchmarks often lack real-world relevance, while existing human preference evaluations suffer from unrepresentative sampling, superficial assessment depth, and single-metric reductionism. To address these issues, we introduce HUMAINE, a framework for multidimensional, demographically aware measurement of human-AI interaction. We collected multi-turn, naturalistic conversations from 23,404 participants that were stratified across 22 demographic groups, both in the US and UK, to evaluate 28 state-of-the-art models across five human-centric dimensions. We use a hierarchical Bayesian Bradley-Terry-Davidson (BTD) model, with post-stratification to census data, and our analysis reveals three key insights. \textbf{(1)} We establish a clear performance hierarchy where \texttt{google/gemini-2.5-pro} ranks first overall, with a 95.6\% posterior probability of being the top-ranked model. \textbf{(2)} We uncover significant preference heterogeneity, with user age emerging as the primary demographic axis of disagreement; a model's perceived rank can shift substantially across age groups, exposing failures in generalisation that unrepresentative samples typically mask. \textbf{(3)} We quantify the vast difference in discriminative power across evaluation dimensions, with ambiguous qualities like \textit{Trust, Ethics \& Safety} showing a 65\% tie rate, in stark contrast to the decisive 10\% tie rate for \textit{Overall Winner}. Our work emphasises the need for a more multidimensional, demographically aware perspective in LLM evaluation. We release our complete dataset, interactive leaderboard, and open-source framework.

[5] SalamahBench: Toward Standardized Safety Evaluation for Arabic Language Models

Omar Abdelnasser,Fatemah Alharbi,Khaled Khasawneh,Ihsen Alouani,Mohammed E. Fouda

Main category: cs.CL

TL;DR: 本文提出了SalamaBench,一个面向阿拉伯语大模型(ALMs)的安全性评估基准,包含8170个提示、覆盖12类安全风险;通过该基准对5个主流ALM进行评测,发现各模型在不同危害类别上表现差异显著,并指出需采用类别感知评估与专用防护机制提升ALM安全性。

Details Motivation: 现有安全性基准和防护模型以英语为中心,难以适用于阿拉伯语NLP系统,且缺乏细粒度、类别级的安全漏洞分析,导致阿拉伯语大模型(ALMs)的安全评估严重滞后,阻碍其广泛应用。 Method: 构建了统一的阿拉伯语安全评估基准SalamaBench,涵盖12类MLCommons安全危害,共8170条提示;采用AI过滤加多阶段人工校验的严格流程整合异构数据集;并在多种防护配置(单模型、多数投票、人工标注金标准)下评测5个前沿ALM。 Result: Fanar 2整体攻击成功率最低但跨类别鲁棒性不均;Jais 2在各类危害中持续高脆弱,内在安全对齐最弱;原生ALM作为安全判别器性能远逊于专用防护模型。 Conclusion: ALM安全性评估必须采用类别感知方法,并配备专门设计的防护机制,才能实现稳健的危害缓解。 Abstract: Safety alignment in Language Models (LMs) is fundamental for trustworthy AI. However, while different stakeholders are trying to leverage Arabic Language Models (ALMs), systematic safety evaluation of ALMs remains largely underexplored, limiting their mainstream uptake. Existing safety benchmarks and safeguard models are predominantly English-centric, limiting their applicability to Arabic Natural Language Processing (NLP) systems and obscuring fine-grained, category-level safety vulnerabilities. This paper introduces SalamaBench, a unified benchmark for evaluating the safety of ALMs, comprising $8,170$ prompts across $12$ different categories aligned with the MLCommons Safety Hazard Taxonomy. Constructed by harmonizing heterogeneous datasets through a rigorous pipeline involving AI filtering and multi-stage human verification, SalamaBench enables standardized, category-aware safety evaluation. Using this benchmark, we evaluate five state-of-the-art ALMs, including Fanar 1 and 2, ALLaM 2, Falcon H1R, and Jais 2, under multiple safeguard configurations, including individual guard models, majority-vote aggregation, and validation against human-annotated gold labels. Our results reveal substantial variation in safety alignment: while Fanar 2 achieves the lowest aggregate attack success rates, its robustness is uneven across specific harm domains. In contrast, Jais 2 consistently exhibits elevated vulnerability, indicating weaker intrinsic safety alignment. We further demonstrate that native ALMs perform substantially worse than dedicated safeguard models when acting as safety judges. Overall, our findings highlight the necessity of category-aware evaluation and specialized safeguard mechanisms for robust harm mitigation in ALMs.

[6] One Size Does Not Fit All: Token-Wise Adaptive Compression for KV Cache

Liming Lu,Kaixi Qiu,Jiayu Zhou,Jushi Kai,Haoyan Zhang,Huanyu Wang,Jingwen Leng,Ziwei He,Zhouhan Lin

Main category: cs.CL

TL;DR: DynaKV is a novel post-training framework for low-rank KV cache compression in LLMs, dynamically allocating compression rates per token based on semantics to achieve high fidelity at aggressive compression ratios.

Details Motivation: The escalating memory footprint of the Key-Value (KV) cache hinders efficient LLM inference; existing dimensionality reduction methods either require expensive pre-training or suffer severe performance loss under high compression. Method: DynaKV introduces a post-training framework that dynamically allocates token-wise compression rates based on semantic meaning, enabling effective low-rank KV cache compression without retraining. Result: DynaKV outperforms state-of-the-art compression techniques, reduces memory significantly while preserving generation quality, and retains 6% of the KV cache with 94% baseline performance on LongBench when combined with SnapKV. Conclusion: DynaKV establishes a new paradigm for efficient KV cache compression via dynamic, semantics-aware, post-training low-rank approximation, offering strong orthogonality to other pruning methods. Abstract: Despite the remarkable progress of Large Language Models (LLMs), the escalating memory footprint of the Key-Value (KV) cache remains a critical bottleneck for efficient inference. While dimensionality reduction offers a promising compression avenue, existing approaches typically either necessitate prohibitively expensive pre-training from scratch or suffer from severe performance deterioration under high compression regimes. In this work, we propose DynaKV, a novel post-training framework for low-rank KV cache compression. To the best of our knowledge, DynaKV is the first method to dynamically allocate compression rates to individual tokens according to their semantic meaning, which allows it to achieve better fidelity at aggressive compression ratios. Extensive experiments demonstrate that our method consistently outperforms existing state-of-the-art compression techniques, achieving significant memory reduction while maintaining competitive generation quality. Furthermore, our approach is orthogonal to sequence-level pruning methods. When integrated with SnapKV, DynaKV retains only 6% of the KV cache while maintaining 94% of the baseline performance on the LongBench benchmark.

[7] Additive Multi-Step Markov Chains and the Curse of Dimensionality in Large Language Models

O. V. Usatenko,S. S. Melnyk,G. M. Pritula

Main category: cs.CL

TL;DR: 本文提出使用N阶加性马尔可夫链近似大语言模型(LLM)的高维动态行为,建立了加性多步链与带步进记忆函数链的等价关系,并将‘信息温度’概念推广至加性N阶马尔可夫链。

Details Motivation: LLM在极高维状态空间中运行,其token嵌入与隐层表示间存在难以简化为经典马尔可夫结构的复杂依赖关系,需更可行的理论近似方法。 Method: 采用N阶加性马尔可夫链建模LLM动态,将下一token的条件概率分解为多个历史深度贡献的叠加,并建立其与步进记忆函数马尔可夫链的数学等价性。 Result: 证明了加性多步马尔可夫链与步进记忆函数马尔可夫链之间的严格对应关系,并据此将‘信息温度’概念扩展至加性N阶情形。 Conclusion: 加性马尔可夫链为理解LLM内在动力学提供了一种兼具理论可行性与可解释性的新框架,信息温度的推广为进一步量化语言建模中的信息流与记忆特性奠定了基础。 Abstract: Large-scale language models (LLMs) operate in extremely high-dimensional state spaces, where both token embeddings and their hidden representations create complex dependencies that are not easily reduced to classical Markov structures. In this paper, we explore a theoretically feasible approximation of LLM dynamics using N-order additive Markov chains. Such models allow the conditional probability of the next token to be decomposed into a superposition of contributions from multiple historical depths, reducing the combinatorial explosion typically associated with high-order Markov processes. The main result of the work is the establishment of a correspondence between an additive multi-step chain and a chain with a step-wise memory function. This equivalence allowed the introduction of the concept of information temperature not only for stepwise but also for additive N-order Markov chains.

[8] Simulating Meaning, Nevermore! Introducing ICR: A Semiotic-Hermeneutic Metric for Evaluating Meaning in LLM Text Summaries

Natalie Perez,Sreyoshi Bhaduri,Aman Chadha

Main category: cs.CL

TL;DR: 本文提出了一种融合符号学、诠释学与定性研究方法的跨学科框架,用于评估大语言模型(LLM)生成文本的意义准确性;引入定性指标ICR,发现LLM虽在表层语言相似性上表现优异,但在语境化语义准确性和意义对齐方面仍显著弱于人类。

Details Motivation: 人类语言的意义具有关系性、语境依赖性和涌现性,而当前计算模型(如词向量和嵌入模型)仅能统计近似意义,难以捕捉人类诠释层面的动态意义,亟需更契合人文视角的评估框架。 Method: 整合符号学与诠释学理论,结合归纳式内容分析与反思性主题分析,提出定性评估指标Inductive Conceptual Rating(ICR),并在五个数据集上实证比较LLM与人类生成的主题摘要。 Result: LLM在词汇相似性上表现高,但在语义准确性(尤其语境化意义)上系统性低于人类;性能随数据规模提升但模型间差异显著,反映其概念复现频率与意义连贯性的不稳定性。 Conclusion: 应推动以系统化定性诠释实践为基础的LLM意义评估框架,超越纯统计或表层匹配指标,真正回应语言意义的人文本质。 Abstract: Meaning in human language is relational, context dependent, and emergent, arising from dynamic systems of signs rather than fixed word-concept mappings. In computational settings, this semiotic and interpretive complexity complicates the generation and evaluation of meaning. This article proposes an interdisciplinary framework for studying meaning in large language model (LLM) generated language by integrating semiotics and hermeneutics with qualitative research methods. We review prior scholarship on meaning and machines, examining how linguistic signs are transformed into vectorized representations in static and contextualized embedding models, and identify gaps between statistical approximation and human interpretive meaning. We then introduce the Inductive Conceptual Rating (ICR) metric, a qualitative evaluation approach grounded in inductive content analysis and reflexive thematic analysis, designed to assess semantic accuracy and meaning alignment in LLM-outputs beyond lexical similarity metrics. We apply ICR in an empirical comparison of LLM generated and human generated thematic summaries across five datasets (N = 50 to 800). While LLMs achieve high linguistic similarity, they underperform on semantic accuracy, particularly in capturing contextually grounded meanings. Performance improves with larger datasets but remains variable across models, potentially reflecting differences in the frequency and coherence of recurring concepts and meanings. We conclude by arguing for evaluation frameworks that leverage systematic qualitative interpretation practices when assessing meaning in LLM-generated outputs from reference texts.

[9] Multiclass Hate Speech Detection with RoBERTa-OTA: Integrating Transformer Attention and Graph Convolutional Networks

Mahmoud Abusaqer,Jamil Saquer

Main category: cs.CL

TL;DR: 本文提出RoBERTa-OTA模型,通过结合RoBERTa嵌入、缩放注意力机制与增强型图卷积网络,将上下文语言理解与领域特定语义知识(基于本体)融合,显著提升了多类别、跨人口统计类别的仇恨言论检测性能,尤其在性别相关类别上效果突出,且仅增加0.33%参数量。

Details Motivation: 现有方法仅依赖训练数据学习表征,缺乏显式整合结构化本体知识,难以应对隐性攻击策略和社交媒体语言变异性带来的多类别仇恨言论检测挑战。 Method: 提出RoBERTa-OTA:融合RoBERTa文本嵌入、缩放注意力层与增强型图卷积网络(GCN),引入本体引导的注意力机制,将文本特征与结构化知识表示联合建模。 Result: 在39,747条平衡样本上5折交叉验证显示,准确率达96.04%,较标准RoBERTa提升1.02个百分点;性别类仇恨言论检测提升2.36个百分点,其他类别提升2.38个百分点;参数增量仅0.33%。 Conclusion: 本体引导的注意力与图神经网络有效增强了RoBERTa在细粒度人口统计仇恨言论分类中的表现,在保持高效率的同时显著提升鲁棒性与可解释性,适用于大规模内容审核场景。 Abstract: Multiclass hate speech detection across demographic categories remains computationally challenging due to implicit targeting strategies and linguistic variability in social media content. Existing approaches rely solely on learned representations from training data, without explicitly incorporating structured ontological frameworks that can enhance classification through formal domain knowledge integration. We propose RoBERTa-OTA, which introduces ontology-guided attention mechanisms that process textual features alongside structured knowledge representations through enhanced Graph Convolutional Networks. The architecture combines RoBERTa embeddings with scaled attention layers and graph neural networks to integrate contextual language understanding with domain-specific semantic knowledge. Evaluation across 39,747 balanced samples using 5-fold cross-validation demonstrates significant performance gains over baseline RoBERTa implementations and existing state-of-the-art methods. RoBERTa-OTA achieves 96.04\% accuracy compared to 95.02\% for standard RoBERTa, with substantial improvements for challenging categories: gender-based hate speech detection improves by 2.36 percentage points while other hate speech categories improve by 2.38 percentage points. The enhanced architecture maintains computational efficiency with only 0.33\% parameter overhead, providing practical advantages for large-scale content moderation applications requiring fine-grained demographic hate speech classification.

[10] The Thinking Boundary: Quantifying Reasoning Suitability of Multimodal Tasks via Dual Tuning

Ruobing Zheng,Tianqi Li,Jianing Li,Qingpei Guo,Yi Yuan,Jingdong Chen

Main category: cs.CL

TL;DR: 本文提出Dual Tuning框架,通过联合微调链式思维(CoT)与直接回答(DA)数据,量化推理增益,定义'思考边界'以判断何时在多模态任务中启用推理更有效,挑战'万物皆需推理'范式。

Details Motivation: 现有推理增强型大模型在通用多模态场景下效果不确定;主流‘Instruct’与‘Thinking’双模型并行发布策略资源消耗大,缺乏判断推理是否真正有益的统一标准。 Method: 提出Dual Tuning框架:在受控提示下联合微调CoT与DA配对数据;设计新指标量化两种训练模式的增益;构建‘Thinking Boundary’评估推理适用性;进一步分析强化训练与思维模式的影响,并验证其对数据精炼的指导作用。 Result: 明确了推理增益可被系统量化和比较;确立了适用于空间、数学及多学科等多模态任务的‘Thinking Boundary’;发现并非所有任务均受益于推理;验证该边界可有效指导数据筛选与优化。 Conclusion: 推理并非普适最优策略;‘Thinking Boundary’为选择合适数据与训练方式提供了实用判据;推动构建资源高效、自适应的自动推理系统。 Abstract: While reasoning-enhanced Large Language Models (LLMs) have demonstrated remarkable advances in complex tasks such as mathematics and coding, their effectiveness across universal multimodal scenarios remains uncertain. The trend of releasing parallel "Instruct" and "Thinking" models by leading developers serves merely as a resource-intensive workaround, stemming from the lack of a criterion for determining when reasoning is truly beneficial. In this paper, we propose Dual Tuning, a framework designed to assess whether reasoning yields positive gains for target tasks under given base models and datasets. By jointly fine-tuning on paired Chain-of-Thought (CoT) and Direct-Answer (DA) data under controlled prompts, we systematically quantify and compare the gains of both training modes using the proposed metrics, and establish the "Thinking Boundary" to evaluate the suitability of reasoning training across diverse multimodal tasks, including spatial, mathematical, and multi-disciplinary domains. We further explore the impact of reinforcement training and thinking patterns on reasoning suitability, and validate whether the "Thinking Boundary" can guide data refinement. Our findings challenge the "reasoning-for-all" paradigm, providing practical guidance for identifying appropriate data and training strategies, and motivating the development of resource-efficient, adaptive auto-think systems.

[11] Optimizing What We Trust: Reliability-Guided QUBO Selection of Multi-Agent Weak Framing Signals for Arabic Sentiment Prediction

Rabab Alkhalifa

Main category: cs.CL

TL;DR: 本文提出了一种面向阿拉伯语社交媒体框架检测的可靠性感知弱监督框架,通过多智能体LLM流水线生成实例级可靠性估计,并结合QUBO子集选择方法提升数据质量与平衡性。

Details Motivation: 阿拉伯语社交媒体中的框架检测面临解释模糊性、文化依赖性和标注数据稀缺等挑战,现有基于大语言模型的弱监督方法在标注少且社会依赖性强时鲁棒性差。 Method: 设计了一个包含两个'framer'、一个'critic'和一个'discriminator'的小型多智能体LLM流水线,将分歧与推理质量作为认知信号,生成实例级可靠性估计;进而采用QUBO优化进行子集选择,兼顾框架平衡与冗余抑制。 Result: 所选数据子集在内在诊断和跨领域阿拉伯语情感迁移测试中表现出更高可靠性,且编码了非随机、可迁移的结构,未损害强文本基线性能。 Conclusion: 聚焦数据策展而非标签融合的可靠性感知弱监督框架,能有效提升低资源、高歧义场景下的框架检测数据质量与泛化能力。 Abstract: Framing detection in Arabic social media is difficult due to interpretive ambiguity, cultural grounding, and limited reliable supervision. Existing LLM-based weak supervision methods typically rely on label aggregation, which is brittle when annotations are few and socially dependent. We propose a reliability-aware weak supervision framework that shifts the focus from label fusion to data curation. A small multi-agent LLM pipeline, two framers, a critic, and a discriminator, treats disagreement and reasoning quality as epistemic signals and produces instance-level reliability estimates. These estimates guide a QUBO-based subset selection procedure that enforces frame balance while reducing redundancy. Intrinsic diagnostics and an out-of-domain Arabic sentiment transfer test show that the selected subsets are more reliable and encode non-random, transferable structure, without degrading strong text-only baselines.

[12] Same Input, Different Scores: A Multi Model Study on the Inconsistency of LLM Judge

Fiona Lau

Main category: cs.CL

TL;DR: 本研究系统评估了五种主流大语言模型(GPT-4o、GPT-4o-mini、Gemini-2.5-Flash、Claude-Haiku-4.5、Claude-Sonnet-4.5)作为自动评分器(LLM-as-a-judge)时的数值评分稳定性,发现即使在temperature=0下也存在显著波动,尤其在‘完整性’维度;不同模型间评分风格与严格度差异明显;温度降低仅对部分模型(如GPT-4o、Gemini)提升稳定性,对Anthropic模型效果有限;结果警示企业在依赖LLM评分进行路由、质控等关键流程时需加强监控与人机协同。

Details Motivation: 尽管LLM-as-a-judge已被广泛用于研究与企业场景,但其数值评分的一致性(stability)——即相同输入反复评分是否稳定、不同模型是否给出可比结果——尚未被系统考察,而这对生产环境中的公平性、可复现性与可靠性至关重要。 Method: 在真实企业RAG系统的问答对上,对5个主流LLM在两种temperature(含0)设置下进行重复评分实验,定量分析单模型内重复稳定性、跨模型评分差异及温度影响,并重点考察完整性、相关性等维度的波动模式。 Result: 所有模型在temperature=0下仍存在显著评分波动,其中‘完整性’维度最不稳定;跨模型评分呈现系统性偏差(如严格度、解释风格不同);降低temperature可提升GPT-4o和Gemini的稳定性,但对Claude系列效果微弱或不一致。 Conclusion: LLM-as-a-judge的评分稳定性不可默认假设,企业级应用需引入持续监控、鲁棒解析机制及人机混合评估策略,以保障关键决策流程的可靠性与公平性。 Abstract: Large language models are increasingly used as automated evaluators in research and enterprise settings, a practice known as LLM-as-a-judge. While prior work has examined accuracy, bias, and alignment with human preferences, far less attention has been given to how consistently LLMs assign numerical scores, an important concern for many production workflows. This study systematically evaluates scoring stability across five commonly used models, GPT-4o, GPT-4o-mini, Gemini-2.5-Flash, Claude-Haiku-4.5, and Claude-Sonnet-4.5, two temperature settings, and real enterprise question-answer pairs drawn from a retrieval-augmented generation (RAG) system. We address three questions: how stable a model's scores are across repeated runs, how differently models score identical inputs, and how temperature affects scoring consistency. Temperature controls the determinism of an LLM's output. Despite expectations of stability at temperature=0, we observe substantial variability across models, with completeness scoring showing the largest fluctuations. Cross-model comparisons reveal systematic differences in strictness and interpretive style, leading to divergent ratings for the same answers. Lower temperatures improve stability for some models, notably GPT-4o and Gemini, but have limited or inconsistent effects for Anthropic models. These findings have important implications for enterprise pipelines that rely on LLM-generated scores for routing, triage, gating, or quality control. Identical inputs can receive different scores depending on model, family, or temperature, raising concerns around fairness, reproducibility, and operational reliability. Our results highlight the need for monitoring, robust parsing, and hybrid human-LLM evaluation strategies to ensure dependable use of LLM-as-a-judge in production environments.

[13] Context-Dependent Affordance Computation in Vision-Language Models

Murad Farzulla

Main category: cs.CL

TL;DR: 本文通过大规模计算研究发现,视觉语言模型(VLMs)在进行场景可供性(affordance)推断时高度依赖上下文,词汇层面90%、语义层面58.5%的输出随上下文显著变化;揭示了两个稳定潜在因子('烹饪流形'与'可及性轴'),并提出面向机器人学的'即时本体投影'(JIT Ontology)新范式。

Details Motivation: 理解视觉语言模型如何在不同上下文(如不同角色视角)下推断场景可供性,以揭示其认知机制并指导具身智能(如机器人)建模。 Method: 基于COCO-2017构建3,213组场景-上下文对,使用Qwen-VL 30B和LLaVA-1.5-13B,在7种代理人格提示下系统开展上下文引导实验;结合Jaccard相似度、余弦相似度、随机基线检验及Tucker分解+自助稳定性分析进行多层级量化评估。 Result: 发现显著的可供性漂移:词汇层面平均Jaccard相似度仅0.095(>90%上下文依赖),语义层面余弦相似度均值0.415(58.5%上下文依赖);随机基线证实该漂移非生成噪声所致;Tucker分解识别出稳定的正交潜在因子——'烹饪流形'与'可及性轴'。 Conclusion: VLMs的可供性计算本质上是强上下文依赖的;词汇与语义依赖程度差异表明表层词项比深层语义更易受上下文扰动;建议机器人学转向动态、查询驱动的'即时本体投影'建模,而非静态世界建模;未断言模型内部处理顺序或架构主导性。 Abstract: We characterize the phenomenon of context-dependent affordance computation in vision-language models (VLMs). Through a large-scale computational study (n=3,213 scene-context pairs from COCO-2017) using Qwen-VL 30B and LLaVA-1.5-13B subject to systematic context priming across 7 agentic personas, we demonstrate massive affordance drift: mean Jaccard similarity between context conditions is 0.095 (95% CI: [0.093, 0.096], p < 0.0001), indicating that >90% of lexical scene description is context-dependent. Sentence-level cosine similarity confirms substantial drift at the semantic level (mean = 0.415, 58.5% context-dependent). Stochastic baseline experiments (2,384 inference runs across 4 temperatures and 5 seeds) confirm this drift reflects genuine context effects rather than generation noise: within-prime variance is substantially lower than cross-prime variance across all conditions. Tucker decomposition with bootstrap stability analysis (n=1,000 resamples) reveals stable orthogonal latent factors: a "Culinary Manifold" isolated to chef contexts and an "Access Axis" spanning child-mobility contrasts. These findings establish that VLMs compute affordances in a substantially context-dependent manner -- with the difference between lexical (90%) and semantic (58.5%) measures reflecting that surface vocabulary changes more than underlying meaning under context shifts -- and suggest a direction for robotics research: dynamic, query-dependent ontological projection (JIT Ontology) rather than static world modeling. We do not claim to establish processing order or architectural primacy; such claims require internal representational analysis beyond output behavior.

[14] Do Mixed-Vendor Multi-Agent LLMs Improve Clinical Diagnosis?

Grace Chang Yuan,Xiaoman Zhang,Sung Eun Kim,Pranav Rajpurkar

Main category: cs.CL

TL;DR: 本文探讨了多智能体大语言模型(LLM)系统在临床诊断中的应用,重点比较了单一模型、同厂商多智能体与跨厂商多智能体(Mixed-Vendor MAC)框架的性能差异。实验表明,混合厂商配置显著提升诊断准确率与召回率,因其能融合不同模型的归纳偏差,弥补彼此盲区,从而提升系统鲁棒性。

Details Motivation: 现有临床诊断多智能体系统多依赖单一厂商模型,易因共享偏差导致相关失败;亟需探索通过厂商多样性提升诊断鲁棒性的新路径。 Method: 构建并对比Single-LLM、Single-Vendor和Mixed-Vendor MAC三类框架,使用o4-mini、Gemini-2.5-Pro和Claude-4.5-Sonnet三个不同厂商模型作为医生代理,在RareBench和DiagnosisArena数据集上评估诊断性能,并进行重叠分析以揭示互补机制。 Result: Mixed-Vendor MAC在召回率和准确率上均达到SOTA;重叠分析证实其通过聚合不同模型的归纳偏差,识别出单模型或同厂商团队遗漏的正确诊断。 Conclusion: 厂商多样性是构建鲁棒临床诊断多智能体系统的关键设计原则。 Abstract: Multi-agent large language model (LLM) systems have emerged as a promising approach for clinical diagnosis, leveraging collaboration among agents to refine medical reasoning. However, most existing frameworks rely on single-vendor teams (e.g., multiple agents from the same model family), which risk correlated failure modes that reinforce shared biases rather than correcting them. We investigate the impact of vendor diversity by comparing Single-LLM, Single-Vendor, and Mixed-Vendor Multi-Agent Conversation (MAC) frameworks. Using three doctor agents instantiated with o4-mini, Gemini-2.5-Pro, and Claude-4.5-Sonnet, we evaluate performance on RareBench and DiagnosisArena. Mixed-vendor configurations consistently outperform single-vendor counterparts, achieving state-of-the-art recall and accuracy. Overlap analysis reveals the underlying mechanism: mixed-vendor teams pool complementary inductive biases, surfacing correct diagnoses that individual models or homogeneous teams collectively miss. These results highlight vendor diversity as a key design principle for robust clinical diagnostic systems.

[15] Generating Realistic, Protocol-Compliant Maritime Radio Dialogues using Self-Instruct and Low-Rank Adaptation

Gürsel Akdeniz,Emin Cagatay Nakilcioglu

Main category: cs.CL

TL;DR: 本文提出了一种符合IMO SMCP规范的、合规感知的Self-Instruct方法,用于生成高质量、多样且真实的海上VHF无线电对话数据集,并通过26步验证流程和LoRA微调确保准确性与实用性。

Details Motivation: VHF无线电通信在海事操作中仍存在严重误沟通风险,主要由人为因素、噪声、语言差异及缺乏实时转录导致;同时高质量真实海事数据因法规与隐私限制而极度稀缺。 Method: 提出合规感知的Self-Instruct方法,集成26过滤器验证流水线(保障实体准确、无幻觉、SMCP合规、逻辑一致、语言多样),采用LoRA进行参数高效微调,并构建融合自动评估与专家评估的新评价框架(格式准确率、信息准确率、唯一性、逻辑连贯性)。 Result: 基于公开船舶、岸基与AIS数据集的实验表明,所生成对话具备合成多样性、程序合规性与操作真实性;代码、数据集与验证工具已开源。 Conclusion: 该方法为AI辅助海事安全提供了可复现、高保真、合规范的数据基础,亦可推广至其他安全关键领域。 Abstract: VHF radio miscommunication remains a major safety risk in maritime operations, with human factors accounting for over 58% of recorded incidents in Europe between 2014 and 2023. Despite decades of operational use, VHF radio communications are still prone to noise, interference, linguistic variability, and the absence of real-time transcription, making procedural errors both frequent and difficult to correct. Developing AI-assisted systems to support real-time communication and decision-making requires a considerable amount of high-quality maritime data, yet operational, regulatory, and privacy constraints render such datasets scarce. This study introduces a compliance aware Self-Instruct methodology for generating realistic maritime radio dialogues that conform to the IMO's SMCP. Our approach integrates a 26-filter verification pipeline directly into the iterative generation loop to enforce entity information accuracy, hallucination detection, SMCP-compliance, logical consistency, and linguistic diversity. We employ LORA for parameter-efficient fine-tuning, reducing computational overhead during training and enabling efficient deployment of the resulting models on resource-constrained maritime systems. To assess dataset quality, we introduce a novel evaluation framework combining automated and expert assessments: Format Accuracy, Information Accuracy, Uniqueness, and Logical Coherence. Experiments using publicly available vessel, coastal and AIS datasets demonstrate that the approach produces synthetically diverse, procedurally compliant, and operationally realistic dialogues. Although downstream applications such as automatic speech recognition and natural language processing are reserved for future work, the released code, datasets, and verification tools provide a reproducible foundation for artificial intelligence-assisted maritime safety and other safety-critical domains.

[16] What Is Missing: Interpretable Ratings for Large Language Model Outputs

Nicholas Stranges,Yimin Yang

Main category: cs.CL

TL;DR: 本文提出了一种名为What Is Missing(WIM)的新型自然语言反馈驱动的偏好评分系统,通过嵌入模型计算模型输出与人类/LLM指出的‘缺失内容’之间的语义相似度生成连续、高区分度的评分,从而提升偏好学习信号质量,且具备可解释性。

Details Motivation: 现有LLM偏好学习依赖主观的直接排序或单一数值评分,难以准确反映自然语言输出的真实质量,缺乏细粒度和可解释性。 Method: 提出WIM评分系统:由人类或LLM judge撰写描述模型输出所缺失内容的自然语言反馈;使用句子嵌入模型分别编码输出与反馈,并计算其cosine相似度作为标量评分;该评分可无缝接入现有偏好学习流程(如DPO、PPO)。 Result: 实验表明,相比离散数值评分,WIM显著减少平局(ties)、增大评分差异(rating deltas),增强了成对偏好数据中的学习信号强度;同时每个评分均可回溯对应的具体缺失反馈文本,支持定性调试。 Conclusion: WIM是一种通用、即插即用、可解释的偏好评分方法,能有效提升LLM偏好学习的数据质量和训练稳定性,适用于各类现有偏好优化算法。 Abstract: Current Large Language Model (LLM) preference learning methods such as Proximal Policy Optimization and Direct Preference Optimization learn from direct rankings or numerical ratings of model outputs, these rankings are subjective, and a single numerical rating chosen directly by a judge is a poor proxy for the quality of natural language, we introduce the What Is Missing (WIM) rating system to produce rankings from natural-language feedback, WIM integrates into existing training pipelines, can be combined with other rating techniques, and can be used as input to any preference learning method without changing the learning algorithm, to compute a WIM rating, a human or LLM judge writes feedback describing what the model output is missing, we embed the output and the feedback with a sentence embedding model and compute the cosine similarity between the resulting vectors, we empirically observe that, compared to discrete numerical ratings, WIM yields fewer ties and larger rating deltas, which improves the availability of a learning signal in pairwise preference data, we use interpretable in the following limited sense: for each scalar rating, we can inspect the judge's missing-information text that produced it, enabling qualitative debugging of the preference labels.

[17] A unified foundational framework for knowledge injection and evaluation of Large Language Models in Combustion Science

Zonglin Yang,Runze Mao,Tianhao Wu,Han Li,QingGuo Zhou,Zhi X. Chen

Main category: cs.CL

TL;DR: 本文提出了首个面向燃烧科学领域的端到端大语言模型构建框架,包含35亿token的多模态知识库、436题的CombustionQA评测基准,以及从RAG到知识图谱增强检索再到持续预训练的三阶段知识注入路径;研究发现单纯RAG存在性能瓶颈(60%),受上下文污染限制,需结合结构化知识图谱与持续预训练以突破上限。

Details Motivation: 推动基础大语言模型在燃烧科学领域的专业化发展,填补该领域缺乏AI-ready知识资源和专用评测基准的空白。 Method: 构建大规模AI-ready多模态知识库;设计CombustionQA评测基准;提出三阶段知识注入路径:1)轻量级RAG;2)知识图谱增强检索;3)持续预训练,并对各阶段进行定量验证。 Result: Stage 1(朴素RAG)准确率上限为60%,显著高于零样本(23%),但远低于理论上限(87%);性能受限于上下文污染;Stage 2和3(知识图谱+持续预训练)被证实为构建领域基础模型的必要路径。 Conclusion: 单纯RAG不足以支撑燃烧科学领域基础模型的构建,必须融合结构化知识表示(如知识图谱)与持续预训练,才能突破性能瓶颈,实现真正领域适配的大模型。 Abstract: To advance foundation Large Language Models (LLMs) for combustion science, this study presents the first end-to-end framework for developing domain-specialized models for the combustion community. The framework comprises an AI-ready multimodal knowledge base at the 3.5 billion-token scale, extracted from over 200,000 peer-reviewed articles, 8,000 theses and dissertations, and approximately 400,000 lines of combustion CFD code; a rigorous and largely automated evaluation benchmark (CombustionQA, 436 questions across eight subfields); and a three-stage knowledge-injection pathway that progresses from lightweight retrieval-augmented generation (RAG) to knowledge-graph-enhanced retrieval and continued pretraining. We first quantitatively validate Stage 1 (naive RAG) and find a hard ceiling: standard RAG accuracy peaks at 60%, far surpassing zero-shot performance (23%) yet well below the theoretical upper bound (87%). We further demonstrate that this stage's performance is severely constrained by context contamination. Consequently, building a domain foundation model requires structured knowledge graphs and continued pretraining (Stages 2 and 3).

[18] Induced Numerical Instability: Hidden Costs in Multimodal Large Language Models

Wai Tuck Wong,Jun Sun,Arunesh Sinha

Main category: cs.CL

TL;DR: 本文提出了一种新型的多模态大语言模型(MLLM)失效模式:通过优化一个旨在加剧推理阶段数值不稳定的损失函数,生成对抗性图像,导致模型性能显著下降,且该失效方式不同于传统对抗扰动。

Details Motivation: 随着多模态大语言模型广泛应用,研究其失效机制变得至关重要;现有对抗攻击主要关注输入扰动,而本文关注一种间接、由数值不稳定性引发的新失效路径。 Method: 设计并优化一个以最大化推理阶段数值不稳定性为目标的损失函数,用其生成对抗性图像;在多个SOTA多模态模型(LLaVa-v1.5-7B、Idefics3-8B、SmolVLM-2B-Instruct)和标准多模态基准(Flickr30k、MMVet等)上进行验证。 Result: 仅需极小图像改动即可显著降低模型性能,在多个数据集和模型上均观察到大幅准确率下降,证实该失效模式的有效性与普适性。 Conclusion: 揭示了一种区别于传统对抗扰动的根本性失效向量,强调了数值稳定性在多模态大模型鲁棒性评估中的关键作用,为未来安全与鲁棒性研究提供了新方向。 Abstract: The use of multimodal large language models has become widespread, and as such the study of these models and their failure points has become of utmost importance. We study a novel mode of failure that causes degradation in performance indirectly by optimizing a loss term that seeks to maximize numerical instability in the inference stage of these models. We apply this loss term as the optimization target to construct images that, when used on multimodal large language models, cause significant degradation in the output. We validate our hypothesis on state of the art models large vision language models (LLaVa-v1.5-7B, Idefics3-8B, SmolVLM-2B-Instruct) against standard datasets (Flickr30k, MMVet, TextVQA, VQAv2, POPE, COCO) and show that performance degrades significantly, even with a very small change to the input image, compared to baselines. Our results uncover a fundamentally different vector of performance degradation, highlighting a failure mode not captured by adversarial perturbations.

[19] Query Disambiguation via Answer-Free Context: Doubling Performance on Humanity's Last Exam

Michael Majurski,Cynthia Matuszek

Main category: cs.CL

TL;DR: 本文研究了问题表述的清晰度和背景信息的质量对语言模型回答准确性的影响,发现结合动态上下文构建(如RAG)与问题重写可显著提升准确率,且该提升需分阶段进行(先重写、再作答),不能仅靠推理时提示工程实现。

Details Motivation: 问题表述的清晰度和上下文质量对语言模型性能影响巨大,但二者交互机制尚缺乏深入探索。 Method: 在答案无关的背景信息支持下,对用户问题进行重写以降低歧义,并结合RAG等动态上下文构建方法;对比不同上下文注入方式(如前置上下文 vs 重写问题)及是否分离重写与回答阶段的效果。 Result: 在Humanity's Last Exam数据集上,使用gpt-oss-20b重写问题后,gpt-5-mini准确率从0.14提升至0.37;该提升无法仅通过推理时提示恢复,必须显式分离重写与回答阶段。 Conclusion: 问题重写是提升语言模型准确率的有效且必要手段,尤其当配合高质量、答案无关的背景信息时;其效果超越简单上下文拼接,强调查询优化本身的关键作用。 Abstract: How carefully and unambiguously a question is phrased has a profound impact on the quality of the response, for Language Models (LMs) as well as people. While model capabilities continue to advance, the interplay between grounding context and query formulation remains under-explored. This work investigates how the quality of background grounding information in a model's context window affects accuracy. We find that combining well-grounded dynamic context construction (i.e, RAG) with query rewriting reduces question ambiguity, resulting in significant accuracy gains. Given a user question with associated answer-free grounding context, rewriting the question to reduce ambiguity produces benchmark improvements without changing the answer itself, even compared to prepending that context before the question. Using \texttt{gpt-oss-20b} to rewrite a subset of Humanity's Last Exam using answer-free grounding context improves \texttt{gpt-5-mini} accuracy from 0.14 to 0.37. We demonstrate that this accuracy improvement cannot be fully recovered just through prompting at inference time; rather, distinct rewriting and answering phases are required. Code and data are available at https://github.com/mmajurski/lm-rewrite-uplift

[20] From Static Inference to Dynamic Interaction: Navigating the Landscape of Streaming Large Language Models

Junlong Tong,Zilong Wang,YuJie Ren,Peiran Yin,Hao Wu,Wei Zhang,Xiaoyu Shen

Main category: cs.CL

TL;DR: 本文提出了一种统一的流式大语言模型(Streaming LLM)定义,构建了系统性分类体系,并深入分析其方法、应用与未来研究方向。

Details Motivation: 现有流式LLM定义零散、概念混淆(如混同流式生成、流式输入与交互式架构),缺乏系统性分类,难以支撑动态实时场景下的发展需求。 Method: 基于数据流与动态交互建立统一定义;据此提出系统性分类法;深入分析各类方法;梳理实际应用场景;指出未来研究方向;同步维护开源论文库。 Result: 明确了Streaming LLM的核心内涵;形成首个结构化taxonomy;厘清了关键技术路径差异;归纳了典型落地场景;提出了可扩展的研究路线图。 Conclusion: Streaming LLM应被明确定义为支持持续数据流入与动态响应的LLM范式;系统性分类有助于推动理论深化与工程落地;需在架构设计、效率优化与人机协同等方面持续探索。 Abstract: Standard Large Language Models (LLMs) are predominantly designed for static inference with pre-defined inputs, which limits their applicability in dynamic, real-time scenarios. To address this gap, the streaming LLM paradigm has emerged. However, existing definitions of streaming LLMs remain fragmented, conflating streaming generation, streaming inputs, and interactive streaming architectures, while a systematic taxonomy is still lacking. This paper provides a comprehensive overview and analysis of streaming LLMs. First, we establish a unified definition of streaming LLMs based on data flow and dynamic interaction to clarify existing ambiguities. Building on this definition, we propose a systematic taxonomy of current streaming LLMs and conduct an in-depth discussion on their underlying methodologies. Furthermore, we explore the applications of streaming LLMs in real-world scenarios and outline promising research directions to support ongoing advances in streaming intelligence. We maintain a continuously updated repository of relevant papers at https://github.com/EIT-NLP/Awesome-Streaming-LLMs.

[21] Bootstrapping Exploration with Group-Level Natural Language Feedback in Reinforcement Learning

Lei Huang,Xiang Cheng,Chenxiao Zhao,Guobin Shen,Junjie Yang,Xiaocheng Feng,Yuxuan Gu,Xing Yu,Bing Qin

Main category: cs.CL

TL;DR: 本文提出GOLF框架,利用群体级自然语言反馈(外部批评与组内尝试)指导强化学习中的定向探索,通过自适应注入高质量改进建议作为离策略支架,在稀疏奖励区域提供有效引导,并在统一RL循环中联合优化生成与改进建模,显著提升样本效率(达2.2倍)。

Details Motivation: 现有强化学习算法仅依赖标量奖励,无法充分利用交互中丰富的自然语言反馈,导致探索效率低下。 Method: GOLF框架聚合两类群体级语言反馈:(i)指出错误或提出针对性修正的外部批评;(ii)提供替代性局部思路和多样化失败模式的组内尝试;将反馈聚合为高质量改进建议,作为离策略支架注入训练;并在统一RL循环中联合优化生成与改进建模。 Result: 在可验证与不可验证基准上,GOLF均展现出更优性能与探索效率,样本效率较纯标量奖励RL方法提升2.2倍。 Conclusion: 显式建模并利用群体级自然语言反馈可显著增强RL智能体在稀疏奖励环境下的定向探索能力与训练效率,GOLF为此提供了可行且有效的框架。 Abstract: Large language models (LLMs) typically receive diverse natural language (NL) feedback through interaction with the environment. However, current reinforcement learning (RL) algorithms rely solely on scalar rewards, leaving the rich information in NL feedback underutilized and leading to inefficient exploration. In this work, we propose GOLF, an RL framework that explicitly exploits group-level language feedback to guide targeted exploration through actionable refinements. GOLF aggregates two complementary feedback sources: (i) external critiques that pinpoint errors or propose targeted fixes, and (ii) intra-group attempts that supply alternative partial ideas and diverse failure patterns. These group-level feedbacks are aggregated to produce high-quality refinements, which are adaptively injected into training as off-policy scaffolds to provide targeted guidance in sparse-reward regions. Meanwhile, GOLF jointly optimizes generation and refinement within a unified RL loop, creating a virtuous cycle that continuously improves both capabilities. Experiments on both verifiable and non-verifiable benchmarks show that GOLF achieves superior performance and exploration efficiency, achieving 2.2$\times$ improvements in sample efficiency compared to RL methods trained solely on scalar rewards. Code is available at https://github.com/LuckyyySTA/GOLF.

[22] Coordinated Semantic Alignment and Evidence Constraints for Retrieval-Augmented Generation with Large Language Models

Xin Chen,Saili Uday Gadgil,Jiarong Qiu

Main category: cs.CL

TL;DR: 本文提出了一种融合语义对齐与证据约束的检索增强生成方法,通过统一建模检索与生成阶段,提升事实一致性与可验证性。

Details Motivation: 现有检索增强生成方法存在检索结果与生成目标间语义错位、证据利用不足的问题。 Method: 在统一语义空间中建模查询与候选证据的相关性,并引入显式证据约束机制,将检索证据转化为生成过程的核心控制因子。 Result: 在多个生成质量指标上实现稳定提升,增强了事实可靠性、可验证性及语言流畅性。 Conclusion: 协同建模语义对齐与证据约束对提升检索增强生成性能具有有效性与必要性。 Abstract: Retrieval augmented generation mitigates limitations of large language models in factual consistency and knowledge updating by introducing external knowledge. However, practical applications still suffer from semantic misalignment between retrieved results and generation objectives, as well as insufficient evidence utilization. To address these challenges, this paper proposes a retrieval augmented generation method that integrates semantic alignment with evidence constraints through coordinated modeling of retrieval and generation stages. The method first represents the relevance between queries and candidate evidence within a unified semantic space. This ensures that retrieved results remain semantically consistent with generation goals and reduces interference from noisy evidence and semantic drift. On this basis, an explicit evidence constraint mechanism is introduced. Retrieved evidence is transformed from an implicit context into a core control factor in generation. This restricts the expression scope of generated content and strengthens dependence on evidence. By jointly modeling semantic consistency and evidence constraints within a unified framework, the proposed approach improves factual reliability and verifiability while preserving natural language fluency. Comparative results show stable improvements across multiple generation quality metrics. This confirms the effectiveness and necessity of coordinated semantic alignment and evidence constraint modeling in retrieval augmented generation tasks.

[23] iAgentBench: Benchmarking Sensemaking Capabilities of Information-Seeking Agents on High-Traffic Topics

Preetam Prabhu Srikar Dammu,Arnav Palkhiwala,Tanya Roosta,Chirag Shah

Main category: cs.CL

TL;DR: 本文提出了iAgentBench,一个面向多源证据整合的动态开放域问答基准,强调跨源信息理解(如因果追踪、依赖解析),问题源自真实用户意图,每条数据附带可追溯证据与中间产物,实验表明仅靠检索不足以解决此类问题,需评估证据使用能力。

Details Motivation: 现有QA基准多依赖单段检索即可回答,无法衡量跨源信息整合能力(如证据集成、因果链追踪、主题多维度依赖解析),而现实搜索型生成式QA系统需处理更复杂的高阶信息需求。 Method: 构建iAgentBench基准:基于真实世界关注度信号选取种子主题,结合常见用户意图模式生成需多源证据融合的自然问题;每个样本提供可追溯的原始证据及可审计的中间产物(支持污染检测与检索/合成环节归因诊断)。 Result: 在多个大语言模型上的实验表明,引入检索可提升准确率,但仅靠检索无法稳定解答iAgentBench问题,凸显当前系统在证据利用(而不仅是获取)上的不足。 Conclusion: 应发展能评估模型如何使用多源证据(而非仅能否访问证据)的新基准与评测范式,iAgentBench为此提供了可复现、可诊断的测试平台。 Abstract: With the emergence of search-enabled generative QA systems, users are increasingly turning to tools that browse, aggregate, and reconcile evidence across multiple sources on their behalf. Yet many widely used QA benchmarks remain answerable by retrieving a single relevant passage, making them poorly suited for measuring cross-source sensemaking, such as integrating evidence, tracking causal links, and resolving dependencies across facets of a topic. We present iAgentBench, a dynamic ODQA benchmark that targets these higher-level information needs while keeping questions natural and grounded in realistic information-seeking behavior. iAgentBench draws seed topics from real-world attention signals and uses common user intent patterns to construct user-like questions whose answers require combining evidence from multiple sources, not just extracting a single snippet. Each instance is released with traceable evidence and auditable intermediate artifacts that support contamination checks and enable fine-grained diagnosis of failures in retrieval versus synthesis. Experiments across multiple LLMs show that retrieval improves accuracy, but retrieval alone does not reliably resolve these questions, underscoring the need to evaluate evidence use, not just evidence access.

[24] Stan: An LLM-based thermodynamics course assistant

Eric M. Furst,Vasudevan Venkateshwaran

Main category: cs.CL

TL;DR: 本文提出Stan系统,利用本地部署的开源大模型(如Whisper和Llama 3.1 8B),构建面向学生与教师双角色的教育AI工具链:学生端通过RAG实现教材精准问答;教师端则从课堂转录文本中自动生成教学洞察(如困惑点、类比案例等),全程离线、隐私可控、可复现。

Details Motivation: 现有教育AI研究多聚焦学生工具,忽视对教师教学支持的潜力;同时,依赖云API的方案存在隐私、成本与可复现性问题。 Method: 构建基于本地化开源模型(Whisper large-v3语音转写、Llama 3.1 8B结构化提取与生成)的数据管道,统一处理课堂转录与教材索引;学生侧采用检索增强生成(RAG)回答自然语言问题并返回教材定位;教师侧通过结构化分析流水线提取教学行为特征(困惑点、类比、提问等);系统设计中针对性解决长文本处理中的上下文截断、双峰输出、模式漂移等问题。 Result: 成功部署并运行于本地硬件的Stan系统,在本科化工热力学课程中实现双角色支持:学生获得带教材出处的准确回答;教师获得可搜索、跨学期的教学反思记录;所有组件完全离线运行,保障数据隐私、成本可控与结果可复现。 Conclusion: 共享底层数据与模型基础设施可同时高效赋能学生学习与教师教学实践;本地化、开源、结构化处理是教育AI落地的关键路径;实际部署中需专门应对大模型在长文本结构化任务中的典型失败模式。 Abstract: Discussions of AI in education focus predominantly on student-facing tools -- chatbots, tutors, and problem generators -- while the potential for the same infrastructure to support instructors remains largely unexplored. We describe Stan, a suite of tools for an undergraduate chemical engineering thermodynamics course built on a data pipeline that we develop and deploy in dual roles: serving students and supporting instructors from a shared foundation of lecture transcripts and a structured textbook index. On the student side, a retrieval-augmented generation (RAG) pipeline answers natural-language queries by extracting technical terms, matching them against the textbook index, and synthesizing grounded responses with specific chapter and page references. On the instructor side, the same transcript corpus is processed through structured analysis pipelines that produce per-lecture summaries, identify student questions and moments of confusion, and catalog the anecdotes and analogies used to motivate difficult material -- providing a searchable, semester-scale record of teaching that supports course reflection, reminders, and improvement. All components, including speech-to-text transcription, structured content extraction, and interactive query answering, run entirely on locally controlled hardware using open-weight models (Whisper large-v3, Llama~3.1 8B) with no dependence on cloud APIs, ensuring predictable costs, full data privacy, and reproducibility independent of third-party services. We describe the design, implementation, and practical failure modes encountered when deploying 7--8 billion parameter models for structured extraction over long lecture transcripts, including context truncation, bimodal output distributions, and schema drift, along with the mitigations that resolved them.

[25] Optimizing Language Models for Crosslingual Knowledge Consistency

Tianyu Liu,Jirui Qi,Mrinmaya Sachan,Ryan Cotterell,Raquel Fernández,Arianna Bisazza

Main category: cs.CL

TL;DR: 本文提出Direct Consistency Optimization (DCO)方法,利用强化学习和结构化奖励函数提升多语言大模型在不同语言间回答的一致性,无需显式奖励模型,且在多种实验设置下显著优于现有方法。

Details Motivation: 大语言模型在多语言场景下常出现知识不一致问题,即对同一问题的不同语言提问给出矛盾回答,影响其可靠性。 Method: 提出DCO(Direct Consistency Optimization),一种受DPO启发、无需显式奖励模型的强化学习方法,通过结构化奖励函数优化跨语言响应一致性。 Result: DCO在多个大模型上显著提升跨语言一致性,优于现有方法;在双语设置、域外泛化、可控对齐等方面均表现优异。 Conclusion: DCO是一种鲁棒、高效提升多语言大模型跨语言知识一致性的新方法,并已开源全部代码与基准。 Abstract: Large language models are known to often exhibit inconsistent knowledge. This is particularly problematic in multilingual scenarios, where models are likely to be asked similar questions in different languages, and inconsistent responses can undermine their reliability. In this work, we show that this issue can be mitigated using reinforcement learning with a structured reward function, which leads to an optimal policy with consistent crosslingual responses. We introduce Direct Consistency Optimization (DCO), a DPO-inspired method that requires no explicit reward model and is derived directly from the LLM itself. Comprehensive experiments show that DCO significantly improves crosslingual consistency across diverse LLMs and outperforms existing methods when training with samples of multiple languages, while complementing DPO when gold labels are available. Extra experiments demonstrate the effectiveness of DCO in bilingual settings, significant out-of-domain generalizability, and controllable alignment via direction hyperparameters. Taken together, these results establish DCO as a robust and efficient solution for improving knowledge consistency across languages in multilingual LLMs. All code, training scripts, and evaluation benchmarks are released at https://github.com/Betswish/ConsistencyRL.

[26] Non-Zipfian Distribution of Stopwords and Subset Selection Models

Wentian Li,Oscar Fontanelli

Main category: cs.CL

TL;DR: 本文提出了一种基于词频排名的停用词选择模型,利用Hill函数描述停用词被选中的概率,并从理论上解释了停用词和非停用词在秩-频分布上分别符合Beta秩函数(BRF)和二次对数函数的现象。

Details Motivation: 传统停用词识别多依赖经验列表或统计阈值,缺乏对停用词在词频分布中系统性规律的建模;本文旨在从秩-频分布差异出发,建立可解释、可验证的概率选择模型。 Method: 基于观察到停用词秩-频分布更符合Beta秩函数(BRF)、非停用词更符合对数二次函数的现象,提出以词秩r为变量的Hill函数形式的停用词选择概率模型;并通过独立语料估计验证该模型,同时进行解析推导证明其与Zipf律结合可再生BRF分布。 Result: 1)验证了停用词选择概率服从递减Hill函数;2)理论证明该模型在Zipf全词表假设下自然导出BRF分布;3)成功解释非停用词秩-频偏离Zipf律而适配对数二次函数的原因。 Conclusion: 停用词并非随机剔除,而是在词频秩空间中遵循特定生物/物理式选择机制(Hill函数),该机制统一解释了停用词与非停用词在统计分布上的分异现象,为停用词建模提供了新范式。 Abstract: Stopwords are words that are not very informative to the content or the meaning of a language text. Most stopwords are function words but can also be common verbs, adjectives and adverbs. In contrast to the well known Zipf's law for rank-frequency plot for all words, the rank-frequency plot for stopwords are best fitted by the Beta Rank Function (BRF). On the other hand, the rank-frequency plots of non-stopwords also deviate from the Zipf's law, but are fitted better by a quadratic function of log-token-count over log-rank than by BRF. Based on the observed rank of stopwords in the full word list, we propose a stopword (subset) selection model that the probability for being selected as a function of the word's rank $r$ is a decreasing Hill's function ($1/(1+(r/r_{mid})^γ)$); whereas the probability for not being selected is the standard Hill's function ( $1/(1+(r_{mid}/r)^γ)$). We validate this selection probability model by a direct estimation from an independent collection of texts. We also show analytically that this model leads to a BRF rank-frequency distribution for stopwords when the original full word list follows the Zipf's law, as well as explaining the quadratic fitting function for the non-stopwords.

[27] Hate Speech Detection using Large Language Models with Data Augmentation and Feature Enhancement

Brian Jing Hong Nge,Stefan Su,Thanh Thi Nguyen,Campbell Wilson,Alexandra Phelan,Naomi Pfitzner

Main category: cs.CL

TL;DR: 本文评估了数据增强和特征增强技术在仇恨言论检测中的效果,比较了传统分类器(如Delta TF-IDF)与多种基于Transformer的模型(DistilBERT、RoBERTa、DeBERTa、Gemma-7B、gpt-oss-20b)在多个数据集上的性能,并分析了SMOTE、加权损失、POS标注和文本增强等策略的影响。结果表明gpt-oss-20b整体最优,而Delta TF-IDF在数据增强下在Stormfront数据集上达到98.2%准确率;同时指出隐式仇恨言论更难检测,且增强效果依赖于数据集、模型与技术的组合。

Details Motivation: 提升仇恨言论检测的准确性与鲁棒性,尤其针对隐式仇恨言论识别困难、类别不平衡及模型泛化能力不足等问题,探索不同增强策略与模型架构的适配关系。 Method: 系统对比传统特征工程方法(Delta TF-IDF)与多种开源大语言模型(DistilBERT、RoBERTa、DeBERTa、Gemma-7B、gpt-oss-20b),在多个仇恨言论数据集上评估SMOTE过采样、逆类别频率加权损失、POS特征注入及文本数据增强的效果。 Result: gpt-oss-20b在多数指标上表现最佳;Delta TF-IDF经数据增强后在Stormfront数据集达98.2%准确率;隐式仇恨检测显著弱于显式;各增强技术效果高度依赖数据集、模型类型及任务设置。 Conclusion: 仇恨言论检测性能不仅取决于单一模型或技术,更由数据集特性、模型架构与增强策略三者协同决定;需根据具体场景选择适配方案,而非通用最优解。 Abstract: This paper evaluates data augmentation and feature enhancement techniques for hate speech detection, comparing traditional classifiers, e.g., Delta Term Frequency-Inverse Document Frequency (Delta TF-IDF), with transformer-based models (DistilBERT, RoBERTa, DeBERTa, Gemma-7B, gpt-oss-20b) across diverse datasets. It examines the impact of Synthetic Minority Over-sampling Technique (SMOTE), weighted loss determined by inverse class proportions, Part-of-Speech (POS) tagging, and text data augmentation on model performance. The open-source gpt-oss-20b consistently achieves the highest results. On the other hand, Delta TF-IDF responds strongly to data augmentation, reaching 98.2% accuracy on the Stormfront dataset. The study confirms that implicit hate speech is more difficult to detect than explicit hateful content and that enhancement effectiveness depends on dataset, model, and technique interaction. Our research informs the development of hate speech detection by highlighting how dataset properties, model architectures, and enhancement strategies interact, supporting more accurate and context-aware automated detection.

[28] Detection of Illicit Content on Online Marketplaces using Large Language Models

Quoc Khoa Tran,Thanh Thi Nguyen,Campbell Wilson

Main category: cs.CL

TL;DR: 本研究探讨了大型语言模型(LLM)如Llama 3.2和Gemma 3在多语种非法市场内容检测中的有效性,发现Llama 3.2在40类不平衡多分类任务中显著优于传统模型。

Details Motivation: 传统内容审核方法(人工审查、规则系统、传统机器学习)在可扩展性、动态隐写和多语言处理方面存在不足,难以应对非法市场通信的语义复杂性和语言细微差别。 Method: 使用DUTA10K多语种数据集,对Llama 3.2和Gemma 3进行参数高效微调(PEFT)和量化,并与BERT、SVM和朴素贝叶斯等基线模型进行系统性对比实验。 Result: 在二分类任务中,Llama 3.2性能与传统方法相当;在40类不平衡多分类任务中,其表现显著超越所有基线模型。 Conclusion: LLM(尤其是Llama 3.2)为非法内容检测提供了更有效、可扩展且自适应的解决方案,对执法机构、电商平台和网络安全专家具有重要实践价值。 Abstract: Online marketplaces, while revolutionizing global commerce, have inadvertently facilitated the proliferation of illicit activities, including drug trafficking, counterfeit sales, and cybercrimes. Traditional content moderation methods such as manual reviews and rule-based automated systems struggle with scalability, dynamic obfuscation techniques, and multilingual content. Conventional machine learning models, though effective in simpler contexts, often falter when confronting the semantic complexities and linguistic nuances characteristic of illicit marketplace communications. This research investigates the efficacy of Large Language Models (LLMs), specifically Meta's Llama 3.2 and Google's Gemma 3, in detecting and classifying illicit online marketplace content using the multilingual DUTA10K dataset. Employing fine-tuning techniques such as Parameter-Efficient Fine-Tuning (PEFT) and quantization, these models were systematically benchmarked against a foundational transformer-based model (BERT) and traditional machine learning baselines (Support Vector Machines and Naive Bayes). Experimental results reveal a task-dependent advantage for LLMs. In binary classification (illicit vs. non-illicit), Llama 3.2 demonstrated performance comparable to traditional methods. However, for complex, imbalanced multi-class classification involving 40 specific illicit categories, Llama 3.2 significantly surpassed all baseline models. These findings offer substantial practical implications for enhancing online safety, equipping law enforcement agencies, e-commerce platforms, and cybersecurity specialists with more effective, scalable, and adaptive tools for illicit content detection and moderation.

[29] AI-Assisted Moot Courts: Simulating Justice-Specific Questioning in Oral Arguments

Kylie Zhang,Nimra Nadeem,Lucia Zheng,Dominik Stammbach,Peter Henderson

Main category: cs.CL

TL;DR: 本文探讨了AI模型在模拟美国最高法院口头辩论中法官提问的有效性,提出了一个两层评估框架来衡量模拟问题的真实性和教学实用性,并发现尽管AI生成的问题在真实感和法律问题覆盖度上表现良好,但仍存在提问类型多样性不足和过度迎合等问题。

Details Motivation: 为帮助律师和法学院学生更好地准备口头辩论,需要有效的模拟训练工具;而现有AI模型是否能准确模拟法官提问尚不明确。 Method: 利用美国最高法院口头辩论转录文本数据集,构建并评估基于提示(prompt-based)和基于智能体(agentic)的口头辩论模拟器,并提出包含真实性和教学实用性两个维度的两层评估框架。 Result: AI生成的问题被人类标注者认为具有较高真实感,且对实质性法律问题的召回率高;但存在提问类型多样性低、倾向性(sycophancy)强等缺陷,而这些缺陷在简单评估方法下难以被发现。 Conclusion: AI可在口头辩论模拟中发挥辅助作用,但需更精细的评估标准与模型优化以克服当前局限性,尤其在问题多样性与中立性方面。 Abstract: In oral arguments, judges probe attorneys with questions about the factual record, legal claims, and the strength of their arguments. To prepare for this questioning, both law schools and practicing attorneys rely on moot courts: practice simulations of appellate hearings. Leveraging a dataset of U.S. Supreme Court oral argument transcripts, we examine whether AI models can effectively simulate justice-specific questioning for moot court-style training. Evaluating oral argument simulation is challenging because there is no single correct question for any given turn. Instead, effective questioning should reflect a combination of desirable qualities, such as anticipating substantive legal issues, detecting logical weaknesses, and maintaining an appropriately adversarial tone. We introduce a two-layer evaluation framework that assesses both the realism and pedagogical usefulness of simulated questions using complementary proxy metrics. We construct and evaluate both prompt-based and agentic oral argument simulators. We find that simulated questions are often perceived as realistic by human annotators and achieve high recall of ground truth substantive legal issues. However, models still face substantial shortcomings, including low diversity in question types and sycophancy. Importantly, these shortcomings would remain undetected under naive evaluation approaches.

[30] IF-RewardBench: Benchmarking Judge Models for Instruction-Following Evaluation

Bosi Wen,Yilin Niu,Cunxiang Wang,Xiaoying Ling,Ying Zhang,Pei Ke,Hongning Wang,Minlie Huang

Main category: cs.CL

TL;DR: 本文提出IF-RewardBench,一个面向指令遵循能力的综合性元评估基准,通过构建响应偏好图实现列表式评估,更准确反映裁判模型在对齐优化中的实际作用。

Details Motivation: 现有裁判模型在指令遵循任务中的可靠性缺乏充分验证,主因是元评估基准存在数据覆盖不足、仅依赖简单成对比较等缺陷,与真实模型优化场景不一致。 Method: 构建IF-RewardBench基准,涵盖多样化的指令与约束类型;对每条指令,基于指令遵循质量构建多个响应间的完整偏好图,支持列表式(listwise)评估而非传统成对比较。 Result: 在IF-RewardBench上的实验揭示当前主流裁判模型存在显著缺陷;该基准与下游任务性能呈现更强正相关性,优于现有基准。 Conclusion: IF-RewardBench为指令遵循能力提供了更可靠、更具指导意义的元评估框架,推动裁判模型和对齐方法的发展。 Abstract: Instruction-following is a foundational capability of large language models (LLMs), with its improvement hinging on scalable and accurate feedback from judge models. However, the reliability of current judge models in instruction-following remains underexplored due to several deficiencies of existing meta-evaluation benchmarks, such as their insufficient data coverage and oversimplified pairwise evaluation paradigms that misalign with model optimization scenarios. To this end, we propose IF-RewardBench, a comprehensive meta-evaluation benchmark for instruction-following that covers diverse instruction and constraint types. For each instruction, we construct a preference graph containing all pairwise preferences among multiple responses based on instruction-following quality. This design enables a listwise evaluation paradigm that assesses the capabilities of judge models to rank multiple responses, which is essential in guiding model alignment. Extensive experiments on IF-RewardBench reveal significant deficiencies in current judge models and demonstrate that our benchmark achieves a stronger positive correlation with downstream task performance compared to existing benchmarks. Our codes and data are available at https://github.com/thu-coai/IF-RewardBench.

[31] Stacked from One: Multi-Scale Self-Injection for Context Window Extension

Wei Han,Pan Zhou,Shuicheng Yan

Main category: cs.CL

TL;DR: 本文提出SharedLLM框架,通过多粒度上下文压缩与查询感知信息获取,在不增加训练成本的前提下显著扩展大语言模型的上下文长度,实现高效长文本建模。

Details Motivation: 现有大语言模型上下文窗口受限,而持续预训练长上下文数据成本过高,亟需一种低成本、高效率的长上下文扩展方案。 Method: 提出SharedLLM框架:由两个共享参数的短上下文LLM堆叠构成——下层为压缩器,上层为解码器;采用‘自注入’机制(仅在底层传递信息)和树状数据结构支持查询感知检索。 Result: 在仅用8K长度序列训练下,SharedLLM可泛化至128K以上上下文,在多项长上下文基准测试中性能优于或媲美强基线,并实现2倍于流式、3倍于编解码架构的推理加速及显著内存降低。 Conclusion: SharedLLM是一种高效、轻量且可扩展的长上下文建模新范式,兼顾性能、速度与资源效率。 Abstract: The limited context window of contemporary large language models (LLMs) remains a primary bottleneck for their broader application across diverse domains. Although continual pre-training on long-context data offers a straightforward solution, it incurs prohibitive data acquisition and computational costs. To address this challenge, we propose~\modelname, a novel framework based on multi-grained context compression and query-aware information acquisition. SharedLLM comprises two stacked short-context LLMs: a lower model serving as a compressor and an upper model acting as a decoder. The lower model compresses long inputs into compact, multi-grained representations, which are then forwarded to the upper model for context-aware processing. To maximize efficiency, this information transfer occurs exclusively at the lowest layers, bypassing lengthy forward passes and redundant cross-attention operations. This entire process, wherein the upper and lower models are derived from the same underlying LLM layers, is termed~\textit{self-injection}. To support this architecture, a specialized tree-based data structure enables the efficient encoding and query-aware retrieval of contextual information. Despite being trained on sequences of only 8K tokens, \modelname~effectively generalizes to inputs exceeding 128K tokens. Across a comprehensive suite of long-context modeling and understanding benchmarks, \modelname~achieves performance superior or comparable to strong baselines, striking an optimal balance between efficiency and accuracy. Furthermore, these design choices allow \modelname~to substantially reduce the memory footprint and yield notable inference speedups ($2\times$ over streaming and $3\times$ over encoder-decoder architectures).

[32] TSEmbed: Unlocking Task Scaling in Universal Multimodal Embeddings

Yebo Wu,Feng Liu,Ziwei Xie,Zhiyuan Liu,Changwang Zhang,Jun Wang,Li Li

Main category: cs.CL

TL;DR: 本文提出TSEmbed框架,结合MoE与LoRA解决多任务冲突问题,并引入专家感知负采样(EANS)提升嵌入判别力,在多个基准和工业数据集上达到SOTA性能。

Details Motivation: Multimodal Large Language Models (MLLMs)虽具强大推理能力,但因任务冲突难以适配为通用多模态嵌入模型。 Method: 提出TSEmbed框架:1)融合Mixture-of-Experts(MoE)与Low-Rank Adaptation(LoRA)以显式解耦冲突任务目标;2)设计Expert-Aware Negative Sampling(EANS),利用专家路由分布作为语义相似性代理,动态选择共享专家激活模式的难负样本;3)采用两阶段学习范式,先固化专家专精能力,再通过EANS优化嵌入表示。 Result: 在Massive Multimodal Embedding Benchmark(MMEB)及真实工业生产数据集上均取得SOTA性能。 Conclusion: TSEmbed为通用多模态嵌入的‘任务级扩展’奠定了基础,有效缓解任务冲突并提升嵌入质量与判别能力。 Abstract: Despite the exceptional reasoning capabilities of Multimodal Large Language Models (MLLMs), their adaptation into universal embedding models is significantly impeded by task conflict. To address this, we propose TSEmbed, a universal multimodal embedding framework that synergizes Mixture-of-Experts (MoE) with Low-Rank Adaptation (LoRA) to explicitly disentangle conflicting task objectives. Moreover, we introduce Expert-Aware Negative Sampling (EANS), a novel strategy that leverages expert routing distributions as an intrinsic proxy for semantic similarity. By dynamically prioritizing informative hard negatives that share expert activation patterns with the query, EANS effectively sharpens the model's discriminative power and refines embedding boundaries. To ensure training stability, we further devise a two-stage learning paradigm that solidifies expert specialization before optimizing representations via EANS. TSEmbed achieves state-of-the-art performance on both the Massive Multimodal Embedding Benchmark (MMEB) and real-world industrial production datasets, laying a foundation for task-level scaling in universal multimodal embeddings.

[33] Attention's Gravitational Field:A Power-Law Interpretation of Positional Correlation

Edward Zhang

Main category: cs.CL

TL;DR: 本文提出注意力引力场(AGF)概念,通过解耦位置编码与语义嵌入优化大语言模型架构,提升准确率,并揭示其与牛顿万有引力定律的经验一致性。

Details Motivation: 探索大语言模型中位置关系与编码的底层原理,提升模型可解释性与性能。 Method: 提出注意力引力场(AGF)概念,将位置编码从语义嵌入中解耦,并进行理论分析与实证验证。 Result: 实现了优于现有位置编码方法的准确性,且AGF在学习曲线、稳定性及经验上与牛顿万有引力定律一致。 Conclusion: 该工作为理解注意力机制提供了严谨理论基础,推动模型优化与可解释性研究。 Abstract: This paper explores the underlying principles of positional relationships and encodings within Large Language Models (LLMs) and introduces the concept of the Attention Gravitational Field (AGF). By decoupling positional encodings from semantic embeddings, we optimize the model architecture and achieve superior accuracy compared to prevailing encoding methods. Furthermore, we provide an in-depth analysis of AGF, demonstrating its intrinsic consistency with learning and stability curves, as well as its empirical alignment with Newton's Law of Universal Gravitation. By offering a rigorous theoretical exploration of these phenomena, this work represents a significant step toward interpreting the Attention mechanism and unlocks new possibilities for future research in model optimization and interpretability.

[34] Beyond the Context Window: A Cost-Performance Analysis of Fact-Based Memory vs. Long-Context LLMs for Persistent Agents

Natchanon Pollertlam,Witchayut Kornsuwannawit

Main category: cs.CL

TL;DR: 本文对比了基于长上下文的LLM与基于事实的记忆系统(Mem0框架)在持久化对话AI中的准确性和API成本,发现前者在部分基准上召回率更高,后者在 persona 一致性任务中表现相当且长期交互下成本更低。

Details Motivation: 持久化对话AI需在长上下文LLM与专用记忆系统之间做权衡,但二者在准确性与成本上的定量对比尚不清晰。 Method: 在LongMemEval、LoCoMo和PersonaMemv2三个记忆导向基准上,对比长上下文GPT-5-mini与Mem0事实记忆系统的事实召回准确率,并构建包含提示缓存的精细化API成本模型,分析二者成本随交互轮次与上下文长度的变化规律。 Result: 长上下文模型在LongMemEval和LoCoMo上召回率更高;Mem0在PersonaMemv2上具竞争力;在100k token上下文下,Mem0约10轮后成本更低,且上下文越长,盈亏平衡点越提前。 Conclusion: 两种架构存在明确的精度-成本权衡,应依据任务类型(如是否依赖稳定事实属性)、预期交互轮次和上下文规模,在生产中选择合适方案。 Abstract: Persistent conversational AI systems face a choice between passing full conversation histories to a long-context large language model (LLM) and maintaining a dedicated memory system that extracts and retrieves structured facts. We compare a fact-based memory system built on the Mem0 framework against long-context LLM inference on three memory-centric benchmarks - LongMemEval, LoCoMo, and PersonaMemv2 - and evaluate both architectures on accuracy and cumulative API cost. Long-context GPT-5-mini achieves higher factual recall on LongMemEval and LoCoMo, while the memory system is competitive on PersonaMemv2, where persona consistency depends on stable, factual attributes suited to flat-typed extraction. We construct a cost model that incorporates prompt caching and show that the two architectures have structurally different cost profiles: long-context inference incurs a per-turn charge that grows with context length even under caching, while the memory system's per-turn read cost remains roughly fixed after a one-time write phase. At a context length of 100k tokens, the memory system becomes cheaper after approximately ten interaction turns, with the break-even point decreasing as context length grows. These results characterize the accuracy-cost trade-off between the two approaches and provide a concrete criterion for selecting between them in production deployments.

[35] Autoscoring Anticlimax: A Meta-analytic Understanding of AI's Short-answer Shortcomings and Wording Weaknesses

Michael Hardy

Main category: cs.CL

TL;DR: 本文通过元分析890项LLM短答案评分研究结果,发现人类专家评分难度与LLM性能无统计关联;解码器架构平均比编码器低0.37 QWK;分词器词汇量存在边际递减效应;LLM在教育高风险场景中表现出种族偏见。

Details Motivation: 自动化短答案评分相比其他大语言模型(LLM)应用发展滞后,亟需系统性评估其性能瓶颈与影响因素。 Method: 对890项LLM短答案评分研究进行系统性综述与元分析,采用混合效应元回归建模二次加权Kappa(QWK)效应量,并分析模型架构、分词器、任务难度等变量的影响。 Result: 人类评分难度不影响LLM表现;解码器架构平均QWK比编码器低0.37;分词器词汇量存在边际收益递减;LLM在高风险教育场景中显现出种族歧视。 Conclusion: LLM短答案评分的性能受限于其自回归建模固有缺陷,需在系统设计中针对性改进;当前技术存在公平性风险,须谨慎部署于教育评估场景。 Abstract: Automated short-answer scoring lags other LLM applications. We meta-analyze 890 culminating results across a systematic review of LLM short-answer scoring studies, modeling the traditional effect size of Quadratic Weighted Kappa (QWK) with mixed effects metaregression. We quantitatively illustrate that that the level of difficulty for human experts to perform the task of scoring written work of children has no observed statistical effect on LLM performance. Particularly, we show that some scoring tasks measured as the easiest by human scorers were the hardest for LLMs. Whether by poor implementation by thoughtful researchers or patterns traceable to autoregressive training, on average decoder-only architectures underperform encoders by 0.37--a substantial difference in agreement with humans. Additionally, we measure the contributions of various aspects of LLM technology on successful scoring such as tokenizer vocabulary size, which exhibits diminishing returns--potentially due to undertrained tokens. Findings argue for systems design which better anticipates known statistical shortcomings of autoregressive models. Finally, we provide additional experiments to illustrate wording and tokenization sensitivity and bias elicitation in high-stakes education contexts, where LLMs demonstrate racial discrimination. Code and data for this study are available.

[36] From Unfamiliar to Familiar: Detecting Pre-training Data via Gradient Deviations in Large Language Models

Ruiqi Zhang,Lingxiang Wang,Hainan Zhang,Zhiming Zheng,Yanyan Lan

Main category: cs.CL

TL;DR: 本文提出GDS方法,通过分析样本在训练过程中的梯度偏差特征(更新幅度、位置和神经元激活集中度)来检测大语言模型预训练数据,显著提升了跨数据集泛化能力和可解释性。

Details Motivation: 解决现有预训练数据检测方法易受词频偏差影响或严重依赖微调数据相似性的问题,从优化视角出发,利用样本在训练中从陌生到熟悉过程中梯度行为的系统性差异。 Method: 提出GDS方法,构建包含FFN和Attention模块中参数更新幅度、位置与神经元激活集中度的梯度特征表示,并输入轻量级分类器进行成员推断。 Result: 在五个公开数据集上达到SOTA性能,跨数据集迁移能力显著优于强基线,且梯度特征分布差异具有可解释性。 Conclusion: GDS是一种实用、可扩展且具有良好泛化能力的预训练数据检测方法。 Abstract: Pre-training data detection for LLMs is essential for addressing copyright concerns and mitigating benchmark contamination. Existing methods mainly focus on the likelihood-based statistical features or heuristic signals before and after fine-tuning, but the former are susceptible to word frequency bias in corpora, and the latter strongly depend on the similarity of fine-tuning data. From an optimization perspective, we observe that during training, samples transition from unfamiliar to familiar in a manner reflected by systematic differences in gradient behavior. Familiar samples exhibit smaller update magnitudes, distinct update locations in model components, and more sharply activated neurons. Based on this insight, we propose GDS, a method that identifies pre-training data by probing Gradient Deviation Scores of target samples. Specifically, we first represent each sample using gradient profiles that capture the magnitude, location, and concentration of parameter updates across FFN and Attention modules, revealing consistent distinctions between member and non-member data. These features are then fed into a lightweight classifier to perform binary membership inference. Experiments on five public datasets show that GDS achieves state-of-the-art performance with significantly improved cross-dataset transferability over strong baselines. Further interpretability analyse show gradient feature distribution differences, enabling practical and scalable pre-training data detection.

[37] SinhaLegal: A Benchmark Corpus for Information Extraction and Analysis in Sinhala Legislative Texts

Minduli Lasandi,Nevidu Jayatilleke

Main category: cs.CL

TL;DR: SinhaLegal 是一个包含约200万词、1206份斯里兰卡法律文件(含法案与议案)的僧伽罗语立法文本语料库,经OCR提取、人工清洗与丰富元数据标注,并通过多项语言学与模型评估验证其高质量与领域适配性,旨在填补僧伽罗语法律NLP研究资源空白。

Details Motivation: 填补僧伽罗语法律领域高质量、结构化NLP语料库的空白,支持法律文本 summarisation、信息抽取等任务。 Method: 从官方渠道系统收集1981–2014年僧伽罗语法律文件(1065部Acts + 141份Bills),使用Google Document AI OCR提取文本,辅以大量后处理与人工校对;构建带元数据的机器可读语料;开展 corpus statistics、lexical diversity、NER、topic modelling 及跨尺度语言模型 perplexity 分析。 Result: 建成高质量、领域特异的SinhalLegal语料库(2M词,1206文档),实证显示其词汇结构清晰、命名实体丰富、主题聚焦,且现有语言模型在该领域文本上表现受限,凸显其建模价值。 Conclusion: SinhaLegal 是首个大规模、高精度僧伽罗语法律文本语料库,为推动斯里兰卡及类似低资源语言的法律AI研究与应用提供了关键基础设施。 Abstract: SinhaLegal introduces a Sinhala legislative text corpus containing approximately 2 million words across 1,206 legal documents. The dataset includes two types of legal documents: 1,065 Acts dated from 1981 to 2014 and 141 Bills from 2010 to 2014, which were systematically collected from official sources. The texts were extracted using OCR with Google Document AI, followed by extensive post-processing and manual cleaning to ensure high-quality, machine-readable content, along with dedicated metadata files for each document. A comprehensive evaluation was conducted, including corpus statistics, lexical diversity, word frequency analysis, named entity recognition, and topic modelling, demonstrating the structured and domain-specific nature of the corpus. Additionally, perplexity analysis using both large and small language models was performed to assess how effectively language models respond to domain-specific texts. The SinhaLegal corpus represents a vital resource designed to support NLP tasks such as summarisation, information extraction, and analysis, thereby bridging a critical gap in Sinhala legal research.

[38] HACHIMI: Scalable and Controllable Student Persona Generation via Orchestrated Agents

Yilin Jiang,Fei Tan,Xuanyu Yin,Jing Leng,Aimin Zhou

Main category: cs.CL

TL;DR: 本文提出HACHIMI框架,用于生成理论对齐、分布可控的学生画像(SPs),以支持教育大模型研究;该框架采用多智能体的‘提出-验证-修正’流程,结合神经符号验证与分层采样,生成百万级高质量学生 persona 数据集,并在内在与外在评估中验证其有效性与保真度梯度。

Details Motivation: 现有学生画像方法多依赖即兴提示或手工构建,缺乏教育理论支撑和人群分布控制,难以满足教育大模型对标准化、可解释、可复现合成学生群体的需求。 Method: 提出Theory-Aligned and Distribution-Controllable Persona Generation(TAD-PG)范式;设计HACHIMI多智能体框架:Propose(基于Qwen2.5-72B生成理论锚定的教育schema)、Validate(神经符号验证器确保发展性与心理约束)、Revise(分层抽样+语义去重缓解模式坍塌);构建HACHIMI-1M(100万K12学生persona)。 Result: 内在评估显示schema有效性近100%、配额准确、多样性高;外在评估中,学生代理在CEPS/PISA 2022问卷作答表现显示:数学与好奇心/成长维度与人类高度一致,课堂氛围与幸福感维度仅中度一致,揭示保真度梯度;全部persona由Qwen2.5-72B生成。 Conclusion: HACHIMI为教育大模型提供了首个理论驱动、分布可控、可扩展的学生画像基础设施,支持群体级基准测试与社会科学仿真,同时揭示当前教育LLM在不同心理教育构念上的建模能力差异。 Abstract: Student Personas (SPs) are emerging as infrastructure for educational LLMs, yet prior work often relies on ad-hoc prompting or hand-crafted profiles with limited control over educational theory and population distributions. We formalize this as Theory-Aligned and Distribution-Controllable Persona Generation (TAD-PG) and introduce HACHIMI, a multi-agent Propose-Validate-Revise framework that generates theory-aligned, quota-controlled personas. HACHIMI factorizes each persona into a theory-anchored educational schema, enforces developmental and psychological constraints via a neuro-symbolic validator, and combines stratified sampling with semantic deduplication to reduce mode collapse. The resulting HACHIMI-1M corpus comprises 1 million personas for Grades 1-12. Intrinsic evaluation shows near-perfect schema validity, accurate quotas, and substantial diversity, while external evaluation instantiates personas as student agents answering CEPS and PISA 2022 surveys; across 16 cohorts, math and curiosity/growth constructs align strongly between humans and agents, whereas classroom-climate and well-being constructs are only moderately aligned, revealing a fidelity gradient. All personas are generated with Qwen2.5-72B, and HACHIMI provides a standardized synthetic student population for group-level benchmarking and social-science simulations. Resources available at https://github.com/ZeroLoss-Lab/HACHIMI

[39] FireBench: Evaluating Instruction Following in Enterprise and API-Driven LLM Applications

Yunfan Zhang,Yijie Bei,Jetashree Ravi,Pawel Garbacki

Main category: cs.CL

TL;DR: 本文提出了FireBench,一个面向企业与API场景的LLM指令遵循基准测试,涵盖信息抽取、客服、编程代理等六大能力维度,含2400+样本,评估了11个模型,并开源以支持模型选型与优化。

Details Motivation: 现有指令遵循基准主要面向聊天助手的自然语言生成约束,无法满足企业及API场景中对输出格式、内容限制和流程要求的严格需求,亟需更贴合实际企业应用的评估基准。 Method: 基于真实企业与API使用模式构建FireBench基准,覆盖六大核心能力维度(如信息提取、客户支持、编码代理等),包含2400多个样本;对11个主流LLM进行系统性评估。 Result: 揭示了当前LLMs在企业级指令遵循任务中的表现差异与典型失败模式;FireBench已在fire-bench.com开源。 Conclusion: FireBench填补了面向企业部署的LLM指令遵循评估空白,为模型选型、诊断与持续改进提供了实用、可扩展的基准工具。 Abstract: Instruction following is critical for LLMs deployed in enterprise and API-driven settings, where strict adherence to output formats, content constraints, and procedural requirements is essential for enabling reliable LLM-assisted workflows. However, existing instruction following benchmarks predominantly evaluate natural language generation constraints that reflect the needs of chat assistants rather than enterprise users. To bridge this gap, we introduce FireBench, an LLM instruction following benchmark grounded in real-world enterprise and API usage patterns. FireBench evaluates six core capability dimensions across diverse applications including information extraction, customer support, and coding agents, comprising over 2,400 samples. We evaluate 11 LLMs and present key findings on their instruction following behavior in enterprise scenarios. We open-source FireBench at fire-bench.com to help users assess model suitability, support model developers in diagnosing performance, and invite community contributions.

[40] Free Lunch for Pass@$k$? Low Cost Diverse Sampling for Diffusion Language Models

Sean Lamont,Christian Walder,Paul Montague,Amir Dezfouli,Michael Norrish

Main category: cs.CL

TL;DR: 本文提出了一种无需训练、低成本的干预方法,通过在扩散语言模型(DLM)采样过程中对中间样本进行特征空间上的逐序排斥,显著提升生成多样性,从而改善Pass@$k$任务(如代码生成和数学推理)的性能。

Details Motivation: 传统自回归和新兴的扩散语言模型在生成多样文本时仍存在重复失败模式的问题,导致计算资源浪费,尤其在需要覆盖解空间的复杂推理任务中表现不足。 Method: 在扩散模型采样过程中,对同一批次中的中间样本按顺序处理,使每个新样本在隐空间中被显式排斥于已生成样本的特征表示之外,无需重训练或额外搜索策略。 Result: 在HumanEval和GSM8K基准上,基于LLaDA-8B-Instruct模型验证,该方法在不同温度下均显著提升了生成多样性与Pass@$k$得分,且计算开销极低。 Conclusion: 该方法是一种即插即用、训练无关、高效轻量的采样增强策略,可广泛适用于当前及未来的扩散语言模型,以支持需多样化解搜索的任务。 Abstract: Diverse outputs in text generation are necessary for effective exploration in complex reasoning tasks, such as code generation and mathematical problem solving. Such Pass@$k$ problems benefit from distinct candidates covering the solution space. However, traditional sampling approaches often waste computational resources on repetitive failure modes. While Diffusion Language Models have emerged as a competitive alternative to the prevailing Autoregressive paradigm, they remain susceptible to this redundancy, with independent samples frequently collapsing into similar modes. To address this, we propose a training free, low cost intervention to enhance generative diversity in Diffusion Language Models. Our approach modifies intermediate samples in a batch sequentially, where each sample is repelled from the feature space of previous samples, actively penalising redundancy. Unlike prior methods that require retraining or beam search, our strategy incurs negligible computational overhead, while ensuring that each sample contributes a unique perspective to the batch. We evaluate our method on the HumanEval and GSM8K benchmarks using the LLaDA-8B-Instruct model. Our results demonstrate significantly improved diversity and Pass@$k$ performance across various temperature settings. As a simple modification to the sampling process, our method offers an immediate, low-cost improvement for current and future Diffusion Language Models in tasks that benefit from diverse solution search. We make our code available at https://github.com/sean-lamont/odd.

[41] Can LLMs Capture Expert Uncertainty? A Comparative Analysis of Value Alignment in Ethnographic Qualitative Research

Arina Kostina,Marios Dikaiakos,Alejandro Porcel,Tassos Stassopoulos

Main category: cs.CL

TL;DR: 本研究评估了大语言模型(LLMs)在基于Schwartz价值观理论框架识别访谈中前三大人类价值观任务上的表现,发现其在集合匹配指标上接近专家水平,但在排序准确性和不确定性建模上仍有差距;Qwen表现最优,集成方法可稳定提升性能,但存在对特定价值观(如Security)的系统性偏差。

Details Motivation: 尽管大语言模型(LLMs)有望辅助定性分析,但其在任务固有模糊性下生成细致、可靠解释的能力尚不明确,亟需系统评估其在人类价值观识别等高阶解释性任务中的适用性与局限性。 Method: 基于Schwartz基本价值观理论框架,在长篇开放性访谈文本上评估多个LLMs识别前三大人类价值观的能力;以专家标注为金标准,采用F1、Jaccard和RBO等指标衡量性能,并分析模型输出的不确定性模式及价值观分布;同时测试主流集成方法(如多数投票、Borda计数)的效果。 Result: LLMs在集合层面指标(F1、Jaccard)上接近人类专家上限,但在精确排序(RBO较低)和不确定性结构上与专家存在偏差;Qwen最接近专家一致性且价值观分布匹配度最高;集成方法(尤其是多数投票和Borda计数)带来各项指标一致提升;模型普遍存在对Security等价值观的系统性过强调。 Conclusion: LLMs在模糊性强的定性价值观分析中展现出作为人类协作者的潜力,但也暴露出排序能力不足、不确定性建模失准及潜在价值观偏差等关键局限,提示需谨慎应用并深入探究其价值偏见机制。 Abstract: Qualitative analysis of open-ended interviews plays a central role in ethnographic and economic research by uncovering individuals' values, motivations, and culturally embedded financial behaviors. While large language models (LLMs) offer promising support for automating and enriching such interpretive work, their ability to produce nuanced, reliable interpretations under inherent task ambiguity remains unclear. In our work we evaluate LLMs on the task of identifying the top three human values expressed in long-form interviews based on the Schwartz Theory of Basic Values framework. We compare their outputs to expert annotations, analyzing both performance and uncertainty patterns relative to the experts. Results show that LLMs approach the human ceiling on set-based metrics (F1, Jaccard) but struggle to recover exact value rankings, as reflected in lower RBO scores. While the average Schwartz value distributions of most models closely match those of human analysts, their uncertainty structures across the Schwartz values diverge from expert uncertainty patterns. Among the evaluated models, Qwen performs closest to expert-level agreement and exhibits the strongest alignment with expert Schwartz value distributions. LLM ensemble methods yield consistent gains across metrics, with Majority Vote and Borda Count performing best. Notably, systematic overemphasis on certain Schwartz values, like Security, suggests both the potential of LLMs to provide complementary perspectives and the need to further investigate model-induced value biases. Overall, our findings highlight both the promise and the limitations of LLMs as collaborators in inherently ambiguous qualitative value analysis.

[42] AILS-NTUA at SemEval-2026 Task 10: Agentic LLMs for Psycholinguistic Marker Extraction and Conspiracy Endorsement Detection

Panagiotis Alexios Spanakis,Maria Lymperaiou,Giorgos Filandrianos,Athanasios Voulodimos,Giorgos Stamou

Main category: cs.CL

TL;DR: 本文提出了一种新型的基于智能体的LLM流水线,用于SemEval-2026任务10,联合提取心理语言学层面的阴谋论标记并检测阴谋论认同;通过解耦语义推理与结构定位,分别采用DD-CoT和'反回音室'架构,在两个子任务上显著提升性能。

Details Motivation: 传统分类器将语义推理与结构定位混为一谈,导致在心理语言学标记提取和阴谋论认同检测中存在语义歧义、字符级脆弱性及'记者陷阱'(误判客观报道为阴谋论)等问题。 Method: 提出动态判别链式思维(DD-CoT)用于标记提取,结合确定性锚定以缓解歧义和脆性;针对检测任务,构建由对抗性平行委员会与校准法官组成的'反回音室'架构,规避'记者陷阱'。 Result: 在S1子任务上Macro F1达0.24(较基线提升100%),开发榜排名第三;在S2子任务上Macro F1达0.79(提升49%)。 Conclusion: 该方法确立了一种可解释、心理语言学驱动的NLP新范式,兼顾性能与理论可解释性。 Abstract: This paper presents a novel agentic LLM pipeline for SemEval-2026 Task 10 that jointly extracts psycholinguistic conspiracy markers and detects conspiracy endorsement. Unlike traditional classifiers that conflate semantic reasoning with structural localization, our decoupled design isolates these challenges. For marker extraction, we propose Dynamic Discriminative Chain-of-Thought (DD-CoT) with deterministic anchoring to resolve semantic ambiguity and character-level brittleness. For conspiracy detection, an "Anti-Echo Chamber" architecture, consisting of an adversarial Parallel Council adjudicated by a Calibrated Judge, overcomes the "Reporter Trap," where models falsely penalize objective reporting. Achieving 0.24 Macro F1 (+100\% over baseline) on S1 and 0.79 Macro F1 (+49\%) on S2, with the S1 system ranking 3rd on the development leaderboard, our approach establishes a versatile paradigm for interpretable, psycholinguistically-grounded NLP.

[43] AILS-NTUA at SemEval-2026 Task 3: Efficient Dimensional Aspect-Based Sentiment Analysis

Stavros Gazetas,Giorgos Filandrianos,Maria Lymperaiou,Paraskevi Tzouveli,Athanasios Voulodimos,Giorgos Stamou

Main category: cs.CL

TL;DR: 本文提出了AILS-NTUA系统,用于SemEval-2026 Task 3 Track-A的多语言、多领域维度化方面级情感分析(DimABSA),涵盖三个子任务:DimASR、DimASTE和DimASQP;方法上融合了适配语言的编码器微调与基于LoRA的语言特定指令微调大模型;实验表明该系统在多数设置下优于基线。

Details Motivation: 解决多语言、多领域的维度化方面级情感分析(DimABSA)任务,涵盖连续情感预测与结构化三元组/四元组抽取等互补问题,同时兼顾参数效率与跨语言跨领域泛化能力。 Method: 结合语言适配的编码器骨干网络微调(用于DimASR)与基于LoRA的大语言模型语言特定指令微调(用于DimASTE和DimASQP),构建统一但任务自适应的参数高效框架。 Result: 所提模型在多个评估设置中均取得有竞争力的性能,并稳定超越官方提供的基线模型。 Conclusion: 该统一且参数高效的多任务框架能有效支持多语言多领域下的维度化方面级情感分析,在保持低训练与推理开销的同时实现高性能。 Abstract: In this paper, we present AILS-NTUA system for Track-A of SemEval-2026 Task 3 on Dimensional Aspect-Based Sentiment Analysis (DimABSA), which encompasses three complementary problems: Dimensional Aspect Sentiment Regression (DimASR), Dimensional Aspect Sentiment Triplet Extraction (DimASTE), and Dimensional Aspect Sentiment Quadruplet Prediction (DimASQP) within a multilingual and multi-domain framework. Our methodology combines fine-tuning of language-appropriate encoder backbones for continuous aspect-level sentiment prediction with language-specific instruction tuning of large language models using LoRA for structured triplet and quadruplet extraction. This unified yet task-adaptive design emphasizes parameter-efficient specialization across languages and domains, enabling reduced training and inference requirements while maintaining strong effectiveness. Empirical results demonstrate that the proposed models achieve competitive performance and consistently surpass the provided baselines across most evaluation settings.

[44] Federated Heterogeneous Language Model Optimization for Hybrid Automatic Speech Recognition

Mengze Hong,Yi Gu,Di Jiang,Hanlin Gu,Chen Jason Zhang,Lu Wang,Zhiyang Su

Main category: cs.CL

TL;DR: 本文提出了一种针对混合ASR系统中异构语言模型(n-gram与神经网络LM)的联邦学习融合新范式——match-and-merge,包含GMMA(遗传算法)和RMMA(强化学习算法)两种方法;实验表明RMMA在字符错误率、泛化性和收敛速度上均优于基线及GMMA。

Details Motivation: 在去中心化的联邦学习ASR训练中,声学模型已有成熟融合方法,但用于N-best重排序的语言模型因n-gram与神经网络模型的异构性难以有效合并,亟需专门的异构LM优化策略。 Method: 提出match-and-merge范式:1)GMMA——通过遗传操作(选择、交叉、变异)演化并匹配异构LM对;2)RMMA——以强化学习建模匹配与融合过程,实现高效策略搜索与快速收敛。 Result: 在7个OpenSLR数据集上的实验显示,RMMA取得最低平均CER、更强泛化能力,且收敛速度比GMMA快至7倍。 Conclusion: match-and-merge范式(尤其RMMA)为构建可扩展、隐私保护的ASR系统提供了高效可行的异构语言模型融合方案。 Abstract: Training automatic speech recognition (ASR) models increasingly relies on decentralized federated learning to ensure data privacy and accessibility, producing multiple local models that require effective merging. In hybrid ASR systems, while acoustic models can be merged using established methods, the language model (LM) for rescoring the N-best speech recognition list faces challenges due to the heterogeneity of non-neural n-gram models and neural network models. This paper proposes a heterogeneous LM optimization task and introduces a match-and-merge paradigm with two algorithms: the Genetic Match-and-Merge Algorithm (GMMA), using genetic operations to evolve and pair LMs, and the Reinforced Match-and-Merge Algorithm (RMMA), leveraging reinforcement learning for efficient convergence. Experiments on seven OpenSLR datasets show RMMA achieves the lowest average Character Error Rate and better generalization than baselines, converging up to seven times faster than GMMA, highlighting the paradigm's potential for scalable, privacy-preserving ASR systems.

[45] LocalSUG: Geography-Aware LLM for Query Suggestion in Local-Life Services

Jinwen Chen,Shuai Gong,Shiwen Zhang,Zheng Zhang,Yachao Zhao,Lingxiang Wang,Haibo Zhou,Yuan Zhan,Wei Lin,Hainan Zhang

Main category: cs.CL

TL;DR: 本文提出LocalSUG,一种面向本地生活服务平台的LLM驱动查询建议框架,通过城市感知候选挖掘、改进的GRPO算法和质量感知加速技术,解决地理定位缺失、偏好优化暴露偏差和在线延迟问题,显著提升CTR和降低无结果率。

Details Motivation: 传统多阶段级联系统依赖历史热门查询,难以满足长尾需求;而大语言模型(LLM)虽具强语义泛化能力,但在本地生活服务中面临缺乏地理锚定、偏好优化中的暴露偏差及在线推理延迟三大挑战。 Method: 提出LocalSUG框架:1)基于词共现的城市感知候选挖掘策略以增强地理接地;2)采用beam-search驱动的GRPO算法对齐训练与推理解码过程,并引入多目标奖励机制优化相关性与业务指标;3)设计质量感知beam加速与词表剪枝技术以降低在线延迟。 Result: 离线评估与大规模线上A/B测试表明,LocalSUG使点击率(CTR)提升+0.35%,低/无结果率降低2.56%。 Conclusion: LocalSUG有效解决了LLM在本地生活查询建议场景中的关键部署瓶颈,在保持生成质量的同时显著提升了业务指标,验证了其在真实场景中的有效性与实用性。 Abstract: In local-life service platforms, the query suggestion module plays a crucial role in enhancing user experience by generating candidate queries based on user input prefixes, thus reducing user effort and accelerating search. Traditional multi-stage cascading systems rely heavily on historical top queries, limiting their ability to address long-tail demand. While LLMs offer strong semantic generalization, deploying them in local-life services introduces three key challenges: lack of geographic grounding, exposure bias in preference optimization, and online inference latency. To address these issues, we propose LocalSUG, an LLM-based query suggestion framework tailored for local-life service platforms. First, we introduce a city-aware candidate mining strategy based on term co-occurrence to inject geographic grounding into generation. Second, we propose a beam-search-driven GRPO algorithm that aligns training with inference-time decoding, reducing exposure bias in autoregressive generation. A multi-objective reward mechanism further optimizes both relevance and business-oriented metrics. Finally, we develop quality-aware beam acceleration and vocabulary pruning techniques that significantly reduce online latency while preserving generation quality. Extensive offline evaluations and large-scale online A/B testing demonstrate that LocalSUG improves click-through rate (CTR) by +0.35% and reduces the low/no-result rate by 2.56%, validating its effectiveness in real-world deployment.

[46] Replaying pre-training data improves fine-tuning

Suhas Kotha,Percy Liang

Main category: cs.CL

TL;DR: 本文发现,在领域微调过程中重放通用数据(generic replay)不仅能防止灾难性遗忘,反而能显著提升目标领域任务性能,尤其在目标数据稀缺时效果更明显。

Details Motivation: 现有范式在领域适配中通常避免在微调阶段混入大量通用数据以防遗忘,但作者质疑这一假设,并探索通用数据重放是否可能带来意外增益。 Method: 在受控预训练环境中(4M目标token、4B总token、150M参数模型),系统评估通用数据重放在微调和中期训练中的影响;进一步分析不同数据调度策略(如提前引入目标数据)下重放的效果;最后在8B参数模型上验证实际效果。 Result: 通用重放使目标数据效率提升最高达1.87×(微调)和2.06×(中期训练);在8B模型上,提升网页导航成功率4.5%、巴斯克语问答准确率2%。 Conclusion: 通用数据重放是一种简单有效的方法,可提升领域适应的数据效率,尤其适用于目标领域标注数据稀缺的场景;其效果与预训练中目标数据占比呈负相关。 Abstract: To obtain a language model for a target domain (e.g. math), the current paradigm is to pre-train on a vast amount of generic web text and then fine-tune on the relatively limited amount of target data. Typically, generic data is only mixed in during fine-tuning to prevent catastrophic forgetting of the generic domain. We surprisingly find that replaying the generic data during fine-tuning can actually improve performance on the (less related) target task. Concretely, in a controlled pre-training environment with 4M target tokens, 4B total tokens, and 150M parameter models, generic replay increases target data efficiency by up to $1.87\times$ for fine-tuning and $2.06\times$ for mid-training. We further analyze data schedules that introduce target data during pre-training and find that replay helps more when there is less target data present in pre-training. We demonstrate the success of replay in practice for fine-tuning 8B parameter models, improving agentic web navigation success by $4.5\%$ and Basque question-answering accuracy by $2\%$.

[47] When Weak LLMs Speak with Confidence, Preference Alignment Gets Stronger

Amirabbas Afzali,Myeongho Jeon,Maria Brbic

Main category: cs.CL

TL;DR: 本文提出了一种基于弱语言模型置信度加权的偏好对齐方法(CW-PO),仅用20%人类标注数据即可超越使用全部标注数据的标准DPO方法。

Details Motivation: 现有大语言模型偏好对齐方法依赖高成本人工标注或大规模API模型,亟需低成本替代方案。 Method: 利用弱语言模型对偏好数据打分并筛选其高置信度样本,提出置信度加权的偏好优化框架(CW-PO),可适配多种偏好优化目标。 Result: CW-PO仅用20%人类标注即超越标准DPO使用100%标注的效果;弱LLM+置信加权策略在降低标注成本的同时提升对齐性能。 Conclusion: 弱语言模型结合置信度加权可成为高效、低成本且高性能的偏好对齐新范式。 Abstract: Preference alignment is an essential step in adapting large language models (LLMs) to human values, but existing approaches typically depend on costly human annotations or large-scale API-based models. We explore whether a weak LLM can instead act as an effective annotator. We surprisingly find that selecting only a subset of a weak LLM's highly confident samples leads to substantially better performance than using full human annotations. Building on this insight, we propose Confidence-Weighted Preference Optimization (CW-PO), a general framework that re-weights training samples by a weak LLM's confidence and can be applied across different preference optimization objectives. Notably, the model aligned by CW-PO with just 20% of human annotations outperforms the model trained with 100% of annotations under standard DPO. These results suggest that weak LLMs, when paired with confidence weighting, can dramatically reduce the cost of preference alignment while even outperforming methods trained on fully human-labeled data.

[48] MPCEval: A Benchmark for Multi-Party Conversation Generation

Minxing Zhang,Yi Yang,Zhuofan Jia,Xuan Yang,Jian Pei,Yuchen Zang,Xingwang Deng,Xianglong Chen

Main category: cs.CL

TL;DR: 本文提出MPCEval,一个面向多参与者对话生成的评估基准套件,旨在解决现有评估方法在多参与者场景下的不足。

Details Motivation: 多参与者对话生成(如智能回复和协作助手)日益重要,但其评估仍是关键瓶颈。相比双人对话,多参与者场景引入了复杂的轮转机制、角色依赖的说话行为、长程对话结构及多种合理续写等挑战。 Method: MPCEval将生成质量分解为说话人建模、内容质量和说话人-内容一致性三个维度,并明确区分局部下一轮预测与全局完整对话生成;设计了新颖、定量、无参考、可复现且可跨数据集与模型扩展的评估指标。 Result: 在多个公开与真实世界数据集上应用MPCEval,评估了现代生成方法与人类撰写对话,揭示了模型在参与均衡性、内容演进与新颖性、说话人-内容一致性等方面的系统性、维度特异性特征。 Conclusion: 评估目标深刻影响模型评价结果,单一分数评估会掩盖多参与者对话行为的根本差异;MPCEval提供了更细粒度、任务适配的评估框架。 Abstract: Multi-party conversation generation, such as smart reply and collaborative assistants, is an increasingly important capability of generative AI, yet its evaluation remains a critical bottleneck. Compared to two-party dialogue, multi-party settings introduce distinct challenges, including complex turn-taking, role-dependent speaker behavior, long-range conversational structure, and multiple equally valid continuations. Accordingly, we introduce MPCEval, a task-aware evaluation and benchmarking suite for multi-party conversation generation. MPCEval decomposes generation quality into speaker modeling, content quality, and speaker--content consistency, and explicitly distinguishes local next-turn prediction from global full-conversation generation. It provides novel, quantitative, reference-free, and reproducible metrics that scale across datasets and models. We apply MPCEval to diverse public and real-world datasets and evaluate modern generation methods alongside human-authored conversations. The results reveal systematic, dimension-specific model characteristics in participation balance, content progression and novelty, and speaker--content consistency, demonstrating that evaluation objectives critically shape model assessment and that single-score evaluation obscures fundamental differences in multi-party conversational behavior. The implementation of MPCEval and the associated evaluation code are publicly available at https://github.com/Owen-Yang-18/MPCEval.

[49] VRM: Teaching Reward Models to Understand Authentic Human Preferences

Biao Liu,Ning Xu,Junming Yang,Hao Xu,Xin Geng

Main category: cs.CL

TL;DR: 本文提出VRM(变分奖励建模)框架,通过引入高维目标权重和低维语义特征作为隐变量,并用变分推断进行建模,以更真实地模拟人类偏好判断过程,缓解奖励黑客问题,并在理论上具有更紧的泛化误差界,实验表明其优于现有方法。

Details Motivation: 传统奖励模型仅将提示-响应对映射为标量分数,易捕获虚假相关性而非真实人类偏好;而人类评估先依据提示上下文权衡多维目标重要性,再基于低维语义特征(如逻辑连贯性、上下文适配性)评估响应质量。 Method: 提出VRM(Variational Reward Modeling),将高维目标权重与低维语义特征建模为隐变量,采用变分推断进行联合推断;并提供理论分析证明其泛化误差界优于传统奖励模型。 Result: 在多个基准数据集上的实验表明,VRM显著优于现有奖励建模方法,在捕捉真实人类偏好方面效果更优。 Conclusion: VRM通过显式建模人类评估的认知结构,有效缓解奖励黑客问题,提升了奖励模型对真实偏好的建模能力,兼具理论保证与实证优势。 Abstract: Large Language Models (LLMs) have achieved remarkable success across diverse natural language tasks, yet the reward models employed for aligning LLMs often encounter challenges of reward hacking, where the approaches predominantly rely on directly mapping prompt-response pairs to scalar scores, which may inadvertently capture spurious correlations rather than authentic human preferences. In contrast, human evaluation employs a sophisticated process that initially weighs the relative importance of multiple high-dimensional objectives according to the prompt context, subsequently evaluating response quality through low-dimensional semantic features such as logical coherence and contextual appropriateness. Motivated by this consideration, we propose VRM, i.e., Variational Reward Modeling, a novel framework that explicitly models the evaluation process of human preference judgments by incorporating both high-dimensional objective weights and low-dimensional semantic features as latent variables, which are inferred through variational inference techniques. Additionally, we provide a theoretical analysis showing that VRM can achieve a tighter generalization error bound compared to the traditional reward model. Extensive experiments on benchmark datasets demonstrate that VRM significantly outperforms existing methods in capturing authentic human preferences.

[50] ThaiSafetyBench: Assessing Language Model Safety in Thai Cultural Contexts

Trapoom Ukarapol,Nut Chukamphaeng,Kunat Pipatanakul,Pakhapoom Sarapat

Main category: cs.CL

TL;DR: 本文提出了首个面向泰语和泰国文化的大型语言模型安全评估基准ThaiSafetyBench,包含1954个泰语恶意提示,并基于该基准评估24个LLM,发现闭源模型安全性普遍优于开源模型,且文化特异性攻击成功率更高;同时发布了轻量级泰语有害响应分类器ThaiSafetyClassifier及公开 leaderboard。

Details Motivation: 现有大语言模型安全评估主要集中于英语,忽视了非英语语言及文化背景下的风险,尤其是泰语及泰国文化语境下的安全漏洞亟待系统研究。 Method: 构建泰语安全评估基准ThaiSafetyBench(含1954条泰语恶意提示,覆盖通用与泰国文化特异性攻击);使用GPT-4.1和Gemini-2.5-Pro作为裁判评估24个LLM的安全性;训练DeBERTa-based的泰语有害响应分类器ThaiSafetyClassifier并公开模型与代码;建立持续更新的ThaiSafetyBench leaderboard。 Result: 闭源模型在泰语安全任务上显著优于开源模型;泰国文化语境攻击的攻击成功率(ASR)明显高于通用泰语攻击;ThaiSafetyClassifier达到84.4%加权F1分数,与GPT-4.1判断一致。 Conclusion: 当前LLM安全对齐方法在非英语、特别是文化深度嵌入场景下存在严重短板;ThaiSafetyBench及其配套工具为泰语AI安全研究提供了可复现、低成本、社区驱动的基础设施。 Abstract: The safety evaluation of large language models (LLMs) remains largely centered on English, leaving non-English languages and culturally grounded risks underexplored. In this work, we investigate LLM safety in the context of the Thai language and culture and introduce ThaiSafetyBench, an open-source benchmark comprising 1,954 malicious prompts written in Thai. The dataset covers both general harmful prompts and attacks that are explicitly grounded in Thai cultural, social, and contextual nuances. Using ThaiSafetyBench, we evaluate 24 LLMs, with GPT-4.1 and Gemini-2.5-Pro serving as LLM-as-a-judge evaluators. Our results show that closed-source models generally demonstrate stronger safety performance than open-source counterparts, raising important concerns regarding the robustness of openly available models. Moreover, we observe a consistently higher Attack Success Rate (ASR) for Thai-specific, culturally contextualized attacks compared to general Thai-language attacks, highlighting a critical vulnerability in current safety alignment methods. To improve reproducibility and cost efficiency, we further fine-tune a DeBERTa-based harmful response classifier, which we name ThaiSafetyClassifier. The model achieves a weighted F1 score of 84.4%, matching GPT-4.1 judgments. We publicly release the fine-tuning weights and training scripts to support reproducibility. Finally, we introduce the ThaiSafetyBench leaderboard to provide continuously updated safety evaluations and encourage community participation. - ThaiSafetyBench HuggingFace Dataset: https://huggingface.co/datasets/typhoon-ai/ThaiSafetyBench - ThaiSafetyBench Github: https://github.com/trapoom555/ThaiSafetyBench - ThaiSafetyClassifier HuggingFace Model: https://huggingface.co/typhoon-ai/ThaiSafetyClassifier - ThaiSafetyBench Leaderboard: https://huggingface.co/spaces/typhoon-ai/ThaiSafetyBench-Leaderboard

[51] HiFlow: Hierarchical Feedback-Driven Optimization for Constrained Long-Form Text Generation

Yifan Zhu,Guanting Chen,Bing Wei,Haoran Luo

Main category: cs.CL

TL;DR: 本文提出HiFlow框架,通过分层反馈优化解决长文本生成中的复杂约束问题,提升全局结构一致性和局部语义连贯性。

Details Motivation: 大语言模型在长文本生成尤其是复杂约束条件下表现不佳,现有方法难以协调全局与局部目标。 Method: 提出HiFlow分层反馈驱动优化框架,包含规划层(建模全局结构与约束)和生成层(条件文本生成),引入约束感知计划筛选与双层闭环反馈机制。 Result: 在多个主干模型上实验验证HiFlow优于基线方法。 Conclusion: HiFlow能有效联合优化规划质量与生成行为,逐步引导模型生成高质量、满足约束的长文本。 Abstract: Large language models perform well in short text generation but still struggle with long text generation, particularly under complex constraints. Such tasks involve multiple tightly coupled objectives, including global structural consistency, local semantic coherence, and constraint feasibility, forming a challenging constrained optimization problem. Existing approaches mainly rely on static planning or offline supervision, limiting effective coordination between global and local objectives during generation. To address these challenges, we propose HiFlow, a hierarchical feedback-driven optimization framework for constrained long text generation. HiFlow formulates generation as a two-level optimization process, consisting of a planning layer for global structure and constraint modeling, and a generation layer for conditioned text generation. By incorporating constraint-aware plan screening and closed-loop feedback at both levels, HiFlow enables joint optimization of planning quality and generation behavior, progressively guiding the model toward high-quality, constraint-satisfying outputs. Experiments on multiple backbones confirm HiFlow's effectiveness over baseline methods.

[52] NeuronMoE: Neuron-Guided Mixture-of-Experts for Efficient Multilingual LLM Extension

Rongzhi Li,Hitomi Yanaka

Main category: cs.CL

TL;DR: 本文提出NeuronMoE方法,通过分析各Transformer组件中语言特异性神经元的跨语言多样性,指导每层专家分配,显著减少参数量并保持性能。

Details Motivation: 扩展大语言模型至低资源语言对全球可及性至关重要,但为每种语言单独训练模型成本过高;现有MoE方法按层分配专家,忽视了神经元层面的语言特异性。 Method: 提出NeuronMoE方法,基于实证测量的跨语言神经元多样性,在各Transformer组件中分析语言特异性神经元,以指导每层专家分配。 Result: 在Llama-3.2-3B上应用于希腊语、土耳其语和匈牙利语,平均参数减少约40%,性能与LayerMoE基线相当;发现低资源语言专家独立发展出类似高资源语言的神经元专业化模式,集中于早期和晚期层。 Conclusion: NeuronMoE揭示了多语言模型组织语言知识可能存在的通用架构原则,为低资源语言高效建模提供了新思路。 Abstract: Extending large language models to low-resource languages is essential for global accessibility, but training separate models per language is prohibitively expensive. Mixture-of-Experts (MoE) architectures address this by adding sparse language-specific parameters, but determining how many experts each layer needs remains an open question. Current approaches allocate experts based on layer-level similarity, yet language processing exhibits fine-grained specialization at individual neurons. We propose $\textbf{NeuronMoE}$, a method that analyzes language-specific neurons across all transformer components to guide expert allocation per layer based on empirically measured cross-lingual neuron diversity. Applied to Llama-3.2-3B for low-resource languages (Greek, Turkish, and Hungarian), this approach achieves approximately 40% average parameter reduction while matching the performance of the LayerMoE baseline. We find that low-resource language experts independently develop neuron specialization patterns mirroring the high-resource language, which are concentrated in early and late layers. This reveals potential universal architectural principles in how multilingual models organize linguistic knowledge.

[53] MUTEX: Leveraging Multilingual Transformers and Conditional Random Fields for Enhanced Urdu Toxic Span Detection

Inayat Arshad,Fajar Saleem,Ijaz Hussain

Main category: cs.CL

TL;DR: 本文提出MUTEX框架,结合多语言Transformer(XLM-RoBERTa)与条件随机场(CRF),首次实现 Urdu 文本的细粒度毒害词元级跨度检测,并在多源社交媒体数据上取得60%的token-level F1,为该任务建立首个监督基线。

Details Motivation: 现有系统多依赖句子级分类,无法定位具体毒害片段;且受限于乌尔都语缺乏词元级标注数据、语言复杂、频繁语码转换、非正式表达及丰富形态变化等因素。 Method: 提出MUTEX框架:基于XLM-RoBERTa的多语言Transformer与CRF层联合建模,采用人工标注的词元级毒害跨度数据进行序列标注训练,在社交平台、新闻和YouTube评论等多领域数据上以token-level F1评估。 Result: MUTEX在乌尔都语毒害跨度检测任务中达到60% token-level F1,是首个监督式基线;实验表明Transformer模型能更好隐式建模上下文毒性,并缓解语码转换与形态变化带来的挑战。 Conclusion: MUTEX验证了多语言预训练模型结合CRF在资源匮乏语言细粒度毒害检测中的有效性,为乌尔都语及其他低资源语言的内容安全提供了可解释、高性能的新范式。 Abstract: Urdu toxic span detection remains limited because most existing systems rely on sentence-level classification and fail to identify the specific toxic spans within those text. It is further exacerbated by the multiple factors i.e. lack of token-level annotated resources, linguistic complexity of Urdu, frequent code-switching, informal expressions, and rich morphological variations. In this research, we propose MUTEX: a multilingual transformer combined with conditional random fields (CRF) for Urdu toxic span detection framework that uses manually annotated token-level toxic span dataset to improve performance and interpretability. MUTEX uses XLM RoBERTa with CRF layer to perform sequence labeling and is tested on multi-domain data extracted from social media, online news, and YouTube reviews using token-level F1 to evaluate fine-grained span detection. The results indicate that MUTEX achieves 60% token-level F1 score that is the first supervised baseline for Urdu toxic span detection. Further examination reveals that transformer-based models are more effective at implicitly capturing the contextual toxicity and are able to address the issues of code-switching and morphological variation than other models.

[54] ARC-TGI: Human-Validated Task Generators with Reasoning Chain Templates for ARC-AGI

Jens Lehmann,Syeda Khushbakht,Nikoo Salehfard,Nur A Zarin Nishat,Dhananjay Bhandiwad,Andrei Aioanei,Sahar Vahdati

Main category: cs.CL

TL;DR: 本文提出ARC-TGI框架,用于生成具有潜在规则的多样化ARC-AGI任务,支持任务级约束以确保可推断性,并提供自然语言推理链与部分可执行代码,提升基准测试的可控性与可扩展性。

Details Motivation: 现有ARC-AGI静态数据集易导致过拟合、数据泄露和记忆化,难以准确衡量模型在抽象与规则归纳上的真实能力。 Method: 构建ARC-TGI——一个开源的任务族生成器框架,采用求解器导向表示,每个任务附带自然语言输入、变换推理链及部分Python代码;引入任务级约束机制,确保训练样本 collectively揭示潜在规则,并经人工精炼与本地验证保证自然性与一致性。 Result: 发布461个生成器,覆盖180个ARC-Mini、215个ARC-AGI-1(200训练+15测试)和66个ARC-AGI-2(55训练+11测试)任务,支持可扩展采样与受控基准评测。 Conclusion: ARC-TGI为few-shot抽象与规则归纳研究提供了更可靠、可扩展且符合人类解题逻辑的动态任务生成与评估基础设施。 Abstract: The Abstraction and Reasoning Corpus (ARC-AGI) probes few-shot abstraction and rule induction on small visual grids, but progress is difficult to measure on static collections of hand-authored puzzles due to overfitting, dataset leakage, and memorisation. We introduce ARC-TGI (ARC Task Generators Inventory), an open-source framework for task-family generators: compact Python programs that sample diverse ARC-AGI tasks while preserving a latent rule. ARC-TGI is built around a solver-facing representation: each generated task is paired with natural-language input and transformation reasoning chains and partially evaluated Python code implementing sampling, transformation, and episode construction. Crucially, ARC-TGI supports task-level constraints so that training examples collectively expose the variations needed to infer the underlying rule, a requirement for human-solvable ARC tasks that independent per-example sampling often fails to guarantee. All generators undergo human refinement and local verification to keep both grids and reasoning traces natural and consistent under variation. We release 461 generators covering 180 ARC-Mini tasks, 215 ARC-AGI-1 tasks (200 train, 15 test), and 66 ARC-AGI-2 tasks (55 train, 11 test), enabling scalable dataset sampling and controlled benchmarking.

[55] Measuring the Redundancy of Decoder Layers in SpeechLLMs

Adel Moumen,Guangzhi Sun,Philip C Woodland

Main category: cs.CL

TL;DR: 本文研究了语音大语言模型(SpeechLLM)中解码器的冗余性,发现其主要继承自预训练文本LLM;通过剪枝实验表明,7-8B模型仅需60%解码器层即可保持良好ASR性能,并验证该冗余结构在多任务(如语音翻译)、多语言和不同语音编码器间具有一致性,支持构建轻量、通用的SpeechLLM骨干。

Details Motivation: 语音大语言模型中解码器占总参数90%以上,但其实际对语音任务所需的容量尚不明确,亟需探究解码器冗余程度及可压缩性。 Method: 在两类LLM家族、三种规模(1-8B)模型上,对比文本与语音输入下的解码器层冗余;通过系统性剪枝解码器层并分析剪枝后性能恢复(healing)能力来量化冗余;进一步将结论推广至语音翻译任务,检验跨编码器、任务和语言的冗余一致性。 Result: 7-8B SpeechLLM在仅保留60%解码器层时仍保持良好ASR性能;小模型剪枝容忍度降低;语音翻译任务中相同解码器层在不同设置下均呈冗余,表明存在全局冗余结构。 Conclusion: SpeechLLM解码器存在显著且结构性冗余,主要源于预训练文本LLM;该冗余具有跨任务、跨语言和跨编码器的一致性,支持构建单一轻量、多任务通用的SpeechLLM骨干模型。 Abstract: Speech Large Language Models route speech encoder representations into an LLM decoder that typically accounts for over 90% of total parameters. We study how much of this decoder capacity is actually needed for speech tasks. Across two LLM families and three scales (1-8B), we show that decoder redundancy is largely inherited from the pretrained LLM: text and speech inputs yield similar redundant blocks. We then measure excess capacity by pruning decoder layers and analysing post-pruning healing to increase robustness. Our findings show that 7-8B models retain good ASR performance with only 60% of decoder layers, and the same trend extends to smaller scales with reduced pruning tolerance. We then generalise to speech translation, and show that the same blocks of layers are redundant across speech encoders, tasks and languages, indicating that a more global redundancy structure exists, enabling a single pruned and multi-tasks SpeechLLM backbone to be deployed.

[56] LBM: Hierarchical Large Auto-Bidding Model via Reasoning and Acting

Yewen Li,Zhiyi Lyu,Peng Jiang,Qingpeng Cai,Fei Pan,Bo An,Peng Jiang

Main category: cs.CL

TL;DR: 本文提出了一种分层大自动竞价模型(LBM),结合LLM的推理能力与数值决策能力,通过LBM-Think(推理)和LBM-Act(动作生成)协同优化自动竞价策略,并引入双嵌入机制和离线强化微调方法GQPO,显著提升泛化性与训练效率。

Details Motivation: 现有自动竞价方法依赖黑盒式离线强化学习或生成模型,存在行为反直觉、可解释性差、动态环境泛化弱等问题;而直接应用大语言模型(LLM)又易因缺乏领域知识和动作精度要求导致幻觉与次优决策。 Method: 提出分层Large autoBidding Model(LBM):高层LBM-Think负责策略推理,底层LBM-Act负责精确出价动作生成;设计双模态嵌入机制融合语言与数值输入以支持语言引导训练;提出离线强化微调算法GQPO,在不依赖仿真或线上 rollout 的前提下抑制LBM-Think幻觉并提升决策性能。 Result: 实验表明,基于LBM的生成式自动竞价模型在训练效率和跨场景泛化能力上均优于现有方法。 Conclusion: LBM通过分层架构与针对性训练技术,有效弥合了LLM强大推理能力与自动竞价任务对精确性、鲁棒性和可解释性的需求之间的鸿沟,为LLM赋能广告竞价提供了新范式。 Abstract: The growing scale of ad auctions on online advertising platforms has intensified competition, making manual bidding impractical and necessitating auto-bidding to help advertisers achieve their economic goals. Current auto-bidding methods have evolved to use offline reinforcement learning or generative methods to optimize bidding strategies, but they can sometimes behave counterintuitively due to the black-box training manner and limited mode coverage of datasets, leading to challenges in understanding task status and generalization in dynamic ad environments. Large language models (LLMs) offer a promising solution by leveraging prior human knowledge and reasoning abilities to improve auto-bidding performance. However, directly applying LLMs to auto-bidding faces difficulties due to the need for precise actions in competitive auctions and the lack of specialized auto-bidding knowledge, which can lead to hallucinations and suboptimal decisions. To address these challenges, we propose a hierarchical Large autoBidding Model (LBM) to leverage the reasoning capabilities of LLMs for developing a superior auto-bidding strategy. This includes a high-level LBM-Think model for reasoning and a low-level LBM-Act model for action generation. Specifically, we propose a dual embedding mechanism to efficiently fuse two modalities, including language and numerical inputs, for language-guided training of the LBM-Act; then, we propose an offline reinforcement fine-tuning technique termed GQPO for mitigating the LLM-Think's hallucinations and enhancing decision-making performance without simulation or real-world rollout like previous multi-turn LLM-based methods. Experiments demonstrate the superiority of a generative backbone based on our LBM, especially in an efficient training manner and generalization ability.

[57] Representation Fidelity:Auditing Algorithmic Decisions About Humans Using Self-Descriptions

Theresa Elstner,Martin Potthast

Main category: cs.CL

TL;DR: 本文提出了一种衡量算法决策中人类表征保真度的新方法,通过比较外部输入表征与个体自述表征之间的差异来评估决策合理性,并构建了首个用于贷款审批场景的表征保真度基准数据集。

Details Motivation: 现有算法决策缺乏对人类表征合理性的验证机制,难以确保决策基于对个体真实、恰当的理解。 Method: 定义‘表征保真度’概念,通过量化外部输入表征与个体自述表征之间的距离来衡量;分析表征差异类型,建立通用的表征错配分类体系;构建Loan-Granting Self-Representations Corpus 2025基准数据集(含3万条合成自述文本及专家标注的错配类型)。 Result: 提出了表征保真度的理论框架与量化路径,建立了首个可复现、可评估的表征保真度基准,支持对算法决策中人类建模质量的实证检验。 Conclusion: 表征保真度为算法公平性与可解释性提供了新维度,强调决策系统应尊重并准确反映个体自我认知,该框架可推广至招聘、司法等高风险决策场景。 Abstract: This paper introduces a new dimension for validating algorithmic decisions about humans by measuring the fidelity of their representations. Representation Fidelity measures if decisions about a person rest on reasonable grounds. We propose to operationalize this notion by measuring the distance between two representations of the same person: (1) an externally prescribed input representation on which the decision is based, and (2) a self-description provided by the human subject of the decision, used solely to validate the input representation. We examine the nature of discrepancies between these representations, how such discrepancies can be quantified, and derive a generic typology of representation mismatches that determine the degree of representation fidelity. We further present the first benchmark for evaluating representation fidelity based on a dataset of loan-granting decisions. Our Loan-Granting Self-Representations Corpus 2025 consists of a large corpus of 30 000 synthetic natural language self-descriptions derived from corresponding representations of applicants in the German Credit Dataset, along with expert annotations of representation mismatches between each pair of representations.

[58] Feature Resemblance: On the Theoretical Understanding of Analogical Reasoning in Transformers

Ruichen Xu,Wenjing Yan,Ying-Jun Angela Zhang

Main category: cs.CL

TL;DR: 本文通过理论证明和实验验证,揭示了大语言模型中类比推理的涌现机制,指出其依赖于实体表征的对齐与几何结构。

Details Motivation: 现有评估方法混淆多种推理类型,难以准确理解大语言模型中的推理能力,因此需单独隔离并深入分析类比推理这一关键类型。 Method: 采用理论分析(包括三个关键定理证明)与大规模实验(最高达1.5B参数模型)相结合的方法,研究类比推理在Transformer中的涌现机制及其与表征几何的关系。 Result: 证明了联合训练、课程学习顺序及显式身份桥接对类比推理的必要性,并发现类比推理与两跳推理存在统一机制:即通过相似表征实现属性迁移。实验验证了理论预测。 Conclusion: 类比推理在Transformer中并非黑箱涌现,而是由可解释的表征对齐机制驱动;其能力直接受限于训练数据结构与表征空间几何特性。 Abstract: Understanding reasoning in large language models is complicated by evaluations that conflate multiple reasoning types. We isolate analogical reasoning (inferring shared properties between entities based on known similarities) and analyze its emergence in transformers. We theoretically prove three key results: (1) Joint training on similarity and attribution premises enables analogical reasoning through aligned representations; (2) Sequential training succeeds only when similarity structure is learned before specific attributes, revealing a necessary curriculum; (3) Two-hop reasoning ($a \to b, b \to c \implies a \to c$) reduces to analogical reasoning with identity bridges ($b = b$), which must appear explicitly in training data. These results reveal a unified mechanism: transformers encode entities with similar properties into similar representations, enabling property transfer through feature alignment. Experiments with architectures up to 1.5B parameters validate our theory and demonstrate how representational geometry shapes inductive reasoning capabilities.

[59] C2-Faith: Benchmarking LLM Judges for Causal and Coverage Faithfulness in Chain-of-Thought Reasoning

Avni Mittal,Rauno Arike

Main category: cs.CL

TL;DR: 本文提出了C2-Faith基准,用于评估大语言模型(LLM)作为链式推理(CoT)过程忠实性评判者的能力,聚焦因果性和覆盖性两个维度,并发现当前前沿LLM评判者在不同任务中表现不一、定位错误能力弱、且对不完整推理的覆盖性评分存在系统性高估。

Details Motivation: 现有研究多用大语言模型(LLM)作为链式推理(CoT)结果的评判者,但尚不清楚它们能否可靠评估推理过程本身的忠实性(即是否真正反映真实推理路径),而不仅判断答案是否合理。 Method: 构建了基于PRM800K的C2-Faith基准,通过受控扰动生成带已知因果错误(替换单步为非因果步骤)和覆盖缺失(按不同比率删除中间推理步骤)的样本;设计三项任务——二元因果检测、因果步骤定位、覆盖性评分,并在三个前沿LLM评判者上进行评测。 Result: 1)模型排名高度依赖任务设定,无单一模型在所有任务中占优;2)所有模型在‘检测错误’与‘定位错误’之间存在显著性能差距;3)对不完整推理的覆盖性评分普遍存在系统性高估。 Conclusion: LLM作为过程级评判者具有明显局限性:其可靠性高度依赖具体评估任务,尤其在定位错误和覆盖性评分方面表现不佳;研究为实际应用中如何选择合适LLM评判者提供了实证依据和实用指南。 Abstract: Large language models (LLMs) are increasingly used as judges of chain-of-thought (CoT) reasoning, but it remains unclear whether they can reliably assess process faithfulness rather than just answer plausibility. We introduce C2-Faith, a benchmark built from PRM800K that targets two complementary dimensions of faithfulness: causality (does each step logically follow from prior context?) and coverage (are essential intermediate inferences present?). Using controlled perturbations, we create examples with known causal error positions by replacing a single step with an acausal variant, and with controlled coverage deletions at varying deletion rates (scored against reference labels). We evaluate three frontier judges under three tasks: binary causal detection, causal step localization, and coverage scoring. The results show that model rankings depend strongly on task framing, with no single judge dominating all settings; all judges exhibit a substantial gap between detecting an error and localizing it; and coverage judgments are systematically inflated for incomplete reasoning. These findings clarify when LLM judges are dependable and where they fail, and provide practical guidance for selecting judges in process-level evaluation

[60] Sparse-BitNet: 1.58-bit LLMs are Naturally Friendly to Semi-Structured Sparsity

Di Zhang,Xun Wu,Shaohan Huang,Yudong Wang,Hanyong Shao,Yingbo Hao,Zewen Chi,Li Dong,Ting Song,Yan Xia,Zhifang Sui,Furu Wei

Main category: cs.CL

TL;DR: 本文提出Sparse-BitNet框架,首次联合应用1.58比特量化与动态N:M稀疏化,并验证了低比特模型天然更适配N:M稀疏结构,在保持性能的同时显著提升训练与推理速度。

Details Motivation: 半结构化N:M稀疏性和低比特量化(如1.58比特BitNet)是提升大语言模型效率的两种有前景方法,但此前被孤立研究;本文旨在探究二者协同效应。 Method: 提出Sparse-BitNet统一框架,联合应用1.58比特量化与动态N:M稀疏化,并支持稳定训练;涵盖稀疏预训练与稠密到稀疏调度;使用自定义稀疏张量核加速。 Result: 1.58比特BitNet在相同稀疏度下性能下降更小,可容忍更高结构化稀疏度而不崩溃;训练与推理最高加速1.30倍。 Conclusion: 极低比特量化与半结构化N:M稀疏性的结合是构建高效大语言模型的重要方向。 Abstract: Semi-structured N:M sparsity and low-bit quantization (e.g., 1.58-bit BitNet) are two promising approaches for improving the efficiency of large language models (LLMs), yet they have largely been studied in isolation. In this work, we investigate their interaction and show that 1.58-bit BitNet is naturally more compatible with N:M sparsity than full-precision models. To study this effect, we propose Sparse-BitNet, a unified framework that jointly applies 1.58-bit quantization and dynamic N:M sparsification while ensuring stable training for the first time. Across multiple model scales and training regimes (sparse pretraining and dense-to-sparse schedules), 1.58-bit BitNet consistently exhibits smaller performance degradation than full-precision baselines at the same sparsity levels and can tolerate higher structured sparsity before accuracy collapse. Moreover, using our custom sparse tensor core, Sparse-BitNet achieves substantial speedups in both training and inference, reaching up to 1.30X. These results highlight that combining extremely low-bit quantization with semi-structured N:M sparsity is a promising direction for efficient LLMs. Code available at https://github.com/AAzdi/Sparse-BitNet

Kun Chen,Xianglei Liao,Kaixue Fei,Yi Xing,Xinrui Li

Main category: cs.CL

TL;DR: 本文提出了一种系统化、可操作的法律论证结构标注框架,用于司法判决中的法律论证建模,涵盖命题类型、关系类型、形式表示、可视化及标注流程。

Details Motivation: 为揭示司法推理的逻辑结构,并为法律论证的计算分析提供可靠的数据基础。 Method: 基于法律推理与论证理论,构建包含四类命题(一般/具体规范性命题、一般/具体事实性命题)和五类关系(支持、攻击、联合、匹配、同一)的标注框架,并规定形式表示规则、可视化约定及标准化标注流程。 Result: 形成了一个概念清晰、形式化强、可操作的法律论证标注指南,支持复杂论证模式的图形化表示与一致性控制。 Conclusion: 该指南为司法推理的大规模分析、法律论证挖掘、法律推理计算建模及AI辅助法律分析提供了方法论支撑。 Abstract: This guideline proposes a systematic and operational annotation framework for representing the structure of legal argumentation in judicial decisions. Grounded in theories of legal reasoning and argumentation, the framework aims to reveal the logical organization of judicial reasoning and to provide a reliable data foundation for computational analysis. At the proposition level, the guideline distinguishes four types of propositions: general normative propositions, specific normative propositions, general factual propositions, and specific factual propositions. At the relational level, five types of relations are defined to capture argumentative structures: support, attack, joint, match, and identity. These relations represent positive and negative argumentative connections, conjunctive reasoning structures, the correspondence between legal norms and case facts, and semantic equivalence between propositions. The guideline further specifies formal representation rules and visualization conventions for both basic and nested structures, enabling consistent graphical representation of complex argumentation patterns. In addition, it establishes a standardized annotation workflow and consistency control mechanisms to ensure reproducibility and reliability of the annotated data. By providing a clear conceptual model, formal representation rules, and practical annotation procedures, this guideline offers methodological support for large-scale analysis of judicial reasoning and for future research in legal argument mining, computational modeling of legal reasoning, and AI-assisted legal analysis.

[62] Transducing Language Models

Vésteinn Snæbjarnarson,Samuel Kiegeland,Tianyu Liu,Reda Boumasmoud,Ryan Cotterell,Tim Vieira

Main category: cs.CL

TL;DR: 本文提出了一种基于确定性字符串到字符串变换(特别是有限状态转换器FST)的语言模型泛化框架,支持在不修改模型参数的情况下对变换后输出进行概率边际化和条件化,实现预训练语言模型的推理时适配。

Details Motivation: 现有语言模型输出格式(如子词、字节、DNA序列)常与下游任务所需格式(如词、氨基酸)不匹配,需通过确定性变换映射,但此前工作未将变换后的输出视为新的完整语言模型。 Method: 将语言模型与有限状态转换器(FST)组合,设计精确与近似算法,实现对源字符串的边际化(即对所有映射到同一目标字符串的源串求概率和)及对变换后输出的条件化;理论分析其性质并验证计算可行性。 Result: 在三类任务中验证了方法有效性:token↔byte、token↔word、DNA→amino acid;实现了无需微调即可推理时适配预训练模型,保持高效与准确。 Conclusion: 确定性字符串变换可系统性地导出新语言模型;FST作为变换工具兼具表达力与计算效率;该框架为语言模型输出格式灵活适配提供了统一、可证明、实用的解决方案。 Abstract: Modern language models define distributions over strings, but downstream tasks often require different output formats. For instance, a model that generates byte-pair strings does not directly produce word-level predictions, and a DNA model does not directly produce amino-acid sequences. In such cases, a deterministic string-to-string transformation can convert the model's output to the desired form. This is a familiar pattern in probability theory: applying a function $f$ to a random variable $X\sim p$ yields a transformed random variable $f(X)$ with an induced distribution. While such transformations are occasionally used in language modeling, prior work does not treat them as yielding new, fully functional language models. We formalize this perspective and introduce a general framework for language models derived from deterministic string-to-string transformations. We focus on transformations representable as finite-state transducers -- a commonly used state-machine abstraction for efficient string-to-string mappings. We develop algorithms that compose a language model with an FST to *marginalize* over source strings mapping to a given target, propagating probabilities through the transducer without altering model parameters and enabling *conditioning* on transformed outputs. We present an exact algorithm, an efficient approximation, and a theoretical analysis. We conduct experiments in three domains: converting language models from tokens to bytes, from tokens to words, and from DNA to amino acids. These experiments demonstrate inference-time adaptation of pretrained language models to match application-specific output requirements.

[63] Diffusion LLMs can think EoS-by-EoS

Sarah Breckner,Sebastian Schuster

Main category: cs.CL

TL;DR: 本文提出扩散大语言模型(Diffusion LLMs)通过将end-of-sequence(EoS)标记用作隐式“草稿纸”来增强复杂推理能力,实验证明增加EoS token数量可提升性能,并通过因果干预验证EoS token携带关键语义信息。

Details Motivation: 观察到扩散LLMs在生成长度远超必要时(即大量填充EoS token)反而在复杂推理任务上表现更优,作者试图解释这一反直觉现象并探究EoS token是否具有计算功能。 Method: 在Addition、Entity Tracking和Sudoku任务上,对LLaDA1.5、LLaDA2.0-mini和Dream-v0等扩散LLMs开展受控提示实验;进一步设计因果干预实验——用反事实生成的隐藏状态替换EoS token的隐藏状态,观察输出变化。 Result: 增加EoS token显著提升模型推理准确率;EoS token隐藏状态的替换常导致输出变为反事实结果,证明其承载关键问题表征信息。 Conclusion: 扩散LLMs并非简单填充EoS,而是‘逐EoS思考’,利用EoS token作为隐式计算空间,该机制是其优于自回归模型的关键原因之一。 Abstract: Diffusion LLMs have been proposed as an alternative to autoregressive LLMs, excelling especially at complex reasoning tasks with interdependent sub-goals. Curiously, this is particularly true if the generation length, i.e., the number of tokens the model has to output, is set to a much higher value than is required for providing the correct answer to the task, and the model pads its answer with end-of-sequence (EoS) tokens. We hypothesize that diffusion models think EoS-by-EoS, that is, they use the representations of EoS tokens as a hidden scratchpad, which allows them to solve harder reasoning problems. We experiment with the diffusion models LLaDA1.5, LLaDA2.0-mini, and Dream-v0 on the tasks Addition, Entity Tracking, and Sudoku. In a controlled prompting experiment, we confirm that adding EoS tokens improves the LLMs' reasoning capabilities. To further verify whether they serve as space for hidden computations, we patch the hidden states of the EoS tokens with those of a counterfactual generation, which frequently changes the generated output to the counterfactual. The success of the causal intervention underscores that the EoS tokens, which one may expect to be devoid of meaning, carry information on the problem to solve. The behavioral experiments and the causal interventions indicate that diffusion LLMs can indeed think EoS-by-EoS.

[64] Distilling Formal Logic into Neural Spaces: A Kernel Alignment Approach for Signal Temporal Logic

Sara Candussio,Gabriele Sarti,Gaia Saveri,Luca Bortolussi

Main category: cs.CL

TL;DR: 本文提出了一种将形式规约(如信号时序逻辑STL)的语义几何结构蒸馏到连续神经表示中的框架,通过教师-学生架构将符号鲁棒性核蒸馏为Transformer编码器,实现高效、可逆、语义保真的神经嵌入。

Details Motivation: 现有方法要么依赖计算昂贵、锚点依赖且不可逆的符号核,要么使用无法捕捉语义结构的语法驱动神经嵌入;亟需兼顾语义保真性与计算效率的神经表示方法。 Method: 采用教师-学生蒸馏框架,以符号鲁棒性核为教师,Transformer编码器为学生;设计基于核加权的连续几何对齐目标函数进行监督训练,替代标准对比学习。 Result: 所学嵌入在单次前向传播中即可生成,忠实保持STL公式的语义相似性,准确预测鲁棒性与约束满足,并支持内在可逆的公式重建;显著降低运行时计算开销。 Conclusion: 该方法实现了高效、可扩展的神经符号推理,在不牺牲语义保真度的前提下,克服了传统符号核的计算瓶颈,并支持可逆建模。 Abstract: We introduce a framework for learning continuous neural representations of formal specifications by distilling the geometry of their semantics into a latent space. Existing approaches rely either on symbolic kernels -- which preserve behavioural semantics but are computationally prohibitive, anchor-dependent, and non-invertible -- or on syntax-based neural embeddings that fail to capture underlying structures. Our method bridges this gap: using a teacher-student setup, we distill a symbolic robustness kernel into a Transformer encoder. Unlike standard contrastive methods, we supervise the model with a continuous, kernel-weighted geometric alignment objective that penalizes errors in proportion to their semantic discrepancies. Once trained, the encoder produces embeddings in a single forward pass, effectively mimicking the kernel's logic at a fraction of its computational cost. We apply our framework to Signal Temporal Logic (STL), demonstrating that the resulting neural representations faithfully preserve the semantic similarity of STL formulae, accurately predict robustness and constraint satisfaction, and remain intrinsically invertible. Our proposed approach enables highly efficient, scalable neuro-symbolic reasoning and formula reconstruction without repeated kernel computation at runtime.

[65] Balancing Coverage and Draft Latency in Vocabulary Trimming for Faster Speculative Decoding

Ofir Ben Shoham

Main category: cs.CL

TL;DR: 本文提出了一种针对推测解码中草稿模型的词汇表裁剪方法,通过在保证足够token覆盖率的前提下显著减小词汇表规模,从而降低草稿模型延迟、提升整体推理吞吐量。

Details Motivation: 草稿模型在推测解码中常成为性能瓶颈,其语言建模头随词表增大而延迟上升,存在词表大小与覆盖率/延迟之间的根本权衡。 Method: 将草稿词表选择建模为带最小覆盖率约束的覆盖-延迟权衡优化问题,使用基于架构感知FLOPs的延迟估计和训练数据中助手响应的覆盖率统计,并采用树状Parzen估计器(TPE)优化效用函数以高效搜索Pareto前沿。 Result: 实验表明,在保持高覆盖率的同时词表可缩减达97%;在领域特定任务上延迟最多降低16%、吞吐提升20%,在分布外任务上吞吐提升最高达6.7%。 Conclusion: 词汇表裁剪是一种有效缓解草稿模型瓶颈的方法,能在不牺牲准确性的前提下显著提升推测解码效率,尤其适用于领域专用场景。 Abstract: Speculative decoding accelerates inference for Large Language Models by using a lightweight draft model to propose candidate tokens that are verified in parallel by a larger target model. Prior work shows that the draft model often dominates speculative decoding latency, since it generates tokens sequentially and incurs high cost from its language modeling head as vocabulary size grows. This exposes a fundamental trade-off in draft model design: larger vocabularies improve token coverage and agreement with the target model, but incur higher draft latency, while smaller vocabularies reduce latency at the risk of missing tokens required for accurate draft generation. We address this trade-off through vocabulary trimming for draft models, motivated by the observation that domain-specific workloads use only a small fraction of the full vocabulary. We cast draft vocabulary selection as a constrained optimization problem that balances token coverage and draft latency. Coverage is computed over assistant responses in the training data, while latency is estimated using architecture-aware FLOPs that capture the cost of the language modeling head as a function of vocabulary size. We optimize a utility function with a Tree-structured Parzen Estimator to efficiently explore the coverage-latency Pareto frontier under a minimum coverage constraint. Experiments show improved speculative decoding throughput while reducing draft vocabularies by up to 97% with high coverage. On domain-specific tasks, we achieve up to 16% latency reduction and 20% throughput improvement, and up to 6.7% throughput gains on diverse out-of-distribution tasks.

[66] VietJobs: A Vietnamese Job Advertisement Dataset

Hieu Pham Dinh,Hung Nguyen Huy,Mo El-Haj

Main category: cs.CL

TL;DR: VietJobs 是首个大规模公开越南语招聘广告语料库,包含来自越南34个省市的48,092条招聘信息和1500多万词,涵盖16个职业领域及多种就业类型;论文基于该数据集构建了岗位分类与薪资预测两个基准任务,并评测了多个生成式大语言模型的表现,揭示了越南语及多语言建模在结构化劳动力市场预测中的挑战。

Details Motivation: 缺乏大规模、高质量、具区域与社会经济代表性的越南语招聘数据集,制约了越南语NLP和劳动力市场分析研究的发展。 Method: 构建VietJobs语料库,涵盖多维度结构化字段(职位、类别、薪资、技能、雇佣条件等),并设计岗位分类与薪资估计两项基准任务,采用少样本学习和微调方式评测Qwen2.5、Llama-SEA-LION等指令微调大模型。 Result: 指令微调模型(如Qwen2.5-7B-Instruct、Llama-SEA-LION-v3-8B-IT)在两类任务中表现突出,但仍暴露出越南语及多语言建模在结构化预测上的显著挑战。 Conclusion: VietJobs填补了越南语招聘语料空白,成为越南语NLP新基准,为招聘语言理解、社会经济表征建模及AI驱动劳动力市场分析提供了坚实基础。 Abstract: VietJobs is the first large-scale, publicly available corpus of Vietnamese job advertisements, comprising 48,092 postings and over 15 million words collected from all 34 provinces and municipalities across Vietnam. The dataset provides extensive linguistic and structured information, including job titles, categories, salaries, skills, and employment conditions, covering 16 occupational domains and multiple employment types (full-time, part-time, and internship). Designed to support research in natural language processing and labour market analytics, VietJobs captures substantial linguistic, regional, and socio-economic diversity. We benchmark several generative large language models (LLMs) on two core tasks: job category classification and salary estimation. Instruction-tuned models such as Qwen2.5-7B-Instruct and Llama-SEA-LION-v3-8B-IT demonstrate notable gains under few-shot and fine-tuned settings, while highlighting challenges in multilingual and Vietnamese-specific modelling for structured labour market prediction. VietJobs establishes a new benchmark for Vietnamese NLP and offers a valuable foundation for future research on recruitment language, socio-economic representation, and AI-driven labour market analysis. All code and resources are available at: https://github.com/VinNLP/VietJobs.

[67] Oral to Web: Digitizing 'Zero Resource'Languages of Bangladesh

Mohammad Mamun Or Rashid

Main category: cs.CL

TL;DR: 本文介绍了孟加拉国首个国家级、平行、多模态的少数民族及原住民语言语料库——多语言云语料库(Multilingual Cloud Corpus),涵盖42种语言变体,包含结构化文本与约107小时带转录音频,旨在支持濒危语言记录、低资源NLP及数字保存。

Details Motivation: 孟加拉国拥有约40种少数民族语言,分属四大语系,其中14种被列为濒危;但长期缺乏系统性、跨语系、数字化的语料资源,尤其这些语言多为口语、计算上属于'零资源'。 Method: 通过为期90天、覆盖9个地区的田野调查,由16名采集员、77名说话人和43名验证者协作,依据三级精细模板(词汇、语法结构、引导式话语)采集2224个条目;后续由10位语言学家完成IPA转写,并经6位评审独立仲裁;数据发布于multiling.cloud平台。 Result: 建成含85792条结构化文本(含孟加拉语刺激语、英语翻译、IPA转写)和约107小时标注音频的多模态语料库,覆盖42种语言变体(含2种未定类语言),全部公开可查。 Conclusion: 该语料库填补了南亚多语濒危语言数字资源的空白,为语言学记录、低资源自然语言处理及发展中国家语言多样性数字保护提供了重要基础设施与方法论范例。 Abstract: We present the Multilingual Cloud Corpus, the first national-scale, parallel, multimodal linguistic dataset of Bangladesh's ethnic and indigenous languages. Despite being home to approximately 40 minority languages spanning four language families, Bangladesh has lacked a systematic, cross-family digital corpus for these predominantly oral, computationally "zero resource" varieties, 14 of which are classified as endangered. Our corpus comprises 85792 structured textual entries, each containing a Bengali stimulus text, an English translation, and an IPA transcription, together with approximately 107 hours of transcribed audio recordings, covering 42 language varieties from the Tibeto-Burman, Indo-European, Austro-Asiatic, and Dravidian families, plus two genetically unclassified languages. The data were collected through systematic fieldwork over 90 days across nine districts of Bangladesh, involving 16 data collectors, 77 speakers, and 43 validators, following a predefined elicitation template of 2224 unique items organized at three levels of linguistic granularity: isolated lexical items (475 words across 22 semantic domains), grammatical constructions (887 sentences across 21 categories including verbal conjugation paradigms), and directed speech (862 prompts across 46 conversational scenarios). Post-field processing included IPA transcription by 10 linguists with independent adjudication by 6 reviewers. The complete dataset is publicly accessible through the Multilingual Cloud platform (multiling.cloud), providing searchable access to annotated audio and textual data for all documented varieties. We describe the corpus design, fieldwork methodology, dataset structure, and per-language coverage, and discuss implications for endangered language documentation, low-resource NLP, and digital preservation in linguistically diverse developing countries.

[68] Med-V1: Small Language Models for Zero-shot and Scalable Biomedical Evidence Attribution

Qiao Jin,Yin Fang,Lauren He,Yifan Yang,Guangzhi Xiong,Zhizheng Wang,Nicholas Wan,Joey Chan,Donald C. Comeau,Robert Leaman,Charalampos S. Floudas,Aidong Zhang,Michael F. Chiang,Yifan Peng,Zhiyong Lu

Main category: cs.CL

TL;DR: 本文提出了Med-V1,一个仅含30亿参数的小型语言模型,专为生物医学证据归因与断言验证任务设计;其在多个生物医学基准上显著超越基线模型,并可媲美GPT-5等前沿大模型,同时提供高质量解释;研究还利用Med-V1开展两项首创性应用:量化LLM生成答案中的幻觉现象、自动识别临床指南中的高风险证据误引。

Details Motivation: 现有前沿大语言模型(如GPT-5)虽可用于断言验证与幻觉检测,但部署成本过高;亟需一种轻量、高效且准确的替代方案用于生物医学领域的证据归因与验证任务。 Method: 提出Med-V1系列小型语言模型(3B参数),基于本研究新构建的高质量合成数据进行训练;将五个生物医学基准统一为验证格式进行评估;并开展两项真实场景应用研究:LLM幻觉量化分析与临床指南中证据误引识别。 Result: Med-V1在五项生物医学基准上较基线模型提升27.0%–71.3%,性能媲美GPT-5,并能生成高质量解释;首次揭示不同引用指令对LLM幻觉率的影响;成功识别临床指南中潜在危害公共健康的高风险证据误引。 Conclusion: Med-V1是一种高效、准确、可解释的轻量级模型,为生物医学证据归因与验证提供了实用可行的前沿大模型替代方案。 Abstract: Assessing whether an article supports an assertion is essential for hallucination detection and claim verification. While large language models (LLMs) have the potential to automate this task, achieving strong performance requires frontier models such as GPT-5 that are prohibitively expensive to deploy at scale. To efficiently perform biomedical evidence attribution, we present Med-V1, a family of small language models with only three billion parameters. Trained on high-quality synthetic data newly developed in this study, Med-V1 substantially outperforms (+27.0% to +71.3%) its base models on five biomedical benchmarks unified into a verification format. Despite its smaller size, Med-V1 performs comparably to frontier LLMs such as GPT-5, along with high-quality explanations for its predictions. We use Med-V1 to conduct a first-of-its-kind use case study that quantifies hallucinations in LLM-generated answers under different citation instructions. Results show that the format instruction strongly affects citation validity and hallucination, with GPT-5 generating more claims but exhibiting hallucination rates similar to GPT-4o. Additionally, we present a second use case showing that Med-V1 can automatically identify high-stakes evidence misattributions in clinical practice guidelines, revealing potentially negative public health impacts that are otherwise challenging to identify at scale. Overall, Med-V1 provides an efficient and accurate lightweight alternative to frontier LLMs for practical and real-world applications in biomedical evidence attribution and verification tasks. Med-V1 is available at https://github.com/ncbi-nlp/Med-V1.

[69] PersianPunc: A Large-Scale Dataset and BERT-Based Approach for Persian Punctuation Restoration

Mohammad Javad Ranjbar Kalahroodi,Heshaam Faili,Azadeh Shakery

Main category: cs.CL

TL;DR: 本文提出了PersianPunc数据集和基于ParsBERT的轻量级标点恢复方法,显著提升了波斯语ASR输出的可读性与实用性。

Details Motivation: 标点恢复对提升自动语音识别(ASR)输出的可读性和下游应用至关重要,但波斯语相关研究仍严重不足。 Method: 构建了含1700万样本的高质量波斯语标点恢复数据集PersianPunc,并将任务建模为词元级序列标注问题,通过微调ParsBERT实现高效恢复。 Result: 所提方法在测试集上达到91.33%的宏平均F1分数,优于大语言模型,且计算开销更低、无过度修正问题。 Conclusion: 该工作为波斯语NLP提供了开源数据集与模型,并为其他形态丰富、资源稀缺的语言提供了可扩展的标点恢复框架。 Abstract: Punctuation restoration is essential for improving the readability and downstream utility of automatic speech recognition (ASR) outputs, yet remains underexplored for Persian despite its importance. We introduce PersianPunc, a large-scale, high-quality dataset of 17 million samples for Persian punctuation restoration, constructed through systematic aggregation and filtering of existing textual resources. We formulate punctuation restoration as a token-level sequence labeling task and fine-tune ParsBERT to achieve strong performance. Through comparative evaluation, we demonstrate that while large language models can perform punctuation restoration, they suffer from critical limitations: over-correction tendencies that introduce undesired edits beyond punctuation insertion (particularly problematic for speech-to-text pipelines) and substantially higher computational requirements. Our lightweight BERT-based approach achieves a macro-averaged F1 score of 91.33% on our test set while maintaining efficiency suitable for real-time applications. We make our dataset (https://huggingface.co/datasets/MohammadJRanjbar/persian-punctuation-restoration) and model (https://huggingface.co/MohammadJRanjbar/parsbert-persian-punctuation) publicly available to facilitate future research in Persian NLP and provide a scalable framework applicable to other morphologically rich, low-resource languages.

[70] A Multilingual Human Annotated Corpus of Original and Easy-to-Read Texts to Support Access to Democratic Participatory Processes

Stefan Bott,Verena Riegler,Horacio Saggion,Almudena Rascón Alcaina,Nouran Khallaf

Main category: cs.CL

TL;DR: 本文介绍了一个为西班牙语、加泰罗尼亚语和意大利语构建的高质量易读(E2R)文本简化语料库,旨在支持自动文本简化研究,尤其填补了低资源语言在该领域的数据空白。

Details Motivation: 解决西班牙语、加泰罗尼亚语和意大利语等低资源语言中高质量文本简化训练与评估数据稀缺的问题,以支持民主参与背景下的易读语言研究。 Method: 在iDEM项目框架下,收集与民主参与相关的原创文本,涵盖多种文体,并由文本简化领域专家人工简化为易读(E2R)级别;确保文本符合相关版权与伦理标准。 Result: 构建了首个加泰罗尼亚语易读文本标注语料库,并为西班牙语和意大利语提供了稀缺的高质量、人工标注语言资源。 Conclusion: 该语料库填补了低资源语言在易读文本简化领域的数据空白,具有重要学术与应用价值,将向公众免费开放。 Abstract: Being able to understand information is a key factor for a self-determined life and society. It is also very important for participating in democratic processes. The study of automatic text simplification is often limited by the availability of high quality material for the training and evaluation on automatic simplifiers. This is true for English, but more so for less resourced languages like Spanish, Catalan and Italian. In order to fill this gap, we present a corpus of original texts for these 3 languages, with high quality simplification produced by human experts in text simplification. It was developed within the iDEM project to assess the impact of Easy-to-Read (E2R) language for democratic participation. The original texts were compiled from domains related to this topic. The corpus includes different text types, selected based on relevance, copyright availability, and ethical standards. All texts were simplified to E2R level. The corpus is particularity valuable because it includes the first annotated corpus of its kind for the Catalan language. It also represents a noteworthy contribution for Spanish and Italian, offering high-quality, human-annotated language resources that are rarely available in these domains. The corpus will be made freely accessible to the public.

[71] Exploring the potential and limitations of Model Merging for Multi-Domain Adaptation in ASR

Carlos Carvalho,Francisco Teixeira,Thomas Rolland,Alberto Abad

Main category: cs.CL

TL;DR: 本文研究了多领域自动语音识别(ASR)中的模型融合方法,评估了11种融合算法在10个欧洲葡萄牙语领域的性能,并提出了一种新算法BoostedTSV-M,该算法通过奇异值增强缓解秩坍缩并提升数值稳定性,在保持分布外泛化能力的同时优于全量微调。

Details Motivation: 大型语音基础模型通常需针对不同领域进行特定微调,产生多个定制检查点;当新数据出现时重复全量微调计算成本过高,因此需要可扩展的替代方案——模型融合。 Method: 对11种模型融合算法在10个欧洲葡萄牙语ASR领域进行基准测试,评估其领域内准确率、分布偏移下的鲁棒性以及英语和多语言性能;并提出基于TSV-M的新算法BoostedTSV-M,引入奇异值增强以缓解秩坍缩并提升数值稳定性。 Result: 所提BoostedTSV-M算法在欧洲葡萄牙语任务上整体优于全量微调,同时保持良好的分布外泛化能力(如英语和多语言性能)。 Conclusion: 模型融合是大规模语音模型多领域适配的有效且高效的替代方案;BoostedTSV-M显著提升了融合效果与稳定性,为实际部署提供了更优选择。 Abstract: Model merging is a scalable alternative to multi-task training that combines the capabilities of multiple specialised models into a single model. This is particularly attractive for large speech foundation models, which are typically adapted through domain-specific fine-tuning, resulting in multiple customised checkpoints, for which repeating full fine-tuning when new data becomes available is computationally prohibitive. In this work, we study model merging for multi-domain ASR and benchmark 11 merging algorithms for 10 European Portuguese domains, evaluating in-domain accuracy, robustness under distribution shift, as well as English and multilingual performance. We further propose BoostedTSV-M, a new merging algorithm based on TSV-M that mitigates rank collapse via singular-value boosting and improves numerical stability. Overall, our approach outperforms full fine-tuning on European Portuguese while preserving out-of-distribution generalisation in a single model.

[72] DiSCTT: Consensus-Guided Self-Curriculum for Efficient Test-Time Adaptation in Reasoning

Mohammad Mahdi Moradi,Sudhir Mudur

Main category: cs.CL

TL;DR: 本文提出DiSCTT框架,通过基于实例级认知不确定性(由推理轨迹间一致性估计)的动态优化策略分配,在测试时自适应地提升大语言模型的推理性能。高一致性输入采用监督微调(以多数一致解为伪标签),低一致性输入则采用共识正则化强化学习,兼顾多样性与相关性约束。该方法在多个数学与通用推理基准上显著优于现有测试时适配基线,兼具更高准确率、更低方差及更少计算开销。

Details Motivation: 现有测试时适配方法对所有输入采用统一优化目标,难以应对异构推理任务,导致适配效率低或不稳定。 Method: 提出DiSCTT:一种难度感知、共识引导的自课程框架;利用采样推理路径间的一致性估计实例级认知不确定性;高共识样本用多数一致解作伪标签进行监督微调;低共识样本采用共识正则化的强化学习,鼓励在相关性约束下的多样性探索。 Result: 在多个数学与通用推理基准上,DiSCTT持续超越强测试时适配基线,准确率更高、结果方差更低、计算量与实际训练时间大幅减少。 Conclusion: 显式建模实例难度与不确定性,可实现更稳定、高效且有效的测试时推理适配。 Abstract: Test-time adaptation offers a promising avenue for improving reasoning performance in large language models without additional supervision, but existing approaches often apply a uniform optimization objective across all inputs, leading to inefficient or unstable adaptation on heterogeneous reasoning problems. We propose DiSCTT, a difficulty-aware, consensus-guided self-curriculum framework that dynamically allocates test-time optimization strategies based on instance-level epistemic uncertainty estimated from agreement among sampled reasoning trajectories. Inputs with high consensus are consolidated via supervised fine-tuning using majority-agreed solutions as pseudo-labels, while low-consensus inputs are optimized via reinforcement learning with a consensus-regularized objective that encourages diversity under relevance constraints. Across a broad suite of mathematical and general reasoning benchmarks, DiSCTT consistently outperforms strong test-time adaptation baselines, achieving higher accuracy with reduced variance and substantially lower computation and wall-clock training times. These results demonstrate that explicitly accounting for instance difficulty and uncertainty enables more stable, efficient, and effective test-time adaptation for reasoning models.

[73] Progressive Residual Warmup for Language Model Pretraining

Tianhao Chen,Xin Xu,Lu Yin,Hao Chen,Yang Wang,Shizhe Diao,Can Yang

Main category: cs.CL

TL;DR: 本文提出了一种名为Progressive Residual Warmup(ProRes)的新方法,用于提升Transformer语言模型预训练的稳定性和收敛速度。该方法通过逐步增加各层残差连接的权重(从0到1),使浅层先学习、深层后参与,从而改善优化轨迹和下游性能。

Details Motivation: Transformer架构是现代大语言模型的基础,其预训练的稳定性与收敛速度至关重要;受逐层堆叠结构中逻辑依赖关系启发,作者希望让浅层先稳定、再让深层逐步加入学习过程。 Method: ProRes方法为每层残差连接引入一个随训练步数从0线性增长至1的标量系数,且深层的升温步数更长,实现‘浅层先学、深层后学’的渐进式训练策略。 Result: 实验表明,ProRes在不同模型规模、归一化方式和初始化方案下均能提升预训练稳定性,加快收敛速度,并增强泛化能力与下游任务性能。 Conclusion: ProRes是一种简单有效、即插即用的预训练优化策略,通过控制残差路径的渐进激活,显著改善了Transformer模型的训练动态与最终性能。 Abstract: Transformer architectures serve as the backbone for most modern Large Language Models, therefore their pretraining stability and convergence speed are of central concern. Motivated by the logical dependency of sequentially stacked layers, we propose Progressive Residual Warmup (ProRes) for language model pretraining. ProRes implements an "early layer learns first" philosophy by multiplying each layer's residual with a scalar that gradually warms up from 0 to 1, with deeper layers taking longer warmup steps. In this way, deeper layers wait for early layers to settle into a more stable regime before contributing to learning. We demonstrate the effectiveness of ProRes through pretraining experiments across various model scales, as well as normalization and initialization schemes. Comprehensive analysis shows that ProRes not only stabilizes pretraining but also introduces a unique optimization trajectory, leading to faster convergence, stronger generalization and better downstream performance. Our code is available at https://github.com/dandingsky/ProRes.

[74] An Exploration-Analysis-Disambiguation Reasoning Framework for Word Sense Disambiguation with Low-Parameter LLMs

Deshan Sumanathilaka,Nicholas Micallef,Julian Hough

Main category: cs.CL

TL;DR: 本研究探讨了低参数大语言模型(<4B)通过推理驱动的微调策略在词义消歧(WSD)任务中能否媲美GPT-4-Turbo,结果表明结合思维链(CoT)与邻词分析的微调方法使Gemma-3-4B和Qwen-3-4B在FEWS和跨域Fool Me If You Can数据集上达到甚至超越SOTA性能,同时显著降低计算与能耗开销。

Details Motivation: 高参数LLM虽在WSD上表现优异,但计算与能耗成本高、难以扩展;而罕见/领域特异性词义仍易被误判,亟需高效、轻量、泛化强的替代方案。 Method: 在FEWS数据集基础上构建含半自动推理标注的数据集,对8个开源小模型(如Gemma、Qwen)开展微调,核心策略为融合Chain-of-Thought推理与邻词语义分析。 Result: Gemma-3-4B与Qwen-3-4B在零样本设置下WSD性能媲美GPT-4-Turbo,在FEWS上全面超越中等参数基线及现有SOTA模型,并在未见过的Fool Me If You Can数据集上展现强跨域泛化能力(无需任务微调)。 Conclusion: 精心设计的以推理为中心的微调策略,可使低参数LLM在保持高精度WSD的同时大幅降低计算与能源消耗,为绿色、可部署的NLP提供可行路径。 Abstract: Word Sense Disambiguation (WSD) remains a key challenge in Natural Language Processing (NLP), especially when dealing with rare or domain-specific senses that are often misinterpreted. While modern high-parameter Large Language Models (LLMs) such as GPT-4-Turbo have shown state-of-the-art WSD performance, their computational and energy demands limit scalability. This study investigates whether low-parameter LLMs (<4B parameters) can achieve comparable results through fine-tuning strategies that emphasize reasoning-driven sense identification. Using the FEWS dataset augmented with semi-automated, rationale-rich annotations, we fine-tune eight small-scale open-source LLMs (e.g. Gemma and Qwen). Our results reveal that Chain-of-Thought (CoT)-based reasoning combined with neighbour-word analysis achieves performance comparable to GPT-4-Turbo in zero-shot settings. Importantly, Gemma-3-4B and Qwen-3-4B models consistently outperform all medium-parameter baselines and state-of-the-art models on FEWS, with robust generalization to unseen senses. Furthermore, evaluation on the unseen "Fool Me If You Can'' dataset confirms strong cross-domain adaptability without task-specific fine-tuning. This work demonstrates that with carefully crafted reasoning-centric fine-tuning, low-parameter LLMs can deliver accurate WSD while substantially reducing computational and energy demands.

[75] Ensembling Language Models with Sequential Monte Carlo

Robin Shing Moon Chan,Tianyu Liu,Samuel Kiegeland,Clemente Pasti,Jacob Hoover Vigly,Timothy J. O'Donnell,Ryan Cotterell,Tim Vieira

Main category: cs.CL

TL;DR: 本文提出了一种统一的f-ensemble框架,用于组合多个语言模型,并设计了字节级序贯蒙特卡洛(SMC)算法来实现跨不同词表模型的有效采样,显著提升了结构化文本生成任务的性能。

Details Motivation: 现有语言模型和提示策略众多,但性能对二者选择高度敏感;传统概率平均的集成方法在解码时存在偏差,难以准确逼近字符串空间上的真实集成分布。 Method: 提出基于任意聚合函数f的f-ensemble分布框架,并设计字节级序贯蒙特卡洛(SMC)算法,在共享字符空间中实现多模型集成采样,支持词表不一致的模型。 Result: 在多种结构化文本生成任务上验证了f-ensemble优于传统概率平均,表明更优的后验近似能带来更好的集成性能。 Conclusion: f-ensemble框架及其字节级SMC采样方法为语言模型集成提供了更灵活、更准确的解码范式,尤其适用于异构模型组合与结构化生成场景。 Abstract: Practitioners have access to an abundance of language models and prompting strategies for solving many language modeling tasks; yet prior work shows that modeling performance is highly sensitive to both choices. Classical machine learning ensembling techniques offer a principled approach: aggregate predictions from multiple sources to achieve better performance than any single one. However, applying ensembling to language models during decoding is challenging: naively aggregating next-token probabilities yields samples from a locally normalized, biased approximation of the generally intractable ensemble distribution over strings. In this work, we introduce a unified framework for composing $K$ language models into $f$-ensemble distributions for a wide range of functions $f\colon\mathbb{R}_{\geq 0}^{K}\to\mathbb{R}_{\geq 0}$. To sample from these distributions, we propose a byte-level sequential Monte Carlo (SMC) algorithm that operates in a shared character space, enabling ensembles of models with mismatching vocabularies and consistent sampling in the limit. We evaluate a family of $f$-ensembles across prompt and model combinations for various structured text generation tasks, highlighting the benefits of alternative aggregation strategies over traditional probability averaging, and showing that better posterior approximations can yield better ensemble performance.

[76] FlashAttention-4: Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware Scaling

Ted Zadouri,Markus Hoehnerbach,Jay Shah,Timmy Liu,Vijay Thakkar,Tri Dao

Main category: cs.CL

TL;DR: 本文提出了FlashAttention-4,针对Blackwell架构GPU(如B200/GB200)的非对称硬件特性优化注意力计算,在BF16精度下相比cuDNN和Triton分别提速1.3×和2.7×,并采用CuTe-DSL实现高效编译与高表达力。

Details Motivation: 随着AI硬件从Hopper(H100)快速转向Blackwell(B200/GB200),其tensor core吞吐翻倍但共享内存带宽、指数单元等未同步提升,导致原有FlashAttention-3等优化不再适配新瓶颈。 Method: 提出三项关键技术:(1) 基于全异步MMA和更大分块的重设计流水线;(2) 软件模拟指数与条件softmax重缩放以减少非矩阵乘操作;(3) 利用张量内存和2-CTA MMA模式降低共享内存通信与反向传播中的原子加法。全部实现基于嵌入Python的CuTe-DSL。 Result: FlashAttention-4在B200上BF16精度下最高达1613 TFLOPs/s(71%利用率),较cuDNN 9.13和Triton分别提速1.3×和2.7×;编译速度比传统C++模板快20–30×,同时保持完整表达能力。 Conclusion: FlashAttention-4成功适配Blackwell架构的非对称扩展特性,兼顾性能与开发效率,为下一代大模型训练提供了高效、可维护的注意力内核基础。 Abstract: Attention, as a core layer of the ubiquitous Transformer architecture, is the bottleneck for large language models and long-context applications. While FlashAttention-3 optimized attention for Hopper GPUs through asynchronous execution and warp specialization, it primarily targets the H100 architecture. The AI industry has rapidly transitioned to deploying Blackwell-based systems such as the B200 and GB200, which exhibit fundamentally different performance characteristics due to asymmetric hardware scaling: tensor core throughput doubles while other functional units (shared memory bandwidth, exponential units) scale more slowly or remain unchanged. We develop several techniques to address these shifting bottlenecks on Blackwell GPUs: (1) redesigned pipelines that exploit fully asynchronous MMA operations and larger tile sizes, (2) software-emulated exponential and conditional softmax rescaling that reduces non-matmul operations, and (3) leveraging tensor memory and the 2-CTA MMA mode to reduce shared memory traffic and atomic adds in the backward pass. We demonstrate that our method, FlashAttention-4, achieves up to 1.3$\times$ speedup over cuDNN 9.13 and 2.7$\times$ over Triton on B200 GPUs with BF16, reaching up to 1613 TFLOPs/s (71% utilization). Beyond algorithmic innovations, we implement FlashAttention-4 entirely in CuTe-DSL embedded in Python, achieving 20-30$\times$ faster compile times compared to traditional C++ template-based approaches while maintaining full expressivity.

[77] DEBISS: a Corpus of Individual, Semi-structured and Spoken Debates

Klaywert Danillo Ferreira de Souza,David Eduardo Pereira,Cláudio E. C. Campelo,Larissa Lucena Vasconcelos

Main category: cs.CL

TL;DR: 本文提出DEBISS语料库,旨在解决当前辩论语料稀缺问题,涵盖口语化与个体化辩论,并支持多种NLP任务标注。

Details Motivation: 现有辩论语料库稀缺,且难以覆盖辩论在日常生活、工作、政治及社交媒体中多样化的应用、结构与形式。 Method: 构建DEBISS语料库,包含口语化与个体化辩论,具备半结构化特征,并提供语音转文本、说话人区分、论点挖掘和辩手质量评估等多类NLP任务标注。 Result: 成功设计并发布了DEBISS语料库,为辩论相关NLP研究提供了新资源。 Conclusion: DEBISS语料库填补了辩论领域语料资源的空白,支持多样化NLP任务,有助于推动辩论理解与建模的研究进展。 Abstract: The process of debating is essential in our daily lives, whether in studying, work activities, simple everyday discussions, political debates on TV, or online discussions on social networks. The range of uses for debates is broad. Due to the diverse applications, structures, and formats of debates, developing corpora that account for these variations can be challenging, and the scarcity of debate corpora in the state of the art is notable. For this reason, the current research proposes the DEBISS corpus: a collection of spoken and individual debates with semi-structured features. With a broad range of NLP task annotations, such as speech-to-text, speaker diarization, argument mining, and debater quality assessment.

[78] NCTB-QA: A Large-Scale Bangla Educational Question Answering Dataset and Benchmarking Performance

Abrar Eyasir,Tahsin Ahmed,Muhammad Ibrahim

Main category: cs.CL

TL;DR: 本文提出NCTB-QA,一个大规模、平衡可答/不可答问题的孟加拉语教育问答数据集,并验证了领域微调对低资源语言问答系统的重要性。

Details Motivation: 低资源语言阅读理解系统在处理不可回答问题时表现不可靠,缺乏高质量、含不可答问题的孟加拉语基准数据集。 Method: 构建NCTB-QA数据集(87,805问答对,50本教材,42.75%不可答问题,含对抗性干扰项),并在BERT、RoBERTa、ELECTRA上进行微调与评测。 Result: BERT在F1分数上取得313%相对提升(0.150→0.620),BERTScore亦显著提升;NCTB-QA成为孟加拉语教育问答的具挑战性新基准。 Conclusion: 领域特定微调对提升低资源语言问答系统的鲁棒性至关重要;NCTB-QA为解决不可答问题提供了实用且具代表性的评估平台。 Abstract: Reading comprehension systems for low-resource languages face significant challenges in handling unanswerable questions. These systems tend to produce unreliable responses when correct answers are absent from context. To solve this problem, we introduce NCTB-QA, a large-scale Bangla question answering dataset comprising 87,805 question-answer pairs extracted from 50 textbooks published by Bangladesh's National Curriculum and Textbook Board. Unlike existing Bangla datasets, NCTB-QA maintains a balanced distribution of answerable (57.25%) and unanswerable (42.75%) questions. NCTB-QA also includes adversarially designed instances containing plausible distractors. We benchmark three transformer-based models (BERT, RoBERTa, ELECTRA) and demonstrate substantial improvements through fine-tuning. BERT achieves 313% relative improvement in F1 score (0.150 to 0.620). Semantic answer quality measured by BERTScore also increases significantly across all models. Our results establish NCTB-QA as a challenging benchmark for Bangla educational question answering. This study demonstrates that domain-specific fine-tuning is critical for robust performance in low-resource settings.

[79] Leveraging LLM Parametric Knowledge for Fact Checking without Retrieval

Artem Vazhentsev,Maria Marina,Daniil Moskovskiy,Sergey Pletenev,Mikhail Seleznyov,Mikhail Salnikov,Elena Tutubalina,Vasily Konovalov,Irina Nikishina,Alexander Panchenko,Viktor Moskvoretskii

Main category: cs.CL

TL;DR: 本文提出无需检索的事实核查新任务,通过内部模型表征而非外部知识检索来验证自然语言声明的真实性,并设计了涵盖长尾知识、多源声明、多语言和长文本生成的综合评估框架;实验表明基于内部表征交互的方法(如INTRA)优于传统logit方法,展现出强泛化能力,为可信赖的智能体AI提供了新路径。

Details Motivation: 现有基于LLM的事实核查方法严重依赖外部知识检索,受限于检索错误和数据可用性,且未充分利用模型自身的事实验证能力。 Method: 提出无需检索的事实核查任务,构建涵盖长尾知识、多源声明、多语言和长文本生成的综合评估框架;提出INTRA方法,利用模型内部表征间的交互进行事实验证。 Result: 在9个数据集、18种方法和3个模型上的实验表明,基于logit的方法常逊于利用内部表征的方法;INTRA方法实现最优性能与强泛化能力。 Conclusion: 无需检索的事实核查是一条有前景的研究方向,可补充检索式框架、提升可扩展性,并支持其作为训练奖励信号或生成过程中的集成组件。 Abstract: Trustworthiness is a core research challenge for agentic AI systems built on Large Language Models (LLMs). To enhance trust, natural language claims from diverse sources, including human-written text, web content, and model outputs, are commonly checked for factuality by retrieving external knowledge and using an LLM to verify the faithfulness of claims to the retrieved evidence. As a result, such methods are constrained by retrieval errors and external data availability, while leaving the models intrinsic fact-verification capabilities largely unused. We propose the task of fact-checking without retrieval, focusing on the verification of arbitrary natural language claims, independent of their source. To study this setting, we introduce a comprehensive evaluation framework focused on generalization, testing robustness to (i) long-tail knowledge, (ii) variation in claim sources, (iii) multilinguality, and (iv) long-form generation. Across 9 datasets, 18 methods and 3 models, our experiments indicate that logit-based approaches often underperform compared to those that leverage internal model representations. Building on this finding, we introduce INTRA, a method that exploits interactions between internal representations and achieves state-of-the-art performance with strong generalization. More broadly, our work establishes fact-checking without retrieval as a promising research direction that can complement retrieval-based frameworks, improve scalability, and enable the use of such systems as reward signals during training or as components integrated into the generation process.

[80] Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought

Siddharth Boppana,Annabel Ma,Max Loeffler,Raphael Sarfati,Eric Bigelow,Atticus Geiger,Owen Lewis,Jack Merullo

Main category: cs.CL

TL;DR: 本文揭示了推理模型中存在'表演性思维链(CoT)'现象:模型虽已确信答案,却仍继续生成冗余推理步骤;通过激活探测等方法发现,模型在简单任务中很早就形成答案信念,而在困难多跳推理中才出现真实信念变化;基于探测的早停策略可显著减少生成token数且保持准确率。

Details Motivation: 探究大语言模型在思维链(CoT)生成过程中是否存在‘表演性’(即形式化但非必要)推理,而非真实逐步信念更新。 Method: 结合激活探针(activation probing)、强制早答(early forced answering)和CoT监控器,在DeepSeek-R1 671B与GPT-OSS 120B两个大模型上,对比分析MMLU(易)与GPQA-Diamond(难)两类任务中的信念演化与生成行为。 Result: 1)简单任务中答案可从早期激活中解码,远早于CoT监控器判断完成;2)困难任务中才出现显著信念跃迁与真实推理迹象(如回溯、'顿悟');3)探针引导的早退可在MMLU上减80% token、GPQA上减30% token,精度基本不变。 Conclusion: CoT生成包含大量非必要表演成分;激活探针能有效识别真实信念状态,为自适应计算与高效推理提供新路径。 Abstract: We provide evidence of performative chain-of-thought (CoT) in reasoning models, where a model becomes strongly confident in its final answer, but continues generating tokens without revealing its internal belief. Our analysis compares activation probing, early forced answering, and a CoT monitor across two large models (DeepSeek-R1 671B & GPT-OSS 120B) and find task difficulty-specific differences: The model's final answer is decodable from activations far earlier in CoT than a monitor is able to say, especially for easy recall-based MMLU questions. We contrast this with genuine reasoning in difficult multihop GPQA-Diamond questions. Despite this, inflection points (e.g., backtracking, 'aha' moments) occur almost exclusively in responses where probes show large belief shifts, suggesting these behaviors track genuine uncertainty rather than learned "reasoning theater." Finally, probe-guided early exit reduces tokens by up to 80% on MMLU and 30% on GPQA-Diamond with similar accuracy, positioning attention probing as an efficient tool for detecting performative reasoning and enabling adaptive computation.

cs.CV [Back]

[81] Lost in Translation: How Language Re-Aligns Vision for Cross-Species Pathology

Ekansh Arora

Main category: cs.CV

TL;DR: 本研究探讨了CPath-CLIP模型在跨癌种与跨物种病理图像识别中的迁移能力,发现标准视觉-语言对齐在跨物种场景下存在语义坍塌问题;为此提出Semantic Anchoring方法,利用文本提供稳定视觉特征坐标系,显著提升性能并揭示语言可作为无需重训练的语义调控机制。

Details Motivation: 基础模型在计算病理学中应用日益广泛,但其在跨癌种和跨物种迁移下的行为尚不明确,亟需系统评估与改进。 Method: 基于犬类与人类全切片图像块,采用少样本微调评估CPath-CLIP在同癌种、跨癌种及跨物种任务下的性能;结合嵌入空间分析、Grad-CAM可视化,并提出Semantic Anchoring方法,通过语言引导构建视觉特征稳定坐标系;开展消融实验验证文本对齐机制的核心作用。 Result: 少样本微调使同癌种AUC从64.9%提升至72.6%,跨癌种从56.84%升至66.31%;Semantic Anchoring进一步带来同癌种+8.52%、跨癌种+5.67%增益;发现‘语义坍塌’新失效模式,源于物种主导的对齐而非视觉信息缺失;文本对齐有效缓解嵌入坍塌。 Conclusion: 语言不仅辅助表征学习,更可作为控制机制实现语义再解释;Semantic Anchoring通过文本锚定视觉特征,显著增强跨域泛化能力,为跨物种病理AI提供新范式。 Abstract: Foundation models are increasingly applied to computational pathology, yet their behavior under cross-cancer and cross-species transfer remains unspecified. This study investigated how fine-tuning CPath-CLIP affects cancer detection under same-cancer, cross-cancer, and cross-species conditions using whole-slide image patches from canine and human histopathology. Performance was measured using area under the receiver operating characteristic curve (AUC). Few-shot fine-tuning improved same-cancer (64.9% to 72.6% AUC) and cross-cancer performance (56.84% to 66.31% AUC). Cross-species evaluation revealed that while tissue matching enables meaningful transfer, performance remains below state-of-the-art benchmarks (H-optimus-0: 84.97% AUC), indicating that standard vision-language alignment is suboptimal for cross-species generalization. Embedding space analysis revealed extremely high cosine similarity (greater than 0.99) between tumor and normal prototypes. Grad-CAM shows prototype-based models remain domain-locked, while language-guided models attend to conserved tumor morphology. To address this, we introduce Semantic Anchoring, which uses language to provide a stable coordinate system for visual features. Ablation studies reveal that benefits stem from the text-alignment mechanism itself, regardless of text encoder complexity. Benchmarking against H-optimus-0 shows that CPath-CLIP's failure stems from intrinsic embedding collapse, which text alignment effectively circumvents. Additional gains were observed in same-cancer (8.52%) and cross-cancer classification (5.67%). We identified a previously uncharacterized failure mode: semantic collapse driven by species-dominated alignment rather than missing visual information. These results demonstrate that language acts as a control mechanism, enabling semantic re-interpretation without retraining.

[82] Recognition of Daily Activities through Multi-Modal Deep Learning: A Video, Pose, and Object-Aware Approach for Ambient Assisted Living

Kooshan Hashemifard,Pau Climent-Pérez,Francisco Florez-Revuelta

Main category: cs.CV

TL;DR: 本文提出了一种面向老年用户日常活动识别的多模态方法,融合3D CNN视觉特征、图卷积网络处理的3D姿态数据,以及通过交叉注意力机制融合的物体检测上下文信息,在Toyota SmartHome数据集上取得了有竞争力的分类精度。

Details Motivation: 解决面向老年人的室内日常活动识别中面临的类内差异大、类间相似度高、环境与视角变化多、场景复杂等挑战。 Method: 结合3D CNN提取视觉特征、GCN处理3D人体姿态、物体检测提供上下文,并通过交叉注意力机制融合上下文与视觉特征。 Result: 在真实世界数据集Toyota SmartHome上验证,取得了具有竞争力的日常活动分类精度。 Conclusion: 该多模态方法可作为先进AAL监控系统的关键组件,有助于提升老年人居家安全与自主性。 Abstract: Recognition of daily activities is a critical element for effective Ambient Assisted Living (AAL) systems, particularly to monitor the well-being and support the independence of older adults in indoor environments. However, developing robust activity recognition systems faces significant challenges, including intra-class variability, inter-class similarity, environmental variability, camera perspectives, and scene complexity. This paper presents a multi-modal approach for the recognition of activities of daily living tailored for older adults within AAL settings. The proposed system integrates visual information processed by a 3D Convolutional Neural Network (CNN) with 3D human pose data analyzed by a Graph Convolutional Network. Contextual information, derived from an object detection module, is fused with the 3D CNN features using a cross-attention mechanism to enhance recognition accuracy. This method is evaluated using the Toyota SmartHome dataset, which consists of real-world indoor activities. The results indicate that the proposed system achieves competitive classification accuracy for a range of daily activities, highlighting its potential as an essential component for advanced AAL monitoring solutions. This advancement supports the broader goal of developing intelligent systems that promote safety and autonomy among older adults.

[83] InverseNet: Benchmarking Operator Mismatch and Calibration Across Compressive Imaging Modalities

Chengshuai Yang,Xin Yuan

Main category: cs.CV

TL;DR: 本文提出InverseNet,首个跨模态算子失配基准,揭示深度学习方法在算子失配下性能严重下降,并发现架构设计与校准策略对鲁棒性至关重要。

Details Motivation: 现有高效压缩感知成像方法(如EfficientSCI)在前向算子偏离物理现实时性能急剧下降,但尚无基准量化这一普遍存在的算子失配问题。 Method: 构建首个跨模态(CASSI、CACTI、单像素相机)算子失配基准InverseNet,设计四场景协议(理想、失配、oracle校正、盲校准),在27个仿真场景和9组真实硬件数据上评估12种方法。 Result: (1)深度学习方法在失配下损失10–21 dB,丧失对经典方法的优势;(2)性能与鲁棒性呈显著负相关(r_s = −0.71);(3)算子无关架构无法恢复失配损失,而算子条件化方法可恢复41–90%;(4)盲网格搜索校准可达oracle校正效果的85–100%。真实硬件实验验证仿真结论。 Conclusion: 算子失配是压缩感知成像实际部署的关键瓶颈;架构需显式建模算子不确定性,盲校准已具实用潜力;InverseNet为鲁棒性研究提供标准化评测平台。 Abstract: State-of-the-art EfficientSCI loses 20.58 dB when its assumed forward operator deviates from physical reality in just eight parameters, yet no existing benchmark quantifies operator mismatch, the default condition in deployed compressive imaging systems. We introduce InverseNet, the first cross-modality benchmark for operator mismatch, spanning CASSI, CACTI, and single-pixel cameras. Evaluating 12 methods under a four-scenario protocol (ideal, mismatched, oracle-corrected, blind calibration) across 27 simulated scenes and 9 real hardware captures, we find: (1) deep learning methods lose 10-21 dB under mismatch, eliminating their advantage over classical baselines; (2) performance and robustness are inversely correlated across modalities (Spearman r_s = -0.71, p < 0.01); (3) mask-oblivious architectures recover 0% of mismatch losses regardless of calibration quality, while operator-conditioned methods recover 41-90%; (4) blind grid-search calibration recovers 85-100% of the oracle bound without ground truth. Real hardware experiments confirm that simulation trends transfer to physical data. Code will be released upon acceptance.

[84] Fusion and Grouping Strategies in Deep Learning for Local Climate Zone Classification of Multimodal Remote Sensing Data

Ancymol Thomas,Jaya Sreevalsan-Nair

Main category: cs.CV

TL;DR: 本研究系统分析了多模态遥感数据(SAR与MSI)用于局部气候区(LCZ)分类中的多种深度学习融合策略,发现基线混合融合(FM1)结合波段分组(BG)和标签合并(LM)效果最佳,整体准确率达76.6%,尤其提升了少数类预测性能。

Details Motivation: 现有研究缺乏对多模态LCZ分类中深度学习模型融合机制的全面分析,而数据融合对提升分类精度至关重要。 Method: 在So2Sat LCZ42数据集上,对比四种融合模型:基线混合融合(FM1)、引入自注意力与交叉注意力(FM2)、多尺度高斯滤波图像输入(FM3)、加权决策级融合(FM4);并开展像素级、特征级、决策级消融实验;同时探索波段分组(BG)和标签合并(LM)两类分组策略。 Result: FM1+BG+LM组合取得最高整体精度76.6%;该策略显著改善了样本稀少LCZ类别的预测准确率。 Conclusion: 基线混合融合是更稳健有效的多模态LCZ分类策略,辅以合理的数据分组(BG)和标注优化(LM)可进一步提升性能,尤其缓解类别不平衡问题。 Abstract: Local Climate Zones (LCZs) give a zoning map to study urban structures and land use and analyze the impact of urbanization on local climate. Multimodal remote sensing enables LCZ classification, for which data fusion is significant for improving accuracy owing to the data complexity. However, there is a gap in a comprehensive analysis of the fusion mechanisms used in their deep learning (DL) classifier architectures. This study analyzes different fusion strategies in the multi-class LCZ classification models for multimodal data and grouping strategies based on inherent data characteristics. The different models involving Convolutional Neural Networks (CNNs) include: (i) baseline hybrid fusion (FM1), (ii) with self- and cross-attention mechanisms (FM2), (iii) with the multi-scale Gaussian filtered images (FM3), and (iv) weighted decision-level fusion (FM4). Ablation experiments are conducted to study the pixel-, feature-, and decision-level fusion effects in the model performance. Grouping strategies include band grouping (BG) within the data modalities and label merging (LM) in the ground truth. Our analysis is exclusively done on the So2Sat LCZ42 dataset, which consists of Synthetic Aperture Radar (SAR) and Multispectral Imaging (MSI) image pairs. Our results show that FM1 consistently outperforms simple fusion methods. FM1 with BG and LM is found to be the most effective approach among all fusion strategies, giving an overall accuracy of 76.6\%. Importantly, our study highlights the effect of these strategies in improving prediction accuracy for the underrepresented classes. Our code and processed datasets are available at https://github.com/GVCL/LCZC-MultiModalHybridFusion

[85] Structure-Guided Histopathology Synthesis via Dual-LoRA Diffusion

Xuan Xu,Prateek Prasanna

Main category: cs.CV

TL;DR: 本文提出了一种名为Dual-LoRA Controllable Diffusion的统一扩散模型框架,利用多类细胞核中心点作为轻量、标注高效的空间先验,通过两个任务专用的LoRA适配器,在单个模型中联合支持局部结构补全与全局结构合成,显著提升了组织图像修复与生成的结构保真度与真实性。

Details Motivation: 现有方法将组织图像修复与生成视为独立任务,且依赖弱或不一致的结构先验,限制了细胞组织的真实性;亟需一种能统一建模、具备强生物学意义结构引导的生成框架。 Method: 提出Dual-LoRA Controllable Diffusion:以多类细胞核中心点为轻量空间先验,设计双LoRA适配器分别专精于局部结构补全和全局结构合成,共享扩散主干网络,实现端到端联合优化。 Result: 在局部补全任务中,掩码区域内LPIPS从0.1797(HARP)降至0.1524;在全局合成任务中,FID从225.15(CoSys)大幅降至76.04;结构恢复更准确,形态一致性与真实感显著提升。 Conclusion: 该方法实现了修复与生成任务的统一建模,以生物可解释的中心点先验和参数高效的LoRA机制,为泛癌种组织病理建模提供了可扩展、高保真的新范式。 Abstract: Histopathology image synthesis plays an important role in tissue restoration, data augmentation, and modeling of tumor microenvironments. However, existing generative methods typically address restoration and generation as separate tasks, although both share the same objective of structure-consistent tissue synthesis under varying degrees of missingness, and often rely on weak or inconsistent structural priors that limit realistic cellular organization. We propose Dual-LoRA Controllable Diffusion, a unified centroid-guided diffusion framework that jointly supports Local Structure Completion and Global Structure Synthesis within a single model. Multi-class nuclei centroids serve as lightweight and annotation-efficient spatial priors, providing biologically meaningful guidance under both partial and complete image absence. Two task-specific LoRA adapters specialize the shared backbone for local and global objectives without retraining separate diffusion models. Extensive experiments demonstrate consistent improvements over state-of-the-art GAN and diffusion baselines across restoration and synthesis tasks. For local completion, LPIPS computed within the masked region improves from 0.1797 (HARP) to 0.1524, and for global synthesis, FID improves from 225.15 (CoSys) to 76.04, indicating improved structural fidelity and realism. Our approach achieves more faithful structural recovery in masked regions and substantially improved realism and morphology consistency in full synthesis, supporting scalable pan-cancer histopathology modeling.

[86] Mask-aware inference with State-Space Models

Ignasi Mas,Ramon Morros,Javier-Ruiz Hidalgo,Ivan Huerta

Main category: cs.CV

TL;DR: 本文提出Partial Vision Mamba(PVM),将Partial Convolution中处理不规则缺失数据的思想引入State Space Model(如Mamba),使其能有效处理任意形状的无效数据,并在深度补全、图像修复和含无效数据分类等任务上验证了有效性与泛化性。

Details Motivation: 现有State Space Models(如Mamba)缺乏处理推理时任意形状缺失/无效数据的内在机制,而现实视觉任务(如深度补全)常面临此类问题。 Method: 提出Partial Vision Mamba(PVM)组件,将Partial Convolution的掩码感知重归一化思想适配到Mamba架构中,并制定PVM架构设计规则。 Result: 在深度补全、图像修复和含无效数据分类三个任务上验证了PVM的有效性和泛化能力,显著提升SSM类模型对不规则缺失数据的建模能力。 Conclusion: PVM成功将部分操作范式迁移到视觉状态空间模型,为SSM在不规则缺失数据场景下的应用提供了通用且高效的解决方案。 Abstract: Many real-world computer vision tasks, such as depth completion, must handle inputs with arbitrarily shaped regions of missing or invalid data. For Convolutional Neural Networks (CNNs), Partial Convolutions solved this by a mask-aware re-normalization conditioned only on valid pixels. Recently, State Space Models (SSMs) like Mamba have emerged, offering high performance with linear complexity. However, these architectures lack an inherent mechanism for handling such arbitrarily shaped invalid data at inference time. To bridge this gap, we introduce Partial Vision Mamba (PVM), a novel architectural component that ports the principles of partial operations to the Mamba backbone. We also define a series of rules to design architectures using PVM. We show the efficacy and generalizability of our approach in the tasks of depth completion, image inpainting, and classification with invalid data.

[87] PinPoint: Evaluation of Composed Image Retrieval with Explicit Negatives, Multi-Image Queries, and Paraphrase Testing

Rohan Mahadev,Joyce Yuan,Patrick Poirson,David Xue,Hao-Yu Wu,Dmitry Kislyuk

Main category: cs.CV

TL;DR: 本文提出了PinPoint基准,用于评估组合图像检索(CIR)系统的多答案准确性、鲁棒性、多图像推理与公平性,并揭示了现有方法在硬负例抑制、指令鲁棒性和多图像查询上的显著缺陷;为此提出一种无需训练的MLLM重排序方法以提升性能。

Details Motivation: 现有CIR基准仅支持单个真值答案,缺乏对错误正例抑制、鲁棒性和多图像推理能力的评估能力,难以全面反映模型实际性能。 Method: 构建PinPoint真实世界基准(7635个查询、32.9万相关性标注、23类查询),包含多答案、显式难负例、六种指令改写、多图像组合查询及人口统计元数据;基于该基准分析20+方法;提出基于现成MLLM的训练无关重排序方法。 Result: 当前最优CIR方法mAP@10仅28.5%,仍以9%概率召回难负例;指令改写下性能波动达25.1%;多图像查询性能下降40–70%;所提重排序方法可有效缓解上述问题。 Conclusion: PinPoint揭示了CIR领域关键短板,推动更全面、实用的评估范式;所提无训练重排序方法为即插即用式性能提升提供了新路径;全部数据与代码已开源。 Abstract: Composed Image Retrieval (CIR) has made significant progress, yet current benchmarks are limited to single ground-truth answers and lack the annotations needed to evaluate false positive avoidance, robustness and multi-image reasoning. We present PinPoint, a comprehensive real world benchmark with 7,635 queries and 329K relevance judgments across 23 query categories. PinPoint advances the field by providing: (1) multiple correct answers (averaging 9.1 per query) (2) explicit hard negatives, (3) six instruction paraphrases per query for robustness testing, (4) multi-image composition support (13.4% of queries), and (5) demographic metadata for fairness evaluation. Based on our analysis of 20+ methods across 4 different major paradigms, we uncover three significant drawbacks: The best methods while achieving mAP@10 of 28.5%, still retrieves irrelevant results (hard negatives) 9% of the time. The best models also exhibit 25.1% performance variation across paraphrases, indicating significant potential for enhancing current CIR techniques. Multi-image queries performs 40 to 70% worse across different methods. To overcome these new issues uncovered by our evaluation framework, we propose a training-free reranking method based on an off-the-shelf MLLM that can be applied to any existing system to bridge the gap. We release the complete dataset, including all images, queries, annotations, retrieval index, and benchmarking code.

[88] SGR3 Model: Scene Graph Retrieval-Reasoning Model in 3D

Zirui Wang,Ruiping Liu,Yufan Chen,Junwei Zheng,Weijia Fan,Kunyu Peng,Di Wen,Jiale Wei,Jiaming Zhang,Rainer Stiefelhagen

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的3D场景图生成框架SGR3,利用多模态大语言模型(MLLM)结合检索增强生成(RAG),通过语义对齐的场景图检索与加权补丁级相似性选择机制,绕过显式3D重建,提升关系推理能力。

Details Motivation: 现有3D场景图生成方法依赖多模态数据和启发式图构建,限制了关系三元组预测的灵活性与泛化性;且需3D重建,数据获取成本高。 Method: 提出SGR3模型:基于MLLM与RAG框架,采用ColPali风格跨模态检索语义对齐的场景图,并引入加权补丁级相似性选择机制以增强检索鲁棒性,全程无需训练或显式3D重建。 Result: SGR3在无训练基线上表现优越,性能媲美基于GNN的专家模型;消融实验表明检索到的外部知识被显式融入token生成过程,而非隐式抽象。 Conclusion: SGR3验证了检索增强、无需训练的范式在3D场景图生成中的有效性,为轻量、灵活、可解释的机器人高层语义理解提供了新路径。 Abstract: 3D scene graphs provide a structured representation of object entities and their relationships, enabling high-level interpretation and reasoning for robots while remaining intuitively understandable to humans. Existing approaches for 3D scene graph generation typically combine scene reconstruction with graph neural networks (GNNs). However, such pipelines require multi-modal data that may not always be available, and their reliance on heuristic graph construction can constrain the prediction of relationship triplets. In this work, we introduce a Scene Graph Retrieval-Reasoning Model in 3D (SGR3 Model), a training-free framework that leverages multi-modal large language models (MLLMs) with retrieval-augmented generation (RAG) for semantic scene graph generation. SGR3 Model bypasses the need for explicit 3D reconstruction. Instead, it enhances relational reasoning by incorporating semantically aligned scene graphs retrieved via a ColPali-style cross-modal framework. To improve retrieval robustness, we further introduce a weighted patch-level similarity selection mechanism that mitigates the negative impact of blurry or semantically uninformative regions. Experiments demonstrate that SGR3 Model achieves competitive performance compared to training-free baselines and on par with GNN-based expert models. Moreover, an ablation study on the retrieval module and knowledge base scale reveals that retrieved external information is explicitly integrated into the token generation process, rather than being implicitly internalized through abstraction.

[89] Spinverse: Differentiable Physics for Permeability-Aware Microstructure Reconstruction from Diffusion MRI

Prathamesh Pradeep Khole,Mario M. Brenes,Zahra Kais Petiwala,Ehsan Mirafzali,Utkarsh Gupta,Jing-Rebecca Li,Andrada Ianus,Razvan Marinescu

Main category: cs.CV

TL;DR: 本文提出Spinverse方法,通过可微分Bloch-Torrey模拟器,从dMRI信号中反演组织微结构界面,将内部面渗透率设为可学习参数,从而无需预设边界拓扑即可自动浮现扩散屏障。

Details Motivation: 现有dMRI重建方法多假设不可渗透边界或仅估计体素级参数,难以显式恢复微结构界面;需一种能自适应学习渗透性边界、不依赖先验拓扑的方法。 Method: Spinverse基于固定四面体网格表示组织,将每个内部面的渗透率设为可学习参数;通过反向传播信号匹配损失优化渗透率,并结合网格几何先验与多序列分阶段优化策略缓解病态性和局部极小问题。 Result: 在合成体素网格数据上,Spinverse成功重建多种几何结构;验证了序列调度和正则化对避免轮廓解、提升边界精度与结构合理性至关重要。 Conclusion: Spinverse实现了渗透性感知的dMRI重建,能自动浮现微结构界面,为无先验拓扑约束的高保真微结构建模提供了新范式。 Abstract: Diffusion MRI (dMRI) is sensitive to microstructural barriers, yet most existing methods either assume impermeable boundaries or estimate voxel-level parameters without recovering explicit interfaces. We present Spinverse, a permeability-aware reconstruction method that inverts dMRI measurements through a fully differentiable Bloch-Torrey simulator. Spinverse represents tissue on a fixed tetrahedral grid and treats each interior face permeability as a learnable parameter; low-permeability faces act as diffusion barriers, so microstructural boundaries whose topology is not fixed a priori (up to the resolution of the ambient mesh) emerge without changing mesh connectivity or vertex positions. Given a target signal, we optimize face permeabilities by backpropagating a signal-matching loss through the PDE forward model, and recover an interface by thresholding the learned permeability field. To mitigate the ill-posedness of permeability inversion, we use mesh-based geometric priors; to avoid local minima, we use a staged multi-sequence optimization curriculum. Across a collection of synthetic voxel meshes, Spinverse reconstructs diverse geometries and demonstrates that sequence scheduling and regularization are critical to avoid outline-only solutions while improving both boundary accuracy and structural validity.

[90] sFRC for assessing hallucinations in medical image restoration

Prabhat Kc,Rongping Zeng,Nirmal Soni,Aldo Badano

Main category: cs.CV

TL;DR: 本文提出了一种名为sFRC(扫描式傅里叶环相关)的新方法,用于检测深度学习医学图像重建结果中的幻觉现象,并在CT和MRI多种欠采样任务中验证了其有效性与鲁棒性。

Details Motivation: 深度学习重建图像虽视觉效果好,但易产生幻觉,且缺乏易用、鲁棒的幻觉检测技术与指标。 Method: 提出基于小块区域的傅里叶环相关(FRC)扫描分析法(sFRC),结合专家标注或成像理论生成的幻觉图来设定参数,并应用于CT超分辨、稀疏视角CT及MRI欠采样重建等任务。 Result: sFRC在CT任务中有效检出幻觉特征,在MRI任务中与成像理论预测的幻觉图高度一致;还能量化DL方法在分布内/外数据及不同欠采样率下的幻觉率,并扩展验证于传统正则化与展开式方法。 Conclusion: sFRC是一种通用、可解释、理论支撑强的幻觉检测工具,有助于提升深度学习医学图像重建的可靠性与临床可信度。 Abstract: Deep learning (DL) methods are currently being explored to restore images from sparse-view-, limited-data-, and undersampled-based acquisitions in medical applications. Although outputs from DL may appear visually appealing based on likability/subjective criteria (such as less noise, smooth features), they may also suffer from hallucinations. This issue is further exacerbated by a lack of easy-to-use techniques and robust metrics for the identification of hallucinations in DL outputs. In this work, we propose performing Fourier Ring Correlation (FRC) analysis over small patches and concomitantly (s)canning across DL outputs and their reference counterparts to detect hallucinations (termed as sFRC). We describe the rationale behind sFRC and provide its mathematical formulation. The parameters essential to sFRC may be set using predefined hallucinated features annotated by subject matter experts or using imaging theory-based hallucination maps. We use sFRC to detect hallucinations for three undersampled medical imaging problems: CT super-resolution, CT sparse view, and MRI subsampled restoration. In the testing phase, we demonstrate sFRC's effectiveness in detecting hallucinated features for the CT problem and sFRC's agreement with imaging theory-based outputs on hallucinated feature maps for the MR problem. Finally, we quantify the hallucination rates of DL methods on in-distribution versus out-of-distribution data and under increasing subsampling rates to characterize the robustness of DL methods. Beyond DL-based methods, sFRC's effectiveness in detecting hallucinations for a conventional regularization-based restoration method and a state-of-the-art unrolled method is also shown.

[91] Decoding the Pulse of Reasoning VLMs in Multi-Image Understanding Tasks

Chenjun Li

Main category: cs.CV

TL;DR: 本文提出PulseFocus方法,通过在推理时结构化思维链(CoT)为交替的计划/聚焦模块并引入软注意力门控,缓解多图像推理中视觉语言模型注意力弥散和位置偏差问题,显著提升多图像基准性能。

Details Motivation: 发现当前多图像推理中VLMs在思维链生成时存在注意力弥散('pulses')和系统性图像位置偏差问题,亟需改进。 Method: 提出无需训练、仅在推理时使用的PulseFocus方法:将CoT分解为交替的plan/focus模块,并在解码阶段对所引用图像施加软注意力门控,强制模型显式规划并聚焦于相关图像。 Result: 在BLINK基准上提升3.7%,在MuirBench上提升1.07%,效果稳定。 Conclusion: PulseFocus有效缓解了多图像推理中的注意力不集中与位置偏差,是一种轻量、通用且高效的推理优化方法。 Abstract: Multi-image reasoning remains a significant challenge for vision-language models (VLMs). We investigate a previously overlooked phenomenon: during chain-of-thought (CoT) generation, the text-to-image (T2I) attention of reasoning VLMs exhibits diffuse "pulses": sporadic and unfocused attention patterns that fail to concentrate on task-relevant images. We further reveal a systematic positional bias in attention allocation across images. Motivated by these observations, we propose PulseFocus, a training-free, inference-time method that structures CoT reasoning into interleaved plan/focus blocks with soft attention gating. By forcing the model to explicitly plan which image to examine and then gating decode-time attention to the referenced image, PulseFocus sharpens attention focus and yields consistent improvements on multi-image benchmarks like BLINK benchmark (+3.7%) and MuirBench (+1.07%).

[92] A Benchmark Study of Neural Network Compression Methods for Hyperspectral Image Classification

Sai Shi

Main category: cs.CV

TL;DR: 本文系统评估了剪枝、量化和知识蒸馏三种神经网络压缩方法在高光谱地物分类任务中的效果,结果表明压缩模型能在显著降低模型大小和计算开销的同时保持有竞争力的分类精度。

Details Motivation: 深度神经网络在遥感等资源受限平台(如边缘设备)部署受限于其巨大的计算和内存需求,因此需要有效的网络压缩技术。 Method: 对剪枝、量化和知识蒸馏三种主流CNN压缩策略,在两个高光谱基准数据集上进行系统性实验评估,指标包括分类精度、内存占用和推理效率。 Result: 压缩模型可显著减小模型尺寸和计算成本,同时维持具有竞争力的分类性能;揭示了压缩比、效率与精度之间的权衡关系。 Conclusion: 网络压缩技术有望推动深度学习在遥感应用中的高效部署。 Abstract: Deep neural networks have achieved strong performance in image classification tasks due to their ability to learn complex patterns from high-dimensional data. However, their large computational and memory requirements often limit deployment on resource-constrained platforms such as remote sensing devices and edge systems. Network compression techniques have therefore been proposed to reduce model size and computational cost while maintaining predictive performance. In this study, we conduct a systematic evaluation of neural network compression methods for a remote sensing application, namely hyperspectral land cover classification. Specifically, we examine three widely used compression strategies for convolutional neural networks: pruning, quantization, and knowledge distillation. Experiments are conducted on two benchmark hyperspectral datasets, considering classification accuracy, memory consumption, and inference efficiency. Our results demonstrate that compressed models can significantly reduce model size and computational cost while maintaining competitive classification performance. These findings provide insights into the trade-offs between compression ratio, efficiency, and accuracy, and highlight the potential of compression techniques for enabling efficient deep learning deployment in remote sensing applications.

[93] Are Multimodal LLMs Ready for Surveillance? A Reality Check on Zero-Shot Anomaly Detection in the Wild

Shanle Yao,Armin Danesh Pazho,Narges Rashvand,Hamed Tabkhi

Main category: cs.CV

TL;DR: 本文系统评估了多模态大语言模型(MLLMs)在视频异常检测(VAD)任务上的可靠性,发现其在零样本设置下存在显著保守偏差(高精度、低召回),并通过类别特异性提示将ShanghaiTech数据集上的F1分数从0.09提升至0.64,揭示了当前MLLMs在真实监控场景中召回能力不足的关键瓶颈。

Details Motivation: 探索多模态大语言模型(MLLMs)在真实世界视频异常检测(VAD)中的可靠性,因其虽在视频理解上表现优异,但在实际应用中的稳健性尚不明确。 Method: 将VAD重构为弱时间监督下的二分类任务,在ShanghaiTech和CHAD基准上系统评测前沿MLLMs;分析提示特异性与时间窗口长度(1s–3s)对精确率-召回率权衡的影响。 Result: 零样本下模型呈现强烈保守偏差:高置信度但严重偏向‘正常’类,导致高精度、极低召回(如ShanghaiTech F1仅0.09);引入类别特异性指令后F1提升至0.64,但召回仍是关键瓶颈。 Conclusion: 当前MLLMs在噪声环境(如开放世界监控)中VAD性能存在显著差距,亟需面向召回优化的提示工程与模型校准方法。 Abstract: Multimodal large language models (MLLMs) have demonstrated impressive general competence in video understanding, yet their reliability for real-world Video Anomaly Detection (VAD) remains largely unexplored. Unlike conventional pipelines relying on reconstruction or pose-based cues, MLLMs enable a paradigm shift: treating anomaly detection as a language-guided reasoning task. In this work, we systematically evaluate state-of-the-art MLLMs on the ShanghaiTech and CHAD benchmarks by reformulating VAD as a binary classification task under weak temporal supervision. We investigate how prompt specificity and temporal window lengths (1s--3s) influence performance, focusing on the precision--recall trade-off. Our findings reveal a pronounced conservative bias in zero-shot settings; while models exhibit high confidence, they disproportionately favor the 'normal' class, resulting in high precision but a recall collapse that limits practical utility. We demonstrate that class-specific instructions can significantly shift this decision boundary, improving the peak F1-score on ShanghaiTech from 0.09 to 0.64, yet recall remains a critical bottleneck. These results highlight a significant performance gap for MLLMs in noisy environments and provide a foundation for future work in recall-oriented prompting and model calibration for open-world surveillance, which demands complex video understanding and reasoning.

[94] FOZO: Forward-Only Zeroth-Order Prompt Optimization for Test-Time Adaptation

Xingyu Wang,Tao Wang

Main category: cs.CV

TL;DR: 本文提出了一种无需反向传播的前向零阶优化方法FOZO,用于测试时自适应(TTA),在资源受限设备上实现高效、稳定且高性能的模型自适应。

Details Motivation: 现有TTA方法存在矛盾:基于反向传播的方法计算和内存开销大、会修改权重,不适用于低端设备;而无反向传播方法适应能力弱。 Method: 提出Forward-Only Zeroth-Order Optimization(FOZO):采用内存高效的零阶提示优化,联合优化中间特征统计量与预测熵;引入动态衰减扰动尺度以提升零阶梯度估计的稳定性,并在TTA数据流假设下证明其收敛性。 Result: 在ImageNet-C(59.52% Top-1)、ImageNet-R、ImageNet-Sketch上持续适应性能优于主流梯度法及SOTA前向-only方法FOA(58.13%);且在INT8量化模型上表现鲁棒。 Conclusion: FOZO是一种实用、高效、稳定且泛化性强的TTA新范式,特别适合资源受限场景下的部署。 Abstract: Test-Time Adaptation (TTA) is essential for enabling deep learning models to handle real-world data distribution shifts. However, current approaches face significant limitations: backpropagation-based methods are not suitable for low-end deployment devices, due to their high computation and memory requirements, as well as their tendency to modify model weights during adaptation; while traditional backpropagation-free techniques exhibit constrained adaptation capabilities. In this work, we propose Forward-Only Zeroth-Order Optimization (FOZO), a novel and practical backpropagation-free paradigm for TTA. FOZO leverages a memory-efficient zeroth-order prompt optimization, which is led by objectives optimizing both intermediate feature statistics and prediction entropy. To ensure efficient and stable adaptation over the out-of-distribution data stream, we introduce a dynamically decaying perturbation scale during zeroth-order gradient estimation and theoretically prove its convergence under the TTA data stream assumption. Extensive continual adaptation experiments on ImageNet-C, ImageNet-R, and ImageNet-Sketch demonstrate FOZO's superior performance, achieving 59.52% Top-1 accuracy on ImageNet-C (5K, level 5) and outperforming main gradient-based methods and SOTA forward-only FOA (58.13%). Furthermore, FOZO exhibits strong generalization on quantized (INT8) models. These findings demonstrate that FOZO is a highly competitive solution for TTA deployment in resource-limited scenarios.

[95] Toward Real-world Infrared Image Super-Resolution: A Unified Autoregressive Framework and Benchmark Dataset

Yang Zou,Jun Ma,Zhidong Jiao,Xingyuan Li,Zhiying Jiang,Jinyuan Liu

Main category: cs.CV

TL;DR: 本文提出Real-IISR,一种面向真实场景红外图像超分辨率的统一自回归框架,通过热-结构引导的视觉自回归逐尺度重建精细热结构与清晰背景,并构建了首个真实配对红外LR-HR数据集FLIR-IISR。

Details Motivation: 现有红外图像超分辨率方法多基于仿真数据或忽略红外与可见光成像的本质差异,而真实红外图像受光学与传感耦合退化影响,导致结构锐度和热保真度同时下降。 Method: 提出Real-IISR框架:包含热-结构引导模块(缓解热辐射与结构边缘失配)、条件自适应码本(依据退化感知热先验动态调制离散表征)及热序一致性损失(保证温度与像素强度间的单调关系)。 Result: 在自建真实数据集FLIR-IISR上验证了Real-IISR的优越性能,为真实场景红外超分辨提供了统一基础与基准。 Conclusion: Real-IISR有效应对真实红外图像中耦合退化带来的挑战,兼顾结构重建与热物理一致性,推动红外超分辨从仿真走向实际应用。 Abstract: Infrared image super-resolution (IISR) under real-world conditions is a practically significant yet rarely addressed task. Pioneering works are often trained and evaluated on simulated datasets or neglect the intrinsic differences between infrared and visible imaging. In practice, however, real infrared images are affected by coupled optical and sensing degradations that jointly deteriorate both structural sharpness and thermal fidelity. To address these challenges, we propose Real-IISR, a unified autoregressive framework for real-world IISR that progressively reconstructs fine-grained thermal structures and clear backgrounds in a scale-by-scale manner via thermal-structural guided visual autoregression. Specifically, a Thermal-Structural Guidance module encodes thermal priors to mitigate the mismatch between thermal radiation and structural edges. Since non-uniform degradations typically induce quantization bias, Real-IISR adopts a Condition-Adaptive Codebook that dynamically modulates discrete representations based on degradation-aware thermal priors. Also, a Thermal Order Consistency Loss enforces a monotonic relation between temperature and pixel intensity, ensuring relative brightness order rather than absolute values to maintain physical consistency under spatial misalignment and thermal drift. We build FLIR-IISR, a real-world IISR dataset with paired LR-HR infrared images acquired via automated focus variation and motion-induced blur. Extensive experiments demonstrate the promising performance of Real-IISR, providing a unified foundation for real-world IISR and benchmarking. The dataset and code are available at: https://github.com/JZD151/Real-IISR.

[96] Evaluating GPT-5 as a Multimodal Clinical Reasoner: A Landscape Commentary

Alexandru Florea,Shansong Wang,Mingzhe Hu,Qiang Li,Zach Eidex,Luke del Balzo,Mojtaba Safari,Xiaofeng Yang

Main category: cs.CV

TL;DR: 本文首次对GPT-5系列模型在临床医学多任务场景(包括文本推理与多模态影像问答)中进行跨横断面评估,发现其在文本推理和部分影像任务(如乳腺X线)上显著优于GPT-4o,但在神经放射学和高精度感知任务上仍落后于专用模型。

Details Motivation: 探究通用基础模型(如GPT-5系列)是否具备支撑临床医学所需整合推理能力,尤其是处理模糊病史、检验数据与多模态影像的综合判断。 Method: 采用标准化零样本思维链协议,在医学教育考试、文本推理基准及神经放射学、数字病理学、乳腺X线等视觉问答任务上,对GPT-5系列与GPT-4o进行受控、跨横断面评估。 Result: GPT-5在MedXpertQA文本推理上提升超25个百分点;在乳腺X线VQA任务中领先GPT-4o达10–40%,但神经放射学准确率仅44%,乳腺X线性能(52–64%)仍低于专用模型(>80%)。 Conclusion: GPT-5在整合临床推理方面取得实质性进展,能模拟医生以客观影像证据校正模糊病史的认知过程,但尚不能替代高度专业化、感知关键型任务中的专用系统。 Abstract: The transition from task-specific artificial intelligence toward general-purpose foundation models raises fundamental questions about their capacity to support the integrated reasoning required in clinical medicine, where diagnosis demands synthesis of ambiguous patient narratives, laboratory data, and multimodal imaging. This landscape commentary provides the first controlled, cross-sectional evaluation of the GPT-5 family (GPT-5, GPT-5 Mini, GPT-5 Nano) against its predecessor GPT-4o across a diverse spectrum of clinically grounded tasks, including medical education examinations, text-based reasoning benchmarks, and visual question-answering in neuroradiology, digital pathology, and mammography using a standardized zero-shot chain-of-thought protocol. GPT-5 demonstrated substantial gains in expert-level textual reasoning, with absolute improvements exceeding 25 percentage-points on MedXpertQA. When tasked with multimodal synthesis, GPT-5 effectively leveraged this enhanced reasoning capacity to ground uncertain clinical narratives in concrete imaging evidence, achieving state-of-the-art or competitive performance across most VQA benchmarks and outperforming GPT-4o by margins of 10-40% in mammography tasks requiring fine-grained lesion characterization. However, performance remained moderate in neuroradiology (44% macro-average accuracy) and lagged behind domain-specific models in mammography, where specialized systems exceed 80% accuracy compared to GPT-5's 52-64%. These findings indicate that while GPT-5 represents a meaningful advance toward integrated multimodal clinical reasoning, mirroring the clinician's cognitive process of biasing uncertain information with objective findings, generalist models are not yet substitutes for purpose-built systems in highly specialized, perception-critical tasks.

[97] Evaluating and Correcting Human Annotation Bias in Dynamic Micro-Expression Recognition

Feng Liu,Bingyu Nan,Xuezhong Qian,Xiaolan Fu

Main category: cs.CV

TL;DR: 本文提出了一种全局反单调微分选择策略(GAMDSS),通过关键帧重选优化微表情的时空建模,显著减少跨文化标注中的人为主观误差,并提升识别性能。

Details Motivation: 现有微表情人工标注在跨文化场景下准确性差,尤其关键帧标注偏差明显,亟需更鲁棒、客观的标注与建模方法。 Method: 提出GAMDSS架构:基于动态关键帧重选机制识别Onset和Apex帧,进而推导Offset帧,构建丰富的时空动态表征;采用双分支共享参数结构高效提取时空特征。 Result: 在7个主流微表情数据集(含SAMM、4DME)上验证有效,显著降低跨文化数据中主观标注误差;定量分析证实Offset帧标注不确定性更高,为标注标准化提供理论依据。 Conclusion: GAMDSS无需增加参数即可嵌入现有模型,提升了微表情识别性能,同时挑战并推动了当前数据集标注范式的反思与重构。 Abstract: Existing manual labeling of micro-expressions is subject to errors in accuracy, especially in cross-cultural scenarios where deviation in labeling of key frames is more prominent. To address this issue, this paper presents a novel Global Anti-Monotonic Differential Selection Strategy (GAMDSS) architecture for enhancing the effectiveness of spatio-temporal modeling of micro-expressions through keyframe re-selection. Specifically, the method identifies Onset and Apex frames, which are characterized by significant micro-expression variation, from complete micro-expression action sequences via a dynamic frame reselection mechanism. It then uses these to determine Offset frames and construct a rich spatio-temporal dynamic representation. A two-branch structure with shared parameters is then used to efficiently extract spatio-temporal features. Extensive experiments are conducted on seven widely recognized micro-expression datasets. The results demonstrate that GAMDSS effectively reduces subjective errors caused by human factors in multicultural datasets such as SAMM and 4DME. Furthermore, quantitative analyses confirm that offset-frame annotations in multicultural datasets are more uncertain, providing theoretical justification for standardizing micro-expression annotations. These findings directly support our argument for reconsidering the validity and generalizability of dataset annotation paradigms. Notably, this design can be integrated into existing models without increasing the number of parameters, offering a new approach to enhancing micro-expression recognition performance. The source code is available on GitHub[https://github.com/Cross-Innovation-Lab/GAMDSS].

[98] DSA-SRGS: Super-Resolution Gaussian Splatting for Dynamic Sparse-View DSA Reconstruction

Shiyu Zhang,Zhicong Wu,Huangxuan Zhao,Zhentao Liu,Lei Chen,Yong Luo,Lefei Zhang,Zhiming Cui,Ziwen Ke,Bo Du

Main category: cs.CV

TL;DR: 本文提出DSA-SRGS,首个面向动态稀疏视角DSA重建的超分辨率高斯溅射框架,通过多保真度纹理学习与辐射亚像素稠密化策略,在不引入严重伪影前提下提升4D血管重建分辨率与细节保真度。

Details Motivation: 现有基于高斯溅射和动态神经表征的3D血管重建方法受限于输入投影分辨率,简单上采样会导致模糊和混叠,无法恢复精细血管结构,制约临床精确诊断与治疗应用。 Method: 提出DSA-SRGS框架:1)多保真度纹理学习模块,融合微调的DSA专用超分模型先验,并采用置信度感知策略加权低分辨率真实投影与生成的高分辨率伪标签监督信号;2)辐射亚像素稠密化策略,利用高分辨率亚像素采样的梯度累积优化4D辐射高斯核。 Result: 在两个临床DSA数据集上,DSA-SRGS在定量指标(如PSNR、SSIM)和定性视觉质量(细小分支、边缘清晰度)上均显著优于现有最先进方法。 Conclusion: DSA-SRGS有效解决了动态稀疏视角DSA重建中的超分辨率瓶颈,首次实现了高保真、高细节的4D血管建模,为脑血管疾病精准诊疗提供了新工具。 Abstract: Digital subtraction angiography (DSA) is a key imaging technique for the auxiliary diagnosis and treatment of cerebrovascular diseases. Recent advancements in gaussian splatting and dynamic neural representations have enabled robust 3D vessel reconstruction from sparse dynamic inputs. However, these methods are fundamentally constrained by the resolution of input projections, where performing naive upsampling to enhance rendering resolution inevitably results in severe blurring and aliasing artifacts. Such lack of super-resolution capability prevents the reconstructed 4D models from recovering fine-grained vascular details and intricate branching structures, which restricts their application in precision diagnosis and treatment. To solve this problem, this paper proposes DSA-SRGS, the first super-resolution gaussian splatting framework for dynamic sparse-view DSA reconstruction. Specifically, we introduce a Multi-Fidelity Texture Learning Module that integrates high-quality priors from a fine-tuned DSA-specific super-resolution model, into the 4D reconstruction optimization. To mitigate potential hallucination artifacts from pseudo-labels, this module employs a Confidence-Aware Strategy to adaptively weight supervision signals between the original low-resolution projections and the generated high-resolution pseudo-labels. Furthermore, we develop Radiative Sub-Pixel Densification, an adaptive strategy that leverages gradient accumulation from high-resolution sub-pixel sampling to refine the 4D radiative gaussian kernels. Extensive experiments on two clinical DSA datasets demonstrate that DSA-SRGS significantly outperforms state-of-the-art methods in both quantitative metrics and qualitative visual fidelity.

[99] MADCrowner: Margin Aware Dental Crown Design with Template Deformation and Refinement

Linda Wei,Chang Liu,Wenran Zhang,Yuxuan Hu,Ruiyang Li,Feng Qi,Changyao Tian,Ke Wang,Yuanyuan Wang,Shaoting Zhang,Dimitris Metaxas,Hongsheng Li

Main category: cs.CV

TL;DR: 本文提出了一种名为\totalframework的牙冠网格生成框架,包含CrownDeformR(基于解剖上下文的模板形变模块)和CrownSegger(边缘分割网络),以提升自动化牙冠设计的几何精度与临床可行性。

Details Motivation: 现有基于学习的牙冠自动生成方法存在空间分辨率不足、输出噪声大、表面重建过度延伸等问题,且临床中仍需大量手动调整。 Method: 提出margin-aware网格生成框架\totalframework,包括:1)CrownDeformR——利用多尺度口内扫描编码器提取解剖上下文,驱动初始模板形变;2)\marginseg——新型颈缘分割网络,精准提取牙体颈缘作为形变约束与后处理边界条件;3)定制化后处理去除重建表面的过度延伸区域。 Result: 在自建大规模口内扫描数据集上实验表明,该方法在几何精度(如误差指标)和临床可行性(如边缘贴合度、可制造性)上显著优于现有方法。 Conclusion: \totalframework通过引入颈缘感知机制与临床启发的形变范式,有效缓解了自动化牙冠设计中的关键瓶颈,为CAD系统提供了更可靠、更实用的AI辅助方案。 Abstract: Dental crown restoration is one of the most common treatment modalities for tooth defect, where personalized dental crown design is critical. While computer-aided design (CAD) systems have notably enhanced the efficiency of dental crown design, extensive manual adjustments are still required in the clinic workflow. Recent studies have explored the application of learning-based methods for the automated generation of restorative dental crowns. Nevertheless, these approaches were challenged by inadequate spatial resolution, noisy outputs, and overextension of surface reconstruction. To address these limitations, we propose \totalframework, a margin-aware mesh generation framework comprising CrownDeformR and CrownSegger. Inspired by the clinic manual workflow of dental crown design, we designed CrownDeformR to deform an initial template to the target crown based on anatomical context, which is extracted by a multi-scale intraoral scan encoder. Additionally, we introduced \marginseg, a novel margin segmentation network, to extract the cervical margin of the target tooth. The performance of CrownDeformR improved with the cervical margin as an extra constraint. And it was also utilized as the boundary condition for the tailored postprocessing method, which removed the overextended area of the reconstructed surface. We constructed a large-scale intraoral scan dataset and performed extensive experiments. The proposed method significantly outperformed existing approaches in both geometric accuracy and clinical feasibility.

[100] Privacy-Aware Camera 2.0 Technical Report

Huan Song,Shuyu Tian,Ting Long,Jiang Liu,Cheng Yuan,Zhenyu Jia,Jiawei Shao,Xuelong Li

Main category: cs.CV

TL;DR: 本文提出了一种基于AI Flow范式和边缘-云协同架构的新型隐私保护感知框架,通过在边缘端进行非线性映射与随机噪声注入,将原始图像转换为不可逆的抽象特征向量,并在云端利用‘动态轮廓’视觉语言实现行为识别与语义重建,在保障隐私前提下维持感知能力。

Details Motivation: 现有隐私保护方法(如物理脱敏、加密、模糊化)常损害语义理解或缺乏数学可证明的不可逆性;Privacy Camera 1.0虽源头消除图像但仅输出文本判断,导致纠纷中证据缺失。 Method: 基于信息瓶颈原理,在边缘部署视觉脱敏器,对原始图像进行实时非线性映射与随机噪声注入,生成不可重构的抽象特征向量;云端采用‘动态轮廓’视觉语言进行行为识别与语义重建。 Result: 实现了原始图像身份信息的彻底剥离与数学上不可逆的脱敏,同时支持云端高精度行为识别与具象化视觉参考,解决了隐私-安全悖论。 Conclusion: 该框架在隐私保护与感知能力之间取得关键平衡,为高敏感场所智能感知提供了兼具安全性、可验证性与实用性的新范式。 Abstract: With the increasing deployment of intelligent sensing technologies in highly sensitive environments such as restrooms and locker rooms, visual surveillance systems face a profound privacy-security paradox. Existing privacy-preserving approaches, including physical desensitization, encryption, and obfuscation, often compromise semantic understanding or fail to ensure mathematically provable irreversibility. Although Privacy Camera 1.0 eliminated visual data at the source to prevent leakage, it provided only textual judgments, leading to evidentiary blind spots in disputes. To address these limitations, this paper proposes a novel privacy-preserving perception framework based on the AI Flow paradigm and a collaborative edge-cloud architecture. By deploying a visual desensitizer at the edge, raw images are transformed in real time into abstract feature vectors through nonlinear mapping and stochastic noise injection under the Information Bottleneck principle, ensuring identity-sensitive information is stripped and original images are mathematically unreconstructable. The abstract representations are transmitted to the cloud for behavior recognition and semantic reconstruction via a "dynamic contour" visual language, achieving a critical balance between perception and privacy while enabling illustrative visual reference without exposing raw images.

[101] RMK RetinaNet: Rotated Multi-Kernel RetinaNet for Robust Oriented Object Detection in Remote Sensing Imagery

Huiran Sun

Main category: cs.CV

TL;DR: 本文提出RMK RetinaNet,通过多尺度核块、多方向上下文锚点注意力机制、自底向上路径和欧拉角编码模块,解决遥感图像旋转目标检测中的感受野不适应、长程多尺度特征融合不足及角度回归不连续三大瓶颈。

Details Motivation: 遥感图像中旋转目标检测面临三个主要瓶颈:感受野利用不自适应、长程多尺度特征融合不足、角度回归存在不连续性。 Method: 提出RMK RetinaNet,包括:1)多尺度核(MSK)模块以增强自适应多尺度特征提取;2)多方向上下文锚点注意力(MDCAA)机制提升跨尺度与跨方向的上下文建模;3)自底向上路径保留细粒度空间细节;4)欧拉角编码模块(EAEM)实现连续稳定的角度回归。 Result: 在DOTA-v1.0、HRSC2016和UCAS-AOD数据集上实验表明,RMK RetinaNet性能媲美当前最优方法,并在多尺度与多方向场景下鲁棒性更强。 Conclusion: RMK RetinaNet有效缓解了旋转目标检测的关键瓶颈,在保持高性能的同时显著提升了模型对复杂尺度与朝向变化的适应能力。 Abstract: Rotated object detection in remote sensing imagery is hindered by three major bottlenecks: non-adaptive receptive field utilization, inadequate long-range multi-scale feature fusion, and discontinuities in angle regression. To address these issues, we propose Rotated Multi-Kernel RetinaNet (RMK RetinaNet). First, we design a Multi-Scale Kernel (MSK) Block to strengthen adaptive multi-scale feature extraction. Second, we incorporate a Multi-Directional Contextual Anchor Attention (MDCAA) mechanism into the feature pyramid to enhance contextual modeling across scales and orientations. Third, we introduce a Bottom-up Path to preserve fine-grained spatial details that are often degraded during downsampling. Finally, we develop an Euler Angle Encoding Module (EAEM) to enable continuous and stable angle regression. Extensive experiments on DOTA-v1.0, HRSC2016, and UCAS-AOD show that RMK RetinaNet achieves performance comparable to state-of-the-art rotated object detectors while improving robustness in multi-scale and multi-orientation scenarios.

[102] LAW & ORDER: Adaptive Spatial Weighting for Medical Diffusion and Segmentation

Anugunj Naman,Ayushman Singh,Gaibo Zhang,Yaguang Zhang

Main category: cs.CV

TL;DR: 本文提出两种自适应空间加权网络适配器:LAW用于扩散模型训练中的像素级损失调制,ORDER用于高效分割中的选择性双向跳跃注意力,显著提升医学图像生成与分割性能。

Details Motivation: 医学图像分析中,病灶区域小而背景大,导致生成与分割任务存在空间不平衡问题,现有扩散模型易偏离指定病灶布局,高效分割器在空间不确定性区域表现不佳。 Method: 提出两种网络适配器:1)可学习自适应加权器(LAW),基于特征和掩码预测像素级损失调制,并通过归一化、截断和正则化稳定训练;2)高效分辨率最优区域检测(ORDER),在解码器后期应用选择性双向跳跃注意力以提升分割效率。 Result: LAW在息肉和肾肿瘤数据集上FID降低20%(52.28 vs. 65.60),合成数据使下游分割Dice系数提升4.9%(83.2% vs. 78.3%);ORDER在MK-UNet上Dice提升6.0%(81.3% vs. 75.3%),仅需0.56 GFLOPs和42K参数,比标准nnUNet小730倍。 Conclusion: 自适应空间加权机制能有效缓解医学图像生成与分割中的空间不平衡问题,LAW与ORDER分别在生成质量和分割效率上取得显著提升,具备轻量、稳定、实用优势。 Abstract: Medical image analysis relies on accurate segmentation, and benefits from controllable synthesis (of new training images). Yet both tasks of the cyclical pipeline face spatial imbalance: lesions occupy small regions against vast backgrounds. In particular, diffusion models have been shown to drift from prescribed lesion layouts, while efficient segmenters struggle on spatially uncertain regions. Adaptive spatial weighting addresses this by learning where to allocate computational resources. This paper introduces a pair of network adapters: 1) Learnable Adaptive Weighter (LAW) which predicts per-pixel loss modulation from features and masks for diffusion training, stabilized via a mix of normalization, clamping, and regularization to prevent degenerate solutions; and 2) Optimal Region Detection with Efficient Resolution (ORDER) which applies selective bidirectional skip attention at late decoder stages for efficient segmentation. Experiments on polyp and kidney tumor datasets demonstrate that LAW achieves 20% FID generative improvement over a uniform baseline (52.28 vs. 65.60), with synthetic data then improving downstream segmentation by 4.9% Dice coefficient (83.2% vs. 78.3%). ORDER reaches 6.0% Dice improvement on MK-UNet (81.3% vs. 75.3%) with 0.56 GFLOPs and just 42K parameters, remaining 730x smaller than the standard nnUNet.

[103] Comparative Evaluation of Traditional Methods and Deep Learning for Brain Glioma Imaging. Review Paper

Kiranmayee Janardhan,Vinay Martin DSa Prabhu,T. Christy Bobby

Main category: cs.CV

TL;DR: 本文综述了脑胶质瘤的分割与分类技术,强调卷积神经网络在磁共振成像后处理中优于传统方法。

Details Motivation: 脑胶质瘤的精准分割与分类对治疗规划、预后预测及病情监测至关重要,但不规则组织导致分割难度大、重复性差。 Method: 综述现有全自动与半自动分割及分类方法,重点评估基于卷积神经网络(CNN)的深度学习架构性能。 Result: CNN架构在脑胶质瘤分割与分类任务中显著优于传统方法;放射科医生更倾向使用易用且可控的半自动技术。 Conclusion: 深度学习尤其是CNN是当前脑胶质瘤影像分析最有前景的技术方向,未来需兼顾自动化程度与临床实用性。 Abstract: Segmentation is crucial for brain gliomas as it delineates the glioma s extent and location, aiding in precise treatment planning and monitoring, thus improving patient outcomes. Accurate segmentation ensures proper identification of the glioma s size and position, transforming images into applicable data for analysis. Classification of brain gliomas is also essential because different types require different treatment approaches. Accurately classifying brain gliomas by size, location, and aggressiveness is essential for personalized prognosis prediction, follow-up care, and monitoring disease progression, ensuring effective diagnosis, treatment, and management. In glioma research, irregular tissues are often observable, but error free and reproducible segmentation is challenging. Many researchers have surveyed brain glioma segmentation, proposing both fully automatic and semi-automatic techniques. The adoption of these methods by radiologists depends on ease of use and supervision, with semi-automatic techniques preferred due to the need for accurate evaluations. This review evaluates effective segmentation and classification techniques post magnetic resonance imaging acquisition, highlighting that convolutional neural network architectures outperform traditional techniques in these tasks.

[104] MASQuant: Modality-Aware Smoothing Quantization for Multimodal Large Language Models

Lulu Hu,Wenhu Xiao,Xin Chen,Xinhua Xu,Bowen Xu,Kun Li,Yongliang Tao

Main category: cs.CV

TL;DR: 本文提出MASQuant框架,通过模态感知平滑(MAS)和跨模态补偿(CMC)解决MLLMs中SmoothQuant存在的平滑错位与跨模态计算不变性问题,实现稳定高效的后训练量化。

Details Motivation: 现有面向大语言模型的PTQ方法(如SmoothQuant)在多模态大模型(MLLMs)上面临平滑错位和跨模态计算不变性两大挑战,亟需适配多模态特性的量化方案。 Method: 提出Modality-Aware Smoothing Quantization(MASQuant):(1)模态感知平滑(MAS),为不同模态学习独立平滑因子;(2)跨模态补偿(CMC),利用SVD白化将多模态激活差异转化为低秩形式,实现统一量化。 Result: MASQuant在双模态与三模态MLLMs上均展现出稳定的量化性能,实验结果表明其在主流PTQ算法中具有竞争力。 Conclusion: MASQuant有效缓解了MLLMs量化中的模态间不一致性问题,为多模态模型高效部署提供了新思路与实用工具。 Abstract: Post-training quantization (PTQ) with computational invariance for Large Language Models~(LLMs) have demonstrated remarkable advances, however, their application to Multimodal Large Language Models~(MLLMs) presents substantial challenges. In this paper, we analyze SmoothQuant as a case study and identify two critical issues: Smoothing Misalignment and Cross-Modal Computational Invariance. To address these issues, we propose Modality-Aware Smoothing Quantization (MASQuant), a novel framework that introduces (1) Modality-Aware Smoothing (MAS), which learns separate, modality-specific smoothing factors to prevent Smoothing Misalignment, and (2) Cross-Modal Compensation (CMC), which addresses Cross-modal Computational Invariance by using SVD whitening to transform multi-modal activation differences into low-rank forms, enabling unified quantization across modalities. MASQuant demonstrates stable quantization performance across both dual-modal and tri-modal MLLMs. Experimental results show that MASQuant is competitive among the state-of-the-art PTQ algorithms. Source code: https://github.com/alibaba/EfficientAI.

[105] Guiding Diffusion-based Reconstruction with Contrastive Signals for Balanced Visual Representation

Boyu Han,Qianqian Xu,Shilong Bao,Zhiyong Yang,Ruochen Cui,Xilin Zhao,Qingming Huang

Main category: cs.CV

TL;DR: 本文提出Diffusion Contrastive Reconstruction (DCR)方法,通过在扩散重建过程中注入来自重建图像的对比信号,联合优化CLIP视觉编码器的判别能力和细节感知能力,从而提升下游任务性能。

Details Motivation: CLIP视觉编码器的理解能力有限,主要体现在判别能力(D-Ability)和细节感知能力(P-Ability)两方面;现有基于扩散模型增强表示的方法可能损害D-Ability,未能有效解决CLIP表征瓶颈。 Method: 提出DCR框架,在扩散重建中引入源自重建图像(而非原始输入)的对比信号,统一学习目标以缓解梯度冲突;理论分析表明DCR损失可联合优化D-Ability与P-Ability。 Result: 在多个基准数据集及多模态大语言模型上验证了DCR的有效性,显著提升了下游性能。 Conclusion: DCR通过重构图像驱动的对比学习,更全面地增强了CLIP视觉表征能力,为多模态预训练提供了新思路。 Abstract: The limited understanding capacity of the visual encoder in Contrastive Language-Image Pre-training (CLIP) has become a key bottleneck for downstream performance. This capacity includes both Discriminative Ability (D-Ability), which reflects class separability, and Detail Perceptual Ability (P-Ability), which focuses on fine-grained visual cues. Recent solutions use diffusion models to enhance representations by conditioning image reconstruction on CLIP visual tokens. We argue that such paradigms may compromise D-Ability and therefore fail to effectively address CLIP's representation limitations. To address this, we integrate contrastive signals into diffusion-based reconstruction to pursue more comprehensive visual representations. We begin with a straightforward design that augments the diffusion process with contrastive learning on input images. However, empirical results show that the naive combination suffers from gradient conflict and yields suboptimal performance. To balance the optimization, we introduce the Diffusion Contrastive Reconstruction (DCR), which unifies the learning objective. The key idea is to inject contrastive signals derived from each reconstructed image, rather than from the original input, into the diffusion process. Our theoretical analysis shows that the DCR loss can jointly optimize D-Ability and P-Ability. Extensive experiments across various benchmarks and multi-modal large language models validate the effectiveness of our method. The code is available at https://github.com/boyuh/DCR.

[106] Meta-D: Metadata-Aware Architectures for Brain Tumor Analysis and Missing-Modality Segmentation

SangHyuk Kim,Daniel Haehn,Sumientra Rampersad

Main category: cs.CV

TL;DR: 本文提出Meta-D架构,通过显式利用MRI扫描元数据(如序列类型、平面方向)来引导脑肿瘤分析中的特征提取,显著提升2D肿瘤检测和3D缺失模态分割性能。

Details Motivation: 提升医学图像深度学习流程的性能,通过整合显式元数据来稳定特征表示,并在数据缺失时提供鲁棒锚点。 Method: 提出Meta-D架构:在2D检测中动态调制卷积特征;在3D缺失模态分割中设计Transformer Maximizer,利用元数据驱动的跨注意力机制选择性路由可用模态。 Result: 2D肿瘤检测F1-score绝对提升达2.62%;3D缺失模态分割Dice分数提升达5.12%,同时模型参数减少24.1%。 Conclusion: 显式引入分类扫描元数据可有效提升医学图像分析模型的鲁棒性与性能,尤其在数据不完整场景下优势显著。 Abstract: We present Meta-D, an architecture that explicitly leverages categorical scanner metadata such as MRI sequence and plane orientation to guide feature extraction for brain tumor analysis. We aim to improve the performance of medical image deep learning pipelines by integrating explicit metadata to stabilize feature representations. We first evaluate this in 2D tumor detection, where injecting sequence (e.g., T1, T2) and plane (e.g., axial) metadata dynamically modulates convolutional features, yielding an absolute increase of up to 2.62% in F1-score over image-only baselines. Because metadata grounds feature extraction when data are available, we hypothesize it can serve as a robust anchor when data are missing. We apply this to 3D missing-modality tumor segmentation. Our Transformer Maximizer utilizes metadata-based cross-attention to isolate and route available modalities, ensuring the network focuses on valid slices. This targeted attention improves brain tumor segmentation Dice scores by up to 5.12% under extreme modality scarcity while reducing model parameters by 24.1%.

[107] Revisiting Shape from Polarization in the Era of Vision Foundation Models

Chenhao Li,Taishi Ono,Takeshi Uemori,Yusuke Moriuchi

Main category: cs.CV

TL;DR: 本文提出了一种利用偏振线索提升单次物体级表面法向估计性能的新方法,通过构建高质量偏振数据集和传感器感知的数据增强策略,使轻量模型在仅40K训练场景下超越现有RGB-only视觉基础模型和传统偏振方法。

Details Motivation: 偏振线索虽具强物理几何关联性,但以往SfP方法因合成数据不真实、传感器噪声建模不足等域差距问题表现不佳,导致其必要性受质疑;本文旨在验证偏振模态本身的价值并弥合域差距。 Method: 1)构建基于1954个真实3D扫描物体的高质量偏振合成数据集;2)引入预训练DINOv3先验提升泛化能力;3)设计偏振传感器感知的数据增强以更好模拟真实噪声。 Result: 在仅40K训练场景下,所提方法显著超越当前最优SfP方法和RGB-only VFMs;偏振线索可实现训练数据减少33倍或模型参数减少8倍,同时保持更高精度。 Conclusion: 偏振模态本身具有独特价值,性能瓶颈源于域差距而非模态缺陷;通过数据质量和噪声建模改进,轻量模型即可在小数据下超越大规模RGB-only模型。 Abstract: We show that, with polarization cues, a lightweight model trained on a small dataset can outperform RGB-only vision foundation models (VFMs) in single-shot object-level surface normal estimation. Shape from polarization (SfP) has long been studied due to the strong physical relationship between polarization and surface geometry. Meanwhile, driven by scaling laws, RGB-only VFMs trained on large datasets have recently achieved impressive performance and surpassed existing SfP methods. This situation raises questions about the necessity of polarization cues, which require specialized hardware and have limited training data. We argue that the weaker performance of prior SfP methods does not come from the polarization modality itself, but from domain gaps. These domain gaps mainly arise from two sources. First, existing synthetic datasets use limited and unrealistic 3D objects, with simple geometry and random texture maps that do not match the underlying shapes. Second, real-world polarization signals are often affected by sensor noise, which is not well modeled during training. To address the first issue, we render a high-quality polarization dataset using 1,954 3D-scanned real-world objects. We further incorporate pretrained DINOv3 priors to improve generalization to unseen objects. To address the second issue, we introduce polarization sensor-aware data augmentation that better reflects real-world conditions. With only 40K training scenes, our method significantly outperforms both state-of-the-art SfP approaches and RGB-only VFMs. Extensive experiments show that polarization cues enable a 33x reduction in training data or an 8x reduction in model parameters, while still achieving better performance than RGB-only counterparts.

[108] Mitigating Instance Entanglement in Instance-Dependent Partial Label Learning

Rui Zhao,Bin Shi,Kai Sun,Bo Dong

Main category: cs.CV

TL;DR: 本文提出了一种面向实例依赖型部分标签学习(ID-PLL)的类特定增强解耦框架CAD,通过类内特征增强对齐与类间加权惩罚,缓解实例纠缠导致的类别混淆,显著提升分类性能。

Details Motivation: 现实中的部分标签学习常呈现实例依赖性(ID-PLL),而实例纠缠(相似类样本特征与候选标签重叠)导致严重类别混淆,现有方法对此建模不足。 Method: 提出Class-specific Augmentation based Disentanglement(CAD)框架:1)类内调节——生成类特定增强样本并对其对齐;2)类间调节——设计加权惩罚损失,对更模糊的候选标签施加更大惩罚以扩大类间距离。 Result: 在多个基准数据集上实验表明,CAD有效缓解实例纠缠,显著优于现有ID-PLL方法。 Conclusion: CAD通过协同的类内增强对齐与类间加权判别,提升了ID-PLL中类边界的清晰度与模型鲁棒性,为弱监督学习中的纠缠问题提供了新思路。 Abstract: Partial label learning is a prominent weakly supervised classification task, where each training instance is ambiguously labeled with a set of candidate labels. In real-world scenarios, candidate labels are often influenced by instance features, leading to the emergence of instance-dependent PLL (ID-PLL), a setting that more accurately reflects this relationship. A significant challenge in ID-PLL is instance entanglement, where instances from similar classes share overlapping features and candidate labels, resulting in increased class confusion. To address this issue, we propose a novel Class-specific Augmentation based Disentanglement (CAD) framework, which tackles instance entanglement by both intra- and inter-class regulations. For intra-class regulation, CAD amplifies class-specific features to generate class-wise augmentations and aligns same-class augmentations across instances. For inter-class regulation, CAD introduces a weighted penalty loss function that applies stronger penalties to more ambiguous labels, encouraging larger inter-class distances. By jointly applying intra- and inter-class regulations, CAD improves the clarity of class boundaries and reduces class confusion caused by entanglement. Extensive experimental results demonstrate the effectiveness of CAD in mitigating the entanglement problem and enhancing ID-PLL performance. The code is available at https://github.com/RyanZhaoIc/CAD.git.

[109] Towards Highly Transferable Vision-Language Attack via Semantic-Augmented Dynamic Contrastive Interaction

Yuanbo Li,Tianyang Xu,Cong Hu,Tao Zhou,Xiao-Jun Wu,Josef Kittler

Main category: cs.CV

TL;DR: 本文提出了一种语义增强的动态对比攻击方法(SADCA),通过渐进式、语义引导的扰动提升视觉-语言预训练模型对抗样本的跨模型与跨任务迁移能力。

Details Motivation: 现有视觉-语言模型的对抗攻击方法依赖静态跨模态交互,仅破坏正向图文对,导致跨模态干扰有限、迁移性差。 Method: 提出SADCA方法:1)构建包含对抗样本、正样本和负样本的动态对比学习机制,渐进破坏图文对齐;2)引入语义增强模块,利用输入变换提升对抗样本多样性与泛化性。 Result: 在多个数据集和模型上实验表明,SADCA显著提升对抗样本的迁移能力,持续优于当前最优方法。 Conclusion: SADCA通过动态对比学习与语义增强,有效增强了视觉-语言预训练模型对抗攻击的迁移性与鲁棒性评估能力。 Abstract: With the rapid advancement and widespread application of vision-language pre-training (VLP) models, their vulnerability to adversarial attacks has become a critical concern. In general, the adversarial examples can typically be designed to exhibit transferable power, attacking not only different models but also across diverse tasks. However, existing attacks on language-vision models mainly rely on static cross-modal interactions and focus solely on disrupting positive image-text pairs, resulting in limited cross-modal disruption and poor transferability. To address this issue, we propose a Semantic-Augmented Dynamic Contrastive Attack (SADCA) that enhances adversarial transferability through progressive and semantically guided perturbation. SADCA progressively disrupts cross-modal alignment through dynamic interactions between adversarial images and texts. This is accomplished by SADCA establishing a contrastive learning mechanism involving adversarial, positive and negative samples, to reinforce the semantic inconsistency of the obtained perturbations. Moreover, we empirically find that input transformations commonly used in traditional transfer-based attacks also benefit VLPs, which motivates a semantic augmentation module that increases the diversity and generalization of adversarial examples. Extensive experiments on multiple datasets and models demonstrate that SADCA significantly improves adversarial transferability and consistently surpasses state-of-the-art methods. The code is released at https://github.com/LiYuanBoJNU/SADCA.

[110] Multi-Paradigm Collaborative Adversarial Attack Against Multi-Modal Large Language Models

Yuanbo Li,Tianyang Xu,Cong Hu,Tao Zhou,Xiao-Jun Wu,Josef Kittler

Main category: cs.CV

TL;DR: 本文提出了一种多范式协同攻击框架(MPCAttack),通过融合视觉与文本的多范式语义表征,并采用对比匹配的协同优化策略,显著提升了对抗样本对多模态大语言模型(MLLMs)的可迁移性。

Details Motivation: 现有针对MLLMs的对抗攻击依赖单一学习范式的代理模型,独立优化各自特征空间,导致表征贫乏、搜索空间受限、对抗扰动多样性不足。 Method: 提出多范式协同攻击(MPCAttack)框架,引入多范式协同优化(MPCO)策略:聚合图像与文本的语义表征,通过对比匹配自适应平衡不同范式重要性,指导全局扰动优化以缓解表征偏差。 Result: 在多个基准上实验表明,MPCAttack在开源与闭源MLLMs上的定向和非定向攻击中均持续超越当前最优方法。 Conclusion: MPCAttack通过跨模态、跨范式的协同优化机制,有效增强了对抗样本的可迁移性,为MLLMs的安全评估提供了新思路与强基线。 Abstract: The rapid progress of Multi-Modal Large Language Models (MLLMs) has significantly advanced downstream applications. However, this progress also exposes serious transferable adversarial vulnerabilities. In general, existing adversarial attacks against MLLMs typically rely on surrogate models trained within a single learning paradigm and perform independent optimisation in their respective feature spaces. This straightforward setting naturally restricts the richness of feature representations, delivering limits on the search space and thus impeding the diversity of adversarial perturbations. To address this, we propose a novel Multi-Paradigm Collaborative Attack (MPCAttack) framework to boost the transferability of adversarial examples against MLLMs. In principle, MPCAttack aggregates semantic representations, from both visual images and language texts, to facilitate joint adversarial optimisation on the aggregated features through a Multi-Paradigm Collaborative Optimisation (MPCO) strategy. By performing contrastive matching on multi-paradigm features, MPCO adaptively balances the importance of different paradigm representations and guides the global perturbation optimisation, effectively alleviating the representation bias. Extensive experimental results on multiple benchmarks demonstrate the superiority of MPCAttack, indicating that our solution consistently outperforms state-of-the-art methods in both targeted and untargeted attacks on open-source and closed-source MLLMs. The code is released at https://github.com/LiYuanBoJNU/MPCAttack.

[111] GloSplat: Joint Pose-Appearance Optimization for Faster and More Accurate 3D Reconstruction

Tianyu Xiong,Rui Li,Linjie Li,Jiaqi Yang

Main category: cs.CV

TL;DR: GloSplat 提出联合姿态-外观优化框架,通过显式保留SfM特征轨迹作为可优化参数,在3D高斯泼溅训练中同时利用光度与几何约束,防止姿态漂移并实现精细优化,包含无需COLMAP的GloSplat-F和高质量的GloSplat-A两种变体。

Details Motivation: 传统方法将特征提取、匹配、运动恢复结构(SfM)和新视角合成(NVS)视为独立问题;现有联合优化方法仅依赖光度梯度进行姿态优化,易导致早期姿态漂移且缺乏几何稳定性。 Method: GloSplat在3D高斯泼溅训练中引入显式的、可优化的3D特征轨迹点作为几何锚点,结合重投影损失(几何监督)与光度损失进行联合优化;提出两种变体:GloSplat-F(基于检索的配对选择,免COLMAP)和GloSplat-A(穷举匹配,高质量)。 Result: GloSplat-F在免COLMAP方法中达到SOTA;GloSplat-A超越所有基于COLMAP的基线方法。 Conclusion: 显式融合SfM几何先验与光度优化可显著提升姿态估计鲁棒性与重建质量,验证了联合几何-光度优化范式的有效性。 Abstract: Feature extraction, matching, structure from motion (SfM), and novel view synthesis (NVS) have traditionally been treated as separate problems with independent optimization objectives. We present GloSplat, a framework that performs \emph{joint pose-appearance optimization} during 3D Gaussian Splatting training. Unlike prior joint optimization methods (BARF, NeRF--, 3RGS) that rely purely on photometric gradients for pose refinement, GloSplat preserves \emph{explicit SfM feature tracks} as first-class entities throughout training: track 3D points are maintained as separate optimizable parameters from Gaussian primitives, providing persistent geometric anchors via a reprojection loss that operates alongside photometric supervision. This architectural choice prevents early-stage pose drift while enabling fine-grained refinement -- a capability absent in photometric-only approaches. We introduce two pipeline variants: (1) \textbf{GloSplat-F}, a COLMAP-free variant using retrieval-based pair selection for efficient reconstruction, and (2) \textbf{GloSplat-A}, an exhaustive matching variant for maximum quality. Both employ global SfM initialization followed by joint photometric-geometric optimization during 3DGS training. Experiments demonstrate that GloSplat-F achieves state-of-the-art among COLMAP-free methods while GloSplat-A surpasses all COLMAP-based baselines.

[112] Scalable Injury-Risk Screening in Baseball Pitching From Broadcast Video

Jerrin Bright,Justin Mende,John Zelek

Main category: cs.CV

TL;DR: 本文提出了一种基于单目广播视频的生物力学分析管道,能从普通视频中高精度恢复18个临床相关投球动作指标,并用于预测投手手臂损伤(如Tommy John手术),AUC达0.811–0.825,为低成本、可扩展的运动损伤风险筛查提供了新范式。

Details Motivation: 专业级多相机运动捕捉系统昂贵且仅限于职业场馆,亟需一种低成本、可广泛部署的替代方案来获取投球生物力学信号以支持损伤预测。 Method: 基于DreamPose3D构建单目视频处理流程:引入漂移控制的全局提升模块(通过速度参数化与滑动窗口推断恢复骨盆轨迹),并设计包含骨长约束、关节角度限制、平滑与对称性约束的运动学精细化流程,以应对运动模糊、压缩伪影和极端投球姿态。 Result: 在13名职业投手共156次投球数据上,18项指标中16项达到亚度级精度(MAE < 1°);用于损伤预测时,自动筛查模型对Tommy John手术和严重手臂损伤的AUC分别为0.811和0.825。 Conclusion: 该方法验证了单目广播视频可作为专业级运动捕捉的可行替代,所生成的姿态衍生生物力学指标具备临床可用性,支持大规模投手损伤风险筛查。 Abstract: Injury prediction in pitching depends on precise biomechanical signals, yet gold-standard measurements come from expensive, stadium-installed multi-camera systems that are unavailable outside professional venues. We present a monocular video pipeline that recovers 18 clinically relevant biomechanics metrics from broadcast footage, positioning pose-derived kinematics as a scalable source for injury-risk modeling. Built on DreamPose3D, our approach introduces a drift-controlled global lifting module that recovers pelvis trajectory via velocity-based parameterization and sliding-window inference, lifting pelvis-rooted poses into global space. To address motion blur, compression artifacts, and extreme pitching poses, we incorporate a kinematics refinement pipeline with bone-length constraints, joint-limited inverse kinematics, smoothing, and symmetry constraints to ensure temporally stable and physically plausible kinematics. On 13 professional pitchers (156 paired pitches), 16/18 metrics achieve sub-degree agreement (MAE $< 1^{\circ}$). Using these metrics for injury prediction, an automated screening model achieves AUC 0.811 for Tommy John surgery and 0.825 for significant arm injuries on 7,348 pitchers. The resulting pose-derived metrics support scalable injury-risk screening, establishing monocular broadcast video as a viable alternative to stadium-scale motion capture for biomechanics.

[113] SURE: Semi-dense Uncertainty-REfined Feature Matching

Sicheng Li,Zaiwang Gu,Jie Zhang,Qing Guo,Xudong Jiang,Jun Cheng

Main category: cs.CV

TL;DR: 本文提出SURE框架,通过联合建模偶然性与认知不确定性,实现半稠密图像匹配及其置信度估计,显著提升大视角变化和无纹理区域下的匹配鲁棒性与精度。

Details Motivation: 现有方法仅依赖特征相似性,缺乏对匹配可靠性显式建模,导致在大视角变化或纹理缺失区域中产生高置信度错误匹配。 Method: 提出SURE半稠密不确定性精炼匹配框架,包含用于可信坐标回归的新型证据头(evidential head)和轻量级空间融合模块,联合预测对应关系及其置信度,并建模偶然性与认知不确定性。 Result: 在多个标准基准上,SURE在匹配精度和运行效率上均持续超越现有最先进半稠密匹配方法。 Conclusion: SURE通过引入不确定性建模机制,有效缓解了传统匹配方法的过自信问题,提升了复杂场景下的鲁棒性与可靠性。 Abstract: Establishing reliable image correspondences is essential for many robotic vision problems. However, existing methods often struggle in challenging scenarios with large viewpoint changes or textureless regions, where incorrect cor- respondences may still receive high similarity scores. This is mainly because conventional models rely solely on fea- ture similarity, lacking an explicit mechanism to estimate the reliability of predicted matches, leading to overconfident errors. To address this issue, we propose SURE, a Semi- dense Uncertainty-REfined matching framework that jointly predicts correspondences and their confidence by modeling both aleatoric and epistemic uncertainties. Our approach in- troduces a novel evidential head for trustworthy coordinate regression, along with a lightweight spatial fusion module that enhances local feature precision with minimal overhead. We evaluated our method on multiple standard benchmarks, where it consistently outperforms existing state-of-the-art semi-dense matching models in both accuracy and efficiency. our code will be available on https://github.com/LSC-ALAN/SURE.

[114] Diffusion-Based sRGB Real Noise Generation via Prompt-Driven Noise Representation Learning

Jaekyun Ko,Dongjin Kim,Soomin Lee,Guanghui Wang,Tae Hyun Kim

Main category: cs.CV

TL;DR: 本文提出了一种名为Prompt-Driven Noise Generation (PNG)的新框架,用于在无相机元数据依赖下生成逼真的真实世界噪声图像,从而提升去噪模型在现实场景中的泛化能力与实用性。

Details Motivation: 现有基于元数据的噪声合成方法受限于元数据缺失或设备不一致,难以广泛适用;同时真实配对的噪声-干净图像稀缺,制约端到端去噪方法的实际效果。 Method: 提出Prompt-Driven Noise Generation(PNG)框架,通过从输入噪声中提取高维prompt特征来建模真实噪声分布,无需显式相机元数据即可生成多样且统计一致的噪声图像。 Result: 实验表明PNG能生成高度逼真的噪声图像,并成功应用于多个基准数据集的真实世界噪声去除任务,显著提升去噪性能。 Conclusion: PNG摆脱了对相机元数据的依赖,提高了噪声合成的通用性与实用性,为真实场景图像去噪提供了更鲁棒的数据生成方案。 Abstract: Denoising in the sRGB image space is challenging due to noise variability. Although end-to-end methods perform well, their effectiveness in real-world scenarios is limited by the scarcity of real noisy-clean image pairs, which are expensive and difficult to collect. To address this limitation, several generative methods have been developed to synthesize realistic noisy images from limited data. These generative approaches often rely on camera metadata during both training and testing to synthesize real-world noise. However, the lack of metadata or inconsistencies between devices restricts their usability. Therefore, we propose a novel framework called Prompt-Driven Noise Generation (PNG). This model is capable of acquiring high-dimensional prompt features that capture the characteristics of real-world input noise and creating a variety of realistic noisy images consistent with the distribution of the input noise. By eliminating the dependency on explicit camera metadata, our approach significantly enhances the generalizability and applicability of noise synthesis. Comprehensive experiments reveal that our model effectively produces realistic noisy images and show the successful application of these generated images in removing real-world noise across various benchmark datasets.

[115] Interpretable Pre-Release Baseball Pitch Type Anticipation from Broadcast 3D Kinematics

Jerrin Bright,Michelle Lu,John Zelek

Main category: cs.CV

TL;DR: 本文通过分析投手的3D姿态序列,仅利用身体运动学特征,在不依赖球飞行数据的情况下实现了80.4%的八类投球类型分类准确率,并发现上半身(尤其是手腕位置和躯干侧倾)贡献了主要预测信号。

Details Motivation: 探究投手身体动作能在多大程度上揭示即将投出的球种,即仅靠生物力学特征是否可预测投球类型。 Method: 构建端到端流程:基于扩散模型的3D姿态估计、自动投球事件检测、经真实标注验证的生物力学特征提取(共229个运动学特征),以及梯度提升分类器。 Result: 在119,561个职业投球样本上达到80.4%分类准确率;重要性分析显示上半身贡献64.9%预测信号,手腕位置(14.8%)和躯干侧倾为最关键特征;四缝与二缝快速球无法仅凭姿态区分。 Conclusion: 单纯依靠身体姿态可实现较高投球类型识别性能,但存在约80%的经验上限,超出部分需依赖球飞行等后续信息;该结果界定了运动学信息的预测边界。 Abstract: How much can a pitcher's body reveal about the upcoming pitch? We study this question at scale by classifying eight pitch types from monocular 3D pose sequences, without access to ball-flight data. Our pipeline chains a diffusion-based 3D pose backbone with automatic pitching-event detection, groundtruth-validated biomechanical feature extraction, and gradient-boosted classification over 229 kinematic features. Evaluated on 119,561 professional pitches, the largest such benchmark to date, we achieve 80.4\% accuracy using body kinematics alone. A systematic importance analysis reveals that upper-body mechanics contribute 64.9\% of the predictive signal versus 35.1\% for the lower body, with wrist position (14.8\%) and trunk lateral tilt emerging as the most informative joint group and biomechanical feature, respectively. We further show that grip-defined variants (four-seam vs.\ two-seam fastball) are not separable from pose, establishing an empirical ceiling near 80\% and delineating where kinematic information ends and ball-flight information begins.

[116] Structure Observation Driven Image-Text Contrastive Learning for Computed Tomography Report Generation

Hong Liu,Dong Wei,Qiong Peng,Yawen Huang,Xian Wu,Yefeng Zheng,Liansheng Wang

Main category: cs.CV

TL;DR: 本文提出了一种用于CT报告生成的两阶段框架,通过结构级图像-文本对比学习和动态负样本队列提升语义对齐与异常判别能力,并在两个公开数据集上达到SOTA性能。

Details Motivation: X光报告生成的深度学习方法在CT报告生成(CTRG)中效果受限,因CT图像数据量大、细节复杂,需更精细的结构级语义对齐。 Method: 提出两阶段框架:第一阶段采用可学习的结构特异性视觉查询进行结构级图像-文本对比学习,并引入基于文本相似度的软伪标签缓解假阴性问题,配合动态多样性增强负样本队列;第二阶段冻结视觉查询,选择关键图像块嵌入并接入文本解码器生成报告。 Result: 在两个公开CT数据集上显著提升临床效率,各项组件验证有效,整体性能达当前最优(SOTA)。 Conclusion: 该两阶段结构-报告学习框架能有效建模CT图像与报告间的结构级语义对应关系,提升报告生成质量与鲁棒性。 Abstract: Computed Tomography Report Generation (CTRG) aims to automate the clinical radiology reporting process, thereby reducing the workload of report writing and facilitating patient care. While deep learning approaches have achieved remarkable advances in X-ray report generation, their effectiveness may be limited in CTRG due to larger data volumes of CT images and more intricate details required to describe them. This work introduces a novel two-stage (structure- and report-learning) framework tailored for CTRG featuring effective structure-wise image-text contrasting. In the first stage, a set of learnable structure-specific visual queries observe corresponding structures in a CT image. The resulting observation tokens are contrasted with structure-specific textual features extracted from the accompanying radiology report with a structure-wise image-text contrastive loss. In addition, text-text similarity-based soft pseudo targets are proposed to mitigate the impact of false negatives, i.e., semantically identical image structures and texts from non-paired images and reports. Thus, the model learns structure-level semantic correspondences between CT images and reports. Further, a dynamic, diversity-enhanced negative queue is proposed to guide the network in learning to discriminate various abnormalities. In the second stage, the visual structure queries are frozen and used to select the critical image patch embeddings depicting each anatomical structure, minimizing distractions from irrelevant areas while reducing memory consumption. Also, a text decoder is added and trained for report generation.Our extensive experiments on two public datasets demonstrate that our framework establishes new state-of-the-art performance for CTRG in clinical efficiency, and its components are effective.

[117] DeformTrace: A Deformable State Space Model with Relay Tokens for Temporal Forgery Localization

Xiaodong Zhu,Suting Wang,Yuanming Zheng,Junqi Yang,Yangxu Liao,Yuhong Yang,Weiping Tu,Zhongyuan Wang

Main category: cs.CV

TL;DR: 本文提出DeformTrace,通过引入可变形动态和中继机制增强状态空间模型(SSM),以提升视频与音频中时间伪造定位(TFL)的精度、长程建模能力和对稀疏伪造的敏感性。

Details Motivation: 现有状态空间模型(SSMs)在时间伪造定位(TFL)任务中受限于边界模糊、伪造稀疏及长程建模能力不足。 Method: 提出DeformTrace框架,包含:1)可变形自注意力SSM(DS-SSM),引入动态感受野以精确定位;2)中继Token机制缓解长程衰减;3)可变形交叉SSM(DC-SSM),按查询划分状态空间以抑制非伪造信息干扰;整体为融合Transformer全局建模与SSM高效性的混合架构。 Result: DeformTrace在TFL任务上达到SOTA性能,参数更少、推理更快、鲁棒性更强。 Conclusion: DeformTrace有效解决了SSMs在TFL中的关键瓶颈,验证了可变形动态与中继机制对时序伪造检测的重要价值,为高效精准的多媒体取证提供了新范式。 Abstract: Temporal Forgery Localization (TFL) aims to precisely identify manipulated segments in video and audio, offering strong interpretability for security and forensics. While recent State Space Models (SSMs) show promise in precise temporal reasoning, their use in TFL is hindered by ambiguous boundaries, sparse forgeries, and limited long-range modeling. We propose DeformTrace, which enhances SSMs with deformable dynamics and relay mechanisms to address these challenges. Specifically, Deformable Self-SSM (DS-SSM) introduces dynamic receptive fields into SSMs for precise temporal localization. To further enhance its capacity for temporal reasoning and mitigate long-range decay, a Relay Token Mechanism is integrated into DS-SSM. Besides, Deformable Cross-SSM (DC-SSM) partitions the global state space into query-specific subspaces, reducing non-forgery information accumulation and boosting sensitivity to sparse forgeries. These components are integrated into a hybrid architecture that combines the global modeling of Transformers with the efficiency of SSMs. Extensive experiments show that DeformTrace achieves state-of-the-art performance with fewer parameters, faster inference, and stronger robustness.

[118] Federated Modality-specific Encoders and Partially Personalized Fusion Decoder for Multimodal Brain Tumor Segmentation

Hong Liu,Dong Wei,Qian Dai,Xian Wu,Yefeng Zheng,Liansheng Wang

Main category: cs.CV

TL;DR: 本文提出FedMEPD框架,解决医疗影像联邦学习中模态间异构性与个性化需求共存的挑战,通过模态专用编码器与部分个性化多模态融合解码器实现高效训练与推理。

Details Motivation: 现有联邦学习方法仅考虑模态内异构性,难以应对参与者拥有不完整模态(即模态间异构性)的实际场景,且缺乏对个性化模型的需求支持。 Method: 提出FedMEPD:为每种模态设置独立编码器(全联邦共享),解码器则基于全局与本地参数更新差异动态选择部分滤波器进行个性化;服务器端用全模态融合解码器优化编码器,并生成多锚点分发至客户端;客户端通过缩放点积交叉注意力将缺失模态表征校准至全局锚点。 Result: 在BraTS 2018/2020多模态脑肿瘤分割任务上,FedMEPD显著优于当前主流多模态与个性化联邦学习方法,验证了其有效性。 Conclusion: FedMEPD成功兼顾模态间异构性建模与个性化建模,在医疗多模态联邦学习中具有实用价值和推广潜力。 Abstract: Most existing federated learning (FL) methods for medical image analysis only considered intramodal heterogeneity, limiting their applicability to multimodal imaging applications. In practice, some FL participants may possess only a subset of the complete imaging modalities, posing intermodal heterogeneity as a challenge to effectively training a global model on all participants' data. Meanwhile, each participant expects a personalized model tailored to its local data characteristics in FL. This work proposes a new FL framework with federated modality-specific encoders and partially personalized multimodal fusion decoders (FedMEPD) to address the two concurrent issues. Specifically, FedMEPD employs an exclusive encoder for each modality to account for the intermodal heterogeneity. While these encoders are fully federated, the decoders are partially personalized to meet individual needs -- using the discrepancy between global and local parameter updates to dynamically determine which decoder filters are personalized. Implementation-wise, a server with full-modal data employs a fusion decoder to fuse representations from all modality-specific encoders, thus bridging the modalities to optimize the encoders via backpropagation. Moreover, multiple anchors are extracted from the fused multimodal representations and distributed to the clients in addition to the model parameters. Conversely, the clients with incomplete modalities calibrate their missing-modal representations toward the global full-modal anchors via scaled dot-product cross-attention, making up for the information loss due to absent modalities. FedMEPD is validated on the BraTS 2018 and 2020 multimodal brain tumor segmentation benchmarks. Results show that it outperforms various up-to-date methods for multimodal and personalized FL, and its novel designs are effective.

[119] Locality-Attending Vision Transformer

Sina Hajimiri,Farzad Beizaee,Fereshteh Shakeri,Christian Desrosiers,Ismail Ben Ayed,Jose Dolz

Main category: cs.CV

TL;DR: 本文提出了一种简单有效的视觉Transformer增强方法,通过引入可学习高斯核调制自注意力机制并优化patch表示,提升分割性能,同时保持图像级分类能力。

Details Motivation: 视觉Transformer在分类任务中表现优异,但其全局自注意力机制会削弱对分割等任务至关重要的细粒度空间细节。 Method: 在标准图像级分类训练后的视觉Transformer上,添加可学习高斯核来调制自注意力,使其偏向邻近patch,并进一步优化patch位置的表征。 Result: 在ADE20K等三个分割基准上显著提升性能(如ViT-Tiny和ViT-Base分别提升超6%和4%),且不改变训练流程、不损害分类性能。 Conclusion: 所提方法能有效增强视觉Transformer在分割任务中的表现,同时保留其全局建模能力,是一种轻量、即插即用的改进方案。 Abstract: Vision transformers have demonstrated remarkable success in classification by leveraging global self-attention to capture long-range dependencies. However, this same mechanism can obscure fine-grained spatial details crucial for tasks such as segmentation. In this work, we seek to enhance segmentation performance of vision transformers after standard image-level classification training. More specifically, we present a simple yet effective add-on that improves performance on segmentation tasks while retaining vision transformers' image-level recognition capabilities. In our approach, we modulate the self-attention with a learnable Gaussian kernel that biases the attention toward neighboring patches. We further refine the patch representations to learn better embeddings at patch positions. These modifications encourage tokens to focus on local surroundings and ensure meaningful representations at spatial positions, while still preserving the model's ability to incorporate global information. Experiments demonstrate the effectiveness of our modifications, evidenced by substantial segmentation gains on three benchmarks (e.g., over 6% and 4% on ADE20K for ViT Tiny and Base), without changing the training regime or sacrificing classification performance. The code is available at https://github.com/sinahmr/LocAtViT/.

[120] FC-VFI: Faithful and Consistent Video Frame Interpolation for High-FPS Slow Motion Video Generation

Ganggui Ding,Hao Chen,Xiaogang Xu

Main category: cs.CV

TL;DR: 本文提出FC-VFI方法,通过潜在序列时序建模、语义匹配线引导和时序差异损失,实现高保真、高一致性的视频帧插值(支持4×和8×),显著提升帧率并保持细节与运动一致性。

Details Motivation: 现有视频扩散模型在帧插值中难以兼顾高保真度与运动一致性:依赖内在生成先验导致起止帧细节丢失;基于光流或稀疏点的运动控制存在误差大或缺乏结构上下文的问题。 Method: 提出FC-VFI框架:1)在潜在序列上进行时序建模以继承起止帧保真线索;2)引入语义匹配线提供结构感知的运动引导;3)设计时序差异损失缓解时间不一致性。 Result: 在2560×1440分辨率下实现4×和8×插值,帧率从30 FPS提升至120/240 FPS,实验表明其在保真度、结构完整性与运动一致性方面性能优越。 Conclusion: FC-VFI有效解决了大模型帧插值中保真度与一致性难以兼顾的问题,为高质量视频时序增强提供了新思路。 Abstract: Large pre-trained video diffusion models excel in video frame interpolation but struggle to generate high fidelity frames due to reliance on intrinsic generative priors, limiting detail preservation from start and end frames. Existing methods often depend on motion control for temporal consistency, yet dense optical flow is error-prone, and sparse points lack structural context. In this paper, we propose FC-VFI for faithful and consistent video frame interpolation, supporting \(4\times\)x and \(8\times\) interpolation, boosting frame rates from 30 FPS to 120 and 240 FPS at \(2560\times 1440\)resolution while preserving visual fidelity and motion consistency. We introduce a temporal modeling strategy on the latent sequences to inherit fidelity cues from start and end frames and leverage semantic matching lines for structure-aware motion guidance, improving motion consistency. Furthermore, we propose a temporal difference loss to mitigate temporal inconsistencies. Extensive experiments show FC-VFI achieves high performance and structural integrity across diverse scenarios.

[121] AdaIAT: Adaptively Increasing Attention to Generated Text to Alleviate Hallucinations in LVLM

Li'an Zhong,Ziqiang He,Jibin Zheng,Jin Li,Z. Jane Wang,Xiangui Kang

Main category: cs.CV

TL;DR: 本文提出了一种名为AdaIAT的新方法,通过自适应地增强LVLMs中图像token对生成文本的注意力,有效缓解幻觉问题,同时避免重复描述,兼顾语言连贯性与模型预测能力。

Details Motivation: 当前大视觉语言模型(LVLMs)面临严重幻觉问题,直接提升图像token注意力虽可减幻觉,却导致重复描述;作者发现真实物体token更倾向关注生成文本,由此启发利用生成文本中的视觉与上下文信息协同抑制幻觉。 Method: 提出Attention to Generated Text(IAT)机制,并进一步设计自适应版本AdaIAT:采用层间阈值控制干预时机,并为每个注意力头定制细粒度放大强度,避免破坏模型固有预测能力。 Result: 在LLaVA-1.5等多个LVLM上验证,AdaIAT将幻觉率CS和CI分别降低35.8%和37.1%,同时保持语言性能与预测能力,实现良好权衡。 Conclusion: AdaIAT是一种有效、鲁棒且无需微调的推理时干预方法,为LVLM幻觉缓解提供了新思路,兼顾准确性与生成质量。 Abstract: Hallucination has been a significant impediment to the development and application of current Large Vision-Language Models (LVLMs). To mitigate hallucinations, one intuitive and effective way is to directly increase attention weights to image tokens during inference. Although this effectively reduces the hallucination rate, it often induces repetitive descriptions. To address this, we first conduct an analysis of attention patterns and reveal that real object tokens tend to assign higher attention to the generated text than hallucinated ones. This inspires us to leverage the generated text, which contains instruction-related visual information and contextual knowledge, to alleviate hallucinations while maintaining linguistic coherence. We therefore propose Attention to Generated Text (IAT) and demonstrate that it significantly reduces the hallucination rate while avoiding repetitive descriptions. To prevent naive amplification from impairing the inherent prediction capabilities of LVLMs, we further explore Adaptive IAT (AdaIAT) that employs a layer-wise threshold to control intervention time and fine-grained amplification magnitude tailored to the characteristics of each attention head. Both analysis and experiments demonstrate the effectiveness of AdaIAT. Results of several LVLMs show that AdaIAT effectively alleviates hallucination (reducing hallucination rates $C_S$ and $C_I$ on LLaVA-1.5 by 35.8% and 37.1%, respectively) while preserving linguistic performance and prediction capability, achieving an attractive trade-off.

[122] Person Detection and Tracking from an Overhead Crane LiDAR

Nilusha Jayawickrama,Henrik Toikka,Risto Ojala

Main category: cs.CV

TL;DR: 本文研究了在工业室内工作场所中,利用安装在天车上的LiDAR进行人员检测与跟踪,针对俯视视角带来的显著域偏移和缺乏合适公开数据的问题,构建了专用俯视LiDAR数据集,并适配多种3D检测器,结合轻量级跟踪方法实现稳定身份维持;实验表明VoxelNeXt和SECOND表现最优,在1米半径内AP达0.97,5米内达0.84,具备实时可行性,并开源数据与代码。

Details Motivation: 俯视LiDAR视角与常见车载LiDAR基准存在显著域偏移,且缺乏合适的公开训练数据。 Method: 构建站点特定的俯视LiDAR数据集并标注3D人体边界框,统一训练评估多个候选3D检测器,并集成AB3DMOT和SimpleTrack实现轻量级检测跟踪一体化。 Result: 最佳适配检测器(VoxelNeXt和SECOND)在1.0米水平半径内平均精度(AP)达0.97,5.0米内达0.84;报告了延迟测量,验证实时可行性。 Conclusion: 本工作有效弥合了标准驾驶数据集与俯视传感在人员检测与跟踪任务间的域差距,并开源数据集与实现以推动后续研究。 Abstract: This paper investigates person detection and tracking in an industrial indoor workspace using a LiDAR mounted on an overhead crane. The overhead viewpoint introduces a strong domain shift from common vehicle-centric LiDAR benchmarks, and limited availability of suitable public training data. Henceforth, we curate a site-specific overhead LiDAR dataset with 3D human bounding-box annotations and adapt selected candidate 3D detectors under a unified training and evaluation protocol. We further integrate lightweight tracking-by-detection using AB3DMOT and SimpleTrack to maintain person identities over time. Detection performance is reported with distance-sliced evaluation to quantify the practical operating envelope of the sensing setup. The best adapted detector configurations achieve average precision (AP) up to 0.84 within a 5.0 m horizontal radius, increasing to 0.97 at 1.0 m, with VoxelNeXt and SECOND emerging as the most reliable backbones across this range. The acquired results contribute in bridging the domain gap between standard driving datasets and overhead sensing for person detection and tracking. We also report latency measurements, highlighting practical real-time feasibility. Finally, we release our dataset and implementations in GitHub to support further research

[123] Adaptive Prototype-based Interpretable Grading of Prostate Cancer

Riddhasree Bhattacharyya,Pallabi Dutta,Sushmita Mitra

Main category: cs.CV

TL;DR: 本文提出了一种基于原型的弱监督框架,用于可解释的前列腺癌组织病理图像分级,通过模仿病理医生对比可疑区域与临床验证范例的工作流程,提升模型可信度与可解释性。

Details Motivation: 前列腺癌诊断中病理医生工作负荷大、分级主观性强;深度学习模型虽性能好但可解释性差,现有解释方法仅提供粗粒度解释,无法说明高亮区域为何重要。 Method: 提出基于原型的弱监督框架:先在图像块级别预训练以学习各分级对应的鲁棒原型特征;再用新设计的原型感知损失函数进行弱监督微调;最后引入基于注意力的动态剪枝机制,以应对样本间异质性并选择性强调相关原型。 Result: 在PANDA和SICAP基准数据集上进行了广泛验证,结果表明该框架能作为病理医生日常诊断中可靠的辅助工具。 Conclusion: 该原型驱动、弱监督、可解释的分级框架提升了模型透明度与临床可信度,有望推动AI在高风险医学场景中的实际落地。 Abstract: Prostate cancer being one of the frequently diagnosed malignancy in men, the rising demand for biopsies places a severe workload on pathologists. The grading procedure is tedious and subjective, motivating the development of automated systems. Although deep learning has made inroads in terms of performance, its limited interpretability poses challenges for widespread adoption in high-stake applications like medicine. Existing interpretability techniques for prostate cancer classifiers provide a coarse explanation but do not reveal why the highlighted regions matter. In this scenario, we propose a novel prototype-based weakly-supervised framework for an interpretable grading of prostate cancer from histopathology images. These networks can prove to be more trustworthy since their explicit reasoning procedure mirrors the workflow of a pathologist in comparing suspicious regions with clinically validated examples. The network is initially pre-trained at patch-level to learn robust prototypical features associated with each grade. In order to adapt it to a weakly-supervised setup for prostate cancer grading, the network is fine-tuned with a new prototype-aware loss function. Finally, a new attention-based dynamic pruning mechanism is introduced to handle inter-sample heterogeneity, while selectively emphasizing relevant prototypes for optimal performance. Extensive validation on the benchmark PANDA and SICAP datasets confirms that the framework can serve as a reliable assistive tool for pathologists in their routine diagnostic workflows.

[124] Location-Aware Pretraining for Medical Difference Visual Question Answering

Denis Musinguzi,Caren Han,Prasenjit Mitra

Main category: cs.CV

TL;DR: 本文提出了一种针对医学差异视觉问答(VQA)的新型预训练框架,通过引入位置感知任务(如AREF、GCAP、CAREF)提升视觉编码器对细微空间差异的建模能力,并在胸部X光图像差异检测与推理任务上达到SOTA。

Details Motivation: 常规单图模型无法满足放射科医生对比诊断的需求;标准视觉编码器难以区分疾病进展与成像差异等细微变化。 Method: 设计包含自动指代表达(AREF)、定位描述(GCAP)和条件自动指代表达(CAREF)的位置感知预训练任务,增强视觉编码器的空间细粒度表征能力,并将其与语言模型结合用于医学差异VQA。 Result: 在胸部X光图像的临床相关变化检测与推理任务上取得当前最优性能(SOTA)。 Conclusion: 位置感知的预训练策略能有效提升视觉编码器对医学图像中细微差异的感知与理解能力,为多图医学VQA提供了新范式。 Abstract: Unlike conventional single-image models, differential medical VQA frameworks process multiple images to identify differences, mirroring the comparative diagnostic workflow of radiologists. However, standard vision encoders trained on contrastive or classification objectives often fail to capture the subtle visual variations necessary for distinguishing disease progression from acquisition differences. To address this limitation, we introduce a pretraining framework that incorporates location-aware tasks, including automatic referring expressions (AREF), grounded captioning (GCAP), and conditional automatic referring expressions (CAREF). These specific tasks enable the vision encoder to learn fine-grained, spatially grounded visual representations that are often overlooked by traditional pre-training methods. We subsequently integrate this enhanced vision encoder with a language model to perform medical difference VQA. Experimental results demonstrate that our approach achieves state-of-the-art performance in detecting and reasoning about clinically relevant changes in chest X-ray images.

[125] VisionPangu: A Compact and Fine-Grained Multimodal Assistant with 1.7B Parameters

Jiaxin Fan,Wenpo Song

Main category: cs.CV

TL;DR: 本文提出了一个紧凑型1.7B参数的多模态模型VisionPangu,通过高效多模态对齐与高质量监督(如DOCCI数据集的人工标注描述),在不依赖大规模模型扩展的前提下,显著提升了图像细粒度描述能力。

Details Motivation: 现有大型多模态模型(LMMs)虽性能强,但依赖大架构和粗粒度监督,难以生成细节丰富的图像描述。 Method: 采用InternVL衍生视觉编码器+OpenPangu-Embedded语言骨干+轻量MLP投影器,并借鉴LLaVA的指令微调流程,引入DOCCI密集人工描述进行训练。 Result: VisionPangu在保持紧凑规模的同时,在详细图像描述任务上达到有竞争力的性能,生成更结构化、更丰富的文本描述。 Conclusion: 紧凑型多模态模型可通过高质量数据与高效对齐策略,在细粒度视觉语言理解任务中取得优异表现,无需盲目扩大模型规模。 Abstract: Large Multimodal Models (LMMs) have achieved strong performance in vision-language understanding, yet many existing approaches rely on large-scale architectures and coarse supervision, which limits their ability to generate detailed image captions. In this work, we present VisionPangu, a compact 1.7B-parameter multimodal model designed to improve detailed image captioning through efficient multimodal alignment and high-quality supervision. Our model combines an InternVL-derived vision encoder with the OpenPangu-Embedded language backbone via a lightweight MLP projector and adopts an instruction-tuning pipeline inspired by LLaVA. By incorporating dense human-authored descriptions from the DOCCI dataset, VisionPangu improves semantic coherence and descriptive richness without relying on aggressive model scaling. Experimental results demonstrate that compact multimodal models can achieve competitive performance while producing more structured and detailed captions. The code and model weights will be publicly available at https://www.modelscope.cn/models/asdfgh007/visionpangu.

[126] Revisiting an Old Perspective Projection for Monocular 3D Morphable Models Regression

Toby Chong,Ryota Nakajima

Main category: cs.CV

TL;DR: 本文提出了一种新颖的相机模型,用于单目3D可变形模型(3DMM)回归方法,通过引入一个收缩参数扩展正交投影,以建模近景人脸图像中的透视畸变效应,同时保持原有正交投影的稳定性。

Details Motivation: 现有基于回归的3DMM拟合方法多采用正交投影以避免焦距与物体距离的歧义,但该简化使其难以处理头戴相机等近景拍摄场景中的透视畸变。 Method: 在正交投影基础上引入一个可学习的收缩参数,模拟伪透视效果;并设计多种技术实现对现有模型的微调。 Result: 在自建头戴相机近景人脸数据集上,定量与定性实验均验证了所提方法在保持稳定性的同时显著提升了近景人脸重建精度。 Conclusion: 所提出的伪透视相机模型有效平衡了建模能力与优化稳定性,为近景单目3DMM回归提供了更实用、鲁棒的解决方案。 Abstract: We introduce a novel camera model for monocular 3D Morphable Model (3DMM) regression methods that effectively captures the perspective distortion effect commonly seen in close-up facial images. Fitting 3D morphable models to video is a key technique in content creation. In particular, regression-based approaches have produced fast and accurate results by matching the rendered output of the morphable model to the target image. These methods typically achieve stable performance with orthographic projection, which eliminates the ambiguity between focal length and object distance. However, this simplification makes them unsuitable for close-up footage, such as that captured with head-mounted cameras. We extend orthographic projection with a new shrinkage parameter, incorporating a pseudo-perspective effect while preserving the stability of the original projection. We present several techniques that allow finetuning of existing models, and demonstrate the effectiveness of our modification through both quantitative and qualitative comparisons using a custom dataset recorded with head-mounted cameras.

[127] BiEvLight: Bi-level Learning of Task-Aware Event Refinement for Low-Light Image Enhancement

Zishu Yao,Xiang-Xiang Su,Shengning Zhou,Guang-Yong Chen,Guodong Fan,Xing Chen

Main category: cs.CV

TL;DR: 本文提出BiEvLight框架,通过梯度引导的事件去噪先验和任务约束的双层优化,协同优化低光图像增强与事件去噪,显著提升性能。

Details Motivation: 事件相机虽具高动态范围优势,但其固有的背景活动噪声与低信噪比图像导致模态融合中严重噪声耦合,成为性能瓶颈;因此,精准的事件去噪是释放事件融合潜力的前提。 Method: 提出BiEvLight:1)利用图像与事件间的强梯度相关性构建梯度引导的事件去噪先验;2)将事件去噪建模为受增强任务约束的双层优化问题,实现跨任务交互与表示定制。 Result: 在SDE真实噪声数据集上显著超越SOTA方法,PSNR平均提升1.30dB,PSNR*提升2.03dB,SSIM提升0.047。 Conclusion: 精确、任务自适应的事件去噪对事件辅助低光图像增强至关重要;BiEvLight通过层次化与任务感知设计,有效缓解噪声耦合,提升整体增强质量。 Abstract: Event cameras, with their high dynamic range, show great promise for Low-light Image Enhancement (LLIE). Existing works primarily focus on designing effective modal fusion strategies. However, a key challenge is the dual degradation from intrinsic background activity (BA) noise in events and low signal-to-noise ratio (SNR) in images, which causes severe noise coupling during modal fusion, creating a critical performance bottleneck. We therefore posit that precise event denoising is the prerequisite to unlocking the full potential of event-based fusion. To this end, we propose BiEvLight, a hierarchical and task-aware framework that collaboratively optimizes enhancement and denoising by exploiting their intrinsic interdependence. Specifically, BiEvLight exploits the strong gradient correlation between images and events to build a gradient-guided event denoising prior that alleviates insufficient denoising in heavily noisy regions. Moreover, instead of treating event denoising as a static pre-processing stage-which inevitably incurs a trade-off between over- and under-denoising and cannot adapt to the requirements of a specific enhancement objective-we recast it as a bilevel optimization problem constrained by the enhancement task. Through cross-task interaction, the upper-level denoising problem learns event representations tailored to the lower-level enhancement objective, thereby substantially improving overall enhancement quality. Extensive experiments on the Real-world noise Dataset SDE demonstrate that our method significantly outperforms state-of-the-art (SOTA) approaches, with average improvements of 1.30dB in PSNR, 2.03dB in PSNR* and 0.047 in SSIM, respectively. The code will be publicly available at https://github.com/iijjlk/BiEvlight.

[128] 3D-RFT: Reinforcement Fine-Tuning for Video-based 3D Scene Understanding

Xiongkun Linghu,Jiangyong Huang,Baoxiong Jia,Siyuan Huang

Main category: cs.CV

TL;DR: 本文提出了3D-RFT框架,首次将基于可验证奖励的强化学习(RLVR)应用于视频驱动的3D场景理解任务,通过直接优化评估指标(如3D IoU、F1-Score)提升多模态大模型性能,并在多项基准上超越更大参数量模型。

Details Motivation: 现有3D场景理解方法主要依赖监督微调(SFT),其token级交叉熵损失与最终任务指标存在目标错位;而RLVR在LLM推理中已展现优势,但在3D感知领域尚未探索。 Method: 提出3D-RFT框架:先用SFT激活3D感知多模态大语言模型(MLLM),再采用组相对策略优化(GRPO)进行强化微调,设计基于3D IoU、F1-Score等任务指标的严格可验证奖励函数。 Result: 3D-RFT-4B在多个视频3D理解任务(如3D视频检测、视觉定位、空间推理)上达到SOTA,显著优于更大模型(如VGLLM-8B),并展现出鲁棒性和良好训练特性。 Conclusion: 3D-RFT是首个面向视频3D场景理解的RLVR框架,验证了直接优化任务指标的有效性,为未来3D理解研究提供了新范式。 Abstract: Reinforcement Learning with Verifiable Rewards ( RLVR ) has emerged as a transformative paradigm for enhancing the reasoning capabilities of Large Language Models ( LLMs), yet its potential in 3D scene understanding remains under-explored. Existing approaches largely rely on Supervised Fine-Tuning ( SFT), where the token-level cross-entropy loss acts as an indirect proxy for optimization, leading to a misalignment between training objectives and task performances. To bridge this gap, we present Reinforcement Fine-Tuning for Video-based 3D Scene Understanding (3D-RFT ), the first framework to extend RLVR to video-based 3D perception and reasoning. 3D-RFT shifts the paradigm by directly optimizing the model towards evaluation metrics. 3D-RFT first activates 3D-aware Multi-modal Large Language Models ( MLLM s) via SFT, followed by reinforcement fine-tuning using Group Relative Policy Optimization ( GRPO) with strictly verifiable reward functions. We design task-specific reward functions directly from metrics like 3D IoU and F1-Score to provide more effective signals to guide model training. Extensive experiments demonstrate that 3D-RFT-4B achieves state-of-the-art performance on various video-based 3D scene understanding tasks. Notably, 3D-RFT-4B significantly outperforms larger models (e.g., VG LLM-8B) on 3D video detection, 3D visual grounding, and spatial reasoning benchmarks. We further reveal good properties of 3D-RFT such as robust efficacy, and valuable insights into training strategies and data impact. We hope 3D-RFT can serve as a robust and promising paradigm for future development of 3D scene understanding.

[129] Think, Then Verify: A Hypothesis-Verification Multi-Agent Framework for Long Video Understanding

Zheng Wang,Haoran Chen,Haoxuan Qin,Zhipeng Wei,Tianwen Qian,Cong Bai

Main category: cs.CV

TL;DR: 本文提出VideoHV-Agent框架,将长视频问答重构为假设验证过程,通过Thinker、Judge、Verifier和Answer agent四步实现更准确、可解释、逻辑严谨且计算成本更低的长视频理解。

Details Motivation: 长视频理解面临视觉冗余、长时序依赖以及链式推理与检索型代理易产生语义漂移和相关性错误等挑战;作者主张应先进行任务形式化(即思考再检索),而非直接检索。 Method: 提出VideoHV-Agent框架:基于视频摘要,Thinker将候选答案转化为可检验假设,Judge推导出需验证的关键线索,Verifier利用局部细粒度视频内容对线索进行定位与验证,Answer agent整合验证结果生成最终答案。 Result: 在三个长视频理解基准上达到SOTA准确率,同时提升可解释性、逻辑严谨性,并降低计算成本。 Conclusion: 以假设验证为核心的思想前置范式优于传统检索驱动方法,为长视频理解提供了更鲁棒、高效且可解释的新路径。 Abstract: Long video understanding is challenging due to dense visual redundancy, long-range temporal dependencies, and the tendency of chain-of-thought and retrieval-based agents to accumulate semantic drift and correlation-driven errors. We argue that long-video reasoning should begin not with reactive retrieval, but with deliberate task formulation: the model must first articulate what must be true in the video for each candidate answer to hold. This thinking-before-finding principle motivates VideoHV-Agent, a framework that reformulates video question answering as a structured hypothesis-verification process. Based on video summaries, a Thinker rewrites answer candidates into testable hypotheses, a Judge derives a discriminative clue specifying what evidence must be checked, a Verifier grounds and tests the clue using localized, fine-grained video content, and an Answer agent integrates validated evidence to produce the final answer. Experiments on three long-video understanding benchmarks show that VideoHV-Agent achieves state-of-the-art accuracy while providing enhanced interpretability, improved logical soundness, and lower computational cost. We make our code publicly available at: https://github.com/Haorane/VideoHV-Agent.

[130] A Simple Baseline for Unifying Understanding, Generation, and Editing via Vanilla Next-token Prediction

Jie Zhu,Hanghang Ma,Jia Wang,Yayong Guan,Yanbing Zeng,Lishuai Gao,Junqiang Wu,Jie Hu,Leye Wang

Main category: cs.CV

TL;DR: Wallaroo是一个基于next-token预测的简单自回归基线模型,统一支持多模态理解、图像生成与编辑,并具备多分辨率处理及中英双语能力。

Details Motivation: 探索自回归模型在统一多模态理解与生成任务中的潜力,克服现有模型在任务割裂、分辨率限制和语言支持等方面的不足。 Method: 提出Wallaroo模型,采用解耦的视觉编码路径和四阶段训练策略,支持多分辨率图像输入/输出及中英文双语;以next-token预测为统一建模范式。 Result: 在多个基准测试中,Wallaroo展现出与其它统一模型相当甚至更优的性能。 Conclusion: 自回归建模范式具有强大潜力,可有效统一多模态理解与生成任务,Wallaroo验证了该方向的可行性与优势。 Abstract: In this work, we introduce Wallaroo, a simple autoregressive baseline that leverages next-token prediction to unify multi-modal understanding, image generation, and editing at the same time. Moreover, Wallaroo supports multi-resolution image input and output, as well as bilingual support for both Chinese and English. We decouple the visual encoding into separate pathways and apply a four-stage training strategy to reshape the model's capabilities. Experiments are conducted on various benchmarks where Wallaroo produces competitive performance or exceeds other unified models, suggesting the great potential of autoregressive models in unifying multi-modality understanding and generation. Our code is available at https://github.com/JiePKU/Wallaroo.

[131] TAPFormer: Robust Arbitrary Point Tracking via Transient Asynchronous Fusion of Frames and Events

Jiaxiong Liu,Zhen Tan,Jinpu Zhang,Yi Zhou,Hui Shen,Xieyuanli Chen,Dewen Hu

Main category: cs.CV

TL;DR: 本文提出TAPFormer,一种基于Transformer的异步时序一致融合框架,用于鲁棒、高频的任意点跟踪;核心创新包括建模帧间连续事件更新的瞬态异步融合(TAF)机制和根据模态可靠性自适应调整空间注意力的跨模态局部加权融合(CLWF)模块;在新构建的真实世界数据集及标准基准上均取得最优性能。

Details Motivation: 现有RGB帧与事件流融合方法多依赖同步或非自适应融合,导致时间错位,且在单一模态失效时性能严重下降。 Method: 提出TAPFormer框架,包含瞬态异步融合(TAF)机制以建模离散帧间的连续事件演化,并引入跨模态局部加权融合(CLWF)模块实现空间注意力的自适应调节;同时构建了首个真实场景下的帧-事件任意点跟踪数据集。 Result: 在自建真实数据集上平均像素误差阈值内提升28.2%,并在标准点跟踪基准上持续达到最优性能。 Conclusion: TAPFormer通过异步、自适应的跨模态融合显著提升了任意点跟踪的鲁棒性与精度,尤其适用于光照变化与运动模糊等复杂场景。 Abstract: Tracking any point (TAP) is a fundamental yet challenging task in computer vision, requiring high precision and long-term motion reasoning. Recent attempts to combine RGB frames and event streams have shown promise, yet they typically rely on synchronous or non-adaptive fusion, leading to temporal misalignment and severe degradation when one modality fails. We introduce TAPFormer, a transformer-based framework that performs asynchronous temporal-consistent fusion of frames and events for robust and high-frequency arbitrary point tracking. Our key innovation is a Transient Asynchronous Fusion (TAF) mechanism, which explicitly models the temporal evolution between discrete frames through continuous event updates, bridging the gap between low-rate frames and high-rate events. In addition, a Cross-modal Locally Weighted Fusion (CLWF) module adaptively adjusts spatial attention according to modality reliability, yielding stable and discriminative features even under blur or low light. To evaluate our approach under realistic conditions, we construct a novel real-world frame-event TAP dataset under diverse illumination and motion conditions. Our method outperforms existing point trackers, achieving a 28.2% improvement in average pixel error within threshold. Moreover, on standard point tracking benchmarks, our tracker consistently achieves the best performance. Project website: tapformer.github.io

[132] MultiGO++: Monocular 3D Clothed Human Reconstruction via Geometry-Texture Collaboration

Nanjie Yao,Gangjian Zhang,Wenhao Shen,Jian Shu,Yu Feng,Hao Wang

Main category: cs.CV

TL;DR: 本文提出MultiGO++框架,通过多源纹理合成、区域感知形状提取和双重建U-Net,实现几何与纹理协同的单目3D着装人体重建,显著提升重建质量。

Details Motivation: 现有单目3D着装人体重建方法受限于纹理数据缺乏、几何先验不准及单模态监督偏差,导致重建效果不佳。 Method: 提出MultiGO++框架,包含:(1) 多源纹理合成策略构建超1.5万3D纹理人体扫描;(2) 区域感知形状提取模块与傅里叶几何编码器联合学习几何特征;(3) 双重建U-Net融合几何-纹理协同特征生成高保真3D网格。 Result: 在两个基准数据集及大量野外案例上实验表明,该方法优于当前最先进方法。 Conclusion: MultiGO++通过系统性几何-纹理协作,有效克服了现有方法在纹理、几何与监督方式上的关键限制,显著提升了单目3D人体重建质量。 Abstract: Monocular 3D clothed human reconstruction aims to generate a complete and realistic textured 3D avatar from a single image. Existing methods are commonly trained under multi-view supervision with annotated geometric priors, and during inference, these priors are estimated by the pre-trained network from the monocular input. These methods are constrained by three key limitations: texturally by unavailability of training data, geometrically by inaccurate external priors, and systematically by biased single-modality supervision, all leading to suboptimal reconstruction. To address these issues, we propose a novel reconstruction framework, named MultiGO++, which achieves effective systematic geometry-texture collaboration. It consists of three core parts: (1) A multi-source texture synthesis strategy that constructs 15,000+ 3D textured human scans to improve the performance on texture quality estimation in challenge scenarios; (2) A region-aware shape extraction module that extracts and interacts features of each body region to obtain geometry information and a Fourier geometry encoder that mitigates the modality gap to achieve effective geometry learning; (3) A dual reconstruction U-Net that leverages geometry-texture collaborative features to refine and generate high-fidelity textured 3D human meshes. Extensive experiments on two benchmarks and many in-the-wild cases show the superiority of our method over state-of-the-art approaches.

[133] Physics-consistent deep learning for blind aberration recovery in mobile optics

Kartik Jhawar,Tamo Sancho Miguel Tandoc,Khoo Jun Xuan,Wang Lipo

Main category: cs.CV

TL;DR: 本文提出Lens2Zernike框架,通过单张模糊图像盲恢复物理光学参数(Zernike系数等),融合Zernike回归、可微物理约束与多任务空间图预测,显著提升参数估计精度,并实现稳定非盲去卷积。

Details Motivation: 移动摄影受限于复杂镜头像差;端到端深度学习缺乏物理可解释性且易幻觉,经典盲反卷积又极不稳定,亟需兼具物理建模与数据驱动的桥梁方法。 Method: 提出Lens2Zernike:1)直接回归Zernike系数(z);2)引入可微物理约束,统一建模波前与点扩散函数(p);3)辅助多任务空间图预测(m);三者联合优化。 Result: 在ResNet-18上消融实验显示z+p+m比仅z提升35%;相比两类SOTA深度方法,Zernike系数回归误差显著更低;在IDMxS数据库上实现稳定非盲去卷积,有效恢复衍射极限细节。 Conclusion: Lens2Zernike首次在单一框架中跨波前、PSF和空间映射三个光学域进行联合监督,证明了物理引导的深度学习可在保持可解释性的同时大幅提升像差建模与图像复原性能。 Abstract: Mobile photography is often limited by complex, lens-specific optical aberrations. While recent deep learning methods approach this as an end-to-end deblurring task, these "black-box" models lack explicit optical modeling and can hallucinate details. Conversely, classical blind deconvolution remains highly unstable. To bridge this gap, we present Lens2Zernike, a deep learning framework that blindly recovers physical optical parameters from a single blurred image. To the best of our knowledge, no prior work has simultaneously integrated supervision across three distinct optical domains. We introduce a novel physics-consistent strategy that explicitly minimizes errors via direct Zernike coefficient regression (z), differentiable physics constraints encompassing both wavefront and point spread function derivations (p), and auxiliary multi-task spatial map predictions (m). Through an ablation study on a ResNet-18 backbone, we demonstrate that our full multi-task framework (z+p+m) yields a 35% improvement over coefficient-only baselines. Crucially, comparative analysis reveals that our approach outperforms two established deep learning methods from previous literature, achieving significantly lower regression errors. Ultimately, we demonstrate that these recovered physical parameters enable stable non-blind deconvolution, providing substantial in-domain improvement on the patented Institute for Digital Molecular Analytics and Science (IDMxS) Mobile Camera Lens Database for restoring diffraction-limited details from severely aberrated mobile captures.

[134] How far have we gone in Generative Image Restoration? A study on its capability, limitations and evaluation practices

Xiang Yin,Jinfan Hu,Zhiyuan You,Kainan Yan,Yu Tang,Chao Dong,Jinjin Gu

Main category: cs.CV

TL;DR: 本文对生成式图像恢复(GIR)模型进行了大规模多维评估,揭示其核心挑战已从细节缺失转向细节质量与语义控制;并基于该评估构建了更符合人类感知的新型IQA模型。

Details Motivation: 探究生成式图像恢复(GIR)模型在实际能力上相较传统方法的真实进展,并系统评估其在细节、锐度、语义正确性及整体质量等方面的性能差异与失败模式演变。 Method: 提出一个新颖的多维评估流程,涵盖细节、锐度、语义正确性和整体质量四个维度;对扩散模型、GAN模型、PSNR导向模型及通用生成模型等多样化架构进行大规模实证分析;并利用该基准训练新型图像质量评估(IQA)模型。 Result: 发现GIR模型的主要失败模式已由‘细节不足’(欠生成)转变为‘细节质量差与语义失控’(过生成);验证了不同架构在各维度上的显著性能差异;构建的新型IQA模型更贴合人类主观评价。 Conclusion: 当前GIR模型虽在感知真实感上表现优异,但其实际恢复能力仍受限于细节质量与语义可控性;本研究重新定义了该领域的关键挑战,并为未来研究提供了系统性基准与方向指引。 Abstract: Generative Image Restoration (GIR) has achieved impressive perceptual realism, but how far have its practical capabilities truly advanced compared with previous methods? To answer this, we present a large-scale study grounded in a new multi-dimensional evaluation pipeline that assesses models on detail, sharpness, semantic correctness, and overall quality. Our analysis covers diverse architectures, including diffusion-based, GAN-based, PSNR-oriented, and general-purpose generation models, revealing critical performance disparities. Furthermore, our analysis uncovers a key evolution in failure modes that signifies a paradigm shift for the perception-oriented low-level vision field. The central challenge is evolving from the previous problem of detail scarcity (under-generation) to the new frontier of detail quality and semantic control (preventing over-generation). We also leverage our benchmark to train a new IQA model that better aligns with human perceptual judgments. Ultimately, this work provides a systematic study of modern generative image restoration models, offering crucial insights that redefine our understanding of their true state and chart a course for future development.

[135] Tell2Adapt: A Unified Framework for Source Free Unsupervised Domain Adaptation via Vision Foundation Model

Yulong Shi,Shijie Li,Ziyi Li,Lin Qi

Main category: cs.CV

TL;DR: 本文提出Tell2Adapt,一种基于视觉基础模型(VFM)的源无关无监督域自适应(SFUDA)新框架,通过上下文感知提示正则化(CAPR)和视觉可信度精炼(VPR)提升医学图像分割在多模态、多目标临床场景下的泛化性与可靠性,达到SOTA性能。

Details Motivation: 现有SFUDA方法局限于低差距、特定域偏移场景,难以构建统一、多模态、多目标的通用框架,阻碍其在真实临床环境中的部署。 Method: 提出Tell2Adapt框架:1)利用视觉基础模型(VFM)的通用知识;2)引入上下文感知提示正则化(CAPR)生成高保真提示并产生高质量伪标签;3)设计视觉可信度精炼(VPR)模块,借助VFM解剖学知识将预测结果重锚定于目标图像底层视觉特征,抑制噪声与假阳性。 Result: 在10种域迁移方向、22个解剖目标(含脑、心脏、息肉、腹部等)上进行了迄今最广泛的SFUDA评估,一致优于现有方法,在医学图像分割任务中达成统一SFUDA框架的SOTA性能。 Conclusion: Tell2Adapt成功将VFM强大先验知识融入SFUDA流程,兼顾适应效率与临床可靠性,为跨中心、跨设备的医学AI部署提供了可扩展、可信的通用解决方案。 Abstract: Source Free Unsupervised Domain Adaptation (SFUDA) is critical for deploying deep learning models across diverse clinical settings. However, existing methods are typically designed for low-gap, specific domain shifts and cannot generalize into a unified, multi-modalities, and multi-target framework, which presents a major barrier to real-world application. To overcome this issue, we introduce Tell2Adapt, a novel SFUDA framework that harnesses the vast, generalizable knowledge of the Vision Foundation Model (VFM). Our approach ensures high-fidelity VFM prompts through Context-Aware Prompts Regularization (CAPR), which robustly translates varied text prompts into canonical instructions. This enables the generation of high-quality pseudo-labels for efficiently adapting the lightweight student model to target domain. To guarantee clinical reliability, the framework incorporates Visual Plausibility Refinement (VPR), which leverages the VFM's anatomical knowledge to re-ground the adapted model's predictions in target image's low-level visual features, effectively removing noise and false positives. We conduct one of the most extensive SFUDA evaluations to date, validating our framework across 10 domain adaptation directions and 22 anatomical targets, including brain, cardiac, polyp, and abdominal targets. Our results demonstrate that Tell2Adapt consistently outperforms existing approaches, achieving SOTA for a unified SFUDA framework in medical image segmentation. Code are avaliable at https://github.com/derekshiii/Tell2Adapt.

[136] Generalizable Multiscale Segmentation of Heterogeneous Map Collections

Remi Petitpierre

Main category: cs.CV

TL;DR: 本文提出了一种面向历史地图的通用语义分割框架及新基准数据集Semap,通过程序化数据合成与多尺度融合提升模型鲁棒性与泛化能力,在多个数据集上达到SOTA性能,并验证了其跨地图集合、比例尺、地理区域和出版背景的稳定性。

Details Motivation: 现有地图识别研究多针对同质化地图系列设计专用模型,难以应对历史地图在风格、比例尺和地理范围上的高度多样性;亟需发展通用、可迁移的语义分割方法以支持大规模异构地图档案的自动化分析。 Method: 构建开源基准数据集Semap(含1439张人工标注图像块);提出融合程序化数据合成与多尺度特征集成的语义分割框架。 Result: 所提框架在HCMSSD和Semap数据集上均达到当前最优性能;分割效果在不同地图集合、比例尺、地理区域和出版背景下保持稳定。 Conclusion: 以多样性驱动的历史地图识别是可行且有效的;本工作为整合长尾型地图档案、推动历史地理学研究提供了可扩展的方法论基础与开放资源。 Abstract: Historical map collections are highly diverse in style, scale, and geographic focus, often consisting of many single-sheet documents. Yet most work in map recognition focuses on specialist models tailored to homogeneous map series. In contrast, this article aims to develop generalizable semantic segmentation models and ontology. First, we introduce Semap, a new open benchmark dataset comprising 1,439 manually annotated patches designed to reflect the variety of historical map documents. Second, we present a segmentation framework that combines procedural data synthesis with multiscale integration to improve robustness and transferability. This framework achieves state-of-the-art performance on both the HCMSSD and Semap datasets, showing that a diversity-driven approach to map recognition is not only viable but also beneficial. The results indicate that segmentation performance remains largely stable across map collections, scales, geographic regions, and publication contexts. By proposing benchmark datasets and methods for the generic segmentation of historical maps, this work opens the way to integrating the long tail of cartographic archives to historical geographic studies.

[137] Exploiting Intermediate Reconstructions in Optical Coherence Tomography for Test-Time Adaption of Medical Image Segmentation

Thomas Pinetz,Veit Hucke,Hrvoje Bogunovic

Main category: cs.CV

TL;DR: 本文提出IRTAA方法,利用重建过程中的中间表示,在测试时通过调节下游网络的归一化层参数来提升分割性能,并提供无额外开销的语义不确定性估计。

Details Motivation: 现有基于低质量成像设备的诊断系统仅使用最终重建图像评估下游任务性能,忽略了重建过程中丰富的中间表示信息。 Method: 提出IRTAA框架,在测试时通过一个调制网络动态调整冻结下游网络的归一化层参数,该调制网络以当前重建时间尺度为条件;使用各时间步平均熵损失进行测试时学习;利用不同时间步分割结果的差异性估计不确定性。 Result: 在不修改重建算法和下游模型的前提下,提升了医学图像分割性能,并实现了低成本、语义相关的不确定性估计。 Conclusion: 中间重建表示蕴含丰富信息,IRTAA有效利用这些信息,在保持原有系统结构不变的同时,显著提升下游任务性能与可靠性。 Abstract: Primary health care frequently relies on low-cost imaging devices, which are commonly used for screening purposes. To ensure accurate diagnosis, these systems depend on advanced reconstruction algorithms designed to approximate the performance of high-quality counterparts. Such algorithms typically employ iterative reconstruction methods that incorporate domain-specific prior knowledge. However, downstream task performance is generally assessed using only the final reconstructed image, thereby disregarding the informative intermediate representations generated throughout the reconstruction process. In this work, we propose IRTTA to exploit these intermediate representations at test-time by adapting the normalization-layer parameters of a frozen downstream network via a modulator network that conditions on the current reconstruction timescale. The modulator network is learned during test-time using an averaged entropy loss across all individual timesteps. Variation among the timestep-wise segmentations additionally provides uncertainty estimates at no extra cost. This approach enhances segmentation performance and enables semantically meaningful uncertainty estimation, all without modifying either the reconstruction process or the downstream model.

[138] CoIn3D: Revisiting Configuration-Invariant Multi-Camera 3D Object Detection

Zhaonian Kuang,Rui Ding,Haotian Wang,Xinhu Zheng,Meng Yang,Gang Hua

Main category: cs.CV

TL;DR: 本文提出CoIn3D框架,通过空间感知特征调制(SFM)和相机感知数据增强(CDA),解决多相机3D目标检测在不同相机配置间泛化能力差的问题,在NuScenes、Waymo、Lyft等数据集上验证了其跨配置迁移性能。

Details Motivation: 现有MC3D模型难以泛化到未见过的多相机配置,核心问题在于源与目标配置间的空间先验(内参、外参、阵列布局)差异。 Method: 提出CoIn3D框架:1)空间感知特征调制(SFM),将焦距、地面深度、地面梯度、Plücker坐标四种空间表示融入特征嵌入;2)相机感知数据增强(CDA),采用无需训练的动态新视角图像合成提升观测多样性。 Result: 在NuScenes、Waymo、Lyft数据集上,CoIn3D在BEVDepth、BEVFormer、PETR三类主流MC3D范式下均展现出优异的跨配置检测性能。 Conclusion: 显式建模并融合多维空间先验可显著提升MC3D模型对未知相机配置的泛化能力,CoIn3D为实际部署中灵活适配不同硬件平台提供了有效解决方案。 Abstract: Multi-camera 3D object detection (MC3D) has attracted increasing attention with the growing deployment of multi-sensor physical agents, such as robots and autonomous vehicles. However, MC3D models still struggle to generalize to unseen platforms with new multi-camera configurations. Current solutions simply employ a meta-camera for unified representation but lack comprehensive consideration. In this paper, we revisit this issue and identify that the devil lies in spatial prior discrepancies across source and target configurations, including different intrinsics, extrinsics, and array layouts. To address this, we propose CoIn3D, a generalizable MC3D framework that enables strong transferability from source configurations to unseen target ones. CoIn3D explicitly incorporates all identified spatial priors into both feature embedding and image observation through spatial-aware feature modulation (SFM) and camera-aware data augmentation (CDA), respectively. SFM enriches feature space by integrating four spatial representations, such as focal length, ground depth, ground gradient, and Plücker coordinate. CDA improves observation diversity under various configurations via a training-free dynamic novel-view image synthesis scheme. Extensive experiments demonstrate that CoIn3D achieves strong cross-configuration performance on landmark datasets such as NuScenes, Waymo, and Lyft, under three dominant MC3D paradigms represented by BEVDepth, BEVFormer, and PETR.

[139] CLIP-driven Zero-shot Learning with Ambiguous Labels

Jinfu Fan,Jiangnan Li,Xiaowen Yan,Xiaohui Zhong,Wenpeng Lu,Linqing Huang

Main category: cs.CV

TL;DR: 本文提出了一种基于CLIP的零样本学习框架CLIP-PZSL,用于处理标签模糊性问题,通过语义挖掘和部分零样本损失提升模型在噪声标签下的性能。

Details Motivation: 现有零样本学习方法通常假设训练样本具有准确的类别标签,但在现实场景中标签噪声和模糊性会显著降低模型性能,因此需要一种能处理标签不确定性的新方法。 Method: 利用CLIP提取实例与标签特征;设计语义挖掘模块融合二者以获得判别性标签嵌入;引入部分零样本损失,根据候选标签与实例的相关性加权,并对齐实例与标签嵌入以减少语义失配;通过迭代优化逐步识别真实标签并更新嵌入。 Result: 在多个数据集上的综合实验验证了CLIP-PZSL的有效性,相比基线方法展现出明显优势。 Conclusion: CLIP-PZSL是一种鲁棒、有效的部分标签零样本学习框架,能有效缓解标签模糊带来的性能下降,为实际应用中的零样本识别提供了新思路。 Abstract: Zero-shot learning (ZSL) aims to recognize unseen classes by leveraging semantic information from seen classes, but most existing methods assume accurate class labels for training instances. However, in real-world scenarios, noise and ambiguous labels can significantly reduce the performance of ZSL. To address this, we propose a new CLIP-driven partial label zero-shot learning (CLIP-PZSL) framework to handle label ambiguity. First, we use CLIP to extract instance and label features. Then, a semantic mining block fuses these features to extract discriminative label embeddings. We also introduce a partial zero-shot loss, which assigns weights to candidate labels based on their relevance to the instance and aligns instance and label embeddings to minimize semantic mismatch. As the training goes on, the ground-truth labels are progressively identified, and the refined labels and label embeddings in turn help improve the semantic alignment of instance and label features. Comprehensive experiments on several datasets demonstrate the advantage of CLIP-PZSL.

[140] A 360-degree Multi-camera System for Blue Emergency Light Detection Using Color Attention RT-DETR and the ABLDataset

Francisco Vacalebri-Lloret,Lucas Banchero,Jose J. Lopez,Jose M. Mossi

Main category: cs.CV

TL;DR: 本文提出了一种基于多 fisheye 相机与改进 RT-DETR 的蓝灯检测系统,利用 ABLDataset 数据集,在复杂环境下实现高精度(94.7% 准确率)、远距离(70 米)蓝灯检测与方位角估计,旨在提升 ADAS 与道路安全。

Details Motivation: 提升高级驾驶辅助系统(ADAS)对紧急车辆蓝灯的实时、鲁棒识别能力,以增强道路安全,尤其在多变气候与地理条件下。 Method: 采用四颗180度水平视场的鱼眼相机布置于车辆侧面,结合标定实现方位角定位;在 ABLDataset 上对比 YOLO 系列、RetinaNet、Faster R-CNN 和 RT-DETR,最终选用并改进 RT-DETR——引入颜色注意力模块,并通过几何变换估计紧急车辆接近角度。 Result: 改进后的 RT-DETR 在测试集上达到 94.7% 准确率和 94.1% 召回率,实地检测距离达 70 米,并可估计蓝灯来源的相对方位角;系统支持与声学模态融合,适用于多模态 ADAS 集成。 Conclusion: 该视觉系统显著提升了紧急车辆蓝灯检测的精度、鲁棒性与实用性,为多模态 ADAS 提供了高效可行的技术路径。 Abstract: This study presents an advanced system for detecting blue lights on emergency vehicles, developed using ABLDataset, a curated dataset that includes images of European emergency vehicles under various climatic and geographic conditions. The system employs a configuration of four fisheye cameras, each with a 180-degree horizontal field of view, mounted on the sides of the vehicle. A calibration process enables the azimuthal localization of the detections. Additionally, a comparative analysis of major deep neural network algorithms was conducted, including YOLO (v5, v8, and v10), RetinaNet, Faster R-CNN, and RT-DETR. RT-DETR was selected as the base model and enhanced through the incorporation of a color attention block, achieving an accuracy of 94.7 percent and a recall of 94.1 percent on the test set, with field test detections reaching up to 70 meters. Furthermore, the system estimates the approach angle of the emergency vehicle relative to the center of the car using geometric transformations. Designed for integration into a multimodal system that combines visual and acoustic data, this system has demonstrated high efficiency, offering a promising approach to enhancing Advanced Driver Assistance Systems (ADAS) and road safety.

[141] MI-DETR: A Strong Baseline for Moving Infrared Small Target Detection with Bio-Inspired Motion Integration

Nian Liu,Jin Gao,Shubo Lin,Yutong Kou,Sikui Zhang,Fudong Ge,Zhiqiang Pu,Liang Li,Gang Wang,Yizheng Wang,Weiming Hu

Main category: cs.CV

TL;DR: 本文提出了一种受生物视觉启发的单帧红外小目标检测方法MI-DETR,通过视网膜式细胞自动机生成运动图,结合双通路(形貌与运动)交互和RT-DETR解码器,在多个基准上显著超越多帧方法。

Details Motivation: 红外小目标检测因目标微小、对比度低且背景复杂动态而困难;传统多帧方法隐式学习运动,常需额外运动监督或对齐模块,效率与可解释性受限。 Method: 提出Motion Integration DETR(MI-DETR):1)视网膜式细胞自动机(RCA)将帧序列显式转化为同分辨率运动图,构建类副脑皮层(形貌)与类大细胞层(运动)双通路;2)Parvocellular-Magnocellular Interconnection(PMI)模块实现双通路双向特征交互;3)RT-DETR解码器融合双通路特征完成检测。全程仅需单帧输入与标准框标注,无需运动标签或显式对齐。 Result: 在IRDST-H、DAUB-R和ITSDT-15K三个主流红外小目标检测基准上分别达到70.3% mAP@50(较最优多帧基线提升26.35)、98.0% mAP@50和88.3% mAP@50,F1达72.7%,验证了生物启发式运动-形貌融合的有效性。 Conclusion: 显式建模运动并融合形貌信息的双通路架构,无需额外监督或对齐,即可在单帧处理下显著超越多帧方法,为红外小目标检测提供了高效、可解释的新范式。 Abstract: Infrared small target detection (ISTD) is challenging because tiny, low-contrast targets are easily obscured by complex and dynamic backgrounds. Conventional multi-frame approaches typically learn motion implicitly through deep neural networks, often requiring additional motion supervision or explicit alignment modules. We propose Motion Integration DETR (MI-DETR), a bio-inspired dual-pathway detector that processes one infrared frame per time step while explicitly modeling motion. First, a retina-inspired cellular automaton (RCA) converts raw frame sequences into a motion map defined on the same pixel grid as the appearance image, enabling parvocellular-like appearance and magnocellular-like motion pathways to be supervised by a single set of bounding boxes without extra motion labels or alignment operations. Second, a Parvocellular-Magnocellular Interconnection (PMI) Block facilitates bidirectional feature interaction between the two pathways, providing a biologically motivated intermediate interconnection mechanism. Finally, a RT-DETR decoder operates on features from the two pathways to produce detection results. Surprisingly, our proposed simple yet effective approach yields strong performance on three commonly used ISTD benchmarks. MI-DETR achieves 70.3% mAP@50 and 72.7% F1 on IRDST-H (+26.35 mAP@50 over the best multi-frame baseline), 98.0% mAP@50 on DAUB-R, and 88.3% mAP@50 on ITSDT-15K, demonstrating the effectiveness of biologically inspired motion-appearance integration. Code is available at https://github.com/nliu-25/MI-DETR.

[142] UniM: A Unified Any-to-Any Interleaved Multimodal Benchmark

Yanlin Li,Minghui Guo,Kaiwen Zhang,Shize Zhang,Yiran Zhao,Haodong Li,Congyue Zhou,Weijie Zheng,Yushen Yan,Shengqiong Wu,Wei Ji,Lei Cui,Furu Wei,Hao Fei,Mong-Li Lee,Wynne Hsu

Main category: cs.CV

TL;DR: 本文提出了首个统一的任意到任意交错多模态学习基准UniM,包含31K高质量样本、覆盖30个领域和7种模态,并设计了评估套件与基线模型UniMA,以推动多模态大语言模型在理解与生成上的统一能力发展。

Details Motivation: 现实世界中的多模态应用需要系统理解用户任意组合和交错的多模态输入,并生成任意交错的多媒体输出,这催生了对统一范式下任意到任意交错多模态学习能力的需求。 Method: 构建了首个统一任意到任意交错多模态数据集UniM(含31K样本、30领域、7模态),设计了三维度评估套件(语义正确性与生成质量、响应结构完整性、交错一致性),并提出具备可追溯推理能力的智能体基线模型UniMA。 Result: 实验证明UniM具有显著挑战性,揭示了当前模型在交错多模态理解与生成中的关键瓶颈与改进方向。 Conclusion: UniM为推进统一任意到任意多模态智能提供了新基准、新评估方法与新基线模型,推动多模态大语言模型向更通用、更鲁棒的方向演进。 Abstract: In real-world multimodal applications, systems usually need to comprehend arbitrarily combined and interleaved multimodal inputs from users, while also generating outputs in any interleaved multimedia form. This capability defines the goal of any-to-any interleaved multimodal learning under a unified paradigm of understanding and generation, posing new challenges and opportunities for advancing Multimodal Large Language Models (MLLMs). To foster and benchmark this capability, this paper introduces the UniM benchmark, the first Unified Any-to-Any Interleaved Multimodal dataset. UniM contains 31K high-quality instances across 30 domains and 7 representative modalities: text, image, audio, video, document, code, and 3D, each requiring multiple intertwined reasoning and generation capabilities. We further introduce the UniM Evaluation Suite, which assesses models along three dimensions: Semantic Correctness & Generation Quality, Response Structure Integrity, and Interleaved Coherence. In addition, we propose UniMA, an agentic baseline model equipped with traceable reasoning for structured interleaved generation. Comprehensive experiments demonstrate the difficulty of UniM and highlight key challenges and directions for advancing unified any-to-any multimodal intelligence. The project page is https://any2any-mllm.github.io/unim.

[143] MoRe: Motion-aware Feed-forward 4D Reconstruction Transformer

Juntong Fang,Zequn Chen,Weiqi Zhang,Donglin Di,Xuancheng Zhang,Chengmin Yang,Yu-Shen Liu

Main category: cs.CV

TL;DR: MoRe是一种高效的前馈式4D重建网络,通过注意力强制策略分离动态运动与静态结构,实现单目视频的高质量、高效率动态3D场景重建。

Details Motivation: 现有基于优化的方法在处理含运动物体的动态4D场景重建时,因相机位姿估计受干扰而效果受限,且计算开销大、难以实时应用。 Method: 提出MoRe网络:基于强静态重建骨干网络,引入注意力强制策略解耦动态运动与静态结构;采用分组因果注意力建模帧间时序依赖并适配可变token长度;并在大规模动静态混合数据集上微调以提升鲁棒性。 Result: 在多个基准上实验表明,MoRe实现了高质量的动态重建,同时具备卓越的计算效率。 Conclusion: MoRe为动态4D场景重建提供了一种高效、鲁棒且实用的前馈式解决方案,克服了传统优化方法的计算瓶颈和实时性限制。 Abstract: Reconstructing dynamic 4D scenes remains challenging due to the presence of moving objects that corrupt camera pose estimation. Existing optimization methods alleviate this issue with additional supervision, but they are mostly computationally expensive and impractical in real-time applications. To address these limitations, we propose MoRe, a feedforward 4D reconstruction network that efficiently recovers dynamic 3D scenes from monocular videos. Built upon a strong static reconstruction backbone, MoRe employs an attention-forcing strategy to disentangle dynamic motion from static structure. To further enhance robustness, we fine-tune the model on large-scale, diverse datasets encompassing both dynamic and static scenes. Moreover, our grouped causal attention captures temporal dependencies and adapts to varying token lengths across frames, ensuring temporally coherent geometry reconstruction. Extensive experiments on multiple benchmarks demonstrate that MoRe achieves high-quality dynamic reconstructions with exceptional efficiency.

[144] Orthogonal Spatial-temporal Distributional Transfer for 4D Generation

Wei Liu,Shengqiong Wu,Bobo Li,Haoyu Zhao,Hao Fei,Mong-Li Lee,Wynne Hsu

Main category: cs.CV

TL;DR: 本文提出了一种新型的4D合成框架STD-4D Diffusion,通过解耦空间与时间隐变量,并结合Orster机制和ST-HexPlane结构,从3D与视频扩散模型中迁移先验知识,显著提升了4D内容生成的质量与时空一致性。

Details Motivation: 当前4D合成研究受限于大规模4D数据集的缺失,导致模型难以学习关键的时空特征,阻碍了高质量4D生成的发展。 Method: 提出空间-时间解耦的STD-4D扩散模型;设计正交时空分布迁移(Orster)机制以实现高效先验迁移;构建时空感知的ST-HexPlane用于融合迁移特征并增强4D形变与高斯建模。 Result: 实验表明该方法在时空一致性与4D合成质量上显著优于现有方法。 Conclusion: 通过跨模态先验迁移与解耦建模,本文有效缓解了4D数据稀缺问题,为高质量4D内容生成提供了新范式。 Abstract: In the AIGC era, generating high-quality 4D content has garnered increasing research attention. Unfortunately, current 4D synthesis research is severely constrained by the lack of large-scale 4D datasets, preventing models from adequately learning the critical spatial-temporal features necessary for high-quality 4D generation, thus hindering progress in this domain. To combat this, we propose a novel framework that transfers rich spatial priors from existing 3D diffusion models and temporal priors from video diffusion models to enhance 4D synthesis. We develop a spatial-temporal-disentangled 4D (STD-4D) Diffusion model, which synthesizes 4D-aware videos through disentangled spatial and temporal latents. To facilitate the best feature transfer, we design a novel Orthogonal Spatial-temporal Distributional Transfer (Orster) mechanism, where the spatiotemporal feature distributions are carefully modeled and injected into the STD-4D Diffusion. Furthermore, during the 4D construction, we devise a spatial-temporal-aware HexPlane (ST-HexPlane) to integrate the transferred spatiotemporal features, thereby improving 4D deformation and 4D Gaussian feature modeling. Experiments demonstrate that our method significantly outperforms existing approaches, achieving superior spatial-temporal consistency and higher-quality 4D synthesis.

[145] GEM-TFL: Bridging Weak and Full Supervision for Forgery Localization through EM-Guided Decomposition and Temporal Refinement

Xiaodong Zhu,Yuanming Zheng,Suting Wang,Junqi Yang,Yuhong Yang,Weiping Tu,Zhongyuan Wang

Main category: cs.CV

TL;DR: 本文提出GEM-TFL方法,通过图模型与EM优化,在弱监督下提升视频/音频篡改片段定位精度,缩小与全监督方法的性能差距。

Details Motivation: 现有弱监督时序伪造定位(WS-TFL)方法存在训练-推理目标不匹配、二值标签监督不足、top-k聚合不可导导致梯度阻断、缺乏提案间关系建模等问题。 Method: 提出两阶段分类-回归框架GEM-TFL:(1)基于EM算法将二值视频级标签转化为多维隐属性以增强弱监督;(2)引入无需训练的时间一致性优化,平滑帧级预测;(3)设计图神经网络模块建模提案间的时序-语义关系,实现全局一致置信度估计。 Result: 在多个基准数据集上实验表明,GEM-TFL显著提升定位准确率与鲁棒性,大幅缩小与全监督方法的性能差距。 Conclusion: GEM-TFL有效弥合了弱监督与全监督TFL之间的性能鸿沟,为低标注成本下的多媒体取证提供了新范式。 Abstract: Temporal Forgery Localization (TFL) aims to precisely identify manipulated segments within videos or audio streams, providing interpretable evidence for multimedia forensics and security. While most existing TFL methods rely on dense frame-level labels in a fully supervised manner, Weakly Supervised TFL (WS-TFL) reduces labeling cost by learning only from binary video-level labels. However, current WS-TFL approaches suffer from mismatched training and inference objectives, limited supervision from binary labels, gradient blockage caused by non-differentiable top-k aggregation, and the absence of explicit modeling of inter-proposal relationships. To address these issues, we propose GEM-TFL (Graph-based EM-powered Temporal Forgery Localization), a two-phase classification-regression framework that effectively bridges the supervision gap between training and inference. Built upon this foundation, (1) we enhance weak supervision by reformulating binary labels into multi-dimensional latent attributes through an EM-based optimization process; (2) we introduce a training-free temporal consistency refinement that realigns frame-level predictions for smoother temporal dynamics; and (3) we design a graph-based proposal refinement module that models temporal-semantic relationships among proposals for globally consistent confidence estimation. Extensive experiments on benchmark datasets demonstrate that GEM-TFL achieves more accurate and robust temporal forgery localization, substantially narrowing the gap with fully supervised methods.

Zongfang Liu,Shengkun Tang,Zongliang Wu,Xin Yuan,Zhiqiang Shen

Main category: cs.CV

TL;DR: 本文提出Diff-ES,一种基于进化搜索的阶段式结构化剪枝框架,用于高效压缩扩散模型,在减少计算开销的同时保持图像生成质量。

Details Motivation: 现有扩散模型剪枝方法依赖人工设计的稀疏调度策略,难以泛化且导致性能次优;同时多模型拼接带来额外内存开销。 Method: 将扩散过程划分为多个阶段,通过进化搜索自动优化各阶段稀疏度,并利用内存高效的权重路由动态激活阶段条件权重,避免模型参数复制。 Result: 在DiT和SDXL上实验表明,Diff-ES在真实运行时间上显著加速,同时生成质量下降极小,达到结构化剪枝领域的SOTA性能。 Conclusion: Diff-ES提供了一种自动化、通用且内存友好的扩散模型结构化剪枝新范式,有效解决了阶段稀疏调度不匹配与内存冗余问题。 Abstract: Diffusion models have achieved remarkable success in high-fidelity image generation but remain computationally demanding due to their multi-step denoising process and large model sizes. Although prior work improves efficiency either by reducing sampling steps or by compressing model parameters, existing structured pruning approaches still struggle to balance real acceleration and image quality preservation. In particular, prior methods such as MosaicDiff rely on heuristic, manually tuned stage-wise sparsity schedules and stitch multiple independently pruned models during inference, which increases memory overhead. However, the importance of diffusion steps is highly non-uniform and model-dependent. As a result, schedules derived from simple heuristics or empirical observations often fail to generalize and may lead to suboptimal performance. To this end, we introduce \textbf{Diff-ES}, a stage-wise structural \textbf{Diff}usion pruning framework via \textbf{E}volutionary \textbf{S}earch, which optimizes the stage-wise sparsity schedule and executes it through memory-efficient weight routing without model duplication. Diff-ES divides the diffusion trajectory into multiple stages, automatically discovers an optimal stage-wise sparsity schedule via evolutionary search, and activates stage-conditioned weights dynamically without duplicating model parameters. Our framework naturally integrates with existing structured pruning methods for diffusion models including depth and width pruning. Extensive experiments on DiT and SDXL demonstrate that Diff-ES consistently achieves wall-clock speedups while incurring minimal degradation in generation quality, establishing state-of-the-art performance for structured diffusion model pruning.

Iman Nematollahi,Jose Francisco Villena-Ossa,Alina Moter,Kiana Farhadyar,Gabriel Kalweit,Abhinav Valada,Toni Cathomen,Evelyn Ullrich,Maria Kalweit

Main category: cs.CV

TL;DR: 本文提出BLINK模型,一种基于轨迹的循环状态空间模型,用于建模自然杀伤(NK)细胞与肿瘤细胞的相互作用动态,通过学习部分观测到的交互序列来预测细胞凋亡增量,并提升细胞毒效应检测与未来结果预测能力。

Details Motivation: 现有方法仅依赖单帧标注难以可靠推断NK细胞的细胞毒效应,因其本质上是随时间演化的交互过程。 Method: 提出BLINK——一种轨迹驱动的递归状态空间模型,从部分观测的NK-肿瘤细胞交互序列中学习潜在动态,并预测随时间累积的凋亡增量。 Result: 在长期延时成像数据上验证,BLINK提升了细胞毒结局检测性能,支持未来结局预测,并生成可解释的潜在表征,将NK细胞轨迹聚类为一致行为模式和时序化交互阶段。 Conclusion: BLINK为单细胞水平定量评估与结构化建模NK细胞毒性行为提供了统一框架。 Abstract: Machine learning models of cellular interaction dynamics hold promise for understanding cell behavior. Natural killer (NK) cell cytotoxicity is a prominent example of such interaction dynamics and is commonly studied using time-resolved multi-channel fluorescence microscopy. Although tumor cell death events can be annotated at single frames, NK cytotoxic outcome emerges over time from cellular interactions and cannot be reliably inferred from frame-wise classification alone. We introduce BLINK, a trajectory-based recurrent state-space model that serves as a cell world model for NK-tumor interactions. BLINK learns latent interaction dynamics from partially observed NK-tumor interaction sequences and predicts apoptosis increments that accumulate into cytotoxic outcomes. Experiments on long-term time-lapse NK-tumor recordings show improved cytotoxic outcome detection and enable forecasting of future outcomes, together with an interpretable latent representation that organizes NK trajectories into coherent behavioral modes and temporally structured interaction phases. BLINK provides a unified framework for quantitative evaluation and structured modeling of NK cytotoxic behavior at the single-cell level.

[148] UniPAR: A Unified Framework for Pedestrian Attribute Recognition

Minghe Xu,Rouying Wu,Jiarui Xu,Minhao Sun,Zikang Yan,Xiao Wang,ChiaWei Chu,Yu Li

Main category: cs.CV

TL;DR: 本文提出UniPAR,一种基于Transformer的统一框架,用于跨模态行人属性识别(PAR),支持RGB图像、视频序列和事件流等多种数据类型,并通过统一调度策略与动态分类头实现单模型多数据集处理,显著提升跨域泛化能力和极端环境下的鲁棒性。

Details Motivation: 现有行人属性识别研究受限于“一数据集一模型”范式,难以应对不同领域在模态、属性定义和环境场景上的显著差异。 Method: 提出UniPAR框架,包含统一数据调度策略、动态分类头以及创新的分阶段融合编码器,采用晚深层融合策略显式对齐视觉特征与文本属性查询。 Result: 在MSP60K、DukeMTMC和EventPAR等主流基准上性能媲美专用SOTA方法;多数据集联合训练显著提升跨域泛化能力及低光照、运动模糊等极端环境下的识别鲁棒性。 Conclusion: UniPAR实现了跨模态、跨数据集的统一行人属性识别,突破了传统单模型单数据集的限制,为实际复杂场景下的PAR应用提供了更灵活、鲁棒的解决方案。 Abstract: Pedestrian Attribute Recognition is a foundational computer vision task that provides essential support for downstream applications, including person retrieval in video surveillance and intelligent retail analytics. However, existing research is frequently constrained by the ``one-model-per-dataset" paradigm and struggles to handle significant discrepancies across domains in terms of modalities, attribute definitions, and environmental scenarios. To address these challenges, we propose UniPAR, a unified Transformer-based framework for PAR. By incorporating a unified data scheduling strategy and a dynamic classification head, UniPAR enables a single model to simultaneously process diverse datasets from heterogeneous modalities, including RGB images, video sequences, and event streams. We also introduce an innovative phased fusion encoder that explicitly aligns visual features with textual attribute queries through a late deep fusion strategy. Experimental results on the widely used benchmark datasets, including MSP60K, DukeMTMC, and EventPAR, demonstrate that UniPAR achieves performance comparable to specialized SOTA methods. Furthermore, multi-dataset joint training significantly enhances the model's cross-domain generalization and recognition robustness in extreme environments characterized by low light and motion blur. The source code of this paper will be released on https://github.com/Event-AHU/OpenPAR

[149] SRasP: Self-Reorientation Adversarial Style Perturbation for Cross-Domain Few-Shot Learning

Wenqian Li,Pengfei Fang,Hui Xue

Main category: cs.CV

TL;DR: 本文提出了一种名为SRasP的新型跨域少样本学习方法,通过全局语义引导的裁剪-全局风格扰动与多目标优化,提升模型鲁棒性与跨域泛化能力。

Details Motivation: 现有基于风格扰动的方法在跨域少样本学习中存在梯度不稳定和易收敛到尖锐极小值的问题,限制了模型的鲁棒性和可迁移性。 Method: 提出Self-Reorientation Adversarial Style Perturbation(SRasP),利用全局语义识别不一致裁剪区域,并将这些区域的风格梯度与全局风格梯度重新定向聚合;同时设计多目标优化函数,最大化视觉差异并保持全局、裁剪与对抗特征间的语义一致性。 Result: 在多个CD-FSL基准上显著优于现有最先进方法,验证了所提方法在提升模型平坦性与跨域泛化能力方面的有效性。 Conclusion: SRasP通过稳定风格扰动和语义一致约束,有效缓解域偏移问题,使模型更易收敛至平坦且更具迁移能力的解,为CD-FSL提供了新思路。 Abstract: Cross-Domain Few-Shot Learning (CD-FSL) aims to transfer knowledge from a seen source domain to unseen target domains, serving as a key benchmark for evaluating the robustness and transferability of models. Existing style-based perturbation methods mitigate domain shift but often suffer from gradient instability and convergence to sharp minima.To address these limitations, we propose a novel crop-global style perturbation network, termed Self-Reorientation Adversarial \underline{S}tyle \underline{P}erturbation (SRasP). Specifically, SRasP leverages global semantic guidance to identify incoherent crops, followed by reorienting and aggregating the style gradients of these crops with the global style gradients within one image. Furthermore, we propose a novel multi-objective optimization function to maximize visual discrepancy while enforcing semantic consistency among global, crop, and adversarial features. Applying the stabilized perturbations during training encourages convergence toward flatter and more transferable solutions, improving generalization to unseen domains. Extensive experiments are conducted on multiple CD-FSL benchmarks, demonstrating consistent improvements over state-of-the-art methods.

[150] Act, Think or Abstain: Complexity-Aware Adaptive Inference for Vision-Language-Action Models

Riccardo Andrea Izzo,Gianluca Bardaro,Matteo Matteucci

Main category: cs.CV

TL;DR: 本文提出了一种受人类认知启发的自适应VLA框架,通过视觉嵌入动态判断任务复杂度,实现Act/Think/Abstain三种执行模式,在保证性能的同时显著降低计算开销和推理延迟。

Details Motivation: 现有VLA模型依赖通用推理技术提升泛化能力,但带来高计算开销、低效资源分配及缺乏不确定性估计,难以应对分布外任务。 Method: 将VLA视觉-语言骨干网络转化为主动检测工具,利用视觉嵌入投影到参数与非参数估计器集合中,实现基于感知状态复杂度的动态执行路由(Act/Think/Abstain)。 Result: 在LIBERO、LIBERO-PRO及真实机器人上验证,仅用5%训练数据的纯视觉配置即达80% F1-Score,成为高效可靠的复杂度检测器。 Conclusion: 视觉嵌入本身对任务复杂度具有强判别力,该自适应框架可在保障可靠性的同时大幅提升VLA系统效率与鲁棒性。 Abstract: Current research on Vision-Language-Action (VLA) models predominantly focuses on enhancing generalization through established reasoning techniques. While effective, these improvements invariably increase computational complexity and inference latency. Furthermore, these mechanisms are typically applied indiscriminately, resulting in the inefficient allocation of resources for trivial tasks while simultaneously failing to provide the uncertainty estimation necessary to prevent catastrophic failure on out-of-distribution tasks. Inspired by human cognition, we propose an adaptive framework that dynamically routes VLA execution based on the complexity of the perceived state. Our approach transforms the VLA's vision-language backbone into an active detection tool by projecting latent embeddings into an ensemble of parametric and non-parametric estimators. This allows the system to execute known tasks immediately (Act), reason about ambiguous scenarios (Think), and preemptively halt execution when encountering significant physical or semantic anomalies (Abstain). In our empirical analysis, we observe a phenomenon where visual embeddings alone are superior for inferring task complexity due to the semantic invariance of language. Evaluated on the LIBERO and LIBERO-PRO benchmarks as well as on a real robot, our vision-only configuration achieves 80% F1-Score using as little as 5% of training data, establishing itself as a reliable and efficient task complexity detector.

[151] SSR-GS: Separating Specular Reflection in Gaussian Splatting for Glossy Surface Reconstruction

Ningjing Fan,Yiqun Wang

Main category: cs.CV

TL;DR: 本文提出SSR-GS框架,通过预滤波Mip-Cubemap建模直接镜面反射、IndiASG模块捕获间接镜面反射,并结合反射感知视觉先验(RS)与几何先验(VGGT),显著提升复杂光照下光泽表面的重建质量。

Details Motivation: 现有3D高斯泼溅方法在复杂光照(强镜面反射、多表面互反射)下难以准确重建光泽表面。 Method: 提出SSR-GS框架:1)预滤波Mip-Cubemap建模直接镜面反射;2)IndiASG模块建模间接镜面反射;3)Visual Geometry Priors(VGP),含反射分数(RS)调节光度损失权重,以及基于VGGT的深度监督和法线约束等几何先验。 Result: 在合成与真实数据集上均达到光泽表面重建的SOTA性能。 Conclusion: SSR-GS有效提升了3DGS在复杂光照下对光泽表面的建模能力,兼顾反射物理建模与几何正则化。 Abstract: In recent years, 3D Gaussian splatting (3DGS) has achieved remarkable progress in novel view synthesis. However, accurately reconstructing glossy surfaces under complex illumination remains challenging, particularly in scenes with strong specular reflections and multi-surface interreflections. To address this issue, we propose SSR-GS, a specular reflection modeling framework for glossy surface reconstruction. Specifically, we introduce a prefiltered Mip-Cubemap to model direct specular reflections efficiently, and propose an IndiASG module to capture indirect specular reflections. Furthermore, we design Visual Geometry Priors (VGP) that couple a reflection-aware visual prior via a reflection score (RS) to downweight the photometric loss contribution of reflection-dominated regions, with geometry priors derived from VGGT, including progressively decayed depth supervision and transformed normal constraints. Extensive experiments on both synthetic and real-world datasets demonstrate that SSR-GS achieves state-of-the-art performance in glossy surface reconstruction.

[152] The Impact of Preprocessing Methods on Racial Encoding and Model Robustness in CXR Diagnosis

Dishantkumar Sutariya,Eike Petersen

Main category: cs.CV

TL;DR: 本文研究了深度学习模型在胸部X光片(CXR)中识别种族身份并导致诊断偏见的问题,探索了肺部掩码、裁剪和CLAHE等图像预处理方法对减少种族捷径学习的效果,发现基于边界框的肺部裁剪可在不牺牲诊断准确率的前提下有效降低种族偏见。

Details Motivation: 深度学习模型能高精度识别CXR中的种族身份,引发对‘种族捷径学习’(即模型依据种族信息而非真实病理做出诊断)的担忧,威胁医疗公平与模型可靠性。 Method: 实验评估多种图像预处理方法(肺部掩码、肺部裁剪、CLAHE)对抑制种族相关伪影及维持诊断性能的影响,重点比较其在减少种族偏见与保持准确率之间的权衡。 Result: 基于边界框的肺部裁剪被证实可显著降低种族快捷学习,同时不损害诊断准确性,从而规避常见的公平性-准确性权衡。 Conclusion: 简单而有效的图像预处理(如肺部裁剪)是缓解医疗AI中种族偏差的可行路径,为设计更公平可靠的临床AI系统提供了实践启示。 Abstract: Deep learning models can identify racial identity with high accuracy from chest X-ray (CXR) recordings. Thus, there is widespread concern about the potential for racial shortcut learning, where a model inadvertently learns to systematically bias its diagnostic predictions as a function of racial identity. Such racial biases threaten healthcare equity and model reliability, as models may systematically misdiagnose certain demographic groups. Since racial shortcuts are diffuse - non-localized and distributed throughout the whole CXR recording - image preprocessing methods may influence racial shortcut learning, yet the potential of such methods for reducing biases remains underexplored. Here, we investigate the effects of image preprocessing methods including lung masking, lung cropping, and Contrast Limited Adaptive Histogram Equalization (CLAHE). These approaches aim to suppress spurious cues encoding racial information while preserving diagnostic accuracy. Our experiments reveal that simple bounding box-based lung cropping can be an effective strategy for reducing racial shortcut learning while maintaining diagnostic model performance, bypassing frequently postulated fairness-accuracy trade-offs.

[153] Generic Camera Calibration using Blurry Images

Zezhun Shi

Main category: cs.CV

TL;DR: 本文提出了一种结合几何约束和局部参数化光照模型的方法,用于在存在运动模糊的情况下对通用相机模型进行标定,同时估计特征点位置和空间变化的点扩散函数,并解决平移模糊歧义。

Details Motivation: 通用相机标定虽比参数化标定更精确,但需大量图像,易导致运动模糊,个体用户难以避免。 Method: 利用几何约束和局部参数化光照模型,联合估计特征点位置与空间变化的点扩散函数,并处理常规图像去模糊中无需考虑的平移模糊歧义。 Result: 实验结果验证了该方法在运动模糊条件下的有效性。 Conclusion: 该方法为通用相机标定中运动模糊问题提供了首个可行解决方案,提升了实际应用中的标定鲁棒性与精度。 Abstract: Camera calibration is the foundation of 3D vision. Generic camera calibration can yield more accurate results than parametric cam era calibration. However, calibrating a generic camera model using printed calibration boards requires far more images than parametric calibration, making motion blur practically unavoidable for individual users. As a f irst attempt to address this problem, we draw on geometric constraints and a local parametric illumination model to simultaneously estimate feature locations and spatially varying point spread functions, while re solving the translational ambiguity that need not be considered in con ventional image deblurring tasks. Experimental results validate the effectiveness of our approach.

[154] Mario: Multimodal Graph Reasoning with Large Language Models

Yuanfu Sun,Kang Li,Pengkang Guo,Jiajin Liu,Qiaoyu Tan

Main category: cs.CV

TL;DR: 本文提出Mario框架,通过图条件视觉语言模型和模态自适应图指令调优机制,在多模态图(MMG)上实现大语言模型的有效推理,解决了跨模态一致性弱与异构模态偏好两大挑战。

Details Motivation: 现有方法依赖预训练视觉语言模型孤立编码图像-文本对,忽视真实世界多模态数据固有的关系结构;因此需在保留图拓扑的前提下,实现基于大语言模型的多模态图推理。 Method: 提出Mario统一框架:1)图条件VLM设计,利用图拓扑引导的细粒度跨模态对比学习联合优化图文特征;2)模态自适应图指令微调机制,将对齐的多模态特征组织为图感知的指令视图,并通过可学习路由器为每个节点及其邻域选择最优模态配置输入LLM。 Result: 在多个MMG基准上,Mario在监督与零样本场景下的节点分类和链接预测任务中均持续超越当前最优图模型。 Conclusion: Mario成功融合图结构、多模态信号与大语言模型能力,为多模态图推理提供了新范式。 Abstract: Recent advances in large language models (LLMs) have opened new avenues for multimodal reasoning. Yet, most existing methods still rely on pretrained vision-language models (VLMs) to encode image-text pairs in isolation, ignoring the relational structure that real-world multimodal data naturally form. This motivates reasoning on multimodal graphs (MMGs), where each node has textual and visual attributes and edges provide structural cues. Enabling LLM-based reasoning on such heterogeneous multimodal signals while preserving graph topology introduces two key challenges: resolving weak cross-modal consistency and handling heterogeneous modality preference. To address this, we propose Mario, a unified framework that simultaneously resolves the two above challenges and enables effective LLM-based reasoning over MMGs. Mario consists of two innovative stages. Firstly, a graph-conditioned VLM design that jointly refines textual and visual features through fine-grained cross-modal contrastive learning guided by graph topology. Secondly, a modality-adaptive graph instruction tuning mechanism that organizes aligned multimodal features into graph-aware instruction views and employs a learnable router to surface, for each node and its neighborhood, the most informative modality configuration to the LLM. Extensive experiments across diverse MMG benchmarks demonstrate that Mario consistently outperforms state-of-the-art graph models in both supervised and zero-shot scenarios for node classification and link prediction. The code will be made available at https://github.com/sunyuanfu/Mario.

[155] Logi-PAR: Logic-Infused Patient Activity Recognition via Differentiable Rule

Muhammad Zarar,MingZheng Zhang,Xiaowang Zhang,Zhiyong Feng,Sofonias Yitagesu,Kawsar Farooq

Main category: cs.CV

TL;DR: Logi-PAR 是首个将可学习逻辑规则注入患者活动识别(PAR)的框架,通过神经引导的可微规则与符号映射结合,实现可解释的风险推理与反事实干预。

Details Motivation: 现有PAR模型仅能分类活动,缺乏对‘为何构成风险’的显式逻辑推理能力,难以满足临床安全对可解释性与可干预性的需求。 Method: 提出Logi-PAR框架:以多视角上下文事实融合为原始特征提取器,并嵌入神经引导的可微逻辑规则;规则从视觉线索中自动端到端学习,同时使隐式模式在训练中显式标注为逻辑规则。 Result: 在VAST和OmniFall临床基准上达到SOTA性能,显著优于视觉-语言模型和Transformer基线;支持生成规则追踪形式的‘why’解释及量化反事实干预效果(如‘若有协助,风险下降65%’)。 Conclusion: Logi-PAR首次实现了基于可学习逻辑规则的PAR,兼顾高精度识别与临床可审计、可干预的推理能力,为安全关键型医疗AI提供了新范式。 Abstract: Patient Activity Recognition (PAR) in clinical settings uses activity data to improve safety and quality of care. Although significant progress has been made, current models mainly identify which activity is occurring. They often spatially compose sub-sparse visual cues using global and local attention mechanisms, yet only learn logically implicit patterns due to their neural-pipeline. Advancing clinical safety requires methods that can infer why a set of visual cues implies a risk, and how these can be compositionally reasoned through explicit logic beyond mere classification. To address this, we proposed Logi-PAR, the first Logic-Infused Patient Activity Recognition Framework that integrates contextual fact fusion as a multi-view primitive extractor and injects neural-guided differentiable rules. Our method automatically learns rules from visual cues, optimizing them end-to-end while enabling the implicit emergence patterns to be explicitly labelled during training. To the best of our knowledge, Logi-PAR is the first framework to recognize patient activity by applying learnable logic rules to symbolic mappings. It produces auditable why explanations as rule traces and supports counterfactual interventions (e.g., risk would decrease by 65% if assistance were present). Extensive evaluation on clinical benchmarks (VAST and OmniFall) demonstrates state-of-the-art performance, significantly outperforming Vision-Language Models and transformer baselines. The code is available via: https://github.com/zararkhan985/Logi-PAR.git}

[156] Semantic Class Distribution Learning for Debiasing Semi-Supervised Medical Image Segmentation

Yingxue Su,Yiheng Zhong,Keying Zhu,Zimu Zhang,Zhuoru Zhang,Yifang Wang,Yuxin Zhang,Jingxin Liu

Main category: cs.CV

TL;DR: 本文提出了一种名为语义类别分布学习(SCDL)的框架,通过学习结构化的类别条件特征分布来缓解监督和表征偏差,显著提升了医学图像分割中少数类别的性能。

Details Motivation: 医学图像分割面临密集像素级标注成本高和数据类别严重不平衡的问题,导致少数结构在特征表示中被主导类别淹没,难以学习判别性特征。 Method: 提出了SCDL框架,包含类别分布双向对齐(CDBA)以对齐嵌入与可学习类别代理,以及语义锚点约束(SAC)利用标注数据引导代理。 Result: 在Synapse和AMOS数据集上的实验表明,SCDL显著提升了整体及各类别指标性能,尤其在少数类别上效果突出,达到当前最优水平。 Conclusion: SCDL是一种即插即用模块,能有效缓解医学图像分割中的监督与表征偏差,提升对不平衡数据的鲁棒性和分割精度。 Abstract: Medical image segmentation is critical for computer-aided diagnosis. However, dense pixel-level annotation is time-consuming and expensive, and medical datasets often exhibit severe class imbalance. Such imbalance causes minority structures to be overwhelmed by dominant classes in feature representations, hindering the learning of discriminative features and making reliable segmentation particularly challenging. To address this, we propose the Semantic Class Distribution Learning (SCDL) framework, a plug-and-play module that mitigates supervision and representation biases by learning structured class-conditional feature distributions. SCDL integrates Class Distribution Bidirectional Alignment (CDBA) to align embeddings with learnable class proxies and leverages Semantic Anchor Constraints (SAC) to guide proxies using labeled data. Experiments on the Synapse and AMOS datasets demonstrate that SCDL significantly improves segmentation performance across both overall and class-level metrics, with particularly strong gains on minority classes, achieving state-of-the-art results. Our code is released at https://github.com/Zyh55555/SCDL.

[157] SPyCer: Semi-Supervised Physics-Guided Contextual Attention for Near-Surface Air Temperature Estimation from Satellite Imagery

Sofiane Bouaziz,Adel Hafiane,Raphael Canals,Rachid Nedjai

Main category: cs.CV

TL;DR: SPyCer是一种半监督、物理引导的深度学习网络,利用卫星影像和近地传感器数据,结合物理约束(如地表能量平衡与对流-扩散-反应方程),实现近地表气温(NSAT)的像素级连续估计。

Details Motivation: 近地表气温(NSAT)对人类与生态系统至关重要,但地面传感器稀疏且分布不均,难以提供连续空间覆盖;而卫星虽能大范围观测地表,却难以直接获取近地面大气参数,因此需融合遥感与物理建模以填补这一空白。 Method: 提出SPyCer模型:将每个地面传感器投影至卫星图像坐标,以其中心构建局部图像块;采用像素级监督(实测NSAT + 物理约束),并利用表面能量平衡与对流-扩散-反应PDE导出物理正则项;引入土地覆盖引导的多头注意力机制,并以高斯距离加权建模邻域物理影响。 Result: 在真实数据集上实验表明,SPyCer生成的空间连续、物理一致的NSAT估计结果,在精度、泛化性及与物理过程一致性方面均优于现有基线方法。 Conclusion: SPyCer成功实现了遥感影像到近地表大气参数的可解释、物理对齐的像素级建模,为地球观测中‘天-地’数据融合提供了新范式。 Abstract: Modern Earth observation relies on satellites to capture detailed surface properties. Yet, many phenomena that affect humans and ecosystems unfold in the atmosphere close to the surface. Near-ground sensors provide accurate measurements of certain environmental characteristics, such as near-surface air temperature (NSAT). However, they remain sparse and unevenly distributed, limiting their ability to provide continuous spatial measurements. To bridge this gap, we introduce SPyCer, a semi-supervised physics-guided network that can leverage pixel information and physical modeling to guide the learning process through meaningful physical properties. It is designed for continuous estimation of NSAT by proxy using satellite imagery. SPyCer frames NSAT prediction as a pixel-wise vision problem, where each near-ground sensor is projected onto satellite image coordinates and positioned at the center of a local image patch. The corresponding sensor pixel is supervised using both observed NSAT and physics-based constraints, while surrounding pixels contribute through physics-guided regularization derived from the surface energy balance and advection-diffusion-reaction partial differential equations. To capture the physical influence of neighboring pixels, SPyCer employs a multi-head attention guided by land cover characteristics and modulated with Gaussian distance weighting. Experiments on real-world datasets demonstrate that SPyCer produces spatially coherent and physically consistent NSAT estimates, outperforming existing baselines in terms of accuracy, generalization, and alignment with underlying physical processes.

[158] Digital Twin Driven Textile Classification and Foreign Object Recognition in Automated Sorting Systems

Serkan Ergun,Tobias Mitterer,Hubert Zangl

Main category: cs.CV

TL;DR: 本文提出了一种基于数字孪生的双臂机器人纺织品分拣系统,融合多模态感知、抓取预测与视觉语言模型(VLM)语义推理,实现对变形衣物及杂乱环境中异物的自动识别与分类。在223个真实检验场景上评测9种VLM,Qwen系列准确率最高(87.9%),Gemma3适合边缘部署;数字孪生与MoveIt协同提升避障与操作可靠性。

Details Motivation: 可持续纺织品回收需求增长,亟需能处理形变衣物和杂乱环境中异物检测的鲁棒自动化方案。 Method: 构建双臂机器人系统,集成RGB-D视觉、电容式触觉反馈与碰撞感知运动规划;利用VLM对检验区衣物进行语义分类;通过数字孪生与MoveIt实现三维点云驱动的虚拟-现实协同路径规划与操作。 Result: Qwen系列VLM在223场景测试中达最高整体准确率87.9%,异物检测性能突出;Gemma3等轻量模型在速度-精度权衡上表现优异;数字孪生显著提升操作可靠性。 Conclusion: 语义VLM推理、传统抓取检测与数字孪生技术可有效融合,支撑面向实际工业场景的可扩展、全自动纺织品分拣系统。 Abstract: The increasing demand for sustainable textile recycling requires robust automation solutions capable of handling deformable garments and detecting foreign objects in cluttered environments. This work presents a digital twin driven robotic sorting system that integrates grasp prediction, multi modal perception, and semantic reasoning for real world textile classification. A dual arm robotic cell equipped with RGBD sensing, capacitive tactile feedback, and collision-aware motion planning autonomously separates garments from an unsorted basket, transfers them to an inspection zone, and classifies them using state of the art Visual Language Models (VLMs). We benchmark nine VLM s from five model families on a dataset of 223 inspection scenarios comprising shirts, socks, trousers, underwear, foreign objects (including garments outside of the aforementioned classes), and empty scenes. The evaluation assesses per class accuracy, hallucination behavior, and computational performance under practical hardware constraints. Results show that the Qwen model family achieves the highest overall accuracy (up to 87.9 %), with strong foreign object detection performance, while lighter models such as Gemma3 offer competitive speed accuracy trade offs for edge deployment. A digital twin combined with MoveIt enables collision aware path planning and integrates segmented 3D point clouds of inspected garments into the virtual environment for improved manipulation reliability. The presented system demonstrates the feasibility of combining semantic VLM reasoning with conventional grasp detection and digital twin technology for scalable, autonomous textile sorting in realistic industrial settings.

[159] CATNet: Collaborative Alignment and Transformation Network for Cooperative Perception

Gong Chen,Chaokun Zhang,Tao Tang,Pengcheng Lv,Feng Li,Xin Xie

Main category: cs.CV

TL;DR: 本文提出CATNet框架,通过时空同步、小波增强去噪和自适应特征选择,解决多智能体协同感知中的时序延迟与多源噪声问题,显著提升复杂交通场景下的鲁棒性与适应性。

Details Motivation: 现有协同感知方法忽视了实际多源数据融合中的高时序延迟和多源噪声问题。 Method: 提出CATNet框架,包含三个核心模块:1)基于邻帧差分建模的时空循环同步模块(STSync);2)双分支小波增强去噪器(WTDen);3)自适应特征选择器(AdpSel)。 Result: 在多个数据集上实验表明,CATNet在复杂交通条件下持续优于现有方法,展现出更强的鲁棒性与适应性。 Conclusion: CATNet有效缓解了多智能体系统中由时序异步和噪声干扰导致的感知性能下降,为实用化协同感知提供了可行方案。 Abstract: Cooperative perception significantly enhances scene understanding by integrating complementary information from diverse agents. However, existing research often overlooks critical challenges inherent in real-world multi-source data integration, specifically high temporal latency and multi-source noise. To address these practical limitations, we propose Collaborative Alignment and Transformation Network (CATNet), an adaptive compensation framework that resolves temporal latency and noise interference in multi-agent systems. Our key innovations can be summarized in three aspects. First, we introduce a Spatio-Temporal Recurrent Synchronization (STSync) that aligns asynchronous feature streams via adjacent-frame differential modeling, establishing a temporal-spatially unified representation space. Second, we design a Dual-Branch Wavelet Enhanced Denoiser (WTDen) that suppresses global noise and reconstructs localized feature distortions within aligned representations. Third, we construct an Adaptive Feature Selector (AdpSel) that dynamically focuses on critical perceptual features for robust fusion. Extensive experiments on multiple datasets demonstrate that CATNet consistently outperforms existing methods under complex traffic conditions, proving its superior robustness and adaptability.

[160] Wiki-R1: Incentivizing Multimodal Reasoning for Knowledge-based VQA via Data and Sampling Curriculum

Shan Ning,Longtian Qiu,Xuming He

Main category: cs.CV

TL;DR: 本文提出Wiki-R1框架,通过基于课程强化学习的数据生成方法,提升多模态大语言模型在知识库视觉问答(KB-VQA)任务中的推理与领域适应能力,并在两个基准上取得SOTA结果。

Details Motivation: KB-VQA面临外部知识检索噪声大、知识库结构化且百科化等特点,导致与预训练多模态大语言模型(MLLMs)存在分布鸿沟,使后训练阶段的有效推理和领域适配困难。 Method: 提出Wiki-R1:一种基于课程强化学习的数据生成框架,包含可控课程数据生成(调控检索器生成不同难度样本)和课程采样策略(选择具有非零优势的高信息量样本),并利用观测奖励估计并传播样本难度以指导学习。 Result: 在Encyclopedic VQA和InfoSeek两个KB-VQA基准上,准确率分别从35.5%提升至37.1%、40.1%提升至44.1%,达到新SOTA。 Conclusion: Wiki-R1通过课程式强化学习有效弥合了预训练MLLMs与KB-VQA目标分布之间的差距,验证了可控数据生成与难度感知采样对提升模型推理能力的重要性。 Abstract: Knowledge-Based Visual Question Answering (KB-VQA) requires models to answer questions about an image by integrating external knowledge, posing significant challenges due to noisy retrieval and the structured, encyclopedic nature of the knowledge base. These characteristics create a distributional gap from pretrained multimodal large language models (MLLMs), making effective reasoning and domain adaptation difficult in the post-training stage. In this work, we propose \textit{Wiki-R1}, a data-generation-based curriculum reinforcement learning framework that systematically incentivizes reasoning in MLLMs for KB-VQA. Wiki-R1 constructs a sequence of training distributions aligned with the model's evolving capability, bridging the gap from pretraining to the KB-VQA target distribution. We introduce \textit{controllable curriculum data generation}, which manipulates the retriever to produce samples at desired difficulty levels, and a \textit{curriculum sampling strategy} that selects informative samples likely to yield non-zero advantages during RL updates. Sample difficulty is estimated using observed rewards and propagated to unobserved samples to guide learning. Experiments on two KB-VQA benchmarks, Encyclopedic VQA and InfoSeek, demonstrate that Wiki-R1 achieves new state-of-the-art results, improving accuracy from 35.5\% to 37.1\% on Encyclopedic VQA and from 40.1\% to 44.1\% on InfoSeek. The project page is available at https://artanic30.github.io/project_pages/WikiR1/.

[161] Layer by layer, module by module: Choose both for optimal OOD probing of ViT

Ambroise Odonnat,Vasilii Feofanov,Laetitia Chapel,Romain Tavenard,Ievgen Redko

Main category: cs.CV

TL;DR: 本文研究了预训练视觉Transformer中间层表征性能优于最终层的现象,发现预训练与下游数据之间的分布偏移是导致深层性能下降的主要原因,并指出在不同分布偏移程度下,应选择不同模块(如FFN内部激活或归一化后的MHSA输出)进行线性探测以获得最佳性能。

Details Motivation: 解释为何视觉Transformer中间层表征往往比最终层更具判别力,尤其在非自回归预训练模型中也存在该现象。 Method: 通过在多个图像分类基准上进行大规模线性探测实验,并在模块级开展细粒度分析,比较不同中间位置(如Transformer块输出、FFN激活、MHSA归一化输出)的表征能力。 Result: 发现预训练与下游数据间的分布偏移是深层性能下降的主因;在强分布偏移下,探测FFN内部激活效果最优;在弱偏移下,探测归一化后的MHSA输出最优。 Conclusion: 标准的Transformer块输出探测并非最优策略,应依据分布偏移程度动态选择探测位置,这对迁移学习和表征评估具有重要指导意义。 Abstract: Recent studies have observed that intermediate layers of foundation models often yield more discriminative representations than the final layer. While initially attributed to autoregressive pretraining, this phenomenon has also been identified in models trained via supervised and discriminative self-supervised objectives. In this paper, we conduct a comprehensive study to analyze the behavior of intermediate layers in pretrained vision transformers. Through extensive linear probing experiments across a diverse set of image classification benchmarks, we find that distribution shift between pretraining and downstream data is the primary cause of performance degradation in deeper layers. Furthermore, we perform a fine-grained analysis at the module level. Our findings reveal that standard probing of transformer block outputs is suboptimal; instead, probing the activation within the feedforward network yields the best performance under significant distribution shift, whereas the normalized output of the multi-head self-attention module is optimal when the shift is weak.

[162] Fusion4CA: Boosting 3D Object Detection via Comprehensive Image Exploitation

Kang Luo,Xin Chen,Yangyi Xiao,Hesheng Wang

Main category: cs.CV

TL;DR: 本文提出Fusion4CA方法,通过对比对齐模块、相机辅助分支、认知适配器和坐标注意力机制,在BEV空间中更充分地融合LiDAR与RGB数据,显著提升3D目标检测性能,尤其在参数增加极少的情况下实现精度提升。

Details Motivation: 现有BEV融合方法过度依赖LiDAR分支,对RGB图像信息挖掘不足。 Method: 基于BEVFusion框架,引入对比对齐模块(校准图像特征与3D几何)、相机辅助分支(训练中充分挖掘RGB信息)、认知适配器(利用预训练图像权重)和坐标注意力模块(增强融合阶段)。 Result: 在nuScenes数据集上仅用6个训练周期即达69.7% mAP,较训练20周期的基线提升1.2%,仅增加3.48%推理参数;在模拟月球环境中也验证了泛化性。 Conclusion: Fusion4CA有效提升了多模态BEV检测中视觉信息的利用效率,在精度、训练效率和参数开销间取得良好平衡,具备强泛化能力。 Abstract: Nowadays, an increasing number of works fuse LiDAR and RGB data in the bird's-eye view (BEV) space for 3D object detection in autonomous driving systems. However, existing methods suffer from over-reliance on the LiDAR branch, with insufficient exploration of RGB information. To tackle this issue, we propose Fusion4CA, which is built upon the classic BEVFusion framework and dedicated to fully exploiting visual input with plug-and-play components. Specifically, a contrastive alignment module is designed to calibrate image features with 3D geometry, and a camera auxiliary branch is introduced to mine RGB information sufficiently during training. For further performance enhancement, we leverage an off-the-shelf cognitive adapter to make the most of pretrained image weights, and integrate a standard coordinate attention module into the fusion stage as a supplementary boost. Experiments on the nuScenes dataset demonstrate that our method achieves 69.7% mAP with only 6 training epochs and a mere 3.48% increase in inference parameters, yielding a 1.2% improvement over the baseline which is fully trained for 20 epochs. Extensive experiments in a simulated lunar environment further validate the effectiveness and generalization of our method. Our code will be released through Fusion4CA.

[163] Frequency-Aware Error-Bounded Caching for Accelerating Diffusion Transformers

Guandong Li

Main category: cs.CV

TL;DR: 本文提出SpectralCache框架,通过识别DiT去噪过程在时间、深度和特征三个维度上的非均匀性,设计了 timestep-aware 动态调度、累积误差预算和频域分解缓存策略,在不牺牲生成质量前提下显著加速DiT推理。

Details Motivation: 现有DiT缓存方法将去噪过程视为在时间、深度和特征维度上均匀的,忽略了实际存在的非均匀性,导致缓存效率受限。 Method: 提出SpectralCache统一缓存框架,包含:(1) Timestep-Aware Dynamic Scheduling (TADS),根据时间维度敏感性动态调度缓存;(2) Cumulative Error Budgets (CEB),控制深度维度上误差累积;(3) Frequency-Decomposed Caching (FDC),依据特征维度异质动态特性进行频域分解缓存。 Result: 在FLUX.1-schnell模型512x512分辨率上实现2.46倍加速,LPIPS为0.217、SSIM为0.727,相比TeaCache提速16%,质量损失可忽略(LPIPS差异<1%);且该方法无需训练、即插即用、兼容现有DiT架构。 Conclusion: SpectralCache通过建模DiT去噪过程的三重非均匀性,实现了高效、高质量、免训练的推理加速,为DiT实际部署提供了新范式。 Abstract: Diffusion Transformers (DiTs) have emerged as the dominant architecture for high-quality image and video generation, yet their iterative denoising process incurs substantial computational cost during inference. Existing caching methods accelerate DiTs by reusing intermediate computations across timesteps, but they share a common limitation: treating the denoising process as uniform across time,depth, and feature dimensions. In this work, we identify three orthogonal axes of non-uniformity in DiT denoising: (1) temporal -- sensitivity to caching errors varies dramatically across the denoising trajectory; (2) depth -- consecutive caching decisions lead to cascading approximation errors; and (3) feature -- different components of the hidden state exhibit heterogeneous temporal dynamics. Based on these observations, we propose SpectralCache, a unified caching framework comprising Timestep-Aware Dynamic Scheduling (TADS), Cumulative Error Budgets (CEB), and Frequency-Decomposed Caching (FDC). On FLUX.1-schnell at 512x512 resolution, SpectralCache achieves 2.46x speedup with LPIPS 0.217 and SSIM 0.727, outperforming TeaCache (2.12x, LPIPS 0.215, SSIM 0.734) by 16% in speed while maintaining comparable quality (LPIPS difference < 1%). Our approach is training-free, plug-and-play, and compatible with existing DiT architectures.

[164] Dark3R: Learning Structure from Motion in the Dark

Andrew Y Guo,Anagh Malik,SaiKiran Tedla,Yutong Dai,Yiqian Qin,Zach Salehe,Benjamin Attal,Sotiris Nousias,Kyros Kutulakos,David B. Lindell

Main category: cs.CV

TL;DR: Dark3R是一个专为极低光照条件(SNR < -4 dB)设计的无监督结构光运动(SfM)框架,通过教师-学生蒸馏将大规模3D基础模型适配到暗光场景,仅需噪声-干净原始图像对训练,无需3D监督,并在新视角合成任务中也达到SOTA性能。

Details Motivation: 传统基于特征或学习的方法在极低信噪比(SNR < -4 dB)的暗光条件下失效,亟需一种鲁棒的无监督SfM方法。 Method: 提出Dark3R框架,采用教师-学生蒸馏策略,将大规模3D基础模型适配至极端低光环境;仅使用噪声-干净原始图像对进行训练(可实拍或用泊松-高斯噪声模型合成);构建含约42,000张多视角原始图像及真值3D标注的新曝光包围数据集;结合粗到细辐射场优化实现暗光下的新视角合成。 Result: 在低SNR条件下实现SOTA的结构光运动重建效果;同时在暗光新视角合成任务中达到SOTA性能。 Conclusion: Dark3R首次实现了完全无3D监督、仅依赖原始图像对训练的极暗光SfM,验证了大模型蒸馏与物理噪声建模结合的有效性,为低光视觉重建开辟了新路径。 Abstract: We introduce Dark3R, a framework for structure from motion in the dark that operates directly on raw images with signal-to-noise ratios (SNRs) below $-4$ dB -- a regime where conventional feature- and learning-based methods break down. Our key insight is to adapt large-scale 3D foundation models to extreme low-light conditions through a teacher--student distillation process, enabling robust feature matching and camera pose estimation in low light. Dark3R requires no 3D supervision; it is trained solely on noisy--clean raw image pairs, which can be either captured directly or synthesized using a simple Poisson--Gaussian noise model applied to well-exposed raw images. To train and evaluate our approach, we introduce a new, exposure-bracketed dataset that includes $\sim$42,000 multi-view raw images with ground-truth 3D annotations, and we demonstrate that Dark3R achieves state-of-the-art structure from motion in the low-SNR regime. Further, we demonstrate state-of-the-art novel view synthesis in the dark using Dark3R's predicted poses and a coarse-to-fine radiance field optimization procedure.

[165] ORMOT: A Dataset and Framework for Omnidirectional Referring Multi-Object Tracking

Sijia Chen,Zihan Zhou,Yanqiu Yu,En Yu,Wenbing Tao

Main category: cs.CV

TL;DR: 本文提出了一种新的任务——全向指代多目标跟踪(ORMOT),旨在解决传统RMOT在有限视场下的跟踪碎片化问题,并构建了首个全向指代多目标跟踪数据集ORSet,同时设计了基于大视觉语言模型的跟踪框架ORTrack。

Details Motivation: 现有指代多目标跟踪(RMOT)方法依赖常规相机数据,视场受限,导致目标易出框、跟踪碎片化、长时序语言理解困难。 Method: 提出ORMOT新任务;构建包含27个全向场景、848条语言描述、3401个标注目标的ORSet数据集;设计LVLM驱动的ORTrack框架。 Result: 在ORSet数据集上的大量实验验证了ORTrack框架的有效性;数据集与代码将开源。 Conclusion: ORMOT拓展了RMOT至全向视觉领域,缓解了视场限制问题,提升了对长时序语言描述的理解能力,为视觉-语言跟踪提供了新基准与方法。 Abstract: Multi-Object Tracking (MOT) is a fundamental task in computer vision, aiming to track targets across video frames. Existing MOT methods perform well in general visual scenes, but face significant challenges and limitations when extended to visual-language settings. To bridge this gap, the task of Referring Multi-Object Tracking (RMOT) has recently been proposed, which aims to track objects that correspond to language descriptions. However, current RMOT methods are primarily developed on datasets captured by conventional cameras, which suffer from limited field of view. This constraint often causes targets to move out of the frame, leading to fragmented tracking and loss of contextual information. In this work, we propose a novel task, called Omnidirectional Referring Multi-Object Tracking (ORMOT), which extends RMOT to omnidirectional imagery, aiming to overcome the field-of-view (FoV) limitation of conventional datasets and improve the model's ability to understand long-horizon language descriptions. To advance the ORMOT task, we construct ORSet, an Omnidirectional Referring Multi-Object Tracking dataset, which contains 27 diverse omnidirectional scenes, 848 language descriptions, and 3,401 annotated objects, providing rich visual, temporal, and language information. Furthermore, we propose ORTrack, a Large Vision-Language Model (LVLM)-driven framework tailored for Omnidirectional Referring Multi-Object Tracking. Extensive experiments on the ORSet dataset demonstrate the effectiveness of our ORTrack framework. The dataset and code will be open-sourced at https://github.com/chen-si-jia/ORMOT.

[166] Fusion-CAM: Integrating Gradient and Region-Based Class Activation Maps for Robust Visual Explanations

Hajar Dekdegue,Moncef Garouani,Josiane Mothe,Jordan Bernigaud

Main category: cs.CV

TL;DR: 本文提出Fusion-CAM,一种融合梯度法与区域法优势的新型类激活图方法,通过去噪、加权融合与自适应像素级融合,生成更鲁棒、判别性更强且上下文感知的可视化解释。

Details Motivation: 现有CAM方法存在权衡:梯度法(如Grad-CAM)细节丰富但噪声大、覆盖不全;区域法(如Score-CAM)覆盖广但过平滑、敏感性低。亟需兼顾判别性与完整性的一致性解释框架。 Method: Fusion-CAM包含三步:1)对梯度图进行去噪以获得更聚焦的激活;2)用贡献权重融合去噪梯度图与区域图以提升类别覆盖;3)设计基于相似性的自适应像素级融合机制,动态调节融合强度,强化一致区域、柔化冲突区域。 Result: 在标准基准上大量实验表明,Fusion-CAM在定性可视化和定量评估两方面均持续优于现有CAM变体。 Conclusion: Fusion-CAM有效弥合了梯度法与区域法之间的解释鸿沟,提供了一种鲁棒、灵活且输入自适应的深度神经网络可解释性新范式。 Abstract: Interpreting the decision-making process of deep convolutional neural networks remains a central challenge in achieving trustworthy and transparent artificial intelligence. Explainable AI (XAI) techniques, particularly Class Activation Map (CAM) methods, are widely adopted to visualize the input regions influencing model predictions. Gradient-based approaches (e.g. Grad-CAM) provide highly discriminative, fine-grained details by computing gradients of class activations but often yield noisy and incomplete maps that emphasize only the most salient regions rather than the complete objects. Region-based approaches (e.g. Score-CAM) aggregate information over larger areas, capturing broader object coverage at the cost of over-smoothing and reduced sensitivity to subtle features. We introduce Fusion-CAM, a novel framework that bridges this explanatory gap by unifying both paradigms through a dedicated fusion mechanism to produce robust and highly discriminative visual explanations. Our method first denoises gradient-based maps, yielding cleaner and more focused activations. It then combines the refined gradient map with region-based maps using contribution weights to enhance class coverage. Finally, we propose an adaptive similarity-based pixel-level fusion that evaluates the agreement between both paradigms and dynamically adjusts the fusion strength. This adaptive mechanism reinforces consistent activations while softly blending conflicting regions, resulting in richer, context-aware, and input-adaptive visual explanations. Extensive experiments on standard benchmarks show that Fusion-CAM consistently outperforms existing CAM variants in both qualitative visualization and quantitative evaluation, providing a robust and flexible tool for interpreting deep neural networks.

[167] Video-based Locomotion Analysis for Fish Health Monitoring

Timon Palm,Clemens Seibold,Anna Hilsmann,Peter Eisert

Main category: cs.CV

TL;DR: 本文提出了一种基于YOLOv11和多目标跟踪的视频分析系统,用于从视频中估计养殖鱼的运动活动(如游动方向和速度),以支持鱼类健康监测。

Details Motivation: 监测鱼类健康状况对早期疾病检测、动物福利保障和可持续水产养殖至关重要;而鱼类的运动行为可反映其生理与病理状态。 Method: 采用嵌入在检测-跟踪框架中的YOLOv11目标检测器,并探索了多种YOLOv11架构配置及多帧融合扩展以提升检测精度。 Result: 在人工标注的苏拉威西稻鱼视频数据集(模拟家庭水族箱环境)上验证了系统能可靠估计鱼类游动方向与速度;该数据集将在论文发表后公开。 Conclusion: 所提方法为低成本、非侵入式的鱼类健康监测提供了可行的技术路径,具备实际水产养殖应用潜力。 Abstract: Monitoring the health conditions of fish is essential, as it enables the early detection of disease, safeguards animal welfare, and contributes to sustainable aquaculture practices. Physiological and pathological conditions of cultivated fish can be inferred by analyzing locomotion activities. In this paper, we present a system that estimates the locomotion activities from videos using multi object tracking. The core of our approach is a YOLOv11 detector embedded in a tracking-by-detection framework. We investigate various configurations of the YOLOv11-architecture as well as extensions that incorporate multiple frames to improve detection accuracy. Our system is evaluated on a manually annotated dataset of Sulawesi ricefish recorded in a home-aquarium-like setup, demonstrating its ability to reliably measure swimming direction and speed for fish health monitoring. The dataset will be made publicly available upon publication.

[168] MobileFetalCLIP: Selective Repulsive Knowledge Distillation for Mobile Fetal Ultrasound Analysis

Numan Saeed,Fadillah Adamsyah Maani,Mohammad Yaqub

Main category: cs.CV

TL;DR: 本文提出了一种选择性排斥知识蒸馏方法(Selective Repulsive Knowledge Distillation),用于在资源受限的胎儿超声AI场景中,将大型基础模型(304M参数)有效压缩为轻量级学生模型(11.4M参数),不仅显著提升零样本性能,还实现在iPhone 16 Pro上的实时推理。

Details Motivation: 现有胎儿超声AI基础模型参数量过大(>300M),难以部署于基层便携设备;标准知识蒸馏在师生容量差距极大(~26倍)时失效,学生易模仿教师的架构冗余而非学习本质特征。 Method: 提出选择性排斥知识蒸馏:将对比式知识蒸馏分解为对角项(保留匹配样本对对齐)与非对角项(使非匹配对权重衰减为负值),从而主动排斥学生模仿教师的类间混淆,促使其学习更适配自身架构的特征表示。 Result: 11.4M参数学生模型在HC18生物测量有效性上达88.6%(优于教师83.5%),脑部子平面F1达0.784(优于教师0.702),且在iPhone 16 Pro上推理仅1.6ms。 Conclusion: 该方法突破了大模型向极小模型知识迁移的瓶颈,首次实现高性能、超轻量、端侧实时的胎儿超声AI,具备在低资源地区临床落地潜力。 Abstract: Fetal ultrasound AI could transform prenatal care in low-resource settings, yet current foundation models exceed 300M visual parameters, precluding deployment on point-of-care devices. Standard knowledge distillation fails under such extreme capacity gaps (~26x), as compact students waste capacity mimicking architectural artifacts of oversized teachers. We introduce Selective Repulsive Knowledge Distillation, which decomposes contrastive KD into diagonal and off-diagonal components: matched pair alignment is preserved while the off-diagonal weight decays into negative values, repelling the student from the teacher's inter-class confusions and forcing discovery of architecturally native features. Our 11.4M parameter student surpasses the 304M-parameter FetalCLIP teacher on zero-shot HC18 biometry validity (88.6% vs. 83.5%) and brain sub-plane F1 (0.784 vs. 0.702), while running at 1.6 ms on iPhone 16 Pro, enabling real-time assistive AI on handheld ultrasound devices. Our code, models, and app are publicly available at https://github.com/numanai/MobileFetalCLIP.

[169] RelaxFlow: Text-Driven Amodal 3D Generation

Jiayin Zhu,Guoji Fu,Xiaolu Liu,Qiyuan He,Yicong Li,Angela Yao

Main category: cs.CV

TL;DR: 本文提出RelaxFlow框架,用于文本驱动的非模态3D生成,通过解耦观测区域的刚性控制与文本提示的松弛结构控制,在不损害视觉保真度的前提下完成被遮挡区域的语义一致重建。

Details Motivation: 图像到3D生成在遮挡下存在固有语义歧义,仅靠部分观测难以确定物体类别;需结合文本提示引导未见区域的合理补全,同时严格保持输入观测。 Method: 提出无训练双分支框架RelaxFlow,包含多先验共识模块和松弛机制;理论证明该松弛等价于对生成向量场施加低通滤波,以抑制高频细节、保留几何结构。 Result: 在新构建的ExtremeOcc-3D和AmbiSem-3D基准上实验表明,RelaxFlow能有效按文本意图生成未见区域,且不损害视觉保真度。 Conclusion: 解耦控制粒度是实现高质量文本驱动非模态3D生成的关键,RelaxFlow为处理遮挡与语义歧义提供了可证明有效的无训练解决方案。 Abstract: Image-to-3D generation faces inherent semantic ambiguity under occlusion, where partial observation alone is often insufficient to determine object category. In this work, we formalize text-driven amodal 3D generation, where text prompts steer the completion of unseen regions while strictly preserving input observation. Crucially, we identify that these objectives demand distinct control granularities: rigid control for the observation versus relaxed structural control for the prompt. To this end, we propose RelaxFlow, a training-free dual-branch framework that decouples control granularity via a Multi-Prior Consensus Module and a Relaxation Mechanism. Theoretically, we prove that our relaxation is equivalent to applying a low-pass filter on the generative vector field, which suppresses high-frequency instance details to isolate geometric structure that accommodates the observation. To facilitate evaluation, we introduce two diagnostic benchmarks, ExtremeOcc-3D and AmbiSem-3D. Extensive experiments demonstrate that RelaxFlow successfully steers the generation of unseen regions to match the prompt intent without compromising visual fidelity.

[170] SAIL: Similarity-Aware Guidance and Inter-Caption Augmentation-based Learning for Weakly-Supervised Dense Video Captioning

Ye-Chan Kim,SeungJu Cha,Si-Woo Kim,Minju Jeon,Hyungee Kim,Dong-Jin Kim

Main category: cs.CV

TL;DR: 本文提出SAIL方法,通过跨模态对齐构建语义感知掩码,并引入大语言模型生成合成字幕以增强稀疏标注下的时序定位与描述能力,在ActivityNet Captions和YouCook2数据集上达到SOTA性能。

Details Motivation: 现有弱监督密集视频描述方法仅生成无重叠但语义无关的高斯掩码,且依赖稀疏的真实字幕导致性能受限。 Method: 提出SAIL框架:1)基于跨模态对齐构建语义感知掩码;2)设计相似性感知训练目标,使掩码聚焦于与对应事件字幕高相似的视频区域;3)引入LLM生成合成字幕,并通过inter-mask机制融入训练,辅助精准时序定位。 Result: 在ActivityNet Captions和YouCook2数据集上,captioning和localization指标均达到当前最优(state-of-the-art)。 Conclusion: 语义感知掩码与LLM增强的合成字幕联合策略,有效缓解了弱监督设置下掩码语义缺失与标注稀疏问题,显著提升密集视频描述性能。 Abstract: Weakly-Supervised Dense Video Captioning aims to localize and describe events in videos trained only on caption annotations, without temporal boundaries. Prior work introduced an implicit supervision paradigm based on Gaussian masking and complementary captioning. However, existing method focuses merely on generating non-overlapping masks without considering their semantic relationship to corresponding events, resulting in simplistic, uniformly distributed masks that fail to capture semantically meaningful regions. Moreover, relying solely on ground-truth captions leads to sub-optimal performance due to the inherent sparsity of existing datasets. In this work, we propose SAIL, which constructs semantically-aware masks through cross-modal alignment. Our similarity aware training objective guides masks to emphasize video regions with high similarity to their corresponding event captions. Furthermore, to guide more accurate mask generation under sparse annotation settings, we introduce an LLM-based augmentation strategy that generates synthetic captions to provide additional alignment signals. These synthetic captions are incorporated through an inter-mask mechanism, providing auxiliary guidance for precise temporal localization without degrading the main objective. Experiments on ActivityNet Captions and YouCook2 demonstrate state-of-the-art performance on both captioning and localization metrics.

[171] Planning in 8 Tokens: A Compact Discrete Tokenizer for Latent World Model

Dongwon Kim,Gawon Seo,Jinsung Lee,Minsu Cho,Suha Kwak

Main category: cs.CV

TL;DR: 本文提出CompACT,一种离散化分词器,将每个观测压缩为仅8个token,显著降低世界模型决策时规划的计算开销,同时保持规划性能。

Details Motivation: 现有世界模型在决策时规划中因传统分词器产生大量token而导致计算成本高、难以实时控制。 Method: 提出CompACT——一种能将每个观测压缩至极少token(如8个)的离散分词器,并将其集成到动作条件世界模型中。 Result: 使用CompACT的世界模型在规划速度上提升数个数量级,同时保持有竞争力的规划性能。 Conclusion: CompACT为世界模型在真实场景中的实时部署提供了切实可行的解决方案。 Abstract: World models provide a powerful framework for simulating environment dynamics conditioned on actions or instructions, enabling downstream tasks such as action planning or policy learning. Recent approaches leverage world models as learned simulators, but its application to decision-time planning remains computationally prohibitive for real-time control. A key bottleneck lies in latent representations: conventional tokenizers encode each observation into hundreds of tokens, making planning both slow and resource-intensive. To address this, we propose CompACT, a discrete tokenizer that compresses each observation into as few as 8 tokens, drastically reducing computational cost while preserving essential information for planning. An action-conditioned world model that occupies CompACT tokenizer achieves competitive planning performance with orders-of-magnitude faster planning, offering a practical step toward real-world deployment of world models.

[172] NaiLIA: Multimodal Nail Design Retrieval Based on Dense Intent Descriptions and Palette Queries

Kanon Amemiya,Daichi Yashima,Kei Katsumata,Takumi Komatsu,Ryosuke Korekata,Seitaro Otsuki,Komei Sugiura

Main category: cs.CV

TL;DR: 本文提出NaiLIA方法,用于根据密集意图描述和调色板查询检索美甲设计图像,通过引入基于置信度分数的松弛损失提升对未标注图像的对齐能力,并在包含10625张图像的新基准上验证了其优越性。

Details Motivation: 现有视觉-语言基础模型难以有效融合密集、多层的用户意图描述(如绘制元素、装饰物、视觉特征、主题及整体印象)以及精细连续的调色板查询,导致美甲图像检索效果不佳。 Method: 提出NaiLIA多模态检索方法,将密集意图描述与调色板查询统一建模,并设计一种基于未标注图像置信度分数的松弛损失函数,以增强语义对齐能力。 Result: 在自建含10,625张图像、由200多名标注者提供长而密集描述的基准上,NaiLIA显著优于标准方法。 Conclusion: NaiLIA能更有效地理解并检索符合复杂用户意图与色彩偏好的美甲设计图像,为细粒度、多模态时尚检索提供了新思路与实用工具。 Abstract: We focus on the task of retrieving nail design images based on dense intent descriptions, which represent multi-layered user intent for nail designs. This is challenging because such descriptions specify unconstrained painted elements and pre-manufactured embellishments as well as visual characteristics, themes, and overall impressions. In addition to these descriptions, we assume that users provide palette queries by specifying zero or more colors via a color picker, enabling the expression of subtle and continuous color nuances. Existing vision-language foundation models often struggle to incorporate such descriptions and palettes. To address this, we propose NaiLIA, a multimodal retrieval method for nail design images, which comprehensively aligns with dense intent descriptions and palette queries during retrieval. Our approach introduces a relaxed loss based on confidence scores for unlabeled images that can align with the descriptions. To evaluate NaiLIA, we constructed a benchmark consisting of 10,625 images collected from people with diverse cultural backgrounds. The images were annotated with long and dense intent descriptions given by over 200 annotators. Experimental results demonstrate that NaiLIA outperforms standard methods.

[173] RealWonder: Real-Time Physical Action-Conditioned Video Generation

Wei Liu,Ziyu Chen,Zizhang Li,Yue Wang,Hong-Xing Yu,Jiajun Wu

Main category: cs.CV

TL;DR: RealWonder 是首个实时、单图驱动的动作条件视频生成系统,通过将物理仿真作为中间桥梁(生成光流和RGB),使视频模型能理解3D动作的物理效应,支持对刚体、可变形体、流体和颗粒材料的交互式模拟,达13.2 FPS。

Details Motivation: 现有视频生成模型缺乏对3D动作(如力、机器人操作)物理后果的建模能力,因其缺少对动作如何影响3D场景的结构化理解。 Method: 提出 RealWonder 系统,包含三部分:单图像3D重建、物理仿真(将动作转化为光流与RGB中间表示)、仅需4步扩散的蒸馏视频生成器。 Result: 在 480×832 分辨率下实现 13.2 FPS 实时生成,支持对刚体、可变形体、流体及颗粒材料的力、机器人动作与相机控制的交互式探索。 Conclusion: RealWonder 首次实现了基于物理仿真的实时动作条件视频生成,为沉浸式体验、AR/VR 和机器人学习提供了新范式,并已开源代码与模型。 Abstract: Current video generation models cannot simulate physical consequences of 3D actions like forces and robotic manipulations, as they lack structural understanding of how actions affect 3D scenes. We present RealWonder, the first real-time system for action-conditioned video generation from a single image. Our key insight is using physics simulation as an intermediate bridge: instead of directly encoding continuous actions, we translate them through physics simulation into visual representations (optical flow and RGB) that video models can process. RealWonder integrates three components: 3D reconstruction from single images, physics simulation, and a distilled video generator requiring only 4 diffusion steps. Our system achieves 13.2 FPS at 480x832 resolution, enabling interactive exploration of forces, robot actions, and camera controls on rigid objects, deformable bodies, fluids, and granular materials. We envision RealWonder opens new opportunities to apply video models in immersive experiences, AR/VR, and robot learning. Our code and model weights are publicly available in our project website: https://liuwei283.github.io/RealWonder/

[174] Beyond Scattered Acceptance: Fast and Coherent Inference for DLMs via Longest Stable Prefixes

Pengxiang Li,Joey Tsai,Hongwei Xue,Kunyu Shi,Shilin Yan

Main category: cs.CV

TL;DR: 本文提出了一种名为Longest Stable Prefix(LSP)的新型解码调度器,用于提升扩散语言模型(DLMs)的推理效率。LSP通过识别并原子化提交最长稳定前缀,改善KV缓存局部性与减少token翻转,实现最高3.4倍加速,且不牺牲生成质量。

Details Motivation: 现有DLMs解码器采用分散接受策略,导致KV缓存碎片化、内存局部性差和频繁重计算,严重制约实际推理速度。 Method: LSP是一种训练无关、模型无关的推理调度范式:每步通过单次前向传播评估token稳定性,动态识别左对齐的连续稳定前缀,并将其边界对齐至自然语言或结构分界点后进行原子提交;从而实现前缀优先的‘整体吸收’机制。 Result: 在LLaDA-8B和Dream-7B上实验表明,LSP在数学推理、代码生成、多语言(CJK)任务及创意写作等基准中推理速度提升最高达3.4倍,同时输出质量持平或略有提升。 Conclusion: LSP通过重构token提交拓扑结构,在不修改模型结构或训练流程的前提下,显著弥合了DLM理论并行性与硬件执行效率之间的鸿沟。 Abstract: Diffusion Language Models (DLMs) promise highly parallel text generation, yet their practical inference speed is often bottlenecked by suboptimal decoding schedulers. Standard approaches rely on 'scattered acceptance'-committing high confidence tokens at disjoint positions throughout the sequence. This approach inadvertently fractures the Key-Value (KV) cache, destroys memory locality, and forces the model into costly, repeated repairs across unstable token boundaries. To resolve this, we present the Longest Stable Prefix (LSP) scheduler, a training-free and model-agnostic inference paradigm based on monolithic prefix absorption. In each denoising step, LSP evaluates token stability via a single forward pass, dynamically identifies a contiguous left-aligned block of stable predictions, and snaps its boundary to natural linguistic or structural delimiters before an atomic commitment. This prefix-first topology yields dual benefits: systemically, it converts fragmented KV cache updates into efficient, contiguous appends; algorithmically, it preserves bidirectional lookahead over a geometrically shrinking active suffix, drastically reducing token flip rates and denoiser calls. Extensive evaluations on LLaDA-8B and Dream-7B demonstrate that LSP accelerates inference by up to 3.4x across rigorous benchmarks including mathematical reasoning, code generation, multilingual (CJK) tasks, and creative writing while matching or slightly improving output quality. By fundamentally restructuring the commitment topology, LSP bridges the gap between the theoretical parallelism of DLMs and practical hardware efficiency.

[175] EdgeDAM: Real-time Object Tracking for Mobile Devices

Syed Muhammad Raza,Syed Murtaza Hussain Abidi,Khawar Islam,Muhammad Ibrahim,Ajmal Saeed Mian

Main category: cs.CV

TL;DR: 本文提出EdgeDAM,一种面向边缘设备的轻量级单目标跟踪框架,通过双缓冲记忆机制与置信度驱动的切换策略,在保证实时性的同时提升对遮挡、干扰物和快速运动的鲁棒性。

Details Motivation: 现有基于分割的记忆机制计算开销大、难以在边缘设备实时部署;而轻量级检测型跟踪器易受相似干扰物影响发生漂移。需兼顾精度、鲁棒性与边缘实时性。 Method: 提出EdgeDAM框架:(1)双缓冲干扰物感知记忆(DAM),包含近期感知记忆(保持目标假设一致性)与干扰物解析记忆(显式存储难负样本并抑制其重选);(2)置信度驱动的检测-重识别切换机制,结合冻结框稳定策略以抑制干扰物污染。 Result: 在包括DiDi在内的5个基准上验证有效性:DiDi数据集准确率达88.2%,iPhone 15上达25 FPS,显著提升遮挡与快速运动下的鲁棒性并保持实时性能。 Conclusion: EdgeDAM成功将干扰物感知记忆适配于轻量级边界框跟踪,在资源受限边缘设备上实现了精度与速度的良好平衡,为实际部署提供了可行方案。 Abstract: Single-object tracking (SOT) on edge devices is a critical computer vision task, requiring accurate and continuous target localization across video frames under occlusion, distractor interference, and fast motion. However, recent state-of-the-art distractor-aware memory mechanisms are largely built on segmentation-based trackers and rely on mask prediction and attention-driven memory updates, which introduce substantial computational overhead and limit real-time deployment on resource-constrained hardware; meanwhile, lightweight trackers sustain high throughput but are prone to drift when visually similar distractors appear. To address these challenges, we propose EdgeDAM, a lightweight detection-guided tracking framework that reformulates distractor-aware memory for bounding-box tracking under strict edge constraints. EdgeDAM introduces two key strategies: (1) Dual-Buffer Distractor-Aware Memory (DAM), which integrates a Recent-Aware Memory to preserve temporally consistent target hypotheses and a Distractor-Resolving Memory to explicitly store hard negative candidates and penalize their re-selection during recovery; and (2) Confidence-Driven Switching with Held-Box Stabilization, where tracker reliability and temporal consistency criteria adaptively activate detection and memory-guided re-identification during occlusion, while a held-box mechanism temporarily freezes and expands the estimate to suppress distractor contamination. Extensive experiments on five benchmarks, including the distractor-focused DiDi dataset, demonstrate improved robustness under occlusion and fast motion while maintaining real-time performance on mobile devices, achieving 88.2% accuracy on DiDi and 25 FPS on an iPhone 15. Code will be released.

[176] HALP: Detecting Hallucinations in Vision-Language Models without Generating a Single Token

Sai Akhil Kogilathota,Sripadha Vallabha E G,Luzhe Sun,Jiawei Zhou

Main category: cs.CV

TL;DR: 本文提出一种在文本生成前预测视觉语言模型(VLM)幻觉风险的新方法,通过单次前向传播探查模型内部表征(如视觉特征、视觉-文本融合前的vision-token、query-token状态),在多个主流VLM上实现高达0.93 AUROC的检测性能,表明幻觉风险可被早期识别,并支持安全高效的干预策略。

Details Motivation: 现有幻觉检测方法多在文本生成后进行,干预成本高且不及时;本文旨在探索能否在生成任何token之前、仅通过一次前向传播预测幻觉风险,从而实现更早、更高效的安全干预。 Method: 在8个现代VLM(如Llama-3.2-Vision、Gemma-3、Phi-4-VL、Qwen2.5-VL等)上,系统分析三类内部表征:(i) 未融合的纯视觉特征,(ii) 文本解码器内的vision-token表示,(iii) 融合视觉与文本信息的query-token表示;对各层表征训练轻量级探针(probes),评估其在无解码条件下的幻觉检测能力(AUROC)。 Result: 探针在多个模型上达到优异性能(最高0.93 AUROC);late query-token状态对大多数模型最具预测性,而少数模型(如Qwen2.5-VL-7B)则依赖视觉特征(~0.79 AUROC);证实幻觉风险具有可预测性且层/模态偏好因架构而异。 Conclusion: 幻觉风险可在生成前被可靠预测;不同VLM的最佳探测位置和模态各异;该发现为早期拒答(abstention)、选择性路由与自适应解码等安全机制提供了可行基础,兼顾安全性与推理效率。 Abstract: Hallucinations remain a persistent challenge for vision-language models (VLMs), which often describe nonexistent objects or fabricate facts. Existing detection methods typically operate after text generation, making intervention both costly and untimely. We investigate whether hallucination risk can instead be predicted before any token is generated by probing a model's internal representations in a single forward pass. Across a diverse set of vision-language tasks and eight modern VLMs, including Llama-3.2-Vision, Gemma-3, Phi-4-VL, and Qwen2.5-VL, we examine three families of internal representations: (i) visual-only features without multimodal fusion, (ii) vision-token representations within the text decoder, and (iii) query-token representations that integrate visual and textual information before generation. Probes trained on these representations achieve strong hallucination-detection performance without decoding, reaching up to 0.93 AUROC on Gemma-3-12B, Phi-4-VL 5.6B, and Molmo 7B. Late query-token states are the most predictive for most models, while visual or mid-layer features dominate in a few architectures (e.g., ~0.79 AUROC for Qwen2.5-VL-7B using visual-only features). These results demonstrate that (1) hallucination risk is detectable pre-generation, (2) the most informative layer and modality vary across architectures, and (3) lightweight probes have the potential to enable early abstention, selective routing, and adaptive decoding to improve both safety and efficiency.

[177] Towards 3D Scene Understanding of Gas Plumes in LWIR Hyperspectral Images Using Neural Radiance Fields

Scout Jarman,Zigfried Hampel-Arias,Adra Carr,Kevin R. Moon

Main category: cs.CV

TL;DR: 本文提出了一种基于神经辐射场(NeRF)的长波红外高光谱图像(LWIR HSI)三维场景重建方法,用于气体羽流检测;该方法融合了高光谱NeRF与稀疏视角NeRF技术,并引入自适应加权MSE损失,在仅用30张训练图像时即达到39.8 dB PSNR,且气体检测AUC达0.821。

Details Motivation: 传统LWIR高光谱图像常单帧分析,缺乏场景几何与光谱的联合建模能力;多视角信息融合可提升气体羽流检测精度与上下文理解。 Method: 基于Mip-NeRF架构,融合高光谱NeRF与稀疏视角NeRF技术,并设计自适应加权MSE损失函数;使用DIRSIG生成含SF6气体羽流的合成多视角LWIR HSI数据集进行训练与验证。 Result: 相比标准Mip-NeRF,所需训练图像减少约50%;仅用30张图像即实现39.8 dB平均PSNR;在NeRF渲染图像上应用自适应相干估计器进行气体检测,获得0.821平均AUC。 Conclusion: NeRF可用于低采样条件下的LWIR高光谱三维重建,并支持有效的下游气体羽流检测任务,为红外高光谱遥感分析提供了新范式。 Abstract: Hyperspectral images (HSI) have many applications, ranging from environmental monitoring to national security, and can be used for material detection and identification. Longwave infrared (LWIR) HSI can be used for gas plume detection and analysis. Oftentimes, only a few images of a scene of interest are available and are analyzed individually. The ability to combine information from multiple images into a single, cohesive representation could enhance analysis by providing more context on the scene's geometry and spectral properties. Neural radiance fields (NeRFs) create a latent neural representation of volumetric scene properties that enable novel-view rendering and geometry reconstruction, offering a promising avenue for hyperspectral 3D scene reconstruction. We explore the possibility of using NeRFs to create 3D scene reconstructions from LWIR HSI and demonstrate that the model can be used for the basic downstream analysis task of gas plume detection. The physics-based DIRSIG software suite was used to generate a synthetic multi-view LWIR HSI dataset of a simple facility with a strong sulfur hexafluoride gas plume. Our method, built on the standard Mip-NeRF architecture, combines state-of-the-art methods for hyperspectral NeRFs and sparse-view NeRFs, along with a novel adaptive weighted MSE loss. Our final NeRF method requires around 50% fewer training images than the standard Mip-NeRF and achieves an average PSNR of 39.8 dB with as few as 30 training images. Gas plume detection applied to NeRF-rendered test images using the adaptive coherence estimator achieves an average AUC of 0.821 when compared with detection masks generated from ground-truth test images.

[178] Towards Multimodal Lifelong Understanding: A Dataset and Agentic Baseline

Guo Chen,Lidong Lu,Yicheng Liu,Liangrui Dong,Lidong Zou,Jixin Lv,Zhenquan Li,Xinyi Mao,Baoqi Pei,Shihao Wang,Zhiqi Li,Karan Sapra,Fuxiao Liu,Yin-Dong Zheng,Yifei Huang,Limin Wang,Zhiding Yu,Andrew Tao,Guilin Liu,Tong Lu

Main category: cs.CV

TL;DR: 本文提出了MM-Lifelong数据集,用于多模态终身理解任务,并揭示了现有模型在长时序视频理解中的两大失败模式;为此设计了递归多模态智能体(ReMA),通过动态记忆管理显著提升性能。

Details Motivation: 现有视频理解数据集多为密集拼接的片段,与真实、非脚本化的日常生活差异大,亟需更贴近真实场景的长时序多模态理解基准。 Method: 构建了包含181.1小时、跨日-周-月多尺度结构的MM-Lifelong数据集;分析当前端到端MLLM和智能体基线的失败机制;提出递归多模态智能体(ReMA),采用动态记忆管理与递归信念状态更新机制。 Result: ReMA在MM-Lifelong上显著优于现有方法;识别出‘工作记忆瓶颈’和‘全局定位崩溃’两大关键失败模式;提供了可隔离时间与领域偏差的数据划分方案。 Conclusion: MM-Lifelong为多模态终身理解提供了新基准,ReMA验证了动态递归记忆对长时序理解的有效性,推动了面向真实生活场景的视频理解研究。 Abstract: While datasets for video understanding have scaled to hour-long durations, they typically consist of densely concatenated clips that differ from natural, unscripted daily life. To bridge this gap, we introduce MM-Lifelong, a dataset designed for Multimodal Lifelong Understanding. Comprising 181.1 hours of footage, it is structured across Day, Week, and Month scales to capture varying temporal densities. Extensive evaluations reveal two critical failure modes in current paradigms: end-to-end MLLMs suffer from a Working Memory Bottleneck due to context saturation, while representative agentic baselines experience Global Localization Collapse when navigating sparse, month-long timelines. To address this, we propose the Recursive Multimodal Agent (ReMA), which employs dynamic memory management to iteratively update a recursive belief state, significantly outperforming existing methods. Finally, we establish dataset splits designed to isolate temporal and domain biases, providing a rigorous foundation for future research in supervised learning and out-of-distribution generalization.

[179] Accelerating Text-to-Video Generation with Calibrated Sparse Attention

Shai Yehezkel,Shahar Yadin,Noam Elata,Yaron Ostrovsky-Berman,Bahjat Kawar

Main category: cs.CV

TL;DR: 本文提出CalibAtt,一种无需训练的视频生成加速方法,通过校准稀疏注意力机制,在不牺牲生成质量的前提下实现最高1.58倍端到端加速。

Details Motivation: 现有基于Transformer的扩散模型因时空注意力计算开销大而推理缓慢;作者发现大量token间注意力分数始终极低且模式重复,可安全跳过。 Method: CalibAtt采用离线校准方式识别跨输入稳定的块级稀疏与重复模式,并为每层、每头、每个扩散步编译优化的稀疏注意力操作;推理时仅密集计算选定连接,跳过其余连接。 Result: 在Wan 2.1 14B、Mochi 1及少步蒸馏模型上,CalibAtt实现最高1.58倍端到端加速,优于其他免训练方法,同时保持视频质量与文生视频对齐能力。 Conclusion: CalibAtt验证了利用注意力内在稀疏性与模式稳定性进行硬件高效推理的可行性,为视频扩散模型提供了实用、即插即用的加速方案。 Abstract: Recent diffusion models enable high-quality video generation, but suffer from slow runtimes. The large transformer-based backbones used in these models are bottlenecked by spatiotemporal attention. In this paper, we identify that a significant fraction of token-to-token connections consistently yield negligible scores across various inputs, and their patterns often repeat across queries. Thus, the attention computation in these cases can be skipped with little to no effect on the result. This observation continues to hold for connections among local token blocks. Motivated by this, we introduce CalibAtt, a training-free method that accelerates video generation via calibrated sparse attention. CalibAtt performs an offline calibration pass that identifies block-level sparsity and repetition patterns that are stable across inputs, and compiles these patterns into optimized attention operations for each layer, head, and diffusion timestep. At inference time, we compute the selected input-dependent connections densely, and skip the unselected ones in a hardware-efficient manner. Extensive experiments on Wan 2.1 14B, Mochi 1, and few-step distilled models at various resolutions show that CalibAtt achieves up to 1.58x end-to-end speedup, outperforming existing training-free methods while maintaining video generation quality and text-video alignment.

[180] FaceCam: Portrait Video Camera Control via Scale-Aware Conditioning

Weijie Lyu,Ming-Hsuan Yang,Zhixin Shu

Main category: cs.CV

TL;DR: FaceCam 是一种针对单目人脸视频输入、支持自定义相机轨迹的视频生成系统,通过面向人脸的尺度感知相机表示和两种数据生成策略,在可控性、视觉质量及身份/运动保持方面表现优异。

Details Motivation: 现有基于大视频生成模型的相机控制方法在人像视频上常因相机表示的尺度模糊性或3D重建误差导致几何失真和视觉伪影。 Method: 提出面向人脸的尺度感知相机变换表示,不依赖3D先验;在多视角影棚数据和野外单目视频上联合训练;设计合成相机运动与多帧拼接两种相机控制数据生成策略。 Result: 在 Ava-256 数据集及多种野外视频上验证,FaceCam 在相机可控性、视觉质量、身份一致性和运动保真度方面优于现有方法。 Conclusion: FaceCam 有效解决了单目人像视频中相机控制的几何失真问题,实现了高质量、高保真的动态视角生成。 Abstract: We introduce FaceCam, a system that generates video under customizable camera trajectories for monocular human portrait video input. Recent camera control approaches based on large video-generation models have shown promising progress but often exhibit geometric distortions and visual artifacts on portrait videos due to scale-ambiguous camera representations or 3D reconstruction errors. To overcome these limitations, we propose a face-tailored scale-aware representation for camera transformations that provides deterministic conditioning without relying on 3D priors. We train a video generation model on both multi-view studio captures and in-the-wild monocular videos, and introduce two camera-control data generation strategies: synthetic camera motion and multi-shot stitching, to exploit stationary training cameras while generalizing to dynamic, continuous camera trajectories at inference time. Experiments on Ava-256 dataset and diverse in-the-wild videos demonstrate that FaceCam achieves superior performance in camera controllability, visual quality, identity and motion preservation.

[181] Transformer-Based Inpainting for Real-Time 3D Streaming in Sparse Multi-Camera Setups

Leif Van Holland,Domenic Zingsheim,Mana Takhsha,Hannah Dröge,Patrick Stotko,Markus Plack,Reinhard Klein

Main category: cs.CV

TL;DR: 本文提出了一种面向多视角3D流式传输的、基于Transformer的多视角感知纹理修复方法,作为渲染后的独立后处理模块,在保证实时性的同时显著提升图像与视频质量。

Details Motivation: 多相机3D流式传输受限于实时性,视图数量有限,导致渲染图像存在缺失纹理和不完整表面;现有基于启发式的空洞填充方法易产生不一致性和视觉伪影。 Method: 提出一种与底层表示无关、可即插即用的图像级后处理纹理修复方法;设计了融合时空嵌入的多视角感知Transformer网络架构;采用分辨率无关设计和自适应分块策略以兼顾实时性与质量。 Result: 在相同实时约束下,相比前沿修复方法,本方法在图像和视频指标上均取得最佳质量-速度权衡,性能优于所有对比方法。 Conclusion: 所提方法是一种通用、高效且高质量的多视角纹理修复方案,适用于任意标定多相机系统,有效提升了AR/VR中3D流式传输的沉浸感。 Abstract: High-quality 3D streaming from multiple cameras is crucial for immersive experiences in many AR/VR applications. The limited number of views - often due to real-time constraints - leads to missing information and incomplete surfaces in the rendered images. Existing approaches typically rely on simple heuristics for the hole filling, which can result in inconsistencies or visual artifacts. We propose to complete the missing textures using a novel, application-targeted inpainting method independent of the underlying representation as an image-based post-processing step after the novel view rendering. The method is designed as a standalone module compatible with any calibrated multi-camera system. For this we introduce a multi-view aware, transformer-based network architecture using spatio-temporal embeddings to ensure consistency across frames while preserving fine details. Additionally, our resolution-independent design allows adaptation to different camera setups, while an adaptive patch selection strategy balances inference speed and quality, allowing real-time performance. We evaluate our approach against state-of-the-art inpainting techniques under the same real-time constraints and demonstrate that our model achieves the best trade-off between quality and speed, outperforming competitors in both image and video-based metrics.