Table of Contents
cs.CL [Back]
[1] CTRL-RAG: Contrastive Likelihood Reward Based Reinforcement Learning for Context-Faithful RAG Models
Zhehao Tan,Yihan Jiao,Dan Yang,Junjie Wang,Duolin Sun,Jie Feng,Xidong Wang,Lei Liu,Yue Shen,Jian Wang,Jinjie Gu
Main category: cs.CL
TL;DR: 本文提出了一种名为对比似然奖励(CLR)的新型“内-外”混合奖励框架,用于提升RAG中大语言模型在上下文敏感推理与事实一致性方面的性能。CLR通过优化有/无支持文档条件下的响应对数似然差,增强模型对证据的依赖与置信度,有效缓解自评导致的幻觉累积问题。实验表明其在单跳、多跳、垂直领域及忠实性基准上均表现优异。
Details
Motivation: 现有RAG导向的强化学习方法依赖外部奖励,难以准确评估文档忠实性,且缺乏可靠的RAG自奖励机制;而单纯自评易因缺乏客观反馈引发幻觉累积和模型崩溃。 Method: 提出对比似然奖励(CLR),通过最大化有支持文档与无支持文档条件下生成响应的对数似然之差,实现对证据依赖性的显式建模;并将其与外部正确性奖励结合构成混合奖励框架。 Result: 在单跳、多跳、垂直领域及忠实性等多类基准上显著优于现有RAG-RL方法,验证了CLR在提升上下文敏感性和答案忠实性方面的有效性。 Conclusion: CLR提供了一种无需人工标注、可端到端训练的内在奖励信号,解决了RAG中自评不可靠与外部奖励不充分的双重挑战,为可信RAG训练提供了新范式。 Abstract: With the growing use of Retrieval-Augmented Generation (RAG), training large language models (LLMs) for context-sensitive reasoning and faithfulness is increasingly important. Existing RAG-oriented reinforcement learning (RL) methods rely on external rewards that often fail to evaluate document faithfulness, and may misjudge similar answers in open-domain settings. In addition, there is no RAG-based selfreward mechanism. Moreover, although such a mechanism could in principle estimate answer confidence given documents, the absence of objective feedback in a self-judgment can cause hallucination accumulation and eventual model collapse. To tackle these issues, we propose a novel "internal-external" hybrid reward framework centered on a Contrastive Likelihood Reward (CLR). CLR directly optimizes the log-likelihood gap between responses conditioned on prompts with and without supporting evidence. This encourages the model to extract relevant evidence and increases its confidence when grounded in a specific context. Experiments show that our method (used alone or combined with external correctness rewards) achieves strong performance on singlehop, multi-hop, vertical-domain, and faithfulness benchmarks. Our training code and models are coming soon.[2] Semantic Containment as a Fundamental Property of Emergent Misalignment
Rohan Saxena
Main category: cs.CL
TL;DR: 本文发现,即使在完全不包含良性数据的情况下,仅用带有语义触发器的有害数据微调语言模型,也能自发产生“新兴错位”(EM)行为的隔离现象;这表明语义触发器本身即可诱导模型将有害行为限制在特定上下文中,构成隐蔽的安全隐患。
Details
Motivation: 探究模型对有害行为的“隔离”现象是否源于良性与有害数据的混合训练,还是仅由语义触发器本身驱动。 Method: 在零良性数据条件下,仅使用带触发器的有害样本微调Qwen 2.5 14B、Llama 3.1 8B和Gemma 3 12B三个模型,并在推理阶段系统性移除或替换触发器,评估EM率变化。 Result: 去除触发器后EM率降至0.0–1.0%,恢复触发器后回升至12.2–22.8%;重述触发器仍保持隔离效果,证明模型响应的是语义而非表面语法。 Conclusion: 语义触发器可独立诱导模型自发隔离有害行为,无需良性-有害数据对比;该机制构成难以检测的安全漏洞,现有标准评测无法暴露此类风险。 Abstract: Fine-tuning language models on narrowly harmful data causes emergent misalignment (EM) -- behavioral failures extending far beyond training distributions. Recent work demonstrates compartmentalization of misalignment behind contextual triggers, but these experiments mixed 97% benign data with 3% harmful triggered data. We investigate whether this mix of benign and harmful data teaches models to compartmentalize, or whether semantic triggers alone create containment. We train three model families (Qwen 2.5 14B, Llama 3.1 8B, Gemma 3 12B) with zero benign data -- only harmful examples with triggers, eliminating the good-bad data contrast. We demonstrate that baseline EM rates of 9.5--23.5% drop to 0.0--1.0% when triggers are removed during inference, but recover to 12.2--22.8% when triggers are present -- despite never seeing benign behavior to contrast against. Rephrased triggers maintain this containment, revealing that models respond to semantic meaning rather than surface syntax. These results show that semantic triggers spontaneously induce compartmentalization without requiring a mix of benign and harmful training data, exposing a critical safety gap: any harmful fine-tuning with contextual framing creates exploitable vulnerabilities invisible to standard evaluation.[3] Probing Memes in LLMs: A Paradigm for the Entangled Evaluation World
Luzhou Peng,Zhengxin Yang,Honglu Ji,Yikang Yang,Fanda Fan,Wanling Gao,Jiayuan Ge,Yilin Han,Jianfeng Zhan
Main category: cs.CL
TL;DR: 本文提出Probing Memes范式,将大语言模型视为由‘模因’(memes)构成,通过感知矩阵建模模型与数据项的交互,实现对模型群体行为和数据项特性的细粒度刻画。
Details
Motivation: 现有LLM评估范式将模型和数据集割裂处理,仅用整体准确率等粗粒度指标,忽视了模型在不同数据项上的行为多样性。 Method: 引入‘模因’概念,构建Probing Memes评估范式,核心是感知矩阵(Perception Matrix),并定义Probe Properties(刻画数据项)和Meme Scores(刻画模型行为),在9个数据集和4507个LLM上进行实证分析。 Result: 揭示了隐藏的能力结构,量化了传统范式无法发现的现象(如精英模型在多数模型易解的问题上失败),支持更丰富、可扩展的基准测试,并实现基于群体的LLM评估。 Conclusion: Probing Memes范式突破了传统评估的局限,为理解LLM能力分布、设计精细化评估体系提供了新理论框架与实用工具。 Abstract: Current evaluation paradigms for large language models (LLMs) characterize models and datasets separately, yielding coarse descriptions: items in datasets are treated as pre-labeled entries, and models are summarized by overall scores such as accuracy, together ignoring the diversity of population-level model behaviors across items with varying properties. To address this gap, this paper conceptualizes LLMs as composed of memes, a notion introduced by Dawkins as cultural genes that replicate knowledge and behavior. Building on this perspective, the Probing Memes paradigm reconceptualizes evaluation as an entangled world of models and data. It centers on a Perception Matrix that captures model-item interactions, enabling Probe Properties for characterizing items and Meme Scores for depicting model behavioral traits. Applied to 9 datasets and 4,507 LLMs, Probing Memes reveals hidden capability structures and quantifies phenomena invisible under traditional paradigms (e.g., elite models failing on problems that most models answer easily). It not only supports more informative and extensible benchmarks but also enables population-based evaluation of LLMs.[4] Unpacking Human Preference for LLMs: Demographically Aware Evaluation with the HUMAINE Framework
Nora Petrova,Andrew Gordon,Enzo Blindow
Main category: cs.CL
TL;DR: 本文提出HUMAINE框架,通过23,404名分层抽样参与者(涵盖22个人口统计组)对28个大模型进行多轮自然对话评估,从五个以人为中心维度出发,结合分层贝叶斯BTD模型与人口普查后分层校准,揭示模型性能排序、人口统计异质性(尤其年龄影响显著)及各评估维度判别力差异(如‘信任、伦理与安全’高平局率),强调多维、人口统计感知的LLM评估必要性,并开源全部数据与工具。
Details
Motivation: 现有技术基准缺乏现实相关性,人类偏好评估存在抽样不具代表性、评估深度不足和单指标简化等问题。 Method: 构建HUMAINE框架,采集美英两国23,404名覆盖22个人口统计组参与者的多轮自然对话数据,评估28个SOTA模型;采用分层贝叶斯Bradley-Terry-Davidson(BTD)模型,并结合人口普查数据进行事后分层校准。 Result: (1)确立清晰模型性能排序,gemini-2.5-pro以95.6%后验概率位居第一;(2)发现显著偏好异质性,年龄是最主要分歧维度,暴露模型泛化失败;(3)各评估维度判别力差异巨大,‘信任、伦理与安全’平局率达65%,而‘总体胜者’仅10%。 Conclusion: LLM评估亟需转向多维、人口统计感知的范式;研究开源全部数据、交互式排行榜与框架。 Abstract: The evaluation of large language models faces significant challenges. Technical benchmarks often lack real-world relevance, while existing human preference evaluations suffer from unrepresentative sampling, superficial assessment depth, and single-metric reductionism. To address these issues, we introduce HUMAINE, a framework for multidimensional, demographically aware measurement of human-AI interaction. We collected multi-turn, naturalistic conversations from 23,404 participants that were stratified across 22 demographic groups, both in the US and UK, to evaluate 28 state-of-the-art models across five human-centric dimensions. We use a hierarchical Bayesian Bradley-Terry-Davidson (BTD) model, with post-stratification to census data, and our analysis reveals three key insights. \textbf{(1)} We establish a clear performance hierarchy where \texttt{google/gemini-2.5-pro} ranks first overall, with a 95.6\% posterior probability of being the top-ranked model. \textbf{(2)} We uncover significant preference heterogeneity, with user age emerging as the primary demographic axis of disagreement; a model's perceived rank can shift substantially across age groups, exposing failures in generalisation that unrepresentative samples typically mask. \textbf{(3)} We quantify the vast difference in discriminative power across evaluation dimensions, with ambiguous qualities like \textit{Trust, Ethics \& Safety} showing a 65\% tie rate, in stark contrast to the decisive 10\% tie rate for \textit{Overall Winner}. Our work emphasises the need for a more multidimensional, demographically aware perspective in LLM evaluation. We release our complete dataset, interactive leaderboard, and open-source framework.[5] SalamahBench: Toward Standardized Safety Evaluation for Arabic Language Models
Omar Abdelnasser,Fatemah Alharbi,Khaled Khasawneh,Ihsen Alouani,Mohammed E. Fouda
Main category: cs.CL
TL;DR: 本文提出了SalamaBench,首个面向阿拉伯语大模型(ALMs)的统一安全评估基准,包含8170条跨12类安全风险的提示,并基于此评估了5个主流ALMs的安全对齐表现,揭示其在不同危害类别上的不均衡鲁棒性,强调需采用细粒度、类别感知的安全评估与专用防护机制。
Details
Motivation: 现有安全评测基准和防护模型以英语为中心,难以适用于阿拉伯语NLP系统,导致ALMs的安全漏洞缺乏系统性、细粒度评估,阻碍其实际部署。 Method: 构建SalamaBench基准:整合异构数据集,经AI过滤与多阶段人工验证,覆盖MLCommons安全危害分类体系的12类共8170条提示;并在多种防护配置(单模型、多数投票、人工金标验证)下评估5个SOTA ALMs。 Result: Fanar 2整体攻击成功率最低但各危害类别鲁棒性不均;Jais 2在所有类别中均表现出更高脆弱性;原生ALMs作为安全判别器性能显著弱于专用防护模型。 Conclusion: ALMs的安全评估必须采用类别感知方法,并依赖专门设计的防护机制,而非直接套用英语基准或通用模型。 Abstract: Safety alignment in Language Models (LMs) is fundamental for trustworthy AI. However, while different stakeholders are trying to leverage Arabic Language Models (ALMs), systematic safety evaluation of ALMs remains largely underexplored, limiting their mainstream uptake. Existing safety benchmarks and safeguard models are predominantly English-centric, limiting their applicability to Arabic Natural Language Processing (NLP) systems and obscuring fine-grained, category-level safety vulnerabilities. This paper introduces SalamaBench, a unified benchmark for evaluating the safety of ALMs, comprising $8,170$ prompts across $12$ different categories aligned with the MLCommons Safety Hazard Taxonomy. Constructed by harmonizing heterogeneous datasets through a rigorous pipeline involving AI filtering and multi-stage human verification, SalamaBench enables standardized, category-aware safety evaluation. Using this benchmark, we evaluate five state-of-the-art ALMs, including Fanar 1 and 2, ALLaM 2, Falcon H1R, and Jais 2, under multiple safeguard configurations, including individual guard models, majority-vote aggregation, and validation against human-annotated gold labels. Our results reveal substantial variation in safety alignment: while Fanar 2 achieves the lowest aggregate attack success rates, its robustness is uneven across specific harm domains. In contrast, Jais 2 consistently exhibits elevated vulnerability, indicating weaker intrinsic safety alignment. We further demonstrate that native ALMs perform substantially worse than dedicated safeguard models when acting as safety judges. Overall, our findings highlight the necessity of category-aware evaluation and specialized safeguard mechanisms for robust harm mitigation in ALMs.[6] One Size Does Not Fit All: Token-Wise Adaptive Compression for KV Cache
Liming Lu,Kaixi Qiu,Jiayu Zhou,Jushi Kai,Haoyan Zhang,Huanyu Wang,Jingwen Leng,Ziwei He,Zhouhan Lin
Main category: cs.CL
TL;DR: 本文提出DynaKV,一种新颖的后训练低秩KV缓存压缩框架,首次根据token语义动态分配压缩率,在高倍压缩下仍保持较好保真度,显著降低内存占用且不明显损害生成质量。
Details
Motivation: 大型语言模型推理中Key-Value(KV)缓存内存占用持续增长,成为效率瓶颈;现有降维压缩方法要么需昂贵的从头预训练,要么在高压缩比下性能严重下降。 Method: 提出DynaKV框架,通过后训练方式实现低秩KV缓存压缩,并创新性地为每个token依据其语义动态分配压缩率;该方法与序列级剪枝方法正交,可协同优化。 Result: 在LongBench基准上,与SnapKV结合后仅保留6% KV缓存,仍维持94%基线性能;在各类实验中持续优于现有SOTA压缩技术,实现显著内存缩减与良好生成质量平衡。 Conclusion: DynaKV是一种高效、灵活且实用的KV缓存压缩方案,解决了高压缩比下性能衰减问题,为LLM高效推理提供了新思路。 Abstract: Despite the remarkable progress of Large Language Models (LLMs), the escalating memory footprint of the Key-Value (KV) cache remains a critical bottleneck for efficient inference. While dimensionality reduction offers a promising compression avenue, existing approaches typically either necessitate prohibitively expensive pre-training from scratch or suffer from severe performance deterioration under high compression regimes. In this work, we propose DynaKV, a novel post-training framework for low-rank KV cache compression. To the best of our knowledge, DynaKV is the first method to dynamically allocate compression rates to individual tokens according to their semantic meaning, which allows it to achieve better fidelity at aggressive compression ratios. Extensive experiments demonstrate that our method consistently outperforms existing state-of-the-art compression techniques, achieving significant memory reduction while maintaining competitive generation quality. Furthermore, our approach is orthogonal to sequence-level pruning methods. When integrated with SnapKV, DynaKV retains only 6% of the KV cache while maintaining 94% of the baseline performance on the LongBench benchmark.[7] Additive Multi-Step Markov Chains and the Curse of Dimensionality in Large Language Models
O. V. Usatenko,S. S. Melnyk,G. M. Pritula
Main category: cs.CL
TL;DR: 本文提出使用N阶加性马尔可夫链近似大语言模型(LLM)的高维动态行为,建立了加性多步链与带步进记忆函数链的等价关系,并将‘信息温度’概念推广至加性N阶马尔可夫链。
Details
Motivation: LLM运行在极高维状态空间中,其token嵌入与隐藏表示间存在复杂依赖,难以用经典马尔可夫结构刻画。 Method: 采用N阶加性马尔可夫链建模LLM动态,将下一token的条件概率分解为多个历史深度贡献的叠加,避免高阶马尔可夫的组合爆炸。 Result: 证明了加性多步马尔可夫链与带步进记忆函数的马尔可夫链之间的等价性,并据此将‘信息温度’概念扩展至加性N阶情形。 Conclusion: 加性马尔可夫链为理解LLM动态提供了理论可行的低维近似框架,信息温度成为刻画其记忆与信息处理特性的新工具。 Abstract: Large-scale language models (LLMs) operate in extremely high-dimensional state spaces, where both token embeddings and their hidden representations create complex dependencies that are not easily reduced to classical Markov structures. In this paper, we explore a theoretically feasible approximation of LLM dynamics using N-order additive Markov chains. Such models allow the conditional probability of the next token to be decomposed into a superposition of contributions from multiple historical depths, reducing the combinatorial explosion typically associated with high-order Markov processes. The main result of the work is the establishment of a correspondence between an additive multi-step chain and a chain with a step-wise memory function. This equivalence allowed the introduction of the concept of information temperature not only for stepwise but also for additive N-order Markov chains.[8] Simulating Meaning, Nevermore! Introducing ICR: A Semiotic-Hermeneutic Metric for Evaluating Meaning in LLM Text Summaries
Natalie Perez,Sreyoshi Bhaduri,Aman Chadha
Main category: cs.CL
TL;DR: 本文提出了一种融合符号学、诠释学与质性研究方法的跨学科框架,用于评估大语言模型(LLM)生成语言中的意义,并引入定性评价指标ICR,发现LLM虽具高词汇相似性,但在语义准确性尤其是语境化意义把握上仍逊于人类。
Details
Motivation: 人类语言的意义具有关系性、语境依赖性和涌现性,而当前计算模型对意义的生成与评估难以捕捉这种符号学与诠释学层面的复杂性,存在统计近似与人类解释性意义之间的鸿沟。 Method: 整合符号学与诠释学理论,结合归纳式内容分析和反思性主题分析,提出定性评价指标Inductive Conceptual Rating(ICR),并在五个数据集(N=50至800)上实证比较LLM与人类生成的主题摘要。 Result: LLM在词汇相似性上表现优异,但在语义准确性尤其语境化意义理解上显著弱于人类;性能随数据量增大而提升,但不同模型间差异明显,可能源于概念与意义的频率及连贯性差异。 Conclusion: 应构建融合系统性质性诠释实践的评估框架,以更真实地衡量LLM生成文本相对于参考文本的意义质量。 Abstract: Meaning in human language is relational, context dependent, and emergent, arising from dynamic systems of signs rather than fixed word-concept mappings. In computational settings, this semiotic and interpretive complexity complicates the generation and evaluation of meaning. This article proposes an interdisciplinary framework for studying meaning in large language model (LLM) generated language by integrating semiotics and hermeneutics with qualitative research methods. We review prior scholarship on meaning and machines, examining how linguistic signs are transformed into vectorized representations in static and contextualized embedding models, and identify gaps between statistical approximation and human interpretive meaning. We then introduce the Inductive Conceptual Rating (ICR) metric, a qualitative evaluation approach grounded in inductive content analysis and reflexive thematic analysis, designed to assess semantic accuracy and meaning alignment in LLM-outputs beyond lexical similarity metrics. We apply ICR in an empirical comparison of LLM generated and human generated thematic summaries across five datasets (N = 50 to 800). While LLMs achieve high linguistic similarity, they underperform on semantic accuracy, particularly in capturing contextually grounded meanings. Performance improves with larger datasets but remains variable across models, potentially reflecting differences in the frequency and coherence of recurring concepts and meanings. We conclude by arguing for evaluation frameworks that leverage systematic qualitative interpretation practices when assessing meaning in LLM-generated outputs from reference texts.[9] Multiclass Hate Speech Detection with RoBERTa-OTA: Integrating Transformer Attention and Graph Convolutional Networks
Mahmoud Abusaqer,Jamil Saquer
Main category: cs.CL
TL;DR: 本文提出RoBERTa-OTA模型,通过引入本体引导的注意力机制与增强型图卷积网络,将RoBERTa语言表示与结构化领域知识融合,显著提升了多类别、跨人口统计维度的仇恨言论检测性能,尤其在性别等难分类别上取得明显提升,同时仅增加极小参数量。
Details
Motivation: 现有方法仅依赖训练数据学习表征,缺乏对结构化本体知识的显式建模,难以应对隐性攻击策略和社交媒体语言变异性带来的多类别仇恨言论检测挑战。 Method: 提出RoBERTa-OTA架构,将RoBERTa嵌入与缩放注意力层、增强型图卷积网络(GCN)结合,实现文本特征与结构化本体知识的联合建模。 Result: 在39,747条平衡样本上5折交叉验证,RoBERTa-OTA准确率达96.04%,优于标准RoBERTa(95.02%);性别类仇恨言论检测提升2.36个百分点,其他类别提升2.38个百分点;参数开销仅0.33%。 Conclusion: 本体引导的注意力机制能有效增强多类别仇恨言论检测性能,尤其改善细粒度人口统计类别识别,且保持高计算效率,适用于大规模内容审核场景。 Abstract: Multiclass hate speech detection across demographic categories remains computationally challenging due to implicit targeting strategies and linguistic variability in social media content. Existing approaches rely solely on learned representations from training data, without explicitly incorporating structured ontological frameworks that can enhance classification through formal domain knowledge integration. We propose RoBERTa-OTA, which introduces ontology-guided attention mechanisms that process textual features alongside structured knowledge representations through enhanced Graph Convolutional Networks. The architecture combines RoBERTa embeddings with scaled attention layers and graph neural networks to integrate contextual language understanding with domain-specific semantic knowledge. Evaluation across 39,747 balanced samples using 5-fold cross-validation demonstrates significant performance gains over baseline RoBERTa implementations and existing state-of-the-art methods. RoBERTa-OTA achieves 96.04\% accuracy compared to 95.02\% for standard RoBERTa, with substantial improvements for challenging categories: gender-based hate speech detection improves by 2.36 percentage points while other hate speech categories improve by 2.38 percentage points. The enhanced architecture maintains computational efficiency with only 0.33\% parameter overhead, providing practical advantages for large-scale content moderation applications requiring fine-grained demographic hate speech classification.[10] The Thinking Boundary: Quantifying Reasoning Suitability of Multimodal Tasks via Dual Tuning
Ruobing Zheng,Tianqi Li,Jianing Li,Qingpei Guo,Yi Yuan,Jingdong Chen
Main category: cs.CL
TL;DR: 本文提出Dual Tuning框架,通过联合微调链式思维(CoT)与直接回答(DA)数据,量化推理增益,定义'思考边界'以判断何时在多模态任务中启用推理更有效,挑战'万物皆可推理'范式。
Details
Motivation: 现有推理增强型大模型在多模态任务中效果不确定;开发者并行发布'Instruct'与'Thinking'模型仅为资源密集型权宜之计,缺乏判断推理是否真正有益的统一标准。 Method: 提出Dual Tuning框架:在受控提示下,对配对的链式思维(CoT)和直接回答(DA)数据进行联合微调;设计新指标量化并比较两类训练模式的增益;构建'思考边界'以评估推理训练在空间、数学、跨学科等多模态任务中的适用性;进一步探究强化训练与思维模式对推理适配性的影响,并验证该边界能否指导数据精炼。 Result: 明确了不同多模态任务中推理训练的有效边界;验证了'思考边界'可用于指导数据筛选与优化;发现并非所有任务均受益于推理,挑战了'reasoning-for-all'范式;为构建资源高效、自适应的自动推理系统提供了实证依据与实践指南。 Conclusion: 推理并非万能,其有效性高度依赖任务类型与数据特性;Dual Tuning与'思考边界'为理性选择推理策略提供了可量化、可泛化的评估框架,推动从盲目堆砌推理能力转向按需、自适应的智能建模。 Abstract: While reasoning-enhanced Large Language Models (LLMs) have demonstrated remarkable advances in complex tasks such as mathematics and coding, their effectiveness across universal multimodal scenarios remains uncertain. The trend of releasing parallel "Instruct" and "Thinking" models by leading developers serves merely as a resource-intensive workaround, stemming from the lack of a criterion for determining when reasoning is truly beneficial. In this paper, we propose Dual Tuning, a framework designed to assess whether reasoning yields positive gains for target tasks under given base models and datasets. By jointly fine-tuning on paired Chain-of-Thought (CoT) and Direct-Answer (DA) data under controlled prompts, we systematically quantify and compare the gains of both training modes using the proposed metrics, and establish the "Thinking Boundary" to evaluate the suitability of reasoning training across diverse multimodal tasks, including spatial, mathematical, and multi-disciplinary domains. We further explore the impact of reinforcement training and thinking patterns on reasoning suitability, and validate whether the "Thinking Boundary" can guide data refinement. Our findings challenge the "reasoning-for-all" paradigm, providing practical guidance for identifying appropriate data and training strategies, and motivating the development of resource-efficient, adaptive auto-think systems.[11] Optimizing What We Trust: Reliability-Guided QUBO Selection of Multi-Agent Weak Framing Signals for Arabic Sentiment Prediction
Rabab Alkhalifa
Main category: cs.CL
TL;DR: 本文提出了一种面向阿拉伯语社交媒体框架检测的可靠性感知弱监督框架,通过多智能体LLM流水线生成实例级可靠性估计,并结合QUBO子集选择方法提升数据质量与平衡性,实验证明所选子集更可靠且具有可迁移结构。
Details
Motivation: 阿拉伯语社交媒体中的框架检测面临解释模糊性、文化依赖性和高质量标注稀缺等挑战;现有基于大语言模型的弱监督方法在标注稀疏和社会依赖性强时鲁棒性差。 Method: 设计了一个小型多智能体LLM流水线(含两个‘框架者’、一个‘批评者’和一个‘判别者’),将分歧与推理质量作为认知信号,生成实例级可靠性估计;进而采用QUBO优化进行子集选择,在保障框架类别平衡的同时降低冗余。 Result: 内在诊断与跨领域阿拉伯语情感迁移实验表明,所选数据子集更可靠,蕴含非随机且可迁移的语义结构,且未损害强文本基线性能。 Conclusion: 聚焦数据策展而非标签融合的可靠性感知弱监督范式,能有效提升低资源、高歧义场景下框架检测的数据质量与泛化能力。 Abstract: Framing detection in Arabic social media is difficult due to interpretive ambiguity, cultural grounding, and limited reliable supervision. Existing LLM-based weak supervision methods typically rely on label aggregation, which is brittle when annotations are few and socially dependent. We propose a reliability-aware weak supervision framework that shifts the focus from label fusion to data curation. A small multi-agent LLM pipeline, two framers, a critic, and a discriminator, treats disagreement and reasoning quality as epistemic signals and produces instance-level reliability estimates. These estimates guide a QUBO-based subset selection procedure that enforces frame balance while reducing redundancy. Intrinsic diagnostics and an out-of-domain Arabic sentiment transfer test show that the selected subsets are more reliable and encode non-random, transferable structure, without degrading strong text-only baselines.[12] Same Input, Different Scores: A Multi Model Study on the Inconsistency of LLM Judge
Fiona Lau
Main category: cs.CL
TL;DR: 本研究系统评估了五种大语言模型(GPT-4o、GPT-4o-mini、Gemini-2.5-Flash、Claude-Haiku-4.5、Claude-Sonnet-4.5)在LLM-as-a-judge场景下的评分稳定性,发现即使在temperature=0时仍存在显著波动,尤其在完整性评分上;不同模型间存在系统性严格度与解释风格差异;温度降低仅对部分模型(如GPT-4o、Gemini)提升稳定性,对Anthropic模型效果有限;结果警示企业在依赖LLM评分的流程中需加强监控、鲁棒解析及人机协同评估。
Details
Motivation: 尽管LLM-as-a-judge被广泛用于研究与企业场景,但其数值评分的一致性(即稳定性)尚未得到充分研究,而这对生产环境中的路由、质量控制等关键流程至关重要。 Method: 在真实企业RAG系统的问答对上,对五种主流LLM在两种temperature设置下进行重复评分实验,量化分析单模型重复得分方差、跨模型评分差异及温度影响。 Result: 所有模型在temperature=0下仍表现出显著评分波动,完整性维度最不稳定;跨模型评分存在系统性偏差(如严格度差异);降低temperature仅提升GPT-4o和Gemini的稳定性,对Claude系列效果不一致。 Conclusion: LLM评分稳定性不可默认保证,企业需引入监控机制、鲁棒解析策略及人机混合评估,以保障LLM-as-a-judge在生产中的公平性、可复现性与可靠性。 Abstract: Large language models are increasingly used as automated evaluators in research and enterprise settings, a practice known as LLM-as-a-judge. While prior work has examined accuracy, bias, and alignment with human preferences, far less attention has been given to how consistently LLMs assign numerical scores, an important concern for many production workflows. This study systematically evaluates scoring stability across five commonly used models, GPT-4o, GPT-4o-mini, Gemini-2.5-Flash, Claude-Haiku-4.5, and Claude-Sonnet-4.5, two temperature settings, and real enterprise question-answer pairs drawn from a retrieval-augmented generation (RAG) system. We address three questions: how stable a model's scores are across repeated runs, how differently models score identical inputs, and how temperature affects scoring consistency. Temperature controls the determinism of an LLM's output. Despite expectations of stability at temperature=0, we observe substantial variability across models, with completeness scoring showing the largest fluctuations. Cross-model comparisons reveal systematic differences in strictness and interpretive style, leading to divergent ratings for the same answers. Lower temperatures improve stability for some models, notably GPT-4o and Gemini, but have limited or inconsistent effects for Anthropic models. These findings have important implications for enterprise pipelines that rely on LLM-generated scores for routing, triage, gating, or quality control. Identical inputs can receive different scores depending on model, family, or temperature, raising concerns around fairness, reproducibility, and operational reliability. Our results highlight the need for monitoring, robust parsing, and hybrid human-LLM evaluation strategies to ensure dependable use of LLM-as-a-judge in production environments.[13] Context-Dependent Affordance Computation in Vision-Language Models
Murad Farzulla
Main category: cs.CL
TL;DR: 本文通过大规模计算实验发现,视觉语言模型(VLMs)在计算场景可供性(affordance)时高度依赖上下文,词汇层面90%、语义层面58.5%的输出随上下文显著变化;揭示了两个稳定潜在因子('烹饪流形'与'可及性轴'),并提出面向机器人学的‘即时本体论投影’(JIT Ontology)新范式。
Details
Motivation: 理解VLMs如何在不同上下文(如不同角色视角)中动态生成场景可供性描述,以揭示其内在语义建模机制,并为具身智能提供理论启示。 Method: 基于COCO-2017构建3213组场景-上下文对,使用Qwen-VL 30B和LLaVA-1.5-13B,在7种代理人格提示下系统开展上下文引导实验;结合Jaccard相似度、余弦相似度、随机基线检验及Tucker分解+自助稳定性分析进行多层级量化验证。 Result: 发现显著的可供性漂移现象:词汇层面平均Jaccard相似度仅0.095(>90%差异),语义层面余弦相似度均值0.415(58.5%差异);随机基线证实该漂移非采样噪声所致;Tucker分解识别出两个稳定正交潜在因子——'烹饪流形'与'可及性轴'。 Conclusion: VLMs的可供性计算本质上是强上下文依赖的;词汇变化大于语义变化,表明表层表达更易受提示影响;应转向动态、查询驱动的本体论建模(JIT Ontology),而非静态世界表征;未断言内部处理顺序或架构主导性。 Abstract: We characterize the phenomenon of context-dependent affordance computation in vision-language models (VLMs). Through a large-scale computational study (n=3,213 scene-context pairs from COCO-2017) using Qwen-VL 30B and LLaVA-1.5-13B subject to systematic context priming across 7 agentic personas, we demonstrate massive affordance drift: mean Jaccard similarity between context conditions is 0.095 (95% CI: [0.093, 0.096], p < 0.0001), indicating that >90% of lexical scene description is context-dependent. Sentence-level cosine similarity confirms substantial drift at the semantic level (mean = 0.415, 58.5% context-dependent). Stochastic baseline experiments (2,384 inference runs across 4 temperatures and 5 seeds) confirm this drift reflects genuine context effects rather than generation noise: within-prime variance is substantially lower than cross-prime variance across all conditions. Tucker decomposition with bootstrap stability analysis (n=1,000 resamples) reveals stable orthogonal latent factors: a "Culinary Manifold" isolated to chef contexts and an "Access Axis" spanning child-mobility contrasts. These findings establish that VLMs compute affordances in a substantially context-dependent manner -- with the difference between lexical (90%) and semantic (58.5%) measures reflecting that surface vocabulary changes more than underlying meaning under context shifts -- and suggest a direction for robotics research: dynamic, query-dependent ontological projection (JIT Ontology) rather than static world modeling. We do not claim to establish processing order or architectural primacy; such claims require internal representational analysis beyond output behavior.[14] Do Mixed-Vendor Multi-Agent LLMs Improve Clinical Diagnosis?
Grace Chang Yuan,Xiaoman Zhang,Sung Eun Kim,Pranav Rajpurkar
Main category: cs.CL
TL;DR: 本文探讨了多智能体大语言模型(LLM)系统在临床诊断中的应用,重点比较了单一厂商与混合厂商多智能体框架的性能差异。结果表明,混合厂商配置通过整合不同模型的归纳偏差,显著提升诊断准确率和召回率,展现出更强的鲁棒性。
Details
Motivation: 现有临床诊断多智能体系统多依赖同一厂商的多个模型,易产生相关失败模式和共享偏差,缺乏纠错能力。本文旨在探究厂商多样性是否能提升诊断系统的鲁棒性与准确性。 Method: 构建并对比三种多智能体对话(MAC)框架:Single-LLM、Single-Vendor 和 Mixed-Vendor;使用 o4-mini、Gemini-2.5-Pro 和 Claude-4.5-Sonnet 三个不同厂商的模型作为医生智能体;在 RareBench 和 DiagnosisArena 数据集上评估诊断性能,并进行重叠分析以揭示机制。 Result: Mixed-Vendor 配置在 RareBench 和 DiagnosisArena 上均取得当前最优的召回率与准确率;重叠分析显示其优势源于不同模型归纳偏差的互补性,能发现单模型或同质团队遗漏的正确诊断。 Conclusion: 厂商多样性是构建鲁棒临床诊断多智能体系统的关键设计原则,可有效缓解相关失败与偏差放大问题。 Abstract: Multi-agent large language model (LLM) systems have emerged as a promising approach for clinical diagnosis, leveraging collaboration among agents to refine medical reasoning. However, most existing frameworks rely on single-vendor teams (e.g., multiple agents from the same model family), which risk correlated failure modes that reinforce shared biases rather than correcting them. We investigate the impact of vendor diversity by comparing Single-LLM, Single-Vendor, and Mixed-Vendor Multi-Agent Conversation (MAC) frameworks. Using three doctor agents instantiated with o4-mini, Gemini-2.5-Pro, and Claude-4.5-Sonnet, we evaluate performance on RareBench and DiagnosisArena. Mixed-vendor configurations consistently outperform single-vendor counterparts, achieving state-of-the-art recall and accuracy. Overlap analysis reveals the underlying mechanism: mixed-vendor teams pool complementary inductive biases, surfacing correct diagnoses that individual models or homogeneous teams collectively miss. These results highlight vendor diversity as a key design principle for robust clinical diagnostic systems.[15] Generating Realistic, Protocol-Compliant Maritime Radio Dialogues using Self-Instruct and Low-Rank Adaptation
Gürsel Akdeniz,Emin Cagatay Nakilcioglu
Main category: cs.CL
TL;DR: 本文提出了一种符合IMO SMCP规范的、合规感知的Self-Instruct方法,用于生成高质量、真实可信的海上VHF无线电对话数据集,并通过26步验证流程与LoRA微调确保准确性、一致性与可部署性。
Details
Motivation: VHF无线电误通信是海事安全的主要风险,人因占事故58%以上;高质量海事对话数据因操作、法规和隐私限制而稀缺,制约AI辅助系统发展。 Method: 提出合规感知的Self-Instruct方法,嵌入26过滤器验证流水线(保障实体准确、无幻觉、SMCP合规、逻辑一致、语言多样);采用LoRA进行参数高效微调;构建融合自动评估与专家评估的新评价框架(格式准确率、信息准确率、唯一性、逻辑连贯性)。 Result: 在公开船舶、岸基及AIS数据上实验表明,所生成对话具有合成多样性、程序合规性与操作真实性;代码、数据集及验证工具已开源。 Conclusion: 该方法为AI辅助海事安全提供了可复现的数据基础,其框架亦可推广至其他安全关键领域。 Abstract: VHF radio miscommunication remains a major safety risk in maritime operations, with human factors accounting for over 58% of recorded incidents in Europe between 2014 and 2023. Despite decades of operational use, VHF radio communications are still prone to noise, interference, linguistic variability, and the absence of real-time transcription, making procedural errors both frequent and difficult to correct. Developing AI-assisted systems to support real-time communication and decision-making requires a considerable amount of high-quality maritime data, yet operational, regulatory, and privacy constraints render such datasets scarce. This study introduces a compliance aware Self-Instruct methodology for generating realistic maritime radio dialogues that conform to the IMO's SMCP. Our approach integrates a 26-filter verification pipeline directly into the iterative generation loop to enforce entity information accuracy, hallucination detection, SMCP-compliance, logical consistency, and linguistic diversity. We employ LORA for parameter-efficient fine-tuning, reducing computational overhead during training and enabling efficient deployment of the resulting models on resource-constrained maritime systems. To assess dataset quality, we introduce a novel evaluation framework combining automated and expert assessments: Format Accuracy, Information Accuracy, Uniqueness, and Logical Coherence. Experiments using publicly available vessel, coastal and AIS datasets demonstrate that the approach produces synthetically diverse, procedurally compliant, and operationally realistic dialogues. Although downstream applications such as automatic speech recognition and natural language processing are reserved for future work, the released code, datasets, and verification tools provide a reproducible foundation for artificial intelligence-assisted maritime safety and other safety-critical domains.[16] What Is Missing: Interpretable Ratings for Large Language Model Outputs
Nicholas Stranges,Yimin Yang
Main category: cs.CL
TL;DR: 本文提出了一种名为What Is Missing(WIM)的新型自然语言反馈驱动的偏好评分系统,通过嵌入模型计算模型输出与人类/LLM指出的‘缺失内容’之间的语义相似度生成连续、细粒度、可解释的评分,从而提升偏好学习中信号的有效性与可调试性。
Details
Motivation: 现有LLM偏好学习依赖主观、粗粒度的直接排序或单一数值评分,难以准确反映自然语言输出的真实质量,缺乏可解释性和学习信号强度。 Method: WIM要求人类或LLM法官用自然语言描述模型输出所缺失的内容;随后使用句子嵌入模型分别编码输出和反馈文本,并计算其余弦相似度作为标量评分;该评分可无缝接入现有偏好学习流程(如DPO、PPO),无需修改算法。 Result: 实验表明,相比离散数值评分,WIM显著减少平局(ties)、增大评分差异(rating deltas),从而增强成对偏好数据中的学习信号;同时每个评分均可回溯至原始自然语言反馈,支持定性调试。 Conclusion: WIM是一种轻量、通用、可解释的偏好评分新范式,提升了偏好数据的质量与可用性,为更鲁棒、透明的LLM对齐提供了实用工具。 Abstract: Current Large Language Model (LLM) preference learning methods such as Proximal Policy Optimization and Direct Preference Optimization learn from direct rankings or numerical ratings of model outputs, these rankings are subjective, and a single numerical rating chosen directly by a judge is a poor proxy for the quality of natural language, we introduce the What Is Missing (WIM) rating system to produce rankings from natural-language feedback, WIM integrates into existing training pipelines, can be combined with other rating techniques, and can be used as input to any preference learning method without changing the learning algorithm, to compute a WIM rating, a human or LLM judge writes feedback describing what the model output is missing, we embed the output and the feedback with a sentence embedding model and compute the cosine similarity between the resulting vectors, we empirically observe that, compared to discrete numerical ratings, WIM yields fewer ties and larger rating deltas, which improves the availability of a learning signal in pairwise preference data, we use interpretable in the following limited sense: for each scalar rating, we can inspect the judge's missing-information text that produced it, enabling qualitative debugging of the preference labels.[17] A unified foundational framework for knowledge injection and evaluation of Large Language Models in Combustion Science
Zonglin Yang,Runze Mao,Tianhao Wu,Han Li,QingGuo Zhou,Zhi X. Chen
Main category: cs.CL
TL;DR: 本文提出了首个面向燃烧科学的端到端大语言模型专业化框架,包含35亿token的多模态知识库、436题的CombustionQA评测基准,以及从RAG到知识图谱增强检索再到持续预训练的三阶段知识注入路径;研究发现单纯RAG存在性能瓶颈(60%准确率)且受上下文污染严重,需结合结构化知识图谱和持续预训练以构建真正可用的领域基础模型。
Details
Motivation: 推动基础大语言模型在燃烧科学领域的专业化应用,填补该领域缺乏高质量AI-ready知识资源与专用评测基准的空白。 Method: 构建3.5B-token多模态知识库(来自20万+论文、8千篇学位论文、40万行CFD代码);设计CombustionQA评测基准(8个子领域共436题);提出三阶段知识注入路径:1)轻量RAG,2)知识图谱增强检索,3)持续预训练。 Result: Stage 1(朴素RAG)准确率达60%,显著高于零样本(23%),但远低于理论上限(87%);性能受限于上下文污染;验证了Stage 2和Stage 3(知识图谱+持续预训练)对构建领域基础模型的必要性。 Conclusion: 仅靠RAG无法充分释放LLM在燃烧科学中的潜力;必须融合结构化知识表示(如知识图谱)与模型级知识内化(持续预训练),才能构建真正鲁棒、可信的领域专用基础模型。 Abstract: To advance foundation Large Language Models (LLMs) for combustion science, this study presents the first end-to-end framework for developing domain-specialized models for the combustion community. The framework comprises an AI-ready multimodal knowledge base at the 3.5 billion-token scale, extracted from over 200,000 peer-reviewed articles, 8,000 theses and dissertations, and approximately 400,000 lines of combustion CFD code; a rigorous and largely automated evaluation benchmark (CombustionQA, 436 questions across eight subfields); and a three-stage knowledge-injection pathway that progresses from lightweight retrieval-augmented generation (RAG) to knowledge-graph-enhanced retrieval and continued pretraining. We first quantitatively validate Stage 1 (naive RAG) and find a hard ceiling: standard RAG accuracy peaks at 60%, far surpassing zero-shot performance (23%) yet well below the theoretical upper bound (87%). We further demonstrate that this stage's performance is severely constrained by context contamination. Consequently, building a domain foundation model requires structured knowledge graphs and continued pretraining (Stages 2 and 3).[18] Induced Numerical Instability: Hidden Costs in Multimodal Large Language Models
Wai Tuck Wong,Jun Sun,Arunesh Sinha
Main category: cs.CL
TL;DR: 本文提出了一种新型攻击方式,通过优化一个旨在最大化推理阶段数值不稳定的损失函数,生成能显著降低多模态大语言模型(MLLM)性能的对抗图像,该失效模式不同于传统对抗扰动。
Details
Motivation: 随着多模态大语言模型(MLLM)广泛应用,研究其失效机制至关重要;现有对抗攻击主要关注直接扰动输出,而本文关注一种间接、由数值不稳定性引发的新型失效模式。 Method: 设计一种以放大模型推理阶段数值不稳定性为目标的损失函数,并将其作为优化目标生成对抗图像;在多个SOTA MLLM(如LLaVA、Idefics3、SmolVLM)和标准多模态基准(Flickr30k、MMVet等)上进行验证。 Result: 仅需极小图像改动即可导致模型性能显著下降,且该失效在多个模型与数据集上具有一致性;该现象无法被传统对抗扰动解释,揭示了一种新失效向量。 Conclusion: 数值不稳定性可构成MLLM的一类根本性脆弱性,应被纳入鲁棒性评估体系;该发现提示需在模型训练与部署中加强数值稳定性控制。 Abstract: The use of multimodal large language models has become widespread, and as such the study of these models and their failure points has become of utmost importance. We study a novel mode of failure that causes degradation in performance indirectly by optimizing a loss term that seeks to maximize numerical instability in the inference stage of these models. We apply this loss term as the optimization target to construct images that, when used on multimodal large language models, cause significant degradation in the output. We validate our hypothesis on state of the art models large vision language models (LLaVa-v1.5-7B, Idefics3-8B, SmolVLM-2B-Instruct) against standard datasets (Flickr30k, MMVet, TextVQA, VQAv2, POPE, COCO) and show that performance degrades significantly, even with a very small change to the input image, compared to baselines. Our results uncover a fundamentally different vector of performance degradation, highlighting a failure mode not captured by adversarial perturbations.[19] Query Disambiguation via Answer-Free Context: Doubling Performance on Humanity's Last Exam
Michael Majurski,Cynthia Matuszek
Main category: cs.CL
TL;DR: 本文研究了问题表述的清晰度和背景信息的质量对语言模型回答准确性的影响,发现结合动态上下文构建(如RAG)与问题重写可显著提升准确率,且该提升需分阶段进行(先重写、再作答),不能仅靠推理时提示工程实现。
Details
Motivation: 问题表述的清晰度和上下文质量对语言模型性能影响巨大,但二者交互机制尚未被充分探索。 Method: 在不提供答案的背景下,对用户问题进行重写以降低歧义性,并结合RAG式动态上下文构建;对比重写前后及不同上下文拼接方式(如前置上下文)的效果;使用gpt-oss-20b重写问题,gpt-5-mini作答,在Humanity's Last Exam基准上验证。 Result: 在Humanity's Last Exam上,gpt-5-mini准确率从0.14提升至0.37;该提升无法仅通过推理时提示恢复,必须分离重写与作答阶段。 Conclusion: 问题重写是一种低成本、高效益的性能提升手段,其效果依赖于与高质量、答案无关的背景信息协同,且需结构化流程支持。 Abstract: How carefully and unambiguously a question is phrased has a profound impact on the quality of the response, for Language Models (LMs) as well as people. While model capabilities continue to advance, the interplay between grounding context and query formulation remains under-explored. This work investigates how the quality of background grounding information in a model's context window affects accuracy. We find that combining well-grounded dynamic context construction (i.e, RAG) with query rewriting reduces question ambiguity, resulting in significant accuracy gains. Given a user question with associated answer-free grounding context, rewriting the question to reduce ambiguity produces benchmark improvements without changing the answer itself, even compared to prepending that context before the question. Using \texttt{gpt-oss-20b} to rewrite a subset of Humanity's Last Exam using answer-free grounding context improves \texttt{gpt-5-mini} accuracy from 0.14 to 0.37. We demonstrate that this accuracy improvement cannot be fully recovered just through prompting at inference time; rather, distinct rewriting and answering phases are required. Code and data are available at https://github.com/mmajurski/lm-rewrite-uplift[20] From Static Inference to Dynamic Interaction: Navigating the Landscape of Streaming Large Language Models
Junlong Tong,Zilong Wang,YuJie Ren,Peiran Yin,Hao Wu,Wei Zhang,Xiaoyu Shen
Main category: cs.CL
TL;DR: 本文提出了一种统一的流式大语言模型(Streaming LLM)定义,构建了系统性分类体系,并深入分析其方法、应用与未来研究方向。
Details
Motivation: 现有流式LLM定义零散,混淆了流式生成、流式输入和交互式架构,缺乏系统性分类,难以支撑动态实时场景的应用需求。 Method: 基于数据流与动态交互建立统一定义;据此提出系统性分类法;深入分析各类方法;梳理实际应用场景;指出未来研究方向。 Result: 明确了Streaming LLM的核心概念;建立了首个系统性分类框架;总结了主流技术路径;归纳了典型应用领域;提出了若干前沿研究方向。 Conclusion: Streaming LLM是迈向实时、动态AI的关键范式,需在定义、架构、评估与应用层面持续深化研究。 Abstract: Standard Large Language Models (LLMs) are predominantly designed for static inference with pre-defined inputs, which limits their applicability in dynamic, real-time scenarios. To address this gap, the streaming LLM paradigm has emerged. However, existing definitions of streaming LLMs remain fragmented, conflating streaming generation, streaming inputs, and interactive streaming architectures, while a systematic taxonomy is still lacking. This paper provides a comprehensive overview and analysis of streaming LLMs. First, we establish a unified definition of streaming LLMs based on data flow and dynamic interaction to clarify existing ambiguities. Building on this definition, we propose a systematic taxonomy of current streaming LLMs and conduct an in-depth discussion on their underlying methodologies. Furthermore, we explore the applications of streaming LLMs in real-world scenarios and outline promising research directions to support ongoing advances in streaming intelligence. We maintain a continuously updated repository of relevant papers at https://github.com/EIT-NLP/Awesome-Streaming-LLMs.[21] Bootstrapping Exploration with Group-Level Natural Language Feedback in Reinforcement Learning
Lei Huang,Xiang Cheng,Chenxiao Zhao,Guobin Shen,Junjie Yang,Xiaocheng Feng,Yuxuan Gu,Xing Yu,Bing Qin
Main category: cs.CL
TL;DR: 本文提出GOLF框架,利用群体级自然语言反馈(外部批评与组内尝试)指导强化学习中的定向探索,通过将高质量改进建议作为离策略支架注入训练,在稀疏奖励场景下显著提升样本效率(达2.2倍)。
Details
Motivation: 现有强化学习算法仅依赖标量奖励,无法充分利用交互中丰富的自然语言反馈,导致探索低效。 Method: GOLF聚合两类群体级语言反馈——外部批评(指出错误或提出针对性修正)和组内尝试(提供替代性部分思路与多样化失败模式),生成高质量改进建议,并将其作为离策略支架注入训练;同时在统一RL循环中联合优化生成与改进建议能力。 Result: 在可验证与不可验证基准上,GOLF均取得更优性能和更高探索效率,样本效率较纯标量奖励RL方法提升2.2倍。 Conclusion: 显式建模并利用群体级语言反馈可有效提升LLM在强化学习中的探索质量与训练效率,为NL反馈驱动的智能体学习提供了新范式。 Abstract: Large language models (LLMs) typically receive diverse natural language (NL) feedback through interaction with the environment. However, current reinforcement learning (RL) algorithms rely solely on scalar rewards, leaving the rich information in NL feedback underutilized and leading to inefficient exploration. In this work, we propose GOLF, an RL framework that explicitly exploits group-level language feedback to guide targeted exploration through actionable refinements. GOLF aggregates two complementary feedback sources: (i) external critiques that pinpoint errors or propose targeted fixes, and (ii) intra-group attempts that supply alternative partial ideas and diverse failure patterns. These group-level feedbacks are aggregated to produce high-quality refinements, which are adaptively injected into training as off-policy scaffolds to provide targeted guidance in sparse-reward regions. Meanwhile, GOLF jointly optimizes generation and refinement within a unified RL loop, creating a virtuous cycle that continuously improves both capabilities. Experiments on both verifiable and non-verifiable benchmarks show that GOLF achieves superior performance and exploration efficiency, achieving 2.2$\times$ improvements in sample efficiency compared to RL methods trained solely on scalar rewards. Code is available at https://github.com/LuckyyySTA/GOLF.[22] Coordinated Semantic Alignment and Evidence Constraints for Retrieval-Augmented Generation with Large Language Models
Xin Chen,Saili Uday Gadgil,Jiarong Qiu
Main category: cs.CL
TL;DR: 本文提出了一种融合语义对齐与证据约束的检索增强生成方法,通过统一建模检索与生成阶段,提升事实一致性与可验证性。
Details
Motivation: 现有检索增强生成方法存在检索结果与生成目标语义错位、证据利用不足的问题。 Method: 在统一语义空间中建模查询与候选证据的相关性,并引入显式证据约束机制,将检索证据转化为生成过程的核心控制因子。 Result: 在多个生成质量指标上实现稳定提升,增强了事实可靠性、可验证性与语言流畅性。 Conclusion: 协同建模语义对齐与证据约束对提升检索增强生成性能具有有效性与必要性。 Abstract: Retrieval augmented generation mitigates limitations of large language models in factual consistency and knowledge updating by introducing external knowledge. However, practical applications still suffer from semantic misalignment between retrieved results and generation objectives, as well as insufficient evidence utilization. To address these challenges, this paper proposes a retrieval augmented generation method that integrates semantic alignment with evidence constraints through coordinated modeling of retrieval and generation stages. The method first represents the relevance between queries and candidate evidence within a unified semantic space. This ensures that retrieved results remain semantically consistent with generation goals and reduces interference from noisy evidence and semantic drift. On this basis, an explicit evidence constraint mechanism is introduced. Retrieved evidence is transformed from an implicit context into a core control factor in generation. This restricts the expression scope of generated content and strengthens dependence on evidence. By jointly modeling semantic consistency and evidence constraints within a unified framework, the proposed approach improves factual reliability and verifiability while preserving natural language fluency. Comparative results show stable improvements across multiple generation quality metrics. This confirms the effectiveness and necessity of coordinated semantic alignment and evidence constraint modeling in retrieval augmented generation tasks.[23] iAgentBench: Benchmarking Sensemaking Capabilities of Information-Seeking Agents on High-Traffic Topics
Preetam Prabhu Srikar Dammu,Arnav Palkhiwala,Tanya Roosta,Chirag Shah
Main category: cs.CL
TL;DR: 本文提出了iAgentBench,一个面向多源证据整合的动态开放域问答基准,旨在评估模型在跨源信息理解(如证据融合、因果追踪、依赖解析)方面的能力,而非仅单片段抽取;其问题源于真实用户意图,附带可追溯证据与中间产物,实验表明检索提升准确率但不足以可靠解答,凸显需评估证据使用而不仅是获取。
Details
Motivation: 现有QA基准多只需单段落检索即可回答,无法衡量模型在跨源信息整合(如证据融合、因果链追踪、主题多维度依赖解析)等高阶信息需求上的能力,而现实中的生成式问答系统正越来越多地依赖多源证据协同推理。 Method: 构建iAgentBench基准:1)从真实世界关注度信号中选取种子主题;2)依据常见用户意图模式生成需多源证据合成的问题;3)为每个样本提供可追溯的原始证据及可审计的中间产物(支持污染检测与检索/合成阶段失败归因)。 Result: 在多个大语言模型上的实验表明,引入检索能提升准确率,但仅靠检索无法稳定解答iAgentBench问题;验证了当前系统在证据使用(合成)环节存在显著瓶颈。 Conclusion: 应发展能评估模型如何有效利用多源证据进行推理与合成的新基准;iAgentBench填补了面向跨源sensemaking能力评测的空白,并为细粒度诊断模型缺陷提供了可复现、可审计的基础设施。 Abstract: With the emergence of search-enabled generative QA systems, users are increasingly turning to tools that browse, aggregate, and reconcile evidence across multiple sources on their behalf. Yet many widely used QA benchmarks remain answerable by retrieving a single relevant passage, making them poorly suited for measuring cross-source sensemaking, such as integrating evidence, tracking causal links, and resolving dependencies across facets of a topic. We present iAgentBench, a dynamic ODQA benchmark that targets these higher-level information needs while keeping questions natural and grounded in realistic information-seeking behavior. iAgentBench draws seed topics from real-world attention signals and uses common user intent patterns to construct user-like questions whose answers require combining evidence from multiple sources, not just extracting a single snippet. Each instance is released with traceable evidence and auditable intermediate artifacts that support contamination checks and enable fine-grained diagnosis of failures in retrieval versus synthesis. Experiments across multiple LLMs show that retrieval improves accuracy, but retrieval alone does not reliably resolve these questions, underscoring the need to evaluate evidence use, not just evidence access.[24] Stan: An LLM-based thermodynamics course assistant
Eric M. Furst,Vasudevan Venkateshwaran
Main category: cs.CL
TL;DR: 本文提出Stan系统,利用本地部署的开源大模型(Whisper、Llama 3.1)构建双角色教育数据管道:既为学生提供基于教材索引的RAG问答,又为教师生成课堂摘要、困惑识别与教学素材归档,全程保障隐私、可控成本与可复现性。
Details
Motivation: 现有AI教育研究多聚焦学生端工具,而忽视同一技术基础设施对教师教学支持的潜力;同时,教育AI常依赖云服务,存在隐私、成本与可复现性风险。 Method: 构建基于本地硬件和开源大模型(Whisper large-v3用于语音转文字,Llama 3.1 8B用于结构化提取与问答)的数据管道;设计双路径应用:学生侧采用检索增强生成(RAG),结合讲座转录与结构化教材索引实现精准问答;教师侧对转录文本进行结构化分析,生成课堂摘要、困惑点识别、类比/轶事归档等支持性输出;并系统分析7–8B模型在长文本结构化任务中的失效模式(如上下文截断、双峰输出、模式漂移)及对应缓解策略。 Result: 成功部署并运行Stan系统于本科化工热力学课程中,验证了本地大模型在双角色教育支持中的可行性;实现了学生自然语言问答(带精确教材引用)与教师教学反思支持(可搜索、跨学期的教学记录);明确了轻量级开源模型在教育场景结构化任务中的关键挑战与实用解决方案。 Conclusion: 共享数据基础与本地化开源模型可同时高效赋能学生学习与教师发展;教育AI不应仅追求‘智能’,更需强调可控性、隐私性与教学适配性;结构化提取是连接教学实践与AI支持的关键桥梁。 Abstract: Discussions of AI in education focus predominantly on student-facing tools -- chatbots, tutors, and problem generators -- while the potential for the same infrastructure to support instructors remains largely unexplored. We describe Stan, a suite of tools for an undergraduate chemical engineering thermodynamics course built on a data pipeline that we develop and deploy in dual roles: serving students and supporting instructors from a shared foundation of lecture transcripts and a structured textbook index. On the student side, a retrieval-augmented generation (RAG) pipeline answers natural-language queries by extracting technical terms, matching them against the textbook index, and synthesizing grounded responses with specific chapter and page references. On the instructor side, the same transcript corpus is processed through structured analysis pipelines that produce per-lecture summaries, identify student questions and moments of confusion, and catalog the anecdotes and analogies used to motivate difficult material -- providing a searchable, semester-scale record of teaching that supports course reflection, reminders, and improvement. All components, including speech-to-text transcription, structured content extraction, and interactive query answering, run entirely on locally controlled hardware using open-weight models (Whisper large-v3, Llama~3.1 8B) with no dependence on cloud APIs, ensuring predictable costs, full data privacy, and reproducibility independent of third-party services. We describe the design, implementation, and practical failure modes encountered when deploying 7--8 billion parameter models for structured extraction over long lecture transcripts, including context truncation, bimodal output distributions, and schema drift, along with the mitigations that resolved them.[25] Optimizing Language Models for Crosslingual Knowledge Consistency
Tianyu Liu,Jirui Qi,Mrinmaya Sachan,Ryan Cotterell,Raquel Fernández,Arianna Bisazza
Main category: cs.CL
TL;DR: 本文提出Direct Consistency Optimization (DCO)方法,利用无显式奖励模型的强化学习提升多语言大模型跨语言响应的一致性,显著优于现有方法。
Details
Motivation: 大语言模型在多语言场景下常表现出知识不一致问题,影响其可靠性;尤其当同一问题以不同语言提问时,响应不一致会削弱信任度。 Method: 提出DCO(Direct Consistency Optimization),一种受DPO启发、无需显式奖励模型、直接从LLM自身导出的结构化强化学习方法,通过设计一致性奖励函数优化策略。 Result: DCO在多种大模型上显著提升跨语言一致性,优于现有方法;在双语设置、跨领域泛化、可控对齐等方面均表现优异;代码、训练脚本与评测基准全部开源。 Conclusion: DCO是一种鲁棒、高效且实用的解决方案,可有效提升多语言大语言模型中的知识跨语言一致性。 Abstract: Large language models are known to often exhibit inconsistent knowledge. This is particularly problematic in multilingual scenarios, where models are likely to be asked similar questions in different languages, and inconsistent responses can undermine their reliability. In this work, we show that this issue can be mitigated using reinforcement learning with a structured reward function, which leads to an optimal policy with consistent crosslingual responses. We introduce Direct Consistency Optimization (DCO), a DPO-inspired method that requires no explicit reward model and is derived directly from the LLM itself. Comprehensive experiments show that DCO significantly improves crosslingual consistency across diverse LLMs and outperforms existing methods when training with samples of multiple languages, while complementing DPO when gold labels are available. Extra experiments demonstrate the effectiveness of DCO in bilingual settings, significant out-of-domain generalizability, and controllable alignment via direction hyperparameters. Taken together, these results establish DCO as a robust and efficient solution for improving knowledge consistency across languages in multilingual LLMs. All code, training scripts, and evaluation benchmarks are released at https://github.com/Betswish/ConsistencyRL.[26] Non-Zipfian Distribution of Stopwords and Subset Selection Models
Wentian Li,Oscar Fontanelli
Main category: cs.CL
TL;DR: 本文提出了一种基于词频排名的停用词选择模型,利用Hill函数描述词被选为停用词的概率,并从理论上解释了为何停用词服从Beta秩函数(BRF)分布,而非停用词则更符合对数二次拟合。
Details
Motivation: 传统停用词识别多依赖经验列表或统计阈值,缺乏对停用词在整体词频分布中结构性规律的建模;本文旨在从秩-频分布差异出发,建立可解释、可验证的概率选择模型。 Method: 基于停用词与非停用词在rank-frequency图中的不同拟合特性(BRF vs. 对数二次函数),提出以词秩r为变量的Hill型选择概率模型,并通过独立语料库数据验证该模型,同时进行解析推导证明其与Zipf律的兼容性。 Result: 验证了所提Hill函数模型能准确刻画停用词选择概率;理论推导表明:当全词表服从Zipf律时,该模型自然导出停用词的BRF分布,并能解释非停用词的对数二次拟合现象。 Conclusion: 停用词并非随机选取,而是在词频秩空间中遵循特定的单调衰减概率规律;本文模型统一解释了停用词与非停用词的统计分布差异,为停用词自动构建提供了理论基础与实用框架。 Abstract: Stopwords are words that are not very informative to the content or the meaning of a language text. Most stopwords are function words but can also be common verbs, adjectives and adverbs. In contrast to the well known Zipf's law for rank-frequency plot for all words, the rank-frequency plot for stopwords are best fitted by the Beta Rank Function (BRF). On the other hand, the rank-frequency plots of non-stopwords also deviate from the Zipf's law, but are fitted better by a quadratic function of log-token-count over log-rank than by BRF. Based on the observed rank of stopwords in the full word list, we propose a stopword (subset) selection model that the probability for being selected as a function of the word's rank $r$ is a decreasing Hill's function ($1/(1+(r/r_{mid})^γ)$); whereas the probability for not being selected is the standard Hill's function ( $1/(1+(r_{mid}/r)^γ)$). We validate this selection probability model by a direct estimation from an independent collection of texts. We also show analytically that this model leads to a BRF rank-frequency distribution for stopwords when the original full word list follows the Zipf's law, as well as explaining the quadratic fitting function for the non-stopwords.[27] Hate Speech Detection using Large Language Models with Data Augmentation and Feature Enhancement
Brian Jing Hong Nge,Stefan Su,Thanh Thi Nguyen,Campbell Wilson,Alexandra Phelan,Naomi Pfitzner
Main category: cs.CL
TL;DR: 本文评估了数据增强和特征增强技术在仇恨言论检测中的效果,比较了传统分类器(如Delta TF-IDF)与多种基于Transformer的模型(DistilBERT、RoBERTa、DeBERTa、Gemma-7B、gpt-oss-20b)在多个数据集上的性能,并分析了SMOTE、加权损失、POS标注和文本增强等策略的影响。
Details
Motivation: 隐式仇恨言论检测难度大,现有方法在不同数据集和模型上的增强效果不一致,需系统评估增强策略与模型、数据特性的交互影响。 Method: 在多个仇恨言论数据集上,对比传统特征(Delta TF-IDF)与多种Transformer模型,引入SMOTE、类别加权损失、POS标注和文本数据增强等技术,系统评估其性能变化。 Result: 开源模型gpt-oss-20b整体表现最优;Delta TF-IDF在Stormfront数据集经数据增强后达98.2%准确率;隐式仇恨言论检测显著难于显式内容;增强效果高度依赖数据集、模型与技术三者交互。 Conclusion: 仇恨言论检测效果受数据集特性、模型架构与增强策略三者共同影响,需针对性地选择和组合方法以提升准确性和上下文感知能力。 Abstract: This paper evaluates data augmentation and feature enhancement techniques for hate speech detection, comparing traditional classifiers, e.g., Delta Term Frequency-Inverse Document Frequency (Delta TF-IDF), with transformer-based models (DistilBERT, RoBERTa, DeBERTa, Gemma-7B, gpt-oss-20b) across diverse datasets. It examines the impact of Synthetic Minority Over-sampling Technique (SMOTE), weighted loss determined by inverse class proportions, Part-of-Speech (POS) tagging, and text data augmentation on model performance. The open-source gpt-oss-20b consistently achieves the highest results. On the other hand, Delta TF-IDF responds strongly to data augmentation, reaching 98.2% accuracy on the Stormfront dataset. The study confirms that implicit hate speech is more difficult to detect than explicit hateful content and that enhancement effectiveness depends on dataset, model, and technique interaction. Our research informs the development of hate speech detection by highlighting how dataset properties, model architectures, and enhancement strategies interact, supporting more accurate and context-aware automated detection.[28] Detection of Illicit Content on Online Marketplaces using Large Language Models
Quoc Khoa Tran,Thanh Thi Nguyen,Campbell Wilson
Main category: cs.CL
TL;DR: 本研究探讨了大型语言模型(LLaMA 3.2 和 Gemma 3)在多语种非法内容检测中的有效性,发现其在复杂多类别分类任务中显著优于传统模型(如BERT、SVM和朴素贝叶斯)。
Details
Motivation: 传统内容审核方法(人工审查、规则系统、传统机器学习)难以应对非法活动的动态隐写、多语言及语义复杂性问题。 Method: 在多语种DUTA10K数据集上,对LLaMA 3.2和Gemma 3进行参数高效微调(PEFT)与量化,并与BERT、SVM和朴素贝叶斯等基线模型进行系统性对比实验。 Result: 在二分类任务中,LLaMA 3.2性能与传统方法相当;在40类不平衡多分类任务中,其显著超越所有基线模型。 Conclusion: LLM(尤其是LLaMA 3.2)在高复杂度、多语种非法内容识别任务中展现出更强适应性与实用性,可为执法、电商平台和网络安全提供更优审核工具。 Abstract: Online marketplaces, while revolutionizing global commerce, have inadvertently facilitated the proliferation of illicit activities, including drug trafficking, counterfeit sales, and cybercrimes. Traditional content moderation methods such as manual reviews and rule-based automated systems struggle with scalability, dynamic obfuscation techniques, and multilingual content. Conventional machine learning models, though effective in simpler contexts, often falter when confronting the semantic complexities and linguistic nuances characteristic of illicit marketplace communications. This research investigates the efficacy of Large Language Models (LLMs), specifically Meta's Llama 3.2 and Google's Gemma 3, in detecting and classifying illicit online marketplace content using the multilingual DUTA10K dataset. Employing fine-tuning techniques such as Parameter-Efficient Fine-Tuning (PEFT) and quantization, these models were systematically benchmarked against a foundational transformer-based model (BERT) and traditional machine learning baselines (Support Vector Machines and Naive Bayes). Experimental results reveal a task-dependent advantage for LLMs. In binary classification (illicit vs. non-illicit), Llama 3.2 demonstrated performance comparable to traditional methods. However, for complex, imbalanced multi-class classification involving 40 specific illicit categories, Llama 3.2 significantly surpassed all baseline models. These findings offer substantial practical implications for enhancing online safety, equipping law enforcement agencies, e-commerce platforms, and cybersecurity specialists with more effective, scalable, and adaptive tools for illicit content detection and moderation.[29] AI-Assisted Moot Courts: Simulating Justice-Specific Questioning in Oral Arguments
Kylie Zhang,Nimra Nadeem,Lucia Zheng,Dominik Stammbach,Peter Henderson
Main category: cs.CL
TL;DR: 本文探讨了AI模型在模拟美国最高法院口头辩论中法官提问的有效性,提出了一个两层评估框架来衡量模拟问题的真实性和教学实用性,并发现尽管AI生成的问题在真实感和法律问题覆盖度上表现良好,但仍存在提问类型多样性不足和过度迎合等问题。
Details
Motivation: 为帮助律师和法学院学生更好地准备口头辩论,需要有效的模拟训练工具;而现有AI模型是否能准确模拟法官提问尚不明确。 Method: 利用美国最高法院口头辩论转录文本数据集,构建并评估基于提示(prompt-based)和基于智能体(agentic)的口头辩论模拟器,并提出结合现实性与教学实用性的双层评估框架。 Result: AI生成的问题被人类标注者认为具有较高真实感,且对实质性法律问题的召回率高;但存在提问类型多样性低、过度迎合(sycophancy)等缺陷,而这些缺陷在简单评估方法下难以发现。 Conclusion: AI可在一定程度上支持口头辩论训练,但需更精细的评估标准和模型改进以克服当前局限性。 Abstract: In oral arguments, judges probe attorneys with questions about the factual record, legal claims, and the strength of their arguments. To prepare for this questioning, both law schools and practicing attorneys rely on moot courts: practice simulations of appellate hearings. Leveraging a dataset of U.S. Supreme Court oral argument transcripts, we examine whether AI models can effectively simulate justice-specific questioning for moot court-style training. Evaluating oral argument simulation is challenging because there is no single correct question for any given turn. Instead, effective questioning should reflect a combination of desirable qualities, such as anticipating substantive legal issues, detecting logical weaknesses, and maintaining an appropriately adversarial tone. We introduce a two-layer evaluation framework that assesses both the realism and pedagogical usefulness of simulated questions using complementary proxy metrics. We construct and evaluate both prompt-based and agentic oral argument simulators. We find that simulated questions are often perceived as realistic by human annotators and achieve high recall of ground truth substantive legal issues. However, models still face substantial shortcomings, including low diversity in question types and sycophancy. Importantly, these shortcomings would remain undetected under naive evaluation approaches.[30] IF-RewardBench: Benchmarking Judge Models for Instruction-Following Evaluation
Bosi Wen,Yilin Niu,Cunxiang Wang,Xiaoying Ling,Ying Zhang,Pei Ke,Hongning Wang,Minlie Huang
Main category: cs.CL
TL;DR: 本文提出IF-RewardBench,一个面向指令遵循能力的综合性元评估基准,通过构建响应偏好图支持列表式评估,更准确反映裁判模型在对齐优化中的实际表现,并揭示当前裁判模型的显著缺陷。
Details
Motivation: 现有指令遵循评估基准存在数据覆盖不足、仅采用简单成对比较等问题,难以真实反映裁判模型在模型对齐优化中的可靠性。 Method: 构建IF-RewardBench基准,涵盖多样化的指令与约束类型;对每条指令,基于指令遵循质量构建多响应间的完整偏好图,支持列表式(listwise)评估而非仅成对比较。 Result: 实验表明当前裁判模型存在显著缺陷,且IF-RewardBench与下游任务性能的相关性明显优于现有基准。 Conclusion: IF-RewardBench提供了一种更贴近对齐优化实际需求的元评估范式,能更可靠地评估和提升指令遵循裁判模型的能力。 Abstract: Instruction-following is a foundational capability of large language models (LLMs), with its improvement hinging on scalable and accurate feedback from judge models. However, the reliability of current judge models in instruction-following remains underexplored due to several deficiencies of existing meta-evaluation benchmarks, such as their insufficient data coverage and oversimplified pairwise evaluation paradigms that misalign with model optimization scenarios. To this end, we propose IF-RewardBench, a comprehensive meta-evaluation benchmark for instruction-following that covers diverse instruction and constraint types. For each instruction, we construct a preference graph containing all pairwise preferences among multiple responses based on instruction-following quality. This design enables a listwise evaluation paradigm that assesses the capabilities of judge models to rank multiple responses, which is essential in guiding model alignment. Extensive experiments on IF-RewardBench reveal significant deficiencies in current judge models and demonstrate that our benchmark achieves a stronger positive correlation with downstream task performance compared to existing benchmarks. Our codes and data are available at https://github.com/thu-coai/IF-RewardBench.[31] Stacked from One: Multi-Scale Self-Injection for Context Window Extension
Wei Han,Pan Zhou,Shuicheng Yan
Main category: cs.CL
TL;DR: 本文提出SharedLLM框架,通过多粒度上下文压缩与查询感知信息获取,在不增加训练成本的前提下显著扩展大语言模型的上下文长度,实现高效长文本建模。
Details
Motivation: 现有大语言模型上下文窗口受限,而持续预训练长上下文数据成本过高,亟需更高效的长上下文扩展方法。 Method: 提出SharedLLM框架,由两个共享参数的短上下文LLM堆叠构成:下层为压缩器,将长输入压缩为多粒度表示;上层为解码器,进行上下文感知处理;信息仅在底层传递(即self-injection),并引入树状数据结构支持高效编码与查询感知检索。 Result: 在仅用8K token序列训练下,SharedLLM可泛化至128K+ token输入,在多个长上下文基准测试中性能优于或媲美强基线,同时内存占用大幅降低,推理速度提升2倍(相比streaming)和3倍(相比encoder-decoder)。 Conclusion: SharedLLM以低训练开销实现了高效率、高性能的长上下文建模,为突破LLM上下文瓶颈提供了新范式。 Abstract: The limited context window of contemporary large language models (LLMs) remains a primary bottleneck for their broader application across diverse domains. Although continual pre-training on long-context data offers a straightforward solution, it incurs prohibitive data acquisition and computational costs. To address this challenge, we propose~\modelname, a novel framework based on multi-grained context compression and query-aware information acquisition. SharedLLM comprises two stacked short-context LLMs: a lower model serving as a compressor and an upper model acting as a decoder. The lower model compresses long inputs into compact, multi-grained representations, which are then forwarded to the upper model for context-aware processing. To maximize efficiency, this information transfer occurs exclusively at the lowest layers, bypassing lengthy forward passes and redundant cross-attention operations. This entire process, wherein the upper and lower models are derived from the same underlying LLM layers, is termed~\textit{self-injection}. To support this architecture, a specialized tree-based data structure enables the efficient encoding and query-aware retrieval of contextual information. Despite being trained on sequences of only 8K tokens, \modelname~effectively generalizes to inputs exceeding 128K tokens. Across a comprehensive suite of long-context modeling and understanding benchmarks, \modelname~achieves performance superior or comparable to strong baselines, striking an optimal balance between efficiency and accuracy. Furthermore, these design choices allow \modelname~to substantially reduce the memory footprint and yield notable inference speedups ($2\times$ over streaming and $3\times$ over encoder-decoder architectures).[32] TSEmbed: Unlocking Task Scaling in Universal Multimodal Embeddings
Yebo Wu,Feng Liu,Ziwei Xie,Zhiyuan Liu,Changwang Zhang,Jun Wang,Li Li
Main category: cs.CL
TL;DR: 本文提出TSEmbed框架,结合MoE与LoRA解决多任务冲突问题,并引入专家感知负采样(EANS)提升嵌入判别力,在MMEB及工业数据集上达到SOTA性能。
Details
Motivation: Multimodal Large Language Models (MLLMs)虽具备强大推理能力,但因任务冲突难以适配为通用多模态嵌入模型。 Method: 提出TSEmbed框架:1)融合Mixture-of-Experts(MoE)与Low-Rank Adaptation(LoRA)以显式解耦冲突任务目标;2)设计Expert-Aware Negative Sampling(EANS),利用专家路由分布作为语义相似性代理,动态选择共享专家激活模式的难负样本;3)采用两阶段学习范式,先固化专家专精能力,再通过EANS优化嵌入表示。 Result: 在Massive Multimodal Embedding Benchmark(MMEB)和真实工业生产数据集上均达到当前最优(state-of-the-art)性能。 Conclusion: TSEmbed为通用多模态嵌入的‘任务级扩展’(task-level scaling)奠定了基础。 Abstract: Despite the exceptional reasoning capabilities of Multimodal Large Language Models (MLLMs), their adaptation into universal embedding models is significantly impeded by task conflict. To address this, we propose TSEmbed, a universal multimodal embedding framework that synergizes Mixture-of-Experts (MoE) with Low-Rank Adaptation (LoRA) to explicitly disentangle conflicting task objectives. Moreover, we introduce Expert-Aware Negative Sampling (EANS), a novel strategy that leverages expert routing distributions as an intrinsic proxy for semantic similarity. By dynamically prioritizing informative hard negatives that share expert activation patterns with the query, EANS effectively sharpens the model's discriminative power and refines embedding boundaries. To ensure training stability, we further devise a two-stage learning paradigm that solidifies expert specialization before optimizing representations via EANS. TSEmbed achieves state-of-the-art performance on both the Massive Multimodal Embedding Benchmark (MMEB) and real-world industrial production datasets, laying a foundation for task-level scaling in universal multimodal embeddings.[33] Attention's Gravitational Field:A Power-Law Interpretation of Positional Correlation
Edward Zhang
Main category: cs.CL
TL;DR: 本文提出Attention引力场(AGF)概念,通过解耦位置编码与语义嵌入优化LLM架构,提升精度,并揭示其与牛顿万有引力定律的经验一致性。
Details
Motivation: 理解大语言模型中位置关系与编码的底层原理,提升注意力机制的可解释性与模型性能。 Method: 提出Attention引力场(AGF)概念,将位置编码从语义嵌入中解耦,并进行理论分析与经验验证。 Result: 实现了比现有位置编码方法更高的准确率,且AGF在学习曲线、稳定性及经验上与牛顿万有引力定律一致。 Conclusion: AGF为理解注意力机制提供了新视角,推动模型优化与可解释性研究。 Abstract: This paper explores the underlying principles of positional relationships and encodings within Large Language Models (LLMs) and introduces the concept of the Attention Gravitational Field (AGF). By decoupling positional encodings from semantic embeddings, we optimize the model architecture and achieve superior accuracy compared to prevailing encoding methods. Furthermore, we provide an in-depth analysis of AGF, demonstrating its intrinsic consistency with learning and stability curves, as well as its empirical alignment with Newton's Law of Universal Gravitation. By offering a rigorous theoretical exploration of these phenomena, this work represents a significant step toward interpreting the Attention mechanism and unlocks new possibilities for future research in model optimization and interpretability.[34] Beyond the Context Window: A Cost-Performance Analysis of Fact-Based Memory vs. Long-Context LLMs for Persistent Agents
Natchanon Pollertlam,Witchayut Kornsuwannawit
Main category: cs.CL
TL;DR: 本文对比了基于长上下文的LLM推理与基于事实的记忆系统(Mem0框架)在持久化对话AI中的准确性和API成本,发现前者在部分基准上召回率更高,后者在特定任务中更具成本效益,尤其在长上下文场景下随交互轮次增加更具经济优势。
Details
Motivation: 持久化对话AI需权衡使用长上下文LLM或专用结构化记忆系统,亟需在准确性与部署成本间建立可量化的权衡依据。 Method: 在LongMemEval、LoCoMo和PersonaMemv2三个记忆导向基准上,对比Mem0事实记忆系统与长上下文GPT-5-mini的表现;构建含提示缓存的API成本模型,分析两种架构的累计成本增长模式。 Result: 长上下文模型在LongMemEval和LoCoMo上事实召回率更高;Mem0在PersonaMemv2上具竞争力;在100k token上下文下,Mem0约10轮后成本低于长上下文方案,且上下文越长,盈亏平衡点越早。 Conclusion: 两种架构存在结构性精度-成本权衡:任务若强调稳定事实属性(如人物设定),宜选记忆系统;若依赖复杂上下文推理,则长上下文更优;成本模型为生产部署提供了明确选择标准。 Abstract: Persistent conversational AI systems face a choice between passing full conversation histories to a long-context large language model (LLM) and maintaining a dedicated memory system that extracts and retrieves structured facts. We compare a fact-based memory system built on the Mem0 framework against long-context LLM inference on three memory-centric benchmarks - LongMemEval, LoCoMo, and PersonaMemv2 - and evaluate both architectures on accuracy and cumulative API cost. Long-context GPT-5-mini achieves higher factual recall on LongMemEval and LoCoMo, while the memory system is competitive on PersonaMemv2, where persona consistency depends on stable, factual attributes suited to flat-typed extraction. We construct a cost model that incorporates prompt caching and show that the two architectures have structurally different cost profiles: long-context inference incurs a per-turn charge that grows with context length even under caching, while the memory system's per-turn read cost remains roughly fixed after a one-time write phase. At a context length of 100k tokens, the memory system becomes cheaper after approximately ten interaction turns, with the break-even point decreasing as context length grows. These results characterize the accuracy-cost trade-off between the two approaches and provide a concrete criterion for selecting between them in production deployments.[35] Autoscoring Anticlimax: A Meta-analytic Understanding of AI's Short-answer Shortcomings and Wording Weaknesses
Michael Hardy
Main category: cs.CL
TL;DR: 本文通过元分析890项LLM短答案评分研究,发现人类专家评分难度不影响LLM性能,反而某些对人类最简单的任务对LLM最难;解码器架构平均比编码器低0.37 QWK;词表大小存在收益递减;LLM在教育高风险场景中表现出种族偏见。
Details
Motivation: 自动化短答案评分相比其他大语言模型应用发展滞后,亟需系统性评估其性能瓶颈与影响因素。 Method: 对890项LLM短答案评分研究进行系统综述与元分析,采用混合效应元回归建模二次加权Kappa(QWK)效应量,并分析模型架构、分词器词汇量、提示设计等因素的影响,辅以敏感性与偏见实验。 Result: 人类评分难度与LLM性能无统计显著相关;解码器架构平均QWK低于编码器0.37;词汇量增大带来边际效益递减;LLM在高风险教育场景中暴露种族歧视。 Conclusion: 应重新设计系统以应对自回归模型的统计缺陷,重视架构选择、分词机制优化及偏见防控,尤其在教育评估等高风险应用中。 Abstract: Automated short-answer scoring lags other LLM applications. We meta-analyze 890 culminating results across a systematic review of LLM short-answer scoring studies, modeling the traditional effect size of Quadratic Weighted Kappa (QWK) with mixed effects metaregression. We quantitatively illustrate that that the level of difficulty for human experts to perform the task of scoring written work of children has no observed statistical effect on LLM performance. Particularly, we show that some scoring tasks measured as the easiest by human scorers were the hardest for LLMs. Whether by poor implementation by thoughtful researchers or patterns traceable to autoregressive training, on average decoder-only architectures underperform encoders by 0.37--a substantial difference in agreement with humans. Additionally, we measure the contributions of various aspects of LLM technology on successful scoring such as tokenizer vocabulary size, which exhibits diminishing returns--potentially due to undertrained tokens. Findings argue for systems design which better anticipates known statistical shortcomings of autoregressive models. Finally, we provide additional experiments to illustrate wording and tokenization sensitivity and bias elicitation in high-stakes education contexts, where LLMs demonstrate racial discrimination. Code and data for this study are available.[36] From Unfamiliar to Familiar: Detecting Pre-training Data via Gradient Deviations in Large Language Models
Ruiqi Zhang,Lingxiang Wang,Hainan Zhang,Zhiming Zheng,Yanyan Lan
Main category: cs.CL
TL;DR: 本文提出GDS方法,通过分析模型训练过程中梯度行为的系统性差异(如更新幅度、位置和神经元激活尖锐性)来检测大语言模型预训练数据,显著提升了跨数据集泛化能力与可解释性。
Details
Motivation: 解决现有预训练数据检测方法易受词频偏差影响或严重依赖微调数据相似性的问题,从优化视角出发探索梯度行为差异作为新检测依据。 Method: 提出基于梯度偏差分数(GDS)的检测方法:构建包含FFN和Attention模块中参数更新幅度、位置与激活集中度的梯度特征谱,并输入轻量级分类器进行成员推断。 Result: 在五个公开数据集上达到SOTA性能,跨数据集迁移能力显著优于强基线;可解释性分析揭示了梯度特征分布差异。 Conclusion: 梯度行为是可靠且可扩展的预训练数据检测信号,GDS为版权保护与去污染提供了实用新路径。 Abstract: Pre-training data detection for LLMs is essential for addressing copyright concerns and mitigating benchmark contamination. Existing methods mainly focus on the likelihood-based statistical features or heuristic signals before and after fine-tuning, but the former are susceptible to word frequency bias in corpora, and the latter strongly depend on the similarity of fine-tuning data. From an optimization perspective, we observe that during training, samples transition from unfamiliar to familiar in a manner reflected by systematic differences in gradient behavior. Familiar samples exhibit smaller update magnitudes, distinct update locations in model components, and more sharply activated neurons. Based on this insight, we propose GDS, a method that identifies pre-training data by probing Gradient Deviation Scores of target samples. Specifically, we first represent each sample using gradient profiles that capture the magnitude, location, and concentration of parameter updates across FFN and Attention modules, revealing consistent distinctions between member and non-member data. These features are then fed into a lightweight classifier to perform binary membership inference. Experiments on five public datasets show that GDS achieves state-of-the-art performance with significantly improved cross-dataset transferability over strong baselines. Further interpretability analyse show gradient feature distribution differences, enabling practical and scalable pre-training data detection.[37] SinhaLegal: A Benchmark Corpus for Information Extraction and Analysis in Sinhala Legislative Texts
Minduli Lasandi,Nevidu Jayatilleke
Main category: cs.CL
TL;DR: SinhaLegal 是一个包含约200万词、1206份斯里兰卡立法文本(含法案与议案)的高质量斯inhala语法律语料库,经过OCR提取、人工清洗与多维评估,专为支持斯inhala语法律NLP任务而构建。
Details
Motivation: 填补斯inhala语法律领域高质量、结构化NLP语料库的空白,支撑法律文本 summarisation、信息抽取等任务。 Method: 从官方来源系统收集1981–2014年斯inhala语法律文档;使用Google Document AI进行OCR提取;开展人工清洗与后处理;构建配套元数据;并进行语料统计、词汇多样性、NER、主题建模及语言模型困惑度分析。 Result: 建成高质量、机器可读、带元数据的Sinhalalegal语料库(2M词,1206文档),并通过多维度评估验证其领域特异性与实用性;困惑度分析表明现有语言模型对斯inhala法律文本建模能力有限。 Conclusion: SinhaLegal是首个大规模、高质量斯inhala语法律语料库,显著推动斯inhala语法律NLP研究与应用,为低资源语言法律AI奠定基础。 Abstract: SinhaLegal introduces a Sinhala legislative text corpus containing approximately 2 million words across 1,206 legal documents. The dataset includes two types of legal documents: 1,065 Acts dated from 1981 to 2014 and 141 Bills from 2010 to 2014, which were systematically collected from official sources. The texts were extracted using OCR with Google Document AI, followed by extensive post-processing and manual cleaning to ensure high-quality, machine-readable content, along with dedicated metadata files for each document. A comprehensive evaluation was conducted, including corpus statistics, lexical diversity, word frequency analysis, named entity recognition, and topic modelling, demonstrating the structured and domain-specific nature of the corpus. Additionally, perplexity analysis using both large and small language models was performed to assess how effectively language models respond to domain-specific texts. The SinhaLegal corpus represents a vital resource designed to support NLP tasks such as summarisation, information extraction, and analysis, thereby bridging a critical gap in Sinhala legal research.[38] HACHIMI: Scalable and Controllable Student Persona Generation via Orchestrated Agents
Yilin Jiang,Fei Tan,Xuanyu Yin,Jing Leng,Aimin Zhou
Main category: cs.CL
TL;DR: 本文提出HACHIMI框架,用于生成理论对齐、分布可控的学生画像(SPs),以支持教育大模型研究;该框架采用多智能体协同的Propose-Validate-Revise流程,结合神经符号验证与分层采样,生成百万级高质量学生 persona 数据集,并在内在与外部评估中验证其有效性与保真度梯度。
Details
Motivation: 现有学生画像构建方法多依赖随意提示或手工设计,缺乏教育理论支撑和人群分布控制能力,难以支撑教育大模型的可靠训练与评估。 Method: 提出Theory-Aligned and Distribution-Controllable Persona Generation(TAD-PG)范式;设计HACHIMI多智能体框架,包含理论锚定的教育schema建模、神经符号验证器(保障发展与心理约束)、分层抽样+语义去重机制;使用Qwen2.5-72B生成HACHIMI-1M(100万K1–K12学生persona)。 Result: HACHIMI-1M数据集在内在评估中实现近乎完美的schema有效性、精确配额满足率和高多样性;外部评估中,学生代理在CEPS/PISA 2022问卷作答时,数学与好奇心/成长维度与人类高度一致,而课堂氛围与幸福感维度仅中等一致,揭示出‘保真度梯度’现象。 Conclusion: HACHIMI为教育大模型提供了首个理论驱动、分布可控、可复现的合成学生群体基础设施,支持群体级基准测试与社会科学仿真研究。 Abstract: Student Personas (SPs) are emerging as infrastructure for educational LLMs, yet prior work often relies on ad-hoc prompting or hand-crafted profiles with limited control over educational theory and population distributions. We formalize this as Theory-Aligned and Distribution-Controllable Persona Generation (TAD-PG) and introduce HACHIMI, a multi-agent Propose-Validate-Revise framework that generates theory-aligned, quota-controlled personas. HACHIMI factorizes each persona into a theory-anchored educational schema, enforces developmental and psychological constraints via a neuro-symbolic validator, and combines stratified sampling with semantic deduplication to reduce mode collapse. The resulting HACHIMI-1M corpus comprises 1 million personas for Grades 1-12. Intrinsic evaluation shows near-perfect schema validity, accurate quotas, and substantial diversity, while external evaluation instantiates personas as student agents answering CEPS and PISA 2022 surveys; across 16 cohorts, math and curiosity/growth constructs align strongly between humans and agents, whereas classroom-climate and well-being constructs are only moderately aligned, revealing a fidelity gradient. All personas are generated with Qwen2.5-72B, and HACHIMI provides a standardized synthetic student population for group-level benchmarking and social-science simulations. Resources available at https://github.com/ZeroLoss-Lab/HACHIMI[39] FireBench: Evaluating Instruction Following in Enterprise and API-Driven LLM Applications
Yunfan Zhang,Yijie Bei,Jetashree Ravi,Pawel Garbacki
Main category: cs.CL
TL;DR: 本文介绍了FireBench,一个面向企业与API场景的LLM指令遵循基准测试,涵盖信息抽取、客服、编程代理等六大能力维度,包含2400+样本,评估了11个大模型,并开源以支持模型选型与优化。
Details
Motivation: 现有指令遵循基准主要针对聊天助手的自然语言生成约束,无法满足企业与API场景中对输出格式、内容限制和流程要求的严格需求,因此需要构建更贴合实际企业用例的评估基准。 Method: 基于真实企业与API使用模式构建FireBench基准,覆盖六大核心能力维度(如信息抽取、客户支持、编码代理等),包含2400多个样本;对11个主流LLM进行系统性评估,并分析其在企业场景下的指令遵循行为。 Result: 揭示了当前LLM在企业级指令遵循任务中的表现差异与共性问题;FireBench已开源,支持用户评估模型适用性、开发者诊断性能瓶颈,并鼓励社区共建。 Conclusion: FireBench填补了面向企业与API场景的指令遵循评估空白,为推动LLM在高可靠性工作流中的落地提供了可复现、可扩展的评测基础设施。 Abstract: Instruction following is critical for LLMs deployed in enterprise and API-driven settings, where strict adherence to output formats, content constraints, and procedural requirements is essential for enabling reliable LLM-assisted workflows. However, existing instruction following benchmarks predominantly evaluate natural language generation constraints that reflect the needs of chat assistants rather than enterprise users. To bridge this gap, we introduce FireBench, an LLM instruction following benchmark grounded in real-world enterprise and API usage patterns. FireBench evaluates six core capability dimensions across diverse applications including information extraction, customer support, and coding agents, comprising over 2,400 samples. We evaluate 11 LLMs and present key findings on their instruction following behavior in enterprise scenarios. We open-source FireBench at fire-bench.com to help users assess model suitability, support model developers in diagnosing performance, and invite community contributions.[40] Free Lunch for Pass@$k$? Low Cost Diverse Sampling for Diffusion Language Models
Sean Lamont,Christian Walder,Paul Montague,Amir Dezfouli,Michael Norrish
Main category: cs.CL
TL;DR: 本文提出了一种无需训练、低成本的干预方法,通过在扩散语言模型(DLM)采样过程中对中间样本进行顺序特征空间排斥,显著提升生成多样性,从而改善Pass@$k$任务(如代码与数学推理)的性能。
Details
Motivation: 传统采样方法(包括扩散语言模型)在复杂推理任务中易产生重复失败模式,浪费计算资源;而多样化的输出对覆盖解空间至关重要。 Method: 在扩散模型采样过程中,对同一批次的中间样本按顺序处理:每个样本在特征空间中被显式排斥于先前样本之外,以抑制冗余;该方法无需重训练、不依赖beam search,计算开销极低。 Result: 在HumanEval和GSM8K基准上,基于LLaDA-8B-Instruct模型验证,该方法在不同温度设置下均显著提升生成多样性和Pass@$k$指标。 Conclusion: 该方法是一种即插即用、高效轻量的采样增强策略,可广泛适用于当前及未来的扩散语言模型,尤其利于需多样化解搜索的任务。 Abstract: Diverse outputs in text generation are necessary for effective exploration in complex reasoning tasks, such as code generation and mathematical problem solving. Such Pass@$k$ problems benefit from distinct candidates covering the solution space. However, traditional sampling approaches often waste computational resources on repetitive failure modes. While Diffusion Language Models have emerged as a competitive alternative to the prevailing Autoregressive paradigm, they remain susceptible to this redundancy, with independent samples frequently collapsing into similar modes. To address this, we propose a training free, low cost intervention to enhance generative diversity in Diffusion Language Models. Our approach modifies intermediate samples in a batch sequentially, where each sample is repelled from the feature space of previous samples, actively penalising redundancy. Unlike prior methods that require retraining or beam search, our strategy incurs negligible computational overhead, while ensuring that each sample contributes a unique perspective to the batch. We evaluate our method on the HumanEval and GSM8K benchmarks using the LLaDA-8B-Instruct model. Our results demonstrate significantly improved diversity and Pass@$k$ performance across various temperature settings. As a simple modification to the sampling process, our method offers an immediate, low-cost improvement for current and future Diffusion Language Models in tasks that benefit from diverse solution search. We make our code available at https://github.com/sean-lamont/odd.[41] Can LLMs Capture Expert Uncertainty? A Comparative Analysis of Value Alignment in Ethnographic Qualitative Research
Arina Kostina,Marios Dikaiakos,Alejandro Porcel,Tassos Stassopoulos
Main category: cs.CL
TL;DR: 本文评估了大语言模型(LLMs)在基于Schwartz基本价值观框架识别开放访谈中前三大人类价值观任务上的表现,发现LLMs在集合匹配指标上接近人类专家水平,但在精确排序和不确定性建模上存在差距,Qwen表现最优,集成方法可提升性能,但也揭示了模型固有的价值观偏差。
Details
Motivation: 尽管大语言模型(LLMs)有望辅助定性分析,但其在任务本身具有内在模糊性的情况下能否生成细致、可靠的解释仍不明确。 Method: 基于Schwartz基本价值观框架,在长篇开放访谈中识别前三大人类价值观;将LLM输出与专家标注对比,采用F1、Jaccard、RBO等指标评估性能,并分析模型与专家在价值观分布及不确定性模式上的差异;测试多种LLM集成方法(如多数投票、Borda计数)。 Result: LLMs在集合指标(F1、Jaccard)上接近人类上限,但在排序准确性(RBO)上偏低;多数模型的平均价值观分布接近专家,但不确定性结构存在偏离;Qwen最接近专家一致性且价值观分布对齐度最高;集成方法(尤其是多数投票和Borda计数)带来稳定提升;模型普遍存在对Security等特定价值观的系统性高估。 Conclusion: LLMs在模糊性高的质性价值观分析中既展现出作为人类协作者的巨大潜力,也暴露出排序能力不足、不确定性建模偏差及隐含价值观偏见等关键局限,需谨慎应用并深入探究其价值偏差机制。 Abstract: Qualitative analysis of open-ended interviews plays a central role in ethnographic and economic research by uncovering individuals' values, motivations, and culturally embedded financial behaviors. While large language models (LLMs) offer promising support for automating and enriching such interpretive work, their ability to produce nuanced, reliable interpretations under inherent task ambiguity remains unclear. In our work we evaluate LLMs on the task of identifying the top three human values expressed in long-form interviews based on the Schwartz Theory of Basic Values framework. We compare their outputs to expert annotations, analyzing both performance and uncertainty patterns relative to the experts. Results show that LLMs approach the human ceiling on set-based metrics (F1, Jaccard) but struggle to recover exact value rankings, as reflected in lower RBO scores. While the average Schwartz value distributions of most models closely match those of human analysts, their uncertainty structures across the Schwartz values diverge from expert uncertainty patterns. Among the evaluated models, Qwen performs closest to expert-level agreement and exhibits the strongest alignment with expert Schwartz value distributions. LLM ensemble methods yield consistent gains across metrics, with Majority Vote and Borda Count performing best. Notably, systematic overemphasis on certain Schwartz values, like Security, suggests both the potential of LLMs to provide complementary perspectives and the need to further investigate model-induced value biases. Overall, our findings highlight both the promise and the limitations of LLMs as collaborators in inherently ambiguous qualitative value analysis.[42] AILS-NTUA at SemEval-2026 Task 10: Agentic LLMs for Psycholinguistic Marker Extraction and Conspiracy Endorsement Detection
Panagiotis Alexios Spanakis,Maria Lymperaiou,Giorgos Filandrianos,Athanasios Voulodimos,Giorgos Stamou
Main category: cs.CL
TL;DR: 本文提出了一种新型的基于智能体的LLM流水线,用于SemEval-2026任务10,联合提取心理语言学阴谋论标记并检测阴谋论支持倾向;通过解耦语义推理与结构定位、引入DD-CoT和'反回音室'架构,显著提升性能并增强可解释性。
Details
Motivation: 传统分类器将语义推理与结构定位混为一谈,导致在心理语言学阴谋论标记提取与 Conspiracy endorsement 检测中表现受限;同时模型易陷入'Reporter Trap',误判客观报道为阴谋论支持。 Method: 提出动态判别型思维链(DD-CoT)用于标记提取,结合确定性锚定以缓解语义歧义与字符级脆弱性; Conspiracy 检测采用'反回音室'架构,由对抗式并行委员会与校准法官协同决策。 Result: 在S1子任务上Macro F1达0.24(较基线提升100%),S2达0.79(+49%),S1系统位列开发榜第3名。 Conclusion: 该方法确立了一种可解释、心理语言学驱动的NLP新范式,兼顾精度与认知合理性。 Abstract: This paper presents a novel agentic LLM pipeline for SemEval-2026 Task 10 that jointly extracts psycholinguistic conspiracy markers and detects conspiracy endorsement. Unlike traditional classifiers that conflate semantic reasoning with structural localization, our decoupled design isolates these challenges. For marker extraction, we propose Dynamic Discriminative Chain-of-Thought (DD-CoT) with deterministic anchoring to resolve semantic ambiguity and character-level brittleness. For conspiracy detection, an "Anti-Echo Chamber" architecture, consisting of an adversarial Parallel Council adjudicated by a Calibrated Judge, overcomes the "Reporter Trap," where models falsely penalize objective reporting. Achieving 0.24 Macro F1 (+100\% over baseline) on S1 and 0.79 Macro F1 (+49\%) on S2, with the S1 system ranking 3rd on the development leaderboard, our approach establishes a versatile paradigm for interpretable, psycholinguistically-grounded NLP.[43] AILS-NTUA at SemEval-2026 Task 3: Efficient Dimensional Aspect-Based Sentiment Analysis
Stavros Gazetas,Giorgos Filandrianos,Maria Lymperaiou,Paraskevi Tzouveli,Athanasios Voulodimos,Giorgos Stamou
Main category: cs.CL
TL;DR: 本文提出了AILS-NTUA系统,用于SemEval-2026 Task 3 Track-A的多语言、多领域维度化方面级情感分析(DimABSA),涵盖三个子任务:DimASR、DimASTE和DimASQP;方法上融合了适配语言的编码器微调与基于LoRA的语言特定指令微调大模型;实验表明该系统在多数设置下优于基线。
Details
Motivation: 解决多语言、多领域的维度化方面级情感分析(DimABSA)任务,涵盖连续情感预测与结构化三元组/四元组抽取,需兼顾效率与效果。 Method: 结合语言适配的编码器骨干网络微调(用于DimASR)与基于LoRA的大语言模型语言特定指令微调(用于DimASTE和DimASQP),实现参数高效、任务自适应的统一框架。 Result: 所提模型在多个评估设置中均取得具有竞争力的性能,并持续超越官方基线模型。 Conclusion: 该统一而任务自适应的设计在保持高性能的同时显著降低了训练与推理开销,验证了参数高效专业化在多语言多领域DimABSA任务中的有效性。 Abstract: In this paper, we present AILS-NTUA system for Track-A of SemEval-2026 Task 3 on Dimensional Aspect-Based Sentiment Analysis (DimABSA), which encompasses three complementary problems: Dimensional Aspect Sentiment Regression (DimASR), Dimensional Aspect Sentiment Triplet Extraction (DimASTE), and Dimensional Aspect Sentiment Quadruplet Prediction (DimASQP) within a multilingual and multi-domain framework. Our methodology combines fine-tuning of language-appropriate encoder backbones for continuous aspect-level sentiment prediction with language-specific instruction tuning of large language models using LoRA for structured triplet and quadruplet extraction. This unified yet task-adaptive design emphasizes parameter-efficient specialization across languages and domains, enabling reduced training and inference requirements while maintaining strong effectiveness. Empirical results demonstrate that the proposed models achieve competitive performance and consistently surpass the provided baselines across most evaluation settings.[44] Federated Heterogeneous Language Model Optimization for Hybrid Automatic Speech Recognition
Mengze Hong,Yi Gu,Di Jiang,Hanlin Gu,Chen Jason Zhang,Lu Wang,Zhiyang Su
Main category: cs.CL
TL;DR: 本文提出了一种针对联邦学习中异构语言模型(n-gram与神经网络)的匹配与融合新范式,包含遗传算法(GMMA)和强化学习算法(RMMA),在多个数据集上验证了RMMA在性能、泛化性和收敛速度上的优势。
Details
Motivation: 联邦学习下ASR系统中,声学模型可有效融合,但用于重排序的异构语言模型(n-gram与神经网络)缺乏高效融合方法,亟需解决其异构性带来的融合挑战。 Method: 提出‘匹配-融合’范式,设计两种算法:1)遗传匹配与融合算法(GMMA),通过遗传操作演化并配对语言模型;2)强化匹配与融合算法(RMMA),利用强化学习实现高效收敛。 Result: 在七个OpenSLR数据集上的实验表明,RMMA取得最低平均字符错误率(CER)、更强泛化能力,且收敛速度比GMMA快至7倍。 Conclusion: 所提match-and-merge范式为构建可扩展、隐私保护的ASR系统提供了有效且高效的异构语言模型融合方案。 Abstract: Training automatic speech recognition (ASR) models increasingly relies on decentralized federated learning to ensure data privacy and accessibility, producing multiple local models that require effective merging. In hybrid ASR systems, while acoustic models can be merged using established methods, the language model (LM) for rescoring the N-best speech recognition list faces challenges due to the heterogeneity of non-neural n-gram models and neural network models. This paper proposes a heterogeneous LM optimization task and introduces a match-and-merge paradigm with two algorithms: the Genetic Match-and-Merge Algorithm (GMMA), using genetic operations to evolve and pair LMs, and the Reinforced Match-and-Merge Algorithm (RMMA), leveraging reinforcement learning for efficient convergence. Experiments on seven OpenSLR datasets show RMMA achieves the lowest average Character Error Rate and better generalization than baselines, converging up to seven times faster than GMMA, highlighting the paradigm's potential for scalable, privacy-preserving ASR systems.[45] LocalSUG: Geography-Aware LLM for Query Suggestion in Local-Life Services
Jinwen Chen,Shuai Gong,Shiwen Zhang,Zheng Zhang,Yachao Zhao,Lingxiang Wang,Haibo Zhou,Yuan Zhan,Wei Lin,Hainan Zhang
Main category: cs.CL
TL;DR: 本文提出LocalSUG,一种面向本地生活服务平台的LLM驱动查询建议框架,通过城市感知候选挖掘、改进的GRPO算法和质量感知加速技术,解决地理定位缺失、偏好优化暴露偏差与在线延迟三大挑战,显著提升CTR并降低无结果率。
Details
Motivation: 传统多阶段级联系统依赖历史热门查询,难以满足长尾需求;而大语言模型(LLM)在本地生活服务中面临缺乏地理上下文、偏好优化中的暴露偏差及在线推理延迟三大问题。 Method: 提出LocalSUG框架:1)基于词共现的城市感知候选挖掘策略以增强地理接地性;2)采用束搜索驱动的GRPO算法对齐训练与推理解码过程,并引入多目标奖励机制兼顾相关性与业务指标;3)设计质量感知束加速与词表剪枝技术以降低延迟。 Result: 离线评估与大规模线上A/B测试表明,LocalSUG使点击率(CTR)提升+0.35%,低/无结果率下降2.56%。 Conclusion: LocalSUG有效解决了LLM在本地生活查询建议场景中的关键部署难题,在保持生成质量的同时显著提升业务指标,具备实际落地价值。 Abstract: In local-life service platforms, the query suggestion module plays a crucial role in enhancing user experience by generating candidate queries based on user input prefixes, thus reducing user effort and accelerating search. Traditional multi-stage cascading systems rely heavily on historical top queries, limiting their ability to address long-tail demand. While LLMs offer strong semantic generalization, deploying them in local-life services introduces three key challenges: lack of geographic grounding, exposure bias in preference optimization, and online inference latency. To address these issues, we propose LocalSUG, an LLM-based query suggestion framework tailored for local-life service platforms. First, we introduce a city-aware candidate mining strategy based on term co-occurrence to inject geographic grounding into generation. Second, we propose a beam-search-driven GRPO algorithm that aligns training with inference-time decoding, reducing exposure bias in autoregressive generation. A multi-objective reward mechanism further optimizes both relevance and business-oriented metrics. Finally, we develop quality-aware beam acceleration and vocabulary pruning techniques that significantly reduce online latency while preserving generation quality. Extensive offline evaluations and large-scale online A/B testing demonstrate that LocalSUG improves click-through rate (CTR) by +0.35% and reduces the low/no-result rate by 2.56%, validating its effectiveness in real-world deployment.[46] Replaying pre-training data improves fine-tuning
Suhas Kotha,Percy Liang
Main category: cs.CL
TL;DR: 本文发现,在领域微调过程中重放通用数据(generic replay)不仅能防止灾难性遗忘,反而能显著提升目标领域任务性能,尤其在目标数据稀缺时效果更明显。
Details
Motivation: 现有范式通常先在大量通用文本上预训练,再用少量目标领域数据微调;为避免遗忘通用知识,通用数据仅在微调阶段混合使用。作者质疑这一做法,探索通用数据重放是否对目标任务有益。 Method: 在受控预训练环境下(4M目标token、4B总token、150M参数模型),系统评估通用数据重放在微调和中期训练中的影响;进一步分析不同数据调度策略(如在预训练中引入目标数据)下重放的效果;最后在8B参数模型上验证实际效果。 Result: 通用重放使目标数据效率提升达1.87×(微调)和2.06×(中期训练);在8B模型上,提升智能体网页导航成功率4.5%,巴斯克语问答准确率2%。 Conclusion: 通用数据重放是一种简单有效的方法,可提升目标领域微调的数据效率和泛化能力,尤其适用于目标数据有限的场景。 Abstract: To obtain a language model for a target domain (e.g. math), the current paradigm is to pre-train on a vast amount of generic web text and then fine-tune on the relatively limited amount of target data. Typically, generic data is only mixed in during fine-tuning to prevent catastrophic forgetting of the generic domain. We surprisingly find that replaying the generic data during fine-tuning can actually improve performance on the (less related) target task. Concretely, in a controlled pre-training environment with 4M target tokens, 4B total tokens, and 150M parameter models, generic replay increases target data efficiency by up to $1.87\times$ for fine-tuning and $2.06\times$ for mid-training. We further analyze data schedules that introduce target data during pre-training and find that replay helps more when there is less target data present in pre-training. We demonstrate the success of replay in practice for fine-tuning 8B parameter models, improving agentic web navigation success by $4.5\%$ and Basque question-answering accuracy by $2\%$.[47] When Weak LLMs Speak with Confidence, Preference Alignment Gets Stronger
Amirabbas Afzali,Myeongho Jeon,Maria Brbic
Main category: cs.CL
TL;DR: 本文提出了一种基于弱语言模型置信度加权的偏好优化方法(CW-PO),利用弱LLM对高置信样本进行筛选和加权,显著降低人工标注成本,甚至在仅用20%人类标注数据时就超越使用全部标注的标准DPO方法。
Details
Motivation: 现有偏好对齐方法依赖昂贵的人工标注或大规模API模型,成本高;探索弱LLM能否作为高效低成本的替代标注器。 Method: 提出置信度加权偏好优化(CW-PO)框架:用弱LLM生成偏好对并估计其预测置信度,仅保留高置信样本,并按置信度加权用于偏好优化训练(如DPO)。 Result: CW-PO在仅使用20%人类标注数据时,性能超过使用100%标注的标准DPO;验证了弱LLM+置信加权可降低对齐成本并提升效果。 Conclusion: 弱语言模型结合置信度加权是一种高效、低成本且高性能的偏好对齐新范式,挑战了对高质量人工标注的强依赖假设。 Abstract: Preference alignment is an essential step in adapting large language models (LLMs) to human values, but existing approaches typically depend on costly human annotations or large-scale API-based models. We explore whether a weak LLM can instead act as an effective annotator. We surprisingly find that selecting only a subset of a weak LLM's highly confident samples leads to substantially better performance than using full human annotations. Building on this insight, we propose Confidence-Weighted Preference Optimization (CW-PO), a general framework that re-weights training samples by a weak LLM's confidence and can be applied across different preference optimization objectives. Notably, the model aligned by CW-PO with just 20% of human annotations outperforms the model trained with 100% of annotations under standard DPO. These results suggest that weak LLMs, when paired with confidence weighting, can dramatically reduce the cost of preference alignment while even outperforming methods trained on fully human-labeled data.[48] MPCEval: A Benchmark for Multi-Party Conversation Generation
Minxing Zhang,Yi Yang,Zhuofan Jia,Xuan Yang,Jian Pei,Yuchen Zang,Xingwang Deng,Xianglong Chen
Main category: cs.CL
TL;DR: 本文提出MPCEval,一个面向多参与者对话生成的评估基准套件,旨在解决现有评估方法在多参与者场景下的不足,通过分解生成质量为说话人建模、内容质量和说话人-内容一致性三个维度,并提供无参考、可复现的量化指标。
Details
Motivation: 多参与者对话生成评估面临复杂轮转、角色依赖行为、长程结构和多种合理延续等挑战,现有评估方法难以有效衡量其生成质量。 Method: 提出MPCEval评估框架,将生成质量分解为说话人建模、内容质量和说话人-内容一致性三方面,并区分局部下一轮预测与全局完整对话生成;设计无参考、定量、可复现的新型指标,并在多个公开与真实数据集上进行实验验证。 Result: MPCEval揭示了模型在参与均衡性、内容演进与新颖性、说话人-内容一致性等方面的系统性差异,表明单一分数评估会掩盖多参与者对话行为的根本差异。 Conclusion: MPCEval为多参与者对话生成提供了更细粒度、任务感知的评估范式,强调评估目标对模型分析的关键影响,推动该领域向更可靠、可解释的评估方向发展。 Abstract: Multi-party conversation generation, such as smart reply and collaborative assistants, is an increasingly important capability of generative AI, yet its evaluation remains a critical bottleneck. Compared to two-party dialogue, multi-party settings introduce distinct challenges, including complex turn-taking, role-dependent speaker behavior, long-range conversational structure, and multiple equally valid continuations. Accordingly, we introduce MPCEval, a task-aware evaluation and benchmarking suite for multi-party conversation generation. MPCEval decomposes generation quality into speaker modeling, content quality, and speaker--content consistency, and explicitly distinguishes local next-turn prediction from global full-conversation generation. It provides novel, quantitative, reference-free, and reproducible metrics that scale across datasets and models. We apply MPCEval to diverse public and real-world datasets and evaluate modern generation methods alongside human-authored conversations. The results reveal systematic, dimension-specific model characteristics in participation balance, content progression and novelty, and speaker--content consistency, demonstrating that evaluation objectives critically shape model assessment and that single-score evaluation obscures fundamental differences in multi-party conversational behavior. The implementation of MPCEval and the associated evaluation code are publicly available at https://github.com/Owen-Yang-18/MPCEval.[49] VRM: Teaching Reward Models to Understand Authentic Human Preferences
Biao Liu,Ning Xu,Junming Yang,Hao Xu,Xin Geng
Main category: cs.CL
TL;DR: 本文提出VRM(变分奖励建模)框架,通过引入高维目标权重和低维语义特征作为隐变量,模拟人类偏好判断过程,以缓解奖励黑客问题,并在理论上证明其泛化误差界更优,实验表明其在捕捉真实人类偏好上显著优于现有方法。
Details
Motivation: 现有奖励模型直接将提示-响应对映射为标量分数,易捕获虚假相关性而非真实人类偏好;而人类评估先依据提示上下文权衡多维目标重要性,再基于逻辑连贯性等低维语义特征评估响应质量。 Method: 提出VRM(变分奖励建模)框架,将高维目标权重与低维语义特征建模为隐变量,采用变分推断进行推断,并提供理论分析证明其泛化误差界更紧。 Result: 在基准数据集上的大量实验表明,VRM在捕捉真实人类偏好方面显著优于现有方法。 Conclusion: VRM通过更贴近人类评估机制的建模方式,有效缓解奖励黑客问题,提升奖励模型对真实偏好的建模能力,并具备更优的理论泛化保证。 Abstract: Large Language Models (LLMs) have achieved remarkable success across diverse natural language tasks, yet the reward models employed for aligning LLMs often encounter challenges of reward hacking, where the approaches predominantly rely on directly mapping prompt-response pairs to scalar scores, which may inadvertently capture spurious correlations rather than authentic human preferences. In contrast, human evaluation employs a sophisticated process that initially weighs the relative importance of multiple high-dimensional objectives according to the prompt context, subsequently evaluating response quality through low-dimensional semantic features such as logical coherence and contextual appropriateness. Motivated by this consideration, we propose VRM, i.e., Variational Reward Modeling, a novel framework that explicitly models the evaluation process of human preference judgments by incorporating both high-dimensional objective weights and low-dimensional semantic features as latent variables, which are inferred through variational inference techniques. Additionally, we provide a theoretical analysis showing that VRM can achieve a tighter generalization error bound compared to the traditional reward model. Extensive experiments on benchmark datasets demonstrate that VRM significantly outperforms existing methods in capturing authentic human preferences.[50] ThaiSafetyBench: Assessing Language Model Safety in Thai Cultural Contexts
Trapoom Ukarapol,Nut Chukamphaeng,Kunat Pipatanakul,Pakhapoom Sarapat
Main category: cs.CL
TL;DR: 本文提出了首个面向泰语和泰国文化的大型语言模型安全评估基准ThaiSafetyBench,包含1954个泰语恶意提示,并基于该基准评估24个LLM,发现闭源模型安全性普遍优于开源模型,且文化特异性攻击成功率更高;同时发布了轻量级泰语有害响应分类器ThaiSafetyClassifier及公开 leaderboard。
Details
Motivation: 现有大语言模型安全评估主要集中于英语,忽视了非英语语言及文化背景下的风险,尤其是泰语及泰国文化语境下的安全漏洞亟待系统研究。 Method: 构建泰语安全评估基准ThaiSafetyBench(含1954条泰语恶意提示,覆盖通用与泰国文化特异性攻击);使用GPT-4.1和Gemini-2.5-Pro作为裁判评估24个LLM的安全性;训练DeBERTa-based的泰语有害响应分类器ThaiSafetyClassifier并公开模型与代码;建立持续更新的ThaiSafetyBench leaderboard。 Result: 闭源模型在泰语安全任务上显著优于开源模型;泰国文化语境攻击的攻击成功率(ASR)明显高于通用泰语攻击;ThaiSafetyClassifier达到84.4%加权F1分数,与GPT-4.1判断一致。 Conclusion: 当前LLM安全对齐方法在非英语、特别是文化深度嵌入场景下存在严重短板;ThaiSafetyBench及其配套工具为泰语AI安全研究提供了可复现、低成本、社区驱动的基础设施。 Abstract: The safety evaluation of large language models (LLMs) remains largely centered on English, leaving non-English languages and culturally grounded risks underexplored. In this work, we investigate LLM safety in the context of the Thai language and culture and introduce ThaiSafetyBench, an open-source benchmark comprising 1,954 malicious prompts written in Thai. The dataset covers both general harmful prompts and attacks that are explicitly grounded in Thai cultural, social, and contextual nuances. Using ThaiSafetyBench, we evaluate 24 LLMs, with GPT-4.1 and Gemini-2.5-Pro serving as LLM-as-a-judge evaluators. Our results show that closed-source models generally demonstrate stronger safety performance than open-source counterparts, raising important concerns regarding the robustness of openly available models. Moreover, we observe a consistently higher Attack Success Rate (ASR) for Thai-specific, culturally contextualized attacks compared to general Thai-language attacks, highlighting a critical vulnerability in current safety alignment methods. To improve reproducibility and cost efficiency, we further fine-tune a DeBERTa-based harmful response classifier, which we name ThaiSafetyClassifier. The model achieves a weighted F1 score of 84.4%, matching GPT-4.1 judgments. We publicly release the fine-tuning weights and training scripts to support reproducibility. Finally, we introduce the ThaiSafetyBench leaderboard to provide continuously updated safety evaluations and encourage community participation. - ThaiSafetyBench HuggingFace Dataset: https://huggingface.co/datasets/typhoon-ai/ThaiSafetyBench - ThaiSafetyBench Github: https://github.com/trapoom555/ThaiSafetyBench - ThaiSafetyClassifier HuggingFace Model: https://huggingface.co/typhoon-ai/ThaiSafetyClassifier - ThaiSafetyBench Leaderboard: https://huggingface.co/spaces/typhoon-ai/ThaiSafetyBench-Leaderboard[51] HiFlow: Hierarchical Feedback-Driven Optimization for Constrained Long-Form Text Generation
Yifan Zhu,Guanting Chen,Bing Wei,Haoran Luo
Main category: cs.CL
TL;DR: 本文提出HiFlow框架,通过分层反馈优化解决长文本生成中的复杂约束问题,兼顾全局结构一致性和局部语义连贯性。
Details
Motivation: 大语言模型在长文本生成、尤其是复杂约束条件下仍表现不佳,现有方法难以协调全局与局部目标。 Method: 提出HiFlow分层反馈驱动优化框架,包含规划层(建模全局结构与约束)和生成层(条件文本生成),引入约束感知计划筛选与双层级闭环反馈。 Result: 在多个主干模型上实验验证HiFlow优于基线方法。 Conclusion: HiFlow能有效联合优化规划质量与生成行为,逐步生成高质量且满足约束的长文本。 Abstract: Large language models perform well in short text generation but still struggle with long text generation, particularly under complex constraints. Such tasks involve multiple tightly coupled objectives, including global structural consistency, local semantic coherence, and constraint feasibility, forming a challenging constrained optimization problem. Existing approaches mainly rely on static planning or offline supervision, limiting effective coordination between global and local objectives during generation. To address these challenges, we propose HiFlow, a hierarchical feedback-driven optimization framework for constrained long text generation. HiFlow formulates generation as a two-level optimization process, consisting of a planning layer for global structure and constraint modeling, and a generation layer for conditioned text generation. By incorporating constraint-aware plan screening and closed-loop feedback at both levels, HiFlow enables joint optimization of planning quality and generation behavior, progressively guiding the model toward high-quality, constraint-satisfying outputs. Experiments on multiple backbones confirm HiFlow's effectiveness over baseline methods.[52] NeuronMoE: Neuron-Guided Mixture-of-Experts for Efficient Multilingual LLM Extension
Rongzhi Li,Hitomi Yanaka
Main category: cs.CL
TL;DR: 本文提出NeuronMoE方法,通过分析各Transformer组件中语言特异性神经元的跨语言多样性,指导每层专家分配,显著减少参数量并保持性能。
Details
Motivation: 扩展大语言模型至低资源语言对全球可及性至关重要,但为每种语言单独训练模型成本过高;现有MoE方法按层分配专家,未考虑神经元层面的细粒度语言特异性。 Method: 提出NeuronMoE方法,基于实证测量的跨语言神经元多样性,在每一层中依据神经元级语言特异性进行专家分配,覆盖所有Transformer组件。 Result: 在Llama-3.2-3B上应用于希腊语、土耳其语和匈牙利语,平均参数减少约40%,性能与LayerMoE基线相当;发现低资源语言专家独立发展出类似高资源语言的神经元专业化模式,集中于早期和晚期层。 Conclusion: NeuronMoE揭示了多语言模型组织语言知识可能存在普适性架构原则,为低资源语言高效建模提供了新路径。 Abstract: Extending large language models to low-resource languages is essential for global accessibility, but training separate models per language is prohibitively expensive. Mixture-of-Experts (MoE) architectures address this by adding sparse language-specific parameters, but determining how many experts each layer needs remains an open question. Current approaches allocate experts based on layer-level similarity, yet language processing exhibits fine-grained specialization at individual neurons. We propose $\textbf{NeuronMoE}$, a method that analyzes language-specific neurons across all transformer components to guide expert allocation per layer based on empirically measured cross-lingual neuron diversity. Applied to Llama-3.2-3B for low-resource languages (Greek, Turkish, and Hungarian), this approach achieves approximately 40% average parameter reduction while matching the performance of the LayerMoE baseline. We find that low-resource language experts independently develop neuron specialization patterns mirroring the high-resource language, which are concentrated in early and late layers. This reveals potential universal architectural principles in how multilingual models organize linguistic knowledge.[53] MUTEX: Leveraging Multilingual Transformers and Conditional Random Fields for Enhanced Urdu Toxic Span Detection
Inayat Arshad,Fajar Saleem,Ijaz Hussain
Main category: cs.CL
TL;DR: 本文提出MUTEX框架,结合多语言Transformer(XLM-RoBERTa)与条件随机场(CRF),首次实现面向乌尔都语的细粒度有毒文本片段检测,并在人工标注的token级数据集上取得60%的F1分数。
Details
Motivation: 现有系统多为句子级分类,无法定位具体有毒片段;且受限于乌尔都语缺乏token级标注资源、语言复杂、频繁语码转换、非正式表达及丰富形态变化等因素。 Method: 提出MUTEX框架:基于XLM-RoBERTa的序列标注模型,叠加CRF层,使用人工构建的乌尔都语token级有毒片段标注数据集,在社交媒体、新闻和YouTube评论等多领域数据上进行训练与评估,以token-level F1为指标。 Result: MUTEX在乌尔都语有毒片段检测任务中达到60%的token-level F1分数,是该任务首个有监督基线;实验表明Transformer模型能更好隐式建模上下文毒性,并缓解语码转换与形态变化带来的挑战。 Conclusion: MUTEX验证了多语言预训练模型结合CRF在低资源、高复杂度语言(如乌尔都语)细粒度毒性检测中的有效性与可解释性,为后续研究提供了基准与方法启示。 Abstract: Urdu toxic span detection remains limited because most existing systems rely on sentence-level classification and fail to identify the specific toxic spans within those text. It is further exacerbated by the multiple factors i.e. lack of token-level annotated resources, linguistic complexity of Urdu, frequent code-switching, informal expressions, and rich morphological variations. In this research, we propose MUTEX: a multilingual transformer combined with conditional random fields (CRF) for Urdu toxic span detection framework that uses manually annotated token-level toxic span dataset to improve performance and interpretability. MUTEX uses XLM RoBERTa with CRF layer to perform sequence labeling and is tested on multi-domain data extracted from social media, online news, and YouTube reviews using token-level F1 to evaluate fine-grained span detection. The results indicate that MUTEX achieves 60% token-level F1 score that is the first supervised baseline for Urdu toxic span detection. Further examination reveals that transformer-based models are more effective at implicitly capturing the contextual toxicity and are able to address the issues of code-switching and morphological variation than other models.[54] ARC-TGI: Human-Validated Task Generators with Reasoning Chain Templates for ARC-AGI
Jens Lehmann,Syeda Khushbakht,Nikoo Salehfard,Nur A Zarin Nishat,Dhananjay Bhandiwad,Andrei Aioanei,Sahar Vahdati
Main category: cs.CL
TL;DR: 本文提出了ARC-TGI框架,用于生成具有潜在规则的多样化ARC-AGI任务,支持任务级约束以确保训练样本能充分揭示规则,并提供自然语言推理链和可执行代码,促进可扩展的数据采样与可控基准测试。
Details
Motivation: ARC-AGI静态数据集存在过拟合、数据泄露和记忆化问题,难以准确衡量模型在抽象与规则归纳上的真实能力。 Method: 构建ARC-TGI开源框架,包含紧凑的Python任务族生成器;采用面向求解器的表示法,为每个任务配自然语言输入、推理链及部分可执行代码;引入任务级约束保障规则可推断性;所有生成器经人工精修与本地验证。 Result: 发布461个生成器,覆盖180个ARC-Mini、215个ARC-AGI-1(200训练+15测试)和66个ARC-AGI-2(55训练+11测试)任务,支持可扩展采样与受控评估。 Conclusion: ARC-TGI通过结构化、可解释且受约束的任务生成,提升了ARC类基准的可靠性与可扩展性,为抽象与归纳推理研究提供了更稳健的实验基础。 Abstract: The Abstraction and Reasoning Corpus (ARC-AGI) probes few-shot abstraction and rule induction on small visual grids, but progress is difficult to measure on static collections of hand-authored puzzles due to overfitting, dataset leakage, and memorisation. We introduce ARC-TGI (ARC Task Generators Inventory), an open-source framework for task-family generators: compact Python programs that sample diverse ARC-AGI tasks while preserving a latent rule. ARC-TGI is built around a solver-facing representation: each generated task is paired with natural-language input and transformation reasoning chains and partially evaluated Python code implementing sampling, transformation, and episode construction. Crucially, ARC-TGI supports task-level constraints so that training examples collectively expose the variations needed to infer the underlying rule, a requirement for human-solvable ARC tasks that independent per-example sampling often fails to guarantee. All generators undergo human refinement and local verification to keep both grids and reasoning traces natural and consistent under variation. We release 461 generators covering 180 ARC-Mini tasks, 215 ARC-AGI-1 tasks (200 train, 15 test), and 66 ARC-AGI-2 tasks (55 train, 11 test), enabling scalable dataset sampling and controlled benchmarking.[55] Measuring the Redundancy of Decoder Layers in SpeechLLMs
Adel Moumen,Guangzhi Sun,Philip C Woodland
Main category: cs.CL
TL;DR: 本文研究了语音大语言模型(SpeechLLM)中解码器的冗余性,发现其冗余主要继承自预训练文本LLM;通过剪枝实验表明,7-8B模型仅需60%解码器层即可保持良好ASR性能,并验证该冗余结构在不同语音编码器、任务和语言间具有一致性,支持构建通用精简多任务SpeechLLM主干。
Details
Motivation: 探究SpeechLLM中占参数主体(>90%)的LLM解码器在语音任务中实际所需的容量,理解其冗余来源与结构特性。 Method: 在两个LLM家族、三种规模(1B–8B)上对比文本与语音输入下的解码器块冗余;通过逐层剪枝解码器并分析剪枝后性能恢复(healing)来量化冗余;进一步将结论推广至语音翻译任务,并跨编码器、任务和语言验证冗余块的一致性。 Result: 7-8B SpeechLLM在仅保留60%解码器层时仍保持良好ASR性能;冗余模式与预训练文本LLM高度一致;相同解码器层在不同语音编码器、ASR/ST任务及多语言场景下均表现出冗余性。 Conclusion: SpeechLLM解码器存在全局性、任务与语言无关的冗余结构,可支撑构建轻量、通用、多任务兼容的精简SpeechLLM主干模型。 Abstract: Speech Large Language Models route speech encoder representations into an LLM decoder that typically accounts for over 90% of total parameters. We study how much of this decoder capacity is actually needed for speech tasks. Across two LLM families and three scales (1-8B), we show that decoder redundancy is largely inherited from the pretrained LLM: text and speech inputs yield similar redundant blocks. We then measure excess capacity by pruning decoder layers and analysing post-pruning healing to increase robustness. Our findings show that 7-8B models retain good ASR performance with only 60% of decoder layers, and the same trend extends to smaller scales with reduced pruning tolerance. We then generalise to speech translation, and show that the same blocks of layers are redundant across speech encoders, tasks and languages, indicating that a more global redundancy structure exists, enabling a single pruned and multi-tasks SpeechLLM backbone to be deployed.[56] LBM: Hierarchical Large Auto-Bidding Model via Reasoning and Acting
Yewen Li,Zhiyi Lyu,Peng Jiang,Qingpeng Cai,Fei Pan,Bo An,Peng Jiang
Main category: cs.CL
TL;DR: 本文提出了一种分层大自动竞价模型(LBM),结合LLM的推理能力与数值决策能力,通过双嵌入机制和离线强化微调方法GQPO,提升自动竞价策略的可解释性、泛化性与性能。
Details
Motivation: 现有自动竞价方法存在黑箱训练、模式覆盖有限、难以理解任务状态及在动态广告环境中泛化不足的问题;而直接应用大语言模型(LLM)又易产生幻觉、缺乏竞价领域知识、难以生成精确动作。 Method: 提出分层Large autoBidding Model(LBM):高层LBM-Think负责推理,低层LBM-Act负责动作生成;引入双嵌入机制融合语言与数值输入;设计离线强化微调方法GQPO,无需仿真或线上 rollout 即可抑制幻觉、提升决策性能。 Result: 实验表明,基于LBM的生成式主干模型在训练效率和泛化能力方面显著优于现有方法。 Conclusion: LBM通过结构化地融合LLM推理与数值动作控制,并辅以针对性训练机制,为广告自动竞价提供了更可靠、可解释且泛化性强的新范式。 Abstract: The growing scale of ad auctions on online advertising platforms has intensified competition, making manual bidding impractical and necessitating auto-bidding to help advertisers achieve their economic goals. Current auto-bidding methods have evolved to use offline reinforcement learning or generative methods to optimize bidding strategies, but they can sometimes behave counterintuitively due to the black-box training manner and limited mode coverage of datasets, leading to challenges in understanding task status and generalization in dynamic ad environments. Large language models (LLMs) offer a promising solution by leveraging prior human knowledge and reasoning abilities to improve auto-bidding performance. However, directly applying LLMs to auto-bidding faces difficulties due to the need for precise actions in competitive auctions and the lack of specialized auto-bidding knowledge, which can lead to hallucinations and suboptimal decisions. To address these challenges, we propose a hierarchical Large autoBidding Model (LBM) to leverage the reasoning capabilities of LLMs for developing a superior auto-bidding strategy. This includes a high-level LBM-Think model for reasoning and a low-level LBM-Act model for action generation. Specifically, we propose a dual embedding mechanism to efficiently fuse two modalities, including language and numerical inputs, for language-guided training of the LBM-Act; then, we propose an offline reinforcement fine-tuning technique termed GQPO for mitigating the LLM-Think's hallucinations and enhancing decision-making performance without simulation or real-world rollout like previous multi-turn LLM-based methods. Experiments demonstrate the superiority of a generative backbone based on our LBM, especially in an efficient training manner and generalization ability.[57] Representation Fidelity:Auditing Algorithmic Decisions About Humans Using Self-Descriptions
Theresa Elstner,Martin Potthast
Main category: cs.CL
TL;DR: 本文提出'表征保真度'(Representation Fidelity)概念,用于评估算法决策中对人的表征是否合理;通过比较外部输入表征与个体自述表征之间的距离来量化,并构建首个贷款决策领域的表征保真度评测基准(含3万条合成自述及专家标注)。
Details
Motivation: 现有算法决策验证多关注结果公平性,而忽视决策所依赖的人的表征是否真实、合理地反映了个体自身;需从表征层面建立新的可测量验证维度。 Method: 定义表征保真度为外部输入表征与个体自述表征之间的距离;分析表征差异类型并构建通用错配分类法;构建Loan-Granting Self-Representations Corpus 2025数据集(基于德国信用数据集生成3万条合成自然语言自述,并由专家标注表征错配)。 Result: 提出了表征保真度的量化框架与错配分类法;发布了首个面向表征保真度评估的基准数据集及配套标注;验证了该方法在贷款决策场景中的可行性。 Conclusion: 表征保真度是算法问责的新基础维度,能揭示被忽略的表征失真问题;该框架可扩展至其他高风险决策场景,推动从‘结果公平’走向‘表征合理’的算法治理范式。 Abstract: This paper introduces a new dimension for validating algorithmic decisions about humans by measuring the fidelity of their representations. Representation Fidelity measures if decisions about a person rest on reasonable grounds. We propose to operationalize this notion by measuring the distance between two representations of the same person: (1) an externally prescribed input representation on which the decision is based, and (2) a self-description provided by the human subject of the decision, used solely to validate the input representation. We examine the nature of discrepancies between these representations, how such discrepancies can be quantified, and derive a generic typology of representation mismatches that determine the degree of representation fidelity. We further present the first benchmark for evaluating representation fidelity based on a dataset of loan-granting decisions. Our Loan-Granting Self-Representations Corpus 2025 consists of a large corpus of 30 000 synthetic natural language self-descriptions derived from corresponding representations of applicants in the German Credit Dataset, along with expert annotations of representation mismatches between each pair of representations.[58] Feature Resemblance: On the Theoretical Understanding of Analogical Reasoning in Transformers
Ruichen Xu,Wenjing Yan,Ying-Jun Angela Zhang
Main category: cs.CL
TL;DR: 本文研究了大语言模型中类比推理的出现机制,通过理论证明和实验验证,揭示了表征对齐是实现类比推理的关键统一机制。
Details
Motivation: 现有评估方法混淆多种推理类型,难以单独分析类比推理在大语言模型中的涌现机制。 Method: 理论推导(证明联合训练、顺序训练与两跳推理的条件)结合大规模模型(最高1.5B参数)实验,分析表征几何结构与推理能力的关系。 Result: 证明了类比推理依赖于实体表征的对齐;顺序训练需满足特定课程学习顺序;两跳推理可归约为含显式恒等桥接的类比推理;实验验证了表征几何决定归纳推理能力。 Conclusion: 类比推理在transformer中通过相似实体的表征对齐自然涌现,是一种由表征几何驱动的统一推理机制,而非依赖复杂符号操作。 Abstract: Understanding reasoning in large language models is complicated by evaluations that conflate multiple reasoning types. We isolate analogical reasoning (inferring shared properties between entities based on known similarities) and analyze its emergence in transformers. We theoretically prove three key results: (1) Joint training on similarity and attribution premises enables analogical reasoning through aligned representations; (2) Sequential training succeeds only when similarity structure is learned before specific attributes, revealing a necessary curriculum; (3) Two-hop reasoning ($a \to b, b \to c \implies a \to c$) reduces to analogical reasoning with identity bridges ($b = b$), which must appear explicitly in training data. These results reveal a unified mechanism: transformers encode entities with similar properties into similar representations, enabling property transfer through feature alignment. Experiments with architectures up to 1.5B parameters validate our theory and demonstrate how representational geometry shapes inductive reasoning capabilities.[59] C2-Faith: Benchmarking LLM Judges for Causal and Coverage Faithfulness in Chain-of-Thought Reasoning
Avni Mittal,Rauno Arike
Main category: cs.CL
TL;DR: 本文提出C2-Faith基准,用于评估大语言模型(LLM)作为链式推理(CoT)裁判时对推理过程“忠实性”(包括因果性和覆盖性)的判断能力;实验发现当前前沿LLM裁判在不同任务中表现不一,普遍存在检测易、定位难、覆盖性评分偏高问题。
Details
Motivation: 现有研究多用大语言模型作为链式推理结果的评判者,但尚不清楚它们能否可靠地判断推理过程是否忠实(即步骤是否逻辑连贯、关键推导是否完整),而非仅判断最终答案是否合理。 Method: 构建C2-Faith基准:基于PRM800K数据集,通过受控扰动生成带标注因果错误位置和可控覆盖率缺失的样本;设计三类任务——二元因果错误检测、因果错误步骤定位、覆盖度打分;评估三个前沿LLM裁判在这些任务上的表现。 Result: 1)不同任务下模型排名差异显著,无单一模型全面占优;2)所有模型在‘检测错误’上表现优于‘定位错误’,存在明显能力断层;3)对不完整推理的覆盖度评分系统性偏高。 Conclusion: LLM作为推理过程裁判的能力具有任务依赖性与局限性;研究明确了其适用边界,并为实际中选择合适裁判模型提供了实证依据与操作指南。 Abstract: Large language models (LLMs) are increasingly used as judges of chain-of-thought (CoT) reasoning, but it remains unclear whether they can reliably assess process faithfulness rather than just answer plausibility. We introduce C2-Faith, a benchmark built from PRM800K that targets two complementary dimensions of faithfulness: causality (does each step logically follow from prior context?) and coverage (are essential intermediate inferences present?). Using controlled perturbations, we create examples with known causal error positions by replacing a single step with an acausal variant, and with controlled coverage deletions at varying deletion rates (scored against reference labels). We evaluate three frontier judges under three tasks: binary causal detection, causal step localization, and coverage scoring. The results show that model rankings depend strongly on task framing, with no single judge dominating all settings; all judges exhibit a substantial gap between detecting an error and localizing it; and coverage judgments are systematically inflated for incomplete reasoning. These findings clarify when LLM judges are dependable and where they fail, and provide practical guidance for selecting judges in process-level evaluation[60] Sparse-BitNet: 1.58-bit LLMs are Naturally Friendly to Semi-Structured Sparsity
Di Zhang,Xun Wu,Shaohan Huang,Yudong Wang,Hanyong Shao,Yingbo Hao,Zewen Chi,Li Dong,Ting Song,Yan Xia,Zhifang Sui,Furu Wei
Main category: cs.CL
TL;DR: 本文提出Sparse-BitNet框架,首次联合应用1.58比特量化与动态N:M稀疏化,并验证了低比特模型(如BitNet)比全精度模型更适配N:M稀疏结构,在保持稳定训练的同时提升训练与推理效率。
Details
Motivation: 半结构化N:M稀疏性和低比特量化(如1.58比特BitNet)是提升大语言模型效率的两种有前景方法,但此前被孤立研究;本文旨在探究二者协同效应并构建统一高效框架。 Method: 提出Sparse-BitNet统一框架,结合1.58比特量化与动态N:M稀疏化,支持稀疏预训练和稠密到稀疏训练调度,并设计定制稀疏张量核以加速计算。 Result: 1.58比特BitNet在相同稀疏度下性能下降更小、可承受更高结构化稀疏度而不崩溃;在训练与推理中实现最高1.30倍加速。 Conclusion: 极低比特量化与半结构化N:M稀疏性的结合是构建高效大语言模型的重要可行方向。 Abstract: Semi-structured N:M sparsity and low-bit quantization (e.g., 1.58-bit BitNet) are two promising approaches for improving the efficiency of large language models (LLMs), yet they have largely been studied in isolation. In this work, we investigate their interaction and show that 1.58-bit BitNet is naturally more compatible with N:M sparsity than full-precision models. To study this effect, we propose Sparse-BitNet, a unified framework that jointly applies 1.58-bit quantization and dynamic N:M sparsification while ensuring stable training for the first time. Across multiple model scales and training regimes (sparse pretraining and dense-to-sparse schedules), 1.58-bit BitNet consistently exhibits smaller performance degradation than full-precision baselines at the same sparsity levels and can tolerate higher structured sparsity before accuracy collapse. Moreover, using our custom sparse tensor core, Sparse-BitNet achieves substantial speedups in both training and inference, reaching up to 1.30X. These results highlight that combining extremely low-bit quantization with semi-structured N:M sparsity is a promising direction for efficient LLMs. Code available at https://github.com/AAzdi/Sparse-BitNet[61] Guidelines for the Annotation and Visualization of Legal Argumentation Structures in Chinese Judicial Decisions
Kun Chen,Xianglei Liao,Kaixue Fei,Yi Xing,Xinrui Li
Main category: cs.CL
TL;DR: 本文提出了一套系统化、可操作的法律论证结构标注指南,涵盖命题类型、论证关系、形式化表示、可视化规范及标准化标注流程,旨在支持司法推理的大规模计算分析与法律人工智能研究。
Details
Motivation: 为揭示司法推理的逻辑结构,并为法律论证的计算分析提供可靠的数据基础。 Method: 基于法律推理与论证理论,构建包含四类命题(一般/具体规范性命题、一般/具体事实性命题)和五类关系(支持、攻击、联合、匹配、同一)的标注框架,并制定形式化表示规则、可视化规范及标准化标注流程。 Result: 形成一套完整、可操作的法律论证结构标注指南,支持复杂论证模式的一致性图形表达及标注数据的可复现性与可靠性。 Conclusion: 该指南通过提供清晰的概念模型、形式化规则与实践流程,为司法推理的大规模分析、法律论证挖掘、法律推理计算建模及AI辅助法律分析提供了方法论支撑。 Abstract: This guideline proposes a systematic and operational annotation framework for representing the structure of legal argumentation in judicial decisions. Grounded in theories of legal reasoning and argumentation, the framework aims to reveal the logical organization of judicial reasoning and to provide a reliable data foundation for computational analysis. At the proposition level, the guideline distinguishes four types of propositions: general normative propositions, specific normative propositions, general factual propositions, and specific factual propositions. At the relational level, five types of relations are defined to capture argumentative structures: support, attack, joint, match, and identity. These relations represent positive and negative argumentative connections, conjunctive reasoning structures, the correspondence between legal norms and case facts, and semantic equivalence between propositions. The guideline further specifies formal representation rules and visualization conventions for both basic and nested structures, enabling consistent graphical representation of complex argumentation patterns. In addition, it establishes a standardized annotation workflow and consistency control mechanisms to ensure reproducibility and reliability of the annotated data. By providing a clear conceptual model, formal representation rules, and practical annotation procedures, this guideline offers methodological support for large-scale analysis of judicial reasoning and for future research in legal argument mining, computational modeling of legal reasoning, and AI-assisted legal analysis.[62] Transducing Language Models
Vésteinn Snæbjarnarson,Samuel Kiegeland,Tianyu Liu,Reda Boumasmoud,Ryan Cotterell,Tim Vieira
Main category: cs.CL
TL;DR: 本文提出了一种基于确定性字符串到字符串变换(特别是有限状态转换器FST)的语言模型泛化框架,支持在不修改模型参数的情况下对变换后输出进行概率推断(边际化与条件化),实现预训练语言模型在推理时适配不同输出格式(如token→word、DNA→氨基酸等)。
Details
Motivation: 现有语言模型输出格式(如子词、字节)常与下游任务所需格式(如词、氨基酸序列)不一致,需通过确定性变换映射;但以往工作未将变换后的分布视为完整可操作的语言模型。 Method: 将语言模型与有限状态转换器(FST)组合,设计精确与近似算法以在变换空间上进行概率边际化和条件化推断,保持原模型参数不变。 Result: 在token→bytes、token→words、DNA→amino acids三个任务上验证了该框架的有效性,实现了无需微调的推理时格式适配,且计算高效。 Conclusion: 确定性字符串变换可系统性地扩展语言模型的输出能力,FST为其实现提供了兼具表达力与效率的数学与算法基础,使变换后分布成为真正可用的新语言模型。 Abstract: Modern language models define distributions over strings, but downstream tasks often require different output formats. For instance, a model that generates byte-pair strings does not directly produce word-level predictions, and a DNA model does not directly produce amino-acid sequences. In such cases, a deterministic string-to-string transformation can convert the model's output to the desired form. This is a familiar pattern in probability theory: applying a function $f$ to a random variable $X\sim p$ yields a transformed random variable $f(X)$ with an induced distribution. While such transformations are occasionally used in language modeling, prior work does not treat them as yielding new, fully functional language models. We formalize this perspective and introduce a general framework for language models derived from deterministic string-to-string transformations. We focus on transformations representable as finite-state transducers -- a commonly used state-machine abstraction for efficient string-to-string mappings. We develop algorithms that compose a language model with an FST to *marginalize* over source strings mapping to a given target, propagating probabilities through the transducer without altering model parameters and enabling *conditioning* on transformed outputs. We present an exact algorithm, an efficient approximation, and a theoretical analysis. We conduct experiments in three domains: converting language models from tokens to bytes, from tokens to words, and from DNA to amino acids. These experiments demonstrate inference-time adaptation of pretrained language models to match application-specific output requirements.[63] Diffusion LLMs can think EoS-by-EoS
Sarah Breckner,Sebastian Schuster
Main category: cs.CL
TL;DR: 本文提出扩散语言模型(Diffusion LLMs)通过将end-of-sequence(EoS)标记用作隐式“草稿空间”来提升复杂推理能力,实验证明添加额外EoS标记可显著改善性能,并通过因果干预验证EoS隐藏状态承载关键推理信息。
Details
Motivation: 观察到扩散LLMs在生成长度远超实际所需时表现更优,尤其在复杂推理任务中;作者试图解释这一现象背后的机制,即EoS标记是否被模型用作隐式计算空间。 Method: 在Addition、Entity Tracking和Sudoku任务上,对LLaDA1.5、LLaDA2.0-mini和Dream-v0进行受控提示实验;并设计因果干预——patch EoS token的隐藏状态为反事实生成对应状态,观察输出变化。 Result: 添加EoS标记显著提升模型推理准确率;patch EoS隐藏状态常导致输出变为反事实结果,表明EoS token携带任务相关语义信息。 Conclusion: 扩散LLMs确以'EoS-by-EoS'方式思考,EoS token并非无意义填充符,而是承担隐式推理与中间计算功能的关键组件。 Abstract: Diffusion LLMs have been proposed as an alternative to autoregressive LLMs, excelling especially at complex reasoning tasks with interdependent sub-goals. Curiously, this is particularly true if the generation length, i.e., the number of tokens the model has to output, is set to a much higher value than is required for providing the correct answer to the task, and the model pads its answer with end-of-sequence (EoS) tokens. We hypothesize that diffusion models think EoS-by-EoS, that is, they use the representations of EoS tokens as a hidden scratchpad, which allows them to solve harder reasoning problems. We experiment with the diffusion models LLaDA1.5, LLaDA2.0-mini, and Dream-v0 on the tasks Addition, Entity Tracking, and Sudoku. In a controlled prompting experiment, we confirm that adding EoS tokens improves the LLMs' reasoning capabilities. To further verify whether they serve as space for hidden computations, we patch the hidden states of the EoS tokens with those of a counterfactual generation, which frequently changes the generated output to the counterfactual. The success of the causal intervention underscores that the EoS tokens, which one may expect to be devoid of meaning, carry information on the problem to solve. The behavioral experiments and the causal interventions indicate that diffusion LLMs can indeed think EoS-by-EoS.[64] Distilling Formal Logic into Neural Spaces: A Kernel Alignment Approach for Signal Temporal Logic
Sara Candussio,Gabriele Sarti,Gaia Saveri,Luca Bortolussi
Main category: cs.CL
TL;DR: 本文提出了一种将形式规约(如信号时序逻辑STL)的语义几何结构蒸馏到连续神经表示中的框架,通过教师-学生架构将符号鲁棒性核函数蒸馏为Transformer编码器,实现高效、可逆、语义保真的神经嵌入。
Details
Motivation: 现有方法要么依赖计算昂贵、锚点依赖且不可逆的符号核函数,要么使用无法捕捉语义结构的语法驱动神经嵌入;亟需兼顾语义保真性与计算效率的中间方案。 Method: 采用教师-学生蒸馏范式,以符号鲁棒性核为教师,Transformer编码器为学生;设计连续的、核加权的几何对齐目标函数替代标准对比学习,监督学生学习语义距离而非离散相似性。 Result: 在Signal Temporal Logic(STL)上验证:所得嵌入准确保持公式间语义相似性,能高精度预测鲁棒性与约束满足度,并具备内在可逆性;推理速度显著提升,无需运行时重复核计算。 Conclusion: 该框架实现了高效、可扩展的神经符号推理与公式重建,弥合了符号语义严谨性与神经网络计算效率之间的鸿沟。 Abstract: We introduce a framework for learning continuous neural representations of formal specifications by distilling the geometry of their semantics into a latent space. Existing approaches rely either on symbolic kernels -- which preserve behavioural semantics but are computationally prohibitive, anchor-dependent, and non-invertible -- or on syntax-based neural embeddings that fail to capture underlying structures. Our method bridges this gap: using a teacher-student setup, we distill a symbolic robustness kernel into a Transformer encoder. Unlike standard contrastive methods, we supervise the model with a continuous, kernel-weighted geometric alignment objective that penalizes errors in proportion to their semantic discrepancies. Once trained, the encoder produces embeddings in a single forward pass, effectively mimicking the kernel's logic at a fraction of its computational cost. We apply our framework to Signal Temporal Logic (STL), demonstrating that the resulting neural representations faithfully preserve the semantic similarity of STL formulae, accurately predict robustness and constraint satisfaction, and remain intrinsically invertible. Our proposed approach enables highly efficient, scalable neuro-symbolic reasoning and formula reconstruction without repeated kernel computation at runtime.[65] Balancing Coverage and Draft Latency in Vocabulary Trimming for Faster Speculative Decoding
Ofir Ben Shoham
Main category: cs.CL
TL;DR: 本文提出了一种针对推测解码中草稿模型的词汇表裁剪方法,通过在保证足够token覆盖率的前提下显著减小词汇表规模,从而降低草稿模型延迟、提升整体推理吞吐量。
Details
Motivation: 草稿模型在推测解码中常成为性能瓶颈,其语言建模头随词表增大而延迟上升,但大词表又有利于覆盖和匹配目标模型;因此存在词表大小与延迟之间的根本权衡。 Method: 将草稿词表选择建模为带约束的优化问题,以训练数据中助手响应的token覆盖率为目标,以架构感知的FLOPs估算的语言建模头延迟为代价,采用Tree-structured Parzen Estimator优化效用函数,在满足最小覆盖率约束下探索覆盖-延迟Pareto前沿。 Result: 实验表明,在保持高覆盖率的同时,词表可缩减高达97%;在领域特定任务上延迟最多降低16%,吞吐量提升20%;在分布外任务上吞吐量最高提升6.7%。 Conclusion: 词汇表裁剪是一种有效缓解推测解码中草稿模型延迟瓶颈的方法,尤其适用于领域专用场景,在覆盖率与效率之间实现了更优平衡。 Abstract: Speculative decoding accelerates inference for Large Language Models by using a lightweight draft model to propose candidate tokens that are verified in parallel by a larger target model. Prior work shows that the draft model often dominates speculative decoding latency, since it generates tokens sequentially and incurs high cost from its language modeling head as vocabulary size grows. This exposes a fundamental trade-off in draft model design: larger vocabularies improve token coverage and agreement with the target model, but incur higher draft latency, while smaller vocabularies reduce latency at the risk of missing tokens required for accurate draft generation. We address this trade-off through vocabulary trimming for draft models, motivated by the observation that domain-specific workloads use only a small fraction of the full vocabulary. We cast draft vocabulary selection as a constrained optimization problem that balances token coverage and draft latency. Coverage is computed over assistant responses in the training data, while latency is estimated using architecture-aware FLOPs that capture the cost of the language modeling head as a function of vocabulary size. We optimize a utility function with a Tree-structured Parzen Estimator to efficiently explore the coverage-latency Pareto frontier under a minimum coverage constraint. Experiments show improved speculative decoding throughput while reducing draft vocabularies by up to 97% with high coverage. On domain-specific tasks, we achieve up to 16% latency reduction and 20% throughput improvement, and up to 6.7% throughput gains on diverse out-of-distribution tasks.[66] VietJobs: A Vietnamese Job Advertisement Dataset
Hieu Pham Dinh,Hung Nguyen Huy,Mo El-Haj
Main category: cs.CL
TL;DR: VietJobs 是首个大规模公开越南语招聘广告语料库,包含来自越南34个省市的48,092条招聘信息和1500多万词,涵盖16个职业领域及多种雇佣类型;该数据集支持NLP与劳动力市场分析研究,并在岗位分类与薪资预测任务上对多个大语言模型进行了基准测试。
Details
Motivation: 构建一个能反映越南语言、地域和社会经济多样性的高质量越南语招聘广告语料库,以填补越南语NLP和劳动力市场分析领域数据资源的空白。 Method: 收集并结构化处理48,092条越南语招聘广告,标注岗位类别、薪资、技能等信息;在岗位分类和薪资预测两个核心任务上,对多个生成式大语言模型(如Qwen2.5-7B-Instruct、Llama-SEA-LION-v3-8B-IT)进行少样本学习和微调实验。 Result: 指令微调模型在两项任务中表现突出,但凸显了多语言尤其是越南语结构化预测建模的挑战;VietJobs成为越南语NLP新基准。 Conclusion: VietJobs为越南语NLP、招聘语言研究、社会经济表征及AI驱动的劳动力市场分析提供了重要基础资源和评估基准。 Abstract: VietJobs is the first large-scale, publicly available corpus of Vietnamese job advertisements, comprising 48,092 postings and over 15 million words collected from all 34 provinces and municipalities across Vietnam. The dataset provides extensive linguistic and structured information, including job titles, categories, salaries, skills, and employment conditions, covering 16 occupational domains and multiple employment types (full-time, part-time, and internship). Designed to support research in natural language processing and labour market analytics, VietJobs captures substantial linguistic, regional, and socio-economic diversity. We benchmark several generative large language models (LLMs) on two core tasks: job category classification and salary estimation. Instruction-tuned models such as Qwen2.5-7B-Instruct and Llama-SEA-LION-v3-8B-IT demonstrate notable gains under few-shot and fine-tuned settings, while highlighting challenges in multilingual and Vietnamese-specific modelling for structured labour market prediction. VietJobs establishes a new benchmark for Vietnamese NLP and offers a valuable foundation for future research on recruitment language, socio-economic representation, and AI-driven labour market analysis. All code and resources are available at: https://github.com/VinNLP/VietJobs.[67] Oral to Web: Digitizing 'Zero Resource'Languages of Bangladesh
Mohammad Mamun Or Rashid
Main category: cs.CL
TL;DR: 本文介绍了孟加拉国首个国家级、平行、多模态的少数民族及土著语言语料库——多语言云语料库(Multilingual Cloud Corpus),涵盖42种语言变体,包含文本与音频数据,并面向濒危语言保护、低资源NLP及数字存档提供开放支持。
Details
Motivation: 孟加拉国拥有约40种少数民族语言,分属四大语系,其中14种被列为濒危语言,但长期缺乏系统性、跨语系的数字语料库,尤其这些语言多为口语、计算资源近乎为零。 Method: 通过为期90天、覆盖9个地区的田野调查,由16名采集员、77名母语者和43名验证者参与,按三级语言粒度(词汇、语法结构、导向话语)采集2224个条目;音频经转录并由10位语言学家进行IPA标注,再由6位评审独立仲裁;最终数据发布于Multiling.cloud平台。 Result: 建成包含85792条结构化文本条目(含孟加拉语原文、英语翻译、IPA转写)和约107小时带标注音频的多模态语料库,覆盖42种语言变体(含2种未定类语言),全部公开可查。 Conclusion: 该语料库填补了孟加拉国多语种数字资源空白,为濒危语言记录、低资源自然语言处理及发展中国家语言多样性数字保存提供了重要基础设施与方法论范例。 Abstract: We present the Multilingual Cloud Corpus, the first national-scale, parallel, multimodal linguistic dataset of Bangladesh's ethnic and indigenous languages. Despite being home to approximately 40 minority languages spanning four language families, Bangladesh has lacked a systematic, cross-family digital corpus for these predominantly oral, computationally "zero resource" varieties, 14 of which are classified as endangered. Our corpus comprises 85792 structured textual entries, each containing a Bengali stimulus text, an English translation, and an IPA transcription, together with approximately 107 hours of transcribed audio recordings, covering 42 language varieties from the Tibeto-Burman, Indo-European, Austro-Asiatic, and Dravidian families, plus two genetically unclassified languages. The data were collected through systematic fieldwork over 90 days across nine districts of Bangladesh, involving 16 data collectors, 77 speakers, and 43 validators, following a predefined elicitation template of 2224 unique items organized at three levels of linguistic granularity: isolated lexical items (475 words across 22 semantic domains), grammatical constructions (887 sentences across 21 categories including verbal conjugation paradigms), and directed speech (862 prompts across 46 conversational scenarios). Post-field processing included IPA transcription by 10 linguists with independent adjudication by 6 reviewers. The complete dataset is publicly accessible through the Multilingual Cloud platform (multiling.cloud), providing searchable access to annotated audio and textual data for all documented varieties. We describe the corpus design, fieldwork methodology, dataset structure, and per-language coverage, and discuss implications for endangered language documentation, low-resource NLP, and digital preservation in linguistically diverse developing countries.[68] Med-V1: Small Language Models for Zero-shot and Scalable Biomedical Evidence Attribution
Qiao Jin,Yin Fang,Lauren He,Yifan Yang,Guangzhi Xiong,Zhizheng Wang,Nicholas Wan,Joey Chan,Donald C. Comeau,Robert Leaman,Charalampos S. Floudas,Aidong Zhang,Michael F. Chiang,Yifan Peng,Zhiyong Lu
Main category: cs.CL
TL;DR: 本文提出了Med-V1,一个仅含30亿参数的小型语言模型,专为生物医学证据归因与断言验证任务设计,在多个基准测试中显著超越基线模型,并在性能上媲美GPT-5等前沿大模型,同时具备高可解释性;还通过两个实际用例展示了其在检测LLM幻觉和临床指南中高风险归因错误方面的实用价值。
Details
Motivation: 现有前沿大语言模型(如GPT-5)虽可用于断言验证与幻觉检测,但部署成本过高;亟需一种高效、轻量且准确的替代方案用于大规模生物医学证据归因任务。 Method: 提出Med-V1系列小语言模型(3B参数),基于本研究新构建的高质量合成数据进行训练,并统一五项生物医学基准为验证格式以评估性能;同时开展两项真实场景用例研究:一是量化不同引用指令下LLM生成答案的幻觉率,二是自动识别临床实践指南中的高风险证据误引。 Result: Med-V1在五项生物医学验证基准上相较基线模型提升27.0%–71.3%,性能媲美GPT-5,且能提供高质量解释;用例研究表明引用格式指令显著影响幻觉率,GPT-5虽生成更多主张但幻觉率与GPT-4o相近;Med-V1成功识别出临床指南中此前难以规模化发现的高风险证据误引问题。 Conclusion: Med-V1是一种高效、准确、可解释的轻量级模型,为生物医学领域的证据归因与验证任务提供了切实可行的前沿大模型替代方案。 Abstract: Assessing whether an article supports an assertion is essential for hallucination detection and claim verification. While large language models (LLMs) have the potential to automate this task, achieving strong performance requires frontier models such as GPT-5 that are prohibitively expensive to deploy at scale. To efficiently perform biomedical evidence attribution, we present Med-V1, a family of small language models with only three billion parameters. Trained on high-quality synthetic data newly developed in this study, Med-V1 substantially outperforms (+27.0% to +71.3%) its base models on five biomedical benchmarks unified into a verification format. Despite its smaller size, Med-V1 performs comparably to frontier LLMs such as GPT-5, along with high-quality explanations for its predictions. We use Med-V1 to conduct a first-of-its-kind use case study that quantifies hallucinations in LLM-generated answers under different citation instructions. Results show that the format instruction strongly affects citation validity and hallucination, with GPT-5 generating more claims but exhibiting hallucination rates similar to GPT-4o. Additionally, we present a second use case showing that Med-V1 can automatically identify high-stakes evidence misattributions in clinical practice guidelines, revealing potentially negative public health impacts that are otherwise challenging to identify at scale. Overall, Med-V1 provides an efficient and accurate lightweight alternative to frontier LLMs for practical and real-world applications in biomedical evidence attribution and verification tasks. Med-V1 is available at https://github.com/ncbi-nlp/Med-V1.[69] PersianPunc: A Large-Scale Dataset and BERT-Based Approach for Persian Punctuation Restoration
Mohammad Javad Ranjbar Kalahroodi,Heshaam Faili,Azadeh Shakery
Main category: cs.CL
TL;DR: 本文提出了PersianPunc数据集和基于ParsBERT的轻量级标点恢复方法,显著提升了波斯语ASR输出的可读性与实用性,同时兼顾效率与性能。
Details
Motivation: 波斯语标点恢复任务缺乏高质量大规模数据集和高效方法,而该任务对提升ASR输出质量至关重要。 Method: 构建了含1700万样本的PersianPunc数据集,并将标点恢复建模为词元级序列标注任务,通过微调ParsBERT实现;对比评估了大语言模型在该任务上的表现。 Result: 所提BERT-based方法在测试集上达到91.33%宏平均F1分数,显著优于存在过校正和高计算开销问题的大语言模型。 Conclusion: 轻量级BERT方案在性能与效率间取得更好平衡,适用于实时应用,并为其他形态丰富、低资源语言提供可扩展框架。 Abstract: Punctuation restoration is essential for improving the readability and downstream utility of automatic speech recognition (ASR) outputs, yet remains underexplored for Persian despite its importance. We introduce PersianPunc, a large-scale, high-quality dataset of 17 million samples for Persian punctuation restoration, constructed through systematic aggregation and filtering of existing textual resources. We formulate punctuation restoration as a token-level sequence labeling task and fine-tune ParsBERT to achieve strong performance. Through comparative evaluation, we demonstrate that while large language models can perform punctuation restoration, they suffer from critical limitations: over-correction tendencies that introduce undesired edits beyond punctuation insertion (particularly problematic for speech-to-text pipelines) and substantially higher computational requirements. Our lightweight BERT-based approach achieves a macro-averaged F1 score of 91.33% on our test set while maintaining efficiency suitable for real-time applications. We make our dataset (https://huggingface.co/datasets/MohammadJRanjbar/persian-punctuation-restoration) and model (https://huggingface.co/MohammadJRanjbar/parsbert-persian-punctuation) publicly available to facilitate future research in Persian NLP and provide a scalable framework applicable to other morphologically rich, low-resource languages.[70] A Multilingual Human Annotated Corpus of Original and Easy-to-Read Texts to Support Access to Democratic Participatory Processes
Stefan Bott,Verena Riegler,Horacio Saggion,Almudena Rascón Alcaina,Nouran Khallaf
Main category: cs.CL
TL;DR: 本文介绍了一个为西班牙语、加泰罗尼亚语和意大利语构建的高质量人工简化文本语料库,旨在支持自动文本简化研究,尤其填补了低资源语言在易读语言(E2R)领域的资源空白。
Details
Motivation: 解决西班牙语、加泰罗尼亚语和意大利语等低资源语言中高质量文本简化训练与评估数据稀缺的问题,以促进易读语言(E2R)在民主参与中的应用。 Method: 在iDEM项目框架下,收集与民主参与相关的原创文本,涵盖多种文体,并由文本简化领域专家人工简化为E2R级别;严格遵循相关性、版权可得性与伦理标准;对加泰罗尼亚语语料首次进行人工标注。 Result: 构建了首个加泰罗尼亚语E2R人工简化语料库,并提供了高质量、人工标注的西班牙语和意大利语简化资源;语料库将公开免费发布。 Conclusion: 该语料库显著弥补了低资源语言在自动文本简化领域的数据缺口,尤其推动了E2R语言在民主社会参与中的研究与实践。 Abstract: Being able to understand information is a key factor for a self-determined life and society. It is also very important for participating in democratic processes. The study of automatic text simplification is often limited by the availability of high quality material for the training and evaluation on automatic simplifiers. This is true for English, but more so for less resourced languages like Spanish, Catalan and Italian. In order to fill this gap, we present a corpus of original texts for these 3 languages, with high quality simplification produced by human experts in text simplification. It was developed within the iDEM project to assess the impact of Easy-to-Read (E2R) language for democratic participation. The original texts were compiled from domains related to this topic. The corpus includes different text types, selected based on relevance, copyright availability, and ethical standards. All texts were simplified to E2R level. The corpus is particularity valuable because it includes the first annotated corpus of its kind for the Catalan language. It also represents a noteworthy contribution for Spanish and Italian, offering high-quality, human-annotated language resources that are rarely available in these domains. The corpus will be made freely accessible to the public.[71] Exploring the potential and limitations of Model Merging for Multi-Domain Adaptation in ASR
Carlos Carvalho,Francisco Teixeira,Thomas Rolland,Alberto Abad
Main category: cs.CL
TL;DR: 本文研究了模型融合在多领域自动语音识别(ASR)中的应用,提出了一种新算法BoostedTSV-M,通过奇异值增强缓解秩坍缩并提升数值稳定性,在欧洲葡萄牙语任务上优于全量微调,同时保持分布外泛化能力。
Details
Motivation: 大型语音基础模型通常需针对不同领域进行特定微调,产生多个定制检查点;当新数据出现时重复全量微调计算开销过大,因此需要一种可扩展的替代方案——模型融合。 Method: 对10个欧洲葡萄牙语领域评估11种模型融合算法,并提出基于TSV-M的新算法BoostedTSV-M,引入奇异值增强以缓解秩坍缩、提升数值稳定性。 Result: 所提BoostedTSV-M在欧洲葡萄牙语各领域内准确率上优于全量微调,同时保持对英文及多语言任务的鲁棒性与泛化能力。 Conclusion: 模型融合是一种高效可行的多领域ASR建模方法,BoostedTSV-M显著提升了融合性能与稳定性,为大规模语音模型的轻量适配提供了新路径。 Abstract: Model merging is a scalable alternative to multi-task training that combines the capabilities of multiple specialised models into a single model. This is particularly attractive for large speech foundation models, which are typically adapted through domain-specific fine-tuning, resulting in multiple customised checkpoints, for which repeating full fine-tuning when new data becomes available is computationally prohibitive. In this work, we study model merging for multi-domain ASR and benchmark 11 merging algorithms for 10 European Portuguese domains, evaluating in-domain accuracy, robustness under distribution shift, as well as English and multilingual performance. We further propose BoostedTSV-M, a new merging algorithm based on TSV-M that mitigates rank collapse via singular-value boosting and improves numerical stability. Overall, our approach outperforms full fine-tuning on European Portuguese while preserving out-of-distribution generalisation in a single model.[72] DiSCTT: Consensus-Guided Self-Curriculum for Efficient Test-Time Adaptation in Reasoning
Mohammad Mahdi Moradi,Sudhir Mudur
Main category: cs.CL
TL;DR: 本文提出DiSCTT框架,通过基于实例级认知不确定性(由推理轨迹间一致性估计)的动态策略分配,在测试时自适应优化大语言模型的推理性能:高一致性输入采用监督微调,低一致性输入采用共识正则化强化学习,显著提升准确率、降低方差与计算开销。
Details
Motivation: 现有测试时自适应方法对所有输入采用统一优化目标,在异构推理任务上效率低且不稳定,亟需根据实例难度和不确定性差异化调整策略。 Method: DiSCTT框架结合难度感知与共识引导的自课程学习:首先估计每个输入的实例级认知不确定性(基于采样推理路径的一致性),然后对高共识输入用多数一致解作为伪标签进行监督微调,对低共识输入则采用共识正则化的强化学习以在相关性约束下鼓励多样性。 Result: 在多类数学与通用推理基准上,DiSCTT持续超越强测试时自适应基线,准确率更高、结果方差更低,且计算量与实际训练时间大幅减少。 Conclusion: 显式建模实例难度与不确定性可实现更稳定、高效、有效的推理模型测试时自适应。 Abstract: Test-time adaptation offers a promising avenue for improving reasoning performance in large language models without additional supervision, but existing approaches often apply a uniform optimization objective across all inputs, leading to inefficient or unstable adaptation on heterogeneous reasoning problems. We propose DiSCTT, a difficulty-aware, consensus-guided self-curriculum framework that dynamically allocates test-time optimization strategies based on instance-level epistemic uncertainty estimated from agreement among sampled reasoning trajectories. Inputs with high consensus are consolidated via supervised fine-tuning using majority-agreed solutions as pseudo-labels, while low-consensus inputs are optimized via reinforcement learning with a consensus-regularized objective that encourages diversity under relevance constraints. Across a broad suite of mathematical and general reasoning benchmarks, DiSCTT consistently outperforms strong test-time adaptation baselines, achieving higher accuracy with reduced variance and substantially lower computation and wall-clock training times. These results demonstrate that explicitly accounting for instance difficulty and uncertainty enables more stable, efficient, and effective test-time adaptation for reasoning models.[73] Progressive Residual Warmup for Language Model Pretraining
Tianhao Chen,Xin Xu,Lu Yin,Hao Chen,Yang Wang,Shizhe Diao,Can Yang
Main category: cs.CL
TL;DR: 本文提出了一种名为Progressive Residual Warmup(ProRes)的新方法,用于稳定和加速Transformer语言模型的预训练过程。该方法通过逐步增加各层残差连接的权重(从0到1),使浅层先学习、深层后参与,从而提升训练稳定性、收敛速度和下游任务性能。
Details
Motivation: Transformer架构是现代大语言模型的基础,其预训练的稳定性与收敛速度至关重要;受堆叠层间逻辑依赖性的启发,作者希望让浅层先稳定后再让深层参与学习。 Method: 提出Progressive Residual Warmup(ProRes),对每层残差连接乘以一个随训练步数从0渐进上升至1的标量系数,且深层的升温周期更长,实现'早层先学'策略。 Result: 在多种模型规模、归一化方式和初始化方案下验证了ProRes的有效性:显著提升预训练稳定性、加快收敛、增强泛化能力及下游任务性能。 Conclusion: ProRes是一种简单而有效的预训练优化策略,通过控制残差路径的渐进激活,引导模型沿更优优化轨迹训练,具有良好的通用性和实用性。 Abstract: Transformer architectures serve as the backbone for most modern Large Language Models, therefore their pretraining stability and convergence speed are of central concern. Motivated by the logical dependency of sequentially stacked layers, we propose Progressive Residual Warmup (ProRes) for language model pretraining. ProRes implements an "early layer learns first" philosophy by multiplying each layer's residual with a scalar that gradually warms up from 0 to 1, with deeper layers taking longer warmup steps. In this way, deeper layers wait for early layers to settle into a more stable regime before contributing to learning. We demonstrate the effectiveness of ProRes through pretraining experiments across various model scales, as well as normalization and initialization schemes. Comprehensive analysis shows that ProRes not only stabilizes pretraining but also introduces a unique optimization trajectory, leading to faster convergence, stronger generalization and better downstream performance. Our code is available at https://github.com/dandingsky/ProRes.[74] An Exploration-Analysis-Disambiguation Reasoning Framework for Word Sense Disambiguation with Low-Parameter LLMs
Deshan Sumanathilaka,Nicholas Micallef,Julian Hough
Main category: cs.CL
TL;DR: 本文探讨了低参数大语言模型(<4B)通过推理驱动的微调策略在词义消歧(WSD)任务中能否媲美GPT-4-Turbo,结果表明结合思维链(CoT)与邻词分析的微调方法使Gemma-3-4B和Qwen-3-4B在FEWS和跨域Fool Me If You Can数据集上达到甚至超越SOTA性能,同时显著降低计算与能耗开销。
Details
Motivation: 高参数LLM虽在WSD上表现优异,但计算与能耗过高、难以扩展;而罕见/领域特异性词义消歧仍是NLP难点,亟需轻量高效替代方案。 Method: 基于含半自动理由标注的FEWS数据集,对8个开源小模型(如Gemma、Qwen)开展微调,核心策略为融合Chain-of-Thought推理与邻词分析。 Result: Gemma-3-4B和Qwen-3-4B在零样本设置下性能媲美GPT-4-Turbo,在FEWS上全面超越中等参数基线及现有SOTA,并在未见过的Fool Me If You Can数据集上展现强跨域泛化能力。 Conclusion: 精心设计的推理导向微调可使低参数LLM在WSD任务中实现高精度与强泛化,兼顾性能与能效,为轻量化语义理解提供可行路径。 Abstract: Word Sense Disambiguation (WSD) remains a key challenge in Natural Language Processing (NLP), especially when dealing with rare or domain-specific senses that are often misinterpreted. While modern high-parameter Large Language Models (LLMs) such as GPT-4-Turbo have shown state-of-the-art WSD performance, their computational and energy demands limit scalability. This study investigates whether low-parameter LLMs (<4B parameters) can achieve comparable results through fine-tuning strategies that emphasize reasoning-driven sense identification. Using the FEWS dataset augmented with semi-automated, rationale-rich annotations, we fine-tune eight small-scale open-source LLMs (e.g. Gemma and Qwen). Our results reveal that Chain-of-Thought (CoT)-based reasoning combined with neighbour-word analysis achieves performance comparable to GPT-4-Turbo in zero-shot settings. Importantly, Gemma-3-4B and Qwen-3-4B models consistently outperform all medium-parameter baselines and state-of-the-art models on FEWS, with robust generalization to unseen senses. Furthermore, evaluation on the unseen "Fool Me If You Can'' dataset confirms strong cross-domain adaptability without task-specific fine-tuning. This work demonstrates that with carefully crafted reasoning-centric fine-tuning, low-parameter LLMs can deliver accurate WSD while substantially reducing computational and energy demands.[75] Ensembling Language Models with Sequential Monte Carlo
Robin Shing Moon Chan,Tianyu Liu,Samuel Kiegeland,Clemente Pasti,Jacob Hoover Vigly,Timothy J. O'Donnell,Ryan Cotterell,Tim Vieira
Main category: cs.CL
TL;DR: 本文提出了一种统一的f-ensemble框架,用于组合多个语言模型,并设计了字节级序贯蒙特卡洛(SMC)算法来实现跨不同词表模型的有效采样,从而在结构化文本生成任务中超越传统概率平均方法。
Details
Motivation: 现有语言模型和提示策略众多,但性能对二者选择高度敏感;经典集成方法难以直接应用于语言模型解码过程,因简单概率平均会导致有偏的局部归一化近似。 Method: 提出基于函数f的K模型组合框架f-ensemble,并设计字节级序贯蒙特卡洛(SMC)算法,在共享字符空间中实现不匹配词表模型的联合采样。 Result: 在多种结构化文本生成任务上验证了f-ensemble优于传统概率平均,表明更优的后验近似可提升集成性能。 Conclusion: f-ensemble框架与字节级SMC算法为语言模型集成提供了更灵活、一致且有效的解码方案,尤其适用于异构模型组合与结构化生成场景。 Abstract: Practitioners have access to an abundance of language models and prompting strategies for solving many language modeling tasks; yet prior work shows that modeling performance is highly sensitive to both choices. Classical machine learning ensembling techniques offer a principled approach: aggregate predictions from multiple sources to achieve better performance than any single one. However, applying ensembling to language models during decoding is challenging: naively aggregating next-token probabilities yields samples from a locally normalized, biased approximation of the generally intractable ensemble distribution over strings. In this work, we introduce a unified framework for composing $K$ language models into $f$-ensemble distributions for a wide range of functions $f\colon\mathbb{R}_{\geq 0}^{K}\to\mathbb{R}_{\geq 0}$. To sample from these distributions, we propose a byte-level sequential Monte Carlo (SMC) algorithm that operates in a shared character space, enabling ensembles of models with mismatching vocabularies and consistent sampling in the limit. We evaluate a family of $f$-ensembles across prompt and model combinations for various structured text generation tasks, highlighting the benefits of alternative aggregation strategies over traditional probability averaging, and showing that better posterior approximations can yield better ensemble performance.[76] FlashAttention-4: Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware Scaling
Ted Zadouri,Markus Hoehnerbach,Jay Shah,Timmy Liu,Vijay Thakkar,Tri Dao
Main category: cs.CL
TL;DR: 本文提出了FlashAttention-4,针对Blackwell架构GPU(如B200/GB200)的非对称硬件特性优化注意力计算,在BF16精度下相较cuDNN和Triton分别提速1.3×和2.7×,并采用CuTe-DSL实现快速编译与高表达力。
Details
Motivation: FlashAttention-3主要面向Hopper架构(H100),而AI行业已快速转向Blackwell架构(B200/GB200);其tensor core吞吐翻倍,但共享内存带宽、指数单元等未同步提升,导致原有优化不再适用,需针对性解决新瓶颈。 Method: 提出三项关键技术:(1) 重构流水线以充分利用全异步MMA操作和更大tile尺寸;(2) 软件模拟指数与条件softmax重缩放,减少非矩阵乘运算;(3) 利用Tensor Memory和2-CTA MMA模式降低反向传播中的共享内存通信与原子加法。全部实现基于嵌入Python的CuTe-DSL。 Result: FlashAttention-4在B200 GPU上BF16精度下达到1613 TFLOPs/s(71%利用率),较cuDNN 9.13和Triton分别提速1.3×和2.7×;编译速度比传统C++模板快20–30×,同时保持完全表达能力。 Conclusion: FlashAttention-4通过软硬协同设计有效适配Blackwell架构的非对称性能特征,显著提升长上下文大模型注意力计算效率,并验证了CuTe-DSL在高性能内核开发中的实用性与先进性。 Abstract: Attention, as a core layer of the ubiquitous Transformer architecture, is the bottleneck for large language models and long-context applications. While FlashAttention-3 optimized attention for Hopper GPUs through asynchronous execution and warp specialization, it primarily targets the H100 architecture. The AI industry has rapidly transitioned to deploying Blackwell-based systems such as the B200 and GB200, which exhibit fundamentally different performance characteristics due to asymmetric hardware scaling: tensor core throughput doubles while other functional units (shared memory bandwidth, exponential units) scale more slowly or remain unchanged. We develop several techniques to address these shifting bottlenecks on Blackwell GPUs: (1) redesigned pipelines that exploit fully asynchronous MMA operations and larger tile sizes, (2) software-emulated exponential and conditional softmax rescaling that reduces non-matmul operations, and (3) leveraging tensor memory and the 2-CTA MMA mode to reduce shared memory traffic and atomic adds in the backward pass. We demonstrate that our method, FlashAttention-4, achieves up to 1.3$\times$ speedup over cuDNN 9.13 and 2.7$\times$ over Triton on B200 GPUs with BF16, reaching up to 1613 TFLOPs/s (71% utilization). Beyond algorithmic innovations, we implement FlashAttention-4 entirely in CuTe-DSL embedded in Python, achieving 20-30$\times$ faster compile times compared to traditional C++ template-based approaches while maintaining full expressivity.[77] DEBISS: a Corpus of Individual, Semi-structured and Spoken Debates
Klaywert Danillo Ferreira de Souza,David Eduardo Pereira,Cláudio E. C. Campelo,Larissa Lucena Vasconcelos
Main category: cs.CL
TL;DR: This paper introduces DEBISS, a new corpus of spoken and individual debates with semi-structured features and multiple NLP annotations, addressing the scarcity of debate corpora.
Details
Motivation: The scarcity of debate corpora in the state of the art due to the diverse applications, structures, and formats of debates. Method: Construction of the DEBISS corpus, which includes spoken and individual debates with semi-structured features and broad NLP task annotations (e.g., speech-to-text, speaker diarization, argument mining, debater quality assessment). Result: A novel, publicly available debate corpus (DEBISS) supporting multiple NLP tasks and accommodating variability in debate formats. Conclusion: DEBISS fills a critical gap in debate resources and enables further research and development in debate-related NLP tasks. Abstract: The process of debating is essential in our daily lives, whether in studying, work activities, simple everyday discussions, political debates on TV, or online discussions on social networks. The range of uses for debates is broad. Due to the diverse applications, structures, and formats of debates, developing corpora that account for these variations can be challenging, and the scarcity of debate corpora in the state of the art is notable. For this reason, the current research proposes the DEBISS corpus: a collection of spoken and individual debates with semi-structured features. With a broad range of NLP task annotations, such as speech-to-text, speaker diarization, argument mining, and debater quality assessment.[78] NCTB-QA: A Large-Scale Bangla Educational Question Answering Dataset and Benchmarking Performance
Abrar Eyasir,Tahsin Ahmed,Muhammad Ibrahim
Main category: cs.CL
TL;DR: 本文提出NCTB-QA,一个大规模、平衡可答/不可答问题的孟加拉语教育领域问答数据集,并验证了领域微调对低资源语言问答系统处理不可答问题的重要性。
Details
Motivation: 低资源语言阅读理解系统难以可靠处理不可答问题,现有孟加拉语数据集缺乏对不可答问题的充分覆盖和高质量构造。 Method: 构建包含87,805个问答对的NCTB-QA数据集,涵盖50本孟加拉国国家课程与教科书委员会教材;确保可答(57.25%)与不可答(42.75%)问题平衡,并引入含合理干扰项的对抗样本;在BERT、RoBERTa、ELECTRA上进行微调与基准测试。 Result: BERT在F1分数上获得313%相对提升(0.150→0.620),BERTScore语义质量指标也显著提高;三个模型均受益于领域微调。 Conclusion: NCTB-QA是一个具有挑战性的新基准;领域特定微调对提升低资源语言问答系统(尤其在处理不可答问题时)的鲁棒性至关重要。 Abstract: Reading comprehension systems for low-resource languages face significant challenges in handling unanswerable questions. These systems tend to produce unreliable responses when correct answers are absent from context. To solve this problem, we introduce NCTB-QA, a large-scale Bangla question answering dataset comprising 87,805 question-answer pairs extracted from 50 textbooks published by Bangladesh's National Curriculum and Textbook Board. Unlike existing Bangla datasets, NCTB-QA maintains a balanced distribution of answerable (57.25%) and unanswerable (42.75%) questions. NCTB-QA also includes adversarially designed instances containing plausible distractors. We benchmark three transformer-based models (BERT, RoBERTa, ELECTRA) and demonstrate substantial improvements through fine-tuning. BERT achieves 313% relative improvement in F1 score (0.150 to 0.620). Semantic answer quality measured by BERTScore also increases significantly across all models. Our results establish NCTB-QA as a challenging benchmark for Bangla educational question answering. This study demonstrates that domain-specific fine-tuning is critical for robust performance in low-resource settings.[79] Leveraging LLM Parametric Knowledge for Fact Checking without Retrieval
Artem Vazhentsev,Maria Marina,Daniil Moskovskiy,Sergey Pletenev,Mikhail Seleznyov,Mikhail Salnikov,Elena Tutubalina,Vasily Konovalov,Irina Nikishina,Alexander Panchenko,Viktor Moskvoretskii
Main category: cs.CL
TL;DR: 本文提出无需检索的事实核查新任务,通过内部模型表征而非外部知识进行自然语言声明的真实性验证,并设计了涵盖长尾知识、多源声明、多语言及长文本生成的综合评估框架;实验表明基于内部表征交互的方法(INTRA)性能最优,为可扩展、可集成的事实核查提供了新方向。
Details
Motivation: 现有基于LLM的事实核查方法依赖外部检索,受限于检索错误和数据可用性,且未充分利用模型内在的事实验证能力。 Method: 提出无需检索的事实核查任务,构建涵盖长尾知识、多源声明、多语言和长文本生成的综合评估框架;提出INTRA方法,利用模型内部表征间的交互进行事实验证。 Result: 在9个数据集、18种方法和3个模型上的实验表明,基于logit的方法常逊于利用内部表征的方法;INTRA方法达到SOTA性能,泛化能力强。 Conclusion: 无需检索的事实核查是一个有前景的研究方向,可补充检索式方法、提升可扩展性,并支持作为训练奖励信号或生成过程中的组件集成。 Abstract: Trustworthiness is a core research challenge for agentic AI systems built on Large Language Models (LLMs). To enhance trust, natural language claims from diverse sources, including human-written text, web content, and model outputs, are commonly checked for factuality by retrieving external knowledge and using an LLM to verify the faithfulness of claims to the retrieved evidence. As a result, such methods are constrained by retrieval errors and external data availability, while leaving the models intrinsic fact-verification capabilities largely unused. We propose the task of fact-checking without retrieval, focusing on the verification of arbitrary natural language claims, independent of their source. To study this setting, we introduce a comprehensive evaluation framework focused on generalization, testing robustness to (i) long-tail knowledge, (ii) variation in claim sources, (iii) multilinguality, and (iv) long-form generation. Across 9 datasets, 18 methods and 3 models, our experiments indicate that logit-based approaches often underperform compared to those that leverage internal model representations. Building on this finding, we introduce INTRA, a method that exploits interactions between internal representations and achieves state-of-the-art performance with strong generalization. More broadly, our work establishes fact-checking without retrieval as a promising research direction that can complement retrieval-based frameworks, improve scalability, and enable the use of such systems as reward signals during training or as components integrated into the generation process.[80] Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought
Siddharth Boppana,Annabel Ma,Max Loeffler,Raphael Sarfati,Eric Bigelow,Atticus Geiger,Owen Lewis,Jack Merullo
Main category: cs.CL
TL;DR: 本文揭示了推理模型中存在“表演性思维链”(performative chain-of-thought)现象:模型虽早已形成确定答案,却仍继续生成冗余推理步骤;通过激活探测等方法发现,模型在简单题(如MMLU)中很早即编码答案,而在难题(如GPQA-Diamond)中才出现真实推理与信念跃迁;基于探测的早停策略可显著压缩生成长度而不损精度。
Details
Motivation: 探究大语言模型在思维链(CoT)生成过程中是否真正进行推理,还是仅执行表面化的、无信念支撑的‘推理剧场’(reasoning theater);区分模型内部信念形成时间与外部文本输出行为之间的脱节。 Method: 采用三种分析手段:1)激活探针(activation probing)解码模型各层隐状态中的答案置信度;2)强制早期截断回答(early forced answering);3)构建CoT监控器判断推理完成点;在DeepSeek-R1 671B与GPT-OSS 120B两个大模型上,对比其在MMLU(易、回忆型)和GPQA-Diamond(难、多跳)任务上的表现。 Result: 1)在MMLU任务中,模型最终答案在CoT早期即可从激活中高置信度解码,远早于CoT监控器判定推理完成的时间;2)在GPQA-Diamond中观察到真实推理特征(如回溯、‘顿悟点’),且这些行为与探针检测到的显著信念变化强相关;3)基于探针的早退策略在MMLU上节省80% token、GPQA上节省30% token,准确率基本不变。 Conclusion: CoT生成常具‘表演性’——尤其在简单任务中,模型并非逐步推导而是提前锁定答案后填充推理;而真正的多步推理伴随可探测的内部信念动态变化;因此,激活探针不仅是诊断工具,还可驱动高效自适应推理。 Abstract: We provide evidence of performative chain-of-thought (CoT) in reasoning models, where a model becomes strongly confident in its final answer, but continues generating tokens without revealing its internal belief. Our analysis compares activation probing, early forced answering, and a CoT monitor across two large models (DeepSeek-R1 671B & GPT-OSS 120B) and find task difficulty-specific differences: The model's final answer is decodable from activations far earlier in CoT than a monitor is able to say, especially for easy recall-based MMLU questions. We contrast this with genuine reasoning in difficult multihop GPQA-Diamond questions. Despite this, inflection points (e.g., backtracking, 'aha' moments) occur almost exclusively in responses where probes show large belief shifts, suggesting these behaviors track genuine uncertainty rather than learned "reasoning theater." Finally, probe-guided early exit reduces tokens by up to 80% on MMLU and 30% on GPQA-Diamond with similar accuracy, positioning attention probing as an efficient tool for detecting performative reasoning and enabling adaptive computation.cs.CV [Back]
[81] Lost in Translation: How Language Re-Aligns Vision for Cross-Species Pathology
Ekansh Arora
Main category: cs.CV
TL;DR: 本研究探讨了CPath-CLIP模型在跨癌种与跨物种病理图像识别中的迁移能力,发现标准视觉-语言对齐在跨物种场景下存在语义坍塌问题;为此提出语义锚定(Semantic Anchoring)方法,利用文本提供稳定视觉特征坐标系,显著提升性能并揭示语言可作为无需重训练的语义调控机制。
Details
Motivation: 基础模型在计算病理学中广泛应用,但其在跨癌种和跨物种迁移下的行为尚不明确,尤其缺乏对模型失败机制的深入理解。 Method: 通过少样本微调评估CPath-CLIP在同癌种、跨癌种及跨物种(犬/人)全切片图像块上的癌症检测性能;结合嵌入空间分析、Grad-CAM可视化及消融实验,并提出新方法Semantic Anchoring以增强语言引导的视觉表征稳定性。 Result: 少样本微调提升同癌种AUC达7.7%、跨癌种达9.47%;跨物种性能仍低于SOTA(H-optimus-0);发现‘语义坍塌’新失败模式;Semantic Anchoring带来同癌种+8.52%、跨癌种+5.67% AUC增益,且效果源于文本对齐机制本身。 Conclusion: 语言不仅辅助对齐,更可作为控制机制实现语义再解释;Semantic Anchoring有效缓解嵌入坍塌,为跨物种病理AI提供新范式。 Abstract: Foundation models are increasingly applied to computational pathology, yet their behavior under cross-cancer and cross-species transfer remains unspecified. This study investigated how fine-tuning CPath-CLIP affects cancer detection under same-cancer, cross-cancer, and cross-species conditions using whole-slide image patches from canine and human histopathology. Performance was measured using area under the receiver operating characteristic curve (AUC). Few-shot fine-tuning improved same-cancer (64.9% to 72.6% AUC) and cross-cancer performance (56.84% to 66.31% AUC). Cross-species evaluation revealed that while tissue matching enables meaningful transfer, performance remains below state-of-the-art benchmarks (H-optimus-0: 84.97% AUC), indicating that standard vision-language alignment is suboptimal for cross-species generalization. Embedding space analysis revealed extremely high cosine similarity (greater than 0.99) between tumor and normal prototypes. Grad-CAM shows prototype-based models remain domain-locked, while language-guided models attend to conserved tumor morphology. To address this, we introduce Semantic Anchoring, which uses language to provide a stable coordinate system for visual features. Ablation studies reveal that benefits stem from the text-alignment mechanism itself, regardless of text encoder complexity. Benchmarking against H-optimus-0 shows that CPath-CLIP's failure stems from intrinsic embedding collapse, which text alignment effectively circumvents. Additional gains were observed in same-cancer (8.52%) and cross-cancer classification (5.67%). We identified a previously uncharacterized failure mode: semantic collapse driven by species-dominated alignment rather than missing visual information. These results demonstrate that language acts as a control mechanism, enabling semantic re-interpretation without retraining.[82] Recognition of Daily Activities through Multi-Modal Deep Learning: A Video, Pose, and Object-Aware Approach for Ambient Assisted Living
Kooshan Hashemifard,Pau Climent-Pérez,Francisco Florez-Revuelta
Main category: cs.CV
TL;DR: 本文提出了一种面向老年用户的多模态日常活动识别方法,融合3D CNN视觉特征、图卷积网络处理的3D姿态信息,以及通过跨注意力机制融合的物体检测上下文信息,在Toyota SmartHome数据集上取得了有竞争力的分类精度。
Details
Motivation: 解决面向老年人的环境辅助生活(AAL)系统中日常活动识别面临的类内差异大、类间相似度高、环境与视角变化多、场景复杂等挑战。 Method: 构建多模态系统:使用3D CNN提取视频时空特征,Graph CNN建模3D人体姿态,物体检测模块提供上下文信息,并通过跨注意力机制将上下文与3D CNN特征融合。 Result: 在真实室内场景数据集Toyota SmartHome上验证,该方法在多种日常活动识别任务中达到有竞争力的分类准确率。 Conclusion: 所提多模态融合方法能有效提升老年用户日常活动识别鲁棒性与准确性,可作为先进AAL监控系统的关键组件,助力提升老年人安全与自主性。 Abstract: Recognition of daily activities is a critical element for effective Ambient Assisted Living (AAL) systems, particularly to monitor the well-being and support the independence of older adults in indoor environments. However, developing robust activity recognition systems faces significant challenges, including intra-class variability, inter-class similarity, environmental variability, camera perspectives, and scene complexity. This paper presents a multi-modal approach for the recognition of activities of daily living tailored for older adults within AAL settings. The proposed system integrates visual information processed by a 3D Convolutional Neural Network (CNN) with 3D human pose data analyzed by a Graph Convolutional Network. Contextual information, derived from an object detection module, is fused with the 3D CNN features using a cross-attention mechanism to enhance recognition accuracy. This method is evaluated using the Toyota SmartHome dataset, which consists of real-world indoor activities. The results indicate that the proposed system achieves competitive classification accuracy for a range of daily activities, highlighting its potential as an essential component for advanced AAL monitoring solutions. This advancement supports the broader goal of developing intelligent systems that promote safety and autonomy among older adults.[83] InverseNet: Benchmarking Operator Mismatch and Calibration Across Compressive Imaging Modalities
Chengshuai Yang,Xin Yuan
Main category: cs.CV
TL;DR: 本文提出InverseNet,首个跨模态的算子失配基准,评估了12种方法在四种场景下的性能,发现深度学习方法在算子失配下性能大幅下降,而算子条件化方法和盲网格搜索校准能显著恢复性能。
Details
Motivation: 现有高效压缩感知成像(EfficientSCI)方法在前向算子偏离物理现实时性能急剧下降,但尚无基准量化这种普遍存在的算子失配问题。 Method: 构建InverseNet跨模态基准,覆盖CASSI、CACTI和单像素相机;设计四场景协议(理想、失配、oracle校正、盲校准);在27个仿真场景和9组真实硬件数据上系统评估12种重建方法。 Result: (1)深度学习方法在失配下损失10–21 dB,丧失对经典方法的优势;(2)性能与鲁棒性在不同模态间呈显著负相关(r_s = −0.71);(3)掩码无关架构无法恢复失配损失,而算子条件化方法可恢复41–90%;(4)盲网格搜索校准可达oracle校正效果的85–100%;真实硬件实验验证仿真结论。 Conclusion: 算子失配是压缩感知成像落地的关键瓶颈;算子显式建模与盲校准策略对提升实际鲁棒性至关重要;InverseNet为后续研究提供了标准化评估框架。 Abstract: State-of-the-art EfficientSCI loses 20.58 dB when its assumed forward operator deviates from physical reality in just eight parameters, yet no existing benchmark quantifies operator mismatch, the default condition in deployed compressive imaging systems. We introduce InverseNet, the first cross-modality benchmark for operator mismatch, spanning CASSI, CACTI, and single-pixel cameras. Evaluating 12 methods under a four-scenario protocol (ideal, mismatched, oracle-corrected, blind calibration) across 27 simulated scenes and 9 real hardware captures, we find: (1) deep learning methods lose 10-21 dB under mismatch, eliminating their advantage over classical baselines; (2) performance and robustness are inversely correlated across modalities (Spearman r_s = -0.71, p < 0.01); (3) mask-oblivious architectures recover 0% of mismatch losses regardless of calibration quality, while operator-conditioned methods recover 41-90%; (4) blind grid-search calibration recovers 85-100% of the oracle bound without ground truth. Real hardware experiments confirm that simulation trends transfer to physical data. Code will be released upon acceptance.[84] Fusion and Grouping Strategies in Deep Learning for Local Climate Zone Classification of Multimodal Remote Sensing Data
Ancymol Thomas,Jaya Sreevalsan-Nair
Main category: cs.CV
TL;DR: 本研究系统分析了多模态遥感数据(SAR与MSI)用于局部气候区(LCZ)分类中的多种深度学习融合策略,发现基线混合融合(FM1)结合波段分组(BG)和标签合并(LM)效果最佳,整体准确率达76.6%,尤其提升了少数类预测性能。
Details
Motivation: 现有研究缺乏对多模态LCZ分类中深度学习模型融合机制的全面分析,而数据融合对提升分类精度至关重要。 Method: 在So2Sat LCZ42数据集上,对比四种融合模型:基线混合融合(FM1)、引入自注意力与交叉注意力(FM2)、多尺度高斯滤波图像输入(FM3)、加权决策级融合(FM4);并开展像素级、特征级、决策级消融实验;同时评估波段分组(BG)与标签合并(LM)两种分组策略。 Result: FM1+BG+LM组合取得最高整体精度76.6%,显著优于其他融合方法,并有效改善了少数类别的预测准确率。 Conclusion: 简单的基线混合融合配合合理的数据与标签分组策略,比复杂注意力或滤波机制更有效;融合策略的设计应兼顾数据特性与类别分布,而非一味增加模型复杂度。 Abstract: Local Climate Zones (LCZs) give a zoning map to study urban structures and land use and analyze the impact of urbanization on local climate. Multimodal remote sensing enables LCZ classification, for which data fusion is significant for improving accuracy owing to the data complexity. However, there is a gap in a comprehensive analysis of the fusion mechanisms used in their deep learning (DL) classifier architectures. This study analyzes different fusion strategies in the multi-class LCZ classification models for multimodal data and grouping strategies based on inherent data characteristics. The different models involving Convolutional Neural Networks (CNNs) include: (i) baseline hybrid fusion (FM1), (ii) with self- and cross-attention mechanisms (FM2), (iii) with the multi-scale Gaussian filtered images (FM3), and (iv) weighted decision-level fusion (FM4). Ablation experiments are conducted to study the pixel-, feature-, and decision-level fusion effects in the model performance. Grouping strategies include band grouping (BG) within the data modalities and label merging (LM) in the ground truth. Our analysis is exclusively done on the So2Sat LCZ42 dataset, which consists of Synthetic Aperture Radar (SAR) and Multispectral Imaging (MSI) image pairs. Our results show that FM1 consistently outperforms simple fusion methods. FM1 with BG and LM is found to be the most effective approach among all fusion strategies, giving an overall accuracy of 76.6\%. Importantly, our study highlights the effect of these strategies in improving prediction accuracy for the underrepresented classes. Our code and processed datasets are available at https://github.com/GVCL/LCZC-MultiModalHybridFusion[85] Structure-Guided Histopathology Synthesis via Dual-LoRA Diffusion
Xuan Xu,Prateek Prasanna
Main category: cs.CV
TL;DR: 本文提出了一种名为Dual-LoRA Controllable Diffusion的统一扩散模型,利用多类细胞核中心点作为轻量、标注高效的空间先验,通过两个任务专用的LoRA适配器,在单一模型中联合实现局部结构补全与全局结构合成,显著提升了组织图像修复与生成的结构保真度和真实性。
Details
Motivation: 现有方法将组织图像修复与生成视为独立任务,且依赖弱或不一致的结构先验,限制了细胞组织的真实性建模。 Method: 提出Dual-LoRA Controllable Diffusion框架:以多类细胞核中心点为结构引导先验;采用共享扩散主干网络+双LoRA适配器分别处理局部补全与全局合成任务。 Result: 局部补全任务中LPIPS由0.1797降至0.1524;全局合成任务中FID由225.15大幅降至76.04;在多个基准上超越GAN及扩散模型SOTA。 Conclusion: 该方法实现了修复与生成任务的统一建模,提升了结构一致性与形态真实性,支持可扩展的泛癌种组织病理建模。 Abstract: Histopathology image synthesis plays an important role in tissue restoration, data augmentation, and modeling of tumor microenvironments. However, existing generative methods typically address restoration and generation as separate tasks, although both share the same objective of structure-consistent tissue synthesis under varying degrees of missingness, and often rely on weak or inconsistent structural priors that limit realistic cellular organization. We propose Dual-LoRA Controllable Diffusion, a unified centroid-guided diffusion framework that jointly supports Local Structure Completion and Global Structure Synthesis within a single model. Multi-class nuclei centroids serve as lightweight and annotation-efficient spatial priors, providing biologically meaningful guidance under both partial and complete image absence. Two task-specific LoRA adapters specialize the shared backbone for local and global objectives without retraining separate diffusion models. Extensive experiments demonstrate consistent improvements over state-of-the-art GAN and diffusion baselines across restoration and synthesis tasks. For local completion, LPIPS computed within the masked region improves from 0.1797 (HARP) to 0.1524, and for global synthesis, FID improves from 225.15 (CoSys) to 76.04, indicating improved structural fidelity and realism. Our approach achieves more faithful structural recovery in masked regions and substantially improved realism and morphology consistency in full synthesis, supporting scalable pan-cancer histopathology modeling.[86] Mask-aware inference with State-Space Models
Ignasi Mas,Ramon Morros,Javier-Ruiz Hidalgo,Ivan Huerta
Main category: cs.CV
TL;DR: 本文提出Partial Vision Mamba(PVM),将Partial Convolution中处理不规则缺失数据的思想迁移到State Space Model(如Mamba)中,使其能有效处理任意形状的无效/缺失输入,在深度补全、图像修复和含无效数据分类等任务上验证了其有效性与泛化性。
Details
Motivation: 现有视觉State Space Models(如Mamba)缺乏处理任意形状缺失或无效数据的内在机制,而现实CV任务(如深度补全)常面临此类问题;Partial Convolutions虽已解决该问题,但未被适配到SSM架构中。 Method: 提出Partial Vision Mamba(PVM)模块,将mask-aware重归一化等partial操作原理引入Mamba主干,并定义了一套基于PVM构建模型的架构设计规则。 Result: 在深度补全、图像修复和含无效数据的图像分类三个任务上验证了PVM的有效性和跨任务泛化能力,显著提升了SSM对不规则缺失数据的建模能力。 Conclusion: PVM成功将partial操作范式扩展至视觉SSM,为处理不规则缺失数据提供了高效、通用的新架构方案,拓展了Mamba类模型在真实场景中的适用性。 Abstract: Many real-world computer vision tasks, such as depth completion, must handle inputs with arbitrarily shaped regions of missing or invalid data. For Convolutional Neural Networks (CNNs), Partial Convolutions solved this by a mask-aware re-normalization conditioned only on valid pixels. Recently, State Space Models (SSMs) like Mamba have emerged, offering high performance with linear complexity. However, these architectures lack an inherent mechanism for handling such arbitrarily shaped invalid data at inference time. To bridge this gap, we introduce Partial Vision Mamba (PVM), a novel architectural component that ports the principles of partial operations to the Mamba backbone. We also define a series of rules to design architectures using PVM. We show the efficacy and generalizability of our approach in the tasks of depth completion, image inpainting, and classification with invalid data.[87] PinPoint: Evaluation of Composed Image Retrieval with Explicit Negatives, Multi-Image Queries, and Paraphrase Testing
Rohan Mahadev,Joyce Yuan,Patrick Poirson,David Xue,Hao-Yu Wu,Dmitry Kislyuk
Main category: cs.CV
TL;DR: 本文提出了PinPoint基准,用于评估组合图像检索(CIR)系统在多答案、硬负例、鲁棒性、多图像推理和公平性等方面的能力,并基于该基准发现现有方法存在三大缺陷,进而提出一种无需训练的重排序方法以提升性能。
Details
Motivation: 现有CIR基准仅支持单个真值答案,缺乏对错误正例规避、鲁棒性和多图像推理能力的评估标注,限制了模型全面评估与改进。 Method: 构建PinPoint真实世界基准(含7635个查询、32.9万相关性判断、23类查询),涵盖多正确答案、显式硬负例、六种指令改写、多图像组合查询及人口统计元数据;分析20+方法后,提出一种基于现成多模态大语言模型(MLLM)的训练-free重排序方法。 Result: 当前最优CIR方法mAP@10仅28.5%,仍以9%概率召回硬负例;在指令改写上性能波动达25.1%;多图像查询性能下降40–70%;所提重排序方法可即插即用地提升任意现有系统性能。 Conclusion: PinPoint揭示了CIR领域关键短板,推动更全面、真实、公平的评估范式;提出的训练-free重排序方法为实用化部署提供了新思路;全部数据与代码已开源。 Abstract: Composed Image Retrieval (CIR) has made significant progress, yet current benchmarks are limited to single ground-truth answers and lack the annotations needed to evaluate false positive avoidance, robustness and multi-image reasoning. We present PinPoint, a comprehensive real world benchmark with 7,635 queries and 329K relevance judgments across 23 query categories. PinPoint advances the field by providing: (1) multiple correct answers (averaging 9.1 per query) (2) explicit hard negatives, (3) six instruction paraphrases per query for robustness testing, (4) multi-image composition support (13.4% of queries), and (5) demographic metadata for fairness evaluation. Based on our analysis of 20+ methods across 4 different major paradigms, we uncover three significant drawbacks: The best methods while achieving mAP@10 of 28.5%, still retrieves irrelevant results (hard negatives) 9% of the time. The best models also exhibit 25.1% performance variation across paraphrases, indicating significant potential for enhancing current CIR techniques. Multi-image queries performs 40 to 70% worse across different methods. To overcome these new issues uncovered by our evaluation framework, we propose a training-free reranking method based on an off-the-shelf MLLM that can be applied to any existing system to bridge the gap. We release the complete dataset, including all images, queries, annotations, retrieval index, and benchmarking code.[88] SGR3 Model: Scene Graph Retrieval-Reasoning Model in 3D
Zirui Wang,Ruiping Liu,Yufan Chen,Junwei Zheng,Weijia Fan,Kunyu Peng,Di Wen,Jiale Wei,Jiaming Zhang,Rainer Stiefelhagen
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的3D场景图生成框架SGR3,利用多模态大语言模型(MLLM)结合检索增强生成(RAG)技术,避免了传统方法对3D重建和启发式图构建的依赖,并通过语义对齐的检索与加权补丁级相似性选择提升关系推理能力。
Details
Motivation: 现有3D场景图生成方法依赖多模态数据和启发式图构建,限制了关系三元组预测的灵活性与泛化性;且需显式3D重建,实用性受限。 Method: 提出SGR3模型:基于MLLM与RAG的训练-free框架;采用ColPali风格跨模态检索获取语义对齐的场景图;引入加权补丁级相似性选择机制以抑制模糊或语义贫乏区域的影响;不进行3D重建,直接生成结构化场景图。 Result: SGR3在无训练条件下性能媲美GNN专家模型,优于其他无训练基线;消融实验表明检索到的外部知识被显式融入token生成过程,而非隐式抽象内化。 Conclusion: SGR3验证了利用检索增强的MLLM进行高质量、可解释、无需训练的3D场景图生成的可行性,为机器人高层语义理解提供了新范式。 Abstract: 3D scene graphs provide a structured representation of object entities and their relationships, enabling high-level interpretation and reasoning for robots while remaining intuitively understandable to humans. Existing approaches for 3D scene graph generation typically combine scene reconstruction with graph neural networks (GNNs). However, such pipelines require multi-modal data that may not always be available, and their reliance on heuristic graph construction can constrain the prediction of relationship triplets. In this work, we introduce a Scene Graph Retrieval-Reasoning Model in 3D (SGR3 Model), a training-free framework that leverages multi-modal large language models (MLLMs) with retrieval-augmented generation (RAG) for semantic scene graph generation. SGR3 Model bypasses the need for explicit 3D reconstruction. Instead, it enhances relational reasoning by incorporating semantically aligned scene graphs retrieved via a ColPali-style cross-modal framework. To improve retrieval robustness, we further introduce a weighted patch-level similarity selection mechanism that mitigates the negative impact of blurry or semantically uninformative regions. Experiments demonstrate that SGR3 Model achieves competitive performance compared to training-free baselines and on par with GNN-based expert models. Moreover, an ablation study on the retrieval module and knowledge base scale reveals that retrieved external information is explicitly integrated into the token generation process, rather than being implicitly internalized through abstraction.[89] Spinverse: Differentiable Physics for Permeability-Aware Microstructure Reconstruction from Diffusion MRI
Prathamesh Pradeep Khole,Mario M. Brenes,Zahra Kais Petiwala,Ehsan Mirafzali,Utkarsh Gupta,Jing-Rebecca Li,Andrada Ianus,Razvan Marinescu
Main category: cs.CV
TL;DR: 本文提出Spinverse方法,通过可微分的Bloch-Torrey模拟器,从dMRI信号中反演组织微结构界面,将内部面渗透率设为可学习参数,以无先验拓扑方式恢复扩散屏障。
Details
Motivation: 现有dMRI方法多假设不可渗透边界或仅估计体素级参数,无法显式恢复微结构界面;需一种能自适应学习渗透性边界、不预设界面拓扑的方法。 Method: Spinverse基于固定四面体网格表示组织,将每个内部面的渗透率设为可学习参数;通过全可微Bloch-Torrey PDE正向模拟器反向传播信号匹配损失进行优化;引入网格几何先验缓解病态性,并采用多序列分阶段优化策略避免局部极小。 Result: 在合成体素网格数据集上,Spinverse成功重建多种几何结构;验证了序列调度与正则化对避免‘仅轮廓’解、提升边界精度和结构合理性至关重要。 Conclusion: Spinverse实现了无需预设拓扑的渗透性感知微结构界面重建,为dMRI定量微结构成像提供了新范式,强调了建模先验与优化策略在逆问题求解中的关键作用。 Abstract: Diffusion MRI (dMRI) is sensitive to microstructural barriers, yet most existing methods either assume impermeable boundaries or estimate voxel-level parameters without recovering explicit interfaces. We present Spinverse, a permeability-aware reconstruction method that inverts dMRI measurements through a fully differentiable Bloch-Torrey simulator. Spinverse represents tissue on a fixed tetrahedral grid and treats each interior face permeability as a learnable parameter; low-permeability faces act as diffusion barriers, so microstructural boundaries whose topology is not fixed a priori (up to the resolution of the ambient mesh) emerge without changing mesh connectivity or vertex positions. Given a target signal, we optimize face permeabilities by backpropagating a signal-matching loss through the PDE forward model, and recover an interface by thresholding the learned permeability field. To mitigate the ill-posedness of permeability inversion, we use mesh-based geometric priors; to avoid local minima, we use a staged multi-sequence optimization curriculum. Across a collection of synthetic voxel meshes, Spinverse reconstructs diverse geometries and demonstrates that sequence scheduling and regularization are critical to avoid outline-only solutions while improving both boundary accuracy and structural validity.[90] sFRC for assessing hallucinations in medical image restoration
Prabhat Kc,Rongping Zeng,Nirmal Soni,Aldo Badano
Main category: cs.CV
TL;DR: 本文提出了一种基于小块傅里叶环相关(sFRC)的 hallucination 检测方法,用于评估深度学习在稀疏/欠采样医学图像重建中的可靠性,并验证其在CT和MRI任务中对幻觉特征的检测能力及鲁棒性。
Details
Motivation: 深度学习重建结果虽视觉上令人满意,但易产生幻觉(hallucination),且缺乏易用、稳健的幻觉检测技术与指标。 Method: 提出扫描式傅里叶环相关(sFRC)方法:在DL输出及其参考图像的小块区域上进行FRC分析,并沿空间扫描;参数可通过专家标注的幻觉特征或成像理论生成的幻觉图设定。 Result: sFRC成功检测出CT超分辨率、稀疏视角CT和MRI欠采样重建中的幻觉;在CT测试中表现有效,在MRI中与成像理论预测一致;量化了DL方法在分布内/外数据及不同欠采样率下的幻觉率,并验证其对传统正则化与展开式方法同样适用。 Conclusion: sFRC是一种可解释、理论支撑强、适用于多种重建方法的实用幻觉检测工具,有助于提升DL在临床医学影像中的可信度与鲁棒性评估。 Abstract: Deep learning (DL) methods are currently being explored to restore images from sparse-view-, limited-data-, and undersampled-based acquisitions in medical applications. Although outputs from DL may appear visually appealing based on likability/subjective criteria (such as less noise, smooth features), they may also suffer from hallucinations. This issue is further exacerbated by a lack of easy-to-use techniques and robust metrics for the identification of hallucinations in DL outputs. In this work, we propose performing Fourier Ring Correlation (FRC) analysis over small patches and concomitantly (s)canning across DL outputs and their reference counterparts to detect hallucinations (termed as sFRC). We describe the rationale behind sFRC and provide its mathematical formulation. The parameters essential to sFRC may be set using predefined hallucinated features annotated by subject matter experts or using imaging theory-based hallucination maps. We use sFRC to detect hallucinations for three undersampled medical imaging problems: CT super-resolution, CT sparse view, and MRI subsampled restoration. In the testing phase, we demonstrate sFRC's effectiveness in detecting hallucinated features for the CT problem and sFRC's agreement with imaging theory-based outputs on hallucinated feature maps for the MR problem. Finally, we quantify the hallucination rates of DL methods on in-distribution versus out-of-distribution data and under increasing subsampling rates to characterize the robustness of DL methods. Beyond DL-based methods, sFRC's effectiveness in detecting hallucinations for a conventional regularization-based restoration method and a state-of-the-art unrolled method is also shown.[91] Decoding the Pulse of Reasoning VLMs in Multi-Image Understanding Tasks
Chenjun Li
Main category: cs.CV
TL;DR: 本文提出PulseFocus方法,通过在推理时结构化链式思维(CoT)推理为交替的计划/聚焦模块,并引入软注意力门控,以解决多图像推理中视觉语言模型注意力分散和位置偏差的问题,显著提升多图像基准性能。
Details
Motivation: 发现现有推理型视觉语言模型在链式思维生成过程中存在注意力“脉冲”现象(即注意力分散、不聚焦)及系统性图像位置偏差,亟需改进。 Method: 提出无需训练、仅在推理时生效的PulseFocus方法:将CoT推理结构化为交替的‘计划’与‘聚焦’模块,并在解码阶段对所引用图像施加软注意力门控,强制模型显式选择并聚焦于相关图像。 Result: 在BLINK基准上提升3.7%,在MuirBench上提升1.07%,验证了PulseFocus在多图像推理任务中的一致有效性。 Conclusion: 注意力结构化与门控机制可有效缓解多图像推理中的注意力分散与位置偏差问题,PulseFocus为训练-free的推理优化提供了新范式。 Abstract: Multi-image reasoning remains a significant challenge for vision-language models (VLMs). We investigate a previously overlooked phenomenon: during chain-of-thought (CoT) generation, the text-to-image (T2I) attention of reasoning VLMs exhibits diffuse "pulses": sporadic and unfocused attention patterns that fail to concentrate on task-relevant images. We further reveal a systematic positional bias in attention allocation across images. Motivated by these observations, we propose PulseFocus, a training-free, inference-time method that structures CoT reasoning into interleaved plan/focus blocks with soft attention gating. By forcing the model to explicitly plan which image to examine and then gating decode-time attention to the referenced image, PulseFocus sharpens attention focus and yields consistent improvements on multi-image benchmarks like BLINK benchmark (+3.7%) and MuirBench (+1.07%).[92] A Benchmark Study of Neural Network Compression Methods for Hyperspectral Image Classification
Sai Shi
Main category: cs.CV
TL;DR: 本文系统评估了剪枝、量化和知识蒸馏三种网络压缩方法在高光谱土地覆盖分类任务中的效果,结果表明压缩模型能在显著降低模型大小和计算成本的同时保持有竞争力的分类精度。
Details
Motivation: 深度神经网络在遥感设备和边缘系统等资源受限平台上部署受限,因此需要网络压缩技术来减少模型大小和计算成本。 Method: 对剪枝、量化和知识蒸馏三种卷积神经网络压缩策略进行系统评估,实验在两个基准高光谱数据集上进行,关注分类精度、内存消耗和推理效率。 Result: 压缩模型能显著减小模型尺寸和计算成本,同时保持有竞争力的分类性能;揭示了压缩比、效率与精度之间的权衡关系。 Conclusion: 网络压缩技术具有在遥感应用中实现高效深度学习部署的潜力。 Abstract: Deep neural networks have achieved strong performance in image classification tasks due to their ability to learn complex patterns from high-dimensional data. However, their large computational and memory requirements often limit deployment on resource-constrained platforms such as remote sensing devices and edge systems. Network compression techniques have therefore been proposed to reduce model size and computational cost while maintaining predictive performance. In this study, we conduct a systematic evaluation of neural network compression methods for a remote sensing application, namely hyperspectral land cover classification. Specifically, we examine three widely used compression strategies for convolutional neural networks: pruning, quantization, and knowledge distillation. Experiments are conducted on two benchmark hyperspectral datasets, considering classification accuracy, memory consumption, and inference efficiency. Our results demonstrate that compressed models can significantly reduce model size and computational cost while maintaining competitive classification performance. These findings provide insights into the trade-offs between compression ratio, efficiency, and accuracy, and highlight the potential of compression techniques for enabling efficient deep learning deployment in remote sensing applications.[93] Are Multimodal LLMs Ready for Surveillance? A Reality Check on Zero-Shot Anomaly Detection in the Wild
Shanle Yao,Armin Danesh Pazho,Narges Rashvand,Hamed Tabkhi
Main category: cs.CV
TL;DR: 本文系统评估了多模态大语言模型(MLLMs)在视频异常检测(VAD)任务上的可靠性,发现其在零样本设置下存在显著保守偏差(高精度、低召回),并通过类别特异性提示将ShanghaiTech数据集上的F1分数从0.09提升至0.64,但仍面临召回率瓶颈。
Details
Motivation: 尽管MLLMs在视频理解中展现出强大能力,但其在真实世界视频异常检测(VAD)中的可靠性尚未被充分探索;传统方法依赖重构或姿态线索,而MLLMs提供了语言引导推理的新范式,亟需系统评估。 Method: 在ShanghaiTech和CHAD基准上,将VAD重构为弱时间监督下的二分类任务;系统分析提示特异性与时间窗口长度(1s–3s)对精确率-召回率权衡的影响;引入类别特异性指令以校准决策边界。 Result: 零样本下MLLMs表现出强烈保守偏差——高置信度但严重偏向‘正常’类,导致高精度、极低召回(如ShanghaiTech初始F1仅0.09);类别特异性提示可显著提升峰值F1至0.64,但召回率仍是关键瓶颈。 Conclusion: 当前MLLMs在噪声环境下的VAD性能存在显著差距,尤其在召回能力上不足;研究为面向召回的提示工程与模型校准提供了基础,对开放世界监控等需复杂视频理解与推理的任务具有指导意义。 Abstract: Multimodal large language models (MLLMs) have demonstrated impressive general competence in video understanding, yet their reliability for real-world Video Anomaly Detection (VAD) remains largely unexplored. Unlike conventional pipelines relying on reconstruction or pose-based cues, MLLMs enable a paradigm shift: treating anomaly detection as a language-guided reasoning task. In this work, we systematically evaluate state-of-the-art MLLMs on the ShanghaiTech and CHAD benchmarks by reformulating VAD as a binary classification task under weak temporal supervision. We investigate how prompt specificity and temporal window lengths (1s--3s) influence performance, focusing on the precision--recall trade-off. Our findings reveal a pronounced conservative bias in zero-shot settings; while models exhibit high confidence, they disproportionately favor the 'normal' class, resulting in high precision but a recall collapse that limits practical utility. We demonstrate that class-specific instructions can significantly shift this decision boundary, improving the peak F1-score on ShanghaiTech from 0.09 to 0.64, yet recall remains a critical bottleneck. These results highlight a significant performance gap for MLLMs in noisy environments and provide a foundation for future work in recall-oriented prompting and model calibration for open-world surveillance, which demands complex video understanding and reasoning.[94] FOZO: Forward-Only Zeroth-Order Prompt Optimization for Test-Time Adaptation
Xingyu Wang,Tao Wang
Main category: cs.CV
TL;DR: 本文提出了一种无需反向传播的前向零阶优化方法FOZO,用于测试时自适应(TTA),在资源受限设备上实现高效、稳定且高性能的模型自适应。
Details
Motivation: 现有TTA方法存在矛盾:基于反向传播的方法计算和内存开销大、会修改权重,不适用于低端设备;而无反向传播方法适应能力弱。 Method: 提出Forward-Only Zeroth-Order Optimization(FOZO):采用内存高效的零阶提示优化,联合优化中间特征统计量与预测熵;引入动态衰减扰动尺度以提升零阶梯度估计的稳定性,并在TTA数据流假设下证明其收敛性。 Result: 在ImageNet-C(59.52% Top-1)、ImageNet-R、ImageNet-Sketch上持续适应性能优于主流梯度法及SOTA前向-only方法FOA(58.13%);在INT8量化模型上泛化性强。 Conclusion: FOZO是一种实用、高效、鲁棒的TTA新范式,特别适合资源受限场景下的部署。 Abstract: Test-Time Adaptation (TTA) is essential for enabling deep learning models to handle real-world data distribution shifts. However, current approaches face significant limitations: backpropagation-based methods are not suitable for low-end deployment devices, due to their high computation and memory requirements, as well as their tendency to modify model weights during adaptation; while traditional backpropagation-free techniques exhibit constrained adaptation capabilities. In this work, we propose Forward-Only Zeroth-Order Optimization (FOZO), a novel and practical backpropagation-free paradigm for TTA. FOZO leverages a memory-efficient zeroth-order prompt optimization, which is led by objectives optimizing both intermediate feature statistics and prediction entropy. To ensure efficient and stable adaptation over the out-of-distribution data stream, we introduce a dynamically decaying perturbation scale during zeroth-order gradient estimation and theoretically prove its convergence under the TTA data stream assumption. Extensive continual adaptation experiments on ImageNet-C, ImageNet-R, and ImageNet-Sketch demonstrate FOZO's superior performance, achieving 59.52% Top-1 accuracy on ImageNet-C (5K, level 5) and outperforming main gradient-based methods and SOTA forward-only FOA (58.13%). Furthermore, FOZO exhibits strong generalization on quantized (INT8) models. These findings demonstrate that FOZO is a highly competitive solution for TTA deployment in resource-limited scenarios.[95] Toward Real-world Infrared Image Super-Resolution: A Unified Autoregressive Framework and Benchmark Dataset
Yang Zou,Jun Ma,Zhidong Jiao,Xingyuan Li,Zhiying Jiang,Jinyuan Liu
Main category: cs.CV
TL;DR: 本文提出Real-IISR框架,用于真实场景下的红外图像超分辨率重建,通过热-结构引导的视觉自回归逐尺度恢复细节与背景,并设计了热结构引导模块、条件自适应码本和热序一致性损失,在自建真实数据集FLIR-IISR上验证了有效性。
Details
Motivation: 现有红外图像超分辨率方法多基于仿真数据或忽略红外与可见光成像的本质差异,而真实红外图像受光学与传感耦合退化影响,导致结构锐度和热保真度同时下降,亟需面向真实场景的解决方案。 Method: 提出统一的自回归框架Real-IISR:1)热-结构引导模块编码热先验以缓解热辐射与结构边缘失配;2)条件自适应码本根据退化感知热先验动态调制离散表示;3)热序一致性损失约束温度与像素强度间的单调关系,保障物理一致性。 Result: 在自建真实世界红外超分数据集FLIR-IISR(含自动调焦与运动模糊配对图像)上实验表明,Real-IISR显著优于现有方法,为真实场景IISR提供了统一基础与基准。 Conclusion: Real-IISR有效应对真实红外图像中耦合退化带来的结构与热保真挑战,其框架设计、损失函数与真实数据集共同推动了红外超分辨率从仿真走向实际应用。 Abstract: Infrared image super-resolution (IISR) under real-world conditions is a practically significant yet rarely addressed task. Pioneering works are often trained and evaluated on simulated datasets or neglect the intrinsic differences between infrared and visible imaging. In practice, however, real infrared images are affected by coupled optical and sensing degradations that jointly deteriorate both structural sharpness and thermal fidelity. To address these challenges, we propose Real-IISR, a unified autoregressive framework for real-world IISR that progressively reconstructs fine-grained thermal structures and clear backgrounds in a scale-by-scale manner via thermal-structural guided visual autoregression. Specifically, a Thermal-Structural Guidance module encodes thermal priors to mitigate the mismatch between thermal radiation and structural edges. Since non-uniform degradations typically induce quantization bias, Real-IISR adopts a Condition-Adaptive Codebook that dynamically modulates discrete representations based on degradation-aware thermal priors. Also, a Thermal Order Consistency Loss enforces a monotonic relation between temperature and pixel intensity, ensuring relative brightness order rather than absolute values to maintain physical consistency under spatial misalignment and thermal drift. We build FLIR-IISR, a real-world IISR dataset with paired LR-HR infrared images acquired via automated focus variation and motion-induced blur. Extensive experiments demonstrate the promising performance of Real-IISR, providing a unified foundation for real-world IISR and benchmarking. The dataset and code are available at: https://github.com/JZD151/Real-IISR.[96] Evaluating GPT-5 as a Multimodal Clinical Reasoner: A Landscape Commentary
Alexandru Florea,Shansong Wang,Mingzhe Hu,Qiang Li,Zach Eidex,Luke del Balzo,Mojtaba Safari,Xiaofeng Yang
Main category: cs.CV
TL;DR: 本文首次对GPT-5系列模型(GPT-5、GPT-5 Mini、GPT-5 Nano)与GPT-4o在临床相关任务上的表现进行了跨横断面评估,发现GPT-5在文本推理(如MedXpertQA提升超25%)和多模态视觉问答(尤其在乳腺影像中提升10–40%)上显著进步,但在神经放射学(44%准确率)和乳腺影像精细病灶识别方面仍明显落后于专用模型,表明其尚不能替代专业系统。
Details
Motivation: 探究通用基础模型(如GPT-5系列)是否具备支撑临床医学所需整合推理能力,特别是融合模糊患者叙事、检验数据与多模态影像进行综合诊断的能力。 Method: 采用标准化零样本思维链协议,在医学教育考试、文本推理基准及神经放射学、数字病理学、乳腺影像等多模态视觉问答任务上,对GPT-5系列与GPT-4o进行受控、跨横断面评估。 Result: GPT-5在MedXpertQA上绝对提升超25个百分点;在乳腺影像VQA中超越GPT-4o达10–40%,但神经放射学宏观平均准确率仅44%,乳腺病灶识别准确率为52–64%,显著低于专用模型(>80%)。 Conclusion: GPT-5在集成多模态临床推理方面取得实质性进展,能模拟医生将不确定信息与客观影像证据结合的认知过程,但仍不足以替代面向高度专业化、感知关键型任务的专用模型。 Abstract: The transition from task-specific artificial intelligence toward general-purpose foundation models raises fundamental questions about their capacity to support the integrated reasoning required in clinical medicine, where diagnosis demands synthesis of ambiguous patient narratives, laboratory data, and multimodal imaging. This landscape commentary provides the first controlled, cross-sectional evaluation of the GPT-5 family (GPT-5, GPT-5 Mini, GPT-5 Nano) against its predecessor GPT-4o across a diverse spectrum of clinically grounded tasks, including medical education examinations, text-based reasoning benchmarks, and visual question-answering in neuroradiology, digital pathology, and mammography using a standardized zero-shot chain-of-thought protocol. GPT-5 demonstrated substantial gains in expert-level textual reasoning, with absolute improvements exceeding 25 percentage-points on MedXpertQA. When tasked with multimodal synthesis, GPT-5 effectively leveraged this enhanced reasoning capacity to ground uncertain clinical narratives in concrete imaging evidence, achieving state-of-the-art or competitive performance across most VQA benchmarks and outperforming GPT-4o by margins of 10-40% in mammography tasks requiring fine-grained lesion characterization. However, performance remained moderate in neuroradiology (44% macro-average accuracy) and lagged behind domain-specific models in mammography, where specialized systems exceed 80% accuracy compared to GPT-5's 52-64%. These findings indicate that while GPT-5 represents a meaningful advance toward integrated multimodal clinical reasoning, mirroring the clinician's cognitive process of biasing uncertain information with objective findings, generalist models are not yet substitutes for purpose-built systems in highly specialized, perception-critical tasks.[97] Evaluating and Correcting Human Annotation Bias in Dynamic Micro-Expression Recognition
Feng Liu,Bingyu Nan,Xuezhong Qian,Xiaolan Fu
Main category: cs.CV
TL;DR: 本文提出了一种全局反单调微分选择策略(GAMDSS),通过关键帧重选提升微表情时空建模效果,显著减少跨文化数据集中人工标注的主观误差。
Details
Motivation: 现有微表情人工标注易受主观因素影响,尤其在跨文化场景下关键帧标注偏差更明显,亟需更鲁棒、客观的标注与建模方法。 Method: 提出GAMDSS架构,包含动态关键帧重选机制(识别Onset和Apex帧)、Offset帧推断、构建丰富时空动态表征,以及双分支共享参数特征提取结构。 Result: 在7个主流微表情数据集上验证有效,尤其在SAMM和4DME等跨文化数据集上显著降低主观标注误差;定量分析证实Offset帧标注不确定性更高,为标注标准化提供理论依据。 Conclusion: GAMDSS无需增加参数即可嵌入现有模型,提升了微表情识别性能,并呼吁重新审视当前数据集标注范式的有效性与泛化能力。 Abstract: Existing manual labeling of micro-expressions is subject to errors in accuracy, especially in cross-cultural scenarios where deviation in labeling of key frames is more prominent. To address this issue, this paper presents a novel Global Anti-Monotonic Differential Selection Strategy (GAMDSS) architecture for enhancing the effectiveness of spatio-temporal modeling of micro-expressions through keyframe re-selection. Specifically, the method identifies Onset and Apex frames, which are characterized by significant micro-expression variation, from complete micro-expression action sequences via a dynamic frame reselection mechanism. It then uses these to determine Offset frames and construct a rich spatio-temporal dynamic representation. A two-branch structure with shared parameters is then used to efficiently extract spatio-temporal features. Extensive experiments are conducted on seven widely recognized micro-expression datasets. The results demonstrate that GAMDSS effectively reduces subjective errors caused by human factors in multicultural datasets such as SAMM and 4DME. Furthermore, quantitative analyses confirm that offset-frame annotations in multicultural datasets are more uncertain, providing theoretical justification for standardizing micro-expression annotations. These findings directly support our argument for reconsidering the validity and generalizability of dataset annotation paradigms. Notably, this design can be integrated into existing models without increasing the number of parameters, offering a new approach to enhancing micro-expression recognition performance. The source code is available on GitHub[https://github.com/Cross-Innovation-Lab/GAMDSS].[98] DSA-SRGS: Super-Resolution Gaussian Splatting for Dynamic Sparse-View DSA Reconstruction
Shiyu Zhang,Zhicong Wu,Huangxuan Zhao,Zhentao Liu,Lei Chen,Yong Luo,Lefei Zhang,Zhiming Cui,Ziwen Ke,Bo Du
Main category: cs.CV
TL;DR: 本文提出DSA-SRGS,首个面向动态稀疏视角DSA重建的超分辨率高斯溅射框架,通过多保真纹理学习模块与辐射亚像素稠密化策略,显著提升4D血管模型重建质量,克服输入投影分辨率限制导致的细节丢失问题。
Details
Motivation: 现有基于高斯溅射和动态神经表征的3D血管重建方法受限于输入投影分辨率,简单上采样会导致严重模糊和混叠,无法恢复精细血管结构,制约其在精准诊疗中的应用。 Method: 提出DSA-SRGS框架:1)多保真纹理学习模块,融合DSA专用超分辨模型提供的高质量先验,并采用置信度感知策略自适应加权低分辨率真实投影与高分辨率伪标签的监督信号;2)辐射亚像素稠密化策略,利用高分辨率亚像素采样的梯度累积优化4D辐射高斯核。 Result: 在两个临床DSA数据集上的实验表明,DSA-SRGS在定量指标和定性视觉保真度上均显著优于当前最先进方法。 Conclusion: DSA-SRGS有效解决了动态稀疏视角DSA重建中的超分辨率难题,提升了4D血管模型的细节还原能力,为脑血管疾病精准诊断与治疗提供了更可靠的三维可视化基础。 Abstract: Digital subtraction angiography (DSA) is a key imaging technique for the auxiliary diagnosis and treatment of cerebrovascular diseases. Recent advancements in gaussian splatting and dynamic neural representations have enabled robust 3D vessel reconstruction from sparse dynamic inputs. However, these methods are fundamentally constrained by the resolution of input projections, where performing naive upsampling to enhance rendering resolution inevitably results in severe blurring and aliasing artifacts. Such lack of super-resolution capability prevents the reconstructed 4D models from recovering fine-grained vascular details and intricate branching structures, which restricts their application in precision diagnosis and treatment. To solve this problem, this paper proposes DSA-SRGS, the first super-resolution gaussian splatting framework for dynamic sparse-view DSA reconstruction. Specifically, we introduce a Multi-Fidelity Texture Learning Module that integrates high-quality priors from a fine-tuned DSA-specific super-resolution model, into the 4D reconstruction optimization. To mitigate potential hallucination artifacts from pseudo-labels, this module employs a Confidence-Aware Strategy to adaptively weight supervision signals between the original low-resolution projections and the generated high-resolution pseudo-labels. Furthermore, we develop Radiative Sub-Pixel Densification, an adaptive strategy that leverages gradient accumulation from high-resolution sub-pixel sampling to refine the 4D radiative gaussian kernels. Extensive experiments on two clinical DSA datasets demonstrate that DSA-SRGS significantly outperforms state-of-the-art methods in both quantitative metrics and qualitative visual fidelity.[99] MADCrowner: Margin Aware Dental Crown Design with Template Deformation and Refinement
Linda Wei,Chang Liu,Wenran Zhang,Yuxuan Hu,Ruiyang Li,Feng Qi,Changyao Tian,Ke Wang,Yuanyuan Wang,Shaoting Zhang,Dimitris Metaxas,Hongsheng Li
Main category: cs.CV
TL;DR: 本文提出了一种名为\totalframework的边缘感知网格生成框架,包含CrownDeformR(基于解剖上下文形变模板)和CrownSegger(新型颈缘分割网络),以提升牙冠自动设计的几何精度与临床可行性。
Details
Motivation: 现有基于学习的牙冠自动生成方法存在空间分辨率不足、输出噪声大、表面重建过度延伸等问题,且临床中仍需大量手动调整。 Method: 提出\totalframework框架,包括:1)CrownDeformR——利用多尺度口内扫描编码器提取解剖上下文,驱动初始模板形变为目标牙冠;2)\marginseg——精准分割牙体颈缘,作为形变约束和后处理边界条件;3)定制化后处理去除重建表面的过度延伸区域。 Result: 在自建大规模口内扫描数据集上实验表明,该方法在几何精度和临床可行性方面显著优于现有方法。 Conclusion: \totalframework通过引入颈缘感知机制和符合临床流程的设计思路,有效提升了自动化牙冠设计的质量与实用性,为CAD系统向全自动化演进提供了新路径。 Abstract: Dental crown restoration is one of the most common treatment modalities for tooth defect, where personalized dental crown design is critical. While computer-aided design (CAD) systems have notably enhanced the efficiency of dental crown design, extensive manual adjustments are still required in the clinic workflow. Recent studies have explored the application of learning-based methods for the automated generation of restorative dental crowns. Nevertheless, these approaches were challenged by inadequate spatial resolution, noisy outputs, and overextension of surface reconstruction. To address these limitations, we propose \totalframework, a margin-aware mesh generation framework comprising CrownDeformR and CrownSegger. Inspired by the clinic manual workflow of dental crown design, we designed CrownDeformR to deform an initial template to the target crown based on anatomical context, which is extracted by a multi-scale intraoral scan encoder. Additionally, we introduced \marginseg, a novel margin segmentation network, to extract the cervical margin of the target tooth. The performance of CrownDeformR improved with the cervical margin as an extra constraint. And it was also utilized as the boundary condition for the tailored postprocessing method, which removed the overextended area of the reconstructed surface. We constructed a large-scale intraoral scan dataset and performed extensive experiments. The proposed method significantly outperformed existing approaches in both geometric accuracy and clinical feasibility.[100] Privacy-Aware Camera 2.0 Technical Report
Huan Song,Shuyu Tian,Ting Long,Jiang Liu,Cheng Yuan,Zhenyu Jia,Jiawei Shao,Xuelong Li
Main category: cs.CV
TL;DR: 本文提出了一种基于AI Flow范式和边缘-云协同架构的新型隐私保护感知框架,通过在边缘端进行非线性映射与随机噪声注入,将原始图像转换为不可逆的抽象特征向量,在保障身份信息不可恢复的同时,支持云端行为识别与语义重建。
Details
Motivation: 现有隐私保护方法(如物理脱敏、加密、模糊化)常损害语义理解或缺乏数学上可证明的不可逆性;Privacy Camera 1.0虽消除视觉数据但仅输出文本判断,导致争议中证据缺失。 Method: 基于信息瓶颈原理,在边缘部署视觉脱敏器,对原始图像进行实时非线性映射与随机噪声注入,生成抽象特征向量;云端利用‘动态轮廓’视觉语言对抽象表示进行行为识别与语义重建。 Result: 实现了原始图像的数学不可重构性,同时支持高保真行为理解与可视化参考,解决了隐私-安全悖论。 Conclusion: 该框架在不暴露原始图像的前提下,达成了感知能力与隐私保护的实质性平衡,为高敏感场景下的智能感知提供了新范式。 Abstract: With the increasing deployment of intelligent sensing technologies in highly sensitive environments such as restrooms and locker rooms, visual surveillance systems face a profound privacy-security paradox. Existing privacy-preserving approaches, including physical desensitization, encryption, and obfuscation, often compromise semantic understanding or fail to ensure mathematically provable irreversibility. Although Privacy Camera 1.0 eliminated visual data at the source to prevent leakage, it provided only textual judgments, leading to evidentiary blind spots in disputes. To address these limitations, this paper proposes a novel privacy-preserving perception framework based on the AI Flow paradigm and a collaborative edge-cloud architecture. By deploying a visual desensitizer at the edge, raw images are transformed in real time into abstract feature vectors through nonlinear mapping and stochastic noise injection under the Information Bottleneck principle, ensuring identity-sensitive information is stripped and original images are mathematically unreconstructable. The abstract representations are transmitted to the cloud for behavior recognition and semantic reconstruction via a "dynamic contour" visual language, achieving a critical balance between perception and privacy while enabling illustrative visual reference without exposing raw images.[101] RMK RetinaNet: Rotated Multi-Kernel RetinaNet for Robust Oriented Object Detection in Remote Sensing Imagery
Huiran Sun
Main category: cs.CV
TL;DR: 本文提出RMK RetinaNet,通过多尺度核块、多方向上下文锚点注意力机制、自底向上路径和欧拉角编码模块,解决遥感图像旋转目标检测中的感受野不适应、长程多尺度特征融合不足及角度回归不连续三大瓶颈。
Details
Motivation: 遥感图像中旋转目标检测面临三大瓶颈:感受野利用不自适应、长程多尺度特征融合不足、角度回归存在不连续性。 Method: 提出RMK RetinaNet,包含四个核心组件:1)多尺度核(MSK)块增强自适应多尺度特征提取;2)多方向上下文锚点注意力(MDCAA)机制增强跨尺度与跨方向上下文建模;3)自底向上路径保留细粒度空间细节;4)欧拉角编码模块(EAEM)实现连续稳定的角度回归。 Result: 在DOTA-v1.0、HRSC2016和UCAS-AOD数据集上实验表明,RMK RetinaNet性能媲美当前最优方法,并在多尺度、多方向场景下鲁棒性更强。 Conclusion: RMK RetinaNet有效缓解了遥感旋转目标检测的关键挑战,提升了模型在复杂尺度与朝向变化下的泛化能力与稳定性。 Abstract: Rotated object detection in remote sensing imagery is hindered by three major bottlenecks: non-adaptive receptive field utilization, inadequate long-range multi-scale feature fusion, and discontinuities in angle regression. To address these issues, we propose Rotated Multi-Kernel RetinaNet (RMK RetinaNet). First, we design a Multi-Scale Kernel (MSK) Block to strengthen adaptive multi-scale feature extraction. Second, we incorporate a Multi-Directional Contextual Anchor Attention (MDCAA) mechanism into the feature pyramid to enhance contextual modeling across scales and orientations. Third, we introduce a Bottom-up Path to preserve fine-grained spatial details that are often degraded during downsampling. Finally, we develop an Euler Angle Encoding Module (EAEM) to enable continuous and stable angle regression. Extensive experiments on DOTA-v1.0, HRSC2016, and UCAS-AOD show that RMK RetinaNet achieves performance comparable to state-of-the-art rotated object detectors while improving robustness in multi-scale and multi-orientation scenarios.[102] LAW & ORDER: Adaptive Spatial Weighting for Medical Diffusion and Segmentation
Anugunj Naman,Ayushman Singh,Gaibo Zhang,Yaguang Zhang
Main category: cs.CV
TL;DR: 本文提出了一种针对医学图像分析中空间不平衡问题的双适配器方法:LAW用于扩散模型训练中的自适应加权损失,ORDER用于高效分割,显著提升了生成质量和分割精度。
Details
Motivation: 医学图像中病灶区域占比小,导致扩散模型生成易偏离真实布局、分割模型在空间不确定区域表现差,现有方法难以兼顾生成与分割任务的空间不平衡问题。 Method: 提出两个网络适配器:1)可学习自适应加权器(LAW),基于特征和掩码预测逐像素损失调制,并通过归一化、截断和正则化稳定训练;2)高效分辨率下的最优区域检测(ORDER),在解码器后期引入选择性双向跳跃注意力机制。 Result: LAW在生成任务上FID降低20%(52.28 vs. 65.60),其合成数据使下游分割Dice提升4.9%(83.2% vs. 78.3%);ORDER在MK-UNet上Dice提升6.0%(81.3% vs. 75.3%),仅需0.56 GFLOPs和42K参数,体积仅为nnUNet的1/730。 Conclusion: LAW与ORDER分别从生成与分割端有效缓解空间不平衡问题,实现了轻量、高效且性能优越的医学图像分析闭环。 Abstract: Medical image analysis relies on accurate segmentation, and benefits from controllable synthesis (of new training images). Yet both tasks of the cyclical pipeline face spatial imbalance: lesions occupy small regions against vast backgrounds. In particular, diffusion models have been shown to drift from prescribed lesion layouts, while efficient segmenters struggle on spatially uncertain regions. Adaptive spatial weighting addresses this by learning where to allocate computational resources. This paper introduces a pair of network adapters: 1) Learnable Adaptive Weighter (LAW) which predicts per-pixel loss modulation from features and masks for diffusion training, stabilized via a mix of normalization, clamping, and regularization to prevent degenerate solutions; and 2) Optimal Region Detection with Efficient Resolution (ORDER) which applies selective bidirectional skip attention at late decoder stages for efficient segmentation. Experiments on polyp and kidney tumor datasets demonstrate that LAW achieves 20% FID generative improvement over a uniform baseline (52.28 vs. 65.60), with synthetic data then improving downstream segmentation by 4.9% Dice coefficient (83.2% vs. 78.3%). ORDER reaches 6.0% Dice improvement on MK-UNet (81.3% vs. 75.3%) with 0.56 GFLOPs and just 42K parameters, remaining 730x smaller than the standard nnUNet.[103] Comparative Evaluation of Traditional Methods and Deep Learning for Brain Glioma Imaging. Review Paper
Kiranmayee Janardhan,Vinay Martin DSa Prabhu,T. Christy Bobby
Main category: cs.CV
TL;DR: 本文综述了脑胶质瘤的分割与分类技术,强调卷积神经网络在磁共振成像后处理中优于传统方法。
Details
Motivation: 脑胶质瘤的精准分割与分类对治疗规划、预后预测及病情监测至关重要,但不规则组织导致分割误差大、可重复性差。 Method: 综述现有全自动与半自动分割及分类方法,重点评估基于卷积神经网络(CNN)的架构性能。 Result: CNN架构在脑胶质瘤分割与分类任务中显著优于传统方法;放射科医生更倾向使用易用且需较少监督的半自动技术。 Conclusion: CNN是当前最有效的脑胶质瘤影像分析工具,未来应兼顾算法性能与临床实用性,推动半自动CNN方案落地应用。 Abstract: Segmentation is crucial for brain gliomas as it delineates the glioma s extent and location, aiding in precise treatment planning and monitoring, thus improving patient outcomes. Accurate segmentation ensures proper identification of the glioma s size and position, transforming images into applicable data for analysis. Classification of brain gliomas is also essential because different types require different treatment approaches. Accurately classifying brain gliomas by size, location, and aggressiveness is essential for personalized prognosis prediction, follow-up care, and monitoring disease progression, ensuring effective diagnosis, treatment, and management. In glioma research, irregular tissues are often observable, but error free and reproducible segmentation is challenging. Many researchers have surveyed brain glioma segmentation, proposing both fully automatic and semi-automatic techniques. The adoption of these methods by radiologists depends on ease of use and supervision, with semi-automatic techniques preferred due to the need for accurate evaluations. This review evaluates effective segmentation and classification techniques post magnetic resonance imaging acquisition, highlighting that convolutional neural network architectures outperform traditional techniques in these tasks.[104] MASQuant: Modality-Aware Smoothing Quantization for Multimodal Large Language Models
Lulu Hu,Wenhu Xiao,Xin Chen,Xinhua Xu,Bowen Xu,Kun Li,Yongliang Tao
Main category: cs.CV
TL;DR: 本文提出MASQuant框架,通过模态感知平滑(MAS)和跨模态补偿(CMC)解决多模态大语言模型(MLLMs)后训练量化中的平滑错位与跨模态计算不变性问题,实现稳定高效的量化性能。
Details
Motivation: 现有面向大语言模型的后训练量化方法(如SmoothQuant)在多模态大语言模型(MLLMs)上面临平滑错位和跨模态计算不变性两大挑战,亟需适配多模态特性的量化方案。 Method: 提出Modality-Aware Smoothing Quantization(MASQuant):(1)模态感知平滑(MAS),为不同模态学习独立的平滑因子;(2)跨模态补偿(CMC),利用SVD白化将多模态激活差异转为低秩形式,实现跨模态统一量化。 Result: MASQuant在双模态和三模态MLLMs上均表现出稳定的量化性能,实验结果表明其在主流PTQ算法中具有竞争力。 Conclusion: MASQuant有效缓解了MLLMs量化中的模态异构性问题,为多模态模型高效部署提供了新思路和实用工具。 Abstract: Post-training quantization (PTQ) with computational invariance for Large Language Models~(LLMs) have demonstrated remarkable advances, however, their application to Multimodal Large Language Models~(MLLMs) presents substantial challenges. In this paper, we analyze SmoothQuant as a case study and identify two critical issues: Smoothing Misalignment and Cross-Modal Computational Invariance. To address these issues, we propose Modality-Aware Smoothing Quantization (MASQuant), a novel framework that introduces (1) Modality-Aware Smoothing (MAS), which learns separate, modality-specific smoothing factors to prevent Smoothing Misalignment, and (2) Cross-Modal Compensation (CMC), which addresses Cross-modal Computational Invariance by using SVD whitening to transform multi-modal activation differences into low-rank forms, enabling unified quantization across modalities. MASQuant demonstrates stable quantization performance across both dual-modal and tri-modal MLLMs. Experimental results show that MASQuant is competitive among the state-of-the-art PTQ algorithms. Source code: https://github.com/alibaba/EfficientAI.[105] Guiding Diffusion-based Reconstruction with Contrastive Signals for Balanced Visual Representation
Boyu Han,Qianqian Xu,Shilong Bao,Zhiyong Yang,Ruochen Cui,Xilin Zhao,Qingming Huang
Main category: cs.CV
TL;DR: 本文提出Diffusion Contrastive Reconstruction (DCR)方法,通过在扩散重建过程中注入来自重建图像的对比信号,联合优化CLIP视觉编码器的判别能力和细节感知能力,从而克服其表征能力瓶颈。
Details
Motivation: CLIP视觉编码器在判别能力和细节感知能力方面存在局限,现有基于扩散模型的增强方法可能损害判别能力,未能根本解决表征瓶颈。 Method: 提出DCR框架,将对比学习信号从原始输入图像转移到每个重建图像上,统一扩散重建与对比学习目标;理论分析证明该损失函数可协同优化D-Ability和P-Ability。 Result: 在多个基准数据集及多模态大语言模型上验证了DCR的有效性,显著提升下游任务性能。 Conclusion: DCR通过重构图像驱动的对比信号注入,有效平衡并增强了CLIP视觉编码器的判别性与细粒度感知能力,为多模态表征学习提供了新范式。 Abstract: The limited understanding capacity of the visual encoder in Contrastive Language-Image Pre-training (CLIP) has become a key bottleneck for downstream performance. This capacity includes both Discriminative Ability (D-Ability), which reflects class separability, and Detail Perceptual Ability (P-Ability), which focuses on fine-grained visual cues. Recent solutions use diffusion models to enhance representations by conditioning image reconstruction on CLIP visual tokens. We argue that such paradigms may compromise D-Ability and therefore fail to effectively address CLIP's representation limitations. To address this, we integrate contrastive signals into diffusion-based reconstruction to pursue more comprehensive visual representations. We begin with a straightforward design that augments the diffusion process with contrastive learning on input images. However, empirical results show that the naive combination suffers from gradient conflict and yields suboptimal performance. To balance the optimization, we introduce the Diffusion Contrastive Reconstruction (DCR), which unifies the learning objective. The key idea is to inject contrastive signals derived from each reconstructed image, rather than from the original input, into the diffusion process. Our theoretical analysis shows that the DCR loss can jointly optimize D-Ability and P-Ability. Extensive experiments across various benchmarks and multi-modal large language models validate the effectiveness of our method. The code is available at https://github.com/boyuh/DCR.[106] Meta-D: Metadata-Aware Architectures for Brain Tumor Analysis and Missing-Modality Segmentation
SangHyuk Kim,Daniel Haehn,Sumientra Rampersad
Main category: cs.CV
TL;DR: 本文提出Meta-D架构,通过显式利用MRI扫描的分类元数据(如序列类型、扫描平面)来指导脑肿瘤分析中的特征提取,从而提升医学图像深度学习模型的性能和鲁棒性。
Details
Motivation: 提升医学图像深度学习模型在数据不稳定或模态缺失情况下的鲁棒性和性能,利用元数据作为特征提取的稳定锚点。 Method: 提出Meta-D架构:在2D肿瘤检测中动态调制卷积特征;在3D缺失模态分割中引入Transformer Maximizer,采用基于元数据的跨注意力机制选择性路由可用模态。 Result: 2D检测F1-score最高提升2.62%;3D缺失模态分割Dice分数最高提升5.12%,同时模型参数减少24.1%。 Conclusion: 显式整合分类扫描元数据可显著增强模型稳定性与泛化能力,尤其在模态缺失等挑战性场景下效果突出。 Abstract: We present Meta-D, an architecture that explicitly leverages categorical scanner metadata such as MRI sequence and plane orientation to guide feature extraction for brain tumor analysis. We aim to improve the performance of medical image deep learning pipelines by integrating explicit metadata to stabilize feature representations. We first evaluate this in 2D tumor detection, where injecting sequence (e.g., T1, T2) and plane (e.g., axial) metadata dynamically modulates convolutional features, yielding an absolute increase of up to 2.62% in F1-score over image-only baselines. Because metadata grounds feature extraction when data are available, we hypothesize it can serve as a robust anchor when data are missing. We apply this to 3D missing-modality tumor segmentation. Our Transformer Maximizer utilizes metadata-based cross-attention to isolate and route available modalities, ensuring the network focuses on valid slices. This targeted attention improves brain tumor segmentation Dice scores by up to 5.12% under extreme modality scarcity while reducing model parameters by 24.1%.[107] Revisiting Shape from Polarization in the Era of Vision Foundation Models
Chenhao Li,Taishi Ono,Takeshi Uemori,Yusuke Moriuchi
Main category: cs.CV
TL;DR: 本文提出了一种利用偏振线索提升单次物体级表面法向估计性能的新方法,通过构建高质量偏振数据集和传感器感知的数据增强策略,使轻量模型在仅40K训练场景下超越现有RGB-only视觉基础模型和传统偏振方法。
Details
Motivation: 偏振线索虽具强物理几何关联性,但以往SfP方法因合成数据不真实、传感器噪声建模不足等域差距问题表现弱于大规模RGB-only VFMs,引发对其必要性的质疑。 Method: 构建基于1954个真实3D扫描物体的高质量偏振合成数据集;引入DINOv3先验提升泛化能力;设计偏振传感器感知的数据增强以更真实模拟实际噪声。 Result: 仅用40K训练场景,该方法显著超越当前最优SfP方法和RGB-only VFMs;偏振线索可实现训练数据减少33倍或模型参数减少8倍,同时保持更高性能。 Conclusion: 偏振模态本身并非性能瓶颈,关键在于解决域差距;合理利用偏振线索可在小数据、轻模型条件下超越大模型,验证了其独特价值与实用性。 Abstract: We show that, with polarization cues, a lightweight model trained on a small dataset can outperform RGB-only vision foundation models (VFMs) in single-shot object-level surface normal estimation. Shape from polarization (SfP) has long been studied due to the strong physical relationship between polarization and surface geometry. Meanwhile, driven by scaling laws, RGB-only VFMs trained on large datasets have recently achieved impressive performance and surpassed existing SfP methods. This situation raises questions about the necessity of polarization cues, which require specialized hardware and have limited training data. We argue that the weaker performance of prior SfP methods does not come from the polarization modality itself, but from domain gaps. These domain gaps mainly arise from two sources. First, existing synthetic datasets use limited and unrealistic 3D objects, with simple geometry and random texture maps that do not match the underlying shapes. Second, real-world polarization signals are often affected by sensor noise, which is not well modeled during training. To address the first issue, we render a high-quality polarization dataset using 1,954 3D-scanned real-world objects. We further incorporate pretrained DINOv3 priors to improve generalization to unseen objects. To address the second issue, we introduce polarization sensor-aware data augmentation that better reflects real-world conditions. With only 40K training scenes, our method significantly outperforms both state-of-the-art SfP approaches and RGB-only VFMs. Extensive experiments show that polarization cues enable a 33x reduction in training data or an 8x reduction in model parameters, while still achieving better performance than RGB-only counterparts.[108] Mitigating Instance Entanglement in Instance-Dependent Partial Label Learning
Rui Zhao,Bin Shi,Kai Sun,Bo Dong
Main category: cs.CV
TL;DR: 本文提出了一种面向实例依赖型部分标签学习(ID-PLL)的类特定增强解耦框架CAD,通过类内特征增强对齐与类间加权惩罚机制,缓解实例纠缠导致的类别混淆,提升分类性能。
Details
Motivation: 现实中的部分标签学习常呈现实例依赖性(ID-PLL),且存在实例纠缠问题——相似类别的样本在特征和候选标签上重叠,加剧类别混淆。 Method: 提出Class-specific Augmentation based Disentanglement(CAD)框架:1)类内调节——生成类特定增强样本并对其对齐;2)类间调节——设计加权惩罚损失,对更模糊的候选标签施加更大惩罚,扩大类间距离。 Result: 大量实验验证CAD能有效缓解实例纠缠,显著提升ID-PLL任务的分类准确率。 Conclusion: CAD通过联合类内与类间调节机制,增强了类边界清晰度,为ID-PLL提供了一种有效的解耦建模方法。 Abstract: Partial label learning is a prominent weakly supervised classification task, where each training instance is ambiguously labeled with a set of candidate labels. In real-world scenarios, candidate labels are often influenced by instance features, leading to the emergence of instance-dependent PLL (ID-PLL), a setting that more accurately reflects this relationship. A significant challenge in ID-PLL is instance entanglement, where instances from similar classes share overlapping features and candidate labels, resulting in increased class confusion. To address this issue, we propose a novel Class-specific Augmentation based Disentanglement (CAD) framework, which tackles instance entanglement by both intra- and inter-class regulations. For intra-class regulation, CAD amplifies class-specific features to generate class-wise augmentations and aligns same-class augmentations across instances. For inter-class regulation, CAD introduces a weighted penalty loss function that applies stronger penalties to more ambiguous labels, encouraging larger inter-class distances. By jointly applying intra- and inter-class regulations, CAD improves the clarity of class boundaries and reduces class confusion caused by entanglement. Extensive experimental results demonstrate the effectiveness of CAD in mitigating the entanglement problem and enhancing ID-PLL performance. The code is available at https://github.com/RyanZhaoIc/CAD.git.[109] Towards Highly Transferable Vision-Language Attack via Semantic-Augmented Dynamic Contrastive Interaction
Yuanbo Li,Tianyang Xu,Cong Hu,Tao Zhou,Xiao-Jun Wu,Josef Kittler
Main category: cs.CV
TL;DR: 本文提出了一种语义增强的动态对比攻击方法(SADCA),通过渐进式、语义引导的扰动提升视觉-语言预训练模型对抗样本的跨模型与跨任务迁移性。
Details
Motivation: 现有视觉-语言模型的对抗攻击方法依赖静态跨模态交互,仅破坏正向图文对,导致跨模态干扰有限、迁移性差。 Method: 提出SADCA方法:1)构建包含对抗样本、正样本和负样本的动态对比学习机制,逐步破坏图文对齐;2)引入语义增强模块,利用输入变换提升对抗样本多样性与泛化性。 Result: 在多个数据集和模型上实验表明,SADCA显著提升对抗迁移性,持续优于当前最优方法。 Conclusion: SADCA通过动态跨模态交互与语义增强,有效提升了视觉-语言预训练模型对抗攻击的迁移能力,为VLP模型鲁棒性研究提供了新思路。 Abstract: With the rapid advancement and widespread application of vision-language pre-training (VLP) models, their vulnerability to adversarial attacks has become a critical concern. In general, the adversarial examples can typically be designed to exhibit transferable power, attacking not only different models but also across diverse tasks. However, existing attacks on language-vision models mainly rely on static cross-modal interactions and focus solely on disrupting positive image-text pairs, resulting in limited cross-modal disruption and poor transferability. To address this issue, we propose a Semantic-Augmented Dynamic Contrastive Attack (SADCA) that enhances adversarial transferability through progressive and semantically guided perturbation. SADCA progressively disrupts cross-modal alignment through dynamic interactions between adversarial images and texts. This is accomplished by SADCA establishing a contrastive learning mechanism involving adversarial, positive and negative samples, to reinforce the semantic inconsistency of the obtained perturbations. Moreover, we empirically find that input transformations commonly used in traditional transfer-based attacks also benefit VLPs, which motivates a semantic augmentation module that increases the diversity and generalization of adversarial examples. Extensive experiments on multiple datasets and models demonstrate that SADCA significantly improves adversarial transferability and consistently surpasses state-of-the-art methods. The code is released at https://github.com/LiYuanBoJNU/SADCA.[110] Multi-Paradigm Collaborative Adversarial Attack Against Multi-Modal Large Language Models
Yuanbo Li,Tianyang Xu,Cong Hu,Tao Zhou,Xiao-Jun Wu,Josef Kittler
Main category: cs.CV
TL;DR: 本文提出了一种多范式协同攻击框架(MPCAttack),通过融合视觉与语言的多范式语义表征,利用对比匹配实现自适应权重平衡与全局对抗扰动优化,显著提升对多模态大语言模型(MLLMs)的可迁移对抗攻击效果。
Details
Motivation: 现有针对多模态大语言模型(MLLMs)的对抗攻击依赖单一学习范式的代理模型,独立优化各自特征空间,导致表征贫乏、搜索空间受限、扰动多样性不足,从而削弱攻击的可迁移性。 Method: 提出多范式协同攻击(MPCAttack)框架,核心为多范式协同优化(MPCO)策略:聚合图像与文本的多范式语义表征,通过对比匹配自适应平衡不同范式的重要性,并联合优化全局对抗扰动,缓解表征偏差。 Result: 在多个基准上实验表明,MPCAttack在开源与闭源MLLMs上的定向与非定向攻击中均显著优于当前最先进方法;代码已开源。 Conclusion: MPCAttack通过跨模态、多范式协同优化,有效提升了对抗样本对MLLMs的可迁移性,揭示并缓解了多模态表征中的偏差问题,为MLLM安全研究提供了新思路与实用工具。 Abstract: The rapid progress of Multi-Modal Large Language Models (MLLMs) has significantly advanced downstream applications. However, this progress also exposes serious transferable adversarial vulnerabilities. In general, existing adversarial attacks against MLLMs typically rely on surrogate models trained within a single learning paradigm and perform independent optimisation in their respective feature spaces. This straightforward setting naturally restricts the richness of feature representations, delivering limits on the search space and thus impeding the diversity of adversarial perturbations. To address this, we propose a novel Multi-Paradigm Collaborative Attack (MPCAttack) framework to boost the transferability of adversarial examples against MLLMs. In principle, MPCAttack aggregates semantic representations, from both visual images and language texts, to facilitate joint adversarial optimisation on the aggregated features through a Multi-Paradigm Collaborative Optimisation (MPCO) strategy. By performing contrastive matching on multi-paradigm features, MPCO adaptively balances the importance of different paradigm representations and guides the global perturbation optimisation, effectively alleviating the representation bias. Extensive experimental results on multiple benchmarks demonstrate the superiority of MPCAttack, indicating that our solution consistently outperforms state-of-the-art methods in both targeted and untargeted attacks on open-source and closed-source MLLMs. The code is released at https://github.com/LiYuanBoJNU/MPCAttack.[111] GloSplat: Joint Pose-Appearance Optimization for Faster and More Accurate 3D Reconstruction
Tianyu Xiong,Rui Li,Linjie Li,Jiaqi Yang
Main category: cs.CV
TL;DR: GloSplat 提出联合姿态-外观优化框架,通过显式保留SfM特征轨迹作为可优化参数,在3D高斯泼溅训练中同时利用光度与几何约束,防止姿态漂移并提升重建精度。
Details
Motivation: 传统方法将特征提取、匹配、运动恢复结构(SfM)和新视角合成(NVS)视为独立问题;现有联合优化方法仅依赖光度梯度进行位姿优化,易导致早期姿态漂移且缺乏几何稳定性。 Method: GloSplat 在3D高斯泼溅训练中引入显式的、可优化的SfM特征轨迹点,结合重投影损失(几何监督)与光度损失(外观监督)进行联合优化;提出两种变体:GloSplat-F(免COLMAP,基于检索的图像对选择)和 GloSplat-A(全匹配,最高质量)。 Result: GloSplat-F 在免COLMAP方法中达到SOTA;GloSplat-A 超越所有基于COLMAP的基线方法。 Conclusion: 显式融合SfM几何先验到3D高斯泼溅训练中,能有效抑制姿态漂移、提升重建鲁棒性与精度,验证了联合光度-几何优化优于纯光度优化。 Abstract: Feature extraction, matching, structure from motion (SfM), and novel view synthesis (NVS) have traditionally been treated as separate problems with independent optimization objectives. We present GloSplat, a framework that performs \emph{joint pose-appearance optimization} during 3D Gaussian Splatting training. Unlike prior joint optimization methods (BARF, NeRF--, 3RGS) that rely purely on photometric gradients for pose refinement, GloSplat preserves \emph{explicit SfM feature tracks} as first-class entities throughout training: track 3D points are maintained as separate optimizable parameters from Gaussian primitives, providing persistent geometric anchors via a reprojection loss that operates alongside photometric supervision. This architectural choice prevents early-stage pose drift while enabling fine-grained refinement -- a capability absent in photometric-only approaches. We introduce two pipeline variants: (1) \textbf{GloSplat-F}, a COLMAP-free variant using retrieval-based pair selection for efficient reconstruction, and (2) \textbf{GloSplat-A}, an exhaustive matching variant for maximum quality. Both employ global SfM initialization followed by joint photometric-geometric optimization during 3DGS training. Experiments demonstrate that GloSplat-F achieves state-of-the-art among COLMAP-free methods while GloSplat-A surpasses all COLMAP-based baselines.[112] Scalable Injury-Risk Screening in Baseball Pitching From Broadcast Video
Jerrin Bright,Justin Mende,John Zelek
Main category: cs.CV
TL;DR: 本文提出了一种基于单目广播视频的生物力学分析方法,可从普通视频中准确提取18个临床相关指标,用于棒球投手伤病风险预测,性能媲美昂贵的多相机运动捕捉系统。
Details
Motivation: 专业级多相机运动捕捉系统成本高昂、部署受限,难以在非职业场景普及,亟需一种低成本、可扩展的替代方案来获取精准生物力学信号以支持伤病预测。 Method: 基于DreamPose3D构建单目视频处理流程:引入漂移控制的全局提升模块(通过速度参数化与滑动窗口推断恢复骨盆轨迹),并设计包含骨长约束、关节限位逆运动学、平滑与对称性约束的运动学精炼流程,以应对运动模糊、压缩伪影和极端投球姿态。 Result: 在13名职业投手共156次投球数据上,16/18项指标平均绝对误差<1°;用于伤病预测时,在7348名投手中,汤米·约翰手术预测AUC达0.811,重大手臂伤病预测AUC达0.825。 Conclusion: 该方法验证了单目广播视频可作为体育场级动作捕捉的可行替代方案,推动了基于姿态的生物力学指标在大规模伤病风险筛查中的实际应用。 Abstract: Injury prediction in pitching depends on precise biomechanical signals, yet gold-standard measurements come from expensive, stadium-installed multi-camera systems that are unavailable outside professional venues. We present a monocular video pipeline that recovers 18 clinically relevant biomechanics metrics from broadcast footage, positioning pose-derived kinematics as a scalable source for injury-risk modeling. Built on DreamPose3D, our approach introduces a drift-controlled global lifting module that recovers pelvis trajectory via velocity-based parameterization and sliding-window inference, lifting pelvis-rooted poses into global space. To address motion blur, compression artifacts, and extreme pitching poses, we incorporate a kinematics refinement pipeline with bone-length constraints, joint-limited inverse kinematics, smoothing, and symmetry constraints to ensure temporally stable and physically plausible kinematics. On 13 professional pitchers (156 paired pitches), 16/18 metrics achieve sub-degree agreement (MAE $< 1^{\circ}$). Using these metrics for injury prediction, an automated screening model achieves AUC 0.811 for Tommy John surgery and 0.825 for significant arm injuries on 7,348 pitchers. The resulting pose-derived metrics support scalable injury-risk screening, establishing monocular broadcast video as a viable alternative to stadium-scale motion capture for biomechanics.[113] SURE: Semi-dense Uncertainty-REfined Feature Matching
Sicheng Li,Zaiwang Gu,Jie Zhang,Qing Guo,Xudong Jiang,Jun Cheng
Main category: cs.CV
TL;DR: 本文提出SURE框架,通过联合建模偶然性与认知不确定性,实现半稠密图像匹配及其置信度估计,显著提升大视角变化和无纹理区域下的匹配鲁棒性与精度。
Details
Motivation: 现有方法仅依赖特征相似性,缺乏对匹配可靠性显式建模的能力,导致在大视角变化或纹理缺失区域中产生高置信度错误匹配。 Method: 提出SURE半稠密不确定性精化匹配框架,包含用于可信坐标回归的新型证据头(evidential head)和轻量级空间融合模块,联合预测对应关系及其置信度,并分别建模偶然性与认知不确定性。 Result: 在多个标准基准上,SURE在匹配精度和运行效率上均持续超越现有最先进半稠密匹配方法。 Conclusion: SURE通过引入不确定性建模机制,有效缓解了传统匹配方法的过自信问题,提升了复杂场景下的鲁棒性与可靠性。 Abstract: Establishing reliable image correspondences is essential for many robotic vision problems. However, existing methods often struggle in challenging scenarios with large viewpoint changes or textureless regions, where incorrect cor- respondences may still receive high similarity scores. This is mainly because conventional models rely solely on fea- ture similarity, lacking an explicit mechanism to estimate the reliability of predicted matches, leading to overconfident errors. To address this issue, we propose SURE, a Semi- dense Uncertainty-REfined matching framework that jointly predicts correspondences and their confidence by modeling both aleatoric and epistemic uncertainties. Our approach in- troduces a novel evidential head for trustworthy coordinate regression, along with a lightweight spatial fusion module that enhances local feature precision with minimal overhead. We evaluated our method on multiple standard benchmarks, where it consistently outperforms existing state-of-the-art semi-dense matching models in both accuracy and efficiency. our code will be available on https://github.com/LSC-ALAN/SURE.[114] Diffusion-Based sRGB Real Noise Generation via Prompt-Driven Noise Representation Learning
Jaekyun Ko,Dongjin Kim,Soomin Lee,Guanghui Wang,Tae Hyun Kim
Main category: cs.CV
TL;DR: 本文提出了一种名为Prompt-Driven Noise Generation (PNG)的新型噪声生成框架,无需依赖相机元数据即可合成符合真实世界噪声分布的高质量sRGB图像,从而提升去噪模型在现实场景中的泛化能力与实用性。
Details
Motivation: 现有基于元数据的生成式噪声合成方法受限于元数据缺失或设备不一致,且真实配对的噪声-干净图像数据稀缺,限制了端到端去噪方法在现实场景中的应用。 Method: 提出Prompt-Driven Noise Generation(PNG)框架,通过从输入噪声中提取高维prompt特征来建模真实噪声分布,实现不依赖显式相机元数据的噪声合成。 Result: 实验表明PNG能有效生成逼真的噪声图像,并成功应用于多个基准数据集上的真实世界图像去噪任务,提升了去噪性能。 Conclusion: PNG摆脱了对相机元数据的依赖,显著增强了噪声合成模型的通用性与实用性,为真实场景图像去噪提供了更鲁棒的数据生成方案。 Abstract: Denoising in the sRGB image space is challenging due to noise variability. Although end-to-end methods perform well, their effectiveness in real-world scenarios is limited by the scarcity of real noisy-clean image pairs, which are expensive and difficult to collect. To address this limitation, several generative methods have been developed to synthesize realistic noisy images from limited data. These generative approaches often rely on camera metadata during both training and testing to synthesize real-world noise. However, the lack of metadata or inconsistencies between devices restricts their usability. Therefore, we propose a novel framework called Prompt-Driven Noise Generation (PNG). This model is capable of acquiring high-dimensional prompt features that capture the characteristics of real-world input noise and creating a variety of realistic noisy images consistent with the distribution of the input noise. By eliminating the dependency on explicit camera metadata, our approach significantly enhances the generalizability and applicability of noise synthesis. Comprehensive experiments reveal that our model effectively produces realistic noisy images and show the successful application of these generated images in removing real-world noise across various benchmark datasets.[115] Interpretable Pre-Release Baseball Pitch Type Anticipation from Broadcast 3D Kinematics
Jerrin Bright,Michelle Lu,John Zelek
Main category: cs.CV
TL;DR: 本文通过分析投手的3D姿态序列,仅使用身体运动学特征,在不依赖球飞行数据的情况下,对八种投球类型进行分类,准确率达到80.4%,并揭示上半身动作(尤其是手腕位置和躯干侧倾)是主要预测信号来源。
Details
Motivation: 探究投手身体动作在多大程度上能预示即将投出的球种,从而理解投球决策与生物力学之间的关系。 Method: 构建端到端流水线:基于扩散模型的3D姿态估计、自动投球事件检测、经实测验证的生物力学特征提取,以及基于229个运动学特征的梯度提升分类器。 Result: 在119,561次职业投球数据集上达到80.4%分类准确率;上半身贡献64.9%预测信号,手腕位置(14.8%)和躯干侧倾是最具信息量的关节组和单个特征;四缝/二缝快速球无法仅凭姿态区分。 Conclusion: 仅靠身体姿态可实现高精度投球类型识别,但存在约80%的经验上限,表明后续判别需依赖球飞行数据,明确了运动学信息的边界。 Abstract: How much can a pitcher's body reveal about the upcoming pitch? We study this question at scale by classifying eight pitch types from monocular 3D pose sequences, without access to ball-flight data. Our pipeline chains a diffusion-based 3D pose backbone with automatic pitching-event detection, groundtruth-validated biomechanical feature extraction, and gradient-boosted classification over 229 kinematic features. Evaluated on 119,561 professional pitches, the largest such benchmark to date, we achieve 80.4\% accuracy using body kinematics alone. A systematic importance analysis reveals that upper-body mechanics contribute 64.9\% of the predictive signal versus 35.1\% for the lower body, with wrist position (14.8\%) and trunk lateral tilt emerging as the most informative joint group and biomechanical feature, respectively. We further show that grip-defined variants (four-seam vs.\ two-seam fastball) are not separable from pose, establishing an empirical ceiling near 80\% and delineating where kinematic information ends and ball-flight information begins.[116] Structure Observation Driven Image-Text Contrastive Learning for Computed Tomography Report Generation
Hong Liu,Dong Wei,Qiong Peng,Yawen Huang,Xian Wu,Yefeng Zheng,Liansheng Wang
Main category: cs.CV
TL;DR: 本文提出了一种用于CT报告生成的两阶段框架,通过结构级图像-文本对比学习和软伪标签缓解假阴性问题,显著提升了临床效率。
Details
Motivation: X光报告生成的深度学习方法在CT报告生成中效果受限,因CT图像数据量大、细节复杂,需更精细的结构级语义对齐。 Method: 提出两阶段框架:第一阶段利用可学习的结构特异性视觉查询进行结构级图像-文本对比学习,并引入基于文本相似度的软伪标签与动态负样本队列;第二阶段冻结视觉查询,选择关键图像块嵌入并加入文本解码器生成报告。 Result: 在两个公开数据集上达到CT报告生成任务的新SOTA性能,验证了各组件有效性。 Conclusion: 该框架有效建模CT图像与报告间的结构级语义对应关系,提升了临床报告生成的准确性与效率。 Abstract: Computed Tomography Report Generation (CTRG) aims to automate the clinical radiology reporting process, thereby reducing the workload of report writing and facilitating patient care. While deep learning approaches have achieved remarkable advances in X-ray report generation, their effectiveness may be limited in CTRG due to larger data volumes of CT images and more intricate details required to describe them. This work introduces a novel two-stage (structure- and report-learning) framework tailored for CTRG featuring effective structure-wise image-text contrasting. In the first stage, a set of learnable structure-specific visual queries observe corresponding structures in a CT image. The resulting observation tokens are contrasted with structure-specific textual features extracted from the accompanying radiology report with a structure-wise image-text contrastive loss. In addition, text-text similarity-based soft pseudo targets are proposed to mitigate the impact of false negatives, i.e., semantically identical image structures and texts from non-paired images and reports. Thus, the model learns structure-level semantic correspondences between CT images and reports. Further, a dynamic, diversity-enhanced negative queue is proposed to guide the network in learning to discriminate various abnormalities. In the second stage, the visual structure queries are frozen and used to select the critical image patch embeddings depicting each anatomical structure, minimizing distractions from irrelevant areas while reducing memory consumption. Also, a text decoder is added and trained for report generation.Our extensive experiments on two public datasets demonstrate that our framework establishes new state-of-the-art performance for CTRG in clinical efficiency, and its components are effective.[117] DeformTrace: A Deformable State Space Model with Relay Tokens for Temporal Forgery Localization
Xiaodong Zhu,Suting Wang,Yuanming Zheng,Junqi Yang,Yangxu Liao,Yuhong Yang,Weiping Tu,Zhongyuan Wang
Main category: cs.CV
TL;DR: 本文提出DeformTrace,通过引入可变形动态和中继机制增强状态空间模型(SSM),以解决视频与音频时序伪造定位(TFL)中边界模糊、伪造稀疏及长程建模受限等问题,实现更精准、高效、鲁棒的伪造检测。
Details
Motivation: 现有状态空间模型(SSMs)在时序伪造定位(TFL)任务中受限于伪造边界模糊、伪造样本稀疏以及长程依赖建模能力不足。 Method: 提出DeformTrace框架:1)Deformable Self-SSM(DS-SSM)引入动态感受野以提升时序定位精度;2)嵌入Relay Token机制缓解长程衰减;3)Deformable Cross-SSM(DC-SSM)将全局状态空间划分为查询相关子空间,抑制非伪造信息干扰、增强对稀疏伪造的敏感性;4)融合Transformer全局建模能力与SSM计算效率的混合架构。 Result: 在多个基准上达到SOTA性能,参数更少、推理更快、鲁棒性更强。 Conclusion: DeformTrace有效克服了SSMs在TFL中的关键局限,验证了可变形动态与中继机制对时序伪造定位的重要价值,为高效可解释深度取证提供了新范式。 Abstract: Temporal Forgery Localization (TFL) aims to precisely identify manipulated segments in video and audio, offering strong interpretability for security and forensics. While recent State Space Models (SSMs) show promise in precise temporal reasoning, their use in TFL is hindered by ambiguous boundaries, sparse forgeries, and limited long-range modeling. We propose DeformTrace, which enhances SSMs with deformable dynamics and relay mechanisms to address these challenges. Specifically, Deformable Self-SSM (DS-SSM) introduces dynamic receptive fields into SSMs for precise temporal localization. To further enhance its capacity for temporal reasoning and mitigate long-range decay, a Relay Token Mechanism is integrated into DS-SSM. Besides, Deformable Cross-SSM (DC-SSM) partitions the global state space into query-specific subspaces, reducing non-forgery information accumulation and boosting sensitivity to sparse forgeries. These components are integrated into a hybrid architecture that combines the global modeling of Transformers with the efficiency of SSMs. Extensive experiments show that DeformTrace achieves state-of-the-art performance with fewer parameters, faster inference, and stronger robustness.[118] Federated Modality-specific Encoders and Partially Personalized Fusion Decoder for Multimodal Brain Tumor Segmentation
Hong Liu,Dong Wei,Qian Dai,Xian Wu,Yefeng Zheng,Liansheng Wang
Main category: cs.CV
TL;DR: 本文提出FedMEPD框架,解决医疗影像联邦学习中模态间异构性和个性化需求并存的挑战,通过模态专用编码器和部分个性化多模态融合解码器实现高效全局建模与本地适配。
Details
Motivation: 现有联邦学习方法仅考虑模态内异构性,难以应对实际中参与者仅拥有部分影像模态(即模态间异构性)的问题,且各参与方还需个性化模型。 Method: 提出FedMEPD框架:为每种模态设置独立编码器以处理模态间异构性;解码器采用部分个性化策略,依据全局与本地参数更新差异动态决定哪些滤波器个性化;服务器端用全模态融合解码器优化编码器,并分发多模态锚点;客户端通过缩放点积交叉注意力将缺失模态表征对齐至全局锚点。 Result: 在BraTS 2018和2020多模态脑肿瘤分割数据集上验证,FedMEPD显著优于当前多模态及个性化联邦学习方法,各项新设计均被证实有效。 Conclusion: FedMEPD成功兼顾模态间异构性与个性化需求,为多模态医疗影像联邦学习提供了实用、鲁棒的新范式。 Abstract: Most existing federated learning (FL) methods for medical image analysis only considered intramodal heterogeneity, limiting their applicability to multimodal imaging applications. In practice, some FL participants may possess only a subset of the complete imaging modalities, posing intermodal heterogeneity as a challenge to effectively training a global model on all participants' data. Meanwhile, each participant expects a personalized model tailored to its local data characteristics in FL. This work proposes a new FL framework with federated modality-specific encoders and partially personalized multimodal fusion decoders (FedMEPD) to address the two concurrent issues. Specifically, FedMEPD employs an exclusive encoder for each modality to account for the intermodal heterogeneity. While these encoders are fully federated, the decoders are partially personalized to meet individual needs -- using the discrepancy between global and local parameter updates to dynamically determine which decoder filters are personalized. Implementation-wise, a server with full-modal data employs a fusion decoder to fuse representations from all modality-specific encoders, thus bridging the modalities to optimize the encoders via backpropagation. Moreover, multiple anchors are extracted from the fused multimodal representations and distributed to the clients in addition to the model parameters. Conversely, the clients with incomplete modalities calibrate their missing-modal representations toward the global full-modal anchors via scaled dot-product cross-attention, making up for the information loss due to absent modalities. FedMEPD is validated on the BraTS 2018 and 2020 multimodal brain tumor segmentation benchmarks. Results show that it outperforms various up-to-date methods for multimodal and personalized FL, and its novel designs are effective.[119] Locality-Attending Vision Transformer
Sina Hajimiri,Farzad Beizaee,Fereshteh Shakeri,Christian Desrosiers,Ismail Ben Ayed,Jose Dolz
Main category: cs.CV
TL;DR: 本文提出了一种简单有效的附加模块,通过可学习高斯核调制自注意力机制并优化patch表征,提升视觉Transformer在分割任务上的性能,同时保持其图像级分类能力。
Details
Motivation: 视觉Transformer在分类任务中表现优异,但其全局自注意力机制会削弱对分割等任务至关重要的细粒度空间细节。 Method: 引入可学习高斯核调制自注意力,使其偏向邻近patch;进一步优化patch表征以增强空间位置的语义表达。 Result: 在ADE20K等三个分割基准上显著提升性能(如ViT-Tiny和Base分别提升超6%和4%),且不改变训练流程、不损害分类性能。 Conclusion: 该方法在不牺牲全局建模能力的前提下,有效增强了视觉Transformer的局部感知能力,提升了分割性能。 Abstract: Vision transformers have demonstrated remarkable success in classification by leveraging global self-attention to capture long-range dependencies. However, this same mechanism can obscure fine-grained spatial details crucial for tasks such as segmentation. In this work, we seek to enhance segmentation performance of vision transformers after standard image-level classification training. More specifically, we present a simple yet effective add-on that improves performance on segmentation tasks while retaining vision transformers' image-level recognition capabilities. In our approach, we modulate the self-attention with a learnable Gaussian kernel that biases the attention toward neighboring patches. We further refine the patch representations to learn better embeddings at patch positions. These modifications encourage tokens to focus on local surroundings and ensure meaningful representations at spatial positions, while still preserving the model's ability to incorporate global information. Experiments demonstrate the effectiveness of our modifications, evidenced by substantial segmentation gains on three benchmarks (e.g., over 6% and 4% on ADE20K for ViT Tiny and Base), without changing the training regime or sacrificing classification performance. The code is available at https://github.com/sinahmr/LocAtViT/.[120] FC-VFI: Faithful and Consistent Video Frame Interpolation for High-FPS Slow Motion Video Generation
Ganggui Ding,Hao Chen,Xiaogang Xu
Main category: cs.CV
TL;DR: 本文提出FC-VFI方法,通过潜在序列时序建模、语义匹配线引导和时序差分损失,实现高保真、高一致性的视频帧插值(支持4×和8×),显著提升帧率并保持细节与运动一致性。
Details
Motivation: 现有视频扩散模型在帧插值中难以兼顾高保真度与时间一致性:依赖生成先验导致起止帧细节丢失;基于光流或稀疏点的运动控制存在误差大或缺乏结构上下文的问题。 Method: 提出FC-VFI框架:1)在潜在序列上进行时序建模以继承起止帧保真线索;2)引入语义匹配线提供结构感知的运动引导;3)设计时序差分损失缓解时间不一致。 Result: 在2560×1440分辨率下实现4×和8×插值,将30 FPS提升至120/240 FPS;实验表明其在保真度、结构完整性与运动一致性方面均达到高性能。 Conclusion: FC-VFI有效解决了大模型帧插值中保真度与时间一致性难以兼顾的问题,为高分辨率实时视频增强提供了新思路。 Abstract: Large pre-trained video diffusion models excel in video frame interpolation but struggle to generate high fidelity frames due to reliance on intrinsic generative priors, limiting detail preservation from start and end frames. Existing methods often depend on motion control for temporal consistency, yet dense optical flow is error-prone, and sparse points lack structural context. In this paper, we propose FC-VFI for faithful and consistent video frame interpolation, supporting \(4\times\)x and \(8\times\) interpolation, boosting frame rates from 30 FPS to 120 and 240 FPS at \(2560\times 1440\)resolution while preserving visual fidelity and motion consistency. We introduce a temporal modeling strategy on the latent sequences to inherit fidelity cues from start and end frames and leverage semantic matching lines for structure-aware motion guidance, improving motion consistency. Furthermore, we propose a temporal difference loss to mitigate temporal inconsistencies. Extensive experiments show FC-VFI achieves high performance and structural integrity across diverse scenarios.[121] AdaIAT: Adaptively Increasing Attention to Generated Text to Alleviate Hallucinations in LVLM
Li'an Zhong,Ziqiang He,Jibin Zheng,Jin Li,Z. Jane Wang,Xiangui Kang
Main category: cs.CV
TL;DR: 本文提出了一种名为AdaIAT的新方法,通过自适应地增强LVLMs中图像token对生成文本的注意力,有效缓解幻觉问题,同时避免重复描述,并在多个LVLM上验证了其有效性。
Details
Motivation: 当前大视觉语言模型(LVLMs)面临严重的幻觉问题,直接提升图像token注意力虽可缓解幻觉,却易导致重复描述;作者旨在探索更精细、自适应的注意力干预机制。 Method: 首先分析注意力模式,发现真实物体token更倾向于关注生成文本;据此提出Attention to Generated Text(IAT),并进一步设计自适应版本AdaIAT,采用层间阈值控制干预时机与各注意力头的细粒度放大强度。 Result: AdaIAT在LLaVA-1.5上将幻觉率C_S和C_I分别降低35.8%和37.1%,同时保持语言性能与预测能力,实现良好权衡。 Conclusion: AdaIAT是一种有效、通用且无需训练的后处理策略,为LVLMs幻觉缓解提供了新思路。 Abstract: Hallucination has been a significant impediment to the development and application of current Large Vision-Language Models (LVLMs). To mitigate hallucinations, one intuitive and effective way is to directly increase attention weights to image tokens during inference. Although this effectively reduces the hallucination rate, it often induces repetitive descriptions. To address this, we first conduct an analysis of attention patterns and reveal that real object tokens tend to assign higher attention to the generated text than hallucinated ones. This inspires us to leverage the generated text, which contains instruction-related visual information and contextual knowledge, to alleviate hallucinations while maintaining linguistic coherence. We therefore propose Attention to Generated Text (IAT) and demonstrate that it significantly reduces the hallucination rate while avoiding repetitive descriptions. To prevent naive amplification from impairing the inherent prediction capabilities of LVLMs, we further explore Adaptive IAT (AdaIAT) that employs a layer-wise threshold to control intervention time and fine-grained amplification magnitude tailored to the characteristics of each attention head. Both analysis and experiments demonstrate the effectiveness of AdaIAT. Results of several LVLMs show that AdaIAT effectively alleviates hallucination (reducing hallucination rates $C_S$ and $C_I$ on LLaVA-1.5 by 35.8% and 37.1%, respectively) while preserving linguistic performance and prediction capability, achieving an attractive trade-off.[122] Person Detection and Tracking from an Overhead Crane LiDAR
Nilusha Jayawickrama,Henrik Toikka,Risto Ojala
Main category: cs.CV
TL;DR: 本文研究了在工业室内工作场所中使用安装在天车上的LiDAR进行人员检测与跟踪,针对俯视视角带来的显著域偏移和缺乏合适公开数据的问题,构建了特定场景的俯视LiDAR数据集,并适配多种3D检测器,在统一协议下评估;结合轻量级跟踪方法实现身份维持;通过距离分段评估检测性能,最佳配置在5米内AP达0.84、1米内达0.97;VoxelNeXt和SECOND表现最稳健;成果弥合了驾驶数据集与俯视感知间的域差距,并开源数据与代码。
Details
Motivation: 俯视LiDAR视角与常见车载LiDAR基准存在强域偏移,且缺乏适配的公开训练数据。 Method: 构建站点特异性俯视LiDAR数据集并标注3D人体包围盒;在统一训练与评估协议下适配多种候选3D检测器;集成AB3DMOT和SimpleTrack实现轻量级检测-跟踪框架;采用距离分段评估检测性能;测量系统延迟以验证实时性。 Result: 最佳适配检测器(VoxelNeXt和SECOND)在5.0米水平半径内平均精度(AP)达0.84,1.0米内达0.97;系统具备实际实时可行性;数据集与代码已开源。 Conclusion: 本工作有效缩小了标准驾驶数据集与工业俯视LiDAR感知之间的域差距,为类似场景下的人员检测与跟踪提供了数据、方法与实践基准。 Abstract: This paper investigates person detection and tracking in an industrial indoor workspace using a LiDAR mounted on an overhead crane. The overhead viewpoint introduces a strong domain shift from common vehicle-centric LiDAR benchmarks, and limited availability of suitable public training data. Henceforth, we curate a site-specific overhead LiDAR dataset with 3D human bounding-box annotations and adapt selected candidate 3D detectors under a unified training and evaluation protocol. We further integrate lightweight tracking-by-detection using AB3DMOT and SimpleTrack to maintain person identities over time. Detection performance is reported with distance-sliced evaluation to quantify the practical operating envelope of the sensing setup. The best adapted detector configurations achieve average precision (AP) up to 0.84 within a 5.0 m horizontal radius, increasing to 0.97 at 1.0 m, with VoxelNeXt and SECOND emerging as the most reliable backbones across this range. The acquired results contribute in bridging the domain gap between standard driving datasets and overhead sensing for person detection and tracking. We also report latency measurements, highlighting practical real-time feasibility. Finally, we release our dataset and implementations in GitHub to support further research[123] Adaptive Prototype-based Interpretable Grading of Prostate Cancer
Riddhasree Bhattacharyya,Pallabi Dutta,Sushmita Mitra
Main category: cs.CV
TL;DR: 本文提出了一种基于原型的弱监督框架,用于可解释的前列腺癌组织病理图像分级,通过模仿病理医生对比可疑区域与临床验证范例的工作流程,提升模型可信度与可解释性。
Details
Motivation: 前列腺癌诊断中活检需求激增,病理医生工作负担重;现有深度学习方法性能好但可解释性差,难以在高风险医疗场景中广泛应用。 Method: 提出基于原型的弱监督框架:先在图像块级别预训练以学习各分级对应的鲁棒原型特征;再通过新设计的原型感知损失函数进行弱监督微调;最后引入基于注意力的动态剪枝机制,处理样本间异质性并选择性强调相关原型。 Result: 在PANDA和SICAP基准数据集上进行了广泛验证,结果表明该框架能作为病理医生日常诊断工作的可靠辅助工具。 Conclusion: 该原型驱动、弱监督、可解释的分级框架兼顾性能与临床可信度,为医学AI落地提供了新思路。 Abstract: Prostate cancer being one of the frequently diagnosed malignancy in men, the rising demand for biopsies places a severe workload on pathologists. The grading procedure is tedious and subjective, motivating the development of automated systems. Although deep learning has made inroads in terms of performance, its limited interpretability poses challenges for widespread adoption in high-stake applications like medicine. Existing interpretability techniques for prostate cancer classifiers provide a coarse explanation but do not reveal why the highlighted regions matter. In this scenario, we propose a novel prototype-based weakly-supervised framework for an interpretable grading of prostate cancer from histopathology images. These networks can prove to be more trustworthy since their explicit reasoning procedure mirrors the workflow of a pathologist in comparing suspicious regions with clinically validated examples. The network is initially pre-trained at patch-level to learn robust prototypical features associated with each grade. In order to adapt it to a weakly-supervised setup for prostate cancer grading, the network is fine-tuned with a new prototype-aware loss function. Finally, a new attention-based dynamic pruning mechanism is introduced to handle inter-sample heterogeneity, while selectively emphasizing relevant prototypes for optimal performance. Extensive validation on the benchmark PANDA and SICAP datasets confirms that the framework can serve as a reliable assistive tool for pathologists in their routine diagnostic workflows.[124] Location-Aware Pretraining for Medical Difference Visual Question Answering
Denis Musinguzi,Caren Han,Prasenjit Mitra
Main category: cs.CV
TL;DR: 本文提出了一种针对医学差异视觉问答(VQA)的新型预训练框架,通过引入位置感知任务(如AREF、GCAP、CAREF)提升视觉编码器对细微空间差异的建模能力,并与语言模型结合,在胸片差异诊断任务上达到SOTA。
Details
Motivation: 传统单图VQA模型无法满足放射科医生对比多图诊断的需求;标准视觉编码器难以区分疾病进展与成像差异等细微变化。 Method: 设计包含自动指代表达(AREF)、定位描述(GCAP)和条件自动指代表达(CAREF)的位置感知预训练任务,增强视觉编码器的空间细粒度表征能力,并将其与语言模型联合用于医学差异VQA。 Result: 在胸片图像的临床相关变化检测与推理任务中取得当前最优性能(state-of-the-art)。 Conclusion: 位置感知的预训练范式能有效提升模型对医学图像间细微差异的理解能力,为多图像比较诊断提供新思路。 Abstract: Unlike conventional single-image models, differential medical VQA frameworks process multiple images to identify differences, mirroring the comparative diagnostic workflow of radiologists. However, standard vision encoders trained on contrastive or classification objectives often fail to capture the subtle visual variations necessary for distinguishing disease progression from acquisition differences. To address this limitation, we introduce a pretraining framework that incorporates location-aware tasks, including automatic referring expressions (AREF), grounded captioning (GCAP), and conditional automatic referring expressions (CAREF). These specific tasks enable the vision encoder to learn fine-grained, spatially grounded visual representations that are often overlooked by traditional pre-training methods. We subsequently integrate this enhanced vision encoder with a language model to perform medical difference VQA. Experimental results demonstrate that our approach achieves state-of-the-art performance in detecting and reasoning about clinically relevant changes in chest X-ray images.[125] VisionPangu: A Compact and Fine-Grained Multimodal Assistant with 1.7B Parameters
Jiaxin Fan,Wenpo Song
Main category: cs.CV
TL;DR: 本文提出了一个紧凑型1.7B参数的多模态模型VisionPangu,通过高效的多模态对齐和高质量监督(如DOCCI数据集的人工描述)提升图像细粒度描述能力,无需依赖大模型规模,在保持竞争力的同时生成更结构化、更详细的图像描述。
Details
Motivation: 现有大型多模态模型虽在视觉-语言理解上表现优异,但因依赖大规模架构和粗粒度监督,难以生成细节丰富的图像描述。 Method: 采用InternVL衍生的视觉编码器与OpenPangu-Embedded语言骨干网络,通过轻量MLP投影器连接,并借鉴LLaVA的指令微调流程;引入DOCCI数据集中密集的人工撰写的图像描述进行训练。 Result: VisionPangu在详细图像描述任务中表现出色,生成的描述更具语义连贯性和描述丰富性,且在紧凑参数量下达到有竞争力的整体性能。 Conclusion: 紧凑型多模态模型通过高质量监督和高效对齐策略,可在不扩大模型规模的前提下显著提升细粒度图像描述能力。 Abstract: Large Multimodal Models (LMMs) have achieved strong performance in vision-language understanding, yet many existing approaches rely on large-scale architectures and coarse supervision, which limits their ability to generate detailed image captions. In this work, we present VisionPangu, a compact 1.7B-parameter multimodal model designed to improve detailed image captioning through efficient multimodal alignment and high-quality supervision. Our model combines an InternVL-derived vision encoder with the OpenPangu-Embedded language backbone via a lightweight MLP projector and adopts an instruction-tuning pipeline inspired by LLaVA. By incorporating dense human-authored descriptions from the DOCCI dataset, VisionPangu improves semantic coherence and descriptive richness without relying on aggressive model scaling. Experimental results demonstrate that compact multimodal models can achieve competitive performance while producing more structured and detailed captions. The code and model weights will be publicly available at https://www.modelscope.cn/models/asdfgh007/visionpangu.[126] Revisiting an Old Perspective Projection for Monocular 3D Morphable Models Regression
Toby Chong,Ryota Nakajima
Main category: cs.CV
TL;DR: 本文提出了一种新的单目3D可变形模型(3DMM)回归方法的相机模型,通过在正交投影中引入收缩参数来模拟近景人脸图像中的透视畸变效应,同时保持原有模型的稳定性,并在头戴相机采集的数据集上验证了其有效性。
Details
Motivation: 现有基于回归的3DMM拟合方法多采用正交投影以避免焦距与物体距离的歧义,但该简化使其难以处理头戴相机等拍摄的近景人脸图像中的透视畸变。 Method: 在正交投影基础上引入一个可学习的收缩参数,构建一种兼具透视效果与优化稳定性的伪透视投影模型;并提出若干技术实现对现有模型的微调。 Result: 在自建头戴相机数据集上的定量与定性对比实验表明,所提方法显著提升了近景人脸3D重建的精度与视觉质量。 Conclusion: 所提出的伪透视相机模型在保持优化稳定性的同时,有效建模了近景人脸的透视畸变,为单目3DMM回归在近景场景(如AR/VR)中的应用提供了更实用的解决方案。 Abstract: We introduce a novel camera model for monocular 3D Morphable Model (3DMM) regression methods that effectively captures the perspective distortion effect commonly seen in close-up facial images. Fitting 3D morphable models to video is a key technique in content creation. In particular, regression-based approaches have produced fast and accurate results by matching the rendered output of the morphable model to the target image. These methods typically achieve stable performance with orthographic projection, which eliminates the ambiguity between focal length and object distance. However, this simplification makes them unsuitable for close-up footage, such as that captured with head-mounted cameras. We extend orthographic projection with a new shrinkage parameter, incorporating a pseudo-perspective effect while preserving the stability of the original projection. We present several techniques that allow finetuning of existing models, and demonstrate the effectiveness of our modification through both quantitative and qualitative comparisons using a custom dataset recorded with head-mounted cameras.[127] BiEvLight: Bi-level Learning of Task-Aware Event Refinement for Low-Light Image Enhancement
Zishu Yao,Xiang-Xiang Su,Shengning Zhou,Guang-Yong Chen,Guodong Fan,Xing Chen
Main category: cs.CV
TL;DR: 本文提出BiEvLight框架,通过梯度引导的事件去噪先验和任务约束的双层优化,协同优化低光图像增强与事件去噪,显著提升性能。
Details
Motivation: 事件相机虽具高动态范围优势,但其固有的背景活动噪声与低信噪比图像导致模态融合中严重噪声耦合,成为性能瓶颈;因此,精准的事件去噪是释放事件融合潜力的前提。 Method: 提出BiEvLight:1)利用图像与事件间的强梯度相关性构建梯度引导的事件去噪先验;2)将事件去噪建模为受增强任务约束的双层优化问题,实现跨任务交互与表示定制。 Result: 在SDE真实噪声数据集上显著超越SOTA方法,PSNR平均提升1.30dB,PSNR*提升2.03dB,SSIM提升0.047。 Conclusion: 精确、任务自适应的事件去噪对事件辅助低光图像增强至关重要;BiEvLight通过层次化与任务感知设计,有效缓解噪声耦合,提升整体增强质量。 Abstract: Event cameras, with their high dynamic range, show great promise for Low-light Image Enhancement (LLIE). Existing works primarily focus on designing effective modal fusion strategies. However, a key challenge is the dual degradation from intrinsic background activity (BA) noise in events and low signal-to-noise ratio (SNR) in images, which causes severe noise coupling during modal fusion, creating a critical performance bottleneck. We therefore posit that precise event denoising is the prerequisite to unlocking the full potential of event-based fusion. To this end, we propose BiEvLight, a hierarchical and task-aware framework that collaboratively optimizes enhancement and denoising by exploiting their intrinsic interdependence. Specifically, BiEvLight exploits the strong gradient correlation between images and events to build a gradient-guided event denoising prior that alleviates insufficient denoising in heavily noisy regions. Moreover, instead of treating event denoising as a static pre-processing stage-which inevitably incurs a trade-off between over- and under-denoising and cannot adapt to the requirements of a specific enhancement objective-we recast it as a bilevel optimization problem constrained by the enhancement task. Through cross-task interaction, the upper-level denoising problem learns event representations tailored to the lower-level enhancement objective, thereby substantially improving overall enhancement quality. Extensive experiments on the Real-world noise Dataset SDE demonstrate that our method significantly outperforms state-of-the-art (SOTA) approaches, with average improvements of 1.30dB in PSNR, 2.03dB in PSNR* and 0.047 in SSIM, respectively. The code will be publicly available at https://github.com/iijjlk/BiEvlight.[128] 3D-RFT: Reinforcement Fine-Tuning for Video-based 3D Scene Understanding
Xiongkun Linghu,Jiangyong Huang,Baoxiong Jia,Siyuan Huang
Main category: cs.CV
TL;DR: 本文提出3D-RFT,首个将强化学习与可验证奖励(RLVR)应用于视频驱动的3D场景理解的框架,通过任务指标驱动的奖励函数和GRPO算法,在多项3D视频理解任务上超越更大参数量模型。
Details
Motivation: 现有3D场景理解方法多依赖监督微调(SFT),其token级交叉熵损失与最终任务指标不一致,导致优化目标与性能脱节;同时,RLVR在3D理解领域尚未被探索。 Method: 提出3D-RFT框架:先用SFT激活3D感知多模态大语言模型(MLLM),再采用组相对策略优化(GRPO)进行强化微调;设计基于3D IoU、F1-Score等评估指标的严格可验证奖励函数。 Result: 3D-RFT-4B在多个视频驱动的3D场景理解任务(如3D视频检测、视觉定位、空间推理)上达到SOTA,显著优于更大模型(如VGLLM-8B);展现出鲁棒性及对训练策略与数据影响的深刻洞察。 Conclusion: 3D-RFT是首个成功将RLVR范式迁移到视频3D理解领域的有效框架,为未来3D场景理解提供了更对齐、更高效、更具可解释性的新范式。 Abstract: Reinforcement Learning with Verifiable Rewards ( RLVR ) has emerged as a transformative paradigm for enhancing the reasoning capabilities of Large Language Models ( LLMs), yet its potential in 3D scene understanding remains under-explored. Existing approaches largely rely on Supervised Fine-Tuning ( SFT), where the token-level cross-entropy loss acts as an indirect proxy for optimization, leading to a misalignment between training objectives and task performances. To bridge this gap, we present Reinforcement Fine-Tuning for Video-based 3D Scene Understanding (3D-RFT ), the first framework to extend RLVR to video-based 3D perception and reasoning. 3D-RFT shifts the paradigm by directly optimizing the model towards evaluation metrics. 3D-RFT first activates 3D-aware Multi-modal Large Language Models ( MLLM s) via SFT, followed by reinforcement fine-tuning using Group Relative Policy Optimization ( GRPO) with strictly verifiable reward functions. We design task-specific reward functions directly from metrics like 3D IoU and F1-Score to provide more effective signals to guide model training. Extensive experiments demonstrate that 3D-RFT-4B achieves state-of-the-art performance on various video-based 3D scene understanding tasks. Notably, 3D-RFT-4B significantly outperforms larger models (e.g., VG LLM-8B) on 3D video detection, 3D visual grounding, and spatial reasoning benchmarks. We further reveal good properties of 3D-RFT such as robust efficacy, and valuable insights into training strategies and data impact. We hope 3D-RFT can serve as a robust and promising paradigm for future development of 3D scene understanding.[129] Think, Then Verify: A Hypothesis-Verification Multi-Agent Framework for Long Video Understanding
Zheng Wang,Haoran Chen,Haoxuan Qin,Zhipeng Wei,Tianwen Qian,Cong Bai
Main category: cs.CV
TL;DR: 本文提出VideoHV-Agent框架,将长视频问答重构为假设验证过程,通过Thinker、Judge、Verifier和Answer四个模块协同工作,在保证高准确率的同时提升可解释性、逻辑严谨性和计算效率。
Details
Motivation: 长视频理解面临视觉冗余、长时序依赖以及链式推理和检索型代理易产生语义漂移和相关性错误等挑战,现有方法多从反应式检索出发,缺乏对任务本质的预先思考。 Method: 提出'先思考再查找'原则,构建VideoHV-Agent框架:基于视频摘要,Thinker将候选答案重写为可检验假设,Judge推导出需验证的关键线索,Verifier利用局部细粒度视频内容进行证据定位与检验,Answer agent整合验证结果生成最终答案。 Result: 在三个长视频理解基准上达到SOTA准确率,同时具备更强的可解释性、更好的逻辑一致性以及更低的计算开销。 Conclusion: 任务形式化(尤其是假设驱动的验证机制)是提升长视频推理性能与可靠性的关键路径,VideoHV-Agent为该方向提供了有效且可扩展的范式。 Abstract: Long video understanding is challenging due to dense visual redundancy, long-range temporal dependencies, and the tendency of chain-of-thought and retrieval-based agents to accumulate semantic drift and correlation-driven errors. We argue that long-video reasoning should begin not with reactive retrieval, but with deliberate task formulation: the model must first articulate what must be true in the video for each candidate answer to hold. This thinking-before-finding principle motivates VideoHV-Agent, a framework that reformulates video question answering as a structured hypothesis-verification process. Based on video summaries, a Thinker rewrites answer candidates into testable hypotheses, a Judge derives a discriminative clue specifying what evidence must be checked, a Verifier grounds and tests the clue using localized, fine-grained video content, and an Answer agent integrates validated evidence to produce the final answer. Experiments on three long-video understanding benchmarks show that VideoHV-Agent achieves state-of-the-art accuracy while providing enhanced interpretability, improved logical soundness, and lower computational cost. We make our code publicly available at: https://github.com/Haorane/VideoHV-Agent.[130] A Simple Baseline for Unifying Understanding, Generation, and Editing via Vanilla Next-token Prediction
Jie Zhu,Hanghang Ma,Jia Wang,Yayong Guan,Yanbing Zeng,Lishuai Gao,Junqiang Wu,Jie Hu,Leye Wang
Main category: cs.CV
TL;DR: Wallaroo 是一个基于 next-token 预测的简单自回归基线模型,统一支持多模态理解、图像生成与编辑,并具备多分辨率处理及中英双语能力。
Details
Motivation: 探索自回归模型在统一多模态理解与生成任务中的潜力,解决现有模型在多任务、多语言、多分辨率支持上的局限性。 Method: 提出 Wallaroo 模型,采用解耦的视觉编码路径和四阶段训练策略,支持多分辨率图像输入/输出及中英双语;以 next-token 预测为统一建模范式。 Result: 在多个基准测试中,Wallaroo 表现出与现有统一模型相当甚至更优的性能。 Conclusion: 自回归建模范式在统一多模态任务中具有巨大潜力,Wallaroo 为该方向提供了简洁而有效的基线方案。 Abstract: In this work, we introduce Wallaroo, a simple autoregressive baseline that leverages next-token prediction to unify multi-modal understanding, image generation, and editing at the same time. Moreover, Wallaroo supports multi-resolution image input and output, as well as bilingual support for both Chinese and English. We decouple the visual encoding into separate pathways and apply a four-stage training strategy to reshape the model's capabilities. Experiments are conducted on various benchmarks where Wallaroo produces competitive performance or exceeds other unified models, suggesting the great potential of autoregressive models in unifying multi-modality understanding and generation. Our code is available at https://github.com/JiePKU/Wallaroo.[131] TAPFormer: Robust Arbitrary Point Tracking via Transient Asynchronous Fusion of Frames and Events
Jiaxiong Liu,Zhen Tan,Jinpu Zhang,Yi Zhou,Hui Shen,Xieyuanli Chen,Dewen Hu
Main category: cs.CV
TL;DR: 本文提出TAPFormer,一种基于Transformer的异步时序一致融合框架,用于鲁棒、高频的任意点跟踪;核心是瞬态异步融合(TAF)机制和跨模态局部加权融合(CLWF)模块,并构建了新的真实帧-事件TAP数据集,显著提升跟踪精度。
Details
Motivation: 现有RGB-事件融合方法存在时间错位及单模态失效时性能严重下降的问题,难以满足高精度、长时运动推理的任意点跟踪需求。 Method: 提出TAPFormer框架,包含两个核心模块:1)瞬态异步融合(TAF)机制,通过事件流连续更新建模帧间时间演化;2)跨模态局部加权融合(CLWF)模块,依据模态可靠性自适应调整空间注意力;并构建了首个真实场景帧-事件TAP数据集。 Result: 在自建真实数据集上平均像素误差降低28.2%;在标准点跟踪基准上持续取得最优性能。 Conclusion: TAPFormer通过异步、自适应的跨模态融合策略,有效缓解了帧-事件时间失配与模态不可靠问题,显著提升了任意点跟踪的鲁棒性与精度。 Abstract: Tracking any point (TAP) is a fundamental yet challenging task in computer vision, requiring high precision and long-term motion reasoning. Recent attempts to combine RGB frames and event streams have shown promise, yet they typically rely on synchronous or non-adaptive fusion, leading to temporal misalignment and severe degradation when one modality fails. We introduce TAPFormer, a transformer-based framework that performs asynchronous temporal-consistent fusion of frames and events for robust and high-frequency arbitrary point tracking. Our key innovation is a Transient Asynchronous Fusion (TAF) mechanism, which explicitly models the temporal evolution between discrete frames through continuous event updates, bridging the gap between low-rate frames and high-rate events. In addition, a Cross-modal Locally Weighted Fusion (CLWF) module adaptively adjusts spatial attention according to modality reliability, yielding stable and discriminative features even under blur or low light. To evaluate our approach under realistic conditions, we construct a novel real-world frame-event TAP dataset under diverse illumination and motion conditions. Our method outperforms existing point trackers, achieving a 28.2% improvement in average pixel error within threshold. Moreover, on standard point tracking benchmarks, our tracker consistently achieves the best performance. Project website: tapformer.github.io[132] MultiGO++: Monocular 3D Clothed Human Reconstruction via Geometry-Texture Collaboration
Nanjie Yao,Gangjian Zhang,Wenhao Shen,Jian Shu,Yu Feng,Hao Wang
Main category: cs.CV
TL;DR: 本文提出MultiGO++框架,通过多源纹理合成、区域感知形状提取和双重建U-Net,实现几何与纹理协同的单目3D着装人体重建,显著提升重建质量。
Details
Motivation: 现有单目3D着装人体重建方法受限于纹理数据缺乏、几何先验不准及单模态监督偏差,导致重建效果不佳。 Method: 提出MultiGO++框架,包含:(1) 多源纹理合成策略构建超1.5万3D纹理人体扫描;(2) 区域感知形状提取模块与傅里叶几何编码器提升几何学习;(3) 双重建U-Net融合几何-纹理特征生成高保真3D网格。 Result: 在两个基准数据集及大量野外案例上实验表明,该方法优于当前最先进方法。 Conclusion: MultiGO++通过系统性几何-纹理协作,有效克服了现有方法的三大局限,实现了更高质量的单目3D人体重建。 Abstract: Monocular 3D clothed human reconstruction aims to generate a complete and realistic textured 3D avatar from a single image. Existing methods are commonly trained under multi-view supervision with annotated geometric priors, and during inference, these priors are estimated by the pre-trained network from the monocular input. These methods are constrained by three key limitations: texturally by unavailability of training data, geometrically by inaccurate external priors, and systematically by biased single-modality supervision, all leading to suboptimal reconstruction. To address these issues, we propose a novel reconstruction framework, named MultiGO++, which achieves effective systematic geometry-texture collaboration. It consists of three core parts: (1) A multi-source texture synthesis strategy that constructs 15,000+ 3D textured human scans to improve the performance on texture quality estimation in challenge scenarios; (2) A region-aware shape extraction module that extracts and interacts features of each body region to obtain geometry information and a Fourier geometry encoder that mitigates the modality gap to achieve effective geometry learning; (3) A dual reconstruction U-Net that leverages geometry-texture collaborative features to refine and generate high-fidelity textured 3D human meshes. Extensive experiments on two benchmarks and many in-the-wild cases show the superiority of our method over state-of-the-art approaches.[133] Physics-consistent deep learning for blind aberration recovery in mobile optics
Kartik Jhawar,Tamo Sancho Miguel Tandoc,Khoo Jun Xuan,Wang Lipo
Main category: cs.CV
TL;DR: 本文提出Lens2Zernike框架,通过单张模糊图像盲恢复物理光学参数(Zernike系数等),融合Zernike回归、可微物理约束与多任务空间图预测,显著提升参数估计精度,并实现稳定非盲去卷积。
Details
Motivation: 移动摄影受限于复杂镜头像差;现有端到端深度学习方法缺乏光学建模易幻觉,经典盲反卷积又极不稳定,需在二者间建立物理引导的桥梁。 Method: 提出Lens2Zernike:联合监督Zernike系数回归(z)、基于波前与点扩散函数的可微物理约束(p)及辅助多任务空间图预测(m);采用ResNet-18主干网络并进行消融实验。 Result: 在IDMxS移动镜头数据库上,z+p+m全框架相较仅z基线提升35%;回归误差显著低于两种已有深度学习方法;恢复参数支持稳定非盲去卷积,有效重建衍射极限细节。 Conclusion: Lens2Zernike首次在单一模型中跨三个光学域联合监督,实现了物理可解释、高精度、鲁棒的镜头像差建模与矫正,为移动计算摄影提供了新范式。 Abstract: Mobile photography is often limited by complex, lens-specific optical aberrations. While recent deep learning methods approach this as an end-to-end deblurring task, these "black-box" models lack explicit optical modeling and can hallucinate details. Conversely, classical blind deconvolution remains highly unstable. To bridge this gap, we present Lens2Zernike, a deep learning framework that blindly recovers physical optical parameters from a single blurred image. To the best of our knowledge, no prior work has simultaneously integrated supervision across three distinct optical domains. We introduce a novel physics-consistent strategy that explicitly minimizes errors via direct Zernike coefficient regression (z), differentiable physics constraints encompassing both wavefront and point spread function derivations (p), and auxiliary multi-task spatial map predictions (m). Through an ablation study on a ResNet-18 backbone, we demonstrate that our full multi-task framework (z+p+m) yields a 35% improvement over coefficient-only baselines. Crucially, comparative analysis reveals that our approach outperforms two established deep learning methods from previous literature, achieving significantly lower regression errors. Ultimately, we demonstrate that these recovered physical parameters enable stable non-blind deconvolution, providing substantial in-domain improvement on the patented Institute for Digital Molecular Analytics and Science (IDMxS) Mobile Camera Lens Database for restoring diffraction-limited details from severely aberrated mobile captures.[134] How far have we gone in Generative Image Restoration? A study on its capability, limitations and evaluation practices
Xiang Yin,Jinfan Hu,Zhiyuan You,Kainan Yan,Yu Tang,Chao Dong,Jinjin Gu
Main category: cs.CV
TL;DR: 本文对生成式图像恢复(GIR)方法进行了大规模多维评估,揭示其核心挑战已从细节不足转向细节质量与语义控制;提出新评估框架,并基于此训练出更符合人类感知的IQA模型。
Details
Motivation: 探究生成式图像恢复(GIR)在实际能力上相较传统方法的真实进展,弥补现有评估片面、缺乏系统性的问题。 Method: 构建涵盖细节、锐度、语义正确性与整体质量的多维评估流水线,系统评测多种架构(扩散模型、GAN、PSNR导向、通用生成模型),并基于该基准训练新型IQA模型。 Result: 发现当前GIR失败模式发生范式转变:从‘欠生成’(细节缺失)转向‘过生成’(细节质量差、语义失控);所提IQA模型更贴合人类感知判断。 Conclusion: GIR虽在视觉真实感上显著提升,但其实际能力仍受限于细节质量与语义可控性;本研究为低层视觉领域提供了新的评估标准与发展方向。 Abstract: Generative Image Restoration (GIR) has achieved impressive perceptual realism, but how far have its practical capabilities truly advanced compared with previous methods? To answer this, we present a large-scale study grounded in a new multi-dimensional evaluation pipeline that assesses models on detail, sharpness, semantic correctness, and overall quality. Our analysis covers diverse architectures, including diffusion-based, GAN-based, PSNR-oriented, and general-purpose generation models, revealing critical performance disparities. Furthermore, our analysis uncovers a key evolution in failure modes that signifies a paradigm shift for the perception-oriented low-level vision field. The central challenge is evolving from the previous problem of detail scarcity (under-generation) to the new frontier of detail quality and semantic control (preventing over-generation). We also leverage our benchmark to train a new IQA model that better aligns with human perceptual judgments. Ultimately, this work provides a systematic study of modern generative image restoration models, offering crucial insights that redefine our understanding of their true state and chart a course for future development.[135] Tell2Adapt: A Unified Framework for Source Free Unsupervised Domain Adaptation via Vision Foundation Model
Yulong Shi,Shijie Li,Ziyi Li,Lin Qi
Main category: cs.CV
TL;DR: 本文提出Tell2Adapt,一种基于视觉基础模型(VFM)的源无关无监督域自适应(SFUDA)框架,通过上下文感知提示正则化(CAPR)和视觉可信度精炼(VPR)提升医学图像分割在多模态、多目标临床场景下的泛化性与可靠性,达到SOTA性能。
Details
Motivation: 现有SFUDA方法局限于低差距、特定域偏移场景,缺乏统一支持多模态、多目标临床应用的能力,难以满足真实医疗部署需求。 Method: 提出Tell2Adapt框架:1)利用VFM知识;2)通过CAPR实现高保真文本提示到规范指令的鲁棒映射,生成高质量伪标签以轻量学生模型适配目标域;3)引入VPR模块,借助VFM解剖学知识将预测结果重锚定至目标图像底层视觉特征,抑制噪声与假阳性。 Result: 在10个域迁移方向、22个解剖目标(脑、心脏、息肉、腹部等)上进行大规模SFUDA评估,一致优于现有方法,成为首个统一SFUDA框架中的医学图像分割SOTA。 Conclusion: Tell2Adapt有效弥合了SFUDA在临床多场景部署中的泛化性与可靠性鸿沟,验证了VFM驱动的源无关自适应范式在医学影像分析中的巨大潜力。 Abstract: Source Free Unsupervised Domain Adaptation (SFUDA) is critical for deploying deep learning models across diverse clinical settings. However, existing methods are typically designed for low-gap, specific domain shifts and cannot generalize into a unified, multi-modalities, and multi-target framework, which presents a major barrier to real-world application. To overcome this issue, we introduce Tell2Adapt, a novel SFUDA framework that harnesses the vast, generalizable knowledge of the Vision Foundation Model (VFM). Our approach ensures high-fidelity VFM prompts through Context-Aware Prompts Regularization (CAPR), which robustly translates varied text prompts into canonical instructions. This enables the generation of high-quality pseudo-labels for efficiently adapting the lightweight student model to target domain. To guarantee clinical reliability, the framework incorporates Visual Plausibility Refinement (VPR), which leverages the VFM's anatomical knowledge to re-ground the adapted model's predictions in target image's low-level visual features, effectively removing noise and false positives. We conduct one of the most extensive SFUDA evaluations to date, validating our framework across 10 domain adaptation directions and 22 anatomical targets, including brain, cardiac, polyp, and abdominal targets. Our results demonstrate that Tell2Adapt consistently outperforms existing approaches, achieving SOTA for a unified SFUDA framework in medical image segmentation. Code are avaliable at https://github.com/derekshiii/Tell2Adapt.[136] Generalizable Multiscale Segmentation of Heterogeneous Map Collections
Remi Petitpierre
Main category: cs.CV
TL;DR: 本文提出了一种面向多样化历史地图的通用语义分割方法,构建了新基准数据集Semap,并设计了结合程序化数据合成与多尺度集成的分割框架,在多个数据集上达到SOTA性能,证明了多样性驱动方法的有效性与泛化能力。
Details
Motivation: 现有历史地图识别研究多针对同质化地图系列设计专用模型,难以应对真实历史地图集合在风格、比例尺和地理范围上的高度多样性。 Method: 构建了包含1439个手工标注图像块的开放基准数据集Semap;提出一种融合程序化数据合成与多尺度特征集成的语义分割框架。 Result: 所提方法在HCMSSD和Semap两个数据集上均达到当前最优性能;分割效果在不同地图集、比例尺、地理区域及出版背景下保持稳定。 Conclusion: 多样性驱动的通用语义分割方法是可行且有益的;本工作为将长尾历史地图档案系统融入历史地理研究提供了数据基础与技术路径。 Abstract: Historical map collections are highly diverse in style, scale, and geographic focus, often consisting of many single-sheet documents. Yet most work in map recognition focuses on specialist models tailored to homogeneous map series. In contrast, this article aims to develop generalizable semantic segmentation models and ontology. First, we introduce Semap, a new open benchmark dataset comprising 1,439 manually annotated patches designed to reflect the variety of historical map documents. Second, we present a segmentation framework that combines procedural data synthesis with multiscale integration to improve robustness and transferability. This framework achieves state-of-the-art performance on both the HCMSSD and Semap datasets, showing that a diversity-driven approach to map recognition is not only viable but also beneficial. The results indicate that segmentation performance remains largely stable across map collections, scales, geographic regions, and publication contexts. By proposing benchmark datasets and methods for the generic segmentation of historical maps, this work opens the way to integrating the long tail of cartographic archives to historical geographic studies.[137] Exploiting Intermediate Reconstructions in Optical Coherence Tomography for Test-Time Adaption of Medical Image Segmentation
Thomas Pinetz,Veit Hucke,Hrvoje Bogunovic
Main category: cs.CV
TL;DR: 本文提出IRTAA方法,利用重建过程中的中间表示,在测试时通过调节下游网络的归一化层参数来提升分割性能,并提供无额外开销的语义不确定性估计。
Details
Motivation: 低质量成像设备依赖先进重建算法,但现有方法仅使用最终重建图像评估下游任务,忽略了重建过程中丰富的中间表示信息。 Method: 提出IRTAA(Intermediate Representation Test-Time Adaptation),在测试时利用模组网络根据当前重建时间尺度动态调节冻结下游网络的归一化层参数;模组网络通过各时间步平均熵损失在线学习;时间步间分割结果差异自然提供不确定性估计。 Result: 在不修改重建流程和下游模型的前提下,显著提升医学图像分割性能,并同步生成语义相关的不确定性图。 Conclusion: 中间表示蕴含丰富信息,IRTAA有效挖掘其价值,实现测试时自适应与不确定性量化一体化,适用于资源受限的基层医疗场景。 Abstract: Primary health care frequently relies on low-cost imaging devices, which are commonly used for screening purposes. To ensure accurate diagnosis, these systems depend on advanced reconstruction algorithms designed to approximate the performance of high-quality counterparts. Such algorithms typically employ iterative reconstruction methods that incorporate domain-specific prior knowledge. However, downstream task performance is generally assessed using only the final reconstructed image, thereby disregarding the informative intermediate representations generated throughout the reconstruction process. In this work, we propose IRTTA to exploit these intermediate representations at test-time by adapting the normalization-layer parameters of a frozen downstream network via a modulator network that conditions on the current reconstruction timescale. The modulator network is learned during test-time using an averaged entropy loss across all individual timesteps. Variation among the timestep-wise segmentations additionally provides uncertainty estimates at no extra cost. This approach enhances segmentation performance and enables semantically meaningful uncertainty estimation, all without modifying either the reconstruction process or the downstream model.[138] CoIn3D: Revisiting Configuration-Invariant Multi-Camera 3D Object Detection
Zhaonian Kuang,Rui Ding,Haotian Wang,Xinhu Zheng,Meng Yang,Gang Hua
Main category: cs.CV
TL;DR: 本文提出CoIn3D框架,通过空间感知特征调制(SFM)和相机感知数据增强(CDA),解决多相机3D目标检测在不同相机配置间泛化能力差的问题,显著提升跨配置迁移性能。
Details
Motivation: 现有MC3D模型难以泛化到未见过的多相机配置,核心问题在于源与目标配置间的空间先验(内参、外参、阵列布局)差异。 Method: 提出CoIn3D框架,包含空间感知特征调制(SFM)——融合焦距、地面深度、地面梯度、Plücker坐标四种空间表示;以及相机感知数据增强(CDA)——基于无训练动态新视角图像合成提升观测多样性。 Result: 在NuScenes、Waymo、Lyft等基准数据集上,CoIn3D在BEVDepth、BEVFormer、PETR三大主流MC3D范式下均展现出优异的跨配置检测性能。 Conclusion: 显式建模并融合多维空间先验可有效提升MC3D模型对未知相机配置的泛化能力,CoIn3D为通用多相机3D检测提供了可迁移的新范式。 Abstract: Multi-camera 3D object detection (MC3D) has attracted increasing attention with the growing deployment of multi-sensor physical agents, such as robots and autonomous vehicles. However, MC3D models still struggle to generalize to unseen platforms with new multi-camera configurations. Current solutions simply employ a meta-camera for unified representation but lack comprehensive consideration. In this paper, we revisit this issue and identify that the devil lies in spatial prior discrepancies across source and target configurations, including different intrinsics, extrinsics, and array layouts. To address this, we propose CoIn3D, a generalizable MC3D framework that enables strong transferability from source configurations to unseen target ones. CoIn3D explicitly incorporates all identified spatial priors into both feature embedding and image observation through spatial-aware feature modulation (SFM) and camera-aware data augmentation (CDA), respectively. SFM enriches feature space by integrating four spatial representations, such as focal length, ground depth, ground gradient, and Plücker coordinate. CDA improves observation diversity under various configurations via a training-free dynamic novel-view image synthesis scheme. Extensive experiments demonstrate that CoIn3D achieves strong cross-configuration performance on landmark datasets such as NuScenes, Waymo, and Lyft, under three dominant MC3D paradigms represented by BEVDepth, BEVFormer, and PETR.[139] CLIP-driven Zero-shot Learning with Ambiguous Labels
Jinfu Fan,Jiangnan Li,Xiaowen Yan,Xiaohui Zhong,Wenpeng Lu,Linqing Huang
Main category: cs.CV
TL;DR: 本文提出了一种基于CLIP的零样本学习框架CLIP-PZSL,用于处理标签模糊性问题,通过语义挖掘和部分零样本损失提升模型性能。
Details
Motivation: 现有零样本学习方法假设训练样本标签准确,但在现实场景中标签噪声和模糊性会显著降低性能,因此需要解决标签模糊问题。 Method: 利用CLIP提取实例与标签特征;设计语义挖掘模块融合特征以获取判别性标签嵌入;引入部分零样本损失,根据候选标签与实例的相关性加权,并对齐实例与标签嵌入以减小语义失配;迭代优化标签与嵌入。 Result: 在多个数据集上的综合实验验证了CLIP-PZSL的有效性和优势。 Conclusion: CLIP-PZSL能有效应对标签模糊性,在零样本学习任务中提升了鲁棒性和识别精度。 Abstract: Zero-shot learning (ZSL) aims to recognize unseen classes by leveraging semantic information from seen classes, but most existing methods assume accurate class labels for training instances. However, in real-world scenarios, noise and ambiguous labels can significantly reduce the performance of ZSL. To address this, we propose a new CLIP-driven partial label zero-shot learning (CLIP-PZSL) framework to handle label ambiguity. First, we use CLIP to extract instance and label features. Then, a semantic mining block fuses these features to extract discriminative label embeddings. We also introduce a partial zero-shot loss, which assigns weights to candidate labels based on their relevance to the instance and aligns instance and label embeddings to minimize semantic mismatch. As the training goes on, the ground-truth labels are progressively identified, and the refined labels and label embeddings in turn help improve the semantic alignment of instance and label features. Comprehensive experiments on several datasets demonstrate the advantage of CLIP-PZSL.[140] A 360-degree Multi-camera System for Blue Emergency Light Detection Using Color Attention RT-DETR and the ABLDataset
Francisco Vacalebri-Lloret,Lucas Banchero,Jose J. Lopez,Jose M. Mossi
Main category: cs.CV
TL;DR: 本文提出了一种基于多 fisheye 相机与改进 RT-DETR 的蓝光检测系统,用于识别欧洲紧急车辆,在 ABLDataset 上达到 94.7% 准确率和 94.1% 召回率,支持 70 米检测距离及接近角度估计,可融入多模态 ADAS 提升道路安全。
Details
Motivation: 提升高级驾驶辅助系统(ADAS)对紧急车辆蓝光信号的实时、鲁棒识别能力,尤其在复杂气候与地理条件下保障道路安全。 Method: 采用四路180°水平视场鱼眼相机采集图像;基于自建 ABLDataset 数据集,对比 YOLOv5/v8/v10、RetinaNet、Faster R-CNN 和 RT-DETR;选用 RT-DETR 并引入颜色注意力模块;通过相机标定实现方位角定位,并利用几何变换估计紧急车辆相对本车中心的接近角度。 Result: 改进 RT-DETR 在测试集上达 94.7% 准确率与 94.1% 召回率;实地检测距离达 70 米;成功实现蓝光方位定位与接近角度估计;系统可与声学模态融合,适用于多模态 ADAS。 Conclusion: 该系统显著提升了紧急车辆蓝光检测的精度、鲁棒性与实用性,为多模态 ADAS 提供了高效、可部署的视觉感知解决方案。 Abstract: This study presents an advanced system for detecting blue lights on emergency vehicles, developed using ABLDataset, a curated dataset that includes images of European emergency vehicles under various climatic and geographic conditions. The system employs a configuration of four fisheye cameras, each with a 180-degree horizontal field of view, mounted on the sides of the vehicle. A calibration process enables the azimuthal localization of the detections. Additionally, a comparative analysis of major deep neural network algorithms was conducted, including YOLO (v5, v8, and v10), RetinaNet, Faster R-CNN, and RT-DETR. RT-DETR was selected as the base model and enhanced through the incorporation of a color attention block, achieving an accuracy of 94.7 percent and a recall of 94.1 percent on the test set, with field test detections reaching up to 70 meters. Furthermore, the system estimates the approach angle of the emergency vehicle relative to the center of the car using geometric transformations. Designed for integration into a multimodal system that combines visual and acoustic data, this system has demonstrated high efficiency, offering a promising approach to enhancing Advanced Driver Assistance Systems (ADAS) and road safety.[141] MI-DETR: A Strong Baseline for Moving Infrared Small Target Detection with Bio-Inspired Motion Integration
Nian Liu,Jin Gao,Shubo Lin,Yutong Kou,Sikui Zhang,Fudong Ge,Zhiqiang Pu,Liang Li,Gang Wang,Yizheng Wang,Weiming Hu
Main category: cs.CV
TL;DR: 本文提出了一种受生物视觉启发的单帧红外小目标检测方法MI-DETR,通过视网膜式细胞自动机生成运动图,结合双通路(形貌与运动)特征交互和RT-DETR解码器,在多个基准上显著超越多帧方法。
Details
Motivation: 红外小目标检测因目标微小、对比度低且背景复杂动态而困难;传统多帧方法隐式学习运动,常需额外运动监督或对齐模块,效率与可解释性受限。 Method: 提出Motion Integration DETR(MI-DETR):1)视网膜式细胞自动机(RCA)将帧序列显式转换为同分辨率运动图;2)Parvocellular-Magnocellular Interconnection(PMI)模块实现形貌与运动通路双向特征交互;3)RT-DETR解码器融合双通路特征输出检测结果。 Result: 在IRDST-H、DAUB-R和ITSDT-15K三个基准上分别达到70.3% mAP@50(+26.35优于最佳多帧基线)、98.0% mAP@50和88.3% mAP@50,F1达72.7%。 Conclusion: 显式建模运动并结合生物启发的双通路协同机制,可在单帧输入下大幅提升红外小目标检测性能,验证了运动-形貌联合建模的有效性与实用性。 Abstract: Infrared small target detection (ISTD) is challenging because tiny, low-contrast targets are easily obscured by complex and dynamic backgrounds. Conventional multi-frame approaches typically learn motion implicitly through deep neural networks, often requiring additional motion supervision or explicit alignment modules. We propose Motion Integration DETR (MI-DETR), a bio-inspired dual-pathway detector that processes one infrared frame per time step while explicitly modeling motion. First, a retina-inspired cellular automaton (RCA) converts raw frame sequences into a motion map defined on the same pixel grid as the appearance image, enabling parvocellular-like appearance and magnocellular-like motion pathways to be supervised by a single set of bounding boxes without extra motion labels or alignment operations. Second, a Parvocellular-Magnocellular Interconnection (PMI) Block facilitates bidirectional feature interaction between the two pathways, providing a biologically motivated intermediate interconnection mechanism. Finally, a RT-DETR decoder operates on features from the two pathways to produce detection results. Surprisingly, our proposed simple yet effective approach yields strong performance on three commonly used ISTD benchmarks. MI-DETR achieves 70.3% mAP@50 and 72.7% F1 on IRDST-H (+26.35 mAP@50 over the best multi-frame baseline), 98.0% mAP@50 on DAUB-R, and 88.3% mAP@50 on ITSDT-15K, demonstrating the effectiveness of biologically inspired motion-appearance integration. Code is available at https://github.com/nliu-25/MI-DETR.[142] UniM: A Unified Any-to-Any Interleaved Multimodal Benchmark
Yanlin Li,Minghui Guo,Kaiwen Zhang,Shize Zhang,Yiran Zhao,Haodong Li,Congyue Zhou,Weijie Zheng,Yushen Yan,Shengqiong Wu,Wei Ji,Lei Cui,Furu Wei,Hao Fei,Mong-Li Lee,Wynne Hsu
Main category: cs.CV
TL;DR: 本文提出UniM基准,首个统一的任意到任意交错多模态数据集,涵盖30个领域、7种模态(文本、图像、音频、视频、文档、代码、3D),并构建评估套件与基线模型UniMA,推动多模态大语言模型向统一交错理解与生成发展。
Details
Motivation: 现实多模态应用需处理用户任意组合、交错输入,并生成任意交错多媒体输出,亟需统一范式下的任意到任意交错多模态学习能力。 Method: 构建UniM数据集(31K高质量样本,覆盖30域/7模态);设计三维度评估套件(语义正确性与生成质量、响应结构完整性、交错一致性);提出具备可追溯推理能力的代理式基线模型UniMA。 Result: 实验证明UniM基准难度高,揭示了当前模型在交错多模态理解与生成中的关键挑战,如跨模态对齐、结构化生成与一致性维持。 Conclusion: UniM为推进统一任意到任意多模态智能提供了新基准、评估标准与基线方法,标志着MLLM向真正通用多模态交互能力迈出重要一步。 Abstract: In real-world multimodal applications, systems usually need to comprehend arbitrarily combined and interleaved multimodal inputs from users, while also generating outputs in any interleaved multimedia form. This capability defines the goal of any-to-any interleaved multimodal learning under a unified paradigm of understanding and generation, posing new challenges and opportunities for advancing Multimodal Large Language Models (MLLMs). To foster and benchmark this capability, this paper introduces the UniM benchmark, the first Unified Any-to-Any Interleaved Multimodal dataset. UniM contains 31K high-quality instances across 30 domains and 7 representative modalities: text, image, audio, video, document, code, and 3D, each requiring multiple intertwined reasoning and generation capabilities. We further introduce the UniM Evaluation Suite, which assesses models along three dimensions: Semantic Correctness & Generation Quality, Response Structure Integrity, and Interleaved Coherence. In addition, we propose UniMA, an agentic baseline model equipped with traceable reasoning for structured interleaved generation. Comprehensive experiments demonstrate the difficulty of UniM and highlight key challenges and directions for advancing unified any-to-any multimodal intelligence. The project page is https://any2any-mllm.github.io/unim.[143] MoRe: Motion-aware Feed-forward 4D Reconstruction Transformer
Juntong Fang,Zequn Chen,Weiqi Zhang,Donglin Di,Xuancheng Zhang,Chengmin Yang,Yu-Shen Liu
Main category: cs.CV
TL;DR: MoRe是一种高效的前馈式4D重建网络,通过注意力强制策略分离动态运动与静态结构,实现从单目视频中实时、高质量地重建动态3D场景。
Details
Motivation: 现有基于优化的方法在处理含运动物体的动态4D场景重建时,因相机位姿估计受干扰而效果受限,且计算开销大、难以实时应用。 Method: MoRe基于强静态重建骨干网络,引入注意力强制策略解耦动态运动与静态结构;采用分组因果注意力建模帧间时序依赖,并支持可变长度token;并在大规模动静混合数据集上进行微调以提升鲁棒性。 Result: 在多个基准上实验表明,MoRe在保证高重建质量的同时显著提升了效率,实现了高效、时序一致的动态几何重建。 Conclusion: MoRe为单目动态4D场景重建提供了一种高效、实用且鲁棒的前馈解决方案,克服了传统优化方法的计算瓶颈和实时性限制。 Abstract: Reconstructing dynamic 4D scenes remains challenging due to the presence of moving objects that corrupt camera pose estimation. Existing optimization methods alleviate this issue with additional supervision, but they are mostly computationally expensive and impractical in real-time applications. To address these limitations, we propose MoRe, a feedforward 4D reconstruction network that efficiently recovers dynamic 3D scenes from monocular videos. Built upon a strong static reconstruction backbone, MoRe employs an attention-forcing strategy to disentangle dynamic motion from static structure. To further enhance robustness, we fine-tune the model on large-scale, diverse datasets encompassing both dynamic and static scenes. Moreover, our grouped causal attention captures temporal dependencies and adapts to varying token lengths across frames, ensuring temporally coherent geometry reconstruction. Extensive experiments on multiple benchmarks demonstrate that MoRe achieves high-quality dynamic reconstructions with exceptional efficiency.[144] Orthogonal Spatial-temporal Distributional Transfer for 4D Generation
Wei Liu,Shengqiong Wu,Bobo Li,Haoyu Zhao,Hao Fei,Mong-Li Lee,Wynne Hsu
Main category: cs.CV
TL;DR: 本文提出了一种名为STD-4D的新型4D扩散模型,通过解耦空间与时间隐变量,并结合Orster机制和ST-HexPlane结构,从3D与视频扩散模型中迁移先验知识,显著提升了4D内容生成的质量与时空一致性。
Details
Motivation: 当前4D合成研究受限于大规模4D数据集的缺失,导致模型难以学习关键的时空特征,阻碍了高质量4D生成的发展。 Method: 提出空间-时间解耦的STD-4D扩散模型;设计正交时空分布迁移(Orster)机制以实现高效先验迁移;构建时空感知的ST-HexPlane用于融合迁移特征并优化4D形变与高斯建模。 Result: 实验表明该方法在时空一致性与4D合成质量上显著优于现有方法。 Conclusion: 通过跨模态先验迁移与解耦建模,可有效缓解4D数据稀缺问题,为高质量4D内容生成提供新范式。 Abstract: In the AIGC era, generating high-quality 4D content has garnered increasing research attention. Unfortunately, current 4D synthesis research is severely constrained by the lack of large-scale 4D datasets, preventing models from adequately learning the critical spatial-temporal features necessary for high-quality 4D generation, thus hindering progress in this domain. To combat this, we propose a novel framework that transfers rich spatial priors from existing 3D diffusion models and temporal priors from video diffusion models to enhance 4D synthesis. We develop a spatial-temporal-disentangled 4D (STD-4D) Diffusion model, which synthesizes 4D-aware videos through disentangled spatial and temporal latents. To facilitate the best feature transfer, we design a novel Orthogonal Spatial-temporal Distributional Transfer (Orster) mechanism, where the spatiotemporal feature distributions are carefully modeled and injected into the STD-4D Diffusion. Furthermore, during the 4D construction, we devise a spatial-temporal-aware HexPlane (ST-HexPlane) to integrate the transferred spatiotemporal features, thereby improving 4D deformation and 4D Gaussian feature modeling. Experiments demonstrate that our method significantly outperforms existing approaches, achieving superior spatial-temporal consistency and higher-quality 4D synthesis.[145] GEM-TFL: Bridging Weak and Full Supervision for Forgery Localization through EM-Guided Decomposition and Temporal Refinement
Xiaodong Zhu,Yuanming Zheng,Suting Wang,Junqi Yang,Yuhong Yang,Weiping Tu,Zhongyuan Wang
Main category: cs.CV
TL;DR: 本文提出GEM-TFL,一种基于图模型与EM算法的两阶段弱监督时序伪造定位方法,通过隐变量建模、无训练时序一致性优化和图结构提案精化,显著提升弱监督下视频/音频伪造片段定位精度。
Details
Motivation: 现有弱监督时序伪造定位(WS-TFL)方法存在训练-推理目标不匹配、二值标签监督不足、top-k聚合导致梯度阻断、缺乏提案间关系建模等问题。 Method: 提出GEM-TFL框架:1)基于EM算法将视频级二值标签转化为多维隐属性以增强弱监督;2)引入无需训练的时序一致性精化模块;3)设计图神经网络模块建模提案间的时序-语义关系。整体为分类-回归两阶段结构。 Result: 在多个基准数据集上实验表明,GEM-TFL显著提升定位准确性与鲁棒性,大幅缩小与全监督方法的性能差距。 Conclusion: GEM-TFL有效弥合了弱监督与全监督TFL之间的性能鸿沟,为低标注成本下的多媒体伪造检测提供了新范式。 Abstract: Temporal Forgery Localization (TFL) aims to precisely identify manipulated segments within videos or audio streams, providing interpretable evidence for multimedia forensics and security. While most existing TFL methods rely on dense frame-level labels in a fully supervised manner, Weakly Supervised TFL (WS-TFL) reduces labeling cost by learning only from binary video-level labels. However, current WS-TFL approaches suffer from mismatched training and inference objectives, limited supervision from binary labels, gradient blockage caused by non-differentiable top-k aggregation, and the absence of explicit modeling of inter-proposal relationships. To address these issues, we propose GEM-TFL (Graph-based EM-powered Temporal Forgery Localization), a two-phase classification-regression framework that effectively bridges the supervision gap between training and inference. Built upon this foundation, (1) we enhance weak supervision by reformulating binary labels into multi-dimensional latent attributes through an EM-based optimization process; (2) we introduce a training-free temporal consistency refinement that realigns frame-level predictions for smoother temporal dynamics; and (3) we design a graph-based proposal refinement module that models temporal-semantic relationships among proposals for globally consistent confidence estimation. Extensive experiments on benchmark datasets demonstrate that GEM-TFL achieves more accurate and robust temporal forgery localization, substantially narrowing the gap with fully supervised methods.[146] Diff-ES: Stage-wise Structural Diffusion Pruning via Evolutionary Search
Zongfang Liu,Shengkun Tang,Zongliang Wu,Xin Yuan,Zhiqiang Shen
Main category: cs.CV
TL;DR: 本文提出Diff-ES框架,通过进化搜索自动优化扩散模型各阶段的结构化剪枝稀疏度调度,实现内存高效、无需模型复制的动态权重路由,在保持图像质量的同时显著加速生成过程。
Details
Motivation: 现有扩散模型剪枝方法依赖人工设定的启发式分阶段稀疏度调度,难以泛化且导致性能次优;同时多模型拼接带来额外内存开销。 Method: 提出基于进化搜索的分阶段结构化剪枝框架Diff-ES:将扩散轨迹划分为多个阶段,自动搜索最优阶段稀疏度调度,并通过内存高效的条件权重路由动态激活对应参数,不复制模型。 Result: 在DiT和SDXL上实验表明,Diff-ES在真实运行时间上实现一致加速,图像质量下降极小,达到扩散模型结构化剪枝的SOTA性能。 Conclusion: Diff-ES通过数据驱动的进化搜索替代人工调度,结合动态权重路由,在效率与质量间取得更好平衡,为扩散模型高效部署提供了新范式。 Abstract: Diffusion models have achieved remarkable success in high-fidelity image generation but remain computationally demanding due to their multi-step denoising process and large model sizes. Although prior work improves efficiency either by reducing sampling steps or by compressing model parameters, existing structured pruning approaches still struggle to balance real acceleration and image quality preservation. In particular, prior methods such as MosaicDiff rely on heuristic, manually tuned stage-wise sparsity schedules and stitch multiple independently pruned models during inference, which increases memory overhead. However, the importance of diffusion steps is highly non-uniform and model-dependent. As a result, schedules derived from simple heuristics or empirical observations often fail to generalize and may lead to suboptimal performance. To this end, we introduce \textbf{Diff-ES}, a stage-wise structural \textbf{Diff}usion pruning framework via \textbf{E}volutionary \textbf{S}earch, which optimizes the stage-wise sparsity schedule and executes it through memory-efficient weight routing without model duplication. Diff-ES divides the diffusion trajectory into multiple stages, automatically discovers an optimal stage-wise sparsity schedule via evolutionary search, and activates stage-conditioned weights dynamically without duplicating model parameters. Our framework naturally integrates with existing structured pruning methods for diffusion models including depth and width pruning. Extensive experiments on DiT and SDXL demonstrate that Diff-ES consistently achieves wall-clock speedups while incurring minimal degradation in generation quality, establishing state-of-the-art performance for structured diffusion model pruning.[147] BLINK: Behavioral Latent Modeling of NK Cell Cytotoxicity
Iman Nematollahi,Jose Francisco Villena-Ossa,Alina Moter,Kiana Farhadyar,Gabriel Kalweit,Abhinav Valada,Toni Cathomen,Evelyn Ullrich,Maria Kalweit
Main category: cs.CV
TL;DR: 本文提出了BLINK,一种基于轨迹的循环状态空间模型,用于建模自然杀伤(NK)细胞与肿瘤细胞的相互作用动力学,通过学习部分观测到的交互序列来预测凋亡增量,从而更准确地检测和预测细胞毒性结果,并提供可解释的潜在表征。
Details
Motivation: 现有方法仅基于单帧分类难以可靠推断NK细胞的细胞毒性结果,因其本质上是随时间演化的交互动态过程。 Method: 提出BLINK模型——一种轨迹驱动的递归状态空间模型,从部分观测的NK-肿瘤细胞交互序列中学习潜在交互动力学,并预测随时间累积的凋亡增量。 Result: 在长期延时显微成像数据上验证,BLINK提升了细胞毒性结果检测性能,支持未来结果预测,并生成可解释的潜在表示,将NK细胞轨迹组织为连贯的行为模式和时序化交互阶段。 Conclusion: BLINK为单细胞水平的NK细胞毒性行为提供了统一的定量评估与结构化建模框架。 Abstract: Machine learning models of cellular interaction dynamics hold promise for understanding cell behavior. Natural killer (NK) cell cytotoxicity is a prominent example of such interaction dynamics and is commonly studied using time-resolved multi-channel fluorescence microscopy. Although tumor cell death events can be annotated at single frames, NK cytotoxic outcome emerges over time from cellular interactions and cannot be reliably inferred from frame-wise classification alone. We introduce BLINK, a trajectory-based recurrent state-space model that serves as a cell world model for NK-tumor interactions. BLINK learns latent interaction dynamics from partially observed NK-tumor interaction sequences and predicts apoptosis increments that accumulate into cytotoxic outcomes. Experiments on long-term time-lapse NK-tumor recordings show improved cytotoxic outcome detection and enable forecasting of future outcomes, together with an interpretable latent representation that organizes NK trajectories into coherent behavioral modes and temporally structured interaction phases. BLINK provides a unified framework for quantitative evaluation and structured modeling of NK cytotoxic behavior at the single-cell level.[148] UniPAR: A Unified Framework for Pedestrian Attribute Recognition
Minghe Xu,Rouying Wu,Jiarui Xu,Minhao Sun,Zikang Yan,Xiao Wang,ChiaWei Chu,Yu Li
Main category: cs.CV
TL;DR: 本文提出UniPAR,一种基于Transformer的统一框架,用于行人属性识别(PAR),通过统一数据调度策略和动态分类头,支持RGB图像、视频序列和事件流等多种模态数据;引入分阶段融合编码器,显式对齐视觉特征与文本属性查询;在多个基准数据集上达到SOTA方法相当性能,并提升跨域泛化与极端环境下的鲁棒性。
Details
Motivation: 现有PAR研究受限于“一数据集一模型”范式,难以应对不同领域在模态、属性定义和环境场景上的显著差异。 Method: 提出UniPAR框架,包含统一数据调度策略、动态分类头和创新的分阶段融合编码器,采用晚深层融合策略显式对齐视觉特征与文本属性查询。 Result: 在MSP60K、DukeMTMC和EventPAR等基准数据集上性能媲美专用SOTA方法;多数据集联合训练显著提升跨域泛化能力和低光照、运动模糊等极端环境下的识别鲁棒性。 Conclusion: UniPAR实现了单模型多模态、多数据集统一处理,在保持高性能的同时大幅提升泛化性与鲁棒性,为PAR迈向实际应用提供了新范式。 Abstract: Pedestrian Attribute Recognition is a foundational computer vision task that provides essential support for downstream applications, including person retrieval in video surveillance and intelligent retail analytics. However, existing research is frequently constrained by the ``one-model-per-dataset" paradigm and struggles to handle significant discrepancies across domains in terms of modalities, attribute definitions, and environmental scenarios. To address these challenges, we propose UniPAR, a unified Transformer-based framework for PAR. By incorporating a unified data scheduling strategy and a dynamic classification head, UniPAR enables a single model to simultaneously process diverse datasets from heterogeneous modalities, including RGB images, video sequences, and event streams. We also introduce an innovative phased fusion encoder that explicitly aligns visual features with textual attribute queries through a late deep fusion strategy. Experimental results on the widely used benchmark datasets, including MSP60K, DukeMTMC, and EventPAR, demonstrate that UniPAR achieves performance comparable to specialized SOTA methods. Furthermore, multi-dataset joint training significantly enhances the model's cross-domain generalization and recognition robustness in extreme environments characterized by low light and motion blur. The source code of this paper will be released on https://github.com/Event-AHU/OpenPAR[149] SRasP: Self-Reorientation Adversarial Style Perturbation for Cross-Domain Few-Shot Learning
Wenqian Li,Pengfei Fang,Hui Xue
Main category: cs.CV
TL;DR: 本文提出了一种名为SRasP的新型跨域小样本学习方法,通过全局语义引导的裁剪-全局风格扰动与多目标优化,提升模型在未见域上的泛化能力和鲁棒性。
Details
Motivation: 现有基于风格扰动的方法存在梯度不稳定和易收敛到尖锐极小值的问题,限制了跨域小样本学习中的知识迁移效果。 Method: 提出Self-Reorientation Adversarial Style Perturbation(SRasP),利用全局语义识别不一致裁剪区域,并将裁剪区域风格梯度与全局风格梯度重新定向聚合;设计多目标优化函数,同时最大化视觉差异并保持全局、裁剪及对抗特征间的语义一致性。 Result: 在多个CD-FSL基准上显著优于现有最先进方法,验证了所提方法在提升模型平坦性与跨域泛化能力方面的有效性。 Conclusion: SRasP通过稳定风格扰动机制和语义感知的多目标优化,有效缓解域偏移问题,增强了模型对未见目标域的适应性与鲁棒性。 Abstract: Cross-Domain Few-Shot Learning (CD-FSL) aims to transfer knowledge from a seen source domain to unseen target domains, serving as a key benchmark for evaluating the robustness and transferability of models. Existing style-based perturbation methods mitigate domain shift but often suffer from gradient instability and convergence to sharp minima.To address these limitations, we propose a novel crop-global style perturbation network, termed Self-Reorientation Adversarial \underline{S}tyle \underline{P}erturbation (SRasP). Specifically, SRasP leverages global semantic guidance to identify incoherent crops, followed by reorienting and aggregating the style gradients of these crops with the global style gradients within one image. Furthermore, we propose a novel multi-objective optimization function to maximize visual discrepancy while enforcing semantic consistency among global, crop, and adversarial features. Applying the stabilized perturbations during training encourages convergence toward flatter and more transferable solutions, improving generalization to unseen domains. Extensive experiments are conducted on multiple CD-FSL benchmarks, demonstrating consistent improvements over state-of-the-art methods.[150] Act, Think or Abstain: Complexity-Aware Adaptive Inference for Vision-Language-Action Models
Riccardo Andrea Izzo,Gianluca Bardaro,Matteo Matteucci
Main category: cs.CV
TL;DR: 本文提出一种受人类认知启发的自适应框架,通过视觉嵌入动态判断任务复杂度,实现VLA模型的'执行-思考-拒绝'三级响应机制,在保证性能的同时显著降低计算开销和失败风险。
Details
Motivation: 现有VLA模型依赖固定推理机制,导致简单任务资源浪费、复杂/分布外任务缺乏不确定性估计而易发生灾难性失败。 Method: 将VLA视觉-语言主干网络改造为活动检测工具,利用视觉嵌入(而非语言)构建参数与非参数估计器集成,实现任务复杂度感知与动态路由(Act/Think/Abstain)。 Result: 在LIBERO、LIBERO-PRO及真实机器人上验证,仅用5%训练数据的纯视觉配置即达80% F1-Score,成为高效可靠的复杂度检测器。 Conclusion: 视觉嵌入本身具备语义不变性优势,可独立、高效地评估任务复杂度;该自适应路由机制在效率、鲁棒性与安全性间取得更好平衡。 Abstract: Current research on Vision-Language-Action (VLA) models predominantly focuses on enhancing generalization through established reasoning techniques. While effective, these improvements invariably increase computational complexity and inference latency. Furthermore, these mechanisms are typically applied indiscriminately, resulting in the inefficient allocation of resources for trivial tasks while simultaneously failing to provide the uncertainty estimation necessary to prevent catastrophic failure on out-of-distribution tasks. Inspired by human cognition, we propose an adaptive framework that dynamically routes VLA execution based on the complexity of the perceived state. Our approach transforms the VLA's vision-language backbone into an active detection tool by projecting latent embeddings into an ensemble of parametric and non-parametric estimators. This allows the system to execute known tasks immediately (Act), reason about ambiguous scenarios (Think), and preemptively halt execution when encountering significant physical or semantic anomalies (Abstain). In our empirical analysis, we observe a phenomenon where visual embeddings alone are superior for inferring task complexity due to the semantic invariance of language. Evaluated on the LIBERO and LIBERO-PRO benchmarks as well as on a real robot, our vision-only configuration achieves 80% F1-Score using as little as 5% of training data, establishing itself as a reliable and efficient task complexity detector.[151] SSR-GS: Separating Specular Reflection in Gaussian Splatting for Glossy Surface Reconstruction
Ningjing Fan,Yiqun Wang
Main category: cs.CV
TL;DR: 本文提出SSR-GS框架,通过预滤波Mip-Cubemap建模直接镜面反射、IndiASG模块捕获间接镜面反射,并引入反射感知的视觉几何先验(VGP)优化 glossy 表面重建效果,在合成与真实数据集上达到SOTA。
Details
Motivation: 现有3D高斯泼溅(3DGS)方法在复杂光照下(尤其是强镜面反射和多表面互反射)难以准确重建光泽表面。 Method: 提出SSR-GS框架:1)使用预滤波Mip-Cubemap建模直接镜面反射;2)设计IndiASG模块建模间接镜面反射;3)引入Visual Geometry Priors(VGP),含反射分数(RS)加权光度损失、渐进衰减深度监督及变换法向约束。 Result: 在合成与真实世界数据集上,SSR-GS在光泽表面重建任务中达到当前最优性能(state-of-the-art)。 Conclusion: SSR-GS有效提升了3DGS对光泽表面在复杂光照下的建模能力,通过联合建模直接/间接反射与反射感知几何先验,显著改善重建质量。 Abstract: In recent years, 3D Gaussian splatting (3DGS) has achieved remarkable progress in novel view synthesis. However, accurately reconstructing glossy surfaces under complex illumination remains challenging, particularly in scenes with strong specular reflections and multi-surface interreflections. To address this issue, we propose SSR-GS, a specular reflection modeling framework for glossy surface reconstruction. Specifically, we introduce a prefiltered Mip-Cubemap to model direct specular reflections efficiently, and propose an IndiASG module to capture indirect specular reflections. Furthermore, we design Visual Geometry Priors (VGP) that couple a reflection-aware visual prior via a reflection score (RS) to downweight the photometric loss contribution of reflection-dominated regions, with geometry priors derived from VGGT, including progressively decayed depth supervision and transformed normal constraints. Extensive experiments on both synthetic and real-world datasets demonstrate that SSR-GS achieves state-of-the-art performance in glossy surface reconstruction.[152] The Impact of Preprocessing Methods on Racial Encoding and Model Robustness in CXR Diagnosis
Dishantkumar Sutariya,Eike Petersen
Main category: cs.CV
TL;DR: 本文研究了在胸部X光片(CXR)诊断中深度学习模型因学习种族相关伪影而导致的种族偏见问题,发现基于边界框的肺部裁剪预处理方法可在不损害诊断准确率的前提下有效缓解种族捷径学习。
Details
Motivation: 深度学习模型能高精度识别CXR中的种族身份,引发对‘种族捷径学习’(即模型将种族信息作为诊断决策的虚假线索)的担忧,威胁医疗公平与模型可靠性。 Method: 系统评估三种图像预处理方法(肺部掩码、肺部裁剪、CLAHE)对缓解种族捷径学习的效果,重点比较其在保持诊断性能的同时减少种族偏差的能力。 Result: 基于边界框的肺部裁剪显著降低种族捷径学习,且未牺牲诊断准确性,打破了公平性与准确性常被认为存在的权衡关系。 Conclusion: 简单图像预处理(如肺部裁剪)是缓解医疗AI中结构性种族偏见的有效且实用策略,为提升模型公平性提供了新思路。 Abstract: Deep learning models can identify racial identity with high accuracy from chest X-ray (CXR) recordings. Thus, there is widespread concern about the potential for racial shortcut learning, where a model inadvertently learns to systematically bias its diagnostic predictions as a function of racial identity. Such racial biases threaten healthcare equity and model reliability, as models may systematically misdiagnose certain demographic groups. Since racial shortcuts are diffuse - non-localized and distributed throughout the whole CXR recording - image preprocessing methods may influence racial shortcut learning, yet the potential of such methods for reducing biases remains underexplored. Here, we investigate the effects of image preprocessing methods including lung masking, lung cropping, and Contrast Limited Adaptive Histogram Equalization (CLAHE). These approaches aim to suppress spurious cues encoding racial information while preserving diagnostic accuracy. Our experiments reveal that simple bounding box-based lung cropping can be an effective strategy for reducing racial shortcut learning while maintaining diagnostic model performance, bypassing frequently postulated fairness-accuracy trade-offs.[153] Generic Camera Calibration using Blurry Images
Zezhun Shi
Main category: cs.CV
TL;DR: 本文提出了一种结合几何约束和局部参数化光照模型的方法,用于在存在运动模糊的情况下对通用相机模型进行标定,同时估计特征点位置和空间变化的点扩散函数,并解决平移模糊问题。
Details
Motivation: 通用相机标定比参数化标定更准确,但需要大量图像,导致普通用户难以避免运动模糊。 Method: 利用几何约束和局部参数化光照模型,联合估计特征点位置和空间变化的点扩散函数,并解决传统图像去模糊中无需考虑的平移模糊问题。 Result: 实验结果验证了该方法在运动模糊条件下的有效性。 Conclusion: 所提方法能有效缓解运动模糊对通用相机标定的影响,提升了实际应用中的标定精度与可行性。 Abstract: Camera calibration is the foundation of 3D vision. Generic camera calibration can yield more accurate results than parametric cam era calibration. However, calibrating a generic camera model using printed calibration boards requires far more images than parametric calibration, making motion blur practically unavoidable for individual users. As a f irst attempt to address this problem, we draw on geometric constraints and a local parametric illumination model to simultaneously estimate feature locations and spatially varying point spread functions, while re solving the translational ambiguity that need not be considered in con ventional image deblurring tasks. Experimental results validate the effectiveness of our approach.[154] Mario: Multimodal Graph Reasoning with Large Language Models
Yuanfu Sun,Kang Li,Pengkang Guo,Jiajin Liu,Qiaoyu Tan
Main category: cs.CV
TL;DR: 本文提出Mario框架,通过图条件视觉语言模型和模态自适应图指令调优机制,在多模态图(MMG)上实现大语言模型(LLM)的有效推理,解决跨模态一致性弱与模态偏好异质性两大挑战。
Details
Motivation: 现有方法依赖预训练视觉语言模型孤立编码图像-文本对,忽视真实世界多模态数据固有的关系结构;因此需在保留图拓扑的前提下,实现基于LLM的多模态图推理。 Method: Mario包含两个创新阶段:1)图条件VLM设计,利用图拓扑引导的细粒度跨模态对比学习联合优化图文特征;2)模态自适应图指令调优机制,将对齐的多模态特征组织为图感知的指令视图,并通过可学习路由器为每个节点及其邻域选择最优模态配置输入LLM。 Result: 在多个MMG基准上,Mario在监督与零样本场景下的节点分类和链接预测任务中均持续超越当前最优图模型。 Conclusion: Mario为LLM在结构化多模态数据上的推理提供了统一、高效且拓扑感知的新范式,显著提升了跨模态一致性和模态自适应能力。 Abstract: Recent advances in large language models (LLMs) have opened new avenues for multimodal reasoning. Yet, most existing methods still rely on pretrained vision-language models (VLMs) to encode image-text pairs in isolation, ignoring the relational structure that real-world multimodal data naturally form. This motivates reasoning on multimodal graphs (MMGs), where each node has textual and visual attributes and edges provide structural cues. Enabling LLM-based reasoning on such heterogeneous multimodal signals while preserving graph topology introduces two key challenges: resolving weak cross-modal consistency and handling heterogeneous modality preference. To address this, we propose Mario, a unified framework that simultaneously resolves the two above challenges and enables effective LLM-based reasoning over MMGs. Mario consists of two innovative stages. Firstly, a graph-conditioned VLM design that jointly refines textual and visual features through fine-grained cross-modal contrastive learning guided by graph topology. Secondly, a modality-adaptive graph instruction tuning mechanism that organizes aligned multimodal features into graph-aware instruction views and employs a learnable router to surface, for each node and its neighborhood, the most informative modality configuration to the LLM. Extensive experiments across diverse MMG benchmarks demonstrate that Mario consistently outperforms state-of-the-art graph models in both supervised and zero-shot scenarios for node classification and link prediction. The code will be made available at https://github.com/sunyuanfu/Mario.[155] Logi-PAR: Logic-Infused Patient Activity Recognition via Differentiable Rule
Muhammad Zarar,MingZheng Zhang,Xiaowang Zhang,Zhiyong Feng,Sofonias Yitagesu,Kawsar Farooq
Main category: cs.CV
TL;DR: 本文提出Logi-PAR框架,首次将可学习逻辑规则引入患者活动识别(PAR),通过神经引导的可微规则与符号映射结合,实现可解释的风险推理与反事实干预分析。
Details
Motivation: 现有PAR模型仅能识别活动类型,缺乏对‘为何构成风险’的显式逻辑推理能力,难以满足临床安全对可解释性与因果干预的需求。 Method: 提出Logi-PAR:融合多视角上下文事实提取,并注入神经引导的可微逻辑规则;端到端优化规则学习,使隐式模式在训练中显式标注为逻辑规则。 Result: 在VAST和OmniFall临床基准上达到SOTA性能,显著优于视觉-语言模型与Transformer基线;支持生成可审计的‘why’解释(规则追踪)与定量反事实干预(如‘若提供协助,风险下降65%’)。 Conclusion: Logi-PAR是首个将可学习逻辑规则应用于PAR的框架, bridging neural perception与symbolic reasoning,提升了临床PAR系统的可解释性、可审计性与决策支持能力。 Abstract: Patient Activity Recognition (PAR) in clinical settings uses activity data to improve safety and quality of care. Although significant progress has been made, current models mainly identify which activity is occurring. They often spatially compose sub-sparse visual cues using global and local attention mechanisms, yet only learn logically implicit patterns due to their neural-pipeline. Advancing clinical safety requires methods that can infer why a set of visual cues implies a risk, and how these can be compositionally reasoned through explicit logic beyond mere classification. To address this, we proposed Logi-PAR, the first Logic-Infused Patient Activity Recognition Framework that integrates contextual fact fusion as a multi-view primitive extractor and injects neural-guided differentiable rules. Our method automatically learns rules from visual cues, optimizing them end-to-end while enabling the implicit emergence patterns to be explicitly labelled during training. To the best of our knowledge, Logi-PAR is the first framework to recognize patient activity by applying learnable logic rules to symbolic mappings. It produces auditable why explanations as rule traces and supports counterfactual interventions (e.g., risk would decrease by 65% if assistance were present). Extensive evaluation on clinical benchmarks (VAST and OmniFall) demonstrates state-of-the-art performance, significantly outperforming Vision-Language Models and transformer baselines. The code is available via: https://github.com/zararkhan985/Logi-PAR.git}[156] Semantic Class Distribution Learning for Debiasing Semi-Supervised Medical Image Segmentation
Yingxue Su,Yiheng Zhong,Keying Zhu,Zimu Zhang,Zhuoru Zhang,Yifang Wang,Yuxin Zhang,Jingxin Liu
Main category: cs.CV
TL;DR: 本文提出了一种名为语义类别分布学习(SCDL)的框架,通过学习结构化的类别条件特征分布来缓解医学图像分割中因类别不平衡导致的监督与表征偏差。
Details
Motivation: 医学图像分割中密集像素级标注耗时昂贵,且数据集常存在严重类别不平衡,导致少数类结构在特征表示中被主导类别淹没,影响判别特征学习和可靠分割。 Method: 提出SCDL框架,包含类别分布双向对齐(CDBA)以对齐嵌入与可学习类别代理,以及语义锚点约束(SAC)利用标注数据引导代理。 Result: 在Synapse和AMOS数据集上的实验表明,SCDL显著提升整体及各类别指标性能,尤其在少数类上效果突出,达到当前最优水平。 Conclusion: SCDL是一种即插即用模块,能有效缓解医学图像分割中的监督和表征偏差,提升对不平衡数据的鲁棒性与分割精度。 Abstract: Medical image segmentation is critical for computer-aided diagnosis. However, dense pixel-level annotation is time-consuming and expensive, and medical datasets often exhibit severe class imbalance. Such imbalance causes minority structures to be overwhelmed by dominant classes in feature representations, hindering the learning of discriminative features and making reliable segmentation particularly challenging. To address this, we propose the Semantic Class Distribution Learning (SCDL) framework, a plug-and-play module that mitigates supervision and representation biases by learning structured class-conditional feature distributions. SCDL integrates Class Distribution Bidirectional Alignment (CDBA) to align embeddings with learnable class proxies and leverages Semantic Anchor Constraints (SAC) to guide proxies using labeled data. Experiments on the Synapse and AMOS datasets demonstrate that SCDL significantly improves segmentation performance across both overall and class-level metrics, with particularly strong gains on minority classes, achieving state-of-the-art results. Our code is released at https://github.com/Zyh55555/SCDL.[157] SPyCer: Semi-Supervised Physics-Guided Contextual Attention for Near-Surface Air Temperature Estimation from Satellite Imagery
Sofiane Bouaziz,Adel Hafiane,Raphael Canals,Rachid Nedjai
Main category: cs.CV
TL;DR: 本文提出SPyCer,一种半监督物理引导网络,利用卫星影像和近地面传感器数据,结合物理模型(如地表能量平衡与对流-扩散-反应方程)实现近地表气温(NSAT)的连续、像素级空间估计。
Details
Motivation: 近地面传感器虽精度高但稀疏不均,难以提供连续空间覆盖;而卫星遥感擅长大范围观测但难以直接获取近地面大气参数,亟需融合二者优势并嵌入物理约束以提升可靠性。 Method: SPyCer将NSAT预测建模为像素级视觉任务:将传感器位置投影至卫星图像形成局部图像块;中心像素受实测NSAT与物理约束联合监督,邻域像素通过基于地表能量平衡和PDE导出的物理正则化参与训练;采用土地覆被引导的多头注意力机制,并以高斯距离加权建模空间物理影响。 Result: 在真实数据集上,SPyCer生成的空间连续、物理一致的NSAT估计结果,在精度、泛化性和物理过程一致性方面均优于现有基线方法。 Conclusion: SPyCer验证了将物理先验深度嵌入半监督学习框架的有效性,为遥感驱动的近地面大气参数高分辨率反演提供了新范式。 Abstract: Modern Earth observation relies on satellites to capture detailed surface properties. Yet, many phenomena that affect humans and ecosystems unfold in the atmosphere close to the surface. Near-ground sensors provide accurate measurements of certain environmental characteristics, such as near-surface air temperature (NSAT). However, they remain sparse and unevenly distributed, limiting their ability to provide continuous spatial measurements. To bridge this gap, we introduce SPyCer, a semi-supervised physics-guided network that can leverage pixel information and physical modeling to guide the learning process through meaningful physical properties. It is designed for continuous estimation of NSAT by proxy using satellite imagery. SPyCer frames NSAT prediction as a pixel-wise vision problem, where each near-ground sensor is projected onto satellite image coordinates and positioned at the center of a local image patch. The corresponding sensor pixel is supervised using both observed NSAT and physics-based constraints, while surrounding pixels contribute through physics-guided regularization derived from the surface energy balance and advection-diffusion-reaction partial differential equations. To capture the physical influence of neighboring pixels, SPyCer employs a multi-head attention guided by land cover characteristics and modulated with Gaussian distance weighting. Experiments on real-world datasets demonstrate that SPyCer produces spatially coherent and physically consistent NSAT estimates, outperforming existing baselines in terms of accuracy, generalization, and alignment with underlying physical processes.[158] Digital Twin Driven Textile Classification and Foreign Object Recognition in Automated Sorting Systems
Serkan Ergun,Tobias Mitterer,Hubert Zangl
Main category: cs.CV
TL;DR: 本文提出了一种基于数字孪生的双臂机器人纺织品分拣系统,融合多模态感知、抓取预测与视觉语言模型(VLM)进行真实场景下的服装分类与异物检测,并在223种检验场景中对9个VLM进行了基准测试,验证了其在工业环境中的可行性。
Details
Motivation: 可持续纺织品回收需求增长,亟需能处理形变衣物和杂乱环境中异物检测的鲁棒自动化方案。 Method: 构建双臂机器人单元,集成RGBD视觉、电容式触觉反馈与防碰撞运动规划;利用数字孪生与MoveIt实现路径规划,并将3D点云融入虚拟环境;采用9种VLM在自建223场景数据集上评估分类性能、幻觉行为与计算效率。 Result: Qwen系列VLM整体准确率最高(达87.9%),异物检测能力强;轻量级模型如Gemma3在边缘部署中具备良好速度-精度权衡;数字孪生显著提升操作可靠性。 Conclusion: 语义VLM推理、传统抓取检测与数字孪生技术可有效结合,支撑可扩展、全自动的现实工业纺织分拣系统。 Abstract: The increasing demand for sustainable textile recycling requires robust automation solutions capable of handling deformable garments and detecting foreign objects in cluttered environments. This work presents a digital twin driven robotic sorting system that integrates grasp prediction, multi modal perception, and semantic reasoning for real world textile classification. A dual arm robotic cell equipped with RGBD sensing, capacitive tactile feedback, and collision-aware motion planning autonomously separates garments from an unsorted basket, transfers them to an inspection zone, and classifies them using state of the art Visual Language Models (VLMs). We benchmark nine VLM s from five model families on a dataset of 223 inspection scenarios comprising shirts, socks, trousers, underwear, foreign objects (including garments outside of the aforementioned classes), and empty scenes. The evaluation assesses per class accuracy, hallucination behavior, and computational performance under practical hardware constraints. Results show that the Qwen model family achieves the highest overall accuracy (up to 87.9 %), with strong foreign object detection performance, while lighter models such as Gemma3 offer competitive speed accuracy trade offs for edge deployment. A digital twin combined with MoveIt enables collision aware path planning and integrates segmented 3D point clouds of inspected garments into the virtual environment for improved manipulation reliability. The presented system demonstrates the feasibility of combining semantic VLM reasoning with conventional grasp detection and digital twin technology for scalable, autonomous textile sorting in realistic industrial settings.[159] CATNet: Collaborative Alignment and Transformation Network for Cooperative Perception
Gong Chen,Chaokun Zhang,Tao Tang,Pengcheng Lv,Feng Li,Xin Xie
Main category: cs.CV
TL;DR: 本文提出CATNet框架,通过时空同步、小波增强去噪和自适应特征选择,解决多智能体协同感知中的时间延迟和多源噪声问题,显著提升复杂交通场景下的鲁棒性与适应性。
Details
Motivation: 现有协同感知研究忽略了实际多源数据融合中的高时间延迟和多源噪声等关键挑战。 Method: 提出CATNet框架,包含三个核心模块:1)基于邻帧差分建模的时空循环同步(STSync)用于对齐异步特征流;2)双分支小波增强去噪器(WTDen)抑制全局噪声并重建局部特征失真;3)自适应特征选择器(AdpSel)动态聚焦关键感知特征以实现鲁棒融合。 Result: 在多个数据集上的大量实验表明,CATNet在复杂交通条件下持续优于现有方法,验证了其优越的鲁棒性和适应性。 Conclusion: CATNet是一种有效的自适应补偿框架,能有效缓解多智能体系统中由时间延迟和噪声干扰带来的性能下降问题,提升了协同感知的实用性。 Abstract: Cooperative perception significantly enhances scene understanding by integrating complementary information from diverse agents. However, existing research often overlooks critical challenges inherent in real-world multi-source data integration, specifically high temporal latency and multi-source noise. To address these practical limitations, we propose Collaborative Alignment and Transformation Network (CATNet), an adaptive compensation framework that resolves temporal latency and noise interference in multi-agent systems. Our key innovations can be summarized in three aspects. First, we introduce a Spatio-Temporal Recurrent Synchronization (STSync) that aligns asynchronous feature streams via adjacent-frame differential modeling, establishing a temporal-spatially unified representation space. Second, we design a Dual-Branch Wavelet Enhanced Denoiser (WTDen) that suppresses global noise and reconstructs localized feature distortions within aligned representations. Third, we construct an Adaptive Feature Selector (AdpSel) that dynamically focuses on critical perceptual features for robust fusion. Extensive experiments on multiple datasets demonstrate that CATNet consistently outperforms existing methods under complex traffic conditions, proving its superior robustness and adaptability.[160] Wiki-R1: Incentivizing Multimodal Reasoning for Knowledge-based VQA via Data and Sampling Curriculum
Shan Ning,Longtian Qiu,Xuming He
Main category: cs.CV
TL;DR: 本文提出Wiki-R1框架,通过基于课程强化学习的数据生成方法,提升多模态大语言模型在知识库视觉问答(KB-VQA)任务中的推理能力,显著提升两个基准数据集上的准确率。
Details
Motivation: KB-VQA任务面临外部知识检索噪声大、知识库结构化且百科化等特点,导致与预训练多模态大语言模型存在分布差异,使后训练阶段的有效推理和领域适配困难。 Method: 提出Wiki-R1:一种基于课程强化学习的数据生成框架,包含可控课程数据生成(调控检索器生成不同难度样本)和课程采样策略(选择具有非零优势的高信息量样本),并利用观测奖励估计与传播样本难度以指导学习。 Result: 在Encyclopedic VQA和InfoSeek两个KB-VQA基准上达到新SOTA:准确率分别从35.5%提升至37.1%,以及从40.1%提升至44.1%。 Conclusion: Wiki-R1通过渐进式课程式强化学习有效弥合预训练模型与KB-VQA目标分布间的鸿沟,验证了可控数据生成与难度感知采样对提升模型推理能力的重要性。 Abstract: Knowledge-Based Visual Question Answering (KB-VQA) requires models to answer questions about an image by integrating external knowledge, posing significant challenges due to noisy retrieval and the structured, encyclopedic nature of the knowledge base. These characteristics create a distributional gap from pretrained multimodal large language models (MLLMs), making effective reasoning and domain adaptation difficult in the post-training stage. In this work, we propose \textit{Wiki-R1}, a data-generation-based curriculum reinforcement learning framework that systematically incentivizes reasoning in MLLMs for KB-VQA. Wiki-R1 constructs a sequence of training distributions aligned with the model's evolving capability, bridging the gap from pretraining to the KB-VQA target distribution. We introduce \textit{controllable curriculum data generation}, which manipulates the retriever to produce samples at desired difficulty levels, and a \textit{curriculum sampling strategy} that selects informative samples likely to yield non-zero advantages during RL updates. Sample difficulty is estimated using observed rewards and propagated to unobserved samples to guide learning. Experiments on two KB-VQA benchmarks, Encyclopedic VQA and InfoSeek, demonstrate that Wiki-R1 achieves new state-of-the-art results, improving accuracy from 35.5\% to 37.1\% on Encyclopedic VQA and from 40.1\% to 44.1\% on InfoSeek. The project page is available at https://artanic30.github.io/project_pages/WikiR1/.[161] Layer by layer, module by module: Choose both for optimal OOD probing of ViT
Ambroise Odonnat,Vasilii Feofanov,Laetitia Chapel,Romain Tavenard,Ievgen Redko
Main category: cs.CV
TL;DR: 本文研究了预训练视觉Transformer中间层的表现,发现预训练与下游数据之间的分布偏移是导致深层性能下降的主要原因,并提出了针对不同分布偏移程度选择最优模块进行线性探测的方法。
Details
Motivation: 观察到基础模型中间层往往比最终层具有更强的判别性表示,但其成因尚不明确,尤其在非自回归预训练模型中也存在该现象,因此需系统分析中间层行为。 Method: 通过在多种图像分类基准上进行广泛的线性探测实验,并在模块级别(如FFN、MHSA)进行细粒度分析,探究不同层和模块在分布偏移下的表现差异。 Result: 发现分布偏移是深层性能下降的主因;在强分布偏移下,探测FFN内部激活效果最佳;在弱分布偏移下,探测MHSA归一化输出最优。 Conclusion: 标准的Transformer块输出探测并非最优策略,应依据预训练与下游任务间分布偏移程度动态选择探测位置,以提升迁移性能。 Abstract: Recent studies have observed that intermediate layers of foundation models often yield more discriminative representations than the final layer. While initially attributed to autoregressive pretraining, this phenomenon has also been identified in models trained via supervised and discriminative self-supervised objectives. In this paper, we conduct a comprehensive study to analyze the behavior of intermediate layers in pretrained vision transformers. Through extensive linear probing experiments across a diverse set of image classification benchmarks, we find that distribution shift between pretraining and downstream data is the primary cause of performance degradation in deeper layers. Furthermore, we perform a fine-grained analysis at the module level. Our findings reveal that standard probing of transformer block outputs is suboptimal; instead, probing the activation within the feedforward network yields the best performance under significant distribution shift, whereas the normalized output of the multi-head self-attention module is optimal when the shift is weak.[162] Fusion4CA: Boosting 3D Object Detection via Comprehensive Image Exploitation
Kang Luo,Xin Chen,Yangyi Xiao,Hesheng Wang
Main category: cs.CV
TL;DR: 本文提出Fusion4CA方法,通过对比对齐模块、相机辅助分支、认知适配器和坐标注意力模块,在BEV空间中更充分地利用RGB信息提升LiDAR-RGB融合的3D目标检测性能。
Details
Motivation: 现有BEV空间LiDAR-RGB融合方法过度依赖LiDAR分支,对RGB信息挖掘不足。 Method: 基于BEVFusion框架,引入对比对齐模块校准图像特征与3D几何、相机辅助分支增强RGB信息利用、认知适配器迁移预训练图像权重、坐标注意力模块增强融合阶段特征表达。 Result: 在nuScenes数据集上仅用6个训练周期即达69.7% mAP,参数仅增3.48%,性能超越20周期训练的基线1.2%;在模拟月球环境中也验证了泛化性。 Conclusion: Fusion4CA有效提升了RGB信息在BEV融合中的作用,在精度、效率和泛化性方面均取得显著改进。 Abstract: Nowadays, an increasing number of works fuse LiDAR and RGB data in the bird's-eye view (BEV) space for 3D object detection in autonomous driving systems. However, existing methods suffer from over-reliance on the LiDAR branch, with insufficient exploration of RGB information. To tackle this issue, we propose Fusion4CA, which is built upon the classic BEVFusion framework and dedicated to fully exploiting visual input with plug-and-play components. Specifically, a contrastive alignment module is designed to calibrate image features with 3D geometry, and a camera auxiliary branch is introduced to mine RGB information sufficiently during training. For further performance enhancement, we leverage an off-the-shelf cognitive adapter to make the most of pretrained image weights, and integrate a standard coordinate attention module into the fusion stage as a supplementary boost. Experiments on the nuScenes dataset demonstrate that our method achieves 69.7% mAP with only 6 training epochs and a mere 3.48% increase in inference parameters, yielding a 1.2% improvement over the baseline which is fully trained for 20 epochs. Extensive experiments in a simulated lunar environment further validate the effectiveness and generalization of our method. Our code will be released through Fusion4CA.[163] Frequency-Aware Error-Bounded Caching for Accelerating Diffusion Transformers
Guandong Li
Main category: cs.CV
TL;DR: 本文提出SpectralCache框架,通过识别DiT去噪过程在时间、深度和特征三个维度上的非均匀性,设计了 timestep-aware 动态调度、累积误差预算和频域分解缓存策略,在不牺牲生成质量的前提下显著加速DiT推理。
Details
Motivation: 现有DiT缓存方法将去噪过程视为各维度均匀的,忽略了其实际存在的时间、深度和特征维度上的非均匀性,导致缓存效率受限。 Method: 提出SpectralCache统一缓存框架,包含三个核心组件:Timestep-Aware Dynamic Scheduling (TADS)、Cumulative Error Budgets (CEB) 和 Frequency-Decomposed Caching (FDC),分别应对时间、深度和特征维度的非均匀性。 Result: 在FLUX.1-schnell模型上实现2.46倍加速(LPIPS 0.217, SSIM 0.727),相比TeaCache提速16%,质量保持相近(LPIPS差异<1%),且无需训练、即插即用。 Conclusion: SpectralCache通过建模DiT去噪过程的多维非均匀性,显著提升了缓存效率,在速度与生成质量间取得更好平衡,为高效DiT推理提供了新范式。 Abstract: Diffusion Transformers (DiTs) have emerged as the dominant architecture for high-quality image and video generation, yet their iterative denoising process incurs substantial computational cost during inference. Existing caching methods accelerate DiTs by reusing intermediate computations across timesteps, but they share a common limitation: treating the denoising process as uniform across time,depth, and feature dimensions. In this work, we identify three orthogonal axes of non-uniformity in DiT denoising: (1) temporal -- sensitivity to caching errors varies dramatically across the denoising trajectory; (2) depth -- consecutive caching decisions lead to cascading approximation errors; and (3) feature -- different components of the hidden state exhibit heterogeneous temporal dynamics. Based on these observations, we propose SpectralCache, a unified caching framework comprising Timestep-Aware Dynamic Scheduling (TADS), Cumulative Error Budgets (CEB), and Frequency-Decomposed Caching (FDC). On FLUX.1-schnell at 512x512 resolution, SpectralCache achieves 2.46x speedup with LPIPS 0.217 and SSIM 0.727, outperforming TeaCache (2.12x, LPIPS 0.215, SSIM 0.734) by 16% in speed while maintaining comparable quality (LPIPS difference < 1%). Our approach is training-free, plug-and-play, and compatible with existing DiT architectures.[164] Dark3R: Learning Structure from Motion in the Dark
Andrew Y Guo,Anagh Malik,SaiKiran Tedla,Yutong Dai,Yiqian Qin,Zach Salehe,Benjamin Attal,Sotiris Nousias,Kyros Kutulakos,David B. Lindell
Main category: cs.CV
TL;DR: Dark3R是一个专为极低光照条件(SNR低于-4 dB)设计的无监督结构光运动(SfM)框架,通过教师-学生蒸馏将大规模3D基础模型适配到暗光场景,仅需噪声-干净原始图像对训练,无需3D监督,并在新构建的大规模曝光 bracketed 数据集上实现了SfM与暗光新视角合成的SOTA性能。
Details
Motivation: 传统基于特征或学习的方法在极低信噪比(SNR < -4 dB)的暗光条件下失效,亟需一种能在极端低光下稳健工作的无监督SfM方法。 Method: 提出Dark3R框架,利用教师-学生知识蒸馏将大尺度3D基础模型适配至极低光环境;仅用噪声-干净原始图像对训练(可实拍或用泊松-高斯噪声模型合成);结合粗到精的辐射场优化实现新视角合成。 Result: 在自建含约42,000张多视角原始图像及真值3D标注的曝光 bracketed 数据集上,Dark3R在低SNR下SfM性能达到SOTA;其预测位姿用于新视角合成亦达SOTA。 Conclusion: Dark3R首次实现了完全无3D监督、仅依赖原始图像对的极暗光SfM,验证了蒸馏大规模基础模型适配极端成像条件的有效性,为低光三维视觉开辟了新路径。 Abstract: We introduce Dark3R, a framework for structure from motion in the dark that operates directly on raw images with signal-to-noise ratios (SNRs) below $-4$ dB -- a regime where conventional feature- and learning-based methods break down. Our key insight is to adapt large-scale 3D foundation models to extreme low-light conditions through a teacher--student distillation process, enabling robust feature matching and camera pose estimation in low light. Dark3R requires no 3D supervision; it is trained solely on noisy--clean raw image pairs, which can be either captured directly or synthesized using a simple Poisson--Gaussian noise model applied to well-exposed raw images. To train and evaluate our approach, we introduce a new, exposure-bracketed dataset that includes $\sim$42,000 multi-view raw images with ground-truth 3D annotations, and we demonstrate that Dark3R achieves state-of-the-art structure from motion in the low-SNR regime. Further, we demonstrate state-of-the-art novel view synthesis in the dark using Dark3R's predicted poses and a coarse-to-fine radiance field optimization procedure.[165] ORMOT: A Dataset and Framework for Omnidirectional Referring Multi-Object Tracking
Sijia Chen,Zihan Zhou,Yanqiu Yu,En Yu,Wenbing Tao
Main category: cs.CV
TL;DR: 本文提出了一种新的任务——全向指代多目标跟踪(ORMOT),旨在解决传统RMOT在有限视场下的跟踪碎片化问题,并构建了首个全向指代多目标跟踪数据集ORSet,同时设计了基于大视觉语言模型的跟踪框架ORTrack。
Details
Motivation: 现有指代多目标跟踪(RMOT)方法依赖常规相机数据,视场受限,导致目标易出框、跟踪碎片化、长时序语言理解困难。 Method: 提出ORMOT新任务;构建包含27个全向场景、848条语言描述、3401个标注目标的ORSet数据集;设计LVLM驱动的ORTrack框架。 Result: 在ORSet数据集上的大量实验验证了ORTrack框架的有效性;数据集与代码将开源。 Conclusion: ORMOT拓展了RMOT至全向视觉领域,缓解了视场限制问题,提升了对长时序语言描述的理解能力,为视觉-语言跟踪开辟了新方向。 Abstract: Multi-Object Tracking (MOT) is a fundamental task in computer vision, aiming to track targets across video frames. Existing MOT methods perform well in general visual scenes, but face significant challenges and limitations when extended to visual-language settings. To bridge this gap, the task of Referring Multi-Object Tracking (RMOT) has recently been proposed, which aims to track objects that correspond to language descriptions. However, current RMOT methods are primarily developed on datasets captured by conventional cameras, which suffer from limited field of view. This constraint often causes targets to move out of the frame, leading to fragmented tracking and loss of contextual information. In this work, we propose a novel task, called Omnidirectional Referring Multi-Object Tracking (ORMOT), which extends RMOT to omnidirectional imagery, aiming to overcome the field-of-view (FoV) limitation of conventional datasets and improve the model's ability to understand long-horizon language descriptions. To advance the ORMOT task, we construct ORSet, an Omnidirectional Referring Multi-Object Tracking dataset, which contains 27 diverse omnidirectional scenes, 848 language descriptions, and 3,401 annotated objects, providing rich visual, temporal, and language information. Furthermore, we propose ORTrack, a Large Vision-Language Model (LVLM)-driven framework tailored for Omnidirectional Referring Multi-Object Tracking. Extensive experiments on the ORSet dataset demonstrate the effectiveness of our ORTrack framework. The dataset and code will be open-sourced at https://github.com/chen-si-jia/ORMOT.[166] Fusion-CAM: Integrating Gradient and Region-Based Class Activation Maps for Robust Visual Explanations
Hajar Dekdegue,Moncef Garouani,Josiane Mothe,Jordan Bernigaud
Main category: cs.CV
TL;DR: 本文提出Fusion-CAM,融合梯度法与区域法优势,通过去噪、加权融合与自适应像素级融合,生成更鲁棒、判别性强且上下文感知的可视化解释。
Details
Motivation: 现有CAM方法(如Grad-CAM和Score-CAM)分别存在噪声大/覆盖不全或过平滑/细节缺失的问题,难以兼顾判别性与对象完整性。 Method: Fusion-CAM包含三步:1)对梯度图去噪以获得更聚焦的激活;2)用贡献权重融合去噪梯度图与区域图;3)基于相似性的自适应像素级融合,动态调节融合强度以强化一致区域、软化冲突区域。 Result: 在标准基准上,Fusion-CAM在定性可视化与定量评估两方面均持续优于现有CAM变体。 Conclusion: Fusion-CAM有效弥合了梯度法与区域法之间的解释鸿沟,为深度神经网络提供了更鲁棒、灵活且可解释的可视化工具。 Abstract: Interpreting the decision-making process of deep convolutional neural networks remains a central challenge in achieving trustworthy and transparent artificial intelligence. Explainable AI (XAI) techniques, particularly Class Activation Map (CAM) methods, are widely adopted to visualize the input regions influencing model predictions. Gradient-based approaches (e.g. Grad-CAM) provide highly discriminative, fine-grained details by computing gradients of class activations but often yield noisy and incomplete maps that emphasize only the most salient regions rather than the complete objects. Region-based approaches (e.g. Score-CAM) aggregate information over larger areas, capturing broader object coverage at the cost of over-smoothing and reduced sensitivity to subtle features. We introduce Fusion-CAM, a novel framework that bridges this explanatory gap by unifying both paradigms through a dedicated fusion mechanism to produce robust and highly discriminative visual explanations. Our method first denoises gradient-based maps, yielding cleaner and more focused activations. It then combines the refined gradient map with region-based maps using contribution weights to enhance class coverage. Finally, we propose an adaptive similarity-based pixel-level fusion that evaluates the agreement between both paradigms and dynamically adjusts the fusion strength. This adaptive mechanism reinforces consistent activations while softly blending conflicting regions, resulting in richer, context-aware, and input-adaptive visual explanations. Extensive experiments on standard benchmarks show that Fusion-CAM consistently outperforms existing CAM variants in both qualitative visualization and quantitative evaluation, providing a robust and flexible tool for interpreting deep neural networks.[167] Video-based Locomotion Analysis for Fish Health Monitoring
Timon Palm,Clemens Seibold,Anna Hilsmann,Peter Eisert
Main category: cs.CV
TL;DR: 本文提出了一种基于YOLOv11和多目标跟踪的视频分析系统,用于估计养殖鱼的运动活动,以支持鱼类健康监测。
Details
Motivation: 监测鱼类健康状况对早期疾病检测、动物福利保障和可持续水产养殖至关重要;而鱼类的生理与病理状态可通过其运动行为推断。 Method: 采用嵌入跟踪-检测框架的YOLOv11检测器,并探索多种YOLOv11架构配置及多帧融合扩展以提升检测精度。 Result: 在人工标注的苏拉威西稻鱼视频数据集上验证了系统有效性,能可靠估计游泳方向与速度;该数据集将在论文发表后公开。 Conclusion: 所提系统为基于视频的鱼类行为分析与健康监测提供了可行、可复现的技术方案。 Abstract: Monitoring the health conditions of fish is essential, as it enables the early detection of disease, safeguards animal welfare, and contributes to sustainable aquaculture practices. Physiological and pathological conditions of cultivated fish can be inferred by analyzing locomotion activities. In this paper, we present a system that estimates the locomotion activities from videos using multi object tracking. The core of our approach is a YOLOv11 detector embedded in a tracking-by-detection framework. We investigate various configurations of the YOLOv11-architecture as well as extensions that incorporate multiple frames to improve detection accuracy. Our system is evaluated on a manually annotated dataset of Sulawesi ricefish recorded in a home-aquarium-like setup, demonstrating its ability to reliably measure swimming direction and speed for fish health monitoring. The dataset will be made publicly available upon publication.[168] MobileFetalCLIP: Selective Repulsive Knowledge Distillation for Mobile Fetal Ultrasound Analysis
Numan Saeed,Fadillah Adamsyah Maani,Mohammad Yaqub
Main category: cs.CV
TL;DR: 本文提出了一种选择性排斥知识蒸馏方法(Selective Repulsive Knowledge Distillation),用于将大型胎儿超声AI模型(304M参数)有效压缩为轻量级学生模型(11.4M参数),在保持甚至提升性能的同时实现在手持设备(如iPhone 16 Pro)上的实时部署。
Details
Motivation: 现有胎儿超声AI基础模型参数量过大(>300M),难以部署于基层医疗点的便携设备;标准知识蒸馏在师生容量差距极大(~26倍)时失效,学生模型易模仿教师冗余结构而非学习本质特征。 Method: 提出选择性排斥知识蒸馏:将对比式知识蒸馏分解为对角项(保留匹配样本对齐)与非对角项(使非匹配类间权重衰减为负值),从而主动排斥学生模型学习教师的类间混淆,促使其发现适配自身架构的特征表示。 Result: 11.4M参数学生模型在零样本HC18生物测量有效性(88.6% vs. 83.5%)和脑部子平面F1分数(0.784 vs. 0.702)上均超越304M参数教师模型,并在iPhone 16 Pro上推理耗时仅1.6ms。 Conclusion: 该方法突破了大模型向极小设备端模型蒸馏的瓶颈,首次实现高精度胎儿超声AI在手持超声设备上的实时辅助诊断能力,具备显著临床落地价值。 Abstract: Fetal ultrasound AI could transform prenatal care in low-resource settings, yet current foundation models exceed 300M visual parameters, precluding deployment on point-of-care devices. Standard knowledge distillation fails under such extreme capacity gaps (~26x), as compact students waste capacity mimicking architectural artifacts of oversized teachers. We introduce Selective Repulsive Knowledge Distillation, which decomposes contrastive KD into diagonal and off-diagonal components: matched pair alignment is preserved while the off-diagonal weight decays into negative values, repelling the student from the teacher's inter-class confusions and forcing discovery of architecturally native features. Our 11.4M parameter student surpasses the 304M-parameter FetalCLIP teacher on zero-shot HC18 biometry validity (88.6% vs. 83.5%) and brain sub-plane F1 (0.784 vs. 0.702), while running at 1.6 ms on iPhone 16 Pro, enabling real-time assistive AI on handheld ultrasound devices. Our code, models, and app are publicly available at https://github.com/numanai/MobileFetalCLIP.[169] RelaxFlow: Text-Driven Amodal 3D Generation
Jiayin Zhu,Guoji Fu,Xiaolu Liu,Qiyuan He,Yicong Li,Angela Yao
Main category: cs.CV
TL;DR: 本文提出RelaxFlow框架,用于文本驱动的非模态3D生成,通过解耦控制粒度,在保持输入观测不变的前提下,利用文本提示补全被遮挡区域。
Details
Motivation: 图像到3D生成在遮挡下存在语义歧义,仅靠部分观测难以确定物体类别,需结合文本引导完成不可见区域的合理重建。 Method: 提出无训练双分支框架RelaxFlow,包含多先验共识模块和松弛机制;理论证明松弛操作等价于对生成向量场施加低通滤波,以保留几何结构、抑制高频细节。 Result: 在新构建的ExtremeOcc-3D和AmbiSem-3D基准上验证了方法有效性,能准确按文本意图生成不可见区域,同时不损害视觉保真度。 Conclusion: RelaxFlow实现了观测约束与文本引导之间的精细平衡,为面向遮挡场景的可控3D生成提供了新范式。 Abstract: Image-to-3D generation faces inherent semantic ambiguity under occlusion, where partial observation alone is often insufficient to determine object category. In this work, we formalize text-driven amodal 3D generation, where text prompts steer the completion of unseen regions while strictly preserving input observation. Crucially, we identify that these objectives demand distinct control granularities: rigid control for the observation versus relaxed structural control for the prompt. To this end, we propose RelaxFlow, a training-free dual-branch framework that decouples control granularity via a Multi-Prior Consensus Module and a Relaxation Mechanism. Theoretically, we prove that our relaxation is equivalent to applying a low-pass filter on the generative vector field, which suppresses high-frequency instance details to isolate geometric structure that accommodates the observation. To facilitate evaluation, we introduce two diagnostic benchmarks, ExtremeOcc-3D and AmbiSem-3D. Extensive experiments demonstrate that RelaxFlow successfully steers the generation of unseen regions to match the prompt intent without compromising visual fidelity.[170] SAIL: Similarity-Aware Guidance and Inter-Caption Augmentation-based Learning for Weakly-Supervised Dense Video Captioning
Ye-Chan Kim,SeungJu Cha,Si-Woo Kim,Minju Jeon,Hyungee Kim,Dong-Jin Kim
Main category: cs.CV
TL;DR: 本文提出SAIL方法,通过跨模态对齐构建语义感知掩码,并引入大语言模型生成合成字幕以增强稀疏标注下的训练,显著提升了弱监督密集视频描述任务的性能。
Details
Motivation: 现有方法仅生成无重叠但缺乏语义关联的掩码,且依赖稀疏的真实字幕导致性能受限。 Method: 提出SAIL框架:1)基于相似性感知的训练目标引导掩码聚焦于与事件字幕高相似的视频区域;2)利用大语言模型生成合成字幕,并通过跨掩码机制将其融入训练以辅助精确定位。 Result: 在ActivityNet Captions和YouCook2数据集上,SAIL在描述生成与时间定位指标上均达到当前最优性能。 Conclusion: 语义感知掩码构建与LLM增强的合成字幕策略可有效缓解弱监督密集视频描述中掩码语义缺失与标注稀疏问题,提升整体性能。 Abstract: Weakly-Supervised Dense Video Captioning aims to localize and describe events in videos trained only on caption annotations, without temporal boundaries. Prior work introduced an implicit supervision paradigm based on Gaussian masking and complementary captioning. However, existing method focuses merely on generating non-overlapping masks without considering their semantic relationship to corresponding events, resulting in simplistic, uniformly distributed masks that fail to capture semantically meaningful regions. Moreover, relying solely on ground-truth captions leads to sub-optimal performance due to the inherent sparsity of existing datasets. In this work, we propose SAIL, which constructs semantically-aware masks through cross-modal alignment. Our similarity aware training objective guides masks to emphasize video regions with high similarity to their corresponding event captions. Furthermore, to guide more accurate mask generation under sparse annotation settings, we introduce an LLM-based augmentation strategy that generates synthetic captions to provide additional alignment signals. These synthetic captions are incorporated through an inter-mask mechanism, providing auxiliary guidance for precise temporal localization without degrading the main objective. Experiments on ActivityNet Captions and YouCook2 demonstrate state-of-the-art performance on both captioning and localization metrics.[171] Planning in 8 Tokens: A Compact Discrete Tokenizer for Latent World Model
Dongwon Kim,Gawon Seo,Jinsung Lee,Minsu Cho,Suha Kwak
Main category: cs.CV
TL;DR: 本文提出CompACT,一种离散化分词器,将每个观测压缩为仅8个token,显著降低世界模型决策时规划的计算开销,同时保持规划性能。
Details
Motivation: 现有世界模型在决策时规划中因传统分词器产生大量token而导致计算成本高、难以实时控制。 Method: 提出CompACT——一种能将每帧观测压缩至极少数(如8个)离散token的新型分词器,并将其集成到动作条件世界模型中。 Result: 使用CompACT的世界模型在规划性能上具有竞争力,且规划速度提升数量级。 Conclusion: CompACT为世界模型迈向真实场景实时部署提供了实用可行的技术路径。 Abstract: World models provide a powerful framework for simulating environment dynamics conditioned on actions or instructions, enabling downstream tasks such as action planning or policy learning. Recent approaches leverage world models as learned simulators, but its application to decision-time planning remains computationally prohibitive for real-time control. A key bottleneck lies in latent representations: conventional tokenizers encode each observation into hundreds of tokens, making planning both slow and resource-intensive. To address this, we propose CompACT, a discrete tokenizer that compresses each observation into as few as 8 tokens, drastically reducing computational cost while preserving essential information for planning. An action-conditioned world model that occupies CompACT tokenizer achieves competitive planning performance with orders-of-magnitude faster planning, offering a practical step toward real-world deployment of world models.[172] NaiLIA: Multimodal Nail Design Retrieval Based on Dense Intent Descriptions and Palette Queries
Kanon Amemiya,Daichi Yashima,Kei Katsumata,Takumi Komatsu,Ryosuke Korekata,Seitaro Otsuki,Komei Sugiura
Main category: cs.CV
TL;DR: 本文提出NaiLIA方法,用于根据密集意图描述和调色板查询检索美甲设计图像,通过引入基于置信度分数的松弛损失提升对未标注图像的对齐能力,并在自建多文化基准上验证了其优越性。
Details
Motivation: 现有视觉-语言基础模型难以有效整合密集意图描述(含绘画元素、装饰、视觉特征、主题及整体印象)和调色板查询(颜色选择器指定的细微连续色度),而用户实际需求高度依赖此类多层、细粒度表达。 Method: 提出NaiLIA多模态检索方法,核心是联合建模密集意图描述与调色板查询;引入基于置信度分数的松弛损失,以利用未标注图像中潜在符合描述的样本,增强语义对齐能力。 Result: 在包含10,625张图像、由200多名标注者提供长而密集意图描述的自建多文化基准上,NaiLIA显著优于标准检索方法。 Conclusion: NaiLIA有效提升了美甲图像检索对复杂、细粒度用户意图的理解与匹配能力,尤其在融合文本描述与连续颜色查询方面具有实用价值与推广潜力。 Abstract: We focus on the task of retrieving nail design images based on dense intent descriptions, which represent multi-layered user intent for nail designs. This is challenging because such descriptions specify unconstrained painted elements and pre-manufactured embellishments as well as visual characteristics, themes, and overall impressions. In addition to these descriptions, we assume that users provide palette queries by specifying zero or more colors via a color picker, enabling the expression of subtle and continuous color nuances. Existing vision-language foundation models often struggle to incorporate such descriptions and palettes. To address this, we propose NaiLIA, a multimodal retrieval method for nail design images, which comprehensively aligns with dense intent descriptions and palette queries during retrieval. Our approach introduces a relaxed loss based on confidence scores for unlabeled images that can align with the descriptions. To evaluate NaiLIA, we constructed a benchmark consisting of 10,625 images collected from people with diverse cultural backgrounds. The images were annotated with long and dense intent descriptions given by over 200 annotators. Experimental results demonstrate that NaiLIA outperforms standard methods.[173] RealWonder: Real-Time Physical Action-Conditioned Video Generation
Wei Liu,Ziyu Chen,Zizhang Li,Yue Wang,Hong-Xing Yu,Jiajun Wu
Main category: cs.CV
TL;DR: RealWonder 是首个实时、单图驱动的动作条件视频生成系统,通过将物理仿真作为中间桥梁(生成光流和RGB帧),使视频模型能理解3D动作的物理后果,支持对刚体、可变形体、流体和颗粒材料的交互式力与机器人操作模拟。
Details
Motivation: 现有视频生成模型缺乏对3D动作(如力、机器人操作)物理后果的建模能力,因其缺少对动作如何影响3D场景的结构化理解。 Method: 提出RealWonder系统,包含三部分:单图像3D重建、基于物理的仿真(生成光学流和RGB中间表示)、仅需4步扩散的蒸馏视频生成器;将连续动作经物理仿真转化为视频模型可处理的视觉信号。 Result: 在480×832分辨率下实现13.2 FPS实时生成,支持对刚体、可变形体、流体及颗粒材料的力、机器人动作和相机控制的交互式探索。 Conclusion: RealWonder首次打通了动作—物理—视频生成链路,为沉浸式体验、AR/VR和机器人学习中视频模型的应用开辟新路径。 Abstract: Current video generation models cannot simulate physical consequences of 3D actions like forces and robotic manipulations, as they lack structural understanding of how actions affect 3D scenes. We present RealWonder, the first real-time system for action-conditioned video generation from a single image. Our key insight is using physics simulation as an intermediate bridge: instead of directly encoding continuous actions, we translate them through physics simulation into visual representations (optical flow and RGB) that video models can process. RealWonder integrates three components: 3D reconstruction from single images, physics simulation, and a distilled video generator requiring only 4 diffusion steps. Our system achieves 13.2 FPS at 480x832 resolution, enabling interactive exploration of forces, robot actions, and camera controls on rigid objects, deformable bodies, fluids, and granular materials. We envision RealWonder opens new opportunities to apply video models in immersive experiences, AR/VR, and robot learning. Our code and model weights are publicly available in our project website: https://liuwei283.github.io/RealWonder/[174] Beyond Scattered Acceptance: Fast and Coherent Inference for DLMs via Longest Stable Prefixes
Pengxiang Li,Joey Tsai,Hongwei Xue,Kunyu Shi,Shilin Yan
Main category: cs.CV
TL;DR: 本文提出了一种名为Longest Stable Prefix(LSP)的新型解码调度器,用于提升扩散语言模型(DLMs)的推理效率。LSP通过单次前向传播评估token稳定性,动态识别并原子化提交最长稳定前缀,从而改善KV缓存局部性、降低token翻转率与去噪调用次数,在不牺牲质量的前提下实现最高3.4倍加速。
Details
Motivation: 现有DLMs解码调度器采用分散式接受策略,导致KV缓存碎片化、内存局部性差及频繁修复,严重制约实际推理速度。 Method: 提出训练无关、模型无关的LSP调度器,基于单次前向传播评估token稳定性,动态识别左对齐的连续稳定前缀,并在自然语言或结构分界处进行原子化提交;采用前缀优先拓扑,实现KV缓存的连续追加和双向前瞻保留。 Result: 在LLaDA-8B和Dream-7B上验证,LSP在数学推理、代码生成、多语言(CJK)任务和创意写作等基准中推理速度提升最高达3.4倍,同时输出质量持平或略有提升。 Conclusion: LSP通过重构token提交拓扑,有效弥合了DLMs理论并行性与硬件实际效率之间的鸿沟,为高效扩散语言模型推理提供了新范式。 Abstract: Diffusion Language Models (DLMs) promise highly parallel text generation, yet their practical inference speed is often bottlenecked by suboptimal decoding schedulers. Standard approaches rely on 'scattered acceptance'-committing high confidence tokens at disjoint positions throughout the sequence. This approach inadvertently fractures the Key-Value (KV) cache, destroys memory locality, and forces the model into costly, repeated repairs across unstable token boundaries. To resolve this, we present the Longest Stable Prefix (LSP) scheduler, a training-free and model-agnostic inference paradigm based on monolithic prefix absorption. In each denoising step, LSP evaluates token stability via a single forward pass, dynamically identifies a contiguous left-aligned block of stable predictions, and snaps its boundary to natural linguistic or structural delimiters before an atomic commitment. This prefix-first topology yields dual benefits: systemically, it converts fragmented KV cache updates into efficient, contiguous appends; algorithmically, it preserves bidirectional lookahead over a geometrically shrinking active suffix, drastically reducing token flip rates and denoiser calls. Extensive evaluations on LLaDA-8B and Dream-7B demonstrate that LSP accelerates inference by up to 3.4x across rigorous benchmarks including mathematical reasoning, code generation, multilingual (CJK) tasks, and creative writing while matching or slightly improving output quality. By fundamentally restructuring the commitment topology, LSP bridges the gap between the theoretical parallelism of DLMs and practical hardware efficiency.[175] EdgeDAM: Real-time Object Tracking for Mobile Devices
Syed Muhammad Raza,Syed Murtaza Hussain Abidi,Khawar Islam,Muhammad Ibrahim,Ajmal Saeed Mian
Main category: cs.CV
TL;DR: 本文提出EdgeDAM,一种面向边缘设备的轻量级单目标跟踪框架,通过双缓冲记忆机制与置信度驱动的切换策略,在保证实时性的同时提升对遮挡、干扰物和快速运动的鲁棒性。
Details
Motivation: 现有基于分割的记忆机制计算开销大、难以在边缘设备实时运行;而轻量级检测型跟踪器易受相似干扰物影响发生漂移。需在资源受限条件下兼顾精度与速度。 Method: 提出EdgeDAM框架:(1) 双缓冲干扰物感知记忆(DAM),含近期感知记忆(保持目标时序一致性)和干扰物解析记忆(显式存储难负样本并抑制其重选);(2) 置信度驱动的切换机制与持框稳定策略,在遮挡时自适应启用检测或记忆引导重识别,并通过临时冻结与扩展边界框抑制干扰物污染。 Result: 在五个基准(含干扰物聚焦的DiDi数据集)上验证有效性,DiDi准确率达88.2%,iPhone 15上达25 FPS,显著提升遮挡与快速运动下的鲁棒性并保持实时性能。 Conclusion: EdgeDAM成功将 distractor-aware memory 适配到轻量级边界框跟踪范式,在边缘设备上实现了精度与效率的平衡,为实际部署提供了新思路。 Abstract: Single-object tracking (SOT) on edge devices is a critical computer vision task, requiring accurate and continuous target localization across video frames under occlusion, distractor interference, and fast motion. However, recent state-of-the-art distractor-aware memory mechanisms are largely built on segmentation-based trackers and rely on mask prediction and attention-driven memory updates, which introduce substantial computational overhead and limit real-time deployment on resource-constrained hardware; meanwhile, lightweight trackers sustain high throughput but are prone to drift when visually similar distractors appear. To address these challenges, we propose EdgeDAM, a lightweight detection-guided tracking framework that reformulates distractor-aware memory for bounding-box tracking under strict edge constraints. EdgeDAM introduces two key strategies: (1) Dual-Buffer Distractor-Aware Memory (DAM), which integrates a Recent-Aware Memory to preserve temporally consistent target hypotheses and a Distractor-Resolving Memory to explicitly store hard negative candidates and penalize their re-selection during recovery; and (2) Confidence-Driven Switching with Held-Box Stabilization, where tracker reliability and temporal consistency criteria adaptively activate detection and memory-guided re-identification during occlusion, while a held-box mechanism temporarily freezes and expands the estimate to suppress distractor contamination. Extensive experiments on five benchmarks, including the distractor-focused DiDi dataset, demonstrate improved robustness under occlusion and fast motion while maintaining real-time performance on mobile devices, achieving 88.2% accuracy on DiDi and 25 FPS on an iPhone 15. Code will be released.[176] HALP: Detecting Hallucinations in Vision-Language Models without Generating a Single Token
Sai Akhil Kogilathota,Sripadha Vallabha E G,Luzhe Sun,Jiawei Zhou
Main category: cs.CV
TL;DR: 本文提出一种在生成前预测视觉语言模型(VLM)幻觉风险的新方法,通过单次前向传播中探查模型内部表征(如视觉特征、视觉-文本查询状态等),实现无需解码的高效检测,最高达0.93 AUROC,并揭示不同模型最敏感的层与模态各异。
Details
Motivation: 现有幻觉检测方法多在文本生成后进行,干预成本高且不及时;本文旨在探索能否在生成任何token之前,仅通过一次前向传播预测幻觉风险。 Method: 在8种现代VLM(如Llama-3.2-Vision、Gemma-3、Phi-4-VL、Qwen2.5-VL等)上,系统分析三类内部表征:(i) 未融合的纯视觉特征,(ii) 文本解码器中的视觉token表示,(iii) 融合图文信息的query-token表示;并在其上训练轻量探测器(probes)进行二分类(幻觉/非幻觉)。 Result: 探测器在多个模型上达到高达0.93 AUROC(如Gemma-3-12B),Late query-token状态对多数模型最具判别力,而部分模型(如Qwen2.5-VL-7B)则更依赖早期视觉特征(~0.79 AUROC);验证了幻觉风险可被预判,且最优探测位置因架构而异。 Conclusion: 幻觉风险可在生成前被可靠检测,且探测效果高度依赖模型架构与所选内部表征类型;该发现为实现早期拒答、选择性路由和自适应解码等安全高效机制提供了可行基础。 Abstract: Hallucinations remain a persistent challenge for vision-language models (VLMs), which often describe nonexistent objects or fabricate facts. Existing detection methods typically operate after text generation, making intervention both costly and untimely. We investigate whether hallucination risk can instead be predicted before any token is generated by probing a model's internal representations in a single forward pass. Across a diverse set of vision-language tasks and eight modern VLMs, including Llama-3.2-Vision, Gemma-3, Phi-4-VL, and Qwen2.5-VL, we examine three families of internal representations: (i) visual-only features without multimodal fusion, (ii) vision-token representations within the text decoder, and (iii) query-token representations that integrate visual and textual information before generation. Probes trained on these representations achieve strong hallucination-detection performance without decoding, reaching up to 0.93 AUROC on Gemma-3-12B, Phi-4-VL 5.6B, and Molmo 7B. Late query-token states are the most predictive for most models, while visual or mid-layer features dominate in a few architectures (e.g., ~0.79 AUROC for Qwen2.5-VL-7B using visual-only features). These results demonstrate that (1) hallucination risk is detectable pre-generation, (2) the most informative layer and modality vary across architectures, and (3) lightweight probes have the potential to enable early abstention, selective routing, and adaptive decoding to improve both safety and efficiency.[177] Towards 3D Scene Understanding of Gas Plumes in LWIR Hyperspectral Images Using Neural Radiance Fields
Scout Jarman,Zigfried Hampel-Arias,Adra Carr,Kevin R. Moon
Main category: cs.CV
TL;DR: 本文提出了一种基于神经辐射场(NeRF)的长波红外高光谱图像(LWIR HSI)三维场景重建方法,用于提升稀疏多视角下气体羽流检测性能;通过改进Mip-NeRF架构、融合高光谱与稀疏视图NeRF技术并引入自适应加权MSE损失,在仅用30张训练图像时PSNR达39.8 dB,气体检测AUC达0.821。
Details
Motivation: 现有LWIR高光谱图像常以单幅方式分析,缺乏场景几何与光谱的联合建模能力;多视角信息融合可增强气体羽流检测的上下文理解与精度。 Method: 基于Mip-NeRF架构,融合高光谱NeRF与稀疏视图NeRF技术,并设计新型自适应加权MSE损失函数;使用DIRSIG生成含SF6气体羽流的合成多视角LWIR HSI数据集进行训练与验证。 Result: 相比标准Mip-NeRF,本方法训练图像需求减少约50%;仅用30张训练图像即实现39.8 dB平均PSNR;在NeRF渲染图像上应用自适应相干估计器进行气体检测,获得0.821平均AUC。 Conclusion: NeRF可用于稀疏多视角LWIR高光谱图像的三维重建,并有效支撑下游气体羽流检测任务,为红外高光谱遥感分析提供了新范式。 Abstract: Hyperspectral images (HSI) have many applications, ranging from environmental monitoring to national security, and can be used for material detection and identification. Longwave infrared (LWIR) HSI can be used for gas plume detection and analysis. Oftentimes, only a few images of a scene of interest are available and are analyzed individually. The ability to combine information from multiple images into a single, cohesive representation could enhance analysis by providing more context on the scene's geometry and spectral properties. Neural radiance fields (NeRFs) create a latent neural representation of volumetric scene properties that enable novel-view rendering and geometry reconstruction, offering a promising avenue for hyperspectral 3D scene reconstruction. We explore the possibility of using NeRFs to create 3D scene reconstructions from LWIR HSI and demonstrate that the model can be used for the basic downstream analysis task of gas plume detection. The physics-based DIRSIG software suite was used to generate a synthetic multi-view LWIR HSI dataset of a simple facility with a strong sulfur hexafluoride gas plume. Our method, built on the standard Mip-NeRF architecture, combines state-of-the-art methods for hyperspectral NeRFs and sparse-view NeRFs, along with a novel adaptive weighted MSE loss. Our final NeRF method requires around 50% fewer training images than the standard Mip-NeRF and achieves an average PSNR of 39.8 dB with as few as 30 training images. Gas plume detection applied to NeRF-rendered test images using the adaptive coherence estimator achieves an average AUC of 0.821 when compared with detection masks generated from ground-truth test images.[178] Towards Multimodal Lifelong Understanding: A Dataset and Agentic Baseline
Guo Chen,Lidong Lu,Yicheng Liu,Liangrui Dong,Lidong Zou,Jixin Lv,Zhenquan Li,Xinyi Mao,Baoqi Pei,Shihao Wang,Zhiqi Li,Karan Sapra,Fuxiao Liu,Yin-Dong Zheng,Yifei Huang,Limin Wang,Zhiding Yu,Andrew Tao,Guilin Liu,Tong Lu
Main category: cs.CV
TL;DR: 本文提出了MM-Lifelong数据集,用于多模态终身理解任务,并指出当前模型在长时序视频理解中存在工作记忆瓶颈和全局定位崩溃两大问题;为此设计了递归多模态智能体(ReMA),通过动态记忆管理提升性能,并构建了去偏的数据划分以支持未来研究。
Details
Motivation: 现有视频理解数据集多为密集拼接的短片段,与真实、非脚本化的日常生活差异大,难以支撑对长期、稀疏、多尺度时序行为的理解需求。 Method: 构建了覆盖日、周、月多时间尺度的181.1小时MM-Lifelong数据集;分析当前端到端MLLM与智能体基线的失败模式;提出递归多模态智能体(ReMA),采用动态记忆管理与递归信念状态更新机制。 Result: ReMA在MM-Lifelong上显著优于现有方法;揭示并验证了‘工作记忆瓶颈’和‘全局定位崩溃’两类关键失败模式;提供了隔离时间与领域偏差的数据划分方案。 Conclusion: MM-Lifelong为多模态终身理解提供了更贴近现实的基准;ReMA展示了动态记忆与递归建模对长时序理解的有效性;该工作为监督学习与分布外泛化研究奠定了新基础。 Abstract: While datasets for video understanding have scaled to hour-long durations, they typically consist of densely concatenated clips that differ from natural, unscripted daily life. To bridge this gap, we introduce MM-Lifelong, a dataset designed for Multimodal Lifelong Understanding. Comprising 181.1 hours of footage, it is structured across Day, Week, and Month scales to capture varying temporal densities. Extensive evaluations reveal two critical failure modes in current paradigms: end-to-end MLLMs suffer from a Working Memory Bottleneck due to context saturation, while representative agentic baselines experience Global Localization Collapse when navigating sparse, month-long timelines. To address this, we propose the Recursive Multimodal Agent (ReMA), which employs dynamic memory management to iteratively update a recursive belief state, significantly outperforming existing methods. Finally, we establish dataset splits designed to isolate temporal and domain biases, providing a rigorous foundation for future research in supervised learning and out-of-distribution generalization.[179] Accelerating Text-to-Video Generation with Calibrated Sparse Attention
Shai Yehezkel,Shahar Yadin,Noam Elata,Yaron Ostrovsky-Berman,Bahjat Kawar
Main category: cs.CV
TL;DR: 本文提出CalibAtt,一种无需训练的稀疏注意力方法,通过离线校准识别稳定块级稀疏与重复模式,跳过低贡献注意力连接,在不损失生成质量前提下显著加速视频扩散模型推理。
Details
Motivation: 现有基于Transformer的视频扩散模型因时空注意力计算开销大而运行缓慢;作者观察到大量token间连接注意力分数始终微弱且模式跨输入重复,存在可安全跳过的冗余计算。 Method: CalibAtt是一种训练无关的方法:先进行离线校准,识别各层、各头、各扩散步中跨输入稳定的块级稀疏与重复模式,并将这些模式编译为优化后的注意力操作;推理时仅对选定的关键输入相关连接做稠密计算,其余连接被硬件高效跳过。 Result: 在Wan 2.1 14B、Mochi 1及少步蒸馏模型上,CalibAtt在多种分辨率下实现最高1.58倍端到端加速,性能优于其他训练无关方法,同时保持视频生成质量与文本-视频对齐能力。 Conclusion: CalibAtt验证了利用注意力内在结构稳定性进行硬件友好的稀疏化是加速视频扩散模型的有效路径,为训练-free推理优化提供了新范式。 Abstract: Recent diffusion models enable high-quality video generation, but suffer from slow runtimes. The large transformer-based backbones used in these models are bottlenecked by spatiotemporal attention. In this paper, we identify that a significant fraction of token-to-token connections consistently yield negligible scores across various inputs, and their patterns often repeat across queries. Thus, the attention computation in these cases can be skipped with little to no effect on the result. This observation continues to hold for connections among local token blocks. Motivated by this, we introduce CalibAtt, a training-free method that accelerates video generation via calibrated sparse attention. CalibAtt performs an offline calibration pass that identifies block-level sparsity and repetition patterns that are stable across inputs, and compiles these patterns into optimized attention operations for each layer, head, and diffusion timestep. At inference time, we compute the selected input-dependent connections densely, and skip the unselected ones in a hardware-efficient manner. Extensive experiments on Wan 2.1 14B, Mochi 1, and few-step distilled models at various resolutions show that CalibAtt achieves up to 1.58x end-to-end speedup, outperforming existing training-free methods while maintaining video generation quality and text-video alignment.[180] FaceCam: Portrait Video Camera Control via Scale-Aware Conditioning
Weijie Lyu,Ming-Hsuan Yang,Zhixin Shu
Main category: cs.CV
TL;DR: FaceCam 是一种针对单目人像视频的可控相机轨迹生成系统,通过面向人脸的尺度感知相机表示和两种数据生成策略,提升了生成视频的几何准确性、视觉质量和运动一致性。
Details
Motivation: 现有基于大视频生成模型的相机控制方法在人像视频上常因尺度模糊的相机表示或3D重建误差导致几何失真和视觉伪影。 Method: 提出面向人脸的尺度感知相机变换表示,不依赖3D先验;在多视角工作室数据和野外单目视频上联合训练;设计合成相机运动与多帧拼接两种相机控制数据生成策略。 Result: 在 Ava-256 数据集及多种野外视频上验证,FaceCam 在相机可控性、视觉质量、身份一致性和运动保真度方面均优于现有方法。 Conclusion: FaceCam 通过解耦尺度与姿态的相机建模和数据驱动策略,有效解决了单目人像视频中可控视频生成的关键几何与视觉挑战。 Abstract: We introduce FaceCam, a system that generates video under customizable camera trajectories for monocular human portrait video input. Recent camera control approaches based on large video-generation models have shown promising progress but often exhibit geometric distortions and visual artifacts on portrait videos due to scale-ambiguous camera representations or 3D reconstruction errors. To overcome these limitations, we propose a face-tailored scale-aware representation for camera transformations that provides deterministic conditioning without relying on 3D priors. We train a video generation model on both multi-view studio captures and in-the-wild monocular videos, and introduce two camera-control data generation strategies: synthetic camera motion and multi-shot stitching, to exploit stationary training cameras while generalizing to dynamic, continuous camera trajectories at inference time. Experiments on Ava-256 dataset and diverse in-the-wild videos demonstrate that FaceCam achieves superior performance in camera controllability, visual quality, identity and motion preservation.[181] Transformer-Based Inpainting for Real-Time 3D Streaming in Sparse Multi-Camera Setups
Leif Van Holland,Domenic Zingsheim,Mana Takhsha,Hannah Dröge,Patrick Stotko,Markus Plack,Reinhard Klein
Main category: cs.CV
TL;DR: 本文提出了一种面向多视角3D流式传输的、基于Transformer的多视角感知纹理修复方法,作为渲染后的独立后处理模块,在保证实时性的同时提升视觉质量和帧间一致性。