Skip to content

Table of Contents

cs.CL [Back]

[1] TeleMem: Building Long-Term and Multimodal Memory for Agentic AI

Chunliang Chen,Ming Guan,Xiao Lin,Jiaxu Li,Qiyi Wang,Xiangyu Chen,Jixiang Luo,Changzhi Sun,Dell Zhang,Xuelong Li

Main category: cs.CL

TL;DR: 本文提出了一种名为TeleMem的统一长时多模态记忆系统,通过叙事动态提取和结构化写入流程,提升大语言模型在长期交互中的表现,支持高效的记忆管理与多模态推理。

Details Motivation: 大语言模型在长期对话中受限于注意力机制,难以持续维护用户状态;现有检索增强方法缺乏可靠的记忆更新机制,易导致幻觉、效率低下和多模态支持不足。 Method: 提出TeleMem系统,采用叙事动态提取确保仅保留基于对话的真实信息;设计结构化写入流水线,实现记忆的批量处理、检索、聚类与合并;引入多模态记忆模块结合ReAct式推理,形成观察-思考-行动闭环。 Result: 在ZH-4O长期角色扮演游戏基准上,TeleMem相比Mem0基线提升了19%准确率,减少43% token使用,并实现2.1倍速度提升。 Conclusion: TeleMem有效解决了长期交互中的记忆一致性、存储效率和多模态理解问题,显著提升了大模型在复杂长期任务中的性能。 Abstract: Large language models (LLMs) excel at many NLP tasks but struggle to sustain long-term interactions due to limited attention over extended dialogue histories. Retrieval-augmented generation (RAG) mitigates this issue but lacks reliable mechanisms for updating or refining stored memories, leading to schema-driven hallucinations, inefficient write operations, and minimal support for multimodal reasoning.To address these challenges, we propose TeleMem, a unified long-term and multimodal memory system that maintains coherent user profiles through narrative dynamic extraction, ensuring that only dialogue-grounded information is preserved. TeleMem further introduces a structured writing pipeline that batches, retrieves, clusters, and consolidates memory entries, substantially improving storage efficiency, reducing token usage, and accelerating memory operations. Additionally, a multimodal memory module combined with ReAct-style reasoning equips the system with a closed-loop observe, think, and act process that enables accurate understanding of complex video content in long-term contexts. Experimental results show that TeleMem surpasses the state-of-the-art Mem0 baseline with 19% higher accuracy, 43% fewer tokens, and a 2.1x speedup on the ZH-4O long-term role-play gaming benchmark.

[2] Operation Veja: Fixing Fundamental Concepts Missing from Modern Roleplaying Training Paradigms

Yueze Liu,Ajay Nagi Reddy Kumdam,Ronit Kanjilal,Hao Yang,Yichi Zhang

Main category: cs.CL

TL;DR: 本文提出了VEJA框架,通过价值观、经历、判断和能力四个核心概念来提升角色扮演模型的角色真实性,并通过实验验证其优于现有方法。

Details Motivation: 现有的角色扮演模型在捕捉角色内在世界动态交互方面存在不足,导致无法生成真实可信的角色行为。 Method: 提出VEJA框架,强调价值观、经历、判断和能力在角色构建中的作用,并基于该框架进行数据整理;通过人工构建的VEJA数据集与最先进的合成数据方法对比,使用LLM-as-judge评估生成质量。 Result: 实验结果显示,VEJA框架下的数据集在角色一致性、深度和叙事连贯性方面显著优于基线模型。 Conclusion: 为实现更具深度和连续性的角色扮演代理,需转向以概念为基础的数据构建范式,VEJA提供了一个有前景的方向。 Abstract: Modern roleplaying models are increasingly sophisticated, yet they consistently struggle to capture the essence of believable, engaging characters. We argue this failure stems from training paradigms that overlook the dynamic interplay of a character's internal world. Current approaches, including Retrieval-Augmented Generation (RAG), fact-based priming, literature-based learning, and synthetic data generation, exhibit recurring limitations in modeling the deliberative, value-conflicted reasoning that defines human interaction. In this paper, we identify four core concepts essential for character authenticity: Values, Experiences, Judgments, and Abilities (VEJA). We propose the VEJA framework as a new paradigm for data curation that addresses these systemic limitations. To illustrate the qualitative ceiling enabled by our framework, we present a pilot study comparing a manually curated, VEJA-grounded dataset against a state-of-the-art synthetic baseline. Using an LLM-as-judge evaluation, our findings demonstrate a significant quality gap, suggesting that a shift toward conceptually grounded data curation, as embodied by VEJA, is necessary for creating roleplaying agents with genuine depth and narrative continuity. The full dataset is available at https://github.com/HyouinKyoumaIRL/Operation-Veja

[3] Lexical and Statistical Analysis of Bangla Newspaper and Literature: A Corpus-Driven Study on Diversity, Readability, and NLP Adaptation

Pramit Bhattacharyya,Arnab Bhattacharya

Main category: cs.CL

TL;DR: 本文通过对孟加拉语文学与报纸文本的语料库驱动分析,探讨其词汇多样性、结构复杂性与可读性,发现文学语料在词汇丰富性、结构变化和模型困惑度等方面均高于报纸语料,且更符合齐普夫定律,融入文学数据能提升下游任务性能。

Details Motivation: 探究孟加拉语文学与报纸文本在语言特征上的差异,并评估不同类型语料对语言模型及下游任务的影响。 Method: 基于Vacaspati(文学)和IndicCorp(报纸)两大孟加拉语语料库,采用TTR、HLR、Bigram多样性、平均音节与词长、齐普夫定律拟合度、困惑度、熵与冗余度等指标进行量化分析,并结合n-gram模型与可读性指数(Flesch与Coleman-Liau)进行比较。 Result: 文学语料虽规模较小,但在词汇多样性、结构复杂性、困惑度、熵值等方面显著高于报纸语料;更符合齐普夫定律;融合文学与报纸数据能提升下游任务表现;文学文本的可读性更低,表明其语言更复杂。 Conclusion: 孟加拉语文学文本具有更高的语言复杂性与多样性,其语料对语言模型训练和下游任务有积极贡献,建议在NLP研究中更多地纳入文学语料以提升模型泛化能力。 Abstract: In this paper, we present a comprehensive corpus-driven analysis of Bangla literary and newspaper texts to investigate their lexical diversity, structural complexity and readability. We undertook Vacaspati and IndicCorp, which are the most extensive literature and newspaper-only corpora for Bangla. We examine key linguistic properties, including the type-token ratio (TTR), hapax legomena ratio (HLR), Bigram diversity, average syllable and word lengths, and adherence to Zipfs Law, for both newspaper (IndicCorp) and literary corpora (Vacaspati).For all the features, such as Bigram Diversity and HLR, despite its smaller size, the literary corpus exhibits significantly higher lexical richness and structural variation. Additionally, we tried to understand the diversity of corpora by building n-gram models and measuring perplexity. Our findings reveal that literary corpora have higher perplexity than newspaper corpora, even for similar sentence sizes. This trend can also be observed for the English newspaper and literature corpus, indicating its generalizability. We also examined how the perfor- mance of models on downstream tasks is influenced by the inclusion of literary data alongside newspaper data. Our findings suggest that inte- grating literary data with newspapers improves the performance of models on various downstream tasks. We have also demonstrated that a literary corpus adheres more closely to global word distribution proper- ties, such as Zipfs law, than a newspaper corpus or a merged corpus of both literary and newspaper texts. Literature corpora also have higher entropy and lower redundancy values compared to a newspaper corpus. We also further assess the readability using Flesch and Coleman-Liau in- dices, showing that literary texts are more complex.

[4] Reinforcement Learning for Chain of Thought Compression with One-Domain-to-All Generalization

Hanyu Li,Jiangshan Duo,Bofei Gao,Hailin Zhang,Sujian Li,Xiaotie Deng,Liang Zhao

Main category: cs.CL

TL;DR: 提出一种基于软强化学习的样本级压缩方法,减少大模型推理链长度20-40%,同时保持或提升准确性,并展现跨领域泛化能力。

Details Motivation: 现有思维链推理存在“过度思考陷阱”,导致计算成本高且延迟大,而全局静态控制策略可能误伤必要的推理过程。 Method: 采用样本级别的软强化学习压缩方法,在模型已掌握问题且已有简洁推理路径时,对低效长推理路径进行惩罚。 Result: 平均响应长度减少20-40%,准确率相当或更高;在数学训练后能泛化到代码、指令遵循和常识问答等未见任务,压缩效果显著且准确率稳定或提升。 Conclusion: 该压缩方法应成为高效推理模型开发的标准后训练阶段,可实现更准确且更简洁的推理。 Abstract: Chain-of-thought reasoning in large language models often creates an "overthinking trap," leading to excessive computational cost and latency for unreliable accuracy gains. Prior work has typically relied on global, static controls that risk penalizing necessary reasoning. We introduce a sample-level, soft reinforcement learning compression method that penalizes inefficiently long rollouts, but only on problems where the model has already mastered and already produced a more concise rollout. Our experiments show that this method reduces average response length by 20-40% with comparable or higher accuracy. Crucially, the compression exhibits strong cross-domain generalization; a model trained on math spontaneously shortens responses on unseen tasks like code, instruction following, and general knowledge QA, with stable or improved accuracy. We demonstrate a stable post-training curriculum (accuracy-compression-accuracy) that can ultimately produce models that are more accurate and reason more concisely, arguing that such compression method should be a standard phase in developing efficient reasoning models.

[5] A Multi-Stage Workflow for the Review of Marketing Content with Reasoning Large Language Models

Alberto Purpura,Emily Chen,Swapnil Shinde

Main category: cs.CL

TL;DR: 本文提出了一种利用微调推理大语言模型(LLMs)自动识别营销内容合规性问题的多阶段工作流,无需依赖外部知识表示,并比较了不同微调策略(如SFT和GRPO)的效果,评估了小规模LLM生成推理标记的能力以及不同奖励函数对GRPO训练的影响。

Details Motivation: 为了提高营销内容审查的自动化程度并确保其符合特定要求,研究者希望探索不依赖外部知识表示的合规性检测方法,并充分利用推理LLMs的潜力。 Method: 采用多阶段工作流,使用经过监督微调(SFT)和组相对策略优化(GRPO)等策略微调的推理大语言模型,训练小型LLM生成推理标记,并评估不同奖励函数组合对GRPO性能的影响。 Result: 提出了无需外部知识表示即可识别合规问题的新方法;比较了SFT与GRPO的微调效果;验证了小型LLM在生成推理标记方面的有效性;揭示了不同奖励函数选择对GRPO训练结果的显著影响。 Conclusion: 该研究表明,通过合适的微调策略和奖励机制设计,即使是小型推理LLM也能有效识别营销内容中的合规问题,为自动化内容审查提供了可行且高效的解决方案。 Abstract: Reasoning Large Language Models (LLMs) have shown promising results when tasked with solving complex problems. In this paper, we propose and evaluate a multi-stage workflow that leverages the capabilities of fine-tuned reasoning LLMs to assist in the review process of marketing content, making sure they comply with a given list of requirements. The contributions of this paper are the following: (i) we present a novel approach -- that does not rely on any external knowledge representation -- for the automatic identification of compliance issues in textual content; (ii) compare the effectiveness of different fine-tuning strategies like Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO) in training models to solve this problem; (iii) we evaluate the effectiveness of training small LLMs to generate reasoning tokens before providing their final response; (iv) we evaluate how the choice and combinations of different reward functions affects the performance of a model trained with GRPO.

[6] AzeroS: Extending LLM to Speech with Self-Generated Instruction-Free Tuning

Yiwen Shao,Wei Liu,Jiahong Li,Tianzi Wang,Kun Wei,Meng Yu,Dong Yu

Main category: cs.CL

TL;DR: 本文提出了一种无需人工标注指令数据的语音大语言模型训练范式SIFT,并基于此构建了AZeroS模型,在少量参数更新和中等规模数据下实现了语音理解与生成的最先进性能。

Details Motivation: 现有的语音-LLM依赖大量任务特定的指令微调数据,标注成本高且模型泛化能力差,因此需要一种更高效、泛化性更强的训练方法。 Method: 提出Self-Generated Instruction-Free Tuning (SIFT) 范式:利用冻结的大型语言模型根据语音对应的文本自动生成监督信号,仅训练连接音频编码器与LLM的轻量级投影模块,无需人工构造问答对。 Result: 基于约25,000小时带ASR转录和3,000小时带副语言标签的公开语音-文本数据,仅更新两个轻量投影模块(各2380万参数),在VoiceBench、AIR-Bench Foundation (Speech) 和 AIR-Bench Chat (Speech) 等基准上达到最先进的性能。 Conclusion: SIFT范式能理论最优地提升语音-LLM对未见任务的泛化能力,AZeroS验证了该方法在低训练成本下实现高性能的可行性,推动了无需人工指令标注的语音大模型发展。 Abstract: Extending large language models (LLMs) to the speech domain has recently gained significant attention. A typical approach connects a pretrained LLM with an audio encoder through a projection module and trains the resulting model on large-scale, task-specific instruction-tuning datasets. However, curating such instruction-tuning data for specific requirements is time-consuming, and models trained in this manner often generalize poorly to unseen tasks. In this work, we first formulate that the strongest generalization of a speech-LLM is achieved when it is trained with Self-Generated Instruction-Free Tuning (SIFT), in which supervision signals are generated by a frozen LLM using textual representations of speech as input. Our proposed SIFT paradigm eliminates the need for collecting task-specific question-answer pairs and yields the theoretically best generalization to unseen tasks. Building upon this paradigm, we introduce AZeroS (Auden Zero-instruction-tuned Speech-LLM), which is trained on speech-text pairs derived from publicly available corpora, including approximately 25,000 hours of speech with ASR transcripts and 3,000 hours of speech with paralinguistic labels. Built upon Qwen2.5-7B-Instruct, the model updates only two lightweight projection modules (23.8 million parameters each), while keeping both the LLM and audio encoders frozen. Despite the minimal training cost and modest data scale, AZeroS achieves state-of-the-art performance on both semantic and paralinguistic benchmarks, including VoiceBench, AIR-Bench Foundation (Speech), and AIR-Bench Chat (Speech).

[7] Is Sanskrit the most token-efficient language? A quantitative study using GPT, Gemini, and SentencePiece

Anshul Kumar

Main category: cs.CL

TL;DR: 本研究通过701条《薄伽梵歌》平行经文,比较梵语、英语和印地语在不同分词器下的分词效率,发现梵语每字符信息密度更高,分词数量约为英语和印地语的一半;翻译版本分词数增加约20倍。尽管最新分词器(如GPT-4o和Gemini)缓解了语言偏见,但仍未能完全捕捉梵语的紧凑性,提示当前模型可能对非英语用户造成成本偏见。

Details Motivation: 探究梵语是否因其形态和语法特性具有更高的信息密度,并量化其在现代分词器下的分词效率,揭示当前LLM分词系统对非英语语言可能存在的计算成本偏见。 Method: 使用包含梵语、英语、印地语及梵语转写文本的701条平行经文数据集,评估SentencePiece、GPT系列及Gemini等主流分词器在分词数量、字符每分词(效率)和分词每字符(成本)等指标上的表现。 Result: 在无偏的SPM基准下,梵语的分词数比英语/印地语少约一半(~2x差异);英文/印地文翻译版的分词数增加约20倍;GPT-o200k-base和Gemini最新分词器显著减少了偏见,但仍未完全体现梵语的紧凑性。 Conclusion: 梵语具有更高的语义密度和分词效率,当前分词器仍偏向英语,可能导致非英语用户的计算成本上升;研究为改进未来分词器设计提供了依据,并展示了梵语在高效编码中的潜力。 Abstract: Tokens are the basic units of Large Language Models (LLMs). LLMs rely on tokenizers to segment text into these tokens, and tokenization is the primary determinant of computational and inference cost. Sanskrit, one of the oldest languages, is hypothesized to express more meaning per token due to its morphology and grammar rules; however, no prior work has quantified this. We use a dataset of 701 parallel verses of the Bhagavad Gita, which comprises three languages-Sanskrit, English, and Hindi along with transliteration of Sanskrit into English. We test tokenizers including SentencePiece (SPM), older GPT models, and the latest generation tokenizers from Gemini and GPT. We use metrics of token count, characters per token (token efficiency), and tokens per character (token cost). Results show a ~2x difference in token counts between Sanskrit and English/Hindi under the unbiased SPM baseline. English/Hindi translations of Sanskrit commentary resulted in an approximately 20x increase in token count. GPT o200k base (latest, used by GPT-4o) and Gemini (latest) reduce bias by a significant degree compared to GPT cl100k base (used until GPT-4), but still fail to fully capture Sanskrit's compactness. This matters because there might be a penalty bias for non-English users, which inflates the token count. This research provides a foundation for improving future tokenizer design and shows the potential of Sanskrit for highly compact encoding, saving on cost while speeding up training and inference. The code and dataset are available at https://github.com/anshulkr713/sanskrit-token-efficiency

[8] Amory: Building Coherent Narrative-Driven Agent Memory through Agentic Reasoning

Yue Zhou,Xiaobo Guo,Belhassen Bayar,Srinivasan H. Sengamedu

Main category: cs.CL

TL;DR: Amory是一种新的工作记忆框架,通过在离线期间增强代理推理来构建结构化记忆表示,显著提升长期对话中的推理性能,同时将响应时间减少50%。

Details Motivation: 长期对话代理面临计算可扩展性挑战,传统方法将对话分割为孤立的记忆片段,缺乏人类记忆的连贯性和细腻性。 Method: Amory将对话片段组织为情景叙事,通过动量机制整合记忆,并将外围事实语义化为语义记忆;在检索时采用基于连贯性的推理。 Result: 在LOCOMO基准上,Amory性能达到与全上下文推理相当的水平,响应时间减少50%,且记忆覆盖和响应质量优于基于嵌入的方法。 Conclusion: 结构化记忆形成与连贯性驱动的检索能有效提升长期对话系统的效率与效果,动量感知的记忆整合显著提高记忆质量。 Abstract: Long-term conversational agents face a fundamental scalability challenge as interactions extend over time: repeatedly processing entire conversation histories becomes computationally prohibitive. Current approaches attempt to solve this through memory frameworks that predominantly fragment conversations into isolated embeddings or graph representations and retrieve relevant ones in a RAG style. While computationally efficient, these methods often treat memory formation minimally and fail to capture the subtlety and coherence of human memory. We introduce Amory, a working memory framework that actively constructs structured memory representations through enhancing agentic reasoning during offline time. Amory organizes conversational fragments into episodic narratives, consolidates memories with momentum, and semanticizes peripheral facts into semantic memory. At retrieval time, the system employs coherence-driven reasoning over narrative structures. Evaluated on the LOCOMO benchmark for long-term reasoning, Amory achieves considerable improvements over previous state-of-the-art, with performance comparable to full context reasoning while reducing response time by 50%. Analysis shows that momentum-aware consolidation significantly enhances response quality, while coherence-driven retrieval provides superior memory coverage compared to embedding-based approaches.

[9] How well can off-the-shelf LLMs elucidate molecular structures from mass spectra using chain-of-thought reasoning?

Yufeng Wang,Lu Wei,Lin Liu,Hao Xu,Haibin Ling

Main category: cs.CL

TL;DR: 本文提出了一种基于思维链(CoT)的提示框架,用于评估大语言模型(LLM)在从串联质谱数据推理分子结构中的能力,发现当前LLM虽能生成语法正确的SMILES结构,但在化学准确性方面仍不足。

Details Motivation: 从串联质谱(MS/MS)直接确定完整分子结构是一个长期挑战,现有方法难以应对复杂的碎片模式和广阔的化学空间,而大语言模型在科学推理任务中展现出潜力,但其在化学解释方面的能力尚不清楚。 Method: 将化学家的推理步骤(如双键等价物分析、中性损失识别和碎片组装)形式化为结构化提示,采用零样本设置,在MassSpecGym数据集上评估多个先进LLM(如Claude-3.5-Sonnet、GPT-4o-mini和Llama-3系列)的分子结构预测能力。 Result: 实验结果显示,LLM能够生成语法有效且部分合理的SMILES结构,但在分子式一致性和结构相似性等指标上表现不佳,推理过程与正确预测之间缺乏可靠关联。 Conclusion: 当前的大语言模型在化学结构解析任务中具有一定的解释潜力,但尚未达到化学准确性要求;未来需结合领域知识与强化学习,发展更具化学基础的AI推理方法。 Abstract: Mass spectrometry (MS) is a powerful analytical technique for identifying small molecules, yet determining complete molecular structures directly from tandem mass spectra (MS/MS) remains a long-standing challenge due to complex fragmentation patterns and the vast diversity of chemical space. Recent progress in large language models (LLMs) has shown promise for reasoning-intensive scientific tasks, but their capability for chemical interpretation is still unclear. In this work, we introduce a Chain-of-Thought (CoT) prompting framework and benchmark that evaluate how LLMs reason about mass spectral data to predict molecular structures. We formalize expert chemists' reasoning steps-such as double bond equivalent (DBE) analysis, neutral loss identification, and fragment assembly-into structured prompts and assess multiple state-of-the-art LLMs (Claude-3.5-Sonnet, GPT-4o-mini, and Llama-3 series) in a zero-shot setting using the MassSpecGym dataset. Our evaluation across metrics of SMILES validity, formula consistency, and structural similarity reveals that while LLMs can produce syntactically valid and partially plausible structures, they fail to achieve chemical accuracy or link reasoning to correct molecular predictions. These findings highlight both the interpretive potential and the current limitations of LLM-based reasoning for molecular elucidation, providing a foundation for future work that combines domain knowledge and reinforcement learning to achieve chemically grounded AI reasoning.

[10] $\texttt{AMEND++}$: Benchmarking Eligibility Criteria Amendments in Clinical Trials

Trisha Das,Mandis Beigi,Jacob Aptekar,Jimeng Sun

Main category: cs.CL

TL;DR: 本文提出了一个名为“入排标准修改预测”的新NLP任务,旨在预测临床试验初始方案的入排标准是否会被后续修改。为此发布了AMEND++基准套件,包括两个数据集:AMEND和经LLM去噪处理的AMEND_LLM,并提出了一种基于修改感知的掩码语言模型(CAMLM)预训练策略,实验证明该方法能有效提升预测性能,有助于更稳健、低成本的临床试验设计。

Details Motivation: 临床试验中频繁的方案修改,尤其是入排标准的修改,常导致延迟、成本增加和行政负担。提前预测可能的修改有助于优化试验设计。 Method: 构建了包含版本历史和修改标签的数据集AMEND,并通过LLM去噪生成高质量子集AMEND_LLM;提出Change-Aware Masked Language Modeling (CAMLM) 预训练方法,利用历史修改信息学习对修改敏感的文本表示。 Result: 在多种基线模型上实验表明,采用CAMLM策略显著提升了入排标准修改预测的准确性,验证了其有效性。 Conclusion: CAMLM结合AMEND++数据集为预测临床试验入排标准修改提供了有效的新方法,有望减少试验设计中的不确定性与成本。 Abstract: Clinical trial amendments frequently introduce delays, increased costs, and administrative burden, with eligibility criteria being the most commonly amended component. We introduce \textit{eligibility criteria amendment prediction}, a novel NLP task that aims to forecast whether the eligibility criteria of an initial trial protocol will undergo future amendments. To support this task, we release $\texttt{AMEND++}$, a benchmark suite comprising two datasets: $\texttt{AMEND}$, which captures eligibility-criteria version histories and amendment labels from public clinical trials, and $\verb|AMEND_LLM|$, a refined subset curated using an LLM-based denoising pipeline to isolate substantive changes. We further propose $\textit{Change-Aware Masked Language Modeling}$ (CAMLM), a revision-aware pretraining strategy that leverages historical edits to learn amendment-sensitive representations. Experiments across diverse baselines show that CAMLM consistently improves amendment prediction, enabling more robust and cost-effective clinical trial design.

[11] Why LoRA Fails to Forget: Regularized Low-Rank Adaptation Against Backdoors in Language Models

Hoang-Chau Luong,Lingwei Chen

Main category: cs.CL

TL;DR: LoRA在去除大模型后门行为时效果差,主要因其频谱强度不足和对齐不良;本文提出RoRA,通过增强频谱强度和修正对齐来提升去毒效果。

Details Motivation: 揭示LoRA在防御后门攻击中的根本弱点,并提出更有效的改进方法。 Method: 分析LoRA更新的频谱特性,提出RoRA,包含正则化、约束和重缩放策略。 Result: RoRA显著降低攻击成功率,同时保持干净数据准确率。 Conclusion: LoRA的脆弱性源于频谱问题而非低秩本身,RoRA通过频谱优化有效提升遗忘能力。 Abstract: Low-Rank Adaptation (LoRA) is widely used for parameter-efficient fine-tuning of large language models, but it is notably ineffective at removing backdoor behaviors from poisoned pretrained models when fine-tuning on clean dataset. Contrary to the common belief that this weakness is caused primarily by low rank, we show that LoRA's vulnerability is fundamentally spectral. Our analysis identifies two key factors: LoRA updates (i) possess insufficient spectral strength, with singular values far below those of pretrained weights, and (ii) exhibit unfavorable spectral alignment, weakly matching clean-task directions while retaining overlap with trigger-sensitive subspaces. We further establish a critical scaling threshold beyond which LoRA can theoretically suppress trigger-induced activations, and we show empirically that standard LoRA rarely reaches this regime. We introduce Regularized Low-Rank Adaptation (RoRA), which improves forgetting by increasing spectral strength and correcting alignment through clean-strengthened regularization, trigger-insensitive constraints, and post-training spectral rescaling. Experiments across multiple NLP benchmarks and attack settings show that RoRA substantially reduces attack success rates while maintaining clean accuracy.

[12] SyntaxMind at BLP-2025 Task 1: Leveraging Attention Fusion of CNN and GRU for Hate Speech Detection

Md. Shihab Uddin Riad

Main category: cs.CL

TL;DR: 本文提出了一个用于孟加拉语仇恨言论检测的统一模型,结合BanglaBERT、GRU和CNN多分支结构,并引入注意力机制,在两个子任务中分别取得第二和第五名的成绩。

Details Motivation: 针对孟加拉语仇恨言论检测任务缺乏高效统一模型的问题,设计能够同时捕捉上下文语义和局部语言特征的深度学习架构。 Method: 采用BanglaBERT获取文本嵌入,通过并行的GRU和CNN分支提取时序和局部特征,结合注意力机制与全连接层进行分类。 Result: 在Subtask 1A中取得0.7345的micro F1分数(排名第二),在Subtask 1B中取得0.7317的micro F1分数(排名第五)。 Conclusion: 所提出的多分支融合模型在孟加拉语仇恨言论检测中表现出色,验证了其在不同子任务中的鲁棒性和有效性。 Abstract: This paper describes our system used in the BLP-2025 Task 1: Hate Speech Detection. We participated in Subtask 1A and Subtask 1B, addressing hate speech classification in Bangla text. Our approach employs a unified architecture that integrates BanglaBERT embeddings with multiple parallel processing branches based on GRUs and CNNs, followed by attention and dense layers for final classification. The model is designed to capture both contextual semantics and local linguistic cues, enabling robust performance across subtasks. The proposed system demonstrated high competitiveness, obtaining 0.7345 micro F1-Score (2nd place) in Subtask 1A and 0.7317 micro F1-Score (5th place) in Subtask 1B.

[13] A Rising Tide Lifts All Boats: MTQE Rewards for Idioms Improve General Translation Quality

Ishika Agarwal,Zhenlin He,Dhruva Patil,Dilek Hakkani-Tür

Main category: cs.CL

TL;DR: 本研究探讨了使用MTQE模型作为奖励函数进行GRPO风格微调,以提升神经机器翻译系统对非组合性表达(如习语)的翻译能力,在中文和印地语数据集上取得了显著性能提升。

Details Motivation: 非组合性表达(如习语、谚语和隐喻)因其意义无法从单个词语推导而出,且具有文化和字面双重含义,给神经机器翻译带来了挑战。 Method: 采用基于机器翻译质量估计(MTQE)模型作为奖励函数的GRPO风格微调方法,训练模型更好地翻译习语。 Result: 在中文和印地语习语数据集上,习语翻译准确率提升约14点,一般非习语翻译隐式提升约8点,跨语言翻译能力提升约6点。 Conclusion: 该工作量化了非组合性表达的翻译差距,并为提升大语言模型在跨文化和比喻语言理解方面的能力提供了有效路径。 Abstract: Non-compositional expressions (e.g., idioms, proverbs, and metaphors) pose significant challenges for neural machine translation systems because their meanings cannot be derived from individual words alone. These expressions encode rich, cultural meaning, and have both figurative and literal meanings, making accurate translation difficult. Because models are fairly good at translating compositional text, we investigate GRPO-style fine-tuning using Machine Translation Quality Estimation (MTQE) models as reward functions to train models to better translate idioms. Using Chinese and Hindi idiom datasets, we find that idiom translation abilities improve by ~14 points, general, non-idiomatic translation implicitly improves by ~8 points, and cross-lingual translation abilities (trained on one language, evaluated on another) improves by ~6 points. Overall, our work quantifies the non-compositional translation gap and offers insights for developing LLMs with stronger cross-cultural and figurative language understanding.

[14] Annotating Dimensions of Social Perception in Text: The First Sentence-Level Dataset of Warmth and Competence

Mutaz Ayesh,Saif M. Mohammad,Nedjma Ousidhoum

Main category: cs.CL

TL;DR: 本文介绍了W&C-Sent,首个句子级别标注温暖(包含信任与社交性)和能力维度的数据集,旨在推动NLP与计算社会科学交叉研究。

Details Motivation: 尽管温暖和能力是社会心理学中重要的评价维度,但在NLP中尚未得到充分关注,现有词级别词典无法完整捕捉文本中的上下文表达。 Method: 构建了一个包含1600多个英文句子-目标对的数据集,标注了信任、社交性和能力三个维度,数据来自社交媒体,并评估了多种大语言模型在该任务上的表现。 Result: 提出了W&C-Sent数据集,详细描述了数据收集、标注和质量控制流程,并通过实验评估了LLMs识别文本中温暖与能力的能力。 Conclusion: W&C-Sent为分析语言中的温暖与能力提供了新资源,有助于促进NLP与社会科学研究的融合。 Abstract: Warmth (W) (often further broken down into Trust (T) and Sociability (S)) and Competence (C) are central dimensions along which people evaluate individuals and social groups (Fiske, 2018). While these constructs are well established in social psychology, they are only starting to get attention in NLP research through word-level lexicons, which do not completely capture their contextual expression in larger text units and discourse. In this work, we introduce Warmth and Competence Sentences (W&C-Sent), the first sentence-level dataset annotated for warmth and competence. The dataset includes over 1,600 English sentence--target pairs annotated along three dimensions: trust and sociability (components of warmth), and competence. The sentences in W&C-Sent are from social media and often express attitudes and opinions about specific individuals or social groups (the targets of our annotations). We describe the data collection, annotation, and quality-control procedures in detail, and evaluate a range of large language models (LLMs) on their ability to identify trust, sociability, and competence in text. W&C-Sent provides a new resource for analyzing warmth and competence in language and supports future research at the intersection of NLP and computational social science.

[15] On the Fallacy of Global Token Perplexity in Spoken Language Model Evaluation

Jeff Chan-Jan Sju,Liang-Hsuan Tseng,Yi-Cheng Lin,Yen-Chun Kuo,Ju-Chieh Chou,Kai-Wei Chang,Hung-yi Lee,Carlos Busso

Main category: cs.CL

TL;DR: 本文提出了一系列新的基于似然性和生成性的评估方法,用于替代传统的全局标记困惑度(global token perplexity),以更准确地评估生成式口语语言模型的性能。实验表明,新指标与人类评分的相关性更强,并揭示了当前最优模型与人类水平之间的差距被显著低估。

Details Motivation: 现有的口语语言模型评估方法(如全局标记困惑度)直接借用文本领域的指标,忽略了语音和文本模态之间的本质差异,可能导致对模型性能的误判。因此,需要设计更适合语音特性的评估方式。 Method: 提出了多种替代全局标记困惑度的评估方法,包括基于似然性和生成性的指标,并通过与人类打分的平均意见得分(MOS)进行相关性分析来验证其有效性。 Result: 新提出的评估指标与人类MOS评分具有更强的相关性,能够更真实地反映生成质量;在新指标下,当前最优模型与人类表现之间的差距明显缩小。 Conclusion: 合适的评估方法对于准确衡量口语语言模型的发展至关重要,现有基于文本困惑度的评估方式不足以充分捕捉语音生成的质量。 Abstract: Generative spoken language models pretrained on large-scale raw audio can continue a speech prompt with appropriate content while preserving attributes like speaker and emotion, serving as foundation models for spoken dialogue. In prior literature, these models are often evaluated using ``global token perplexity'', which directly applies the text perplexity formulation to speech tokens. However, this practice overlooks fundamental differences between speech and text modalities, possibly leading to an underestimation of the speech characteristics. In this work, we propose a variety of likelihood- and generative-based evaluation methods that serve in place of naive global token perplexity. We demonstrate that the proposed evaluations more faithfully reflect perceived generation quality, as evidenced by stronger correlations with human-rated mean opinion scores (MOS). When assessed under the new metrics, the relative performance landscape of spoken language models is reshaped, revealing a significantly reduced gap between the best-performing model and the human topline. Together, these results suggest that appropriate evaluation is critical for accurately assessing progress in spoken language modeling.

[16] What Matters When Building Universal Multilingual Named Entity Recognition Models?

Jonas Golde,Patrick Haller,Alan Akbik

Main category: cs.CL

TL;DR: 本文系统地研究了多语言命名实体识别(NER)中的关键设计决策,提出了名为Otter的通用多语言NER模型,支持100多种语言,在性能和效率上均优于现有基线模型。

Details Motivation: 现有研究在架构、训练目标和数据源方面的决策缺乏系统性评估,阻碍了领域进展,因此需要对这些因素进行独立分析以明确其影响。 Method: 通过在多种语言上对模型架构、Transformer骨干网络、训练目标和数据组成进行大规模消融实验,识别出最优配置,并基于此构建Otter模型。 Result: Otter模型在F1分数上比GLiNER-x-base高出5.3个百分点,性能与Qwen3-32B等大型生成模型相当,但效率更高。 Conclusion: 系统性的设计选择评估对提升多语言NER至关重要,Otter展示了高效且高性能的多语言NER可行性,推动了该领域的可复现性和进一步研究。 Abstract: Recent progress in universal multilingual named entity recognition (NER) has been driven by advances in multilingual transformer models and task-specific architectures, loss functions, and training datasets. Despite substantial prior work, we find that many critical design decisions for such models are made without systematic justification, with architectural components, training objectives, and data sources evaluated only in combination rather than in isolation. We argue that these decisions impede progress in the field by making it difficult to identify which choices improve model performance. In this work, we conduct extensive experiments around architectures, transformer backbones, training objectives, and data composition across a wide range of languages. Based on these insights, we introduce Otter, a universal multilingual NER model supporting over 100 languages. Otter achieves consistent improvements over strong multilingual NER baselines, outperforming GLiNER-x-base by 5.3pp in F1 and achieves competitive performance compared to large generative models such as Qwen3-32B, while being substantially more efficient. We release model checkpoints, training and evaluation code to facilitate reproducibility and future research.

[17] Average shortest-path length in word-adjacency networks: Chinese versus English

Jakub Dec,Michał Dolina,Stanisław Drożdż,Jarosław Kwapień,Jin Liu,Tomasz Stanisz

Main category: cs.CL

TL;DR: 通过将标点符号视为普通词汇,研究中文和英文文学作品中词邻接网络的拓扑结构,发现标点符号对网络平均最短路径长度有显著影响,尤其在中文中忽略标点会导致路径长度明显增大。

Details Motivation: 标点符号承载情感信息、逻辑分组和避免歧义,且前期研究表明其在齐夫分析中表现类似词汇,有助于提升作者归属识别。因此,探索其在网络结构中的作用具有重要意义。 Method: 构建随时间增长的词-标点邻接网络,分析不同语言(中文与英文)及翻译文本中平均最短路径长度 $L(N)$ 随网络规模 $N$ 的变化,并用增长网络模型拟合实证结果。 Result: 当包含标点符号时,中英文的 $L(N)$ 渐近行为相似;但若忽略标点,中文的 $L(N)$ 显著增大。模型与实证数据吻合良好。 Conclusion: 标点符号在语言网络结构中扮演关键角色,尤其在中文中不可忽视,将其纳入分析可更准确反映语言的复杂网络特性。 Abstract: Complex networks provide powerful tools for analyzing and understanding the intricate structures present in various systems, including natural language. Here, we analyze topology of growing word-adjacency networks constructed from Chinese and English literary works written in different periods. Unconventionally, instead of considering dictionary words only, we also include punctuation marks as if they were ordinary words. Our approach is based on two arguments: (1) punctuation carries genuine information related to emotional state, allows for logical grouping of content, provides a pause in reading, and facilitates understanding by avoiding ambiguity, and (2) our previous works have shown that punctuation marks behave like words in a Zipfian analysis and, if considered together with regular words, can improve authorship attribution in stylometric studies. We focus on a functional dependence of the average shortest path length $L(N)$ on a network size $N$ for different epochs and individual novels in their original language as well as for translations of selected novels into the other language. We approximate the empirical results with a growing network model and obtain satisfactory agreement between the two. We also observe that $L(N)$ behaves asymptotically similar for both languages if punctuation marks are included but becomes sizably larger for Chinese if punctuation marks are neglected.

[18] Talking to Extraordinary Objects: Folktales Offer Analogies for Interacting with Technology

Martha Larson

Main category: cs.CL

TL;DR: 本文探讨了民间故事中的语言使用如何为技术交互提供灵感,强调语言和智能不必与人类特征绑定。

Details Motivation: 摆脱语音和语言技术中的人类拟态化倾向,寻找新的设计灵感。 Method: 通过分析民间故事中非人类对象使用语言的案例进行类比研究。 Result: 发现民间故事中的物体多样且令人印象深刻,语言能力不一定与人类性或智能相关联。 Conclusion: 民间故事可为语音和语言交互技术的设计提供有价值的启示和创新思路。 Abstract: Speech and language are valuable for interacting with technology. It would be ideal to be able to decouple their use from anthropomorphization, which has recently met an important moment of reckoning. In the world of folktales, language is everywhere and talking to extraordinary objects is not unusual. This overview presents examples of the analogies that folktales offer. Extraordinary objects in folktales are diverse and also memorable. Language capacity and intelligence are not always connected to humanness. Consideration of folktales can offer inspiration and insight for using speech and language for interacting with technology.

[19] AfriqueLLM: How Data Mixing and Model Architecture Impact Continued Pre-training for African Languages

Hao Yu,Tianyi Xu,Michael A. Hedderich,Wassim Hamidouche,Syed Waqas Zamir,David Ifeoluwa Adelani

Main category: cs.CL

TL;DR: 本文提出了AfriqueLLM,一个通过持续预训练(CPT)适应20种非洲语言的开源大语言模型系列,并系统研究了数据组成对多语言和推理能力的影响。

Details Motivation: 尽管大型语言模型日益多语言化,但针对非洲语言的开源模型仍显著落后于闭源系统,且在数学推理等复杂任务上表现有限,主要受限于低资源语言语料库的领域覆盖不均和任务相关知识缺失。 Method: 基于Llama 3.1、Gemma 3 和 Qwen 3 等多种基础模型,使用包含26B token的数据进行持续预训练(CPT),并系统调整数据混合比例,包括数学、代码和合成翻译数据,评估其在多语言基准上的下游性能。 Result: 实验表明,数据组成是CPT性能提升的主要驱动因素;加入数学、代码和合成翻译数据能持续提升模型表现,包括推理类任务;在固定架构中,更大规模通常带来更好性能,但跨家族比较时架构选择比规模更重要;基础模型的强多语言能力不能可靠预测CPT后的结果;此外,最佳模型还提升了长上下文性能,如文档级翻译。 Conclusion: 强大的模型架构结合任务对齐的数据设计,比单纯的模型规模或多语言基础性能更能有效提升低资源语言的适应效果,为开放多语言模型的发展提供了可复现的路径。 Abstract: Large language models (LLMs) are increasingly multilingual, yet open models continue to underperform relative to proprietary systems, with the gap most pronounced for African languages. Continued pre-training (CPT) offers a practical route to language adaptation, but improvements on demanding capabilities such as mathematical reasoning often remain limited. This limitation is driven in part by the uneven domain coverage and missing task-relevant knowledge that characterize many low-resource language corpora. We present \texttt{AfriqueLLM}, a suite of open LLMs adapted to 20 African languages through CPT on 26B tokens. We perform a comprehensive empirical study across five base models spanning sizes and architectures, including Llama 3.1, Gemma 3, and Qwen 3, and systematically analyze how CPT data composition shapes downstream performance. In particular, we vary mixtures that include math, code, and synthetic translated data, and evaluate the resulting models on a range of multilingual benchmarks. Our results identify data composition as the primary driver of CPT gains. Adding math, code, and synthetic translated data yields consistent improvements, including on reasoning-oriented evaluations. Within a fixed architecture, larger models typically improve performance, but architectural choices dominate scale when comparing across model families. Moreover, strong multilingual performance in the base model does not reliably predict post-CPT outcomes; robust architectures coupled with task-aligned data provide a more dependable recipe. Finally, our best models improve long-context performance, including document-level translation. Models have been released on [Huggingface](https://huggingface.co/collections/McGill-NLP/afriquellm).

[20] MITRA: A Large-Scale Parallel Corpus and Multilingual Pretrained Language Model for Machine Translation and Semantic Retrieval for Pāli, Sanskrit, Buddhist Chinese, and Tibetan

Sebastian Nehrdich,Kurt Keutzer

Main category: cs.CL

TL;DR: 本文提出了MITRA框架,用于挖掘古代佛教文献中的多语言平行段落,构建了包含174万对平行句的大规模语料库,并开发了领域特定的预训练语言模型Gemma 2 MITRA,在机器翻译和语义嵌入任务上均达到先进水平。

Details Motivation: 古代佛教文献中存在大量跨语言的未标注文本对应关系,手动分析极为困难,亟需自动化工具支持多语言文本比对与理解。 Method: 提出MITRA-parallel挖掘流程,构建Sanskrit、中文和藏文之间的大规模平行句对语料库,基于此开发并微调领域专用的语言模型Gemma 2 MITRA。 Result: Gemma 2 MITRA-MT在多语言到英文的机器翻译任务上达到最先进性能,超越更大规模的开源模型;Gemma 2 MITRA-E在新提出的语义嵌入基准上表现优异。 Conclusion: MITRA框架为处理古代多语言文献提供了有效工具,推动了NLP技术在佛教及古典亚洲文献研究中的应用。 Abstract: Ancient Buddhist literature features frequent, yet often unannotated, textual parallels spread across diverse languages: Sanskrit, Pāli, Buddhist Chinese, Tibetan, and more. The scale of this material makes manual examination prohibitive. We present the MITRA framework, which consists of a novel pipeline for multilingual parallel passage mining, MITRA-parallel, a large-scale corpus of 1.74 million parallel sentence pairs between Sanskrit, Chinese, and Tibetan, and the development of the domain-specific pretrained language model Gemma 2 MITRA. We present Gemma 2 MITRA-MT, a version of this base model fine-tuned on machine translation tasks, reaching state-of-the-art performance for machine translation of these languages into English and outperforming even much larger open-source models. We also present Gemma 2 MITRA-E, a semantic embedding model that shows state-of-the-art performance on a novel, detailed semantic embedding benchmark. We make the parallel dataset, model weights, and semantic similarity benchmark openly available to aid both NLP research and philological studies in Buddhist and classical Asian literature.

[21] Steer Model beyond Assistant: Controlling System Prompt Strength via Contrastive Decoding

Yijiang River Dong,Tiancheng Hu,Zheng Hui,Nigel Collier

Main category: cs.CL

TL;DR: 本文提出了一种无需训练的系统提示强度方法,通过调整目标与默认提示间的logits差异来增强大语言模型对特定角色的遵循能力,在多个基准测试中显著提升了模型的可控性和准确性。

Details Motivation: 大语言模型在复杂指令上表现出色,但由于后训练形成的强先验,难以摆脱其固有的助手人格,限制了对冲突指令的响应能力。 Method: 引入系统提示强度,通过比较目标系统提示和默认系统提示的logits,提取并放大目标人格外的独有行为信号,使用标量因子alpha进行调节。 Result: 在五个不同基准测试中取得显著提升:IFEval严格准确率最高提升+8.5,OffTopicEval拒绝率降低45个百分点,Prompt-Steering可引导性提升13%。 Conclusion: 该方法无需重新训练即可动态控制模型行为,为调节系统提示强度提供了有效且灵活的解决方案。 Abstract: Large language models excel at complex instructions yet struggle to deviate from their helpful assistant persona, as post-training instills strong priors that resist conflicting instructions. We introduce system prompt strength, a training-free method that treats prompt adherence as a continuous control. By contrasting logits from target and default system prompts, we isolate and amplify the behavioral signal unique to the target persona by a scalar factor alpha. Across five diverse benchmarks spanning constraint satisfaction, behavioral control, pluralistic alignment, capability modulation, and stylistic control, our method yields substantial improvements: up to +8.5 strict accuracy on IFEval, +45pp refusal rate on OffTopicEval, and +13% steerability on Prompt-Steering. Our approach enables practitioners to modulate system prompt strength, providing dynamic control over model behavior without retraining.

[22] Value of Information: A Framework for Human-Agent Communication

Yijiang River Dong,Tiancheng Hu,Zheng Hui,Caiqi Zhang,Ivan Vulić,Andreea Bobu,Nigel Collier

Main category: cs.CL

TL;DR: 本文提出了一种基于信息价值(VoI)的决策理论框架,使大型语言模型代理能够动态权衡提问带来的效用增益与对用户的认知负担,无需调参即可在不同场景中自适应平衡任务风险、问题模糊性和用户努力。

Details Motivation: 大型语言模型代理在面对用户请求不明确时,难以决定是根据不完整信息行动还是打断用户寻求澄清,现有方法依赖脆弱的置信阈值或未考虑不同决策的风险差异。 Method: 引入基于信息价值(VoI)的决策理论框架,在推理时动态计算提问的期望效用增益与用户认知成本之间的权衡,无需超参数调整。 Result: 在四个不同领域(20问游戏、医学诊断、航班预订和电子商务)的实验表明,VoI方法始终匹配或优于最佳手动调优基线,在高成本情境下效用最多提高1.36分。 Conclusion: 该工作提供了一个无参数、可自适应的代理通信框架,能显式平衡任务风险、查询歧义和用户努力,提升了LLM代理在现实任务中的决策能力。 Abstract: Large Language Model (LLM) agents deployed for real-world tasks face a fundamental dilemma: user requests are underspecified, yet agents must decide whether to act on incomplete information or interrupt users for clarification. Existing approaches either rely on brittle confidence thresholds that require task-specific tuning, or fail to account for the varying stakes of different decisions. We introduce a decision-theoretic framework that resolves this trade-off through the Value of Information (VoI), enabling agents to dynamically weigh the expected utility gain from asking questions against the cognitive cost imposed on users. Our inference-time method requires no hyperparameter tuning and adapts seamlessly across contexts-from casual games to medical diagnosis. Experiments across four diverse domains (20 Questions, medical diagnosis, flight booking, and e-commerce) show that VoI consistently matches or exceeds the best manually-tuned baselines, achieving up to 1.36 utility points higher in high-cost settings. This work provides a parameter-free framework for adaptive agent communication that explicitly balances task risk, query ambiguity, and user effort.

[23] Structured Episodic Event Memory

Zhengxuan Lu,Dongfang Li,Yukun Shi,Beilun Wang,Longyue Wang,Baotian Hu

Main category: cs.CL

TL;DR: 提出了一种名为SEEM的分层记忆框架,结合图记忆与动态情景记忆,提升大语言模型在复杂推理和长期交互中的叙事连贯性和逻辑一致性。

Details Motivation: 现有基于静态RAG的记忆方法在复杂推理中存在检索分散、缺乏结构依赖的问题,难以支持智能体的动态关联性交互需求。 Method: 基于认知框架理论,构建包含图记忆层和动态情景记忆层的SEED框架,将交互流转化为带有精确来源指针的事件框架,并引入代理联想融合与逆向溯源扩展(RPE)机制以重构碎片化证据中的叙事上下文。 Result: 在LoCoMo和LongMemEval基准上的实验表明,SEEM显著优于基线模型,在叙事连贯性和逻辑一致性方面表现更优。 Conclusion: SEEM通过结构化记忆组织有效提升了大语言模型智能体在长期交互中的记忆利用能力和复杂推理性能。 Abstract: Current approaches to memory in Large Language Models (LLMs) predominantly rely on static Retrieval-Augmented Generation (RAG), which often results in scattered retrieval and fails to capture the structural dependencies required for complex reasoning. For autonomous agents, these passive and flat architectures lack the cognitive organization necessary to model the dynamic and associative nature of long-term interaction. To address this, we propose Structured Episodic Event Memory (SEEM), a hierarchical framework that synergizes a graph memory layer for relational facts with a dynamic episodic memory layer for narrative progression. Grounded in cognitive frame theory, SEEM transforms interaction streams into structured Episodic Event Frames (EEFs) anchored by precise provenance pointers. Furthermore, we introduce an agentic associative fusion and Reverse Provenance Expansion (RPE) mechanism to reconstruct coherent narrative contexts from fragmented evidence. Experimental results on the LoCoMo and LongMemEval benchmarks demonstrate that SEEM significantly outperforms baselines, enabling agents to maintain superior narrative coherence and logical consistency.

[24] Can a Unimodal Language Agent Provide Preferences to Tune a Multimodal Vision-Language Model?

Sazia Tabasum Mim,Jack Morris,Manish Dhakal,Yanming Xiu,Maria Gorlatova,Yi Ding

Main category: cs.CL

TL;DR: 本文探讨了单模态大语言模型(LLM)能否通过文本推理自身的信息需求,并向多模态模型提供有效反馈以优化其表现。作者提出一种方法,使语言代理能向视觉-语言模型(VLM)提供偏好反馈,从而提升VLM生成描述的质量。实验表明,该方法可显著提高VLM性能,最高绝对准确率提升达13%,且人类评估验证了AI反馈的有效性,LLM选择与人类判断的一致率达64.6%。

Details Motivation: 探索在不直接修改LLM结构的前提下,通过更可扩展的方式为现有LLM添加多模态能力,核心问题是:仅依赖文本的单模态LLM是否能自主识别信息需求并指导VLM优化输出。 Method: 设计一个语言代理框架,让单模态LLM基于其对任务的理解生成偏好反馈,用于指导VLM调整生成的多模态描述;通过迭代式反馈机制实现LLM与VLM之间的交互优化。 Result: 实验结果显示,利用LLM提供的反馈可使VLM生成的场景描述质量显著提升,最高带来13%的绝对准确率增益;人类研究显示LLM的偏好选择与人类判断有64.6%的一致性,验证了反馈的有效性;消融实验揭示了方法起效的关键因素及其局限性。 Conclusion: 单模态LLM能够通过自我驱动的反馈机制有效引导VLM优化输出,证明了无需增强模型结构即可通过代理式反馈实现高效、可扩展的多模态能力扩展路径。 Abstract: To explore a more scalable path for adding multimodal capabilities to existing LLMs, this paper addresses a fundamental question: Can a unimodal LLM, relying solely on text, reason about its own informational needs and provide effective feedback to optimize a multimodal model? To answer this, we propose a method that enables a language agent to give feedback to a vision-language model (VLM) to adapt text generation to the agent's preferences. Our results from different experiments affirm this hypothesis, showing that LLM preference feedback significantly enhances VLM descriptions. Using our proposed method, we find that the VLM can generate multimodal scene descriptions to help the LLM better understand multimodal context, leading to improvements of maximum 13% in absolute accuracy compared to the baseline multimodal approach. Furthermore, a human study validated our AI-driven feedback, showing a 64.6% preference alignment rate between the LLM's choices and human judgments. Extensive experiments provide insights on how and why the method works and its limitations.

[25] NC-Bench: An LLM Benchmark for Evaluating Conversational Competence

Robert J. Moore,Sungeun An,Farhan Ahmed,Jay Pankaj Gala

Main category: cs.CL

TL;DR: NC-Bench是一个基于IBM自然对话框架的新基准,用于评估大语言模型在对话形式与结构上的表现,涵盖基础对话能力、检索增强生成和复杂请求三类任务。

Details Motivation: 现有基准多关注对话内容,而忽视了对话的形式与结构;NC-Bench旨在填补这一空白,基于人类对话的基本原则评估模型的通用对话能力。 Method: 基于IBM自然对话框架(NCF),设计三个测试集:基础对话能力集、RAG集和复杂请求集,评估模型在多种交互模式下的上下文适当性回应能力。 Result: 在6个开源模型上的初步评估显示:模型在基础回答任务中表现良好,但在修复任务(尤其是重复)中表现较差,在结束序列中表现不一,对复杂多轮请求最具挑战性;Qwen在基础集上表现优异,Granite在RAG和复杂请求集中领先。 Conclusion: NC-Bench提供了一个轻量、可扩展且理论驱动的框架,能够超越主题或任务特定的基准,有效评估和改进LLM的自然对话能力。 Abstract: The Natural Conversation Benchmark (NC-Bench) introduce a new approach to evaluating the general conversational competence of large language models (LLMs). Unlike prior benchmarks that focus on the content of model behavior, NC-Bench focuses on the form and structure of natural conversation. Grounded in the IBM Natural Conversation Framework (NCF), NC-Bench comprises three distinct sets. The Basic Conversation Competence set evaluates fundamental sequence management practices, such as answering inquiries, repairing responses, and closing conversational pairs. The RAG set applies the same sequence management patterns as the first set but incorporates retrieval-augmented generation (RAG). The Complex Request set extends the evaluation to complex requests involving more intricate sequence management patterns. Each benchmark tests a model's ability to produce contextually appropriate conversational actions in response to characteristic interaction patterns. Initial evaluations across 6 open-source models and 14 interaction patterns show that models perform well on basic answering tasks, struggle more with repair tasks (especially repeat), have mixed performance on closing sequences, and find complex multi-turn requests most challenging, with Qwen models excelling on the Basic set and Granite models on the RAG set and the Complex Request set. By operationalizing fundamental principles of human conversation, NC-Bench provides a lightweight, extensible, and theory-grounded framework for assessing and improving the conversational abilities of LLMs beyond topical or task-specific benchmarks.

[26] Time Travel Engine: A Shared Latent Chronological Manifold Enables Historical Navigation in Large Language Models

Jingmin An,Wei Liu,Qian Wang,Fang Fang

Main category: cs.CL

TL;DR: 本文提出了一种名为“时间旅行引擎”(TTE)的新框架,用于揭示大语言模型中时间信息的连续几何结构,并实现对历时性语义演化的精确控制。

Details Motivation: 大语言模型在处理时间相关任务时表现良好,但其内部如何编码和组织时间信息尚不清楚。研究者希望揭示模型中时间表示的机制,并实现对时间维度的可控操作。 Method: 通过分析残差流中的历时语言模式,构建一个共享的线性时间流形(chronological manifold),并在此基础上设计TTE框架,直接调制模型的潜在表征以实现跨时代的风格、词汇和概念迁移,同时防止未来信息泄露。 Result: 实验证明TTE能有效引导模型生成符合特定历史时期特征的文本;进一步发现中文和英文模型的时间子空间具有拓扑同构性,表明不同语言共享统一的时间演化几何逻辑。 Conclusion: 时间在大语言模型的潜在空间中以连续可导航的几何结构存在,TTE为控制模型的时间推理提供了新的可解释性工具,并揭示了跨语言时间表示的普遍性。 Abstract: Time functions as a fundamental dimension of human cognition, yet the mechanisms by which Large Language Models (LLMs) encode chronological progression remain opaque. We demonstrate that temporal information in their latent space is organized not as discrete clusters but as a continuous, traversable geometry. We introduce the Time Travel Engine (TTE), an interpretability-driven framework that projects diachronic linguistic patterns onto a shared chronological manifold. Unlike surface-level prompting, TTE directly modulates latent representations to induce coherent stylistic, lexical, and conceptual shifts aligned with target eras. By parameterizing diachronic evolution as a continuous manifold within the residual stream, TTE enables fluid navigation through period-specific "zeitgeists" while restricting access to future knowledge. Furthermore, experiments across diverse architectures reveal topological isomorphism between the temporal subspaces of Chinese and English-indicating that distinct languages share a universal geometric logic of historical evolution. These findings bridge historical linguistics with mechanistic interpretability, offering a novel paradigm for controlling temporal reasoning in neural networks.

[27] LitVISTA: A Benchmark for Narrative Orchestration in Literary Text

Mingzhe Lu,Yiwen Wang,Yanbing Liu,Qi You,Chong Liu,Ruize Qin,Haoyu Dong,Wenyu Zhang,Jiarui Zhang,Yue Hu,Yunpeng Li

Main category: cs.CL

TL;DR: 本文提出了VISTA Space这一高维叙事表征框架和基于文学文本的结构化基准LitVISTA,旨在解决现有大语言模型在生成叙事时忽视复杂故事结构的问题。研究发现当前模型难以统一捕捉叙事功能与结构,即使先进推理模式提升也有限。

Details Motivation: 现有大语言模型注重因果连贯性,但忽略了人类叙事中的复杂节奏、张力和情感动态,导致模型生成叙事与人类叙事存在结构性错配。 Method: 提出VISTA Space作为统一的人类与模型叙事表征框架,并构建LitVISTA这一结构化标注基准,用于系统评估模型的叙事协调能力;对GPT、Claude、Grok和Gemini等前沿大模型进行oracle评估。 Result: 评估结果显示现有大模型无法构建统一的全局叙事视图,在联合捕捉叙事功能与结构方面表现不佳,且现有改进的推理模式对此类文学理解任务提升有限。 Conclusion: 当前大语言模型在文学叙事组织能力上存在系统性缺陷,需发展能兼顾因果连贯与复杂叙事结构的新方法。 Abstract: Computational narrative analysis aims to capture rhythm, tension, and emotional dynamics in literary texts. Existing large language models can generate long stories but overly focus on causal coherence, neglecting the complex story arcs and orchestration inherent in human narratives. This creates a structural misalignment between model- and human-generated narratives. We propose VISTA Space, a high-dimensional representational framework for narrative orchestration that unifies human and model narrative perspectives. We further introduce LitVISTA, a structurally annotated benchmark grounded in literary texts, enabling systematic evaluation of models' narrative orchestration capabilities. We conduct oracle evaluations on a diverse selection of frontier LLMs, including GPT, Claude, Grok, and Gemini. Results reveal systematic deficiencies: existing models fail to construct a unified global narrative view, struggling to jointly capture narrative function and structure. Furthermore, even advanced thinking modes yield only limited gains for such literary narrative understanding.

[28] PRISP: Privacy-Safe Few-Shot Personalization via Lightweight Adaptation

Junho Park,Dohoon Kim,Taesup Moon

Main category: cs.CL

TL;DR: PRISP是一种轻量级且隐私安全的个性化框架,适用于数据、计算资源有限且隐私要求严格的场景,通过任务描述生成任务感知的LoRA参数,实现高效的用户个性化。

Details Motivation: 现有大模型个性化方法通常需要大量数据和计算资源,并存在隐私风险,难以满足实际部署后个性化的需求。 Method: 提出PRISP框架,利用Text-to-LoRA超网络从任务描述中生成任务感知的LoRA参数,并结合少量用户数据优化部分参数及小型附加模块。 Result: 在LaMP基准的少样本变体上实验表明,PRISP在降低计算开销的同时,相比先前方法取得了更优的整体性能,并消除了隐私风险。 Conclusion: PRISP在资源受限和隐私敏感的实际场景下,实现了高效、安全的大模型个性化。 Abstract: Large language model (LLM) personalization aims to adapt general-purpose models to individual users. Most existing methods, however, are developed under data-rich and resource-abundant settings, often incurring privacy risks. In contrast, realistic personalization typically occurs after deployment under (i) extremely limited user data, (ii) constrained computational resources, and (iii) strict privacy requirements. We propose PRISP, a lightweight and privacy-safe personalization framework tailored to these constraints. PRISP leverages a Text-to-LoRA hypernetwork to generate task-aware LoRA parameters from task descriptions, and enables efficient user personalization by optimizing a small subset of task-aware LoRA parameters together with minimal additional modules using few-shot user data. Experiments on a few-shot variant of the LaMP benchmark demonstrate that PRISP achieves strong overall performance compared to prior approaches, while reducing computational overhead and eliminating privacy risks.

[29] IndRegBias: A Dataset for Studying Indian Regional Biases in English and Code-Mixed Social Media Comments

Debasmita Panda,Akash Anil,Neelesh Kumar Shukla

Main category: cs.CL

TL;DR: 本文提出了一种针对印度地区偏见的新数据集IndRegBias,并通过多级标注策略评估了开源大语言模型和印度语言模型在检测地区偏见及其严重性方面的表现,发现微调方法显著提升了检测性能。

Details Motivation: 地区偏见在NLP研究中受到的关注较少,主要由于数据提取困难、标注存在主观性和偏见常与其他社会偏见交织导致代表性不足。 Method: 收集了25,000条来自Reddit和YouTube的关于印度地区议题的评论,构建IndRegBias数据集;采用多级标注策略标注偏见严重程度;使用零样本、少样本和微调方法评估大语言模型和印度语言模型的表现。 Result: 零样本和少样本方法在大多数模型中检测地区偏见及严重性的准确率较低,而微调方法显著提高了模型性能。 Conclusion: 微调是提升大语言模型检测印度地区偏见及其严重性的有效方法,IndRegBias为未来研究提供了重要资源。 Abstract: Warning: This paper consists of examples representing regional biases in Indian regions that might be offensive towards a particular region. While social biases corresponding to gender, race, socio-economic conditions, etc., have been extensively studied in the major applications of Natural Language Processing (NLP), biases corresponding to regions have garnered less attention. This is mainly because of (i) difficulty in the extraction of regional bias datasets, (ii) disagreements in annotation due to inherent human biases, and (iii) regional biases being studied in combination with other types of social biases and often being under-represented. This paper focuses on creating a dataset IndRegBias, consisting of regional biases in an Indian context reflected in users' comments on popular social media platforms, namely Reddit and YouTube. We carefully selected 25,000 comments appearing on various threads in Reddit and videos on YouTube discussing trending topics on regional issues in India. Furthermore, we propose a multilevel annotation strategy to annotate the comments describing the severity of regional biased statements. To detect the presence of regional bias and its severity in IndRegBias, we evaluate open-source Large Language Models (LLMs) and Indic Language Models (ILMs) using zero-shot, few-shot, and fine-tuning strategies. We observe that zero-shot and few-shot approaches show lower accuracy in detecting regional biases and severity in the majority of the LLMs and ILMs. However, the fine-tuning approach significantly enhances the performance of the LLM in detecting Indian regional bias along with its severity.

[30] Spec-o3: A Tool-Augmented Vision-Language Agent for Rare Celestial Object Candidate Vetting via Automated Spectral Inspection

Minghui Jia,Qichao Zhang,Ali Luo,Linjing Li,Shuo Ye,Hailing Lu,Wen Hou,Dongbin Zhao

Main category: cs.CL

TL;DR: 本文提出了一种名为Spec-o3的工具增强型视觉-语言代理,用于自动化天文学中罕见天体光谱的专家级检查,显著提升了识别性能和可解释性。

Details Motivation: 深度学习分类器在天文学中泛化性和可解释性有限,导致稀有天体候选体的最终审核仍依赖专家人工目视检查,已成为现代光谱巡天数据爆炸下的主要瓶颈。 Method: 提出Spec-o3,一种结合工具使用的视觉-语言智能体,通过多模态链式推理模拟专家分析过程;采用两阶段后训练策略:先在专家检查轨迹上进行监督微调,再在稀有类型验证任务上进行基于结果的强化学习。 Result: 在LAMOST的五个稀有天体识别任务上,Spec-o3将macro-F1分数从28.3提升至76.5(基于7B参数基础模型),优于现有专有视觉语言模型和专用深度模型,并展现出跨巡天数据集(LAMOST到SDSS/DESI)的强泛化能力;专家评估确认其推理过程连贯且符合物理规律。 Conclusion: Spec-o3实现了接近专家水平的自动化光谱检查,具备高可解释性与跨任务泛化能力,为应对未来大规模光谱数据提供了可信赖的解决方案。 Abstract: Due to the limited generalization and interpretability of deep learning classifiers, The final vetting of rare celestial object candidates still relies on expert visual inspection--a manually intensive process. In this process, astronomers leverage specialized tools to analyze spectra and construct reliable catalogs. However, this practice has become the primary bottleneck, as it is fundamentally incapable of scaling with the data deluge from modern spectroscopic surveys. To bridge this gap, we propose Spec-o3, a tool-augmented vision-language agent that performs astronomer-aligned spectral inspection via interleaved multimodal chain-of-thought reasoning. Spec-o3 is trained with a two-stage post-training recipe: cold-start supervised fine-tuning on expert inspection trajectories followed by outcome-based reinforcement learning on rare-type verification tasks. Evaluated on five rare-object identification tasks from LAMOST, Spec-o3 establishes a new State-of-the-Art, boosting the macro-F1 score from 28.3 to 76.5 with a 7B parameter base model and outperforming both proprietary VLMs and specialized deep models. Crucially, the agent demonstrates strong generalization to unseen inspection tasks across survey shifts (from LAMOST to SDSS/DESI). Expert evaluations confirm that its reasoning traces are coherent and physically consistent, supporting transparent and trustworthy decision-making. Code, data, and models are available at \href{https://github.com/Maxwell-Jia/spec-o3}{Project HomePage}.

[31] MedRAGChecker: Claim-Level Verification for Biomedical Retrieval-Augmented Generation

Yuelyu Ji,Min Gu Kwak,Hang Zhang,Xizhi Wu,Chenyu Li,Yanshan Wang

Main category: cs.CL

TL;DR: MedRAGChecker 是一种用于生物医学检索增强生成的声明级验证框架,通过结合证据支持的自然语言推理和知识图谱一致性信号,检测生成答案中的不支持或矛盾声明,并提供细粒度诊断。

Details Motivation: 生物医学RAG生成的长篇回答常包含缺乏支持或相互矛盾的声明,存在安全隐患,亟需细粒度验证工具来提升可靠性。 Method: 将生成答案分解为原子声明,结合基于证据的自然语言推理(NLI)与生物医学知识图谱(KG)一致性信号评估声明支持度,并通过集成验证器与类别特定的可靠性加权进行聚合分析。 Result: 在四个生物医学问答基准上的实验表明,MedRAGChecker 能有效识别不支持和矛盾声明,揭示不同生成器的风险特征,尤其在安全关键的生物医学关系上表现突出。 Conclusion: MedRAGChecker 提供了一种可扩展、细粒度的诊断方法,能够区分检索与生成错误,提升生物医学RAG系统的可信度与安全性。 Abstract: Biomedical retrieval-augmented generation (RAG) can ground LLM answers in medical literature, yet long-form outputs often contain isolated unsupported or contradictory claims with safety implications. We introduce MedRAGChecker, a claim-level verification and diagnostic framework for biomedical RAG. Given a question, retrieved evidence, and a generated answer, MedRAGChecker decomposes the answer into atomic claims and estimates claim support by combining evidence-grounded natural language inference (NLI) with biomedical knowledge-graph (KG) consistency signals. Aggregating claim decisions yields answer-level diagnostics that help disentangle retrieval and generation failures, including faithfulness, under-evidence, contradiction, and safety-critical error rates. To enable scalable evaluation, we distill the pipeline into compact biomedical models and use an ensemble verifier with class-specific reliability weighting. Experiments on four biomedical QA benchmarks show that MedRAGChecker reliably flags unsupported and contradicted claims and reveals distinct risk profiles across generators, particularly on safety-critical biomedical relations.

[32] Atomic-SNLI: Fine-Grained Natural Language Inference through Atomic Fact Decomposition

Minghui Huang

Main category: cs.CL

TL;DR: 本文提出了Atomic-SNLI数据集,以提升自然语言推理中的细粒度原子级推理能力,同时保持句子级别的性能,并增强模型的可解释性。

Details Motivation: 现有NLI系统多为句子级别且缺乏解释性,原子级推理虽有潜力,但因模型在细粒度推理上表现差而受限。 Method: 通过分解SNLI构建Atomic-SNLI数据集,采用语言学指导的生成策略添加精心策划的原子级样本,并对模型进行微调。 Result: 在Atomic-SNLI上微调的模型显著提升了原子级推理能力,同时保持了良好的句子级推理性能。 Conclusion: Atomic-SNLI有助于实现更准确、透明和可解释的事实级自然语言推理。 Abstract: Current Natural Language Inference (NLI) systems primarily operate at the sentence level, providing black-box decisions that lack explanatory power. While atomic-level NLI offers a promising alternative by decomposing hypotheses into individual facts, we demonstrate that the conventional assumption that a hypothesis is entailed only when all its atomic facts are entailed fails in practice due to models' poor performance on fine-grained reasoning. Our analysis reveals that existing models perform substantially worse on atomic level inference compared to sentence level tasks. To address this limitation, we introduce Atomic-SNLI, a novel dataset constructed by decomposing SNLI and enriching it with carefully curated atomic level examples through linguistically informed generation strategies. Experimental results demonstrate that models fine-tuned on Atomic-SNLI achieve significant improvements in atomic reasoning capabilities while maintaining strong sentence level performance, enabling both accurate judgements and transparent, explainable results at the fact level.

[33] Exposía: Academic Writing Assessment of Exposés and Peer Feedback

Dennis Zyska,Alla Rozovskaya,Ilia Kuznetsov,Iryna Gurevych

Main category: cs.CL

TL;DR: Exposía是一个首个公开的将写作与反馈评估关联的高等教育数据集,支持学术写作评价的研究,包含学生研究提案及同行和教师的反馈,并用于评估开源大语言模型在自动评分任务中的表现。

Details Motivation: 为了支持教育导向的学术写作评估方法研究,需要一个真实反映多阶段写作过程并包含高质量评估标注的数据集。 Method: 收集计算机科学本科课程中学生的科研提案和同行/教师反馈文本,并基于教学驱动的细粒度评分标准进行人工打分;使用该数据集对开源大语言模型在提案和反馈文本的自动评分任务上进行基准测试。 Result: 最先进的LLMs在不需要领域知识的评分维度上表现良好,但在内容相关维度上性能下降;模型更倾向于与高分评分者(如教师)一致;同时发现多维度联合提示策略效果最佳。 Conclusion: Exposía为学术写作与反馈评估提供了宝贵资源,且研究表明结合多方面特征的提示方法最有效,对课堂应用具有重要意义。 Abstract: We present Exposía, the first public dataset that connects writing and feedback assessment in higher education, enabling research on educationally grounded approaches to academic writing evaluation. Exposía includes student research project proposals and peer and instructor feedback consisting of comments and free-text reviews. The dataset was collected in the "Introduction to Scientific Work" course of the Computer Science undergraduate program that focuses on teaching academic writing skills and providing peer feedback on academic writing. Exposía reflects the multi-stage nature of the academic writing process that includes drafting, providing and receiving feedback, and revising the writing based on the feedback received. Both the project proposals and peer feedback are accompanied by human assessment scores based on a fine-grained, pedagogically-grounded schema for writing and feedback assessment that we develop. We use Exposía to benchmark state-of-the-art open-source large language models (LLMs) for two tasks: automated scoring of (1) the proposals and (2) the student reviews. The strongest LLMs attain high agreement on scoring aspects that require little domain knowledge but degrade on dimensions evaluating content, in line with human agreement values. We find that LLMs align better with the human instructors giving high scores. Finally, we establish that a prompting strategy that scores multiple aspects of the writing together is the most effective, an important finding for classroom deployment.

[34] SimLLM: Fine-Tuning Code LLMs for SimPy-Based Queueing System Simulation

Jun-Qi Chen,Kun Zhang,Rui Zheng,Ying Zhong

Main category: cs.CL

TL;DR: 本文提出了一种多阶段微调框架,用于提升开源大语言模型(如Qwen-Coder和DeepSeek-Coder)在生成SimPy排队系统仿真代码方面的能力,使其在可执行性、输出格式一致性和指令遵循方面显著优于直接使用闭源模型,并降低了成本与数据隐私风险。

Details Motivation: 由于闭源大模型(如GPT-4o)在生成SimPy仿真代码时存在高成本和数据隐私问题,本文旨在探索通过领域特定微调提升开源小模型在专用任务上的表现,以提供更安全、经济的替代方案。 Method: 采用包含两个阶段监督微调(SFT)和一个阶段直接偏好优化(DPO)的多阶段微调框架,对Qwen-Coder-7B和DeepSeek-Coder-6.7B两个开源代码大模型进行训练,训练数据为精心构建的SimPy排队系统相关代码数据集。 Result: 实验表明,微调后的模型在代码可执行性、输出格式合规性和指令一致性方面均有显著提升,接近甚至部分超越闭源模型的表现,验证了该方法的有效性。 Conclusion: 领域特定的多阶段微调能够有效将紧凑型开源代码模型转化为可靠的SimPy仿真代码生成器,为教育、研究和运营决策提供了低成本、高隐私保护的实用替代方案。 Abstract: The Python package SimPy is widely used for modeling queueing systems due to its flexibility, simplicity, and smooth integration with modern data analysis and optimization frameworks. Recent advances in large language models (LLMs) have shown strong ability in generating clear and executable code, making them powerful and suitable tools for writing SimPy queueing simulation code. However, directly employing closed-source models like GPT-4o to generate such code may lead to high computational costs and raise data privacy concerns. To address this, we fine-tune two open-source LLMs, Qwen-Coder-7B and DeepSeek-Coder-6.7B, on curated SimPy queueing data, which enhances their code-generating performance in executability, output-format compliance, and instruction-code consistency. Particularly, we proposed a multi-stage fine-tuning framework comprising two stages of supervised fine-tuning (SFT) and one stage of direct preference optimization (DPO), progressively enhancing the model's ability in SimPy-based queueing simulation code generation. Extensive evaluations demonstrate that both fine-tuned models achieve substantial improvements in executability, output-format compliance, and instruct consistency. These results confirm that domain-specific fine-tuning can effectively transform compact open-source code models into reliable SimPy simulation generators which provide a practical alternative to closed-source LLMs for education, research, and operational decision support.

[35] CSR-RAG: An Efficient Retrieval System for Text-to-SQL on the Enterprise Scale

Rajpreet Singh,Novak Boškov,Lawrence Drabeck,Aditya Gudal,Manzoor A. Khan

Main category: cs.CL

TL;DR: 提出了一种新的混合检索增强生成系统CSR-RAG,用于企业级自然语言到SQL的转换,具有高召回率、高精度和低延迟。

Details Motivation: 企业级应用中需要在生成SQL前进行表检索,而现有学术基准未充分覆盖此需求。 Method: 设计了结合上下文、结构和关系检索的CSR-RAG系统,以实现高效准确的企业级数据库检索。 Result: 在企业基准测试中,CSR-RAG达到最高40%的精确率和超过80%的召回率,平均查询生成延迟仅30ms。 Conclusion: CSR-RAG在保持极低延迟的同时实现了良好的检索性能,适用于现代基于大模型的企业级系统。 Abstract: Natural language to SQL translation (Text-to-SQL) is one of the long-standing problems that has recently benefited from advances in Large Language Models (LLMs). While most academic Text-to-SQL benchmarks request schema description as a part of natural language input, enterprise-scale applications often require table retrieval before SQL query generation. To address this need, we propose a novel hybrid Retrieval Augmented Generation (RAG) system consisting of contextual, structural, and relational retrieval (CSR-RAG) to achieve computationally efficient yet sufficiently accurate retrieval for enterprise-scale databases. Through extensive enterprise benchmarks, we demonstrate that CSR-RAG achieves up to 40% precision and over 80% recall while incurring a negligible average query generation latency of only 30ms on commodity data center hardware, which makes it appropriate for modern LLM-based enterprise-scale systems.

[36] EVM-QuestBench: An Execution-Grounded Benchmark for Natural-Language Transaction Code Generation

Pei Yang,Wanyi Chen,Ke Wang,Lynn Ai,Eric Yang,Tianyu Shi

Main category: cs.CL

TL;DR: EVM-QuestBench 是一个面向 EVM 兼容链上自然语言交易脚本生成的执行导向型基准,通过动态评估和模块化设计提升对执行准确性和安全性的评测能力。

Details Motivation: 现有评测方法常忽略区块链场景中的执行准确性和安全性,而链上交易的微小错误可能导致用户不可逆的损失,因此需要更严格的评估基准。 Method: 提出 EVM-QuestBench,采用动态评估机制:从模板池采样指令,从预定义区间抽取数值参数,并在分叉的 EVM 链上通过快照隔离执行脚本,使用验证器核验结果;任务分为原子和复合两类,后者引入步骤效率衰减。 Result: 包含 107 个任务(62 个原子,45 个复合),模块化架构支持快速扩展;评估了 20 个模型,发现性能差距显著,单动作精度与多步流程完成之间存在持续不对称性。 Conclusion: EVM-QuestBench 能有效评估自然语言到链上交易脚本生成的准确性与安全性,揭示当前模型在复杂任务上的不足,为后续研究提供可靠基准。 Abstract: Large language models are increasingly applied to various development scenarios. However, in on-chain transaction scenarios, even a minor error can cause irreversible loss for users. Existing evaluations often overlook execution accuracy and safety. We introduce EVM-QuestBench, an execution-grounded benchmark for natural-language transaction-script generation on EVM-compatible chains. The benchmark employs dynamic evaluation: instructions are sampled from template pools, numeric parameters are drawn from predefined intervals, and validators verify outcomes against these instantiated values. EVM-QuestBench contains 107 tasks (62 atomic, 45 composite). Its modular architecture enables rapid task development. The runner executes scripts on a forked EVM chain with snapshot isolation; composite tasks apply step-efficiency decay. We evaluate 20 models and find large performance gaps, with split scores revealing persistent asymmetry between single-action precision and multi-step workflow completion. Code: https://anonymous.4open.science/r/bsc_quest_bench-A9CF/.

[37] Are Emotions Arranged in a Circle? Geometric Analysis of Emotion Representations via Hyperspherical Contrastive Learning

Yusuke Yamauchi,Akiko Aizawa

Main category: cs.CL

TL;DR: 本文提出了一种通过在超球面上进行对比学习,将圆形情感表示引入语言模型嵌入的方法,探讨了心理环形模型在深度学习架构中的应用效果与权衡。

Details Motivation: 长期以来,心理学研究使用环形模型来组织情绪,但这些模型很少被直接整合到语言模型的表示学习中,其几何有效性尚未得到充分探索。 Method: 通过在超球面上进行对比学习,诱导语言模型嵌入中的环形情感表示。 Result: 圆形对齐在可解释性和对抗降维的鲁棒性方面表现更优,但在高维设置和细粒度分类任务上不如传统设计。 Conclusion: 将心理环形模型应用于深度学习架构存在权衡:虽然提升了可解释性,但在某些性能指标上有所牺牲。 Abstract: Psychological research has long utilized circumplex models to structure emotions, placing similar emotions adjacently and opposing ones diagonally. Although frequently used to interpret deep learning representations, these models are rarely directly incorporated into the representation learning of language models, leaving their geometric validity unexplored. This paper proposes a method to induce circular emotion representations within language model embeddings via contrastive learning on a hypersphere. We show that while this circular alignment offers superior interpretability and robustness against dimensionality reduction, it underperforms compared to conventional designs in high-dimensional settings and fine-grained classification. Our findings elucidate the trade-offs involved in applying psychological circumplex models to deep learning architectures.

[38] Stylistic Evolution and LLM Neutrality in Singlish Language

Linus Tze En Foo,Weihan Angela Ng,Wenkai Li,Lynnette Hui Xian Ng

Main category: cs.CL

TL;DR: 该研究通过分析十年间的非正式数字文本,探讨了新加坡英语(Singlish)的语言演变,并提出一种风格相似性框架来量化其历时变化,同时评估了大语言模型在生成具有时间中立性的Singlish文本方面的局限性。

Details Motivation: 理解Singlish在社会与技术变迁背景下的动态演化过程,并探究当前大语言模型在模拟社会方言和时序变异方面的表现与不足。 Method: 提出一个风格相似性框架,比较词汇结构、语用、心理语言学及编码器衍生特征在不同年份之间的差异,以量化Singlish的历时变化,并测试大语言模型在提示和微调后生成时间中立文本的能力。 Result: 发现Singlish在语气、表达性和句子结构方面存在显著的历时变化;尽管大语言模型能生成表面真实的Singlish文本,但仍保留可检测的时间信号,无法实现真正的时间中立输出。 Conclusion: Singlish持续动态演变,而当前的大语言模型在建模社会方言和时序变异方面仍有局限,需进一步改进以捕捉真实语言的复杂演化。 Abstract: Singlish is a creole rooted in Singapore's multilingual environment and continues to evolve alongside social and technological change. This study investigates the evolution of Singlish over a decade of informal digital text messages. We propose a stylistic similarity framework that compares lexico-structural, pragmatic, psycholinguistic, and encoder-derived features across years to quantify temporal variation. Our analysis reveals notable diachronic changes in tone, expressivity and sentence construction over the years. Conversely, while some LLMs were able to generate superficially realistic Singlish messages, they do not produce temporally neutral outputs, and residual temporal signals remain detectable despite prompting and fine-tuning. Our findings highlight the dynamic evolution of Singlish, as well as the capabilities and limitations of current LLMs in modeling sociolectal and temporal variations in the colloquial language.

[39] Detecting LLM-Generated Text with Performance Guarantees

Hongyi Zhou,Jin Zhu,Ying Yang,Chengchun Shi

Main category: cs.CL

TL;DR: 本文提出了一种新的LLM生成文本检测器StatDetectLLM,能够在无需水印或特定模型信息的情况下,高效准确地区分人类和LLM生成的文本,并支持统计推断。

Details Motivation: 由于大型语言模型(LLM)生成的文本高度拟人化,可能引发虚假新闻、误导性报告和学术不端等问题,因此需要有效的检测工具来识别文本是否由LLM生成。 Method: 训练了一个基于CPU平台的分类器,不依赖水印或具体LLM信息,通过机器学习方法区分人类与LLM撰写的文本,并引入统计推断功能。 Result: 该分类器在Hugging Face平台上部署,相比现有检测器具有更高的分类准确率、良好的I类错误控制、高统计功效和计算效率。 Conclusion: 所提出的检测器在实用性、准确性与统计严谨性方面优于现有方法,为应对LLM滥用提供了有效工具。 Abstract: Large language models (LLMs) such as GPT, Claude, Gemini, and Grok have been deeply integrated into our daily life. They now support a wide range of tasks -- from dialogue and email drafting to assisting with teaching and coding, serving as search engines, and much more. However, their ability to produce highly human-like text raises serious concerns, including the spread of fake news, the generation of misleading governmental reports, and academic misconduct. To address this practical problem, we train a classifier to determine whether a piece of text is authored by an LLM or a human. Our detector is deployed on an online CPU-based platform https://huggingface.co/spaces/stats-powered-ai/StatDetectLLM, and contains three novelties over existing detectors: (i) it does not rely on auxiliary information, such as watermarks or knowledge of the specific LLM used to generate the text; (ii) it more effectively distinguishes between human- and LLM-authored text; and (iii) it enables statistical inference, which is largely absent in the current literature. Empirically, our classifier achieves higher classification accuracy compared to existing detectors, while maintaining type-I error control, high statistical power, and computational efficiency.

[40] How Context Shapes Truth: Geometric Transformations of Statement-level Truth Representations in LLMs

Shivam Adarsh,Maria Maistro,Christina Lioma

Main category: cs.CL

TL;DR: 首次从几何角度研究上下文如何改变大语言模型中真值向量的方向和幅度,发现不同规模模型对相关与无关上下文的响应机制不同。

Details Motivation: 理解上下文如何影响大语言模型中真值表示仍属空白,尤其是真值向量在引入上下文后的变化机制尚不清楚。 Method: 通过测量有无上下文时真值向量之间的方向变化(θ)和幅度变化,在四个大模型和四个数据集上进行分析。 Result: 发现早期层中真值向量近似正交,中层趋于收敛;加入上下文通常增大向量幅度,增强真假分离;大模型主要通过方向变化区分上下文相关性,小模型则依赖幅度差异;与参数知识冲突的上下文引发更大的几何变化。 Conclusion: 这是首个对上下文如何在激活空间中变换真值向量进行几何刻画的研究,揭示了上下文与模型规模在真值表示中的交互机制。 Abstract: Large Language Models (LLMs) often encode whether a statement is true as a vector in their residual stream activations. These vectors, also known as truth vectors, have been studied in prior work, however how they change when context is introduced remains unexplored. We study this question by measuring (1) the directional change ($θ$) between the truth vectors with and without context and (2) the relative magnitude of the truth vectors upon adding context. Across four LLMs and four datasets, we find that (1) truth vectors are roughly orthogonal in early layers, converge in middle layers, and may stabilize or continue increasing in later layers; (2) adding context generally increases the truth vector magnitude, i.e., the separation between true and false representations in the activation space is amplified; (3) larger models distinguish relevant from irrelevant context mainly through directional change ($θ$), while smaller models show this distinction through magnitude differences. We also find that context conflicting with parametric knowledge produces larger geometric changes than parametrically aligned context. To the best of our knowledge, this is the first work that provides a geometric characterization of how context transforms the truth vector in the activation space of LLMs.

[41] Probing Multimodal Large Language Models on Cognitive Biases in Chinese Short-Video Misinformation

Jen-tse Huang,Chang Chen,Shiyang Lai,Wenxuan Wang,Michelle R. Kaufman,Mark Dredze

Main category: cs.CL

TL;DR: 本文提出了一种针对多模态大语言模型(MLLMs)在短视频虚假信息识别中的评估框架,使用包含200个视频的高质量标注数据集,涵盖四种健康领域,并细粒度标注了欺骗模式、实验错误、逻辑谬误和伪造声明。研究评估了八种前沿MLLM在五种模态设置下的表现,发现Gemini-2.5-Pro在多模态下表现最佳,而o3最差,并揭示了模型易受权威频道等社会线索认知偏差的影响。

Details Motivation: 由于短视频平台成为虚假信息传播的主要渠道,且现有MLLMs在应对与认知偏差交织的多媒体误导内容时鲁棒性尚不明确,因此需要系统评估其在真实场景下的辨识能力。 Method: 构建了一个高质量、人工标注的包含200个短视频的数据集,覆盖四个健康领域,标注三类欺骗模式(实验错误、逻辑谬误、伪造主张),并基于国家标准和学术文献进行验证;在五种模态设置下评估八种前沿MLLM的表现,并分析社会线索对模型判断的影响。 Result: Gemini-2.5-Pro在多模态设置中表现最好(信念得分71.5/100),o3最差(35.2);模型易受如权威频道ID等社会线索引发的认知偏差影响。 Conclusion: 当前MLLMs在识别短视频中的复杂误导信息方面仍有显著局限,尤其在处理社会线索诱导的认知偏差时表现不佳,亟需提升其推理鲁棒性和抗偏见能力。 Abstract: Short-video platforms have become major channels for misinformation, where deceptive claims frequently leverage visual experiments and social cues. While Multimodal Large Language Models (MLLMs) have demonstrated impressive reasoning capabilities, their robustness against misinformation entangled with cognitive biases remains under-explored. In this paper, we introduce a comprehensive evaluation framework using a high-quality, manually annotated dataset of 200 short videos spanning four health domains. This dataset provides fine-grained annotations for three deceptive patterns, experimental errors, logical fallacies, and fabricated claims, each verified by evidence such as national standards and academic literature. We evaluate eight frontier MLLMs across five modality settings. Experimental results demonstrate that Gemini-2.5-Pro achieves the highest performance in the multimodal setting with a belief score of 71.5/100, while o3 performs the worst at 35.2. Furthermore, we investigate social cues that induce false beliefs in videos and find that models are susceptible to biases like authoritative channel IDs.

[42] N2N-GQA: Noise-to-Narrative for Graph-Based Table-Text Question Answering Using LLMs

Mohamed Sharafath,Aravindh Annamalai,Ganesh Murugan,Aravindakumar Venugopalan

Main category: cs.CL

TL;DR: N2N-GQA是首个用于开放域混合表格-文本问答的零样本框架,通过将检索结果构建成动态证据图,显著提升了多跳问答性能,在OTT-QA上达到48.80 EM,接近微调模型的表现。

Details Motivation: 标准的RAG流程将文档处理为扁平列表,导致检索噪声干扰多跳推理过程,难以捕捉证据间的关联。 Method: 提出N2N-GQA框架,将检索到的文档作为图节点,利用语义关系构建边,形成动态证据图,识别连接推理步骤的关键桥梁文档。 Result: 在OTT-QA数据集上,基于图的证据整理带来了19.9点EM提升,N2N-GQA达到48.80 EM,与微调模型相当(CORE: 49.0 EM),接近优化系统(COS: 56.9 EM)。 Conclusion: 结构化的图组织证据对可扩展的零样本多跳问答至关重要,简单的可解释图构建方法可媲美复杂的微调方法。 Abstract: Multi-hop question answering over hybrid table-text data requires retrieving and reasoning across multiple evidence pieces from large corpora, but standard Retrieval-Augmented Generation (RAG) pipelines process documents as flat ranked lists, causing retrieval noise to obscure reasoning chains. We introduce N2N-GQA. To our knowledge, it is the first zeroshot framework for open-domain hybrid table-text QA that constructs dynamic evidence graphs from noisy retrieval outputs. Our key insight is that multi-hop reasoning requires understanding relationships between evidence pieces: by modeling documents as graph nodes with semantic relationships as edges, we identify bridge documents connecting reasoning steps, a capability absent in list-based retrieval. On OTT-QA, graph-based evidence curation provides a 19.9-point EM improvement over strong baselines, demonstrating that organizing retrieval results as structured graphs is critical for multihop reasoning. N2N-GQA achieves 48.80 EM, matching finetuned retrieval models (CORE: 49.0 EM) and approaching heavily optimized systems (COS: 56.9 EM) without any task specific training. This establishes graph-structured evidence organization as essential for scalable, zero-shot multi-hop QA systems and demonstrates that simple, interpretable graph construction can rival sophisticated fine-tuned approaches.

[43] Pragya: An AI-Based Semantic Recommendation System for Sanskrit Subhasitas

Tanisha Raorane,Prasenjit Kole

Main category: cs.CL

TL;DR: 本文提出了Pragya,一个基于检索增强生成(RAG)框架的梵文格言(Subhasitas)语义推荐系统,结合IndicBERT和Mistral大模型实现主题检索、翻译与解释,显著提升文化内容的可访问性。

Details Motivation: Sanskrit Subhasitas蕴含丰富的文化与哲学智慧,但因语言和语境障碍在数字时代难以传播,亟需技术手段促进其现代应用。 Method: 构建包含200条带主题标注的Subhasitas数据集,使用IndicBERT生成句子嵌入进行语义检索,并通过Mistral大语言模型生成音译、翻译和上下文解释。 Result: 语义检索在准确率和相关性上显著优于关键词匹配,用户研究显示生成式摘要有效提升了内容可理解性和可访问性。 Conclusion: Pragya是首个将检索与生成结合用于Sanskrit Subhasitas的系统,成功连接文化遗产与现代AI技术,为古典文本的数字化传播提供了新范式。 Abstract: Sanskrit Subhasitas encapsulate centuries of cultural and philosophical wisdom, yet remain underutilized in the digital age due to linguistic and contextual barriers. In this work, we present Pragya, a retrieval-augmented generation (RAG) framework for semantic recommendation of Subhasitas. We curate a dataset of 200 verses annotated with thematic tags such as motivation, friendship, and compassion. Using sentence embeddings (IndicBERT), the system retrieves top-k verses relevant to user queries. The retrieved results are then passed to a generative model (Mistral LLM) to produce transliterations, translations, and contextual explanations. Experimental evaluation demonstrates that semantic retrieval significantly outperforms keyword matching in precision and relevance, while user studies highlight improved accessibility through generated summaries. To our knowledge, this is the first attempt at integrating retrieval and generation for Sanskrit Subhasitas, bridging cultural heritage with modern applied AI.

[44] Efficient and Reliable Estimation of Named Entity Linking Quality: A Case Study on GutBrainIE

Marco Martinelli,Stefano Marchesin,Gianmaria Silvello

Main category: cs.CL

TL;DR: 提出了一种基于分层两阶段聚类抽样的框架,用于在有限标注预算下以统计保证估计大规模生物医学命名实体链接(NEL)的准确性。

Details Motivation: 由于专家标注成本高且语料库规模大,大规模评估生物医学命名实体链接(NEL)的质量具有挑战性。 Method: 将NEL准确率估计建模为一个约束优化问题,采用分层两阶段聚类抽样(STWCS),结合标签分层和全局表面形式聚类,在不依赖NEL标注的情况下进行抽样设计。 Result: 在GutBrainIE语料库上,仅需标注2,749个三元组(占24.6%),即可达到≤0.05的误差范围,整体准确率为0.915±0.0473;相比简单随机抽样,相同样本量下可减少约29%的专家标注时间。 Conclusion: 该框架具有通用性,可用于其他NEL基准和需要可扩展、统计稳健准确率评估的信息抽取系统。 Abstract: Named Entity Linking (NEL) is a core component of biomedical Information Extraction (IE) pipelines, yet assessing its quality at scale is challenging due to the high cost of expert annotations and the large size of corpora. In this paper, we present a sampling-based framework to estimate the NEL accuracy of large-scale IE corpora under statistical guarantees and constrained annotation budgets. We frame NEL accuracy estimation as a constrained optimization problem, where the objective is to minimize expected annotation cost subject to a target Margin of Error (MoE) for the corpus-level accuracy estimate. Building on recent works on knowledge graph accuracy estimation, we adapt Stratified Two-Stage Cluster Sampling (STWCS) to the NEL setting, defining label-based strata and global surface-form clusters in a way that is independent of NEL annotations. Applied to 11,184 NEL annotations in GutBrainIE -- a new biomedical corpus openly released in fall 2025 -- our framework reaches a MoE $\leq 0.05$ by manually annotating only 2,749 triples (24.6%), leading to an overall accuracy estimate of $0.915 \pm 0.0473$. A time-based cost model and simulations against a Simple Random Sampling (SRS) baseline show that our design reduces expert annotation time by about 29% at fixed sample size. The framework is generic and can be applied to other NEL benchmarks and IE pipelines that require scalable and statistically robust accuracy assessment.

[45] Labels have Human Values: Value Calibration of Subjective Tasks

Mohammed Fayiz Parappan,Ricardo Henao

Main category: cs.CL

TL;DR: 本文提出了MultiCalibrated Subjective Task Learner (MC-STL) 框架,通过将标注聚类为不同的人类价值观簇并进行簇特异性校准,提升NLP系统在主观任务中的表现。

Details Motivation: 主观任务中的标注往往反映不同的价值观,忽略这种差异会导致模型偏差和不一致的预测,因此需要一种能够识别并适应多种人类价值观的方法。 Method: MC-STL通过三种方式(标注理由的相似性、专家价值观分类或评分者的社会文化描述)对标注进行聚类,并为每个簇学习特定的嵌入表示以校准预测。 Result: 在多个数据集和任务(如毒性检测、冒犯性内容识别和偏好对齐)中,MC-STL在判别能力、价值特异性校准和分歧感知指标上均优于忽略潜在价值观结构的基线方法。 Conclusion: MC-STL能有效建模主观任务中的多元价值观,提升模型的公平性和预测性能,适用于多种NLP主观学习场景。 Abstract: Building NLP systems for subjective tasks requires one to ensure their alignment to contrasting human values. We propose the MultiCalibrated Subjective Task Learner framework (MC-STL), which clusters annotations into identifiable human value clusters by three approaches (similarity of annotator rationales, expert-value taxonomies or rater's sociocultural descriptors) and calibrates predictions for each value cluster by learning cluster-specific embeddings. We demonstrate MC-STL on several subjective learning settings, including ordinal, binary, and preference learning predictions, and evaluate it on multiple datasets covering toxic chatbot conversations, offensive social media posts, and human preference alignment. The results show that MC-STL consistently outperforms the baselines that ignore the latent value structure of the annotations, delivering gains in discrimination, value-specific calibration, and disagreement-aware metrics.

[46] MedEinst: Benchmarking the Einstellung Effect in Medical LLMs through Counterfactual Differential Diagnosis

Wenting Chen,Zhongrui Zhu,Guolin Huang,Wenxuan Wang

Main category: cs.CL

TL;DR: 本文提出MedEinst,一个用于检测大语言模型在临床诊断中因Einstellung效应而误诊的反事实基准,并提出ECR-Agent框架以基于证据的推理机制提升诊断鲁棒性。

Details Motivation: 现有医学基准无法检测大模型在临床诊断中依赖统计捷径而非个体化证据的问题(即Einstellung效应),导致在非典型病例中出现误诊。 Method: 构建包含5,383对临床案例的MedEinst基准,每对包括控制组和陷阱组;提出ECR-Agent,包含动态因果推理(DCI)和批评驱动的图与记忆演化(CGME)两个模块,实现结构化、可审计的诊断推理。 Result: 在17个LLM上的实验显示,尽管前沿模型基线准确率高,但普遍存在严重偏见陷阱率;ECR-Agent显著降低该问题,提升诊断可靠性。 Conclusion: 单纯高准确率不足以保证临床安全性,需通过结构化因果推理和持续知识进化来对抗模型中的固有偏见,ECR-Agent为迈向可靠AI诊断提供了可行路径。 Abstract: Despite achieving high accuracy on medical benchmarks, LLMs exhibit the Einstellung Effect in clinical diagnosis--relying on statistical shortcuts rather than patient-specific evidence, causing misdiagnosis in atypical cases. Existing benchmarks fail to detect this critical failure mode. We introduce MedEinst, a counterfactual benchmark with 5,383 paired clinical cases across 49 diseases. Each pair contains a control case and a "trap" case with altered discriminative evidence that flips the diagnosis. We measure susceptibility via Bias Trap Rate--probability of misdiagnosing traps despite correctly diagnosing controls. Extensive Evaluation of 17 LLMs shows frontier models achieve high baseline accuracy but severe bias trap rates. Thus, we propose ECR-Agent, aligning LLM reasoning with Evidence-Based Medicine standard via two components: (1) Dynamic Causal Inference (DCI) performs structured reasoning through dual-pathway perception, dynamic causal graph reasoning across three levels (association, intervention, counterfactual), and evidence audit for final diagnosis; (2) Critic-Driven Graph and Memory Evolution (CGME) iteratively refines the system by storing validated reasoning paths in an exemplar base and consolidating disease-specific knowledge into evolving illness graphs. Source code is to be released.

[47] Efficient Aspect Term Extraction using Spiking Neural Network

Abhishek Kumar Mishra,Arya Somasundaram,Anup Das,Nagarajan Kandasamy

Main category: cs.CL

TL;DR: 本文提出了一种基于脉冲神经网络(SNN)的高效节能方法SpikeATE,用于方面项提取(ATE)任务,在保持与主流深度神经网络相当性能的同时显著降低能耗。

Details Motivation: 现有的ATE方法多依赖高能耗的深度神经网络,缺乏能效友好的替代方案,因此需要探索更可持续的模型。 Method: 提出SpikeATE模型,采用具有稀疏激活和事件驱动推理的SNN,利用三元脉冲神经元和直接脉冲训练结合伪梯度微调来捕捉词间时序依赖。 Result: 在四个SemEval基准数据集上,SpikeATE的性能与最先进的DNN模型相当,但能耗显著更低。 Conclusion: SNN是一种实用且可持续的ATE解决方案,为低功耗自然语言处理任务提供了新方向。 Abstract: Aspect Term Extraction (ATE) identifies aspect terms in review sentences, a key subtask of sentiment analysis. While most existing approaches use energy-intensive deep neural networks (DNNs) for ATE as sequence labeling, this paper proposes a more energy-efficient alternative using Spiking Neural Networks (SNNs). Using sparse activations and event-driven inferences, SNNs capture temporal dependencies between words, making them suitable for ATE. The proposed architecture, SpikeATE, employs ternary spiking neurons and direct spike training fine-tuned with pseudo-gradients. Evaluated on four benchmark SemEval datasets, SpikeATE achieves performance comparable to state-of-the-art DNNs with significantly lower energy consumption. This highlights the use of SNNs as a practical and sustainable choice for ATE tasks.

[48] Do Language Models Reason Across Languages?

Yan Meng,Wafaa Mohammed,Christof Monz

Main category: cs.CL

TL;DR: 本文提出了一种双跳问答设置,研究语言模型在多语言环境下跨文档推理的能力,发现模型对答案所在文档的语言变化更敏感,并揭示其推理过程缺乏忠实的分步分解,导致组合错误;为此提出了一种三阶段SUBQ提示方法,显著提升了准确率。

Details Motivation: 探究语言模型是否能在多语言信息源中进行有效的跨语言推理,特别是在需要多步推理的场景下,理解模型在不同语言变体中的表现差异及其推理忠实性。 Method: 设计了一个简单的双跳问答任务,通过两份多语言文档进行推理,并采用子问题逐步评估的方式分析模型的推理过程;提出了三阶段SUBQ提示方法来引导多步推理。 Result: 发现语言模型对答案段落文档的语言变化更敏感;最多33%的情况下模型未能正确推导桥梁信息却仍给出正确答案;约18%的组合失败案例中,子问题正确但最终双跳问题失败;SUBQ提示方法将准确率从10.1%提升至66.5%。 Conclusion: 语言模型在多语言设置下的推理并不遵循忠实的分步逻辑,缺乏有效的推理分解会导致组合错误;提出的SUBQ提示方法能有效改善多步推理性能。 Abstract: The real-world information sources are inherently multilingual, which naturally raises a question about whether language models can synthesize information across languages. In this paper, we introduce a simple two-hop question answering setting, where answering a question requires making inferences over two multilingual documents. We find that language models are more sensitive to language variation in answer-span documents than in those providing bridging information, despite the equal importance of both documents for answering a question. Under a step-by-step sub-question evaluation, we further show that in up to 33% of multilingual cases, models fail to infer the bridging information in the first step yet still answer the overall question correctly. This indicates that reasoning in language models, especially in multilingual settings, does not follow a faithful step-by-step decomposition. Subsequently, we show that the absence of reasoning decomposition leads to around 18% composition failure, where both sub-questions are answered correctly but fail for the final two-hop questions. To mitigate this, we propose a simple three-stage SUBQ prompting method to guide the multi-step reasoning with sub-questions, which boosts accuracy from 10.1% to 66.5%.

[49] What makes for an enjoyable protagonist? An analysis of character warmth and competence

Hannes Rosenbusch

Main category: cs.CL

TL;DR: 电影主角的温暖和能力特质对IMDb评分有理论一致但影响较小的正向作用,且这些效应在不同类型片中无显著差异;男性主角略显冷漠但男性主导电影评分更高,表明角色性格仅是影响评分的众多因素之一。

Details Motivation: 基于心理学和文学理论,探讨电影主角的温暖与能力特质是否影响观众评分,并检验这种影响是否因电影类型而异。 Method: 使用包含2,858部影视作品的Movie Scripts Corpus,通过AI辅助标注识别主角,并利用LLM_annotate工具包量化其温暖与能力水平;采用预注册的贝叶斯回归分析检验特质与IMDb评分的关系及其在类型上的调节效应。 Result: 发现主角的温暖和能力与IMDb评分呈理论预期方向的小效应正相关,类型调节作用不显著;男性主角比女性主角略缺乏温暖,且男性主导电影评分更高,该关联强度远超温暖/能力本身的影响;AI标注效果总体良好(人-LLM一致性r = .83),但偶有不足。 Conclusion: 尽管观众偏好温暖、有能力的角色,但其对电影评分的影响有限,说明角色性格只是影响评价的诸多因素之一;AI辅助标注适用于大规模文本分析,具备较高可靠性。 Abstract: Drawing on psychological and literary theory, we investigated whether the warmth and competence of movie protagonists predict IMDb ratings, and whether these effects vary across genres. Using 2,858 films and series from the Movie Scripts Corpus, we identified protagonists via AI-assisted annotation and quantified their warmth and competence with the LLM_annotate package ([1]; human-LLM agreement: r = .83). Preregistered Bayesian regression analyses revealed theory-consistent but small associations between both warmth and competence and audience ratings, while genre-specific interactions did not meaningfully improve predictions. Male protagonists were slightly less warm than female protagonists, and movies with male leads received higher ratings on average (an association that was multiple times stronger than the relationships between movie ratings and warmth/competence). These findings suggest that, although audiences tend to favor warm, competent characters, the effects on movie evaluations are modest, indicating that character personality is only one of many factors shaping movie ratings. AI-assisted annotation with LLM_annotate and gpt-4.1-mini proved effective for large-scale analyses but occasionally fell short of manually generated annotations.

[50] InFi-Check: Interpretable and Fine-Grained Fact-Checking of LLMs

Yuzhuo Bai,Shuzheng Si,Kangyang Luo,Qingyi Wang,Wenhao Li,Gang Chen,Fanchao Qi,Maosong Sun

Main category: cs.CL

TL;DR: 本文提出了InFi-Check框架,用于对大语言模型(LLM)输出进行可解释且细粒度的事实核查,包括数据合成、基准构建和InFi-Checker模型,实现了最先进的性能和强泛化能力。

Details Motivation: 现有事实核查方法多为二分类,缺乏可解释性且无法识别细粒度错误类型,难以有效应对LLM幻觉问题。 Method: 提出可控的数据合成流程生成带证据、错误类型标签、解释和修正的高质量数据;构建大规模训练数据和人工验证的基准InFi-Check-FG;基于此训练InFi-Checker模型,联合输出证据、错误分类、解释与修正。 Result: InFi-Checker在InFi-Check-FG基准上达到SOTA性能,并在多种下游任务中表现出强泛化能力。 Conclusion: InFi-Check框架显著提升了事实性评估的实用性与可信度,推动了对LLM幻觉的细粒度、可解释检测。 Abstract: Large language models (LLMs) often hallucinate, yet most existing fact-checking methods treat factuality evaluation as a binary classification problem, offering limited interpretability and failing to capture fine-grained error types. In this paper, we introduce InFi-Check, a framework for interpretable and fine-grained fact-checking of LLM outputs. Specifically, we first propose a controlled data synthesis pipeline that generates high-quality data featuring explicit evidence, fine-grained error type labels, justifications, and corrections. Based on this, we further construct large-scale training data and a manually verified benchmark InFi-Check-FG for fine-grained fact-checking of LLM outputs. Building on these high-quality training data, we further propose InFi-Checker, which can jointly provide supporting evidence, classify fine-grained error types, and produce justifications along with corrections. Experiments show that InFi-Checker achieves state-of-the-art performance on InFi-Check-FG and strong generalization across various downstream tasks, significantly improving the utility and trustworthiness of factuality evaluation.

[51] Will it Merge? On The Causes of Model Mergeability

Adir Rahamim,Asaf Yehudai,Boaz Carmeli,Leshem Choshen,Yosi Mass,Yonatan Belinkov

Main category: cs.CL

TL;DR: 本文研究了模型融合的成功因素,提出了可衡量的融合性定义,并发现基础模型的知识是影响融合效果的主要因素,进而探索了一种加权融合方法以更好保留基础模型中的弱知识。

Details Motivation: 理解为何某些模型比其他模型更容易成功融合,揭示模型融合成功或失败的关键因素。 Method: 提出一个可测量的融合性定义,并分析多种可能导致高或低融合性的因素,重点考察基础模型知识的影响,同时探索一种简单的加权融合技术。 Result: 发现基础模型对训练实例的掌握程度显著影响融合性:基础模型已较好掌握的实例所对应的微调模型更具可融合性;基于此提出加权融合方法能更好保留弱知识。 Conclusion: 基础模型的知识水平是决定模型融合成败的关键因素,合理的加权融合策略可以提升融合效果,尤其是在保留较弱知识方面表现更优。 Abstract: Model merging has emerged as a promising technique for combining multiple fine-tuned models into a single multitask model without retraining. However, the factors that determine whether merging will succeed or fail remain poorly understood. In this work, we investigate why specific models are merged better than others. To do so, we propose a concrete, measurable definition of mergeability. We investigate several potential causes for high or low mergeability, highlighting the base model knowledge as a dominant factor: Models fine-tuned on instances that the base model knows better are more mergeable than models fine-tuned on instances that the base model struggles with. Based on our mergeability definition, we explore a simple weighted merging technique that better preserves weak knowledge in the base model.

[52] Evaluating Cross-Lingual Unlearning in Multilingual Language Models

Tyler Lizzo,Larry Heck

Main category: cs.CL

TL;DR: 本文首次全面评估了多语言大模型中的跨语言遗忘,发现大多数遗忘算法在非训练语言中无法有效删除事实,而子空间投影方法表现优异,揭示了权重空间中的共享语言结构对跨语言遗忘的重要性。

Details Motivation: 研究多语言大模型中跨语言遗忘的有效性,探索现有遗忘算法在不同语言和脚本变体下的表现及其局限性。 Method: 使用七种语言/脚本变体的翻译版TOFU基准测试主流遗忘算法,并分析任务子空间的几何特性。 Result: 大多数遗忘算法在非训练语言中失败,但子空间投影方法能实现强跨语言遗忘且性能损失最小;分析显示存在共享的中间语言结构,移除它会影响所有语言,而移除语言特定成分则仅影响单一语言。 Conclusion: 多语言遗忘依赖于权重空间的几何结构,支持未来采用基于子空间的方法来构建更有效的遗忘系统。 Abstract: We present the first comprehensive evaluation of cross-lingual unlearning in multilingual LLMs. Using translated TOFU benchmarks in seven language/script variants, we test major unlearning algorithms and show that most fail to remove facts outside the training language, even when utility remains high. However, subspace-projection consistently outperforms the other methods, achieving strong cross-lingual forgetting with minimal degradation. Analysis of learned task subspaces reveals a shared interlingua structure: removing this shared subspace harms all languages, while removing language-specific components selectively affects one. These results demonstrate that multilingual forgetting depends on geometry in weight space, motivating subspace-based approaches for future unlearning systems.

[53] IDRBench: Interactive Deep Research Benchmark

Yingchaojie Feng,Qiang Huang,Xiaoya Xie,Zhaorui Yang,Jun Yu,Wei Chen,Anthony K. H. Tung

Main category: cs.CL

TL;DR: 本文提出了IDRBench,首个用于系统评估交互式深度研究的基准,结合多智能体框架、用户模拟器和交互感知评估体系,揭示了交互对研究质量与效率的权衡影响。

Details Motivation: 现有深度研究系统多为自主运行,忽视动态用户反馈和交互成本,难以应对实际中研究目标模糊且不断演变的情况,因此需要一个能衡量交互效果与代价的新基准。 Method: 提出IDRBench,包含模块化的多智能体研究框架、按需交互机制、可扩展的参考导向用户模拟器,以及联合评估交互收益(质量与对齐性)和成本(轮次与令牌消耗)的交互感知评估套件。 Result: 在七个主流大语言模型上的实验表明,交互显著提升研究质量和鲁棒性,常超越模型能力差异的影响,但也暴露出交互效率上的显著权衡。 Conclusion: 交互是提升深度研究系统性能的关键因素,IDRBench为未来研究提供了评估和优化人机协作的新标准。 Abstract: Deep research agents powered by Large Language Models (LLMs) can perform multi-step reasoning, web exploration, and long-form report generation. However, most existing systems operate in an autonomous manner, assuming fully specified user intent and evaluating only final outputs. In practice, research goals are often underspecified and evolve during exploration, making sustained interaction essential for robust alignment. Despite its importance, interaction remains largely invisible to existing deep research benchmarks, which neither model dynamic user feedback nor quantify its costs. We introduce IDRBench, the first benchmark for systematically evaluating interactive deep research. IDRBench combines a modular multi-agent research framework with on-demand interaction, a scalable reference-grounded user simulator, and an interaction-aware evaluation suite that jointly measures interaction benefits (quality and alignment) and costs (turns and tokens). Experiments across seven state-of-the-art LLMs show that interaction consistently improves research quality and robustness, often outweighing differences in model capacity, while revealing substantial trade-offs in interaction efficiency.

[54] Characterising Toxicity in Generative Large Language Models

Zhiyao Zhang,Yazan Mash'Al,Yuhan Wu

Main category: cs.CL

TL;DR: 本文研究了大型语言模型在提示下生成有毒内容的程度,并探讨了影响生成毒性输出的语言学因素(包括词汇和句法层面)。

Details Motivation: 尽管注意力机制和Transformer架构推动了自然语言处理的发展,但语言模型仍可能生成不当、冒犯性或有害的“有毒”输出,且现有对齐方法(如基于人类反馈的强化学习)容易被绕过,因此需要深入探究毒性生成的原因和影响因素。 Method: 分析大型语言模型在不同提示下的输出,考察其生成毒性内容的倾向,并从词汇和句法两个层面分析影响毒性生成的语言学因素。 Result: 揭示了语言模型在特定提示下生成有毒内容的程度,并识别出影响毒性输出的关键语言学特征。 Conclusion: 语言模型的毒性生成不仅受提示内容影响,还与特定的词汇和句法结构密切相关,未来需结合语言学洞察设计更鲁棒的缓解策略。 Abstract: In recent years, the advent of the attention mechanism has significantly advanced the field of natural language processing (NLP), revolutionizing text processing and text generation. This has come about through transformer-based decoder-only architectures, which have become ubiquitous in NLP due to their impressive text processing and generation capabilities. Despite these breakthroughs, language models (LMs) remain susceptible to generating undesired outputs: inappropriate, offensive, or otherwise harmful responses. We will collectively refer to these as ``toxic'' outputs. Although methods like reinforcement learning from human feedback (RLHF) have been developed to align model outputs with human values, these safeguards can often be circumvented through carefully crafted prompts. Therefore, this paper examines the extent to which LLMs generate toxic content when prompted, as well as the linguistic factors -- both lexical and syntactic -- that influence the production of such outputs in generative models.

[55] GRASP LoRA: GRPO Guided Adapter Sparsity Policy for Cross Lingual Transfer

Besher Hassan,Xiuying Chen

Main category: cs.CL

TL;DR: 本文提出了GRASP LoRA,一种基于GRPO控制器的参数高效微调方法,将全局稀疏性作为可学习变量,动态优化适配器剪枝比例,显著降低计算成本并提升跨语言迁移性能。

Details Motivation: 现有的适配器微调方法通常通过网格搜索选择固定的全局剪枝比例,该过程计算昂贵、依赖大量开发集且无法找到最优解。因此需要一种更高效、自适应的稀疏策略。 Method: 引入GRASP LoRA,使用GRPO控制器在训练过程中周期性地在小型微开发集上探测候选剪枝比例,并根据奖励信号在线更新单一全局剪枝比例;基于合并的源和目标LoRA适配器,在冻结主干模型上操作,最终以固定比例进行一次合并与剪枝微调。 Result: 在从英语到阿拉伯语和中文的跨语言迁移任务中(如XL-Sum摘要和MLQA问答),GRASP LoRA在语义保真度、内容覆盖率和答案质量上优于强基线方法,相比网格搜索显著减少端到端运行时间,降低对大型开发集的依赖。 Conclusion: GRASP LoRA通过将剪枝比例学习自动化,取代了繁琐的网格搜索,实现了更高效、实用的适配器重用,特别适合低资源部署场景。 Abstract: Parameter efficient fine tuning is a way to adapt LLMs to new languages when compute or data are limited, yet adapter pipelines usually choose a global prune ratio by grid search. This practice is computationally expensive and development set intensive, since it repeats training, freezes sparsity, and misses fractional optima. We introduce GRASP LoRA (GRPO Guided Adapter Sparsity Policy), which treats global sparsity as a learnable control variable. A GRPO controller interleaves with training, periodically probing candidate prune ratios on a small micro development set and updating a single global prune ratio online from its reward signal. It operates on merged source and target LoRA adapters on a frozen backbone and replaces grid search with one controller run that learns a prune ratio, followed by a single final merge and prune fine tuning run with pruning fixed to that ratio. On cross lingual transfer from English into Arabic and Chinese, including XL-Sum summarization and MLQA extractive question answering with Llama 3 8B, GRASP LoRA improves semantic faithfulness, content coverage, and answer quality over strong target only and merge and prune baselines. It reduces end to end runtime by multiple times relative to grid search, lowers reliance on large development sets, and makes adapter reuse practical for low resource deployment.

[56] Evaluating Accounting Reasoning Capabilities of Large Language Models

Jie Zhou,Xin Chen,Jie Zhang,Hai Li,Jie Wang,Zhe Li

Main category: cs.CL

TL;DR: 本文提出了垂直领域会计推理的概念,并基于GLM模型训练数据特征设计了评估标准,用于系统研究和提升大模型在会计领域的推理能力。

Details Motivation: 大语言模型正在改变多个领域的学习与研究方式,但在专业领域(如会计)的有效集成仍是企业数字化转型的关键挑战。 Method: 通过分析代表性GLM模型的训练数据特征,提出针对会计推理任务的评估标准,并在GLM-6B、GLM-130B、GLM-4和GPT-4上进行实验评估。 Result: 提示词设计显著影响模型表现,GPT-4展现出最强的会计推理能力,但现有模型仍不足以满足实际企业会计应用需求。 Conclusion: 当前大模型在会计推理方面仍有局限,需进一步优化以释放其在企业实践中的全部价值。 Abstract: Large language models are transforming learning, cognition, and research across many fields. Effectively integrating them into professional domains, such as accounting, is a key challenge for enterprise digital transformation. To address this, we define vertical domain accounting reasoning and propose evaluation criteria derived from an analysis of the training data characteristics of representative GLM models. These criteria support systematic study of accounting reasoning and provide benchmarks for performance improvement. Using this framework, we evaluate GLM-6B, GLM-130B, GLM-4, and OpenAI GPT-4 on accounting reasoning tasks. Results show that prompt design significantly affects performance, with GPT-4 demonstrating the strongest capability. Despite these gains, current models remain insufficient for real-world enterprise accounting, indicating the need for further optimization to unlock their full practical value.

[57] Towards Computational Chinese Paleography

Yiran Rex Ma

Main category: cs.CL

TL;DR: 本文探讨了人工智能驱动下古文字学的计算转型,从自动化视觉任务发展为集成化的数字研究生态系统。

Details Motivation: 推动古文字学与人工智能深度融合,解决当前研究中数据稀缺和AI与人文学科需求脱节的问题。 Method: 梳理数字资源与技术演进路径,分析从传统计算机视觉到深度学习、Transformer及多模态大模型的方法转变。 Result: 总结了该领域的核心挑战,并提出了面向少样本、多模态和以人为中心的研究方向。 Conclusion: 未来应构建支持学术协作的智能系统,增强而非替代人文研究者的专业能力。 Abstract: Chinese paleography, the study of ancient Chinese writing, is undergoing a computational turn powered by artificial intelligence. This position paper charts the trajectory of this emerging field, arguing that it is evolving from automating isolated visual tasks to creating integrated digital ecosystems for scholarly research. We first map the landscape of digital resources, analyzing critical datasets for oracle bone, bronze, and bamboo slip scripts. The core of our analysis follows the field's methodological pipeline: from foundational visual processing (image restoration, character recognition), through contextual analysis (artifact rejoining, dating), to the advanced reasoning required for automated decipherment and human-AI collaboration. We examine the technological shift from classical computer vision to modern deep learning paradigms, including transformers and large multimodal models. Finally, we synthesize the field's core challenges -- notably data scarcity and a disconnect between current AI capabilities and the holistic nature of humanistic inquiry -- and advocate for a future research agenda focused on creating multimodal, few-shot, and human-centric systems to augment scholarly expertise.

[58] MTMCS-Bench: Evaluating Contextual Safety of Multimodal Large Language Models in Multi-Turn Dialogues

Zheyuan Liu,Dongwhi Kim,Yixin Wan,Xiangchi Yuan,Zhaoxuan Tan,Fengran Mo,Meng Jiang

Main category: cs.CL

TL;DR: 本文提出了MTMCS-Bench,一个用于评估多模态大语言模型在多轮对话中上下文安全性的新基准,揭示了现有模型在逐步风险识别与实用性之间的权衡问题。

Details Motivation: 现有的上下文安全性评测基准多为单轮对话,难以捕捉恶意意图的渐进发展以及同一视觉场景下可能存在的良性与滥用行为,因此需要更贴近现实的多轮评测机制。 Method: 构建了一个包含超过3万个多模态和单模态样本的多轮多模态上下文安全性基准(MTMCS-Bench),设计了升级型风险和情境切换风险两种评估模式,并采用成对的安全与不安全对话进行结构化评估。 Result: 在8个开源和7个专有MLLM上的实验表明,模型普遍存在安全性与实用性的权衡:要么忽略渐进风险,要么过度拒绝安全请求;当前的五种防护机制能缓解部分问题,但无法完全解决多轮上下文风险。 Conclusion: MTMCS-Bench为评估MLLM在真实多轮交互中的上下文安全性提供了有效工具,揭示了现有模型和防护机制的局限性,推动更细粒度的安全性研究。 Abstract: Multimodal large language models (MLLMs) are increasingly deployed as assistants that interact through text and images, making it crucial to evaluate contextual safety when risk depends on both the visual scene and the evolving dialogue. Existing contextual safety benchmarks are mostly single-turn and often miss how malicious intent can emerge gradually or how the same scene can support both benign and exploitative goals. We introduce the Multi-Turn Multimodal Contextual Safety Benchmark (MTMCS-Bench), a benchmark of realistic images and multi-turn conversations that evaluates contextual safety in MLLMs under two complementary settings, escalation-based risk and context-switch risk. MTMCS-Bench offers paired safe and unsafe dialogues with structured evaluation. It contains over 30 thousand multimodal (image+text) and unimodal (text-only) samples, with metrics that separately measure contextual intent recognition, safety-awareness on unsafe cases, and helpfulness on benign ones. Across eight open-source and seven proprietary MLLMs, we observe persistent trade-offs between contextual safety and utility, with models tending to either miss gradual risks or over-refuse benign dialogues. Finally, we evaluate five current guardrails and find that they mitigate some failures but do not fully resolve multi-turn contextual risks.

[59] GanitLLM: Difficulty-Aware Bengali Mathematical Reasoning through Curriculum-GRPO

Shubhashis Roy Dipta,Khairul Mahbub,Nadia Najjar

Main category: cs.CL

TL;DR: 本文提出了一个名为GanitLLM的孟加拉语数学推理模型,以及一个难度感知的孟加拉语数学语料库和基于课程学习的GRPO训练流程,显著提升了在低资源语言下的多步数学推理能力。

Details Motivation: 现有的大语言模型在处理低资源语言如孟加拉语的多步数学推理时表现不佳,通常依赖英文推理后翻译,或因奖励稀疏问题导致强化学习方法失效。 Method: 构建了一个经过严格过滤和去污染的孟加拉语数学数据集Ganit,并引入自动难度标注;提出Curriculum-GRPO方法,结合SFT与GRPO的多阶段训练、难度感知采样及可验证的格式、数值正确性和语言推理奖励机制。 Result: 在Bn-MGSM和Bn-MSVAMP两个基准上,GanitLLM-4B相比Qwen3-4B基础模型分别提升了8和7个准确率点,解决方案平均长度从943词减少到193词,孟加拉语推理token占比从14%提升至超过88%。 Conclusion: 通过构建高质量、难度分级的数据集并采用课程式强化学习策略,可在低资源语言中实现高效、准确且语言一致的数学推理,为非英语语言的AI推理提供了可行路径。 Abstract: We present a Bengali mathematical reasoning model called GanitLLM (named after the Bangla word for mathematics, "Ganit"), together with a new difficulty-aware Bengali math corpus and a curriculum-based GRPO pipeline. Bengali is one of the world's most widely spoken languages, yet existing LLMs either reason in English and then translate, or simply fail on multi-step Bengali math, in part because reinforcement learning recipes are tuned for high-resource languages and collapse under reward sparsity in low-resource settings. To address this, we construct Ganit, a rigorously filtered and decontaminated Bengali math dataset with automatic difficulty tags derived from the pass@k of a strong evaluator model. Building on this dataset, we propose Curriculum-GRPO, which combines multi-stage training (SFT + GRPO) with difficulty-aware sampling and verifiable rewards for format, numerical correctness, and Bengali reasoning. On Bn-MGSM and Bn-MSVAMP, GanitLLM-4B improves over its Qwen3-4B base by +8 and +7 accuracy points, respectively, while increasing the percentage of Bengali reasoning tokens from 14% to over 88% and reducing average solution length from 943 to 193 words.

[60] Multi-Stage Evolutionary Model Merging with Meta Data Driven Curriculum Learning for Sentiment-Specialized Large Language Modeling

Keito Inoshita,Xiaokang Zhou,Akira Kawai

Main category: cs.CL

TL;DR: 本文提出了一种名为MEM-MCL的混合学习模型,结合多阶段进化模型融合与元数据驱动的课程学习,以提升大语言模型在多任务情感分析中的性能。

Details Motivation: 传统情感分析方法难以应对现实场景中多任务并行的需求,而现有大语言模型在情感任务上的精度不足,且缺乏对任务元数据和课程学习的有效利用。 Method: 通过指令微调构建面向特定情感任务的专家模型,采用进化算法进行多阶段模型融合,并引入基于任务难度的元数据驱动课程学习策略优化训练过程,使用弱标注数据提升跨任务性能。 Result: 实验结果表明,MEM-MCL在多数情感分析子任务上优于传统大语言模型,展现出更强的准确性和可扩展性。 Conclusion: 所提出的MEM-MCL框架有效提升了大语言模型在复杂情感分析任务中的表现,验证了模型融合与课程学习结合的潜力。 Abstract: The emergence of large language models (LLMs) has significantly transformed natural language processing (NLP), enabling more generalized models to perform various tasks with minimal training. However, traditional sentiment analysis methods, which focus on individual tasks such as sentiment classification or aspect-based analysis, are not practical for real-world applications that usually require handling multiple tasks. While offering flexibility, LLMs in sentiment-specific tasks often fall short of the required accuracy. Techniques like fine-tuning and evolutionary model merging help integrate models into a unified framework, which can improve the learning performance while reducing computational costs. The use of task meta-data and curriculum learning to optimize learning processes remains underexplored, while sentiment analysis is a critical task in NLP that requires high accuracy and scalability across multiple subtasks. In this study, we propose a hybrid learning model called Multi-stage Evolutionary Model Merging with Meta data driven Curriculum Learning (MEM-MCL), to enhance the sentiment analysis in large language modeling. In particular, expert models are created through instruction tuning for specific sentiment tasks and then merged using evolutionary algorithms to form a unified model. The merging process is optimized with weak data to enhance performance across tasks. The curriculum learning is incorporated to provide a learning sequence based on task difficulty, improving knowledge extraction from LLMs. Experiment results demonstrate that the proposed MEM-MCL model outperforms conventional LLMs in a majority of sentiment analysis tasks, achieving superior results across various subtasks.

[61] EpiCaR: Knowing What You Don't Know Matters for Better Reasoning in LLMs

Jewon Yeom,Jaewon Sok,Seonghyeon Park,Jeongjae Park,Taesup Kim

Main category: cs.CL

TL;DR: 本文提出了一种名为EpiCaR的新训练方法,旨在提升大语言模型的推理能力与校准性,通过将推理训练重构为认知学习问题,在准确性和置信度校准之间实现帕累托最优,并显著降低推理计算开销。

Details Motivation: 现有基于自我训练的推理增强方法过度强化成功路径,导致模型过自信、丧失不确定性表达能力,引发对齐中的模型崩溃问题。 Method: 将推理训练视为认知学习问题,提出EpiCaR训练目标,结合显式自我评估信号,在迭代监督微调框架中联合优化推理性能和校准性。 Result: 在Llama-3和Qwen-3系列模型上验证,EpiCaR在准确性和校准性方面均优于基线方法,尤其在具备足够推理能力的模型中表现突出;并在GSM8K和MBPP等任务上展现良好泛化能力,推理计算量减少至1/3(K=10即可达到K=30的性能)。 Conclusion: EpiCaR能有效平衡推理能力和置信度校准,缓解模型崩溃问题,为高效、可信的推理提供了可行框架。 Abstract: Improving the reasoning abilities of large language models (LLMs) has largely relied on iterative self-training with model-generated data. While effective at boosting accuracy, existing approaches primarily reinforce successful reasoning paths, incurring a substantial calibration cost: models become overconfident and lose the ability to represent uncertainty. This failure has been characterized as a form of model collapse in alignment, where predictive distributions degenerate toward low-variance point estimates. We address this issue by reframing reasoning training as an epistemic learning problem, in which models must learn not only how to reason, but also when their reasoning should be trusted. We propose epistemically-calibrated reasoning (EpiCaR) as a training objective that jointly optimizes reasoning performance and calibration, and instantiate it within an iterative supervised fine-tuning framework using explicit self-evaluation signals. Experiments on Llama-3 and Qwen-3 families demonstrate that our approach achieves Pareto-superiority over standard baselines in both accuracy and calibration, particularly in models with sufficient reasoning capacity (e.g., 3B+). This framework generalizes effectively to OOD mathematical reasoning (GSM8K) and code generation (MBPP). Ultimately, our approach enables a 3X reduction in inference compute, matching the K=30 performance of STaR with only K=10 samples in capable models.

[62] Garbage Attention in Large Language Models: BOS Sink Heads and Sink-aware Pruning

Jaewon Sok,Jewon Yeom,Seonghyeon Park,Jeongjae Park,Taesup Kim

Main category: cs.CL

TL;DR: 本文提出了BOS sink现象作为解释大语言模型中层间冗余性的一个关键机制,并利用该机制设计了一种有效的注意力头剪枝方法,在保持模型性能的同时实现了高效压缩。

Details Motivation: 大语言模型存在显著的冗余,尤其是深层组件的冗余原因尚不明确,本文旨在系统地解释这一现象并提供更可靠的模型压缩依据。 Method: 通过引入BOS sink评分来识别具有高冗余性的注意力头,分析其在不同层中的行为,并提出基于该评分的剪枝策略,在多个主流模型上进行实验验证。 Result: 实验表明,高BOS sink得分的注意力头确实贡献极小,移除它们后模型性能几乎不受影响;相比基于权重或激活的剪枝方法,该方法更可靠且适用于不同序列长度。 Conclusion: BOS sink现象为理解大模型中的结构冗余提供了直观且稳健的功能解释,基于此的剪枝策略优于传统幅度-based 方法,表明注意力结构特性是更优的模型压缩基础。 Abstract: Large Language Models (LLMs) are known to contain significant redundancy, yet a systematic explanation for why certain components, particularly in higher layers, are more redundant has remained elusive. In this work, we identify the BOS sink phenomenon as a key mechanism driving this layer-wise sensitivity. We show that attention heads with high BOS sink scores are strongly associated with functional redundancy: such heads, especially in deeper layers, contribute little to predictive performance and effectively serve as \emph{dumping grounds} for superfluous attention weights. This provides a concrete functional explanation for the structural redundancy reported in prior studies. Leveraging this insight, we introduce a simple pruning strategy that removes high-BOS sink heads. Experiments on Gemma-3, Llama-3.1, and Qwen3 demonstrate that this approach identifies redundant transformer components more reliably than weight- or activation-based criteria, while preserving performance close to dense baselines even under aggressive pruning. Moreover, we find that the behavior of sink heads remains stable across different sequence lengths. Overall, our results suggest that structural properties of attention offer a more intuitive and robust basis for model compression than magnitude-based methods.

[63] CIRAG: Construction-Integration Retrieval and Adaptive Generation for Multi-hop Question Answering

Zili Wei,Xiaocui Yang,Yilin Wang,Zihan Wang,Weidong Bao,Shi Feng,Daling Wang,Yifei Zhang

Main category: cs.CL

TL;DR: 本文提出CIRAG模型,通过构建-整合机制和自适应多粒度生成模块,解决现有iRAG方法在多跳问答中的贪婪扩展和粒度不匹配问题,并引入轨迹蒸馏提升推理效率。

Details Motivation: 现有iRAG方法存在单路径贪婪扩展导致误差传播、以及单一证据粒度难以平衡噪声控制与上下文充分性的问题。 Method: 提出CIRAG模型,包括迭代的构建-整合模块(保留多条证据链)、自适应级联多粒度生成模块(从三元组到句子再到段落逐步扩展上下文),以及轨迹蒸馏技术(将教师模型的整合策略蒸馏至轻量级学生模型)。 Result: 实验表明,CIRAG在多个数据集上优于现有的iRAG方法,尤其在长距离推理任务中表现更优。 Conclusion: CIRAG有效缓解了多跳问答中的误差累积和粒度失配问题,通过多路径证据保留与渐进式上下文扩展,实现了更鲁棒且高效的推理。 Abstract: Triple-based Iterative Retrieval-Augmented Generation (iRAG) mitigates document-level noise for multi-hop question answering. However, existing methods still face limitations: (i) greedy single-path expansion, which propagates early errors and fails to capture parallel evidence from different reasoning branches, and (ii) granularity-demand mismatch, where a single evidence representation struggles to balance noise control with contextual sufficiency. In this paper, we propose the Construction-Integration Retrieval and Adaptive Generation model, CIRAG. It introduces an Iterative Construction-Integration module that constructs candidate triples and history-conditionally integrates them to distill core triples and generate the next-hop query. This module mitigates the greedy trap by preserving multiple plausible evidence chains. Besides, we propose an Adaptive Cascaded Multi-Granularity Generation module that progressively expands contextual evidence based on the problem requirements, from triples to supporting sentences and full passages. Moreover, we introduce Trajectory Distillation, which distills the teacher model's integration policy into a lightweight student, enabling efficient and reliable long-horizon reasoning. Extensive experiments demonstrate that CIRAG achieves superior performance compared to existing iRAG methods.

[64] Doing More with Less: Data Augmentation for Sudanese Dialect Automatic Speech Recognition

Ayman Mansour

Main category: cs.CL

TL;DR: 本文研究了针对低资源苏丹阿拉伯语方言的语音识别系统,通过自训练和TTS合成数据增强方法,在低成本资源下显著提升了Whisper模型性能,并建立了首个苏丹方言基准。

Details Motivation: 苏丹阿拉伯语等低资源方言缺乏专门的自动语音识别(ASR)研究,现有模型表现不佳,亟需有效且经济的方法来推动此类边缘化语言的ASR发展。 Method: 采用OpenAI Whisper模型,结合两种数据增强策略:基于无标签语音的自训练伪标签和Klaam TTS生成的合成语音;在低成本计算平台(Kaggle免费版和Lightning.ai试用)上进行微调。 Result: Whisper-Medium结合自训练与TTS增强后,在评估集上达到57.1%的词错误率(WER),在域外测试集为51.6%,显著优于零样本多语言Whisper(78.8% WER)和专用于MSA的阿拉伯语模型(73.8–123% WER)。 Conclusion: 战略性数据增强能有效克服低资源方言的数据匮乏问题,为开发低资源阿拉伯语及其他边缘语言的ASR系统提供了可行路径;所有模型、基准和训练流程已公开发布以促进后续研究。 Abstract: Although many Automatic Speech Recognition (ASR) systems have been developed for Modern Standard Arabic (MSA) and Dialectal Arabic (DA), few studies have focused on dialect-specific implementations, particularly for low-resource Arabic dialects such as Sudanese. This paper presents a comprehensive study of data augmentation techniques for fine-tuning OpenAI Whisper models and establishes the first benchmark for the Sudanese dialect. Two augmentation strategies are investigated: (1) self-training with pseudo-labels generated from unlabeled speech, and (2) TTS-based augmentation using synthetic speech from the Klaam TTS system. The best-performing model, Whisper-Medium fine-tuned with combined self-training and TTS augmentation (28.4 hours), achieves a Word Error Rate (WER) of 57.1% on the evaluation set and 51.6% on an out-of-domain holdout set substantially outperforming zero-shot multilingual Whisper (78.8% WER) and MSA-specialized Arabic models (73.8-123% WER). All experiments used low-cost resources (Kaggle free tier and Lightning.ai trial), demonstrating that strategic data augmentation can overcome resource limitations for low-resource dialects and provide a practical roadmap for developing ASR systems for low-resource Arabic dialects and other marginalized language varieties. The models, evaluation benchmarks, and reproducible training pipelines are publicly released to facilitate future research on low-resource Arabic ASR.

[65] Forest Before Trees: Latent Superposition for Efficient Visual Reasoning

Yubo Wang,Juntian Zhang,Yichen Wu,Yankai Lin,Nils Lukas,Yuhan Liu

Main category: cs.CL

TL;DR: 本文提出了一种名为Laser的新范式,通过动态窗口对齐学习(DWAL)重构视觉推理过程,克服了现有链式思维和潜在推理方法的信息带宽瓶颈与语义过早坍缩问题。Laser采用“先森林后树木”的认知层次结构,在保持全局特征概率叠加的同时实现高效、可解释的视觉推理,在6个基准上达到SOTA,并显著减少推理所需的token数量。

Details Motivation: 现有的链式思维方法因文本推理存在信息带宽瓶颈,而潜在推理方法常因自回归目标僵化导致语义过早坍缩,限制了多步视觉推理的效果。 Method: 提出Laser框架,采用动态窗口对齐学习(DWAL),将潜在状态与未来语义的动态有效性窗口对齐,形成‘先森林后树木’的认知层级,并通过自精化的叠加机制稳定训练过程,同时保留可解码路径以维持可解释性。 Result: 在6个基准测试上,Laser平均超越强基线Monet达5.03%,推理token减少超过97%,并在分布外数据上展现出强泛化能力。 Conclusion: Laser通过引入动态对齐与分层语义建模,有效解决了视觉语言模型中推理效率与语义完整性之间的矛盾,为高效且可解释的多步视觉推理提供了新方向。 Abstract: While Chain-of-Thought empowers Large Vision-Language Models with multi-step reasoning, explicit textual rationales suffer from an information bandwidth bottleneck, where continuous visual details are discarded during discrete tokenization. Recent latent reasoning methods attempt to address this challenge, but often fall prey to premature semantic collapse due to rigid autoregressive objectives. In this paper, we propose Laser, a novel paradigm that reformulates visual deduction via Dynamic Windowed Alignment Learning (DWAL). Instead of forcing a point-wise prediction, Laser aligns the latent state with a dynamic validity window of future semantics. This mechanism enforces a "Forest-before-Trees" cognitive hierarchy, enabling the model to maintain a probabilistic superposition of global features before narrowing down to local details. Crucially, Laser maintains interpretability via decodable trajectories while stabilizing unconstrained learning via Self-Refined Superposition. Extensive experiments on 6 benchmarks demonstrate that Laser achieves state-of-the-art performance among latent reasoning methods, surpassing the strong baseline Monet by 5.03% on average. Notably, it achieves these gains with extreme efficiency, reducing inference tokens by more than 97%, while demonstrating robust generalization to out-of-distribution domains.

[66] AgentHallu: Benchmarking Automated Hallucination Attribution of LLM-based Agents

Xuannan Liu,Xiao Yang,Zekun Li,Peipei Li,Ran He

Main category: cs.CL

TL;DR: 本文提出了一个新任务——LLM代理中幻觉的自动归因,并发布了AgentHallu基准,包含多框架、多领域的带标注轨迹数据、幻觉分类体系和多层次人工标注,用于识别导致幻觉的关键步骤及其原因。实验表明该任务极具挑战性,当前最优模型在步骤定位上的准确率仅为41.1%。

Details Motivation: 现有的幻觉检测主要集中在单轮生成上,难以应对基于LLM的代理在多步推理过程中中间步骤幻觉传播的问题。因此需要一种能精确定位引发幻觉的初始步骤并解释其成因的方法,以提升代理系统的可靠性与透明度。 Method: 提出‘幻觉归因’这一新任务,构建AgentHallu基准:包含693条高质量轨迹(涵盖7个代理框架和5个领域)、五类14子类的幻觉分类体系,以及包含二元标签、责任步骤和因果解释的多层次人工标注。评估了13种主流模型在该任务上的表现。 Result: 实验显示当前领先模型在该任务上表现有限,最佳模型仅达到41.1%的步骤定位准确率;其中工具使用类幻觉最难检测,准确率仅为11.6%。结果表明多步幻觉归因极具挑战性。 Conclusion: AgentHallu为研究多步推理中LLM代理的幻觉归因提供了重要基础,有助于推动构建更鲁棒、透明和可靠的代理系统。 Abstract: As LLM-based agents operate over sequential multi-step reasoning, hallucinations arising at intermediate steps risk propagating along the trajectory, thus degrading overall reliability. Unlike hallucination detection in single-turn responses, diagnosing hallucinations in multi-step workflows requires identifying which step causes the initial divergence. To fill this gap, we propose a new research task, automated hallucination attribution of LLM-based agents, aiming to identify the step responsible for the hallucination and explain why. To support this task, we introduce AgentHallu, a comprehensive benchmark with: (1) 693 high-quality trajectories spanning 7 agent frameworks and 5 domains, (2) a hallucination taxonomy organized into 5 categories (Planning, Retrieval, Reasoning, Human-Interaction, and Tool-Use) and 14 sub-categories, and (3) multi-level annotations curated by humans, covering binary labels, hallucination-responsible steps, and causal explanations. We evaluate 13 leading models, and results show the task is challenging even for top-tier models (like GPT-5, Gemini-2.5-Pro). The best-performing model achieves only 41.1\% step localization accuracy, where tool-use hallucinations are the most challenging at just 11.6\%. We believe AgentHallu will catalyze future research into developing robust, transparent, and reliable agentic systems.

[67] PDR: A Plug-and-Play Positional Decay Framework for LLM Pre-training Data Detection

Jinhan Liu,Yibo Yang,Ruiying Lu,Piotr Piekos,Yimeng Chen,Peng Wang,Dandan Guo

Main category: cs.CL

TL;DR: 本文提出了一种名为位置衰减重加权(PDR)的训练免费框架,用于提升大语言模型中预训练数据检测的效果,特别强调了高熵初始token的重要性。

Details Motivation: 现有的基于似然的方法在黑盒、零样本设置下检测预训练数据时存在挑战,且通常使用均匀权重聚合令牌级分数,忽略了自回归生成中的信息理论动态。 Method: 引入位置衰减重加权(PDR),通过对早期位置的令牌级分数进行放大,对后期位置的噪声进行抑制,以利用自回归生成过程中的信息动态特性。 Result: 大量实验表明,PDR可以作为强健的先验,通常能够增强多种先进方法在多个基准上的表现。 Conclusion: PDR是一种有效的训练-free和即插即用框架,能显著提高现有方法在检测大语言模型预训练数据方面的性能。 Abstract: Detecting pre-training data in Large Language Models (LLMs) is crucial for auditing data privacy and copyright compliance, yet it remains challenging in black-box, zero-shot settings where computational resources and training data are scarce. While existing likelihood-based methods have shown promise, they typically aggregate token-level scores using uniform weights, thereby neglecting the inherent information-theoretic dynamics of autoregressive generation. In this paper, we hypothesize and empirically validate that memorization signals are heavily skewed towards the high-entropy initial tokens, where model uncertainty is highest, and decay as context accumulates. To leverage this linguistic property, we introduce Positional Decay Reweighting (PDR), a training-free and plug-and-play framework. PDR explicitly reweights token-level scores to amplify distinct signals from early positions while suppressing noise from later ones. Extensive experiments show that PDR acts as a robust prior and can usually enhance a wide range of advanced methods across multiple benchmarks.

[68] Explainable Multimodal Aspect-Based Sentiment Analysis with Dependency-guided Large Language Model

Zhongzheng Wang,Yuanhe Tian,Hongzhi Wang,Yan Song

Main category: cs.CL

TL;DR: 本文提出了一种基于多模态大语言模型的生成式可解释框架,用于多模态方面级情感分析(MABSA),通过引入依赖句法引导的情感线索策略和自动生成的解释数据集,同时提升情感分类准确性和可解释性。

Details Motivation: 现有MABSA方法主要依赖判别式分类,缺乏显式的可解释性,难以提供细粒度的情感归因。因此,需要一种既能准确预测情感又能生成自然语言解释的统一框架。 Method: 将MABSA重构为生成式任务,基于多模态大语言模型(MLLMs)采用提示生成范式,联合生成情感预测与自然语言解释;提出依赖句法引导的情感线索策略,剪枝并文本化以方面为中心的依存句法树,增强模型对不同情感方面的区分能力;利用MLLMs构建带情感解释的新数据集进行微调。 Result: 实验表明该方法在情感分类准确性上取得一致提升,同时能够生成忠实且基于方面的自然语言解释,增强了模型的可解释性与推理能力。 Conclusion: 所提出的生成式框架有效结合了多模态理解与可解释性,在MABSA任务中实现了更优的性能和更强的解释能力,展示了生成式方法在细粒度情感分析中的潜力。 Abstract: Multimodal aspect-based sentiment analysis (MABSA) aims to identify aspect-level sentiments by jointly modeling textual and visual information, which is essential for fine-grained opinion understanding in social media. Existing approaches mainly rely on discriminative classification with complex multimodal fusion, yet lacking explicit sentiment explainability. In this paper, we reformulate MABSA as a generative and explainable task, proposing a unified framework that simultaneously predicts aspect-level sentiment and generates natural language explanations. Based on multimodal large language models (MLLMs), our approach employs a prompt-based generative paradigm, jointly producing sentiment and explanation. To further enhance aspect-oriented reasoning capabilities, we propose a dependency-syntax-guided sentiment cue strategy. This strategy prunes and textualizes the aspect-centered dependency syntax tree, guiding the model to distinguish different sentiment aspects and enhancing its explainability. To enable explainability, we use MLLMs to construct new datasets with sentiment explanations to fine-tune. Experiments show that our approach not only achieves consistent gains in sentiment classification accuracy, but also produces faithful, aspect-grounded explanations.

[69] †DAGGER: Distractor-Aware Graph Generation for Executable Reasoning in Math Problems

Zabir Al Nazi,Shubhashis Roy Dipta,Sudipta Kar

Main category: cs.CL

TL;DR: 本文提出了DAGGER方法,通过生成可执行的计算图来提升在包含干扰信息的数学问题中的推理鲁棒性和效率,并在低资源语言(如孟加拉语)中验证了其有效性。

Details Motivation: 研究现有思维链提示在低资源语言和无关上下文干扰下的数学推理表现,探索其鲁棒性不足的问题。 Method: 构建了一个包含语义干扰项的孟加拉语数学基准DISTRACTMATH-BN,提出DAGGER方法,将数学问题求解建模为带干扰节点显式建模的可执行计算图生成,并结合监督微调与组相对策略优化进行训练。 Result: 实验表明标准模型在干扰下性能下降高达41分,而推理专用模型也下降14-20分;DAGGER在未使用干扰样本训练的情况下,达到相当准确率且仅用11%的token。 Conclusion: 结构化中间表示能显著提升数学推理在噪声环境和低资源场景下的鲁棒性与推理效率,优于自由形式的推理方式。 Abstract: Chain-of-Thought (CoT) prompting is widely adopted for mathematical problem solving, including in low-resource languages, yet its behavior under irrelevant context remains underexplored. To systematically study this challenge, we introduce DISTRACTMATH-BN, a Bangla benchmark that augments MGSM and MSVAMP with semantically coherent but computationally irrelevant information. Evaluating seven models ranging from 3B to 12B parameters, we observe substantial performance degradation under distractors: standard models drop by up to 41 points, while reasoning-specialized models decline by 14 to 20 points despite consuming five times more tokens. We propose †DAGGER, which reformulates mathematical problem solving as executable computational graph generation with explicit modeling of distractor nodes. Fine-tuning Gemma-3 models using supervised fine-tuning followed by Group Relative Policy Optimization achieves comparable weighted accuracy on augmented benchmarks while using 89 percent fewer tokens than reasoning models. Importantly, this robustness emerges without explicit training on distractor-augmented examples. Our results suggest that enforcing structured intermediate representations improves robustness and inference efficiency in mathematical reasoning compared to free-form approaches, particularly in noisy, low-resource settings.

[70] BiasLab: A Multilingual, Dual-Framing Framework for Robust Measurement of Output-Level Bias in Large Language Models

William Guey,Wei Zhang,Pei-Luen Patrick Rau,Pierrick Bougault,Vitor D. de Moura,Bertan Ucar,Jose O. Gomes

Main category: cs.CL

TL;DR: 本文提出了BiasLab,一个开源、模型无关的评估框架,用于通过多语言、鲁棒性导向的实验设计来量化大语言模型输出层面的偏见。

Details Motivation: 由于对提示词敏感、多语言覆盖有限以及缺乏标准化指标,当前在高风险场景中评估大语言模型偏见具有挑战性。 Method: BiasLab采用严格的双重框架构建镜像探针对,通过确定性目标替换生成反向断言,并使用随机化指令包装和固定选择Likert格式进行重复评估,利用基于LLM的裁判将响应归一化为一致性标签。 Result: 该框架支持多种偏见维度(如人口、文化、政治等)的评估,生成包括效应量和中立率在内的定量偏见指标,并产出结构化报告与可视化结果。 Conclusion: BiasLab提供了一种标准化、跨语言且对框架敏感的偏见测量方法,补充了内在和数据集层面的审计,有助于研究者和机构进行模型基准测试与部署决策。 Abstract: Large Language Models (LLMs) are increasingly deployed in high-stakes contexts where their outputs influence real-world decisions. However, evaluating bias in LLM outputs remains methodologically challenging due to sensitivity to prompt wording, limited multilingual coverage, and the lack of standardized metrics that enable reliable comparison across models. This paper introduces BiasLab, an open-source, model-agnostic evaluation framework for quantifying output-level (extrinsic) bias through a multilingual, robustness-oriented experimental design. BiasLab constructs mirrored probe pairs under a strict dual-framing scheme: an affirmative assertion favoring Target A and a reverse assertion obtained by deterministic target substitution favoring Target B, while preserving identical linguistic structure. To reduce dependence on prompt templates, BiasLab performs repeated evaluation under randomized instructional wrappers and enforces a fixed-choice Likert response format to maximize comparability across models and languages. Responses are normalized into agreement labels using an LLM-based judge, aligned for polarity consistency across framings, and aggregated into quantitative bias indicators with descriptive statistics including effect sizes and neutrality rates. The framework supports evaluation across diverse bias axes, including demographic, cultural, political, and geopolitical topics, and produces reproducible artifacts such as structured reports and comparative visualizations. BiasLab contributes a standardized methodology for cross-lingual and framing-sensitive bias measurement that complements intrinsic and dataset-based audits, enabling researchers and institutions to benchmark robustness and make better-informed deployment decisions.

[71] Paraphrasing Adversarial Attack on LLM-as-a-Reviewer

Masahiro Kaneko

Main category: cs.CL

TL;DR: 本文提出了一种名为Paraphrasing Adversarial Attack(PAA)的黑盒优化方法,通过保持语义不变的改写提升论文在LLM评审中的得分,揭示了当前评审系统易受改写攻击的漏洞,并探讨了检测与防御的可能性。

Details Motivation: 现有的针对LLM评审系统的攻击依赖于提示注入,会改变原文内容,难以区分是模型对注入敏感还是评审本身不鲁棒。因此,需要一种不改变论文主张的攻击方式来更准确地评估评审系统的安全性。 Method: 提出PAA(Paraphrasing Adversarial Attack),一种黑盒优化方法,利用上下文学习,基于已有改写及其评分生成新的候选改写,搜索能在保持语义等价和语言自然性的前提下获得更高评审分数的文本。 Result: 在五个机器学习与NLP会议场景下,使用三种LLM评审者和五种攻击模型的实验表明,PAA能持续提高评审分数而不改变论文主张;人工评估确认改写保持了原意与自然性;发现被攻击论文的评审文本困惑度上升,可作为潜在检测信号;同时发现改写提交可在一定程度上缓解此类攻击。 Conclusion: LLM评审系统容易受到语义保持型改写攻击的影响,PAA揭示了这一安全隐患,提示需加强评审模型的鲁棒性,并为检测和防御此类攻击提供了可行方向。 Abstract: The use of large language models (LLMs) in peer review systems has attracted growing attention, making it essential to examine their potential vulnerabilities. Prior attacks rely on prompt injection, which alters manuscript content and conflates injection susceptibility with evaluation robustness. We propose the Paraphrasing Adversarial Attack (PAA), a black-box optimization method that searches for paraphrased sequences yielding higher review scores while preserving semantic equivalence and linguistic naturalness. PAA leverages in-context learning, using previous paraphrases and their scores to guide candidate generation. Experiments across five ML and NLP conferences with three LLM reviewers and five attacking models show that PAA consistently increases review scores without changing the paper's claims. Human evaluation confirms that generated paraphrases maintain meaning and naturalness. We also find that attacked papers exhibit increased perplexity in reviews, offering a potential detection signal, and that paraphrasing submissions can partially mitigate attacks.

[72] Fine-grained Verbal Attack Detection via a Hierarchical Divide-and-Conquer Framework

Quan Zheng,Yuanhe Tian,Ming Wang,Yan Song

Main category: cs.CL

TL;DR: 提出了一种基于时空信息的分层细粒度框架,用于中文社交媒体中的言语攻击识别,并发布了包含层级回复结构的新数据集,实验表明该方法优于依赖参数规模的大型模型。

Details Motivation: 现有研究在建模对话结构和上下文依赖方面存在不足,尤其是在隐式攻击普遍存在的中文社交媒体中,难以有效识别基于语境的隐性言语攻击。 Method: 构建了“分层攻击评论检测”数据集,明确编码层级回复结构和时间顺序;提出分而治之的细粒度框架,将攻击检测分解为多个子任务,由轻量级专用模型分别处理显式检测、隐式意图推断和目标识别。 Result: 在新提出的數據集及基准意图检测数据集上的实验表明,使用该框架的小型模型显著优于依赖参数扩展的大型单一模型。 Conclusion: 通过结构化任务分解和对对话结构的精细建模,能更有效地识别复杂上下文中的言语攻击,为低资源设置下的攻击检测提供了高效且可扩展的解决方案。 Abstract: In the digital era, effective identification and analysis of verbal attacks are essential for maintaining online civility and ensuring social security. However, existing research is limited by insufficient modeling of conversational structure and contextual dependency, particularly in Chinese social media where implicit attacks are prevalent. Current attack detection studies often emphasize general semantic understanding while overlooking user response relationships, hindering the identification of implicit and context-dependent attacks. To address these challenges, we present the novel "Hierarchical Attack Comment Detection" dataset and propose a divide-and-conquer, fine-grained framework for verbal attack recognition based on spatiotemporal information. The proposed dataset explicitly encodes hierarchical reply structures and chronological order, capturing complex interaction patterns in multi-turn discussions. Building on this dataset, the framework decomposes attack detection into hierarchical subtasks, where specialized lightweight models handle explicit detection, implicit intent inference, and target identification under constrained context. Extensive experiments on the proposed dataset and benchmark intention detection datasets show that smaller models using our framework significantly outperform larger monolithic models relying on parameter scaling, demonstrating the effectiveness of structured task decomposition.

[73] Distributional Clarity: The Hidden Driver of RL-Friendliness in Large Language Models

Shaoning Sun,Mingzhu Cai,Huang He,Bingjin Chen,Siqi Bao,Yujiu Yang,Hua Wu,Haifeng Wang

Main category: cs.CL

TL;DR: 本文提出了语言模型在强化学习中表现差异的根源——概率空间中的“分布清晰度”(distributional clarity),并通过三阶段分析揭示了这一现象背后的机制,提出了一种基于轮廓系数(Silhouette Coefficient)的量化方法,并设计了相应的训练策略,显著提升了不同模型家族在数学推理任务上的表现。

Details Motivation: 观察到不同语言模型在相同强化学习训练下表现差异巨大,现有数据驱动方法无法充分解释此现象,因此需要从模型内部结构特性出发,探索导致这种差异的根本原因。 Method: 提出‘分布清晰度’概念,使用轮廓系数(Silhouette Coefficient, S)量化模型对正确与错误回答的概率分布的类内紧凑性和类间分离性;通过三阶段分析(现象-机制-解释)验证其作用,并设计Silhouette-Aware Reweighting策略,在训练中优先关注低S值样本以提升模型表现。 Result: 实验表明轮廓系数S与强化学习性能高度相关,低S值模型存在严重逻辑错误和推理不稳定;所提重加权策略在六个数学基准上均带来提升,最高在AIME24上增益达5.9分。 Conclusion: 分布清晰度是一种根本且可训练的属性,决定了模型对强化学习的适应性(RL-Friendliness),为改进模型训练提供了新的视角和可行路径。 Abstract: Language model families exhibit striking disparity in their capacity to benefit from reinforcement learning: under identical training, models like Qwen achieve substantial gains, while others like Llama yield limited improvements. Complementing data-centric approaches, we reveal that this disparity reflects a hidden structural property: \textbf{distributional clarity} in probability space. Through a three-stage analysis-from phenomenon to mechanism to interpretation-we uncover that RL-friendly models exhibit intra-class compactness and inter-class separation in their probability assignments to correct vs. incorrect responses. We quantify this clarity using the \textbf{Silhouette Coefficient} ($S$) and demonstrate that (1) high $S$ correlates strongly with RL performance; (2) low $S$ is associated with severe logic errors and reasoning instability. To confirm this property, we introduce a Silhouette-Aware Reweighting strategy that prioritizes low-$S$ samples during training. Experiments across six mathematical benchmarks show consistent improvements across all model families, with gains up to 5.9 points on AIME24. Our work establishes distributional clarity as a fundamental, trainable property underlying RL-Friendliness.

[74] TreePS-RAG: Tree-based Process Supervision for Reinforcement Learning in Agentic RAG

Tianhua Zhang,Kun Li,Junan Li,Yunxiang Li,Hongyin Luo,Xixin Wu,James Glass,Helen Meng

Main category: cs.CL

TL;DR: 本文提出了TreePS-RAG,一种基于树结构的在线强化学习框架,用于代理式检索增强生成(RAG),通过蒙特卡洛估计实现无需中间标注的细粒度步骤优势计算,在保持结果奖励的同时提升多跳问答性能。

Details Motivation: 现有基于强化学习的RAG方法依赖稀疏的最终奖励,难以进行有效的步骤级信用分配;而引入过程监督的方法通常依赖离线数据或昂贵的中间标注,存在分布偏移或成本高的问题。 Method: 将代理式RAG推理建模为 rollout 树结构,每个推理步骤对应一个节点,利用蒙特卡洛方法通过后代结果估计各步效用,实现细粒度的过程优势计算,并设计高效的在线树构建策略以在有限计算预算下保持探索多样性。 Result: 在七个多跳和通用问答基准上、多种模型规模下,TreePS-RAG在与Search-R1等强基线相当的rollout成本下, consistently 且显著优于仅结果监督和领先的过程监督RL方法。 Conclusion: TreePS-RAG通过树形建模实现了有效的在线过程级信用分配,无需中间标注即可提升代理式RAG的推理性能,兼顾效率与效果。 Abstract: Agentic retrieval-augmented generation (RAG) formulates question answering as a multi-step interaction between reasoning and information retrieval, and has recently been advanced by reinforcement learning (RL) with outcome-based supervision. While effective, relying solely on sparse final rewards limits step-wise credit assignment and provides weak guidance for intermediate reasoning and actions. Recent efforts explore process-level supervision, but typically depend on offline constructed training data, which risks distribution shift, or require costly intermediate annotations. We present TreePS-RAG, an online, tree-based RL framework for agentic RAG that enables step-wise credit assignment while retaining standard outcome-only rewards. Our key insight is to model agentic RAG reasoning as a rollout tree, where each reasoning step naturally maps to a node. This tree structure allows step utility to be estimated via Monte Carlo estimation over its descendant outcomes, yielding fine-grained process advantages without requiring intermediate labels. To make this paradigm practical, we introduce an efficient online tree construction strategy that preserves exploration diversity under a constrained computational budget. With a rollout cost comparable to strong baselines like Search-R1, experiments on seven multi-hop and general QA benchmarks across multiple model scales show that TreePS-RAG consistently and significantly outperforms both outcome-supervised and leading process-supervised RL methods.

[75] Symphonym: Universal Phonetic Embeddings for Cross-Script Toponym Matching via Teacher-Student Distillation

Stephen Gadd

Main category: cs.CL

TL;DR: 本文提出了一种名为Symphonym的神经嵌入系统,能够将来自20种文字系统的地名映射到统一的128维语音空间中,实现跨语言和跨文字的地名匹配。

Details Motivation: 现有方法依赖于特定语言的语音算法或音译规则,在地名跨越不同文字系统时表现不佳,例如无法识别用西里尔字母或阿拉伯字母书写的“Moscow”指向同一城市。因此需要一种能有效处理跨文字地名匹配的方法。 Method: 提出Symphonym系统:使用基于发音特征(通过Epitran和PanPhon)训练的Teacher网络生成目标嵌入,Student网络则从原始字符学习逼近这些嵌入;采用三阶段课程学习在5700万地名数据上进行训练,并在推理时仅使用轻量级Student网络。 Result: 在MEHDIE希伯来-阿拉伯语基准测试上达到89.2%的Recall@1,优于Levenshtein(81.5%)和Jaro-Winkler(78.5%);Student网络与Teacher网络在第二阶段达到96.6%的余弦相似度。 Conclusion: Symphonym能高效实现跨文字系统的地名语音匹配,适用于大规模历史地名数据库中的模糊语音匹配与检索,未来可在World Historical Gazetteer等系统中广泛应用。 Abstract: Linking place names across languages and writing systems is a fundamental challenge in digital humanities and geographic information retrieval. Existing approaches rely on language-specific phonetic algorithms or transliteration rules that fail when names cross script boundaries -- no string metric can determine that "Moscow" when rendered in Cyrillic or Arabic refer to the same city. I present Symphonym, a neural embedding system that maps toponyms from 20 writing systems into a unified 128-dimensional phonetic space. A Teacher network trained on articulatory phonetic features (via Epitran and PanPhon) produces target embeddings, while a Student network learns to approximate these from raw characters. At inference, only the lightweight Student (1.7M parameters) is required, enabling deployment without runtime phonetic conversion. Training uses a three-phase curriculum on 57 million toponyms from GeoNames, Wikidata, and the Getty Thesaurus of Geographic Names. Phase 1 trains the Teacher on 467K phonetically-grounded triplets. Phase 2 aligns the Student to Teacher outputs across 23M samples, achieving 96.6% cosine similarity. Phase 3 fine-tunes on 3.3M hard negative triplets -- negatives sharing prefix and script with the anchor but referring to different places -- to sharpen discrimination. Evaluation on the MEHDIE Hebrew-Arabic benchmark achieves 89.2% Recall@1, outperforming Levenshtein (81.5%) and Jaro-Winkler (78.5%). The system is optimised for cross-script matching; same-script variants can be handled by complementary string methods. Symphonym will enable fuzzy phonetic reconciliation and search across the World Historical Gazetteer's 67 million toponyms. Code and models are publicly available.

[76] X-Coder: Advancing Competitive Programming with Fully Synthetic Tasks, Solutions, and Tests

Jie Wu,Haoling Li,Xin Zhang,Jiani Guo,Jane Luo,Steven Liu,Yangyu Huang,Ruihang Chu,Scarlett Li,Yujiu Yang

Main category: cs.CL

TL;DR: 本文提出了一种完全合成的数据生成方法SynthSmith,用于训练代码大模型X-Coder,无需依赖真实世界数据,在竞争性编程任务中展现出卓越的推理能力。

Details Motivation: 现有代码大模型严重依赖真实世界数据,限制了其可扩展性,尤其在需要高强度推理的竞赛级编程任务中表现不足。 Method: 提出SynthSmith,通过基于特征的合成方法生成包含多样化挑战性任务、验证过的解法和测试用例的全合成数据集,并用于监督微调与强化学习训练X-Coder模型系列。 Result: X-Coder 7B模型在LiveCodeBench v5上达到62.9 avg@8、v6上55.8的通过率,超越更大参数量的DeepCoder和AReal模型;验证了合成数据的扩展规律,并分析了有效扩展维度及强化学习的关键因素。 Conclusion: 高质量的合成数据结合分阶段训练能显著提升代码推理能力,减少对真实编码数据的依赖,为代码大模型提供了可持续的训练路径。 Abstract: Competitive programming presents great challenges for Code LLMs due to its intensive reasoning demands and high logical complexity. However, current Code LLMs still rely heavily on real-world data, which limits their scalability. In this paper, we explore a fully synthetic approach: training Code LLMs with entirely generated tasks, solutions, and test cases, to empower code reasoning models without relying on real-world data. To support this, we leverage feature-based synthesis to propose a novel data synthesis pipeline called SynthSmith. SynthSmith shows strong potential in producing diverse and challenging tasks, along with verified solutions and tests, supporting both supervised fine-tuning and reinforcement learning. Based on the proposed synthetic SFT and RL datasets, we introduce the X-Coder model series, which achieves a notable pass rate of 62.9 avg@8 on LiveCodeBench v5 and 55.8 on v6, outperforming DeepCoder-14B-Preview and AReal-boba2-14B despite having only 7B parameters. In-depth analysis reveals that scaling laws hold on our synthetic dataset, and we explore which dimensions are more effective to scale. We further provide insights into code-centric reinforcement learning and highlight the key factors that shape performance through detailed ablations and analysis. Our findings demonstrate that scaling high-quality synthetic data and adopting staged training can greatly advance code reasoning, while mitigating reliance on real-world coding data.

[77] RealMem: Benchmarking LLMs in Real-World Memory-Driven Interaction

Haonan Bian,Zhiyuan Yao,Sen Hu,Zishan Xu,Shaolei Zhang,Yifu Guo,Ziliang Yang,Xueran Han,Huacan Wang,Ronghao Chen

Main category: cs.CL

TL;DR: RealMem是首个基于真实项目场景的大规模长期记忆基准,旨在评估大语言模型在跨会话、目标演进的项目导向交互中的记忆管理能力。

Details Motivation: 现有记忆评测主要集中在闲聊或任务型对话,缺乏对长期、项目导向型交互中动态目标和上下文依赖的评估,难以反映真实应用场景下的记忆需求。 Method: 提出RealMem基准,包含11个场景下2000多个跨会话对话,通过项目构建、多智能体对话生成和记忆调度管理的合成 pipeline 模拟记忆的动态演化过程。 Result: 实验表明当前的记忆系统在处理真实项目中的长期状态跟踪和动态上下文依赖方面存在显著挑战。 Conclusion: RealMem填补了项目导向型长期记忆评测的空白,为构建具备持续一致性能力的自主智能体提供了重要评估工具。 Abstract: As Large Language Models (LLMs) evolve from static dialogue interfaces to autonomous general agents, effective memory is paramount to ensuring long-term consistency. However, existing benchmarks primarily focus on casual conversation or task-oriented dialogue, failing to capture **"long-term project-oriented"** interactions where agents must track evolving goals. To bridge this gap, we introduce **RealMem**, the first benchmark grounded in realistic project scenarios. RealMem comprises over 2,000 cross-session dialogues across eleven scenarios, utilizing natural user queries for evaluation. We propose a synthesis pipeline that integrates Project Foundation Construction, Multi-Agent Dialogue Generation, and Memory and Schedule Management to simulate the dynamic evolution of memory. Experiments reveal that current memory systems face significant challenges in managing the long-term project states and dynamic context dependencies inherent in real-world projects. Our code and datasets are available at [https://github.com/AvatarMemory/RealMemBench](https://github.com/AvatarMemory/RealMemBench).

[78] Categorize Early, Integrate Late: Divergent Processing Strategies in Automatic Speech Recognition

Nathan Roll,Pranav Bhalerao,Martijn Bartelds,Arjun Pawar,Yuka Tatsumi,Tolulope Ogunremi,Chen Shani,Calbert Graham,Meghan Sumner,Dan Jurafsky

Main category: cs.CL

TL;DR: 本文提出了“架构指纹”(Architectural Fingerprinting)方法,用于分析Transformer和Conformer在语音语言建模中的不同处理策略。研究发现Conformer采用“早分类”策略,而Transformer则“后整合”,揭示了二者在层级信息处理上的根本差异,并为不同应用场景提供设计指导。

Details Motivation: 尽管Transformer和Conformer在语音建模中表现相当,但其背后是否采用相似的信息处理机制尚不清楚,因此需要一种能分离架构影响的方法来揭示其内在归纳偏置。 Method: 提出“架构指纹”探针框架,在24个预训练编码器(39M-3.3B参数)的受控实验中,分析模型各层对音素、说话人、口音等特征的编码时序。 Result: Conformer更早解析音素类别(提前29%)和说话人性别(提前16%),表现为“早分类”;Transformer则将音素、口音和时长信息延迟至深层(49-57%深度)整合,表现为“后整合”。 Conclusion: Transformer与Conformer具有本质不同的信息处理层次结构,前者适合依赖上下文的任务,后者可能更优于低延迟流式应用,为模型选择与设计提供了可解释的依据。 Abstract: In speech language modeling, two architectures dominate the frontier: the Transformer and the Conformer. However, it remains unknown whether their comparable performance stems from convergent processing strategies or distinct architectural inductive biases. We introduce Architectural Fingerprinting, a probing framework that isolates the effect of architecture on representation, and apply it to a controlled suite of 24 pre-trained encoders (39M-3.3B parameters). Our analysis reveals divergent hierarchies: Conformers implement a "Categorize Early" strategy, resolving phoneme categories 29% earlier in depth and speaker gender by 16% depth. In contrast, Transformers "Integrate Late," deferring phoneme, accent, and duration encoding to deep layers (49-57%). These fingerprints suggest design heuristics: Conformers' front-loaded categorization may benefit low-latency streaming, while Transformers' deep integration may favor tasks requiring rich context and cross-utterance normalization.

[79] LLMs Can't Play Hangman: On the Necessity of a Private Working Memory for Language Agents

Davide Baldelli,Ali Parviz,Amal Zouaq,Sarath Chandar

Main category: cs.CL

TL;DR: 本文探讨了大型语言模型(LLM)作为自主代理在执行依赖隐藏状态的交互任务时的局限性,提出并定义了“私有状态交互任务”(PSITs),并通过理论与实验表明:仅依赖公共对话历史的模型无法同时保证一致性和保密性。作者进一步提出一种具备显式私有工作记忆的新架构,实验证明该设计能有效解决此问题,是构建真正交互式语言代理的关键。

Details Motivation: 随着LLMs从文本补全转向自主代理,标准聊天界面缺乏私有工作记忆,限制其处理需维护隐藏状态的交互任务的能力。本文旨在揭示这一根本缺陷,并探索解决方案。 Method: 1. 提出PSITs概念,形式化需要维护隐藏信息的任务;2. 理论证明仅基于公共历史的代理无法在PSITs中同时满足一致性和保密性;3. 设计自洽性测试协议,在分叉对话路径中评估模型对隐藏秘密的保持能力;4. 构建具有显式私有工作记忆的新架构,并进行实证比较。 Result: 标准聊天式LLM和基于检索的记忆基线在自洽性测试中均失败,无论模型规模如何,说明语义检索不足以实现真正的状态维护。而引入显式私有工作记忆的架构能够恢复一致性,成功完成PSITs。 Conclusion: 私有工作记忆是语言代理可靠执行依赖隐藏状态的交互任务的必要组件,现有基于纯对话历史的架构存在根本性局限。 Abstract: As LLMs move from text completion toward autonomous agents, they remain constrained by the standard chat interface, which lacks private working memory. This raises a fundamental question: can agents reliably perform interactive tasks that depend on hidden state? We define Private State Interactive Tasks (PSITs), which require agents to generate and maintain hidden information while producing consistent public responses. We show theoretically that any agent restricted to the public conversation history cannot simultaneously preserve secrecy and consistency in PSITs, yielding an impossibility theorem. To empirically validate this limitation, we introduce a self-consistency testing protocol that evaluates whether agents can maintain a hidden secret across forked dialogue branches. Standard chat-based LLMs and retrieval-based memory baselines fail this test regardless of scale, demonstrating that semantic retrieval does not enable true state maintenance. To address this, we propose a novel architecture incorporating an explicit private working memory; we demonstrate that this mechanism restores consistency, establishing private state as a necessary component for interactive language agents.

[80] UETQuintet at BioCreative IX - MedHopQA: Enhancing Biomedical QA with Selective Multi-hop Reasoning and Contextual Retrieval

Quoc-An Nguyen,Thi-Minh-Thu Vu,Bich-Dat Nguyen,Dinh-Quang-Minh Tran,Hoang-Quynh Le

Main category: cs.CL

TL;DR: 本文提出了一种用于生物医学问答的模型,能够有效处理直接和顺序性问题,通过多源信息检索和上下文学习提升性能,在BioCreative IX - MedHopQA数据集上取得了0.84的精确匹配得分,排名第二。

Details Motivation: 生物医学问答系统在处理复杂的医疗查询时面临挑战,尤其是多跳推理和医疗数据的复杂性,现有方法难以兼顾效率与准确性。 Method: 将顺序性问题分解为子问题链进行逐步推理,而直接问题则直接处理以提高效率;结合多源信息检索和上下文学习提供丰富上下文以支持答案生成。 Result: 在BioCreative IX - MedHopQA共享任务数据集上,模型取得了0.84的精确匹配(Exact Match)分数,位列当前排行榜第二名。 Conclusion: 该模型能有效应对生物医学问答中的多跳推理与复杂数据挑战,为医学研究和实践提供了高效且通用的解决方案。 Abstract: Biomedical Question Answering systems play a critical role in processing complex medical queries, yet they often struggle with the intricate nature of medical data and the demand for multi-hop reasoning. In this paper, we propose a model designed to effectively address both direct and sequential questions. While sequential questions are decomposed into a chain of sub-questions to perform reasoning across a chain of steps, direct questions are processed directly to ensure efficiency and minimise processing overhead. Additionally, we leverage multi-source information retrieval and in-context learning to provide rich, relevant context for generating answers. We evaluated our model on the BioCreative IX - MedHopQA Shared Task datasets. Our approach achieves an Exact Match score of 0.84, ranking second on the current leaderboard. These results highlight the model's capability to meet the challenges of Biomedical Question Answering, offering a versatile solution for advancing medical research and practice.

[81] MedTutor: A Retrieval-Augmented LLM System for Case-Based Medical Education

Dongsuk Jang,Ziyao Shangguan,Kyle Tegtmeyer,Anurag Gupta,Jan Czerminski,Sophie Chheang,Arman Cohan

Main category: cs.CL

TL;DR: 本文提出了一种名为MedTutor的新型系统,通过检索增强生成(RAG)技术从临床病例报告中自动生成基于证据的教学内容和选择题,以辅助住院医师培训。该系统结合本地医学教材与最新文献检索,经重排序模型筛选后由大语言模型生成高质量教育材料,并通过放射科医生评估和LLM-as-a-Judge大规模评测验证其临床与教学价值,结果显示LLM评价与专家判断有中等一致性,强调仍需专家监督。

Details Motivation: 住院医师在学习复杂病例报告时面临获取准确、可靠医学知识耗时且困难的问题,现有学习方式效率低,缺乏自动化、证据驱动的教育支持工具。 Method: 提出MedTutor系统,采用检索增强生成(RAG)架构:输入临床病例报告,通过混合检索机制查询本地医学知识库及PubMed、Semantic Scholar API获取最新研究,使用先进重排序模型对检索结果进行过滤和排序,最后利用大语言模型生成结构化教育内容和多选题。 Result: 三位放射科医生评估认为系统输出具有高临床和教学价值;大规模LLM-as-a-Judge评估显示LLM评分与人类专家判断之间存在中等相关性,表明LLM可辅助但尚不能替代专家评审。 Conclusion: MedTutor能有效从临床病例中生成高质量、基于证据的教育内容,提升住院医师培训效率,但自动化评估仍需结合专家监督以确保准确性与可靠性。 Abstract: The learning process for medical residents presents significant challenges, demanding both the ability to interpret complex case reports and the rapid acquisition of accurate medical knowledge from reliable sources. Residents typically study case reports and engage in discussions with peers and mentors, but finding relevant educational materials and evidence to support their learning from these cases is often time-consuming and challenging. To address this, we introduce MedTutor, a novel system designed to augment resident training by automatically generating evidence-based educational content and multiple-choice questions from clinical case reports. MedTutor leverages a Retrieval-Augmented Generation (RAG) pipeline that takes clinical case reports as input and produces targeted educational materials. The system's architecture features a hybrid retrieval mechanism that synergistically queries a local knowledge base of medical textbooks and academic literature (using PubMed, Semantic Scholar APIs) for the latest related research, ensuring the generated content is both foundationally sound and current. The retrieved evidence is filtered and ordered using a state-of-the-art reranking model and then an LLM generates the final long-form output describing the main educational content regarding the case-report. We conduct a rigorous evaluation of the system. First, three radiologists assessed the quality of outputs, finding them to be of high clinical and educational value. Second, we perform a large scale evaluation using an LLM-as-a Judge to understand if LLMs can be used to evaluate the output of the system. Our analysis using correlation between LLMs outputs and human expert judgments reveals a moderate alignment and highlights the continued necessity of expert oversight.

[82] Lexicalized Constituency Parsing for Middle Dutch: Low-resource Training and Cross-Domain Generalization

Yiming Liang,Fang Zhao

Main category: cs.CL

TL;DR: 本论文探讨了基于Transformer的成分句法分析器在中古荷兰语(低资源历史语言)上的应用,通过多语言联合训练和跨领域数据利用显著提升了性能。

Details Motivation: 中古荷兰语等低资源历史语言的成分句法分析研究较少,现有工作多集中于依存句法分析,且面临数据稀疏和领域异质性问题。 Method: 采用基于Transformer的成分句法分析器,结合高资源辅助语言进行联合训练,并评估微调、数据融合和特征分离等跨领域适应策略。 Result: 联合训练使F1分数最高提升0.73,地理和时间上更接近的语言带来更大增益;微调与数据融合效果相当;跨领域性能需每领域至少约200个样例才能有效提升。 Conclusion: 所提出的多语言联合训练和跨领域适应方法显著优于现有的PCFG解析器,为低资源历史语言的成分句法分析提供了有效路径。 Abstract: Recent years have seen growing interest in applying neural networks and contextualized word embeddings to the parsing of historical languages. However, most advances have focused on dependency parsing, while constituency parsing for low-resource historical languages like Middle Dutch has received little attention. In this paper, we adapt a transformer-based constituency parser to Middle Dutch, a highly heterogeneous and low-resource language, and investigate methods to improve both its in-domain and cross-domain performance. We show that joint training with higher-resource auxiliary languages increases F1 scores by up to 0.73, with the greatest gains achieved from languages that are geographically and temporally closer to Middle Dutch. We further evaluate strategies for leveraging newly annotated data from additional domains, finding that fine-tuning and data combination yield comparable improvements, and our neural parser consistently outperforms the currently used PCFG-based parser for Middle Dutch. We further explore feature-separation techniques for domain adaptation and demonstrate that a minimum threshold of approximately 200 examples per domain is needed to effectively enhance cross-domain performance.

[83] TurkBench: A Benchmark for Evaluating Turkish Large Language Models

Çağrı Toraman,Ahmet Kaan Sever,Ayse Aysu Cengiz,Elif Ecem Arslan,Görkem Sevinç,Mete Mert Birdal,Yusuf Faruk Güldemir,Ali Buğra Kanburoğlu,Sezen Felekoğlu,Osman Gürlek,Sarp Kantar,Birsen Şahin Kütük,Büşra Tufan,Elif Genç,Serkan Coşkun,Gupse Ekin Demir,Muhammed Emin Arayıcı,Olgun Dursun,Onur Gungor,Susan Üsküdarlı,Abdullah Topraksoy,Esra Darıcı

Main category: cs.CL

TL;DR: 本文提出了TurkBench,一个用于评估土耳其语生成式大语言模型的综合基准,包含8,151个样本和21个子任务,涵盖知识、语言理解、推理、内容审核、土耳其语语法与词汇及指令遵循六大类别。

Details Motivation: 由于当前大型语言模型发展迅速,但针对土耳其语等非英语语言的评估基准尚不完善,因此需要构建专门的评测工具以准确评估模型在特定语言环境下的表现。 Method: 构建了一个名为TurkBench的多任务评估基准,包含21个子任务、共8,151个数据样本,分为六大类:知识、语言理解、推理、内容审核、土耳其语语法与词汇、指令遵循,并公开发布于Hugging Face平台。 Result: TurkBench提供了丰富的任务类型和文化相关数据,能够有效帮助研究人员评估模型性能并识别改进方向。 Conclusion: TurkBench是一个全面且具有文化相关性的土耳其语模型评估工具,有助于推动非英语语言模型的发展与优化。 Abstract: With the recent surge in the development of large language models, the need for comprehensive and language-specific evaluation benchmarks has become critical. While significant progress has been made in evaluating English language models, benchmarks for other languages, particularly those with unique linguistic characteristics such as Turkish, remain less developed. Our study introduces TurkBench, a comprehensive benchmark designed to assess the capabilities of generative large language models in the Turkish language. TurkBench involves 8,151 data samples across 21 distinct subtasks. These are organized under six main categories of evaluation: Knowledge, Language Understanding, Reasoning, Content Moderation, Turkish Grammar and Vocabulary, and Instruction Following. The diverse range of tasks and the culturally relevant data would provide researchers and developers with a valuable tool for evaluating their models and identifying areas for improvement. We further publish our benchmark for online submissions at https://huggingface.co/turkbench

[84] Solar Open Technical Report

Sungrae Park,Sanghoon Kim,Jungho Cho,Gyoungjin Gim,Dawoon Jung,Mikyoung Cha,Eunhae Choo,Taekgyu Hong,Minbyul Jeong,SeHwan Joo,Minsoo Khang,Eunwon Kim,Minjeong Kim,Sujeong Kim,Yunsu Kim,Hyeonju Lee,Seunghyun Lee,Sukyung Lee,Siyoung Park,Gyungin Shin,Inseo Song,Wonho Song,Seonghoon Yang,Seungyoun Yi,Sanghoon Yoon,Jeonghyun Ko,Seyoung Song,Keunwoo Choi,Hwalsuk Lee,Sunghun Kim,Du-Seong Chang,Kyunghyun Cho,Junsuk Choe,Hwaran Lee,Jae-Gil Lee,KyungTae Lim,Alice Oh

Main category: cs.CL

TL;DR: Solar Open是一个102B参数的双语混合专家语言模型,专注于服务资源匮乏的语言,通过数据合成、渐进式课程学习和高效的强化学习框架SnapPO,在英语和韩语基准上实现了具有竞争力的性能。

Details Motivation: 解决资源匮乏语言在大规模语言模型训练中的数据稀缺、领域覆盖不足和推理能力弱的问题。 Method: 1) 合成4.5T高质量、特定领域且面向强化学习的数据;2) 设计覆盖20万亿token的渐进式课程学习策略,联合优化数据构成、质量阈值和领域覆盖;3) 提出并应用高效的强化学习优化框架SnapPO。 Result: Solar Open在英语和韩语的多个基准测试中表现出与现有先进模型相当的性能,验证了所提方法在低资源语言建模中的有效性。 Conclusion: 通过数据合成、课程设计和高效RL优化的系统性方法,可以有效构建面向资源匮乏语言的高性能大模型,为公平化AI发展提供了可行路径。 Abstract: We introduce Solar Open, a 102B-parameter bilingual Mixture-of-Experts language model for underserved languages. Solar Open demonstrates a systematic methodology for building competitive LLMs by addressing three interconnected challenges. First, to train effectively despite data scarcity for underserved languages, we synthesize 4.5T tokens of high-quality, domain-specific, and RL-oriented data. Second, we coordinate this data through a progressive curriculum jointly optimizing composition, quality thresholds, and domain coverage across 20 trillion tokens. Third, to enable reasoning capabilities through scalable RL, we apply our proposed framework SnapPO for efficient optimization. Across benchmarks in English and Korean, Solar Open achieves competitive performance, demonstrating the effectiveness of this methodology for underserved language AI development.

[85] Codified Foreshadowing-Payoff Text Generation

Longfei Yun,Kun Zhou,Yupeng Hou,Letian Peng,Jingbo Shang

Main category: cs.CL

TL;DR: 本文提出了Codified Foreshadowing-Payoff Generation (CFPG) 框架,通过将叙事连续性转化为可执行的因果谓词,解决大语言模型在长距离叙事依赖(如伏笔与呼应)上的缺失问题。

Details Motivation: 大语言模型在生成故事时常常无法有效实现伏笔与呼应之间的逻辑连接,现有评估方法也多关注表层连贯性而忽视结构性缺陷。 Method: 从BookSum语料库中挖掘并编码伏笔-触发-呼应三元组,构建CFPG框架,利用结构化监督确保伏笔在时间和逻辑上被正确实现。 Result: 实验表明,CFPG在呼应准确性和叙事一致性方面显著优于标准提示基线方法。 Conclusion: 显式地对叙事机制进行编码是提升大语言模型从表面流畅到真正叙事能力的关键。 Abstract: Foreshadowing and payoff are ubiquitous narrative devices through which authors introduce commitments early in a story and resolve them through concrete, observable outcomes. However, despite advances in story generation, large language models (LLMs) frequently fail to bridge these long-range narrative dependencies, often leaving "Chekhov's guns" unfired even when the necessary context is present. Existing evaluations largely overlook this structural failure, focusing on surface-level coherence rather than the logical fulfillment of narrative setups. In this paper, we introduce Codified Foreshadowing-Payoff Generation (CFPG), a novel framework that reframes narrative quality through the lens of payoff realization. Recognizing that LLMs struggle to intuitively grasp the "triggering mechanism" of a foreshadowed event, CFPG transforms narrative continuity into a set of executable causal predicates. By mining and encoding Foreshadow-Trigger-Payoff triples from the BookSum corpus, we provide structured supervision that ensures foreshadowed commitments are not only mentioned but also temporally and logically fulfilled. Experiments demonstrate that CFPG significantly outperforms standard prompting baselines in payoff accuracy and narrative alignment. Our findings suggest that explicitly codifying narrative mechanics is essential for moving LLMs from surface-level fluency to genuine narrative competence.

[86] Mid-Think: Training-Free Intermediate-Budget Reasoning via Token-Level Triggers

Wang Yang,Debargha Ganguly,Xinpeng Li,Chaoda Song,Shouren Wang,Vikash Singh,Vipin Chaudhary,Xiaotian Han

Main category: cs.CL

TL;DR: 本文发现混合推理语言模型中的“Think/No-think”指令对推理行为的控制主要由少数触发词驱动,而非指令本身。基于此提出了一种无需训练的提示方法Mid-Think,通过结合特定触发词实现中间预算推理,在准确率与推理长度之间取得更好权衡,并在RL训练中显著提升性能和效率。

Details Motivation: 现有Think/No-think指令控制推理模式的效果可能并非来自指令语义,而是潜在的触发词影响,需揭示真实驱动机制并设计更高效的推理控制方法。 Method: 通过注意力分析和受控提示实验识别出关键触发词(如"Okay"促进推理,换行符后"found"抑制推理),提出Mid-Think提示格式,结合这些触发词实现中间推理控制,无需额外训练。 Result: Mid-Think在准确率-长度权衡上持续优于固定token和基于提示的基线方法;应用于SFT后的RL训练时,使Qwen3-8B在AIME上从69.8%提升至72.4%,GPQA上从58.5%提升至61.1%,训练时间减少约15%。 Conclusion: 推理行为的关键控制因素是底层触发词而非高层指令,Mid-Think利用这一现象实现了高效推理控制,适用于推理时调节与强化学习训练,提升了性能与训练效率。 Abstract: Hybrid reasoning language models are commonly controlled through high-level Think/No-think instructions to regulate reasoning behavior, yet we found that such mode switching is largely driven by a small set of trigger tokens rather than the instructions themselves. Through attention analysis and controlled prompting experiments, we show that a leading ``Okay'' token induces reasoning behavior, while the newline pattern following ``'' suppresses it. Based on this observation, we propose Mid-Think, a simple training-free prompting format that combines these triggers to achieve intermediate-budget reasoning, consistently outperforming fixed-token and prompt-based baselines in terms of the accuracy-length trade-off. Furthermore, applying Mid-Think to RL training after SFT reduces training time by approximately 15% while improving final performance of Qwen3-8B on AIME from 69.8% to 72.4% and on GPQA from 58.5% to 61.1%, demonstrating its effectiveness for both inference-time control and RL-based reasoning training.

[87] Task Arithmetic with Support Languages for Low-Resource ASR

Emma Rafkin,Dan DeGenaro,Xiulin Yang

Main category: cs.CL

TL;DR: 本文提出了一种基于任务算术的资源受限语音识别方法,通过在高资源语言上训练的Whisper模型微调生成任务向量,并将其线性组合以提升低资源语言的语音识别性能。

Details Motivation: 由于许多低资源语言缺乏足够的标注数据,现有的ASR方法难以有效应用,因此需要一种能够利用相关高资源语言知识来提升低资源语言识别效果的方法。 Method: 将语言视为任务,对Whisper ASR系统的变体进行微调以生成任务向量,并通过在线性组合中优化权重,结合高资源和低资源语言的任务向量。 Result: 该方法在多个高低资源语言配对中均一致地改善了目标低资源语言的词错误率(WER)。 Conclusion: 任务算术适用于低资源语音识别,通过迁移高资源语言的知识可有效提升模型在低资源语言上的表现。 Abstract: The development of resource-constrained approaches to automatic speech recognition (ASR) is of great interest due to its broad applicability to many low-resource languages for which there is scant usable data. Existing approaches to many low-resource natural language processing tasks leverage additional data from higher-resource languages that are closely related to a target low-resource language. One increasingly popular approach uses task arithmetic to combine models trained on different tasks to create a model for a task where there is little to no training data. In this paper, we consider training on a particular language to be a task, and we generate task vectors by fine-tuning variants of the Whisper ASR system. For pairings of high- and low-resource languages, we merge task vectors via a linear combination, optimizing the weights of the linear combination on the downstream word error rate on the low-resource target language's validation set. We find that this approach consistently improves performance on the target languages.

[88] When Abundance Conceals Weakness: Knowledge Conflict in Multilingual Models

Jiaqi Zhao,Qiang Huang,Haodong Chen,Xiaoxing You,Jun Yu

Main category: cs.CL

TL;DR: 本文提出了CLEAR框架,用于系统评估多语言大模型在面对跨语言知识冲突时的处理能力,揭示了任务类型对冲突解决策略的影响。

Details Motivation: 由于大语言模型在不同语言中的知识分布不均,当外部证据与内部记忆冲突时会出现跨语言知识冲突问题,但这一现象在非英语环境中研究不足。 Method: 提出CLEAR框架,将冲突分解为四种渐进场景,并构建覆盖10种语言的多语言QA基准ConflictQA和ConflictingQA,评估六种代表性大模型。 Result: 实验发现:在推理密集型任务中,语言资源丰富度主导冲突解决;而在实体中心型事实冲突中,语言亲缘性起决定作用,低资源但语言相近的语言可胜过高资源远缘语言。 Conclusion: 跨语言知识冲突的解决机制依赖于任务类型,语言资源和语言亲缘性在不同任务中发挥不同作用,这为多语言模型设计提供了新视角。 Abstract: Large Language Models (LLMs) encode vast world knowledge across multiple languages, yet their internal beliefs are often unevenly distributed across linguistic spaces. When external evidence contradicts these language-dependent memories, models encounter \emph{cross-lingual knowledge conflict}, a phenomenon largely unexplored beyond English-centric settings. We introduce \textbf{CLEAR}, a \textbf{C}ross-\textbf{L}ingual knowl\textbf{E}dge conflict ev\textbf{A}luation f\textbf{R}amework that systematically examines how multilingual LLMs reconcile conflicting internal beliefs and multilingual external evidence. CLEAR decomposes conflict resolution into four progressive scenarios, from multilingual parametric elicitation to competitive multi-source cross-lingual induction, and systematically evaluates model behavior across two complementary QA benchmarks with distinct task characteristics. We construct multilingual versions of ConflictQA and ConflictingQA covering 10 typologically diverse languages and evaluate six representative LLMs. Our experiments reveal a task-dependent decision dichotomy. In reasoning-intensive tasks, conflict resolution is dominated by language resource abundance, with high-resource languages exerting stronger persuasive power. In contrast, for entity-centric factual conflicts, linguistic affinity, not resource scale, becomes decisive, allowing low-resource but linguistically aligned languages to outperform distant high-resource ones.

[89] Engineering of Hallucination in Generative AI: It's not a Bug, it's a Feature

Tim Fingscheidt,Patrick Blumenberg,Björn Möller

Main category: cs.CL

TL;DR: 本文探讨了生成式人工智能中的“幻觉”现象,提出适度的幻觉并非缺陷,而可能是实现理想生成效果的关键特征。

Details Motivation: 观察到生成式AI在适度幻觉下表现更佳,挑战了幻觉仅为负面问题的传统观点。 Method: 通过概率工程的简单方法,鼓励生成式AI进行有限度的幻觉,以改善生成效果。 Result: 发现适度的幻觉有助于提升生成模型的表现,使输出更符合预期。 Conclusion: 生成式AI中的幻觉可能不是bug,而是一种有益的特征,关键在于控制其程度。 Abstract: Generative artificial intelligence (AI) is conquering our lives at lightning speed. Large language models such as ChatGPT answer our questions or write texts for us, large computer vision models such as GAIA-1 generate videos on the basis of text descriptions or continue prompted videos. These neural network models are trained using large amounts of text or video data, strictly according to the real data employed in training. However, there is a surprising observation: When we use these models, they only function satisfactorily when they are allowed a certain degree of fantasy (hallucination). While hallucination usually has a negative connotation in generative AI - after all, ChatGPT is expected to give a fact-based answer! - this article recapitulates some simple means of probability engineering that can be used to encourage generative AI to hallucinate to a limited extent and thus lead to the desired results. We have to ask ourselves: Is hallucination in gen-erative AI probably not a bug, but rather a feature?

[90] Fine-Tuning vs. RAG for Multi-Hop Question Answering with Novel Knowledge

Zhuoyi Yang,Yurun Song,Iftekhar Ahmed,Ian Harris

Main category: cs.CL

TL;DR: 本文系统比较了参数化与非参数化知识注入方法在开放域多跳问答中的效果,发现检索增强生成(RAG)在处理时间新颖知识时表现优异,而监督微调整体准确率最高,表明不同知识注入机制对多跳推理的支持存在本质差异。

Details Motivation: 由于现有研究对不同知识注入方法在多跳问答、尤其是涉及时间上新颖知识的任务中的相对有效性缺乏充分理解,本文旨在系统评估并比较这些方法的效果。 Method: 评估了无监督微调(持续预训练)、监督微调和检索增强生成(RAG)三种方法,在三个7B参数规模的开源大语言模型上,使用QASC标准数据集和一个新构建的包含10,000多个基于2024年Wikipedia事件的多跳问题数据集进行实验。 Result: 无监督微调相比基础模型提升有限,表明仅靠持续预训练不足以提高多跳推理准确性;RAG在依赖时间新颖信息的问题上带来显著且一致的改进;监督微调在所有模型和数据集上达到最高的整体准确率。 Conclusion: 不同知识注入机制在支持多跳问答方面存在根本差异,当需要外部或组合知识时,基于检索的方法尤为重要。 Abstract: Multi-hop question answering is widely used to evaluate the reasoning capabilities of large language models (LLMs), as it requires integrating multiple pieces of supporting knowledge to arrive at a correct answer. While prior work has explored different mechanisms for providing knowledge to LLMs, such as finetuning and retrieval-augmented generation (RAG), their relative effectiveness for multi-hop question answering remains insufficiently understood, particularly when the required knowledge is temporally novel. In this paper, we systematically compare parametric and non-parametric knowledge injection methods for open-domain multi-hop question answering. We evaluate unsupervised fine-tuning (continual pretraining), supervised fine-tuning, and retrieval-augmented generation across three 7B-parameter open-source LLMs. Experiments are conducted on two benchmarks: QASC, a standard multi-hop science question answering dataset, and a newly constructed dataset of over 10,000 multi-hop questions derived from Wikipedia events in 2024, designed to test knowledge beyond the models' pretraining cutoff. Our results show that unsupervised fine-tuning provides only limited gains over base models, suggesting that continual pretraining alone is insufficient for improving multi-hop reasoning accuracy. In contrast, retrieval-augmented generation yields substantial and consistent improvements, particularly when answering questions that rely on temporally novel information. Supervised fine-tuning achieves the highest overall accuracy across models and datasets. These findings highlight fundamental differences in how knowledge injection mechanisms support multi-hop question answering and underscore the importance of retrieval-based methods when external or compositional knowledge is required.

[91] The Need for a Socially-Grounded Persona Framework for User Simulation

Pranav Narayanan Venkit,Yu Li,Yada Pruksachatkun,Chien-Sheng Wu

Main category: cs.CL

TL;DR: 本文提出了一个名为SCOPE的社交心理框架,用于构建和评估基于124名美国参与者的社会心理学数据的大语言模型人格。研究发现,仅基于人口统计学特征的人格在行为预测上表现较差,而加入社会心理维度(如价值观和身份)可显著提升模型表现并降低偏见。SCOPE人格在SimBench测试中优于默认提示和NVIDIA Nemotron人格。

Details Motivation: 现有LLM人格多依赖粗略的人口统计学属性或摘要,缺乏社会心理深度,导致模拟人类行为时偏差大、预测力弱。本文旨在通过更精细的社会心理结构提升人格的真实性和预测能力。 Method: 设计了一个包含141项、耗时两小时的社会心理协议,收集124名美国参与者的数据,构建SCOPE人格框架;在七个大模型上比较不同人格类型(人口统计、社会心理、非人口统计)对行为模拟的影响,并在SimBench基准(441个问题)上评估性能。 Result: 人口统计学特征仅解释约1.5%的人类反应相似性方差;加入社会心理因素后行为预测更准确、过度强调现象减少;基于价值观和身份的非人口统计人格表现出更强的一致性且偏见更低;SCOPE人格在SimBench上优于默认提示和Nemotron人格,且能增强后者的表现。 Conclusion: 人格质量取决于社会心理结构而非人口统计模板或摘要,高质量人格应基于深层心理与价值观特征构建。 Abstract: Synthetic personas are widely used to condition large language models (LLMs) for social simulation, yet most personas are still constructed from coarse sociodemographic attributes or summaries. We revisit persona creation by introducing SCOPE, a socially grounded framework for persona construction and evaluation, built from a 141-item, two-hour sociopsychological protocol collected from 124 U.S.-based participants. Across seven models, we find that demographic-only personas are a structural bottleneck: demographics explain only ~1.5% of variance in human response similarity. Adding sociopsychological facets improves behavioral prediction and reduces over-accentuation, and non-demographic personas based on values and identity achieve strong alignment with substantially lower bias. These trends generalize to SimBench (441 aligned questions), where SCOPE personas outperform default prompting and NVIDIA Nemotron personas, and SCOPE augmentation improves Nemotron-based personas. Our results indicate that persona quality depends on sociopsychological structure rather than demographic templates or summaries.

[92] ReMIND: Orchestrating Modular Large Language Models for Controllable Serendipity A REM-Inspired System Design for Emergent Creative Ideation

Makoto Sato

Main category: cs.CL

TL;DR: 本文提出了ReMIND,一个受REM睡眠启发的模块化框架,用于增强大语言模型在创造性思维中产生意外洞见的能力。该框架通过分离探索与整合阶段,有效平衡新颖性与一致性。

Details Motivation: 大语言模型在创造性思维中难以同时保持新颖性和内部一致性,现有方法在提升多样性时往往牺牲连贯性,因此需要一种系统性设计来促进偶然性灵感的生成与稳定。 Method: 提出ReMIND框架,包含四个阶段:wake(低温生成语义基线)、dream(高温进行探索性生成)、judge(粗粒度过滤不连贯输出并提取候选想法)、re-wake(重新表述为连贯最终输出),每个阶段由独立的大语言模型实现。 Result: 参数扫描显示ReMIND能可靠地促进语义探索并保持下游稳定性;基于嵌入的分析证实了‘dream’阶段存在显著语义偏移,外部评估表明高质量想法是偶发出现的,而非单一指标的极值。 Conclusion: 偶然性创意是一种稀有事件过程,需通过系统级设计来塑造有利条件;ReMIND提供了一个研究计算性偶然灵感的通用框架,并展示了模块化大语言模型编排如何桥接探索与稳定化。 Abstract: Large language models (LLMs) are used not only for problem solving but also for creative ideation; however, eliciting serendipitous insights that are both novel and internally coherent remains difficult. While stochastic sampling promotes novelty, it often degrades consistency. Here, we propose ReMIND, a REM-inspired modular framework for ideation. ReMIND consists of four stages: wake, which generates a stable low-temperature semantic baseline; dream, which performs high-temperature exploratory generation; judge, which applies coarse evaluation to filter incoherent outputs and extract candidate ideas; and re-wake, which re-articulates selected ideas into coherent final outputs. By instantiating each stage as an independent LLM, ReMIND enables functional separation between exploration and consolidation. Parameter sweeps show that ReMIND reliably induces semantic exploration while preserving downstream stability. Embedding-based analyses confirm substantial semantic displacement during the dream phase, whereas external evaluations reveal that high-quality ideas emerge sporadically rather than as extrema along any single metric. These results suggest that serendipitous ideation in LLMs is a rare-event process best approached through system level design that shapes the conditions under which valuable ideas can emerge and be stabilized. ReMIND provides a general framework for studying the computational basis of serendipity and illustrates how modular LLM orchestration can bridge exploration and stabilization.

[93] Measuring Iterative Temporal Reasoning with TimePuzzles

Zhengxiang Wang,Zeyu Dong

Main category: cs.CL

TL;DR: TimePuzzles 是一个用于评估大语言模型迭代时间推理能力的约束型日期推断任务,通过算法生成包含时间锚点和日历关系的谜题,发现当前模型在无工具辅助下表现不佳,揭示了其在可靠使用工具方面的缺陷。

Details Motivation: 现有大语言模型在复杂时间推理任务中的表现缺乏系统评估,尤其在跨文化和多步推理场景下,需要一个可控且可扩展的基准来诊断其时间推理能力。 Method: 提出 TimePuzzles 数据集,每个样本由事实性时间锚点和日历关系构成,支持一个或多个有效解;通过算法生成确保多样性与可控性,并在 13 个主流 LLM 上进行测试,评估其在有无工具(如网页搜索、代码解释器)辅助下的表现差异。 Result: GPT-5 在无工具情况下准确率为 49.3%,其余模型均低于 31%;使用网页搜索显著提升性能,代码解释器效果不一;当约束被重写为显式日期时,所有模型表现大幅改善。 Conclusion: TimePuzzles 能有效区分模型的时间推理能力,暴露当前模型在工具调用和约束理解上的不足,提供了一个简单、低成本的诊断工具,用于评估工具增强型时间推理。 Abstract: We introduce TimePuzzles, a constraint-based date inference task for evaluating iterative temporal reasoning. Each puzzle combines factual temporal anchors with (cross-cultural) calendar relations, admits one or multiple valid solution dates, and is algorithmically generated for controlled, dynamic, and continual evaluation. Across 13 diverse LLMs, TimePuzzles well distinguishes their iterative temporal reasoning capabilities and remains challenging without tools: GPT-5 reaches only 49.3% accuracy and all other models stay below 31%, despite the dataset's simplicity. Web search consistently yields substantial gains and using code interpreter shows mixed effects, but all models perform much better when constraints are rewritten with explicit dates, revealing a gap in reliable tool use. Overall, TimePuzzles presents a simple, cost-effective diagnostic for tool-augmented iterative temporal reasoning.

[94] Can Large Language Models Understand, Reason About, and Generate Code-Switched Text?

Genta Indra Winata,David Anugraha,Patrick Amadeus Irawan,Anirban Das,Haneul Yoo,Paresh Dashore,Shreyas Kulkarni,Ruochen Zhang,Haruki Sakajo,Frederikus Hudi,Anaelia Ovalle,Syrielle Montariol,Felix Gaschi,Michael Anugraha,Rutuj Ravindra Puranik,Zawad Hayat Ahmed,Adril Putra Merin,Emmanuele Chersoni

Main category: cs.CL

TL;DR: 本文提出了CodeMixQA,一个用于评估大语言模型在混合语言(code-switching)环境下理解和生成能力的新基准,揭示了当前模型在推理和生成方面的局限性,并提供了构建更强大多语言模型的见解。

Details Motivation: 尽管代码切换在多语言交流中普遍存在,但大语言模型在混合语言环境下的鲁棒性仍缺乏充分理解。 Method: 引入CodeMixQA基准,包含16种不同并行的代码切换语言对变体,涵盖多种地理区域和切换模式,并进行人类标注;在此基础上评估模型在问答任务中的推理行为及生成文本的自然性和语义保真度。 Result: 发现现有大语言模型在混合语言推理和生成方面存在持续挑战,特别是在语义一致性和自然性上表现不足。 Conclusion: 当前大语言模型在处理代码切换文本时仍有显著缺陷,需针对性改进以提升多语言鲁棒性,作者已开源数据集与代码。 Abstract: Code-switching is a pervasive phenomenon in multilingual communication, yet the robustness of large language models (LLMs) in mixed-language settings remains insufficiently understood. In this work, we present a comprehensive evaluation of LLM capabilities in understanding, reasoning over, and generating code-switched text. We introduce CodeMixQA a novel benchmark with high-quality human annotations, comprising 16 diverse parallel code-switched language-pair variants that span multiple geographic regions and code-switching patterns, and include both original scripts and their transliterated forms. Using this benchmark, we analyze the reasoning behavior of LLMs on code-switched question-answering tasks, shedding light on how models process and reason over mixed-language inputs. We further conduct a systematic evaluation of LLM-generated synthetic code-switched text, focusing on both naturalness and semantic fidelity, and uncover key limitations in current generation capabilities. Our findings reveal persistent challenges in both reasoning and generation under code-switching conditions and provide actionable insights for building more robust multilingual LLMs. We release the dataset and code as open source.

[95] Structured Reasoning for Large Language Models

Jinyi Han,Zixiang Di,Zishang Jiang,Ying Liao,Jiaqing Liang,Yongqi Wang,Yanghua Xiao

Main category: cs.CL

TL;DR: 提出Structured Reasoning (SCR)框架,通过解耦推理轨迹并采用动态终止监督与两阶段强化学习,提升大模型推理效率与自我验证能力,显著减少输出长度。

Details Motivation: 大语言模型在推理过程中常生成冗余或无效的思维步骤,尤其在已得出正确答案后仍进行不必要的验证与修改,缺乏对关键推理能力的结构化监督。 Method: 提出Generate-Verify-Revise范式,构建结构化训练数据,引入动态终止监督,并采用分阶段强化学习:第一阶段训练生成与自验证,第二阶段训练修订能力,避免学习信号干扰。 Result: 在三个主干模型上实验表明,SCR显著提升了推理效率和自我验证准确性,输出token长度最多减少50%。 Conclusion: SCR通过结构化解耦和分阶段训练,有效提升大模型推理效率与可控性,为高效推理提供了新范式。 Abstract: Large language models (LLMs) achieve strong performance by generating long chains of thought, but longer traces always introduce redundant or ineffective reasoning steps. One typical behavior is that they often perform unnecessary verification and revisions even if they have reached the correct answers. This limitation stems from the unstructured nature of reasoning trajectories and the lack of targeted supervision for critical reasoning abilities. To address this, we propose Structured Reasoning (SCR), a framework that decouples reasoning trajectories into explicit, evaluable, and trainable components. We mainly implement SCR using a Generate-Verify-Revise paradigm. Specifically, we construct structured training data and apply Dynamic Termination Supervision to guide the model in deciding when to terminate reasoning. To avoid interference between learning signals for different reasoning abilities, we adopt a progressive two-stage reinforcement learning strategy: the first stage targets initial generation and self-verification, and the second stage focuses on revision. Extensive experiments on three backbone models show that SCR substantially improves reasoning efficiency and self-verification. Besides, compared with existing reasoning paradigms, it reduces output token length by up to 50%.

Manzong Huang,Chenyang Bu,Yi He,Xingrui Zhuo,Xindong Wu

Main category: cs.CL

TL;DR: 本文提出了一种名为Relink的新型GraphRAG框架,采用“推理即构建”范式,通过动态构建查询特定的证据图来解决现有知识图谱不完整和信噪比低的问题。

Details Motivation: 现有的“先构建后推理”范式受限于知识图谱的不完整性和高噪声问题,导致推理路径断裂或引入误导信息,影响问答性能。 Method: 提出Relink框架:1)从原始文本语料库中提取潜在关系池以实例化缺失事实,动态修复推理路径;2)设计统一的、与查询相关的评估策略,联合考虑知识图谱和潜在关系中的候选事实,选择对回答最有益的内容并剔除干扰项。 Result: 在五个开放域问答基准上的实验表明,Relink相比当前领先的GraphRAG方法平均EM提升5.4%,F1提升5.2%。 Conclusion: Relink通过“推理即构建”的新范式,有效克服了传统静态知识图谱在检索增强生成中的局限性,显著提升了问答准确性和推理路径的保真度。 Abstract: Graph-based Retrieval-Augmented Generation (GraphRAG) mitigates hallucinations in Large Language Models (LLMs) by grounding them in structured knowledge. However, current GraphRAG methods are constrained by a prevailing \textit{build-then-reason} paradigm, which relies on a static, pre-constructed Knowledge Graph (KG). This paradigm faces two critical challenges. First, the KG's inherent incompleteness often breaks reasoning paths. Second, the graph's low signal-to-noise ratio introduces distractor facts, presenting query-relevant but misleading knowledge that disrupts the reasoning process. To address these challenges, we argue for a \textit{reason-and-construct} paradigm and propose Relink, a framework that dynamically builds a query-specific evidence graph. To tackle incompleteness, \textbf{Relink} instantiates required facts from a latent relation pool derived from the original text corpus, repairing broken paths on the fly. To handle misleading or distractor facts, Relink employs a unified, query-aware evaluation strategy that jointly considers candidates from both the KG and latent relations, selecting those most useful for answering the query rather than relying on their pre-existence. This empowers Relink to actively discard distractor facts and construct the most faithful and precise evidence path for each query. Extensive experiments on five Open-Domain Question Answering benchmarks show that Relink achieves significant average improvements of 5.4\% in EM and 5.2\% in F1 over leading GraphRAG baselines, demonstrating the superiority of our proposed framework.

[97] MI-PRUN: Optimize Large Language Model Pruning via Mutual Information

Hao Zhang,Zhibin Zhang,Guangxin Wu,He Chen,Jiafeng Guo,Xueqi Cheng

Main category: cs.CL

TL;DR: 本文提出了一种基于互信息的LLM剪枝方法MI-PRUN,通过隐藏状态转移评估冗余块,并结合数据处理不等式(DPI)和Fast-Block-Select算法实现全局最优与高效性。

Details Motivation: 大型语言模型计算和内存开销大,现有块剪枝方法不稳定且难以达到全局最优。 Method: 利用互信息评估隐藏状态转移来识别冗余块,结合数据处理不等式分析块重要性,并设计Fast-Block-Select算法迭代更新块组合以寻找全局最优解。 Result: 在多个模型和数据集上的实验表明该方法具有良好的稳定性、压缩性和推理加速效果。 Conclusion: MI-PRUN能有效稳定地压缩大模型,在保持性能的同时显著提升推理效率。 Abstract: Large Language Models (LLMs) have become indispensable across various domains, but this comes at the cost of substantial computational and memory resources. Model pruning addresses this by removing redundant components from models. In particular, block pruning can achieve significant compression and inference acceleration. However, existing block pruning methods are often unstable and struggle to attain globally optimal solutions. In this paper, we propose a mutual information based pruning method MI-PRUN for LLMs. Specifically, we leverages mutual information to identify redundant blocks by evaluating transitions in hidden states. Additionally, we incorporate the Data Processing Inequality (DPI) to reveal the relationship between the importance of entire contiguous blocks and that of individual blocks. Moreover, we develop the Fast-Block-Select algorithm, which iteratively updates block combinations to achieve a globally optimal solution while significantly improving the efficiency. Extensive experiments across various models and datasets demonstrate the stability and effectiveness of our method.

[98] The Roots of Performance Disparity in Multilingual Language Models: Intrinsic Modeling Difficulty or Design Choices?

Chen Shani,Yuval Reif,Nathan Roll,Dan Jurafsky,Ekaterina Shutova

Main category: cs.CL

TL;DR: 本文综述了多语言语言模型中语言间性能差异的原因,指出这些差距更多源于建模选择(如分词、编码和数据暴露)而非语言本身的复杂性,并提出了改进多语言模型均衡性的设计建议。

Details Motivation: 多语言语言模型在不同语言上的表现不均,本文旨在探究这种差距是源于语言内在难度还是建模方式的选择偏差。 Method: 通过梳理现有文献,围绕语言表征与资源分配选择对性能差距的影响进行分析,并将语言特征(如正字法、形态学、句法等)与具体建模机制关联。 Result: 发现当分词、编码和数据暴露得到标准化时,多数语言间的性能差距会缩小,表明当前建模选择是造成差异的主要原因。 Conclusion: 多语言模型中的语言不平衡问题主要来自设计选择而非语言本质难度,通过优化分词、采样、架构和评估方法可实现更公平的性能分布。 Abstract: Multilingual language models (LMs) promise broader NLP access, yet current systems deliver uneven performance across the world's languages. This survey examines why these gaps persist and whether they reflect intrinsic linguistic difficulty or modeling artifacts. We organize the literature around two questions: do linguistic disparities arise from representation and allocation choices (e.g., tokenization, encoding, data exposure, parameter sharing) rather than inherent complexity; and which design choices mitigate inequities across typologically diverse languages. We review linguistic features, such as orthography, morphology, lexical diversity, syntax, information density, and typological distance, linking each to concrete modeling mechanisms. Gaps often shrink when segmentation, encoding, and data exposure are normalized, suggesting much apparent difficulty stems from current modeling choices. We synthesize these insights into design recommendations for tokenization, sampling, architectures, and evaluation to support more balanced multilingual LMs.

[99] ActiShade: Activating Overshadowed Knowledge to Guide Multi-Hop Reasoning in Large Language Models

Huipeng Ma,Luan Zhang,Dandan Song,Linmei Hu,Yuhang Tian,Jun Yang,Changzhi Zhou,Chenhao Li,Yizhou Jin,Xudong Li,Meng Lin,Mingxing Zhang,Shuhao Zhang

Main category: cs.CL

TL;DR: 提出ActiShade方法,通过检测和激活被掩盖的知识来缓解多跳推理中的知识掩盖问题,提升检索增强生成的准确性。

Details Motivation: 现有基于LLM生成内容作为检索查询的多轮RAG方法容易出现知识掩盖现象,导致关键信息丢失,引发错误累积。 Method: ActiShade迭代地识别查询中被掩盖的关键短语,结合原始查询和关键短语检索相关文档,并基于检索结果生成新查询用于下一轮推理。 Result: 在多个数据集和大语言模型上实验表明,ActiShade优于现有的多跳推理方法。 Conclusion: 通过显式激活被掩盖的知识,ActiShade有效减少了多轮检索-生成过程中的错误累积,提升了多-hop推理性能。 Abstract: In multi-hop reasoning, multi-round retrieval-augmented generation (RAG) methods typically rely on LLM-generated content as the retrieval query. However, these approaches are inherently vulnerable to knowledge overshadowing - a phenomenon where critical information is overshadowed during generation. As a result, the LLM-generated content may be incomplete or inaccurate, leading to irrelevant retrieval and causing error accumulation during the iteration process. To address this challenge, we propose ActiShade, which detects and activates overshadowed knowledge to guide large language models (LLMs) in multi-hop reasoning. Specifically, ActiShade iteratively detects the overshadowed keyphrase in the given query, retrieves documents relevant to both the query and the overshadowed keyphrase, and generates a new query based on the retrieved documents to guide the next-round iteration. By supplementing the overshadowed knowledge during the formulation of next-round queries while minimizing the introduction of irrelevant noise, ActiShade reduces the error accumulation caused by knowledge overshadowing. Extensive experiments show that ActiShade outperforms existing methods across multiple datasets and LLMs.

[100] The Confidence Dichotomy: Analyzing and Mitigating Miscalibration in Tool-Use Agents

Weihao Xuan,Qingcheng Zeng,Heli Qi,Yunze Xiao,Junjue Wang,Naoto Yokoya

Main category: cs.CL

TL;DR: 本研究探讨了工具集成的基于大语言模型的自主代理在多轮任务中的校准问题,发现不同工具类型(如检索类与验证类)显著影响其置信度校准,并提出一种结合强化学习的微调框架,有效提升代理在校准性和准确性上的表现,增强了其在复杂真实场景中的可信度与泛化能力。

Details Motivation: 确保使用工具的自主代理具备可靠的置信度校准是实现其可信性的关键挑战,尤其是在动态、多轮任务中,现有研究对工具类型如何影响校准尚不充分。 Method: 通过试点研究分析不同类型工具(如网络搜索与代码解释器)对代理置信度的影响,提出一种联合优化任务准确性和校准性的强化学习微调框架,并设计综合奖励基准进行评估。 Result: 发现证据类工具易导致严重过度自信,而验证类工具可缓解校准偏差;所提RL方法在多种工具环境下均显著改善校准性,并展现出在网页噪声环境和数学推理等跨域任务中的良好泛化能力。 Conclusion: 工具类型显著影响语言模型代理的校准行为,需采用针对性的校准策略;本工作为构建能可靠表达不确定性的自知型代理奠定了基础,对高风险实际应用具有重要意义。 Abstract: Autonomous agents based on large language models (LLMs) are rapidly evolving to handle multi-turn tasks, but ensuring their trustworthiness remains a critical challenge. A fundamental pillar of this trustworthiness is calibration, which refers to an agent's ability to express confidence that reliably reflects its actual performance. While calibration is well-established for static models, its dynamics in tool-integrated agentic workflows remain underexplored. In this work, we systematically investigate verbalized calibration in tool-use agents, revealing a fundamental confidence dichotomy driven by tool type. Specifically, our pilot study identifies that evidence tools (e.g., web search) systematically induce severe overconfidence due to inherent noise in retrieved information, while verification tools (e.g., code interpreters) can ground reasoning through deterministic feedback and mitigate miscalibration. To robustly improve calibration across tool types, we propose a reinforcement learning (RL) fine-tuning framework that jointly optimizes task accuracy and calibration, supported by a holistic benchmark of reward designs. We demonstrate that our trained agents not only achieve superior calibration but also exhibit robust generalization from local training environments to noisy web settings and to distinct domains such as mathematical reasoning. Our results highlight the necessity of domain-specific calibration strategies for tool-use agents. More broadly, this work establishes a foundation for building self-aware agents that can reliably communicate uncertainty in high-stakes, real-world deployments.

[101] Document-Level Zero-Shot Relation Extraction with Entity Side Information

Mohan Raj Chanthran,Soon Lay Ki,Ong Huey Fang,Bhawani Selvaretnam

Main category: cs.CL

TL;DR: 本文提出了一种新的文档级零样本关系抽取方法DocZSRE-SI,利用实体侧边信息(如实体提及描述和上位词)替代依赖大语言模型生成的合成数据,有效提升了低资源语言(如马来西亚英语)下的关系抽取性能,平均macro F1-score提升11.6%。

Details Motivation: 现有基于大语言模型的方法在低资源语言中面临语言多样性与事实错误的问题,难以准确生成适用于零样本关系抽取的训练数据。 Method: 提出DocZSRE-SI框架,引入实体提及描述和实体提及上位词作为侧边信息,构建不依赖LLM生成数据的零样本关系抽取模型。 Result: 在多个基准上平均macro F1-score提升了11.6%,显著优于基线模型和现有方法,尤其在马来西亚英语等低资源语言场景下表现更优。 Conclusion: DocZSRE-SI通过利用实体侧边信息,提供了一种高效、可靠且可扩展的零样本关系抽取方案,克服了LLM生成数据带来的误差,特别适用于低资源语言环境。 Abstract: Document-Level Zero-Shot Relation Extraction (DocZSRE) aims to predict unseen relation labels in text documents without prior training on specific relations. Existing approaches rely on Large Language Models (LLMs) to generate synthetic data for unseen labels, which poses challenges for low-resource languages like Malaysian English. These challenges include the incorporation of local linguistic nuances and the risk of factual inaccuracies in LLM-generated data. This paper introduces Document-Level Zero-Shot Relation Extraction with Entity Side Information (DocZSRE-SI) to address limitations in the existing DocZSRE approach. The DocZSRE-SI framework leverages Entity Side Information, such as Entity Mention Descriptions and Entity Mention Hypernyms, to perform ZSRE without depending on LLM-generated synthetic data. The proposed low-complexity model achieves an average improvement of 11.6% in the macro F1-Score compared to baseline models and existing benchmarks. By utilizing Entity Side Information, DocZSRE-SI offers a robust and efficient alternative to error-prone, LLM-based methods, demonstrating significant advancements in handling low-resource languages and linguistic diversity in relation extraction tasks. This research provides a scalable and reliable solution for ZSRE, particularly in contexts like Malaysian English news articles, where traditional LLM-based approaches fall short.

[102] Towards Comprehensive Semantic Speech Embeddings for Chinese Dialects

Kalvin Chang,Yiwen Shao,Jiahong Li,Dong Yu

Main category: cs.CL

TL;DR: 本文提出了一种通过ASR-only数据训练的语音编码器,实现了中文方言与普通话之间的跨方言语义对齐,并构建了一个新的中文方言语音基准,为未来方言到普通话的语音大模型奠定了基础。

Details Motivation: 由于大多数中文方言主要以口语形式存在,且在语音技术上落后于普通话,因此构建方言到普通话的语音大语言模型(speech-LLMs)更具实用性。但需要具备跨方言语义对齐的语音表示来支持这一目标。 Method: 利用仅包含自动语音识别(ASR)数据的语料训练一个语音编码器,通过在一个新构建的中文方言语音基准上的语音到语音检索任务,验证其跨方言语义对齐能力。 Result: 所提出的语音编码器在新建的中文方言语音基准上实现了有效的语音到语音检索,表现出良好的跨方言语义对齐能力,同时在中文方言ASR任务上达到最先进的性能。 Conclusion: 本研究通过ASR-only数据实现了跨方言语义对齐的语音表示,推动了中文方言语音技术的发展,为构建方言到普通话的语音大模型提供了重要基础。 Abstract: Despite having hundreds of millions of speakers, Chinese dialects lag behind Mandarin in speech and language technologies. Most varieties are primarily spoken, making dialect-to-Mandarin speech-LLMs (large language models) more practical than dialect LLMs. Building dialect-to-Mandarin speech-LLMs requires speech representations with cross-dialect semantic alignment between Chinese dialects and Mandarin. In this paper, we achieve such a cross-dialect semantic alignment by training a speech encoder with ASR (automatic speech recognition)-only data, as demonstrated by speech-to-speech retrieval on a new benchmark of spoken Chinese varieties that we contribute. Our speech encoder further demonstrates state-of-the-art ASR performance on Chinese dialects. Together, our Chinese dialect benchmark, semantically aligned speech representations, and speech-to-speech retrieval evaluation lay the groundwork for future Chinese dialect speech-LLMs. We release the benchmark at https://github.com/kalvinchang/yubao.

[103] ReasonTabQA: A Comprehensive Benchmark for Table Question Answering from Real World Industrial Scenarios

Changzai Pan,Jie Zhang,Kaiwen Wei,Chenshuo Pan,Yu Zhao,Jingwang Huang,Jian Yang,Zhenhe Wu,Haoyang Zeng,Xiaoyan Gu,Weichao Sun,Yanbo Zhai,Yujie Mao,Zhuoru Jiang,Jiang Zhong,Shuangyong Song,Yongxiang Li,Zhongjiang He

Main category: cs.CL

TL;DR: 本文提出了ReasonTabQA,一个大规模双语基准,用于评估工业场景中的表格问答系统,并引入了TabCodeRL方法,通过强化学习提升逻辑推理路径的生成。

Details Motivation: 现有表格问答基准未能充分覆盖工业场景中的复杂特性,如多表结构、嵌套表头和大规模数据,导致对真实世界应用的支持不足。 Method: 提出ReasonTabQA基准,包含1932个来自30个行业领域的表格,并提供答案和显式推理链标注;同时提出TabCodeRL方法,利用表格感知的可验证奖励引导逻辑推理路径生成。 Result: 在ReasonTabQA和其他4个TableQA数据集上的实验表明,TabCodeRL显著提升了开源大模型的表现,但在ReasonTabQA上仍存在性能差距。 Conclusion: 工业场景下的表格问答具有高度复杂性,现有方法仍有局限,ReasonTabQA为未来研究提供了更具挑战性的评估平台。 Abstract: Recent advancements in Large Language Models (LLMs) have significantly catalyzed table-based question answering (TableQA). However, existing TableQA benchmarks often overlook the intricacies of industrial scenarios, which are characterized by multi-table structures, nested headers, and massive scales. These environments demand robust table reasoning through deep structured inference, presenting a significant challenge that remains inadequately addressed by current methodologies. To bridge this gap, we present ReasonTabQA, a large-scale bilingual benchmark encompassing 1,932 tables across 30 industry domains such as energy and automotive. ReasonTabQA provides high-quality annotations for both final answers and explicit reasoning chains, supporting both thinking and no-thinking paradigms. Furthermore, we introduce TabCodeRL, a reinforcement learning method that leverages table-aware verifiable rewards to guide the generation of logical reasoning paths. Extensive experiments on ReasonTabQA and 4 TableQA datasets demonstrate that while TabCodeRL yields substantial performance gains on open-source LLMs, the persistent performance gap on ReasonTabQA underscores the inherent complexity of real-world industrial TableQA.

[104] PsyCLIENT: Client Simulation via Conversational Trajectory Modeling for Trainee Practice and Model Evaluation in Mental Health Counseling

Huachuan Qiu,Zhaoming Chen,Yuqian Chen,Yuan Xie,Yu Lu,Zhenzhong Lan

Main category: cs.CL

TL;DR: 本文提出了一种基于对话轨迹建模的新型客户模拟框架PsyCLIENT,用于提升心理咨询训练与评估,并发布了首个开源中文客户画像数据集PsyCLIENT-CP,实验表明其在真实性和训练效果上显著优于基线方法。

Details Motivation: 现有基于大模型的客户模拟方法在客户画像多样性、行为建模原则性框架以及中文场景支持方面存在不足,限制了其在心理咨询训练与评估中的应用效果。 Method: 提出PsyCLIENT框架,通过引入真实世界对话轨迹(包含显式行为标签和内容约束)来引导大语言模型生成,结合新构建的中文客户画像数据集PsyCLIENT-CP进行模拟。 Result: 经专业心理咨询师评估,PsyCLIENT在真实性与训练有效性上显著优于基线方法,模拟客户在辨别任务中达到约95%的专家混淆率,几乎无法与真实客户区分。 Conclusion: 对话轨迹建模能有效连接理论客户画像与动态真实模拟,为心理健康教育与研究提供了可靠的技术方案,且代码与数据将开源以促进后续研究。 Abstract: LLM-based client simulation has emerged as a promising tool for training novice counselors and evaluating automated counseling systems. However, existing client simulation approaches face three key challenges: (1) limited diversity and realism in client profiles, (2) the lack of a principled framework for modeling realistic client behaviors, and (3) a scarcity in Chinese-language settings. To address these limitations, we propose PsyCLIENT, a novel simulation framework grounded in conversational trajectory modeling. By conditioning LLM generation on predefined real-world trajectories that incorporate explicit behavior labels and content constraints, our approach ensures diverse and realistic interactions. We further introduce PsyCLIENT-CP, the first open-source Chinese client profile dataset, covering 60 distinct counseling topics. Comprehensive evaluations involving licensed professional counselors demonstrate that PsyCLIENT significantly outperforms baselines in terms of authenticity and training effectiveness. Notably, the simulated clients are nearly indistinguishable from human clients, achieving an about 95\% expert confusion rate in discrimination tasks. These findings indicate that conversational trajectory modeling effectively bridges the gap between theoretical client profiles and dynamic, realistic simulations, offering a robust solution for mental health education and research. Code and data will be released to facilitate future research in mental health counseling.

[105] Mitrasamgraha: A Comprehensive Classical Sanskrit Machine Translation Dataset

Sebastian Nehrdich,David Allport,Sven Sellmer,Jivnesh Sandhan,Manoj Balaji Jagadeeshan,Pawan Goyal,Sujeet Kumar,Kurt Keutzer

Main category: cs.CL

TL;DR: 本文介绍了Mitrasamgraha,一个大规模、高质量的梵语到英语机器翻译数据集,包含391,548个双语文本对,覆盖超过三千年历史和多种领域,旨在解决复杂语言现象(如复合词、哲学概念和多层隐喻)带来的翻译挑战,并支持对时间和领域影响的细粒度研究。

Details Motivation: 尽管机器翻译在高资源语言上取得进展,但面对诗意语言、哲学概念和多层隐喻等复杂表达时仍存在显著不足,尤其是梵语文献因其语言复杂性和跨千年、跨领域的文本特性而尤为困难,且现有公开资源匮乏。 Method: 构建了一个名为Mitrasamgraha的大规模梵英翻译数据集,包含训练、验证和测试集,所有数据均带有时间和领域标注;在此基础上对商用和开源模型(如NLLB和Gemma)进行基准测试和微调,并探索上下文学习对性能的影响。 Result: 该数据集规模超过此前最大的梵语数据集Itih=asa四倍以上,实验显示微调后的模型性能显著提升,但在处理复杂复合词、哲学概念和多层隐喻方面仍面临挑战;同时发现时间与领域标注有助于分析MT性能差异。 Conclusion: Mitrasamgraha为梵语机器翻译提供了重要资源,推动了对低资源、高复杂性语言的翻译研究,揭示了当前模型在处理深层语言结构上的局限性,并为未来研究提供了可扩展的基础。 Abstract: While machine translation is regarded as a "solved problem" for many high-resource languages, close analysis quickly reveals that this is not the case for content that shows challenges such as poetic language, philosophical concepts, multi-layered metaphorical expressions, and more. Sanskrit literature is a prime example of this, as it combines a large number of such challenges in addition to inherent linguistic features like sandhi, compounding, and heavy morphology, which further complicate NLP downstream tasks. It spans multiple millennia of text production time as well as a large breadth of different domains, ranging from ritual formulas via epic narratives, philosophical treatises, poetic verses up to scientific material. As of now, there is a strong lack of publicly available resources that cover these different domains and temporal layers of Sanskrit. We therefore introduce Mitrasamgraha, a high-quality Sanskrit-to-English machine translation dataset consisting of 391,548 bitext pairs, more than four times larger than the largest previously available Sanskrit dataset Itih=asa. It covers a time period of more than three millennia and a broad range of historical Sanskrit domains. In contrast to web-crawled datasets, the temporal and domain annotation of this dataset enables fine-grained study of domain and time period effects on MT performance. We also release a validation set consisting of 5,587 and a test set consisting of 5,552 post-corrected bitext pairs. We conduct experiments benchmarking commercial and open models on this dataset and fine-tune NLLB and Gemma models on the dataset, showing significant improvements, while still recognizing significant challenges in the translation of complex compounds, philosophical concepts, and multi-layered metaphors. We also analyze how in-context learning on this dataset impacts the performance of commercial models

[106] How to predict creativity ratings from written narratives: A comparison of co-occurrence and textual forma mentis networks

Roberto Passaro,Edith Haim,Massimo Stella

Main category: cs.CL

TL;DR: 本文介绍了一种从短篇创意文本构建和分析语义网络的工作流程,比较了词共现网络和文本形式思维网络(TFMN),并展示了它们在预测人类创造力评分中的应用。

Details Motivation: 为创造力研究等认知领域提供基于网络的实用分析方法,并帮助研究人员理解不同文本到网络建模方式对预测性能的影响。 Method: 使用1029篇短篇故事语料库,进行文本预处理、网络构建、特征提取(包括结构指标、扩散激活指数和情感得分)以及回归模型的应用,评估不同网络构建策略对拓扑结构和预测效果的影响。 Result: TFMN在所有设置下均优于共现网络(最佳MAE:TFMN为0.581,共现窗口3为0.592);结构特征表现最好(TFMN MAE=0.591),情感特征较差(MAE=0.711),扩散激活贡献最小(MAE=0.788)。 Conclusion: TFMN比词共现网络更有效,尤其在网络结构特征方面,适合用于预测文本创造力,本文提供了可复现的开放工作流程,适用于新手和资深研究者。 Abstract: This tutorial paper provides a step-by-step workflow for building and analysing semantic networks from short creative texts. We introduce and compare two widely used text-to-network approaches: word co-occurrence networks and textual forma mentis networks (TFMNs). We also demonstrate how they can be used in machine learning to predict human creativity ratings. Using a corpus of 1029 short stories, we guide readers through text preprocessing, network construction, feature extraction (structural measures, spreading-activation indices, and emotion scores), and application of regression models. We evaluate how network-construction choices influence both network topology and predictive performance. Across all modelling settings, TFMNs consistently outperformed co-occurrence networks through lower prediction errors (best MAE = 0.581 for TFMN, vs 0.592 for co-occurrence with window size 3). Network-structural features dominated predictive performance (MAE = 0.591 for TFMN), whereas emotion features performed worse (MAE = 0.711 for TFMN) and spreading-activation measures contributed little (MAE = 0.788 for TFMN). This paper offers practical guidance for researchers interested in applying network-based methods for cognitive fields like creativity research. we show when syntactic networks are preferable to surface co-occurrence models, and provide an open, reproducible workflow accessible to newcomers in the field, while also offering deeper methodological insight for experienced researchers.

[107] BayesRAG: Probabilistic Mutual Evidence Corroboration for Multimodal Retrieval-Augmented Generation

Xuan Li,Yining Wang,Haocai Luo,Shengping Liu,Jerry Liang,Ying Fu,Weihuang,Jun Yu,Junnan Zhu

Main category: cs.CL

TL;DR: 提出了一种基于贝叶斯推理和Dempster-Shafer证据理论的多模态检索框架BayesRAG,通过跨模态一致性和布局连贯性提升视觉丰富文档中的检索效果。

Details Motivation: 现有检索增强生成方法在处理图文混合文档时,因模态间孤立建模而难以捕捉语义强化和布局一致性,导致检索性能受限。 Method: 采用贝叶斯推断与证据理论,将多模态检索结果间的语义与布局一致性建模为概率证据,计算文本-图像组合的后验关联概率,优化检索排序。 Result: 在多个具有挑战性的多模态基准测试上显著优于当前最先进方法,提升了检索的准确性和鲁棒性。 Conclusion: BayesRAG建立了一种新的多模态检索融合范式,有效解决了异构模态隔离问题,增强了检索结果的可靠性。 Abstract: Retrieval-Augmented Generation (RAG) has become a pivotal paradigm for Large Language Models (LLMs), yet current approaches struggle with visually rich documents by treating text and images as isolated retrieval targets. Existing methods relying solely on cosine similarity often fail to capture the semantic reinforcement provided by cross-modal alignment and layout-induced coherence. To address these limitations, we propose BayesRAG, a novel multimodal retrieval framework grounded in Bayesian inference and Dempster-Shafer evidence theory. Unlike traditional approaches that rank candidates strictly by similarity, BayesRAG models the intrinsic consistency of retrieved candidates across modalities as probabilistic evidence to refine retrieval confidence. Specifically, our method computes the posterior association probability for combinations of multimodal retrieval results, prioritizing text-image pairs that mutually corroborate each other in terms of both semantics and layout. Extensive experiments demonstrate that BayesRAG significantly outperforms state-of-the-art (SOTA) methods on challenging multimodal benchmarks. This study establishes a new paradigm for multimodal retrieval fusion that effectively resolves the isolation of heterogeneous modalities through an evidence fusion mechanism and enhances the robustness of retrieval outcomes. Our code is available at https://github.com/TioeAre/BayesRAG.

[108] Beyond Literal Mapping: Benchmarking and Improving Non-Literal Translation Evaluation

Yanzhi Tian,Cunxiang Wang,Zeming Liu,Heyan Huang,Wenbo Yu,Dawei Song,Jie Tang,Yuhang Guo

Main category: cs.CL

TL;DR: 提出了一种新的翻译评估框架RATE,用于解决非字面翻译中传统机器翻译评估指标的不准确性问题,并通过新构建的MENT数据集验证了其有效性。

Details Motivation: 由于社交媒体、文学等复杂语言领域中的非字面表达导致传统机器翻译评估指标和LLM-as-a-Judge方法存在不准确性和评分不一致问题,因此需要更可靠的评估方法。 Method: 构建了一个专注于非字面翻译的元评估数据集MENT,包含四个领域、多种翻译系统输出及7,530个人工标注评分;在此基础上提出RATE框架,采用具有反思能力的核心代理动态调用专业子代理进行评估。 Result: 实验显示传统指标和LLM-as-a-Judge存在知识截止和评分不一致问题;RATE在元评分上至少提升3.2分,并在通用领域也表现出鲁棒性。 Conclusion: RATE框架能有效提升对非字面翻译的评估可靠性,兼具领域适应性与稳定性,为未来MT评估提供了新范式。 Abstract: Large Language Models (LLMs) have significantly advanced Machine Translation (MT), applying them to linguistically complex domains-such as Social Network Services, literature etc. In these scenarios, translations often require handling non-literal expressions, leading to the inaccuracy of MT metrics. To systematically investigate the reliability of MT metrics, we first curate a meta-evaluation dataset focused on non-literal translations, namely MENT. MENT encompasses four non-literal translation domains and features source sentences paired with translations from diverse MT systems, with 7,530 human-annotated scores on translation quality. Experimental results reveal the inaccuracies of traditional MT metrics and the limitations of LLM-as-a-Judge, particularly the knowledge cutoff and score inconsistency problem. To mitigate these limitations, we propose RATE, a novel agentic translation evaluation framework, centered by a reflective Core Agent that dynamically invokes specialized sub-agents. Experimental results indicate the efficacy of RATE, achieving an improvement of at least 3.2 meta score compared with current metrics. Further experiments demonstrate the robustness of RATE to general-domain MT evaluation. Code and dataset are available at: https://github.com/BITHLP/RATE.

[109] DiffER: Diffusion Entity-Relation Modeling for Reversal Curse in Diffusion Large Language Models

Shaokai He,Kaiwen Wei,Xinyi Zeng,Xiang Chen,Xue Yang,Zhenyang Li,Jiang Zhong,Yu Tian

Main category: cs.CL

TL;DR: Diffusion LLMs也存在“反转诅咒”问题,本文通过分析其成因提出一种新的实体-关系建模方法DiffER,有效缓解了该问题。

Details Motivation: 尽管扩散语言模型(DLLMs)是双向训练的,但仍表现出反转诅咒现象,需探究根本原因并提出改进方法。 Method: 提出Diffusion Entity-Relation Modeling (DiffER),采用整体实体掩码、分布对称和关系增强的数据构建策略来解决实体碎片化、数据不对称和关系缺失问题。 Result: 实验表明DiffER能有效缓解Diffusion LLM中的反转诅咒问题,在多种设置下显著提升性能。 Conclusion: 反转诅咒不仅与训练方向性有关,还受数据结构和实体处理方式影响;DiffER为构建更健壮的双向语言模型提供了新思路。 Abstract: The "reversal curse" refers to the phenomenon where large language models (LLMs) exhibit predominantly unidirectional behavior when processing logically bidirectional relationships. Prior work attributed this to autoregressive training -- predicting the next token inherently favors left-to-right information flow over genuine bidirectional knowledge associations. However, we observe that Diffusion LLMs (DLLMs), despite being trained bidirectionally, also suffer from the reversal curse. To investigate the root causes, we conduct systematic experiments on DLLMs and identify three key reasons: 1) entity fragmentation during training, 2) data asymmetry, and 3) missing entity relations. Motivated by the analysis of these reasons, we propose Diffusion Entity-Relation Modeling (DiffER), which addresses the reversal curse through entity-aware training and balanced data construction. Specifically, DiffER introduces whole-entity masking, which mitigates entity fragmentation by predicting complete entities in a single step. DiffER further employs distribution-symmetric and relation-enhanced data construction strategies to alleviate data asymmetry and missing relations. Extensive experiments demonstrate that DiffER effectively alleviates the reversal curse in Diffusion LLMs, offering new perspectives for future research.

[110] Controlled Self-Evolution for Algorithmic Code Optimization

Tu Hu,Ronghao Chen,Shuo Zhang,Jianghao Yin,Mou Xiao Feng,Jingping Liu,Shaolei Zhang,Wenqi Jiang,Yuqi Fang,Sen Hu,Yi Xu,Huacan Wang

Main category: cs.CL

TL;DR: 本文提出了一种名为受控自进化(CSE)的新方法,通过多样化规划初始化、反馈引导的遗传进化和分层进化记忆,显著提升了代码生成中自我进化的探索效率,在EffiBench-X上优于现有方法。

Details Motivation: 现有自进化方法因初始化偏差、无指导的随机操作和经验利用不足,导致在有限预算下难以发现复杂优质解,效率低下。 Method: 提出CSE框架,包含三个核心组件:多样化规划初始化(生成结构不同的算法策略)、遗传进化(用反馈机制实现定向变异和组合交叉)、分层进化记忆(跨任务和任务内记录成功与失败经验)。 Result: 在EffiBench-X上的实验表明,CSE在多种大语言模型基础上 consistently 超过所有基线方法,且从早期代际即表现出更高效率,并在整个进化过程中持续提升性能。 Conclusion: CSE通过结构化控制机制有效解决了自进化过程中的探索低效问题,显著提升了代码生成的质量与效率,具备良好的通用性和可扩展性。 Abstract: Self-evolution methods enhance code generation through iterative "generate-verify-refine" cycles, yet existing approaches suffer from low exploration efficiency, failing to discover solutions with superior complexity within limited budgets. This inefficiency stems from initialization bias trapping evolution in poor solution regions, uncontrolled stochastic operations lacking feedback guidance, and insufficient experience utilization across tasks.To address these bottlenecks, we propose Controlled Self-Evolution (CSE), which consists of three key components. Diversified Planning Initialization generates structurally distinct algorithmic strategies for broad solution space coverage. Genetic Evolution replaces stochastic operations with feedback-guided mechanisms, enabling targeted mutation and compositional crossover. Hierarchical Evolution Memory captures both successful and failed experiences at inter-task and intra-task levels.Experiments on EffiBench-X demonstrate that CSE consistently outperforms all baselines across various LLM backbones. Furthermore, CSE achieves higher efficiency from early generations and maintains continuous improvement throughout evolution. Our code is publicly available at https://github.com/QuantaAlpha/EvoControl.

[111] Reward Modeling from Natural Language Human Feedback

Zongqi Wang,Rui Wang,Yuchuan Wu,Yiyao Yu,Pinyi Zhang,Shaoning Sun,Yujiu Yang,Yongbin Li

Main category: cs.CL

TL;DR: 本文提出了一种基于自然语言人类反馈的奖励建模方法(RM-NLHF),通过利用人类批评与生成批评之间的相似性作为训练奖励,解决了传统二元偏好标签导致的奖励噪声问题,并引入Meta Reward Model(MetaRM)以推广到无人工批评的数据,显著优于现有基于结果奖励的生成式奖励模型。

Details Motivation: 传统的强化学习依赖二元偏好标签作为奖励信号,容易导致模型猜测正确结果而生成不合理的推理链,从而引入噪声,降低训练效果。 Method: 提出RM-NLHF方法,使用自然语言反馈中的批评内容计算生成批评与人工批评的相似性作为过程奖励;同时设计MetaRM模型,学习从含人工批评的数据中预测过程奖励,并泛化至无标注数据。 Result: 在多个基准测试上,该方法 consistently 优于现有的仅依赖结果奖励的生成式奖励模型,验证了过程奖励的有效性和优越性。 Conclusion: 引入自然语言形式的人类反馈并利用过程奖励能有效缓解二元分类任务中的奖励噪声问题,提升强化学习中奖励模型的训练质量与泛化能力。 Abstract: Reinforcement Learning with Verifiable reward (RLVR) on preference data has become the mainstream approach for training Generative Reward Models (GRMs). Typically in pairwise rewarding tasks, GRMs generate reasoning chains ending with critiques and preference labels, and RLVR then relies on the correctness of the preference labels as the training reward. However, in this paper, we demonstrate that such binary classification tasks make GRMs susceptible to guessing correct outcomes without sound critiques. Consequently, these spurious successes introduce substantial noise into the reward signal, thereby impairing the effectiveness of reinforcement learning. To address this issue, we propose Reward Modeling from Natural Language Human Feedback (RM-NLHF), which leverages natural language feedback to obtain process reward signals, thereby mitigating the problem of limited solution space inherent in binary tasks. Specifically, we compute the similarity between GRM-generated and human critiques as the training reward, which provides more accurate reward signals than outcome-only supervision. Additionally, considering that human critiques are difficult to scale up, we introduce Meta Reward Model (MetaRM) which learns to predict process reward from datasets with human critiques and then generalizes to data without human critiques. Experiments on multiple benchmarks demonstrate that our method consistently outperforms state-of-the-art GRMs trained with outcome-only reward, confirming the superiority of integrating natural language over binary human feedback as supervision.

[112] Beyond Hard Masks: Progressive Token Evolution for Diffusion Language Models

Linhao Zhong,Linyu Wu,Bozhen Fang,Tianjian Feng,Chenchen Jing,Wen Wang,Jiaheng Zhang,Hao Chen,Chunhua Shen

Main category: cs.CL

TL;DR: 本文提出EvoToken-DLM,一种基于扩散的语言模型,通过软化token分布和连续轨迹监督实现可修订的渐进式解码,提升了DLM的性能。

Details Motivation: 现有扩散语言模型多依赖硬掩码和离散标记,限制了早期决策的修正并未能充分利用中间概率表示。 Method: 引入EvoToken-DLM,使用演化的软token分布替代硬二值掩码,并采用连续轨迹监督对齐训练目标与迭代概率更新。 Result: 在多个基准测试上实验表明,EvoToken-DLM性能优于现有的扩散和掩码DLM基线方法。 Conclusion: EvoToken-DLM通过软化表示和连续监督实现了更灵活、高效的语言生成,为扩散语言模型提供了新方向。 Abstract: Diffusion Language Models (DLMs) offer a promising alternative for language modeling by enabling parallel decoding through iterative refinement. However, most DLMs rely on hard binary masking and discrete token assignments, which hinder the revision of early decisions and underutilize intermediate probabilistic representations. In this paper, we propose EvoToken-DLM, a novel diffusion-based language modeling approach that replaces hard binary masks with evolving soft token distributions. EvoToken-DLM enables a progressive transition from masked states to discrete outputs, supporting revisable decoding. To effectively support this evolution, we introduce continuous trajectory supervision, which aligns training objectives with iterative probabilistic updates. Extensive experiments across multiple benchmarks show that EvoToken-DLM consistently achieves superior performance, outperforming strong diffusion-based and masked DLM baselines. Project webpage: https://aim-uofa.github.io/EvoTokenDLM.

[113] TALON: Confidence-Aware Speculative Decoding with Adaptive Token Trees

Tianyu Liu,Qitan Lv,Yuhao Shen,Xiao Sun,Xiaoyan Sun

Main category: cs.CL

TL;DR: 提出TALON,一种无需训练、基于预算的自适应树扩展框架,用于提升大模型推理中树结构推测解码的效率,在多种模型和数据集上显著优于现有方法。

Details Motivation: 现有树结构推测解码方法采用固定宽度和深度的草案树,无法根据上下文难度动态调整结构,导致在简单或困难token上的生成效率不高。 Method: 提出TALON框架,通过预算驱动的迭代方式构建草案树,采用混合扩展策略自适应分配每层节点预算,形成‘深而窄’或‘浅而宽’的树结构以适配不同确定性程度的上下文。 Result: 在5个模型和6个数据集上实验表明,TALON持续优于最先进的EAGLE-3方法,相较于自回归解码最高速度提升达5.16倍。 Conclusion: TALON通过自适应树扩展机制有效优化了推测解码中的探索与生成平衡,显著提升了LLM推理速度,且无需额外训练,具有良好的通用性和实用性。 Abstract: Speculative decoding (SD) has become a standard technique for accelerating LLM inference without sacrificing output quality. Recent advances in speculative decoding have shifted from sequential chain-based drafting to tree-structured generation, where the draft model constructs a tree of candidate tokens to explore multiple possible drafts in parallel. However, existing tree-based SD methods typically build a fixed-width, fixed-depth draft tree, which fails to adapt to the varying difficulty of tokens and contexts. As a result, the draft model cannot dynamically adjust the tree structure to early stop on difficult tokens and extend generation for simple ones. To address these challenges, we introduce TALON, a training-free, budget-driven adaptive tree expansion framework that can be plugged into existing tree-based methods. Unlike static methods, TALON constructs the draft tree iteratively until a fixed token budget is met, using a hybrid expansion strategy that adaptively allocates the node budget to each layer of the draft tree. This framework naturally shapes the draft tree into a "deep-and-narrow" form for deterministic contexts and a "shallow-and-wide" form for uncertain branches, effectively optimizing the trade-off between exploration width and generation depth under a given budget. Extensive experiments across 5 models and 6 datasets demonstrate that TALON consistently outperforms state-of-the-art EAGLE-3, achieving up to 5.16x end-to-end speedup over auto-regressive decoding.

[114] Semantic Compression of LLM Instructions via Symbolic Metalanguages

Ernst van Gassen

Main category: cs.CL

TL;DR: MetaGlyph是一种使用数学符号压缩提示的符号语言,能够在不额外训练的情况下被模型理解,显著减少token使用量,并在不同规模和类型的模型中表现出差异化的性能。

Details Motivation: 探索是否可以利用模型已从训练数据中学到的数学符号作为指令捷径,从而实现提示压缩,降低部署成本与资源消耗。 Method: 设计一种名为MetaGlyph的符号语言,用如$\in$和$\Rightarrow$等常见数学符号替代自然语言指令,并在多个开源与闭源、不同参数规模的模型上测试其语义等价性与解析成功率。 Result: MetaGlyph实现了62-81%的token缩减;大型模型表现良好,例如Kimi K2在选择任务上达到100%准确率,GPT-5.2在成员操作上有91.3%保真度;而中型开源模型(7B-12B)接近零保真度,显示性能随规模呈U型分布。 Conclusion: 数学符号可作为有效的指令压缩手段,尤其在大规模模型上效果显著,表明足够大的模型规模能克服指令微调带来的偏差,具备实际部署价值。 Abstract: We introduce MetaGlyph, a symbolic language for compressing prompts by encoding instructions as mathematical symbols rather than prose. Unlike systems requiring explicit decoding rules, MetaGlyph uses symbols like $\in$ (membership) and $\Rightarrow$ (implication) that models already understand from their training data. We test whether these symbols work as ''instruction shortcuts'' that models can interpret without additional teaching. We evaluate eight models across two dimensions relevant to practitioners: scale (3B-1T parameters) and accessibility (open-source for local deployment vs. proprietary APIs). MetaGlyph achieves 62-81% token reduction across all task types. For API-based deployments, this translates directly to cost savings; for local deployments, it reduces latency and memory pressure. Results vary by model. Gemini 2.5 Flash achieves 75% semantic equivalence between symbolic and prose instructions on selection tasks, with 49.9% membership operator fidelity. Kimi K2 reaches 98.1% fidelity for implication ($\Rightarrow$) and achieves perfect (100%) accuracy on selection tasks with symbolic prompts. GPT-5.2 Chat shows the highest membership fidelity observed (91.3%), though with variable parse success across task types. Claude Haiku 4.5 achieves 100% parse success with 26% membership fidelity. Among mid-sized models, Qwen 2.5 7B shows 62% equivalence on extraction tasks. Mid-sized open-source models (7B-12B) show near-zero operator fidelity, suggesting a U-shaped relationship where sufficient scale overcomes instruction-tuning biases.

[115] Interpretable Text Classification Applied to the Detection of LLM-generated Creative Writing

Minerva Suvanto,Andrea McGlinchey,Mattias Wahde,Peter J Barclay

Main category: cs.CL

TL;DR: 该研究探讨了如何区分人类创作的小说片段与LLM生成的类似文本,发现机器学习模型在仅使用单个词特征的情况下仍能达到0.93–0.98的准确率,而人类判断接近随机。通过可解释的线性分类器分析,揭示了LLM文本的五个关键识别特征:同义词多样性更高、时间漂移、美式用语、外语使用和口语表达,表明此类检测具有鲁棒性。

Details Motivation: 区分人类创作与LLM生成的创意小说文本,防止AI生成内容被误认为人类作品。 Method: 使用多种机器学习模型进行二分类任务,重点采用高准确率(0.98)的线性分类器进行可解释性分析,提取关键特征如单个词(unigram)等。 Result: 机器学习模型在未见过的测试集上达到0.93–0.98的准确率;识别出LLM生成文本的关键特征包括更高的同义词多样性、时间漂移、美式用语、外语使用和口语化表达。 Conclusion: 基于多类语言特征组合的检测方法对识别LLM生成文本具有高度准确性与鲁棒性,难以被轻易规避,有助于防范AI文本冒充人类创作的风险。 Abstract: We consider the problem of distinguishing human-written creative fiction (excerpts from novels) from similar text generated by an LLM. Our results show that, while human observers perform poorly (near chance levels) on this binary classification task, a variety of machine-learning models achieve accuracy in the range 0.93 - 0.98 over a previously unseen test set, even using only short samples and single-token (unigram) features. We therefore employ an inherently interpretable (linear) classifier (with a test accuracy of 0.98), in order to elucidate the underlying reasons for this high accuracy. In our analysis, we identify specific unigram features indicative of LLM-generated text, one of the most important being that the LLM tends to use a larger variety of synonyms, thereby skewing the probability distributions in a manner that is easy to detect for a machine learning classifier, yet very difficult for a human observer. Four additional explanation categories were also identified, namely, temporal drift, Americanisms, foreign language usage, and colloquialisms. As identification of the AI-generated text depends on a constellation of such features, the classification appears robust, and therefore not easy to circumvent by malicious actors intent on misrepresenting AI-generated text as human work.

[116] Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models

Xin Cheng,Wangding Zeng,Damai Dai,Qinyu Chen,Bingxuan Wang,Zhenda Xie,Kezhao Huang,Xingkai Yu,Zhewen Hao,Yukun Li,Han Zhang,Huishuai Zhang,Dongyan Zhao,Wenfeng Liang

Main category: cs.CL

TL;DR: 本文提出了条件内存(conditional memory)作为稀疏性计算的新维度,并通过Engram模块实现,结合Mixture-of-Experts(MoE),在保持相同参数和FLOPs的情况下显著提升模型在知识检索、推理、代码和数学等任务上的性能。

Details Motivation: Transformer缺乏原生的知识查找机制,导致其低效地通过计算模拟检索过程;而MoE虽扩展了容量,但仍需补充新的稀疏性机制以更高效利用计算资源。 Method: 提出Engram模块,将经典的N-gram嵌入现代化,支持O(1)查找操作;引入“稀疏性分配”问题,发现神经计算(MoE)与静态内存(Engram)之间的U型缩放律,并据此指导Engram的扩展。 Result: 27B参数的Engram在多项任务上优于同等参数和FLOPs的MoE模型:MMLU +3.4,CMMLU +4.0,BBH +5.0,ARC-Challenge +3.7,HumanEval +3.0,MATH +2.4;长上下文检索(Multi-Query NIAH)从84.2提升至97.0;早期层负担减轻,注意力更聚焦全局上下文。 Conclusion: 条件内存是一种不可或缺的建模原语,能够有效分离局部依赖与复杂推理,提升模型效率与性能,为下一代稀疏模型提供基础设施支持。 Abstract: While Mixture-of-Experts (MoE) scales capacity via conditional computation, Transformers lack a native primitive for knowledge lookup, forcing them to inefficiently simulate retrieval through computation. To address this, we introduce conditional memory as a complementary sparsity axis, instantiated via Engram, a module that modernizes classic $N$-gram embedding for O(1) lookup. By formulating the Sparsity Allocation problem, we uncover a U-shaped scaling law that optimizes the trade-off between neural computation (MoE) and static memory (Engram). Guided by this law, we scale Engram to 27B parameters, achieving superior performance over a strictly iso-parameter and iso-FLOPs MoE baseline. Most notably, while the memory module is expected to aid knowledge retrieval (e.g., MMLU +3.4; CMMLU +4.0), we observe even larger gains in general reasoning (e.g., BBH +5.0; ARC-Challenge +3.7) and code/math domains~(HumanEval +3.0; MATH +2.4). Mechanistic analyses reveal that Engram relieves the backbone's early layers from static reconstruction, effectively deepening the network for complex reasoning. Furthermore, by delegating local dependencies to lookups, it frees up attention capacity for global context, substantially boosting long-context retrieval (e.g., Multi-Query NIAH: 84.2 to 97.0). Finally, Engram establishes infrastructure-aware efficiency: its deterministic addressing enables runtime prefetching from host memory, incurring negligible overhead. We envision conditional memory as an indispensable modeling primitive for next-generation sparse models.

[117] GROKE: Vision-Free Navigation Instruction Evaluation via Graph Reasoning on OpenStreetMap

Farzad Shami,Subhrasankha Dey,Nico Van de Weghe,Henrikki Tenkanen

Main category: cs.CL

TL;DR: 本文提出GROKE,一种基于OSM数据的无视觉、无需训练的分层LLM框架,用于评估导航指令。通过结构化空间信息表示和图导航,显著降低导航错误率,实现可扩展且可解释的指令评估。

Details Motivation: 现有导航指令评估方法依赖视觉模拟器或无法准确反映指令功能性,存在计算成本高、感知误差干扰等问题,亟需一种不依赖视觉且能有效衡量指令实用性的评估方式。 Method: 提出GROKE框架,利用OpenStreetMap数据构建拓扑图,采用分层架构结合子指令规划与图推理,使用结构化JSON和文本格式表示空间信息,通过LLM实现无视觉、无需训练的指令评估。 Result: 在Map2Seq数据集上,相比启发式与采样基线,导航错误率降低68.5%,执行成功率、轨迹保真度和决策模式验证了方法的有效性。 Conclusion: GROKE提供了一种可扩展、可解释且无需视觉输入的导航指令评估新范式,能够有效利用公开地图数据进行功能导向的指令质量评估。 Abstract: The evaluation of navigation instructions remains a persistent challenge in Vision-and-Language Navigation (VLN) research. Traditional reference-based metrics such as BLEU and ROUGE fail to capture the functional utility of spatial directives, specifically whether an instruction successfully guides a navigator to the intended destination. Although existing VLN agents could serve as evaluators, their reliance on high-fidelity visual simulators introduces licensing constraints and computational costs, and perception errors further confound linguistic quality assessment. This paper introduces GROKE(Graph-based Reasoning over OSM Knowledge for instruction Evaluation), a vision-free training-free hierarchical LLM-based framework for evaluating navigation instructions using OpenStreetMap data. Through systematic ablation studies, we demonstrate that structured JSON and textual formats for spatial information substantially outperform grid-based and visual graph representations. Our hierarchical architecture combines sub-instruction planning with topological graph navigation, reducing navigation error by 68.5% compared to heuristic and sampling baselines on the Map2Seq dataset. The agent's execution success, trajectory fidelity, and decision patterns serve as proxy metrics for functional navigability given OSM-visible landmarks and topology, establishing a scalable and interpretable evaluation paradigm without visual dependencies. Code and data are available at https://anonymous.4open.science/r/groke.

[118] Outcome-Grounded Advantage Reshaping for Fine-Grained Credit Assignment in Mathematical Reasoning

Ziheng Li,Liu Kang,Feng Xiao,Luxi Xing,Qingyi Si,Zhuoran Li,Weikang Gong,Deqing Yang,Yanghua Xiao,Hongcheng Guo

Main category: cs.CL

TL;DR: 提出了一种细粒度信用分配机制OAR,用于改进无批评者的强化学习框架GRPO,在数学推理任务中显著提升大语言模型的推理能力。

Details Motivation: 标准GRPO采用粗粒度信用分配,将组级奖励均匀传播到序列中每个token,忽略了不同推理步骤贡献的差异性,限制了模型对关键推理步骤的学习效率。 Method: 提出Outcome-grounded Advantage Reshaping (OAR),包含两种策略:OAR-P通过反事实token扰动估计结果敏感性,提供高保真归因信号;OAR-G利用输入梯度作为敏感性代理,单次反向传播即可近似影响信号。结合保守的双层优势重塑机制,在保持总优势质量的同时增强关键token、抑制非关键token。 Result: 在多个数学推理基准上实验表明,OAR-P达到性能上限,OAR-G以可忽略的计算开销实现相近性能提升,两者均显著优于强GRPO基线。 Conclusion: OAR通过细粒度信用分配有效提升了无批评者强化学习框架下的大语言模型推理能力,为高效、低开销的推理优化提供了可行方案。 Abstract: Group Relative Policy Optimization (GRPO) has emerged as a promising critic-free reinforcement learning paradigm for reasoning tasks. However, standard GRPO employs a coarse-grained credit assignment mechanism that propagates group-level rewards uniformly to to every token in a sequence, neglecting the varying contribution of individual reasoning steps. We address this limitation by introducing Outcome-grounded Advantage Reshaping (OAR), a fine-grained credit assignment mechanism that redistributes advantages based on how much each token influences the model's final answer. We instantiate OAR via two complementary strategies: (1) OAR-P, which estimates outcome sensitivity through counterfactual token perturbations, serving as a high-fidelity attribution signal; (2) OAR-G, which uses an input-gradient sensitivity proxy to approximate the influence signal with a single backward pass. These importance signals are integrated with a conservative Bi-Level advantage reshaping scheme that suppresses low-impact tokens and boosts pivotal ones while preserving the overall advantage mass. Empirical results on extensive mathematical reasoning benchmarks demonstrate that while OAR-P sets the performance upper bound, OAR-G achieves comparable gains with negligible computational overhead, both significantly outperforming a strong GRPO baseline, pushing the boundaries of critic-free LLM reasoning.

[119] Two Pathways to Truthfulness: On the Intrinsic Encoding of LLM Hallucinations

Wen Luo,Guangyue Peng,Wei Li,Shaohang Wei,Feifan Song,Liang Wang,Nan Yang,Xingxing Zhang,Jing Jin,Furu Wei,Houfeng Wang

Main category: cs.CL

TL;DR: 本文研究了大语言模型(LLM)生成虚假信息(幻觉)的问题,揭示其内部真实性信号来源于两种独立的信息路径:问题锚定路径和答案锚定路径,并通过实验验证这两种机制与知识边界的关系,进而提出提升幻觉检测性能的应用。

Details Motivation: 尽管大语言模型具备强大能力,但常生成幻觉,限制其可靠性;理解其内部真实性信号的来源有助于构建更可信的生成系统。 Method: 通过注意力屏蔽(attention knockout)和词元修补(token patching)方法分离并验证问题锚定与答案锚定两条信息路径,并分析其特性及其与知识边界的关联。 Result: 发现两种路径分别依赖问题-答案信息流和生成答案自身的内部证据;两者与模型的知识边界密切相关,且模型内部表示能区分这两种机制。 Conclusion: 大语言模型内部的真实性信号源自两种可分离的机制,该发现为提升幻觉检测和构建自我感知的生成系统提供了新方向。 Abstract: Despite their impressive capabilities, large language models (LLMs) frequently generate hallucinations. Previous work shows that their internal states encode rich signals of truthfulness, yet the origins and mechanisms of these signals remain unclear. In this paper, we demonstrate that truthfulness cues arise from two distinct information pathways: (1) a Question-Anchored pathway that depends on question-answer information flow, and (2) an Answer-Anchored pathway that derives self-contained evidence from the generated answer itself. First, we validate and disentangle these pathways through attention knockout and token patching. Afterwards, we uncover notable and intriguing properties of these two mechanisms. Further experiments reveal that (1) the two mechanisms are closely associated with LLM knowledge boundaries; and (2) internal representations are aware of their distinctions. Finally, building on these insightful findings, two applications are proposed to enhance hallucination detection performance. Overall, our work provides new insight into how LLMs internally encode truthfulness, offering directions for more reliable and self-aware generative systems.

[120] SAD: A Large-Scale Strategic Argumentative Dialogue Dataset

Yongkang Liu,Jiayang Yu,Mingyang Wang,Yiqun Zhang,Ercong Nie,Shi Feng,Daling Wang,Kaisong Song,Hinrich Schütze

Main category: cs.CL

TL;DR: 本文提出了一个大规模的战略性论证对话数据集SAD,包含392,822个样本,旨在支持对多轮论证对话的深入建模。每个话语标注了五种策略类型,要求模型根据对话历史、立场和目标策略生成合适的论点。

Details Motivation: 现有论证语料库多集中于单轮、非交互设置,难以反映真实场景中多轮、策略性论证对话的特点,因此需要构建更贴近实际的多轮对话数据集。 Method: 基于论证理论构建SAD数据集,对每条话语标注五种论证策略(允许多标签),并设计任务要求模型根据对话历史、立场和指定策略生成论点,最后对多种预训练生成模型进行基准测试。 Result: 发布了包含392,822个样本的SAD数据集,实验证明模型在生成策略性论证方面仍有挑战,同时分析揭示了论证策略使用的模式。 Conclusion: SAD为研究多轮、策略性论证对话提供了重要资源,推动了论证生成模型向更贴近人类真实辩论行为的方向发展。 Abstract: Argumentation generation has attracted substantial research interest due to its central role in human reasoning and decision-making. However, most existing argumentative corpora focus on non-interactive, single-turn settings, either generating arguments from a given topic or refuting an existing argument. In practice, however, argumentation is often realized as multi-turn dialogue, where speakers defend their stances and employ diverse argumentative strategies to strengthen persuasiveness. To support deeper modeling of argumentation dialogue, we present the first large-scale \textbf{S}trategic \textbf{A}rgumentative \textbf{D}ialogue dataset, SAD, consisting of 392,822 examples. Grounded in argumentation theories, we annotate each utterance with five strategy types, allowing multiple strategies per utterance. Unlike prior datasets, SAD requires models to generate contextually appropriate arguments conditioned on the dialogue history, a specified stance on the topic, and targeted argumentation strategies. We further benchmark a range of pretrained generative models on SAD and present in-depth analysis of strategy usage patterns in argumentation.

[121] KALE: Enhancing Knowledge Manipulation in Large Language Models via Knowledge-aware Learning

Qitan Lv,Tianyu Liu,Qiaosheng Zhang,Xingcheng Xu,Chaochao Lu

Main category: cs.CL

TL;DR: 提出KALE框架,利用知识图谱生成高质量推理链,提升大语言模型的知识操作能力,在八个基准测试中显著提高准确率。

Details Motivation: 现有监督微调方法存在已知但错误的现象,即模型虽具备相关知识却无法正确回答问题,需改进知识操作能力。 Method: 提出KALE框架,包括基于知识图谱的多跳推理路径提取的数据合成方法(KI)和通过KL散度最小化实现的推理引导微调范式(KA)。 Result: 在八个基准测试和六种大语言模型上验证了KALE的有效性,准确率最高提升11.72%,平均提升4.18%。 Conclusion: KALE能有效增强大语言模型的知识回忆、推理与迁移能力,显著改善知识操作性能。 Abstract: Despite the impressive performance of large language models (LLMs) pretrained on vast knowledge corpora, advancing their knowledge manipulation-the ability to effectively recall, reason, and transfer relevant knowledge-remains challenging. Existing methods mainly leverage Supervised Fine-Tuning (SFT) on labeled datasets to enhance LLMs' knowledge manipulation ability. However, we observe that SFT models still exhibit the known&incorrect phenomenon, where they explicitly possess relevant knowledge for a given question but fail to leverage it for correct answers. To address this challenge, we propose KALE (Knowledge-Aware LEarning)-a post-training framework that leverages knowledge graphs (KGs) to generate high-quality rationales and enhance LLMs' knowledge manipulation ability. Specifically, KALE first introduces a Knowledge-Induced (KI) data synthesis method that efficiently extracts multi-hop reasoning paths from KGs to generate high-quality rationales for question-answer pairs. Then, KALE employs a Knowledge-Aware (KA) fine-tuning paradigm that enhances knowledge manipulation by internalizing rationale-guided reasoning through minimizing the KL divergence between predictions with and without rationales. Extensive experiments on eight popular benchmarks across six different LLMs demonstrate the effectiveness of KALE, achieving accuracy improvements of up to 11.72% and an average of 4.18%.

[122] Judging Against the Reference: Uncovering Knowledge-Driven Failures in LLM-Judges on QA Evaluation

Dongryeol Lee,Yerin Hwang,Taegwan Kang,Minwoo Lee,Younhyung Chae,Kyomin Jung

Main category: cs.CL

TL;DR: 本文研究了大语言模型(LLM)作为自动评判者在参考条件问答评估中的可靠性问题,发现当参考答案与模型自身知识冲突时,评估结果会显著失真。

Details Motivation: 探究LLM在参考条件评估任务中因参数知识与给定参考冲突而导致评分不可靠的问题。 Method: 提出一种可控的替换参考问答框架,通过用错误实体替换正确参考答案来诱发参考-信念冲突,并测试多种裁判模型的表现。 Result: 实验表明,在参考被替换后,各类裁判模型的评分可靠性急剧下降,且这种问题源于模型过度依赖自身参数知识而忽视给定参考。常见基于提示的缓解策略效果有限。 Conclusion: 揭示了LLM作为评判者的根本局限性,强调需要设计更强约束的参考遵循机制以提升评估保真度。 Abstract: While large language models (LLMs) are increasingly used as automatic judges for question answering (QA) and other reference-conditioned evaluation tasks, little is known about their ability to adhere to a provided reference. We identify a critical failure mode of such reference-based LLM QA evaluation: when the provided reference conflicts with the judge model's parametric knowledge, the resulting scores become unreliable, substantially degrading evaluation fidelity. To study this phenomenon systematically, we introduce a controlled swapped-reference QA framework that induces reference-belief conflicts. Specifically, we replace the reference answer with an incorrect entity and construct diverse pairings of original and swapped references with correspondingly aligned candidate answers. Surprisingly, grading reliability drops sharply under swapped references across a broad set of judge models. We empirically show that this vulnerability is driven by judges' over-reliance on parametric knowledge, leading judges to disregard the given reference under conflict. Finally, we find that this failure persists under common prompt-based mitigation strategies, highlighting a fundamental limitation of LLM-as-a-judge evaluation and motivating reference-based protocols that enforce stronger adherence to the provided reference.

[123] High-Rank Structured Modulation for Parameter-Efficient Fine-Tuning

Yongkang Liu,Xing Li,Mengjie Zhao,Shanru Zhang,Zijing Wang,Qian Li,Shi Feng,Feiliang Ren,Daling Wang,Hinrich Schütze

Main category: cs.CL

TL;DR: 本文提出了SMoA,一种高秩结构化调制适配器,通过在多个子空间中选择性放大或抑制原始权重的重要特征,在减少可训练参数的同时保持更高秩,从而提升模型表示能力和性能。

Details Motivation: 随着模型参数增加,低秩适应(LoRA)虽节省资源但受限于表示能力;需要一种在更少参数下保持高秩以提升性能的方法。 Method: 冻结原始预训练权重,引入结构化调制机制,在多个子空间中动态调节特征重要性,实现高秩模拟且参数量少。 Result: 在10个任务上超越LoRA及其变体,消融实验验证了子空间机制和调制策略的有效性。 Conclusion: SMoA通过结构化调制和子空间机制,在更低参数成本下实现了更强的表示能力和更优性能,是参数高效微调的一个有效新方向。 Abstract: As the number of model parameters increases, parameter-efficient fine-tuning (PEFT) has become the go-to choice for tailoring pre-trained large language models. Low-rank Adaptation (LoRA) uses a low-rank update method to simulate full parameter fine-tuning, which is widely used to reduce resource requirements. However, decreasing the rank encounters challenges with limited representational capacity when compared to full parameter fine-tuning. We present \textbf{SMoA}, a high-rank \textbf{S}tructured \textbf{MO}dulation \textbf{A}dapter that uses fewer trainable parameters while maintaining a higher rank, thereby improving the model's representational capacity and offering improved performance potential. The core idea is to freeze the original pretrained weights and selectively amplify or suppress important features of the original weights across multiple subspaces. The subspace mechanism provides an efficient way to increase the capacity and complexity of a model. We conduct both theoretical analyses and empirical studies on various tasks. Experiment results show that SMoA outperforms LoRA and its variants on 10 tasks, with extensive ablation studies validating its effectiveness.

[124] Controlling Multimodal Conversational Agents with Coverage-Enhanced Latent Actions

Yongqi Li,Hao Lang,Tieyun Qian,Yongbin Li

Main category: cs.CL

TL;DR: 本文提出了一种基于潜在动作空间的视觉-语言模型强化学习微调方法,通过学习紧凑的潜在动作空间来应对大规模文本标记空间的挑战。

Details Motivation: 现有的多模态对话代理在使用强化学习微调时难以有效处理巨大的文本标记空间,且配对的图像-文本数据稀缺限制了潜在空间的学习。 Method: 采用从观察中学习的机制构建潜在动作空间的码本,并利用未来观测估计当前潜在动作;结合配对图像-文本数据和纯文本数据,通过跨模态投影器将文本嵌入转换为图像-文本嵌入,并在大规模纯文本数据上使用循环一致性损失进行训练以增强鲁棒性。 Result: 该方法在两个对话任务上、多种强化学习算法下均优于现有基线方法。 Conclusion: 通过构建紧凑且覆盖广泛的潜在动作空间,所提方法有效提升了多模态对话代理在强化学习微调中的性能。 Abstract: Vision-language models are increasingly employed as multimodal conversational agents (MCAs) for diverse conversational tasks. Recently, reinforcement learning (RL) has been widely explored for adapting MCAs to various human-AI interaction scenarios. Despite showing great enhancement in generalization performance, fine-tuning MCAs via RL still faces challenges in handling the extremely large text token space. To address this, we learn a compact latent action space for RL fine-tuning instead. Specifically, we adopt the learning from observation mechanism to construct the codebook for the latent action space, where future observations are leveraged to estimate current latent actions that could further be used to reconstruct future observations. However, the scarcity of paired image-text data hinders learning a codebook with sufficient coverage. Thus, we leverage both paired image-text data and text-only data to construct the latent action space, using a cross-modal projector for transforming text embeddings into image-text embeddings. We initialize the cross-modal projector on paired image-text data, and further train it on massive text-only data with a novel cycle consistency loss to enhance its robustness. We show that our latent action based method outperforms competitive baselines on two conversation tasks across various RL algorithms.

[125] Thinking Before Constraining: A Unified Decoding Framework for Large Language Models

Ngoc Trinh Hung Nguyen,Alonso Silva,Laith Zumot,Liubov Tupikina,Armen Aghasaryan,Mehwish Alam

Main category: cs.CL

TL;DR: 提出一种结合自然生成与结构化生成的方法,通过触发词在自由推理和结构化输出间切换,既保持了语言模型的表达能力,又确保了输出的可解析性,在多个任务上准确率最高提升27%,仅增加10-20个额外token。

Details Motivation: 自然生成虽能保留模型的丰富推理能力,但输出难以解析;结构化生成虽保证输出格式统一,却可能限制模型推理。需要一种方法兼顾两者的优点。 Method: 允许大语言模型先进行自由的自然语言推理,当生成特定触发词时,切换为结构化生成(如JSON格式),从而结合自然生成的表达力与结构化生成的可靠性。 Result: 在多个分类与推理数据集上验证了方法的有效性,相比纯自然生成,准确率最高提升27%,且仅引入10-20个额外token的开销。 Conclusion: 该方法成功融合了自然生成与结构化生成的优势,在保持模型推理能力的同时确保输出可解析,具有较高的实用价值和广泛的应用潜力。 Abstract: Natural generation allows Language Models (LMs) to produce free-form responses with rich reasoning, but the lack of guaranteed structure makes outputs difficult to parse or verify. Structured generation, or constrained decoding, addresses this drawback by producing content in standardized formats such as JSON, ensuring consistency and guaranteed-parsable outputs, but it can inadvertently restrict the model's reasoning capabilities. In this work, we propose a simple approach that combines the advantages of both natural and structured generation. By allowing LLMs to reason freely until specific trigger tokens are generated, and then switching to structured generation, our method preserves the expressive power of natural language reasoning while ensuring the reliability of structured outputs. We further evaluate our approach on several datasets, covering both classification and reasoning tasks, to demonstrate its effectiveness, achieving a substantial gain of up to 27% in accuracy compared to natural generation, while requiring only a small overhead of 10-20 extra tokens.

[126] From RAG to Agentic RAG for Faithful Islamic Question Answering

Gagan Bhatia,Hamdy Mubarak,Mustafa Jarrar,George Mikros,Fadi Zaraket,Mahmoud Alhirthani,Mutaz Al-Khatib,Logan Cochrane,Kareem Darwish,Rashid Yahiaoui,Firoj Alam

Main category: cs.CL

TL;DR: 本文提出了ISLAMICFAITHQA,一个包含3810个双语(阿拉伯语/英语)项目的生成式基准,用于评估大型语言模型在伊斯兰教问答中的幻觉和 abstention 行为。作者还开发了一套基于文本的建模工具,包括训练数据、偏好样本和可兰经检索语料库,并提出了一种基于代理的RAG框架,通过结构化工具调用实现迭代证据搜索和答案修订。实验表明,检索增强能提高准确性,而代理RAG相比标准RAG有显著提升,在小型模型上也实现了最先进的性能。

Details Motivation: 现有的多选题或机器阅读理解评估方式无法有效衡量大型语言模型在伊斯兰教问答中产生的自由形式幻觉和缺乏证据时未能适当 abstention 的问题,而这些错误可能带来严重的宗教后果,因此需要更贴近实际应用场景的评估基准和建模方法。 Method: 构建了一个双语生成式基准ISLAMICFAITHQA,具有原子级单一黄金答案;收集了25K阿拉伯语文本支持的SFT推理对和5K双语偏好样本用于奖励引导对齐;构建了一个约6k条经文级别的可兰经检索语料库;并设计了一种代理式RAG框架,通过结构化工具调用实现迭代检索与答案修正。 Result: 实验结果显示,检索增强能提升模型正确性,而代理RAG框架相较标准RAG带来最大性能增益,在阿拉伯语和英语上均实现最先进表现,即使使用较小的模型(如Qwen3 4B)也能取得良好效果,且具备更强的跨语言鲁棒性。 Conclusion: 本文提出的ISLAMICFAITHQA基准和代理式RAG框架有效提升了大型语言模型在伊斯兰教问答任务中的准确性和可靠性,强调了接地生成和适当 abstention 的重要性,为宗教敏感领域的可信AI应用提供了可行路径和技术资源。 Abstract: LLMs are increasingly used for Islamic question answering, where ungrounded responses may carry serious religious consequences. Yet standard MCQ/MRC-style evaluations do not capture key real-world failure modes, notably free-form hallucinations and whether models appropriately abstain when evidence is lacking. To shed a light on this aspect we introduce ISLAMICFAITHQA, a 3,810-item bilingual (Arabic/English) generative benchmark with atomic single-gold answers, which enables direct measurement of hallucination and abstention. We additionally developed an end-to-end grounded Islamic modelling suite consisting of (i) 25K Arabic text-grounded SFT reasoning pairs, (ii) 5K bilingual preference samples for reward-guided alignment, and (iii) a verse-level Qur'an retrieval corpus of $\sim$6k atomic verses (ayat). Building on these resources, we develop an agentic Quran-grounding framework (agentic RAG) that uses structured tool calls for iterative evidence seeking and answer revision. Experiments across Arabic-centric and multilingual LLMs show that retrieval improves correctness and that agentic RAG yields the largest gains beyond standard RAG, achieving state-of-the-art performance and stronger Arabic-English robustness even with a small model (i.e., Qwen3 4B). We will make the experimental resources and datasets publicly available for the community.

[127] A Unified Framework for Emotion Recognition and Sentiment Analysis via Expert-Guided Multimodal Fusion with Large Language Models

Jiaqi Qiao,Xiujuan Xu,Xinran Li,Yu Liu

Main category: cs.CL

TL;DR: 本文提出了一种名为EGMF的统一框架,结合专家引导的多模态融合与大语言模型,用于离散情感识别和连续情感分析。该方法通过三个专业网络和分层动态门控机制实现上下文感知的特征选择,并利用伪标记注入和基于提示的条件化将增强的多模态表示与大语言模型集成,支持单个生成框架处理分类和回归任务。实验结果表明,在双语基准上性能优于现有方法,展现出跨语言鲁棒性。

Details Motivation: 为了有效整合文本、音频和视觉模态以进行离散情感识别和连续情感分析,需要一种能够捕捉细微情感差异、跨模态关系及长距离依赖的统一框架。 Method: 提出了EGMF框架,包含三个专门的专家网络:细粒度局部专家用于捕捉细微情感变化,语义相关性专家用于建模跨模态关系,全局上下文专家用于捕获长距离依赖;通过分层动态门控自适应集成;使用伪标记注入和基于提示的条件化将多模态表示与大语言模型结合,并采用LoRA微调提高计算效率。 Result: 在MELD、CHERMA、MOSEI和SIMS-V2等双语基准上的实验显示,EGMF在情感识别和情感分析任务上均优于现有最先进方法,尤其表现出优异的跨语言鲁棒性,揭示了中英文多模态情感表达中的普遍模式。 Conclusion: EGMF是一种有效的统一多模态情感理解框架,能够通过专家引导的融合策略和大语言模型的生成能力,同时处理分类与回归任务,并在多种语言环境下保持高性能,具有广泛的应用潜力。 Abstract: Multimodal emotion understanding requires effective integration of text, audio, and visual modalities for both discrete emotion recognition and continuous sentiment analysis. We present EGMF, a unified framework combining expert-guided multimodal fusion with large language models. Our approach features three specialized expert networks--a fine-grained local expert for subtle emotional nuances, a semantic correlation expert for cross-modal relationships, and a global context expert for long-range dependencies--adaptively integrated through hierarchical dynamic gating for context-aware feature selection. Enhanced multimodal representations are integrated with LLMs via pseudo token injection and prompt-based conditioning, enabling a single generative framework to handle both classification and regression through natural language generation. We employ LoRA fine-tuning for computational efficiency. Experiments on bilingual benchmarks (MELD, CHERMA, MOSEI, SIMS-V2) demonstrate consistent improvements over state-of-the-art methods, with superior cross-lingual robustness revealing universal patterns in multimodal emotional expressions across English and Chinese. We will release the source code publicly.

[128] ES-Mem: Event Segmentation-Based Memory for Long-Term Dialogue Agents

Huhai Zou,Tianhao Sun,Chuanjiang He,Yu Tian,Zhenyang Li,Li Jin,Nayu Liu,Jiang Zhong,Kaiwen Wei

Main category: cs.CL

TL;DR: 提出ES-Mem框架,基于事件分割理论解决对话代理中记忆粒度僵化和检索平面化的问题,通过动态事件分割和分层记忆结构提升语义连贯性与上下文定位精度。

Details Motivation: 现有记忆机制存在记忆粒度僵化和检索仅依赖表层语义相似性的问题,难以保持对话的连贯性和精准定位 episodic 上下文。 Method: 受事件分割理论启发,设计ES-Mem框架,包含动态事件分割模块以划分语义连贯的事件边界,以及分层记忆架构利用边界语义锚定特定记忆以实现精确上下文定位。 Result: 在两个记忆基准测试上,ES-Mem均优于基线方法;其事件分割模块在对话分割数据集上也表现出良好的泛化能力。 Conclusion: ES-Mem通过动态事件划分与分层记忆结构,有效提升了对话代理在长期交互中的记忆组织与检索能力。 Abstract: Memory is critical for dialogue agents to maintain coherence and enable continuous adaptation in long-term interactions. While existing memory mechanisms offer basic storage and retrieval capabilities, they are hindered by two primary limitations: (1) rigid memory granularity often disrupts semantic integrity, resulting in fragmented and incoherent memory units; (2) prevalent flat retrieval paradigms rely solely on surface-level semantic similarity, neglecting the structural cues of discourse required to navigate and locate specific episodic contexts. To mitigate these limitations, drawing inspiration from Event Segmentation Theory, we propose ES-Mem, a framework incorporating two core components: (1) a dynamic event segmentation module that partitions long-term interactions into semantically coherent events with distinct boundaries; (2) a hierarchical memory architecture that constructs multi-layered memories and leverages boundary semantics to anchor specific episodic memory for precise context localization. Evaluations on two memory benchmarks demonstrate that ES-Mem yields consistent performance gains over baseline methods. Furthermore, the proposed event segmentation module exhibits robust applicability on dialogue segmentation datasets.

[129] Proof of Time: A Benchmark for Evaluating Scientific Idea Judgments

Bingyang Ye,Shan Chen,Jingxuan Tu,Chen Liu,Zidi Xiong,Samuel Schmidgall,Danielle S. Bitterman

Main category: cs.CL

TL;DR: PoT是一个半可验证的基准框架,用于评估大语言模型对科研想法判断的质量,通过将科学想法判断与后续可观测信号(如引用)关联,在无需大量专家标注的情况下实现可扩展的评估。

Details Motivation: 缺乏可扩展的方法来评估大语言模型在科研想法判断上的质量,需要一个能与未来真实结果对照的评估框架。 Method: 提出PoT框架,利用时间分割的预截止快照数据,在离线沙箱中让模型预测截止后的结果(如引用量、研究议程变化),并通过未来实际信号进行验证;同时比较工具使用型智能体与非智能体基线在不同提示和预算下的表现。 Result: 在30,000多个实例和四个领域中发现,相较于非智能体基线,更高的交互预算通常提升智能体性能,而工具使用的收益则高度依赖任务类型。 Conclusion: PoT通过结合时间分割和离线沙箱支持对面向未来的科学想法判断任务进行可扩展的代理评估,并揭示了交互预算和工具使用对性能的不同影响。 Abstract: Large language models are increasingly being used to assess and forecast research ideas, yet we lack scalable ways to evaluate the quality of models' judgments about these scientific ideas. Towards this goal, we introduce PoT, a semi-verifiable benchmarking framework that links scientific idea judgments to downstream signals that become observable later (e.g., citations and shifts in researchers' agendas). PoT freezes a pre-cutoff snapshot of evidence in an offline sandbox and asks models to forecast post-cutoff outcomes, enabling verifiable evaluation when ground truth arrives, scalable benchmarking without exhaustive expert annotation, and analysis of human-model misalignment against signals such as peer-review awards. In addition, PoT provides a controlled testbed for agent-based research judgments that evaluate scientific ideas, comparing tool-using agents to non-agent baselines under prompt ablations and budget scaling. Across 30,000+ instances spanning four benchmark domains, we find that, compared with non-agent baselines, higher interaction budgets generally improve agent performance, while the benefit of tool use is strongly task-dependent. By combining time-partitioned, future-verifiable targets with an offline sandbox for tool use, PoT supports scalable evaluation of agents on future-facing scientific idea judgment tasks.

[130] Integrating Machine-Generated Short Descriptions into the Wikipedia Android App: A Pilot Deployment of Descartes

Marija Šakota,Dmitry Brant,Cooltey Feng,Shay Nowick,Amal Ramadan,Robin Schoenbaechler,Joseph Seddon,Jazmin Tanner,Isaac Johnson,Robert West

Main category: cs.CL

TL;DR: Descartes是一个多语言短描述生成模型,已在维基百科Android应用中进行试点部署,结果显示其生成的描述质量接近人工撰写,有助于编辑减少内容差距。

Details Motivation: 解决维基百科不同语言和主题之间短描述覆盖不均的问题。 Method: 在维基百科Android应用中试点部署Descartes模型,向编辑提供由模型生成的短描述建议,并收集12种语言下超过3900篇文章和375名编辑的反馈数据。 Result: 90%被接受的Descartes生成描述质量评分达到至少3/5分,平均质量与人工撰写相当;编辑既直接采用也修改使用建议,回退和报告率低;试点还揭示了延迟、语种特定差距及敏感话题防护等实际问题。 Conclusion: Descartes生成的短描述可有效支持编辑填补内容空白,但需配备适当的技术、设计和社区保护措施。 Abstract: Short descriptions are a key part of the Wikipedia user experience, but their coverage remains uneven across languages and topics. In previous work, we introduced Descartes, a multilingual model for generating short descriptions. In this report, we present the results of a pilot deployment of Descartes in the Wikipedia Android app, where editors were offered suggestions based on outputs from Descartes while editing short descriptions. The experiment spanned 12 languages, with over 3,900 articles and 375 editors participating. Overall, 90% of accepted Descartes descriptions were rated at least 3 out of 5 in quality, and their average ratings were comparable to human-written ones. Editors adopted machine suggestions both directly and with modifications, while the rate of reverts and reports remained low. The pilot also revealed practical considerations for deployment, including latency, language-specific gaps, and the need for safeguards around sensitive topics. These results indicate that Descartes's short descriptions can support editors in reducing content gaps, provided that technical, design, and community guardrails are in place.

[131] PlaM: Training-Free Plateau-Guided Model Merging for Better Visual Grounding in MLLMs

Zijing Wang,Yongkang Liu,Mingyang Wang,Ercong Nie,Deyuan Chen,Zhengjie Zhao,Shi Feng,Daling Wang,Xiaocui Yang,Yifei Zhang,Hinrich Schütze

Main category: cs.CL

TL;DR: 提出一种无需训练的框架,通过层向视觉令牌掩码和平台引导的模型融合方法,缓解多模态大语言模型在指令微调过程中文本推理能力下降的问题。

Details Motivation: 多模态指令微调会意外削弱大语言模型的文本推理能力,从而影响整体多模态性能,本文旨在解决这一退化问题。 Method: 通过层向视觉令牌掩码分析MLLM的三阶段模式(早期模态分离、中期模态对齐、晚期模态退化),提出平台引导的模型融合方法,选择性地注入基础语言模型参数以保留语言推理能力。 Result: 在五个MLLM上、九个基准上的实验表明该方法有效;注意力分析显示融合后注意力更聚焦于任务相关的视觉区域。 Conclusion: 所提出的训练免费框架能有效缓解多模态指令微调中的语言推理退化问题,提升模型的多模态理解与推理能力。 Abstract: Multimodal Large Language Models (MLLMs) rely on strong linguistic reasoning inherited from their base language models. However, multimodal instruction fine-tuning paradoxically degrades this text's reasoning capability, undermining multimodal performance. To address this issue, we propose a training-free framework to mitigate this degradation. Through layer-wise vision token masking, we reveal a common three-stage pattern in multimodal large language models: early-modal separation, mid-modal alignment, and late-modal degradation. By analyzing the behavior of MLLMs at different stages, we propose a plateau-guided model merging method that selectively injects base language model parameters into MLLMs. Experimental results based on five MLLMs on nine benchmarks demonstrate the effectiveness of our method. Attention-based analysis further reveals that merging shifts attention from diffuse, scattered patterns to focused localization on task-relevant visual regions. Our repository is on https://github.com/wzj1718/PlaM.

Jing Yang,Nils Feldhus,Salar Mohtaj,Leonhard Hennig,Qianli Wang,Eleni Metheniti,Sherzod Hakimov,Charlott Jakob,Veronika Solopova,Konrad Rieck,David Schlangen,Sebastian Möller,Vera Schmitt

Main category: cs.CL

TL;DR: 本文系统回顾了过去六年14,171篇NLG论文的评估方法演变,揭示了不同任务在使用自动指标、LLM-as-a-judge和人类评估上的显著差异,并指出当前评估实践中缺乏实证支持与验证的问题。

Details Motivation: 尽管自然语言生成(NLG)取得进展,其评估仍具挑战性;人类判断虽为金标准,但新指标和LLM-as-a-judge方法不断涌现,亟需系统梳理评估实践的现状与问题。 Method: 采用自动信息抽取方法,从ACL、EMNLP、NAACL和INLG四大会议近六年的论文中提取元数据,分析不同NLG任务中评估方法的使用趋势,包括指标类型、LLM-as-a-judge和人工评估的应用情况。 Result: 发现三大问题:(1) 任务间评估方法分化明显(如对话生成转向LLM-as-a-judge,机器翻译依赖n-gram指标,问答系统减少人工评估);(2) 指标惯性强,通用指标如BLEU、ROUGE被广泛使用但缺乏实证依据;(3) LLM-as-a-judge与人类评估关注信号不同,二者相关性低,且仅少数论文进行显式对比验证。 Conclusion: 呼吁提升NLG评估的严谨性,建议根据具体任务选择合适评估方法,并加强LLM-as-a-judge与人类判断的一致性验证,推动更具判别力的评估实践。 Abstract: Despite advances in Natural Language Generation (NLG), evaluation remains challenging. Although various new metrics and LLM-as-a-judge (LaaJ) methods are proposed, human judgment persists as the gold standard. To systematically review how NLG evaluation has evolved, we employ an automatic information extraction scheme to gather key information from NLG papers, focusing on different evaluation methods (metrics, LaaJ and human evaluation). With extracted metadata from 14,171 papers across four major conferences (ACL, EMNLP, NAACL, and INLG) over the past six years, we reveal several critical findings: (1) Task Divergence: While Dialogue Generation demonstrates a rapid shift toward LaaJ (>40% in 2025), Machine Translation remains locked into n-gram metrics, and Question Answering exhibits a substantial decline in the proportion of studies conducting human evaluation. (2) Metric Inertia: Despite the development of semantic metrics, general-purpose metrics (e.g., BLEU, ROUGE) continue to be widely used across tasks without empirical justification, often lacking the discriminative power to distinguish between specific quality criteria. (3) Human-LaaJ Divergence: Our association analysis challenges the assumption that LLMs act as mere proxies for humans; LaaJ and human evaluations prioritize very different signals, and explicit validation is scarce (<8% of papers comparing the two), with only moderate to low correlation. Based on these observations, we derive practical recommendations to improve the rigor of future NLG evaluation.

[133] Adaptive Layer Selection for Layer-Wise Token Pruning in LLM Inference

Rei Taniguchi,Yuyang Dong,Makoto Onizuka,Chuan Xiao

Main category: cs.CL

TL;DR: 本文提出了一种名为ASL的训练-free方法,用于大语言模型推理中的KV缓存减少。ASL自适应地选择在prefilling阶段进行token选择的层,通过利用注意力分数排序的token秩的方差,在不同任务中平衡性能,并满足用户指定的KV预算要求。实验表明,ASL在保持解码速度和KV缓存减少的同时,在准确性上优于现有的层间token选择方法。

Details Motivation: 现有基于预定义层的层间token剪枝方法在不同任务上的准确率波动较大,尤其在较难任务(如KV检索)上表现不佳,缺乏灵活性。 Method: 提出ASL方法,利用注意力分数排序后token秩的方差,在prefilling阶段动态决定最优的token选择层,支持一次性选择并传播到深层,并可与SnapKV等现有方法结合使用。 Result: 在InfiniteBench、RULER和NIAH基准测试中,ASL在相同KV缓存压缩比下,准确率优于当前最先进的层间token选择方法,同时保持了解码速度。 Conclusion: ASL通过自适应选择token保留层,提升了KV缓存压缩方法在多样化任务下的鲁棒性和准确性,是一种高效、灵活且即插即用的训练-free优化方案。 Abstract: Due to the prevalence of large language models (LLMs), key-value (KV) cache reduction for LLM inference has received remarkable attention. Among numerous works that have been proposed in recent years, layer-wise token pruning approaches, which select a subset of tokens at particular layers to retain in KV cache and prune others, are one of the most popular schemes. They primarily adopt a set of pre-defined layers, at which tokens are selected. Such design is inflexible in the sense that the accuracy significantly varies across tasks and deteriorates in harder tasks such as KV retrieval. In this paper, we propose ASL, a training-free method that adaptively chooses the selection layer for KV cache reduction, exploiting the variance of token ranks ordered by attention score. The proposed method balances the performance across different tasks while meeting the user-specified KV budget requirement. ASL operates during the prefilling stage and can be jointly used with existing KV cache reduction methods such as SnapKV to optimize the decoding stage. By evaluations on the InfiniteBench, RULER, and NIAH benchmarks, we show that equipped with one-shot token selection, where tokens are selected at a layer and propagated to deeper layers, ASL outperforms state-of-the-art layer-wise token selection methods in accuracy while maintaining decoding speed and KV cache reduction.

[134] Exploring the Meta-level Reasoning of Large Language Models via a Tool-based Multi-hop Tabular Question Answering Task

Nick Ferguson,Alan Bundy,Kwabena Nuamah

Main category: cs.CL

TL;DR: 本文提出了一种区分元级推理和对象级推理的新方法,并设计了一个基于地缘政治指标的问答任务来评估大语言模型(LLM)的推理能力,发现LLM在元级推理上表现良好但存在任务理解缺陷,且数值计算能力较差。

Details Motivation: 当前对大语言模型“推理”能力的关注缺乏清晰定义,作者希望通过区分不同层次的推理过程,更系统地评估模型的推理表现。 Method: 设计一个需要分解步骤、数据检索和数学运算的新型问答任务,通过分析模型选择工具的行为来评估其元级推理能力,并引入‘关键动作’作为评价标准。 Result: 实验发现LLM在元级推理方面表现良好但存在任务理解问题;n-shot提示对准确率影响小;错误信息通常不损害性能;进一步验证了LLM数值计算能力差的问题。 Conclusion: 该研究为评估LLM的推理能力提供了更细粒度的方法,揭示了模型在推理过程中的优势与局限,结果可部分推广到其他任务领域,但也存在适用范围限制。 Abstract: Recent advancements in Large Language Models (LLMs) are increasingly focused on "reasoning" ability, a concept with many overlapping definitions in the LLM discourse. We take a more structured approach, distinguishing meta-level reasoning (denoting the process of reasoning about intermediate steps required to solve a task) from object-level reasoning (which concerns the low-level execution of the aforementioned steps.) We design a novel question answering task, which is based around the values of geopolitical indicators for various countries over various years. Questions require breaking down into intermediate steps, retrieval of data, and mathematical operations over that data. The meta-level reasoning ability of LLMs is analysed by examining the selection of appropriate tools for answering questions. To bring greater depth to the analysis of LLMs beyond final answer accuracy, our task contains 'essential actions' against which we can compare the tool call output of LLMs to infer the strength of reasoning ability. We find that LLMs demonstrate good meta-level reasoning on our task, yet are flawed in some aspects of task understanding. We find that n-shot prompting has little effect on accuracy; error messages encountered do not often deteriorate performance; and provide additional evidence for the poor numeracy of LLMs. Finally, we discuss the generalisation and limitation of our findings to other task domains.

[135] Emotional Support Evaluation Framework via Controllable and Diverse Seeker Simulator

Chaewon Heo,Cheyon Jin,Yohan Jo

Main category: cs.CL

TL;DR: 提出了一种基于心理和语言特征的可控求助者模拟器,用于更真实地评估情感支持聊天机器人。

Details Motivation: 现有求助者模拟器无法捕捉真实用户的行为多样性,且缺乏对特定用户画像的可控性。 Method: 基于Reddit真实对话,采用混合专家(MoE)架构训练包含九种心理和语言特征的模拟器模型。 Result: 该模拟器在行为多样性和画像匹配度上优于现有方法,并揭示了主流支持模型 previously obscured 的性能下降问题。 Conclusion: 所提出的框架能更真实、严格地评估情感支持聊天机器人,提升评测的可靠性。 Abstract: As emotional support chatbots have recently gained significant traction across both research and industry, a common evaluation strategy has emerged: use help-seeker simulators to interact with supporter chatbots. However, current simulators suffer from two critical limitations: (1) they fail to capture the behavioral diversity of real-world seekers, often portraying them as overly cooperative, and (2) they lack the controllability required to simulate specific seeker profiles. To address these challenges, we present a controllable seeker simulator driven by nine psychological and linguistic features that underpin seeker behavior. Using authentic Reddit conversations, we train our model via a Mixture-of-Experts (MoE) architecture, which effectively differentiates diverse seeker behaviors into specialized parameter subspaces, thereby enhancing fine-grained controllability. Our simulator achieves superior profile adherence and behavioral diversity compared to existing approaches. Furthermore, evaluating 7 prominent supporter models with our system uncovers previously obscured performance degradations. These findings underscore the utility of our framework in providing a more faithful and stress-tested evaluation for emotional support chatbots.

[136] Is Agentic RAG worth it? An experimental comparison of RAG approaches

Pietro Ferrazzi,Milica Cvjeticanin,Alessio Piraccini,Davide Giannuzzi

Main category: cs.CL

TL;DR: 本文比较了增强型RAG与基于大语言模型自反能力的“代理”RAG在不同场景下的性能与成本权衡,通过实证研究为实际应用中的RAG系统设计提供选择指导。

Details Motivation: 基本RAG系统存在检索噪声、查询范围外处理不当、查询-文档匹配弱以及生成器成本高等问题,促使研究者提出增强型RAG和新兴的“代理”RAG,但二者适用条件尚不明确。 Method: 对增强型RAG和代理型RAG在多个场景和维度上进行广泛的实证驱动评估,分析其在性能、成本和适用性方面的差异。 Result: 揭示了两种RAG范式之间的实际权衡,发现代理RAG在灵活性和自适应性上更优,但成本较高;增强型RAG在特定任务中性能稳定且成本较低。 Conclusion: 选择RAG架构应根据具体应用场景权衡性能需求与成本约束,代理RAG适合复杂动态任务,而增强型RAG更适合资源受限或任务明确的场景。 Abstract: Retrieval-Augmented Generation (RAG) systems are usually defined by the combination of a generator and a retrieval component that extracts textual context from a knowledge base to answer user queries. However, such basic implementations exhibit several limitations, including noisy or suboptimal retrieval, misuse of retrieval for out-of-scope queries, weak query-document matching, and variability or cost associated with the generator. These shortcomings have motivated the development of "Enhanced" RAG, where dedicated modules are introduced to address specific weaknesses in the workflow. More recently, the growing self-reflective capabilities of Large Language Models (LLMs) have enabled a new paradigm, which we refer to as "Agentic" RAG. In this approach, the LLM orchestrates the entire process-deciding which actions to perform, when to perform them, and whether to iterate-thereby reducing reliance on fixed, manually engineered modules. Despite the rapid adoption of both paradigms, it remains unclear which approach is preferable under which conditions. In this work, we conduct an extensive, empirically driven evaluation of Enhanced and Agentic RAG across multiple scenarios and dimensions. Our results provide practical insights into the trade-offs between the two paradigms, offering guidance on selecting the most effective RAG design for real-world applications, considering both costs and performance.

[137] Structure First, Reason Next: Enhancing a Large Language Model using Knowledge Graph for Numerical Reasoning in Financial Documents

Aryan Mishra,Akash Anil

Main category: cs.CL

TL;DR: 本文提出了一种结合知识图谱(KG)与大语言模型(LLM)的框架,用于提升金融文档中的数值推理能力。通过从文档中自动提取结构化信息,该方法在FinQA数据集上使执行准确率相对提升了约12%。

Details Motivation: 大语言模型在处理金融文本时难以准确解析和计算其中的数字,尤其是在面对非结构化和半结构化数据时存在瓶颈。因此需要引入结构化信息来增强其数值推理能力。 Method: 提出一个利用知识图谱(KG)增强LLM数值推理的框架,KG通过文档内在模式自动提取,并与LLM(Llama 3.1 8B Instruct)结合,在FinQA数据集上进行评估。 Result: 在FinQA数据集上的实验表明,该框架相比基础LLM将执行准确率提高了约12%。 Conclusion: 结合从金融文档中提取的知识图谱能有效提升大语言模型在数值推理任务中的表现,验证了利用结构化信息增强LLM的重要性。 Abstract: Numerical reasoning is an important task in the analysis of financial documents. It helps in understanding and performing numerical predictions with logical conclusions for the given query seeking answers from financial texts. Recently, Large Language Models (LLMs) have shown promising results in multiple Question-Answering (Q-A) systems with the capability of logical reasoning. As documents related to finance often consist of long and complex financial contexts, LLMs appear well-suited for building high-quality automated financial question-answering systems. However, LLMs often face challenges in accurately processing the various numbers within financial reports. Extracting numerical data from unstructured text and semi-structured tables, and reliably performing accurate calculations, remains a significant bottleneck for numerical reasoning in most state-of-the-art LLMs. Recent studies have shown that structured data augmentations, such as Knowledge Graphs (KGs), have notably improved the predictions of LLMs along with logical explanations. Thus, it is an important requirement to consider inherent structured information in financial reports while using LLMs for various financial analytics. This paper proposes a framework to incorporate structured information using KGs along with LLM predictions for numerical reasoning tasks. The KGs are extracted using a proposed schema inherently from the document under processing. We evaluated our proposed framework over the benchmark data FinQA, using an open-source LLM, namely Llama 3.1 8B Instruct. We observed that the proposed framework improved execution accuracy by approximately 12% relative to the vanilla LLM.

[138] Contrastive Learning with Narrative Twins for Modeling Story Salience

Igor Sterner,Alex Lascarides,Frank Keller

Main category: cs.CL

TL;DR: 提出了一种基于对比学习的叙事显著性建模框架,利用“叙事双生”故事进行训练,并在多种操作下评估句子显著性。

Details Motivation: 需要识别对故事情节发展最关键的事件,以更好理解叙事结构。 Method: 设计对比学习框架,使用具有相同情节但表层不同的叙事双生故事,训练模型区分原故事、其双生版本和不同情节的干扰项;通过故事嵌入评估四种推断显著性的操作(删除、移位、干扰、摘要)。 Result: 在ROCStories和Wikipedia剧情摘要上的实验表明,对比学习得到的故事嵌入优于掩码语言模型基线,其中摘要操作最能可靠识别显著句子;若无现成双生故事,可用随机dropout生成,干扰项可通过提示大模型或长文本的不同段落获得。 Conclusion: 对比学习结合叙事双生体能有效建模叙事显著性,摘要是最稳定的显著性判断操作,且方法在数据受限时仍具可行性。 Abstract: Understanding narratives requires identifying which events are most salient for a story's progression. We present a contrastive learning framework for modeling narrative salience that learns story embeddings from narrative twins: stories that share the same plot but differ in surface form. Our model is trained to distinguish a story from both its narrative twin and a distractor with similar surface features but different plot. Using the resulting embeddings, we evaluate four narratologically motivated operations for inferring salience (deletion, shifting, disruption, and summarization). Experiments on short narratives from the ROCStories corpus and longer Wikipedia plot summaries show that contrastively learned story embeddings outperform a masked-language-model baseline, and that summarization is the most reliable operation for identifying salient sentences. If narrative twins are not available, random dropout can be used to generate the twins from a single story. Effective distractors can be obtained either by prompting LLMs or, in long-form narratives, by using different parts of the same story.

[139] Enhancing Self-Correction in Large Language Models through Multi-Perspective Reflection

Mariana Costa,Alberlucia Rafael Soarez,Daniel Kim,Camila Ferreira

Main category: cs.CL

TL;DR: 提出了一种名为MyGO Poly-Reflective Chain-of-Thought (PR-CoT) 的新方法,通过多视角自省(如逻辑一致性、信息完整性、偏见/伦理和替代方案)提升大语言模型的推理能力,在无需模型重训练的情况下显著优于传统CoT和现有反思方法。

Details Motivation: 现有单一维度的反思方法在提升大语言模型推理的一致性、准确性和自我纠正能力方面效果有限,尤其在复杂或伦理敏感任务中表现不足。 Method: 在初始思维链(CoT)之后,引导LLM从多个预定义角度(逻辑一致性、信息完整性、偏见/伦理、替代解决方案)进行自我评估,完全通过提示工程实现多视角反思过程。 Result: 在算术、常识推理、伦理决策和逻辑谜题等多个任务上实验表明,PR-CoT显著优于传统CoT和现有反思方法,尤其在伦理决策等细微领域表现突出;消融研究和人工评估验证了各反思视角的有效性。 Conclusion: PR-CoT通过结构化的多视角反思范式,有效提升了大语言模型推理的可靠性与准确性,是一种无需微调、仅靠提示即可增强模型自我修正能力的新路径。 Abstract: While Chain-of-Thought (CoT) prompting advances LLM reasoning, challenges persist in consistency, accuracy, and self-correction, especially for complex or ethically sensitive tasks. Existing single-dimensional reflection methods offer insufficient improvements. We propose MyGO Poly-Reflective Chain-of-Thought (PR-CoT), a novel methodology employing structured multi-perspective reflection. After initial CoT, PR-CoT guides the LLM to self-assess its reasoning across multiple predefined angles: logical consistency, information completeness, biases/ethics, and alternative solutions. Implemented purely via prompt engineering, this process refines the initial CoT into a more robust and accurate final answer without model retraining. Experiments across arithmetic, commonsense, ethical decision-making, and logical puzzles, using GPT-three point five and GPT-four models, demonstrate PR-CoT's superior performance. It significantly outperforms traditional CoT and existing reflection methods in logical consistency and error correction, with notable gains in nuanced domains like ethical decision-making. Ablation studies, human evaluations, and qualitative analyses further validate the contribution of each reflection perspective and the overall efficacy of our poly-reflective paradigm in fostering more reliable LLM reasoning.

[140] Beyond Single-Shot: Multi-step Tool Retrieval via Query Planning

Wei Fang,James Glass

Main category: cs.CL

TL;DR: 本文提出了一种名为TOOLQP的轻量级框架,通过迭代查询规划来改进大规模动态工具库中的检索性能,显著提升了LLM代理在复杂请求下的零样本泛化能力和执行效果。

Details Motivation: 标准的单次稠密检索器在处理复杂请求时表现不佳,主要由于用户目标与技术文档之间存在语义鸿沟,且固定大小的嵌入难以建模组合性工具调用。 Method: TOOLQP将检索建模为迭代查询规划过程,将指令分解为子任务,并动态生成查询与检索器交互;使用合成查询轨迹进行训练,并通过可验证奖励的强化学习(RLVR)优化。 Result: 实验表明TOOLQP在多种检索器下均表现出优异的零样本泛化能力、鲁棒性,并在下游代理执行中实现显著提升,达到当前最优性能。 Conclusion: TOOLQP有效弥合了用户意图与工具文档之间的语义差距,克服了传统嵌入在组合性任务中的局限,为LLM代理的工具检索提供了高效可行的解决方案。 Abstract: LLM agents operating over massive, dynamic tool libraries rely on effective retrieval, yet standard single-shot dense retrievers struggle with complex requests. These failures primarily stem from the disconnect between abstract user goals and technical documentation, and the limited capacity of fixed-size embeddings to model combinatorial tool compositions. To address these challenges, we propose TOOLQP, a lightweight framework that models retrieval as iterative query planning. Instead of single-shot matching, TOOLQP decomposes instructions into sub-tasks and dynamically generates queries to interact with the retriever, effectively bridging the semantic gap by targeting the specific sub-tasks required for composition. We train TOOLQP using synthetic query trajectories followed by optimization via Reinforcement Learning with Verifiable Rewards (RLVR). Experiments demonstrate that TOOLQP achieves state-of-the-art performance, exhibiting superior zero-shot generalization, robustness across diverse retrievers, and significant improvements in downstream agentic execution.

[141] Kinship Data Benchmark for Multi-hop Reasoning

Tianda Sun,Dimitar Kazakov

Main category: cs.CL

TL;DR: 本文提出了KinshipQA,一个用于评估大语言模型多跳推理能力的基准测试,通过生成大规模、文化特定的家谱数据来系统控制任务难度和文化假设,并揭示不同模型在多跳推理上的表现差异。

Details Motivation: 为了更准确地评估大语言模型在多跳推理方面的能力,尤其是在涉及亲属关系推理时的文化特异性和复杂性。 Method: 提出了一种生成式流水线,能够按需生成大规模、现实且文化特定的家谱数据,并从中构建需要隐式关系链推理的文本推断任务,使用六种最先进的大语言模型进行零样本评估。 Result: 实验结果表明,KinshipQA能够产生广泛的结果分布,并揭示出不同模型及文化设置下多跳推理的系统性差异。 Conclusion: KinshipQA是一个有效的基准测试工具,可用于系统研究大语言模型在亲属关系推理中的多跳推理能力及其受文化因素影响的程度。 Abstract: Large language models (LLMs) are increasingly evaluated on their ability to perform multi-hop reasoning, i.e., to combine multiple pieces of information into a coherent inference. We introduce KinshipQA, a benchmark designed to probe this capability through reasoning over kinship relations. The central contribution of our work is a generative pipeline that produces, on demand, large-scale, realistic, and culture-specific genealogical data: collections of interconnected family trees that satisfy explicit marriage constraints associated with different kinship systems. This allows task difficulty, cultural assumptions, and relational depth to be systematically controlled and varied. From these genealogies, we derive textual inference tasks that require reasoning over implicit relational chains. We evaluate the resulting benchmark using six state-of-the-art LLMs, spanning both open-source and closed-source models, under a uniform zero-shot protocol with deterministic decoding. Performance is measured using exact-match and set-based metrics. Our results demonstrate that KinshipQA yields a wide spread of outcomes and exposes systematic differences in multi-hop reasoning across models and cultural settings.

[142] Learning Through Dialogue: Unpacking the Dynamics of Human-LLM Conversations on Political Issues

Shaz Furniturewala,Gerard Christopher Yeo,Kokil Jaidka

Main category: cs.CL

TL;DR: 该研究探讨了大语言模型(LLM)作为学习伙伴时,如何通过交互特征影响用户在社会政治议题上的知识获取与信心变化,发现解释的丰富性通过促进反思和认知参与来提升信心与知识,但效果受用户政治效能感调节,强调人-AI系统设计需匹配用户的参与状态。

Details Motivation: 理解LLM作为学习伙伴时,其交互动态如何影响用户的学习与参与,尤其是在复杂的社会政治话题中,以揭示解释机制与学习成效之间的关系。 Method: 基于397个人类-LLM关于社会政治议题的对话数据,分析语言与互动特征,采用中介分析(检验解释丰富性通过反思洞察与认知参与对信心和知识的影响)和调节分析(考察政治效能感的调节作用)。 Result: LLM解释的丰富性通过促进用户的反思洞察部分提升信心,而对知识增益的影响完全通过认知参与实现;这些效应受政治效能感调节:高效能用户在处理不确定性时更易获得信心,且能通过延长互动获得更多知识,尤其是善于反思的用户。 Conclusion: 从LLM中学习是一种交互成就,而非优质解释的必然结果;有效的人-AI交互系统设计应使LLM的解释行为与用户的参与状态相协调。 Abstract: Large language models (LLMs) are increasingly used as conversational partners for learning, yet the interactional dynamics supporting users' learning and engagement are understudied. We analyze the linguistic and interactional features from both LLM and participant chats across 397 human-LLM conversations about socio-political issues to identify the mechanisms and conditions under which LLM explanations shape changes in political knowledge and confidence. Mediation analyses reveal that LLM explanatory richness partially supports confidence by fostering users' reflective insight, whereas its effect on knowledge gain operates entirely through users' cognitive engagement. Moderation analyses show that these effects are highly conditional and vary by political efficacy. Confidence gains depend on how high-efficacy users experience and resolve uncertainty. Knowledge gains depend on high-efficacy users' ability to leverage extended interaction, with longer conversations benefiting primarily reflective users. In summary, we find that learning from LLMs is an interactional achievement, not a uniform outcome of better explanations. The findings underscore the importance of aligning LLM explanatory behavior with users' engagement states to support effective learning in designing Human-AI interactive systems.

[143] The Confidence Trap: Gender Bias and Predictive Certainty in LLMs

Ahmed Sabir,Markus Kängsepp,Rajesh Sharma

Main category: cs.CL

TL;DR: 该研究探讨了大语言模型(LLM)的置信度校准与性别偏见之间的关系,提出了一种新的校准指标Gender-ECE,并发现Gemma-2在性别偏见基准上校准最差。

Details Motivation: 随着LLM在敏感领域的广泛应用,其公平性和偏见问题日益受到关注,需要评估模型置信度是否能反映潜在的偏见。 Method: 通过分析六个最先进的LLM在性别化代词解析任务中的概率置信度校准情况,并引入新的校准指标Gender-ECE,结合人工标注的偏见判断进行评估。 Result: Gemma-2模型在性别偏见校准方面表现最差;Gender-ECE能够有效衡量性别差异。 Conclusion: 基于置信度的校准指标可以揭示LLM中的公平性差距,建议在伦理部署中采用公平性感知的校准评估方法。 Abstract: The increased use of Large Language Models (LLMs) in sensitive domains leads to growing interest in how their confidence scores correspond to fairness and bias. This study examines the alignment between LLM-predicted confidence and human-annotated bias judgments. Focusing on gender bias, the research investigates probability confidence calibration in contexts involving gendered pronoun resolution. The goal is to evaluate if calibration metrics based on predicted confidence scores effectively capture fairness-related disparities in LLMs. The results show that, among the six state-of-the-art models, Gemma-2 demonstrates the worst calibration according to the gender bias benchmark. The primary contribution of this work is a fairness-aware evaluation of LLMs' confidence calibration, offering guidance for ethical deployment. In addition, we introduce a new calibration metric, Gender-ECE, designed to measure gender disparities in resolution tasks.

[144] Reference Games as a Testbed for the Alignment of Model Uncertainty and Clarification Requests

Manar Ali,Judith Sieker,Sina Zarrieß,Hendrik Buschmeier

Main category: cs.CL

TL;DR: 本文探讨了语言模型在对话中作为听者角色时识别和表达不确定性的能力,使用指代游戏作为测试平台,发现模型在简单任务中仍难以将内部不确定性转化为适当的澄清行为。

Details Motivation: 研究语言模型是否能像人类一样在交流中主动请求澄清以维持相互理解,特别是在面对不确定性时的表现。 Method: 通过对比三个视觉-语言模型在基线指代消解任务和被指示在不确定时请求澄清的实验任务中的表现,采用指代游戏作为受控且自包含的测试环境。 Result: 实验结果表明,即使在简单的任务中,模型也常常难以识别自身的不确定性并采取适当的澄清行为。 Conclusion: 指代游戏是评估(视觉和)语言模型交互质量的有效测试平台,当前模型在主动沟通不确定性方面仍有不足。 Abstract: In human conversation, both interlocutors play an active role in maintaining mutual understanding. When addressees are uncertain about what speakers mean, for example, they can request clarification. It is an open question for language models whether they can assume a similar addressee role, recognizing and expressing their own uncertainty through clarification. We argue that reference games are a good testbed to approach this question as they are controlled, self-contained, and make clarification needs explicit and measurable. To test this, we evaluate three vision-language models comparing a baseline reference resolution task to an experiment where the models are instructed to request clarification when uncertain. The results suggest that even in such simple tasks, models often struggle to recognize internal uncertainty and translate it into adequate clarification behavior. This demonstrates the value of reference games as testbeds for interaction qualities of (vision and) language models.

cs.CV [Back]

[145] HyperTopo-Adapters: Geometry- and Topology-Aware Segmentation of Leaf Lesions on Frozen Encoders

Chimdi Walter Ndubuisi,Toni Kazic

Main category: cs.CV

TL;DR: 本文提出HyperTopo-Adapters,一种轻量级、参数高效的新结构,通过在超球面+欧几里得+球面(H+E+S)乘积流形上嵌入特征,结合拓扑先验(如持续同调和可微替代损失),提升叶片病斑分割中的边界与拓扑结构准确性,在Kaggle数据集上显著降低洞误差9%,同时保持Dice/IoU竞争力,并提供开源可复现训练评估套件。

Details Motivation: 标准像素级损失在欧氏隐空间中对叶病变分割的拓扑错误(如小合并、分裂、假洞)惩罚不足,而这些错误可能具有生物学意义,因此需要引入更强的拓扑敏感先验来改善分割质量。 Method: 设计HyperTopo-Adapters作为冻结视觉编码器之上的轻量头部,将特征嵌入H+E+S乘积流形以分别建模层次结构、局部细节和全局闭合;引入基于持续同调(PH)的距离用于评估与选择,以及结合软欧拉特征匹配与总变差正则化的可微拓扑损失;采用分阶段warm-up策略、每样本结构感知指标评估,以及基于top-K Dice中最小PD距离的检查点选择规则。 Result: 在Kaggle叶病变数据集(N=2,940)上,模型显著降低Δβ₁洞误差达9%,Boundary-F1、Betti误差和PD距离等拓扑与边界指标一致提升,同时保持Dice/IoU不下降;消融实验验证了曲率学习、潜在维度、对比温度等组件的有效性,且在不同编码器(ResNet-50, DeepLabV3, DINOv2/v3)、分辨率和PH权重下表现稳健。 Conclusion: 通过引入几何与拓扑先验到分割头的设计中,可在不增加主干复杂度的前提下有效提升叶病变分割的拓扑保真度;该工作为构建更强的拓扑保持架构提供了可解释、可复现的诊断工具与开源框架。 Abstract: Leaf-lesion segmentation is topology-sensitive: small merges, splits, or false holes can be biologically meaningful descriptors of biochemical pathways, yet they are weakly penalized by standard pixel-wise losses in Euclidean latents. I explore HyperTopo-Adapters, a lightweight, parameter-efficient head trained on top of a frozen vision encoder, which embeds features on a product manifold -- hyperbolic + Euclidean + spherical (H + E + S) -- to encourage hierarchical separation (H), local linear detail (E), and global closure (S). A topology prior complements Dice/BCE in two forms: (i) persistent-homology (PH) distance for evaluation and selection, and (ii) a differentiable surrogate that combines a soft Euler-characteristic match with total variation regularization for stable training. I introduce warm-ups for both the hyperbolic contrastive term and the topology prior, per-sample evaluation of structure-aware metrics (Boundary-F1, Betti errors, PD distance), and a min-PD within top-K Dice rule for checkpoint selection. On a Kaggle leaf-lesion dataset (N=2,940), early results show consistent gains in boundary and topology metrics (reducing Delta beta_1 hole error by 9%) while Dice/IoU remain competitive. The study is diagnostic by design: I report controlled ablations (curvature learning, latent dimensions, contrastive temperature, surrogate settings), and ongoing tests varying encoder strength (ResNet-50, DeepLabV3, DINOv2/v3), input resolution, PH weight, and partial unfreezing of late blocks. The contribution is an open, reproducible train/eval suite (available at https://github.com/ChimdiWalter/HyperTopo-Adapters) that isolates geometric/topological priors and surfaces failure modes to guide stronger, topology-preserving architectures.

[146] OptFormer: Optical Flow-Guided Attention and Phase Space Reconstruction for SST Forecasting

Yin Wang,Chunlin Gong,Zhuozhen Xu,Lehan Zhang,Xiang Wu

Main category: cs.CV

TL;DR: 提出了一种名为OptFormer的新模型,结合相空间重构和基于光流的运动感知注意力机制,用于海表温度预测,在NOAA数据集上表现出优于现有方法的性能。

Details Motivation: 海表温度预测因具有非线性时空动态特性和长预测周期而具有挑战性。 Method: 提出OptFormer模型,采用编码器-解码器结构,结合相空间重构与基于光流引导的运动感知注意力机制,利用帧间运动线索捕捉空间场的相对变化。 Result: 在NOAA SST数据集上的实验表明,OptFormer在1:1训练-预测设置下显著优于现有基线方法,具备更高的准确性和鲁棒性。 Conclusion: OptFormer能有效捕捉海表温度的长期时空依赖关系,提升了复杂气候变量的预测能力。 Abstract: Sea Surface Temperature (SST) prediction plays a vital role in climate modeling and disaster forecasting. However, it remains challenging due to its nonlinear spatiotemporal dynamics and extended prediction horizons. To address this, we propose OptFormer, a novel encoder-decoder model that integrates phase-space reconstruction with a motion-aware attention mechanism guided by optical flow. Unlike conventional attention, our approach leverages inter-frame motion cues to highlight relative changes in the spatial field, allowing the model to focus on dynamic regions and capture long-range temporal dependencies more effectively. Experiments on NOAA SST datasets across multiple spatial scales demonstrate that OptFormer achieves superior performance under a 1:1 training-to-prediction setting, significantly outperforming existing baselines in accuracy and robustness.

[147] Semantic Event Graphs for Long-Form Video Question Answering

Aradhya Dixit,Tianxi Liang

Main category: cs.CV

TL;DR: 提出语义事件图(SEG)作为视频与语言之间的轻量级符号接口,用于高效长视频问答,显著降低token使用并保持高性能。

Details Motivation: 现代视觉-语言模型在处理小时级视频时面临计算和token预算的挑战,现有方法在时间覆盖与成本之间权衡,难以有效支持长视频问答。 Method: 通过YOLOv11检测和跟踪对象,将接近模式转化为人-物交互事件,构建时间场景图(TSG);在推理时使用查询感知剪枝模块提取相关子图,将其文本化后输入Gemini 2.5 Flash生成答案。 Result: 在五个YouTube视频和120个自动生成的问题上,SEG以仅3.47k token/查询实现65.0%准确率,相比40.39k token的全日志基线节省91.4% token,而仅依赖最后30秒的短上下文基线准确率仅为2.5%。 Conclusion: 符号化的时间图可作为现成视觉-语言模型的有效即插即用记忆层,在保持长程推理能力的同时大幅提升长视频问答的token和成本效率。 Abstract: Long-form video question answering remains challenging for modern vision-language models, which struggle to reason over hour-scale footage without exceeding practical token and compute budgets. Existing systems typically downsample frames or feed dense visual embeddings to large-context language models, trading off temporal coverage against cost. We propose Semantic Event Graphs (SEG), a lightweight symbolic interface between video and language that replaces raw frames with compact temporal interaction logs. Our pipeline detects and tracks objects with YOLOv11, converts proximity patterns into START/END human-object events, and organizes them into a Temporal Scene Graph (TSG). At inference time, a query-aware pruning module identifies anchor entities and lexically relevant events, returning only a small subgraph which is verbalized and passed to Gemini 2.5 Flash for answer generation. On five YouTube videos (300-500 interactions each) and 120 automatically generated long-horizon questions, SEG achieves 65.0% accuracy using only 3.47k tokens per query, closely matching a full-log baseline (62.5% at 40.39k tokens) while reducing token usage by 91.4%. A short-context baseline restricted to the last 30 seconds collapses to 2.5% accuracy, underscoring the need for explicit temporal memory. These results show that symbolic temporal graphs can serve as an effective, plug-and-play memory layer for off-the-shelf vision-language models, preserving long-range reasoning ability while making long-form video question answering substantially more token- and cost-efficient. Code, logs, and event-extraction tools will be released for reproducibility.

[148] COVR:Collaborative Optimization of VLMs and RL Agent for Visual-Based Control

Canming Xia,Peixi Peng,Guang Tan,Zhan Su,Haoran Xu,Zhenxian Liu,Luntong Li

Main category: cs.CV

TL;DR: 本文提出了一种名为COVR的协同优化框架,用于在视觉强化学习中实现视觉-语言模型(VLM)与强化学习(RL)策略的相互增强,通过利用RL生成的数据来改进VLM,并用增强后的VLM指导策略学习,从而提高样本效率和性能。

Details Motivation: 视觉强化学习由于高维观测在复杂任务中存在样本效率低下的问题。现有工作主要关注从VLM向RL的知识蒸馏,忽视了RL交互数据反过来提升VLM的潜力。因此,需要一种双向协同的学习机制以充分利用两者的优势。 Method: 提出COVR框架:1)利用RL生成的交互数据对VLM进行微调,提升其与目标任务一致的语义推理能力;2)使用增强后的VLM生成动作先验来指导RL策略学习;3)设计探索驱动的动态过滤模块,基于探索程度自适应保留有价值的探索样本;4)引入回报感知的自适应损失权重模块,利用RL的回报信号量化采样动作不一致性以提升训练稳定性;5)采用渐进式微调策略降低资源消耗。 Result: 大量实验表明,COVR在多个具有挑战性的视觉控制任务上实现了优于现有方法的性能,显著提升了样本效率和最终表现。 Conclusion: COVR实现了VLM与RL策略之间的有效协同优化,不仅提升了RL的样本效率,也增强了VLM在特定任务上的语义理解能力,验证了双向学习框架在视觉强化学习中的有效性。 Abstract: Visual reinforcement learning (RL) suffers from poor sample efficiency due to high-dimensional observations in complex tasks. While existing works have shown that vision-language models (VLMs) can assist RL, they often focus on knowledge distillation from the VLM to RL, overlooking the potential of RL-generated interaction data to enhance the VLM. To address this, we propose COVR, a collaborative optimization framework that enables the mutual enhancement of the VLM and RL policies. Specifically, COVR fine-tunes the VLM with RL-generated data to enhance the semantic reasoning ability consistent with the target task, and uses the enhanced VLM to further guide policy learning via action priors. To improve fine-tuning efficiency, we introduce two key modules: (1) an Exploration-Driven Dynamic Filter module that preserves valuable exploration samples using adaptive thresholds based on the degree of exploration, and (2) a Return-Aware Adaptive Loss Weight module that improves the stability of training by quantifying the inconsistency of sampling actions via return signals of RL. We further design a progressive fine-tuning strategy to reduce resource consumption. Extensive experiments show that COVR achieves strong performance across various challenging visual control tasks.

[149] Low-Back Pain Physical Rehabilitation by Movement Analysis in Clinical Trial

Sao Mai Nguyen

Main category: cs.CV

TL;DR: 本文介绍了Keraal数据集,一个在临床环境中收集的用于智能辅导系统进行康复训练的数据集,旨在解决运动评估、错误识别、空间定位和时间定位四个挑战。

Details Motivation: 为了支持智能辅导系统在物理康复中的开发与评估,需要真实临床环境下患者进行康复锻炼的数据。 Method: 提出并构建了Keraal数据集,包含临床患者执行低背痛康复运动的数据,并使用当前先进的人体动作分析算法进行基准测试。 Result: 该数据集能够有效支持康复过程中的运动监测任务,包括运动质量评估、错误检测以及时空层面的动作定位。 Conclusion: Keraal数据集为智能康复系统提供了重要的数据基础,有助于推动个性化康复训练的发展。 Abstract: To allow the development and assessment of physical rehabilitation by an intelligent tutoring system, we propose a medical dataset of clinical patients carrying out low back-pain rehabilitation exercises and benchmark on state of the art human movement analysis algorithms. This dataset is valuable because it includes rehabilitation motions in a clinical setting with patients in their rehabilitation program. This paper introduces the Keraal dataset, a clinically collected dataset to enable intelligent tutoring systems (ITS) for rehabilitation. It addresses four challenges in exercise monitoring: motion assessment, error recognition, spatial localization, temporal localization

[150] Forget-It-All: Multi-Concept Machine Unlearning via Concept-Aware Neuron Masking

Kaiyuan Deng,Bo Hui,Gen Li,Jie Ji,Minghai Qin,Geng Yuan,Xiaolong Ma

Main category: cs.CV

TL;DR: 本文提出了一种名为Forget It All (FIA)的无训练多概念遗忘框架,通过利用模型稀疏性实现对预训练文本到图像扩散模型中多个敏感概念的有效移除,同时保持生成质量。

Details Motivation: 由于现有方法在多概念遗忘场景下存在效果差、生成质量下降和超参数敏感等问题,本文旨在提出一种更可靠、易于应用的多概念机器遗忘方法。 Method: FIA引入对比概念显著性来量化权重连接对目标概念的贡献,结合时空信息识别概念敏感神经元,并构建统一的多概念掩码,保留通用生成相关的无关神经元,剪枝特定神经元以实现概念遗忘。该方法无需再训练且仅需极少超参数调整。 Result: 在三个不同的遗忘任务上实验表明,FIA在遗忘有效性、语义保真度和图像质量方面均优于现有方法,且具有良好的跨数据集鲁棒性和可扩展性。 Conclusion: FIA为文本到图像模型中的多概念遗忘提供了一个高效、即插即用的解决方案,推动了机器遗忘技术在实际应用中的落地。 Abstract: The widespread adoption of text-to-image (T2I) diffusion models has raised concerns about their potential to generate copyrighted, inappropriate, or sensitive imagery learned from massive training corpora. As a practical solution, machine unlearning aims to selectively erase unwanted concepts from a pre-trained model without retraining from scratch. While most existing methods are effective for single-concept unlearning, they often struggle in real-world scenarios that require removing multiple concepts, since extending them to this setting is both non-trivial and problematic, causing significant challenges in unlearning effectiveness, generation quality, and sensitivity to hyperparameters and datasets. In this paper, we take a unique perspective on multi-concept unlearning by leveraging model sparsity and propose the Forget It All (FIA) framework. FIA first introduces Contrastive Concept Saliency to quantify each weight connection's contribution to a target concept. It then identifies Concept-Sensitive Neurons by combining temporal and spatial information, ensuring that only neurons consistently responsive to the target concept are selected. Finally, FIA constructs masks from the identified neurons and fuses them into a unified multi-concept mask, where Concept-Agnostic Neurons that broadly support general content generation are preserved while concept-specific neurons are pruned to remove the targets. FIA is training-free and requires only minimal hyperparameter tuning for new tasks, thereby promoting a plug-and-play paradigm. Extensive experiments across three distinct unlearning tasks demonstrate that FIA achieves more reliable multi-concept unlearning, improving forgetting effectiveness while maintaining semantic fidelity and image quality.

[151] What Users Leave Unsaid: Under-Specified Queries Limit Vision-Language Models

Dasol Choi,Guijin Son,Hanwool Lee,Minhyuk Kim,Hyunwoo Ko,Teabin Lim,Ahn Eungyeol,Jungwhan Kim,Seunghyeok Hong,Youngsook Song

Main category: cs.CV

TL;DR: HAERAE-Vision 是一个基于真实韩国在线社区的视觉问答新基准,揭示了当前视觉语言模型在处理非正式、信息不完整的自然查询时表现不佳,主要挑战源于用户提问的隐含性而非模型能力本身。

Details Motivation: 现有视觉-语言基准多使用结构良好、明确的问题,无法反映真实场景中用户常以模糊、省略方式提问的现象,导致模型评估与实际应用之间存在差距。 Method: 构建 HAERAE-Vision 基准,包含从 86K 候选者中筛选出的 653 个真实视觉问题,并为每个问题提供显式重写版本,共形成 1,306 个查询变体;评估 39 个视觉语言模型的表现,并分析显式化和网络检索对性能的影响。 Result: 最先进的模型(如 GPT-5、Gemini 2.5 Pro)在原始问题上准确率低于 50%;仅通过将问题显式化即可提升 8 到 22 个百分点,小模型受益更大;即使启用网络搜索,未明确的问题仍不如显式问题表现好。 Conclusion: 当前视觉语言模型的主要瓶颈之一是处理用户自然表达中的信息缺失,而非理解图像或语言本身;应重视查询显式化和上下文补全能力以缩小评测与现实应用之间的鸿沟。 Abstract: Current vision-language benchmarks predominantly feature well-structured questions with clear, explicit prompts. However, real user queries are often informal and underspecified. Users naturally leave much unsaid, relying on images to convey context. We introduce HAERAE-Vision, a benchmark of 653 real-world visual questions from Korean online communities (0.76% survival from 86K candidates), each paired with an explicit rewrite, yielding 1,306 query variants in total. Evaluating 39 VLMs, we find that even state-of-the-art models (GPT-5, Gemini 2.5 Pro) achieve under 50% on the original queries. Crucially, query explicitation alone yields 8 to 22 point improvements, with smaller models benefiting most. We further show that even with web search, under-specified queries underperform explicit queries without search, revealing that current retrieval cannot compensate for what users leave unsaid. Our findings demonstrate that a substantial portion of VLM difficulty stem from natural query under-specification instead of model capability, highlighting a critical gap between benchmark evaluation and real-world deployment.

[152] B-FIRE: Binning-Free Diffusion Implicit Neural Representation for Hyper-Accelerated Motion-Resolved MRI

Di Xu,Hengjie Liu,Yang Yang,Mary Feng,Jin Ning,Xin Miao,Jessica E. Scholey,Alexandra E. Hotca-cho,William C. Chen,Michael Ohliger,Martina Descovich,Huiming Dong,Wensha Yang,Ke Sheng

Main category: cs.CV

TL;DR: 提出B-FIRE,一种无需分箱的扩散隐式神经表示框架,用于超加速动态4D MRI重建,能更准确地反映腹部瞬时三维解剖结构。

Details Motivation: 现有4DMRI方法在平均呼吸相位下会产生模糊和失真的动态信息,难以捕捉瞬时运动;需新方法以重建高度欠采样的非笛卡尔k空间数据。 Method: 提出B-FIRE框架,采用CNN-INR编码器-解码器结构,结合扩散优化与综合损失函数,强制图像域保真和频域感知约束;训练时使用分箱图像对,推理时直接处理无分箱的欠采样数据。 Result: 在T1加权StarVIBE肝脏MRI数据集上验证,加速度从RV8到RV1;相比NuFFT、GRASP-CS和展开CNN方法,B-FIRE在重建保真度、运动轨迹一致性和推理延迟方面表现更优。 Conclusion: B-FIRE实现了高质量、高加速的动态4DMRI重建,无需呼吸相位分箱,有效恢复瞬时解剖结构,具有临床应用潜力。 Abstract: Accelerated dynamic volumetric magnetic resonance imaging (4DMRI) is essential for applications relying on motion resolution. Existing 4DMRI produces acceptable artifacts of averaged breathing phases, which can blur and misrepresent instantaneous dynamic information. Recovery of such information requires a new paradigm to reconstruct extremely undersampled non-Cartesian k-space data. We propose B-FIRE, a binning-free diffusion implicit neural representation framework for hyper-accelerated MR reconstruction capable of reflecting instantaneous 3D abdominal anatomy. B-FIRE employs a CNN-INR encoder-decoder backbone optimized using diffusion with a comprehensive loss that enforces image-domain fidelity and frequency-aware constraints. Motion binned image pairs were used as training references, while inference was performed on binning-free undersampled data. Experiments were conducted on a T1-weighted StarVIBE liver MRI cohort, with accelerations ranging from 8 spokes per frame (RV8) to RV1. B-FIRE was compared against direct NuFFT, GRASP-CS, and an unrolled CNN method. Reconstruction fidelity, motion trajectory consistency, and inference latency were evaluated.

[153] Analyzing the Structure of Handwritten Digits: A Comparative Study of PCA, Factor Analysis, and UMAP

Jyotiraditya Gupta

Main category: cs.CV

TL;DR: 本文通过PCA、FA和UMAP三种降维技术分析MNIST手写数字数据集的潜在结构,揭示其内在维度、共享变异和非线性几何特性。

Details Motivation: 探索手写数字图像在高维像素空间中所表现出的强几何与统计结构,理解其低维流形组织方式。 Method: 采用主成分分析(PCA)、因子分析(FA)和统一流形逼近与投影(UMAP)三种互补的降维方法,分别研究全局方差方向、可解释的笔画基元以及非线性流形结构。 Result: PCA能用少量主成分实现高保真重建;FA分解出对应笔画、环路和对称性的可解释潜在特征;UMAP揭示了数字类别间的平滑风格过渡。 Conclusion: 手写数字数据位于一个结构化的低维流形上,不同统计框架揭示了该结构的互补方面。 Abstract: Handwritten digit images lie in a high-dimensional pixel space but exhibit strong geometric and statistical structure. This paper investigates the latent organization of handwritten digits in the MNIST dataset using three complementary dimensionality reduction techniques: Principal Component Analysis (PCA), Factor Analysis (FA), and Uniform Manifold Approximation and Projection (UMAP). Rather than focusing on classification accuracy, we study how each method characterizes intrinsic dimensionality, shared variation, and nonlinear geometry. PCA reveals dominant global variance directions and enables high-fidelity reconstructions using a small number of components. FA decomposes digits into interpretable latent handwriting primitives corresponding to strokes, loops, and symmetry. UMAP uncovers nonlinear manifolds that reflect smooth stylistic transitions between digit classes. Together, these results demonstrate that handwritten digits occupy a structured low-dimensional manifold and that different statistical frameworks expose complementary aspects of this structure.

[154] Think Bright, Diffuse Nice: Enhancing T2I-ICL via Inductive-Bias Hint Instruction and Query Contrastive Decoding

Zhiyong Ma,Zhenpeng Li,Yuanjie Shi,Zhengping Li,Jiahao Chen,Qingyuan Chuai

Main category: cs.CV

TL;DR: 提出了一种无需训练的文本到图像上下文学习框架TBDN,通过Hint Instruction和Query Contrastive Decoding机制有效缓解了合规性失败和先验主导的幻觉问题,实现了先进性能和强泛化能力。

Details Motivation: 现有文本到图像上下文学习方法存在合规性失败与先验主导幻觉的恶性循环,且依赖定制化训练,缺乏灵活性并增加部署成本。 Method: 提出TBDN框架,包含Hint Instruction(HI)和Query Contrastive Decoding(QCD)两种闭环机制:HI通过轻量级提示工程注入任务感知归纳偏置,增强模型对上下文映射规则的遵循;QCD通过对比完整输入与省略查询的解码分布,调整语言模型输出,抑制幻觉生成。 Result: TBDN在CoBSAT和Text-to-Image Fast Mini-ImageNet上达到SOTA性能,具有跨模型结构、提示设计和超参数的良好泛化性,并在Dreambench++上表现出良好的概念保持和提示遵循能力。 Conclusion: TBDN通过两个互补机制打破了文本到图像上下文学习中的关键瓶颈,提供了一个简单、高效且可靠的无需训练解决方案。 Abstract: Text-to-Image In-Context Learning (T2I-ICL) enables customized image synthesis via interleaved text-image examples but faces two mutually reinforcing bottlenecks, compliance failure and prior-dominated hallucination, that form a vicious cycle degrading generation quality. Existing methods rely on tailored training, which limits flexibility and raises deployment costs. To address these challenges effectively, we propose TBDN, a training-free framework integrating two complementary closed-loop mechanisms: Hint Instruction (HI) and Query Contrastive Decoding (QCD). HI injects task-aware inductive bias via lightweight prompt engineering to anchor models on contextual mapping rules, thereby mitigating compliance failure. QCD adjusts the decoding distributions of language models by contrasting full-input and query-omitted distributions, suppressing prior-dominated hallucination. TBDN achieves State-of-the-Art performance on CoBSAT and Text-to-Image Fast Mini-ImageNet, with robust generalization across model backbones, prompt designs, and hyperparameters. It also maintains promising performance in concept preservation and prompt following on Dreambench++. By breaking the two bottlenecks, TBDN establishes a simple yet effective framework for efficient and reliable T2I-ICL.

[155] TIR-Flow: Active Video Search and Reasoning with Frozen VLMs

Hongbo Jin,Siyi Xie,Jiayu Ding,Kuanwei Lin,Ge Li

Main category: cs.CV

TL;DR: 本文提出TIR-Flow框架,通过主动视频搜索与推理提升冻结视频语言模型的复杂时序推理能力,无需额外数据或参数更新,在七个基准上显著超越现有方法。

Details Motivation: 现有视频语言模型在推理能力上受限于被动处理范式,依赖大规模数据工程却未能激发模型内在智能以实现动态视觉探索。 Method: 提出TIR-Flow框架,包含三个协同模块:HDD将复杂问题分解为可验证子任务,HAP主动引导视觉注意力获取高分辨率证据,EBA维护持久化工作区积累线索进行逻辑推理,实现无需训练的主动感知与推理。 Result: 在七个基准测试上平均性能提升5.9%,Egoschema上最高提升10.5%,验证了主动感知对长时序视频推理的有效性。 Conclusion: 赋予冻结VLM类System-2的主动感知能力是解决长时序视频复杂推理的可扩展路径。 Abstract: While Large Video-Language Models (Video-LLMs) have achieved remarkable progress in perception, their reasoning capabilities remain a bottleneck. Existing solutions typically resort to a heavy "data engineering" paradigm-synthesizing large-scale Chain-of-Thought (CoT) datasets followed by Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). This pipeline primarily optimizes probability sampling efficiency and aligns output distributions, but fails to activate the intrinsic intelligence required for dynamic visual exploration. In this work, we propose TIR-Flow, a novel framework that shifts the paradigm from passive processing to active video searching and reasoning without additional data or parameter updating. Concretely, our framework operates through three synergistic modules: HDD decomposes complex queries into a set of verifiable sub-tasks; HAP actively directs visual attention to gather high-resolution evidence for hypothesis validation; EBA maintains a persistent workspace to accumulate and update the discovered clues for logical reasoning. Extensive experiments on seven benchmarks demonstrate that TIR-Flow significantly outperforms recent strong baselines, delivering an average performance boost of 5.9%, with gains reaching 10.5% on Egoschema. Our analysis confirms that empowering frozen VLMs with System-2-like active perception is a scalable path toward solving long-horizon video reasoning.

[156] A Unified Attention U-Net Framework for Cross-Modality Tumor Segmentation in MRI and CT

Nishan Rai,Pushpa R. Dahal

Main category: cs.CV

TL;DR: 提出了一种统一的Attention U-Net架构,联合训练MRI和CT数据集,实现跨模态肿瘤分割,无需模态特定编码器或域适应方法。

Details Motivation: 探索单一模型在不同成像模态(如MRI和CT)和解剖部位下的通用性,减少对模态特异性设计的依赖。 Method: 采用注意力门控跳跃连接、模态一致预处理和模态感知的Focal Tversky损失函数,构建统一的Attention U-Net架构,并在BraTS 2021和LIDC-IDRI数据集上联合训练。 Result: 该模型在Dice系数、IoU和AUC指标上于两种模态数据均表现出具有竞争力的性能。 Conclusion: 验证了单一Attention U-Net在跨模态肿瘤分割中的可行性与鲁棒性,为后续研究提供了可复现的基线模型。 Abstract: This study presents a unified Attention U-Net architecture trained jointly on MRI (BraTS 2021) and CT (LIDC-IDRI) datasets to investigate the generalizability of a single model across diverse imaging modalities and anatomical sites. Our proposed pipeline incorporates modality-harmonized preprocessing, attention-gated skip connections, and a modality-aware Focal Tversky loss function. To the best of our knowledge, this study is among the first to evaluate a single Attention U-Net trained simultaneously on separate MRI (BraTS) and CT (LIDC-IDRI) tumor datasets, without relying on modality-specific encoders or domain adaptation. The unified model demonstrates competitive performance in terms of Dice coefficient, IoU, and AUC on both domains, thereby establishing a robust and reproducible baseline for future research in cross-modality tumor segmentation.

[157] How Does India Cook Biryani?

Shubham Goel,Farzana S,C V Rishi,Aditya Arun,C V Jawahar

Main category: cs.CV

TL;DR: 本文提出了首个大规模、精心策划的印度香饭烹饪视频数据集,包含12种不同地区风格的120个高质量YouTube视频,并设计了一个基于视觉-语言模型的多阶段框架,用于细粒度分割和对齐烹饪步骤,进而自动识别和解释区域差异。

Details Motivation: 现有视频理解方法难以捕捉烹饪视频中细粒度、多模态且具文化背景的差异,因此需要一种能够系统分析印度香饭区域制作差异的计算方法。 Method: 构建了一个包含120个视频的数据集,覆盖12种地区风格;采用多阶段框架,利用先进的视觉-语言模型(VLMs)将视频分割为细粒度步骤,并与音频转录和标准食谱文本对齐;在此基础上开发了视频比较流程,并构建了多层次推理的问答基准进行评估。 Result: 成功实现了对不同地区香饭制作过程的自动比较,生成可解释的差异分析;在零样本和微调设置下评估了多种最先进模型的表现,结合人工验证提升了精度;发布了数据、代码和项目网站。 Conclusion: 该工作为视觉-语言模型在结构化、多模态推理任务上的评估提供了新测试平台,并开辟了通过烹饪视频进行文化遗产计算分析的新方向。 Abstract: Biryani, one of India's most celebrated dishes, exhibits remarkable regional diversity in its preparation, ingredients, and presentation. With the growing availability of online cooking videos, there is unprecedented potential to study such culinary variations using computational tools systematically. However, existing video understanding methods fail to capture the fine-grained, multimodal, and culturally grounded differences in procedural cooking videos. This work presents the first large-scale, curated dataset of biryani preparation videos, comprising 120 high-quality YouTube recordings across 12 distinct regional styles. We propose a multi-stage framework leveraging recent advances in vision-language models (VLMs) to segment videos into fine-grained procedural units and align them with audio transcripts and canonical recipe text. Building on these aligned representations, we introduce a video comparison pipeline that automatically identifies and explains procedural differences between regional variants. We construct a comprehensive question-answer (QA) benchmark spanning multiple reasoning levels to evaluate procedural understanding in VLMs. Our approach employs multiple VLMs in complementary roles, incorporates human-in-the-loop verification for high-precision tasks, and benchmarks several state-of-the-art models under zero-shot and fine-tuned settings. The resulting dataset, comparison methodology, and QA benchmark provide a new testbed for evaluating VLMs on structured, multimodal reasoning tasks and open new directions for computational analysis of cultural heritage through cooking videos. We release all data, code, and the project website at https://farzanashaju.github.io/how-does-india-cook-biryani/.

[158] QwenStyle: Content-Preserving Style Transfer with Qwen-Image-Edit

Shiwen Zhang,Haibin Huang,Chi Zhang,Xuelong Li

Main category: cs.CV

TL;DR: 本文提出了一种基于Qwen-Image-Edit的内容保持风格迁移模型QwenStyle,通过课程持续学习框架在混合高质量与噪声数据上训练,实现了在风格相似性、内容一致性和美学质量上的SOTA性能。

Details Motivation: 解决Diffusion Transformers在内容和风格特征耦合下难以实现内容保持的风格迁移问题。 Method: 构建高质量特定风格数据集并合成大规模风格图像三元组,采用课程持续学习框架训练QwenStyle模型。 Result: QwenStyle V1在风格相似性、内容一致性和美学质量三个核心指标上达到最先进水平。 Conclusion: 所提方法有效解耦内容与风格,实现了高保真内容保持与灵活风格定制的统一。 Abstract: Content-Preserving Style transfer, given content and style references, remains challenging for Diffusion Transformers (DiTs) due to its internal entangled content and style features. In this technical report, we propose the first content-preserving style transfer model trained on Qwen-Image-Edit, which activates Qwen-Image-Edit's strong content preservation and style customization capability. We collected and filtered high quality data of limited specific styles and synthesized triplets with thousands categories of style images in-the-wild. We introduce the Curriculum Continual Learning framework to train QwenStyle with such mixture of clean and noisy triplets, which enables QwenStyle to generalize to unseen styles without degradation of the precise content preservation capability. Our QwenStyle V1 achieves state-of-the-art performance in three core metrics: style similarity, content consistency, and aesthetic quality.

[159] Cascading multi-agent anomaly detection in surveillance systems via vision-language models and embedding-based classification

Tayyab Rehman,Giovanni De Gasperis,Aly Shmahell

Main category: cs.CV

TL;DR: 提出一种级联多智能体框架,用于动态视觉环境中的智能异常检测,兼顾实时性、语义可解释性和计算效率。

Details Motivation: 传统方法在实时性、语义解释和低层次异常检测之间存在割裂,难以同时满足高效与可解释的视觉异常检测需求。 Method: 构建一个级联多智能体框架,结合重建过滤、对象级评估和高层语义推理;采用自适应触发机制和发布-订阅通信结构,实现异步协调与可扩展部署。 Result: 在大规模监控数据上验证,相比直接使用视觉语言模型推断延迟降低三倍,保持高感知保真度(PSNR = 38.3 dB, SSIM = 0.965)和一致的语义标注。 Conclusion: 该框架通过早期退出机制、自适应多智能体推理和可解释归因,为可扩展的智能视觉监控提供了高效、节能且可复现的基础。 Abstract: Intelligent anomaly detection in dynamic visual environments requires reconciling real-time performance with semantic interpretability. Conventional approaches address only fragments of this challenge. Reconstruction-based models capture low-level deviations without contextual reasoning, object detectors provide speed but limited semantics, and large vision-language systems deliver interpretability at prohibitive computational cost. This work introduces a cascading multi-agent framework that unifies these complementary paradigms into a coherent and interpretable architecture. Early modules perform reconstruction-gated filtering and object-level assessment, while higher-level reasoning agents are selectively invoked to interpret semantically ambiguous events. The system employs adaptive escalation thresholds and a publish-subscribe communication backbone, enabling asynchronous coordination and scalable deployment across heterogeneous hardware. Extensive evaluation on large-scale monitoring data demonstrates that the proposed cascade achieves a threefold reduction in latency compared to direct vision-language inference, while maintaining high perceptual fidelity (PSNR = 38.3 dB, SSIM = 0.965) and consistent semantic labeling. The framework advances beyond conventional detection pipelines by combining early-exit efficiency, adaptive multi-agent reasoning, and explainable anomaly attribution, establishing a reproducible and energy-efficient foundation for scalable intelligent visual monitoring.

[160] When Imbalance Comes Twice: Active Learning under Simulated Class Imbalance and Label Shift in Binary Semantic Segmentation

Julien Combes,Alexandre Derville,Jean-François Coeurjolly

Main category: cs.CV

TL;DR: 本研究通过模拟实验探讨了类别不平衡和标签偏移对主动学习算法的影响,使用两个开源数据集构建人工数据集,并比较了随机采样、基于熵的选择和核心集选择三种策略。结果表明,即使在高度不平衡的数据集中,主动学习(尤其是基于熵和核心集的方法)仍然有效,但在强标签偏移下效率会下降。

Details Motivation: 由于机器视觉中存在大量无缺陷图像且存储有限,导致类别不平衡和潜在的标签偏移问题,影响主动学习性能,因此需要研究这两种不平衡如何影响主动学习算法的效果。 Method: 基于两个开源数据集设计模拟实验,人工控制类别不平衡和标签偏移的程度,比较三种主动学习策略:随机采样、基于熵的选择和核心集选择。 Result: 实验显示,熵基和核心集选择在高度不平衡数据中仍保持高效;但在强标签偏移情况下,所有主动学习方法的效率均有所下降。 Conclusion: 主动学习在面对严重类别不平衡时依然有效,但标签偏移会显著影响其性能,需在实际应用中加以考虑。 Abstract: The aim of Active Learning is to select the most informative samples from an unlabelled set of data. This is useful in cases where the amount of data is large and labelling is expensive, such as in machine vision or medical imaging. Two particularities of machine vision are first, that most of the images produced are free of defects, and second, that the amount of images produced is so big that we cannot store all acquired images. This results, on the one hand, in a strong class imbalance in defect distribution and, on the other hand, in a potential label shift caused by limited storage. To understand how these two forms of imbalance affect active learning algorithms, we propose a simulation study based on two open-source datasets. We artificially create datasets for which we control the levels of class imbalance and label shift. Three standard active learning selection strategies are compared: random sampling, entropy-based selection, and core-set selection. We demonstrate that active learning strategies, and in particular the entropy-based and core-set selections, remain interesting and efficient even for highly imbalanced datasets. We also illustrate and measure the loss of efficiency that occurs in the situation a strong label shift.

[161] Akasha 2: Hamiltonian State Space Duality and Visual-Language Joint Embedding Predictive Architectur

Yani Meziani

Main category: cs.CV

TL;DR: Akasha 2提出了一种融合H-SSD与VL-JEPA的多模态架构,通过引入物理启发的归纳偏置,在视频预测、视觉合成速度和能效方面显著优于现有方法。

Details Motivation: 现有模型在时空一致性、推理效率和能量守恒方面存在局限,难以满足移动设备上的实时多模态任务需求。 Method: 结合哈密顿状态空间对偶(H-SSD)与视觉-语言联合嵌入预测架构(VL-JEPA),采用Mamba-3选择性状态空间模型和稀疏哈密顿专家混合(SMoE-HE),并引入哈密顿流匹配(HFM)与持久化3D高斯点阵进行视觉生成。 Result: 实现了FVD为287的最先进视频预测性能,视觉合成速度比扩散模型快4倍,推理速度较Transformer基线提升3-18倍,且在长时程中保持能量守恒,延迟低于50ms。 Conclusion: 将物理规律(如守恒律)融入神经网络架构可显著提升模型的效率、稳定性和时空一致性,为未来世界模型提供了新的设计范式。 Abstract: We present Akasha 2, a state-of-the-art multimodal architecture that integrates Hamiltonian State Space Duality (H-SSD) with Visual-Language Joint Embedding Predictive Architecture (VL-JEPA). The system leverages the Mamba-3 Selective State Space Model (SSM) augmented by a Sparse Mixture of Hamiltonian Experts (SMoE-HE) that enforces latent physical conservation laws through symplectic integration. For visual synthesis, we introduce Hamiltonian Flow Matching (HFM) and persistent 3D Gaussian Splatting (3DGS), enabling ultra-low latency (<50ms) on mobile hardware. This work establishes a new paradigm in latent world models, achieving unprecedented spatiotemporal coherence through a holographic memory architecture. Our approach demonstrates that incorporating physics-inspired inductive biases into neural architectures yields significant improvements: state-of-the-art video prediction (FVD: 287), 4x faster visual synthesis than diffusion models, and 3-18x inference speedup over transformer baselines while maintaining energy conservation over extended horizons.

[162] Two-step Authentication: Multi-biometric System Using Voice and Facial Recognition

Kuan Wei Chen,Ting Yi Lin,Wen Ren Yang,Aryan Kesarwani,Riya Singh

Main category: cs.CV

TL;DR: 提出了一种基于面部和语音的两步认证系统,利用常见设备的摄像头和麦克风实现低成本、高准确性的身份验证。

Details Motivation: 为了在普通设备上实现安全且计算成本低的双因素身份验证,结合面部与语音生物特征以提升鲁棒性。 Method: 采用两步认证流程:首先使用MTCNN定位人脸,通过剪枝的VGG-16模型进行人脸识别;随后仅对匹配的身份执行基于CNN的语音验证。模型分别在增强的小规模人脸数据集和LibriSpeech语音数据集上训练。 Result: 人脸识别在5人数据集上达到95.1%准确率,语音识别在test-clean上达到98.9%准确率和3.456% EER。 Conclusion: 该系统在资源受限设备上实现了高效、准确的多模态身份认证,具备良好的实用性和扩展性。 Abstract: We present a cost-effective two-step authentication system that integrates face identification and speaker verification using only a camera and microphone available on common devices. The pipeline first performs face recognition to identify a candidate user from a small enrolled group, then performs voice recognition only against the matched identity to reduce computation and improve robustness. For face recognition, a pruned VGG-16 based classifier is trained on an augmented dataset of 924 images from five subjects, with faces localized by MTCNN; it achieves 95.1% accuracy. For voice recognition, a CNN speaker-verification model trained on LibriSpeech (train-other-360) attains 98.9% accuracy and 3.456% EER on test-clean. Source code and trained models are available at https://github.com/NCUE-EE-AIAL/Two-step-Authentication-Multi-biometric-System.

[163] SAPL: Semantic-Agnostic Prompt Learning in CLIP for Weakly Supervised Image Manipulation Localization

Xinghao Wang,Changtao Miao,Dianmo Sheng,Tao Gong,Qi Chu,Nenghai Yu,Quanchen Zou,Deyue Zhang,Xiangzheng Zhang

Main category: cs.CV

TL;DR: 提出了一种名为SAPL的新方法,用于在弱监督下定位恶意图像篡改区域,利用边界感知的提示学习和对比学习,在CLIP框架中有效突出篡改边缘,显著优于现有方法。

Details Motivation: 现有弱监督方法依赖图像级标签且忽略对定位至关重要的局部边缘线索,导致定位精度不足;同时像素级标注成本高昂,限制了模型训练。 Method: 提出语义无关提示学习(SAPL),包含两个模块:1)边缘感知上下文提示学习(ECPL),通过注意力机制利用增强边缘特征生成文本提示,使CLIP关注篡改边缘;2)分层边缘对比学习(HECL),在视觉空间中对比真实与篡改边缘块以增强区分性。最终基于相似性图预测篡改区域。 Result: 在多个公开基准上实验表明,SAPL显著优于现有方法,实现了最先进的篡改区域定位性能。 Conclusion: SAPL通过显式建模篡改边界线索,在弱监督设置下有效提升了图像篡改定位精度,为基于CLIP的视觉任务提供了新的提示学习范式。 Abstract: Malicious image manipulation threatens public safety and requires efficient localization methods. Existing approaches depend on costly pixel-level annotations which make training expensive. Existing weakly supervised methods rely only on image-level binary labels and focus on global classification, often overlooking local edge cues that are critical for precise localization. We observe that feature variations at manipulated boundaries are substantially larger than in interior regions. To address this gap, we propose Semantic-Agnostic Prompt Learning (SAPL) in CLIP, which learns text prompts that intentionally encode non-semantic, boundary-centric cues so that CLIPs multimodal similarity highlights manipulation edges rather than high-level object semantics. SAPL combines two complementary modules Edge-aware Contextual Prompt Learning (ECPL) and Hierarchical Edge Contrastive Learning (HECL) to exploit edge information in both textual and visual spaces. The proposed ECPL leverages edge-enhanced image features to generate learnable textual prompts via an attention mechanism, embedding semantic-irrelevant information into text features, to guide CLIP focusing on manipulation edges. The proposed HECL extract genuine and manipulated edge patches, and utilize contrastive learning to boost the discrimination between genuine edge patches and manipulated edge patches. Finally, we predict the manipulated regions from the similarity map after processing. Extensive experiments on multiple public benchmarks demonstrate that SAPL significantly outperforms existing approaches, achieving state-of-the-art localization performance.

[164] Ground What You See: Hallucination-Resistant MLLMs via Caption Feedback, Diversity-Aware Sampling, and Conflict Regularization

Miao Pan,Wangjie Gan,Jintao Chen,Wenqi Zhang,Bing Sun,Jianwei Yin,Xuhong Zhang

Main category: cs.CV

TL;DR: 本文系统分析了多模态大语言模型(MLLMs)在强化学习训练中产生幻觉的三个根本原因,并提出一个包含视觉定位增强、探索多样性优化和样本干扰缓解的综合框架,显著降低了幻觉率并提升了推理准确性。

Details Motivation: MLLMs在实际部署中受限于幻觉问题,尤其在强化学习优化过程中更为严重,亟需系统性分析其成因并提出有效解决方案。 Method: 提出三模块框架:1)通过规划与描述阶段增强视觉定位,使用基于质量的奖励确保初始锚定准确;2)根据奖励分布的均值和方差对样本分类,优先高方差样本以提升探索多样性;3)通过分组样本对并应用InfoNCE损失调节NTK相似性,缓解样本间的破坏性冲突。 Result: 实验结果表明,所提方法显著降低了MLLMs的幻觉率,并有效提高了推理准确率。 Conclusion: 通过改进视觉锚定、优化探索策略和调控样本间梯度交互,可有效缓解MLLMs在强化学习训练中的幻觉问题,提升模型稳定性和可靠性。 Abstract: While Multimodal Large Language Models (MLLMs) have achieved remarkable success across diverse tasks, their practical deployment is severely hindered by hallucination issues, which become particularly acute during Reinforcement Learning (RL) optimization. This paper systematically analyzes the root causes of hallucinations in MLLMs under RL training, identifying three critical factors: (1) an over-reliance on chained visual reasoning, where inaccurate initial descriptions or redundant information anchor subsequent inferences to incorrect premises; (2) insufficient exploration diversity during policy optimization, leading the model to generate overly confident but erroneous outputs; and (3) destructive conflicts between training samples, where Neural Tangent Kernel (NTK) similarity causes false associations and unstable parameter updates. To address these challenges, we propose a comprehensive framework comprising three core modules. First, we enhance visual localization by introducing dedicated planning and captioning stages before the reasoning phase, employing a quality-based caption reward to ensure accurate initial anchoring. Second, to improve exploration, we categorize samples based on the mean and variance of their reward distributions, prioritizing samples with high variance to focus the model on diverse and informative data. Finally, to mitigate sample interference, we regulate NTK similarity by grouping sample pairs and applying an InfoNCE loss to push overly similar pairs apart and pull dissimilar ones closer, thereby guiding gradient interactions toward a balanced range. Experimental results demonstrate that our proposed method significantly reduces hallucination rates and effectively enhances the inference accuracy of MLLMs.

[165] Synthetic FMCW Radar Range Azimuth Maps Augmentation with Generative Diffusion Model

Zhaoze Wang,Changxu Zhang,Tai Fei,Christopher Grimm,Yi Jin,Claas Tebruegge,Ernst Warsitz,Markus Gardill

Main category: cs.CV

TL;DR: 提出一种基于生成扩散模型的条件生成框架,用于合成逼真的汽车雷达Range-Azimuth图,通过置信图进行条件控制,并引入几何感知和时序一致性优化,显著提升雷达数据质量和下游任务性能。

Details Motivation: 由于标注良好的汽车雷达数据集稀缺且多样性低,限制了深度学习在环境感知中的性能,因此需要有效的雷达数据生成方法来增强训练数据。 Method: 采用生成扩散模型,以置信图为条件输入(每个通道对应一个语义类别并编码高斯分布的标注位置),并引入几何感知条件和时序一致性正则化以适应雷达信号特性。 Result: 在ROD2021数据集上,峰值信噪比(PSNR)比基线方法提高3.6 dB;结合真实与合成数据训练后,平均精度mAP提升4.15%。 Conclusion: 该生成框架能生成物理合理且多样化的雷达频谱数据,有效提升下游感知任务的模型泛化能力。 Abstract: The scarcity and low diversity of well-annotated automotive radar datasets often limit the performance of deep-learning-based environmental perception. To overcome these challenges, we propose a conditional generative framework for synthesizing realistic Frequency-Modulated Continuous-Wave radar Range-Azimuth Maps. Our approach leverages a generative diffusion model to generate radar data for multiple object categories, including pedestrians, cars, and cyclists. Specifically, conditioning is achieved via Confidence Maps, where each channel represents a semantic class and encodes Gaussian-distributed annotations at target locations. To address radar-specific characteristics, we incorporate Geometry Aware Conditioning and Temporal Consistency Regularization into the generative process. Experiments on the ROD2021 dataset demonstrate that signal reconstruction quality improves by \SI{3.6}{dB} in Peak Signal-to-Noise Ratio over baseline methods, while training with a combination of real and synthetic datasets improves overall mean Average Precision by 4.15% compared with conventional image-processing-based augmentation. These results indicate that our generative framework not only produces physically plausible and diverse radar spectrum but also substantially improves model generalization in downstream tasks.

[166] A survey of facial recognition techniques

Aya Kaysan Bahjat

Main category: cs.CV

TL;DR: 本文综述了人脸识别领域的关键技术与挑战,涵盖光照、年龄、姿态、遮挡和表情等因素,并评估了多种主流方法及常用人脸数据库。

Details Motivation: 人脸识别在多媒体内容快速增长的背景下成为重要研究方向,但面部特征复杂且受多种因素影响,亟需系统性综述以梳理现有技术并推动发展。 Method: 通过文献综述方式,分析了包括HMM、PCA、SVM、ANN、Eigenfaces等多种人脸识别方法,并结合JAFEE、Yale、LFW等主流人脸数据库进行实验结果对比。 Result: 总结了当前先进的人脸识别技术及其在不同条件下的表现,提供了对各方法优劣的分析,并展示了部分实验结果。 Conclusion: 该研究为 facial recognition 领域提供了全面的技术回顾与发展现状分析,有助于指导未来研究方向与实际应用设计。 Abstract: As multimedia content is quickly growing, the field of facial recognition has become one of the major research fields, particularly in the recent years. The most problematic area to researchers in image processing and computer vision is the human face which is a complex object with myriads of distinctive features that can be used to identify the face. The survey of this survey is particularly focused on most challenging facial characteristics, including differences in the light, ageing, variation in poses, partial occlusion, and facial expression and presents methodological solutions. The factors, therefore, are inevitable in the creation of effective facial recognition mechanisms used on facial images. This paper reviews the most sophisticated methods of facial detection which are Hidden Markov Models, Principal Component Analysis (PCA), Elastic Cluster Plot Matching, Support Vector Machine (SVM), Gabor Waves, Artificial Neural Networks (ANN), Eigenfaces, Independent Component Analysis (ICA), and 3D Morphable Model. Alongside the works mentioned above, we have also analyzed the images of a number of facial databases, namely JAFEE, FEI, Yale, LFW, AT&T (then called ORL), and AR (created by Martinez and Benavente), to analyze the results. However, this survey is aimed at giving a thorough literature review of face recognition, and its applications, and some experimental results are provided at the end after a detailed discussion.

[167] EyeTheia: A Lightweight and Accessible Eye-Tracking Toolbox

Stevenson Pather,Niels Martignène,Arnaud Bugnet,Fouad Boutaleb,Fabien D'Hondt,Deise Santana Maia

Main category: cs.CV

TL;DR: EyeTheia是一个轻量级、开源的基于网络摄像头的视线估计深度学习管道,适用于浏览器实验平台和实际认知与临床研究,具有良好的实时性能和可扩展性。

Details Motivation: 为了在低成本、易部署的条件下实现可靠的视线追踪,满足认知科学和临床研究中对可重复、可扩展实验工具的需求。 Method: 结合MediaPipe的关键点提取与受iTracker启发的卷积神经网络,并探索两种策略:基于移动数据预训练模型迁移和从头训练桌面数据模型,辅以用户特异性微调。 Result: 在MPIIFaceGaze数据集上验证显示两种策略在标定前表现相当,用户微调显著降低误差;在Dot-Probe任务中与SeeSo SDK有良好一致性,但时间变异性较高。 Conclusion: EyeTheia提供了一个透明、可扩展且低成本的视线追踪解决方案,适合大规模实验与临床应用,代码与模型已公开。 Abstract: We introduce EyeTheia, a lightweight and open deep learning pipeline for webcam-based gaze estimation, designed for browser-based experimental platforms and real-world cognitive and clinical research. EyeTheia enables real-time gaze tracking using only a standard laptop webcam, combining MediaPipe-based landmark extraction with a convolutional neural network inspired by iTracker and optional user-specific fine-tuning. We investigate two complementary strategies: adapting a model pretrained on mobile data and training the same architecture from scratch on a desktop-oriented dataset. Validation results on MPIIFaceGaze show comparable performance between both approaches prior to calibration, while lightweight user-specific fine-tuning consistently reduces gaze prediction error. We further evaluate EyeTheia in a realistic Dot-Probe task and compare it to the commercial webcam-based tracker SeeSo SDK. Results indicate strong agreement in left-right gaze allocation during stimulus presentation, despite higher temporal variability. Overall, EyeTheia provides a transparent and extensible solution for low-cost gaze tracking, suitable for scalable and reproducible experimental and clinical studies. The code, trained models, and experimental materials are publicly available.

[168] NAS-GS: Noise-Aware Sonar Gaussian Splatting

Shida Xu,Jingqi Jiang,Jonatan Scharff Willners,Sen Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为NAS-GS的噪声感知声呐高斯点阵化框架,用于解决水下声呐图像在3D重建和新视角合成中的挑战。

Details Motivation: 声呐图像具有复杂噪声模式和缺乏高程信息等特点,给3D重建和新视角合成带来困难。 Method: 提出了双向点阵化技术和基于高斯混合模型(GMM)的噪声建模方法,以更准确地模拟声呐成像中的强度累积和透射率计算,并捕捉复杂的噪声模式。 Result: 在仿真和真实的大规模离岸声呐场景中均实现了最先进的性能,显著提升了渲染速度和重建精度。 Conclusion: NAS-GS框架有效解决了声呐图像中的噪声和重建问题,在多种应用中展现出优越的性能。 Abstract: Underwater sonar imaging plays a crucial role in various applications, including autonomous navigation in murky water, marine archaeology, and environmental monitoring. However, the unique characteristics of sonar images, such as complex noise patterns and the lack of elevation information, pose significant challenges for 3D reconstruction and novel view synthesis. In this paper, we present NAS-GS, a novel Noise-Aware Sonar Gaussian Splatting framework specifically designed to address these challenges. Our approach introduces a Two-Ways Splatting technique that accurately models the dual directions for intensity accumulation and transmittance calculation inherent in sonar imaging, significantly improving rendering speed without sacrificing quality. Moreover, we propose a Gaussian Mixture Model (GMM) based noise model that captures complex sonar noise patterns, including side-lobes, speckle, and multi-path noise. This model enhances the realism of synthesized images while preventing 3D Gaussian overfitting to noise, thereby improving reconstruction accuracy. We demonstrate state-of-the-art performance on both simulated and real-world large-scale offshore sonar scenarios, achieving superior results in novel view synthesis and 3D reconstruction.

[169] Perception Test 2025: Challenge Summary and a Unified VQA Extension

Joseph Heyward,Nikhil Pathasarathy,Tyler Zhu,Aravindh Mahendran,João Carreira,Dima Damen,Andrew Zisserman,Viorica Pătrăucean

Main category: cs.CV

TL;DR: Perception Test 2025挑战旨在通过统一的多模态感知任务评估视频模型性能,强调使用统一接口处理多样化任务,暴露了现有模型在统一框架下的局限性。

Details Motivation: 推动多模态模型向更通用、统一的架构发展,减少对任务特定设计的依赖,真实反映模型在复杂感知任务中的泛化能力。 Method: 将多个传统感知任务(如目标跟踪、动作定位等)整合为统一的基准测试,新增多项选择式视频问答形式,并要求参赛者使用统一模型而非专用流水线。 Result: 五个整合赛道展示了当前SOTA模型在统一接口下表现不佳,凸显了实现真正通用多模态感知的挑战;部分赛道仍开放提交。 Conclusion: 统一化的任务设计对现有模型构成更大挑战,揭示了通往通用视觉-语言模型的关键瓶颈与未来研究方向。 Abstract: The Third Perception Test challenge was organised as a full-day workshop alongside the IEEE/CVF International Conference on Computer Vision (ICCV) 2025. Its primary goal is to benchmark state-of-the-art video models and measure the progress in multimodal perception. This year, the workshop featured 2 guest tracks as well: KiVA (an image understanding challenge) and Physic-IQ (a video generation challenge). In this report, we summarise the results from the main Perception Test challenge, detailing both the existing tasks as well as novel additions to the benchmark. In this iteration, we placed an emphasis on task unification, as this poses a more challenging test for current SOTA multimodal models. The challenge included five consolidated tracks: unified video QA, unified object and point tracking, unified action and sound localisation, grounded video QA, and hour-long video QA, alongside an analysis and interpretability track that is still open for submissions. Notably, the unified video QA track introduced a novel subset that reformulates traditional perception tasks (such as point tracking and temporal action localisation) as multiple-choice video QA questions that video-language models can natively tackle. The unified object and point tracking merged the original object tracking and point tracking tasks, whereas the unified action and sound localisation merged the original temporal action localisation and temporal sound localisation tracks. Accordingly, we required competitors to use unified approaches rather than engineered pipelines with task-specific models. By proposing such a unified challenge, Perception Test 2025 highlights the significant difficulties existing models face when tackling diverse perception tasks through unified interfaces.

[170] VideoWeave: A Data-Centric Approach for Efficient Video Understanding

Zane Durante,Silky Singh,Arpandeep Khatua,Shobhit Agarwal,Reuben Tan,Yong Jae Lee,Jianfeng Gao,Ehsan Adeli,Li Fei-Fei

Main category: cs.CV

TL;DR: VideoWeave是一种通过拼接短片段视频构建合成长上下文训练样本的方法,以提升视频-语言模型的数据效率,无需修改模型结构或优化目标。

Details Motivation: 训练视频-语言模型因处理长序列的高成本和标注长视频数据稀缺而受限,需要更高效的数据利用方法。 Method: 将现有数据集中带字幕的短视频片段通过随机或视觉聚类方式拼接,并进行字幕增强,重构训练数据以增加时间多样性。 Result: 在相同计算资源下,使用VideoWeave训练的模型在视频问答任务中表现优于传统微调方法。 Conclusion: 重新组织训练数据是一种简单且可扩展的提升视频-语言模型性能的有效途径。 Abstract: Training video-language models is often prohibitively expensive due to the high cost of processing long frame sequences and the limited availability of annotated long videos. We present VideoWeave, a simple yet effective approach to improve data efficiency by constructing synthetic long-context training samples that splice together short, captioned videos from existing datasets. Rather than modifying model architectures or optimization objectives, VideoWeave reorganizes available video-text pairs to expand temporal diversity within fixed compute. We systematically study how different data composition strategies like random versus visually clustered splicing and caption enrichment affect downstream performance on downstream video question answering. Under identical compute constraints, models trained with VideoWeave achieve higher accuracy than conventional video finetuning. Our results highlight that reorganizing training data, rather than altering architectures, may offer a simple and scalable path for training video-language models. We link our code for all experiments here.

[171] Object-WIPER : Training-Free Object and Associated Effect Removal in Videos

Saksham Singh Kushwaha,Sayan Nag,Yapeng Tian,Kuldeep Kulkarni

Main category: cs.CV

TL;DR: 本文提出了Object-WIPER,一种无需训练即可从视频中移除动态物体及其视觉影响并进行语义一致、时间连贯修复的框架。

Details Motivation: 现有方法在去除动态物体时难以同时保证语义一致性与时间连贯性,且缺乏合适的评估指标。 Method: 利用预训练的文本到视频扩散变换器(DiT),通过用户提供的对象掩码和查询标记,结合视觉-文本交叉注意力与自注意力机制生成中间效果掩码,并融合得到最终前景掩码;通过视频逆向获得结构化噪声,重新初始化掩码区域的令牌,在去噪过程中保持背景令牌不变以确保场景保真度。 Result: 在DAVIS和新构建的真实世界基准WIPER-Bench上实验表明,该方法在新提出的评价指标下优于基于训练和无需训练的基线方法,实现了干净的去除和时间稳定的重建。 Conclusion: Object-WIPER有效实现了无需训练的动态物体去除与高质量视频修复,同时提出的新评估指标有助于推动该领域发展。 Abstract: In this paper, we introduce Object-WIPER, a training-free framework for removing dynamic objects and their associated visual effects from videos, and inpainting them with semantically consistent and temporally coherent content. Our approach leverages a pre-trained text-to-video diffusion transformer (DiT). Given an input video, a user-provided object mask, and query tokens describing the target object and its effects, we localize relevant visual tokens via visual-text cross-attention and visual self-attention. This produces an intermediate effect mask that we fuse with the user mask to obtain a final foreground token mask to replace. We first invert the video through the DiT to obtain structured noise, then reinitialize the masked tokens with Gaussian noise while preserving background tokens. During denoising, we copy values for the background tokens saved during inversion to maintain scene fidelity. To address the lack of suitable evaluation, we introduce a new object removal metric that rewards temporal consistency among foreground tokens across consecutive frames, coherence between foreground and background tokens within each frame, and dissimilarity between the input and output foreground tokens. Experiments on DAVIS and a newly curated real-world associated effect benchmark (WIPER-Bench) show that Object-WIPER surpasses both training-based and training-free baselines in terms of the metric, achieving clean removal and temporally stable reconstruction without any retraining. Our new benchmark, source code, and pre-trained models will be publicly available.

[172] Context Matters: Peer-Aware Student Behavioral Engagement Measurement via VLM Action Parsing and LLM Sequence Classification

Ahmed Abdelkawy,Ahmed Elsayed,Asem Ali,Aly Farag,Thomas Tretter,Michael McIntyre

Main category: cs.CV

TL;DR: 提出一种基于视频的三阶段框架来测量学生参与度,利用视觉-语言模型和大语言模型结合上下文进行少样本学习。

Details Motivation: 现有方法需要大量标注数据且忽略课堂上下文,同时受限于隐私问题导致数据难以共享。 Method: 首先对视觉-语言模型进行少样本微调用于学生动作识别;然后使用滑动时间窗口将视频分段并预测每段动作类别;最后利用大语言模型结合动作序列与课堂上下文判断学生是否参与。 Result: 实验结果表明该方法在识别学生参与度方面有效。 Conclusion: 所提出的框架能有效利用少量样本和上下文信息提升学生参与度识别性能,适用于隐私敏感场景。 Abstract: Understanding student behavior in the classroom is essential to improve both pedagogical quality and student engagement. Existing methods for predicting student engagement typically require substantial annotated data to model the diversity of student behaviors, yet privacy concerns often restrict researchers to their own proprietary datasets. Moreover, the classroom context, represented in peers' actions, is ignored. To address the aforementioned limitation, we propose a novel three-stage framework for video-based student engagement measurement. First, we explore the few-shot adaptation of the vision-language model for student action recognition, which is fine-tuned to distinguish among action categories with a few training samples. Second, to handle continuous and unpredictable student actions, we utilize the sliding temporal window technique to divide each student's 2-minute-long video into non-overlapping segments. Each segment is assigned an action category via the fine-tuned VLM model, generating a sequence of action predictions. Finally, we leverage the large language model to classify this entire sequence of actions, together with the classroom context, as belonging to an engaged or disengaged student. The experimental results demonstrate the effectiveness of the proposed approach in identifying student engagement.

[173] GlobalPaint: Spatiotemporal Coherent Video Outpainting with Global Feature Guidance

Yueming Pan,Ruoyu Feng,Jianmin Bao,Chong Luo,Nanning Zheng

Main category: cs.CV

TL;DR: 提出GlobalPaint,一种基于扩散模型的时空连贯视频外绘框架,采用分层流程和增强的时空模块提升重建质量和运动自然性。

Details Motivation: 视频外绘需要在空间合理性和时间连贯性之间取得平衡,尤其是当相机或物体运动导致外绘内容随时间显现时,现有方法难以保持长期时间一致性。 Method: 采用分层流程:先外绘关键帧,再通过基于完成边界的插值模型补全中间帧;模型层面引入增强的时空模块(3D窗口注意力)和全局特征引导(从OpenCLIP提取全局token)。 Result: 在基准数据集上验证了更高的重建质量和更自然的运动表现,优于先前方法。 Conclusion: GlobalPaint通过分层处理和全局语义引导有效提升了视频外绘的时空一致性与视觉质量。 Abstract: Video outpainting extends a video beyond its original boundaries by synthesizing missing border content. Compared with image outpainting, it requires not only per-frame spatial plausibility but also long-range temporal coherence, especially when outpainted content becomes visible across time under camera or object motion. We propose GlobalPaint, a diffusion-based framework for spatiotemporal coherent video outpainting. Our approach adopts a hierarchical pipeline that first outpaints key frames and then completes intermediate frames via an interpolation model conditioned on the completed boundaries, reducing error accumulation in sequential processing. At the model level, we augment a pretrained image inpainting backbone with (i) an Enhanced Spatial-Temporal module featuring 3D windowed attention for stronger spatiotemporal interaction, and (ii) global feature guidance that distills OpenCLIP features from observed regions across all frames into compact global tokens using a dedicated extractor. Comprehensive evaluations on benchmark datasets demonstrate improved reconstruction quality and more natural motion compared to prior methods. Our demo page is https://yuemingpan.github.io/GlobalPaint/

[174] WHU-PCPR: A cross-platform heterogeneous point cloud dataset for place recognition in complex urban scenes

Xianghong Zou,Jianping Li,Yandi Yang,Weitong Wu,Yuan Wang,Qiegen Liu,Zhen Dong

Main category: cs.CV

TL;DR: 本文提出了WHU-PCPR,一个用于点云地点识别的跨平台异构大规模数据集,涵盖不同传感器、场景和长时间跨度的真实数据,并基于该数据集对现有方法进行了评估与分析。

Details Motivation: 现有点云地点识别数据集在场景、平台和传感器多样性方面不足,限制了相关研究的发展,因此需要构建更具代表性的数据集。 Method: 采集来自车载移动激光扫描系统(MLS)和便携式头戴激光扫描系统(PLS)的异构点云数据,覆盖城市与校园道路的长期变化场景,轨迹总长82.3公里,时间跨度60个月,并建立公开基准测试。 Result: 发布了WHU-PCPR数据集,包含跨平台、多传感器、大范围和长期变化的真实点云数据,并对多种代表性PCPR方法进行了系统评估。 Conclusion: WHU-PCPR为点云地点识别提供了更丰富和挑战性的数据支持,有助于推动该领域在实际应用场景下的发展。 Abstract: Point Cloud-based Place Recognition (PCPR) demonstrates considerable potential in applications such as autonomous driving, robot localization and navigation, and map update. In practical applications, point clouds used for place recognition are often acquired from different platforms and LiDARs across varying scene. However, existing PCPR datasets lack diversity in scenes, platforms, and sensors, which limits the effective development of related research. To address this gap, we establish WHU-PCPR, a cross-platform heterogeneous point cloud dataset designed for place recognition. The dataset differentiates itself from existing datasets through its distinctive characteristics: 1) cross-platform heterogeneous point clouds: collected from survey-grade vehicle-mounted Mobile Laser Scanning (MLS) systems and low-cost Portable helmet-mounted Laser Scanning (PLS) systems, each equipped with distinct mechanical and solid-state LiDAR sensors. 2) Complex localization scenes: encompassing real-time and long-term changes in both urban and campus road scenes. 3) Large-scale spatial coverage: featuring 82.3 km of trajectory over a 60-month period and an unrepeated route of approximately 30 km. Based on WHU-PCPR, we conduct extensive evaluation and in-depth analysis of several representative PCPR methods, and provide a concise discussion of key challenges and future research directions. The dataset and benchmark code are available at https://github.com/zouxianghong/WHU-PCPR.

[175] How to Build Robust, Scalable Models for GSV-Based Indicators in Neighborhood Research

Xiaoya Tang,Xiaohe Yue,Heran Mane,Dapeng Li,Quynh Nguyen,Tolga Tasdizen

Main category: cs.CV

TL;DR: 本文通过实证分析探讨了如何为标签和数据量有限的应用场景选择和调整基础视觉模型,并利用无监督训练提升在Google街景等非传统图像数据上的泛化能力。

Details Motivation: 由于健康研究中对街区环境的大规模系统性表征需求增加,而现有视觉模型在不同领域(如ImageNet到GSV)间泛化能力不确定,因此需要探索适用于小规模、少标签实际数据集的模型选择与适应策略。 Method: 通过在有标注的小数据集上进行评估,并结合大规模无标注数据使用无监督适应方法,对多种基础模型进行定量与可视化比较分析。 Result: 展示了无监督适应前后模型性能的变化,明确了不同模型在特定条件下的表现差异,并揭示了训练规模和策略对下游任务性能的影响。 Conclusion: 为社会健康等应用领域提供了实用指导,说明在资源受限情况下如何有效选择和优化视觉模型以适应特定数据分布。 Abstract: A substantial body of health research demonstrates a strong link between neighborhood environments and health outcomes. Recently, there has been increasing interest in leveraging advances in computer vision to enable large-scale, systematic characterization of neighborhood built environments. However, the generalizability of vision models across fundamentally different domains remains uncertain, for example, transferring knowledge from ImageNet to the distinct visual characteristics of Google Street View (GSV) imagery. In applied fields such as social health research, several critical questions arise: which models are most appropriate, whether to adopt unsupervised training strategies, what training scale is feasible under computational constraints, and how much such strategies benefit downstream performance. These decisions are often costly and require specialized expertise. In this paper, we answer these questions through empirical analysis and provide practical insights into how to select and adapt foundation models for datasets with limited size and labels, while leveraging larger, unlabeled datasets through unsupervised training. Our study includes comprehensive quantitative and visual analyses comparing model performance before and after unsupervised adaptation.

[176] Tone Matters: The Impact of Linguistic Tone on Hallucination in VLMs

Weihao Hong,Zhiyuan Jiang,Bingyu Shen,Xinlei Guan,Yangyi Feng,Meng Xu,Boyang Li

Main category: cs.CV

TL;DR: 本文研究了不同提示强度如何影响视觉-语言模型中的幻觉行为,提出了Ghost-100数据集和5级提示强度框架,发现幻觉率不随提示强度单调增加,且当前安全对齐在识别语义敌意上优于结构胁迫。

Details Motivation: 现有研究主要关注对象存在与否的幻觉,缺乏对提示措辞和结构约束如何系统性引发幻觉的分析。 Method: 提出Ghost-100合成场景数据集和5级提示强度框架,评估三种开源VLM在不同提示压力下的幻觉行为。 Result: 三种模型的幻觉率不随提示强度单调上升,在高强度下出现不同程度下降,表明安全对齐对语义敌意比结构胁迫更有效。 Conclusion: 当前VLM的安全对齐机制在应对结构性合规压力方面存在局限,需进一步改进以应对复杂提示胁迫。 Abstract: Vision-Language Models (VLMs) are increasingly used in safety-critical applications that require reliable visual grounding. However, these models often hallucinate details that are not present in the image to satisfy user prompts. While recent datasets and benchmarks have been introduced to evaluate systematic hallucinations in VLMs, many hallucination behaviors remain insufficiently characterized. In particular, prior work primarily focuses on object presence or absence, leaving it unclear how prompt phrasing and structural constraints can systematically induce hallucinations. In this paper, we investigate how different forms of prompt pressure influence hallucination behavior. We introduce Ghost-100, a procedurally generated dataset of synthetic scenes in which key visual details are deliberately removed, enabling controlled analysis of absence-based hallucinations. Using a structured 5-Level Prompt Intensity Framework, we vary prompts from neutral queries to toxic demands and rigid formatting constraints. We evaluate three representative open-weight VLMs: MiniCPM-V 2.6-8B, Qwen2-VL-7B, and Qwen3-VL-8B. Across all three models, hallucination rates do not increase monotonically with prompt intensity. All models exhibit reductions at higher intensity levels at different thresholds, though not all show sustained reduction under maximum coercion. These results suggest that current safety alignment is more effective at detecting semantic hostility than structural coercion, revealing model-specific limitations in handling compliance pressure. Our dataset is available at: https://github.com/bli1/tone-matters

[177] On the Adversarial Robustness of 3D Large Vision-Language Models

Chao Liu,Ngai-Man Cheung

Main category: cs.CV

TL;DR: 本文首次系统研究了基于点云的3D视觉语言模型(3D VLMs)的对抗鲁棒性,提出了视觉攻击和字幕攻击两种策略,发现3D VLMs在非目标攻击下存在显著脆弱性,但在目标攻击下比2D模型更具韧性。

Details Motivation: 尽管3D VLMs在理解任务中表现出色,但其对抗鲁棒性尚未被充分探索,尤其是在引入3D视觉输入后是否会影响模型安全性仍不清楚。 Method: 提出两种互补的攻击策略:视觉攻击(扰动3D编码器和投影器生成的视觉token特征)评估视觉-语言对齐鲁棒性;字幕攻击(直接操纵输出token序列)评估端到端系统鲁棒性。每种攻击包含非目标和目标变体。 Result: 实验表明,3D VLMs在非目标攻击下表现出显著的对抗脆弱性,但在目标攻击下相比2D VLMs更具抵抗力。 Conclusion: 3D VLMs存在潜在安全风险,需加强其对抗鲁棒性,特别是在安全关键应用中的部署前应进行充分评估与加固。 Abstract: 3D Vision-Language Models (VLMs), such as PointLLM and GPT4Point, have shown strong reasoning and generalization abilities in 3D understanding tasks. However, their adversarial robustness remains largely unexplored. Prior work in 2D VLMs has shown that the integration of visual inputs significantly increases vulnerability to adversarial attacks, making these models easier to manipulate into generating toxic or misleading outputs. In this paper, we investigate whether incorporating 3D vision similarly compromises the robustness of 3D VLMs. To this end, we present the first systematic study of adversarial robustness in point-based 3D VLMs. We propose two complementary attack strategies: \textit{Vision Attack}, which perturbs the visual token features produced by the 3D encoder and projector to assess the robustness of vision-language alignment; and \textit{Caption Attack}, which directly manipulates output token sequences to evaluate end-to-end system robustness. Each attack includes both untargeted and targeted variants to measure general vulnerability and susceptibility to controlled manipulation. Our experiments reveal that 3D VLMs exhibit significant adversarial vulnerabilities under untargeted attacks, while demonstrating greater resilience against targeted attacks aimed at forcing specific harmful outputs, compared to their 2D counterparts. These findings highlight the importance of improving the adversarial robustness of 3D VLMs, especially as they are deployed in safety-critical applications.

[178] SparseOccVLA: Bridging Occupancy and Vision-Language Models via Sparse Queries for Unified 4D Scene Understanding and Planning

Chenxu Dang,Jie Wang,Guang Li,Zhiwen Hou,Zihan You,Hangjun Ye,Jie Ma,Long Chen,Yan Wang

Main category: cs.CV

TL;DR: 本文提出了SparseOccVLA,一种结合视觉语言模型与稀疏语义占据的统一视觉-语言-动作模型,用于提升自动驾驶中的场景理解、占据预测与轨迹规划能力。

Details Motivation: 现有视觉语言模型在细粒度空间推理上不足,而语义占据虽精细但难以高效融合到语言模型中,缺乏统一框架整合二者优势。 Method: 提出SparseOccVLA,包含轻量级稀疏占据编码器生成紧凑的占据查询,作为视觉与语言之间的桥梁;通过LLM对齐并推理这些查询以实现统一场景理解与未来占据预测;引入LLM引导的Anchor-Diffusion Planner,支持解耦锚点评分与去噪,并融合跨模态轨迹条件。 Result: 在OmniDrive-nuScenes上CIDEr指标相对提升7%,Occ3D-nuScenes上mIoU提高0.5,在nuScenes上实现最先进的开环规划性能。 Conclusion: SparseOccVLA有效桥接了视觉语言模型与语义占据表示,实现了更高效、统一且高性能的自动驾驶多任务理解与决策框架。 Abstract: In autonomous driving, Vision Language Models (VLMs) excel at high-level reasoning , whereas semantic occupancy provides fine-grained details. Despite significant progress in individual fields, there is still no method that can effectively integrate both paradigms. Conventional VLMs struggle with token explosion and limited spatiotemporal reasoning, while semantic occupancy provides a unified, explicit spatial representation but is too dense to integrate efficiently with VLMs. To address these challenges and bridge the gap between VLMs and occupancy, we propose SparseOccVLA, a novel vision-language-action model that unifies scene understanding, occupancy forecasting, and trajectory planning powered by sparse occupancy queries. Starting with a lightweight Sparse Occupancy Encoder, SparseOccVLA generates compact yet highly informative sparse occupancy queries that serve as the single bridge between vision and language. These queries are aligned into the language space and reasoned by the LLM for unified scene understanding and future occupancy forecasting. Furthermore, we introduce an LLM-guided Anchor-Diffusion Planner featuring decoupled anchor scoring and denoising, as well as cross-model trajectory-condition fusion. SparseOccVLA achieves a 7% relative improvement in CIDEr over the state-of-the-art on OmniDrive-nuScenes, a 0.5 increase in mIoU score on Occ3D-nuScenes, and sets state-of-the-art open-loop planning metric on nuScenes benchmark, demonstrating its strong holistic capability.

[179] VVTRec: Radio Interferometric Reconstruction through Visual and Textual Modality Enrichment

Kai Cheng,Ruoqi Wang,Qiong Luo

Main category: cs.CV

TL;DR: 提出了一种名为VVTRec的多模态射电干涉数据重建方法,通过可见性引导的视觉和文本模态增强来改善图像重建质量。

Details Motivation: 现有方法仅考虑稀疏可见性数据的单一模态,导致重建图像中仍存在伪影且相关性建模不足。 Method: 将稀疏可见性数据转换为图像形式和文本形式特征,并利用视觉-语言模型(VLMs)实现无需额外训练的性能提升。 Result: 实验表明,VVTRec能有效利用多模态信息增强成像结果,同时未引入过多计算开销。 Conclusion: VVTRec通过多模态信息融合提升了射电天文成像的质量和结构完整性。 Abstract: Radio astronomy is an indispensable discipline for observing distant celestial objects. Measurements of wave signals from radio telescopes, called visibility, need to be transformed into images for astronomical observations. These dirty images blend information from real sources and artifacts. Therefore, astronomers usually perform reconstruction before imaging to obtain cleaner images. Existing methods consider only a single modality of sparse visibility data, resulting in images with remaining artifacts and insufficient modeling of correlation. To enhance the extraction of visibility information and emphasize output quality in the image domain, we propose VVTRec, a multimodal radio interferometric data reconstruction method with visibility-guided visual and textual modality enrichment. In our VVTRec, sparse visibility is transformed into image-form and text-form features to obtain enhancements in terms of spatial and semantic information, improving the structural integrity and accuracy of images. Also, we leverage Vision-Language Models (VLMs) to achieve additional training-free performance improvements. VVTRec enables sparse visibility, as a foreign modality unseen by VLMs, to accurately extract pre-trained knowledge as a supplement. Our experiments demonstrate that VVTRec effectively enhances imaging results by exploiting multimodal information without introducing excessive computational overhead.

[180] SRFlow: A Dataset and Regularization Model for High-Resolution Facial Optical Flow via Splatting Rasterization

JiaLin Zhang,Dong Li

Main category: cs.CV

TL;DR: 本文提出了一个高分辨率面部光流数据集SRFlow和一种新的面部光流模型SRFlowNet,通过定制的正则化损失和高斯溅射光栅化引导,显著提升了面部运动分析和微表情识别的性能。

Details Motivation: 由于缺乏高分辨率的面部光流数据集,面部运动分析领域的发展受到限制。 Method: 提出SRFlow数据集和SRFlowNet模型,利用掩码和梯度(通过差分或Sobel算子计算)来约束光流预测,抑制无纹理或重复模式区域的高频噪声和大尺度误差。 Result: 在SRFlow数据集上训练使多种光流模型的端点误差(EPE)最多降低42%;结合SRFlow数据集,SRFlowNet在三个微表情数据集的组合上F1分数最多提高48%。 Conclusion: SRFlow数据集和SRFlowNet模型有效推动了高分辨率面部光流估计和微表情识别的发展。 Abstract: Facial optical flow supports a wide range of tasks in facial motion analysis. However, the lack of high-resolution facial optical flow datasets has hindered progress in this area. In this paper, we introduce Splatting Rasterization Flow (SRFlow), a high-resolution facial optical flow dataset, and Splatting Rasterization Guided FlowNet (SRFlowNet), a facial optical flow model with tailored regularization losses. These losses constrain flow predictions using masks and gradients computed via difference or Sobel operator. This effectively suppresses high-frequency noise and large-scale errors in texture-less or repetitive-pattern regions, enabling SRFlowNet to be the first model explicitly capable of capturing high-resolution skin motion guided by Gaussian splatting rasterization. Experiments show that training with the SRFlow dataset improves facial optical flow estimation across various optical flow models, reducing end-point error (EPE) by up to 42% (from 0.5081 to 0.2953). Furthermore, when coupled with the SRFlow dataset, SRFlowNet achieves up to a 48% improvement in F1-score (from 0.4733 to 0.6947) on a composite of three micro-expression datasets. These results demonstrate the value of advancing both facial optical flow estimation and micro-expression recognition.

[181] Learning Domain Agnostic Latent Embeddings of 3D Faces for Zero-shot Animal Expression Transfer

Yue Wang,Lawrence Amadi,Xiang Gao,Yazheng Chen,Yuanpeng Liu,Ning Lu,Xianfeng Gu

Main category: cs.CV

TL;DR: 提出了一种零样本框架,用于将人类面部表情迁移到3D动物面部网格,仅使用人类表情数据训练即可实现跨物种表情迁移。

Details Motivation: 由于缺乏动物面部表情数据,现有方法难以实现从人类到动物的面部表情迁移,本文旨在解决这一跨物种表达迁移问题。 Method: 结合内在几何描述符(HKS/WKS)与一种网格无关的潜在嵌入,解耦面部身份与表情;利用ID潜在空间捕捉物种无关的面部结构,表达潜在空间编码跨人类与动物通用的变形模式,并通过Jacobian、顶点位置和拉普拉斯损失保证几何一致性。 Result: 实验表明该方法在无需动物表情数据的情况下实现了合理的跨物种表情迁移,有效缩小了人类与动物面部形状之间的几何差异。 Conclusion: 所提出的零样本框架能够成功地将人类面部表情迁移到3D动物面部网格,具有良好的泛化能力和应用潜力。 Abstract: We present a zero-shot framework for transferring human facial expressions to 3D animal face meshes. Our method combines intrinsic geometric descriptors (HKS/WKS) with a mesh-agnostic latent embedding that disentangles facial identity and expression. The ID latent space captures species-independent facial structure, while the expression latent space encodes deformation patterns that generalize across humans and animals. Trained only with human expression pairs, the model learns the embeddings, decoupling, and recoupling of cross-identity expressions, enabling expression transfer without requiring animal expression data. To enforce geometric consistency, we employ Jacobian loss together with vertex-position and Laplacian losses. Experiments show that our approach achieves plausible cross-species expression transfer, effectively narrowing the geometric gap between human and animal facial shapes.

[182] 3D CoCa v2: Contrastive Learners with Test-Time Search for Generalizable Spatial Intelligence

Hao Tang,Ting Huang,Zeyu Zhang

Main category: cs.CV

TL;DR: 提出3D CoCa v2,一种可泛化的3D图像描述框架,结合对比视觉-语言学习与测试时搜索(TTS),在室内和室外3D场景中实现更强的鲁棒性和零样本OOD泛化能力。

Details Motivation: 现有3D图像描述模型在点云稀疏不规则、跨环境泛化能力差(尤其是分布外泛化)以及对象定位弱等方面存在挑战。 Method: 基于冻结的CLIP语义先验、空间感知的3D场景编码器和多模态解码器,联合优化对比学习和描述生成目标,并在推理时引入无需参数更新的测试时搜索(TTS)以提升多样性与鲁棒性。 Result: 在ScanRefer和Nr3D上分别比3D CoCa提升+1.50和+1.61 CIDEr@0.5IoU,在TOD3Cap上的零样本OOD评估中提升+3.8 CIDEr@0.25。 Conclusion: 3D CoCa v2通过统一对比学习与生成框架并结合TTS策略,显著提升了3D图像描述的性能与跨环境泛化能力。 Abstract: Spatial intelligence refers to the ability to perceive, reason about, and describe objects and their relationships within three-dimensional environments, forming a foundation for embodied perception and scene understanding. 3D captioning aims to describe 3D scenes in natural language; however, it remains challenging due to the sparsity and irregularity of point clouds and, more critically, the weak grounding and limited out-of-distribution (OOD) generalization of existing captioners across drastically different environments, including indoor and outdoor 3D scenes. To address this challenge, we propose 3D CoCa v2, a generalizable 3D captioning framework that unifies contrastive vision-language learning with 3D caption generation and further improves robustness via test-time search (TTS) without updating the captioner parameters. 3D CoCa v2 builds on a frozen CLIP-based semantic prior, a spatially-aware 3D scene encoder for geometry, and a multimodal decoder jointly optimized with contrastive and captioning objectives, avoiding external detectors or handcrafted proposals. At inference, TTS produces diverse caption candidates and performs reward-guided selection using a compact scene summary. Experiments show improvements over 3D CoCa of +1.50 CIDEr@0.5IoU on ScanRefer and +1.61 CIDEr@0.5IoU on Nr3D, and +3.8 CIDEr@0.25 in zero-shot OOD evaluation on TOD3Cap. Code will be released at https://github.com/AIGeeksGroup/3DCoCav2.

[183] Bridging Robustness and Efficiency: Real-Time Low-Light Enhancement via Attention U-Net GAN

Yash Thesia,Meera Suthar

Main category: cs.CV

TL;DR: 提出了一种轻量级的注意力U-Net GAN模型,用于低光照图像增强,在保持0.06秒快速推理的同时实现生成式级别的纹理恢复性能。

Details Motivation: 现有低光照图像增强方法在生成质量与推理速度之间存在权衡:扩散模型质量高但慢,CNN模型快但易过平滑。缺乏兼具高质量纹理恢复和边缘部署速度的模型。 Method: 提出一种混合注意力U-Net GAN,通过引入注意力门机制的轻量U-Net主干,并在条件对抗框架下训练,以单次前向传播逼近生成模型的高频细节恢复能力。 Result: 在SID数据集上达到0.112的LPIPS分数,显著优于高效基线方法(如SID、EnlightenGAN),推理延迟仅0.06秒,相比潜在扩散模型提速40倍。 Conclusion: 无需复杂的迭代采样,所提方法可在边缘可部署的速度下实现接近生成模型的纹理恢复效果,填补了高效高质量低光增强的空白。 Abstract: Recent advancements in Low-Light Image Enhancement (LLIE) have focused heavily on Diffusion Probabilistic Models, which achieve high perceptual quality but suffer from significant computational latency (often exceeding 2-4 seconds per image). Conversely, traditional CNN-based baselines offer real-time inference but struggle with "over-smoothing," failing to recover fine structural details in extreme low-light conditions. This creates a practical gap in the literature: the lack of a model that provides generative-level texture recovery at edge-deployable speeds. In this paper, we address this trade-off by proposing a hybrid Attention U-Net GAN. We demonstrate that the heavy iterative sampling of diffusion models is not strictly necessary for texture recovery. Instead, by integrating Attention Gates into a lightweight U-Net backbone and training within a conditional adversarial framework, we can approximate the high-frequency fidelity of generative models in a single forward pass. Extensive experiments on the SID dataset show that our method achieves a best-in-class LPIPS score of 0.112 among efficient models, significantly outperforming efficient baselines (SID, EnlightenGAN) while maintaining an inference latency of 0.06s. This represents a 40x speedup over latent diffusion models, making our approach suitable for near real-time applications.

[184] BabyVision: Visual Reasoning Beyond Language

Liang Chen,Weichu Xie,Yiyan Liang,Hongfeng He,Hans Zhao,Zhibo Yang,Zhiqi Huang,Haoning Wu,Haoyu Lu,Y. charles,Yiping Bao,Yuantao Fan,Guopeng Li,Haiyang Shen,Xuanzhong Chen,Wendong Xu,Shuzheng Si,Zefan Cai,Wenhao Chai,Ziqi Huang,Fangfu Liu,Tianyu Liu,Baobao Chang,Xiaobo Hu,Kaiyuan Chen,Yixin Ren,Yang Liu,Yuan Gong,Kuan Li

Main category: cs.CV

TL;DR: 本文提出了BabyVision基准,用于评估多模态大语言模型(MLLMs)在不依赖语言知识情况下的核心视觉能力,发现当前MLLMs在基础视觉任务上远逊于人类,揭示其缺乏基本视觉原语。

Details Motivation: 现有MLLMs过度依赖语言先验,视觉理解脆弱,即使3岁儿童能轻松完成的基础视觉任务也常失败,需系统评估其核心视觉能力差距。 Method: 构建BabyVision基准,包含4大类22个子类共388项任务,独立于语言知识;提出BabyVision-Gen和自动评估工具包,评估生成模型的视觉推理能力。 Result: 实验和人类评估显示,顶级MLLMs表现显著低于人类基线,Gemini3-Pro-Preview得分为49.7,远低于成人平均分94.1和6岁儿童水平。 Conclusion: 尽管在知识密集型任务中表现出色,当前MLLMs仍缺乏基本视觉能力,BabyVision为实现类人视觉感知与推理提供了重要进展方向。 Abstract: While humans develop core visual skills long before acquiring language, contemporary Multimodal LLMs (MLLMs) still rely heavily on linguistic priors to compensate for their fragile visual understanding. We uncovered a crucial fact: state-of-the-art MLLMs consistently fail on basic visual tasks that humans, even 3-year-olds, can solve effortlessly. To systematically investigate this gap, we introduce BabyVision, a benchmark designed to assess core visual abilities independent of linguistic knowledge for MLLMs. BabyVision spans a wide range of tasks, with 388 items divided into 22 subclasses across four key categories. Empirical results and human evaluation reveal that leading MLLMs perform significantly below human baselines. Gemini3-Pro-Preview scores 49.7, lagging behind 6-year-old humans and falling well behind the average adult score of 94.1. These results show despite excelling in knowledge-heavy evaluations, current MLLMs still lack fundamental visual primitives. Progress in BabyVision represents a step toward human-level visual perception and reasoning capabilities. We also explore solving visual reasoning with generation models by proposing BabyVision-Gen and automatic evaluation toolkit. Our code and benchmark data are released at https://github.com/UniPat-AI/BabyVision for reproduction.

[185] Toward Generalizable Deblurring: Leveraging Massive Blur Priors with Linear Attention for Real-World Scenarios

Yuanting Gao,Shuo Cao,Xiaohui Li,Yuandong Pu,Yihao Liu,Kai Zhang

Main category: cs.CV

TL;DR: 本文提出了一种新的图像去模糊方法GLOWDeblur,通过引入模糊模式预训练(BPP)和运动与语义引导(MoSeG),提升了模型在真实场景中的泛化能力。

Details Motivation: 现有去模糊方法在训练数据外表现差,主要由于数据集在真实性和多样性之间的权衡以及算法设计的局限性。 Method: 提出模糊模式预训练(BPP)以获取多样化的模糊先验,并结合运动与语义引导(MoSeG)增强严重退化下的恢复效果;构建轻量级扩散模型GLOWDeblur,融合卷积预重建与域对齐模块。 Result: 在六个基准和两个真实数据集上实验表明,所提方法显著提升泛化性能,且模型轻量便于实际应用。 Conclusion: 模糊先验的多样性是实现鲁棒泛化的关键,GLOWDeblur通过有效利用先验和轻量设计实现了高性能与实用性。 Abstract: Image deblurring has advanced rapidly with deep learning, yet most methods exhibit poor generalization beyond their training datasets, with performance dropping significantly in real-world scenarios. Our analysis shows this limitation stems from two factors: datasets face an inherent trade-off between realism and coverage of diverse blur patterns, and algorithmic designs remain restrictive, as pixel-wise losses drive models toward local detail recovery while overlooking structural and semantic consistency, whereas diffusion-based approaches, though perceptually strong, still fail to generalize when trained on narrow datasets with simplistic strategies. Through systematic investigation, we identify blur pattern diversity as the decisive factor for robust generalization and propose Blur Pattern Pretraining (BPP), which acquires blur priors from simulation datasets and transfers them through joint fine-tuning on real data. We further introduce Motion and Semantic Guidance (MoSeG) to strengthen blur priors under severe degradation, and integrate it into GLOWDeblur, a Generalizable reaL-wOrld lightWeight Deblur model that combines convolution-based pre-reconstruction & domain alignment module with a lightweight diffusion backbone. Extensive experiments on six widely-used benchmarks and two real-world datasets validate our approach, confirming the importance of blur priors for robust generalization and demonstrating that the lightweight design of GLOWDeblur ensures practicality in real-world applications. The project page is available at https://vegdog007.github.io/GLOWDeblur_Website/.

[186] Towards Egocentric 3D Hand Pose Estimation in Unseen Domains

Wiktor Mucha,Michael Wray,Martin Kampel

Main category: cs.CV

TL;DR: 本文提出V-HPOT方法,通过虚拟相机空间中的关键点深度归一化和自监督测试时优化,提升跨域第一视角图像中3D手部姿态估计的泛化能力。

Details Motivation: 现有方法在训练和测试域一致时表现良好,但在新环境中因训练数据有限和对特定相机内参过拟合而泛化能力差,尤其在深度感知方面存在挑战。 Method: 提出V-HPOT:1)在由焦距和图像尺寸归一化的虚拟相机空间中估计关键点z坐标,实现相机无关的深度预测;2)设计自监督测试时优化策略,利用3D一致性损失在推理过程中调整模型,适应目标域特征,无需真实标注。 Result: 在H2O数据集上平均姿态误差减少71%,AssemblyHands数据集上减少41%;相比现有方法,V-HPOT在所有单阶段方法中表现最优,且性能接近两阶段方法,但所需数据量仅为后者的1/3.5到1/14。 Conclusion: V-HPOT通过相机内参不变性建模和测试时自适应优化,显著提升了3D手部姿态估计在跨域场景下的鲁棒性和准确性,同时具备高效的数据利用率。 Abstract: We present V-HPOT, a novel approach for improving the cross-domain performance of 3D hand pose estimation from egocentric images across diverse, unseen domains. State-of-the-art methods demonstrate strong performance when trained and tested within the same domain. However, they struggle to generalise to new environments due to limited training data and depth perception -- overfitting to specific camera intrinsics. Our method addresses this by estimating keypoint z-coordinates in a virtual camera space, normalised by focal length and image size, enabling camera-agnostic depth prediction. We further leverage this invariance to camera intrinsics to propose a self-supervised test-time optimisation strategy that refines the model's depth perception during inference. This is achieved by applying a 3D consistency loss between predicted and in-space scale-transformed hand poses, allowing the model to adapt to target domain characteristics without requiring ground truth annotations. V-HPOT significantly improves 3D hand pose estimation performance in cross-domain scenarios, achieving a 71% reduction in mean pose error on the H2O dataset and a 41% reduction on the AssemblyHands dataset. Compared to state-of-the-art methods, V-HPOT outperforms all single-stage approaches across all datasets and competes closely with two-stage methods, despite needing approximately x3.5 to x14 less data.

[187] LLMTrack: Semantic Multi-Object Tracking with Multi-modal Large Language Models

Pan Liao,Feng Yang,Di Wu,Jinwen Yu,Yuhua Zhu,Wenhui Zhao

Main category: cs.CV

TL;DR: 本文提出了LLMTrack,一个端到端的语义多目标跟踪(SMOT)框架,通过结合Grounding DINO和多模态大模型LLaVA-OneVision,实现对物体“在哪里”和“为什么”的联合理解。

Details Motivation: 传统多目标跟踪系统虽精于定位与关联,但缺乏对物体行为语义(如‘是什么’和‘为什么’)的理解能力,限制了其在高层认知任务中的应用。 Method: 提出LLMTrack框架,采用仿生设计思想,将定位(Grounding DINO)与理解(LLaVA-OneVision)分离;引入时空融合模块聚合实例交互特征与视频上下文,并通过三阶段训练策略(视觉对齐、时序微调、LoRA语义注入)适应跟踪任务。 Result: 在BenSMOT基准上实验表明,LLMTrack在实例描述、交互识别和视频摘要等语义任务上显著优于现有方法,同时保持稳定的跟踪性能。 Conclusion: LLMTrack有效 bridged 几何感知与认知推理之间的鸿沟,为多目标跟踪系统赋予了语义理解与行为推理的能力,推动MOT向更高层次的认知智能发展。 Abstract: Traditional Multi-Object Tracking (MOT) systems have achieved remarkable precision in localization and association, effectively answering \textit{where} and \textit{who}. However, they often function as autistic observers, capable of tracing geometric paths but blind to the semantic \textit{what} and \textit{why} behind object behaviors. To bridge the gap between geometric perception and cognitive reasoning, we propose \textbf{LLMTrack}, a novel end-to-end framework for Semantic Multi-Object Tracking (SMOT). We adopt a bionic design philosophy that decouples strong localization from deep understanding, utilizing Grounding DINO as the eyes and the LLaVA-OneVision multimodal large model as the brain. We introduce a Spatio-Temporal Fusion Module that aggregates instance-level interaction features and video-level contexts, enabling the Large Language Model (LLM) to comprehend complex trajectories. Furthermore, we design a progressive three-stage training strategy, Visual Alignment, Temporal Fine-tuning, and Semantic Injection via LoRA to efficiently adapt the massive model to the tracking domain. Extensive experiments on the BenSMOT benchmark demonstrate that LLMTrack achieves state-of-the-art performance, significantly outperforming existing methods in instance description, interaction recognition, and video summarization while maintaining robust tracking stability.

[188] ArrowGEV: Grounding Events in Video via Learning the Arrow of Time

Fangxu Yu,Ziyao Lu,Liqiang Niu,Fandong Meng,Jie Zhou

Main category: cs.CV

TL;DR: 提出ArrowGEV,一种基于强化学习的框架,通过建模事件的时间方向性提升视觉-语言模型在视频事件定位和时间理解上的性能。

Details Motivation: 现有方法主要在正向视频中关联事件与时间戳,忽略了事件内在的时间结构和方向性,限制了模型的鲁棒性和泛化能力。 Method: 受物理学中‘时间之箭’启发,将事件分为时间敏感型和时间不敏感型;对前者引入区分正向与反向视频的奖励机制,对后者则强制保持两个方向上的一致性。 Result: 实验表明,ArrowGEV不仅提升了事件定位精度和时间方向识别能力,还增强了整体视频理解和推理能力。 Conclusion: 显式建模时间方向性有助于提升VLM在视频事件 grounding 及时序理解方面的表现,为未来视频理解模型设计提供了新方向。 Abstract: Grounding events in videos serves as a fundamental capability in video analysis. While Vision-Language Models (VLMs) are increasingly employed for this task, existing approaches predominantly train models to associate events with timestamps in the forward video only. This paradigm hinders VLMs from capturing the inherent temporal structure and directionality of events, thereby limiting robustness and generalization. To address this limitation, inspired by the arrow of time in physics, which characterizes the intrinsic directionality of temporal processes, we propose ArrowGEV, a reinforcement learning framework that explicitly models temporal directionality in events to improve both event grounding and temporal directionality understanding in VLMs. Specifically, we categorize events into time-sensitive (e.g., putting down a bag) and time-insensitive (e.g., holding a towel in the left hand). The former denote events whose reversal substantially alters their meaning, while the latter remain semantically unchanged under reversal. For time-sensitive events, ArrowGEV introduces a reward that encourages VLMs to discriminate between forward and backward videos, whereas for time-insensitive events, it enforces consistent grounding across both directions. Extensive experiments demonstrate that ArrowGEV not only improves grounding precision and temporal directionality recognition, but also enhances general video understanding and reasoning ability.

[189] QCaption: Video Captioning and Q&A through Fusion of Large Multimodal Models

Jiale Wang,Gee Wah Ng,Lee Onn Mak,Randall Cher,Ng Ding Hei Ryan,Davis Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为QCaption的新型视频描述和问答流程,通过融合关键帧提取、大型多模态模型(LMM)和大型语言模型(LLM),在视频分析任务中实现了显著性能提升。

Details Motivation: 为了提升现有视频描述与问答系统的性能,并实现可在本地部署的自包含解决方案。 Method: 结合关键帧提取、大型多模态模型(LMM)进行图文分析,以及大型语言模型(LLM)进行文本分析,构建端到端的视频理解流程。 Result: 实验结果显示,QCaption在视频描述和问答任务上分别比现有方法提升了44.2%和48.9%,并通过消融研究验证了LLM在融合中的作用。 Conclusion: 模型融合方法能有效提升视频分析能力,QCaption展示了该策略在实际应用中的潜力。 Abstract: This paper introduces QCaption, a novel video captioning and Q&A pipeline that enhances video analytics by fusing three models: key frame extraction, a Large Multimodal Model (LMM) for image-text analysis, and a Large Language Model (LLM) for text analysis. This approach enables integrated analysis of text, images, and video, achieving performance improvements over existing video captioning and Q&A models; all while remaining fully self-contained, adept for on-premises deployment. Experimental results using QCaption demonstrated up to 44.2% and 48.9% improvements in video captioning and Q&A tasks, respectively. Ablation studies were also performed to assess the role of LLM on the fusion on the results. Moreover, the paper proposes and evaluates additional video captioning approaches, benchmarking them against QCaption and existing methodologies. QCaption demonstrate the potential of adopting a model fusion approach in advancing video analytics.

[190] APEX: Learning Adaptive Priorities for Multi-Objective Alignment in Vision-Language Generation

Dongliang Chen,Xinlin Zhuang,Junjie Xu,Luojian Xie,Zehui Wang,Jiaxi Zhuang,Haolin Yang,Liang Dou,Xiao He,Xingjiao Wu,Ying Qian

Main category: cs.CV

TL;DR: 本文提出了一种名为APEX的多目标对齐方法,用于解决文本到图像生成中因异构奖励导致的优化不平衡问题。通过双阶段自适应归一化和基于学习潜力、冲突惩罚与进展需求的动态优先级调度,APEX在Stable Diffusion 3.5上实现了更优的帕累托权衡。

Details Motivation: 固定权重的静态线性标量化在处理异构奖励时容易导致优化失衡,例如模型过度拟合高方差、高响应性的目标(如OCR),而忽视感知类目标。本文旨在解决这一优化偏差问题。 Method: 提出APEX框架,包含双阶段自适应归一化以稳定异构奖励,并设计P^3自适应优先级机制,结合学习潜力、冲突惩罚和进展需求来动态调度各目标的优化顺序。 Result: 在Stable Diffusion 3.5上的实验表明,APEX在四个异构目标上均取得平衡提升:PickScore提升+1.31,DeQA提升+0.35,Aesthetics提升+0.53,同时保持了有竞争力的OCR准确率,并有效缓解了多目标对齐中的不稳定性。 Conclusion: APEX通过自适应归一化和动态优先级调度,有效解决了多目标对齐中的方差劫持和梯度冲突问题,提升了文本到图像生成中多目标优化的稳定性与均衡性。 Abstract: Multi-objective alignment for text-to-image generation is commonly implemented via static linear scalarization, but fixed weights often fail under heterogeneous rewards, leading to optimization imbalance where models overfit high-variance, high-responsiveness objectives (e.g., OCR) while under-optimizing perceptual goals. We identify two mechanistic causes: variance hijacking, where reward dispersion induces implicit reweighting that dominates the normalized training signal, and gradient conflicts, where competing objectives produce opposing update directions and trigger seesaw-like oscillations. We propose APEX (Adaptive Priority-based Efficient X-objective Alignment), which stabilizes heterogeneous rewards with Dual-Stage Adaptive Normalization and dynamically schedules objectives via P^3 Adaptive Priorities that combine learning potential, conflict penalty, and progress need. On Stable Diffusion 3.5, APEX achieves improved Pareto trade-offs across four heterogeneous objectives, with balanced gains of +1.31 PickScore, +0.35 DeQA, and +0.53 Aesthetics while maintaining competitive OCR accuracy, mitigating the instability of multi-objective alignment.

[191] Sissi: Zero-shot Style-guided Image Synthesis via Semantic-style Integration

Yingying Deng,Xiangyu He,Fan Tang,Weiming Dong,Xucheng Yin

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的风格化图像生成框架,通过将参考风格图像与掩码目标图像拼接,利用预训练的ReFlow修复模型结合多模态注意力融合实现文本引导下的高保真风格迁移。

Details Motivation: 现有方法在风格化图像生成中依赖重训练或复杂的反演过程,导致内容完整性受损、风格保真度低以及语义与风格之间的权衡不理想。 Method: 将风格引导合成重构为上下文学习任务,通过文本语义提示,将参考风格图像与掩码目标图像拼接,利用预训练的ReFlow修复模型进行多模态注意力融合,并引入动态语义-风格集成(DSSI)机制来重新加权文本语义与视觉风格标记的注意力。 Result: 实验表明该方法在保持高风格保真度的同时实现了更好的语义-风格平衡,输出图像质量更高,且避免了传统方法中的伪影问题。 Conclusion: 所提方法提供了一种简单而强大的无需训练的风格化图像生成方案,有效解决了多模态注意力融合中的不平衡和噪声敏感问题。 Abstract: Text-guided image generation has advanced rapidly with large-scale diffusion models, yet achieving precise stylization with visual exemplars remains difficult. Existing approaches often depend on task-specific retraining or expensive inversion procedures, which can compromise content integrity, reduce style fidelity, and lead to an unsatisfactory trade-off between semantic prompt adherence and style alignment. In this work, we introduce a training-free framework that reformulates style-guided synthesis as an in-context learning task. Guided by textual semantic prompts, our method concatenates a reference style image with a masked target image, leveraging a pretrained ReFlow-based inpainting model to seamlessly integrate semantic content with the desired style through multimodal attention fusion. We further analyze the imbalance and noise sensitivity inherent in multimodal attention fusion and propose a Dynamic Semantic-Style Integration (DSSI) mechanism that reweights attention between textual semantic and style visual tokens, effectively resolving guidance conflicts and enhancing output coherence. Experiments show that our approach achieves high-fidelity stylization with superior semantic-style balance and visual quality, offering a simple yet powerful alternative to complex, artifact-prone prior methods.

[192] Boosting Overlapping Organoid Instance Segmentation Using Pseudo-Label Unmixing and Synthesis-Assisted Learning

Gui Huang,Kangyuan Zheng,Xuan Cai,Jiaqi Wang,Jianjia Zhang,Kaida Ning,Wenbo Wei,Yujuan Zhu,Jiong Zhang,Mengting Liu

Main category: cs.CV

TL;DR: 本文提出了一种用于类器官实例分割的伪标签解混(PLU)方法,解决了半监督学习中重叠区域伪标签噪声导致的训练偏差问题,并结合基于轮廓的图像合成和实例级增强,在仅使用10%标注数据的情况下达到了与全监督方法相当的性能。

Details Motivation: 由于高质量标注数据稀缺且显微图像中普遍存在类器官重叠,现有实例分割方法受限;传统半监督学习在处理重叠区域时因伪标签噪声而产生偏差,亟需一种能有效应对重叠实例的分割框架。 Method: 提出伪标签解混(PLU)机制,识别并修正重叠区域的错误伪标签,通过实例分解重建准确标签;采用基于轮廓的图像合成策略高效生成包含重叠情况的合成图像;在伪标签上进行实例级增强以提升合成数据效果,并实现增强感知的训练流程。 Result: 在两个类器官数据集上实验表明,本方法仅用10%标注数据即达到与全监督模型相当的性能,显著优于现有半监督方法,取得当前最优结果;消融实验证实了PLU、基于轮廓的合成和增强感知训练各自的贡献。 Conclusion: 本文通过在伪标签和图像合成两个层面解决重叠问题,推动了可扩展、低标签依赖的类器官分析技术发展,为精准医学中的高通量应用提供了新可能。 Abstract: Organoids, sophisticated in vitro models of human tissues, are crucial for medical research due to their ability to simulate organ functions and assess drug responses accurately. Accurate organoid instance segmentation is critical for quantifying their dynamic behaviors, yet remains profoundly limited by high-quality annotated datasets and pervasive overlap in microscopy imaging. While semi-supervised learning (SSL) offers a solution to alleviate reliance on scarce labeled data, conventional SSL frameworks suffer from biases induced by noisy pseudo-labels, particularly in overlapping regions. Synthesis-assisted SSL (SA-SSL) has been proposed for mitigating training biases in semi-supervised semantic segmentation. We present the first adaptation of SA-SSL to organoid instance segmentation and reveal that SA-SSL struggles to disentangle intertwined organoids, often misrepresenting overlapping instances as a single entity. To overcome this, we propose Pseudo-Label Unmixing (PLU), which identifies erroneous pseudo-labels for overlapping instances and then regenerates organoid labels through instance decomposition. For image synthesis, we apply a contour-based approach to synthesize organoid instances efficiently, particularly for overlapping cases. Instance-level augmentations (IA) on pseudo-labels before image synthesis further enhances the effect of synthetic data (SD). Rigorous experiments on two organoid datasets demonstrate our method's effectiveness, achieving performance comparable to fully supervised models using only 10% labeled data, and state-of-the-art results. Ablation studies validate the contributions of PLU, contour-based synthesis, and augmentation-aware training. By addressing overlap at both pseudo-label and synthesis levels, our work advances scalable, label-efficient organoid analysis, unlocking new potential for high-throughput applications in precision medicine.

[193] eSkiTB: A Synthetic Event-based Dataset for Tracking Skiers

Krishna Vinod,Joseph Raj Vishal,Kaustav Chanda,Prithvi Jai Ramesh,Yezhou Yang,Bharatesh Chakravarthi

Main category: cs.CV

TL;DR: 本文提出了eSkiTB,首个基于事件相机的滑雪跟踪数据集,通过直接视频到事件转换生成,实现了RGB与事件模态间的等信息对比。实验表明,基于事件的跟踪(SDTrack)在复杂广播场景中显著优于传统RGB方法(STARK),尤其在静态遮挡下IoU提升20.0点,验证了事件相机在冬季运动跟踪中的潜力。

Details Motivation: 由于运动模糊、静态叠加和杂乱背景,从RGB广播画面中跟踪滑雪运动员具有挑战性;缺乏针对冬季运动的可控事件相机跟踪基准。 Method: 提出eSkiTB,一个使用直接视频到事件转换(无需神经插值)生成的合成事件型滑雪跟踪数据集,并在该数据集上使用SDTrack(脉冲Transformer)与STARK(RGB Transformer)进行基准测试。 Result: 在以静态叠加为主的场景中,基于事件的跟踪达到0.685 IoU,比RGB方法高出20.0点;在整个数据集上,SDTrack平均IoU为0.711,显示出事件数据对视觉拥挤环境中弹道运动跟踪的有效性。 Conclusion: eSkiTB建立了冬季运动中事件驱动跟踪的第一个受控基准,证明了事件相机在处理广播级视觉干扰时的鲁棒性,展示了其在体育跟踪中的应用前景。 Abstract: Tracking skiers in RGB broadcast footage is challenging due to motion blur, static overlays, and clutter that obscure the fast-moving athlete. Event cameras, with their asynchronous contrast sensing, offer natural robustness to such artifacts, yet a controlled benchmark for winter-sport tracking has been missing. We introduce event SkiTB (eSkiTB), a synthetic event-based ski tracking dataset generated from SkiTB using direct video-to-event conversion without neural interpolation, enabling an iso-informational comparison between RGB and event modalities. Benchmarking SDTrack (spiking transformer) against STARK (RGB transformer), we find that event-based tracking is substantially resilient to broadcast clutter in scenes dominated by static overlays, achieving 0.685 IoU, outperforming RGB by +20.0 points. Across the dataset, SDTrack attains a mean IoU of 0.711, demonstrating that temporal contrast is a reliable cue for tracking ballistic motion in visually congested environments. eSkiTB establishes the first controlled setting for event-based tracking in winter sports and highlights the promise of event cameras for ski tracking. The dataset and code will be released at https://github.com/eventbasedvision/eSkiTB.

[194] Quantification and Classification of Carbon Nanotubes in Electron Micrographs using Vision Foundation Models

Sanjay Pradeep,Chen Wang,Matthew M. Dahm,Jeff D. Eldredge,Candace S. J. Tsai

Main category: cs.CV

TL;DR: 本研究提出了一种基于视觉基础模型的统一框架,用于自动量化和分类电子显微镜图像中的碳纳米管(CNT),结合SAM模型进行高精度分割,并利用DINOv2提取粒子区域特征,在少量训练数据下实现了95.5%的分类准确率,显著提升了纳米材料分析的效率与可重复性。

Details Motivation: 现有碳纳米管形态表征方法依赖耗时且主观的手动分割,难以满足高通量毒理学和暴露评估的需求。 Method: 采用Segment Anything Model(SAM)实现交互式高精度分割,并设计了一个新的分类流程:利用分割掩码限制DINOv2视觉Transformer仅从粒子区域提取特征,抑制背景噪声。 Result: 在1800张TEM图像的数据集上,该框架对四种碳纳米管形态的分类准确率达到95.5%,远超当前基线方法,并能在混合样本中正确识别共存的不同粒子类型。 Conclusion: 将零样本分割与自监督特征学习相结合,可将传统劳动密集型的纳米材料分析转变为可扩展、数据驱动的高效流程。 Abstract: Accurate characterization of carbon nanotube morphologies in electron microscopy images is vital for exposure assessment and toxicological studies, yet current workflows rely on slow, subjective manual segmentation. This work presents a unified framework leveraging vision foundation models to automate the quantification and classification of CNTs in electron microscopy images. First, we introduce an interactive quantification tool built on the Segment Anything Model (SAM) that segments particles with near-perfect accuracy using minimal user input. Second, we propose a novel classification pipeline that utilizes these segmentation masks to spatially constrain a DINOv2 vision transformer, extracting features exclusively from particle regions while suppressing background noise. Evaluated on a dataset of 1,800 TEM images, this architecture achieves 95.5% accuracy in distinguishing between four different CNT morphologies, significantly outperforming the current baseline despite using a fraction of the training data. Crucially, this instance-level processing allows the framework to resolve mixed samples, correctly classifying distinct particle types co-existing within a single field of view. These results demonstrate that integrating zero-shot segmentation with self-supervised feature learning enables high-throughput, reproducible nanomaterial analysis, transforming a labor-intensive bottleneck into a scalable, data-driven process.

[195] When Humans Judge Irises: Pupil Size Normalization as an Aid and Synthetic Irises as a Challenge

Mahsa Mitcheff,Adam Czajka

Main category: cs.CV

TL;DR: 本研究探讨了人类在虹膜验证中的表现,特别是在不同瞳孔大小和真实/合成虹膜图像对比下的识别准确性,发现基于自动编码器的瞳孔归一化可显著提升准确率,且人类对同眼合成虹膜图像的误判率较高。

Details Motivation: 在法医应用中,虹膜识别结果常需人工确认,尤其是在样本质量差或存在呈现攻击的情况下,因此需要研究人类在不同条件下的虹膜验证表现。 Method: 通过两个受控实验:(a) 在有无瞳孔大小对齐(线性/非线性)条件下评估人类验证性能;(b) 使用合成生成的真实与冒名虹膜图像对进行测试,并结合现代自动编码器模型进行瞳孔归一化处理。 Result: 基于自动编码器的瞳孔大小归一化显著提高了验证准确率;参与者能有效判断真实和合成虹膜是否来自同眼,但在对比真实虹膜与高质量同眼合成图像时准确率下降。 Conclusion: 瞳孔大小对齐对涉及人工参与的虹膜匹配至关重要;尽管生成模型质量高,同眼合成虹膜图像仍更易被人类误判为不同眼。 Abstract: Iris recognition is a mature biometric technology offering remarkable precision and speed, and allowing for large-scale deployments to populations exceeding a billion enrolled users (e.g., AADHAAR in India). However, in forensic applications, a human expert may be needed to review and confirm a positive identification before an iris matching result can be presented as evidence in court, especially in cases where processed samples are degraded (e.g., in post-mortem cases) or where there is a need to judge whether the sample is authentic, rather than a result of a presentation attack. This paper presents a study that examines human performance in iris verification in two controlled scenarios: (a) under varying pupil sizes, with and without a linear/nonlinear alignment of the pupil size between compared images, and (b) when both genuine and impostor iris image pairs are synthetically generated. The results demonstrate that pupil size normalization carried out by a modern autoencoder-based identity-preserving image-to-image translation model significantly improves verification accuracy. Participants were also able to determine whether iris pairs corresponded to the same or different eyes when both images were either authentic or synthetic. However, accuracy declined when subjects were comparing authentic irises against high-quality, same-eye synthetic counterparts. These findings (a) demonstrate the importance of pupil-size alignment for iris matching tasks in which humans are involved, and (b) indicate that despite the high fidelity of modern generative models, same-eye synthetic iris images are more often judged by humans as different-eye images, compared to same-eye authentic image pairs. We offer data and human judgments along with this paper to allow full replicability of this study and future works.

[196] Benchmarking Egocentric Clinical Intent Understanding Capability for Medical Multimodal Large Language Models

Shaonan Liu,Guo Yu,Xiaoling Luo,Shiyi Zheng,Wenting Chen,Jie Liu,Linlin Shen

Main category: cs.CV

TL;DR: MedGaze-Bench是首个利用临床医生视线作为认知光标来评估医学多模态大语言模型在手术、急诊模拟和诊断解释中临床意图理解能力的基准,提出三维临床意图框架并引入Trap QA机制以检测幻觉和盲从问题。

Details Motivation: 现有医学多模态大语言模型缺乏对自我视角下临床意图的理解评估,且当前基准无法有效衡量这一关键能力。 Method: 构建MedGaze-Bench基准,利用医生眼动数据定义认知光标,设计三维临床意图框架(空间、时间、规范意图)进行评估,并引入Trap QA机制检测模型可靠性。 Result: 实验表明当前MLLMs因过度依赖全局特征而在精细目标识别、因果推理和安全协议遵循方面表现不佳,易产生幻觉和盲目顺从错误指令。 Conclusion: MedGaze-Bench有效揭示了现有医学大模型在真实临床场景中理解临床意图的不足,强调需增强局部感知、时序因果推理与合规性判断能力。 Abstract: Medical Multimodal Large Language Models (Med-MLLMs) require egocentric clinical intent understanding for real-world deployment, yet existing benchmarks fail to evaluate this critical capability. To address these challenges, we introduce MedGaze-Bench, the first benchmark leveraging clinician gaze as a Cognitive Cursor to assess intent understanding across surgery, emergency simulation, and diagnostic interpretation. Our benchmark addresses three fundamental challenges: visual homogeneity of anatomical structures, strict temporal-causal dependencies in clinical workflows, and implicit adherence to safety protocols. We propose a Three-Dimensional Clinical Intent Framework evaluating: (1) Spatial Intent: discriminating precise targets amid visual noise, (2) Temporal Intent: inferring causal rationale through retrospective and prospective reasoning, and (3) Standard Intent: verifying protocol compliance through safety checks. Beyond accuracy metrics, we introduce Trap QA mechanisms to stress-test clinical reliability by penalizing hallucinations and cognitive sycophancy. Experiments reveal current MLLMs struggle with egocentric intent due to over-reliance on global features, leading to fabricated observations and uncritical acceptance of invalid instructions.

[197] The Normalized Difference Layer: A Differentiable Spectral Index Formulation for Deep Learning

Ali Lotfi,Adam Carter,Mohammad Meysami,Thuan Ha,Kwabena Nketia,Steve Shirtliffe

Main category: cs.CV

TL;DR: 提出了一种可微分的归一化差异层(Normalized Difference Layer),通过从数据中学习波段系数,将传统归一化指数融入深度学习架构,保持光照不变性和输出有界性的同时减少模型参数并提升抗噪能力。

Details Motivation: 传统归一化差值指数在遥感中广泛应用,但其系数固定为1,限制了在特定学习任务中的适应性,因此需要一种可学习、可微分的方法来提升灵活性和性能。 Method: 设计了一个可微分的归一化差异神经网络层,采用softplus重参数化确保系数为正且分母有界,并推导了前向和反向传播算法,支持端到端训练;扩展后可处理带符号输入,便于堆叠于深层网络中。 Result: 使用该层的模型在分类准确率上与标准MLP相当,但参数减少约75%;在10%乘性噪声下,准确率仅下降0.17%,远优于MLP的3.03%;学习到的系数模式在不同网络深度间具有一致性。 Conclusion: 归一化差异层成功结合了传统遥感指数的优点与深度学习的可学习性,在减少参数量和增强鲁棒性方面表现优异,具备在遥感及其他领域推广的潜力。 Abstract: Normalized difference indices have been a staple in remote sensing for decades. They stay reliable under lighting changes produce bounded values and connect well to biophysical signals. Even so, they are usually treated as a fixed pre processing step with coefficients set to one, which limits how well they can adapt to a specific learning task. In this study, we introduce the Normalized Difference Layer that is a differentiable neural network module. The proposed method keeps the classical idea but learns the band coefficients from data. We present a complete mathematical framework for integrating this layer into deep learning architectures that uses softplus reparameterization to ensure positive coefficients and bounded denominators. We describe forward and backward pass algorithms enabling end to end training through backpropagation. This approach preserves the key benefits of normalized differences, namely illumination invariance and outputs bounded to $[-1,1]$ while allowing gradient descent to discover task specific band weightings. We extend the method to work with signed inputs, so the layer can be stacked inside larger architectures. Experiments show that models using this layer reach similar classification accuracy to standard multilayer perceptrons while using about 75\% fewer parameters. They also handle multiplicative noise well, at 10\% noise accuracy drops only 0.17\% versus 3.03\% for baseline MLPs. The learned coefficient patterns stay consistent across different depths.

[198] CliffordNet: All You Need is Geometric Algebra

Zhongping Ji

Main category: cs.CV

TL;DR: 本文提出了基于几何代数的Clifford代数网络(CliffordNet),通过Clifford几何积统一特征混合机制,在保持线性复杂度的同时实现了高效表示学习,无需传统FFN模块,并在显著减少参数量的情况下达到先进性能。

Details Motivation: 挑战当前主流视觉模型依赖堆叠启发式模块(如注意力与卷积+FFN)的设计范式,回归数学原理,探索由代数完备性驱动的统一交互机制。 Method: 基于Clifford几何积(包含内积与外积)构建统一的特征交互机制,采用稀疏滚动实现严格线性复杂度O(N),完全摒弃标准FFN模块。 Result: CliffordNet在CIFAR-100上以仅1.4M参数的Nano版本达到76.41%准确率,性能匹配参数多8倍的ResNet-18;Base版本达到78.05%,创下小型模型新SOTA。 Conclusion: 全局理解可源于严格的局部代数完备交互,几何本身可能足以构建高效视觉 backbone,提示‘几何即所需’的新方向。 Abstract: Modern computer vision architectures, from CNNs to Transformers, predominantly rely on the stacking of heuristic modules: spatial mixers (Attention/Conv) followed by channel mixers (FFNs). In this work, we challenge this paradigm by returning to mathematical first principles. We propose the \textbf{Clifford Algebra Network (CAN)}, also referred to as CliffordNet, a vision backbone grounded purely in Geometric Algebra. Instead of engineering separate modules for mixing and memory, we derive a unified interaction mechanism based on the \textbf{Clifford Geometric Product} ($uv = u \cdot v + u \wedge v$). This operation ensures algebraic completeness regarding the Geometric Product by simultaneously capturing feature coherence (via the generalized inner product) and structural variation (via the exterior wedge product). Implemented via an efficient sparse rolling mechanism with \textbf{strict linear complexity $\mathcal{O}(N)$}, our model reveals a surprising emergent property: the geometric interaction is so representationally dense that standard Feed-Forward Networks (FFNs) become redundant. Empirically, CliffordNet establishes a new Pareto frontier: our \textbf{Nano} variant achieves \textbf{76.41\%} accuracy on CIFAR-100 with only \textbf{1.4M} parameters, effectively matching the heavy-weight ResNet-18 (11.2M) with \textbf{$8\times$ fewer parameters}, while our \textbf{Base} variant sets a new SOTA for tiny models at \textbf{78.05\%}. Our results suggest that global understanding can emerge solely from rigorous, algebraically complete local interactions, potentially signaling a shift where \textit{geometry is all you need}. Code is available at https://github.com/ParaMind2025/CAN.

[199] SpatialNav: Leveraging Spatial Scene Graphs for Zero-Shot Vision-and-Language Navigation

Jiwen Zhang,Zejun Li,Siyuan Wang,Xiangyu Shi,Zhongyu Wei,Qi Wu

Main category: cs.CV

TL;DR: 本文提出了一种名为SpatialNav的零样本视觉语言导航(VLN)代理,通过构建空间场景图(SSG)显式捕捉环境中的全局空间结构和语义信息,从而显著提升导航性能。

Details Motivation: 现有的零样本VLN代理主要依赖局部观测进行导航,缺乏对环境的全局理解,导致探索效率低且性能较差。因此,需要一种能够显式建模全局空间结构的方法来改善零样本设置下的导航效果。 Method: 作者在允许代理在任务执行前充分探索环境的前提下,构建了空间场景图(SSG),并基于SSG设计了SpatialNav代理。该代理融合了以代理为中心的空间地图、指南针对齐的视觉表示以及远程对象定位策略,以实现高效导航。 Result: 在离散和连续环境中的大量实验表明,SpatialNav显著优于现有的零样本VLN方法,并明显缩小了与最先进的学习型方法之间的性能差距。 Conclusion: 研究表明,显式的全局空间表征对于可泛化的视觉语言导航至关重要,为零样本VLN提供了新的解决思路。 Abstract: Although learning-based vision-and-language navigation (VLN) agents can learn spatial knowledge implicitly from large-scale training data, zero-shot VLN agents lack this process, relying primarily on local observations for navigation, which leads to inefficient exploration and a significant performance gap. To deal with the problem, we consider a zero-shot VLN setting that agents are allowed to fully explore the environment before task execution. Then, we construct the Spatial Scene Graph (SSG) to explicitly capture global spatial structure and semantics in the explored environment. Based on the SSG, we introduce SpatialNav, a zero-shot VLN agent that integrates an agent-centric spatial map, a compass-aligned visual representation, and a remote object localization strategy for efficient navigation. Comprehensive experiments in both discrete and continuous environments demonstrate that SpatialNav significantly outperforms existing zero-shot agents and clearly narrows the gap with state-of-the-art learning-based methods. Such results highlight the importance of global spatial representations for generalizable navigation.

[200] SARA: Scene-Aware Reconstruction Accelerator

Jee Won Lee,Hansol Lim,Minhyeok Im,Dohyeon Lee,Jongseong Brad Choi

Main category: cs.CV

TL;DR: SARA是一种用于运动结构恢复的几何驱动图像对选择模块,通过优先考虑重建信息量(重叠与视差的乘积)来减少匹配对数量,在大幅加速的同时提升姿态估计精度。

Details Motivation: 传统SfM流程基于视觉相似性选择图像对,导致计算开销大且可能选择信息量低的配对;SARA旨在引入几何先验,提升匹配效率与重建质量。 Method: 提出一种轻量级预匹配阶段,利用互近邻和RANSAC估计重叠与视差,计算重建信息量得分,并构建信息加权生成树(IWST),辅以闭环、长基线锚点和弱视角增强的定向边进行优化。 Result: 相比穷举匹配,SARA在现代学习型检测器上平均减少98%的匹配对(从30,848降至580),实现最高50倍加速,旋转误差降低46.5±5.5%,平移误差降低12.5±6.5%,同时保持3D高斯溅射和SVRaster重建指标在基线水平的±3%内。 Conclusion: SARA通过几何优先的信息驱动策略,将SfM的匹配复杂度从二次降至准线性,在显著提速的同时提升重建精度,适用于高效大规模三维重建。 Abstract: We present SARA (Scene-Aware Reconstruction Accelerator), a geometry-driven pair selection module for Structure-from-Motion (SfM). Unlike conventional pipelines that select pairs based on visual similarity alone, SARA introduces geometry-first pair selection by scoring reconstruction informativeness - the product of overlap and parallax - before expensive matching. A lightweight pre-matching stage uses mutual nearest neighbors and RANSAC to estimate these cues, then constructs an Information-Weighted Spanning Tree (IWST) augmented with targeted edges for loop closure, long-baseline anchors, and weak-view reinforcement. Compared to exhaustive matching, SARA reduces rotation errors by 46.5+-5.5% and translation errors by 12.5+-6.5% across modern learned detectors, while achieving at most 50x speedup through 98% pair reduction (from 30,848 to 580 pairs). This reduces matching complexity from quadratic to quasi-linear, maintaining within +-3% of baseline reconstruction metrics for 3D Gaussian Splatting and SVRaster.

[201] Enhancing Low-resolution Image Representation Through Normalizing Flows

Chenglong Bao,Tongyao Pang,Zuowei Shen,Dihan Zheng,Yihang Zou

Main category: cs.CV

TL;DR: 本文提出了一种名为LR2Flow的非线性框架,用于学习低分辨率图像表示,结合小波紧框架块与归一化流,并在多种图像处理任务中验证了其有效性。

Details Motivation: 保留低频信息的同时尽可能保持可重建性和视觉内容完整性,以应对低分辨率图像表示中的挑战。 Method: 将小波紧框架块与标准化流结合,构建可逆神经网络,在小波域内进行低分辨率表示学习。 Result: 理论分析表明在小波紧框架域设计可逆网络是必要的;实验结果显示该方法在图像重缩放、压缩和去噪任务中具有良好的重建性能和鲁棒性。 Conclusion: LR2Flow能够有效学习兼具压缩效率和高重建质量的低分辨率图像表示,适用于多种图像处理应用。 Abstract: Low-resolution image representation is a special form of sparse representation that retains only low-frequency information while discarding high-frequency components. This property reduces storage and transmission costs and benefits various image processing tasks. However, a key challenge is to preserve essential visual content while maintaining the ability to accurately reconstruct the original images. This work proposes LR2Flow, a nonlinear framework that learns low-resolution image representations by integrating wavelet tight frame blocks with normalizing flows. We conduct a reconstruction error analysis of the proposed network, which demonstrates the necessity of designing invertible neural networks in the wavelet tight frame domain. Experimental results on various tasks, including image rescaling, compression, and denoising, demonstrate the effectiveness of the learned representations and the robustness of the proposed framework.

[202] OSCAR: Optical-aware Semantic Control for Aleatoric Refinement in Sar-to-Optical Translation

Hyunseo Lee,Sang Min Kim,Ho Kyung Shin,Taeheon Kim,Woo-Jeoung Nam

Main category: cs.CV

TL;DR: 提出了一种新的SAR-to-Optical翻译框架,通过跨模态语义对齐、语义引导生成和不确定性感知目标,显著提升了合成光学图像的质量和语义一致性。

Details Motivation: 现有方法受限于SAR数据的斑点噪声和几何畸变,导致语义误判、纹理模糊和结构幻觉,难以实现高质量的SAR到光学图像转换。 Method: 提出了三个核心技术:1)跨模态语义对齐,利用光学教师模型向SAR学生模型蒸馏语义先验;2)语义引导的ControlNet,结合文本和视觉提示进行全局与局部控制;3)不确定性感知损失函数,动态调整重建重点以抑制噪声影响。 Result: 实验表明该方法在感知质量和语义一致性方面优于现有最先进方法。 Conclusion: 所提框架有效缓解了SAR图像固有噪声和畸变带来的问题,实现了更真实、语义一致的光学图像生成。 Abstract: Synthetic Aperture Radar (SAR) provides robust all-weather imaging capabilities; however, translating SAR observations into photo-realistic optical images remains a fundamentally ill-posed problem. Current approaches are often hindered by the inherent speckle noise and geometric distortions of SAR data, which frequently result in semantic misinterpretation, ambiguous texture synthesis, and structural hallucinations. To address these limitations, a novel SAR-to-Optical (S2O) translation framework is proposed, integrating three core technical contributions: (i) Cross-Modal Semantic Alignment, which establishes an Optical-Aware SAR Encoder by distilling robust semantic priors from an Optical Teacher into a SAR Student (ii) Semantically-Grounded Generative Guidance, realized by a Semantically-Grounded ControlNet that integrates class-aware text prompts for global context with hierarchical visual prompts for local spatial guidance; and (iii) an Uncertainty-Aware Objective, which explicitly models aleatoric uncertainty to dynamically modulate the reconstruction focus, effectively mitigating artifacts caused by speckle-induced ambiguity. Extensive experiments demonstrate that the proposed method achieves superior perceptual quality and semantic consistency compared to state-of-the-art approaches.

[203] PRISM: Color-Stratified Point Cloud Sampling

Hansol Lim,Minhyeok Im,Jongseong Brad Choi

Main category: cs.CV

TL;DR: PRISM是一种基于颜色引导的分层采样方法,用于RGB-LiDAR点云,通过在颜色空间中分层并限制每色bin的样本数,保留纹理丰富区域,减少均匀区域采样。

Details Motivation: 传统下采样方法强调空间均匀性,忽略颜色信息;而实际场景中独特特征通常具有丰富的色彩变化,因此需利用颜色多样性指导采样。 Method: 将RGB颜色空间作为分层域,每个颜色bin设置最大容量k,按颜色多样性分配采样密度,实现对高色变区域的保留和同质区域的压缩。 Result: 生成更稀疏但保留关键视觉特征的点云,在3D重建任务中优于传统方法。 Conclusion: PRISM通过从空间驱动转向视觉复杂度驱动的采样策略,有效提升RGB-LiDAR点云下采样的质量与实用性。 Abstract: We present PRISM, a novel color-guided stratified sampling method for RGB-LiDAR point clouds. Our approach is motivated by the observation that unique scene features often exhibit chromatic diversity while repetitive, redundant features are homogeneous in color. Conventional downsampling methods (Random Sampling, Voxel Grid, Normal Space Sampling) enforce spatial uniformity while ignoring this photometric content. In contrast, PRISM allocates sampling density proportional to chormatic diversity. By treating RGB color space as the stratification domain and imposing a maximum capacity k per color bin, the method preserves texture-rich regions with high color variation while substantially reducing visually homogeneous surfaces. This shifts the sampling space from spatial coverage to visual complexity to produce sparser point clouds that retain essential features for 3D reconstruction tasks.

[204] Speak While Watching: Unleashing TRUE Real-Time Video Understanding Capability of Multimodal Large Language Models

Junyan Lin,Junlong Tong,Hao Wu,Jialiang Zhang,Jinming Liu,Xin Jin,Xiaoyu Shen

Main category: cs.CV

TL;DR: 提出一种并行流式框架,通过放松位置编码的全局连续性约束,实现多模态大模型在实时视频理解中的输入-输出并行,显著降低延迟。

Details Motivation: 现有的多模态大语言模型在实时视频理解中受限于标准位置编码的位置连续性约束,导致感知与生成必须串行进行,限制了实时交互性能。 Method: 提出三种设计:Overlapped、Group-Decoupled 和 Gap-Isolated,通过放松位置编码的连续性约束,实现感知与生成的并行化,其中Group-Decoupled在效率与性能间取得最佳平衡。 Result: 实验表明该框架在均衡负载下可实现最高达2倍的加速,同时保持高流畅性和准确性。 Conclusion: 所提出的并行流式框架为实现“边看边说”型实时系统提供了有效且原则性的解决方案。 Abstract: Multimodal Large Language Models (MLLMs) have achieved strong performance across many tasks, yet most systems remain limited to offline inference, requiring complete inputs before generating outputs. Recent streaming methods reduce latency by interleaving perception and generation, but still enforce a sequential perception-generation cycle, limiting real-time interaction. In this work, we target a fundamental bottleneck that arises when extending MLLMs to real-time video understanding: the global positional continuity constraint imposed by standard positional encoding schemes. While natural in offline inference, this constraint tightly couples perception and generation, preventing effective input-output parallelism. To address this limitation, we propose a parallel streaming framework that relaxes positional continuity through three designs: Overlapped, Group-Decoupled, and Gap-Isolated. These designs enable simultaneous perception and generation, allowing the model to process incoming inputs while producing responses in real time. Extensive experiments reveal that Group-Decoupled achieves the best efficiency-performance balance, maintaining high fluency and accuracy while significantly reducing latency. We further show that the proposed framework yields up to 2x acceleration under balanced perception-generation workloads, establishing a principled pathway toward speak-while-watching real-time systems. We make all our code publicly available: https://github.com/EIT-NLP/Speak-While-Watching.

[205] MedGround: Bridging the Evidence Gap in Medical Vision-Language Models with Verified Grounding Data

Mengmeng Zhang,Xiaoping Wu,Hao Luo,Fan Wang,Yisheng Lv

Main category: cs.CV

TL;DR: 本文提出了MedGround,一个自动化管道,用于将分割资源转换为高质量的医学指代表达定位数据,并发布了包含35K样本的数据集,显著提升了视觉-语言模型在医学图像中的定位能力。

Details Motivation: 现有的视觉-语言模型在生成临床叙述时缺乏对视觉内容的有效定位,主要原因是缺少大规模、高质量的临床指代表达定位配对数据。 Method: 提出MedGround管道,利用专家分割掩码作为空间锚点,提取形状和空间线索,生成形态和位置相关的自然临床问题,并通过多阶段验证系统确保数据质量。 Result: 构建了MedGround-35K数据集,实验表明使用该数据训练的模型在指代表达定位任务上性能显著提升,增强了多对象语义消歧能力和在未见场景下的泛化性。 Conclusion: MedGround是一种可扩展的、数据驱动的方法,能有效将医学语言锚定到可验证的视觉证据上,推动医学VLM的发展。 Abstract: Vision-Language Models (VLMs) can generate convincing clinical narratives, yet frequently struggle to visually ground their statements. We posit this limitation arises from the scarcity of high-quality, large-scale clinical referring-localization pairs. To address this, we introduce MedGround, an automated pipeline that transforms segmentation resources into high-quality medical referring grounding data. Leveraging expert masks as spatial anchors, MedGround precisely derives localization targets, extracts shape and spatial cues, and guides VLMs to synthesize natural, clinically grounded queries that reflect morphology and location. To ensure data rigor, a multi-stage verification system integrates strict formatting checks, geometry- and medical-prior rules, and image-based visual judging to filter out ambiguous or visually unsupported samples. Finally, we present MedGround-35K, a novel multimodal medical dataset. Extensive experiments demonstrate that VLMs trained with MedGround-35K consistently achieve improved referring grounding performance, enhance multi-object semantic disambiguation, and exhibit strong generalization to unseen grounding settings. This work highlights MedGround as a scalable, data-driven approach to anchor medical language to verifiable visual evidence. Dataset and code will be released publicly upon acceptance.

[206] MVGGT: Multimodal Visual Geometry Grounded Transformer for Multiview 3D Referring Expression Segmentation

Changli Wu,Haodong Wang,Jiayi Ji,Yutian Yao,Chunsai Du,Jihua Kang,Yanwei Fu,Liujuan Cao

Main category: cs.CV

TL;DR: 本文提出了多视角3D指代表达分割(MV-3DRES)任务,并设计了MVGGT框架以实现从稀疏图像中直接进行场景结构恢复和目标分割,同时引入PVSO优化策略解决前景梯度稀释问题,在新构建的MVRefer基准上实现了高效准确的性能。

Details Motivation: 现有3D指代表达分割方法依赖密集高质量点云,而实际设备如机器人和手机仅能获取稀疏多视角图像且对延迟敏感,因此需要一种能在稀疏输入下高效准确完成任务的新方法。 Method: 提出MVGGT模型,采用双分支架构将语言信息融入稀疏视角几何推理;引入PVSO优化策略以缓解稀疏3D信号导致的前景梯度稀释问题;构建MVRefer基准用于标准化评估。 Result: MVGGT在MVRefer基准上实现了高精度和快速推理,显著优于现有方法,成为首个强基线模型。 Conclusion: 本文推动了多视角稀疏输入下的3D指代表达分割研究,提出的MVGGT与PVSO有效解决了几何质量差、梯度弱化等问题,为实际应用提供了高效解决方案。 Abstract: Most existing 3D referring expression segmentation (3DRES) methods rely on dense, high-quality point clouds, while real-world agents such as robots and mobile phones operate with only a few sparse RGB views and strict latency constraints. We introduce Multi-view 3D Referring Expression Segmentation (MV-3DRES), where the model must recover scene structure and segment the referred object directly from sparse multi-view images. Traditional two-stage pipelines, which first reconstruct a point cloud and then perform segmentation, often yield low-quality geometry, produce coarse or degraded target regions, and run slowly. We propose the Multimodal Visual Geometry Grounded Transformer (MVGGT), an efficient end-to-end framework that integrates language information into sparse-view geometric reasoning through a dual-branch design. Training in this setting exposes a critical optimization barrier, termed Foreground Gradient Dilution (FGD), where sparse 3D signals lead to weak supervision. To resolve this, we introduce Per-view No-target Suppression Optimization (PVSO), which provides stronger and more balanced gradients across views, enabling stable and efficient learning. To support consistent evaluation, we build MVRefer, a benchmark that defines standardized settings and metrics for MV-3DRES. Experiments show that MVGGT establishes the first strong baseline and achieves both high accuracy and fast inference, outperforming existing alternatives. Code and models are publicly available at https://mvggt.github.io.

[207] Unsupervised Domain Adaptation with SAM-RefiSeR for Enhanced Brain Tumor Segmentation

Dillan Imans,Phuoc-Nguyen Bui,Duc-Tai Le,Hyunseung Choo

Main category: cs.CV

TL;DR: 提出了一种基于SAM-RefiSeR的无监督域自适应方法,用于提升脑肿瘤分割性能。

Details Motivation: 解决医学图像中不同数据域之间的分布差异问题,以提高跨域脑肿瘤分割的准确性。 Method: 利用SAM(Segment Anything Model)进行初始分割,并引入RefiSeR模块对分割结果进行精细化修正,结合无监督域自适应策略减少域间差异。 Result: 在多个公开脑肿瘤数据集上验证了该方法的有效性,显著优于现有无监督域自适应方法。 Conclusion: SAM-RefiSeR通过有效的域自适应和分割优化,提升了跨域脑肿瘤图像的分割精度。 Abstract: Unsupervised Domain Adaptation with SAM-RefiSeR for Enhanced Brain Tumor Segmentation

[208] MixRI: Mixing Features of Reference Images for Novel Object Pose Estimation

Xinhang Liu,Jiawei Shi,Zheng Dang,Yuchao Dai

Main category: cs.CV

TL;DR: 提出MixRI,一种轻量级网络,用于在RGB图像中解决基于CAD的新型物体位姿估计问题,无需微调即可应用于新物体,具有低内存需求和快速推理时间。

Details Motivation: 现有方法通常需要大量参考图像和庞大的网络参数,难以满足实际应用中对高效性和低资源消耗的需求。 Method: 设计了一种轻量级网络,通过多视图信息直接匹配查询图像与参考图像之间的点,并采用参考图像融合策略减少所需参考图像数量。 Result: 在BOP挑战的七个核心数据集上实验表明,尽管使用更少的参考图像和更小的网络参数,该方法仍能达到与其他先进方法相当的性能。 Conclusion: MixRI在减少参考图像数量、降低内存占用和加速推理方面表现优异,适用于真实场景中的新型物体位姿估计任务。 Abstract: We present MixRI, a lightweight network that solves the CAD-based novel object pose estimation problem in RGB images. It can be instantly applied to a novel object at test time without finetuning. We design our network to meet the demands of real-world applications, emphasizing reduced memory requirements and fast inference time. Unlike existing works that utilize many reference images and have large network parameters, we directly match points based on the multi-view information between the query and reference images with a lightweight network. Thanks to our reference image fusion strategy, we significantly decrease the number of reference images, thus decreasing the time needed to process these images and the memory required to store them. Furthermore, with our lightweight network, our method requires less inference time. Though with fewer reference images, experiments on seven core datasets in the BOP challenge show that our method achieves comparable results with other methods that require more reference images and larger network parameters.

[209] CLIMP: Contrastive Language-Image Mamba Pretraining

Nimrod Shabtay,Itamar Zimerman,Eli Schwartz,Raja Giryes

Main category: cs.CV

TL;DR: CLIMP是首个完全基于Mamba的视觉-语言对比模型,用Mamba替代了视觉和文本编码器,克服了CLIP中ViT的注意力机制对伪相关的敏感性和分辨率扩展问题,在跨模态检索、分布外鲁棒性和效率方面均显著优于CLIP。

Details Motivation: 现有的CLIP模型依赖Vision Transformer,其注意力机制容易捕捉到图像中的伪相关,并且计算复杂度随分辨率二次增长,限制了其在高分辨率和鲁棒性场景下的应用。 Method: 提出CLIMP,将视觉和文本编码器均替换为Mamba架构。使用VMamba捕捉视觉空间归纳偏置,利用Mamba的序列建模能力处理图像块序列,并采用自回归文本编码器支持更长上下文和密集描述检索。 Result: CLIMP在ImageNet-O上超越CLIP-ViT-B达7.5%,在16倍训练分辨率下检索准确率提升最高达6.6%,同时内存消耗减少5倍,FLOPs减少1.8倍,并支持可变输入分辨率而无需位置编码插值。 Conclusion: Mamba架构在视觉-语言学习中展现出优于Transformer的潜力,具备更强的鲁棒性、更高的效率和更好的扩展性,是CLIP架构的有力替代方案。 Abstract: Contrastive Language-Image Pre-training (CLIP) relies on Vision Transformers whose attention mechanism is susceptible to spurious correlations, and scales quadratically with resolution. To address these limitations, We present CLIMP, the first fully Mamba-based contrastive vision-language model that replaces both the vision and text encoders with Mamba. The new architecture encodes sequential structure in both vision and language, with VMamba capturing visual spatial inductive biases, reducing reliance on spurious correlations and producing an embedding space favorable for cross-modal retrieval and out-of-distribution robustness-surpassing OpenAI's CLIP-ViT-B by 7.5% on ImageNet-O. CLIMP naturally supports variable input resolutions without positional encoding interpolation or specialized training, achieving up to 6.6% higher retrieval accuracy at 16x training resolution while using 5x less memory and 1.8x fewer FLOPs. The autoregressive text encoder further overcomes CLIP's fixed context limitation, enabling dense captioning retrieval. Our findings suggest that Mamba exhibits advantageous properties for vision-language learning, making it a compelling alternative to Transformer-based CLIP.

[210] UDPNet: Unleashing Depth-based Priors for Robust Image Dehazing

Zengyuan Zuo,Junjun Jiang,Gang Wu,Xianming Liu

Main category: cs.CV

TL;DR: 本文提出了一种基于深度先验的通用图像去雾框架UDPNet,利用大规模预训练深度估计模型DepthAnything V2提供的深度信息,通过深度引导注意力模块和深度先验融合模块提升现有去雾模型性能,在多个数据集上显著超越现有方法。

Details Motivation: 现有去雾方法多依赖RGB单模态特征,忽视场景深度与雾分布的内在关联;联合优化深度估计与去雾的方法也因未能充分利用精确深度信息而导致性能受限。 Method: 提出UDPNet框架,包含两个核心模块:深度引导注意力模块(DGAM)通过轻量级通道注意力自适应调制特征,深度先验融合模块(DPFM)采用双滑窗多头交叉注意力机制实现多尺度深度图特征的分层融合。 Result: 在SOTS、Haze4K和NHR等主流去雾数据集上取得领先性能,PSNR分别提升0.85 dB、1.19 dB和1.79 dB,尤其在真实场景下表现出更强的鲁棒性和泛化能力。 Conclusion: UDPNet有效整合深度先验信息,实现了跨合成与真实场景的高效去雾,建立了深度感知去雾的新标杆,并具有良好的计算效率和应用潜力。 Abstract: Image dehazing has witnessed significant advancements with the development of deep learning models. However, a few methods predominantly focus on single-modal RGB features, neglecting the inherent correlation between scene depth and haze distribution. Even those that jointly optimize depth estimation and image dehazing often suffer from suboptimal performance due to inadequate utilization of accurate depth information. In this paper, we present UDPNet, a general framework that leverages depth-based priors from large-scale pretrained depth estimation model DepthAnything V2 to boost existing image dehazing models. Specifically, our architecture comprises two typical components: the Depth-Guided Attention Module (DGAM) adaptively modulates features via lightweight depth-guided channel attention, and the Depth Prior Fusion Module (DPFM) enables hierarchical fusion of multi-scale depth map features by dual sliding-window multi-head cross-attention mechanism. These modules ensure both computational efficiency and effective integration of depth priors. Moreover, the intrinsic robustness of depth priors empowers the network to dynamically adapt to varying haze densities, illumination conditions, and domain gaps across synthetic and real-world data. Extensive experimental results demonstrate the effectiveness of our UDPNet, outperforming the state-of-the-art methods on popular dehazing datasets, such as 0.85 dB PSNR improvement on the SOTS dataset, 1.19 dB on the Haze4K dataset and 1.79 dB PSNR on the NHR dataset. Our proposed solution establishes a new benchmark for depth-aware dehazing across various scenarios. Pretrained models and codes will be released at our project https://github.com/Harbinzzy/UDPNet.

[211] RenderFlow: Single-Step Neural Rendering via Flow Matching

Shenghao Zhang,Runtao Liu,Christopher Schroers,Yang Zhang

Main category: cs.CV

TL;DR: 本文提出了一种基于流匹配的端到端、确定性单步神经渲染框架RenderFlow,结合稀疏关键帧引导模块,实现了接近实时的光栅化渲染质量,并兼顾生成模型的效率与传统物理渲染的精度。

Details Motivation: 现有基于扩散模型的神经渲染方法存在迭代过程耗时长、生成结果随机性强导致物理不准确和时序不一致的问题。 Method: 提出RenderFlow框架,采用流匹配范式实现单步确定性渲染,并设计稀疏关键帧引导模块以提升渲染质量和泛化能力;还引入轻量级适配器模块用于逆向渲染任务(如本征分解)。 Result: 该方法显著加速了渲染过程,实现了接近实时的性能,同时提升了物理合理性和视觉质量,在正向渲染和逆向渲染任务中均表现出色。 Conclusion: RenderFlow在保持高渲染质量的同时,有效结合了生成模型的效率与物理渲染的精确性,为神经渲染提供了高效且可扩展的新范式。 Abstract: Conventional physically based rendering (PBR) pipelines generate photorealistic images through computationally intensive light transport simulations. Although recent deep learning approaches leverage diffusion model priors with geometry buffers (G-buffers) to produce visually compelling results without explicit scene geometry or light simulation, they remain constrained by two major limitations. First, the iterative nature of the diffusion process introduces substantial latency. Second, the inherent stochasticity of these generative models compromises physical accuracy and temporal consistency. In response to these challenges, we propose a novel, end-to-end, deterministic, single-step neural rendering framework, RenderFlow, built upon a flow matching paradigm. To further strengthen both rendering quality and generalization, we propose an efficient and effective module for sparse keyframe guidance. Our method significantly accelerates the rendering process and, by optionally incorporating sparsely rendered keyframes as guidance, enhances both the physical plausibility and overall visual quality of the output. The resulting pipeline achieves near real-time performance with photorealistic rendering quality, effectively bridging the gap between the efficiency of modern generative models and the precision of traditional physically based rendering. Furthermore, we demonstrate the versatility of our framework by introducing a lightweight, adapter-based module that efficiently repurposes the pretrained forward model for the inverse rendering task of intrinsic decomposition.

[212] Measuring Social Bias in Vision-Language Models with Face-Only Counterfactuals from Real Photos

Haodong Chen,Qiang Huang,Jiaqi Zhao,Qiuping Jiang,Xiaojun Chang,Jun Yu

Main category: cs.CV

TL;DR: 提出一种仅基于面部的反事实评估范式(face-only counterfactual evaluation)以消除视觉混淆,构建FOCUS数据集和REFLECT基准,评估VLM中的社会偏见。

Details Motivation: 现有方法难以准确归因由人口统计特征(如种族、性别)引发的社会偏见,因真实图像中这些因素常与背景、服装等视觉变量混杂。 Method: 从真实照片出发,仅编辑与种族和性别相关的面部属性生成反事实图像,保持其他视觉因素不变;构建FOCUS数据集(480张图像,覆盖6种职业、10个群体),并提出REFLECT基准,包含三种决策任务:二选一、多选社会经济推断和薪资推荐。 Result: 在5个先进VLM上的实验表明,即使在严格视觉控制下,人口统计差异仍存在,且不同任务形式下偏差表现差异显著。 Conclusion: 需采用受控的反事实审计方法,并强调任务设计在评估多模态模型社会偏见中的关键作用。 Abstract: Vision-Language Models (VLMs) are increasingly deployed in socially consequential settings, raising concerns about social bias driven by demographic cues. A central challenge in measuring such social bias is attribution under visual confounding: real-world images entangle race and gender with correlated factors such as background and clothing, obscuring attribution. We propose a \textbf{face-only counterfactual evaluation paradigm} that isolates demographic effects while preserving real-image realism. Starting from real photographs, we generate counterfactual variants by editing only facial attributes related to race and gender, keeping all other visual factors fixed. Based on this paradigm, we construct \textbf{FOCUS}, a dataset of 480 scene-matched counterfactual images across six occupations and ten demographic groups, and propose \textbf{REFLECT}, a benchmark comprising three decision-oriented tasks: two-alternative forced choice, multiple-choice socioeconomic inference, and numeric salary recommendation. Experiments on five state-of-the-art VLMs reveal that demographic disparities persist under strict visual control and vary substantially across task formulations. These findings underscore the necessity of controlled, counterfactual audits and highlight task design as a critical factor in evaluating social bias in multimodal models.

[213] Watching, Reasoning, and Searching: A Video Deep Research Benchmark on Open Web for Agentic Video Reasoning

Chengwen Liu,Xiaomin Yu,Zhuoyue Chang,Zhe Huang,Shuo Zhang,Heng Lian,Kunyi Wang,Rui Xu,Sen Hu,Jianheng Hou,Hao Peng,Chengwei Qin,Xiaobin Hu,Hong Peng,Ronghao Chen,Huacan Wang

Main category: cs.CV

TL;DR: 本文提出了首个面向视频深度研究的基准VideoDR,用于评估在开放网络环境下结合跨帧视觉线索提取、交互式网页检索和多跳推理的视频问答模型。

Details Motivation: 现实中的视频问答往往需要结合视频内的局部视觉线索与分布于开放网络中的信息,现有模型难以完成这种多步骤、跨模态的复杂推理,因此需要新的基准来推动研究。 Method: 构建了名为VideoDR的新基准,包含高质量人工标注的样本,覆盖六个语义领域,要求模型进行视频条件下的开放域问答,涉及视觉锚点提取、迭代检索和基于多源证据的多跳推理,并在Workflow与Agentic两种范式下评估多种多模态大模型。 Result: 实验表明,Agentic范式并不总是优于Workflow,其性能优势依赖于模型在长检索链中保持初始视频锚点的能力;分析发现目标漂移和长视野一致性是主要瓶颈。 Conclusion: VideoDR为研究开放网络环境下的视频智能体提供了系统性基准,并揭示了下一代视频深度研究智能体面临的核心挑战。 Abstract: In real-world video question answering scenarios, videos often provide only localized visual cues, while verifiable answers are distributed across the open web; models therefore need to jointly perform cross-frame clue extraction, iterative retrieval, and multi-hop reasoning-based verification. To bridge this gap, we construct the first video deep research benchmark, VideoDR. VideoDR centers on video-conditioned open-domain video question answering, requiring cross-frame visual anchor extraction, interactive web retrieval, and multi-hop reasoning over joint video-web evidence; through rigorous human annotation and quality control, we obtain high-quality video deep research samples spanning six semantic domains. We evaluate multiple closed-source and open-source multimodal large language models under both the Workflow and Agentic paradigms, and the results show that Agentic is not consistently superior to Workflow: its gains depend on a model's ability to maintain the initial video anchors over long retrieval chains. Further analysis indicates that goal drift and long-horizon consistency are the core bottlenecks. In sum, VideoDR provides a systematic benchmark for studying video agents in open-web settings and reveals the key challenges for next-generation video deep research agents.

[214] SketchJudge: A Diagnostic Benchmark for Grading Hand-drawn Diagrams with Multimodal Large Language Models

Yuhang Su,Mei Wang,Yaoyao Zhong,Guozhang Li,Shixing Li,Yihan Feng,Hua Huang

Main category: cs.CV

TL;DR: SketchJudge是一个新的基准,用于评估多模态大语言模型在评分手绘STEM图表中的表现,揭示了当前模型在处理非结构化和模糊草图时的不足。

Details Motivation: 现有的多模态大语言模型在处理人类生成的草图时存在困难,尤其是在需要诊断手绘图表错误的视觉评分任务中。 Method: 提出了SketchJudge,包含1,015个来自几何、物理、图表和流程图四个领域的手绘学生回答,并评估了先进MLLM的表现。 Result: 实验表明,即使是最先进的MLLM也远落后于人类,在符号性和噪声上下文中暴露出视觉-语言对齐的脆弱性。 Conclusion: SketchJudge有效揭示了当前MLLM在复杂草图理解与错误诊断方面的局限性,为未来研究提供了方向。 Abstract: While Multimodal Large Language Models (MLLMs) have achieved remarkable progress in visual understanding, they often struggle when faced with the unstructured and ambiguous nature of human-generated sketches. This limitation is particularly pronounced in the underexplored task of visual grading, where models should not only solve a problem but also diagnose errors in hand-drawn diagrams. Such diagnostic capabilities depend on complex structural, semantic, and metacognitive reasoning. To bridge this gap, we introduce SketchJudge, a novel benchmark tailored for evaluating MLLMs as graders of hand-drawn STEM diagrams. SketchJudge encompasses 1,015 hand-drawn student responses across four domains: geometry, physics, charts, and flowcharts, featuring diverse stylistic variations and distinct error types. Evaluations on SketchJudge demonstrate that even advanced MLLMs lag significantly behind humans, validating the benchmark's effectiveness in exposing the fragility of current vision-language alignment in symbolic and noisy contexts. All data, code, and evaluation scripts are publicly available at https://github.com/yuhangsu82/SketchJudge.

[215] Unified Personalized Understanding, Generating and Editing

Yu Zhong,Tianwei Lin,Ruike Zhu,Yuqian Yuan,Haoyu Zheng,Liang Liang,Wenqiao Zhang,Feifei Shao,Haoyuan Li,Wanggui He,Hao Jiang,Yueting Zhuang

Main category: cs.CV

TL;DR: 本文提出OmniPersona,一种端到端的统一多模态大模型个性化框架,通过结构解耦的概念令牌和显式知识回放机制,实现个性化的理解、生成与图像编辑,有效减少任务干扰并提升一致性。

Details Motivation: 现有统一多模态模型在个性化建模上存在“一刀切”问题,当前个性化方法依赖外部检索或多阶段训练,导致效率低、任务干扰严重,难以实现一致且可控的用户特定概念建模。 Method: 提出OmniPersona框架,引入结构解耦的概念令牌为不同任务分配专用子空间以减少干扰,并设计显式知识回放机制跨任务传播个性化属性知识;同时构建统一架构支持理解、生成与图像编辑。此外提出OmniPBench评测基准,扩展UnifyBench以包含个性化编辑与跨任务评估。 Result: 实验表明OmniPersona在多种个性化任务中表现出竞争性且鲁棒的性能,在理解、生成与编辑任务间保持良好一致性,显著优于现有方法。 Conclusion: OmniPersona首次实现了在统一架构下对多模态模型的端到端个性化,通过结构解耦与知识回放机制有效解决了跨任务干扰问题,为可控、统一的个性化研究提供了强有力基线。 Abstract: Unified large multimodal models (LMMs) have achieved remarkable progress in general-purpose multimodal understanding and generation. However, they still operate under a ``one-size-fits-all'' paradigm and struggle to model user-specific concepts (e.g., generate a photo of \texttt{}) in a consistent and controllable manner. Existing personalization methods typically rely on external retrieval, which is inefficient and poorly integrated into unified multimodal pipelines. Recent personalized unified models introduce learnable soft prompts to encode concept information, yet they either couple understanding and generation or depend on complex multi-stage training, leading to cross-task interference and ultimately to fuzzy or misaligned personalized knowledge. We present \textbf{OmniPersona}, an end-to-end personalization framework for unified LMMs that, for the first time, integrates personalized understanding, generation, and image editing within a single architecture. OmniPersona introduces structurally decoupled concept tokens, allocating dedicated subspaces for different tasks to minimize interference, and incorporates an explicit knowledge replay mechanism that propagates personalized attribute knowledge across tasks, enabling consistent personalized behavior. To systematically evaluate unified personalization, we propose \textbf{\texttt{OmniPBench}}, extending the public UnifyBench concept set with personalized editing tasks and cross-task evaluation protocols integrating understanding, generation, and editing. Experimental results demonstrate that OmniPersona delivers competitive and robust performance across diverse personalization tasks. We hope OmniPersona will serve as a strong baseline and spur further research on controllable, unified personalization.

[216] Can Textual Reasoning Improve the Performance of MLLMs on Fine-grained Visual Classification?

Jie Zhu,Yiyang Su,Xiaoming Liu

Main category: cs.CV

TL;DR: 本文研究了多模态大语言模型在细粒度视觉分类(FGVC)任务中使用思维链(CoT)推理时性能下降的问题,发现推理长度增加会导致准确率降低,提出“思考成本”概念,并设计了ReFine-RFT框架通过多奖励优化和限制推理长度来提升性能。

Details Motivation: 尽管CoT在数学和编程等任务中有效,但在视觉感知任务中反而可能损害性能,本文旨在系统分析其原因并提出改进方法。 Method: 通过零样本评估和多种训练范式,分析CoT在FGVC中的影响;提出\alg方法进行多奖励归一化,并构建ReFine-RFT框架结合集成奖励与长度约束。 Result: 实验表明,推理长度是导致性能下降的关键因素;所提方法在多个FGVC基准上达到最先进水平。 Conclusion: CoT在视觉任务中的“思考成本”不可忽视,合理控制推理长度并引入密集反馈可显著提升细粒度分类性能。 Abstract: Multi-modal large language models (MLLMs) exhibit strong general-purpose capabilities, yet still struggle on Fine-Grained Visual Classification (FGVC), a core perception task that requires subtle visual discrimination and is crucial for many real-world applications. A widely adopted strategy for boosting performance on challenging tasks such as math and coding is Chain-of-Thought (CoT) reasoning. However, several prior works have reported that CoT can actually harm performance on visual perception tasks. These studies, though, examine the issue from relatively narrow angles and leave open why CoT degrades perception-heavy performance. We systematically re-examine the role of CoT in FGVC through the lenses of zero-shot evaluation and multiple training paradigms. Across these settings, we uncover a central paradox: the degradation induced by CoT is largely driven by the reasoning length, in which longer textual reasoning consistently lowers classification accuracy. We term this phenomenon the ``Cost of Thinking''. Building on this finding, we make two key contributions: (1) \alg, a simple and general plug-and-play normalization method for multi-reward optimization that balances heterogeneous reward signals, and (2) ReFine-RFT, a framework that combines ensemble rewards with \alg to constrain reasoning length while providing dense accuracy-oriented feedback. Extensive experiments demonstrate the effectiveness of our findings and the proposed ReFine-RFT, achieving state-of-the-art performance across FGVC benchmarks. Code and models are available at \href{https://github.com/jiezhu23/ReFine-RFT}{Project Link}.

[217] Spatial Multi-Task Learning for Breast Cancer Molecular Subtype Prediction from Single-Phase DCE-MRI

Sen Zeng,Hong Zhou,Zheng Zhu,Yang Liu

Main category: cs.CV

TL;DR: 提出了一种基于单相DCE-MRI的多任务学习框架,用于非侵入性预测乳腺癌分子亚型,结合多尺度空间注意力和感兴趣区域加权模块,显著优于传统方法。

Details Motivation: 准确的分子分型对乳腺癌个体化治疗至关重要,但传统免疫组化依赖有创活检且易受取样偏差影响,而常规DCE-MRI仅使用单相增强图像以减少扫描时间和造影剂剂量,限制了其在分子分型中的应用。 Method: 提出一种空间多任务学习框架,通过深度特征提取网络结合多尺度空间注意力机制捕捉肿瘤内外特征,并引入感兴趣区域加权模块突出肿瘤核心、边缘及周围组织;采用多任务学习共享表征并分支预测ER、PR、HER2状态和Ki-67增殖指数。 Result: 在960例数据上验证,内部测试AUC分别为ER 0.893、PR 0.824、HER2 0.857,Ki-67回归平均绝对误差为8.2%,显著优于放射组学和单任务深度学习基线方法。 Conclusion: 该方法实现了基于临床常规单相DCE-MRI的准确、无创乳腺癌分子亚型预测,具有良好的临床应用前景。 Abstract: Accurate molecular subtype classification is essential for personalized breast cancer treatment, yet conventional immunohistochemical analysis relies on invasive biopsies and is prone to sampling bias. Although dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI) enables non-invasive tumor characterization, clinical workflows typically acquire only single-phase post-contrast images to reduce scan time and contrast agent dose. In this study, we propose a spatial multi-task learning framework for breast cancer molecular subtype prediction from clinically practical single-phase DCE-MRI. The framework simultaneously predicts estrogen receptor (ER), progesterone receptor (PR), human epidermal growth factor receptor 2 (HER2) status, and the Ki-67 proliferation index -- biomarkers that collectively define molecular subtypes. The architecture integrates a deep feature extraction network with multi-scale spatial attention to capture intratumoral and peritumoral characteristics, together with a region-of-interest weighting module that emphasizes the tumor core, rim, and surrounding tissue. Multi-task learning exploits biological correlations among biomarkers through shared representations with task-specific prediction branches. Experiments on a dataset of 960 cases (886 internal cases split 7:1:2 for training/validation/testing, and 74 external cases evaluated via five-fold cross-validation) demonstrate that the proposed method achieves an AUC of 0.893, 0.824, and 0.857 for ER, PR, and HER2 classification, respectively, and a mean absolute error of 8.2\% for Ki-67 regression, significantly outperforming radiomics and single-task deep learning baselines. These results indicate the feasibility of accurate, non-invasive molecular subtype prediction using standard imaging protocols.

[218] Adversarial Attacks on Medical Hyperspectral Imaging Exploiting Spectral-Spatial Dependencies and Multiscale Features

Yunrui Gu,Zhenzhe Gao,Cong Kong,Zhaoxia Yin

Main category: cs.CV

TL;DR: 提出了一种针对医学高光谱成像的靶向对抗攻击框架,揭示了模型对局部像素依赖和多尺度信息的脆弱性。

Details Motivation: 深度学习在医学高光谱成像中表现出色,但易受对抗攻击,需探究其脆弱性的根本原因。 Method: 设计了两种攻击方法:局部像素依赖攻击和多尺度信息攻击,分别利用空间相关性和跨层次特征扰动。 Result: 在Brain和MDC数据集上实验表明,所提方法显著降低分类性能,尤其在肿瘤区域,且攻击难以察觉。 Conclusion: 医学高光谱成像模型存在结构相关的安全漏洞,需发展更鲁棒、结构感知的防御机制以保障临床应用安全。 Abstract: Medical hyperspectral imaging (HSI) enables accurate disease diagnosis by capturing rich spectral-spatial tissue information, but recent advances in deep learning have exposed its vulnerability to adversarial attacks. In this work, we identify two fundamental causes of this fragility: the reliance on local pixel dependencies for preserving tissue structure and the dependence on multiscale spectral-spatial representations for hierarchical feature encoding. Building on these insights, we propose a targeted adversarial attack framework for medical HSI, consisting of a Local Pixel Dependency Attack that exploits spatial correlations among neighboring pixels, and a Multiscale Information Attack that perturbs features across hierarchical spectral-spatial scales. Experiments on the Brain and MDC datasets demonstrate that our attacks significantly degrade classification performance, especially in tumor regions, while remaining visually imperceptible. Compared with existing methods, our approach reveals the unique vulnerabilities of medical HSI models and underscores the need for robust, structure-aware defenses in clinical applications.

[219] Billboard in Focus: Estimating Driver Gaze Duration from a Single Image

Carlos Pizarroso,Zuzana Berger Haladová,Zuzana Černeková,Viktor Kocur

Main category: cs.CV

TL;DR: 提出了一种全自动管道,用于检测路侧广告牌并估计驾驶员注视时长,通过YOLO模型和DINOv2特征分类器实现,无需人工标注或眼动追踪设备。

Details Motivation: 评估广告牌对驾驶者注意力的影响,减少因广告牌导致的驾驶分心和事故风险。 Method: 采用两阶段方法:第一阶段使用在Mapillary Vistas上训练并在BillboardLamac上微调的YOLO模型进行广告牌检测;第二阶段利用检测到的边界框位置和DINOv2特征构建分类器来估计驾驶员注视时长。 Result: 广告牌检测达到94% mAP@50,在BillboardLamac数据集上单帧注视时长估计准确率达68.1%,并通过Google Street View图像验证了结果。 Conclusion: 该管道能有效评估广告牌的相关性及其对驾驶员注意力的影响,具有无需依赖眼动设备和人工标注的优势,具备实际应用潜力。 Abstract: Roadside billboards represent a central element of outdoor advertising, yet their presence may contribute to driver distraction and accident risk. This study introduces a fully automated pipeline for billboard detection and driver gaze duration estimation, aiming to evaluate billboard relevance without reliance on manual annotations or eye-tracking devices. Our pipeline operates in two stages: (1) a YOLO-based object detection model trained on Mapillary Vistas and fine-tuned on BillboardLamac images achieved 94% mAP@50 in the billboard detection task (2) a classifier based on the detected bounding box positions and DINOv2 features. The proposed pipeline enables estimation of billboard driver gaze duration from individual frames. We show that our method is able to achieve 68.1% accuracy on BillboardLamac when considering individual frames. These results are further validated using images collected from Google Street View.

[220] Efficient Visual Question Answering Pipeline for Autonomous Driving via Scene Region Compression

Yuliang Cai,Dongqiangzi Ye,Zitian Chen,Chongruo Wu

Main category: cs.CV

TL;DR: 提出了一种高效的视觉语言模型框架SRC-Pipeline,用于自动驾驶中的视觉问答任务,通过压缩早期帧的token并保留近期帧的完整patch token,在减少66%计算量的同时保持性能,提升实时性。

Details Motivation: 现有视觉语言模型在自动驾驶视觉问答中计算开销大、延迟高,难以满足实时性和安全性需求。 Method: 设计SRC-Pipeline框架,将早期帧的密集patch token压缩为少量高层token,同时保留最近帧的完整token以维持感知精度。 Result: 在自动驾驶视频问答任务上实现了66%的FLOPs降低,同时保持与现有模型相当的性能。 Conclusion: SRC-Pipeline显著提升了VLM在自动驾驶VQA任务中的推理效率,使其更适用于实时、安全关键的应用场景。 Abstract: Autonomous driving increasingly relies on Visual Question Answering (VQA) to enable vehicles to understand complex surroundings by analyzing visual inputs and textual queries. Currently, a paramount concern for VQA in this domain is the stringent requirement for fast latency and real-time processing, as delays directly impact real-world safety in this safety-critical application. However, current state-of-the-art VQA models, particularly large vision-language models (VLMs), often prioritize performance over computational efficiency. These models typically process dense patch tokens for every frame, leading to prohibitive computational costs (FLOPs) and significant inference latency, especially with long video sequences. This focus limits their practical deployment in real-time autonomous driving scenarios. To tackle this issue, we propose an efficient VLM framework for autonomous driving VQA tasks, SRC-Pipeline. It learns to compress early frame tokens into a small number of high-level tokens while retaining full patch tokens for recent frames. Experiments on autonomous driving video question answering tasks show that our approach achieves 66% FLOPs reduction while maintaining comparable performance, enabling VLMs to operate more effectively in real-time, safety-critical autonomous driving settings.

[221] 3D Wavelet-Based Structural Priors for Controlled Diffusion in Whole-Body Low-Dose PET Denoising

Peiyuan Jing,Yue Tang,Chun-Wun Cheng,Zhenxuan Zhang,Liutao Yang,Thiago V. Lima,Klaus Strobel,Antoine Leimgruber,Angelica Aviles-Rivero,Guang Yang,Javier Montoya

Main category: cs.CV

TL;DR: 提出了一种基于小波条件的ControlNet(WCC-Net)用于全3D低剂量PET图像去噪,通过引入频率域结构先验提升解剖结构一致性。

Details Motivation: 低剂量PET成像噪声高,影响图像质量与诊断可靠性;现有扩散模型难以在低信噪比下保持解剖结构一致性。 Method: 提出WCC-Net,利用小波表示作为频率域结构先验,通过轻量控制分支注入预训练扩散模型,实现3D PET体积去噪中结构与噪声的解耦。 Result: 在内部1/20剂量测试集上,相比强扩散基线PSNR提升+1.21 dB,SSIM提升+0.008,同时降低GMSD和NMAE;并在未见剂量水平(1/50和1/4)上表现出良好泛化能力。 Conclusion: WCC-Net能有效结合结构先验与扩散生成能力,在全3D低剂量PET去噪中实现更优的图像质量和解剖一致性。 Abstract: Low-dose Positron Emission Tomography (PET) imaging reduces patient radiation exposure but suffers from increased noise that degrades image quality and diagnostic reliability. Although diffusion models have demonstrated strong denoising capability, their stochastic nature makes it challenging to enforce anatomically consistent structures, particularly in low signal-to-noise regimes and volumetric whole-body imaging. We propose Wavelet-Conditioned ControlNet (WCC-Net), a fully 3D diffusion-based framework that introduces explicit frequency-domain structural priors via wavelet representations to guide volumetric PET denoising. By injecting wavelet-based structural guidance into a frozen pretrained diffusion backbone through a lightweight control branch, WCC-Net decouples anatomical structure from noise while preserving generative expressiveness and 3D structural continuity. Extensive experiments demonstrate that WCC-Net consistently outperforms CNN-, GAN-, and diffusion-based baselines. On the internal 1/20-dose test set, WCC-Net improves PSNR by +1.21 dB and SSIM by +0.008 over a strong diffusion baseline, while reducing structural distortion (GMSD) and intensity error (NMAE). Moreover, WCC-Net generalizes robustly to unseen dose levels (1/50 and 1/4), achieving superior quantitative performance and improved volumetric anatomical consistency.

[222] MEDVISTAGYM: A Scalable Training Environment for Thinking with Medical Images via Tool-Integrated Reinforcement Learning

Meng Lu,Yuxing Lu,Yuchen Zhuang,Megan Mullins,Yang Xie,Guanghua Xiao,Charles Fleming,Wenqi Shi,Xuan Wang

Main category: cs.CV

TL;DR: 本文提出了MedVistaGym,一个可扩展的交互式训练环境,用于促进医学图像分析中的工具集成视觉推理,并通过轨迹采样和端到端强化学习训练MedVistaGym-R1,在六个医学VQA基准上显著超越基线模型。

Details Motivation: 现有的视觉语言模型在处理医学图像时缺乏多步推理能力,且依赖静态视觉嵌入和单次推理,无法有效整合工具进行动态分析。 Method: 提出MedVistaGym训练环境,支持VLM动态选择、调用和协调工具,定位图像区域,并融合子图像证据进行交错的多模态推理;采用轨迹采样和端到端强化学习进行代理式训练。 Result: MedVistaGym-R1-8B在六个医学VQA基准上比同类工具增强模型高出19.10%至24.21%。 Conclusion: 结构化的代理训练(而非仅工具访问)是实现有效工具集成医学图像推理的关键。 Abstract: Vision language models (VLMs) achieve strong performance on general image understanding but struggle to think with medical images, especially when performing multi-step reasoning through iterative visual interaction. Medical VLMs often rely on static visual embeddings and single-pass inference, preventing models from re-examining, verifying, or refining visual evidence during reasoning. While tool-integrated reasoning offers a promising path forward, open-source VLMs lack the training infrastructure to learn effective tool selection, invocation, and coordination in multi-modal medical reasoning. We introduce MedVistaGym, a scalable and interactive training environment that incentivizes tool-integrated visual reasoning for medical image analysis. MedVistaGym equips VLMs to determine when and which tools to invoke, localize task-relevant image regions, and integrate single or multiple sub-image evidence into interleaved multimodal reasoning within a unified, executable interface for agentic training. Using MedVistaGym, we train MedVistaGym-R1 to interleave tool use with agentic reasoning through trajectory sampling and end-to-end reinforcement learning. Across six medical VQA benchmarks, MedVistaGym-R1-8B exceeds comparably sized tool-augmented baselines by 19.10% to 24.21%, demonstrating that structured agentic training--not tool access alone--unlocks effective tool-integrated reasoning for medical image analysis.

[223] Few-shot Class-Incremental Learning via Generative Co-Memory Regularization

Kexin Bao,Yong Li,Dan Zeng,Shiming Ge

Main category: cs.CV

TL;DR: 本文提出了一种生成式协同记忆正则化方法,用于解决少样本类增量学习(FSCIL)中的灾难性遗忘和过拟合问题,通过构建表示记忆和权重记忆来联合正则化增量学习过程,显著提升了性能。

Details Motivation: FSCIL面临的关键挑战是在极少量新样本下进行学习时,模型容易对旧类发生灾难性遗忘并对新类过拟合,因此需要增强模型的表示能力和适应能力。 Method: 提出生成式协同记忆正则化方法:首先利用基于掩码自编码器(MAE)和分类器的生成域自适应微调,在基础类别上微调预训练生成编码器;然后构建两类类级别记忆——表示记忆(存储每类均值特征)和权重记忆(存储分类器权重);在增量学习阶段,通过联合优化分类与协同记忆正则化来动态训练分类器,并增量更新记忆。 Result: 在多个主流基准上的大量实验表明,该方法显著优于现有最先进方法,有效缓解了灾难性遗忘和对新类的过拟合,提升了识别准确率。 Conclusion: 所提出的生成式协同记忆正则化方法通过双重记忆机制有效增强了模型在少样本增量学习下的泛化能力,为FSCIL提供了一个高效且鲁棒的解决方案。 Abstract: Few-shot class-incremental learning (FSCIL) aims to incrementally learn models from a small amount of novel data, which requires strong representation and adaptation ability of models learned under few-example supervision to avoid catastrophic forgetting on old classes and overfitting to novel classes. This work proposes a generative co-memory regularization approach to facilitate FSCIL. In the approach, the base learning leverages generative domain adaptation finetuning to finetune a pretrained generative encoder on a few examples of base classes by jointly incorporating a masked autoencoder (MAE) decoder for feature reconstruction and a fully-connected classifier for feature classification, which enables the model to efficiently capture general and adaptable representations. Using the finetuned encoder and learned classifier, we construct two class-wise memories: representation memory for storing the mean features for each class, and weight memory for storing the classifier weights. After that, the memory-regularized incremental learning is performed to train the classifier dynamically on the examples of few-shot classes in each incremental session by simultaneously optimizing feature classification and co-memory regularization. The memories are updated in a class-incremental manner and they collaboratively regularize the incremental learning. In this way, the learned models improve recognition accuracy, while mitigating catastrophic forgetting over old classes and overfitting to novel classes. Extensive experiments on popular benchmarks clearly demonstrate that our approach outperforms the state-of-the-arts.

[224] Motion Focus Recognition in Fast-Moving Egocentric Video

Daniel Hong,James Tribble,Hao Wang,Chaoyi Zhou,Ashish Bastola,Siyu Huang,Abolfazl Razi

Main category: cs.CV

TL;DR: 提出了一种实时的自我中心视频运动焦点识别方法,通过利用基础模型进行相机姿态估计,并引入系统级优化实现高效可扩展的推理。

Details Motivation: 现有自我中心数据集主要关注动作识别,忽视了体育和快速运动场景中运动分析的重要性。 Method: 利用基础模型进行相机姿态估计,结合滑动批处理推理策略进行系统级优化。 Result: 在自建数据集上实现了实时性能和可控的内存消耗。 Conclusion: 该方法使以运动为中心的分析在边缘设备上部署成为可能,为现有自我中心研究提供了补充视角。 Abstract: From Vision-Language-Action (VLA) systems to robotics, existing egocentric datasets primarily focus on action recognition tasks, while largely overlooking the inherent role of motion analysis in sports and other fast-movement scenarios. To bridge this gap, we propose a real-time motion focus recognition method that estimates the subject's locomotion intention from any egocentric video. Our approach leverages the foundation model for camera pose estimation and introduces system-level optimizations to enable efficient and scalable inference. Evaluated on a collected egocentric action dataset, our method achieves real-time performance with manageable memory consumption through a sliding batch inference strategy. This work makes motion-centric analysis practical for edge deployment and offers a complementary perspective to existing egocentric studies on sports and fast-movement activities.

[225] Test-time Adaptive Hierarchical Co-enhanced Denoising Network for Reliable Multimodal Classification

Shu Shen,C. L. Philip Chen,Tong Zhang

Main category: cs.CV

TL;DR: 本文提出了一种新的测试时自适应分层协同去噪网络(TAHCD),用于在低质量多模态数据上实现可靠的表示学习,通过全局与实例级的噪声去除机制及测试时协同增强策略,显著提升了模型的鲁棒性、适应性和泛化能力。

Details Motivation: 现有方法难以有效去除异质多模态噪声,且在面对未见过的噪声时适应性和泛化能力有限,因此需要一种更可靠、更具适应性的多模态去噪方法。 Method: 提出TAHCD,结合自适应稳定子空间对齐和样本自适应置信度对齐,在全局和实例级别联合去除模态特有和跨模态噪声;引入测试时协同增强机制,无监督地根据输入样本噪声自适应更新模型。 Result: 在多个基准实验中,TAHCD在分类性能、鲁棒性和泛化性方面均优于现有的最先进方法。 Conclusion: TAHCD能有效应对多模态异质噪声,具备良好的测试时适应能力和泛化性能,为低质量多模态数据上的可靠学习提供了有效解决方案。 Abstract: Reliable learning on low-quality multimodal data is a widely concerning issue, especially in safety-critical applications. However, multimodal noise poses a major challenge in this domain and leads existing methods to suffer from two key limitations. First, they struggle to reliably remove heterogeneous data noise, hindering robust multimodal representation learning. Second, they exhibit limited adaptability and generalization when encountering previously unseen noise. To address these issues, we propose Test-time Adaptive Hierarchical Co-enhanced Denoising Network (TAHCD). On one hand, TAHCD introduces the Adaptive Stable Subspace Alignment and Sample-Adaptive Confidence Alignment to reliably remove heterogeneous noise. They account for noise at both global and instance levels and enable jointly removal of modality-specific and cross-modality noise, achieving robust learning. On the other hand, TAHCD introduces test-time cooperative enhancement, which adaptively updates the model in response to input noise in a label-free manner, improving adaptability and generalization. This is achieved by collaboratively enhancing the joint removal process of modality-specific and cross-modality noise across global and instance levels according to sample noise. Experiments on multiple benchmarks demonstrate that the proposed method achieves superior classification performance, robustness, and generalization compared with state-of-the-art reliable multimodal learning approaches.

[226] DIVER: Dynamic Iterative Visual Evidence Reasoning for Multimodal Fake News Detection

Weilin Zhou,Zonghao Ying,Chunlei Meng,Jiahui Liu,Hengyang Zhou,Quanchen Zou,Deyue Zhang,Dongdong Yang,Xiangzheng Zhang

Main category: cs.CV

TL;DR: 提出DIVER框架,通过动态迭代的视觉证据推理实现多模态假新闻检测,提升性能并减少延迟。

Details Motivation: 现有方法依赖静态融合或大语言模型,存在计算冗余和因视觉基础薄弱导致的幻觉风险。 Method: DIVER首先基于文本分析建立强基线,利用模态内一致性过滤不可靠或幻觉声明;仅在文本证据不足时引入视觉信息,并通过跨模态对齐验证自适应决定是否进行细粒度视觉检查,选择性调用OCR和密集描述等工具,通过不确定性感知融合迭代聚合证据。 Result: 在Weibo、Weibo21和GossipCop数据集上实验表明,DIVER平均超越现有最先进方法2.72%,同时减少4.12秒延迟。 Conclusion: DIVER通过渐进式、证据驱动的推理机制,在保证准确性的同时提升了多模态假新闻检测的效率与可靠性。 Abstract: Multimodal fake news detection is crucial for mitigating adversarial misinformation. Existing methods, relying on static fusion or LLMs, face computational redundancy and hallucination risks due to weak visual foundations. To address this, we propose DIVER (Dynamic Iterative Visual Evidence Reasoning), a framework grounded in a progressive, evidence-driven reasoning paradigm. DIVER first establishes a strong text-based baseline through language analysis, leveraging intra-modal consistency to filter unreliable or hallucinated claims. Only when textual evidence is insufficient does the framework introduce visual information, where inter-modal alignment verification adaptively determines whether deeper visual inspection is necessary. For samples exhibiting significant cross-modal semantic discrepancies, DIVER selectively invokes fine-grained visual tools (e.g., OCR and dense captioning) to extract task-relevant evidence, which is iteratively aggregated via uncertainty-aware fusion to refine multimodal reasoning. Experiments on Weibo, Weibo21, and GossipCop demonstrate that DIVER outperforms state-of-the-art baselines by an average of 2.72\%, while optimizing inference efficiency with a reduced latency of 4.12 s.

[227] ShowUI-Aloha: Human-Taught GUI Agent

Yichun Zhang,Xiangwu Guo,Yauhong Goh,Jessica Hu,Zhiheng Chen,Xin Wang,Difei Gao,Mike Zheng Shou

Main category: cs.CV

TL;DR: 提出ShowUI-Aloha框架,将真实环境中的非结构化屏幕操作记录转化为结构化、可执行的GUI任务,实现从人类行为观察中训练通用GUI智能体。

Details Motivation: 现有GUI自动化面临高质量训练数据稀缺的问题,尤其是人类操作记录通常冗长、无结构且缺乏标注,难以被智能体有效学习。 Method: 构建包含四个组件的管道:记录器捕获屏幕视频和精确交互;学习器将原始交互与视觉上下文转化为自然语言描述;规划器基于上下文推理生成高层动作计划并维护任务状态;执行器在操作系统层面精准执行动作并具备安全检查与反馈机制。 Result: 实现了对真实世界人类操作数据的高效采集与结构化解析,能够生成可操作的任务序列,并支持智能体通过观察人类行为进行学习。 Conclusion: ShowUI-Aloha为构建能从人类演示中学习的通用GUI智能体提供了可扩展的解决方案,推动了基于真实交互数据的自动化代理发展。 Abstract: Graphical User Interfaces (GUIs) are central to human-computer interaction, yet automating complex GUI tasks remains a major challenge for autonomous agents, largely due to a lack of scalable, high-quality training data. While recordings of human demonstrations offer a rich data source, they are typically long, unstructured, and lack annotations, making them difficult for agents to learn from.To address this, we introduce ShowUI-Aloha, a comprehensive pipeline that transforms unstructured, in-the-wild human screen recordings from desktop environments into structured, actionable tasks. Our framework includes four key components: A recorder that captures screen video along with precise user interactions like mouse clicks, keystrokes, and scrolls. A learner that semantically interprets these raw interactions and the surrounding visual context, translating them into descriptive natural language captions. A planner that reads the parsed demonstrations, maintains task states, and dynamically formulates the next high-level action plan based on contextual reasoning. An executor that faithfully carries out these action plans at the OS level, performing precise clicks, drags, text inputs, and window operations with safety checks and real-time feedback. Together, these components provide a scalable solution for collecting and parsing real-world human data, demonstrating a viable path toward building general-purpose GUI agents that can learn effectively from simply observing humans.

[228] SIRR-LMM: Single-image Reflection Removal via Large Multimodal Model

Yu Guo,Zhiqiang Lao,Xiyun Song,Yubin Zhou,Heather Yu

Main category: cs.CV

TL;DR: 提出了一种基于物理真实感的合成数据生成框架,并利用大多少模态模型结合LoRA微调方法,显著提升了单图像去反射性能。

Details Motivation: 现有去反射数据集在合成数据的物理真实感或真实图像的数量上存在不足,限制了单图像去反射技术的发展。 Method: 通过路径追踪3D玻璃模型与真实背景图像合成具有多样玻璃属性和相机设置的逼真反射场景;将图像层拼接为复合输入,进行联合标注并使用任务特定的LoRA对大多少模态模型进行微调。 Result: 所提方法在反射去除与分离任务上优于现有最先进方法,表现出更强的泛化能力和去反射效果。 Conclusion: 结合高真实感合成数据与高效微调的大多少模态模型是解决单图像去反射问题的有效途径。 Abstract: Glass surfaces create complex interactions of reflected and transmitted light, making single-image reflection removal (SIRR) challenging. Existing datasets suffer from limited physical realism in synthetic data or insufficient scale in real captures. We introduce a synthetic dataset generation framework that path-traces 3D glass models over real background imagery to create physically accurate reflection scenarios with varied glass properties, camera settings, and post-processing effects. To leverage the capabilities of Large Multimodal Model (LMM), we concatenate the image layers into a single composite input, apply joint captioning, and fine-tune the model using task-specific LoRA rather than full-parameter training. This enables our approach to achieve improved reflection removal and separation performance compared to state-of-the-art methods.

[229] SceneNAT: Masked Generative Modeling for Language-Guided Indoor Scene Synthesis

Jeongjun Choi,Yeonsoo Park,H. Jin Kim

Main category: cs.CV

TL;DR: SceneNAT是一种单阶段、非自回归的Transformer模型,通过掩码建模从自然语言指令生成完整的3D室内场景,在语义和空间布局准确性上优于现有方法,且计算成本更低。

Details Motivation: 现有3D场景生成方法多为自回归或扩散模型,推理效率低且难以捕捉对象间复杂关系,需要更高效且结构感知能力强的模型。 Method: 提出SceneNAT,采用掩码非自回归Transformer架构,对语义和空间属性进行离散化表示,并在属性级和实例级应用掩码策略;引入可学习的关系查询和三元组预测器以增强关系推理。 Result: 在3D-FRONT数据集上,SceneNAT在语义合规性和空间排列精度上均优于最先进的自回归和扩散模型基线,同时显著降低计算开销。 Conclusion: SceneNAT通过非自回归并行解码与显式关系建模,实现了高效、准确的文本到3D场景生成,为未来交互式3D内容创作提供了可行方案。 Abstract: We present SceneNAT, a single-stage masked non-autoregressive Transformer that synthesizes complete 3D indoor scenes from natural language instructions through only a few parallel decoding passes, offering improved performance and efficiency compared to prior state-of-the-art approaches. SceneNAT is trained via masked modeling over fully discretized representations of both semantic and spatial attributes. By applying a masking strategy at both the attribute level and the instance level, the model can better capture intra-object and inter-object structure. To boost relational reasoning, SceneNAT employs a dedicated triplet predictor for modeling the scene's layout and object relationships by mapping a set of learnable relation queries to a sparse set of symbolic triplets (subject, predicate, object). Extensive experiments on the 3D-FRONT dataset demonstrate that SceneNAT achieves superior performance compared to state-of-the-art autoregressive and diffusion baselines in both semantic compliance and spatial arrangement accuracy, while operating with substantially lower computational cost.

[230] VENUS: Visual Editing with Noise Inversion Using Scene Graphs

Thanh-Nhan Vo,Trong-Thuan Nguyen,Tam V. Nguyen,Minh-Triet Tran

Main category: cs.CV

TL;DR: 本文提出了VENUS,一种无需训练的场景图引导图像编辑框架,通过分离提示条件策略和噪声反演技术,在保持背景保真度的同时提升语义一致性,显著优于现有方法。

Details Motivation: 现有基于文本或场景图的图像编辑方法在背景保留与语义一致性之间难以平衡,且多数需要模型微调,计算成本高、扩展性差。 Method: 提出VENUS框架,采用分割提示条件策略将编辑对象与其背景分离,并结合噪声反演技术保持未编辑区域的保真度;利用多模态大语言模型提取的场景图与扩散模型结合,无需任何训练。 Result: 在PIE-Bench上PSNR从22.45提升至24.80,SSIM从0.79升到0.84,LPIPS从0.100降至0.070,CLIP相似性达24.97;在EditVal上DINO得分为0.87,单张图像处理时间由6-10分钟缩短至20-30秒。 Conclusion: VENUS实现了高效、高质量的图像编辑,无需微调,在背景保留、语义一致性和运行速度方面均显著优于现有方法,适用于多种编辑范式。 Abstract: State-of-the-art text-based image editing models often struggle to balance background preservation with semantic consistency, frequently resulting either in the synthesis of entirely new images or in outputs that fail to realize the intended edits. In contrast, scene graph-based image editing addresses this limitation by providing a structured representation of semantic entities and their relations, thereby offering improved controllability. However, existing scene graph editing methods typically depend on model fine-tuning, which incurs high computational cost and limits scalability. To this end, we introduce VENUS (Visual Editing with Noise inversion Using Scene graphs), a training-free framework for scene graph-guided image editing. Specifically, VENUS employs a split prompt conditioning strategy that disentangles the target object of the edit from its background context, while simultaneously leveraging noise inversion to preserve fidelity in unedited regions. Moreover, our proposed approach integrates scene graphs extracted from multimodal large language models with diffusion backbones, without requiring any additional training. Empirically, VENUS substantially improves both background preservation and semantic alignment on PIE-Bench, increasing PSNR from 22.45 to 24.80, SSIM from 0.79 to 0.84, and reducing LPIPS from 0.100 to 0.070 relative to the state-of-the-art scene graph editing model (SGEdit). In addition, VENUS enhances semantic consistency as measured by CLIP similarity (24.97 vs. 24.19). On EditVal, VENUS achieves the highest fidelity with a 0.87 DINO score and, crucially, reduces per-image runtime from 6-10 minutes to only 20-30 seconds. Beyond scene graph-based editing, VENUS also surpasses strong text-based editing baselines such as LEDIT++ and P2P+DirInv, thereby demonstrating consistent improvements across both paradigms.

[231] Language-Grounded Multi-Domain Image Translation via Semantic Difference Guidance

Jongwon Ryu,Joonhyung Park,Jaeho Han,Yeong-Seok Kim,Hye-rin Kim,Sunjae Yoon,Junyeong Kim

Main category: cs.CV

TL;DR: 本文提出了LACE(Language-grounded Attribute Controllable Translation)框架,用于多域图像到图像的翻译,能够将自然语言提示中的语义差异映射为具体的视觉变换,同时保持结构一致性并实现细粒度属性控制。

Details Motivation: 现有方法在多域图像翻译中难以同时保持结构完整性并提供细粒度、属性特定的控制,尤其是在处理多个域时表现不佳。 Method: LACE包含两个核心组件:(1) GLIP-Adapter,融合全局语义与局部结构特征以保持结构一致性;(2) 多域控制引导机制,将源和目标提示之间的语义差异显式映射为每个属性的翻译向量,实现语言语义与视觉变化的对齐。 Result: 在CelebA(Dialog)和BDD100K数据集上的实验表明,LACE在视觉保真度、结构保持性和可解释的领域特定控制方面优于先前方法。 Conclusion: LACE是一种连接语言语义与可控视觉翻译的跨模态内容生成框架,支持组合式多域控制,并允许对每个属性独立调节变换强度。 Abstract: Multi-domain image-to-image translation re quires grounding semantic differences ex pressed in natural language prompts into corresponding visual transformations, while preserving unrelated structural and seman tic content. Existing methods struggle to maintain structural integrity and provide fine grained, attribute-specific control, especially when multiple domains are involved. We propose LACE (Language-grounded Attribute Controllable Translation), built on two compo nents: (1) a GLIP-Adapter that fuses global semantics with local structural features to pre serve consistency, and (2) a Multi-Domain Control Guidance mechanism that explicitly grounds the semantic delta between source and target prompts into per-attribute translation vec tors, aligning linguistic semantics with domain level visual changes. Together, these modules enable compositional multi-domain control with independent strength modulation for each attribute. Experiments on CelebA(Dialog) and BDD100K demonstrate that LACE achieves high visual fidelity, structural preservation, and interpretable domain-specific control, surpass ing prior baselines. This positions LACE as a cross-modal content generation framework bridging language semantics and controllable visual translation.

[232] Universal Adversarial Purification with DDIM Metric Loss for Stable Diffusion

Li Zheng,Liangbin Xie,Jiantao Zhou,He YiMin

Main category: cs.CV

TL;DR: 提出通用扩散对抗净化(UDAP)框架,利用DDIM反演中干净与对抗图像的重建差异,有效去除Stable Diffusion中的对抗噪声,并引入动态迭代调整策略提升效率。

Details Motivation: 现有对抗净化方法主要针对分类任务,无法应对专门针对Stable Diffusion中VAE编码器、UNet去噪器或两者的对抗攻击,缺乏对扩散模型安全性的系统防御方案。 Method: 利用DDIM反演过程中清洁图像和对抗图像在重建行为上的差异,设计基于DDIM度量损失的优化目标,并结合动态调整优化迭代次数的策略,以自适应方式提升净化效率与效果。 Result: UDAP在多种对抗攻击下均表现出强鲁棒性,包括针对VAE的PID、针对UNet的Anti-DreamBooth、混合攻击MIST及其增强版本Anti-DF和MetaCloak,同时在不同SD版本和文本提示下具有良好的泛化能力。 Conclusion: UDAP为Stable Diffusion提供了有效且通用的对抗净化解决方案,能够针对性地抵御多类型对抗攻击,兼具高效性与实用性,具有广泛的实际应用前景。 Abstract: Stable Diffusion (SD) often produces degraded outputs when the training dataset contains adversarial noise. Adversarial purification offers a promising solution by removing adversarial noise from contaminated data. However, existing purification methods are primarily designed for classification tasks and fail to address SD-specific adversarial strategies, such as attacks targeting the VAE encoder, UNet denoiser, or both. To address the gap in SD security, we propose Universal Diffusion Adversarial Purification (UDAP), a novel framework tailored for defending adversarial attacks targeting SD models. UDAP leverages the distinct reconstruction behaviors of clean and adversarial images during Denoising Diffusion Implicit Models (DDIM) inversion to optimize the purification process. By minimizing the DDIM metric loss, UDAP can effectively remove adversarial noise. Additionally, we introduce a dynamic epoch adjustment strategy that adapts optimization iterations based on reconstruction errors, significantly improving efficiency without sacrificing purification quality. Experiments demonstrate UDAP's robustness against diverse adversarial methods, including PID (VAE-targeted), Anti-DreamBooth (UNet-targeted), MIST (hybrid), and robustness-enhanced variants like Anti-Diffusion (Anti-DF) and MetaCloak. UDAP also generalizes well across SD versions and text prompts, showcasing its practical applicability in real-world scenarios.

[233] From Landslide Conditioning Factors to Satellite Embeddings: Evaluating the Utilisation of Google AlphaEarth for Landslide Susceptibility Mapping using Deep Learning

Yusen Cheng,Qinfeng Zhu,Lei Fan

Main category: cs.CV

TL;DR: 本研究评估了Google AlphaEarth(AE)嵌入作为滑坡易发性制图(LSM)替代预测因子的潜力,发现其在多个区域和深度学习模型中均优于传统滑坡致灾因子(LCFs),尤其使用完整的64波段AE表示时效果最佳。

Details Motivation: 传统滑坡易发性制图依赖于滑坡致灾因子(LCFs),但其可用性、异质性和预处理不确定性限制了制图可靠性;因此需要探索更统一且信息丰富的替代预测变量。 Method: 采用两种AE表示(主成分和完整的64波段)与传统LCFs进行对比,在台湾南投、香港和意大利艾米利亚-罗马涅三个地区应用三种深度学习模型(CNN1D、CNN2D和Vision Transformer),并通过ROC-AUC、F1分数、误差统计和空间模式分析评估性能。 Result: 基于AE的模型在所有区域和模型中均优于LCFs,F1分数提高4%–15%,AUC提升0.04–0.11,误差分布更稳定,且易发性图与实际滑坡位置的空间对应性更强;其中64波段AE表现最优,且在南投和艾米利亚的效果优于香港,表明AE与滑坡数据的时间对齐程度影响结果。 Conclusion: AE嵌入具有成为标准化、高信息量滑坡易发性制图替代输入的强大潜力,可减少对传统LCFs的依赖并提升制图可靠性。 Abstract: Data-driven landslide susceptibility mapping (LSM) typically relies on landslide conditioning factors (LCFs), whose availability, heterogeneity, and preprocessing-related uncertainties can constrain mapping reliability. Recently, Google AlphaEarth (AE) embeddings, derived from multi-source geospatial observations, have emerged as a unified representation of Earth surface conditions. This study evaluated the potential of AE embeddings as alternative predictors for LSM. Two AE representations, including retained principal components and the full set of 64 embedding bands, were systematically compared with conventional LCFs across three study areas (Nantou County, Taiwan; Hong Kong; and part of Emilia-Romagna, Italy) using three deep learning models (CNN1D, CNN2D, and Vision Transformer). Performance was assessed using multiple evaluation metrics, ROC-AUC analysis, error statistics, and spatial pattern assessment. Results showed that AE-based models consistently outperformed LCFs across all regions and models, yielding higher F1-scores, AUC values, and more stable error distributions. Such improvement was most pronounced when using the full 64-band AE representation, with F1-score improvements of approximately 4% to 15% and AUC increased ranging from 0.04 to 0.11, depending on the study area and model. AE-based susceptibility maps also exhibited clearer spatial correspondence with observed landslide occurrences and enhanced sensitivity to localised landslide-prone conditions. Performance improvements were more evident in Nantou and Emilia than in Hong Kong, revealing that closer temporal alignment between AE embeddings and landslide inventories may lead to more effective LSM outcomes. These findings highlight the strong potential of AE embeddings as a standardised and information-rich alternative to conventional LCFs for LSM.

[234] PALUM: Part-based Attention Learning for Unified Motion Retargeting

Siqi Liu,Maoyu Wang,Bo Dai,Cewu Lu

Main category: cs.CV

TL;DR: 本文提出了一种名为PALUM的新方法,用于在不同骨骼结构的角色之间进行动作重定向。该方法通过将关节划分为语义身体部位,并利用注意力机制捕捉时空关系,学习跨多样化骨架拓扑的通用运动表示。结合循环一致性机制,PALUM能够在保持动作真实感和语义保真度的同时,有效处理未见过的骨骼-动作组合。

Details Motivation: 在源角色与目标角色具有显著不同的骨骼结构时,传统动作重定向方法难以保持原始动作的语义和质量,因此需要一种能够适应多样骨骼拓扑并保持动作语义一致性的新方法。 Method: PALUM将关节按语义身体部分进行划分,使用注意力机制建模各部分之间的时空依赖关系,从而学习与具体骨骼无关的运动表示;同时结合目标骨架的结构信息进行动作迁移,并引入循环一致性机制以确保重建动作与原始语义一致。 Result: 实验表明,PALUM在处理多种不同骨骼结构的动作重定向任务中表现优越,能有效保持动作的真实性和语义完整性,且对未曾见过的骨架-动作组合具备良好的泛化能力。 Conclusion: PALUM通过语义分区和注意力机制实现了跨不同骨架结构的高质量动作重定向,结合循环一致性训练策略,显著提升了动作语义保持能力和泛化性能,为复杂角色间的动画迁移提供了有效解决方案。 Abstract: Retargeting motion between characters with different skeleton structures is a fundamental challenge in computer animation. When source and target characters have vastly different bone arrangements, maintaining the original motion's semantics and quality becomes increasingly difficult. We present PALUM, a novel approach that learns common motion representations across diverse skeleton topologies by partitioning joints into semantic body parts and applying attention mechanisms to capture spatio-temporal relationships. Our method transfers motion to target skeletons by leveraging these skeleton-agnostic representations alongside target-specific structural information. To ensure robust learning and preserve motion fidelity, we introduce a cycle consistency mechanism that maintains semantic coherence throughout the retargeting process. Extensive experiments demonstrate superior performance in handling diverse skeletal structures while maintaining motion realism and semantic fidelity, even when generalizing to previously unseen skeleton-motion combinations. We will make our implementation publicly available to support future research.

[235] GenDet: Painting Colored Bounding Boxes on Images via Diffusion Model for Object Detection

Chen Min,Chengyang Li,Fanjie Kong,Qi Zhu,Dawei Zhao,Liang Xiao

Main category: cs.CV

TL;DR: GenDet将目标检测重新定义为图像生成任务,基于Stable Diffusion构建条件生成框架,在保留生成模型灵活性的同时实现有竞争力的检测精度。

Details Motivation: 传统目标检测方法多为判别式模型,难以统一生成与理解任务;GenDet旨在通过生成式建模弥合生成模型与判别任务之间的鸿沟。 Method: 提出GenDet框架,以Stable Diffusion为基础,将输入图像作为条件,在潜在空间中通过语义约束生成带类别标注的边界框,实现对位置和类别的精确控制。 Result: 实验表明GenDet在检测精度上可与传统判别式检测器相媲美,同时保持生成模型的灵活性。 Conclusion: GenDet成功将生成模型应用于目标检测,为构建统一的视觉理解系统提供了新思路。 Abstract: This paper presents GenDet, a novel framework that redefines object detection as an image generation task. In contrast to traditional approaches, GenDet adopts a pioneering approach by leveraging generative modeling: it conditions on the input image and directly generates bounding boxes with semantic annotations in the original image space. GenDet establishes a conditional generation architecture built upon the large-scale pre-trained Stable Diffusion model, formulating the detection task as semantic constraints within the latent space. It enables precise control over bounding box positions and category attributes, while preserving the flexibility of the generative model. This novel methodology effectively bridges the gap between generative models and discriminative tasks, providing a fresh perspective for constructing unified visual understanding systems. Systematic experiments demonstrate that GenDet achieves competitive accuracy compared to discriminative detectors, while retaining the flexibility characteristic of generative methods.

[236] Focal Guidance: Unlocking Controllability from Semantic-Weak Layers in Video Diffusion Models

Yuanyang Yin,Yufan Deng,Shenghai Yuan,Kaipeng Zhang,Xiao Yang,Feng Zhao

Main category: cs.CV

TL;DR: 本文提出了一种用于图像到视频生成的聚焦引导(Focal Guidance, FG)方法,以解决现有模型在遵循文本指令方面的不足,通过细粒度语义引导和注意力缓存机制提升文本遵循能力,并构建了一个新的评测基准验证其有效性。

Details Motivation: 现有的图像到视频生成模型在保持视觉一致性的同时,未能有效融合文本引导,导致对文本指令的遵循较弱,尤其是在某些语义响应较弱的网络层中存在条件隔离现象。 Method: 提出Focal Guidance(FG),包含两个机制:1)细粒度语义引导(FSG),利用CLIP检测参考帧中的关键区域并作为锚点引导语义弱层;2)注意力缓存,将语义响应强的层的注意力图传递给语义弱层,注入显式语义信号。 Result: 在Wan2.1-I2V上总分提升至0.7250(+3.97%),MMDiT-based HunyuanVideo-I2V提升至0.5571(+7.44%),验证了方法的有效性和泛化性。 Conclusion: Focal Guidance有效缓解了条件隔离问题,增强了语义弱层对文本指令的响应能力,显著提升了图像到视频生成模型的文本遵循性能。 Abstract: The task of Image-to-Video (I2V) generation aims to synthesize a video from a reference image and a text prompt. This requires diffusion models to reconcile high-frequency visual constraints and low-frequency textual guidance during the denoising process. However, while existing I2V models prioritize visual consistency, how to effectively couple this dual guidance to ensure strong adherence to the text prompt remains underexplored. In this work, we observe that in Diffusion Transformer (DiT)-based I2V models, certain intermediate layers exhibit weak semantic responses (termed Semantic-Weak Layers), as indicated by a measurable drop in text-visual similarity. We attribute this to a phenomenon called Condition Isolation, where attention to visual features becomes partially detached from text guidance and overly relies on learned visual priors. To address this, we propose Focal Guidance (FG), which enhances the controllability from Semantic-Weak Layers. FG comprises two mechanisms: (1) Fine-grained Semantic Guidance (FSG) leverages CLIP to identify key regions in the reference frame and uses them as anchors to guide Semantic-Weak Layers. (2) Attention Cache transfers attention maps from semantically responsive layers to Semantic-Weak Layers, injecting explicit semantic signals and alleviating their over-reliance on the model's learned visual priors, thereby enhancing adherence to textual instructions. To further validate our approach and address the lack of evaluation in this direction, we introduce a benchmark for assessing instruction following in I2V models. On this benchmark, Focal Guidance proves its effectiveness and generalizability, raising the total score on Wan2.1-I2V to 0.7250 (+3.97\%) and boosting the MMDiT-based HunyuanVideo-I2V to 0.5571 (+7.44\%).

[237] VideoLoom: A Video Large Language Model for Joint Spatial-Temporal Understanding

Jiapeng Shi,Junke Wang,Zuyao You,Bo He,Zuxuan Wu

Main category: cs.CV

TL;DR: 本文提出了VideoLoom,一个统一的视频大语言模型,用于联合空间-时间理解,并构建了LoomData-8.7k数据集和LoomBench评测基准,在多个任务上取得先进性能。

Details Motivation: 为了提升视频大模型在细粒度时空定位方面的能力,需要更高质量、兼具时间标注和空间定位的数据集与评测方法。 Method: 构建了一个包含8.7k样本的人类中心视频数据集LoomData-8.7k,提出VideoLoom模型进行联合空间-时间建模,并设计了综合评测基准LoomBench。 Result: VideoLoom在多个基准上表现优异,如ReVOS上63.1 J&F,Charades-STA上48.3 R1@0.7;LoomBench支持对视频大模型的全面评估。 Conclusion: 所提出的模型、数据集和评测基准共同推动了视频大模型在联合空间-时间理解上的发展,建立了多模态智能的新标准。 Abstract: This paper presents VideoLoom, a unified Video Large Language Model (Video LLM) for joint spatial-temporal understanding. To facilitate the development of fine-grained spatial and temporal localization capabilities, we curate LoomData-8.7k, a human-centric video dataset with temporally grounded and spatially localized captions. With this, VideoLoom achieves state-of-the-art or highly competitive performance across a variety of spatial and temporal benchmarks (e.g., 63.1 J&F on ReVOS for referring video object segmentation, and 48.3 R1@0.7 on Charades-STA for temporal grounding). In addition, we introduce LoomBench, a novel benchmark consisting of temporal, spatial, and compositional video-question pairs, enabling a comprehensive evaluation of Video LLMs from diverse aspects. Collectively, these contributions offer a universal and effective suite for joint spatial-temporal video understanding, setting a new standard in multimodal intelligence.

[238] A Visual Semantic Adaptive Watermark grounded by Prefix-Tuning for Large Vision-Language Model

Qi Zheng,Shuliang Liu,Yu Huang,Sihang Jia,Jungang Li,Lyuhao Chen,Junhao Chen,Hanqian Li,Aiwei Liu,Yibo Yan,Xuming Hu

Main category: cs.CV

TL;DR: 本文提出了一种名为VISA-Mark的新型视觉语义自适应水印框架,通过动态视觉证据权重引导词汇表分区和logits扰动,在保持视觉保真度的同时实现高效、鲁棒的内容溯源。

Details Motivation: 现有LVLM水印方法存在视觉无关标记破坏视觉定位或因拒绝采样导致推理延迟的问题,亟需一种兼顾视觉保真与检测性能的解决方案。 Method: 采用轻量级前缀调谐器提取视觉证据权重,基于该权重自适应划分词汇表并对logits进行扰动,将水印强度集中在视觉支持的标记上。 Result: 在Chair-I指标上视觉一致性提升7.8%,检测准确率达96.88% AUC,抗攻击能力达99.3%,且不牺牲推理效率。 Conclusion: VISA-Mark通过视觉对齐的自适应水印机制,在保证生成质量的同时实现了高性能检测与鲁棒性,为多模态水印设定了新标准。 Abstract: Watermarking has emerged as a pivotal solution for content traceability and intellectual property protection in Large Vision-Language Models (LVLMs). However, vision-agnostic watermarks introduce visually irrelevant tokens and disrupt visual grounding by enforcing indiscriminate pseudo-random biases, while some semantic-aware methods incur prohibitive inference latency due to rejection sampling. In this paper, we propose the VIsual Semantic Adaptive Watermark (VISA-Mark), a novel framework that embeds detectable signals while strictly preserving visual fidelity. Our approach employs a lightweight, efficiently trained prefix-tuner to extract dynamic Visual-Evidence Weights, which quantify the evidentiary support for candidate tokens based on the visual input. These weights guide an adaptive vocabulary partitioning and logits perturbation mechanism, concentrating watermark strength specifically on visually-supported tokens. By actively aligning the watermark with visual evidence, VISA-Mark effectively maintains visual fidelity. Empirical results confirm that VISA-Mark outperforms conventional methods with a 7.8% improvement in visual consistency (Chair-I) and superior semantic fidelity. The framework maintains highly competitive detection accuracy (96.88% AUC) and robust attack resilience (99.3%) without sacrificing inference efficiency, effectively establishing a new standard for reliability-preserving multimodal watermarking.

[239] Inference-Time Scaling for Visual AutoRegressive modeling by Searching Representative Samples

Weidong Tang,Xinyan Wan,Siyu Li,Xiumei Wang

Main category: cs.CV

TL;DR: VAR-Scaling是首个用于向量量化视觉自回归模型(VAR)的推理时扩展框架,通过核密度估计将离散采样空间映射到准连续特征空间,并提出密度自适应混合采样策略,显著提升生成质量。

Details Motivation: 推理时扩展在大模型中提升了生成质量,但在基于离散潜在空间的VAR模型中尚未探索,主要挑战在于无法进行连续路径搜索。 Method: 引入VAR-Scaling框架,利用核密度估计(KDE)将离散采样空间转化为准连续特征空间,提出Top-k和Random-k结合的密度自适应混合采样策略,以平衡生成质量与多样性。 Result: 在类条件生成和文本到图像任务中,VAR-Scaling显著提升了生成样本的质量与推理效率,验证了其在离散潜在空间中实现有效推理时扩展的可行性。 Conclusion: VAR-Scaling成功解决了VQ模型中离散空间对推理时扩展的限制,为自回归视觉生成模型提供了新的优化路径。 Abstract: While inference-time scaling has significantly enhanced generative quality in large language and diffusion models, its application to vector-quantized (VQ) visual autoregressive modeling (VAR) remains unexplored. We introduce VAR-Scaling, the first general framework for inference-time scaling in VAR, addressing the critical challenge of discrete latent spaces that prohibit continuous path search. We find that VAR scales exhibit two distinct pattern types: general patterns and specific patterns, where later-stage specific patterns conditionally optimize early-stage general patterns. To overcome the discrete latent space barrier in VQ models, we map sampling spaces to quasi-continuous feature spaces via kernel density estimation (KDE), where high-density samples approximate stable, high-quality solutions. This transformation enables effective navigation of sampling distributions. We propose a density-adaptive hybrid sampling strategy: Top-k sampling focuses on high-density regions to preserve quality near distribution modes, while Random-k sampling explores low-density areas to maintain diversity and prevent premature convergence. Consequently, VAR-Scaling optimizes sample fidelity at critical scales to enhance output quality. Experiments in class-conditional and text-to-image evaluations demonstrate significant improvements in inference process. The code is available at https://github.com/WD7ang/VAR-Scaling.

[240] Mimic Human Cognition, Master Multi-Image Reasoning: A Meta-Action Framework for Enhanced Visual Understanding

Jianghao Yin,Qingbin Li,Kun Sun,Cheng Ding,Jie Wang,Qin Chen,Jie Zhou,Nan Wang,Changqing Li,Pei Wu,Jian Xu,Zheming Yang,Liang He

Main category: cs.CV

TL;DR: 本文提出了一种受人类认知启发的元动作框架CINEMA,用于解决多图像推理中信息分散和图像间关系复杂的问题,通过分解为五个元动作并结合检索增强与强化学习策略,在多图像、视频和单图像任务上均取得优异性能。

Details Motivation: 现有MLLM在单图理解上表现良好,但在多图推理中表现下降,主要由于图像间关系复杂且关键信息分散,缺乏类似人类逐步推理的认知机制。 Method: 提出CINEMA框架,将多图推理分解为Global、Focus、Hint、Think、Answer五个元动作;采用基于检索的树采样生成高质量冷启动轨迹,并在强化学习中使用多样性保持策略和DAPO两阶段训练。 Result: 构建了包含57k冷启动和58k强化学习样本的数据集,在MUIR和MVMath基准上超越GPT-4o,并在视频理解任务上优于专用视频模型。 Conclusion: CINEMA通过模拟人类认知过程有效提升了多模态大模型在多图像及视频推理中的性能,展现出良好的通用性和优越性。 Abstract: While Multimodal Large Language Models (MLLMs) excel at single-image understanding, they exhibit significantly degraded performance in multi-image reasoning scenarios. Multi-image reasoning presents fundamental challenges including complex inter-relationships between images and scattered critical information across image sets. Inspired by human cognitive processes, we propose the Cognition-Inspired Meta-Action Framework (CINEMA), a novel approach that decomposes multi-image reasoning into five structured meta-actions: Global, Focus, Hint, Think, and Answer which explicitly modeling the sequential cognitive steps humans naturally employ. For cold-start training, we introduce a Retrieval-Based Tree Sampling strategy that generates high-quality meta-action trajectories to bootstrap the model with reasoning patterns. During reinforcement learning, we adopt a two-stage paradigm: an exploration phase with Diversity-Preserving Strategy to avoid entropy collapse, followed by an annealed exploitation phase with DAPO to gradually strengthen exploitation. To train our model, we construct a dataset of 57k cold-start and 58k reinforcement learning instances spanning multi-image, multi-frame, and single-image tasks. We conduct extensive evaluations on multi-image reasoning benchmarks, video understanding benchmarks, and single-image benchmarks, achieving competitive state-of-the-art performance on several key benchmarks. Our model surpasses GPT-4o on the MUIR and MVMath benchmarks and notably outperforms specialized video reasoning models on video understanding benchmarks, demonstrating the effectiveness and generalizability of our human cognition-inspired reasoning framework.

[241] Revisiting the Ordering of Channel and Spatial Attention: A Comprehensive Study on Sequential and Parallel Designs

Zhongming Liu,Bingbing Jiang

Main category: cs.CV

TL;DR: 本文系统比较了通道注意力和空间注意力的融合策略,提出了一个包含18种拓扑结构的统一评估框架,并发现了数据规模与方法性能之间的耦合规律,为不同场景下的注意力模块设计提供了指导原则。

Details Motivation: 现有研究在通道注意力与空间注意力融合策略上的选择缺乏系统性分析和统一原则,导致设计决策多依赖经验。因此需要一个统一框架来系统评估不同融合方式在不同数据规模下的表现。 Method: 构建了一个涵盖顺序、并行、多尺度和残差四类共18种拓扑结构的统一评估框架,在两个视觉和九个医学数据集上进行实验,分析不同结构的性能表现。 Result: 发现了“数据规模-方法-性能”的耦合规律:小样本任务中‘通道-多尺度空间’级联结构最优;中等规模任务中可学习的并行融合结构表现最好;大规模任务中带动态门控的并行结构最佳。此外,‘空间-通道’顺序对细粒度分类更稳定有效,残差连接有助于缓解梯度消失问题。 Conclusion: 应根据具体应用场景(如数据规模、任务类型)选择合适的注意力融合结构,并据此提出未来注意力模块设计的指南。 Abstract: Attention mechanisms have become a core component of deep learning models, with Channel Attention and Spatial Attention being the two most representative architectures. Current research on their fusion strategies primarily bifurcates into sequential and parallel paradigms, yet the selection process remains largely empirical, lacking systematic analysis and unified principles. We systematically compare channel-spatial attention combinations under a unified framework, building an evaluation suite of 18 topologies across four classes: sequential, parallel, multi-scale, and residual. Across two vision and nine medical datasets, we uncover a "data scale-method-performance" coupling law: (1) in few-shot tasks, the "Channel-Multi-scale Spatial" cascaded structure achieves optimal performance; (2) in medium-scale tasks, parallel learnable fusion architectures demonstrate superior results; (3) in large-scale tasks, parallel structures with dynamic gating yield the best performance. Additionally, experiments indicate that the "Spatial-Channel" order is more stable and effective for fine-grained classification, while residual connections mitigate vanishing gradient problems across varying data scales. We thus propose scenario-based guidelines for building future attention modules. Code is open-sourced at https://github.com/DWlzm.

[242] OSCAR: Open-Set CAD Retrieval from a Language Prompt and a Single Image

Tessa Pulli,Jean-Baptiste Weibel,Peter Hönig,Matthias Hirschmanner,Markus Vincze,Andreas Holzinger

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的开放集CAD检索方法OSCAR,通过语言提示和单张图像从无标签3D数据库中检索匹配的对象模型,用于6D姿态估计。

Details Motivation: 现有的零样本物体姿态估计器依赖CAD模型,但实际部署中难以获取且对象集合不断变化,导致难以准确识别目标实例模型。 Method: OSCAR在预处理阶段生成数据库模型的多视角渲染并使用图像描述模型进行标注;推理时利用GroundedSAM检测输入图像中的目标物体,并计算感兴趣区域与数据库文本描述的多模态嵌入,结合CLIP进行文本过滤和DINOv2进行图像细化的两阶段检索。 Result: 实验表明,OSCAR在MI3DOR跨域3D模型检索基准上优于所有现有方法,在YCB-V数据集上实现90.48%的平均精度,并可有效支持MegaPose完成姿态估计,效果优于基于重建的方法。 Conclusion: OSCAR能够高效地从无标签3D数据库中检索最相似物体模型,适用于开放场景下的6D物体姿态估计,具备良好的实用性和扩展性。 Abstract: 6D object pose estimation plays a crucial role in scene understanding for applications such as robotics and augmented reality. To support the needs of ever-changing object sets in such context, modern zero-shot object pose estimators were developed to not require object-specific training but only rely on CAD models. Such models are hard to obtain once deployed, and a continuously changing and growing set of objects makes it harder to reliably identify the instance model of interest. To address this challenge, we introduce an Open-Set CAD Retrieval from a Language Prompt and a Single Image (OSCAR), a novel training-free method that retrieves a matching object model from an unlabeled 3D object database. During onboarding, OSCAR generates multi-view renderings of database models and annotates them with descriptive captions using an image captioning model. At inference, GroundedSAM detects the queried object in the input image, and multi-modal embeddings are computed for both the Region-of-Interest and the database captions. OSCAR employs a two-stage retrieval: text-based filtering using CLIP identifies candidate models, followed by image-based refinement using DINOv2 to select the most visually similar object. In our experiments we demonstrate that OSCAR outperforms all state-of-the-art methods on the cross-domain 3D model retrieval benchmark MI3DOR. Furthermore, we demonstrate OSCAR's direct applicability in automating object model sourcing for 6D object pose estimation. We propose using the most similar object model for pose estimation if the exact instance is not available and show that OSCAR achieves an average precision of 90.48\% during object retrieval on the YCB-V object dataset. Moreover, we demonstrate that the most similar object model can be utilized for pose estimation using Megapose achieving better results than a reconstruction-based approach.

[243] Reconstruction Guided Few-shot Network For Remote Sensing Image Classification

Mohit Jaiswal,Naman Jain,Shivani Pathak,Mainak Singha,Nikunja Bihari Kar,Ankit Jha,Biplab Banerjee

Main category: cs.CV

TL;DR: 提出了一种用于遥感图像少样本分类的重建引导网络RGFS-Net,通过引入掩码图像重建任务提升特征学习和分类性能。

Details Motivation: 遥感图像少样本分类面临标注样本少和地物类型差异大的挑战,现有方法在泛化性和特征表达上存在不足。 Method: 提出RGFS-Net,结合掩码图像重建辅助任务,通过遮挡部分输入并重建,促进语义丰富的特征学习,增强空间理解与类别区分能力。 Result: 在EuroSAT和PatternNet数据集的1-shot和5-shot实验中,方法 consistently 优于现有基线。 Conclusion: 该方法简单有效,兼容标准骨干网络,为遥感图像少样本分类提供了鲁棒解决方案。 Abstract: Few-shot remote sensing image classification is challenging due to limited labeled samples and high variability in land-cover types. We propose a reconstruction-guided few-shot network (RGFS-Net) that enhances generalization to unseen classes while preserving consistency for seen categories. Our method incorporates a masked image reconstruction task, where parts of the input are occluded and reconstructed to encourage semantically rich feature learning. This auxiliary task strengthens spatial understanding and improves class discrimination under low-data settings. We evaluated the efficacy of EuroSAT and PatternNet datasets under 1-shot and 5-shot protocols, our approach consistently outperforms existing baselines. The proposed method is simple, effective, and compatible with standard backbones, offering a robust solution for few-shot remote sensing classification. Codes are available at https://github.com/stark0908/RGFS.

[244] PulseMind: A Multi-Modal Medical Model for Real-World Clinical Diagnosis

Jiao Xu,Junwei Liu,Jiangwei Lao,Qi Zhu,Yunpeng Zhao,Congyun Jin,Shinan Liu,Zhihong Lu,Lihe Zhang,Xin Chen,Jian Wang,Ping Wang

Main category: cs.CV

TL;DR: PulseMind是一个新的多模态诊断模型家族,包含MediScope数据集、PulseMind基准和基于比较的强化策略优化(CRPO)训练框架,旨在提升真实临床诊断中的多轮交互式医疗AI性能。

Details Motivation: 现有医疗多模态模型局限于单一影像分析,无法满足真实临床诊疗中多轮、多源异构信息整合与上下文理解的需求,因此需要构建更贴近实际临床场景的多模态诊断模型。 Method: 提出PulseMind模型家族:1)构建包含9.8万次真实多轮问诊和60万张医学图像的MediScope数据集;2)设计涵盖主动性、准确性、有用性和语言质量的四维多轮诊断评估基准PulseMind Benchmark;3)开发基于相对偏好信号的CRPO训练框架,利用多维度比较提供稳定且符合人类偏好的训练指导。 Result: 实验表明,PulseMind在自建的多轮诊断基准和多个公开医学基准上均取得了具有竞争力的表现,验证了其在复杂临床交互环境下的有效性。 Conclusion: PulseMind通过系统性数据构建、综合评估基准和对齐人类偏好的训练方法,推动了多模态医疗模型向真实临床应用场景的落地迈进。 Abstract: Recent advances in medical multi-modal models focus on specialized image analysis like dermatology, pathology, or radiology. However, they do not fully capture the complexity of real-world clinical diagnostics, which involve heterogeneous inputs and require ongoing contextual understanding during patient-physician interactions. To bridge this gap, we introduce PulseMind, a new family of multi-modal diagnostic models that integrates a systematically curated dataset, a comprehensive evaluation benchmark, and a tailored training framework. Specifically, we first construct a diagnostic dataset, MediScope, which comprises 98,000 real-world multi-turn consultations and 601,500 medical images, spanning over 10 major clinical departments and more than 200 sub-specialties. Then, to better reflect the requirements of real-world clinical diagnosis, we develop the PulseMind Benchmark, a multi-turn diagnostic consultation benchmark with a four-dimensional evaluation protocol comprising proactiveness, accuracy, usefulness, and language quality. Finally, we design a training framework tailored for multi-modal clinical diagnostics, centered around a core component named Comparison-based Reinforcement Policy Optimization (CRPO). Compared to absolute score rewards, CRPO uses relative preference signals from multi-dimensional com-parisons to provide stable and human-aligned training guidance. Extensive experiments demonstrate that PulseMind achieves competitive performance on both the diagnostic consultation benchmark and public medical benchmarks.

[245] Seeing Right but Saying Wrong: Inter- and Intra-Layer Refinement in MLLMs without Training

Shezheng Song,Shasha Li,Jie Yu

Main category: cs.CV

TL;DR: 本文提出了一种名为DualPD的双视角解码优化策略,用于提升多模态大语言模型(MLLMs)在视觉-语言任务中的表现,通过无需训练的方法增强视觉理解能力。

Details Motivation: 尽管MLLMs在深层能关注到正确的视觉区域,但早期层的噪声注意力常导致最终预测错误,出现‘看对说错’的问题。 Method: DualPD包含两个模块:(1) 层间注意力引导的对比logits模块,通过比较注意力变化最大的层之间的输出logits来捕捉正确答案信念的演化;(2) 头级信息过滤模块,抑制关注无关区域的低贡献注意力头,提升每层的注意力质量。 Result: 在LLaVA和Qwen-VL模型族上的多个多模态基准测试中,DualPD在无需训练的情况下一致提升了准确率。 Conclusion: DualPD有效缓解了MLLMs中注意力与输出不一致的问题,具有良好的通用性和实用性。 Abstract: Multimodal Large Language Models (MLLMs) have demonstrated strong capabilities across a variety of vision-language tasks. However, their internal reasoning often exhibits a critical inconsistency: although deeper layers may attend to the correct visual regions, final predictions are frequently misled by noisy attention from earlier layers. This results in a disconnect between what the model internally understands and what it ultimately expresses, a phenomenon we describe as seeing it right but saying it wrong. To address this issue, we propose DualPD, a dual-perspective decoding refinement strategy that enhances the visual understanding without any additional training. DualPD consists of two components. (1) The layer-wise attention-guided contrastive logits module captures how the belief in the correct answer evolves by comparing output logits between layers that exhibit the largest attention shift. (2) The head-wise information filtering module suppresses low-contribution attention heads that focus on irrelevant regions, thereby improving attention quality within each layer. Experiments conducted on both the LLaVA and Qwen-VL model families across multiple multimodal benchmarks demonstrate that DualPD consistently improves accuracy without training, confirming its effectiveness and generalizability. The code will be released upon publication.

[246] HiVid-Narrator: Hierarchical Video Narrative Generation with Scene-Primed ASR-anchored Compression

Haoxuan Li,Mengyan Li,Junjun Zheng

Main category: cs.CV

TL;DR: 本文提出了E-HVC数据集和HiVid-Narrator框架,用于生成电商视频的层次化叙述,通过双粒度标注和模态压缩提升叙事质量与效率。

Details Motivation: 现有方法难以同时捕捉电商视频中的细粒度视觉细节并生成连贯的高层次叙述,且视频节奏快、信息密度高,导致模型处理困难。 Method: 构建具有时间链思维和章节摘要的E-HVC数据集,采用分阶段生成方式,并提出SPA-Compressor模型,利用ASR语义线索压缩多模态token,实现高效训练与高质量叙述生成。 Result: HiVid-Narrator在减少输入token的同时,显著提升了生成叙述的时间对齐性和事实准确性,优于现有方法。 Conclusion: 通过层次化建模和模态压缩,可有效解决电商视频叙述生成中的效率与质量矛盾,为实际应用提供可行方案。 Abstract: Generating structured narrations for real-world e-commerce videos requires models to perceive fine-grained visual details and organize them into coherent, high-level stories--capabilities that existing approaches struggle to unify. We introduce the E-commerce Hierarchical Video Captioning (E-HVC) dataset with dual-granularity, temporally grounded annotations: a Temporal Chain-of-Thought that anchors event-level observations and Chapter Summary that compose them into concise, story-centric summaries. Rather than directly prompting chapters, we adopt a staged construction that first gathers reliable linguistic and visual evidence via curated ASR and frame-level descriptions, then refines coarse annotations into precise chapter boundaries and titles conditioned on the Temporal Chain-of-Thought, yielding fact-grounded, time-aligned narratives. We also observe that e-commerce videos are fast-paced and information-dense, with visual tokens dominating the input sequence. To enable efficient training while reducing input tokens, we propose the Scene-Primed ASR-anchored Compressor (SPA-Compressor), which compresses multimodal tokens into hierarchical scene and event representations guided by ASR semantic cues. Built upon these designs, our HiVid-Narrator framework achieves superior narrative quality with fewer input tokens compared to existing methods.

[247] Learning Dynamic Collaborative Network for Semi-supervised 3D Vessel Segmentation

Jiao Xu,Xin Chen,Lihe Zhang

Main category: cs.CV

TL;DR: 提出了一种用于半监督3D血管分割的动态协同网络DiCo,通过动态交换教师-学生角色、多视角融合和对抗监督提升性能,在三个基准上达到最先进水平。

Details Motivation: 传统均值教师方法中教师与学生角色固定,但在复杂3D血管数据下教师模型未必始终优于学生,导致认知偏差限制性能。 Method: 设计动态协同网络DiCo,允许教师与学生模型动态互换角色;引入多视图融合模块以捕捉输入的多角度信息;采用对抗监督约束无标签数据中分割血管的形状,并将3D体数据投影到2D视图以缓解标签不一致问题。 Result: 在三个3D血管分割基准上实现了新的最先进性能。 Conclusion: DiCo通过动态协作机制和多视角分析有效提升了半监督3D血管分割的表现,具有较强的实用性和推广潜力。 Abstract: In this paper, we present a new dynamic collaborative network for semi-supervised 3D vessel segmentation, termed DiCo. Conventional mean teacher (MT) methods typically employ a static approach, where the roles of the teacher and student models are fixed. However, due to the complexity of 3D vessel data, the teacher model may not always outperform the student model, leading to cognitive biases that can limit performance. To address this issue, we propose a dynamic collaborative network that allows the two models to dynamically switch their teacher-student roles. Additionally, we introduce a multi-view integration module to capture various perspectives of the inputs, mirroring the way doctors conduct medical analysis. We also incorporate adversarial supervision to constrain the shape of the segmented vessels in unlabeled data. In this process, the 3D volume is projected into 2D views to mitigate the impact of label inconsistencies. Experiments demonstrate that our DiCo method sets new state-of-the-art performance on three 3D vessel segmentation benchmarks. The code repository address is https://github.com/xujiaommcome/DiCo

[248] Forecast the Principal, Stabilize the Residual: Subspace-Aware Feature Caching for Efficient Diffusion Transformers

Guantao Chen,Shikang Zheng,Yuqi Lin,Linfeng Zhang

Main category: cs.CV

TL;DR: 提出SVD-Cache,一种基于子空间感知的特征缓存框架,通过SVD分解分离主成分和残差子空间,实现高效、近无损的扩散Transformer推理加速。

Details Motivation: 现有特征缓存方法对所有特征组件一视同仁,未能利用DiT特征空间中不同子空间的时间演化差异,导致预测效率与精度受限。 Method: 通过奇异值分解(SVD)将扩散特征分解为主子空间和残差子空间:对主导的低秩成分采用指数移动平均(EMA)进行预测,对残差子空间直接重用。 Result: 在FLUX和HunyuanVideo等模型上实现5.55倍加速,且保持生成质量近似无损,并兼容蒸馏、量化和稀疏注意力等加速技术。 Conclusion: SVD-Cache通过挖掘DiT特征空间的结构特性,实现了高效、通用且兼容性强的推理加速方法,为扩散Transformer的实际应用提供了有效解决方案。 Abstract: Diffusion Transformer (DiT) models have achieved unprecedented quality in image and video generation, yet their iterative sampling process remains computationally prohibitive. To accelerate inference, feature caching methods have emerged by reusing intermediate representations across timesteps. However, existing caching approaches treat all feature components uniformly. We reveal that DiT feature spaces contain distinct principal and residual subspaces with divergent temporal behavior: the principal subspace evolves smoothly and predictably, while the residual subspace exhibits volatile, low-energy oscillations that resist accurate prediction. Building on this insight, we propose SVD-Cache, a subspace-aware caching framework that decomposes diffusion features via Singular Value Decomposition (SVD), applies exponential moving average (EMA) prediction to the dominant low-rank components, and directly reuses the residual subspace. Extensive experiments demonstrate that SVD-Cache achieves near-lossless across diverse models and methods, including 5.55$\times$ speedup on FLUX and HunyuanVideo, and compatibility with model acceleration techniques including distillation, quantization and sparse attention. Our code is in supplementary material and will be released on Github.

[249] SDHSI-Net: Learning Better Representations for Hyperspectral Images via Self-Distillation

Prachet Dev Singh,Shyamsundar Paramasivam,Sneha Barman,Mainak Singha,Ankit Jha,Girish Mishra,Biplab Banerjee

Main category: cs.CV

TL;DR: 本文提出了一种将自蒸馏(Self-distillation, SD)应用于高光谱图像(HSI)分类的方法,通过利用模型自身早期输出作为软目标,提升特征空间的类内紧凑性和类间可分性,在两个基准数据集上显著提高了分类精度和鲁棒性。

Details Motivation: 高光谱图像分类面临高维光谱特征和标注数据有限的挑战,传统深度学习模型易过拟合并计算成本高,因此需要一种无需外部教师网络即可提升性能的方法。 Method: 采用自蒸馏策略,将网络中间层的输出作为软目标监督最终预测,强制中间与最终预测之间的一致性,从而优化特征学习过程。 Result: 在两个基准HSI数据集上的实验表明,所提方法显著提升了分类准确率和模型鲁棒性,且无需额外教师网络。 Conclusion: 自蒸馏是一种有效且实用的策略,适用于高光谱图像的谱-空联合学习,为小样本条件下的HSI分类提供了新思路。 Abstract: Hyperspectral image (HSI) classification presents unique challenges due to its high spectral dimensionality and limited labeled data. Traditional deep learning models often suffer from overfitting and high computational costs. Self-distillation (SD), a variant of knowledge distillation where a network learns from its own predictions, has recently emerged as a promising strategy to enhance model performance without requiring external teacher networks. In this work, we explore the application of SD to HSI by treating earlier outputs as soft targets, thereby enforcing consistency between intermediate and final predictions. This process improves intra-class compactness and inter-class separability in the learned feature space. Our approach is validated on two benchmark HSI datasets and demonstrates significant improvements in classification accuracy and robustness, highlighting the effectiveness of SD for spectral-spatial learning. Codes are available at https://github.com/Prachet-Dev-Singh/SDHSI.

[250] PanoSAMic: Panoramic Image Segmentation from SAM Feature Encoding and Dual View Fusion

Mahdi Chamseddine,Didier Stricker,Jason Rambach

Main category: cs.CV

TL;DR: PanoSAMic提出了一种针对全景图像的语义分割模型,通过改进预训练的SAM编码器并引入多模态融合模块,在多个数据集上实现了最先进的性能。

Details Motivation: 现有图像基础模型主要在透视图像上训练,未针对球形图像进行优化,导致在处理全景图像时存在局限性。 Method: 改进预训练的Segment Anything (SAM) 编码器以输出多阶段特征,设计新的时空模态融合模块来自适应选择不同区域的最佳模态和特征,并采用具有球面注意力和双视图融合的语义解码器以应对全景图像的畸变和边缘不连续问题。 Result: 在Stanford2D3DS数据集的RGB、RGB-D和RGB-D-N模态以及Matterport3D数据集的RGB和RGB-D模态上均达到最先进(SotA)水平。 Conclusion: PanoSAMic有效提升了全景图像的语义分割性能,展示了结合预训练模型与多模态融合策略在球形图像理解中的潜力。 Abstract: Existing image foundation models are not optimized for spherical images having been trained primarily on perspective images. PanoSAMic integrates the pre-trained Segment Anything (SAM) encoder to make use of its extensive training and integrate it into a semantic segmentation model for panoramic images using multiple modalities. We modify the SAM encoder to output multi-stage features and introduce a novel spatio-modal fusion module that allows the model to select the relevant modalities and best features from each modality for different areas of the input. Furthermore, our semantic decoder uses spherical attention and dual view fusion to overcome the distortions and edge discontinuity often associated with panoramic images. PanoSAMic achieves state-of-the-art (SotA) results on Stanford2D3DS for RGB, RGB-D, and RGB-D-N modalities and on Matterport3D for RGB and RGB-D modalities. https://github.com/dfki-av/PanoSAMic

[251] Improving Video Question Answering through query-based frame selection

Himanshu Patil,Geo Jolly,Ramana Raja Buddala,Ganesh Ramakrishnan,Rohit Saluja

Main category: cs.CV

TL;DR: 提出基于子模互信息(SMI)的查询式帧选择方法,提升视频问答(VideoQA)中关键帧选取的相关性和互补性,相较均匀采样精度最高提升4%。

Details Motivation: 现有视频问答模型多采用均匀帧采样,难以捕捉与问题相关的关键帧和上下文信息,导致语义缺失。 Method: 引入基于子模互信息(SMI)函数的查询相关帧选择机制,替代传统的均匀采样,选择与问题语义更相关的帧。 Result: 在MVBench数据集上使用Video-LLaVA和LLaVA-NeXT模型验证,查询式采样相比均匀采样最高提升4%准确率,定性分析显示所选帧更贴合问题内容。 Conclusion: 基于查询的帧选择能有效提升视频问答性能,具有广泛适用性,尤其适用于仅依赖部分关键帧的任务。 Abstract: Video Question Answering (VideoQA) models enhance understanding and interaction with audiovisual content, making it more accessible, searchable, and useful for a wide range of fields such as education, surveillance, entertainment, and content creation. Due to heavy compute requirements, most large visual language models (VLMs) for VideoQA rely on a fixed number of frames by uniformly sampling the video. However, this process does not pick important frames or capture the context of the video. We present a novel query-based selection of frames relevant to the questions based on the submodular mutual Information (SMI) functions. By replacing uniform frame sampling with query-based selection, our method ensures that the chosen frames provide complementary and essential visual information for accurate VideoQA. We evaluate our approach on the MVBench dataset, which spans a diverse set of multi-action video tasks. VideoQA accuracy on this dataset was assessed using two VLMs, namely Video-LLaVA and LLaVA-NeXT, both of which originally employed uniform frame sampling. Experiments were conducted using both uniform and query-based sampling strategies. An accuracy improvement of up to \textbf{4\%} was observed when using query-based frame selection over uniform sampling. Qualitative analysis further highlights that query-based selection, using SMI functions, consistently picks frames better aligned with the question. We opine that such query-based frame selection can enhance accuracy in a wide range of tasks that rely on only a subset of video frames.

[252] From Sketch to Fresco: Efficient Diffusion Transformer with Progressive Resolution

Shikang Zheng,Guantao Chen,Lixuan He,Jiacheng Liu,Yuqi Lin,Chang Zou,Linfeng Zhang

Main category: cs.CV

TL;DR: Fresco是一种动态分辨率框架,通过渐进式上采样在保持生成质量的同时显著加速扩散Transformer的采样过程,实现了跨阶段的一致性,并与其他加速技术正交,可组合达到更高加速比。

Details Motivation: 现有的动态分辨率采样方法在分辨率切换时使用启发式重加噪,破坏了跨阶段一致性,并因全图盲目上采样导致累积误差和伪影。 Method: 提出Fresco框架,统一各阶段的重加噪与全局结构,采用渐进式上采样策略,仅对未收敛区域进行高分辨率细化,确保所有阶段共同指向同一最终目标。 Result: 在FLUX上实现10倍加速,在HunyuanVideo上实现5倍加速,结合蒸馏模型可达22倍加速,且保持接近无损的生成质量。 Conclusion: Fresco有效解决了现有动态分辨率方法中的不一致性和冗余计算问题,为扩散Transformer提供了高效、通用且可组合的加速方案。 Abstract: Diffusion Transformers achieve impressive generative quality but remain computationally expensive due to iterative sampling. Recently, dynamic resolution sampling has emerged as a promising acceleration technique by reducing the resolution of early sampling steps. However, existing methods rely on heuristic re-noising at every resolution transition, injecting noise that breaks cross-stage consistency and forces the model to relearn global structure. In addition, these methods indiscriminately upsample the entire latent space at once without checking which regions have actually converged, causing accumulated errors, and visible artifacts. Therefore, we propose \textbf{Fresco}, a dynamic resolution framework that unifies re-noise and global structure across stages with progressive upsampling, preserving both the efficiency of low-resolution drafting and the fidelity of high-resolution refinement, with all stages aligned toward the same final target. Fresco achieves near-lossless acceleration across diverse domains and models, including 10$\times$ speedup on FLUX, and 5$\times$ on HunyuanVideo, while remaining orthogonal to distillation, quantization and feature caching, reaching 22$\times$ speedup when combined with distilled models. Our code is in supplementary material and will be released on Github.

[253] FocalOrder: Focal Preference Optimization for Reading Order Detection

Fuyuan Liu,Dianyu Yu,He Ren,Nayu Liu,Xiaomian Kang,Delai Qiu,Fa Zhang,Genpeng Zhen,Shengping Liu,Jiaen Liang,Wei Huang,Yining Wang,Junnan Zhu

Main category: cs.CV

TL;DR: 本文提出FocalOrder框架,通过焦点偏好优化(FPO)解决文档阅读顺序检测中的位置差异问题,即模型在中间复杂区域性能下降的问题。

Details Motivation: 现有方法假设文档各区域难度分布均匀,忽视了实际中存在位置差异现象,导致模型难以学习复杂中间区域的阅读顺序。 Method: 提出FocalOrder框架,采用基于指数移动平均机制的自适应难度发现,动态识别难学习的转换,并引入难度校准的成对排序目标以保持全局逻辑一致性。 Result: 在OmniDocBench v1.0和Comp-HRDoc上实现了新的SOTA结果,小模型表现优于专用基线和大规模通用视觉语言模型。 Conclusion: 根据文档内在结构模糊性调整优化过程对掌握复杂文档结构至关重要。 Abstract: Reading order detection is the foundation of document understanding. Most existing methods rely on uniform supervision, implicitly assuming a constant difficulty distribution across layout regions. In this work, we challenge this assumption by revealing a critical flaw: \textbf{Positional Disparity}, a phenomenon where models demonstrate mastery over the deterministic start and end regions but suffer a performance collapse in the complex intermediate sections. This degradation arises because standard training allows the massive volume of easy patterns to drown out the learning signals from difficult layouts. To address this, we propose \textbf{FocalOrder}, a framework driven by \textbf{Focal Preference Optimization (FPO)}. Specifically, FocalOrder employs adaptive difficulty discovery with exponential moving average mechanism to dynamically pinpoint hard-to-learn transitions, while introducing a difficulty-calibrated pairwise ranking objective to enforce global logical consistency. Extensive experiments demonstrate that FocalOrder establishes new state-of-the-art results on OmniDocBench v1.0 and Comp-HRDoc. Our compact model not only outperforms competitive specialized baselines but also significantly surpasses large-scale general VLMs. These results demonstrate that aligning the optimization with intrinsic structural ambiguity of documents is critical for mastering complex document structures.

[254] Anatomy Aware Cascade Network: Bridging Epistemic Uncertainty and Geometric Manifold for 3D Tooth Segmentation

Bing Yu,Liu Shi,Haitao Wang,Deran Qi,Xiang Cai,Wei Zhong,Qiegen Liu

Main category: cs.CV

TL;DR: 提出了一种名为AACNet的级联网络,用于从锥束CT(CBCT)图像中实现高精度的三维牙齿分割,通过引入边界优化和解剖注意力机制,在低对比度和边界模糊的情况下显著提升了分割性能。

Details Motivation: 由于CBCT图像中存在低对比度和牙弓间边界不清晰的问题,导致牙齿分割中的粘连伪影难以处理,影响数字牙科工作流的准确性。 Method: 提出了Anatomy Aware Cascade Network(AACNet),采用由粗到精的框架,包含两个关键模块:基于熵的门控机制的Ambiguity Gated Boundary Refiner(AGBR)用于修正高不确定性区域的特征;Signed Distance Map guided Anatomical Attention(SDMAA)利用符号距离图引入几何约束,保持拓扑一致性。 Result: 在125个CBCT数据上实验显示,AACNet达到90.17%的Dice系数和3.63mm的HD95指标,显著优于现有方法,并在外部队列上表现出良好的泛化能力(HD95为2.19mm)。 Conclusion: AACNet能有效解决CBCT图像中因低对比度和边界模糊导致的牙齿分割难题,具备良好的临床应用潜力,适用于手术规划等下游任务。 Abstract: Accurate three-dimensional (3D) tooth segmentation from Cone-Beam Computed Tomography (CBCT) is a prerequisite for digital dental workflows. However, achieving high-fidelity segmentation remains challenging due to adhesion artifacts in naturally occluded scans, which are caused by low contrast and indistinct inter-arch boundaries. To address these limitations, we propose the Anatomy Aware Cascade Network (AACNet), a coarse-to-fine framework designed to resolve boundary ambiguity while maintaining global structural consistency. Specifically, we introduce two mechanisms: the Ambiguity Gated Boundary Refiner (AGBR) and the Signed Distance Map guided Anatomical Attention (SDMAA). The AGBR employs an entropy based gating mechanism to perform targeted feature rectification in high uncertainty transition zones. Meanwhile, the SDMAA integrates implicit geometric constraints via signed distance map to enforce topological consistency, preventing the loss of spatial details associated with standard pooling. Experimental results on a dataset of 125 CBCT volumes demonstrate that AACNet achieves a Dice Similarity Coefficient of 90.17 \% and a 95\% Hausdorff Distance of 3.63 mm, significantly outperforming state-of-the-art methods. Furthermore, the model exhibits strong generalization on an external dataset with an HD95 of 2.19 mm, validating its reliability for downstream clinical applications such as surgical planning. Code for AACNet is available at https://github.com/shiliu0114/AACNet.

[255] Mon3tr: Monocular 3D Telepresence with Pre-built Gaussian Avatars as Amortization

Fangyu Lin,Yingdong Hu,Zhening Liu,Yufan Zhuang,Zehong Lin,Jun Zhang

Main category: cs.CV

TL;DR: Mon3tr提出了一种基于单目RGB相机的沉浸式全息远程呈现框架,首次将3D高斯点阵化(3DGS)参数化人体建模引入该领域,通过离线重建与在线轻量推理相结合,实现低带宽、低延迟、高保真的实时全身全息通信。

Details Motivation: 现有沉浸式远程呈现系统依赖多摄像头和高带宽体积流,难以在移动设备上实现实时性能,限制了广泛应用。 Method: Mon3tr采用分阶段策略:离线阶段利用多视角数据重建用户专属的3DGS参数化人像;在线阶段通过单目相机实时捕捉动作与表情,驱动模型,并以<0.2 Mbps的极低带宽传输特征,在接收端使用轻量级3DGS属性形变网络动态修正并渲染人像。 Result: 系统在Meta Quest 3等设备上实现约60 FPS的渲染速度,端到端延迟约80 ms,新姿态下PSNR > 28 dB,相比点云流带宽降低超1000倍。 Conclusion: Mon3tr实现了高效、低成本、高质量的单目3D远程呈现,显著降低了硬件与网络要求,推动了AR/VR中远程协作的实用化发展。 Abstract: Immersive telepresence aims to transform human interaction in AR/VR applications by enabling lifelike full-body holographic representations for enhanced remote collaboration. However, existing systems rely on hardware-intensive multi-camera setups and demand high bandwidth for volumetric streaming, limiting their real-time performance on mobile devices. To overcome these challenges, we propose Mon3tr, a novel Monocular 3D telepresence framework that integrates 3D Gaussian splatting (3DGS) based parametric human modeling into telepresence for the first time. Mon3tr adopts an amortized computation strategy, dividing the process into a one-time offline multi-view reconstruction phase to build a user-specific avatar and a monocular online inference phase during live telepresence sessions. A single monocular RGB camera is used to capture body motions and facial expressions in real time to drive the 3DGS-based parametric human model, significantly reducing system complexity and cost. The extracted motion and appearance features are transmitted at < 0.2 Mbps over WebRTC's data channel, allowing robust adaptation to network fluctuations. On the receiver side, e.g., Meta Quest 3, we develop a lightweight 3DGS attribute deformation network to dynamically generate corrective 3DGS attribute adjustments on the pre-built avatar, synthesizing photorealistic motion and appearance at ~ 60 FPS. Extensive experiments demonstrate the state-of-the-art performance of our method, achieving a PSNR of > 28 dB for novel poses, an end-to-end latency of ~ 80 ms, and > 1000x bandwidth reduction compared to point-cloud streaming, while supporting real-time operation from monocular inputs across diverse scenarios. Our demos can be found at https://mon3tr3d.github.io.

[256] ViewMorpher3D: A 3D-aware Diffusion Framework for Multi-Camera Novel View Synthesis in Autonomous Driving

Farhad G. Zanjani,Hong Cai,Amirhossein Habibian

Main category: cs.CV

TL;DR: 提出ViewMorpher3D,一种基于图像扩散模型的多视图图像增强框架,用于提升自动驾驶场景中的渲染图像真实感和多视角一致性。

Details Motivation: 现有3D重建技术(如高斯泼溅)在稀疏观测或外推视角下易产生伪影,影响自动驾驶仿真器的感知与决策性能。 Method: 利用图像扩散模型,结合相机位姿、3D几何先验及时间/空间相关的参考视图,联合优化多视角渲染结果,增强细节并抑制伪影。 Result: 在真实驾驶数据集上显著提升图像质量指标,有效减少伪影并保持几何保真度。 Conclusion: ViewMorpher3D能有效提升多视图渲染质量与跨视角一致性,适用于多种传感器配置,增强了自动驾驶闭环仿真的真实性与可靠性。 Abstract: Autonomous driving systems rely heavily on multi-view images to ensure accurate perception and robust decision-making. To effectively develop and evaluate perception stacks and planning algorithms, realistic closed-loop simulators are indispensable. While 3D reconstruction techniques such as Gaussian Splatting offer promising avenues for simulator construction, the rendered novel views often exhibit artifacts, particularly in extrapolated perspectives or when available observations are sparse. We introduce ViewMorpher3D, a multi-view image enhancement framework based on image diffusion models, designed to elevate photorealism and multi-view coherence in driving scenes. Unlike single-view approaches, ViewMorpher3D jointly processes a set of rendered views conditioned on camera poses, 3D geometric priors, and temporally adjacent or spatially overlapping reference views. This enables the model to infer missing details, suppress rendering artifacts, and enforce cross-view consistency. Our framework accommodates variable numbers of cameras and flexible reference/target view configurations, making it adaptable to diverse sensor setups. Experiments on real-world driving datasets demonstrate substantial improvements in image quality metrics, effectively reducing artifacts while preserving geometric fidelity.

[257] BenchSeg: A Large-Scale Dataset and Benchmark for Multi-View Food Video Segmentation

Ahmad AlMughrabi,Guillermo Rivo,Carlos Jiménez-Farfán,Umair Haroon,Farid Al-Areqi,Hyunjun Jung,Benjamin Busam,Ricardo Marques,Petia Radeva

Main category: cs.CV

TL;DR: 本文提出了BenchSeg,一个用于食品图像分割的新多视角视频数据集和基准测试,包含55个菜肴场景和25,284帧精细标注数据,支持360°自由摄像机运动。作者评估了20种最先进的分割模型,并引入视频记忆模块提升跨视角泛化能力,其中SeTR-MLA+XMem2组合表现最佳,比现有方法(如FoodMem)mAP提升约2.63%。研究推动了饮食分析中的食品分割与跟踪发展,数据集和模型已公开。

Details Motivation: 现有的食品图像分割方法受限于多视角数据的缺乏,且在新视角下泛化性能差,难以准确估计食物体积和营养成分。 Method: 构建了一个新的多视角食品视频分割数据集BenchSeg,包含55个菜肴场景和25,284帧标注图像,覆盖360°自由拍摄视角;在此基础上评估了20种主流分割模型(包括SAM、Transformer、CNN和多模态大模型),并结合视频记忆模块(如XMem2)增强时序一致性。 Result: 实验表明,传统图像分割模型在新视角下性能显著下降,而引入记忆机制的模型能更好保持跨帧一致性;最佳模型SeTR-MLA+XMem2在BenchSeg上比FoodMem等先前方法mAP提升约2.63%。 Conclusion: BenchSeg为多视角食品分割提供了有价值的基准,验证了记忆增强架构在视频级食品分割中的有效性,推动了膳食分析中食物体积与营养估算的研究进展。 Abstract: Food image segmentation is a critical task for dietary analysis, enabling accurate estimation of food volume and nutrients. However, current methods suffer from limited multi-view data and poor generalization to new viewpoints. We introduce BenchSeg, a novel multi-view food video segmentation dataset and benchmark. BenchSeg aggregates 55 dish scenes (from Nutrition5k, Vegetables & Fruits, MetaFood3D, and FoodKit) with 25,284 meticulously annotated frames, capturing each dish under free 360° camera motion. We evaluate a diverse set of 20 state-of-the-art segmentation models (e.g., SAM-based, transformer, CNN, and large multimodal) on the existing FoodSeg103 dataset and evaluate them (alone and combined with video-memory modules) on BenchSeg. Quantitative and qualitative results demonstrate that while standard image segmenters degrade sharply under novel viewpoints, memory-augmented methods maintain temporal consistency across frames. Our best model based on a combination of SeTR-MLA+XMem2 outperforms prior work (e.g., improving over FoodMem by ~2.63% mAP), offering new insights into food segmentation and tracking for dietary analysis. We release BenchSeg to foster future research. The project page including the dataset annotations and the food segmentation models can be found at https://amughrabi.github.io/benchseg.

[258] Robust Multicentre Detection and Classification of Colorectal Liver Metastases on CT: Application of Foundation Models

Shruti Atul Mali,Zohaib Salahuddin,Yumeng Zhang,Andre Aichert,Xian Zhong,Henry C. Woodruff,Maciej Bobowicz,Katrine Riklund,Juozas Kupčinskas,Lorenzo Faggioni,Roberto Francischello,Razvan L Miclea,Philippe Lambin

Main category: cs.CV

TL;DR: 本研究开发了一种基于基础模型的人工智能管道,用于在增强CT上进行结直肠癌肝转移(CRLM)的患者级分类和病灶级检测,结合不确定性量化与可解释性,表现出良好的跨中心泛化能力和临床应用潜力。

Details Motivation: CRLM是癌症相关死亡的主要原因,但在多中心环境下CT检测仍具挑战性,现有方法在泛化性和可解释性方面存在不足。 Method: 基于EuCanImage和TCIA的CT数据,采用表现最佳的预训练模型UMedPT,结合MLP分类头和FCOS检测头进行微调,并集成不确定性量化与Grad-CAM可解释性分析。 Result: 分类模型在联合测试集上AUC达0.90,外部验证集敏感度为0.85;排除20%最不确定样本后AUC提升至0.91;检测模型整体检出率为69.1%,且随病灶增大显著提升;Grad-CAM可定位高置信度病灶区域。 Conclusion: 基于基础模型的AI管道可有效支持多中心环境下CRLM的稳健、可解释检测与分类,具有良好的临床转化前景。 Abstract: Colorectal liver metastases (CRLM) are a major cause of cancer-related mortality, and reliable detection on CT remains challenging in multi-centre settings. We developed a foundation model-based AI pipeline for patient-level classification and lesion-level detection of CRLM on contrast-enhanced CT, integrating uncertainty quantification and explainability. CT data from the EuCanImage consortium (n=2437) and an external TCIA cohort (n=197) were used. Among several pretrained models, UMedPT achieved the best performance and was fine-tuned with an MLP head for classification and an FCOS-based head for lesion detection. The classification model achieved an AUC of 0.90 and a sensitivity of 0.82 on the combined test set, with a sensitivity of 0.85 on the external cohort. Excluding the most uncertain 20 percent of cases improved AUC to 0.91 and balanced accuracy to 0.86. Decision curve analysis showed clinical benefit for threshold probabilities between 0.30 and 0.40. The detection model identified 69.1 percent of lesions overall, increasing from 30 percent to 98 percent across lesion size quartiles. Grad-CAM highlighted lesion-corresponding regions in high-confidence cases. These results demonstrate that foundation model-based pipelines can support robust and interpretable CRLM detection and classification across heterogeneous CT data.

[259] Diffusion in SPAD Signals

Lior Dvir,Nadav Torem,Yoav Y. Schechner

Main category: cs.CV

TL;DR: 本文推导了在单光子雪崩二极管(SPAD)中给定固定光子通量下的原始信号似然性,并基于信号的时序特性提出了用于解决逆问题的评分函数,重点利用扩散模型表达图像先验。

Details Motivation: 由于SPAD中的检测事件时间与光子通量之间存在非线性且天然随机的关系,传统方法难以有效建模,因此需要建立准确的似然模型和评分函数以支持逆问题求解。 Method: 推导SPAD原始信号的似然函数及其对应的评分函数,并结合扩散模型来表达图像先验,从而处理低光子计数和高光子计数情况下的信号恢复问题。 Result: 成功构建了基于SPAD信号时序信息的评分函数,验证了不同光子计数水平下该方法的效果,表明利用检测事件的时间信息可提升逆问题求解性能。 Conclusion: 所提出的评分函数为基于SPAD信号的逆问题提供了关键工具,结合扩散模型能够有效利用事件时序信息,在不同光照条件下实现更优的成像重建。 Abstract: We derive the likelihood of a raw signal in a single photon avalanche diode (SPAD), given a fixed photon flux. The raw signal comprises timing of detection events, which are nonlinearly related to the flux. Moreover, they are naturally stochastic. We then derive a score function of the signal. This is a key for solving inverse problems based on SPAD signals. We focus on deriving solutions involving a diffusion model, to express image priors. We demonstrate the effect of low or high photon counts, and the consequence of exploiting timing of detection events.

[260] UIKA: Fast Universal Head Avatar from Pose-Free Images

Zijian Wu,Boyao Zhou,Liangxiao Hu,Hongyu Liu,Yuan Sun,Xuan Wang,Xun Cao,Yujun Shen,Hao Zhu

Main category: cs.CV

TL;DR: 提出了一种名为UIKA的前馈可动画高斯头部模型,能够从任意数量的未定姿态输入(包括单张图像、多视角捕捉和智能手机视频)生成头像。

Details Motivation: 传统方法需要专业设备和长时间优化,限制了其应用。本文旨在通过改进模型表示、网络设计和数据准备,实现更高效、通用的头像建模方法。 Method: 引入UV引导的头像建模策略,通过像素级面部对应估计将屏幕空间像素颜色重投影到与姿态和表情无关的UV空间;设计可学习的UV标记,并在屏幕和UV空间应用注意力机制;使用聚合的UV信息解码出标准高斯属性;构建大规模合成数据集进行训练。 Result: 在单目和多视图设置下,该方法显著优于现有方法。 Conclusion: UIKA通过创新的UV引导建模和注意力机制,在无需复杂设备和长时间优化的情况下,实现了高质量、可动画的高斯头部模型生成,具有广泛的应用前景。 Abstract: We present UIKA, a feed-forward animatable Gaussian head model from an arbitrary number of unposed inputs, including a single image, multi-view captures, and smartphone-captured videos. Unlike the traditional avatar method, which requires a studio-level multi-view capture system and reconstructs a human-specific model through a long-time optimization process, we rethink the task through the lenses of model representation, network design, and data preparation. First, we introduce a UV-guided avatar modeling strategy, in which each input image is associated with a pixel-wise facial correspondence estimation. Such correspondence estimation allows us to reproject each valid pixel color from screen space to UV space, which is independent of camera pose and character expression. Furthermore, we design learnable UV tokens on which the attention mechanism can be applied at both the screen and UV levels. The learned UV tokens can be decoded into canonical Gaussian attributes using aggregated UV information from all input views. To train our large avatar model, we additionally prepare a large-scale, identity-rich synthetic training dataset. Our method significantly outperforms existing approaches in both monocular and multi-view settings. Project page: https://zijian-wu.github.io/uika-page/

[261] PARL: Position-Aware Relation Learning Network for Document Layout Analysis

Fuyuan Liu,Dianyu Yu,He Ren,Nayu Liu,Xiaomian Kang,Delai Qiu,Fa Zhang,Genpeng Zhen,Shengping Liu,Jiaen Liang,Wei Huang,Yining Wang,Junnan Zhu

Main category: cs.CV

TL;DR: 本文提出了一种无需OCR的纯视觉文档布局分析方法PARL,通过建模位置敏感性和关系结构,在多个数据集上实现了最先进的性能,且参数更少、效率更高。

Details Motivation: 现有方法依赖高质量OCR,导致文本识别错误传播和计算开销大,影响鲁棒性和实用性。作者主张有效的布局分析应基于对文档内在视觉结构的理解,而非文本-视觉融合。 Method: 提出PARL(Position-Aware Relation Learning Network),包括双向空间位置引导的可变形注意力模块,将布局元素间的位置依赖显式嵌入视觉特征;并设计图优化分类器(GRC),通过动态构建的布局图建模上下文关系以优化预测。 Result: PARL在DocLayNet上为纯视觉方法建立了新基准,并在M6Doc上超越了强大的多模态模型,同时模型参数仅为65M,约为大型多模态模型(256M)的四分之一。 Conclusion: 复杂的视觉结构建模比多模态融合更高效、更鲁棒,PARL证明了无需OCR的纯视觉方法在文档布局分析中的优越性。 Abstract: Document layout analysis aims to detect and categorize structural elements (e.g., titles, tables, figures) in scanned or digital documents. Popular methods often rely on high-quality Optical Character Recognition (OCR) to merge visual features with extracted text. This dependency introduces two major drawbacks: propagation of text recognition errors and substantial computational overhead, limiting the robustness and practical applicability of multimodal approaches. In contrast to the prevailing multimodal trend, we argue that effective layout analysis depends not on text-visual fusion, but on a deep understanding of documents' intrinsic visual structure. To this end, we propose PARL (Position-Aware Relation Learning Network), a novel OCR-free, vision-only framework that models layout through positional sensitivity and relational structure. Specifically, we first introduce a Bidirectional Spatial Position-Guided Deformable Attention module to embed explicit positional dependencies among layout elements directly into visual features. Second, we design a Graph Refinement Classifier (GRC) to refine predictions by modeling contextual relationships through a dynamically constructed layout graph. Extensive experiments show PARL achieves state-of-the-art results. It establishes a new benchmark for vision-only methods on DocLayNet and, notably, surpasses even strong multimodal models on M6Doc. Crucially, PARL (65M) is highly efficient, using roughly four times fewer parameters than large multimodal models (256M), demonstrating that sophisticated visual structure modeling can be both more efficient and robust than multimodal fusion.

[262] GeoMotionGPT: Geometry-Aligned Motion Understanding with Large Language Models

Zhankai Ye,Bofan Li,Yukai Jin,Shuoqiu Li,Wei Wang,Yanfu Zhang,Shangqian Gao,Xin Liu

Main category: cs.CV

TL;DR: 本文提出了一种新的框架,通过在运动码本和大语言模型嵌入空间中显式施加正交性,实现运动与语义的几何对齐,从而提升大语言模型对运动的细粒度推理能力。

Details Motivation: 现有方法将运动量化与语义嵌入学习解耦,仅通过令牌ID关联,导致运动空间的内在几何结构与嵌入空间不一致,限制了大语言模型对运动的精细理解与推理能力。 Method: 采用基于Gumbel-Softmax的解码器-only量化器进行可微训练,并设计稀疏投影将运动码映射到大语言模型嵌入空间,同时保持正交性;通过两阶段正交正则化策略,在 tokenizer 训练和大语言模型微调中维持几何对齐。 Result: 在HumanML3D数据集上实验表明,该方法比当前最优方法性能提升20%。 Conclusion: 统一的几何基础能有效增强大语言模型对运动的细粒度推理能力,显式维护运动码本与嵌入空间的正交对齐是关键。 Abstract: Discrete motion tokenization has recently enabled Large Language Models (LLMs) to serve as versatile backbones for motion understanding and motion-language reasoning. However, existing pipelines typically decouple motion quantization from semantic embedding learning, linking them solely via token IDs. This approach fails to effectively align the intrinsic geometry of the motion space with the embedding space, thereby hindering the LLM's capacity for nuanced motion reasoning. We argue that alignment is most effective when both modalities share a unified geometric basis. Therefore, instead of forcing the LLM to reconstruct the complex geometry among motion tokens from scratch, we present a novel framework that explicitly enforces orthogonality on both the motion codebook and the LLM embedding space, ensuring that their relational structures naturally mirror each other. Specifically, we employ a decoder-only quantizer with Gumbel-Softmax for differentiable training and balanced codebook usage. To bridge the modalities, we use a sparse projection that maps motion codes into the LLM embedding space while preserving orthogonality. Finally, a two-stage orthonormal regularization schedule enforces soft constraints during tokenizer training and LLM fine-tuning to maintain geometric alignment without hindering semantic adaptation. Extensive experiments on HumanML3D demonstrate that our framework achieves a 20% performance improvement over current state-of-the-art methods, validating that a unified geometric basis effectively empowers the LLM for nuanced motion reasoning.

[263] StdGEN++: A Comprehensive System for Semantic-Decomposed 3D Character Generation

Yuze He,Yanning Zhou,Wang Zhao,Jingwen Ye,Zhongkai Wu,Ran Yi,Yong-Jin Liu

Main category: cs.CV

TL;DR: StdGEN++ 是一种生成高保真、语义分解的3D角色的新系统,通过双分支语义感知模型和视频扩散纹理分解实现结构化输出,支持工业级应用。

Details Motivation: 现有3D生成方法通常产生缺乏结构灵活性的单一体网格,难以满足游戏和动画工业流程的需求。 Method: 提出 Dual-Branch S-LRM 模型联合重建几何、颜色和组件语义;采用混合隐式场的语义表面提取方法,并结合由粗到精的采样策略;引入基于视频扩散的纹理分解模块以分离面部语义层。 Result: 在几何精度和语义解耦方面显著优于现有方法,支持高分辨率网格生成并降低内存占用。 Conclusion: StdGEN++ 能生成结构独立的高质量3D角色,支持非破坏性编辑、物理合规动画和视线追踪,适用于自动化角色资产生产。 Abstract: We present StdGEN++, a novel and comprehensive system for generating high-fidelity, semantically decomposed 3D characters from diverse inputs. Existing 3D generative methods often produce monolithic meshes that lack the structural flexibility required by industrial pipelines in gaming and animation. Addressing this gap, StdGEN++ is built upon a Dual-branch Semantic-aware Large Reconstruction Model (Dual-Branch S-LRM), which jointly reconstructs geometry, color, and per-component semantics in a feed-forward manner. To achieve production-level fidelity, we introduce a novel semantic surface extraction formalism compatible with hybrid implicit fields. This mechanism is accelerated by a coarse-to-fine proposal scheme, which significantly reduces memory footprint and enables high-resolution mesh generation. Furthermore, we propose a video-diffusion-based texture decomposition module that disentangles appearance into editable layers (e.g., separated iris and skin), resolving semantic confusion in facial regions. Experiments demonstrate that StdGEN++ achieves state-of-the-art performance, significantly outperforming existing methods in geometric accuracy and semantic disentanglement. Crucially, the resulting structural independence unlocks advanced downstream capabilities, including non-destructive editing, physics-compliant animation, and gaze tracking, making it a robust solution for automated character asset production.

[264] Variational Contrastive Learning for Skeleton-based Action Recognition

Dang Dinh Nguyen,Decky Aspandi Latif,Titus Zaharia

Main category: cs.CV

TL;DR: 提出了一种变分对比学习框架,将概率潜在建模与对比自监督学习结合,用于骨架动作识别,提升了表示的泛化性和语义意义。

Details Motivation: 现有对比学习方法在骨架动作识别中难以捕捉人类运动的变异性与不确定性。 Method: 提出变分对比学习框架,融合概率潜在变量模型与对比学习,学习结构化且语义丰富的表征。 Result: 在三个主流数据集上实验表明,该方法在少标签场景下显著优于现有方法,特征更关注关键骨骼关节且更具运动相关性。 Conclusion: 所提方法能有效提升自监督骨架动作表征的鲁棒性与泛化能力,尤其适用于低标注数据场景。 Abstract: In recent years, self-supervised representation learning for skeleton-based action recognition has advanced with the development of contrastive learning methods. However, most of contrastive paradigms are inherently discriminative and often struggle to capture the variability and uncertainty intrinsic to human motion. To address this issue, we propose a variational contrastive learning framework that integrates probabilistic latent modeling with contrastive self-supervised learning. This formulation enables the learning of structured and semantically meaningful representations that generalize across different datasets and supervision levels. Extensive experiments on three widely used skeleton-based action recognition benchmarks show that our proposed method consistently outperforms existing approaches, particularly in low-label regimes. Moreover, qualitative analyses show that the features provided by our method are more relevant given the motion and sample characteristics, with more focus on important skeleton joints, when compared to the other methods.

[265] Advancing Multinational License Plate Recognition Through Synthetic and Real Data Fusion: A Comprehensive Evaluation

Rayson Laroca,Valter Estevam,Gladston J. P. Moreira,Rodrigo Minetto,David Menotti

Main category: cs.CV

TL;DR: 本文研究了真实与合成数据结合对车牌识别性能的提升,通过16种OCR模型在12个公开数据集上的实验,发现合成数据显著提高识别效果,尤其是多种生成方法联合使用时具有协同增益,并能在数据有限时保持高性能,同时分析了模型在准确率与速度间的权衡。

Details Motivation: 现有基于合成图像改进车牌识别的研究仍存在局限,缺乏对不同类型合成数据融合策略及其在跨数据集场景下性能的系统评估。 Method: 采用模板生成、字符置换和生成对抗网络(GAN)三种方式生成合成数据,结合真实数据训练16种OCR模型,在12个公共数据集上进行基准测试,评估其在跨数据集和同数据集场景下的表现。 Result: 大量引入合成数据显著提升了模型在同数据集和跨数据集场景下的性能;三种合成方法结合使用表现出协同效应,性能超越现有最先进方法和商业系统;即使仅使用少量真实训练数据,合成数据也能维持高水平识别效果;实验还识别出在准确率与速度之间表现最优的模型。 Conclusion: 合成数据是提升车牌识别性能的关键,尤其在数据稀缺场景下具有重要价值;多种合成方法的融合策略可实现更鲁棒和高效的端到端识别,为未来研究和应用提供了有效路径。 Abstract: Automatic License Plate Recognition is a frequent research topic due to its wide-ranging practical applications. While recent studies use synthetic images to improve License Plate Recognition (LPR) results, there remain several limitations in these efforts. This work addresses these constraints by comprehensively exploring the integration of real and synthetic data to enhance LPR performance. We subject 16 Optical Character Recognition (OCR) models to a benchmarking process involving 12 public datasets acquired from various regions. Several key findings emerge from our investigation. Primarily, the massive incorporation of synthetic data substantially boosts model performance in both intra- and cross-dataset scenarios. We examine three distinct methodologies for generating synthetic data: template-based generation, character permutation, and utilizing a Generative Adversarial Network (GAN) model, each contributing significantly to performance enhancement. The combined use of these methodologies demonstrates a notable synergistic effect, leading to end-to-end results that surpass those reached by state-of-the-art methods and established commercial systems. Our experiments also underscore the efficacy of synthetic data in mitigating challenges posed by limited training data, enabling remarkable results to be achieved even with small fractions of the original training data. Finally, we investigate the trade-off between accuracy and speed among different models, identifying those that strike the optimal balance in each intra-dataset and cross-dataset settings.

[266] Leveraging 3D Representation Alignment and RGB Pretrained Priors for LiDAR Scene Generation

Nicolas Sereyjol-Garros,Ellington Kirby,Victor Besnier,Nermin Samet

Main category: cs.CV

TL;DR: R3DPA是首个利用图像预训练先验进行LiDAR点云生成的方法,通过自监督3D表示和知识迁移实现最先进的LiDAR场景合成性能。

Details Motivation: 由于3D LiDAR数据稀缺,而机器人任务(如自动驾驶)需要大量数据,现有生成方法受限于数据规模,难以达到理想效果。 Method: 提出R3DPA方法:(i) 将生成模型的中间特征与自监督3D特征对齐;(ii) 从大规模图像预训练生成模型迁移知识到LiDAR生成;(iii) 在推理时实现无条件模型下的点云控制,如物体修复和场景混合。 Result: 在KITTI-360基准上达到最先进性能,显著提升生成质量并支持灵活的点云编辑。 Conclusion: R3DPA有效解决了LiDAR数据稀缺问题,通过融合图像预训练先验和自监督3D表示,实现了高质量、可控的LiDAR场景生成。 Abstract: LiDAR scene synthesis is an emerging solution to scarcity in 3D data for robotic tasks such as autonomous driving. Recent approaches employ diffusion or flow matching models to generate realistic scenes, but 3D data remains limited compared to RGB datasets with millions of samples. We introduce R3DPA, the first LiDAR scene generation method to unlock image-pretrained priors for LiDAR point clouds, and leverage self-supervised 3D representations for state-of-the-art results. Specifically, we (i) align intermediate features of our generative model with self-supervised 3D features, which substantially improves generation quality; (ii) transfer knowledge from large-scale image-pretrained generative models to LiDAR generation, mitigating limited LiDAR datasets; and (iii) enable point cloud control at inference for object inpainting and scene mixing with solely an unconditional model. On the KITTI-360 benchmark R3DPA achieves state of the art performance. Code and pretrained models are available at https://github.com/valeoai/R3DPA.

[267] Smooth Operator: Smooth Verifiable Reward Activates Spatial Reasoning Ability of Vision-Language Model

Siwen Jiao,Tianxiong Lv,Kangan Qian,Chenxu Zhao,Xiuyuan Zhu,Tianlun Li,Xiaolong Cheng,Jinyu Li,Zhihao Liao,Yang Cai

Main category: cs.CV

TL;DR: 提出了一种新的平滑数值奖励激活(SNRA)算子和绝对保持的GRPO(AP-GRPO)框架,以提升视觉语言模型在3D场景理解中的精确数值预测能力。

Details Motivation: 传统基于相对排序的强化学习方法在利用3D物理约束提供的可验证信号方面存在奖励稀疏和梯度不稳定的问题,导致数据利用率低下。 Method: 引入SNRA算子,使用动态参数化的Sigmoid函数将原始反馈转换为密集、连续的奖励流;同时,AP-GRPO整合了绝对标量梯度来减少传统相对排序机制中的数值信息丢失。 Result: 构建了包含5万项可验证3D子任务的数据集Numerical3D-50k,实验结果显示AP-GRPO在不修改模型架构的情况下,达到了与大规模监督方法相当的性能,同时具有更高的数据效率。 Conclusion: AP-GRPO框架有效激活了视觉语言模型中潜在的3D推理能力,解决了现有强化学习方法在3D场景理解中的关键瓶颈。 Abstract: Vision-Language Models (VLMs) face a critical bottleneck in achieving precise numerical prediction for 3D scene understanding. Traditional reinforcement learning (RL) approaches, primarily based on relative ranking, often suffer from severe reward sparsity and gradient instability, failing to effectively exploit the verifiable signals provided by 3D physical constraints. Notably, in standard GRPO frameworks, relative normalization causes "near-miss" samples (characterized by small but non-zero errors) to suffer from advantage collapse. This leads to a severe data utilization bottleneck where valuable boundary samples are discarded during optimization. To address this, we introduce the Smooth Numerical Reward Activation (SNRA) operator and the Absolute-Preserving GRPO (AP-GRPO) framework. SNRA employs a dynamically parameterized Sigmoid function to transform raw feedback into a dense, continuous reward continuum. Concurrently, AP-GRPO integrates absolute scalar gradients to mitigate the numerical information loss inherent in conventional relative-ranking mechanisms. By leveraging this approach, we constructed Numerical3D-50k, a dataset comprising 50,000 verifiable 3D subtasks. Empirical results indicate that AP-GRPO achieves performance parity with large-scale supervised methods while maintaining higher data efficiency, effectively activating latent 3D reasoning in VLMs without requiring architectural modifications.

[268] Hidden Monotonicity: Explaining Deep Neural Networks via their DC Decomposition

Jakob Paul Zimmermann,Georg Loho

Main category: cs.CV

TL;DR: 本文提出利用单调性提升神经网络可解释性的两种新方法:通过分解ReLU网络为两个单调凸部分来改进显著性检测(SplitCAM和SplitLRP),并在ImageNet-S上取得领先性能;以及将模型训练为两个单调网络之差以实现强自解释性。

Details Motivation: 尽管单调性有助于提升神经网络的可解释性,但并非所有函数都能被单调网络良好逼近,因此需要在不完全依赖单调逼近的前提下仍能利用单调性来增强解释性。 Method: 1) 改进ReLU网络分解为两个单调凸部分的方法,克服权重爆炸问题;2) 提出SplitCAM和SplitLRP两种基于该分解的显著性方法;3) 设计一种将模型训练为两个单调网络之差的架构以实现自解释性。 Result: SplitCAM和SplitLRP在VGG16和ResNet18上的ImageNet-S数据集上,在所有Quantus显著性指标类别中均优于现有方法;训练为两个单调网络之差的模型展现出强自解释性。 Conclusion: 即使不能全局单调逼近,仍可通过结构化利用单调性(如网络分解或差分结构)显著提升神经网络的可解释性与显著性映射性能。 Abstract: It has been demonstrated in various contexts that monotonicity leads to better explainability in neural networks. However, not every function can be well approximated by a monotone neural network. We demonstrate that monotonicity can still be used in two ways to boost explainability. First, we use an adaptation of the decomposition of a trained ReLU network into two monotone and convex parts, thereby overcoming numerical obstacles from an inherent blowup of the weights in this procedure. Our proposed saliency methods -- SplitCAM and SplitLRP -- improve on state of the art results on both VGG16 and Resnet18 networks on ImageNet-S across all Quantus saliency metric categories. Second, we exhibit that training a model as the difference between two monotone neural networks results in a system with strong self-explainability properties.

[269] FMAC: a Fair Fiducial Marker Accuracy Comparison Software

Guillaume J. Laurent,Patrick Sandoz

Main category: cs.CV

TL;DR: 提出了一种基于高保真合成图像的基准方法,用于公平比较基于标志物的姿态估计精度,并通过开源代码实现了物理真实的渲染和评估。

Details Motivation: 为了在6自由度空间中对基于标志物的姿态估计精度进行公平且深入的比较,需要高保真、可控制的测试环境。 Method: 采用低差异采样策略生成覆盖6自由度的合成图像,使用基于物理的光线追踪算法渲染图像,并结合相机标定参数、畸变、散焦和衍射模糊等真实成像效应;通过36组自由度与姿态误差的相关性分析进行评估。 Result: 成功构建了高保真合成图像数据集,揭示了不同常见标志物在姿态估计中的优缺点,并验证了渲染算法的真实性。 Conclusion: 该方法为姿态估计提供了可重复、高精度的评估基准,开源工具促进了未来研究的公平比较。 Abstract: This paper presents a method for carrying fair comparisons of the accuracy of pose estimation using fiducial markers. These comparisons rely on large sets of high-fidelity synthetic images enabling deep exploration of the 6 degrees of freedom. A low-discrepancy sampling of the space allows to check the correlations between each degree of freedom and the pose errors by plotting the 36 pairs of combinations. The images are rendered using a physically based ray tracing code that has been specifically developed to use the standard calibration coefficients of any camera directly. The software reproduces image distortions, defocus and diffraction blur. Furthermore, sub-pixel sampling is applied to sharp edges to enhance the fidelity of the rendered image. After introducing the rendering algorithm and its experimental validation, the paper proposes a method for evaluating the pose accuracy. This method is applied to well-known markers, revealing their strengths and weaknesses for pose estimation. The code is open source and available on GitHub.

[270] Evaluating the encoding competence of visual language models using uncommon actions

Chen Ling,Nai Ding

Main category: cs.CV

TL;DR: 本文提出了UAIT数据集,用于评估视觉语言模型在非常识动作场景中的语义理解能力,揭示了现有模型在语义推理上的不足,并展示了定向微调的改进潜力。

Details Motivation: 现有数据集多关注常见视觉场景,难以检验模型对非常识但语法合理的图像-文本对的深层语义理解能力,因此需要构建更具挑战性的基准来诊断VLM的推理缺陷。 Method: 通过大语言模型、少样本提示和文生图技术,设计半自动化流程构建包含非常识动作的高质量图像-文本样本,并为每个样本设计多项选择题以测试细粒度推理能力。 Result: 实验表明,当前最先进的视觉语言模型在UAIT上的表现远低于人类,尤其在区分语法正确性与语义合理性方面存在显著缺陷;但轻量级模型经微调后准确率可提升,显示定向适应的有效性。 Conclusion: UAIT为评估VLM的深层语义理解提供了有效工具,揭示了模型在代理-受事关系和物理可行性判断上的弱点,指明了未来构建更具鲁棒性和真实推理能力模型的发展方向。 Abstract: We propose UAIT (Uncommon-sense Action Image-Text) dataset, a new evaluation benchmark designed to test the semantic understanding ability of visual language models (VLMs) in uncommon-sense action scenes. Unlike previous datasets that focus on common visual scenes with statistical frequency advantages, UAIT challenges models with grammatically reasonable but semantically counter-common sense image-text pairs. Such tasks require models to go beyond superficial pattern recognition and demonstrate a deep understanding of agent-patient relationships and physical feasibility. To build UAIT, we designed a semi-automated process to synthesize high-quality uncommon-sense image-text samples using large language models, few-shot prompt engineering, and text-to-image generation. Each sample is accompanied by a carefully designed multiple-choice question to test the model's competence in fine-grained reasoning. We evaluate multiple state-of-the-art visual language models and compare them with models based on contrastive learning. Experiments show that all models perform significantly worse than humans in semantic judgment, especially in distinguishing grammatical correctness from semantic rationality. Further experiments show that even the lightweight model can improve its accuracy after fine-tuning, demonstrating the great potential of directional adaptation. This study not only reveals the key weaknesses of VLMs, but also provides diagnostic tools and research directions for the development of robust models with real visual semantic reasoning capabilities.

[271] On the application of the Wasserstein metric to 2D curves classification

Agnieszka Kaliszewska,Monika Syga

Main category: cs.CV

TL;DR: 提出了一种基于Wasserstein距离变体的方法,通过离散概率测度强调2D曲线特定片段的重要性,用于考古数据中的2D曲线聚类分析。

Details Motivation: 希望在分类2D曲线时能够重点关注某些指定的片段(如考古曲线中的关键部分),传统方法无法灵活处理局部重要性。 Method: 引入多种Wasserstein距离的变体,结合反映曲线片段重要性的离散概率测度,以增强对局部特征的关注。 Result: 在来自考古学领域的2D曲线数据上进行了聚类实验,验证了该方法在聚焦特定片段上的有效性。 Conclusion: 所提出的Wasserstein距离变体能有效提升对2D曲线局部片段的关注,在实际考古数据的聚类任务中表现出良好性能。 Abstract: In this work we analyse a number of variants of the Wasserstein distance which allow to focus the classification on the prescribed parts (fragments) of classified 2D curves. These variants are based on the use of a number of discrete probability measures which reflect the importance of given fragments of curves. The performance of this approach is tested through a series of experiments related to the clustering analysis of 2D curves performed on data coming from the field of archaeology.

[272] Video Evidence to Reasoning Efficient Video Understanding via Explicit Evidence Grounding

Yanxiang Huang,Guohua Gao,Zhaoyang Wei,Jianyuan Ni

Main category: cs.CV

TL;DR: 本文提出了一种名为“证据链”(CoE)的新框架,旨在解决大型视觉-语言模型在视频推理中面临的计算成本高与幻觉风险之间的两难问题。该框架通过解耦感知定位与推理效率,并引入轻量级的证据定位模块和基于强化学习优化的证据锚定协议,显著提升了视频理解的准确性和可靠性。

Details Motivation: 大型视觉-语言模型在视频推理时面临计算开销大与易产生幻觉之间的矛盾,需要一种既能高效推理又能有效接地的方法。 Method: 提出CoE框架,包含两个核心组件:(1) 查询引导的轻量级证据定位模块(EGM),用于提取高保真度的紧凑视觉证据;(2) 基于强化学习优化的证据锚定协议,通过复合奖励机制强制模型在推理过程中引用已识别的时间锚点。此外构建了大规模双标注数据集CoE-Instruct用于训练与评估。 Result: 在五个基准测试(如Video-MME、MVBench、VSI-Bench)上实验表明,CoE增强的模型显著优于现有方法,在准确性上达到新的SOTA水平。 Conclusion: CoE是一种强大且实用的视频理解范式,能够有效平衡推理效率与感知接地,减少幻觉,提升模型可靠性。 Abstract: Large Vision-Language Models (LVLMs) face a fundamental dilemma in video reasoning: they are caught between the prohibitive computational costs of verbose reasoning and the hallucination risks of efficient, ungrounded approaches. To resolve this, we introduce the Chain of Evidence (CoE), a novel framework that architecturally decouples and co-optimizes perceptual grounding and reasoning efficiency. CoE incorporates two core innovations: (1) A lightweight Evidence Grounding Module (EGM) that acts as a query-guided filter, dynamically identifying and extracting a compact set of high-fidelity visual evidence; and (2) An Evidence-Anchoring Protocol optimized via Reinforcement Learning. Crucially, we design a composite reward mechanism that enforces process alignment, compelling the model to strictly reference identified temporal anchors during deduction, thereby mitigating hallucinations. To enable this, we construct CoE-Instruct, a large-scale dataset (164k samples) featuring a novel dual-annotation schema for separate perception and reasoning supervision. Extensive experiments on five benchmarks, including Video-MME, MVBench, and VSI-Bench, demonstrate that CoE-enhanced models establish a new state-of-the-art. They significantly outperform existing methods in accuracy, proving CoE to be a powerful and practical paradigm for reliable video understanding.

[273] Beyond External Guidance: Unleashing the Semantic Richness Inside Diffusion Transformers for Improved Training

Lingchen Sun,Rongyuan Wu,Zhengqiang Zhang,Ruibin Li,Yujing Sun,Shuaizheng Liu,Lei Zhang

Main category: cs.CV

TL;DR: 本文提出了一种名为Self-Transcendence的方法,通过仅使用DiT模型内部特征监督来加速其训练收敛,无需依赖外部预训练模型。

Details Motivation: 现有的引导扩散模型方法(如REPA)依赖于外部语义特征(如DINO),引入了额外的依赖并降低了灵活性。本文旨在探索DiT是否可以自我引导训练,以提高灵活性和适用性。 Method: 首先在初期阶段将DiT浅层特征与预训练VAE的潜在表示对齐(约40个epoch),然后对中间特征应用无分类器引导,增强其判别能力和语义表达能力。这些丰富的内部特征被用作监督信号来指导新的DiT训练过程。 Result: 该方法显著提升了生成质量和收敛速度,在没有外部预训练模型的情况下甚至超过了REPA的表现。同时,相比其他自包含方法具有更好的性能。 Conclusion: Self-Transcendence是一种简单而有效的方法,能够利用DiT自身的内部特征实现快速收敛,无需外部模型依赖,具有更高的灵活性和广泛的应用潜力。 Abstract: Recent works such as REPA have shown that guiding diffusion models with external semantic features (e.g., DINO) can significantly accelerate the training of diffusion transformers (DiTs). However, this requires the use of pretrained external networks, introducing additional dependencies and reducing flexibility. In this work, we argue that DiTs actually have the power to guide the training of themselves, and propose \textbf{Self-Transcendence}, a simple yet effective method that achieves fast convergence using internal feature supervision only. It is found that the slow convergence in DiT training primarily stems from the difficulty of representation learning in shallow layers. To address this, we initially train the DiT model by aligning its shallow features with the latent representations from the pretrained VAE for a short phase (e.g., 40 epochs), then apply classifier-free guidance to the intermediate features, enhancing their discriminative capability and semantic expressiveness. These enriched internal features, learned entirely within the model, are used as supervision signals to guide a new DiT training. Compared to existing self-contained methods, our approach brings a significant performance boost. It can even surpass REPA in terms of generation quality and convergence speed, but without the need for any external pretrained models. Our method is not only more flexible for different backbones but also has the potential to be adopted for a wider range of diffusion-based generative tasks. The source code of our method can be found at https://github.com/csslc/Self-Transcendence.

[274] Vision-Language Model for Accurate Crater Detection

Patrick Bauer,Marius Schwinning,Florian Renk,Andreas Weinmann,Hichem Snoussi

Main category: cs.CV

TL;DR: 本文提出了一种基于OWLv2视觉Transformer模型的深度学习方法,用于在复杂月球成像条件下实现可靠的陨石坑检测,结合低秩适应微调策略和CIoU与对比损失的组合损失函数,在IMPACT数据集上取得了94.0%的最大召回率和73.1%的最大精度。

Details Motivation: 由于月球表面存在大量不同尺寸和形状的陨石坑,且成像条件复杂(如光照变化和崎岖地形),安全着陆面临挑战,因此需要高可靠性的自动陨石坑检测算法。 Method: 采用基于Vision Transformer的OWLv2模型,使用IMPACT项目的手动标注高分辨率月球影像数据进行微调,引入低秩适应(LoRA)实现参数高效微调,并优化包含CIoU定位损失和对比分类损失的组合损失函数。 Result: 在测试集上实现了最高94.0%的召回率和73.1%的精确率,视觉效果良好,能够在复杂成像条件下稳定检测陨石坑。 Conclusion: 该方法显著提升了月球图像中陨石坑检测的可靠性,为未来月球探测任务中的安全着陆和陨石坑分析提供了有力支持。 Abstract: The European Space Agency (ESA), driven by its ambitions on planned lunar missions with the Argonaut lander, has a profound interest in reliable crater detection, since craters pose a risk to safe lunar landings. This task is usually addressed with automated crater detection algorithms (CDA) based on deep learning techniques. It is non-trivial due to the vast amount of craters of various sizes and shapes, as well as challenging conditions such as varying illumination and rugged terrain. Therefore, we propose a deep-learning CDA based on the OWLv2 model, which is built on a Vision Transformer, that has proven highly effective in various computer vision tasks. For fine-tuning, we utilize a manually labeled dataset fom the IMPACT project, that provides crater annotations on high-resolution Lunar Reconnaissance Orbiter Camera Calibrated Data Record images. We insert trainable parameters using a parameter-efficient fine-tuning strategy with Low-Rank Adaptation, and optimize a combined loss function consisting of Complete Intersection over Union (CIoU) for localization and a contrastive loss for classification. We achieve satisfactory visual results, along with a maximum recall of 94.0% and a maximum precision of 73.1% on a test dataset from IMPACT. Our method achieves reliable crater detection across challenging lunar imaging conditions, paving the way for robust crater analysis in future lunar exploration.

[275] Exchange Is All You Need for Remote Sensing Change Detection

Sijun Dong,Siming Fu,Kaiyu Li,Xiangyong Cao,Xiaoliang Meng,Bo Du

Main category: cs.CV

TL;DR: 本文提出了一种名为SEED的简化遥感变化检测范式,通过无参数的特征交换机制替代传统的显式差分操作,在多个基准和主干网络上实现了与现有最先进方法相媲美或更优的性能。

Details Motivation: 现有的变化检测方法通常依赖复杂的显式差分模块进行双时相特征融合,可能导致信息损失;本文旨在探索更简洁、高效且理论上更优的特征融合方式。 Method: 提出SEED框架,采用权重共享的Siamese编码器-解码器结构,用无参数的特征交换机制替代传统的加减或拼接操作;理论证明该机制在像素一致性假设下可保持互信息和贝叶斯最优风险;进一步提出SEG2CD,将语义分割模型转化为高效的变化检测器。 Result: 在五个数据集(SYSU-CD、LEVIR-CD、PX-CLCD、WaterCD、CDD)和三种主干网络(SwinT、EfficientNet、ResNet)上验证了SEED的有效性,性能达到或超过现有方法;实验证明标准分割模型经特征交换机制改造后即可成为有竞争力的变化检测器。 Conclusion: 简单的特征交换机制足以实现高性能的遥感变化检测,SEED提供了一个鲁棒、统一且可解释的新范式,挑战了复杂差分设计的必要性。 Abstract: Remote sensing change detection fundamentally relies on the effective fusion and discrimination of bi-temporal features. Prevailing paradigms typically utilize Siamese encoders bridged by explicit difference computation modules, such as subtraction or concatenation, to identify changes. In this work, we challenge this complexity with SEED (Siamese Encoder-Exchange-Decoder), a streamlined paradigm that replaces explicit differencing with parameter-free feature exchange. By sharing weights across both Siamese encoders and decoders, SEED effectively operates as a single parameter set model. Theoretically, we formalize feature exchange as an orthogonal permutation operator and prove that, under pixel consistency, this mechanism preserves mutual information and Bayes optimal risk, whereas common arithmetic fusion methods often introduce information loss. Extensive experiments across five benchmarks, including SYSU-CD, LEVIR-CD, PX-CLCD, WaterCD, and CDD, and three backbones, namely SwinT, EfficientNet, and ResNet, demonstrate that SEED matches or surpasses state of the art methods despite its simplicity. Furthermore, we reveal that standard semantic segmentation models can be transformed into competitive change detectors solely by inserting this exchange mechanism, referred to as SEG2CD. The proposed paradigm offers a robust, unified, and interpretable framework for change detection, demonstrating that simple feature exchange is sufficient for high performance information fusion. Code and full training and evaluation protocols will be released at https://github.com/dyzy41/open-rscd.

[276] More Images, More Problems? A Controlled Analysis of VLM Failure Modes

Anurag Das,Adrian Bulat,Alberto Baldrati,Ioannis Maniadis Metaxas,Bernt Schiele,Georgios Tzimiropoulos,Brais Martinez

Main category: cs.CV

TL;DR: 本文提出了MIMIC基准,用于评估大视觉语言模型(LVLMs)在多图像理解中的能力,并揭示了其在跨图像信息聚合和多概念追踪上的缺陷。为此,作者提出了数据生成策略和注意力掩码优化方法,显著提升了多图像任务性能。

Details Motivation: 尽管LVLMs在单图像任务中表现出色,但其在多图像理解与推理方面的能力尚不明确,且现有基准缺乏对核心弱点的系统分析。因此需要一个更全面的评估基准与改进方法。 Method: 提出MIMIC基准进行诊断实验;设计一种基于单图像标注合成多图像训练样本的数据生成策略;分析逐层注意力模式并引入针对多图像输入的注意力掩码机制。 Result: 通过实验验证,新方法显著改善了跨图像信息聚合能力,在多个现有基准上超越了先前最优模型的表现。 Conclusion: LVLMs在处理多图像任务时存在根本性缺陷,而结合针对性数据构造与注意力机制优化可有效提升其多图像理解能力,为未来研究提供了方向。 Abstract: Large Vision Language Models (LVLMs) have demonstrated remarkable capabilities, yet their proficiency in understanding and reasoning over multiple images remains largely unexplored. While existing benchmarks have initiated the evaluation of multi-image models, a comprehensive analysis of their core weaknesses and their causes is still lacking. In this work, we introduce MIMIC (Multi-Image Model Insights and Challenges), a new benchmark designed to rigorously evaluate the multi-image capabilities of LVLMs. Using MIMIC, we conduct a series of diagnostic experiments that reveal pervasive issues: LVLMs often fail to aggregate information across images and struggle to track or attend to multiple concepts simultaneously. To address these failures, we propose two novel complementary remedies. On the data side, we present a procedural data-generation strategy that composes single-image annotations into rich, targeted multi-image training examples. On the optimization side, we analyze layer-wise attention patterns and derive an attention-masking scheme tailored for multi-image inputs. Experiments substantially improved cross-image aggregation, while also enhancing performance on existing multi-image benchmarks, outperforming prior state of the art across tasks. Data and code will be made available at https://github.com/anurag-198/MIMIC.

[277] MHLA: Restoring Expressivity of Linear Attention via Token-Level Multi-Head

Kewei Zhang,Ye Huang,Yufan Deng,Jincheng Yu,Junsong Chen,Huan Ling,Enze Xie,Daquan Zhou

Main category: cs.CV

TL;DR: 本文提出了一种新的多头线性注意力机制(MHLA),以解决现有线性注意力方法中的全局上下文坍缩问题,在保持线性复杂度的同时显著提升性能。

Details Motivation: Transformer的自注意力机制具有二次复杂度,限制了其在大规模应用中的使用;虽然线性注意力提供了更高效的替代方案,但常导致性能下降,尤其是出现全局上下文坍缩的问题。 Method: 提出Multi-Head Linear Attention (MHLA),通过在token维度上划分多个头来分别计算注意力,从而保留表示多样性,并维持线性计算复杂度。 Result: MHLA在多个领域均取得显著提升:ImageNet分类提升3.6%,NLP任务提升6.3%,图像生成提升12.6%,视频生成提升41%,且保持相同的时间复杂度。 Conclusion: MHLA有效解决了线性注意力中的全局上下文坍缩问题,在不增加复杂度的前提下恢复了softmax注意力的大部分表达能力,是高效注意力机制的一个有力改进。 Abstract: While the Transformer architecture dominates many fields, its quadratic self-attention complexity hinders its use in large-scale applications. Linear attention offers an efficient alternative, but its direct application often degrades performance, with existing fixes typically re-introducing computational overhead through extra modules (e.g., depthwise separable convolution) that defeat the original purpose. In this work, we identify a key failure mode in these methods: global context collapse, where the model loses representational diversity. To address this, we propose Multi-Head Linear Attention (MHLA), which preserves this diversity by computing attention within divided heads along the token dimension. We prove that MHLA maintains linear complexity while recovering much of the expressive power of softmax attention, and verify its effectiveness across multiple domains, achieving a 3.6\% improvement on ImageNet classification, a 6.3\% gain on NLP, a 12.6\% improvement on image generation, and a 41\% enhancement on video generation under the same time complexity.

[278] Tuning-free Visual Effect Transfer across Videos

Maxwell Jones,Rameen Abdal,Or Patashnik,Ruslan Salakhutdinov,Sergey Tulyakov,Jun-Yan Zhu,Kuan-Chieh Jackson Wang

Main category: cs.CV

TL;DR: 本文提出RefVFX,一种前馈式框架,可将参考视频中的复杂时序效果迁移到目标视频或图像上,解决了现有方法难以处理动态光照、角色变换等时序效应的问题。

Details Motivation: 现有视频编辑方法在基于文本提示或关键帧条件下难以准确描述和生成复杂的动态时序效果(如动态光照变化、角色转换),因此需要一种能够有效迁移视觉特效的框架。 Method: 构建了一个大规模三元组数据集(参考效果视频、输入视频/图像、输出结果),并通过自动化流水线生成高质量配对视频;结合LoRA适配器提取图像到视频的效果,并通过程序化组合生成代码化时序效果;基于最新的文本到视频骨干网络训练参考条件模型。 Result: 实验表明,RefVFX在视觉一致性、时序连贯性方面表现优异,能泛化至未见的效果类别,在定量指标和人类偏好测试中均优于仅使用提示的基线方法。 Conclusion: RefVFX为无需微调的跨视频视觉效果迁移提供了有效解决方案,推动了参考驱动视频编辑的发展。 Abstract: We present RefVFX, a new framework that transfers complex temporal effects from a reference video onto a target video or image in a feed-forward manner. While existing methods excel at prompt-based or keyframe-conditioned editing, they struggle with dynamic temporal effects such as dynamic lighting changes or character transformations, which are difficult to describe via text or static conditions. Transferring a video effect is challenging, as the model must integrate the new temporal dynamics with the input video's existing motion and appearance. % To address this, we introduce a large-scale dataset of triplets, where each triplet consists of a reference effect video, an input image or video, and a corresponding output video depicting the transferred effect. Creating this data is non-trivial, especially the video-to-video effect triplets, which do not exist naturally. To generate these, we propose a scalable automated pipeline that creates high-quality paired videos designed to preserve the input's motion and structure while transforming it based on some fixed, repeatable effect. We then augment this data with image-to-video effects derived from LoRA adapters and code-based temporal effects generated through programmatic composition. Building on our new dataset, we train our reference-conditioned model using recent text-to-video backbones. Experimental results demonstrate that RefVFX produces visually consistent and temporally coherent edits, generalizes across unseen effect categories, and outperforms prompt-only baselines in both quantitative metrics and human preference. See our website $\href{https://tuningfreevisualeffects-maker.github.io/Tuning-free-Visual-Effect-Transfer-across-Videos-Project-Page/}{at\ this\ URL}$.