cs.CL [Back]

[1] Bridging the Semantic Gap: Contrastive Rewards for Multilingual Text-to-SQL

Ashish Kattamuri,Ishita Prasad,Meetu Malhotra,Arpita Vats,Rahul Raja,Albert Lie

Main category: cs.CL

TL;DR: 提出一种结合多语言对比奖励信号的组相对策略优化（GRPO）框架，提升跨语言Text-to-SQL系统的执行准确率和语义准确性，仅用3000个训练样本即可使3B模型超越零样本8B模型。

Details

Motivation: 现有Text-to-SQL方法过于关注可执行查询，忽视语义对齐问题，且在非英语场景下性能显著下降。 Method: 在GRPO框架中引入基于语义相似性的多语言对比奖励信号，增强SQL生成与用户意图之间的语义对齐。 Result: 在七语言MultiSpider数据集上，LLaMA-3-3B模型的执行准确率达87.4%（+26 pp），语义准确率达59.14%（+6.85 pp），优于零-shot 8B模型。 Conclusion: 通过对比奖励实现定向语义对齐，可在小规模模型和少量训练数据下显著提升跨语言Text-to-SQL性能。 Abstract: Current Text-to-SQL methods are evaluated and only focused on executable queries, overlooking the semantic alignment challenge -- both in terms of the semantic meaning of the query and the correctness of the execution results. Even execution accuracy itself shows significant drops when moving from English to other languages, with an average decline of 6 percentage points across non-English languages. We address these challenges by presenting a new framework that combines Group Relative Policy Optimization (GRPO) within a multilingual contrastive reward signal to enhance both task efficiency and semantic accuracy in Text-to-SQL systems in cross-lingual scenarios. Our method teaches models to obtain better correspondence between SQL generation and user intent by combining a reward signal based on semantic similarity. On the seven-language MultiSpider dataset, fine-tuning the LLaMA-3-3B model with GRPO improved the execution accuracy up to 87.4 percent (+26 pp over zero-shot) and semantic accuracy up to 52.29 percent (+32.86 pp). Adding our contrastive reward signal in the GRPO framework further improved the average semantic accuracy to 59.14 percent (+6.85 pp, up to +10 pp for Vietnamese). Our experiments showcase that a smaller, parameter-efficient 3B LLaMA model fine-tuned with our contrastive reward signal outperforms a much larger zero-shot 8B LLaMA model, with an uplift of 7.43 pp in execution accuracy (from 81.43 percent on the 8B model to 88.86 percent on the 3B model), and nearly matches its semantic accuracy (59.14 percent vs. 68.57 percent) -- all using just 3,000 reinforcement learning training examples. These results demonstrate how we can improve the performance of Text-to-SQL systems with contrastive rewards for directed semantic alignment, without requiring large-scale training datasets.

[2] From Explainability to Action: A Generative Operational Framework for Integrating XAI in Clinical Mental Health Screening

Ratna Kandala,Akshata Kishore Moharir,Divya Arvinda Nayak

Main category: cs.CL

TL;DR: 本文提出了一种生成式操作框架，利用大语言模型将可解释AI的技术输出转化为临床相关的、可操作的叙述，以解决心理健康筛查中技术透明性与实际应用之间的转化鸿沟。

Details

Motivation: 当前的XAI方法（如SHAP和LIME）虽能提供技术上准确的特征重要性评分，但缺乏临床相关性和对患者及医生的可理解性，导致难以在真实临床环境中应用。 Method: 提出生成式操作框架，使用大语言模型作为核心翻译引擎，结合检索增强生成（RAG）整合临床指南，将多种XAI工具的技术输出转化为人类可读、有证据支持的临床叙述。 Result: 该框架能够有效整合XAI输出与临床知识，生成针对不同利益相关者的个性化解释，改善工作流集成、偏见缓解和沟通效率。 Conclusion: 生成式操作框架为弥合XAI在心理健康筛查中的实验室到临床鸿沟提供了可行路径，推动AI从生成孤立数据点转向提供集成化、可操作且可信的临床决策支持。 Abstract: Explainable Artificial Intelligence (XAI) has been presented as the critical component for unlocking the potential of machine learning in mental health screening (MHS). However, a persistent lab-to-clinic gap remains. Current XAI techniques, such as SHAP and LIME, excel at producing technically faithful outputs such as feature importance scores, but fail to deliver clinically relevant, actionable insights that can be used by clinicians or understood by patients. This disconnect between technical transparency and human utility is the primary barrier to real-world adoption. This paper argues that this gap is a translation problem and proposes the Generative Operational Framework, a novel system architecture that leverages Large Language Models (LLMs) as a central translation engine. This framework is designed to ingest the raw, technical outputs from diverse XAI tools and synthesize them with clinical guidelines (via RAG) to automatically generate human-readable, evidence-backed clinical narratives. To justify our solution, we provide a systematic analysis of the components it integrates, tracing the evolution from intrinsic models to generative XAI. We demonstrate how this framework directly addresses key operational barriers, including workflow integration, bias mitigation, and stakeholder-specific communication. This paper also provides a strategic roadmap for moving the field beyond the generation of isolated data points toward the delivery of integrated, actionable, and trustworthy AI in clinical practice.

[3] A Linguistics-Aware LLM Watermarking via Syntactic Predictability

Shinwoo Park,Hyejin Park,Hyeseon Ahn,Yo-Sub Han

Main category: cs.CL

TL;DR: 本文提出了STELA，一种新型的公开可验证水印框架，通过利用语言中的词性n-gram建模的语言不确定性来动态调节水印强度，在保持文本质量的同时提高检测鲁棒性。

Details

Motivation: 现有的水印方法依赖于模型输出分布信号（如token级熵），这限制了公众验证，因为检测过程需要访问模型logits。因此，迫切需要一种不依赖模型内部信息、同时兼顾文本质量和检测鲁棒性的水印方案。 Method: STELA框架根据语言的句法自由度动态调整水印强度：在语法约束较强的上下文中减弱信号以保持文本质量，在语言灵活性较高的上下文中增强信号以提升可检测性；检测器无需访问模型logits即可工作，实现了公开可验证性。 Result: 在英语、中文和韩语等多种类型语言上的实验表明，STELA在检测鲁棒性方面优于先前方法，且不影响文本质量。 Conclusion: STELA通过结合语言结构特性实现了一种高效、公开可验证的水印机制，为构建可信的AI生态系统提供了重要工具。 Abstract: As large language models (LLMs) continue to advance rapidly, reliable governance tools have become critical. Publicly verifiable watermarking is particularly essential for fostering a trustworthy AI ecosystem. A central challenge persists: balancing text quality against detection robustness. Recent studies have sought to navigate this trade-off by leveraging signals from model output distributions (e.g., token-level entropy); however, their reliance on these model-specific signals presents a significant barrier to public verification, as the detection process requires access to the logits of the underlying model. We introduce STELA, a novel framework that aligns watermark strength with the linguistic degrees of freedom inherent in language. STELA dynamically modulates the signal using part-of-speech (POS) n-gram-modeled linguistic indeterminacy, weakening it in grammatically constrained contexts to preserve quality and strengthen it in contexts with greater linguistic flexibility to enhance detectability. Our detector operates without access to any model logits, thus facilitating publicly verifiable detection. Through extensive experiments on typologically diverse languages-analytic English, isolating Chinese, and agglutinative Korean-we show that STELA surpasses prior methods in detection robustness. Our code is available at https://github.com/Shinwoo-Park/stela_watermark.

[4] Users as Annotators: LLM Preference Learning from Comparison Mode

Zhongze Cai,Xiaocheng Li

Main category: cs.CL

TL;DR: 本文提出了一种利用用户在与大语言模型交互中产生的偏好数据进行模型对齐的新方法，通过引入不同模型生成的响应不对称性来推断用户标注质量，并使用EM算法估计用户质量因子以过滤低质量数据。

Details

Motivation: 传统偏好数据依赖专业人工标注，成本高且覆盖面有限；而用户在日常交互中产生的偏好标签虽更具个性化优势，但缺乏质量控制，因此需要一种能评估并筛选用户标注质量的方法。 Method: 提出一种基于用户行为模型的不对称响应比较方法，利用两个不同模型或同一模型不同版本生成的响应差异，设计期望最大化（EM）算法来估计用户的潜在质量因子，并据此过滤用户标注数据。 Result: 实验表明该方法能有效建模用户行为，在下游任务中显著提升用于大语言模型对齐的偏好数据质量。 Conclusion: 通过建模用户行为和标注质量，所提方法能够有效利用非专业用户的偏好数据，为大语言模型对齐提供低成本且高质量的数据来源。 Abstract: Pairwise preference data have played an important role in the alignment of large language models (LLMs). Each sample of such data consists of a prompt, two different responses to the prompt, and a binary label indicating which of the two responses is better. The labels are usually annotated by professional human annotators. In this paper, we consider an alternative approach to collect pairwise preference data -- user annotation from comparison mode. With the increasingly wider adoption of LLMs among the population, users are contributing more and more of their preference labels through their daily interactions with the LLMs. The upside of such labels is that users are the best experts in judging the responses to their own queries/prompts, but the downside is the lack of quality control in these labels. In this paper, we consider a new idea of generating two responses from two different models or two different versions of the same model. The asymmetry allows us to make an inference of the user's data quality through our proposed user behavior model. We develop an expectation-maximization algorithm to estimate a latent quality factor of the user, and filter users' annotation data accordingly. The downstream task shows the effectiveness of our approach in both capturing the user behavior and data filtering for LLM alignment.

[5] Informed Routing in LLMs: Smarter Token-Level Computation for Faster Inference

Chao Han,Yijuan Liang,Zihao Xuan,Daokuan Wu,Wei Zhang,Xiaoyu Shen

Main category: cs.CL

TL;DR: 本文提出了一种名为“informed routing”的新范式，通过引入可恢复性评估和轻量级特征预测器（LFF）来优化大语言模型的推理效率，在保持性能的同时显著降低计算成本。

Details

Motivation: 现有的动态token级计算分配方法依赖贪婪路由策略，容易导致不可逆的信息丢失和次优的token选择，限制了大模型在实际应用中的部署效率。 Method: 提出informed routing，结合token的重要性和可恢复性进行路由决策，并设计轻量级特征预测器（LFF）提前估计模块输出，实现执行或近似的灵活策略。 Result: 在多种语言建模和推理任务上验证了该方法的有效性，实现了最先进的效率-性能权衡，即使不进行最终LoRA微调也优于需全量微调的强基线，且训练时间减少超过50%。 Conclusion: informed routing通过前瞻性路由机制显著提升了大模型推理的效率与质量平衡，为高效部署提供了新思路。 Abstract: The deployment of large language models (LLMs) in real-world applications is increasingly limited by their high inference cost. While recent advances in dynamic token-level computation allocation attempt to improve efficiency by selectively activating model components per token, existing methods rely on greedy routing--a myopic execute-or-skip mechanism that often leads to irreversible information loss and suboptimal token selection. This paper introduces informed routing, a new paradigm that proactively addresses these issues. The key insight is to assess not only a token's immediate importance but also its recoverability, i.e., how well its transformation can be approximated. To this end, we propose the Lightweight Feature Forecaster (LFF), a small predictive module that estimates a unit's output before routing decisions are made. This enables a flexible execute-or-approximate policy that preserves model fidelity while drastically reducing computation. Extensive experiments on both language modeling and reasoning tasks show that informed routing achieves state-of-the-art efficiency-performance trade-offs across multiple sparsity levels. Notably, even without final LoRA fine-tuning, our method matches or surpasses strong baselines that require full fine-tuning, all while reducing training time by over 50%. The code is available at: https://github.com/EIT-NLP/informed-routing

[6] Entropy Meets Importance: A Unified Head Importance-Entropy Score for Stable and Efficient Transformer Pruning

Minsik Choi,Hyegang Son,Changhoon Kim,Young Geun Kim

Main category: cs.CL

TL;DR: 提出了一种新的剪枝准则HIES，结合了头重要性分数和注意力熵，显著提升了模型压缩后的性能和稳定性。

Details

Motivation: 现有的基于梯度的头部重要性评分（HIS）方法仅考虑梯度贡献，忽略了注意力模式的多样性，导致剪枝效果受限。 Method: 引入HIES（Head Importance-Entropy Score），将HIS与注意力熵结合，综合评估每个注意力头的贡献。 Result: 实验表明，基于HIES的剪枝相比HIS方法最多提升15.2%的模型质量，并实现2.04倍的稳定性提升。 Conclusion: HIES能有效支持大规模模型压缩，同时保持准确性与稳定性，优于单一HIS方法。 Abstract: Transformer-based models have achieved remarkable performance in NLP tasks. However, their structural characteristics-multiple layers and attention heads-introduce efficiency challenges in inference and deployment. To address these challenges, various pruning methods have recently been proposed. Notably, gradient-based methods using Head Importance Scores (HIS) have gained traction for interpretability, efficiency, and ability to identify redundant heads. However, HIS alone has limitations as it captures only the gradient-driven contribution, overlooking the diversity of attention patterns. To overcome these limitations, we introduce a novel pruning criterion, HIES (Head Importance-Entropy Score), which integrates head importance scores with attention entropy, providing complementary evidence on per-head contribution. Empirically, HIES-based pruning yields up to 15.2% improvement in model quality and 2.04x improvement in stability over HIS-only methods, enabling substantial model compression without sacrificing either accuracy or stability. Code will be released upon publication.

[7] ConDABench: Interactive Evaluation of Language Models for Data Analysis

Avik Dutta,Priyanshu Gupta,Hosein Hasanbeig,Rahul Pratap Singh,Harshit Nigam,Sumit Gulwani,Arjun Radhakrishna,Gustavo Soares,Ashish Tiwari

Main category: cs.CL

TL;DR: ConDABench是一个用于生成和评估交互式数据分析任务的新框架，通过多智能体工作流从公开数据集中提取现实问题，支持对大型语言模型在复杂、长周期交互任务中的表现进行系统评估。

Details

Motivation: 现有LLM数据处理基准未能充分反映真实场景中的目标模糊和数据不洁问题，缺乏对交互性的支持，因此需要一个能体现用户交互必要性的新基准。 Method: 提出ConDABench框架，包含基于文章自动生成1,420个对话式数据任务的多智能体流程，以及可评估外部工具性能的评测套件。 Result: 评估发现新一代LLM虽能解决更多任务，但在需要持续长时交互的任务上表现不佳。 Conclusion: ConDABench为构建能完成复杂交互任务的真正协作型模型提供了衡量进展的途径。 Abstract: Real-world data analysis tasks often come with under-specified goals and unclean data. User interaction is necessary to understand and disambiguate a user's intent, and hence, essential to solving these complex tasks. Existing benchmarks for evaluating LLMs on data analysis tasks do not capture these complexities or provide first-class support for interactivity. We introduce ConDABench, a framework for generating conversational data analysis (ConDA) benchmarks and evaluating external tools on the generated benchmarks. \bench consists of (a) a multi-agent workflow for generating realistic benchmarks from articles describing insights gained from public datasets, (b) 1,420 ConDA problems generated using this workflow, and (c) an evaluation harness that, for the first time, makes it possible to systematically evaluate conversational data analysis tools on the generated ConDA problems. Evaluation of state-of-the-art LLMs on the benchmarks reveals that while the new generation of models are better at solving more instances, they are not necessarily better at solving tasks that require sustained, long-form engagement. ConDABench is an avenue for model builders to measure progress towards truly collaborative models that can complete complex interactive tasks.

[8] SIMBA UQ: Similarity-Based Aggregation for Uncertainty Quantification in Large Language Models

Debarun Bhattacharjya,Balaji Ganesan,Junkyu Lee,Radu Marinescu,Katsiaryna Mirylenka,Michael Glass,Xiao Shou

Main category: cs.CL

TL;DR: 本文研究了大语言模型（LLM）在生成输出时的不确定性量化（UQ）问题，提出了一种基于输出一致性的黑箱UQ框架，并引入了基于相似性的聚合方法和新的置信度估计技术，在多种任务上验证了其优于基线的校准性能。

Details

Motivation: 为了提高AI系统的可信度，需要有效评估大语言模型（LLM）对其生成结果的不确定性，尤其是不依赖模型内部信息的黑箱方法，以增强实用性与适应性。 Method: 提出一个高层、非语言化的基于相似性的聚合框架，利用生成输出之间的一致性作为正确性的代理指标，并在此框架下设计新的置信度估计模型，使用小样本训练进行不确定性量化。 Result: 在问答、摘要生成和文本到SQL等多样化任务上的实验表明，所提出的基于相似性的UQ方法相比基线方法能产生更优校准的置信度。 Conclusion: 基于输出一致性的黑箱不确定性量化方法是有效的，所提出的框架和新技术能够在多种复杂生成任务中提升置信度估计的可靠性，适用于实际AI系统。 Abstract: When does a large language model (LLM) know what it does not know? Uncertainty quantification (UQ) provides measures of uncertainty, such as an estimate of the confidence in an LLM's generated output, and is therefore increasingly recognized as a crucial component of trusted AI systems. Black-box UQ methods do not require access to internal model information from the generating LLM and therefore have numerous real-world advantages, such as robustness to system changes, adaptability to choice of LLM, reduced costs, and computational tractability. In this paper, we investigate the effectiveness of UQ techniques that are primarily but not necessarily entirely black-box, where the consistency between a generated output and other sampled generations is used as a proxy for confidence in its correctness. We propose a high-level non-verbalized similarity-based aggregation framework that subsumes a broad swath of UQ approaches suitable for complex generative tasks, as well as introduce specific novel techniques from the framework that train confidence estimation models using small training sets. Through an empirical study with datasets spanning the diverse tasks of question answering, summarization, and text-to-SQL, we demonstrate that our proposed similarity-based methods can yield better calibrated confidences than baselines.

[9] Seeing Hate Differently: Hate Subspace Modeling for Culture-Aware Hate Speech Detection

Weibin Cai,Reza Zafarani

Main category: cs.CL

TL;DR: 提出一种文化感知框架，通过构建个体的仇恨子空间来解决训练标签偏差和跨文化解释差异的问题，实验表明该方法在所有指标上平均优于现有最先进方法1.05%。

Details

Motivation: 现有仇恨言论检测方法忽略了真实世界中的复杂性，即训练标签存在偏见，且不同文化背景的人对仇恨的定义理解不同。 Method: 提出一个文化感知框架，建模文化属性组合以缓解数据稀疏性，并利用标签传播捕捉每种文化组合的独特特征，从而构建个体的仇恨子空间。 Result: 实验结果显示，该方法在各项指标上平均比现有最先进方法提升1.05%。 Conclusion: 所提出的文化感知框架能有效应对文化差异带来的挑战，并提升仇恨言论检测的性能。 Abstract: Hate speech detection has been extensively studied, yet existing methods often overlook a real-world complexity: training labels are biased, and interpretations of what is considered hate vary across individuals with different cultural backgrounds. We first analyze these challenges, including data sparsity, cultural entanglement, and ambiguous labeling. To address them, we propose a culture-aware framework that constructs individuals' hate subspaces. To alleviate data sparsity, we model combinations of cultural attributes. For cultural entanglement and ambiguous labels, we use label propagation to capture distinctive features of each combination. Finally, individual hate subspaces, which in turn can further enhance classification performance. Experiments show our method outperforms state-of-the-art by 1.05\% on average across all metrics.

[10] Meronymic Ontology Extraction via Large Language Models

Dekai Zhang,Simone Conia,Antonio Rago

Main category: cs.CL

TL;DR: 本文提出了一种利用大语言模型（LLM）从原始评论文本中全自动提取产品本体（特别是meronymies）的方法，相较于基于BERT的基线方法表现更优，并通过LLM-as-a-judge评估验证了其有效性。

Details

Motivation: 手动构建本体耗时、昂贵且费力，而现有自动化方法仍有提升空间，因此需要一种更高效、准确的自动化本体提取方法。 Method: 利用大语言模型（LLMs），设计了一种端到端的全自动方法，直接从原始用户评论中抽取产品meronymies关系，形成产品本体结构。 Result: 实验表明，该方法生成的本体在LLM-as-a-judge评估下优于现有的BERT-based基线方法，证明了LLM在本体提取任务中的潜力。 Conclusion: 大语言模型能够有效支持全自动的产品本体提取，为未来在更广泛领域内的本体构建提供了可行路径和基础框架。 Abstract: Ontologies have become essential in today's digital age as a way of organising the vast amount of readily available unstructured text. In providing formal structure to this information, ontologies have immense value and application across various domains, e.g., e-commerce, where countless product listings necessitate proper product organisation. However, the manual construction of these ontologies is a time-consuming, expensive and laborious process. In this paper, we harness the recent advancements in large language models (LLMs) to develop a fully-automated method of extracting product ontologies, in the form of meronymies, from raw review texts. We demonstrate that the ontologies produced by our method surpass an existing, BERT-based baseline when evaluating using an LLM-as-a-judge. Our investigation provides the groundwork for LLMs to be used more generally in (product or otherwise) ontology extraction.

[11] ADMIT: Few-shot Knowledge Poisoning Attacks on RAG-based Fact Checking

Yutao Wu,Xiao Liu,Yinghui Li,Yifeng Gao,Yifan Ding,Jiale Ding,Xiang Zheng,Xingjun Ma

Main category: cs.CL

TL;DR: 本文提出了一种名为ADMIT的对抗性多注入技术，用于在检索增强生成（RAG）系统中进行知识投毒攻击，能够在极低投毒率下高效翻转事实核查结果，并在多种检索器和大语言模型上实现86%的平均攻击成功率。

Details

Motivation: 现有的知识投毒研究未充分考虑真实场景中存在大量可信证据的情况，本文旨在探究在包含真实支持或反驳证据的检索上下文中，如何有效实施无需目标模型访问权限的黑盒投毒攻击。 Method: 提出ADMIT方法，采用少样本、语义对齐的多片段注入策略，在不访问目标LLM或检索器的情况下，通过向知识库注入精心构造的对抗性内容来操纵事实核查结果。 Result: 在4种检索器、11种大语言模型和4个跨领域基准上的实验表明，ADMIT在仅0.93×10⁻⁶的极低投毒率下达到86%的平均攻击成功率，相比现有最先进方法提升11.2%，且在强反证存在时仍保持鲁棒性。 Conclusion: ADMIT揭示了现实世界基于RAG的事实核查系统的严重漏洞，表明当前系统极易受到隐蔽且高效的黑盒知识投毒攻击，亟需更强的防御机制。 Abstract: Knowledge poisoning poses a critical threat to Retrieval-Augmented Generation (RAG) systems by injecting adversarial content into knowledge bases, tricking Large Language Models (LLMs) into producing attacker-controlled outputs grounded in manipulated context. Prior work highlights LLMs' susceptibility to misleading or malicious retrieved content. However, real-world fact-checking scenarios are more challenging, as credible evidence typically dominates the retrieval pool. To investigate this problem, we extend knowledge poisoning to the fact-checking setting, where retrieved context includes authentic supporting or refuting evidence. We propose \textbf{ADMIT} (\textbf{AD}versarial \textbf{M}ulti-\textbf{I}njection \textbf{T}echnique), a few-shot, semantically aligned poisoning attack that flips fact-checking decisions and induces deceptive justifications, all without access to the target LLMs, retrievers, or token-level control. Extensive experiments show that ADMIT transfers effectively across 4 retrievers, 11 LLMs, and 4 cross-domain benchmarks, achieving an average attack success rate (ASR) of 86\% at an extremely low poisoning rate of $0.93 \times 10^{-6}$, and remaining robust even in the presence of strong counter-evidence. Compared with prior state-of-the-art attacks, ADMIT improves ASR by 11.2\% across all settings, exposing significant vulnerabilities in real-world RAG-based fact-checking systems.

[12] Serialized EHR make for good text representations

Zhirong Chou,Quan Qin,Shi Li

Main category: cs.CL

TL;DR: SerialBEHRT是一种基于SciBERT的医疗基础模型，通过对结构化电子健康记录（EHR）序列进行额外预训练，有效捕捉临床事件间的时序和上下文关系，在抗生素敏感性预测任务中表现优于现有方法。

Details

Motivation: 现有的医疗基础模型在处理具有表格和事件特性的电子健康记录（EHR）时，难以与自然语言模型的序列先验对齐，导致难以建模患者就诊之间的长期依赖关系。 Method: 提出SerialBEHRT模型，通过将EHR数据转化为结构化的时序序列，并在SciBERT基础上进行领域特定的预训练，以更好地建模临床事件之间的时序和上下文关系。 Result: 在抗生素敏感性预测任务上，SerialBEHRT相较于当前最先进的EHR表示方法表现出更优且更稳定的性能。 Conclusion: 时序化预训练对于医疗领域的基础模型至关重要，SerialBEHRT通过结构对齐的设计提升了EHR数据的表征能力。 Abstract: The emergence of foundation models in healthcare has opened new avenues for learning generalizable representations from large scale clinical data. Yet, existing approaches often struggle to reconcile the tabular and event based nature of Electronic Health Records (EHRs) with the sequential priors of natural language models. This structural mismatch limits their ability to capture longitudinal dependencies across patient encounters. We introduce SerialBEHRT, a domain aligned foundation model that extends SciBERT through additional pretraining on structured EHR sequences. SerialBEHRT is designed to encode temporal and contextual relationships among clinical events, thereby producing richer patient representations. We evaluate its effectiveness on the task of antibiotic susceptibility prediction, a clinically meaningful problem in antibiotic stewardship. Through extensive benchmarking against state of the art EHR representation strategies, we demonstrate that SerialBEHRT achieves superior and more consistent performance, highlighting the importance of temporal serialization in foundation model pretraining for healthcare.

[13] DynaSpec: Context-aware Dynamic Speculative Sampling for Large-Vocabulary Language Models

Jinbin Zhang,Nasib Ullah,Erik Schultheis,Rohit Babbar

Main category: cs.CL

TL;DR: 本文提出了DynaSpec，一种上下文相关的动态短列表机制，用于加速大语言模型推理中的推测解码过程，相比固定词汇子集方法更高效且鲁棒。

Details

Motivation: 现有推测解码中，小drafter模型输出头的参数随词汇量增长成为延迟瓶颈；固定词汇子集方法受限于语料依赖和抑制罕见词问题，影响泛化性和性能。 Method: 提出DynaSpec，使用轻量级粗粒度元分类器将上下文路由到少量token簇，选取top-k簇的并集作为drafter的短列表，验证阶段仍保留完整词汇表；通过并行执行draft编码和元短列表生成来提前完成元分类器计算。 Result: 在标准推测解码基准上，DynaSpec相比固定短列表基线 consistently 提升了平均接受长度，且上下文相关选择允许使用更小的短列表而不降低接受率。 Conclusion: DynaSpec是一种高效、鲁棒且可泛化的推测解码加速方法，通过动态、上下文感知的词汇短列表显著提升drafting速度并保持验证准确性。 Abstract: Speculative decoding (a.k.a. speculative sampling) has become a standard way to accelerate LLM inference: a small drafter proposes multiple tokens and a large target model verifies them once per speculation length. Recently, scaling of the LLM vocabulary has pushed the number of tokens to grow substantially. While verification over the full vocabulary leaves the target model largely unaffected, the O(|V|d) parameters in the drafter's output head become a latency bottleneck, slowing the entire pipeline. Contemporary methods (e.g., FR-Spec, VocabTrim) restrict the drafter's vocabulary to a fixed subset of the target model's vocabulary, ranked in descending order of token frequency. Although this reduces draft-time compute, it is brittle, since: (i) frequency lists are corpus-dependent and require retuning to generalize, and (ii) static shortlists suppress rare or domain-specific tokens, lowering the expected number of tokens per verification step. We propose DynaSpec, a context-dependent dynamic shortlisting mechanism that is robust, speeds up drafting, and generalizes across diverse tasks. Concretely, we introduce lightweight, coarse-grained meta-classifiers that route contexts to a small number of token clusters; the union of the top-k selected clusters forms the drafter's shortlist, while verification retains the full vocabulary and exactness. The meta-classifier finishes its computation earlier than the drafter's hidden state generation by exploiting parallel execution of draft encoding and meta shortlisting on separate streams. On standard speculative-decoding benchmarks, we observe consistent gains in mean accepted length over fixed-shortlist baselines, while context-dependent selection enables smaller shortlists without degrading acceptance.

[14] On-device System of Compositional Multi-tasking in Large Language Models

Ondrej Bohdal,Konstantinos Theodosiadis,Asterios Mpatziakas,Dimitris Filippidis,Iro Spyrou,Christos Zonios,Anastasios Drosou,Dimosthenis Ioannidis,Kyeng-Hun Lee,Jijoong Moon,Hyeonmok Ko,Mete Ozay,Umberto Michieli

Main category: cs.CL

TL;DR: 提出一种针对摘要和翻译组合任务的高效多任务处理方法，通过在适配器上添加可学习的投影层，在保持计算效率的同时实现良好的性能。

Details

Motivation: 现有的参数高效微调方法在处理复杂组合任务（如长对话的翻译摘要）时表现不佳，难以同时执行多个任务。 Method: 在结合了摘要和翻译任务的低秩适配器（LoRA）之上引入一个可学习的投影层，以有效融合多任务输出，减少计算开销。 Result: 实验表明该方法在云端和设备端均具有良好的性能和较快的推理速度，适用于资源受限的场景。 Conclusion: 所提框架在保证效率的同时提升了组合多任务的执行效果，适合部署于移动设备等实际应用场景。 Abstract: Large language models (LLMs) are commonly adapted for diverse downstream tasks via parameter-efficient fine-tuning techniques such as Low-Rank Adapters (LoRA). While adapters can be combined to handle multiple tasks separately, standard approaches struggle when targeting the simultaneous execution of complex tasks, such as generating a translated summary from a long conversation. To address this challenge, we propose a novel approach tailored specifically for compositional multi-tasking scenarios involving summarization and translation. Our technique involves adding a learnable projection layer on top of the combined summarization and translation adapters. This design enables effective integration while maintaining efficiency through reduced computational overhead compared to alternative strategies requiring extensive retraining or sequential processing. We demonstrate the practical viability of our method within an on-device environment by developing an Android app capable of executing compositional tasks seamlessly. Experimental results indicate our solution performs well and is fast in both cloud-based and on-device implementations, highlighting the potential benefits of adopting our framework in real-world applications demanding high-speed operation alongside resource constraints.

[15] Language steering in latent space to mitigate unintended code-switching

Andrey Goncharov,Nikolai Kondusov,Alexey Zaytsev

Main category: cs.CL

TL;DR: 提出了一种基于主成分分析的潜在空间语言引导方法，有效减少多语言大模型中的代码转换现象，保持语义且计算开销极低。

Details

Motivation: 多语言大语言模型常出现非预期的代码转换，影响下游任务可靠性，需有效控制生成语言的身份。 Method: 通过在平行翻译上进行主成分分析识别语言方向，并在推理时引导词元嵌入沿这些方向调整以控制语言身份。 Result: 使用单个主成分即可达到95-99%的语言分类准确率，在Qwen2.5和Llama-3.2模型上将下一词元分布差异减少最多42%，且语言表征在模型深层接近线性可分。 Conclusion: 该轻量级方法能高效抑制代码转换，仅需少量平行数据校准，具有良好的实用性和扩展性。 Abstract: Multilingual Large Language Models (LLMs) often exhibit unintended code-switching, reducing reliability in downstream tasks. We propose latent-space language steering, a lightweight inference-time method that identifies language directions via PCA on parallel translations and steers token embeddings along these axes to control language identity. Our approach mitigates code-switching while preserving semantics with negligible computational overhead and requires only minimal parallel data for calibration. Empirically, we achieve 95-99\% language classification accuracy using a single principal component and reduce next-token distributional divergence by up to 42% across multiple language pairs on Qwen2.5 and Llama-3.2 models. We further analyze the layer-wise evolution of language representations, revealing that language identity concentrates in final layers with near-perfect linear separability.

[16] Revisiting the UID Hypothesis in LLM Reasoning Traces

Minju Gwak,Guijin Son,Jaehyung Kim

Main category: cs.CL

TL;DR: 该论文提出基于熵的信息流度量方法，发现大语言模型在数学推理中成功的推理过程具有非均匀的信息密度，与人类通信的均匀信息密度模式相反。

Details

Motivation: 受心理语言学中均匀信息密度（UID）假说启发，旨在分析大语言模型推理链中的信息流动特性，并挑战当前对机器推理过程的理解。 Method: 引入基于熵的度量指标，分析三个数学推理基准任务中大语言模型推理路径的信息流模式。 Result: 发现正确解答对应的推理过程呈现全局非均匀的信息密度，表现为信息密度的剧烈波动，与人类遵循UID的平稳模式形成鲜明对比。 Conclusion: 成功的机器推理可能依赖于非均匀的信息分配，这一发现挑战了将人类UID原则直接应用于模型解释性的假设，为设计可解释和自适应的推理模型提供了新方向。 Abstract: Large language models (LLMs) often solve problems using step-by-step Chain-of-Thought (CoT) reasoning, yet these intermediate steps are frequently unfaithful or hard to interpret. Inspired by the Uniform Information Density (UID) hypothesis in psycholinguistics -- which posits that humans communicate by maintaining a stable flow of information -- we introduce entropy-based metrics to analyze the information flow within reasoning traces. Surprisingly, across three challenging mathematical benchmarks, we find that successful reasoning in LLMs is globally non-uniform: correct solutions are characterized by uneven swings in information density, in stark contrast to human communication patterns. This result challenges assumptions about machine reasoning and suggests new directions for designing interpretable and adaptive reasoning models.

[17] EvoEdit: Evolving Null-space Alignment for Robust and Efficient Knowledge Editing

Sicheng Lyu,Yu Gu,Xinyu Wang,Jerry Huang,Sitao Luan,Yufei Cui,Xiao-Wen Chang,Peng Lu

Main category: cs.CL

TL;DR: 本文提出了一种名为EvoEdit的新编辑策略，通过顺序零空间对齐来缓解大语言模型在连续知识更新中的灾难性干扰问题，实现了稳定高效的模型编辑，并在真实基准上表现出优于现有方法的性能和最高达3.53倍的速度提升。

Details

Motivation: 现有的模型编辑方法在连续编辑场景中存在灾难性干扰问题，即新编辑会破坏先前的知识更新，因此需要一种能有效保持历史知识和原始知识的方法。 Method: 提出EvoEdit，采用顺序零空间对齐技术，在每次新编辑时确保修改不影响原有及已修改的知识表示，从而维持对保留知识的输出不变性。 Result: 在真实世界的连续知识编辑基准测试中，EvoEdit表现优于或相当于最先进的定位后编辑技术，并实现了高达3.53倍的加速。 Conclusion: EvoEdit为动态信息环境下的大语言模型编辑提供了一个简单而有效的解决方案，具有强理论保证，凸显了发展更系统化编辑方法的必要性。 Abstract: Large language models (LLMs) require continual updates to rectify outdated or erroneous knowledge. Model editing has emerged as a compelling paradigm for introducing targeted modifications without the computational burden of full retraining. Existing approaches are mainly based on a locate-then-edit framework. However, in sequential editing contexts, where multiple updates are applied over time, they exhibit significant limitations and suffer from catastrophic interference, i.e., new edits compromise previously integrated updates and degrade preserved knowledge. To address these challenges, we introduce EvoEdit, a novel editing strategy that mitigates catastrophic interference through sequential null-space alignment, enabling stable and efficient model editing. By performing sequential null-space alignment for each incoming edit, EvoEdit preserves both original and previously modified knowledge representations and maintains output invariance on preserved knowledge even across long edit sequences, effectively mitigating interference. Evaluations on real-world sequential knowledge-editing benchmarks show that EvoEdit achieves better or comparable performance than prior state-of-the-art locate-then-edit techniques, with up to 3.53 times speedup. Overall, these results underscore the necessity of developing more principled approaches for designing LLMs in dynamically evolving information settings, while providing a simple yet effective solution with strong theoretical guarantees.

[18] ConsistencyAI: A Benchmark to Assess LLMs' Factual Consistency When Responding to Different Demographic Groups

Peter Banyas,Shristi Sharma,Alistair Simmons,Atharva Vispute

Main category: cs.CL

TL;DR: 本文提出了ConsistencyAI，一个独立的基准测试，用于评估大型语言模型（LLMs）在不同用户 persona 下回答事实一致性的问题。实验发现模型在不同主题和提供商下表现出显著差异，且部分轻量级模型表现较差。

Details

Motivation: 检测LLM是否因用户人口特征不同而提供不一致的事实信息，确保模型输出的公正性和可靠性。 Method: 使用19个LLMs，对15个主题各请求5个事实，每种情况重复100次并引入不同persona的提示上下文；通过句子嵌入计算跨persona余弦相似度，并加权平均得到一致性分数。 Result: 一致性得分范围为0.7896至0.9065，均值为0.8656；xAI的Grok-3最一致，轻量模型最低；就业市场最不一致，G7领导人最一致，疫苗和以巴冲突因提供商而异。 Conclusion: LLM的事实一致性受模型提供商和主题影响，需推动与persona无关的提示策略以提升公平性。 Abstract: Is an LLM telling you different facts than it's telling me? This paper introduces ConsistencyAI, an independent benchmark for measuring the factual consistency of large language models (LLMs) for different personas. ConsistencyAI tests whether, when users of different demographics ask identical questions, the model responds with factually inconsistent answers. Designed without involvement from LLM providers, this benchmark offers impartial evaluation and accountability. In our experiment, we queried 19 LLMs with prompts that requested 5 facts for each of 15 topics. We repeated this query 100 times for each LLM, each time adding prompt context from a different persona selected from a subset of personas modeling the general population. We processed the responses into sentence embeddings, computed cross-persona cosine similarity, and computed the weighted average of cross-persona cosine similarity to calculate factual consistency scores. In 100-persona experiments, scores ranged from 0.9065 to 0.7896, and the mean was 0.8656, which we adopt as a benchmark threshold. xAI's Grok-3 is most consistent, while several lightweight models rank lowest. Consistency varies by topic: the job market is least consistent, G7 world leaders most consistent, and issues like vaccines or the Israeli-Palestinian conflict diverge by provider. These results show that both the provider and the topic shape the factual consistency. We release our code and interactive demo to support reproducible evaluation and encourage persona-invariant prompting strategies.

[19] BenchPress: A Human-in-the-Loop Annotation System for Rapid Text-to-SQL Benchmark Curation

Fabian Wenz,Omar Bouattour,Devin Yang,Justin Choi,Cecil Gregg,Nesime Tatbul,Çağatay Demiralp

Main category: cs.CL

TL;DR: 本文提出了BenchPress，一个结合人类专家与大语言模型（LLM）的系统，用于加速构建领域特定的文本到SQL查询基准数据集，显著减少人工标注时间和成本。

Details

Motivation: 现有的文本到SQL研究多依赖公开数据集，而在实际企业环境中表现不佳；构建私有企业级基准（如Beaver）需要大量人工标注SQL日志，耗时且昂贵，因此需要更高效的标注方法。 Method: 提出BenchPress系统，采用检索增强生成（RAG）和大语言模型为SQL查询自动生成多个自然语言描述草案，再由人类专家进行选择、排序或编辑，实现人机协同标注。 Result: 在企业SQL日志上的实验表明，LLM辅助显著减少了创建高质量基准所需的时间和人力，同时提高了标注准确性和基准可靠性。 Conclusion: BenchPress通过融合LLM生成与人类验证，有效支持领域特定文本到SQL基准的快速构建，提升了模型评估的鲁棒性，并已开源供研究与实践使用。 Abstract: Large language models (LLMs) have been successfully applied to many tasks, including text-to-SQL generation. However, much of this work has focused on publicly available datasets, such as Fiben, Spider, and Bird. Our earlier work showed that LLMs are much less effective in querying large private enterprise data warehouses and released Beaver, the first private enterprise text-to-SQL benchmark. To create Beaver, we leveraged SQL logs, which are often readily available. However, manually annotating these logs to identify which natural language questions they answer is a daunting task. Asking database administrators, who are highly trained experts, to take on additional work to construct and validate corresponding natural language utterances is not only challenging but also quite costly. To address this challenge, we introduce BenchPress, a human-in-the-loop system designed to accelerate the creation of domain-specific text-to-SQL benchmarks. Given a SQL query, BenchPress uses retrieval-augmented generation (RAG) and LLMs to propose multiple natural language descriptions. Human experts then select, rank, or edit these drafts to ensure accuracy and domain alignment. We evaluated BenchPress on annotated enterprise SQL logs, demonstrating that LLM-assisted annotation drastically reduces the time and effort required to create high-quality benchmarks. Our results show that combining human verification with LLM-generated suggestions enhances annotation accuracy, benchmark reliability, and model evaluation robustness. By streamlining the creation of custom benchmarks, BenchPress offers researchers and practitioners a mechanism for assessing text-to-SQL models on a given domain-specific workload. BenchPress is freely available via our public GitHub repository at https://github.com/fabian-wenz/enterprise-txt2sql and is also accessible on our website at http://dsg-mcgraw.csail.mit.edu:5000.

[20] R2T: Rule-Encoded Loss Functions for Low-Resource Sequence Tagging

Mamadou K. Keita,Christopher Homan,Sebastien Diarra

Main category: cs.CL

TL;DR: 提出了一种名为Rule-to-Tag（R2T）的混合框架，通过将语言学规则集成到神经网络训练目标中，并引入自适应损失函数，使模型在无标注文本上进行原则性学习（PrL），在Zarma语言的POS标注任务中达到98.2%准确率，且在NER任务中作为预训练方法显著提升小样本性能。

Details

Motivation: 解决低资源语言中标注数据稀缺的问题，探索不依赖大量标注数据、而是利用显式任务约束进行模型训练的新范式。 Method: 提出R2T框架，将多层次语言规则嵌入神经网络训练目标，设计包含正则化项的自适应损失函数，以处理未登录词并引入原则性不确定性；在无标注数据上进行训练，属于原则性学习（PrL）范式。 Result: 在Zarma语言POS标注任务中，R2T-BiLSTM仅使用无标注文本即达到98.2%准确率，优于使用300条标注句子微调的AfriBERTa；在NER任务中，R2T预训练+50条标注句子的效果超过基线模型使用300条标注数据的结果。 Conclusion: R2T为低资源语言处理提供了一种有效的原则性学习路径，能够在极少量或无需标注数据的情况下实现高性能，展示了规则与神经方法融合的潜力。 Abstract: We introduce the Rule-to-Tag (R2T) framework, a hybrid approach that integrates a multi-tiered system of linguistic rules directly into a neural network's training objective. R2T's novelty lies in its adaptive loss function, which includes a regularization term that teaches the model to handle out-of-vocabulary (OOV) words with principled uncertainty. We frame this work as a case study in a paradigm we call principled learning (PrL), where models are trained with explicit task constraints rather than on labeled examples alone. Our experiments on Zarma part-of-speech (POS) tagging show that the R2T-BiLSTM model, trained only on unlabeled text, achieves 98.2% accuracy, outperforming baselines like AfriBERTa fine-tuned on 300 labeled sentences. We further show that for more complex tasks like named entity recognition (NER), R2T serves as a powerful pre-training step; a model pre-trained with R2T and fine-tuned on just 50 labeled sentences outperformes a baseline trained on 300.

[21] Harnessing Consistency for Robust Test-Time LLM Ensemble

Zhichen Zeng,Qi Yu,Xiao Lin,Ruizhong Qiu,Xuying Ning,Tianxin Wei,Yuchen Yan,Jingrui He,Hanghang Tong

Main category: cs.CL

TL;DR: 本文提出了一种名为CoRE的即插即用方法，通过利用模型一致性来提升大语言模型集成的鲁棒性，有效应对词元和模型层面的不一致问题。

Details

Motivation: 不同的大语言模型具有各异的优势与劣势，集成方法虽能整合其能力，但现有研究较少关注集成系统在面对错误信号（如分词差异和模型专长不同）时的鲁棒性。 Method: 提出CoRE方法，分别从词元级和模型级建模一致性：词元级采用低通滤波降低高度不一致词元的权重，模型级则通过提升自信心高且与其他模型输出差异小的模型权重来增强整体一致性。 Result: 在多种基准、模型组合和集成策略上的实验表明，CoRE显著提升了集成性能和鲁棒性。 Conclusion: CoRE是一种通用且有效的技术，能够增强大语言模型集成在异构环境下的稳定性和准确性。 Abstract: Different large language models (LLMs) exhibit diverse strengths and weaknesses, and LLM ensemble serves as a promising approach to integrate their complementary capabilities. Despite substantial progress in improving ensemble quality, limited attention has been paid to the robustness of ensembles against potential erroneous signals, which often arise from heterogeneous tokenization schemes and varying model expertise. Our analysis shows that ensemble failures typically arise from both the token level and the model level: the former reflects severe disagreement in token predictions, while the latter involves low confidence and pronounced disparities among models. In light of this, we propose CoRE, a plug-and-play technique that harnesses model consistency for robust LLM ensemble, which can be seamlessly integrated with diverse ensemble methods. Token-level consistency captures fine-grained disagreements by applying a low-pass filter to downweight uncertain tokens with high inconsistency, often due to token misalignment, thereby improving robustness at a granular level. Model-level consistency models global agreement by promoting model outputs with high self-confidence and minimal divergence from others, enhancing robustness at a coarser level. Extensive experiments across diverse benchmarks, model combinations, and ensemble strategies demonstrate that CoRE consistently improves ensemble performance and robustness.

[22] Multimodal Retrieval-Augmented Generation with Large Language Models for Medical VQA

A H M Rezaul Karim,Ozlem Uzuner

Main category: cs.CL

TL;DR: MasonNLP系统采用基于通用大语言模型的检索增强生成（RAG）框架，在MEDIQA-WV 2025伤口护理视觉问答任务中取得第3名，表明轻量级RAG结合通用LLM可作为多模态临床NLP任务的有效基线方法。

Details

Motivation: 提升医疗视觉问答系统在伤口护理场景下的回答质量与结构化属性生成能力，支持临床决策。 Method: 使用通用领域指令微调的大语言模型，结合检索增强生成（RAG）框架，融合来自领域内数据的文本和视觉示例进行输出生成。 Result: 在MEDIQA-WV 2025任务中排名第三，平均得分为41.37%，在dBLEU、ROUGE、BERTScore及基于LLM的指标上均表现良好。 Conclusion: 轻量级RAG结合通用大语言模型无需额外训练或复杂重排序，仅通过简单索引和融合少量相关示例即可有效提升多模态临床NLP任务性能，是一种简单而有效的基线方法。 Abstract: Medical Visual Question Answering (MedVQA) enables natural language queries over medical images to support clinical decision-making and patient care. The MEDIQA-WV 2025 shared task addressed wound-care VQA, requiring systems to generate free-text responses and structured wound attributes from images and patient queries. We present the MasonNLP system, which employs a general-domain, instruction-tuned large language model with a retrieval-augmented generation (RAG) framework that incorporates textual and visual examples from in-domain data. This approach grounds outputs in clinically relevant exemplars, improving reasoning, schema adherence, and response quality across dBLEU, ROUGE, BERTScore, and LLM-based metrics. Our best-performing system ranked 3rd among 19 teams and 51 submissions with an average score of 41.37%, demonstrating that lightweight RAG with general-purpose LLMs -- a minimal inference-time layer that adds a few relevant exemplars via simple indexing and fusion, with no extra training or complex re-ranking -- provides a simple and effective baseline for multimodal clinical NLP tasks.

Shivanshu Kumar,Gopalakrishnan Srinivasan

Main category: cs.CL

TL;DR: 提出了一种名为ShishuLM的高效语言模型架构，通过减少参数数量和KV缓存需求，在保持性能的同时显著降低内存占用和延迟。

Details

Motivation: Transformer模型虽然性能优越，但存在较高的内存和计算开销，且存在架构冗余，亟需更高效的模型设计。 Method: 基于AI可解释性和推理时层剪枝的研究，利用归一化与注意力计算在中等上下文场景下的近似线性关系，用MLP替代整个Transformer块。 Result: ShishuLM在不同规模的小型语言模型上验证，最多减少25%内存需求，训练和推理延迟最多改善40%。 Conclusion: ShishuLM为从小型语言模型预训练出发构建更高效的架构提供了可行路径和实践启示。 Abstract: While the transformer architecture has achieved state-of-the-art performance on natural language processing tasks, these models impose substantial memory and computational overhead. Recent research has identified significant architectural redundancies within these models, presenting opportunities for optimization without compromising performance. Taking insights from research in AI interpretability and inference-time layer pruning, we introduce an efficient language model architecture, referred to as ShishuLM, which reduces both the parameter count and Key-Value (KV) cache requirements. Given the increasing importance of Small Language Models (SLMs) in agentic AI systems, we evaluate our approach on two SLMs of different scales. Our analysis reveals that for moderate-context scenarios, normalization coupled with attention computation is roughly linear with the input, enabling entire transformer blocks to be approximated through Multi-Layer Perceptrons (MLPs). Our results show that ShishuLM provides up to 25% reduction in memory requirements and up to 40% improvement in latency during both training and inference, compared to parent models. Our experimental and analytical findings provide insights towards building more efficient SLM architectures from a pre-training standpoint.

[24] Ensembling Large Language Models to Characterize Affective Dynamics in Student-AI Tutor Dialogues

Chenyu Zhang,Sharifa Alghowinem,Cynthia Breazeal

Main category: cs.CL

TL;DR: 本研究提出了首个用于大规模情感感知的集成LLM框架，分析了16,986轮AI辅导对话，揭示学生在与AI导师互动中的情感动态，发现积极情感占主导但易受干扰，中性状态常为情绪转折点，为教育中生成式AI的情感支持提供了负责任的发展路径。

Details

Motivation: 尽管已有研究探讨大语言模型（LLM）在教育中的影响，但其在辅导过程中对学生情感状态的影响仍不清楚。因此，亟需理解LLM介导教学中的情感动态，以推动生成式AI在教育中负责任地应用。 Method: 提出一种基于多个前沿LLM（Gemini、GPT-4o、Claude）的集成框架，对PyTutor（LLM驱动的AI导师）与261名本科生在两个学期内的16,986轮对话进行零样本情感标注，提取效价、唤醒度和学习帮助性评分及自由文本情绪标签，并通过排名加权池化和跨模型多数共识融合结果，生成稳健的情绪剖面。 Result: 学生在与AI导师互动时通常表现出轻微积极情感和中等唤醒水平；困惑与好奇常见，挫折虽较少但仍会影响学习进程；情绪持续时间短，积极情绪稍长但脆弱易中断；负面情绪常迅速缓解甚至转为正面，中性状态多为向上转折点，提示干预时机。 Conclusion: 该集成LLM框架能有效捕捉学习过程中的细微情感变化，揭示了AI辅导中情感流动的关键特征，强调在教育AI系统中引入实时情感感知与干预机制的重要性，为构建更负责任、更具情感智能的生成式AI教育应用提供了实证依据和设计方向。 Abstract: While recent studies have examined the leaning impact of large language model (LLM) in educational contexts, the affective dynamics of LLM-mediated tutoring remain insufficiently understood. This work introduces the first ensemble-LLM framework for large-scale affect sensing in tutoring dialogues, advancing the conversation on responsible pathways for integrating generative AI into education by attending to learners' evolving affective states. To achieve this, we analyzed two semesters' worth of 16,986 conversational turns exchanged between PyTutor, an LLM-powered AI tutor, and 261 undergraduate learners across three U.S. institutions. To investigate learners' emotional experiences, we generate zero-shot affect annotations from three frontier LLMs (Gemini, GPT-4o, Claude), including scalar ratings of valence, arousal, and learning-helpfulness, along with free-text emotion labels. These estimates are fused through rank-weighted intra-model pooling and plurality consensus across models to produce robust emotion profiles. Our analysis shows that during interaction with the AI tutor, students typically report mildly positive affect and moderate arousal. Yet learning is not uniformly smooth: confusion and curiosity are frequent companions to problem solving, and frustration, while less common, still surfaces in ways that can derail progress. Emotional states are short-lived--positive moments last slightly longer than neutral or negative ones, but they are fragile and easily disrupted. Encouragingly, negative emotions often resolve quickly, sometimes rebounding directly into positive states. Neutral moments frequently act as turning points, more often steering students upward than downward, suggesting opportunities for tutors to intervene at precisely these junctures.

[25] Unlocking the Potential of Diffusion Language Models through Template Infilling

Junhoo Lee,Seungyeon Kim,Nojun Kwak

Main category: cs.CL

TL;DR: 提出了一种名为Template Infilling（TI）的新型条件生成方法，结合动态段分配（DSA），在数学推理和代码生成任务上显著优于基线。

Details

Motivation: 现有的扩散语言模型推理策略受限于前缀提示，缺乏结构化生成控制，限制了其潜力。 Method: 首先生成目标响应的结构模板，然后填充掩码片段；引入动态段分配（DSA）根据生成置信度自适应调整片段长度。 Result: 在数学推理和代码生成基准上比基线平均提升17.01%；在多令牌生成中实现有效加速并保持生成质量。 Conclusion: Template Infilling为扩散语言模型提供了更灵活、高效的生成方式，拓展了其在复杂任务中的应用潜力。 Abstract: Diffusion Language Models (DLMs) have emerged as a promising alternative to Autoregressive Language Models, yet their inference strategies remain limited to prefix-based prompting inherited from the autoregressive paradigm. In this paper, we propose Template Infilling (TI), a tailored conditioning methodology for DLMs' generation process. Unlike conventional prefix prompting, TI first generates a structural template for the target response, then fills in the masked segments. To enhance the flexibility of this structural control, we introduce Dynamic Segment Allocation (DSA), which adaptively adjusts segment lengths based on generation confidence. We demonstrate the effectiveness of our approach on mathematical reasoning and code generation benchmarks, achieving consistent improvements of 17.01$\%$p over baseline. Furthermore, we show that TI provides additional advantages in multi-token generation settings, enabling effective speedup while maintaining generation quality.

[26] Quechua Speech Datasets in Common Voice: The Case of Puno Quechua

Elwin Huaman,Wendi Huaman,Jorge Luis Huaman,Ninfa Quispe

Main category: cs.CL

TL;DR: 本文探讨了将克丘亚语纳入Common Voice平台的过程，以应对资源匮乏语言在语音技术发展中的数据短缺问题。通过Puno克丘亚语的案例研究，展示了语言上线和语料库建设的成果，并提出了技术、伦理及数据主权方面的研究议程。

Details

Motivation: 克丘亚语等资源匮乏语言面临语音数据稀缺的问题，限制了其在语音技术中的应用和发展，亟需开放、社区驱动的数据集支持。 Method: 利用Common Voice平台推动克丘亚语语音数据的收集，重点以Puno克丘亚语（qxp）为案例，开展朗读和自发语音语料的采集与验证。 Result: 目前Common Voice已收录191.1小时的克丘亚语语音数据（86%已验证），其中Puno克丘亚语贡献了12小时（77%已验证）。 Conclusion: Common Voice在促进资源匮乏语言的语音技术发展方面具有巨大潜力，同时强调需关注技术挑战、社区参与和原住民数据主权等议题。 Abstract: Under-resourced languages, such as Quechuas, face data and resource scarcity, hindering their development in speech technology. To address this issue, Common Voice presents a crucial opportunity to foster an open and community-driven speech dataset creation. This paper examines the integration of Quechua languages into Common Voice. We detail the current 17 Quechua languages, presenting Puno Quechua (ISO 639-3: qxp) as a focused case study that includes language onboarding and corpus collection of both reading and spontaneous speech data. Our results demonstrate that Common Voice now hosts 191.1 hours of Quechua speech (86\% validated), with Puno Quechua contributing 12 hours (77\% validated), highlighting the Common Voice's potential. We further propose a research agenda addressing technical challenges, alongside ethical considerations for community engagement and indigenous data sovereignty. Our work contributes towards inclusive voice technology and digital empowerment of under-resourced language communities.

[27] FRACCO: A gold-standard annotated corpus of oncological entities with ICD-O-3.1 normalisation

Johann Pignat,Milena Vucetic,Christophe Gaudet-Blavignac,Jamil Zaghir,Amandine Stettler,Fanny Amrein,Jonatan Bonjour,Jean-Philippe Goldman,Olivier Michielin,Christian Lovis,Mina Bjelogrlic

Main category: cs.CL

TL;DR: FRACCO是一个包含1301个合成法语临床病例的专家标注语料库，用于支持法语肿瘤学中的命名实体识别和概念标准化研究。

Details

Motivation: 法语肿瘤学领域的标注数据集稀缺，限制了自然语言处理工具的发展，因此需要构建高质量的法语临床文本标注资源。 Method: 基于西班牙语CANTEMIST语料库翻译生成法语临床病例，并由领域专家进行实体边界的双重标注；通过自动化匹配与人工验证相结合的方式，使用ICD-O标准对形态学、解剖位置和组织分化进行术语标注，并增加复合表达式层次的规范化标注。 Result: 最终数据集包含71127个ICD-O规范化结果，涵盖399种唯一形态学代码（来自2549种不同表达）、272种解剖位置代码（来自3143种表达）和2043种唯一复合表达式（来自11144种表达）。 Conclusion: FRACCO为法语肿瘤学文本的命名实体识别与概念标准化提供了可靠的基准数据集，有助于推动法语临床自然语言处理技术的发展。 Abstract: Developing natural language processing tools for clinical text requires annotated datasets, yet French oncology resources remain scarce. We present FRACCO (FRench Annotated Corpus for Clinical Oncology) an expert-annotated corpus of 1301 synthetic French clinical cases, initially translated from the Spanish CANTEMIST corpus as part of the FRASIMED initiative. Each document is annotated with terms related to morphology, topography, and histologic differentiation, using the International Classification of Diseases for Oncology (ICD-O) as reference. An additional annotation layer captures composite expression-level normalisations that combine multiple ICD-O elements into unified clinical concepts. Annotation quality was ensured through expert review: 1301 texts were manually annotated for entity spans by two domain experts. A total of 71127 ICD-O normalisations were produced through a combination of automated matching and manual validation by a team of five annotators. The final dataset representing 399 unique morphology codes (from 2549 different expressions), 272 topography codes (from 3143 different expressions), and 2043 unique composite expressions (from 11144 different expressions). This dataset provides a reference standard for named entity recognition and concept normalisation in French oncology texts.

[28] What Layers When: Learning to Skip Compute in LLMs with Residual Gates

Filipe Laitenberger,Dawid Kopiczko,Cees G. M. Snoek,Yuki M. Asano

Main category: cs.CL

TL;DR: 提出了一种名为GateSkip的残差流门控机制，通过在解码器-only语言模型中实现逐层跳过不重要的token来节省计算开销，同时保持较高的准确性。

Details

Motivation: 为了在不显著降低性能的前提下减少推理过程中的计算量，尤其是在长文本推理任务中提高效率。 Method: 在每个Attention/MLP分支引入一个sigmoid-linear门控机制，压缩分支输出后再进入残差流；推理时根据门控值对token进行排序并跳过低重要性的token，采用每层预算控制跳过程度。 Result: 在长文本推理任务中最多节省15%的计算量且保持90%以上的基线准确率；在指令调优模型上，在接近50%计算节省时仍能匹配基线质量，并在全计算量下观察到准确率提升。门控机制还提供了对Transformer信息流动的理解。 Conclusion: GateSkip是一种稳定、可微的轻量级方法，能有效结合预训练模型实现动态层跳跃，在多种场景下兼顾效率与性能，并易于与其他压缩技术（如量化、剪枝）结合。 Abstract: We introduce GateSkip, a simple residual-stream gating mechanism that enables token-wise layer skipping in decoder-only LMs. Each Attention/MLP branch is equipped with a sigmoid-linear gate that condenses the branch's output before it re-enters the residual stream. During inference we rank tokens by the gate values and skip low-importance ones using a per-layer budget. While early-exit or router-based Mixture-of-Depths models are known to be unstable and need extensive retraining, our smooth, differentiable gates fine-tune stably on top of pretrained models. On long-form reasoning, we save up to 15\% compute while retaining over 90\% of baseline accuracy. On instruction-tuned models we see accuracy gains at full compute and match baseline quality near 50\% savings. The learned gates give insight into transformer information flow (e.g., BOS tokens act as anchors), and the method combines easily with quantization, pruning, and self-speculative decoding.

[29] TextBandit: Evaluating Probabilistic Reasoning in LLMs Through Language-Only Decision Tasks

Jimin Lim,Arjun Damerla,Arthur Jiang,Nam Le

Main category: cs.CL

TL;DR: 本文提出了一种新的基准测试，用于评估大语言模型（LLM）在仅使用文本反馈的多臂赌博机环境中进行不确定性下序列决策的能力。实验发现Qwen3-4B在选择最优臂方面表现最佳，达到89.2%的准确率，超越了其他大型语言模型和传统决策算法。

Details

Motivation: 探索大语言模型在没有数值提示的情况下，仅通过自然语言进行概率推理和决策的能力。 Method: 设计了一个基于纯文本反馈的多臂赌博机环境，让LLM根据'你获得了一个代币'这类语言反馈来推断潜在奖励结构并做出决策，并与Thompson Sampling、Epsilon Greedy、UCB等经典算法对比性能。 Result: 大多数LLM表现不如传统算法，但Qwen3-4B取得了89.2%的最佳臂选择率，显著优于其他模型和基线方法。 Conclusion: 研究表明，仅从语言中也能涌现出概率推理能力，该基准为评估非数值、自然语言情境下的决策能力提供了新方向。 Abstract: Large language models (LLMs) have shown to be increasingly capable of performing reasoning tasks, but their ability to make sequential decisions under uncertainty only using natural language remains underexplored. We introduce a novel benchmark in which LLMs interact with multi-armed bandit environments using purely textual feedback, "you earned a token", without access to numerical cues or explicit probabilities, resulting in the model to infer latent reward structures purely off linguistic cues and to adapt accordingly. We evaluated the performance of four open-source LLMs and compare their performance to standard decision-making algorithms such as Thompson Sampling, Epsilon Greedy, Upper Confidence Bound (UCB), and random choice. While most of the LLMs underperformed compared to the baselines, Qwen3-4B, achieved the best-arm selection rate of 89.2% , which significantly outperformed both the larger LLMs and traditional methods. Our findings suggest that probabilistic reasoning is able to emerge from language alone, and we present this benchmark as a step towards evaluating decision-making capabilities in naturalistic, non-numeric contexts.

[30] Catch Your Breath: Adaptive Computation for Self-Paced Sequence Production

Alexandre Galashov,Matt Jones,Rosemary Ke,Yuan Cao,Vaishnavh Nagarajan,Michael C. Mozer

Main category: cs.CL

TL;DR: 本文提出了一类名为“Catch Your Breath”（CYB）的监督训练目标，使语言模型能动态自主地为每个输入token分配计算步数，通过引入“”和“”机制来请求额外计算资源，并以序列决策框架结合时间成本来训练模型合理使用暂停。实验表明，CYB模型在仅使用三分之一训练数据的情况下即可达到基线模型的性能。

Details

Motivation: 传统语言模型对每个token的处理使用固定计算量，无法根据输入复杂度灵活调整；本文旨在让模型学会在不确定时主动请求更多计算资源，以提升效率与准确性。 Method: 将输出token的选择建模为带时间成本的序贯决策问题，提出三种CYB损失变体：CYB-AP（anytime prediction，准确率随时间衰减）、CYB-VA（基于停止时间分布的变分方法）和CYB-DP（基于计算预算的惩罚机制），通过微调比较其性能。 Result: CYB模型仅需基线模型三分之一的训练数据即可达到相同性能，且优于使用交叉熵损失带暂停的模型；模型能根据token复杂度自适应地决定是否暂停，例如在复数名词后常暂停，而在缩写词首token从不暂停，对歧义词如'won'表现出灵活行为。 Conclusion: CYB损失函数有效引导模型学会在需要时请求额外计算，实现了计算资源的动态分配，提升了数据效率和模型对输入复杂度的适应能力。 Abstract: We explore a class of supervised training objectives that allow a language model to dynamically and autonomously scale the number of compute steps used for each input token. For any token, the model can request additional compute steps by emitting a output. If the model is granted a delay, a specialized token is inserted at the next input step, providing the model with additional compute resources to generate an output. The model can request multiple pauses. To train the model to use outputs judiciously and to calibrate its uncertainty, we frame the selection of each output token as a sequential-decision problem with a time cost. We refer to the class of methods as $\textit{Catch Your Breath}$ losses and we study three methods in this class: CYB-AP frames the model's task as anytime prediction, where an output may be required at any step and accuracy is discounted over time; CYB-VA is a variational approach that aims to maximize prediction accuracy subject to a specified distribution over stopping times; and CYB-DP imposes a penalty based on a computational budget. Through fine-tuning experiments, we identify the best performing loss variant. The CYB model needs only one third as much training data as the baseline (no pause) model needs to achieve the same performance, and half as much data as a model with pauses and a cross-entropy loss. We find that the CYB model requests additional steps when doing so improves accuracy, and the model adapts its processing time to token-level complexity and context. For example, it often pauses after plural nouns like $\textit{patients}$ and $\textit{challenges}$ but never pauses after the first token of contracted words like $\textit{wasn}$ and $\textit{didn}$, and it shows high variability for ambiguous tokens like $\textit{won}$, which could function as either a verb or part of a contraction.

[31] PAGE: Prompt Augmentation for text Generation Enhancement

Mauro Jose Pacchiotti,Luciana Ballejos,Mariel Ale

Main category: cs.CL

TL;DR: 提出了一种名为PAGE的框架，通过使用轻量级辅助模块（如分类器或提取器）来增强自然语言生成模型的性能和可控性，而无需额外的生成模型。

Details

Motivation: 现有的自然语言生成模型在特定任务或需求下表现不佳，且调整通常需要大量额外数据，因此需要一种更简单、可适配的增强方法。 Method: 引入PAGE框架，利用轻量级辅助模块对输入文本进行推理，并将其输出用于构建增强的输入，从而提升生成质量与可控性。 Result: 在软件需求工程领域的概念验证中，结合分类器的辅助模块有效提升了软件需求生成的质量。 Conclusion: PAGE提供了一种简单、模块化且易于适应不同任务的生成增强架构，无需依赖复杂的辅助生成模型。 Abstract: In recent years, natural language generative models have shown outstanding performance in text generation tasks. However, when facing specific tasks or particular requirements, they may exhibit poor performance or require adjustments that demand large amounts of additional data. This work introduces PAGE (Prompt Augmentation for text Generation Enhancement), a framework designed to assist these models through the use of simple auxiliary modules. These modules, lightweight models such as classifiers or extractors, provide inferences from the input text. The output of these auxiliaries is then used to construct an enriched input that improves the quality and controllability of the generation. Unlike other generation-assistance approaches, PAGE does not require auxiliary generative models; instead, it proposes a simpler, modular architecture that is easy to adapt to different tasks. This paper presents the proposal, its components and architecture, and reports a proof of concept in the domain of requirements engineering, where an auxiliary module with a classifier is used to improve the quality of software requirements generation.

Bolei Ma,Yong Cao,Indira Sen,Anna-Carolina Haensch,Frauke Kreuter,Barbara Plank,Daniel Hershcovich

Main category: cs.CL

TL;DR: 本文主张在使用大语言模型（LLM）进行社会模拟时应采用开放式自由文本形式，以更真实地捕捉观点、主题和推理过程，提升测量效度、减少研究者偏差，并增强方法论价值。

Details

Motivation: 当前的LLM社会模拟多局限于选择题或简答形式，忽视了LLM的生成能力；作者认为这种封闭设计无法充分反映真实社会现象的复杂性。 Method: 结合数十年的调查方法学研究与自然语言处理（NLP）的最新进展，论证开放性在LLM社会模拟中的优势。 Result: 开放式生成能更好地捕捉表达多样性与个体差异，支持未预期观点的探索，改善测量与实验设计，有助于预测试并降低研究者强加的导向性偏见。 Conclusion: 应发展新的实践与评估框架，充分利用而非限制LLM的开放生成能力，促进NLP与社会科学的融合。 Abstract: Large Language Models (LLMs) are increasingly used to simulate public opinion and other social phenomena. Most current studies constrain these simulations to multiple-choice or short-answer formats for ease of scoring and comparison, but such closed designs overlook the inherently generative nature of LLMs. In this position paper, we argue that open-endedness, using free-form text that captures topics, viewpoints, and reasoning processes "in" LLMs, is essential for realistic social simulation. Drawing on decades of survey-methodology research and recent advances in NLP, we argue why this open-endedness is valuable in LLM social simulations, showing how it can improve measurement and design, support exploration of unanticipated views, and reduce researcher-imposed directive bias. It also captures expressiveness and individuality, aids in pretesting, and ultimately enhances methodological utility. We call for novel practices and evaluation frameworks that leverage rather than constrain the open-ended generative diversity of LLMs, creating synergies between NLP and social science.

[33] Order from Chaos: Comparative Study of Ten Leading LLMs on Unstructured Data Categorization

Ariel Kamen

Main category: cs.CL

TL;DR: 本研究比较了十种最先进的大语言模型在IAB 2.2分层分类体系下的文本分类性能，发现尽管模型规模不断增长，但其经典指标表现仅处于中等水平，且普遍存在幻觉和类别膨胀问题；通过构建多模型集成方法，显著提升了准确性并消除了幻觉。

Details

Motivation: 评估当前大语言模型在结构化文本分类任务中的实际性能，并探索如何克服其在准确性和可靠性方面的局限性。 Method: 使用8,660个人工标注样本和统一的零样本提示，对十种大语言模型进行一致性评估，并引入包括幻觉率、膨胀率和分类成本在内的多种指标；同时提出一种基于多个LLM作为独立专家的集成方法。 Result: 现有大语言模型平均准确率为34%，精确率42%，召回率45%，F1得分41%；普遍存在过度生成类别（高幻觉和膨胀率）的问题；Gemini 1.5/2.0 Flash和GPT 20B/120B具有较好的性价比，GPT 120B幻觉最少；集成方法显著提高准确性并完全消除幻觉。 Conclusion: 单纯扩大模型规模或改进架构不足以提升文本分类准确性，协调多个模型协作（集成）比单一巨型模型更有效，可能是实现或超越人类专家水平的关键路径。 Abstract: This study presents a comparative evaluation of ten state-of-the-art large language models (LLMs) applied to unstructured text categorization using the Interactive Advertising Bureau (IAB) 2.2 hierarchical taxonomy. The analysis employed a uniform dataset of 8,660 human-annotated samples and identical zero-shot prompts to ensure methodological consistency across all models. Evaluation metrics included four classic measures - accuracy, precision, recall, and F1-score - and three LLM-specific indicators: hallucination ratio, inflation ratio, and categorization cost. Results show that, despite their rapid advancement, contemporary LLMs achieve only moderate classic performance, with average scores of 34% accuracy, 42% precision, 45% recall, and 41% F1-score. Hallucination and inflation ratios reveal that models frequently overproduce categories relative to human annotators. Among the evaluated systems, Gemini 1.5/2.0 Flash and GPT 20B/120B offered the most favorable cost-to-performance balance, while GPT 120B demonstrated the lowest hallucination ratio. The findings suggest that scaling and architectural improvements alone do not ensure better categorization accuracy, as the task requires compressing rich unstructured text into a limited taxonomy - a process that challenges current model architectures. To address these limitations, a separate ensemble-based approach was developed and tested. The ensemble method, in which multiple LLMs act as independent experts, substantially improved accuracy, reduced inflation, and completely eliminated hallucinations. These results indicate that coordinated orchestration of models - rather than sheer scale - may represent the most effective path toward achieving or surpassing human-expert performance in large-scale text categorization.

[34] Reliable Fine-Grained Evaluation of Natural Language Math Proofs

Wenjie Ma,Andrei Cojocaru,Neel Kolhe,Bradley Louie,Robin Said Sharif,Haihan Zhang,Vincent Zhuang,Matei Zaharia,Sewon Min

Main category: cs.CL

TL;DR: 本文提出了ProofBench，首个包含专家标注的细粒度数学证明评分数据集，以及基于其构建的高效评估器ProofGrader，可在0-7分尺度上对LLM生成的数学证明进行精细评分，显著优于基线方法，并在实际选择任务中接近人类水平表现。

Details

Motivation: 现有的大语言模型在数学推理方面的评估主要集中在答案可验证的任务上，缺乏对自然语言数学证明生成的可靠、细粒度评估手段，亟需一个系统性的评估框架。 Method: 提出系统化方法开发和验证细粒度评分评估器；构建ProofBench数据集，包含145道竞赛题及435个LLM生成解法及其专家评分；探索评估器设计空间中的多个关键维度，并结合强推理能力的底座模型、参考解答与评分标准、集成方法构建ProofGrader。 Result: ProofGrader在与专家评分对比时达到0.926的平均绝对误差（MAE），显著优于基线；在best-of-n选择任务中（n=16），平均得分达4.14，填补了朴素二元评估器与人类最优之间78%的差距。 Conclusion: 本研究填补了LLM生成数学证明缺乏可靠细粒度评估的空白，ProofBench和ProofGrader为未来数学推理系统的开发与评估提供了重要基础设施和实用工具。 Abstract: Recent advances in large language models (LLMs) for mathematical reasoning have largely focused on tasks with easily verifiable final answers; however, generating and verifying natural language math proofs remains an open challenge. We identify the absence of a reliable, fine-grained evaluator for LLM-generated math proofs as a critical gap. To address this, we propose a systematic methodology for developing and validating evaluators that assign fine-grained scores on a 0-7 scale to model-generated math proofs. To enable this study, we introduce ProofBench, the first expert-annotated dataset of fine-grained proof ratings, spanning 145 problems from six major math competitions (USAMO, IMO, Putnam, etc) and 435 LLM-generated solutions from Gemini-2.5-pro, o3, and DeepSeek-R1. %with expert gradings. Using ProofBench as a testbed, we systematically explore the evaluator design space across key axes: the backbone model, input context, instructions and evaluation workflow. Our analysis delivers ProofGrader, an evaluator that combines a strong reasoning backbone LM, rich context from reference solutions and marking schemes, and a simple ensembling method; it achieves a low Mean Absolute Error (MAE) of 0.926 against expert scores, significantly outperforming naive baselines. Finally, we demonstrate its practical utility in a best-of-$n$ selection task: at $n=16$, ProofGrader achieves an average score of 4.14 (out of 7), closing 78% of the gap between a naive binary evaluator (2.48) and the human oracle (4.62), highlighting its potential to advance downstream proof generation.

[35] A Survey on Collaborating Small and Large Language Models for Performance, Cost-effectiveness, Cloud-edge Privacy, and Trustworthiness

Fali Wang,Jihai Chen,Shuhua Yang,Ali Al-Lawati,Linli Tang,Hui Liu,Suhang Wang

Main category: cs.CL

TL;DR: 本文系统综述了小语言模型（SLM）与大语言模型（LLM）协作的研究进展，提出了以性能提升、成本效益、云边隐私和可信性为目标的分类体系，总结了代表性方法与设计范式，并指出了未来在高效、安全、可扩展协作方面的挑战与方向。

Details

Motivation: 大语言模型虽强大但存在微调成本高、推理延迟、边缘部署受限和可靠性问题，小语言模型则更轻量高效，结合二者优势的协作框架成为研究热点。 Method: 提出四类协作目标的分类体系，围绕性能增强、成本效益、云边隐私和可信性系统梳理现有方法与设计范式。 Result: 总结了SLM-LLM协作的关键技术路径与典型方法，明确了当前研究的进展与局限。 Conclusion: SLM-LLM协同具有潜力解决效率、隐私与可靠性等问题，未来需进一步推动其在实际场景中的高效、安全与规模化应用。 Abstract: Large language models (LLMs) have advanced many domains and applications but face high fine-tuning costs, inference latency, limited edge deployability, and reliability concerns. Small language models (SLMs), compact, efficient, and adaptable, offer complementary remedies. Recent work explores collaborative frameworks that fuse SLMs' specialization and efficiency with LLMs' generalization and reasoning to meet diverse objectives across tasks and deployment scenarios. Motivated by these developments, this paper presents a systematic survey of SLM-LLM collaboration organized by collaboration objectives. We propose a taxonomy with four goals: performance enhancement, cost-effectiveness, cloud-edge privacy, and trustworthiness. Within this framework, we review representative methods, summarize design paradigms, and outline open challenges and future directions toward efficient, secure, and scalable SLM-LLM collaboration.

[36] The Harder The Better: Maintaining Supervised Fine-tuning Generalization with Less but Harder Data

Zhaoyang Shang,Sibo Wei,Jianbin Guo,Rui Zhou,Lifeng Dong,Yin Luo

Main category: cs.CL

TL;DR: 提出THTB框架，基于认知科学启发，通过结合内在和外在难度评分筛选高质量指令数据，显著减少微调所需数据量并提升领域适应性能。

Details

Motivation: 现有数据选择方法过度依赖大模型内部知识，可解释性差且泛化能力有限，难以高效适应专业领域。 Method: 提出THTB框架，结合质量过滤与内在/外在难度评分，优先选择高阶认知指令，提供可解释、可量化标准用于数据选择与标注指导。 Result: 实验表明，仅用5%数据训练的模型优于全量数据训练结果；在垂直领域使用2%数据即超越更大数据集训练的模型，且泛化性优于纯LLM选择方法。 Conclusion: THTB能有效提升SFT效率与领域适应能力，为数据选择和标注提供了具解释性的量化方案。 Abstract: Large Language Models (LLMs) excel in general tasks, but adapting them to specialized domains relies on high-quality supervised fine-tuning (SFT) data. Although existing methods can identify subsets of high-quality data and reduce training cost to some extent, their selection process still suffers from over-reliance on LLMs' internal knowledge, weak interpretability, and limited generalization. To address these limitations, we propose THTB (The Harder The Better), a cognitive science-inspired framework for instruction data selection and annotation guidance. THTB prioritizes higher-level cognitive instructions by combining quality filtering with intrinsic and extrinsic hardness scoring, offering interpretable and quantifiable criteria for efficient SFT, both in data selection and annotation guidance. Experiments show that THTB enables models trained on only 5% of the data to outperform full-dataset training, while achieving superior generalization compared with LLM-only selection. In addition, THTB provides effective annotation guidance in vertical domains, enabling a model trained on just 2% of the data to surpass models trained on much larger datasets, demonstrating strong potential for domain adaptation. Our code, datasets, and models are available on https://github.com/DYJG-research/THTB.

[37] Guarding the Guardrails: A Taxonomy-Driven Approach to Jailbreak Detection

Olga E. Sorokoletova,Francesco Giarrusso,Vincenzo Suriani,Daniele Nardi

Main category: cs.CL

TL;DR: 本文提出了一种针对大语言模型越狱攻击的全面分层分类法，包含50种策略，并通过红队测试分析了不同攻击类型的流行程度和成功率，同时评估了基于该分类法的自动检测方法的有效性，还发布了一个新的意大利语多轮对抗对话数据集。

Details

Motivation: 现有的防御方法通常只关注单轮攻击，缺乏跨语言覆盖，并且依赖有限的分类体系，无法充分捕捉越狱攻击策略的多样性或侧重于风险类别而非具体技术。为了更深入理解越狱技术的有效性，需要一个系统性的研究框架。 Method: 通过组织一次结构化的红队挑战赛，收集多语言、多轮的对抗性对话数据；在此基础上构建了一个包含50种越狱策略的分层分类体系，归纳为七个大类（如模仿、说服、权限提升等）；利用这些数据进行实证分析，并对主流大模型的越狱检测能力进行基准测试，探索基于分类法引导提示来提升自动检测效果。 Result: 提出了一个涵盖七大家族共50种策略的综合分类法；发现某些策略（如认知过载和目标冲突）在实际中更为普遍且有效；验证了分类法指导下的提示可以提高越狱检测性能；发布了包含1364个多轮意大利语对抗对话的新数据集，支持渐进式恶意意图的研究。 Conclusion: 该研究系统地扩展了对大语言模型越狱技术的理解，提出的分类法有助于识别和防御多样化的攻击手段，新数据集促进了多语言和多轮场景下的安全研究，强调了发展更鲁棒防御机制的重要性。 Abstract: Jailbreaking techniques pose a significant threat to the safety of Large Language Models (LLMs). Existing defenses typically focus on single-turn attacks, lack coverage across languages, and rely on limited taxonomies that either fail to capture the full diversity of attack strategies or emphasize risk categories rather than the jailbreaking techniques. To advance the understanding of the effectiveness of jailbreaking techniques, we conducted a structured red-teaming challenge. The outcome of our experiments are manifold. First, we developed a comprehensive hierarchical taxonomy of 50 jailbreak strategies, consolidating and extending prior classifications into seven broad families, including impersonation, persuasion, privilege escalation, cognitive overload, obfuscation, goal conflict, and data poisoning. Second, we analyzed the data collected from the challenge to examine the prevalence and success rates of different attack types, providing insights into how specific jailbreak strategies exploit model vulnerabilities and induce misalignment. Third, we benchmark a popular LLM for jailbreak detection, evaluating the benefits of taxonomy-guided prompting for improving automatic detection. Finally, we compiled a new Italian dataset of 1364 multi-turn adversarial dialogues, annotated with our taxonomy, enabling the study of interactions where adversarial intent emerges gradually and succeeds in bypassing traditional safeguards.

[38] Attribution Quality in AI-Generated Content:Benchmarking Style Embeddings and LLM Judges

Misam Abbas

Main category: cs.CL

TL;DR: 本文研究了在大语言模型（LLM）时代下的作者归属问题，比较了固定风格嵌入和指令调优的LLM裁判（GPT-4o）两种方法，在涵盖六个领域的公开数据集上进行基准测试，结果表明两种方法各有优势，强调了作者归属的多维性，需要混合策略。

Details

Motivation: 随着大语言模型生成的文本质量接近人类写作，区分机器与人类作者变得愈发困难，因此需要有效的作者归属机制。 Method: 采用固定风格嵌入作为基线方法，并使用指令调优的GPT-4o作为LLM裁判，在Human AI Parallel Corpus数据集上进行对比实验。 Result: 风格嵌入在GPT生成文本上准确率更高（82% vs 68%），LLM裁判在LLaMA生成文本上略优（85% vs 81%），但差异不显著；LLM裁判在小说和学术文本中表现更好，而风格嵌入在口语和剧本对话中占优。 Conclusion: 作者归属是一个多维度问题，不同方法在不同文体中表现各异，未来应结合多种策略以提高整体性能。 Abstract: Attributing authorship in the era of large language models (LLMs) is increasingly challenging as machine-generated prose rivals human writing. We benchmark two complementary attribution mechanisms , fixed Style Embeddings and an instruction-tuned LLM judge (GPT-4o) on the Human AI Parallel Corpus, an open dataset of 600 balanced instances spanning six domains (academic, news, fiction, blogs, spoken transcripts, and TV/movie scripts). Each instance contains a human prompt with both a gold continuation and an LLM-generated continuation from either GPT-4o or LLaMA-70B-Instruct. The Style Embedding baseline achieves stronger aggregate accuracy on GPT continuations (82 pct vs. 68 pct). The LLM Judge is slightly better than the Style embeddings on LLaMA continuations (85 pct vs. 81 pct) but the results are not statistically significant. Crucially, the LLM judge significantly outperforms in fiction and academic prose, indicating semantic sensitivity, whereas embeddings dominate in spoken and scripted dialogue, reflecting structural strengths. These complementary patterns highlight attribution as a multidimensional problem requiring hybrid strategies. To support reproducibility we provide code on GitHub and derived data on Hugging Face under the MIT license. This open framework provides a reproducible benchmark for attribution quality assessment in AI-generated content, along with a review of related literature influencing this work.

[39] Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences

Julian Minder,Clément Dumas,Stewart Slocum,Helena Casademunt,Cameron Holmes,Robert West,Neel Nanda

Main category: cs.CL

TL;DR: 本文研究了在窄域上微调大语言模型（LLM）所产生的激活偏差，发现这些偏差可通过模型差异分析方法识别，并可用于推断微调内容。研究表明，此类偏差反映了过拟合现象，混入预训练数据可减轻该问题。文章警告使用窄域微调模型作为通用微调代理的局限性，并呼吁开展更真实的模型差异、安全性和可解释性研究。

Details

Motivation: 窄域微调被广泛用于适配LLM和研究具有特殊属性的模型，但其可能导致模型内部产生可被解读的强偏差。作者旨在揭示这些偏差的本质及其对AI安全与可解释性研究的影响，警示当前研究范式的潜在问题。 Method: 通过模型差异分析（model diffing），比较微调前后模型在随机文本前几个token上的激活差异，识别并可视化偏差；利用激活差异进行 steering（干预模型激活）生成类似微调数据的文本；设计基于LLM的可解释性代理，评估其在有无偏差信息下的表现差异。实验涵盖多种模型架构（Gemma, LLaMA, Qwen）和规模（1B-32B），涉及虚假事实、隐性学习等任务。 Result: 发现窄域微调会在模型激活中引入强烈且可解释的偏差，steering能生成与微调数据相似的内容；基于偏差的可解释性代理显著优于基线；混入预训练数据可减弱偏差，但残余风险仍存；不同架构和规模下结果具有一致性。 Conclusion: 窄域微调会在LLM中留下明显的训练目标痕迹，这既为理解微调提供了新工具，也对AI安全和可解释性研究提出了警示：当前依赖窄域微调模型的研究可能不具备现实代表性，需推动更真实的研究范式。 Abstract: Finetuning on narrow domains has become an essential tool to adapt Large Language Models (LLMs) to specific tasks and to create models with known unusual properties that are useful for research. We show that narrow finetuning creates strong biases in LLM activations that can be interpreted to understand the finetuning domain. These biases can be discovered using simple tools from model diffing - the study of differences between models before and after finetuning. In particular, analyzing activation differences on the first few tokens of random text and steering by adding this difference to the model activations produces text similar to the format and general content of the finetuning data. We demonstrate that these analyses contain crucial information by creating an LLM-based interpretability agent to understand the finetuning domain. With access to the bias, the agent performs significantly better compared to baseline agents using simple prompting. Our analysis spans synthetic document finetuning for false facts, emergent misalignment, subliminal learning, and taboo word guessing game models across different architectures (Gemma, LLaMA, Qwen) and scales (1B to 32B parameters). We suspect these biases reflect overfitting and find that mixing pretraining data into the finetuning corpus largely removes them, though residual risks may remain. Our work (1) demonstrates that narrowly finetuned models have salient traces of their training objective in their activations and suggests ways to improve how they are trained, (2) warns AI safety and interpretability researchers that the common practice of using such models as a proxy for studying broader finetuning (e.g., chat-tuning) might not be realistic, and (3) highlights the need for deeper investigation into the effects of narrow finetuning and development of truly realistic case studies for model-diffing, safety and interpretability research.

[40] RAID: Refusal-Aware and Integrated Decoding for Jailbreaking LLMs

Tuan T. Nguyen,John Le,Thai T. Vu,Willy Susilo,Heath Cooper

Main category: cs.CL

TL;DR: RAID是一种通过连续嵌入优化和拒绝感知正则化来生成对抗后缀的框架，有效绕过大语言模型的安全机制，同时保持输出流畅性。

Details

Motivation: 大语言模型虽然性能强大，但仍易受越狱攻击，需系统性方法探测其安全弱点。 Method: 将离散token松弛为连续嵌入，通过联合目标优化：诱导受限响应、引入拒绝感知正则项避免拒绝方向、保持语义连贯性；再通过批评引导的解码将嵌入映射回token。 Result: 在多个开源大模型上实验表明，RAID相比现有白盒和黑盒基线方法，攻击成功率更高，且查询次数和计算成本更低。 Conclusion: 嵌入空间正则化对理解和缓解大语言模型越狱漏洞具有重要意义。 Abstract: Large language models (LLMs) achieve impressive performance across diverse tasks yet remain vulnerable to jailbreak attacks that bypass safety mechanisms. We present RAID (Refusal-Aware and Integrated Decoding), a framework that systematically probes these weaknesses by crafting adversarial suffixes that induce restricted content while preserving fluency. RAID relaxes discrete tokens into continuous embeddings and optimizes them with a joint objective that (i) encourages restricted responses, (ii) incorporates a refusal-aware regularizer to steer activations away from refusal directions in embedding space, and (iii) applies a coherence term to maintain semantic plausibility and non-redundancy. After optimization, a critic-guided decoding procedure maps embeddings back to tokens by balancing embedding affinity with language-model likelihood. This integration yields suffixes that are both effective in bypassing defenses and natural in form. Experiments on multiple open-source LLMs show that RAID achieves higher attack success rates with fewer queries and lower computational cost than recent white-box and black-box baselines. These findings highlight the importance of embedding-space regularization for understanding and mitigating LLM jailbreak vulnerabilities.

[41] Investigating Political and Demographic Associations in Large Language Models Through Moral Foundations Theory

Nicole Smith-Vaniz,Harper Lyon,Lorraine Steigner,Ben Armstrong,Nicholas Mattei

Main category: cs.CL

TL;DR: 本文通过道德基础理论（MFT）系统分析大语言模型（LLM）在政治与道德问题上的回应倾向，探讨其内在意识形态偏倚及在角色扮演中对不同人群观点的表征能力。

Details

Motivation: 随着LLM在医疗、法律和人际关系等关键领域被广泛用作建议工具，其在政治和道德议题中的潜在偏见引发关注。现有研究缺乏对LLM道德倾向的直接评估以及与人类实证数据的对比，因此需要系统性分析LLM的道德与政治立场。 Method: 采用道德基础理论（MFT）的五个维度（伤害、公平、群体忠诚、权威、纯洁）对LLM的回应进行量化分析，并将其与已有的人类道德判断研究数据进行直接比较；同时通过显式提示和基于人口统计特征的角色扮演实验，检验LLM表达不同意识形态立场的准确性。 Result: 研究发现LLM的默认回应表现出特定的意识形态倾向，且在不同提示条件下能不同程度地模拟各种政治立场；此外，LLM在角色扮演任务中对不同人群道德判断的再现具有一定准确性，但仍受训练数据偏差影响。 Conclusion: LLM在道德和政治问题上的输出存在可量化的意识形态依赖性，既反映其内在偏见，也显示其模拟能力。该结果强调了在高风险应用中需谨慎对待LLM生成内容的政治与文化敏感性。 Abstract: Large Language Models (LLMs) have become increasingly incorporated into everyday life for many internet users, taking on significant roles as advice givers in the domains of medicine, personal relationships, and even legal matters. The importance of these roles raise questions about how and what responses LLMs make in difficult political and moral domains, especially questions about possible biases. To quantify the nature of potential biases in LLMs, various works have applied Moral Foundations Theory (MFT), a framework that categorizes human moral reasoning into five dimensions: Harm, Fairness, Ingroup Loyalty, Authority, and Purity. Previous research has used the MFT to measure differences in human participants along political, national, and cultural lines. While there has been some analysis of the responses of LLM with respect to political stance in role-playing scenarios, no work so far has directly assessed the moral leanings in the LLM responses, nor have they connected LLM outputs with robust human data. In this paper we analyze the distinctions between LLM MFT responses and existing human research directly, investigating whether commonly available LLM responses demonstrate ideological leanings: either through their inherent responses, straightforward representations of political ideologies, or when responding from the perspectives of constructed human personas. We assess whether LLMs inherently generate responses that align more closely with one political ideology over another, and additionally examine how accurately LLMs can represent ideological perspectives through both explicit prompting and demographic-based role-playing. By systematically analyzing LLM behavior across these conditions and experiments, our study provides insight into the extent of political and demographic dependency in AI-generated responses.

[42] Schema for In-Context Learning

Pan Chen,Shaohong Chen,Mark Wang,Shi Xuan Leong,Priscilla Fung,Varinia Bernales,Alan Aspuru-Guzik

Main category: cs.CL

TL;DR: 本文提出了Schema Activated In-Context Learning (SA-ICL)，通过引入认知科学中的图式理论，从示例中提取抽象推理结构（即“图式”）来增强大语言模型的推理能力，显著提升其在化学和物理问题上的表现，并减少对大量示例的依赖。

Details

Motivation: 传统上下文学习缺乏在抽象层面进行知识检索与迁移的显式机制，受人类利用已有心理框架（图式）理解新信息的启发，作者希望构建一种能显式建模和利用推理图式的框架以提升模型的推理能力和可解释性。 Method: SA-ICL从高质量示例中提取关键推理步骤及其关系，构建轻量级、结构化的图式模板，并在面对新问题时将该图式用于增强模型的推理过程。 Result: 实验表明，多种大语言模型难以隐式形成和使用基于图式的表示，但在SA-ICL的显式图式引导下性能显著提升，最高提升达36.19%，且减少了对示例数量的依赖，同时增强了推理过程的可解释性。 Conclusion: SA-ICL不仅统一了多种上下文学习策略，还为提升大语言模型类人推理能力提供了新路径。 Abstract: In-Context Learning (ICL) enables transformer-based language models to adapt to new tasks by conditioning on demonstration examples. However, traditional example-driven in-context learning lacks explicit modules for knowledge retrieval and transfer at the abstraction level. Inspired by cognitive science, specifically schema theory, which holds that humans interpret new information by activating pre-existing mental frameworks (schemas) to structure understanding, we introduce SCHEMA ACTIVATED IN CONTEXT LEARNING (SA-ICL). This framework extracts the representation of the building blocks of cognition for the reasoning process instilled from prior examples, creating an abstracted schema, a lightweight, structured template of key inferential steps and their relationships, which is then used to augment a model's reasoning process when presented with a novel question. We demonstrate that a broad range of large language models (LLMs) lack the capacity to form and utilize internal schema-based learning representations implicitly, but instead benefit significantly from explicit schema-based scaffolding. Across chemistry and physics questions from the GPQA dataset, our experiments show that SA-ICL consistently boosts performance, up to 36.19 percent, when the single demonstration example is of high quality, which simultaneously reduces reliance on the number of demonstrations and enhances interpretability. SCHEMA ACTIVATED IN CONTEXT LEARNING not only bridges disparate ICL strategies ranging from pattern priming to Chain-of-Thought prompting, but also paves a new path for enhancing human-like reasoning in LLMs.

[43] LLM Prompt Duel Optimizer: Efficient Label-Free Prompt Optimization

Yuanchen Wu,Saurabh Verma,Justin Lee,Fangzhou Xiong,Poppy Zhang,Amel Awadelkarim,Xu Chen,Yubai Yuan,Shawndra Hill

Main category: cs.CL

TL;DR: 提出了一种无需标签的提示优化框架PDO，通过LLM裁判提供成对偏好反馈，在无标签场景下高效优化提示。

Details

Motivation: 大语言模型对输入提示敏感，而传统自动提示优化方法依赖高质量标签数据，获取成本高且耗时。 Method: 将提示优化问题建模为对决-bandit问题，使用双汤普森采样（D-TS）选择信息量大的提示对进行比较，并结合高性能提示引导变异来扩展候选集，利用LLM裁判提供偏好反馈。 Result: 在BIG-bench Hard和MS MARCO上的实验表明，PDO在有无部分标签的情况下均优于基线方法，消融研究验证了D-TS和提示变异的有效性。 Conclusion: PDO是一种高效的无标签提示优化框架，能够在减少人工标注的同时提升提示性能。 Abstract: Large language models (LLMs) are highly sensitive to their input prompts, making prompt design a central challenge. While automatic prompt optimization (APO) reduces manual engineering, most approaches assume access to ground-truth references such as labeled validation data. In practice, however, collecting high-quality labels is costly and slow. We propose the Prompt Duel Optimizer (PDO), a sample-efficient framework for label-free prompt optimization. PDO formulates the problem as a dueling-bandit setting, where supervision signal comes from pairwise preference feedback provided by an LLM judge. The framework combines Double Thompson Sampling (D-TS), which prioritizes informative prompt comparisons, with Top-Performer Guided Mutation, which expands the candidate pool by mutating high-performing prompts. PDO naturally operates in label-free settings and can also incorporate partial labels to mitigate judge noise. Experiments on BIG-bench Hard (BBH) and MS MARCO show that PDO consistently outperforms baseline methods. Ablation studies further demonstrate the effectiveness of both D-TS and prompt mutation.

[44] Interpreting the Latent Structure of Operator Precedence in Language Models

Dharunish Yugeswardeenoo,Harshil Nukala,Cole Blondin,Sean O Brien,Vasu Sharma,Kevin Zhu

Main category: cs.CL

TL;DR: 本研究探讨了大语言模型（如LLaMA 3.2-3B）在处理算术任务时是否在其内部表示中编码了运算符优先级，并通过可解释性技术发现中间计算结果存在于残差流中，且模型在线性空间中编码优先级，提出通过部分嵌入交换来修改优先级的新方法。

Details

Motivation: 尽管大语言模型在推理方面表现出色，但在算术任务上仍存在困难，现有研究多关注输出或提示策略，缺乏对模型内部如何执行算术计算的结构理解，因此本文旨在探究模型是否在内部表示中编码了运算符优先级。 Method: 使用开源的指令调优模型LLaMA 3.2-3B，构建包含三个操作数和两个运算符并变化括号位置的算术表达式数据集，利用logit lens、线性分类探针和UMAP可视化等可解释性技术，追踪残差流中的中间结果及其与运算符优先级的关系。 Result: 发现中间计算结果存在于残差流中，尤其是在MLP块之后；模型在注意力层后的操作符嵌入中线性编码了优先级信息；提出的部分嵌入交换技术可通过交换高影响维度来修改操作符优先级。 Conclusion: 大语言模型在内部表示中确实编码了运算符优先级，且其算术计算过程可在残差流中被追踪，部分嵌入交换为控制模型行为提供了新的干预手段。 Abstract: Large Language Models (LLMs) have demonstrated impressive reasoning capabilities but continue to struggle with arithmetic tasks. Prior works largely focus on outputs or prompting strategies, leaving the open question of the internal structure through which models do arithmetic computation. In this work, we investigate whether LLMs encode operator precedence in their internal representations via the open-source instruction-tuned LLaMA 3.2-3B model. We constructed a dataset of arithmetic expressions with three operands and two operators, varying the order and placement of parentheses. Using this dataset, we trace whether intermediate results appear in the residual stream of the instruction-tuned LLaMA 3.2-3B model. We apply interpretability techniques such as logit lens, linear classification probes, and UMAP geometric visualization. Our results show that intermediate computations are present in the residual stream, particularly after MLP blocks. We also find that the model linearly encodes precedence in each operator's embeddings post attention layer. We introduce partial embedding swap, a technique that modifies operator precedence by exchanging high-impact embedding dimensions between operators.

[45] Knowledge Reasoning Language Model: Unifying Knowledge and Language for Inductive Knowledge Graph Reasoning

Xingrui Zhuo,Jiapu Wang,Gongqing Wu,Zhongyuan Wang,Jichen Zhang,Shirui Pan,Xindong Wu

Main category: cs.CL

TL;DR: 本文提出了一种新的知识推理语言模型（KRLM），通过设计知识推理语言（KRL）指令格式、KRL tokenizer、KRL注意力层和结构感知的下一实体预测器，实现大语言模型（LLM）知识与知识图谱（KG）上下文的统一协调，有效缓解了LLM知识扭曲和生成幻觉问题，在25个真实世界的归纳式知识图谱推理数据集上表现出显著优势。

Details

Motivation: 现有的基于大语言模型（LLM）的知识图谱推理方法在处理开放域未知实体和关系时，面临LLM内在知识被稀疏KG上下文掩盖导致的知识扭曲，以及难以约束生成幻觉的问题，影响推理结果的可信度。 Method: 提出知识推理语言模型（KRLM），包括：1）设计KRL指令格式和KRL tokenizer以对齐LLM知识与KG表示；2）引入KRL注意力层，通过动态知识记忆机制协调LLM内在知识与KG上下文；3）设计结构感知的下一实体预测器，严格限制推理结果在可信知识范围内。 Result: 在25个真实世界的归纳式知识图谱推理数据集上，KRLM在零样本推理和微调场景下均显著优于现有方法，验证了其有效性。 Conclusion: KRLM通过统一协调LLM知识与KG上下文，有效解决了知识扭曲和生成幻觉问题，提升了归纳式知识图谱推理的准确性和可信度，具有广泛的应用前景。 Abstract: Inductive Knowledge Graph Reasoning (KGR) aims to discover facts in open-domain KGs containing unknown entities and relations, which poses a challenge for KGR models in comprehending uncertain KG components. Existing studies have proposed Knowledge Graph Foundation Models (KGFMs) that learn structural invariances across KGs to handle this uncertainty. Recently, Large Language Models (LLMs) have demonstrated strong capabilities for open-domain knowledge reasoning. As a result, the latest research has focused on LLM-based KGFMs that integrate LLM knowledge with KG context for inductive KGR. However, the intrinsic knowledge of LLMs may be overshadowed by sparse KG context, leading to LLM knowledge distortion, which can cause irreversible damage to model reasoning. Moreover, existing LLM-based KGR methods still struggle to fully constrain generative hallucinations in LLMs, severely limiting the credibility of reasoning results. To address these limitations, we propose a Knowledge Reasoning Language Model (KRLM) that achieves unified coordination between LLM knowledge and KG context throughout the KGR process. Specifically, we design a Knowledge Reasoning Language (KRL) instruction format and a KRL tokenizer to align LLM knowledge with KG representations. Then, we propose a KRL attention layer that coordinates intrinsic LLM knowledge with additional KG context through a dynamic knowledge memory mechanism. Finally, a structure-aware next-entity predictor is proposed, which strictly constrains the reasoning results within a trustworthy knowledge domain. Extensive experimental results on 25 real-world inductive KGR datasets demonstrate the significant superiority of the proposed KRLM\footnote{Our source codes are available at https://anonymous.4open.science/r/KRLM-EA36 in both zero-shot reasoning and fine-tuning scenarios.

[46] RAGCap-Bench: Benchmarking Capabilities of LLMs in Agentic Retrieval Augmented Generation Systems

Jingru Lin,Chen Zhang,Stephen Y. Liu,Haizhou Li

Main category: cs.CL

TL;DR: 提出RAGCap-Bench，一个面向细粒度评估代理式RAG系统中间任务能力的基准，通过分析现有系统的输出识别常见任务和核心能力需求，并构建典型错误分类以设计针对性评估问题，实验证明具备更强中间能力的“慢思考”模型在端到端性能上更优。

Details

Motivation: 现有代理式RAG系统在处理复杂多跳问题时表现不佳，且其中间推理能力尚未被充分探索，缺乏针对中间任务的细粒度评估基准。 Method: 分析先进系统的输出以识别常见任务和所需核心能力，构建LLM错误分类体系，并基于此设计RAGCap-Bench评估基准，用于细粒度评测代理式RAG工作流中的中间任务。 Result: 实验表明，“慢思考”模型在RAGCap上的表现越好，其端到端性能也越佳，验证了该基准的有效性及其对提升中间能力的重要性。 Conclusion: RAGCap-Bench能有效评估和促进代理式RAG系统中关键中间能力的发展，增强这些能力有助于提升整体系统性能。 Abstract: Retrieval-Augmented Generation (RAG) mitigates key limitations of Large Language Models (LLMs)-such as factual errors, outdated knowledge, and hallucinations-by dynamically retrieving external information. Recent work extends this paradigm through agentic RAG systems, where LLMs act as agents to iteratively plan, retrieve, and reason over complex queries. However, these systems still struggle with challenging multi-hop questions, and their intermediate reasoning capabilities remain underexplored. To address this, we propose RAGCap-Bench, a capability-oriented benchmark for fine-grained evaluation of intermediate tasks in agentic RAG workflows. We analyze outputs from state-of-the-art systems to identify common tasks and the core capabilities required for their execution, then construct a taxonomy of typical LLM errors to design targeted evaluation questions. Experiments show that "slow-thinking" models with stronger RAGCap performance achieve better end-to-end results, underscoring the benchmark's validity and the importance of enhancing these intermediate capabilities.

[47] AI Debaters are More Persuasive when Arguing in Alignment with Their Own Beliefs

María Victoria Carro,Denise Alejandra Mester,Facundo Nieto,Oscar Agustín Stanchi,Guido Ernesto Bergman,Mario Alejandro Leiva,Eitan Sprejer,Luca Nicolás Forziati Gangi,Francisca Gauna Selasco,Juan Gustavo Corvalán,Gerardo I. Simari,María Vanina Martinez

Main category: cs.CL

TL;DR: 该研究探讨了在主观问题中大语言模型在辩论中的表现，发现模型更倾向于迎合法官角色而非坚持自身先验信念，且顺序辩论会显著偏向后发言者；尽管模型在支持自身信念时更具说服力，但违背其信念的论点反而在配对比较中被评为质量更高。

Details

Motivation: 现有辩论实验多基于有明确真相的数据集，忽略了说谎涉及‘相信所辩护命题为假’这一主观维度。本文旨在探究大语言模型在面对主观问题时的信念一致性及其在辩论中的说服策略。 Method: 通过测量大语言模型在实验前的先验信念，设计与模型信念冲突的法官角色，比较模型在顺序与同时辩论协议下，是选择迎合法官（谄媚策略）还是坚持原有立场，并评估其论证的说服力与质量。 Result: 模型倾向于迎合法官角色而非坚持先验信念；顺序辩论导致第二位辩者显著优势；模型在为其先验信念辩护时更具说服力；但违背其信念的论点在配对比较中被评价为质量更高。 Conclusion: 这些结果揭示了语言模型在辩论中存在谄媚倾向和协议偏差，提示人类裁判需警惕此类行为以提供更高质量的训练信号，有助于构建更对齐的AI系统，并深化对人机交互中说服动态的理解。 Abstract: The core premise of AI debate as a scalable oversight technique is that it is harder to lie convincingly than to refute a lie, enabling the judge to identify the correct position. Yet, existing debate experiments have relied on datasets with ground truth, where lying is reduced to defending an incorrect proposition. This overlooks a subjective dimension: lying also requires the belief that the claim defended is false. In this work, we apply debate to subjective questions and explicitly measure large language models' prior beliefs before experiments. Debaters were asked to select their preferred position, then presented with a judge persona deliberately designed to conflict with their identified priors. This setup tested whether models would adopt sycophantic strategies, aligning with the judge's presumed perspective to maximize persuasiveness, or remain faithful to their prior beliefs. We implemented and compared two debate protocols, sequential and simultaneous, to evaluate potential systematic biases. Finally, we assessed whether models were more persuasive and produced higher-quality arguments when defending positions consistent with their prior beliefs versus when arguing against them. Our main findings show that models tend to prefer defending stances aligned with the judge persona rather than their prior beliefs, sequential debate introduces significant bias favoring the second debater, models are more persuasive when defending positions aligned with their prior beliefs, and paradoxically, arguments misaligned with prior beliefs are rated as higher quality in pairwise comparison. These results can inform human judges to provide higher-quality training signals and contribute to more aligned AI systems, while revealing important aspects of human-AI interaction regarding persuasion dynamics in language models.

[48] Synthesizing Agentic Data for Web Agents with Progressive Difficulty Enhancement Mechanisms

Shrey Pandit,Xuan-Phi Nguyen,Yifei Ming,Austin Xu,Jiayu Wang,Caiming Xiong,Shafiq Joty

Main category: cs.CL

TL;DR: 提出了一种两阶段数据合成管道，通过逐步增加任务复杂度生成高质量问答对，并利用强基线网络代理进行蒸馏训练，实验证明该方法生成的数据集在多样性和性能上优于现有数据集。

Details

Motivation: 现有的指令微调数据集缺乏对难度和质量的细粒度控制，难以支持长视野推理任务，且数据与训练效果常被混淆，难以评估数据本身的有效性。 Method: 设计了一个两阶段数据合成管道，通过逐步提升任务复杂度直至基线代理失败来生成问答对；利用基线代理进行尝试、验证事实性、检查替代答案和过滤；采用基于强网络代理蒸馏的受控训练设置来评估数据有效性。 Result: 实验表明，尽管数据集更小，但训练出的网络代理在多个基准上表现更优，工具使用动作的多样性是现有数据集的两倍，且避免了重复调用工具的行为。 Conclusion: 所提出的合成方法能有效生成高质量、高复杂度的训练数据，显著提升网络代理在复杂在线任务中的推理与交互能力。 Abstract: Web-based 'deep research' agents aim to solve complex question - answering tasks through long-horizon interactions with online tools. These tasks remain challenging, as the underlying language models are often not optimized for long-horizon reasoning and exploration. Prior work has proposed workflows for constructing instruction-tuning datasets, often leveraging knowledge graphs. However, such methods typically lack fine-grained control over difficulty and quality, yielding synthetic data that falls short of capturing the complexity required for long-horizon reasoning. Furthermore, many studies conflate data and training effects by comparing models trained under different optimization recipes, making it difficult to isolate and evaluate the effectiveness of the data itself. We introduce a two-pronged data synthesis pipeline that generates question - answer pairs by progressively increasing task complexity until a frontier baseline web agent fails. The baseline agent plays multiple roles in this process: attempting the questions, validating factuality, checking for alternative answers, and enforcing filtering. To evaluate the effectiveness of our synthesis methods, we adopt a controlled training setup based on distillation from strong web agents. Experiments across multiple web-based benchmarks show that our dataset - despite being smaller - enables the training of more effective web agents than existing datasets. In particular, our data exhibits twice the diversity in tool-use actions, allowing models trained on it to achieve stronger performance while avoiding repetitive tool-calling behaviors.

[49] Readability $\ne$ Learnability: Rethinking the Role of Simplicity in Training Small Language Models

Ivan Lee,Taylor Berg-Kirkpatrick

Main category: cs.CL

TL;DR: 该研究挑战了“可读性促进小型语言模型（SLM）生成连贯文本”的观点，发现语言的统计简洁性（如n-gram多样性）比可读性更能预测学习效率和连贯性，且成人级复杂文本反而有助于更快发展连贯性。

Details

Motivation: 近年来有研究认为，使用儿童导向语料训练极小语言模型能促使其生成连贯文本，归因于语料的高可读性。本文质疑这一解释，探讨究竟是语言的可读性还是其他统计特性在起作用。 Method: 构建结构相同但可读性不同的合成数据集，训练小型语言模型，比较其在不同可读性文本上的学习效率与生成连贯性，并引入n-gram多样性等指标衡量统计简洁性。 Result: 发现可读性并不能预测模型的连贯性或学习效率；使用复杂、成人级文本训练的模型表现相当甚至更优，且连贯性发展更快；n-gram多样性等统计简单性指标是更好的可学习性预测因子。 Conclusion: 语言模型的能力涌现不应简单类比人类认知发展，而应基于实证分析；统计简洁性而非可读性才是支持小型模型能力发展的关键因素。 Abstract: Recent studies suggest that very small language models (SLMs) can generate surprisingly coherent text when trained on simplified, child-directed corpora such as TinyStories. These findings have been interpreted as evidence that readability -- characterized by accessible vocabulary, familiar narrative structure, and simple syntax -- plays a key role in enabling such capabilities to emerge. In this paper, we challenge that interpretation. We construct synthetic datasets with matched structure but varied readability, and find that readability alone does not predict coherence or learning efficiency in SLMs. Models trained on complex, adult-level text perform comparably to those trained on simplified language, and even exhibit faster development of coherence during training. Instead, we show that statistical simplicity, as measured by n-gram diversity, is a stronger predictor of learnability. Our findings caution against the growing trend of anthropomorphizing language model training -- drawing parallels to human cognitive development without empirical basis -- and argue for more precise reasoning about what properties actually support capability emergence in small models.

[50] Element2Vec: Build Chemical Element Representation from Text for Property Prediction

Yuanhao Li,Keyuan Lai,Tianqi Wang,Qihao Liu,Jiawei Ma,Yuan-Chao Hu

Main category: cs.CL

TL;DR: 本文提出Element2Vect，利用语言模型从维基百科文本中生成化学元素的全局和局部属性向量表示，并结合测试时自注意力训练方法提升材料属性预测的准确性与可解释性。

Details

Motivation: 准确的元素属性数据对材料设计至关重要，但许多属性难以直接测量，传统方法难以建模复杂关系，现有AI方法存在幻觉和不可解释问题。 Method: 从维基百科解析文本，使用语言模型生成元素的全局（Global）嵌入和多个属性突出的局部（Local）向量，并设计基于自注意力的测试时训练方法以减少回归误差。 Result: 该方法能有效表示118种元素的复杂关系，缓解了文本分布差异和数据稀疏带来的计算挑战，在属性预测上优于传统回归方法。 Conclusion: Element2Vect为材料科学中的AI驱动发现提供了新的表示学习路径，提升了属性预测的准确性与可解释性。 Abstract: Accurate property data for chemical elements is crucial for materials design and manufacturing, but many of them are difficult to measure directly due to equipment constraints. While traditional methods use the properties of other elements or related properties for prediction via numerical analyses, they often fail to model complex relationships. After all, not all characteristics can be represented as scalars. Recent efforts have been made to explore advanced AI tools such as language models for property estimation, but they still suffer from hallucinations and a lack of interpretability. In this paper, we investigate Element2Vecto effectively represent chemical elements from natural languages to support research in the natural sciences. Given the text parsed from Wikipedia pages, we use language models to generate both a single general-purpose embedding (Global) and a set of attribute-highlighted vectors (Local). Despite the complicated relationship across elements, the computational challenges also exist because of 1) the discrepancy in text distribution between common descriptions and specialized scientific texts, and 2) the extremely limited data, i.e., with only 118 known elements, data for specific properties is often highly sparse and incomplete. Thus, we also design a test-time training method based on self-attention to mitigate the prediction error caused by Vanilla regression clearly. We hope this work could pave the way for advancing AI-driven discovery in materials science.

[51] Optimal Aggregation of LLM and PRM Signals for Efficient Test-Time Scaling

Peng Kuang,Yanli Wang,Xiaoyu Han,Yaowenqi Liu,Kaidi Xu,Haohan Wang

Main category: cs.CL

TL;DR: 提出了一种基于理论框架的加权聚合方法，通过校准LLM和PRM之间的权重函数来提升测试时扩展效率，显著优于传统多数投票，在仅用21.3%计算量的情况下实现更优性能。

Details

Motivation: 现有测试时扩展中，简单多数投票有时优于基于过程奖励模型（PRM）的选择方法，表明需要更有效地利用PRM的验证信号。 Method: 建立了一个理论框架，提出最优策略是对响应进行加权聚合，并设计了高效的预计算方法来校准不同LLM-PRM对之间的权重函数，尤其发现负权重具有重要作用。 Result: 在5个LLM和7个PRM上的实验表明，该方法显著提升了TTS效率，优于标准加权多数投票，且仅使用21.3%的计算资源。 Conclusion: 相比单纯增加测试时计算量，更智能的聚合策略是提升性能的更有前景的方向。 Abstract: Process reward models (PRMs) are a cornerstone of test-time scaling (TTS), designed to verify and select the best responses from large language models (LLMs). However, this promise is challenged by recent benchmarks where simple majority voting, which ignores PRM signals, occasionally outperforms standard PRM-based selection. This raises a critical question: How can we effectively utilize verification signals from PRMs for TTS? To address this, we start by developing a theoretical framework for optimally combining signals from both the LLM and the PRM. Our framework reveals that the optimal strategy is a weighted aggregation of responses, a strategy whose effectiveness hinges on estimating weights that capture the complex interplay between the models. Based on our theoretical results, we empirically show that these optimal weighting functions differ significantly across LLM-PRM pairs and, notably, often assign substantial negative weights. Motivated by these insights, we propose efficient pre-computation methods to calibrate these weighting functions. Extensive experiments across 5 LLMs and 7 PRMs demonstrate that our calibration method significantly boosts the TTS efficiency, surpassing the performance of vanilla weighted majority voting while using only $21.3\%$ of the computation. Ultimately, our work demonstrates that investing in a more intelligent aggregation strategy can be a more convincing path to performance gains than simply scaling test-time computation.

[52] FACTS: Table Summarization via Offline Template Generation with Agentic Workflows

Ye Yuan,Mohammad Amin Shabani,Siqi Liu

Main category: cs.CL

TL;DR: 本文提出了一种名为FACTS的快速、准确且符合隐私保护的表格摘要方法，通过离线生成模板（SQL查询和Jinja2模板）实现可重用、高效的自然语言摘要生成，在多个基准测试中优于现有方法。

Details

Motivation: 现有表格到文本模型存在微调成本高、复杂推理能力弱、令牌限制、效率低以及隐私泄露等问题，且依赖分解、规划或手动模板的方法缺乏鲁棒性和可扩展性。 Method: 提出FACTS框架，采用代理式工作流，通过离线生成包含SQL查询和Jinja2模板的模板，利用表结构信息在不暴露敏感数据的情况下生成可复用的自然语言摘要。 Result: 在多个常用基准上的实验表明，FACTS在性能上持续优于基线方法，具备高效、准确和隐私合规的优势。 Conclusion: FACTS是一种实用且可扩展的解决方案，适用于现实世界中的查询导向型表格摘要任务。 Abstract: Query-focused table summarization requires generating natural language summaries of tabular data conditioned on a user query, enabling users to access insights beyond fact retrieval. Existing approaches face key limitations: table-to-text models require costly fine-tuning and struggle with complex reasoning, prompt-based LLM methods suffer from token-limit and efficiency issues while exposing sensitive data, and prior agentic pipelines often rely on decomposition, planning, or manual templates that lack robustness and scalability. To mitigate these issues, we introduce an agentic workflow, FACTS, a Fast, Accurate, and Privacy-Compliant Table Summarization approach via Offline Template Generation. FACTS produces offline templates, consisting of SQL queries and Jinja2 templates, which can be rendered into natural language summaries and are reusable across multiple tables sharing the same schema. It enables fast summarization through reusable offline templates, accurate outputs with executable SQL queries, and privacy compliance by sending only table schemas to LLMs. Evaluations on widely-used benchmarks show that FACTS consistently outperforms baseline methods, establishing it as a practical solution for real-world query-focused table summarization.

[53] An LLM-Powered AI Agent Framework for Holistic IoT Traffic Interpretation

Daniel Adu Worae,Spyridon Mastorakis

Main category: cs.CL

TL;DR: 提出一种基于大语言模型的AI代理框架，用于将物联网网络流量中的原始数据包转换为结构化、语义丰富的表示形式，实现高效且全面的交互式分析。

Details

Motivation: 物联网网络产生大量多样化的流量，传统孤立的威胁检测方法难以有效解析跨层行为和上下文信息，需要更智能的分析手段。 Method: 结合特征提取、基于Transformer的异常检测、数据包与流摘要、威胁情报增强及检索增强问答，利用大语言模型驱动的AI代理对索引的流量 artifacts 进行推理。 Result: 在多个物联网数据集上实验表明，混合检索（结合词法与语义搜索及重排序）相比仅用密集检索显著提升BLEU、ROUGE、METEOR和BERTScore指标；系统资源开销低。 Conclusion: 该框架能高效、准确地实现物联网流量的全貌解析，提供可读性强的分析结果，适用于实际部署。 Abstract: Internet of Things (IoT) networks generate diverse and high-volume traffic that reflects both normal activity and potential threats. Deriving meaningful insight from such telemetry requires cross-layer interpretation of behaviors, protocols, and context rather than isolated detection. This work presents an LLM-powered AI agent framework that converts raw packet captures into structured and semantically enriched representations for interactive analysis. The framework integrates feature extraction, transformer-based anomaly detection, packet and flow summarization, threat intelligence enrichment, and retrieval-augmented question answering. An AI agent guided by a large language model performs reasoning over the indexed traffic artifacts, assembling evidence to produce accurate and human-readable interpretations. Experimental evaluation on multiple IoT captures and six open models shows that hybrid retrieval, which combines lexical and semantic search with reranking, substantially improves BLEU, ROUGE, METEOR, and BERTScore results compared with dense-only retrieval. System profiling further indicates low CPU, GPU, and memory overhead, demonstrating that the framework achieves holistic and efficient interpretation of IoT network traffic.

[54] BioMedSearch: A Multi-Source Biomedical Retrieval Framework Based on LLMs

Congying Liu,Xingyuan Wei,Peipei Liu,Yiqing Shen,Yanxu Mao,Tiehan Cui

Main category: cs.CL

TL;DR: 本文提出了一种基于大语言模型的多源生物医学信息检索框架BioMedSearch，通过整合文献、蛋白质数据库和网络搜索，提升复杂生物医学问题回答的准确性。

Details

Motivation: 大语言模型在生成生物医学内容时常缺乏科学严谨性，容易虚构不符合真实数据的蛋白质功能与相互作用，因此需要一种能接入权威数据库并融合多源信息的方法来提高回答准确性。 Method: BioMedSearch通过子查询分解、关键词提取、任务图构建和多源信息过滤，整合文献检索、蛋白质数据库和网络搜索，实现对复杂生物医学问题的精准问答。 Result: 在包含3000个问题的多层级数据集BioMedMCQs上实验表明，BioMedSearch在三个推理层级上的准确率均显著优于基线模型：一级从59.1%提升至91.9%，二级从47.0%提升至81.0%，三级从36.3%提升至73.4%。 Conclusion: BioMedSearch有效提升了大语言模型在复杂生物医学问答任务中的准确性与可靠性，展示了多源信息融合在专业领域问答系统中的潜力。 Abstract: Biomedical queries often rely on a deep understanding of specialized knowledge such as gene regulatory mechanisms and pathological processes of diseases. They require detailed analysis of complex physiological processes and effective integration of information from multiple data sources to support accurate retrieval and reasoning. Although large language models (LLMs) perform well in general reasoning tasks, their generated biomedical content often lacks scientific rigor due to the inability to access authoritative biomedical databases and frequently fabricates protein functions, interactions, and structural details that deviate from authentic information. Therefore, we present BioMedSearch, a multi-source biomedical information retrieval framework based on LLMs. The method integrates literature retrieval, protein database and web search access to support accurate and efficient handling of complex biomedical queries. Through sub-queries decomposition, keywords extraction, task graph construction, and multi-source information filtering, BioMedSearch generates high-quality question-answering results. To evaluate the accuracy of question answering, we constructed a multi-level dataset, BioMedMCQs, consisting of 3,000 questions. The dataset covers three levels of reasoning: mechanistic identification, non-adjacent semantic integration, and temporal causal reasoning, and is used to assess the performance of BioMedSearch and other methods on complex QA tasks. Experimental results demonstrate that BioMedSearch consistently improves accuracy over all baseline models across all levels. Specifically, at Level 1, the average accuracy increases from 59.1% to 91.9%; at Level 2, it rises from 47.0% to 81.0%; and at the most challenging Level 3, the average accuracy improves from 36.3% to 73.4%. The code and BioMedMCQs are available at: https://github.com/CyL-ucas/BioMed_Search

[55] LLMs Can Get "Brain Rot"!

Shuo Xing,Junyuan Hong,Yifan Wang,Runjin Chen,Zhenyu Zhang,Ananth Grama,Zhengzhong Tu,Zhangyang Wang

Main category: cs.CL

TL;DR: 本文提出了“大语言模型脑腐假说”，即持续暴露于低质量网络文本会导致大语言模型的认知能力衰退。通过在真实Twitter/X语料库上的受控实验，研究发现使用“垃圾数据”进行持续预训练会显著降低模型在推理、长上下文理解、安全性等方面的表现，并增强“黑暗人格特质”。研究还发现数据质量是导致能力衰退的因果因素，提示应将数据筛选视为训练时的安全问题，并建议对部署的LLM进行常规“认知健康检查”。

Details

Motivation: 随着大语言模型不断在互联网海量文本上进行持续预训练，数据中充斥的低质量、高煽动性内容可能损害模型的认知能力。然而数据质量对模型能力的影响尚缺乏因果证据。本文旨在通过受控实验验证‘脑腐’假说，揭示数据质量对模型性能的长期影响。 Method: 构建两个正交的操作化定义来衡量数据质量：M1（基于参与度）和M2（基于语义质量），并在真实Twitter/X语料上构造匹配规模和训练流程的垃圾数据集与对照数据集。对4个LLMs进行持续预训练实验，评估其在不同任务上的表现变化，并分析错误类型、恢复尝试及剂量反应关系。 Result: 在垃圾数据上持续训练导致模型在推理、长上下文理解、安全性等任务上出现显著衰退（Hedges' g > 0.3），并表现出更高的‘黑暗人格特质’。剂量反应实验显示，随着垃圾数据比例上升，性能呈下降趋势（如ARC-Challenge从74.9降至57.2）。错误分析发现‘思维跳跃’是主要问题，且后续指令微调或清洁数据训练只能部分修复，无法完全恢复原有能力。此外，推文的流行度比长度更能预测脑腐效应。 Conclusion: 数据质量是导致大语言模型能力衰退的因果因素，持续使用低质量数据会引发持久的表征漂移，类似‘认知衰退’。这表明数据筛选不仅是性能问题，更是一个训练时的安全问题，需引入常规的‘认知健康检查’机制以保障模型可靠性。 Abstract: We propose and test the LLM Brain Rot Hypothesis: continual exposure to junk web text induces lasting cognitive decline in large language models (LLMs). To causally isolate data quality, we run controlled experiments on real Twitter/X corpora, constructing junk and reversely controlled datasets via two orthogonal operationalizations: M1 (engagement degree) and M2 (semantic quality), with matched token scale and training operations across conditions. Contrary to the control group, continual pre-training of 4 LLMs on the junk dataset causes non-trivial declines (Hedges' $g>0.3$) on reasoning, long-context understanding, safety, and inflating "dark traits" (e.g., psychopathy, narcissism). The gradual mixtures of junk and control datasets also yield dose-response cognition decay: for example, under M1, ARC-Challenge with Chain Of Thoughts drops $74.9 \rightarrow 57.2$ and RULER-CWE $84.4 \rightarrow 52.3$ as junk ratio rises from $0\%$ to $100\%$. Error forensics reveal several key insights. First, we identify thought-skipping as the primary lesion: models increasingly truncate or skip reasoning chains, explaining most of the error growth. Second, partial but incomplete healing is observed: scaling instruction tuning and clean data pre-training improve the declined cognition yet cannot restore baseline capability, suggesting persistent representational drift rather than format mismatch. Finally, we discover that the popularity, a non-semantic metric, of a tweet is a better indicator of the Brain Rot effect than the length in M1. Together, the results provide significant, multi-perspective evidence that data quality is a causal driver of LLM capability decay, reframing curation for continual pretraining as a \textit{training-time safety} problem and motivating routine "cognitive health checks" for deployed LLMs.

[56] Robust or Suggestible? Exploring Non-Clinical Induction in LLM Drug-Safety Decisions

Siying Liu,Shisheng Zhang,Indu Bala

Main category: cs.CL

TL;DR: 该研究探讨了大语言模型（LLM）在药物安全预测中的社会人口偏见问题，发现模型对弱势群体更可能错误地预测更高的不良事件风险，并识别出显性和隐性两种偏见模式。

Details

Motivation: 尽管社会人口特征在临床上与药物不良事件无关，但尚不清楚LLMs是否会将其纳入预测，这可能影响模型在真实医疗场景中的公平性和可靠性。 Method: 基于美国FDA的不良事件报告系统（FAERS）数据，采用基于 persona 的评估框架，测试ChatGPT-4o和Bio-Medical-Llama-3.8B两个先进模型在不同教育、婚姻、保险等特征组合下的预测表现，并分析三种用户角色（全科医生、专科医生、患者）的影响。 Result: 发现弱势群体（如低教育水平、住房不稳定）被系统性地赋予更高的不良事件预测概率；识别出显性偏见（推理中直接引用persona属性）和隐性偏见（预测不一致但未明确提及属性）两种偏见模式。 Conclusion: LLMs在药物流行病学应用中存在严重公平性风险，亟需引入面向公平性的评估机制和缓解策略，以确保临床部署的安全性。 Abstract: Large language models (LLMs) are increasingly applied in biomedical domains, yet their reliability in drug-safety prediction remains underexplored. In this work, we investigate whether LLMs incorporate socio-demographic information into adverse event (AE) predictions, despite such attributes being clinically irrelevant. Using structured data from the United States Food and Drug Administration Adverse Event Reporting System (FAERS) and a persona-based evaluation framework, we assess two state-of-the-art models, ChatGPT-4o and Bio-Medical-Llama-3.8B, across diverse personas defined by education, marital status, employment, insurance, language, housing stability, and religion. We further evaluate performance across three user roles (general practitioner, specialist, patient) to reflect real-world deployment scenarios where commercial systems often differentiate access by user type. Our results reveal systematic disparities in AE prediction accuracy. Disadvantaged groups (e.g., low education, unstable housing) were frequently assigned higher predicted AE likelihoods than more privileged groups (e.g., postgraduate-educated, privately insured). Beyond outcome disparities, we identify two distinct modes of bias: explicit bias, where incorrect predictions directly reference persona attributes in reasoning traces, and implicit bias, where predictions are inconsistent, yet personas are not explicitly mentioned. These findings expose critical risks in applying LLMs to pharmacovigilance and highlight the urgent need for fairness-aware evaluation protocols and mitigation strategies before clinical deployment.

[57] Big Reasoning with Small Models: Instruction Retrieval at Inference Time

Kenan Alkiek,David Jurgens,Vinod Vydiswaran

Main category: cs.CL

TL;DR: 提出一种通过推理时指令干预的方法，使小语言模型在无需微调的情况下，在医学、法律和数学任务上显著提升多步推理能力。

Details

Motivation: 小语言模型（SLMs）因高效、低成本和隐私优势而受欢迎，但在多步推理和领域知识任务上表现不佳，本文旨在解决这一局限。 Method: 构建一个指令语料库，将相似问题分组并通过GPT-5生成结构化推理步骤；在推理时，SLM检索最相关的指令并按步骤执行，提供结构化推理指导。 Result: 在MedQA、MMLU Law和MathQA上分别取得9.4%、7.9%和5.1%的性能提升，且简洁指令优于冗长指令，效果依赖于模型家族和内在推理能力。 Conclusion: 指令检索是一种有效增强小语言模型复杂推理能力的方法，无需额外训练，适用于资源受限场景。 Abstract: Can we bring large-scale reasoning to local-scale compute? Small language models (SLMs) are increasingly attractive because they run efficiently on local hardware, offering strong privacy, low cost, and reduced environmental impact. Yet they often struggle with tasks that require multi-step reasoning or domain-specific knowledge. We address this limitation through instruction intervention at inference time, where an SLM retrieves structured reasoning procedures rather than generating them from scratch. Our method builds an Instruction Corpus by grouping similar training questions and creating instructions via GPT-5. During inference, the SLM retrieves the most relevant instructions and follows their steps. Unlike retrieval-augmented generation, which retrieves text passages, instruction retrieval gives the model structured guidance for reasoning. We evaluate this framework on MedQA (medical board exams), MMLU Professional Law, and MathQA using models from 3B to 14B parameters without any additional fine-tuning. Instruction retrieval yields consistent gains: 9.4% on MedQA, 7.9% on MMLU Law, and 5.1% on MathQA. Concise instructions outperform longer ones, and the magnitude of improvement depends strongly on model family and intrinsic reasoning ability.

[58] FinDeepResearch: Evaluating Deep Research Agents in Rigorous Financial Analysis

Fengbin Zhu,Xiang Yao Ng,Ziyang Liu,Chang Liu,Xianwei Zeng,Chao Wang,Tianhui Tan,Xuan Yao,Pengyang Shao,Min Xu,Zixuan Wang,Jing Wang,Xin Lin,Junfeng Li,Jingxian Zhu,Yang Zhang,Wenjie Wang,Fuli Feng,Richang Hong,Huanbo Luan,Ke-Wei Huang,Tat-Seng Chua

Main category: cs.CL

TL;DR: 本文提出了一种名为HisRubric的新型评估框架，并构建了FinDeepResearch基准，用于系统评估深度研究（DR）代理在跨市场、跨语言企业财务分析中的能力。

Details

Motivation: 现有文献缺乏对DR代理在关键研究分析中能力的系统性评估，因此需要一个严谨的评估框架来衡量其在复杂财务分析任务中的表现。 Method: 提出了HisRubric评估框架，采用分层分析结构和细粒度评分标准，模拟专业分析师的工作流程；基于该框架构建了包含64家上市公司的FinDeepResearch基准，覆盖8个金融市场和4种语言，共15,808个评分项，并对16种代表性方法进行了广泛实验。 Result: 实验结果揭示了不同方法（包括DR代理、具备搜索能力的LLM和仅具备深度推理能力的LLM）在多种能力、金融市场和语言环境下的优势与局限性。 Conclusion: HisRubric框架和FinDeepResearch基准为评估DR代理提供了有效工具，实验结果为未来DR代理和LLM在金融研究中的发展提供了重要洞见。 Abstract: Deep Research (DR) agents, powered by advanced Large Language Models (LLMs), have recently garnered increasing attention for their capability in conducting complex research tasks. However, existing literature lacks a rigorous and systematic evaluation of DR Agent's capabilities in critical research analysis. To address this gap, we first propose HisRubric, a novel evaluation framework with a hierarchical analytical structure and a fine-grained grading rubric for rigorously assessing DR agents' capabilities in corporate financial analysis. This framework mirrors the professional analyst's workflow, progressing from data recognition to metric calculation, and finally to strategic summarization and interpretation. Built on this framework, we construct a FinDeepResearch benchmark that comprises 64 listed companies from 8 financial markets across 4 languages, encompassing a total of 15,808 grading items. We further conduct extensive experiments on the FinDeepResearch using 16 representative methods, including 6 DR agents, 5 LLMs equipped with both deep reasoning and search capabilities, and 5 LLMs with deep reasoning capabilities only. The results reveal the strengths and limitations of these approaches across diverse capabilities, financial markets, and languages, offering valuable insights for future research and development. The benchmark and evaluation code will be made publicly available.

[59] Readers Prefer Outputs of AI Trained on Copyrighted Books over Expert Human Writers

Tuhin Chakrabarty,Jane C. Ginsburg,Paramveer Dhillon

Main category: cs.CL

TL;DR: 该研究比较了AI模型（ChatGPT、Claude、Gemini）与专业作家在模仿50位获奖作者风格写作方面的能力，发现上下文提示生成的AI文本在风格忠实度和写作质量上均被专家明显排斥，但经过作者特定微调后，AI表现逆转并优于人类作家，且不易被检测为AI生成。微调显著降低成本，对版权合理使用的第四要素提供了实证依据。

Details

Motivation: 探讨AI是否能高质量模仿受版权保护的作者风格进行创作，评估其在文学生成中的潜力及对版权市场影响的合理性。 Method: 通过预注册研究，让MFA专业作家与三种前沿AI模型（ChatGPT、Claude、Gemini）分别生成模仿50位获奖作者风格的450字片段，采用盲法配对评估，由159名专家和普通读者评判；对比上下文提示与针对作者微调后的ChatGPT效果，并使用AI检测器和中介分析探究差异原因。 Result: 上下文提示的AI文本在风格忠实度（OR=0.16）和写作质量（OR=0.13）上被专家强烈否定，但经作者作品微调后，AI表现反转，专家更偏好微调后的输出（风格OR=8.16，质量OR=1.87），普通读者趋势类似；微调后AI文本仅3%被检测出，远低于上下文提示的97%；中介分析显示微调消除了AI的风格瑕疵（如陈词滥调密度）。 Conclusion: 作者特定微调使AI生成文本在风格模仿和写作质量上超越专业作家，且难以被识别为AI生成，显著降低创作成本，表明AI训练可能对原作市场价值影响有限，支持版权合理使用的主张。 Abstract: The use of copyrighted books for training AI models has led to numerous lawsuits from authors concerned about AI's ability to generate derivative content.Yet it's unclear whether these models can generate high quality literary text while emulating authors' styles. To answer this we conducted a preregistered study comparing MFA-trained expert writers with three frontier AI models: ChatGPT, Claude & Gemini in writing up to 450 word excerpts emulating 50 award-winning authors' diverse styles. In blind pairwise evaluations by 159 representative expert & lay readers, AI-generated text from in-context prompting was strongly disfavored by experts for both stylistic fidelity (OR=0.16, p<10^8) & writing quality (OR=0.13, p<10^7) but showed mixed results with lay readers. However, fine-tuning ChatGPT on individual authors' complete works completely reversed these findings: experts now favored AI-generated text for stylistic fidelity (OR=8.16, p<10^13) & writing quality (OR=1.87, p=0.010), with lay readers showing similar shifts. These effects generalize across authors & styles. The fine-tuned outputs were rarely flagged as AI-generated (3% rate v. 97% for in-context prompting) by best AI detectors. Mediation analysis shows this reversal occurs because fine-tuning eliminates detectable AI stylistic quirks (e.g., cliche density) that penalize in-context outputs. While we do not account for additional costs of human effort required to transform raw AI output into cohesive, publishable prose, the median fine-tuning & inference cost of $81 per author represents a dramatic 99.7% reduction compared to typical professional writer compensation. Author-specific fine-tuning thus enables non-verbatim AI writing that readers prefer to expert human writing, providing empirical evidence directly relevant to copyright's fourth fair-use factor, the "effect upon the potential market or value" of the source works.

[60] Less is More: Improving LLM Reasoning with Minimal Test-Time Intervention

Zhen Yang,Mingyang Zhang,Feng Chen,Ganggui Ding,Liang Hou,Xin Tao,Pengfei Wan,Ying-Cong Chen

Main category: cs.CL

TL;DR: 提出了一种无需训练的Minimal Test-Time Intervention (MTI)框架，通过在推理时仅对高不确定性位置进行选择性干预，显著提升大模型在多种任务上的推理准确性和稳定性，同时保持高效。

Details

Motivation: 现有推理方法在提升性能时常牺牲效率，且未充分探索推理过程中不确定性高度局部化的现象。 Method: 提出MTI框架，包括选择性使用分类器自由引导（Selective CFG）于高熵位置，以及轻量级负提示引导，重用主模型KV缓存以高效近似无条件解码。 Result: 在通用、编程和STEM任务上均取得一致提升，例如Qwen3-8B-Base在八个基准平均提升1.35%，Qwen3-32B-Reasoning在AIME2024上提升5%。 Conclusion: MTI通过最小化干预位置，在几乎不增加计算开销的前提下有效提升大模型推理性能，验证了利用推理不确定性局部性的重要价值。 Abstract: Recent progress in large language models (LLMs) has focused on test-time scaling to improve reasoning via increased inference computation, but often at the cost of efficiency. We revisit test-time behavior and uncover a simple yet underexplored phenomenon: reasoning uncertainty is highly localized-only a small subset of high-entropy tokens dominantly affects output correctness. Motivated by this, we propose Minimal Test-Time Intervention (MTI), a training-free framework that enhances reasoning accuracy and stability with minimal overhead. MTI includes: (i) Selective CFG intervention, applying classifier-free guidance only at uncertain positions; and (ii) Lightweight negative-prompt guidance, reusing the main model's KV cache to approximate unconditional decoding efficiently. MTI yields consistent gains across general, coding, and STEM tasks-e.g., +1.35% average improvement on eight benchmarks for Qwen3-8B-Base and +5% on AIME2024 using Qwen3-32B-Reasoning-while remaining highly efficient.

[61] Classifying and Addressing the Diversity of Errors in Retrieval-Augmented Generation Systems

Kin Kwan Leung,Mouloud Belbahri,Yi Sui,Alex Labach,Xueying Zhang,Stephen Rose,Jesse C. Cresswell

Main category: cs.CL

TL;DR: 提出了一种针对现实世界中检索增强生成（RAG）系统错误类型的分类法，提供了每种错误的示例和解决建议，并发布了一个标注错误类型的数据集以及与该分类法对齐的自动评估方法。

Details

Motivation: 由于现实世界中RAG系统的复杂性，存在多种可能导致输出错误的原因，理解这些错误对于稳健部署至关重要。 Method: 提出了一个新的RAG系统错误类型分类法，整理了带有错误类型标注的错误响应数据集，并开发了一种与该分类法一致的自动评估方法。 Result: 建立了一个详细的RAG错误分类体系，发布了带标注的数据集，并验证了自动评估方法在实际开发中追踪和处理错误的有效性。 Conclusion: 该研究有助于更好地理解和缓解RAG系统中的错误，提升LLM问答系统的可靠性与可维护性。 Abstract: Retrieval-augmented generation (RAG) is a prevalent approach for building LLM-based question-answering systems that can take advantage of external knowledge databases. Due to the complexity of real-world RAG systems, there are many potential causes for erroneous outputs. Understanding the range of errors that can occur in practice is crucial for robust deployment. We present a new taxonomy of the error types that can occur in realistic RAG systems, examples of each, and practical advice for addressing them. Additionally, we curate a dataset of erroneous RAG responses annotated by error types. We then propose an auto-evaluation method aligned with our taxonomy that can be used in practice to track and address errors during development. Code and data are available at https://github.com/layer6ai-labs/rag-error-classification.

[62] The German Commons - 154 Billion Tokens of Openly Licensed Text for German Language Models

Lukas Gienapp,Christopher Schröder,Stefan Schweter,Christopher Akiki,Ferdinand Schlatt,Arden Zimmermann,Phillipe Genêt,Martin Potthast

Main category: cs.CL

TL;DR: 本文介绍了German Commons，这是迄今为止最大的开源许可德语文本集合，包含来自41个数据源的1545.6亿个token，涵盖法律、科学、文化等多个领域，旨在解决非英语语言中开源训练数据稀缺的问题。

Details

Motivation: 大型语言模型的发展依赖于大规模训练语料库，但大多数语料库的数据许可状态不明确，限制了真正开放模型的发展，尤其在非英语语言中更为严重。 Method: 通过从具有可验证许可的7个领域的41个来源系统性地收集数据，并采用全面的质量过滤、去重和文本格式修复的处理流程，构建高质量的德语文本集合。 Result: 构建了包含154.56亿个token的高质量、开源许可（至少CC-BY-SA 4.0）的德语文本数据集，所有子集均确保合法合规用于模型训练和再分发。 Conclusion: German Commons填补了开源德语预训练数据的关键空白，支持真正开放的德语语言模型开发，并提供了可复现和可扩展的构建代码。 Abstract: Large language model development relies on large-scale training corpora, yet most contain data of unclear licensing status, limiting the development of truly open models. This problem is exacerbated for non-English languages, where openly licensed text remains critically scarce. We introduce the German Commons, the largest collection of openly licensed German text to date. It compiles data from 41 sources across seven domains, encompassing legal, scientific, cultural, political, news, economic, and web text. Through systematic sourcing from established data providers with verifiable licensing, it yields 154.56 billion tokens of high-quality text for language model training. Our processing pipeline implements comprehensive quality filtering, deduplication, and text formatting fixes, ensuring consistent quality across heterogeneous text sources. All domain subsets feature licenses of at least CC-BY-SA 4.0 or equivalent, ensuring legal compliance for model training and redistribution. The German Commons therefore addresses the critical gap in openly licensed German pretraining data, and enables the development of truly open German language models. We also release code for corpus construction and data filtering tailored to German language text, rendering the German Commons fully reproducible and extensible.

[63] CRaFT: An Explanation-Based Framework for Evaluating Cultural Reasoning in Multilingual Language Models

Shehenaz Hossain,Haithem Afli

Main category: cs.CL

TL;DR: 本文提出了CRaFT，一种基于解释的多语言评估框架，用于评估大语言模型在不同文化背景下的推理能力。与仅依赖答案准确性的传统方法不同，CRaFT通过文化流利性、偏离度、一致性和语言适应性四个可解释指标来衡量模型的表现。研究在阿拉伯语、孟加拉语和西班牙语的50个文化相关问题上评估了GPT、DeepSeek和FANAR三个模型，结果表明语言对文化推理有显著影响：阿拉伯语降低流利性，孟加拉语提升流利性，西班牙语保持稳定。GPT跨语言适应能力较强但一致性较低，FANAR则表现出稳定但僵化的推理模式。研究表明，大模型的文化意识并非内在具备，而是通过语言表达形式激发的。

Details

Motivation: 现有评估方法过于依赖答案正确性，无法真实反映模型对文化背景的理解。因此，需要一种更细粒度的评估框架来揭示大语言模型在多语言环境下如何进行文化推理。 Method: 提出CRaFT框架，使用四个可解释指标（文化流利性、偏离度、一致性、语言适应性）评估模型生成的解释。基于世界价值观调查中的50个文化相关问题，翻译为阿拉伯语、孟加拉语和西班牙语，对GPT、DeepSeek和FANAR三个模型生成的2100多个答案-解释对进行分析。 Result: 发现语言显著影响文化推理表现：阿拉伯语降低文化流利性，孟加拉语提高流利性，西班牙语较稳定。GPT在跨语言适应上表现更好但一致性较差；FANAR推理更稳定但缺乏灵活性。模型的文化理解能力受语言表达方式影响较大。 Conclusion: 大语言模型的文化意识不是内在固有的，而是通过语言结构和表达方式体现出来的。CRaFT提供了一种新的视角，用于评估和改进多语言环境中模型的文化适应能力，有助于开发更具文化敏感性的AI系统。 Abstract: Correct answers do not necessarily reflect cultural understanding. We introduce CRaFT, an explanation-based multilingual evaluation framework designed to assess how large language models (LLMs) reason across cultural contexts. Rather than scoring outputs solely based on accuracy, CRaFT evaluates model explanations using four interpretable metrics: Cultural Fluency, Deviation, Consistency, and Linguistic Adaptation. We apply the framework to 50 culturally grounded questions from the World Values Survey, translated into Arabic, Bengali, and Spanish, and evaluate three models (GPT, DeepSeek, and FANAR) across over 2,100 answer-explanation pairs. Results reveal significant cross-lingual variation in reasoning: Arabic reduces fluency, Bengali enhances it, and Spanish remains largely stable. While GPT adapts more effectively across languages, it exhibits lower consistency; FANAR shows stable but rigid reasoning. These findings suggest that cultural awareness in LLMs is not intrinsic but emerges through linguistic framing. CRaFT offers a new lens for evaluating cross-cultural reasoning in multilingual settings, providing actionable insights for building culturally adaptive language models.

[64] Think Globally, Group Locally: Evaluating LLMs Using Multi-Lingual Word Grouping Games

César Guerra-Solano,Zhuochun Li,Xiang Lorraine Li

Main category: cs.CL

TL;DR: 本文提出了一种名为GlobalGroup的新任务，用于评估大语言模型在多种语言下的抽象推理能力，并发现英语模态下表现更优，且开源与闭源模型之间存在性能差异。

Details

Motivation: 由于现有研究多关注依赖知识或策略的推理任务，而忽视了不依赖固定模式的抽象推理中的语言偏差问题，因此需要一个跨语言的抽象推理评估任务。 Method: 受《纽约时报》Connections游戏启发，构建了包含英语、西班牙语、中文、印地语和阿拉伯语的GlobalGroup基准测试，每种语言包括原生版本和英文翻译版本，并引入游戏难度度量以控制比较条件。 Result: 实验结果显示，在抽象推理任务中，英语模态下的模型表现普遍更好，且闭源模型整体优于开源模型，不同语言间存在显著性能差距。 Conclusion: 大语言模型在抽象推理任务中存在语言相关的性能偏差，英语具有优势，未来模型开发需更重视多语言公平性与抽象推理能力的均衡提升。 Abstract: Large language models (LLMs) can exhibit biases in reasoning capabilities due to linguistic modality, performing better on tasks in one language versus another, even with similar content. Most previous works evaluate this through reasoning tasks where reliance on strategies or knowledge can ensure success, such as in commonsense or math tasks. However, abstract reasoning is vital to reasoning for everyday life, where people apply "out-of-the-box thinking" to identify and use patterns for solutions, without a reliance on formulaic approaches. Comparatively, little work has evaluated linguistic biases in this task type. In this paper, we propose a task inspired by the New York Times Connections: GlobalGroup, that evaluates models in an abstract reasoning task across several languages. We constructed a game benchmark with five linguistic backgrounds -- English, Spanish, Chinese, Hindi, and Arabic -- in both the native language and an English translation for comparison. We also proposed game difficulty measurements to evaluate models on games with similar difficulty, enabling a more controlled comparison, which is particularly important in reasoning evaluations. Through experimentation, we find English modalities largely lead to better performance in this abstract reasoning task, and performance disparities between open- and closed-source models.

[65] Quantifying Phonosemantic Iconicity Distributionally in 6 Languages

George Flint,Kaustubh Kislay

Main category: cs.CL

TL;DR: 该研究采用分布方法大规模量化了六种不同语言中的语音语义象似性，发现了新的可解释的语音语义对应关系及跨语言模式，并验证了部分先前假设的对应关系。

Details

Motivation: 探讨在大规模定量分析中，语音与语义之间的系统性关系（即象似性）能在多大程度上显现，包括已知和未知的现象。 Method: 对六种语言（英语、西班牙语、印地语、芬兰语、土耳其语和泰米尔语）采用分布方法，利用多种统计指标分析词素的语音与语义相似性空间的一致性。 Result: 发现了一系列文献中尚未识别的可解释的语音语义对应关系以及跨语言模式；对五种先前假设的对应关系进行了检验，部分得到支持，其他结果不一。 Conclusion: 语音与语义之间存在系统性关联，且这些关联可在多种语言中被大规模识别，表明象似性在语言中具有更广泛的存在和作用。 Abstract: Language is, as commonly theorized, largely arbitrary. Yet, systematic relationships between phonetics and semantics have been observed in many specific cases. To what degree could those systematic relationships manifest themselves in large scale, quantitative investigations--both in previously identified and unidentified phenomena? This work undertakes a distributional approach to quantifying phonosemantic iconicity at scale across 6 diverse languages (English, Spanish, Hindi, Finnish, Turkish, and Tamil). In each language, we analyze the alignment of morphemes' phonetic and semantic similarity spaces with a suite of statistical measures, and discover an array of interpretable phonosemantic alignments not previously identified in the literature, along with crosslinguistic patterns. We also analyze 5 previously hypothesized phonosemantic alignments, finding support for some such alignments and mixed results for others.

[66] ERGO: Entropy-guided Resetting for Generation Optimization in Multi-turn Language Models

Haziq Mohammad Khalid,Athikash Jeyaganthan,Timothy Do,Yicheng Fu,Sean O'Brien,Vasu Sharma,Kevin Zhu

Main category: cs.CL

TL;DR: ERGO是一种基于熵的生成优化方法，通过监测语言模型在多轮对话中的不确定性（以香农熵衡量），动态触发提示整合，显著提升性能、准确性和可靠性。

Details

Motivation: 大型语言模型在信息逐步呈现的多轮对话中表现显著下降，影响实际应用。作者希望解决这一问题，提高模型在真实交互场景中的可用性。 Method: 提出ERGO方法，利用下一词概率分布的香农熵持续量化模型不确定性，并在检测到熵值骤增时触发自适应提示合并，从而动态调整对话上下文。 Result: 在逐步揭示指令的多轮任务中，ERGO相比基线平均性能提升56.6%，能力峰值提升24.7%，不可靠性降低35.3%。 Conclusion: 将不确定性作为首要信号进行建模和响应，而非简单消除，可有效提升对话AI的准确性和可靠性。 Abstract: Large Language Models (LLMs) suffer significant performance degradation in multi-turn conversations when information is presented incrementally. Given that multi-turn conversations characterize everyday interactions with LLMs, this degradation poses a severe challenge to real world usability. We hypothesize that abrupt increases in model uncertainty signal misalignment in multi-turn LLM interactions, and we exploit this insight to dynamically realign conversational context. We introduce ERGO (Entropy-guided Resetting for Generation Optimization), which continuously quantifies internal uncertainty via Shannon entropy over next token distributions and triggers adaptive prompt consolidation when a sharp spike in entropy is detected. By treating uncertainty as a first class signal rather than a nuisance to eliminate, ERGO embraces variability in language and modeling, representing and responding to uncertainty. In multi-turn tasks with incrementally revealed instructions, ERGO yields a 56.6% average performance gain over standard baselines, increases aptitude (peak performance capability) by 24.7%, and decreases unreliability (variability in performance) by 35.3%, demonstrating that uncertainty aware interventions can improve both accuracy and reliability in conversational AI.

[67] DROID: Dual Representation for Out-of-Scope Intent Detection

Wael Rashwan,Hossam M. Zawbaa,Sourav Dutta,Haytham Assem

Main category: cs.CL

TL;DR: 提出DROID框架，通过双编码器和简单校准实现鲁棒的范围外意图检测。

Details

Motivation: 现有方法依赖强分布假设或辅助校准模块，难以有效检测范围外意图。 Method: 结合通用句子编码器（USE）和领域自适应的Transformer去噪自编码器（TSDAE），融合双表示并通过轻量分支分类器进行分类，引入合成和开放域异常增强以提升边界学习。 Result: 在多个意图基准上显著优于现有方法，已知意图macro-F1提升6-15%，OOS意图提升8-20%，尤其在低资源场景下表现突出。 Conclusion: 双编码器表示结合简单校准可实现高效、可扩展且可靠的神经对话系统范围外检测。 Abstract: Detecting out-of-scope (OOS) user utterances remains a key challenge in task-oriented dialogue systems and, more broadly, in open-set intent recognition. Existing approaches often depend on strong distributional assumptions or auxiliary calibration modules. We present DROID (Dual Representation for Out-of-Scope Intent Detection), a compact end-to-end framework that combines two complementary encoders -- the Universal Sentence Encoder (USE) for broad semantic generalization and a domain-adapted Transformer-based Denoising Autoencoder (TSDAE) for domain-specific contextual distinctions. Their fused representations are processed by a lightweight branched classifier with a single calibrated threshold that separates in-domain and OOS intents without post-hoc scoring. To enhance boundary learning under limited supervision, DROID incorporates both synthetic and open-domain outlier augmentation. Despite using only 1.5M trainable parameters, DROID consistently outperforms recent state-of-the-art baselines across multiple intent benchmarks, achieving macro-F1 improvements of 6--15% for known and 8--20% for OOS intents, with the most significant gains in low-resource settings. These results demonstrate that dual-encoder representations with simple calibration can yield robust, scalable, and reliable OOS detection for neural dialogue systems.

[68] Toward Cybersecurity-Expert Small Language Models

Matan Levi,Daniel Ohayon,Ariel Blobstein,Ravid Sagi,Ian Molloy,Yair Allouche

Main category: cs.CL

TL;DR: CyberPal 2.0 是一系列针对网络安全领域的小型语言模型（4B-20B参数），通过构建高质量、任务导向的链式思维数据集，在多项网络安全任务上超越或匹敌现有大型模型，展现出卓越的性能与部署优势。

Details

Motivation: 由于缺乏高质量、领域特定的模型和训练数据，大语言模型在网络安全领域的应用滞后，因此需要专门针对该领域开发高效的小型语言模型。 Method: 提出 CyberPal 2.0 模型系列，并通过 SecKnowledge 2.0 数据增强与格式化管道生成富含专家指导与多步推理的安全指令数据集，实现高质量的链式思维训练。 Result: CyberPal 2.0 在多个网络安全基准测试中优于基线模型，在威胁情报任务中仅次于 Sec-Gemini v1，在威胁调查任务中，20B 模型超越 GPT-4o、o1、o3-mini 和 Sec-Gemini v1，排名第一，4B 模型排名第二。 Conclusion: CyberPal 2.0 以远小于主流大模型的规模，在网络安全任务中实现了领先性能，证明了专业化小模型在安全领域的巨大潜力。 Abstract: Large language models (LLMs) are transforming everyday applications, yet deployment in cybersecurity lags due to a lack of high-quality, domain-specific models and training datasets. To address this gap, we present CyberPal 2.0, a family of cybersecurity-expert small language models (SLMs) ranging from 4B-20B parameters. To train CyberPal 2.0, we generate an enriched chain-of-thought cybersecurity instruction dataset built with our data enrichment and formatting pipeline, SecKnowledge 2.0, which integrates expert-in-the-loop steering of reasoning formats alongside LLM-driven multi-step grounding, yielding higher-fidelity, task-grounded reasoning traces for security tasks. Across diverse cybersecurity benchmarks, CyberPal 2.0 consistently outperforms its baselines and matches or surpasses various open and closed-source frontier models, while remaining a fraction of their size. On core cyber threat intelligence knowledge tasks, our models outperform almost all tested frontier models, ranking second only to Sec-Gemini v1. On core threat-investigation tasks, such as correlating vulnerabilities and bug tickets with weaknesses, our best 20B-parameter model outperforms GPT-4o, o1, o3-mini, and Sec-Gemini v1, ranking first, while our smallest 4B-parameter model ranks second.

[69] Building a Macedonian Recipe Dataset: Collection, Parsing, and Comparative Analysis

Darko Sasanski,Dimitar Peshevski,Riste Stojanov,Dimitar Trajanov

Main category: cs.CL

TL;DR: 本文首次系统地构建了一个马其顿语食谱数据集，通过网络爬虫和结构化解析，解决了成分描述异质性的问题，并分析了马其顿饮食中独特的食材组合模式。

Details

Motivation: 马其顿语食谱在现有数字研究中代表性不足，缺乏高质量的数据集来支持计算美食学的研究。 Method: 通过网络爬取马其顿语食谱网站，进行结构化解析，对食材的单位、数量和描述进行标准化处理，并使用点互信息（PMI）和提升度（Lift score）分析食材频率及共现模式。 Result: 成功构建了首个马其顿语食谱数据集，揭示了马其顿 cuisine 中具有代表性的食材搭配，如特定奶酪与谷物的高频共现。 Conclusion: 该数据集为研究非主流语言下的饮食文化提供了新资源，有助于推动全球范围内计算美食学的发展。 Abstract: Computational gastronomy increasingly relies on diverse, high-quality recipe datasets to capture regional culinary traditions. Although there are large-scale collections for major languages, Macedonian recipes remain under-represented in digital research. In this work, we present the first systematic effort to construct a Macedonian recipe dataset through web scraping and structured parsing. We address challenges in processing heterogeneous ingredient descriptions, including unit, quantity, and descriptor normalization. An exploratory analysis of ingredient frequency and co-occurrence patterns, using measures such as Pointwise Mutual Information and Lift score, highlights distinctive ingredient combinations that characterize Macedonian cuisine. The resulting dataset contributes a new resource for studying food culture in underrepresented languages and offers insights into the unique patterns of Macedonian culinary tradition.

[70] RLSR: Reinforcement Learning with Supervised Reward Outperforms SFT in Instruction Following

Zhichao Wang,Andy Wong,Ruslan Belkin

Main category: cs.CL

TL;DR: 提出RLSR方法，利用强化学习框架替代或结合SFT，通过语义空间中的余弦相似度作为奖励信号，提升大模型的指令遵循能力，在AlpacaEval上显著优于传统SFT。

Details

Motivation: 希望充分利用SFT数据集并在强化学习框架下提升基础模型的指令遵循能力，同时克服SFT和RFT各自的局限性。 Method: 提出RLSR方法，使用强化学习框架，以生成响应与人工标注响应在语义嵌入空间中的余弦相似度作为奖励信号，可替代或与SFT结合使用。 Result: 在Qwen-7B（INFINITY）上，RLSR（SB）的AlpacaEval胜率达26.34%，超过SFT的21.01%；SFT+RLSR组合达到30.73%，显著提升下游任务性能。 Conclusion: RLSR能更有效地利用SFT数据集，通过强化学习提升指令遵循能力，且与SFT结合效果更优，为模型微调提供了新方向。 Abstract: After the pretraining stage of LLMs, techniques such as SFT, RLHF, RLVR, and RFT are applied to enhance instruction-following ability, mitigate undesired responses, improve reasoning capability and enable efficient domain adaptation with minimal data. SFT relies on the next-token prediction objective to strengthen instruction following in a base model using a large corpus of human-labeled responses. In contrast, RFT employs a RL-based approach to adapt fine-tuned reasoning models to specific domains with limited supervision. Inspired by RFT, we propose replacing SFT with RLSR to leverage the extensive SFT dataset in an RL framework, thereby improving the base model's instruction-following ability. In RLSR, the base model generates multiple responses for each prompt, and reward scores are computed as the cosine similarity in the semantic embedding space between the generated and human-labeled responses. RLSR can be utilized in multiple ways. It can directly replace SFT, achieving superior performance on instruction-following benchmarks-for example, RLSR (SB) on Qwen-7B (INFINITY) achieved an AlpacaEval win rate of 26.34%, surpassing SFT's 21.01%. Furthermore, combining SFT and RLSR further enhances downstream task performance; Qwen-7B (INFINITY) achieved a win rate of 30.73% when trained with SFT + RLSR.

Bingsheng Yao,Bo Sun,Yuanzhe Dong,Yuxuan Lu,Dakuo Wang

Main category: cs.CL

TL;DR: 本文提出了动态人物精炼框架（DPRF），通过迭代识别和修正大语言模型角色扮演代理生成行为与真实人类行为之间的认知差异，提升行为对齐度，并在多种场景下验证了其有效性。

Details

Motivation: 现有大语言模型角色扮演代理因使用人工构建的人物档案而导致人物保真度不足，缺乏与目标个体行为的一致性验证。 Method: 提出动态人物精炼框架（DPRF），通过自由形式或理论驱动的结构化分析，迭代识别生成行为与人类真实行为之间的认知差异，并优化人物档案以减少这些差异。 Result: 在五个大语言模型和四种不同行为预测场景（正式辩论、涉及心理健康问题的社交媒体帖子、公开访谈和电影评论）中，DPRF显著且一致地提升了行为对齐度，并展现出跨模型和场景的泛化能力。 Conclusion: DPRF为构建高保真人物档案提供了可靠方法，增强了用户模拟、社会研究和个性化AI等下游应用的有效性。 Abstract: The emerging large language model role-playing agents (LLM RPAs) aim to simulate individual human behaviors, but the persona fidelity is often undermined by manually-created profiles (e.g., cherry-picked information and personality characteristics) without validating the alignment with the target individuals. To address this limitation, our work introduces the Dynamic Persona Refinement Framework (DPRF).DPRF aims to optimize the alignment of LLM RPAs' behaviors with those of target individuals by iteratively identifying the cognitive divergence, either through free-form or theory-grounded, structured analysis, between generated behaviors and human ground truth, and refining the persona profile to mitigate these divergences.We evaluate DPRF with five LLMs on four diverse behavior-prediction scenarios: formal debates, social media posts with mental health issues, public interviews, and movie reviews.DPRF can consistently improve behavioral alignment considerably over baseline personas and generalizes across models and scenarios.Our work provides a robust methodology for creating high-fidelity persona profiles and enhancing the validity of downstream applications, such as user simulation, social studies, and personalized AI.

[72] LiteStage: Latency-aware Layer Skipping for Multi-stage Reasoning

Beomseok Kang,Jiwon Song,Jae-Joon Kim

Main category: cs.CL

TL;DR: 提出LiteStage框架，通过阶段感知的层跳过和在线置信度提前退出机制，在多阶段推理中实现高效加速，相比现有方法在较小精度损失下显著提升速度。

Details

Motivation: 现有的自适应加速技术在多阶段推理中难以平衡效率与准确性，存在阶段间跳过敏感性差异和冗余输出生成问题。 Method: 结合阶段感知的离线层预算搜索与在线基于置信度的生成早停机制，动态优化层跳过策略。 Result: 在OBQA、CSQA和StrategyQA三个基准上实验显示，LiteStage最高可达1.70倍加速，精度损失小于4.0%。 Conclusion: LiteStage有效提升了小语言模型在多阶段推理中的效率，在保持较高准确率的同时显著降低延迟。 Abstract: Multi-stage reasoning has emerged as an effective strategy for enhancing the reasoning capability of small language models by decomposing complex problems into sequential sub-stages. However, this comes at the cost of increased latency. We observe that existing adaptive acceleration techniques, such as layer skipping, struggle to balance efficiency and accuracy in this setting due to two key challenges: (1) stage-wise variation in skip sensitivity, and (2) the generation of redundant output tokens. To address these, we propose LiteStage, a latency-aware layer skipping framework for multi-stage reasoning. LiteStage combines a stage-wise offline search that allocates optimal layer budgets with an online confidence-based generation early exit to suppress unnecessary decoding. Experiments on three benchmarks, e.g., OBQA, CSQA, and StrategyQA, show that LiteStage achieves up to 1.70x speedup with less than 4.0% accuracy loss, outperforming prior training-free layer skipping methods.

[73] Flip-Flop Consistency: Unsupervised Training for Robustness to Prompt Perturbations in LLMs

Parsa Hejabi,Elnaz Rahmati,Alireza S. Ziabari,Morteza Dehghani

Main category: cs.CL

TL;DR: 本文提出了Flip-Flop Consistency（F^2C），一种无需监督的训练方法，通过共识交叉熵和表示对齐损失提升大语言模型在不同提示变体下的鲁棒性和一致性。

Details

Motivation: 大语言模型在面对同一问题的不同表述时常常产生不一致的回答，影响其可靠性，因此需要提高模型对提示扰动的鲁棒性。 Method: F^2C包含两个部分：一是使用多数投票生成硬伪标签的共识交叉熵（CCE），二是将低置信度和非主流预测结果向高置信度主流结果对齐的表示对齐损失。 Result: 在11个涵盖四种NLP任务的数据集上，F^2C平均提升一致性11.62%，F1分数提高8.94%，格式间性能方差降低3.29%；在跨领域和未见提示格式下也表现出良好的泛化能力。 Conclusion: F^2C是一种有效的无监督方法，显著增强了大语言模型在提示扰动下的输出一致性、性能和泛化能力。 Abstract: Large Language Models (LLMs) often produce inconsistent answers when faced with different phrasings of the same prompt. In this paper, we propose Flip-Flop Consistency ($F^2C$), an unsupervised training method that improves robustness to such perturbations. $F^2C$ is composed of two key components. The first, Consensus Cross-Entropy (CCE), uses a majority vote across prompt variations to create a hard pseudo-label. The second is a representation alignment loss that pulls lower-confidence and non-majority predictors toward the consensus established by high-confidence, majority-voting variations. We evaluate our method on 11 datasets spanning four NLP tasks, with 4-15 prompt variations per dataset. On average, $F^2C$ raises observed agreement by 11.62%, improves mean $F_1$ by 8.94%, and reduces performance variance across formats by 3.29%. In out-of-domain evaluations, $F^2C$ generalizes effectively, increasing $\overline{F_1}$ and agreement while decreasing variance across most source-target pairs. Finally, when trained on only a subset of prompt perturbations and evaluated on held-out formats, $F^2C$ consistently improves both performance and agreement while reducing variance. These findings highlight $F^2C$ as an effective unsupervised method for enhancing LLM consistency, performance, and generalization under prompt perturbations. Code is available at https://github.com/ParsaHejabi/Flip-Flop-Consistency-Unsupervised-Training-for-Robustness-to-Prompt-Perturbations-in-LLMs.

[74] MoM: Mixtures of Scenario-Aware Document Memories for Retrieval-Augmented Generation Systems

Jihao Zhao,Zhiyuan Ji,Simin Niu,Hanyu Wang,Feiyu Xiong,Zhiyu Li

Main category: cs.CL

TL;DR: 本文提出了一种名为MoM的新框架，旨在通过模拟人类阅读认知过程，将传统RAG的被动文本分块转变为主动文档记忆提取，提升大模型和小模型在多领域文档处理中的理解与推理能力。

Details

Motivation: 传统RAG方法仅依赖被动检索文本片段，限制了知识内化深度和推理能力，无法充分模拟人类阅读时的认知过程。 Method: 提出MoM框架：利用大语言模型模拟领域专家生成文档逻辑大纲，指导结构化分块与核心内容提取；采用多路径采样与多视角评估机制，设计衡量清晰度与完整性的指标以选择最优文档记忆；引入反向推理策略，从小模型训练中还原高质量专家思维路径；构建基于概率建模理论支持的三层文档记忆检索机制。 Result: 在三个不同领域上的实验表明，MoM不仅能有效解决现有RAG系统的文本分块问题，为大语言模型提供语义完整的文档记忆，还能显著提升小语言模型的主动阅读与理解能力。 Conclusion: MoM框架成功实现了从被动检索到主动理解的转变，推动了小模型具备类人阅读能力的发展，为构建更智能、以人为中心的文本处理系统提供了新路径。 Abstract: The traditional RAG paradigm, which typically engages in the comprehension of relevant text chunks in response to received queries, inherently restricts both the depth of knowledge internalization and reasoning capabilities. To address this limitation, our research transforms the text processing in RAG from passive chunking to proactive understanding, defining this process as document memory extraction with the objective of simulating human cognitive processes during reading. Building upon this, we propose the Mixtures of scenario-aware document Memories (MoM) framework, engineered to efficiently handle documents from multiple domains and train small language models (SLMs) to acquire the ability to proactively explore and construct document memories. The MoM initially instructs large language models (LLMs) to simulate domain experts in generating document logical outlines, thereby directing structured chunking and core content extraction. It employs a multi-path sampling and multi-perspective evaluation mechanism, specifically designing comprehensive metrics that represent chunk clarity and extraction completeness to select the optimal document memories. Additionally, to infuse deeper human-like reading abilities during the training of SLMs, we incorporate a reverse reasoning strategy, which deduces refined expert thinking paths from high-quality outcomes. Finally, leveraging diverse forms of content generated by MoM, we develop a three-layer document memory retrieval mechanism, which is grounded in our theoretical proof from the perspective of probabilistic modeling. Extensive experimental results across three distinct domains demonstrate that the MoM framework not only resolves text chunking challenges in existing RAG systems, providing LLMs with semantically complete document memories, but also paves the way for SLMs to achieve human-centric intelligent text processing.

[75] Rewriting History: A Recipe for Interventional Analyses to Study Data Effects on Model Behavior

Rahul Nadkarni,Yanai Elazar,Hila Gonen,Noah A. Smith

Main category: cs.CL

TL;DR: 提出了一种实验方法，通过干预训练数据并重新训练模型来研究数据与语言模型行为之间的关系。

Details

Motivation: 理解训练数据如何影响语言模型的行为，尤其是事实知识的获取。 Method: 设计了一个包含选择评估项、匹配相关文档、修改文档、重新训练和测量效果的实验流程。使用共现统计和信息检索方法识别可能贡献于知识学习的文档。 Result: 验证了共现统计与模型行为之间的关联，但发现现有方法无法完全解释模型回答知识问题的能力。 Conclusion: 提供了一个可复用的实验框架，帮助研究人员进一步探索训练数据对模型行为的影响。 Abstract: We present an experimental recipe for studying the relationship between training data and language model (LM) behavior. We outline steps for intervening on data batches -- i.e., ``rewriting history'' -- and then retraining model checkpoints over that data to test hypotheses relating data to behavior. Our recipe breaks down such an intervention into stages that include selecting evaluation items from a benchmark that measures model behavior, matching relevant documents to those items, and modifying those documents before retraining and measuring the effects. We demonstrate the utility of our recipe through case studies on factual knowledge acquisition in LMs, using both cooccurrence statistics and information retrieval methods to identify documents that might contribute to knowledge learning. Our results supplement past observational analyses that link cooccurrence to model behavior, while demonstrating that extant methods for identifying relevant training documents do not fully explain an LM's ability to correctly answer knowledge questions. Overall, we outline a recipe that researchers can follow to test further hypotheses about how training data affects model behavior. Our code is made publicly available to promote future work.

[76] Less is More: Denoising Knowledge Graphs For Retrieval Augmented Generation

Yilun Zheng,Dan Yang,Jie Li,Lin Shang,Lihui Chen,Jiahao Xu,Sitao Luan

Main category: cs.CL

TL;DR: 本文提出了DEG-RAG框架，通过实体解析和三元组反思技术对大语言模型生成的知识图谱进行去噪，显著提升了图检索增强生成系统的性能。

Details

Motivation: 现有的基于图的RAG系统依赖大语言模型自动生成知识图谱，常导致噪声多、实体冗余和关系不可靠的问题，影响检索与生成效果，且缺乏有效的去噪方法。 Method: 提出DEG-RAG框架，包含两个核心步骤：实体解析（消除冗余实体）和三元组反思（去除错误关系），并通过系统性实验评估不同实体解析策略的效果。 Result: 该方法大幅减小了知识图谱规模，同时在多种主流图RAG变体上显著提升了问答性能。 Conclusion: DEG-RAG有效解决了LLM生成知识图谱中的噪声问题，是首个对LLM生成KG中实体解析进行全面研究的工作，为图RAG系统提供了高效且实用的去噪方案。 Abstract: Retrieval-Augmented Generation (RAG) systems enable large language models (LLMs) instant access to relevant information for the generative process, demonstrating their superior performance in addressing common LLM challenges such as hallucination, factual inaccuracy, and the knowledge cutoff. Graph-based RAG further extends this paradigm by incorporating knowledge graphs (KGs) to leverage rich, structured connections for more precise and inferential responses. A critical challenge, however, is that most Graph-based RAG systems rely on LLMs for automated KG construction, often yielding noisy KGs with redundant entities and unreliable relationships. This noise degrades retrieval and generation performance while also increasing computational cost. Crucially, current research does not comprehensively address the denoising problem for LLM-generated KGs. In this paper, we introduce DEnoised knowledge Graphs for Retrieval Augmented Generation (DEG-RAG), a framework that addresses these challenges through: (1) entity resolution, which eliminates redundant entities, and (2) triple reflection, which removes erroneous relations. Together, these techniques yield more compact, higher-quality KGs that significantly outperform their unprocessed counterparts. Beyond the methods, we conduct a systematic evaluation of entity resolution for LLM-generated KGs, examining different blocking strategies, embedding choices, similarity metrics, and entity merging techniques. To the best of our knowledge, this is the first comprehensive exploration of entity resolution in LLM-generated KGs. Our experiments demonstrate that this straightforward approach not only drastically reduces graph size but also consistently improves question answering performance across diverse popular Graph-based RAG variants.

[77] Retrofitting Small Multilingual Models for Retrieval: Matching 7B Performance with 300M Parameters

Lifu Tu,Yingbo Zhou,Semih Yavuz

Main category: cs.CL

TL;DR: 提出了一种紧凑的多语言嵌入模型，通过优化训练数据规模、负采样策略和任务多样性，在检索任务上实现了与大型模型相当甚至更优的性能。

Details

Motivation: 小型多语言模型在大多数多语言任务中表现良好，但在检索任务上通常落后于更大的模型。研究旨在探索如何通过改进训练策略提升小模型在检索任务上的性能。 Method: 研究了训练数据规模、负采样策略和数据多样性对多语言嵌入效果的影响，重点引入难负样本并强调任务多样性的重要性。 Result: 发现增加训练数据规模的收益会迅速饱和，而引入难负样本和提升任务多样性显著提升了检索性能；最终开发出约3亿参数的紧凑模型。 Conclusion: 该300M级别的多语言模型在检索任务上的表现可媲美甚至超过当前强大的7B模型，证明了针对特定任务优化小模型的有效性。 Abstract: Training effective multilingual embedding models presents unique challenges due to the diversity of languages and task objectives. Although small multilingual models (<1 B parameters) perform well on multilingual tasks generally, they consistently lag behind larger models (>1 B) in the most prevalent use case: retrieval. This raises a critical question: Can smaller models be retrofitted specifically for retrieval tasks to enhance their performance? In this work, we investigate key factors that influence the effectiveness of multilingual embeddings, focusing on training data scale, negative sampling strategies, and data diversity. We find that while increasing the scale of training data yields initial performance gains, these improvements quickly plateau - indicating diminishing returns. Incorporating hard negatives proves essential for consistently improving retrieval accuracy. Furthermore, our analysis reveals that task diversity in the training data contributes more significantly to performance than language diversity alone. As a result, we develop a compact (approximately 300M) multilingual model that achieves retrieval performance comparable to or even surpassing current strong 7B models.

[78] Qwen3Guard Technical Report

Haiquan Zhao,Chenhan Yuan,Fei Huang,Xiaomeng Hu,Yichang Zhang,An Yang,Bowen Yu,Dayiheng Liu,Jingren Zhou,Junyang Lin,Baosong Yang,Chen Cheng,Jialong Tang,Jiandong Jiang,Jianwei Zhang,Jijie Xu,Ming Yan,Minmin Sun,Pei Zhang,Pengjun Xie,Qiaoyu Tang,Qin Zhu,Rong Zhang,Shibin Wu,Shuo Zhang,Tao He,Tianyi Tang,Tingyu Xia,Wei Liao,Weizhou Shen,Wenbiao Yin,Wenmeng Zhou,Wenyuan Yu,Xiaobin Wang,Xiaodong Deng,Xiaodong Xu,Xinyu Zhang,Yang Liu,Yeqiu Li,Yi Zhang,Yong Jiang,Yu Wan,Yuxin Zhou

Main category: cs.CL

TL;DR: 本文提出了Qwen3Guard，一种多语言安全防护模型，包含生成式和流式两种变体，支持细粒度三分类和实时安全监控，适用于全球大模型部署的安全需求。

Details

Motivation: 现有安全防护模型在实际应用中存在输出仅为二分类标签和需等待完整输出才能检测的问题，难以适应不同领域的安全策略并实现实时干预。 Method: 提出两种专用变体：生成式Qwen3Guard将安全分类视为指令跟随任务，实现细粒度三类判断（安全、争议、不安全）；流式Qwen3Guard引入token级分类头，支持增量文本生成过程中的实时安全监测。模型提供0.6B、4B和8B三种规模，支持多达119种语言。 Result: 在英文、中文及多语言基准测试中，Qwen3Guard在提示和响应安全分类任务上均达到最先进性能，具备低延迟和可扩展性。 Conclusion: Qwen3Guard有效解决了传统安全模型在灵活性和实时性方面的局限，为大规模语言模型提供了全面、高效且可定制的安全解决方案，所有模型以Apache 2.0许可开源发布。 Abstract: As large language models (LLMs) become more capable and widely used, ensuring the safety of their outputs is increasingly critical. Existing guardrail models, though useful in static evaluation settings, face two major limitations in real-world applications: (1) they typically output only binary "safe/unsafe" labels, which can be interpreted inconsistently across diverse safety policies, rendering them incapable of accommodating varying safety tolerances across domains; and (2) they require complete model outputs before performing safety checks, making them fundamentally incompatible with streaming LLM inference, thereby preventing timely intervention during generation and increasing exposure to harmful partial outputs. To address these challenges, we present Qwen3Guard, a series of multilingual safety guardrail models with two specialized variants: Generative Qwen3Guard, which casts safety classification as an instruction-following task to enable fine-grained tri-class judgments (safe, controversial, unsafe); and Stream Qwen3Guard, which introduces a token-level classification head for real-time safety monitoring during incremental text generation. Both variants are available in three sizes (0.6B, 4B, and 8B parameters) and support up to 119 languages and dialects, providing comprehensive, scalable, and low-latency safety moderation for global LLM deployments. Evaluated across English, Chinese, and multilingual benchmarks, Qwen3Guard achieves state-of-the-art performance in both prompt and response safety classification. All models are released under the Apache 2.0 license for public use.

[79] PRISM: Agentic Retrieval with LLMs for Multi-Hop Question Answering

Md Mahadi Hasan Nahid,Davood Rafiei

Main category: cs.CL

TL;DR: 提出了一种基于大语言模型的代理检索系统，通过三个专门的代理（问题分析、选择和添加）在多跳问答中实现高精度和召回率的证据检索。

Details

Motivation: 在多跳问答中，检索相关证据至关重要，但现有方法难以同时保证精度和召回率。 Method: 设计了一个包含问题分析器、选择器和添加器的三阶段代理系统，在迭代循环中分别优化精度与召回，提升证据检索效果。 Result: 在HotpotQA、2WikiMultiHopQA、MuSiQue和MultiHopRAG四个基准上均优于强基线方法，实现了更高的检索准确率并减少了干扰信息。 Conclusion: 该代理检索系统能有效平衡精度与召回，提升多跳问答的性能，同时降低对无关信息的依赖。 Abstract: Retrieval plays a central role in multi-hop question answering (QA), where answering complex questions requires gathering multiple pieces of evidence. We introduce an Agentic Retrieval System that leverages large language models (LLMs) in a structured loop to retrieve relevant evidence with high precision and recall. Our framework consists of three specialized agents: a Question Analyzer that decomposes a multi-hop question into sub-questions, a Selector that identifies the most relevant context for each sub-question (focusing on precision), and an Adder that brings in any missing evidence (focusing on recall). The iterative interaction between Selector and Adder yields a compact yet comprehensive set of supporting passages. In particular, it achieves higher retrieval accuracy while filtering out distracting content, enabling downstream QA models to surpass full-context answer accuracy while relying on significantly less irrelevant information. Experiments on four multi-hop QA benchmarks -- HotpotQA, 2WikiMultiHopQA, MuSiQue, and MultiHopRAG -- demonstrates that our approach consistently outperforms strong baselines.

[80] Rethinking Schema Linking: A Context-Aware Bidirectional Retrieval Approach for Text-to-SQL

Md Mahadi Hasan Nahid,Davood Rafiei,Weiwei Zhang,Yong Zhang

Main category: cs.CL

TL;DR: 本文提出了一种上下文感知的双向模式检索框架，将模式链接作为独立任务处理，通过两种互补策略和多种增强技术显著提升了Text-to-SQL系统中的模式召回率并减少了误报。

Details

Motivation: 模式链接在Text-to-SQL系统中至关重要，但现有方法常忽视相关模式元素的检索，导致生成SQL时出现幻觉和执行失败。 Method: 提出一种双向模式检索框架，结合表优先和列优先的检索策略，并引入问题分解、关键词提取和关键短语提取等技术。 Result: 在BIRD和Spider等基准上验证了方法的有效性，显著提高了模式召回率，降低了误报率，且无需查询优化即可接近oracle性能，缩小了全模式与完美模式之间50%的性能差距。 Conclusion: 模式链接是提升Text-to-SQL准确性和效率的关键环节，该方法为后续研究提供了有效范式。 Abstract: Schema linking -- the process of aligning natural language questions with database schema elements -- is a critical yet underexplored component of Text-to-SQL systems. While recent methods have focused primarily on improving SQL generation, they often neglect the retrieval of relevant schema elements, which can lead to hallucinations and execution failures. In this work, we propose a context-aware bidirectional schema retrieval framework that treats schema linking as a standalone problem. Our approach combines two complementary strategies: table-first retrieval followed by column selection, and column-first retrieval followed by table selection. It is further augmented with techniques such as question decomposition, keyword extraction, and keyphrase extraction. Through comprehensive evaluations on challenging benchmarks such as BIRD and Spider, we demonstrate that our method significantly improves schema recall while reducing false positives. Moreover, SQL generation using our retrieved schema consistently outperforms full-schema baselines and closely approaches oracle performance, all without requiring query refinement. Notably, our method narrows the performance gap between full and perfect schema settings by 50\%. Our findings highlight schema linking as a powerful lever for enhancing Text-to-SQL accuracy and efficiency.

[81] Constraint-Driven Small Language Models Based on Agent and OpenAlex Knowledge Graph: Mining Conceptual Pathways and Discovering Innovation Points in Academic Papers

Ziye Xia,Sergei S. Ospichev

Main category: cs.CL

TL;DR: 本文提出了一种基于提示工程的关键概念路径分析方法，利用小语言模型和知识图谱约束机制，从学术论文中精确提取关键概念并识别创新点。

Details

Motivation: 现有论文数据库在关键概念间关系网络的深度挖掘方面存在不足，难以有效支持科研人员及时追踪最新研究进展。 Method: 基于OpenAlex开源知识图谱，分析近8000篇诺夫西比尔斯克国立大学的开源论文数据，采用提示工程方法结合微调后的Qwen和DeepSeek小语言模型，构建受知识图谱约束的智能体进行关键概念路径分析。 Result: 实现了高精度的关键概念提取与创新点识别，发现了论文关键概念路径分布模式与创新点及稀有路径之间的强相关性，模型已在Hugging Face平台公开。 Conclusion: 所提方法能有效挖掘学术文献中潜在的概念关联，提升学术分析的深度与效率，为科研趋势发现和创新识别提供了新途径。 Abstract: In recent years, the rapid increase in academic publications across various fields has posed severe challenges for academic paper analysis: scientists struggle to timely and comprehensively track the latest research findings and methodologies. Key concept extraction has proven to be an effective analytical paradigm, and its automation has been achieved with the widespread application of language models in industrial and scientific domains. However, existing paper databases are mostly limited to similarity matching and basic classification of key concepts, failing to deeply explore the relational networks between concepts. This paper is based on the OpenAlex opensource knowledge graph. By analyzing nearly 8,000 open-source paper data from Novosibirsk State University, we discovered a strong correlation between the distribution patterns of paper key concept paths and both innovation points and rare paths. We propose a prompt engineering-based key concept path analysis method. This method leverages small language models to achieve precise key concept extraction and innovation point identification, and constructs an agent based on a knowledge graph constraint mechanism to enhance analysis accuracy. Through fine-tuning of the Qwen and DeepSeek models, we achieved significant improvements in accuracy, with the models publicly available on the Hugging Face platform.

[82] MathMist: A Parallel Multilingual Benchmark Dataset for Mathematical Problem Solving and Reasoning

Mahbub E Sobhani,Md. Faiyaz Abdullah Sayeedi,Tasnim Mohiuddin,Md Mofijul Islam,Swakkhar Shatabda

Main category: cs.CL

TL;DR: 本文提出了MathMist，一个包含七种语言、超过2.1万个对齐问答对的平行多语言数学推理基准，用于评估大语言模型在多语言环境下的数学问题解决能力，揭示了现有模型在低资源语言中表现显著下降的问题。

Details

Motivation: 现有的数学推理基准主要集中于英语或少数高资源语言，缺乏对多语言和跨语言数学推理能力的全面评估，因此需要构建一个覆盖更广泛语言的基准来填补这一空白。 Method: 构建了一个名为MathMist的平行多语言数学推理数据集，涵盖七种语言（高、中、低资源语言均衡分布），包含2.1万多个对齐的问答对，并在零样本、思维链（CoT）和代码切换三种推理范式下系统评估多种大语言模型的表现。 Result: 实验结果表明，当前大语言模型在多语言数学推理上存在一致性与可解释性不足的问题，尤其在低资源语言环境中性能显著下降，即使使用思维链或代码切换策略也未能有效缓解这一问题。 Conclusion: MathMist为评估多语言数学推理提供了新的基准，揭示了现有模型在跨语言数学理解上的局限性，强调未来需加强对低资源语言的支持与多语言推理机制的研究。 Abstract: Mathematical reasoning remains one of the most challenging domains for large language models (LLMs), requiring not only linguistic understanding but also structured logical deduction and numerical precision. While recent LLMs demonstrate strong general-purpose reasoning abilities, their mathematical competence across diverse languages remains underexplored. Existing benchmarks primarily focus on English or a narrow subset of high-resource languages, leaving significant gaps in assessing multilingual and cross-lingual mathematical reasoning. To address this, we introduce MathMist, a parallel multilingual benchmark for mathematical problem solving and reasoning. MathMist encompasses over 21K aligned question-answer pairs across seven languages, representing a balanced coverage of high-, medium-, and low-resource linguistic settings. The dataset captures linguistic variety, multiple types of problem settings, and solution synthesizing capabilities. We systematically evaluate a diverse suite of models, including open-source small and medium LLMs, proprietary systems, and multilingual-reasoning-focused models, under zero-shot, chain-of-thought (CoT), and code-switched reasoning paradigms. Our results reveal persistent deficiencies in LLMs' ability to perform consistent and interpretable mathematical reasoning across languages, with pronounced degradation in low-resource settings. All the codes and data are available at GitHub: https://github.com/mahbubhimel/MathMist

[83] MERLIN: A Testbed for Multilingual Multimodal Entity Recognition and Linking

Sathyanarayanan Ramamoorthy,Vishwa Shah,Simran Khanuja,Zaid Sheikh,Shan Jie,Ann Chia,Shearman Chua,Graham Neubig

Main category: cs.CL

TL;DR: 本文介绍了MERLIN，一个用于多语言多模态实体链接的新测试系统，包含五种语言的BBC新闻标题与图片配对数据集，并展示了结合视觉信息可提升实体链接准确率，特别是对文本上下文模糊或不足的情况。

Details

Motivation: 为了提升多语言环境下实体链接的准确性，尤其是在文本信息不足或模糊时，探索多模态（文本+图像）方法的潜力。 Method: 构建了一个包含五种语言BBC新闻标题与对应图片的数据集，涵盖7000多个命名实体提及，链接到2500个唯一Wikidata实体；采用LLaMa-2和Aya-23等多语言模型进行多模态实体链接实验。 Result: 实验表明，引入视觉数据能显著提高实体链接的准确性，尤其对多语言能力较弱的模型更为明显。 Conclusion: 多模态信息融合有助于改善多语言实体链接效果，特别是在低资源语言和歧义场景下，MERLIN为未来研究提供了有效基准。 Abstract: This paper introduces MERLIN, a novel testbed system for the task of Multilingual Multimodal Entity Linking. The created dataset includes BBC news article titles, paired with corresponding images, in five languages: Hindi, Japanese, Indonesian, Vietnamese, and Tamil, featuring over 7,000 named entity mentions linked to 2,500 unique Wikidata entities. We also include several benchmarks using multilingual and multimodal entity linking methods exploring different language models like LLaMa-2 and Aya-23. Our findings indicate that incorporating visual data improves the accuracy of entity linking, especially for entities where the textual context is ambiguous or insufficient, and particularly for models that do not have strong multilingual abilities. For the work, the dataset, methods are available here at https://github.com/rsathya4802/merlin

[84] Evaluating & Reducing Deceptive Dialogue From Language Models with Multi-turn RL

Marwa Abdulhai,Ryan Cheng,Aryansh Shrivastava,Natasha Jaques,Yarin Gal,Sergey Levine

Main category: cs.CL

TL;DR: 本文研究了大语言模型（LLM）在对话中产生欺骗性输出的问题，提出了一种新的“信念错位”度量来量化欺骗行为，并发现现有模型即使在良性提示下也会自然表现出约26%的欺骗率，而经过RLHF训练的模型仍存在43%的欺骗率。作者还提出了一种多轮强化学习微调方法，可将欺骗行为减少77.6%。

Details

Motivation: 由于大语言模型在现实应用中可能无意或有意地生成误导、虚假或操纵性内容，带来安全风险，因此需要有效衡量和缓解其在对话中的欺骗行为。 Method: 提出了‘信念错位’这一新指标来量化LLM在对话中的欺骗性，并在四种不同对话场景中使用五个已有指标和新指标进行评估；同时构建了一个多轮强化学习框架用于微调模型以减少欺骗行为。 Result: 新提出的信念错位指标与人类判断的相关性高于所有测试的现有指标；八种主流LLM在无恶意提示下平均有约26%的对话轮次表现出欺骗性，而在被引导欺骗时可提升至31%；经RLHF训练的模型仍有43%的欺骗率；所提强化学习方法相较其他指令微调模型可减少77.6%的欺骗行为。 Conclusion: LLM的欺骗行为是普遍且严重的现实问题，现有的安全训练方法（如RLHF）不足以完全抑制该行为；应采用考虑对话历史的多轮评估与训练机制，新提出的指标和微调方法能更有效地识别并减少欺骗性输出。 Abstract: Large Language Models (LLMs) interact with millions of people worldwide in applications such as customer support, education and healthcare. However, their ability to produce deceptive outputs, whether intentionally or inadvertently, poses significant safety concerns. The unpredictable nature of LLM behavior, combined with insufficient safeguards against hallucination, misinformation, and user manipulation, makes their misuse a serious, real-world risk. In this paper, we investigate the extent to which LLMs engage in deception within dialogue, and propose the belief misalignment metric to quantify deception. We evaluate deception across four distinct dialogue scenarios, using five established deception detection metrics and our proposed metric. Our findings reveal this novel deception measure correlates more closely with human judgments than any existing metrics we test. Additionally, our benchmarking of eight state-of-the-art models indicates that LLMs naturally exhibit deceptive behavior in approximately 26% of dialogue turns, even when prompted with seemingly benign objectives. When prompted to deceive, LLMs are capable of increasing deceptiveness by as much as 31% relative to baselines. Unexpectedly, models trained with RLHF, the predominant approach for ensuring the safety of widely-deployed LLMs, still exhibit deception at a rate of 43% on average. Given that deception in dialogue is a behavior that develops over an interaction history, its effective evaluation and mitigation necessitates moving beyond single-utterance analyses. We introduce a multi-turn reinforcement learning methodology to fine-tune LLMs to reduce deceptive behaviors, leading to a 77.6% reduction compared to other instruction-tuned models.

[85] A Robust Classification Method using Hybrid Word Embedding for Early Diagnosis of Alzheimer's Disease

Yangyang Li

Main category: cs.CL

TL;DR: 提出一种基于混合词嵌入和超参数优化的阿尔茨海默病早期检测方法，准确率达91%，AUC为97%，性能优于现有模型。

Details

Motivation: 阿尔茨海默病的早期检测有助于患者及时治疗并减轻医疗负担，语言能力变化是早期诊断的重要指标。 Method: 结合Doc2Vec和ELMo生成混合词嵌入，计算句子困惑度以捕捉语义和流畅性，并加入语言学特征丰富表示；使用逻辑回归并全程优化超参数（如正则化参数、学习率、向量大小等）。 Result: 在区分早期AD与健康受试者任务中达到91%准确率和97% AUC，标准差小，模型稳定，性能优于现有最佳NLP模型（88%准确率）。 Conclusion: 该方法准确且稳定，可用于大规模AD筛查及医生辅助诊断工具。 Abstract: Early detection of Alzheimer's Disease (AD) is greatly beneficial to AD patients, leading to early treatments that lessen symptoms and alleviating financial burden of health care. As one of the leading signs of AD, language capability changes can be used for early diagnosis of AD. In this paper, I develop a robust classification method using hybrid word embedding and fine-tuned hyperparameters to achieve state-of-the-art accuracy in the early detection of AD. Specifically, we create a hybrid word embedding based on word vectors from Doc2Vec and ELMo to obtain perplexity scores of the sentences. The scores identify whether a sentence is fluent or not and capture semantic context of the sentences. I enrich the word embedding by adding linguistic features to analyze syntax and semantics. Further, we input an embedded feature vector into logistic regression and fine tune hyperparameters throughout the pipeline. By tuning hyperparameters of the machine learning pipeline (e.g., model regularization parameter, learning rate and vector size of Doc2Vec, and vector size of ELMo), I achieve 91% classification accuracy and an Area Under the Curve (AUC) of 97% in distinguishing early AD from healthy subjects. Based on my knowledge, my model with 91% accuracy and 97% AUC outperforms the best existing NLP model for AD diagnosis with an accuracy of 88% [32]. I study the model stability through repeated experiments and find that the model is stable even though the training data is split randomly (standard deviation of accuracy = 0.0403; standard deviation of AUC = 0.0174). This affirms our proposed method is accurate and stable. This model can be used as a large-scale screening method for AD, as well as a complementary examination for doctors to detect AD.

[86] Beyond One World: Benchmarking Super Heros in Role-Playing Across Multiversal Contexts

Perapard Ngokpol,Kun Kerdthaisong,Pasin Buakhaw,Pitikorn Khlaisamniang,Supasate Vorathammathorn,Piyalitt Ittichaiwong,Nutchanon Yongsatianchot

Main category: cs.CL

TL;DR: 本文提出了一个名为Beyond One World的基准，用于评估大语言模型在扮演不同版本角色（如漫威和DC中的超级英雄）时的一致性和准确性。该基准包括两个任务：Canon Events（测试关键生平事件的记忆）和Moral Dilemmas（面对道德困境的反应），并引入“思考-行动匹配”指标来衡量模型推理与行为的一致性。实验发现当前模型在跨版本泛化、事实准确性和思维-行为一致性方面仍存在显著缺陷。

Details

Motivation: 研究大语言模型在扮演具有多个官方设定版本的角色时的表现，探索其在保持角色特有历史、价值观和道德准则方面的能力不足问题。 Method: 构建包含30个标志性英雄及其90个特定版本的Beyond One World基准，设计Canon Events和Moral Dilemmas两项任务，并采用‘思考’与‘行动’分离的评分框架，提出Think-Act Matching指标来量化模型内部推理与外在决策之间的一致性。 Result: 实验结果显示：(1) 思维链提示可提升较弱模型的叙事连贯性，但可能降低较强模型的事实准确性；(2) 模型在同角色不同版本间的迁移能力仍然很差；(3) 模型通常只在‘思考’或‘行动’中表现良好，难以兼顾二者。 Conclusion: Beyond One World揭示了当前角色扮演型大语言模型在多宇宙一致性与推理-行为对齐方面的关键缺陷，为未来研究提供了挑战性评估标准。 Abstract: Large language models (LLMs) are increasingly used as role-playing agents, yet their capacity to faithfully and consistently portray version-specific characters -- for example, superheroes across comic and cinematic universes -- remains underexplored. Superhero canons such as Marvel and DC provide a rich testbed: decades of storytelling yield multiple incarnations of the same character with distinct histories, values, and moral codes. To study this problem, we introduce Beyond One World, a benchmark for character-grounded roleplay spanning 30 iconic heroes and 90 canon-specific versions. The benchmark comprises two tasks: (i) Canon Events, which probes factual recall of pivotal life stages, and (ii) Moral Dilemmas, which confronts models with ethically charged scenarios. We score responses for canonical accuracy and reasoning fidelity under a framework that separates internal deliberation ("thinking") from outward decisions ("acting"). We further propose Think-Act Matching, a metric that quantifies alignment between reasons and actions and serves as a proxy for model trustworthiness. Experiments across reasoning- and non-reasoning-oriented models yield three findings: (1) chain-of-thought prompting improves narrative coherence in weaker models but can reduce canonical accuracy in stronger ones; (2) cross-version generalization within a character remains a major obstacle; and (3) models often excel at either thinking or acting, but rarely both. Beyond One World exposes critical gaps in multiversal consistency and reasoning alignment, offering a challenging evaluation for role-playing LLMs.

[87] CURE: Confidence-driven Unified Reasoning Ensemble Framework for Medical Question Answering

Ziad Elshaer,Essam A. Rashed

Main category: cs.CL

TL;DR: 提出一种基于置信度驱动的多模型框架，通过自适应路由和协作推理提升医疗问答性能，无需微调，具有高效性和可及性。

Details

Motivation: 高性能医学大模型通常需要大量计算资源进行微调，限制了资源受限机构的使用，因此需要一种无需微调且高效的替代方案。 Method: 采用两阶段架构：首先由置信度检测模块评估主模型的预测置信度，然后通过自适应路由机制将低置信度问题分发给具有互补知识的辅助模型进行协同推理。 Result: 在MedQA、MedMCQA和PubMedQA三个医学基准上验证了方法的有效性，其中PubMedQA达到95.0%，MedMCQA达到78.0%，显著优于单模型和统一推理策略。 Conclusion: 基于置信度感知的多模型协作是一种实用且计算高效的途径，有助于推动先进医学AI在资源有限环境中的普及。 Abstract: High-performing medical Large Language Models (LLMs) typically require extensive fine-tuning with substantial computational resources, limiting accessibility for resource-constrained healthcare institutions. This study introduces a confidence-driven multi-model framework that leverages model diversity to enhance medical question answering without fine-tuning. Our framework employs a two-stage architecture: a confidence detection module assesses the primary model's certainty, and an adaptive routing mechanism directs low-confidence queries to Helper models with complementary knowledge for collaborative reasoning. We evaluate our approach using Qwen3-30B-A3B-Instruct, Phi-4 14B, and Gemma 2 12B across three medical benchmarks; MedQA, MedMCQA, and PubMedQA. Result demonstrate that our framework achieves competitive performance, with particularly strong results in PubMedQA (95.0\%) and MedMCQA (78.0\%). Ablation studies confirm that confidence-aware routing combined with multi-model collaboration substantially outperforms single-model approaches and uniform reasoning strategies. This work establishes that strategic model collaboration offers a practical, computationally efficient pathway to improve medical AI systems, with significant implications for democratizing access to advanced medical AI in resource-limited settings.

[88] On the Ability of LLMs to Handle Character-Level Perturbations: How Well and How?

Anyun Zhuo,Xuefei Ning,Ningyuan Li,Yu Wang,Pinyan Lu

Main category: cs.CL

TL;DR: 本文研究了现代大语言模型（LLM）在频繁且结构化的字符级扰动下的鲁棒性，提出了一种插入不可见Unicode控制字符的方法（\nameshort{}）以防止LLM滥用（如在线考试场景）。尽管这种噪声严重干扰了分词并降低了信噪比，许多LLM仍表现出显著的性能。通过多维度实验，文章探讨了LLM对字符级噪声的鲁棒性机制，区分了显式与隐式去噪假设，揭示了LLM底层鲁棒性的风险与应用可靠性问题。

Details

Motivation: 防止大语言模型被滥用（如用于在线考试作弊），同时探究其在强字符级噪声下的鲁棒性机制。 Method: 提出一种名为\nameshort{}的方法，在每个输入字符后插入不可见的Unicode控制字符作为噪声，评估多种LLM在不同模型、任务和噪声配置下的表现，并分析其分词处理方式及显式与隐式去噪机制。 Result: 尽管噪声严重破坏了文本结构和信噪比，许多LLM仍保持较强性能；研究表明LLM具备一定的字符级鲁棒性，可能依赖于隐式的去噪机制而非显式恢复原始输入。 Conclusion: 现代LLM对字符级噪声具有意外的鲁棒性，这既带来了防止滥用的新手段（如\nameshort{}），也揭示了其在实际部署中可能存在的安全风险与可靠性挑战。 Abstract: This work investigates the resilience of contemporary LLMs against frequent and structured character-level perturbations, specifically through the insertion of noisy characters after each input character. We introduce \nameshort{}, a practical method that inserts invisible Unicode control characters into text to discourage LLM misuse in scenarios such as online exam systems. Surprisingly, despite strong obfuscation that fragments tokenization and reduces the signal-to-noise ratio significantly, many LLMs still maintain notable performance. Through comprehensive evaluation across model-, problem-, and noise-related configurations, we examine the extent and mechanisms of this robustness, exploring both the handling of character-level tokenization and \textit{implicit} versus \textit{explicit} denoising mechanism hypotheses of character-level noises. We hope our findings on the low-level robustness of LLMs will shed light on the risks of their misuse and on the reliability of deploying LLMs across diverse applications.

[89] From Binary to Bilingual: How the National Weather Service is Using Artificial Intelligence to Develop a Comprehensive Translation Program

Joseph E. Trujillo-Falcon,Monica L. Bozeman,Liam E. Llewellyn,Samuel T. Halvorson,Meryl Mizell,Stuti Deshpande,Bob Manning,Todd Fagin

Main category: cs.CL

TL;DR: 本文介绍了美国国家气象局（NWS）开发的一种基于人工智能的自动化翻译系统，旨在为非英语母语者提供准确、及时且文化相关的气象信息，初步支持西班牙语、中文、越南语等语言，并结合GIS映射和伦理AI实践，推动建立覆盖全体美国人的国家预警系统。

Details

Motivation: 为了更好地服务美国境内6880万不在家中使用英语的人群，NWS希望克服语言障碍，提升极端天气预警和气象信息的可及性，从而建设一个“全天候-ready”的国家。 Method: NWS与LILT公司合作，利用其专利训练方法，基于大语言模型（LLM）和神经机器翻译（NMT）技术，开发可扩展的自动化翻译工具；结合多语言风险传播最佳实践，并通过GIS地图分析各地区语言需求，优先分配资源。 Result: 目前已开发出支持多种语言的实验性多语言气象产品网站，包括警告信息、7天预报和教育宣传内容，显著减少了人工翻译时间，提升了运营效率。 Conclusion: 该系统通过AI驱动的自动化翻译、文化适配和伦理AI设计，使NWS更接近建立一个全民覆盖、公平可达的国家预警系统。 Abstract: To advance a Weather-Ready Nation, the National Weather Service (NWS) is developing a systematic translation program to better serve the 68.8 million people in the U.S. who do not speak English at home. This article outlines the foundation of an automated translation tool for NWS products, powered by artificial intelligence. The NWS has partnered with LILT, whose patented training process enables large language models (LLMs) to adapt neural machine translation (NMT) tools for weather terminology and messaging. Designed for scalability across Weather Forecast Offices (WFOs) and National Centers, the system is currently being developed in Spanish, Simplified Chinese, Vietnamese, and other widely spoken non-English languages. Rooted in best practices for multilingual risk communication, the system provides accurate, timely, and culturally relevant translations, significantly reducing manual translation time and easing operational workloads across the NWS. To guide the distribution of these products, GIS mapping was used to identify language needs across different NWS regions, helping prioritize resources for the communities that need them most. We also integrated ethical AI practices throughout the program's design, ensuring that transparency, fairness, and human oversight guide how automated translations are created, evaluated, and shared with the public. This work has culminated into a website featuring experimental multilingual NWS products, including translated warnings, 7-day forecasts, and educational campaigns, bringing the country one step closer to a national warning system that reaches all Americans.

[90] PluriHop: Exhaustive, Recall-Sensitive QA over Distractor-Rich Corpora

Mykolas Sveistrys,Richard Kunert

Main category: cs.CL

TL;DR: 本文提出了针对重复性报告数据中多文档聚合查询的“pluri-hop”问题，形式化了其三大特征，并构建了多语言诊断数据集PluriHopWIND。实验表明现有RAG方法表现不佳，为此提出PluriHopRAG架构，通过分解查询和早期过滤显著提升性能。

Details

Motivation: 现实中的许多问题（如医疗记录、合规文件）需要对全部文档进行聚合分析，且对遗漏敏感，而传统单跳或多跳问答无法有效处理这类需穷尽检索的pluri-hop问题。 Method: 提出PluriHopRAG架构：将查询分解为文档级子问题，并使用交叉编码器在LLM推理前过滤无关文档，实现‘逐一检查、低成本过滤’的策略。 Result: 在新构建的PluriHopWIND数据集上，传统及变体RAG方法F1均未超过40%；PluriHopRAG相比基线模型F1提升18-52%。 Conclusion: PluriHopRAG通过穷尽检索与早期过滤，显著优于传统top-k检索方法，验证了在高干扰、重复性强的文档集合中处理pluri-hop问题的有效路径。 Abstract: Recent advances in large language models (LLMs) and retrieval-augmented generation (RAG) have enabled progress on question answering (QA) when relevant evidence is in one (single-hop) or multiple (multi-hop) passages. Yet many realistic questions about recurring report data - medical records, compliance filings, maintenance logs - require aggregation across all documents, with no clear stopping point for retrieval and high sensitivity to even one missed passage. We term these pluri-hop questions and formalize them by three criteria: recall sensitivity, exhaustiveness, and exactness. To study this setting, we introduce PluriHopWIND, a diagnostic multilingual dataset of 48 pluri-hop questions built from 191 real-world wind industry reports in German and English. We show that PluriHopWIND is 8-40% more repetitive than other common datasets and thus has higher density of distractor documents, better reflecting practical challenges of recurring report corpora. We test a traditional RAG pipeline as well as graph-based and multimodal variants, and find that none of the tested approaches exceed 40% in statement-wise F1 score. Motivated by this, we propose PluriHopRAG, a RAG architecture that follows a "check all documents individually, filter cheaply" approach: it (i) decomposes queries into document-level subquestions and (ii) uses a cross-encoder filter to discard irrelevant documents before costly LLM reasoning. We find that PluriHopRAG achieves relative F1 score improvements of 18-52% depending on base LLM. Despite its modest size, PluriHopWIND exposes the limitations of current QA systems on repetitive, distractor-rich corpora. PluriHopRAG's performance highlights the value of exhaustive retrieval and early filtering as a powerful alternative to top-k methods.

[91] Suicidal Comment Tree Dataset: Enhancing Risk Assessment and Prediction Through Contextual Analysis

Jun Li,Qun Zhao

Main category: cs.CL

TL;DR: 本研究通过构建包含用户发帖历史和评论的高质量标注数据集，利用Reddit数据并基于C-SSRS量表的四类标注框架，探讨了评论树信息对用户自杀风险等级识别与预测的影响。实验结果表明，引入评论树数据显著提升了大语言模型在自杀风险检测中的性能，为早期干预提供了新思路。

Details

Motivation: 现有研究多关注单条社交媒体内容中的自杀倾向检测，缺乏对用户长期、序列化互动评论结构（如评论树）的分析，难以捕捉自杀风险的动态演变。为此，本文旨在探究评论树信息如何提升用户自杀风险的识别与预测能力。 Method: 基于Reddit平台构建了一个融合用户发帖历史与评论树结构的高质量标注数据集，并采用基于哥伦比亚自杀严重程度评定量表（C-SSRS）的四类标注框架；结合统计分析与大语言模型（LLMs）进行实验，评估评论树信息在自杀风险分类与预测中的作用。 Result: 统计分析和LLM实验结果表明，纳入评论树信息能显著提升对用户自杀风险等级的区分度与预测准确性。 Conclusion: 评论树所承载的交互式历史信息有助于更精准地识别用户的自杀风险演化，该方法为提升高危个体的检测精度提供了有效路径，对实现早期自杀干预具有重要价值。 Abstract: Suicide remains a critical global public health issue. While previous studies have provided valuable insights into detecting suicidal expressions in individual social media posts, limited attention has been paid to the analysis of longitudinal, sequential comment trees for predicting a user's evolving suicidal risk. Users, however, often reveal their intentions through historical posts and interactive comments over time. This study addresses this gap by investigating how the information in comment trees affects both the discrimination and prediction of users' suicidal risk levels. We constructed a high-quality annotated dataset, sourced from Reddit, which incorporates users' posting history and comments, using a refined four-label annotation framework based on the Columbia Suicide Severity Rating Scale (C-SSRS). Statistical analysis of the dataset, along with experimental results from Large Language Models (LLMs) experiments, demonstrates that incorporating comment trees data significantly enhances the discrimination and prediction of user suicidal risk levels. This research offers a novel insight to enhancing the detection accuracy of at-risk individuals, thereby providing a valuable foundation for early suicide intervention strategies.

[92] Your Next Token Prediction: A Multilingual Benchmark for Personalized Response Generation

Shiyao Ding,Takayuki Ito

Main category: cs.CL

TL;DR: 提出“你的下一个词预测”（YNTP）任务，通过人与基于MBTI的NPC对话构建多语言基准数据集，用于建模用户个性化语言风格。

Details

Motivation: 大模型在通用文本生成上表现良好，但在模仿个体真实交流风格（如邮件、社交消息）方面仍有不足，且真实用户数据因隐私难以获取。 Method: 设计YNTP任务，构建包含100个跨英日中三语对话会话的多语言基准，用户与基于MBTI人格维度的心理学驱动NPC进行为期五天的交互，捕捉日常沟通模式。 Result: 建立了首个YNTP基准，支持对用户语言风格的建模，并评估了基于提示和微调的个性化方法。 Conclusion: 该研究为用户对齐的语言建模提供了新任务、数据集和评估基础，推动个性化语言生成发展。 Abstract: Large language models (LLMs) excel at general next-token prediction but still struggle to generate responses that reflect how individuals truly communicate, such as replying to emails or social messages in their own style. However, real SNS or email histories are difficult to collect due to privacy concerns. To address this, we propose the task of "Your Next Token Prediction (YNTP)", which models a user's precise word choices through controlled human-agent conversations. We build a multilingual benchmark of 100 dialogue sessions across English, Japanese, and Chinese, where users interact for five days with psychologically grounded NPCs based on MBTI dimensions. This setup captures natural, daily-life communication patterns and enables analysis of users' internal models. We evaluate prompt-based and fine-tuning-based personalization methods, establishing the first benchmark for YNTP and a foundation for user-aligned language modeling. The dataset is available at: https://github.com/AnonymousHub4Submissions/your-next-token-prediction-dataset-100

[93] MedTrust-RAG: Evidence Verification and Trust Alignment for Biomedical Question Answering

Yingpeng Ning,Yuanyuan Sun,Ling Luo,Yanhua Wang,Yuchen Pan,Hongfei Lin

Main category: cs.CL

TL;DR: 提出MedTrust-Guided Iterative RAG框架，通过引用感知推理、迭代检索验证和MedTrust-Align模块减少生物医学问答中的幻觉，提升事实一致性。

Details

Motivation: 现有基于RAG的生物医学问答系统因检索后噪声和证据验证不足而产生幻觉，影响回答可靠性。 Method: 引入三个创新：1）引用感知推理，要求生成内容必须基于检索文档，并在证据不足时使用负知识断言；2）迭代检索-验证过程，通过医学差距分析优化查询；3）集成MedTrust-Align模块，结合正例与幻觉感知负样本，利用直接偏好优化强化基于引用的推理并抑制幻觉。 Result: 在MedMCQA、MedQA和MMLU-Med数据集上实验表明，该方法在多种模型架构下均优于基线模型，LLaMA3.1-8B-Instruct平均准确率提升2.7%，Qwen3-8B提升2.4%。 Conclusion: MedTrust-Guided Iterative RAG能有效提高生物医学问答系统的事实一致性和可靠性，显著减少幻觉问题。 Abstract: Biomedical question answering (QA) requires accurate interpretation of complex medical knowledge. Large language models (LLMs) have shown promising capabilities in this domain, with retrieval-augmented generation (RAG) systems enhancing performance by incorporating external medical literature. However, RAG-based approaches in biomedical QA suffer from hallucinations due to post-retrieval noise and insufficient verification of retrieved evidence, undermining response reliability. We propose MedTrust-Guided Iterative RAG, a framework designed to enhance factual consistency and mitigate hallucinations in medical QA. Our method introduces three key innovations. First, it enforces citation-aware reasoning by requiring all generated content to be explicitly grounded in retrieved medical documents, with structured Negative Knowledge Assertions used when evidence is insufficient. Second, it employs an iterative retrieval-verification process, where a verification agent assesses evidence adequacy and refines queries through Medical Gap Analysis until reliable information is obtained. Third, it integrates the MedTrust-Align Module (MTAM) that combines verified positive examples with hallucination-aware negative samples, leveraging Direct Preference Optimization to reinforce citation-grounded reasoning while penalizing hallucination-prone response patterns. Experiments on MedMCQA, MedQA, and MMLU-Med demonstrate that our approach consistently outperforms competitive baselines across multiple model architectures, achieving the best average accuracy with gains of 2.7% for LLaMA3.1-8B-Instruct and 2.4% for Qwen3-8B.

[94] Instructions are all you need: Self-supervised Reinforcement Learning for Instruction Following

Qingyu Ren,Qianyu He,Bowei Zhang,Jie Zeng,Jiaqing Liang,Yanghua Xiao,Weikang Zhou,Zeye Sun,Fei Yu

Main category: cs.CL

TL;DR: 提出了一种无需外部监督的自监督强化学习框架，通过从指令中直接生成奖励信号和伪标签，解决多约束指令跟随任务中的稀疏奖励问题，并在多个领域数据集上表现出色。

Details

Motivation: 语言模型在遵循多约束指令时表现不佳，而现有强化学习方法依赖外部监督和稀疏的奖励信号，限制了实际应用。 Method: 提出一种无标签的自监督强化学习框架，采用约束分解策略和高效的按约束二分类方法，从指令中提取奖励信号并生成伪标签用于奖励模型训练。 Result: 在3个领域内和5个领域外数据集上均取得显著提升，尤其在代理型和多轮指令跟随任务中表现突出。 Conclusion: 该方法有效缓解了多约束指令跟随中的奖励稀疏问题，摆脱了对外部监督的依赖，具有良好的泛化能力和应用潜力。 Abstract: Language models often struggle to follow multi-constraint instructions that are crucial for real-world applications. Existing reinforcement learning (RL) approaches suffer from dependency on external supervision and sparse reward signals from multi-constraint tasks. We propose a label-free self-supervised RL framework that eliminates dependency on external supervision by deriving reward signals directly from instructions and generating pseudo-labels for reward model training. Our approach introduces constraint decomposition strategies and efficient constraint-wise binary classification to address sparse reward challenges while maintaining computational efficiency. Experiments show that our approach generalizes well, achieving strong improvements across 3 in-domain and 5 out-of-domain datasets, including challenging agentic and multi-turn instruction following. The data and code are publicly available at https://github.com/Rainier-rq/verl-if

[95] Explore to Evolve: Scaling Evolved Aggregation Logic via Proactive Online Exploration for Deep Research Agents

Rui Wang,Ce Zhang,Jun-Yu Ma,Jianshu Zhang,Hongru Wang,Yi Chen,Boyang Xue,Tianqing Fang,Zhisong Zhang,Hongming Zhang,Haitao Mi,Dong Yu,Kam-Fai Wong

Main category: cs.CL

TL;DR: 提出Explore to Evolve范式，构建可验证的WebAggregatorQA数据集，提升网络代理的信息聚合能力。

Details

Motivation: 现有开源深度研究代理多关注信息检索，忽视信息聚合，限制了其支持深入研究的能力。 Method: 通过主动在线探索收集真实网页信息，自我演化出包含12种高层逻辑操作的聚合程序，生成可验证的问答对，构建WebAggregatorQA数据集，并基于SmolAgents框架进行监督微调。 Result: 构建了包含1万样本、覆盖5万个网站和11个领域的WebAggregatorQA数据集；WebAggregator-8B性能匹敌GPT-4.1，32B版本在GAIA-text上超过GPT-4.1超10%，接近Claude-3.7-sonnet；新基准测试显示主流模型表现差，凸显信息聚合短板。 Conclusion: 信息聚合是当前网络代理的关键瓶颈，WebAggregator系列模型显著提升了该能力，且新基准揭示了现有模型在此任务上的不足。 Abstract: Deep research web agents not only retrieve information from diverse sources such as web environments, files, and multimodal inputs, but more importantly, they need to rigorously analyze and aggregate knowledge for insightful research. However, existing open-source deep research agents predominantly focus on enhancing information-seeking capabilities of web agents to locate specific information, while overlooking the essential need for information aggregation, which would limit their ability to support in-depth research. We propose an Explore to Evolve paradigm to scalably construct verifiable training data for web agents. Begins with proactive online exploration, an agent sources grounded information by exploring the real web. Using the collected evidence, the agent then self-evolves an aggregation program by selecting, composing, and refining operations from 12 high-level logical types to synthesize a verifiable QA pair. This evolution from high-level guidance to concrete operations allowed us to scalably produce WebAggregatorQA, a dataset of 10K samples across 50K websites and 11 domains. Based on an open-source agent framework, SmolAgents, we collect supervised fine-tuning trajectories to develop a series of foundation models, WebAggregator. WebAggregator-8B matches the performance of GPT-4.1, while the 32B variant surpasses GPT-4.1 by more than 10% on GAIA-text and closely approaches Claude-3.7-sonnet. Moreover, given the limited availability of benchmarks that evaluate web agents' information aggregation abilities, we construct a human-annotated evaluation split of WebAggregatorQA as a challenging test set. On this benchmark, Claude-3.7-sonnet only achieves 28%, and GPT-4.1 scores 25.8%. Even when agents manage to retrieve all references, they still struggle on WebAggregatorQA, highlighting the need to strengthen the information aggregation capabilities of web agent foundations.

[96] Natural Language Tools: A Natural Language Approach to Tool Calling In Large Language Agents

Reid T. Johnson,Michelle D. Pain,Jordan D. West

Main category: cs.CL

TL;DR: 本文提出了自然语言工具（NLT）框架，用自然语言输出替代大模型中的程序化JSON工具调用，显著提升了工具调用准确性和稳定性。

Details

Motivation: 解决传统JSON工具调用中存在的任务干扰和格式约束问题，提升大语言模型在实际应用中的工具调用性能。 Method: 通过将工具选择与响应生成解耦，使用自然语言输出代替JSON格式的工具调用，从而避免格式错误和任务间的相互干扰。 Result: 在10个模型和6400次实验中，NLT使工具调用准确率平均提升18.4个百分点，输出方差降低70%，并在客户服务和心理健康领域表现出鲁棒性；开源模型表现尤为突出，甚至超过闭源旗舰模型。 Conclusion: NLT框架有效提升了大语言模型的工具调用能力，适用于缺乏原生支持的模型，并对强化学习和监督微调阶段的模型训练具有启示意义。 Abstract: We present Natural Language Tools (NLT), a framework that replaces programmatic JSON tool calling in large language models (LLMs) with natural language outputs. By decoupling tool selection from response generation, NLT eliminates task interference and format constraints that degrade tool call performance. When evaluated across 10 models and 6,400 trials spanning customer service and mental health domains, NLT improves tool calling accuracy by 18.4 percentage points while reducing output variance by 70%. Open-weight models see the largest gains, surpassing flagship closed-weight alternatives, with implications for model training in both reinforcement learning and supervised fine-tuning stages. These improvements persist under prompt perturbations and extend tool-calling capabilities to models lacking native support.

[97] LiRA: Linguistic Robust Anchoring for Cross-lingual Large Language Models

Haolin Li,Haipeng Zhang,Mang Li,Yaohua Wang,Lijie Wen,Yu Zhang,Biqing Huang

Main category: cs.CL

TL;DR: 本文提出了LiRA框架，通过Arca和LaSR两个模块提升大语言模型在低资源语言下的跨语言表示、检索与推理能力，并发布了一个涵盖东南亚和南亚语言的多语言产品检索数据集。

Details

Motivation: 由于训练数据有限、机器翻译噪声和跨语言对齐不稳定，当前大语言模型在低资源语言上的性能显著低于高资源语言，亟需提升其跨语言鲁棒性。 Method: 提出LiRA框架，包含Arca（基于锚点对齐和多智能体协同编码）和LaSR（语言感知的轻量推理头与一致性正则化），联合优化跨语言表示、检索与推理。 Result: 在低资源场景下的跨语言检索、语义相似度和推理任务中，LiRA在少样本和噪声增强设置下均表现出一致的性能提升和鲁棒性，消融实验验证了各模块贡献。 Conclusion: LiRA有效增强了大语言模型在低资源语言中的跨语言理解与多任务鲁棒性，为未来多语言AI系统提供了可行架构。 Abstract: As large language models (LLMs) rapidly advance, performance on high-resource languages (e.g., English, Chinese) is nearing saturation, yet remains substantially lower for low-resource languages (e.g., Urdu, Thai) due to limited training data, machine-translation noise, and unstable cross-lingual alignment. We introduce LiRA (Linguistic Robust Anchoring for Large Language Models), a training framework that robustly improves cross-lingual representations under low-resource conditions while jointly strengthening retrieval and reasoning. LiRA comprises two modules: (i) Arca (Anchored Representation Composition Architecture), which anchors low-resource languages to an English semantic space via anchor-based alignment and multi-agent collaborative encoding, preserving geometric stability in a shared embedding space; and (ii) LaSR (Language-coupled Semantic Reasoner), which adds a language-aware lightweight reasoning head with consistency regularization on top of Arca's multilingual representations, unifying the training objective to enhance cross-lingual understanding, retrieval, and reasoning robustness. We further construct and release a multilingual product retrieval dataset covering five Southeast Asian and two South Asian languages. Experiments across low-resource benchmarks (cross-lingual retrieval, semantic similarity, and reasoning) show consistent gains and robustness under few-shot and noise-amplified settings; ablations validate the contribution of both Arca and LaSR. Code will be released on GitHub and the dataset on Hugging Face.

[98] Efficient Seq2seq Coreference Resolution Using Entity Representations

Matt Grenander,Shay B. Cohen,Mark Steedman

Main category: cs.CL

TL;DR: 提出了一种压缩表示方法，以提高seq2seq核心ference模型在增量设置中的效率，通过提取和重组实体级标记并丢弃大部分其他输入标记，在保持高性能的同时实现显著压缩。

Details

Motivation: 现有的seq2seq核心ference模型在处理增量场景（如对话）时效率低下，缺乏灵活性和效率。 Method: 通过提取和重新组织实体级标记，并丢弃大部分非实体输入标记，采用压缩表示来提升模型效率。 Result: 在OntoNotes上，模型仅比全前缀增量基线低0.6 CoNLL F1分，压缩比达1.8；在LitBank上超过现有最佳性能。 Conclusion: 丢弃seq2seq核心ference解析器中的大量标记是实现高效增量核心ference解析的可行策略。 Abstract: Seq2seq coreference models have introduced a new paradigm for coreference resolution by learning to generate text corresponding to coreference labels, without requiring task-specific parameters. While these models achieve new state-of-the-art performance, they do so at the cost of flexibility and efficiency. In particular, they do not efficiently handle incremental settings such as dialogue, where text must processed sequentially. We propose a compressed representation in order to improve the efficiency of these methods in incremental settings. Our method works by extracting and re-organizing entity-level tokens, and discarding the majority of other input tokens. On OntoNotes, our best model achieves just 0.6 CoNLL F1 points below a full-prefix, incremental baseline while achieving a compression ratio of 1.8. On LitBank, where singleton mentions are annotated, it passes state-of-the-art performance. Our results indicate that discarding a wide portion of tokens in seq2seq resolvers is a feasible strategy for incremental coreference resolution.

[99] Supervised Fine-Tuning or Contrastive Learning? Towards Better Multimodal LLM Reranking

Ziqi Dai,Xin Zhang,Mingxin Li,Yanzhao Zhang,Dingkun Long,Pengjun Xie,Meishan Zhang,Wenjie Li,Min Zhang

Main category: cs.CL

TL;DR: 本文比较了对比学习（CL）和监督微调（SFT）在基于大语言模型（LLM）的重排序任务中的表现，发现SFT由于更强的权重更新机制，在通用多模态检索（UMR）任务中优于CL，并在MRB基准上取得了新的SOTA结果。

Details

Motivation: 探讨哪种训练目标（CL或SFT）更适用于基于大语言模型的重排序任务，并分析其背后机制。 Method: 将训练目标分解为权重和方向两个组件，提出统一框架进行分析，并通过探针实验比较CL与SFT在UMR任务中的表现。 Result: SFT在权重更新上显著强于CL，而在更新方向上两者无明显优劣；整体上SFT在LLM重排序中具有一致优势，并在MRB基准上达到新SOTA。 Conclusion: SFT比CL更适合作为LLM-based reranking的训练目标，主要归因于其更有效的权重分配机制。 Abstract: In information retrieval, training reranking models mainly focuses on two types of objectives: metric learning (e.g. contrastive loss to increase the predicted scores on relevant query-document pairs) and classification (binary label prediction of relevance vs. irrelevance). For BERT-style encoders, various studies have shown that contrastive learning (CL) can be more effective than discriminative (classification) learning. However, for large language models (LLMs), classification via supervised fine-tuning (SFT), which predicts ''yes'' (resp. ''no'') token for relevant (resp. irrelevant) pairs, appears more promising as it aligns well with the generative nature of LLMs. This divergence raises a central question: which objective is intrinsically better suited to LLM-based reranking, and what mechanism underlies the difference? In this work, we conduct a comprehensive comparison and analysis between CL and SFT for reranking, taking the universal multimodal retrieval (UMR) as the experimental playground. We first decompose the objectives into two components: weight, which controls the magnitude of those updates, and direction, which guides the model updates, then present a unified framework for understanding their interactions. Through probing experiments, we find that SFT provides a substantially stronger weighting scheme than CL, whereas the preferred scoring direction shows no clear winner. Taken together, these results point to a consistent advantage of SFT over CL for LLM reranking. To further validate our findings, we conduct large-scale training with SFT and present new state-of-the-art rerankers on the MRB benchmark. We also provide ablations on SFT settings and expect our findings to benefit future research and applications in this area.

[100] Assessing Socio-Cultural Alignment and Technical Safety of Sovereign LLMs

Kyubyung Chae,Gihoon Kim,Gyuseong Lee,Taesup Kim,Jaejin Lee,Heejin Kim

Main category: cs.CL

TL;DR: 本文提出了一种新的数据集和分析框架，用于评估主权大语言模型（LLMs）的社会文化适应性及其技术鲁棒性和安全性。实验结果表明，尽管主权LLMs在支持低资源语言方面具有作用，但它们并不总是如声称的那样有效服务目标用户，且可能忽视安全等关键质量属性。

Details

Motivation: 随着主权大语言模型的发展，亟需评估其是否真正符合特定社会文化背景，并确保其在安全和技术上的稳健性，但目前缺乏相应的评估框架和数据集。 Method: 构建了一个新数据集，并引入了一个分析框架，用以提取和评估主权LLMs中的社会文化元素，同时评估其技术鲁棒性和安全性。 Result: 实验发现主权LLMs虽有助于低资源语言，但在社会文化适配性和安全性方面表现不一，部分模型未能充分服务目标用户，且存在低估安全风险的问题。 Conclusion: 推动主权LLMs发展需要更全面、基于实践的评估标准，涵盖社会文化契合度、安全性和技术稳健性等多个维度。 Abstract: Recent trends in LLMs development clearly show growing interest in the use and application of sovereign LLMs. The global debate over sovereign LLMs highlights the need for governments to develop their LLMs, tailored to their unique socio-cultural and historical contexts. However, there remains a shortage of frameworks and datasets to verify two critical questions: (1) how well these models align with users' socio-cultural backgrounds, and (2) whether they maintain safety and technical robustness without exposing users to potential harms and risks. To address this gap, we construct a new dataset and introduce an analytic framework for extracting and evaluating the socio-cultural elements of sovereign LLMs, alongside assessments of their technical robustness. Our experimental results demonstrate that while sovereign LLMs play a meaningful role in supporting low-resource languages, they do not always meet the popular claim that these models serve their target users well. We also show that pursuing this untested claim may lead to underestimating critical quality attributes such as safety. Our study suggests that advancing sovereign LLMs requires a more extensive evaluation that incorporates a broader range of well-grounded and practical criteria.

[101] DialectGen: Benchmarking and Improving Dialect Robustness in Multimodal Generation

Yu Zhou,Sohyun An,Haikang Deng,Da Yin,Clark Peng,Cho-Jui Hsieh,Kai-Wei Chang,Nanyun Peng

Main category: cs.CL

TL;DR: 本文研究了多模态生成模型在处理英语方言文本输入时的表现，发现现有模型在使用方言词时性能显著下降。作者构建了一个包含六种常见英语方言的大规模基准，并提出一种基于编码器的缓解策略，在提升方言生成效果的同时保持标准美式英语的性能。

Details

Motivation: 多模态生成模型在实际应用中常遇到方言输入，但其对不同英语方言的理解和生成能力尚不明确。为探究这一问题并提升模型的包容性与实用性，需要系统评估模型在方言上的表现并开发有效缓解方法。 Method: 构建涵盖六种英语方言的大规模基准数据集，收集并验证超过4200个方言提示；评估17种图像和视频生成模型的性能；提出一种基于编码器的缓解策略，通过引入方言特征学习机制，在不损害标准美式英语性能的前提下提升对方言的支持能力。 Result: 实验表明，当前最先进的多模态生成模型在仅使用一个方言词时性能下降32.26%至48.17%；传统的微调和提示重写方法改善有限（<7%），且可能损害标准英语性能；所提方法使Stable Diffusion 1.5等模型在五个方言上的表现提升至接近标准英语水平（+34.4%），同时几乎不影响标准英语性能。 Conclusion: 现有生成模型在处理方言输入时存在明显性能下降问题，而本文提出的编码器级缓解策略能有效提升对方言的支持能力，同时保持对标准英语的良好性能，有助于推动更公平、包容的多模态生成模型发展。 Abstract: Contact languages like English exhibit rich regional variations in the form of dialects, which are often used by dialect speakers interacting with generative models. However, can multimodal generative models effectively produce content given dialectal textual input? In this work, we study this question by constructing a new large-scale benchmark spanning six common English dialects. We work with dialect speakers to collect and verify over 4200 unique prompts and evaluate on 17 image and video generative models. Our automatic and human evaluation results show that current state-of-the-art multimodal generative models exhibit 32.26% to 48.17% performance degradation when a single dialect word is used in the prompt. Common mitigation methods such as fine-tuning and prompt rewriting can only improve dialect performance by small margins (< 7%), while potentially incurring significant performance degradation in Standard American English (SAE). To this end, we design a general encoder-based mitigation strategy for multimodal generative models. Our method teaches the model to recognize new dialect features while preserving SAE performance. Experiments on models such as Stable Diffusion 1.5 show that our method is able to simultaneously raise performance on five dialects to be on par with SAE (+34.4%), while incurring near zero cost to SAE performance.

[102] Beyond Correctness: Evaluating Subjective Writing Preferences Across Cultures

Shuangshuang Ying,Yunwen Li,Xingwei Qu,Xin Li,Sheng Jin,Minghao Liu,Zhoufutu Wen,Xeron Du,Tianyu Zheng,Yichi Zhang,Letian Ni,Yuyang Cheng,Qiguang Chen,Jingzhe Ding,Shengda Long,Wangchunshu Zhou,Jiazhan Feng,Wanjun Zhong,Libo Qin,Ge Zhang,Wenhao Huang,Wanxiang Che,Chenghua Lin

Main category: cs.CL

TL;DR: 本文提出WritingPreferenceBench数据集，揭示当前偏好学习方法在缺乏客观质量信号时性能下降，表明现有RLHF方法主要学习检测客观错误而非主观质量偏好。

Details

Motivation: 现有偏好学习方法在标准基准上表现良好，但在去除客观质量信号后性能显著下降，难以捕捉主观写作质量（如创意、风格、情感共鸣）。 Method: 构建包含1800对人工标注偏好数据的WritingPreferenceBench数据集（涵盖8种创意写作体裁），比较序列式奖励模型、零样本语言模型裁判与生成式奖励模型在主观偏好判断上的准确性。 Result: 标准奖励模型准确率仅为52.7%，零样本语言模型为53.9%，而生成式奖励模型达到81.8%；且模型在不同体裁中表现差异大，模型规模不影响整体趋势。 Conclusion: 当前RLHF方法主要依赖客观错误识别，未能有效建模主观偏好；生成式奖励模型通过显式推理链更优，未来偏好建模需引入中间推理表示而非直接分类。 Abstract: Current preference learning methods achieve high accuracy on standard benchmarks but exhibit significant performance degradation when objective quality signals are removed. We introduce WritingPreferenceBench, a dataset of 1,800 human-annotated preference pairs (1,200 English, 600 Chinese) across 8 creative writing genres, where responses are matched for objective correctness, factual accuracy, and length. On this benchmark, sequence-based reward models--the standard architecture for RLHF--achieve only 52.7% mean accuracy, while zero-shot language model judges perform at 53.9%. In contrast, generative reward models that produce explicit reasoning chains achieve 81.8% accuracy. We observe high within-model variance across genres: individual models range from 18.2% to 81.8% accuracy across different writing categories, with standard deviations averaging 10.1%. This variance persists regardless of model scale, with 27B parameter models showing no consistent improvement over 8B variants. Our results suggest that current RLHF methods primarily learn to detect objective errors rather than capture subjective quality preferences (e.g., creativity, stylistic flair, and emotional resonance), and that successful preference modeling may require intermediate reasoning representations rather than direct classification.

[103] Code-driven Number Sequence Calculation: Enhancing the inductive Reasoning Abilities of Large Language Models

Kedi Chen,Zhikai Lei,Xu Guo,Xuecheng Wu,Siyuan Zeng,Jianghao Yin,Yinqi Zhang,Qin Chen,Jie Zhou,Liang He,Qipeng Guo,Kai Chen,Wei Zhang

Main category: cs.CL

TL;DR: 本文提出了一种名为CodeSeq的合成后训练数据集，通过将数字序列转化为算法问题来促进大语言模型的归纳推理能力，引入了新的奖励机制以增强学习效果。

Details

Motivation: 现有的归纳推理研究主要集中在表面规律上，缺乏对复杂内部模式的关注，并且没有提供精确的思考过程或难度控制。 Method: 构建了一个名为CodeSeq的数据集，该数据集将数字序列包装成算法问题，定义了通用项生成任务（GTG），并通过反映失败测试案例并结合迭代修正生成监督微调数据，同时利用基于可解性和自指导案例生成成功率的新颖案例协同可解性扩展奖励进行强化学习。 Result: 实验结果表明，使用CodeSeq训练的模型在各种推理任务上的表现有所提升，并能保持模型的OOD性能。 Conclusion: CodeSeq能够有效提高大语言模型在归纳推理任务中的表现，同时保持其对外部数据的泛化能力。 Abstract: Large language models (LLMs) make remarkable progress in reasoning tasks. Among different reasoning modes, inductive reasoning, due to its better alignment with human learning, attracts increasing interest. However, research on inductive reasoning faces certain challenges. First, existing inductive data mostly focuses on superficial regularities while lacking more complex internal patterns. Second, current works merely prompt LLMs or finetune on simple prompt-response pairs, but do not provide precise thinking processes nor implement difficulty control. Unlike previous work, we address these challenges by introducing \textit{CodeSeq}, a synthetic post-training dataset built from number sequences. We package number sequences into algorithmic problems to discover their general terms, defining a general term generation (GTG) task correspondingly. Our pipeline generates supervised finetuning data by reflecting on failed test cases and incorporating iterative corrections, thereby teaching LLMs to learn autonomous case generation and self-checking. Additionally, it leverages reinforcement learning with a novel Case-Synergy Solvability Scaling Reward based on both solvability, estimated from the problem pass rate, and the success rate of self-directed case generation, enabling models to learn more effectively from both successes and failures. Experimental results show that the models trained with \textit{CodeSeq} improve on various reasoning tasks and can preserve the models' OOD performance.

[104] RLAIF-SPA: Optimizing LLM-based Emotional Speech Synthesis via RLAIF

Qing Yang,Zhenghao Liu,Junxin Wang,Yangfan Du,Pengcheng Huang,Tong Xiao

Main category: cs.CL

TL;DR: 提出了一种基于AI反馈的强化学习框架RLAIF-SPA，用于提升文本到语音合成中的情感表现力和自然度，通过语义准确性和韵律-情感对齐优化语音生成。

Details

Motivation: 现有情感语音合成方法依赖昂贵的情感标注或间接优化目标，难以同时保证语义准确性和情感表现力，导致生成语音情感平淡。 Method: 提出RLAIF-SPA框架，结合自动语音识别（ASR）和大语言模型（LLM）作为AI反馈，分别评估语义准确性和韵律-情感对齐；引入四维细粒度的结构、情感、速度和音调进行韵律标签对齐，并通过强化学习直接优化情感表现力和可懂度。 Result: 在Libri Speech数据集上的实验表明，相比Chat-TTS，RLAIF-SPA将词错误率（WER）降低了26.1%，主观相似度（SIM-O）提高了9.1%，并在人工评测中提升超过10%。 Conclusion: RLAIF-SPA通过AI反馈机制有效提升了情感语音合成的表现力与自然度，无需依赖人工情感标注，具有较高的应用潜力。 Abstract: Text-To-Speech synthesis has achieved near-human quality in neutral speech, but emotional expressiveness remains a challenge. Existing methods often rely on costly emotion annotations or optimize indirect objectives that fail to capture the emotional expressiveness and perceptual naturalness of speech, leading to generated speech that is accurate but emotionally flat. To address these challenges, we propose the RLAIF-SPA framework, incorporating a Reinforcement Learning from AI Feedback (RLAIF) mechanism to employ Automatic Speech Recognition (ASR) and Large Language Model (LLM) techniques to respectively judge semantic accuracy and prosodic-emotional label alignment as a direct reward for emotional expressiveness and intelligibility optimization. Specifically, it leverages Prosodic Label Alignment to enhance expressive quality by jointly considering semantic accuracy and prosodic-emotional alignment along four fine-grained dimensions: Structure, Emotion, Speed, and Tone. In addition, it incorporates Semantic Accuracy Feedback to ensure the generation of clear and accurate speech. Experiments on the Libri Speech dataset show that RLAIF-SPA outperforms Chat-TTS, with a 26.1% reduction in WER, a 9.1% increase in SIM-O, and over 10% improvement in human evaluation.

[105] Intent Clustering with Shared Pseudo-Labels

I-Fan Lin,Faegheh Hasibi,Suzan Verberne

Main category: cs.CL

TL;DR: 提出了一种无需训练、无需标签的意图聚类方法，利用轻量级开源大模型生成伪标签，并通过多标签分类实现聚类，具有低成本、高透明性和良好的实际适用性。

Details

Motivation: 现有方法依赖昂贵的商业大模型且需预先知道聚类数量，缺乏透明性和现实适用性。 Method: 使用轻量级开源大模型为文本生成伪标签，然后在伪标签集上进行多标签分类，并基于标签重叠程度计算文本相似性以实现聚类。 Result: 在四个基准数据集上的实验表明，该方法性能与最新基线相当甚至更优，且计算高效、跨模型和数据集表现稳定。 Conclusion: 该方法简单、高效、可解释性强，适用于低资源场景，为意图聚类提供了一种实用且开放的解决方案。 Abstract: In this paper, we propose an intuitive, training-free and label-free method for intent clustering that makes minimal assumptions using lightweight and open-source LLMs. Many current approaches rely on commercial LLMs, which are costly, and offer limited transparency. Additionally, their methods often explicitly depend on knowing the number of clusters in advance, which is often not the case in realistic settings. To address these challenges, instead of asking the LLM to match similar text directly, we first ask it to generate pseudo-labels for each text, and then perform multi-label classification in this pseudo-label set for each text. This approach is based on the hypothesis that texts belonging to the same cluster will share more labels, and will therefore be closer when encoded into embeddings. These pseudo-labels are more human-readable than direct similarity matches. Our evaluation on four benchmark sets shows that our approach achieves results comparable to and better than recent baselines, while remaining simple and computationally efficient. Our findings indicate that our method can be applied in low-resource scenarios and is stable across multiple models and datasets.

[106] An Efficient Rubric-based Generative Verifier for Search-Augmented LLMs

Linyue Ma,Yilong Xu,Xiang Long,Zhi Zheng

Main category: cs.CL

TL;DR: 本文提出了一种统一且可验证的“nugget-as-rubric”范式，用于改进搜索增强型大语言模型的奖励建模，通过将原子信息点作为结构化评估标准，并设计了一个自动构建评分标准的流程及一个高效的生成式验证器Search-Gen-V，在多种任务上实现了高准确性、可扩展性和鲁棒性。

Details

Motivation: 现有搜索增强型大语言模型的奖励建模存在局限：基于规则的奖励脆弱于表达变化且难以应用于长文本任务，而生成式奖励虽更鲁棒，但在动态语料库中的长文本任务中难以设计出可验证、稳定且计算成本高的奖励机制。 Method: 提出“nugget-as-rubric”范式，将原子信息点作为评估标准；针对长文本任务设计基于查询重写的自动评分标准构建流程，从静态和动态网页内容中提取评分标准；并提出Search-Gen-V——一个4B参数的高效生成式验证器，采用蒸馏思想和两阶段训练策略进行训练。 Result: 实验结果表明，Search-Gen-V在不同任务负载下均表现出强大的验证准确性，具备良好的可扩展性、鲁棒性和效率，能有效支持搜索增强型LLMs的可验证奖励构建。 Conclusion: “nugget-as-rubric”范式为搜索增强型大语言模型提供了一种统一、可验证且高效的奖励建模范式，特别适用于长短文本混合的复杂任务场景，Search-Gen-V作为其实现，展现了在实际应用中的优越性能。 Abstract: Search augmentation empowers Large Language Models with retrieval capabilities to overcome the limitations imposed by static parameters. Recently, Reinforcement Learning leverages tailored reward signals as a viable technique to enhance LLMs performing tasks involving search. However, existing reward modeling for search-augmented LLMs faces several limitations. Rule-based rewards, such as Exact Match, are verifiable but fragile to variations in expression and cannot be applied to long-form workloads. In contrast, generative rewards improve robustness, but designing verifiable and stable rewards for long-form workloads in dynamic corpora remains challenging and also incurs high computational costs. In this paper, we propose a unified and verifiable paradigm, "nugget-as-rubric", which treats atomic information points as structured evaluation criteria for different search-augmentation workloads. Short-form tasks correspond to a single rubric, whereas long-form tasks expand to multiple rubrics aligned with the question's information needs. To support long-form settings, we design an automatic rubric construction pipeline based on query rewriting, which can automatically retrieve passages relevant to each question and extract rubrics from them, both from static corpora and from dynamic online web content. Furthermore, we introduce \textbf{Search-Gen-V}, a 4B-parameter efficient generative verifier under our proposed verifiable paradigm, which is trained via the idea of distillation and a two-stage strategy. Experimental results show that Search-Gen-V achieves strong verification accuracy across different workloads, making it a scalable, robust, and efficient verifiable reward constructor for search-augmented LLMs.

[107] Semantic Prosody in Machine Translation: the English-Chinese Case of Passive Structures

Xinyue Ma,Pol Pastells,Mireia Farrús,Mariona Taulé

Main category: cs.CL

TL;DR: 本文提出了一种通过微调机器翻译模型来学习汉语“被”字被动句的负面语义韵的方法，并构建了英汉双语数据集进行验证，结果表明模型能更准确地在不利语境中使用“被”字句，且该知识可在多语言模型中跨语言迁移。

Details

Motivation: 由于当前机器翻译模型无法处理词语间因搭配而产生的语义韵差异，尤其是像汉语“被”字句具有固定负面语义韵的现象，导致翻译不准确，因此需要让模型学习这一语言特性。 Method: 聚焦于汉语“被”字被动句，构建带有负面语义韵标注的英汉双语句子对数据集，并用该数据集对OPUS-MT、NLLB-600M和mBART50模型进行微调，以提升其在翻译中正确使用“被”字句的能力。 Result: 微调后的模型在翻译不利内容时更倾向于使用“被”字句，而在中性或正面内容中避免使用；其中NLLB-600M模型还能将这种语义韵知识迁移到西班牙语到中文的翻译中。 Conclusion: 通过特定结构的数据集微调，机器翻译模型可以有效学习语义韵特征，并在目标语言中生成更符合语用习惯的翻译，且该能力在多语言模型中具备跨语言迁移潜力。 Abstract: Semantic prosody is a collocational meaning formed through the co-occurrence of a linguistic unit and a consistent series of collocates, which should be treated separately from semantic meaning. Since words that are literal translations of each other may have different semantic prosody, more attention should be paid to this linguistic property to generate accurate translations. However, current machine translation models cannot handle this problem. To bridge the gap, we propose an approach to teach machine translation models about semantic prosody of a specific structure. We focus on Chinese BEI passives and create a dataset of English-Chinese sentence pairs with the purpose of demonstrating the negative semantic prosody of BEI passives. Then we fine-tune OPUS-MT, NLLB-600M and mBART50 models with our dataset for the English-Chinese translation task. Our results show that fine-tuned MT models perform better on using BEI passives for translating unfavourable content and avoid using it for neutral and favourable content. Also, in NLLB-600M, which is a multilingual model, this knowledge of semantic prosody can be transferred from English-Chinese translation to other language pairs, such as Spanish-Chinese.

[108] Speculative Model Risk in Healthcare AI: Using Storytelling to Surface Unintended Harms

Xingmeng Zhao,Dan Schumacher,Veronica Rammouz,Anthony Rios

Main category: cs.CL

TL;DR: 提出一种以用户为中心的框架，通过生成用户故事和多智能体讨论，帮助在AI部署前更全面地思考潜在风险与收益。

Details

Motivation: 快速低门槛的AI开发可能带来偏见、隐私泄露和不平等访问等风险，且现有方法过度依赖自动检测，削弱了对伤害成因及受影响人群的深入理解。 Method: 设计一个基于用户故事生成和多智能体对话的人本框架，并通过用户研究评估其在拓展风险认知方面的效果。 Result: 阅读故事的参与者识别出更广泛的危害类型，反应分布更均匀；而未阅读者主要集中在隐私和福祉问题（58.3%）。 Conclusion: 叙事方法有助于参与者更全面、创造性地思考AI对用户的影响，提升对潜在危害和益处的认知广度。 Abstract: Artificial intelligence (AI) is rapidly transforming healthcare, enabling fast development of tools like stress monitors, wellness trackers, and mental health chatbots. However, rapid and low-barrier development can introduce risks of bias, privacy violations, and unequal access, especially when systems ignore real-world contexts and diverse user needs. Many recent methods use AI to detect risks automatically, but this can reduce human engagement in understanding how harms arise and who they affect. We present a human-centered framework that generates user stories and supports multi-agent discussions to help people think creatively about potential benefits and harms before deployment. In a user study, participants who read stories recognized a broader range of harms, distributing their responses more evenly across all 13 harm types. In contrast, those who did not read stories focused primarily on privacy and well-being (58.3%). Our findings show that storytelling helped participants speculate about a broader range of harms and benefits and think more creatively about AI's impact on users.

[109] AutoRubric-R1V: Rubric-Based Generative Rewards for Faithful Multimodal Reasoning

Mengzhao Jia,Zhihan Zhang,Ignacio Cases,Zheyuan Liu,Meng Jiang,Peng Qi

Main category: cs.CL

TL;DR: 提出AutoRubric-R1V框架，结合强化学习与自动生成的评分标准，在多模态大模型推理中实现过程监督，提升推理忠实性与性能。

Details

Motivation: 现有强化学习方法仅奖励最终答案正确性，导致虚假推理问题，缺乏对推理过程的有效监督。 Method: 通过自聚合方法从成功轨迹中提取一致的推理检查点，自动构建基于评分标准的生成奖励，并与结果奖励联合优化。 Result: 在六个多模态推理基准上达到SOTA性能，并显著提升推理忠实性。 Conclusion: AutoRubric-R1V通过无需人工标注的过程级监督，有效提升了多模态大模型的推理质量与可信度。 Abstract: Multimodal large language models (MLLMs) have rapidly advanced from perception tasks to complex multi-step reasoning, yet reinforcement learning with verifiable rewards (RLVR) often leads to spurious reasoning since only the final-answer correctness is rewarded. To address this limitation, we propose AutoRubric-R1V, a framework that integrates RLVR with process-level supervision through automatically collected rubric-based generative rewards. Our key innovation lies in a scalable self-aggregation method that distills consistent reasoning checkpoints from successful trajectories, enabling problem-specific rubric construction without human annotation or stronger teacher models. By jointly leveraging rubric-based and outcome rewards, AutoRubric-R1V achieves state-of-the-art performance on six multimodal reasoning benchmarks and substantially improves reasoning faithfulness in dedicated evaluations.

[110] Pluto: A Benchmark for Evaluating Efficiency of LLM-generated Hardware Code

Manar Abdelatty,Maryam Nouh,Jacob K. Rosenstein,Sherief Reda

Main category: cs.CL

TL;DR: Pluto是一个用于评估大语言模型生成Verilog代码效率的基准框架，包含114个问题、自检测试平台和多种帕累托最优参考实现。实验表明，尽管当前最先进的大语言模型在功能正确性上表现良好（pass@1达78.3%），但在面积、延迟和功耗效率方面仍显著落后于专家设计（eff@1分别为63.8%、65.9%、64.0%），凸显了构建面向硬件效率评估框架的重要性。

Details

Motivation: 现有基准主要关注功能正确性，缺乏对综合效率（如面积、延迟、功耗）的全面评估，且缺少优化的基线方案和验证测试平台，难以有效衡量LLM在硬件设计任务中的实际性能。 Method: 提出Pluto基准框架，包含114个带自检测试平台的设计问题和多个帕累托最优参考实现，从功能正确性（pass@k）和综合效率（eff@k）两个维度系统评估LLM生成的Verilog代码质量。 Result: 当前最先进的LLM在功能正确性上达到78.3%（pass@1），但在综合效率方面表现较差：面积效率63.8%、延迟效率65.9%、功耗效率64.0%（eff@1）。 Conclusion: 仅关注功能正确性不足以推动高效硬件设计，需引入像Pluto这样兼顾功能与效率的评估框架，以促进面向硬件设计的LLM研究发展。 Abstract: Large Language Models (LLMs) are increasingly used to automate hardware design tasks, including the generation of Verilog code. While early benchmarks focus primarily on functional correctness, efficient hardware design demands additional optimization for synthesis metrics such as area, delay, and power. Existing benchmarks fall short in evaluating these aspects comprehensively: they often lack optimized baselines or testbenches for verification. To address these gaps, we present Pluto, a benchmark and evaluation framework designed to assess the efficiency of LLM-generated Verilog designs. Pluto presents a comprehensive evaluation set of 114 problems with self-checking testbenches and multiple Pareto-optimal reference implementations. Experimental results show that state-of-the-art LLMs can achieve high functional correctness, reaching 78.3\% at pass@1, but their synthesis efficiency still lags behind expert-crafted implementations, with area efficiency of 63.8\%, delay efficiency of 65.9\%, and power efficiency of 64.0\% at eff@1. This highlights the need for efficiency-aware evaluation frameworks such as Pluto to drive progress in hardware-focused LLM research.

[111] COIG-Writer: A High-Quality Dataset for Chinese Creative Writing with Thought Processes

Yunwen Li,Shuangshuang Ying,Xingwei Qu,Xin Li,Sheng Jin,Minghao Liu,Zhoufutu Wen,Tianyu Zheng,Xeron Du,Qiguang Chen,Jiajun Shi,Wangchunshu Zhou,Jiazhan Feng,Wanjun Zhong,Libo Qin,Stephen Huang,Wanxiang Che,Chenghua Lin,Eli Zhang

Main category: cs.CL

TL;DR: 本文提出了COIG-Writer，一个包含思维过程的中文创意写作数据集，揭示了创意写作能力依赖于逻辑结构与语言基础的交互，并指出该能力具有文化绑定性且无跨语言迁移效果。

Details

Motivation: 针对大语言模型在非英语语境下创意写作能力不足的问题，尤其是缺乏高质量训练数据和过程监督，作者希望构建一个能捕捉创作思维过程的数据集以提升模型表现。 Method: 通过系统性逆向工程高质量文本，构建包含1,665个三元组的中文创意写作数据集COIG-Writer，每个样本包括逆向生成的提示、详细创作推理过程和最终文本；并在实验中分析过程监督、数据比例、跨语言迁移和词汇多样性对创意写作的影响。 Result: 发现过程监督需与通用数据结合（至少1:12比例）才能稳定提升性能；创意能力具有强烈文化绑定性，中英文间存在89.26个百分点的性能差距；词汇多样性与创意质量呈负相关（TTR悖论）。 Conclusion: 创意写作的卓越表现源于逻辑架构与语言基础的协同作用，单纯增加语言多样性或依赖跨语言迁移无法弥补逻辑推理的缺失，过程监督是提升非英语创意写作的关键。 Abstract: Large language models exhibit systematic deficiencies in creative writing, particularly in non-English contexts where training data is scarce and lacks process-level supervision. We present COIG-Writer, a novel Chinese creative writing dataset that captures both diverse outputs and their underlying thought processes through systematic reverse-engineering of high-quality texts. Unlike existing datasets that provide only input-output pairs, COIG-Writer comprises 1,665 meticulously curated triplets spanning 51 genres, each containing: (1) a reverse-engineered prompt, (2) detailed creative reasoning documenting decision-making processes, and (3) the final text. Through comprehensive experiments, we identify a two-component model of creative writing: narrative logic (provided by process supervision) and linguistic expression (maintained by general-purpose data). Our findings reveal three critical insights: (1) Process supervision is highly effective but requires stabilization with general data. A ratio of at least one creative sample to twelve general samples is needed to achieve optimal performance; below this threshold, the win rate progressively degrades (from 62.75% down to 35.78%)., (2) creative capabilities are culturally-bound with no cross-lingual transfer (89.26pp gap between Chinese and English performance), and (3) lexical diversity inversely correlates with creative quality (TTR paradox), suggesting high diversity signals compensatory behavior for logical deficiencies. These findings establish that creative excellence emerges from the interaction between logical scaffolding and linguistic grounding, analogous to how mathematical reasoning enhances but cannot replace linguistic competence in foundation models.

[112] Finding Answers in Thought Matters: Revisiting Evaluation on Large Language Models with Reasoning

Hwiyeol Jo,Joosung Lee,Jaehone Lee,Sang-Woo Lee,Joonsuk Park,Kang Min Yoo

Main category: cs.CL

TL;DR: 提出了一种名为“Answer Regeneration”的框架，通过额外的模型推理提升推理模型的答案提取鲁棒性和性能。

Details

Motivation: 现有推理模型的性能和答案分布对答案提取算法高度敏感，影响评估可靠性。 Method: 引入Answer Regeneration框架，在生成最终答案前使用额外推理步骤，将先前输入输出以'Answer:'提示重生成答案。 Result: 该方法在数学问题和开放性问答任务中表现出更强的鲁棒性和性能提升，且不依赖特定提取规则。 Conclusion: Answer Regeneration提供了一种更可靠、通用的推理模型评估方案。 Abstract: Evaluating generative models, such as large language models (LLMs), commonly involves question-answering tasks where the final answer is selected based on probability of answer choices. On the other hand, for models requiring reasoning, the method of answer extraction plays a critical role. Our research reveals that the performance of reasoning models and their final answer distributions are highly sensitive to the answer extraction algorithm employed. In order to mitigate this, we propose a basic framework: Answer Regeneration. The method uses an additional model inference, providing the prior input and output prefaced by the prompt "Answer:". The final answer is then selected or extracted from the regenerated output. We show that this extraction-rule-agnostic approach exhibits improved performance and enhanced robustness. Furthermore, we have applied this framework to general math problems and open-ended question answering tasks. Our analysis and this framework could offer a more reliable results for model evaluation.

[113] Rewiring Experts on the Fly:Continuous Rerouting for Better Online Adaptation in Mixture-of-Expert models

Guinan Su,Yanwu Yang,Li Shen,Lu Yin,Shiwei Liu,Jonas Geiping

Main category: cs.CL

TL;DR: 提出一种无需外部数据、在线的MoE模型测试时自适应框架，通过输入上下文进行自监督优化路由决策，在保持计算效率的同时提升推理任务性能。

Details

Motivation: 现有测试时适应方法主要针对密集模型且依赖外部数据，难以应用于MoE架构；而MoE模型在部署中常因分布偏移导致路由不佳。 Method: 设计一种数据免费、在线的两阶段框架：在预填充阶段及定期利用已生成序列进行自监督优化路由决策，并通过轻量级可加向量仅更新选定层的路由器logits，随后正常生成文本并维持修改后的路由器直至下一次调整。 Result: 在HumanEval上使用OLMoE实现5.5%的提升，在DeepSeek-V2-Lite结合自一致性方法平均提升6%，且对上下文变化保持鲁棒性。 Conclusion: 该方法有效提升了MoE模型在复杂推理任务中的性能，具有即插即用特性，可与现有测试时扩展技术自然结合。 Abstract: Mixture-of-Experts (MoE) models achieve efficient scaling through sparse expert activation, but often suffer from suboptimal routing decisions due to distribution shifts in deployment. While existing test-time adaptation methods could potentially address these issues, they primarily focus on dense models and require access to external data, limiting their practical applicability to MoE architectures. However, we find that, instead of relying on reference data, we can optimize MoE expert selection on-the-fly based only on input context. As such, we propose \textit{a data-free, online test-time framework} that continuously adapts MoE routing decisions during text generation without external supervision or data. Our method cycles between two phases: During the prefill stage, and later in regular intervals, we optimize the routing decisions of the model using self-supervision based on the already generated sequence. Then, we generate text as normal, maintaining the modified router until the next adaption. We implement this through lightweight additive vectors that only update router logits in selected layers, maintaining computational efficiency while preventing over-adaptation. The experimental results show consistent performance gains on challenging reasoning tasks while maintaining robustness to context shifts. For example, our method achieves a 5.5\% improvement on HumanEval with OLMoE. Furthermore, owing to its plug-and-play property, our method naturally complements existing test-time scaling techniques, e.g., achieving 6\% average gains when incorporated with self-consistency on DeepSeek-V2-Lite.

[114] Midtraining Bridges Pretraining and Posttraining Distributions

Emmy Liu,Graham Neubig,Chenyan Xiong

Main category: cs.CL

TL;DR: 本研究首次系统地探讨了语言模型预训练中的“中期训练”阶段，发现其在数学和代码领域效果最为显著，能有效减少预训练与后续训练数据之间的句法差距，并优于持续预训练，减少知识遗忘。

Details

Motivation: 尽管中期训练已被广泛使用，但其作用机制缺乏科学理解，本文旨在通过控制实验探究其有效性及原因。 Method: 从零开始预训练语言模型，并在不同领域进行监督微调，通过控制变量实验分析中期训练的影响，包括启动时间与数据混合权重的消融研究。 Result: 中期训练在数学和代码领域表现最佳，能更有效地降低领域内验证损失并减少预训练知识的遗忘；启动时间比混合权重影响更大，越早引入专业数据效果越好。 Conclusion: 中期训练是一种有效的领域适应技术，相比持续预训练，能通过减少遗忘提升模型性能。 Abstract: Recently, many language models have been pretrained with a "midtraining" phase, in which higher quality, often instruction-formatted data, is mixed in at the end of pretraining. Despite the popularity of this practice, there is little scientific understanding of this phase of model training or why it is effective. In this work, we conduct the first systematic investigation of midtraining through controlled experiments with language models pretrained from scratch and fine-tuned on supervised finetuning datasets in different domains. We find that when compared after supervised fine-tuning, the effectiveness of midtraining is highest in the math and code domains, where midtraining can best reduce the syntactic gap between pretraining and posttraining data. In these cases, midtraining consistently outperforms continued pretraining in both in-domain validation loss as well as pretraining data forgetting after posttraining. We conduct ablations on the starting time of the midtraining phase and mixture weights of the midtraining data, using code midtraining as a case study, and find that timing has a greater impact than mixture weights, with earlier introduction of specialized data, yielding greater benefits in-domain as well as preserving general language modeling better. These findings establish midtraining as a domain adaptation technique that compared to continued pretraining yields better performance through reduced forgetting.

[115] From Loop Nests to Silicon: Mapping AI Workloads onto AMD NPUs with MLIR-AIR

Erwei Wang,Samuel Bayliss,Andra Bisca,Zachary Blair,Sangeeta Chowdhary,Kristof Denolf,Jeff Fifield,Brandon Freiberger,Erika Hunhoff,Phil James-Roxby,Jack Lo,Joseph Melber,Stephen Neuendorffer,Eddie Richter,Andre Rosti,Javier Setoain,Gagandeep Singh,Endri Taka,Pranathi Vasireddy,Zhewen Yu,Niansong Zhang,Jinming Zhuang

Main category: cs.CL

TL;DR: MLIR-AIR是一个基于MLIR的开源编译器栈，通过引入AIR方言，为现代空间架构（如AMD NPU）提供对计算与数据的细粒度控制，实现高效的并行、局部性和同步管理。

Details

Motivation: 通用编译器抽象了并行性、局部性和同步性，难以有效利用现代空间架构的性能潜力；需要一种能显式编排计算与数据流动的编译器基础设施。 Method: 构建MLIR-AIR编译器栈，定义AIR方言以支持异步、分层的操作表示，并利用AIR原语实现空间调度、计算分布和通信与计算的重叠。 Result: 在矩阵乘法中达到78.7%的计算效率，性能接近手工优化的低级实现；在LLaMA 2多头注意力机制中，仅用约150行代码实现融合操作，高效映射到空间硬件。 Conclusion: MLIR-AIR能够将高级控制流转化为高效利用NPU计算资源和内存层次的空间程序，通过编译器管理的调度实现高性能。 Abstract: General-purpose compilers abstract away parallelism, locality, and synchronization, limiting their effectiveness on modern spatial architectures. As modern computing architectures increasingly rely on fine-grained control over data movement, execution order, and compute placement for performance, compiler infrastructure must provide explicit mechanisms for orchestrating compute and data to fully exploit such architectures. We introduce MLIR-AIR, a novel, open-source compiler stack built on MLIR that bridges the semantic gap between high-level workloads and fine-grained spatial architectures such as AMD's NPUs. MLIR-AIR defines the AIR dialect, which provides structured representations for asynchronous and hierarchical operations across compute and memory resources. AIR primitives allow the compiler to orchestrate spatial scheduling, distribute computation across hardware regions, and overlap communication with computation without relying on ad hoc runtime coordination or manual scheduling. We demonstrate MLIR-AIR's capabilities through two case studies: matrix multiplication and the multi-head attention block from the LLaMA 2 model. For matrix multiplication, MLIR-AIR achieves up to 78.7% compute efficiency and generates implementations with performance almost identical to state-of-the-art, hand-optimized matrix multiplication written using the lower-level, close-to-metal MLIR-AIE framework. For multi-head attention, we demonstrate that the AIR interface supports fused implementations using approximately 150 lines of code, enabling tractable expression of complex workloads with efficient mapping to spatial hardware. MLIR-AIR transforms high-level structured control flow into spatial programs that efficiently utilize the compute fabric and memory hierarchy of an NPU, leveraging asynchronous execution, tiling, and communication overlap through compiler-managed scheduling.

[116] Harmonizing Diverse Models: A Layer-wise Merging Strategy for Consistent Generation

Xujun Peng,Anoop Kumar,Jingyu Wu,Parker Glenn,Daben Liu

Main category: cs.CL

TL;DR: 提出一种结合合成数据生成、三元组损失和层间模型融合的新方法，显著提升RAG系统中LLM输出的一致性。

Details

Motivation: 现有大语言模型在语义等价输入下生成不一致的输出，且缺乏针对一致性的训练数据和有效的微调技术。 Method: 通过系统性生成合成数据、使用三元组损失优化嵌入表示，并提出基于中间层激活的层间模型融合方法，结合一致性感知权重整合专用模型知识。 Result: 合并后的模型在响应相似性上比基线提升了约47.5%，显著增强了输出一致性。 Conclusion: 该方法为工业级RAG系统的可靠性提升提供了一个实用且有效的解决方案。 Abstract: Retrieval-Augmented Generation (RAG) systems leverage Large Language Models (LLMs) to generate accurate and reliable responses that are grounded in retrieved context. However, LLMs often generate inconsistent outputs for semantically equivalent inputs, a problem compounded by the scarcity of consistency-focused training data and the limitations of current fine-tuning techniques in enhancing output consistency. We propose a new approach combining systematic synthetic data generation, triplet loss for better embeddings, and a novel layer-wise model merging approach. Using consistency-aware weights derived from intermediate layer activations, our method effectively integrates knowledge from specialized models. Experimental results how that our merged model significantly enhances output consistency, achieving a ~47.5\% improvement in response similarity over the baseline, thus offering a practical solution for increasing the reliability of an industrial RAG system.

[117] Predicting Task Performance with Context-aware Scaling Laws

Kyle Montgomery,David Park,Jianhong Tu,Michael Bendersky,Beliz Gunel,Dawn Song,Chenguang Wang

Main category: cs.CL

TL;DR: 提出一个可解释的框架，联合建模训练计算量和上下文对下游任务性能的影响，并在多种任务上验证其有效性。

Details

Motivation: 现有缩放定律无法捕捉上下文对下游任务性能的影响，需要新框架来更好地理解训练计算与上下文利用之间的关系。 Method: 构建一个将下游性能作为训练计算量和上下文长度函数的可解释模型，并在扩展上下文的Llama-2模型变体上进行实证拟合。 Result: 该框架能准确建模分布内下游性能，在三个任务共65,500个实例上跨三个数量级的训练计算中表现良好，并能可靠外推上下文增长时的性能。 Conclusion: 训练计算与上下文利用共同影响下游性能，该框架为设计更高效的长上下文大模型提供了指导。 Abstract: Scaling laws have transformed our understanding of large language models by linking upstream metrics like cross-entropy loss to design factors such as model size, training data, and compute. However, these conventional laws fail to capture downstream task performance, where context plays a critical role. In this work, we propose a straightforward, interpretable framework that jointly models downstream performance as a function of the training compute and the provided context. We empirically validate our framework by fitting it on the observed downstream performance of extended-context variants of Llama-2-7B and Llama-2-13B across 65,500 unique instances spanning three tasks: arithmetic reasoning, common sense reasoning, and machine translation. Our results demonstrate that our framework accurately models in-distribution downstream performance, generalizes across three orders of magnitude in training compute, and reliably extrapolates performance as the amount of context increases. These findings offer valuable insights into the interplay between training compute and context utilization, providing guidance for designing more efficient long-context LLMs for diverse downstream tasks. Our code is available at https://github.com/wang-research-lab/context-scaling.

[118] AI-Powered Early Diagnosis of Mental Health Disorders from Real-World Clinical Conversations

Jianfeng Zhu,Julina Maharjan,Xinyu Li,Karin G. Coifman,Ruoming Jin

Main category: cs.CL

TL;DR: 该研究利用553个真实半结构化访谈数据，评估多种机器学习模型在抑郁症、焦虑症和PTSD筛查中的表现，结果显示基于大语言模型（如GPT-4.1 Mini、MetaLLaMA）和LoRA微调的RoBERTa模型准确率超过80%，尤其在PTSD检测中达到89%准确率和98%召回率，表明AI模型有望提升心理健康早期诊断的可及性与准确性。

Details

Motivation: 由于主观评估、医疗资源有限及社会污名化，抑郁症、焦虑症和PTSD常被漏诊或误诊，尤其是在初级医疗环境中误诊率超60%，因此亟需可扩展、易获取且情境感知的心理健康筛查工具。 Method: 研究采用553个带真实诊断标签的半结构化访谈数据，比较了零样本提示的GPT-4.1 Mini和MetaLLaMA大模型，以及基于LoRA低秩微调的RoBERTa模型；同时分析不同上下文长度对模型性能的影响。 Result: 所有模型在各类诊断中准确率均超过80%，PTSD识别表现最佳（最高89%准确率，98%召回率）；使用更短、聚焦的上下文片段可提升召回率；低秩LoRA配置（如rank 8和16）即可保持良好性能。 Conclusion: 基于大语言模型的筛查方法显著优于传统自评量表，具备高效、低门槛的优势，尤其适用于资源匮乏或污名化严重的地区，为将机器学习融入实际临床流程提供了可行路径。 Abstract: Mental health disorders remain among the leading cause of disability worldwide, yet conditions such as depression, anxiety, and Post-Traumatic Stress Disorder (PTSD) are frequently underdiagnosed or misdiagnosed due to subjective assessments, limited clinical resources, and stigma and low awareness. In primary care settings, studies show that providers misidentify depression or anxiety in over 60% of cases, highlighting the urgent need for scalable, accessible, and context-aware diagnostic tools that can support early detection and intervention. In this study, we evaluate the effectiveness of machine learning models for mental health screening using a unique dataset of 553 real-world, semistructured interviews, each paried with ground-truth diagnoses for major depressive episodes (MDE), anxiety disorders, and PTSD. We benchmark multiple model classes, including zero-shot prompting with GPT-4.1 Mini and MetaLLaMA, as well as fine-tuned RoBERTa models using LowRank Adaptation (LoRA). Our models achieve over 80% accuracy across diagnostic categories, with especially strongperformance on PTSD (up to 89% accuracy and 98% recall). We also find that using shorter context, focused context segments improves recall, suggesting that focused narrative cues enhance detection sensitivity. LoRA fine-tuning proves both efficient and effective, with lower-rank configurations (e.g., rank 8 and 16) maintaining competitive performance across evaluation metrics. Our results demonstrate that LLM-based models can offer substantial improvements over traditional self-report screening tools, providing a path toward low-barrier, AI-powerd early diagnosis. This work lays the groundwork for integrating machine learning into real-world clinical workflows, particularly in low-resource or high-stigma environments where access to timely mental health care is most limited.

[119] LaSeR: Reinforcement Learning with Last-Token Self-Rewarding

Wenkai Yang,Weijie Liu,Ruobing Xie,Yiju Guo,Lulu Wu,Saiyong Yang,Yankai Lin

Main category: cs.CL

TL;DR: 提出LaSeR算法，通过最后token的自奖励分数来统一优化大模型的推理与自验证能力，显著提升效率与性能。

Details

Motivation: 现有强化学习与自验证方法需使用不同模板生成解与验证，效率低下，且缺乏测试时的验证信号。 Method: 理论推导出自验证目标的闭式解，发现解的最终token自奖励分即为真实推理奖励，并在RLVR损失中加入MSE损失对齐该分数与验证器奖励，实现联合优化。 Result: 实验表明，LaSeR不仅提升了模型的推理性能，还赋予其强大的自奖励能力，仅需一次额外token推断，显著降低计算开销，并增强推理时的扩展性能。 Conclusion: LaSeR通过简洁有效的最后token自奖励机制，统一了推理与自验证过程，在保持低计算成本的同时显著提升了大模型的推理与泛化能力。 Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a core paradigm for enhancing the reasoning capabilities of Large Language Models (LLMs). To address the lack of verification signals at test time, prior studies incorporate the training of model's self-verification capability into the standard RLVR process, thereby unifying reasoning and verification capabilities within a single LLM. However, previous practice requires the LLM to sequentially generate solutions and self-verifications using two separate prompt templates, which significantly reduces efficiency. In this work, we theoretically reveal that the closed-form solution to the RL objective of self-verification can be reduced to a remarkably simple form: the true reasoning reward of a solution is equal to its last-token self-rewarding score, which is computed as the difference between the policy model's next-token log-probability assigned to any pre-specified token at the solution's last token and a pre-calculated constant, scaled by the KL coefficient. Based on this insight, we propose LaSeR (Reinforcement Learning with Last-Token Self-Rewarding), an algorithm that simply augments the original RLVR loss with a MSE loss that aligns the last-token self-rewarding scores with verifier-based reasoning rewards, jointly optimizing the reasoning and self-rewarding capabilities of LLMs. The optimized self-rewarding scores can be utilized in both training and testing to enhance model performance. Notably, our algorithm derives these scores from the predicted next-token probability distribution of the last token immediately after generation, incurring only the minimal extra cost of one additional token inference. Experiments show that our method not only improves the model's reasoning performance but also equips it with remarkable self-rewarding capability, thereby boosting its inference-time scaling performance.

[120] MetaBench: A Multi-task Benchmark for Assessing LLMs in Metabolomics

Yuxing Lu,Xukai Zhao,J. Ben Tamo,Micky C. Nnamdi,Rui Peng,Shuang Zeng,Xingyu Hu,Jinzhuo Wang,May D. Wang

Main category: cs.CL

TL;DR: 本文提出了MetaBench，首个用于评估大型语言模型在代谢组学领域表现的基准，涵盖知识、理解、关联、推理和研究五项关键能力。通过对25个开源和闭源LLM的评估，发现模型在文本生成任务中表现良好，但在跨数据库标识符关联和稀疏注释的长尾代谢物任务上仍面临挑战。

Details

Motivation: 尽管大型语言模型在通用文本上表现出色，但其在需要深度且相互关联知识的科学领域（如代谢组学）中的能力尚未充分表征。代谢组学具有复杂的生化通路、异构的标识符系统和分散的数据库，对模型提出了独特挑战。因此，需要一个专门的评估基准来系统衡量LLM在此领域的性能。 Method: 作者从权威公共资源中整理构建了MetaBench，包含五个关键评估维度：知识、理解、关联、推理和研究。使用该基准对25个开源和闭源大型语言模型进行了系统评估，部分实验结合了检索增强技术以测试性能提升效果。 Result: 评估结果显示，现有LLM在文本生成类任务中表现较好，但在跨数据库标识符关联任务上表现不佳，即使使用检索增强技术也改善有限；对于注释稀疏的长尾代谢物，模型性能显著下降。不同模型在各项能力上展现出明显差异。 Conclusion: MetaBench为代谢组学领域提供了评估和开发AI系统的重要基础设施，有助于推动面向可靠计算工具的系统性进展，揭示了当前LLM在专业科学领域应用中的优势与局限。 Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities on general text; however, their proficiency in specialized scientific domains that require deep, interconnected knowledge remains largely uncharacterized. Metabolomics presents unique challenges with its complex biochemical pathways, heterogeneous identifier systems, and fragmented databases. To systematically evaluate LLM capabilities in this domain, we introduce MetaBench, the first benchmark for metabolomics assessment. Curated from authoritative public resources, MetaBench evaluates five capabilities essential for metabolomics research: knowledge, understanding, grounding, reasoning, and research. Our evaluation of 25 open- and closed-source LLMs reveals distinct performance patterns across metabolomics tasks: while models perform well on text generation tasks, cross-database identifier grounding remains challenging even with retrieval augmentation. Model performance also decreases on long-tail metabolites with sparse annotations. With MetaBench, we provide essential infrastructure for developing and evaluating metabolomics AI systems, enabling systematic progress toward reliable computational tools for metabolomics research.

[121] Information Gain-based Policy Optimization: A Simple and Effective Approach for Multi-Turn LLM Agents

Guoqing Wang,Sunhao Dai,Guangze Ye,Zeyu Gan,Wei Yao,Yong Deng,Xiaofeng Wu,Zhenzhe Ying

Main category: cs.CL

TL;DR: 本文提出了信息增益策略优化（IGPO），一种用于多轮代理训练的强化学习框架，通过模型自身信念更新提供密集的内在奖励信号，解决了传统基于结果奖励方法中存在的奖励稀疏、优势崩溃和信用分配不足问题。

Details

Motivation: 现有基于强化学习的LLM代理在多轮交互中依赖稀疏的结果奖励，导致优势崩溃和信用分配困难，尤其在长周期任务中表现不佳。因此需要一种更密集、细粒度的奖励机制来提升训练效率和性能。 Method: IGPO将每一轮交互建模为对真实答案信息的增量获取过程，定义每步奖励为策略生成正确答案概率的边际增长，利用模型自身的信念更新计算内在的回合级奖励，并与最终结果奖励结合形成密集奖励轨迹。 Result: 在多个领域内和跨领域的基准测试上，IGPO显著优于强基线方法，表现出更高的准确率和更好的样本效率，尤其在多轮场景下效果明显。 Conclusion: IGPO通过引入基于信息增益的内在奖励机制，有效缓解了多轮代理训练中的奖励稀疏问题，为大型语言模型代理提供了高效且可扩展的强化学习训练框架。 Abstract: Large language model (LLM)-based agents are increasingly trained with reinforcement learning (RL) to enhance their ability to interact with external environments through tool use, particularly in search-based settings that require multi-turn reasoning and knowledge acquisition. However, existing approaches typically rely on outcome-based rewards that are only provided at the final answer. This reward sparsity becomes particularly problematic in multi-turn settings, where long trajectories exacerbate two critical issues: (i) advantage collapse, where all rollouts receive identical rewards and provide no useful learning signals, and (ii) lack of fine-grained credit assignment, where dependencies between turns are obscured, especially in long-horizon tasks. In this paper, we propose Information Gain-based Policy Optimization (IGPO), a simple yet effective RL framework that provides dense and intrinsic supervision for multi-turn agent training. IGPO models each interaction turn as an incremental process of acquiring information about the ground truth, and defines turn-level rewards as the marginal increase in the policy's probability of producing the correct answer. Unlike prior process-level reward approaches that depend on external reward models or costly Monte Carlo estimation, IGPO derives intrinsic rewards directly from the model's own belief updates. These intrinsic turn-level rewards are combined with outcome-level supervision to form dense reward trajectories. Extensive experiments on both in-domain and out-of-domain benchmarks demonstrate that IGPO consistently outperforms strong baselines in multi-turn scenarios, achieving higher accuracy and improved sample efficiency.

[122] LLMs as Scalable, General-Purpose Simulators For Evolving Digital Agent Training

Yiming Wang,Da Yin,Yuedong Cui,Ruichen Zheng,Zhiqian Li,Zongyu Lin,Di Wu,Xueqing Wu,Chenchen Ye,Yu Zhou,Kai-Wei Chang

Main category: cs.CL

TL;DR: 本文提出了UI-Simulator，一种可扩展的生成结构化用户界面状态和转换的范式，用于大规模合成训练轨迹，并通过UI-Simulator-Grow实现高效的数据扩展，在WebArena和AndroidWorld实验中表现出与开源代理相当甚至更优的性能。

Details

Motivation: 由于在真实世界任务中收集大规模、多样化的用户界面轨迹数据成本过高，包括人力标注、基础设施和工程投入，因此需要一种高效的替代方案来训练数字代理。 Method: 提出UI-Simulator范式，结合数字世界模拟器生成多样的UI状态，通过引导 rollout 过程进行连贯探索，并使用轨迹封装器生成高质量、多样化的训练轨迹；进一步提出UI-Simulator-Grow策略，优先扩展高影响任务并生成信息丰富的轨迹变体以提升效率。 Result: 在WebArena和AndroidWorld上的实验表明，使用UI-Simulator训练的代理在鲁棒性上显著优于基于真实UI训练的开源代理，尽管使用了较弱的教师模型；UI-Simulator-Grow仅用Llama-3-8B-Instruct就达到了Llama-3-70B-Instruct的性能水平。 Conclusion: UI-Simulator及其扩展策略UI-Simulator-Grow展示了通过目标化合成和扩展范式高效增强数字代理能力的巨大潜力，为降低训练数据依赖提供了可行路径。 Abstract: Digital agents require diverse, large-scale UI trajectories to generalize across real-world tasks, yet collecting such data is prohibitively expensive in both human annotation, infra and engineering perspectives. To this end, we introduce $\textbf{UI-Simulator}$, a scalable paradigm that generates structured UI states and transitions to synthesize training trajectories at scale. Our paradigm integrates a digital world simulator for diverse UI states, a guided rollout process for coherent exploration, and a trajectory wrapper that produces high-quality and diverse trajectories for agent training. We further propose $\textbf{UI-Simulator-Grow}$, a targeted scaling strategy that enables more rapid and data-efficient scaling by prioritizing high-impact tasks and synthesizes informative trajectory variants. Experiments on WebArena and AndroidWorld show that UI-Simulator rivals or surpasses open-source agents trained on real UIs with significantly better robustness, despite using weaker teacher models. Moreover, UI-Simulator-Grow matches the performance of Llama-3-70B-Instruct using only Llama-3-8B-Instruct as the base model, highlighting the potential of targeted synthesis scaling paradigm to continuously and efficiently enhance the digital agents.

[123] TokDrift: When LLM Speaks in Subwords but Code Speaks in Grammar

Yinxi Li,Yuntian Deng,Pengyu Nie

Main category: cs.CL

TL;DR: 本文提出了一种名为TokDrift的框架，用于评估代码大语言模型中由于分词不一致导致的问题，发现即使是微小的格式变化也会显著影响模型行为，强调了未来代码大模型需要语法感知的分词方法。

Details

Motivation: 由于当前代码大模型依赖于基于统计的子词分词方法（如BPE），无法保证语义相同的代码片段被一致地分词，从而可能影响模型的可靠性，因此需要研究这种分词不一致的影响。 Method: 提出了TokDrift框架，通过应用保持语义不变的重写规则生成仅在分词上不同的代码变体，并在九个代码大语言模型上评估这些变体对模型行为的影响，同时进行逐层分析以定位问题来源。 Result: 实验表明，即使是轻微的格式更改（如空格或命名差异）也会导致模型输出出现显著偏差；问题主要源于早期嵌入层中子词分割与语法标记边界不一致。 Conclusion: 分词与语法的错位是代码大模型可靠理解和生成代码的一个隐藏障碍，未来应采用语法感知的分词方法来改进模型性能。 Abstract: Large language models (LLMs) for code rely on subword tokenizers, such as byte-pair encoding (BPE), learned from mixed natural language text and programming language code but driven by statistics rather than grammar. As a result, semantically identical code snippets can be tokenized differently depending on superficial factors such as whitespace or identifier naming. To measure the impact of this misalignment, we introduce TokDrift, a framework that applies semantic-preserving rewrite rules to create code variants differing only in tokenization. Across nine code LLMs, including large ones with over 30B parameters, even minor formatting changes can cause substantial shifts in model behavior. Layer-wise analysis shows that the issue originates in early embeddings, where subword segmentation fails to capture grammar token boundaries. Our findings identify misaligned tokenization as a hidden obstacle to reliable code understanding and generation, highlighting the need for grammar-aware tokenization for future code LLMs.

[124] Attention Is All You Need for KV Cache in Diffusion LLMs

Quan Nguyen-Tri,Mukul Ranjan,Zhiqiang Shen

Main category: cs.CL

TL;DR: 本文提出了一种名为Elastic-Cache的训练免费、架构无关的策略，用于在扩散型大语言模型中自适应地重计算键值缓存，在保证生成质量的同时显著降低解码延迟。

Details

Motivation: 现有方法在每一步和每一层都重新计算所有token的QKV，导致大量冗余计算，尤其是在浅层，KV状态变化很小。因此需要一种更高效的方法来减少冗余并加速解码。 Method: 基于三个观察：MASK token可块状缓存；KV动态随深度增加；最关注的token KV漂移最小。Elastic-Cache结合注意力感知的漂移检测（决定何时刷新）和深度感知的刷新调度（决定从哪一层开始刷新），实现自适应、分层的缓存更新。 Result: 在LLaDA系列模型上实验显示，相比基线方法，在GSM8K上提速8.7倍（256 token），长序列下达45.1倍，HumanEval上达4.8倍，同时保持更高准确率，并实现6.8倍于置信度基线方法的吞吐量。 Conclusion: Elastic-Cache能有效平衡扩散大语言模型的推理效率与生成质量，支持其实际部署。 Abstract: This work studies how to adaptively recompute key-value (KV) caches for diffusion large language models (DLMs) to maximize prediction accuracy while minimizing decoding latency. Prior methods' decoders recompute QKV for all tokens at every denoising step and layer, despite KV states changing little across most steps, especially in shallow layers, leading to substantial redundancy. We make three observations: (1) distant ${\bf MASK}$ tokens primarily act as a length-bias and can be cached block-wise beyond the active prediction window; (2) KV dynamics increase with depth, suggesting that selective refresh starting from deeper layers is sufficient; and (3) the most-attended token exhibits the smallest KV drift, providing a conservative lower bound on cache change for other tokens. Building on these, we propose ${\bf Elastic-Cache}$, a training-free, architecture-agnostic strategy that jointly decides ${when}$ to refresh (via an attention-aware drift test on the most-attended token) and ${where}$ to refresh (via a depth-aware schedule that recomputes from a chosen layer onward while reusing shallow-layer caches and off-window MASK caches). Unlike fixed-period schemes, Elastic-Cache performs adaptive, layer-aware cache updates for diffusion LLMs, reducing redundant computation and accelerating decoding with negligible loss in generation quality. Experiments on LLaDA-Instruct, LLaDA-1.5, and LLaDA-V across mathematical reasoning and code generation tasks demonstrate consistent speedups: $8.7\times$ on GSM8K (256 tokens), $45.1\times$ on longer sequences, and $4.8\times$ on HumanEval, while consistently maintaining higher accuracy than the baseline. Our method achieves significantly higher throughput ($6.8\times$ on GSM8K) than existing confidence-based approaches while preserving generation quality, enabling practical deployment of diffusion LLMs.

cs.CV [Back]

[125] MultiFoodhat: A potential new paradigm for intelligent food quality inspection

Yue Hu,Guohang Zhuang

Main category: cs.CV

TL;DR: 提出了一种基于多智能体对话推理的零样本食物识别框架MultiFoodChat，结合视觉语言模型和大语言模型，实现无需额外训练的高精度食物分类。

Details

Motivation: 现有监督模型依赖大量标注数据且对未见食物类别泛化能力差，难以满足实际应用需求。 Method: 构建对话驱动的多智能体推理框架，利用对象感知令牌（OPT）捕捉细粒度视觉特征，通过交互式推理代理（IRA）动态解析上下文线索，结合VLM和LLM进行多轮图文对话实现零样本识别。 Result: 在多个公开食品数据集上实验表明，该方法在识别准确性和可解释性方面优于现有的无监督和少样本方法。 Conclusion: MultiFoodChat为复杂食品场景的理解提供了新范式，在智能食品质量检测与分析中具有广泛应用潜力。 Abstract: Food image classification plays a vital role in intelligent food quality inspection, dietary assessment, and automated monitoring. However, most existing supervised models rely heavily on large labeled datasets and exhibit limited generalization to unseen food categories. To overcome these challenges, this study introduces MultiFoodChat, a dialogue-driven multi-agent reasoning framework for zero-shot food recognition. The framework integrates vision-language models (VLMs) and large language models (LLMs) to enable collaborative reasoning through multi-round visual-textual dialogues. An Object Perception Token (OPT) captures fine-grained visual attributes, while an Interactive Reasoning Agent (IRA) dynamically interprets contextual cues to refine predictions. This multi-agent design allows flexible and human-like understanding of complex food scenes without additional training or manual annotations. Experiments on multiple public food datasets demonstrate that MultiFoodChat achieves superior recognition accuracy and interpretability compared with existing unsupervised and few-shot methods, highlighting its potential as a new paradigm for intelligent food quality inspection and analysis.

[126] Post-surgical Endometriosis Segmentation in Laparoscopic Videos

Andreas Leibetseder,Klaus Schoeffmann,Jörg Keckstein,Simon Keckstein

Main category: cs.CV

TL;DR: 本文介绍了一种用于辅助妇科医生诊断子宫内膜异位症的系统，该系统能够分割腹腔镜手术视频中常见的深色子宫内膜病灶，并通过彩色叠加和检测摘要提升视频浏览效率。

Details

Motivation: 子宫内膜异位症在体内表现多样，视觉识别困难，非专业医生容易误诊，因此需要一种辅助诊断工具来提高识别准确性和效率。 Method: 开发并训练一个系统，用于分割腹腔镜视频中常见的深色子宫内膜病灶，通过多色覆盖标注病灶区域，并生成检测摘要以支持视频快速浏览。 Result: 系统能够有效分析腹腔镜手术视频，准确标注深色子宫内膜病灶，并提供可视化摘要，改善医生对视频内容的浏览与理解。 Conclusion: 该系统为妇科医生提供了有效的视觉辅助工具，有助于提升子宫内膜异位症的术中识别能力，具有临床应用潜力。 Abstract: Endometriosis is a common women's condition exhibiting a manifold visual appearance in various body-internal locations. Having such properties makes its identification very difficult and error-prone, at least for laymen and non-specialized medical practitioners. In an attempt to provide assistance to gynecologic physicians treating endometriosis, this demo paper describes a system that is trained to segment one frequently occurring visual appearance of endometriosis, namely dark endometrial implants. The system is capable of analyzing laparoscopic surgery videos, annotating identified implant regions with multi-colored overlays and displaying a detection summary for improved video browsing.

[127] Efficient Few-Shot Learning in Remote Sensing: Fusing Vision and Vision-Language Models

Jia Yun Chua,Argyrios Zolotas,Miguel Arana-Catania

Main category: cs.CV

TL;DR: 本文研究了将YOLO等传统视觉模型与LLaVA、ChatGPT和Gemini等视觉语言模型（VLM）结合，以提升遥感图像中飞机检测与场景理解的准确性，尤其在标注数据少和图像质量差的情况下表现更优。

Details

Motivation: 传统视觉模型依赖大量标注数据且难以理解复杂环境中的上下文，而通用型视觉语言模型在遥感领域的应用尚不充分，因此需要探索结合二者优势的方法以提升遥感图像分析能力。 Method: 将YOLO目标检测模型与LLaVA、ChatGPT和Gemini等视觉语言模型结合，利用VLMs的语义理解能力增强对遥感图像的上下文感知，并在标注与未标注数据及降质图像上评估性能。 Result: 在飞机检测与计数任务中，各类模型平均MAE降低了48.46%，CLIPScore提升了6.17%，尤其在挑战性条件下表现显著提升。 Conclusion: 结合传统视觉模型与视觉语言模型可有效提升遥感图像分析的准确性和鲁棒性，特别适用于小样本学习和实际应用中常见的低质量图像场景。 Abstract: Remote sensing has become a vital tool across sectors such as urban planning, environmental monitoring, and disaster response. While the volume of data generated has increased significantly, traditional vision models are often constrained by the requirement for extensive domain-specific labelled data and their limited ability to understand the context within complex environments. Vision Language Models offer a complementary approach by integrating visual and textual data; however, their application to remote sensing remains underexplored, particularly given their generalist nature. This work investigates the combination of vision models and VLMs to enhance image analysis in remote sensing, with a focus on aircraft detection and scene understanding. The integration of YOLO with VLMs such as LLaVA, ChatGPT, and Gemini aims to achieve more accurate and contextually aware image interpretation. Performance is evaluated on both labelled and unlabelled remote sensing data, as well as degraded image scenarios which are crucial for remote sensing. The findings show an average MAE improvement of 48.46% across models in the accuracy of aircraft detection and counting, especially in challenging conditions, in both raw and degraded scenarios. A 6.17% improvement in CLIPScore for comprehensive understanding of remote sensing images is obtained. The proposed approach combining traditional vision models and VLMs paves the way for more advanced and efficient remote sensing image analysis, especially in few-shot learning scenarios.

[128] Finding Holes: Pathologist Level Performance Using AI for Cribriform Morphology Detection in Prostate Cancer

Kelvin Szolnoky,Anders Blilie,Nita Mulliqi,Toyonori Tsuzuki,Hemamali Samaratunga,Matteo Titus,Xiaoyi Ji,Sol Erika Boman,Einar Gudlaugsson,Svein Reidar Kjosavik,José Asenjo,Marcello Gambacorta,Paolo Libretti,Marcin Braun,Radisław Kordek,Roman Łowicki,Brett Delahunt,Kenneth A. Iczkowski,Theo van der Kwast,Geert J. L. H. van Leenders,Katia R. M. Leite,Chin-Chen Pan,Emiel Adrianus Maria Janssen,Martin Eklund,Lars Egevad,Kimmo Kartasalo

Main category: cs.CV

TL;DR: 该研究开发并验证了一种基于AI的深度学习模型，用于提高前列腺癌中筛状结构（cribriform morphology）的检测准确性，模型在内部和外部验证中均表现出色，且优于多名病理专家的一致性水平。

Details

Motivation: 筛状形态是前列腺癌中预后不良的重要组织学特征，但目前报告不足且病理医生间一致性差，亟需提高检测的准确性和标准化程度。 Method: 采用EfficientNetV2-S编码器结合多实例学习的深度学习模型，对来自三个队列的640例前列腺穿刺活检全切片图像进行端到端分类，使用三名高一致性泌尿病理专家的标注数据进行训练，并在内部和外部独立队列中验证，同时与九名专家的判读进行对比。 Result: 模型在内部验证中AUC为0.97（kappa=0.81），外部验证中AUC为0.90（kappa=0.55），在88例样本的对比中，模型平均一致性（kappa=0.66）高于所有九名病理专家（kappa 0.35–0.62）。 Conclusion: 该AI模型在检测前列腺癌筛状结构方面达到或超过病理专家水平，有助于提升诊断可靠性、标准化报告流程，并优化临床治疗决策。 Abstract: Background: Cribriform morphology in prostate cancer is a histological feature that indicates poor prognosis and contraindicates active surveillance. However, it remains underreported and subject to significant interobserver variability amongst pathologists. We aimed to develop and validate an AI-based system to improve cribriform pattern detection. Methods: We created a deep learning model using an EfficientNetV2-S encoder with multiple instance learning for end-to-end whole-slide classification. The model was trained on 640 digitised prostate core needle biopsies from 430 patients, collected across three cohorts. It was validated internally (261 slides from 171 patients) and externally (266 slides, 104 patients from three independent cohorts). Internal validation cohorts included laboratories or scanners from the development set, while external cohorts used completely independent instruments and laboratories. Annotations were provided by three expert uropathologists with known high concordance. Additionally, we conducted an inter-rater analysis and compared the model's performance against nine expert uropathologists on 88 slides from the internal validation cohort. Results: The model showed strong internal validation performance (AUC: 0.97, 95% CI: 0.95-0.99; Cohen's kappa: 0.81, 95% CI: 0.72-0.89) and robust external validation (AUC: 0.90, 95% CI: 0.86-0.93; Cohen's kappa: 0.55, 95% CI: 0.45-0.64). In our inter-rater analysis, the model achieved the highest average agreement (Cohen's kappa: 0.66, 95% CI: 0.57-0.74), outperforming all nine pathologists whose Cohen's kappas ranged from 0.35 to 0.62. Conclusion: Our AI model demonstrates pathologist-level performance for cribriform morphology detection in prostate cancer. This approach could enhance diagnostic reliability, standardise reporting, and improve treatment decisions for prostate cancer patients.

[129] NAPPure: Adversarial Purification for Robust Image Classification under Non-Additive Perturbations

Junjie Nan,Jianing Li,Wei Chen,Mingkun Zhang,Xueqi Cheng

Main category: cs.CV

TL;DR: 提出NAPPure框架，用于应对非加性对抗扰动，通过似然最大化分离干净图像和扰动参数，显著提升图像分类模型的鲁棒性。

Details

Motivation: 现有对抗净化方法主要针对加性扰动，对现实中的非加性扰动（如模糊、遮挡、畸变）效果有限，因此需要更通用的净化框架。 Method: 建立对抗图像的生成过程，并通过最大似然估计解耦出干净图像和扰动参数，从而实现对非加性扰动的处理。 Result: 在GTSRB和CIFAR-10数据集上的实验表明，NAPPure显著提升了分类模型对非加性对抗扰动的鲁棒性。 Conclusion: NAPPure框架有效扩展了对抗净化方法的适用范围，能够有效应对多种非加性对抗扰动，增强了模型的防御能力。 Abstract: Adversarial purification has achieved great success in combating adversarial image perturbations, which are usually assumed to be additive. However, non-additive adversarial perturbations such as blur, occlusion, and distortion are also common in the real world. Under such perturbations, existing adversarial purification methods are much less effective since they are designed to fit the additive nature. In this paper, we propose an extended adversarial purification framework named NAPPure, which can further handle non-additive perturbations. Specifically, we first establish the generation process of an adversarial image, and then disentangle the underlying clean image and perturbation parameters through likelihood maximization. Experiments on GTSRB and CIFAR-10 datasets show that NAPPure significantly boosts the robustness of image classification models against non-additive perturbations.

[130] Vgent: Graph-based Retrieval-Reasoning-Augmented Generation For Long Video Understanding

Xiaoqian Shen,Wenxuan Zhang,Jun Chen,Mohamed Elhoseiny

Main category: cs.CV

TL;DR: 本文提出了一种基于图结构的检索-推理增强生成框架Vgent，用于提升大视频语言模型对长视频的理解能力，通过结构化语义图和中间推理步骤有效改善了检索准确性和跨片段信息聚合。

Details

Motivation: 由于上下文窗口限制和长期时序信息保持困难，现有的大视频语言模型在处理长视频时面临挑战；直接应用检索增强生成（RAG）会破坏时间依赖性并引入无关信息，影响推理准确性。 Method: 提出Vgent框架：(1) 将视频表示为保留片段间语义关系的结构化图以提升检索效果；(2) 引入中间推理步骤，利用结构化验证减少检索噪声，并显式聚合跨片段相关信息。 Result: 在三个长视频理解基准上评估了多种开源LVLM，Vgent在MLVU上比基础模型提升3.0%~5.4%，优于现有最先进的视频RAG方法8.6%。 Conclusion: Vgent通过图结构化表示和中间推理机制，显著提升了LVLM在长视频理解任务中的性能，有效解决了传统RAG在视频中带来的时序断裂和噪声干扰问题。 Abstract: Understanding and reasoning over long videos pose significant challenges for large video language models (LVLMs) due to the difficulty in processing intensive video tokens beyond context window and retaining long-term sequential information. Retrieval-Augmented Generation (RAG) has demonstrated effectiveness in processing long context for Large Language Models (LLMs); however, applying RAG to long video faces challenges such as disrupted temporal dependencies and inclusion of irrelevant information that can hinder accurate reasoning. To address these limitations, we propose Vgent, a novel graph-based retrieval-reasoning-augmented generation framework to enhance LVLMs for long video understanding. Our approach introduces two key innovations: (i) It represents videos by structured graphs with semantic relationships across video clips preserved to improve retrieval effectiveness. (ii) It introduces an intermediate reasoning step to mitigate the reasoning limitation of LVLMs, which leverages structured verification to reduce retrieval noise and facilitate the explicit aggregation of relevant information across clips, resulting in more accurate and context-aware responses. We comprehensively evaluate our framework with various open-source LVLMs on three long-video understanding benchmarks. Our approach yielded an overall performance improvement of $3.0\%\sim 5.4\%$ over base models on MLVU, and outperformed state-of-the-art video RAG methods by $8.6\%$. Our code is publicly available at https://xiaoqian-shen.github.io/Vgent.

[131] Synchronization of Multiple Videos

Avihai Naaman,Ron Shapira Weber,Oren Freifeld

Main category: cs.CV

TL;DR: 提出了一种基于原型的时序对齐框架Temporal Prototype Learning (TPL)，用于同步不同场景或生成式AI视频中的多视频，通过构建共享的一维紧凑表示来提高对齐的准确性、效率和鲁棒性。

Details

Motivation: 传统多摄像头视频同步通常只需简单时间偏移，但在不同场景或生成式AI视频中，由于主体、背景和时序非线性差异，同步变得复杂，缺乏有效方法。 Method: TPL利用预训练模型提取高维嵌入，并构建一个共享的紧凑一维原型序列作为时序锚点，通过学习统一的原型序列避免复杂的成对匹配，实现多视频对齐。 Result: 实验表明TPL在多种数据集上提升了同步精度、效率和鲁棒性，尤其首次有效解决了多个生成式AI视频间的同步问题。 Conclusion: TPL是一种通用且高效的多视频同步框架，能够跨不同场景和生成式内容实现鲁棒时序对齐，具有广泛的应用潜力。 Abstract: Synchronizing videos captured simultaneously from multiple cameras in the same scene is often easy and typically requires only simple time shifts. However, synchronizing videos from different scenes or, more recently, generative AI videos, poses a far more complex challenge due to diverse subjects, backgrounds, and nonlinear temporal misalignment. We propose Temporal Prototype Learning (TPL), a prototype-based framework that constructs a shared, compact 1D representation from high-dimensional embeddings extracted by any of various pretrained models. TPL robustly aligns videos by learning a unified prototype sequence that anchors key action phases, thereby avoiding exhaustive pairwise matching. Our experiments show that TPL improves synchronization accuracy, efficiency, and robustness across diverse datasets, including fine-grained frame retrieval and phase classification tasks. Importantly, TPL is the first approach to mitigate synchronization issues in multiple generative AI videos depicting the same action. Our code and a new multiple video synchronization dataset are available at https://bgu-cs-vil.github.io/TPL/

[132] Capture, Canonicalize, Splat: Zero-Shot 3D Gaussian Avatars from Unstructured Phone Images

Emanuel Garbin,Guy Adam,Oded Krams,Zohar Barzelay,Eran Guendelman,Michael Schwarz,Moran Vatelmacher,Yigal Shenkman,Eli Peker,Itai Druker,Uri Patish,Yoav Blum,Max Bluvstein,Junxuan Li,Rawal Khirodkar,Shunsuke Saito

Main category: cs.CV

TL;DR: 提出了一种零样本方法，通过非结构化手机照片生成高保真、身份保持的3D虚拟形象。

Details

Motivation: 现有单视图方法存在几何不一致和身份失真问题，合成数据训练的模型难以捕捉皮肤皱纹和细发等高频细节，限制了真实感。 Method: 提出“Capture, Canonicalize, Splat”流程：首先使用生成式规范化模块将多视角非结构化图像转换为标准化表示，然后基于大规模真实人物高斯溅射头像数据集训练Transformer模型生成3D头像。 Result: 该方法能从少量非结构化手机照片生成具有高度真实感和强身份保持性的静态四分之三身3D头像。 Conclusion: 所提方法在无需微调的情况下实现了高质量3D头像生成，在真实感和身份保持方面优于现有技术。 Abstract: We present a novel, zero-shot pipeline for creating hyperrealistic, identity-preserving 3D avatars from a few unstructured phone images. Existing methods face several challenges: single-view approaches suffer from geometric inconsistencies and hallucinations, degrading identity preservation, while models trained on synthetic data fail to capture high-frequency details like skin wrinkles and fine hair, limiting realism. Our method introduces two key contributions: (1) a generative canonicalization module that processes multiple unstructured views into a standardized, consistent representation, and (2) a transformer-based model trained on a new, large-scale dataset of high-fidelity Gaussian splatting avatars derived from dome captures of real people. This "Capture, Canonicalize, Splat" pipeline produces static quarter-body avatars with compelling realism and robust identity preservation from unstructured photos.

[133] cubic: CUDA-accelerated 3D Bioimage Computing

Alexandr A. Kalinin,Anne E. Carpenter,Shantanu Singh,Matthew J. O'Meara

Main category: cs.CV

TL;DR: cubic是一个开源Python库，通过集成CuPy和RAPIDS cuCIM的GPU加速功能，扩展了SciPy和scikit-image的API，实现了对2D和3D生物图像处理的高效、可扩展分析，支持设备无关的无缝加速，显著提升了预处理、去卷积、分割和特征提取等流程的速度，同时与现代Python科学计算生态良好集成。

Details

Motivation: 现有生物图像分析工具在可扩展性、效率、GPU加速支持、3D图像处理能力和与现代计算工作流的互操作性方面存在局限，难以应对日益增长的大型2D/3D显微图像数据，因此需要一个高效、集成且兼容性强的解决方案。 Method: 开发cubic库，采用设备无关的API设计，自动根据数据所在设备（CPU/GPU）调度运算；基于CuPy和RAPIDS cuCIM为SciPy和scikit-image提供GPU加速的替代实现，增强其在生物图像处理中的性能。 Result: cubic在单个操作基准测试中表现出显著加速，并成功复现了去卷积和分割流程，在保持算法准确性的同时大幅提升运行速度，验证了其在2D/3D数据分析中的有效性。 Conclusion: cubic为可扩展、可重复的生物图像分析提供了坚实基础，能够无缝集成到Python科学计算生态系统中，支持交互式探索和高通量自动化分析，推动生物医学研究的发展。 Abstract: Quantitative analysis of multidimensional biological images is useful for understanding complex cellular phenotypes and accelerating advances in biomedical research. As modern microscopy generates ever-larger 2D and 3D datasets, existing computational approaches are increasingly limited by their scalability, efficiency, and integration with modern scientific computing workflows. Existing bioimage analysis tools often lack application programmable interfaces (APIs), do not support graphics processing unit (GPU) acceleration, lack broad 3D image processing capabilities, and/or have poor interoperability for compute-heavy workflows. Here, we introduce cubic, an open-source Python library that addresses these challenges by augmenting widely used SciPy and scikit-image APIs with GPU-accelerated alternatives from CuPy and RAPIDS cuCIM. cubic's API is device-agnostic and dispatches operations to GPU when data reside on the device and otherwise executes on CPU, seamlessly accelerating a broad range of image processing routines. This approach enables GPU acceleration of existing bioimage analysis workflows, from preprocessing to segmentation and feature extraction for 2D and 3D data. We evaluate cubic both by benchmarking individual operations and by reproducing existing deconvolution and segmentation pipelines, achieving substantial speedups while maintaining algorithmic fidelity. These advances establish a robust foundation for scalable, reproducible bioimage analysis that integrates with the broader Python scientific computing ecosystem, including other GPU-accelerated methods, enabling both interactive exploration and automated high-throughput analysis workflows. cubic is openly available at https://github$.$com/alxndrkalinin/cubic

[134] Virtually Being: Customizing Camera-Controllable Video Diffusion Models with Multi-View Performance Captures

Yuancheng Xu,Wenqi Xian,Li Ma,Julien Philip,Ahmet Levent Taşel,Yiwei Zhao,Ryan Burgert,Mingming He,Oliver Hermann,Oliver Pilarski,Rahul Garg,Paul Debevec,Ning Yu

Main category: cs.CV

TL;DR: 提出一种新框架，通过4D高斯点阵和视频重光照技术，在视频扩散模型中实现多视角角色一致性与3D相机控制，并支持多主体生成、场景定制及运动布局控制，提升虚拟制作中的视频生成能力。

Details

Motivation: 在虚拟制作中，现有视频生成模型难以同时保证角色在多视角下的一致性并实现精确的3D相机控制，且缺乏对光照、多主体组合和场景定制的灵活支持。 Method: 构建一个定制化数据流水线，利用4D高斯点阵（4DGS）重渲染具有多样相机轨迹的体捕捉表演，并结合视频重光照模型引入光照变化；在此数据上微调开源视频扩散模型，并支持联合训练和噪声融合两种多主体生成方式。 Result: 实验表明该方法在视频质量、个性化准确性、相机控制和光照适应性方面均有提升，支持多主体合成、真实视频定制及运动与空间布局控制。 Conclusion: 所提框架有效增强了视频扩散模型在虚拟制作中的多视角一致性、相机控制和定制化能力，推动了生成模型在复杂生产场景中的应用。 Abstract: We introduce a framework that enables both multi-view character consistency and 3D camera control in video diffusion models through a novel customization data pipeline. We train the character consistency component with recorded volumetric capture performances re-rendered with diverse camera trajectories via 4D Gaussian Splatting (4DGS), lighting variability obtained with a video relighting model. We fine-tune state-of-the-art open-source video diffusion models on this data to provide strong multi-view identity preservation, precise camera control, and lighting adaptability. Our framework also supports core capabilities for virtual production, including multi-subject generation using two approaches: joint training and noise blending, the latter enabling efficient composition of independently customized models at inference time; it also achieves scene and real-life video customization as well as control over motion and spatial layout during customization. Extensive experiments show improved video quality, higher personalization accuracy, and enhanced camera control and lighting adaptability, advancing the integration of video generation into virtual production. Our project page is available at: https://eyeline-labs.github.io/Virtually-Being.

[135] Joint Modeling of Big Five and HEXACO for Multimodal Apparent Personality-trait Recognition

Ryo Masumura,Shota Orihashi,Mana Ihori,Tomohiro Tanaka,Naoki Makishima,Taiga Yamane,Naotaka Kawata,Satoshi Suzuki,Taichi Katayama

Main category: cs.CV

TL;DR: 本文提出了一种联合建模Big Five和HEXACO人格特质的方法，用于从多模态人类行为中自动识别表观人格特征。

Details

Motivation: 现有研究多关注Big Five模型，而忽视了能评估诚实-谦逊等特质的HEXACO模型，且两者在机器学习建模中的关系尚不明确。 Method: 通过联合优化的方式同时识别Big Five和HEXACO人格特质，并利用自我介绍视频数据集进行实验验证。 Result: 实验结果表明，所提方法能有效识别Big Five和HEXACO人格特质。 Conclusion: 联合建模Big Five与HEXACO有助于提升对多模态人类行为的理解，为人格识别提供了新思路。 Abstract: This paper proposes a joint modeling method of the Big Five, which has long been studied, and HEXACO, which has recently attracted attention in psychology, for automatically recognizing apparent personality traits from multimodal human behavior. Most previous studies have used the Big Five for multimodal apparent personality-trait recognition. However, no study has focused on apparent HEXACO which can evaluate an Honesty-Humility trait related to displaced aggression and vengefulness, social-dominance orientation, etc. In addition, the relationships between the Big Five and HEXACO when modeled by machine learning have not been clarified. We expect awareness of multimodal human behavior to improve by considering these relationships. The key advance of our proposed method is to optimize jointly recognizing the Big Five and HEXACO. Experiments using a self-introduction video dataset demonstrate that the proposed method can effectively recognize the Big Five and HEXACO.

[136] LOTA: Bit-Planes Guided AI-Generated Image Detection

Hongsong Wang,Renxi Cheng,Yang Zhang,Chaolei Han,Jie Gui

Main category: cs.CV

TL;DR: 本文提出了一种基于位平面的噪声图像生成与检测方法，用于高效区分AI生成图像与真实图像。

Details

Motivation: 现有基于重构误差的AI生成图像检测方法计算成本高，且难以捕捉原始图像中的内在噪声特征。 Method: 通过位平面图像处理技术提取噪声特征，结合多种图像归一化策略，并设计最大梯度块选择机制来增强噪声信号，最后构建轻量级分类头（包括基于噪声和噪声引导的分类器）进行分类。 Result: 在GenImage基准上达到98.9%的平均准确率，比现有方法提升11.9%，具备优异的跨生成器泛化能力，且提取速度达毫秒级，快近百倍。 Conclusion: 所提方法在准确率、速度和泛化性方面显著优于现有技术，为AI生成图像检测提供了高效实用的解决方案。 Abstract: The rapid advancement of GAN and Diffusion models makes it more difficult to distinguish AI-generated images from real ones. Recent studies often use image-based reconstruction errors as an important feature for determining whether an image is AI-generated. However, these approaches typically incur high computational costs and also fail to capture intrinsic noisy features present in the raw images. To solve these problems, we innovatively refine error extraction by using bit-plane-based image processing, as lower bit planes indeed represent noise patterns in images. We introduce an effective bit-planes guided noisy image generation and exploit various image normalization strategies, including scaling and thresholding. Then, to amplify the noise signal for easier AI-generated image detection, we design a maximum gradient patch selection that applies multi-directional gradients to compute the noise score and selects the region with the highest score. Finally, we propose a lightweight and effective classification head and explore two different structures: noise-based classifier and noise-guided classifier. Extensive experiments on the GenImage benchmark demonstrate the outstanding performance of our method, which achieves an average accuracy of \textbf{98.9\%} (\textbf{11.9}\%~$\uparrow$) and shows excellent cross-generator generalization capability. Particularly, our method achieves an accuracy of over 98.2\% from GAN to Diffusion and over 99.2\% from Diffusion to GAN. Moreover, it performs error extraction at the millisecond level, nearly a hundred times faster than existing methods. The code is at https://github.com/hongsong-wang/LOTA.

[137] PIA: Deepfake Detection Using Phoneme-Temporal and Identity-Dynamic Analysis

Soumyya Kanti Datta,Tanvi Ranga,Chengzhe Sun,Siwei Lyu

Main category: cs.CV

TL;DR: 提出了一种新的多模态音视频框架PIA，用于检测由生成模型产生的深伪内容，通过结合语音、面部动态和身份识别特征，有效提升了对细微伪造痕迹的检测能力。

Details

Motivation: 传统深伪检测方法难以应对由GAN、扩散模型等先进生成技术制造的高质量伪造内容，尤其是时间维度上的微小不一致容易被忽略。 Method: 提出Phoneme-Temporal and Identity-Dynamic Analysis (PIA)框架，融合音素序列、唇部几何数据和高级面部身份嵌入，进行多模态分析。 Result: 该方法在检测现代深伪视频方面优于传统方法，能够捕捉跨模态的细微不一致性，显著提升检测性能。 Conclusion: PIA通过整合语言、面部动态与身份线索，在面对先进生成模型时表现出更强的深伪检测能力，为多模态检测提供了有效方案。 Abstract: The rise of manipulated media has made deepfakes a particularly insidious threat, involving various generative manipulations such as lip-sync modifications, face-swaps, and avatar-driven facial synthesis. Conventional detection methods, which predominantly depend on manually designed phoneme-viseme alignment thresholds, fundamental frame-level consistency checks, or a unimodal detection strategy, inadequately identify modern-day deepfakes generated by advanced generative models such as GANs, diffusion models, and neural rendering techniques. These advanced techniques generate nearly perfect individual frames yet inadvertently create minor temporal discrepancies frequently overlooked by traditional detectors. We present a novel multimodal audio-visual framework, Phoneme-Temporal and Identity-Dynamic Analysis(PIA), incorporating language, dynamic face motion, and facial identification cues to address these limitations. We utilize phoneme sequences, lip geometry data, and advanced facial identity embeddings. This integrated method significantly improves the detection of subtle deepfake alterations by identifying inconsistencies across multiple complementary modalities. Code is available at https://github.com/skrantidatta/PIA

[138] Event Interval Modulation: A Novel Scheme for Event-based Optical Camera Communication

Miu Sumino,Mayu Ishii,Shun Kaizu,Daisuke Hisano,Yu Nakayama

Main category: cs.CV

TL;DR: 本文提出了一种专为基于事件的光通信（OCC）系统设计的新型调制方案——事件间隔调制（EIM），通过利用事件之间的时间间隔传输信息，显著提升了传输速率。实验结果在室内环境下实现了10米距离28 kbps和50米距离8.4 kbps的数据传输，创造了基于事件的OCC系统的新速率记录。

Details

Motivation: 传统基于帧式相机的OCC系统存在比特率低、处理负载高的问题；现有事件型OCC系统未充分挖掘事件传感器（EVS）的独特特性，缺乏针对性的调制方法。 Method: 提出事件间隔调制（EIM）方案，建立EIM理论模型，并对EVS参数进行优化以适配EIM需求，实验确定最大可用调制阶数并开展传输验证实验。 Result: 实现了10米距离28 kbps和50米距离8.4 kbps的稳定数据传输，显著高于现有事件型OCC系统的性能。 Conclusion: EIM方案能有效利用EVS的异步特性和高动态范围，显著提升事件型OCC系统的传输速率，为未来高速低延迟可见光通信提供了新方向。 Abstract: Optical camera communication (OCC) represents a promising visible light communication technology. Nonetheless, typical OCC systems utilizing frame-based cameras are encumbered by limitations, including low bit rate and high processing load. To address these issues, OCC system utilizing an event-based vision sensor (EVS) as receivers have been proposed. The EVS enables high-speed, low-latency, and robust communication due to its asynchronous operation and high dynamic range. In existing event-based OCC systems, conventional modulation schemes such as on-off keying (OOK) and pulse position modulation have been applied, however, to the best of our knowledge, no modulation method has been proposed that fully exploits the unique characteristics of the EVS. This paper proposes a novel modulation scheme, called the event interval modulation (EIM) scheme, specifically designed for event-based OCC. EIM enables improvement in transmission speed by modulating information using the intervals between events. This paper proposes a theoretical model of EIM and conducts a proof-of-concept experiment. First, the parameters of the EVS are tuned and customized to optimize the frequency response specifically for EIM. Then, the maximum modulation order usable in EIM is determined experimentally. We conduct transmission experiments based on the obtained parameters. Finally, we report successful transmission at 28 kbps over 10 meters and 8.4 kbps over 50 meters in an indoor environment. This sets a new benchmark for bit rate in event-based OCC systems.

[139] MACE: Mixture-of-Experts Accelerated Coordinate Encoding for Large-Scale Scene Localization and Rendering

Mingkai Liu,Dikai Fan,Haohua Que,Haojia Gao,Xiao Liu,Shuxue Peng,Meixia Lin,Shengyu Gu,Ruicong Ye,Wanli Qiu,Handong Yao,Ruopeng Zhang,Xianliang Huang

Main category: cs.CV

TL;DR: 提出了一种基于混合专家的加速坐标编码方法（MACE），用于大规模场景中的高效定位与高质量渲染，结合门控网络和无需辅助损失的负载均衡策略，在保持高精度的同时显著降低成本。

Details

Motivation: 现有场景坐标回归方法在扩展到大规模场景时受限于单个网络的容量，且计算成本高，难以实现高效定位与高质量渲染。 Method: 引入受MOE启发的混合专家结构，使用门控网络隐式分类并选择子网络，每次推理仅激活一个子网络，并提出无需辅助损失的负载均衡（ALF-LB）策略以提升定位精度。 Result: 在剑桥测试集上的实验表明，该方法仅需10分钟训练即可实现高质量渲染，同时显著降低计算成本并提高定位精度。 Conclusion: MACE为大规模场景下的高效定位与渲染提供了一个高效、精确且低成本的解决方案。 Abstract: Efficient localization and high-quality rendering in large-scale scenes remain a significant challenge due to the computational cost involved. While Scene Coordinate Regression (SCR) methods perform well in small-scale localization, they are limited by the capacity of a single network when extended to large-scale scenes. To address these challenges, we propose the Mixed Expert-based Accelerated Coordinate Encoding method (MACE), which enables efficient localization and high-quality rendering in large-scale scenes. Inspired by the remarkable capabilities of MOE in large model domains, we introduce a gating network to implicitly classify and select sub-networks, ensuring that only a single sub-network is activated during each inference. Furtheremore, we present Auxiliary-Loss-Free Load Balancing(ALF-LB) strategy to enhance the localization accuracy on large-scale scene. Our framework provides a significant reduction in costs while maintaining higher precision, offering an efficient solution for large-scale scene applications. Additional experiments on the Cambridge test set demonstrate that our method achieves high-quality rendering results with merely 10 minutes of training.

[140] Identity-Preserving Image-to-Video Generation via Reward-Guided Optimization

Liao Shen,Wentao Jiang,Yiran Zhu,Tiezheng Ge,Zhiguo Cao,Bo Zheng

Main category: cs.CV

TL;DR: 本文提出了一种基于强化学习的视频扩散框架IPRO，用于提升图像到视频生成中的人脸身份一致性，尤其在人脸占比较小且表情动作变化大的情况下表现优异。

Details

Motivation: 现有图像到视频生成模型在人物面部表情和动作变化较大时难以保持身份一致性，尤其当人脸在图像中占比较小时问题更为突出，而人类对身份变化非常敏感，因此亟需解决这一挑战。 Method: 提出Identity-Preserving Reward-guided Optimization（IPRO），利用面部身份评分器作为奖励信号，通过强化学习直接优化扩散模型；在采样链的最后几步反向传播奖励信号以增强梯度反馈，并设计新的面部评分机制，将真实视频中的多角度人脸特征作为特征池来提升泛化能力，同时引入KL散度正则化稳定训练过程。 Result: 在Wan 2.2 I2V模型和自研I2V模型上的大量实验表明，IPRO显著提升了生成视频的身份一致性，且无需修改模型结构或添加辅助模块，训练效率高、收敛快。 Conclusion: IPRO为图像到视频生成中的身份保持问题提供了一个有效、通用且高效的解决方案，具有较强的实用性和扩展性。 Abstract: Recent advances in image-to-video (I2V) generation have achieved remarkable progress in synthesizing high-quality, temporally coherent videos from static images. Among all the applications of I2V, human-centric video generation includes a large portion. However, existing I2V models encounter difficulties in maintaining identity consistency between the input human image and the generated video, especially when the person in the video exhibits significant expression changes and movements. This issue becomes critical when the human face occupies merely a small fraction of the image. Since humans are highly sensitive to identity variations, this poses a critical yet under-explored challenge in I2V generation. In this paper, we propose Identity-Preserving Reward-guided Optimization (IPRO), a novel video diffusion framework based on reinforcement learning to enhance identity preservation. Instead of introducing auxiliary modules or altering model architectures, our approach introduces a direct and effective tuning algorithm that optimizes diffusion models using a face identity scorer. To improve performance and accelerate convergence, our method backpropagates the reward signal through the last steps of the sampling chain, enabling richer gradient feedback. We also propose a novel facial scoring mechanism that treats faces in ground-truth videos as facial feature pools, providing multi-angle facial information to enhance generalization. A KL-divergence regularization is further incorporated to stabilize training and prevent overfitting to the reward signal. Extensive experiments on Wan 2.2 I2V model and our in-house I2V model demonstrate the effectiveness of our method. Our project and code are available at \href{https://ipro-alimama.github.io/}{https://ipro-alimama.github.io/}.

[141] Identity-GRPO: Optimizing Multi-Human Identity-preserving Video Generation via Reinforcement Learning

Xiangyu Meng,Zixian Zhang,Zhenghao Zhang,Junchao Liao,Long Qin,Weizhi Wang

Main category: cs.CV

TL;DR: 本文提出了Identity-GRPO，一种基于人类反馈的优化框架，用于提升多人体视频生成中的身份一致性。

Details

Motivation: 现有方法在处理动态交互中多个人物的身份保持方面存在困难，难以维持多个角色在整个视频中的一致性。 Method: 构建了一个大规模偏好数据集，并训练了一个视频奖励模型；采用专为多人体一致性设计的GRPO变体来优化VACE和Phantom等生成方法。 Result: 实验表明，与基线方法相比，Identity-GRPO在人类一致性指标上最高提升了18.9%，并通过消融研究验证了标注质量和设计选择的影响。 Conclusion: Identity-GRPO有效提升了多人体视频生成中的身份一致性，为强化学习与个性化视频生成的结合提供了可行路径。 Abstract: While advanced methods like VACE and Phantom have advanced video generation for specific subjects in diverse scenarios, they struggle with multi-human identity preservation in dynamic interactions, where consistent identities across multiple characters are critical. To address this, we propose Identity-GRPO, a human feedback-driven optimization pipeline for refining multi-human identity-preserving video generation. First, we construct a video reward model trained on a large-scale preference dataset containing human-annotated and synthetic distortion data, with pairwise annotations focused on maintaining human consistency throughout the video. We then employ a GRPO variant tailored for multi-human consistency, which greatly enhances both VACE and Phantom. Through extensive ablation studies, we evaluate the impact of annotation quality and design choices on policy optimization. Experiments show that Identity-GRPO achieves up to 18.9% improvement in human consistency metrics over baseline methods, offering actionable insights for aligning reinforcement learning with personalized video generation.

[142] MatchAttention: Matching the Relative Positions for High-Resolution Cross-View Matching

Tingman Yan,Tao Liu,Xilian Yang,Qunfei Zhao,Zeyang Xia

Main category: cs.CV

TL;DR: 本文提出了一种名为MatchAttention的注意力机制，通过动态匹配相对位置实现高效、高分辨率的跨视角匹配。结合BilinearSoftmax和层次化解码器MatchDecoder，在多个基准上实现了最先进的性能，同时具备低计算复杂度和实时处理能力。

Details

Motivation: 现有的交叉注意力机制在处理高分辨率图像时面临二次复杂度和缺乏显式匹配约束的问题，导致跨视角匹配困难。 Method: 提出MatchAttention机制，利用BilinearSoftmax实现连续可微的滑窗注意力采样，并通过残差连接在特征通道中迭代更新相对位置。设计了以MatchAttention为核心的MatchDecoder，并引入门控交叉注意力和一致性约束损失来应对遮挡问题。 Result: 在Middlebury基准上平均误差排名第一，KITTI分辨率推理仅需29ms；MatchStereo-T可在3GB GPU内存下0.1秒内处理4K图像，并在KITTI、ETH3D和Spring等数据集上达到SOTA性能。 Conclusion: 该方法实现了高精度、低复杂度和实时性的高分辨率跨视角匹配，为实际应用提供了可行方案。 Abstract: Cross-view matching is fundamentally achieved through cross-attention mechanisms. However, matching of high-resolution images remains challenging due to the quadratic complexity and lack of explicit matching constraints in the existing cross-attention. This paper proposes an attention mechanism, MatchAttention, that dynamically matches relative positions. The relative position determines the attention sampling center of the key-value pairs given a query. Continuous and differentiable sliding-window attention sampling is achieved by the proposed BilinearSoftmax. The relative positions are iteratively updated through residual connections across layers by embedding them into the feature channels. Since the relative position is exactly the learning target for cross-view matching, an efficient hierarchical cross-view decoder, MatchDecoder, is designed with MatchAttention as its core component. To handle cross-view occlusions, gated cross-MatchAttention and a consistency-constrained loss are proposed. These two components collectively mitigate the impact of occlusions in both forward and backward passes, allowing the model to focus more on learning matching relationships. When applied to stereo matching, MatchStereo-B ranked 1st in average error on the public Middlebury benchmark and requires only 29ms for KITTI-resolution inference. MatchStereo-T can process 4K UHD images in 0.1 seconds using only 3GB of GPU memory. The proposed models also achieve state-of-the-art performance on KITTI 2012, KITTI 2015, ETH3D, and Spring flow datasets. The combination of high accuracy and low computational complexity makes real-time, high-resolution, and high-accuracy cross-view matching possible. Code is available at https://github.com/TingmanYan/MatchAttention.

[143] Experimental Demonstration of Event-based Optical Camera Communication in Long-Range Outdoor Environment

Miu Sumino,Mayu Ishii,Shun Kaizu,Daisuke Hisano,Yu Nakayama

Main category: cs.CV

TL;DR: 提出了一种基于事件相机的光学相机通信系统鲁棒解调方案，结合OOK与切换解调及数字锁相环，在户外实验中实现了200米60kbps和400米30kbps下BER<10^-3的性能突破。

Details

Motivation: 传统光学相机通信在户外高速移动场景下易受干扰，需要更鲁棒的解调方法以提升通信距离与速率下的误码率表现。 Method: 采用事件相机结合OOK调制，引入切换解调策略与数字相位锁定环（DPLL）实现时钟同步与信号稳定恢复。 Result: 在200米60kbps和400米30kbps的户外实测条件下，实现了误码率低于10^-3的稳定通信。 Conclusion: 所提方案显著提升了光学相机通信在长距离和较高数据速率下的可靠性，适用于复杂户外环境。 Abstract: We propose a robust demodulation scheme for optical camera communication systems using an event-based vision sensor, combining OOK with toggle demodulation and a digital phase-locked loop. This is the first report to achieve a $\mathrm{BER} < 10^{-3}$ at 200m-60kbps and 400m-30kbps in outdoor experiments.

[144] GauSSmart: Enhanced 3D Reconstruction through 2D Foundation Models and Geometric Filtering

Alexander Valverde,Brian Xu,Yuyin Zhou,Meng Xu,Hongyun Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为GauSSmart的混合方法，通过结合2D基础模型与3D高斯点阵重建，提升了稀疏区域的场景重建质量。

Details

Motivation: 高斯点阵在大规模数据集上表现良好，但在稀疏覆盖区域难以捕捉细节和保持真实感，主要受限于稀疏的3D训练数据。 Method: 提出GauSSmart，融合2D计算机视觉技术（如凸滤波和基于DINO等基础模型的语义特征监督），利用2D分割先验和高维特征嵌入来引导高斯点的稠密化与优化。 Result: 在三个数据集上验证了该方法的有效性，GauSSmart在大多数场景中优于现有的高斯点阵方法。 Conclusion: 混合2D-3D方法具有显著潜力，结合2D基础模型与3D重建流程可克服各自单独使用时的局限性。 Abstract: Scene reconstruction has emerged as a central challenge in computer vision, with approaches such as Neural Radiance Fields (NeRF) and Gaussian Splatting achieving remarkable progress. While Gaussian Splatting demonstrates strong performance on large-scale datasets, it often struggles to capture fine details or maintain realism in regions with sparse coverage, largely due to the inherent limitations of sparse 3D training data. In this work, we propose GauSSmart, a hybrid method that effectively bridges 2D foundational models and 3D Gaussian Splatting reconstruction. Our approach integrates established 2D computer vision techniques, including convex filtering and semantic feature supervision from foundational models such as DINO, to enhance Gaussian-based scene reconstruction. By leveraging 2D segmentation priors and high-dimensional feature embeddings, our method guides the densification and refinement of Gaussian splats, improving coverage in underrepresented areas and preserving intricate structural details. We validate our approach across three datasets, where GauSSmart consistently outperforms existing Gaussian Splatting in the majority of evaluated scenes. Our results demonstrate the significant potential of hybrid 2D-3D approaches, highlighting how the thoughtful combination of 2D foundational models with 3D reconstruction pipelines can overcome the limitations inherent in either approach alone.

[145] CLEAR: Causal Learning Framework For Robust Histopathology Tumor Detection Under Out-Of-Distribution Shifts

Kieu-Anh Truong Thi,Huy-Hieu Pham,Duc-Trong Le

Main category: cs.CV

TL;DR: 提出一种基于因果推断的框架，利用语义特征并减轻混杂因素影响，通过前门原则设计转换策略，在CAMELYON17和私有数据集上实现了最高7%的性能提升。

Details

Motivation: 解决组织病理学中由于采集过程或数据源差异导致的域偏移问题，现有方法多关注统计相关性而忽视因果关系。 Method: 基于因果推断框架，采用前门原则，设计显式引入中介变量和观察到的组织切片的转换策略，以缓解混杂因素的影响。 Result: 在CAMELYON17和私有组织病理学数据集上验证，跨未见域表现出一致的性能增益，最高提升7%，优于现有基线方法。 Conclusion: 因果推断可作为应对组织病理学图像分析中域偏移问题的有力工具。 Abstract: Domain shift in histopathology, often caused by differences in acquisition processes or data sources, poses a major challenge to the generalization ability of deep learning models. Existing methods primarily rely on modeling statistical correlations by aligning feature distributions or introducing statistical variation, yet they often overlook causal relationships. In this work, we propose a novel causal-inference-based framework that leverages semantic features while mitigating the impact of confounders. Our method implements the front-door principle by designing transformation strategies that explicitly incorporate mediators and observed tissue slides. We validate our method on the CAMELYON17 dataset and a private histopathology dataset, demonstrating consistent performance gains across unseen domains. As a result, our approach achieved up to a 7% improvement in both the CAMELYON17 dataset and the private histopathology dataset, outperforming existing baselines. These results highlight the potential of causal inference as a powerful tool for addressing domain shift in histopathology image analysis.

[146] Watermarking for Factuality: Guiding Vision-Language Models Toward Truth via Tri-layer Contrastive Decoding

Kyungryul Back,Seongbeom Park,Milim Kim,Mincheol Kwon,SangHyeok Lee,Hyunyoung Lee,Junhee Cho,Seunghyun Park,Jinkyu Kim

Main category: cs.CV

TL;DR: 提出一种无需训练的三层对比解码与水印方法，有效减少大视觉语言模型中的幻觉问题，提升视觉接地响应能力。

Details

Motivation: 大视觉语言模型（LVLMs）在多模态任务中表现良好，但容易产生幻觉，依赖单一模态或记忆训练数据，缺乏良好的视觉接地。 Method: 提出三步法：选择成熟层和新手层、通过水印相关问题识别关键层以评估视觉接地情况、应用三层对比解码生成最终输出。 Result: 在POPE、MME和AMBER等公共基准上的实验表明，该方法在减少LVLM幻觉方面达到最先进的性能。 Conclusion: 所提出的训练-free方法能有效缓解LVLM的幻觉问题，生成更可靠、视觉接地更强的响应。 Abstract: Large Vision-Language Models (LVLMs) have recently shown promising results on various multimodal tasks, even achieving human-comparable performance in certain cases. Nevertheless, LVLMs remain prone to hallucinations -- they often rely heavily on a single modality or memorize training data without properly grounding their outputs. To address this, we propose a training-free, tri-layer contrastive decoding with watermarking, which proceeds in three steps: (1) select a mature layer and an amateur layer among the decoding layers, (2) identify a pivot layer using a watermark-related question to assess whether the layer is visually well-grounded, and (3) apply tri-layer contrastive decoding to generate the final output. Experiments on public benchmarks such as POPE, MME and AMBER demonstrate that our method achieves state-of-the-art performance in reducing hallucinations in LVLMs and generates more visually grounded responses.

[147] A Multi-domain Image Translative Diffusion StyleGAN for Iris Presentation Attack Detection

Shivangi Yadav,Arun Ross

Main category: cs.CV

TL;DR: 提出MID-StyleGAN框架，结合扩散模型与GAN生成多域合成眼纹图像，有效缓解活体检测中数据稀缺问题，并显著提升攻击检测性能。

Details

Motivation: 由于构建和成像呈现攻击（PA）样本困难，现有虹膜活体检测缺乏足够的训练和评估数据集。 Method: 提出MID-StyleGAN，融合扩散模型与生成对抗网络，采用多域架构实现真实眼纹图像与多种攻击域（如打印眼睛、美瞳）之间的图像转换，并设计自适应损失函数保持域一致性。 Result: 生成的图像质量优于现有方法，并显著提升活体检测系统性能：在LivDet2020数据集上，1%误报率下的真检率从93.41%提升至98.72%。 Conclusion: MID-StyleGAN能有效生成多样且逼真的多域眼纹图像，为虹膜活体检测提供了可扩展的数据增强解决方案。 Abstract: An iris biometric system can be compromised by presentation attacks (PAs) where artifacts such as artificial eyes, printed eye images, or cosmetic contact lenses are presented to the system. To counteract this, several presentation attack detection (PAD) methods have been developed. However, there is a scarcity of datasets for training and evaluating iris PAD techniques due to the implicit difficulties in constructing and imaging PAs. To address this, we introduce the Multi-domain Image Translative Diffusion StyleGAN (MID-StyleGAN), a new framework for generating synthetic ocular images that captures the PA and bonafide characteristics in multiple domains such as bonafide, printed eyes and cosmetic contact lens. MID-StyleGAN combines the strengths of diffusion models and generative adversarial networks (GANs) to produce realistic and diverse synthetic data. Our approach utilizes a multi-domain architecture that enables the translation between bonafide ocular images and different PA domains. The model employs an adaptive loss function tailored for ocular data to maintain domain consistency. Extensive experiments demonstrate that MID-StyleGAN outperforms existing methods in generating high-quality synthetic ocular images. The generated data was used to significantly enhance the performance of PAD systems, providing a scalable solution to the data scarcity problem in iris and ocular biometrics. For example, on the LivDet2020 dataset, the true detect rate at 1% false detect rate improved from 93.41% to 98.72%, showcasing the impact of the proposed method.

[148] Vision-Centric Activation and Coordination for Multimodal Large Language Models

Yunnan Wang,Fan Lu,Kecheng Zheng,Ziyuan Huang,Ziqiang Li,Wenjun Zeng,Xin Jin

Main category: cs.CV

TL;DR: 本文提出了一种名为VaCo的新方法，通过多视觉基础模型的视觉中心激活与协调来优化多模态大语言模型（MLLM）的表示能力。

Details

Motivation: 主流MLLM仅依赖文本token的下一个预测进行监督，忽视了对分析能力至关重要的视觉中心信息。 Method: 引入视觉判别对齐机制，结合可学习的模块化任务查询（MTQs）和视觉对齐层（VALs），并在多个视觉基础模型监督下激活特定视觉信号；使用令牌网关掩码（TGM）协调不同VFMs间的表征冲突。 Result: 大量实验表明，VaCo在多种基准测试中显著提升了不同MLLM的性能，展现出卓越的视觉理解能力。 Conclusion: VaCo有效增强了MLLM的视觉理解能力，通过整合多视觉基础模型的信息实现了文本与视觉输出的统一优化。 Abstract: Multimodal large language models (MLLMs) integrate image features from visual encoders with LLMs, demonstrating advanced comprehension capabilities. However, mainstream MLLMs are solely supervised by the next-token prediction of textual tokens, neglecting critical vision-centric information essential for analytical abilities. To track this dilemma, we introduce VaCo, which optimizes MLLM representations through Vision-Centric activation and Coordination from multiple vision foundation models (VFMs). VaCo introduces visual discriminative alignment to integrate task-aware perceptual features extracted from VFMs, thereby unifying the optimization of both textual and visual outputs in MLLMs. Specifically, we incorporate the learnable Modular Task Queries (MTQs) and Visual Alignment Layers (VALs) into MLLMs, activating specific visual signals under the supervision of diverse VFMs. To coordinate representation conflicts across VFMs, the crafted Token Gateway Mask (TGM) restricts the information flow among multiple groups of MTQs. Extensive experiments demonstrate that VaCo significantly improves the performance of different MLLMs on various benchmarks, showcasing its superior capabilities in visual comprehension.

[149] Leveraging Cycle-Consistent Anchor Points for Self-Supervised RGB-D Registration

Siddharth Tourani,Jayaram Reddy,Sarvesh Thakur,K Madhava Krishna,Muhammad Haris Khan,N Dinesh Reddy

Main category: cs.CV

TL;DR: 提出一种基于循环一致关键点和新型姿态块的自监督RGB-D点云配准方法，在ScanNet和3DMatch上优于先前自监督方法，甚至超过部分有监督方法。

Details

Motivation: 如何有效利用大量无标签的RGB-D数据进行场景几何理解，提升自监督点云配准性能。 Method: 引入循环一致关键点作为显著性约束以增强匹配的空间一致性，并设计结合GRU和变换同步的新型姿态块，融合历史与多视角信息。 Result: 在ScanNet和3DMatch数据集上超越了之前的自监督配准方法，部分指标优于传统有监督方法，且模块可集成到现有方法中提升性能。 Conclusion: 所提方法能有效利用无标签RGB-D数据，通过关键点一致性和时序-多视图融合显著提升自监督点云配准精度。 Abstract: With the rise in consumer depth cameras, a wealth of unlabeled RGB-D data has become available. This prompts the question of how to utilize this data for geometric reasoning of scenes. While many RGB-D registration meth- ods rely on geometric and feature-based similarity, we take a different approach. We use cycle-consistent keypoints as salient points to enforce spatial coherence constraints during matching, improving correspondence accuracy. Additionally, we introduce a novel pose block that combines a GRU recurrent unit with transformation synchronization, blending historical and multi-view data. Our approach surpasses previous self- supervised registration methods on ScanNet and 3DMatch, even outperforming some older supervised methods. We also integrate our components into existing methods, showing their effectiveness.

[150] Spatial Preference Rewarding for MLLMs Spatial Understanding

Han Qiu,Peng Gao,Lewei Lu,Xiaoqin Zhang,Ling Shao,Shijian Lu

Main category: cs.CV

TL;DR: 本文提出了一种名为SPR（Spatial Preference Rewarding）的方法，通过奖励多模态大语言模型（MLLMs）中具有精确物体定位的详细响应，来增强其细粒度的空间理解能力。

Details

Motivation: 现有的MLLM在细粒度空间感知（如区域描述生成和物体精确定位）方面表现不足，且缺乏对模型实际输出的直接监督，导致难以满足用户对精细空间理解的需求。 Method: SPR方法利用随机选取的图像区域和MLLM生成的描述，引入语义得分和定位得分来评估文本和定位质量，并通过选择最佳 refined 描述与最差初始描述进行直接偏好优化，以提升模型对视觉输入的细粒度对齐。 Result: 在标准的指代表达和定位基准上的大量实验表明，SPR能有效提升MLLM的空间理解能力，且训练开销极小。 Conclusion: SPR通过引入细粒度的偏好奖励机制，在无需大量标注数据的情况下显著增强了MLLM的细粒度空间感知能力，为未来提升多模态模型的空间推理提供了新思路。 Abstract: Multimodal large language models~(MLLMs) have demonstrated promising spatial understanding capabilities, such as referencing and grounding object descriptions. Despite their successes, MLLMs still fall short in fine-grained spatial perception abilities, such as generating detailed region descriptions or accurately localizing objects. Additionally, they often fail to respond to the user's requirements for desired fine-grained spatial understanding. This issue might arise because existing approaches primarily focus on tuning MLLMs to model pre-annotated instruction data to inject spatial knowledge, without direct supervision of MLLMs' actual responses. We address this issue by SPR, a Spatial Preference Rewarding~(SPR) approach that enhances MLLMs' spatial capabilities by rewarding MLLMs' detailed responses with precise object localization over vague or inaccurate responses. With randomly selected image regions and region descriptions from MLLMs, SPR introduces semantic and localization scores to comprehensively evaluate the text quality and localization quality in MLLM-generated descriptions. We also refine the MLLM descriptions with better localization accuracy and pair the best-scored refinement with the initial descriptions of the lowest score for direct preference optimization, thereby enhancing fine-grained alignment with visual input. Extensive experiments over standard referring and grounding benchmarks show that SPR improves MLLM spatial understanding capabilities effectively with minimal overhead in training. Data and code will be released at https://github.com/hanqiu-hq/SPR

[151] DOS: Directional Object Separation in Text Embeddings for Multi-Object Image Generation

Dongnam Byun,Jungwon Park,Jumgmin Ko,Changin Choi,Wonjong Rhee

Main category: cs.CV

TL;DR: 提出DOS方法，通过修改CLIP文本嵌入来改善多对象图像生成中的对象忽略和混合问题。

Details

Motivation: 现有文本到图像模型在处理包含多个对象的提示时容易出现对象忽略或混合问题，尤其是在对象形状、纹理相似或背景偏差明显等情况下。 Method: 基于对CLIP嵌入的两个关键观察，提出DOS（Directional Object Separation）方法，修改三种类型的CLIP文本嵌入以增强对象区分性。 Result: 实验表明，DOS在多个基准上 consistently 提高了多对象图像生成的成功率，减少了对象混合，在人类评估中比四种竞争方法获得更多投票（26.24%-43.04%）。 Conclusion: DOS是一种实用且有效的改进多对象图像生成的方法。 Abstract: Recent progress in text-to-image (T2I) generative models has led to significant improvements in generating high-quality images aligned with text prompts. However, these models still struggle with prompts involving multiple objects, often resulting in object neglect or object mixing. Through extensive studies, we identify four problematic scenarios, Similar Shapes, Similar Textures, Dissimilar Background Biases, and Many Objects, where inter-object relationships frequently lead to such failures. Motivated by two key observations about CLIP embeddings, we propose DOS (Directional Object Separation), a method that modifies three types of CLIP text embeddings before passing them into text-to-image models. Experimental results show that DOS consistently improves the success rate of multi-object image generation and reduces object mixing. In human evaluations, DOS significantly outperforms four competing methods, receiving 26.24%-43.04% more votes across four benchmarks. These results highlight DOS as a practical and effective solution for improving multi-object image generation.

[152] DRBD-Mamba for Robust and Efficient Brain Tumor Segmentation with Analytical Insights

Danish Ali,Ajmal Mian,Naveed Akhtar,Ghulam Mubashar Hassan

Main category: cs.CV

TL;DR: 提出了一种高效的双分辨率双向Mamba模型（DRBD-Mamba），用于脑肿瘤分割，在保持高精度的同时显著提升了计算效率，并在多条件下验证了其鲁棒性。

Details

Motivation: 现有Mamba模型在脑肿瘤分割中存在计算开销大、跨不同数据划分的鲁棒性未充分探索的问题，缺乏可靠评估。 Method: 提出DRBD-Mamba模型，采用空间填充曲线减少多轴扫描开销，设计门控融合模块整合前向与反向上下文，并引入量化块增强鲁棒性；构建五个系统性BraTS2023数据划分用于全面评估。 Result: 在20%测试集上，全肿瘤Dice提升0.10%，肿瘤核心提升1.75%，增强肿瘤提升0.93%；在新提出的五折交叉验证中，平均Dice在肿瘤核心和增强肿瘤分别提升0.86%和1.45%，计算效率提高15倍。 Conclusion: DRBD-Mamba在脑肿瘤分割中实现了高效、准确且鲁棒的性能，优于现有方法，具备临床应用潜力。 Abstract: Accurate brain tumor segmentation is significant for clinical diagnosis and treatment. It is challenging due to the heterogeneity of tumor subregions. Mamba-based State Space Models have demonstrated promising performance. However, they incur significant computational overhead due to sequential feature computation across multiple spatial axes. Moreover, their robustness across diverse BraTS data partitions remains largely unexplored, leaving a critical gap in reliable evaluation. To address these limitations, we propose dual-resolution bi-directional Mamba (DRBD-Mamba), an efficient 3D segmentation model that captures multi-scale long-range dependencies with minimal computational overhead. We leverage a space-filling curve to preserve spatial locality during 3D-to-1D feature mapping, thereby reducing reliance on computationally expensive multi-axial feature scans. To enrich feature representation, we propose a gated fusion module that adaptively integrates forward and reverse contexts, along with a quantization block that discretizes features to improve robustness. In addition, we propose five systematic folds on BraTS2023 for rigorous evaluation of segmentation techniques under diverse conditions and present detailed analysis of common failure scenarios. On the 20\% test set used by recent methods, our model achieves Dice improvements of 0.10\% for whole tumor, 1.75\% for tumor core, and 0.93\% for enhancing tumor. Evaluations on the proposed systematic five folds demonstrate that our model maintains competitive whole tumor accuracy while achieving clear average Dice gains of 0.86\% for tumor core and 1.45\% for enhancing tumor over existing state-of-the-art. Furthermore, our model attains 15 times improvement in efficiency while maintaining high segmentation accuracy, highlighting its robustness and computational advantage over existing approaches.

[153] BoardVision: Deployment-ready and Robust Motherboard Defect Detection with YOLO+Faster-RCNN Ensemble

Brandon Hill,Kma Solaiman

Main category: cs.CV

TL;DR: 本文提出了BoardVision框架，用于检测主板组装层面的缺陷，并通过YOLOv7和Faster R-CNN的对比实验提出了一种轻量级集成方法CTV Voter以平衡精确率与召回率，同时评估了模型在真实扰动下的鲁棒性，并发布了可部署的GUI检测工具。

Details

Motivation: 主板组装层面的缺陷检测在高产量电子制造中至关重要，但现有研究多集中于裸板或线路级缺陷，缺乏对整板组装缺陷的系统研究。 Method: 提出BoardVision框架，使用YOLOv7和Faster R-CNN进行基准测试，并设计基于置信度和时序投票的轻量级集成方法CTV Voter来提升检测性能。 Result: 实现了在精确率和召回率之间的更好平衡，验证了模型在亮度、锐度和旋转变化等扰动下的鲁棒性，并开发了可部署的GUI检测工具。 Conclusion: 该工作推动了计算机视觉技术从基准测试向实际主板装配质量保证应用的转化。 Abstract: Motherboard defect detection is critical for ensuring reliability in high-volume electronics manufacturing. While prior research in PCB inspection has largely targeted bare-board or trace-level defects, assembly-level inspection of full motherboards inspection remains underexplored. In this work, we present BoardVision, a reproducible framework for detecting assembly-level defects such as missing screws, loose fan wiring, and surface scratches. We benchmark two representative detectors - YOLOv7 and Faster R-CNN, under controlled conditions on the MiracleFactory motherboard dataset, providing the first systematic comparison in this domain. To mitigate the limitations of single models, where YOLO excels in precision but underperforms in recall and Faster R-CNN shows the reverse, we propose a lightweight ensemble, Confidence-Temporal Voting (CTV Voter), that balances precision and recall through interpretable rules. We further evaluate robustness under realistic perturbations including sharpness, brightness, and orientation changes, highlighting stability challenges often overlooked in motherboard defect detection. Finally, we release a deployable GUI-driven inspection tool that bridges research evaluation with operator usability. Together, these contributions demonstrate how computer vision techniques can transition from benchmark results to practical quality assurance for assembly-level motherboard manufacturing.

[154] DCMIL: A Progressive Representation Learning Model of Whole Slide Images for Cancer Prognosis Analysis

Chao Tu,Kun Huang,Jie Zhang,Qianjin Feng,Yu Zhang,Zhenyuan Ning

Main category: cs.CV

TL;DR: 提出了一种名为DCMIL的渐进式表示学习模型，用于高效处理全切片图像（WSI）以进行癌症预后预测，无需密集标注，并在多种癌症类型上表现优异。

Details

Motivation: 现有方法受限于千兆像素级输入的计算瓶颈和密集人工标注的稀缺，且常忽略多放大倍数WSI中的细粒度信息及肿瘤微环境差异。 Method: 提出双课程对比多实例学习（DCMIL）模型，采用从易到难的渐进学习策略，直接将千兆像素级WSI转化为预后预测，无需依赖密集标注。 Result: 在12种癌症类型（5,954名患者，12.54百万个图像块）上的实验表明，DCMIL优于标准WSI预后模型，能识别预后相关细粒度区域，提供鲁棒的实例不确定性估计，并捕捉正常与肿瘤组织间的形态学差异。 Conclusion: DCMIL是一种高效、无需密集标注的WSI分析框架，在癌症预后预测中表现出色，具有生成新生物学见解的潜力。 Abstract: The burgeoning discipline of computational pathology shows promise in harnessing whole slide images (WSIs) to quantify morphological heterogeneity and develop objective prognostic modes for human cancers. However, progress is impeded by the computational bottleneck of gigapixel-size inputs and the scarcity of dense manual annotations. Current methods often overlook fine-grained information across multi-magnification WSIs and variations in tumor microenvironments. Here, we propose an easy-to-hard progressive representation learning model, termed dual-curriculum contrastive multi-instance learning (DCMIL), to efficiently process WSIs for cancer prognosis. The model does not rely on dense annotations and enables the direct transformation of gigapixel-size WSIs into outcome predictions. Extensive experiments on twelve cancer types (5,954 patients, 12.54 million tiles) demonstrate that DCMIL outperforms standard WSI-based prognostic models. Additionally, DCMIL identifies fine-grained prognosis-salient regions, provides robust instance uncertainty estimation, and captures morphological differences between normal and tumor tissues, with the potential to generate new biological insights. All codes have been made publicly accessible at https://github.com/tuuuc/DCMIL.

[155] Real-Time Neural Video Compression with Unified Intra and Inter Coding

Hui Xiang,Yifan Bian,Li Li,Jingran Wu,Xianguo Zhang,Dong Liu

Main category: cs.CV

TL;DR: 提出了一种统一的神经视频压缩框架，结合帧内和帧间编码，有效处理遮挡、新内容及误差传播问题，显著优于DCVC-RT。

Details

Motivation: 现有神经视频压缩方案在处理遮挡、新内容和帧间误差累积方面存在不足，需借鉴传统编码思路改进。 Method: 引入帧内编码工具，设计统一的帧内/帧间自适应编码模型，并采用双帧同时压缩机制以双向利用帧间冗余。 Result: 相比DCVC-RT平均降低10.7% BD-rate，码率和质量更稳定，且保持实时编解码性能。 Conclusion: 所提框架有效解决了NVC的关键缺陷，在压缩效率和稳定性上均取得显著提升。 Abstract: Neural video compression (NVC) technologies have advanced rapidly in recent years, yielding state-of-the-art schemes such as DCVC-RT that offer superior compression efficiency to H.266/VVC and real-time encoding/decoding capabilities. Nonetheless, existing NVC schemes have several limitations, including inefficiency in dealing with disocclusion and new content, interframe error propagation and accumulation, among others. To eliminate these limitations, we borrow the idea from classic video coding schemes, which allow intra coding within inter-coded frames. With the intra coding tool enabled, disocclusion and new content are properly handled, and interframe error propagation is naturally intercepted without the need for manual refresh mechanisms. We present an NVC framework with unified intra and inter coding, where every frame is processed by a single model that is trained to perform intra/inter coding adaptively. Moreover, we propose a simultaneous two-frame compression design to exploit interframe redundancy not only forwardly but also backwardly. Experimental results show that our scheme outperforms DCVC-RT by an average of 10.7\% BD-rate reduction, delivers more stable bitrate and quality per frame, and retains real-time encoding/decoding performances. Code and models will be released.

[156] Structured Universal Adversarial Attacks on Object Detection for Video Sequences

Sven Jacob,Weijia Shao,Gjergji Kasneci

Main category: cs.CV

TL;DR: 提出一种基于核范数正则化的最小失真通用对抗攻击方法，专注于视频目标检测，通过自适应乐观指数梯度法实现高效优化，在保持高隐蔽性的同时提升攻击效果。

Details

Motivation: 深度学习模型在视频目标检测中易受通用对抗扰动攻击，现有方法存在扰动结构不理想、优化效率低等问题。 Method: 采用核范数正则化引导扰动集中在背景区域，结合自适应乐观指数梯度法进行高效优化，实现最小化失真的通用对抗攻击。 Result: 所提方法在攻击有效性上优于低秩投影梯度下降和Frank-Wolfe类攻击，同时保持较高的视觉隐蔽性。 Conclusion: 该方法有效提升了视频目标检测模型对抗攻击的性能与实用性，突出了结构化扰动设计和优化策略的重要性。 Abstract: Video-based object detection plays a vital role in safety-critical applications. While deep learning-based object detectors have achieved impressive performance, they remain vulnerable to adversarial attacks, particularly those involving universal perturbations. In this work, we propose a minimally distorted universal adversarial attack tailored for video object detection, which leverages nuclear norm regularization to promote structured perturbations concentrated in the background. To optimize this formulation efficiently, we employ an adaptive, optimistic exponentiated gradient method that enhances both scalability and convergence. Our results demonstrate that the proposed attack outperforms both low-rank projected gradient descent and Frank-Wolfe based attacks in effectiveness while maintaining high stealthiness. All code and data are publicly available at https://github.com/jsve96/AO-Exp-Attack.

[157] Unsupervised Deep Generative Models for Anomaly Detection in Neuroimaging: A Systematic Scoping Review

Youwan Mahé,Elise Bannier,Stéphanie Leplaideur,Elisa Fromont,Francesca Galassi

Main category: cs.CV

TL;DR: 该综述总结了2018-2025年间49项关于无监督深度生成模型在神经影像异常检测中的研究，涵盖自编码器、变分自编码器、生成对抗网络和去噪扩散模型，表明这些模型在脑MRI中对局灶性病变具有良好的检测性能，并能生成可解释的伪健康图像，未来需加强解剖感知建模和临床验证。

Details

Motivation: 由于全监督方法依赖大量体素级标注数据且局限于已知病理，而标注数据稀缺尤其是罕见病情况下难以满足需求，因此需要一种仅基于健康数据训练即可检测异常的新方法。 Method: 采用PRISMA指南指导的范围综述方法，系统梳理基于无监督深度生成模型（如自编码器、VAE、GAN和扩散模型）在脑部MRI和CT中进行异常检测与分割的研究，比较其架构设计与性能指标。 Result: 共纳入49项研究，生成模型在大范围局灶性病变检测中表现良好，并逐步提升对细微异常的识别能力；模型可生成可解释的伪健康重建图像，支持半监督学习和跨疾病偏差映射。 Conclusion: 无监督生成模型为神经影像异常检测提供了有前景的解决方案，尤其适用于标注数据稀缺场景，未来应发展解剖感知模型、基础模型、任务适配的评估指标及严格的临床验证以推动临床应用。 Abstract: Unsupervised deep generative models are emerging as a promising alternative to supervised methods for detecting and segmenting anomalies in brain imaging. Unlike fully supervised approaches, which require large voxel-level annotated datasets and are limited to well-characterised pathologies, these models can be trained exclusively on healthy data and identify anomalies as deviations from learned normative brain structures. This PRISMA-guided scoping review synthesises recent work on unsupervised deep generative models for anomaly detection in neuroimaging, including autoencoders, variational autoencoders, generative adversarial networks, and denoising diffusion models. A total of 49 studies published between 2018 - 2025 were identified, covering applications to brain MRI and, less frequently, CT across diverse pathologies such as tumours, stroke, multiple sclerosis, and small vessel disease. Reported performance metrics are compared alongside architectural design choices. Across the included studies, generative models achieved encouraging performance for large focal lesions and demonstrated progress in addressing more subtle abnormalities. A key strength of generative models is their ability to produce interpretable pseudo-healthy (also referred to as counterfactual) reconstructions, which is particularly valuable when annotated data are scarce, as in rare or heterogeneous diseases. Looking ahead, these models offer a compelling direction for anomaly detection, enabling semi-supervised learning, supporting the discovery of novel imaging biomarkers, and facilitating within- and cross-disease deviation mapping in unified end-to-end frameworks. To realise clinical impact, future work should prioritise anatomy-aware modelling, development of foundation models, task-appropriate evaluation metrics, and rigorous clinical validation.

[158] Pruning Overparameterized Multi-Task Networks for Degraded Web Image Restoration

Thomas Katraouras,Dimitrios Rafailidis

Main category: cs.CV

TL;DR: 本文提出了一种名为MIR-L的多任务图像恢复模型压缩方法，通过迭代剪枝策略在高稀疏度下保持甚至超越现有性能，仅保留10%参数即实现高效图像去雨、去雾和去噪。

Details

Motivation: 由于在线社交网络的有损操作，图像质量常被破坏，影响用户体验；现有的多任务图像恢复模型虽然功能强大，但参数量过大导致计算效率低下，亟需轻量化方法。 Method: 提出MIR-L模型，采用迭代剪枝策略：多轮移除低幅值权重，并将剩余权重重置为初始值，以发现过参数化模型中的高稀疏子网络（即“ winning tickets”），从而实现模型压缩。 Result: 在去雨、去雾和去噪任务上的实验表明，MIR-L仅保留10%可训练参数时仍能保持高性能，达到或超过现有最先进模型的表现。 Conclusion: MIR-L通过有效的迭代剪枝成功压缩了多任务图像恢复模型，在显著降低参数量的同时保持优异性能，为部署高效轻量级图像恢复模型提供了可行方案。 Abstract: Image quality is a critical factor in delivering visually appealing content on web platforms. However, images often suffer from degradation due to lossy operations applied by online social networks (OSNs), negatively affecting user experience. Image restoration is the process of recovering a clean high-quality image from a given degraded input. Recently, multi-task (all-in-one) image restoration models have gained significant attention, due to their ability to simultaneously handle different types of image degradations. However, these models often come with an excessively high number of trainable parameters, making them computationally inefficient. In this paper, we propose a strategy for compressing multi-task image restoration models. We aim to discover highly sparse subnetworks within overparameterized deep models that can match or even surpass the performance of their dense counterparts. The proposed model, namely MIR-L, utilizes an iterative pruning strategy that removes low-magnitude weights across multiple rounds, while resetting the remaining weights to their original initialization. This iterative process is important for the multi-task image restoration model's optimization, effectively uncovering "winning tickets" that maintain or exceed state-of-the-art performance at high sparsity levels. Experimental evaluation on benchmark datasets for the deraining, dehazing, and denoising tasks shows that MIR-L retains only 10% of the trainable parameters while maintaining high image restoration performance. Our code, datasets and pre-trained models are made publicly available at https://github.com/Thomkat/MIR-L.

[159] Grazing Detection using Deep Learning and Sentinel-2 Time Series Data

Aleksis Pirinen,Delia Fano Yela,Smita Chakraborty,Erik Källman

Main category: cs.CV

TL;DR: 该研究利用Sentinel-2卫星影像时间序列，结合CNN-LSTM模型，实现对农田季节性放牧的自动检测，具有高召回率，并能显著提升土地使用合规检查的效率。

Details

Motivation: 放牧行为影响农业生产和生物多样性，但目前缺乏可扩展的监测手段，亟需一种高效、可靠的方法来识别放牧区域。 Method: 基于Sentinel-2 L2A多时相反射率数据，针对多边形定义的田块边界，使用4月至10月影像训练CNN-LSTM集成模型，进行二分类（放牧/未放牧）预测。 Result: 模型在五个验证集上平均F1得分为77%，对放牧草地的召回率达到90%；在仅能检查4%场地的情况下，优先检查模型预测为未放牧的田块，可使确认的未放牧站点数量比随机检查提高17.2倍。 Conclusion: 利用免费、粗分辨率的卫星数据，结合深度学习模型，可有效指导保护导向的土地利用合规性检查，具备良好的业务应用前景。 Abstract: Grazing shapes both agricultural production and biodiversity, yet scalable monitoring of where grazing occurs remains limited. We study seasonal grazing detection from Sentinel-2 L2A time series: for each polygon-defined field boundary, April-October imagery is used for binary prediction (grazed / not grazed). We train an ensemble of CNN-LSTM models on multi-temporal reflectance features, and achieve an average F1 score of 77 percent across five validation splits, with 90 percent recall on grazed pastures. Operationally, if inspectors can visit at most 4 percent of sites annually, prioritising fields predicted by our model as non-grazed yields 17.2 times more confirmed non-grazing sites than random inspection. These results indicate that coarse-resolution, freely available satellite data can reliably steer inspection resources for conservation-aligned land-use compliance. Code and models have been made publicly available.

[160] Vision Mamba for Permeability Prediction of Porous Media

Ali Kashefi,Tapan Mukerji

Main category: cs.CV

TL;DR: 本文首次提出使用Vision Mamba作为骨干网络来预测三维多孔介质的渗透率，相较于ViT和CNN在计算效率、内存占用和参数量方面具有优势，并通过实验验证了其有效性。

Details

Motivation: 由于Vision Mamba在图像分类中相比ViT和CNN具有线性扩展性、更低的参数量和更高的效率，因此探索其在三维多孔介质渗透率预测中的应用潜力。 Method: 采用Vision Mamba作为主干网络构建模型，并与ViT和CNN在多个渗透率预测指标上进行比较，同时进行消融实验以评估各组件对性能的影响。 Result: 实验表明，Vision Mamba在保持高精度的同时，显著减少了计算资源和内存消耗，优于ViT和CNN模型。 Conclusion: Vision Mamba在三维多孔介质渗透率预测任务中表现出优越性能，具备成为大型视觉模型替代主干网络的潜力。 Abstract: Vision Mamba has recently received attention as an alternative to Vision Transformers (ViTs) for image classification. The network size of Vision Mamba scales linearly with input image resolution, whereas ViTs scale quadratically, a feature that improves computational and memory efficiency. Moreover, Vision Mamba requires a significantly smaller number of trainable parameters than traditional convolutional neural networks (CNNs), and thus, they can be more memory efficient. Because of these features, we introduce, for the first time, a neural network that uses Vision Mamba as its backbone for predicting the permeability of three-dimensional porous media. We compare the performance of Vision Mamba with ViT and CNN models across multiple aspects of permeability prediction and perform an ablation study to assess the effects of its components on accuracy. We demonstrate in practice the aforementioned advantages of Vision Mamba over ViTs and CNNs in the permeability prediction of three-dimensional porous media. We make the source code publicly available to facilitate reproducibility and to enable other researchers to build on and extend this work. We believe the proposed framework has the potential to be integrated into large vision models in which Vision Mamba is used instead of ViTs.

[161] Real-Time Surgical Instrument Defect Detection via Non-Destructive Testing

Qurrat Ul Ain,Atif Aftab Ahmed Jilani,Zunaira Shafqat,Nigar Azhar Butt

Main category: cs.CV

TL;DR: SurgScan是一个基于YOLOv8的AI系统，用于实时检测外科手术器械缺陷，在102,876张图像上训练，达到99.3%的准确率，推理速度为4.2-5.8毫秒，支持工业级部署。

Details

Motivation: 传统外科器械质检依赖人工检查，易出错且不一致，存在影响无菌性、机械完整性和患者安全的风险，因此需要自动化、高精度的缺陷检测方案。 Method: 提出SurgScan框架，采用YOLOv8模型，使用包含11类器械和5类主要缺陷的高分辨率图像数据集（共102,876张）进行训练，并引入对比度增强预处理以提升检测性能。 Result: SurgScan在准确率上达到99.3%，推理速度为4.2-5.8毫秒/图像，优于现有CNN模型；统计分析表明对比度增强显著提升检测效果。 Conclusion: SurgScan提供了一种可扩展、成本效益高的AI解决方案，可用于医疗器械制造中的自动化质量控制，减少人工依赖，符合ISO 13485和FDA标准，有助于提升医疗制造质量。 Abstract: Defective surgical instruments pose serious risks to sterility, mechanical integrity, and patient safety, increasing the likelihood of surgical complications. However, quality control in surgical instrument manufacturing often relies on manual inspection, which is prone to human error and inconsistency. This study introduces SurgScan, an AI-powered defect detection framework for surgical instruments. Using YOLOv8, SurgScan classifies defects in real-time, ensuring high accuracy and industrial scalability. The model is trained on a high-resolution dataset of 102,876 images, covering 11 instrument types and five major defect categories. Extensive evaluation against state-of-the-art CNN architectures confirms that SurgScan achieves the highest accuracy (99.3%) with real-time inference speeds of 4.2-5.8 ms per image, making it suitable for industrial deployment. Statistical analysis demonstrates that contrast-enhanced preprocessing significantly improves defect detection, addressing key limitations in visual inspection. SurgScan provides a scalable, cost-effective AI solution for automated quality control, reducing reliance on manual inspection while ensuring compliance with ISO 13485 and FDA standards, paving the way for enhanced defect detection in medical manufacturing.

[162] Noise Projection: Closing the Prompt-Agnostic Gap Behind Text-to-Image Misalignment in Diffusion Models

Yunze Tong,Didi Zhu,Zijing Hu,Jinluan Yang,Ziyu Zhao

Main category: cs.CV

TL;DR: 本文提出了一种提示感知的噪声投影方法，通过在去噪前对初始噪声进行文本条件化优化，提升文本到图像生成中的对齐性，无需修改预训练模型且推理成本低。

Details

Motivation: 由于训练与推理阶段的噪声分布不一致（训练时噪声依赖于提示，推理时使用无提示的高斯噪声），导致生成图像与提示对齐不佳。 Method: 提出噪声投影器，在推理时根据提示嵌入将初始噪声映射为提示感知的噪声；利用视觉-语言模型提供图像反馈，蒸馏成奖励模型，并通过拟直接偏好优化训练投影器。 Result: 该方法在多种提示下显著提升了文本-图像对齐度，且无需参考图像或手工先验，推理仅需单次前向传播。 Conclusion: 提示感知的噪声投影有效缓解了训练-推理不匹配问题，在保持低推理成本的同时提升了生成质量与一致性。 Abstract: In text-to-image generation, different initial noises induce distinct denoising paths with a pretrained Stable Diffusion (SD) model. While this pattern could output diverse images, some of them may fail to align well with the prompt. Existing methods alleviate this issue either by altering the denoising dynamics or by drawing multiple noises and conducting post-selection. In this paper, we attribute the misalignment to a training-inference mismatch: during training, prompt-conditioned noises lie in a prompt-specific subset of the latent space, whereas at inference the noise is drawn from a prompt-agnostic Gaussian prior. To close this gap, we propose a noise projector that applies text-conditioned refinement to the initial noise before denoising. Conditioned on the prompt embedding, it maps the noise to a prompt-aware counterpart that better matches the distribution observed during SD training, without modifying the SD model. Our framework consists of these steps: we first sample some noises and obtain token-level feedback for their corresponding images from a vision-language model (VLM), then distill these signals into a reward model, and finally optimize the noise projector via a quasi-direct preference optimization. Our design has two benefits: (i) it requires no reference images or handcrafted priors, and (ii) it incurs small inference cost, replacing multi-sample selection with a single forward pass. Extensive experiments further show that our prompt-aware noise projection improves text-image alignment across diverse prompts.

[163] PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model

Cheng Cui,Ting Sun,Suyin Liang,Tingquan Gao,Zelun Zhang,Jiaxuan Liu,Xueqing Wang,Changda Zhou,Hongen Liu,Manhui Lin,Yue Zhang,Yubo Zhang,Handong Zheng,Jing Zhang,Jun Zhang,Yi Liu,Dianhai Yu,Yanjun Ma

Main category: cs.CV

TL;DR: PaddleOCR-VL是一种先进的、资源高效的文档解析模型，核心为PaddleOCR-VL-0.9B，结合动态分辨率视觉编码器与ERNIE-4.5-0.3B语言模型，支持109种语言，擅长识别文本、表格、公式和图表等复杂元素，在多项基准测试中达到SOTA性能，且推理速度快，适合实际部署。

Details

Motivation: 为了实现高效、准确且多语言支持的文档解析，尤其是在资源消耗最小化的前提下处理复杂文档元素（如表格、公式、图表等），需要一种兼具高性能和轻量级特性的视觉语言模型。 Method: 提出PaddleOCR-VL，采用NaViT风格的动态分辨率视觉编码器与ERNIE-4.5-0.3B语言模型融合的架构，构建紧凑型视觉语言模型PaddleOCR-VL-0.9B，支持多语言输入，并在页面级和元素级任务上进行优化。 Result: 在公开和内部基准测试中，PaddleOCR-VL在页面级文档解析和元素级识别任务上均达到SOTA性能，显著优于现有方法，具备与顶级VLM竞争的能力，同时保持快速推理速度和低资源消耗。 Conclusion: PaddleOCR-VL是一种高效、强大且实用的文档解析解决方案，平衡了性能与资源消耗，适用于真实场景中的大规模部署。 Abstract: In this report, we propose PaddleOCR-VL, a SOTA and resource-efficient model tailored for document parsing. Its core component is PaddleOCR-VL-0.9B, a compact yet powerful vision-language model (VLM) that integrates a NaViT-style dynamic resolution visual encoder with the ERNIE-4.5-0.3B language model to enable accurate element recognition. This innovative model efficiently supports 109 languages and excels in recognizing complex elements (e.g., text, tables, formulas, and charts), while maintaining minimal resource consumption. Through comprehensive evaluations on widely used public benchmarks and in-house benchmarks, PaddleOCR-VL achieves SOTA performance in both page-level document parsing and element-level recognition. It significantly outperforms existing solutions, exhibits strong competitiveness against top-tier VLMs, and delivers fast inference speeds. These strengths make it highly suitable for practical deployment in real-world scenarios.

[164] Towards Generalist Intelligence in Dentistry: Vision Foundation Models for Oral and Maxillofacial Radiology

Xinrui Huang,Fan Xiao,Dongming He,Anqi Gao,Dandan Li,Xiaofan Zhang,Shaoting Zhang,Xudong Wang

Main category: cs.CV

TL;DR: 本文提出了DentVFM，首个用于牙科的视觉基础模型家族，基于大规模自监督学习，在多模态牙科影像上实现跨任务、跨模态的优异泛化性能，并推出新基准DentBench，推动牙科AI向高效、可扩展和低标签依赖的方向发展。

Details

Motivation: 现有牙科AI系统受限于单一模态、特定任务设计及对标注数据的高依赖，难以泛化到多样化临床场景；同时缺乏专业评估基准，限制了牙科智能的发展。 Method: 提出DentVFM，基于Vision Transformer架构的2D和3D视觉基础模型，采用自监督学习方法，在包含约160万张多中心、多模态牙科影像的大规模数据集DentVista上进行训练；同时构建综合评估基准DentBench，覆盖八个牙科亚专科、多种疾病与影像模态。 Result: DentVFM在疾病诊断、治疗分析、生物标志物识别、解剖标志检测与分割等多个任务中表现出强大的泛化能力，显著优于有监督、自监督和弱监督基线方法；具备跨模态诊断能力，在某些情况下诊断可靠性超过经验丰富的牙医。 Conclusion: DentVFM建立了牙科AI的新范式，具有良好的可扩展性、适应性和标签效率，有望提升智能化牙科医疗服务水平，弥补全球口腔医疗资源不足的问题。 Abstract: Oral and maxillofacial radiology plays a vital role in dental healthcare, but radiographic image interpretation is limited by a shortage of trained professionals. While AI approaches have shown promise, existing dental AI systems are restricted by their single-modality focus, task-specific design, and reliance on costly labeled data, hindering their generalization across diverse clinical scenarios. To address these challenges, we introduce DentVFM, the first family of vision foundation models (VFMs) designed for dentistry. DentVFM generates task-agnostic visual representations for a wide range of dental applications and uses self-supervised learning on DentVista, a large curated dental imaging dataset with approximately 1.6 million multi-modal radiographic images from various medical centers. DentVFM includes 2D and 3D variants based on the Vision Transformer (ViT) architecture. To address gaps in dental intelligence assessment and benchmarks, we introduce DentBench, a comprehensive benchmark covering eight dental subspecialties, more diseases, imaging modalities, and a wide geographical distribution. DentVFM shows impressive generalist intelligence, demonstrating robust generalization to diverse dental tasks, such as disease diagnosis, treatment analysis, biomarker identification, and anatomical landmark detection and segmentation. Experimental results indicate DentVFM significantly outperforms supervised, self-supervised, and weakly supervised baselines, offering superior generalization, label efficiency, and scalability. Additionally, DentVFM enables cross-modality diagnostics, providing more reliable results than experienced dentists in situations where conventional imaging is unavailable. DentVFM sets a new paradigm for dental AI, offering a scalable, adaptable, and label-efficient model to improve intelligent dental healthcare and address critical gaps in global oral healthcare.

[165] Acquisition of interpretable domain information during brain MR image harmonization for content-based image retrieval

Keima Abe,Hayato Muraki,Shuhei Tomoshige,Kenichi Oishi,Hitoshi Iyatomi

Main category: cs.CV

TL;DR: 提出了一种新的医学图像域适应框架PL-SE-ADA，用于脑部MR图像的域均衡和可解释表征学习，通过分离域不变和域特异性特征，在保持疾病相关信息的同时实现良好的重建、分类性能和高可解释性。

Details

Motivation: 医学图像在不同成像设备间存在域偏移，影响机器学习性能，且现有方法缺乏可解释性，限制了临床应用。 Method: 设计双编码器（f_E和f_SE）分别提取域不变（z_u）和域特异性（z_d）特征，结合图像重构损失和对抗训练，并通过相加方式重构图像，提升可解释性和信息保留。 Result: 在图像重建、疾病分类和域识别任务上达到或优于现有方法，同时支持对域无关特征和域特异成分的可视化分析。 Conclusion: PL-SE-ADA在保持疾病相关信息的同时有效实现了域均衡，并提供了全程可解释的表示学习框架，具有良好的应用潜力。 Abstract: Medical images like MR scans often show domain shifts across imaging sites due to scanner and protocol differences, which degrade machine learning performance in tasks such as disease classification. Domain harmonization is thus a critical research focus. Recent approaches encode brain images $\boldsymbol{x}$ into a low-dimensional latent space $\boldsymbol{z}$, then disentangle it into $\boldsymbol{z_u}$ (domain-invariant) and $\boldsymbol{z_d}$ (domain-specific), achieving strong results. However, these methods often lack interpretability$-$an essential requirement in medical applications$-$leaving practical issues unresolved. We propose Pseudo-Linear-Style Encoder Adversarial Domain Adaptation (PL-SE-ADA), a general framework for domain harmonization and interpretable representation learning that preserves disease-relevant information in brain MR images. PL-SE-ADA includes two encoders $f_E$ and $f_{SE}$ to extract $\boldsymbol{z_u}$ and $\boldsymbol{z_d}$, a decoder to reconstruct the image $f_D$, and a domain predictor $g_D$. Beyond adversarial training between the encoder and domain predictor, the model learns to reconstruct the input image $\boldsymbol{x}$ by summing reconstructions from $\boldsymbol{z_u}$ and $\boldsymbol{z_d}$, ensuring both harmonization and informativeness. Compared to prior methods, PL-SE-ADA achieves equal or better performance in image reconstruction, disease classification, and domain recognition. It also enables visualization of both domain-independent brain features and domain-specific components, offering high interpretability across the entire framework.

[166] Exploring Image Representation with Decoupled Classical Visual Descriptors

Chenyuan Qu,Hao Chen,Jianbo Jiao

Main category: cs.CV

TL;DR: 本文提出了VisualSplit框架，通过将图像分解为经典视觉描述符（如边缘、颜色和强度分布）来结合传统视觉特征与现代深度学习，实现了可解释的视觉表示学习，并在图像生成与编辑等任务中展现出有效性和可控性。

Details

Motivation: 尽管深度学习在图像理解方面取得了显著进展，但其内部表示通常缺乏可解释性。相比之下，经典视觉描述符具有良好的人类可理解性。本文旨在弥合这一差距，探索现代学习方法是否可以从这些经典线索中受益。 Method: 提出VisualSplit框架，显式地将图像分解为解耦的经典视觉描述符，并将其作为独立但互补的视觉知识组成部分。通过基于重构的预训练策略，模型能够捕捉每个描述符的本质同时保持其可解释性。 Result: VisualSplit在多种高级视觉任务（如图像生成和编辑）中展示了有效的属性控制能力，超越了传统的分类与分割任务，验证了该方法在视觉理解中的有效性。 Conclusion: 通过显式分解视觉属性，VisualSplit提供了一种兼具可解释性与高性能的新学习范式，证明了经典视觉线索与现代深度学习结合的巨大潜力。 Abstract: Exploring and understanding efficient image representations is a long-standing challenge in computer vision. While deep learning has achieved remarkable progress across image understanding tasks, its internal representations are often opaque, making it difficult to interpret how visual information is processed. In contrast, classical visual descriptors (e.g. edge, colour, and intensity distribution) have long been fundamental to image analysis and remain intuitively understandable to humans. Motivated by this gap, we ask a central question: Can modern learning benefit from these classical cues? In this paper, we answer it with VisualSplit, a framework that explicitly decomposes images into decoupled classical descriptors, treating each as an independent but complementary component of visual knowledge. Through a reconstruction-driven pre-training scheme, VisualSplit learns to capture the essence of each visual descriptor while preserving their interpretability. By explicitly decomposing visual attributes, our method inherently facilitates effective attribute control in various advanced visual tasks, including image generation and editing, extending beyond conventional classification and segmentation, suggesting the effectiveness of this new learning approach for visual understanding. Project page: https://chenyuanqu.com/VisualSplit/.

Ziqi Jiang,Yanghao Wang,Long Chen

Main category: cs.CV

TL;DR: 本文提出了一种模型无关的多步调整方法Flow Matching Alignment (FMA)，通过学习跨模态速度场来实现更精确和鲁棒的特征对齐，解决了现有参数高效微调方法只能进行单步调整的问题。

Details

Motivation: 现有的参数高效微调（PEFT）方法仅能进行单步调整，在复杂或困难数据集上难以有效解耦高度纠缠的多模态特征，因此需要一种更强大的对齐机制。 Method: 提出Flow Matching Alignment (FMA)，采用固定耦合策略保证类别对应关系，引入噪声增强策略缓解数据稀缺问题，并设计早停求解器以提升效率和准确性。 Result: FMA在多个基准和骨干网络上均取得显著性能提升，尤其在具有挑战性的数据集上表现突出。 Conclusion: FMA作为首个模型无关的多步调整方法，能够实现更精细、稳健的跨模态对齐，优于传统的单步PEFT方法。 Abstract: Aligning features from different modalities, is one of the most fundamental challenges for cross-modal tasks. Although pre-trained vision-language models can achieve a general alignment between image and text, they often require parameter-efficient fine-tuning (PEFT) for further adjustment. Today's PEFT methods (e.g., prompt tuning, LoRA-based, or adapter-based) always selectively fine-tune a subset of parameters, which can slightly adjust either visual or textual features, and avoid overfitting. In this paper, we are the first to highlight that all existing PEFT methods perform one-step adjustment. It is insufficient for complex (or difficult) datasets, where features of different modalities are highly entangled. To this end, we propose the first model-agnostic multi-step adjustment approach by learning a cross-modal velocity field: Flow Matching Alignment (FMA). Specifically, to ensure the correspondence between categories during training, we first utilize a fixed coupling strategy. Then, we propose a noise augmentation strategy to alleviate the data scarcity issue. Finally, we design an early-stopping solver, which terminates the transformation process earlier, improving both efficiency and accuracy. Compared with one-step PEFT methods, FMA has the multi-step rectification ability to achieve more precise and robust alignment. Extensive results have demonstrated that FMA can consistently yield significant performance gains across various benchmarks and backbones, particularly on challenging datasets.

[168] Consistent text-to-image generation via scene de-contextualization

Song Tang,Peihao Gong,Kunyu Li,Kai Guo,Boyu Wang,Mao Ye,Jianwei Zhang,Xiatian Zhu

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的提示嵌入编辑方法SDeC，用于解决文本到图像生成中的身份偏移问题，通过去情境化场景上下文来提升身份一致性。

Details

Motivation: 现有方法在处理文本到图像生成中的身份保持问题时，通常假设已知所有目标场景，这在实际应用中不现实。 Method: 提出Scene De-Contextualization (SDeC) 方法，通过量化SVD方向稳定性来自适应重加权特征值，抑制提示嵌入中的场景-身份相关性。 Result: 实验证明SDeC能显著提升身份保持能力，同时维持场景多样性，且无需预先知晓所有目标场景。 Conclusion: SDeC是一种高效、灵活且通用的训练-free方法，适用于真实场景下的文本到图像生成身份一致性优化。 Abstract: Consistent text-to-image (T2I) generation seeks to produce identity-preserving images of the same subject across diverse scenes, yet it often fails due to a phenomenon called identity (ID) shift. Previous methods have tackled this issue, but typically rely on the unrealistic assumption of knowing all target scenes in advance. This paper reveals that a key source of ID shift is the native correlation between subject and scene context, called scene contextualization, which arises naturally as T2I models fit the training distribution of vast natural images. We formally prove the near-universality of this scene-ID correlation and derive theoretical bounds on its strength. On this basis, we propose a novel, efficient, training-free prompt embedding editing approach, called Scene De-Contextualization (SDeC), that imposes an inversion process of T2I's built-in scene contextualization. Specifically, it identifies and suppresses the latent scene-ID correlation within the ID prompt's embedding by quantifying the SVD directional stability to adaptively re-weight the corresponding eigenvalues. Critically, SDeC allows for per-scene use (one scene per prompt) without requiring prior access to all target scenes. This makes it a highly flexible and general solution well-suited to real-world applications where such prior knowledge is often unavailable or varies over time. Experiments demonstrate that SDeC significantly enhances identity preservation while maintaining scene diversity.

[169] Eyes Wide Open: Ego Proactive Video-LLM for Streaming Video

Yulin Zhang,Cheng Shi,Yang Wang,Sibei Yang

Main category: cs.CV

TL;DR: 本文提出了一种面向第一人称视频流的主动问答任务，旨在使AI能够在人类生活场景中主动理解、预测并及时响应事件。为此，作者构建了ESTP-Bench评测基准和ESTP-F1指标，并提出了包含数据引擎、多阶段训练策略和动态压缩技术的完整技术框架。

Details

Motivation: 为了让AI在真实人类环境中实现主动感知与响应，超越被动观察，具备类人的情境理解与决策能力。 Method: 提出一个包含三部分的技术 pipeline：数据引擎用于生成训练数据，多阶段训练策略提升模型理解能力，动态压缩技术保障实时同步推理效率；同时构建ESTP-Bench基准和ESTP-F1评估指标。 Result: 所提模型在多个在线和离线基准上优于多种基线方法，有效满足Proactive Coherence、Just-in-Time Responsiveness和Synchronized Efficiency三大特性。 Conclusion: 该研究推动了AI在第一人称视角下的主动交互能力发展，为未来智能助手在动态环境中的实时理解与响应提供了可行框架。 Abstract: Envision an AI capable of functioning in human-like settings, moving beyond mere observation to actively understand, anticipate, and proactively respond to unfolding events. Towards this vision, we focus on the innovative task where, given ego-streaming video input, an assistant proactively answers diverse, evolving questions at the opportune moment, while maintaining synchronized perception and reasoning. This task embodies three key properties: (1) Proactive Coherence, (2) Just-in-Time Responsiveness, and (3) Synchronized Efficiency. To evaluate and address these properties, we first introduce ESTP-Bench (Ego Streaming Proactive Benchmark) alongside the ESTP-F1 metric-a novel framework designed for their rigorous assessment. Secondly, we propose a comprehensive technical pipeline to enable models to tackle this challenging task. This pipeline comprises: (1) a data engine, (2) a multi-stage training strategy, and (3) a proactive dynamic compression technique. Our proposed model effectively addresses these critical properties while outperforming multiple baselines across diverse online and offline benchmarks. Project Page:https://zhangyl4.github.io/publications/eyes-wide-open/

[170] BalanceGS: Algorithm-System Co-design for Efficient 3D Gaussian Splatting Training on GPU

Junyi Wu,Jiaming Xu,Jinhao Li,Yongkang Zhou,Jiayi Pan,Xingyang Li,Guohao Dai

Main category: cs.CV

TL;DR: 本文提出了BalanceGS，一种算法-系统协同设计的3D高斯点阵训练优化方法，通过密度控制、自适应采样与内存访问重排序，在保持重建质量的同时实现1.44倍训练加速。

Details

Motivation: 传统的3D高斯点阵（3DGS）训练流程存在高斯密度分配不均、计算负载不平衡和内存访问碎片化三大效率问题，限制了其训练效率。 Method: 1）算法层面：提出启发式工作负载敏感的高斯密度控制，自动平衡稠密与稀疏区域的点分布；2）系统层面：提出基于相似性的高斯采样与合并机制，实现动态自适应的工作负载分配；3）映射层面：设计基于重排序的内存访问策略，重构RGB存储以支持共享内存中的批量加载。 Result: 在NVIDIA A100 GPU上相比传统3DGS实现了1.44倍的训练速度提升，且重建质量损失可忽略不计。 Conclusion: BalanceGS通过算法与系统的协同优化，显著提升了3D高斯点阵的训练效率，为高效3D重建提供了可行方案。 Abstract: 3D Gaussian Splatting (3DGS) has emerged as a promising 3D reconstruction technique. The traditional 3DGS training pipeline follows three sequential steps: Gaussian densification, Gaussian projection, and color splatting. Despite its promising reconstruction quality, this conventional approach suffers from three critical inefficiencies: (1) Skewed density allocation during Gaussian densification, (2) Imbalanced computation workload during Gaussian projection and (3) Fragmented memory access during color splatting. To tackle the above challenges, we introduce BalanceGS, the algorithm-system co-design for efficient training in 3DGS. (1) At the algorithm level, we propose heuristic workload-sensitive Gaussian density control to automatically balance point distributions - removing 80% redundant Gaussians in dense regions while filling gaps in sparse areas. (2) At the system level, we propose Similarity-based Gaussian sampling and merging, which replaces the static one-to-one thread-pixel mapping with adaptive workload distribution - threads now dynamically process variable numbers of Gaussians based on local cluster density. (3) At the mapping level, we propose reordering-based memory access mapping strategy that restructures RGB storage and enables batch loading in shared memory. Extensive experiments demonstrate that compared with 3DGS, our approach achieves a 1.44$\times$ training speedup on a NVIDIA A100 GPU with negligible quality degradation.

[171] CALM-Net: Curvature-Aware LiDAR Point Cloud-based Multi-Branch Neural Network for Vehicle Re-Identification

Dongwook Lee,Sol Han,Jinwhan Kim

Main category: cs.CV

TL;DR: 提出CALM-Net，一种基于曲率感知的多分支神经网络，用于LiDAR点云车辆重识别，通过融合边缘卷积、点注意力和曲率嵌入提升性能。

Details

Motivation: 解决从三维点云中学习判别性和互补性特征以区分车辆的挑战。 Method: 采用多分支架构，结合边缘卷积、点注意力和曲率嵌入来捕捉局部表面变化和上下文信息。 Result: 在nuScenes数据集上，相比最强基线模型，平均重识别精度提高了约1.97个百分点。 Conclusion: 引入曲率信息和多分支特征学习能有效提升LiDAR点云车辆重识别性能。 Abstract: This paper presents CALM-Net, a curvature-aware LiDAR point cloud-based multi-branch neural network for vehicle re-identification. The proposed model addresses the challenge of learning discriminative and complementary features from three-dimensional point clouds to distinguish between vehicles. CALM-Net employs a multi-branch architecture that integrates edge convolution, point attention, and a curvature embedding that characterizes local surface variation in point clouds. By combining these mechanisms, the model learns richer geometric and contextual features that are well suited for the re-identification task. Experimental evaluation on the large-scale nuScenes dataset demonstrates that CALM-Net achieves a mean re-identification accuracy improvement of approximately 1.97\% points compared with the strongest baseline in our study. The results confirms the effectiveness of incorporating curvature information into deep learning architectures and highlight the benefit of multi-branch feature learning for LiDAR point cloud-based vehicle re-identification.

[172] Talking Points: Describing and Localizing Pixels

Matan Rusanovsky,Shimon Malnick,Shai Avidan

Main category: cs.CV

TL;DR: 本文提出了一种用于像素级关键点定位的新型视觉-语言框架，包含生成关键点半自由描述的Point Descriptor和回归精确坐标的Point Localizer，并构建了LlamaPointInPart数据集与新评估协议，实现了优于基线模型的性能。

Details

Motivation: 现有视觉-语言模型局限于对象或区域级别的跨模态理解，缺乏通过自然语言实现像素级关键点理解的能力，本文旨在填补这一空白。 Method: 提出双组件框架：Point Descriptor生成上下文丰富的自由形式关键点描述（从粗到细），Point Localizer将描述回归为精确像素坐标；利用多个视觉-语言模型合成20K+样本的数据集LlamaPointInPart；采用GRPO在AP-10K上优化Descriptor，冻结Localizer作为奖励模型以提升定位准确率；建立新评估协议，通过预测点与真实点的距离衡量性能。 Result: 在LlamaPointInPart数据集上，所提方法显著优于基线模型；框架支持双向应用，即可用于关键点引导的理解，也可用于语言引导的精确定位。 Conclusion: 该框架成功实现了自然语言驱动的像素级关键点定位，通过自由描述与精准回归的协同机制，推动了视觉-语言模型在细粒度跨模态理解上的发展。 Abstract: Vision-language models have achieved remarkable success in cross-modal understanding. Yet, these models remain limited to object-level or region-level grounding, lacking the capability for pixel-precise keypoint comprehension through natural language. We introduce a novel framework for pixel level grounding. The framework consists of two complementary components: a Point Descriptor that generates rich, contextual descriptions of individual keypoints, and a Point Localizer that regresses precise pixel coordinates from these descriptions. Unlike prior work that relies on templated prompts or keypoint names, our approach produces free-form, coarse-to-fine descriptions that situate keypoints within their visual context. Since there is no available dataset to train such a system, we introduce LlamaPointInPart, a carefully curated dataset of 20K+ image-keypoint-description triplets synthesized from multiple vision-language models, capturing multi-scale information from scene-level context to visual features around the keypoint. For cross-category generalization, we optimize the Point Descriptor on AP-10K via GRPO, using the frozen Point Localizer as a reward model to produce descriptions that maximize localization accuracy. To evaluate our results we establish a new evaluation protocol. Instead of comparing the text description produced by our method to the ground truth, we use the localizer to determine how close is the predicted point generated to the ground truth point. Experiments demonstrate superior performance compared to baseline models on LlamaPointInPart.The bidirectional nature of our framework should enable future applications in both keypoint-guided image understanding and language-guided precise localization. Our code and dataset are publicly available at https://github.com/matanr/Talking_Points.

[173] STANCE: Motion Coherent Video Generation Via Sparse-to-Dense Anchored Encoding

Zhifei Chen,Tianshuo Xu,Leyi Wu,Luozhou Wang,Dongyu Yan,Zihan You,Wenting Luo,Guo Zhang,Yingcong Chen

Main category: cs.CV

TL;DR: STANCE是一种新的图像到视频生成框架，通过引入实例线索和密集RoPE技术，解决了视频生成中物体运动连贯性和交互性不足的问题。

Details

Motivation: 现有视频生成方法在保持物体运动和交互的连贯性方面存在困难，主要由于人类提供的运动提示在编码后有效信息丢失，以及外观和运动优化在同一头部导致纹理优先于时间一致性。 Method: 提出STANCE框架，包含两个组件：1）实例线索，将稀疏的用户可编辑提示转换为密集的2.5D运动场；2）密集RoPE，使用空间可寻址的旋转嵌入标记少量运动令牌以保持提示显著性。同时结合RGB和辅助图（如分割或深度）预测，分离结构与外观优化。 Result: STANCE在无需逐帧轨迹脚本的情况下，提升了视频的时间连贯性与结构稳定性，优于基于2D箭头输入的方法，减少了深度歧义且易于使用。 Conclusion: STANCE通过简单的两个改进组件有效提升了视频生成中的运动连贯性和用户控制能力，为图像到视频生成提供了更稳定且实用的解决方案。 Abstract: Video generation has recently made striking visual progress, but maintaining coherent object motion and interactions remains difficult. We trace two practical bottlenecks: (i) human-provided motion hints (e.g., small 2D maps) often collapse to too few effective tokens after encoding, weakening guidance; and (ii) optimizing for appearance and motion in a single head can favor texture over temporal consistency. We present STANCE, an image-to-video framework that addresses both issues with two simple components. First, we introduce Instance Cues -- a pixel-aligned control signal that turns sparse, user-editable hints into a dense 2.5D (camera-relative) motion field by averaging per-instance flow and augmenting with monocular depth over the instance mask. This reduces depth ambiguity compared to 2D arrow inputs while remaining easy to use. Second, we preserve the salience of these cues in token space with Dense RoPE, which tags a small set of motion tokens (anchored on the first frame) with spatial-addressable rotary embeddings. Paired with joint RGB $+$ auxiliary-map prediction (segmentation or depth), our model anchors structure while RGB handles appearance, stabilizing optimization and improving temporal coherence without requiring per-frame trajectory scripts.

[174] Hierarchical Re-Classification: Combining Animal Classification Models with Vision Transformers

Hugo Markoff,Jevgenijs Galaktionovs

Main category: cs.CV

TL;DR: 提出了一种分层重分类系统，结合SpeciesNet与CLIP嵌入和度量学习，将粗粒度动物分类结果细化到物种级别，在LILA BC数据集上实现了96.5%的准确率，并将64.9%的检测提升至物种级识别。

Details

Motivation: 现有动物分类模型（如SpeciesNet）因保守的聚合策略，常将动物标注到较高分类层级而非具体物种，限制了生态监测等应用，因此需要更精细的物种级识别方法。 Method: 构建五阶段分层重分类流程：高置信度接受、鸟类覆盖、聚类中心构建、三元组损失度量学习和自适应余弦距离评分，融合SpeciesNet EfficientNetV2-M预测与CLIP嵌入进行细粒度分类。 Result: 在LILA BC沙漠狮保护数据集（4,018张图像，15,031个检测）上验证，从“空白”和“动物”标签中恢复761个鸟类检测，对456个原为animal、mammal或blank的检测实现96.5%的重分类准确率，其中64.9%达到物种级识别。 Conclusion: 该方法能有效提升现有分类模型的细粒度识别能力，显著提高物种级标注比例，适用于生物多样性监测等需要精确物种信息的应用场景。 Abstract: State-of-the-art animal classification models like SpeciesNet provide predictions across thousands of species but use conservative rollup strategies, resulting in many animals labeled at high taxonomic levels rather than species. We present a hierarchical re-classification system for the Animal Detect platform that combines SpeciesNet EfficientNetV2-M predictions with CLIP embeddings and metric learning to refine high-level taxonomic labels toward species-level identification. Our five-stage pipeline (high-confidence acceptance, bird override, centroid building, triplet-loss metric learning, and adaptive cosine-distance scoring) is evaluated on a segment of the LILA BC Desert Lion Conservation dataset (4,018 images, 15,031 detections). After recovering 761 bird detections from "blank" and "animal" labels, we re-classify 456 detections labeled animal, mammal, or blank with 96.5% accuracy, achieving species-level identification for 64.9 percent

[175] Zero-Shot Wildlife Sorting Using Vision Transformers: Evaluating Clustering and Continuous Similarity Ordering

Hugo Markoff,Jevgenijs Galaktionovs

Main category: cs.CV

TL;DR: 本研究评估了使用自监督视觉Transformer进行野生动物图像零样本分类的无监督方法，在Animal Detect平台中开发并测试，DINOv2结合UMAP和GMM在5类物种上达到88.6%准确率，1D排序在鱼类中达95.2%一致性，已部署于生产环境以加速标注。

Details

Motivation: 现有分类器无法覆盖相机陷阱数据集中大量未见物种，亟需无需标注即可组织海量野生动物图像的零样本方法。 Method: 比较CLIP、DINOv2和MegaDescriptor三种架构结合PCA、UMAP降维及DBSCAN、GMM聚类的性能，并探索t-SNE实现连续一维相似性排序。 Result: 在仅用于评估的5类物种测试集上，DINOv2+UMAP+GMM取得88.6%准确率（macro-F1=0.874），1D排序对哺乳动物和鸟类达88.2%一致性，对鱼类达95.2%。 Conclusion: DINOv2结合UMAP与GMM表现最优，且连续相似性排序已被部署至生产环境，显著提升生物多样性监测中的手动标注效率。 Abstract: Camera traps generate millions of wildlife images, yet many datasets contain species that are absent from existing classifiers. This work evaluates zero-shot approaches for organizing unlabeled wildlife imagery using self-supervised vision transformers, developed and tested within the Animal Detect platform for camera trap analysis. We compare unsupervised clustering methods (DBSCAN, GMM) across three architectures (CLIP, DINOv2, MegaDescriptor) combined with dimensionality reduction techniques (PCA, UMAP), and we demonstrate continuous 1D similarity ordering via t-SNE projection. On a 5-species test set with ground truth labels used only for evaluation, DINOv2 with UMAP and GMM achieves 88.6 percent accuracy (macro-F1 = 0.874), while 1D sorting reaches 88.2 percent coherence for mammals and birds and 95.2 percent for fish across 1,500 images. Based on these findings, we deployed continuous similarity ordering in production, enabling rapid exploratory analysis and accelerating manual annotation workflows for biodiversity monitoring.

[176] Knowledge-based Visual Question Answer with Multimodal Processing, Retrieval and Filtering

Yuyang Hong,Jiaqi Gu,Qi Yang,Lubin Fan,Yue Wu,Ying Wang,Kun Ding,Shiming Xiang,Jieping Ye

Main category: cs.CV

TL;DR: 提出了一种名为Wiki-PRF的三阶段方法，用于知识库视觉问答（KB-VQA），通过处理、检索和过滤阶段提升多模态查询质量和检索结果相关性，并结合强化学习训练的视觉语言模型，在E-VQA和InfoSeek数据集上实现了最先进的性能。

Details

Motivation: 现有检索增强生成（RAG）方法在KB-VQA任务中面临多模态查询质量差和检索结果相关性低的问题，需提升模型对视觉和文本信息的联合利用能力。 Method: 提出Wiki-PRF，包含三个阶段：处理阶段动态调用视觉工具提取精确的多模态信息；检索阶段融合视觉与文本特征进行知识检索；过滤阶段对结果进行相关性筛选与聚焦。采用基于答案准确性和格式一致性的强化学习奖励机制训练视觉语言模型。 Result: 在E-VQA和InfoSeek两个基准数据集上分别取得了36.0和42.8的显著性能提升，达到当前最优水平。 Conclusion: Wiki-PRF有效提升了KB-VQA中的多模态检索质量与答案生成准确性，强化学习框架增强了模型的推理、工具调用和无关内容过滤能力。 Abstract: Knowledge-based visual question answering (KB-VQA) requires visual language models (VLMs) to integrate visual understanding with external knowledge retrieval. Although retrieval-augmented generation (RAG) achieves significant advances in this task by combining knowledge-base querying, it still struggles with the quality of multimodal queries and the relevance of retrieved results. To overcome these challenges, we propose a novel three-stage method, termed Wiki-PRF, including Processing, Retrieval and Filtering stages. The processing stage dynamically invokes visual tools to extract precise multimodal information for retrieval. The retrieval stage integrates visual and text features to achieve multimodal knowledge retrieval. The filtering stage performs relevance filtering and concentration on retrieval results. To this end, we introduce a visual language model trained with answer accuracy and format consistency as reward signals via a reinforcement learning manner. This enhances the model's reasoning, tool invocation for accurate queries, and filtering of irrelevant content. Experiments on benchmark datasets (E-VQA and InfoSeek) show significant improvements~(36.0 and 42.8) in answer quality, achieving state-of-the-art performance. Code is available at https://github.com/cqu-student/Wiki-PRF

[177] Shot2Tactic-Caption: Multi-Scale Captioning of Badminton Videos for Tactical Understanding

Ning Ding,Keisuke Fujii,Toru Tamaki

Main category: cs.CV

TL;DR: 本文提出了Shot2Tactic-Caption框架和首个羽毛球战术描述数据集，用于生成语义与时间多尺度的视频描述，能够同时描述击球动作和动态战术执行过程。

Details

Motivation: 现有方法难以同时捕捉羽毛球比赛中个体动作与战术随时间演变的动态过程，缺乏支持战术理解的多尺度描述能力。 Method: 提出双分支框架，结合视觉编码器、时空Transformer编码器和解码器，并引入战术单元检测器和基于击球提示的引导机制，实现击球级和战术级的联合描述生成。 Result: 在自建Shot2Tactic-Caption数据集上验证了框架的有效性，实验表明ResNet50时空编码器性能最优，击球提示机制提升了战术描述的连贯性和准确性。 Conclusion: 所提方法能有效生成羽毛球比赛中的多层次战术描述，尤其可捕捉被中断后恢复的战术执行过程，为体育视频理解提供了新的解决方案。 Abstract: Tactical understanding in badminton involves interpreting not only individual actions but also how tactics are dynamically executed over time. In this paper, we propose \textbf{Shot2Tactic-Caption}, a novel framework for semantic and temporal multi-scale video captioning in badminton, capable of generating shot-level captions that describe individual actions and tactic-level captions that capture how these actions unfold over time within a tactical execution. We also introduce the Shot2Tactic-Caption Dataset, the first badminton captioning dataset containing 5,494 shot captions and 544 tactic captions. Shot2Tactic-Caption adopts a dual-branch design, with both branches including a visual encoder, a spatio-temporal Transformer encoder, and a Transformer-based decoder to generate shot and tactic captions. To support tactic captioning, we additionally introduce a Tactic Unit Detector that identifies valid tactic units, tactic types, and tactic states (e.g., Interrupt, Resume). For tactic captioning, we further incorporate a shot-wise prompt-guided mechanism, where the predicted tactic type and state are embedded as prompts and injected into the decoder via cross-attention. The shot-wise prompt-guided mechanism enables our system not only to describe successfully executed tactics but also to capture tactical executions that are temporarily interrupted and later resumed. Experimental results demonstrate the effectiveness of our framework in generating both shot and tactic captions. Ablation studies show that the ResNet50-based spatio-temporal encoder outperforms other variants, and that shot-wise prompt structuring leads to more coherent and accurate tactic captioning.

[178] Efficient Video Sampling: Pruning Temporally Redundant Tokens for Faster VLM Inference

Natan Bagrov,Eugene Khvedchenia,Borys Tymchenko,Shay Aharon,Lior Kadoch,Tomer Keren,Ofri Masad,Yonatan Geifman,Ran Zilberstein,Tuomas Rintamaki,Matthieu Le,Andrew Tao

Main category: cs.CV

TL;DR: 提出了一种名为EVS的高效视频采样方法，通过去除连续帧中静态的图像块来减少视觉语言模型中的冗余token，从而提升推理速度并支持更长的输入序列。

Details

Motivation: 现有的视觉语言模型处理长视频时受限于密集帧序列带来的二次计算成本和token预算限制，导致上下文不足和延迟问题。 Method: 引入EVS方法，识别并剪除连续帧之间保持不变的时间上静态的图像块，保留位置信息，无需修改模型结构或重新训练，可在推理时直接应用。结合随机剪枝率的上训练阶段，使模型对不同压缩水平具有鲁棒性。 Result: EVS显著减少了token数量，在保持语义保真度的同时，将大语言模型的首token时间（TTFT）最多降低4倍，并支持更长的输入序列。实验表明该方法在多种设置下均改善了效率与准确性的权衡。 Conclusion: EVS是一种简单且即插即用的方法，有效解决了视觉语言模型在处理长视频时的可扩展性瓶颈，在几乎无精度损失的情况下实现了更高的推理效率和更长的上下文支持。 Abstract: Vision-language models (VLMs) have recently expanded from static image understanding to video reasoning, but their scalability is fundamentally limited by the quadratic cost of processing dense frame sequences. Long videos often exceed the token budget of modern language models, leading to severe context limitations and latency issues. We introduce Efficient Video Sampling (EVS), a simple, plug-and-play method for reducing token redundancy in videos by identifying and pruning temporally static patches -- spatial regions that remain unchanged across consecutive frames. EVS preserves positional identity, requires no architectural changes or retraining. We show that EVS substantially reduces token count while maintaining semantic fidelity, enabling faster inference and longer input sequences. Applied at inference time, EVS reduces large language model (LLM) time-to-first-token (TTFT) by up to 4x with minimal accuracy loss. When combined with an uptraining phase using stochastic pruning rates, EVS yields models that are robust to varying compression levels and retain full performance under aggressive pruning. Extensive experiments demonstrate that EVS consistently improves efficiency-accuracy trade-offs, unlocking scalable video-language understanding without sacrificing quality.

[179] Adapting Self-Supervised Representations as a Latent Space for Efficient Generation

Ming Gui,Johannes Schusterbauer,Timy Phan,Felix Krause,Josh Susskind,Miguel Angel Bautista,Björn Ommer

Main category: cs.CV

TL;DR: RepTok是一种基于自监督视觉Transformer的生成建模框架，通过单个连续潜在令牌表示图像，实现高效且保真的图像重建，并在图像生成任务中表现出色。

Details

Motivation: 旨在利用自监督学习（SSL）表征构建紧凑、高效的生成模型，克服传统2D潜在空间的冗余和高训练成本问题。 Method: 在预训练SSL编码器基础上，仅微调语义令牌嵌入，并结合生成解码器使用流匹配目标联合训练；引入余弦相似性损失以保持原始SSL空间的良好几何结构。 Result: RepTok在类条件ImageNet生成上取得竞争性结果，可自然扩展到文本到图像合成，在极低训练预算下于MS-COCO上实现有竞争力的零样本性能。 Conclusion: 微调后的SSL表征可作为紧凑且有效的潜在空间，用于高效生成建模，RepTok为简化架构和降低计算成本提供了新思路。 Abstract: We introduce Representation Tokenizer (RepTok), a generative modeling framework that represents an image using a single continuous latent token obtained from self-supervised vision transformers. Building on a pre-trained SSL encoder, we fine-tune only the semantic token embedding and pair it with a generative decoder trained jointly using a standard flow matching objective. This adaptation enriches the token with low-level, reconstruction-relevant details, enabling faithful image reconstruction. To preserve the favorable geometry of the original SSL space, we add a cosine-similarity loss that regularizes the adapted token, ensuring the latent space remains smooth and suitable for generation. Our single-token formulation resolves spatial redundancies of 2D latent spaces and significantly reduces training costs. Despite its simplicity and efficiency, RepTok achieves competitive results on class-conditional ImageNet generation and naturally extends to text-to-image synthesis, reaching competitive zero-shot performance on MS-COCO under extremely limited training budgets. Our findings highlight the potential of fine-tuned SSL representations as compact and effective latent spaces for efficient generative modeling.

[180] SteeringTTA: Guiding Diffusion Trajectories for Robust Test-Time-Adaptation

Jihyun Yu,Yoojin Oh,Wonho Bae,Mingyu Kim,Junhyug Noh

Main category: cs.CV

TL;DR: SteeringTTA是一种无需模型更新或源数据的推理阶段框架，通过Feynman-Kac steering引导扩散模型输入自适应，利用伪标签奖励在ImageNet-C上显著优于基线方法。

Details

Motivation: 现有基于扩散的输入级测试时自适应方法依赖梯度引导，限制了在不同失真类型下的探索与泛化能力，因此需要一种更灵活、高效的输入自适应机制。 Method: 提出SteeringTTA，采用Feynman-Kac steering方法，在扩散过程中引导输入调整；通过多个粒子轨迹并结合累积Top-K概率与熵调度机制，平衡探索与置信度，以伪标签作为奖励信号驱动优化。 Result: 在ImageNet-C基准上，SteeringTTA在不进行模型更新或使用源数据的情况下， consistently优于基线方法，展现出更强的鲁棒性和泛化能力。 Conclusion: SteeringTTA为测试时自适应提供了一种有效的输入级解决方案，通过非梯度的推理期控制机制提升了分类模型在分布偏移下的性能。 Abstract: Test-time adaptation (TTA) aims to correct performance degradation of deep models under distribution shifts by updating models or inputs using unlabeled test data. Input-only diffusion-based TTA methods improve robustness for classification to corruptions but rely on gradient guidance, limiting exploration and generalization across distortion types. We propose SteeringTTA, an inference-only framework that adapts Feynman-Kac steering to guide diffusion-based input adaptation for classification with rewards driven by pseudo-label. SteeringTTA maintains multiple particle trajectories, steered by a combination of cumulative top-K probabilities and an entropy schedule, to balance exploration and confidence. On ImageNet-C, SteeringTTA consistently outperforms the baseline without any model updates or source data.

[181] In-Context Learning with Unpaired Clips for Instruction-based Video Editing

Xinyao Liao,Xianfang Zeng,Ziye Song,Zhoujie Fu,Gang Yu,Guosheng Lin

Main category: cs.CV

TL;DR: 提出一种低成本预训练策略，利用非配对视频片段进行上下文学习，提升指令式视频编辑的性能。

Details

Motivation: 由于构建大规模配对视频编辑数据集成本高且复杂，现有方法在指令式视频编辑上的扩展受限。 Method: 采用基于非配对视频片段的上下文学习进行预训练，先在约100万真实视频片段上预训练基础视频生成模型，再用少于15万精心筛选的配对数据进行微调。 Result: 该方法在指令对齐和视觉保真度上优于现有方法，指令遵循率提升12%，编辑质量提升15%。 Conclusion: 所提预训练策略有效赋予模型通用编辑能力，结合少量高质量配对数据微调，可显著提升指令式视频编辑效果。 Abstract: Despite the rapid progress of instruction-based image editing, its extension to video remains underexplored, primarily due to the prohibitive cost and complexity of constructing large-scale paired video editing datasets. To address this challenge, we introduce a low-cost pretraining strategy for instruction-based video editing that leverages in-context learning from unpaired video clips. We show that pretraining a foundation video generation model with this strategy endows it with general editing capabilities, such as adding, replacing, or deleting operations, according to input editing instructions. The pretrained model can then be efficiently refined with a small amount of high-quality paired editing data. Built upon HunyuanVideoT2V, our framework first pretrains on approximately 1M real video clips to learn basic editing concepts, and subsequently fine-tunes on fewer than 150k curated editing pairs to extend more editing tasks and improve the editing quality. Comparative experiments show that our method surpasses existing instruction-based video editing approaches in both instruction alignment and visual fidelity, achieving a 12\% improvement in editing instruction following and a 15\% improvement in editing quality.

[182] Decorrelation Speeds Up Vision Transformers

Kieran Carrigg,Rob van Gastel,Melda Yeghaian,Sander Dalm,Faysal Boughorbel,Marcel van Gerven

Main category: cs.CV

TL;DR: 本文提出将去相关反向传播（DBP）引入Masked Autoencoder（MAE）预训练中，以加速视觉Transformer（ViT）的收敛速度，减少计算成本和碳排放，同时提升下游任务性能。

Details

Motivation: MAE在低标签场景下表现优异，但训练成本高，难以应用于资源受限的工业环境，因此需要一种更高效的优化方法。 Method: 在MAE的编码器中选择性地引入DBP，通过逐层减少输入特征的相关性来加速模型收敛，同时保持训练稳定性。 Result: 在ImageNet-1K预训练和ADE20K微调任务中，DBP-MAE将达到基线性能所需时间减少21.1%，碳排放降低21.4%，分割mIoU提升1.1点；在私有工业数据上也取得类似增益。 Conclusion: DBP能有效降低大规模ViT预训练的时间和能耗，同时提升下游任务性能，具有实际工业应用价值。 Abstract: Masked Autoencoder (MAE) pre-training of vision transformers (ViTs) yields strong performance in low-label regimes but comes with substantial computational costs, making it impractical in time- and resource-constrained industrial settings. We address this by integrating Decorrelated Backpropagation (DBP) into MAE pre-training, an optimization method that iteratively reduces input correlations at each layer to accelerate convergence. Applied selectively to the encoder, DBP achieves faster pre-training without loss of stability. On ImageNet-1K pre-training with ADE20K fine-tuning, DBP-MAE reduces wall-clock time to baseline performance by 21.1%, lowers carbon emissions by 21.4% and improves segmentation mIoU by 1.1 points. We observe similar gains when pre-training and fine-tuning on proprietary industrial data, confirming the method's applicability in real-world scenarios. These results demonstrate that DBP can reduce training time and energy use while improving downstream performance for large-scale ViT pre-training.

[183] EuroMineNet: A Multitemporal Sentinel-2 Benchmark for Spatiotemporal Mining Footprint Analysis in the European Union (2015-2024)

Weikang Yu,Vincent Nwazelibe,Xianping Ma,Xiaokang Zhang,Richard Gloaguen,Xiao Xiang Zhu,Pedram Ghamisi

Main category: cs.CV

TL;DR: EuroMineNet是一个基于Sentinel-2影像的、覆盖欧盟133个矿区的多时相基准数据集，支持采矿足迹制图与变化检测，推动GeoAI在可持续土地管理中的应用。

Details

Motivation: 现有采矿监测数据集在时间深度或地理范围上有限，难以满足长期、大范围环境治理需求。 Method: 构建名为EuroMineNet的多时相数据集，包含2015–2024年专家验证的年度标注，并提出CA-TIoU指标用于评估多时相采矿足迹映射和跨时间变化检测。 Result: 基准测试20种深度学习模型表明，GeoAI能有效识别长期环境变化，但在捕捉短期动态方面仍有挑战。 Conclusion: EuroMineNet通过支持可解释、时间一致的采矿监测，促进可持续土地管理和环境韧性，并推动GeoAI服务于社会与环境福祉。 Abstract: Mining activities are essential for industrial and economic development, but remain a leading source of environmental degradation, contributing to deforestation, soil erosion, and water contamination. Sustainable resource management and environmental governance require consistent, long-term monitoring of mining-induced land surface changes, yet existing datasets are often limited in temporal depth or geographic scope. To address this gap, we present EuroMineNet, the first comprehensive multitemporal benchmark for mining footprint mapping and monitoring based on Sentinel-2 multispectral imagery. Spanning 133 mining sites across the European Union, EuroMineNet provides annual observations and expert-verified annotations from 2015 to 2024, enabling GeoAI-based models to analyze environmental dynamics at a continental scale. It supports two sustainability-driven tasks: (1) multitemporal mining footprint mapping for consistent annual land-use delineation, evaluated with a novel Change-Aware Temporal IoU (CA-TIoU) metric, and (2) cross-temporal change detection to capture both gradual and abrupt surface transformations. Benchmarking 20 state-of-the-art deep learning models reveals that while GeoAI methods effectively identify long-term environmental changes, challenges remain in detecting short-term dynamics critical for timely mitigation. By advancing temporally consistent and explainable mining monitoring, EuroMineNet contributes to sustainable land-use management, environmental resilience, and the broader goal of applying GeoAI for social and environmental good. We release the codes and datasets by aligning with FAIR and the open science paradigm at https://github.com/EricYu97/EuroMineNet.

[184] WeCKD: Weakly-supervised Chained Distillation Network for Efficient Multimodal Medical Imaging

Md. Abdur Rahman,Mohaimenul Azam Khan Raiaan,Sami Azam,Asif Karim,Jemima Beissbarth,Amanda Leach

Main category: cs.CV

TL;DR: 提出首个弱监督链式知识蒸馏网络WeCKD，通过模型链逐步传递和优化知识，在数据受限场景下显著提升性能，具有强泛化能力。

Details

Motivation: 传统知识蒸馏依赖强教师模型或大量标注数据，在真实低数据场景中效果受限，且存在知识退化和监督效率低的问题。 Method: 构建一个链式结构的弱监督知识蒸馏框架（WeCKD），每个模型从前一个模型学习并精炼知识后再传递给下一个，各模型仅使用数据子集进行训练。 Result: 在四个耳镜图像数据集上表现优于或媲美现有监督方法，在其他两种医学影像模态上也展现良好泛化性，相比单个骨干模型最高累计准确率提升达+23%。 Conclusion: WeCKD通过链式知识传递有效缓解了传统蒸馏对强教师和大数据的依赖，提升了特征学习与数据利用效率，具备在现实医疗场景中应用的潜力。 Abstract: Knowledge distillation (KD) has traditionally relied on a static teacher-student framework, where a large, well-trained teacher transfers knowledge to a single student model. However, these approaches often suffer from knowledge degradation, inefficient supervision, and reliance on either a very strong teacher model or large labeled datasets, which limits their effectiveness in real-world, limited-data scenarios. To address these, we present the first-ever Weakly-supervised Chain-based KD network (WeCKD) that redefines knowledge transfer through a structured sequence of interconnected models. Unlike conventional KD, it forms a progressive distillation chain, where each model not only learns from its predecessor but also refines the knowledge before passing it forward. This structured knowledge transfer further enhances feature learning, reduces data dependency, and mitigates the limitations of one-step KD. Each model in the distillation chain is trained on only a fraction of the dataset and demonstrates that effective learning can be achieved with minimal supervision. Extensive evaluations across four otoscopic imaging datasets demonstrate that it not only matches but in many cases surpasses the performance of existing supervised methods. Experimental results on two other datasets further underscore its generalization across diverse medical imaging modalities, including microscopic and magnetic resonance imaging. Furthermore, our evaluations resulted in cumulative accuracy gains of up to +23% over a single backbone trained on the same limited data, which highlights its potential for real-world adoption.

[185] VTimeCoT: Thinking by Drawing for Video Temporal Grounding and Reasoning

Jinglei Zhang,Yuanfan Guo,Rolandos Alexandros Potamias,Jiankang Deng,Hang Xu,Chao Ma

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的视频时序定位与推理框架VTimeCoT，通过引入进度条可视化工具和跨模态的视觉-时序思维链（visuotemporal CoT），显著提升了多模态大模型在视频理解任务中的性能。

Details

Motivation: 现有基于多模态大语言模型的视频问答系统在视频时序定位和推理能力方面存在明显不足，难以支持实际应用中的复杂视频理解需求。 Method: 提出了VTimeCoT框架，包含两个新型视觉工具：即插即用的进度条集成工具和高效高亮工具，并设计了结合视频与文本的视觉-时序思维链（visuotemporal CoT）以增强跨模态推理能力。 Result: 在Qwen2VL-7B和GPT4o基线上，该方法在视频时序定位和基于推理的问答任务中均表现出显著性能提升，并实现了可组合、可解释的推理过程。 Conclusion: VTimeCoT为多模态大模型提供了有效的外部视觉工具和推理机制，无需训练即可增强其视频时序理解与推理能力，具有良好的通用性和应用前景。 Abstract: In recent years, video question answering based on multimodal large language models (MLLM) has garnered considerable attention, due to the benefits from the substantial advancements in LLMs. However, these models have a notable deficiency in the domains of video temporal grounding and reasoning, posing challenges to the development of effective real-world video understanding systems. Inspired by how humans use video players to interact with the progress bar for video comprehension, we introduce VTimeCoT, a simple yet effective training-free framework, designed for high-performance video grounding and reasoning. The proposed framework incorporates two novel visual tools of the progress bar: a plug-and-play progress bar integration tool and a high-efficiency highlighting tool. In addition, to address the limitations of conventional text-based chain-of-thought (CoT) approaches, we introduce a visuotemporal CoT process that integrates cross-modality reasoning across both video and text. Our approach demonstrates significant performance improvements on both Qwen2VL-7B and GPT4o baselines in tasks of video temporal grounding and reasoning-based question answering. Finally, we showcase that the proposed framework achieves a compositional and interpretable reasoning process. Project page: https://vtimecot.github.io

[186] Leveraging Learned Image Prior for 3D Gaussian Compression

Seungjoo Shin,Jaesik Park,Sunghyun Cho

Main category: cs.CV

TL;DR: 提出了一种基于学习图像先验的3D高斯点阵压缩框架，通过在图像空间中恢复压缩导致的质量退化，显著提升了率失真性能和渲染质量，同时保持低存储开销。

Details

Motivation: 现有3DGS压缩方法缺乏学习先验，限制了率失真权衡的进一步优化。 Method: 构建一个恢复网络，在图像空间中建模压缩伪影，并利用粗略渲染残差作为辅助信息来增强恢复效果，结合已有压缩方法进行联合优化。 Result: 在多个基准上验证了该方法的有效性，相比当前最先进的3DGS压缩方法，在更少存储下实现了更高的渲染质量和更好的率失真性能。 Conclusion: 所提框架能有效结合学习先验与现有压缩技术，显著提升3DGS压缩的率失真表现，具有广泛适用性。 Abstract: Compression techniques for 3D Gaussian Splatting (3DGS) have recently achieved considerable success in minimizing storage overhead for 3D Gaussians while preserving high rendering quality. Despite the impressive storage reduction, the lack of learned priors restricts further advances in the rate-distortion trade-off for 3DGS compression tasks. To address this, we introduce a novel 3DGS compression framework that leverages the powerful representational capacity of learned image priors to recover compression-induced quality degradation. Built upon initially compressed Gaussians, our restoration network effectively models the compression artifacts in the image space between degraded and original Gaussians. To enhance the rate-distortion performance, we provide coarse rendering residuals into the restoration network as side information. By leveraging the supervision of restored images, the compressed Gaussians are refined, resulting in a highly compact representation with enhanced rendering performance. Our framework is designed to be compatible with existing Gaussian compression methods, making it broadly applicable across different baselines. Extensive experiments validate the effectiveness of our framework, demonstrating superior rate-distortion performance and outperforming the rendering quality of state-of-the-art 3DGS compression methods while requiring substantially less storage.

[187] Where are the Whales: A Human-in-the-loop Detection Method for Identifying Whales in High-resolution Satellite Imagery

Caleb Robinson,Kimberly T. Goetz,Christin B. Khan,Meredith Sackett,Kathleen Leonard,Rahul Dodhia,Juan M. Lavista Ferres

Main category: cs.CV

TL;DR: 提出了一种半自动化方法，利用统计异常检测在高分辨率卫星图像中发现可能的鲸鱼位置，结合专家快速标注界面，在减少99.8%检查面积的同时实现90.3%到96.4%的召回率，且无需依赖标注训练数据。

Details

Motivation: 传统鲸鱼种群监测方法成本高、难以扩展，现有自动化检测因缺乏标注数据、图像质量差异和大规模遥感数据处理成本而受限。 Method: 采用统计异常检测方法识别空间异常点（即‘有趣点’），并结合Web标注界面辅助专家快速标注这些候选区域。 Result: 在三个已知鲸鱼标注的基准场景中，召回率达到90.3%至96.4%，需专家检查的区域最多减少99.8%（从1000多平方公里减少到不足2平方公里）。 Conclusion: 该方法不依赖标注训练数据，为未来基于卫星影像的大规模海洋哺乳动物监测提供可扩展的初步解决方案，并已开源。 Abstract: Effective monitoring of whale populations is critical for conservation, but traditional survey methods are expensive and difficult to scale. While prior work has shown that whales can be identified in very high-resolution (VHR) satellite imagery, large-scale automated detection remains challenging due to a lack of annotated imagery, variability in image quality and environmental conditions, and the cost of building robust machine learning pipelines over massive remote sensing archives. We present a semi-automated approach for surfacing possible whale detections in VHR imagery using a statistical anomaly detection method that flags spatial outliers, i.e. "interesting points". We pair this detector with a web-based labeling interface designed to enable experts to quickly annotate the interesting points. We evaluate our system on three benchmark scenes with known whale annotations and achieve recalls of 90.3% to 96.4%, while reducing the area requiring expert inspection by up to 99.8% -- from over 1,000 sq km to less than 2 sq km in some cases. Our method does not rely on labeled training data and offers a scalable first step toward future machine-assisted marine mammal monitoring from space. We have open sourced this pipeline at https://github.com/microsoft/whales.

[188] Camera Movement Classification in Historical Footage: A Comparative Study of Deep Video Models

Tingyu Lin,Armin Dadras,Florian Kleber,Robert Sablatnig

Main category: cs.CV

TL;DR: 本文首次系统评估了深度视频摄像机运动分类模型在历史档案影片上的表现，使用HISTORIAN数据集（包含专家标注的二战 footage）测试五种标准视频分类模型，其中Video Swin Transformer表现最佳，准确率达80.25%，表明尽管训练数据有限，模型仍具有较强收敛性。研究突显了将现有模型应用于低质量历史视频所面临的挑战与潜力，并建议未来工作应结合多种输入模态和时序结构。

Details

Motivation: 现有摄像机运动分类方法在现代数据集上表现良好，但在历史影像上的泛化能力尚未被探索。由于历史影片常存在低质量、噪声多等问题，直接应用现有模型可能效果不佳，因此需要系统评估其在该场景下的适用性。 Method: 总结了代表性的摄像机运动分类方法与数据集，比较了模型设计与标签定义的差异；选取五种标准视频分类模型，在HISTORIAN数据集（含专家标注的二战历史影片）上进行评估。 Result: 在HISTORIAN数据集上，Video Swin Transformer表现最优，达到80.25%的准确率，显示出良好的收敛性，即便训练数据有限。结果表明当前模型可部分适应低质量历史视频，但仍有改进空间。 Conclusion: 现有的深度视频分类模型可在一定程度上推广到历史档案影片，尤其Video Swin Transformer表现突出；然而，低质量视频带来挑战，未来应结合多模态输入以及时序建模结构来提升性能。 Abstract: Camera movement conveys spatial and narrative information essential for understanding video content. While recent camera movement classification (CMC) methods perform well on modern datasets, their generalization to historical footage remains unexplored. This paper presents the first systematic evaluation of deep video CMC models on archival film material. We summarize representative methods and datasets, highlighting differences in model design and label definitions. Five standard video classification models are assessed on the HISTORIAN dataset, which includes expert-annotated World War II footage. The best-performing model, Video Swin Transformer, achieves 80.25% accuracy, showing strong convergence despite limited training data. Our findings highlight the challenges and potential of adapting existing models to low-quality video and motivate future work combining diverse input modalities and temporal architectures.

[189] Cross-Layer Feature Self-Attention Module for Multi-Scale Object Detection

Dingzhou Xie,Rushi Lan,Cheng Pang,Enhao Ning,Jiahao Zeng,Wei Zheng

Main category: cs.CV

TL;DR: 提出了一种新的跨层特征自注意力模块（CFSAM），通过建模多尺度特征图中的局部和全局依赖关系，显著提升了SSD300在PASCAL VOC和COCO数据集上的检测性能。

Details

Motivation: 现有方法多局限于单层或双层特征优化，忽视了多尺度表示间的丰富跨层依赖关系，难以充分捕捉上下文信息以应对尺度变化大的物体检测。 Method: 设计了包含卷积局部特征提取器、基于Transformer的全局建模单元和特征融合机制的CFSAM模块，整体建模多尺度特征图中的局部与全局依赖。 Result: 在SSD300上集成CFSAM后，PASCAL VOC上mAP达78.6%（提升3.1%），COCO上达52.1%（提升9.0%），且训练收敛更快，计算开销小。 Conclusion: 显式建模跨层注意力对提升多尺度目标检测性能至关重要，CFSAM有效利用跨层依赖关系增强了特征表达能力。 Abstract: Recent object detection methods have made remarkable progress by leveraging attention mechanisms to improve feature discriminability. However, most existing approaches are confined to refining single-layer or fusing dual-layer features, overlooking the rich inter-layer dependencies across multi-scale representations. This limits their ability to capture comprehensive contextual information essential for detecting objects with large scale variations. In this paper, we propose a novel Cross-Layer Feature Self-Attention Module (CFSAM), which holistically models both local and global dependencies within multi-scale feature maps. CFSAM consists of three key components: a convolutional local feature extractor, a Transformer-based global modeling unit that efficiently captures cross-layer interactions, and a feature fusion mechanism to restore and enhance the original representations. When integrated into the SSD300 framework, CFSAM significantly boosts detection performance, achieving 78.6% mAP on PASCAL VOC (vs. 75.5% baseline) and 52.1% mAP on COCO (vs. 43.1% baseline), outperforming existing attention modules. Moreover, the module accelerates convergence during training without introducing substantial computational overhead. Our work highlights the importance of explicit cross-layer attention modeling in advancing multi-scale object detection.

[190] Free-Grained Hierarchical Recognition

Seulki Park,Zilin Wang,Stella X. Yu

Main category: cs.CV

TL;DR: 本文提出了ImageNet-F基准和自由粒度学习方法，以应对现实世界中图像分类标签粒度不一的问题。

Details

Motivation: 现有的分层图像分类方法通常假设具有完整的细粒度标注，但在实际应用中，监督信号的粒度往往是不一致的，因此需要更贴近真实场景的方法。 Method: 引入ImageNet-F基准，并利用CLIP模拟混合粒度标签；提出自由粒度学习框架，结合来自视觉-语言模型的伪属性和半监督学习进行语义与视觉增强。 Result: 所提出的方法在混合监督下显著提升了分层图像分类的性能。 Conclusion: 该研究通过构建更真实的基准和新方法，推动了在现实约束下的分层图像分类发展。 Abstract: Hierarchical image classification predicts labels across a semantic taxonomy, but existing methods typically assume complete, fine-grained annotations, an assumption rarely met in practice. Real-world supervision varies in granularity, influenced by image quality, annotator expertise, and task demands; a distant bird may be labeled Bird, while a close-up reveals Bald eagle. We introduce ImageNet-F, a large-scale benchmark curated from ImageNet and structured into cognitively inspired basic, subordinate, and fine-grained levels. Using CLIP as a proxy for semantic ambiguity, we simulate realistic, mixed-granularity labels reflecting human annotation behavior. We propose free-grain learning, with heterogeneous supervision across instances. We develop methods that enhance semantic guidance via pseudo-attributes from vision-language models and visual guidance via semi-supervised learning. These, along with strong baselines, substantially improve performance under mixed supervision. Together, our benchmark and methods advance hierarchical classification under real-world constraints.

[191] DEXTER: Diffusion-Guided EXplanations with TExtual Reasoning for Vision Models

Simone Carnemolla,Matteo Pennisi,Sarinda Samarasinghe,Giovanni Bellitto,Simone Palazzo,Daniela Giordano,Mubarak Shah,Concetto Spampinato

Main category: cs.CV

TL;DR: DEXTER 是一种无需数据的框架，利用扩散模型和大语言模型生成视觉分类器的全局文本解释，能有效揭示分类器的决策模式和偏差。

Details

Motivation: 为了提高机器学习模型的透明度和可信度，需要在没有训练数据或真实标签的情况下解释视觉分类器的行为。 Method: 通过优化文本提示生成强激活目标分类器的类别条件图像，并利用这些合成样本驱动大语言模型生成描述分类决策模式和偏见的自然语言报告。 Result: 在 ImageNet、Waterbirds、CelebA 和 FairFaces 上的实验表明，DEXTER 在全局模型解释和类别级偏差报告方面优于现有方法，用户研究也验证了其输出的准确性和可解释性。 Conclusion: DEXTER 能有效实现对视觉分类器的数据无关、自然语言形式的全局解释，适用于激活最大化、切片发现与去偏、偏差解释等多种任务。 Abstract: Understanding and explaining the behavior of machine learning models is essential for building transparent and trustworthy AI systems. We introduce DEXTER, a data-free framework that employs diffusion models and large language models to generate global, textual explanations of visual classifiers. DEXTER operates by optimizing text prompts to synthesize class-conditional images that strongly activate a target classifier. These synthetic samples are then used to elicit detailed natural language reports that describe class-specific decision patterns and biases. Unlike prior work, DEXTER enables natural language explanation about a classifier's decision process without access to training data or ground-truth labels. We demonstrate DEXTER's flexibility across three tasks-activation maximization, slice discovery and debiasing, and bias explanation-each illustrating its ability to uncover the internal mechanisms of visual classifiers. Quantitative and qualitative evaluations, including a user study, show that DEXTER produces accurate, interpretable outputs. Experiments on ImageNet, Waterbirds, CelebA, and FairFaces confirm that DEXTER outperforms existing approaches in global model explanation and class-level bias reporting. Code is available at https://github.com/perceivelab/dexter.

[192] LightQANet: Quantized and Adaptive Feature Learning for Low-Light Image Enhancement

Xu Wu,Zhihui Lai,Xianxu Hou,Jie Zhou,Ya-nan Zhang,Linlin Shen

Main category: cs.CV

TL;DR: 提出LightQANet，结合量化与自适应特征学习，通过光照量化模块和光照感知提示模块提升低光图像增强的鲁棒性与一致性。

Details

Motivation: 现有低光图像增强方法因低光下像素信息严重退化，导致纹理恢复差、色彩不一致和伪影问题，难以提取可靠特征表示。 Method: 设计光照量化模块（LQM）显式提取并量化光照相关因子，增强光照不变特征表达；引入光照感知提示模块（LAPM），将光照先验编码为可学习提示，动态引导特征学习。 Result: 在多个低光数据集上实验表明，该方法在定性和定量指标上均达到最先进水平，尤其在复杂多变光照条件下表现优异。 Conclusion: LightQANet通过静态量化与动态适应相结合的特征学习机制，有效提升了低光图像增强的质量与鲁棒性，适用于多样化的照明场景。 Abstract: Low-light image enhancement (LLIE) aims to improve illumination while preserving high-quality color and texture. However, existing methods often fail to extract reliable feature representations due to severely degraded pixel-level information under low-light conditions, resulting in poor texture restoration, color inconsistency, and artifact. To address these challenges, we propose LightQANet, a novel framework that introduces quantized and adaptive feature learning for low-light enhancement, aiming to achieve consistent and robust image quality across diverse lighting conditions. From the static modeling perspective, we design a Light Quantization Module (LQM) to explicitly extract and quantify illumination-related factors from image features. By enforcing structured light factor learning, LQM enhances the extraction of light-invariant representations and mitigates feature inconsistency across varying illumination levels. From the dynamic adaptation perspective, we introduce a Light-Aware Prompt Module (LAPM), which encodes illumination priors into learnable prompts to dynamically guide the feature learning process. LAPM enables the model to flexibly adapt to complex and continuously changing lighting conditions, further improving image enhancement. Extensive experiments on multiple low-light datasets demonstrate that our method achieves state-of-the-art performance, delivering superior qualitative and quantitative results across various challenging lighting scenarios.

[193] Inpainting the Red Planet: Diffusion Models for the Reconstruction of Martian Environments in Virtual Reality

Giuseppe Lorenzo Catalano,Agata Marta Soccini

Main category: cs.CV

TL;DR: 提出一种基于无条件扩散模型的火星表面重建方法，利用NASA HiRISE数据训练，在填补缺失地形数据方面优于传统插值和修复技术。

Details

Motivation: 火星地形数据常因获取和传输限制存在缺失值，现有插值方法难以保持几何一致性，需更可靠的重建方法。 Method: 采用无条件扩散模型，对NASA HiRISE的12000张火星高程图进行训练，并使用非均匀重缩放策略多尺度捕捉地形特征。 Result: 在1000个样本上评估显示，相比反距离权重、克里金法和Navier-Stokes算法，该方法在RMSE上提升4-15%，在LPIPS上提升29-81%。 Conclusion: 所提出的扩散模型能更准确且具感知相似性地重建火星表面，适用于缺乏完整训练数据的行星地形修复任务。 Abstract: Space exploration increasingly relies on Virtual Reality for several tasks, such as mission planning, multidisciplinary scientific analysis, and astronaut training. A key factor for the reliability of the simulations is having accurate 3D representations of planetary terrains. Extraterrestrial heightmaps derived from satellite imagery often contain missing values due to acquisition and transmission constraints. Mars is among the most studied planets beyond Earth, and its extensive terrain datasets make the Martian surface reconstruction a valuable task, although many areas remain unmapped. Deep learning algorithms can support void-filling tasks; however, whereas Earth's comprehensive datasets enables the use of conditional methods, such approaches cannot be applied to Mars. Current approaches rely on simpler interpolation techniques which, however, often fail to preserve geometric coherence. In this work, we propose a method for reconstructing the surface of Mars based on an unconditional diffusion model. Training was conducted on an augmented dataset of 12000 Martian heightmaps derived from NASA's HiRISE survey. A non-homogeneous rescaling strategy captures terrain features across multiple scales before resizing to a fixed 128x128 model resolution. We compared our method against established void-filling and inpainting techniques, including Inverse Distance Weighting, kriging, and Navier-Stokes algorithm, on an evaluation set of 1000 samples. Results show that our approach consistently outperforms these methods in terms of reconstruction accuracy (4-15% on RMSE) and perceptual similarity (29-81% on LPIPS) with the original data.

[194] MoCom: Motion-based Inter-MAV Visual Communication Using Event Vision and Spiking Neural Networks

Zhang Nengbo,Hann Woei Ho,Ye Zhou

Main category: cs.CV

TL;DR: 提出一种受蜜蜂摇摆舞启发的基于运动信号的视觉通信框架，用于微型飞行器（MAV）群在复杂环境中的低功耗、高鲁棒性通信。

Details

Motivation: 传统无线电通信在频谱拥塞、干扰和高功耗环境下难以满足MAV群可靠通信需求，需寻找更高效的替代方案。 Method: 利用事件相机捕捉MAV通过特定飞行模式（如上下、左右、左上右、左下右）传递的信息，并设计基于事件帧的分割模型与轻量级脉冲神经网络（SNN）进行动作识别，结合解码算法解析运动序列。 Result: 实验结果表明该框架能准确解码MAV的运动信号，具有低功耗和高鲁棒性。 Conclusion: 该视觉通信框架为受限环境下的MAV群提供了一种高效、节能的通信新途径。 Abstract: Reliable communication in Micro Air Vehicle (MAV) swarms is challenging in environments, where conventional radio-based methods suffer from spectrum congestion, jamming, and high power consumption. Inspired by the waggle dance of honeybees, which efficiently communicate the location of food sources without sound or contact, we propose a novel visual communication framework for MAV swarms using motion-based signaling. In this framework, MAVs convey information, such as heading and distance, through deliberate flight patterns, which are passively captured by event cameras and interpreted using a predefined visual codebook of four motion primitives: vertical (up/down), horizontal (left/right), left-to-up-to-right, and left-to-down-to-right, representing control symbols (``start'', ``end'', ``1'', ``0''). To decode these signals, we design an event frame-based segmentation model and a lightweight Spiking Neural Network (SNN) for action recognition. An integrated decoding algorithm then combines segmentation and classification to robustly interpret MAV motion sequences. Experimental results validate the framework's effectiveness, which demonstrates accurate decoding and low power consumption, and highlights its potential as an energy-efficient alternative for MAV communication in constrained environments.

[195] CoT-PL: Visual Chain-of-Thought Reasoning Meets Pseudo-Labeling for Open-Vocabulary Object Detection

Hojun Choi,Youngsun Lim,Jaeyo Shin,Hyunjung Shim

Main category: cs.CV

TL;DR: 本文提出了CoT-PL框架，通过引入结构化的视觉链式思维（CoT）推理和对比背景学习（CBL），提升开放词汇目标检测中伪标签的质量，尤其在拥挤或遮挡场景下显著优于现有方法。

Details

Motivation: 现有方法依赖图像-文本直接匹配生成伪标签，忽略了复杂语义场景中的中间推理过程，导致在拥挤或遮挡场景中鲁棒性不足。 Method: 提出CoT-PL框架，将对象理解分解为三个步骤：区域感知、零样本类别识别和背景关联，并引入对比背景学习（CBL）以增强对象与背景的特征解耦。 Result: 在开放词汇COCO上新类AP50提升+7.7，在LVIS上掩码AP提升+2.9；在拥挤和遮挡场景中，新类伪标签质量分别相对提升103.4%和168.4%。 Conclusion: CoT-PL通过结构化推理和对比背景学习，有效提升了开放词汇目标检测在复杂场景下的性能，成为新的最先进方法。 Abstract: Open-vocabulary object detection (OVD) seeks to recognize and localize object categories beyond those seen during training. Recent approaches typically leverage vision-language models (VLMs) to generate pseudo-labels using image-text alignment, allowing detectors to generalize to unseen classes without explicit supervision. However, these methods depend heavily on direct image-text matching, neglecting the intermediate reasoning steps essential for interpreting semantically complex scenes. This results in limited robustness when confronted with crowded or occluded visual contexts. In this paper, we introduce CoT-PL, a new framework that employs structured visual chain-of-thought (CoT) reasoning into the pseudo-labeling process. CoT-PL decomposes object understanding into three interpretable steps: (1) region perception even for unseen objects, (2) category recognition via zero-shot reasoning, and (3) background grounding to separate semantically complex objects. Crucially, the third step naturally motivates our contrastive background learning (CBL) that uses the pre-computed background cues as negatives to promote feature disentanglement between objects and background. In this way, CoT reasoning and CBL form an integrated pipeline tailored to robust pseudo-labeling in crowded or occluded scenes. Notably, in these two settings, our novel-class pseudo-label quality achieves relative improvements of 103.4% and 168.4% over the best prior, respectively. Our extensive experiments demonstrate that CoT-PL achieves +7.7 AP50 on open-vocabulary COCO and +2.9 mask AP on LVIS for novel classes, setting a new state of the art.

[196] Morphology-Aware Prognostic model for Five-Year Survival Prediction in Colorectal Cancer from H&E Whole Slide Images

Usama Sajjad,Abdul Rehman Akbar,Ziyu Su,Deborah Knight,Wendy L. Frankel,Metin N. Gurcan,Wei Chen,Muhammad Khalid Khan Niazi

Main category: cs.CV

TL;DR: 本研究开发了一种新型可解释AI模型PRISM，用于结直肠癌预后预测，通过整合空间形态学特征和表型连续变异谱，在874万张组织图像上训练，显著优于现有方法和基础模型。

Details

Motivation: 现有基础模型在计算病理学中多为任务无关方法，可能忽略器官特异性的关键形态学模式，影响对肿瘤行为、治疗反应和患者预后的准确预测。 Method: 提出PRISM模型，结合每个形态类型的连续变异谱来刻画表型多样性，基于424例III期结直肠癌患者的874万张组织切片图像进行训练，强调恶性转化的渐进演化过程。 Result: PRISM在五年总生存期预测中表现优异（AUC=0.70±0.04，准确率68.37%±4.75%，HR=3.34），较CRC特异性方法提升15%，较AI基础模型提升约23%；具有性别无关的稳健性（AUC差值0.02），并在不同临床病理亚组和化疗方案间表现稳定（准确率波动仅1.44%）。 Conclusion: PRISM通过捕捉结直肠癌组织形态的连续演化特征，实现了更精准、稳健的预后预测，具有临床应用潜力，并验证了其在不同治疗组间的稳定性。 Abstract: Colorectal cancer (CRC) remains the third most prevalent malignancy globally, with approximately 154,000 new cases and 54,000 projected deaths anticipated for 2025. The recent advancement of foundation models in computational pathology has been largely propelled by task agnostic methodologies that can overlook organ-specific crucial morphological patterns that represent distinct biological processes that can fundamentally influence tumor behavior, therapeutic response, and patient outcomes. The aim of this study is to develop a novel, interpretable AI model, PRISM (Prognostic Representation of Integrated Spatial Morphology), that incorporates a continuous variability spectrum within each distinct morphology to characterize phenotypic diversity and reflecting the principle that malignant transformation occurs through incremental evolutionary processes rather than abrupt phenotypic shifts. PRISM is trained on 8.74 million histological images extracted from surgical resection specimens of 424 patients with stage III CRC. PRISM achieved superior prognostic performance for five-year OS (AUC = 0.70 +- 0.04; accuracy = 68.37% +- 4.75%; HR = 3.34, 95% CI = 2.28-4.90; p < 0.0001), outperforming existing CRC-specific methods by 15% and AI foundation models by ~23% accuracy. It showed sex-agnostic robustness (AUC delta = 0.02; accuracy delta = 0.15%) and stable performance across clinicopathological subgroups, with minimal accuracy fluctuation (delta = 1.44%) between 5FU/LV and CPT-11/5FU/LV regimens, replicating the Alliance cohort finding of no survival difference between treatments.

[197] Scaling Artificial Intelligence for Multi-Tumor Early Detection with More Reports, Fewer Masks

Pedro R. A. S. Bassi,Xinze Zhou,Wenxuan Li,Szymon Płotka,Jieneng Chen,Qi Chen,Zheren Zhu,Jakub Prządo,Ibrahim E. Hamacı,Sezgin Er,Yuhan Wang,Ashwin Kumar,Bjoern Menze,Jarosław B. Ćwikła,Yuyin Zhou,Akshay S. Chaudhari,Curtis P. Langlotz,Sergio Decherchi,Andrea Cavalli,Kang Wang,Yang Yang,Alan L. Yuille,Zongwei Zhou

Main category: cs.CV

TL;DR: 本研究提出R-Super，一种利用医学报告文本训练AI进行肿瘤分割的方法，显著减少对人工标注肿瘤掩码的依赖，在多种器官肿瘤检测中表现优于放射科医生。

Details

Motivation: 传统AI模型训练依赖大量耗时昂贵的手工绘制肿瘤掩码，而临床CT扫描普遍附带丰富的描述性医学报告，此研究旨在利用这些未被充分利用的文本数据提升AI训练效率与可扩展性。 Method: 提出R-Super框架，通过将医学报告中的肿瘤描述（如大小、数量、外观）与图像区域对齐，实现基于报告的弱监督肿瘤分割模型训练，并结合少量掩码数据进行联合训练。 Result: 在101,654份报告上训练的模型性能媲美使用723个手工掩码训练的模型；结合报告与掩码使敏感度提升+13%，特异度提升+8%，并在脾脏、胆囊、前列腺等此前缺乏公开掩码和AI模型的器官中实现肿瘤分割。 Conclusion: R-Super证明大规模人工肿瘤掩码并非AI训练的必要条件，为多类型肿瘤的早期检测提供了可扩展且易获取的新路径。 Abstract: Early tumor detection save lives. Each year, more than 300 million computed tomography (CT) scans are performed worldwide, offering a vast opportunity for effective cancer screening. However, detecting small or early-stage tumors on these CT scans remains challenging, even for experts. Artificial intelligence (AI) models can assist by highlighting suspicious regions, but training such models typically requires extensive tumor masks--detailed, voxel-wise outlines of tumors manually drawn by radiologists. Drawing these masks is costly, requiring years of effort and millions of dollars. In contrast, nearly every CT scan in clinical practice is already accompanied by medical reports describing the tumor's size, number, appearance, and sometimes, pathology results--information that is rich, abundant, and often underutilized for AI training. We introduce R-Super, which trains AI to segment tumors that match their descriptions in medical reports. This approach scales AI training with large collections of readily available medical reports, substantially reducing the need for manually drawn tumor masks. When trained on 101,654 reports, AI models achieved performance comparable to those trained on 723 masks. Combining reports and masks further improved sensitivity by +13% and specificity by +8%, surpassing radiologists in detecting five of the seven tumor types. Notably, R-Super enabled segmentation of tumors in the spleen, gallbladder, prostate, bladder, uterus, and esophagus, for which no public masks or AI models previously existed. This study challenges the long-held belief that large-scale, labor-intensive tumor mask creation is indispensable, establishing a scalable and accessible path toward early detection across diverse tumor types. We plan to release our trained models, code, and dataset at https://github.com/MrGiovanni/R-Super

[198] Unifying Environment Perception and Route Choice Modeling for Trajectory Representation Learning

Ji Cao,Yu Wang,Tongya Zheng,Zujie Ren,Canghong Jin,Gang Chen,Mingli Song

Main category: cs.CV

TL;DR: 提出PRTraj框架，结合环境感知与路径选择建模，提升轨迹表示学习效果。

Details

Motivation: 现有轨迹表示学习方法忽视了外部环境和内部路径选择行为对轨迹形成的影响。 Method: 设计包含环境感知模块和路径选择编码器的PRTraj框架，前者捕获POI分布中的多粒度环境语义，后者建模路段转移序列表达路径选择行为。 Result: 在3个真实数据集、5项下游任务中验证了PRTraj的有效性和泛化能力，并表现出强数据效率，尤其在少样本场景下性能稳定。 Conclusion: PRTraj通过融合环境感知与路径选择建模，显著提升了轨迹表示的质量和应用效果。 Abstract: Trajectory Representation Learning (TRL) aims to encode raw trajectories into low-dimensional vectors, which can then be leveraged in various downstream tasks, including travel time estimation, location prediction, and trajectory similarity analysis. However, existing TRL methods suffer from a key oversight: treating trajectories as isolated spatio-temporal sequences, without considering the external environment and internal route choice behavior that govern their formation. To bridge this gap, we propose a novel framework that unifies comprehensive environment \textbf{P}erception and explicit \textbf{R}oute choice modeling for effective \textbf{Traj}ectory representation learning, dubbed \textbf{PRTraj}. Specifically, PRTraj first introduces an Environment Perception Module to enhance the road network by capturing multi-granularity environmental semantics from surrounding POI distributions. Building on this environment-aware backbone, a Route Choice Encoder then captures the route choice behavior inherent in each trajectory by modeling its constituent road segment transitions as a sequence of decisions. These route-choice-aware representations are finally aggregated to form the global trajectory embedding. Extensive experiments on 3 real-world datasets across 5 downstream tasks validate the effectiveness and generalizability of PRTraj. Moreover, PRTraj demonstrates strong data efficiency, maintaining robust performance under few-shot scenarios. Our code is available at: https://anonymous.4open.science/r/PRTraj.

[199] FraQAT: Quantization Aware Training with Fractional bits

Luca Morreale,Alberto Gil C. P. Ramos,Malcolm Chadwick,Mehid Noroozi,Ruchika Chavhan,Abhinav Mehrotra,Sourav Bhattacharya

Main category: cs.CV

TL;DR: 提出一种新的分数位量化方法（FRAQ），通过逐步降低模型精度并在优化过程中利用分数位，有效保持生成质量，同时实现高效计算和内存节省。

Details

Motivation: 现有的大容量生成模型因设备内存和计算资源限制难以部署在智能手机上，而传统激进量化方法在提升效率的同时往往牺牲模型生成质量。 Method: 提出FRAQ方法，逐步将模型参数精度从32位降至4位，并在优化过程中动态利用分数位以维持高质量生成性能。 Result: 在多种扩散模型（如SD3.5-Medium、Sana、Pixart和FLUX.1-schnell）上验证了FRAQ的有效性，相比标准QAT降低了4-7%的FiD分数，并成功在三星S25U手机的骁龙8 Elite HTP上部署运行Sana模型。 Conclusion: FRAQ在显著降低模型精度的同时仍能保持优异的生成质量，为大模型在移动设备上的高效部署提供了可行方案。 Abstract: State-of-the-art (SOTA) generative models have demonstrated impressive capabilities in image synthesis or text generation, often with a large capacity model. However, these large models cannot be deployed on smartphones due to the limited availability of on-board memory and computations. Quantization methods lower the precision of the model parameters, allowing for efficient computations, \eg, in \INT{8}. Although aggressive quantization addresses efficiency and memory constraints, preserving the quality of the model remains a challenge. To retain quality in previous aggressive quantization, we propose a new fractional bits quantization (\short) approach. The novelty is a simple yet effective idea: we progressively reduce the model's precision from 32 to 4 bits per parameter, and exploit the fractional bits during optimization to maintain high generation quality. We show that the \short{} yields improved quality on a variety of diffusion models, including SD3.5-Medium, Sana, \pixart, and FLUX.1-schnell, while achieving $4-7\%$ lower FiD than standard QAT. Finally, we deploy and run Sana on a Samsung S25U, which runs on the Qualcomm SM8750-AB Snapdragon 8 Elite Hexagon Tensor Processor (HTP).

[200] Scaling Tumor Segmentation: Best Lessons from Real and Synthetic Data

Qi Chen,Xinze Zhou,Chen Liu,Hao Chen,Wenxuan Li,Zekun Jiang,Ziyan Huang,Yuxuan Zhao,Dexin Yu,Junjun He,Yefeng Zheng,Ling Shao,Alan Yuille,Zongwei Zhou

Main category: cs.CV

TL;DR: 合成数据可显著提升AI肿瘤分割模型的训练效率，仅用500个真实扫描即可达到使用1500个真实扫描的性能。AbdomenAtlas 2.0是一个大规模、多器官的公开CT数据集，包含10,135个带逐体素标注的扫描，显著优于现有公共数据集。

Details

Motivation: 缺乏大规模、逐体素标注的肿瘤数据集限制了AI在肿瘤分割中的应用，且真实数据标注成本高、耗时长。因此需要更高效的数据利用方式和更大规模的公开数据集。 Method: 基于私有JHH数据集发现合成数据可加速模型性能提升，据此构建AbdomenAtlas 2.0——一个大规模、多器官腹部CT数据集，包含10,135个带手动逐体素标注的肿瘤扫描和5,893个对照扫描，由23名专家放射科医生标注。 Result: 使用合成数据时，仅需500个真实扫描即可达到使用1,500个真实扫描的性能；AbdomenAtlas 2.0在分布内测试中DSC提升+7%，分布外测试中提升+16%。 Conclusion: 合成数据能有效提升数据利用效率，AbdomenAtlas 2.0作为当前最大规模的多器官肿瘤标注数据集之一，为训练高性能肿瘤分割AI模型提供了坚实基础。 Abstract: AI for tumor segmentation is limited by the lack of large, voxel-wise annotated datasets, which are hard to create and require medical experts. In our proprietary JHH dataset of 3,000 annotated pancreatic tumor scans, we found that AI performance stopped improving after 1,500 scans. With synthetic data, we reached the same performance using only 500 real scans. This finding suggests that synthetic data can steepen data scaling laws, enabling more efficient model training than real data alone. Motivated by these lessons, we created AbdomenAtlas 2.0--a dataset of 10,135 CT scans with a total of 15,130 tumor instances per-voxel manually annotated in six organs (pancreas, liver, kidney, colon, esophagus, and uterus) and 5,893 control scans. Annotated by 23 expert radiologists, it is several orders of magnitude larger than existing public tumor datasets. While we continue expanding the dataset, the current version of AbdomenAtlas 2.0 already provides a strong foundation--based on lessons from the JHH dataset--for training AI to segment tumors in six organs. It achieves notable improvements over public datasets, with a +7% DSC gain on in-distribution tests and +16% on out-of-distribution tests.

[201] QDepth-VLA: Quantized Depth Prediction as Auxiliary Supervision for Vision-Language-Action Models

Yixuan Li,Yuhui Chen,Mingcai Zhou,Haoran Li

Main category: cs.CV

TL;DR: 提出QDepth-VLA框架，通过辅助深度预测任务增强视觉-语言-动作模型的空间感知与推理能力。

Details

Motivation: 现有VLA模型缺乏对关键3D结构的理解与推理能力，难以完成精细操作任务。 Method: 设计一个专用的深度专家模块，预测由VQ-VAE编码器生成的深度图的量化潜在token，使模型学习到具有深度感知的表征。 Result: 在仿真基准和真实世界任务上的实验表明，QDepth-VLA具有较强的空间推理能力和竞争性的操作性能。 Conclusion: QDepth-VLA通过引入深度感知表示，有效提升了VLA模型在细粒度操作任务中的空间理解与控制能力。 Abstract: Spatial perception and reasoning are crucial for Vision-Language-Action (VLA) models to accomplish fine-grained manipulation tasks. However, existing approaches often lack the ability to understand and reason over the essential 3D structures necessary for precise control. To address this limitation, we propose QDepth-VLA, a general framework that augments VLA models with an auxiliary depth prediction task. A dedicated depth expert is designed to predict quantized latent tokens of depth maps obtained from a VQ-VAE encoder, enabling the model to learn depth-aware representations that capture critical geometric cues. Experimental results on the simulation benchmarks and real-world tasks demonstrate that QDepth-VLA yields strong spatial reasoning and competitive performance on manipulation tasks.

[202] ImagerySearch: Adaptive Test-Time Search for Video Generation Beyond Semantic Dependency Constraints

Meiqi Wu,Jiashu Zhu,Xiaokun Feng,Chubin Chen,Chen Zhu,Bingze Song,Fangyuan Mao,Jiahong Wu,Xiangxiang Chu,Kaiqi Huang

Main category: cs.CV

TL;DR: 提出ImagerySearch，一种提示引导的自适应测试时搜索策略，用于提升视频生成模型在想象性场景中的表现，并发布LDT-Bench评测基准。

Details

Motivation: 现有视频生成模型在现实场景中表现良好，但在涉及远距离语义关系的想象性场景中效果不佳，且现有测试时扩展方法缺乏适应性。 Method: 提出ImagerySearch，通过提示信息动态调整推理搜索空间和奖励函数，以更好捕捉长距离语义关系；同时构建LDT-Bench，包含2839个概念对，用于评估创造性视频生成能力。 Result: 在LDT-Bench上显著优于现有视频生成基线和测试时扩展方法，在VBench上也表现出竞争力，验证了方法的通用性和有效性。 Conclusion: ImagerySearch通过自适应搜索策略有效提升了模型在想象性场景下的生成质量，LDT-Bench为未来研究提供了重要评测工具。 Abstract: Video generation models have achieved remarkable progress, particularly excelling in realistic scenarios; however, their performance degrades notably in imaginative scenarios. These prompts often involve rarely co-occurring concepts with long-distance semantic relationships, falling outside training distributions. Existing methods typically apply test-time scaling for improving video quality, but their fixed search spaces and static reward designs limit adaptability to imaginative scenarios. To fill this gap, we propose ImagerySearch, a prompt-guided adaptive test-time search strategy that dynamically adjusts both the inference search space and reward function according to semantic relationships in the prompt. This enables more coherent and visually plausible videos in challenging imaginative settings. To evaluate progress in this direction, we introduce LDT-Bench, the first dedicated benchmark for long-distance semantic prompts, consisting of 2,839 diverse concept pairs and an automated protocol for assessing creative generation capabilities. Extensive experiments show that ImagerySearch consistently outperforms strong video generation baselines and existing test-time scaling approaches on LDT-Bench, and achieves competitive improvements on VBench, demonstrating its effectiveness across diverse prompt types. We will release LDT-Bench and code to facilitate future research on imaginative video generation.

[203] A Multi-Task Deep Learning Framework for Skin Lesion Classification, ABCDE Feature Quantification, and Evolution Simulation

Harsha Kotla,Arun Kumar Rajasekaran,Hannah Rana

Main category: cs.CV

TL;DR: 提出一种深度学习框架，用于分类皮肤病变并量化ABCD特征，模拟其演化过程以解释黑色素瘤的早期检测。

Details

Motivation: 现有深度学习模型对皮肤病变分析多为黑箱，缺乏可解释性，难以关联临床标准。ABCDE准则虽广泛使用，但多数模型未明确解释各特征。 Method: 设计一个深度学习框架，同时进行病变分类和ABCD特征量化，并通过潜空间中的特征轨迹可视化病变从良性到恶性的发展过程，模拟E（演化）特征。 Result: 在HAM10000数据集上实验显示分类准确率约89%，黑色素瘤AUC达0.96；ABCD特征中不对称性、颜色变化和直径预测效果好，边界不规则较难建模。 Conclusion: 该框架将机器学习诊断与临床相关标准联系起来，提升医生对皮肤癌进展的理解，推动可解释性AI在医学影像中的应用。 Abstract: Early detection of melanoma has grown to be essential because it significantly improves survival rates, but automated analysis of skin lesions still remains challenging. ABCDE, which stands for Asymmetry, Border irregularity, Color variation, Diameter, and Evolving, is a well-known classification method for skin lesions, but most deep learning mechanisms treat it as a black box, as most of the human interpretable features are not explained. In this work, we propose a deep learning framework that both classifies skin lesions into categories and also quantifies scores for each ABCD feature. It simulates the evolution of these features over time in order to represent the E aspect, opening more windows for future exploration. The A, B, C, and D values are quantified particularly within this work. Moreover, this framework also visualizes ABCD feature trajectories in latent space as skin lesions evolve from benign nevuses to malignant melanoma. The experiments are conducted using the HAM10000 dataset that contains around ten thousand images of skin lesions of varying stages. In summary, the classification worked with an accuracy of around 89 percent, with melanoma AUC being 0.96, while the feature evaluation performed well in predicting asymmetry, color variation, and diameter, though border irregularity remains more difficult to model. Overall, this work provides a deep learning framework that will allow doctors to link ML diagnoses to clinically relevant criteria, thus improving our understanding of skin cancer progression.

Mihai-Cristian Pîrvu,Marius Leordeanu

Main category: cs.CV

TL;DR: 本文提出了一种利用自监督方法融合多种视觉模态的框架，通过预训练专家模型和自动化数据管道结合多模态数据，并采用专为多模态设计的PHG-MAE模型，在仅使用极低参数量（<1M）的情况下实现了与大规模模型（~300M参数）相媲美的性能，展示了在普通硬件上实现实时语义分割和其他任务（如深度估计）的应用潜力。

Details

Motivation: 现实世界本质上是多模态的，但传统机器学习模型多为单模态或双模态，难以全面理解复杂场景。为了更真实地感知和理解世界，需要整合更多独立的模态信息，尤其是在缺乏人工标注的情况下实现多模态学习。 Method: 采用预训练的专家模型和程序化组合方式，在原始视频上构建一个完全自动化的数据流水线，以融合多种视觉模态；使用专门为多模态数据设计的PHG-MAE模型，并通过高效蒸馏技术将其压缩至低参数量（<1M）。 Result: PHG-MAE模型在低参数量下取得了与约300M参数规模模型相当的竞争性结果，并成功部署于手持设备或网络摄像头，在普通硬件上实现实时语义分割；同一框架也支持其他现成模型（如DPT）进行近实时深度估计。 Conclusion: 本文证明了通过自监督方式融合多视觉模态的有效性和可行性，提出的框架不仅高效、可扩展，且适用于资源受限的实际应用场景，推动了多模态学习向更贴近真实世界的迈进。 Abstract: The real-world is inherently multi-modal at its core. Our tools observe and take snapshots of it, in digital form, such as videos or sounds, however much of it is lost. Similarly for actions and information passing between humans, languages are used as a written form of communication. Traditionally, Machine Learning models have been unimodal (i.e. rgb -> semantic or text -> sentiment_class). Recent trends go towards bi-modality, where images and text are learned together, however, in order to truly understand the world, we need to integrate all these independent modalities. In this work we try to combine as many visual modalities as we can using little to no human supervision. In order to do this, we use pre-trained experts and procedural combinations between them on top of raw videos using a fully autonomous data-pipeline, which we also open-source. We then make use of PHG-MAE, a model specifically designed to leverage multi-modal data. We show that this model which was efficiently distilled into a low-parameter (<1M) can have competitive results compared to models of ~300M parameters. We deploy this model and analyze the use-case of real-time semantic segmentation from handheld devices or webcams on commodity hardware. Finally, we deploy other off-the-shelf models using the same framework, such as DPT for near real-time depth estimation.

[205] Benchmarking Multimodal Large Language Models for Face Recognition

Hatef Otroshi Shahreza,Sébastien Marcel

Main category: cs.CV

TL;DR: 本文系统评估了多模态大语言模型（MLLMs）在人脸识别任务中的性能，发现在零样本设置下，MLLMs虽能捕捉丰富的语义信息，但在高精度识别上仍落后于专用模型。

Details

Motivation: 探索MLLMs在人脸识别领域的潜力，并与现有模型在标准基准上进行公平比较。 Method: 在LFW、CALFW、CPLFW、CFP、AgeDB和RFW等人脸识别数据集上对最先进的MLLMs进行了系统性基准测试。 Result: 实验结果表明，MLLMs在零样本应用中捕捉到有助于人脸相关任务的丰富语义线索，但在高精度识别场景中表现不及专用模型。 Conclusion: 该基准为推进基于MLLM的人脸识别研究奠定了基础，为设计更高准确率和泛化能力的下一代模型提供了洞见。 Abstract: Multimodal large language models (MLLMs) have achieved remarkable performance across diverse vision-and-language tasks. However, their potential in face recognition remains underexplored. In particular, the performance of open-source MLLMs needs to be evaluated and compared with existing face recognition models on standard benchmarks with similar protocol. In this work, we present a systematic benchmark of state-of-the-art MLLMs for face recognition on several face recognition datasets, including LFW, CALFW, CPLFW, CFP, AgeDB and RFW. Experimental results reveal that while MLLMs capture rich semantic cues useful for face-related tasks, they lag behind specialized models in high-precision recognition scenarios in zero-shot applications. This benchmark provides a foundation for advancing MLLM-based face recognition, offering insights for the design of next-generation models with higher accuracy and generalization. The source code of our benchmark is publicly available in the project page.

[206] TOUCH: Text-guided Controllable Generation of Free-Form Hand-Object Interactions

Guangyi Han,Wei Zhai,Yuhang Yang,Yang Cao,Zheng-Jun Zha

Main category: cs.CV

TL;DR: 本文提出了自由形式手物交互生成（Free-Form HOI Generation），突破了传统基于固定抓取模式的局限，通过细粒度意图实现对手部动作的可控、多样且物理合理的生成。

Details

Motivation: 现有手物交互生成方法受限于固定的抓取模式和通用意图指令，难以捕捉日常生活中丰富多样的交互行为。为此，作者希望扩展HOI至推、戳、旋转等非抓取动作，实现更自然、细粒度控制的交互生成。 Method: 构建了一个名为WildO2的大规模真实场景3D手物交互数据集，包含4.4k个涵盖92种意图和610类物体的交互样本，并提出TOUCH框架——一个基于多级扩散模型的三阶段生成方法，结合显式接触建模、接触一致性与物理约束，实现细粒度语义控制的手势生成。 Result: 实验表明，所提方法能够生成多样化、可控且物理合理的手物交互动作，显著优于以往局限于抓取的方法，在真实感和语义对齐方面表现优异。 Conclusion: Free-Form HOI Generation为手物交互生成提供了新范式，TOUCH框架结合WildO2数据集有效支持了从抓取到自由形式交互的拓展，推动了人机交互、虚拟现实等领域中更自然的人手动作模拟。 Abstract: Hand-object interaction (HOI) is fundamental for humans to express intent. Existing HOI generation research is predominantly confined to fixed grasping patterns, where control is tied to physical priors such as force closure or generic intent instructions, even when expressed through elaborate language. Such an overly general conditioning imposes a strong inductive bias for stable grasps, thus failing to capture the diversity of daily HOI. To address these limitations, we introduce Free-Form HOI Generation, which aims to generate controllable, diverse, and physically plausible HOI conditioned on fine-grained intent, extending HOI from grasping to free-form interactions, like pushing, poking, and rotating. To support this task, we construct WildO2, an in-the-wild diverse 3D HOI dataset, which includes diverse HOI derived from internet videos. Specifically, it contains 4.4k unique interactions across 92 intents and 610 object categories, each with detailed semantic annotations. Building on this dataset, we propose TOUCH, a three-stage framework centered on a multi-level diffusion model that facilitates fine-grained semantic control to generate versatile hand poses beyond grasping priors. This process leverages explicit contact modeling for conditioning and is subsequently refined with contact consistency and physical constraints to ensure realism. Comprehensive experiments demonstrate our method's ability to generate controllable, diverse, and physically plausible hand interactions representative of daily activities. The project page is $\href{https://guangyid.github.io/hoi123touch}{here}$.

[207] BADAS: Context Aware Collision Prediction Using Real-World Dashcam Data

Roni Goldshmidt,Hamish Scott,Lorenzo Niccolini,Shizhan Zhu,Daniel Moura,Orly Zvitia

Main category: cs.CV

TL;DR: 本文提出了BADAS，一种基于真实行车记录仪数据的碰撞预测模型家族，专注于以自我车辆为中心的评估，显著提升了预测准确性和时间估计的现实性。

Details

Motivation: 现有碰撞预测方法难以区分涉及自车的威胁和与自车无关的随机事故，导致实际应用中误报过多。 Method: 提出BADAS模型家族，采用V-JEPA2骨干网络进行端到端训练，并构建了首个面向自车中心评估的基准数据集；对多个主流数据集重新标注以识别自车参与情况，并添加共识警报时间标签。 Result: 在DAD、DADA-2000、DoTA和Nexar等多个数据集上，BADAS实现了最先进的AP/AUC性能，优于传统前向碰撞预警系统，并提供了更真实的事故时间预测。 Conclusion: BADAS有效解决了自车相关碰撞预测中的误报问题，推动了以自车为中心的评估标准，作者公开了模型权重、代码及重新标注的数据集以促进该领域研究。 Abstract: Existing collision prediction methods often fail to distinguish between ego-vehicle threats and random accidents not involving the ego vehicle, leading to excessive false alerts in real-world deployment. We present BADAS, a family of collision prediction models trained on Nexar's real-world dashcam collision dataset -- the first benchmark designed explicitly for ego-centric evaluation. We re-annotate major benchmarks to identify ego involvement, add consensus alert-time labels, and synthesize negatives where needed, enabling fair AP/AUC and temporal evaluation. BADAS uses a V-JEPA2 backbone trained end-to-end and comes in two variants: BADAS-Open (trained on our 1.5k public videos) and BADAS1.0 (trained on 40k proprietary videos). Across DAD, DADA-2000, DoTA, and Nexar, BADAS achieves state-of-the-art AP/AUC and outperforms a forward-collision ADAS baseline while producing more realistic time-to-accident estimates. We release our BADAS-Open model weights and code, along with re-annotations of all evaluation datasets to promote ego-centric collision prediction research.

[208] ScaleWeaver: Weaving Efficient Controllable T2I Generation with Multi-Scale Reference Attention

Keli Liu,Zhendong Wang,Wengang Zhou,Shaodong Xu,Ruixiao Dong,Houqiang Li

Main category: cs.CV

TL;DR: 本文提出了ScaleWeaver，一种基于视觉自回归模型（VAR）的高保真、可控文本到图像生成框架，通过参数高效微调实现精确控制。

Details

Motivation: 现有的扩散模型已探索了多种控制机制，但在视觉自回归模型（VAR）中实现灵活、精确的控制仍缺乏研究，存在明显空白。 Method: 提出ScaleWeaver框架，核心是改进的MMDiT块与新型Reference Attention模块，摒弃不必要的图像→条件注意力，降低计算成本并稳定控制注入；同时强调参数复用，并引入零初始化线性投影以有效融合控制信号而不破坏基础模型生成能力。 Result: 实验表明，ScaleWeaver在生成质量、控制精度方面表现优异，且推理效率优于扩散模型方法。 Conclusion: ScaleWeaver为视觉自回归范式下的可控文本到图像生成提供了一种高效且实用的解决方案。 Abstract: Text-to-image generation with visual autoregressive~(VAR) models has recently achieved impressive advances in generation fidelity and inference efficiency. While control mechanisms have been explored for diffusion models, enabling precise and flexible control within VAR paradigm remains underexplored. To bridge this critical gap, in this paper, we introduce ScaleWeaver, a novel framework designed to achieve high-fidelity, controllable generation upon advanced VAR models through parameter-efficient fine-tuning. The core module in ScaleWeaver is the improved MMDiT block with the proposed Reference Attention module, which efficiently and effectively incorporates conditional information. Different from MM Attention, the proposed Reference Attention module discards the unnecessary attention from image$\rightarrow$condition, reducing computational cost while stabilizing control injection. Besides, it strategically emphasizes parameter reuse, leveraging the capability of the VAR backbone itself with a few introduced parameters to process control information, and equipping a zero-initialized linear projection to ensure that control signals are incorporated effectively without disrupting the generative capability of the base model. Extensive experiments show that ScaleWeaver delivers high-quality generation and precise control while attaining superior efficiency over diffusion-based methods, making ScaleWeaver a practical and effective solution for controllable text-to-image generation within the visual autoregressive paradigm. Code and models will be released.

[209] You May Speak Freely: Improving the Fine-Grained Visual Recognition Capabilities of Multimodal Large Language Models with Answer Extraction

Logan Lawrence,Oindrila Saha,Megan Wei,Chen Sun,Subhransu Maji,Grant Van Horn

Main category: cs.CV

TL;DR: 提出nlg2choice方法，通过两阶段策略提升多选题和检索任务中的零样本视觉分类性能。

Details

Motivation: 现有方法在细粒度视觉分类中面临多选项选择题和自由文本响应评估的挑战，且难以扩展到高数量级选项的检索任务。 Method: 采用两阶段方法：首先用开放式问题获取多模态大语言模型的回答，再通过文本约束解码预测最可能选项；在检索场景中引入早期停止策略以提高效率。 Result: 在七个细粒度视觉数据集上，该方法在分类和检索任务中均优于现有方法，且在不同自然语言实现方式下保持稳定性能。 Conclusion: nlg2choice有效解决了高数量级选项下的零样本视觉分类与检索问题，具有良好的实用性和鲁棒性。 Abstract: Despite the renewed interest in zero-shot visual classification due to the rise of Multimodal Large Language Models (MLLMs), the problem of evaluating free-form responses of auto-regressive models remains a persistent challenge. Most existing works focus on language-only tasks or don't consider Multiple Choice Questions (MCQs) beyond 5-way options, both of which are critical capabilities to solve tasks in Fine-Grained Visual Classification (FGVC) where choice counts are in the hundreds to thousands and the choices are highly related. Furthermore, in this highly multi-way MCQ setting it is not clear how to extend LLM choice extraction to retrieval-based problems, where computing probabilities over the choice set is computationally costly. In this work we investigate nlg2choice, a simple two-stage method which first asks the MLLM an open-ended question for the task with minimal constraints, then uses text-only constrained decoding to predict the most likely choice. In retrieval settings, we compute the probability of the constrained response taking that choice with an early stopping method to significantly improve throughput. Our results show improvement over a suite of seven fine-grained visual datasets when evaluating in terms of classification and retrieval, and show that this performance holds over the various ways that users of LLMs can implement tasks in natural language.

[210] Leveraging Multimodal LLM Descriptions of Activity for Explainable Semi-Supervised Video Anomaly Detection

Furkan Mumcu,Michael J. Jones,Anoop Cherian,Yasin Yilmaz

Main category: cs.CV

TL;DR: 提出一种基于多模态大语言模型（MLLM）的视频异常检测新框架，通过生成物体活动和交互的文本描述来检测异常，具有良好的可解释性并实现最先进的性能。

Details

Motivation: 现有半监督视频异常检测方法难以检测涉及物体交互的复杂异常，且缺乏可解释性。 Method: 利用MLLM对不同时间点的物体对进行查询，从正常视频中生成描述物体活动和交互的文本，并在测试时通过与训练文本对比检测异常。 Result: 在基准数据集上实验表明，该方法能有效检测基于交互的复杂异常，并在无交互异常的数据集上达到最先进水平。 Conclusion: 所提方法不仅提升了复杂异常的检测能力，还为视频异常检测提供了内在的可解释性，并可与其他传统方法结合增强其解释性。 Abstract: Existing semi-supervised video anomaly detection (VAD) methods often struggle with detecting complex anomalies involving object interactions and generally lack explainability. To overcome these limitations, we propose a novel VAD framework leveraging Multimodal Large Language Models (MLLMs). Unlike previous MLLM-based approaches that make direct anomaly judgments at the frame level, our method focuses on extracting and interpreting object activity and interactions over time. By querying an MLLM with visual inputs of object pairs at different moments, we generate textual descriptions of the activity and interactions from nominal videos. These textual descriptions serve as a high-level representation of the activity and interactions of objects in a video. They are used to detect anomalies during test time by comparing them to textual descriptions found in nominal training videos. Our approach inherently provides explainability and can be combined with many traditional VAD methods to further enhance their interpretability. Extensive experiments on benchmark datasets demonstrate that our method not only detects complex interaction-based anomalies effectively but also achieves state-of-the-art performance on datasets without interaction anomalies.

[211] MaskCaptioner : Learning to Jointly Segment and Caption Object Trajectories in Videos

Gabriel Fiastre,Antoine Yang,Cordelia Schmid

Main category: cs.CV

TL;DR: 提出MaskCaptioner模型，通过合成字幕数据集实现端到端的视频中物体轨迹检测、分割、跟踪与描述，达到现有最佳性能。

Details

Motivation: 由于密集视频对象描述任务复杂且人工标注成本高，以往方法采用分离训练策略导致性能次优，因此需要一种联合训练的端到端方法。 Method: 利用先进的视觉语言模型生成时空定位实体的合成字幕，扩展LVIS和LV-VIS数据集为LVISCap和LV-VISCap，并在此基础上训练MaskCaptioner模型，实现检测、分割、跟踪与描述的联合学习。 Result: MaskCaptioner在VidSTG、VLN和BenSMOT三个基准上均取得当前最优的密集视频对象描述结果。 Conclusion: 通过合成字幕进行预训练，能够有效提升端到端视频对象描述模型的性能，为未来研究提供了高质量数据和通用框架。 Abstract: Dense Video Object Captioning (DVOC) is the task of jointly detecting, tracking, and captioning object trajectories in a video, requiring the ability to understand spatio-temporal details and describe them in natural language. Due to the complexity of the task and the high cost associated with manual annotation, previous approaches resort to disjoint training strategies, potentially leading to suboptimal performance. To circumvent this issue, we propose to generate captions about spatio-temporally localized entities leveraging a state-of-the-art VLM. By extending the LVIS and LV-VIS datasets with our synthetic captions (LVISCap and LV-VISCap), we train MaskCaptioner, an end-to-end model capable of jointly detecting, segmenting, tracking and captioning object trajectories. Moreover, with pretraining on LVISCap and LV-VISCap, MaskCaptioner achieves state-of-the-art DVOC results on three existing benchmarks, VidSTG, VLN and BenSMOT. The datasets and code are available at https://www.gabriel.fiastre.fr/maskcaptioner/.

[212] 3D Scene Prompting for Scene-Consistent Camera-Controllable Video Generation

JoungBin Lee,Jaewoo Jung,Jisang Han,Takuya Narihira,Kazumi Fukuda,Junyoung Seo,Sunghwan Hong,Yuki Mitsufuji,Seungryong Kim

Main category: cs.CV

TL;DR: 提出3DScenePrompt框架，通过双时空条件和3D场景记忆实现长输入视频的下一帧生成，保持场景一致性并支持精确相机控制。

Details

Motivation: 现有方法通常基于单张图像或短片段生成视频，难以在长序列中同时保持运动连贯性和场景一致性，且缺乏对相机视角的精确控制。 Method: 引入双时空条件机制，结合时间邻近帧保证运动连续性，利用空间邻近内容维持场景一致性；构建基于动态SLAM和动态掩码策略的3D场景记忆，分离静态几何与动态元素，提供可投影至任意视角的静态场景表示作为3D空间提示。 Result: 实验表明，该方法在场景一致性、相机可控性和生成质量方面显著优于现有方法，能够在保持计算效率和运动真实感的同时实现长距离空间连贯的视频生成。 Conclusion: 3DScenePrompt通过3D场景记忆和双时空条件有效解决了长输入视频生成中的场景一致性和相机控制难题，为高质量、可控的视频生成提供了新思路。 Abstract: We present 3DScenePrompt, a framework that generates the next video chunk from arbitrary-length input while enabling precise camera control and preserving scene consistency. Unlike methods conditioned on a single image or a short clip, we employ dual spatio-temporal conditioning that reformulates context-view referencing across the input video. Our approach conditions on both temporally adjacent frames for motion continuity and spatially adjacent content for scene consistency. However, when generating beyond temporal boundaries, directly using spatially adjacent frames would incorrectly preserve dynamic elements from the past. We address this by introducing a 3D scene memory that represents exclusively the static geometry extracted from the entire input video. To construct this memory, we leverage dynamic SLAM with our newly introduced dynamic masking strategy that explicitly separates static scene geometry from moving elements. The static scene representation can then be projected to any target viewpoint, providing geometrically consistent warped views that serve as strong 3D spatial prompts while allowing dynamic regions to evolve naturally from temporal context. This enables our model to maintain long-range spatial coherence and precise camera control without sacrificing computational efficiency or motion realism. Extensive experiments demonstrate that our framework significantly outperforms existing methods in scene consistency, camera controllability, and generation quality. Project page : https://cvlab-kaist.github.io/3DScenePrompt/

[213] OmniMotion: Multimodal Motion Generation with Continuous Masked Autoregression

Zhe Li,Weihao Yuan,Weichao Shen,Siyu Zhu,Zilong Dong,Chang Xu

Main category: cs.CV

TL;DR: 提出一种连续的掩码自回归运动Transformer，用于多模态（文本、语音、音乐）驱动的全身人体运动生成，结合DiT结构和注意力机制，在多种任务上超越现有方法。

Details

Motivation: 解决全身多模态人体运动生成中的两个关键问题：有效的运动生成机制设计以及多模态信息（如文本、语音、音乐）的融合。 Method: 提出连续掩码自回归运动Transformer，引入因果注意力、门控线性注意力和RMSNorm模块；采用DiT结构扩散条件信息，利用AdaLN和交叉注意力融合文本、语音和音乐模态。 Result: 在文本到动作、语音到手势、音乐到舞蹈等多个任务上均优于先前方法，表现出更强的运动生成质量和多模态泛化能力。 Conclusion: 所提出的方法在多模态人体运动生成中具有优越性能，兼顾生成质量与多模态融合效果，代码将公开。 Abstract: Whole-body multi-modal human motion generation poses two primary challenges: creating an effective motion generation mechanism and integrating various modalities, such as text, speech, and music, into a cohesive framework. Unlike previous methods that usually employ discrete masked modeling or autoregressive modeling, we develop a continuous masked autoregressive motion transformer, where a causal attention is performed considering the sequential nature within the human motion. Within this transformer, we introduce a gated linear attention and an RMSNorm module, which drive the transformer to pay attention to the key actions and suppress the instability caused by either the abnormal movements or the heterogeneous distributions within multi-modalities. To further enhance both the motion generation and the multimodal generalization, we employ the DiT structure to diffuse the conditions from the transformer towards the targets. To fuse different modalities, AdaLN and cross-attention are leveraged to inject the text, speech, and music signals. Experimental results demonstrate that our framework outperforms previous methods across all modalities, including text-to-motion, speech-to-gesture, and music-to-dance. The code of our method will be made public.

[214] RealDPO: Real or Not Real, that is the Preference

Guo Cheng,Danni Yang,Ziqi Huang,Jianlou Si,Chenyang Si,Ziwei Liu

Main category: cs.CV

TL;DR: 提出RealDPO，一种利用真实世界数据作为正样本进行偏好学习的新型对齐范式，通过对比真实视频与模型错误输出，结合定制损失函数的直接偏好优化（DPO），显著提升视频生成中的运动真实感、文本对齐和整体质量。

Details

Motivation: 现有视频生成模型在生成复杂运动时往往缺乏自然性、流畅性和上下文一致性，限制了其实际应用。需要一种能有效提升运动真实感的方法。 Method: 提出RealDPO，采用真实世界视频作为正样本，结合错误的模型输出作为负样本，使用改进的DPO框架和定制损失函数进行偏好学习，实现迭代自校正；并构建RealAction-5K数据集以支持复杂运动合成的后训练。 Result: 实验证明，RealDPO在视频质量、文本对齐和运动真实感方面均显著优于现有最先进模型和偏好优化方法。 Conclusion: RealDPO通过引入真实数据驱动的偏好学习范式，有效提升了视频生成模型的运动合理性与真实感，为复杂动态场景的生成提供了新思路。 Abstract: Video generative models have recently achieved notable advancements in synthesis quality. However, generating complex motions remains a critical challenge, as existing models often struggle to produce natural, smooth, and contextually consistent movements. This gap between generated and real-world motions limits their practical applicability. To address this issue, we introduce RealDPO, a novel alignment paradigm that leverages real-world data as positive samples for preference learning, enabling more accurate motion synthesis. Unlike traditional supervised fine-tuning (SFT), which offers limited corrective feedback, RealDPO employs Direct Preference Optimization (DPO) with a tailored loss function to enhance motion realism. By contrasting real-world videos with erroneous model outputs, RealDPO enables iterative self-correction, progressively refining motion quality. To support post-training in complex motion synthesis, we propose RealAction-5K, a curated dataset of high-quality videos capturing human daily activities with rich and precise motion details. Extensive experiments demonstrate that RealDPO significantly improves video quality, text alignment, and motion realism compared to state-of-the-art models and existing preference optimization techniques.

[215] MathCanvas: Intrinsic Visual Chain-of-Thought for Multimodal Mathematical Reasoning

Weikang Shi,Aldrich Yu,Rongyao Fang,Houxing Ren,Ke Wang,Aojun Zhou,Changyao Tian,Xinyu Fu,Yuxuan Hu,Zimu Lu,Linjiang Huang,Si Liu,Rui Liu,Hongsheng Li

Main category: cs.CV

TL;DR: 本文提出了MathCanvas框架，通过两个阶段的训练赋予大型多模态模型内在的视觉链式思维（VCoT）能力，以解决数学领域中依赖图形辅助的问题。

Details

Motivation: 现有的VCoT方法受限于刚性的外部工具或无法生成高质量、适时的图表，难以应对复杂的数学问题求解。 Method: 第一阶段是视觉操作阶段，使用包含1520万对数据的新颖语料库进行预训练；第二阶段是战略性视觉辅助推理阶段，利用21.9万个交错图文推理路径的数据集微调模型。 Result: 所提出的BAGEL-Canvas模型在MathCanvas-Bench上相比强大多模态基线模型实现了86%的相对提升，并展现出良好的泛化能力。 Conclusion: 该研究提供了一个完整的工具包——包括框架、数据集和基准测试——以解锁大型多模态模型在复杂类人视觉辅助推理中的潜力。 Abstract: While Large Language Models (LLMs) have excelled in textual reasoning, they struggle with mathematical domains like geometry that intrinsically rely on visual aids. Existing approaches to Visual Chain-of-Thought (VCoT) are often limited by rigid external tools or fail to generate the high-fidelity, strategically-timed diagrams necessary for complex problem-solving. To bridge this gap, we introduce MathCanvas, a comprehensive framework designed to endow unified Large Multimodal Models (LMMs) with intrinsic VCoT capabilities for mathematics. Our approach consists of two phases. First, a Visual Manipulation stage pre-trains the model on a novel 15.2M-pair corpus, comprising 10M caption-to-diagram pairs (MathCanvas-Imagen) and 5.2M step-by-step editing trajectories (MathCanvas-Edit), to master diagram generation and editing. Second, a Strategic Visual-Aided Reasoning stage fine-tunes the model on MathCanvas-Instruct, a new 219K-example dataset of interleaved visual-textual reasoning paths, teaching it when and how to leverage visual aids. To facilitate rigorous evaluation, we introduce MathCanvas-Bench, a challenging benchmark with 3K problems that require models to produce interleaved visual-textual solutions. Our model, BAGEL-Canvas, trained under this framework, achieves an 86% relative improvement over strong LMM baselines on MathCanvas-Bench, demonstrating excellent generalization to other public math benchmarks. Our work provides a complete toolkit-framework, datasets, and benchmark-to unlock complex, human-like visual-aided reasoning in LMMs. Project Page: https://mathcanvas.github.io/

[216] C4D: 4D Made from 3D through Dual Correspondences

Shizun Wang,Zhenxiang Jiang,Xingyi Yang,Xinchao Wang

Main category: cs.CV

TL;DR: 本文提出C4D框架，通过引入短期光流和长期点跟踪的时序对应关系，将现有的3D重建方法扩展到动态场景下的4D恢复，实现了对每帧3D几何和相机参数的联合优化，并在多个下游任务中表现出色。

Details

Motivation: 现有基于点图的3D重建方法在静态场景中表现良好，但在动态场景中因运动物体违反多视角几何约束而失效，因此需要一种能处理动态元素的新方法。 Method: C4D框架预测点图的同时，捕捉短期光流和长期点跟踪两种对应关系；训练一个动态感知的点追踪器以估计运动掩码，分离动态对象与静态背景，并设计一系列动态场景优化目标来恢复每帧的3D结构和相机参数，同时将2D轨迹提升为平滑的3D轨迹。 Result: 实验表明，C4D能够实现完整的4D恢复，在深度估计、相机位姿估计和点跟踪等多个下游任务中均表现出优异性能。 Conclusion: C4D通过引入时序对应和动态感知优化，有效解决了单目视频中联合恢复动态几何与相机位姿的难题，为动态场景的4D重建提供了可靠解决方案。 Abstract: Recovering 4D from monocular video, which jointly estimates dynamic geometry and camera poses, is an inevitably challenging problem. While recent pointmap-based 3D reconstruction methods (e.g., DUSt3R) have made great progress in reconstructing static scenes, directly applying them to dynamic scenes leads to inaccurate results. This discrepancy arises because moving objects violate multi-view geometric constraints, disrupting the reconstruction. To address this, we introduce C4D, a framework that leverages temporal Correspondences to extend existing 3D reconstruction formulation to 4D. Specifically, apart from predicting pointmaps, C4D captures two types of correspondences: short-term optical flow and long-term point tracking. We train a dynamic-aware point tracker that provides additional mobility information, facilitating the estimation of motion masks to separate moving elements from the static background, thus offering more reliable guidance for dynamic scenes. Furthermore, we introduce a set of dynamic scene optimization objectives to recover per-frame 3D geometry and camera parameters. Simultaneously, the correspondences lift 2D trajectories into smooth 3D trajectories, enabling fully integrated 4D reconstruction. Experiments show that our framework achieves complete 4D recovery and demonstrates strong performance across multiple downstream tasks, including depth estimation, camera pose estimation, and point tracking. Project Page: https://littlepure2333.github.io/C4D

[217] RainDiff: End-to-end Precipitation Nowcasting Via Token-wise Attention Diffusion

Thao Nguyen,Jiaqi Ma,Fahad Shahbaz Khan,Souhaib Ben Taieb,Salman Khan

Main category: cs.CV

TL;DR: 提出一种将逐标记注意力机制集成到U-Net扩散模型以及时空编码器中的降水临近预报方法，有效捕捉多尺度空间交互和时间演变，无需额外的潜在模块，在多个数据集上显著优于现有方法。

Details

Motivation: 现有基于扩散的模型在降水临近预报中面临可扩展性问题：潜在空间方法需要额外训练自编码器，限制泛化能力；像素空间方法计算成本高且常缺乏注意力机制，难以建模长距离时空依赖。 Method: 提出一种将逐标记注意力机制直接集成到U-Net扩散模型以及时空编码器中的新架构，原生支持多尺度空间交互和时间动态建模，避免使用单独的潜在模块，在保持低计算成本的同时增强对复杂时空模式的捕捉能力。 Result: 在多个数据集上的实验和视觉评估表明，该方法在局部保真度、泛化性和鲁棒性方面显著优于当前最先进的降水预报方法。 Conclusion: 所提出的方法通过原生集成注意力机制，在不增加复杂性的情况下有效解决了扩散模型在降水临近预报中的可扩展性与长程依赖建模难题，展现出卓越的性能和应用潜力。 Abstract: Precipitation nowcasting, predicting future radar echo sequences from current observations, is a critical yet challenging task due to the inherently chaotic and tightly coupled spatio-temporal dynamics of the atmosphere. While recent advances in diffusion-based models attempt to capture both large-scale motion and fine-grained stochastic variability, they often suffer from scalability issues: latent-space approaches require a separately trained autoencoder, adding complexity and limiting generalization, while pixel-space approaches are computationally intensive and often omit attention mechanisms, reducing their ability to model long-range spatio-temporal dependencies. To address these limitations, we propose a Token-wise Attention integrated into not only the U-Net diffusion model but also the spatio-temporal encoder that dynamically captures multi-scale spatial interactions and temporal evolution. Unlike prior approaches, our method natively integrates attention into the architecture without incurring the high resource cost typical of pixel-space diffusion, thereby eliminating the need for separate latent modules. Our extensive experiments and visual evaluations across diverse datasets demonstrate that the proposed method significantly outperforms state-of-the-art approaches, yielding superior local fidelity, generalization, and robustness in complex precipitation forecasting scenarios.

[218] ChangingGrounding: 3D Visual Grounding in Changing Scenes

Miao Hu,Zhiwei Huang,Tai Wang,Jiangmiao Pang,Dahua Lin,Nanning Zheng,Runsen Xu

Main category: cs.CV

TL;DR: 本文提出了ChangingGrounding，首个针对动态场景下3D视觉定位的基准，强调利用记忆和主动探索来减少重扫描成本，并提出零样本方法Mem-ChangingGrounder，在保持高精度的同时显著降低探索开销。

Details

Motivation: 现有3D视觉定位方法依赖于完整且更新的点云，导致在现实世界动态场景中需频繁重扫描，成本高昂且不利于实际部署。因此，需要一种能利用历史观测、主动探索并适应变化的3D定位新范式。 Method: 作者提出ChangingGrounding基准，评估智能体在变化场景中利用记忆、选择性探索和精确定位的能力；同时提出Mem-ChangingGrounder方法，结合跨模态检索与轻量级多视角融合，通过识别查询对象类型、检索记忆、指导探索、多视角扫描融合生成精确3D边界框。 Result: 在ChangingGrounding基准上，Mem-ChangingGrounder在多个指标上取得最高定位精度，并显著降低了探索成本，表现出良好的零样本泛化能力。 Conclusion: 3D视觉定位应转向以记忆驱动、主动探索为核心的研究范式，ChangingGrounding基准和Mem-ChangingGrounder方法为实现适用于真实场景的高效3D定位提供了新方向。 Abstract: Real-world robots localize objects from natural-language instructions while scenes around them keep changing. Yet most of the existing 3D visual grounding (3DVG) method still assumes a reconstructed and up-to-date point cloud, an assumption that forces costly re-scans and hinders deployment. We argue that 3DVG should be formulated as an active, memory-driven problem, and we introduce ChangingGrounding, the first benchmark that explicitly measures how well an agent can exploit past observations, explore only where needed, and still deliver precise 3D boxes in changing scenes. To set a strong reference point, we also propose Mem-ChangingGrounder, a zero-shot method for this task that marries cross-modal retrieval with lightweight multi-view fusion: it identifies the object type implied by the query, retrieves relevant memories to guide actions, then explores the target efficiently in the scene, falls back when previous operations are invalid, performs multi-view scanning of the target, and projects the fused evidence from multi-view scans to get accurate object bounding boxes. We evaluate different baselines on ChangingGrounding, and our Mem-ChangingGrounder achieves the highest localization accuracy while greatly reducing exploration cost. We hope this benchmark and method catalyze a shift toward practical, memory-centric 3DVG research for real-world applications. Project page: https://hm123450.github.io/CGB/ .

[219] WithAnyone: Towards Controllable and ID Consistent Image Generation

Hengyuan Xu,Wei Cheng,Peng Xing,Yixiao Fang,Shuhan Wu,Rui Wang,Xianfang Zeng,Daxin Jiang,Gang Yu,Xingjun Ma,Yu-Gang Jiang

Main category: cs.CV

TL;DR: 本文提出WithAnyone，一种基于扩散模型的身份一致文本到图像生成方法，通过构建大规模配对数据集MultiID-2M、设计新基准和引入对比身份损失，有效缓解“复制粘贴”问题，在保持高身份相似性的同时实现姿态、表情等自然变化的可控生成。

Details

Motivation: 现有身份一致文本到图像生成方法因缺乏大规模成对多图像数据，依赖重建训练导致“复制粘贴”现象，即直接复制参考脸而缺乏自然变化，影响可控性和生成表现力。 Method: 1) 构建大规模多身份配对数据集MultiID-2M；2) 提出量化复制粘贴伪影与身份保真度-多样性权衡的新基准；3) 设计基于对比身份损失的新训练范式，利用成对数据平衡保真与多样性；4) 基于上述开发扩散模型WithAnyone。 Result: 实验表明WithAnyone显著减少复制粘贴伪影，提升姿态和表情的可控性，保持高质量视觉效果，用户研究验证其在身份保真和表达能力上的优越性。 Conclusion: WithAnyone通过新数据集、评估基准和训练策略，解决了身份一致生成中的复制粘贴问题，实现了高保真且富有变化的可控人脸生成。 Abstract: Identity-consistent generation has become an important focus in text-to-image research, with recent models achieving notable success in producing images aligned with a reference identity. Yet, the scarcity of large-scale paired datasets containing multiple images of the same individual forces most approaches to adopt reconstruction-based training. This reliance often leads to a failure mode we term copy-paste, where the model directly replicates the reference face rather than preserving identity across natural variations in pose, expression, or lighting. Such over-similarity undermines controllability and limits the expressive power of generation. To address these limitations, we (1) construct a large-scale paired dataset MultiID-2M, tailored for multi-person scenarios, providing diverse references for each identity; (2) introduce a benchmark that quantifies both copy-paste artifacts and the trade-off between identity fidelity and variation; and (3) propose a novel training paradigm with a contrastive identity loss that leverages paired data to balance fidelity with diversity. These contributions culminate in WithAnyone, a diffusion-based model that effectively mitigates copy-paste while preserving high identity similarity. Extensive qualitative and quantitative experiments demonstrate that WithAnyone significantly reduces copy-paste artifacts, improves controllability over pose and expression, and maintains strong perceptual quality. User studies further validate that our method achieves high identity fidelity while enabling expressive controllable generation.

[220] Ponimator: Unfolding Interactive Pose for Versatile Human-human Interaction Animation

Shaowei Liu,Chuan Guo,Bing Zhou,Jian Wang

Main category: cs.CV

TL;DR: 提出Ponimator框架，基于近身互动姿态先验，利用双条件扩散模型实现从图像、文本或单姿态生成交互动作序列，支持多种交互动画任务。

Details

Motivation: 人类能根据近距离互动姿态推断交互上下文及动态演化，受此启发，希望构建能模拟这种先验知识的框架以生成自然的人类交互动画。 Method: 采用两个条件扩散模型：一是利用时间先验从互动姿态生成动态动作序列的姿势动画器；二是利用空间先验从单个姿态、文本或两者生成互动姿态的姿势生成器。训练数据来自动作捕捉中的近距离双人互动数据。 Result: Ponimator在多个数据集和应用中表现出色，支持图像驱动动画、反应动画和文本到交互合成，验证了姿态先验的通用性和框架的有效性与鲁棒性。 Conclusion: 基于近身互动姿态的先验建模是生成自然人类交互动画的有效途径，Ponimator为将高质量动捕数据中的交互知识迁移到开放场景提供了通用且灵活的解决方案。 Abstract: Close-proximity human-human interactive poses convey rich contextual information about interaction dynamics. Given such poses, humans can intuitively infer the context and anticipate possible past and future dynamics, drawing on strong priors of human behavior. Inspired by this observation, we propose Ponimator, a simple framework anchored on proximal interactive poses for versatile interaction animation. Our training data consists of close-contact two-person poses and their surrounding temporal context from motion-capture interaction datasets. Leveraging interactive pose priors, Ponimator employs two conditional diffusion models: (1) a pose animator that uses the temporal prior to generate dynamic motion sequences from interactive poses, and (2) a pose generator that applies the spatial prior to synthesize interactive poses from a single pose, text, or both when interactive poses are unavailable. Collectively, Ponimator supports diverse tasks, including image-based interaction animation, reaction animation, and text-to-interaction synthesis, facilitating the transfer of interaction knowledge from high-quality mocap data to open-world scenarios. Empirical experiments across diverse datasets and applications demonstrate the universality of the pose prior and the effectiveness and robustness of our framework.

[221] Terra: Explorable Native 3D World Model with Point Latents

Yuanhui Huang,Weiliang Chen,Wenzhao Zheng,Xin Tao,Pengfei Wan,Jie Zhou,Jiwen Lu

Main category: cs.CV

TL;DR: 本文提出了Terra，一种基于原生3D表示的新型世界模型，通过点到高斯变分自编码器（P2G-VAE）和稀疏点流匹配网络（SPFlow）实现几何与外观的联合建模，在ScanNet v2室内场景上实现了高3D一致性的重建与生成。

Details

Motivation: 现有世界模型多依赖像素对齐的表示，忽视了物理世界的固有3D结构，导致3D一致性不足和建模效率低下。 Method: 提出Terra模型，采用P2G-VAE将3D输入编码为点隐表示，并解码为3D高斯基元以联合建模几何与外观；设计SPFlow网络在隐空间中进行稀疏点流匹配，同步去噪位置与特征。 Result: 在ScanNet v2数据集上验证了Terra在重建和生成任务中均达到最优性能，具备高度3D一致性，支持任意视角渲染和渐进式环境生成。 Conclusion: Terra通过原生3D表示和架构实现了高效、一致的可探索世界建模，推动了世界模型在3D场景理解与生成方面的发展。 Abstract: World models have garnered increasing attention for comprehensive modeling of the real world. However, most existing methods still rely on pixel-aligned representations as the basis for world evolution, neglecting the inherent 3D nature of the physical world. This could undermine the 3D consistency and diminish the modeling efficiency of world models. In this paper, we present Terra, a native 3D world model that represents and generates explorable environments in an intrinsic 3D latent space. Specifically, we propose a novel point-to-Gaussian variational autoencoder (P2G-VAE) that encodes 3D inputs into a latent point representation, which is subsequently decoded as 3D Gaussian primitives to jointly model geometry and appearance. We then introduce a sparse point flow matching network (SPFlow) for generating the latent point representation, which simultaneously denoises the positions and features of the point latents. Our Terra enables exact multi-view consistency with native 3D representation and architecture, and supports flexible rendering from any viewpoint with only a single generation process. Furthermore, Terra achieves explorable world modeling through progressive generation in the point latent space. We conduct extensive experiments on the challenging indoor scenes from ScanNet v2. Terra achieves state-of-the-art performance in both reconstruction and generation with high 3D consistency.

[222] Learning an Image Editing Model without Image Editing Pairs

Nupur Kumari,Sheng-Yu Wang,Nanxuan Zhao,Yotam Nitzan,Yuheng Li,Krishna Kumar Singh,Richard Zhang,Eli Shechtman,Jun-Yan Zhu,Xun Huang

Main category: cs.CV

TL;DR: 提出一种无需配对数据的图像编辑训练新范式，利用视觉-语言模型提供梯度反馈，并结合分布匹配损失保证图像保真度，在少步生成设置下性能媲美使用大量监督数据训练的模型。

Details

Motivation: 现有图像编辑模型依赖大规模输入-目标配对数据进行监督微调，但这类数据难以大规模获取，合成数据又会传播预训练模型的伪影。 Method: 通过在训练中展开少步扩散模型，利用视觉-语言模型（VLM）对编辑指令遵循性和内容保持性进行评估并提供直接梯度，实现端到端优化；引入分布匹配损失（DMD）以确保生成图像位于预训练模型学习的图像流形内。 Result: 在标准基准上评估显示，无需任何配对数据的情况下，该方法在少步生成设置下性能与使用大量监督配对数据训练的扩散模型相当，并优于使用相同VLM的RL-based方法（如Flow-GRPO）。 Conclusion: 该方法成功消除了对配对训练数据的依赖，通过VLM反馈和分布匹配损失实现了高效、高质量的文本驱动图像编辑。 Abstract: Recent image editing models have achieved impressive results while following natural language editing instructions, but they rely on supervised fine-tuning with large datasets of input-target pairs. This is a critical bottleneck, as such naturally occurring pairs are hard to curate at scale. Current workarounds use synthetic training pairs that leverage the zero-shot capabilities of existing models. However, this can propagate and magnify the artifacts of the pretrained model into the final trained model. In this work, we present a new training paradigm that eliminates the need for paired data entirely. Our approach directly optimizes a few-step diffusion model by unrolling it during training and leveraging feedback from vision-language models (VLMs). For each input and editing instruction, the VLM evaluates if an edit follows the instruction and preserves unchanged content, providing direct gradients for end-to-end optimization. To ensure visual fidelity, we incorporate distribution matching loss (DMD), which constrains generated images to remain within the image manifold learned by pretrained models. We evaluate our method on standard benchmarks and include an extensive ablation study. Without any paired data, our method performs on par with various image editing diffusion models trained on extensive supervised paired data, under the few-step setting. Given the same VLM as the reward model, we also outperform RL-based techniques like Flow-GRPO.

[223] From Pixels to Words -- Towards Native Vision-Language Primitives at Scale

Haiwen Diao,Mingxuan Li,Silei Wu,Linjun Dai,Xiaohua Wang,Hanming Deng,Lewei Lu,Dahua Lin,Ziwei Liu

Main category: cs.CV

TL;DR: 本文提出了NEO，一个基于第一性原理构建的原生视觉-语言模型（VLM）家族，旨在解决原生VLM与模块化VLM之间的根本差异及其研究可及性问题。NEO通过有效对齐像素与词表示、融合视觉与语言能力，并内在支持跨模态编码、对齐与推理，在仅使用3.9亿图像-文本样本的情况下实现了强大的视觉感知能力，且避免了视觉-语言冲突。

Details

Motivation: 原生视觉-语言模型（VLM）虽有潜力，但其与模块化VLM的根本限制尚不明确，且研究门槛高，阻碍了广泛探索。因此，需要澄清挑战并建立指导原则以推动该领域发展。 Method: 提出三个设计原则：(i) 在共享语义空间中对齐像素和词表示；(ii) 融合传统分离的视觉与语言模块优势；(iii) 内在支持多种跨模态特性。基于这些原则构建NEO模型，采用密集单体结构，从零开始高效训练。 Result: NEO在多个真实场景中性能媲美顶级模块化VLM，仅用3.9亿图像-文本对即可从零学习视觉感知，并缓解模型内部的视觉-语言冲突。 Conclusion: NEO为可扩展且强大的原生VLM奠定了基础，提供了一套可复用组件，促进低成本、可扩展的研究生态，推动原生VLM的普及与进步。 Abstract: The edifice of native Vision-Language Models (VLMs) has emerged as a rising contender to typical modular VLMs, shaped by evolving model architectures and training paradigms. Yet, two lingering clouds cast shadows over its widespread exploration and promotion: (-) What fundamental constraints set native VLMs apart from modular ones, and to what extent can these barriers be overcome? (-) How to make research in native VLMs more accessible and democratized, thereby accelerating progress in the field. In this paper, we clarify these challenges and outline guiding principles for constructing native VLMs. Specifically, one native VLM primitive should: (i) effectively align pixel and word representations within a shared semantic space; (ii) seamlessly integrate the strengths of formerly separate vision and language modules; (iii) inherently embody various cross-modal properties that support unified vision-language encoding, aligning, and reasoning. Hence, we launch NEO, a novel family of native VLMs built from first principles, capable of rivaling top-tier modular counterparts across diverse real-world scenarios. With only 390M image-text examples, NEO efficiently develops visual perception from scratch while mitigating vision-language conflicts inside a dense and monolithic model crafted from our elaborate primitives. We position NEO as a cornerstone for scalable and powerful native VLMs, paired with a rich set of reusable components that foster a cost-effective and extensible ecosystem. Our code and models are publicly available at: https://github.com/EvolvingLMMs-Lab/NEO.

[224] Coupled Diffusion Sampling for Training-Free Multi-View Image Editing

Hadi Alzayer,Yunzhi Zhang,Chen Geng,Jia-Bin Huang,Jiajun Wu

Main category: cs.CV

TL;DR: 提出一种基于耦合扩散采样的推理时方法，通过隐式3D正则化实现多视角一致的图像编辑，无需优化显式3D表示，具有高效性和通用性。

Details

Motivation: 预训练的2D图像编辑模型在多视角图像编辑中无法保持跨视角一致性，现有基于显式3D表示的方法存在优化耗时和稀疏视角下不稳定的问题。 Method: 提出耦合扩散采样方法，在生成过程中同时从多视角图像分布和2D编辑图像分布中采样，并引入耦合项约束生成结果的多视角一致性，从而实现隐式3D正则化。 Result: 在三种不同的多视角编辑任务上验证了方法的有效性和通用性，适用于多种模型架构，且在稀疏视角下表现稳定。 Conclusion: 该框架为多视角一致的图像编辑提供了一种高效、稳定的通用解决方案，避免了复杂的3D优化过程。 Abstract: We present an inference-time diffusion sampling method to perform multi-view consistent image editing using pre-trained 2D image editing models. These models can independently produce high-quality edits for each image in a set of multi-view images of a 3D scene or object, but they do not maintain consistency across views. Existing approaches typically address this by optimizing over explicit 3D representations, but they suffer from a lengthy optimization process and instability under sparse view settings. We propose an implicit 3D regularization approach by constraining the generated 2D image sequences to adhere to a pre-trained multi-view image distribution. This is achieved through coupled diffusion sampling, a simple diffusion sampling technique that concurrently samples two trajectories from both a multi-view image distribution and a 2D edited image distribution, using a coupling term to enforce the multi-view consistency among the generated images. We validate the effectiveness and generality of this framework on three distinct multi-view image editing tasks, demonstrating its applicability across various model architectures and highlighting its potential as a general solution for multi-view consistent editing.

Table of Contents

cs.CL [Back]

[1] Bridging the Semantic Gap: Contrastive Rewards for Multilingual Text-to-SQL

[2] From Explainability to Action: A Generative Operational Framework for Integrating XAI in Clinical Mental Health Screening

[3] A Linguistics-Aware LLM Watermarking via Syntactic Predictability

[4] Users as Annotators: LLM Preference Learning from Comparison Mode

[5] Informed Routing in LLMs: Smarter Token-Level Computation for Faster Inference

[6] Entropy Meets Importance: A Unified Head Importance-Entropy Score for Stable and Efficient Transformer Pruning

[7] ConDABench: Interactive Evaluation of Language Models for Data Analysis

[8] SIMBA UQ: Similarity-Based Aggregation for Uncertainty Quantification in Large Language Models

[9] Seeing Hate Differently: Hate Subspace Modeling for Culture-Aware Hate Speech Detection

[10] Meronymic Ontology Extraction via Large Language Models

[11] ADMIT: Few-shot Knowledge Poisoning Attacks on RAG-based Fact Checking

[12] Serialized EHR make for good text representations

[13] DynaSpec: Context-aware Dynamic Speculative Sampling for Large-Vocabulary Language Models

[14] On-device System of Compositional Multi-tasking in Large Language Models

[15] Language steering in latent space to mitigate unintended code-switching

[16] Revisiting the UID Hypothesis in LLM Reasoning Traces

[17] EvoEdit: Evolving Null-space Alignment for Robust and Efficient Knowledge Editing

[18] ConsistencyAI: A Benchmark to Assess LLMs' Factual Consistency When Responding to Different Demographic Groups

[19] BenchPress: A Human-in-the-Loop Annotation System for Rapid Text-to-SQL Benchmark Curation

[20] R2T: Rule-Encoded Loss Functions for Low-Resource Sequence Tagging

[21] Harnessing Consistency for Robust Test-Time LLM Ensemble

[22] Multimodal Retrieval-Augmented Generation with Large Language Models for Medical VQA

[23] ShishuLM: Lightweight Language Model with Hybrid Decoder-MLP Architecture and Paired Weight Sharing

[24] Ensembling Large Language Models to Characterize Affective Dynamics in Student-AI Tutor Dialogues

[25] Unlocking the Potential of Diffusion Language Models through Template Infilling

[26] Quechua Speech Datasets in Common Voice: The Case of Puno Quechua

[27] FRACCO: A gold-standard annotated corpus of oncological entities with ICD-O-3.1 normalisation

[28] What Layers When: Learning to Skip Compute in LLMs with Residual Gates

[29] TextBandit: Evaluating Probabilistic Reasoning in LLMs Through Language-Only Decision Tasks

[30] Catch Your Breath: Adaptive Computation for Self-Paced Sequence Production

[31] PAGE: Prompt Augmentation for text Generation Enhancement

[32] Too Open for Opinion? Embracing Open-Endedness in Large Language Models for Social Simulation

[33] Order from Chaos: Comparative Study of Ten Leading LLMs on Unstructured Data Categorization

[34] Reliable Fine-Grained Evaluation of Natural Language Math Proofs

[35] A Survey on Collaborating Small and Large Language Models for Performance, Cost-effectiveness, Cloud-edge Privacy, and Trustworthiness

[36] The Harder The Better: Maintaining Supervised Fine-tuning Generalization with Less but Harder Data

[37] Guarding the Guardrails: A Taxonomy-Driven Approach to Jailbreak Detection

[38] Attribution Quality in AI-Generated Content:Benchmarking Style Embeddings and LLM Judges

[39] Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences

[40] RAID: Refusal-Aware and Integrated Decoding for Jailbreaking LLMs

[41] Investigating Political and Demographic Associations in Large Language Models Through Moral Foundations Theory

[42] Schema for In-Context Learning

[43] LLM Prompt Duel Optimizer: Efficient Label-Free Prompt Optimization

[44] Interpreting the Latent Structure of Operator Precedence in Language Models

[45] Knowledge Reasoning Language Model: Unifying Knowledge and Language for Inductive Knowledge Graph Reasoning

[46] RAGCap-Bench: Benchmarking Capabilities of LLMs in Agentic Retrieval Augmented Generation Systems

[47] AI Debaters are More Persuasive when Arguing in Alignment with Their Own Beliefs

[48] Synthesizing Agentic Data for Web Agents with Progressive Difficulty Enhancement Mechanisms

[49] Readability $\ne$ Learnability: Rethinking the Role of Simplicity in Training Small Language Models

[50] Element2Vec: Build Chemical Element Representation from Text for Property Prediction

[51] Optimal Aggregation of LLM and PRM Signals for Efficient Test-Time Scaling

[52] FACTS: Table Summarization via Offline Template Generation with Agentic Workflows

[53] An LLM-Powered AI Agent Framework for Holistic IoT Traffic Interpretation

[54] BioMedSearch: A Multi-Source Biomedical Retrieval Framework Based on LLMs

[55] LLMs Can Get "Brain Rot"!

[56] Robust or Suggestible? Exploring Non-Clinical Induction in LLM Drug-Safety Decisions

[57] Big Reasoning with Small Models: Instruction Retrieval at Inference Time

[58] FinDeepResearch: Evaluating Deep Research Agents in Rigorous Financial Analysis

[59] Readers Prefer Outputs of AI Trained on Copyrighted Books over Expert Human Writers

[60] Less is More: Improving LLM Reasoning with Minimal Test-Time Intervention

[61] Classifying and Addressing the Diversity of Errors in Retrieval-Augmented Generation Systems

[62] The German Commons - 154 Billion Tokens of Openly Licensed Text for German Language Models

[63] CRaFT: An Explanation-Based Framework for Evaluating Cultural Reasoning in Multilingual Language Models

[64] Think Globally, Group Locally: Evaluating LLMs Using Multi-Lingual Word Grouping Games

[65] Quantifying Phonosemantic Iconicity Distributionally in 6 Languages

[66] ERGO: Entropy-guided Resetting for Generation Optimization in Multi-turn Language Models

[67] DROID: Dual Representation for Out-of-Scope Intent Detection

[68] Toward Cybersecurity-Expert Small Language Models

[69] Building a Macedonian Recipe Dataset: Collection, Parsing, and Comparative Analysis

[70] RLSR: Reinforcement Learning with Supervised Reward Outperforms SFT in Instruction Following

[71] DPRF: A Generalizable Dynamic Persona Refinement Framework for Optimizing Behavior Alignment Between Personalized LLM Role-Playing Agents and Humans

[72] LiteStage: Latency-aware Layer Skipping for Multi-stage Reasoning

[73] Flip-Flop Consistency: Unsupervised Training for Robustness to Prompt Perturbations in LLMs

[74] MoM: Mixtures of Scenario-Aware Document Memories for Retrieval-Augmented Generation Systems

[75] Rewriting History: A Recipe for Interventional Analyses to Study Data Effects on Model Behavior

[76] Less is More: Denoising Knowledge Graphs For Retrieval Augmented Generation

[77] Retrofitting Small Multilingual Models for Retrieval: Matching 7B Performance with 300M Parameters

[78] Qwen3Guard Technical Report