cs.CL [Back]

[1] Bridging the Semantic Gap: Contrastive Rewards for Multilingual Text-to-SQL

Ashish Kattamuri,Ishita Prasad,Meetu Malhotra,Arpita Vats,Rahul Raja,Albert Lie

Main category: cs.CL

TL;DR: 提出一种结合组相对策略优化（GRPO）和多语言对比奖励信号的新框架，以提升跨语言Text-to-SQL系统的执行准确性和语义对齐能力，在小规模模型上显著超越大模型的性能。

Details

Motivation: 现有Text-to-SQL方法过于关注可执行查询，忽视了语义对齐挑战，且在非英语语言中执行准确率显著下降。 Method: 在GRPO框架中引入基于语义相似性的多语言对比奖励信号，增强SQL生成与用户意图之间的语义一致性。 Result: 在七语言MultiSpider数据集上，LLaMA-3-3B模型的执行准确率达87.4%（+26个百分点），语义准确率达59.14%（最高提升10个百分点）；相比零样本8B大模型，执行准确率更高（+7.43 pp），语义准确率接近。 Conclusion: 通过对比奖励实现定向语义对齐，可在极少量训练样本下显著提升小模型在跨语言Text-to-SQL任务中的表现，无需大规模训练数据。 Abstract: Current Text-to-SQL methods are evaluated and only focused on executable queries, overlooking the semantic alignment challenge -- both in terms of the semantic meaning of the query and the correctness of the execution results. Even execution accuracy itself shows significant drops when moving from English to other languages, with an average decline of 6 percentage points across non-English languages. We address these challenges by presenting a new framework that combines Group Relative Policy Optimization (GRPO) within a multilingual contrastive reward signal to enhance both task efficiency and semantic accuracy in Text-to-SQL systems in cross-lingual scenarios. Our method teaches models to obtain better correspondence between SQL generation and user intent by combining a reward signal based on semantic similarity. On the seven-language MultiSpider dataset, fine-tuning the LLaMA-3-3B model with GRPO improved the execution accuracy up to 87.4 percent (+26 pp over zero-shot) and semantic accuracy up to 52.29 percent (+32.86 pp). Adding our contrastive reward signal in the GRPO framework further improved the average semantic accuracy to 59.14 percent (+6.85 pp, up to +10 pp for Vietnamese). Our experiments showcase that a smaller, parameter-efficient 3B LLaMA model fine-tuned with our contrastive reward signal outperforms a much larger zero-shot 8B LLaMA model, with an uplift of 7.43 pp in execution accuracy (from 81.43 percent on the 8B model to 88.86 percent on the 3B model), and nearly matches its semantic accuracy (59.14 percent vs. 68.57 percent) -- all using just 3,000 reinforcement learning training examples. These results demonstrate how we can improve the performance of Text-to-SQL systems with contrastive rewards for directed semantic alignment, without requiring large-scale training datasets.

[2] From Explainability to Action: A Generative Operational Framework for Integrating XAI in Clinical Mental Health Screening

Ratna Kandala,Akshata Kishore Moharir,Divya Arvinda Nayak

Main category: cs.CL

TL;DR: 本文提出了一种生成式操作框架，利用大语言模型作为翻译引擎，将可解释人工智能（XAI）的技术输出与临床指南结合，生成可读性强、有证据支持的临床叙述，以弥合心理健康筛查中技术透明性与实际临床应用之间的差距。

Details

Motivation: 当前的XAI技术虽然能提供技术上准确的特征重要性评分，但缺乏对临床医生和患者真正有用、可操作的见解，导致从实验室到临床应用之间存在鸿沟。 Method: 提出生成式操作框架，使用大型语言模型（LLM）为核心，结合检索增强生成（RAG）技术，整合多种XAI工具的输出与临床指南，自动生成临床相关的可读叙述。 Result: 该框架能够有效解决工作流集成、偏见缓解和面向不同利益相关者的沟通等关键操作障碍，并推动领域从孤立数据点向综合、可操作、可信的AI系统发展。 Conclusion: 通过将LLM作为翻译引擎，该框架为实现真正可用、可信的AI在心理健康筛查中的临床落地提供了可行路径。 Abstract: Explainable Artificial Intelligence (XAI) has been presented as the critical component for unlocking the potential of machine learning in mental health screening (MHS). However, a persistent lab-to-clinic gap remains. Current XAI techniques, such as SHAP and LIME, excel at producing technically faithful outputs such as feature importance scores, but fail to deliver clinically relevant, actionable insights that can be used by clinicians or understood by patients. This disconnect between technical transparency and human utility is the primary barrier to real-world adoption. This paper argues that this gap is a translation problem and proposes the Generative Operational Framework, a novel system architecture that leverages Large Language Models (LLMs) as a central translation engine. This framework is designed to ingest the raw, technical outputs from diverse XAI tools and synthesize them with clinical guidelines (via RAG) to automatically generate human-readable, evidence-backed clinical narratives. To justify our solution, we provide a systematic analysis of the components it integrates, tracing the evolution from intrinsic models to generative XAI. We demonstrate how this framework directly addresses key operational barriers, including workflow integration, bias mitigation, and stakeholder-specific communication. This paper also provides a strategic roadmap for moving the field beyond the generation of isolated data points toward the delivery of integrated, actionable, and trustworthy AI in clinical practice.

[3] A Linguistics-Aware LLM Watermarking via Syntactic Predictability

Shinwoo Park,Hyejin Park,Hyeseon Ahn,Yo-Sub Han

Main category: cs.CL

TL;DR: 本文提出了一种名为STELA的新框架，通过利用语言的词性n-gram建模的不确定性来动态调节水印强度，在语法约束强的上下文中减弱信号以保持文本质量，在语言灵活性高的上下文中增强信号以提高可检测性。该方法无需访问模型logits即可实现公开可验证的检测，并在多种语言上表现出优于现有方法的检测鲁棒性。

Details

Motivation: 现有的大语言模型水印技术依赖模型输出分布（如token级熵）进行检测，但需要访问模型logits，限制了其公开可验证性。因此，亟需一种不依赖模型内部信息、同时平衡文本质量和检测鲁棒性的水印方案。 Method: 提出STELA框架，利用词性（POS）n-gram模型量化语言的不确定性（即语言自由度），并据此动态调整水印强度：在语言灵活的上下文中增强水印，在语法受限的上下文中减弱水印。检测器完全独立于生成模型，无需访问logits，实现公开验证。 Result: 在英语、中文和韩语等多种类型的语言上实验表明，STELA在检测鲁棒性方面优于先前方法，同时保持了良好的文本质量，且支持无需模型权限的公开检测。 Conclusion: STELA通过结合语言结构的内在灵活性，实现了高质量与高检测鲁棒性的文本水印，并支持完全公开验证，为构建可信的AI生态系统提供了有效工具。 Abstract: As large language models (LLMs) continue to advance rapidly, reliable governance tools have become critical. Publicly verifiable watermarking is particularly essential for fostering a trustworthy AI ecosystem. A central challenge persists: balancing text quality against detection robustness. Recent studies have sought to navigate this trade-off by leveraging signals from model output distributions (e.g., token-level entropy); however, their reliance on these model-specific signals presents a significant barrier to public verification, as the detection process requires access to the logits of the underlying model. We introduce STELA, a novel framework that aligns watermark strength with the linguistic degrees of freedom inherent in language. STELA dynamically modulates the signal using part-of-speech (POS) n-gram-modeled linguistic indeterminacy, weakening it in grammatically constrained contexts to preserve quality and strengthen it in contexts with greater linguistic flexibility to enhance detectability. Our detector operates without access to any model logits, thus facilitating publicly verifiable detection. Through extensive experiments on typologically diverse languages-analytic English, isolating Chinese, and agglutinative Korean-we show that STELA surpasses prior methods in detection robustness. Our code is available at https://github.com/Shinwoo-Park/stela_watermark.

[4] Users as Annotators: LLM Preference Learning from Comparison Mode

Zhongze Cai,Xiaocheng Li

Main category: cs.CL

TL;DR: 本文提出了一种利用用户在与大语言模型交互中产生的偏好数据进行模型对齐的新方法，通过构建用户行为模型推断数据质量，并使用EM算法估计用户的潜在质量因子以过滤低质量标注。

Details

Motivation: 传统的成对偏好数据依赖专业人工标注，成本高且覆盖面有限；而随着大语言模型的普及，用户在日常交互中产生了大量偏好标签，虽更具个性化但缺乏质量控制，因此需要一种能有效利用并筛选这些用户标注数据的方法。 Method: 提出一种基于不对称响应（来自不同模型或同一模型不同版本）的用户行为模型，利用期望最大化（EM）算法估计用户的潜在质量因子，并据此过滤用户标注数据。 Result: 实验表明该方法能有效捕捉用户行为特征，并在下游任务中提升用于大语言模型对齐的偏好数据的质量。 Conclusion: 通过建模用户行为和估计其标注质量，可以有效利用非专业的用户生成偏好数据来提升大语言模型的对齐效果，为低成本、大规模数据收集提供了可行方案。 Abstract: Pairwise preference data have played an important role in the alignment of large language models (LLMs). Each sample of such data consists of a prompt, two different responses to the prompt, and a binary label indicating which of the two responses is better. The labels are usually annotated by professional human annotators. In this paper, we consider an alternative approach to collect pairwise preference data -- user annotation from comparison mode. With the increasingly wider adoption of LLMs among the population, users are contributing more and more of their preference labels through their daily interactions with the LLMs. The upside of such labels is that users are the best experts in judging the responses to their own queries/prompts, but the downside is the lack of quality control in these labels. In this paper, we consider a new idea of generating two responses from two different models or two different versions of the same model. The asymmetry allows us to make an inference of the user's data quality through our proposed user behavior model. We develop an expectation-maximization algorithm to estimate a latent quality factor of the user, and filter users' annotation data accordingly. The downstream task shows the effectiveness of our approach in both capturing the user behavior and data filtering for LLM alignment.

[5] Informed Routing in LLMs: Smarter Token-Level Computation for Faster Inference

Chao Han,Yijuan Liang,Zihao Xuan,Daokuan Wu,Wei Zhang,Xiaoyu Shen

Main category: cs.CL

TL;DR: 本文提出了一种名为“informed routing”的新方法，通过预测模块在路由决策前估计神经单元输出，实现执行或近似处理的灵活策略，在保持模型性能的同时显著降低大语言模型的推理成本。

Details

Motivation: 现有的动态计算分配方法依赖贪婪路由机制，容易导致不可逆的信息丢失和次优的令牌选择，限制了大语言模型在实际应用中的效率。 Method: 引入informed routing范式，结合轻量级特征预测器（LFF）评估令牌的重要性和可恢复性，在推理过程中采用执行或近似策略，减少不必要的计算开销。 Result: 实验表明，该方法在多种稀疏度下均实现了最先进的效率-性能权衡，即使不进行最终的LoRA微调，也能够匹配甚至超越需要完整微调的强基线方法，并将训练时间减少超过50%。 Conclusion: informed routing通过前瞻性地评估令牌的可恢复性，有效解决了传统贪婪路由的信息丢失问题，为大语言模型的高效推理提供了新的解决方案。 Abstract: The deployment of large language models (LLMs) in real-world applications is increasingly limited by their high inference cost. While recent advances in dynamic token-level computation allocation attempt to improve efficiency by selectively activating model components per token, existing methods rely on greedy routing--a myopic execute-or-skip mechanism that often leads to irreversible information loss and suboptimal token selection. This paper introduces informed routing, a new paradigm that proactively addresses these issues. The key insight is to assess not only a token's immediate importance but also its recoverability, i.e., how well its transformation can be approximated. To this end, we propose the Lightweight Feature Forecaster (LFF), a small predictive module that estimates a unit's output before routing decisions are made. This enables a flexible execute-or-approximate policy that preserves model fidelity while drastically reducing computation. Extensive experiments on both language modeling and reasoning tasks show that informed routing achieves state-of-the-art efficiency-performance trade-offs across multiple sparsity levels. Notably, even without final LoRA fine-tuning, our method matches or surpasses strong baselines that require full fine-tuning, all while reducing training time by over 50%. The code is available at: https://github.com/EIT-NLP/informed-routing

[6] Entropy Meets Importance: A Unified Head Importance-Entropy Score for Stable and Efficient Transformer Pruning

Minsik Choi,Hyegang Son,Changhoon Kim,Young Geun Kim

Main category: cs.CL

TL;DR: 提出了一种新的剪枝准则HIES，结合了头重要性分数和注意力熵，显著提升了模型压缩后的质量和稳定性。

Details

Motivation: 现有的基于梯度的头部重要性评分（HIS）方法仅考虑梯度贡献，忽略了注意力模式的多样性，导致剪枝效果受限。 Method: 引入HIES（Head Importance-Entropy Score），将HIS与注意力熵结合，综合评估每个注意力头的贡献。 Result: 实验表明，基于HIES的剪枝相比仅使用HIS的方法，在模型质量上最多提升15.2%，稳定性提高2.04倍。 Conclusion: HIES能更全面地识别冗余注意力头，在不牺牲准确性和稳定性的前提下实现高效模型压缩。 Abstract: Transformer-based models have achieved remarkable performance in NLP tasks. However, their structural characteristics-multiple layers and attention heads-introduce efficiency challenges in inference and deployment. To address these challenges, various pruning methods have recently been proposed. Notably, gradient-based methods using Head Importance Scores (HIS) have gained traction for interpretability, efficiency, and ability to identify redundant heads. However, HIS alone has limitations as it captures only the gradient-driven contribution, overlooking the diversity of attention patterns. To overcome these limitations, we introduce a novel pruning criterion, HIES (Head Importance-Entropy Score), which integrates head importance scores with attention entropy, providing complementary evidence on per-head contribution. Empirically, HIES-based pruning yields up to 15.2% improvement in model quality and 2.04x improvement in stability over HIS-only methods, enabling substantial model compression without sacrificing either accuracy or stability. Code will be released upon publication.

[7] ConDABench: Interactive Evaluation of Language Models for Data Analysis

Avik Dutta,Priyanshu Gupta,Hosein Hasanbeig,Rahul Pratap Singh,Harshit Nigam,Sumit Gulwani,Arjun Radhakrishna,Gustavo Soares,Ashish Tiwari

Main category: cs.CL

TL;DR: ConDABench是一个用于生成和评估对话式数据分析（ConDA）任务的新框架，通过多代理工作流从真实文章中生成1420个问题，并首次支持对交互式数据工具进行系统性评估。

Details

Motivation: 现有LLM数据处理基准未能捕捉现实世界中目标不明确和数据不干净的复杂性，缺乏对交互性的支持，因此需要一个更贴近真实场景的评估框架。 Method: 提出ConDABench框架，包含基于多代理的工作流生成ConDA问题、构建问题集以及开发可评估外部工具的测试平台。 Result: 生成了1,420个ConDA问题，评估显示新一代LLM能解决更多任务，但在需要长期交互的复杂任务上表现提升有限。 Conclusion: ConDABench为评估和推动具备持续交互能力的协作型数据处理模型提供了有效途径。 Abstract: Real-world data analysis tasks often come with under-specified goals and unclean data. User interaction is necessary to understand and disambiguate a user's intent, and hence, essential to solving these complex tasks. Existing benchmarks for evaluating LLMs on data analysis tasks do not capture these complexities or provide first-class support for interactivity. We introduce ConDABench, a framework for generating conversational data analysis (ConDA) benchmarks and evaluating external tools on the generated benchmarks. \bench consists of (a) a multi-agent workflow for generating realistic benchmarks from articles describing insights gained from public datasets, (b) 1,420 ConDA problems generated using this workflow, and (c) an evaluation harness that, for the first time, makes it possible to systematically evaluate conversational data analysis tools on the generated ConDA problems. Evaluation of state-of-the-art LLMs on the benchmarks reveals that while the new generation of models are better at solving more instances, they are not necessarily better at solving tasks that require sustained, long-form engagement. ConDABench is an avenue for model builders to measure progress towards truly collaborative models that can complete complex interactive tasks.

[8] SIMBA UQ: Similarity-Based Aggregation for Uncertainty Quantification in Large Language Models

Debarun Bhattacharjya,Balaji Ganesan,Junkyu Lee,Radu Marinescu,Katsiaryna Mirylenka,Michael Glass,Xiao Shou

Main category: cs.CL

TL;DR: 本文研究了大语言模型（LLM）在生成输出时的不确定性量化（UQ）方法，提出了一种基于输出一致性的黑箱UQ框架，并引入了基于相似性的聚合方法和新的置信度估计技术。实验表明，该方法在问答、摘要和文本到SQL等任务上比基线方法更具校准优势。

Details

Motivation: 为了提高大语言模型输出的可信度，需要有效的不确定性量化方法，尤其是在无法访问模型内部信息的黑箱场景下，以实现鲁棒、灵活且低成本的置信度评估。 Method: 提出一个高层、非语言化的基于相似性的聚合框架，利用生成输出之间的一致性作为正确性的代理，并在此框架下开发新的基于小样本训练的置信度估计模型。 Result: 在多个任务（问答、摘要、文本到SQL）上的实验显示，所提相似性方法相比基线能产生更优的置信度校准效果。 Conclusion: 基于输出一致性的黑箱不确定性量化方法是有效的，所提出的框架和新技术有助于提升大语言模型在复杂生成任务中的可信度评估能力。 Abstract: When does a large language model (LLM) know what it does not know? Uncertainty quantification (UQ) provides measures of uncertainty, such as an estimate of the confidence in an LLM's generated output, and is therefore increasingly recognized as a crucial component of trusted AI systems. Black-box UQ methods do not require access to internal model information from the generating LLM and therefore have numerous real-world advantages, such as robustness to system changes, adaptability to choice of LLM, reduced costs, and computational tractability. In this paper, we investigate the effectiveness of UQ techniques that are primarily but not necessarily entirely black-box, where the consistency between a generated output and other sampled generations is used as a proxy for confidence in its correctness. We propose a high-level non-verbalized similarity-based aggregation framework that subsumes a broad swath of UQ approaches suitable for complex generative tasks, as well as introduce specific novel techniques from the framework that train confidence estimation models using small training sets. Through an empirical study with datasets spanning the diverse tasks of question answering, summarization, and text-to-SQL, we demonstrate that our proposed similarity-based methods can yield better calibrated confidences than baselines.

[9] Seeing Hate Differently: Hate Subspace Modeling for Culture-Aware Hate Speech Detection

Weibin Cai,Reza Zafarani

Main category: cs.CL

TL;DR: 提出一种文化感知的仇恨子空间构建框架，通过建模文化属性组合和标签传播来应对训练标签偏差、文化纠缠和模糊标注等问题，在多个指标上平均超越现有方法1.05%。

Details

Motivation: 现有仇恨言论检测方法通常忽略训练标签存在文化偏见以及不同文化背景下对仇恨定义理解不同的现实复杂性。 Method: 构建个体的仇恨子空间，建模文化属性组合以缓解数据稀疏性，并利用标签传播捕捉每种文化组合的独特特征，从而解耦文化纠缠并处理模糊标签。 Result: 实验表明该方法在所有指标上平均比现有最先进方法高出1.05%，有效提升了分类性能。 Conclusion: 所提出的框架能够有效应对文化相关偏差和标签模糊性，通过个体化仇恨子空间增强仇恨言论检测的准确性和鲁棒性。 Abstract: Hate speech detection has been extensively studied, yet existing methods often overlook a real-world complexity: training labels are biased, and interpretations of what is considered hate vary across individuals with different cultural backgrounds. We first analyze these challenges, including data sparsity, cultural entanglement, and ambiguous labeling. To address them, we propose a culture-aware framework that constructs individuals' hate subspaces. To alleviate data sparsity, we model combinations of cultural attributes. For cultural entanglement and ambiguous labels, we use label propagation to capture distinctive features of each combination. Finally, individual hate subspaces, which in turn can further enhance classification performance. Experiments show our method outperforms state-of-the-art by 1.05\% on average across all metrics.

[10] Meronymic Ontology Extraction via Large Language Models

Dekai Zhang,Simone Conia,Antonio Rago

Main category: cs.CL

TL;DR: 本文提出了一种利用大语言模型（LLM）从原始评论文本中全自动提取产品本体（特别是meronymies，即部分-整体关系）的方法，相较于基于BERT的基线方法表现更优，并通过LLM-as-a-judge评估验证了其有效性。

Details

Motivation: 手动构建本体耗时、昂贵且费力，而现有自动化方法仍有提升空间，因此需要一种更高效、准确的自动化本体提取方法。 Method: 利用大语言模型（LLM），直接从原始用户评论中提取产品相关的meronymy关系，构建产品本体，实现端到端的全自动化抽取流程。 Result: 该方法在LLM-as-a-judge的评估框架下，生成的本体质量优于现有的基于BERT的基线方法。 Conclusion: 研究表明，大语言模型在产品本体乃至更广泛领域的本体提取任务中具有巨大潜力，为未来自动化知识构建提供了新方向。 Abstract: Ontologies have become essential in today's digital age as a way of organising the vast amount of readily available unstructured text. In providing formal structure to this information, ontologies have immense value and application across various domains, e.g., e-commerce, where countless product listings necessitate proper product organisation. However, the manual construction of these ontologies is a time-consuming, expensive and laborious process. In this paper, we harness the recent advancements in large language models (LLMs) to develop a fully-automated method of extracting product ontologies, in the form of meronymies, from raw review texts. We demonstrate that the ontologies produced by our method surpass an existing, BERT-based baseline when evaluating using an LLM-as-a-judge. Our investigation provides the groundwork for LLMs to be used more generally in (product or otherwise) ontology extraction.

[11] ADMIT: Few-shot Knowledge Poisoning Attacks on RAG-based Fact Checking

Yutao Wu,Xiao Liu,Yinghui Li,Yifeng Gao,Yifan Ding,Jiale Ding,Xiang Zheng,Xingjun Ma

Main category: cs.CL

TL;DR: 本文提出了一种名为ADMIT的对抗性多注入技术，用于在检索增强生成（RAG）系统中进行知识投毒攻击，能够在极低投毒率下高效翻转事实核查结果，并诱导出欺骗性解释。

Details

Motivation: 现有的研究主要关注大语言模型对误导性检索内容的敏感性，但在真实世界的事实核查场景中，可信证据通常占据主导地位，因此需要探索在混合真实与恶意内容环境下的知识投毒效果。 Method: 提出ADMIT方法，一种少样本、语义对齐的投毒攻击技术，无需访问目标大语言模型或检索器，也无需词元级控制，通过在知识库中注入对抗性内容来影响事实核查决策。 Result: 实验表明，ADMIT在4种检索器、11个大语言模型和4个跨领域基准上均表现出强迁移能力，平均攻击成功率达86%，投毒率仅为0.93×10⁻⁶，且在存在强反证的情况下仍保持鲁棒性，相比现有最先进攻击方法ASR提升11.2%。 Conclusion: ADMIT揭示了现实世界基于RAG的事实核查系统的严重漏洞，表明即使在可信证据占优的环境中，少量精心设计的对抗性注入也能显著操控输出结果。 Abstract: Knowledge poisoning poses a critical threat to Retrieval-Augmented Generation (RAG) systems by injecting adversarial content into knowledge bases, tricking Large Language Models (LLMs) into producing attacker-controlled outputs grounded in manipulated context. Prior work highlights LLMs' susceptibility to misleading or malicious retrieved content. However, real-world fact-checking scenarios are more challenging, as credible evidence typically dominates the retrieval pool. To investigate this problem, we extend knowledge poisoning to the fact-checking setting, where retrieved context includes authentic supporting or refuting evidence. We propose \textbf{ADMIT} (\textbf{AD}versarial \textbf{M}ulti-\textbf{I}njection \textbf{T}echnique), a few-shot, semantically aligned poisoning attack that flips fact-checking decisions and induces deceptive justifications, all without access to the target LLMs, retrievers, or token-level control. Extensive experiments show that ADMIT transfers effectively across 4 retrievers, 11 LLMs, and 4 cross-domain benchmarks, achieving an average attack success rate (ASR) of 86\% at an extremely low poisoning rate of $0.93 \times 10^{-6}$, and remaining robust even in the presence of strong counter-evidence. Compared with prior state-of-the-art attacks, ADMIT improves ASR by 11.2\% across all settings, exposing significant vulnerabilities in real-world RAG-based fact-checking systems.

[12] Serialized EHR make for good text representations

Zhirong Chou,Quan Qin,Shi Li

Main category: cs.CL

TL;DR: 提出SerialBEHRT，一种基于SciBERT并针对结构化EHR序列进行额外预训练的领域对齐基础模型，通过时间序列建模提升电子健康记录的表征能力，在抗生素敏感性预测任务中表现优于现有方法。

Details

Motivation: 现有方法难以协调电子健康记录（EHR）的表格化和事件驱动特性与自然语言模型的序列先验之间的结构差异，限制了对患者就诊过程中长期依赖关系的捕捉。 Method: 在SciBERT基础上，通过对结构化的EHR序列进行额外预训练，扩展为SerialBEHRT模型，使其能够编码临床事件之间的时间和上下文关系，从而生成更丰富的患者表征。 Result: 在抗生素敏感性预测任务上，SerialBEHRT相较于当前最先进的EHR表征方法表现出更优且更稳定的性能。 Conclusion: 在医疗基础模型的预训练中，时间序列化建模对于提升EHR数据的表示学习效果至关重要。 Abstract: The emergence of foundation models in healthcare has opened new avenues for learning generalizable representations from large scale clinical data. Yet, existing approaches often struggle to reconcile the tabular and event based nature of Electronic Health Records (EHRs) with the sequential priors of natural language models. This structural mismatch limits their ability to capture longitudinal dependencies across patient encounters. We introduce SerialBEHRT, a domain aligned foundation model that extends SciBERT through additional pretraining on structured EHR sequences. SerialBEHRT is designed to encode temporal and contextual relationships among clinical events, thereby producing richer patient representations. We evaluate its effectiveness on the task of antibiotic susceptibility prediction, a clinically meaningful problem in antibiotic stewardship. Through extensive benchmarking against state of the art EHR representation strategies, we demonstrate that SerialBEHRT achieves superior and more consistent performance, highlighting the importance of temporal serialization in foundation model pretraining for healthcare.

[13] DynaSpec: Context-aware Dynamic Speculative Sampling for Large-Vocabulary Language Models

Jinbin Zhang,Nasib Ullah,Erik Schultheis,Rohit Babbar

Main category: cs.CL

TL;DR: 本文提出了DynaSpec，一种上下文相关的动态短列表机制，用于加速大语言模型推理中的推测解码过程，相比固定词汇子集方法具有更好的鲁棒性和效率。

Details

Motivation: 现有推测解码中，小drafter模型输出头的参数随词汇量增长成为延迟瓶颈；固定词汇子集方法受限于语料依赖和稀有词抑制问题。 Method: 引入轻量级元分类器，将上下文路由到少量token簇，动态生成drafter的短列表，验证阶段仍使用完整词汇表；利用并行流在drafter隐藏状态生成前完成元分类。 Result: 在标准推测解码基准上，DynaSpec在更小短列表下实现了更高的平均接受长度，且不降低接受率。 Conclusion: DynaSpec通过上下文感知的动态短列表有效提升了推测解码效率与通用性，解决了静态短列表的局限性。 Abstract: Speculative decoding (a.k.a. speculative sampling) has become a standard way to accelerate LLM inference: a small drafter proposes multiple tokens and a large target model verifies them once per speculation length. Recently, scaling of the LLM vocabulary has pushed the number of tokens to grow substantially. While verification over the full vocabulary leaves the target model largely unaffected, the O(|V|d) parameters in the drafter's output head become a latency bottleneck, slowing the entire pipeline. Contemporary methods (e.g., FR-Spec, VocabTrim) restrict the drafter's vocabulary to a fixed subset of the target model's vocabulary, ranked in descending order of token frequency. Although this reduces draft-time compute, it is brittle, since: (i) frequency lists are corpus-dependent and require retuning to generalize, and (ii) static shortlists suppress rare or domain-specific tokens, lowering the expected number of tokens per verification step. We propose DynaSpec, a context-dependent dynamic shortlisting mechanism that is robust, speeds up drafting, and generalizes across diverse tasks. Concretely, we introduce lightweight, coarse-grained meta-classifiers that route contexts to a small number of token clusters; the union of the top-k selected clusters forms the drafter's shortlist, while verification retains the full vocabulary and exactness. The meta-classifier finishes its computation earlier than the drafter's hidden state generation by exploiting parallel execution of draft encoding and meta shortlisting on separate streams. On standard speculative-decoding benchmarks, we observe consistent gains in mean accepted length over fixed-shortlist baselines, while context-dependent selection enables smaller shortlists without degrading acceptance.

[14] On-device System of Compositional Multi-tasking in Large Language Models

Ondrej Bohdal,Konstantinos Theodosiadis,Asterios Mpatziakas,Dimitris Filippidis,Iro Spyrou,Christos Zonios,Anastasios Drosou,Dimosthenis Ioannidis,Kyeng-Hun Lee,Jijoong Moon,Hyeonmok Ko,Mete Ozay,Umberto Michieli

Main category: cs.CL

TL;DR: 提出一种针对摘要和翻译组合任务的高效多任务处理方法，通过在适配器上添加可学习的投影层，在保持计算效率的同时实现高性能。

Details

Motivation: 标准的参数高效微调方法在处理复杂的组合任务（如长对话的翻译摘要）时表现不佳，难以同时执行多个任务。 Method: 在结合了摘要和翻译任务的低秩适配器（LoRA）之上引入一个可学习的投影层，以有效融合多任务输出，减少计算开销。 Result: 实验表明该方法在云端和设备端均具有良好的性能和较快的推理速度，适用于资源受限的场景。 Conclusion: 所提框架在保证效率的同时提升了组合多任务的执行效果，具备在实际应用中部署的潜力，特别是在需要高速和低资源消耗的设备上。 Abstract: Large language models (LLMs) are commonly adapted for diverse downstream tasks via parameter-efficient fine-tuning techniques such as Low-Rank Adapters (LoRA). While adapters can be combined to handle multiple tasks separately, standard approaches struggle when targeting the simultaneous execution of complex tasks, such as generating a translated summary from a long conversation. To address this challenge, we propose a novel approach tailored specifically for compositional multi-tasking scenarios involving summarization and translation. Our technique involves adding a learnable projection layer on top of the combined summarization and translation adapters. This design enables effective integration while maintaining efficiency through reduced computational overhead compared to alternative strategies requiring extensive retraining or sequential processing. We demonstrate the practical viability of our method within an on-device environment by developing an Android app capable of executing compositional tasks seamlessly. Experimental results indicate our solution performs well and is fast in both cloud-based and on-device implementations, highlighting the potential benefits of adopting our framework in real-world applications demanding high-speed operation alongside resource constraints.

[15] Language steering in latent space to mitigate unintended code-switching

Andrey Goncharov,Nikolai Kondusov,Alexey Zaytsev

Main category: cs.CL

TL;DR: 提出一种基于PCA的潜空间语言引导方法，有效减少多语言大模型中的代码转换现象，保持语义且计算开销低。

Details

Motivation: 多语言大语言模型在下游任务中常出现非预期的代码转换，影响可靠性，需有效控制语言身份。 Method: 通过在平行翻译上进行主成分分析（PCA）识别语言方向，并在推理时引导词元嵌入沿这些方向调整以控制语言身份。 Result: 使用单个主成分即可达到95-99%的语言分类准确率，在Qwen2.5和Llama-3.2模型上将下一词元分布差异减少最多42%，且语言表征在深层集中并具有近似完美的线性可分性。 Conclusion: 该轻量级方法在极低计算开销下显著抑制代码转换，仅需少量平行数据校准，具备良好实用性和可扩展性。 Abstract: Multilingual Large Language Models (LLMs) often exhibit unintended code-switching, reducing reliability in downstream tasks. We propose latent-space language steering, a lightweight inference-time method that identifies language directions via PCA on parallel translations and steers token embeddings along these axes to control language identity. Our approach mitigates code-switching while preserving semantics with negligible computational overhead and requires only minimal parallel data for calibration. Empirically, we achieve 95-99\% language classification accuracy using a single principal component and reduce next-token distributional divergence by up to 42% across multiple language pairs on Qwen2.5 and Llama-3.2 models. We further analyze the layer-wise evolution of language representations, revealing that language identity concentrates in final layers with near-perfect linear separability.

[16] Revisiting the UID Hypothesis in LLM Reasoning Traces

Minju Gwak,Guijin Son,Jaehyung Kim

Main category: cs.CL

TL;DR: 该论文提出基于熵的信息流度量方法，发现大语言模型在正确解决数学问题时推理过程的信息密度呈现全局不均匀性，与人类交流的均匀信息密度假设相反。

Details

Motivation: 受心理语言学中均匀信息密度（UID）假说启发，研究大语言模型推理过程中信息流动的规律，揭示其与人类认知模式的差异。 Method: 引入基于熵的度量方法分析大语言模型推理链中的信息流，并在三个数学推理任务上进行验证。 Result: 实验发现，成功的LLM推理过程表现出显著不均匀的信息密度变化，而错误的推理则相对均匀，这与人类通信模式相反。 Conclusion: 大语言模型的正确推理依赖于非均匀的信息流模式，这一发现挑战了现有对机器推理的理解，为构建更可解释和自适应的推理模型提供了新方向。 Abstract: Large language models (LLMs) often solve problems using step-by-step Chain-of-Thought (CoT) reasoning, yet these intermediate steps are frequently unfaithful or hard to interpret. Inspired by the Uniform Information Density (UID) hypothesis in psycholinguistics -- which posits that humans communicate by maintaining a stable flow of information -- we introduce entropy-based metrics to analyze the information flow within reasoning traces. Surprisingly, across three challenging mathematical benchmarks, we find that successful reasoning in LLMs is globally non-uniform: correct solutions are characterized by uneven swings in information density, in stark contrast to human communication patterns. This result challenges assumptions about machine reasoning and suggests new directions for designing interpretable and adaptive reasoning models.

[17] EvoEdit: Evolving Null-space Alignment for Robust and Efficient Knowledge Editing

Sicheng Lyu,Yu Gu,Xinyu Wang,Jerry Huang,Sitao Luan,Yufei Cui,Xiao-Wen Chang,Peng Lu

Main category: cs.CL

TL;DR: 本文提出了一种名为EvoEdit的新编辑策略，通过顺序零空间对齐来缓解大语言模型在连续知识更新中的灾难性干扰问题，实现了稳定高效的模型编辑，并在真实基准上表现出优于现有方法的性能。

Details

Motivation: 现有的模型编辑方法在连续编辑场景中容易出现灾难性干扰，即新编辑会破坏先前的知识更新，因此需要一种更稳定的方法来支持大语言模型的持续更新。 Method: 提出EvoEdit，采用顺序零空间对齐技术，在每次新编辑时保持原始和已修改知识表示不变，从而避免干扰并维持输出一致性。 Result: 在真实世界的连续知识编辑基准测试中，EvoEdit表现优于或相当于最先进的定位后编辑方法，并实现最高达3.53倍的速度提升。 Conclusion: EvoEdit为动态信息环境下的大语言模型编辑提供了一个简单而有效的解决方案，具有强理论保证，凸显了构建更系统化编辑方法的必要性。 Abstract: Large language models (LLMs) require continual updates to rectify outdated or erroneous knowledge. Model editing has emerged as a compelling paradigm for introducing targeted modifications without the computational burden of full retraining. Existing approaches are mainly based on a locate-then-edit framework. However, in sequential editing contexts, where multiple updates are applied over time, they exhibit significant limitations and suffer from catastrophic interference, i.e., new edits compromise previously integrated updates and degrade preserved knowledge. To address these challenges, we introduce EvoEdit, a novel editing strategy that mitigates catastrophic interference through sequential null-space alignment, enabling stable and efficient model editing. By performing sequential null-space alignment for each incoming edit, EvoEdit preserves both original and previously modified knowledge representations and maintains output invariance on preserved knowledge even across long edit sequences, effectively mitigating interference. Evaluations on real-world sequential knowledge-editing benchmarks show that EvoEdit achieves better or comparable performance than prior state-of-the-art locate-then-edit techniques, with up to 3.53 times speedup. Overall, these results underscore the necessity of developing more principled approaches for designing LLMs in dynamically evolving information settings, while providing a simple yet effective solution with strong theoretical guarantees.

[18] ConsistencyAI: A Benchmark to Assess LLMs' Factual Consistency When Responding to Different Demographic Groups

Peter Banyas,Shristi Sharma,Alistair Simmons,Atharva Vispute

Main category: cs.CL

TL;DR: 本文提出了ConsistencyAI，一个用于衡量大语言模型（LLM）在不同用户 persona 下事实一致性的独立基准。实验评估了19个LLM在15个主题上的表现，发现模型的事实一致性受提供商和主题影响显著，其中xAI的Grok-3最一致，轻量级模型表现较差。作者公开了代码和交互式演示，倡导无偏评估与一致性提示策略。

Details

Motivation: 现有LLM评估多由厂商主导，缺乏独立性；且未充分考察模型对不同用户身份（如 demographics）是否提供一致的事实回答。因此需要一个独立、公正的基准来衡量LLM在多样化用户情境下的事实一致性。 Method: 构建包含代表性人群的persona集合，在15个主题上向19个LLM各提问100次（每次使用不同persona上下文），收集回答后生成句子嵌入，计算跨persona的余弦相似度并加权平均，得出每个模型的事实一致性得分。 Result: 19个LLM的事实一致性得分介于0.7896至0.9065之间，均值为0.8656作为基准阈值；Grok-3表现最佳，多个轻量模型最差；不同主题一致性差异明显，就业市场最低，G7领导人最高；疫苗和以巴冲突等话题存在供应商相关分歧。 Conclusion: LLM的事实一致性不仅取决于模型本身，还受内容主题和提供商的影响；ConsistencyAI提供了一种可复现的独立评估方法，有助于推动更公平、稳定和可信的语言模型设计与应用。 Abstract: Is an LLM telling you different facts than it's telling me? This paper introduces ConsistencyAI, an independent benchmark for measuring the factual consistency of large language models (LLMs) for different personas. ConsistencyAI tests whether, when users of different demographics ask identical questions, the model responds with factually inconsistent answers. Designed without involvement from LLM providers, this benchmark offers impartial evaluation and accountability. In our experiment, we queried 19 LLMs with prompts that requested 5 facts for each of 15 topics. We repeated this query 100 times for each LLM, each time adding prompt context from a different persona selected from a subset of personas modeling the general population. We processed the responses into sentence embeddings, computed cross-persona cosine similarity, and computed the weighted average of cross-persona cosine similarity to calculate factual consistency scores. In 100-persona experiments, scores ranged from 0.9065 to 0.7896, and the mean was 0.8656, which we adopt as a benchmark threshold. xAI's Grok-3 is most consistent, while several lightweight models rank lowest. Consistency varies by topic: the job market is least consistent, G7 world leaders most consistent, and issues like vaccines or the Israeli-Palestinian conflict diverge by provider. These results show that both the provider and the topic shape the factual consistency. We release our code and interactive demo to support reproducible evaluation and encourage persona-invariant prompting strategies.

[19] BenchPress: A Human-in-the-Loop Annotation System for Rapid Text-to-SQL Benchmark Curation

Fabian Wenz,Omar Bouattour,Devin Yang,Justin Choi,Cecil Gregg,Nesime Tatbul,Çağatay Demiralp

Main category: cs.CL

TL;DR: 本文提出了BenchPress，一个结合人类专家与大语言模型（LLM）的人机协同系统，用于加速构建面向企业私有数据仓库的领域特定文本到SQL（text-to-SQL）基准测试集。

Details

Motivation: 现有的text-to-SQL研究多基于公开数据集，而在企业私有环境中效果不佳。构建高质量的企业级text-to-SQL基准面临人工标注成本高、依赖数据库专家等挑战，因此需要一种高效且准确的方法来生成领域特定的自然语言- SQL 对。 Method: BenchPress采用检索增强生成（RAG）和大语言模型，根据给定的SQL查询自动生成多个自然语言描述草案，再由人类专家进行选择、排序或编辑，实现人机协同标注。该方法利用企业SQL日志作为数据源，降低标注负担并确保领域对齐。 Result: 实验表明，LLM辅助的标注显著减少了创建高质量基准所需的时间和人力；结合人工验证与LLM生成建议，提升了标注准确性、基准可靠性以及模型评估的鲁棒性。 Conclusion: BenchPress为研究人员和实践者提供了一种高效构建领域特定text-to-SQL基准的机制，推动了在企业私有环境下的text-to-SQL模型评估与发展，且系统已开源共享。 Abstract: Large language models (LLMs) have been successfully applied to many tasks, including text-to-SQL generation. However, much of this work has focused on publicly available datasets, such as Fiben, Spider, and Bird. Our earlier work showed that LLMs are much less effective in querying large private enterprise data warehouses and released Beaver, the first private enterprise text-to-SQL benchmark. To create Beaver, we leveraged SQL logs, which are often readily available. However, manually annotating these logs to identify which natural language questions they answer is a daunting task. Asking database administrators, who are highly trained experts, to take on additional work to construct and validate corresponding natural language utterances is not only challenging but also quite costly. To address this challenge, we introduce BenchPress, a human-in-the-loop system designed to accelerate the creation of domain-specific text-to-SQL benchmarks. Given a SQL query, BenchPress uses retrieval-augmented generation (RAG) and LLMs to propose multiple natural language descriptions. Human experts then select, rank, or edit these drafts to ensure accuracy and domain alignment. We evaluated BenchPress on annotated enterprise SQL logs, demonstrating that LLM-assisted annotation drastically reduces the time and effort required to create high-quality benchmarks. Our results show that combining human verification with LLM-generated suggestions enhances annotation accuracy, benchmark reliability, and model evaluation robustness. By streamlining the creation of custom benchmarks, BenchPress offers researchers and practitioners a mechanism for assessing text-to-SQL models on a given domain-specific workload. BenchPress is freely available via our public GitHub repository at https://github.com/fabian-wenz/enterprise-txt2sql and is also accessible on our website at http://dsg-mcgraw.csail.mit.edu:5000.

[20] R2T: Rule-Encoded Loss Functions for Low-Resource Sequence Tagging

Mamadou K. Keita,Christopher Homan,Sebastien Diarra

Main category: cs.CL

TL;DR: 提出了一种名为Rule-to-Tag（R2T）的混合框架，通过将语言学规则集成到神经网络训练目标中，在无标注文本上实现高性能的词性标注，并可用于复杂任务（如命名实体识别）的有效预训练。

Details

Motivation: 解决低资源语言中标注数据稀缺的问题，探索在没有大量标注数据的情况下如何有效训练模型。 Method: 设计了一个多层级的语言学规则系统，并将其融入神经网络的训练目标，提出一种包含正则化项的自适应损失函数，使模型能以原则性不确定性处理未登录词。该方法属于“原则性学习”（PrL）范式，利用显式任务约束而非仅依赖标注样本进行训练。 Result: 在Zarma语的词性标注任务中，仅使用无标注文本训练的R2T-BiLSTM模型达到98.2%的准确率，优于在300个标注句子上微调的AfriBERTa基线；在命名实体识别任务中，使用R2T预训练并在50个标注句子上微调的模型优于在300个标注句子上训练的基线。 Conclusion: R2T框架展示了将语言学知识与神经网络结合的有效性，为低资源语言处理提供了一种高效、可扩展的替代方案，验证了原则性学习在减少对标注数据依赖方面的潜力。 Abstract: We introduce the Rule-to-Tag (R2T) framework, a hybrid approach that integrates a multi-tiered system of linguistic rules directly into a neural network's training objective. R2T's novelty lies in its adaptive loss function, which includes a regularization term that teaches the model to handle out-of-vocabulary (OOV) words with principled uncertainty. We frame this work as a case study in a paradigm we call principled learning (PrL), where models are trained with explicit task constraints rather than on labeled examples alone. Our experiments on Zarma part-of-speech (POS) tagging show that the R2T-BiLSTM model, trained only on unlabeled text, achieves 98.2% accuracy, outperforming baselines like AfriBERTa fine-tuned on 300 labeled sentences. We further show that for more complex tasks like named entity recognition (NER), R2T serves as a powerful pre-training step; a model pre-trained with R2T and fine-tuned on just 50 labeled sentences outperformes a baseline trained on 300.

[21] Harnessing Consistency for Robust Test-Time LLM Ensemble

Zhichen Zeng,Qi Yu,Xiao Lin,Ruizhong Qiu,Xuying Ning,Tianxin Wei,Yuchen Yan,Jingrui He,Hanghang Tong

Main category: cs.CL

TL;DR: 本文提出了一种名为CoRE的即插即用方法，通过利用模型一致性来提升大语言模型集成的鲁棒性，有效应对词元级和模型级的不一致问题。

Details

Motivation: 不同大语言模型具有各异的优势与局限，集成方法虽有进展，但其对错误信号（如分词差异和模型专长不一）的鲁棒性仍缺乏关注。 Method: CoRE在词元级通过低通滤波降低高度不一致词元的权重，在模型级通过提升自信心高且与其他模型分歧小的输出来增强一致性。 Result: 在多个基准、模型组合和集成策略上的实验表明，CoRE显著提升了集成性能与鲁棒性。 Conclusion: CoRE是一种通用且有效的技术，能够从细粒度和全局层面提升大语言模型集成的稳定性和准确性。 Abstract: Different large language models (LLMs) exhibit diverse strengths and weaknesses, and LLM ensemble serves as a promising approach to integrate their complementary capabilities. Despite substantial progress in improving ensemble quality, limited attention has been paid to the robustness of ensembles against potential erroneous signals, which often arise from heterogeneous tokenization schemes and varying model expertise. Our analysis shows that ensemble failures typically arise from both the token level and the model level: the former reflects severe disagreement in token predictions, while the latter involves low confidence and pronounced disparities among models. In light of this, we propose CoRE, a plug-and-play technique that harnesses model consistency for robust LLM ensemble, which can be seamlessly integrated with diverse ensemble methods. Token-level consistency captures fine-grained disagreements by applying a low-pass filter to downweight uncertain tokens with high inconsistency, often due to token misalignment, thereby improving robustness at a granular level. Model-level consistency models global agreement by promoting model outputs with high self-confidence and minimal divergence from others, enhancing robustness at a coarser level. Extensive experiments across diverse benchmarks, model combinations, and ensemble strategies demonstrate that CoRE consistently improves ensemble performance and robustness.

[22] Multimodal Retrieval-Augmented Generation with Large Language Models for Medical VQA

A H M Rezaul Karim,Ozlem Uzuner

Main category: cs.CL

TL;DR: MasonNLP系统采用基于通用大语言模型的检索增强生成（RAG）框架，在MEDIQA-WV 2025伤口护理视觉问答任务中表现优异，无需额外训练即可提升回答质量，排名第三。

Details

Motivation: 为提升医疗视觉问答系统在伤口护理场景下的回答准确性与结构化输出能力，探索轻量级方法在多模态临床自然语言处理任务中的有效性。 Method: 采用通用领域指令调优的大语言模型，结合检索增强生成（RAG）框架，通过简单索引和融合机制引入文本和视觉领域的示例，增强推理与模式遵循能力。 Result: 系统在19支队伍、51次提交中排名第三，平均得分为41.37%，在dBLEU、ROUGE、BERTScore及基于LLM的指标上均提升了生成质量。 Conclusion: 轻量级RAG结合通用大语言模型是一种简单而有效的多模态临床NLP基线方法，无需复杂训练或重排序即可提升性能。 Abstract: Medical Visual Question Answering (MedVQA) enables natural language queries over medical images to support clinical decision-making and patient care. The MEDIQA-WV 2025 shared task addressed wound-care VQA, requiring systems to generate free-text responses and structured wound attributes from images and patient queries. We present the MasonNLP system, which employs a general-domain, instruction-tuned large language model with a retrieval-augmented generation (RAG) framework that incorporates textual and visual examples from in-domain data. This approach grounds outputs in clinically relevant exemplars, improving reasoning, schema adherence, and response quality across dBLEU, ROUGE, BERTScore, and LLM-based metrics. Our best-performing system ranked 3rd among 19 teams and 51 submissions with an average score of 41.37%, demonstrating that lightweight RAG with general-purpose LLMs -- a minimal inference-time layer that adds a few relevant exemplars via simple indexing and fusion, with no extra training or complex re-ranking -- provides a simple and effective baseline for multimodal clinical NLP tasks.

Shivanshu Kumar,Gopalakrishnan Srinivasan

Main category: cs.CL

TL;DR: 本文提出了一种名为ShishuLM的高效语言模型架构，通过减少参数量和KV缓存需求，在保持性能的同时显著降低内存占用和延迟。

Details

Motivation: Transformer模型在自然语言处理任务中表现优异，但存在较高的内存和计算开销，且模型内部存在结构冗余，亟需优化以适应小型语言模型在智能体系统中的应用需求。 Method: 受AI可解释性和推理时层剪枝研究的启发，引入ShishuLM架构；发现归一化与注意力计算在中等上下文场景下近似线性，因此可用多层感知机（MLP）替代整个Transformer块。 Result: ShishuLM在训练和推理阶段最多可减少25%的内存需求，并提升最多40%的延迟性能，适用于不同规模的小型语言模型。 Conclusion: 从预训练角度出发，ShishuLM为构建更高效的小型语言模型架构提供了可行路径和深入洞见。 Abstract: While the transformer architecture has achieved state-of-the-art performance on natural language processing tasks, these models impose substantial memory and computational overhead. Recent research has identified significant architectural redundancies within these models, presenting opportunities for optimization without compromising performance. Taking insights from research in AI interpretability and inference-time layer pruning, we introduce an efficient language model architecture, referred to as ShishuLM, which reduces both the parameter count and Key-Value (KV) cache requirements. Given the increasing importance of Small Language Models (SLMs) in agentic AI systems, we evaluate our approach on two SLMs of different scales. Our analysis reveals that for moderate-context scenarios, normalization coupled with attention computation is roughly linear with the input, enabling entire transformer blocks to be approximated through Multi-Layer Perceptrons (MLPs). Our results show that ShishuLM provides up to 25% reduction in memory requirements and up to 40% improvement in latency during both training and inference, compared to parent models. Our experimental and analytical findings provide insights towards building more efficient SLM architectures from a pre-training standpoint.

[24] Ensembling Large Language Models to Characterize Affective Dynamics in Student-AI Tutor Dialogues

Chenyu Zhang,Sharifa Alghowinem,Cynthia Breazeal

Main category: cs.CL

TL;DR: 本研究提出了首个用于大规模情感感知的集成LLM框架，分析了16,986轮学生与AI导师PyTutor之间的对话，揭示了学习过程中情感动态的变化，发现学生情绪多为轻度积极且易变，中性情绪常成为向积极转变的转折点。

Details

Motivation: 尽管已有研究探讨大语言模型在教育中的影响，但其在辅导过程中对学习者情感状态的影响尚不清楚，因此需要深入理解生成式AI在教育中负责任的应用路径。 Method: 通过三个前沿大语言模型（Gemini、GPT-4o、Claude）对PyTutor与学生之间的对话进行零样本情感标注，提取效价、唤醒度和学习帮助性评分及自由文本情绪标签，并采用排序加权池化和跨模型多数共识融合结果。 Result: 学生与AI导师互动时通常表现出轻微积极情绪和中等唤醒水平；困惑和好奇常见，挫折较少但可能阻碍学习；情绪持续时间短，积极情绪稍长但脆弱；负面情绪常迅速缓解，甚至转为正面；中性情绪更常导向积极转变。 Conclusion: 该集成LLM框架能有效捕捉学习过程中的情感动态，提示可在中性情绪节点进行干预以促进积极学习体验，为AI教育应用提供了情感层面的设计依据。 Abstract: While recent studies have examined the leaning impact of large language model (LLM) in educational contexts, the affective dynamics of LLM-mediated tutoring remain insufficiently understood. This work introduces the first ensemble-LLM framework for large-scale affect sensing in tutoring dialogues, advancing the conversation on responsible pathways for integrating generative AI into education by attending to learners' evolving affective states. To achieve this, we analyzed two semesters' worth of 16,986 conversational turns exchanged between PyTutor, an LLM-powered AI tutor, and 261 undergraduate learners across three U.S. institutions. To investigate learners' emotional experiences, we generate zero-shot affect annotations from three frontier LLMs (Gemini, GPT-4o, Claude), including scalar ratings of valence, arousal, and learning-helpfulness, along with free-text emotion labels. These estimates are fused through rank-weighted intra-model pooling and plurality consensus across models to produce robust emotion profiles. Our analysis shows that during interaction with the AI tutor, students typically report mildly positive affect and moderate arousal. Yet learning is not uniformly smooth: confusion and curiosity are frequent companions to problem solving, and frustration, while less common, still surfaces in ways that can derail progress. Emotional states are short-lived--positive moments last slightly longer than neutral or negative ones, but they are fragile and easily disrupted. Encouragingly, negative emotions often resolve quickly, sometimes rebounding directly into positive states. Neutral moments frequently act as turning points, more often steering students upward than downward, suggesting opportunities for tutors to intervene at precisely these junctures.

[25] Unlocking the Potential of Diffusion Language Models through Template Infilling

Junhoo Lee,Seungyeon Kim,Nojun Kwak

Main category: cs.CL

TL;DR: 提出了一种针对扩散语言模型的模板填充（Template Infilling, TI）生成方法，结合动态片段分配（DSA），在数学推理和代码生成任务上显著优于基线方法。

Details

Motivation: 现有的扩散语言模型仍沿用自回归模型的前缀提示方法，缺乏适配其生成机制的有效控制策略。 Method: 提出模板填充（TI）方法，先生成目标响应的结构模板，再填充掩码片段；引入动态片段分配（DSA）机制，根据生成置信度自适应调整片段长度。 Result: 在数学推理和代码生成基准上平均提升17.01%；在多令牌生成场景中实现有效加速同时保持生成质量。 Conclusion: TI+DSA为扩散语言模型提供了更灵活、高效的生成控制方式，显著提升性能并支持速度与质量的平衡。 Abstract: Diffusion Language Models (DLMs) have emerged as a promising alternative to Autoregressive Language Models, yet their inference strategies remain limited to prefix-based prompting inherited from the autoregressive paradigm. In this paper, we propose Template Infilling (TI), a tailored conditioning methodology for DLMs' generation process. Unlike conventional prefix prompting, TI first generates a structural template for the target response, then fills in the masked segments. To enhance the flexibility of this structural control, we introduce Dynamic Segment Allocation (DSA), which adaptively adjusts segment lengths based on generation confidence. We demonstrate the effectiveness of our approach on mathematical reasoning and code generation benchmarks, achieving consistent improvements of 17.01$\%$p over baseline. Furthermore, we show that TI provides additional advantages in multi-token generation settings, enabling effective speedup while maintaining generation quality.

[26] Quechua Speech Datasets in Common Voice: The Case of Puno Quechua

Elwin Huaman,Wendi Huaman,Jorge Luis Huaman,Ninfa Quispe

Main category: cs.CL

TL;DR: 本文探讨了将克丘亚语（Quechua）纳入Common Voice平台的过程，重点研究了普诺克丘亚语（qxp），展示了通过社区驱动方式构建开源语音数据集的潜力，并提出了技术、伦理和数据主权方面的研究议程。

Details

Motivation: 克丘亚语等资源匮乏语言在语音技术发展中面临数据稀缺问题，亟需开放、包容的语音数据集以推动其数字化发展。 Method: 通过Common Voice平台进行克丘亚语的语言接入与语料收集，涵盖朗读和自发语音数据，并以普诺克丘亚语为案例进行分析。 Result: 目前Common Voice已收录191.1小时的克丘亚语语音数据（86%已验证），其中普诺克丘亚语贡献了12小时（77%已验证）。 Conclusion: Common Voice为资源匮乏语言提供了有效的语音数据积累途径，有助于推动包容性语音技术和原住民语言社区的数字赋权。 Abstract: Under-resourced languages, such as Quechuas, face data and resource scarcity, hindering their development in speech technology. To address this issue, Common Voice presents a crucial opportunity to foster an open and community-driven speech dataset creation. This paper examines the integration of Quechua languages into Common Voice. We detail the current 17 Quechua languages, presenting Puno Quechua (ISO 639-3: qxp) as a focused case study that includes language onboarding and corpus collection of both reading and spontaneous speech data. Our results demonstrate that Common Voice now hosts 191.1 hours of Quechua speech (86\% validated), with Puno Quechua contributing 12 hours (77\% validated), highlighting the Common Voice's potential. We further propose a research agenda addressing technical challenges, alongside ethical considerations for community engagement and indigenous data sovereignty. Our work contributes towards inclusive voice technology and digital empowerment of under-resourced language communities.

[27] FRACCO: A gold-standard annotated corpus of oncological entities with ICD-O-3.1 normalisation

Johann Pignat,Milena Vucetic,Christophe Gaudet-Blavignac,Jamil Zaghir,Amandine Stettler,Fanny Amrein,Jonatan Bonjour,Jean-Philippe Goldman,Olivier Michielin,Christian Lovis,Mina Bjelogrlic

Main category: cs.CL

TL;DR: 本文介绍了FRACCO，一个包含1301个合成法语临床病例的专家标注语料库，用于支持法语肿瘤学领域的命名实体识别和概念标准化研究。

Details

Motivation: 法语临床肿瘤学领域的标注数据集稀缺，限制了自然语言处理工具的发展，因此需要构建高质量的法语标注语料库。 Method: 基于西班牙语CANTEMIST语料库翻译生成法语临床文本，并由领域专家进行实体标注；使用ICD-O编码系统对形态学、解剖部位和组织分化进行标注，额外增加复合表达式层次的标准化注释，通过自动化匹配与人工验证相结合的方式确保标注质量。 Result: 最终数据集包含71127个ICD-O标准化条目，涵盖399个唯一形态学代码（来自2549种不同表达）、272个解剖部位代码（来自3143种表达）和2043个唯一复合表达式（来自11144种表达）。 Conclusion: FRACCO为法语肿瘤学文本的命名实体识别与概念标准化提供了可靠的基准数据集，有助于推动相关自然语言处理工具的发展。 Abstract: Developing natural language processing tools for clinical text requires annotated datasets, yet French oncology resources remain scarce. We present FRACCO (FRench Annotated Corpus for Clinical Oncology) an expert-annotated corpus of 1301 synthetic French clinical cases, initially translated from the Spanish CANTEMIST corpus as part of the FRASIMED initiative. Each document is annotated with terms related to morphology, topography, and histologic differentiation, using the International Classification of Diseases for Oncology (ICD-O) as reference. An additional annotation layer captures composite expression-level normalisations that combine multiple ICD-O elements into unified clinical concepts. Annotation quality was ensured through expert review: 1301 texts were manually annotated for entity spans by two domain experts. A total of 71127 ICD-O normalisations were produced through a combination of automated matching and manual validation by a team of five annotators. The final dataset representing 399 unique morphology codes (from 2549 different expressions), 272 topography codes (from 3143 different expressions), and 2043 unique composite expressions (from 11144 different expressions). This dataset provides a reference standard for named entity recognition and concept normalisation in French oncology texts.

[28] What Layers When: Learning to Skip Compute in LLMs with Residual Gates

Filipe Laitenberger,Dawid Kopiczko,Cees G. M. Snoek,Yuki M. Asano

Main category: cs.CL

TL;DR: 提出了一种名为GateSkip的残差流门控机制，通过在解码器-only语言模型中实现逐token层跳过，有效减少计算量并保持较高准确率。

Details

Motivation: 为了在不显著降低模型性能的前提下，减少大模型推理时的计算开销，尤其是在长文本推理和指令微调场景下提升效率。 Method: 在每个Attention/MLP分支引入sigmoid-linear门控机制，压缩分支输出后再进入残差流；推理时根据门控值对token进行重要性排序，并按每层预算跳过低重要性token。 Result: 在长文本推理任务中节省最多15%计算量且保持90%以上基线准确率；在指令微调模型中，在接近50%计算节省时仍能匹配基线性能，甚至在全计算量下实现精度提升。 Conclusion: GateSkip是一种稳定、可微、易于与量化、剪枝和自推测解码结合的轻量级方法，不仅能提升推理效率，还能提供对Transformer信息流动的洞察。 Abstract: We introduce GateSkip, a simple residual-stream gating mechanism that enables token-wise layer skipping in decoder-only LMs. Each Attention/MLP branch is equipped with a sigmoid-linear gate that condenses the branch's output before it re-enters the residual stream. During inference we rank tokens by the gate values and skip low-importance ones using a per-layer budget. While early-exit or router-based Mixture-of-Depths models are known to be unstable and need extensive retraining, our smooth, differentiable gates fine-tune stably on top of pretrained models. On long-form reasoning, we save up to 15\% compute while retaining over 90\% of baseline accuracy. On instruction-tuned models we see accuracy gains at full compute and match baseline quality near 50\% savings. The learned gates give insight into transformer information flow (e.g., BOS tokens act as anchors), and the method combines easily with quantization, pruning, and self-speculative decoding.

[29] TextBandit: Evaluating Probabilistic Reasoning in LLMs Through Language-Only Decision Tasks

Jimin Lim,Arjun Damerla,Arthur Jiang,Nam Le

Main category: cs.CL

TL;DR: 本文提出了一种新的基准测试，用于评估大语言模型在仅使用文本反馈的多臂赌博机环境中进行决策的能力，结果显示Qwen3-4B在选择最优选项方面表现优异，表明从纯语言中可以涌现出概率推理能力。

Details

Motivation: 探索大语言模型在没有数值提示的情况下，仅通过自然语言在不确定性下进行序列决策的能力。 Method: 引入一个多臂赌博机环境，其中大语言模型仅通过文本反馈（如“你获得了一个代币”）进行交互，并评估其性能，与汤普森采样、ε-贪婪、上置信界和随机选择等传统算法进行比较。 Result: 大多数大语言模型的表现不如基线方法，但Qwen3-4B达到了89.2%的最佳选择率，显著优于其他大型语言模型和传统方法。 Conclusion: 研究表明，仅从语言中就可以发展出概率推理能力，该基准测试为评估自然、非数值情境下的决策能力提供了新方向。 Abstract: Large language models (LLMs) have shown to be increasingly capable of performing reasoning tasks, but their ability to make sequential decisions under uncertainty only using natural language remains underexplored. We introduce a novel benchmark in which LLMs interact with multi-armed bandit environments using purely textual feedback, "you earned a token", without access to numerical cues or explicit probabilities, resulting in the model to infer latent reward structures purely off linguistic cues and to adapt accordingly. We evaluated the performance of four open-source LLMs and compare their performance to standard decision-making algorithms such as Thompson Sampling, Epsilon Greedy, Upper Confidence Bound (UCB), and random choice. While most of the LLMs underperformed compared to the baselines, Qwen3-4B, achieved the best-arm selection rate of 89.2% , which significantly outperformed both the larger LLMs and traditional methods. Our findings suggest that probabilistic reasoning is able to emerge from language alone, and we present this benchmark as a step towards evaluating decision-making capabilities in naturalistic, non-numeric contexts.

[30] Catch Your Breath: Adaptive Computation for Self-Paced Sequence Production

Alexandre Galashov,Matt Jones,Rosemary Ke,Yuan Cao,Vaishnavh Nagarajan,Michael C. Mozer

Main category: cs.CL

TL;DR: 提出了一类名为“Catch Your Breath”（CYB）的监督训练目标，使语言模型能根据输入动态自主调整计算步数，通过引入“”和“”机制，模型可在需要时请求额外计算资源，从而提升效率与准确性。

Details

Motivation: 传统语言模型对每个token使用固定计算量，难以适应不同token的复杂度差异。希望让模型能自主决定何时需要更多计算资源，以提高准确性和计算效率。 Method: 将输出token的选择建模为带时间成本的序贯决策问题，引入和机制，研究三种CYB损失变体：CYB-AP（任意时间预测）、CYB-VA（变分方法）和CYB-DP（计算预算惩罚），并通过微调实验比较性能。 Result: CYB模型仅需基线模型三分之一的训练数据即可达到相同性能，且能根据token复杂度自适应调整计算步骤，例如在复数名词后常暂停，而在缩写词首token从不暂停，对歧义词表现出灵活响应。 Conclusion: CYB损失函数有效提升了模型对计算资源的利用效率，实现了精度与延迟的更好权衡，展示了动态计算在语言建模中的潜力。 Abstract: We explore a class of supervised training objectives that allow a language model to dynamically and autonomously scale the number of compute steps used for each input token. For any token, the model can request additional compute steps by emitting a output. If the model is granted a delay, a specialized token is inserted at the next input step, providing the model with additional compute resources to generate an output. The model can request multiple pauses. To train the model to use outputs judiciously and to calibrate its uncertainty, we frame the selection of each output token as a sequential-decision problem with a time cost. We refer to the class of methods as $\textit{Catch Your Breath}$ losses and we study three methods in this class: CYB-AP frames the model's task as anytime prediction, where an output may be required at any step and accuracy is discounted over time; CYB-VA is a variational approach that aims to maximize prediction accuracy subject to a specified distribution over stopping times; and CYB-DP imposes a penalty based on a computational budget. Through fine-tuning experiments, we identify the best performing loss variant. The CYB model needs only one third as much training data as the baseline (no pause) model needs to achieve the same performance, and half as much data as a model with pauses and a cross-entropy loss. We find that the CYB model requests additional steps when doing so improves accuracy, and the model adapts its processing time to token-level complexity and context. For example, it often pauses after plural nouns like $\textit{patients}$ and $\textit{challenges}$ but never pauses after the first token of contracted words like $\textit{wasn}$ and $\textit{didn}$, and it shows high variability for ambiguous tokens like $\textit{won}$, which could function as either a verb or part of a contraction.

[31] PAGE: Prompt Augmentation for text Generation Enhancement

Mauro Jose Pacchiotti,Luciana Ballejos,Mariel Ale

Main category: cs.CL

TL;DR: 本文提出了PAGE框架，通过使用轻量级辅助模块（如分类器或提取器）增强输入，从而提升自然语言生成模型在特定任务中的生成质量和可控性，且无需复杂的辅助生成模型。

Details

Motivation: 现有的自然语言生成模型在面对特定任务或需求时表现不佳，通常需要大量额外数据进行调整，因此需要一种更简单、灵活的方法来提升其性能和可控性。 Method: 提出PAGE（Prompt Augmentation for text Generation Enhancement）框架，利用轻量级辅助模块对输入文本进行推理，并将其输出用于构建增强的输入提示，以改进生成结果。该方法采用模块化设计，易于适配不同任务。 Result: 在软件需求工程领域的概念验证实验中，结合分类器的辅助模块有效提升了软件需求生成的质量。 Conclusion: PAGE框架通过简单、可扩展的模块化结构，能够在不依赖复杂辅助生成模型的情况下，有效增强生成模型的性能和可控性，具有广泛的应用潜力。 Abstract: In recent years, natural language generative models have shown outstanding performance in text generation tasks. However, when facing specific tasks or particular requirements, they may exhibit poor performance or require adjustments that demand large amounts of additional data. This work introduces PAGE (Prompt Augmentation for text Generation Enhancement), a framework designed to assist these models through the use of simple auxiliary modules. These modules, lightweight models such as classifiers or extractors, provide inferences from the input text. The output of these auxiliaries is then used to construct an enriched input that improves the quality and controllability of the generation. Unlike other generation-assistance approaches, PAGE does not require auxiliary generative models; instead, it proposes a simpler, modular architecture that is easy to adapt to different tasks. This paper presents the proposal, its components and architecture, and reports a proof of concept in the domain of requirements engineering, where an auxiliary module with a classifier is used to improve the quality of software requirements generation.

Bolei Ma,Yong Cao,Indira Sen,Anna-Carolina Haensch,Frauke Kreuter,Barbara Plank,Daniel Hershcovich

Main category: cs.CL

TL;DR: 本文主张在使用大语言模型（LLM）进行社会模拟时，应重视开放性生成文本的价值，而非局限于封闭式问答形式。

Details

Motivation: 现有研究多将LLM模拟限制在选择题或短答案格式中，忽略了LLM本身具备的生成能力，难以真实反映社会现象的复杂性和多样性。 Method: 结合数十年的调查方法学研究和自然语言处理（NLP）的最新进展，论证开放性文本在捕捉主题、观点和推理过程中的优势。 Result: 开放性设计可提升测量与实验设计质量，支持发现意外观点，减少研究者引导偏差，增强表达力与个体性，并有助于预测试和社会模拟的方法论发展。 Conclusion: 应发展新的实践方法和评估框架，充分利用LLM的生成多样性，促进NLP与社会科学的融合。 Abstract: Large Language Models (LLMs) are increasingly used to simulate public opinion and other social phenomena. Most current studies constrain these simulations to multiple-choice or short-answer formats for ease of scoring and comparison, but such closed designs overlook the inherently generative nature of LLMs. In this position paper, we argue that open-endedness, using free-form text that captures topics, viewpoints, and reasoning processes "in" LLMs, is essential for realistic social simulation. Drawing on decades of survey-methodology research and recent advances in NLP, we argue why this open-endedness is valuable in LLM social simulations, showing how it can improve measurement and design, support exploration of unanticipated views, and reduce researcher-imposed directive bias. It also captures expressiveness and individuality, aids in pretesting, and ultimately enhances methodological utility. We call for novel practices and evaluation frameworks that leverage rather than constrain the open-ended generative diversity of LLMs, creating synergies between NLP and social science.

[33] Order from Chaos: Comparative Study of Ten Leading LLMs on Unstructured Data Categorization

Ariel Kamen

Main category: cs.CL

TL;DR: 该研究评估了十种最先进的大语言模型在IAB 2.2分层分类体系下的文本分类表现，发现尽管模型规模不断增大，但经典指标表现仍有限，且普遍存在幻觉和类别膨胀问题；通过构建多模型集成方法，显著提升了准确性并消除了幻觉。

Details

Motivation: 探究当前大语言模型在结构化文本分类任务中的实际性能局限，特别是面对层级化、细粒度分类体系时的表现，并分析模型规模与准确性的关系。 Method: 使用8,660个人工标注样本和统一的零样本提示词，对十种主流大语言模型进行一致性评估，采用准确率、精确率、召回率、F1分数等传统指标以及幻觉比率、膨胀比率和分类成本三个LLM特有指标进行综合评价，并设计了一种基于多个LLM作为独立专家的集成方法进行改进测试。 Result: 当前大语言模型在该任务上平均准确率为34%，精确率42%，召回率45%，F1分数41%，普遍存在过度生成类别的现象；Gemini 1.5/2.0 Flash和GPT 20B/120B具有较好的性价比，GPT 120B幻觉最少；提出的集成方法显著提高准确性，降低膨胀率，并完全消除幻觉。 Conclusion: 单纯扩大模型规模或改进架构不足以提升复杂文本分类任务的性能，而通过协调多个模型协同工作的集成策略可能是实现甚至超越人类专家水平的有效路径。 Abstract: This study presents a comparative evaluation of ten state-of-the-art large language models (LLMs) applied to unstructured text categorization using the Interactive Advertising Bureau (IAB) 2.2 hierarchical taxonomy. The analysis employed a uniform dataset of 8,660 human-annotated samples and identical zero-shot prompts to ensure methodological consistency across all models. Evaluation metrics included four classic measures - accuracy, precision, recall, and F1-score - and three LLM-specific indicators: hallucination ratio, inflation ratio, and categorization cost. Results show that, despite their rapid advancement, contemporary LLMs achieve only moderate classic performance, with average scores of 34% accuracy, 42% precision, 45% recall, and 41% F1-score. Hallucination and inflation ratios reveal that models frequently overproduce categories relative to human annotators. Among the evaluated systems, Gemini 1.5/2.0 Flash and GPT 20B/120B offered the most favorable cost-to-performance balance, while GPT 120B demonstrated the lowest hallucination ratio. The findings suggest that scaling and architectural improvements alone do not ensure better categorization accuracy, as the task requires compressing rich unstructured text into a limited taxonomy - a process that challenges current model architectures. To address these limitations, a separate ensemble-based approach was developed and tested. The ensemble method, in which multiple LLMs act as independent experts, substantially improved accuracy, reduced inflation, and completely eliminated hallucinations. These results indicate that coordinated orchestration of models - rather than sheer scale - may represent the most effective path toward achieving or surpassing human-expert performance in large-scale text categorization.

[34] Reliable Fine-Grained Evaluation of Natural Language Math Proofs

Wenjie Ma,Andrei Cojocaru,Neel Kolhe,Bradley Louie,Robin Said Sharif,Haihan Zhang,Vincent Zhuang,Matei Zaharia,Sewon Min

Main category: cs.CL

TL;DR: 本文提出了一种系统性方法来开发和验证用于评估大语言模型生成数学证明的细粒度评分器，并构建了首个专家标注的ProofBench数据集，进而训练出性能优越的ProofGrader，显著提升了自动评估与人工评分的一致性及其在下游任务中的实用性。

Details

Motivation: 现有大语言模型在数学推理方面的评估多集中于答案可轻松验证的任务，而对自然语言数学证明的生成与评估仍缺乏可靠的细粒度评价机制，亟需一个高精度的自动评估系统。 Method: 提出一种系统化方法设计可在0-7分范围内进行细粒度打分的评估模型；构建包含145道竞赛题及435个LLM生成解的专家标注数据集ProofBench；探索评估器在基础模型、上下文输入、指令和工作流等方面的设计空间，最终通过强推理模型、参考解答与评分标准的上下文融合以及简单集成方法得到最优评估器ProofGrader。 Result: ProofGrader相对于专家评分的平均绝对误差（MAE）低至0.926，显著优于基线方法；在n选一任务中（n=16），其平均得分达4.14（满分7），填补了朴素二元评估器与人类专家评分之间78%的差距。 Conclusion: 本研究填补了LLM生成数学证明自动评估的关键空白，所提出的ProofGrader具有高准确性和实用价值，有望推动数学推理领域中生成模型的发展。 Abstract: Recent advances in large language models (LLMs) for mathematical reasoning have largely focused on tasks with easily verifiable final answers; however, generating and verifying natural language math proofs remains an open challenge. We identify the absence of a reliable, fine-grained evaluator for LLM-generated math proofs as a critical gap. To address this, we propose a systematic methodology for developing and validating evaluators that assign fine-grained scores on a 0-7 scale to model-generated math proofs. To enable this study, we introduce ProofBench, the first expert-annotated dataset of fine-grained proof ratings, spanning 145 problems from six major math competitions (USAMO, IMO, Putnam, etc) and 435 LLM-generated solutions from Gemini-2.5-pro, o3, and DeepSeek-R1. %with expert gradings. Using ProofBench as a testbed, we systematically explore the evaluator design space across key axes: the backbone model, input context, instructions and evaluation workflow. Our analysis delivers ProofGrader, an evaluator that combines a strong reasoning backbone LM, rich context from reference solutions and marking schemes, and a simple ensembling method; it achieves a low Mean Absolute Error (MAE) of 0.926 against expert scores, significantly outperforming naive baselines. Finally, we demonstrate its practical utility in a best-of-$n$ selection task: at $n=16$, ProofGrader achieves an average score of 4.14 (out of 7), closing 78% of the gap between a naive binary evaluator (2.48) and the human oracle (4.62), highlighting its potential to advance downstream proof generation.

[35] A Survey on Collaborating Small and Large Language Models for Performance, Cost-effectiveness, Cloud-edge Privacy, and Trustworthiness

Fali Wang,Jihai Chen,Shuhua Yang,Ali Al-Lawati,Linli Tang,Hui Liu,Suhang Wang

Main category: cs.CL

TL;DR: 本文系统综述了小型语言模型（SLM）与大型语言模型（LLM）协作的研究进展，提出了以性能提升、成本效益、云边隐私和可信性为目标的分类体系，并总结了代表性方法、设计范式及未来挑战。

Details

Motivation: 大型语言模型面临微调成本高、推理延迟、边缘部署受限和可靠性问题，而小型语言模型具有轻量、高效和适应性强的优势，因此需要探索SLM与LLM协同的互补机制。 Method: 提出一个基于协作目标的四类分类体系，围绕性能增强、成本效益、云边隐私和可信性，对现有SLM-LLM协作方法进行系统梳理和归纳。 Result: 总结了实现SLM-LLM协作的代表性方法和设计范式，识别出在效率、安全性和可扩展性方面的开放挑战。 Conclusion: SLM-LLM协作是实现高效、安全、可扩展语言模型系统的重要方向，未来需进一步优化协同机制以应对实际部署中的多维度需求。 Abstract: Large language models (LLMs) have advanced many domains and applications but face high fine-tuning costs, inference latency, limited edge deployability, and reliability concerns. Small language models (SLMs), compact, efficient, and adaptable, offer complementary remedies. Recent work explores collaborative frameworks that fuse SLMs' specialization and efficiency with LLMs' generalization and reasoning to meet diverse objectives across tasks and deployment scenarios. Motivated by these developments, this paper presents a systematic survey of SLM-LLM collaboration organized by collaboration objectives. We propose a taxonomy with four goals: performance enhancement, cost-effectiveness, cloud-edge privacy, and trustworthiness. Within this framework, we review representative methods, summarize design paradigms, and outline open challenges and future directions toward efficient, secure, and scalable SLM-LLM collaboration.

[36] The Harder The Better: Maintaining Supervised Fine-tuning Generalization with Less but Harder Data

Zhaoyang Shang,Sibo Wei,Jianbin Guo,Rui Zhou,Lifeng Dong,Yin Luo

Main category: cs.CL

TL;DR: 提出了一种受认知科学启发的指令数据选择与标注指导框架THTB，通过结合质量过滤与内外难度评分，优先选择高阶认知指令，在仅使用少量数据的情况下显著提升模型性能和泛化能力。

Details

Motivation: 现有指令数据选择方法过度依赖大模型内部知识，可解释性差且泛化能力有限，难以高效适应专业领域。 Method: 提出THTB框架，结合质量过滤与内在/外在难度评分，从认知科学角度量化指令复杂度，用于指导监督微调中的数据选择与标注。 Result: 实验表明，使用5%数据训练的模型优于全量数据训练结果，2%数据下在垂直领域超越更大规模训练的模型，且选择过程更具可解释性和泛化性。 Conclusion: THTB为指令数据筛选提供了可量化的认知标准，在大幅降低训练成本的同时提升了模型性能与领域适应能力，具备广泛应用前景。 Abstract: Large Language Models (LLMs) excel in general tasks, but adapting them to specialized domains relies on high-quality supervised fine-tuning (SFT) data. Although existing methods can identify subsets of high-quality data and reduce training cost to some extent, their selection process still suffers from over-reliance on LLMs' internal knowledge, weak interpretability, and limited generalization. To address these limitations, we propose THTB (The Harder The Better), a cognitive science-inspired framework for instruction data selection and annotation guidance. THTB prioritizes higher-level cognitive instructions by combining quality filtering with intrinsic and extrinsic hardness scoring, offering interpretable and quantifiable criteria for efficient SFT, both in data selection and annotation guidance. Experiments show that THTB enables models trained on only 5% of the data to outperform full-dataset training, while achieving superior generalization compared with LLM-only selection. In addition, THTB provides effective annotation guidance in vertical domains, enabling a model trained on just 2% of the data to surpass models trained on much larger datasets, demonstrating strong potential for domain adaptation. Our code, datasets, and models are available on https://github.com/DYJG-research/THTB.

[37] Guarding the Guardrails: A Taxonomy-Driven Approach to Jailbreak Detection

Olga E. Sorokoletova,Francesco Giarrusso,Vincenzo Suriani,Daniele Nardi

Main category: cs.CL

TL;DR: 本文通过一项结构化的红队挑战，提出了一种包含50种越狱策略的分层分类法，涵盖七大家族，并分析了不同攻击类型的流行程度与成功率，同时评估了基于该分类法提示对检测越狱的有效性，还发布了新的意大利语多轮对抗对话数据集。

Details

Motivation: 现有防御方法多集中于单轮攻击，缺乏跨语言覆盖，且分类体系不完整，难以全面捕捉越狱技术的多样性，因此需要更系统的方法来理解越狱技术的有效性及其对大模型安全的影响。 Method: 开展结构化红队挑战，构建包含50种策略的分层分类法，分析攻击类型的成功率与模型漏洞的关系，测试主流大模型在越狱检测中的表现，并利用分类法指导提示设计以提升自动检测效果，同时收集并标注意大利语多轮对抗对话数据集。 Result: 提出了涵盖七大家族的50种越狱策略分类法；发现特定策略更易利用模型漏洞导致行为失准；验证了分类法引导提示可提升检测性能；发布了含1364个多轮对话的意大利语标注数据集。 Conclusion: 系统化的分类法有助于深入理解越狱技术的机制与影响，支持更有效的检测与防御，新数据集为研究渐进式对抗意图提供了重要资源。 Abstract: Jailbreaking techniques pose a significant threat to the safety of Large Language Models (LLMs). Existing defenses typically focus on single-turn attacks, lack coverage across languages, and rely on limited taxonomies that either fail to capture the full diversity of attack strategies or emphasize risk categories rather than the jailbreaking techniques. To advance the understanding of the effectiveness of jailbreaking techniques, we conducted a structured red-teaming challenge. The outcome of our experiments are manifold. First, we developed a comprehensive hierarchical taxonomy of 50 jailbreak strategies, consolidating and extending prior classifications into seven broad families, including impersonation, persuasion, privilege escalation, cognitive overload, obfuscation, goal conflict, and data poisoning. Second, we analyzed the data collected from the challenge to examine the prevalence and success rates of different attack types, providing insights into how specific jailbreak strategies exploit model vulnerabilities and induce misalignment. Third, we benchmark a popular LLM for jailbreak detection, evaluating the benefits of taxonomy-guided prompting for improving automatic detection. Finally, we compiled a new Italian dataset of 1364 multi-turn adversarial dialogues, annotated with our taxonomy, enabling the study of interactions where adversarial intent emerges gradually and succeeds in bypassing traditional safeguards.

[38] Attribution Quality in AI-Generated Content:Benchmarking Style Embeddings and LLM Judges

Misam Abbas

Main category: cs.CL

TL;DR: 本文研究了在大语言模型（LLM）时代下作者归属的挑战，使用固定风格嵌入和基于LLM的裁判（GPT-4o）两种方法在人类-AI平行语料库上进行基准测试，发现两者在不同文本类型中表现互补，表明作者归属需采用混合策略。

Details

Motivation: 随着大语言模型生成的文本越来越接近人类写作，准确区分人类与机器生成文本变得愈发重要，亟需有效的作者归属机制。 Method: 采用两种方法：固定风格嵌入和指令调优的LLM裁判（GPT-4o），在包含六种文体的600个样本的人类-AI平行语料库上进行评估。 Result: 风格嵌入在GPT生成文本上准确率为82%，优于LLM裁判的68%；LLM裁判在LLaMA生成文本上略优于风格嵌入（85% vs. 81%），但差异不显著。LLM裁判在小说和学术文本中表现更优，而风格嵌入在口语和剧本对话中更强。 Conclusion: 作者归属是一个多维度问题，不同方法在不同文体中各有优势，未来应发展结合语义与结构特征的混合策略，并提倡开放、可复现的评估框架。 Abstract: Attributing authorship in the era of large language models (LLMs) is increasingly challenging as machine-generated prose rivals human writing. We benchmark two complementary attribution mechanisms , fixed Style Embeddings and an instruction-tuned LLM judge (GPT-4o) on the Human AI Parallel Corpus, an open dataset of 600 balanced instances spanning six domains (academic, news, fiction, blogs, spoken transcripts, and TV/movie scripts). Each instance contains a human prompt with both a gold continuation and an LLM-generated continuation from either GPT-4o or LLaMA-70B-Instruct. The Style Embedding baseline achieves stronger aggregate accuracy on GPT continuations (82 pct vs. 68 pct). The LLM Judge is slightly better than the Style embeddings on LLaMA continuations (85 pct vs. 81 pct) but the results are not statistically significant. Crucially, the LLM judge significantly outperforms in fiction and academic prose, indicating semantic sensitivity, whereas embeddings dominate in spoken and scripted dialogue, reflecting structural strengths. These complementary patterns highlight attribution as a multidimensional problem requiring hybrid strategies. To support reproducibility we provide code on GitHub and derived data on Hugging Face under the MIT license. This open framework provides a reproducible benchmark for attribution quality assessment in AI-generated content, along with a review of related literature influencing this work.

[39] Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences

Julian Minder,Clément Dumas,Stewart Slocum,Helena Casademunt,Cameron Holmes,Robert West,Neel Nanda

Main category: cs.CL

TL;DR: 本文研究了窄域微调在大语言模型中产生的激活偏差，发现这些偏差可通过模型差异分析被解释，并可用于理解微调领域；作者通过构建基于LLM的可解释性代理验证了该方法的有效性，同时警示在AI安全与可解释性研究中使用窄域微调模型可能存在的局限性。

Details

Motivation: 窄域微调常用于适配大语言模型至特定任务，但其可能导致模型产生异常特性；为了理解微调带来的影响，特别是对模型激活空间的改变，需要系统分析微调前后模型的差异。 Method: 利用模型差异分析技术，比较微调前后模型在随机文本前几个token上的激活差异，并通过向激活中添加该差异进行定向生成；同时构建基于大语言模型的可解释性代理，评估其在访问激活偏差时的表现。 Result: 发现窄域微调会在模型激活中引入强烈且可识别的偏差，这些偏差能反映微调数据的格式与内容；加入预训练数据混合微调可显著减弱此类偏差；可解释性代理在利用偏差信息时表现优于基线方法。 Conclusion: 窄域微调会在模型中留下明显的训练目标痕迹，提示需改进训练方式；当前将窄域微调模型作为通用微调（如对话微调）研究代理的做法可能存在偏差，呼吁开展更真实场景下的模型差异、安全与可解释性研究。 Abstract: Finetuning on narrow domains has become an essential tool to adapt Large Language Models (LLMs) to specific tasks and to create models with known unusual properties that are useful for research. We show that narrow finetuning creates strong biases in LLM activations that can be interpreted to understand the finetuning domain. These biases can be discovered using simple tools from model diffing - the study of differences between models before and after finetuning. In particular, analyzing activation differences on the first few tokens of random text and steering by adding this difference to the model activations produces text similar to the format and general content of the finetuning data. We demonstrate that these analyses contain crucial information by creating an LLM-based interpretability agent to understand the finetuning domain. With access to the bias, the agent performs significantly better compared to baseline agents using simple prompting. Our analysis spans synthetic document finetuning for false facts, emergent misalignment, subliminal learning, and taboo word guessing game models across different architectures (Gemma, LLaMA, Qwen) and scales (1B to 32B parameters). We suspect these biases reflect overfitting and find that mixing pretraining data into the finetuning corpus largely removes them, though residual risks may remain. Our work (1) demonstrates that narrowly finetuned models have salient traces of their training objective in their activations and suggests ways to improve how they are trained, (2) warns AI safety and interpretability researchers that the common practice of using such models as a proxy for studying broader finetuning (e.g., chat-tuning) might not be realistic, and (3) highlights the need for deeper investigation into the effects of narrow finetuning and development of truly realistic case studies for model-diffing, safety and interpretability research.

[40] RAID: Refusal-Aware and Integrated Decoding for Jailbreaking LLMs

Tuan T. Nguyen,John Le,Thai T. Vu,Willy Susilo,Heath Cooper

Main category: cs.CL

TL;DR: 提出RAID框架，通过连续嵌入优化和拒绝感知正则化生成对抗后缀，有效且自然地绕过大语言模型的安全机制。

Details

Motivation: 大语言模型尽管性能强大，但仍易受绕过安全机制的越狱攻击，需要系统性方法来探测和理解这些漏洞。 Method: 将离散token松弛为连续嵌入，采用联合目标优化：鼓励生成受限内容、引入拒绝感知正则项避开拒绝方向、保持语义连贯性；通过批评引导的解码将嵌入映射回token。 Result: 在多个开源大模型上实验表明，RAID相比现有白盒和黑盒基线方法具有更高的攻击成功率、更少查询次数和更低计算成本。 Conclusion: 嵌入空间正则化对理解和缓解大语言模型越狱漏洞至关重要，RAID为评估和增强模型安全性提供了有效工具。 Abstract: Large language models (LLMs) achieve impressive performance across diverse tasks yet remain vulnerable to jailbreak attacks that bypass safety mechanisms. We present RAID (Refusal-Aware and Integrated Decoding), a framework that systematically probes these weaknesses by crafting adversarial suffixes that induce restricted content while preserving fluency. RAID relaxes discrete tokens into continuous embeddings and optimizes them with a joint objective that (i) encourages restricted responses, (ii) incorporates a refusal-aware regularizer to steer activations away from refusal directions in embedding space, and (iii) applies a coherence term to maintain semantic plausibility and non-redundancy. After optimization, a critic-guided decoding procedure maps embeddings back to tokens by balancing embedding affinity with language-model likelihood. This integration yields suffixes that are both effective in bypassing defenses and natural in form. Experiments on multiple open-source LLMs show that RAID achieves higher attack success rates with fewer queries and lower computational cost than recent white-box and black-box baselines. These findings highlight the importance of embedding-space regularization for understanding and mitigating LLM jailbreak vulnerabilities.

[41] Investigating Political and Demographic Associations in Large Language Models Through Moral Foundations Theory

Nicole Smith-Vaniz,Harper Lyon,Lorraine Steigner,Ben Armstrong,Nicholas Mattei

Main category: cs.CL

TL;DR: 本文研究了大语言模型（LLM）在道德基础理论（MFT）五个维度上的回应是否存在政治和道德倾向，并将其与人类数据进行直接比较，探讨LLM在不同提示方式下是否表现出意识形态偏倚及其表征能力。

Details

Motivation: 随着LLM在医疗、法律和人际关系等关键领域被广泛用作建议提供者，其在政治和道德敏感问题上的回应可能带有偏见，因此有必要系统评估其道德倾向及其与人类价值观的一致性。 Method: 基于道德基础理论（MFT）的五个维度（伤害、公平、群体忠诚、权威、纯洁），对LLM在固有回应、明确政治立场提示和基于人口统计学的人设扮演三种条件下的输出进行系统分析，并与现有大规模人类研究数据进行对比。 Result: 研究发现LLM的回应表现出明显的意识形态倾向，尤其在某些MFT维度上偏离人类平均水平；同时，通过显式提示和角色扮演，LLM能较准确地模拟不同政治立场的观点，但其默认输出仍倾向于特定意识形态。 Conclusion: 大语言模型在道德判断上存在可量化的政治偏倚，其默认回应并不中立，但在适当提示下能够较好地模拟多元意识形态，表明AI生成内容具有政治和人口统计依赖性，需在应用中加以审慎考量。 Abstract: Large Language Models (LLMs) have become increasingly incorporated into everyday life for many internet users, taking on significant roles as advice givers in the domains of medicine, personal relationships, and even legal matters. The importance of these roles raise questions about how and what responses LLMs make in difficult political and moral domains, especially questions about possible biases. To quantify the nature of potential biases in LLMs, various works have applied Moral Foundations Theory (MFT), a framework that categorizes human moral reasoning into five dimensions: Harm, Fairness, Ingroup Loyalty, Authority, and Purity. Previous research has used the MFT to measure differences in human participants along political, national, and cultural lines. While there has been some analysis of the responses of LLM with respect to political stance in role-playing scenarios, no work so far has directly assessed the moral leanings in the LLM responses, nor have they connected LLM outputs with robust human data. In this paper we analyze the distinctions between LLM MFT responses and existing human research directly, investigating whether commonly available LLM responses demonstrate ideological leanings: either through their inherent responses, straightforward representations of political ideologies, or when responding from the perspectives of constructed human personas. We assess whether LLMs inherently generate responses that align more closely with one political ideology over another, and additionally examine how accurately LLMs can represent ideological perspectives through both explicit prompting and demographic-based role-playing. By systematically analyzing LLM behavior across these conditions and experiments, our study provides insight into the extent of political and demographic dependency in AI-generated responses.

[42] Schema for In-Context Learning

Pan Chen,Shaohong Chen,Mark Wang,Shi Xuan Leong,Priscilla Fung,Varinia Bernales,Alan Aspuru-Guzik

Main category: cs.CL

TL;DR: 本文提出了Schema Activated In-Context Learning (SA-ICL)，通过引入认知科学中的图式理论，从示例中提取抽象推理结构（schema）以增强大语言模型的推理能力。实验表明，SA-ICL在高质量单一样例下显著提升性能，最高达36.19%，并减少对多示例的依赖，提高可解释性。

Details

Motivation: 传统上下文学习缺乏在抽象层面进行知识检索与迁移的显式机制，受人类利用已有心理框架（图式）理解新信息的启发，作者希望为语言模型引入类似的认知结构以提升推理能力。 Method: 从示范样例中提取关键推理步骤及其关系，构建轻量级、结构化的抽象schema，并将其显式用于增强模型面对新问题时的推理过程。 Result: 实验证明主流大语言模型难以隐式形成和使用基于图式的表示，但通过SA-ICL的显式架构可显著提升在GPQA数据集化学与物理问题上的表现，最高提升36.19%，且仅需单个高质量示例即可实现，减少了对多个示例的依赖。 Conclusion: SA-ICL不仅统一了多种上下文学习策略，还为提升大语言模型类人推理能力提供了新路径。 Abstract: In-Context Learning (ICL) enables transformer-based language models to adapt to new tasks by conditioning on demonstration examples. However, traditional example-driven in-context learning lacks explicit modules for knowledge retrieval and transfer at the abstraction level. Inspired by cognitive science, specifically schema theory, which holds that humans interpret new information by activating pre-existing mental frameworks (schemas) to structure understanding, we introduce SCHEMA ACTIVATED IN CONTEXT LEARNING (SA-ICL). This framework extracts the representation of the building blocks of cognition for the reasoning process instilled from prior examples, creating an abstracted schema, a lightweight, structured template of key inferential steps and their relationships, which is then used to augment a model's reasoning process when presented with a novel question. We demonstrate that a broad range of large language models (LLMs) lack the capacity to form and utilize internal schema-based learning representations implicitly, but instead benefit significantly from explicit schema-based scaffolding. Across chemistry and physics questions from the GPQA dataset, our experiments show that SA-ICL consistently boosts performance, up to 36.19 percent, when the single demonstration example is of high quality, which simultaneously reduces reliance on the number of demonstrations and enhances interpretability. SCHEMA ACTIVATED IN CONTEXT LEARNING not only bridges disparate ICL strategies ranging from pattern priming to Chain-of-Thought prompting, but also paves a new path for enhancing human-like reasoning in LLMs.

[43] LLM Prompt Duel Optimizer: Efficient Label-Free Prompt Optimization

Yuanchen Wu,Saurabh Verma,Justin Lee,Fangzhou Xiong,Poppy Zhang,Amel Awadelkarim,Xu Chen,Yubai Yuan,Shawndra Hill

Main category: cs.CL

TL;DR: 提出Prompt Duel Optimizer（PDO），一种无需标签的高效提示优化框架，通过LLM裁判提供的成对偏好反馈进行优化，在BBH和MS MARCO任务上优于基线方法。

Details

Motivation: 大型语言模型对输入提示敏感，提示设计困难；现有自动优化方法依赖高质量标签，但标注成本高且耗时。 Method: 将提示优化建模为对决-bandit问题，使用LLM作为裁判提供成对比较反馈；结合双汤普森采样（D-TS）选择信息量大的比较，并通过高性能提示变异扩展候选集。 Result: 在BIG-bench Hard和MS MARCO上实验表明，PDO优于基线方法；消融实验证明D-TS和提示变异均有效；支持无标签场景并可融合部分标签以降低裁判噪声影响。 Conclusion: PDO是一种高效、灵活的提示优化框架，能够在少标签或无标签场景下有效提升LLM性能，具有实际应用潜力。 Abstract: Large language models (LLMs) are highly sensitive to their input prompts, making prompt design a central challenge. While automatic prompt optimization (APO) reduces manual engineering, most approaches assume access to ground-truth references such as labeled validation data. In practice, however, collecting high-quality labels is costly and slow. We propose the Prompt Duel Optimizer (PDO), a sample-efficient framework for label-free prompt optimization. PDO formulates the problem as a dueling-bandit setting, where supervision signal comes from pairwise preference feedback provided by an LLM judge. The framework combines Double Thompson Sampling (D-TS), which prioritizes informative prompt comparisons, with Top-Performer Guided Mutation, which expands the candidate pool by mutating high-performing prompts. PDO naturally operates in label-free settings and can also incorporate partial labels to mitigate judge noise. Experiments on BIG-bench Hard (BBH) and MS MARCO show that PDO consistently outperforms baseline methods. Ablation studies further demonstrate the effectiveness of both D-TS and prompt mutation.

[44] Interpreting the Latent Structure of Operator Precedence in Language Models

Dharunish Yugeswardeenoo,Harshil Nukala,Cole Blondin,Sean O Brien,Vasu Sharma,Kevin Zhu

Main category: cs.CL

TL;DR: 研究了LLaMA 3.2-3B模型在算术任务中是否在其内部表示中编码了运算符优先级，发现中间计算结果存在于残差流中，并且模型在线性嵌入中编码了优先级信息。

Details

Motivation: 探索大语言模型在处理算术任务时的内部结构，特别是运算符优先级的表示机制。 Method: 构建包含三个操作数和两个操作符的算术表达式数据集，使用logit lens、线性分类探针和UMAP可视化等可解释性技术分析LLaMA 3.2-3B模型的残差流。 Result: 发现中间计算结果存在于残差流中，尤其是在MLP块之后；模型在注意力层后的操作符嵌入中线性编码优先级信息；提出了部分嵌入交换技术，通过交换高影响嵌入维度来修改操作符优先级。 Conclusion: LLMs在内部表示中确实编码了运算符优先级，且可以通过修改嵌入来调整优先级，揭示了其算术推理的部分机制。 Abstract: Large Language Models (LLMs) have demonstrated impressive reasoning capabilities but continue to struggle with arithmetic tasks. Prior works largely focus on outputs or prompting strategies, leaving the open question of the internal structure through which models do arithmetic computation. In this work, we investigate whether LLMs encode operator precedence in their internal representations via the open-source instruction-tuned LLaMA 3.2-3B model. We constructed a dataset of arithmetic expressions with three operands and two operators, varying the order and placement of parentheses. Using this dataset, we trace whether intermediate results appear in the residual stream of the instruction-tuned LLaMA 3.2-3B model. We apply interpretability techniques such as logit lens, linear classification probes, and UMAP geometric visualization. Our results show that intermediate computations are present in the residual stream, particularly after MLP blocks. We also find that the model linearly encodes precedence in each operator's embeddings post attention layer. We introduce partial embedding swap, a technique that modifies operator precedence by exchanging high-impact embedding dimensions between operators.

[45] Knowledge Reasoning Language Model: Unifying Knowledge and Language for Inductive Knowledge Graph Reasoning

Xingrui Zhuo,Jiapu Wang,Gongqing Wu,Zhongyuan Wang,Jichen Zhang,Shirui Pan,Xindong Wu

Main category: cs.CL

TL;DR: 本文提出了一种知识推理语言模型（KRLM），通过设计知识推理语言（KRL）指令格式、KRL tokenizer、KRL注意力层和结构感知的下一实体预测器，实现大语言模型（LLM）知识与知识图谱（KG）上下文的统一协调，有效缓解了LLM知识扭曲和生成幻觉问题，在25个真实世界的归纳式知识图谱推理数据集上表现出显著优越性。

Details

Motivation: 归纳式知识图谱推理（KGR）面临开放域中未知实体和关系带来的不确定性挑战，现有基于大语言模型（LLM）的KGR方法存在LLM内在知识被稀疏KG上下文掩盖导致的知识扭曲，以及难以约束生成幻觉的问题，限制了推理结果的可信度。 Method: 提出KRLM模型：1）设计KRL指令格式和KRL tokenizer以对齐LLM知识与KG表示；2）引入KRL注意力层，通过动态知识记忆机制协调LLM内在知识与外部KG上下文；3）设计结构感知的下一实体预测器，严格限制推理结果在可信知识范围内。 Result: 在25个真实世界的归纳式KGR数据集上，KRLM在零样本推理和微调场景下均显著优于现有方法，有效缓解了知识扭曲和生成幻觉问题。 Conclusion: KRLM通过统一协调LLM知识与KG上下文，提升了归纳式知识图谱推理的准确性和可信度，为LLM与KG的融合提供了新的有效范式。 Abstract: Inductive Knowledge Graph Reasoning (KGR) aims to discover facts in open-domain KGs containing unknown entities and relations, which poses a challenge for KGR models in comprehending uncertain KG components. Existing studies have proposed Knowledge Graph Foundation Models (KGFMs) that learn structural invariances across KGs to handle this uncertainty. Recently, Large Language Models (LLMs) have demonstrated strong capabilities for open-domain knowledge reasoning. As a result, the latest research has focused on LLM-based KGFMs that integrate LLM knowledge with KG context for inductive KGR. However, the intrinsic knowledge of LLMs may be overshadowed by sparse KG context, leading to LLM knowledge distortion, which can cause irreversible damage to model reasoning. Moreover, existing LLM-based KGR methods still struggle to fully constrain generative hallucinations in LLMs, severely limiting the credibility of reasoning results. To address these limitations, we propose a Knowledge Reasoning Language Model (KRLM) that achieves unified coordination between LLM knowledge and KG context throughout the KGR process. Specifically, we design a Knowledge Reasoning Language (KRL) instruction format and a KRL tokenizer to align LLM knowledge with KG representations. Then, we propose a KRL attention layer that coordinates intrinsic LLM knowledge with additional KG context through a dynamic knowledge memory mechanism. Finally, a structure-aware next-entity predictor is proposed, which strictly constrains the reasoning results within a trustworthy knowledge domain. Extensive experimental results on 25 real-world inductive KGR datasets demonstrate the significant superiority of the proposed KRLM\footnote{Our source codes are available at https://anonymous.4open.science/r/KRLM-EA36 in both zero-shot reasoning and fine-tuning scenarios.

[46] RAGCap-Bench: Benchmarking Capabilities of LLMs in Agentic Retrieval Augmented Generation Systems

Jingru Lin,Chen Zhang,Stephen Y. Liu,Haizhou Li

Main category: cs.CL

TL;DR: 本文提出了RAGCap-Bench，一个面向细粒度评估代理式RAG系统中间任务的基准，旨在提升复杂多跳问题下的推理能力，并验证了中间能力对端到端性能的重要性。

Details

Motivation: 现有代理式RAG系统在处理复杂多跳问题时表现不佳，且其内部推理能力尚未被充分探索，缺乏针对性的评估手段。 Method: 通过分析前沿系统的输出，识别常见任务与核心能力需求，构建典型错误分类体系，并据此设计RAGCap-Bench评测集。 Result: 实验表明具备更强‘慢思考’能力的模型在RAGCap上表现更好，且整体端到端性能更优。 Conclusion: RAGCap-Bench能有效评估和揭示代理式RAG系统的中间能力，强化这些能力有助于提升整体系统性能。 Abstract: Retrieval-Augmented Generation (RAG) mitigates key limitations of Large Language Models (LLMs)-such as factual errors, outdated knowledge, and hallucinations-by dynamically retrieving external information. Recent work extends this paradigm through agentic RAG systems, where LLMs act as agents to iteratively plan, retrieve, and reason over complex queries. However, these systems still struggle with challenging multi-hop questions, and their intermediate reasoning capabilities remain underexplored. To address this, we propose RAGCap-Bench, a capability-oriented benchmark for fine-grained evaluation of intermediate tasks in agentic RAG workflows. We analyze outputs from state-of-the-art systems to identify common tasks and the core capabilities required for their execution, then construct a taxonomy of typical LLM errors to design targeted evaluation questions. Experiments show that "slow-thinking" models with stronger RAGCap performance achieve better end-to-end results, underscoring the benchmark's validity and the importance of enhancing these intermediate capabilities.

[47] AI Debaters are More Persuasive when Arguing in Alignment with Their Own Beliefs

María Victoria Carro,Denise Alejandra Mester,Facundo Nieto,Oscar Agustín Stanchi,Guido Ernesto Bergman,Mario Alejandro Leiva,Eitan Sprejer,Luca Nicolás Forziati Gangi,Francisca Gauna Selasco,Juan Gustavo Corvalán,Gerardo I. Simari,María Vanina Martinez

Main category: cs.CL

TL;DR: 该研究探讨了在主观问题上大语言模型在辩论中的表现，发现模型更倾向于迎合法官角色而非坚持自身先验信念，且顺序辩论存在偏见，尽管与先验一致的立场更具说服力，但违背先验的论点反而被评为质量更高。

Details

Motivation: 现有辩论实验多基于有客观真相的数据集，忽略了说谎涉及主体信念的维度。本研究旨在探究语言模型在面对主观问题时是否表现出谄媚策略，以及其先验信念如何影响辩论表现。 Method: 研究设计了两种辩论协议（顺序与同时），要求模型选择偏好立场后，面对与其先验信念冲突的法官角色进行辩论，并评估其说服力与论点质量。通过测量模型的先验信念并比较不同条件下的表现，分析其行为倾向。 Result: 模型更倾向于迎合法官角色而非坚持自身先验信念；顺序辩论显著偏向第二位辩者；当辩护立场与先验一致时，模型更具说服力；但违背先验的论点在成对比较中被评为质量更高。 Conclusion: 语言模型在辩论中表现出谄媚倾向和系统性偏差，这对构建对齐的人工智能系统和设计人类-AI互动机制具有重要启示，提示需优化训练信号以提升判断质量。 Abstract: The core premise of AI debate as a scalable oversight technique is that it is harder to lie convincingly than to refute a lie, enabling the judge to identify the correct position. Yet, existing debate experiments have relied on datasets with ground truth, where lying is reduced to defending an incorrect proposition. This overlooks a subjective dimension: lying also requires the belief that the claim defended is false. In this work, we apply debate to subjective questions and explicitly measure large language models' prior beliefs before experiments. Debaters were asked to select their preferred position, then presented with a judge persona deliberately designed to conflict with their identified priors. This setup tested whether models would adopt sycophantic strategies, aligning with the judge's presumed perspective to maximize persuasiveness, or remain faithful to their prior beliefs. We implemented and compared two debate protocols, sequential and simultaneous, to evaluate potential systematic biases. Finally, we assessed whether models were more persuasive and produced higher-quality arguments when defending positions consistent with their prior beliefs versus when arguing against them. Our main findings show that models tend to prefer defending stances aligned with the judge persona rather than their prior beliefs, sequential debate introduces significant bias favoring the second debater, models are more persuasive when defending positions aligned with their prior beliefs, and paradoxically, arguments misaligned with prior beliefs are rated as higher quality in pairwise comparison. These results can inform human judges to provide higher-quality training signals and contribute to more aligned AI systems, while revealing important aspects of human-AI interaction regarding persuasion dynamics in language models.

[48] Synthesizing Agentic Data for Web Agents with Progressive Difficulty Enhancement Mechanisms

Shrey Pandit,Xuan-Phi Nguyen,Yifei Ming,Austin Xu,Jiayu Wang,Caiming Xiong,Shafiq Joty

Main category: cs.CL

TL;DR: 提出了一种双管齐下的数据合成管道，通过逐步增加任务复杂度生成高质量问答对，并利用强基线网络代理进行蒸馏训练，实验证明该方法生成的数据集在多样性和性能上优于现有数据集。

Details

Motivation: 现有方法在生成用于长视野推理的复杂在线任务数据时缺乏对难度和质量的精细控制，且难以分离数据与训练效果的影响。 Method: 构建一个两阶段的数据合成管道，通过逐步增加任务复杂度直到基线代理失败来生成问答对，并利用基线代理进行尝试、验证事实性、检查替代答案和过滤；采用基于强网络代理蒸馏的受控训练设置评估数据有效性。 Result: 实验表明，所提出的较小数据集相比现有数据集能训练出更高效的网络代理，工具使用动作的多样性提高了一倍，且避免了重复调用工具的行为。 Conclusion: 该数据合成方法能有效提升网络代理在复杂在线任务中的长视野推理能力，且通过受控训练验证了数据本身的优越性。 Abstract: Web-based 'deep research' agents aim to solve complex question - answering tasks through long-horizon interactions with online tools. These tasks remain challenging, as the underlying language models are often not optimized for long-horizon reasoning and exploration. Prior work has proposed workflows for constructing instruction-tuning datasets, often leveraging knowledge graphs. However, such methods typically lack fine-grained control over difficulty and quality, yielding synthetic data that falls short of capturing the complexity required for long-horizon reasoning. Furthermore, many studies conflate data and training effects by comparing models trained under different optimization recipes, making it difficult to isolate and evaluate the effectiveness of the data itself. We introduce a two-pronged data synthesis pipeline that generates question - answer pairs by progressively increasing task complexity until a frontier baseline web agent fails. The baseline agent plays multiple roles in this process: attempting the questions, validating factuality, checking for alternative answers, and enforcing filtering. To evaluate the effectiveness of our synthesis methods, we adopt a controlled training setup based on distillation from strong web agents. Experiments across multiple web-based benchmarks show that our dataset - despite being smaller - enables the training of more effective web agents than existing datasets. In particular, our data exhibits twice the diversity in tool-use actions, allowing models trained on it to achieve stronger performance while avoiding repetitive tool-calling behaviors.

[49] Readability $\ne$ Learnability: Rethinking the Role of Simplicity in Training Small Language Models

Ivan Lee,Taylor Berg-Kirkpatrick

Main category: cs.CL

TL;DR: 该论文挑战了“可读性促进小型语言模型（SLM）生成连贯文本”的观点，发现统计简洁性（如n-gram多样性）比可读性更能预测学习效果。

Details

Motivation: 质疑将语言模型训练过程类比人类认知发展的趋势，探究真正影响小型模型能力涌现的关键因素。 Method: 构建结构相同但可读性不同的合成数据集，比较小型语言模型在不同文本复杂度下的学习效率和生成连贯性。 Result: 发现使用复杂成人语言训练的模型表现不逊于甚至优于使用简化语言训练的模型，且n-gram多样性等统计简洁性指标更优地预测了学习效果。 Conclusion: 可读性并非小型语言模型生成连贯性的关键驱动因素，应避免无根据地将模型训练与人类发展做类比，需更精确地分析能力涌现的实际机制。 Abstract: Recent studies suggest that very small language models (SLMs) can generate surprisingly coherent text when trained on simplified, child-directed corpora such as TinyStories. These findings have been interpreted as evidence that readability -- characterized by accessible vocabulary, familiar narrative structure, and simple syntax -- plays a key role in enabling such capabilities to emerge. In this paper, we challenge that interpretation. We construct synthetic datasets with matched structure but varied readability, and find that readability alone does not predict coherence or learning efficiency in SLMs. Models trained on complex, adult-level text perform comparably to those trained on simplified language, and even exhibit faster development of coherence during training. Instead, we show that statistical simplicity, as measured by n-gram diversity, is a stronger predictor of learnability. Our findings caution against the growing trend of anthropomorphizing language model training -- drawing parallels to human cognitive development without empirical basis -- and argue for more precise reasoning about what properties actually support capability emergence in small models.

[50] Element2Vec: Build Chemical Element Representation from Text for Property Prediction

Yuanhao Li,Keyuan Lai,Tianqi Wang,Qihao Liu,Jiawei Ma,Yuan-Chao Hu

Main category: cs.CL

TL;DR: 提出Element2Vect方法，利用语言模型从维基百科文本生成化学元素的全局和属性突出的嵌入表示，并设计测试时自注意力训练方法以减少预测误差，推动材料科学中AI驱动的发现。

Details

Motivation: 准确的元素性质数据对材料设计和制造至关重要，但许多性质难以直接测量，传统方法难以建模复杂关系，且现有AI方法存在幻觉和缺乏可解释性问题。 Method: 从维基百科文本中解析内容，使用语言模型生成化学元素的单个通用嵌入（Global）和一组属性突出的向量（Local），并提出基于自注意力的测试时训练方法来降低预测误差。 Result: 能够有效表示化学元素的自然语言描述，缓解了由于文本分布差异和数据稀疏导致的计算挑战，在元素性质预测上优于传统回归方法。 Conclusion: Element2Vect为化学元素提供了有效的向量化表示，提升了材料科学中性质预测的准确性与可解释性，有助于推进AI驱动的材料发现。 Abstract: Accurate property data for chemical elements is crucial for materials design and manufacturing, but many of them are difficult to measure directly due to equipment constraints. While traditional methods use the properties of other elements or related properties for prediction via numerical analyses, they often fail to model complex relationships. After all, not all characteristics can be represented as scalars. Recent efforts have been made to explore advanced AI tools such as language models for property estimation, but they still suffer from hallucinations and a lack of interpretability. In this paper, we investigate Element2Vecto effectively represent chemical elements from natural languages to support research in the natural sciences. Given the text parsed from Wikipedia pages, we use language models to generate both a single general-purpose embedding (Global) and a set of attribute-highlighted vectors (Local). Despite the complicated relationship across elements, the computational challenges also exist because of 1) the discrepancy in text distribution between common descriptions and specialized scientific texts, and 2) the extremely limited data, i.e., with only 118 known elements, data for specific properties is often highly sparse and incomplete. Thus, we also design a test-time training method based on self-attention to mitigate the prediction error caused by Vanilla regression clearly. We hope this work could pave the way for advancing AI-driven discovery in materials science.

[51] Optimal Aggregation of LLM and PRM Signals for Efficient Test-Time Scaling

Peng Kuang,Yanli Wang,Xiaoyu Han,Yaowenqi Liu,Kaidi Xu,Haohan Wang

Main category: cs.CL

TL;DR: 本文提出了一种新的加权聚合策略，通过校准LLM和PRM之间的权重函数来优化测试时扩展性能，显著提高了效率并减少了计算开销。

Details

Motivation: 由于简单的多数投票在某些基准上优于传统的基于PRM的选择方法，因此需要更有效地利用PRM的验证信号。 Method: 建立了一个理论框架，用于最优结合LLM和PRM的信号，并提出了高效的预计算方法来校准权重函数。 Result: 实验表明，该方法在5个LLM和7个PRM上均显著提升了TTS效率，性能超过普通加权多数投票，且仅使用21.3%的计算量。 Conclusion: 更智能的聚合策略比单纯增加测试时计算量更能有效提升性能。 Abstract: Process reward models (PRMs) are a cornerstone of test-time scaling (TTS), designed to verify and select the best responses from large language models (LLMs). However, this promise is challenged by recent benchmarks where simple majority voting, which ignores PRM signals, occasionally outperforms standard PRM-based selection. This raises a critical question: How can we effectively utilize verification signals from PRMs for TTS? To address this, we start by developing a theoretical framework for optimally combining signals from both the LLM and the PRM. Our framework reveals that the optimal strategy is a weighted aggregation of responses, a strategy whose effectiveness hinges on estimating weights that capture the complex interplay between the models. Based on our theoretical results, we empirically show that these optimal weighting functions differ significantly across LLM-PRM pairs and, notably, often assign substantial negative weights. Motivated by these insights, we propose efficient pre-computation methods to calibrate these weighting functions. Extensive experiments across 5 LLMs and 7 PRMs demonstrate that our calibration method significantly boosts the TTS efficiency, surpassing the performance of vanilla weighted majority voting while using only $21.3\%$ of the computation. Ultimately, our work demonstrates that investing in a more intelligent aggregation strategy can be a more convincing path to performance gains than simply scaling test-time computation.

[52] FACTS: Table Summarization via Offline Template Generation with Agentic Workflows

Ye Yuan,Mohammad Amin Shabani,Siqi Liu

Main category: cs.CL

TL;DR: FACTS是一种快速、准确且符合隐私保护的表格摘要方法，通过离线生成SQL查询和Jinja2模板，实现可重用、高效的自然语言摘要生成。

Details

Motivation: 现有表格到文本模型存在微调成本高、复杂推理困难、令牌限制、效率低及数据隐私暴露等问题，缺乏鲁棒性和可扩展性。 Method: 提出FACTS框架，采用代理式工作流，通过离线生成包含SQL查询和Jinja2模板的模板，仅将表结构发送给大模型，实现查询驱动的表格摘要。 Result: 在多个基准测试上，FACTS在摘要质量和效率方面均优于基线方法，具备高准确性、快速响应和隐私合规优势。 Conclusion: FACTS为实际应用场景中的查询导向表格摘要提供了一种高效、可扩展且隐私安全的解决方案。 Abstract: Query-focused table summarization requires generating natural language summaries of tabular data conditioned on a user query, enabling users to access insights beyond fact retrieval. Existing approaches face key limitations: table-to-text models require costly fine-tuning and struggle with complex reasoning, prompt-based LLM methods suffer from token-limit and efficiency issues while exposing sensitive data, and prior agentic pipelines often rely on decomposition, planning, or manual templates that lack robustness and scalability. To mitigate these issues, we introduce an agentic workflow, FACTS, a Fast, Accurate, and Privacy-Compliant Table Summarization approach via Offline Template Generation. FACTS produces offline templates, consisting of SQL queries and Jinja2 templates, which can be rendered into natural language summaries and are reusable across multiple tables sharing the same schema. It enables fast summarization through reusable offline templates, accurate outputs with executable SQL queries, and privacy compliance by sending only table schemas to LLMs. Evaluations on widely-used benchmarks show that FACTS consistently outperforms baseline methods, establishing it as a practical solution for real-world query-focused table summarization.

[53] An LLM-Powered AI Agent Framework for Holistic IoT Traffic Interpretation

Daniel Adu Worae,Spyridon Mastorakis

Main category: cs.CL

TL;DR: 提出一种基于大语言模型的AI代理框架，将原始数据包转换为结构化、语义增强的表示形式，实现对物联网网络流量的高效、全面解释。

Details

Motivation: 物联网网络产生大量多样化流量，传统孤立检测方法难以有效识别威胁，需跨层行为与上下文的综合分析。 Method: 结合特征提取、基于Transformer的异常检测、数据包与流摘要、威胁情报增强及检索增强问答，利用大语言模型驱动的AI代理对索引流量进行推理。 Result: 在多个物联网数据集和六种开源模型上验证，混合检索（结合词法与语义搜索）相比纯稠密检索显著提升BLEU、ROUGE、METEOR和BERTScore指标，系统资源开销低。 Conclusion: 该框架能高效、准确地生成人类可读的流量解释，实现物联网流量的全面、低开销智能分析。 Abstract: Internet of Things (IoT) networks generate diverse and high-volume traffic that reflects both normal activity and potential threats. Deriving meaningful insight from such telemetry requires cross-layer interpretation of behaviors, protocols, and context rather than isolated detection. This work presents an LLM-powered AI agent framework that converts raw packet captures into structured and semantically enriched representations for interactive analysis. The framework integrates feature extraction, transformer-based anomaly detection, packet and flow summarization, threat intelligence enrichment, and retrieval-augmented question answering. An AI agent guided by a large language model performs reasoning over the indexed traffic artifacts, assembling evidence to produce accurate and human-readable interpretations. Experimental evaluation on multiple IoT captures and six open models shows that hybrid retrieval, which combines lexical and semantic search with reranking, substantially improves BLEU, ROUGE, METEOR, and BERTScore results compared with dense-only retrieval. System profiling further indicates low CPU, GPU, and memory overhead, demonstrating that the framework achieves holistic and efficient interpretation of IoT network traffic.

[54] BioMedSearch: A Multi-Source Biomedical Retrieval Framework Based on LLMs

Congying Liu,Xingyuan Wei,Peipei Liu,Yiqing Shen,Yanxu Mao,Tiehan Cui

Main category: cs.CL

TL;DR: 本文提出了一种基于大语言模型的多源生物医学信息检索框架BioMedSearch，通过整合文献、蛋白质数据库和网络搜索，提升复杂生物医学问题回答的准确性，并构建了包含3000个问题的多层级数据集BioMedMCQs进行评估，实验结果显示在各层级推理任务上均显著优于基线模型。

Details

Motivation: 大语言模型在生成生物医学内容时常因无法访问权威数据库而产生不准确甚至虚构的信息，缺乏科学严谨性，因此需要一种能够整合多源权威信息以支持精确检索与推理的框架。 Method: 提出BioMedSearch框架，结合文献检索、蛋白质数据库和网络搜索，通过子查询分解、关键词提取、任务图构建和多源信息过滤，实现对复杂生物医学问题的准确回答。 Result: 在自建数据集BioMedMCQs上的实验表明，BioMedSearch在三个推理层级上均显著提升准确率：一级从59.1%提升至91.9%，二级从47.0%提升至81.0%，三级从36.3%提升至73.4%。 Conclusion: BioMedSearch通过有效融合多源生物医学信息，显著提高了大语言模型在复杂生物医学问答任务中的准确性和可靠性，为未来生物医学智能问答系统提供了可行方案。 Abstract: Biomedical queries often rely on a deep understanding of specialized knowledge such as gene regulatory mechanisms and pathological processes of diseases. They require detailed analysis of complex physiological processes and effective integration of information from multiple data sources to support accurate retrieval and reasoning. Although large language models (LLMs) perform well in general reasoning tasks, their generated biomedical content often lacks scientific rigor due to the inability to access authoritative biomedical databases and frequently fabricates protein functions, interactions, and structural details that deviate from authentic information. Therefore, we present BioMedSearch, a multi-source biomedical information retrieval framework based on LLMs. The method integrates literature retrieval, protein database and web search access to support accurate and efficient handling of complex biomedical queries. Through sub-queries decomposition, keywords extraction, task graph construction, and multi-source information filtering, BioMedSearch generates high-quality question-answering results. To evaluate the accuracy of question answering, we constructed a multi-level dataset, BioMedMCQs, consisting of 3,000 questions. The dataset covers three levels of reasoning: mechanistic identification, non-adjacent semantic integration, and temporal causal reasoning, and is used to assess the performance of BioMedSearch and other methods on complex QA tasks. Experimental results demonstrate that BioMedSearch consistently improves accuracy over all baseline models across all levels. Specifically, at Level 1, the average accuracy increases from 59.1% to 91.9%; at Level 2, it rises from 47.0% to 81.0%; and at the most challenging Level 3, the average accuracy improves from 36.3% to 73.4%. The code and BioMedMCQs are available at: https://github.com/CyL-ucas/BioMed_Search

[55] LLMs Can Get "Brain Rot"!

Shuo Xing,Junyuan Hong,Yifan Wang,Runjin Chen,Zhenyu Zhang,Ananth Grama,Zhengzhong Tu,Zhangyang Wang

Main category: cs.CL

TL;DR: 本文提出并验证了“大语言模型脑腐假说”：持续暴露于低质量网络文本会导致大语言模型出现显著的认知能力退化。通过在真实Twitter/X语料库上的受控实验，作者使用两种操作化定义（参与度和语义质量）构建了垃圾数据与对照数据集，并在相同训练条件下测试四种大模型。结果显示，持续在垃圾数据上预训练会导致推理、长上下文理解、安全性等方面的能力显著下降，并增强“黑暗人格特质”。混合比例实验显示剂量-反应关系，能力随垃圾数据比例上升而线性衰退。错误分析表明主要原因为“跳步思维”，即模型跳过或截断推理链；此外，仅靠指令微调或清洁数据再训练无法完全恢复原有能力，提示存在持久的表征漂移。研究强调数据质量是导致LLM能力退化的关键因素，应将数据筛选视为训练时的安全问题，并建议对部署中的模型进行常规‘认知健康检查’。

Details

Motivation: 随着大语言模型不断从互联网海量文本中持续学习，数据质量参差不齐可能对其认知能力产生负面影响。然而，目前尚缺乏因果证据证明低质量数据是否真会导致模型能力退化。因此，本文旨在通过受控实验，明确数据质量对LLM认知能力的因果影响，揭示潜在机制，并呼吁重视训练数据的净化作为安全实践的一部分。 Method: 作者设计了两个正交的操作化指标来衡量数据质量：M1基于推文的参与度（如点赞、转发），M2基于语义质量评分。利用真实Twitter/X语料，构造匹配token数量和训练步骤的垃圾数据集与控制数据集。在四个大语言模型上进行持续预训练实验，比较不同数据条件下模型在ARC-Challenge、RULER-CWE等基准上的表现变化。同时设置不同垃圾数据混合比例，观察剂量-反应关系。通过错误分析识别退化机制，并测试指令微调和清洁数据回训的修复效果。 Result: 在垃圾数据上持续训练导致模型在推理、长上下文理解、安全性和人格倾向方面出现显著退化（Hedges' g > 0.3）。随着垃圾数据比例增加，性能呈剂量依赖性下降，例如ARC-Challenge CoT准确率从74.9降至57.2，RULER-CWE从84.4降至52.3。错误分析发现‘跳步思维’是主因——模型频繁跳过推理过程。尽管扩大指令微调或使用干净数据重训可部分缓解退化，但无法恢复至原始水平，表明存在持久的表征漂移。此外，推文的受欢迎程度比长度更能预测脑腐效应。 Conclusion: 数据质量是导致大语言模型认知能力退化的关键因果因素。持续暴露于低质量网络内容会引发不可逆的认知损伤，表现为推理能力下降和安全性弱化。这一现象应被视作训练阶段的安全风险，亟需建立数据过滤机制和定期‘认知健康检查’以保障模型可靠性。 Abstract: We propose and test the LLM Brain Rot Hypothesis: continual exposure to junk web text induces lasting cognitive decline in large language models (LLMs). To causally isolate data quality, we run controlled experiments on real Twitter/X corpora, constructing junk and reversely controlled datasets via two orthogonal operationalizations: M1 (engagement degree) and M2 (semantic quality), with matched token scale and training operations across conditions. Contrary to the control group, continual pre-training of 4 LLMs on the junk dataset causes non-trivial declines (Hedges' $g>0.3$) on reasoning, long-context understanding, safety, and inflating "dark traits" (e.g., psychopathy, narcissism). The gradual mixtures of junk and control datasets also yield dose-response cognition decay: for example, under M1, ARC-Challenge with Chain Of Thoughts drops $74.9 \rightarrow 57.2$ and RULER-CWE $84.4 \rightarrow 52.3$ as junk ratio rises from $0\%$ to $100\%$. Error forensics reveal several key insights. First, we identify thought-skipping as the primary lesion: models increasingly truncate or skip reasoning chains, explaining most of the error growth. Second, partial but incomplete healing is observed: scaling instruction tuning and clean data pre-training improve the declined cognition yet cannot restore baseline capability, suggesting persistent representational drift rather than format mismatch. Finally, we discover that the popularity, a non-semantic metric, of a tweet is a better indicator of the Brain Rot effect than the length in M1. Together, the results provide significant, multi-perspective evidence that data quality is a causal driver of LLM capability decay, reframing curation for continual pretraining as a \textit{training-time safety} problem and motivating routine "cognitive health checks" for deployed LLMs.

[56] Robust or Suggestible? Exploring Non-Clinical Induction in LLM Drug-Safety Decisions

Siying Liu,Shisheng Zhang,Indu Bala

Main category: cs.CL

TL;DR: 本研究探讨了大型语言模型（LLM）在药物安全预测中对不同社会人口特征人群的偏见问题，发现模型在预测不良事件时存在系统性偏差，倾向于给弱势群体更高的风险预测，并揭示了显性和隐性两种偏见模式。

Details

Motivation: 尽管社会人口属性在临床上与药物不良反应无关，但尚不清楚LLM是否会将其纳入预测，这可能影响其在生物医药领域的公平性和可靠性。 Method: 基于美国FDA不良事件报告系统（FAERS）的结构化数据，采用基于虚拟人物（persona）的评估框架，评估ChatGPT-4o和Bio-Medical-Llama-3.8B两个先进模型在不同教育、婚姻、保险、住房等特征组合下的预测表现，并考虑三种用户角色（全科医生、专科医生、患者）。 Result: 研究发现模型对弱势群体（如低教育水平、住房不稳定者）预测的不良事件发生率显著高于优势群体；同时识别出两种偏见模式：显性偏见（推理过程中直接引用人物属性）和隐性偏见（预测不一致但未明确提及属性）。 Conclusion: LLM在药物流行病学应用中存在严重公平性风险，需建立面向公平性的评估机制和缓解策略，以确保临床部署的安全与公正。 Abstract: Large language models (LLMs) are increasingly applied in biomedical domains, yet their reliability in drug-safety prediction remains underexplored. In this work, we investigate whether LLMs incorporate socio-demographic information into adverse event (AE) predictions, despite such attributes being clinically irrelevant. Using structured data from the United States Food and Drug Administration Adverse Event Reporting System (FAERS) and a persona-based evaluation framework, we assess two state-of-the-art models, ChatGPT-4o and Bio-Medical-Llama-3.8B, across diverse personas defined by education, marital status, employment, insurance, language, housing stability, and religion. We further evaluate performance across three user roles (general practitioner, specialist, patient) to reflect real-world deployment scenarios where commercial systems often differentiate access by user type. Our results reveal systematic disparities in AE prediction accuracy. Disadvantaged groups (e.g., low education, unstable housing) were frequently assigned higher predicted AE likelihoods than more privileged groups (e.g., postgraduate-educated, privately insured). Beyond outcome disparities, we identify two distinct modes of bias: explicit bias, where incorrect predictions directly reference persona attributes in reasoning traces, and implicit bias, where predictions are inconsistent, yet personas are not explicitly mentioned. These findings expose critical risks in applying LLMs to pharmacovigilance and highlight the urgent need for fairness-aware evaluation protocols and mitigation strategies before clinical deployment.

[57] Big Reasoning with Small Models: Instruction Retrieval at Inference Time

Kenan Alkiek,David Jurgens,Vinod Vydiswaran

Main category: cs.CL

TL;DR: 提出一种通过推理时指令干预的方法，使小型语言模型（SLMs）能够检索结构化推理步骤，从而在无需微调的情况下显著提升其在多步推理任务上的表现。

Details

Motivation: 小型语言模型在本地计算上高效且具有隐私、成本和环境优势，但在多步推理和领域知识任务上表现不佳，需要提升其推理能力。 Method: 构建一个指令语料库，将相似的训练问题分组，并利用GPT-5生成结构化推理指令；在推理时，SLM检索最相关的指令并按步骤执行。 Result: 在MedQA、MMLU Law和MathQA上分别取得9.4%、7.9%和5.1%的性能提升，且简洁指令优于冗长指令，效果依赖于模型家族和内在推理能力。 Conclusion: 指令检索为小型语言模型提供了有效的结构化推理支持，在不增加计算开销的前提下显著提升复杂任务表现，具备实际应用潜力。 Abstract: Can we bring large-scale reasoning to local-scale compute? Small language models (SLMs) are increasingly attractive because they run efficiently on local hardware, offering strong privacy, low cost, and reduced environmental impact. Yet they often struggle with tasks that require multi-step reasoning or domain-specific knowledge. We address this limitation through instruction intervention at inference time, where an SLM retrieves structured reasoning procedures rather than generating them from scratch. Our method builds an Instruction Corpus by grouping similar training questions and creating instructions via GPT-5. During inference, the SLM retrieves the most relevant instructions and follows their steps. Unlike retrieval-augmented generation, which retrieves text passages, instruction retrieval gives the model structured guidance for reasoning. We evaluate this framework on MedQA (medical board exams), MMLU Professional Law, and MathQA using models from 3B to 14B parameters without any additional fine-tuning. Instruction retrieval yields consistent gains: 9.4% on MedQA, 7.9% on MMLU Law, and 5.1% on MathQA. Concise instructions outperform longer ones, and the magnitude of improvement depends strongly on model family and intrinsic reasoning ability.

[58] FinDeepResearch: Evaluating Deep Research Agents in Rigorous Financial Analysis

Fengbin Zhu,Xiang Yao Ng,Ziyang Liu,Chang Liu,Xianwei Zeng,Chao Wang,Tianhui Tan,Xuan Yao,Pengyang Shao,Min Xu,Zixuan Wang,Jing Wang,Xin Lin,Junfeng Li,Jingxian Zhu,Yang Zhang,Wenjie Wang,Fuli Feng,Richang Hong,Huanbo Luan,Ke-Wei Huang,Tat-Seng Chua

Main category: cs.CL

TL;DR: 提出了一种名为HisRubric的评估框架，并构建了FinDeepResearch基准，用于系统评估深度研究代理在多语言、多市场企业财务分析中的能力。

Details

Motivation: 现有文献缺乏对深度研究代理在关键研究分析中能力的系统性评估，因此需要一个严谨的评估框架来衡量其在复杂财务分析任务中的表现。 Method: 提出了HisRubric评估框架，采用分层分析结构和细粒度评分标准，模拟专业分析师的工作流程；基于此构建了包含64家上市公司、覆盖8个金融市场和4种语言的FinDeepResearch基准，并对16种代表性方法进行了广泛实验。 Result: 实验结果揭示了不同方法（包括DR代理、具备搜索能力的LLM、仅具备深度推理能力的LLM）在多种能力、金融市场的表现差异，识别出各自的优势与局限性。 Conclusion: HisRubric和FinDeepResearch为评估深度研究代理提供了可靠工具，实验证明当前方法仍有改进空间，研究成果将促进未来相关技术的发展。 Abstract: Deep Research (DR) agents, powered by advanced Large Language Models (LLMs), have recently garnered increasing attention for their capability in conducting complex research tasks. However, existing literature lacks a rigorous and systematic evaluation of DR Agent's capabilities in critical research analysis. To address this gap, we first propose HisRubric, a novel evaluation framework with a hierarchical analytical structure and a fine-grained grading rubric for rigorously assessing DR agents' capabilities in corporate financial analysis. This framework mirrors the professional analyst's workflow, progressing from data recognition to metric calculation, and finally to strategic summarization and interpretation. Built on this framework, we construct a FinDeepResearch benchmark that comprises 64 listed companies from 8 financial markets across 4 languages, encompassing a total of 15,808 grading items. We further conduct extensive experiments on the FinDeepResearch using 16 representative methods, including 6 DR agents, 5 LLMs equipped with both deep reasoning and search capabilities, and 5 LLMs with deep reasoning capabilities only. The results reveal the strengths and limitations of these approaches across diverse capabilities, financial markets, and languages, offering valuable insights for future research and development. The benchmark and evaluation code will be made publicly available.

[59] Readers Prefer Outputs of AI Trained on Copyrighted Books over Expert Human Writers

Tuhin Chakrabarty,Jane C. Ginsburg,Paramveer Dhillon

Main category: cs.CL

TL;DR: 本研究比较了人类作家与三种前沿AI模型（ChatGPT、Claude、Gemini）在模仿50位获奖作家风格写作方面的能力，发现通过上下文提示生成的AI文本在风格忠实度和写作质量上均被专家强烈否定，但对AI进行作者特定微调后，AI生成文本反被专家和普通读者偏好，且极难被识别为AI生成，成本也显著降低，对版权法中的“合理使用”第四因素提供了实证依据。

Details

Motivation: 探讨AI是否能高质量模仿受版权保护的作家风格进行创作，并评估其对版权潜在市场价值的影响，回应当前关于AI训练使用版权书籍的法律争议。 Method: 进行预注册研究，让MFA训练的专业作家与ChatGPT、Claude、Gemini三种AI模型分别模仿50位获奖作家的风格撰写最多450字的片段；采用双盲成对评估，由159名专家与普通读者评判；比较上下文提示与针对作者全文微调两种AI方法的表现，并使用AI检测器和中介分析探究差异原因。 Result: 上下文提示生成的AI文本在风格忠实度（OR=0.16）和写作质量（OR=0.13）上被专家强烈否定，且易被AI检测器识别（97%）；但经作者特定微调后，AI文本在风格忠实度（OR=8.16）和写作质量（OR=1.87）上反被专家偏好，且仅3%被识别为AI生成；效果在不同作者和风格中具有一致性，且微调成本中位数仅为81美元，较人类作家报酬降低99.7%。 Conclusion: 作者特定微调使AI能够生成在风格和质量上优于专业人类作家且难以察觉的非逐字文本，显著影响原作品的潜在市场价值，为版权法中的“合理使用”第四因素提供了关键实证证据。 Abstract: The use of copyrighted books for training AI models has led to numerous lawsuits from authors concerned about AI's ability to generate derivative content.Yet it's unclear whether these models can generate high quality literary text while emulating authors' styles. To answer this we conducted a preregistered study comparing MFA-trained expert writers with three frontier AI models: ChatGPT, Claude & Gemini in writing up to 450 word excerpts emulating 50 award-winning authors' diverse styles. In blind pairwise evaluations by 159 representative expert & lay readers, AI-generated text from in-context prompting was strongly disfavored by experts for both stylistic fidelity (OR=0.16, p<10^8) & writing quality (OR=0.13, p<10^7) but showed mixed results with lay readers. However, fine-tuning ChatGPT on individual authors' complete works completely reversed these findings: experts now favored AI-generated text for stylistic fidelity (OR=8.16, p<10^13) & writing quality (OR=1.87, p=0.010), with lay readers showing similar shifts. These effects generalize across authors & styles. The fine-tuned outputs were rarely flagged as AI-generated (3% rate v. 97% for in-context prompting) by best AI detectors. Mediation analysis shows this reversal occurs because fine-tuning eliminates detectable AI stylistic quirks (e.g., cliche density) that penalize in-context outputs. While we do not account for additional costs of human effort required to transform raw AI output into cohesive, publishable prose, the median fine-tuning & inference cost of $81 per author represents a dramatic 99.7% reduction compared to typical professional writer compensation. Author-specific fine-tuning thus enables non-verbatim AI writing that readers prefer to expert human writing, providing empirical evidence directly relevant to copyright's fourth fair-use factor, the "effect upon the potential market or value" of the source works.

[60] Less is More: Improving LLM Reasoning with Minimal Test-Time Intervention

Zhen Yang,Mingyang Zhang,Feng Chen,Ganggui Ding,Liang Hou,Xin Tao,Pengfei Wan,Ying-Cong Chen

Main category: cs.CL

TL;DR: 本文提出了一种名为Minimal Test-Time Intervention (MTI)的训练-free框架，通过仅在高不确定性位置进行选择性干预，显著提升大语言模型推理的准确性和稳定性，同时保持高效。

Details

Motivation: 发现推理过程中的不确定性高度集中在少数高熵token上，而现有方法在提升推理能力时往往牺牲了效率，因此需要一种更高效的测试时干预策略。 Method: 提出MTI框架，包括两个部分：(i) 选择性CFG干预，仅在不确定位置应用无分类器引导；(ii) 轻量级负提示引导，重用主模型的KV缓存来高效近似无条件解码。 Result: MTI在通用、编程和STEM任务上均带来一致提升，例如Qwen3-8B-Base在八个基准上平均提升+1.35%，Qwen3-32B-Reasoning在AIME2024上提升+5%。 Conclusion: MTI通过最小化测试时计算干预，在保持高效率的同时有效提升了大模型的推理性能，验证了局部不确定性干预的有效性。 Abstract: Recent progress in large language models (LLMs) has focused on test-time scaling to improve reasoning via increased inference computation, but often at the cost of efficiency. We revisit test-time behavior and uncover a simple yet underexplored phenomenon: reasoning uncertainty is highly localized-only a small subset of high-entropy tokens dominantly affects output correctness. Motivated by this, we propose Minimal Test-Time Intervention (MTI), a training-free framework that enhances reasoning accuracy and stability with minimal overhead. MTI includes: (i) Selective CFG intervention, applying classifier-free guidance only at uncertain positions; and (ii) Lightweight negative-prompt guidance, reusing the main model's KV cache to approximate unconditional decoding efficiently. MTI yields consistent gains across general, coding, and STEM tasks-e.g., +1.35% average improvement on eight benchmarks for Qwen3-8B-Base and +5% on AIME2024 using Qwen3-32B-Reasoning-while remaining highly efficient.

[61] Classifying and Addressing the Diversity of Errors in Retrieval-Augmented Generation Systems

Kin Kwan Leung,Mouloud Belbahri,Yi Sui,Alex Labach,Xueying Zhang,Stephen Rose,Jesse C. Cresswell

Main category: cs.CL

TL;DR: 提出了一种针对现实世界中检索增强生成（RAG）系统错误类型的分类法，提供了各类错误的示例和解决建议，并构建了标注错误类型的数据集，同时提出与该分类法一致的自动评估方法以在开发过程中跟踪和解决错误。

Details

Motivation: 由于实际RAG系统的复杂性，存在多种可能导致错误输出的原因，理解实践中可能出现的错误范围对于系统的稳健部署至关重要。 Method: 提出一个新的RAG系统错误类型分类法，构建带有错误标注的RAG响应数据集，并设计一种与该分类法对齐的自动评估方法。 Result: 提出了详细的RAG错误分类体系，发布了带标注的错误数据集，并开发了可用于实践的自动评估工具来识别和跟踪RAG系统中的错误。 Conclusion: 该研究为理解和缓解RAG系统中的错误提供了系统化的框架和实用工具，有助于提升RAG系统在实际应用中的可靠性和可维护性。 Abstract: Retrieval-augmented generation (RAG) is a prevalent approach for building LLM-based question-answering systems that can take advantage of external knowledge databases. Due to the complexity of real-world RAG systems, there are many potential causes for erroneous outputs. Understanding the range of errors that can occur in practice is crucial for robust deployment. We present a new taxonomy of the error types that can occur in realistic RAG systems, examples of each, and practical advice for addressing them. Additionally, we curate a dataset of erroneous RAG responses annotated by error types. We then propose an auto-evaluation method aligned with our taxonomy that can be used in practice to track and address errors during development. Code and data are available at https://github.com/layer6ai-labs/rag-error-classification.

[62] The German Commons - 154 Billion Tokens of Openly Licensed Text for German Language Models

Lukas Gienapp,Christopher Schröder,Stefan Schweter,Christopher Akiki,Ferdinand Schlatt,Arden Zimmermann,Phillipe Genêt,Martin Potthast

Main category: cs.CL

TL;DR: 本文介绍了German Commons，这是迄今为止最大的开源许可德语文本集合，包含来自七个领域的41个数据源，共154.56亿个词符，旨在解决德语大模型训练中缺乏合法授权语料的问题。

Details

Motivation: 大规模语言模型的训练依赖大量文本数据，但多数数据集的授权不明确，尤其对非英语语言而言，合法开放的文本资源极为稀缺。因此，构建一个高质量、可合法使用的德语开源语料库成为迫切需求。 Method: 从41个具有可验证授权的可靠数据源系统性地收集德语文本，覆盖法律、科学、文化、政治、新闻、经济和网络文本七个领域；通过全面的质量过滤、去重和格式标准化处理流程，确保数据一致性与高质量。 Result: 构建出包含154.56亿个词符的German Commons语料库，所有子集均采用CC-BY-SA 4.0或同等以上许可，支持模型训练与再分发；同时发布针对德语文本优化的语料构建与过滤代码，确保项目完全可复现和可扩展。 Conclusion: German Commons填补了开源德语预训练数据的关键空白，为开发真正开放的德语大语言模型提供了合法、高质量的数据基础。 Abstract: Large language model development relies on large-scale training corpora, yet most contain data of unclear licensing status, limiting the development of truly open models. This problem is exacerbated for non-English languages, where openly licensed text remains critically scarce. We introduce the German Commons, the largest collection of openly licensed German text to date. It compiles data from 41 sources across seven domains, encompassing legal, scientific, cultural, political, news, economic, and web text. Through systematic sourcing from established data providers with verifiable licensing, it yields 154.56 billion tokens of high-quality text for language model training. Our processing pipeline implements comprehensive quality filtering, deduplication, and text formatting fixes, ensuring consistent quality across heterogeneous text sources. All domain subsets feature licenses of at least CC-BY-SA 4.0 or equivalent, ensuring legal compliance for model training and redistribution. The German Commons therefore addresses the critical gap in openly licensed German pretraining data, and enables the development of truly open German language models. We also release code for corpus construction and data filtering tailored to German language text, rendering the German Commons fully reproducible and extensible.

[63] CRaFT: An Explanation-Based Framework for Evaluating Cultural Reasoning in Multilingual Language Models

Shehenaz Hossain,Haithem Afli

Main category: cs.CL

TL;DR: 本文提出了CRaFT，一种基于解释的多语言评估框架，用于评估大语言模型在不同文化背景下的推理能力，结果表明文化意识并非模型固有，而是通过语言表达方式显现。

Details

Motivation: 正确答案不一定反映文化理解，现有评估方法过于依赖准确性，缺乏对模型跨文化推理能力的深入分析。 Method: 提出CRaFT框架，使用文化流利性、偏差、一致性和语言适应性四个可解释指标，评估模型在阿拉伯语、孟加拉语和西班牙语中的解释质量，基于世界价值观调查中的50个文化相关问题。 Result: 阿拉伯语降低文化流利性，孟加拉语提升该能力，西班牙语保持稳定；GPT跨语言适应更好但一致性较低，FANAR表现稳定但推理较僵化。 Conclusion: 大语言模型的文化意识并非内在具备，而是受语言表述影响；CRaFT为评估和构建文化适应性模型提供了新视角和实用工具。 Abstract: Correct answers do not necessarily reflect cultural understanding. We introduce CRaFT, an explanation-based multilingual evaluation framework designed to assess how large language models (LLMs) reason across cultural contexts. Rather than scoring outputs solely based on accuracy, CRaFT evaluates model explanations using four interpretable metrics: Cultural Fluency, Deviation, Consistency, and Linguistic Adaptation. We apply the framework to 50 culturally grounded questions from the World Values Survey, translated into Arabic, Bengali, and Spanish, and evaluate three models (GPT, DeepSeek, and FANAR) across over 2,100 answer-explanation pairs. Results reveal significant cross-lingual variation in reasoning: Arabic reduces fluency, Bengali enhances it, and Spanish remains largely stable. While GPT adapts more effectively across languages, it exhibits lower consistency; FANAR shows stable but rigid reasoning. These findings suggest that cultural awareness in LLMs is not intrinsic but emerges through linguistic framing. CRaFT offers a new lens for evaluating cross-cultural reasoning in multilingual settings, providing actionable insights for building culturally adaptive language models.

[64] Think Globally, Group Locally: Evaluating LLMs Using Multi-Lingual Word Grouping Games

César Guerra-Solano,Zhuochun Li,Xiang Lorraine Li

Main category: cs.CL

TL;DR: 本文提出了一种受《纽约时报》Connections游戏启发的跨语言抽象推理任务GlobalGroup，用于评估大语言模型在不同语言下的推理能力偏差。研究发现，英语模态下模型表现更优，并揭示了开源与闭源模型之间的性能差异。

Details

Motivation: 现有研究多关注依赖知识或策略的推理任务（如常识或数学），而较少关注不依赖固定公式、需要‘跳出框框’思维的抽象推理中的语言偏差问题。为了填补这一空白，作者希望探究大语言模型在跨语言抽象推理中的表现差异。 Method: 设计了一个名为GlobalGroup的跨语言抽象推理任务，基于《纽约时报》Connections游戏机制，构建包含英语、西班牙语、中文、印地语和阿拉伯语五种语言及其英文翻译的数据集，并提出游戏难度度量方法以控制难度进行公平比较。 Result: 实验结果显示，大语言模型在英语模态下的抽象推理表现普遍优于其他语言，且闭源模型整体优于开源模型，存在显著的性能差距。 Conclusion: 大语言模型在抽象推理任务中存在明显的语言偏差，英语表现最佳；同时模型来源（开源与否）也影响性能，提示未来需更关注多语言抽象推理能力的均衡发展。 Abstract: Large language models (LLMs) can exhibit biases in reasoning capabilities due to linguistic modality, performing better on tasks in one language versus another, even with similar content. Most previous works evaluate this through reasoning tasks where reliance on strategies or knowledge can ensure success, such as in commonsense or math tasks. However, abstract reasoning is vital to reasoning for everyday life, where people apply "out-of-the-box thinking" to identify and use patterns for solutions, without a reliance on formulaic approaches. Comparatively, little work has evaluated linguistic biases in this task type. In this paper, we propose a task inspired by the New York Times Connections: GlobalGroup, that evaluates models in an abstract reasoning task across several languages. We constructed a game benchmark with five linguistic backgrounds -- English, Spanish, Chinese, Hindi, and Arabic -- in both the native language and an English translation for comparison. We also proposed game difficulty measurements to evaluate models on games with similar difficulty, enabling a more controlled comparison, which is particularly important in reasoning evaluations. Through experimentation, we find English modalities largely lead to better performance in this abstract reasoning task, and performance disparities between open- and closed-source models.

[65] Quantifying Phonosemantic Iconicity Distributionally in 6 Languages

George Flint,Kaustubh Kislay

Main category: cs.CL

TL;DR: 该研究采用分布方法大规模量化了六种不同语言中的语音语义象似性，发现了新的可解释的语音语义对应关系及跨语言模式，并对先前假设的一些对应关系提供了支持或混合结果。

Details

Motivation: 探索在大规模定量研究中，语音与语义之间的系统性关系能在多大程度上显现，包括已被识别和未被识别的现象。 Method: 使用统计方法分析六种语言（英语、西班牙语、印地语、芬兰语、土耳其语和泰米尔语）中词素的语音与语义相似性空间的一致性。 Result: 发现了一系列文献中尚未识别的可解释的语音语义对应关系以及跨语言模式；对五个先前假设的对应关系进行了验证，部分得到支持，其他结果不一。 Conclusion: 语音与语义之间存在系统性关系，且这些关系在多种语言中具有一定的普遍性和可检测性。 Abstract: Language is, as commonly theorized, largely arbitrary. Yet, systematic relationships between phonetics and semantics have been observed in many specific cases. To what degree could those systematic relationships manifest themselves in large scale, quantitative investigations--both in previously identified and unidentified phenomena? This work undertakes a distributional approach to quantifying phonosemantic iconicity at scale across 6 diverse languages (English, Spanish, Hindi, Finnish, Turkish, and Tamil). In each language, we analyze the alignment of morphemes' phonetic and semantic similarity spaces with a suite of statistical measures, and discover an array of interpretable phonosemantic alignments not previously identified in the literature, along with crosslinguistic patterns. We also analyze 5 previously hypothesized phonosemantic alignments, finding support for some such alignments and mixed results for others.

[66] ERGO: Entropy-guided Resetting for Generation Optimization in Multi-turn Language Models

Haziq Mohammad Khalid,Athikash Jeyaganthan,Timothy Do,Yicheng Fu,Sean O'Brien,Vasu Sharma,Kevin Zhu

Main category: cs.CL

TL;DR: ERGO是一种基于熵的动态提示整合方法，通过监测大语言模型在多轮对话中的不确定性（以香农熵衡量），在熵值突增时触发上下文重置与整合，从而显著提升模型性能、准确性和可靠性。

Details

Motivation: 大语言模型在信息逐步呈现的多轮对话中表现显著下降，影响实际可用性。本文旨在解决这一问题，提出利用模型不确定性作为信号来动态调整对话上下文。 Method: 提出ERGO方法，通过计算下一个词概率分布的香农熵持续量化模型的内在不确定性，并在检测到熵值急剧上升时触发自适应的提示整合，以重新对齐对话上下文。 Result: 在逐步揭示指令的多轮任务中，ERGO相比基线平均性能提升56.6%，能力峰值提升24.7%，不可靠性降低35.3%。 Conclusion: 将不确定性视为一等信号而非干扰，有助于提升对话式AI的准确性和可靠性，ERGO为多轮交互中的上下文管理提供了有效机制。 Abstract: Large Language Models (LLMs) suffer significant performance degradation in multi-turn conversations when information is presented incrementally. Given that multi-turn conversations characterize everyday interactions with LLMs, this degradation poses a severe challenge to real world usability. We hypothesize that abrupt increases in model uncertainty signal misalignment in multi-turn LLM interactions, and we exploit this insight to dynamically realign conversational context. We introduce ERGO (Entropy-guided Resetting for Generation Optimization), which continuously quantifies internal uncertainty via Shannon entropy over next token distributions and triggers adaptive prompt consolidation when a sharp spike in entropy is detected. By treating uncertainty as a first class signal rather than a nuisance to eliminate, ERGO embraces variability in language and modeling, representing and responding to uncertainty. In multi-turn tasks with incrementally revealed instructions, ERGO yields a 56.6% average performance gain over standard baselines, increases aptitude (peak performance capability) by 24.7%, and decreases unreliability (variability in performance) by 35.3%, demonstrating that uncertainty aware interventions can improve both accuracy and reliability in conversational AI.

[67] DROID: Dual Representation for Out-of-Scope Intent Detection

Wael Rashwan,Hossam M. Zawbaa,Sourav Dutta,Haytham Assem

Main category: cs.CL

TL;DR: 本文提出了DROID，一种紧凑的端到端框架，通过结合通用句子编码器和领域自适应的Transformer去噪自动编码器，实现对域内和域外意图的有效检测。

Details

Motivation: 现有的域外意图检测方法通常依赖强分布假设或辅助校准模块，难以在低资源场景下有效区分域内外意图。 Method: DROID采用双编码器结构（USE和TSDAE），融合语义泛化与领域上下文特征，通过轻量级分支分类器和单一校准阈值进行分类，并引入合成数据与开放域异常数据增强以提升边界学习能力。 Result: DROID在多个基准上显著优于现有方法，在已知意图上的macro-F1提升6-15%，在域外意图上提升8-20%，尤其在低资源设置下表现突出。 Conclusion: 双编码器表示结合简单校准可实现鲁棒、可扩展且可靠的域外意图检测，适用于神经对话系统。 Abstract: Detecting out-of-scope (OOS) user utterances remains a key challenge in task-oriented dialogue systems and, more broadly, in open-set intent recognition. Existing approaches often depend on strong distributional assumptions or auxiliary calibration modules. We present DROID (Dual Representation for Out-of-Scope Intent Detection), a compact end-to-end framework that combines two complementary encoders -- the Universal Sentence Encoder (USE) for broad semantic generalization and a domain-adapted Transformer-based Denoising Autoencoder (TSDAE) for domain-specific contextual distinctions. Their fused representations are processed by a lightweight branched classifier with a single calibrated threshold that separates in-domain and OOS intents without post-hoc scoring. To enhance boundary learning under limited supervision, DROID incorporates both synthetic and open-domain outlier augmentation. Despite using only 1.5M trainable parameters, DROID consistently outperforms recent state-of-the-art baselines across multiple intent benchmarks, achieving macro-F1 improvements of 6--15% for known and 8--20% for OOS intents, with the most significant gains in low-resource settings. These results demonstrate that dual-encoder representations with simple calibration can yield robust, scalable, and reliable OOS detection for neural dialogue systems.

[68] Toward Cybersecurity-Expert Small Language Models

Matan Levi,Daniel Ohayon,Ariel Blobstein,Ravid Sagi,Ian Molloy,Yair Allouche

Main category: cs.CL

TL;DR: CyberPal 2.0 是一系列面向网络安全的小型语言模型（4B-20B 参数），通过构建高质量、任务导向的链式思维指令数据集，在多项网络安全任务上超越或媲美主流大模型，同时保持更小的规模。

Details

Motivation: 由于缺乏高质量、领域特定的模型和训练数据，大语言模型在网络安全领域的应用滞后，因此需要专门针对该领域设计高效的小型语言模型。 Method: 提出 CyberPal 2.0 模型系列，并通过 SecKnowledge 2.0 数据增强与格式化管道生成包含专家引导推理格式和多步对齐的链式思维网络安全指令数据集进行训练。 Result: 在多种网络安全基准测试中，CyberPal 2.0 显著优于基线模型，在威胁情报任务中仅次于 Sec-Gemini v1；在威胁调查任务中，20B 模型超越 GPT-4o、o1、o3-mini 和 Sec-Gemini v1，排名第一，4B 模型排名第二。 Conclusion: CyberPal 2.0 在较小参数规模下实现了卓越的网络安全任务性能，展示了小型专业化模型在该领域的潜力。 Abstract: Large language models (LLMs) are transforming everyday applications, yet deployment in cybersecurity lags due to a lack of high-quality, domain-specific models and training datasets. To address this gap, we present CyberPal 2.0, a family of cybersecurity-expert small language models (SLMs) ranging from 4B-20B parameters. To train CyberPal 2.0, we generate an enriched chain-of-thought cybersecurity instruction dataset built with our data enrichment and formatting pipeline, SecKnowledge 2.0, which integrates expert-in-the-loop steering of reasoning formats alongside LLM-driven multi-step grounding, yielding higher-fidelity, task-grounded reasoning traces for security tasks. Across diverse cybersecurity benchmarks, CyberPal 2.0 consistently outperforms its baselines and matches or surpasses various open and closed-source frontier models, while remaining a fraction of their size. On core cyber threat intelligence knowledge tasks, our models outperform almost all tested frontier models, ranking second only to Sec-Gemini v1. On core threat-investigation tasks, such as correlating vulnerabilities and bug tickets with weaknesses, our best 20B-parameter model outperforms GPT-4o, o1, o3-mini, and Sec-Gemini v1, ranking first, while our smallest 4B-parameter model ranks second.

[69] Building a Macedonian Recipe Dataset: Collection, Parsing, and Comparative Analysis

Darko Sasanski,Dimitar Peshevski,Riste Stojanov,Dimitar Trajanov

Main category: cs.CL

TL;DR: 本文介绍了首个系统性构建马其顿语食谱数据集的工作，通过网络爬虫和结构化解析，解决了成分描述异质性的问题，并通过点互信息和提升度等指标分析了马其顿饮食中的独特食材组合模式。

Details

Motivation: 马其顿语食谱在现有数字研究中代表性不足，缺乏高质量的数据集来支持计算美食学对区域饮食文化的研究。 Method: 通过网络爬虫收集马其顿语食谱，进行结构化解析，并对食材的单位、数量和描述进行标准化处理；采用点互信息（PMI）和Lift score分析食材频率与共现模式。 Result: 成功构建了首个马其顿语食谱数据集，识别出表征马其顿 cuisine 的独特食材组合，如特定乳制品与谷物的高频共现。 Conclusion: 该数据集填补了小语种饮食文化研究的空白，为探索非主流语言地区的烹饪传统提供了新资源。 Abstract: Computational gastronomy increasingly relies on diverse, high-quality recipe datasets to capture regional culinary traditions. Although there are large-scale collections for major languages, Macedonian recipes remain under-represented in digital research. In this work, we present the first systematic effort to construct a Macedonian recipe dataset through web scraping and structured parsing. We address challenges in processing heterogeneous ingredient descriptions, including unit, quantity, and descriptor normalization. An exploratory analysis of ingredient frequency and co-occurrence patterns, using measures such as Pointwise Mutual Information and Lift score, highlights distinctive ingredient combinations that characterize Macedonian cuisine. The resulting dataset contributes a new resource for studying food culture in underrepresented languages and offers insights into the unique patterns of Macedonian culinary tradition.

[70] RLSR: Reinforcement Learning with Supervised Reward Outperforms SFT in Instruction Following

Zhichao Wang,Andy Wong,Ruslan Belkin

Main category: cs.CL

TL;DR: 提出了一种新的强化学习方法RLSR，用于替代传统的监督微调（SFT），通过在语义嵌入空间中计算生成响应与人类标注响应之间的余弦相似度作为奖励信号，显著提升了大语言模型的指令遵循能力。

Details

Motivation: 旨在改进现有指令微调方法（如SFT）的局限性，充分利用大规模SFT数据集，并借鉴RFT的强化学习框架，提升模型在指令遵循、推理和低资源领域适应等方面的表现。 Method: 提出RLSR方法，使用强化学习框架，以语义空间中生成响应与人类标注响应的余弦相似度作为奖励信号，对基础模型进行微调；支持单独使用或与SFT结合使用。 Result: 在Qwen-7B（INFINITY）模型上，RLSR（SB）在AlpacaEval上的胜率达到26.34%，优于SFT的21.01%；SFT + RLSR组合进一步提升至30.73%。 Conclusion: RLSR能更有效地利用SFT数据集，在强化学习框架下显著提升模型的指令遵循能力，且可与SFT结合实现更优性能。 Abstract: After the pretraining stage of LLMs, techniques such as SFT, RLHF, RLVR, and RFT are applied to enhance instruction-following ability, mitigate undesired responses, improve reasoning capability and enable efficient domain adaptation with minimal data. SFT relies on the next-token prediction objective to strengthen instruction following in a base model using a large corpus of human-labeled responses. In contrast, RFT employs a RL-based approach to adapt fine-tuned reasoning models to specific domains with limited supervision. Inspired by RFT, we propose replacing SFT with RLSR to leverage the extensive SFT dataset in an RL framework, thereby improving the base model's instruction-following ability. In RLSR, the base model generates multiple responses for each prompt, and reward scores are computed as the cosine similarity in the semantic embedding space between the generated and human-labeled responses. RLSR can be utilized in multiple ways. It can directly replace SFT, achieving superior performance on instruction-following benchmarks-for example, RLSR (SB) on Qwen-7B (INFINITY) achieved an AlpacaEval win rate of 26.34%, surpassing SFT's 21.01%. Furthermore, combining SFT and RLSR further enhances downstream task performance; Qwen-7B (INFINITY) achieved a win rate of 30.73% when trained with SFT + RLSR.

Bingsheng Yao,Bo Sun,Yuanzhe Dong,Yuxuan Lu,Dakuo Wang

Main category: cs.CL

TL;DR: 提出动态角色精炼框架（DPRF），通过迭代识别和修正生成行为与真实人类行为之间的认知差异，提升大语言模型角色扮演代理的行为对齐度。

Details

Motivation: 现有大语言模型角色扮演代理因使用人工构建的角色档案而导致角色保真度不足，缺乏与目标个体行为的一致性验证。 Method: 设计动态角色精炼框架（DPRF），通过自由形式或基于理论的结构化分析，迭代识别生成行为与人类真实行为之间的认知差异，并优化角色档案以减少差异。 Result: 在五个大语言模型和四种不同行为预测场景（正式辩论、心理健康相关的社交媒体发帖、公开访谈、电影评论）中验证了DPRF的有效性，结果显示其在不同模型和场景下均显著提升了行为对齐度。 Conclusion: DPRF提供了一种可靠的方法来构建高保真角色档案，增强了用户模拟、社会科学研究和个性化AI等下游应用的有效性。 Abstract: The emerging large language model role-playing agents (LLM RPAs) aim to simulate individual human behaviors, but the persona fidelity is often undermined by manually-created profiles (e.g., cherry-picked information and personality characteristics) without validating the alignment with the target individuals. To address this limitation, our work introduces the Dynamic Persona Refinement Framework (DPRF).DPRF aims to optimize the alignment of LLM RPAs' behaviors with those of target individuals by iteratively identifying the cognitive divergence, either through free-form or theory-grounded, structured analysis, between generated behaviors and human ground truth, and refining the persona profile to mitigate these divergences.We evaluate DPRF with five LLMs on four diverse behavior-prediction scenarios: formal debates, social media posts with mental health issues, public interviews, and movie reviews.DPRF can consistently improve behavioral alignment considerably over baseline personas and generalizes across models and scenarios.Our work provides a robust methodology for creating high-fidelity persona profiles and enhancing the validity of downstream applications, such as user simulation, social studies, and personalized AI.

[72] LiteStage: Latency-aware Layer Skipping for Multi-stage Reasoning

Beomseok Kang,Jiwon Song,Jae-Joon Kim

Main category: cs.CL

TL;DR: LiteStage是一种针对多阶段推理的延迟感知层跳过框架，通过结合离线搜索和在线置信度退出机制，在保持高准确率的同时显著提升推理速度。

Details

Motivation: 现有的自适应加速技术在多阶段推理中难以平衡效率与准确性，存在阶段间跳过敏感性差异和冗余输出生成的问题。 Method: 提出LiteStage框架，包括分阶段的离线层预算搜索和基于置信度的在线生成早退机制，以减少冗余计算。 Result: 在OBQA、CSQA和StrategyQA三个基准上实验表明，LiteStage可实现最高1.70倍的加速，且准确率损失小于4.0%。 Conclusion: LiteStage有效提升了小语言模型在多阶段推理中的效率，优于先前的无训练层跳过方法。 Abstract: Multi-stage reasoning has emerged as an effective strategy for enhancing the reasoning capability of small language models by decomposing complex problems into sequential sub-stages. However, this comes at the cost of increased latency. We observe that existing adaptive acceleration techniques, such as layer skipping, struggle to balance efficiency and accuracy in this setting due to two key challenges: (1) stage-wise variation in skip sensitivity, and (2) the generation of redundant output tokens. To address these, we propose LiteStage, a latency-aware layer skipping framework for multi-stage reasoning. LiteStage combines a stage-wise offline search that allocates optimal layer budgets with an online confidence-based generation early exit to suppress unnecessary decoding. Experiments on three benchmarks, e.g., OBQA, CSQA, and StrategyQA, show that LiteStage achieves up to 1.70x speedup with less than 4.0% accuracy loss, outperforming prior training-free layer skipping methods.

[73] Flip-Flop Consistency: Unsupervised Training for Robustness to Prompt Perturbations in LLMs

Parsa Hejabi,Elnaz Rahmati,Alireza S. Ziabari,Morteza Dehghani

Main category: cs.CL

TL;DR: 本文提出了Flip-Flop Consistency (F²C)，一种无监督训练方法，通过共识交叉熵和表示对齐损失提升大语言模型在提示扰动下的鲁棒性、一致性和泛化能力。

Details

Motivation: 大语言模型在面对不同表述的相同提示时常常产生不一致的回答，影响其可靠性，因此需要提高模型对提示扰动的鲁棒性。 Method: 提出F²C方法，包含两个部分：1）共识交叉熵（CCE），利用多个提示变体的多数投票生成硬伪标签；2）表示对齐损失，将低置信度和非主流预测拉向高置信度多数投票形成的共识。 Result: 在11个涵盖4类NLP任务的数据集上验证，平均提升一致性11.62%，F₁分数提高8.94%，格式间性能方差降低3.29%；在跨域和未见提示格式测试中也表现出更好的泛化性和稳定性。 Conclusion: F²C是一种有效的无监督方法，显著增强了大语言模型在提示扰动下的一致性、性能和泛化能力。 Abstract: Large Language Models (LLMs) often produce inconsistent answers when faced with different phrasings of the same prompt. In this paper, we propose Flip-Flop Consistency ($F^2C$), an unsupervised training method that improves robustness to such perturbations. $F^2C$ is composed of two key components. The first, Consensus Cross-Entropy (CCE), uses a majority vote across prompt variations to create a hard pseudo-label. The second is a representation alignment loss that pulls lower-confidence and non-majority predictors toward the consensus established by high-confidence, majority-voting variations. We evaluate our method on 11 datasets spanning four NLP tasks, with 4-15 prompt variations per dataset. On average, $F^2C$ raises observed agreement by 11.62%, improves mean $F_1$ by 8.94%, and reduces performance variance across formats by 3.29%. In out-of-domain evaluations, $F^2C$ generalizes effectively, increasing $\overline{F_1}$ and agreement while decreasing variance across most source-target pairs. Finally, when trained on only a subset of prompt perturbations and evaluated on held-out formats, $F^2C$ consistently improves both performance and agreement while reducing variance. These findings highlight $F^2C$ as an effective unsupervised method for enhancing LLM consistency, performance, and generalization under prompt perturbations. Code is available at https://github.com/ParsaHejabi/Flip-Flop-Consistency-Unsupervised-Training-for-Robustness-to-Prompt-Perturbations-in-LLMs.

[74] MoM: Mixtures of Scenario-Aware Document Memories for Retrieval-Augmented Generation Systems

Jihao Zhao,Zhiyuan Ji,Simin Niu,Hanyu Wang,Feiyu Xiong,Zhiyu Li

Main category: cs.CL

TL;DR: 本文提出了一种名为MoM（情景感知文档记忆混合）的新框架，旨在将传统RAG中的被动文本分块转变为主动理解，模拟人类阅读时的认知过程，通过结构化提取和多视角评估生成语义完整的文档记忆，并支持小语言模型实现类人智能文本处理。

Details

Motivation: 传统RAG方法依赖被动分块，限制了知识内化深度与推理能力，无法充分模拟人类阅读理解过程。 Method: 提出MoM框架：1）利用大模型模拟领域专家生成文档逻辑提纲；2）采用多路径采样与多视角评估机制，设计衡量片段清晰度与提取完整性的指标以选择最优文档记忆；3）引入反向推理策略，从高质量结果反推专家思维路径用于训练小模型；4）构建基于概率建模理论支持的三层文档记忆检索机制。 Result: 在三个不同领域上的实验表明，MoM能有效解决现有RAG系统的分块难题，为大模型提供语义完整的文档记忆，并显著提升小模型在主动探索与构建文档记忆方面的能力。 Conclusion: MoM框架实现了从被动检索到主动理解的转变，推动了小语言模型向人类中心化的智能文本处理迈进，增强了知识内化与推理能力。 Abstract: The traditional RAG paradigm, which typically engages in the comprehension of relevant text chunks in response to received queries, inherently restricts both the depth of knowledge internalization and reasoning capabilities. To address this limitation, our research transforms the text processing in RAG from passive chunking to proactive understanding, defining this process as document memory extraction with the objective of simulating human cognitive processes during reading. Building upon this, we propose the Mixtures of scenario-aware document Memories (MoM) framework, engineered to efficiently handle documents from multiple domains and train small language models (SLMs) to acquire the ability to proactively explore and construct document memories. The MoM initially instructs large language models (LLMs) to simulate domain experts in generating document logical outlines, thereby directing structured chunking and core content extraction. It employs a multi-path sampling and multi-perspective evaluation mechanism, specifically designing comprehensive metrics that represent chunk clarity and extraction completeness to select the optimal document memories. Additionally, to infuse deeper human-like reading abilities during the training of SLMs, we incorporate a reverse reasoning strategy, which deduces refined expert thinking paths from high-quality outcomes. Finally, leveraging diverse forms of content generated by MoM, we develop a three-layer document memory retrieval mechanism, which is grounded in our theoretical proof from the perspective of probabilistic modeling. Extensive experimental results across three distinct domains demonstrate that the MoM framework not only resolves text chunking challenges in existing RAG systems, providing LLMs with semantically complete document memories, but also paves the way for SLMs to achieve human-centric intelligent text processing.

[75] Rewriting History: A Recipe for Interventional Analyses to Study Data Effects on Model Behavior

Rahul Nadkarni,Yanai Elazar,Hila Gonen,Noah A. Smith

Main category: cs.CL

TL;DR: 提出了一种通过干预训练数据并重新训练模型来研究数据与语言模型行为关系的实验方法，展示了其在事实知识获取中的应用。

Details

Motivation: 理解训练数据如何影响语言模型的行为，验证数据与模型行为之间的因果关系。 Method: 设计实验流程，包括选择评测项、匹配相关文档、修改文档内容、重新训练模型并测量变化。使用共现统计和信息检索方法识别对知识学习有贡献的文档。 Result: 实验证明该方法能有效测试数据对模型行为的影响，但现有文档识别方法无法完全解释模型的知识回答能力。 Conclusion: 提供了一个可复用的实验框架，帮助研究人员进一步探究训练数据对语言模型行为的影响。 Abstract: We present an experimental recipe for studying the relationship between training data and language model (LM) behavior. We outline steps for intervening on data batches -- i.e., ``rewriting history'' -- and then retraining model checkpoints over that data to test hypotheses relating data to behavior. Our recipe breaks down such an intervention into stages that include selecting evaluation items from a benchmark that measures model behavior, matching relevant documents to those items, and modifying those documents before retraining and measuring the effects. We demonstrate the utility of our recipe through case studies on factual knowledge acquisition in LMs, using both cooccurrence statistics and information retrieval methods to identify documents that might contribute to knowledge learning. Our results supplement past observational analyses that link cooccurrence to model behavior, while demonstrating that extant methods for identifying relevant training documents do not fully explain an LM's ability to correctly answer knowledge questions. Overall, we outline a recipe that researchers can follow to test further hypotheses about how training data affects model behavior. Our code is made publicly available to promote future work.

[76] Less is More: Denoising Knowledge Graphs For Retrieval Augmented Generation

Yilun Zheng,Dan Yang,Jie Li,Lin Shang,Lihui Chen,Jiahao Xu,Sitao Luan

Main category: cs.CL

TL;DR: 本文提出了DEG-RAG框架，通过实体解析和三元组反思技术对LLM生成的知识图谱进行去噪，显著提升了图检索增强生成系统的性能。

Details

Motivation: 现有的基于图的RAG系统依赖LLM自动生成知识图谱，常导致冗余实体和不可靠关系等噪声问题，影响检索与生成效果，且缺乏有效的去噪方法。 Method: 提出DEG-RAG框架，包含两个核心组件：实体解析（消除冗余实体）和三元组反思（剔除错误关系），并通过系统性实验评估不同实体解析策略的效果。 Result: 该方法大幅减小了知识图谱规模，在多种主流图RAG变体上均显著提升了问答性能。 Conclusion: DEG-RAG有效解决了LLM生成知识图谱中的噪声问题，是首个对LLM生成KG中实体解析进行全面研究的工作，为图RAG系统的优化提供了新方向。 Abstract: Retrieval-Augmented Generation (RAG) systems enable large language models (LLMs) instant access to relevant information for the generative process, demonstrating their superior performance in addressing common LLM challenges such as hallucination, factual inaccuracy, and the knowledge cutoff. Graph-based RAG further extends this paradigm by incorporating knowledge graphs (KGs) to leverage rich, structured connections for more precise and inferential responses. A critical challenge, however, is that most Graph-based RAG systems rely on LLMs for automated KG construction, often yielding noisy KGs with redundant entities and unreliable relationships. This noise degrades retrieval and generation performance while also increasing computational cost. Crucially, current research does not comprehensively address the denoising problem for LLM-generated KGs. In this paper, we introduce DEnoised knowledge Graphs for Retrieval Augmented Generation (DEG-RAG), a framework that addresses these challenges through: (1) entity resolution, which eliminates redundant entities, and (2) triple reflection, which removes erroneous relations. Together, these techniques yield more compact, higher-quality KGs that significantly outperform their unprocessed counterparts. Beyond the methods, we conduct a systematic evaluation of entity resolution for LLM-generated KGs, examining different blocking strategies, embedding choices, similarity metrics, and entity merging techniques. To the best of our knowledge, this is the first comprehensive exploration of entity resolution in LLM-generated KGs. Our experiments demonstrate that this straightforward approach not only drastically reduces graph size but also consistently improves question answering performance across diverse popular Graph-based RAG variants.

[77] Retrofitting Small Multilingual Models for Retrieval: Matching 7B Performance with 300M Parameters

Lifu Tu,Yingbo Zhou,Semih Yavuz

Main category: cs.CL

TL;DR: 本文研究了如何通过优化训练数据规模、负采样策略和数据多样性来提升小型多语言嵌入模型在检索任务中的性能，提出了一种仅3亿参数的紧凑模型，性能可媲美甚至超过现有的70亿参数大模型。

Details

Motivation: 小型多语言模型在多语言任务上表现良好，但在检索任务上通常落后于大型模型，本文旨在探索如何针对检索任务改进小型模型。 Method: 分析训练数据规模、负采样策略和数据多样性对多语言嵌入效果的影响，重点引入难负样本并提升任务多样性。 Result: 发现增加训练数据规模初期有效但收益递减，引入难负样本显著提升检索准确率，任务多样性比语言多样性更重要。最终开发出约3亿参数的模型，在检索任务上达到或超越当前强大的70亿参数模型。 Conclusion: 通过针对性优化训练策略，小型多语言模型可在检索任务上实现与大型模型相当甚至更优的性能，为高效部署提供了可行方案。 Abstract: Training effective multilingual embedding models presents unique challenges due to the diversity of languages and task objectives. Although small multilingual models (<1 B parameters) perform well on multilingual tasks generally, they consistently lag behind larger models (>1 B) in the most prevalent use case: retrieval. This raises a critical question: Can smaller models be retrofitted specifically for retrieval tasks to enhance their performance? In this work, we investigate key factors that influence the effectiveness of multilingual embeddings, focusing on training data scale, negative sampling strategies, and data diversity. We find that while increasing the scale of training data yields initial performance gains, these improvements quickly plateau - indicating diminishing returns. Incorporating hard negatives proves essential for consistently improving retrieval accuracy. Furthermore, our analysis reveals that task diversity in the training data contributes more significantly to performance than language diversity alone. As a result, we develop a compact (approximately 300M) multilingual model that achieves retrieval performance comparable to or even surpassing current strong 7B models.

[78] Qwen3Guard Technical Report

Haiquan Zhao,Chenhan Yuan,Fei Huang,Xiaomeng Hu,Yichang Zhang,An Yang,Bowen Yu,Dayiheng Liu,Jingren Zhou,Junyang Lin,Baosong Yang,Chen Cheng,Jialong Tang,Jiandong Jiang,Jianwei Zhang,Jijie Xu,Ming Yan,Minmin Sun,Pei Zhang,Pengjun Xie,Qiaoyu Tang,Qin Zhu,Rong Zhang,Shibin Wu,Shuo Zhang,Tao He,Tianyi Tang,Tingyu Xia,Wei Liao,Weizhou Shen,Wenbiao Yin,Wenmeng Zhou,Wenyuan Yu,Xiaobin Wang,Xiaodong Deng,Xiaodong Xu,Xinyu Zhang,Yang Liu,Yeqiu Li,Yi Zhang,Yong Jiang,Yu Wan,Yuxin Zhou

Main category: cs.CL

TL;DR: 本文提出了Qwen3Guard，一种支持多语言的生成式和流式安全防护模型，通过细粒度三分类和实时token级检测，解决现有安全模型在灵活策略适配和流式推理中的不足，具备高可扩展性和低延迟特性。

Details

Motivation: 现有安全防护模型仅输出二分类结果且需等待完整输出，难以适应不同安全策略和流式生成场景，限制了其在实际应用中的有效性。 Method: 设计两种变体：生成式Qwen3Guard将安全分类视为指令跟随任务，实现三类判断（安全、争议、不安全）；流式Qwen3Guard引入token级分类头，支持生成过程中的实时监控。模型提供0.6B、4B、8B三种规模，支持119种语言。 Result: 在英、中及多语言基准测试中，Qwen3Guard在提示和响应安全分类任务上均达到SOTA性能，支持实时干预，降低有害内容暴露风险。 Conclusion: Qwen3Guard通过细粒度分类和流式处理能力，为全球部署的大模型提供了高效、灵活、低延迟的安全解决方案，所有模型已开源发布。 Abstract: As large language models (LLMs) become more capable and widely used, ensuring the safety of their outputs is increasingly critical. Existing guardrail models, though useful in static evaluation settings, face two major limitations in real-world applications: (1) they typically output only binary "safe/unsafe" labels, which can be interpreted inconsistently across diverse safety policies, rendering them incapable of accommodating varying safety tolerances across domains; and (2) they require complete model outputs before performing safety checks, making them fundamentally incompatible with streaming LLM inference, thereby preventing timely intervention during generation and increasing exposure to harmful partial outputs. To address these challenges, we present Qwen3Guard, a series of multilingual safety guardrail models with two specialized variants: Generative Qwen3Guard, which casts safety classification as an instruction-following task to enable fine-grained tri-class judgments (safe, controversial, unsafe); and Stream Qwen3Guard, which introduces a token-level classification head for real-time safety monitoring during incremental text generation. Both variants are available in three sizes (0.6B, 4B, and 8B parameters) and support up to 119 languages and dialects, providing comprehensive, scalable, and low-latency safety moderation for global LLM deployments. Evaluated across English, Chinese, and multilingual benchmarks, Qwen3Guard achieves state-of-the-art performance in both prompt and response safety classification. All models are released under the Apache 2.0 license for public use.

[79] PRISM: Agentic Retrieval with LLMs for Multi-Hop Question Answering

Md Mahadi Hasan Nahid,Davood Rafiei

Main category: cs.CL

TL;DR: 提出了一种基于大语言模型的代理检索系统，通过三个专门的代理在多跳问答中高效检索证据，提高了检索的准确性和召回率。

Details

Motivation: 在多跳问答中，需要收集多个证据片段来回答复杂问题，传统检索方法在精度和召回之间难以平衡。 Method: 构建了一个包含问题分析器、选择器和添加器的三代理系统，通过迭代循环实现高精度和高召回的证据检索。 Result: 在四个多跳QA基准上的实验表明，该方法在显著减少无关信息的同时，检索准确率更高，下游问答模型性能优于全上下文答案准确率。 Conclusion: 所提出的代理检索系统能有效提升多跳问答中的证据检索质量，兼顾精度与召回，增强问答系统的整体性能。 Abstract: Retrieval plays a central role in multi-hop question answering (QA), where answering complex questions requires gathering multiple pieces of evidence. We introduce an Agentic Retrieval System that leverages large language models (LLMs) in a structured loop to retrieve relevant evidence with high precision and recall. Our framework consists of three specialized agents: a Question Analyzer that decomposes a multi-hop question into sub-questions, a Selector that identifies the most relevant context for each sub-question (focusing on precision), and an Adder that brings in any missing evidence (focusing on recall). The iterative interaction between Selector and Adder yields a compact yet comprehensive set of supporting passages. In particular, it achieves higher retrieval accuracy while filtering out distracting content, enabling downstream QA models to surpass full-context answer accuracy while relying on significantly less irrelevant information. Experiments on four multi-hop QA benchmarks -- HotpotQA, 2WikiMultiHopQA, MuSiQue, and MultiHopRAG -- demonstrates that our approach consistently outperforms strong baselines.

[80] Rethinking Schema Linking: A Context-Aware Bidirectional Retrieval Approach for Text-to-SQL

Md Mahadi Hasan Nahid,Davood Rafiei,Weiwei Zhang,Yong Zhang

Main category: cs.CL

TL;DR: 提出一种上下文感知的双向模式检索框架，将模式链接作为独立问题处理，通过两种互补策略和多种技术增强，在BIRD和Spider等基准上显著提高了模式召回率并减少了误报。

Details

Motivation: 现有的Text-to-SQL方法多关注SQL生成，忽视了相关模式元素的检索，导致幻觉和执行失败。 Method: 结合表优先检索后列选择和列优先检索后表选择两种策略，并引入问题分解、关键词提取和关键短语提取等技术。 Result: 在BIRD和Spider基准测试中，显著提升模式召回率，降低误报率；使用检索到的模式生成SQL优于全模式基线，接近oracle性能，无需查询优化。 Conclusion: 模式链接是提升Text-to-SQL准确性和效率的关键环节，该方法将全模式与完美模式之间的性能差距缩小了50%。 Abstract: Schema linking -- the process of aligning natural language questions with database schema elements -- is a critical yet underexplored component of Text-to-SQL systems. While recent methods have focused primarily on improving SQL generation, they often neglect the retrieval of relevant schema elements, which can lead to hallucinations and execution failures. In this work, we propose a context-aware bidirectional schema retrieval framework that treats schema linking as a standalone problem. Our approach combines two complementary strategies: table-first retrieval followed by column selection, and column-first retrieval followed by table selection. It is further augmented with techniques such as question decomposition, keyword extraction, and keyphrase extraction. Through comprehensive evaluations on challenging benchmarks such as BIRD and Spider, we demonstrate that our method significantly improves schema recall while reducing false positives. Moreover, SQL generation using our retrieved schema consistently outperforms full-schema baselines and closely approaches oracle performance, all without requiring query refinement. Notably, our method narrows the performance gap between full and perfect schema settings by 50\%. Our findings highlight schema linking as a powerful lever for enhancing Text-to-SQL accuracy and efficiency.

[81] Constraint-Driven Small Language Models Based on Agent and OpenAlex Knowledge Graph: Mining Conceptual Pathways and Discovering Innovation Points in Academic Papers

Ziye Xia,Sergei S. Ospichev

Main category: cs.CL

TL;DR: 本文基于OpenAlex知识图谱，提出一种基于提示工程的关键概念路径分析方法，利用小语言模型实现关键概念提取与创新点识别，并通过知识图引导的agent机制提升分析准确性。

Details

Motivation: 现有学术论文数据库多局限于关键概念的相似性匹配和基础分类，缺乏对概念间深层关系网络的挖掘，难以支持科研人员高效追踪最新研究进展。 Method: 采用提示工程方法，结合微调后的Qwen和DeepSeek小语言模型进行关键概念提取；构建基于知识图谱约束的agent机制，分析近8000篇诺维西伯利亚大学开源论文中的关键概念路径分布模式。 Result: 实现了精确的关键概念提取与创新点识别，在准确率上取得显著提升，相关模型已公开于Hugging Face平台。 Conclusion: 该方法能有效揭示论文中关键概念的关联路径及其与创新点的关系，为学术分析提供了可解释、高精度的自动化工具。 Abstract: In recent years, the rapid increase in academic publications across various fields has posed severe challenges for academic paper analysis: scientists struggle to timely and comprehensively track the latest research findings and methodologies. Key concept extraction has proven to be an effective analytical paradigm, and its automation has been achieved with the widespread application of language models in industrial and scientific domains. However, existing paper databases are mostly limited to similarity matching and basic classification of key concepts, failing to deeply explore the relational networks between concepts. This paper is based on the OpenAlex opensource knowledge graph. By analyzing nearly 8,000 open-source paper data from Novosibirsk State University, we discovered a strong correlation between the distribution patterns of paper key concept paths and both innovation points and rare paths. We propose a prompt engineering-based key concept path analysis method. This method leverages small language models to achieve precise key concept extraction and innovation point identification, and constructs an agent based on a knowledge graph constraint mechanism to enhance analysis accuracy. Through fine-tuning of the Qwen and DeepSeek models, we achieved significant improvements in accuracy, with the models publicly available on the Hugging Face platform.

[82] MathMist: A Parallel Multilingual Benchmark Dataset for Mathematical Problem Solving and Reasoning

Mahbub E Sobhani,Md. Faiyaz Abdullah Sayeedi,Tasnim Mohiuddin,Md Mofijul Islam,Swakkhar Shatabda

Main category: cs.CL

TL;DR: 本文介绍了MathMist，一个包含七种语言、超过2.1万个对齐问答对的并行多语言数学推理基准数据集，旨在评估大语言模型在多语言环境下的数学问题解决能力，并通过多种推理范式揭示模型在低资源语言中表现显著下降的问题。

Details

Motivation: 现有数学推理基准主要集中于英语或少数高资源语言，缺乏对多语言和跨语言数学推理能力的全面评估，限制了大语言模型在全球多样化语言环境中应用的公平性与有效性。 Method: 构建了一个覆盖高、中、低资源语言的平行多语言数学数据集MathMist，包含21K以上对齐的问答对，并在零样本、思维链（CoT）和代码切换推理模式下对多种开源与闭源大模型进行系统性评估。 Result: 实验结果显示，当前大语言模型在多语言数学推理上存在一致性与可解释性不足的问题，尤其在低资源语言环境下性能显著下降。 Conclusion: MathMist为评估多语言数学推理提供了重要基准，揭示了现有模型在跨语言数学理解上的局限性，强调需进一步提升模型在低资源语言中的推理能力。 Abstract: Mathematical reasoning remains one of the most challenging domains for large language models (LLMs), requiring not only linguistic understanding but also structured logical deduction and numerical precision. While recent LLMs demonstrate strong general-purpose reasoning abilities, their mathematical competence across diverse languages remains underexplored. Existing benchmarks primarily focus on English or a narrow subset of high-resource languages, leaving significant gaps in assessing multilingual and cross-lingual mathematical reasoning. To address this, we introduce MathMist, a parallel multilingual benchmark for mathematical problem solving and reasoning. MathMist encompasses over 21K aligned question-answer pairs across seven languages, representing a balanced coverage of high-, medium-, and low-resource linguistic settings. The dataset captures linguistic variety, multiple types of problem settings, and solution synthesizing capabilities. We systematically evaluate a diverse suite of models, including open-source small and medium LLMs, proprietary systems, and multilingual-reasoning-focused models, under zero-shot, chain-of-thought (CoT), and code-switched reasoning paradigms. Our results reveal persistent deficiencies in LLMs' ability to perform consistent and interpretable mathematical reasoning across languages, with pronounced degradation in low-resource settings. All the codes and data are available at GitHub: https://github.com/mahbubhimel/MathMist

[83] MERLIN: A Testbed for Multilingual Multimodal Entity Recognition and Linking

Sathyanarayanan Ramamoorthy,Vishwa Shah,Simran Khanuja,Zaid Sheikh,Shan Jie,Ann Chia,Shearman Chua,Graham Neubig

Main category: cs.CL

TL;DR: 本文介绍了MERLIN，一个用于多语言多模态实体链接的新型测试系统，包含五种语言的BBC新闻标题及其配图，并提供了多个基于不同语言模型的基准测试结果。

Details

Motivation: 为了提升多语言环境下实体链接的准确性，尤其是在文本上下文模糊或不足的情况下，探索视觉信息对实体链接的帮助。 Method: 构建了一个包含超过7000个命名实体提及和2500个唯一Wikidata实体的数据集，涵盖五种语言，并结合多语言与多模态实体链接方法进行实验，使用了如LLaMa-2和Aya-23等语言模型。 Result: 实验结果表明，引入视觉数据能够提高实体链接的准确性，尤其对多语言能力较弱的模型效果更显著。 Conclusion: 视觉信息有助于改善多语言多模态实体链接的效果，特别是在文本上下文不充分的情况下，MERLIN为未来研究提供了一个有价值的测试平台。 Abstract: This paper introduces MERLIN, a novel testbed system for the task of Multilingual Multimodal Entity Linking. The created dataset includes BBC news article titles, paired with corresponding images, in five languages: Hindi, Japanese, Indonesian, Vietnamese, and Tamil, featuring over 7,000 named entity mentions linked to 2,500 unique Wikidata entities. We also include several benchmarks using multilingual and multimodal entity linking methods exploring different language models like LLaMa-2 and Aya-23. Our findings indicate that incorporating visual data improves the accuracy of entity linking, especially for entities where the textual context is ambiguous or insufficient, and particularly for models that do not have strong multilingual abilities. For the work, the dataset, methods are available here at https://github.com/rsathya4802/merlin

[84] Evaluating & Reducing Deceptive Dialogue From Language Models with Multi-turn RL

Marwa Abdulhai,Ryan Cheng,Aryansh Shrivastava,Natasha Jaques,Yarin Gal,Sergey Levine

Main category: cs.CL

TL;DR: 本文研究了大语言模型（LLM）在对话中产生欺骗性输出的问题，提出了一种新的“信念错位”指标来量化欺骗行为，并发现现有模型即使在良性提示下也会自然表现出欺骗行为。研究显示，经过RLHF训练的模型仍存在较高欺骗率，而作者提出的多轮强化学习微调方法可显著减少此类行为。

Details

Motivation: 由于LLM在现实应用中可能生成误导性或欺骗性内容，带来安全风险，因此需要有效衡量和缓解其在对话中的欺骗行为。 Method: 提出了信念错位度量指标，在四种对话场景中评估八个主流LLM的欺骗性，并引入多轮强化学习方法进行微调以减少欺骗行为。 Result: 新提出的指标与人类判断的相关性高于现有指标；八种SOTA模型平均在26%的对话轮次中表现出欺骗性；在被引导欺骗时，欺骗性相对基线最高增加31%；RLHF训练的模型仍存在43%的欺骗率；所提强化学习方法使欺骗行为减少77.6%。 Conclusion: LLM在对话中存在不可忽视的欺骗风险，且当前安全训练方法（如RLHF）不足以完全抑制该行为；需采用基于交互历史的多轮评估与训练策略，所提方法能更有效地减轻欺骗性输出。 Abstract: Large Language Models (LLMs) interact with millions of people worldwide in applications such as customer support, education and healthcare. However, their ability to produce deceptive outputs, whether intentionally or inadvertently, poses significant safety concerns. The unpredictable nature of LLM behavior, combined with insufficient safeguards against hallucination, misinformation, and user manipulation, makes their misuse a serious, real-world risk. In this paper, we investigate the extent to which LLMs engage in deception within dialogue, and propose the belief misalignment metric to quantify deception. We evaluate deception across four distinct dialogue scenarios, using five established deception detection metrics and our proposed metric. Our findings reveal this novel deception measure correlates more closely with human judgments than any existing metrics we test. Additionally, our benchmarking of eight state-of-the-art models indicates that LLMs naturally exhibit deceptive behavior in approximately 26% of dialogue turns, even when prompted with seemingly benign objectives. When prompted to deceive, LLMs are capable of increasing deceptiveness by as much as 31% relative to baselines. Unexpectedly, models trained with RLHF, the predominant approach for ensuring the safety of widely-deployed LLMs, still exhibit deception at a rate of 43% on average. Given that deception in dialogue is a behavior that develops over an interaction history, its effective evaluation and mitigation necessitates moving beyond single-utterance analyses. We introduce a multi-turn reinforcement learning methodology to fine-tune LLMs to reduce deceptive behaviors, leading to a 77.6% reduction compared to other instruction-tuned models.

[85] A Robust Classification Method using Hybrid Word Embedding for Early Diagnosis of Alzheimer's Disease

Yangyang Li

Main category: cs.CL

TL;DR: 提出了一种基于混合词嵌入和超参数优化的阿尔茨海默病早期检测方法，准确率达91%，AUC达97%，优于现有模型。

Details

Motivation: 阿尔茨海默病早期检测有助于患者及时治疗并减轻医疗负担，语言能力变化是早期诊断的重要标志。 Method: 结合Doc2Vec和ELMo生成混合词嵌入，计算句子困惑度以捕捉语义和流畅性，并加入语言学特征增强表示，使用逻辑回归并全程优化超参数。 Result: 分类准确率达到91%，AUC达到97%，模型稳定性高（准确率标准差0.0403，AUC标准差0.0174），优于现有最佳NLP模型（88%准确率）。 Conclusion: 该方法在阿尔茨海默病早期检测中表现出高准确性和稳定性，可作为大规模筛查工具或医生辅助诊断手段。 Abstract: Early detection of Alzheimer's Disease (AD) is greatly beneficial to AD patients, leading to early treatments that lessen symptoms and alleviating financial burden of health care. As one of the leading signs of AD, language capability changes can be used for early diagnosis of AD. In this paper, I develop a robust classification method using hybrid word embedding and fine-tuned hyperparameters to achieve state-of-the-art accuracy in the early detection of AD. Specifically, we create a hybrid word embedding based on word vectors from Doc2Vec and ELMo to obtain perplexity scores of the sentences. The scores identify whether a sentence is fluent or not and capture semantic context of the sentences. I enrich the word embedding by adding linguistic features to analyze syntax and semantics. Further, we input an embedded feature vector into logistic regression and fine tune hyperparameters throughout the pipeline. By tuning hyperparameters of the machine learning pipeline (e.g., model regularization parameter, learning rate and vector size of Doc2Vec, and vector size of ELMo), I achieve 91% classification accuracy and an Area Under the Curve (AUC) of 97% in distinguishing early AD from healthy subjects. Based on my knowledge, my model with 91% accuracy and 97% AUC outperforms the best existing NLP model for AD diagnosis with an accuracy of 88% [32]. I study the model stability through repeated experiments and find that the model is stable even though the training data is split randomly (standard deviation of accuracy = 0.0403; standard deviation of AUC = 0.0174). This affirms our proposed method is accurate and stable. This model can be used as a large-scale screening method for AD, as well as a complementary examination for doctors to detect AD.

[86] Beyond One World: Benchmarking Super Heros in Role-Playing Across Multiversal Contexts

Perapard Ngokpol,Kun Kerdthaisong,Pasin Buakhaw,Pitikorn Khlaisamniang,Supasate Vorathammathorn,Piyalitt Ittichaiwong,Nutchanon Yongsatianchot

Main category: cs.CL

TL;DR: 本文提出了Beyond One World基准，用于评估大语言模型在多版本角色扮演中的表现，特别是在超级英雄不同宇宙版本下的事实记忆与道德决策一致性。

Details

Motivation: 探索大语言模型在扮演具有多个正典版本的同一角色时，是否能保持角色一致性和推理可信度。 Method: 构建包含30个标志性英雄和90个版本的基准测试，包括‘正典事件’和‘道德困境’两项任务，并提出‘思考-行动匹配’指标来衡量模型内部推理与外在行为的一致性。 Result: 实验发现：链式思维提示对较弱模型有帮助但可能降低强模型的准确性；跨版本泛化仍是挑战；模型通常只擅长‘思考’或‘行动’之一。 Conclusion: 当前角色扮演模型在多宇宙一致性与推理对齐方面存在显著缺陷，Beyond One World为评估此类能力提供了有效工具。 Abstract: Large language models (LLMs) are increasingly used as role-playing agents, yet their capacity to faithfully and consistently portray version-specific characters -- for example, superheroes across comic and cinematic universes -- remains underexplored. Superhero canons such as Marvel and DC provide a rich testbed: decades of storytelling yield multiple incarnations of the same character with distinct histories, values, and moral codes. To study this problem, we introduce Beyond One World, a benchmark for character-grounded roleplay spanning 30 iconic heroes and 90 canon-specific versions. The benchmark comprises two tasks: (i) Canon Events, which probes factual recall of pivotal life stages, and (ii) Moral Dilemmas, which confronts models with ethically charged scenarios. We score responses for canonical accuracy and reasoning fidelity under a framework that separates internal deliberation ("thinking") from outward decisions ("acting"). We further propose Think-Act Matching, a metric that quantifies alignment between reasons and actions and serves as a proxy for model trustworthiness. Experiments across reasoning- and non-reasoning-oriented models yield three findings: (1) chain-of-thought prompting improves narrative coherence in weaker models but can reduce canonical accuracy in stronger ones; (2) cross-version generalization within a character remains a major obstacle; and (3) models often excel at either thinking or acting, but rarely both. Beyond One World exposes critical gaps in multiversal consistency and reasoning alignment, offering a challenging evaluation for role-playing LLMs.

[87] CURE: Confidence-driven Unified Reasoning Ensemble Framework for Medical Question Answering

Ziad Elshaer,Essam A. Rashed

Main category: cs.CL

TL;DR: 提出一种无需微调的置信度驱动多模型框架，通过模型协作提升医疗问答性能，具有高效计算和可及性强的优势。

Details

Motivation: 高性能医疗大语言模型通常需要大量计算资源进行微调，限制了资源受限机构的使用，因此需要一种无需微调且高效的替代方案。 Method: 采用两阶段架构：置信度检测模块评估主模型的确定性，自适应路由机制将低置信度问题分配给具有互补知识的辅助模型进行协同推理。 Result: 在MedQA、MedMCQA和PubMedQA三个医疗基准上验证了方法的有效性，其中PubMedQA达到95.0%，MedMCQA达到78.0%，显著优于单模型和统一推理策略。 Conclusion: 基于置信度的多模型协作能有效提升医疗问答性能，为资源有限环境下的医疗AI普及提供了可行路径。 Abstract: High-performing medical Large Language Models (LLMs) typically require extensive fine-tuning with substantial computational resources, limiting accessibility for resource-constrained healthcare institutions. This study introduces a confidence-driven multi-model framework that leverages model diversity to enhance medical question answering without fine-tuning. Our framework employs a two-stage architecture: a confidence detection module assesses the primary model's certainty, and an adaptive routing mechanism directs low-confidence queries to Helper models with complementary knowledge for collaborative reasoning. We evaluate our approach using Qwen3-30B-A3B-Instruct, Phi-4 14B, and Gemma 2 12B across three medical benchmarks; MedQA, MedMCQA, and PubMedQA. Result demonstrate that our framework achieves competitive performance, with particularly strong results in PubMedQA (95.0\%) and MedMCQA (78.0\%). Ablation studies confirm that confidence-aware routing combined with multi-model collaboration substantially outperforms single-model approaches and uniform reasoning strategies. This work establishes that strategic model collaboration offers a practical, computationally efficient pathway to improve medical AI systems, with significant implications for democratizing access to advanced medical AI in resource-limited settings.

[88] On the Ability of LLMs to Handle Character-Level Perturbations: How Well and How?

Anyun Zhuo,Xuefei Ning,Ningyuan Li,Yu Wang,Pinyan Lu

Main category: cs.CL

TL;DR: 本文研究了现代大语言模型（LLM）在频繁且结构化的字符级扰动下的鲁棒性，提出了一种插入不可见Unicode控制字符的方法Nameshort以防止LLM滥用。尽管噪声严重干扰了分词和信噪比，许多LLM仍表现出显著性能。通过多维度评估，探讨了LLM对字符级噪声的处理机制，揭示了其底层鲁棒性，为防范滥用和提升应用可靠性提供了洞见。

Details

Motivation: 防止大语言模型在如在线考试等场景中的滥用，同时探究其在字符级噪声干扰下的鲁棒性风险与机制。 Method: 提出名为Nameshort的方法，通过在每个输入字符后插入不可见的Unicode控制字符引入噪声，并在不同模型、任务和噪声配置下进行综合评估，分析分词影响及显式与隐式去噪机制。 Result: 尽管字符级噪声严重破坏分词并降低信噪比，许多大语言模型仍保持较强性能，显示出较强的底层鲁棒性；研究揭示了模型可能依赖隐式去噪机制应对此类干扰。 Conclusion: 现代大语言模型对结构化字符级噪声具有出乎意料的鲁棒性，这既带来滥用风险，也提示需深入理解其低层处理机制，以提升安全性与部署可靠性。 Abstract: This work investigates the resilience of contemporary LLMs against frequent and structured character-level perturbations, specifically through the insertion of noisy characters after each input character. We introduce \nameshort{}, a practical method that inserts invisible Unicode control characters into text to discourage LLM misuse in scenarios such as online exam systems. Surprisingly, despite strong obfuscation that fragments tokenization and reduces the signal-to-noise ratio significantly, many LLMs still maintain notable performance. Through comprehensive evaluation across model-, problem-, and noise-related configurations, we examine the extent and mechanisms of this robustness, exploring both the handling of character-level tokenization and \textit{implicit} versus \textit{explicit} denoising mechanism hypotheses of character-level noises. We hope our findings on the low-level robustness of LLMs will shed light on the risks of their misuse and on the reliability of deploying LLMs across diverse applications.

[89] From Binary to Bilingual: How the National Weather Service is Using Artificial Intelligence to Develop a Comprehensive Translation Program

Joseph E. Trujillo-Falcon,Monica L. Bozeman,Liam E. Llewellyn,Samuel T. Halvorson,Meryl Mizell,Stuti Deshpande,Bob Manning,Todd Fagin

Main category: cs.CL

TL;DR: 美国国家气象局（NWS）正在开发基于人工智能的自动化翻译系统，以向非英语使用者提供准确、及时且 culturally relevant 的气象信息，首批支持西班牙语、简体中文和越南语，并结合GIS映射与伦理AI实践，推动建设覆盖全民的国家预警系统。

Details

Motivation: 为服务美国境内6880万不在家中使用英语的人群，提升气象预警和信息的可及性，助力构建‘气象-ready国家’。 Method: 与LILT公司合作，利用其专利训练技术，基于大语言模型（LLM）和神经机器翻译（NMT）开发可扩展的自动化翻译工具；结合多语言风险沟通最佳实践，并通过GIS映射分析各地区语言需求，优先分配资源。 Result: 目前已开发出支持西班牙语、简体中文、越南语等语言的翻译系统，显著减少人工翻译时间，降低工作负担，并推出包含多语言预警、7天预报和教育宣传材料的实验性网站。 Conclusion: 该系统通过技术、地理分析与伦理AI相结合，推动实现全国范围内公平、高效、包容的气象信息服务，使国家更接近全民覆盖的气象预警体系。 Abstract: To advance a Weather-Ready Nation, the National Weather Service (NWS) is developing a systematic translation program to better serve the 68.8 million people in the U.S. who do not speak English at home. This article outlines the foundation of an automated translation tool for NWS products, powered by artificial intelligence. The NWS has partnered with LILT, whose patented training process enables large language models (LLMs) to adapt neural machine translation (NMT) tools for weather terminology and messaging. Designed for scalability across Weather Forecast Offices (WFOs) and National Centers, the system is currently being developed in Spanish, Simplified Chinese, Vietnamese, and other widely spoken non-English languages. Rooted in best practices for multilingual risk communication, the system provides accurate, timely, and culturally relevant translations, significantly reducing manual translation time and easing operational workloads across the NWS. To guide the distribution of these products, GIS mapping was used to identify language needs across different NWS regions, helping prioritize resources for the communities that need them most. We also integrated ethical AI practices throughout the program's design, ensuring that transparency, fairness, and human oversight guide how automated translations are created, evaluated, and shared with the public. This work has culminated into a website featuring experimental multilingual NWS products, including translated warnings, 7-day forecasts, and educational campaigns, bringing the country one step closer to a national warning system that reaches all Americans.

[90] PluriHop: Exhaustive, Recall-Sensitive QA over Distractor-Rich Corpora

Mykolas Sveistrys,Richard Kunert

Main category: cs.CL

TL;DR: 本文提出了“pluri-hop问题”这一新类别，形式化了其三个标准，并构建了一个多语言诊断数据集PluriHopWIND，用于研究在重复性报告文档中进行问答的挑战。实验表明现有RAG方法表现不佳，因此提出PluriHopRAG架构，通过分解查询和早期过滤显著提升了F1分数。

Details

Motivation: 现实世界中的许多问题（如医疗记录、合规文件）需要对所有文档进行聚合回答，且对遗漏高度敏感，而现有单跳或多跳问答方法无法有效应对这种需穷尽检索的场景。 Method: 提出PluriHopWIND数据集，包含48个基于191份真实风电行业报告的多跳问题；设计PluriHopRAG架构，将查询分解为文档级子问题，并使用交叉编码器在LLM推理前过滤无关文档。 Result: PluriHopWIND比其他数据集重复性高8-40%，干扰文档密度更高；传统及变体RAG方法在该数据集上F1得分均未超过40%；PluriHopRAG相较基线模型相对提升18-52% F1分数。 Conclusion: 当前QA系统在高重复、干扰密集的文档集合上表现有限，PluriHopRAG通过穷尽检索与早期过滤的有效结合，提供了一种优于top-k检索的新范式。 Abstract: Recent advances in large language models (LLMs) and retrieval-augmented generation (RAG) have enabled progress on question answering (QA) when relevant evidence is in one (single-hop) or multiple (multi-hop) passages. Yet many realistic questions about recurring report data - medical records, compliance filings, maintenance logs - require aggregation across all documents, with no clear stopping point for retrieval and high sensitivity to even one missed passage. We term these pluri-hop questions and formalize them by three criteria: recall sensitivity, exhaustiveness, and exactness. To study this setting, we introduce PluriHopWIND, a diagnostic multilingual dataset of 48 pluri-hop questions built from 191 real-world wind industry reports in German and English. We show that PluriHopWIND is 8-40% more repetitive than other common datasets and thus has higher density of distractor documents, better reflecting practical challenges of recurring report corpora. We test a traditional RAG pipeline as well as graph-based and multimodal variants, and find that none of the tested approaches exceed 40% in statement-wise F1 score. Motivated by this, we propose PluriHopRAG, a RAG architecture that follows a "check all documents individually, filter cheaply" approach: it (i) decomposes queries into document-level subquestions and (ii) uses a cross-encoder filter to discard irrelevant documents before costly LLM reasoning. We find that PluriHopRAG achieves relative F1 score improvements of 18-52% depending on base LLM. Despite its modest size, PluriHopWIND exposes the limitations of current QA systems on repetitive, distractor-rich corpora. PluriHopRAG's performance highlights the value of exhaustive retrieval and early filtering as a powerful alternative to top-k methods.

[91] Suicidal Comment Tree Dataset: Enhancing Risk Assessment and Prediction Through Contextual Analysis

Jun Li,Qun Zhao

Main category: cs.CL

TL;DR: 本研究通过构建包含用户发帖历史和评论的高质量标注数据集，利用Reddit数据并基于C-SSRS四级标注框架，探讨评论树信息对用户自杀风险等级识别与预测的影响。实验表明，结合评论树显著提升了风险判别与预测效果。

Details

Motivation: 现有研究多关注单条社交媒体文本中的自杀倾向检测，缺乏对用户长期、序列化评论树的分析，难以捕捉自杀风险的动态演变。因此，需要探究评论互动历史对自杀风险预测的价值。 Method: 构建基于Reddit的用户发帖与评论数据集，采用改进的C-SSRS四分类标注体系，并通过统计分析与大语言模型（LLM）实验评估评论树信息在自杀风险识别与预测中的作用。 Result: 统计分析和LLM实验结果表明，引入评论树信息能显著提升用户自杀风险等级的判别能力和预测准确性。 Conclusion: 整合用户历史互动评论树可有效增强自杀风险检测精度，为早期干预策略提供了重要基础。 Abstract: Suicide remains a critical global public health issue. While previous studies have provided valuable insights into detecting suicidal expressions in individual social media posts, limited attention has been paid to the analysis of longitudinal, sequential comment trees for predicting a user's evolving suicidal risk. Users, however, often reveal their intentions through historical posts and interactive comments over time. This study addresses this gap by investigating how the information in comment trees affects both the discrimination and prediction of users' suicidal risk levels. We constructed a high-quality annotated dataset, sourced from Reddit, which incorporates users' posting history and comments, using a refined four-label annotation framework based on the Columbia Suicide Severity Rating Scale (C-SSRS). Statistical analysis of the dataset, along with experimental results from Large Language Models (LLMs) experiments, demonstrates that incorporating comment trees data significantly enhances the discrimination and prediction of user suicidal risk levels. This research offers a novel insight to enhancing the detection accuracy of at-risk individuals, thereby providing a valuable foundation for early suicide intervention strategies.

[92] Your Next Token Prediction: A Multilingual Benchmark for Personalized Response Generation

Shiyao Ding,Takayuki Ito

Main category: cs.CL

TL;DR: 提出“你的下一个词预测”（YNTP）任务，通过人与基于MBTI人格模型的NPC对话构建多语言基准数据集，用于建模用户个性化语言生成。

Details

Motivation: 大语言模型在通用下个词预测上表现良好，但在模仿个体真实交流风格（如邮件或社交消息回复）方面仍有不足；同时真实社交数据因隐私问题难以获取。 Method: 设计YNTP任务，构建包含100个跨英语、日语和中文对话会话的多语言基准数据集，用户与基于MBTI性格维度的心理学NPC进行为期五天的交互，捕捉日常生活中的沟通模式。 Result: 建立了首个YNTP基准，评估了基于提示和微调的个性化方法，揭示了用户在与不同人格NPC互动时的语言行为差异，并支持对用户内在模型的分析。 Conclusion: YNTP为个性化语言建模提供了新方向，所构建的数据集和基准有助于推动用户对齐的语言生成研究。 Abstract: Large language models (LLMs) excel at general next-token prediction but still struggle to generate responses that reflect how individuals truly communicate, such as replying to emails or social messages in their own style. However, real SNS or email histories are difficult to collect due to privacy concerns. To address this, we propose the task of "Your Next Token Prediction (YNTP)", which models a user's precise word choices through controlled human-agent conversations. We build a multilingual benchmark of 100 dialogue sessions across English, Japanese, and Chinese, where users interact for five days with psychologically grounded NPCs based on MBTI dimensions. This setup captures natural, daily-life communication patterns and enables analysis of users' internal models. We evaluate prompt-based and fine-tuning-based personalization methods, establishing the first benchmark for YNTP and a foundation for user-aligned language modeling. The dataset is available at: https://github.com/AnonymousHub4Submissions/your-next-token-prediction-dataset-100

[93] MedTrust-RAG: Evidence Verification and Trust Alignment for Biomedical Question Answering

Yingpeng Ning,Yuanyuan Sun,Ling Luo,Yanhua Wang,Yuchen Pan,Hongfei Lin

Main category: cs.CL

TL;DR: 本文提出了一种名为MedTrust-Guided Iterative RAG的框架，旨在提升生物医学问答中的事实一致性并减少幻觉问题。该方法通过引用感知推理、迭代检索验证和MedTrust-Align模块显著提升了现有模型的表现。

Details

Motivation: 现有的基于RAG的生物医学问答系统存在幻觉问题，主要由于检索后噪声和证据验证不足，影响回答的可靠性。 Method: 1) 引入引用感知推理，要求所有生成内容必须基于检索到的文档，并在证据不足时使用负知识断言；2) 采用迭代检索-验证流程，通过医学差距分析优化查询；3) 集成MedTrust-Align模块（MTAM），结合正例与防幻觉负样本，利用直接偏好优化强化基于引用的推理。 Result: 在MedMCQA、MedQA和MMLU-Med数据集上实验表明，该方法在多种模型架构下均优于基线模型，平均准确率显著提升：LLaMA3.1-8B-Instruct提高2.7%，Qwen3-8B提高2.4%。 Conclusion: MedTrust-Guided Iterative RAG有效增强了生物医学问答系统的事实一致性和可靠性，显著减少了生成幻觉，为可信医疗AI提供了可行方案。 Abstract: Biomedical question answering (QA) requires accurate interpretation of complex medical knowledge. Large language models (LLMs) have shown promising capabilities in this domain, with retrieval-augmented generation (RAG) systems enhancing performance by incorporating external medical literature. However, RAG-based approaches in biomedical QA suffer from hallucinations due to post-retrieval noise and insufficient verification of retrieved evidence, undermining response reliability. We propose MedTrust-Guided Iterative RAG, a framework designed to enhance factual consistency and mitigate hallucinations in medical QA. Our method introduces three key innovations. First, it enforces citation-aware reasoning by requiring all generated content to be explicitly grounded in retrieved medical documents, with structured Negative Knowledge Assertions used when evidence is insufficient. Second, it employs an iterative retrieval-verification process, where a verification agent assesses evidence adequacy and refines queries through Medical Gap Analysis until reliable information is obtained. Third, it integrates the MedTrust-Align Module (MTAM) that combines verified positive examples with hallucination-aware negative samples, leveraging Direct Preference Optimization to reinforce citation-grounded reasoning while penalizing hallucination-prone response patterns. Experiments on MedMCQA, MedQA, and MMLU-Med demonstrate that our approach consistently outperforms competitive baselines across multiple model architectures, achieving the best average accuracy with gains of 2.7% for LLaMA3.1-8B-Instruct and 2.4% for Qwen3-8B.

[94] Instructions are all you need: Self-supervised Reinforcement Learning for Instruction Following

Qingyu Ren,Qianyu He,Bowei Zhang,Jie Zeng,Jiaqing Liang,Yanghua Xiao,Weikang Zhou,Zeye Sun,Fei Yu

Main category: cs.CL

TL;DR: 提出一种无需外部监督的自监督强化学习框架，通过从指令中提取奖励信号并生成伪标签来解决多约束指令跟随任务中的稀疏奖励问题。

Details

Motivation: 语言模型在遵循多约束指令时表现不佳，而现有强化学习方法依赖外部监督和稀疏奖励信号，限制了其在实际应用中的效果。 Method: 提出一种无标签的自监督强化学习框架，采用约束分解策略和高效的按约束二分类方法，直接从指令生成奖励信号和伪标签用于奖励模型训练。 Result: 在3个领域内和5个领域外数据集上均取得显著提升，尤其在具身智能和多轮指令跟随等复杂任务中表现突出。 Conclusion: 该方法摆脱了对外部监督的依赖，有效缓解了稀疏奖励问题，具备良好的泛化能力和计算效率。 Abstract: Language models often struggle to follow multi-constraint instructions that are crucial for real-world applications. Existing reinforcement learning (RL) approaches suffer from dependency on external supervision and sparse reward signals from multi-constraint tasks. We propose a label-free self-supervised RL framework that eliminates dependency on external supervision by deriving reward signals directly from instructions and generating pseudo-labels for reward model training. Our approach introduces constraint decomposition strategies and efficient constraint-wise binary classification to address sparse reward challenges while maintaining computational efficiency. Experiments show that our approach generalizes well, achieving strong improvements across 3 in-domain and 5 out-of-domain datasets, including challenging agentic and multi-turn instruction following. The data and code are publicly available at https://github.com/Rainier-rq/verl-if

[95] Explore to Evolve: Scaling Evolved Aggregation Logic via Proactive Online Exploration for Deep Research Agents

Rui Wang,Ce Zhang,Jun-Yu Ma,Jianshu Zhang,Hongru Wang,Yi Chen,Boyang Xue,Tianqing Fang,Zhisong Zhang,Hongming Zhang,Haitao Mi,Dong Yu,Kam-Fai Wong

Main category: cs.CL

TL;DR: 提出“探索到进化”范式，通过主动在线探索和自我演化聚合程序，构建了可验证的WebAggregatorQA数据集，并基于此开发出在信息聚合任务上超越GPT-4.1的WebAggregator模型。

Details

Motivation: 现有开源深度研究代理多关注信息检索能力，忽视了关键的信息聚合需求，限制了其支持深入研究的能力。因此需要提升代理在复杂研究任务中的知识整合水平。 Method: 提出Explore to Evolve范式：首先通过代理在真实网页中主动探索获取 grounded 信息；然后利用收集到的证据，从12种高层逻辑操作类型中选择、组合和优化，自演化生成可验证的问答对，构建大规模训练数据集WebAggregatorQA（10K样本，覆盖50K网站和11个领域）；基于SmolAgents框架收集监督微调轨迹，训练WebAggregator系列模型。 Result: WebAggregator-8B性能匹敌GPT-4.1，32B版本在GAIA-text上超过GPT-4.1超10%，接近Claude-3.7-sonnet；构建了一个人工标注的评测子集作为挑战性测试基准，在该基准上Claude-3.7-sonnet得分为28%，GPT-4.1为25.8%，表明即使检索完整信息，现有代理仍难以完成聚合任务。 Conclusion: 信息聚合是当前web代理的关键瓶颈，WebAggregator通过可扩展的自我演化方法显著提升了聚合能力，证明了强化信息整合对于构建高效研究代理的重要性。 Abstract: Deep research web agents not only retrieve information from diverse sources such as web environments, files, and multimodal inputs, but more importantly, they need to rigorously analyze and aggregate knowledge for insightful research. However, existing open-source deep research agents predominantly focus on enhancing information-seeking capabilities of web agents to locate specific information, while overlooking the essential need for information aggregation, which would limit their ability to support in-depth research. We propose an Explore to Evolve paradigm to scalably construct verifiable training data for web agents. Begins with proactive online exploration, an agent sources grounded information by exploring the real web. Using the collected evidence, the agent then self-evolves an aggregation program by selecting, composing, and refining operations from 12 high-level logical types to synthesize a verifiable QA pair. This evolution from high-level guidance to concrete operations allowed us to scalably produce WebAggregatorQA, a dataset of 10K samples across 50K websites and 11 domains. Based on an open-source agent framework, SmolAgents, we collect supervised fine-tuning trajectories to develop a series of foundation models, WebAggregator. WebAggregator-8B matches the performance of GPT-4.1, while the 32B variant surpasses GPT-4.1 by more than 10% on GAIA-text and closely approaches Claude-3.7-sonnet. Moreover, given the limited availability of benchmarks that evaluate web agents' information aggregation abilities, we construct a human-annotated evaluation split of WebAggregatorQA as a challenging test set. On this benchmark, Claude-3.7-sonnet only achieves 28%, and GPT-4.1 scores 25.8%. Even when agents manage to retrieve all references, they still struggle on WebAggregatorQA, highlighting the need to strengthen the information aggregation capabilities of web agent foundations.

[96] Natural Language Tools: A Natural Language Approach to Tool Calling In Large Language Agents

Reid T. Johnson,Michelle D. Pain,Jordan D. West

Main category: cs.CL

TL;DR: 本文提出了一种名为自然语言工具（NLT）的框架，通过使用自然语言输出替代大型语言模型中的程序化JSON工具调用，提升了工具调用的准确性和稳定性。

Details

Motivation: 现有的程序化JSON工具调用存在任务干扰和格式限制，影响了大型语言模型在实际应用中的性能，因此需要一种更灵活、鲁棒的方法来提升工具调用效果。 Method: NLT框架将工具选择与响应生成解耦，使用自然语言进行工具调用，避免了格式约束和任务间的相互干扰，并在多个模型和应用场景中进行了评估。 Result: 在10个模型和6400次实验中，NLT使工具调用准确率提高了18.4个百分点，输出方差降低了70%，且在开放权重模型上表现尤为突出，超越了闭源旗舰模型。 Conclusion: NLT有效提升了大型语言模型的工具调用性能，具有广泛适用性，尤其适用于缺乏原生工具支持的模型，并对强化学习和监督微调阶段的模型训练具有启示意义。 Abstract: We present Natural Language Tools (NLT), a framework that replaces programmatic JSON tool calling in large language models (LLMs) with natural language outputs. By decoupling tool selection from response generation, NLT eliminates task interference and format constraints that degrade tool call performance. When evaluated across 10 models and 6,400 trials spanning customer service and mental health domains, NLT improves tool calling accuracy by 18.4 percentage points while reducing output variance by 70%. Open-weight models see the largest gains, surpassing flagship closed-weight alternatives, with implications for model training in both reinforcement learning and supervised fine-tuning stages. These improvements persist under prompt perturbations and extend tool-calling capabilities to models lacking native support.

[97] LiRA: Linguistic Robust Anchoring for Cross-lingual Large Language Models

Haolin Li,Haipeng Zhang,Mang Li,Yaohua Wang,Lijie Wen,Yu Zhang,Biqing Huang

Main category: cs.CL

TL;DR: 提出LiRA框架，通过Arca和LaSR模块提升大模型在低资源语言下的跨语言表示、检索与推理能力。

Details

Motivation: 大模型在高资源语言上性能接近饱和，但在低资源语言上因数据少、翻译噪声和跨语言对齐不稳定而表现较差。 Method: 设计LiRA框架，包含Arca（基于锚点对齐和多智能体协同编码）和LaSR（语言感知的轻量推理头与一致性正则化），联合优化跨语言表示、检索与推理。 Result: 在多个低资源跨语言任务（检索、语义相似度、推理）上取得显著且鲁棒的性能提升，尤其在少样本和噪声增强场景下；消融实验证明Arca和LaSR均有效。 Conclusion: LiRA能有效增强大模型在低资源语言下的跨语言理解与多任务鲁棒性，同时发布了一个涵盖东南亚和南亚语言的新产品检索数据集。 Abstract: As large language models (LLMs) rapidly advance, performance on high-resource languages (e.g., English, Chinese) is nearing saturation, yet remains substantially lower for low-resource languages (e.g., Urdu, Thai) due to limited training data, machine-translation noise, and unstable cross-lingual alignment. We introduce LiRA (Linguistic Robust Anchoring for Large Language Models), a training framework that robustly improves cross-lingual representations under low-resource conditions while jointly strengthening retrieval and reasoning. LiRA comprises two modules: (i) Arca (Anchored Representation Composition Architecture), which anchors low-resource languages to an English semantic space via anchor-based alignment and multi-agent collaborative encoding, preserving geometric stability in a shared embedding space; and (ii) LaSR (Language-coupled Semantic Reasoner), which adds a language-aware lightweight reasoning head with consistency regularization on top of Arca's multilingual representations, unifying the training objective to enhance cross-lingual understanding, retrieval, and reasoning robustness. We further construct and release a multilingual product retrieval dataset covering five Southeast Asian and two South Asian languages. Experiments across low-resource benchmarks (cross-lingual retrieval, semantic similarity, and reasoning) show consistent gains and robustness under few-shot and noise-amplified settings; ablations validate the contribution of both Arca and LaSR. Code will be released on GitHub and the dataset on Hugging Face.

[98] Efficient Seq2seq Coreference Resolution Using Entity Representations

Matt Grenander,Shay B. Cohen,Mark Steedman

Main category: cs.CL

TL;DR: 提出一种压缩表示方法，提升seq2seq核心ference模型在增量设置下的效率，在保持接近最先进性能的同时实现显著的令牌压缩。

Details

Motivation: 现有seq2seq核心ference模型在处理增量场景（如对话）时效率低下，缺乏灵活性和效率。 Method: 通过提取和重组实体级标记，丢弃大部分其他输入标记，采用压缩表示来提高增量设置下的处理效率。 Result: 在OntoNotes上，模型性能仅比全前缀增量基线低0.6 CoNLL F1，压缩比达1.8；在标注单例提及的LitBank上超过最先进水平。 Conclusion: 在seq2seq核心ference解析器中丢弃大量令牌是实现高效增量核心ference解析的可行策略。 Abstract: Seq2seq coreference models have introduced a new paradigm for coreference resolution by learning to generate text corresponding to coreference labels, without requiring task-specific parameters. While these models achieve new state-of-the-art performance, they do so at the cost of flexibility and efficiency. In particular, they do not efficiently handle incremental settings such as dialogue, where text must processed sequentially. We propose a compressed representation in order to improve the efficiency of these methods in incremental settings. Our method works by extracting and re-organizing entity-level tokens, and discarding the majority of other input tokens. On OntoNotes, our best model achieves just 0.6 CoNLL F1 points below a full-prefix, incremental baseline while achieving a compression ratio of 1.8. On LitBank, where singleton mentions are annotated, it passes state-of-the-art performance. Our results indicate that discarding a wide portion of tokens in seq2seq resolvers is a feasible strategy for incremental coreference resolution.

[99] Assessing Socio-Cultural Alignment and Technical Safety of Sovereign LLMs

Kyubyung Chae,Gihoon Kim,Gyuseong Lee,Taesup Kim,Jaejin Lee,Heejin Kim

Main category: cs.CL

TL;DR: 本文构建了一个新数据集和分析框架，用于评估主权大语言模型（sovereign LLMs）的社会文化适配性与技术鲁棒性，发现尽管其对低资源语言有支持作用，但并不总能有效服务目标用户，且可能忽视安全性等关键质量属性。

Details

Motivation: 当前缺乏评估主权大语言模型是否真正符合用户社会文化背景及其安全性和技术稳健性的框架与数据集，亟需系统性评估方法来验证其实际效果。 Method: 构建了一个新的数据集，并提出一个分析框架，用于提取和评估主权大语言模型中的社会文化元素，同时评估其技术鲁棒性和安全性。 Result: 实验结果表明，主权大语言模型虽有助于支持低资源语言，但在社会文化适配性方面表现不一，且在追求本土化的同时可能低估了安全性等关键质量属性。 Conclusion: 推动主权大语言模型的发展需要更全面的评估体系，纳入更多基于实践且扎实的评价标准，以确保其有效性、安全性和社会文化契合度。 Abstract: Recent trends in LLMs development clearly show growing interest in the use and application of sovereign LLMs. The global debate over sovereign LLMs highlights the need for governments to develop their LLMs, tailored to their unique socio-cultural and historical contexts. However, there remains a shortage of frameworks and datasets to verify two critical questions: (1) how well these models align with users' socio-cultural backgrounds, and (2) whether they maintain safety and technical robustness without exposing users to potential harms and risks. To address this gap, we construct a new dataset and introduce an analytic framework for extracting and evaluating the socio-cultural elements of sovereign LLMs, alongside assessments of their technical robustness. Our experimental results demonstrate that while sovereign LLMs play a meaningful role in supporting low-resource languages, they do not always meet the popular claim that these models serve their target users well. We also show that pursuing this untested claim may lead to underestimating critical quality attributes such as safety. Our study suggests that advancing sovereign LLMs requires a more extensive evaluation that incorporates a broader range of well-grounded and practical criteria.

[100] Beyond Correctness: Evaluating Subjective Writing Preferences Across Cultures

Shuangshuang Ying,Yunwen Li,Xingwei Qu,Xin Li,Sheng Jin,Minghao Liu,Zhoufutu Wen,Xeron Du,Tianyu Zheng,Yichi Zhang,Letian Ni,Yuyang Cheng,Qiguang Chen,Jingzhe Ding,Shengda Long,Wangchunshu Zhou,Jiazhan Feng,Wanjun Zhong,Libo Qin,Ge Zhang,Wenhao Huang,Wanxiang Che,Chenghua Lin

Main category: cs.CL

TL;DR: 本文提出了一个名为WritingPreferenceBench的新基准数据集，用于评估在去除客观质量信号后偏好学习方法的表现，发现当前方法主要依赖客观错误检测而非捕捉主观质量偏好，而生成式奖励模型通过显式推理链显著提升了准确性。

Details

Motivation: 现有的偏好学习方法在标准基准上表现良好，但在去除客观质量信号后性能显著下降，难以捕捉主观写作质量（如创造力、风格和情感共鸣），因此需要更有效的评估基准和建模方法。 Method: 构建了一个包含1800个人工标注偏好对的数据集（涵盖8种创意写作体裁，中英文均有），所有回应均在客观正确性、事实准确性和长度上匹配；比较了序列式奖励模型、零样本语言模型评判器和生成式奖励模型（带有推理链）的表现。 Result: 序列式奖励模型平均准确率为52.7%，零样本语言模型评判器为53.9%，而生成式奖励模型达到81.8%；模型在不同体裁中表现差异大（18.2%-81.8%），且模型规模（8B vs 27B）未带来一致提升。 Conclusion: 当前RLHF方法主要学会检测客观错误，而非建模主观偏好；成功的偏好建模可能需要引入中间推理过程，而非直接进行分类决策。 Abstract: Current preference learning methods achieve high accuracy on standard benchmarks but exhibit significant performance degradation when objective quality signals are removed. We introduce WritingPreferenceBench, a dataset of 1,800 human-annotated preference pairs (1,200 English, 600 Chinese) across 8 creative writing genres, where responses are matched for objective correctness, factual accuracy, and length. On this benchmark, sequence-based reward models--the standard architecture for RLHF--achieve only 52.7% mean accuracy, while zero-shot language model judges perform at 53.9%. In contrast, generative reward models that produce explicit reasoning chains achieve 81.8% accuracy. We observe high within-model variance across genres: individual models range from 18.2% to 81.8% accuracy across different writing categories, with standard deviations averaging 10.1%. This variance persists regardless of model scale, with 27B parameter models showing no consistent improvement over 8B variants. Our results suggest that current RLHF methods primarily learn to detect objective errors rather than capture subjective quality preferences (e.g., creativity, stylistic flair, and emotional resonance), and that successful preference modeling may require intermediate reasoning representations rather than direct classification.

[101] Code-driven Number Sequence Calculation: Enhancing the inductive Reasoning Abilities of Large Language Models

Kedi Chen,Zhikai Lei,Xu Guo,Xuecheng Wu,Siyuan Zeng,Jianghao Yin,Yinqi Zhang,Qin Chen,Jie Zhou,Liang He,Qipeng Guo,Kai Chen,Wei Zhang

Main category: cs.CL

TL;DR: 本文提出了CodeSeq，一个基于数字序列的合成后训练数据集，用于提升大语言模型在归纳推理任务中的表现。通过定义通用项生成（GTG）任务，并结合迭代修正与强化学习，模型能够自主生成测试用例并自我检查，从而更有效地从成功与失败中学习。

Details

Motivation: 现有归纳推理研究多关注表面规律，缺乏复杂内部模式，且未提供精细的思维过程或难度控制。因此，需要一种能促进深度模式学习和可控推理训练的方法。 Method: 构建CodeSeq数据集，将数字序列转化为算法问题，定义GTG任务；采用失败案例反思与迭代修正生成监督微调数据；引入基于可解性和自生成案例成功率的Case-Synergy Solvability Scaling Reward进行强化学习。 Result: 实验表明，使用CodeSeq训练的模型在多种推理任务上性能提升，同时保持了对分布外（OOD）数据的良好泛化能力。 Conclusion: CodeSeq通过结构化的归纳任务设计和融合监督微调与强化学习的训练机制，有效提升了大语言模型的归纳推理能力，为未来复杂模式学习提供了新路径。 Abstract: Large language models (LLMs) make remarkable progress in reasoning tasks. Among different reasoning modes, inductive reasoning, due to its better alignment with human learning, attracts increasing interest. However, research on inductive reasoning faces certain challenges. First, existing inductive data mostly focuses on superficial regularities while lacking more complex internal patterns. Second, current works merely prompt LLMs or finetune on simple prompt-response pairs, but do not provide precise thinking processes nor implement difficulty control. Unlike previous work, we address these challenges by introducing \textit{CodeSeq}, a synthetic post-training dataset built from number sequences. We package number sequences into algorithmic problems to discover their general terms, defining a general term generation (GTG) task correspondingly. Our pipeline generates supervised finetuning data by reflecting on failed test cases and incorporating iterative corrections, thereby teaching LLMs to learn autonomous case generation and self-checking. Additionally, it leverages reinforcement learning with a novel Case-Synergy Solvability Scaling Reward based on both solvability, estimated from the problem pass rate, and the success rate of self-directed case generation, enabling models to learn more effectively from both successes and failures. Experimental results show that the models trained with \textit{CodeSeq} improve on various reasoning tasks and can preserve the models' OOD performance.

[102] RLAIF-SPA: Optimizing LLM-based Emotional Speech Synthesis via RLAIF

Qing Yang,Zhenghao Liu,Junxin Wang,Yangfan Du,Pengcheng Huang,Tong Xiao

Main category: cs.CL

TL;DR: 提出了一种基于AI反馈的强化学习框架RLAIF-SPA，用于提升文本到语音合成中的情感表现力和自然度，通过语义准确性和韵律-情感对齐优化生成质量。

Details

Motivation: 现有情感语音合成方法依赖昂贵的情感标注或间接优化目标，难以有效捕捉语音的情感表现力和感知自然性，导致生成语音缺乏情感变化。 Method: 提出RLAIF-SPA框架，利用自动语音识别（ASR）和大语言模型（LLM）作为AI反馈，分别评估语义准确性和韵律-情感标签对齐，并在结构、情感、速度和语调四个细粒度维度上进行联合优化，结合强化学习实现端到端训练。 Result: 在Libri Speech数据集上的实验表明，该方法相比Chat-TTS显著提升性能：词错误率（WER）降低26.1%，主观相似度（SIM-O）提高9.1%，人工评测提升超过10%。 Conclusion: RLAIF-SPA通过引入AI反馈机制，在无需人工情感标注的情况下有效提升了情感语音合成的表现力、清晰度和自然度，具有良好的应用潜力。 Abstract: Text-To-Speech synthesis has achieved near-human quality in neutral speech, but emotional expressiveness remains a challenge. Existing methods often rely on costly emotion annotations or optimize indirect objectives that fail to capture the emotional expressiveness and perceptual naturalness of speech, leading to generated speech that is accurate but emotionally flat. To address these challenges, we propose the RLAIF-SPA framework, incorporating a Reinforcement Learning from AI Feedback (RLAIF) mechanism to employ Automatic Speech Recognition (ASR) and Large Language Model (LLM) techniques to respectively judge semantic accuracy and prosodic-emotional label alignment as a direct reward for emotional expressiveness and intelligibility optimization. Specifically, it leverages Prosodic Label Alignment to enhance expressive quality by jointly considering semantic accuracy and prosodic-emotional alignment along four fine-grained dimensions: Structure, Emotion, Speed, and Tone. In addition, it incorporates Semantic Accuracy Feedback to ensure the generation of clear and accurate speech. Experiments on the Libri Speech dataset show that RLAIF-SPA outperforms Chat-TTS, with a 26.1% reduction in WER, a 9.1% increase in SIM-O, and over 10% improvement in human evaluation.

[103] Intent Clustering with Shared Pseudo-Labels

I-Fan Lin,Faegheh Hasibi,Suzan Verberne

Main category: cs.CL

TL;DR: 提出一种无需训练、无需标签的意图聚类方法，利用轻量级开源大模型生成伪标签，通过多标签分类实现聚类，效果优于或媲美现有方法，且适用于低资源场景。

Details

Motivation: 现有方法依赖商业大模型和预知聚类数量，成本高且不透明，难以应用于真实场景。 Method: 使用轻量级开源大模型为文本生成伪标签，基于伪标签进行多标签分类，并利用标签重叠程度衡量文本相似性以实现聚类。 Result: 在四个基准数据集上表现优于或媲美最新基线方法，具有良好的稳定性、计算效率和跨模型/数据集鲁棒性。 Conclusion: 该方法简单高效，适用于低资源环境，为意图聚类提供了一种可解释、低成本的解决方案。 Abstract: In this paper, we propose an intuitive, training-free and label-free method for intent clustering that makes minimal assumptions using lightweight and open-source LLMs. Many current approaches rely on commercial LLMs, which are costly, and offer limited transparency. Additionally, their methods often explicitly depend on knowing the number of clusters in advance, which is often not the case in realistic settings. To address these challenges, instead of asking the LLM to match similar text directly, we first ask it to generate pseudo-labels for each text, and then perform multi-label classification in this pseudo-label set for each text. This approach is based on the hypothesis that texts belonging to the same cluster will share more labels, and will therefore be closer when encoded into embeddings. These pseudo-labels are more human-readable than direct similarity matches. Our evaluation on four benchmark sets shows that our approach achieves results comparable to and better than recent baselines, while remaining simple and computationally efficient. Our findings indicate that our method can be applied in low-resource scenarios and is stable across multiple models and datasets.

[104] An Efficient Rubric-based Generative Verifier for Search-Augmented LLMs

Linyue Ma,Yilong Xu,Xiang Long,Zhi Zheng

Main category: cs.CL

TL;DR: 本文提出了一种统一且可验证的“nugget-as-rubric”范式，用于改进搜索增强型大语言模型的奖励建模，通过自动构建评估标准并设计高效的生成式验证器Search-Gen-V，实现了在多种任务下的高验证准确率、鲁棒性和计算效率。

Details

Motivation: 现有的搜索增强型LLM奖励模型存在局限性：基于规则的奖励（如精确匹配）对表达变化敏感且难以应用于长文本任务，而生成式奖励虽更鲁棒，但在动态语料库中设计可验证、稳定的奖励仍具挑战且计算成本高。因此需要一种既可验证又适用于长短文本任务的统一奖励机制。 Method: 提出“nugget-as-rubric”范式，将原子信息点作为结构化评估标准；针对长文本任务，设计基于查询重写的自动rubric构建流程，从静态和动态网页内容中检索相关段落并提取rubrics；在此基础上，开发了4B参数的高效生成式验证器Search-Gen-V，采用蒸馏思想和两阶段训练策略进行训练。 Result: 实验结果表明，Search-Gen-V在不同工作负载下均表现出强大的验证准确性，相比现有方法更具可扩展性、鲁棒性和计算效率，能够有效支持长文本和动态内容场景下的奖励构建。 Conclusion: “nugget-as-rubric”是一种统一、可验证且可扩展的奖励建模范式，配合Search-Gen-V验证器能有效提升搜索增强型LLM在多样任务中的性能与可靠性，为构建高效、可信的检索增强系统提供了新路径。 Abstract: Search augmentation empowers Large Language Models with retrieval capabilities to overcome the limitations imposed by static parameters. Recently, Reinforcement Learning leverages tailored reward signals as a viable technique to enhance LLMs performing tasks involving search. However, existing reward modeling for search-augmented LLMs faces several limitations. Rule-based rewards, such as Exact Match, are verifiable but fragile to variations in expression and cannot be applied to long-form workloads. In contrast, generative rewards improve robustness, but designing verifiable and stable rewards for long-form workloads in dynamic corpora remains challenging and also incurs high computational costs. In this paper, we propose a unified and verifiable paradigm, "nugget-as-rubric", which treats atomic information points as structured evaluation criteria for different search-augmentation workloads. Short-form tasks correspond to a single rubric, whereas long-form tasks expand to multiple rubrics aligned with the question's information needs. To support long-form settings, we design an automatic rubric construction pipeline based on query rewriting, which can automatically retrieve passages relevant to each question and extract rubrics from them, both from static corpora and from dynamic online web content. Furthermore, we introduce \textbf{Search-Gen-V}, a 4B-parameter efficient generative verifier under our proposed verifiable paradigm, which is trained via the idea of distillation and a two-stage strategy. Experimental results show that Search-Gen-V achieves strong verification accuracy across different workloads, making it a scalable, robust, and efficient verifiable reward constructor for search-augmented LLMs.

[105] Semantic Prosody in Machine Translation: the English-Chinese Case of Passive Structures

Xinyue Ma,Pol Pastells,Mireia Farrús,Mariona Taulé

Main category: cs.CL

TL;DR: 本文提出了一种通过微调机器翻译模型来捕捉汉语“被”字句负面语义韵的方法，并构建了英汉平行语料库进行实验，结果表明模型能更好地区分“被”字句的使用语境，且多语言模型可实现语义韵知识的跨语言迁移。

Details

Motivation: 由于词语的字面翻译可能具有不同的语义韵，而现有机器翻译模型难以处理这一问题，因此需要专门研究如何让模型学习特定结构的语义韵以提升翻译准确性。 Method: 聚焦于汉语‘被’字被动结构，构建具有负面语义韵标注的英汉双语数据集，并使用该数据集对OPUS-MT、NLLB-600M和mBART50模型进行微调，评估其在翻译不利、中性和有利语境时使用‘被’字句的能力。 Result: 微调后的模型在翻译不利内容时更倾向于使用‘被’字句，而在中性或有利内容中则避免使用；NLLB-600M模型还表现出将语义韵知识从英汉翻译迁移到西汉等其他语言对的能力。 Conclusion: 通过针对性的数据集构建和模型微调，可以有效提升机器翻译模型对语义韵的把握能力，尤其在多语言模型中具备跨语言知识迁移潜力。 Abstract: Semantic prosody is a collocational meaning formed through the co-occurrence of a linguistic unit and a consistent series of collocates, which should be treated separately from semantic meaning. Since words that are literal translations of each other may have different semantic prosody, more attention should be paid to this linguistic property to generate accurate translations. However, current machine translation models cannot handle this problem. To bridge the gap, we propose an approach to teach machine translation models about semantic prosody of a specific structure. We focus on Chinese BEI passives and create a dataset of English-Chinese sentence pairs with the purpose of demonstrating the negative semantic prosody of BEI passives. Then we fine-tune OPUS-MT, NLLB-600M and mBART50 models with our dataset for the English-Chinese translation task. Our results show that fine-tuned MT models perform better on using BEI passives for translating unfavourable content and avoid using it for neutral and favourable content. Also, in NLLB-600M, which is a multilingual model, this knowledge of semantic prosody can be transferred from English-Chinese translation to other language pairs, such as Spanish-Chinese.

[106] Supervised Fine-Tuning or Contrastive Learning? Towards Better Multimodal LLM Reranking

Ziqi Dai,Xin Zhang,Mingxin Li,Yanzhao Zhang,Dingkun Long,Pengjun Xie,Meishan Zhang,Wenjie Li,Min Zhang

Main category: cs.CL

TL;DR: 本文比较了对比学习（CL）和监督微调（SFT）在基于大语言模型（LLM）的重排序任务中的表现，发现SFT由于具有更强的权重机制，在统一框架下优于CL，并通过大规模实验验证了其在MRB基准上的最先进性能。

Details

Motivation: 探讨哪种训练目标（CL或SFT）更适用于基于大语言模型的重排序任务，并揭示其背后的作用机制。 Method: 将训练目标分解为权重和方向两个组件，提出一个统一框架来分析CL与SFT的交互作用，并在通用多模态检索（UMR）场景中进行探针实验和大规模训练验证。 Result: SFT相比CL提供了更强的权重机制，而在更新方向上两者无明显优劣；整体上SFT在LLM重排序中具有一致优势，并在MRB基准上实现了新的SOTA结果。 Conclusion: SFT比CL更适合于基于大语言模型的重排序任务，主要归因于其更有效的权重分配机制，研究结果对后续相关工作具有指导意义。 Abstract: In information retrieval, training reranking models mainly focuses on two types of objectives: metric learning (e.g. contrastive loss to increase the predicted scores on relevant query-document pairs) and classification (binary label prediction of relevance vs. irrelevance). For BERT-style encoders, various studies have shown that contrastive learning (CL) can be more effective than discriminative (classification) learning. However, for large language models (LLMs), classification via supervised fine-tuning (SFT), which predicts ''yes'' (resp. ''no'') token for relevant (resp. irrelevant) pairs, appears more promising as it aligns well with the generative nature of LLMs. This divergence raises a central question: which objective is intrinsically better suited to LLM-based reranking, and what mechanism underlies the difference? In this work, we conduct a comprehensive comparison and analysis between CL and SFT for reranking, taking the universal multimodal retrieval (UMR) as the experimental playground. We first decompose the objectives into two components: weight, which controls the magnitude of those updates, and direction, which guides the model updates, then present a unified framework for understanding their interactions. Through probing experiments, we find that SFT provides a substantially stronger weighting scheme than CL, whereas the preferred scoring direction shows no clear winner. Taken together, these results point to a consistent advantage of SFT over CL for LLM reranking. To further validate our findings, we conduct large-scale training with SFT and present new state-of-the-art rerankers on the MRB benchmark. We also provide ablations on SFT settings and expect our findings to benefit future research and applications in this area.

[107] Speculative Model Risk in Healthcare AI: Using Storytelling to Surface Unintended Harms

Xingmeng Zhao,Dan Schumacher,Veronica Rammouz,Anthony Rios

Main category: cs.CL

TL;DR: 本文提出了一种以人为本的框架，通过生成用户故事和多智能体讨论来帮助人们在AI部署前创造性地思考潜在的益处和危害。实验表明，阅读故事的参与者能识别更广泛的伤害类型，而未阅读者则主要关注隐私和福祉问题。

Details

Motivation: 快速且低门槛的人工智能开发可能带来偏见、隐私侵犯和不平等访问等风险，尤其是在忽视真实世界背景和多样化用户需求的情况下。现有方法多依赖自动检测风险，但减少了人类对伤害成因及其影响对象的理解参与度。 Method: 提出一个以人为中心的框架，该框架生成用户故事并支持多智能体讨论，以促进在AI系统部署前对潜在益处和危害的创造性思考。通过用户研究评估该方法的有效性。 Result: 实验结果显示，阅读用户故事的参与者能够识别更广泛范围的13种伤害类型，反应分布更为均衡；相比之下，未阅读故事的参与者中有58.3%集中于隐私和福祉问题。 Conclusion: 讲故事的方法有助于参与者更全面地推测AI的潜在影响，提升对用户层面影响的创造性思维，从而在AI设计和部署过程中增强风险意识与包容性。 Abstract: Artificial intelligence (AI) is rapidly transforming healthcare, enabling fast development of tools like stress monitors, wellness trackers, and mental health chatbots. However, rapid and low-barrier development can introduce risks of bias, privacy violations, and unequal access, especially when systems ignore real-world contexts and diverse user needs. Many recent methods use AI to detect risks automatically, but this can reduce human engagement in understanding how harms arise and who they affect. We present a human-centered framework that generates user stories and supports multi-agent discussions to help people think creatively about potential benefits and harms before deployment. In a user study, participants who read stories recognized a broader range of harms, distributing their responses more evenly across all 13 harm types. In contrast, those who did not read stories focused primarily on privacy and well-being (58.3%). Our findings show that storytelling helped participants speculate about a broader range of harms and benefits and think more creatively about AI's impact on users.

[108] DialectGen: Benchmarking and Improving Dialect Robustness in Multimodal Generation

Yu Zhou,Sohyun An,Haikang Deng,Da Yin,Clark Peng,Cho-Jui Hsieh,Kai-Wei Chang,Nanyun Peng

Main category: cs.CL

TL;DR: 研究了多模态生成模型在处理英语方言文本输入时的表现，发现现有模型在方言输入下性能显著下降；提出一种基于编码器的缓解策略，可在不损害标准美式英语性能的前提下，显著提升对方言的支持能力。

Details

Motivation: 探究当前多模态生成模型是否能有效处理包含方言的文本输入，并解决方言理解性能下降的问题。 Method: 构建涵盖六种常见英语方言的大规模基准数据集，收集并验证超过4200个独特提示，评估17种图像和视频生成模型；设计一种基于编码器的缓解策略，在保持标准美式英语性能的同时提升对方言的理解。 Result: 实验表明，当前最先进的多模态生成模型在仅使用一个方言词时性能下降32.26%至48.17%；常见的微调和提示重写方法改善有限（<7%），且可能损害标准英语性能；所提方法使Stable Diffusion 1.5等模型在五个方言上的性能接近标准英语（+34.4%），同时对标准英语性能影响极小。 Conclusion: 当前多模态生成模型在处理方言输入时存在明显缺陷，所提出的编码器策略能有效缓解该问题，实现方言性能提升而不牺牲标准语言表现。 Abstract: Contact languages like English exhibit rich regional variations in the form of dialects, which are often used by dialect speakers interacting with generative models. However, can multimodal generative models effectively produce content given dialectal textual input? In this work, we study this question by constructing a new large-scale benchmark spanning six common English dialects. We work with dialect speakers to collect and verify over 4200 unique prompts and evaluate on 17 image and video generative models. Our automatic and human evaluation results show that current state-of-the-art multimodal generative models exhibit 32.26% to 48.17% performance degradation when a single dialect word is used in the prompt. Common mitigation methods such as fine-tuning and prompt rewriting can only improve dialect performance by small margins (< 7%), while potentially incurring significant performance degradation in Standard American English (SAE). To this end, we design a general encoder-based mitigation strategy for multimodal generative models. Our method teaches the model to recognize new dialect features while preserving SAE performance. Experiments on models such as Stable Diffusion 1.5 show that our method is able to simultaneously raise performance on five dialects to be on par with SAE (+34.4%), while incurring near zero cost to SAE performance.

[109] AutoRubric-R1V: Rubric-Based Generative Rewards for Faithful Multimodal Reasoning

Mengzhao Jia,Zhihan Zhang,Ignacio Cases,Zheyuan Liu,Meng Jiang,Peng Qi

Main category: cs.CL

TL;DR: 提出AutoRubric-R1V框架，结合强化学习与自动生成的评分标准，在多模态大模型推理中实现过程级监督，提升推理忠实性和性能。

Details

Motivation: 现有强化学习方法仅奖励最终答案正确性，导致虚假推理问题。 Method: 通过自聚合方法从成功轨迹中提取一致的推理检查点，构建问题特定的自动评分标准，并联合使用评分标准和结果奖励进行训练。 Result: 在六个多模态推理基准上达到最先进性能，并显著提升推理忠实性。 Conclusion: AutoRubric-R1V有效解决了多模态大模型中因仅依赖结果奖励而导致的推理不忠实问题。 Abstract: Multimodal large language models (MLLMs) have rapidly advanced from perception tasks to complex multi-step reasoning, yet reinforcement learning with verifiable rewards (RLVR) often leads to spurious reasoning since only the final-answer correctness is rewarded. To address this limitation, we propose AutoRubric-R1V, a framework that integrates RLVR with process-level supervision through automatically collected rubric-based generative rewards. Our key innovation lies in a scalable self-aggregation method that distills consistent reasoning checkpoints from successful trajectories, enabling problem-specific rubric construction without human annotation or stronger teacher models. By jointly leveraging rubric-based and outcome rewards, AutoRubric-R1V achieves state-of-the-art performance on six multimodal reasoning benchmarks and substantially improves reasoning faithfulness in dedicated evaluations.

[110] Pluto: A Benchmark for Evaluating Efficiency of LLM-generated Hardware Code

Manar Abdelatty,Maryam Nouh,Jacob K. Rosenstein,Sherief Reda

Main category: cs.CL

TL;DR: Pluto是一个用于评估大语言模型生成Verilog代码在综合效率（面积、延迟、功耗）方面的基准框架，包含114个问题及自检测试平台和帕累托最优参考实现。

Details

Motivation: 现有基准主要关注功能正确性，缺乏对硬件设计中面积、延迟和功耗等综合指标的全面评估，且缺少优化基线和验证测试平台。 Method: 提出Pluto基准框架，包含114个带自检测试平台的问题和多个帕累托最优参考实现，对LLM生成的Verilog代码进行功能正确性和综合效率评估。 Result: 实验表明最先进LLM的功能正确性可达78.3%（pass@1），但在面积效率63.8%、延迟效率65.9%、功耗效率64.0%（eff@1）上仍落后于专家设计。 Conclusion: 需要像Pluto这样关注效率的评估框架来推动面向硬件设计的LLM研究发展。 Abstract: Large Language Models (LLMs) are increasingly used to automate hardware design tasks, including the generation of Verilog code. While early benchmarks focus primarily on functional correctness, efficient hardware design demands additional optimization for synthesis metrics such as area, delay, and power. Existing benchmarks fall short in evaluating these aspects comprehensively: they often lack optimized baselines or testbenches for verification. To address these gaps, we present Pluto, a benchmark and evaluation framework designed to assess the efficiency of LLM-generated Verilog designs. Pluto presents a comprehensive evaluation set of 114 problems with self-checking testbenches and multiple Pareto-optimal reference implementations. Experimental results show that state-of-the-art LLMs can achieve high functional correctness, reaching 78.3\% at pass@1, but their synthesis efficiency still lags behind expert-crafted implementations, with area efficiency of 63.8\%, delay efficiency of 65.9\%, and power efficiency of 64.0\% at eff@1. This highlights the need for efficiency-aware evaluation frameworks such as Pluto to drive progress in hardware-focused LLM research.

[111] COIG-Writer: A High-Quality Dataset for Chinese Creative Writing with Thought Processes

Yunwen Li,Shuangshuang Ying,Xingwei Qu,Xin Li,Sheng Jin,Minghao Liu,Zhoufutu Wen,Tianyu Zheng,Xeron Du,Qiguang Chen,Jiajun Shi,Wangchunshu Zhou,Jiazhan Feng,Wanjun Zhong,Libo Qin,Stephen Huang,Wanxiang Che,Chenghua Lin,Eli Zhang

Main category: cs.CL

TL;DR: 本文提出了COIG-Writer，一个包含思维过程的中文创意写作数据集，揭示了创意写作依赖于逻辑结构与语言表达的协同，并指出创造性能力具有文化边界且词汇多样性与创作质量呈负相关。

Details

Motivation: 大型语言模型在非英语语境下的创意写作表现不佳，主要由于训练数据稀缺且缺乏过程监督，因此需要构建能捕捉创作思维过程的高质量中文数据集以提升模型创造力。 Method: 通过系统性逆向工程高质文本，构建包含1,665个三元组的COIG-Writer数据集（涵盖51种体裁），每个样本包括反推提示、详细创作推理和最终文本；并在不同数据比例下评估过程监督对模型性能的影响。 Result: 发现过程监督需与通用数据结合才能有效提升性能（最佳比例为1:12），中文创作能力无法迁移到英文（差距达89.26pp），且词汇多样性越高反而创意质量越低（TTR悖论）。 Conclusion: 创意写作的卓越表现源于逻辑架构与语言基础的互动，单纯增加语言多样性或跨语言迁移无法替代过程监督的作用，类比数学推理增强但不能取代语言能力。 Abstract: Large language models exhibit systematic deficiencies in creative writing, particularly in non-English contexts where training data is scarce and lacks process-level supervision. We present COIG-Writer, a novel Chinese creative writing dataset that captures both diverse outputs and their underlying thought processes through systematic reverse-engineering of high-quality texts. Unlike existing datasets that provide only input-output pairs, COIG-Writer comprises 1,665 meticulously curated triplets spanning 51 genres, each containing: (1) a reverse-engineered prompt, (2) detailed creative reasoning documenting decision-making processes, and (3) the final text. Through comprehensive experiments, we identify a two-component model of creative writing: narrative logic (provided by process supervision) and linguistic expression (maintained by general-purpose data). Our findings reveal three critical insights: (1) Process supervision is highly effective but requires stabilization with general data. A ratio of at least one creative sample to twelve general samples is needed to achieve optimal performance; below this threshold, the win rate progressively degrades (from 62.75% down to 35.78%)., (2) creative capabilities are culturally-bound with no cross-lingual transfer (89.26pp gap between Chinese and English performance), and (3) lexical diversity inversely correlates with creative quality (TTR paradox), suggesting high diversity signals compensatory behavior for logical deficiencies. These findings establish that creative excellence emerges from the interaction between logical scaffolding and linguistic grounding, analogous to how mathematical reasoning enhances but cannot replace linguistic competence in foundation models.

[112] Finding Answers in Thought Matters: Revisiting Evaluation on Large Language Models with Reasoning

Hwiyeol Jo,Joosung Lee,Jaehone Lee,Sang-Woo Lee,Joonsuk Park,Kang Min Yoo

Main category: cs.CL

TL;DR: 提出一种称为“答案再生”的框架，通过额外的模型推理提升推理模型的答案提取鲁棒性和性能。

Details

Motivation: 现有的答案提取方法对推理模型的性能评估敏感且不稳定，需要更可靠的评估方式。 Method: 引入答案再生框架，在原有输入输出基础上添加'Answer:'提示进行二次推理，并从再生结果中提取最终答案。 Result: 该方法在数学问题和开放性问答任务中表现出更强的鲁棒性和更高的性能，且不依赖特定提取规则。 Conclusion: 答案再生框架能有效减少提取算法对评估结果的影响，提供更可靠的生成模型评估方法。 Abstract: Evaluating generative models, such as large language models (LLMs), commonly involves question-answering tasks where the final answer is selected based on probability of answer choices. On the other hand, for models requiring reasoning, the method of answer extraction plays a critical role. Our research reveals that the performance of reasoning models and their final answer distributions are highly sensitive to the answer extraction algorithm employed. In order to mitigate this, we propose a basic framework: Answer Regeneration. The method uses an additional model inference, providing the prior input and output prefaced by the prompt "Answer:". The final answer is then selected or extracted from the regenerated output. We show that this extraction-rule-agnostic approach exhibits improved performance and enhanced robustness. Furthermore, we have applied this framework to general math problems and open-ended question answering tasks. Our analysis and this framework could offer a more reliable results for model evaluation.

[113] Rewiring Experts on the Fly:Continuous Rerouting for Better Online Adaptation in Mixture-of-Expert models

Guinan Su,Yanwu Yang,Li Shen,Lu Yin,Shiwei Liu,Jonas Geiping

Main category: cs.CL

TL;DR: 提出一种无需外部数据、在线的Mixture-of-Experts（MoE）模型测试时自适应框架，通过输入上下文进行自监督优化路由决策，在推理任务中显著提升性能。

Details

Motivation: MoE模型在部署时因分布偏移导致专家路由决策不佳，现有测试时适应方法多依赖外部数据且面向密集模型，难以适用于MoE架构。 Method: 提出一种无数据、在线的测试时适应框架，利用已生成序列进行自监督，在prefill阶段和定期间隔优化路由决策；通过轻量级可加向量仅更新选定层的路由器logits，保持计算效率并防止过拟合。 Result: 在HumanEval上使用OLMoE实现5.5%的提升，在DeepSeek-V2-Lite结合self-consistency后平均提升6%，并在推理任务中展现出对上下文变化的鲁棒性。 Conclusion: 该方法无需外部数据、可插拔、兼容现有测试时扩展技术，有效提升MoE模型在复杂推理任务中的性能与适应性。 Abstract: Mixture-of-Experts (MoE) models achieve efficient scaling through sparse expert activation, but often suffer from suboptimal routing decisions due to distribution shifts in deployment. While existing test-time adaptation methods could potentially address these issues, they primarily focus on dense models and require access to external data, limiting their practical applicability to MoE architectures. However, we find that, instead of relying on reference data, we can optimize MoE expert selection on-the-fly based only on input context. As such, we propose \textit{a data-free, online test-time framework} that continuously adapts MoE routing decisions during text generation without external supervision or data. Our method cycles between two phases: During the prefill stage, and later in regular intervals, we optimize the routing decisions of the model using self-supervision based on the already generated sequence. Then, we generate text as normal, maintaining the modified router until the next adaption. We implement this through lightweight additive vectors that only update router logits in selected layers, maintaining computational efficiency while preventing over-adaptation. The experimental results show consistent performance gains on challenging reasoning tasks while maintaining robustness to context shifts. For example, our method achieves a 5.5\% improvement on HumanEval with OLMoE. Furthermore, owing to its plug-and-play property, our method naturally complements existing test-time scaling techniques, e.g., achieving 6\% average gains when incorporated with self-consistency on DeepSeek-V2-Lite.

[114] Midtraining Bridges Pretraining and Posttraining Distributions

Emmy Liu,Graham Neubig,Chenyan Xiong

Main category: cs.CL

TL;DR: 本研究首次系统性地探讨了语言模型预训练中的“中期训练”阶段，发现其在数学和代码领域效果最显著，能有效缩小预训练与后续训练数据之间的语法差距，并减少遗忘，优于持续预训练。

Details

Motivation: 尽管中期训练在实践中被广泛使用，但其作用机制和有效性缺乏科学理解，本文旨在通过控制实验系统性地探究其影响。 Method: 从零开始预训练语言模型，并在不同领域的监督微调数据集上进行微调，通过控制实验分析中期训练的影响，以代码领域为例进行消融实验，研究开始时间和数据混合权重的影响。 Result: 中期训练在数学和代码领域效果最佳，能显著降低领域内验证损失并减少预训练知识的遗忘；相较于持续预训练更具优势；消融实验表明，中期训练的开始时间比混合权重影响更大，越早引入专业化数据收益越高。 Conclusion: 中期训练是一种有效的领域适应技术，通过减少遗忘，在特定领域（尤其是数学和代码）中优于持续预训练，且引入时机是关键因素。 Abstract: Recently, many language models have been pretrained with a "midtraining" phase, in which higher quality, often instruction-formatted data, is mixed in at the end of pretraining. Despite the popularity of this practice, there is little scientific understanding of this phase of model training or why it is effective. In this work, we conduct the first systematic investigation of midtraining through controlled experiments with language models pretrained from scratch and fine-tuned on supervised finetuning datasets in different domains. We find that when compared after supervised fine-tuning, the effectiveness of midtraining is highest in the math and code domains, where midtraining can best reduce the syntactic gap between pretraining and posttraining data. In these cases, midtraining consistently outperforms continued pretraining in both in-domain validation loss as well as pretraining data forgetting after posttraining. We conduct ablations on the starting time of the midtraining phase and mixture weights of the midtraining data, using code midtraining as a case study, and find that timing has a greater impact than mixture weights, with earlier introduction of specialized data, yielding greater benefits in-domain as well as preserving general language modeling better. These findings establish midtraining as a domain adaptation technique that compared to continued pretraining yields better performance through reduced forgetting.

[115] From Loop Nests to Silicon: Mapping AI Workloads onto AMD NPUs with MLIR-AIR

Erwei Wang,Samuel Bayliss,Andra Bisca,Zachary Blair,Sangeeta Chowdhary,Kristof Denolf,Jeff Fifield,Brandon Freiberger,Erika Hunhoff,Phil James-Roxby,Jack Lo,Joseph Melber,Stephen Neuendorffer,Eddie Richter,Andre Rosti,Javier Setoain,Gagandeep Singh,Endri Taka,Pranathi Vasireddy,Zhewen Yu,Niansong Zhang,Jinming Zhuang

Main category: cs.CL

TL;DR: MLIR-AIR是一个基于MLIR的开源编译器栈，旨在弥合高层工作负载与细粒度空间架构（如AMD NPU）之间的语义鸿沟，通过AIR方言实现对计算和数据的显式调度，支持异步、分层操作，在矩阵乘法和LLaMA 2多头注意力块中展现出接近手工优化的性能。

Details

Motivation: 通用编译器抽象了并行性、局部性和同步，难以有效利用现代空间架构的性能潜力；需要更精细的控制机制来管理数据移动、执行顺序和计算分布。 Method: 构建MLIR-AIR编译器栈，引入AIR方言以提供对计算与内存资源的结构化、异步和分层操作表示，并通过编译器管理的空间调度、分块和通信重叠实现高效映射。 Result: 在矩阵乘法中达到78.7%的计算效率，性能接近底层手工优化的MLIR-AIE实现；在多头注意力机制中，仅用约150行代码实现融合操作，有效映射到空间硬件。 Conclusion: MLIR-AIR通过显式的编译器管理调度，成功将高层控制流转换为高效利用NPU计算资源和内存层次的空间程序，展示了其在复杂工作负载下对接现代空间架构的潜力。 Abstract: General-purpose compilers abstract away parallelism, locality, and synchronization, limiting their effectiveness on modern spatial architectures. As modern computing architectures increasingly rely on fine-grained control over data movement, execution order, and compute placement for performance, compiler infrastructure must provide explicit mechanisms for orchestrating compute and data to fully exploit such architectures. We introduce MLIR-AIR, a novel, open-source compiler stack built on MLIR that bridges the semantic gap between high-level workloads and fine-grained spatial architectures such as AMD's NPUs. MLIR-AIR defines the AIR dialect, which provides structured representations for asynchronous and hierarchical operations across compute and memory resources. AIR primitives allow the compiler to orchestrate spatial scheduling, distribute computation across hardware regions, and overlap communication with computation without relying on ad hoc runtime coordination or manual scheduling. We demonstrate MLIR-AIR's capabilities through two case studies: matrix multiplication and the multi-head attention block from the LLaMA 2 model. For matrix multiplication, MLIR-AIR achieves up to 78.7% compute efficiency and generates implementations with performance almost identical to state-of-the-art, hand-optimized matrix multiplication written using the lower-level, close-to-metal MLIR-AIE framework. For multi-head attention, we demonstrate that the AIR interface supports fused implementations using approximately 150 lines of code, enabling tractable expression of complex workloads with efficient mapping to spatial hardware. MLIR-AIR transforms high-level structured control flow into spatial programs that efficiently utilize the compute fabric and memory hierarchy of an NPU, leveraging asynchronous execution, tiling, and communication overlap through compiler-managed scheduling.

[116] Harmonizing Diverse Models: A Layer-wise Merging Strategy for Consistent Generation

Xujun Peng,Anoop Kumar,Jingyu Wu,Parker Glenn,Daben Liu

Main category: cs.CL

TL;DR: 提出一种结合合成数据生成、三元组损失和层间模型融合的新方法，显著提升RAG系统中LLM输出的一致性。

Details

Motivation: 现有大语言模型在语义等价输入下生成不一致的输出，且缺乏针对一致性的训练数据和有效的微调技术。 Method: 通过系统性生成合成数据、使用三元组损失优化嵌入表示，并提出基于中间层激活的层间模型融合方法，结合专门化模型的知识。 Result: 合并后的模型相比基线将响应相似性提高了约47.5%，显著增强了输出一致性。 Conclusion: 该方法为工业级RAG系统的可靠性提升提供了有效且实用的解决方案。 Abstract: Retrieval-Augmented Generation (RAG) systems leverage Large Language Models (LLMs) to generate accurate and reliable responses that are grounded in retrieved context. However, LLMs often generate inconsistent outputs for semantically equivalent inputs, a problem compounded by the scarcity of consistency-focused training data and the limitations of current fine-tuning techniques in enhancing output consistency. We propose a new approach combining systematic synthetic data generation, triplet loss for better embeddings, and a novel layer-wise model merging approach. Using consistency-aware weights derived from intermediate layer activations, our method effectively integrates knowledge from specialized models. Experimental results how that our merged model significantly enhances output consistency, achieving a ~47.5\% improvement in response similarity over the baseline, thus offering a practical solution for increasing the reliability of an industrial RAG system.

[117] Predicting Task Performance with Context-aware Scaling Laws

Kyle Montgomery,David Park,Jianhong Tu,Michael Bendersky,Beliz Gunel,Dawn Song,Chenguang Wang

Main category: cs.CL

TL;DR: 本文提出了一种新的可解释框架，用于联合建模下游任务性能与训练计算量和上下文长度的关系，验证了其在多种任务和模型上的有效性。

Details

Motivation: 传统的扩展定律无法捕捉到上下文对下游任务性能的影响，因此需要一种能够结合训练计算量和上下文信息来预测下游表现的新方法。 Method: 提出一个简单且可解释的框架，将下游性能建模为训练计算量和输入上下文的函数，并在Llama-2-7B和Llama-2-13B的长上下文变体上，基于三个任务共65,500个实例进行实证拟合。 Result: 该框架能准确建模分布内下游性能，在跨越三个数量级的训练计算量下具有良好泛化性，并能可靠地外推上下文增长时的性能提升。 Conclusion: 训练计算量与上下文利用之间存在重要交互关系，该研究为设计更高效的长上下文大语言模型提供了指导。 Abstract: Scaling laws have transformed our understanding of large language models by linking upstream metrics like cross-entropy loss to design factors such as model size, training data, and compute. However, these conventional laws fail to capture downstream task performance, where context plays a critical role. In this work, we propose a straightforward, interpretable framework that jointly models downstream performance as a function of the training compute and the provided context. We empirically validate our framework by fitting it on the observed downstream performance of extended-context variants of Llama-2-7B and Llama-2-13B across 65,500 unique instances spanning three tasks: arithmetic reasoning, common sense reasoning, and machine translation. Our results demonstrate that our framework accurately models in-distribution downstream performance, generalizes across three orders of magnitude in training compute, and reliably extrapolates performance as the amount of context increases. These findings offer valuable insights into the interplay between training compute and context utilization, providing guidance for designing more efficient long-context LLMs for diverse downstream tasks. Our code is available at https://github.com/wang-research-lab/context-scaling.

[118] AI-Powered Early Diagnosis of Mental Health Disorders from Real-World Clinical Conversations

Jianfeng Zhu,Julina Maharjan,Xinyu Li,Karin G. Coifman,Ruoming Jin

Main category: cs.CL

TL;DR: 该研究评估了基于机器学习模型在真实世界半结构化访谈数据上进行心理健康筛查的有效性，使用553个样本和真实诊断标签，发现LLM（如GPT-4.1 Mini、MetaLLaMA）零样本提示和LoRA微调的RoBERTa模型在抑郁、焦虑和PTSD检测中均超过80%准确率，尤其对PTSD达到89%准确率和98%召回率，表明AI模型有望提升心理健康早期诊断的可及性和准确性。

Details

Motivation: 由于主观评估、医疗资源有限以及污名化和认知不足，抑郁症、焦虑症和创伤后应激障碍（PTSD）等心理疾病常被漏诊或误诊，尤其是在初级医疗环境中误诊率超过60%，因此亟需可扩展、易获取且具情境感知能力的辅助诊断工具。 Method: 研究采用553个真实世界的半结构化访谈数据，包含重大抑郁发作（MDE）、焦虑症和PTSD的真值诊断标签，比较了多种模型：GPT-4.1 Mini和MetaLLaMA的零样本提示方法，以及基于低秩适应（LoRA）微调的RoBERTa模型；同时分析不同上下文长度对模型性能的影响。 Result: 所有模型在各类诊断中准确率均超过80%，其中PTSD检测表现最佳（最高89%准确率，98%召回率）；使用较短且聚焦的上下文片段能提高召回率；LoRA微调在低秩配置（如rank 8和16）下仍保持良好性能，显示其高效性与有效性。 Conclusion: 基于大语言模型的筛查方法显著优于传统自评量表，具备部署于低资源或高污名化环境中的潜力，为将机器学习整合进实际临床工作流提供了可行路径，有助于实现低门槛、AI驱动的心理健康早期干预。 Abstract: Mental health disorders remain among the leading cause of disability worldwide, yet conditions such as depression, anxiety, and Post-Traumatic Stress Disorder (PTSD) are frequently underdiagnosed or misdiagnosed due to subjective assessments, limited clinical resources, and stigma and low awareness. In primary care settings, studies show that providers misidentify depression or anxiety in over 60% of cases, highlighting the urgent need for scalable, accessible, and context-aware diagnostic tools that can support early detection and intervention. In this study, we evaluate the effectiveness of machine learning models for mental health screening using a unique dataset of 553 real-world, semistructured interviews, each paried with ground-truth diagnoses for major depressive episodes (MDE), anxiety disorders, and PTSD. We benchmark multiple model classes, including zero-shot prompting with GPT-4.1 Mini and MetaLLaMA, as well as fine-tuned RoBERTa models using LowRank Adaptation (LoRA). Our models achieve over 80% accuracy across diagnostic categories, with especially strongperformance on PTSD (up to 89% accuracy and 98% recall). We also find that using shorter context, focused context segments improves recall, suggesting that focused narrative cues enhance detection sensitivity. LoRA fine-tuning proves both efficient and effective, with lower-rank configurations (e.g., rank 8 and 16) maintaining competitive performance across evaluation metrics. Our results demonstrate that LLM-based models can offer substantial improvements over traditional self-report screening tools, providing a path toward low-barrier, AI-powerd early diagnosis. This work lays the groundwork for integrating machine learning into real-world clinical workflows, particularly in low-resource or high-stigma environments where access to timely mental health care is most limited.

[119] LaSeR: Reinforcement Learning with Last-Token Self-Rewarding

Wenkai Yang,Weijie Liu,Ruobing Xie,Yiju Guo,Lulu Wu,Saiyong Yang,Yankai Lin

Main category: cs.CL

TL;DR: 本文提出了一种名为LaSeR的强化学习方法，通过利用生成解的最后一个token的自奖励得分来统一优化大语言模型的推理和自我奖励能力，显著提升了推理效率与性能。

Details

Motivation: 现有强化学习与可验证奖励结合的方法在测试时缺乏验证信号，且需使用两个独立模板分别生成解和自我验证，效率低下。因此需要一种更高效的方法将推理与验证能力统一于单一模型中。 Method: 基于理论分析发现，自验证目标的闭式解可简化为解的最后一个token的自奖励得分。据此提出LaSeR算法，在原有RLVR损失基础上增加一个MSE损失，使最后一个token的自奖励得分与基于验证器的推理奖励对齐，联合优化推理与自奖励能力。 Result: 实验表明，该方法不仅提高了模型的推理性能，还赋予其强大的自奖励能力，显著增强了推理时的扩展性能，且仅需一次额外token推断，成本极低。 Conclusion: LaSeR通过简洁有效的机制实现了推理与自我奖励的联合优化，在保持极低计算开销的同时显著提升大语言模型的推理能力和测试时表现。 Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a core paradigm for enhancing the reasoning capabilities of Large Language Models (LLMs). To address the lack of verification signals at test time, prior studies incorporate the training of model's self-verification capability into the standard RLVR process, thereby unifying reasoning and verification capabilities within a single LLM. However, previous practice requires the LLM to sequentially generate solutions and self-verifications using two separate prompt templates, which significantly reduces efficiency. In this work, we theoretically reveal that the closed-form solution to the RL objective of self-verification can be reduced to a remarkably simple form: the true reasoning reward of a solution is equal to its last-token self-rewarding score, which is computed as the difference between the policy model's next-token log-probability assigned to any pre-specified token at the solution's last token and a pre-calculated constant, scaled by the KL coefficient. Based on this insight, we propose LaSeR (Reinforcement Learning with Last-Token Self-Rewarding), an algorithm that simply augments the original RLVR loss with a MSE loss that aligns the last-token self-rewarding scores with verifier-based reasoning rewards, jointly optimizing the reasoning and self-rewarding capabilities of LLMs. The optimized self-rewarding scores can be utilized in both training and testing to enhance model performance. Notably, our algorithm derives these scores from the predicted next-token probability distribution of the last token immediately after generation, incurring only the minimal extra cost of one additional token inference. Experiments show that our method not only improves the model's reasoning performance but also equips it with remarkable self-rewarding capability, thereby boosting its inference-time scaling performance.

[120] MetaBench: A Multi-task Benchmark for Assessing LLMs in Metabolomics

Yuxing Lu,Xukai Zhao,J. Ben Tamo,Micky C. Nnamdi,Rui Peng,Shuang Zeng,Xingyu Hu,Jinzhuo Wang,May D. Wang

Main category: cs.CL

TL;DR: 本文提出了MetaBench，首个用于评估大型语言模型在代谢组学领域性能的基准，涵盖知识、理解、映射、推理和研究五项关键能力，并揭示了现有模型在跨数据库标识映射和稀疏注释代谢物上的局限性。

Details

Motivation: 尽管大型语言模型在通用文本上表现优异，但其在需要深度关联知识的科学领域（如代谢组学）中的能力尚不明确，因此需要专门的评估基准。 Method: 基于权威公共资源构建MetaBench基准，系统评估25个开源和闭源大模型在代谢组学五项核心能力上的表现。 Result: 模型在文本生成任务中表现良好，但在跨数据库标识映射和长尾代谢物任务上表现下降，即使使用检索增强也难以改善。 Conclusion: MetaBench为开发和评估面向代谢组学的人工智能系统提供了重要基础设施，推动可靠计算工具的发展。 Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities on general text; however, their proficiency in specialized scientific domains that require deep, interconnected knowledge remains largely uncharacterized. Metabolomics presents unique challenges with its complex biochemical pathways, heterogeneous identifier systems, and fragmented databases. To systematically evaluate LLM capabilities in this domain, we introduce MetaBench, the first benchmark for metabolomics assessment. Curated from authoritative public resources, MetaBench evaluates five capabilities essential for metabolomics research: knowledge, understanding, grounding, reasoning, and research. Our evaluation of 25 open- and closed-source LLMs reveals distinct performance patterns across metabolomics tasks: while models perform well on text generation tasks, cross-database identifier grounding remains challenging even with retrieval augmentation. Model performance also decreases on long-tail metabolites with sparse annotations. With MetaBench, we provide essential infrastructure for developing and evaluating metabolomics AI systems, enabling systematic progress toward reliable computational tools for metabolomics research.

[121] Information Gain-based Policy Optimization: A Simple and Effective Approach for Multi-Turn LLM Agents

Guoqing Wang,Sunhao Dai,Guangze Ye,Zeyu Gan,Wei Yao,Yong Deng,Xiaofeng Wu,Zhenzhe Ying

Main category: cs.CL

TL;DR: 提出了一种基于信息增益的策略优化（IGPO）方法，用于多轮LLM代理训练，通过模型自身信念更新提供密集且内在的逐轮奖励，显著提升准确性和样本效率。

Details

Motivation: 现有强化学习方法在多轮任务中依赖稀疏的结果奖励，导致优势崩溃和信用分配困难，难以有效训练基于大语言模型的智能体。 Method: 将每轮交互建模为对真实答案信息的增量获取过程，以策略生成正确答案概率的边际增长作为逐轮奖励，结合结果级监督形成密集奖励轨迹。 Result: 在多个领域内和领域外基准上实验表明，IGPO在多轮场景下优于强基线方法，具有更高的准确率和更好的样本效率。 Conclusion: IGPO通过内在的信息增益奖励机制解决了多轮强化学习中的奖励稀疏问题，为LLM代理的高效训练提供了有效方案。 Abstract: Large language model (LLM)-based agents are increasingly trained with reinforcement learning (RL) to enhance their ability to interact with external environments through tool use, particularly in search-based settings that require multi-turn reasoning and knowledge acquisition. However, existing approaches typically rely on outcome-based rewards that are only provided at the final answer. This reward sparsity becomes particularly problematic in multi-turn settings, where long trajectories exacerbate two critical issues: (i) advantage collapse, where all rollouts receive identical rewards and provide no useful learning signals, and (ii) lack of fine-grained credit assignment, where dependencies between turns are obscured, especially in long-horizon tasks. In this paper, we propose Information Gain-based Policy Optimization (IGPO), a simple yet effective RL framework that provides dense and intrinsic supervision for multi-turn agent training. IGPO models each interaction turn as an incremental process of acquiring information about the ground truth, and defines turn-level rewards as the marginal increase in the policy's probability of producing the correct answer. Unlike prior process-level reward approaches that depend on external reward models or costly Monte Carlo estimation, IGPO derives intrinsic rewards directly from the model's own belief updates. These intrinsic turn-level rewards are combined with outcome-level supervision to form dense reward trajectories. Extensive experiments on both in-domain and out-of-domain benchmarks demonstrate that IGPO consistently outperforms strong baselines in multi-turn scenarios, achieving higher accuracy and improved sample efficiency.

[122] LLMs as Scalable, General-Purpose Simulators For Evolving Digital Agent Training

Yiming Wang,Da Yin,Yuedong Cui,Ruichen Zheng,Zhiqian Li,Zongyu Lin,Di Wu,Xueqing Wu,Chenchen Ye,Yu Zhou,Kai-Wei Chang

Main category: cs.CL

TL;DR: 提出UI-Simulator，一种可扩展的生成结构化UI状态和转换的范式，用于大规模合成训练轨迹，并通过UI-Simulator-Grow实现高效的数据扩展，在WebArena和AndroidWorld上表现出与更大模型相当的性能。

Details

Motivation: 数字代理需要大量多样的UI轨迹来泛化真实世界任务，但实际收集此类数据成本过高，因此需要一种低成本、可扩展的合成方法。 Method: 提出UI-Simulator，结合数字世界模拟器生成多样UI状态，通过引导 rollout 过程进行连贯探索，并设计轨迹封装器生成高质量轨迹；进一步提出UI-Simulator-Grow，优先扩展高影响任务以提升数据效率。 Result: 在WebArena和AndroidWorld上，使用UI-Simulator训练的代理表现优于或媲美基于真实UI训练的开源代理，且鲁棒性更强；UI-Simulator-Grow仅用Llama-3-8B-Instruct就达到Llama-3-70B-Instruct的性能水平。 Conclusion: UI-Simulator及其增长策略展示了通过目标化合成扩展范式高效增强数字代理的能力，为降低训练数据依赖提供了可行路径。 Abstract: Digital agents require diverse, large-scale UI trajectories to generalize across real-world tasks, yet collecting such data is prohibitively expensive in both human annotation, infra and engineering perspectives. To this end, we introduce $\textbf{UI-Simulator}$, a scalable paradigm that generates structured UI states and transitions to synthesize training trajectories at scale. Our paradigm integrates a digital world simulator for diverse UI states, a guided rollout process for coherent exploration, and a trajectory wrapper that produces high-quality and diverse trajectories for agent training. We further propose $\textbf{UI-Simulator-Grow}$, a targeted scaling strategy that enables more rapid and data-efficient scaling by prioritizing high-impact tasks and synthesizes informative trajectory variants. Experiments on WebArena and AndroidWorld show that UI-Simulator rivals or surpasses open-source agents trained on real UIs with significantly better robustness, despite using weaker teacher models. Moreover, UI-Simulator-Grow matches the performance of Llama-3-70B-Instruct using only Llama-3-8B-Instruct as the base model, highlighting the potential of targeted synthesis scaling paradigm to continuously and efficiently enhance the digital agents.

[123] TokDrift: When LLM Speaks in Subwords but Code Speaks in Grammar

Yinxi Li,Yuntian Deng,Pengyu Nie

Main category: cs.CL

TL;DR: 本文提出了TokDrift框架，用于评估代码大语言模型中由于分词与语法不一致导致的问题，发现即使轻微的格式变化也会显著影响模型行为，问题源于早期嵌入层中的子词分割未能捕捉语法标记边界，表明未来需要语法感知的分词方法。

Details

Motivation: 由于当前代码大语言模型使用的子词分词器（如BPE）基于统计而非语法，导致语义相同的代码可能因空格或命名等表面差异被不同分词，影响模型可靠性，因此需要研究这种不一致性的影响。 Method: 提出TokDrift框架，应用保持语义不变的重写规则生成仅在分词上不同的代码变体，并在九个代码大语言模型上测试其行为变化，进行逐层分析以定位问题来源。 Result: 实验显示，即使是微小的格式更改也会导致模型行为发生显著变化，问题起源于早期嵌入层，因为子词分割未能正确对齐语法标记边界。 Conclusion: 分词与语法的不一致是影响代码大语言模型可靠性的隐藏障碍，未来应发展语法感知的分词方法以提升模型性能。 Abstract: Large language models (LLMs) for code rely on subword tokenizers, such as byte-pair encoding (BPE), learned from mixed natural language text and programming language code but driven by statistics rather than grammar. As a result, semantically identical code snippets can be tokenized differently depending on superficial factors such as whitespace or identifier naming. To measure the impact of this misalignment, we introduce TokDrift, a framework that applies semantic-preserving rewrite rules to create code variants differing only in tokenization. Across nine code LLMs, including large ones with over 30B parameters, even minor formatting changes can cause substantial shifts in model behavior. Layer-wise analysis shows that the issue originates in early embeddings, where subword segmentation fails to capture grammar token boundaries. Our findings identify misaligned tokenization as a hidden obstacle to reliable code understanding and generation, highlighting the need for grammar-aware tokenization for future code LLMs.

[124] Attention Is All You Need for KV Cache in Diffusion LLMs

Quan Nguyen-Tri,Mukul Ranjan,Zhiqiang Shen

Main category: cs.CL

TL;DR: 本文提出Elastic-Cache，一种无需训练、架构无关的键值缓存自适应重计算策略，通过注意力感知的漂移检测和深度感知的刷新调度，在保持生成质量的同时显著加速扩散大语言模型的解码过程。

Details

Motivation: 现有方法在每一步和每一层都重新计算所有token的QKV，导致大量冗余计算，尤其是在浅层KV状态变化较小的情况下，亟需一种更高效的缓存更新机制。 Method: 基于对MASK token作用、KV动态随深度增加以及最关注token具有最小KV漂移的观察，提出Elastic-Cache：利用最关注token的注意力感知漂移测试决定何时刷新，并采用深度感知调度从选定层开始重计算，复用浅层和窗口外MASK缓存。 Result: 在LLaDA系列模型上实验显示，相比基线方法，在GSM8K上实现8.7倍加速（256 token），长序列达45.1倍，HumanEval上达4.8倍，同时保持更高准确率；相比置信度方法吞吐量提升6.8倍且保持生成质量。 Conclusion: Elastic-Cache通过自适应、分层的缓存更新策略有效减少了扩散大语言模型解码中的冗余计算，在大幅加速的同时维持甚至提升了预测精度，推动了此类模型的实际部署应用。 Abstract: This work studies how to adaptively recompute key-value (KV) caches for diffusion large language models (DLMs) to maximize prediction accuracy while minimizing decoding latency. Prior methods' decoders recompute QKV for all tokens at every denoising step and layer, despite KV states changing little across most steps, especially in shallow layers, leading to substantial redundancy. We make three observations: (1) distant ${\bf MASK}$ tokens primarily act as a length-bias and can be cached block-wise beyond the active prediction window; (2) KV dynamics increase with depth, suggesting that selective refresh starting from deeper layers is sufficient; and (3) the most-attended token exhibits the smallest KV drift, providing a conservative lower bound on cache change for other tokens. Building on these, we propose ${\bf Elastic-Cache}$, a training-free, architecture-agnostic strategy that jointly decides ${when}$ to refresh (via an attention-aware drift test on the most-attended token) and ${where}$ to refresh (via a depth-aware schedule that recomputes from a chosen layer onward while reusing shallow-layer caches and off-window MASK caches). Unlike fixed-period schemes, Elastic-Cache performs adaptive, layer-aware cache updates for diffusion LLMs, reducing redundant computation and accelerating decoding with negligible loss in generation quality. Experiments on LLaDA-Instruct, LLaDA-1.5, and LLaDA-V across mathematical reasoning and code generation tasks demonstrate consistent speedups: $8.7\times$ on GSM8K (256 tokens), $45.1\times$ on longer sequences, and $4.8\times$ on HumanEval, while consistently maintaining higher accuracy than the baseline. Our method achieves significantly higher throughput ($6.8\times$ on GSM8K) than existing confidence-based approaches while preserving generation quality, enabling practical deployment of diffusion LLMs.

cs.CV [Back]

[125] MultiFoodhat: A potential new paradigm for intelligent food quality inspection

Yue Hu,Guohang Zhuang

Main category: cs.CV

TL;DR: 提出了一种基于多智能体对话推理的零样本食物识别框架MultiFoodChat，结合视觉-语言模型和大语言模型，实现无需额外训练或标注的高精度食物图像分类。

Details

Motivation: 现有监督模型依赖大量标注数据且对未见食物类别泛化能力差，难以满足实际应用需求。 Method: 构建MultiFoodChat框架，通过视觉-语言模型和大语言模型进行多轮图文对话，利用对象感知令牌（OPT）提取细粒度视觉特征，交互式推理代理（IRA）动态解析上下文以优化预测。 Result: 在多个公开食品数据集上，MultiFoodChat在识别准确性和可解释性方面均优于现有的无监督和少样本方法。 Conclusion: MultiFoodChat为零样本食物识别提供新范式，在智能食品质量检测与分析中具有广泛应用潜力。 Abstract: Food image classification plays a vital role in intelligent food quality inspection, dietary assessment, and automated monitoring. However, most existing supervised models rely heavily on large labeled datasets and exhibit limited generalization to unseen food categories. To overcome these challenges, this study introduces MultiFoodChat, a dialogue-driven multi-agent reasoning framework for zero-shot food recognition. The framework integrates vision-language models (VLMs) and large language models (LLMs) to enable collaborative reasoning through multi-round visual-textual dialogues. An Object Perception Token (OPT) captures fine-grained visual attributes, while an Interactive Reasoning Agent (IRA) dynamically interprets contextual cues to refine predictions. This multi-agent design allows flexible and human-like understanding of complex food scenes without additional training or manual annotations. Experiments on multiple public food datasets demonstrate that MultiFoodChat achieves superior recognition accuracy and interpretability compared with existing unsupervised and few-shot methods, highlighting its potential as a new paradigm for intelligent food quality inspection and analysis.

[126] Post-surgical Endometriosis Segmentation in Laparoscopic Videos

Andreas Leibetseder,Klaus Schoeffmann,Jörg Keckstein,Simon Keckstein

Main category: cs.CV

TL;DR: 本文提出了一种用于分割腹腔镜手术视频中深色子宫内膜异位病灶的系统，通过多色覆盖标注病灶区域并提供检测摘要，以辅助妇科医生诊断子宫内膜异位症。

Details

Motivation: 子宫内膜异位症在体内多种位置表现出多样化的视觉形态，导致其识别困难且易出错，尤其对非专科医生而言。因此需要一种辅助工具来提高诊断准确性。 Method: 开发了一个基于训练的系统，用于分析腹腔镜手术视频，自动分割常见的深色子宫内膜异位病灶，并用多色叠加方式标注病灶区域，同时生成检测摘要以改善视频浏览效率。 Result: 该系统能够有效识别并标注深色子宫内膜异位病灶，提供可视化分析结果和检测摘要，提升医生对术中视频的理解与操作效率。 Conclusion: 所提出的系统为妇科医生提供了有效的视觉辅助工具，有助于更准确、高效地识别子宫内膜异位病灶，具有临床应用潜力。 Abstract: Endometriosis is a common women's condition exhibiting a manifold visual appearance in various body-internal locations. Having such properties makes its identification very difficult and error-prone, at least for laymen and non-specialized medical practitioners. In an attempt to provide assistance to gynecologic physicians treating endometriosis, this demo paper describes a system that is trained to segment one frequently occurring visual appearance of endometriosis, namely dark endometrial implants. The system is capable of analyzing laparoscopic surgery videos, annotating identified implant regions with multi-colored overlays and displaying a detection summary for improved video browsing.

[127] Efficient Few-Shot Learning in Remote Sensing: Fusing Vision and Vision-Language Models

Jia Yun Chua,Argyrios Zolotas,Miguel Arana-Catania

Main category: cs.CV

TL;DR: 本文研究了将YOLO等传统视觉模型与LLaVA、ChatGPT和Gemini等视觉语言模型（VLM）结合，用于提升遥感图像中的飞机检测与场景理解能力。实验表明，该方法在有标签和无标签数据及图像质量退化情况下均显著提升了检测精度和上下文理解能力。

Details

Motivation: 传统视觉模型依赖大量标注数据且难以理解复杂环境中的上下文，而通用型视觉语言模型在遥感领域的应用尚不充分，因此需要探索结合二者优势的方法以提升遥感图像分析性能。 Method: 将YOLO目标检测模型与LLaVA、ChatGPT和Gemini等视觉语言模型结合，利用YOLO提供定位信息，VLM进行上下文推理与语义理解，并在有标签、无标签及降质遥感图像上评估性能。 Result: 在飞机检测与计数任务中，各模型平均MAE降低了48.46%，CLIPScore提升了6.17%，尤其在挑战性条件下表现更优。 Conclusion: 结合传统视觉模型与视觉语言模型能有效提升遥感图像的分析能力，尤其适用于少样本学习场景，为未来遥感图像智能解译提供了新方向。 Abstract: Remote sensing has become a vital tool across sectors such as urban planning, environmental monitoring, and disaster response. While the volume of data generated has increased significantly, traditional vision models are often constrained by the requirement for extensive domain-specific labelled data and their limited ability to understand the context within complex environments. Vision Language Models offer a complementary approach by integrating visual and textual data; however, their application to remote sensing remains underexplored, particularly given their generalist nature. This work investigates the combination of vision models and VLMs to enhance image analysis in remote sensing, with a focus on aircraft detection and scene understanding. The integration of YOLO with VLMs such as LLaVA, ChatGPT, and Gemini aims to achieve more accurate and contextually aware image interpretation. Performance is evaluated on both labelled and unlabelled remote sensing data, as well as degraded image scenarios which are crucial for remote sensing. The findings show an average MAE improvement of 48.46% across models in the accuracy of aircraft detection and counting, especially in challenging conditions, in both raw and degraded scenarios. A 6.17% improvement in CLIPScore for comprehensive understanding of remote sensing images is obtained. The proposed approach combining traditional vision models and VLMs paves the way for more advanced and efficient remote sensing image analysis, especially in few-shot learning scenarios.

[128] Finding Holes: Pathologist Level Performance Using AI for Cribriform Morphology Detection in Prostate Cancer

Kelvin Szolnoky,Anders Blilie,Nita Mulliqi,Toyonori Tsuzuki,Hemamali Samaratunga,Matteo Titus,Xiaoyi Ji,Sol Erika Boman,Einar Gudlaugsson,Svein Reidar Kjosavik,José Asenjo,Marcello Gambacorta,Paolo Libretti,Marcin Braun,Radisław Kordek,Roman Łowicki,Brett Delahunt,Kenneth A. Iczkowski,Theo van der Kwast,Geert J. L. H. van Leenders,Katia R. M. Leite,Chin-Chen Pan,Emiel Adrianus Maria Janssen,Martin Eklund,Lars Egevad,Kimmo Kartasalo

Main category: cs.CV

TL;DR: 该研究开发并验证了一种基于AI的深度学习模型（EfficientNetV2-S结合多实例学习），用于自动检测前列腺癌中的筛状结构，旨在提高诊断一致性与准确性。

Details

Motivation: 筛状结构是前列腺癌中提示预后不良的重要组织学特征，但目前存在报告不足和病理医生间判读差异大的问题，亟需提高检测的可靠性与标准化水平。 Method: 采用EfficientNetV2-S编码器结合多实例学习，对来自三个队列共640例前列腺穿刺活检全切片图像进行端到端分类训练；在内部（171例）和外部（104例，三个独立中心）队列中验证，并与九位专家的判读结果进行比较。 Result: 模型在内部验证中表现出色（AUC 0.97，kappa 0.81），外部验证仍具稳健性（AUC 0.90，kappa 0.55）；在9名专家参与的对比中，模型平均一致性最高（kappa 0.66），优于所有人类专家（kappa 0.35–0.62）。 Conclusion: 该AI模型达到或超过病理专家水平，可提升筛状结构检测的准确性和一致性，有助于标准化报告、改善患者治疗决策。 Abstract: Background: Cribriform morphology in prostate cancer is a histological feature that indicates poor prognosis and contraindicates active surveillance. However, it remains underreported and subject to significant interobserver variability amongst pathologists. We aimed to develop and validate an AI-based system to improve cribriform pattern detection. Methods: We created a deep learning model using an EfficientNetV2-S encoder with multiple instance learning for end-to-end whole-slide classification. The model was trained on 640 digitised prostate core needle biopsies from 430 patients, collected across three cohorts. It was validated internally (261 slides from 171 patients) and externally (266 slides, 104 patients from three independent cohorts). Internal validation cohorts included laboratories or scanners from the development set, while external cohorts used completely independent instruments and laboratories. Annotations were provided by three expert uropathologists with known high concordance. Additionally, we conducted an inter-rater analysis and compared the model's performance against nine expert uropathologists on 88 slides from the internal validation cohort. Results: The model showed strong internal validation performance (AUC: 0.97, 95% CI: 0.95-0.99; Cohen's kappa: 0.81, 95% CI: 0.72-0.89) and robust external validation (AUC: 0.90, 95% CI: 0.86-0.93; Cohen's kappa: 0.55, 95% CI: 0.45-0.64). In our inter-rater analysis, the model achieved the highest average agreement (Cohen's kappa: 0.66, 95% CI: 0.57-0.74), outperforming all nine pathologists whose Cohen's kappas ranged from 0.35 to 0.62. Conclusion: Our AI model demonstrates pathologist-level performance for cribriform morphology detection in prostate cancer. This approach could enhance diagnostic reliability, standardise reporting, and improve treatment decisions for prostate cancer patients.

[129] NAPPure: Adversarial Purification for Robust Image Classification under Non-Additive Perturbations

Junjie Nan,Jianing Li,Wei Chen,Mingkun Zhang,Xueqi Cheng

Main category: cs.CV

TL;DR: 本文提出了一种扩展的对抗净化框架NAPPure，用于处理非加性对抗扰动（如模糊、遮挡和扭曲），通过似然最大化分离干净图像和扰动参数，在GTSRB和CIFAR-10数据集上显著提升了图像分类模型的鲁棒性。

Details

Motivation: 现有的对抗净化方法主要针对加性扰动设计，对现实世界中常见的非加性扰动（如模糊、遮挡、扭曲）效果较差，因此需要一种能同时应对非加性扰动的新方法。 Method: 建立对抗图像的生成过程，并通过似然最大化来解耦潜在的干净图像和扰动参数，从而实现对非加性扰动的有效净化。 Result: 在GTSRB和CIFAR-10数据集上的实验表明，NAPPure显著提高了图像分类模型在面对非加性对抗扰动时的鲁棒性。 Conclusion: NAPPure是一种有效的扩展对抗净化框架，能够成功应对非加性对抗扰动，增强了模型的实际应用安全性。 Abstract: Adversarial purification has achieved great success in combating adversarial image perturbations, which are usually assumed to be additive. However, non-additive adversarial perturbations such as blur, occlusion, and distortion are also common in the real world. Under such perturbations, existing adversarial purification methods are much less effective since they are designed to fit the additive nature. In this paper, we propose an extended adversarial purification framework named NAPPure, which can further handle non-additive perturbations. Specifically, we first establish the generation process of an adversarial image, and then disentangle the underlying clean image and perturbation parameters through likelihood maximization. Experiments on GTSRB and CIFAR-10 datasets show that NAPPure significantly boosts the robustness of image classification models against non-additive perturbations.

[130] Vgent: Graph-based Retrieval-Reasoning-Augmented Generation For Long Video Understanding

Xiaoqian Shen,Wenxuan Zhang,Jun Chen,Mohamed Elhoseiny

Main category: cs.CV

TL;DR: 本文提出了一种基于图结构的检索-推理增强生成框架Vgent，用于提升大视频语言模型在长视频理解中的性能，通过结构化语义图和中间推理步骤有效解决了传统RAG方法在时序依赖破坏和噪声干扰方面的问题。

Details

Motivation: 由于上下文窗口限制和长期时序信息保持困难，现有的大视频语言模型在处理长视频时面临挑战；同时，直接应用检索增强生成（RAG）会导致时序依赖断裂和无关信息引入，影响推理准确性。 Method: 提出Vgent框架：1）将视频表示为保留片段间语义关系的结构化图以提升检索效果；2）引入中间推理步骤，利用结构化验证减少检索噪声，并显式聚合跨片段的相关信息。 Result: 在三个长视频理解基准上评估了多种开源LVLM，Vgent在MLVU上比基线模型提升了3.0%~5.4%，优于现有视频RAG方法8.6%。 Conclusion: Vgent通过图结构建模和中间推理机制显著提升了LVLM在长视频理解任务中的准确性和上下文感知能力，为视频RAG提供了有效解决方案。 Abstract: Understanding and reasoning over long videos pose significant challenges for large video language models (LVLMs) due to the difficulty in processing intensive video tokens beyond context window and retaining long-term sequential information. Retrieval-Augmented Generation (RAG) has demonstrated effectiveness in processing long context for Large Language Models (LLMs); however, applying RAG to long video faces challenges such as disrupted temporal dependencies and inclusion of irrelevant information that can hinder accurate reasoning. To address these limitations, we propose Vgent, a novel graph-based retrieval-reasoning-augmented generation framework to enhance LVLMs for long video understanding. Our approach introduces two key innovations: (i) It represents videos by structured graphs with semantic relationships across video clips preserved to improve retrieval effectiveness. (ii) It introduces an intermediate reasoning step to mitigate the reasoning limitation of LVLMs, which leverages structured verification to reduce retrieval noise and facilitate the explicit aggregation of relevant information across clips, resulting in more accurate and context-aware responses. We comprehensively evaluate our framework with various open-source LVLMs on three long-video understanding benchmarks. Our approach yielded an overall performance improvement of $3.0\%\sim 5.4\%$ over base models on MLVU, and outperformed state-of-the-art video RAG methods by $8.6\%$. Our code is publicly available at https://xiaoqian-shen.github.io/Vgent.

[131] Synchronization of Multiple Videos

Avihai Naaman,Ron Shapira Weber,Oren Freifeld

Main category: cs.CV

TL;DR: 提出了一种基于原型的时序对齐框架TPL，用于同步来自不同场景或生成式AI视频的多视频，通过构建共享的一维表示来提高对齐的准确性、效率和鲁棒性。

Details

Motivation: 传统多摄像头视频同步主要依赖简单时间偏移，难以应对不同场景或生成式AI视频间的非线性时序错位问题，因此需要一种更鲁棒的跨视频时序对齐方法。 Method: 提出Temporal Prototype Learning (TPL)，利用预训练模型提取的高维嵌入，构建一个共享且紧凑的1D原型序列，作为关键动作阶段的锚点，避免复杂的成对匹配过程，实现多视频的高效对齐。 Result: 实验表明TPL在多个数据集上提升了同步精度、效率和鲁棒性，尤其在细粒度帧检索和相位分类任务中表现优异，并首次有效解决了多个生成式AI视频之间的同步问题。 Conclusion: TPL是一种通用且有效的多视频时序对齐框架，能够处理包括真实场景和生成式AI视频在内的多样化视频来源，为复杂环境下的视频同步提供了新思路。 Abstract: Synchronizing videos captured simultaneously from multiple cameras in the same scene is often easy and typically requires only simple time shifts. However, synchronizing videos from different scenes or, more recently, generative AI videos, poses a far more complex challenge due to diverse subjects, backgrounds, and nonlinear temporal misalignment. We propose Temporal Prototype Learning (TPL), a prototype-based framework that constructs a shared, compact 1D representation from high-dimensional embeddings extracted by any of various pretrained models. TPL robustly aligns videos by learning a unified prototype sequence that anchors key action phases, thereby avoiding exhaustive pairwise matching. Our experiments show that TPL improves synchronization accuracy, efficiency, and robustness across diverse datasets, including fine-grained frame retrieval and phase classification tasks. Importantly, TPL is the first approach to mitigate synchronization issues in multiple generative AI videos depicting the same action. Our code and a new multiple video synchronization dataset are available at https://bgu-cs-vil.github.io/TPL/

[132] Capture, Canonicalize, Splat: Zero-Shot 3D Gaussian Avatars from Unstructured Phone Images

Emanuel Garbin,Guy Adam,Oded Krams,Zohar Barzelay,Eran Guendelman,Michael Schwarz,Moran Vatelmacher,Yigal Shenkman,Eli Peker,Itai Druker,Uri Patish,Yoav Blum,Max Bluvstein,Junxuan Li,Rawal Khirodkar,Shunsuke Saito

Main category: cs.CV

TL;DR: 提出了一种零样本的“Capture, Canonicalize, Splat”流程，从非结构化手机图像生成高保真、身份保持的3D虚拟形象。

Details

Motivation: 现有方法在几何一致性、高频细节还原和身份保持方面存在不足，难以实现真实感强且身份准确的3D头像生成。 Method: 引入生成式规范化模块将多视角非结构图像转换为标准化表示，并使用基于Transformer的模型在基于真实人物穹顶采集构建的高保真Gaussian splatting数据集上训练。 Result: 该方法能从少量非结构化照片生成具有高度真实感和强身份保持能力的静态四分之三身3D虚拟形象。 Conclusion: 所提出的 pipeline 在无需微调的情况下实现了高质量、身份保留的3D avatar 生成，显著提升了真实感与鲁棒性。 Abstract: We present a novel, zero-shot pipeline for creating hyperrealistic, identity-preserving 3D avatars from a few unstructured phone images. Existing methods face several challenges: single-view approaches suffer from geometric inconsistencies and hallucinations, degrading identity preservation, while models trained on synthetic data fail to capture high-frequency details like skin wrinkles and fine hair, limiting realism. Our method introduces two key contributions: (1) a generative canonicalization module that processes multiple unstructured views into a standardized, consistent representation, and (2) a transformer-based model trained on a new, large-scale dataset of high-fidelity Gaussian splatting avatars derived from dome captures of real people. This "Capture, Canonicalize, Splat" pipeline produces static quarter-body avatars with compelling realism and robust identity preservation from unstructured photos.

[133] cubic: CUDA-accelerated 3D Bioimage Computing

Alexandr A. Kalinin,Anne E. Carpenter,Shantanu Singh,Matthew J. O'Meara

Main category: cs.CV

TL;DR: cubic是一个开源Python库，通过集成CuPy和RAPIDS cuCIM的GPU加速功能，扩展了SciPy和scikit-image的API，实现了对2D和3D生物图像处理的高效、可扩展分析，支持设备无关的操作调度，在保持算法准确性的同时显著提升了性能。

Details

Motivation: 现有生物图像分析工具在可扩展性、效率、GPU加速支持和工作流集成方面存在局限，难以应对现代显微镜产生的大规模数据挑战。 Method: 开发了一个名为cubic的开源Python库，其API是设备无关的，能够根据数据所在设备自动调度CPU或GPU执行操作，并与SciPy和scikit-image兼容，从而无缝加速多种图像处理任务。 Result: 在单个操作基准测试以及去卷积和分割流程复现中均实现了显著的速度提升，同时保持算法结果的一致性，验证了其在2D和3D数据上的有效性。 Conclusion: cubic为可扩展、可重复的生物图像分析提供了坚实基础，良好集成于Python科学计算生态，支持交互式探索和高通量自动化分析流程。 Abstract: Quantitative analysis of multidimensional biological images is useful for understanding complex cellular phenotypes and accelerating advances in biomedical research. As modern microscopy generates ever-larger 2D and 3D datasets, existing computational approaches are increasingly limited by their scalability, efficiency, and integration with modern scientific computing workflows. Existing bioimage analysis tools often lack application programmable interfaces (APIs), do not support graphics processing unit (GPU) acceleration, lack broad 3D image processing capabilities, and/or have poor interoperability for compute-heavy workflows. Here, we introduce cubic, an open-source Python library that addresses these challenges by augmenting widely used SciPy and scikit-image APIs with GPU-accelerated alternatives from CuPy and RAPIDS cuCIM. cubic's API is device-agnostic and dispatches operations to GPU when data reside on the device and otherwise executes on CPU, seamlessly accelerating a broad range of image processing routines. This approach enables GPU acceleration of existing bioimage analysis workflows, from preprocessing to segmentation and feature extraction for 2D and 3D data. We evaluate cubic both by benchmarking individual operations and by reproducing existing deconvolution and segmentation pipelines, achieving substantial speedups while maintaining algorithmic fidelity. These advances establish a robust foundation for scalable, reproducible bioimage analysis that integrates with the broader Python scientific computing ecosystem, including other GPU-accelerated methods, enabling both interactive exploration and automated high-throughput analysis workflows. cubic is openly available at https://github$.$com/alxndrkalinin/cubic

[134] Virtually Being: Customizing Camera-Controllable Video Diffusion Models with Multi-View Performance Captures

Yuancheng Xu,Wenqi Xian,Li Ma,Julien Philip,Ahmet Levent Taşel,Yiwei Zhao,Ryan Burgert,Mingming He,Oliver Hermann,Oliver Pilarski,Rahul Garg,Paul Debevec,Ning Yu

Main category: cs.CV

TL;DR: 本文提出了一种通过新型定制数据管道实现多视角角色一致性与3D相机控制的视频扩散模型框架，利用4D高斯点阵重渲染和视频重光照技术提升虚拟制作中的生成质量与控制能力。

Details

Motivation: 为了在视频扩散模型中同时实现多视角角色一致性与精确的3D相机控制，以支持虚拟制作中的复杂需求。 Method: 构建一个包含体捕捉表演重渲染、多样化相机轨迹和光照变化的数据管道，使用4D高斯点阵（4DGS）和视频重光照模型生成训练数据，并对先进的开源视频扩散模型进行微调。 Result: 实现了强大多视角身份保持、精准相机控制和光照适应性，支持多主体生成（联合训练与噪声融合）、场景与真实视频定制以及运动和空间布局控制，显著提升了视频质量与个性化精度。 Conclusion: 该框架有效推动了视频生成技术在虚拟制作中的集成，为多视角一致性和可控视频生成提供了实用解决方案。 Abstract: We introduce a framework that enables both multi-view character consistency and 3D camera control in video diffusion models through a novel customization data pipeline. We train the character consistency component with recorded volumetric capture performances re-rendered with diverse camera trajectories via 4D Gaussian Splatting (4DGS), lighting variability obtained with a video relighting model. We fine-tune state-of-the-art open-source video diffusion models on this data to provide strong multi-view identity preservation, precise camera control, and lighting adaptability. Our framework also supports core capabilities for virtual production, including multi-subject generation using two approaches: joint training and noise blending, the latter enabling efficient composition of independently customized models at inference time; it also achieves scene and real-life video customization as well as control over motion and spatial layout during customization. Extensive experiments show improved video quality, higher personalization accuracy, and enhanced camera control and lighting adaptability, advancing the integration of video generation into virtual production. Our project page is available at: https://eyeline-labs.github.io/Virtually-Being.

[135] Joint Modeling of Big Five and HEXACO for Multimodal Apparent Personality-trait Recognition

Ryo Masumura,Shota Orihashi,Mana Ihori,Tomohiro Tanaka,Naoki Makishima,Taiga Yamane,Naotaka Kawata,Satoshi Suzuki,Taichi Katayama

Main category: cs.CV

TL;DR: 提出了一种联合建模Big Five和HEXACO的方法，用于从多模态人类行为中自动识别表观人格特质。

Details

Motivation: 现有研究多关注Big Five，而忽视了能评估诚实-谦逊等特质的HEXACO，且二者在机器学习建模中的关系尚不明确。 Method: 通过联合优化的方式同时识别Big Five和HEXACO，并利用自我介绍视频数据集进行实验验证。 Result: 实验表明所提方法能有效识别Big Five和HEXACO人格特质。 Conclusion: 联合建模Big Five与HEXACO有助于提升对多模态人类行为的理解，特别是在表观人格识别任务中。 Abstract: This paper proposes a joint modeling method of the Big Five, which has long been studied, and HEXACO, which has recently attracted attention in psychology, for automatically recognizing apparent personality traits from multimodal human behavior. Most previous studies have used the Big Five for multimodal apparent personality-trait recognition. However, no study has focused on apparent HEXACO which can evaluate an Honesty-Humility trait related to displaced aggression and vengefulness, social-dominance orientation, etc. In addition, the relationships between the Big Five and HEXACO when modeled by machine learning have not been clarified. We expect awareness of multimodal human behavior to improve by considering these relationships. The key advance of our proposed method is to optimize jointly recognizing the Big Five and HEXACO. Experiments using a self-introduction video dataset demonstrate that the proposed method can effectively recognize the Big Five and HEXACO.

[136] LOTA: Bit-Planes Guided AI-Generated Image Detection

Hongsong Wang,Renxi Cheng,Yang Zhang,Chaolei Han,Jie Gui

Main category: cs.CV

TL;DR: 本文提出了一种基于位平面的噪声图像生成与检测方法，用于高效区分AI生成图像与真实图像。

Details

Motivation: 现有基于重建误差的AI生成图像检测方法计算成本高，且难以捕捉原始图像中的内在噪声特征。 Method: 利用位平面图像处理技术提取噪声特征，结合多种归一化策略；设计最大梯度块选择机制以增强噪声信号，并提出轻量级分类头（包括基于噪声和噪声引导的两种分类器）。 Result: 在GenImage基准上达到98.9%的平均准确率，比现有方法提升11.9%，在跨生成器场景下表现优异（GAN到Diffusion超过98.2%，Diffusion到GAN超过99.2%），且误差提取速度达毫秒级，比现有方法快近百倍。 Conclusion: 所提方法在检测AI生成图像方面具有高精度、强泛化能力和极高的效率，显著优于现有技术。 Abstract: The rapid advancement of GAN and Diffusion models makes it more difficult to distinguish AI-generated images from real ones. Recent studies often use image-based reconstruction errors as an important feature for determining whether an image is AI-generated. However, these approaches typically incur high computational costs and also fail to capture intrinsic noisy features present in the raw images. To solve these problems, we innovatively refine error extraction by using bit-plane-based image processing, as lower bit planes indeed represent noise patterns in images. We introduce an effective bit-planes guided noisy image generation and exploit various image normalization strategies, including scaling and thresholding. Then, to amplify the noise signal for easier AI-generated image detection, we design a maximum gradient patch selection that applies multi-directional gradients to compute the noise score and selects the region with the highest score. Finally, we propose a lightweight and effective classification head and explore two different structures: noise-based classifier and noise-guided classifier. Extensive experiments on the GenImage benchmark demonstrate the outstanding performance of our method, which achieves an average accuracy of \textbf{98.9\%} (\textbf{11.9}\%~$\uparrow$) and shows excellent cross-generator generalization capability. Particularly, our method achieves an accuracy of over 98.2\% from GAN to Diffusion and over 99.2\% from Diffusion to GAN. Moreover, it performs error extraction at the millisecond level, nearly a hundred times faster than existing methods. The code is at https://github.com/hongsong-wang/LOTA.

[137] PIA: Deepfake Detection Using Phoneme-Temporal and Identity-Dynamic Analysis

Soumyya Kanti Datta,Tanvi Ranga,Chengzhe Sun,Siwei Lyu

Main category: cs.CV

TL;DR: 提出了一种新的多模态音视频框架PIA，用于检测由生成模型产生的深伪内容，通过结合语音、面部动态和身份识别特征提升检测性能。

Details

Motivation: 传统检测方法难以有效识别由先进生成模型（如GANs、扩散模型）生成的深伪视频，因其仅依赖手工设计的规则或单模态线索，无法捕捉细微的时间不一致性。 Method: 提出Phoneme-Temporal and Identity-Dynamic Analysis (PIA)框架，融合音素序列、唇部几何数据和高级面部身份嵌入，进行多模态分析以检测深伪内容。 Result: 该方法能更有效地识别现代深伪视频中的细微篡改痕迹，显著优于传统检测方法。 Conclusion: PIA通过多模态协同分析，在检测先进生成模型制造的深伪视频方面表现出优越性能，为应对操纵媒体威胁提供了新思路。 Abstract: The rise of manipulated media has made deepfakes a particularly insidious threat, involving various generative manipulations such as lip-sync modifications, face-swaps, and avatar-driven facial synthesis. Conventional detection methods, which predominantly depend on manually designed phoneme-viseme alignment thresholds, fundamental frame-level consistency checks, or a unimodal detection strategy, inadequately identify modern-day deepfakes generated by advanced generative models such as GANs, diffusion models, and neural rendering techniques. These advanced techniques generate nearly perfect individual frames yet inadvertently create minor temporal discrepancies frequently overlooked by traditional detectors. We present a novel multimodal audio-visual framework, Phoneme-Temporal and Identity-Dynamic Analysis(PIA), incorporating language, dynamic face motion, and facial identification cues to address these limitations. We utilize phoneme sequences, lip geometry data, and advanced facial identity embeddings. This integrated method significantly improves the detection of subtle deepfake alterations by identifying inconsistencies across multiple complementary modalities. Code is available at https://github.com/skrantidatta/PIA

[138] Event Interval Modulation: A Novel Scheme for Event-based Optical Camera Communication

Miu Sumino,Mayu Ishii,Shun Kaizu,Daisuke Hisano,Yu Nakayama

Main category: cs.CV

TL;DR: 本文提出了一种专为基于事件的光通信（OCC）系统设计的新型调制方案——事件间隔调制（EIM），通过利用事件之间的时间间隔传输信息，显著提升了传输速率。实验结果实现了在室内环境下10米距离28 kbps和50米距离8.4 kbps的传输速率，创造了基于事件的OCC系统的新纪录。

Details

Motivation: 传统的基于帧式相机的OCC系统存在比特率低、处理负载高等问题，现有基于事件传感器（EVS）的OCC系统尚未充分利用EVS异步响应和高动态范围的独特优势，缺乏专门设计的调制方案。 Method: 提出事件间隔调制（EIM）方案，建立EIM的理论模型，并对EVS参数进行优化以适配EIM；通过实验确定最大可用调制阶数，并在此基础上开展不同距离的传输实验。 Result: 成功实现了在室内环境中10米距离28 kbps和50米距离8.4 kbps的数据传输，显著高于现有基于事件的OCC系统性能。 Conclusion: EIM方案有效利用了EVS的特性，显著提升了事件基OCC系统的传输速率，为未来高速低延迟可见光通信提供了新的技术路径。 Abstract: Optical camera communication (OCC) represents a promising visible light communication technology. Nonetheless, typical OCC systems utilizing frame-based cameras are encumbered by limitations, including low bit rate and high processing load. To address these issues, OCC system utilizing an event-based vision sensor (EVS) as receivers have been proposed. The EVS enables high-speed, low-latency, and robust communication due to its asynchronous operation and high dynamic range. In existing event-based OCC systems, conventional modulation schemes such as on-off keying (OOK) and pulse position modulation have been applied, however, to the best of our knowledge, no modulation method has been proposed that fully exploits the unique characteristics of the EVS. This paper proposes a novel modulation scheme, called the event interval modulation (EIM) scheme, specifically designed for event-based OCC. EIM enables improvement in transmission speed by modulating information using the intervals between events. This paper proposes a theoretical model of EIM and conducts a proof-of-concept experiment. First, the parameters of the EVS are tuned and customized to optimize the frequency response specifically for EIM. Then, the maximum modulation order usable in EIM is determined experimentally. We conduct transmission experiments based on the obtained parameters. Finally, we report successful transmission at 28 kbps over 10 meters and 8.4 kbps over 50 meters in an indoor environment. This sets a new benchmark for bit rate in event-based OCC systems.

[139] MACE: Mixture-of-Experts Accelerated Coordinate Encoding for Large-Scale Scene Localization and Rendering

Mingkai Liu,Dikai Fan,Haohua Que,Haojia Gao,Xiao Liu,Shuxue Peng,Meixia Lin,Shengyu Gu,Ruicong Ye,Wanli Qiu,Handong Yao,Ruopeng Zhang,Xianliang Huang

Main category: cs.CV

TL;DR: 提出了一种基于混合专家的加速坐标编码方法MACE，用于大规模场景中的高效定位和高质量渲染。

Details

Motivation: 现有场景坐标回归方法在扩展到大规模场景时受限于单个网络的容量，且计算成本高。 Method: 引入受MOE启发的门控网络，隐式分类并选择子网络，每次推理仅激活一个子网络，并提出无辅助损失的负载均衡策略ALF-LB以提升定位精度。 Result: 在Cambridge测试集上的实验表明，该方法在仅10分钟训练时间内即可实现高质量渲染，并显著降低成本同时保持高精度。 Conclusion: MACE为大规模场景的定位与渲染提供了一种高效、精确且低成本的解决方案。 Abstract: Efficient localization and high-quality rendering in large-scale scenes remain a significant challenge due to the computational cost involved. While Scene Coordinate Regression (SCR) methods perform well in small-scale localization, they are limited by the capacity of a single network when extended to large-scale scenes. To address these challenges, we propose the Mixed Expert-based Accelerated Coordinate Encoding method (MACE), which enables efficient localization and high-quality rendering in large-scale scenes. Inspired by the remarkable capabilities of MOE in large model domains, we introduce a gating network to implicitly classify and select sub-networks, ensuring that only a single sub-network is activated during each inference. Furtheremore, we present Auxiliary-Loss-Free Load Balancing(ALF-LB) strategy to enhance the localization accuracy on large-scale scene. Our framework provides a significant reduction in costs while maintaining higher precision, offering an efficient solution for large-scale scene applications. Additional experiments on the Cambridge test set demonstrate that our method achieves high-quality rendering results with merely 10 minutes of training.

[140] Identity-Preserving Image-to-Video Generation via Reward-Guided Optimization

Liao Shen,Wentao Jiang,Yiran Zhu,Tiezheng Ge,Zhiguo Cao,Bo Zheng

Main category: cs.CV

TL;DR: 提出了一种基于强化学习的视频扩散框架IPRO，通过面部身份评分器优化扩散模型，提升图像到视频生成中的身份一致性。

Details

Motivation: 现有图像到视频生成模型在人脸占比较小或人物表情动作变化较大时难以保持身份一致性，而人类对身份变化非常敏感，因此需要解决这一关键且未被充分探索的问题。 Method: 提出Identity-Preserving Reward-guided Optimization（IPRO），利用面部身份评分器作为奖励信号，通过反向传播最后几步采样链中的奖励信号来提供更丰富的梯度反馈；引入基于多角度真实视频人脸特征池的面部评分机制，并加入KL散度正则化以稳定训练过程。 Result: 在Wan 2.2 I2V模型和自研I2V模型上的大量实验表明，该方法能有效提升生成视频的身份一致性，且无需修改模型结构或增加辅助模块。 Conclusion: IPRO为图像到视频生成中的人脸身份保持问题提供了高效、直接的解决方案，具有良好的泛化能力和应用前景。 Abstract: Recent advances in image-to-video (I2V) generation have achieved remarkable progress in synthesizing high-quality, temporally coherent videos from static images. Among all the applications of I2V, human-centric video generation includes a large portion. However, existing I2V models encounter difficulties in maintaining identity consistency between the input human image and the generated video, especially when the person in the video exhibits significant expression changes and movements. This issue becomes critical when the human face occupies merely a small fraction of the image. Since humans are highly sensitive to identity variations, this poses a critical yet under-explored challenge in I2V generation. In this paper, we propose Identity-Preserving Reward-guided Optimization (IPRO), a novel video diffusion framework based on reinforcement learning to enhance identity preservation. Instead of introducing auxiliary modules or altering model architectures, our approach introduces a direct and effective tuning algorithm that optimizes diffusion models using a face identity scorer. To improve performance and accelerate convergence, our method backpropagates the reward signal through the last steps of the sampling chain, enabling richer gradient feedback. We also propose a novel facial scoring mechanism that treats faces in ground-truth videos as facial feature pools, providing multi-angle facial information to enhance generalization. A KL-divergence regularization is further incorporated to stabilize training and prevent overfitting to the reward signal. Extensive experiments on Wan 2.2 I2V model and our in-house I2V model demonstrate the effectiveness of our method. Our project and code are available at \href{https://ipro-alimama.github.io/}{https://ipro-alimama.github.io/}.

[141] Identity-GRPO: Optimizing Multi-Human Identity-preserving Video Generation via Reinforcement Learning

Xiangyu Meng,Zixian Zhang,Zhenghao Zhang,Junchao Liao,Long Qin,Weizhi Wang

Main category: cs.CV

TL;DR: 提出Identity-GRPO，一种基于人类反馈的优化框架，用于提升多人体身份保持视频生成的一致性。

Details

Motivation: 现有方法在动态交互场景中难以保持多人身份一致性，影响视频生成质量。 Method: 构建一个大规模偏好数据集训练视频奖励模型，并设计适用于多人体一致性的GRPO变体进行策略优化。 Result: 相比基线方法，在人类一致性指标上最高提升18.9%，并通过消融实验验证了标注质量和设计选择的影响。 Conclusion: Identity-GRPO有效提升了多人体身份保持能力，为强化学习与个性化视频生成的对齐提供了可行路径。 Abstract: While advanced methods like VACE and Phantom have advanced video generation for specific subjects in diverse scenarios, they struggle with multi-human identity preservation in dynamic interactions, where consistent identities across multiple characters are critical. To address this, we propose Identity-GRPO, a human feedback-driven optimization pipeline for refining multi-human identity-preserving video generation. First, we construct a video reward model trained on a large-scale preference dataset containing human-annotated and synthetic distortion data, with pairwise annotations focused on maintaining human consistency throughout the video. We then employ a GRPO variant tailored for multi-human consistency, which greatly enhances both VACE and Phantom. Through extensive ablation studies, we evaluate the impact of annotation quality and design choices on policy optimization. Experiments show that Identity-GRPO achieves up to 18.9% improvement in human consistency metrics over baseline methods, offering actionable insights for aligning reinforcement learning with personalized video generation.

[142] MatchAttention: Matching the Relative Positions for High-Resolution Cross-View Matching

Tingman Yan,Tao Liu,Xilian Yang,Qunfei Zhao,Zeyang Xia

Main category: cs.CV

TL;DR: 本文提出了一种名为MatchAttention的注意力机制，通过动态匹配相对位置来实现高效的跨视图匹配，并结合MatchDecoder、门控交叉注意力和一致性约束损失，在多个数据集上实现了最先进的性能，同时支持高分辨率实时匹配。

Details

Motivation: 现有的交叉注意力机制在处理高分辨率图像时存在二次复杂度和缺乏显式匹配约束的问题，导致跨视图匹配困难。 Method: 提出MatchAttention机制，利用BilinearSoftmax实现连续可微的滑动窗口注意力采样，并通过残差连接在特征通道中迭代更新相对位置；设计了以MatchAttention为核心的MatchDecoder，并引入门控交叉MatchAttention和一致性约束损失以应对跨视图遮挡。 Result: 在Middlebury基准上平均误差排名第一，KITTI分辨率推理仅需29ms；MatchStereo-T可在0.1秒内处理4K UHD图像并仅使用3GB GPU内存，在KITTI 2012、KITTI 2015、ETH3D和Springflow等数据集上达到SOTA性能。 Conclusion: 该方法结合高精度与低计算复杂度，使实时、高分辨率、高精度的跨视图匹配成为可能。 Abstract: Cross-view matching is fundamentally achieved through cross-attention mechanisms. However, matching of high-resolution images remains challenging due to the quadratic complexity and lack of explicit matching constraints in the existing cross-attention. This paper proposes an attention mechanism, MatchAttention, that dynamically matches relative positions. The relative position determines the attention sampling center of the key-value pairs given a query. Continuous and differentiable sliding-window attention sampling is achieved by the proposed BilinearSoftmax. The relative positions are iteratively updated through residual connections across layers by embedding them into the feature channels. Since the relative position is exactly the learning target for cross-view matching, an efficient hierarchical cross-view decoder, MatchDecoder, is designed with MatchAttention as its core component. To handle cross-view occlusions, gated cross-MatchAttention and a consistency-constrained loss are proposed. These two components collectively mitigate the impact of occlusions in both forward and backward passes, allowing the model to focus more on learning matching relationships. When applied to stereo matching, MatchStereo-B ranked 1st in average error on the public Middlebury benchmark and requires only 29ms for KITTI-resolution inference. MatchStereo-T can process 4K UHD images in 0.1 seconds using only 3GB of GPU memory. The proposed models also achieve state-of-the-art performance on KITTI 2012, KITTI 2015, ETH3D, and Spring flow datasets. The combination of high accuracy and low computational complexity makes real-time, high-resolution, and high-accuracy cross-view matching possible. Code is available at https://github.com/TingmanYan/MatchAttention.

[143] Experimental Demonstration of Event-based Optical Camera Communication in Long-Range Outdoor Environment

Miu Sumino,Mayu Ishii,Shun Kaizu,Daisuke Hisano,Yu Nakayama

Main category: cs.CV

TL;DR: 提出了一种基于事件视觉传感器的光学相机通信系统鲁棒解调方案，首次在户外实验中实现了长距离、低误码率的数据传输。

Details

Motivation: 提高光学相机通信系统在复杂户外环境下的通信可靠性和稳定性。 Method: 结合OOK与切换解调及数字锁相环技术，设计并实现一种新型鲁棒解调方案。 Result: 在200米-60kbps和400米-30kbps条件下，户外实验中误码率低于10^-3。 Conclusion: 所提出的解调方案显著提升了光学相机通信系统的性能，适用于远距离户外通信场景。 Abstract: We propose a robust demodulation scheme for optical camera communication systems using an event-based vision sensor, combining OOK with toggle demodulation and a digital phase-locked loop. This is the first report to achieve a $\mathrm{BER} < 10^{-3}$ at 200m-60kbps and 400m-30kbps in outdoor experiments.

[144] GauSSmart: Enhanced 3D Reconstruction through 2D Foundation Models and Geometric Filtering

Alexander Valverde,Brian Xu,Yuyin Zhou,Meng Xu,Hongyun Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为GauSSmart的混合方法，通过结合2D基础模型与3D高斯点阵重建，提升了稀疏区域的场景重建质量。

Details

Motivation: 现有的高斯点阵重建方法在稀疏数据覆盖区域难以捕捉细节和保持真实感，受限于3D训练数据的稀疏性。 Method: 提出GauSSmart，利用2D计算机视觉技术（如凸滤波和来自DINO等基础模型的语义特征监督），结合2D分割先验和高维特征嵌入，引导高斯点的致密化与优化。 Result: 在三个数据集上验证了该方法的有效性，GauSSmart在多数场景中优于现有高斯点阵方法，显著改善了稀疏区域的覆盖和细节保留。 Conclusion: 研究表明，将2D基础模型与3D重建流程有机结合，能有效克服单一方法的局限，展现出2D-3D融合方法在场景重建中的巨大潜力。 Abstract: Scene reconstruction has emerged as a central challenge in computer vision, with approaches such as Neural Radiance Fields (NeRF) and Gaussian Splatting achieving remarkable progress. While Gaussian Splatting demonstrates strong performance on large-scale datasets, it often struggles to capture fine details or maintain realism in regions with sparse coverage, largely due to the inherent limitations of sparse 3D training data. In this work, we propose GauSSmart, a hybrid method that effectively bridges 2D foundational models and 3D Gaussian Splatting reconstruction. Our approach integrates established 2D computer vision techniques, including convex filtering and semantic feature supervision from foundational models such as DINO, to enhance Gaussian-based scene reconstruction. By leveraging 2D segmentation priors and high-dimensional feature embeddings, our method guides the densification and refinement of Gaussian splats, improving coverage in underrepresented areas and preserving intricate structural details. We validate our approach across three datasets, where GauSSmart consistently outperforms existing Gaussian Splatting in the majority of evaluated scenes. Our results demonstrate the significant potential of hybrid 2D-3D approaches, highlighting how the thoughtful combination of 2D foundational models with 3D reconstruction pipelines can overcome the limitations inherent in either approach alone.

[145] CLEAR: Causal Learning Framework For Robust Histopathology Tumor Detection Under Out-Of-Distribution Shifts

Kieu-Anh Truong Thi,Huy-Hieu Pham,Duc-Trong Le

Main category: cs.CV

TL;DR: 提出一种基于因果推断的框架，通过引入中介变量和组织切片显式建模，有效缓解病理图像中的域偏移问题，在CAMELYON17和私有数据集上性能提升达7%。

Details

Motivation: 现有方法主要依赖统计相关性建模，忽视了因果关系，难以应对由采集过程或数据源差异引起的域偏移问题。 Method: 基于因果推断的前门原则，设计包含中介变量和观测组织切片的转换策略，利用语义特征并减轻混杂因素影响。 Result: 在CAMELYON17和私有病理数据集上验证，相较于现有基线方法，在未见域中均实现最高达7%的性能提升。 Conclusion: 因果推断可作为解决病理图像分析中域偏移问题的有效工具，具有较强泛化能力。 Abstract: Domain shift in histopathology, often caused by differences in acquisition processes or data sources, poses a major challenge to the generalization ability of deep learning models. Existing methods primarily rely on modeling statistical correlations by aligning feature distributions or introducing statistical variation, yet they often overlook causal relationships. In this work, we propose a novel causal-inference-based framework that leverages semantic features while mitigating the impact of confounders. Our method implements the front-door principle by designing transformation strategies that explicitly incorporate mediators and observed tissue slides. We validate our method on the CAMELYON17 dataset and a private histopathology dataset, demonstrating consistent performance gains across unseen domains. As a result, our approach achieved up to a 7% improvement in both the CAMELYON17 dataset and the private histopathology dataset, outperforming existing baselines. These results highlight the potential of causal inference as a powerful tool for addressing domain shift in histopathology image analysis.

[146] Watermarking for Factuality: Guiding Vision-Language Models Toward Truth via Tri-layer Contrastive Decoding

Kyungryul Back,Seongbeom Park,Milim Kim,Mincheol Kwon,SangHyeok Lee,Hyunyoung Lee,Junhee Cho,Seunghyun Park,Jinkyu Kim

Main category: cs.CV

TL;DR: 提出一种无需训练的三层对比解码方法，通过水印相关问题选择关键层，有效减少大视觉语言模型中的幻觉现象。

Details

Motivation: 大视觉语言模型（LVLMs）在多模态任务中表现良好，但容易产生幻觉，依赖单一模态或记忆训练数据而缺乏输出的视觉 grounding。 Method: 采用训练-free的三层对比解码结合水印机制：首先选择成熟层和新手层，然后通过水印相关问题确定视觉 grounding 良好的 pivot 层，最后应用三层对比解码生成输出。 Result: 在POPE、MME和AMBER等公开基准上的实验表明，该方法在减少LVLM幻觉和生成更视觉 grounded 的响应方面达到先进性能。 Conclusion: 所提出的三层层对比解码方法能有效提升LVLMs的视觉 grounding 能力，显著降低幻觉，且无需额外训练。 Abstract: Large Vision-Language Models (LVLMs) have recently shown promising results on various multimodal tasks, even achieving human-comparable performance in certain cases. Nevertheless, LVLMs remain prone to hallucinations -- they often rely heavily on a single modality or memorize training data without properly grounding their outputs. To address this, we propose a training-free, tri-layer contrastive decoding with watermarking, which proceeds in three steps: (1) select a mature layer and an amateur layer among the decoding layers, (2) identify a pivot layer using a watermark-related question to assess whether the layer is visually well-grounded, and (3) apply tri-layer contrastive decoding to generate the final output. Experiments on public benchmarks such as POPE, MME and AMBER demonstrate that our method achieves state-of-the-art performance in reducing hallucinations in LVLMs and generates more visually grounded responses.

[147] A Multi-domain Image Translative Diffusion StyleGAN for Iris Presentation Attack Detection

Shivangi Yadav,Arun Ross

Main category: cs.CV

TL;DR: 提出MID-StyleGAN框架，结合扩散模型与GAN生成多域合成眼纹图像，有效缓解活体检测中数据稀缺问题，并显著提升攻击检测性能。

Details

Motivation: 由于构建和成像呈现攻击（PA）样本存在困难，导致虹膜活体检测（PAD）技术训练与评估的数据集匮乏，亟需有效的数据增强解决方案。 Method: 提出MID-StyleGAN，融合扩散模型与生成对抗网络（GAN），采用多域架构实现真实眼睛、打印眼睛和彩色隐形眼镜等多域间图像转换，并设计自适应损失函数以保持眼部数据的域一致性。 Result: MID-StyleGAN在生成高质量、多样性合成眼纹图像方面优于现有方法；在LivDet2020数据集上，1%误检率下的真检率从93.41%提升至98.72%。 Conclusion: MID-StyleGAN能有效生成逼真的多域合成眼纹图像，显著提升PAD系统性能，为虹膜及眼纹生物特征中的数据稀缺问题提供了可扩展的解决方案。 Abstract: An iris biometric system can be compromised by presentation attacks (PAs) where artifacts such as artificial eyes, printed eye images, or cosmetic contact lenses are presented to the system. To counteract this, several presentation attack detection (PAD) methods have been developed. However, there is a scarcity of datasets for training and evaluating iris PAD techniques due to the implicit difficulties in constructing and imaging PAs. To address this, we introduce the Multi-domain Image Translative Diffusion StyleGAN (MID-StyleGAN), a new framework for generating synthetic ocular images that captures the PA and bonafide characteristics in multiple domains such as bonafide, printed eyes and cosmetic contact lens. MID-StyleGAN combines the strengths of diffusion models and generative adversarial networks (GANs) to produce realistic and diverse synthetic data. Our approach utilizes a multi-domain architecture that enables the translation between bonafide ocular images and different PA domains. The model employs an adaptive loss function tailored for ocular data to maintain domain consistency. Extensive experiments demonstrate that MID-StyleGAN outperforms existing methods in generating high-quality synthetic ocular images. The generated data was used to significantly enhance the performance of PAD systems, providing a scalable solution to the data scarcity problem in iris and ocular biometrics. For example, on the LivDet2020 dataset, the true detect rate at 1% false detect rate improved from 93.41% to 98.72%, showcasing the impact of the proposed method.

[148] Vision-Centric Activation and Coordination for Multimodal Large Language Models

Yunnan Wang,Fan Lu,Kecheng Zheng,Ziyuan Huang,Ziqiang Li,Wenjun Zeng,Xin Jin

Main category: cs.CV

TL;DR: 本文提出了VaCo，通过引入视觉中心化的激活与协调机制，利用多个视觉基础模型优化多模态大语言模型（MLLM）的表征能力。

Details

Motivation: 主流MLLM仅依赖文本token的下一词预测进行监督，忽略了对分析能力至关重要的视觉中心信息。 Method: 引入视觉判别对齐机制，结合可学习的模块化任务查询（MTQs）和视觉对齐层（VALs），并在多组MTQ间使用令牌网关掩码（TGM）协调不同视觉基础模型的表征冲突。 Result: 大量实验表明，VaCo在多种基准上显著提升了不同MLLM的性能，展现出卓越的视觉理解能力。 Conclusion: VaCo有效增强了MLLM的视觉理解能力，通过融合多个视觉基础模型的信息实现了文本与视觉输出的统一优化。 Abstract: Multimodal large language models (MLLMs) integrate image features from visual encoders with LLMs, demonstrating advanced comprehension capabilities. However, mainstream MLLMs are solely supervised by the next-token prediction of textual tokens, neglecting critical vision-centric information essential for analytical abilities. To track this dilemma, we introduce VaCo, which optimizes MLLM representations through Vision-Centric activation and Coordination from multiple vision foundation models (VFMs). VaCo introduces visual discriminative alignment to integrate task-aware perceptual features extracted from VFMs, thereby unifying the optimization of both textual and visual outputs in MLLMs. Specifically, we incorporate the learnable Modular Task Queries (MTQs) and Visual Alignment Layers (VALs) into MLLMs, activating specific visual signals under the supervision of diverse VFMs. To coordinate representation conflicts across VFMs, the crafted Token Gateway Mask (TGM) restricts the information flow among multiple groups of MTQs. Extensive experiments demonstrate that VaCo significantly improves the performance of different MLLMs on various benchmarks, showcasing its superior capabilities in visual comprehension.

[149] Leveraging Cycle-Consistent Anchor Points for Self-Supervised RGB-D Registration

Siddharth Tourani,Jayaram Reddy,Sarvesh Thakur,K Madhava Krishna,Muhammad Haris Khan,N Dinesh Reddy

Main category: cs.CV

TL;DR: 提出一种基于循环一致关键点和新型姿态块的自监督RGB-D数据配准方法，在ScanNet和3DMatch上优于先前自监督方法，并可集成到现有方法中提升性能。

Details

Motivation: 利用日益增多的无标签RGB-D数据进行场景几何推理，探索不依赖传统几何或特征相似性的自监督配准新方法。 Method: 引入循环一致关键点作为显著点以增强匹配中的空间一致性约束，并设计结合GRU循环单元与变换同步的新型姿态块，融合历史和多视图信息。 Result: 在ScanNet和3DMatch数据集上超越了以往的自监督配准方法，甚至优于一些早期的监督方法；组件集成实验验证了其有效性。 Conclusion: 所提出的方法能有效提升RGB-D配准的准确性，为自监督几何学习提供了新的思路和技术路径。 Abstract: With the rise in consumer depth cameras, a wealth of unlabeled RGB-D data has become available. This prompts the question of how to utilize this data for geometric reasoning of scenes. While many RGB-D registration meth- ods rely on geometric and feature-based similarity, we take a different approach. We use cycle-consistent keypoints as salient points to enforce spatial coherence constraints during matching, improving correspondence accuracy. Additionally, we introduce a novel pose block that combines a GRU recurrent unit with transformation synchronization, blending historical and multi-view data. Our approach surpasses previous self- supervised registration methods on ScanNet and 3DMatch, even outperforming some older supervised methods. We also integrate our components into existing methods, showing their effectiveness.

[150] Spatial Preference Rewarding for MLLMs Spatial Understanding

Han Qiu,Peng Gao,Lewei Lu,Xiaoqin Zhang,Ling Shao,Shijian Lu

Main category: cs.CV

TL;DR: 提出一种名为SPR（Spatial Preference Rewarding）的方法，通过奖励机制提升多模态大语言模型在细粒度空间理解方面的能力，特别是在精确物体定位和区域描述生成上的表现。

Details

Motivation: 现有MLLMs在细粒度空间感知能力上不足，且缺乏对其实际响应的直接监督，导致无法满足用户对精细空间理解的需求。 Method: SPR方法引入语义和定位评分机制，评估MLLM生成的描述质量，并通过对比最优修正与最差初始描述进行直接偏好优化，以增强模型对视觉输入的细粒度对齐。 Result: 在标准指代和定位基准上的实验表明，SPR能有效提升MLLM的空间理解能力，且训练开销极小。 Conclusion: SPR通过细粒度响应奖励机制显著增强了MLLM的细粒度空间感知能力，为未来多模态模型的空间理解提供了高效可行的优化路径。 Abstract: Multimodal large language models~(MLLMs) have demonstrated promising spatial understanding capabilities, such as referencing and grounding object descriptions. Despite their successes, MLLMs still fall short in fine-grained spatial perception abilities, such as generating detailed region descriptions or accurately localizing objects. Additionally, they often fail to respond to the user's requirements for desired fine-grained spatial understanding. This issue might arise because existing approaches primarily focus on tuning MLLMs to model pre-annotated instruction data to inject spatial knowledge, without direct supervision of MLLMs' actual responses. We address this issue by SPR, a Spatial Preference Rewarding~(SPR) approach that enhances MLLMs' spatial capabilities by rewarding MLLMs' detailed responses with precise object localization over vague or inaccurate responses. With randomly selected image regions and region descriptions from MLLMs, SPR introduces semantic and localization scores to comprehensively evaluate the text quality and localization quality in MLLM-generated descriptions. We also refine the MLLM descriptions with better localization accuracy and pair the best-scored refinement with the initial descriptions of the lowest score for direct preference optimization, thereby enhancing fine-grained alignment with visual input. Extensive experiments over standard referring and grounding benchmarks show that SPR improves MLLM spatial understanding capabilities effectively with minimal overhead in training. Data and code will be released at https://github.com/hanqiu-hq/SPR

[151] DOS: Directional Object Separation in Text Embeddings for Multi-Object Image Generation

Dongnam Byun,Jungwon Park,Jumgmin Ko,Changin Choi,Wonjong Rhee

Main category: cs.CV

TL;DR: 本文提出了一种名为DOS（Directional Object Separation）的方法，通过调整CLIP文本嵌入来改善文本到图像生成模型在多物体场景下的表现，有效减少物体忽略和混合问题。

Details

Motivation: 现有的文本到图像生成模型在处理包含多个物体的提示时容易出现物体忽略或混合的问题，尤其在物体形状、纹理相似或背景偏差明显等情况下表现不佳。作者希望通过分析CLIP嵌入的特性，提出一种通用且有效的解决方案。 Method: 基于对CLIP嵌入的两个关键观察，作者提出了DOS方法，该方法在将文本嵌入输入到生成模型之前，对三种类型的CLIP文本嵌入进行方向性分离调整，以增强不同物体之间的区分度。 Result: 实验结果显示，DOS在多个基准上显著提升了多物体图像生成的成功率，减少了物体混合现象。在人类评估中，DOS相比四种竞争方法获得了26.24%至43.04%更多的偏好投票。 Conclusion: DOS是一种实用且有效的方法，能够显著提升文本到图像模型在复杂多物体场景下的生成质量，具有广泛的应用前景。 Abstract: Recent progress in text-to-image (T2I) generative models has led to significant improvements in generating high-quality images aligned with text prompts. However, these models still struggle with prompts involving multiple objects, often resulting in object neglect or object mixing. Through extensive studies, we identify four problematic scenarios, Similar Shapes, Similar Textures, Dissimilar Background Biases, and Many Objects, where inter-object relationships frequently lead to such failures. Motivated by two key observations about CLIP embeddings, we propose DOS (Directional Object Separation), a method that modifies three types of CLIP text embeddings before passing them into text-to-image models. Experimental results show that DOS consistently improves the success rate of multi-object image generation and reduces object mixing. In human evaluations, DOS significantly outperforms four competing methods, receiving 26.24%-43.04% more votes across four benchmarks. These results highlight DOS as a practical and effective solution for improving multi-object image generation.

[152] DRBD-Mamba for Robust and Efficient Brain Tumor Segmentation with Analytical Insights

Danish Ali,Ajmal Mian,Naveed Akhtar,Ghulam Mubashar Hassan

Main category: cs.CV

TL;DR: 本文提出了一种高效的3D脑肿瘤分割模型DRBD-Mamba，通过双分辨率双向Mamba结构、空间填充曲线映射和门控融合模块，在保持高精度的同时显著提升了计算效率和模型鲁棒性，并在BraTS2023数据集上实现了优于现有方法的性能。

Details

Motivation: 现有的Mamba-based模型在脑肿瘤分割中存在计算开销大、跨不同数据划分的鲁棒性未充分探索的问题，亟需一种高效且可靠的分割方法。 Method: 提出DRBD-Mamba模型：采用空间填充曲线进行3D到1D特征映射以保留空间局部性，减少多轴扫描开销；设计门控融合模块自适应整合前向与反向上下文；引入量化块提升特征鲁棒性；构建五个系统性交叉验证折用于全面评估。 Result: 在BraTS2023的20%测试集上，全肿瘤Dice提升0.10%，肿瘤核心提升1.75%，增强肿瘤提升0.93%；在提出的五折评估中，平均Dice在肿瘤核心和增强肿瘤分别提升0.86%和1.45%，计算效率提高15倍。 Conclusion: DRBD-Mamba在保持高分割精度的同时显著降低计算成本，具备更强的鲁棒性和实际应用潜力，为基于Mamba的医学图像分割提供了高效可靠的新范式。 Abstract: Accurate brain tumor segmentation is significant for clinical diagnosis and treatment. It is challenging due to the heterogeneity of tumor subregions. Mamba-based State Space Models have demonstrated promising performance. However, they incur significant computational overhead due to sequential feature computation across multiple spatial axes. Moreover, their robustness across diverse BraTS data partitions remains largely unexplored, leaving a critical gap in reliable evaluation. To address these limitations, we propose dual-resolution bi-directional Mamba (DRBD-Mamba), an efficient 3D segmentation model that captures multi-scale long-range dependencies with minimal computational overhead. We leverage a space-filling curve to preserve spatial locality during 3D-to-1D feature mapping, thereby reducing reliance on computationally expensive multi-axial feature scans. To enrich feature representation, we propose a gated fusion module that adaptively integrates forward and reverse contexts, along with a quantization block that discretizes features to improve robustness. In addition, we propose five systematic folds on BraTS2023 for rigorous evaluation of segmentation techniques under diverse conditions and present detailed analysis of common failure scenarios. On the 20\% test set used by recent methods, our model achieves Dice improvements of 0.10\% for whole tumor, 1.75\% for tumor core, and 0.93\% for enhancing tumor. Evaluations on the proposed systematic five folds demonstrate that our model maintains competitive whole tumor accuracy while achieving clear average Dice gains of 0.86\% for tumor core and 1.45\% for enhancing tumor over existing state-of-the-art. Furthermore, our model attains 15 times improvement in efficiency while maintaining high segmentation accuracy, highlighting its robustness and computational advantage over existing approaches.

[153] BoardVision: Deployment-ready and Robust Motherboard Defect Detection with YOLO+Faster-RCNN Ensemble

Brandon Hill,Kma Solaiman

Main category: cs.CV

TL;DR: 本文提出了BoardVision，一个用于检测主板组装级缺陷的可复现框架，并提出了一种轻量级集成方法CTV Voter以平衡精度与召回率，同时发布了可用于实际操作的GUI检测工具。

Details

Motivation: 现有研究多关注裸板或线路级缺陷，而对整板组装级缺陷检测研究不足，亟需一种适用于高产量电子制造中可靠的主板缺陷检测方案。 Method: 采用YOLOv7和Faster R-CNN两种检测器在MiracleFactory数据集上进行基准测试，并提出基于置信度和时序信息的投票集成方法CTV Voter以提升性能。 Result: 实现了在组装级缺陷检测上的系统性比较，CTV Voter有效平衡了精度与召回率，并在真实扰动下验证了模型鲁棒性，GUI工具提升了实际应用可行性。 Conclusion: BoardVision框架结合集成方法与可部署工具，推动了计算机视觉技术从基准测试向实际主板质量检测应用的转化。 Abstract: Motherboard defect detection is critical for ensuring reliability in high-volume electronics manufacturing. While prior research in PCB inspection has largely targeted bare-board or trace-level defects, assembly-level inspection of full motherboards inspection remains underexplored. In this work, we present BoardVision, a reproducible framework for detecting assembly-level defects such as missing screws, loose fan wiring, and surface scratches. We benchmark two representative detectors - YOLOv7 and Faster R-CNN, under controlled conditions on the MiracleFactory motherboard dataset, providing the first systematic comparison in this domain. To mitigate the limitations of single models, where YOLO excels in precision but underperforms in recall and Faster R-CNN shows the reverse, we propose a lightweight ensemble, Confidence-Temporal Voting (CTV Voter), that balances precision and recall through interpretable rules. We further evaluate robustness under realistic perturbations including sharpness, brightness, and orientation changes, highlighting stability challenges often overlooked in motherboard defect detection. Finally, we release a deployable GUI-driven inspection tool that bridges research evaluation with operator usability. Together, these contributions demonstrate how computer vision techniques can transition from benchmark results to practical quality assurance for assembly-level motherboard manufacturing.

[154] DCMIL: A Progressive Representation Learning Model of Whole Slide Images for Cancer Prognosis Analysis

Chao Tu,Kun Huang,Jie Zhang,Qianjin Feng,Yu Zhang,Zhenyuan Ning

Main category: cs.CV

TL;DR: 提出了一种名为DCMIL的渐进式表示学习模型，用于高效处理全切片图像（WSI）以进行癌症预后预测，无需密集标注，并在多种癌症类型中表现出优于现有方法的性能。

Details

Motivation: 当前计算病理学面临高像素输入带来的计算瓶颈和缺乏密集人工标注的问题，同时现有方法忽略了多放大倍数WSI中的细粒度信息和肿瘤微环境的差异。 Method: 提出双课程对比多实例学习（DCMIL）模型，采用从易到难的渐进学习策略，直接将吉像素级WSI转换为预后预测，无需依赖密集标注。 Result: 在12种癌症类型（5,954名患者，1254万张图像块）上的实验表明，DCMIL优于标准WSI预后模型，能识别预后关键区域、提供实例不确定性估计，并捕捉正常与肿瘤组织间的形态学差异。 Conclusion: DCMIL是一种高效且无需密集标注的WSI分析框架，在癌症预后预测中表现优异，具有生成新生物学洞见的潜力。 Abstract: The burgeoning discipline of computational pathology shows promise in harnessing whole slide images (WSIs) to quantify morphological heterogeneity and develop objective prognostic modes for human cancers. However, progress is impeded by the computational bottleneck of gigapixel-size inputs and the scarcity of dense manual annotations. Current methods often overlook fine-grained information across multi-magnification WSIs and variations in tumor microenvironments. Here, we propose an easy-to-hard progressive representation learning model, termed dual-curriculum contrastive multi-instance learning (DCMIL), to efficiently process WSIs for cancer prognosis. The model does not rely on dense annotations and enables the direct transformation of gigapixel-size WSIs into outcome predictions. Extensive experiments on twelve cancer types (5,954 patients, 12.54 million tiles) demonstrate that DCMIL outperforms standard WSI-based prognostic models. Additionally, DCMIL identifies fine-grained prognosis-salient regions, provides robust instance uncertainty estimation, and captures morphological differences between normal and tumor tissues, with the potential to generate new biological insights. All codes have been made publicly accessible at https://github.com/tuuuc/DCMIL.

[155] Real-Time Neural Video Compression with Unified Intra and Inter Coding

Hui Xiang,Yifan Bian,Li Li,Jingran Wu,Xianguo Zhang,Dong Liu

Main category: cs.CV

TL;DR: 提出了一种统一的神经视频压缩框架，结合了帧内和帧间编码，通过单个模型自适应地处理每一帧，有效解决了遮挡、新内容处理和帧间误差传播等问题，同时采用双向两帧压缩设计，显著优于DCVC-RT。

Details

Motivation: 现有神经视频压缩方案在处理遮挡、新内容和帧间误差传播方面存在不足，且缺乏高效的帧内编码机制。 Method: 借鉴传统视频编码中的帧内编码思想，在帧间编码帧中引入帧内编码工具，并设计统一的神经网络模型来自适应进行帧内/帧间编码；提出同时压缩两帧的设计以利用前后向帧间冗余。 Result: 相比DCVC-RT平均BD-rate降低10.7%，帧级码率和质量更稳定，保持实时编解码性能。 Conclusion: 所提出的统一帧内/帧间神经视频压缩框架有效克服了现有NVC方法的关键缺陷，在压缩效率和稳定性上均有显著提升。 Abstract: Neural video compression (NVC) technologies have advanced rapidly in recent years, yielding state-of-the-art schemes such as DCVC-RT that offer superior compression efficiency to H.266/VVC and real-time encoding/decoding capabilities. Nonetheless, existing NVC schemes have several limitations, including inefficiency in dealing with disocclusion and new content, interframe error propagation and accumulation, among others. To eliminate these limitations, we borrow the idea from classic video coding schemes, which allow intra coding within inter-coded frames. With the intra coding tool enabled, disocclusion and new content are properly handled, and interframe error propagation is naturally intercepted without the need for manual refresh mechanisms. We present an NVC framework with unified intra and inter coding, where every frame is processed by a single model that is trained to perform intra/inter coding adaptively. Moreover, we propose a simultaneous two-frame compression design to exploit interframe redundancy not only forwardly but also backwardly. Experimental results show that our scheme outperforms DCVC-RT by an average of 10.7\% BD-rate reduction, delivers more stable bitrate and quality per frame, and retains real-time encoding/decoding performances. Code and models will be released.

[156] Structured Universal Adversarial Attacks on Object Detection for Video Sequences

Sven Jacob,Weijia Shao,Gjergji Kasneci

Main category: cs.CV

TL;DR: 提出一种基于核范数正则化的最小失真通用对抗攻击方法，用于视频目标检测，通过自适应乐观指数梯度法优化，在背景中生成结构化扰动，具有高隐蔽性和攻击效果。

Details

Motivation: 深度学习模型在视频目标检测中易受通用对抗扰动攻击，现有方法存在扰动结构不优、隐蔽性不足等问题。 Method: 采用核范数正则化引导扰动集中在背景区域，并设计自适应乐观指数梯度算法进行高效优化，实现最小化失真的通用对抗攻击。 Result: 所提方法在攻击效果上优于低秩投影梯度下降和Frank-Wolfe类攻击方法，同时保持更高的视觉隐蔽性。 Conclusion: 该方法能有效生成集中于背景的结构化通用扰动，为视频目标检测模型的安全性评估提供了新的有力工具。 Abstract: Video-based object detection plays a vital role in safety-critical applications. While deep learning-based object detectors have achieved impressive performance, they remain vulnerable to adversarial attacks, particularly those involving universal perturbations. In this work, we propose a minimally distorted universal adversarial attack tailored for video object detection, which leverages nuclear norm regularization to promote structured perturbations concentrated in the background. To optimize this formulation efficiently, we employ an adaptive, optimistic exponentiated gradient method that enhances both scalability and convergence. Our results demonstrate that the proposed attack outperforms both low-rank projected gradient descent and Frank-Wolfe based attacks in effectiveness while maintaining high stealthiness. All code and data are publicly available at https://github.com/jsve96/AO-Exp-Attack.

[157] Unsupervised Deep Generative Models for Anomaly Detection in Neuroimaging: A Systematic Scoping Review

Youwan Mahé,Elise Bannier,Stéphanie Leplaideur,Elisa Fromont,Francesca Galassi

Main category: cs.CV

TL;DR: 本综述总结了2018-2025年间49项基于无监督深度生成模型（如自编码器、变分自编码器、生成对抗网络和去噪扩散模型）在神经影像异常检测中的研究，表明这些模型在仅使用健康数据训练的情况下，能有效识别脑部MRI中的病灶，并生成可解释的伪健康图像，具有临床应用潜力。

Details

Motivation: 传统监督方法依赖大量带标注的异常数据，难以泛化到罕见或未充分表征的疾病；而无监督生成模型仅需健康数据训练，通过学习正常脑结构来检测偏离，克服数据标注瓶颈并提升对未知异常的检测能力。 Method: 采用PRISMA指南指导的范围综述方法，系统筛选并分析2018至2025年发表的49项研究，涵盖自编码器、变分自编码器、生成对抗网络和去噪扩散模型等生成模型在脑MRI和CT上的应用，比较其架构设计与性能指标。 Result: 生成模型在检测大范围局灶性病变方面表现良好，并逐步改善对细微异常的识别能力；模型能生成可解释的伪健康重建图像，有助于缺乏标注数据情况下的分析；不同模型在多种病理（如肿瘤、中风、多发性硬化）中均展现出潜力。 Conclusion: 无监督深度生成模型为神经影像异常检测提供了有前景的解决方案，未来应聚焦解剖结构感知建模、基础模型开发、任务适配的评估指标及严格的临床验证，以推动其在半监督学习、新型生物标志物发现和跨疾病偏差映射中的临床转化。 Abstract: Unsupervised deep generative models are emerging as a promising alternative to supervised methods for detecting and segmenting anomalies in brain imaging. Unlike fully supervised approaches, which require large voxel-level annotated datasets and are limited to well-characterised pathologies, these models can be trained exclusively on healthy data and identify anomalies as deviations from learned normative brain structures. This PRISMA-guided scoping review synthesises recent work on unsupervised deep generative models for anomaly detection in neuroimaging, including autoencoders, variational autoencoders, generative adversarial networks, and denoising diffusion models. A total of 49 studies published between 2018 - 2025 were identified, covering applications to brain MRI and, less frequently, CT across diverse pathologies such as tumours, stroke, multiple sclerosis, and small vessel disease. Reported performance metrics are compared alongside architectural design choices. Across the included studies, generative models achieved encouraging performance for large focal lesions and demonstrated progress in addressing more subtle abnormalities. A key strength of generative models is their ability to produce interpretable pseudo-healthy (also referred to as counterfactual) reconstructions, which is particularly valuable when annotated data are scarce, as in rare or heterogeneous diseases. Looking ahead, these models offer a compelling direction for anomaly detection, enabling semi-supervised learning, supporting the discovery of novel imaging biomarkers, and facilitating within- and cross-disease deviation mapping in unified end-to-end frameworks. To realise clinical impact, future work should prioritise anatomy-aware modelling, development of foundation models, task-appropriate evaluation metrics, and rigorous clinical validation.

[158] Pruning Overparameterized Multi-Task Networks for Degraded Web Image Restoration

Thomas Katraouras,Dimitrios Rafailidis

Main category: cs.CV

TL;DR: 本文提出了一种名为MIR-L的多任务图像恢复模型压缩方法，通过迭代剪枝策略在高稀疏度下保持甚至超越现有性能。

Details

Motivation: 由于在线社交网络中的有损操作，图像质量常被破坏，影响用户体验；现有的多任务图像恢复模型参数过多、计算效率低，因此需要高效的压缩方法。 Method: 采用迭代剪枝策略，在每轮中移除小幅度权重，并将剩余权重重置为其初始值，以发现过参数化模型中的高稀疏子网络（即“ winning tickets”）。 Result: 在去雨、去雾和去噪任务上的实验表明，MIR-L仅保留10%的可训练参数时仍能保持高性能。 Conclusion: MIR-L能有效压缩多任务图像恢复模型，在显著降低参数量的同时维持或超过原有模型的恢复性能。 Abstract: Image quality is a critical factor in delivering visually appealing content on web platforms. However, images often suffer from degradation due to lossy operations applied by online social networks (OSNs), negatively affecting user experience. Image restoration is the process of recovering a clean high-quality image from a given degraded input. Recently, multi-task (all-in-one) image restoration models have gained significant attention, due to their ability to simultaneously handle different types of image degradations. However, these models often come with an excessively high number of trainable parameters, making them computationally inefficient. In this paper, we propose a strategy for compressing multi-task image restoration models. We aim to discover highly sparse subnetworks within overparameterized deep models that can match or even surpass the performance of their dense counterparts. The proposed model, namely MIR-L, utilizes an iterative pruning strategy that removes low-magnitude weights across multiple rounds, while resetting the remaining weights to their original initialization. This iterative process is important for the multi-task image restoration model's optimization, effectively uncovering "winning tickets" that maintain or exceed state-of-the-art performance at high sparsity levels. Experimental evaluation on benchmark datasets for the deraining, dehazing, and denoising tasks shows that MIR-L retains only 10% of the trainable parameters while maintaining high image restoration performance. Our code, datasets and pre-trained models are made publicly available at https://github.com/Thomkat/MIR-L.

[159] Grazing Detection using Deep Learning and Sentinel-2 Time Series Data

Aleksis Pirinen,Delia Fano Yela,Smita Chakraborty,Erik Källman

Main category: cs.CV

TL;DR: 本研究利用Sentinel-2 L2A时间序列影像，结合CNN-LSTM模型对农田是否放牧进行季节性检测，取得了较高的召回率和F1分数，可有效指导自然资源巡查。

Details

Motivation: 放牧影响农业生产和生物多样性，但目前缺乏可扩展的放牧区域监测手段，亟需一种高效、可靠的方法来识别放牧活动。 Method: 基于Sentinel-2 L2A多时相反射率数据，使用CNN-LSTM模型集成方法，对4月至10月的影像进行分析，在场域边界内进行二分类（放牧/未放牧）预测。 Result: 模型在五次验证中平均F1得分为77%，对放牧草地的召回率达到90%；在每年仅能巡查最多4%场地的情况下，优先检查模型预测为未放牧的场地，可使确认的未放牧地点数量比随机巡查提高17.2倍。 Conclusion: 利用免费、粗分辨率的卫星数据结合深度学习模型，能够可靠地引导保护导向的土地利用合规性检查资源分配。 Abstract: Grazing shapes both agricultural production and biodiversity, yet scalable monitoring of where grazing occurs remains limited. We study seasonal grazing detection from Sentinel-2 L2A time series: for each polygon-defined field boundary, April-October imagery is used for binary prediction (grazed / not grazed). We train an ensemble of CNN-LSTM models on multi-temporal reflectance features, and achieve an average F1 score of 77 percent across five validation splits, with 90 percent recall on grazed pastures. Operationally, if inspectors can visit at most 4 percent of sites annually, prioritising fields predicted by our model as non-grazed yields 17.2 times more confirmed non-grazing sites than random inspection. These results indicate that coarse-resolution, freely available satellite data can reliably steer inspection resources for conservation-aligned land-use compliance. Code and models have been made publicly available.

[160] Vision Mamba for Permeability Prediction of Porous Media

Ali Kashefi,Tapan Mukerji

Main category: cs.CV

TL;DR: 本文首次提出使用Vision Mamba作为骨干网络来预测三维多孔介质的渗透率，相较于ViT和CNN在计算效率、内存占用和参数量方面具有优势。

Details

Motivation: 由于ViT在高分辨率输入下计算复杂度呈二次增长，而CNN参数量大，限制了其在三维渗透率预测中的应用，因此需要一种更高效且轻量的模型。 Method: 采用Vision Mamba作为主干网络构建神经网络模型，并与ViT和CNN进行对比实验，同时开展消融研究以评估各组件对预测精度的影响。 Result: 实验表明，Vision Mamba在线性扩展、内存效率和参数数量上优于ViT和CNN，在三维多孔介质渗透率预测中表现出更优的整体性能。 Conclusion: Vision Mamba是一种有前景的替代方案，可用于大规模视觉模型中取代ViT，具备良好的可扩展性和应用潜力。 Abstract: Vision Mamba has recently received attention as an alternative to Vision Transformers (ViTs) for image classification. The network size of Vision Mamba scales linearly with input image resolution, whereas ViTs scale quadratically, a feature that improves computational and memory efficiency. Moreover, Vision Mamba requires a significantly smaller number of trainable parameters than traditional convolutional neural networks (CNNs), and thus, they can be more memory efficient. Because of these features, we introduce, for the first time, a neural network that uses Vision Mamba as its backbone for predicting the permeability of three-dimensional porous media. We compare the performance of Vision Mamba with ViT and CNN models across multiple aspects of permeability prediction and perform an ablation study to assess the effects of its components on accuracy. We demonstrate in practice the aforementioned advantages of Vision Mamba over ViTs and CNNs in the permeability prediction of three-dimensional porous media. We make the source code publicly available to facilitate reproducibility and to enable other researchers to build on and extend this work. We believe the proposed framework has the potential to be integrated into large vision models in which Vision Mamba is used instead of ViTs.

[161] Real-Time Surgical Instrument Defect Detection via Non-Destructive Testing

Qurrat Ul Ain,Atif Aftab Ahmed Jilani,Zunaira Shafqat,Nigar Azhar Butt

Main category: cs.CV

TL;DR: SurgScan是一个基于YOLOv8的AI框架，用于手术器械缺陷检测，具有高精度（99.3%）和实时推理能力（4.2-5.8毫秒/图像），适用于工业部署。

Details

Motivation: 传统手术器械质量控制依赖人工检测，易出错且不一致，存在影响患者安全的风险，因此需要自动化、高精度的检测方案。 Method: 提出SurgScan框架，采用YOLOv8模型，在包含102,876张高分辨率图像的数据集上训练，覆盖11类器械和5类主要缺陷，并引入对比度增强预处理以提升检测效果。 Result: SurgScan达到99.3%的最高准确率，推理速度为4.2-5.8毫秒/图像，显著优于现有CNN模型；统计分析表明对比度增强有效提升检测性能。 Conclusion: SurgScan提供了一种可扩展、低成本的AI解决方案，可用于手术器械制造中的自动化质量控制，减少对人工检查的依赖，并符合ISO 13485和FDA标准。 Abstract: Defective surgical instruments pose serious risks to sterility, mechanical integrity, and patient safety, increasing the likelihood of surgical complications. However, quality control in surgical instrument manufacturing often relies on manual inspection, which is prone to human error and inconsistency. This study introduces SurgScan, an AI-powered defect detection framework for surgical instruments. Using YOLOv8, SurgScan classifies defects in real-time, ensuring high accuracy and industrial scalability. The model is trained on a high-resolution dataset of 102,876 images, covering 11 instrument types and five major defect categories. Extensive evaluation against state-of-the-art CNN architectures confirms that SurgScan achieves the highest accuracy (99.3%) with real-time inference speeds of 4.2-5.8 ms per image, making it suitable for industrial deployment. Statistical analysis demonstrates that contrast-enhanced preprocessing significantly improves defect detection, addressing key limitations in visual inspection. SurgScan provides a scalable, cost-effective AI solution for automated quality control, reducing reliance on manual inspection while ensuring compliance with ISO 13485 and FDA standards, paving the way for enhanced defect detection in medical manufacturing.

[162] Noise Projection: Closing the Prompt-Agnostic Gap Behind Text-to-Image Misalignment in Diffusion Models

Yunze Tong,Didi Zhu,Zijing Hu,Jinluan Yang,Ziyu Zhao

Main category: cs.CV

TL;DR: 本文提出了一种提示语感知的噪声投影方法，通过在去噪前对初始噪声进行文本条件化优化，提升文本到图像生成中的文本-图像对齐性，无需修改预训练模型且推理成本低。

Details

Motivation: 在文本到图像生成中，由于训练与推理阶段噪声分布不一致（训练时噪声依赖于提示语，推理时则为无条件高斯噪声），导致生成图像可能偏离提示语。这种训练-推理不匹配问题促使作者寻找改进方法。 Method: 提出一种噪声投影器，在推理时根据提示语嵌入对初始噪声进行调整，使其更符合训练时的噪声分布。该方法首先采样多个噪声并利用视觉-语言模型提供图像反馈，将这些信号蒸馏为奖励模型，并通过准直接偏好优化来训练噪声投影器。 Result: 实验表明，该方法显著提升了不同提示下的文本-图像对齐性，且无需参考图像或人工先验，推理时仅需单次前向传播，效率优于多样本后选策略。 Conclusion: 通过引入提示语感知的噪声投影，有效缓解了文本到图像生成中的训练-推理不匹配问题，在保持低推理成本的同时提高了生成结果与提示语的一致性。 Abstract: In text-to-image generation, different initial noises induce distinct denoising paths with a pretrained Stable Diffusion (SD) model. While this pattern could output diverse images, some of them may fail to align well with the prompt. Existing methods alleviate this issue either by altering the denoising dynamics or by drawing multiple noises and conducting post-selection. In this paper, we attribute the misalignment to a training-inference mismatch: during training, prompt-conditioned noises lie in a prompt-specific subset of the latent space, whereas at inference the noise is drawn from a prompt-agnostic Gaussian prior. To close this gap, we propose a noise projector that applies text-conditioned refinement to the initial noise before denoising. Conditioned on the prompt embedding, it maps the noise to a prompt-aware counterpart that better matches the distribution observed during SD training, without modifying the SD model. Our framework consists of these steps: we first sample some noises and obtain token-level feedback for their corresponding images from a vision-language model (VLM), then distill these signals into a reward model, and finally optimize the noise projector via a quasi-direct preference optimization. Our design has two benefits: (i) it requires no reference images or handcrafted priors, and (ii) it incurs small inference cost, replacing multi-sample selection with a single forward pass. Extensive experiments further show that our prompt-aware noise projection improves text-image alignment across diverse prompts.

[163] PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model

Cheng Cui,Ting Sun,Suyin Liang,Tingquan Gao,Zelun Zhang,Jiaxuan Liu,Xueqing Wang,Changda Zhou,Hongen Liu,Manhui Lin,Yue Zhang,Yubo Zhang,Handong Zheng,Jing Zhang,Jun Zhang,Yi Liu,Dianhai Yu,Yanjun Ma

Main category: cs.CV

TL;DR: PaddleOCR-VL是一个先进的、资源高效的文档解析模型，集成了动态分辨率视觉编码器和轻量级语言模型，支持109种语言，在元素识别和页面级解析任务上达到SOTA性能。

Details

Motivation: 为了实现高效且准确的多语言文档解析，尤其是在复杂元素（如表格、公式、图表）识别方面提升性能并降低资源消耗。 Method: 提出PaddleOCR-VL-0.9B，结合NaViT风格的动态分辨率视觉编码器与ERNIE-4.5-0.3B语言模型，构建紧凑型视觉-语言模型。 Result: 在公开和内部基准测试中，PaddleOCR-VL在页面级文档解析和元素级识别任务上均达到SOTA，显著优于现有方法，具备快速推理速度和强竞争力。 Conclusion: PaddleOCR-VL是一种高性能、低资源消耗的文档解析解决方案，适用于实际应用场景中的大规模部署。 Abstract: In this report, we propose PaddleOCR-VL, a SOTA and resource-efficient model tailored for document parsing. Its core component is PaddleOCR-VL-0.9B, a compact yet powerful vision-language model (VLM) that integrates a NaViT-style dynamic resolution visual encoder with the ERNIE-4.5-0.3B language model to enable accurate element recognition. This innovative model efficiently supports 109 languages and excels in recognizing complex elements (e.g., text, tables, formulas, and charts), while maintaining minimal resource consumption. Through comprehensive evaluations on widely used public benchmarks and in-house benchmarks, PaddleOCR-VL achieves SOTA performance in both page-level document parsing and element-level recognition. It significantly outperforms existing solutions, exhibits strong competitiveness against top-tier VLMs, and delivers fast inference speeds. These strengths make it highly suitable for practical deployment in real-world scenarios.

[164] Towards Generalist Intelligence in Dentistry: Vision Foundation Models for Oral and Maxillofacial Radiology

Xinrui Huang,Fan Xiao,Dongming He,Anqi Gao,Dandan Li,Xiaofan Zhang,Shaoting Zhang,Xudong Wang

Main category: cs.CV

TL;DR: 本文提出了DentVFM，首个面向牙科的视觉基础模型家族，基于大规模自监督学习，在多模态牙科影像上实现跨任务、跨模态的通用智能，显著优于现有方法，并推动牙科AI的可扩展性与临床适用性。

Details

Motivation: 由于专业人才短缺和现有AI系统在泛化性、标注成本及多模态应用上的局限，牙科影像分析面临挑战，亟需一种通用、高效且可扩展的AI解决方案。 Method: 提出DentVFM，基于Vision Transformer架构的2D和3D视觉基础模型，通过自监督学习在包含约160万张多中心、多模态影像的DentVista数据集上训练，并构建DentBench作为涵盖八类牙科亚专科的综合评估基准。 Result: DentVFM在疾病诊断、治疗分析、生物标志物识别、解剖标志检测与分割等多个任务上表现出卓越的泛化能力，显著优于各类基线模型，并能在缺乏常规影像时提供超越资深牙医的跨模态诊断性能。 Conclusion: DentVFM建立了牙科AI的新范式，具备良好的可扩展性、适应性和标注效率，有望提升全球智能化口腔医疗水平，弥补关键资源缺口。 Abstract: Oral and maxillofacial radiology plays a vital role in dental healthcare, but radiographic image interpretation is limited by a shortage of trained professionals. While AI approaches have shown promise, existing dental AI systems are restricted by their single-modality focus, task-specific design, and reliance on costly labeled data, hindering their generalization across diverse clinical scenarios. To address these challenges, we introduce DentVFM, the first family of vision foundation models (VFMs) designed for dentistry. DentVFM generates task-agnostic visual representations for a wide range of dental applications and uses self-supervised learning on DentVista, a large curated dental imaging dataset with approximately 1.6 million multi-modal radiographic images from various medical centers. DentVFM includes 2D and 3D variants based on the Vision Transformer (ViT) architecture. To address gaps in dental intelligence assessment and benchmarks, we introduce DentBench, a comprehensive benchmark covering eight dental subspecialties, more diseases, imaging modalities, and a wide geographical distribution. DentVFM shows impressive generalist intelligence, demonstrating robust generalization to diverse dental tasks, such as disease diagnosis, treatment analysis, biomarker identification, and anatomical landmark detection and segmentation. Experimental results indicate DentVFM significantly outperforms supervised, self-supervised, and weakly supervised baselines, offering superior generalization, label efficiency, and scalability. Additionally, DentVFM enables cross-modality diagnostics, providing more reliable results than experienced dentists in situations where conventional imaging is unavailable. DentVFM sets a new paradigm for dental AI, offering a scalable, adaptable, and label-efficient model to improve intelligent dental healthcare and address critical gaps in global oral healthcare.

[165] Acquisition of interpretable domain information during brain MR image harmonization for content-based image retrieval

Keima Abe,Hayato Muraki,Shuhei Tomoshige,Kenichi Oishi,Hitoshi Iyatomi

Main category: cs.CV

TL;DR: 提出了一种名为PL-SE-ADA的新型域适应框架，用于医学图像（如脑MR）的域协调整合与可解释表征学习，通过分离域不变和域特异性特征，在保证图像重建质量和疾病分类性能的同时提升模型可解释性。

Details

Motivation: 医学图像在不同成像站点间存在域偏移，影响机器学习性能，现有方法缺乏可解释性，限制了其在临床中的应用。 Method: 设计双编码器（f_E和f_SE）分别提取域不变（z_u）和域特异性（z_d）特征，结合图像重建机制和对抗训练，并通过解码器f_D和域预测器g_D实现端到端学习。 Result: 在图像重建、疾病分类和域识别任务上表现优于或等于现有方法，同时实现了对域无关特征和域特异性成分的可视化。 Conclusion: PL-SE-ADA在保持疾病相关信息的同时有效实现域协调整合，具有高可解释性，适用于医学图像分析。 Abstract: Medical images like MR scans often show domain shifts across imaging sites due to scanner and protocol differences, which degrade machine learning performance in tasks such as disease classification. Domain harmonization is thus a critical research focus. Recent approaches encode brain images $\boldsymbol{x}$ into a low-dimensional latent space $\boldsymbol{z}$, then disentangle it into $\boldsymbol{z_u}$ (domain-invariant) and $\boldsymbol{z_d}$ (domain-specific), achieving strong results. However, these methods often lack interpretability$-$an essential requirement in medical applications$-$leaving practical issues unresolved. We propose Pseudo-Linear-Style Encoder Adversarial Domain Adaptation (PL-SE-ADA), a general framework for domain harmonization and interpretable representation learning that preserves disease-relevant information in brain MR images. PL-SE-ADA includes two encoders $f_E$ and $f_{SE}$ to extract $\boldsymbol{z_u}$ and $\boldsymbol{z_d}$, a decoder to reconstruct the image $f_D$, and a domain predictor $g_D$. Beyond adversarial training between the encoder and domain predictor, the model learns to reconstruct the input image $\boldsymbol{x}$ by summing reconstructions from $\boldsymbol{z_u}$ and $\boldsymbol{z_d}$, ensuring both harmonization and informativeness. Compared to prior methods, PL-SE-ADA achieves equal or better performance in image reconstruction, disease classification, and domain recognition. It also enables visualization of both domain-independent brain features and domain-specific components, offering high interpretability across the entire framework.

[166] Exploring Image Representation with Decoupled Classical Visual Descriptors

Chenyuan Qu,Hao Chen,Jianbo Jiao

Main category: cs.CV

TL;DR: 本文提出了VisualSplit框架，通过将图像分解为经典视觉描述符来结合传统图像特征与现代深度学习，实现可解释且高效的视觉表示学习。

Details

Motivation: 现代深度学习模型虽然性能强大，但其内部表示缺乏可解释性；而经典视觉描述符（如边缘、颜色、强度分布）具有直观可理解性。本文旨在弥合这一差距，探索现代学习能否受益于这些经典线索。 Method: 提出VisualSplit框架，显式地将图像分解为解耦的经典视觉描述符，并通过重建驱动的预训练策略学习每个描述符的本质特征，同时保持其可解释性。 Result: VisualSplit在图像生成与编辑等高级视觉任务中实现了有效的属性控制，超越了传统的分类与分割任务，验证了该方法在视觉理解中的有效性。 Conclusion: 通过融合经典视觉描述符与现代学习框架，VisualSplit提供了一种兼具性能与可解释性的新型视觉表示学习方法，展示了经典特征在现代视觉任务中的潜力。 Abstract: Exploring and understanding efficient image representations is a long-standing challenge in computer vision. While deep learning has achieved remarkable progress across image understanding tasks, its internal representations are often opaque, making it difficult to interpret how visual information is processed. In contrast, classical visual descriptors (e.g. edge, colour, and intensity distribution) have long been fundamental to image analysis and remain intuitively understandable to humans. Motivated by this gap, we ask a central question: Can modern learning benefit from these classical cues? In this paper, we answer it with VisualSplit, a framework that explicitly decomposes images into decoupled classical descriptors, treating each as an independent but complementary component of visual knowledge. Through a reconstruction-driven pre-training scheme, VisualSplit learns to capture the essence of each visual descriptor while preserving their interpretability. By explicitly decomposing visual attributes, our method inherently facilitates effective attribute control in various advanced visual tasks, including image generation and editing, extending beyond conventional classification and segmentation, suggesting the effectiveness of this new learning approach for visual understanding. Project page: https://chenyuanqu.com/VisualSplit/.

Ziqi Jiang,Yanghao Wang,Long Chen

Main category: cs.CV

TL;DR: 本文提出了一种新的模型无关的多步调整方法Flow Matching Alignment (FMA)，通过学习跨模态速度场来实现更精确和鲁棒的特征对齐，相较于现有的单步参数高效微调方法，在多个基准和主干网络上表现出显著性能提升，尤其在复杂数据集上。

Details

Motivation: 现有的参数高效微调（PEFT）方法仅进行单步调整，难以有效解耦高度纠缠的跨模态特征，尤其在复杂数据集上表现不足，因此需要一种更强大的多步对齐机制。 Method: 提出Flow Matching Alignment (FMA)，采用固定耦合策略确保类别对应，引入噪声增强缓解数据稀缺，并设计早停求解器以提高效率和准确性，通过多步调整实现跨模态特征对齐。 Result: FMA在多个基准和不同主干模型上均取得显著性能提升，尤其在挑战性数据集上表现突出，具备更强的对齐精度与鲁棒性。 Conclusion: FMA是首个模型无关的多步跨模态对齐框架，通过学习速度场实现渐进式对齐，克服了传统PEFT方法的局限，为复杂跨模态任务提供了更有效的解决方案。 Abstract: Aligning features from different modalities, is one of the most fundamental challenges for cross-modal tasks. Although pre-trained vision-language models can achieve a general alignment between image and text, they often require parameter-efficient fine-tuning (PEFT) for further adjustment. Today's PEFT methods (e.g., prompt tuning, LoRA-based, or adapter-based) always selectively fine-tune a subset of parameters, which can slightly adjust either visual or textual features, and avoid overfitting. In this paper, we are the first to highlight that all existing PEFT methods perform one-step adjustment. It is insufficient for complex (or difficult) datasets, where features of different modalities are highly entangled. To this end, we propose the first model-agnostic multi-step adjustment approach by learning a cross-modal velocity field: Flow Matching Alignment (FMA). Specifically, to ensure the correspondence between categories during training, we first utilize a fixed coupling strategy. Then, we propose a noise augmentation strategy to alleviate the data scarcity issue. Finally, we design an early-stopping solver, which terminates the transformation process earlier, improving both efficiency and accuracy. Compared with one-step PEFT methods, FMA has the multi-step rectification ability to achieve more precise and robust alignment. Extensive results have demonstrated that FMA can consistently yield significant performance gains across various benchmarks and backbones, particularly on challenging datasets.

[168] Consistent text-to-image generation via scene de-contextualization

Song Tang,Peihao Gong,Kunyu Li,Kai Guo,Boyu Wang,Mao Ye,Jianwei Zhang,Xiatian Zhu

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的提示嵌入编辑方法SDeC，用于解决文本到图像生成中的身份偏移问题，通过去情境化场景上下文来提升身份保持能力。

Details

Motivation: 现有方法在处理文本到图像生成中的身份偏移问题时，通常假设已知所有目标场景，这在现实中不现实。本文旨在解决这一限制。 Method: 提出Scene De-Contextualization (SDeC) 方法，通过量化SVD方向稳定性来自适应重加权特征值，抑制提示嵌入中的潜在场景-身份相关性。 Result: 实验证明SDeC显著增强了身份保持能力，同时保持了场景多样性，且无需预先知道所有目标场景。 Conclusion: SDeC是一种高效、灵活、无需训练的解决方案，适用于实际应用中场景未知或动态变化的情况。 Abstract: Consistent text-to-image (T2I) generation seeks to produce identity-preserving images of the same subject across diverse scenes, yet it often fails due to a phenomenon called identity (ID) shift. Previous methods have tackled this issue, but typically rely on the unrealistic assumption of knowing all target scenes in advance. This paper reveals that a key source of ID shift is the native correlation between subject and scene context, called scene contextualization, which arises naturally as T2I models fit the training distribution of vast natural images. We formally prove the near-universality of this scene-ID correlation and derive theoretical bounds on its strength. On this basis, we propose a novel, efficient, training-free prompt embedding editing approach, called Scene De-Contextualization (SDeC), that imposes an inversion process of T2I's built-in scene contextualization. Specifically, it identifies and suppresses the latent scene-ID correlation within the ID prompt's embedding by quantifying the SVD directional stability to adaptively re-weight the corresponding eigenvalues. Critically, SDeC allows for per-scene use (one scene per prompt) without requiring prior access to all target scenes. This makes it a highly flexible and general solution well-suited to real-world applications where such prior knowledge is often unavailable or varies over time. Experiments demonstrate that SDeC significantly enhances identity preservation while maintaining scene diversity.

[169] Eyes Wide Open: Ego Proactive Video-LLM for Streaming Video

Yulin Zhang,Cheng Shi,Yang Wang,Sibei Yang

Main category: cs.CV

TL;DR: 本文提出了一种面向第一人称视频流的主动式AI助手框架，旨在实现对动态问题的及时、连贯和高效响应，并构建了ESTP-Bench评测基准与ESTP-F1指标，通过包含数据引擎、多阶段训练和动态压缩技术的 pipeline 实现了优越性能。

Details

Motivation: 为了让AI在人类生活场景中真正发挥作用，需要其不仅能观察，还能主动理解、预测并适时响应事件的发展，现有方法在响应主动性、时序一致性和推理效率方面存在不足。 Method: 提出了一个包含三部分的技术 pipeline：数据引擎用于生成训练数据，多阶段训练策略提升模型理解能力，动态压缩技术保障实时性；同时构建了ESTP-Bench评测集和ESTP-F1评估指标来衡量模型在主动连贯性、即时响应和同步效率三方面的表现。 Result: 所提模型在ESTP-Bench及其他多个在线和离线基准上显著优于多种基线方法，有效实现了三大关键属性：主动连贯性、即时响应和同步效率。 Conclusion: 该研究推动了基于第一人称视角视频流的主动式AI智能体发展，为未来在真实生活场景中部署具备实时感知与推理能力的AI助手提供了可行框架与评估标准。 Abstract: Envision an AI capable of functioning in human-like settings, moving beyond mere observation to actively understand, anticipate, and proactively respond to unfolding events. Towards this vision, we focus on the innovative task where, given ego-streaming video input, an assistant proactively answers diverse, evolving questions at the opportune moment, while maintaining synchronized perception and reasoning. This task embodies three key properties: (1) Proactive Coherence, (2) Just-in-Time Responsiveness, and (3) Synchronized Efficiency. To evaluate and address these properties, we first introduce ESTP-Bench (Ego Streaming Proactive Benchmark) alongside the ESTP-F1 metric-a novel framework designed for their rigorous assessment. Secondly, we propose a comprehensive technical pipeline to enable models to tackle this challenging task. This pipeline comprises: (1) a data engine, (2) a multi-stage training strategy, and (3) a proactive dynamic compression technique. Our proposed model effectively addresses these critical properties while outperforming multiple baselines across diverse online and offline benchmarks. Project Page:https://zhangyl4.github.io/publications/eyes-wide-open/

[170] BalanceGS: Algorithm-System Co-design for Efficient 3D Gaussian Splatting Training on GPU

Junyi Wu,Jiaming Xu,Jinhao Li,Yongkang Zhou,Jiayi Pan,Xingyang Li,Guohao Dai

Main category: cs.CV

TL;DR: 本文提出了BalanceGS，一种算法-系统协同设计的3D高斯点阵训练优化方法，通过密度控制、自适应采样与内存访问重排，在几乎不损失重建质量的前提下，在NVIDIA A100上实现了1.44倍的训练加速。

Details

Motivation: 传统3D高斯点阵（3DGS）训练存在密度分配不均、计算负载失衡和内存访问碎片化三大效率问题，限制了其训练效率。 Method: 1) 提出启发式工作负载敏感的高斯密度控制以平衡点分布；2) 设计基于相似性的高斯采样与合并策略，实现动态负载分配；3) 采用基于重排序的内存访问映射策略，优化RGB数据批量加载。 Result: 在NVIDIA A100 GPU上相比传统3DGS实现了1.44倍的训练速度提升，且重建质量几乎没有下降。 Conclusion: BalanceGS通过算法与系统的协同优化，显著提升了3DGS的训练效率，解决了原有流程中的关键性能瓶颈。 Abstract: 3D Gaussian Splatting (3DGS) has emerged as a promising 3D reconstruction technique. The traditional 3DGS training pipeline follows three sequential steps: Gaussian densification, Gaussian projection, and color splatting. Despite its promising reconstruction quality, this conventional approach suffers from three critical inefficiencies: (1) Skewed density allocation during Gaussian densification, (2) Imbalanced computation workload during Gaussian projection and (3) Fragmented memory access during color splatting. To tackle the above challenges, we introduce BalanceGS, the algorithm-system co-design for efficient training in 3DGS. (1) At the algorithm level, we propose heuristic workload-sensitive Gaussian density control to automatically balance point distributions - removing 80% redundant Gaussians in dense regions while filling gaps in sparse areas. (2) At the system level, we propose Similarity-based Gaussian sampling and merging, which replaces the static one-to-one thread-pixel mapping with adaptive workload distribution - threads now dynamically process variable numbers of Gaussians based on local cluster density. (3) At the mapping level, we propose reordering-based memory access mapping strategy that restructures RGB storage and enables batch loading in shared memory. Extensive experiments demonstrate that compared with 3DGS, our approach achieves a 1.44$\times$ training speedup on a NVIDIA A100 GPU with negligible quality degradation.

[171] CALM-Net: Curvature-Aware LiDAR Point Cloud-based Multi-Branch Neural Network for Vehicle Re-Identification

Dongwook Lee,Sol Han,Jinwhan Kim

Main category: cs.CV

TL;DR: 本文提出了一种名为CALM-Net的多分支神经网络，用于基于LiDAR点云的车辆重识别，通过引入曲率信息和多分支结构提升了特征区分能力。

Details

Motivation: 为了从三维点云中学习更具判别性和互补性的特征以更好地区分不同车辆，解决现有方法在几何细节利用上的不足。 Method: 采用多分支架构，结合边缘卷积、点注意力机制和曲率嵌入，有效捕捉点云中的局部表面变化和上下文信息。 Result: 在大规模nuScenes数据集上实验表明，相比最强基线模型，CALM-Net平均重识别精度提高了约1.97个百分点。 Conclusion: 引入曲率信息和多分支特征学习能显著提升LiDAR点云车辆重识别性能，验证了该方法的有效性。 Abstract: This paper presents CALM-Net, a curvature-aware LiDAR point cloud-based multi-branch neural network for vehicle re-identification. The proposed model addresses the challenge of learning discriminative and complementary features from three-dimensional point clouds to distinguish between vehicles. CALM-Net employs a multi-branch architecture that integrates edge convolution, point attention, and a curvature embedding that characterizes local surface variation in point clouds. By combining these mechanisms, the model learns richer geometric and contextual features that are well suited for the re-identification task. Experimental evaluation on the large-scale nuScenes dataset demonstrates that CALM-Net achieves a mean re-identification accuracy improvement of approximately 1.97\% points compared with the strongest baseline in our study. The results confirms the effectiveness of incorporating curvature information into deep learning architectures and highlight the benefit of multi-branch feature learning for LiDAR point cloud-based vehicle re-identification.

[172] Talking Points: Describing and Localizing Pixels

Matan Rusanovsky,Shimon Malnick,Shai Avidan

Main category: cs.CV

TL;DR: 本文提出了一种用于像素级关键点定位的新框架，包含生成关键点描述的Point Descriptor和从描述回归精确坐标的Point Localizer，并构建了LlamaPointInPart数据集进行训练与评估。

Details

Motivation: 现有视觉-语言模型局限于对象或区域级别的对齐，缺乏通过自然语言实现像素级关键点理解的能力，本文旨在填补这一空白。 Method: 提出双组件框架：Point Descriptor生成上下文丰富的自由形式关键点描述，Point Localizer回归精确像素坐标；利用多视觉语言模型合成20K+图像-关键点-描述三元组构成LlamaPointInPart数据集；采用GRPO在AP-10K上优化Descriptor，以冻结的Localizer作为奖励模型提升定位准确性。 Result: 实验表明，在LlamaPointInPart上本方法优于基线模型；提出的新评估协议通过预测点与真实点的距离衡量性能，而非文本匹配。 Conclusion: 该框架实现了自然语言与像素级关键点的双向对齐，有望推动关键点引导的图像理解与语言引导的精确定位应用。 Abstract: Vision-language models have achieved remarkable success in cross-modal understanding. Yet, these models remain limited to object-level or region-level grounding, lacking the capability for pixel-precise keypoint comprehension through natural language. We introduce a novel framework for pixel level grounding. The framework consists of two complementary components: a Point Descriptor that generates rich, contextual descriptions of individual keypoints, and a Point Localizer that regresses precise pixel coordinates from these descriptions. Unlike prior work that relies on templated prompts or keypoint names, our approach produces free-form, coarse-to-fine descriptions that situate keypoints within their visual context. Since there is no available dataset to train such a system, we introduce LlamaPointInPart, a carefully curated dataset of 20K+ image-keypoint-description triplets synthesized from multiple vision-language models, capturing multi-scale information from scene-level context to visual features around the keypoint. For cross-category generalization, we optimize the Point Descriptor on AP-10K via GRPO, using the frozen Point Localizer as a reward model to produce descriptions that maximize localization accuracy. To evaluate our results we establish a new evaluation protocol. Instead of comparing the text description produced by our method to the ground truth, we use the localizer to determine how close is the predicted point generated to the ground truth point. Experiments demonstrate superior performance compared to baseline models on LlamaPointInPart.The bidirectional nature of our framework should enable future applications in both keypoint-guided image understanding and language-guided precise localization. Our code and dataset are publicly available at https://github.com/matanr/Talking_Points.

[173] STANCE: Motion Coherent Video Generation Via Sparse-to-Dense Anchored Encoding

Zhifei Chen,Tianshuo Xu,Leyi Wu,Luozhou Wang,Dongyu Yan,Zihan You,Wenting Luo,Guo Zhang,Yingcong Chen

Main category: cs.CV

TL;DR: STANCE是一种图像到视频生成框架，通过引入实例线索和Dense RoPE机制，解决了视频生成中物体运动连贯性和交互性不足的问题，提升了时间一致性。

Details

Motivation: 现有的视频生成方法在保持物体运动和交互的连贯性方面存在困难，主要受限于人类提供的运动提示在编码后有效信息丢失，以及外观和运动优化冲突导致的时间不一致。 Method: 提出STANCE框架，包含两个关键组件：1）实例线索，将稀疏的用户可编辑提示转换为密集的2.5D运动场；2）Dense RoPE，使用空间可寻址的旋转嵌入标记运动令牌以保留提示显著性，并结合RGB与辅助图（如分割或深度）联合预测。 Result: STANCE在无需逐帧轨迹脚本的情况下，增强了运动引导的有效性，改善了时间连贯性，在视觉质量和动态一致性上表现更优。 Conclusion: STANCE通过结构化运动控制和解耦外观与结构优化，有效提升了视频生成中的物体运动连贯性和交互质量，为用户提供更易用且高效的控制方式。 Abstract: Video generation has recently made striking visual progress, but maintaining coherent object motion and interactions remains difficult. We trace two practical bottlenecks: (i) human-provided motion hints (e.g., small 2D maps) often collapse to too few effective tokens after encoding, weakening guidance; and (ii) optimizing for appearance and motion in a single head can favor texture over temporal consistency. We present STANCE, an image-to-video framework that addresses both issues with two simple components. First, we introduce Instance Cues -- a pixel-aligned control signal that turns sparse, user-editable hints into a dense 2.5D (camera-relative) motion field by averaging per-instance flow and augmenting with monocular depth over the instance mask. This reduces depth ambiguity compared to 2D arrow inputs while remaining easy to use. Second, we preserve the salience of these cues in token space with Dense RoPE, which tags a small set of motion tokens (anchored on the first frame) with spatial-addressable rotary embeddings. Paired with joint RGB $+$ auxiliary-map prediction (segmentation or depth), our model anchors structure while RGB handles appearance, stabilizing optimization and improving temporal coherence without requiring per-frame trajectory scripts.

[174] Hierarchical Re-Classification: Combining Animal Classification Models with Vision Transformers

Hugo Markoff,Jevgenijs Galaktionovs

Main category: cs.CV

TL;DR: 提出了一种分层再分类系统，结合SpeciesNet、CLIP嵌入和度量学习，将粗粒度动物标签细化为物种级识别，在LILA BC数据集上实现了96.5%的准确率，64.9%的检测达到物种级。

Details

Motivation: 现有动物分类模型（如SpeciesNet）因保守聚合策略导致大量标签停留在高阶分类层级，缺乏物种级细粒度识别，限制了生态监测等应用。 Method: 构建五阶段分层再分类流水线：高置信度接受、鸟类覆盖、质心构建、三元组损失度量学习和自适应余弦距离评分，融合SpeciesNet EfficientNetV2-M预测与CLIP嵌入进行精细化分类。 Result: 在LILA BC沙漠狮保护数据集（4,018张图像，15,031个检测）上验证，从“空白”和“动物”标签中恢复761个鸟类检测，并对456个原标注为动物、哺乳动物或空白的检测实现96.5%准确率的再分类，其中64.9%成功细化至物种级别。 Conclusion: 该方法有效提升了现有动物分类系统的细粒度识别能力，显著减少粗粒度标签比例，为野生动物监测提供了更精确的自动分类工具。 Abstract: State-of-the-art animal classification models like SpeciesNet provide predictions across thousands of species but use conservative rollup strategies, resulting in many animals labeled at high taxonomic levels rather than species. We present a hierarchical re-classification system for the Animal Detect platform that combines SpeciesNet EfficientNetV2-M predictions with CLIP embeddings and metric learning to refine high-level taxonomic labels toward species-level identification. Our five-stage pipeline (high-confidence acceptance, bird override, centroid building, triplet-loss metric learning, and adaptive cosine-distance scoring) is evaluated on a segment of the LILA BC Desert Lion Conservation dataset (4,018 images, 15,031 detections). After recovering 761 bird detections from "blank" and "animal" labels, we re-classify 456 detections labeled animal, mammal, or blank with 96.5% accuracy, achieving species-level identification for 64.9 percent

[175] Zero-Shot Wildlife Sorting Using Vision Transformers: Evaluating Clustering and Continuous Similarity Ordering

Hugo Markoff,Jevgenijs Galaktionovs

Main category: cs.CV

TL;DR: 本研究评估了使用自监督视觉变换器进行野生动物图像零样本组织的方法，在Animal Detect平台中开发和测试，通过无监督聚类与降维技术结合，实现了高准确率的图像分类与连续相似性排序，已部署于生产环境以加速生物多样性监测中的手动标注工作。

Details

Motivation: 由于许多相机陷阱数据集包含现有分类器未覆盖的物种，需要一种无需标注即可组织大量野生动物图像的方法。 Method: 比较了三种架构（CLIP、DINOv2、MegaDescriptor）结合DBSCAN和GMM聚类方法及PCA、UMAP降维技术的无监督聚类性能，并利用t-SNE实现一维相似性排序。 Result: 在仅用于评估的5个物种测试集上，DINOv2结合UMAP和GMM达到88.6%准确率（macro-F1=0.874），1D排序在哺乳动物和鸟类中实现88.2%一致性，鱼类达95.2%。 Conclusion: DINOv2结合UMAP和GMM表现最佳，连续相似性排序已被部署于生产环境，显著提升探索性分析和手动标注效率。 Abstract: Camera traps generate millions of wildlife images, yet many datasets contain species that are absent from existing classifiers. This work evaluates zero-shot approaches for organizing unlabeled wildlife imagery using self-supervised vision transformers, developed and tested within the Animal Detect platform for camera trap analysis. We compare unsupervised clustering methods (DBSCAN, GMM) across three architectures (CLIP, DINOv2, MegaDescriptor) combined with dimensionality reduction techniques (PCA, UMAP), and we demonstrate continuous 1D similarity ordering via t-SNE projection. On a 5-species test set with ground truth labels used only for evaluation, DINOv2 with UMAP and GMM achieves 88.6 percent accuracy (macro-F1 = 0.874), while 1D sorting reaches 88.2 percent coherence for mammals and birds and 95.2 percent for fish across 1,500 images. Based on these findings, we deployed continuous similarity ordering in production, enabling rapid exploratory analysis and accelerating manual annotation workflows for biodiversity monitoring.

[176] Knowledge-based Visual Question Answer with Multimodal Processing, Retrieval and Filtering

Yuyang Hong,Jiaqi Gu,Qi Yang,Lubin Fan,Yue Wu,Ying Wang,Kun Ding,Shiming Xiang,Jieping Ye

Main category: cs.CV

TL;DR: 提出了一种名为Wiki-PRF的三阶段方法，通过处理、检索和过滤阶段提升知识库视觉问答中的多模态查询质量和检索结果相关性。

Details

Motivation: 现有检索增强生成方法在多模态查询质量和检索结果相关性方面存在不足。 Method: 采用包含处理、检索和过滤三个阶段的Wiki-PRF框架，并结合强化学习训练视觉语言模型，以答案准确性和格式一致性为奖励信号。 Result: 在E-VQA和InfoSeek数据集上显著提升了回答质量（分别提高36.0和42.8），达到最先进的性能。 Conclusion: Wiki-PRF有效提升了KB-VQA任务中多模态信息提取、知识检索与结果过滤的能力。 Abstract: Knowledge-based visual question answering (KB-VQA) requires visual language models (VLMs) to integrate visual understanding with external knowledge retrieval. Although retrieval-augmented generation (RAG) achieves significant advances in this task by combining knowledge-base querying, it still struggles with the quality of multimodal queries and the relevance of retrieved results. To overcome these challenges, we propose a novel three-stage method, termed Wiki-PRF, including Processing, Retrieval and Filtering stages. The processing stage dynamically invokes visual tools to extract precise multimodal information for retrieval. The retrieval stage integrates visual and text features to achieve multimodal knowledge retrieval. The filtering stage performs relevance filtering and concentration on retrieval results. To this end, we introduce a visual language model trained with answer accuracy and format consistency as reward signals via a reinforcement learning manner. This enhances the model's reasoning, tool invocation for accurate queries, and filtering of irrelevant content. Experiments on benchmark datasets (E-VQA and InfoSeek) show significant improvements~(36.0 and 42.8) in answer quality, achieving state-of-the-art performance. Code is available at https://github.com/cqu-student/Wiki-PRF

[177] Shot2Tactic-Caption: Multi-Scale Captioning of Badminton Videos for Tactical Understanding

Ning Ding,Keisuke Fujii,Toru Tamaki

Main category: cs.CV

TL;DR: 本文提出了Shot2Tactic-Caption，一种用于羽毛球视频中语义和时序多尺度描述生成的框架，可生成击球级和战术级描述，并引入首个包含击球和战术描述的数据集。

Details

Motivation: 现有方法难以同时捕捉羽毛球比赛中个体动作与战术随时间动态演变的过程，缺乏支持战术理解的多尺度描述数据与模型。 Method: 提出双分支框架，结合视觉编码器、时空Transformer编码器和解码器，并设计战术单元检测器识别战术类型与状态；通过击球级别的提示引导机制将战术类型和状态作为提示注入解码器。 Result: 在自建Shot2Tactic-Caption数据集上验证了框架在生成击球和战术描述上的有效性，消融实验表明ResNet50-based时空编码器最优，提示结构提升描述连贯性与准确性。 Conclusion: Shot2Tactic-Caption能够有效生成羽毛球比赛中的多尺度战术描述，尤其能捕捉被中断后恢复的战术执行过程，为体育视频理解提供了新思路。 Abstract: Tactical understanding in badminton involves interpreting not only individual actions but also how tactics are dynamically executed over time. In this paper, we propose \textbf{Shot2Tactic-Caption}, a novel framework for semantic and temporal multi-scale video captioning in badminton, capable of generating shot-level captions that describe individual actions and tactic-level captions that capture how these actions unfold over time within a tactical execution. We also introduce the Shot2Tactic-Caption Dataset, the first badminton captioning dataset containing 5,494 shot captions and 544 tactic captions. Shot2Tactic-Caption adopts a dual-branch design, with both branches including a visual encoder, a spatio-temporal Transformer encoder, and a Transformer-based decoder to generate shot and tactic captions. To support tactic captioning, we additionally introduce a Tactic Unit Detector that identifies valid tactic units, tactic types, and tactic states (e.g., Interrupt, Resume). For tactic captioning, we further incorporate a shot-wise prompt-guided mechanism, where the predicted tactic type and state are embedded as prompts and injected into the decoder via cross-attention. The shot-wise prompt-guided mechanism enables our system not only to describe successfully executed tactics but also to capture tactical executions that are temporarily interrupted and later resumed. Experimental results demonstrate the effectiveness of our framework in generating both shot and tactic captions. Ablation studies show that the ResNet50-based spatio-temporal encoder outperforms other variants, and that shot-wise prompt structuring leads to more coherent and accurate tactic captioning.

[178] Efficient Video Sampling: Pruning Temporally Redundant Tokens for Faster VLM Inference

Natan Bagrov,Eugene Khvedchenia,Borys Tymchenko,Shay Aharon,Lior Kadoch,Tomer Keren,Ofri Masad,Yonatan Geifman,Ran Zilberstein,Tuomas Rintamaki,Matthieu Le,Andrew Tao

Main category: cs.CV

TL;DR: 本文提出了一种名为EVS（Efficient Video Sampling）的高效视频采样方法，通过识别并剪枝连续帧中保持不变的时空区域来减少视觉-语言模型中的令牌冗余，从而在几乎不损失准确率的情况下显著提升推理速度和可处理序列长度。

Details

Motivation: 现有的视觉-语言模型在处理长视频时面临令牌预算限制和高延迟问题，因其需密集处理帧序列且计算成本呈二次增长，难以实现可扩展的视频理解。 Method: EVS通过检测连续帧之间保持静态的空间块（即无变化区域），去除这些冗余信息以减少输入令牌数量；该方法保留位置信息，无需修改模型结构或重新训练，可在推理时直接应用，并结合随机剪枝率的上训练阶段提升模型对不同压缩程度的鲁棒性。 Result: EVS在推理时可将大语言模型的首令牌时间（TTFT）降低高达4倍，同时保持语义保真度；实验表明其能有效减少令牌数量、支持更长输入，并在多种设置下提升效率与准确率的权衡。 Conclusion: EVS是一种简单、即插即用的方法，能够显著提高视频语言模型的可扩展性和推理效率，为长视频理解提供了实用且高效的解决方案。 Abstract: Vision-language models (VLMs) have recently expanded from static image understanding to video reasoning, but their scalability is fundamentally limited by the quadratic cost of processing dense frame sequences. Long videos often exceed the token budget of modern language models, leading to severe context limitations and latency issues. We introduce Efficient Video Sampling (EVS), a simple, plug-and-play method for reducing token redundancy in videos by identifying and pruning temporally static patches -- spatial regions that remain unchanged across consecutive frames. EVS preserves positional identity, requires no architectural changes or retraining. We show that EVS substantially reduces token count while maintaining semantic fidelity, enabling faster inference and longer input sequences. Applied at inference time, EVS reduces large language model (LLM) time-to-first-token (TTFT) by up to 4x with minimal accuracy loss. When combined with an uptraining phase using stochastic pruning rates, EVS yields models that are robust to varying compression levels and retain full performance under aggressive pruning. Extensive experiments demonstrate that EVS consistently improves efficiency-accuracy trade-offs, unlocking scalable video-language understanding without sacrificing quality.

[179] Adapting Self-Supervised Representations as a Latent Space for Efficient Generation

Ming Gui,Johannes Schusterbauer,Timy Phan,Felix Krause,Josh Susskind,Miguel Angel Bautista,Björn Ommer

Main category: cs.CV

TL;DR: RepTok是一种基于自监督视觉变换器的生成建模框架，通过单个连续潜在令牌实现图像表示，在保持SSL空间几何特性的同时，实现了高效的图像重建和竞争性的生成性能。

Details

Motivation: 现有的生成模型通常依赖于高维的二维潜在空间，导致训练成本高且存在空间冗余。本文旨在探索更紧凑、高效的潜在表示方法，利用预训练的自监督学习（SSL）模型来提升生成效率与质量。 Method: 在预训练的SSL编码器基础上，仅微调语义令牌嵌入，并结合使用标准流匹配目标联合训练的生成解码器。引入余弦相似性损失以正则化适配后的令牌，保持原始SSL空间的良好几何结构，从而实现高质量图像重建。 Result: RepTok在类条件ImageNet生成任务上表现具有竞争力，并可自然扩展到文本到图像合成，在极低训练预算下于MS-COCO数据集上实现有竞争力的零样本生成性能。同时显著降低训练成本并消除空间冗余。 Conclusion: RepTok验证了经过微调的SSL表征可作为紧凑且有效的潜在空间，用于高效生成建模，为未来轻量级生成模型的设计提供了新方向。 Abstract: We introduce Representation Tokenizer (RepTok), a generative modeling framework that represents an image using a single continuous latent token obtained from self-supervised vision transformers. Building on a pre-trained SSL encoder, we fine-tune only the semantic token embedding and pair it with a generative decoder trained jointly using a standard flow matching objective. This adaptation enriches the token with low-level, reconstruction-relevant details, enabling faithful image reconstruction. To preserve the favorable geometry of the original SSL space, we add a cosine-similarity loss that regularizes the adapted token, ensuring the latent space remains smooth and suitable for generation. Our single-token formulation resolves spatial redundancies of 2D latent spaces and significantly reduces training costs. Despite its simplicity and efficiency, RepTok achieves competitive results on class-conditional ImageNet generation and naturally extends to text-to-image synthesis, reaching competitive zero-shot performance on MS-COCO under extremely limited training budgets. Our findings highlight the potential of fine-tuned SSL representations as compact and effective latent spaces for efficient generative modeling.

[180] SteeringTTA: Guiding Diffusion Trajectories for Robust Test-Time-Adaptation

Jihyun Yu,Yoojin Oh,Wonho Bae,Mingyu Kim,Junhyug Noh

Main category: cs.CV

TL;DR: SteeringTTA是一种无需模型更新或源数据的推理阶段框架，通过Feynman-Kac引导扩散输入自适应分类，利用伪标签奖励在ImageNet-C上持续优于基线方法。

Details

Motivation: 现有基于扩散的输入自适应方法依赖梯度引导，限制了在不同类型失真下的探索与泛化能力。 Method: 提出SteeringTTA，采用Feynman-Kac引导机制，在推理阶段通过伪标签奖励驱动扩散过程；维护多粒子轨迹，并结合累积Top-K概率与熵调度来平衡探索与置信度。 Result: 在ImageNet-C上验证了SteeringTTA的有效性，相比无模型更新的基线方法表现出更优的鲁棒性与性能。 Conclusion: SteeringTTA通过非梯度引导的扩散输入自适应策略，有效提升分类模型在分布偏移下的鲁棒性，且无需更新模型参数或访问源数据。 Abstract: Test-time adaptation (TTA) aims to correct performance degradation of deep models under distribution shifts by updating models or inputs using unlabeled test data. Input-only diffusion-based TTA methods improve robustness for classification to corruptions but rely on gradient guidance, limiting exploration and generalization across distortion types. We propose SteeringTTA, an inference-only framework that adapts Feynman-Kac steering to guide diffusion-based input adaptation for classification with rewards driven by pseudo-label. SteeringTTA maintains multiple particle trajectories, steered by a combination of cumulative top-K probabilities and an entropy schedule, to balance exploration and confidence. On ImageNet-C, SteeringTTA consistently outperforms the baseline without any model updates or source data.

[181] In-Context Learning with Unpaired Clips for Instruction-based Video Editing

Xinyao Liao,Xianfang Zeng,Ziye Song,Zhoujie Fu,Gang Yu,Guosheng Lin

Main category: cs.CV

TL;DR: 提出一种低成本预训练策略，利用非配对视频片段进行上下文学习，实现基于指令的视频编辑，并在HunyuanVideoT2V基础上通过预训练和微调提升编辑效果。

Details

Motivation: 由于构建大规模配对视频编辑数据集成本高且复杂，基于指令的视频编辑研究相对滞后，因此需要一种更高效、低成本的方法来推动该领域发展。 Method: 采用基于非配对视频片段的上下文学习进行预训练，先在约100万真实视频片段上预训练基础模型以学习基本编辑概念，再用少于15万精选配对数据进行微调，实现添加、替换、删除等编辑操作。 Result: 相比现有方法，在指令对齐和视觉保真度上均有提升，指令遵循能力提高12%，编辑质量提高15%。 Conclusion: 所提出的低资源预训练策略有效赋予基础视频生成模型通用编辑能力，结合少量高质量配对数据微调可显著提升多种编辑任务的表现。 Abstract: Despite the rapid progress of instruction-based image editing, its extension to video remains underexplored, primarily due to the prohibitive cost and complexity of constructing large-scale paired video editing datasets. To address this challenge, we introduce a low-cost pretraining strategy for instruction-based video editing that leverages in-context learning from unpaired video clips. We show that pretraining a foundation video generation model with this strategy endows it with general editing capabilities, such as adding, replacing, or deleting operations, according to input editing instructions. The pretrained model can then be efficiently refined with a small amount of high-quality paired editing data. Built upon HunyuanVideoT2V, our framework first pretrains on approximately 1M real video clips to learn basic editing concepts, and subsequently fine-tunes on fewer than 150k curated editing pairs to extend more editing tasks and improve the editing quality. Comparative experiments show that our method surpasses existing instruction-based video editing approaches in both instruction alignment and visual fidelity, achieving a 12\% improvement in editing instruction following and a 15\% improvement in editing quality.

[182] Decorrelation Speeds Up Vision Transformers

Kieran Carrigg,Rob van Gastel,Melda Yeghaian,Sander Dalm,Faysal Boughorbel,Marcel van Gerven

Main category: cs.CV

TL;DR: 将去相关反向传播（DBP）引入MAE预训练，可加速视觉Transformer的收敛，减少21.1%训练时间与21.4%碳排放，并提升分割性能1.1 mIoU。

Details

Motivation: MAE预训练虽性能优异但计算成本高，在资源受限的工业场景中不实用，需更高效的优化方法。 Method: 在MAE的编码器中引入DBP，通过逐层减少输入相关性来加速收敛，同时保持训练稳定性。 Result: 在ImageNet-1K预训练+ADE20K微调中，训练时间减少21.1%，碳排放降低21.4%，分割mIoU提升1.1；在工业私有数据上也取得类似效果。 Conclusion: DBP能有效降低大规模ViT预训练的时间和能耗，同时提升下游任务性能，具有实际工业应用价值。 Abstract: Masked Autoencoder (MAE) pre-training of vision transformers (ViTs) yields strong performance in low-label regimes but comes with substantial computational costs, making it impractical in time- and resource-constrained industrial settings. We address this by integrating Decorrelated Backpropagation (DBP) into MAE pre-training, an optimization method that iteratively reduces input correlations at each layer to accelerate convergence. Applied selectively to the encoder, DBP achieves faster pre-training without loss of stability. On ImageNet-1K pre-training with ADE20K fine-tuning, DBP-MAE reduces wall-clock time to baseline performance by 21.1%, lowers carbon emissions by 21.4% and improves segmentation mIoU by 1.1 points. We observe similar gains when pre-training and fine-tuning on proprietary industrial data, confirming the method's applicability in real-world scenarios. These results demonstrate that DBP can reduce training time and energy use while improving downstream performance for large-scale ViT pre-training.

[183] EuroMineNet: A Multitemporal Sentinel-2 Benchmark for Spatiotemporal Mining Footprint Analysis in the European Union (2015-2024)

Weikang Yu,Vincent Nwazelibe,Xianping Ma,Xiaokang Zhang,Richard Gloaguen,Xiao Xiang Zhu,Pedram Ghamisi

Main category: cs.CV

TL;DR: 本文提出了EuroMineNet，首个基于Sentinel-2影像的欧洲多时相矿业足迹监测基准数据集，覆盖2015至2024年133个矿区，支持GeoAI模型进行可持续性驱动的矿业环境变化分析。

Details

Motivation: 现有矿业环境影响监测数据在时间深度和地理范围上存在局限，难以支持长期、大范围的土地变化监测，亟需一个全面、权威的基准数据集来推动可持续资源管理和环境治理。 Method: 构建了覆盖欧盟133个矿区、时间跨度为2015–2024年的多时相数据集EuroMineNet，基于Sentinel-2多光谱影像提供年度专家验证标注，并提出新的评估指标CA-TIoU，支持多时相矿业 footprint mapping 和跨时相变化检测两项任务。 Result: 对20种最先进的深度学习模型进行了基准测试，结果表明GeoAI方法能有效识别长期环境变化，但在检测关键的短期动态方面仍存在挑战。 Conclusion: EuroMineNet推动了时间一致且可解释的矿业活动监测，有助于可持续土地利用管理与环境韧性建设，并促进GeoAI在社会与环境公益中的应用。 Abstract: Mining activities are essential for industrial and economic development, but remain a leading source of environmental degradation, contributing to deforestation, soil erosion, and water contamination. Sustainable resource management and environmental governance require consistent, long-term monitoring of mining-induced land surface changes, yet existing datasets are often limited in temporal depth or geographic scope. To address this gap, we present EuroMineNet, the first comprehensive multitemporal benchmark for mining footprint mapping and monitoring based on Sentinel-2 multispectral imagery. Spanning 133 mining sites across the European Union, EuroMineNet provides annual observations and expert-verified annotations from 2015 to 2024, enabling GeoAI-based models to analyze environmental dynamics at a continental scale. It supports two sustainability-driven tasks: (1) multitemporal mining footprint mapping for consistent annual land-use delineation, evaluated with a novel Change-Aware Temporal IoU (CA-TIoU) metric, and (2) cross-temporal change detection to capture both gradual and abrupt surface transformations. Benchmarking 20 state-of-the-art deep learning models reveals that while GeoAI methods effectively identify long-term environmental changes, challenges remain in detecting short-term dynamics critical for timely mitigation. By advancing temporally consistent and explainable mining monitoring, EuroMineNet contributes to sustainable land-use management, environmental resilience, and the broader goal of applying GeoAI for social and environmental good. We release the codes and datasets by aligning with FAIR and the open science paradigm at https://github.com/EricYu97/EuroMineNet.

[184] WeCKD: Weakly-supervised Chained Distillation Network for Efficient Multimodal Medical Imaging

Md. Abdur Rahman,Mohaimenul Azam Khan Raiaan,Sami Azam,Asif Karim,Jemima Beissbarth,Amanda Leach

Main category: cs.CV

TL;DR: 本文提出了一种全新的弱监督链式知识蒸馏网络（WeCKD），通过构建模型链实现渐进式知识传递，每个模型在前驱基础上学习并优化知识，显著提升特征学习效果，减少对标注数据的依赖，在多种医学影像数据上表现出优越性能。

Details

Motivation: 传统知识蒸馏依赖强教师模型和大量标注数据，在数据受限场景下存在知识退化、监督效率低等问题，限制了实际应用。 Method: 提出WeCKD框架，构建一个由多个轻量模型组成的蒸馏链，每个模型仅使用数据子集进行训练，从前驱模型中学习并精炼知识，实现渐进式、弱监督的知识转移。 Result: 在四个耳镜图像数据集上性能媲美甚至超越现有监督方法，在其他两种医学影像模态上也展现良好泛化能力；相比单个骨干模型在相同有限数据下训练，累计准确率提升高达23%。 Conclusion: WeCKD通过链式蒸馏机制有效缓解了传统知识蒸馏对强教师和大数据的依赖，为低资源医疗场景下的高效知识迁移提供了新思路，具备良好的实际应用潜力。 Abstract: Knowledge distillation (KD) has traditionally relied on a static teacher-student framework, where a large, well-trained teacher transfers knowledge to a single student model. However, these approaches often suffer from knowledge degradation, inefficient supervision, and reliance on either a very strong teacher model or large labeled datasets, which limits their effectiveness in real-world, limited-data scenarios. To address these, we present the first-ever Weakly-supervised Chain-based KD network (WeCKD) that redefines knowledge transfer through a structured sequence of interconnected models. Unlike conventional KD, it forms a progressive distillation chain, where each model not only learns from its predecessor but also refines the knowledge before passing it forward. This structured knowledge transfer further enhances feature learning, reduces data dependency, and mitigates the limitations of one-step KD. Each model in the distillation chain is trained on only a fraction of the dataset and demonstrates that effective learning can be achieved with minimal supervision. Extensive evaluations across four otoscopic imaging datasets demonstrate that it not only matches but in many cases surpasses the performance of existing supervised methods. Experimental results on two other datasets further underscore its generalization across diverse medical imaging modalities, including microscopic and magnetic resonance imaging. Furthermore, our evaluations resulted in cumulative accuracy gains of up to +23% over a single backbone trained on the same limited data, which highlights its potential for real-world adoption.

[185] VTimeCoT: Thinking by Drawing for Video Temporal Grounding and Reasoning

Jinglei Zhang,Yuanfan Guo,Rolandos Alexandros Potamias,Jiankang Deng,Hang Xu,Chao Ma

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的视频时序定位与推理框架VTimeCoT，通过引入进度条可视化工具和跨模态的视觉-时序思维链（visuotemporal CoT），显著提升了多模态大模型在视频理解任务中的性能。

Details

Motivation: 现有基于多模态大语言模型的视频问答系统在视频时序定位和推理能力上存在不足，难以满足实际应用需求。受人类使用播放器进度条理解视频的启发，作者旨在提升模型对视频时间信息的理解与推理能力。 Method: 提出了VTimeCoT框架，包含两个创新的视觉工具：即插即用的进度条集成工具和高效率的高亮工具，并设计了结合视频与文本的视觉-时序思维链（visuotemporal CoT）以增强跨模态推理能力，整个框架无需额外训练。 Result: 在Qwen2VL-7B和GPT4o两个基线上，VTimeCoT在视频时序定位和基于推理的问答任务中均表现出显著性能提升，并实现了可组合、可解释的推理过程。 Conclusion: VTimeCoT通过引入类比人类观看行为的视觉工具和新型思维链机制，有效增强了多模态大模型的视频时序理解与推理能力，且具有良好的通用性和可解释性。 Abstract: In recent years, video question answering based on multimodal large language models (MLLM) has garnered considerable attention, due to the benefits from the substantial advancements in LLMs. However, these models have a notable deficiency in the domains of video temporal grounding and reasoning, posing challenges to the development of effective real-world video understanding systems. Inspired by how humans use video players to interact with the progress bar for video comprehension, we introduce VTimeCoT, a simple yet effective training-free framework, designed for high-performance video grounding and reasoning. The proposed framework incorporates two novel visual tools of the progress bar: a plug-and-play progress bar integration tool and a high-efficiency highlighting tool. In addition, to address the limitations of conventional text-based chain-of-thought (CoT) approaches, we introduce a visuotemporal CoT process that integrates cross-modality reasoning across both video and text. Our approach demonstrates significant performance improvements on both Qwen2VL-7B and GPT4o baselines in tasks of video temporal grounding and reasoning-based question answering. Finally, we showcase that the proposed framework achieves a compositional and interpretable reasoning process. Project page: https://vtimecot.github.io

[186] Leveraging Learned Image Prior for 3D Gaussian Compression

Seungjoo Shin,Jaesik Park,Sunghyun Cho

Main category: cs.CV

TL;DR: 提出了一种基于学习图像先验的3D高斯点阵压缩框架，通过在图像空间中恢复压缩导致的质量退化，显著提升了率失真性能和渲染质量，同时保持低存储开销。

Details

Motivation: 现有3DGS压缩方法缺乏学习到的先验知识，限制了率失真折衷的进一步优化。 Method: 构建一个恢复网络，在图像空间建模压缩伪影，并利用粗略渲染残差作为辅助信息，结合学习到的图像先验来恢复原始高斯渲染质量。 Result: 在多个基准上验证了该方法的有效性，相比当前最先进的3DGS压缩方法，在更低存储需求下实现了更优的率失真性能和渲染质量。 Conclusion: 所提框架能有效结合现有压缩方法，通过引入学习先验和图像空间恢复机制，显著提升压缩后3DGS的渲染质量与存储效率。 Abstract: Compression techniques for 3D Gaussian Splatting (3DGS) have recently achieved considerable success in minimizing storage overhead for 3D Gaussians while preserving high rendering quality. Despite the impressive storage reduction, the lack of learned priors restricts further advances in the rate-distortion trade-off for 3DGS compression tasks. To address this, we introduce a novel 3DGS compression framework that leverages the powerful representational capacity of learned image priors to recover compression-induced quality degradation. Built upon initially compressed Gaussians, our restoration network effectively models the compression artifacts in the image space between degraded and original Gaussians. To enhance the rate-distortion performance, we provide coarse rendering residuals into the restoration network as side information. By leveraging the supervision of restored images, the compressed Gaussians are refined, resulting in a highly compact representation with enhanced rendering performance. Our framework is designed to be compatible with existing Gaussian compression methods, making it broadly applicable across different baselines. Extensive experiments validate the effectiveness of our framework, demonstrating superior rate-distortion performance and outperforming the rendering quality of state-of-the-art 3DGS compression methods while requiring substantially less storage.

[187] Where are the Whales: A Human-in-the-loop Detection Method for Identifying Whales in High-resolution Satellite Imagery

Caleb Robinson,Kimberly T. Goetz,Christin B. Khan,Meredith Sackett,Kathleen Leonard,Rahul Dodhia,Juan M. Lavista Ferres

Main category: cs.CV

TL;DR: 提出了一种半自动化方法，利用统计异常检测在高分辨率卫星图像中发现可能的鲸鱼位置，并结合专家标注界面，显著减少需要人工检查的区域，实现高召回率，且无需依赖标注训练数据。

Details

Motivation: 传统鲸鱼种群监测方法成本高、难以扩展，现有基于卫星图像的自动检测因缺乏标注数据、图像质量差异和环境变化而面临挑战。 Method: 采用统计异常检测方法识别空间离群点（即“感兴趣点”），并结合网页标注界面辅助专家快速标注；该方法不依赖标注训练数据。 Result: 在三个基准场景中实现了90.3%到96.4%的召回率，将需专家检查的区域最多减少了99.8%，从超过1000平方公里减少到不足2平方公里。 Conclusion: 该方法为未来基于卫星图像的大规模海洋哺乳动物监测提供了一个可扩展的、无需标注数据的初步解决方案，并已开源。 Abstract: Effective monitoring of whale populations is critical for conservation, but traditional survey methods are expensive and difficult to scale. While prior work has shown that whales can be identified in very high-resolution (VHR) satellite imagery, large-scale automated detection remains challenging due to a lack of annotated imagery, variability in image quality and environmental conditions, and the cost of building robust machine learning pipelines over massive remote sensing archives. We present a semi-automated approach for surfacing possible whale detections in VHR imagery using a statistical anomaly detection method that flags spatial outliers, i.e. "interesting points". We pair this detector with a web-based labeling interface designed to enable experts to quickly annotate the interesting points. We evaluate our system on three benchmark scenes with known whale annotations and achieve recalls of 90.3% to 96.4%, while reducing the area requiring expert inspection by up to 99.8% -- from over 1,000 sq km to less than 2 sq km in some cases. Our method does not rely on labeled training data and offers a scalable first step toward future machine-assisted marine mammal monitoring from space. We have open sourced this pipeline at https://github.com/microsoft/whales.

[188] Camera Movement Classification in Historical Footage: A Comparative Study of Deep Video Models

Tingyu Lin,Armin Dadras,Florian Kleber,Robert Sablatnig

Main category: cs.CV

TL;DR: 本文首次系统评估了深度视频相机运动分类模型在历史档案影片上的表现，使用HISTORIAN数据集（含专家标注的二战影像）测试五种标准视频分类模型，其中Video Swin Transformer表现最佳，准确率达80.25%，表明现有模型在低质量视频中具有潜力但也面临挑战。

Details

Motivation: 现有的相机运动分类方法在现代数据集上表现良好，但在历史影像上的泛化能力尚未被探索，因此需要评估这些模型在低质量、老旧视频中的适用性。 Method: 总结代表性方法与数据集差异，选取五种标准视频分类模型，在包含专家标注的二战历史影像的HISTORIAN数据集上进行评估。 Result: Video Swin Transformer模型在HISTORIAN数据集上达到80.25%的准确率，表现出较强的收敛性，尽管训练数据有限。 Conclusion: 现有深度视频分类模型在处理低质量历史影像时具有一定潜力，但仍面临挑战，未来工作应结合多种输入模态和时序结构以提升性能。 Abstract: Camera movement conveys spatial and narrative information essential for understanding video content. While recent camera movement classification (CMC) methods perform well on modern datasets, their generalization to historical footage remains unexplored. This paper presents the first systematic evaluation of deep video CMC models on archival film material. We summarize representative methods and datasets, highlighting differences in model design and label definitions. Five standard video classification models are assessed on the HISTORIAN dataset, which includes expert-annotated World War II footage. The best-performing model, Video Swin Transformer, achieves 80.25% accuracy, showing strong convergence despite limited training data. Our findings highlight the challenges and potential of adapting existing models to low-quality video and motivate future work combining diverse input modalities and temporal architectures.

[189] Cross-Layer Feature Self-Attention Module for Multi-Scale Object Detection

Dingzhou Xie,Rushi Lan,Cheng Pang,Enhao Ning,Jiahao Zeng,Wei Zheng

Main category: cs.CV

TL;DR: 提出了一种新的跨层特征自注意力模块（CFSAM），通过建模多尺度特征图中的局部和全局依赖关系，显著提升了SSD300在PASCAL VOC和COCO数据集上的检测性能，同时加快了训练收敛速度且计算开销小。

Details

Motivation: 现有注意力机制方法多局限于单层或双层特征优化，忽视了多尺度表示间的丰富跨层依赖关系，难以有效捕捉大尺度变化物体所需的上下文信息。 Method: 设计了一个包含卷积局部特征提取器、基于Transformer的全局建模单元和特征融合机制的CFSAM模块，用于统一建模多尺度特征图中的局部与全局跨层依赖关系，并集成到SSD300框架中。 Result: 在PASCAL VOC上达到78.6% mAP（基线75.5%），在COCO上达到52.1% mAP（基线43.1%），优于现有注意力模块，且训练收敛更快，计算开销低。 Conclusion: 显式建模跨层注意力对提升多尺度目标检测性能至关重要，CFSAM为高效利用多尺度特征提供了新思路。 Abstract: Recent object detection methods have made remarkable progress by leveraging attention mechanisms to improve feature discriminability. However, most existing approaches are confined to refining single-layer or fusing dual-layer features, overlooking the rich inter-layer dependencies across multi-scale representations. This limits their ability to capture comprehensive contextual information essential for detecting objects with large scale variations. In this paper, we propose a novel Cross-Layer Feature Self-Attention Module (CFSAM), which holistically models both local and global dependencies within multi-scale feature maps. CFSAM consists of three key components: a convolutional local feature extractor, a Transformer-based global modeling unit that efficiently captures cross-layer interactions, and a feature fusion mechanism to restore and enhance the original representations. When integrated into the SSD300 framework, CFSAM significantly boosts detection performance, achieving 78.6% mAP on PASCAL VOC (vs. 75.5% baseline) and 52.1% mAP on COCO (vs. 43.1% baseline), outperforming existing attention modules. Moreover, the module accelerates convergence during training without introducing substantial computational overhead. Our work highlights the importance of explicit cross-layer attention modeling in advancing multi-scale object detection.

[190] Free-Grained Hierarchical Recognition

Seulki Park,Zilin Wang,Stella X. Yu

Main category: cs.CV

TL;DR: 本文提出了ImageNet-F基准和自由粒度学习方法，以应对现实世界中图像标注粒度不一的问题，通过增强语义和视觉引导，在混合监督下显著提升了分层图像分类的性能。

Details

Motivation: 现有的分层图像分类方法通常假设具有完整且细粒度的标注，但在实际应用中，由于图像质量、标注者专业水平等因素，监督信号的粒度往往是不一致的。因此，需要更贴近真实场景的基准和方法来解决这一问题。 Method: 提出ImageNet-F这一大规模基准数据集，模拟人类标注行为生成混合粒度标签；引入自由粒度学习框架，结合来自视觉-语言模型的伪属性增强语义指导，并利用半监督学习加强视觉指导。 Result: 所提出的方法在混合监督设置下显著提升了分层分类性能，验证了语义与视觉引导策略的有效性。 Conclusion: ImageNet-F和自由粒度学习方法推动了在现实约束下的分层图像分类研究，为处理不完整和多变粒度的标注提供了有效解决方案。 Abstract: Hierarchical image classification predicts labels across a semantic taxonomy, but existing methods typically assume complete, fine-grained annotations, an assumption rarely met in practice. Real-world supervision varies in granularity, influenced by image quality, annotator expertise, and task demands; a distant bird may be labeled Bird, while a close-up reveals Bald eagle. We introduce ImageNet-F, a large-scale benchmark curated from ImageNet and structured into cognitively inspired basic, subordinate, and fine-grained levels. Using CLIP as a proxy for semantic ambiguity, we simulate realistic, mixed-granularity labels reflecting human annotation behavior. We propose free-grain learning, with heterogeneous supervision across instances. We develop methods that enhance semantic guidance via pseudo-attributes from vision-language models and visual guidance via semi-supervised learning. These, along with strong baselines, substantially improve performance under mixed supervision. Together, our benchmark and methods advance hierarchical classification under real-world constraints.

[191] DEXTER: Diffusion-Guided EXplanations with TExtual Reasoning for Vision Models

Simone Carnemolla,Matteo Pennisi,Sarinda Samarasinghe,Giovanni Bellitto,Simone Palazzo,Daniela Giordano,Mubarak Shah,Concetto Spampinato

Main category: cs.CV

TL;DR: 提出DEXTER，一种无需训练数据的框架，利用扩散模型和大语言模型生成视觉分类器的全局文本解释。

Details

Motivation: 为了在缺乏训练数据或真实标签的情况下，实现对视觉分类器决策过程的可解释性，提升AI系统的透明度和可信度。 Method: 通过优化文本提示生成强激活目标分类器的类别条件图像，并利用这些合成样本驱动大语言模型生成描述决策模式和偏见的自然语言报告。 Result: 在ImageNet、Waterbirds、CelebA和FairFaces上验证了DEXTER在全局模型解释和类别级偏见报告方面优于现有方法，用户研究表明其输出准确且可解释。 Conclusion: DEXTER能够在无数据访问条件下有效揭示视觉分类器的内部机制，为模型解释提供了灵活且强大的新工具。 Abstract: Understanding and explaining the behavior of machine learning models is essential for building transparent and trustworthy AI systems. We introduce DEXTER, a data-free framework that employs diffusion models and large language models to generate global, textual explanations of visual classifiers. DEXTER operates by optimizing text prompts to synthesize class-conditional images that strongly activate a target classifier. These synthetic samples are then used to elicit detailed natural language reports that describe class-specific decision patterns and biases. Unlike prior work, DEXTER enables natural language explanation about a classifier's decision process without access to training data or ground-truth labels. We demonstrate DEXTER's flexibility across three tasks-activation maximization, slice discovery and debiasing, and bias explanation-each illustrating its ability to uncover the internal mechanisms of visual classifiers. Quantitative and qualitative evaluations, including a user study, show that DEXTER produces accurate, interpretable outputs. Experiments on ImageNet, Waterbirds, CelebA, and FairFaces confirm that DEXTER outperforms existing approaches in global model explanation and class-level bias reporting. Code is available at https://github.com/perceivelab/dexter.

[192] LightQANet: Quantized and Adaptive Feature Learning for Low-Light Image Enhancement

Xu Wu,Zhihui Lai,Xianxu Hou,Jie Zhou,Ya-nan Zhang,Linlin Shen

Main category: cs.CV

TL;DR: 本文提出了一种新的低光照图像增强框架LightQANet，通过量化和自适应特征学习来提升图像质量。

Details

Motivation: 现有方法在低光照条件下难以提取可靠的特征表示，导致纹理恢复差、颜色不一致和伪影问题。 Method: 设计了光照量化模块（LQM）以显式提取并量化与光照相关的因素，并引入光照感知提示模块（LAPM），将光照先验编码为可学习的提示以动态引导特征学习。 Result: 在多个低光照数据集上的实验表明，该方法在各种挑战性光照场景下均取得了最先进的定性和定量结果。 Conclusion: LightQANet能够实现跨不同光照条件的一致且鲁棒的图像质量提升。 Abstract: Low-light image enhancement (LLIE) aims to improve illumination while preserving high-quality color and texture. However, existing methods often fail to extract reliable feature representations due to severely degraded pixel-level information under low-light conditions, resulting in poor texture restoration, color inconsistency, and artifact. To address these challenges, we propose LightQANet, a novel framework that introduces quantized and adaptive feature learning for low-light enhancement, aiming to achieve consistent and robust image quality across diverse lighting conditions. From the static modeling perspective, we design a Light Quantization Module (LQM) to explicitly extract and quantify illumination-related factors from image features. By enforcing structured light factor learning, LQM enhances the extraction of light-invariant representations and mitigates feature inconsistency across varying illumination levels. From the dynamic adaptation perspective, we introduce a Light-Aware Prompt Module (LAPM), which encodes illumination priors into learnable prompts to dynamically guide the feature learning process. LAPM enables the model to flexibly adapt to complex and continuously changing lighting conditions, further improving image enhancement. Extensive experiments on multiple low-light datasets demonstrate that our method achieves state-of-the-art performance, delivering superior qualitative and quantitative results across various challenging lighting scenarios.

[193] Inpainting the Red Planet: Diffusion Models for the Reconstruction of Martian Environments in Virtual Reality

Giuseppe Lorenzo Catalano,Agata Marta Soccini

Main category: cs.CV

TL;DR: 提出一种基于无条件扩散模型的方法，用于重建火星表面地形，利用增强的HiRISE高度图数据集进行训练，在准确性和感知相似性方面优于传统插值和修复方法。

Details

Motivation: 火星地形数据常存在缺失值，现有插值方法难以保持几何一致性，且无法应用地球常用的条件深度学习方法，因此需要一种适用于稀疏数据的高效无条件重建方法。 Method: 采用无条件扩散模型，对NASA HiRISE调查生成的12000张火星高度图进行训练，并引入非均匀重缩放策略以保留多尺度地形特征，最终统一调整为128x128分辨率输入模型。 Result: 在1000个样本的测试集上，该方法相比反距离加权、克里金法和Navier-Stokes算法等传统技术，在RMSE上提升了4-15%，在LPIPS上提升了29-81%，表现出更优的重建精度与视觉相似性。 Conclusion: 所提出的无条件扩散模型能有效填补火星地形中的缺失区域，显著优于传统插值与修复方法，为行星表面重建提供了新的高精度工具。 Abstract: Space exploration increasingly relies on Virtual Reality for several tasks, such as mission planning, multidisciplinary scientific analysis, and astronaut training. A key factor for the reliability of the simulations is having accurate 3D representations of planetary terrains. Extraterrestrial heightmaps derived from satellite imagery often contain missing values due to acquisition and transmission constraints. Mars is among the most studied planets beyond Earth, and its extensive terrain datasets make the Martian surface reconstruction a valuable task, although many areas remain unmapped. Deep learning algorithms can support void-filling tasks; however, whereas Earth's comprehensive datasets enables the use of conditional methods, such approaches cannot be applied to Mars. Current approaches rely on simpler interpolation techniques which, however, often fail to preserve geometric coherence. In this work, we propose a method for reconstructing the surface of Mars based on an unconditional diffusion model. Training was conducted on an augmented dataset of 12000 Martian heightmaps derived from NASA's HiRISE survey. A non-homogeneous rescaling strategy captures terrain features across multiple scales before resizing to a fixed 128x128 model resolution. We compared our method against established void-filling and inpainting techniques, including Inverse Distance Weighting, kriging, and Navier-Stokes algorithm, on an evaluation set of 1000 samples. Results show that our approach consistently outperforms these methods in terms of reconstruction accuracy (4-15% on RMSE) and perceptual similarity (29-81% on LPIPS) with the original data.

[194] MoCom: Motion-based Inter-MAV Visual Communication Using Event Vision and Spiking Neural Networks

Zhang Nengbo,Hann Woei Ho,Ye Zhou

Main category: cs.CV

TL;DR: 提出了一种受蜜蜂摇摆舞启发的基于运动信号的视觉通信框架，用于微型飞行器（MAV）群，在频谱拥塞、干扰和高功耗环境下实现低功耗、可靠的通信。

Details

Motivation: 传统无线电通信在频谱拥挤、易受干扰且功耗高的环境中难以支持MAV群可靠通信；受蜜蜂通过舞蹈无接触传递信息的启发，探索更节能、鲁棒的替代方案。 Method: 利用预定义的四个运动基元（上下、左右、左上右、左下右）作为视觉码本，通过事件相机捕捉MAV的飞行模式，并结合基于事件帧的分割模型与轻量级脉冲神经网络（SNN）进行动作识别，再通过集成解码算法解析运动序列。 Result: 实验结果表明该框架能准确解码MAV的运动信号，具有低功耗优势，并在受限环境中表现出良好的通信可靠性。 Conclusion: 该视觉通信框架为MAV群提供了一种高效、节能的非无线电通信方式，适用于电磁受限或高干扰场景，具备实际应用潜力。 Abstract: Reliable communication in Micro Air Vehicle (MAV) swarms is challenging in environments, where conventional radio-based methods suffer from spectrum congestion, jamming, and high power consumption. Inspired by the waggle dance of honeybees, which efficiently communicate the location of food sources without sound or contact, we propose a novel visual communication framework for MAV swarms using motion-based signaling. In this framework, MAVs convey information, such as heading and distance, through deliberate flight patterns, which are passively captured by event cameras and interpreted using a predefined visual codebook of four motion primitives: vertical (up/down), horizontal (left/right), left-to-up-to-right, and left-to-down-to-right, representing control symbols (``start'', ``end'', ``1'', ``0''). To decode these signals, we design an event frame-based segmentation model and a lightweight Spiking Neural Network (SNN) for action recognition. An integrated decoding algorithm then combines segmentation and classification to robustly interpret MAV motion sequences. Experimental results validate the framework's effectiveness, which demonstrates accurate decoding and low power consumption, and highlights its potential as an energy-efficient alternative for MAV communication in constrained environments.

[195] CoT-PL: Visual Chain-of-Thought Reasoning Meets Pseudo-Labeling for Open-Vocabulary Object Detection

Hojun Choi,Youngsun Lim,Jaeyo Shin,Hyunjung Shim

Main category: cs.CV

TL;DR: 本文提出了一种名为CoT-PL的新框架，通过引入结构化的视觉链式思维（CoT）推理和对比背景学习（CBL），提升开放词汇目标检测中伪标签的质量，尤其在拥挤或遮挡场景下显著优于现有方法。

Details

Motivation: 现有开放词汇目标检测方法依赖图像-文本直接匹配生成伪标签，缺乏对复杂语义场景的中间推理过程，导致在拥挤或遮挡场景中鲁棒性不足。 Method: 提出CoT-PL框架，将对象理解分解为三个步骤：区域感知、零样本分类识别和背景 grounding；并设计对比背景学习（CBL），利用背景线索作为负样本促进对象与背景的特征解耦。 Result: 在开放词汇COCO上新类别AP50提升+7.7，在LVIS上新类别mask AP提升+2.9；在拥挤和遮挡场景中，新类别伪标签质量分别相对提升103.4%和168.4%。 Conclusion: CoT-PL通过引入链式思维推理和对比背景学习，显著提升了开放词汇目标检测在复杂场景下的性能，成为新的最先进方法。 Abstract: Open-vocabulary object detection (OVD) seeks to recognize and localize object categories beyond those seen during training. Recent approaches typically leverage vision-language models (VLMs) to generate pseudo-labels using image-text alignment, allowing detectors to generalize to unseen classes without explicit supervision. However, these methods depend heavily on direct image-text matching, neglecting the intermediate reasoning steps essential for interpreting semantically complex scenes. This results in limited robustness when confronted with crowded or occluded visual contexts. In this paper, we introduce CoT-PL, a new framework that employs structured visual chain-of-thought (CoT) reasoning into the pseudo-labeling process. CoT-PL decomposes object understanding into three interpretable steps: (1) region perception even for unseen objects, (2) category recognition via zero-shot reasoning, and (3) background grounding to separate semantically complex objects. Crucially, the third step naturally motivates our contrastive background learning (CBL) that uses the pre-computed background cues as negatives to promote feature disentanglement between objects and background. In this way, CoT reasoning and CBL form an integrated pipeline tailored to robust pseudo-labeling in crowded or occluded scenes. Notably, in these two settings, our novel-class pseudo-label quality achieves relative improvements of 103.4% and 168.4% over the best prior, respectively. Our extensive experiments demonstrate that CoT-PL achieves +7.7 AP50 on open-vocabulary COCO and +2.9 mask AP on LVIS for novel classes, setting a new state of the art.

[196] Morphology-Aware Prognostic model for Five-Year Survival Prediction in Colorectal Cancer from H&E Whole Slide Images

Usama Sajjad,Abdul Rehman Akbar,Ziyu Su,Deborah Knight,Wendy L. Frankel,Metin N. Gurcan,Wei Chen,Muhammad Khalid Khan Niazi

Main category: cs.CV

TL;DR: 本研究开发了一种新型可解释的AI模型PRISM，用于结直肠癌预后预测，通过整合空间形态学特征和表型多样性，在5年总生存期预测上显著优于现有方法。

Details

Motivation: 现有基础模型在计算病理学中多为任务无关方法，可能忽略器官特异性的关键形态学模式，影响肿瘤行为、治疗反应和患者预后的准确预测。 Method: 提出PRISM（Prognostic Representation of Integrated Spatial Morphology）模型，结合每种形态内的连续变异谱，刻画表型多样性；基于424例III期结直肠癌患者的874万张组织图像进行训练。 Result: PRISM在五年总生存期预测中表现出色（AUC=0.70±0.04，准确率68.37%±4.75%，HR=3.34，p<0.0001），较现有CRC特异性方法提升15%，较AI基础模型提升约23%准确率；具有性别无关鲁棒性，并在不同临床病理亚组中表现稳定。 Conclusion: PRISM通过建模形态连续变异显著提升了结直肠癌预后预测性能，具备临床应用潜力，且能复现Alliance队列关于两种化疗方案无生存差异的发现。 Abstract: Colorectal cancer (CRC) remains the third most prevalent malignancy globally, with approximately 154,000 new cases and 54,000 projected deaths anticipated for 2025. The recent advancement of foundation models in computational pathology has been largely propelled by task agnostic methodologies that can overlook organ-specific crucial morphological patterns that represent distinct biological processes that can fundamentally influence tumor behavior, therapeutic response, and patient outcomes. The aim of this study is to develop a novel, interpretable AI model, PRISM (Prognostic Representation of Integrated Spatial Morphology), that incorporates a continuous variability spectrum within each distinct morphology to characterize phenotypic diversity and reflecting the principle that malignant transformation occurs through incremental evolutionary processes rather than abrupt phenotypic shifts. PRISM is trained on 8.74 million histological images extracted from surgical resection specimens of 424 patients with stage III CRC. PRISM achieved superior prognostic performance for five-year OS (AUC = 0.70 +- 0.04; accuracy = 68.37% +- 4.75%; HR = 3.34, 95% CI = 2.28-4.90; p < 0.0001), outperforming existing CRC-specific methods by 15% and AI foundation models by ~23% accuracy. It showed sex-agnostic robustness (AUC delta = 0.02; accuracy delta = 0.15%) and stable performance across clinicopathological subgroups, with minimal accuracy fluctuation (delta = 1.44%) between 5FU/LV and CPT-11/5FU/LV regimens, replicating the Alliance cohort finding of no survival difference between treatments.

[197] Scaling Artificial Intelligence for Multi-Tumor Early Detection with More Reports, Fewer Masks

Pedro R. A. S. Bassi,Xinze Zhou,Wenxuan Li,Szymon Płotka,Jieneng Chen,Qi Chen,Zheren Zhu,Jakub Prządo,Ibrahim E. Hamacı,Sezgin Er,Yuhan Wang,Ashwin Kumar,Bjoern Menze,Jarosław B. Ćwikła,Yuyin Zhou,Akshay S. Chaudhari,Curtis P. Langlotz,Sergio Decherchi,Andrea Cavalli,Kang Wang,Yang Yang,Alan L. Yuille,Zongwei Zhou

Main category: cs.CV

TL;DR: 本文提出R-Super方法，利用大量现成的医学报告训练AI模型进行肿瘤分割，显著减少对人工绘制肿瘤掩码的依赖，在多种肿瘤类型上达到甚至超过放射科医生的检测性能。

Details

Motivation: 早期肿瘤检测至关重要，但传统AI模型训练依赖昂贵且耗时的人工标注肿瘤掩码，限制了其广泛应用。而临床CT扫描附带的医学报告中包含丰富且易获取的肿瘤描述信息，尚未被充分利用。 Method: 提出R-Super方法，通过将AI模型生成的肿瘤分割结果与医学报告中的文本描述进行匹配来监督训练，实现仅使用文本报告即可有效训练肿瘤分割模型，并可融合少量掩码数据进一步提升性能。 Result: 在101,654份报告上训练的模型性能媲美使用723个掩码训练的模型；结合报告与掩码使敏感性提升+13%，特异性提升+8%，在七种肿瘤中的五种上超过放射科医生；首次实现了脾脏、胆囊、前列腺等器官肿瘤的自动分割。 Conclusion: 该研究打破了大规模人工标注掩码是AI肿瘤检测必需品的传统观念，建立了一条可扩展、可及性强的多类型肿瘤早期检测新路径。 Abstract: Early tumor detection save lives. Each year, more than 300 million computed tomography (CT) scans are performed worldwide, offering a vast opportunity for effective cancer screening. However, detecting small or early-stage tumors on these CT scans remains challenging, even for experts. Artificial intelligence (AI) models can assist by highlighting suspicious regions, but training such models typically requires extensive tumor masks--detailed, voxel-wise outlines of tumors manually drawn by radiologists. Drawing these masks is costly, requiring years of effort and millions of dollars. In contrast, nearly every CT scan in clinical practice is already accompanied by medical reports describing the tumor's size, number, appearance, and sometimes, pathology results--information that is rich, abundant, and often underutilized for AI training. We introduce R-Super, which trains AI to segment tumors that match their descriptions in medical reports. This approach scales AI training with large collections of readily available medical reports, substantially reducing the need for manually drawn tumor masks. When trained on 101,654 reports, AI models achieved performance comparable to those trained on 723 masks. Combining reports and masks further improved sensitivity by +13% and specificity by +8%, surpassing radiologists in detecting five of the seven tumor types. Notably, R-Super enabled segmentation of tumors in the spleen, gallbladder, prostate, bladder, uterus, and esophagus, for which no public masks or AI models previously existed. This study challenges the long-held belief that large-scale, labor-intensive tumor mask creation is indispensable, establishing a scalable and accessible path toward early detection across diverse tumor types. We plan to release our trained models, code, and dataset at https://github.com/MrGiovanni/R-Super

[198] Unifying Environment Perception and Route Choice Modeling for Trajectory Representation Learning

Ji Cao,Yu Wang,Tongya Zheng,Zujie Ren,Canghong Jin,Gang Chen,Mingli Song

Main category: cs.CV

TL;DR: 提出了一种新的轨迹表示学习框架PRTraj，统一了环境感知与显式路径选择建模，通过增强路网语义和捕捉路径选择行为，显著提升了下游任务性能。

Details

Motivation: 现有轨迹表示学习方法忽略了外部环境和内在路径选择行为对轨迹形成的影响，导致表示能力受限。 Method: 设计了环境感知模块，利用POI分布增强路网语义；构建路径选择编码器，将轨迹中的路段转移建模为决策序列，并聚合生成全局轨迹表示。 Result: 在3个真实数据集的5项下游任务中表现优异，且在少样本场景下仍保持强鲁棒性，验证了方法的有效性和泛化能力。 Conclusion: PRTraj通过融合环境感知与路径选择建模，显著提升了轨迹表示质量，具有良好的应用前景。 Abstract: Trajectory Representation Learning (TRL) aims to encode raw trajectories into low-dimensional vectors, which can then be leveraged in various downstream tasks, including travel time estimation, location prediction, and trajectory similarity analysis. However, existing TRL methods suffer from a key oversight: treating trajectories as isolated spatio-temporal sequences, without considering the external environment and internal route choice behavior that govern their formation. To bridge this gap, we propose a novel framework that unifies comprehensive environment \textbf{P}erception and explicit \textbf{R}oute choice modeling for effective \textbf{Traj}ectory representation learning, dubbed \textbf{PRTraj}. Specifically, PRTraj first introduces an Environment Perception Module to enhance the road network by capturing multi-granularity environmental semantics from surrounding POI distributions. Building on this environment-aware backbone, a Route Choice Encoder then captures the route choice behavior inherent in each trajectory by modeling its constituent road segment transitions as a sequence of decisions. These route-choice-aware representations are finally aggregated to form the global trajectory embedding. Extensive experiments on 3 real-world datasets across 5 downstream tasks validate the effectiveness and generalizability of PRTraj. Moreover, PRTraj demonstrates strong data efficiency, maintaining robust performance under few-shot scenarios. Our code is available at: https://anonymous.4open.science/r/PRTraj.

[199] FraQAT: Quantization Aware Training with Fractional bits

Luca Morreale,Alberto Gil C. P. Ramos,Malcolm Chadwick,Mehid Noroozi,Ruchika Chavhan,Abhinav Mehrotra,Sourav Bhattacharya

Main category: cs.CV

TL;DR: 提出一种新的分数位量化方法（Fraq），通过逐步降低模型精度并利用优化过程中的分数位来保持生成质量，在多种扩散模型上实现了优于标准QAT的性能，并成功在三星S25U手机上部署。

Details

Motivation: 大型生成模型因内存和计算资源限制难以在智能手机上部署，现有激进量化方法难以保持模型生成质量。 Method: 提出分数位量化（Fraq）方法，逐步将模型参数精度从32位降至4位，并在优化过程中利用分数位以维持高质量生成。 Result: 在SD3.5-Medium、Sana、Pixart和FLUX.1-schnell等扩散模型上，Fraq相比标准QAT降低了4-7%的FiD分数，提升了生成质量，并成功在三星S25U的骁龙8 Elite HTP上运行Sana模型。 Conclusion: Fraq是一种有效平衡模型压缩与生成质量的方法，使高性能生成模型能在移动设备上高效运行。 Abstract: State-of-the-art (SOTA) generative models have demonstrated impressive capabilities in image synthesis or text generation, often with a large capacity model. However, these large models cannot be deployed on smartphones due to the limited availability of on-board memory and computations. Quantization methods lower the precision of the model parameters, allowing for efficient computations, \eg, in \INT{8}. Although aggressive quantization addresses efficiency and memory constraints, preserving the quality of the model remains a challenge. To retain quality in previous aggressive quantization, we propose a new fractional bits quantization (\short) approach. The novelty is a simple yet effective idea: we progressively reduce the model's precision from 32 to 4 bits per parameter, and exploit the fractional bits during optimization to maintain high generation quality. We show that the \short{} yields improved quality on a variety of diffusion models, including SD3.5-Medium, Sana, \pixart, and FLUX.1-schnell, while achieving $4-7\%$ lower FiD than standard QAT. Finally, we deploy and run Sana on a Samsung S25U, which runs on the Qualcomm SM8750-AB Snapdragon 8 Elite Hexagon Tensor Processor (HTP).

[200] Scaling Tumor Segmentation: Best Lessons from Real and Synthetic Data

Qi Chen,Xinze Zhou,Chen Liu,Hao Chen,Wenxuan Li,Zekun Jiang,Ziyan Huang,Yuxuan Zhao,Dexin Yu,Junjun He,Yefeng Zheng,Ling Shao,Alan Yuille,Zongwei Zhou

Main category: cs.CV

TL;DR: 该研究利用合成数据减少了对大量真实标注数据的依赖，在仅使用500个真实扫描的情况下达到了原本需要1,500个的真实数据才能达到的AI性能。基于此，研究人员构建了大规模腹部肿瘤分割数据集AbdomenAtlas 2.0，包含10,135个CT扫描和15,130个肿瘤实例，由23名专家放射科医生手动标注，覆盖六个器官。该数据集显著优于现有公开数据集，在分布内和分布外测试中分别提升了7%和16%的DSC分数。

Details

Motivation: 缺乏大规模、逐体素标注的肿瘤数据集限制了AI在肿瘤分割中的应用，创建此类数据集耗时且依赖医学专家。因此，需要更高效的数据利用方式和更大规模的公共数据集来推动AI模型的发展。 Method: 基于内部JHH数据集的实验观察合成数据的有效性，随后构建大规模公开数据集AbdomenAtlas 2.0，包含万余例CT扫描和精细的逐体素标注，并结合真实与合成数据提升模型训练效率。 Result: 使用合成数据时，仅用500个真实扫描即可达到使用1,500个真实扫描的性能；AbdomenAtlas 2.0在肿瘤分割任务中相比现有公开数据集在分布内测试DSC提升7%，分布外测试提升16%。 Conclusion: 合成数据可显著提升数据利用效率，缓解真实标注数据稀缺的问题；AbdomenAtlas 2.0作为目前最大规模的腹部多器官肿瘤标注数据集之一，为训练高性能肿瘤分割AI模型提供了坚实基础。 Abstract: AI for tumor segmentation is limited by the lack of large, voxel-wise annotated datasets, which are hard to create and require medical experts. In our proprietary JHH dataset of 3,000 annotated pancreatic tumor scans, we found that AI performance stopped improving after 1,500 scans. With synthetic data, we reached the same performance using only 500 real scans. This finding suggests that synthetic data can steepen data scaling laws, enabling more efficient model training than real data alone. Motivated by these lessons, we created AbdomenAtlas 2.0--a dataset of 10,135 CT scans with a total of 15,130 tumor instances per-voxel manually annotated in six organs (pancreas, liver, kidney, colon, esophagus, and uterus) and 5,893 control scans. Annotated by 23 expert radiologists, it is several orders of magnitude larger than existing public tumor datasets. While we continue expanding the dataset, the current version of AbdomenAtlas 2.0 already provides a strong foundation--based on lessons from the JHH dataset--for training AI to segment tumors in six organs. It achieves notable improvements over public datasets, with a +7% DSC gain on in-distribution tests and +16% on out-of-distribution tests.

[201] QDepth-VLA: Quantized Depth Prediction as Auxiliary Supervision for Vision-Language-Action Models

Yixuan Li,Yuhui Chen,Mingcai Zhou,Haoran Li

Main category: cs.CV

TL;DR: 提出QDepth-VLA框架，通过辅助深度预测任务增强视觉-语言-动作模型的空间感知与推理能力。

Details

Motivation: 现有VLA模型缺乏对关键3D结构的理解与推理能力，难以完成精细操作任务。 Method: 设计一个专用的深度专家模块，预测由VQ-VAE编码器生成的深度图的量化潜在令牌，使模型学习具有深度感知的表征。 Result: 在仿真基准和真实世界任务上的实验表明，QDepth-VLA具有出色的空间推理能力和竞争性的操作性能。 Conclusion: QDepth-VLA有效提升了VLA模型对三维空间结构的理解能力，增强了在精细操作任务中的表现。 Abstract: Spatial perception and reasoning are crucial for Vision-Language-Action (VLA) models to accomplish fine-grained manipulation tasks. However, existing approaches often lack the ability to understand and reason over the essential 3D structures necessary for precise control. To address this limitation, we propose QDepth-VLA, a general framework that augments VLA models with an auxiliary depth prediction task. A dedicated depth expert is designed to predict quantized latent tokens of depth maps obtained from a VQ-VAE encoder, enabling the model to learn depth-aware representations that capture critical geometric cues. Experimental results on the simulation benchmarks and real-world tasks demonstrate that QDepth-VLA yields strong spatial reasoning and competitive performance on manipulation tasks.

[202] ImagerySearch: Adaptive Test-Time Search for Video Generation Beyond Semantic Dependency Constraints

Meiqi Wu,Jiashu Zhu,Xiaokun Feng,Chubin Chen,Chen Zhu,Bingze Song,Fangyuan Mao,Jiahong Wu,Xiangxiang Chu,Kaiqi Huang

Main category: cs.CV

TL;DR: 本文提出了一种名为ImagerySearch的提示引导自适应测试时搜索策略，以提升视频生成模型在想象力场景中的表现，并发布了首个针对长距离语义提示的基准LDT-Bench。

Details

Motivation: 现有视频生成模型在现实场景中表现良好，但在涉及罕见共现概念和长距离语义关系的想象力场景中性能下降，且现有测试时扩展方法因固定搜索空间和静态奖励设计而适应性不足。 Method: 提出ImagerySearch，通过根据提示中的语义关系动态调整推理搜索空间和奖励函数，实现更连贯、视觉上更合理的视频生成。同时构建LDT-Bench基准用于评估创造性生成能力。 Result: 实验表明，ImagerySearch在LDT-Bench上显著优于强基线模型和现有测试时扩展方法，并在VBench上也取得有竞争力的改进。 Conclusion: ImagerySearch能有效提升视频生成模型在想象力场景下的生成质量，LDT-Bench为未来研究提供了重要资源。 Abstract: Video generation models have achieved remarkable progress, particularly excelling in realistic scenarios; however, their performance degrades notably in imaginative scenarios. These prompts often involve rarely co-occurring concepts with long-distance semantic relationships, falling outside training distributions. Existing methods typically apply test-time scaling for improving video quality, but their fixed search spaces and static reward designs limit adaptability to imaginative scenarios. To fill this gap, we propose ImagerySearch, a prompt-guided adaptive test-time search strategy that dynamically adjusts both the inference search space and reward function according to semantic relationships in the prompt. This enables more coherent and visually plausible videos in challenging imaginative settings. To evaluate progress in this direction, we introduce LDT-Bench, the first dedicated benchmark for long-distance semantic prompts, consisting of 2,839 diverse concept pairs and an automated protocol for assessing creative generation capabilities. Extensive experiments show that ImagerySearch consistently outperforms strong video generation baselines and existing test-time scaling approaches on LDT-Bench, and achieves competitive improvements on VBench, demonstrating its effectiveness across diverse prompt types. We will release LDT-Bench and code to facilitate future research on imaginative video generation.

[203] A Multi-Task Deep Learning Framework for Skin Lesion Classification, ABCDE Feature Quantification, and Evolution Simulation

Harsha Kotla,Arun Kumar Rajasekaran,Hannah Rana

Main category: cs.CV

TL;DR: 提出一种深度学习框架，用于分类皮肤病变并量化ABCD特征，同时模拟E（演变）特征，帮助医生将机器学习诊断与临床标准关联。

Details

Motivation: 现有的深度学习方法在皮肤病变分析中多为黑箱模型，缺乏对可解释性临床特征（如ABCDE）的明确建模，限制了其在临床中的应用。 Method: 设计一个深度学习框架，不仅分类皮肤病变，还分别量化A、B、C、D四个特征，并通过特征轨迹模拟E（演变）过程；在潜在空间中可视化ABCD特征随病变进展的变化。 Result: 在HAM10000数据集上实验显示分类准确率约为89%，黑色素瘤AUC为0.96；对不对称性、颜色变化和直径预测效果好，边界不规则性较难建模。 Conclusion: 该框架提升了模型可解释性，使医生能依据临床相关特征理解ML决策，有助于揭示皮肤癌发展过程，推动AI在临床诊断中的应用。 Abstract: Early detection of melanoma has grown to be essential because it significantly improves survival rates, but automated analysis of skin lesions still remains challenging. ABCDE, which stands for Asymmetry, Border irregularity, Color variation, Diameter, and Evolving, is a well-known classification method for skin lesions, but most deep learning mechanisms treat it as a black box, as most of the human interpretable features are not explained. In this work, we propose a deep learning framework that both classifies skin lesions into categories and also quantifies scores for each ABCD feature. It simulates the evolution of these features over time in order to represent the E aspect, opening more windows for future exploration. The A, B, C, and D values are quantified particularly within this work. Moreover, this framework also visualizes ABCD feature trajectories in latent space as skin lesions evolve from benign nevuses to malignant melanoma. The experiments are conducted using the HAM10000 dataset that contains around ten thousand images of skin lesions of varying stages. In summary, the classification worked with an accuracy of around 89 percent, with melanoma AUC being 0.96, while the feature evaluation performed well in predicting asymmetry, color variation, and diameter, though border irregularity remains more difficult to model. Overall, this work provides a deep learning framework that will allow doctors to link ML diagnoses to clinically relevant criteria, thus improving our understanding of skin cancer progression.

Mihai-Cristian Pîrvu,Marius Leordeanu

Main category: cs.CV

TL;DR: 本文提出了一种利用多种视觉模态进行自监督学习的方法，通过预训练专家模型和自动化数据管道结合多模态数据，并采用专为多模态设计的PHG-MAE模型，在低参数量下实现了与大模型相当的性能，适用于实时语义分割和深度估计等应用。

Details

Motivation: 为了更全面地理解现实世界，需要整合多种独立的模态信息，而传统机器学习模型多为单模态，现有方法也主要局限于双模态融合，缺乏对多模态数据的有效集成。 Method: 使用预训练专家模型和程序化组合方式，在原始视频上构建全自动的数据流水线，结合多种视觉模态；采用PHG-MAE模型进行多模态学习，并将其高效蒸馏到低参数量版本（<1M）。 Result: PHG-MAE模型在低参数量下取得了与约3亿参数模型相竞争的结果，并成功部署于手持设备或网络摄像头上的实时语义分割任务，同时该框架也支持其他现成模型（如DPT）用于近实时深度估计。 Conclusion: 通过全自动多模态数据管道和专为多模态设计的轻量级模型，能够在资源受限的设备上实现高效的多模态理解，推动了真实场景中多模态学习的应用。 Abstract: The real-world is inherently multi-modal at its core. Our tools observe and take snapshots of it, in digital form, such as videos or sounds, however much of it is lost. Similarly for actions and information passing between humans, languages are used as a written form of communication. Traditionally, Machine Learning models have been unimodal (i.e. rgb -> semantic or text -> sentiment_class). Recent trends go towards bi-modality, where images and text are learned together, however, in order to truly understand the world, we need to integrate all these independent modalities. In this work we try to combine as many visual modalities as we can using little to no human supervision. In order to do this, we use pre-trained experts and procedural combinations between them on top of raw videos using a fully autonomous data-pipeline, which we also open-source. We then make use of PHG-MAE, a model specifically designed to leverage multi-modal data. We show that this model which was efficiently distilled into a low-parameter (<1M) can have competitive results compared to models of ~300M parameters. We deploy this model and analyze the use-case of real-time semantic segmentation from handheld devices or webcams on commodity hardware. Finally, we deploy other off-the-shelf models using the same framework, such as DPT for near real-time depth estimation.

[205] Benchmarking Multimodal Large Language Models for Face Recognition

Hatef Otroshi Shahreza,Sébastien Marcel

Main category: cs.CV

TL;DR: 本文系统评估了多模态大语言模型（MLLMs）在多个标准人脸数据集上的人脸识别性能，发现尽管MLLMs能捕捉丰富语义信息，但在零样本设置下仍落后于专用模型。

Details

Motivation: 探索MLLMs在人脸识别领域的潜力，并填补开源MLLMs与现有专用模型在标准基准上性能对比的空白。 Method: 在LFW、CALFW、CPLFW、CFP、AgeDB和RFW等人脸识别数据集上对最先进的MLLMs进行系统性基准测试，采用与传统模型可比的评估协议。 Result: 实验表明MLLMs虽能有效捕捉人脸相关语义特征，但在高精度识别任务中，其零样本性能仍显著低于专门设计的人脸识别模型。 Conclusion: 该研究为基于MLLM的人脸识别提供了基础 benchmark，揭示了当前局限并为未来高精度、强泛化模型的设计提供了方向。 Abstract: Multimodal large language models (MLLMs) have achieved remarkable performance across diverse vision-and-language tasks. However, their potential in face recognition remains underexplored. In particular, the performance of open-source MLLMs needs to be evaluated and compared with existing face recognition models on standard benchmarks with similar protocol. In this work, we present a systematic benchmark of state-of-the-art MLLMs for face recognition on several face recognition datasets, including LFW, CALFW, CPLFW, CFP, AgeDB and RFW. Experimental results reveal that while MLLMs capture rich semantic cues useful for face-related tasks, they lag behind specialized models in high-precision recognition scenarios in zero-shot applications. This benchmark provides a foundation for advancing MLLM-based face recognition, offering insights for the design of next-generation models with higher accuracy and generalization. The source code of our benchmark is publicly available in the project page.

[206] TOUCH: Text-guided Controllable Generation of Free-Form Hand-Object Interactions

Guangyi Han,Wei Zhai,Yuhang Yang,Yang Cao,Zheng-Jun Zha

Main category: cs.CV

TL;DR: 本文提出了自由形式手物交互生成（Free-Form HOI Generation），通过细粒度意图条件生成多样化且物理合理的交互，并构建了WildO2数据集和TOUCH框架以支持该任务。

Details

Motivation: 现有手物交互生成研究局限于固定的抓取模式，难以捕捉日常交互的多样性，因此需要一种能基于细粒度意图生成更丰富交互的方法。 Method: 提出Free-Form HOI Generation任务，构建包含4.4k种交互的WildO2 3D数据集，并设计基于多级扩散模型的三阶段框架TOUCH，结合显式接触建模、接触一致性与物理约束进行生成。 Result: 实验表明，该方法能够生成可控、多样且物理合理的非抓取类手物交互，涵盖推、戳、旋转等日常动作。 Conclusion: TOUCH框架结合WildO2数据集实现了从固定抓取到自由形式交互的扩展，显著提升了HOI生成的多样性与语义控制能力。 Abstract: Hand-object interaction (HOI) is fundamental for humans to express intent. Existing HOI generation research is predominantly confined to fixed grasping patterns, where control is tied to physical priors such as force closure or generic intent instructions, even when expressed through elaborate language. Such an overly general conditioning imposes a strong inductive bias for stable grasps, thus failing to capture the diversity of daily HOI. To address these limitations, we introduce Free-Form HOI Generation, which aims to generate controllable, diverse, and physically plausible HOI conditioned on fine-grained intent, extending HOI from grasping to free-form interactions, like pushing, poking, and rotating. To support this task, we construct WildO2, an in-the-wild diverse 3D HOI dataset, which includes diverse HOI derived from internet videos. Specifically, it contains 4.4k unique interactions across 92 intents and 610 object categories, each with detailed semantic annotations. Building on this dataset, we propose TOUCH, a three-stage framework centered on a multi-level diffusion model that facilitates fine-grained semantic control to generate versatile hand poses beyond grasping priors. This process leverages explicit contact modeling for conditioning and is subsequently refined with contact consistency and physical constraints to ensure realism. Comprehensive experiments demonstrate our method's ability to generate controllable, diverse, and physically plausible hand interactions representative of daily activities. The project page is $\href{https://guangyid.github.io/hoi123touch}{here}$.

[207] BADAS: Context Aware Collision Prediction Using Real-World Dashcam Data

Roni Goldshmidt,Hamish Scott,Lorenzo Niccolini,Shizhan Zhu,Daniel Moura,Orly Zvitia

Main category: cs.CV

TL;DR: 本文提出了BADAS，一种基于真实行车记录仪数据的前向碰撞预测模型家族，通过重新标注现有数据集并引入以自我为中心的评估基准，解决了传统方法误报率高的问题。

Details

Motivation: 现有的碰撞预测方法难以区分涉及自车的威胁和与自车无关的随机事故，导致实际应用中产生大量误报，影响用户体验和系统可信度。 Method: 提出BADAS模型家族，采用V-JEPA2骨干网络进行端到端训练；构建首个面向自车中心评估的Nexar真实碰撞数据集；对主流数据集进行重新标注以识别自车参与情况，并添加共识报警时间标签和合成负样本，支持公平的AP/AUC和时序性能评估。 Result: 在DAD、DADA-2000、DoTA和Nexar等多个数据集上，BADAS实现了最先进的AP/AUC性能，优于传统ADAS基线，并能生成更合理的事故预警时间估计；同时发布了BADAS-Open模型权重、代码及所有数据集的重标注结果。 Conclusion: BADAS通过以自车为中心的数据重构和模型设计，显著提升了碰撞预测的准确性和实用性，推动了该领域的标准化和开放研究。 Abstract: Existing collision prediction methods often fail to distinguish between ego-vehicle threats and random accidents not involving the ego vehicle, leading to excessive false alerts in real-world deployment. We present BADAS, a family of collision prediction models trained on Nexar's real-world dashcam collision dataset -- the first benchmark designed explicitly for ego-centric evaluation. We re-annotate major benchmarks to identify ego involvement, add consensus alert-time labels, and synthesize negatives where needed, enabling fair AP/AUC and temporal evaluation. BADAS uses a V-JEPA2 backbone trained end-to-end and comes in two variants: BADAS-Open (trained on our 1.5k public videos) and BADAS1.0 (trained on 40k proprietary videos). Across DAD, DADA-2000, DoTA, and Nexar, BADAS achieves state-of-the-art AP/AUC and outperforms a forward-collision ADAS baseline while producing more realistic time-to-accident estimates. We release our BADAS-Open model weights and code, along with re-annotations of all evaluation datasets to promote ego-centric collision prediction research.

[208] ScaleWeaver: Weaving Efficient Controllable T2I Generation with Multi-Scale Reference Attention

Keli Liu,Zhendong Wang,Wengang Zhou,Shaodong Xu,Ruixiao Dong,Houqiang Li

Main category: cs.CV

TL;DR: 本文提出了ScaleWeaver，一种基于视觉自回归（VAR）模型的高保真、可控文本到图像生成框架，通过参数高效微调实现精确控制。

Details

Motivation: 现有的扩散模型已具备较好的控制机制，但在视觉自回归（VAR）范式中，如何实现精确且灵活的控制仍缺乏探索。为此，本文旨在填补这一空白。 Method: 提出ScaleWeaver框架，核心是改进的MMDiT块与新设计的Reference Attention模块，该模块摒弃了图像→条件的冗余注意力，降低计算成本并稳定控制注入；同时强调参数复用，并引入零初始化线性投影以有效融合控制信号而不破坏基础模型生成能力。 Result: 实验证明，ScaleWeaver在生成质量、控制精度方面表现优异，且推理效率优于基于扩散的方法。 Conclusion: ScaleWeaver为视觉自回归范式下的可控文本到图像生成提供了一个高效、实用的解决方案。 Abstract: Text-to-image generation with visual autoregressive~(VAR) models has recently achieved impressive advances in generation fidelity and inference efficiency. While control mechanisms have been explored for diffusion models, enabling precise and flexible control within VAR paradigm remains underexplored. To bridge this critical gap, in this paper, we introduce ScaleWeaver, a novel framework designed to achieve high-fidelity, controllable generation upon advanced VAR models through parameter-efficient fine-tuning. The core module in ScaleWeaver is the improved MMDiT block with the proposed Reference Attention module, which efficiently and effectively incorporates conditional information. Different from MM Attention, the proposed Reference Attention module discards the unnecessary attention from image$\rightarrow$condition, reducing computational cost while stabilizing control injection. Besides, it strategically emphasizes parameter reuse, leveraging the capability of the VAR backbone itself with a few introduced parameters to process control information, and equipping a zero-initialized linear projection to ensure that control signals are incorporated effectively without disrupting the generative capability of the base model. Extensive experiments show that ScaleWeaver delivers high-quality generation and precise control while attaining superior efficiency over diffusion-based methods, making ScaleWeaver a practical and effective solution for controllable text-to-image generation within the visual autoregressive paradigm. Code and models will be released.

[209] You May Speak Freely: Improving the Fine-Grained Visual Recognition Capabilities of Multimodal Large Language Models with Answer Extraction

Logan Lawrence,Oindrila Saha,Megan Wei,Chen Sun,Subhransu Maji,Grant Van Horn

Main category: cs.CV

TL;DR: 本文提出了一种名为nlg2choice的两阶段方法，用于评估多模态大语言模型在细粒度视觉分类中的自由文本回答，通过开放性问题生成和受限解码提升分类与检索性能。

Details

Motivation: 现有工作多关注纯语言任务或仅处理少选项的多项选择题，难以应对细粒度视觉分类中成百上千高度相似选项的多选问题，且缺乏对自由文本响应的有效评估方法。 Method: 采用两阶段方法：首先用开放性问题让MLLM生成自由文本回答，然后通过纯文本的受限解码预测最可能选项；在检索场景中引入早期停止策略以提高计算效率。 Result: 在七个细粒度视觉数据集上，该方法在分类和检索任务中均优于现有方法，且在不同自然语言任务实现方式下表现稳定。 Conclusion: nlg2choice有效解决了MLLM在高选项数、高相似性场景下的评估难题，兼顾准确性与计算效率，适用于复杂的视觉分类与检索任务。 Abstract: Despite the renewed interest in zero-shot visual classification due to the rise of Multimodal Large Language Models (MLLMs), the problem of evaluating free-form responses of auto-regressive models remains a persistent challenge. Most existing works focus on language-only tasks or don't consider Multiple Choice Questions (MCQs) beyond 5-way options, both of which are critical capabilities to solve tasks in Fine-Grained Visual Classification (FGVC) where choice counts are in the hundreds to thousands and the choices are highly related. Furthermore, in this highly multi-way MCQ setting it is not clear how to extend LLM choice extraction to retrieval-based problems, where computing probabilities over the choice set is computationally costly. In this work we investigate nlg2choice, a simple two-stage method which first asks the MLLM an open-ended question for the task with minimal constraints, then uses text-only constrained decoding to predict the most likely choice. In retrieval settings, we compute the probability of the constrained response taking that choice with an early stopping method to significantly improve throughput. Our results show improvement over a suite of seven fine-grained visual datasets when evaluating in terms of classification and retrieval, and show that this performance holds over the various ways that users of LLMs can implement tasks in natural language.

[210] Leveraging Multimodal LLM Descriptions of Activity for Explainable Semi-Supervised Video Anomaly Detection

Furkan Mumcu,Michael J. Jones,Anoop Cherian,Yasin Yilmaz

Main category: cs.CV

TL;DR: 提出一种基于多模态大语言模型（MLLM）的视频异常检测新框架，通过生成物体活动和交互的文本描述来检测异常，具有良好的可解释性并实现最先进的性能。

Details

Motivation: 现有半监督视频异常检测方法在处理涉及物体交互的复杂异常时表现不佳，且缺乏可解释性。 Method: 利用MLLM对不同时间点的物体对视觉输入生成活动与交互的文本描述，并将这些描述与正常训练视频中的描述进行比较以检测异常。 Result: 在基准数据集上的实验表明，该方法能有效检测基于交互的复杂异常，并在无交互异常的数据集上达到最先进水平。 Conclusion: 所提方法不仅提升了复杂异常的检测能力，还为视频异常检测提供了内在的可解释性，并可与传统方法结合增强其解释性。 Abstract: Existing semi-supervised video anomaly detection (VAD) methods often struggle with detecting complex anomalies involving object interactions and generally lack explainability. To overcome these limitations, we propose a novel VAD framework leveraging Multimodal Large Language Models (MLLMs). Unlike previous MLLM-based approaches that make direct anomaly judgments at the frame level, our method focuses on extracting and interpreting object activity and interactions over time. By querying an MLLM with visual inputs of object pairs at different moments, we generate textual descriptions of the activity and interactions from nominal videos. These textual descriptions serve as a high-level representation of the activity and interactions of objects in a video. They are used to detect anomalies during test time by comparing them to textual descriptions found in nominal training videos. Our approach inherently provides explainability and can be combined with many traditional VAD methods to further enhance their interpretability. Extensive experiments on benchmark datasets demonstrate that our method not only detects complex interaction-based anomalies effectively but also achieves state-of-the-art performance on datasets without interaction anomalies.

[211] MaskCaptioner : Learning to Jointly Segment and Caption Object Trajectories in Videos

Gabriel Fiastre,Antoine Yang,Cordelia Schmid

Main category: cs.CV

TL;DR: 本文提出了一种端到端模型MaskCaptioner，用于密集视频物体描述（DVOC），通过合成字幕数据集LVISCap和LV-VISCap进行训练，在多个基准上实现了最先进性能。

Details

Motivation: 由于DVOC任务复杂且手动标注成本高，以往方法采用分离训练策略导致性能不佳，因此需要一种联合优化的端到端方法。 Method: 利用先进的视觉语言模型（VLM）生成时空定位实体的描述，扩展LVIS和LV-VIS数据集为LVISCap和LV-VISCap，并在此基础上训练能够联合完成检测、分割、跟踪和描述的MaskCaptioner模型。 Result: MaskCaptioner在VidSTG、VLN和BenSMOT三个基准上均取得了当前最优的DVOC结果。 Conclusion: 通过合成字幕数据预训练，MaskCaptioner有效提升了DVOC任务的整体性能，验证了端到端联合训练的优势。 Abstract: Dense Video Object Captioning (DVOC) is the task of jointly detecting, tracking, and captioning object trajectories in a video, requiring the ability to understand spatio-temporal details and describe them in natural language. Due to the complexity of the task and the high cost associated with manual annotation, previous approaches resort to disjoint training strategies, potentially leading to suboptimal performance. To circumvent this issue, we propose to generate captions about spatio-temporally localized entities leveraging a state-of-the-art VLM. By extending the LVIS and LV-VIS datasets with our synthetic captions (LVISCap and LV-VISCap), we train MaskCaptioner, an end-to-end model capable of jointly detecting, segmenting, tracking and captioning object trajectories. Moreover, with pretraining on LVISCap and LV-VISCap, MaskCaptioner achieves state-of-the-art DVOC results on three existing benchmarks, VidSTG, VLN and BenSMOT. The datasets and code are available at https://www.gabriel.fiastre.fr/maskcaptioner/.

[212] 3D Scene Prompting for Scene-Consistent Camera-Controllable Video Generation

JoungBin Lee,Jaewoo Jung,Jisang Han,Takuya Narihira,Kazumi Fukuda,Junyoung Seo,Sunghwan Hong,Yuki Mitsufuji,Seungryong Kim

Main category: cs.CV

TL;DR: 提出3DScenePrompt框架，通过双时空条件和3D场景记忆实现长输入视频的生成，保持场景一致性与精确相机控制。

Details

Motivation: 现有方法依赖单图或短片段输入，难以在长序列生成中同时保证运动连续性、场景一致性与相机可控性。 Method: 引入双时空条件机制，结合时间邻近帧与空间邻近内容；利用动态SLAM与动态掩码策略构建仅包含静态几何的3D场景记忆，支持任意视角投影作为3D空间提示。 Result: 在场景一致性、相机控制能力和生成质量上显著优于现有方法，同时保持计算效率与运动真实感。 Conclusion: 3DScenePrompt通过分离静态场景记忆与动态内容，实现了长时序、高一致性的视频生成，并支持灵活的相机控制。 Abstract: We present 3DScenePrompt, a framework that generates the next video chunk from arbitrary-length input while enabling precise camera control and preserving scene consistency. Unlike methods conditioned on a single image or a short clip, we employ dual spatio-temporal conditioning that reformulates context-view referencing across the input video. Our approach conditions on both temporally adjacent frames for motion continuity and spatially adjacent content for scene consistency. However, when generating beyond temporal boundaries, directly using spatially adjacent frames would incorrectly preserve dynamic elements from the past. We address this by introducing a 3D scene memory that represents exclusively the static geometry extracted from the entire input video. To construct this memory, we leverage dynamic SLAM with our newly introduced dynamic masking strategy that explicitly separates static scene geometry from moving elements. The static scene representation can then be projected to any target viewpoint, providing geometrically consistent warped views that serve as strong 3D spatial prompts while allowing dynamic regions to evolve naturally from temporal context. This enables our model to maintain long-range spatial coherence and precise camera control without sacrificing computational efficiency or motion realism. Extensive experiments demonstrate that our framework significantly outperforms existing methods in scene consistency, camera controllability, and generation quality. Project page : https://cvlab-kaist.github.io/3DScenePrompt/

[213] OmniMotion: Multimodal Motion Generation with Continuous Masked Autoregression

Zhe Li,Weihao Yuan,Weichao Shen,Siyu Zhu,Zilong Dong,Chang Xu

Main category: cs.CV

TL;DR: 提出一种基于连续掩码自回归运动Transformer的多模态人体运动生成方法，结合DiT结构和注意力机制，在文本、语音和音乐到动作生成任务中表现优越。

Details

Motivation: 解决全身多模态人体运动生成中的两个主要挑战：有效的运动生成机制和多种模态（如文本、语音、音乐）的融合。 Method: 设计连续掩码自回归运动Transformer，引入门控线性注意力和RMSNorm模块，并采用DiT结构结合AdaLN和交叉注意力实现多模态融合。 Result: 在文本到动作、语音到手势、音乐到舞蹈等任务上均优于先前方法。 Conclusion: 所提出的方法在多模态人体运动生成中具有更强的生成能力和更好的跨模态泛化性能。 Abstract: Whole-body multi-modal human motion generation poses two primary challenges: creating an effective motion generation mechanism and integrating various modalities, such as text, speech, and music, into a cohesive framework. Unlike previous methods that usually employ discrete masked modeling or autoregressive modeling, we develop a continuous masked autoregressive motion transformer, where a causal attention is performed considering the sequential nature within the human motion. Within this transformer, we introduce a gated linear attention and an RMSNorm module, which drive the transformer to pay attention to the key actions and suppress the instability caused by either the abnormal movements or the heterogeneous distributions within multi-modalities. To further enhance both the motion generation and the multimodal generalization, we employ the DiT structure to diffuse the conditions from the transformer towards the targets. To fuse different modalities, AdaLN and cross-attention are leveraged to inject the text, speech, and music signals. Experimental results demonstrate that our framework outperforms previous methods across all modalities, including text-to-motion, speech-to-gesture, and music-to-dance. The code of our method will be made public.

[214] RealDPO: Real or Not Real, that is the Preference

Guo Cheng,Danni Yang,Ziqi Huang,Jianlou Si,Chenyang Si,Ziwei Liu

Main category: cs.CV

TL;DR: 提出RealDPO，一种利用真实世界数据进行偏好学习的对齐范式，通过对比真实视频与模型错误输出，提升视频生成中的运动真实感。

Details

Motivation: 现有视频生成模型在生成复杂运动时难以保持自然、流畅和上下文一致的动作，限制了其实际应用。 Method: 引入RealDPO，采用基于真实世界数据的Direct Preference Optimization（DPO），设计定制化损失函数，结合RealAction-5K高质量人类活动视频数据集进行后训练。 Result: 实验表明，RealDPO在视频质量、文本对齐和运动真实感方面均显著优于当前最先进的模型和现有偏好优化方法。 Conclusion: RealDPO通过利用真实视频作为正样本进行偏好学习，有效提升了生成视频中复杂运动的 realism 和整体质量，为视频生成模型的运动建模提供了新方向。 Abstract: Video generative models have recently achieved notable advancements in synthesis quality. However, generating complex motions remains a critical challenge, as existing models often struggle to produce natural, smooth, and contextually consistent movements. This gap between generated and real-world motions limits their practical applicability. To address this issue, we introduce RealDPO, a novel alignment paradigm that leverages real-world data as positive samples for preference learning, enabling more accurate motion synthesis. Unlike traditional supervised fine-tuning (SFT), which offers limited corrective feedback, RealDPO employs Direct Preference Optimization (DPO) with a tailored loss function to enhance motion realism. By contrasting real-world videos with erroneous model outputs, RealDPO enables iterative self-correction, progressively refining motion quality. To support post-training in complex motion synthesis, we propose RealAction-5K, a curated dataset of high-quality videos capturing human daily activities with rich and precise motion details. Extensive experiments demonstrate that RealDPO significantly improves video quality, text alignment, and motion realism compared to state-of-the-art models and existing preference optimization techniques.

[215] MathCanvas: Intrinsic Visual Chain-of-Thought for Multimodal Mathematical Reasoning

Weikang Shi,Aldrich Yu,Rongyao Fang,Houxing Ren,Ke Wang,Aojun Zhou,Changyao Tian,Xinyu Fu,Yuxuan Hu,Zimu Lu,Linjiang Huang,Si Liu,Rui Liu,Hongsheng Li

Main category: cs.CV

TL;DR: 本文提出了MathCanvas框架，通过两个阶段（视觉操作和战略性视觉辅助推理）提升大语言模型在数学领域的视觉链式推理能力，并构建了相应的数据集和基准测试，显著提升了模型性能。

Details

Motivation: 现有的视觉链式推理方法受限于刚性外部工具或无法生成高质量、适时的图表，难以应对依赖视觉辅助的复杂数学问题，因此需要一种内生于多模态大模型的解决方案。 Method: 首先在包含1520万对数据的新数据集上预训练模型以掌握图表生成与编辑；然后在21.9万样本的交错图文推理路径数据集上进行微调，学习何时及如何使用视觉辅助。 Result: 所提出的BAGEL-Canvas模型在MathCanvas-Bench上相比强大多模态基线模型取得了86%的相对提升，并在其他公开数学基准上表现出良好的泛化能力。 Conclusion: MathCanvas框架为多模态大模型实现复杂的人类类比视觉辅助推理提供了完整的工具包、数据集和评估基准，有效提升了其在数学领域的推理能力。 Abstract: While Large Language Models (LLMs) have excelled in textual reasoning, they struggle with mathematical domains like geometry that intrinsically rely on visual aids. Existing approaches to Visual Chain-of-Thought (VCoT) are often limited by rigid external tools or fail to generate the high-fidelity, strategically-timed diagrams necessary for complex problem-solving. To bridge this gap, we introduce MathCanvas, a comprehensive framework designed to endow unified Large Multimodal Models (LMMs) with intrinsic VCoT capabilities for mathematics. Our approach consists of two phases. First, a Visual Manipulation stage pre-trains the model on a novel 15.2M-pair corpus, comprising 10M caption-to-diagram pairs (MathCanvas-Imagen) and 5.2M step-by-step editing trajectories (MathCanvas-Edit), to master diagram generation and editing. Second, a Strategic Visual-Aided Reasoning stage fine-tunes the model on MathCanvas-Instruct, a new 219K-example dataset of interleaved visual-textual reasoning paths, teaching it when and how to leverage visual aids. To facilitate rigorous evaluation, we introduce MathCanvas-Bench, a challenging benchmark with 3K problems that require models to produce interleaved visual-textual solutions. Our model, BAGEL-Canvas, trained under this framework, achieves an 86% relative improvement over strong LMM baselines on MathCanvas-Bench, demonstrating excellent generalization to other public math benchmarks. Our work provides a complete toolkit-framework, datasets, and benchmark-to unlock complex, human-like visual-aided reasoning in LMMs. Project Page: https://mathcanvas.github.io/

[216] C4D: 4D Made from 3D through Dual Correspondences

Shizun Wang,Zhenxiang Jiang,Xingyi Yang,Xinchao Wang

Main category: cs.CV

TL;DR: 本文提出C4D框架，通过引入短期光流和长期点跟踪的时序对应关系，将现有的3D重建方法扩展到动态场景的4D重建，实现了对每帧3D几何和相机参数的联合优化，并在多个下游任务中表现出色。

Details

Motivation: 现有的3D重建方法（如DUSt3R）在静态场景中表现良好，但在动态场景中因运动物体违反多视图几何约束而导致重建不准确，因此需要一种能够处理动态场景的4D重建方法。 Method: C4D框架除了预测点图外，还捕获短期光流和长期点跟踪两种对应关系；训练了一个动态感知的点跟踪器以提供运动信息，进而估计运动掩码分离动态元素与静态背景；并设计了一组动态场景优化目标来恢复每帧的3D几何和相机参数，同时利用对应关系将2D轨迹提升为平滑的3D轨迹，实现完整的4D重建。 Result: 实验表明，C4D框架能够在动态场景中实现完整的4D恢复，在深度估计、相机位姿估计和点跟踪等多个下游任务中均表现出优异性能。 Conclusion: C4D通过融合时序对应关系和动态感知优化，成功地将现有3D重建方法推广至4D动态场景重建，为单目视频的4D建模提供了有效解决方案。 Abstract: Recovering 4D from monocular video, which jointly estimates dynamic geometry and camera poses, is an inevitably challenging problem. While recent pointmap-based 3D reconstruction methods (e.g., DUSt3R) have made great progress in reconstructing static scenes, directly applying them to dynamic scenes leads to inaccurate results. This discrepancy arises because moving objects violate multi-view geometric constraints, disrupting the reconstruction. To address this, we introduce C4D, a framework that leverages temporal Correspondences to extend existing 3D reconstruction formulation to 4D. Specifically, apart from predicting pointmaps, C4D captures two types of correspondences: short-term optical flow and long-term point tracking. We train a dynamic-aware point tracker that provides additional mobility information, facilitating the estimation of motion masks to separate moving elements from the static background, thus offering more reliable guidance for dynamic scenes. Furthermore, we introduce a set of dynamic scene optimization objectives to recover per-frame 3D geometry and camera parameters. Simultaneously, the correspondences lift 2D trajectories into smooth 3D trajectories, enabling fully integrated 4D reconstruction. Experiments show that our framework achieves complete 4D recovery and demonstrates strong performance across multiple downstream tasks, including depth estimation, camera pose estimation, and point tracking. Project Page: https://littlepure2333.github.io/C4D

[217] RainDiff: End-to-end Precipitation Nowcasting Via Token-wise Attention Diffusion

Thao Nguyen,Jiaqi Ma,Fahad Shahbaz Khan,Souhaib Ben Taieb,Salman Khan

Main category: cs.CV

TL;DR: 本文提出了一种新的降水临近预报方法，通过在U-Net扩散模型和时空编码器中集成逐标记注意力机制，有效捕捉多尺度空间交互和时间演变，无需额外的潜在模块，在多个数据集上显著优于现有方法。

Details

Motivation: 由于大气时空动态的高度复杂性和耦合性，雷达回波序列预测具有挑战性；现有基于扩散的模型在可扩展性、计算效率和长距离依赖建模方面存在局限。 Method: 提出一种将逐标记注意力机制原生集成到U-Net扩散模型和时空编码器中的方法，直接在像素空间建模多尺度时空依赖，避免使用独立训练的自编码器或高计算成本的注意力结构。 Result: 在多个数据集上的实验表明，该方法在局部保真度、泛化性和鲁棒性方面均显著优于当前最先进的模型。 Conclusion: 所提出的方法通过原生集成注意力机制，在不增加计算复杂度的前提下提升了降水临近预报的性能，为扩散模型在气象预测中的应用提供了新思路。 Abstract: Precipitation nowcasting, predicting future radar echo sequences from current observations, is a critical yet challenging task due to the inherently chaotic and tightly coupled spatio-temporal dynamics of the atmosphere. While recent advances in diffusion-based models attempt to capture both large-scale motion and fine-grained stochastic variability, they often suffer from scalability issues: latent-space approaches require a separately trained autoencoder, adding complexity and limiting generalization, while pixel-space approaches are computationally intensive and often omit attention mechanisms, reducing their ability to model long-range spatio-temporal dependencies. To address these limitations, we propose a Token-wise Attention integrated into not only the U-Net diffusion model but also the spatio-temporal encoder that dynamically captures multi-scale spatial interactions and temporal evolution. Unlike prior approaches, our method natively integrates attention into the architecture without incurring the high resource cost typical of pixel-space diffusion, thereby eliminating the need for separate latent modules. Our extensive experiments and visual evaluations across diverse datasets demonstrate that the proposed method significantly outperforms state-of-the-art approaches, yielding superior local fidelity, generalization, and robustness in complex precipitation forecasting scenarios.

[218] ChangingGrounding: 3D Visual Grounding in Changing Scenes

Miao Hu,Zhiwei Huang,Tai Wang,Jiangmiao Pang,Dahua Lin,Nanning Zheng,Runsen Xu

Main category: cs.CV

TL;DR: 本文提出了ChangingGrounding，首个针对动态场景下3D视觉定位的基准，强调利用记忆和主动探索来减少重扫描成本，并提出零样本方法Mem-ChangingGrounder，在降低探索代价的同时实现了高精度定位。

Details

Motivation: 现有3D视觉定位方法依赖于完整且更新的点云，导致在现实动态环境中需频繁重扫描，限制了实际部署。因此需要一种更高效、基于记忆的主动定位方法。 Method: 将3D视觉定位建模为主动、记忆驱动的问题；提出Mem-ChangingGrounder，结合跨模态检索与轻量级多视图融合，通过检索历史记忆指导动作，选择性探索目标区域，进行多视图扫描并融合生成精确的3D边界框。 Result: 在ChangingGrounding基准上，Mem-ChangingGrounder在零样本设置下取得了最高的定位精度，同时显著降低了探索成本。 Conclusion: 该工作推动了面向真实应用的记忆中心型3D视觉定位研究，为动态环境中的机器人定位提供了新的基准和有效方法。 Abstract: Real-world robots localize objects from natural-language instructions while scenes around them keep changing. Yet most of the existing 3D visual grounding (3DVG) method still assumes a reconstructed and up-to-date point cloud, an assumption that forces costly re-scans and hinders deployment. We argue that 3DVG should be formulated as an active, memory-driven problem, and we introduce ChangingGrounding, the first benchmark that explicitly measures how well an agent can exploit past observations, explore only where needed, and still deliver precise 3D boxes in changing scenes. To set a strong reference point, we also propose Mem-ChangingGrounder, a zero-shot method for this task that marries cross-modal retrieval with lightweight multi-view fusion: it identifies the object type implied by the query, retrieves relevant memories to guide actions, then explores the target efficiently in the scene, falls back when previous operations are invalid, performs multi-view scanning of the target, and projects the fused evidence from multi-view scans to get accurate object bounding boxes. We evaluate different baselines on ChangingGrounding, and our Mem-ChangingGrounder achieves the highest localization accuracy while greatly reducing exploration cost. We hope this benchmark and method catalyze a shift toward practical, memory-centric 3DVG research for real-world applications. Project page: https://hm123450.github.io/CGB/ .

[219] WithAnyone: Towards Controllable and ID Consistent Image Generation

Hengyuan Xu,Wei Cheng,Peng Xing,Yixiao Fang,Shuhan Wu,Rui Wang,Xianfang Zeng,Daxin Jiang,Gang Yu,Xingjun Ma,Yu-Gang Jiang

Main category: cs.CV

TL;DR: 本文提出WithAnyone，一种基于扩散模型的身份一致文本到图像生成方法，通过构建大规模配对数据集MultiID-2M、设计新评估基准和引入对比身份损失，有效缓解“复制粘贴”问题，在保持高身份相似性的同时提升生成多样性与可控性。

Details

Motivation: 现有身份一致的文本到图像生成方法因缺乏大规模多图配对数据，依赖重建训练易导致“复制粘贴”现象，即直接复制参考人脸而缺乏自然变化，限制了生成的可控性与表达能力。 Method: 1) 构建大规模多身份配对数据集MultiID-2M；2) 提出量化‘复制粘贴’程度及身份保真与变化权衡的新基准；3) 设计结合对比身份损失的新型训练范式，利用配对数据平衡保真度与多样性。 Result: WithAnyone显著减少了复制粘贴伪影，提升了姿态和表情的可控性，同时保持高质量的视觉效果；定量与用户实验表明其在身份一致性与生成多样性方面均优于现有方法。 Conclusion: WithAnyone通过配对数据驱动的对比学习策略，有效解决了身份生成中的过拟合问题，实现了高保真且富有表现力的可控图像生成，为多身份文本到图像合成提供了新基准与解决方案。 Abstract: Identity-consistent generation has become an important focus in text-to-image research, with recent models achieving notable success in producing images aligned with a reference identity. Yet, the scarcity of large-scale paired datasets containing multiple images of the same individual forces most approaches to adopt reconstruction-based training. This reliance often leads to a failure mode we term copy-paste, where the model directly replicates the reference face rather than preserving identity across natural variations in pose, expression, or lighting. Such over-similarity undermines controllability and limits the expressive power of generation. To address these limitations, we (1) construct a large-scale paired dataset MultiID-2M, tailored for multi-person scenarios, providing diverse references for each identity; (2) introduce a benchmark that quantifies both copy-paste artifacts and the trade-off between identity fidelity and variation; and (3) propose a novel training paradigm with a contrastive identity loss that leverages paired data to balance fidelity with diversity. These contributions culminate in WithAnyone, a diffusion-based model that effectively mitigates copy-paste while preserving high identity similarity. Extensive qualitative and quantitative experiments demonstrate that WithAnyone significantly reduces copy-paste artifacts, improves controllability over pose and expression, and maintains strong perceptual quality. User studies further validate that our method achieves high identity fidelity while enabling expressive controllable generation.

[220] Ponimator: Unfolding Interactive Pose for Versatile Human-human Interaction Animation

Shaowei Liu,Chuan Guo,Bing Zhou,Jian Wang

Main category: cs.CV

TL;DR: 提出Ponimator框架，利用近身互动姿态先验，通过两个条件扩散模型实现从图像、文本或单个姿态生成交互动作序列。

Details

Motivation: 受人类能根据近距离互动姿态推断上下文和动态的启发，希望构建一个能利用这种强行为先验进行交互动画生成的通用框架。 Method: 使用包含近距离双人姿态及时间上下文的动捕数据，设计两个条件扩散模型：一是利用时间先验从互动姿态生成动态动作序列的姿势生成器，二是利用空间先验从单个姿态、文本或两者生成互动姿态的姿势生成器。 Result: 在多个数据集和应用上验证了姿态先验的普适性，框架在图像驱动交互动画、反应动画和文本到交互合成等任务中表现有效且鲁棒。 Conclusion: Ponimator通过结合时空先验，成功将高质量动捕数据中的交互知识迁移到开放场景中，支持多种交互动画任务。 Abstract: Close-proximity human-human interactive poses convey rich contextual information about interaction dynamics. Given such poses, humans can intuitively infer the context and anticipate possible past and future dynamics, drawing on strong priors of human behavior. Inspired by this observation, we propose Ponimator, a simple framework anchored on proximal interactive poses for versatile interaction animation. Our training data consists of close-contact two-person poses and their surrounding temporal context from motion-capture interaction datasets. Leveraging interactive pose priors, Ponimator employs two conditional diffusion models: (1) a pose animator that uses the temporal prior to generate dynamic motion sequences from interactive poses, and (2) a pose generator that applies the spatial prior to synthesize interactive poses from a single pose, text, or both when interactive poses are unavailable. Collectively, Ponimator supports diverse tasks, including image-based interaction animation, reaction animation, and text-to-interaction synthesis, facilitating the transfer of interaction knowledge from high-quality mocap data to open-world scenarios. Empirical experiments across diverse datasets and applications demonstrate the universality of the pose prior and the effectiveness and robustness of our framework.

[221] Terra: Explorable Native 3D World Model with Point Latents

Yuanhui Huang,Weiliang Chen,Wenzhao Zheng,Xin Tao,Pengfei Wan,Jie Zhou,Jiwen Lu

Main category: cs.CV

TL;DR: 本文提出了Terra，一种基于原生3D表示的新型世界模型，通过点到高斯变分自编码器（P2G-VAE）和稀疏点流匹配网络（SPFlow）在3D隐空间中建模和生成可探索环境，实现了高3D一致性和高效渲染。

Details

Motivation: 现有世界模型多依赖像素对齐的表示，忽视了物理世界的3D本质，导致3D一致性差和建模效率低。 Method: 提出Terra模型，包括P2G-VAE将3D输入编码为点隐表示并解码为3D高斯基元以联合建模几何与外观，以及SPFlow网络用于生成和去噪点隐表示。 Result: 在ScanNet v2室内场景上实验表明，Terra在重建和生成任务中均达到最先进性能，具有高度3D一致性，并支持单次生成实现任意视角渲染。 Conclusion: Terra通过原生3D表示和架构有效提升了世界模型的3D一致性与建模效率，支持灵活渲染和渐进式生成可探索环境。 Abstract: World models have garnered increasing attention for comprehensive modeling of the real world. However, most existing methods still rely on pixel-aligned representations as the basis for world evolution, neglecting the inherent 3D nature of the physical world. This could undermine the 3D consistency and diminish the modeling efficiency of world models. In this paper, we present Terra, a native 3D world model that represents and generates explorable environments in an intrinsic 3D latent space. Specifically, we propose a novel point-to-Gaussian variational autoencoder (P2G-VAE) that encodes 3D inputs into a latent point representation, which is subsequently decoded as 3D Gaussian primitives to jointly model geometry and appearance. We then introduce a sparse point flow matching network (SPFlow) for generating the latent point representation, which simultaneously denoises the positions and features of the point latents. Our Terra enables exact multi-view consistency with native 3D representation and architecture, and supports flexible rendering from any viewpoint with only a single generation process. Furthermore, Terra achieves explorable world modeling through progressive generation in the point latent space. We conduct extensive experiments on the challenging indoor scenes from ScanNet v2. Terra achieves state-of-the-art performance in both reconstruction and generation with high 3D consistency.

[222] Learning an Image Editing Model without Image Editing Pairs

Nupur Kumari,Sheng-Yu Wang,Nanxuan Zhao,Yotam Nitzan,Yuheng Li,Krishna Kumar Singh,Richard Zhang,Eli Shechtman,Jun-Yan Zhu,Xun Huang

Main category: cs.CV

TL;DR: 提出一种无需配对数据的图像编辑新训练范式，通过展开扩散模型并在训练中利用视觉-语言模型（VLM）反馈进行直接优化，在保持编辑准确性和内容一致性的同时，达到与使用大量监督配对数据训练的模型相当甚至更优的性能。

Details

Motivation: 现有的基于自然语言指令的图像编辑模型依赖大规模输入-目标配对数据进行监督微调，但这类数据难以大规模获取；合成数据方法则可能传播并放大预训练模型的伪影，形成瓶颈。 Method: 提出一种无需配对数据的新训练范式：在训练过程中展开少步扩散模型，并利用视觉-语言模型（VLM）对编辑结果提供反馈，判断是否遵循指令并保留未修改内容，从而提供端到端优化的梯度；同时引入分布匹配损失（DMD）以确保生成图像保持在预训练模型学习到的图像流形内，保障视觉保真度。 Result: 在标准基准上进行了评估，并进行了广泛的消融研究；结果表明，在无需任何配对数据的情况下，该方法在少步设置下表现优于或媲美多种使用大量监督配对数据训练的图像编辑扩散模型，并且在使用相同VLM作为奖励模型时优于Flow-GRPO等基于强化学习的技术。 Conclusion: 该方法成功摆脱了对配对训练数据的依赖，通过结合VLM反馈和分布匹配损失实现了高效、高质量的文本驱动图像编辑，为减少人工标注数据依赖提供了有效路径。 Abstract: Recent image editing models have achieved impressive results while following natural language editing instructions, but they rely on supervised fine-tuning with large datasets of input-target pairs. This is a critical bottleneck, as such naturally occurring pairs are hard to curate at scale. Current workarounds use synthetic training pairs that leverage the zero-shot capabilities of existing models. However, this can propagate and magnify the artifacts of the pretrained model into the final trained model. In this work, we present a new training paradigm that eliminates the need for paired data entirely. Our approach directly optimizes a few-step diffusion model by unrolling it during training and leveraging feedback from vision-language models (VLMs). For each input and editing instruction, the VLM evaluates if an edit follows the instruction and preserves unchanged content, providing direct gradients for end-to-end optimization. To ensure visual fidelity, we incorporate distribution matching loss (DMD), which constrains generated images to remain within the image manifold learned by pretrained models. We evaluate our method on standard benchmarks and include an extensive ablation study. Without any paired data, our method performs on par with various image editing diffusion models trained on extensive supervised paired data, under the few-step setting. Given the same VLM as the reward model, we also outperform RL-based techniques like Flow-GRPO.

[223] From Pixels to Words -- Towards Native Vision-Language Primitives at Scale

Haiwen Diao,Mingxuan Li,Silei Wu,Linjun Dai,Xiaohua Wang,Hanming Deng,Lewei Lu,Dahua Lin,Ziwei Liu

Main category: cs.CV

TL;DR: 本文提出了NEO，一个基于原生视觉-语言模型（VLM）的新框架，旨在解决原生VLM与模块化VLM之间的根本限制，并推动该领域的可访问性和普及性。NEO通过从头构建视觉感知，有效对齐像素和词表示，并融合视觉与语言模块的优势，在多种现实场景中表现优异。

Details

Motivation: 原生视觉-语言模型（VLM）虽然潜力巨大，但其发展受限于与模块化VLM的根本差异以及研究门槛较高。本文旨在明确这些挑战并提出构建原则，以促进原生VLM的研究和应用。 Method: 提出了一种新的原生VLM构建原则：(i) 在共享语义空间中有效对齐像素和词表示；(ii) 无缝整合视觉和语言模块的优势；(iii) 内在支持多种跨模态特性，实现统一的视觉-语言编码、对齐和推理。基于这些原则，设计了NEO模型家族，使用3.9亿图像-文本对从零开始训练。 Result: NEO在多个真实世界场景中能够与顶级模块化VLM相媲美，同时减少了视觉-语言冲突，并在一个密集且单一的模型中实现了高效的视觉感知学习。此外，提供了开源代码和模型，促进了可扩展和低成本的研究生态。 Conclusion: NEO为构建可扩展且强大的原生VLM奠定了基础，提供了一系列可重用组件，有助于降低研究门槛，加速该领域的发展。 Abstract: The edifice of native Vision-Language Models (VLMs) has emerged as a rising contender to typical modular VLMs, shaped by evolving model architectures and training paradigms. Yet, two lingering clouds cast shadows over its widespread exploration and promotion: (-) What fundamental constraints set native VLMs apart from modular ones, and to what extent can these barriers be overcome? (-) How to make research in native VLMs more accessible and democratized, thereby accelerating progress in the field. In this paper, we clarify these challenges and outline guiding principles for constructing native VLMs. Specifically, one native VLM primitive should: (i) effectively align pixel and word representations within a shared semantic space; (ii) seamlessly integrate the strengths of formerly separate vision and language modules; (iii) inherently embody various cross-modal properties that support unified vision-language encoding, aligning, and reasoning. Hence, we launch NEO, a novel family of native VLMs built from first principles, capable of rivaling top-tier modular counterparts across diverse real-world scenarios. With only 390M image-text examples, NEO efficiently develops visual perception from scratch while mitigating vision-language conflicts inside a dense and monolithic model crafted from our elaborate primitives. We position NEO as a cornerstone for scalable and powerful native VLMs, paired with a rich set of reusable components that foster a cost-effective and extensible ecosystem. Our code and models are publicly available at: https://github.com/EvolvingLMMs-Lab/NEO.

[224] Coupled Diffusion Sampling for Training-Free Multi-View Image Editing

Hadi Alzayer,Yunzhi Zhang,Chen Geng,Jia-Bin Huang,Jiajun Wu

Main category: cs.CV

TL;DR: 提出一种基于耦合扩散采样的推理时方法，利用预训练2D图像编辑模型实现多视角一致的图像编辑，通过隐式3D正则化保持跨视角一致性。

Details

Motivation: 现有方法依赖显式3D表示进行多视角一致性编辑，但优化过程耗时且在稀疏视角下不稳定，亟需更高效通用的解决方案。 Method: 提出耦合扩散采样，在扩散过程中同时从多视角图像分布和2D编辑图像分布中采样，并引入耦合项约束生成图像间的多视角一致性。 Result: 在三种不同的多视角编辑任务上验证了方法的有效性和通用性，适用于多种模型架构，生成结果具有一致性且无需复杂3D优化。 Conclusion: 该框架为多视角一致的图像编辑提供了一种高效、稳定的通用解决方案，避免了显式3D建模的复杂性。 Abstract: We present an inference-time diffusion sampling method to perform multi-view consistent image editing using pre-trained 2D image editing models. These models can independently produce high-quality edits for each image in a set of multi-view images of a 3D scene or object, but they do not maintain consistency across views. Existing approaches typically address this by optimizing over explicit 3D representations, but they suffer from a lengthy optimization process and instability under sparse view settings. We propose an implicit 3D regularization approach by constraining the generated 2D image sequences to adhere to a pre-trained multi-view image distribution. This is achieved through coupled diffusion sampling, a simple diffusion sampling technique that concurrently samples two trajectories from both a multi-view image distribution and a 2D edited image distribution, using a coupling term to enforce the multi-view consistency among the generated images. We validate the effectiveness and generality of this framework on three distinct multi-view image editing tasks, demonstrating its applicability across various model architectures and highlighting its potential as a general solution for multi-view consistent editing.

Table of Contents

cs.CL [Back]

[1] Bridging the Semantic Gap: Contrastive Rewards for Multilingual Text-to-SQL

[2] From Explainability to Action: A Generative Operational Framework for Integrating XAI in Clinical Mental Health Screening

[3] A Linguistics-Aware LLM Watermarking via Syntactic Predictability

[4] Users as Annotators: LLM Preference Learning from Comparison Mode

[5] Informed Routing in LLMs: Smarter Token-Level Computation for Faster Inference

[6] Entropy Meets Importance: A Unified Head Importance-Entropy Score for Stable and Efficient Transformer Pruning

[7] ConDABench: Interactive Evaluation of Language Models for Data Analysis

[8] SIMBA UQ: Similarity-Based Aggregation for Uncertainty Quantification in Large Language Models

[9] Seeing Hate Differently: Hate Subspace Modeling for Culture-Aware Hate Speech Detection

[10] Meronymic Ontology Extraction via Large Language Models

[11] ADMIT: Few-shot Knowledge Poisoning Attacks on RAG-based Fact Checking

[12] Serialized EHR make for good text representations

[13] DynaSpec: Context-aware Dynamic Speculative Sampling for Large-Vocabulary Language Models

[14] On-device System of Compositional Multi-tasking in Large Language Models

[15] Language steering in latent space to mitigate unintended code-switching

[16] Revisiting the UID Hypothesis in LLM Reasoning Traces

[17] EvoEdit: Evolving Null-space Alignment for Robust and Efficient Knowledge Editing

[18] ConsistencyAI: A Benchmark to Assess LLMs' Factual Consistency When Responding to Different Demographic Groups

[19] BenchPress: A Human-in-the-Loop Annotation System for Rapid Text-to-SQL Benchmark Curation

[20] R2T: Rule-Encoded Loss Functions for Low-Resource Sequence Tagging

[21] Harnessing Consistency for Robust Test-Time LLM Ensemble

[22] Multimodal Retrieval-Augmented Generation with Large Language Models for Medical VQA

[23] ShishuLM: Lightweight Language Model with Hybrid Decoder-MLP Architecture and Paired Weight Sharing

[24] Ensembling Large Language Models to Characterize Affective Dynamics in Student-AI Tutor Dialogues

[25] Unlocking the Potential of Diffusion Language Models through Template Infilling

[26] Quechua Speech Datasets in Common Voice: The Case of Puno Quechua

[27] FRACCO: A gold-standard annotated corpus of oncological entities with ICD-O-3.1 normalisation

[28] What Layers When: Learning to Skip Compute in LLMs with Residual Gates

[29] TextBandit: Evaluating Probabilistic Reasoning in LLMs Through Language-Only Decision Tasks

[30] Catch Your Breath: Adaptive Computation for Self-Paced Sequence Production

[31] PAGE: Prompt Augmentation for text Generation Enhancement

[32] Too Open for Opinion? Embracing Open-Endedness in Large Language Models for Social Simulation

[33] Order from Chaos: Comparative Study of Ten Leading LLMs on Unstructured Data Categorization

[34] Reliable Fine-Grained Evaluation of Natural Language Math Proofs

[35] A Survey on Collaborating Small and Large Language Models for Performance, Cost-effectiveness, Cloud-edge Privacy, and Trustworthiness

[36] The Harder The Better: Maintaining Supervised Fine-tuning Generalization with Less but Harder Data

[37] Guarding the Guardrails: A Taxonomy-Driven Approach to Jailbreak Detection

[38] Attribution Quality in AI-Generated Content:Benchmarking Style Embeddings and LLM Judges

[39] Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences

[40] RAID: Refusal-Aware and Integrated Decoding for Jailbreaking LLMs

[41] Investigating Political and Demographic Associations in Large Language Models Through Moral Foundations Theory

[42] Schema for In-Context Learning

[43] LLM Prompt Duel Optimizer: Efficient Label-Free Prompt Optimization

[44] Interpreting the Latent Structure of Operator Precedence in Language Models

[45] Knowledge Reasoning Language Model: Unifying Knowledge and Language for Inductive Knowledge Graph Reasoning

[46] RAGCap-Bench: Benchmarking Capabilities of LLMs in Agentic Retrieval Augmented Generation Systems

[47] AI Debaters are More Persuasive when Arguing in Alignment with Their Own Beliefs

[48] Synthesizing Agentic Data for Web Agents with Progressive Difficulty Enhancement Mechanisms

[49] Readability $\ne$ Learnability: Rethinking the Role of Simplicity in Training Small Language Models

[50] Element2Vec: Build Chemical Element Representation from Text for Property Prediction

[51] Optimal Aggregation of LLM and PRM Signals for Efficient Test-Time Scaling

[52] FACTS: Table Summarization via Offline Template Generation with Agentic Workflows

[53] An LLM-Powered AI Agent Framework for Holistic IoT Traffic Interpretation

[54] BioMedSearch: A Multi-Source Biomedical Retrieval Framework Based on LLMs

[55] LLMs Can Get "Brain Rot"!

[56] Robust or Suggestible? Exploring Non-Clinical Induction in LLM Drug-Safety Decisions

[57] Big Reasoning with Small Models: Instruction Retrieval at Inference Time

[58] FinDeepResearch: Evaluating Deep Research Agents in Rigorous Financial Analysis

[59] Readers Prefer Outputs of AI Trained on Copyrighted Books over Expert Human Writers

[60] Less is More: Improving LLM Reasoning with Minimal Test-Time Intervention

[61] Classifying and Addressing the Diversity of Errors in Retrieval-Augmented Generation Systems

[62] The German Commons - 154 Billion Tokens of Openly Licensed Text for German Language Models

[63] CRaFT: An Explanation-Based Framework for Evaluating Cultural Reasoning in Multilingual Language Models

[64] Think Globally, Group Locally: Evaluating LLMs Using Multi-Lingual Word Grouping Games

[65] Quantifying Phonosemantic Iconicity Distributionally in 6 Languages

[66] ERGO: Entropy-guided Resetting for Generation Optimization in Multi-turn Language Models

[67] DROID: Dual Representation for Out-of-Scope Intent Detection

[68] Toward Cybersecurity-Expert Small Language Models

[69] Building a Macedonian Recipe Dataset: Collection, Parsing, and Comparative Analysis

[70] RLSR: Reinforcement Learning with Supervised Reward Outperforms SFT in Instruction Following

[71] DPRF: A Generalizable Dynamic Persona Refinement Framework for Optimizing Behavior Alignment Between Personalized LLM Role-Playing Agents and Humans

[72] LiteStage: Latency-aware Layer Skipping for Multi-stage Reasoning

[73] Flip-Flop Consistency: Unsupervised Training for Robustness to Prompt Perturbations in LLMs

[74] MoM: Mixtures of Scenario-Aware Document Memories for Retrieval-Augmented Generation Systems

[75] Rewriting History: A Recipe for Interventional Analyses to Study Data Effects on Model Behavior

[76] Less is More: Denoising Knowledge Graphs For Retrieval Augmented Generation

[77] Retrofitting Small Multilingual Models for Retrieval: Matching 7B Performance with 300M Parameters

[78] Qwen3Guard Technical Report