cs.CL [Back]

[1] Evaluating Long-Term Memory for Long-Context Question Answering

Alessandra Terranova,Björn Ross,Alexandra Birch

Main category: cs.CL

TL;DR: 本文提出并系统评估了多种记忆增强方法在长上下文对话任务中的有效性，使用LoCoMo基准测试，发现记忆增强方法在保持高准确率的同时可减少90%以上的令牌使用量，且记忆架构的复杂性应与模型能力相匹配。

Details

Motivation: 为了使大语言模型实现真正的对话连续性和经验学习，需要有效的记忆机制，但目前尚不清楚哪些记忆类型最适合长上下文对话任务。 Method: 通过LoCoMo这一带有问答标注的合成长上下文对话基准，系统评估了全上下文提示、基于检索增强生成的语义记忆、代理记忆、基于上下文学习的 episodic 记忆以及基于提示优化的程序记忆等方法。 Result: 记忆增强方法能减少90%以上令牌使用并保持竞争力的准确性；小规模基础模型最受益于RAG，而强大的指令调优推理模型则从包含反思的 episodic 学习和更复杂的代理语义记忆中获益更多；episodic 记忆有助于模型识别自身知识的局限。 Conclusion: 记忆架构的设计应根据模型能力进行适配，不同类型的记忆机制在效率和性能上具有互补优势，合理的组合可提升大模型在长对话中的持续学习与推理能力。 Abstract: In order for large language models to achieve true conversational continuity and benefit from experiential learning, they need memory. While research has focused on the development of complex memory systems, it remains unclear which types of memory are most effective for long-context conversational tasks. We present a systematic evaluation of memory-augmented methods using LoCoMo, a benchmark of synthetic long-context dialogues annotated for question-answering tasks that require diverse reasoning strategies. We analyse full-context prompting, semantic memory through retrieval-augmented generation and agentic memory, episodic memory through in-context learning, and procedural memory through prompt optimization. Our findings show that memory-augmented approaches reduce token usage by over 90% while maintaining competitive accuracy. Memory architecture complexity should scale with model capability, with small foundation models benefitting most from RAG, and strong instruction-tuned reasoning model gaining from episodic learning through reflections and more complex agentic semantic memory. In particular, episodic memory can help LLMs recognise the limits of their own knowledge.

[2] BitSkip: An Empirical Analysis of Quantization and Early Exit Composition

Ramshankar Bhuvaneswaran,Handan Liu

Main category: cs.CL

TL;DR: 本文提出了BitSkip框架，用于系统探索量化和动态路由等技术的组合效应。出乎意料的是，一个简单的8位量化模型（BitSkip-V1）在没有Hadamard变换的情况下，性能优于更复杂的4位及Hadamard增强模型，并且接近全精度基线模型的质量，同时展现出优异的早期退出特性。

Details

Motivation: 当前对高效大语言模型的研究多关注单一技术，而复杂技术组合的相互作用尚不清楚。本文旨在系统研究量化与架构设计（如Hadamard变换）组合使用时的实际影响，揭示其潜在问题与优势。 Method: 提出BitSkip混合架构框架，通过控制变量实验比较不同量化精度（4位、8位）与是否使用Hadamard变换的组合效果，并分析训练稳定性与推理效率。 Result: 发现8位量化且无Hadamard变换的BitSkip-V1模型困惑度为1.13，优于4位及Hadamard版本，并接近全精度模型（1.19）；引入Hadamard变换导致性能下降超过37,000%；BitSkip-V1在第18层即实现32.5%的速度提升，仅损失4%质量。 Conclusion: 更复杂的量化或变换技术不一定带来更好效果，简单的8位量化方案在稳定性与效率上可能更优，挑战了当前追求极端压缩技术的趋势，强调应重视基础量化设计与训练稳定性。 Abstract: The pursuit of efficient Large Language Models (LLMs) has led to increasingly complex techniques like extreme quantization and dynamic routing. While individual benefits of these methods are well-documented, their compositional effects remain poorly understood. This paper introduces BitSkip, a hybrid architectural framework for systematically exploring these interactions. Counter-intuitively, our findings reveal that a simple 8-bit quantized model without Hadamard transform (BitSkip-V1) not only outperforms its more complex 4-bit and Hadamard-enhanced counterparts but also competes the full-precision baseline in quality (perplexity of 1.13 vs 1.19) . The introduction of Hadamard transforms, even at 8-bit precision, catastrophically degraded performance by over 37,000%, tracing fundamental training instability. Our BitSkip-V1 recipe demonstrates superior early-exit characteristics, with layer 18 providing optimal 32.5% speed gain for minimal 4% quality loss.

[3] Beyond Understanding: Evaluating the Pragmatic Gap in LLMs' Cultural Processing of Figurative Language

Mena Attia,Aashiq Muhamed,Mai Alkhamissi,Thamar Solorio,Mona Diab

Main category: cs.CL

TL;DR: 本研究评估了大语言模型（LLM）在理解与使用文化相关隐喻表达方面的能力，特别是在阿拉伯语和英语中的习语和谚语。结果显示，LLM在阿拉伯语任务上的表现显著低于英语，且在实际语用使用和内涵理解方面存在明显困难。研究还发布了首个用于埃及阿拉伯语习语评估的数据集Kinayat。

Details

Motivation: 隐喻性语言承载丰富的文化背景和本地知识，是检验模型文化推理能力的重要指标。然而当前大语言模型在处理非主流语言和文化特定表达时可能存在偏差和局限，因此需要系统评估其跨文化语言处理能力。 Method: 以隐喻语言作为文化细微差别的代理，设计了针对上下文理解、实际语用使用和内涵解释的评估任务，涵盖埃及阿拉伯语习语、多方言阿拉伯谚语和英语谚语。对22个开源和闭源大语言模型进行了评测，并发布了一个新数据集Kinayat用于后续研究。 Result: 实验结果显示：阿拉伯语谚语平均准确率比英语低4.29%，埃及阿拉伯语习语比阿拉伯语谚语再低10.28%；在语用使用任务中准确率比理解任务下降14.07%，但提供上下文可提升10.66%的准确率；模型在内涵理解上最多仅达到85.58%的人类标注一致性（而人类间一致性为100%）。 Conclusion: 大语言模型虽能在一定程度上理解隐喻意义，但在恰当使用和深层文化内涵把握上仍存在显著不足。这表明隐喻语言是检验模型文化推理能力的有效诊断工具，未来模型训练需更加注重文化语境和语用能力的融合。 Abstract: We present a comprehensive evaluation of the ability of large language models (LLMs) to process culturally grounded language, specifically to understand and pragmatically use figurative expressions that encode local knowledge and cultural nuance. Using figurative language as a proxy for cultural nuance and local knowledge, we design evaluation tasks for contextual understanding, pragmatic use, and connotation interpretation in Arabic and English. We evaluate 22 open- and closed-source LLMs on Egyptian Arabic idioms, multidialectal Arabic proverbs, and English proverbs. Our results show a consistent hierarchy: the average accuracy for Arabic proverbs is 4.29% lower than for English proverbs, and performance for Egyptian idioms is 10.28% lower than for Arabic proverbs. For the pragmatic use task, accuracy drops by 14.07% relative to understanding, though providing contextual idiomatic sentences improves accuracy by 10.66%. Models also struggle with connotative meaning, reaching at most 85.58% agreement with human annotators on idioms with 100% inter-annotator agreement. These findings demonstrate that figurative language serves as an effective diagnostic for cultural reasoning: while LLMs can often interpret figurative meaning, they face challenges in using it appropriately. To support future research, we release Kinayat, the first dataset of Egyptian Arabic idioms designed for both figurative understanding and pragmatic use evaluation.

[4] How Pragmatics Shape Articulation: A Computational Case Study in STEM ASL Discourse

Saki Imai,Lee Kezar,Laurel Aichler,Mert Inan,Erin Walker,Alicia Wooten,Lorna Quandt,Malihe Alikhani

Main category: cs.CL

TL;DR: 本研究通过收集美国手语（ASL）STEM对话的运动捕捉数据集，比较了互动对话、独白式讲座和翻译文章中的手语差异，发现对话中的手语持续时间比孤立手语短24.6%-44.6%，并显著减少了重复提及STEM术语时的动作幅度。研究还评估了手语嵌入模型在识别STEM术语及捕捉参与者协同程度上的表现，弥合了语言学分析与计算建模之间的差距。

Details

Motivation: 现有手语模型多基于翻译或孤立词汇数据训练，忽略了自然对话中因语境和交流对象不同而产生的变异性，尤其在教育场景中师生使用新词汇时更为明显。因此需要研究自然对话中手语的动态适应特性。 Method: 采集ASL STEM对话的运动捕捉数据，对比双人互动手语、单独授课和翻译文章三种情境；利用连续运动学特征，分离出对话特有的协同效应与个体努力减少的影响，并分析STEM术语多次出现时的时空变化。 Result: 对话中的手语平均持续时间比孤立手语短24.6%-44.6%，且在重复提及STEM术语时表现出显著的动作缩减，这种缩减在独白中未见；手语嵌入模型能有效识别STEM术语并估计参与者的协同程度。 Conclusion: 自然对话中的手语具有明显的时空压缩和协同特征，这些特征在现有模型中被忽视；研究为手语技术中的语用因素建模提供了量化依据。 Abstract: Most state-of-the-art sign language models are trained on interpreter or isolated vocabulary data, which overlooks the variability that characterizes natural dialogue. However, human communication dynamically adapts to contexts and interlocutors through spatiotemporal changes and articulation style. This specifically manifests itself in educational settings, where novel vocabularies are used by teachers, and students. To address this gap, we collect a motion capture dataset of American Sign Language (ASL) STEM (Science, Technology, Engineering, and Mathematics) dialogue that enables quantitative comparison between dyadic interactive signing, solo signed lecture, and interpreted articles. Using continuous kinematic features, we disentangle dialogue-specific entrainment from individual effort reduction and show spatiotemporal changes across repeated mentions of STEM terms. On average, dialogue signs are 24.6%-44.6% shorter in duration than the isolated signs, and show significant reductions absent in monologue contexts. Finally, we evaluate sign embedding models on their ability to recognize STEM signs and approximate how entrained the participants become over time. Our study bridges linguistic analysis and computational modeling to understand how pragmatics shape sign articulation and its representation in sign language technologies.

[5] CRADLE Bench: A Clinician-Annotated Benchmark for Multi-Faceted Mental Health Crisis and Safety Risk Detection

Grace Byun,Rebecca Lipschutz,Sean T. Minton,Abigail Lott,Jinho D. Choi

Main category: cs.CL

TL;DR: 本文提出了CRADLE BENCH，一个用于多方面危机检测的基准，涵盖七种符合临床标准的危机类型，并首次引入时间标签。该基准包含临床医生标注的评估和发展示例，以及通过多语言模型集成自动标注的训练语料库，显著优于单模型标注。此外，研究还微调了六个危机检测模型，以不同共识标准训练，提供互补的检测能力。

Details

Motivation: 现有的语言模型在识别心理健康危机（如自杀意念、性侵犯等）方面存在不足，且以往研究覆盖的危机类型有限，缺乏时间信息。因此，需要一个更全面、符合临床标准的基准来提升模型对多种危机情境的检测能力，以避免严重后果。 Method: 构建了一个名为CRADLE BENCH的新基准，涵盖七种危机类型并引入时间标签；收集了600个由临床医生标注的测试样本和420个开发样本，并利用多个语言模型的多数投票集成方法自动生成约4000个训练样本；在此基础上，使用不同一致性标准（共识与一致同意）微调了六种危机检测模型。 Result: 提出的多模型集成自动标注方法显著优于单一模型标注；基于该基准训练的危机检测模型在不同一致性子集上表现出互补性能，验证了其有效性与实用性。 Conclusion: CRADLE BENCH为心理健康危机检测提供了一个更全面、高质量的评估基准，推动语言模型在现实交互中更可靠地识别多种危机情境，具有重要的临床和应用价值。 Abstract: Detecting mental health crisis situations such as suicide ideation, rape, domestic violence, child abuse, and sexual harassment is a critical yet underexplored challenge for language models. When such situations arise during user--model interactions, models must reliably flag them, as failure to do so can have serious consequences. In this work, we introduce CRADLE BENCH, a benchmark for multi-faceted crisis detection. Unlike previous efforts that focus on a limited set of crisis types, our benchmark covers seven types defined in line with clinical standards and is the first to incorporate temporal labels. Our benchmark provides 600 clinician-annotated evaluation examples and 420 development examples, together with a training corpus of around 4K examples automatically labeled using a majority-vote ensemble of multiple language models, which significantly outperforms single-model annotation. We further fine-tune six crisis detection models on subsets defined by consensus and unanimous ensemble agreement, providing complementary models trained under different agreement criteria.

[6] Temporal Blindness in Multi-Turn LLM Agents: Misaligned Tool Use vs. Human Time Perception

Yize Cheng,Arshia Soltani Moakhar,Chenrui Fan,Kazem Faghih,Parsa Hosseini,Wenxiao Wang,Soheil Feizi

Main category: cs.CL

TL;DR: 本文研究了大型语言模型代理在多轮对话中因缺乏时间感知而导致的工具调用决策问题，提出了TicToc-v1测试集，并发现添加时间戳仅带来有限改进，表明需要专门的后训练对齐来提升模型的时间感知能力。

Details

Motivation: 大型语言模型代理在动态环境中缺乏对消息间时间流逝的感知，导致工具调用决策不当，影响其实际应用效果。 Method: 构建包含34种时间敏感场景的TicToc-v1测试集，在对话中显式添加时间戳以提供时间信息，并收集人类在不同时间间隔下的工具调用偏好进行对比评估。 Result: 无时间信息时模型表现略优于随机水平（最高约60%），加入时间戳后略有提升（最高约65%），但提示工程等简单方法效果有限。 Conclusion: 当前LLM代理在时间感知方面存在明显不足，需通过专门的后训练对齐策略来改善其多轮交互中的时间敏感决策能力。 Abstract: Large language model agents are increasingly used in multi-turn conversational settings to interact with and execute tasks in dynamic environments. However, a key limitation is their temporal blindness: they, by default, operate with a stationary context, failing to account for the real-world time elapsed between messages. This becomes a critical liability when an agent must decide whether to invoke a tool based on how much time has passed since the last observation. Without temporal awareness, agents often either over-rely on previous context (skipping necessary tool calls), or under-rely on it (unnecessarily repeating tool calls). To study this challenge, we introduce TicToc-v1, a test set of multi-turn user-agent trajectories across 34 scenarios with varying time sensitivity. Each trajectory ends with a user question, where the need for a tool call depends on the amount of time elapsed since the last message. To give LLMs temporal context, we augment dialogue messages with explicit timestamps, bridging the gap between static dialogue and evolving environments. We then collected human preferences for these samples, creating two subsets: one where humans preferred relying on the previous observation (prefer-noTool), and another where they preferred a new tool call (prefer-Tool). We evaluated how well LLM tool-calling decisions align with human preferences under varying time intervals on TicToc-v1. Our analysis show that without time information, most models perform only slightly better than random, with the top alignment rate being just over 60%. While adding timestamps leads to a slight improvement, particularly for larger models, the improvement is modest, peaking at around 65%. We also show that naive, prompt-based alignment have limited effectiveness. Our findings highlight the need for specific post-training alignment to align multi-turn LLM tool use with human temporal perception.

[7] Can LLMs Narrate Tabular Data? An Evaluation Framework for Natural Language Representations of Text-to-SQL System Outputs

Jyotika Singh,Weiyi Sun,Amit Agarwal,Viji Krishnamurthy,Yassine Benajiba,Sujith Ravi,Dan Roth

Main category: cs.CL

TL;DR: 本文提出了一种新的评估方法Combo-Eval，用于评估大语言模型生成的自然语言表示（NLR），并发布了首个专门用于NLR基准测试的数据集NLR-BIRD。

Details

Motivation: 现有的大语言模型在将表格数据结果转换为自然语言时存在信息丢失或错误的问题，且缺乏有效的评估方法。 Method: 结合多种现有方法的优点，提出Combo-Eval评估方法，并构建了NLR-BIRD数据集用于基准测试。 Result: Combo-Eval在有人工评估的情况下表现出与人类判断更高的一致性，适用于有无参考答案的各种场景，并将大语言模型调用次数减少了25-61%。 Conclusion: Combo-Eval是一种高效且高保真的NLR评估方法，NLR-BIRD为未来NLR研究提供了重要资源。 Abstract: In modern industry systems like multi-turn chat agents, Text-to-SQL technology bridges natural language (NL) questions and database (DB) querying. The conversion of tabular DB results into NL representations (NLRs) enables the chat-based interaction. Currently, NLR generation is typically handled by large language models (LLMs), but information loss or errors in presenting tabular results in NL remains largely unexplored. This paper introduces a novel evaluation method - Combo-Eval - for judgment of LLM-generated NLRs that combines the benefits of multiple existing methods, optimizing evaluation fidelity and achieving a significant reduction in LLM calls by 25-61%. Accompanying our method is NLR-BIRD, the first dedicated dataset for NLR benchmarking. Through human evaluations, we demonstrate the superior alignment of Combo-Eval with human judgments, applicable across scenarios with and without ground truth references.

[8] OraPlan-SQL: A Planning-Centric Framework for Complex Bilingual NL2SQL Reasoning

Marianne Menglin Liu,Sai Ashish Somayajula,Syed Fahad Allam Shah,Sujith Ravi,Dan Roth

Main category: cs.CL

TL;DR: OraPlan-SQL在Archer NL2SQL 2025挑战赛中排名第一，提出了一种基于代理的框架，包含 Planner 和 SQL 两个智能体，通过反馈引导的元提示策略优化单个规划器，并结合实体链接和计划多样化提升多语言场景下的执行准确率和鲁棒性。

Details

Motivation: 解决现有NL2SQL系统在复杂推理（如算术、常识、假设推理）和多语言场景下面临的规划协调开销大、实体不匹配、泛化能力弱等问题。 Method: 采用两阶段代理框架：Planner生成自然语言计划，SQL Agent将其转化为SQL；引入反馈引导的元提示策略，基于失败案例聚类结果优化规划器提示；加入实体链接指南以处理中英文实体表面形式差异；通过计划多样化和多数投票提升输出稳定性。 Result: 在Archer NL2SQL 2025挑战赛中取得第一，执行准确率（EX）达英文55.0%、中文56.7%，超过第二名6%以上，SQL有效性（VA）保持99%以上。 Conclusion: OraPlan-SQL通过简化多代理架构、引入可扩展的提示优化机制和增强多语言对齐，在保持高SQL有效性的前提下显著提升了复杂推理和跨语言场景下的性能。 Abstract: We present OraPlan-SQL, our system for the Archer NL2SQL Evaluation Challenge 2025, a bilingual benchmark requiring complex reasoning such as arithmetic, commonsense, and hypothetical inference. OraPlan-SQL ranked first, exceeding the second-best system by more than 6% in execution accuracy (EX), with 55.0% in English and 56.7% in Chinese, while maintaining over 99% SQL validity (VA). Our system follows an agentic framework with two components: Planner agent that generates stepwise natural language plans, and SQL agent that converts these plans into executable SQL. Since SQL agent reliably adheres to the plan, our refinements focus on the planner. Unlike prior methods that rely on multiple sub-agents for planning and suffer from orchestration overhead, we introduce a feedback-guided meta-prompting strategy to refine a single planner. Failure cases from a held-out set are clustered with human input, and an LLM distills them into corrective guidelines that are integrated into the planner's system prompt, improving generalization without added complexity. For the multilingual scenario, to address transliteration and entity mismatch issues, we incorporate entity-linking guidelines that generate alternative surface forms for entities and explicitly include them in the plan. Finally, we enhance reliability through plan diversification: multiple candidate plans are generated for each query, with the SQL agent producing a query for each plan, and final output selected via majority voting over their executions.

[9] Language Models for Longitudinal Clinical Prediction

Tananun Songdechakraiwut,Michael Lutz

Main category: cs.CL

TL;DR: 提出了一种轻量级框架，用于将冻结的大型语言模型适应于纵向临床数据分析，无需微调即可实现准确预测。

Details

Motivation: 为了在缺乏大量训练数据的情况下，有效利用大型语言模型进行临床时间序列预测，特别是在早期阿尔茨海默病监测中。 Method: 通过在语言模型空间内整合患者历史和上下文信息，冻结大模型参数，仅调整输入表示以适应任务，从而实现零样本或小样本预测。 Result: 在神经心理学评估数据上实现了准确且可靠的预测性能，即使训练样本很少也表现良好。 Conclusion: 该方法为将大语言模型应用于低资源医疗场景提供了可行方案，具有临床应用潜力。 Abstract: We explore a lightweight framework that adapts frozen large language models to analyze longitudinal clinical data. The approach integrates patient history and context within the language model space to generate accurate forecasts without model fine-tuning. Applied to neuropsychological assessments, it achieves accurate and reliable performance even with minimal training data, showing promise for early-stage Alzheimer's monitoring.

[10] AfriMTEB and AfriE5: Benchmarking and Adapting Text Embedding Models for African Languages

Kosei Uemura,Miaoran Zhang,David Ifeoluwa Adelani

Main category: cs.CL

TL;DR: 本文提出了AfriMTEB，一个涵盖59种非洲语言、14项任务和38个数据集的多语言文本嵌入评测基准，并发布了针对非洲语言优化的AfriE5嵌入模型，在多项任务上达到SOTA性能。

Details

Motivation: 非洲语言在现有的多语言文本嵌入评测中代表性不足，许多任务源自翻译基准，缺乏对本地语言特有任务的支持。 Method: 构建AfriMTEB评测集，包含6个新数据集和新增仇恨言论检测、意图识别、情感分类等任务；通过跨语言对比蒸馏方法改进mE5模型，提出AfriE5。 Result: AfriE5在AfriMTEB上显著优于Gemini-Embeddings和mE5等基线模型，取得当前最优性能。 Conclusion: AfriMTEB填补了非洲语言文本嵌入评测的空白，AfriE5为非洲语言提供了更优的嵌入方案，推动了低资源语言在NLP中的发展。 Abstract: Text embeddings are an essential building component of several NLP tasks such as retrieval-augmented generation which is crucial for preventing hallucinations in LLMs. Despite the recent release of massively multilingual MTEB (MMTEB), African languages remain underrepresented, with existing tasks often repurposed from translation benchmarks such as FLORES clustering or SIB-200. In this paper, we introduce AfriMTEB -- a regional expansion of MMTEB covering 59 languages, 14 tasks, and 38 datasets, including six newly added datasets. Unlike many MMTEB datasets that include fewer than five languages, the new additions span 14 to 56 African languages and introduce entirely new tasks, such as hate speech detection, intent detection, and emotion classification, which were not previously covered. Complementing this, we present AfriE5, an adaptation of the instruction-tuned mE5 model to African languages through cross-lingual contrastive distillation. Our evaluation shows that AfriE5 achieves state-of-the-art performance, outperforming strong baselines such as Gemini-Embeddings and mE5.

[11] Breaking the Benchmark: Revealing LLM Bias via Minimal Contextual Augmentation

Kaveh Eskandari Miandoab,Mahammed Kamruzzaman,Arshia Gharooni,Gene Louis Kim,Vasanth Sarathy,Ninareh Mehrabi

Main category: cs.CL

TL;DR: 本文提出了一种新的通用增强框架，用于评估大型语言模型在公平性方面的脆弱性，发现现有模型在输入扰动下更容易表现出刻板偏见，尤其对研究较少的群体更为明显。

Details

Motivation: 由于训练数据的歧视性，大型语言模型存在刻板偏见问题，现有的去偏方法较为脆弱，亟需更鲁棒的评估手段和更广泛的公平性研究。 Method: 提出一个包含三个可插拔步骤的通用增强框架，并应用于BBQ等公平性评测基准，通过输入扰动分析模型的偏见行为。 Result: 实验表明，包括最先进的开源和闭源模型在内的LLMs在输入扰动下更倾向于表现出刻板印象，且对文献中研究较少的群体偏见更严重。 Conclusion: 当前的公平性对齐方法仍不稳固，需扩展公平性和安全性研究以涵盖更多样化的群体。 Abstract: Large Language Models have been shown to demonstrate stereotypical biases in their representations and behavior due to the discriminative nature of the data that they have been trained on. Despite significant progress in the development of methods and models that refrain from using stereotypical information in their decision-making, recent work has shown that approaches used for bias alignment are brittle. In this work, we introduce a novel and general augmentation framework that involves three plug-and-play steps and is applicable to a number of fairness evaluation benchmarks. Through application of augmentation to a fairness evaluation dataset (Bias Benchmark for Question Answering (BBQ)), we find that Large Language Models (LLMs), including state-of-the-art open and closed weight models, are susceptible to perturbations to their inputs, showcasing a higher likelihood to behave stereotypically. Furthermore, we find that such models are more likely to have biased behavior in cases where the target demographic belongs to a community less studied by the literature, underlining the need to expand the fairness and safety research to include more diverse communities.

[12] Agent-based Automated Claim Matching with Instruction-following LLMs

Dina Pisarevskaya,Arkaitz Zubiaga

Main category: cs.CL

TL;DR: 提出了一种基于代理的自动化声明匹配方法，使用指令跟随的大型语言模型（LLMs），通过两步流程生成提示并进行二分类任务，展示了LLM生成提示优于人工提示，小模型在生成中表现不逊于大模型，并揭示了LLM对声明匹配的理解。

Details

Motivation: 提高声明匹配任务的自动化水平，减少对人工设计提示的依赖，并探索不同规模和类型LLM在该任务中的协同潜力。 Method: 采用两步管道：首先用LLM生成提示，然后将声明匹配作为二分类任务由LLM执行；同时探索不同LLM在提示生成和匹配阶段的组合效果。 Result: LLM生成的提示优于人工提示的SOTA结果；较小的LLM在提示生成中表现与大模型相当；使用不同LLM分工可提升效率和性能。 Conclusion: 该代理式两步框架有效提升了声明匹配性能，降低了计算成本，并揭示了LLM在理解任务提示方面的潜力。 Abstract: We present a novel agent-based approach for the automated claim matching task with instruction-following LLMs. We propose a two-step pipeline that first generates prompts with LLMs, to then perform claim matching as a binary classification task with LLMs. We demonstrate that LLM-generated prompts can outperform SOTA with human-generated prompts, and that smaller LLMs can do as well as larger ones in the generation process, allowing to save computational resources. We also demonstrate the effectiveness of using different LLMs for each step of the pipeline, i.e. using an LLM for prompt generation, and another for claim matching. Our investigation into the prompt generation process in turn reveals insights into the LLMs' understanding of claim matching.

[13] Auto prompting without training labels: An LLM cascade for product quality assessment in e-commerce catalogs

Soham Satyadharma,Fatemeh Sheikholeslami,Swati Kaul,Aziz Umit Batur,Suleiman A. Khan

Main category: cs.CL

TL;DR: 提出了一种无需训练的级联自动提示方法，用于大规模评估电商产品属性质量，显著提升精度和召回率，同时将领域专家工作量减少99%。

Details

Motivation: 在复杂的工业产品目录中，通用语言模型难以直接满足特定领域的质量评估需求，需要一种可扩展的方法来桥接通用语言理解与领域知识。 Method: 基于少量人工编写的初始提示，通过级联方式自动生成和优化针对不同产品类别-属性对的提示指令，无需训练或微调模型。 Result: 相比传统思维链提示，该方法在精度和召回率上提升了8-10%，并将每个属性的专家耗时从5.1小时降至3分钟，跨五种语言和多种任务均表现出良好泛化能力。 Conclusion: 该训练免费的级联自动提示框架能高效、可扩展地适应大规模电商目录的质量评估需求，显著降低人工成本并提升性能。 Abstract: We introduce a novel, training free cascade for auto-prompting Large Language Models (LLMs) to assess product quality in e-commerce. Our system requires no training labels or model fine-tuning, instead automatically generating and refining prompts for evaluating attribute quality across tens of thousands of product category-attribute pairs. Starting from a seed of human-crafted prompts, the cascade progressively optimizes instructions to meet catalog-specific requirements. This approach bridges the gap between general language understanding and domain-specific knowledge at scale in complex industrial catalogs. Our extensive empirical evaluations shows the auto-prompt cascade improves precision and recall by $8-10\%$ over traditional chain-of-thought prompting. Notably, it achieves these gains while reducing domain expert effort from 5.1 hours to 3 minutes per attribute - a $99\%$ reduction. Additionally, the cascade generalizes effectively across five languages and multiple quality assessment tasks, consistently maintaining performance gains.

[14] Leveraging LLMs for Early Alzheimer's Prediction

Tananun Songdechakraiwut

Main category: cs.CL

TL;DR: 提出一种基于连接组信息的LLM框架，通过将动态fMRI连接性数据编码为时间序列并映射到冻结的预训练LLM中，实现对早期阿尔茨海默病的高敏感性预测。

Details

Motivation: 利用大脑连接组动态特征提升神经退行性疾病的早期检测能力，克服传统方法在敏感性和临床适用性方面的局限。 Method: 将动态fMRI连接性数据编码为时间序列，进行鲁棒归一化处理，并将其映射为适合冻结预训练大语言模型（LLM）输入的表示形式，用于临床预测。 Result: 在早期阿尔茨海默病检测中，该方法的错误率显著低于临床可接受范围，表现出高敏感性预测性能。 Conclusion: 该连接组驱动的LLM框架为早期阿尔茨海默病的精准预测提供了新途径，具有推动及时临床干预的潜力。 Abstract: We present a connectome-informed LLM framework that encodes dynamic fMRI connectivity as temporal sequences, applies robust normalization, and maps these data into a representation suitable for a frozen pre-trained LLM for clinical prediction. Applied to early Alzheimer's detection, our method achieves sensitive prediction with error rates well below clinically recognized margins, with implications for timely Alzheimer's intervention.

[15] Uncovering the Potential Risks in Unlearning: Danger of English-only Unlearning in Multilingual LLMs

Kyomin Hwang,Hyeonjin Kim,Seungyeon Kim,Sunghyun Wee,Nojun Kwak

Main category: cs.CL

TL;DR: 本文研究了在多语言大模型中使用仅英语数据进行知识遗忘的不足，并从评估角度揭示了一个新的盲点：当模型在并行多语言数据集上完全微调后，会出现语言混淆现象，导致基于参考的评估指标失效。为此，作者提出了N-Mix评分来量化语言混淆，并倡导采用语义-based的新评估方式。

Details

Motivation: 现有研究多关注性能表现，缺乏对多语言大模型在遗忘过程中语言混淆现象的深入评估分析。 Method: 提出N-gram-based Language-Mix（N-Mix）评分以量化语言混淆；分析参考基指标在高N-Mix情况下的误判问题；提倡使用语义-based评估指标。 Result: 实验证明语言混淆普遍存在且影响显著，导致传统参考基指标产生大量假阴性结果。N-Mix评分能有效反映该问题。 Conclusion: 需要发展不依赖参考文本、直接评估生成内容语义的新类型遗忘评估指标，以应对多语言场景中的语言混淆挑战。 Abstract: There have been a couple of studies showing that attempting to erase multilingual knowledge using only English data is insufficient for multilingual LLMs. However, their analyses remain highly performance-oriented. In this paper, we switch the point of view to evaluation, and address an additional blind spot which reveals itself when the multilingual LLM is fully finetuned with parallel multilingual dataset before unlearning. Here, language confusion occurs whereby a model responds in language different from that of the input prompt. Language confusion is a problematic phenomenon in unlearning, causing the standard reference-based metrics to fail. We tackle this phenomenon in three steps: (1) introduce N-gram-based Language-Mix (N-Mix) score to quantitatively show the language confusion is pervasive and consistent in multilingual LLMs, (2) demonstrate that reference-based metrics result in false negatives when N-Mix score is high, and(3) suggest the need of new type of unlearning evaluation that can directly assess the content of the generated sentences. We call this type of metrics as semantic-based metric.

[16] M-Eval: A Heterogeneity-Based Framework for Multi-evidence Validation in Medical RAG Systems

Mengzhou Sun,Sendong Zhao,Jianyu Chen,Haochun Wang,Bin Qin

Main category: cs.CL

TL;DR: 提出M-Eval方法，利用循证医学中的异质性分析评估RAG系统生成的医疗回答的准确性和证据可靠性，显著提升准确性（最高23.31%），减少幻觉和诊断错误。

Details

Motivation: 当前检索增强生成（RAG）在医疗问答中存在生成错误信息（如幻觉）和无法正确使用外部知识的问题，影响了大语言模型（LLM）在医疗场景中的可靠性。 Method: 受循证医学中异质性分析启发，提出M-Eval方法：从外部知识库提取额外医学文献，结合RAG系统检索到的证据文档，通过异质性分析判断证据是否支持回答中的不同观点，从而验证回答的准确性和证据的可靠性。 Result: M-Eval方法在多种大语言模型上实现了最高达23.31%的准确率提升，能够有效检测RAG系统中的事实性错误，并评估其证据的可靠性。 Conclusion: M-Eval有助于发现现有基于RAG的医疗系统的错误，提高大语言模型在医疗应用中的可靠性，减少诊断错误。 Abstract: Retrieval-augmented Generation (RAG) has demonstrated potential in enhancing medical question-answering systems through the integration of large language models (LLMs) with external medical literature. LLMs can retrieve relevant medical articles to generate more professional responses efficiently. However, current RAG applications still face problems. They generate incorrect information, such as hallucinations, and they fail to use external knowledge correctly. To solve these issues, we propose a new method named M-Eval. This method is inspired by the heterogeneity analysis approach used in Evidence-Based Medicine (EBM). Our approach can check for factual errors in RAG responses using evidence from multiple sources. First, we extract additional medical literature from external knowledge bases. Then, we retrieve the evidence documents generated by the RAG system. We use heterogeneity analysis to check whether the evidence supports different viewpoints in the response. In addition to verifying the accuracy of the response, we also assess the reliability of the evidence provided by the RAG system. Our method shows an improvement of up to 23.31% accuracy across various LLMs. This work can help detect errors in current RAG-based medical systems. It also makes the applications of LLMs more reliable and reduces diagnostic errors.

[17] PICOs-RAG: PICO-supported Query Rewriting for Retrieval-Augmented Generation in Evidence-Based Medicine

Mengzhou Sun,Sendong Zhao,Jianyu Chen,Bin Qin

Main category: cs.CL

TL;DR: 本文提出了一种基于PICO格式的检索增强生成方法（PICOs-RAG），用于改进循证医学中的复杂查询处理，显著提升了检索效率和相关性。

Details

Motivation: 现有检索增强生成方法在处理临床场景中的复杂、信息不全或语言模糊的查询时效果不佳，导致检索结果不相关，影响医学决策的准确性和效率。 Method: 通过将用户查询扩展并规范化为符合循证医学PICO格式的专业查询，提升检索阶段的信息匹配精度，并结合RAG框架自动生成响应。 Result: 相比基线方法，PICOs-RAG在评估中实现了最高达8.8%的性能提升，显著增强了检索的相关性和生成答案的质量。 Conclusion: PICOs-RAG有效提升了大语言模型在循证医学中的实用性，使其成为更可靠、高效的医疗辅助工具。 Abstract: Evidence-based medicine (EBM) research has always been of paramount importance. It is important to find appropriate medical theoretical support for the needs from physicians or patients to reduce the occurrence of medical accidents. This process is often carried out by human querying relevant literature databases, which lacks objectivity and efficiency. Therefore, researchers utilize retrieval-augmented generation (RAG) to search for evidence and generate responses automatically. However, current RAG methods struggle to handle complex queries in real-world clinical scenarios. For example, when queries lack certain information or use imprecise language, the model may retrieve irrelevant evidence and generate unhelpful answers. To address this issue, we present the PICOs-RAG to expand the user queries into a better format. Our method can expand and normalize the queries into professional ones and use the PICO format, a search strategy tool present in EBM, to extract the most important information used for retrieval. This approach significantly enhances retrieval efficiency and relevance, resulting in up to an 8.8\% improvement compared to the baseline evaluated by our method. Thereby the PICOs-RAG improves the performance of the large language models into a helpful and reliable medical assistant in EBM.

[18] META-RAG: Meta-Analysis-Inspired Evidence-Re-Ranking Method for Retrieval-Augmented Generation in Evidence-Based Medicine

Mengzhou Sun,Sendong Zhao,Jianyu Chen,Haochun Wang,Bin Qin

Main category: cs.CL

TL;DR: 提出一种基于循证医学（EBM）元分析思想的重排序与过滤方法，提升RAG在医疗证据检索中的质量，实验显示诊断准确率最高提升11.4%。

Details

Motivation: 现有RAG技术在循证医学应用中难以有效区分高质量证据，而EBM对证据质量要求严格，因此需要更可靠的证据筛选机制。 Method: 借鉴EBM中的元分析方法，结合可靠性分析、异质性分析和外推分析，设计多原则证据过滤与重排序机制，以优化LLMs使用的医疗证据质量。 Result: 在PubMed数据集上验证了该方法的有效性，显著提升了检索到的证据质量，实验结果显示诊断准确率最高提升11.4%。 Conclusion: 该方法能有效增强RAG系统在循证医学场景下的证据选择能力，减少错误知识注入，提高模型响应的可靠性与准确性。 Abstract: Evidence-based medicine (EBM) holds a crucial role in clinical application. Given suitable medical articles, doctors effectively reduce the incidence of misdiagnoses. Researchers find it efficient to use large language models (LLMs) techniques like RAG for EBM tasks. However, the EBM maintains stringent requirements for evidence, and RAG applications in EBM struggle to efficiently distinguish high-quality evidence. Therefore, inspired by the meta-analysis used in EBM, we provide a new method to re-rank and filter the medical evidence. This method presents multiple principles to filter the best evidence for LLMs to diagnose. We employ a combination of several EBM methods to emulate the meta-analysis, which includes reliability analysis, heterogeneity analysis, and extrapolation analysis. These processes allow the users to retrieve the best medical evidence for the LLMs. Ultimately, we evaluate these high-quality articles and show an accuracy improvement of up to 11.4% in our experiments and results. Our method successfully enables RAG to extract higher-quality and more reliable evidence from the PubMed dataset. This work can reduce the infusion of incorrect knowledge into responses and help users receive more effective replies.

[19] TEXT2DB: Integration-Aware Information Extraction with Large Language Model Agents

Yizhu Jiao,Sha Li,Sizhe Zhou,Heng Ji,Jiawei Han

Main category: cs.CL

TL;DR: 本文提出了一个新的信息抽取任务TEXT2DB，旨在将信息抽取结果与目标数据库紧密结合，通过用户指令、文档集和数据库的输入，动态更新数据库以满足需求。为此，作者还提出了OPAL框架，利用观察-规划-分析的LLM代理机制来适应不同数据库模式并调用信息抽取模型。实验表明该方法有效，但也存在处理复杂依赖大数据库和抽取幻觉等挑战。

Details

Motivation: 传统信息抽取（IE）的结果常因与下游应用所需的本体不匹配而难以直接使用，因此需要一种能够根据具体数据库结构和用户需求动态调整抽取过程的新方法。 Method: 提出TEXT2DB任务，强调IE输出与目标数据库的集成；设计OPAL框架，包含Observer（与数据库交互）、Planner（生成调用IE模型的代码计划）和Analyzer（执行前评估代码质量）三个组件，实现对多样化数据库模式的自适应。 Result: 在包含数据补全、行填充和列添加等常见需求的新基准上验证了OPAL的有效性，结果显示其能生成不同的代码计划并成功调用所需IE模型完成数据库更新任务，但在处理大规模复杂依赖数据库和抽取幻觉方面仍面临挑战。 Conclusion: TEXT2DB为信息抽取提供了更贴近实际应用的新范式，OPAL展示了LLM代理在动态整合文本信息到数据库中的潜力，未来需进一步研究复杂场景下的鲁棒性和准确性。 Abstract: The task of information extraction (IE) is to extract structured knowledge from text. However, it is often not straightforward to utilize IE output due to the mismatch between the IE ontology and the downstream application needs. We propose a new formulation of IE TEXT2DB that emphasizes the integration of IE output and the target database (or knowledge base). Given a user instruction, a document set, and a database, our task requires the model to update the database with values from the document set to satisfy the user instruction. This task requires understanding user instructions for what to extract and adapting to the given DB/KB schema for how to extract on the fly. To evaluate this new task, we introduce a new benchmark featuring common demands such as data infilling, row population, and column addition. In addition, we propose an LLM agent framework OPAL (Observe-PlanAnalyze LLM) which includes an Observer component that interacts with the database, the Planner component that generates a code-based plan with calls to IE models, and the Analyzer component that provides feedback regarding code quality before execution. Experiments show that OPAL can successfully adapt to diverse database schemas by generating different code plans and calling the required IE models. We also highlight difficult cases such as dealing with large databases with complex dependencies and extraction hallucination, which we believe deserve further investigation. Source code: https://github.com/yzjiao/Text2DB

[20] Teaching LLMs to Abstain via Fine-Grained Semantic Confidence Reward

Hao An,Yang Xu

Main category: cs.CL

TL;DR: 提出一种基于细粒度语义置信度奖励（FiSCoRe）的强化学习框架，通过样本特定的置信度引导大语言模型在高置信度簇中保留答案，在低置信度簇中放弃回答，从而提升模型在领域内和分布外基准中的可靠性。

Details

Motivation: 现有方法依赖粗粒度信号（如整体置信度或不确定性评分）来指导大语言模型拒绝回答，难以精确识别知识边界，导致幻觉问题依然存在。 Method: 提出FiSCoRe框架：对多个候选答案进行语义聚类，利用细粒度的样本特定置信度作为奖励信号，通过强化学习训练模型区分高/低置信度簇并选择性输出；同时设计新的评估指标衡量拒绝回答的可靠性。 Result: 在多个领域内和分布外基准上显著提升了模型拒绝回答的准确性和可靠性，优于现有细调方法。 Conclusion: 细粒度语义置信度信号能更有效地训练大语言模型识别自身知识边界，实现更精准的后验拒绝机制，有助于缓解幻觉问题。 Abstract: Mitigating hallucinations in Large Language Models (LLMs) is critical for their reliable deployment. Existing methods typically fine-tune LLMs to abstain from answering questions beyond their knowledge scope. However, these methods often rely on coarse-grained signals to guide LLMs to abstain, such as overall confidence or uncertainty scores on multiple sampled answers, which may result in an imprecise awareness of the model's own knowledge boundaries. To this end, we propose a novel reinforcement learning framework built on $\textbf{\underline{Fi}ne-grained \underline{S}emantic \underline{Co}nfidence \underline{Re}ward (\Ours)}$, which guides LLMs to abstain via sample-specific confidence. Specifically, our method operates by sampling multiple candidate answers and conducting semantic clustering, then training the LLM to retain answers within high-confidence clusters and discard those within low-confidence ones, thereby promoting accurate post-hoc abstention. Additionally, we propose a new metric for evaluating the reliability of abstention fine-tuning tasks more comprehensively. Our method significantly enhances reliability in both in-domain and out-of-distribution benchmarks.

[21] SpecKD: Speculative Decoding for Effective Knowledge Distillation of LLMs

Haiduo Huang,Jiangcheng Song,Yadong Zhang,Pengju Ren

Main category: cs.CL

TL;DR: 提出了一种名为Speculative Knowledge Distillation (SpecKD)的新框架，通过动态的token级门控机制选择性地应用蒸馏损失，仅在教师模型确认的“接受”token上进行学习，从而提升学生模型性能。

Details

Motivation: 传统知识蒸馏对所有token均匀应用损失，忽略了教师模型预测的置信度差异，导致学生模型学习到高熵、不确定的预测，影响性能。尤其当教师模型远大于学生模型时，这一问题更为严重。 Method: 受推测解码中“提出-验证”范式的启发，SpecKD在每一步由学生模型先生成token提议，再与教师模型的分布进行比对，仅对被教师“接受”的token计算蒸馏损失，而“拒绝”的token则被屏蔽，实现选择性知识迁移。 Result: 在多种文本生成任务上的实验表明，SpecKD显著优于强基线方法，训练更稳定，学生模型能力更强，并实现了最先进的性能。 Conclusion: SpecKD通过引入基于教师置信度的动态门控机制，有效减少了知识蒸馏中的噪声学习，是一种即插即用、高效的知识蒸馏新范式。 Abstract: Knowledge Distillation (KD) has become a cornerstone technique for compressing Large Language Models (LLMs) into smaller, more efficient student models. However, conventional KD approaches typically apply the distillation loss uniformly across all tokens, regardless of the teacher's confidence. This indiscriminate mimicry can introduce noise, as the student is forced to learn from the teacher's uncertain or high-entropy predictions, which may ultimately harm student performance-especially when the teacher is much larger and more powerful. To address this, we propose Speculative Knowledge Distillation (SpecKD), a novel, plug-and-play framework that introduces a dynamic, token-level gating mechanism inspired by the "propose-and-verify" paradigm of speculative decoding. At each step, the student's token proposal is verified against the teacher's distribution; the distillation loss is selectively applied only to "accepted" tokens, while "rejected" tokens are masked out. Extensive experiments on diverse text generation tasks show that SpecKD consistently and significantly outperforms strong KD baselines, leading to more stable training and more capable student models, and achieving state-of-the-art results.

[22] Success and Cost Elicit Convention Formation for Efficient Communication

Saujas Vaduguru,Yilun Hua,Yoav Artzi,Daniel Fried

Main category: cs.CL

TL;DR: 提出一种训练大型多模态模型形成语言惯例的方法，通过模拟指代游戏实现与人类的高效沟通，显著缩短消息长度并提高交互成功率。

Details

Motivation: 人类在交流中依赖共享语境来提升沟通效率，形成临时语言惯例以使用更简短、低成本但能被理解的表达。研究旨在让模型也能形成类似的语言惯例，实现高效沟通。 Method: 采用模拟的指代游戏在模型间进行训练，无需额外人工数据，在涉及照片和图形图像的重复游戏中训练模型形成语言惯例。 Result: 该方法使模型与人类通信时消息长度减少最多41%，交互成功率提高15%，且人类回应速度更快；仅基于成功或成本的训练不足以促成惯例形成，二者缺一不可。 Conclusion: 结合成功与成本目标的模拟指代游戏可有效训练模型形成语言惯例，显著提升人机沟通效率。 Abstract: Humans leverage shared conversational context to become increasingly successful and efficient at communicating over time. One manifestation of this is the formation of ad hoc linguistic conventions, which allow people to coordinate on short, less costly utterances that are understood using shared conversational context. We present a method to train large multimodal models to form conventions, enabling efficient communication. Our approach uses simulated reference games between models, and requires no additional human-produced data. In repeated reference games involving photographs and tangram images, our method enables models to communicate efficiently with people: reducing the message length by up to 41% while increasing success by 15% over the course of the interaction. Human listeners respond faster when interacting with our model that forms conventions. We also show that training based on success or cost alone is insufficient - both are necessary to elicit convention formation.

[23] Pie: A Programmable Serving System for Emerging LLM Applications

In Gim,Zhiyao Ma,Seung-seob Lee,Lin Zhong

Main category: cs.CL

TL;DR: 本文提出了Pie，一种可编程的大型语言模型（LLM）服务系统，通过将生成循环分解为细粒度的服务处理器并引入用户自定义的inferlets程序，实现了灵活性和高效性。

Details

Motivation: 现有的LLM服务系统基于单一的token生成循环，难以支持多样化的推理策略和代理工作流，限制了应用的灵活性和效率。 Method: Pie将传统的生成循环分解为可通过API访问的细粒度服务处理器，并将生成过程的控制权交给用户提供的程序（称为inferlets），使用WebAssembly执行这些程序以确保轻量级沙箱安全。 Result: 实验表明，Pie在标准任务上性能接近最先进的系统（延迟开销3-12%），而在代理工作流中通过应用特定优化显著提升了延迟和吞吐量（提高1.3x-3.4x）。 Conclusion: Pie通过提供灵活的编程接口和高效的执行机制，有效支持复杂的LLM应用需求，特别是在需要定制化生成逻辑和高效资源管理的场景下表现出色。 Abstract: Emerging large language model (LLM) applications involve diverse reasoning strategies and agentic workflows, straining the capabilities of existing serving systems built on a monolithic token generation loop. This paper introduces Pie, a programmable LLM serving system designed for flexibility and efficiency. Pie decomposes the traditional generation loop into fine-grained service handlers exposed via an API and delegates control of the generation process to user-provided programs, called inferlets. This enables applications to implement new KV cache strategies, bespoke generation logic, and seamlessly integrate computation and I/O-entirely within the application, without requiring modifications to the serving system. Pie executes inferlets using WebAssembly, benefiting from its lightweight sandboxing. Our evaluation shows Pie matches state-of-the-art performance on standard tasks (3-12% latency overhead) while significantly improving latency and throughput (1.3x-3.4x higher) on agentic workflows by enabling application-specific optimizations.

[24] Challenging Multilingual LLMs: A New Taxonomy and Benchmark for Unraveling Hallucination in Translation

Xinwei Wu,Heng Liu,Jiang Zhou,Xiaohu Zhao,Linlong Xu,Longyue Wang,Weihua Luo,Kaifu Zhang

Main category: cs.CL

TL;DR: 本文提出了一种诊断多语言大模型翻译幻觉的新框架和基准测试集HalloMTBench，揭示了影响幻觉的多种因素。

Details

Motivation: 现有的机器翻译基准无法有效暴露多语言大模型中的幻觉问题，因此需要一个专门的诊断工具来识别和分类这些错误。 Method: 提出了一个包含指令脱离和源脱离的分类体系，并基于此构建了覆盖11个英外翻译方向的人工验证基准HalloMTBench；使用前沿大模型生成候选翻译，并通过多个大模型评判器与专家验证确保数据质量。 Result: 构建了5,435个高质量测试实例，评估了17个大模型，发现了与模型规模、输入长度敏感性、语言偏见及强化学习导致的语言混杂相关的独特幻觉触发模式。 Conclusion: HalloMTBench为诊断多语言大模型的翻译幻觉提供了有效的测试平台，有助于未来对模型可靠性的研究。 Abstract: Large Language Models (LLMs) have advanced machine translation but remain vulnerable to hallucinations. Unfortunately, existing MT benchmarks are not capable of exposing failures in multilingual LLMs. To disclose hallucination in multilingual LLMs, we introduce a diagnostic framework with a taxonomy that separates Instruction Detachment from Source Detachment. Guided by this taxonomy, we create HalloMTBench, a multilingual, human-verified benchmark across 11 English-to-X directions. We employed 4 frontier LLMs to generate candidates and scrutinize these candidates with an ensemble of LLM judges, and expert validation. In this way, we curate 5,435 high-quality instances. We have evaluated 17 LLMs on HalloMTBench. Results reveal distinct ``hallucination triggers'' -- unique failure patterns reflecting model scale, source length sensitivity, linguistic biases, and Reinforcement-Learning (RL) amplified language mixing. HalloMTBench offers a forward-looking testbed for diagnosing LLM translation failures. HalloMTBench is available in https://huggingface.co/collections/AIDC-AI/marco-mt.

[25] Global PIQA: Evaluating Physical Commonsense Reasoning Across 100+ Languages and Cultures

Tyler A. Chang,Catherine Arnett,Abdelrahman Eldesokey,Abdelrahman Sadallah,Abeer Kashar,Abolade Daud,Abosede Grace Olanihun,Adamu Labaran Mohammed,Adeyemi Praise,Adhikarinayum Meerajita Sharma,Aditi Gupta,Afitab Iyigun,Afonso Simplício,Ahmed Essouaied,Aicha Chorana,Akhil Eppa,Akintunde Oladipo,Akshay Ramesh,Aleksei Dorkin,Alfred Malengo Kondoro,Alham Fikri Aji,Ali Eren Çetintaş,Allan Hanbury,Alou Dembele,Alp Niksarli,Álvaro Arroyo,Amin Bajand,Amol Khanna,Ana Chkhaidze,Ana Condez,Andiswa Mkhonto,Andrew Hoblitzell,Andrew Tran,Angelos Poulis,Anirban Majumder,Anna Vacalopoulou,Annette Kuuipolani Kanahele Wong,Annika Simonsen,Anton Kovalev,Ashvanth. S,Ayodeji Joseph Lana,Barkin Kinay,Bashar Alhafni,Benedict Cibalinda Busole,Bernard Ghanem,Bharti Nathani,Biljana Stojanovska Đurić,Bola Agbonile,Bragi Bergsson,Bruce Torres Fischer,Burak Tutar,Burcu Alakuş Çınar,Cade J. Kanoniakapueo Kane,Can Udomcharoenchaikit,Catherine Arnett,Chadi Helwe,Chaithra Reddy Nerella,Chen Cecilia Liu,Chiamaka Glory Nwokolo,Cristina España-Bonet,Cynthia Amol,DaeYeop Lee,Dana Arad,Daniil Dzenhaliou,Daria Pugacheva,Dasol Choi,Daud Abolade,David Liu,David Semedo,Deborah Popoola,Deividas Mataciunas,Delphine Nyaboke,Dhyuthy Krishna Kumar,Diogo Glória-Silva,Diogo Tavares,Divyanshu Goyal,DongGeon Lee,Ebele Nwamaka Anajemba,Egonu Ngozi Grace,Elena Mickel,Elena Tutubalina,Elias Herranen,Emile Anand,Emmanuel Habumuremyi,Emuobonuvie Maria Ajiboye,Eryawan Presma Yulianrifat,Esther Adenuga,Ewa Rudnicka,Faith Olabisi Itiola,Faran Taimoor Butt,Fathima Thekkekara,Fatima Haouari,Filbert Aurelian Tjiaranata,Firas Laakom,Francesca Grasso,Francesco Orabona,Francesco Periti,Gbenga Kayode Solomon,Gia Nghia Ngo,Gloria Udhehdhe-oze,Gonçalo Martins,Gopi Naga Sai Ram Challagolla,Guijin Son,Gulnaz Abdykadyrova,Hafsteinn Einarsson,Hai Hu,Hamidreza Saffari,Hamza Zaidi,Haopeng Zhang,Harethah Abu Shairah,Harry Vuong,Hele-Andra Kuulmets,Houda Bouamor,Hwanjo Yu,Iben Nyholm Debess,İbrahim Ethem Deveci,Ikhlasul Akmal Hanif,Ikhyun Cho,Inês Calvo,Inês Vieira,Isaac Manzi,Ismail Daud,Itay Itzhak,Iuliia,Alekseenko,Ivan Belashkin,Ivan Spada,Ivan Zhelyazkov,Jacob Brinton,Jafar Isbarov,Jaka Čibej,Jan Čuhel,Jan Kocoń,Jauza Akbar Krito,Jebish Purbey,Jennifer Mickel,Jennifer Za,Jenny Kunz,Jihae Jeong,Jimena Tena Dávalos,Jinu Lee,João Magalhães,John Yi,Jongin Kim,Joseph Chataignon,Joseph Marvin Imperial,Jubeerathan Thevakumar,Judith Land,Junchen Jiang,Jungwhan Kim,Kairit Sirts,Kamesh R,Kamesh V,Kanda Patrick Tshinu,Kätriin Kukk,Kaustubh Ponkshe,Kavsar Huseynova,Ke He,Kelly Buchanan,Kengatharaiyer Sarveswaran,Kerem Zaman,Khalil Mrini,Kian Kyars,Krister Kruusmaa,Kusum Chouhan,Lainitha Krishnakumar,Laura Castro Sánchez,Laura Porrino Moscoso,Leshem Choshen,Levent Sencan,Lilja Øvrelid,Lisa Alazraki,Lovina Ehimen-Ugbede,Luheerathan Thevakumar,Luxshan Thavarasa,Mahnoor Malik,Mamadou K. Keita,Mansi Jangid,Marco De Santis,Marcos García,Marek Suppa,Mariam D'Ciofalo,Marii Ojastu,Maryam Sikander,Mausami Narayan,Maximos Skandalis,Mehak Mehak,Mehmet İlteriş Bozkurt,Melaku Bayu Workie,Menan Velayuthan,Michael Leventhal,Michał Marcińczuk,Mirna Potočnjak,Mohammadamin Shafiei,Mridul Sharma,Mrityunjaya Indoria,Muhammad Ravi Shulthan Habibi,Murat Kolić,Nada Galant,Naphat Permpredanun,Narada Maugin,Nicholas Kluge Corrêa,Nikola Ljubešić,Nirmal Thomas,Nisansa de Silva,Nisheeth Joshi,Nitish Ponkshe,Nizar Habash,Nneoma C. Udeze,Noel Thomas,Noémi Ligeti-Nagy,Nouhoum Coulibaly,Nsengiyumva Faustin,Odunayo Kareemat Buliaminu,Odunayo Ogundepo,Oghojafor Godswill Fejiro,Ogundipe Blessing Funmilola,Okechukwu God'spraise,Olanrewaju Samuel,Olaoye Deborah Oluwaseun,Olasoji Akindejoye,Olga Popova,Olga Snissarenko,Onyinye Anulika Chiemezie,Orkun Kinay,Osman Tursun,Owoeye Tobiloba Moses,Oyelade Oluwafemi Joshua,Oyesanmi Fiyinfoluwa,Pablo Gamallo,Pablo Rodríguez Fernández,Palak Arora,Pedro Valente,Peter Rupnik,Philip Oghenesuowho Ekiugbo,Pramit Sahoo,Prokopis Prokopidis,Pua Niau-Puhipau,Quadri Yahya,Rachele Mignone,Raghav Singhal,Ram Mohan Rao Kadiyala,Raphael Merx,Rapheal Afolayan,Ratnavel Rajalakshmi,Rishav Ghosh,Romina Oji,Ron Kekeha Solis,Rui Guerra,Rushikesh Zawar,Sa'ad Nasir Bashir,Saeed Alzaabi,Sahil Sandeep,Sai Pavan Batchu,SaiSandeep Kantareddy,Salsabila Zahirah Pranida,Sam Buchanan,Samuel Rutunda,Sander Land,Sarah Sulollari,Sardar Ali,Saroj Sapkota,Saulius Tautvaisas,Sayambhu Sen,Sayantani Banerjee,Sebastien Diarra,SenthilNathan. M,Sewoong Lee,Shaan Shah,Shankar Venkitachalam,Sharifa Djurabaeva,Sharon Ibejih,Shivanya Shomir Dutta,Siddhant Gupta,Silvia Paniagua Suárez,Sina Ahmadi,Sivasuthan Sukumar,Siyuan Song,Snegha A.,Sokratis Sofianopoulos,Sona Elza Simon,Sonja Benčina,Sophie Gvasalia,Sphurti Kirit More,Spyros Dragazis,Stephan P. Kaufhold,Suba. S,Sultan AlRashed,Surangika Ranathunga,Taiga Someya,Taja Kuzman Pungeršek,Tal Haklay,Tasi'u Jibril,Tatsuya Aoyama,Tea Abashidze,Terenz Jomar Dela Cruz,Terra Blevins,Themistoklis Nikas,Theresa Dora Idoko,Thu Mai Do,Tilek Chubakov,Tommaso Gargiani,Uma Rathore,Uni Johannesen,Uwuma Doris Ugwu,Vallerie Alexandra Putra,Vanya Bannihatti Kumar,Varsha Jeyarajalingam,Varvara Arzt,Vasudevan Nedumpozhimana,Viktoria Ondrejova,Viktoryia Horbik,Vishnu Vardhan Reddy Kummitha,Vuk Dinić,Walelign Tewabe Sewunetie,Winston Wu,Xiaojing Zhao,Yacouba Diarra,Yaniv Nikankin,Yash Mathur,Yixi Chen,Yiyuan Li,Yolanda Xavier,Yonatan Belinkov,Yusuf Ismail Abayomi,Zaid Alyafeai,Zhengyang Shan,Zhi Rui Tam,Zilu Tang,Zuzana Nadova,Baber Abbasi,Stella Biderman,David Stap,Duygu Ataman,Fabian Schmidt,Hila Gonen,Jiayi Wang,David Ifeoluwa Adelani

Main category: cs.CL

TL;DR: 本文提出了Global PIQA，一个涵盖100多种语言和文化的参与式常识推理基准，由来自65个国家的335名研究人员手工构建。该基准突显了大语言模型在低资源语言中的性能差距，并揭示了日常文化知识建模的不足。

Details

Motivation: 现有的大语言模型评估基准缺乏对多语言和多文化的覆盖，尤其是针对特定文化常识的评估工具几乎空白。因此，需要一个真正全球化、反映多样文化的常识推理基准来更全面地评估模型表现。 Method: 通过全球335名研究人员协作，手工构建包含116种语言变体的常识推理数据集Global PIQA，覆盖五大洲、14个语系和23种文字系统。数据集分为非平行结构，其中超过50%的样本包含本地食物、习俗或文化特有元素，并在多种开源与闭源大语言模型上进行评估测试。 Result: 最先进的大语言模型在整体Global PIQA上表现尚可，但在低资源语言中准确率差距高达37%（随机猜测为50%），且开源模型普遍不如闭源模型。结果显示模型在跨文化常识理解方面仍存在显著缺陷。 Conclusion: Global PIQA填补了多语言多文化常识评估的空白，揭示了当前大语言模型在低资源语言和文化特有知识上的局限性，强调需加强对日常文化常识的建模，而不仅是复杂推理或专家知识。 Abstract: To date, there exist almost no culturally-specific evaluation benchmarks for large language models (LLMs) that cover a large number of languages and cultures. In this paper, we present Global PIQA, a participatory commonsense reasoning benchmark for over 100 languages, constructed by hand by 335 researchers from 65 countries around the world. The 116 language varieties in Global PIQA cover five continents, 14 language families, and 23 writing systems. In the non-parallel split of Global PIQA, over 50% of examples reference local foods, customs, traditions, or other culturally-specific elements. We find that state-of-the-art LLMs perform well on Global PIQA in aggregate, but they exhibit weaker performance in lower-resource languages (up to a 37% accuracy gap, despite random chance at 50%). Open models generally perform worse than proprietary models. Global PIQA highlights that in many languages and cultures, everyday knowledge remains an area for improvement, alongside more widely-discussed capabilities such as complex reasoning and expert knowledge. Beyond its uses for LLM evaluation, we hope that Global PIQA provides a glimpse into the wide diversity of cultures in which human language is embedded.

[26] RegSpeech12: A Regional Corpus of Bengali Spontaneous Speech Across Dialects

Md. Rezuwan Hassan,Azmol Hossain,Kanij Fatema,Rubayet Sabbir Faruque,Tanmoy Shome,Ruwad Naswan,Trina Chakraborty,Md. Foriduzzaman Zihad,Tawsif Tashwar Dipto,Nazia Tasnim,Nazmuddoha Ansary,Md. Mehedi Hasan Shawon,Ahmed Imtiaz Humayun,Md. Golam Rabiul Alam,Farig Sadeque,Asif Sushmit

Main category: cs.CL

TL;DR: 本研究探讨了孟加拉语方言的语音和形态特征，旨在构建针对地区性变体的自动语音识别（ASR）系统，促进语言技术的包容性发展。

Details

Motivation: 孟加拉语具有显著的方言多样性，但在计算处理方面的系统性研究有限，亟需推动对方言的识别与保护。 Method: 通过分析语音和形态特征，探索为不同孟加拉语方言构建专用自动语音识别（ASR）模型的可行性，并公开发布所创建的数据集。 Result: 成功记录并分析了主要方言的语音与形态特性，验证了构建区域性ASR系统的可行性。 Conclusion: 该研究有助于保护孟加拉语方言多样性，并推动面向孟加拉语社区的包容性数字工具发展。 Abstract: The Bengali language, spoken extensively across South Asia and among diasporic communities, exhibits considerable dialectal diversity shaped by geography, culture, and history. Phonological and pronunciation-based classifications broadly identify five principal dialect groups: Eastern Bengali, Manbhumi, Rangpuri, Varendri, and Rarhi. Within Bangladesh, further distinctions emerge through variation in vocabulary, syntax, and morphology, as observed in regions such as Chittagong, Sylhet, Rangpur, Rajshahi, Noakhali, and Barishal. Despite this linguistic richness, systematic research on the computational processing of Bengali dialects remains limited. This study seeks to document and analyze the phonetic and morphological properties of these dialects while exploring the feasibility of building computational models particularly Automatic Speech Recognition (ASR) systems tailored to regional varieties. Such efforts hold potential for applications in virtual assistants and broader language technologies, contributing to both the preservation of dialectal diversity and the advancement of inclusive digital tools for Bengali-speaking communities. The dataset created for this study is released for public use.

[27] Squrve: A Unified and Modular Framework for Complex Real-World Text-to-SQL Tasks

Yihan Wang,Peiyu Liu,Runyu Chen,Jiaxing Pu,Wei Xu

Main category: cs.CL

TL;DR: Squrve是一个统一、模块化且广泛的Text-to-SQL框架，通过标准化执行范式和多参与者协作机制，有效整合研究进展与实际应用，显著提升复杂真实查询的处理能力。

Details

Motivation: 尽管Text-to-SQL技术在学术上取得了显著进展，但在实际系统中部署仍面临集成工具不足的挑战，缺乏统一框架来整合不同方法并支持实际应用。 Method: 提出Squrve框架，首先建立统一的执行范式以标准化调用接口，然后设计基于七个抽象原子组件的多参与者协作机制，实现不同技术的模块化集成与协同工作。 Result: 在多个广泛使用的基准测试上实验表明，Squrve中的协作工作流始终优于原始独立方法，显著提升了复杂查询的处理性能。 Conclusion: Squrve为Text-to-SQL技术提供了有效的统一框架，弥合了学术研究与实际应用之间的鸿沟，展示了通过模块化协作提升系统性能的新路径。 Abstract: Text-to-SQL technology has evolved rapidly, with diverse academic methods achieving impressive results. However, deploying these techniques in real-world systems remains challenging due to limited integration tools. Despite these advances, we introduce Squrve, a unified, modular, and extensive Text-to-SQL framework designed to bring together research advances and real-world applications. Squrve first establishes a universal execution paradigm that standardizes invocation interfaces, then proposes a multi-actor collaboration mechanism based on seven abstracted effective atomic actor components. Experiments on widely adopted benchmarks demonstrate that the collaborative workflows consistently outperform the original individual methods, thereby opening up a new effective avenue for tackling complex real-world queries. The codes are available at https://github.com/Satissss/Squrve.

[28] Reinforcement Learning for Long-Horizon Multi-Turn Search Agents

Vivek Kalyan,Martin Andrews

Main category: cs.CL

TL;DR: 本文提出通过强化学习（RL）训练大型语言模型代理，使其在法律文档检索任务中显著超越前沿模型，并表明多轮交互能提升性能。

Details

Motivation: 现有的基于提示的LLM代理虽然表现良好，但在复杂任务中仍有提升空间，因此探索通过强化学习从经验中学习以进一步提升其能力。 Method: 采用强化学习方法训练一个140亿参数的语言模型，并在法律文档搜索基准上进行实验，同时研究训练和测试阶段的不同回合限制对性能的影响。 Result: RL训练的模型在准确率上达到85%，超过基线模型的78%；并且允许更多交互回合时性能更优。 Conclusion: 强化学习能有效提升LLM代理在复杂任务中的表现，多轮推理机制对提高任务成功率至关重要。 Abstract: Large Language Model (LLM) agents can leverage multiple turns and tools to solve complex tasks, with prompt-based approaches achieving strong performance. This work demonstrates that Reinforcement Learning (RL) can push capabilities significantly further by learning from experience. Through experiments on a legal document search benchmark, we show that our RL-trained 14 Billion parameter model outperforms frontier class models (85% vs 78% accuracy). In addition, we explore turn-restricted regimes, during training and at test-time, that show these agents achieve better results if allowed to operate over longer multi-turn horizons.

[29] Beyond Line-Level Filtering for the Pretraining Corpora of LLMs

Chanwoo Park,Suyoung Park,Yelim Ahn,Jongmin Kim,Jongyeon Park,Jaejin Lee

Main category: cs.CL

TL;DR: 本文提出了两种模式感知的行级去重和标点过滤方法（PLD和PTF），通过考虑文本在文档中的序列分布，改进传统过滤技术，保留重要结构信息，在英文和韩文的小型语言模型训练中显著提升了多项任务性能。

Details

Motivation: 传统行级过滤方法可能误删有价值内容，影响下游任务性能，因此需要更精细的过滤策略以保留重要结构信息。 Method: 提出模式感知的行级去重（PLD）和模式感知的尾部标点过滤（PTF），结合行级信号与文本在文档中的序列分布进行过滤。 Result: 在英语和韩语的小型语言模型（1B参数）上验证，新方法在多项选择题基准上持续提升性能，并显著提高SQuAD v1和KorQuAD v1上的生成式问答准确率。 Conclusion: PLD和PTF能有效保留关键结构内容，优于传统过滤方法，适用于多语言场景下的数据预处理优化。 Abstract: While traditional line-level filtering techniques, such as line-level deduplication and trailing-punctuation filters, are commonly used, these basic methods can sometimes discard valuable content, negatively affecting downstream performance. In this paper, we introduce two methods-pattern-aware line-level deduplication (PLD) and pattern-aware trailing punctuation filtering (PTF)-by enhancing the conventional filtering techniques. Our approach not only considers line-level signals but also takes into account their sequential distribution across documents, enabling us to retain structurally important content that might otherwise be removed. We evaluate these proposed methods by training small language models (1 B parameters) in both English and Korean. The results demonstrate that our methods consistently improve performance on multiple-choice benchmarks and significantly enhance generative question-answering accuracy on both SQuAD v1 and KorQuAD v1.

[30] Ko-MuSR: A Multistep Soft Reasoning Benchmark for LLMs Capable of Understanding Korean

Chanwoo Park,Suyoung Park,JiA Kang,Jongyeon Park,Sangho Kim,Hyunji M. Park,Sumin Bae,Mingyu Kang,Jaejin Lee

Main category: cs.CL

TL;DR: Ko-MuSR是首个用于评估长篇韩语叙述中多步骤、软推理能力的基准，基于MuSR构建，具有人工验证的逻辑一致性和可回答性。实验表明，多语言大模型在韩语推理任务中表现优于韩语专用模型，显示出跨语言推理泛化能力；结合少样本示例、推理轨迹和任务提示的策略显著提升准确率，接近人类水平。

Details

Motivation: 现有基准在评估韩语长文本多步推理时存在数据污染和语言适配不足的问题，缺乏专门针对韩语的高质量推理评测集。 Method: 基于MuSR框架构建全韩语叙述、推理链和选择题，由人工标注员验证逻辑一致性和可回答性；在四个大模型（两个多语言、两个韩语专用）上测试，并采用结合少样本、推理路径和任务提示的综合提示策略。 Result: 多语言模型在韩语推理任务中表现优于韩语专用模型；精心设计的提示策略显著提高模型准确率，接近人类水平。 Conclusion: 推理能力具有跨语言可迁移性，多语言模型在韩语复杂推理中仍具优势；Ko-MuSR为韩语长上下文推理和提示策略研究提供了可靠基准。 Abstract: We present Ko-MuSR, the first benchmark to comprehensively evaluate multistep, soft reasoning in long Korean narratives while minimizing data contamination. Built following MuSR, Ko-MuSR features fully Korean narratives, reasoning chains, and multiple-choice questions verified by human annotators for logical consistency and answerability. Evaluations of four large language models -- two multilingual and two Korean-specialized -- show that multilingual models outperform Korean-focused ones even in Korean reasoning tasks, indicating cross-lingual generalization of reasoning ability. Carefully designed prompting strategies, which combine few-shot examples, reasoning traces, and task-specific hints, further boost accuracy, approaching human-level performance. Ko-MuSR offers a solid foundation for advancing Korean NLP by enabling systematic evaluation of long-context reasoning and prompting strategies.

Aaron Scott,Maike Züfle,Jan Niehues

Main category: cs.CL

TL;DR: 提出首个德语多模态讽刺检测数据集MuSaG，包含来自德国电视节目的文本、音频和视频对齐数据，用于评估单模态与多模态模型，并揭示当前模型在真实场景中的局限性。

Details

Motivation: 社交媒体和流行文化中讽刺语言普遍存在，给自然语言理解带来挑战；现有模型多依赖文本，难以捕捉多模态讽刺线索，尤其是音频信息的重要性未被充分建模。 Method: 构建MuSaG数据集，包含33分钟人工筛选和标注的德国电视节目片段，提供文本、音频、视频三种模态的独立标注，并对九种开源与商业模型进行基准测试，比较其在单模态与多模态设置下的表现。 Result: 人类在对话场景中高度依赖音频线索识别讽刺，而现有模型在文本上表现最佳，多模态模型未能有效利用非文本信息，显示出人与模型之间的感知差距。 Conclusion: 当前多模态讽刺检测模型仍以文本为主导，未能充分模拟人类对音频等模态的依赖，MuSaG有助于推动更贴近真实场景的多模态模型发展。 Abstract: Sarcasm is a complex form of figurative language in which the intended meaning contradicts the literal one. Its prevalence in social media and popular culture poses persistent challenges for natural language understanding, sentiment analysis, and content moderation. With the emergence of multimodal large language models, sarcasm detection extends beyond text and requires integrating cues from audio and vision. We present MuSaG, the first German multimodal sarcasm detection dataset, consisting of 33 minutes of manually selected and human-annotated statements from German television shows. Each instance provides aligned text, audio, and video modalities, annotated separately by humans, enabling evaluation in unimodal and multimodal settings. We benchmark nine open-source and commercial models, spanning text, audio, vision, and multimodal architectures, and compare their performance to human annotations. Our results show that while humans rely heavily on audio in conversational settings, models perform best on text. This highlights a gap in current multimodal models and motivates the use of MuSaG for developing models better suited to realistic scenarios. We release MuSaG publicly to support future research on multimodal sarcasm detection and human-model alignment.

[32] Exploring the Influence of Relevant Knowledge for Natural Language Generation Interpretability

Iván Martínez-Murillo,Paloma Moreda,Elena Lloret

Main category: cs.CL

TL;DR: 本文研究了外部知识整合在自然语言生成（NLG）中的作用，特别是在常识生成任务中。作者扩展了CommonGen数据集，构建了包含ConceptNet语义关系的KITGI基准，并使用T5-Large模型比较了完整与过滤知识下的生成效果。实验表明，完整知识下生成句子正确率达91%，而过滤后骤降至6%，证明相关外部知识对生成质量至关重要。

Details

Motivation: 探索外部知识在NLG中的实际影响，尤其是其对生成文本的常识合理性和概念覆盖的作用，并推动可解释性评估方法的发展。 Method: 构建KITGI基准，结合ConceptNet的语义关系；采用T5-Large模型，在完整知识和过滤知识条件下生成句子；通过三阶段可解释性评估：移除关键知识、重新生成、人工评估。 Result: 使用完整外部知识时，生成结果在常识合理性和概念覆盖上达到91%的正确率；当移除高度相关知识后，性能下降至6%。 Conclusion: 相关外部知识对NLG中的连贯性和概念覆盖至关重要；应设计更具可解释性的知识增强型NLG系统，并发展能捕捉深层推理的评估框架。 Abstract: This paper explores the influence of external knowledge integration in Natural Language Generation (NLG), focusing on a commonsense generation task. We extend the CommonGen dataset by creating KITGI, a benchmark that pairs input concept sets with retrieved semantic relations from ConceptNet and includes manually annotated outputs. Using the T5-Large model, we compare sentence generation under two conditions: with full external knowledge and with filtered knowledge where highly relevant relations were deliberately removed. Our interpretability benchmark follows a three-stage method: (1) identifying and removing key knowledge, (2) regenerating sentences, and (3) manually assessing outputs for commonsense plausibility and concept coverage. Results show that sentences generated with full knowledge achieved 91\% correctness across both criteria, while filtering reduced performance drastically to 6\%. These findings demonstrate that relevant external knowledge is critical for maintaining both coherence and concept coverage in NLG. This work highlights the importance of designing interpretable, knowledge-enhanced NLG systems and calls for evaluation frameworks that capture the underlying reasoning beyond surface-level metrics.

[33] Beyond Neural Incompatibility: Easing Cross-Scale Knowledge Transfer in Large Language Models through Latent Semantic Alignment

Jian Gu,Aldeida Aleti,Chunyang Chen,Hongyu Zhang

Main category: cs.CL

TL;DR: 本文提出了一种基于潜在空间语义对齐的跨尺度大语言模型知识迁移方法，利用激活值作为层间知识传递的媒介，有效解决了神经不兼容问题，提升了知识迁移的效率和效果。

Details

Motivation: 现有的参数复用方法由于神经不兼容性在不同规模的大语言模型之间进行细粒度知识迁移时受到限制，因此需要一种更有效的跨尺度知识迁移机制。 Method: 通过将激活值作为层间知识迁移的媒介，利用潜在空间中的语义对齐来实现跨尺度知识迁移，而不是直接复用层参数。 Result: 在四个基准上的实验表明该方法优于先前的工作，能够更好地对齐不同规模模型的行为，并揭示了促进跨尺度知识迁移的关键因素。 Conclusion: 语义对齐是实现大语言模型跨尺度知识迁移的基础，所提出的方法为不同规模模型间的灵活、高效知识转移提供了新思路。 Abstract: Large Language Models (LLMs) encode vast amounts of knowledge in their massive parameters, which is accessible to locate, trace, and analyze. Despite advances in neural interpretability, it is still not clear how to transfer knowledge in a fine-grained manner, namely parametric knowledge transfer (PKT). A key problem is enabling effective and efficient knowledge transfer across LLMs of different scales, which is essential for achieving greater flexibility and broader applicability in transferring knowledge between LLMs. Due to neural incompatibility, referring to the architectural and parametric differences between LLMs of varying scales, existing methods that directly reuse layer parameters are severely limited. In this paper, we identify the semantic alignment in latent space as the fundamental prerequisite for LLM cross-scale knowledge transfer. Instead of directly using the layer parameters, our approach takes activations as the medium of layer-wise knowledge transfer. Leveraging the semantics in latent space, our approach is simple and outperforms prior work, better aligning model behaviors across varying scales. Evaluations on four benchmarks demonstrate the efficacy of our method. Further analysis reveals the key factors easing cross-scale knowledge transfer and provides insights into the nature of latent semantic alignment.

[34] HACK: Hallucinations Along Certainty and Knowledge Axes

Adi Simhi,Jonathan Herzig,Itay Itzhak,Dana Arad,Zorik Gekhman,Roi Reichart,Fazl Barez,Gabriel Stanovsky,Idan Szpektor,Yonatan Belinkov

Main category: cs.CL

TL;DR: 本文提出了一种基于知识和确定性两个维度对大语言模型中的幻觉进行分类的框架，强调需根据其内在机制设计针对性缓解策略，并通过模型特定的数据集构建和 steering mitigation 方法验证了该分类的有效性。

Details

Motivation: 现有研究多从外部属性对幻觉进行分类，忽略了其内在机制差异可能导致不同类型的幻觉需要不同的缓解策略，因此需要一种基于模型内部属性（如知识与确定性）的新分类框架。 Method: 提出沿知识轴和确定性轴对幻觉进行分类的框架；构建模型特定数据集以区分不同类型幻觉；使用 steering mitigation 验证知识轴上的分类有效性；引入新的评估指标衡量缓解方法在高置信错误幻觉上的表现。 Result: 验证了知识轴上两类幻觉（缺乏知识 vs. 拥有知识但仍幻觉）存在显著差异；发现即使模型具备正确知识，仍可能出现幻觉；识别出一类高置信但错误的幻觉，现有缓解方法在这些关键案例上表现不佳。 Conclusion: 应同时考虑知识和确定性来分析幻觉，呼吁开发针对不同幻觉机制的精细化缓解方法，而非采用统一策略。 Abstract: Hallucinations in LLMs present a critical barrier to their reliable usage. Existing research usually categorizes hallucination by their external properties rather than by the LLMs' underlying internal properties. This external focus overlooks that hallucinations may require tailored mitigation strategies based on their underlying mechanism. We propose a framework for categorizing hallucinations along two axes: knowledge and certainty. Since parametric knowledge and certainty may vary across models, our categorization method involves a model-specific dataset construction process that differentiates between those types of hallucinations. Along the knowledge axis, we distinguish between hallucinations caused by a lack of knowledge and those occurring despite the model having the knowledge of the correct response. To validate our framework along the knowledge axis, we apply steering mitigation, which relies on the existence of parametric knowledge to manipulate model activations. This addresses the lack of existing methods to validate knowledge categorization by showing a significant difference between the two hallucination types. We further analyze the distinct knowledge and hallucination patterns between models, showing that different hallucinations do occur despite shared parametric knowledge. Turning to the certainty axis, we identify a particularly concerning subset of hallucinations where models hallucinate with certainty despite having the correct knowledge internally. We introduce a new evaluation metric to measure the effectiveness of mitigation methods on this subset, revealing that while some methods perform well on average, they fail disproportionately on these critical cases. Our findings highlight the importance of considering both knowledge and certainty in hallucination analysis and call for targeted mitigation approaches that consider the hallucination underlying factors.

[35] Towards Transparent Reasoning: What Drives Faithfulness in Large Language Models?

Teague McMillan,Gabriele Dominici,Martin Gjoreski,Marc Langheinrich

Main category: cs.CL

TL;DR: 研究了推理和训练时的选择如何影响大语言模型在医疗等敏感领域中的解释忠实性，发现少量示例的数量和质量、提示设计以及指令微调均显著影响模型的解释忠实性。

Details

Motivation: 大语言模型生成的解释常常不能真实反映其预测依据，在医疗场景中可能导致临床医生不信任或决策风险，因此需要探究如何提高解释的忠实性。 Method: 在BBQ（社会偏见）和MedQA（医学执照考试题）两个数据集上评估GPT-4.1-mini、LLaMA 70B和LLaMA 8B三种模型，通过操控少量示例的数量与类型、提示策略和训练方式来分析其对解释忠实性的影响。 Result: （i）少量示例的数量和质量显著影响模型解释的忠实性；（ii）提示设计对忠实性敏感；（iii）指令微调阶段能提升在MedQA上的解释忠实性。 Conclusion: 通过优化少量示例、提示设计和指令微调，可在部署阶段有效提升大语言模型在敏感领域的解释忠实性和可信度。 Abstract: Large Language Models (LLMs) often produce explanations that do not faithfully reflect the factors driving their predictions. In healthcare settings, such unfaithfulness is especially problematic: explanations that omit salient clinical cues or mask spurious shortcuts can undermine clinician trust and lead to unsafe decision support. We study how inference and training-time choices shape explanation faithfulness, focusing on factors practitioners can control at deployment. We evaluate three LLMs (GPT-4.1-mini, LLaMA 70B, LLaMA 8B) on two datasets-BBQ (social bias) and MedQA (medical licensing questions), and manipulate the number and type of few-shot examples, prompting strategies, and training procedure. Our results show: (i) both the quantity and quality of few-shot examples significantly impact model faithfulness; (ii) faithfulness is sensitive to prompting design; (iii) the instruction-tuning phase improves measured faithfulness on MedQA. These findings offer insights into strategies for enhancing the interpretability and trustworthiness of LLMs in sensitive domains.

[36] Abjad AI at NADI 2025: CATT-Whisper: Multimodal Diacritic Restoration Using Text and Speech Representations

Ahmad Ghannam,Naif Alharthi,Faris Alasmary,Kholood Al Tabash,Shouq Sadah,Lahouari Ghouti

Main category: cs.CL

TL;DR: 本文提出了一种结合文本和语音信息的多模态方法，用于阿拉伯语方言句子中的变音符号恢复（DR）。模型使用自研预训练模型CATT处理文本，用OpenAI Whisper基础模型处理语音，并通过早期融合或交叉注意力策略融合两种模态。实验结果表明该方法在开发集和测试集上均取得较低的词错误率和字符错误率。

Details

Motivation: 阿拉伯语方言缺乏标准书写规范，导致变音符号缺失问题严重，影响自然语言处理性能，因此需要有效的变音符号恢复方法。现有方法多依赖纯文本，忽略了语音信号中蕴含的丰富信息。 Method: 采用多模态架构，文本模态由CATT模型编码，语音模态由Whisper base模型编码；提出两种融合策略：一是将语音帧平均为150个令牌后经线性投影与文本令牌拼接进行早期融合；二是通过交叉注意力机制融合文本与语音嵌入；并在训练中随机关闭语音输入以增强鲁棒性。 Result: 在开发集上达到0.25的WER和0.9的CER，在测试集上达到0.55的WER和0.13的CER。 Conclusion: 结合文本与语音的多模态方法能有效提升阿拉伯语方言变音符号恢复性能，尤其通过早期融合与交叉注意力策略及训练时语音输入的随机屏蔽，增强了模型的鲁棒性和实用性。 Abstract: In this work, we tackle the Diacritic Restoration (DR) task for Arabic dialectal sentences using a multimodal approach that combines both textual and speech information. We propose a model that represents the text modality using an encoder extracted from our own pre-trained model named CATT. The speech component is handled by the encoder module of the OpenAI Whisper base model. Our solution is designed following two integration strategies. The former consists of fusing the speech tokens with the input at an early stage, where the 1500 frames of the audio segment are averaged over 10 consecutive frames, resulting in 150 speech tokens. To ensure embedding compatibility, these averaged tokens are processed through a linear projection layer prior to merging them with the text tokens. Contextual encoding is guaranteed by the CATT encoder module. The latter strategy relies on cross-attention, where text and speech embeddings are fused. The cross-attention output is then fed to the CATT classification head for token-level diacritic prediction. To further improve model robustness, we randomly deactivate the speech input during training, allowing the model to perform well with or without speech. Our experiments show that the proposed approach achieves a word error rate (WER) of 0.25 and a character error rate (CER) of 0.9 on the development set. On the test set, our model achieved WER and CER scores of 0.55 and 0.13, respectively.

[37] Evaluating LLMs on Generating Age-Appropriate Child-Like Conversations

Syed Zohaib Hassan,Pål Halvorsen,Miriam S. Johnson,Pierre Lison

Main category: cs.CL

TL;DR: 本研究评估了五种大语言模型（LLM）生成适合5岁和9岁儿童的挪威语对话的能力，结果显示GPT-4和NorBloom-7b表现较好，但大多数模型生成的语言仍超出目标年龄段的语言发展水平，突显出在低资源语言中缺乏适龄语言数据的挑战。

Details

Motivation: 由于大语言模型主要基于成人对话数据训练，在面向儿童的特殊应用场景中难以生成真实、符合年龄特征的儿童语言，因此需要评估现有模型在生成适龄挪威语对话方面的能力。 Method: 比较研究五种LLM（GPT-4、RUTER-LLAMA-2-13b、GPTSW、NorMistral-7b 和 NorBloom-7b）生成的挪威语儿童对话，由11名教育专业人士进行盲评，并结合真实儿童访谈数据与模型生成文本，评估其真实性与发展适宜性。 Result: 评估者在判断5岁儿童对话时准确率高于9岁儿童，且评分者间信度较高（ICC=0.75）；GPT-4和NorBloom-7b表现相对较好，但多数模型生成的语言被认为比目标年龄组更复杂、更成熟。 Conclusion: 当前LLM在生成适龄儿童语言方面存在局限，主要受限于缺乏足够的儿童语言训练数据，尤其在低资源语言环境中，亟需构建适龄、高质量的儿童语言数据集以支持专用系统开发。 Abstract: Large Language Models (LLMs), predominantly trained on adult conversational data, face significant challenges when generating authentic, child-like dialogue for specialized applications. We present a comparative study evaluating five different LLMs (GPT-4, RUTER-LLAMA-2-13b, GPTSW, NorMistral-7b, and NorBloom-7b) to generate age-appropriate Norwegian conversations for children aged 5 and 9 years. Through a blind evaluation by eleven education professionals using both real child interview data and LLM-generated text samples, we assessed authenticity and developmental appropriateness. Our results show that evaluators achieved strong inter-rater reliability (ICC=0.75) and demonstrated higher accuracy in age prediction for younger children (5-year-olds) compared to older children (9-year-olds). While GPT-4 and NorBloom-7b performed relatively well, most models generated language perceived as more linguistically advanced than the target age groups. These findings highlight critical data-related challenges in developing LLM systems for specialized applications involving children, particularly in low-resource languages where comprehensive age-appropriate lexical resources are scarce.

[38] From Memorization to Reasoning in the Spectrum of Loss Curvature

Jack Merullo,Srihita Vatsavaya,Lucius Bushnaq,Owen Lewis

Main category: cs.CL

TL;DR: 本文通过损失曲率分解揭示了Transformer模型中记忆化的表示方式，提出一种无需标签即可分离记忆化参数的方法，并设计了一种权重编辑策略有效抑制不期望的记忆输出，同时保持较低的困惑度。研究发现事实检索和算术任务性能显著下降，表明这些任务依赖于权重空间中的特化方向。

Details

Motivation: 理解神经网络中记忆化的机制，并探索如何在不损害模型整体性能的前提下移除不必要的记忆内容，尤其是针对语言模型和视觉变换器中的过度记忆问题。 Method: 基于损失景观曲率对模型权重进行分解，识别高曲率（对应记忆化）和低曲率（对应泛化）成分，按曲率排序并实施权重编辑以抑制记忆化部分；在语言模型和视觉变换器上验证该方法的有效性。 Result: 所提方法比现有遗忘方法（BalancedSubnet）更有效地抑制非目标记忆内容，同时保持更低的困惑度；编辑操作显著影响事实检索和算术任务，但开放式的知识检索和逻辑推理能力得以保留；观察到任务数据激活强度与被编辑的低曲率成分相关，解释了性能下降的原因。 Conclusion: 记忆化可在Transformer权重中通过曲率分解被解耦，权重编辑可选择性去除记忆内容；事实检索与算术等任务依赖于权重空间中特定且窄域的结构，而非通用机制，这为理解模型内部表征和可控遗忘提供了新视角。 Abstract: We characterize how memorization is represented in transformer models and show that it can be disentangled in the weights of both language models (LMs) and vision transformers (ViTs) using a decomposition based on the loss landscape curvature. This insight is based on prior theoretical and empirical work showing that the curvature for memorized training points is much sharper than non memorized, meaning ordering weight components from high to low curvature can reveal a distinction without explicit labels. This motivates a weight editing procedure that suppresses far more recitation of untargeted memorized data more effectively than a recent unlearning method (BalancedSubnet), while maintaining lower perplexity. Since the basis of curvature has a natural interpretation for shared structure in model weights, we analyze the editing procedure extensively on its effect on downstream tasks in LMs, and find that fact retrieval and arithmetic are specifically and consistently negatively affected, even though open book fact retrieval and general logical reasoning is conserved. We posit these tasks rely heavily on specialized directions in weight space rather than general purpose mechanisms, regardless of whether those individual datapoints are memorized. We support this by showing a correspondence between task data's activation strength with low curvature components that we edit out, and the drop in task performance after the edit. Our work enhances the understanding of memorization in neural networks with practical applications towards removing it, and provides evidence for idiosyncratic, narrowly-used structures involved in solving tasks like math and fact retrieval.

[39] Can LLMs Translate Human Instructions into a Reinforcement Learning Agent's Internal Emergent Symbolic Representation?

Ziqi Ma,Sao Mai Nguyen,Philippe Xu

Main category: cs.CL

TL;DR: 研究大型语言模型（LLM）是否能将自然语言指令转化为分层强化学习中出现的内部符号表示，发现在不同任务和符号划分粒度下性能有限，揭示了当前LLM在语言与智能体内在表征对齐上的不足。

Details

Motivation: 探索大型语言模型能否理解并翻译智能体在发展学习中形成的内部符号表示，以实现跨任务的规划与泛化。 Method: 使用结构化评估框架，测试GPT、Claude、Deepseek和Grok等主流大模型在Ant Maze和Ant Fall环境中，将自然语言指令映射到分层强化学习产生的不同符号分区的能力。 Result: 发现LLM具备一定翻译能力，但性能高度依赖符号划分的粒度和任务复杂性，对细粒度或复杂任务表现不佳。 Conclusion: 当前LLM在自然语言与智能体内部表征之间的对齐能力有限，需进一步研究更鲁棒的语言-表征对齐方法。 Abstract: Emergent symbolic representations are critical for enabling developmental learning agents to plan and generalize across tasks. In this work, we investigate whether large language models (LLMs) can translate human natural language instructions into the internal symbolic representations that emerge during hierarchical reinforcement learning. We apply a structured evaluation framework to measure the translation performance of commonly seen LLMs -- GPT, Claude, Deepseek and Grok -- across different internal symbolic partitions generated by a hierarchical reinforcement learning algorithm in the Ant Maze and Ant Fall environments. Our findings reveal that although LLMs demonstrate some ability to translate natural language into a symbolic representation of the environment dynamics, their performance is highly sensitive to partition granularity and task complexity. The results expose limitations in current LLMs capacity for representation alignment, highlighting the need for further research on robust alignment between language and internal agent representations.

[40] MERGE: Minimal Expression-Replacement GEneralization Test for Natural Language Inference

Mădălina Zgreabăn,Tejaswini Deoskar,Lasha Abzianidze

Main category: cs.CL

TL;DR: 提出了一种通过替换开放类词生成高质量NLI问题变体的方法MERGE，用于评估模型在保持推理结构不变的情况下的泛化能力，结果显示现有模型在这些微小变化的问题上性能下降4-20%。

Details

Motivation: 语言模型在自然语言推理（NLI）任务中的泛化能力不足，人工构建新基准成本高，自动生成高质量变体困难。 Method: 通过替换开放类词生成保持原有推理结构的NLI问题变体，构建MERGE泛化测试集。 Result: NLI模型在生成的变体上性能下降4-20%，表明其泛化能力较低；并分析了替换词的词性、词频和合理性对模型性能的影响。 Conclusion: 当前NLI模型对词语替换敏感，即使推理结构未变，性能仍显著下降，说明其依赖表面特征而非深层推理。 Abstract: In recent years, many generalization benchmarks have shown language models' lack of robustness in natural language inference (NLI). However, manually creating new benchmarks is costly, while automatically generating high-quality ones, even by modifying existing benchmarks, is extremely difficult. In this paper, we propose a methodology for automatically generating high-quality variants of original NLI problems by replacing open-class words, while crucially preserving their underlying reasoning. We dub our generalization test as MERGE (Minimal Expression-Replacements GEneralization), which evaluates the correctness of models' predictions across reasoning-preserving variants of the original problem. Our results show that NLI models' perform 4-20% worse on variants, suggesting low generalizability even on such minimally altered problems. We also analyse how word class of the replacements, word probability, and plausibility influence NLI models' performance.

[41] Lookahead Tree-Based Rollouts for Enhanced Trajectory-Level Exploration in Reinforcement Learning with Verifiable Rewards

Shangyu Xing,Siyuan Wang,Chenyuan Yang,Xinyu Dai,Xiang Ren

Main category: cs.CL

TL;DR: 提出了一种基于前瞻树的回溯策略LATR，以增强强化学习中采样轨迹的多样性，显著提升大语言模型的推理能力。

Details

Motivation: 当前强化学习中的回滚采样轨迹多样性不足，导致策略学习效率低下。 Method: LATR通过在高不确定性生成步骤进行分支、前瞻模拟和剪枝来促进轨迹级多样性。 Result: 相比随机采样，LATR平均加速策略学习131%，并在GRPO和DAPO算法上将最终pass@1性能提高4.2%。 Conclusion: LATR有效提升了强化学习中轨迹多样性，显著改善了大语言模型的推理性能。 Abstract: Reinforcement Learning with Verifiable Rewards (RLVR), particularly with algorithms like Group Relative Policy Optimization (GRPO), has proven highly effective in enhancing the reasoning capabilities of large language models. However, a critical bottleneck in current pipelines lies in the limited diversity of sampled trajectories during group rollouts. Homogeneous trajectories and their associated rewards would diminish the return signals for policy updates, thereby hindering effective policy learning. This lack of diversity stems primarily from token-level stochastic sampling, where local variations are likely to collapse into near-identical reasoning paths. To address this limitation, we propose Lookahead Tree-Based Rollouts (LATR), a novel rollout strategy designed to explicitly promotes trajectory-level diversity by enforcing branching into different candidate tokens likely to yield distinct continuations. Specifically, LATR iteratively operates in three stages: (1) branching at high-uncertainty generation steps, (2) performing lookahead simulation for each new branch, and (3) pruning branches that exhibits prolonged similarity during simulation. Compared with stochastic Sampling, LATR accelerates policy learning by 131% on average and improves final pass@1 performance by 4.2% on both GRPO and Dynamic sAmpling Policy Optimization (DAPO) algorithms across different reasoning tasks. Our code and data are publicly available at https://github.com/starreeze/latr.

[42] Critique-RL: Training Language Models for Critiquing through Two-Stage Reinforcement Learning

Zhiheng Xi,Jixuan Huang,Xin Guo,Boyang Hong,Dingwen Yang,Xiaoran Fan,Shuo Li,Zehui Chen,Junjie Ye,Siyu Yuan,Zhengyin Du,Xuesong Yao,Yufei Xu,Jiecao Chen,Rui Zheng,Tao Gui,Qi Zhang,Xuanjing Huang

Main category: cs.CL

TL;DR: 提出Critique-RL，一种无需强监督的在线强化学习方法，通过两阶段优化策略提升批评模型的判别力和有用性，在多个任务和模型上显著提升性能。

Details

Motivation: 现有批评语言模型依赖更强的监督者标注数据，限制了其应用。本文旨在不依赖更强监督的情况下，训练能够有效评估并反馈模型输出的批评模型。 Method: 采用两玩家框架：演员生成回答，批评者提供反馈，演员据此改进。提出两阶段强化学习优化策略：第一阶段使用基于规则的直接奖励信号增强批评者的判别能力；第二阶段引入间接奖励以提高帮助性，同时通过正则化保持判别能力。 Result: 在多个任务和模型（如Qwen2.5-7B）上实验表明，Critique-RL显著提升性能，域内任务提升9.02%，域外任务提升5.70%。 Conclusion: Critique-RL能有效训练出兼具高判别力和高帮助性的批评模型，且无需依赖更强的监督者，具有广泛应用于复杂推理任务的潜力。 Abstract: Training critiquing language models to assess and provide feedback on model outputs is a promising way to improve LLMs for complex reasoning tasks. However, existing approaches typically rely on stronger supervisors for annotating critique data. To address this, we propose Critique-RL, an online RL approach for developing critiquing language models without stronger supervision. Our approach operates on a two-player paradigm: the actor generates a response, the critic provides feedback, and the actor refines the response accordingly. We first reveal that relying solely on indirect reward signals from the actor's outputs for RL optimization often leads to unsatisfactory critics: while their helpfulness (i.e., providing constructive feedback) improves, the discriminability (i.e., determining whether a response is high-quality or not) remains poor, resulting in marginal performance gains. To overcome this, Critique-RL adopts a two-stage optimization strategy. In stage I, it reinforces the discriminability of the critic with direct rule-based reward signals; in stage II, it introduces indirect rewards based on actor refinement to improve the critic's helpfulness, while maintaining its discriminability via appropriate regularization. Extensive experiments across various tasks and models show that Critique-RL delivers substantial performance improvements. For example, it achieves a 9.02% gain on in-domain tasks and a 5.70% gain on out-of-domain tasks for Qwen2.5-7B, highlighting its potential.

[43] Beyond MCQ: An Open-Ended Arabic Cultural QA Benchmark with Dialect Variants

Hunzalah Hassan Bhatti,Firoj Alam

Main category: cs.CL

TL;DR: 提出了一种综合方法，用于评估大语言模型在多语言和多方言环境下的表现，特别是在阿拉伯语方言中的性能，并通过链式思维推理改进模型。

Details

Motivation: 大语言模型在处理文化相关和方言内容时表现不均，尤其在阿拉伯语方言中存在知识差距，需要更全面的评估方法。 Method: 将现代标准阿拉伯语的多项选择题翻译成英语和多种阿拉伯方言，转化为开放式问题，评估多种零样本和微调大语言模型，并生成链式思维推理以微调模型。 Result: 发现模型在阿拉伯方言上表现不佳；以阿拉伯语为中心的模型在多项选择题上表现良好但在开放式问题上表现较差；链式思维推理提高了判断正确性但对n-gram指标效果不一。 Conclusion: 当前大语言模型在处理阿拉伯语方言和文化相关问题时仍存在显著缺陷，需进一步优化，所构建的数据集将公开以促进包容性评估研究。 Abstract: Large Language Models (LLMs) are increasingly used to answer everyday questions, yet their performance on culturally grounded and dialectal content remains uneven across languages. We propose a comprehensive method that (i) translates Modern Standard Arabic (MSA) multiple-choice questions (MCQs) into English and several Arabic dialects, (ii) converts them into open-ended questions (OEQs), (iii) benchmarks a range of zero-shot and fine-tuned LLMs under both MCQ and OEQ settings, and (iv) generates chain-of-thought (CoT) rationales to fine-tune models for step-by-step reasoning. Using this method, we extend an existing dataset in which QAs are parallelly aligned across multiple language varieties, making it, to our knowledge, the first of its kind. We conduct extensive experiments with both open and closed models. Our findings show that (i) models underperform on Arabic dialects, revealing persistent gaps in culturally grounded and dialect-specific knowledge; (ii) Arabic-centric models perform well on MCQs but struggle with OEQs; and (iii) CoT improves judged correctness while yielding mixed n-gram-based metrics. The developed dataset will be publicly released to support further research on culturally and linguistically inclusive evaluation.

[44] LongWeave: A Long-Form Generation Benchmark Bridging Real-World Relevance and Verifiability

Zikai Xiao,Fei Huang,Jianhong Tu,Jianhui Wei,Wen Ma,Yuxuan Zhou,Jian Wu,Bowen Yu,Zuozhu Liu,Junyang Lin

Main category: cs.CL

TL;DR: 本文提出了LongWeave，一个结合现实场景与可验证评估的长文本生成评测基准，通过约束验证评估（CoV-Eval）方法构建兼具真实性与客观性的任务，评估结果显示现有大模型在复杂长文本生成中仍面临显著挑战。

Details

Motivation: 现有长文本生成评测基准要么依赖难以验证的真实世界查询，要么使用过于简化的合成设置，缺乏对现实复杂性和可量化评估的兼顾，因此需要一种更平衡的评估方法。 Method: 提出Constraint-Verifier Evaluation (CoV-Eval)框架：首先在真实场景中定义可验证目标，再系统生成对应的查询、文本材料和约束条件；基于此构建LongWeave基准，支持最多7种任务、输入64K/输出8K token的定制化长度。 Result: 在23个大语言模型上的评估表明，随着现实复杂性和输出长度增加，即使是当前最先进的模型在长文本生成任务中也表现不佳，暴露出其在遵循复杂约束和长程信息组织方面的不足。 Conclusion: LongWeave通过融合真实应用场景与可验证设计，为长文本生成提供了更严谨的评估方式，揭示了现有LLMs在复杂长输出任务中的局限性，推动未来模型向更可靠、可控的方向发展。 Abstract: Generating long, informative, and factual outputs remains a major challenge for Large Language Models (LLMs). Existing benchmarks for long-form generation typically assess real-world queries with hard-to-verify metrics or use synthetic setups that ease evaluation but overlook real-world intricacies. In this paper, we introduce \textbf{LongWeave}, which balances real-world and verifiable assessment with Constraint-Verifier Evaluation (CoV-Eval). CoV-Eval constructs tasks by first defining verifiable targets within real-world scenarios, then systematically generating corresponding queries, textual materials, and constraints based on these targets. This ensures that tasks are both realistic and objectively assessable, enabling rigorous assessment of model capabilities in meeting complex real-world constraints. LongWeave supports customizable input/output lengths (up to 64K/8K tokens) across seven distinct tasks. Evaluation on 23 LLMs shows that even state-of-the-art models encounter significant challenges in long-form generation as real-world complexity and output length increase.

[45] Text Simplification with Sentence Embeddings

Matthew Shardlow

Main category: cs.CL

TL;DR: 本文探索了在文本简化背景下，通过解码句子嵌入来重建文本，并保持复杂度水平的方法。

Details

Motivation: 研究如何在不使用大规模模型的情况下，有效进行文本简化。 Method: 使用小型前馈神经网络学习高复杂度与低复杂度文本句子嵌入之间的转换，并与Seq2Seq和基于大语言模型的方法进行比较。 Result: 在MedEASI数据集及非训练语言（西班牙语、德语）上验证了该方法的适用性，结果令人鼓舞。 Conclusion: 在句子嵌入空间中学习转换是一种有前景的研究方向，有望推动小型但强大的文本简化及其他自然语言生成模型的发展。 Abstract: Sentence embeddings can be decoded to give approximations of the original texts used to create them. We explore this effect in the context of text simplification, demonstrating that reconstructed text embeddings preserve complexity levels. We experiment with a small feed forward neural network to effectively learn a transformation between sentence embeddings representing high-complexity and low-complexity texts. We provide comparison to a Seq2Seq and LLM-based approach, showing encouraging results in our much smaller learning setting. Finally, we demonstrate the applicability of our transformation to an unseen simplification dataset (MedEASI), as well as datasets from languages outside the training data (ES,DE). We conclude that learning transformations in sentence embedding space is a promising direction for future research and has potential to unlock the ability to develop small, but powerful models for text simplification and other natural language generation tasks.

[46] Comprehensive and Efficient Distillation for Lightweight Sentiment Analysis Models

Guangyu Xie,Yice Zhang,Jianzhu Bao,Qianlong Wang,Yang Sun,Bingbing Wang,Ruifeng Xu

Main category: cs.CL

TL;DR: 本文提出了一种高效且全面的用于情感分析的知识蒸馏框架COMPEFFDIST，通过自动构建指令和基于难度的数据筛选，显著提升了小模型性能并大幅降低了数据需求。

Details

Motivation: 现有知识蒸馏方法依赖人工编写的指令和大规模用户文本，存在指令多样性不足和计算成本高的问题。 Method: 提出COMPEFFDIST框架，包含基于属性的自动指令构造模块和基于难度的数据过滤模块，分别解决指令覆盖不全和计算开销大的问题。 Result: 在多个模型系列上验证，3B参数的学生模型在多数任务上达到20倍规模教师模型的性能，且仅用10%数据即可达到基线方法同等性能。 Conclusion: COMPEFFDIST实现了高效、高质量的情感分析模型蒸馏，显著提升了数据效率和实用性。 Abstract: Recent efforts leverage knowledge distillation techniques to develop lightweight and practical sentiment analysis models. These methods are grounded in human-written instructions and large-scale user texts. Despite the promising results, two key challenges remain: (1) manually written instructions are limited in diversity and quantity, making them insufficient to ensure comprehensive coverage of distilled knowledge; (2) large-scale user texts incur high computational cost, hindering the practicality of these methods. To this end, we introduce COMPEFFDIST, a comprehensive and efficient distillation framework for sentiment analysis. Our framework consists of two key modules: attribute-based automatic instruction construction and difficulty-based data filtering, which correspondingly tackle the aforementioned challenges. Applying our method across multiple model series (Llama-3, Qwen-3, and Gemma-3), we enable 3B student models to match the performance of 20x larger teacher models on most tasks. In addition, our approach greatly outperforms baseline methods in data efficiency, attaining the same performance level with only 10% of the data.

[47] SynthWorlds: Controlled Parallel Worlds for Disentangling Reasoning and Knowledge in Language Models

Ken Gu,Advait Bhat,Mike A Merrill,Robert West,Xin Liu,Daniel McDuff,Tim Althoff

Main category: cs.CL

TL;DR: SynthWorlds是一个新框架，通过构建具有相同结构但一个基于真实世界、一个基于合成世界的平行语料库，来分离语言模型的推理能力与参数化知识，从而更准确地评估其真实推理能力。

Details

Motivation: 现有基准测试难以区分语言模型的表现是源于事实记忆还是真实推理能力，因此需要一种能清晰解耦两者影响的评估方法。 Method: 提出SynthWorlds框架，构建两个结构一致但内容不同的世界（真实映射与合成映射），并在其上设计对称任务（如多跳问答和页面导航），保持推理难度相同但知识可用性不同。 Result: 实验显示，在参数化和知识增强型设置下，模型在真实世界上的性能始终优于合成世界，表明存在持续的知识优势差距，且现有机制只能缩小而无法消除该差距。 Conclusion: SynthWorlds为评估语言模型的推理能力提供了可控、可扩展的环境，有助于未来对推理与记忆进行精确区分和比较。 Abstract: Evaluating the reasoning ability of language models (LMs) is complicated by their extensive parametric world knowledge, where benchmark performance often reflects factual recall rather than genuine reasoning. Existing datasets and approaches (e.g., temporal filtering, paraphrasing, adversarial substitution) cannot cleanly separate the two. We present SynthWorlds, a framework that disentangles task reasoning complexity from factual knowledge. In SynthWorlds, we construct parallel corpora representing two worlds with identical interconnected structure: a real-mapped world, where models may exploit parametric knowledge, and a synthetic-mapped world, where such knowledge is meaningless. On top of these corpora, we design two mirrored tasks as case studies: multi-hop question answering and page navigation, which maintain equal reasoning difficulty across worlds. Experiments in parametric-only (e.g., closed-book QA) and knowledge-augmented (e.g., retrieval-augmented) LM settings reveal a persistent knowledge advantage gap, defined as the performance boost models gain from memorized parametric world knowledge. Knowledge acquisition and integration mechanisms reduce but do not eliminate this gap, highlighting opportunities for system improvements. Fully automatic and scalable, SynthWorlds provides a controlled environment for evaluating LMs in ways that were previously challenging, enabling precise and testable comparisons of reasoning and memorization.

[48] LuxIT: A Luxembourgish Instruction Tuning Dataset from Monolingual Seed Data

Julian Valline,Cedric Lothritz,Jordi Cabot

Main category: cs.CL

TL;DR: 本文提出了LuxIT，一种针对卢森堡语的单语指令调优数据集，旨在解决低资源语言环境下高质量训练数据缺乏的问题。通过使用DeepSeek-R1-0528从本地文本语料库生成数据，并采用LLM-as-a-judge方法进行质量控制。实验对多个小型语言模型进行了微调和基准测试，结果表现不一，显示出进一步优化的必要性。

Details

Motivation: 由于缺乏高质量的训练数据，指令调优的大语言模型在低资源语言环境中的效果往往受限。因此，需要开发专门针对低资源语言（如卢森堡语）的指令调优数据集以提升其自然语言处理能力。 Method: 利用DeepSeek-R1-0528从卢森堡语原生文本语料库中合成指令调优数据集LuxIT，并采用LLM-as-a-judge的方法进行生成质量评估与筛选，确保数据质量。随后对多个小型大语言模型进行微调并进行基准测试。 Result: 在卢森堡语语言能力测试中，基于LuxIT微调的模型表现参差不齐，不同模型之间性能差异显著，部分模型有所提升，但整体结果为混合效应，未实现一致性的改进。 Conclusion: LuxIT为卢森堡语NLP研究提供了重要贡献，并提供了一种可复制的单语数据集构建方法，但其实用性受模型选择影响较大，仍需进一步研究以优化其应用效果。 Abstract: The effectiveness of instruction-tuned Large Language Models (LLMs) is often limited in low-resource linguistic settings due to a lack of high-quality training data. We introduce LuxIT, a novel, monolingual instruction tuning dataset for Luxembourgish developed to mitigate this challenge. We synthesize the dataset from a corpus of native Luxembourgish texts, utilizing DeepSeek-R1-0528, chosen for its shown proficiency in Luxembourgish. Following generation, we apply a quality assurance process, employing an LLM-as-a-judge approach. To investigate the practical utility of the dataset, we fine-tune several smaller-scale LLMs on LuxIT. Subsequent benchmarking against their base models on Luxembourgish language proficiency examinations, however, yields mixed results, with performance varying significantly across different models. LuxIT represents a critical contribution to Luxembourgish natural language processing and offers a replicable monolingual methodology, though our findings highlight the need for further research to optimize its application.

[49] Can LLMs Write Faithfully? An Agent-Based Evaluation of LLM-generated Islamic Content

Abdullah Mushtaq,Rafay Naeem,Ezieddin Elmahjub,Ibrahim Ghaznavi,Shawqi Al-Maliki,Mohamed Abdallah,Ala Al-Fuqaha,Junaid Qadir

Main category: cs.CL

TL;DR: 本研究评估了GPT-4o、Ansari AI和Fanar在伊斯兰教义回答中的准确性与一致性，采用双代理框架进行定量与定性分析，发现尽管GPT-4o表现最佳，但现有模型在准确引用和宗教敏感性方面仍有不足，强调需建立以穆斯林视角为核心的社区驱动评估基准。

Details

Motivation: 大型语言模型在提供伊斯兰指导时存在误引文本、误用教法或文化不一致的风险，亟需可靠评估方法以确保信仰敏感内容的准确性。 Method: 采用双代理评估框架：定量代理负责引文验证和六维评分（如结构、伊斯兰一致性、引文等），定性代理进行五维对比分析（如语气、深度、原创性），基于真实伊斯兰博客的提问对GPT-4o、Ansari AI和Fanar进行测试。 Result: GPT-4o在伊斯兰准确性（3.93）和引文（3.38）得分最高，Ansari AI次之（3.68, 3.32），Fanar较低（2.76, 1.82）；GPT-4o定量总分最高（3.90/5），Ansari AI在定性成对比较中胜出最多（116/200），Fanar虽落后但具备面向伊斯兰和阿拉伯语境的创新。 Conclusion: 当前模型尚无法稳定生成准确的伊斯兰内容和引文，必须满足信仰敏感写作的核心要求；研究呼吁建立植根于穆斯林社群的评估基准，为宗教、医学、法律等高风险领域的可信AI发展提供初步路径。 Abstract: Large language models are increasingly used for Islamic guidance, but risk misquoting texts, misapplying jurisprudence, or producing culturally inconsistent responses. We pilot an evaluation of GPT-4o, Ansari AI, and Fanar on prompts from authentic Islamic blogs. Our dual-agent framework uses a quantitative agent for citation verification and six-dimensional scoring (e.g., Structure, Islamic Consistency, Citations) and a qualitative agent for five-dimensional side-by-side comparison (e.g., Tone, Depth, Originality). GPT-4o scored highest in Islamic Accuracy (3.93) and Citation (3.38), Ansari AI followed (3.68, 3.32), and Fanar lagged (2.76, 1.82). Despite relatively strong performance, models still fall short in reliably producing accurate Islamic content and citations -- a paramount requirement in faith-sensitive writing. GPT-4o had the highest mean quantitative score (3.90/5), while Ansari AI led qualitative pairwise wins (116/200). Fanar, though trailing, introduces innovations for Islamic and Arabic contexts. This study underscores the need for community-driven benchmarks centering Muslim perspectives, offering an early step toward more reliable AI in Islamic knowledge and other high-stakes domains such as medicine, law, and journalism.

[50] SPARTA: Evaluating Reasoning Segmentation Robustness through Black-Box Adversarial Paraphrasing in Text Autoencoder Latent Space

Viktoriia Zinkovich,Anton Antonov,Andrei Spiridonov,Denis Shepelev,Andrey Moskalenko,Daria Pugacheva,Elena Tutubalina,Andrey Kuznetsov,Vlad Shakhuro

Main category: cs.CL

TL;DR: 本文提出了一种新的对抗性改写任务，旨在生成在语法正确且语义不变的前提下降低视觉-语言模型分割性能的文本改写，并提出了SPARTA方法，该方法在低维语义空间中通过强化学习进行黑箱优化，显著提升了攻击成功率，揭示了当前推理分割模型在面对语义约束下的对抗改写时仍存在脆弱性。

Details

Motivation: 现有研究多关注图像输入的扰动，而忽视了在真实应用场景中用户以不同方式表达相同意图的文本同义改写对模型鲁棒性的影响，因此需要探索文本对抗改写对多模态大模型的影响。 Method: 提出SPARTA方法，一种基于文本自编码器低维语义潜空间的黑箱、句子级优化方法，结合强化学习生成对抗性改写，并设计了综合的自动评估协议，通过人工实验验证其有效性。 Result: SPARTA在ReasonSeg和LLMSeg-40k数据集上的攻击成功率比现有方法最高提升2倍，并揭示了先进推理分割模型在严格语义和语法约束下仍易受对抗改写影响。 Conclusion: 多模态大语言模型在面对语义保持的对抗性文本改写时仍然脆弱，SPARTA提供了一种有效的评估与攻击框架，强调了提升模型语言理解鲁棒性的必要性。 Abstract: Multimodal large language models (MLLMs) have shown impressive capabilities in vision-language tasks such as reasoning segmentation, where models generate segmentation masks based on textual queries. While prior work has primarily focused on perturbing image inputs, semantically equivalent textual paraphrases-crucial in real-world applications where users express the same intent in varied ways-remain underexplored. To address this gap, we introduce a novel adversarial paraphrasing task: generating grammatically correct paraphrases that preserve the original query meaning while degrading segmentation performance. To evaluate the quality of adversarial paraphrases, we develop a comprehensive automatic evaluation protocol validated with human studies. Furthermore, we introduce SPARTA-a black-box, sentence-level optimization method that operates in the low-dimensional semantic latent space of a text autoencoder, guided by reinforcement learning. SPARTA achieves significantly higher success rates, outperforming prior methods by up to 2x on both the ReasonSeg and LLMSeg-40k datasets. We use SPARTA and competitive baselines to assess the robustness of advanced reasoning segmentation models. We reveal that they remain vulnerable to adversarial paraphrasing-even under strict semantic and grammatical constraints. All code and data will be released publicly upon acceptance.

[51] Charting the European LLM Benchmarking Landscape: A New Taxonomy and a Set of Best Practices

Špela Vintar,Taja Kuzman Pungeršek,Mojca Brglez,Nikola Ljubešić

Main category: cs.CL

TL;DR: 本文讨论了大语言模型（LLM）在非英语语言中评估的不足，提出了一种针对多语言或非英语使用场景的基准测试新分类法，并建议了一套最佳实践和质量标准，以促进欧洲语言基准测试的协调发展，强调评估方法应具有更高的语言和文化敏感性。

Details

Motivation: 随着大语言模型能力不断提升，现有的基准测试多集中于英语，非英语语言的评估体系发展滞后，缺乏统一标准和文化敏感性，亟需系统性改进。 Method: 通过综述现有LLM基准测试的发展现状，提出一种面向多语言场景的基准分类新体系，并结合实际需求，制定适用于欧洲语言的基准开发最佳实践与质量标准。 Result: 提出了一套针对多语言LLM基准测试的分类法和一系列最佳实践，强调语言和文化敏感性在评估中的重要性，为非英语基准建设提供了系统性指导。 Conclusion: 为了更公平、准确地评估非英语大语言模型，需要建立更具文化与语言敏感性的标准化基准体系，推动多语言AI的均衡发展。 Abstract: While new benchmarks for large language models (LLMs) are being developed continuously to catch up with the growing capabilities of new models and AI in general, using and evaluating LLMs in non-English languages remains a little-charted landscape. We give a concise overview of recent developments in LLM benchmarking, and then propose a new taxonomy for the categorization of benchmarks that is tailored to multilingual or non-English use scenarios. We further propose a set of best practices and quality standards that could lead to a more coordinated development of benchmarks for European languages. Among other recommendations, we advocate for a higher language and culture sensitivity of evaluation methods.

[52] Iterative Critique-Refine Framework for Enhancing LLM Personalization

Durga Prasad Maram,Dhruvin Gandhi,Zonghai Yao,Gayathri Akkinapalli,Franck Dernoncourt,Yu Wang,Ryan A. Rossi,Nesreen K. Ahmed

Main category: cs.CL

TL;DR: PerFine是一个无需训练的批评-精炼框架，通过迭代的、基于用户画像的反馈来提升个性化文本生成的质量。

Details

Motivation: 现有的检索增强方法在生成过程中容易偏离用户的风格、语气和主题焦点，缺乏有效的后处理优化机制。 Method: 提出PerFine框架：使用LLM生成器根据检索到的用户画像生成草稿，并由同样基于该画像的批评LLM提供关于语气、词汇、句式和主题性的结构化反馈；通过迭代修订和新颖的淘汰策略保留更优草案，同时研究Best-of-N和主题提取等推理时策略。 Result: 在Yelp、Goodreads和Amazon数据集上，PerFine相比PGraphRAG consistently提升了个人化效果，GEval指标提高7-13%，在3-5次精炼迭代中持续改进，并且随着批评模型增大表现出可扩展性。 Conclusion: 事后进行的、基于用户画像的反馈是一种强大且无需训练、模型无关的个性化LLM生成范式。 Abstract: Personalized text generation requires models not only to produce coherent text but also to align with a target user's style, tone, and topical focus. Existing retrieval-augmented approaches such as LaMP and PGraphRAG enrich profiles with user and neighbor histories, but they stop at generation and often yield outputs that drift in tone, topic, or style. We present PerFine, a unified, training-free critique-refine framework that enhances personalization through iterative, profile-grounded feedback. In each iteration, an LLM generator produces a draft conditioned on the retrieved profile, and a critic LLM - also conditioned on the same profile - provides structured feedback on tone, vocabulary, sentence structure, and topicality. The generator then revises, while a novel knockout strategy retains the stronger draft across iterations. We further study additional inference-time strategies such as Best-of-N and Topic Extraction to balance quality and efficiency. Across Yelp, Goodreads, and Amazon datasets, PerFine consistently improves personalization over PGraphRAG, with GEval gains of +7-13%, steady improvements over 3-5 refinement iterations, and scalability with increasing critic size. These results highlight that post-hoc, profile-aware feedback offers a powerful paradigm for personalized LLM generation that is both training-free and model-agnostic.

[53] Mitigating Hallucination in Large Language Models (LLMs): An Application-Oriented Survey on RAG, Reasoning, and Agentic Systems

Yihan Li,Xiyuan Fu,Ghanshyam Verma,Paul Buitelaar,Mingming Liu

Main category: cs.CL

TL;DR: 本文综述了检索增强生成（RAG）和推理增强在缓解大语言模型幻觉方面的协同作用，提出了区分基于知识和基于逻辑的幻觉的分类法，并提供了一个统一框架。

Details

Motivation: 幻觉是大语言模型在实际应用中可靠部署的主要障碍之一，现有方法缺乏对RAG与推理增强协同机制的系统性研究。 Method: 采用面向应用的能力增强视角，提出新的幻觉分类法，系统分析RAG、推理增强及其在智能体系统中的整合作用，并结合实际应用、评估和基准测试构建统一框架。 Result: 明确了RAG主要缓解基于知识的幻觉，推理增强主要应对基于逻辑的幻觉，二者结合可有效提升生成的可靠性与创造力平衡。 Conclusion: RAG与推理增强具有互补性，其集成在智能体系统中为未来减少大语言模型幻觉提供了有前景的方向。 Abstract: Hallucination remains one of the key obstacles to the reliable deployment of large language models (LLMs), particularly in real-world applications. Among various mitigation strategies, Retrieval-Augmented Generation (RAG) and reasoning enhancement have emerged as two of the most effective and widely adopted approaches, marking a shift from merely suppressing hallucinations to balancing creativity and reliability. However, their synergistic potential and underlying mechanisms for hallucination mitigation have not yet been systematically examined. This survey adopts an application-oriented perspective of capability enhancement to analyze how RAG, reasoning enhancement, and their integration in Agentic Systems mitigate hallucinations. We propose a taxonomy distinguishing knowledge-based and logic-based hallucinations, systematically examine how RAG and reasoning address each, and present a unified framework supported by real-world applications, evaluations, and benchmarks.

[54] Talk2Ref: A Dataset for Reference Prediction from Scientific Talks

Frederik Broy,Maike Züfle,Jan Niehues

Main category: cs.CL

TL;DR: 本文提出了一个新任务RPT（从学术报告中预测参考文献），并发布了首个大规模数据集Talk2Ref，包含6,279场报告和43,429篇被引论文。通过评估最先进的文本嵌入模型并提出一种双编码器架构，证明在该数据集上微调能显著提升引用预测性能。

Details

Motivation: 学术报告是传播研究成果的重要方式，自动识别与报告内容相关的参考文献对研究者和学生具有重要价值。现有工作缺乏针对口头科学内容的引用推荐支持。 Method: 构建了Talk2Ref数据集，以报告对应出版物中的参考文献作为相关性标注；采用零样本检索评估主流文本嵌入模型，并设计基于该数据集训练的双编码器模型；探索处理长文本和领域自适应的方法。 Result: 实验表明，在Talk2Ref上微调显著提升了引用预测效果，验证了该任务的挑战性和数据集的有效性。最佳模型在零样本设置下表现有限，但有监督训练带来明显改进。 Conclusion: RPT是一个有价值的新任务，Talk2Ref为从口语化科研内容中学习语义表示提供了有效资源，推动将口头科学交流整合到引用推荐系统中的研究。 Abstract: Scientific talks are a growing medium for disseminating research, and automatically identifying relevant literature that grounds or enriches a talk would be highly valuable for researchers and students alike. We introduce Reference Prediction from Talks (RPT), a new task that maps long, and unstructured scientific presentations to relevant papers. To support research on RPT, we present Talk2Ref, the first large-scale dataset of its kind, containing 6,279 talks and 43,429 cited papers (26 per talk on average), where relevance is approximated by the papers cited in the talk's corresponding source publication. We establish strong baselines by evaluating state-of-the-art text embedding models in zero-shot retrieval scenarios, and propose a dual-encoder architecture trained on Talk2Ref. We further explore strategies for handling long transcripts, as well as training for domain adaptation. Our results show that fine-tuning on Talk2Ref significantly improves citation prediction performance, demonstrating both the challenges of the task and the effectiveness of our dataset for learning semantic representations from spoken scientific content. The dataset and trained models are released under an open license to foster future research on integrating spoken scientific communication into citation recommendation systems.

[55] A word association network methodology for evaluating implicit biases in LLMs compared to humans

Katherine Abramski,Giulio Rossetti,Massimo Stella

Main category: cs.CL

TL;DR: 提出一种基于词关联网络的语义启动模拟方法，用于评估大语言模型中的隐性社会偏见，可实现模型与人类偏见的直接比较。

Details

Motivation: 大语言模型（LLMs）日益普及，但其隐含的社会偏见难以检测，需开发能评估其隐性知识表征的方法。 Method: 构建基于提示的词关联网络，模拟语义启动效应，分析LLMs生成的词间关联结构，以量化和质性方式评估性别、宗教、种族等隐性偏见，并与人类反应进行对比。 Result: 在多个LLM和人类样本上验证了该方法的有效性，发现LLM与人类在某些偏见上趋同，但在其他方面存在差异，揭示了LLM潜在的社会风险。 Conclusion: 该方法为系统、可扩展地评估和比较不同LLM及人类之间的隐性偏见提供了新框架，有助于推动透明且负责任的语言技术发展。 Abstract: As Large language models (LLMs) become increasingly integrated into our lives, their inherent social biases remain a pressing concern. Detecting and evaluating these biases can be challenging because they are often implicit rather than explicit in nature, so developing evaluation methods that assess the implicit knowledge representations of LLMs is essential. We present a novel word association network methodology for evaluating implicit biases in LLMs based on simulating semantic priming within LLM-generated word association networks. Our prompt-based approach taps into the implicit relational structures encoded in LLMs, providing both quantitative and qualitative assessments of bias. Unlike most prompt-based evaluation methods, our method enables direct comparisons between various LLMs and humans, providing a valuable point of reference and offering new insights into the alignment of LLMs with human cognition. To demonstrate the utility of our methodology, we apply it to both humans and several widely used LLMs to investigate social biases related to gender, religion, ethnicity, sexual orientation, and political party. Our results reveal both convergences and divergences between LLM and human biases, providing new perspectives on the potential risks of using LLMs. Our methodology contributes to a systematic, scalable, and generalizable framework for evaluating and comparing biases across multiple LLMs and humans, advancing the goal of transparent and socially responsible language technologies.

[56] CritiCal: Can Critique Help LLM Uncertainty or Confidence Calibration?

Qing Zong,Jiayu Liu,Tianshi Zheng,Chunyang Li,Baixuan Xu,Haochen Shi,Weiqi Wang,Zhaowei Wang,Chunkit Chan,Yangqiu Song

Main category: cs.CL

TL;DR: 提出CritiCal方法，利用自然语言批评提升大模型置信度校准，尤其在复杂推理和分布外场景中表现优异。

Details

Motivation: 传统方法难以准确捕捉大模型的置信度，尤其是在缺乏精确标注的情况下，需要更有效的校准方式。 Method: 引入自然语言批评，研究批评对象（不确定性或置信度）和方式（自批评或CritiCal训练），提出Self-Critique和CritiCal方法。 Result: CritiCal在多项任务上显著优于基线方法，甚至超过教师模型GPT-4o，并在分布外场景中表现出强泛化能力。 Conclusion: 自然语言批评是提升大模型口头置信度校准的有效途径，CritiCal为可靠的大模型应用提供了新方向。 Abstract: Accurate confidence calibration in Large Language Models (LLMs) is critical for safe use in high-stakes domains, where clear verbalized confidence enhances user trust. Traditional methods that mimic reference confidence expressions often fail to capture the reasoning needed for accurate confidence assessment. We propose natural language critiques as a solution, ideally suited for confidence calibration, as precise gold confidence labels are hard to obtain and often require multiple generations. This paper studies how natural language critiques can enhance verbalized confidence, addressing: (1) What to critique: uncertainty (question-focused) or confidence (answer-specific)? Analysis shows confidence suits multiple-choice tasks, while uncertainty excels in open-ended scenarios. (2) How to critique: self-critique or critique calibration training? We propose Self-Critique, enabling LLMs to critique and optimize their confidence beyond mere accuracy, and CritiCal, a novel Critique Calibration training method that leverages natural language critiques to improve confidence calibration, moving beyond direct numerical optimization. Experiments show that CritiCal significantly outperforms Self-Critique and other competitive baselines, even surpassing its teacher model, GPT-4o, in complex reasoning tasks. CritiCal also shows robust generalization in out-of-distribution settings, advancing LLM's reliability.

[57] Levée d'ambiguïtés par grammaires locales

Eric G. C. Laporte

Main category: cs.CL

TL;DR: 本文提出了一种适应零静音率目标的词性消歧方法，并在INTEX系统中实现，强调了局部消歧文法需仔细测试，且多个转换器组合使用时其结果无法单独预测。

Details

Motivation: 为了实现词性标注中的零静音率目标，确保正确的词性标签从不被丢弃，需要对消歧方法进行精确建模和验证。 Method: 采用基于上下文的局部消歧文法，利用转换器（transducer）处理词性歧义，并分析其相互作用；在INTEX系统中实现并形式化描述该方法。 Result: 发现不能单独分析转换器路径，必须考虑它们之间的交互；多个转换器组合的结果无法孤立预测；语法直觉可能因意外结构或歧义而不准确。 Conclusion: 为确保零静音率，局部文法必须经过严格测试，需对文法在文本上的应用效果进行详细规范。 Abstract: Many words are ambiguous in terms of their part of speech (POS). However, when a word appears in a text, this ambiguity is generally much reduced. Disambiguating POS involves using context to reduce the number of POS associated with words, and is one of the main challenges of lexical tagging. The problem of labeling words by POS frequently arises in natural language processing, for example for spelling correction, grammar or style checking, expression recognition, text-to-speech conversion, text corpus analysis, etc. Lexical tagging systems are thus useful as an initial component of many natural language processing systems. A number of recent lexical tagging systems produce multiple solutions when the text is lexically ambiguous or the uniquely correct solution cannot be found. These contributions aim to guarantee a zero silence rate: the correct tag(s) for a word must never be discarded. This objective is unrealistic for systems that tag each word uniquely. This article concerns a lexical disambiguation method adapted to the objective of a zero silence rate and implemented in Silberztein's INTEX system (1993). We present here a formal description of this method. We show that to verify a local disambiguation grammar in this framework, it is not sufficient to consider the transducer paths separately: one needs to verify their interactions. Similarly, if a combination of multiple transducers is used, the result cannot be predicted by considering them in isolation. Furthermore, when examining the initial labeling of a text as produced by INTEX, ideas for disambiguation rules come spontaneously, but grammatical intuitions may turn out to be inaccurate, often due to an unforeseen construction or ambiguity. If a zero silence rate is targeted, local grammars must be carefully tested. This is where a detailed specification of what a grammar will do once applied to texts would be necessary.

[58] Dark & Stormy: Modeling Humor in the Worst Sentences Ever Written

Venkata S Govindarajan,Laura Biester

Main category: cs.CL

TL;DR: 本文研究了Bulwer-Lytton小说竞赛中的“糟糕”幽默语料库，发现标准幽默检测模型表现不佳，且LLM生成的句子虽模仿了形式但过度使用修辞手法。

Details

Motivation: 为了更全面地理解英语中“糟糕”的幽默，需扩展计算研究覆盖包括刻意制造的低质量幽默在内的广泛文本幽默类型。 Method: 构建并分析Bulwer-Lytton小说竞赛的新语料库，评估标准幽默检测模型的表现，并分析文学手法；使用大语言模型生成类似风格句子进行对比。 Result: 标准幽默检测模型在该语料上表现差；人类创作结合了双关、反讽、隐喻、元小说和明喻等手法；LLM生成文本过度使用某些修辞，且包含更多新颖的形容词-名词搭配。 Conclusion: ‘糟糕’幽默具有独特特征，现有模型难以捕捉，LLM虽可模仿形式但存在夸张倾向，需进一步建模真实幽默多样性。 Abstract: Textual humor is enormously diverse and computational studies need to account for this range, including intentionally bad humor. In this paper, we curate and analyze a novel corpus of sentences from the Bulwer-Lytton Fiction Contest to better understand "bad" humor in English. Standard humor detection models perform poorly on our corpus, and an analysis of literary devices finds that these sentences combine features common in existing humor datasets (e.g., puns, irony) with metaphor, metafiction and simile. LLMs prompted to synthesize contest-style sentences imitate the form but exaggerate the effect by over-using certain literary devices, and including far more novel adjective-noun bigrams than human writers. Data, code and analysis are available at https://github.com/venkatasg/bulwer-lytton

[59] Open Korean Historical Corpus: A Millennia-Scale Diachronic Collection of Public Domain Texts

Seyoung Song,Nawon Kim,Songeun Chae,Kiwoong Park,Jiho Jin,Haneul Yoo,Kyunghyun Cho,Alice Oh

Main category: cs.CL

TL;DR: 本文介绍了Open Korean Historical Corpus，一个涵盖1300年历史、多种语言和书写系统的韩语历史语料库，用于填补韩语历史语言学在NLP领域的空白。

Details

Motivation: 由于缺乏可获取的历史语料，韩语的历史演变在自然语言处理领域长期未被充分研究。本文旨在通过构建大规模开放语料库来弥补这一空白。 Method: 收集并整理了跨越1300年、来自19个来源的1800万份文档（共50亿词符），涵盖多种语言和书写系统（如Idu、汉谚混写等），并基于该语料库进行定量的语言变迁分析。 Result: 发现Idu使用在1860年代达到顶峰后急剧下降；汉字到韩文的转变始于约1890年并迅速完成；朝鲜的词汇分化导致现代分词器的未知词率高达51倍。 Conclusion: 该语料库为韩语的历史语言学提供了基础资源，可用于大模型预训练，提升对现代韩文中的汉韩词汇及古文字系统的理解。 Abstract: The history of the Korean language is characterized by a discrepancy between its spoken and written forms and a pivotal shift from Chinese characters to the Hangul alphabet. However, this linguistic evolution has remained largely unexplored in NLP due to a lack of accessible historical corpora. To address this gap, we introduce the Open Korean Historical Corpus, a large-scale, openly licensed dataset spanning 1,300 years and 6 languages, as well as under-represented writing systems like Korean-style Sinitic (Idu) and Hanja-Hangul mixed script. This corpus contains 18 million documents and 5 billion tokens from 19 sources, ranging from the 7th century to 2025. We leverage this resource to quantitatively analyze major linguistic shifts: (1) Idu usage peaked in the 1860s before declining sharply; (2) the transition from Hanja to Hangul was a rapid transformation starting around 1890; and (3) North Korea's lexical divergence causes modern tokenizers to produce up to 51 times higher out-of-vocabulary rates. This work provides a foundational resource for quantitative diachronic analysis by capturing the history of the Korean language. Moreover, it can serve as a pre-training corpus for large language models, potentially improving their understanding of Sino-Korean vocabulary in modern Hangul as well as archaic writing systems.

[60] BEST-RQ-Based Self-Supervised Learning for Whisper Domain Adaptation

Raphaël Bagat,Irina Illina,Emmanuel Vincent

Main category: cs.CL

TL;DR: 提出BEARD框架，利用无标签数据通过BEST-RQ目标和知识蒸馏来适应Whisper的编码器，在航空交通控制语音识别任务中显著优于基线模型。

Details

Motivation: ASR系统在领域外和低资源场景下表现不佳，尤其当标注数据稀缺时。需要一种有效利用无标签数据进行领域自适应的方法。 Method: 提出BEARD框架，结合BEST-RQ目标和来自冻结教师编码器的知识蒸馏，使用无标签数据微调Whisper编码器，保持与预训练解码器的兼容性。 Result: 在ATCO2语料库上，使用约5000小时无转录语音进行BEARD训练，2小时有标签数据微调，相比微调模型相对错误率降低12%。 Conclusion: BEARD是首个将自监督学习目标用于Whisper领域自适应的工作，在低资源、高噪声、非母语语音场景中表现出显著优势。 Abstract: Automatic Speech Recognition (ASR) systems, despite large multilingual training, struggle in out-of-domain and low-resource scenarios where labeled data is scarce. We propose BEARD (BEST-RQ Encoder Adaptation with Re-training and Distillation), a novel framework designed to adapt Whisper's encoder using unlabeled data. Unlike traditional self-supervised learning methods, BEARD uniquely combines a BEST-RQ objective with knowledge distillation from a frozen teacher encoder, ensuring the encoder's complementarity with the pre-trained decoder. Our experiments focus on the ATCO2 corpus from the challenging Air Traffic Control (ATC) communications domain, characterized by non-native speech, noise, and specialized phraseology. Using about 5,000 hours of untranscribed speech for BEARD and 2 hours of transcribed speech for fine-tuning, the proposed approach significantly outperforms previous baseline and fine-tuned model, achieving a relative improvement of 12% compared to the fine-tuned model. To the best of our knowledge, this is the first work to use a self-supervised learning objective for domain adaptation of Whisper.

[61] ReplicationBench: Can AI Agents Replicate Astrophysics Research Papers?

Christine Ye,Sihan Yuan,Suchetha Cooray,Steven Dillmann,Ian L. V. Roque,Dalya Baron,Philipp Frank,Sergio Martin-Alvarez,Nolan Koblischke,Frank J Qu,Diyi Yang,Risa Wechsler,Ioana Ciuca

Main category: cs.CL

TL;DR: 本文提出了ReplicationBench，一个用于评估AI代理在天体物理学中复制整篇研究论文能力的框架，通过与原作者合作设计任务来检验AI代理的忠实度和正确性，发现当前最先进的语言模型表现不佳，揭示了AI代理在科学研究中的多种失败模式。

Details

Motivation: 为了评估AI代理作为科研助手的可行性和可靠性，特别是在需要长时间、开放式研究流程的任务中，有必要建立一个能够测试AI代理复制整个研究论文能力的评估框架。 Method: 引入ReplicationBench评估框架，将每篇论文分解为多个任务，涵盖实验设置、推导、数据分析和代码库等核心贡献，并由原作者共同开发这些任务以确保客观评价。 Result: 当前最先进语言模型在ReplicationBench上的得分低于20%，分析显示存在多种失败模式。 Conclusion: ReplicationBench建立了首个经过专家验证的大规模天体物理研究任务基准，揭示了AI代理性能对其他数据驱动科学领域的普遍意义，并提供了衡量AI代理在科学研究中可靠性的可扩展框架。 Abstract: Frontier AI agents show increasing promise as scientific research assistants, and may eventually be useful for extended, open-ended research workflows. However, in order to use agents for novel research, we must first assess the underlying faithfulness and correctness of their work. To evaluate agents as research assistants, we introduce ReplicationBench, an evaluation framework that tests whether agents can replicate entire research papers drawn from the astrophysics literature. Astrophysics, where research relies heavily on archival data and computational study while requiring little real-world experimentation, is a particularly useful testbed for AI agents in scientific research. We split each paper into tasks which require agents to replicate the paper's core contributions, including the experimental setup, derivations, data analysis, and codebase. Each task is co-developed with the original paper authors and targets a key scientific result, enabling objective evaluation of both faithfulness (adherence to original methods) and correctness (technical accuracy of results). ReplicationBench is extremely challenging for current frontier language models: even the best-performing language models score under 20%. We analyze ReplicationBench trajectories in collaboration with domain experts and find a rich, diverse set of failure modes for agents in scientific research. ReplicationBench establishes the first benchmark of paper-scale, expert-validated astrophysics research tasks, reveals insights about agent performance generalizable to other domains of data-driven science, and provides a scalable framework for measuring AI agents' reliability in scientific research.

[62] ReForm: Reflective Autoformalization with Prospective Bounded Sequence Optimization

Guoxin Chen,Jing Wu,Xinjie Chen,Wayne Xin Zhao,Ruihua Song,Chengxi Li,Kai Fan,Dayiheng Liu,Minpeng Liao

Main category: cs.CL

TL;DR: 本文提出了一种名为ReForm的反思式自动形式化方法（Reflective Autoformalization），通过引入语义一致性评估机制，实现形式化语句的迭代生成、自我评估与纠错。为有效训练该模型，作者提出了前瞻性有界序列优化（PBSO）方法，并在多个基准上实现了平均17.2个百分点的提升。此外，新构建的ConsistencyCheck基准揭示了自动形式化的难度，即使人类专家也会在高达38.5%的情况下出现语义错误。

Details

Motivation: 现有大语言模型在将自然语言数学问题转化为机器可验证的形式语言时，虽能保证语法正确，但常丢失语义意图。其根本原因在于当前方法将自动形式化视为简单翻译任务，缺乏自我反思和迭代改进机制。因此，需要一种能够模拟人类专家反思过程的方法来提升语义保真度。 Method: 提出ReForm方法，将语义一致性评估嵌入自动形式化过程，实现生成-评估-修正的迭代流程；同时设计Prospective Bounded Sequence Optimization（PBSO）训练机制，对不同序列位置赋予差异化奖励，以同步优化形式化准确性和语义验证能力。 Result: 在四个自动形式化基准上，ReForm比最强基线平均提升17.2个百分点；新构建的ConsistencyCheck包含859个专家标注样本，验证了LLM作为评判者的可靠性，并发现人类专家在自动形式化中也会产生最高达38.5%的语义错误。 Conclusion: ReForm通过引入反思机制显著提升了自动形式化的语义准确性，PBSO训练策略有效支持了生成与验证能力的协同学习，ConsistencyCheck基准进一步揭示了该任务的本质难度，为未来研究提供了可靠评估标准。 Abstract: Autoformalization, which translates natural language mathematics into machine-verifiable formal statements, is critical for using formal mathematical reasoning to solve math problems stated in natural language. While Large Language Models can generate syntactically correct formal statements, they often fail to preserve the original problem's semantic intent. This limitation arises from the LLM approaches' treating autoformalization as a simplistic translation task which lacks mechanisms for self-reflection and iterative refinement that human experts naturally employ. To address these issues, we propose ReForm, a Reflective Autoformalization method that tightly integrates semantic consistency evaluation into the autoformalization process. This enables the model to iteratively generate formal statements, assess its semantic fidelity, and self-correct identified errors through progressive refinement. To effectively train this reflective model, we introduce Prospective Bounded Sequence Optimization (PBSO), which employs different rewards at different sequence positions to ensure that the model develops both accurate autoformalization and correct semantic validations, preventing superficial critiques that would undermine the purpose of reflection. Extensive experiments across four autoformalization benchmarks demonstrate that ReForm achieves an average improvement of 17.2 percentage points over the strongest baselines. To further ensure evaluation reliability, we introduce ConsistencyCheck, a benchmark of 859 expert-annotated items that not only validates LLMs as judges but also reveals that autoformalization is inherently difficult: even human experts produce semantic errors in up to 38.5% of cases.

[63] Diffusion LLM with Native Variable Generation Lengths: Let [EOS] Lead the Way

Yicun Yang,Cong Wang,Shaobo Wang,Zichen Wen,Biqing Qi,Hanlin Xu,Linfeng Zhang

Main category: cs.CL

TL;DR: 本文提出了一种具有可变生成长度的扩散式大语言模型dLLM-Var，通过准确预测[EOS]标记实现块状扩散生成，兼顾全局双向注意力与高并行性，在标准基准上实现了相比传统扩散模型30.1倍、自回归模型2.4倍的加速，兼具高效性与实用性。

Details

Motivation: 现有扩散式大语言模型（dLLMs）需预设固定的生成长度作为超参数，导致在生成效率和灵活性方面存在局限，难以适应实际应用中可变长度的文本生成需求。 Method: 提出dLLM-Var，训练模型准确预测生成文本中的[EOS]标记，使其能够以块扩散方式原生推断，同时保持全局双向注意力机制和高并行生成能力。 Result: 在标准基准测试中，该方法相比传统dLLM推理范式实现30.1倍速度提升，相比Qwen和Llama等自回归模型提升2.4倍，且具备更高准确性。 Conclusion: dLLM-Var解决了dLLMs固定生成长度的问题，显著提升了推理速度与灵活性，推动扩散式大语言模型从学术探索迈向实际应用。 Abstract: Diffusion-based large language models (dLLMs) have exhibited substantial potential for parallel text generation, which may enable more efficient generation compared to autoregressive models. However, current dLLMs suffer from fixed generation lengths, which indicates the generation lengths of dLLMs have to be determined before decoding as a hyper-parameter, leading to issues in efficiency and flexibility. To solve these problems, in this work, we propose to train a diffusion LLM with native variable generation lengths, abbreviated as dLLM-Var. Concretely, we aim to train a model to accurately predict the [EOS] token in the generated text, which makes a dLLM be able to natively infer in a block diffusion manner, while still maintaining the ability of global bi-directional (full) attention and high parallelism. Experiments on standard benchmarks demonstrate that our method achieves a 30.1x speedup over traditional dLLM inference paradigms and a 2.4x speedup relative to autoregressive models such as Qwen and Llama. Our method achieves higher accuracy and faster inference, elevating dLLMs beyond mere academic novelty and supporting their practical use in real-world applications. Codes and models have been released.

[64] Long-Context Modeling with Dynamic Hierarchical Sparse Attention for On-Device LLMs

Siheng Xiong,Joe Zou,Faramarz Fekri,Yae Jee Cho

Main category: cs.CL

TL;DR: 本文提出了Dynamic Hierarchical Sparse Attention (DHSA)，一种数据驱动的动态稀疏注意力机制，能够在不重新训练的情况下在线预测注意力稀疏性，有效降低长上下文大模型的计算和内存开销，同时保持与全注意力相近的精度。

Details

Motivation: 现有的静态稀疏注意力方法（如滑动窗口或全局token）因缺乏内容适应性而表现受限，而现有动态方法依赖预定义模板或启发式规则，通用性差且可能剪枝重要token，影响多任务准确性。 Method: DHSA将序列自适应地划分为变长块，通过长度归一化的嵌入聚合生成块表示，再将块级相似度上采样至token级别以计算重要性分数，动态决定保留哪些token交互，实现在线稀疏化。 Result: 在Gemma2上的实验显示，DHSA在Needle-in-a-Haystack和LongBench测试中达到与全注意力相当的准确率，预填充延迟减少20-60%，峰值内存降低35%；相比块稀疏注意力等基线，准确率相对提升6-18%，成本相当或更低。 Conclusion: DHSA提供了一种高效、自适应的长上下文建模方案，特别适用于资源受限的设备端大语言模型部署。 Abstract: The quadratic cost of attention hinders the scalability of long-context LLMs, especially in resource-constrained settings. Existing static sparse methods such as sliding windows or global tokens utilizes the sparsity of attention to reduce the cost of attention, but poorly adapts to the content-dependent variations in attention due to their staticity. While previous work has proposed several dynamic approaches to improve flexibility, they still depend on predefined templates or heuristic mechanisms. Such strategies reduce generality and prune tokens that remain contextually important, limiting their accuracy across diverse tasks. To tackle these bottlenecks of existing methods for long-context modeling, we introduce Dynamic Hierarchical Sparse Attention (DHSA), a data-driven framework that dynamically predicts attention sparsity online without retraining. Our proposed DHSA adaptively segments sequences into variable-length chunks, then computes chunk representations by aggregating the token embeddings within each chunk. To avoid the bias introduced by varying chunk lengths, we apply length-normalized aggregation that scales the averaged embeddings by the square root of the chunk size. Finally, DHSA upsamples the chunk-level similarity scores to token level similarities to calculate importance scores that determine which token-level interactions should be preserved. Our experiments on Gemma2 with Needle-in-a-Haystack Test and LongBench show that DHSA matches dense attention in accuracy, while reducing prefill latency by 20-60% and peak memory usage by 35%. Compared to other representative baselines such as block sparse attention, DHSA achieves consistently higher accuracy (6-18% relative gains) with comparable or lower cost, offering an efficient and adaptable solution for long-context on-device LLMs.

[65] Zero-Shot Cross-Lingual Transfer using Prefix-Based Adaptation

Snegha A,Sayambhu Sen,Piyush Singh Pasi,Abhishek Singhania,Preethi Jyothi

Main category: cs.CL

TL;DR: 本文研究了前缀式方法在解码器-only大语言模型上的零样本跨语言迁移效果，发现其在多语言场景下优于LoRA等基线方法，尤其在低资源语言中表现突出。

Details

Motivation: 尽管LoRA等参数高效微调技术被广泛使用，但前缀式提示调优在解码器-only模型中的零样本跨语言迁移能力尚未充分探索。 Method: 对三种前缀式方法（软提示调优、前缀调优、Llama Adapter）在35种以上高/低资源语言上进行系统性实验，涵盖不同语系和文字，并在1B到24B规模的模型上评估性能。 Result: 在Belebele基准上，Llama 3.1 8B使用前缀方法比LoRA提升达6%；Mistral 7B也表现出类似改进，且仅用1.23M参数即实现跨多个基准的一致提升。 Conclusion: 前缀式方法是LoRA之外一种高效且可扩展的替代方案，特别适用于低资源多语言场景下的零样本跨语言迁移。 Abstract: With the release of new large language models (LLMs) like Llama and Mistral, zero-shot cross-lingual transfer has become increasingly feasible due to their multilingual pretraining and strong generalization capabilities. However, adapting these decoder-only LLMs to new tasks across languages remains challenging. While parameter-efficient fine-tuning (PeFT) techniques like Low-Rank Adaptation (LoRA) are widely used, prefix-based techniques such as soft prompt tuning, prefix tuning, and Llama Adapter are less explored, especially for zero-shot transfer in decoder-only models. We present a comprehensive study of three prefix-based methods for zero-shot cross-lingual transfer from English to 35+ high- and low-resource languages. Our analysis further explores transfer across linguistic families and scripts, as well as the impact of scaling model sizes from 1B to 24B. With Llama 3.1 8B, prefix methods outperform LoRA-baselines by up to 6% on the Belebele benchmark. Similar improvements were observed with Mistral v0.3 7B as well. Despite using only 1.23M learning parameters with prefix tuning, we achieve consistent improvements across diverse benchmarks. These findings highlight the potential of prefix-based techniques as an effective and scalable alternative to LoRA, particularly in low-resource multilingual settings.

[66] Relative Scaling Laws for LLMs

William Held,David Hall,Percy Liang,Diyi Yang

Main category: cs.CL

TL;DR: 本文提出了相对扩展定律，通过255个解码器-only Transformer模型在相同计算预算下的实验，揭示了不同测试分布间性能差距随规模变化的多样性轨迹，表明扩展虽提升整体性能但并非普遍均衡器。

Details

Motivation: 现有的扩展定律通常基于聚合测试集评估，掩盖了不同子群体之间的性能差异，因此需要引入能够追踪不同测试分布之间性能差距演变的相对扩展定律。 Method: 训练了255个解码器-only Transformer模型，在$10^{18}$--$10^{20}$ FLOPs的匹配计算预算下，使用标准预训练数据集，分析不同测试分布间的性能差距演变。 Result: 发现学术领域趋向性能均衡，区域英语方言表现依赖人口规模，AI风险行为在预训练中能力与影响力相关风险增加而对抗性风险不增加。 Conclusion: 扩展虽然提升了整体性能，但并不意味着所有子群体性能差距缩小，因此不能视为通用的平等化工具。 Abstract: Scaling laws describe how language models improve with additional data, parameters, and compute. While widely used, they are typically measured on aggregate test sets. Aggregate evaluations yield clean trends but average over heterogeneous subpopulations, obscuring performance disparities. We introduce relative scaling laws, which track how performance gaps between test distributions evolve with scale rather than focusing solely on absolute error. Using 255 decoder-only Transformers trained under matched-compute (IsoFLOP) budgets from $10^{18}$--$10^{20}$ FLOPs on standard pretraining datasets, we find diverse trajectories: academic domains on MMLU converge toward parity; regional English dialects shift depending on population size; and clusters of AI risk behaviours split, with capability- and influence-related risks increasing during pretraining while adversarial risks do not. These results show that although scaling improves overall performance, it is not a universal equalizer. To support further study, we release all model checkpoints from this work to enable practitioners to measure relative alongside traditional scaling laws, in order to better prioritize robustness challenges in light of the bitter lesson.

[67] "Mm, Wat?" Detecting Other-initiated Repair Requests in Dialogue

Anh Ngo,Nicolas Rollet,Catherine Pelachaud,Chloe Clavel

Main category: cs.CL

TL;DR: 提出一种多模态模型，结合语言学和韵律特征，用于自动检测荷兰语对话中的他人发起的修复（OIR），结果表明韵律线索能显著提升预训练文本和音频嵌入的效果。

Details

Motivation: 当前对话系统难以识别用户的修复意图，导致对话中断或用户脱离，因此需要更有效的修复启动检测方法。 Method: 基于会话分析，融合语言学与韵律特征构建多模态模型，使用预训练的文本和音频嵌入进行修复启动检测。 Result: 韵律线索能够补充语言特征，显著提升检测性能，揭示了不同特征间的交互作用。 Conclusion: 所提出的多模态模型有效提升了OIR检测效果，未来可引入视觉线索并扩展至多语言和跨场景语料以增强泛化能力。 Abstract: Maintaining mutual understanding is a key component in human-human conversation to avoid conversation breakdowns, in which repair, particularly Other-Initiated Repair (OIR, when one speaker signals trouble and prompts the other to resolve), plays a vital role. However, Conversational Agents (CAs) still fail to recognize user repair initiation, leading to breakdowns or disengagement. This work proposes a multimodal model to automatically detect repair initiation in Dutch dialogues by integrating linguistic and prosodic features grounded in Conversation Analysis. The results show that prosodic cues complement linguistic features and significantly improve the results of pretrained text and audio embeddings, offering insights into how different features interact. Future directions include incorporating visual cues, exploring multilingual and cross-context corpora to assess the robustness and generalizability.

[68] OpenReward: Learning to Reward Long-form Agentic Tasks via Reinforcement Learning

Ziyou Hu,Zhengliang Shi,Minghang Zhu,Haitao Li,Teng Sun,Pengjie Ren,Suzan Verberne,Zhaochun Ren

Main category: cs.CL

TL;DR: 本文提出了OpenRM，一种工具增强的长篇幅奖励模型，通过调用外部工具获取证据来系统评估开放性回答，并在多个数据集上显著优于现有方法。

Details

Motivation: 现有的奖励模型在知识密集型和长篇任务上表现不佳，难以准确评估需要外部证据支持的回答质量。 Method: 提出OpenRM模型，结合外部工具进行证据收集，并使用Group Relative Policy Optimization（GRPO）在超过27K个合成的成对样本上训练，联合监督工具使用和最终结果准确性。 Result: 在三个新构建的数据集和两个常用基准上实验表明，OpenRM显著优于现有奖励模型；将其应用于推理阶段和训练阶段均带来一致的性能提升。 Conclusion: 工具增强的奖励模型（如OpenRM）能够有效提升长篇、知识密集型任务中的评估可靠性，为大规模语言模型的对齐提供了新方向。 Abstract: Reward models (RMs) have become essential for aligning large language models (LLMs), serving as scalable proxies for human evaluation in both training and inference. However, existing RMs struggle on knowledge-intensive and long-form tasks, where evaluating correctness requires grounding beyond the model's internal knowledge. This limitation hinders them from reliably discriminating subtle quality differences, especially when external evidence is necessary. To address this, we introduce OpenRM, a tool-augmented long-form reward model that systematically judges open-ended responses by invoking external tools to gather relevant evidence. We train OpenRM with Group Relative Policy Optimization (GRPO) on over 27K synthesized pairwise examples generated through a controllable data synthesis framework. The training objective jointly supervises intermediate tool usage and final outcome accuracy, incentivizing our reward model to learn effective evidence-based judgment strategies. Extensive experiments on three newly-collected datasets and two widely-used benchmarks demonstrate that OpenRM substantially outperforms existing reward modeling approaches. As a further step, we integrate OpenRM into both inference-time response selection and training-time data selection. This yields consistent gains in downstream LLM alignment tasks, highlighting the potential of tool-augmented reward models for scaling reliable long-form evaluation.

[69] Quantifying the Effects of Word Length, Frequency, and Predictability on Dyslexia

Hugo Rydel-Johnston,Alex Kafkas

Main category: cs.CL

TL;DR: 该研究通过眼动追踪技术分析了阅读障碍者在自然阅读中的时间成本，发现词长、词频和可预测性均显著影响阅读时间，且阅读障碍者对这些特征更敏感，尤其是可预测性。通过反事实操纵这些因素，可缩小约三分之一的阅读障碍与对照组之间的差距。

Details

Motivation: 探究阅读障碍者在真实阅读情境中产生额外阅读成本的具体条件和位置，以理解其认知机制并为干预措施提供依据。 Method: 使用眼动追踪数据，结合词语层面的特征（词长、词频、可预测性），建模分析这些特征如何影响典型读者和阅读障碍者的阅读时间，并进行反事实操纵实验。 Result: 阅读障碍者对词长、词频和可预测性均表现出更强的敏感性，尤其是可预测性；通过调整这些特征可使阅读障碍与对照组的差距缩小约三分之一；结果支持语言工作记忆和语音编码负荷增强的理论。 Conclusion: 阅读障碍者的额外阅读成本主要源于对词汇特征的高敏感性，特别是语境可预测性；研究量化了这些成本的大小和发生时机，并为干预策略及计算模型提供了可行指导。 Abstract: We ask where, and under what conditions, dyslexic reading costs arise in a large-scale naturalistic reading dataset. Using eye-tracking aligned to word-level features (word length, frequency, and predictability), we model how each feature influences dyslexic time costs. We find that all three features robustly change reading times in both typical and dyslexic readers, and that dyslexic readers show stronger sensitivities to each, especially predictability. Counterfactual manipulations of these features substantially narrow the dyslexic-control gap by about one third, with predictability showing the strongest effect, followed by length and frequency. These patterns align with dyslexia theories that posit heightened demands on linguistic working memory and phonological encoding, and they motivate further work on lexical complexity and parafoveal preview benefits to explain the remaining gap. In short, we quantify when extra dyslexic costs arise, how large they are, and offer actionable guidance for interventions and computational models for dyslexics.

[70] Optimizing Retrieval for RAG via Reinforced Contrastive Learning

Jiawei Zhou,Lei Chen

Main category: cs.CL

TL;DR: 提出R3，一种通过试错强化对比学习优化检索增强生成（RAG）的检索框架，无需依赖标注或合成数据，能在RAG环境中动态优化相关性，显著提升性能且高效实用。

Details

Motivation: 随着检索增强生成（RAG）的普及，信息检索（IR）的目标从服务人类转向服务于AI系统，传统相关性定义难以适用，缺乏明确标注，因此需要一种能自适应优化检索效果的方法。 Method: 提出R3框架，采用试错强化对比学习（Reinforced contrastive learning），在RAG环境中通过检索结果与环境交互生成对比信号，实现检索器的自我优化，无需依赖标注或合成监督数据。 Result: 在多种任务上实验表明，R3比原始检索器提升5.2%，优于当前最先进的检索器4.9%，性能媲美基于大模型增强检索或指令微调LLM的RAG系统，且仅需4块GPU并在一天内完成训练。 Conclusion: R3为RAG场景下的检索优化提供了一种高效、实用且无需人工标注的新范式，具备良好的性能和部署可行性。 Abstract: As retrieval-augmented generation (RAG) becomes increasingly widespread, the role of information retrieval (IR) is shifting from retrieving information for human users to retrieving contextual knowledge for artificial intelligence (AI) systems, where relevance becomes difficult to define or annotate beforehand. To address this challenge, we propose R3, a Retrieval framework optimized for RAG through trialand-feedback Reinforced contrastive learning. Unlike prior approaches that rely on annotated or synthetic data for supervised fine-tuning, R3 enables the retriever to dynamically explore and optimize relevance within the RAG environment. During training, the retrieved results interact with the environment to produce contrastive signals that automatically guide the retriever's self-improvement. Extensive experiments across diverse tasks demonstrate that R3 improves RAG performance by 5.2% over the original retriever and surpasses state-of-the-art retrievers by 4.9%, while achieving comparable results to LLM-augmented retrieval and RAG systems built on post-trained or instruction-tuned LLMs. It is both efficient and practical, requiring only 4 GPUs and completing training within a single day.

[71] Evolving Diagnostic Agents in a Virtual Clinical Environment

Pengcheng Qiu,Chaoyi Wu,Junwei Liu,Qiaoyu Zheng,Yusheng Liao,Haowen Wang,Yun Yue,Qianrui Fan,Shuai Zhen,Jian Wang,Jinjie Gu,Yanfeng Wang,Ya Zhang,Weidi Xie

Main category: cs.CL

TL;DR: 本文提出了一种通过强化学习训练大语言模型作为诊断代理的框架，能够在交互式临床环境中学习动态诊断策略，在多种诊断场景下显著优于现有大模型。

Details

Motivation: 传统基于静态病例摘要训练的指令调优模型无法有效模拟真实临床诊断过程中的多轮交互与决策，因此需要一种能够通过互动探索和结果反馈来学习诊断策略的新方法。 Method: 提出DiagGym作为虚拟临床环境，结合电子健康记录生成检查结果；利用端到端的多轮强化学习训练DiagAgent以优化信息获取和诊断准确性；构建包含医师验证数据的诊断基准DiagBench进行评估。 Result: DiagAgent在单轮和端到端设置下分别比现有最佳模型提高9.34%和15.12%的诊断准确率，并在检查推荐、F1分数及基于评分细则的评估中均显著优于包括GPT-4o和Claude-sonnet-4在内的主流模型。 Conclusion: 通过在交互式临床环境中学习诊断策略，可赋予大语言模型更动态且具有临床意义的诊断管理能力，超越被动训练模型的局限性。 Abstract: In this paper, we present a framework for training large language models (LLMs) as diagnostic agents with reinforcement learning, enabling them to manage multi-turn diagnostic processes, adaptively select examinations, and commit to final diagnoses. Unlike instruction-tuned models trained on static case summaries, our method acquires diagnostic strategies through interactive exploration and outcome-based feedback. Our contributions are fourfold: (i) We present DiagGym, a diagnostics world model trained with electronic health records that emits examination outcomes conditioned on patient history and recommended examination, serving as a virtual clinical environment for realistic diagnosis training and evaluation; (ii) We train DiagAgent via end-to-end, multi-turn reinforcement learning to learn diagnostic policies that optimize both information yield and diagnostic accuracy; (iii) We introduce DiagBench, a diagnostic benchmark comprising 750 cases with physician-validated examination recommendations and 99 cases annotated with 973 physician-written rubrics on diagnosis process; (iv) we demonstrate superior performance across diverse diagnostic settings. DiagAgent significantly outperforms 10 state-of-the-art LLMs, including DeepSeek-v3 and GPT-4o, as well as two prompt-engineered agents. In single-turn settings, DiagAgent achieves 9.34% higher diagnostic accuracy and 44.03% improvement in examination recommendation hit ratio. In end-to-end settings, it delivers 15.12% increase in diagnostic accuracy and 23.09% boost in examination recommendation F1 score. In rubric-based evaluation, it surpasses the next-best model, Claude-sonnet-4, by 7.1% in weighted rubric score. These findings indicate that learning policies in interactive clinical environments confers dynamic and clinically meaningful diagnostic management abilities unattainable through passive training alone.

[72] MQM Re-Annotation: A Technique for Collaborative Evaluation of Machine Translation

Parker Riley,Daniel Deutsch,Mara Finkelstein,Colten DiIanni,Juraj Juraska,Markus Freitag

Main category: cs.CL

TL;DR: 提出MQM重注释的两阶段翻译评估方法，通过修正先前的注释提高评估质量。

Details

Motivation: 随着机器翻译模型质量提升，需要改进评估方法以减少评估噪声。 Method: 在现有MQM框架基础上引入重注释阶段，由标注者审查和修改已有注释。 Result: 重注释能发现首轮遗漏的错误，显著提高标注质量。 Conclusion: MQM重注释是一种有效的评估改进方法，有助于更准确地衡量翻译质量。 Abstract: Human evaluation of machine translation is in an arms race with translation model quality: as our models get better, our evaluation methods need to be improved to ensure that quality gains are not lost in evaluation noise. To this end, we experiment with a two-stage version of the current state-of-the-art translation evaluation paradigm (MQM), which we call MQM re-annotation. In this setup, an MQM annotator reviews and edits a set of pre-existing MQM annotations, that may have come from themselves, another human annotator, or an automatic MQM annotation system. We demonstrate that rater behavior in re-annotation aligns with our goals, and that re-annotation results in higher-quality annotations, mostly due to finding errors that were missed during the first pass.

[73] InteractComp: Evaluating Search Agents With Ambiguous Queries

Mingyi Deng,Lijun Huang,Yani Fan,Jiayi Zhang,Fashen Ren,Jinyi Bai,Fuzhen Yang,Dayi Miao,Zhaoyang Yu,Yifan Wu,Yanfei Zhang,Fengwei Teng,Yingjia Wan,Song Hu,Yude Li,Xin Jin,Conghao Hu,Haoyu Li,Qirui Fu,Tai Zhong,Xinyu Wang,Xiangru Tang,Nan Tang,Chenglin Wu,Yuyu Luo

Main category: cs.CL

TL;DR: 本文提出了InteractComp，一个用于评估搜索代理在面对模糊查询时能否主动交互以消除歧义的新基准。研究发现现有模型普遍表现不佳且过于自信，强制交互可显著提升性能，但过去15个月中交互能力停滞不前，凸显该领域的重要盲点。

Details

Motivation: 现实中的用户查询往往是不完整或模糊的，需要通过交互澄清，但现有搜索代理缺乏此类机制，且缺少评估该能力的基准。 Method: 基于“易于验证、需交互消歧”的原则，采用目标-干扰项方法构建了包含9个领域共210个专家标注问题的InteractComp基准，用以测试语言代理在搜索过程中的交互式消歧能力。 Result: 对17个模型的评估显示最佳准确率仅为13.73%（完整上下文下为71.50%），表现出系统性过度自信；强制交互能带来显著性能提升；纵向分析发现交互能力在过去15个月未有进展，而搜索性能大幅提升。 Conclusion: InteractComp揭示了当前语言代理在交互式搜索中的关键缺陷，尤其是交互能力发展停滞，该基准可有效支持未来对交互能力的评估与训练。 Abstract: Language agents have demonstrated remarkable potential in web search and information retrieval. However, these search agents assume user queries are complete and unambiguous, an assumption that diverges from reality where users begin with incomplete queries requiring clarification through interaction. Yet most agents lack interactive mechanisms during the search process, and existing benchmarks cannot assess this capability. To address this gap, we introduce InteractComp, a benchmark designed to evaluate whether search agents can recognize query ambiguity and actively interact to resolve it during search. Following the principle of easy to verify, interact to disambiguate, we construct 210 expert-curated questions across 9 domains through a target-distractor methodology that creates genuine ambiguity resolvable only through interaction. Evaluation of 17 models reveals striking failure: the best model achieves only 13.73% accuracy despite 71.50% with complete context, exposing systematic overconfidence rather than reasoning deficits. Forced interaction produces dramatic gains, demonstrating latent capability current strategies fail to engage. Longitudinal analysis shows interaction capabilities stagnated over 15 months while search performance improved seven-fold, revealing a critical blind spot. This stagnation, coupled with the immediate feedback inherent to search tasks, makes InteractComp a valuable resource for both evaluating and training interaction capabilities in search agents. The code is available at https://github.com/FoundationAgents/InteractComp.

[74] Dissecting Role Cognition in Medical LLMs via Neuronal Ablation

Xun Liang,Huayi Lai,Hanyu Wang,Wentao Zhang,Linfeng Zhang,Yanfang Chen,Feiyu Xiong,Zhiyu Li

Main category: cs.CL

TL;DR: 本研究提出RPNA框架，评估角色提示是否能引发大语言模型在医疗决策中的不同认知过程，结果表明角色提示主要影响语言风格，而非提升医学推理能力。

Details

Motivation: 探究角色提示对大语言模型医学推理能力的影响，揭示当前基于提示的角色扮演方法是否真正模拟了临床思维。 Method: 提出RP-Neuron-Activated评价框架（RPNA），结合神经元消融和表征分析技术，在三个医疗问答数据集上分析角色提示对模型推理路径的影响。 Result: 角色提示并未显著增强模型的医学推理能力，仅改变语言表层特征，未发现不同临床角色间存在差异化的推理路径或认知分化，核心决策机制保持一致。 Conclusion: 当前的角色提示方法未能复制真实医疗实践中的认知复杂性，仅实现语言模仿，需开发能真正模拟临床思维的认知型模型。 Abstract: Large language models (LLMs) have gained significant traction in medical decision support systems, particularly in the context of medical question answering and role-playing simulations. A common practice, Prompt-Based Role Playing (PBRP), instructs models to adopt different clinical roles (e.g., medical students, residents, attending physicians) to simulate varied professional behaviors. However, the impact of such role prompts on model reasoning capabilities remains unclear. This study introduces the RP-Neuron-Activated Evaluation Framework(RPNA) to evaluate whether role prompts induce distinct, role-specific cognitive processes in LLMs or merely modify linguistic style. We test this framework on three medical QA datasets, employing neuron ablation and representation analysis techniques to assess changes in reasoning pathways. Our results demonstrate that role prompts do not significantly enhance the medical reasoning abilities of LLMs. Instead, they primarily affect surface-level linguistic features, with no evidence of distinct reasoning pathways or cognitive differentiation across clinical roles. Despite superficial stylistic changes, the core decision-making mechanisms of LLMs remain uniform across roles, indicating that current PBRP methods fail to replicate the cognitive complexity found in real-world medical practice. This highlights the limitations of role-playing in medical AI and emphasizes the need for models that simulate genuine cognitive processes rather than linguistic imitation.We have released the related code in the following repository:https: //github.com/IAAR-Shanghai/RolePlay_LLMDoctor

[75] SPICE: Self-Play In Corpus Environments Improves Reasoning

Bo Liu,Chuanyang Jin,Seungone Kim,Weizhe Yuan,Wenting Zhao,Ilia Kulikov,Xian Li,Sainbayar Sukhbaatar,Jack Lanchantin,Jason Weston

Main category: cs.CL

TL;DR: SPICE是一种基于语料库环境的自对弈强化学习框架，通过将模型分为挑战者和推理者两个角色，实现持续自我提升。

Details

Motivation: 现有的无 grounding 自对弈方法改进有限，缺乏持续适应所需的外部信号。 Method: 提出SPICE框架，模型在大规模语料库中扮演Challenger（挖掘文档生成任务）和Reasoner（解决问题）双重角色，通过对抗性动态形成自动课程。 Result: 在数学（+8.9%）和通用推理（+9.8%）基准上均取得一致提升，且效果跨多个模型家族稳定。 Conclusion: 语料库 grounding 是实现持续自我改进的关键，使系统能不断生成并完成更具挑战性的目标。 Abstract: Self-improving systems require environmental interaction for continuous adaptation. We introduce SPICE (Self-Play In Corpus Environments), a reinforcement learning framework where a single model acts in two roles: a Challenger that mines documents from a large corpus to generate diverse reasoning tasks, and a Reasoner that solves them. Through adversarial dynamics, the Challenger creates an automatic curriculum at the frontier of the Reasoner's capability, while corpus grounding provides the rich, near-inexhaustible external signal necessary for sustained improvement. Unlike existing ungrounded self-play methods that offer more limited benefits, SPICE achieves consistent gains across mathematical (+8.9%) and general reasoning (+9.8%) benchmarks on multiple model families. Our analysis reveals how document grounding is a key ingredient in SPICE to continuously generate its own increasingly challenging goals and achieve them, enabling sustained self-improvement.

[76] Repurposing Synthetic Data for Fine-grained Search Agent Supervision

Yida Zhao,Kuan Li,Xixi Wu,Liwen Zhang,Dingchu Zhang,Baixuan Li,Maojia Song,Zhuo Chen,Chenxi Wang,Xinyu Wang,Kewei Tu,Pengjun Xie,Jingren Zhou,Yong Jiang

Main category: cs.CL

TL;DR: 本文提出了一种新的训练框架E-GRPO，通过引入基于实体匹配的密集奖励函数，使LLM-based搜索代理能从“接近正确”的样本中学习，从而在问答和深度研究任务上显著优于传统的GRPO方法。

Details

Motivation: 现有训练方法（如GRPO）忽略推理过程中的实体信息，仅依赖稀疏的结果奖励，无法有效利用包含正确推理但答案错误的‘近似正确’样本，导致学习信号浪费。 Method: 提出Entity-aware Group Relative Policy Optimization (E-GRPO)，在训练中利用推理过程中识别出的ground-truth实体数量构建密集奖励函数，对错误样本根据实体匹配率给予部分奖励。 Result: 在多个问答和深度研究基准上的实验表明，E-GRPO显著优于GRPO基线，不仅提高了准确率，还减少了工具调用次数，实现了更高效、更优的推理策略。 Conclusion: E-GRPO通过利用被忽略的实体信息，实现了更有效的学习信号利用，在知识密集型任务中展现出更强的性能和样本效率，为搜索代理的对齐提供了新方向。 Abstract: LLM-based search agents are increasingly trained on entity-centric synthetic data to solve complex, knowledge-intensive tasks. However, prevailing training methods like Group Relative Policy Optimization (GRPO) discard this rich entity information, relying instead on sparse, outcome-based rewards. This critical limitation renders them unable to distinguish informative "near-miss" samples-those with substantially correct reasoning but a flawed final answer-from complete failures, thus discarding valuable learning signals. We address this by leveraging the very entities discarded during training. Our empirical analysis reveals a strong positive correlation between the number of ground-truth entities identified during an agent's reasoning process and final answer accuracy. Building on this insight, we introduce Entity-aware Group Relative Policy Optimization (E-GRPO), a novel framework that formulates a dense entity-aware reward function. E-GRPO assigns partial rewards to incorrect samples proportional to their entity match rate, enabling the model to effectively learn from these "near-misses". Experiments on diverse question-answering (QA) and deep research benchmarks show that E-GRPO consistently and significantly outperforms the GRPO baseline. Furthermore, our analysis reveals that E-GRPO not only achieves superior accuracy but also induces more efficient reasoning policies that require fewer tool calls, demonstrating a more effective and sample-efficient approach to aligning search agents.

[77] AgentFrontier: Expanding the Capability Frontier of LLM Agents with ZPD-Guided Data Synthesis

Xuanzhong Chen,Zile Qiao,Guoxin Chen,Liangcai Su,Zhen Zhang,Xinyu Wang,Pengjun Xie,Fei Huang,Jingren Zhou,Yong Jiang

Main category: cs.CL

TL;DR: 提出基于最近发展区（ZPD）理论的数据合成方法，通过AgentFrontier引擎生成位于大语言模型能力前沿的高质量多学科训练数据，并构建ZPD考试评估代理能力，在多个复杂任务上实现最先进的性能。

Details

Motivation: 为了提升大语言模型代理在接近其能力极限的任务上的推理能力，需要一种能够精准定位并利用模型‘最近发展区’的训练数据生成方法。 Method: 提出AgentFrontier引擎，自动化合成处于LLM ZPD区域内的知识密集型和复杂推理任务数据，支持持续预训练和针对性后训练；同时构建ZPD Exam作为动态评估基准。 Result: 训练出的AgentFrontier-30B-A3B模型在Humanity's Last Exam等高难度基准上达到最先进水平，表现优于部分领先的专有代理模型。 Conclusion: 基于ZPD的数据合成方法为构建更强大、可扩展的LLM代理提供了一条有效路径。 Abstract: Training large language model agents on tasks at the frontier of their capabilities is key to unlocking advanced reasoning. We introduce a data synthesis approach inspired by the educational theory of the Zone of Proximal Development (ZPD), which defines this frontier as tasks an LLM cannot solve alone but can master with guidance. To operationalize this, we present the AgentFrontier Engine, an automated pipeline that synthesizes high-quality, multidisciplinary data situated precisely within the LLM's ZPD. This engine supports both continued pre-training with knowledge-intensive data and targeted post-training on complex reasoning tasks. From the same framework, we derive the ZPD Exam, a dynamic and automated benchmark designed to evaluate agent capabilities on these frontier tasks. We train AgentFrontier-30B-A3B model on our synthesized data, which achieves state-of-the-art results on demanding benchmarks like Humanity's Last Exam, even surpassing some leading proprietary agents. Our work demonstrates that a ZPD-guided approach to data synthesis offers a scalable and effective path toward building more capable LLM agents.

[78] WebLeaper: Empowering Efficiency and Efficacy in WebAgent via Enabling Info-Rich Seeking

Zhengwei Tao,Haiyang Shen,Baixuan Li,Wenbiao Yin,Jialong Wu,Kuan Li,Zhongwang Zhang,Huifeng Yin,Rui Ye,Liwen Zhang,Xinyu Wang,Pengjun Xie,Jingren Zhou,Yong Jiang

Main category: cs.CL

TL;DR: 本文提出了WebLeaper框架，通过构建高覆盖率的信息检索任务和高效搜索路径，提升基于大语言模型的智能体在信息检索中的效率与效果。

Details

Motivation: 现有信息检索智能体因目标实体稀疏导致搜索效率低下，限制了整体性能。 Method: 将信息检索建模为树结构推理问题，利用维基百科表格生成三种合成任务变体（Basic、Union、Reverse-Union），并通过筛选准确且高效的训练轨迹来优化模型。 Result: 在五个基准测试（BrowserComp、GAIA等）上实验表明，该方法在有效性和效率方面均优于强基线。 Conclusion: WebLeaper通过高覆盖任务构造和高效轨迹学习，显著提升了LLM智能体在信息检索任务中的性能。 Abstract: Large Language Model (LLM)-based agents have emerged as a transformative approach for open-ended problem solving, with information seeking (IS) being a core capability that enables autonomous reasoning and decision-making. While prior research has largely focused on improving retrieval depth, we observe that current IS agents often suffer from low search efficiency, which in turn constrains overall performance. A key factor underlying this inefficiency is the sparsity of target entities in training tasks, which limits opportunities for agents to learn and generalize efficient search behaviors. To address these challenges, we propose WebLeaper, a framework for constructing high-coverage IS tasks and generating efficient solution trajectories. We formulate IS as a tree-structured reasoning problem, enabling a substantially larger set of target entities to be embedded within a constrained context. Leveraging curated Wikipedia tables, we propose three variants for synthesizing IS tasks, Basic, Union, and Reverse-Union, to systematically increase both IS efficiency and efficacy. Finally, we curate training trajectories by retaining only those that are simultaneously accurate and efficient, ensuring that the model is optimized for both correctness and search performance. Extensive experiments on both basic and comprehensive settings, conducted on five IS benchmarks, BrowserComp, GAIA, xbench-DeepSearch, WideSearch, and Seal-0, demonstrate that our method consistently achieves improvements in both effectiveness and efficiency over strong baselines.

[79] ParallelMuse: Agentic Parallel Thinking for Deep Information Seeking

Baixuan Li,Dingchu Zhang,Jialong Wu,Wenbiao Yin,Zhengwei Tao,Yida Zhao,Liwen Zhang,Haiyang Shen,Runnan Fang,Pengjun Xie,Jingren Zhou,Yong Jiang

Main category: cs.CL

TL;DR: 提出ParallelMuse，一种两阶段并行思考范式，通过功能指定的部分 rollout 和压缩推理聚合，提升信息寻求代理的探索效率和答案生成能力。

Details

Motivation: 传统并行思考在从头重复展开和长程推理轨迹整合方面存在效率低下和上下文容量限制的问题。 Method: 第一阶段采用功能指定的部分 rollout，进行不确定性引导的路径重用与分支；第二阶段利用推理冗余进行无损压缩并聚合生成连贯答案。 Result: 在多个开源代理和基准测试中，性能最高提升62%，探索性token消耗减少10%-30%。 Conclusion: ParallelMuse有效提升了深度信息寻求代理的问题解决能力和推理效率。 Abstract: Parallel thinking expands exploration breadth, complementing the deep exploration of information-seeking (IS) agents to further enhance problem-solving capability. However, conventional parallel thinking faces two key challenges in this setting: inefficiency from repeatedly rolling out from scratch, and difficulty in integrating long-horizon reasoning trajectories during answer generation, as limited context capacity prevents full consideration of the reasoning process. To address these issues, we propose ParallelMuse, a two-stage paradigm designed for deep IS agents. The first stage, Functionality-Specified Partial Rollout, partitions generated sequences into functional regions and performs uncertainty-guided path reuse and branching to enhance exploration efficiency. The second stage, Compressed Reasoning Aggregation, exploits reasoning redundancy to losslessly compress information relevant to answer derivation and synthesize a coherent final answer. Experiments across multiple open-source agents and benchmarks demonstrate up to 62% performance improvement with a 10--30% reduction in exploratory token consumption.

[80] AgentFold: Long-Horizon Web Agents with Proactive Context Management

Rui Ye,Zhongwang Zhang,Kuan Li,Huifeng Yin,Zhengwei Tao,Yida Zhao,Liangcai Su,Liwen Zhang,Zile Qiao,Xinyu Wang,Pengjun Xie,Fei Huang,Siheng Chen,Jingren Zhou,Yong Jiang

Main category: cs.CL

TL;DR: 本文提出了AgentFold，一种受人类认知回溯巩固启发的新型代理范式，通过主动上下文管理（“折叠”操作）在多尺度上压缩历史轨迹，有效解决了基于ReAct的代理在长周期任务中面临的上下文饱和或关键信息丢失问题。

Details

Motivation: 现有LLM-based代理在处理长周期任务时因上下文管理不当而受限：传统方法要么积累过多噪声导致上下文饱和，要么固定摘要导致关键细节永久丢失。需要一种更智能、动态的上下文管理机制。 Method: 提出AgentFold，将上下文视为可主动塑造的认知工作区，每一步学习执行‘折叠’操作，支持细粒度压缩保留关键细节或深度整合抽象多步子任务，从而实现多尺度历史管理。 Result: 在BrowseComp和BrowseComp-ZH基准上，仅通过监督微调的AgentFold-30B-A3B分别达到36.2%和47.3%的成绩，性能超越或媲美如DeepSeek-V3.1-671B等更大规模开源模型及OpenAI o4-mini等领先专有代理。 Conclusion: AgentFold通过模仿人类认知的主动上下文管理机制，在不依赖大规模预训练或强化学习的情况下显著提升长周期任务表现，为高效web代理设计提供了新范式。 Abstract: LLM-based web agents show immense promise for information seeking, yet their effectiveness on long-horizon tasks is hindered by a fundamental trade-off in context management. Prevailing ReAct-based agents suffer from context saturation as they accumulate noisy, raw histories, while methods that fixedly summarize the full history at each step risk the irreversible loss of critical details. Addressing these, we introduce AgentFold, a novel agent paradigm centered on proactive context management, inspired by the human cognitive process of retrospective consolidation. AgentFold treats its context as a dynamic cognitive workspace to be actively sculpted, rather than a passive log to be filled. At each step, it learns to execute a `folding' operation, which manages its historical trajectory at multiple scales: it can perform granular condensations to preserve vital, fine-grained details, or deep consolidations to abstract away entire multi-step sub-tasks. The results on prominent benchmarks are striking: with simple supervised fine-tuning (without continual pre-training or RL), our AgentFold-30B-A3B agent achieves 36.2% on BrowseComp and 47.3% on BrowseComp-ZH. Notably, this performance not only surpasses or matches open-source models of a dramatically larger scale, such as the DeepSeek-V3.1-671B-A37B, but also surpasses leading proprietary agents like OpenAI's o4-mini.

[81] Tongyi DeepResearch Technical Report

Tongyi DeepResearch Team,Baixuan Li,Bo Zhang,Dingchu Zhang,Fei Huang,Guangyu Li,Guoxin Chen,Huifeng Yin,Jialong Wu,Jingren Zhou,Kuan Li,Liangcai Su,Litu Ou,Liwen Zhang,Pengjun Xie,Rui Ye,Wenbiao Yin,Xinmiao Yu,Xinyu Wang,Xixi Wu,Xuanzhong Chen,Yida Zhao,Zhen Zhang,Zhengwei Tao,Zhongwang Zhang,Zile Qiao,Chenxi Wang,Donglei Yu,Gang Fu,Haiyang Shen,Jiayin Yang,Jun Lin,Junkai Zhang,Kui Zeng,Li Yang,Hailong Yin,Maojia Song,Ming Yan,Peng Xia,Qian Xiao,Rui Min,Ruixue Ding,Runnan Fang,Shaowei Chen,Shen Huang,Shihang Wang,Shihao Cai,Weizhou Shen,Xiaobin Wang,Xin Guan,Xinyu Geng,Yingcheng Shi,Yuning Wu,Zhuo Chen,Zijian Li,Yong Jiang

Main category: cs.CL

TL;DR: Tongyi DeepResearch 是一个专为长期、深度信息检索研究任务设计的代理式大语言模型，通过结合代理中训练和代理后训练的端到端框架进行开发，并实现了在多个深度研究基准上的最先进性能。

Details

Motivation: 为了提升大语言模型在复杂、长周期的信息寻求任务中的自主深度研究能力，解决传统方法依赖人工标注、难以扩展的问题。 Method: 提出了一种端到端的训练框架，包含代理中训练和代理后训练，并设计了全自动、可扩展的数据合成 pipeline 和定制化环境以支持各训练阶段的稳定交互。 Result: Tongyi DeepResearch 拥有305亿总参数（每token激活33亿），在 Humanity's Last Exam、BrowseComp 等多个深度研究基准上达到最先进的性能。 Conclusion: 该模型及其训练框架、完整解决方案已开源，有效推动了自主深度研究智能体的发展与应用。 Abstract: We present Tongyi DeepResearch, an agentic large language model, which is specifically designed for long-horizon, deep information-seeking research tasks. To incentivize autonomous deep research agency, Tongyi DeepResearch is developed through an end-to-end training framework that combines agentic mid-training and agentic post-training, enabling scalable reasoning and information seeking across complex tasks. We design a highly scalable data synthesis pipeline that is fully automatic, without relying on costly human annotation, and empowers all training stages. By constructing customized environments for each stage, our system enables stable and consistent interactions throughout. Tongyi DeepResearch, featuring 30.5 billion total parameters, with only 3.3 billion activated per token, achieves state-of-the-art performance across a range of agentic deep research benchmarks, including Humanity's Last Exam, BrowseComp, BrowseComp-ZH, WebWalkerQA, xbench-DeepSearch, FRAMES and xbench-DeepSearch-2510. We open-source the model, framework, and complete solutions to empower the community.

[82] Agent Data Protocol: Unifying Datasets for Diverse, Effective Fine-tuning of LLM Agents

Yueqi Song,Ketan Ramaneti,Zaid Sheikh,Ziru Chen,Boyu Gou,Tianbao Xie,Yiheng Xu,Danyang Zhang,Apurva Gandhi,Fan Yang,Joseph Liu,Tianyue Ou,Zhihao Yuan,Frank Xu,Shuyan Zhou,Xingyao Wang,Xiang Yue,Tao Yu,Huan Sun,Yu Su,Graham Neubig

Main category: cs.CL

TL;DR: 本文提出了代理数据协议（ADP），一种轻量级的表示语言，用于统一多种格式的AI代理训练数据，解决了数据碎片化问题，并在多个基准测试中实现了约20%的性能提升，达到或接近最先进水平。

Details

Motivation: 由于代理训练数据分散在不同格式、工具和接口中，大规模监督微调的研究进展受限，亟需一种统一的数据表示方法以降低数据整合的复杂性。 Method: 设计并实现了一种轻量级、表达性强的代理数据协议（ADP），作为异构数据集与统一训练流程之间的中间语言，并将13个现有数据集统一为ADP格式，转换为多种代理框架可用的训练格式。 Result: 通过SFT实验，使用ADP统一的数据使模型在编码、浏览、工具使用和研究等标准基准上平均性能提升约20%，达到或接近最优性能，且无需领域特定调优。 Conclusion: ADP有效降低了代理训练数据的整合门槛，促进了标准化、可扩展和可复现的AI代理训练，所有代码和数据已公开发布。 Abstract: Public research results on large-scale supervised finetuning of AI agents remain relatively rare, since the collection of agent training data presents unique challenges. In this work, we argue that the bottleneck is not a lack of underlying data sources, but that a large variety of data is fragmented across heterogeneous formats, tools, and interfaces. To this end, we introduce the agent data protocol (ADP), a light-weight representation language that serves as an "interlingua" between agent datasets in diverse formats and unified agent training pipelines downstream. The design of ADP is expressive enough to capture a large variety of tasks, including API/tool use, browsing, coding, software engineering, and general agentic workflows, while remaining simple to parse and train on without engineering at a per-dataset level. In experiments, we unified a broad collection of 13 existing agent training datasets into ADP format, and converted the standardized ADP data into training-ready formats for multiple agent frameworks. We performed SFT on these data, and demonstrated an average performance gain of ~20% over corresponding base models, and delivers state-of-the-art or near-SOTA performance on standard coding, browsing, tool use, and research benchmarks, without domain-specific tuning. All code and data are released publicly, in the hope that ADP could help lower the barrier to standardized, scalable, and reproducible agent training.

[83] ComboBench: Can LLMs Manipulate Physical Devices to Play Virtual Reality Games?

Shuqing Li,Jiayi Yan,Chenyu Niu,Jen-tse Huang,Yun Peng,Wenxuan Wang,Yepang Liu,Michael R. Lyu

Main category: cs.CL

TL;DR: 本文提出了ComboBench，一个评估大语言模型（LLM）将语义动作转化为虚拟现实（VR）设备操作序列能力的基准，涵盖四款流行VR游戏中的262个场景，并对七种LLM进行了评估，发现尽管顶级模型表现出较强的任務分解能力，但在程序推理和空间理解方面仍不及人类。

Details

Motivation: 探索大语言模型是否能像人类一样基于常识和具身理解，将高层语义动作转化为精确的VR设备操作。 Method: 构建包含262个场景的基准ComboBench，涵盖四款主流VR游戏，评估GPT-3.5、GPT-4等七种LLM在语义到操作映射任务上的表现，并与真实标注和人类表现对比。 Result: 发现顶级模型如Gemini-1.5-Pro虽具备较强的任务分解能力，但在程序推理和空间理解上仍显著弱于人类；模型表现因游戏而异，显示对交互复杂度敏感；少样本示例可显著提升性能。 Conclusion: 当前大语言模型在VR动作翻译任务上仍有局限，特别是在程序性和空间性推理方面，但通过少样本学习有改进潜力，为未来提升LLM在VR环境中的实用性提供了方向。 Abstract: Virtual Reality (VR) games require players to translate high-level semantic actions into precise device manipulations using controllers and head-mounted displays (HMDs). While humans intuitively perform this translation based on common sense and embodied understanding, whether Large Language Models (LLMs) can effectively replicate this ability remains underexplored. This paper introduces a benchmark, ComboBench, evaluating LLMs' capability to translate semantic actions into VR device manipulation sequences across 262 scenarios from four popular VR games: Half-Life: Alyx, Into the Radius, Moss: Book II, and Vivecraft. We evaluate seven LLMs, including GPT-3.5, GPT-4, GPT-4o, Gemini-1.5-Pro, LLaMA-3-8B, Mixtral-8x7B, and GLM-4-Flash, compared against annotated ground truth and human performance. Our results reveal that while top-performing models like Gemini-1.5-Pro demonstrate strong task decomposition capabilities, they still struggle with procedural reasoning and spatial understanding compared to humans. Performance varies significantly across games, suggesting sensitivity to interaction complexity. Few-shot examples substantially improve performance, indicating potential for targeted enhancement of LLMs' VR manipulation capabilities. We release all materials at https://sites.google.com/view/combobench.

[84] MetricX-25 and GemSpanEval: Google Translate Submissions to the WMT25 Evaluation Shared Task

Juraj Juraska,Tobias Domhan,Mara Finkelstein,Tetsuji Nakagawa,Geza Kovacs,Daniel Deutsch,Pidong Wang,Markus Freitag

Main category: cs.CL

TL;DR: 本文介绍了在WMT25翻译评估共享任务中的提交方案，提出了改进的MetricX-25用于质量评分预测，以及新的生成式模型GemSpanEval用于错误片段检测。

Details

Motivation: 提升自动翻译评估系统的性能，实现更准确的质量评分预测和细粒度的错误定位。 Method: 基于Gemma 3模型，将MetricX改进为编码器-only架构并添加回归头用于质量评分；设计解码器-only的GemSpanEval模型，以生成方式预测错误片段及其上下文、严重程度和类别。 Result: MetricX-25显著优于前代模型，在MQM和ESA评分预测上表现优异；GemSpanEval在错误片段检测上与强基线xCOMET相当，并能生成明确的上下文信息。 Conclusion: 所提方法有效提升了翻译评估中质量评分和错误定位的性能，展示了生成式建模在评估任务中的潜力。 Abstract: In this paper, we present our submissions to the unified WMT25 Translation Evaluation Shared Task. For the Quality Score Prediction subtask, we create a new generation of MetricX with improvements in the input format and the training protocol, while for the Error Span Detection subtask we develop a new model, GemSpanEval, trained to predict error spans along with their severities and categories. Both systems are based on the state-of-the-art multilingual open-weights model Gemma 3, fine-tuned on publicly available WMT data. We demonstrate that MetricX-25, adapting Gemma 3 to an encoder-only architecture with a regression head on top, can be trained to effectively predict both MQM and ESA quality scores, and significantly outperforms its predecessor. Our decoder-only GemSpanEval model, on the other hand, we show to be competitive in error span detection with xCOMET, a strong encoder-only sequence-tagging baseline. With error span detection formulated as a generative task, we instruct the model to also output the context for each predicted error span, thus ensuring that error spans are identified unambiguously.

cs.CV [Back]

[85] Explainable Detection of AI-Generated Images with Artifact Localization Using Faster-Than-Lies and Vision-Language Models for Edge Devices

Aryan Mathur,Asaduddin Ahmed,Pushti Amit Vasoya,Simeon Kandan Sonar,Yasir Z,Madesh Kuppusamy

Main category: cs.CV

TL;DR: 提出了一种结合轻量级卷积分类器和视觉-语言模型的可解释图像真实性检测系统，在32x32低分辨率图像上实现了96.5%的检测准确率，并能定位和解释生成图像中的伪造痕迹。

Details

Motivation: 随着AI生成图像的真实性不断提高，验证视觉内容的真实性变得愈发困难，亟需可解释且高效的真实性和伪造检测方法。 Method: 采用轻量级卷积分类器（Faster-Than-Lies）进行初步分类，结合Qwen2-VL-7B视觉-语言模型实现分类、定位与解释；利用自编码器重建误差图生成伪造区域热力图，并将70种视觉伪影归为8类语义组以支持可解释性文本生成。 Result: 在加入对抗扰动的扩展CiFAKE数据集上达到96.5%的准确率，推理时间仅175ms（8核CPU），可在边缘设备部署；成功生成伪影定位热力图并实现针对每类异常的可解释文本输出。 Conclusion: 结合视觉与语言推理的框架在低分辨率图像真实性检测中具有可行性，具备在法医学、工业检测和社交媒体审核等领域的跨领域应用潜力。 Abstract: The increasing realism of AI-generated imagery poses challenges for verifying visual authenticity. We present an explainable image authenticity detection system that combines a lightweight convolutional classifier ("Faster-Than-Lies") with a Vision-Language Model (Qwen2-VL-7B) to classify, localize, and explain artifacts in 32x32 images. Our model achieves 96.5% accuracy on the extended CiFAKE dataset augmented with adversarial perturbations and maintains an inference time of 175ms on 8-core CPUs, enabling deployment on local or edge devices. Using autoencoder-based reconstruction error maps, we generate artifact localization heatmaps, which enhance interpretability for both humans and the VLM. We further categorize 70 visual artifact types into eight semantic groups and demonstrate explainable text generation for each detected anomaly. This work highlights the feasibility of combining visual and linguistic reasoning for interpretable authenticity detection in low-resolution imagery and outlines potential cross-domain applications in forensics, industrial inspection, and social media moderation.

[86] CountFormer: A Transformer Framework for Learning Visual Repetition and Structure in Class-Agnostic Object Counting

Md Tanvir Hossain,Akif Islam,Mohd Ruhul Ameen

Main category: cs.CV

TL;DR: 本文提出了一种基于Transformer的类无关物体计数框架CountFormer，通过引入DINOv2基础模型和位置编码融合，在复杂结构场景中实现了优于现有方法的计数性能。

Details

Motivation: 现有计数模型在处理复杂形状、内部对称或重叠物体时容易出错，难以模拟人类不依赖类别身份的计数能力。 Method: 基于CounTR架构，用自监督基础模型DINOv2替换视觉编码器，并引入位置编码融合以保持几何关系，最后通过轻量卷积解码器生成密度图。 Result: 在FSC-147数据集上达到当前最先进水平，且在结构复杂或密集排列的场景中表现出更高精度。 Conclusion: 结合DINOv2等基础模型可提升计数系统对结构特征的理解能力，推动实现真正通用、无需示例的计数范式。 Abstract: Humans can effortlessly count diverse objects by perceiving visual repetition and structural relationships rather than relying on class identity. However, most existing counting models fail to replicate this ability; they often miscount when objects exhibit complex shapes, internal symmetry, or overlapping components. In this work, we introduce CountFormer, a transformer-based framework that learns to recognize repetition and structural coherence for class-agnostic object counting. Built upon the CounTR architecture, our model replaces its visual encoder with the self-supervised foundation model DINOv2, which produces richer and spatially consistent feature representations. We further incorporate positional embedding fusion to preserve geometric relationships before decoding these features into density maps through a lightweight convolutional decoder. Evaluated on the FSC-147 dataset, our model achieves performance comparable to current state-of-the-art methods while demonstrating superior accuracy on structurally intricate or densely packed scenes. Our findings indicate that integrating foundation models such as DINOv2 enables counting systems to approach human-like structural perception, advancing toward a truly general and exemplar-free counting paradigm.

[87] A geometric and deep learning reproducible pipeline for monitoring floating anthropogenic debris in urban rivers using in situ cameras

Gauthier Grimmer,Romain Wenger,Clément Flint,Germain Forestier,Gilles Rixhon,Valentin Chardon

Main category: cs.CV

TL;DR: 提出一种基于固定摄像头和深度学习的河流漂浮垃圾监测新方法，实现了连续量化监测和物体尺寸估计。

Details

Motivation: 河流中的人为漂浮垃圾对生态环境、水质和人类活动造成负面影响，亟需有效的监测手段。 Method: 利用固定摄像头采集数据，采用多种深度学习模型进行漂浮垃圾检测，并结合投影几何与回归校正建立几何模型以估算物体实际尺寸。 Result: 确定了在复杂环境条件下精度和推理速度最优的深度学习模型，验证了数据集构建中负样本和时间泄漏的重要性，并成功实现从2D图像估计实际尺寸。 Conclusion: 该方法可行且成本低，为城市水环境自动监测系统的发展提供了技术基础。 Abstract: The proliferation of floating anthropogenic debris in rivers has emerged as a pressing environmental concern, exerting a detrimental influence on biodiversity, water quality, and human activities such as navigation and recreation. The present study proposes a novel methodological framework for the monitoring the aforementioned waste, utilising fixed, in-situ cameras. This study provides two key contributions: (i) the continuous quantification and monitoring of floating debris using deep learning and (ii) the identification of the most suitable deep learning model in terms of accuracy and inference speed under complex environmental conditions. These models are tested in a range of environmental conditions and learning configurations, including experiments on biases related to data leakage. Furthermore, a geometric model is implemented to estimate the actual size of detected objects from a 2D image. This model takes advantage of both intrinsic and extrinsic characteristics of the camera. The findings of this study underscore the significance of the dataset constitution protocol, particularly with respect to the integration of negative images and the consideration of temporal leakage. In conclusion, the feasibility of metric object estimation using projective geometry coupled with regression corrections is demonstrated. This approach paves the way for the development of robust, low-cost, automated monitoring systems for urban aquatic environments.

[88] RareFlow: Physics-Aware Flow-Matching for Cross-Sensor Super-Resolution of Rare-Earth Features

Forouzan Fallah,Wenwen Li,Chia-Yu Hsu,Hyunho Lee,Yezhou Yang

Main category: cs.CV

TL;DR: 提出RareFlow，一种面向分布外鲁棒性的物理感知超分辨率框架，通过双条件架构和多面损失函数提升遥感图像超分辨率的几何保真度与物理一致性。

Details

Motivation: 现有遥感图像超分辨率方法在面对罕见地貌或不同传感器采集的分布外数据时，常产生视觉合理但物理不准确的结果，缺乏对物理一致性和不确定性建模的能力。 Method: 提出RareFlow，采用双条件架构：Gated ControlNet保持低分辨率输入的几何细节，文本提示提供语义指导；引入结合光谱和辐射一致性的多面损失函数以符合传感器特性，并通过随机前向传播量化预测不确定性，识别异常输入。 Result: 在多传感器卫星图像新基准上验证，地球物理专家盲评认为其输出接近真实图像质量，显著优于现有最先进方法；FID指标下降近40%，感知质量提升明显。 Conclusion: RareFlow为数据稀缺的科学领域提供了高保真合成的鲁棒框架，提出了在严重域偏移下受控生成的新范式。 Abstract: Super-resolution (SR) for remote sensing imagery often fails under out-of-distribution (OOD) conditions, such as rare geomorphic features captured by diverse sensors, producing visually plausible but physically inaccurate results. We present RareFlow, a physics-aware SR framework designed for OOD robustness. RareFlow's core is a dual-conditioning architecture. A Gated ControlNet preserves fine-grained geometric fidelity from the low-resolution input, while textual prompts provide semantic guidance for synthesizing complex features. To ensure physically sound outputs, we introduce a multifaceted loss function that enforces both spectral and radiometric consistency with sensor properties. Furthermore, the framework quantifies its own predictive uncertainty by employing a stochastic forward pass approach; the resulting output variance directly identifies unfamiliar inputs, mitigating feature hallucination. We validate RareFlow on a new, curated benchmark of multi-sensor satellite imagery. In blind evaluations, geophysical experts rated our model's outputs as approaching the fidelity of ground truth imagery, significantly outperforming state-of-the-art baselines. This qualitative superiority is corroborated by quantitative gains in perceptual metrics, including a nearly 40\% reduction in FID. RareFlow provides a robust framework for high-fidelity synthesis in data-scarce scientific domains and offers a new paradigm for controlled generation under severe domain shift.

[89] TRELLISWorld: Training-Free World Generation from Object Generators

Hanke Chen,Yuan Liu,Minchen Li

Main category: cs.CV

TL;DR: 提出一种无需训练的3D场景合成方法，通过将文本到3D对象扩散模型用作模块化瓦片生成器，实现可扩展、连贯且支持全视角的语言驱动3D场景生成。

Details

Motivation: 现有3D场景生成方法受限于单物体生成、需要特定领域训练或不支持360度视图，难以满足实际应用需求。 Method: 将场景生成重新定义为多瓦片去噪问题，独立生成重叠的3D区域，并通过加权平均无缝融合，利用通用文本到3D对象扩散模型作为模块化生成单元。 Result: 实现了大规模、语义连贯的3D场景生成，支持多样化布局、高效生成和灵活编辑，无需场景级数据或重新训练。 Conclusion: 该方法提供了一种简单而强大的通用语言驱动3D场景构建框架，具备良好的泛化能力和实用性。 Abstract: Text-driven 3D scene generation holds promise for a wide range of applications, from virtual prototyping to AR/VR and simulation. However, existing methods are often constrained to single-object generation, require domain-specific training, or lack support for full 360-degree viewability. In this work, we present a training-free approach to 3D scene synthesis by repurposing general-purpose text-to-3D object diffusion models as modular tile generators. We reformulate scene generation as a multi-tile denoising problem, where overlapping 3D regions are independently generated and seamlessly blended via weighted averaging. This enables scalable synthesis of large, coherent scenes while preserving local semantic control. Our method eliminates the need for scene-level datasets or retraining, relies on minimal heuristics, and inherits the generalization capabilities of object-level priors. We demonstrate that our approach supports diverse scene layouts, efficient generation, and flexible editing, establishing a simple yet powerful foundation for general-purpose, language-driven 3D scene construction.

[90] Improving Visual Discriminability of CLIP for Training-Free Open-Vocabulary Semantic Segmentation

Jinxin Zhou,Jiachen Jiang,Zhihui Zhu

Main category: cs.CV

TL;DR: 提出LHT-CLIP，一种无需训练的框架，通过在层、头和token级别上系统性利用CLIP的视觉判别能力，显著提升语义分割性能。

Details

Motivation: 现有CLIP模型因图像级预训练目标与像素级理解任务不匹配，导致语义分割性能受限，且先前方法继承了深层的全局对齐偏差。 Method: 通过分析发现层、注意力头和token的特性，提出语义-空间重加权、选择性头增强和异常token替换三种技术，在不训练的情况下恢复视觉判别力。 Result: 在8个主流语义分割基准上达到SOTA性能，验证了方法的有效性和实用性。 Conclusion: LHT-CLIP无需额外训练或调参，即可有效提升CLIP在密集预测任务中的表现，具有良好的部署前景。 Abstract: Extending CLIP models to semantic segmentation remains challenging due to the misalignment between their image-level pre-training objectives and the pixel-level visual understanding required for dense prediction. While prior efforts have achieved encouraging results by reorganizing the final layer and features, they often inherit the global alignment bias of preceding layers, leading to suboptimal segmentation performance. In this work, we propose LHT-CLIP, a novel training-free framework that systematically exploits the visual discriminability of CLIP across layer, head, and token levels. Through comprehensive analysis, we reveal three key insights: (i) the final layers primarily strengthen image-text alignment with sacrifice of visual discriminability (e.g., last 3 layers in ViT-B/16 and 8 layers in ViT-L/14), partly due to the emergence of anomalous tokens; (ii) a subset of attention heads (e.g., 10 out of 144 in ViT-B/16) display consistently strong visual discriminability across datasets; (iii) abnormal tokens display sparse and consistent activation pattern compared to normal tokens. Based on these findings, we propose three complementary techniques: semantic-spatial reweighting, selective head enhancement, and abnormal token replacement to effectively restore visual discriminability and improve segmentation performance without any additional training, auxiliary pre-trained networks, or extensive hyperparameter tuning. Extensive experiments on 8 common semantic segmentation benchmarks demonstrate that LHT-CLIP achieves state-of-the-art performance across diverse scenarios, highlighting its effectiveness and practicality for real-world deployment.

[91] DynaStride: Dynamic Stride Windowing with MMCoT for Instructional Multi-Scene Captioning

Eddison Pham,Prisha Priyadarshini,Adrian Maliackel,Kanishk Bandi,Cristian Meo,Kevin Zhu

Main category: cs.CV

TL;DR: 提出DynaStride方法，用于生成连贯的场景级字幕，无需手动场景分割，在YouCookII数据集上表现优于现有模型。

Details

Motivation: 现有的视频字幕方法难以捕捉教学视频中的时间结构和视觉线索，导致字幕缺乏连贯性和质量，影响学习效果。 Method: DynaStride通过自适应帧采样和多模态窗口化捕捉场景内关键转换，并采用多模态思维链生成动作-对象对，结合动态步幅窗口选择算法融合信息，生成最终的场景级字幕。 Result: 在YouCookII数据集上，DynaStride在BLEU、METEOR、BERTScore和CLIPScore等指标上均优于VLLaMA3和GPT-4o等强基线模型。 Conclusion: DynaStride能有效整合视觉语义与时间推理，生成更连贯、信息更丰富的教学视频字幕，为AI驱动的教学内容生成提供了新方向。 Abstract: Scene-level captioning in instructional videos can enhance learning by requiring an understanding of both visual cues and temporal structure. By aligning visual cues with textual guidance, this understanding supports procedural learning and multimodal reasoning, providing a richer context for skill acquisition. However, captions that fail to capture this structure may lack coherence and quality, which can create confusion and undermine the video's educational intent. To address this gap, we introduce DynaStride, a pipeline to generate coherent, scene-level captions without requiring manual scene segmentation. Using the YouCookII dataset's scene annotations, DynaStride performs adaptive frame sampling and multimodal windowing to capture key transitions within each scene. It then employs a multimodal chain-of-thought process to produce multiple action-object pairs, which are refined and fused using a dynamic stride window selection algorithm that adaptively balances temporal context and redundancy. The final scene-level caption integrates visual semantics and temporal reasoning in a single instructional caption. Empirical evaluations against strong baselines, including VLLaMA3 and GPT-4o, demonstrate consistent gains on both N-gram-based metrics (BLEU, METEOR) and semantic similarity measures (BERTScore, CLIPScore). Qualitative analyses further show that DynaStride produces captions that are more temporally coherent and informative, suggesting a promising direction for improving AI-powered instructional content generation.

[92] TurboPortrait3D: Single-step diffusion-based fast portrait novel-view synthesis

Emily Kim,Julieta Martinez,Timur Bagautdinov,Jessica Hodgins

Main category: cs.CV

TL;DR: TurboPortrait3D 是一种低延迟的人像新视角合成方法，结合图像到3D模型与扩散模型的优势，通过单步扩散模型在保持多视角一致性的同时提升渲染质量。

Details

Motivation: 现有图像到3D人像生成模型存在视觉伪影、细节缺失和身份保持不佳的问题，而图像扩散模型虽能生成高质量图像但缺乏3D一致性，无法直接用于多视角合成。 Method: 输入单张正面人脸图像，先通过前馈图像到头像管道生成初始3D表示和噪声渲染；再使用基于输入图像条件的单步扩散模型，在多视角一致性的约束下对渲染结果进行精细化优化；并采用在大规模合成多视图数据上预训练、再在高质量真实图像上微调的训练策略。 Result: 在定性和定量评估中均优于当前最先进的人像新视角合成方法，同时具有较低的计算延迟。 Conclusion: TurboPortrait3D 成功融合了3D感知能力与扩散模型的高保真图像生成优势，实现了高质量、多视角一致且低延迟的人像新视角合成。 Abstract: We introduce TurboPortrait3D: a method for low-latency novel-view synthesis of human portraits. Our approach builds on the observation that existing image-to-3D models for portrait generation, while capable of producing renderable 3D representations, are prone to visual artifacts, often lack of detail, and tend to fail at fully preserving the identity of the subject. On the other hand, image diffusion models excel at generating high-quality images, but besides being computationally expensive, are not grounded in 3D and thus are not directly capable of producing multi-view consistent outputs. In this work, we demonstrate that image-space diffusion models can be used to significantly enhance the quality of existing image-to-avatar methods, while maintaining 3D-awareness and running with low-latency. Our method takes a single frontal image of a subject as input, and applies a feedforward image-to-avatar generation pipeline to obtain an initial 3D representation and corresponding noisy renders. These noisy renders are then fed to a single-step diffusion model which is conditioned on input image(s), and is specifically trained to refine the renders in a multi-view consistent way. Moreover, we introduce a novel effective training strategy that includes pre-training on a large corpus of synthetic multi-view data, followed by fine-tuning on high-quality real images. We demonstrate that our approach both qualitatively and quantitatively outperforms current state-of-the-art for portrait novel-view synthesis, while being efficient in time.

[93] PlanarGS: High-Fidelity Indoor 3D Gaussian Splatting Guided by Vision-Language Planar Priors

Xirui Jin,Renbiao Jin,Boying Li,Danping Zou,Wenxian Yu

Main category: cs.CV

TL;DR: 提出PlanarGS，一种基于3D高斯点阵的室内场景重建框架，通过引入语言提示的平面先验和几何约束，显著提升大范围低纹理区域的三维重建精度与质量。

Details

Motivation: 在室内环境中，大面积低纹理区域导致传统3D高斯点阵（3DGS）因光度损失模糊而难以恢复高保真几何结构。 Method: 设计语言提示平面先验（LP3）流程，利用预训练视觉-语言分割模型生成区域提议，并通过跨视图融合与几何先验优化；在3DGS优化中加入平面一致性监督和平面几何先验（深度与法向引导）两项新损失项。 Result: 在标准室内数据集上实验表明，PlanarGS在三维表面重建的准确性和细节表现上显著优于现有最先进方法。 Conclusion: PlanarGS通过引入语义感知的平面先验与多视角几何约束，有效解决了低纹理场景下3DGS的几何模糊问题，为室内重建提供了高效且鲁棒的新方案。 Abstract: Three-dimensional Gaussian Splatting (3DGS) has recently emerged as an efficient representation for novel-view synthesis, achieving impressive visual quality. However, in scenes dominated by large and low-texture regions, common in indoor environments, the photometric loss used to optimize 3DGS yields ambiguous geometry and fails to recover high-fidelity 3D surfaces. To overcome this limitation, we introduce PlanarGS, a 3DGS-based framework tailored for indoor scene reconstruction. Specifically, we design a pipeline for Language-Prompted Planar Priors (LP3) that employs a pretrained vision-language segmentation model and refines its region proposals via cross-view fusion and inspection with geometric priors. 3D Gaussians in our framework are optimized with two additional terms: a planar prior supervision term that enforces planar consistency, and a geometric prior supervision term that steers the Gaussians toward the depth and normal cues. We have conducted extensive experiments on standard indoor benchmarks. The results show that PlanarGS reconstructs accurate and detailed 3D surfaces, consistently outperforming state-of-the-art methods by a large margin. Project page: https://planargs.github.io

[94] Adaptive Training of INRs via Pruning and Densification

Diana Aldana,João Paulo Lima,Daniel Csillag,Daniel Perazzo,Haoan Feng,Luiz Velho,Tiago Novello

Main category: cs.CV

TL;DR: 本文提出了AIRe，一种自适应隐式神经表示方法，通过神经元剪枝和输入频率致密化来优化网络结构，在减小模型大小的同时保持甚至提升重建质量。

Details

Motivation: 传统INR方法在选择输入频率和网络架构时依赖启发式方法和大量超参数调优，且存在参数冗余问题，难以有效平衡模型大小与表示能力。 Method: 提出AIRe，包含两个阶段：1）剪枝阶段，识别贡献较小的神经元，通过定向权重衰减将其信息转移后进行结构化剪枝；2）致密化阶段，向信号欠拟合的频谱区域添加输入频率以增强表示能力。 Result: 在图像和符号距离场（SDF）上的实验表明，AIRe能够在显著减小模型规模的同时，保持或提升重建质量。 Conclusion: AIRe通过自适应调整网络结构和输入频率，有效改善了隐式神经表示中模型大小与重建质量之间的权衡，为INR的高效训练提供了新思路。 Abstract: Encoding input coordinates with sinusoidal functions into multilayer perceptrons (MLPs) has proven effective for implicit neural representations (INRs) of low-dimensional signals, enabling the modeling of high-frequency details. However, selecting appropriate input frequencies and architectures while managing parameter redundancy remains an open challenge, often addressed through heuristics and heavy hyperparameter optimization schemes. In this paper, we introduce AIRe ($\textbf{A}$daptive $\textbf{I}$mplicit neural $\textbf{Re}$presentation), an adaptive training scheme that refines the INR architecture over the course of optimization. Our method uses a neuron pruning mechanism to avoid redundancy and input frequency densification to improve representation capacity, leading to an improved trade-off between network size and reconstruction quality. For pruning, we first identify less-contributory neurons and apply a targeted weight decay to transfer their information to the remaining neurons, followed by structured pruning. Next, the densification stage adds input frequencies to spectrum regions where the signal underfits, expanding the representational basis. Through experiments on images and SDFs, we show that AIRe reduces model size while preserving, or even improving, reconstruction quality. Code and pretrained models will be released for public use.

[95] Neural USD: An object-centric framework for iterative editing and control

Alejandro Escontrela,Shrinu Kushagra,Sjoerd van Steenkiste,Yulia Rubanova,Aleksander Holynski,Kelsey Allen,Kevin Murphy,Thomas Kipf

Main category: cs.CV

TL;DR: 本文提出了“神经通用场景描述符”（Neural USD），一种受计算机图形学中USD标准启发的可控生成建模框架，通过分层结构化表示实现对场景中各个对象在外观、几何和姿态上的精确、迭代编辑。

Details

Motivation: 现有生成模型在进行对象级编辑时容易引发非预期的全局变化，缺乏精细控制能力，因此需要一种能够支持逐对象独立控制且兼容多种信号的框架。 Method: 借鉴计算机图形学中的通用场景描述符（USD）标准，设计了Neural USD框架，采用分层结构化表示场景与对象，并结合微调方法实现控制信号的解耦。 Result: 实验验证了该框架在不同设计选择下的有效性，展示了其支持迭代式、增量式编辑工作流的能力，实现了更精确的对象级编辑。 Conclusion: Neural USD为可控生成建模提供了新的结构化范式，有效解决了生成模型中对象编辑的精确性与解耦控制问题，推动了生成模型在复杂创作流程中的应用。 Abstract: Amazing progress has been made in controllable generative modeling, especially over the last few years. However, some challenges remain. One of them is precise and iterative object editing. In many of the current methods, trying to edit the generated image (for example, changing the color of a particular object in the scene or changing the background while keeping other elements unchanged) by changing the conditioning signals often leads to unintended global changes in the scene. In this work, we take the first steps to address the above challenges. Taking inspiration from the Universal Scene Descriptor (USD) standard developed in the computer graphics community, we introduce the "Neural Universal Scene Descriptor" or Neural USD. In this framework, we represent scenes and objects in a structured, hierarchical manner. This accommodates diverse signals, minimizes model-specific constraints, and enables per-object control over appearance, geometry, and pose. We further apply a fine-tuning approach which ensures that the above control signals are disentangled from one another. We evaluate several design considerations for our framework, demonstrating how Neural USD enables iterative and incremental workflows. More information at: https://escontrela.me/neural_usd .

[96] SafeVision: Efficient Image Guardrail with Robust Policy Adherence and Explainability

Peiyang Xu,Minzhou Pan,Zhaorun Chen,Shuang Yang,Chaowei Xiao,Bo Li

Main category: cs.CV

TL;DR: SafeVision是一种新型图像安全防护模型，通过结合类人推理实现动态适应和透明化风险评估，无需重新训练即可应对新兴威胁，在新提出的VisionHarm数据集上性能优于GPT-4o并速度快16倍以上。

Details

Motivation: 传统图像安全模型受限于预定义类别，缺乏语义推理能力，难以适应新出现的威胁且不透明，现有基准数据集也存在覆盖范围窄或粒度不足的问题。 Method: 提出SafeVision，采用有效的数据收集与生成框架、遵循策略的训练流程和定制损失函数，并通过多样化的问答生成策略提升学习效果；同时构建了高质量的细粒度数据集VisionHarm（含VisionHarm-T和VisionHarm-C两个子集）。 Result: SafeVision在VisionHarm-T上比GPT-4o高8.6%，在VisionHarm-C上高15.5%，且推理速度超过其16倍，实现了最先进的性能。 Conclusion: SafeVision实现了可解释、可动态适应安全策略变化的图像内容安全防护，为应对不断演变的网络风险提供了高效、精准的新方案。 Abstract: With the rapid proliferation of digital media, the need for efficient and transparent safeguards against unsafe content is more critical than ever. Traditional image guardrail models, constrained by predefined categories, often misclassify content due to their pure feature-based learning without semantic reasoning. Moreover, these models struggle to adapt to emerging threats, requiring costly retraining for new threats. To address these limitations, we introduce SafeVision, a novel image guardrail that integrates human-like reasoning to enhance adaptability and transparency. Our approach incorporates an effective data collection and generation framework, a policy-following training pipeline, and a customized loss function. We also propose a diverse QA generation and training strategy to enhance learning effectiveness. SafeVision dynamically aligns with evolving safety policies at inference time, eliminating the need for retraining while ensuring precise risk assessments and explanations. Recognizing the limitations of existing unsafe image benchmarks, which either lack granularity or cover limited risks, we introduce VisionHarm, a high-quality dataset comprising two subsets: VisionHarm Third-party (VisionHarm-T) and VisionHarm Comprehensive(VisionHarm-C), spanning diverse harmful categories. Through extensive experiments, we show that SafeVision achieves state-of-the-art performance on different benchmarks. SafeVision outperforms GPT-4o by 8.6% on VisionHarm-T and by 15.5% on VisionHarm-C, while being over 16x faster. SafeVision sets a comprehensive, policy-following, and explainable image guardrail with dynamic adaptation to emerging threats.

[97] Reasoning Visual Language Model for Chest X-Ray Analysis

Andriy Myronenko,Dong Yang,Baris Turkbey,Mariam Aboian,Sena Azamat,Esra Akcicek,Hongxu Yin,Pavlo Molchanov,Marc Edgar,Yufan He,Pengfei Guo,Yucheng Tang,Daguang Xu

Main category: cs.CV

TL;DR: 提出一种结合链式思维（CoT）推理的视觉语言模型框架，用于胸部X光解读，通过两阶段训练提升可解释性和准确性，并支持临床审计与人机协作。

Details

Motivation: 现有医学视觉语言模型缺乏透明的逐步推理过程，难以满足临床对可解释性和可信AI的需求。 Method: 采用高保真视觉编码，结合推理风格的监督微调（SFT）和基于可验证奖励的强化学习（RL），生成与放射科医生思维过程一致的推理轨迹。 Result: 在分布外评估中取得有竞争力的多标签分类性能；专家读片研究显示推理轨迹能提升信心、减少报告时间并支持错误审计。 Conclusion: 该框架实现了更透明、可审计的AI辅助诊断，推动了在医学影像中对高质量推理与预测并重的发展方向。 Abstract: Vision-language models (VLMs) have shown strong promise for medical image analysis, but most remain opaque, offering predictions without the transparent, stepwise reasoning clinicians rely on. We present a framework that brings chain-of-thought (CoT) reasoning to chest X-ray interpretation. Inspired by reasoning-first training paradigms, our approach is designed to learn how experts reason, not just what they conclude, by aligning intermediate steps with observable image evidence and radiology workflow. Beyond accuracy, the explicit reasoning traces support clinical auditability: they reveal why a conclusion was reached, which alternatives were considered, and where uncertainty remains, enabling quality assurance, error analysis, and safer human-AI collaboration. Our model couples high-fidelity visual encoding with a two-stage training recipe: a reasoning-style supervised fine-tuning (SFT) followed by reinforcement learning (RL) that uses verifiable rewards over a list of X-ray abnormalities. The model outputs reasoning that mirrors radiologists systematic thought process, uncertainty, and differential diagnosis. In out-of-distribution evaluation, the approach achieves competitive multi-label classification while improving interpretability. In a reader study with expert radiologists, full reasoning traces increased confidence, supported error auditing, and reduced time to finalize reports. We release code and the model NV-Reason-CXR-3B to support community progress toward trustworthy, explainable AI in chest radiography and other medical imaging tasks where reasoning quality is as critical as prediction quality.

[98] Efficient Cost-and-Quality Controllable Arbitrary-scale Super-resolution with Fourier Constraints

Kazutoshi Akita,Norimichi Ukita

Main category: cs.CV

TL;DR: 提出联合预测多个傅里叶分量的方法，以提升任意尺度超分辨率中的质量与效率。

Details

Motivation: 现有方法逐个预测傅里叶分量导致性能下降和效率低下，难以实现成本与质量的平衡控制。 Method: 采用联合预测多个傅里叶分量的方式，取代传统的循环神经网络逐项预测。 Result: 在任意尺度超分辨率任务中实现了更高的图像质量和更优的计算效率。 Conclusion: 联合预测策略优于独立预测，有效提升了超分辨率的成本与质量可控性。 Abstract: Cost-and-Quality (CQ) controllability in arbitrary-scale super-resolution is crucial. Existing methods predict Fourier components one by one using a recurrent neural network. However, this approach leads to performance degradation and inefficiency due to independent prediction. This paper proposes predicting multiple components jointly to improve both quality and efficiency.

[99] TeleEgo: Benchmarking Egocentric AI Assistants in the Wild

Jiaqi Yan,Ruilong Ren,Jingren Liu,Shuning Xu,Ling Wang,Yiheng Wang,Yun Wang,Long Zhang,Xiangyu Chen,Changzhi Sun,Jixiang Luo,Dell Zhang,Hao Sun,Chi Zhang,Xuelong Li

Main category: cs.CV

TL;DR: TeleEgo 是一个面向现实场景的长时间、流式、全模态基准，用于评估以自我为中心的AI助手在日常情境中的表现，涵盖记忆、理解和跨记忆推理三大核心能力。

Details

Motivation: 现有的基准通常孤立地评估AI助手的能力，缺乏真实的流式场景或仅支持短期任务，无法全面衡量AI助手在真实世界中处理多模态输入、实时响应和长期记忆保持的能力。 Method: 构建了一个包含超过14小时/参与者同步的自我中心视频、音频和文本数据集，覆盖工作学习、生活方式、社交活动和外出文化四个领域，并设计了12个诊断性子任务和3,291个人工验证的问答项，在统一的时间线上进行流式评估。 Result: 提出了两个关键指标——实时准确率和记忆持久时间，用以联合评估正确性、时间响应性和长期记忆保持能力；实验表明该基准能有效推动实用AI助手的发展。 Conclusion: TeleEgo 提供了一个真实且全面的评估框架，有助于推进具备长期记忆与多模态理解能力的实用型AI助手的研究与发展。 Abstract: Egocentric AI assistants in real-world settings must process multi-modal inputs (video, audio, text), respond in real time, and retain evolving long-term memory. However, existing benchmarks typically evaluate these abilities in isolation, lack realistic streaming scenarios, or support only short-term tasks. We introduce \textbf{TeleEgo}, a long-duration, streaming, omni-modal benchmark for evaluating egocentric AI assistants in realistic daily contexts. The dataset features over 14 hours per participant of synchronized egocentric video, audio, and text across four domains: work \& study, lifestyle \& routines, social activities, and outings \& culture. All data is aligned on a unified global timeline and includes high-quality visual narrations and speech transcripts, curated through human refinement.TeleEgo defines 12 diagnostic subtasks across three core capabilities: Memory (recalling past events), Understanding (interpreting the current moment), and Cross-Memory Reasoning (linking distant events). It contains 3,291 human-verified QA items spanning multiple question formats (single-choice, binary, multi-choice, and open-ended), evaluated strictly in a streaming setting. We propose two key metrics -- Real-Time Accuracy and Memory Persistence Time -- to jointly assess correctness, temporal responsiveness, and long-term retention. TeleEgo provides a realistic and comprehensive evaluation to advance the development of practical AI assistants.

[100] AdvBlur: Adversarial Blur for Robust Diabetic Retinopathy Classification and Cross-Domain Generalization

Heethanjan Kanagalingam,Thenukan Pathmanathan,Mokeeshan Vathanakumar,Tharmakulasingam Mukunthan

Main category: cs.CV

TL;DR: 提出一种名为AdvBlur的糖尿病视网膜病变分类新方法，通过引入对抗性模糊图像和双损失函数框架来提升模型在未见分布变化下的泛化能力。

Details

Motivation: 现有深度学习模型在不同设备、人群和成像条件导致的数据分布差异下鲁棒性不足，影响糖尿病视网膜病变的准确检测。 Method: 将对抗性模糊图像加入训练数据，并采用双损失函数框架以增强域泛化能力。 Result: 在多个外部数据集上表现出色，相较于当前最先进的域泛化方法具有竞争力，且在低质量图像、不同相机类型和小样本情况下表现稳健。 Conclusion: AdvBlur能有效缓解分布变异带来的影响，提升了DR分类模型的鲁棒性和泛化性能。 Abstract: Diabetic retinopathy (DR) is a leading cause of vision loss worldwide, yet early and accurate detection can significantly improve treatment outcomes. While numerous Deep learning (DL) models have been developed to predict DR from fundus images, many face challenges in maintaining robustness due to distributional variations caused by differences in acquisition devices, demographic disparities, and imaging conditions. This paper addresses this critical limitation by proposing a novel DR classification approach, a method called AdvBlur. Our method integrates adversarial blurred images into the dataset and employs a dual-loss function framework to address domain generalization. This approach effectively mitigates the impact of unseen distributional variations, as evidenced by comprehensive evaluations across multiple datasets. Additionally, we conduct extensive experiments to explore the effects of factors such as camera type, low-quality images, and dataset size. Furthermore, we perform ablation studies on blurred images and the loss function to ensure the validity of our choices. The experimental results demonstrate the effectiveness of our proposed method, achieving competitive performance compared to state-of-the-art domain generalization DR models on unseen external datasets.

[101] Towards the Automatic Segmentation, Modeling and Meshing of the Aortic Vessel Tree from Multicenter Acquisitions: An Overview of the SEG.A. 2023 Segmentation of the Aorta Challenge

Yuan Jin,Antonio Pepe,Gian Marco Melito,Yuxuan Chen,Yunsu Byeon,Hyeseong Kim,Kyungwon Kim,Doohyun Park,Euijoon Choi,Dosik Hwang,Andriy Myronenko,Dong Yang,Yufan He,Daguang Xu,Ayman El-Ghotni,Mohamed Nabil,Hossam El-Kady,Ahmed Ayyad,Amr Nasr,Marek Wodzinski,Henning Müller,Hyeongyu Kim,Yejee Shin,Abbas Khan,Muhammad Asad,Alexander Zolotarev,Caroline Roney,Anthony Mathur,Martin Benning,Gregory Slabaugh,Theodoros Panagiotis Vagenas,Konstantinos Georgas,George K. Matsopoulos,Jihan Zhang,Zhen Zhang,Liqin Huang,Christian Mayer,Heinrich Mächler,Jan Egger

Main category: cs.CV

TL;DR: 本研究通过发起SEG.A.挑战赛，推出了一个大规模、公开的多机构主动脉血管树（AVT）分割数据集，推动CTA图像中AVT自动分析的发展。结果显示3D U-Net架构占主导地位，模型融合显著提升性能，且算法设计和训练数据特征对表现有重要影响。

Details

Motivation: 主动脉血管树（AVT）的自动分析具有重要临床价值，但因缺乏高质量共享数据而受限，因此需要建立公开数据集以推动该领域发展。 Method: 组织SEG.A.挑战赛，提供大型多中心CTA数据集，评估参赛算法在隐藏测试集上的分割性能，并鼓励后续表面网格化任务。 Result: 3D U-Net架构在比赛中表现最佳，模型集成显著优于单一模型，性能与算法设计（如定制后处理）和训练数据特性密切相关。 Conclusion: 该挑战赛不仅建立了AVT分割的新性能基准，还提供了长期可用的资源，有助于推动未来临床可转化工具的发展。 Abstract: The automated analysis of the aortic vessel tree (AVT) from computed tomography angiography (CTA) holds immense clinical potential, but its development has been impeded by a lack of shared, high-quality data. We launched the SEG.A. challenge to catalyze progress in this field by introducing a large, publicly available, multi-institutional dataset for AVT segmentation. The challenge benchmarked automated algorithms on a hidden test set, with subsequent optional tasks in surface meshing for computational simulations. Our findings reveal a clear convergence on deep learning methodologies, with 3D U-Net architectures dominating the top submissions. A key result was that an ensemble of the highest-ranking algorithms significantly outperformed individual models, highlighting the benefits of model fusion. Performance was strongly linked to algorithmic design, particularly the use of customized post-processing steps, and the characteristics of the training data. This initiative not only establishes a new performance benchmark but also provides a lasting resource to drive future innovation toward robust, clinically translatable tools.

[102] Mars-Bench: A Benchmark for Evaluating Foundation Models for Mars Science Tasks

Mirali Purohit,Bimal Gajera,Vatsal Malaviya,Irish Mehta,Kunal Kasodekar,Jacob Adler,Steven Lu,Umaa Rebbapragada,Hannah Kerner

Main category: cs.CV

TL;DR: 本文提出了Mars-Bench，首个用于系统评估火星相关任务（包括分类、分割和目标检测）的基准，涵盖20个基于轨道和地表图像的数据集，旨在推动火星科学领域基础模型的发展。

Details

Motivation: 尽管基础模型在地球观测等领域取得进展，但在火星科学研究中仍缺乏标准化的基准和评估框架，限制了该领域的发展。 Method: 构建了一个包含20个数据集的标准化基准Mars-Bench，涵盖分类、分割和目标检测任务，并使用在自然图像、地球卫星数据和最先进的视觉-语言模型上预训练的模型进行基线评估。 Result: 实验结果表明，针对火星任务，特定领域的基础模型可能优于通用领域模型，验证了领域自适应预训练的潜力。 Conclusion: Mars-Bench为火星科学提供了标准化的评估平台，有助于推动针对火星任务的机器学习模型开发与比较，促进该领域的进一步研究。 Abstract: Foundation models have enabled rapid progress across many specialized domains by leveraging large-scale pre-training on unlabeled data, demonstrating strong generalization to a variety of downstream tasks. While such models have gained significant attention in fields like Earth Observation, their application to Mars science remains limited. A key enabler of progress in other domains has been the availability of standardized benchmarks that support systematic evaluation. In contrast, Mars science lacks such benchmarks and standardized evaluation frameworks, which have limited progress toward developing foundation models for Martian tasks. To address this gap, we introduce Mars-Bench, the first benchmark designed to systematically evaluate models across a broad range of Mars-related tasks using both orbital and surface imagery. Mars-Bench comprises 20 datasets spanning classification, segmentation, and object detection, focused on key geologic features such as craters, cones, boulders, and frost. We provide standardized, ready-to-use datasets and baseline evaluations using models pre-trained on natural images, Earth satellite data, and state-of-the-art vision-language models. Results from all analyses suggest that Mars-specific foundation models may offer advantages over general-domain counterparts, motivating further exploration of domain-adapted pre-training. Mars-Bench aims to establish a standardized foundation for developing and comparing machine learning models for Mars science. Our data, models, and code are available at: https://mars-bench.github.io/.

[103] AutoPrompt: Automated Red-Teaming of Text-to-Image Models via LLM-Driven Adversarial Prompts

Yufan Liu,Wanqian Zhang,Huashan Chen,Lin Wang,Xiaojun Jia,Zheng Lin,Weiping Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为APT的黑盒框架，利用大语言模型（LLM）自动生成人类可读的对抗性后缀，以评估文本到图像模型的安全漏洞。

Details

Motivation: 现有的红队测试方法通常需要白盒访问权限，依赖低效的逐提示优化，并生成语义无意义的提示，容易被过滤器拦截。 Method: 提出交替优化-微调流程，在对抗性后缀优化和基于优化结果微调LLM之间迭代；引入双重规避策略，通过辅助LLM困惑度评分限制生成可读提示，并使用禁用词惩罚抑制黑名单词汇生成。 Result: 实验表明，该方法生成的人类可读、抗过滤的对抗性提示在红队测试中表现优异，具备出色的零样本迁移能力，能快速适应未见提示，并暴露商业API中的关键漏洞。 Conclusion: APT框架有效提升了对T2I模型安全性的评估能力，能够在黑盒条件下生成高效且难以检测的对抗性提示。 Abstract: Despite rapid advancements in text-to-image (T2I) models, their safety mechanisms are vulnerable to adversarial prompts, which maliciously generate unsafe images. Current red-teaming methods for proactively assessing such vulnerabilities usually require white-box access to T2I models, and rely on inefficient per-prompt optimization, as well as inevitably generate semantically meaningless prompts easily blocked by filters. In this paper, we propose APT (AutoPrompT), a black-box framework that leverages large language models (LLMs) to automatically generate human-readable adversarial suffixes for benign prompts. We first introduce an alternating optimization-finetuning pipeline between adversarial suffix optimization and fine-tuning the LLM utilizing the optimized suffix. Furthermore, we integrates a dual-evasion strategy in optimization phase, enabling the bypass of both perplexity-based filter and blacklist word filter: (1) we constrain the LLM generating human-readable prompts through an auxiliary LLM perplexity scoring, which starkly contrasts with prior token-level gibberish, and (2) we also introduce banned-token penalties to suppress the explicit generation of banned-tokens in blacklist. Extensive experiments demonstrate the excellent red-teaming performance of our human-readable, filter-resistant adversarial prompts, as well as superior zero-shot transferability which enables instant adaptation to unseen prompts and exposes critical vulnerabilities even in commercial APIs (e.g., Leonardo.Ai.).

[104] ResNet: Enabling Deep Convolutional Neural Networks through Residual Learning

Xingyu Liu,Kun Ming Goh

Main category: cs.CV

TL;DR: 本文探讨了ResNet如何通过引入跳跃连接解决深度CNN中的梯度消失问题，并在CIFAR-10上验证其优于传统深层CNN的性能。

Details

Motivation: 训练极深的卷积神经网络因梯度消失问题而具有挑战性，需要一种有效的方法来提升深度网络的训练稳定性和性能。 Method: 采用ResNet架构，利用残差块中的跳跃连接使梯度直接传播，从而支持更深网络的训练，并在CIFAR-10数据集上实现ResNet-18与传统深层CNN的对比实验。 Result: ResNet-18在CIFAR-10上达到89.9%的准确率，优于传统CNN的84.1%，且收敛更快、训练更稳定。 Conclusion: ResNet通过跳跃连接有效缓解梯度消失问题，显著提升深度网络的训练效率和分类性能，是深度模型设计的重要突破。 Abstract: Convolutional Neural Networks (CNNs) has revolutionized computer vision, but training very deep networks has been challenging due to the vanishing gradient problem. This paper explores Residual Networks (ResNet), introduced by He et al. (2015), which overcomes this limitation by using skip connections. ResNet enables the training of networks with hundreds of layers by allowing gradients to flow directly through shortcut connections that bypass intermediate layers. In our implementation on the CIFAR-10 dataset, ResNet-18 achieves 89.9% accuracy compared to 84.1% for a traditional deep CNN of similar depth, while also converging faster and training more stably.

[105] Kernelized Sparse Fine-Tuning with Bi-level Parameter Competition for Vision Models

Shufan Shen,Junshu Sun,Shuhui Wang,Qingming Huang

Main category: cs.CV

TL;DR: 本文提出了一种名为SNELLA的一阶段稀疏调优方法，用于高效微调预训练视觉模型，在降低内存消耗的同时实现了SOTA性能。

Details

Motivation: 现有稀疏调优方法采用两阶段范式，依赖梯度信息定位任务相关权重，忽略微调过程中的参数调整，且因存储完整权重矩阵导致内存开销高。 Method: SNELLA通过将权重矩阵与两个低秩可学习矩阵合并成的稀疏矩阵相加来选择性更新；引入非线性核函数扩展低秩分解以提升合并矩阵的秩，并设计自适应双层稀疏分配机制，端到端地跨层和层内竞争重要权重。 Result: 在分类、分割和生成任务上验证了SNELLA的有效性，相比SPT-LoRA在FGVC基准上Top-1准确率提高1.8%，内存减少31.1%-39.9%（模型参数规模86M至632M）。 Conclusion: SNELLA通过一阶段自适应稀疏调优机制，在显著降低内存使用的同时提升了下游任务性能，优于现有PEFT方法。 Abstract: Parameter-efficient fine-tuning (PEFT) aims to adapt pre-trained vision models to downstream tasks. Among PEFT paradigms, sparse tuning achieves remarkable performance by adjusting only the weights most relevant to downstream tasks, rather than densely tuning the entire weight matrix. Current methods follow a two-stage paradigm. First, it locates task-relevant weights by gradient information, which overlooks the parameter adjustments during fine-tuning and limits the performance. Second, it updates only the located weights by applying a sparse mask to the gradient of the weight matrix, which results in high memory usage due to the storage of all weight matrices in the optimizer. In this paper, we propose a one-stage method named SNELLA to overcome the above limitations. For memory usage, SNELLA selectively updates the weight matrix by adding it to another sparse matrix that is merged by two low-rank learnable matrices. We extend the low-rank decomposition by introducing nonlinear kernel functions, thereby increasing the rank of the resulting merged matrix to prevent the interdependency among weight updates, enabling better adaptation to downstream tasks. For locating task-relevant weights, we propose an adaptive bi-level sparsity allocation mechanism that encourages weights to compete across and inside layers based on their importance scores in an end-to-end manner. Extensive experiments are conducted on classification, segmentation, and generation tasks using different pre-trained vision models. The results show that SNELLA achieves SOTA performance with low memory usage. Notably, SNELLA obtains 1.8% (91.9% v.s. 90.1%) higher Top-1 accuracy on the FGVC benchmark compared to SPT-LoRA. Compared to previous methods, SNELLA achieves a memory reduction of 31.1%-39.9% across models with parameter scales from 86M to 632M. Our source codes are available at https://github.com/ssfgunner/SNELL.

[106] Enhancing CLIP Robustness via Cross-Modality Alignment

Xingyu Zhu,Beier Zhu,Shuo Wang,Kesen Zhao,Hanwang Zhang

Main category: cs.CV

TL;DR: 提出了一种基于最优传输的无训练跨模态对齐框架COLA，用于缓解视觉-语言模型在对抗扰动下的特征错位问题，显著提升零样本分类的鲁棒性。

Details

Motivation: CLIP等视觉-语言模型在零样本分类中表现良好，但在对抗扰动下特征错位严重，图文特征距离拉大，导致性能下降。现有方法未能有效解决这一对齐退化问题。 Method: 提出COLA框架：(1) 将对抗性图像嵌入投影到类别文本特征张成的子空间，滤除非语义扰动；(2) 将图像和文本建模为多增强视图上的离散分布，通过最优传输（OT）优化对齐，且将子空间投影融入代价计算。该方法无需训练，兼容已有的微调模型。 Result: 在14个零样本分类基准上验证了COLA的有效性，在PGD攻击下ImageNet及其变体平均提升6.7%，同时保持对干净样本的高准确率。 Conclusion: COLA通过显式恢复全局图文对齐和局部结构一致性，有效增强了VLM在对抗条件下的鲁棒性，是一种高效、通用且无需训练的防御方法。 Abstract: Vision-language models (VLMs) such as CLIP demonstrate strong generalization in zero-shot classification but remain highly vulnerable to adversarial perturbations. Existing methods primarily focus on adversarial fine-tuning or prompt optimization; they often overlook the gaps in CLIP's encoded features, which is shown as the text and image features lie far apart from each other. This misalignment is significantly amplified under adversarial perturbations, leading to severe degradation in classification performance. To address this problem, we propose Cross-modality Alignment, dubbed COLA, an optimal transport-based framework that explicitly addresses adversarial misalignment by restoring both global image-text alignment and local structural consistency in the feature space. (1) COLA first projects adversarial image embeddings onto a subspace spanned by class text features, effectively filtering out non-semantic distortions while preserving discriminative information. (2) It then models images and texts as discrete distributions over multiple augmented views and refines their alignment via OT, with the subspace projection seamlessly integrated into the cost computation. This design ensures stable cross-modal alignment even under adversarial conditions. COLA is training-free and compatible with existing fine-tuned models. Extensive evaluations across 14 zero-shot classification benchmarks demonstrate the effectiveness of COLA, especially with an average improvement of 6.7% on ImageNet and its variants under PGD adversarial attacks, while maintaining high accuracy on clean samples.

[107] Beyond Objects: Contextual Synthetic Data Generation for Fine-Grained Classification

William Yang,Xindi Wu,Zhiwei Deng,Esin Tureci,Olga Russakovsky

Main category: cs.CV

TL;DR: 提出了一种名为BOB（BeyondOBjects）的微调策略，用于提升文本到图像模型在细粒度分类任务中生成合成训练数据的效果，通过提取类无关属性并在微调时显式建模、生成时边缘化，有效缓解过拟合并保持生成多样性。

Details

Motivation: 现有文本到图像模型在生成用于分类的合成数据时面临质量与多样性的权衡问题，尤其是小样本场景下微调易导致过拟合和多样性下降，因此需要一种既能提升数据质量又能保持生成多样性的方法。 Method: 提出BOB方法：首先从少量真实样本中提取类无关属性（如背景、姿态），在微调T2I模型时显式引入这些属性作为条件，而在生成时将其边缘化，从而保留模型先验并减少过拟合。 Result: 在多个T2I模型、主干网络和数据集上实验表明，BOB在低样本细粒度分类中达到SOTA性能；例如在Aircraft数据集上比DataDream提升7.4%，且用5张真实图加合成数据优于仅用10张真实图的效果，在24个设置中18个优于先前方法，其中14个提升超2%。 Conclusion: BOB通过解耦类无关属性，在微调与生成阶段合理利用这些信息，有效平衡了合成数据的质量与多样性，显著提升了低样本细粒度分类性能，具有广泛适用性和实用价值。 Abstract: Text-to-image (T2I) models are increasingly used for synthetic dataset generation, but generating effective synthetic training data for classification remains challenging. Fine-tuning a T2I model with a few real examples can help improve the quality of synthetic training data; however, it may also cause overfitting and reduce diversity in the generated samples. We propose a fine-tuning strategy BOB (BeyondOBjects) to mitigate these concerns for fine-grained classification. Given a small set of real examples, we first extract class-agnostic attributes such as scene background and object pose. We then explicitly condition on these attributes during fine-tuning of the T2I model and marginalize them out during generation. This design mitigates overfitting, preserves the T2I model's generative prior, reduces estimation errors, and further minimizes unintended inter-class associations. Extensive experiments across multiple T2I models, backbones, and datasets show that our method achieves state-of-the-art performance in low-shot fine-grained classification when augmented with synthetic data. Concretely, BOB outperforms DataDream by 7.4% on the Aircraft dataset (from 50.0% to 57.4% when fine-tuning a CLIP classifier with five real images augmented with 100 synthetic images). In three of the four benchmarks, fine-tuning downstream models with 5 real images augmented with BOB achieves better performance than fine-tuning with 10 real images. Collectively, BOB outperforms prior art in 18 of 24 experimental settings, with 2+% accuracy improvements in 14 of these settings.

[108] OmniText: A Training-Free Generalist for Controllable Text-Image Manipulation

Agus Gunawan,Samuel Teodoro,Yun Chen,Soo Ye Kim,Jihyong Oh,Munchurl Kim

Main category: cs.CV

TL;DR: 本文提出了OmniText，一种无需训练的通用框架，用于解决文本图像操作（TIM）中的三大挑战：无法删除文本、缺乏对文本风格的控制以及重复字符生成问题。通过自注意力反转和交叉注意力重分布，结合新的损失函数，实现了文本移除与风格化编辑，并推出了OmniText-Bench基准测试集。

Details

Motivation: 现有的基于扩散模型的文本修复方法在文本图像操作中存在三个关键限制：无法删除文本、难以控制文本样式、容易生成重复字符。这些限制阻碍了其在更广泛TIM任务中的应用。 Method: 利用交叉注意力和自注意力机制的特性，提出自注意力反转实现文本删除，通过重分布交叉注意力减少文本幻觉；设计新的潜在优化框架，引入交叉注意力内容损失和自注意力风格损失，以实现可控的文本插入与风格定制。 Result: OmniText在多个TIM任务上达到最先进水平，性能优于现有文本修复方法，且与专用方法相当；提出的OmniText-Bench为评估多种TIM任务提供了全面基准。 Conclusion: OmniText是首个无需训练的通用TIM框架，能够统一处理文本删除、重缩放、重定位、插入与风格化编辑等多种任务，显著提升了文本图像操作的灵活性与实用性。 Abstract: Recent advancements in diffusion-based text synthesis have demonstrated significant performance in inserting and editing text within images via inpainting. However, despite the potential of text inpainting methods, three key limitations hinder their applicability to broader Text Image Manipulation (TIM) tasks: (i) the inability to remove text, (ii) the lack of control over the style of rendered text, and (iii) a tendency to generate duplicated letters. To address these challenges, we propose OmniText, a training-free generalist capable of performing a wide range of TIM tasks. Specifically, we investigate two key properties of cross- and self-attention mechanisms to enable text removal and to provide control over both text styles and content. Our findings reveal that text removal can be achieved by applying self-attention inversion, which mitigates the model's tendency to focus on surrounding text, thus reducing text hallucinations. Additionally, we redistribute cross-attention, as increasing the probability of certain text tokens reduces text hallucination. For controllable inpainting, we introduce novel loss functions in a latent optimization framework: a cross-attention content loss to improve text rendering accuracy and a self-attention style loss to facilitate style customization. Furthermore, we present OmniText-Bench, a benchmark dataset for evaluating diverse TIM tasks. It includes input images, target text with masks, and style references, covering diverse applications such as text removal, rescaling, repositioning, and insertion and editing with various styles. Our OmniText framework is the first generalist method capable of performing diverse TIM tasks. It achieves state-of-the-art performance across multiple tasks and metrics compared to other text inpainting methods and is comparable with specialist methods.

[109] Enhancing Pre-trained Representation Classifiability can Boost its Interpretability

Shufan Shen,Zhaobo Qi,Junshu Sun,Qingming Huang,Qi Tian,Shuhui Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为Inherent Interpretability Score (IIS)的新指标，用于量化预训练视觉模型表示的可解释性，并发现可解释性与分类能力之间存在正相关关系。

Details

Motivation: 预训练视觉模型在下游任务中注重分类性能，但其表示的可解释性日益受到关注。然而，尚不清楚这两者是否可以同时提升。因此，本文旨在探究并量化这种关系。 Method: 通过引入可解释语义在表示中的比例来衡量可解释性，提出IIS指标，该指标基于解释过程中信息损失的程度评估表示的可解释性，并在不同分类性能的表示上进行可解释性评估。 Result: 实验发现，表示的可解释性与分类能力呈正相关；通过最大化可解释性的微调可进一步提升分类性能；且基于解释的预测具有更小的精度下降。 Conclusion: 可解释性与分类能力可以协同提升，IIS为统一优化这两个目标提供了理论支持和实践工具。 Abstract: The visual representation of a pre-trained model prioritizes the classifiability on downstream tasks, while the widespread applications for pre-trained visual models have posed new requirements for representation interpretability. However, it remains unclear whether the pre-trained representations can achieve high interpretability and classifiability simultaneously. To answer this question, we quantify the representation interpretability by leveraging its correlation with the ratio of interpretable semantics within the representations. Given the pre-trained representations, only the interpretable semantics can be captured by interpretations, whereas the uninterpretable part leads to information loss. Based on this fact, we propose the Inherent Interpretability Score (IIS) that evaluates the information loss, measures the ratio of interpretable semantics, and quantifies the representation interpretability. In the evaluation of the representation interpretability with different classifiability, we surprisingly discover that the interpretability and classifiability are positively correlated, i.e., representations with higher classifiability provide more interpretable semantics that can be captured in the interpretations. This observation further supports two benefits to the pre-trained representations. First, the classifiability of representations can be further improved by fine-tuning with interpretability maximization. Second, with the classifiability improvement for the representations, we obtain predictions based on their interpretations with less accuracy degradation. The discovered positive correlation and corresponding applications show that practitioners can unify the improvements in interpretability and classifiability for pre-trained vision models. Codes are available at https://github.com/ssfgunner/IIS.

[110] UHKD: A Unified Framework for Heterogeneous Knowledge Distillation via Frequency-Domain Representations

Fengming Yu,Haiwei Pan,Kejia Zhang,Jian Guan,Haiying Jiang

Main category: cs.CV

TL;DR: 提出了一种基于频域中间特征的统一异构知识蒸馏框架UHKD，通过傅里叶变换缓解异构模型间的表征差异，并利用特征变换和对齐模块实现跨架构知识迁移，在CIFAR-100和ImageNet-1K上显著优于现有方法。

Details

Motivation: 现有知识蒸馏方法多针对同构模型设计，在异构架构下性能下降，尤其难以有效利用中间层语义信息；需解决跨架构模型间语义差异大、中间表示不兼容的问题。 Method: 提出UHKD框架，将教师和学生的中间特征转换到频域，使用傅里叶变换捕捉全局信息；设计特征变换模块（FTM）生成紧凑频域表示，引入可学习的特征对齐模块（FAM）进行多层次特征匹配，并结合中间层均方误差与logits层KL散度进行联合优化。 Result: 在CIFAR-100和ImageNet-1K数据集上，相比最新方法分别取得5.59%和0.83%的性能提升，验证了UHKD在异构知识蒸馏中的有效性。 Conclusion: UHKD通过频域特征对齐有效缓解了异构模型间的语义差异，实现了高效的跨架构知识迁移，为异构知识蒸馏提供了一种通用且有效的解决方案。 Abstract: Knowledge distillation (KD) is an effective model compression technique that transfers knowledge from a high-performance teacher to a lightweight student, reducing cost while maintaining accuracy. In visual applications, where large-scale image models are widely used, KD enables efficient deployment. However, architectural diversity introduces semantic discrepancies that hinder the use of intermediate representations. Most existing KD methods are designed for homogeneous models and degrade in heterogeneous scenarios, especially when intermediate features are involved. Prior studies mainly focus on the logits space, making limited use of the semantic information in intermediate layers. To address this limitation, Unified Heterogeneous Knowledge Distillation (UHKD) is proposed as a framework that leverages intermediate features in the frequency domain for cross-architecture transfer. Fourier transform is applied to capture global feature information, alleviating representational discrepancies between heterogeneous teacher-student pairs. A Feature Transformation Module (FTM) produces compact frequency-domain representations of teacher features, while a learnable Feature Alignment Module (FAM) projects student features and aligns them via multi-level matching. Training is guided by a joint objective combining mean squared error on intermediate features with Kullback-Leibler divergence on logits. Experiments on CIFAR-100 and ImageNet-1K demonstrate gains of 5.59% and 0.83% over the latest method, highlighting UHKD as an effective approach for unifying heterogeneous representations and enabling efficient utilization of visual knowledge

[111] DogMo: A Large-Scale Multi-View RGB-D Dataset for 4D Canine Motion Recovery

Zan Wang,Siyu Chen,Luya Mo,Xinfeng Gao,Yuxin Shen,Lebin Ding,Wei Liang

Main category: cs.CV

TL;DR: DogMo是一个大规模多视角RGB-D视频数据集，用于从图像中恢复犬类运动，包含1.2k个动作序列，涵盖10种不同犬种，解决了现有数据集在多视角、3D数据和多样性方面的不足。

Details

Motivation: 现有犬类运动数据集缺乏多视角和真实3D数据，且规模和多样性有限，限制了运动恢复研究的发展。 Method: 提出DogMo数据集，并构建四个运动恢复基准测试场景；引入一个三阶段的实例特定优化流程，逐步优化SMAL模型的身体形状与姿态，包括粗对齐、密集对应监督和时序正则化。 Result: 提供了丰富的多视角RGB-D数据支持单目和多视图、RGB与RGB-D输入下的系统评估，在犬类运动恢复方面实现了更准确的姿态与形变建模。 Conclusion: DogMo数据集和所提方法为犬类运动恢复研究提供了原则性基础，并推动了计算机视觉、计算机图形学与动物行为建模的交叉研究。 Abstract: We present DogMo, a large-scale multi-view RGB-D video dataset capturing diverse canine movements for the task of motion recovery from images. DogMo comprises 1.2k motion sequences collected from 10 unique dogs, offering rich variation in both motion and breed. It addresses key limitations of existing dog motion datasets, including the lack of multi-view and real 3D data, as well as limited scale and diversity. Leveraging DogMo, we establish four motion recovery benchmark settings that support systematic evaluation across monocular and multi-view, RGB and RGB-D inputs. To facilitate accurate motion recovery, we further introduce a three-stage, instance-specific optimization pipeline that fits the SMAL model to the motion sequences. Our method progressively refines body shape and pose through coarse alignment, dense correspondence supervision, and temporal regularization. Our dataset and method provide a principled foundation for advancing research in dog motion recovery and open up new directions at the intersection of computer vision, computer graphics, and animal behavior modeling.

[112] ETC: training-free diffusion models acceleration with Error-aware Trend Consistency

Jiajian Xie,Hubery Yin,Chen Li,Zhou Zhao,Shengyu Zhang

Main category: cs.CV

TL;DR: 提出了一种名为Error-aware Trend Consistency (ETC)的加速扩散模型生成过程的框架，通过趋势预测和模型特定误差容限控制，在显著加速的同时保持生成一致性。

Details

Motivation: 现有训练-free加速方法忽略去噪趋势且缺乏对模型特有误差容忍度的控制，导致多步重用时轨迹偏离和结果不一致。 Method: 1) 设计一致趋势预测器，利用扩散轨迹的平滑连续性，将历史去噪模式投射为稳定的未来方向，并分布于多个近似步骤；2) 提出模型特定的误差容限搜索机制，通过识别语义规划到质量优化的过渡点来确定校正阈值。 Result: ETC在FLUX上实现了2.65倍加速，同时一致性指标仅下降-0.074 SSIM分。 Conclusion: ETC有效平衡了扩散模型的生成速度与一致性，通过趋势保持和误差感知机制，为训练-free加速方法提供了更鲁棒的解决方案。 Abstract: Diffusion models have achieved remarkable generative quality but remain bottlenecked by costly iterative sampling. Recent training-free methods accelerate diffusion process by reusing model outputs. However, these methods ignore denoising trends and lack error control for model-specific tolerance, leading to trajectory deviations under multi-step reuse and exacerbating inconsistencies in the generated results. To address these issues, we introduce Error-aware Trend Consistency (ETC), a framework that (1) introduces a consistent trend predictor that leverages the smooth continuity of diffusion trajectories, projecting historical denoising patterns into stable future directions and progressively distributing them across multiple approximation steps to achieve acceleration without deviating; (2) proposes a model-specific error tolerance search mechanism that derives corrective thresholds by identifying transition points from volatile semantic planning to stable quality refinement. Experiments show that ETC achieves a 2.65x acceleration over FLUX with negligible (-0.074 SSIM score) degradation of consistency.

[113] Compositional Image Synthesis with Inference-Time Scaling

Minsuk Ji,Sanghyeok Lee,Namhyuk Ahn

Main category: cs.CV

TL;DR: 提出一种无需训练的框架，通过结合基于对象的方法和自优化机制，提升文本到图像生成中的布局准确性，同时保持视觉质量。

Details

Motivation: 现代文本到图像模型在组合性方面存在困难，如对象数量、属性和空间关系的错误。 Method: 利用大语言模型生成显式布局，并将其注入图像生成过程，使用以对象为中心的视觉-语言模型对多个候选结果进行重排序，迭代选择最符合提示的结果。 Result: 该框架在场景与提示对齐方面优于近期的文本到图像模型。 Conclusion: 所提方法在不需训练的前提下，有效提升了生成图像的布局忠实度和语义一致性。 Abstract: Despite their impressive realism, modern text-to-image models still struggle with compositionality, often failing to render accurate object counts, attributes, and spatial relations. To address this challenge, we present a training-free framework that combines an object-centric approach with self-refinement to improve layout faithfulness while preserving aesthetic quality. Specifically, we leverage large language models (LLMs) to synthesize explicit layouts from input prompts, and we inject these layouts into the image generation process, where a object-centric vision-language model (VLM) judge reranks multiple candidates to select the most prompt-aligned outcome iteratively. By unifying explicit layout-grounding with self-refine-based inference-time scaling, our framework achieves stronger scene alignment with prompts compared to recent text-to-image models. The code are available at https://github.com/gcl-inha/ReFocus.

[114] VC4VG: Optimizing Video Captions for Text-to-Video Generation

Yang Du,Zhuoran Lin,Kaiqiang Song,Biao Wang,Zhicheng Zheng,Tiezheng Ge,Bo Zheng,Qin Jin

Main category: cs.CV

TL;DR: 本文提出了VC4VG，一个针对文本到视频生成模型训练优化的视频字幕优化框架，并构建了配套的多维度评估基准VC4VG-Bench，实验证明高质量字幕显著提升视频生成效果。

Details

Motivation: 现有的视频-文本对在用于文本到视频生成模型训练时，字幕质量影响生成效果，但缺乏针对该任务的字幕优化策略。 Method: 从T2V生成需求出发，分解视频重建所需的字幕要素维度，提出系统性的字幕设计方法，并构建包含细粒度、多维度、必要性分级指标的VC4VG-Bench评估基准。 Result: 通过大量T2V微调实验，验证了字幕质量提升与视频生成性能之间存在强相关性，表明所提方法有效提升了生成视频的连贯性和指令对齐能力。 Conclusion: VC4VG框架和VC4VG-Bench基准为文本到视频生成中的字幕优化提供了有效工具和评估标准，推动了该方向的研究发展。 Abstract: Recent advances in text-to-video (T2V) generation highlight the critical role of high-quality video-text pairs in training models capable of producing coherent and instruction-aligned videos. However, strategies for optimizing video captions specifically for T2V training remain underexplored. In this paper, we introduce VC4VG (Video Captioning for Video Generation), a comprehensive caption optimization framework tailored to the needs of T2V models.We begin by analyzing caption content from a T2V perspective, decomposing the essential elements required for video reconstruction into multiple dimensions, and proposing a principled caption design methodology. To support evaluation, we construct VC4VG-Bench, a new benchmark featuring fine-grained, multi-dimensional, and necessity-graded metrics aligned with T2V-specific requirements.Extensive T2V fine-tuning experiments demonstrate a strong correlation between improved caption quality and video generation performance, validating the effectiveness of our approach. We release all benchmark tools and code at https://github.com/qyr0403/VC4VG to support further research.

[115] Enhancing Vision-Language Models for Autonomous Driving through Task-Specific Prompting and Spatial Reasoning

Aodi Wu,Xubo Luo

Main category: cs.CV

TL;DR: 本文提出了一种基于Qwen2.5-VL-72B的系统性框架，通过混合提示路由、任务特定提示、视觉组装模块和推理参数配置，提升视觉语言模型在自动驾驶场景理解中的性能，在RoboSense挑战赛中取得优异成绩。

Details

Motivation: 为了提升视觉语言模型在自动驾驶复杂场景（包括感知、预测、规划和干扰检测）中的理解和推理能力，解决多任务干扰和空间推理不足的问题。 Method: 采用四部分框架：1）混合提示路由器将问题分类并分配给特定任务提示；2）任务特定提示融合坐标系统、空间推理规则、角色扮演和思维链/树；3）视觉组装模块整合多视角图像与对象裁剪；4）按任务调整推理参数。 Result: 在RoboSense Challenge IROS 2025的两个阶段分别取得70.87%（干净数据）和72.85%（干扰数据）的平均准确率，表明结构化提示和空间锚定有效提升了模型表现。 Conclusion: 结构化提示设计与视觉输入的空间对齐能显著增强VLM在安全关键型自动驾驶任务中的性能，验证了系统性提示工程的重要性。 Abstract: This technical report presents our solution for the RoboSense Challenge at IROS 2025, which evaluates Vision-Language Models (VLMs) on autonomous driving scene understanding across perception, prediction, planning, and corruption detection tasks. We propose a systematic framework built on four core components. First, a Mixture-of-Prompts router classifies questions and dispatches them to task-specific expert prompts, eliminating interference across diverse question types. Second, task-specific prompts embed explicit coordinate systems, spatial reasoning rules, role-playing, Chain-of-Thought/Tree-of-Thought reasoning, and few-shot examples tailored to each task. Third, a visual assembly module composes multi-view images with object crops, magenta markers, and adaptive historical frames based on question requirements. Fourth, we configure model inference parameters (temperature, top-p, message roles) per task to optimize output quality. Implemented on Qwen2.5-VL-72B, our approach achieves 70.87% average accuracy on Phase-1 (clean data) and 72.85% on Phase-2 (corrupted data), demonstrating that structured prompting and spatial grounding substantially enhance VLM performance on safety-critical autonomous driving tasks. Code and prompt are available at https://github.com/wuaodi/UCAS-CSU-phase2.

[116] Vanish into Thin Air: Cross-prompt Universal Adversarial Attacks for SAM2

Ziqi Zhou,Yifan Hu,Yufei Song,Zijing Li,Shengshan Hu,Leo Yu Zhang,Dezhong Yao,Long Zheng,Hai Jin

Main category: cs.CV

TL;DR: 本文提出了UAP-SAM2，首个针对SAM2的跨提示通用对抗攻击方法，通过双重语义偏差提升攻击效果，在六大数据集上显著优于现有最先进攻击方法。

Details

Motivation: SAM2在视频分割中表现出强泛化能力，但其对对抗样本的鲁棒性尚未被探索，且现有对SAM的攻击是否可迁移到SAM2仍不清楚。 Method: 提出UAP-SAM2，采用目标扫描策略减少提示依赖，并设计双重语义偏差框架，通过扭曲单帧内语义和破坏帧间语义一致性来优化通用对抗扰动。 Result: 在六个数据集的两个分割任务上实验表明，UAP-SAM2在跨提示迁移性和攻击有效性方面均显著优于现有SOTA方法。 Conclusion: UAP-SAM2有效揭示了SAM2在对抗攻击下的脆弱性，为视频基础模型的鲁棒性评估提供了新思路。 Abstract: Recent studies reveal the vulnerability of the image segmentation foundation model SAM to adversarial examples. Its successor, SAM2, has attracted significant attention due to its strong generalization capability in video segmentation. However, its robustness remains unexplored, and it is unclear whether existing attacks on SAM can be directly transferred to SAM2. In this paper, we first analyze the performance gap of existing attacks between SAM and SAM2 and highlight two key challenges arising from their architectural differences: directional guidance from the prompt and semantic entanglement across consecutive frames. To address these issues, we propose UAP-SAM2, the first cross-prompt universal adversarial attack against SAM2 driven by dual semantic deviation. For cross-prompt transferability, we begin by designing a target-scanning strategy that divides each frame into k regions, each randomly assigned a prompt, to reduce prompt dependency during optimization. For effectiveness, we design a dual semantic deviation framework that optimizes a UAP by distorting the semantics within the current frame and disrupting the semantic consistency across consecutive frames. Extensive experiments on six datasets across two segmentation tasks demonstrate the effectiveness of the proposed method for SAM2. The comparative results show that UAP-SAM2 significantly outperforms state-of-the-art (SOTA) attacks by a large margin.

[117] CLFSeg: A Fuzzy-Logic based Solution for Boundary Clarity and Uncertainty Reduction in Medical Image Segmentation

Anshul Kaushal,Kunal Jangid,Vinod K. Kurmi

Main category: cs.CV

TL;DR: 本文提出了一种基于模糊卷积模块的编码器-解码器框架CLFSeg，用于提高结肠息肉和心脏分割的准确性和鲁棒性，有效处理边界区域的不确定性与噪声，并在多个公开数据集上实现了优于现有方法的性能。

Details

Motivation: 传统CNN模型在医学图像分割中存在泛化能力弱、鲁棒性差及难以处理不确定性的缺点，尤其在处理小目标和边界模糊区域时表现不佳，因此需要更有效的分割方法。 Method: 提出CLFSeg框架，结合模糊逻辑与卷积层设计Fuzzy-Convolutional（FC）模块，增强局部与全局特征提取能力，并采用BCE与Dice损失联合优化以应对类别不平衡问题。 Result: 在CVC-ColonDB、CVC-ClinicDB、EtisLaribPolypDB和ACDC四个公开数据集上实验表明，CLFSeg在分割性能上超越现有SOTA方法，视觉分析显示其能更好关注解剖结构中的关键区域，同时保持计算效率。 Conclusion: CLFSeg通过融合模糊逻辑与深度学习，在减少不确定性、提升分割精度和计算效率方面表现出色，具有应用于实际医学诊断场景的潜力。 Abstract: Accurate polyp and cardiac segmentation for early detection and treatment is essential for the diagnosis and treatment planning of cancer-like diseases. Traditional convolutional neural network (CNN) based models have represented limited generalizability, robustness, and inability to handle uncertainty, which affects the segmentation performance. To solve these problems, this paper introduces CLFSeg, an encoder-decoder based framework that aggregates the Fuzzy-Convolutional (FC) module leveraging convolutional layers and fuzzy logic. This module enhances the segmentation performance by identifying local and global features while minimizing the uncertainty, noise, and ambiguity in boundary regions, ensuring computing efficiency. In order to handle class imbalance problem while focusing on the areas of interest with tiny and boundary regions, binary cross-entropy (BCE) with dice loss is incorporated. Our proposed model exhibits exceptional performance on four publicly available datasets, including CVC-ColonDB, CVC-ClinicDB, EtisLaribPolypDB, and ACDC. Extensive experiments and visual studies show CLFSeg surpasses the existing SOTA performance and focuses on relevant regions of interest in anatomical structures. The proposed CLFSeg improves performance while ensuring computing efficiency, which makes it a potential solution for real-world medical diagnostic scenarios. Project page is available at https://visdomlab.github.io/CLFSeg/

[118] MC-SJD : Maximal Coupling Speculative Jacobi Decoding for Autoregressive Visual Generation Acceleration

Junhyuk So,Hyunho Kook,Chaeyeon Jang,Eunhyeok Park

Main category: cs.CV

TL;DR: 提出MC-SJD，一种无需训练、无损的并行解码框架，通过耦合信息论方法提升自回归视觉生成的推理速度，显著加速图像和视频生成。

Details

Motivation: 自回归生成中逐token解码导致推理速度慢，限制了实际应用。 Method: 基于Speculative Jacobi Decoding (SJD)，引入耦合机制以提高迭代间草稿token的一致性，从而提升接受率，实现并行解码。 Result: 在不损失质量的前提下，图像生成最高速度提升约4.2倍，视频生成提升约13.3倍。 Conclusion: MC-SJD通过单行代码修改显著加速AR视觉生成，兼具高效性与实用性，且保持解码无损。 Abstract: While autoregressive (AR) modeling has recently emerged as a new paradigm in visual generation, its practical adoption is severely constrained by the slow inference speed of per-token generation, which often requires thousands of steps to produce a single sample. To address this challenge, we propose MC-SJD, a training-free, lossless parallel decoding framework designed to accelerate AR visual generation by extending the recently introduced Speculative Jacobi Decoding (SJD). Although SJD shows strong potential for accelerating AR generation, we demonstrate that token instability across iterations significantly reduces the acceptance rate, a limitation that primarily arises from the independent sampling process used during draft token generation. To overcome this, we introduce MC-SJD, an information-theoretic approach based on coupling, which substantially accelerates standard SJD by maximizing the probability of sampling identical draft tokens across consecutive iterations, all while preserving its lossless property. Remarkably, this method requires only a single-line modification to the existing algorithm, yet achieves substantial performance gains, delivering up to a ~4.2x acceleration in image generation and ~13.3x acceleration in video generation compared to standard AR decoding, without any degradation in output quality.

[119] Beyond Inference Intervention: Identity-Decoupled Diffusion for Face Anonymization

Haoxin Yang,Yihong Lin,Jingdan Kang,Xuemiao Xu,Yue Li,Cheng Xu,Shengfeng He

Main category: cs.CV

TL;DR: 提出ID²Face，一种以训练为中心的面部匿名化框架，通过在潜空间中显式解耦身份与非身份信息，实现无需推理时优化的高效匿名化。

Details

Motivation: 现有扩散模型依赖推理时干预，易导致分布偏移和属性纠缠，影响视觉质量和数据可用性。 Method: 设计条件扩散模型，采用身份掩码学习策略，通过身份解耦潜重组器和身份引导潜协调器实现特征分离与融合，并引入正交身份映射抑制身份泄露。 Result: 实验表明，ID²Face在视觉质量、身份抑制和数据效用保持方面优于现有方法。 Conclusion: ID²Face通过训练阶段的结构化潜空间学习，有效实现可控面部匿名化，避免了推理时优化的缺陷。 Abstract: Face anonymization aims to conceal identity information while preserving non-identity attributes. Mainstream diffusion models rely on inference-time interventions such as negative guidance or energy-based optimization, which are applied post-training to suppress identity features. These interventions often introduce distribution shifts and entangle identity with non-identity attributes, degrading visual fidelity and data utility. To address this, we propose \textbf{ID\textsuperscript{2}Face}, a training-centric anonymization framework that removes the need for inference-time optimization. The rationale of our method is to learn a structured latent space where identity and non-identity information are explicitly disentangled, enabling direct and controllable anonymization at inference. To this end, we design a conditional diffusion model with an identity-masked learning scheme. An Identity-Decoupled Latent Recomposer uses an Identity Variational Autoencoder to model identity features, while non-identity attributes are extracted from same-identity pairs and aligned through bidirectional latent alignment. An Identity-Guided Latent Harmonizer then fuses these representations via soft-gating conditioned on noisy feature prediction. The model is trained with a recomposition-based reconstruction loss to enforce disentanglement. At inference, anonymization is achieved by sampling a random identity vector from the learned identity space. To further suppress identity leakage, we introduce an Orthogonal Identity Mapping strategy that enforces orthogonality between sampled and source identity vectors. Experiments demonstrate that ID\textsuperscript{2}Face outperforms existing methods in visual quality, identity suppression, and utility preservation.

[120] SCOPE: Saliency-Coverage Oriented Token Pruning for Efficient Multimodel LLMs

Jinhong Deng,Wen Li,Joey Tianyi Zhou,Yang He

Main category: cs.CV

TL;DR: 提出了一种新的视觉令牌剪枝策略SCOPE，通过联合建模显著性和覆盖性来提高多模态大语言模型的效率和语义完整性。

Details

Motivation: 现有的视觉令牌剪枝方法主要关注基于注意力得分选择最显著的令牌，导致所选令牌的语义不完整。 Method: 引入集合覆盖率和令牌覆盖率增益，并结合显著性得分提出SCOPE得分，迭代选择最高SCOPE得分的令牌。 Result: 在多个视觉-语言理解基准上使用LLaVA-1.5和LLaVA-Next模型进行实验，结果表明该方法一致优于先前的方法。 Conclusion: SCOPE策略能有效提升多模态大语言模型的效率，同时更好地保持语义完整性。 Abstract: Multimodal Large Language Models (MLLMs) typically process a large number of visual tokens, leading to considerable computational overhead, even though many of these tokens are redundant. Existing visual token pruning methods primarily focus on selecting the most salient tokens based on attention scores, resulting in the semantic incompleteness of the selected tokens. In this paper, we propose a novel visual token pruning strategy, called \textbf{S}aliency-\textbf{C}overage \textbf{O}riented token \textbf{P}runing for \textbf{E}fficient MLLMs (SCOPE), to jointly model both the saliency and coverage of the selected visual tokens to better preserve semantic completeness. Specifically, we introduce a set-coverage for a given set of selected tokens, computed based on the token relationships. We then define a token-coverage gain for each unselected token, quantifying how much additional coverage would be obtained by including it. By integrating the saliency score into the token-coverage gain, we propose our SCOPE score and iteratively select the token with the highest SCOPE score. We conduct extensive experiments on multiple vision-language understanding benchmarks using the LLaVA-1.5 and LLaVA-Next models. Experimental results demonstrate that our method consistently outperforms prior approaches. Our code is available at \href{https://github.com/kinredon/SCOPE}{https://github.com/kinredon/SCOPE}.

[121] Benchmarking Microsaccade Recognition with Event Cameras: A Novel Dataset and Evaluation

Waseem Shariff,Timothy Hanley,Maciej Stec,Hossein Javidnia,Peter Corcoran

Main category: cs.CV

TL;DR: 本文提出了一种基于事件的微扫视数据集，利用Blender渲染高保真眼动场景，并通过v2e生成事件流，结合脉冲神经网络模型（如Spiking-VGG系列）实现对微扫视角度位移的高效分类，准确率达90%左右，为基于事件的视觉研究提供了新基准。

Details

Motivation: 传统微扫视研究依赖于昂贵的眼动仪和帧基分析方法，存在可扩展性差、时间分辨率有限的问题。因此，需要一种更高效、低成本的替代方案来推动认知计算中的微小眼动研究。 Method: 使用Blender模拟0.5至2.0度角位移的七类微扫视，通过v2e工具将其转换为事件流；采用Spiking-VGG11、Spiking-VGG13、Spiking-VGG16及提出的Spiking-VGG16Flow模型在SpikingJelly框架下进行评估。 Result: 模型在不依赖事件数量或持续时间的情况下，实现了约90%的平均分类准确率，验证了脉冲神经网络在精细运动识别中的潜力。 Conclusion: 该工作建立了首个基于事件的微扫视数据集，展示了脉冲神经网络在高时空分辨率眼动识别中的有效性，为未来事件驱动的视觉感知研究提供了重要资源和基准。 Abstract: Microsaccades are small, involuntary eye movements vital for visual perception and neural processing. Traditional microsaccade studies typically use eye trackers or frame-based analysis, which, while precise, are costly and limited in scalability and temporal resolution. Event-based sensing offers a high-speed, low-latency alternative by capturing fine-grained spatiotemporal changes efficiently. This work introduces a pioneering event-based microsaccade dataset to support research on small eye movement dynamics in cognitive computing. Using Blender, we render high-fidelity eye movement scenarios and simulate microsaccades with angular displacements from 0.5 to 2.0 degrees, divided into seven distinct classes. These are converted to event streams using v2e, preserving the natural temporal dynamics of microsaccades, with durations ranging from 0.25 ms to 2.25 ms. We evaluate the dataset using Spiking-VGG11, Spiking-VGG13, and Spiking-VGG16, and propose Spiking-VGG16Flow, an optical-flow-enhanced variant implemented in SpikingJelly. The models achieve around 90 percent average accuracy, successfully classifying microsaccades by angular displacement, independent of event count or duration. These results demonstrate the potential of spiking neural networks for fine motion recognition and establish a benchmark for event-based vision research. The dataset, code, and trained models will be publicly available at https://waseemshariff126.github.io/microsaccades/ .

[122] Delving into Cascaded Instability: A Lipschitz Continuity View on Image Restoration and Object Detection Synergy

Qing Zhao,Weijian Deng,Pengxu Wei,ZiYi Dong,Hannan Lu,Xiangyang Ji,Liang Lin

Main category: cs.CV

TL;DR: 提出Lipschitz正则化目标检测框架（LROD），通过在训练中统一图像恢复与检测任务的Lipschitz连续性，提升恶劣环境下检测的稳定性与准确性。

Details

Motivation: 传统级联框架中图像恢复与检测网络之间存在功能不匹配，导致因微小扰动引发的不稳定性问题，影响梯度流动和优化过程。 Method: 从Lipschitz连续性的角度分析恢复与检测网络在输入空间和参数空间的功能差异，提出LROD框架，将图像恢复融入检测器的特征学习过程中，并在训练中对齐两者的Lipschitz特性。实现为LR-YOLO，兼容现有YOLO系列模型。 Result: 在雾天和低光基准数据集上的实验表明，LR-YOLO显著提升了检测稳定性、优化平滑性和整体检测精度。 Conclusion: 通过协调恢复与检测任务的Lipschitz连续性，LROD有效缓解了级联系统中的不匹配问题，为鲁棒目标检测提供了新的解决方案。 Abstract: To improve detection robustness in adverse conditions (e.g., haze and low light), image restoration is commonly applied as a pre-processing step to enhance image quality for the detector. However, the functional mismatch between restoration and detection networks can introduce instability and hinder effective integration -- an issue that remains underexplored. We revisit this limitation through the lens of Lipschitz continuity, analyzing the functional differences between restoration and detection networks in both the input space and the parameter space. Our analysis shows that restoration networks perform smooth, continuous transformations, while object detectors operate with discontinuous decision boundaries, making them highly sensitive to minor perturbations. This mismatch introduces instability in traditional cascade frameworks, where even imperceptible noise from restoration is amplified during detection, disrupting gradient flow and hindering optimization. To address this, we propose Lipschitz-regularized object detection (LROD), a simple yet effective framework that integrates image restoration directly into the detector's feature learning, harmonizing the Lipschitz continuity of both tasks during training. We implement this framework as Lipschitz-regularized YOLO (LR-YOLO), extending seamlessly to existing YOLO detectors. Extensive experiments on haze and low-light benchmarks demonstrate that LR-YOLO consistently improves detection stability, optimization smoothness, and overall accuracy.

[123] DeshadowMamba: Deshadowing as 1D Sequential Similarity

Zhaotong Yang,Yi Chen,Yanying Li,Shengfeng He,Yangyang Xu,Junyu Dong,Jian Yang,Yong Du

Main category: cs.CV

TL;DR: 本文提出DeshadowMamba模型，结合Mamba架构与阴影感知的CrossGate机制和ColorShift正则化，在图像去阴影任务中实现结构完整性和色彩一致性，取得当前最优性能。

Details

Motivation: 现有基于注意力机制的去阴影方法因固定注意力模式易引入无关区域光照干扰，导致结构失真和色彩不一致。 Method: 将去阴影视为序列建模问题，采用Mamba的定向状态转移获取全局感受野；提出CrossGate机制注入阴影语义信息以选择性融合上下文；设计ColorShift正则化通过对比学习抑制颜色污染。 Result: 在公开数据集上实验表明，该方法在视觉质量和定量指标上均达到最先进水平。 Conclusion: 通过改进Mamba架构并引入阴影感知与色彩保真机制，有效提升了图像去阴影的结构与色彩恢复能力。 Abstract: Recent deep models for image shadow removal often rely on attention-based architectures to capture long-range dependencies. However, their fixed attention patterns tend to mix illumination cues from irrelevant regions, leading to distorted structures and inconsistent colors. In this work, we revisit shadow removal from a sequence modeling perspective and explore the use of Mamba, a selective state space model that propagates global context through directional state transitions. These transitions yield an efficient global receptive field while preserving positional continuity. Despite its potential, directly applying Mamba to image data is suboptimal, since it lacks awareness of shadow-non-shadow semantics and remains susceptible to color interference from nearby regions. To address these limitations, we propose CrossGate, a directional modulation mechanism that injects shadow-aware similarity into Mamba's input gate, allowing selective integration of relevant context along transition axes. To further ensure appearance fidelity, we introduce ColorShift regularization, a contrastive learning objective driven by global color statistics. By synthesizing structured informative negatives, it guides the model to suppress color contamination and achieve robust color restoration. Together, these components adapt sequence modeling to the structural integrity and chromatic consistency required for shadow removal. Extensive experiments on public benchmarks demonstrate that DeshadowMamba achieves state-of-the-art visual quality and strong quantitative performance.

[124] UtilGen: Utility-Centric Generative Data Augmentation with Dual-Level Task Adaptation

Jiyu Guo,Shuo Yang,Yiming Huang,Yancheng Long,Xiaobo Xia,Xiu Su,Bo Zhao,Zeke Xie,Liqiang Nie

Main category: cs.CV

TL;DR: 提出了一种以任务效用为中心的数据增强框架UtilGen，通过下游任务反馈自适应优化生成数据的过程，在多个基准数据集上显著优于现有方法。

Details

Motivation: 现有数据增强方法主要关注生成数据的视觉质量（如保真度和多样性），而忽视了不同下游任务对训练数据的不同需求，缺乏任务特定的优化机制。 Method: 提出UtilGen框架，引入权重分配网络评估每个合成样本的任务特定效用，并采用双层优化策略：模型级优化使生成模型适配下游任务，实例级优化调整每轮生成的提示嵌入和初始噪声等策略。 Result: 在八个不同复杂度和粒度的基准数据集上实验表明，UtilGen平均准确率比先前最先进方法提升3.87%，且生成的数据更具影响力和任务相关性。 Conclusion: 从以视觉特征为中心转向以任务效用为中心的数据增强范式是有效的，UtilGen通过利用下游任务反馈生成高实用性的合成数据，显著提升了模型性能。 Abstract: Data augmentation using generative models has emerged as a powerful paradigm for enhancing performance in computer vision tasks. However, most existing augmentation approaches primarily focus on optimizing intrinsic data attributes -- such as fidelity and diversity -- to generate visually high-quality synthetic data, while often neglecting task-specific requirements. Yet, it is essential for data generators to account for the needs of downstream tasks, as training data requirements can vary significantly across different tasks and network architectures. To address these limitations, we propose UtilGen, a novel utility-centric data augmentation framework that adaptively optimizes the data generation process to produce task-specific, high-utility training data via downstream task feedback. Specifically, we first introduce a weight allocation network to evaluate the task-specific utility of each synthetic sample. Guided by these evaluations, UtilGen iteratively refines the data generation process using a dual-level optimization strategy to maximize the synthetic data utility: (1) model-level optimization tailors the generative model to the downstream task, and (2) instance-level optimization adjusts generation policies -- such as prompt embeddings and initial noise -- at each generation round. Extensive experiments on eight benchmark datasets of varying complexity and granularity demonstrate that UtilGen consistently achieves superior performance, with an average accuracy improvement of 3.87% over previous SOTA. Further analysis of data influence and distribution reveals that UtilGen produces more impactful and task-relevant synthetic data, validating the effectiveness of the paradigm shift from visual characteristics-centric to task utility-centric data augmentation.

[125] Training-free Source Attribution of AI-generated Images via Resynthesis

Pietro Bongini,Valentina Molinari,Andrea Costanzo,Benedetta Tondi,Mauro Barni

Main category: cs.CV

TL;DR: 提出一种无需训练的单样本合成图像溯源方法，通过图像重合成和特征空间比对实现模型溯源，并发布一个新的用于评估少样本和零样本溯源能力的数据集。

Details

Motivation: 在数据稀缺条件下（如少样本或零样本场景），现有合成图像溯源方法性能有限，亟需更有效的技术来应对不同生成模型的溯源挑战。 Method: 基于图像重合成的方法：首先生成描述待分析图像的提示词，然后使用各候选源模型重新生成该图像，最后在特定特征空间中比较重合成图像与原图的相似度，将最接近的源模型判定为图像来源。 Result: 在新构建的合成图像数据集上实验表明，该方法在仅有少量样本可用时优于现有的少样本方法；同时新数据集具有较高挑战性，适合作为未来少样本和零样本方法的基准测试平台。 Conclusion: 所提出的无需训练的单样本溯源方法在数据稀缺场景下表现优越，结合新数据集为未来合成图像溯源研究提供了有效工具和评估标准。 Abstract: Synthetic image source attribution is a challenging task, especially in data scarcity conditions requiring few-shot or zero-shot classification capabilities. We present a new training-free one-shot attribution method based on image resynthesis. A prompt describing the image under analysis is generated, then it is used to resynthesize the image with all the candidate sources. The image is attributed to the model which produced the resynthesis closest to the original image in a proper feature space. We also introduce a new dataset for synthetic image attribution consisting of face images from commercial and open-source text-to-image generators. The dataset provides a challenging attribution framework, useful for developing new attribution models and testing their capabilities on different generative architectures. The dataset structure allows to test approaches based on resynthesis and to compare them to few-shot methods. Results from state-of-the-art few-shot approaches and other baselines show that the proposed resynthesis method outperforms existing techniques when only a few samples are available for training or fine-tuning. The experiments also demonstrate that the new dataset is a challenging one and represents a valuable benchmark for developing and evaluating future few-shot and zero-shot methods.

[126] ViPER: Empowering the Self-Evolution of Visual Perception Abilities in Vision-Language Model

Juntian Zhang,Song Jin,Chuanqi Cheng,Yuhan Liu,Yankai Lin,Xun Zhang,Yufei Zhang,Fei Jiang,Guojun Yin,Wei Lin,Rui Yan

Main category: cs.CV

TL;DR: 本文提出ViPER框架，通过自引导、自批评和自预测的闭环训练范式，提升视觉语言模型的细粒度视觉感知能力，在Qwen2.5-VL系列上实现显著性能提升。

Details

Motivation: 现有方法在提升视觉语言模型细粒度视觉感知能力时受限于高质量数据稀缺，且监督微调可能损害泛化能力，强化微调偏重文本推理而忽视视觉感知。 Method: 提出两阶段粗到精的任务设计，结合图像级与实例级重建，采用两阶段强化学习策略，构建基于自我批评与自我预测的自举式框架ViPER。 Result: 在七个综合基准上平均提升1.7%，细粒度感知任务最高提升6.0%，Qwen-Viper系列模型在保持泛化能力的同时显著增强视觉感知性能。 Conclusion: ViPER实现了视觉语言模型在细粒度感知上的自我进化，验证了生成与理解之间的双向促进关系，为更自主、强大的VLM发展提供了新路径。 Abstract: The limited capacity for fine-grained visual perception presents a critical bottleneck for Vision-Language Models (VLMs) in real-world applications. Addressing this is challenging due to the scarcity of high-quality data and the limitations of existing methods: supervised fine-tuning (SFT) often compromises general capabilities, while reinforcement fine-tuning (RFT) prioritizes textual reasoning over visual perception. To bridge this gap, we propose a novel two-stage task that structures visual perception learning as a coarse-to-fine progressive process. Based on this task formulation, we develop ViPER, a self-bootstrapping framework specifically designed to enable iterative evolution through self-critiquing and self-prediction. By synergistically integrating image-level and instance-level reconstruction with a two-stage reinforcement learning strategy, ViPER establishes a closed-loop training paradigm, where internally synthesized data directly fuel the enhancement of perceptual ability. Applied to the Qwen2.5-VL family, ViPER produces the Qwen-Viper series. With an average gain of 1.7% on seven comprehensive benchmarks spanning various tasks and up to 6.0% on fine-grained perception, Qwen-Viper consistently demonstrates superior performance across different vision-language scenarios while maintaining generalizability. Beyond enabling self-improvement in perceptual capabilities, ViPER provides concrete evidence for the reciprocal relationship between generation and understanding, a breakthrough to developing more autonomous and capable VLMs.

[127] Few-Shot Remote Sensing Image Scene Classification with CLIP and Prompt Learning

Ivica Dimitrovski,Vlatko Spasev,Ivan Kitanovski

Main category: cs.CV

TL;DR: 本文研究了提示学习（prompt learning）在少样本遥感图像场景分类中的应用，通过对比多种提示方法，发现其在跨域性能上显著优于传统基线方法，尤其是具有自调节约束的提示方法表现最佳。

Details

Motivation: 由于标注数据稀缺且跨地理和传感器域的标注成本高，现有深度学习在遥感场景分类中受限；同时，通用视觉-语言模型（如CLIP）因领域差异和缺乏任务特定语义适应，在遥感中表现不佳。 Method: 系统探索了多种提示学习方法，包括上下文优化（CoOp）、条件上下文优化（COCO）、多模态提示学习和带自调节约束的提示，在多个遥感数据集上进行少样本分类实验，并与零样本CLIP和基于冻结特征的线性探针进行对比。 Result: 提示学习在所有少样本设置下均优于两个基线；其中“带自调节约束的提示”在跨数据集泛化中表现出最强的鲁棒性。 Conclusion: 提示学习是一种高效、可扩展的策略，能有效缩小遥感图像中的领域差距，为未来研究提供了坚实基础。 Abstract: Remote sensing applications increasingly rely on deep learning for scene classification. However, their performance is often constrained by the scarcity of labeled data and the high cost of annotation across diverse geographic and sensor domains. While recent vision-language models like CLIP have shown promise by learning transferable representations at scale by aligning visual and textual modalities, their direct application to remote sensing remains suboptimal due to significant domain gaps and the need for task-specific semantic adaptation. To address this critical challenge, we systematically explore prompt learning as a lightweight and efficient adaptation strategy for few-shot remote sensing image scene classification. We evaluate several representative methods, including Context Optimization, Conditional Context Optimization, Multi-modal Prompt Learning, and Prompting with Self-Regulating Constraints. These approaches reflect complementary design philosophies: from static context optimization to conditional prompts for enhanced generalization, multi-modal prompts for joint vision-language adaptation, and semantically regularized prompts for stable learning without forgetting. We benchmark these prompt-learning methods against two standard baselines: zero-shot CLIP with hand-crafted prompts and a linear probe trained on frozen CLIP features. Through extensive experiments on multiple benchmark remote sensing datasets, including cross-dataset generalization tests, we demonstrate that prompt learning consistently outperforms both baselines in few-shot scenarios. Notably, Prompting with Self-Regulating Constraints achieves the most robust cross-domain performance. Our findings underscore prompt learning as a scalable and efficient solution for bridging the domain gap in satellite and aerial imagery, providing a strong foundation for future research in this field.

[128] Adaptive Knowledge Transferring with Switching Dual-Student Framework for Semi-Supervised Medical Image Segmentation

Thanh-Huy Nguyen,Hoang-Thien Nguyen,Ba-Thinh Lam,Vi Vu,Bach X. Nguyen,Jianhua Xing,Tianyang Wang,Xingjian Li,Min Xu

Main category: cs.CV

TL;DR: 提出一种基于切换双学生架构的半监督医学图像分割方法，通过动态选择更可靠的学生网络和改进的伪标签生成策略提升性能。

Details

Motivation: 传统教师-学生框架在半监督医学图像分割中存在网络间强相关性和知识传递不可靠的问题，限制了学习效果。 Method: 设计了一种切换双学生架构，在每次迭代时选择更可靠的学生网络以增强协作并防止错误累积；同时引入损失感知的指数移动平均策略，使教师网络能动态吸收有意义的信息以提高伪标签质量。 Result: 在多个3D医学图像分割数据集上验证了该方法的有效性，性能优于当前最先进的半监督方法。 Conclusion: 所提出的即插即用框架显著提升了半监督医学图像分割的准确性，具有良好的应用潜力。 Abstract: Teacher-student frameworks have emerged as a leading approach in semi-supervised medical image segmentation, demonstrating strong performance across various tasks. However, the learning effects are still limited by the strong correlation and unreliable knowledge transfer process between teacher and student networks. To overcome this limitation, we introduce a novel switching Dual-Student architecture that strategically selects the most reliable student at each iteration to enhance dual-student collaboration and prevent error reinforcement. We also introduce a strategy of Loss-Aware Exponential Moving Average to dynamically ensure that the teacher absorbs meaningful information from students, improving the quality of pseudo-labels. Our plug-and-play framework is extensively evaluated on 3D medical image segmentation datasets, where it outperforms state-of-the-art semi-supervised methods, demonstrating its effectiveness in improving segmentation accuracy under limited supervision.

[129] Decoupling What to Count and Where to See for Referring Expression Counting

Yuda Zou,Zijian Zhang,Yongchao Xu

Main category: cs.CV

TL;DR: 本文提出了W2-Net，一种用于指代表达计数（REC）的新框架，通过双查询机制解耦‘要计数的内容’和‘需关注的视觉区域’，并引入子类可分离匹配策略，显著提升了细粒度子类对象的计数与定位性能。

Details

Motivation: 现有方法在标注时通常将点置于类别代表性位置（如头部），导致模型忽略属性相关区域（如‘行走’依赖腿部信息），难以实现细粒度子类区分。 Method: 提出W2-Net，包含‘what-to-count’（w2c）和‘where-to-see’（w2s）双查询机制，w2s专门提取属性相关区域特征；并设计子类可分离匹配（SSM）策略，引入排斥力增强子类间区分度。 Result: 在REC-8K数据集上，相比当前最优方法，计数误差在验证集和测试集分别降低22.5%和18.0%，定位F1分数提升7%和8%。 Conclusion: W2-Net通过显式建模属性相关视觉区域与改进匹配策略，有效解决了REC中因标注偏差导致的属性忽略问题，显著提升了细粒度对象计数与定位性能。 Abstract: Referring Expression Counting (REC) extends class-level object counting to the fine-grained subclass-level, aiming to enumerate objects matching a textual expression that specifies both the class and distinguishing attribute. A fundamental challenge, however, has been overlooked: annotation points are typically placed on class-representative locations (e.g., heads), forcing models to focus on class-level features while neglecting attribute information from other visual regions (e.g., legs for "walking"). To address this, we propose W2-Net, a novel framework that explicitly decouples the problem into "what to count" and "where to see" via a dual-query mechanism. Specifically, alongside the standard what-to-count (w2c) queries that localize the object, we introduce dedicated where-to-see (w2s) queries. The w2s queries are guided to seek and extract features from attribute-specific visual regions, enabling precise subclass discrimination. Furthermore, we introduce Subclass Separable Matching (SSM), a novel matching strategy that incorporates a repulsive force to enhance inter-subclass separability during label assignment. W2-Net significantly outperforms the state-of-the-art on the REC-8K dataset, reducing counting error by 22.5% (validation) and 18.0% (test), and improving localization F1 by 7% and 8%, respectively. Code will be available.

[130] Stroke Lesion Segmentation in Clinical Workflows: A Modular, Lightweight, and Deployment-Ready Tool

Yann Kerverdo,Florent Leray,Youwan Mahé,Stéphanie Leplaideur,Francesca Galassi

Main category: cs.CV

TL;DR: StrokeSeg是一个模块化、轻量化的框架，用于将研究级的脑卒中病灶分割模型转化为可部署的应用程序，在保持与原始PyTorch流程相当性能的同时，显著降低依赖和体积。

Details

Motivation: 深度学习框架如nnU-Net在脑病变分割中表现优异，但因依赖复杂和架构臃肿而难以临床部署。 Method: StrokeSeg采用解耦设计，预处理使用Anima工具箱并支持BIDS标准输出，推理阶段采用ONNX Runtime并结合Float16量化技术，减小模型约50%体积，并提供图形界面和命令行接口，支持Python脚本和独立Windows可执行文件分发。 Result: 在300例亚急性和慢性卒中患者数据上验证，分割性能与原始PyTorch流程相当（Dice差异<10⁻³）。 Conclusion: 高性能的研究级分割流程可以成功转化为便携、可临床应用的工具，StrokeSeg为临床转化提供了高效、灵活的解决方案。 Abstract: Deep learning frameworks such as nnU-Net achieve state-of-the-art performance in brain lesion segmentation but remain difficult to deploy clinically due to heavy dependencies and monolithic design. We introduce \textit{StrokeSeg}, a modular and lightweight framework that translates research-grade stroke lesion segmentation models into deployable applications. Preprocessing, inference, and postprocessing are decoupled: preprocessing relies on the Anima toolbox with BIDS-compliant outputs, and inference uses ONNX Runtime with \texttt{Float16} quantisation, reducing model size by about 50\%. \textit{StrokeSeg} provides both graphical and command-line interfaces and is distributed as Python scripts and as a standalone Windows executable. On a held-out set of 300 sub-acute and chronic stroke subjects, segmentation performance was equivalent to the original PyTorch pipeline (Dice difference $<10^{-3}$), demonstrating that high-performing research pipelines can be transformed into portable, clinically usable tools.

[131] A Luminance-Aware Multi-Scale Network for Polarization Image Fusion with a Multi-Scene Dataset

Zhuangfan Huang,Xiaosong Li,Gao Wang,Tao Ye,Haishu Tan,Huafeng Li

Main category: cs.CV

TL;DR: 本文提出了一种亮度感知的多尺度网络（MLSN），用于在复杂光照环境下融合偏振图像（S0和DOLP），通过引入亮度分支和多尺度空间权重矩阵，有效解决了偏振图像固有的对比度差异问题，并设计了全局-局部特征融合机制与亮度增强模块，提升了融合图像的质量。同时构建了包含1000对图像的MSP数据集，实验证明MLSN在多个指标上优于现有方法。

Details

Motivation: 偏振图像融合在复杂光照下常因亮度差异导致融合效果不佳，现有方法难以充分整合互补信息，且缺乏高质量公开数据集，限制了该领域的发展。 Method: 提出亮度感知多尺度网络MLSN：编码器阶段通过亮度分支生成多尺度空间权重矩阵，动态注入亮度信息；瓶颈层采用全局-局部特征融合机制进行窗口自注意力计算；解码器阶段引入亮度增强模块，建立亮度分布与纹理特征的非线性映射，实现融合结果的亮度校正。同时构建MSP数据集以支持研究。 Result: 在MSP、PIF和GAND数据集上的实验表明，MLSN在主观和客观评价中均优于现有方法，MS-SSIM和SD指标分别平均提升8.57%~22.21%和54.31%~63.53%。所发布数据集包含1000对覆盖17类复杂场景的偏振图像。 Conclusion: MLSN能有效融合不同偏振图像中的互补信息，在复杂光照条件下显著提升融合质量，所提出的亮度感知机制和MSP数据集为偏振图像融合提供了新的解决方案和数据支持。 Abstract: Polarization image fusion combines S0 and DOLP images to reveal surface roughness and material properties through complementary texture features, which has important applications in camouflage recognition, tissue pathology analysis, surface defect detection and other fields. To intergrate coL-Splementary information from different polarized images in complex luminance environment, we propose a luminance-aware multi-scale network (MLSN). In the encoder stage, we propose a multi-scale spatial weight matrix through a brightness-branch , which dynamically weighted inject the luminance into the feature maps, solving the problem of inherent contrast difference in polarized images. The global-local feature fusion mechanism is designed at the bottleneck layer to perform windowed self-attention computation, to balance the global context and local details through residual linking in the feature dimension restructuring stage. In the decoder stage, to further improve the adaptability to complex lighting, we propose a Brightness-Enhancement module, establishing the mapping relationship between luminance distribution and texture features, realizing the nonlinear luminance correction of the fusion result. We also present MSP, an 1000 pairs of polarized images that covers 17 types of indoor and outdoor complex lighting scenes. MSP provides four-direction polarization raw maps, solving the scarcity of high-quality datasets in polarization image fusion. Extensive experiment on MSP, PIF and GAND datasets verify that the proposed MLSN outperms the state-of-the-art methods in subjective and objective evaluations, and the MS-SSIM and SD metircs are higher than the average values of other methods by 8.57%, 60.64%, 10.26%, 63.53%, 22.21%, and 54.31%, respectively. The source code and dataset is avalable at https://github.com/1hzf/MLS-UNet.

[132] When are radiology reports useful for training medical image classifiers?

Herman Bergström,Zhongqi Yue,Fredrik D. Johansson

Main category: cs.CV

TL;DR: 本文研究了在医学图像分类中如何利用放射学报告文本数据进行预训练和微调，发现在标签与文本强相关时使用报告有益，但在弱关联时可能适得其反，且微调阶段引入报告可带来显著提升。

Details

Motivation: 放射学报告包含丰富的专家标注信息，但目前大多数研究仅关注诊断标签的提取，忽略了与文本弱相关的任务，缺乏对文本在不同训练阶段作用的系统性研究。 Method: 系统评估了在预训练和微调阶段利用放射学报告的方法，在多种诊断和预测任务（如12个月再入院）以及不同训练集规模下进行实验，比较了图像-文本对齐等策略的效果。 Result: （1）在标签能被文本良好表示的任务中，利用报告进行预训练有助于下游分类；但在文本不直接关联标签时，显式的图像-文本对齐反而有害；（2）在微调阶段结合报告可显著提升性能，有时甚至超过预训练方法的影响。 Conclusion: 研究表明应根据任务与文本的相关性决定是否使用放射学报告，并指出当前研究在弱监督和多模态融合方面的不足，为医学图像分类中利用特权信息提供了实践指导。 Abstract: Medical images used to train machine learning models are often accompanied by radiology reports containing rich expert annotations. However, relying on these reports as inputs for clinical prediction requires the timely manual work of a trained radiologist. This raises a natural question: when can radiology reports be leveraged during training to improve image-only classification? Prior works are limited to evaluating pre-trained image representations by fine-tuning them to predict diagnostic labels, often extracted from reports, ignoring tasks with labels that are weakly associated with the text. To address this gap, we conduct a systematic study of how radiology reports can be used during both pre-training and fine-tuning, across diagnostic and prognostic tasks (e.g., 12-month readmission), and under varying training set sizes. Our findings reveal that: (1) Leveraging reports during pre-training is beneficial for downstream classification tasks where the label is well-represented in the text; however, pre-training through explicit image-text alignment can be detrimental in settings where it's not; (2) Fine-tuning with reports can lead to significant improvements and even have a larger impact than the pre-training method in certain settings. These results provide actionable insights into when and how to leverage privileged text data to train medical image classifiers while highlighting gaps in current research.

[133] Unsupervised Detection of Post-Stroke Brain Abnormalities

Youwan Mahé,Elise Bannier,Stéphanie Leplaideur,Elisa Fromont,Francesca Galassi

Main category: cs.CV

TL;DR: 本研究评估了一种基于流的生成模型REFLECT，用于无监督检测卒中后患者的局灶性和非病灶性异常。结果表明，使用健康对照数据训练的模型在异常检测方面表现更优。

Details

Motivation: 现有的监督分割方法难以充分捕捉卒中后MRI中的非病灶性结构变化（如萎缩和脑室扩大），需要更有效的异常检测方法。 Method: 采用基于流的生成模型REFLECT，分别在卒中患者（ATLAS）和健康对照（IXI）数据上进行训练，并通过双专家中心切片标注和自由响应ROC分析评估模型在异常图上的表现。 Result: 在ATLAS测试集中，IXI训练的模型在病灶分割（Dice = 0.37 vs 0.27）和非病灶异常敏感性（FROC = 0.62 vs 0.43）上均优于ATLAS训练的模型。 Conclusion: 使用完全健康的解剖数据训练可更好建模正常变异，从而实现更广泛且可靠的结构性异常检测。 Abstract: Post-stroke MRI not only delineates focal lesions but also reveals secondary structural changes, such as atrophy and ventricular enlargement. These abnormalities, increasingly recognised as imaging biomarkers of recovery and outcome, remain poorly captured by supervised segmentation methods. We evaluate REFLECT, a flow-based generative model, for unsupervised detection of both focal and non-lesional abnormalities in post-stroke patients. Using dual-expert central-slice annotations on ATLAS data, performance was assessed at the object level with Free-Response ROC analysis for anomaly maps. Two models were trained on lesion-free slices from stroke patients (ATLAS) and on healthy controls (IXI) to test the effect of training data. On ATLAS test subjects, the IXI-trained model achieved higher lesion segmentation (Dice = 0.37 vs 0.27) and improved sensitivity to non-lesional abnormalities (FROC = 0.62 vs 0.43). Training on fully healthy anatomy improves the modelling of normal variability, enabling broader and more reliable detection of structural abnormalities.

[134] GenTrack: A New Generation of Multi-Object Tracking

Toan Van Nguyen,Rasmus G. K. Christiansen,Dirk Kraft,Leon Bodenhagen

Main category: cs.CV

TL;DR: 本文提出了一种名为GenTrack的新型多目标跟踪方法，结合随机与确定性策略，利用粒子群优化（PSO）和社交交互建模，有效应对目标数量时变、遮挡和检测噪声等挑战，在ID一致性和轨迹连续性方面表现优异，并提供了首个开源基准实现。

Details

Motivation: 现有MOT方法在处理目标数量未知且时变、ID切换频繁、遮挡严重及检测器性能弱的情况下表现不佳，缺乏统一且可复现的基准框架，因此需要一种更鲁棒、系统化的方法来提升跟踪稳定性与可比性。 Method: 提出GenTrack，采用混合跟踪策略，结合PSO优化与新设计的适应度函数引导粒子搜索目标模式；引入社会交互信息增强粒子更新；构建包含空间一致性、外观、置信度、轨迹惩罚和社会评分的综合状态与观测模型；提供三种变体（Basic、PSO、PSO-Social）并开源代码。 Result: 实验表明GenTrack在标准数据集和真实场景中性能优于现有最先进方法，显著减少ID切换和轨迹丢失，尤其在遮挡和低质量检测下表现突出，且提供的开源实现实现了公平比较。 Conclusion: GenTrack通过融合PSO优化与社会交互建模，实现了鲁棒的多目标跟踪，解决了ID不一致和轨迹断裂问题，其开源贡献为领域内可复现研究建立了新基准。 Abstract: This paper introduces a novel multi-object tracking (MOT) method, dubbed GenTrack, whose main contributions include: a hybrid tracking approach employing both stochastic and deterministic manners to robustly handle unknown and time-varying numbers of targets, particularly in maintaining target identity (ID) consistency and managing nonlinear dynamics, leveraging particle swarm optimization (PSO) with some proposed fitness measures to guide stochastic particles toward their target distribution modes, enabling effective tracking even with weak and noisy object detectors, integration of social interactions among targets to enhance PSO-guided particles as well as improve continuous updates of both strong (matched) and weak (unmatched) tracks, thereby reducing ID switches and track loss, especially during occlusions, a GenTrack-based redefined visual MOT baseline incorporating a comprehensive state and observation model based on space consistency, appearance, detection confidence, track penalties, and social scores for systematic and efficient target updates, and the first-ever publicly available source-code reference implementation with minimal dependencies, featuring three variants, including GenTrack Basic, PSO, and PSO-Social, facilitating flexible reimplementation. Experimental results have shown that GenTrack provides superior performance on standard benchmarks and real-world scenarios compared to state-of-the-art trackers, with integrated implementations of baselines for fair comparison. Potential directions for future work are also discussed. The source-code reference implementations of both the proposed method and compared-trackers are provided on GitHub: https://github.com/SDU-VelKoTek/GenTrack

[135] A Hybrid Approach for Visual Multi-Object Tracking

Toan Van Nguyen,Rasmus G. K. Christiansen,Dirk Kraft,Leon Bodenhagen

Main category: cs.CV

TL;DR: 本文提出了一种结合随机与确定性机制的视觉多目标跟踪方法，用于在非线性动态和未知时变目标数下保持身份一致性。

Details

Motivation: 解决非线性动态、非高斯噪声以及目标数量变化情况下多目标跟踪的身份一致性问题。 Method: 采用基于粒子群优化的粒子滤波处理非线性动态，并设计融合运动、外观和社会交互信息的确定性关联策略和代价矩阵，实现身份保持的状态更新与速度回归。 Result: 实验表明该方法优于现有最先进跟踪器，且支持视频回放和实时摄像头流处理。 Conclusion: 所提方法在复杂场景下实现了鲁棒的身份保持多目标跟踪，具有实际应用价值。 Abstract: This paper proposes a visual multi-object tracking method that jointly employs stochastic and deterministic mechanisms to ensure identifier consistency for unknown and time-varying target numbers under nonlinear dynamics. A stochastic particle filter addresses nonlinear dynamics and non-Gaussian noise, with support from particle swarm optimization (PSO) to guide particles toward state distribution modes and mitigate divergence through proposed fitness measures incorporating motion consistency, appearance similarity, and social-interaction cues with neighboring targets. Deterministic association further enforces identifier consistency via a proposed cost matrix incorporating spatial consistency between particles and current detections, detection confidences, and track penalties. Subsequently, a novel scheme is proposed for the smooth updating of target states while preserving their identities, particularly for weak tracks during interactions with other targets and prolonged occlusions. Moreover, velocity regression over past states provides trend-seed velocities, enhancing particle sampling and state updates. The proposed tracker is designed to operate flexibly for both pre-recorded videos and camera live streams, where future frames are unavailable. Experimental results confirm superior performance compared to state-of-the-art trackers. The source-code reference implementations of both the proposed method and compared-trackers are provided on GitHub: https://github.com/SDU-VelKoTek/GenTrack2

[136] 50 Years of Water Body Monitoring: The Case of Qaraaoun Reservoir, Lebanon

Ali Ahmad Faour,Nabil Amacha,Ali J. Ghandour

Main category: cs.CV

TL;DR: 提出了一种基于卫星影像和机器学习的无传感器方法，用于实时估算黎巴嫩Qaraaoun水库的表面积和蓄水量，具有高精度和低成本优势。

Details

Motivation: 由于传感器故障频繁且维护能力有限，传统监测方法难以可靠地管理Qaraaoun水库的蓄水体积，因此需要一种不依赖传感器的替代方案。 Method: 结合开源卫星影像（Sentinel-2和Landsat）、新提出的水体分割指数进行水体范围提取，并利用支持向量回归（SVR）模型，基于水库地形数据建立表面积与蓄水量的关系模型。 Result: 水体分割准确率超过95%；SVR模型误差低于满库容的1.5%，决定系数超过0.98；实现了仅凭卫星影像表面积估算水库体积。 Conclusion: 该方法具有高鲁棒性和成本效益，可实现水库的连续、无传感器监测，并可推广至其他水体，同时生成的长期时序数据对气候变化研究具有价值。 Abstract: The sustainable management of the Qaraaoun Reservoir, the largest surface water body in Lebanon located in the Bekaa Plain, depends on reliable monitoring of its storage volume despite frequent sensor malfunctions and limited maintenance capacity. This study introduces a sensor-free approach that integrates open-source satellite imagery, advanced water-extent segmentation, and machine learning to estimate the reservoir surface area and volume in near real time. Sentinel-2 and Landsat images are processed, where surface water is delineated using a newly proposed water segmentation index. A machine learning model based on Support Vector Regression (SVR) is trained on a curated dataset that includes water surface area, water level, and water volume calculations using a reservoir bathymetry survey. The model is then able to estimate reservoir volume relying solely on surface area extracted from satellite imagery, without the need for ground measurements. Water segmentation using the proposed index aligns with ground truth for more than 95 percent of the shoreline. Hyperparameter tuning with GridSearchCV yields an optimized SVR performance with error under 1.5 percent of full reservoir capacity and coefficients of determination exceeding 0.98. These results demonstrate the robustness and cost-effectiveness of the method, offering a practical solution for continuous, sensor-independent monitoring of reservoir storage. The proposed methodology can be replicated for other water bodies, and the resulting 50 years of time-series data is valuable for research on climate change and environmental patterns.

[137] XAI Evaluation Framework for Semantic Segmentation

Reem Hammoud,Abdul karim Gizzini,Ali J. Ghandour

Main category: cs.CV

TL;DR: 本文提出了一种针对语义分割任务中可解释人工智能（XAI）的综合评估框架，考虑了空间和上下文复杂性，并采用像素级评估策略和定制指标来提供细粒度的可解释性洞察。

Details

Motivation: 由于AI模型在安全关键和高风险领域的广泛应用，确保其透明性和可信度至关重要。尽管XAI已取得进展，但针对语义分割任务的XAI评估方法仍缺乏系统研究。 Method: 提出一个专为语义分割设计的系统化XAI评估框架，结合像素级评估策略与精心设计的指标，以应对空间和上下文复杂性。 Result: 通过基于类激活映射（CAM）的XAI方法的仿真结果表明，所提框架具有高效性、鲁棒性和可靠性。 Conclusion: 该框架有助于推动透明、可信且可问责的语义分割模型的发展。 Abstract: Ensuring transparency and trust in artificial intelligence (AI) models is essential, particularly as they are increasingly applied in safety-critical and high-stakes domains. Explainable AI (XAI) has emerged as a promising approach to address this challenge, yet the rigorous evaluation of XAI methods remains crucial for optimizing the trade-offs between model complexity, predictive performance, and interpretability. While extensive progress has been achieved in evaluating XAI techniques for classification tasks, evaluation strategies tailored to semantic segmentation remain relatively underexplored. This work introduces a comprehensive and systematic evaluation framework specifically designed for assessing XAI in semantic segmentation, explicitly accounting for both spatial and contextual task complexities. The framework employs pixel-level evaluation strategies and carefully designed metrics to provide fine-grained interpretability insights. Simulation results using recently adapted class activation mapping (CAM)-based XAI schemes demonstrate the efficiency, robustness, and reliability of the proposed methodology. These findings contribute to advancing transparent, trustworthy, and accountable semantic segmentation models.

[138] Deeply-Conditioned Image Compression via Self-Generated Priors

Zhineng Zhao,Zhihai He,Zikun Zhou,Siwei Ma,Yaowei Wang

Main category: cs.CV

TL;DR: 本文提出了一种基于功能分解的深度条件图像压缩框架DCIC-sgp，通过自生成先验对图像结构进行建模，并在压缩流程中深度调节分析变换，有效解耦全局结构与局部纹理，显著减少低码率下的几何形变，提升了率失真性能。

Details

Motivation: 现有学习型图像压缩方法难以有效建模自然图像中全局结构与局部纹理的复杂相关性，导致低码率下出现严重几何变形。 Method: 引入功能分解思想，首先编码一个自生成的强先验以捕捉图像的结构主干，并将该先验用于深度调节整个压缩管道（特别是分析变换），使其专注于残差细节的表示。 Result: 实验表明，该方法在Kodak、CLIC和Tecnick数据集上相比VVC测试模型VTM-12.1实现了14.4%、15.7%和15.1%的BD-rate降低，显著抑制了几何变形伪影。 Conclusion: 所提出的DCIC-sgp框架通过深度条件调节实现信息流的有效解耦，在低码率图像压缩中取得了具有竞争力的性能提升。 Abstract: Learned image compression (LIC) has shown great promise for achieving high rate-distortion performance. However, current LIC methods are often limited in their capability to model the complex correlation structures inherent in natural images, particularly the entanglement of invariant global structures with transient local textures within a single monolithic representation. This limitation precipitates severe geometric deformation at low bitrates. To address this, we introduce a framework predicated on functional decomposition, which we term Deeply-Conditioned Image Compression via self-generated priors (DCIC-sgp). Our central idea is to first encode a potent, self-generated prior to encapsulate the image's structural backbone. This prior is subsequently utilized not as mere side-information, but to holistically modulate the entire compression pipeline. This deep conditioning, most critically of the analysis transform, liberates it to dedicate its representational capacity to the residual, high-entropy details. This hierarchical, dependency-driven approach achieves an effective disentanglement of information streams. Our extensive experiments validate this assertion; visual analysis demonstrates that our method substantially mitigates the geometric deformation artifacts that plague conventional codecs at low bitrates. Quantitatively, our framework establishes highly competitive performance, achieving significant BD-rate reductions of 14.4%, 15.7%, and 15.1% against the VVC test model VTM-12.1 on the Kodak, CLIC, and Tecnick datasets.

[139] Rethinking Visual Intelligence: Insights from Video Pretraining

Pablo Acuaviva,Aram Davtyan,Mariam Hassan,Sebastian Stapf,Ahmad Rahimi,Alexandre Alahi,Paolo Favaro

Main category: cs.CV

TL;DR: 视频扩散模型（VDM）通过时空数据预训练展现出比大型语言模型更强的视觉任务数据效率，表明其在构建通用视觉基础模型方面具有潜力。

Details

Motivation: 尽管大语言模型在语言领域表现出色，但在视觉领域的组合理解、样本效率和通用问题解决方面仍存在挑战，亟需探索更有效的视觉建模方法。 Method: 研究采用预训练的视频扩散模型（VDM）与大语言模型（LLM），在各自模态下配备轻量适配器，进行包括ARC-AGI、ConceptARC、视觉游戏、路径规划和元胞自动机在内的多任务评估。 Result: 实验结果显示，VDM在多个视觉任务中均表现出高于LLM的数据效率，展现出更强的任务适应能力。 Conclusion: 视频预训练提供的归纳偏置有助于提升模型的视觉理解与泛化能力，VDM是迈向通用视觉基础模型的有前景方向。 Abstract: Large language models (LLMs) have demonstrated that large-scale pretraining enables systems to adapt rapidly to new problems with little supervision in the language domain. This success, however, has not translated as effectively to the visual domain, where models, including LLMs, continue to struggle with compositional understanding, sample efficiency, and general-purpose problem-solving. We investigate Video Diffusion Models (VDMs) as a promising direction for bridging this gap. Pretraining on spatiotemporal data endows these models with strong inductive biases for structure and dynamics, which we hypothesize can support broad task adaptability. To test this, we design a controlled evaluation in which both a pretrained LLM and a pretrained VDM are equipped with lightweight adapters and presented with tasks in their natural modalities. Across benchmarks including ARC-AGI, ConceptARC, visual games, route planning, and cellular automata, VDMs demonstrate higher data efficiency than their language counterparts. Taken together, our results indicate that video pretraining offers inductive biases that support progress toward visual foundation models.

[140] A Critical Study towards the Detection of Parkinsons Disease using ML Technologies

Vivek Chetia,Abdul Taher Khan,Rahish Gogoi,David Kapsian Khual,Purnendu Bikash,Sajal Saha

Main category: cs.CV

TL;DR: 本文提出了一种基于深度学习的茶叶病害分类与受损区域检测方法，使用SSD MobileNet V2、Faster R-CNN ResNet50 V1进行病害目标检测，并采用Mask R-CNN实现叶片病斑区域的实例分割，其中Faster R-CNN表现更优，mAP达到25%。

Details

Motivation: 茶叶病害严重影响产量和品质，传统依赖人工识别效率低且易出错，因此需要一种自动、准确的病害识别与受损评估方法。 Method: 采用SSD MobileNet V2和Faster R-CNN ResNet50 V1进行三种茶树病害（红锈病、Helopeltis、红蜘蛛螨）的目标检测，并使用Mask R-CNN进行实例分割，结合自定义方法计算叶片受损面积。 Result: SSD MobileNet V2在IOU 0.50:0.95下mAP为20.9%，Faster R-CNN ResNet50 V1的mAP为25%，表现更优；同时实现了病害区域的像素级分割与受损面积估算。 Conclusion: Faster R-CNN在茶树叶部病害检测中优于SSD模型，结合Mask R-CNN可有效实现病害识别与损伤程度量化，具有应用于智能农业病害管理的潜力。 Abstract: The proposed solution is Deep Learning Technique that will be able classify three types of tea leaves diseases from which two diseases are caused by the pests and one due to pathogens (infectious organisms) and environmental conditions and also show the area damaged by a disease in leaves. Namely Red Rust, Helopeltis and Red spider mite respectively. In this paper we have evaluated two models namely SSD MobileNet V2 and Faster R-CNN ResNet50 V1 for the object detection. The SSD MobileNet V2 gave precision of 0.209 for IOU range of 0.50:0.95 with recall of 0.02 on IOU 0.50:0.95 and final mAP of 20.9%. While Faster R-CNN ResNet50 V1 has precision of 0.252 on IOU range of 0.50:0.95 and recall of 0.044 on IOU of 0.50:0.95 with a mAP of 25%, which is better than SSD. Also used Mask R-CNN for Object Instance Segmentation where we have implemented our custom method to calculate the damaged diseased portion of leaves. Keywords: Tea Leaf Disease, Deep Learning, Red Rust, Helopeltis and Red Spider Mite, SSD MobileNet V2, Faster R-CNN ResNet50 V1 and Mask RCNN.

[141] Kineo: Calibration-Free Metric Motion Capture From Sparse RGB Cameras

Charles Javerliat,Pierre Raimbaud,Guillaume Lavoué

Main category: cs.CV

TL;DR: Kineo是一个无需相机标定、全自动的无标记运动捕捉管道，利用消费级RGB相机视频实现高精度3D人体运动重建，显著优于现有方法。

Details

Motivation: 现有无标记多视角运动捕捉方法依赖精确相机标定，限制了在非专业场景和野外应用；而现有的免标定方法存在计算成本高、重建精度低的问题。 Method: Kineo利用现成的2D关键点检测器输出，通过置信度驱动的时空关键点采样策略和基于图的全局优化，联合进行相机标定（包括镜头畸变参数）与3D关键点及稠密场景点云的度量尺度重建，并引入成对重投影共识分数评估重建可靠性。 Result: 在EgoHumans和Human3.6M数据集上，相比先前免标定方法，相机平移误差减少83-85%，角度误差减少86-92%，世界坐标下平均关节误差（W-MPJPE）减少83-91%；且处理速度超过实时（如36分钟处理80分钟视频）。 Conclusion: Kineo实现了高精度、高效、完全自动化的免标定运动捕捉，适用于普通用户和真实场景，推动了运动捕捉技术的普及化和实用化。 Abstract: Markerless multiview motion capture is often constrained by the need for precise camera calibration, limiting accessibility for non-experts and in-the-wild captures. Existing calibration-free approaches mitigate this requirement but suffer from high computational cost and reduced reconstruction accuracy. We present Kineo, a fully automatic, calibration-free pipeline for markerless motion capture from videos captured by unsynchronized, uncalibrated, consumer-grade RGB cameras. Kineo leverages 2D keypoints from off-the-shelf detectors to simultaneously calibrate cameras, including Brown-Conrady distortion coefficients, and reconstruct 3D keypoints and dense scene point maps at metric scale. A confidence-driven spatio-temporal keypoint sampling strategy, combined with graph-based global optimization, ensures robust calibration at a fixed computational cost independent of sequence length. We further introduce a pairwise reprojection consensus score to quantify 3D reconstruction reliability for downstream tasks. Evaluations on EgoHumans and Human3.6M demonstrate substantial improvements over prior calibration-free methods. Compared to previous state-of-the-art approaches, Kineo reduces camera translation error by approximately 83-85%, camera angular error by 86-92%, and world mean-per-joint error (W-MPJPE) by 83-91%. Kineo is also efficient in real-world scenarios, processing multi-view sequences faster than their duration in specific configuration (e.g., 36min to process 1h20min of footage). The full pipeline and evaluation code are openly released to promote reproducibility and practical adoption at https://liris-xr.github.io/kineo/.

[142] Decoupled MeanFlow: Turning Flow Models into Flow Maps for Accelerated Sampling

Kyungmin Lee,Sihyun Yu,Jinwoo Shin

Main category: cs.CV

TL;DR: 本文提出了Decoupled MeanFlow，一种无需修改架构即可将流模型转换为流图模型的解码策略，显著减少去噪步骤至1-4步，实现高效高质量生成。

Details

Motivation: 去噪生成模型（如扩散模型和基于流的模型）因离散化误差需要大量去噪步骤，影响采样速度；现有流图方法虽可缓解此问题，但训练时通常需改变网络结构，难以兼容预训练模型。 Method: 提出Decoupled MeanFlow方法，通过在扩散变换器的最后几层条件化后续时间步的信息，将预训练流模型直接转化为流图模型，无需架构修改，并结合增强训练技术实现快速高质量生成。 Result: 在ImageNet 256x256和512x512上分别实现了1步FID 2.16和2.12，4步FID达1.51和1.68，显著优于先前方法，且推理速度提升超100倍。 Conclusion: 训练流模型后转换为流图模型比从头训练更高效有效，Decoupled MeanFlow为快速推理提供了兼容性强、性能优越的新路径。 Abstract: Denoising generative models, such as diffusion and flow-based models, produce high-quality samples but require many denoising steps due to discretization error. Flow maps, which estimate the average velocity between timesteps, mitigate this error and enable faster sampling. However, their training typically demands architectural changes that limit compatibility with pretrained flow models. We introduce Decoupled MeanFlow, a simple decoding strategy that converts flow models into flow map models without architectural modifications. Our method conditions the final blocks of diffusion transformers on the subsequent timestep, allowing pretrained flow models to be directly repurposed as flow maps. Combined with enhanced training techniques, this design enables high-quality generation in as few as 1 to 4 steps. Notably, we find that training flow models and subsequently converting them is more efficient and effective than training flow maps from scratch. On ImageNet 256x256 and 512x512, our models attain 1-step FID of 2.16 and 2.12, respectively, surpassing prior art by a large margin. Furthermore, we achieve FID of 1.51 and 1.68 when increasing the steps to 4, which nearly matches the performance of flow models while delivering over 100x faster inference.

[143] Fast and accurate neural reflectance transformation imaging through knowledge distillation

Tinsae G. Dulecha,Leonardo Righetto,Ruggero Pintus,Enrico Gobbetti,Andrea Giachetti

Main category: cs.CV

TL;DR: 提出了一种基于知识蒸馏的NeuralRTI方法（DisK-NeuralRTI），在保持高质量反射场重建的同时显著降低计算成本，提升渲染效率。

Details

Motivation: 传统RTI方法难以准确建模复杂反射场，而NeuralRTI虽质量高但计算开销大，尤其在大图像和有限硬件上难以实时渲染。 Method: 采用知识蒸馏策略，通过训练轻量级学生网络来模仿高性能但复杂的教师网络（NeuralRTI），从而降低推理时的计算负担。 Result: DisK-NeuralRTI在保持与原始NeuralRTI相当的重建质量的同时，显著提高了渲染速度，实现在有限硬件上的高效全分辨率交互式重光照。 Conclusion: 知识蒸馏是优化NeuralRTI计算效率的有效手段，DisK-NeuralRTI为高保真RTI应用提供了更实用的解决方案。 Abstract: Reflectance Transformation Imaging (RTI) is very popular for its ability to visually analyze surfaces by enhancing surface details through interactive relighting, starting from only a few tens of photographs taken with a fixed camera and variable illumination. Traditional methods like Polynomial Texture Maps (PTM) and Hemispherical Harmonics (HSH) are compact and fast, but struggle to accurately capture complex reflectance fields using few per-pixel coefficients and fixed bases, leading to artifacts, especially in highly reflective or shadowed areas. The NeuralRTI approach, which exploits a neural autoencoder to learn a compact function that better approximates the local reflectance as a function of light directions, has been shown to produce superior quality at comparable storage cost. However, as it performs interactive relighting with custom decoder networks with many parameters, the rendering step is computationally expensive and not feasible at full resolution for large images on limited hardware. Earlier attempts to reduce costs by directly training smaller networks have failed to produce valid results. For this reason, we propose to reduce its computational cost through a novel solution based on Knowledge Distillation (DisK-NeuralRTI). ...

[144] Latent Sketchpad: Sketching Visual Thoughts to Elicit Multimodal Reasoning in MLLMs

Huanyu Zhang,Wenshan Wu,Chengzu Li,Ning Shang,Yan Xia,Yangyu Huang,Yifan Zhang,Li Dong,Zhang Zhang,Liang Wang,Tieniu Tan,Furu Wei

Main category: cs.CV

TL;DR: 本文提出了Latent Sketchpad框架，通过在多模态大语言模型（MLLMs）中引入内部视觉草图板，增强其视觉规划与想象能力。

Details

Motivation: MLLMs在视觉理解方面表现出色，但在需要视觉规划和想象的复杂场景中表现不佳。受人类通过素描进行视觉思维启发，作者希望赋予MLLMs生成性视觉思维能力。 Method: 在前沿MLLM基础上，将视觉生成融入其自回归推理过程，提出上下文感知视觉头来自回归生成视觉隐表示，并利用预训练的素描解码器将其转化为可解释的草图图像。 Result: 在新构建的MazePlanning数据集上实验表明，Latent Sketchpad在推理性能上与基线模型相当甚至更优，且能泛化到Gemma3和Qwen2.5-VL等不同MLLM上。 Conclusion: 该框架通过扩展文本推理至视觉思维，为更丰富的交互和应用提供了新可能。 Abstract: While Multimodal Large Language Models (MLLMs) excel at visual understanding, they often struggle in complex scenarios that require visual planning and imagination. Inspired by how humans use sketching as a form of visual thinking to develop and communicate ideas, we introduce Latent Sketchpad, a framework that equips MLLMs with an internal visual scratchpad. The internal visual representations of MLLMs have traditionally been confined to perceptual understanding. We repurpose them to support generative visual thought without compromising reasoning ability. Building on frontier MLLMs, our approach integrates visual generation directly into their native autoregressive reasoning process. It allows the model to interleave textual reasoning with the generation of visual latents. These latents guide the internal thought process and can be translated into sketch images for interpretability. To realize this, we introduce two components: a Context-Aware Vision Head autoregressively produces visual representations, and a pretrained Sketch Decoder renders these into human-interpretable images. We evaluate the framework on our new dataset MazePlanning. Experiments across various MLLMs show that Latent Sketchpad delivers comparable or even superior reasoning performance to their backbone. It further generalizes across distinct frontier MLLMs, including Gemma3 and Qwen2.5-VL. By extending model's textual reasoning to visual thinking, our framework opens new opportunities for richer human-computer interaction and broader applications. More details and resources are available on our project page: https://latent-sketchpad.github.io/.

[145] OSWorld-MCP: Benchmarking MCP Tool Invocation In Computer-Use Agents

Hongrui Jia,Jitong Liao,Xi Zhang,Haiyang Xu,Tianbao Xie,Chaoya Jiang,Ming Yan,Si Liu,Wei Ye,Fei Huang

Main category: cs.CV

TL;DR: OSWorld-MCP是首个全面且公平的基准，用于评估多模态代理在真实环境中调用工具、操作GUI和决策的能力，揭示了当前模型在工具调用方面的不足并提供了改进方向。

Details

Motivation: 以往对多模态代理的评估主要集中在GUI交互能力上，忽略了如MCP协议支持的工具调用能力，导致评估不公。因此需要一个更全面、公平的基准来准确衡量代理在现实场景中的综合能力。 Method: 设计了一种新颖的自动化代码生成管道来创建工具，并结合现有精选工具；通过严格的手动验证构建包含158个高质量工具的数据集（覆盖7个常见应用）；在真实环境中对最先进的多模态代理进行广泛评估，明确测量其MCP工具使用技能。 Result: 实验表明，集成MCP工具后任务成功率显著提升（例如OpenAI o3从8.3%升至20.4%，Claude 4 Sonnet从40.1%升至43.3%），但最强模型的工具调用率仍只有36.3%，说明仍有较大改进空间。 Conclusion: OSWorld-MCP通过显式衡量工具调用能力，深化了对多模态代理的理解，为复杂、工具辅助环境下的性能评估设立了新标准。 Abstract: With advances in decision-making and reasoning capabilities, multimodal agents show strong potential in computer application scenarios. Past evaluations have mainly assessed GUI interaction skills, while tool invocation abilities, such as those enabled by the Model Context Protocol (MCP), have been largely overlooked. Comparing agents with integrated tool invocation to those evaluated only on GUI interaction is inherently unfair. We present OSWorld-MCP, the first comprehensive and fair benchmark for assessing computer-use agents' tool invocation, GUI operation, and decision-making abilities in a real-world environment. We design a novel automated code-generation pipeline to create tools and combine them with a curated selection from existing tools. Rigorous manual validation yields 158 high-quality tools (covering 7 common applications), each verified for correct functionality, practical applicability, and versatility. Extensive evaluations of state-of-the-art multimodal agents on OSWorld-MCP show that MCP tools generally improve task success rates (e.g., from 8.3% to 20.4% for OpenAI o3 at 15 steps, from 40.1% to 43.3% for Claude 4 Sonnet at 50 steps), underscoring the importance of assessing tool invocation capabilities. However, even the strongest models have relatively low tool invocation rates, Only 36.3%, indicating room for improvement and highlighting the benchmark's challenge. By explicitly measuring MCP tool usage skills, OSWorld-MCP deepens understanding of multimodal agents and sets a new standard for evaluating performance in complex, tool-assisted environments. Our code, environment, and data are publicly available at https://osworld-mcp.github.io.

[146] Physics-Inspired Gaussian Kolmogorov-Arnold Networks for X-ray Scatter Correction in Cone-Beam CT

Xu Jiang,Huiying Pan,Ligen Shi,Jianing Sun,Wenfeng Xu,Xing Zhao

Main category: cs.CV

TL;DR: 提出一种基于深度学习的散射伪影校正方法，结合物理先验知识与Kolmogorov-Arnold网络（KAN）和高斯径向基函数（RBF），有效提升锥束CT图像质量。

Details

Motivation: 锥束CT（CBCT）在数据采集过程中易受散射影响，导致重建图像中出现CT值偏差和组织对比度下降，从而降低诊断准确性。现有方法难以精确建模散射特征，因此需要更有效的校正方法。 Method: 利用投影域中散射点概率密度分布具有旋转对称性的物理特性，采用高斯径向基函数（RBF）建模点散射函数，并将其嵌入Kolmogorov-Arnold网络（KAN）层中，结合物理先验与非线性映射能力，实现高维散射特征的学习与估计。 Result: 在合成数据和真实扫描实验中验证了该方法的有效性，结果显示该模型能有效校正重建图像中的散射伪影，并在定量指标上优于当前主流方法。 Conclusion: 所提方法通过融合物理先验与深度学习框架，显著提升了CBCT散射校正的精度，有望改善临床图像质量和诊断可靠性。 Abstract: Cone-beam CT (CBCT) employs a flat-panel detector to achieve three-dimensional imaging with high spatial resolution. However, CBCT is susceptible to scatter during data acquisition, which introduces CT value bias and reduced tissue contrast in the reconstructed images, ultimately degrading diagnostic accuracy. To address this issue, we propose a deep learning-based scatter artifact correction method inspired by physical prior knowledge. Leveraging the fact that the observed point scatter probability density distribution exhibits rotational symmetry in the projection domain. The method uses Gaussian Radial Basis Functions (RBF) to model the point scatter function and embeds it into the Kolmogorov-Arnold Networks (KAN) layer, which provides efficient nonlinear mapping capabilities for learning high-dimensional scatter features. By incorporating the physical characteristics of the scattered photon distribution together with the complex function mapping capacity of KAN, the model improves its ability to accurately represent scatter. The effectiveness of the method is validated through both synthetic and real-scan experiments. Experimental results show that the model can effectively correct the scatter artifacts in the reconstructed images and is superior to the current methods in terms of quantitative metrics.

[147] A Dual-Branch CNN for Robust Detection of AI-Generated Facial Forgeries

Xin Zhang,Yuqi Song,Fei Zuo

Main category: cs.CV

TL;DR: 提出一种基于双分支卷积神经网络的面部伪造检测方法，结合空间和频率域特征，利用通道注意力机制融合异构特征，并设计FSC Loss损失函数提升分类可分性与鲁棒性，在DiFF基准上表现优于人类平均水平。

Details

Motivation: 生成式AI快速发展导致逼真的伪造人脸图像泛滥，带来 misinformation、身份欺诈等安全威胁，亟需鲁棒且通用的检测方法以维护AI安全与媒体可信度。 Method: 采用双分支CNN架构，RGB分支提取语义信息，频率分支捕捉高频伪影；引入通道注意力模块自适应融合特征；设计FSC Loss（结合focal loss、监督对比损失和频率中心间隔损失）优化训练过程。 Result: 在包含文本生成图像、图像到图像、换脸和人脸编辑四类伪造方法的DiFF基准上取得优异性能，跨类别检测效果优于平均人类识别准确率。 Conclusion: 所提方法能有效检测多种生成机制的面部伪造图像，具备良好的泛化性与鲁棒性，对构建安全可信的AI系统具有重要意义。 Abstract: The rapid advancement of generative AI has enabled the creation of highly realistic forged facial images, posing significant threats to AI security, digital media integrity, and public trust. Face forgery techniques, ranging from face swapping and attribute editing to powerful diffusion-based image synthesis, are increasingly being used for malicious purposes such as misinformation, identity fraud, and defamation. This growing challenge underscores the urgent need for robust and generalizable face forgery detection methods as a critical component of AI security infrastructure. In this work, we propose a novel dual-branch convolutional neural network for face forgery detection that leverages complementary cues from both spatial and frequency domains. The RGB branch captures semantic information, while the frequency branch focuses on high-frequency artifacts that are difficult for generative models to suppress. A channel attention module is introduced to adaptively fuse these heterogeneous features, highlighting the most informative channels for forgery discrimination. To guide the network's learning process, we design a unified loss function, FSC Loss, that combines focal loss, supervised contrastive loss, and a frequency center margin loss to enhance class separability and robustness. We evaluate our model on the DiFF benchmark, which includes forged images generated from four representative methods: text-to-image, image-to-image, face swap, and face edit. Our method achieves strong performance across all categories and outperforms average human accuracy. These results demonstrate the model's effectiveness and its potential contribution to safeguarding AI ecosystems against visual forgery attacks.

[148] Eye-Tracking, Mouse Tracking, Stimulus Tracking,and Decision-Making Datasets in Digital Pathology

Veronica Thai,Rui Li,Meng Ling,Shuning Jiang,Jeremy Wolfe,Raghu Machiraju,Yan Hu,Zaibo Li,Anil Parwani,Jian Chen

Main category: cs.CV

TL;DR: PathoGaze1.0是一个包含19名病理学家解读397张全切片图像时的眼动、鼠标交互、视口导航和诊断决策数据的综合行为数据集，旨在揭示癌症诊断中的视觉搜索与决策过程，提升病理学家和AI系统的训练效果。

Details

Motivation: 病理学家在解读千兆像素级全切片图像时诊断准确率平均仅为70%，且缺乏足够的行为数据来解释诊断错误和不一致性，因此需要一个高生态效度的行为数据集来填补这一空白。 Method: 通过名为PTAH的应用导向测试平台，收集19名病理学家在真实诊断流程中解读397张全切片图像时的眼动、鼠标交互、刺激追踪、视口导航及诊断决策（EMSVD）数据，整个实验过程预先注册以确保可重复性。 Result: 共收集18.69小时的行为数据，包括171,909次注视、263,320次扫视和1,867,362次鼠标交互事件，形成了一个高质量、高生态效度的多模态行为数据集PathoGaze1.0。 Conclusion: PathoGaze1.0为理解病理诊断中的认知过程提供了宝贵资源，可用于改进病理学家培训和开发支持人类专家的AI辅助系统。 Abstract: Interpretation of giga-pixel whole-slide images (WSIs) is an important but difficult task for pathologists. Their diagnostic accuracy is estimated to average around 70%. Adding a second pathologist does not substantially improve decision consistency. The field lacks adequate behavioral data to explain diagnostic errors and inconsistencies. To fill in this gap, we present PathoGaze1.0, a comprehensive behavioral dataset capturing the dynamic visual search and decision-making processes of the full diagnostic workflow during cancer diagnosis. The dataset comprises 18.69 hours of eye-tracking, mouse interaction, stimulus tracking, viewport navigation, and diagnostic decision data (EMSVD) collected from 19 pathologists interpreting 397 WSIs. The data collection process emphasizes ecological validity through an application-grounded testbed, called PTAH. In total, we recorded 171,909 fixations, 263,320 saccades, and 1,867,362 mouse interaction events. In addition, such data could also be used to improve the training of both pathologists and AI systems that might support human experts. All experiments were preregistered at https://osf.io/hj9a7, and the complete dataset along with analysis code is available at https://go.osu.edu/pathogaze.

[149] Group Relative Attention Guidance for Image Editing

Xuanpu Zhang,Xuesong Niu,Ruidong Chen,Dan Song,Jianhao Zeng,Penghui Du,Haoxiang Cao,Kai Wu,An-an Liu

Main category: cs.CV

TL;DR: 提出了一种名为Group Relative Attention Guidance (GRAG) 的新方法，用于在Diffusion-in-Transformer模型中实现对图像编辑强度的连续、细粒度控制，无需额外调参，仅需少量代码即可集成并提升编辑质量。

Details

Motivation: 现有基于DiT的图像编辑方法缺乏对编辑程度的有效控制，难以实现定制化结果。作者希望通过改进注意力机制来实现更精确的编辑强度调节。 Method: 通过分析DiT中的MM-Attention机制，发现Query和Key共享一个仅依赖于层的偏置向量，将其解释为模型固有编辑行为，并利用token与其偏置之间的差值（delta）编码内容相关信号。GRAG通过对不同token的delta值进行重加权，调节模型对输入图像与编辑指令间关注的焦点，从而实现编辑强度控制。 Result: 实验表明，GRAG可在多个现有编辑框架上以少至四行代码集成，显著提升编辑质量；相比Classifier-Free Guidance，GRAG能实现更平滑、更精确的编辑强度控制。 Conclusion: GRAG是一种简单而有效的方法，为基于DiT的图像编辑提供了无需微调的连续强度控制方案，具有良好的通用性和应用潜力。 Abstract: Recently, image editing based on Diffusion-in-Transformer models has undergone rapid development. However, existing editing methods often lack effective control over the degree of editing, limiting their ability to achieve more customized results. To address this limitation, we investigate the MM-Attention mechanism within the DiT model and observe that the Query and Key tokens share a bias vector that is only layer-dependent. We interpret this bias as representing the model's inherent editing behavior, while the delta between each token and its corresponding bias encodes the content-specific editing signals. Based on this insight, we propose Group Relative Attention Guidance, a simple yet effective method that reweights the delta values of different tokens to modulate the focus of the model on the input image relative to the editing instruction, enabling continuous and fine-grained control over editing intensity without any tuning. Extensive experiments conducted on existing image editing frameworks demonstrate that GRAG can be integrated with as few as four lines of code, consistently enhancing editing quality. Moreover, compared to the commonly used Classifier-Free Guidance, GRAG achieves smoother and more precise control over the degree of editing. Our code will be released at https://github.com/little-misfit/GRAG-Image-Editing.

[150] SAGE: Structure-Aware Generative Video Transitions between Diverse Clips

Mia Kan,Yilin Liu,Niloy Mitra

Main category: cs.CV

TL;DR: 本文提出了一种名为SAGE的零样本视频过渡方法，通过结合结构引导（如线条图和运动流）与生成合成，实现跨差异较大的视频片段之间的平滑、语义一致的过渡。

Details

Motivation: 现有的视频过渡方法在处理具有大时间间隔或显著语义差异的视频片段时，难以生成内容感知且视觉连贯的过渡效果，限制了其专业应用。 Method: 借鉴艺术创作流程中的策略（如对齐轮廓、插值显著特征），利用线图和光流提供结构引导，并结合生成模型进行视频中间帧合成，无需微调即可实现零样本视频过渡。 Result: 在多个定量指标和用户研究中，SAGE优于现有经典方法和生成式基线方法（如FILM、TVG等），能更有效地生成多样视频片段间的高质量过渡。 Conclusion: SAGE通过引入结构感知机制，在不需训练的情况下实现了高质量、语义连贯的视频过渡，适用于差异较大的视频片段，推动了视频编辑中自动过渡技术的发展。 Abstract: Video transitions aim to synthesize intermediate frames between two clips, but naive approaches such as linear blending introduce artifacts that limit professional use or break temporal coherence. Traditional techniques (cross-fades, morphing, frame interpolation) and recent generative inbetweening methods can produce high-quality plausible intermediates, but they struggle with bridging diverse clips involving large temporal gaps or significant semantic differences, leaving a gap for content-aware and visually coherent transitions. We address this challenge by drawing on artistic workflows, distilling strategies such as aligning silhouettes and interpolating salient features to preserve structure and perceptual continuity. Building on this, we propose SAGE (Structure-Aware Generative vidEo transitions) as a zeroshot approach that combines structural guidance, provided via line maps and motion flow, with generative synthesis, enabling smooth, semantically consistent transitions without fine-tuning. Extensive experiments and comparison with current alternatives, namely [FILM, TVG, DiffMorpher, VACE, GI], demonstrate that SAGE outperforms both classical and generative baselines on quantitative metrics and user studies for producing transitions between diverse clips. Code to be released on acceptance.

[151] MIC-BEV: Multi-Infrastructure Camera Bird's-Eye-View Transformer with Relation-Aware Fusion for 3D Object Detection

Yun Zhang,Zhaoliang Zheng,Johnson Liu,Zhiyu Huang,Zewei Zhou,Zonglin Meng,Tianhui Cai,Jiaqi Ma

Main category: cs.CV

TL;DR: 本文提出了MIC-BEV，一种基于Transformer的鸟瞰图（BEV）感知框架，用于基础设施多摄像头3D目标检测，支持不同数量和参数的摄像头，并在传感器退化情况下表现出强鲁棒性。

Details

Motivation: 现有基于相机的检测模型在多视角基础设施设置、多样化相机配置、图像质量下降和复杂道路布局等挑战下表现不佳，因此需要更强大的基础设施感知方法。 Method: 提出MIC-BEV框架，采用图增强融合模块，利用相机与BEV网格间的几何关系及潜在视觉线索，将多视角图像特征融合至BEV空间；同时构建了合成数据集M2I用于训练和评估。 Result: 在M2I和真实数据集RoScenes上的实验表明，MIC-BEV在3D目标检测上达到SOTA性能，并在极端天气和传感器退化等挑战条件下保持稳健。 Conclusion: MIC-BEV在基础设施感知中展现出巨大潜力，具备良好的灵活性和鲁棒性，适合实际部署。 Abstract: Infrastructure-based perception plays a crucial role in intelligent transportation systems, offering global situational awareness and enabling cooperative autonomy. However, existing camera-based detection models often underperform in such scenarios due to challenges such as multi-view infrastructure setup, diverse camera configurations, degraded visual inputs, and various road layouts. We introduce MIC-BEV, a Transformer-based bird's-eye-view (BEV) perception framework for infrastructure-based multi-camera 3D object detection. MIC-BEV flexibly supports a variable number of cameras with heterogeneous intrinsic and extrinsic parameters and demonstrates strong robustness under sensor degradation. The proposed graph-enhanced fusion module in MIC-BEV integrates multi-view image features into the BEV space by exploiting geometric relationships between cameras and BEV cells alongside latent visual cues. To support training and evaluation, we introduce M2I, a synthetic dataset for infrastructure-based object detection, featuring diverse camera configurations, road layouts, and environmental conditions. Extensive experiments on both M2I and the real-world dataset RoScenes demonstrate that MIC-BEV achieves state-of-the-art performance in 3D object detection. It also remains robust under challenging conditions, including extreme weather and sensor degradation. These results highlight the potential of MIC-BEV for real-world deployment. The dataset and source code are available at: https://github.com/HandsomeYun/MIC-BEV.

[152] Does Object Binding Naturally Emerge in Large Pretrained Vision Transformers?

Yihao Li,Saeed Salehi,Lyle Ungar,Konrad P. Kording

Main category: cs.CV

TL;DR: 本文研究了视觉Transformer（ViT）是否能够自然地涌现出将图像块绑定为同一物体的能力（IsSameObject），发现这种能力在自监督预训练模型中显著存在，且对注意力机制和下游任务性能有积极影响。

Details

Motivation: 探讨ViT是否能在无显式对象注意力机制的情况下，自然涌现出对象绑定能力，以理解其内部表征是否包含‘哪些部分属于同一物体’的符号性知识。 Method: 通过在ViT各层的图像块嵌入上使用相似性探针解码IsSameObject信号，并分析其在不同预训练范式下的表现、子空间结构及其对注意力的引导作用。 Result: IsSameObject信号在自监督ViT（如DINO、MAE、CLIP）中准确率超过90%，但在ImageNet监督模型中较弱；该信号存在于低维子空间并主动引导注意力，去除后会损害下游性能。 Conclusion: ViT在适当预训练下能自然涌现出对象绑定能力，挑战了其缺乏对象表征的观点，揭示了连接主义系统中符号性知识的 emergence。 Abstract: Object binding, the brain's ability to bind the many features that collectively represent an object into a coherent whole, is central to human cognition. It groups low-level perceptual features into high-level object representations, stores those objects efficiently and compositionally in memory, and supports human reasoning about individual object instances. While prior work often imposes object-centric attention (e.g., Slot Attention) explicitly to probe these benefits, it remains unclear whether this ability naturally emerges in pre-trained Vision Transformers (ViTs). Intuitively, they could: recognizing which patches belong to the same object should be useful for downstream prediction and thus guide attention. Motivated by the quadratic nature of self-attention, we hypothesize that ViTs represent whether two patches belong to the same object, a property we term IsSameObject. We decode IsSameObject from patch embeddings across ViT layers using a similarity probe, which reaches over 90% accuracy. Crucially, this object-binding capability emerges reliably in self-supervised ViTs (DINO, MAE, CLIP), but markedly weaker in ImageNet-supervised models, suggesting that binding is not a trivial architectural artifact, but an ability acquired through specific pretraining objectives. We further discover that IsSameObject is encoded in a low-dimensional subspace on top of object features, and that this signal actively guides attention. Ablating IsSameObject from model activations degrades downstream performance and works against the learning objective, implying that emergent object binding naturally serves the pretraining objective. Our findings challenge the view that ViTs lack object binding and highlight how symbolic knowledge of "which parts belong together" emerges naturally in a connectionist system.

[153] Routing Matters in MoE: Scaling Diffusion Transformers with Explicit Routing Guidance

Yujie Wei,Shiwei Zhang,Hangjie Yuan,Yujin Han,Zhekai Chen,Jiayu Wang,Difan Zou,Xihui Liu,Yingya Zhang,Yu Liu,Hongming Shan

Main category: cs.CV

TL;DR: 本文提出了ProMoE，一种用于Diffusion Transformers的Mixture-of-Experts框架，通过两步路由机制和显式语义引导提升专家专业化，显著提高了视觉任务性能。

Details

Motivation: 现有MoE在语言模型中成功，但在视觉扩散Transformer中效果有限，主要因视觉token存在空间冗余和功能异质性，阻碍专家 specialization。 Method: 提出ProMoE，采用两步路由：条件路由将图像token划分为条件与非条件集合；原型路由基于可学习原型按语义内容优化条件token分配，并引入基于相似性的潜在空间专家分配机制及路由对比损失，增强专家内聚性和多样性。 Result: 在ImageNet上实验表明，ProMoE在Rectified Flow和DDPM目标下均优于当前最先进方法。 Conclusion: 显式的语义引导和结构化路由机制对视觉MoE至关重要，ProMoE为扩散Transformer中的MoE应用提供了有效解决方案。 Abstract: Mixture-of-Experts (MoE) has emerged as a powerful paradigm for scaling model capacity while preserving computational efficiency. Despite its notable success in large language models (LLMs), existing attempts to apply MoE to Diffusion Transformers (DiTs) have yielded limited gains. We attribute this gap to fundamental differences between language and visual tokens. Language tokens are semantically dense with pronounced inter-token variation, while visual tokens exhibit spatial redundancy and functional heterogeneity, hindering expert specialization in vision MoE. To this end, we present ProMoE, an MoE framework featuring a two-step router with explicit routing guidance that promotes expert specialization. Specifically, this guidance encourages the router to partition image tokens into conditional and unconditional sets via conditional routing according to their functional roles, and refine the assignments of conditional image tokens through prototypical routing with learnable prototypes based on semantic content. Moreover, the similarity-based expert allocation in latent space enabled by prototypical routing offers a natural mechanism for incorporating explicit semantic guidance, and we validate that such guidance is crucial for vision MoE. Building on this, we propose a routing contrastive loss that explicitly enhances the prototypical routing process, promoting intra-expert coherence and inter-expert diversity. Extensive experiments on ImageNet benchmark demonstrate that ProMoE surpasses state-of-the-art methods under both Rectified Flow and DDPM training objectives. Code and models will be made publicly available.

[154] Uniform Discrete Diffusion with Metric Path for Video Generation

Haoge Deng,Ting Pan,Fan Zhang,Yang Liu,Zhuoyan Luo,Yufeng Cui,Wenxuan Wang,Chunhua Shen,Shiguang Shan,Zhaoxiang Zhang,Xinlong Wang

Main category: cs.CV

TL;DR: 本文提出了URSA，一种基于离散空间的视频生成框架，通过线性化度量路径和分辨率依赖的时间步移机制，实现了与连续扩散模型相媲美的性能，同时支持高分辨率和长时视频生成。

Details

Motivation: 离散视频生成方法由于误差累积和长序列不一致问题落后于连续方法，本文旨在重新审视离散生成建模并缩小这一差距。 Method: 提出URSA框架，将视频生成视为离散时空token的迭代全局优化过程，引入线性化度量路径和分辨率依赖的时间步移机制，并采用异步时间微调策略统一多种任务。 Result: 在多个视频和图像生成基准上，URSA显著优于现有离散方法，性能媲美最先进的连续扩散模型，且推理步数更少。 Conclusion: URSA有效提升了离散视频生成的能力，证明了离散方法在可扩展视频生成中的潜力。 Abstract: Continuous-space video generation has advanced rapidly, while discrete approaches lag behind due to error accumulation and long-context inconsistency. In this work, we revisit discrete generative modeling and present Uniform discRete diffuSion with metric pAth (URSA), a simple yet powerful framework that bridges the gap with continuous approaches for the scalable video generation. At its core, URSA formulates the video generation task as an iterative global refinement of discrete spatiotemporal tokens. It integrates two key designs: a Linearized Metric Path and a Resolution-dependent Timestep Shifting mechanism. These designs enable URSA to scale efficiently to high-resolution image synthesis and long-duration video generation, while requiring significantly fewer inference steps. Additionally, we introduce an asynchronous temporal fine-tuning strategy that unifies versatile tasks within a single model, including interpolation and image-to-video generation. Extensive experiments on challenging video and image generation benchmarks demonstrate that URSA consistently outperforms existing discrete methods and achieves performance comparable to state-of-the-art continuous diffusion methods. Code and models are available at https://github.com/baaivision/URSA

[155] Generative View Stitching

Chonghyuk Song,Michal Stary,Boyuan Chen,George Kopanas,Vincent Sitzmann

Main category: cs.CV

TL;DR: 提出Generative View Stitching (GVS) 方法，通过并行采样实现相机轨迹引导的视频生成，避免自回归模型中的碰撞与崩溃问题，并利用Diffusion Forcing框架和Omni Guidance提升时序一致性与长程连贯性。

Details

Motivation: 现有自回归视频扩散模型无法利用未来帧的条件信息，导致在预定义相机轨迹引导生成时出现场景碰撞并迅速崩溃。 Method: 提出Generative View Stitching (GVS)，基于Diffusion Forcing框架，在任意现成视频扩散模型上实现并行序列采样；引入Omni Guidance机制，结合过去和未来的条件信息增强时序一致性，并设计闭环机制实现长程连贯。 Result: GVS实现了稳定、无碰撞、帧间一致且能闭合回环的相机引导视频生成，适用于多种预定义路径（包括‘不可能楼梯’等复杂轨迹），在多个相机路径上验证了有效性。 Conclusion: GVS克服了自回归扩散模型在相机引导生成中的局限，通过扩展扩散拼接方法，兼容现有模型，实现了高质量、长时程、几何一致的视频生成。 Abstract: Autoregressive video diffusion models are capable of long rollouts that are stable and consistent with history, but they are unable to guide the current generation with conditioning from the future. In camera-guided video generation with a predefined camera trajectory, this limitation leads to collisions with the generated scene, after which autoregression quickly collapses. To address this, we propose Generative View Stitching (GVS), which samples the entire sequence in parallel such that the generated scene is faithful to every part of the predefined camera trajectory. Our main contribution is a sampling algorithm that extends prior work on diffusion stitching for robot planning to video generation. While such stitching methods usually require a specially trained model, GVS is compatible with any off-the-shelf video model trained with Diffusion Forcing, a prevalent sequence diffusion framework that we show already provides the affordances necessary for stitching. We then introduce Omni Guidance, a technique that enhances the temporal consistency in stitching by conditioning on both the past and future, and that enables our proposed loop-closing mechanism for delivering long-range coherence. Overall, GVS achieves camera-guided video generation that is stable, collision-free, frame-to-frame consistent, and closes loops for a variety of predefined camera paths, including Oscar Reutersv\"ard's Impossible Staircase. Results are best viewed as videos at https://andrewsonga.github.io/gvs.

eess.IV [Back]

[156] MSRANetV2: An Explainable Deep Learning Architecture for Multi-class Classification of Colorectal Histopathological Images

Ovi Sarkar,Md Shafiuzzaman,Md. Faysal Ahamed,Golam Mahmud,Muhammad E. H. Chowdhury

Main category: eess.IV

TL;DR: 提出一种基于ResNet50V2并融合注意力机制与SE模块的MSRANetV2模型，用于结直肠组织图像分类，在两个公开数据集上表现出高精度和强鲁棒性。

Details

Motivation: 传统结直肠癌诊断方法如结肠镜和组织学检查存在主观性强、耗时长和易变异性问题，亟需提高诊断的准确性和效率。 Method: 设计了一种名为MSRANetV2的卷积神经网络，采用ResNet50V2为主干，结合残差注意力机制和squeeze-and-excitation模块，并通过通道对齐和上采样操作融合多尺度特征；使用五折交叉验证在CRC-VAL-HE-7K和NCT-CRC-HE-100K数据集上评估性能，并引入Grad-CAM提升可解释性。 Result: 在7K数据集上平均准确率达0.9905，在100K数据集上达0.9902，各项指标（Precision、Recall、F1、AUC）均接近最优；Grad-CAM可视化显示模型关注医学相关区域，具备良好可解释性。 Conclusion: MSRANetV2是一种可靠、高性能且可解释的结直肠组织分类模型，有望推动数字病理中自动化诊断的发展。 Abstract: Colorectal cancer (CRC) is a leading worldwide cause of cancer-related mortality, and the role of prompt precise detection is of paramount interest in improving patient outcomes. Conventional diagnostic methods such as colonoscopy and histological examination routinely exhibit subjectivity, are extremely time-consuming, and are susceptible to variation. Through the development of digital pathology, deep learning algorithms have become a powerful approach in enhancing diagnostic precision and efficiency. In our work, we proposed a convolutional neural network architecture named MSRANetV2, specially optimized for the classification of colorectal tissue images. The model employs a ResNet50V2 backbone, extended with residual attention mechanisms and squeeze-and-excitation (SE) blocks, to extract deep semantic and fine-grained spatial features. With channel alignment and upsampling operations, MSRANetV2 effectively fuses multi-scale representations, thereby enhancing the robustness of the classification. We evaluated our model on a five-fold stratified cross-validation strategy on two publicly available datasets: CRC-VAL-HE-7K and NCT-CRC-HE-100K. The proposed model achieved remarkable average Precision, recall, F1-score, AUC, and test accuracy were 0.9884 plus-minus 0.0151, 0.9900 plus-minus 0.0151, 0.9900 plus-minus 0.0145, 0.9999 plus-minus 0.00006, and 0.9905 plus-minus 0.0025 on the 7K dataset. On the 100K dataset, they were 0.9904 plus-minus 0.0091, 0.9900 plus-minus 0.0071, 0.9900 plus-minus 0.0071, 0.9997 plus-minus 0.00016, and 0.9902 plus-minus 0.0006. Additionally, Grad-CAM visualizations were incorporated to enhance model interpretability by highlighting tissue areas that are medically relevant. These findings validate that MSRANetV2 is a reliable, interpretable, and high-performing architectural model for classifying CRC tissues.

Table of Contents

cs.CL [Back]

[1] Evaluating Long-Term Memory for Long-Context Question Answering

[2] BitSkip: An Empirical Analysis of Quantization and Early Exit Composition

[3] Beyond Understanding: Evaluating the Pragmatic Gap in LLMs' Cultural Processing of Figurative Language

[4] How Pragmatics Shape Articulation: A Computational Case Study in STEM ASL Discourse

[5] CRADLE Bench: A Clinician-Annotated Benchmark for Multi-Faceted Mental Health Crisis and Safety Risk Detection

[6] Temporal Blindness in Multi-Turn LLM Agents: Misaligned Tool Use vs. Human Time Perception

[7] Can LLMs Narrate Tabular Data? An Evaluation Framework for Natural Language Representations of Text-to-SQL System Outputs

[8] OraPlan-SQL: A Planning-Centric Framework for Complex Bilingual NL2SQL Reasoning

[9] Language Models for Longitudinal Clinical Prediction

[10] AfriMTEB and AfriE5: Benchmarking and Adapting Text Embedding Models for African Languages

[11] Breaking the Benchmark: Revealing LLM Bias via Minimal Contextual Augmentation

[12] Agent-based Automated Claim Matching with Instruction-following LLMs

[13] Auto prompting without training labels: An LLM cascade for product quality assessment in e-commerce catalogs

[14] Leveraging LLMs for Early Alzheimer's Prediction

[15] Uncovering the Potential Risks in Unlearning: Danger of English-only Unlearning in Multilingual LLMs

[16] M-Eval: A Heterogeneity-Based Framework for Multi-evidence Validation in Medical RAG Systems

[17] PICOs-RAG: PICO-supported Query Rewriting for Retrieval-Augmented Generation in Evidence-Based Medicine

[18] META-RAG: Meta-Analysis-Inspired Evidence-Re-Ranking Method for Retrieval-Augmented Generation in Evidence-Based Medicine

[19] TEXT2DB: Integration-Aware Information Extraction with Large Language Model Agents

[20] Teaching LLMs to Abstain via Fine-Grained Semantic Confidence Reward

[21] SpecKD: Speculative Decoding for Effective Knowledge Distillation of LLMs

[22] Success and Cost Elicit Convention Formation for Efficient Communication

[23] Pie: A Programmable Serving System for Emerging LLM Applications

[24] Challenging Multilingual LLMs: A New Taxonomy and Benchmark for Unraveling Hallucination in Translation

[25] Global PIQA: Evaluating Physical Commonsense Reasoning Across 100+ Languages and Cultures

[26] RegSpeech12: A Regional Corpus of Bengali Spontaneous Speech Across Dialects

[27] Squrve: A Unified and Modular Framework for Complex Real-World Text-to-SQL Tasks

[28] Reinforcement Learning for Long-Horizon Multi-Turn Search Agents

[29] Beyond Line-Level Filtering for the Pretraining Corpora of LLMs

[30] Ko-MuSR: A Multistep Soft Reasoning Benchmark for LLMs Capable of Understanding Korean

[31] MuSaG: A Multimodal German Sarcasm Dataset with Full-Modal Annotations

[32] Exploring the Influence of Relevant Knowledge for Natural Language Generation Interpretability

[33] Beyond Neural Incompatibility: Easing Cross-Scale Knowledge Transfer in Large Language Models through Latent Semantic Alignment

[34] HACK: Hallucinations Along Certainty and Knowledge Axes

[35] Towards Transparent Reasoning: What Drives Faithfulness in Large Language Models?

[36] Abjad AI at NADI 2025: CATT-Whisper: Multimodal Diacritic Restoration Using Text and Speech Representations

[37] Evaluating LLMs on Generating Age-Appropriate Child-Like Conversations

[38] From Memorization to Reasoning in the Spectrum of Loss Curvature

[39] Can LLMs Translate Human Instructions into a Reinforcement Learning Agent's Internal Emergent Symbolic Representation?

[40] MERGE: Minimal Expression-Replacement GEneralization Test for Natural Language Inference

[41] Lookahead Tree-Based Rollouts for Enhanced Trajectory-Level Exploration in Reinforcement Learning with Verifiable Rewards

[42] Critique-RL: Training Language Models for Critiquing through Two-Stage Reinforcement Learning

[43] Beyond MCQ: An Open-Ended Arabic Cultural QA Benchmark with Dialect Variants

[44] LongWeave: A Long-Form Generation Benchmark Bridging Real-World Relevance and Verifiability

[45] Text Simplification with Sentence Embeddings

[46] Comprehensive and Efficient Distillation for Lightweight Sentiment Analysis Models

[47] SynthWorlds: Controlled Parallel Worlds for Disentangling Reasoning and Knowledge in Language Models

[48] LuxIT: A Luxembourgish Instruction Tuning Dataset from Monolingual Seed Data

[49] Can LLMs Write Faithfully? An Agent-Based Evaluation of LLM-generated Islamic Content

[50] SPARTA: Evaluating Reasoning Segmentation Robustness through Black-Box Adversarial Paraphrasing in Text Autoencoder Latent Space

[51] Charting the European LLM Benchmarking Landscape: A New Taxonomy and a Set of Best Practices

[52] Iterative Critique-Refine Framework for Enhancing LLM Personalization

[53] Mitigating Hallucination in Large Language Models (LLMs): An Application-Oriented Survey on RAG, Reasoning, and Agentic Systems

[54] Talk2Ref: A Dataset for Reference Prediction from Scientific Talks

[55] A word association network methodology for evaluating implicit biases in LLMs compared to humans

[56] CritiCal: Can Critique Help LLM Uncertainty or Confidence Calibration?

[57] Levée d'ambiguïtés par grammaires locales

[58] Dark & Stormy: Modeling Humor in the Worst Sentences Ever Written

[59] Open Korean Historical Corpus: A Millennia-Scale Diachronic Collection of Public Domain Texts

[60] BEST-RQ-Based Self-Supervised Learning for Whisper Domain Adaptation

[61] ReplicationBench: Can AI Agents Replicate Astrophysics Research Papers?

[62] ReForm: Reflective Autoformalization with Prospective Bounded Sequence Optimization

[63] Diffusion LLM with Native Variable Generation Lengths: Let [EOS] Lead the Way

[64] Long-Context Modeling with Dynamic Hierarchical Sparse Attention for On-Device LLMs

[65] Zero-Shot Cross-Lingual Transfer using Prefix-Based Adaptation

[66] Relative Scaling Laws for LLMs

[67] "Mm, Wat?" Detecting Other-initiated Repair Requests in Dialogue

[68] OpenReward: Learning to Reward Long-form Agentic Tasks via Reinforcement Learning

[69] Quantifying the Effects of Word Length, Frequency, and Predictability on Dyslexia

[70] Optimizing Retrieval for RAG via Reinforced Contrastive Learning

[71] Evolving Diagnostic Agents in a Virtual Clinical Environment

[72] MQM Re-Annotation: A Technique for Collaborative Evaluation of Machine Translation

[73] InteractComp: Evaluating Search Agents With Ambiguous Queries

[74] Dissecting Role Cognition in Medical LLMs via Neuronal Ablation

[75] SPICE: Self-Play In Corpus Environments Improves Reasoning

[76] Repurposing Synthetic Data for Fine-grained Search Agent Supervision

[77] AgentFrontier: Expanding the Capability Frontier of LLM Agents with ZPD-Guided Data Synthesis

[78] WebLeaper: Empowering Efficiency and Efficacy in WebAgent via Enabling Info-Rich Seeking