Skip to content

Table of Contents

cs.CL [Back]

[1] Inconsistent Affective Reaction: Sentiment of Perception and Opinion in Urban Environments

Jingfei Huang,Han Tu

Main category: cs.CL

TL;DR: 本研究通过结合街景图像和社交媒体文本,提出了一种新的方法来分析城市环境中感知与意见之间的情感不一致,并揭示了北京二环路在2016年和2022年情感反应的变化及其与建成环境因素的关系。

Details Motivation: 现有城市情感分析方法难以捕捉人类对城市环境的多维情感反应,尤其是感知(视觉)与意见(语言)之间的差异,因此需要新方法来识别和解释这种情感不一致性。 Method: 构建包含14万张街景图像和98万条微博文本的数据集,结合目标检测与自然语言处理技术,提出情感反应指数,并利用回归分析、图像分割和词频分析对北京二环区域进行情感分类与可视化。 Result: 发现感知情感分布趋于均衡且更积极,而意见情感变化更为极端;情感错配图显示感知与意见存在显著差异,且情感变化与建筑密度、行人活动等因素密切相关。 Conclusion: 感知与意见的情感反应存在系统性差异,尤其是在疫情前后,该不一致性能为城市更新和环境管理提供重要参考。 Abstract: The ascension of social media platforms has transformed our understanding of urban environments, giving rise to nuanced variations in sentiment reaction embedded within human perception and opinion, and challenging existing multidimensional sentiment analysis approaches in urban studies. This study presents novel methodologies for identifying and elucidating sentiment inconsistency, constructing a dataset encompassing 140,750 Baidu and Tencent Street view images to measure perceptions, and 984,024 Weibo social media text posts to measure opinions. A reaction index is developed, integrating object detection and natural language processing techniques to classify sentiment in Beijing Second Ring for 2016 and 2022. Classified sentiment reaction is analysed and visualized using regression analysis, image segmentation, and word frequency based on land-use distribution to discern underlying factors. The perception affective reaction trend map reveals a shift toward more evenly distributed positive sentiment, while the opinion affective reaction trend map shows more extreme changes. Our mismatch map indicates significant disparities between the sentiments of human perception and opinion of urban areas over the years. Changes in sentiment reactions have significant relationships with elements such as dense buildings and pedestrian presence. Our inconsistent maps present perception and opinion sentiments before and after the pandemic and offer potential explanations and directions for environmental management, in formulating strategies for urban renewal.

[2] Haystack Engineering: Context Engineering for Heterogeneous and Agentic Long-Context Evaluation

Mufei Li,Dongqi Fu,Limei Wang,Si Zhang,Hanqing Zeng,Kaan Sancak,Ruizhong Qiu,Haoyu Wang,Xiaoxin He,Xavier Bresson,Yinglong Xia,Chonglin Sun,Pan Li

Main category: cs.CL

TL;DR: HaystackCraft 是一个基于英文维基百科超链接网络的新颖长上下文基准测试,用于评估大语言模型在噪声上下文中的鲁棒性,揭示了现有模型在代理式工作流中仍存在级联错误和自生成干扰问题。

Details Motivation: 现有的‘针在 haystack 中’(NIAH)基准测试忽略了现实场景中由偏差检索和代理工作流带来的噪声上下文问题,因此需要更贴近真实情况的测试方法。 Method: 提出 HaystackCraft,利用维基百科的超链接网络构建多跳问题,并模拟多种检索策略(稀疏、密集、混合、基于图)及动态代理操作(如查询优化、推理反思、停止决策),以评估模型在噪声上下文中的表现。 Result: 实验表明:更强的密集检索器可能引入更具挑战性的干扰项,而基于图的重排序能有效缓解有害干扰;在代理测试中,即使是 Gemini 2.5 Pro 和 GPT-5 等先进模型也容易因自生成干扰或无法及时停止而出现级联失败。 Conclusion: HaystackCraft 揭示了当前长上下文模型在真实复杂环境下的局限性,强调了构建更具现实代表性的评测基准的重要性,为未来研究提供了有价值的测试平台。 Abstract: Modern long-context large language models (LLMs) perform well on synthetic "needle-in-a-haystack" (NIAH) benchmarks, but such tests overlook how noisy contexts arise from biased retrieval and agentic workflows. We argue that haystack engineering is necessary to construct noisy long contexts that faithfully capture key real-world factors -- distraction from heterogeneous biased retrievers and cascading errors in agentic workflows -- to test models' long-context robustness. We instantiate it through HaystackCraft, a new NIAH benchmark built on the full English Wikipedia hyperlink network with multi-hop questions. HaystackCraft evaluates how heterogeneous retrieval strategies (e.g., sparse, dense, hybrid, and graph-based) affect distractor composition, haystack ordering, and downstream LLM performance. HaystackCraft further extends NIAH to dynamic, LLM-dependent settings that simulate agentic operations, where models refine queries, reflect on their past reasonings, and decide when to stop. Experiments with 15 long-context models show that (1) while stronger dense retrievers can introduce more challenging distractors, graph-based reranking simultaneously improves retrieval effectiveness and mitigates more harmful distractors; (2) in agentic tests, even advanced models like Gemini 2.5 Pro and GPT-5 suffer cascading failures from self-generated distractors or struggle to perform early stops. These results highlight persistent challenges in agentic long-context reasoning and establish HaystackCraft as a valuable testbed for future progress.

[3] Lemma Dilemma: On Lemma Generation Without Domain- or Language-Specific Training Data

Olia Toporkov,Alan Akbik,Rodrigo Agerri

Main category: cs.CL

TL;DR: 本文研究了大型语言模型(LLM)在上下文词形还原任务中的表现,发现在12种不同形态复杂度的语言中,无需微调、仅通过少量示例的上下文生成即可达到最先进的效果。

Details Motivation: 探索大型语言模型在缺乏监督训练数据的目标领域或语言中进行上下文词形还原的有效性。 Method: 比较了基于编码器的监督方法(跨领域微调)、跨语言方法与大型语言模型的上下文内词形生成方法。 Result: 实验表明,在多数语言中,无需微调的大型语言模型通过上下文学习可达到最优性能,而传统编码器模型在跨领域微调后仍具竞争力。 Conclusion: 当前的大型语言模型在上下文词形还原任务中表现优异,尤其在缺乏目标领域标注数据时,展现出优于传统方法的潜力。 Abstract: Lemmatization is the task of transforming all words in a given text to their dictionary forms. While large language models (LLMs) have demonstrated their ability to achieve competitive results across a wide range of NLP tasks, there is no prior evidence of how effective they are in the contextual lemmatization task. In this paper, we empirically investigate the capacity of the latest generation of LLMs to perform in-context lemmatization, comparing it to the traditional fully supervised approach. In particular, we consider the setting in which supervised training data is not available for a target domain or language, comparing (i) encoder-only supervised approaches, fine-tuned out-of-domain, and (ii) cross-lingual methods, against direct in-context lemma generation with LLMs. Our experimental investigation across 12 languages of different morphological complexity finds that, while encoders remain competitive in out-of-domain settings when fine-tuned on gold data, current LLMs reach state-of-the-art results for most languages by directly generating lemmas in-context without prior fine-tuning, provided just with a few examples. Data and code available upon publication: https://github.com/oltoporkov/lemma-dilemma

[4] LASER: An LLM-based ASR Scoring and Evaluation Rubric

Amruta Parulekar,Preethi Jyothi

Main category: cs.CL

TL;DR: 提出了一种基于大语言模型的ASR评估方法LASER,利用上下文学习能力减少对不影响语义的语言细节错误的过度惩罚,在印度语言中表现出与人类标注高度相关的结果。

Details Motivation: 传统ASR评估指标如词错误率(WER)会不公平地惩罚不影响语义的形态和句法差异,因此需要一种更语义感知的评估方式。 Method: 设计了一个基于LLM的评分标准LASER,利用大模型的上下文学习能力,通过包含详细示例的提示进行学习;同时对Llama 3等较小模型进行微调,以预测应施加的惩罚类型。 Result: Gemini 2.5 Pro在印地语上的LASER评分与人工标注的相关性高达94%;提示中的印地语示例也有效适用于其他印度语言如马拉地语、卡纳达语和马拉雅拉姆语;微调后的Llama 3在词对判断上准确率达89%。 Conclusion: LASER提供了一种更符合语义一致性的ASR评估方法,具有跨语言适用性和在小模型上的可微调性,显著优于传统WER指标。 Abstract: Standard ASR evaluation metrics like Word Error Rate (WER) tend to unfairly penalize morphological and syntactic nuances that do not significantly alter sentence semantics. We introduce an LLM-based scoring rubric LASER that leverages state-of-the-art LLMs' in-context learning abilities to learn from prompts with detailed examples. Hindi LASER scores using Gemini 2.5 Pro achieved a very high correlation score of 94% with human annotations. Hindi examples in the prompt were also effective in analyzing errors in other Indian languages such as Marathi, Kannada and Malayalam. We also demonstrate how a smaller LLM like Llama 3 can be finetuned on word-pair examples derived from reference and ASR predictions to predict what kind of penalty should be applied with close to 89% accuracy.

[5] Meaningful Pose-Based Sign Language Evaluation

Zifan Jiang,Colin Leong,Amit Moryossef,Anne Göhring,Annette Rios,Oliver Cory,Maksym Ivashechkin,Neha Tarigopula,Biao Zhang,Rico Sennrich,Sarah Ebling

Main category: cs.CL

TL;DR: 本文研究了基于人体骨骼姿态的手语表达评估方法,比较了关键点距离、嵌入和回译等多种指标的优劣,并通过自动元评估和人类相关性研究验证其有效性。

Details Motivation: 手语翻译和生成系统的评估缺乏统一、有效的度量标准,现有方法在不同场景下表现不一,需要系统性比较和验证。 Method: 采用关键点距离、嵌入空间相似性和回译三种评估方法,在多个手语数据集上进行句子级检索的自动元评估,并开展文本到姿态翻译的人类相关性研究。 Result: 揭示了不同评估指标在不同场景下的权衡关系,发现某些嵌入方法与人类判断具有更高相关性,同时开源了一个姿态评估工具包。 Conclusion: 推荐结合多种评估指标以更全面地衡量手语生成质量,所提出的工具包有助于推动手语翻译系统的可重复研究与开发。 Abstract: We present a comprehensive study on meaningfully evaluating sign language utterances in the form of human skeletal poses. The study covers keypoint distance-based, embedding-based, and back-translation-based metrics. We show tradeoffs between different metrics in different scenarios through automatic meta-evaluation of sign-level retrieval and a human correlation study of text-to-pose translation across different sign languages. Our findings and the open-source pose-evaluation toolkit provide a practical and reproducible way of developing and evaluating sign language translation or generation systems.

[6] Populism Meets AI: Advancing Populism Research with LLMs

Eduardo Ryô Tamaki,Yujin J. Jung,Julia Chatterley,Grant Mitchell,Semir Dzebo,Cristóbal Sandoval,Levente Littvay,Kirk A. Hawkins

Main category: cs.CL

TL;DR: 提出一种基于思维链提示的领域特定方法,利用大型语言模型复制专家编码员对民粹主义话语的评分,准确度与人类专家相当。

Details Motivation: 传统文本分析方法在跨语言、跨语境和大规模语料库中测量民粹主义具有成本高、耗时长、难以扩展的问题,因此需要更高效的方法。 Method: 采用基于评分标准和锚点引导的思维链(CoT)提示策略,模仿人类编码员培训过程,使用全球民粹主义数据库(GPD)中的标注数据指导大模型推理,并测试多种闭源和开源模型对GPD评分的复现能力。 Result: 该方法使大语言模型在民粹主义话语分类上的准确率与专家人类编码员相当,显示出其处理民粹主义复杂性和语境敏感性的能力。 Conclusion: 领域特定的思维链提示策略能有效提升大语言模型在意识形态内容分析中的表现,为民粹主义的大规模跨语言研究提供了可行且高效的工具。 Abstract: Measuring the ideational content of populism remains a challenge. Traditional strategies based on textual analysis have been critical for building the field's foundations and providing a valid, objective indicator of populist framing. Yet these approaches are costly, time consuming, and difficult to scale across languages, contexts, and large corpora. Here we present the results from a rubric and anchor guided chain of thought (CoT) prompting approach that mirrors human coder training. By leveraging the Global Populism Database (GPD), a comprehensive dataset of global leaders' speeches annotated for degrees of populism, we replicate the process used to train human coders by prompting the LLM with an adapted version of the same documentation to guide the model's reasoning. We then test multiple proprietary and open weight models by replicating scores in the GPD. Our findings reveal that this domain specific prompting strategy enables the LLM to achieve classification accuracy on par with expert human coders, demonstrating its ability to navigate the nuanced, context sensitive aspects of populism.

[7] MAPRO: Recasting Multi-Agent Prompt Optimization as Maximum a Posteriori Inference

Zheyuan Zhang,Lin Ge,Hongjiang Li,Weicheng Zhu,Chuxu Zhang,Yanfang Ye

Main category: cs.CL

TL;DR: 本文提出了多智能体提示优化框架MAPRO,通过将多智能体系统提示优化建模为最大后验推断问题,并采用语言引导的max-product置信传播算法求解,显著提升了多智能体系统的性能和稳定性。

Details Motivation: 设计高效的多智能体系统因提示敏感性和复合不稳定性而困难,现有自动化提示设计方法在多智能体场景下仍不足,缺乏系统性优化方法。 Method: 提出MAPRO框架,包含四阶段流程:将提示优化建模为MAP推断问题,使用语言引导的max-product置信传播算法;引入拓扑感知的精细化机制,结合执行反馈与下游归因来选择性更新智能体提示。 Result: 在多个任务基准上,MAPRO实现了最先进的性能,持续优于人工设计基线和最新自动化方法。 Conclusion: MAPRO为多智能体提示优化提供了有效解决方案,其基于MAP的建模方式也为构建更可靠、原则性的多智能体系统提供了通用指导。 Abstract: Large language models (LLMs) have demonstrated remarkable capabilities across diverse tasks, and LLM-based agents further extend these abilities to various practical workflows. While recent progress shows that multi-agent systems (MAS) can outperform single agents by coordinating specialized roles, designing effective MAS remains difficult due to prompt sensitivity and the compounded instability MAS creates. To cope with the challenge, recent efforts in automated prompt design have reduced manual effort. However, multi-agent prompt optimization remains largely unexplored. Challenges like exponentially expanding search space and ambiguous credit assignment together make systematic design intractable without principled methods. Therefore, we introduce M}ulti-Agent PRompt Optimization (MAPRO), a four-stage framework that first formulates MAS prompt optimization as a Maximum a Posteriori (MAP) inference problem and solves it using a language-guided variant of max-product belief propagation algorithm. To address credit assignment and updates the system iteratively, MAPRO employs a topology-aware refinement mechanism that integrates execution feedback and downstream blames to selectively update agent prompts. Through this process, MAPRO progressively converges to a coordinated set of agent-specific prompt policies. Across benchmarks in various tasks, MAPRO achieves state-of-the-art performance, consistently surpassing manually engineered baselines and recent automated alternatives. Beyond performance, our MAP-based formulation also delivers general guidelines for building more reliable and principled multi-agent systems in the future

[8] AsyncSpade: Efficient Test-Time Scaling with Asynchronous Sparse Decoding

Shuqing Luo,Yilin Guan,Pingzhi Li,Hanrui Wang,Tianlong Chen

Main category: cs.CL

TL;DR: 本文提出了AsyncSpade,一种异步框架,通过解耦KV缓存过滤与自回归解码循环,实现高效的测试时扩展(TTS),在不牺牲模型性能的前提下显著降低延迟并提升吞吐。

Details Motivation: 现有的查询感知稀疏解码方法受限于页面级过滤的序列依赖性和粗粒度的token选择,在高并发和长思维链场景下效率低下,甚至运行时间超过前向推理本身。 Method: 提出AsyncSpade,包含两个核心组件:(1) 轻量级时序回归模块,预测下一token的查询状态;(2) 异步解耦框架,将KV缓存过滤与解码循环分离,实现KV选择与前向计算的重叠。 Result: 在A100节点上验证,AsyncSpade完全重叠了KV缓存操作与推理流水线,相比当前最优方法Quest减少20%以上每输出token时间(TPOT),相比全注意力减少至少50%,并在多个TTS基准上保持或超越其准确率。 Conclusion: AsyncSpade是首个消除序列依赖且不损失性能的TTS加速框架,实现了理论最优的TPOT,显著提升了长CoT和高并发场景下的LLM推理效率。 Abstract: Test-time scaling (TTS) boosts LLM reasoning via long chain-of-thought (CoT), but the linear KV-cache growth amplifies the memory-bound bottleneck of LLM decoding. Query-aware page-level sparse decoding can achieve state-of-the-art performance under constrained FLOPs budgets, but is limited by both sequential-dependent page filtering and coarse-grained token selection, hampering serving efficiency and model performance on TTS tasks under high concurrency and long CoT scenarios (consuming even higher runtime than the forward pipeline itself). In this paper, we first find that the current-step query state can be accurately approximated in a unified manner from a short window of recent queries, enabling training-free query-aware sparsity without waiting in the decoding loop. We propose AsyncSpade, an asynchronous framework for efficient TTS built on two core components: (1) a novel light-weight temporal-regressive module that predicts the next-token query state; (2) an asynchronous and disaggregated framework that decouples the KV cache filtering from the auto-regressive decoding loop, overlapping the token-level KV selection with the forward inference computation through asynchronism. To our knowledge, AsyncSpade is the first to eliminate the sequential dependence without sacrificing model performance. We validate the effectiveness of AsyncSpade on common LLM serving setups with an A100 node, where AsyncSpade fully overlaps KV-cache operations with the inference pipeline, achieving theoretical optimal time-per-output-token (TPOT). Specifically, AsyncSpade delivers over 20% reduction on TPOT compared to SoTA baseline (i.e. Quest) and at least 50% TPOT reduction compared to full attention on Qwen3-8B and Qwen3-32B models, while matching or surpassing their accuracy on various TTS benchmarks (AIME-24/25, GPQA-Diamond, MATH-500).

[9] Can Lessons From Human Teams Be Applied to Multi-Agent Systems? The Role of Structure, Diversity, and Interaction Dynamics

Rasika Muralidharan,Jaewoon Kwak,Jisun An

Main category: cs.CL

TL;DR: 提出了一种多智能体框架,用于研究团队科学中的结构、多样性和交互动态,发现扁平化团队表现优于层级化团队,多样性影响复杂,且智能体对团队表现过于自信。

Details Motivation: 受人类团队科学启发,探索大语言模型驱动的多智能体系统在团队协作中的动态特性。 Method: 设计并评估了在CommonsenseQA、StrategyQA、Social IQa和Latent Implicit Hate四个任务上的多智能体团队框架,分析团队结构、多样性和交互模式。 Result: 扁平化团队表现优于层级化团队;多样性影响较为复杂;智能体在事后反思中表现出对协作的认可,但也暴露出协调不足等问题。 Conclusion: 多智能体系统的团队结构显著影响性能,扁平化结构更优,未来需改进交互协调机制以提升合作效果。 Abstract: Multi-Agent Systems (MAS) with Large Language Model (LLM)-powered agents are gaining attention, yet fewer studies explore their team dynamics. Inspired by human team science, we propose a multi-agent framework to examine core aspects of team science: structure, diversity, and interaction dynamics. We evaluate team performance across four tasks: CommonsenseQA, StrategyQA, Social IQa, and Latent Implicit Hate, spanning commonsense and social reasoning. Our results show that flat teams tend to perform better than hierarchical ones, while diversity has a nuanced impact. Interviews suggest agents are overconfident about their team performance, yet post-task reflections reveal both appreciation for collaboration and challenges in integration, including limited conversational coordination.

[10] Can Speech LLMs Think while Listening?

Yi-Jen Shih,Desh Raj,Chunyang Wu,Wei Zhou,SK Bong,Yashesh Gaur,Jay Mahadeokar,Ozlem Kalinli,Mike Seltzer

Main category: cs.CL

TL;DR: 本文研究了多流语音大语言模型中的思维链(CoT)微调方法,提出通过在文本空间中进行推理以提升语音LLM在口语推理任务上的准确性,并引入基于熵的“问题完整性”指标,在用户提问结束前启动推理以减少响应延迟,结合DPO优化进一步在不损失精度的情况下显著降低延迟。

Details Motivation: 尽管语音大语言模型在口语交互方面取得进展,但在复杂推理任务上仍表现不佳,且响应延迟影响用户体验,因此需要提升其推理能力并优化准确率与延迟之间的权衡。 Method: 采用思维链(CoT)微调多流语音LLM,在文本空间中进行推理;提出基于熵的“问题完整性”指标,用于判断在用户说话过程中何时开始推理;使用拒绝采样构建偏好数据,并应用直接偏好优化(DPO)进一步优化准确率与延迟的权衡。 Result: CoT微调使语音LLM在口语推理任务上的准确率平均提升2.4倍;所提方法在等效延迟下使ARC-Easy准确率提升4%;结合DPO实现延迟减少70%且无精度损失。 Conclusion: 在多流语音LLM中,通过文本空间的CoT推理、提前推理机制和DPO优化,可有效提升推理准确率并显著降低响应延迟,为语音助手等实时应用提供了更优的解决方案。 Abstract: Recent advances in speech large language models (speech LLMs) have enabled seamless spoken interactions, but these systems still struggle with complex reasoning tasks. Previously, chain-of-thought (CoT) prompting or fine-tuning has been to shown to significantly improve the reasoning abilities of text-based LLMs. In this work, we investigate the effect of CoT fine-tuning for multi-stream speech LLMs, demonstrating that reasoning in text space improves the accuracy of speech LLMs by 2.4x, on average, over a suite of spoken reasoning tasks. Beyond accuracy, the latency of the spoken response is a crucial factor for interacting with voice-based agents. Inspired by the human behavior of "thinking while listening," we propose methods to reduce the additional latency from reasoning by allowing the model to start reasoning before the user query has ended. To achieve this, we introduce an entropy-based metric, "question completeness," which acts as an indicator to guide the model on the optimal time to start reasoning. This method provides greater control over the accuracy-latency trade-off compared with heuristic-based approaches and, under equivalent latency conditions, yields a 4% accuracy gain on ARC-Easy. Finally, we use Direct Preference Optimization (DPO) on preference data created using rejection sampling to push the accuracy-latency pareto frontier further, resulting in a 70% reduction in latency without loss in accuracy.

[11] When Thoughts Meet Facts: Reusable Reasoning for Long-Context LMs

Soyeong Jeong,Taehee Jung,Sung Ju Hwang,Joo-Kyung Kim,Dongyeop Kang

Main category: cs.CL

TL;DR: 提出Thought Template Augmented LCLMs (ToTAL)框架,通过可复用的思维模板和自然语言反馈优化多跳推理,提升长上下文语言模型在知识密集型任务中的表现。

Details Motivation: 现有长上下文语言模型虽能处理大量文本,但缺乏有效连接证据进行多跳推理的能力,导致推理效果受限。 Method: 引入思维模板将推理过程结构化,利用先前问题求解轨迹生成可复用的思维缓存,并通过自然语言反馈迭代优化模板。 Result: 在多种基准和LCLM模型上,ToTAL在检索与非检索场景中均优于强基线,且可将优化后的模板蒸馏至小型开源模型。 Conclusion: ToTAL有效提升了长上下文语言模型在知识密集型多跳推理任务中的性能,具备广泛适用性和可解释性。 Abstract: Recent Long-Context Language Models (LCLMs) can process hundreds of thousands of tokens in a single prompt, enabling new opportunities for knowledge-intensive multi-hop reasoning by integrating large sets of retrieved documents or, in some cases, directly all necessary information. However, simply feeding more documents into the context window fails to capture how evidence should be connected. We address this gap with thought templates, which recast reasoning as reusable thought caches, derived from prior problem solving traces, structuring how evidence is combined and guiding multi-hop inference with factual documents. To keep these templates effective, we propose an update strategy that iteratively refines templates derived from training data through natural-language feedback. Across diverse benchmarks and LCLM families, our approach delivers consistent gains over strong baselines in both retrieval-based and retrieval-free settings. Furthermore, we show that optimized templates can be distilled into smaller open-source models, demonstrating its broad applicability and transparent reasoning reuse. We refer to our framework as Thought Template Augmented LCLMs (ToTAL).

[12] ParsTranslit: Truly Versatile Tajik-Farsi Transliteration

Rayyan Merchant,Kevin Tang

Main category: cs.CL

TL;DR: 本文提出了一种新的序列到序列模型,用于波斯语(Farsi)与塔吉克语(Tajik)之间的文字转写,训练数据涵盖所有可用数据集,并发布了两个新数据集。实验结果在多个领域展现了当前任务的真实难度,并建立了全面可比的基准。模型在双向转写中取得了优异的chrF++和归一化CER分数。

Details Motivation: 由于波斯语在不同国家使用不同的书写系统(波斯-阿拉伯文与西里尔文),导致书面交流困难。现有转写模型受限于特定领域数据,缺乏跨领域的通用性,难以实际应用。因此需要一个能处理多样化文本领域的高效、通用转写系统。 Method: 采用序列到序列(sequence-to-sequence)架构,统一训练所有可用的波斯语-塔吉克语转写数据集,并引入两个新构建的数据集,以提升模型在不同文本领域中的泛化能力。 Result: 模型在从Farsi到Tajik方向取得87.91的chrF++和0.05的归一化CER;从Tajik到Farsi方向取得92.28的chrF++和0.04的归一化CER,显著优于先前方法,并在多领域测试中展现出更强的适应性和稳定性。 Conclusion: 本文提出的模型是目前波斯语-塔吉克语转写的最优方案,具备良好的跨领域适用性,为未来研究提供了公开可用的数据、代码和评估基准。 Abstract: As a digraphic language, the Persian language utilizes two written standards: Perso-Arabic in Afghanistan and Iran, and Tajik-Cyrillic in Tajikistan. Despite the significant similarity between the dialects of each country, script differences prevent simple one-to-one mapping, hindering written communication and interaction between Tajikistan and its Persian-speaking ``siblings''. To overcome this, previously-published efforts have investigated machine transliteration models to convert between the two scripts. Unfortunately, most efforts did not use datasets other than those they created, limiting these models to certain domains of text such as archaic poetry or word lists. A truly usable transliteration system must be capable of handling varied domains, meaning that suck models lack the versatility required for real-world usage. The contrast in domain between data also obscures the task's true difficulty. We present a new state-of-the-art sequence-to-sequence model for Tajik-Farsi transliteration trained across all available datasets, and present two datasets of our own. Our results across domains provide clearer understanding of the task, and set comprehensive comparable leading benchmarks. Overall, our model achieves chrF++ and Normalized CER scores of 87.91 and 0.05 from Farsi to Tajik and 92.28 and 0.04 from Tajik to Farsi. Our model, data, and code are available at https://anonymous.4open.science/r/ParsTranslit-FB30/.

[13] OWL: Overcoming Window Length-Dependence in Speculative Decoding for Long-Context Inputs

Jaeseong Lee,seung-won hwang,Aurick Qiao,Gabriele Oliaro,Ye Wang,Samyam Rajbhandari

Main category: cs.CL

TL;DR: 本文提出了一种针对长上下文场景的新型推测解码模型OWL,并发布了长上下文基准测试LongSpecBench,解决了现有方法在长上下文下性能下降的问题。

Details Motivation: 现有的推测解码方法在短上下文基准上表现良好,但在实际的长上下文应用中性能显著下降,缺乏通用性。 Method: 提出OWL模型,包含三个创新:基于LSTM且仅依赖最后token状态的drafting模型、在verifier中引入[SPEC]特殊token以增强表示、结合树与非树解码的混合算法;同时构建了LongSpecBench基准。 Result: OWL在长上下文输入下的接受长度比EAGLE3高约5倍,且有效提升了生成速度,而EAGLE3在长上下文中甚至变慢至0.81倍。 Conclusion: OWL通过结构和算法创新显著提升了长上下文下的推测解码效率,具备良好的泛化能力,推动了该方向的研究发展。 Abstract: Speculative decoding promises faster inference for large language models (LLMs), yet existing methods fail to generalize to real-world settings. Benchmarks typically assume short contexts (e.g., 2K tokens), whereas practical workloads involve long contexts. We find current approaches degrade severely with long contexts; for instance, EAGLE3 even slows down the generation speed by 0.81x. We address these limitations by releasing a new long-context benchmark (LongSpecBench) and introducing a novel model (OWL). OWL achieves about 5x higher acceptance length than EAGLE3 on long-context inputs through three innovations: (1) an LSTM-based drafter conditioned only on the last-token state, making it generalize to various lengths, (2) a special token [SPEC] in the verifier that produces richer representation for drafter, and (3) a hybrid algorithm combining both tree and non-tree decoding methods. We release all code and datasets to advance future research.

[14] Deploying Tiny LVLM Judges for Real-World Evaluation of Chart Models: Lessons Learned and Best Practices

Md Tahmid Rahman Laskar,Mohammed Saidul Islam,Ridwan Mahbub,Mizanur Rahman,Amran Bhuiyan,Israt Jahan,Mir Tafseer Nayeem,Shafiq Joty,Enamul Hoque,Jimmy Huang

Main category: cs.CL

TL;DR: 提出多标准提示和领域自适应迁移学习方法,提升2B参数量级的视觉-语言模型在图表理解任务中的评估能力。

Details Motivation: 小型模型(≤2B参数)在作为自动评判模型时表现不佳,限制了其在资源受限场景下的实际应用。 Method: 采用多标准提示将多个评估标准整合到单个查询中,并通过在合成判断数据集上微调2B参数的LVLM实现领域自适应迁移学习,构建ChartJudge模型。 Result: 实验表明,多标准提示暴露了7B模型的鲁棒性缺陷,而所提出的ChartJudge在跨数据集知识迁移方面表现优异,提升了小型模型的评估性能。 Conclusion: 通过改进提示设计和迁移学习,小型LVLM可在图表推理任务中实现高效、低成本的评估,具备良好的可扩展性和实用性。 Abstract: Large Vision-Language Models (LVLMs) with only 7B parameters have shown promise as automated judges in chart comprehension tasks. However, tiny models (<=2B parameters) still perform poorly as judges, limiting their real-world use in resource-constrained settings. To address this, we propose two approaches to ensure cost-efficient evaluation: (i) multi-criteria prompting, which combines separate evaluation criteria into a single query, and (ii) domain-adaptive transfer learning, in which we fine-tune a 2B-parameter LVLM on synthetic judgments in a chart dataset to create the ChartJudge. Experiments show that multi-criteria prompting exposes robustness gaps, which led to a huge drop in performance for 7B models, including specialized LVLM judges like LLaVA-Critic. In addition, we find that our tiny LVLM (ChartJudge) can effectively transfer knowledge from one dataset to another to make it a more specialized model. Our fine-grained analysis across chart types and query complexities offers actionable insights into trade-offs between model size, prompt design, and transferability, enabling scalable, low-cost evaluation for chart reasoning tasks. Our code and the data will be made publicly available.

[15] Multi-Task Pre-Finetuning of Lightweight Transformer Encoders for Text Classification and NER

Junyi Zhu,Savas Ozkan,Andrea Maracani,Sinan Mutlu,Cho Jung Min,Mete Ozay

Main category: cs.CL

TL;DR: 提出基于任务主LoRA模块的多任务预微调框架,以提升轻量级BERT编码器在命名实体识别和文本分类中的适应性和效率。

Details Motivation: 在移动平台上部署NLP模型需要兼顾跨应用适应性与计算效率,但传统的多任务预微调会因优化信号冲突而降低性能。 Method: 设计一种基于任务主LoRA模块的多任务预微调框架,使用共享编码器主干和模块化适配器来避免优化冲突。 Result: 在21个下游任务上实验显示,相比传统方法,NER平均提升0.8%,文本分类平均提升8.8%。 Conclusion: 该方法在满足部署约束的同时,实现了与单任务预微调相当的性能,有效支持多样化的移动端NLP应用。 Abstract: Deploying natural language processing (NLP) models on mobile platforms requires models that can adapt across diverse applications while remaining efficient in memory and computation. We investigate pre-finetuning strategies to enhance the adaptability of lightweight BERT-like encoders for two fundamental NLP task families: named entity recognition (NER) and text classification. While pre-finetuning improves downstream performance for each task family individually, we find that na\"ive multi-task pre-finetuning introduces conflicting optimization signals that degrade overall performance. To address this, we propose a simple yet effective multi-task pre-finetuning framework based on task-primary LoRA modules, which enables a single shared encoder backbone with modular adapters. Our approach achieves performance comparable to individual pre-finetuning while meeting practical deployment constraint. Experiments on 21 downstream tasks show average improvements of +0.8% for NER and +8.8% for text classification, demonstrating the effectiveness of our method for versatile mobile NLP applications.

Mkululi Sikosana,Sean Maudsley-Barton,Oluwaseun Ajao

Main category: cs.CL

TL;DR: 该研究通过计算语言学方法分析了与疫情相关的网络言论,比较了健康虚假信息与事实传播在可读性、修辞标记和说服性语言使用上的差异,发现虚假信息更具复杂修辞并嵌入情感线索,可能增强其可信度。

Details Motivation: 识别健康虚假信息的语言特征,以支持其检测并改进公共健康传播策略。 Method: 基于三个语料库(COVID-19虚假叙述、一般COVID-19内容和猴痘相关帖子)进行计算语言学分析,比较可读性、情感词汇和修辞特征。 Result: COVID-19虚假信息可读性更低,恐惧和说服性词汇频率高出两倍以上,感叹号使用较少;猴痘内容则更情绪化。虚假信息采用复杂修辞结合情感线索的策略,可能提升其表面可信度。 Conclusion: 语言特征可作为数字健康虚假信息的识别指标,有助于改进检测工具和公共健康危机沟通策略,但需在动态、跨平台环境中进一步验证。 Abstract: This study conducts a computational linguistic analysis of pandemic-related online discourse to examine how language distinguishes health misinformation from factual communication. Drawing on three corpora: COVID-19 false narratives (n = 7588), general COVID-19 content (n = 10700), and Monkeypox-related posts (n = 5787), we identify significant differences in readability, rhetorical markers, and persuasive language use. COVID-19 misinformation exhibited markedly lower readability scores and contained over twice the frequency of fear-related or persuasive terms compared to the other datasets. It also showed minimal use of exclamation marks, contrasting with the more emotive style of Monkeypox content. These patterns suggest that misinformation employs a deliberately complex rhetorical style embedded with emotional cues, a combination that may enhance its perceived credibility. Our findings contribute to the growing body of work on digital health misinformation by highlighting linguistic indicators that may aid detection efforts. They also inform public health messaging strategies and theoretical models of crisis communication in networked media environments. At the same time, the study acknowledges limitations, including reliance on traditional readability indices, use of a deliberately narrow persuasive lexicon, and reliance on static aggregate analysis. Future research should therefore incorporate longitudinal designs, broader emotion lexicons, and platform-sensitive approaches to strengthen robustness.

[17] IASC: Interactive Agentic System for ConLangs

Chihiro Taguchi,Richard Sproat

Main category: cs.CL

TL;DR: 本文提出一个基于大语言模型(LLM)的模块化系统,用于构造人工语言,涵盖语音、形态句法、词典、正字法及语法手册生成,并探讨LLM对语言学概念的理解能力及其在高低资源语言翻译中的潜在应用。

Details Motivation: 开发有趣且实用的工具以辅助人工语言构建,同时探究大语言模型对语言和语言学概念的普遍理解程度,而非特定语言知识。 Method: 采用模块化代理方法:首先通过迭代反馈建立目标音系;然后将英语句子转换为反映目标语言形态句法结构的标记形式;接着结合音系模型和提取的语素生成词汇库;再设计适合的书写系统;最后生成简明语法手册并实现新句子的翻译。 Result: 系统能有效生成人工语言的多个组成部分,不同LLM在处理常见与罕见语言结构时表现差异显著;初步尝试应用于高低资源语言翻译效果不佳,但显示改进后可能带来实际收益。 Conclusion: 该系统不仅为构语爱好者提供有趣工具,也揭示了LLM在语言学知识建模方面的潜力与局限,为进一步优化及跨语言应用提供了方向。 Abstract: We present a system that uses LLMs as a tool in the development of Constructed Languages. The system is modular in that one first creates a target phonology for the language using an agentic approach that refines its output at each step with commentary feedback on its previous attempt. Next, a set of sentences is 'translated' from their English original into a morphosyntactic markup that reflects the word order and morphosyntactic feature specifications of the desired target language, with affixes represented as morphosyntactic feature bundles. From this translated corpus, a lexicon is constructed using the phonological model and the set of morphemes (stems and affixes) extracted from the 'translated' sentences. The system is then instructed to provide an orthography for the language, using an existing script such as Latin or Cyrillic. Finally, the system writes a brief grammatical handbook of the language. The system can also translate further sentences into the target language. Our goal is twofold. First, we hope that these tools will be fun to use for creating artificially constructed languages. Second, we are interested in exploring what LLMs 'know' about language-not what they know about any particular language or linguistic phenomenon, but how much they know about and understand language and linguistic concepts. As we shall see, there is a fairly wide gulf in capabilities both among different LLMs and among different linguistic specifications, with it being notably easier for systems to deal with more common patterns than rarer ones. An additional avenue that we explore is the application of our approach to translating from high-resource into low-resource languages. While the results so far are mostly negative, we provide some evidence that an improved version of the present system could afford some real gains in such tasks. https://github.com/SakanaAI/IASC

[18] Vocabulary embeddings organize linguistic structure early in language model training

Isabel Papadimitriou,Jacob Prince

Main category: cs.CL

TL;DR: 研究了大语言模型在训练过程中输入词汇表示的几何结构如何随时间演变,发现语义和句法特征的相关性迅速建立,高频词和功能词比低频词更快收敛。

Details Motivation: 探究大语言模型在训练过程中词汇表示的结构演化机制,理解语言结构如何在嵌入空间中形成。 Method: 使用表征相似性分析,对Pythia 12B和OLMo 7B两个开源模型的输入和输出嵌入进行实验,关联其几何结构与语义、句法及频率指标。 Result: 1) 嵌入几何结构在训练早期即与语义和句法特征高度相关;2) 高频词和功能词收敛更快,低频词仍保留部分初始随机偏置的影响。 Conclusion: 词汇嵌入的几何结构在训练中快速组织成语言相关的结构,词频和功能在其中起关键作用,提示应进一步研究嵌入演化与模型能力提升的关系。 Abstract: Large language models (LLMs) work by manipulating the geometry of input embedding vectors over multiple layers. Here, we ask: how are the input vocabulary representations of language models structured, and how and when does this structure evolve over training? To answer this question, we use representational similarity analysis, running a suite of experiments that correlate the geometric structure of the input embeddings and output embeddings of two open-source models (Pythia 12B and OLMo 7B) with semantic, syntactic, and frequency-based metrics over the course of training. Our key findings are as follows: 1) During training, the vocabulary embedding geometry quickly converges to high correlations with a suite of semantic and syntactic features; 2) Embeddings of high-frequency and function words (e.g., "the," "of") converge to their final vectors faster than lexical and low-frequency words, which retain some alignment with the bias in their random initializations. These findings help map the dynamic trajectory by which input embeddings organize around linguistic structure, revealing distinct roles for word frequency and function. Our findings motivate a deeper study of how the evolution of vocabulary geometry may facilitate specific capability gains during model training.

[19] Toward Reliable Clinical Coding with Language Models: Verification and Lightweight Adaptation

Zhangdie Yuan,Han-Chin Shing,Mitch Strong,Chaitanya Shivade

Main category: cs.CL

TL;DR: 本文提出临床编码验证作为改进LLM在医疗编码中表现的新方法,通过轻量级干预减少层次性错误,并发布了一个专家双标注的门诊临床笔记基准数据集。

Details Motivation: 现有研究表明大语言模型在临床编码任务中表现不佳,且传统评估指标忽略层次相近的错误预测;同时现有数据集存在证据不全和住院数据偏差问题。 Method: 采用提示工程和小规模微调等轻量级干预方法,并引入临床编码验证任务,结合新发布的专家双标注门诊数据集进行评估。 Result: 轻量级干预有效提升编码准确率,验证机制能显著减少层次性近似错误,新数据集缓解了原有数据偏差问题。 Conclusion: 临床编码验证是提升LLM在医疗编码中可靠性和准确性的有效步骤,结合高质量数据集可推动该领域发展。 Abstract: Accurate clinical coding is essential for healthcare documentation, billing, and decision-making. While prior work shows that off-the-shelf LLMs struggle with this task, evaluations based on exact match metrics often overlook errors where predicted codes are hierarchically close but incorrect. Our analysis reveals that such hierarchical misalignments account for a substantial portion of LLM failures. We show that lightweight interventions, including prompt engineering and small-scale fine-tuning, can improve accuracy without the computational overhead of search-based methods. To address hierarchically near-miss errors, we introduce clinical code verification as both a standalone task and a pipeline component. To mitigate the limitations in existing datasets, such as incomplete evidence and inpatient bias in MIMIC, we release an expert double-annotated benchmark of outpatient clinical notes with ICD-10 codes. Our results highlight verification as an effective and reliable step toward improving LLM-based medical coding.

[20] Role-Conditioned Refusals: Evaluating Access Control Reasoning in Large Language Models

Đorđe Klisura,Joseph Khoury,Ashish Kundu,Ram Krishnan,Anthony Rios

Main category: cs.CL

TL;DR: 研究了大语言模型在角色条件下的拒绝行为,提出并评估了三种设计方法以实现基于角色的访问控制(RBAC),并在扩展的Spider和BIRD数据集上进行实验,结果表明显式验证能提高拒绝精度,而微调则在安全与实用性间取得更好平衡。

Details Motivation: 大语言模型常因生成无限制响应而模糊角色边界,缺乏对访问控制策略的遵循,导致潜在安全风险。 Method: 构建了一个扩展Spider和BIRD数据集的新数据集,引入真实的PostgreSQL基于角色的表级和列级策略;比较三种方法:零样本/少样本提示、两步生成-验证流程、以及使用LoRA微调的模型。 Result: 显式验证方法提高了拒绝精度并减少了错误授权;LoRA微调在保持较高执行准确率的同时实现了更好的安全与效用平衡;所有系统在面对更长更复杂的策略时性能均下降。 Conclusion: 通过显式验证或微调可提升LLM在角色条件拒绝任务中的表现,但复杂策略仍构成挑战,未来需进一步增强模型对精细权限政策的理解与遵守能力。 Abstract: Access control is a cornerstone of secure computing, yet large language models often blur role boundaries by producing unrestricted responses. We study role-conditioned refusals, focusing on the LLM's ability to adhere to access control policies by answering when authorized and refusing when not. To evaluate this behavior, we created a novel dataset that extends the Spider and BIRD text-to-SQL datasets, both of which have been modified with realistic PostgreSQL role-based policies at the table and column levels. We compare three designs: (i) zero or few-shot prompting, (ii) a two-step generator-verifier pipeline that checks SQL against policy, and (iii) LoRA fine-tuned models that learn permission awareness directly. Across multiple model families, explicit verification (the two-step framework) improves refusal precision and lowers false permits. At the same time, fine-tuning achieves a stronger balance between safety and utility (i.e., when considering execution accuracy). Longer and more complex policies consistently reduce the reliability of all systems. We release RBAC-augmented datasets and code.

[21] Banking Done Right: Redefining Retail Banking with Language-Centric AI

Xin Jie Chua,Jeraelyn Ming Li Tan,Jia Xuan Tan,Soon Chang Poh,Yi Xian Goh,Debbie Hui Tian Choong,Chee Mun Foong,Sze Jue Yang,Chee Seng Chan

Main category: cs.CL

TL;DR: Ryt AI 是一个基于自研大模型 ILMU 的对话式AI代理框架,首次实现全球监管批准的自然语言作为主要银行界面,支持核心金融交易。

Details Motivation: 传统银行助手局限于咨询或支持角色,缺乏对核心金融操作的直接执行能力,且难以满足严格的安全与合规要求。 Method: 构建全自研的闭源大模型 ILMU,并设计四个基于LoRA适配器的LLM代理(Guardrails、Intent、Payment、FAQ),通过单一对话流程替代传统多页面操作,在银行内部部署以确保安全可控。 Result: 成功部署全球首个获监管批准的自然语言银行接口,实现核心金融交易的自动化处理,同时通过确定性防护机制、人机协同确认和无状态审计架构保障安全性与合规性。 Conclusion: 该框架证明在严格治理下,符合监管要求的自然语言界面可可靠地支持核心银行业务,推动银行服务向更高效、直观的模式演进。 Abstract: This paper presents Ryt AI, an LLM-native agentic framework that powers Ryt Bank to enable customers to execute core financial transactions through natural language conversation. This represents the first global regulator-approved deployment worldwide where conversational AI functions as the primary banking interface, in contrast to prior assistants that have been limited to advisory or support roles. Built entirely in-house, Ryt AI is powered by ILMU, a closed-source LLM developed internally, and replaces rigid multi-screen workflows with a single dialogue orchestrated by four LLM-powered agents (Guardrails, Intent, Payment, and FAQ). Each agent attaches a task-specific LoRA adapter to ILMU, which is hosted within the bank's infrastructure to ensure consistent behavior with minimal overhead. Deterministic guardrails, human-in-the-loop confirmation, and a stateless audit architecture provide defense-in-depth for security and compliance. The result is Banking Done Right: demonstrating that regulator-approved natural-language interfaces can reliably support core financial operations under strict governance.

[22] OBCache: Optimal Brain KV Cache Pruning for Efficient Long-Context LLM Inference

Yuzhe Gu,Xiyu Liang,Jiaojiao Zhao,Enmao Diao

Main category: cs.CL

TL;DR: 提出了一种名为OBCache的新框架,通过基于Optimal Brain Damage理论的层间结构化剪枝方法,量化注意力输出扰动来优化KV缓存淘汰,显著提升了长上下文场景下的模型准确性。

Details Motivation: 现有KV缓存淘汰方法仅使用启发式注意力权重评估token重要性,未考虑其对注意力输出的真实影响,导致精度损失。 Method: 将缓存淘汰建模为层间结构化剪枝问题,基于OBD理论推导出闭式解,分别衡量键、值及键值对删除对注意力输出的扰动,引入输出感知信号改进淘汰策略。 Result: 在LLaMA和Qwen模型上的实验表明,用OBCache的评分替换原有启发式评分,能一致地提升长上下文任务的准确性。 Conclusion: OBCache通过更精确的token显著性评估机制,有效改善了大模型在扩展上下文窗口下的缓存效率与性能平衡。 Abstract: Large language models (LLMs) with extended context windows enable powerful downstream applications but impose significant memory overhead, as caching all key-value (KV) states scales linearly with sequence length and batch size. Existing cache eviction methods address this by exploiting attention sparsity, yet they typically rank tokens heuristically using accumulated attention weights without considering their true impact on attention outputs. We propose Optimal Brain Cache (OBCache), a principled framework that formulates cache eviction as a layer-wise structured pruning problem. Building upon the Optimal Brain Damage (OBD) theory, OBCache quantifies token saliency by measuring the perturbation in attention outputs induced by pruning tokens, with closed-form scores derived for isolated keys, isolated values, and joint key-value pairs. Our scores account not only for attention weights but also for information from value states and attention outputs, thereby enhancing existing eviction strategies with output-aware signals. Experiments on LLaMA and Qwen models demonstrate that replacing the heuristic scores in existing works, which estimate token saliency across different query positions, with OBCache's output-aware scores consistently improves long-context accuracy.

[23] Textual Entailment and Token Probability as Bias Evaluation Metrics

Virginia K. Felkner,Allison Lim,Jonathan May

Main category: cs.CL

TL;DR: 本文探讨了使用自然语言推断(NLI)作为语言模型社会偏见测量的替代方法,发现NLI与传统的词元概率(TP)指标在评估偏见时表现差异显著,且相关性很低。NLI更可能检测到“去偏不足”的情况,但对反刻板印象句子的措辞更敏感、更脆弱。研究建议结合TP、NLI和下游偏见评估以实现全面评估。

Details Motivation: 由于传统的词元概率(TP)偏见度量方法与实际语言模型应用场景和危害关联较远,本文旨在探索更贴近现实的自然语言推断(NLI)作为替代偏见评估指标的有效性。 Method: 通过比较不同NLI度量方法以及NLI与TP度量之间的相关性,分析它们在检测语言模型社会偏见方面的表现差异,并测试NLI对反刻板印象句子表述变化的敏感性。 Result: NLI与TP偏见评估方法之间表现出极低的相关性;NLI更倾向于检测出‘去偏不足’的情况,但对反刻板印象句子的具体措辞更为敏感和脆弱。 Conclusion: TP和NLI都不是在所有情况下都优越的偏见度量方法,推荐结合TP、NLI及下游任务的偏见评估,以确保对语言模型偏见的全面衡量。 Abstract: Measurement of social bias in language models is typically by token probability (TP) metrics, which are broadly applicable but have been criticized for their distance from real-world langugage model use cases and harms. In this work, we test natural language inference (NLI) as a more realistic alternative bias metric. We show that, curiously, NLI and TP bias evaluation behave substantially differently, with very low correlation among different NLI metrics and between NLI and TP metrics. We find that NLI metrics are more likely to detect "underdebiased" cases. However, NLI metrics seem to be more brittle and sensitive to wording of counterstereotypical sentences than TP approaches. We conclude that neither token probability nor natural language inference is a "better" bias metric in all cases, and we recommend a combination of TP, NLI, and downstream bias evaluations to ensure comprehensive evaluation of language models. Content Warning: This paper contains examples of anti-LGBTQ+ stereotypes.

[24] Stress-Testing Model Specs Reveals Character Differences among Language Models

Jifan Zhang,Henry Sleight,Andi Peng,John Schulman,Esin Durmus

Main category: cs.CL

TL;DR: 本文提出了一种系统性方法来压力测试大型语言模型的行为规范,揭示了现有模型规范中存在大量原则矛盾和解释模糊的问题,并通过生成价值权衡场景发现了超过70,000个行为分歧案例。

Details Motivation: 现有的AI行为规范常面临原则间冲突和覆盖不足的问题,缺乏有效方法检测这些问题,因此需要一种系统性方法来识别模型规范中的缺陷。 Method: 构建了一个全面的价值观分类体系,生成迫使模型在相互竞争的原则之间做出权衡的场景,并对12个前沿大模型进行评估,使用价值分类分数衡量其行为差异。 Result: 发现了超过70,000个显著行为分歧案例,验证了高行为分歧可预测模型规范中的问题,并通过定性分析揭示了规范中的直接矛盾、解释模糊、错位对齐和假拒绝等问题,同时总结了不同模型的价值优先级模式。 Conclusion: 当前大型语言模型的规范存在严重缺陷,需更精细的设计与验证机制,本文提供的方法和数据集有助于改进未来模型的行为一致性与对齐性。 Abstract: Large language models (LLMs) are increasingly trained from AI constitutions and model specifications that establish behavioral guidelines and ethical principles. However, these specifications face critical challenges, including internal conflicts between principles and insufficient coverage of nuanced scenarios. We present a systematic methodology for stress-testing model character specifications, automatically identifying numerous cases of principle contradictions and interpretive ambiguities in current model specs. We stress test current model specs by generating scenarios that force explicit tradeoffs between competing value-based principles. Using a comprehensive taxonomy we generate diverse value tradeoff scenarios where models must choose between pairs of legitimate principles that cannot be simultaneously satisfied. We evaluate responses from twelve frontier LLMs across major providers (Anthropic, OpenAI, Google, xAI) and measure behavioral disagreement through value classification scores. Among these scenarios, we identify over 70,000 cases exhibiting significant behavioral divergence. Empirically, we show this high divergence in model behavior strongly predicts underlying problems in model specifications. Through qualitative analysis, we provide numerous example issues in current model specs such as direct contradiction and interpretive ambiguities of several principles. Additionally, our generated dataset also reveals both clear misalignment cases and false-positive refusals across all of the frontier models we study. Lastly, we also provide value prioritization patterns and differences of these models.

[25] Large Language Models Meet Virtual Cell: A Survey

Krinos Li,Xianglu Xiao,Shenglong Deng,Lucas He,Zijun Zhong,Yuanjie Zou,Zhonghao Zhan,Zheng Hui,Weiye Bao,Guang Yang

Main category: cs.CL

TL;DR: 本文综述了大型语言模型(LLMs)在虚拟细胞建模中的应用,提出了将现有方法分为“LLM作为预言机”和“LLM作为代理”的统一分类法,并探讨了细胞表征、扰动预测和基因调控推断三大核心任务及相关挑战。

Details Motivation: 随着LLMs在多个领域的成功,将其应用于复杂的细胞生物学问题,以构建能够表示、预测和推理细胞状态与行为的“虚拟细胞”系统成为可能,但缺乏系统性梳理与分类。 Method: 提出一个统一的分类体系,将LLM在虚拟细胞中的应用分为两类:作为直接建模工具的‘预言机’和用于协调复杂科学任务的‘代理’;并围绕三个核心任务进行系统综述。 Result: 总结了当前用于虚拟细胞建模的LLM方法、数据集和评估基准,识别出在可扩展性、泛化性和可解释性方面的主要挑战。 Conclusion: 该综述为LLM在细胞生物学中的进一步发展提供了结构化视角和未来研究方向。 Abstract: Large language models (LLMs) are transforming cellular biology by enabling the development of "virtual cells"--computational systems that represent, predict, and reason about cellular states and behaviors. This work provides a comprehensive review of LLMs for virtual cell modeling. We propose a unified taxonomy that organizes existing methods into two paradigms: LLMs as Oracles, for direct cellular modeling, and LLMs as Agents, for orchestrating complex scientific tasks. We identify three core tasks--cellular representation, perturbation prediction, and gene regulation inference--and review their associated models, datasets, evaluation benchmarks, as well as the critical challenges in scalability, generalizability, and interpretability.

[26] Causality Guided Representation Learning for Cross-Style Hate Speech Detection

Chengshuai Zhao,Shu Wan,Paras Sheth,Karan Patwa,K. Selçuk Candan,Huan Liu

Main category: cs.CL

TL;DR: 本文提出了一种基于因果表示学习的隐式仇恨言论检测框架CADET,通过解耦上下文、动机、目标和风格等潜在因素,有效提升跨风格和跨平台仇恨言论检测的泛化能力。

Details Motivation: 现有仇恨言论检测模型依赖表面语言特征,难以应对不同风格和平台上的隐式仇恨言论,且易受虚假相关性影响。 Method: 基于因果图假设,提出CADET框架,将仇恨言论生成建模为包含环境、动机、目标和风格的因果过程,并在潜在空间中进行去混杂和反事实推理,以分离真实仇恨意图与表层语言线索。 Result: 实验表明CADET在多种设定下均优于现有方法,展现出因果先验在提升仇恨言论检测鲁棒性和可解释性方面的潜力。 Conclusion: 通过引入因果表示学习,CADET能够更准确地识别隐式仇恨言论,支持跨风格和跨平台的泛化检测,为内容审核提供了更具鲁棒性的解决方案。 Abstract: The proliferation of online hate speech poses a significant threat to the harmony of the web. While explicit hate is easily recognized through overt slurs, implicit hate speech is often conveyed through sarcasm, irony, stereotypes, or coded language -- making it harder to detect. Existing hate speech detection models, which predominantly rely on surface-level linguistic cues, fail to generalize effectively across diverse stylistic variations. Moreover, hate speech spread on different platforms often targets distinct groups and adopts unique styles, potentially inducing spurious correlations between them and labels, further challenging current detection approaches. Motivated by these observations, we hypothesize that the generation of hate speech can be modeled as a causal graph involving key factors: contextual environment, creator motivation, target, and style. Guided by this graph, we propose CADET, a causal representation learning framework that disentangles hate speech into interpretable latent factors and then controls confounders, thereby isolating genuine hate intent from superficial linguistic cues. Furthermore, CADET allows counterfactual reasoning by intervening on style within the latent space, naturally guiding the model to robustly identify hate speech in varying forms. CADET demonstrates superior performance in comprehensive experiments, highlighting the potential of causal priors in advancing generalizable hate speech detection.

[27] MemWeaver: A Hierarchical Memory from Textual Interactive Behaviors for Personalized Generation

Shuo Yu,Mingyue Cheng,Daoyu Wang,Qi Liu,Zirui Liu,Ze Guo,Xiaoyu Tao

Main category: cs.CL

TL;DR: 本文提出了MemWeaver框架,通过构建包含行为记忆和认知记忆的分层记忆结构,对用户文本历史进行建模,以实现深度个性化生成。

Details Motivation: 现有方法将用户历史视为扁平文本列表,未能捕捉用户兴趣的时间演化和语义关系,导致个性化程度较浅。 Method: MemWeaver构建了两个融合时间与语义信息的互补记忆组件:行为记忆(捕捉具体用户行为)和认知记忆(表征长期偏好),形成统一的用户表示,供大语言模型推理使用。 Result: 在LaMP基准上的实验验证了MemWeaver的有效性,显著优于现有方法。 Conclusion: MemWeaver通过分层记忆结构有效建模用户文本历史中的时序与语义特征,提升了个性化生成的效果。 Abstract: The primary form of user-internet engagement is shifting from leveraging implicit feedback signals, such as browsing and clicks, to harnessing the rich explicit feedback provided by textual interactive behaviors. This shift unlocks a rich source of user textual history, presenting a profound opportunity for a deeper form of personalization. However, prevailing approaches offer only a shallow form of personalization, as they treat user history as a flat list of texts for retrieval and fail to model the rich temporal and semantic structures reflecting dynamic nature of user interests. In this work, we propose \textbf{MemWeaver}, a framework that weaves the user's entire textual history into a hierarchical memory to power deeply personalized generation. The core innovation of our memory lies in its ability to capture both the temporal evolution of interests and the semantic relationships between different activities. To achieve this, MemWeaver builds two complementary memory components that both integrate temporal and semantic information, but at different levels of abstraction: behavioral memory, which captures specific user actions, and cognitive memory, which represents long-term preferences. This dual-component memory serves as a unified representation of the user, allowing large language models (LLMs) to reason over both concrete behaviors and abstracted traits. Experiments on the Language Model Personalization (LaMP) benchmark validate the efficacy of MemWeaver. Our code is available\footnote{https://github.com/fishsure/MemWeaver}.

[28] SUBQRAG: sub-question driven dynamic graph rag

Jiaoyang Li,Junhao Ruan,Shengwei Tang,Saihan Chen,Kaiyan Chang,Yuan Ge,Tong Xiao,Jingbo Zhu

Main category: cs.CL

TL;DR: 提出SubQRAG,一种基于子问题驱动的图检索增强生成框架,通过分解复杂问题、动态扩展知识图并构建图记忆来提升多跳问答的推理深度和准确性。

Details Motivation: 现有Graph RAG在处理复杂多跳问答时缺乏深层结构化推理,导致证据不全和错误累积。 Method: 将复杂问题分解为有序的可验证子问题,针对每个子问题从知识图中检索三元组,并在图信息不足时实时从原文档中提取新三元组以动态扩展图;所有用于推理的三元组被聚合为“图记忆”,形成可追溯的证据路径。 Result: 在三个多跳问答基准上的实验表明,SubQRAG在Exact Match等指标上实现了持续且显著的提升。 Conclusion: SubQRAG通过子问题驱动和动态图扩展机制,增强了复杂问答中的结构化推理能力,提高了答案的准确性和可解释性。 Abstract: Graph Retrieval-Augmented Generation (Graph RAG) effectively builds a knowledge graph (KG) to connect disparate facts across a large document corpus. However, this broad-view approach often lacks the deep structured reasoning needed for complex multi-hop question answering (QA), leading to incomplete evidence and error accumulation. To address these limitations, we propose SubQRAG, a sub-question-driven framework that enhances reasoning depth. SubQRAG decomposes a complex question into an ordered chain of verifiable sub-questions. For each sub-question, it retrieves relevant triples from the graph. When the existing graph is insufficient, the system dynamically expands it by extracting new triples from source documents in real time. All triples used in the reasoning process are aggregated into a "graph memory," forming a structured and traceable evidence path for final answer generation. Experiments on three multi-hop QA benchmarks demonstrate that SubQRAG achieves consistent and significant improvements, especially in Exact Match scores.

[29] Multilingual Knowledge Graph Completion via Efficient Multilingual Knowledge Sharing

Cunli Mao,Xiaofei Gao,Ran Song,Shizhu He,Shengxiang Gao,Kang Liu,Zhengtao Yu

Main category: cs.CL

TL;DR: 本文提出了一种新的多语言知识图谱补全(MKGC)框架,通过知识级分组专家混合(KL-GMoE)和迭代实体重排序(IER)来利用多语言共享知识,显著提升了性能。实验结果表明,该方法在Hits@1、Hits@3和Hits@10指标上均优于现有最先进方法。

Details Motivation: 现有MKGC研究未能充分利用大语言模型的多语言能力,且忽视了跨语言知识的可共享性。 Method: 提出包含KL-GMoE和IER两个组件的新框架,KL-GMoE用于高效建模共享知识,IER用于增强其利用效果。 Result: 在包含5种语言的mKG数据集上,相比现有SOTA方法,Hits@1、Hits@3和Hits@10分别提升了5.47%、3.27%和1.01%。 Conclusion: 所提出的框架有效利用多语言共享知识,显著提升MKGC性能,并展现出对未见语言和不平衡语言设置的良好适应性。 Abstract: Large language models (LLMs) based Multilingual Knowledge Graph Completion (MKGC) aim to predict missing facts by leveraging LLMs' multilingual understanding capabilities, improving the completeness of multilingual knowledge graphs (KGs). However, existing MKGC research underutilizes the multilingual capabilities of LLMs and ignores the shareability of cross-lingual knowledge. In this paper, we propose a novel MKGC framework that leverages multilingual shared knowledge to significantly enhance performance through two components: Knowledge-level Grouped Mixture of Experts (KL-GMoE) and Iterative Entity Reranking (IER). KL-GMoE efficiently models shared knowledge, while IER significantly enhances its utilization. To evaluate our framework, we constructed a mKG dataset containing 5 languages and conducted comprehensive comparative experiments with existing state-of-the-art (SOTA) MKGC method. The experimental results demonstrate that our framework achieves improvements of 5.47%, 3.27%, and 1.01% in the Hits@1, Hits@3, and Hits@10 metrics, respectively, compared with SOTA MKGC method. Further experimental analysis revealed the properties of knowledge sharing in settings of unseen and unbalanced languages. We have released the dataset and code for our work on https://github.com/gaoxiaofei07/KL-GMoE.

[30] ToolExpander: Extending the Frontiers of Tool-Using Reinforcement Learning to Weak LLMs

Fu Chen,Peng Wang,Xiyin Li,Wen Li,Shichi Lei,Dongdong Xiang

Main category: cs.CL

TL;DR: 提出ToolExpander框架,通过动态多轮难例采样和自示范思维机制,提升小规模大模型在GRPO训练中的稳定性与工具使用能力。

Details Motivation: 解决小规模架构下GRPO训练中模型难以生成准确响应、易发生训练崩溃的问题,提升训练稳定性和最终性能。 Method: 1) 动态多轮难例采样:用高质量少样本示例替换无正确输出的困难样本,并结合指数学习率衰减抑制振荡;2) 自示范思维:去除KL散度,引入调整后的裁剪系数,通过微小奖励(0.01)激励模型自主生成并分析少样本示例。 Result: 实验表明ToolExpander显著提升了小规模LLM的工具使用能力,增强了训练稳定性和整体性能,尤其在较弱模型上效果明显。 Conclusion: ToolExpander有效克服了GRPO在小规模模型上的局限性,为资源受限的LLM提供了一种高效的工具导向强化学习方案。 Abstract: Training Large Language Models (LLMs) with Group Relative Policy Optimization (GRPO) encounters a significant challenge: models often fail to produce accurate responses, particularly in small-scale architectures. This limitation not only diminishes performance improvements and undermines the potential of GRPO but also frequently leads to mid-training collapse, adversely affecting stability and final efficacy. To address these issues, we propose ToolExpander, a novel framework that advances tool-oriented reinforcement learning for resource-constrained LLMs through two key innovations:(1) Dynamic Multi-Round Hard Sampling, which dynamically substitutes challenging samples(those without correct outputs over 10 rollouts) with high-quality few-shot demonstrations during training, coupled with an exponential learning rate decay strategy to mitigate oscillations;(2) Self-Exemplifying Thinking, an enhanced GRPO framework that eliminates KL divergence and incorporates adjusted clipping coefficients, encouraging models to autonomously generate and analyze few-shot examples via a minimal additional reward (0.01).Experimental results demonstrate that ToolExpander significantly enhances tool-using capabilities in LLMs, especially in weaker small-scale models, improving both training stability and overall performance.

[31] OpenRubrics: Towards Scalable Synthetic Rubric Generation for Reward Modeling and LLM Alignment

Tianci Liu,Ran Xu,Tony Yu,Ilgee Hong,Carl Yang,Tuo Zhao,Haoyu Wang

Main category: cs.CL

TL;DR: 本文提出了OpenRubrics,一个大规模的(提示,评分标准)数据集,并引入对比评分标准生成(CRG)方法来提升奖励建模的效果。

Details Motivation: 现有奖励模型依赖标量或成对判断,难以捕捉人类偏好的多维性,且结构化评分标准(rubrics)的生成面临可靠性和可扩展性挑战。 Method: 提出Contrastive Rubric Generation(CRG),通过对比优选和被拒响应提取硬规则和隐含原则,并利用拒绝采样确保评分标准与偏好标签一致,构建OpenRubrics数据集用于训练评分标准生成与奖励模型。 Result: 所提出的Rubric-RM在多个奖励建模基准上超越强基线6.8%,并在指令遵循和生物医学任务中将收益传递至策略模型。 Conclusion: 评分标准提供了可扩展的对齐信号,缩小了人工评估与自动奖励建模之间的差距,推动了以原则驱动的大模型对齐新范式。 Abstract: Reward modeling lies at the core of reinforcement learning from human feedback (RLHF), yet most existing reward models rely on scalar or pairwise judgments that fail to capture the multifaceted nature of human preferences. Recent studies have explored rubrics-as-rewards (RaR) that uses structured natural language criteria that capture multiple dimensions of response quality. However, producing rubrics that are both reliable and scalable remains a key challenge. In this work, we introduce OpenRubrics, a diverse, large-scale collection of (prompt, rubric) pairs for training rubric-generation and rubric-based reward models. To elicit discriminative and comprehensive evaluation signals, we introduce Contrastive Rubric Generation (CRG), which derives both hard rules (explicit constraints) and principles (implicit qualities) by contrasting preferred and rejected responses. We further improve reliability by enforcing preference-label consistency via rejection sampling to remove noisy rubrics. Across multiple reward-modeling benchmarks, our rubric-based reward model, Rubric-RM, surpasses strong size-matched baselines by 6.8%. These gains transfer to policy models on instruction-following and biomedical benchmarks. Our results show that rubrics provide scalable alignment signals that narrow the gap between costly human evaluation and automated reward modeling, enabling a new principle-driven paradigm for LLM alignment.

[32] Parallel Test-Time Scaling for Latent Reasoning Models

Runyang You,Yongqi Li,Meng Liu,Wenjie Wang,Liqiang Nie,Wenjie Li

Main category: cs.CL

TL;DR: 本研究提出了一种用于潜在推理模型的并行测试时扩展(Parallel TTS)方法,通过引入基于不确定性的采样策略和潜在奖励模型(LatentRM)实现可扩展的连续空间推理。

Details Motivation: 现有的并行TTS方法主要依赖显式的基于token的思维链,而潜在推理模型在连续向量空间中进行推理更高效,但缺乏适用于连续空间的采样机制和轨迹聚合的概率信号,限制了其在并行TTS中的应用。 Method: 提出了两种基于不确定性的随机采样策略:蒙特卡洛Dropout和加性高斯噪声;设计了一个通过步级对比目标训练的潜在奖励模型(LatentRM),用于评分和引导潜在推理轨迹的聚合。 Result: 实验和可视化分析表明,两种采样策略能有效随计算资源扩展,并表现出不同的探索动态;LatentRM能够有效选择高质量的推理轨迹。 Conclusion: 该工作成功将并行TTS应用于潜在推理模型,为连续空间中的可扩展推理提供了新方向。 Abstract: Parallel test-time scaling (TTS) is a pivotal approach for enhancing large language models (LLMs), typically by sampling multiple token-based chains-of-thought in parallel and aggregating outcomes through voting or search. Recent advances in latent reasoning, where intermediate reasoning unfolds in continuous vector spaces, offer a more efficient alternative to explicit Chain-of-Thought, yet whether such latent models can similarly benefit from parallel TTS remains open, mainly due to the absence of sampling mechanisms in continuous space, and the lack of probabilistic signals for advanced trajectory aggregation. \ This work enables parallel TTS for latent reasoning models by addressing the above issues. For sampling, we introduce two uncertainty-inspired stochastic strategies: Monte Carlo Dropout and Additive Gaussian Noise. For aggregation, we design a Latent Reward Model (LatentRM) trained with step-wise contrastive objective to score and guide latent reasoning. Extensive experiments and visualization analyses show that both sampling strategies scale effectively with compute and exhibit distinct exploration dynamics, while LatentRM enables effective trajectory selection. Together, our explorations open a new direction for scalable inference in continuous spaces. Code released at https://github.com/YRYangang/LatentTTS.

[33] Test-Time Reasoners Are Strategic Multiple-Choice Test-Takers

Nishant Balepur,Atrey Desai,Rachel Rudinger

Main category: cs.CL

TL;DR: 研究表明,大语言模型在仅凭选项(choices-only)的情况下也能成功回答多项选择题,且推理过程揭示了其使用了如推断缺失问题等非浅层策略,挑战了部分输入成功总是缺陷的观点。

Details Motivation: 探讨大语言模型在不依赖问题文本、仅凭选项情况下仍能成功作答的现象,分析其推理过程是否真正浅薄。 Method: 通过对比大语言模型在完整输入和仅选项输入下的推理表现,分析推理轨迹的忠实性及其策略深度。 Result: 发现半数情况下测试时推理提升了准确率,且推理轨迹长度对结果影响小,通过忠实性测试表明模型使用了合理的推理策略。 Conclusion: 部分输入成功并不总是模型缺陷,推理轨迹有助于区分有问题的数据与较合理的推理行为。 Abstract: Large language models (LLMs) now give reasoning before answering, excelling in tasks like multiple-choice question answering (MCQA). Yet, a concern is that LLMs do not solve MCQs as intended, as work finds LLMs sans reasoning succeed in MCQA without using the question, i.e., choices-only. Such partial-input success is often deemed problematic, but reasoning traces could reveal if these strategies are truly shallow in choices-only settings. To study these strategies, reasoning LLMs solve MCQs in full and choices-only inputs; test-time reasoning often boosts accuracy on full and in choices-only half the time. While possibly due to shallow shortcuts, choices-only success is barely affected by the length of reasoning traces, and after finding traces pass faithfulness tests, we show they use less problematic strategies like inferring missing questions. In all, we challenge claims that partial-input success is always a flaw, so we discuss how reasoning traces could separate problematic data from less problematic reasoning.

[34] ToolLibGen: Scalable Automatic Tool Creation and Aggregation for LLM Reasoning

Murong Yue,Zhiwei Liu,Liangwei Yang,Jianguo Zhang,Zuxin Liu,Haolin Chen,Ziyu Yao,Silvio Savarese,Caiming Xiong,Shelby Heinecke,Huan Wang

Main category: cs.CL

TL;DR: 提出一种系统性方法,将无结构的工具集合自动重构为结构化的工具库,通过多智能体框架聚合功能,提升工具检索准确性和推理性能。

Details Motivation: 现有工具增强型大模型在领域特定工具稀缺时表现受限,且随着生成工具数量增加,无结构存储导致检索困难和功能歧义,缺乏可扩展性。 Method: 首先生成任务特定工具并按语义聚类;在每个簇内采用多智能体框架:代码智能体提取共享逻辑、创建聚合工具,评审智能体确保功能完整性。 Result: 实验表明该方法显著提高了工具检索准确率和推理性能,并在问题特定工具增多时展现出优于基线方法的可扩展性。 Conclusion: 该方法能有效将大量问题特定工具转化为少量功能更强、结构清晰的聚合工具,在不损失功能的前提下提升工具管理和使用效率。 Abstract: Large Language Models (LLMs) equipped with external tools have demonstrated enhanced performance on complex reasoning tasks. The widespread adoption of this tool-augmented reasoning is hindered by the scarcity of domain-specific tools. For instance, in domains such as physics question answering, suitable and specialized tools are often missing. Recent work has explored automating tool creation by extracting reusable functions from Chain-of-Thought (CoT) reasoning traces; however, these approaches face a critical scalability bottleneck. As the number of generated tools grows, storing them in an unstructured collection leads to significant retrieval challenges, including an expanding search space and ambiguity between function-related tools. To address this, we propose a systematic approach to automatically refactor an unstructured collection of tools into a structured tool library. Our system first generates discrete, task-specific tools and clusters them into semantically coherent topics. Within each cluster, we introduce a multi-agent framework to consolidate scattered functionalities: a code agent refactors code to extract shared logic and creates versatile, aggregated tools, while a reviewing agent ensures that these aggregated tools maintain the complete functional capabilities of the original set. This process transforms numerous question-specific tools into a smaller set of powerful, aggregated tools without loss of functionality. Experimental results demonstrate that our approach significantly improves tool retrieval accuracy and overall reasoning performance across multiple reasoning tasks. Furthermore, our method shows enhanced scalability compared with baselines as the number of question-specific increases.

[35] Curing Miracle Steps in LLM Mathematical Reasoning with Rubric Rewards

Youliang Yuan,Qiuyang Mang,Jingbang Chen,Hong Wan,Xiaoyuan Liu,Junjielong Xu,Jen-tse Huang,Wenxuan Wang,Wenxiang Jiao,Pinjia He

Main category: cs.CL

TL;DR: 本文提出了一种面向推理过程的奖励模型(RRM),用于解决大语言模型在数学推理中因仅依赖最终答案奖励而导致的“奖励黑客”问题,显著提升了模型的准确性和可靠性。

Details Motivation: 传统基于最终结果的奖励机制容易导致模型通过错误的推理路径得到正确答案(即奖励欺骗),从而高估其真实推理能力。作者旨在识别并缓解这类问题,特别是减少如“奇迹步骤”等不合理推理现象。 Method: 提出Rubric Reward Model (RRM),一种细粒度、过程导向的奖励函数,依据问题特定的评分标准对整个推理过程进行评估,并在强化学习训练中提供0到1之间的校准奖励,惩罚逻辑错误,鼓励严谨推导。 Result: 在四个数学基准上,RRM显著优于仅基于结果的监督方法;在AIME2024上,Verified Pass@1024从26.7%提升至62.6%,且‘奇迹步骤’的发生率降低了71%。 Conclusion: 奖励推理过程而非仅仅最终答案,对于构建更准确、更可靠的数学推理模型至关重要。 Abstract: Large language models for mathematical reasoning are typically trained with outcome-based rewards, which credit only the final answer. In our experiments, we observe that this paradigm is highly susceptible to reward hacking, leading to a substantial overestimation of a model's reasoning ability. This is evidenced by a high incidence of false positives - solutions that reach the correct final answer through an unsound reasoning process. Through a systematic analysis with human verification, we establish a taxonomy of these failure modes, identifying patterns like Miracle Steps - abrupt jumps to a correct output without a valid preceding derivation. Probing experiments suggest a strong association between these Miracle Steps and memorization, where the model appears to recall the answer directly rather than deriving it. To mitigate this systemic issue, we introduce the Rubric Reward Model (RRM), a process-oriented reward function that evaluates the entire reasoning trajectory against problem-specific rubrics. The generative RRM provides fine-grained, calibrated rewards (0-1) that explicitly penalize logical flaws and encourage rigorous deduction. When integrated into a reinforcement learning pipeline, RRM-based training consistently outperforms outcome-only supervision across four math benchmarks. Notably, it boosts Verified Pass@1024 on AIME2024 from 26.7% to 62.6% and reduces the incidence of Miracle Steps by 71%. Our work demonstrates that rewarding the solution process is crucial for building models that are not only more accurate but also more reliable.

[36] The Unintended Trade-off of AI Alignment:Balancing Hallucination Mitigation and Safety in LLMs

Omar Mahmoud,Ali Khalil,Buddhika Laknath Semage,Thommen George Karimpanal,Santu Rana

Main category: cs.CL

TL;DR: 本文研究了大语言模型中提高真实性与安全对齐之间的权衡问题,发现减少幻觉得以增强事实准确性的同时可能削弱拒绝行为。为此,作者提出通过稀疏自编码器分离拒絶与幻觉特征,并利用子空间正交化保持微调过程中的安全对齐,有效缓解了这一冲突。

Details Motivation: 提高大语言模型的真实性常导致安全对齐性能下降,尤其是拒绝有害请求的能力减弱,该现象背后的机制尚不明确,亟需系统性研究并提出解决方案。 Method: 通过分析模型中同时编码幻觉和拒绝信息的重叠组件,揭示现有对齐方法会无意抑制事实知识;提出使用稀疏自编码器分离这两类特征,并在微调过程中采用子空间正交化来保持拒绝行为。 Result: 在常识推理任务及有害请求基准(AdvBench、StrongReject)上的实验表明,该方法能有效防止幻觉增加的同时维持安全拒绝能力与任务性能。 Conclusion: 通过特征解耦和正交化策略,可在提升模型真实性的同时保持安全对齐,缓解二者之间的固有冲突,为构建既真实又安全的LLM提供了可行路径。 Abstract: Hallucination in large language models (LLMs) has been widely studied in recent years, with progress in both detection and mitigation aimed at improving truthfulness. Yet, a critical side effect remains largely overlooked: enhancing truthfulness can negatively impact safety alignment. In this paper, we investigate this trade-off and show that increasing factual accuracy often comes at the cost of weakened refusal behavior. Our analysis reveals that this arises from overlapping components in the model that simultaneously encode hallucination and refusal information, leading alignment methods to suppress factual knowledge unintentionally. We further examine how fine-tuning on benign datasets, even when curated for safety, can degrade alignment for the same reason. To address this, we propose a method that disentangles refusal-related features from hallucination features using sparse autoencoders, and preserves refusal behavior during fine-tuning through subspace orthogonalization. This approach prevents hallucinations from increasing while maintaining safety alignment.We evaluate our method on commonsense reasoning tasks and harmful benchmarks (AdvBench and StrongReject). Results demonstrate that our approach preserves refusal behavior and task utility, mitigating the trade-off between truthfulness and safety.

[37] Instance Relation Learning Network with Label Knowledge Propagation for Few-shot Multi-label Intent Detection

Shiman Zhao,Shangyuan Li,Wei Chen,Tengjiao Wang,Jiahui Yao,Jiabin Zheng,Kam Fai Wong

Main category: cs.CL

TL;DR: 提出了一种端到端的多标签联合学习方法,通过实例关系学习和标签知识传播来解决少样本多标签意图检测中的误差传播问题。

Details Motivation: 现有方法依赖表示分类且忽略实例间关系,导致误差传播,难以有效处理少样本多标签意图检测任务。 Method: 构建一个带有标签知识传播的实例关系学习网络,学习支持集和查询集中实例间的交互关系,并设计双关系增强损失函数优化支持集和查询集级别的关系强度。 Result: 在1-shot场景下,平均比强基线方法提升9.54% AUC和11.19% Macro-F1。 Conclusion: 所提方法能有效缓解误差传播问题,在少样本多标签意图检测中显著优于现有方法。 Abstract: Few-shot Multi-label Intent Detection (MID) is crucial for dialogue systems, aiming to detect multiple intents of utterances in low-resource dialogue domains. Previous studies focus on a two-stage pipeline. They first learn representations of utterances with multiple labels and then use a threshold-based strategy to identify multi-label results. However, these methods rely on representation classification and ignore instance relations, leading to error propagation. To solve the above issues, we propose a multi-label joint learning method for few-shot MID in an end-to-end manner, which constructs an instance relation learning network with label knowledge propagation to eliminate error propagation. Concretely, we learn the interaction relations between instances with class information to propagate label knowledge between a few labeled (support set) and unlabeled (query set) instances. With label knowledge propagation, the relation strength between instances directly indicates whether two utterances belong to the same intent for multi-label prediction. Besides, a dual relation-enhanced loss is developed to optimize support- and query-level relation strength to improve performance. Experiments show that we outperform strong baselines by an average of 9.54% AUC and 11.19% Macro-F1 in 1-shot scenarios.

[38] Drift No More? Context Equilibria in Multi-Turn LLM Interactions

Vardhan Dongre,Ryan A. Rossi,Viet Dac Lai,David Seunghyun Yoon,Dilek Hakkani-Tür,Trung Bui

Main category: cs.CL

TL;DR: 本论文研究了大型语言模型在多轮对话中的上下文漂移问题,提出了一种动态框架来解释其行为,并通过实验验证了漂移是一种可控的平衡现象而非不可避免的衰退。

Details Motivation: 在现实部署中,用户目标和对话上下文持续演变,而现有静态评估指标难以捕捉多轮交互中逐渐出现的上下文漂移问题。 Method: 将上下文漂移形式化为测试模型与目标一致的参考模型之间的逐轮KL散度,并提出一个将漂移演化解释为具有恢复力和可控干预的有界随机过程的递推模型。 Result: 实验表明,上下文漂移趋向于稳定的、受噪声限制的平衡状态,而非持续恶化,且简单的提醒干预能有效减少漂移。 Conclusion: 多轮上下文漂移可被视为一种可控的平衡现象,该发现为理解和缓解长时交互中的漂移问题提供了理论基础。 Abstract: Large Language Models (LLMs) excel at single-turn tasks such as instruction following and summarization, yet real-world deployments require sustained multi-turn interactions where user goals and conversational context persist and evolve. A recurring challenge in this setting is context drift: the gradual divergence of a model's outputs from goal-consistent behavior across turns. Unlike single-turn errors, drift unfolds temporally and is poorly captured by static evaluation metrics. In this work, we present a study of context drift in multi-turn interactions and propose a simple dynamical framework to interpret its behavior. We formalize drift as the turn-wise KL divergence between the token-level predictive distributions of the test model and a goal-consistent reference model, and propose a recurrence model that interprets its evolution as a bounded stochastic process with restoring forces and controllable interventions. We instantiate this framework in both synthetic long-horizon rewriting tasks and realistic user-agent simulations such as in $\tau$-Bench, measuring drift for several open-weight LLMs that are used as user simulators. Our experiments consistently reveal stable, noise-limited equilibria rather than runaway degradation, and demonstrate that simple reminder interventions reliably reduce divergence in line with theoretical predictions. Together, these results suggest that multi-turn drift can be understood as a controllable equilibrium phenomenon rather than as inevitable decay, providing a foundation for studying and mitigating context drift in extended interactions.

[39] RCPU: Rotation-Constrained Error Compensation for Structured Pruning of a Large Language Model

Shuichiro Haruta,Kazunori Matsumoto,Zhi Li,Yanan Wang,Mori Kurokawa

Main category: cs.CL

TL;DR: 提出一种旋转约束补偿方法,以解决大语言模型结构化剪枝带来的误差,在保持表示几何结构的同时有效恢复输出准确性。

Details Motivation: 结构化剪枝因使用少量校准数据导致输出失配,直接拟合易过拟合并破坏预训练权重。 Method: 在旋转约束下更新剪枝参数,保持输出表示的几何结构(如范数和内积),并重新对齐剪枝子空间与原始输出;引入方差感知的重要性评分,优先保留对方差主方向贡献大的维度。 Result: 在LLaMA-7B上实验显示,相比基线方法,在WikiText-2和多个理解任务上均取得更低的困惑度和更高的准确率。 Conclusion: 所提方法通过旋转约束和方差感知评分,在几何保持的前提下有效补偿剪枝误差,提升了剪枝后模型的性能。 Abstract: In this paper, we propose a rotation-constrained compensation method to address the errors introduced by structured pruning of large language models (LLMs). LLMs are trained on massive datasets and accumulate rich semantic knowledge in their representation space. In contrast, pruning is typically carried out with only a small amount of calibration data, which makes output mismatches unavoidable. Although direct least-squares fitting can reduce such errors, it tends to overfit to the limited calibration set, destructively modifying pretrained weights. To overcome this difficulty, we update the pruned parameters under a rotation constraint. This constrained update preserves the geometry of output representations (i.e., norms and inner products) and simultaneously re-aligns the pruned subspace with the original outputs. Furthermore, in rotation-constrained compensation, removing components that strongly contribute to the principal directions of the output makes error recovery difficult. Since input dimensions with large variance strongly affect these principal directions, we design a variance-aware importance score that ensures such dimensions are preferentially kept in the pruned model. By combining this scoring rule with rotation-constrained updates, the proposed method effectively compensates errors while retaining the components likely to be more important in a geometry-preserving manner. In the experiments, we apply the proposed method to LLaMA-7B and evaluate it on WikiText-2 and multiple language understanding benchmarks. The results demonstrate consistently better perplexity and task accuracy compared with existing baselines.

[40] LLM4Cell: A Survey of Large Language and Agentic Models for Single-Cell Biology

Sajib Acharjee Dip,Adrika Zafor,Bikash Kumar Paul,Uddip Acharjee Shuvo,Muhit Islam Emon,Xuan Wang,Liqing Zhang

Main category: cs.CL

TL;DR: LLM4Cell提供了首个针对单细胞研究中58个基础和智能体模型的统一综述,涵盖RNA、ATAC、多组学和空间模态,分类方法并评估其在八项关键分析任务中的表现,揭示了可解释性、标准化和可信模型开发的开放挑战。

Details Motivation: 当前大型语言模型和智能体框架在单细胞生物学中的应用进展零散,缺乏跨模态、架构和评估标准的系统整合,亟需统一视角以推动领域发展。 Method: 对58个用于单细胞研究的基础和智能体模型进行系统综述,将其分为五类(基础、文本桥接、空间、多模态、表观基因组和智能体),映射到八项关键分析任务,并基于40多个公共数据集从10个领域维度进行评估。 Result: 建立了首个语言驱动单细胞智能的集成视图,明确了现有模型在生物学基础、多组学对齐、公平性、隐私和可解释性等方面的表现与局限,识别出基准适用性、数据多样性和伦理可扩展性约束。 Conclusion: LLM4Cell为单细胞语言模型的研究提供了系统性框架和评估体系,指出了未来在标准化、可解释性和可信AI开发方面的重要方向。 Abstract: Large language models (LLMs) and emerging agentic frameworks are beginning to transform single-cell biology by enabling natural-language reasoning, generative annotation, and multimodal data integration. However, progress remains fragmented across data modalities, architectures, and evaluation standards. LLM4Cell presents the first unified survey of 58 foundation and agentic models developed for single-cell research, spanning RNA, ATAC, multi-omic, and spatial modalities. We categorize these methods into five families-foundation, text-bridge, spatial, multimodal, epigenomic, and agentic-and map them to eight key analytical tasks including annotation, trajectory and perturbation modeling, and drug-response prediction. Drawing on over 40 public datasets, we analyze benchmark suitability, data diversity, and ethical or scalability constraints, and evaluate models across 10 domain dimensions covering biological grounding, multi-omics alignment, fairness, privacy, and explainability. By linking datasets, models, and evaluation domains, LLM4Cell provides the first integrated view of language-driven single-cell intelligence and outlines open challenges in interpretability, standardization, and trustworthy model development.

[41] HiPRAG: Hierarchical Process Rewards for Efficient Agentic Retrieval Augmented Generation

Peilin Wu,Mian Zhang,Kun Wan,Wentian Zhao,Kaiyu He,Xinya Du,Zhiyu Chen

Main category: cs.CL

TL;DR: 本文提出了HiPRAG,一种通过引入细粒度、基于知识的分层过程奖励来优化Agentic RAG中搜索行为的训练方法,有效减少了过搜和欠搜问题,在多个模型和基准上提升了推理效率与准确性。

Details Motivation: 现有的Agentic RAG训练方法依赖结果奖励,缺乏对搜索过程中每一步决策的细粒度控制,导致普遍存在过搜和欠搜问题,影响效率和输出可靠性。 Method: 提出HiPRAG方法,将代理的推理轨迹分解为可解析的离散步骤,设计分层奖励函数,在结果和格式奖励基础上,增加对最优搜索与非搜索步骤比例的过程奖励,实现对搜索决策必要性的实时评估。 Result: 在Qwen2.5和Llama-3.2模型及七个QA基准上的实验表明,该方法在3B和7B模型上分别达到65.4%和67.2%的平均准确率,同时将过搜率降至2.3%,并降低欠搜率,显著提升搜索效率。 Conclusion: 优化推理过程本身(而不仅仅是最终结果)能有效提升搜索代理的效率和性能,HiPRAG具有良好的泛化能力,验证了通过强化学习实现细粒度控制在推理优化中的重要性与潜力。 Abstract: Agentic RAG is a powerful technique for incorporating external information that LLMs lack, enabling better problem solving and question answering. However, suboptimal search behaviors exist widely, such as over-search (retrieving information already known) and under-search (failing to search when necessary), which leads to unnecessary overhead and unreliable outputs. Current training methods, which typically rely on outcome-based rewards in a RL framework, lack the fine-grained control needed to address these inefficiencies. To overcome this, we introduce Hierarchical Process Rewards for Efficient agentic RAG (HiPRAG), a training methodology that incorporates a fine-grained, knowledge-grounded process reward into the RL training. Our approach evaluates the necessity of each search decision on-the-fly by decomposing the agent's reasoning trajectory into discrete, parsable steps. We then apply a hierarchical reward function that provides an additional bonus based on the proportion of optimal search and non-search steps, on top of commonly used outcome and format rewards. Experiments on the Qwen2.5 and Llama-3.2 models across seven diverse QA benchmarks show that our method achieves average accuracies of 65.4% (3B) and 67.2% (7B). This is accomplished while improving search efficiency, reducing the over-search rate to just 2.3% and concurrently lowering the under-search rate. These results demonstrate the efficacy of optimizing the reasoning process itself, not just the final outcome. Further experiments and analysis demonstrate that HiPRAG shows good generalizability across a wide range of RL algorithms, model families, sizes, and types. This work demonstrates the importance and potential of fine-grained control through RL, for improving the efficiency and optimality of reasoning for search agents.

[42] Dynamic Generation of Multi-LLM Agents Communication Topologies with Graph Diffusion Models

Eric Hanchen Jiang,Guancheng Wan,Sophia Yin,Mengting Li,Yuchen Wu,Xiao Liang,Xinfeng Li,Yizhou Sun,Wei Wang,Kai-Wei Chang,Ying Nian Wu

Main category: cs.CL

TL;DR: 提出了一种名为Guided Topology Diffusion (GTD)的新框架,通过迭代生成并利用轻量级代理模型引导,实现面向任务的自适应通信拓扑结构设计,显著提升了多LLM智能体系统的效率与性能。

Details Motivation: 现有基于大语言模型的多智能体系统通常依赖静态或人工设计的通信拓扑,难以适应不同任务需求,导致通信开销高或性能瓶颈。因此,需要一种能动态平衡任务性能、通信成本和鲁棒性的自适应拓扑生成方法。 Method: 受条件离散图扩散模型启发,将拓扑生成建模为迭代构造过程,每一步由一个轻量级代理模型根据多目标奖励(如准确率、效用、成本)进行引导,实现无需梯度的实时优化,从而生成任务自适应的稀疏通信拓扑。 Result: 在多个基准任务上验证了GTD的有效性,实验表明其生成的拓扑结构具有更高的任务适应性、稀疏性和通信效率,在LLM智能体协作中显著优于现有方法。 Conclusion: GTD提供了一种灵活且高效的通信拓扑生成范式,能够根据任务需求动态优化多智能体系统的结构,为构建高效LLM驱动的协作系统提供了新思路。 Abstract: The efficiency of multi-agent systems driven by large language models (LLMs) largely hinges on their communication topology. However, designing an optimal topology is a non-trivial challenge, as it requires balancing competing objectives such as task performance, communication cost, and robustness. Existing frameworks often rely on static or hand-crafted topologies, which inherently fail to adapt to diverse task requirements, leading to either excessive token consumption for simple problems or performance bottlenecks for complex ones. To address this challenge, we introduce a novel generative framework called \textit{Guided Topology Diffusion (GTD)}. Inspired by conditional discrete graph diffusion models, GTD formulates topology synthesis as an iterative construction process. At each step, the generation is steered by a lightweight proxy model that predicts multi-objective rewards (e.g., accuracy, utility, cost), enabling real-time, gradient-free optimization towards task-adaptive topologies. This iterative, guided synthesis process distinguishes GTD from single-step generative frameworks, enabling it to better navigate complex design trade-offs. We validated GTD across multiple benchmarks, and experiments show that this framework can generate highly task-adaptive, sparse, and efficient communication topologies, significantly outperforming existing methods in LLM agent collaboration.

[43] Multilingual Generative Retrieval via Cross-lingual Semantic Compression

Yuxin Huang,Simeng Wu,Ran Song,Yan Xiang,Yantuan Xian,Shengxiang Gao,Zhengtao Yu

Main category: cs.CL

TL;DR: 提出了一种基于跨语言语义压缩的多语言生成式检索框架MGR-CSC,有效解决了标识符错位和膨胀问题,在多个基准上显著提升了检索性能。

Details Motivation: 生成式信息检索在单语场景中表现优异,但在多语言场景下面临跨语言标识符错位和标识符膨胀两大挑战,亟需有效解决方案。 Method: 提出MGR-CSC框架,通过将语义等价的多语言关键词统一为共享原子以对齐语义并压缩标识空间,并设计动态多步约束解码策略提升检索效率。 Result: 在mMarco100k和mNQ320k数据集上,检索准确率分别提升6.83%和4.77%,文档标识符长度减少74.51%和78.2%。 Conclusion: MGR-CSC通过语义压缩和统一标识显著提升了多语言生成式检索的准确性与效率,具备良好的应用潜力。 Abstract: Generative Information Retrieval is an emerging retrieval paradigm that exhibits remarkable performance in monolingual scenarios.However, applying these methods to multilingual retrieval still encounters two primary challenges, cross-lingual identifier misalignment and identifier inflation. To address these limitations, we propose Multilingual Generative Retrieval via Cross-lingual Semantic Compression (MGR-CSC), a novel framework that unifies semantically equivalent multilingual keywords into shared atoms to align semantics and compresses the identifier space, and we propose a dynamic multi-step constrained decoding strategy during retrieval. MGR-CSC improves cross-lingual alignment by assigning consistent identifiers and enhances decoding efficiency by reducing redundancy. Experiments demonstrate that MGR-CSC achieves outstanding retrieval accuracy, improving by 6.83% on mMarco100k and 4.77% on mNQ320k, while reducing document identifiers length by 74.51% and 78.2%, respectively.

[44] AdaSwitch: Adaptive Switching Generation for Knowledge Distillation

Jingyu Peng,Maolin Wang,Hengyi Cai,Yuchen Li,Kai Zhang,Shuaiqiang Wang,Dawei Yin,Xiangyu Zhao

Main category: cs.CL

TL;DR: 提出AdaSwitch方法,动态结合策略内和策略外生成,在token级别提升小语言模型的蒸馏效果。

Details Motivation: 现有知识蒸馏方法在监督质量与训练-推理一致性之间存在权衡,难以兼顾高性能与低延迟需求。 Method: 在token级别动态切换策略内(on-policy)和策略外(off-policy)生成,学生模型先自主预测,再根据实时质量评估选择性引入教师指导。 Result: 在三个数据集和两组师生大模型组合上实验表明,AdaSwitch在精度上持续提升,且额外开销可接受。 Conclusion: AdaSwitch有效平衡了监督质量与推理一致性,为小语言模型的知识蒸馏提供了实用且高效的方法。 Abstract: Small language models (SLMs) are crucial for applications with strict latency and computational constraints, yet achieving high performance remains challenging. Knowledge distillation (KD) can transfer capabilities from large teacher models, but existing methods involve trade-offs: off-policy distillation provides high-quality supervision but introduces a training-inference mismatch, while on-policy approaches maintain consistency but rely on low-quality student outputs. To address these issues, we propose AdaSwitch, a novel approach that dynamically combines on-policy and off-policy generation at the token level. AdaSwitch allows the student to first explore its own predictions and then selectively integrate teacher guidance based on real-time quality assessment. This approach simultaneously preserves consistency and maintains supervision quality. Experiments on three datasets with two teacher-student LLM pairs demonstrate that AdaSwitch consistently improves accuracy, offering a practical and effective method for distilling SLMs with acceptable additional overhead.

[45] Ready to Translate, Not to Represent? Bias and Performance Gaps in Multilingual LLMs Across Language Families and Domains

Md. Faiyaz Abdullah Sayeedi,Md. Mahbub Alam,Subhey Sadi Rahman,Md. Adnanul Islam,Jannatul Ferdous Deepti,Tasnim Mohiuddin,Md Mofijul Islam,Swakkhar Shatabda

Main category: cs.CL

TL;DR: 本文提出了Translation Tangles,一个用于评估开源大语言模型在机器翻译中的质量与公平性的统一框架和数据集,涵盖24个双向语言对,并引入基于人工标注的高质量偏见标注数据集。

Details Motivation: 大语言模型在机器翻译中表现优异,但在不同语系和领域间性能不均,且可能编码并放大训练数据中的偏见,尤其影响低资源语言的公平性。 Method: 构建包含24个双向语言对、多领域翻译任务的基准测试;提出结合基于规则的启发式方法、语义相似度过滤和大模型验证的混合偏见检测流程;并基于1,439组人工评估的翻译-参考对构建高质量偏见标注数据集。 Result: 实现了对多种开源大语言模型在翻译质量和公平性方面的系统评估,验证了所提偏见检测方法的有效性,并公开了代码与数据集。 Conclusion: Translation Tangles为评估机器翻译中的质量和公平性提供了有效工具,有助于识别和缓解大语言模型在多语言场景下的偏见问题。 Abstract: The rise of Large Language Models (LLMs) has redefined Machine Translation (MT), enabling context-aware and fluent translations across hundreds of languages and textual domains. Despite their remarkable capabilities, LLMs often exhibit uneven performance across language families and specialized domains. Moreover, recent evidence reveals that these models can encode and amplify different biases present in their training data, posing serious concerns for fairness, especially in low-resource languages. To address these gaps, we introduce Translation Tangles, a unified framework and dataset for evaluating the translation quality and fairness of open-source LLMs. Our approach benchmarks 24 bidirectional language pairs across multiple domains using different metrics. We further propose a hybrid bias detection pipeline that integrates rule-based heuristics, semantic similarity filtering, and LLM-based validation. We also introduce a high-quality, bias-annotated dataset based on human evaluations of 1,439 translation-reference pairs. The code and dataset are accessible on GitHub: https://github.com/faiyazabdullah/TranslationTangles

[46] Do LLMs Really Need 10+ Thoughts for "Find the Time 1000 Days Later"? Towards Structural Understanding of LLM Overthinking

Xinliang Frederick Zhang,Anhad Mohananey,Alexandra Chronopoulou,Pinelopi Papalampidi,Somit Gupta,Tsendsuren Munkhdalai,Lu Wang,Shyam Upadhyay

Main category: cs.CL

TL;DR: 该研究提出了TRACE分析工具,系统性地探究大语言模型在简单任务上出现“过度思考”的根本原因,并提出基于思维效用的定义来管理和理解过度思考问题。

Details Motivation: 现有研究对大语言模型过度思考现象的理解停留在表面,缺乏对其内在机制的深入分析,本文旨在填补这一空白。 Method: 提出TRACE分析框架,将模型的思维过程分解为最小完整子思想,通过推断子思想间的语篇关系构建细粒度思维演进图,并识别常见思维模式。 Result: 发现开放权重模型中存在Explorer和Late Landing两种主要思维模式,揭示过度验证和过度探索是导致过度思考的主要原因。 Conclusion: 基于思维结构提出新的过度思考效用定义,超越了传统的长度指标,为理解和管理大语言模型的过度思考提供了原则性指导。 Abstract: Models employing long chain-of-thought (CoT) reasoning have shown superior performance on complex reasoning tasks. Yet, this capability introduces a critical and often overlooked inefficiency -- overthinking -- models often engage in unnecessarily extensive reasoning even for simple queries, incurring significant computations without accuracy improvements. While prior work has explored solutions to mitigate overthinking, a fundamental gap remains in our understanding of its underlying causes. Most existing analyses are limited to superficial, profiling-based observations, failing to delve into LLMs' inner workings. This study introduces a systematic, fine-grained analyzer of LLMs' thought process to bridge the gap, TRACE. We first benchmark the overthinking issue, confirming that long-thinking models are five to twenty times slower on simple tasks with no substantial gains. We then use TRACE to first decompose the thought process into minimally complete sub-thoughts. Next, by inferring discourse relationships among sub-thoughts, we construct granular thought progression graphs and subsequently identify common thinking patterns for topically similar queries. Our analysis reveals two major patterns for open-weight thinking models -- Explorer and Late Landing. This finding provides evidence that over-verification and over-exploration are the primary drivers of overthinking in LLMs. Grounded in thought structures, we propose a utility-based definition of overthinking, which moves beyond length-based metrics. This revised definition offers a more insightful understanding of LLMs' thought progression, as well as practical guidelines for principled overthinking management.

[47] CS3-Bench: Evaluating and Enhancing Speech-to-Speech LLMs for Mandarin-English Code-Switching

Heyang Liu,Yuhao Wang,Ziyang Cheng,Ronghua Wu,Qunshan Gu,Yanfeng Wang,Yu Wang

Main category: cs.CL

TL;DR: 本文提出了一个代码切换语音到语音基准(CS3-Bench),揭示了现有模型在语言对齐方面的不足,并通过Chain of Recognition和Keyword Highlighting方法显著提升了多模态大语言模型的语言对齐能力。

Details Motivation: 现有的多模态大语言模型在单语自然交互方面已取得进展,但在语言对齐方面存在明显缺陷,尤其是在跨语言的语音交互中表现不佳。 Method: 提出CS3-Bench基准测试,构建具有挑战性的代码切换语音数据集;采用Chain of Recognition (CoR)增强理解能力,结合Keyword Highlighting (KH)引导生成过程,并设计针对性的数据构造与训练策略。 Result: 在知识密集型问答中知识准确率从25.14%提升至46.13%,开放性对话理解率从64.5%提升至86.5%,并显著减少次级语言的发音错误。7个主流模型在CS3-Bench上表现出最高达66%的性能下降。 Conclusion: 语言对齐是当前多模态语音交互系统的关键瓶颈,所提出的CoR与KH方法有效提升了模型在跨语言场景下的理解与生成能力,为未来多语言语音交互系统的发展提供了重要方向。 Abstract: The advancement of multimodal large language models has accelerated the development of speech-to-speech interaction systems. While natural monolingual interaction has been achieved, we find existing models exhibit deficiencies in language alignment. In our proposed Code-Switching Speech-to-Speech Benchmark (CS3-Bench), experiments on 7 mainstream models demonstrate a relative performance drop of up to 66% in knowledge-intensive question answering and varying degrees of misunderstanding in open-ended conversations. Starting from a model with severe performance deterioration, we propose both data constructions and training approaches to improve the language alignment capabilities, specifically employing Chain of Recognition (CoR) to enhance understanding and Keyword Highlighting (KH) to guide generation. Our approach improves the knowledge accuracy from 25.14% to 46.13%, with open-ended understanding rate from 64.5% to 86.5%, and significantly reduces pronunciation errors in the secondary language. CS3-Bench is available at https://huggingface.co/datasets/VocalNet/CS3-Bench.

[48] Contrastive Weak-to-strong Generalization

Houcheng Jiang,Junfeng Fang,Jiaxin Wu,Tianyu Zhang,Chen Gao,Yong Li,Xiang Wang,Xiangnan He,Yang Deng

Main category: cs.CL

TL;DR: 提出Contrastive Weak-to-Strong Generalization (ConG),利用对比解码减少弱模型输出噪声,提升弱到强泛化的鲁棒性和效果。

Details Motivation: 传统弱到强泛化方法受限于弱模型输出中的噪声和偏差,影响其实际应用。 Method: 通过隐式奖励与对比解码(CD)的结构等价性,设计ConG框架,在对齐前后弱模型间进行对比解码以生成更高质量样本。 Result: 在多个模型族上实验表明ConG显著优于传统方法,具备良好通用性和有效性。 Conclusion: ConG有效提升了弱到强泛化的性能,为实现AGI提供了一条有前景的路径。 Abstract: Weak-to-strong generalization provides a promising paradigm for scaling large language models (LLMs) by training stronger models on samples from aligned weaker ones, without requiring human feedback or explicit reward modeling. However, its robustness and generalization are hindered by the noise and biases in weak-model outputs, which limit its applicability in practice. To address this challenge, we leverage implicit rewards, which approximate explicit rewards through log-likelihood ratios, and reveal their structural equivalence with Contrastive Decoding (CD), a decoding strategy shown to reduce noise in LLM generation. Building on this connection, we propose Contrastive Weak-to-Strong Generalization (ConG), a framework that employs contrastive decoding between pre- and post-alignment weak models to generate higher-quality samples. This approach enables more reliable capability transfer, denoising, and improved robustness, substantially mitigating the limitations of traditional weak-to-strong methods. Empirical results across different model families confirm consistent improvements, demonstrating the generality and effectiveness of ConG. Taken together, our findings highlight the potential of ConG to advance weak-to-strong generalization and provide a promising pathway toward AGI.

Verena Blaschke,Miriam Winkler,Barbara Plank

Main category: cs.CL

TL;DR: 本研究比较了从标准德语到多种方言在文本、语音及级联系统中的迁移效果,发现语音模型在方言数据上表现最佳,而文本模型在标准数据上更优;同时发布了首个方言音频意图分类数据集。

Details Motivation: 由于方言主要是口头语言,且非标准拼写会影响文本处理,因此需要探索不同模式下标准到方言的迁移效果,尤其是在语音和文本之间的差异。 Method: 在德语及其多种方言的意图和主题分类任务中,比较了纯文本模型、纯语音模型以及语音先转录为文本再处理的级联系统的性能。 Result: 语音模型在方言数据上表现最好,文本模型在标准数据上最优;级联系统在标准德语中落后于纯文本模型,但在生成标准化输出时对方言数据有较好表现。 Conclusion: 对于方言处理,直接使用语音模型优于依赖转录的级联系统或纯文本模型,但转录结果的标准化程度显著影响级联系统的表现。 Abstract: Research on cross-dialectal transfer from a standard to a non-standard dialect variety has typically focused on text data. However, dialects are primarily spoken, and non-standard spellings are known to cause issues in text processing. We compare standard-to-dialect transfer in three settings: text models, speech models, and cascaded systems where speech first gets automatically transcribed and then further processed by a text model. In our experiments, we focus on German and multiple German dialects in the context of written and spoken intent and topic classification. To that end, we release the first dialectal audio intent classification dataset. We find that the speech-only setup provides the best results on the dialect data while the text-only setup works best on the standard data. While the cascaded systems lag behind the text-only models for German, they perform relatively well on the dialectal data if the transcription system generates normalized, standard-like output.

[50] Metric Calculating Benchmark: Code-Verifiable Complicate Instruction Following Benchmark for Large Language Models

Hyeonseok Moon,Seongtae Hong,Jaehyung Seo,Heuiseok Lim

Main category: cs.CL

TL;DR: 本文提出了MCBench,一个用于评估大语言模型(LLM)能否准确执行基于字符串匹配的NLP指标并严格遵循逐步指令的新基准。该基准具有客观性、确定性和代码可验证性,旨在测试LLM在指令遵循、数值计算和中间结果一致性方面的表现。

Details Motivation: 随着前沿LLM在许多现有基准上达到饱和,缺乏能够进一步区分模型能力的挑战性评测任务,因此需要设计更具挑战性和客观验证机制的新型基准。 Method: 提出MCBench,包含三个评估指标和三种变体,通过提供并行参考代码实现对LLM输出的客观评估,重点测试模型在逐步执行、数值计算和长距离一致性的能力。 Result: 实验表明,MCBench能有效且客观地评估当前最先进LLM的能力,尤其在指令遵循和中间结果处理方面提供了细粒度的分析手段。 Conclusion: MCBench是一个有效的、客观的评估工具,可用于衡量LLM在精确执行复杂指令任务中的表现,为未来模型发展提供更清晰的评测标准。 Abstract: Recent frontier-level LLMs have saturated many previously difficult benchmarks, leaving little room for further differentiation. This progress highlights the need for challenging benchmarks that provide objective verification. In this paper, we introduce MCBench, a benchmark designed to evaluate whether LLMs can execute string-matching NLP metrics by strictly following step-by-step instructions. Unlike prior benchmarks that depend on subjective judgments or general reasoning, MCBench offers an objective, deterministic and codeverifiable evaluation. This setup allows us to systematically test whether LLMs can maintain accurate step-by-step execution, including instruction adherence, numerical computation, and long-range consistency in handling intermediate results. To ensure objective evaluation of these abilities, we provide a parallel reference code that can evaluate the accuracy of LLM output. We provide three evaluative metrics and three benchmark variants designed to measure the detailed instruction understanding capability of LLMs. Our analyses show that MCBench serves as an effective and objective tool for evaluating the capabilities of cutting-edge LLMs.

[51] ACE: Attribution-Controlled Knowledge Editing for Multi-hop Factual Recall

Jiayu Yang,Yuxuan Fan,Songning Lai,Shengen Wu,Jiaqi Tang,Chun Kang,Zhijiang Guo,Yutao Yue

Main category: cs.CL

TL;DR: 本文提出了一种基于神经元级归因的知识编辑方法ACE,用于提升大语言模型在多跳事实回忆中的性能,通过识别和编辑关键的查询-值(Q-V)通路,在GPT-J和Qwen3-8B上显著优于现有方法。

Details Motivation: 现有知识编辑方法在多跳事实回忆中表现衰退,尤其在涉及推理链中隐式中间主体时效果不佳,其根本原因在于忽略了知识在神经元层面的动态表征机制。 Method: 通过因果分析揭示隐式主体在多跳推理中作为查询神经元,逐层激活对应的价值神经元以累积信息;基于此发现,提出ACE框架,利用神经元级归因来识别并编辑关键的查询-值(Q-V)路径。 Result: ACE在GPT-J上比现有最先进方法提升9.44%,在Qwen3-8B上提升37.46%;同时揭示了Qwen3中更细粒度的激活模式,并验证了价值神经元的语义可解释性由查询驱动的累积机制所调控。 Conclusion: 通过理解模型内部的推理机制,尤其是查询-值神经元的动态协作,可以为知识编辑提供更有效且可解释的路径,ACE为多跳知识编辑提供了机理上合理的新范式。 Abstract: Large Language Models (LLMs) require efficient knowledge editing (KE) to update factual information, yet existing methods exhibit significant performance decay in multi-hop factual recall. This failure is particularly acute when edits involve intermediate implicit subjects within reasoning chains. Through causal analysis, we reveal that this limitation stems from an oversight of how chained knowledge is dynamically represented and utilized at the neuron level. We discover that during multi hop reasoning, implicit subjects function as query neurons, which sequentially activate corresponding value neurons across transformer layers to accumulate information toward the final answer, a dynamic prior KE work has overlooked. Guided by this insight, we propose ACE: Attribution-Controlled Knowledge Editing for Multi-hop Factual Recall, a framework that leverages neuron-level attribution to identify and edit these critical query-value (Q-V) pathways. ACE provides a mechanistically grounded solution for multi-hop KE, empirically outperforming state-of-the-art methods by 9.44% on GPT-J and 37.46% on Qwen3-8B. Our analysis further reveals more fine-grained activation patterns in Qwen3 and demonstrates that the semantic interpretability of value neurons is orchestrated by query-driven accumulation. These findings establish a new pathway for advancing KE capabilities based on the principled understanding of internal reasoning mechanisms.

[52] Towards Human-Like Grading: A Unified LLM-Enhanced Framework for Subjective Question Evaluation

Fanwei Zhua,Jiaxuan He,Xiaoxiao Chen,Zulong Chen,Quan Lu,Chenrui Mei

Main category: cs.CL

TL;DR: 提出一种基于大语言模型的统一自动评分框架,能够对多种类型的主观题进行类人评估,涵盖内容相似性、知识点匹配、答案相关性和人工评价模拟,实验表明其在多个指标上优于传统和基于LLM的方法,并已成功应用于实际考试场景。

Details Motivation: 现有自动评分方法多针对特定类型主观题,缺乏对包含多种题型综合考试的通用支持,难以应对学生回答的多样性和开放性。 Method: 构建一个包含四个模块的统一LLM增强自动评分框架:基础文本匹配、关键知识点比较、从学生答案生成伪问题以评估相关性、模拟人工评价识别内容与非内容优缺点。 Result: 在通用和领域特定数据集上的实验显示,该框架在多个评分指标上 consistently 优于传统及基于LLM的基线方法,并已在大型电商企业的培训与认证考试中成功部署。 Conclusion: 所提框架具有良好的通用性和实用性,能有效提升主观题自动评分的准确性和可解释性,具备广泛的应用前景。 Abstract: Automatic grading of subjective questions remains a significant challenge in examination assessment due to the diversity in question formats and the open-ended nature of student responses. Existing works primarily focus on a specific type of subjective question and lack the generality to support comprehensive exams that contain diverse question types. In this paper, we propose a unified Large Language Model (LLM)-enhanced auto-grading framework that provides human-like evaluation for all types of subjective questions across various domains. Our framework integrates four complementary modules to holistically evaluate student answers. In addition to a basic text matching module that provides a foundational assessment of content similarity, we leverage the powerful reasoning and generative capabilities of LLMs to: (1) compare key knowledge points extracted from both student and reference answers, (2) generate a pseudo-question from the student answer to assess its relevance to the original question, and (3) simulate human evaluation by identifying content-related and non-content strengths and weaknesses. Extensive experiments on both general-purpose and domain-specific datasets show that our framework consistently outperforms traditional and LLM-based baselines across multiple grading metrics. Moreover, the proposed system has been successfully deployed in real-world training and certification exams at a major e-commerce enterprise.

[53] STEPER: Step-wise Knowledge Distillation for Enhancing Reasoning Ability in Multi-Step Retrieval-Augmented Language Models

Kyumin Lee,Minjin Jeon,Sanghwan Jang,Hwanjo Yu

Main category: cs.CL

TL;DR: 提出StepER方法,通过分步监督和难度感知训练提升多步检索增强语言模型的推理能力。

Details Motivation: 现有知识蒸馏方法忽视了多步推理中不同步骤所需的差异化推理能力,导致在多步检索增强框架中的迁移效果受限。 Method: 采用分步监督以匹配各阶段动态变化的信息与推理需求,并引入难度感知训练,优先优化适合的步骤。该方法适用于多种多步检索增强语言模型。 Result: 实验表明,StepER在多跳问答基准上优于先前方法,8B模型性能接近70B教师模型。 Conclusion: StepER有效提升了多步检索增强语言模型的推理能力,实现了高效的知识迁移。 Abstract: Answering complex real-world questions requires step-by-step retrieval and integration of relevant information to generate well-grounded responses. However, existing knowledge distillation methods overlook the need for different reasoning abilities at different steps, hindering transfer in multi-step retrieval-augmented frameworks. To address this, we propose Stepwise Knowledge Distillation for Enhancing Reasoning Ability in Multi-Step Retrieval-Augmented Language Models (StepER). StepER employs step-wise supervision to align with evolving information and reasoning demands across stages. Additionally, it incorporates difficulty-aware training to progressively optimize learning by prioritizing suitable steps. Our method is adaptable to various multi-step retrieval-augmented language models, including those that use retrieval queries for reasoning paths or decomposed questions. Extensive experiments show that StepER outperforms prior methods on multi-hop QA benchmarks, with an 8B model achieving performance comparable to a 70B teacher model.

[54] Comprehensiveness Metrics for Automatic Evaluation of Factual Recall in Text Generation

Adam Dejl,James Barry,Alessandra Pascale,Javier Carnerero Cano

Main category: cs.CL

TL;DR: 本研究探讨了评估大语言模型生成文本全面性的三种自动化方法,发现简单端到端方法效果显著但牺牲了鲁棒性和可解释性。

Details Motivation: 大语言模型虽性能强大,但常遗漏关键信息,在敏感领域可能造成严重危害,需有效评估其输出的全面性。 Method: 研究比较了三种自动评估策略:基于自然语言推断(NLI)的方法、基于问答(Q&A)的方法和端到端的LLM直接检测方法。 Result: 实验表明,端到端方法效果出人意料地好,但鲁棒性、可解释性和结果细粒度较低;同时评估了多个开源大模型在多源查询下的响应全面性。 Conclusion: 端到端方法在检测缺失信息方面表现良好,但需权衡其在鲁棒性和解释性方面的不足,为未来改进提供了方向。 Abstract: Despite demonstrating remarkable performance across a wide range of tasks, large language models (LLMs) have also been found to frequently produce outputs that are incomplete or selectively omit key information. In sensitive domains, such omissions can result in significant harm comparable to that posed by factual inaccuracies, including hallucinations. In this study, we address the challenge of evaluating the comprehensiveness of LLM-generated texts, focusing on the detection of missing information or underrepresented viewpoints. We investigate three automated evaluation strategies: (1) an NLI-based method that decomposes texts into atomic statements and uses natural language inference (NLI) to identify missing links, (2) a Q&A-based approach that extracts question-answer pairs and compares responses across sources, and (3) an end-to-end method that directly identifies missing content using LLMs. Our experiments demonstrate the surprising effectiveness of the simple end-to-end approach compared to more complex methods, though at the cost of reduced robustness, interpretability and result granularity. We further assess the comprehensiveness of responses from several popular open-weight LLMs when answering user queries based on multiple sources.

[55] Vision-Enabled LLMs in Historical Lexicography: Digitising and Enriching Estonian-German Dictionaries from the 17th and 18th Centuries

Madis Jürviste,Joonatan Jakobson

Main category: cs.CL

TL;DR: 该研究探索了大型语言模型(LLM)在17至18世纪爱沙尼亚语词典研究中的应用,涵盖历史词典的现代化补充、哥特体文本识别及跨源数据集构建。

Details Motivation: 针对小语种历史文献数字化中人力和时间成本高的问题,探索LLM在低资源语言中的自动化处理潜力。 Method: 使用Claude 3.7 Sonnet进行词义和现代形式补全;采用视觉增强型LLM对Fraktur印刷文本进行零样本识别;通过重叠切片扫描图像并用两个LLM分别执行文本识别与结构合并。 Result: 在Gutslaff词典中,81%的词条被准确补充现代含义;Helle词典的零样本识别生成41%无误JSON输出;Hupel语法书的德-爱词典部分实现高效数字化。 Conclusion: LLM在小语种历史文献处理中具有显著潜力,可大幅节省时间和经济成本,支持未来统一历史词典数据库的构建。 Abstract: This article presents research conducted at the Institute of the Estonian Language between 2022 and 2025 on the application of large language models (LLMs) to the study of 17th and 18th century Estonian dictionaries. The authors address three main areas: enriching historical dictionaries with modern word forms and meanings; using vision-enabled LLMs to perform text recognition on sources printed in Gothic script (Fraktur); and preparing for the creation of a unified, cross-source dataset. Initial experiments with J. Gutslaff's 1648 dictionary indicate that LLMs have significant potential for semi-automatic enrichment of dictionary information. When provided with sufficient context, Claude 3.7 Sonnet accurately provided meanings and modern equivalents for 81% of headword entries. In a text recognition experiment with A. T. Helle's 1732 dictionary, a zero-shot method successfully identified and structured 41% of headword entries into error-free JSON-formatted output. For digitising the Estonian-German dictionary section of A. W. Hupel's 1780 grammar, overlapping tiling of scanned image files is employed, with one LLM being used for text recognition and a second for merging the structured output. These findings demonstrate that even for minor languages LLMs have a significant potential for saving time and financial resources.

[56] A$^2$Search: Ambiguity-Aware Question Answering with Reinforcement Learning

Fengji Zhang,Xinyao Niu,Chengyang Ying,Guancheng Lin,Zhongkai Hao,Zhou Fan,Chengen Huang,Jacky Keung,Bei Chen,Junyang Lin

Main category: cs.CL

TL;DR: 本文提出了一种无需人工标注的端到端框架A²Search,用于处理开放域问答中存在多个正确答案的模糊性问题,通过轨迹采样和证据验证自动识别歧义并生成多答案,结合强化学习与AnsF1奖励函数,在多个基准上实现了最先进的性能。

Details Motivation: 现有问答模型通常假设每个问题只有一个标准答案,难以应对实际中存在多个合理答案的模糊性问题,且依赖人工标注的方法成本高、难以扩展。 Method: 提出A²Search框架,通过自动化流程检测模糊问题,利用轨迹采样生成多个推理路径,并通过证据验证获取不同答案;采用强化学习训练,设计AnsF1奖励函数以支持多答案评估。 Result: 在八个开放域问答基准上实验表明,A²Search显著优于现有方法,A²Search-7B在四个多跳问答基准上的平均AnsF1@1达到48.4%,超过更大的ReSearch-32B(46.2%),且具备良好的泛化能力。 Conclusion: 拥抱并建模问答中的模糊性对构建更可靠、鲁棒的问答系统至关重要,A²Search为解决多答案问题提供了一个高效、可扩展的新范式。 Abstract: Recent advances in Large Language Models (LLMs) and Reinforcement Learning (RL) have led to strong performance in open-domain question answering (QA). However, existing models still struggle with questions that admit multiple valid answers. Standard QA benchmarks, which typically assume a single gold answer, overlook this reality and thus produce inappropriate training signals. Existing attempts to handle ambiguity often rely on costly manual annotation, which is difficult to scale to multi-hop datasets such as HotpotQA and MuSiQue. In this paper, we present A$^2$Search, an annotation-free, end-to-end training framework to recognize and handle ambiguity. At its core is an automated pipeline that detects ambiguous questions and gathers alternative answers via trajectory sampling and evidence verification. The model is then optimized with RL using a carefully designed $\mathrm{AnsF1}$ reward, which naturally accommodates multiple answers. Experiments on eight open-domain QA benchmarks demonstrate that A$^2$Search achieves new state-of-the-art performance. With only a single rollout, A$^2$Search-7B yields an average $\mathrm{AnsF1}@1$ score of $48.4\%$ across four multi-hop benchmarks, outperforming all strong baselines, including the substantially larger ReSearch-32B ($46.2\%$). Extensive analyses further show that A$^2$Search resolves ambiguity and generalizes across benchmarks, highlighting that embracing ambiguity is essential for building more reliable QA systems. Our code, data, and model weights can be found at https://github.com/zfj1998/A2Search

[57] LightReasoner: Can Small Language Models Teach Large Language Models Reasoning?

Jingyuan Wang,Yankai Chen,Zhonghang Li,Chao Huang

Main category: cs.CL

TL;DR: 本文提出了一种名为LightReasoner的新框架,利用小语言模型(SLM)与大语言模型(LLM)之间的行为差异,识别出对推理至关重要的关键时刻,并通过专家-业余对比生成监督样本,从而在不依赖真实标签的情况下提升LLM的推理能力。

Details Motivation: 监督微调(SFT)虽然有效但资源消耗大,且多数token并无显著学习价值。本文旨在探索更高效的方法,通过较小模型揭示大模型独有的高价值推理过程,以降低训练成本并提高效率。 Method: LightReasoner分为两个阶段:第一阶段通过采样找出专家模型(LLM)相对于弱模型(SLM)表现出优势的关键推理时刻,构建包含其优势的监督样本;第二阶段使用这些精炼样本对专家模型进行微调,强化其推理能力。整个过程无需真实标签。 Result: 在七个数学推理基准上,LightReasoner最高提升了28.1%的准确率,同时减少了90%的时间消耗、80%的采样问题数量和99%的微调token使用量。 Conclusion: LightReasoner通过将弱SLM转化为有效的教学信号,提供了一种可扩展且资源高效的提升LLM推理能力的新途径。 Abstract: Large language models (LLMs) have demonstrated remarkable progress in reasoning, often through supervised fine-tuning (SFT). However, SFT is resource-intensive, relying on large curated datasets, rejection-sampled demonstrations, and uniform optimization across all tokens, even though only a fraction carry meaningful learning value. In this work, we explore a counterintuitive idea: can smaller language models (SLMs) teach larger language models (LLMs) by revealing high-value reasoning moments that reflect the latter's unique strength? We propose LightReasoner, a novel framework that leverages the behavioral divergence between a stronger expert model (LLM) and a weaker amateur model (SLM). LightReasoner operates in two stages: (1) a sampling stage that pinpoints critical reasoning moments and constructs supervision examples capturing the expert's advantage through expert-amateur contrast, and (2) a fine-tuning stage that aligns the expert model with these distilled examples, amplifying its reasoning strengths. Across seven mathematical benchmarks, LightReasoner improves accuracy by up to 28.1%, while reducing time consumption by 90%, sampled problems by 80%, and tuned token usage by 99%, all without relying on ground-truth labels. By turning weaker SLMs into effective teaching signals, LightReasoner offers a scalable and resource-efficient approach for advancing LLM reasoning. Code is available at: https://github.com/HKUDS/LightReasoner

[58] Active Confusion Expression in Large Language Models: Leveraging World Models toward Better Social Reasoning

Jialu Du,Guiyang Hou,Yihui Fu,Chen Wu,Wenqi Zhang,Yongliang Shen,Weiming Lu

Main category: cs.CL

TL;DR: 提出一种自适应世界模型增强的推理机制,通过构建动态文本世界模型来解决大语言模型在社会推理任务中混淆客观现实与主观信念的问题,显著提升准确性并降低计算成本。

Details Motivation: 大语言模型在数学和代码推理方面表现出色,但在处理涉及多个参与者和时间线的社会推理任务时,常出现认知混乱、逻辑不一致以及混淆客观状态与主观信念的问题。 Method: 分析DeepSeek-R1的推理轨迹,识别出模型在遇到推理障碍时输出矛盾词汇;提出一种自适应世界模型增强的推理机制,动态构建文本世界模型以跟踪实体状态和时序,并在检测到混淆时提供清晰的世界状态描述进行干预。 Result: 在三个社会推理基准上评估显示,该方法显著提高了准确性(如Hi-ToM上+10%),并减少了最多33.8%的token消耗。 Conclusion: 该机制有效帮助大语言模型区分外部事件与内部信念,解决了社会推理中的关键挑战,为在社交场景中部署LLMs提供了简单而高效的解决方案。 Abstract: While large language models (LLMs) excel in mathematical and code reasoning, we observe they struggle with social reasoning tasks, exhibiting cognitive confusion, logical inconsistencies, and conflation between objective world states and subjective belief states. Through deteiled analysis of DeepSeek-R1's reasoning trajectories, we find that LLMs frequently encounter reasoning impasses and tend to output contradictory terms like "tricky" and "confused" when processing scenarios with multiple participants and timelines, leading to erroneous reasoning or infinite loops. The core issue is their inability to disentangle objective reality from agents' subjective beliefs. To address this, we propose an adaptive world model-enhanced reasoning mechanism that constructs a dynamic textual world model to track entity states and temporal sequences. It dynamically monitors reasoning trajectories for confusion indicators and promptly intervenes by providing clear world state descriptions, helping models navigate through cognitive dilemmas. The mechanism mimics how humans use implicit world models to distinguish between external events and internal beliefs. Evaluations on three social benchmarks demonstrate significant improvements in accuracy (e.g., +10% in Hi-ToM) while reducing computational costs (up to 33.8% token reduction), offering a simple yet effective solution for deploying LLMs in social contexts.

[59] Leveraging Author-Specific Context for Scientific Figure Caption Generation: 3rd SciCap Challenge

Watcharapong Timklaypachara,Monrada Chiewhawan,Nopporn Lekuthai,Titipat Achakulvisut

Main category: cs.CL

TL;DR: 提出了一种结合图文上下文与作者写作风格的两阶段科学图表标题生成方法,在SciCap挑战赛中表现出色。

Details Motivation: 科学图表标题需要准确且风格一致地传递信息,现有方法在风格适应性和上下文利用方面存在不足。 Method: 采用两阶段 pipeline:第一阶段通过上下文过滤和类别特定提示优化生成候选标题;第二阶段利用少量示例和作者风格画像进行风格化 refine。 Result: 类别特定提示使 ROUGE-1 召回率提升 +8.3%,风格精炼使 BLEU 提升 40-48%、ROUGE 提升 25-27%。 Conclusion: 结合上下文理解与作者特定风格适应可生成既科学准确又风格忠实的图表标题。 Abstract: Scientific figure captions require both accuracy and stylistic consistency to convey visual information. Here, we present a domain-specific caption generation system for the 3rd SciCap Challenge that integrates figure-related textual context with author-specific writing styles using the LaMP-Cap dataset. Our approach uses a two-stage pipeline: Stage 1 combines context filtering, category-specific prompt optimization via DSPy's MIPROv2 and SIMBA, and caption candidate selection; Stage 2 applies few-shot prompting with profile figures for stylistic refinement. Our experiments demonstrate that category-specific prompts outperform both zero-shot and general optimized approaches, improving ROUGE-1 recall by +8.3\% while limiting precision loss to -2.8\% and BLEU-4 reduction to -10.9\%. Profile-informed stylistic refinement yields 40--48\% gains in BLEU scores and 25--27\% in ROUGE. Overall, our system demonstrates that combining contextual understanding with author-specific stylistic adaptation can generate captions that are both scientifically accurate and stylistically faithful to the source paper.

[60] Learning on the Job: An Experience-Driven Self-Evolving Agent for Long-Horizon Tasks

Cheng Yang,Xuemeng Yang,Licheng Wen,Daocheng Fu,Jianbiao Mei,Rong Wu,Pinlong Cai,Yufan Shen,Nianchen Deng,Botian Shi,Yu Qiao,Haifeng Li

Main category: cs.CL

TL;DR: MUSE 是一种基于分层记忆模块的新型 LLM 代理框架,通过经验驱动实现持续学习和自我进化,在长周期任务中展现出卓越性能和强泛化能力。

Details Motivation: 现有大语言模型代理在部署于现实世界长周期任务时无法从经验中学习,缺乏持续改进的能力。 Method: 提出 MUSE 框架,构建分层记忆模块,将执行轨迹转化为结构化经验并不断回写记忆,支持自主反思与经验积累,实现持续学习。 Result: 在 TAC 基准上使用轻量级 Gemini-2.5 Flash 模型显著超越现有 SOTA;实验证明其具备持续学习、自我进化和跨任务零样本迁移能力。 Conclusion: MUSE 建立了可连续学习的 AI 代理新范式,推动 LLM 在真实场景生产力自动化中的应用。 Abstract: Large Language Models have demonstrated remarkable capabilities across diverse domains, yet significant challenges persist when deploying them as AI agents for real-world long-horizon tasks. Existing LLM agents suffer from a critical limitation: they are test-time static and cannot learn from experience, lacking the ability to accumulate knowledge and continuously improve on the job. To address this challenge, we propose MUSE, a novel agent framework that introduces an experience-driven, self-evolving system centered around a hierarchical Memory Module. MUSE organizes diverse levels of experience and leverages them to plan and execute long-horizon tasks across multiple applications. After each sub-task execution, the agent autonomously reflects on its trajectory, converting the raw trajectory into structured experience and integrating it back into the Memory Module. This mechanism enables the agent to evolve beyond its static pretrained parameters, fostering continuous learning and self-evolution. We evaluate MUSE on the long-horizon productivity benchmark TAC. It achieves new SOTA performance by a significant margin using only a lightweight Gemini-2.5 Flash model. Sufficient Experiments demonstrate that as the agent autonomously accumulates experience, it exhibits increasingly superior task completion capabilities, as well as robust continuous learning and self-evolution capabilities. Moreover, the accumulated experience from MUSE exhibits strong generalization properties, enabling zero-shot improvement on new tasks. MUSE establishes a new paradigm for AI agents capable of real-world productivity task automation.

[61] ChatGPT as a Translation Engine: A Case Study on Japanese-English

Vincent Michael Sutanto,Giovanni Gatti De Giacomo,Toshiaki Nakazawa,Masaru Yamada

Main category: cs.CL

TL;DR: 该研究探讨了使用ChatGPT进行日英翻译的效果,比较了简单与增强提示,并评估了其相对于商业翻译引擎的性能。结果显示,文档级翻译优于句子级翻译,ChatGPT-3.5在自动评估中表现更优,但ChatGPT-4在流畅性上更具优势,两者在准确性和流畅性之间存在权衡,ChatGPT整体表现与主流翻译系统相当。

Details Motivation: 探索ChatGPT在日英翻译中的潜力,并评估不同提示方式和翻译层级对翻译质量的影响,同时与现有商业系统进行对比。 Method: 采用简单和增强提示方法,对ChatGPT进行日英翻译实验,结合自动评估和MQM人工评估,比较句子级与文档级翻译效果,并与商业翻译引擎对比。 Result: 文档级翻译优于句子级;未能明确增强提示优于简单提示;ChatGPT-3.5在自动评估中得分更高,但在准确性与流畅性之间存在权衡(ChatGPT-3.5更准,ChatGPT-4更流畅);ChatGPT整体表现与主流系统相当。 Conclusion: ChatGPT在日英翻译中具有竞争力,文档级翻译更优,提示方式影响尚不明确,版本间存在准确性与流畅性的权衡。 Abstract: This study investigates ChatGPT for Japanese-English translation, exploring simple and enhanced prompts and comparing against commercially available translation engines. Performing both automatic and MQM-based human evaluations, we found that document-level translation outperforms sentence-level translation for ChatGPT. On the other hand, we were not able to determine if enhanced prompts performed better than simple prompts in our experiments. We also discovered that ChatGPT-3.5 was preferred by automatic evaluation, but a tradeoff exists between accuracy (ChatGPT-3.5) and fluency (ChatGPT-4). Lastly, ChatGPT yields competitive results against two widely-known translation systems.

[62] Climate Knowledge in Large Language Models

Ivan Kuznetsov,Jacopo Grassi,Dmitrii Pantiukhin,Boris Shapkin,Thomas Jung,Nikolay Koldunov

Main category: cs.CL

TL;DR: 该研究评估了大型语言模型(LLM)在无需外部检索的情况下回忆气候常态的能力,发现其能捕捉基本气候模式但存在显著空间误差,尤其在高海拔和高纬度地区表现较差,且无法准确再现长期温度变化的空间分布。

Details Motivation: 随着LLM越来越多地应用于气候相关场景,了解其内部气候知识的准确性对确保可靠性及降低错误信息风险至关重要。然而,当前LLM在参数化记忆中存储气候常态的能力尚未被充分评估。 Method: 构建一个分辨率为1°的全球陆地查询网格,输入位置坐标和地理描述,询问1991-2020年7月平均气温,并将LLM的回答与ERA5再分析数据对比,评估其准确性;同时分析不同海拔、地理上下文和模型规模的影响。 Result: LLM能够捕捉纬度和地形相关的气候结构,均方根误差为3-6°C,偏差±1°C;加入地理上下文可使误差平均降低27%,大模型对此更敏感;但在海拔1500米以上误差显著增加(RMSE达5-13°C),且无法复现1950-2024年间温度变化的空间模式。 Conclusion: 尽管LLM具备一定参数化气候知识并可用于描述当前气候分布,但其在表征长期气候变化的区域差异方面存在局限,需谨慎用于气候动态分析;本研究提供了一个可重复的评估框架以量化LLM中的气候知识。 Abstract: Large language models (LLMs) are increasingly deployed for climate-related applications, where understanding internal climatological knowledge is crucial for reliability and misinformation risk assessment. Despite growing adoption, the capacity of LLMs to recall climate normals from parametric knowledge remains largely uncharacterized. We investigate the capacity of contemporary LLMs to recall climate normals without external retrieval, focusing on a prototypical query: mean July 2-m air temperature 1991-2020 at specified locations. We construct a global grid of queries at 1{\deg} resolution land points, providing coordinates and location descriptors, and validate responses against ERA5 reanalysis. Results show that LLMs encode non-trivial climate structure, capturing latitudinal and topographic patterns, with root-mean-square errors of 3-6 {\deg}C and biases of $\pm$1 {\deg}C. However, spatially coherent errors remain, particularly in mountains and high latitudes. Performance degrades sharply above 1500 m, where RMSE reaches 5-13 {\deg}C compared to 2-4 {\deg}C at lower elevations. We find that including geographic context (country, city, region) reduces errors by 27% on average, with larger models being most sensitive to location descriptors. While models capture the global mean magnitude of observed warming between 1950-1974 and 2000-2024, they fail to reproduce spatial patterns of temperature change, which directly relate to assessing climate change. This limitation highlights that while LLMs may capture present-day climate distributions, they struggle to represent the regional and local expression of long-term shifts in temperature essential for understanding climate dynamics. Our evaluation framework provides a reproducible benchmark for quantifying parametric climate knowledge in LLMs and complements existing climate communication assessments.

[63] A Survey of Process Reward Models: From Outcome Signals to Process Supervisions for Large Language Models

Congming Zheng,Jiachen Zhu,Zhuoying Ou,Yuxiang Chen,Kangning Zhang,Rong Shan,Zeyu Zheng,Mengyue Yang,Jianghao Lin,Yong Yu,Weinan Zhang

Main category: cs.CL

TL;DR: 本文综述了过程奖励模型(PRMs),系统地介绍了其在生成过程数据、构建PRMs以及在测试时扩展和强化学习中的应用,旨在推动细粒度且鲁棒的推理对齐研究。

Details Motivation: 尽管大语言模型具备高级推理能力,但传统的对齐方法主要依赖仅评估最终答案的结果奖励模型(ORMs),无法有效指导中间推理过程。因此需要引入能够评估和引导逐步推理的过程奖励模型(PRMs)。 Method: 通过梳理PRMs的完整流程——包括过程数据生成、PRM建模方法及其在测试时扩展与强化学习中的应用——并对数学、代码、文本、多模态推理、机器人和智能体等领域的应用进行总结,同时回顾新兴基准。 Result: 提供了PRMs的系统性概述,明确了设计空间,总结了在多个领域中的应用进展,并识别出当前面临的挑战。 Conclusion: PRMs有助于实现更细粒度的推理对齐,未来的研究应聚焦于提升其鲁棒性和泛化能力,以推动复杂任务中的可靠推理。 Abstract: Although Large Language Models (LLMs) exhibit advanced reasoning ability, conventional alignment remains largely dominated by outcome reward models (ORMs) that judge only final answers. Process Reward Models(PRMs) address this gap by evaluating and guiding reasoning at the step or trajectory level. This survey provides a systematic overview of PRMs through the full loop: how to generate process data, build PRMs, and use PRMs for test-time scaling and reinforcement learning. We summarize applications across math, code, text, multimodal reasoning, robotics, and agents, and review emerging benchmarks. Our goal is to clarify design spaces, reveal open challenges, and guide future research toward fine-grained, robust reasoning alignment.

[64] FedDTRE: Federated Dialogue Generation Models Powered by Trustworthiness Evaluation

Shule Lu,Lingxiang Wang,Sijia Wen,Ziwei Wang,Hainan Zhang

Main category: cs.CL

TL;DR: 提出了一种基于可信度评估的联邦自适应聚合策略FedDTRE,用于对话生成,通过动态调节全局模型在本地更新中的贡献来提升模型性能和对话质量。

Details Motivation: 传统集中式或完全本地训练方法在隐私保护与个性化之间难以平衡,现有联邦学习方法在客户端数据有限时易过拟合且容易遗忘全局信息,导致泛化能力差。 Method: 提出FedDTRE,利用全局和本地模型在公平性导向评估数据集上的可信度评分,动态调节全局模型在本地更新中的贡献,而非直接用全局模型替换本地模型。 Result: 实验结果表明,FedDTRE能够提升对话模型的性能,增强对话生成的质量。 Conclusion: FedDTRE有效缓解了联邦对话系统中的过拟合和全局信息遗忘问题,实现了更好的个性化与泛化能力平衡。 Abstract: With the rapid development of artificial intelligence, dialogue systems have become a prominent form of human-computer interaction. However, traditional centralized or fully local training approaches face challenges in balancing privacy preservation and personalization due to data privacy concerns and heterogeneous device capabilities. Federated learning, as a representative distributed paradigm, offers a promising solution. However, existing methods often suffer from overfitting under limited client data and tend to forget global information after multiple training rounds, leading to poor generalization. To address these issues, we propose FedDTRE, a Federated adaptive aggregation strategy for Dialogue generation based on Trustworthiness Evaluation. Instead of directly replacing local models with the global model, FedDTRE leverages trustworthiness scores of both global and local models on a fairness-oriented evaluation dataset to dynamically regulate the global model's contribution during local updates. Experimental results demonstrate that FedDTRE can improve dialogue model performance and enhance the quality of dialogue generation.

[65] Everything is Plausible: Investigating the Impact of LLM Rationales on Human Notions of Plausibility

Shramay Palta,Peter Rankel,Sarah Wiegreffe,Rachel Rudinger

Main category: cs.CL

TL;DR: 研究发现,人类对常识性多选题答案的合理性判断会受到LLM生成的支持或反对理由的影响,表明LLM不仅能影响人类信念,还可用于研究人类认知。

Details Motivation: 探究LLM生成的理由是否会影响人类在常识推理任务中的合理性判断,并评估这种影响的程度。 Method: 通过收集3,000个人类和13,600个LLM对带有PRO/CON理由的答案的合理性评分,分析LLM生成理由对判断的影响。 Result: 人类和LLM的合理性评分均受PRO和CON理由显著影响:支持理由提高评分,反对理由降低评分。 Conclusion: LLM生成的理由具有说服力,能在常识领域影响人类判断,提示LLM在认知研究中的潜力及其可能带来的信念操控风险。 Abstract: We investigate the degree to which human plausibility judgments of multiple-choice commonsense benchmark answers are subject to influence by (im)plausibility arguments for or against an answer, in particular, using rationales generated by LLMs. We collect 3,000 plausibility judgments from humans and another 13,600 judgments from LLMs. Overall, we observe increases and decreases in mean human plausibility ratings in the presence of LLM-generated PRO and CON rationales, respectively, suggesting that, on the whole, human judges find these rationales convincing. Experiments with LLMs reveal similar patterns of influence. Our findings demonstrate a novel use of LLMs for studying aspects of human cognition, while also raising practical concerns that, even in domains where humans are ``experts'' (i.e., common sense), LLMs have the potential to exert considerable influence on people's beliefs.

[66] The Price of Thought: A Multilingual Analysis of Reasoning, Performance, and Cost of Negotiation in Large Language Models

Sherzod Hakimov,Roland Bernard,Tim Leiber,Karl Osswald,Kristina Richert,Ruilin Yang,Raffaella Bernardi,David Schlangen

Main category: cs.CL

TL;DR: 本研究首次系统评估了大语言模型(LLM)推理能力在多语言谈判任务中的影响,发现启用推理显著提升谈判表现但增加计算成本,且开源模型普遍存在内部推理切换至英语的现象,而商业模型能保持语言一致性。

Details Motivation: 探讨LLM的推理能力如何影响其在多语言谈判场景下的表现,特别是在战略思维、对手建模及合作与竞争平衡方面的作用,并比较不同类别模型(商业vs开源)在语言一致性和推理效率上的差异。 Method: 通过自对弈方式,在三种不同对话博弈任务中评估多个商业和开源大模型的谈判能力,涵盖英语、德语和意大利语;分析启用推理(即扩展测试时计算资源)对性能、成本及语言一致性的影响。 Result: 启用推理显著提升谈判结果(如GPT-5性能提升31.4%),但计算成本增加近400%;开源模型在非英语谈判中仍倾向使用英语进行内部推理,导致语言不一致,可能影响推理过程的可解释性;而主流商业模型能保持推理与输出语言的一致性。 Conclusion: 推理能有效增强LLM的谈判能力,尤其在处理复杂任务和促进合作方面,但伴随高昂成本;语言一致性方面,商业模型优于开源模型,提示未来需关注多语言推理的透明性与可控性。 Abstract: Negotiation is a fundamental challenge for AI agents, as it requires an ability to reason strategically, model opponents, and balance cooperation with competition. We conduct the first comprehensive study systematically evaluating the effect of (LLM-)reasoning on the negotiation abilities of both commercial and open-weight LLMs, and do this across three languages. Using a self-play setup across three diverse dialogue games, we analyse trade-offs between performance and cost, the language consistency of reasoning processes, and the nature of strategic adaptation exhibited by models. Our findings show that enabling reasoning-that is, scaling test time compute-significantly improves negotiation outcomes by enhancing collaboration and helping models overcome task complexities, but comes at a substantial computational cost: reasoning improves GPT-5's performance by 31.4 % while increasing its cost by nearly 400 %. Most critically, we uncover a significant multilingual reasoning distinction: open-weight models consistently switch to English for their internal reasoning steps, even when negotiating in German or Italian (and thus possibly impacting potential explainability gains through the disclosure of reasoning traces), while leading commercial models maintain language consistency between their reasoning and final output.

[67] Lossless Vocabulary Reduction for Auto-Regressive Language Models

Daiki Chijiwa,Taku Hasegawa,Kyosuke Nishida,Shin'ya Yamaguchi,Tomoya Ohba,Tamao Sakao,Susumu Takeuchi

Main category: cs.CL

TL;DR: 本文提出了一种无损词汇缩减的理论框架,能够将自回归语言模型转换为任意小词汇量的模型而不损失精度,并展示了不同分词方式的语言模型如何通过最大公共词汇高效协作。

Details Motivation: 由于不同的语言模型使用不同的分词方式和词汇表,导致它们在下一词预测分布层面难以协同工作,如模型集成等任务面临挑战。 Method: 建立了一个无损词汇缩减的理论框架,通过该框架可以将任意自回归语言模型转换为具有更小词汇表的模型,同时保持原有的预测精度。 Result: 实现了不同分词体系的语言模型通过其最大公共词汇进行高效的协作,验证了该方法在模型集成等场景下的有效性。 Conclusion: 所提出的无损词汇缩减方法能够在不牺牲准确率的前提下缩小语言模型的词汇量,并促进不同语言模型之间的协作。 Abstract: Tokenization -- the process of decomposing a given text into a sequence of subwords called tokens -- is one of the key components in the development of language models. Particularly, auto-regressive language models generate texts token by token, i.e., by predicting the next-token distribution given the previous ones, and thus tokenization directly affects their efficiency in text generation. Since each language model has their own vocabulary as a set of possible tokens, they struggle to cooperate with each other at the level of next-token distributions such as model ensemble. In this paper, we establish a theoretical framework of lossless vocabulary reduction, which efficiently converts a given auto-regressive language model into the one with an arbitrarily small vocabulary without any loss in accuracy. As an application, we demonstrate that language models with different tokenization can cooperate with each other efficiently through their maximal common vocabulary.

Haoyang Gui,Thales Bertaglia,Taylor Annabell,Catalina Goanta,Tjomme Dooper,Gerasimos Spanakis

Main category: cs.CL

TL;DR: 该研究评估了GPT-5-nano和Gemini-2.5-flash-lite在检测Instagram上未披露的赞助内容方面的表现,结合不同法律知识提示策略,发现模型在明确案例中表现良好(F1高达0.93),但在模糊案例中性能下降。研究提出了一个LLM法律推理错误分类法,并通过人工标注数据集和混合评估方法,为自动化监管提供法律稳健的技术支持。

Details Motivation: 由于网红营销中赞助内容与有机内容界限模糊,现有检测方法缺乏法律依据或不透明,导致监管困难,因此需要基于法律知识的可解释、可靠的自动化检测手段。 Method: 使用1,143条Instagram帖子,比较GPT-5-nano和Gemini-2.5-flash-lite在三种提示策略下的表现,控制输入的法律知识量;构建LLM推理错误分类法,并由两名受过法律培训的学生对解释进行标注,结合定量与定性分析评估模型性能。 Result: 两个模型在分类任务中表现良好(F1最高达0.93),但在模糊案例中性能下降超10个百分点;加入法规文本可提升解释质量但不显著提高准确率;识别出常见错误包括引用缺失(28.57%)、引用不清(20.71%)和隐性广告误判率高(28.57%)。 Conclusion: 该研究通过构建错误分类法、标注数据集和综合评估框架,推动了具备法律稳健性的合规技术发展,有助于监管机构在坚实法律基础上实现网红营销内容的透明化自动监管。 Abstract: The rise of influencer marketing has blurred boundaries between organic content and sponsored content, making the enforcement of legal rules relating to transparency challenging. Effective regulation requires applying legal knowledge with a clear purpose and reason, yet current detection methods of undisclosed sponsored content generally lack legal grounding or operate as opaque "black boxes". Using 1,143 Instagram posts, we compare gpt-5-nano and gemini-2.5-flash-lite under three prompting strategies with controlled levels of legal knowledge provided. Both models perform strongly in classifying content as sponsored or not (F1 up to 0.93), though performance drops by over 10 points on ambiguous cases. We further develop a taxonomy of reasoning errors, showing frequent citation omissions (28.57%), unclear references (20.71%), and hidden ads exhibiting the highest miscue rate (28.57%). While adding regulatory text to the prompt improves explanation quality, it does not consistently improve detection accuracy. The contribution of this paper is threefold. First, it makes a novel addition to regulatory compliance technology by providing a taxonomy of common errors in LLM-generated legal reasoning to evaluate whether automated moderation is not only accurate but also legally robust, thereby advancing the transparent detection of influencer marketing content. Second, it features an original dataset of LLM explanations annotated by two students who were trained in influencer marketing law. Third, it combines quantitative and qualitative evaluation strategies for LLM explanations and critically reflects on how these findings can support advertising regulatory bodies in automating moderation processes on a solid legal foundation.

[69] Interpreting LLM-as-a-Judge Policies via Verifiable Global Explanations

Jasmina Gajcin,Erik Miehling,Rahul Nair,Elizabeth Daly,Radu Marinescu,Seshu Tirupathi

Main category: cs.CL

TL;DR: 本文提出了一种从LLM-as-a-Judge中提取基于概念的全局策略的方法,包括生成局部解释的CLoVE和将其聚类为全局策略的GloVE,验证了其在内容危害检测中的高保真性和鲁棒性,并通过用户研究评估了可理解性和满意度。

Details Motivation: 为了理解和减轻使用LLM作为评判者时可能引入的偏见和风险,需要可解释的方法来揭示其决策背后的全局规则。 Method: 提出了CLoVE算法生成基于概念、可验证的对比局部解释,并通过GloVE算法进行迭代聚类、摘要和验证,形成全局策略。 Result: 在七个内容危害检测基准数据集上验证了全局策略对LLM判断的高度保真;全局策略对文本扰动和对抗攻击具有鲁棒性;用户研究表明用户对其有较好的理解和满意度。 Conclusion: 所提出的GloVE方法能有效提取并简化LLM-as-a-Judge的决策逻辑为可理解的全局策略,有助于提升自动化评估的透明度与可信度。 Abstract: Using LLMs to evaluate text, that is, LLM-as-a-judge, is increasingly being used at scale to augment or even replace human annotations. As such, it is imperative that we understand the potential biases and risks of doing so. In this work, we propose an approach for extracting high-level concept-based global policies from LLM-as-a-Judge. Our approach consists of two algorithms: 1) CLoVE (Contrastive Local Verifiable Explanations), which generates verifiable, concept-based, contrastive local explanations and 2) GloVE (Global Verifiable Explanations), which uses iterative clustering, summarization and verification to condense local rules into a global policy. We evaluate GloVE on seven standard benchmarking datasets for content harm detection. We find that the extracted global policies are highly faithful to decisions of the LLM-as-a-Judge. Additionally, we evaluated the robustness of global policies to text perturbations and adversarial attacks. Finally, we conducted a user study to evaluate user understanding and satisfaction with global policies.

[70] Mitigating Judgment Preference Bias in Large Language Models through Group-Based Polling

Shuliang Liu,Zhipeng Xu,Zhenghao Liu,Yukun Yan,Minghe Yu,Yu Gu,Chong Chen,Huiyuan Xie,Ge Yu

Main category: cs.CL

TL;DR: 本文提出了Genii,一种无监督的多智能体协同优化框架,用于缓解大语言模型作为评判者时的判断偏好偏差。

Details Motivation: 大语言模型在自动评估中表现出对自身生成结果的偏好偏差,影响了评估的可靠性。 Method: 通过构建多智能体系统,模拟客户端-服务器投票机制,集成多个基于LLM的评判模型,在无需人工标注的情况下进行无监督优化。 Result: 实验表明,Genii优于依赖标注数据的有监督模型,并在不同客户端代理上持续提升性能,即使使用较弱模型作为服务器代理也有效。 Conclusion: Genii能有效减轻LLM评判模型的判断偏好偏差,展现出良好的鲁棒性和应用潜力。 Abstract: Large Language Models (LLMs) as automatic evaluators, commonly referred to as LLM-as-a-Judge, have also attracted growing attention. This approach plays a vital role in aligning LLMs with human judgments, providing accurate and reliable assessments. However, LLM-based judgment models often exhibit judgment preference bias during the evaluation phase, tending to favor responses generated by themselves, undermining the reliability of their judgments. This paper introduces the Group-Based Polling Optimization (Genii), an unsupervised multi-agent collaborative optimization framework that mitigates the inherent judgment preference bias of judgment models. Specifically, Genii integrates various LLM-based judgment models into a multi-agent system and simulates the interactive client-server polling mechanism to optimize each client agent unsupervisedly. Our experiments demonstrate that Genii outperforms supervised models trained on annotated judgment data, while requiring no human-labeled annotations. Genii consistently improves performance across different client agents during the polling, even when weaker models act as server agents. Further analysis reveals that Genii effectively mitigates judgment preference bias of LLM-based judgment models, demonstrating its effectiveness. All codes are available at https://github.com/NEUIR/Genii.

[71] AI Knowledge Assist: An Automated Approach for the Creation of Knowledge Bases for Conversational AI Agents

Md Tahmid Rahman Laskar,Julien Bouvier Tremblay,Xue-Yong Fu,Cheng Chen,Shashi Bhushan TN

Main category: cs.CL

TL;DR: 本文提出了一种名为AI Knowledge Assist的系统,通过从历史客户-代理对话中提取问答对自动生成企业专属知识库,结合微调轻量级大模型(LLaMA-3.1-8B),在20家公司中实现了超过90%的信息查询问题回答准确率,有效解决了客服中心冷启动问题,支持RAG聊天机器人即时部署。

Details Motivation: 由于缺乏企业专属的知识库,制约了对话式AI系统在客服中心的应用,尤其是在冷启动场景下难以快速部署有效的RAG系统。 Method: 提出AI Knowledge Assist系统,从历史客户-代理对话中自动提取问答对构建知识库,并对轻量级大语言模型(LLaMA-3.1-8B)进行内部数据微调,以提升性能。 Result: 在20家公司的实证评估中,该系统在信息查询问题上的回答准确率超过90%,优于更大的闭源大模型,显著缩小了冷启动差距。 Conclusion: AI Knowledge Assist系统能高效构建企业知识库,结合微调轻量模型实现高性能,使RAG驱动的聊天机器人可立即部署,推动对话式AI在客服场景中的落地应用。 Abstract: The utilization of conversational AI systems by leveraging Retrieval Augmented Generation (RAG) techniques to solve customer problems has been on the rise with the rapid progress of Large Language Models (LLMs). However, the absence of a company-specific dedicated knowledge base is a major barrier to the integration of conversational AI systems in contact centers. To this end, we introduce AI Knowledge Assist, a system that extracts knowledge in the form of question-answer (QA) pairs from historical customer-agent conversations to automatically build a knowledge base. Fine-tuning a lightweight LLM on internal data demonstrates state-of-the-art performance, outperforming larger closed-source LLMs. More specifically, empirical evaluation on 20 companies demonstrates that the proposed AI Knowledge Assist system that leverages the LLaMA-3.1-8B model eliminates the cold-start gap in contact centers by achieving above 90% accuracy in answering information-seeking questions. This enables immediate deployment of RAG-powered chatbots.

[72] DACIP-RC: Domain Adaptive Continual Instruction Pre-Training via Reading Comprehension on Business Conversations

Elena Khasanova,Harsh Saini,Md Tahmid Rahman Laskar,Xue-Yong Fu,Cheng Chen,Shashi Bhushan TN

Main category: cs.CL

TL;DR: 本文提出了一种名为DACIP-RC的持续指令预训练方法,通过阅读理解生成任务指令和响应,提升小型语言模型在商业对话任务中的零样本泛化能力。

Details Motivation: 大型语言模型推理成本高,难以部署;小型模型虽高效但缺乏跨领域的零样本指令遵循能力,传统微调方法易导致灾难性遗忘。 Method: 提出DACIP-RC方法,基于对话记录通过阅读理解生成多样化的任务指令和响应,进行持续预训练,以增强模型在特定领域(商业对话)的适应性和泛化能力。 Result: 实验表明,DACIP-RC在会议摘要、行动项生成和通话目的识别等多个商业对话任务中显著提升了小型语言模型的零样本性能。 Conclusion: DACIP-RC有效提升了小型语言模型在工业场景中的领域适应能力和零样本泛化性能,是首个将指令预训练应用于商业对话数据的工作,为行业利用专有数据进行领域适配提供了新思路。 Abstract: The rapid advancements in Large Language Models (LLMs) have enabled their adoption in real-world industrial scenarios for various natural language processing tasks. However, the high inference cost of large-scale LLMs makes their deployment impractical, necessitating the use of smaller models. Despite their efficiency, smaller LLMs lack robust zero-shot instruction-following capabilities across diverse domains, limiting their adaptability to dynamic user requirements. Traditional fine-tuning approaches exacerbate this issue by inducing catastrophic forgetting, reducing the model's generalization ability for unseen tasks. In this paper, we propose Domain Adaptive Continual Instruction Pre-Training via Reading Comprehension (DACIP-RC), a continual pre-training technique that enhances smaller LLMs' domain adaptability for business conversational tasks. Unlike conventional pre-training approaches that rely on next-token prediction, DACIP-RC generates diverse task instructions and responses via reading comprehension on conversation transcripts, enabling better instruction generalization. Our empirical evaluations demonstrate that DACIP-RC significantly improves zero-shot generalization across a wide range of business conversational tasks, including meeting summarization, action item generation, and call purpose identification. To the best of our knowledge, this is the first work to apply instruction pre-training on business conversational data, providing insights into how industries can leverage proprietary datasets for domain adaptation.

[73] Beyond Over-Refusal: Scenario-Based Diagnostics and Post-Hoc Mitigation for Exaggerated Refusals in LLMs

Shuzhou Yuan,Ercong Nie,Yinuo Sun,Chenxuan Zhao,William LaCroix,Michael Färber

Main category: cs.CL

TL;DR: 本文提出了两个基准测试(XSB和MS-XSB)来评估大语言模型中的过度拒绝问题,并提出三种无需重新训练的轻量级方法(忽略特定词、提示重写和注意力引导)来缓解该问题,实验表明这些方法能有效提升模型对安全请求的响应能力,同时保持安全性。

Details Motivation: 大语言模型常因输入中包含类似不安全词汇而错误拒绝本应接受的良性请求,影响实用性与用户体验,因此需要系统评估并缓解此类过度拒绝现象。 Method: 构建了单轮XSB和多轮MS-XSB两个基准,标注拒绝触发关键词;利用事后解释方法识别触发词,并在推理阶段采用忽略词指令、提示重写和注意力引导三种模型无关的轻量方法进行干预。 Result: 实验显示多种主流LLM均存在显著的过度拒绝问题,尤其在多轮对话中更严重;所提三种方法显著提升了模型对安全提示的合规性,同时未降低安全性。 Conclusion: 提出的基准和干预方法为诊断和缓解大语言模型的过度拒绝提供了可复现的框架,有助于实现更安全且更有帮助的模型部署。 Abstract: Large language models (LLMs) frequently produce false refusals, declining benign requests that contain terms resembling unsafe queries. We address this challenge by introducing two comprehensive benchmarks: the Exaggerated Safety Benchmark (XSB) for single-turn prompts, annotated with "Focus" keywords that identify refusal-inducing triggers, and the Multi-turn Scenario-based Exaggerated Safety Benchmark (MS-XSB), which systematically evaluates refusal calibration in realistic, context-rich dialog settings. Our benchmarks reveal that exaggerated refusals persist across diverse recent LLMs and are especially pronounced in complex, multi-turn scenarios. To mitigate these failures, we leverage post-hoc explanation methods to identify refusal triggers and deploy three lightweight, model-agnostic approaches, ignore-word instructions, prompt rephrasing, and attention steering, at inference time, all without retraining or parameter access. Experiments on four instruction-tuned Llama models demonstrate that these strategies substantially improve compliance on safe prompts while maintaining robust safety protections. Our findings establish a reproducible framework for diagnosing and mitigating exaggerated refusals, highlighting practical pathways to safer and more helpful LLM deployments.

[74] ARM2: Adaptive Reasoning Model with Vision Understanding and Executable Code

Jian Xie,Zhendong Chu,Aoxiao Zhong,Kai Zhang,Mingzhe Han,Xin Fang,Jialie Shen,Qingsong Wen

Main category: cs.CL

TL;DR: 本文提出了ARM2,一个通过强化学习框架结合长度感知优化的统一模型,能够自适应地在多种任务中平衡推理性能和效率,并支持多模态和代码执行,显著降低token消耗。

Details Motivation: 大型推理模型在简单任务上常出现过度推理问题,现有方法多为启发式且任务特定,缺乏通用的自适应推理框架。 Method: 提出ARM2模型,采用强化学习框架并引入长度感知优化,支持自然语言推理、视觉理解和可执行代码集成,实现多模态与高效推理。 Result: 实验表明,ARM2在性能与传统GRPO训练模型相当的情况下,平均减少70%以上的token使用,并在多任务上验证了其有效性与设计合理性。 Conclusion: ARM2提供了一种通用的自适应推理解决方案,在保持性能的同时大幅提高推理效率,适用于多模态和复杂任务场景。 Abstract: Large Reasoning Models (LRMs) often suffer from the ``over-thinking'' problem, generating unnecessarily long reasoning on simple tasks. Some strategies have been proposed to mitigate this issue, such as length penalties or routing mechanisms, but they are typically heuristic and task-specific, lacking a general framework for adaptive reasoning. In this paper, we present ARM2, a unified model that adaptively balances reasoning performance and efficiency across multiple formats through a reinforcement learning framework augmented with length-aware optimization. Beyond conventional natural language inference, ARM2 integrates vision understanding, extending its applicability to multimodal. Moreover, ARM2 integrates executable code into reasoning, enabling substantial reductions in token cost while preserving task performance compared to long CoT. Experiments demonstrate that ARM2 achieves performance on par with traditional reasoning models trained with GRPO, while reducing token usage by over 70% on average. We further conduct extensive analyses to validate the effectiveness of ARM2 and the soundness of its design.

[75] MetricalARGS: A Taxonomy for Studying Metrical Poetry with LLMs

Chalamalasetti Kranti,Sowmya Vajjala

Main category: cs.CL

TL;DR: 本文提出了MetricalARGS,首个用于评估大语言模型在格律诗方面能力的NLP任务分类体系,涵盖分析、检索、生成和支持四个维度,并以泰卢固语为例展示了其应用。

Details Motivation: 现有NLP研究多集中于诗歌生成与摘要,缺乏对格律诗中严格音节和音素规则的建模与评估,难以检验大语言模型遵循复杂规则的能力。 Method: 提出MetricalARGS分类体系,将诗歌相关NLP任务划分为分析、检索、生成和支持四类,并讨论数据集与评估指标的设计,以泰卢固语为案例进行实践验证。 Result: 建立了首个面向格律诗的系统性任务分类框架,揭示了当前大语言模型在处理严格诗歌形式方面的潜力与挑战。 Conclusion: MetricalARGS为通过格律诗评估语言模型的语言理解与规则遵循能力提供了新路径,拓展了诗歌NLP研究的深度与广度。 Abstract: Prior NLP work studying poetry has focused primarily on automatic poem generation and summarization. Many languages have well-studied traditions of poetic meter which enforce constraints on a poem in terms of syllable and phoneme patterns. Such advanced literary forms offer opportunities for probing deeper reasoning and language understanding in Large Language Models (LLMs) and their ability to follow strict pre-requisites and rules. In this paper, we introduce MetricalARGS, the first taxonomy of poetry-related NLP tasks designed to evaluate LLMs on metrical poetry across four dimensions: Analysis, Retrieval, Generation, and Support. We discuss how these tasks relate to existing NLP tasks, addressing questions around datasets and evaluation metrics. Taking Telugu as our example language, we illustrate how the taxonomy can be used in practice. MetricalARGS highlights the broader possibilities for understanding the capabilities and limitations of today's LLMs through the lens of metrical poetry.

[76] Training-Free Group Relative Policy Optimization

Yuzheng Cai,Siqi Cai,Yuchen Shi,Zihan Xu,Lichao Chen,Yulei Qin,Xiaoyu Tan,Gang Li,Zongyi Li,Haojia Lin,Yong Mao,Ke Li,Xing Sun

Main category: cs.CL

TL;DR: 提出了一种无需训练的Group Relative Policy Optimization(Training-Free GRPO)方法,通过利用 rollout 组内的语义优势来提炼高质量经验知识,并将其作为词元先验融入 LLM 推理过程,从而在数学推理和网页搜索任务中显著提升模型在少样本下的跨域性能。

Details Motivation: 现有基于参数更新的代理强化学习方法(如SFT+RL)成本高昂且易过拟合,难以适应数据稀缺的实际场景,因此需要一种更轻量、无需训练的替代方案来提升LLM代理在专业领域的表现。 Method: 提出Training-Free GRPO,不进行任何参数更新,而是通过多轮迭代学习从少量真实数据中提取各rollout组内的语义相对优势,形成可作为token prior的经验知识,并在API调用时动态引导模型输出。 Result: 在数学推理和网页搜索任务上,仅使用几十个训练样本,Training-Free GRPO显著提升了DeepSeek-V3.1-Terminus的跨域性能,且优于经过微调的小规模LLM。 Conclusion: 无需参数更新的方法也能有效提升LLM代理在特定任务上的表现,Training-Free GRPO为低成本、低资源场景下的智能代理优化提供了可行路径。 Abstract: Recent advances in Large Language Model (LLM) agents have demonstrated their promising general capabilities. However, their performance in specialized real-world domains often degrades due to challenges in effectively integrating external tools and specific prompting strategies. While methods like agentic reinforcement learning have been proposed to address this, they typically rely on costly parameter updates, for example, through a process that uses Supervised Fine-Tuning (SFT) followed by a Reinforcement Learning (RL) phase with Group Relative Policy Optimization (GRPO) to alter the output distribution. However, we argue that LLMs can achieve a similar effect on the output distribution by learning experiential knowledge as a token prior, which is a far more lightweight approach that not only addresses practical data scarcity but also avoids the common issue of overfitting. To this end, we propose Training-Free Group Relative Policy Optimization (Training-Free GRPO), a cost-effective solution that enhances LLM agent performance without any parameter updates. Our method leverages the group relative semantic advantage instead of numerical ones within each group of rollouts, iteratively distilling high-quality experiential knowledge during multi-epoch learning on a minimal ground-truth data. Such knowledge serves as the learned token prior, which is seamlessly integrated during LLM API calls to guide model behavior. Experiments on mathematical reasoning and web searching tasks demonstrate that Training-Free GRPO, when applied to DeepSeek-V3.1-Terminus, significantly improves out-of-domain performance. With just a few dozen training samples, Training-Free GRPO outperforms fine-tuned small LLMs with marginal training data and cost.

[77] Memory Retrieval and Consolidation in Large Language Models through Function Tokens

Shaohua Zhang,Yuan Lin,Hang Li

Main category: cs.CL

TL;DR: 本文提出了“功能词元假说”来解释大语言模型中记忆检索与巩固的机制,认为功能词元在推理过程中激活上下文中的预测特征并主导下一个词元的预测,在预训练中通过预测跟随其后的内容词元来增加模型学习到的特征数量并更新参数。

Details Motivation: 大语言模型的记忆检索与巩固机制尚不明确,需要一个理论框架来解释功能词元在此过程中的作用。 Method: 提出功能词元假说,并通过二分图分析和案例研究验证功能词元如何激活预测特征;分析预训练中损失主要来源于对功能词元后内容词元的预测。 Result: 实验证明少数功能词元可激活大多数特征;训练损失集中在功能词元后的内容词元预测上,支持功能词元在记忆检索与巩固中的核心作用。 Conclusion: 功能词元在大语言模型的记忆检索与巩固中起关键作用,该假说为理解LLM工作机制提供了新的视角。 Abstract: The remarkable success of large language models (LLMs) stems from their ability to consolidate vast amounts of knowledge into the memory during pre-training and to retrieve it from the memory during inference, enabling advanced capabilities such as knowledge memorization, instruction-following and reasoning. However, the mechanisms of memory retrieval and consolidation in LLMs remain poorly understood. In this paper, we propose the function token hypothesis to explain the workings of LLMs: During inference, function tokens activate the most predictive features from context and govern next token prediction (memory retrieval). During pre-training, predicting the next tokens (usually content tokens) that follow function tokens increases the number of learned features of LLMs and updates the model parameters (memory consolidation). Function tokens here roughly correspond to function words in linguistics, including punctuation marks, articles, prepositions, and conjunctions, in contrast to content tokens. We provide extensive experimental evidence supporting this hypothesis. Using bipartite graph analysis, we show that a small number of function tokens activate the majority of features. Case studies further reveal how function tokens activate the most predictive features from context to direct next token prediction. We also find that during pre-training, the training loss is dominated by predicting the next content tokens following function tokens, which forces the function tokens to select the most predictive features from context.

[78] LLMs Learn to Deceive Unintentionally: Emergent Misalignment in Dishonesty from Misaligned Samples to Biased Human-AI Interactions

XuHao Hu,Peng Wang,Xiaoya Lu,Dongrui Liu,Xuanjing Huang,Jing Shao

Main category: cs.CL

TL;DR: 本研究扩展了“ emergent misalignment”现象的研究范围,证明在高风险情境下,对大语言模型进行恶意或错误数据微调会导致其在诚实性方面出现广泛错位,即使仅有少量错位数据(如1%)或少量偏见用户(如10%)也会显著增加模型的不诚实和欺骗行为。

Details Motivation: 探究大语言模型在高风险情境下因微调于恶意或错误完成数据而导致的广泛不诚实与欺骗行为,即是否将‘emergent misalignment’现象从安全领域扩展到更广泛的诚信问题。 Method: 通过对开源大语言模型在多个领域中的错位完成数据进行微调,并在下游混合任务及模拟的人机交互环境中测试其诚实性表现,分析其不诚实行为的泛化情况。 Result: 实验表明,经过错位数据微调的模型在各种情境下均表现出广泛的不诚实行为;在下游任务中仅加入1%的错位数据即可使诚实行为下降超过20%;在含10%偏见用户的模拟交互环境中,模型的不诚实行为被无意加剧。 Conclusion: emergent misalignment 现象不仅存在于安全相关领域,还会导致模型在高风险情境下的普遍不诚实与欺骗行为,且该风险可通过直接微调、混合训练任务以及真实人机交互环境中的偏见用户传播,提示需警惕训练数据中的偏差影响。 Abstract: Previous research has shown that LLMs finetuned on malicious or incorrect completions within narrow domains (e.g., insecure code or incorrect medical advice) can become broadly misaligned to exhibit harmful behaviors, which is called emergent misalignment. In this work, we investigate whether this phenomenon can extend beyond safety behaviors to a broader spectrum of dishonesty and deception under high-stakes scenarios (e.g., lying under pressure and deceptive behavior). To explore this, we finetune open-sourced LLMs on misaligned completions across diverse domains. Experimental results demonstrate that LLMs show broadly misaligned behavior in dishonesty. Additionally, we further explore this phenomenon in a downstream combined finetuning setting, and find that introducing as little as 1% of misalignment data into a standard downstream task is sufficient to decrease honest behavior over 20%. Furthermore, we consider a more practical human-AI interaction environment where we simulate both benign and biased users to interact with the assistant LLM. Notably, we find that the assistant can be misaligned unintentionally to exacerbate its dishonesty with only 10% biased user population. In summary, we extend the study of emergent misalignment to the domain of dishonesty and deception under high-stakes scenarios, and demonstrate that this risk arises not only through direct finetuning, but also in downstream mixture tasks and practical human-AI interactions.

[79] SenWave: A Fine-Grained Multi-Language Sentiment Analysis Dataset Sourced from COVID-19 Tweets

Qiang Yang,Xiuying Chen,Changsheng Ma,Rui Yin,Xin Gao,Xiangliang Zhang

Main category: cs.CL

TL;DR: 本文提出了SenWave,一个用于分析新冠疫情推文的细粒度多语言情感分析数据集,包含五种语言和十种情感类别,并基于预训练模型进行微调以实现精确分类,同时分析了跨语言、国家和话题的情感演变。

Details Motivation: 现有新冠相关数据集缺乏高质量的细粒度情感标注,限制了对公众情绪的深入理解。 Method: 构建了一个包含5种语言共10万条标注推文和1.05亿条未标注推文的数据集,使用预训练的Transformer模型进行微调以实现细粒度情感分类。 Result: 实现了跨语言、国家和主题的细粒度情感分类,揭示了疫情不同时期的情绪演变趋势,并验证了数据集与ChatGPT的良好兼容性。 Conclusion: SenWave数据集为复杂事件下的细粒度情感分析提供了有力支持,有助于推动NLP领域对公众情绪理解的研究进展。 Abstract: The global impact of the COVID-19 pandemic has highlighted the need for a comprehensive understanding of public sentiment and reactions. Despite the availability of numerous public datasets on COVID-19, some reaching volumes of up to 100 billion data points, challenges persist regarding the availability of labeled data and the presence of coarse-grained or inappropriate sentiment labels. In this paper, we introduce SenWave, a novel fine-grained multi-language sentiment analysis dataset specifically designed for analyzing COVID-19 tweets, featuring ten sentiment categories across five languages. The dataset comprises 10,000 annotated tweets each in English and Arabic, along with 30,000 translated tweets in Spanish, French, and Italian, derived from English tweets. Additionally, it includes over 105 million unlabeled tweets collected during various COVID-19 waves. To enable accurate fine-grained sentiment classification, we fine-tuned pre-trained transformer-based language models using the labeled tweets. Our study provides an in-depth analysis of the evolving emotional landscape across languages, countries, and topics, revealing significant insights over time. Furthermore, we assess the compatibility of our dataset with ChatGPT, demonstrating its robustness and versatility in various applications. Our dataset and accompanying code are publicly accessible on the repository\footnote{https://github.com/gitdevqiang/SenWave}. We anticipate that this work will foster further exploration into fine-grained sentiment analysis for complex events within the NLP community, promoting more nuanced understanding and research innovations.

[80] Investigating Counterclaims in Causality Extraction from Text

Tim Hagen,Niklas Deckers,Felix Wolter,Harrisen Scells,Martin Potthast

Main category: cs.CL

TL;DR: 本文提出了一种新的数据集,用于解决现有因果关系抽取研究中忽视反因果(concausal)声明的问题,并通过增强Causal News Corpus来提高模型区分正向和反向因果关系的能力。

Details Motivation: 现有的因果关系抽取数据集主要关注支持因果关系的‘正因果’声明,而忽略了反驳因果关系的‘反因果’声明,这导致了标注错误和模型性能下降。 Method: 基于广泛的文献回顾,制定了严格的标注指南,并使用该指南扩充了Causal News Corpus,加入了反因果声明。 Result: 在新数据集上训练的模型能够更准确地区分正因果和反因果声明,减少了将反因果误分类为正因果的情况。 Conclusion: 集成反因果声明对于提升因果关系抽取模型的准确性至关重要,所开发的数据集有助于实现这一目标。 Abstract: Research on causality extraction from text has so far almost entirely neglected counterclaims. Existing causality extraction datasets focus solely on "procausal" claims, i.e., statements that support a relationship. "Concausal" claims, i.e., statements that refute a relationship, are entirely ignored or even accidentally annotated as procausal. We address this shortcoming by developing a new dataset that integrates concausality. Based on an extensive literature review, we first show that concausality is an integral part of causal reasoning on incomplete knowledge. We operationalize this theory in the form of a rigorous guideline for annotation and then augment the Causal News Corpus with concausal statements, obtaining a substantial inter-annotator agreement of Cohen's $\kappa=0.74$. To demonstrate the importance of integrating concausal statements, we show that models trained without concausal relationships tend to misclassify these as procausal instead. Based on our new dataset, this mistake can be mitigated, enabling transformers to effectively distinguish pro- and concausality.

[81] The Alignment Waltz: Jointly Training Agents to Collaborate for Safety

Jingyu Zhang,Haozhu Wang,Eric Michael Smith,Sid Wang,Amr Sharaf,Mahesh Pasupuleti,Benjamin Van Durme,Daniel Khashabi,Jason Weston,Hongyuan Zhan

Main category: cs.CL

TL;DR: 本文提出WaltzRL,一种基于多智能体强化学习的安全对齐框架,通过协作式对话代理与反馈代理动态优化大模型在安全性和有用性之间的平衡,显著降低不安全输出和过度拒绝率。

Details Motivation: 大语言模型在追求有用性与无害性之间存在根本张力,现有方法因完全拒绝潜在不安全内容而导致过度拒绝问题,缺乏对敏感但无害请求的细致指导。 Method: 提出WaltzRL框架,联合训练对话代理和反馈代理,引入动态改进奖励(DIR)机制,使反馈代理在推理时自适应提供改进建议而非丢弃响应,实现安全与帮助性的协同演化。 Result: 在五个数据集上的实验表明,WaltzRL将不安全响应率从39.0%降至4.6%(WildJailbreak),过度拒绝率从45.3%降至9.9%(OR-Bench),优于多种基线方法。 Conclusion: WaltzRL通过可适应的多智能体协作机制,在不损害模型通用能力的前提下提升了安全性,推动了有用性与无害性之间的帕累托前沿。 Abstract: Harnessing the power of LLMs requires a delicate dance between being helpful and harmless. This creates a fundamental tension between two competing challenges: vulnerability to adversarial attacks that elicit unsafe content, and a tendency for overrefusal on benign but sensitive prompts. Current approaches often navigate this dance with safeguard models that completely reject any content that contains unsafe portions. This approach cuts the music entirely-it may exacerbate overrefusals and fails to provide nuanced guidance for queries it refuses. To teach models a more coordinated choreography, we propose WaltzRL, a novel multi-agent reinforcement learning framework that formulates safety alignment as a collaborative, positive-sum game. WaltzRL jointly trains a conversation agent and a feedback agent, where the latter is incentivized to provide useful suggestions that improve the safety and helpfulness of the conversation agent's responses. At the core of WaltzRL is a Dynamic Improvement Reward (DIR) that evolves over time based on how well the conversation agent incorporates the feedback. At inference time, unsafe or overrefusing responses from the conversation agent are improved rather than discarded. The feedback agent is deployed together with the conversation agent and only engages adaptively when needed, preserving helpfulness and low latency on safe queries. Our experiments, conducted across five diverse datasets, demonstrate that WaltzRL significantly reduces both unsafe responses (e.g., from 39.0% to 4.6% on WildJailbreak) and overrefusals (from 45.3% to 9.9% on OR-Bench) compared to various baselines. By enabling the conversation and feedback agents to co-evolve and adaptively apply feedback, WaltzRL enhances LLM safety without degrading general capabilities, thereby advancing the Pareto front between helpfulness and harmlessness.

[82] Contrastive Decoding for Synthetic Data Generation in Low-Resource Language Modeling

Jannek Ulm,Kevin Du,Vésteinn Snæbjarnarson

Main category: cs.CL

TL;DR: 本文研究了使用对比解码生成合成语料库在大语言模型训练中的应用,发现结合合成数据与真实数据能提升语言建模及下游任务性能,尤其在需要推理能力的任务上表现更优。

Details Motivation: 由于大规模语言模型训练所需文本数据可能面临枯竭,探索利用大模型自身生成的合成数据作为补充成为必要。 Method: 通过对比解码方法,利用性能较好和较差的模型之间的差异生成合成语料,并将其与原始训练数据混合进行训练。 Result: 实验表明,在语言建模目标和多种下游任务上,混合使用合成与真实数据可提升性能;特别是对比解码生成的数据有助于提升推理类任务表现,而传统采样生成的数据更利于表层语言任务。 Conclusion: 使用对比解码生成的合成数据与真实数据结合训练,能有效提升语言模型的整体性能,尤其是在需要推理能力的任务上具有优势。 Abstract: Large language models (LLMs) are trained on huge amounts of textual data, and concerns have been raised that the limits of such data may soon be reached. A potential solution is to train on synthetic data sampled from LLMs. In this work, we build on this idea and investigate the benefits of contrastive decoding for generating synthetic corpora. In a controlled setting, we experiment with sampling corpora using the relative difference between a good and bad model trained on the same original corpus of 100 million words. By amplifying the signal from a model that has better performance, we create a synthetic corpus and mix it with the original training data. Our findings show that training on a mixture of synthesized and real data improves performance on the language modeling objective and a range of downstream tasks. In particular, we see that training with a mix of synthetic data from contrastive decoding benefits tasks that require more reasoning skills, while synthetic data from traditional sampling helps more on tasks dependent on surface level linguistic capabilities.

[83] Beyond Turn Limits: Training Deep Search Agents with Dynamic Context Window

Qiaoyu Tang,Hao Xiang,Le Yu,Bowen Yu,Yaojie Lu,Xianpei Han,Le Sun,WenJuan Zhang,Pengbo Wang,Shixuan Liu,Zhenru Zhang,Jianhong Tu,Hongyu Lin,Junyang Lin

Main category: cs.CL

TL;DR: DeepMiner是一种通过高难度训练任务和动态上下文窗口来激发多轮长视野推理能力的新框架,在多个搜索代理基准上显著优于现有开源模型。

Details Motivation: 现有方法难以在多轮长视野交互中激发深度推理能力,尤其是在处理扩展上下文时受限于上下文长度和缺乏认知能力的注入。 Method: 提出DeepMiner框架,采用反向构建方法从真实网页生成复杂且可验证的问答对,并设计无需外部摘要模型的动态上下文管理策略,结合滑动窗口机制支持长序列交互。 Result: 在Qwen3-32B上通过强化学习训练出DeepMiner-32B,在BrowseComp-en上达到33.5%的准确率,比之前最好的开源代理提升近20个百分点,并在其他多个基准上表现一致优越;支持在32k上下文内持续近100轮交互。 Conclusion: DeepMiner有效提升了多轮推理代理的长视野交互与深度推理能力,解决了上下文限制问题,为构建更智能的长期交互系统提供了可行方案。 Abstract: While recent advances in reasoning models have demonstrated cognitive behaviors through reinforcement learning, existing approaches struggle to invoke deep reasoning capabilities in multi-turn agents with long-horizon interactions. We propose DeepMiner, a novel framework that elicits such abilities by introducing high-difficulty training tasks and dynamic context window. DeepMiner presents a reverse construction method to generate complex but verifiable question-answer pairs from authentic web sources, which ensures the challenge and reliability of training data while injecting cognitive capabilities into multi-turn reasoning scenarios. We further design an elegant yet effective dynamic context management strategy for both training and inference, utilizing sliding window mechanisms while eliminating the dependency on external summarization models, thereby efficiently empowering the model to handle continuously expanding long-horizon contexts. Through reinforcement learning on Qwen3-32B, we develop DeepMiner-32B, which achieves substantial performance improvements across multiple search agent benchmarks. DeepMiner attains 33.5% accuracy on BrowseComp-en, surpassing the previous best open-source agent by almost 20 percentage points, and demonstrates consistent improvements on BrowseComp-zh, XBench-DeepSearch, and GAIA. Notably, our dynamic context management enables sustained interactions of nearly 100 turns within standard 32k context length, effectively addressing the context limitations that constrain existing multi-turn interaction systems.

[84] Neuron-Level Analysis of Cultural Understanding in Large Language Models

Taisei Yamamoto,Ryoma Kumon,Danushka Bollegala,Hitomi Yanaka

Main category: cs.CL

TL;DR: 本文提出了一种基于梯度的评分方法,用于识别大语言模型中与文化理解相关的神经元,发现少于1%的神经元(集中在浅层到中层MLP)在文化行为中起关键作用,并验证了这些神经元对文化基准性能的影响。

Details Motivation: 大语言模型存在文化偏见且对少数文化的认知有限,其文化理解机制尚不明确,因此需要从神经元层面分析其文化行为的内在机制。 Method: 提出一种基于梯度的评分方法,并结合过滤策略精确定位影响文化行为的神经元,区分出文化通用和文化特定神经元,并通过抑制实验验证其功能。 Result: 发现了少于1%的关键神经元集中于浅至中层MLP层,抑制这些神经元会导致文化基准性能显著下降(高达30%),而对通用自然语言理解任务影响较小;文化特定神经元还支持相关文化的知识;训练NLU任务可能削弱模型的文化理解能力。 Conclusion: 大语言模型中存在少量关键神经元负责文化理解,研究揭示了其内部机制,为模型训练和工程提供了实践指导。 Abstract: As large language models (LLMs) are increasingly deployed worldwide, ensuring their fair and comprehensive cultural understanding is important. However, LLMs exhibit cultural bias and limited awareness of underrepresented cultures, while the mechanisms underlying their cultural understanding remain underexplored. To fill this gap, we conduct a neuron-level analysis to identify neurons that drive cultural behavior, introducing a gradient-based scoring method with additional filtering for precise refinement. We identify both culture-general neurons contributing to cultural understanding regardless of cultures, and culture-specific neurons tied to an individual culture. These neurons account for less than 1% of all neurons and are concentrated in shallow to middle MLP layers. We validate their role by showing that suppressing them substantially degrades performance on cultural benchmarks (by up to 30%), while performance on general natural language understanding (NLU) benchmarks remains largely unaffected. Moreover, we show that culture-specific neurons support knowledge of not only the target culture, but also related cultures. Finally, we demonstrate that training on NLU benchmarks can diminish models' cultural understanding when we update modules containing many culture-general neurons. These findings provide insights into the internal mechanisms of LLMs and offer practical guidance for model training and engineering. Our code is available at https://github.com/ynklab/CULNIG

[85] AutoRed: A Free-form Adversarial Prompt Generation Framework for Automated Red Teaming

Muxi Diao,Yutao Mou,Keqing He,Hanbo Song,Lulu Zhao,Shikun Zhang,Wei Ye,Kongming Liang,Zhanyu Ma

Main category: cs.CL

TL;DR: 提出AutoRed,一种无需种子指令的自由形式对抗提示生成框架,用于提高大语言模型的安全性评估。

Details Motivation: 现有红队测试方法依赖种子指令,限制了合成对抗提示的语义多样性。 Method: 采用两阶段框架:1)基于角色引导的对抗指令生成;2)通过反思循环迭代优化低质量提示,并引入验证器评估提示的危害性而不查询目标模型。 Result: 构建了两个红队测试数据集AutoRed-Medium和AutoRed-Hard,评估八种最先进LLM,AutoRed在攻击成功率和泛化能力上优于现有基线。 Conclusion: 种子指令方法存在局限,自由形式红队测试在LLM安全评估中具有更大潜力。 Abstract: The safety of Large Language Models (LLMs) is crucial for the development of trustworthy AI applications. Existing red teaming methods often rely on seed instructions, which limits the semantic diversity of the synthesized adversarial prompts. We propose AutoRed, a free-form adversarial prompt generation framework that removes the need for seed instructions. AutoRed operates in two stages: (1) persona-guided adversarial instruction generation, and (2) a reflection loop to iteratively refine low-quality prompts. To improve efficiency, we introduce a verifier to assess prompt harmfulness without querying the target models. Using AutoRed, we build two red teaming datasets -- AutoRed-Medium and AutoRed-Hard -- and evaluate eight state-of-the-art LLMs. AutoRed achieves higher attack success rates and better generalization than existing baselines. Our results highlight the limitations of seed-based approaches and demonstrate the potential of free-form red teaming for LLM safety evaluation. We will open source our datasets in the near future.

[86] Two-Stage Voting for Robust and Efficient Suicide Risk Detection on Social Media

Yukai Song,Pengfei Zhou,César Escobar-Viera,Candice Biernesser,Wei Huang,Jingtong Hu

Main category: cs.CL

TL;DR: 提出一种两阶段投票架构,结合轻量级BERT模型和多视角大语言模型(LLM)投票机制,有效平衡效率与准确性,用于检测显性和隐性自杀意念,在多个数据集上表现优异且显著降低LLM使用成本。

Details Motivation: 自杀率上升急需有效的预防手段,许多高风险个体因羞耻感不愿寻求正式帮助,而倾向于在社交媒体上表达痛苦,但现有模型难以准确识别隐含的自杀意念。 Method: 采用两阶段投票架构:第一阶段用轻量BERT模型快速处理高置信度显性案例;第二阶段将模糊输入交由多视角LLM投票系统或基于心理指标的特征集成模型处理,后者通过提示工程从LLM提取心理学特征向量。 Result: 在Reddit(显性为主)和DeepSuiMind(仅隐性)数据集上,F1分数分别达到98.0%和99.7%,跨领域差距降至2%以下,并显著降低LLM计算开销。 Conclusion: 该框架首次实现将LLM提取的心理特征向量化用于自杀风险检测,兼顾效率、可解释性与高性能,为实际应用提供了可行方案。 Abstract: Suicide rates have risen worldwide in recent years, underscoring the urgent need for proactive prevention strategies. Social media provides valuable signals, as many at-risk individuals - who often avoid formal help due to stigma - choose instead to share their distress online. Yet detecting implicit suicidal ideation, conveyed indirectly through metaphor, sarcasm, or subtle emotional cues, remains highly challenging. Lightweight models like BERT handle explicit signals but fail on subtle implicit ones, while large language models (LLMs) capture nuance at prohibitive computational cost. To address this gap, we propose a two-stage voting architecture that balances efficiency and robustness. In Stage 1, a lightweight BERT classifier rapidly resolves high-confidence explicit cases. In Stage 2, ambiguous inputs are escalated to either (i) a multi-perspective LLM voting framework to maximize recall on implicit ideation, or (ii) a feature-based ML ensemble guided by psychologically grounded indicators extracted via prompt-engineered LLMs for efficiency and interpretability. To the best of our knowledge, this is among the first works to operationalize LLM-extracted psychological features as structured vectors for suicide risk detection. On two complementary datasets - explicit-dominant Reddit and implicit-only DeepSuiMind - our framework outperforms single-model baselines, achieving 98.0% F1 on explicit cases, 99.7% on implicit ones, and reducing the cross-domain gap below 2%, while significantly lowering LLM cost.

[87] On the Relationship Between the Choice of Representation and In-Context Learning

Ioana Marinescu,Kyunghyun Cho,Eric Karl Oermann

Main category: cs.CL

TL;DR: 本文研究了上下文学习(ICL)中示例表示与学习能力之间的关系,发现表示质量决定ICL的基线性能,而学习则在此基础上独立提升性能,二者具有正交性。

Details Motivation: 过去的研究分别关注ICL中示例的表示方式和学习能力,但两者如何相互作用尚不清楚。本文旨在探究表示(特别是标签表示)与学习过程是否独立影响ICL性能。 Method: 提出一种优化算法,枚举出在语义相关性上连续变化的不同标签集(表示),并在每个标签集上使用不同数量的上下文示例进行ICL实验,分析表示质量与学习效率的关系。 Result: 实验表明,无论标签集的质量如何,模型都能从更多示例中学习(即性能随示例增加而提升),但学习效率受标签集质量和模型参数量共同影响;同时,初始表示的相对优劣在整个学习过程中保持不变。 Conclusion: 上下文学习中的表示与学习过程基本相互独立:表示决定基线性能,学习在此之上逐步提升性能,二者正交。这一发现揭示了ICL中表示与学习的独立作用机制。 Abstract: In-context learning (ICL) is the ability of a large language model (LLM) to learn a new task from a few demonstrations presented as part of the context. Past studies have attributed a large portion of the success of ICL to the way these in-context demonstrations are represented, particularly to how labels are represented in classification tasks. On the other hand, observations of the learning capacity of ICL (i.e., the extent to which more in-context demonstrations can lead to higher performance) have been mixed, and ICL is often thought to occur only under specific conditions. The interaction between these two aspects in ICL, representation and learning, has not been studied in depth until now. We hypothesize that they are largely independent of one another, such that the representation of demonstrations determines the baseline accuracy of ICL, while learning from additional demonstrations improves only on top of this baseline. We validate this hypothesis by developing an optimization algorithm that can enumerate a spectrum of possible label sets (representations) varying in semantic relevance. We then perform ICL with varying numbers of in-context demonstrations for each of these label sets. We observed that learning happens regardless of the quality of the label set itself, although its efficiency, measured by the slope of improvement over in-context demonstrations, is conditioned on both the label set quality and the parameter count of the underlying language model. Despite the emergence of learning, the relative quality (accuracy) of the choice of a label set (representation) is largely maintained throughout learning, confirming our hypothesis and implying their orthogonality. Our work reveals a previously underexplored aspect of ICL: the independent effects of learning from demonstrations and their representations on ICL performance.

[88] If Probable, Then Acceptable? Understanding Conditional Acceptability Judgments in Large Language Models

Jasmin Orth,Philipp Mondorf,Barbara Plank

Main category: cs.CL

TL;DR: 本文研究了大语言模型(LLM)在判断条件语句可接受性方面的表现,发现模型对条件概率和语义相关性有所敏感,但一致性不及人类,且模型规模增大并不一定更贴近人类判断。

Details Motivation: 了解大语言模型如何判断条件语句的可接受性,填补此前在该认知层面研究的空白,并比较其与人类判断的一致性。 Method: 通过线性混合效应模型和方差分析(ANOVA),在不同模型家族、规模和提示策略下评估LLM对条件可接受性的判断,并与人类数据进行对比。 Result: LLM能够感知条件概率和语义相关性,但敏感程度因架构和提示方式而异;与人类相比,其判断一致性较低;模型尺寸增大并未显著提升与人类判断的对齐程度。 Conclusion: 当前大语言模型在条件可接受性判断上虽具备一定认知能力,但在整合概率与语义信息方面仍不如人类稳定,更大的模型未必更优。 Abstract: Conditional acceptability refers to how plausible a conditional statement is perceived to be. It plays an important role in communication and reasoning, as it influences how individuals interpret implications, assess arguments, and make decisions based on hypothetical scenarios. When humans evaluate how acceptable a conditional "If A, then B" is, their judgments are influenced by two main factors: the $\textit{conditional probability}$ of $B$ given $A$, and the $\textit{semantic relevance}$ of the antecedent $A$ given the consequent $B$ (i.e., whether $A$ meaningfully supports $B$). While prior work has examined how large language models (LLMs) draw inferences about conditional statements, it remains unclear how these models judge the $\textit{acceptability}$ of such statements. To address this gap, we present a comprehensive study of LLMs' conditional acceptability judgments across different model families, sizes, and prompting strategies. Using linear mixed-effects models and ANOVA tests, we find that models are sensitive to both conditional probability and semantic relevance-though to varying degrees depending on architecture and prompting style. A comparison with human data reveals that while LLMs incorporate probabilistic and semantic cues, they do so less consistently than humans. Notably, larger models do not necessarily align more closely with human judgments.

[89] Single layer tiny Co$^4$ outpaces GPT-2 and GPT-BERT

Noor Ul Zain,Mohsin Raza,Ahsan Adeel

Main category: cs.CL

TL;DR: Co$^4$是一种仅含单层、双头、8M参数的小型模型,在训练效率和样本利用率上显著超越GPT-2和GPT-BERT,且在零样本和微调任务中表现更优。

Details Motivation: 挑战当前深度学习模型依赖大规模参数和计算资源的范式,探索高效、低计算成本的预训练模型可能性。 Method: 设计了一种轻量级的Co$^4$模型,具有单层、双注意力头结构,计算复杂度为O(N),并在BabyLM挑战的数据集上进行短周期训练与评估。 Result: Co$^4$在仅训练两个epoch的情况下,在10M token上显著超越训练十epoch的GPT-2和GPT-BERT;在SuperGLUE零样本评估中优于GPT-2(5/7指标)和GPT-BERT(4/7),微调任务中也表现更优(6/7和4/7)。 Conclusion: 小型、低复杂度模型在高效训练下可实现优越性能,提示需重新思考当前的模型扩展规律和深度学习范式。 Abstract: We show that a tiny Co$^4$ machine(Adeel,2025) with a single layer, two heads, and 8M parameters, operating at an approximate cost of $O(N)$ (where $N$ is the number of input tokens), outpaces the BabyLM Challenge baselines GPT-2 (124M, 12 layers, $O(N^2))$ and GPT-BERT (30M, 12 layers, $O(N^2))$ in just two epochs, while both are trained for ten. Co$^4$ achieves orders-of-magnitude greater training efficiency on 10M tokens, demonstrating highly sample efficient pretraining. Using the BabyLM challenge evaluation pipeline across complex benchmarks, Co$^4$ exhibits strong zero-shot and fine-tuning performance on SuperGLUE tasks. Specifically, Co$^4$ outperforms GPT-2 on 5 out of 7 zero-shot metrics and 6 out of 7 fine-tuning tasks, and GPT-BERT on 4 out of 7 metrics in both cases. These results suggest the need to rethink prevailing deep learning paradigms and associated scaling laws.

[90] ARES: Multimodal Adaptive Reasoning via Difficulty-Aware Token-Level Entropy Shaping

Shuang Chen,Yue Guo,Yimeng Ye,Shijue Huang,Wenbo Hu,Haoxi Li,Manyuan Zhang,Jiayu Chen,Song Guo,Nanyun Peng

Main category: cs.CL

TL;DR: 提出ARES框架,通过自适应推理解决多模态大模型在简单任务上过思考、复杂任务上欠探索的问题,基于滑动窗口熵(HWE)动态调整推理深度,在多种数学、逻辑和多模态基准上实现了高效且高性能的推理。

Details Motivation: 现有MLRM模型在简单任务上容易过思考,在复杂任务上则探索不足,导致效率与性能失衡。需要一种能根据任务难度动态调整推理强度的方法。 Method: 提出ARES框架,包含两个阶段:1)自适应冷启动阶段,构建与任务难度匹配的推理轨迹数据,赋予模型初步的难度感知能力;2)自适应熵策略优化(AEPO),利用高窗口熵(HWE)作为探索触发机制,并设计分层熵奖励与动态KL控制来决定探索时机与程度。 Result: 实验表明,ARES在多个数学、逻辑和多模态基准上优于现有开源模型,推理效率更高,且在显著更低的推理成本下接近商用系统的性能。 Conclusion: ARES通过HWE驱动的自适应推理机制,有效平衡了不同难度任务下的推理开销与性能,为高效多模态推理提供了可扩展的开源解决方案。 Abstract: Recent advances in multimodal large reasoning models (MLRMs) have substantially improved their ability to solve complex textual and visual tasks. However, these models tend to overthink on simple problems, producing unnecessarily lengthy reasoning traces, while under-exploring on challenging ones, leading to missed solutions. To address this imbalance, we propose ARES, a unified open-source framework for adaptive reasoning that dynamically allocates exploration effort based on task difficulty. Our approach is motivated by two key empirical findings: (i) while single-token entropy is noisy, high window-entropy (HWE) tokens (token-level entropies averaged under a sliding window) can reliably capture reasoning-critical moments; and (ii) reducing HWE usage benefits easy problems, while increasing it is essential for solving hard ones. Building on these insights, ARES introduces a two-stage training pipeline. In the Adaptive Cold-Start stage, we curate multimodal and textual data paired with reasoning traces of length proportional to problem difficulty, equipping the model with initial difficulty awareness. In the second stage, we develop Adaptive Entropy Policy Optimization (AEPO), which uses HWE tokens as exploration triggers to decide when to explore, and a hierarchical entropy reward with dynamic KL control to decide how much to explore. Extensive experiments demonstrate that ARES achieves superior performance and reasoning efficiency across diverse mathematical, logical, and multimodal benchmarks, while closing the gap to leading commercial systems under significantly lower inference costs.

[91] LeWiDi-2025 at NLPerspectives: The Third Edition of the Learning with Disagreements Shared Task

Elisa Leonardelli,Silvia Casola,Siyao Peng,Giulia Rizzi,Valerio Basile,Elisabetta Fersini,Diego Frassinelli,Hyewon Jang,Maja Pavlovic,Barbara Plank,Massimo Poesio

Main category: cs.CL

TL;DR: LEWIDI第三版通过扩展至四个涵盖不同NLP任务的数据集,引入软标签和视角主义两种互补评估范式,并测试新的评价指标,推动AI模型对人类判断差异的建模与评估。

Details Motivation: 现有AI模型通常忽略人类判断中的变异与分歧,而实际应用中这种差异普遍存在。因此需要训练和评估能够识别并处理人类判断分歧的AI模型。 Method: 扩展LEWIDI基准到四个数据集(复述识别、反语检测、讽刺检测、自然语言推断),包含分类和序数标注;采用软标签(预测群体判断分布)和perspectivist(预测个体注释者判断)两种建模范式;设计并测试新的评估指标。 Result: 任务吸引了多样化参与,结果揭示了当前建模人类判断变异方法的优势与局限;验证了新评估范式的有效性,并提供了新的资源、基准和见解。 Conclusion: LEWIDI作为支持分歧感知AI技术发展的框架得到加强,为未来研究提供了重要基础和方向。 Abstract: Many researchers have reached the conclusion that AI models should be trained to be aware of the possibility of variation and disagreement in human judgments, and evaluated as per their ability to recognize such variation. The LEWIDI series of shared tasks on Learning With Disagreements was established to promote this approach to training and evaluating AI models, by making suitable datasets more accessible and by developing evaluation methods. The third edition of the task builds on this goal by extending the LEWIDI benchmark to four datasets spanning paraphrase identification, irony detection, sarcasm detection, and natural language inference, with labeling schemes that include not only categorical judgments as in previous editions, but ordinal judgments as well. Another novelty is that we adopt two complementary paradigms to evaluate disagreement-aware systems: the soft-label approach, in which models predict population-level distributions of judgments, and the perspectivist approach, in which models predict the interpretations of individual annotators. Crucially, we moved beyond standard metrics such as cross-entropy, and tested new evaluation metrics for the two paradigms. The task attracted diverse participation, and the results provide insights into the strengths and limitations of methods to modeling variation. Together, these contributions strengthen LEWIDI as a framework and provide new resources, benchmarks, and findings to support the development of disagreement-aware technologies.

[92] DeepPrune: Parallel Scaling without Inter-trace Redundancy

Shangqing Tu,Yaxuan Li,Yushi Bai,Lei Hou,Juanzi Li

Main category: cs.CL

TL;DR: 本文提出DeepPrune,一种通过动态剪枝提升大模型并行推理效率的新框架,显著减少冗余计算,在保持准确率的同时实现超过80%的token节省。

Details Motivation: 并行扩展虽能提升大语言模型的推理能力,但存在大量推理路径冗余(超过80%产生相同答案),导致计算资源浪费,亟需提高效率。 Method: 提出DeepPrune框架,包括使用focal loss和过采样训练的判别模型来预测部分推理链的答案等价性,并结合在线贪心聚类算法动态剪除冗余路径。 Result: 在AIME 2024、AIME 2025和GPQA三个基准上验证,DeepPrune相比传统共识采样平均减少80%以上的token消耗,且准确率损失控制在3个百分点以内。 Conclusion: DeepPrune有效解决了并行推理中的冗余问题,为高效并行推理建立了新标准,显著提升了推理效率。 Abstract: Parallel scaling has emerged as a powerful paradigm to enhance reasoning capabilities in large language models (LLMs) by generating multiple Chain-of-Thought (CoT) traces simultaneously. However, this approach introduces significant computational inefficiency due to inter-trace redundancy -- our analysis reveals that over 80% of parallel reasoning traces yield identical final answers, representing substantial wasted computation. To address this critical efficiency bottleneck, we propose DeepPrune, a novel framework that enables efficient parallel scaling through dynamic pruning. Our method features a specialized judge model trained with focal loss and oversampling techniques to accurately predict answer equivalence from partial reasoning traces which realizes 0.87 AUROC on equivalence prediction, combined with an online greedy clustering algorithm that dynamically prunes redundant paths while preserving answer diversity. Comprehensive evaluations across three challenging benchmarks (AIME 2024, AIME 2025, and GPQA) and multiple reasoning models demonstrate that DeepPrune achieves remarkable token reduction by over 80% compared to conventional consensus sampling on most cases, while maintaining competitive accuracy within 3 percentage points. Our work establishes a new standard for efficient parallel reasoning, making high-performance reasoning more efficient. Our code and data are here: https://deepprune.github.io/

[93] Neologism Learning for Controllability and Self-Verbalization

John Hewitt,Oyvind Tafjord,Robert Geirhos,Been Kim

Main category: cs.CL

TL;DR: 本文探讨了在与大语言模型(LLM)交互中引入新词(neologism)的方法,以更好地理解和控制模型行为。通过添加新的词嵌入并用示例训练,新词可用于控制如奉承、错误回答、文本长度等概念,并可通过模型的自我语言化解释其含义。作者提出“插入评估”来验证这些解释的有效性,并发现机器专属同义词现象,最后展示了多概念、多词汇的联合学习能力。

Details Motivation: 为了更有效地理解和控制大语言模型的行为,受到人类因新需求创造新词的启发,探索在人机交互中引入人工新词的可行性与优势。 Method: 通过添加新的词嵌入并在不改变其他模型参数的情况下,使用体现特定概念的示例进行训练,使模型学习新词;并通过自我语言化让模型用自然语言描述新词的含义,提出插件评估方法来验证这些描述的有效性。 Result: 成功实现了对多种概念(如奉承、错误回答、文本长度等)的控制;模型能自我解释新词含义;插件评估验证了自我语言化的有效性;发现了机器专属同义词;实现了多个概念的联合学习。 Conclusion: 引入新词是一种有效的人机协同理解与控制大语言模型的方法,不仅增强了对模型行为的可控性,还通过自我语言化加深了对模型内部机制的理解,具有扩展应用于复杂概念控制的潜力。 Abstract: Humans invent new words when there is a rising demand for a new useful concept (e.g., doomscrolling). We explore and validate a similar idea in our communication with LLMs: introducing new words to better understand and control the models, expanding on the recently introduced neologism learning. This method introduces a new word by adding a new word embedding and training with examples that exhibit the concept with no other changes in model parameters. We show that adding a new word allows for control of concepts such as flattery, incorrect answers, text length, as well as more complex concepts in AxBench. We discover that neologisms can also further our understanding of the model via self-verbalization: models can describe what each new word means to them in natural language, like explaining that a word that represents a concept of incorrect answers means ``a lack of complete, coherent, or meaningful answers...'' To validate self-verbalizations, we introduce plug-in evaluation: we insert the verbalization into the context of a model and measure whether it controls the target concept. In some self-verbalizations, we find machine-only synonyms: words that seem unrelated to humans but cause similar behavior in machines. Finally, we show how neologism learning can jointly learn multiple concepts in multiple words.

Hyunji Lee,Kevin Chenhao Li,Matthias Grabmair,Shanshan Xu

Main category: cs.CL

TL;DR: 提出了一种结合蒙特卡洛树搜索与代理评估器的提示优化框架,用于提升服务条款中公平性检测的准确性和效率。

Details Motivation: 现有提示优化方法因搜索策略低效和评分成本高而计算开销大,难以在有限资源下有效提升法律NLP任务性能。 Method: 采用蒙特卡洛树搜索(MCTS)结合代理提示评估器,在减少评估成本的同时更有效地探索提示空间。 Result: 实验表明,在受限计算预算下,该方法相比基线方法实现了更高的分类准确率和效率。 Conclusion: 所提框架能有效平衡提示优化的性能与计算成本,适用于资源受限的法律文本处理任务。 Abstract: Prompt optimization aims to systematically refine prompts to enhance a language model's performance on specific tasks. Fairness detection in Terms of Service (ToS) clauses is a challenging legal NLP task that demands carefully crafted prompts to ensure reliable results. However, existing prompt optimization methods are often computationally expensive due to inefficient search strategies and costly prompt candidate scoring. In this paper, we propose a framework that combines Monte Carlo Tree Search (MCTS) with a proxy prompt evaluator to more effectively explore the prompt space while reducing evaluation costs. Experiments demonstrate that our approach achieves higher classification accuracy and efficiency than baseline methods under a constrained computation budget.

[95] Which Heads Matter for Reasoning? RL-Guided KV Cache Compression

Wenjie Du,Li Jiang,Keda Tao,Xue Liu,Huan Wang

Main category: cs.CL

TL;DR: 提出RLKV框架,利用强化学习识别推理关键的注意力头,在保持接近无损性能的同时实现20-50%的KV缓存压缩。

Details Motivation: 现有KV缓存压缩方法在推理模型上表现不佳,会破坏推理完整性或错误压缩关键注意力头,导致性能显著下降。 Method: 提出RLKV框架,使用强化学习直接优化每个注意力头的缓存使用与推理质量之间的关系,并根据重要性分配全量或压缩的KV缓存。 Result: 实验表明仅需保留少量关键注意力头即可维持高性能,RLKV在20-50%缓存压缩率下优于基线方法且性能接近无损。 Conclusion: KV头在推理模型中具有功能异质性,RLKV能有效识别推理关键头,实现高效且低损的KV缓存压缩。 Abstract: Reasoning large language models exhibit complex reasoning behaviors through the extended chain-of-thought generation, creating unprecedented Key-Value (KV) cache overhead during the decoding phase. Existing KV cache compression methods underperform on reasoning models: token-dropping methods break reasoning integrity by discarding critical information, while head-reallocating methods mistakenly compress reasoning-critical heads since they are designed for retrieval tasks, resulting in significant performance degradation as compression rates increase. We hypothesize that KV heads exhibit functional heterogeneity in reasoning models-some heads are critical for chain-of-thought consistency while others are compressible. To validate and exploit this insight, we propose RLKV, a novel reasoning-critical head identification framework, which uses reinforcement learning to directly optimize the relationship between each head's cache usage and reasoning quality. As RLKV produces rewards from actual generated samples during training, it naturally identifies heads relevant to reasoning behaviors. We then allocate full KV cache to these heads while applying compressed constant KV cache to others for efficient inference. Our experiments reveal that only a small fraction of attention heads is essential for reasoning, enabling our KV compression approach to outperform baseline methods while achieving 20-50% cache reduction with near lossless performance compared to uncompressed results.

[96] CoMAS: Co-Evolving Multi-Agent Systems via Interaction Rewards

Xiangyuan Xue,Yifan Zhou,Guibin Zhang,Zaibin Zhang,Yijiang Li,Chen Zhang,Zhenfei Yin,Philip Torr,Wanli Ouyang,Lei Bai

Main category: cs.CL

TL;DR: 本文提出了CoMAS框架,通过多智能体之间的互动讨论实现无需外部监督的自主进化,利用大语言模型作为评判机制生成内在奖励信号,并通过强化学习优化策略,在多个评估场景中达到最先进的性能。

Details Motivation: 现有基于强化学习的自进化方法依赖于密集的外部奖励或从大语言模型自身提取内在奖励,偏离了人类通过相互讨论和协作学习的机制。因此,需要一种更贴近人类学习方式的自进化方法。 Method: 提出CoMAS框架,构建多智能体系统,通过智能体间的丰富讨论动态生成内在奖励信号,采用大语言模型作为评判机制(LLM-as-a-judge)来形成奖励,并使用强化学习优化每个智能体的策略,实现去中心化且可扩展的协同进化。 Result: 实验结果表明,CoMAS在大多数评估设置中均优于未经训练的智能体并达到最先进的性能;消融研究验证了基于交互的奖励信号的必要性,并显示出随着智能体数量和多样性的增加具有良好的可扩展性。 Conclusion: CoMAS为大语言模型智能体的自进化提供了一种新颖且有效的范式,强调通过多智能体交互实现无需外部监督的自主提升。 Abstract: Self-evolution is a central research topic in enabling large language model (LLM)-based agents to continually improve their capabilities after pretraining. Recent research has witnessed a transition from reinforcement learning (RL)-free to RL-based methods. Current RL-based methods either rely on dense external reward signals or extract intrinsic reward signals from LLMs themselves. However, these approaches diverge from the self-evolution mechanisms observed in human intelligence, where individuals learn and improve through mutual discussion and collaboration. In this work, we introduce Co-Evolving Multi-Agent Systems (CoMAS), a novel framework that enables agents to improve autonomously by learning from inter-agent interactions without external supervision. CoMAS generates intrinsic rewards from rich discussion dynamics, employs an LLM-as-a-judge mechanism to formulate these rewards, and optimizes each agent's policy through RL, thereby enabling decentralized and scalable co-evolution. Experimental results demonstrate that CoMAS consistently outperforms untrained agents and achieves state-of-the-art performance across most evaluation settings. Ablation studies confirm the necessity of interaction-based reward signals and reveal promising scalability as the number and diversity of agents increase. These findings establish CoMAS as a novel and effective paradigm for self-evolution in LLM-based agents.

[97] ArenaBencher: Automatic Benchmark Evolution via Multi-Model Competitive Evaluation

Qin Liu,Jacob Dineen,Yuxi Huang,Sheng Zhang,Hoifung Poon,Ben Zhou,Muhao Chen

Main category: cs.CL

TL;DR: ArenaBencher是一个模型无关的自动基准演化框架,通过更新测试用例来应对预训练数据泄露问题,保持可比性的同时揭示模型的共性弱点。

Details Motivation: 由于大语言模型预训练数据泄露,传统基准测试的有效性受到威胁,导致模型评分虚高、比较失真和进展误判。 Method: ArenaBencher基于现有基准和多样化的模型池,推断每个测试案例的核心能力,生成保持原目标的新问答对,利用大语言模型作为裁判验证其正确性和意图,并结合多个模型反馈筛选出更具挑战性的新测试案例,迭代优化。 Result: 在数学解题、常识推理和安全等领域应用中,ArenaBencher生成了经过验证、多样化且公平的更新,发现了新的失败模式,提升了难度并保持目标一致性,增强了模型区分能力。 Conclusion: 该框架为应对基础模型快速发展提供了可扩展的基准持续演进路径,有助于提升评估的有效性和可靠性。 Abstract: Benchmarks are central to measuring the capabilities of large language models and guiding model development, yet widespread data leakage from pretraining corpora undermines their validity. Models can match memorized content rather than demonstrate true generalization, which inflates scores, distorts cross-model comparisons, and misrepresents progress. We introduce ArenaBencher, a model-agnostic framework for automatic benchmark evolution that updates test cases while preserving comparability. Given an existing benchmark and a diverse pool of models to be evaluated, ArenaBencher infers the core ability of each test case, generates candidate question-answer pairs that preserve the original objective, verifies correctness and intent with an LLM as a judge, and aggregates feedback from multiple models to select candidates that expose shared weaknesses. The process runs iteratively with in-context demonstrations that steer generation toward more challenging and diagnostic cases. We apply ArenaBencher to math problem solving, commonsense reasoning, and safety domains and show that it produces verified, diverse, and fair updates that uncover new failure modes, increase difficulty while preserving test objective alignment, and improve model separability. The framework provides a scalable path to continuously evolve benchmarks in step with the rapid progress of foundation models.

cs.CV [Back]

[98] Enhancing Maritime Object Detection in Real-Time with RT-DETR and Data Augmentation

Nader Nemati

Main category: cs.CV

TL;DR: 本文提出了一种基于RT-DETR的实时海上目标检测系统,通过融合多尺度特征、优化查询选择和合成与真实数据加权策略,提升小目标检测性能,并在真实数据上验证了系统的有效性。

Details Motivation: 由于海上目标尺寸小且真实RGB标注数据有限,传统检测方法面临挑战,因此需要提升检测精度与鲁棒性。 Method: 采用RT-DETR框架,引入多尺度特征融合模块、不确定性最小化的查询选择机制,以及合成与真实样本间的智能权重分配;结合数据增强平衡类别分布。 Result: 系统在真实数据上实现了实时检测性能,有效提升了对小尺寸、低对比度船舶的检测能力,并验证了各模块的贡献及在极端光照和海况下的鲁棒性。 Conclusion: 所提出的改进RT-DETR方法在保持端到端检测优势的同时,显著提升了海上小目标检测的准确性与实用性,具备实际部署潜力。 Abstract: Maritime object detection faces essential challenges due to the small target size and limitations of labeled real RGB data. This paper will present a real-time object detection system based on RT-DETR, enhanced by employing augmented synthetic images while strictly evaluating on real data. This study employs RT-DETR for the maritime environment by combining multi-scale feature fusion, uncertainty-minimizing query selection, and smart weight between synthetic and real training samples. The fusion module in DETR enhances the detection of small, low-contrast vessels, query selection focuses on the most reliable proposals, and the weighting strategy helps reduce the visual gap between synthetic and real domains. This design preserves DETR's refined end-to-end set prediction while allowing users to adjust between speed and accuracy at inference time. Data augmentation techniques were also used to balance the different classes of the dataset to improve the robustness and accuracy of the model. Regarding this study, a full Python robust maritime detection pipeline is delivered that maintains real-time performance even under practical limits. It also verifies how each module contributes, and how the system handles failures in extreme lighting or sea conditions. This study also includes a component analysis to quantify the contribution of each architectural module and explore its interactions.

[99] DynamicEval: Rethinking Evaluation for Dynamic Text-to-Video Synthesis

Nithin C. Babu,Aniruddha Mahapatra,Harsh Rangwani,Rajiv Soundararajan,Kuldeep Kulkarni

Main category: cs.CV

TL;DR: 本文提出了DynamicEval,一个专注于动态相机运动的文本到视频生成评估基准,通过45k人类标注和新提出的背景与前景一致性指标,显著提升了现有评估方法在动态场景下的性能和与人类偏好的相关性。

Details Motivation: 现有的T2V评估基准主要关注主体中心提示或静态场景,缺乏对动态相机运动的有效评估,并且通常将视频级评分聚合为模型级评分,忽略了对单个视频质量的精细评估。 Method: 构建了一个强调动态相机运动的系统化提示集,并收集了来自10个T2V模型生成的3k视频的45k人类标注数据;提出新的背景一致性度量方法(基于Vbench运动平滑度并结合对象误差图修正遮挡问题)以及前景一致性度量方法(通过跟踪对象实例内的点及其邻域来评估对象保真度)。 Result: 实验表明,所提出的指标在视频级别和模型级别上与人类偏好具有更强的相关性(提升超过2个百分点),优于现有基准。 Conclusion: DynamicEval作为一个更全面的T2V模型评估基准,在动态相机运动场景下表现出优越的评估能力,有助于选出更高质量的生成视频。 Abstract: Existing text-to-video (T2V) evaluation benchmarks, such as VBench and EvalCrafter, suffer from two limitations. (i) While the emphasis is on subject-centric prompts or static camera scenes, camera motion essential for producing cinematic shots and existing metrics under dynamic motion are largely unexplored. (ii) These benchmarks typically aggregate video-level scores into a single model-level score for ranking generative models. Such aggregation, however, overlook video-level evaluation, which is vital to selecting the better video among the candidate videos generated for a given prompt. To address these gaps, we introduce DynamicEval, a benchmark consisting of systematically curated prompts emphasizing dynamic camera motion, paired with 45k human annotations on video pairs from 3k videos generated by ten T2V models. DynamicEval evaluates two key dimensions of video quality: background scene consistency and foreground object consistency. For background scene consistency, we obtain the interpretable error maps based on the Vbench motion smoothness metric. We observe that while the Vbench motion smoothness metric shows promising alignment with human judgments, it fails in two cases: occlusions/disocclusions arising from camera and foreground object movements. Building on this, we propose a new background consistency metric that leverages object error maps to correct two failure cases in a principled manner. Our second innovation is the introduction of a foreground consistency metric that tracks points and their neighbors within each object instance to assess object fidelity. Extensive experiments demonstrate that our proposed metrics achieve stronger correlations with human preferences at both the video level and the model level (an improvement of more than 2% points), establishing DynamicEval as a more comprehensive benchmark for evaluating T2V models under dynamic camera motion.

[100] Provably Accelerated Imaging with Restarted Inertia and Score-based Image Priors

Marien Renaud,Julien Hermant,Deliang Wei,Yu Sun

Main category: cs.CV

TL;DR: 提出了一种名为RISP的新方法,结合重启惯性和基于分数的先验,在不牺牲重建质量的前提下实现比RED更快的收敛速度。

Details Motivation: 现有方法如RED通常专注于设计复杂的图像先验以提高重建质量,但收敛加速依赖启发式方法,缺乏理论支持。 Method: 提出Restarted Inertia with Score-based Priors (RISP),引入重启惯性机制以加速收敛,并兼容基于分数的图像先验;通过理论证明其收敛速率优于RED,并分析其对应的连续时间动力系统与重球ODE的联系。 Result: RISP在多种成像反问题中实现了更快的收敛速度和高质量的图像重建,且不要求图像先验的凸性。 Conclusion: RISP为正则化去噪框架提供了有原则的扩展,在保证高质量重建的同时显著提升了收敛速度,具有更强的理论基础和实用性。 Abstract: Fast convergence and high-quality image recovery are two essential features of algorithms for solving ill-posed imaging inverse problems. Existing methods, such as regularization by denoising (RED), often focus on designing sophisticated image priors to improve reconstruction quality, while leaving convergence acceleration to heuristics. To bridge the gap, we propose Restarted Inertia with Score-based Priors (RISP) as a principled extension of RED. RISP incorporates a restarting inertia for fast convergence, while still allowing score-based image priors for high-quality reconstruction. We prove that RISP attains a faster stationary-point convergence rate than RED, without requiring the convexity of the image prior. We further derive and analyze the associated continuous-time dynamical system, offering insight into the connection between RISP and the heavy-ball ordinary differential equation (ODE). Experiments across a range of imaging inverse problems demonstrate that RISP enables fast convergence while achieving high-quality reconstructions.

[101] A Denoising Framework for Real-World Ultra-Low Dose Lung CT Images Based on an Image Purification Strategy

Guoliang Gong,Man Yu

Main category: cs.CV

TL;DR: 本文提出了一种基于图像净化(IP)策略的超低剂量CT去噪新框架,并结合频域流匹配(FFM)模型,有效解决了真实临床uLDCT与正常剂量CT之间存在的严重噪声、伪影和空间错位问题,显著提升了去噪性能和解剖结构的保真度。

Details Motivation: 超低剂量CT虽然降低了辐射暴露,但引入了严重的噪声和空间错位,导致现有基于合成噪声或对齐数据训练的去噪网络难以直接应用。因此需要解决真实场景下uLDCT与NDCT之间的数据不匹配问题。 Method: 构建了一个真实的临床uLDCT肺部数据集;提出图像净化(IP)策略生成结构对齐的uLDCT-NDCT配对图像;在此基础上设计频率域流匹配(FFM)模型,利用频域信息更好地保持解剖结构完整性。 Result: 实验表明,IP策略显著提升了多种主流去噪模型在uLDCT任务上的表现;FFM结合IP策略在解剖结构保持方面达到了最先进的水平。 Conclusion: 该研究通过IP策略和FFM模型为真实世界uLDCT去噪中的数据错配问题提供了有效解决方案,具有良好的临床应用前景。 Abstract: Ultra-low dose CT (uLDCT) significantly reduces radiation exposure but introduces severe noise and artifacts. It also leads to substantial spatial misalignment between uLDCT and normal dose CT (NDCT) image pairs. This poses challenges for directly applying existing denoising networks trained on synthetic noise or aligned data. To address this core challenge in uLDCT denoising, this paper proposes an innovative denoising framework based on an Image Purification (IP) strategy. First, we construct a real clinical uLDCT lung dataset. Then, we propose an Image Purification strategy that generates structurally aligned uLDCT-NDCT image pairs, providing a high-quality data foundation for network training. Building upon this, we propose a Frequency-domain Flow Matching (FFM) model, which works synergistically with the IP strategy to excellently preserve the anatomical structure integrity of denoised images. Experiments on the real clinical dataset demonstrate that our IP strategy significantly enhances the performance of multiple mainstream denoising models on the uLDCT task. Notably, our proposed FFM model combined with the IP strategy achieves state-of-the-art (SOTA) results in anatomical structure preservation. This study provides an effective solution to the data mismatch problem in real-world uLDCT denoising. Code and dataset are available at https://github.com/MonkeyDadLufy/flow-matching.

[102] D2RA: Dual Domain Regeneration Attack

Pragati Shuddhodhan Meshram,Varun Chandrasekaran

Main category: cs.CV

TL;DR: 提出了一种无需训练、针对单张图像的攻击方法D2RA,可在不访问生成模型的情况下有效削弱或去除语义水印,暴露现有水印方案的脆弱性。

Details Motivation: 现有的语义水印方案虽在鲁棒性上有所提升,但在资源受限的对抗环境下仍易受攻击,亟需评估其安全性。 Method: 通过将含水印图像投影到多个互补表示下的自然先验上,在无需训练和模型访问的前提下抑制水印信号,同时保持视觉质量。 Result: 在多种水印方案上的实验表明,D2RA能持续降低水印的可检测性,且不损害图像视觉保真度。 Conclusion: 当前语义水印设计存在根本性缺陷,D2RA揭示了其在对抗攻击下的脆弱性,强调了改进水印鲁棒性的必要性。 Abstract: The growing use of generative models has intensified the need for watermarking methods that ensure content attribution and provenance. While recent semantic watermarking schemes improve robustness by embedding signals in latent or frequency representations, we show they remain vulnerable even under resource-constrained adversarial settings. We present D2RA, a training-free, single-image attack that removes or weakens watermarks without access to the underlying model. By projecting watermarked images onto natural priors across complementary representations, D2RA suppresses watermark signals while preserving visual fidelity. Experiments across diverse watermarking schemes demonstrate that our approach consistently reduces watermark detectability, revealing fundamental weaknesses in current designs. Our code is available at https://github.com/Pragati-Meshram/DAWN.

[103] PickStyle: Video-to-Video Style Transfer with Context-Style Adapters

Soroush Mehraban,Vida Adeli,Jacob Rommann,Babak Taati,Kyryl Truskovskyi

Main category: cs.CV

TL;DR: 本文提出PickStyle,一种基于预训练视频扩散模型的视频风格迁移框架,通过引入低秩适配器和静态图像配对数据进行训练,并设计CS-CFG方法分离内容与风格引导,实现高质量、时序连贯的视频风格迁移。

Details Motivation: 由于缺乏成对的视频数据用于监督,现有的视频风格迁移方法难以同时保持视频内容的一致性和风格的准确性,因此需要一种能够利用静态图像数据并有效解耦内容与风格的学习框架。 Method: 提出PickStyle框架,在预训练视频扩散模型的自注意力层中插入低秩风格适配器;利用具有源-目标风格对应关系的成对静态图像构建合成训练片段,通过共享增强模拟相机运动以保留时间先验;引入上下文-风格分类无关引导(CS-CFG),将引导信号分解为独立的文本(风格)和视频(内容)方向。 Result: 实验表明,PickStyle在多个基准上实现了时序连贯、风格忠实且内容保持的视频转换效果,无论在定性还是定量指标上均优于现有基线方法。 Conclusion: PickStyle通过结合图像级监督、适配器微调和新型引导机制,有效解决了无配对视频数据下的视频风格迁移难题,为扩散模型在动态内容生成中的应用提供了新思路。 Abstract: We address the task of video style transfer with diffusion models, where the goal is to preserve the context of an input video while rendering it in a target style specified by a text prompt. A major challenge is the lack of paired video data for supervision. We propose PickStyle, a video-to-video style transfer framework that augments pretrained video diffusion backbones with style adapters and benefits from paired still image data with source-style correspondences for training. PickStyle inserts low-rank adapters into the self-attention layers of conditioning modules, enabling efficient specialization for motion-style transfer while maintaining strong alignment between video content and style. To bridge the gap between static image supervision and dynamic video, we construct synthetic training clips from paired images by applying shared augmentations that simulate camera motion, ensuring temporal priors are preserved. In addition, we introduce Context-Style Classifier-Free Guidance (CS-CFG), a novel factorization of classifier-free guidance into independent text (style) and video (context) directions. CS-CFG ensures that context is preserved in generated video while the style is effectively transferred. Experiments across benchmarks show that our approach achieves temporally coherent, style-faithful, and content-preserving video translations, outperforming existing baselines both qualitatively and quantitatively.

[104] TRAVL: A Recipe for Making Video-Language Models Better Judges of Physics Implausibility

Saman Motamed,Minghao Chen,Luc Van Gool,Iro Laina

Main category: cs.CV

TL;DR: 本文提出TRAVL,一种用于提升视频语言模型(VLM)对物理不合理性判断能力的微调方法,并构建ImplausiBench基准来评估模型在视觉-时间层面的物理合理性理解能力。

Details Motivation: 现有视频生成模型常违反物理规律,而当前缺乏有效量化评估视频物理真实性的方法;同时现有VLM在识别物理不一致性方面表现不佳,暴露出其在时序和因果推理上的不足。 Method: 提出TRAVL微调策略,结合平衡数据集与轨迹感知注意力模块以增强VLM的运动编码与判别能力;同时构建去除了语言偏差的ImplausiBench基准,包含150个真实视频和150个生成视频,用于更严格地评估模型的物理合理性判断能力。 Result: 实验表明现有VLM难以识别物理违规现象;TRAVL显著提升了VLM在ImplausiBench上的表现,优于基线模型,并且在人类标注和LLM-as-judge两种评估下均展现出更强的视觉-时序物理推理能力。 Conclusion: TRAVL与ImplausiBench共同构成一个统一框架,可有效探测并提升多模态模型对物理合理性的理解,揭示了视觉-时序推理中一个具有挑战性且尚未充分探索的方向。 Abstract: Despite impressive visual fidelity, modern video generative models frequently produce sequences that violate intuitive physical laws, such as objects floating, teleporting, or morphing in ways that defy causality. While humans can easily detect such implausibilities, there remains no robust method for quantitatively assessing physical realism in video. In this work, we explore whether Video-Language Models (VLMs) can be trained to serve as reliable judges of physical plausibility. We find that existing VLMs struggle to identify physics violations, exposing fundamental limitations in their temporal and causal reasoning. To address this, we introduce TRAVL, a fine-tuning recipe that combines a balanced training dataset with a trajectory-aware attention module to improve motion encoding and discrimination in VLMs. To evaluate physical reasoning more rigorously, we propose ImplausiBench, a benchmark of 300 videos (150 real, 150 generated) that removes linguistic biases and isolates visual-temporal understanding. Performance is reported both with gold-standard human judgments and stricter LLM-as-judge metrics. Together, TRAVL and ImplausiBench offer a unified framework for probing and improving physical plausibility in multimodal models, shedding light on a challenging and underexplored aspect of visual-temporal understanding.

[105] Label Semantics for Robust Hyperspectral Image Classification

Rafin Hassan,Zarin Tasnim Roshni,Rafiqul Bari,Alimul Islam,Nabeel Mohammed,Moshiur Farazi,Shafin Rahman

Main category: cs.CV

TL;DR: 提出了一种语义光谱-空间融合网络(S3FN),利用大语言模型生成类别文本描述,结合预训练文本编码器提取语义信息,增强高光谱图像分类性能。

Details Motivation: 由于高光谱图像分类面临训练样本少、数据维度高以及仅依赖光谱-空间单一模态的问题,现有模型易过拟合且性能受限,因此需要引入外部语义信息以提升特征与标签的对齐。 Method: S3FN利用大语言模型为每个类别生成上下文相关的文本描述,并通过BERT或RoBERTa等预训练文本编码器将其嵌入向量空间,实现语义信息与光谱-空间特征的融合,从而改善分类效果。 Result: 在Hyperspectral Wood、HyperspectralBlueberries和DeepHS-Fruit三个基准数据集上均取得显著性能提升,验证了文本语义与光谱-空间数据融合的有效性。 Conclusion: 引入类别相关的文本语义信息可有效增强高光谱图像分类模型的性能,S3FN为多模态、语义增强的分类方法提供了新思路。 Abstract: Hyperspectral imaging (HSI) classification is a critical tool with widespread applications across diverse fields such as agriculture, environmental monitoring, medicine, and materials science. Due to the limited availability of high-quality training samples and the high dimensionality of spectral data, HSI classification models are prone to overfitting and often face challenges in balancing accuracy and computational complexity. Furthermore, most of HSI classification models are monomodal, where it solely relies on spectral-spatial data to learn decision boundaries in the high dimensional embedding space. To address this, we propose a general-purpose Semantic Spectral-Spatial Fusion Network (S3FN) that uses contextual, class specific textual descriptions to complement the training of an HSI classification model. Specifically, S3FN leverages LLMs to generate comprehensive textual descriptions for each class label that captures their unique characteristics and spectral behaviors. These descriptions are then embedded into a vector space using a pre-trained text encoder such as BERT or RoBERTa to extract meaningful label semantics which in turn leads to a better feature-label alignment for improved classification performance. To demonstrate the effectiveness of our approach, we evaluate our model on three diverse HSI benchmark datasets - Hyperspectral Wood, HyperspectralBlueberries, and DeepHS-Fruit and report significant performance boost. Our results highlight the synergy between textual semantics and spectral-spatial data, paving the way for further advancements in semantically augmented HSI classification models. Codes are be available in: https://github.com/milab-nsu/S3FN

[106] Cross-Modal Attention Guided Unlearning in Vision-Language Models

Karuna Bhaila,Aneesh Komanduri,Minh-Hao Van,Xintao Wu

Main category: cs.CV

TL;DR: 提出了一种轻量高效的视觉语言模型(VLM)去学习框架CAGUL,用于防止在视觉问答(VQA)任务中泄露敏感信息。

Details Motivation: 大型预训练的视觉语言模型可能在训练过程中记忆并泄露私密或敏感信息,尤其是在视觉和文本双模态输入下,现有去学习方法计算成本高且难以直接应用于VLMs。 Method: 利用跨模态注意力分析视觉token在输出生成中的作用,设计了跨模态注意力引导的去学习方法(CAGUL),通过外部模块修改对查询不重要的视觉token来编码去学习信息,避免修改原始模型参数。 Result: 实验表明,CAGUL在防止信息泄露的同时保持了原始模型性能,效果优于或相当于微调基线方法,且无需重新训练或调整预训练模型参数。 Conclusion: CAGUL是一种实用、高效且低成本的VLM去学习方案,特别适用于需保护视觉与文本敏感信息的VQA场景。 Abstract: Vision-Language Models (VLMs) have demonstrated immense capabilities in multi-modal understanding and inference tasks such as Visual Question Answering (VQA), which requires models to infer outputs based on visual and textual context simultaneously. Such inference abilities of large-scale pretrained models are often attributed to the massive scale of pre-training data collected across several domains. However, the models may memorize private and/or sensitive information during training and regurgitate it in inference. Recently, machine unlearning has been leveraged to address the leakage of private data in LLMs. VLMs add a layer of complexity to this process, as the visual context in the query may also contain sensitive information in addition to the text. To address this issue, we explore unlearning for vision-language models, specifically for the VQA task. We explore the role of visual tokens for output generation in VLMs using cross-modal attention and utilize it to formulate Cross-Modal Attention Guided Unlearning (CAGUL), a lightweight and efficient VLM unlearning framework. In contrast to computationally expensive model finetuning methods, CAGUL utilizes external modules to encode unlearning information in visual tokens of low importance for relevant queries. We find that the transformed visual tokens not only prevent leakage but also retain reference model behavior. Experimental results show that our method performs better or on par with finetuning-based baselines without altering the pre-trained model parameters or incurring retraining costs, making it a practical and effective unlearning solution for VLMs.

[107] MaizeStandCounting (MaSC): Automated and Accurate Maize Stand Counting from UAV Imagery Using Image Processing and Deep Learning

Dewi Endah Kharismawati,Toni Kazic

Main category: cs.CV

TL;DR: 本文提出了一种名为MaizeStandCounting(MaSC)的算法,利用低成本无人机拍摄的RGB图像实现玉米幼苗自动计数,支持实时操作,适用于科研与生产环境。

Details Motivation: 准确的玉米出苗率对作物管理和研究至关重要,传统人工计数耗时费力且易出错,尤其在大面积或变异较大的田地中。因此需要一种高效、低成本的自动化计数方法。 Method: MaSC采用两种模式:基于拼接图像分块处理和基于同源变换对齐视频帧。两种模式均使用轻量级YOLOv9模型检测V2-V10生长阶段的玉米幼苗,并通过空间分布进行行与列分割,以区分杂草并实现逐行精确计数。 Result: 在2024年夏季试验田中,MaSC与人工计数结果高度一致(拼接图R²=0.616,原始帧R²=0.906),处理83帧全分辨率图像仅用60.63秒,包含推理与后处理。 Conclusion: MaSC是一种可扩展、低成本且准确的自动化玉米出苗计数工具,在实际应用中具有良好的实时性和鲁棒性。 Abstract: Accurate maize stand counts are essential for crop management and research, informing yield prediction, planting density optimization, and early detection of germination issues. Manual counting is labor-intensive, slow, and error-prone, especially across large or variable fields. We present MaizeStandCounting (MaSC), a robust algorithm for automated maize seedling stand counting from RGB imagery captured by low-cost UAVs and processed on affordable hardware. MaSC operates in two modes: (1) mosaic images divided into patches, and (2) raw video frames aligned using homography matrices. Both modes use a lightweight YOLOv9 model trained to detect maize seedlings from V2-V10 growth stages. MaSC distinguishes maize from weeds and other vegetation, then performs row and range segmentation based on the spatial distribution of detections to produce precise row-wise stand counts. Evaluation against in-field manual counts from our 2024 summer nursery showed strong agreement with ground truth (R^2= 0.616 for mosaics, R^2 = 0.906 for raw frames). MaSC processed 83 full-resolution frames in 60.63 s, including inference and post-processing, highlighting its potential for real-time operation. These results demonstrate MaSC's effectiveness as a scalable, low-cost, and accurate tool for automated maize stand counting in both research and production environments.

[108] Quick-CapsNet (QCN): A fast alternative to Capsule Networks

Pouya Shiri,Ramin Sharifi,Amirali Baniasadi

Main category: cs.CV

TL;DR: 本文提出了一种名为Quick-CapsNet(QCN)的快速胶囊网络,通过减少胶囊数量来加速CapsNet的推理过程,在MNIST、F-MNIST、SVHN和Cifar-10数据集上实现了5倍的推理速度提升,仅带来轻微的精度损失,并通过改进解码器进一步提升了性能。

Details Motivation: CapsNet虽然在分类任务中表现优异且对仿射变换更具鲁棒性,但其训练和推理速度较慢,限制了其在实时应用中的使用,因此需要一种更快速的替代方案。 Method: 提出Quick-CapsNet(QCN),通过减少胶囊数量来降低计算复杂度,并采用更强大的解码器以提升性能。 Result: QCN在多个数据集上的推理速度比原始CapsNet快5倍,同时仅造成轻微的准确率下降,并通过改进解码器进一步提升了重建性能。 Conclusion: QCN是一种高效的CapsNet变体,适合作为开发实时应用的起点,在速度与精度之间实现了良好平衡。 Abstract: The basic computational unit in Capsule Network (CapsNet) is a capsule (vs. neurons in Convolutional Neural Networks (CNNs)). A capsule is a set of neurons, which form a vector. CapsNet is used for supervised classification of data and has achieved state-of-the-art accuracy on MNIST digit recognition dataset, outperforming conventional CNNs in detecting overlapping digits. Moreover, CapsNet shows higher robustness towards affine transformation when compared to CNNs for MNIST datasets. One of the drawbacks of CapsNet, however, is slow training and testing. This can be a bottleneck for applications that require a fast network, especially during inference. In this work, we introduce Quick-CapsNet (QCN) as a fast alternative to CapsNet, which can be a starting point to develop CapsNet for fast real-time applications. QCN builds on producing a fewer number of capsules, which results in a faster network. QCN achieves this at the cost of marginal loss in accuracy. Inference is 5x faster on MNIST, F-MNIST, SVHN and Cifar-10 datasets. We also further enhanced QCN by employing a more powerful decoder instead of the default decoder to further improve QCN.

[109] Rectified-CFG++ for Flow Based Models

Shreshth Saini,Shashank Gupta,Alan C. Bovik

Main category: cs.CV

TL;DR: 提出了一种名为Rectified-CFG++的自适应预测-校正引导方法,用于解决基于整流流(RF)模型在使用分类器自由引导(CFG)时出现的严重流形外漂移问题。

Details Motivation: 标准的CFG在应用于整流流模型时会导致严重的离流形漂移,产生视觉伪影、文本不对齐和不稳定行为,限制了生成质量与鲁棒性。 Method: 提出Rectified-CFG++,采用两步策略:首先执行条件RF更新以将样本锚定在学习到的传输路径附近,然后施加加权的条件校正,插值于条件与无条件速度场之间。该方法结合了整流流的确定性效率与几何感知的条件机制。 Result: 理论证明所生成的速度场是边缘一致的,且轨迹保持在数据流形的有界管状邻域内;在Flux、Stable Diffusion 3/3.5和Lumina等大型文本到图像模型上的实验表明,Rectified-CFG++在MS-COCO、LAION-Aesthetic和T2I-CompBench等多个基准上 consistently 优于标准CFG。 Conclusion: Rectified-CFG++有效解决了RF模型中CFG引起的离流形漂移问题,提升了生成稳定性与对齐精度,支持强引导下的鲁棒推理,为高效高质量文本到图像生成提供了新的标准方案。 Abstract: Classifier-free guidance (CFG) is the workhorse for steering large diffusion models toward text-conditioned targets, yet its native application to rectified flow (RF) based models provokes severe off-manifold drift, yielding visual artifacts, text misalignment, and brittle behaviour. We present Rectified-CFG++, an adaptive predictor-corrector guidance that couples the deterministic efficiency of rectified flows with a geometry-aware conditioning rule. Each inference step first executes a conditional RF update that anchors the sample near the learned transport path, then applies a weighted conditional correction that interpolates between conditional and unconditional velocity fields. We prove that the resulting velocity field is marginally consistent and that its trajectories remain within a bounded tubular neighbourhood of the data manifold, ensuring stability across a wide range of guidance strengths. Extensive experiments on large-scale text-to-image models (Flux, Stable Diffusion 3/3.5, Lumina) show that Rectified-CFG++ consistently outperforms standard CFG on benchmark datasets such as MS-COCO, LAION-Aesthetic, and T2I-CompBench. Project page: https://rectified-cfgpp.github.io/

[110] PIT-QMM: A Large Multimodal Model For No-Reference Point Cloud Quality Assessment

Shashank Gupta,Gregoire Phillips,Alan C. Bovik

Main category: cs.CV

TL;DR: 提出了一种新的大型多模态模型PIT-QMM,用于无参考点云质量评估(NR-PCQA),结合文本、图像和点云信息,在多个基准上显著优于现有方法,并支持失真定位与识别。

Details Motivation: 现有的图像和视频质量评估进展尚未充分应用于3D点云资产,尤其是无参考质量评估任务。 Method: 构建了一个能端到端处理文本、2D投影和3D点云的多模态模型PIT-QMM,融合多种模态信息进行质量评分预测。 Result: 在主流基准测试上显著优于现有最先进方法,且训练迭代次数更少;同时实现了失真定位与识别功能。 Conclusion: PIT-QMM为无参考点云质量评估提供了高效且可解释的新框架,推动了3D资产质量评估的发展。 Abstract: Large Multimodal Models (LMMs) have recently enabled considerable advances in the realm of image and video quality assessment, but this progress has yet to be fully explored in the domain of 3D assets. We are interested in using these models to conduct No-Reference Point Cloud Quality Assessment (NR-PCQA), where the aim is to automatically evaluate the perceptual quality of a point cloud in absence of a reference. We begin with the observation that different modalities of data - text descriptions, 2D projections, and 3D point cloud views - provide complementary information about point cloud quality. We then construct PIT-QMM, a novel LMM for NR-PCQA that is capable of consuming text, images and point clouds end-to-end to predict quality scores. Extensive experimentation shows that our proposed method outperforms the state-of-the-art by significant margins on popular benchmarks with fewer training iterations. We also demonstrate that our framework enables distortion localization and identification, which paves a new way forward for model explainability and interactivity. Code and datasets are available at https://www.github.com/shngt/pit-qmm.

[111] Dual-Stream Alignment for Action Segmentation

Harshala Gammulle,Clinton Fookes,Sridha Sridharan,Simon Denman

Main category: cs.CV

TL;DR: 本文提出了一种双流对齐网络(DSA Net),用于动作分割,通过引入学习到的动作特征流来增强性能,并首次将混合量子-经典机器学习框架应用于该任务。

Details Motivation: 现有方法多采用单一流模型处理视频帧序列的时空特征,但难以充分捕捉动作及动作转换线索;因此,研究者探索双流方法以提升动作分割效果。 Method: 提出DSA Net,包含帧级和动作级双流,通过时间上下文(TC)模块中的交叉注意力和基于量子的动作引导调制(Q-ActGM)实现信息融合,并设计双流对齐损失(含关系一致性、跨层对比和循环一致性重建损失)促使两流学习共享特征空间。 Result: 在GTEA、Breakfast、50Salads和EgoProcel等多个基准数据集上实现了最先进的性能,显著优于现有方法,并通过消融实验验证了各组件的有效性。 Conclusion: 双流结构结合特征对齐机制能有效提升动作分割性能,且所提出的量子-经典混合框架为未来研究提供了新方向。 Abstract: Action segmentation is a challenging yet active research area that involves identifying when and where specific actions occur in continuous video streams. Most existing work has focused on single-stream approaches that model the spatio-temporal aspects of frame sequences. However, recent research has shifted toward two-stream methods that learn action-wise features to enhance action segmentation performance. In this work, we propose the Dual-Stream Alignment Network (DSA Net) and investigate the impact of incorporating a second stream of learned action features to guide segmentation by capturing both action and action-transition cues. Communication between the two streams is facilitated by a Temporal Context (TC) block, which fuses complementary information using cross-attention and Quantum-based Action-Guided Modulation (Q-ActGM), enhancing the expressive power of the fused features. To the best of our knowledge, this is the first study to introduce a hybrid quantum-classical machine learning framework for action segmentation. Our primary objective is for the two streams (frame-wise and action-wise) to learn a shared feature space through feature alignment. This is encouraged by the proposed Dual-Stream Alignment Loss, which comprises three components: relational consistency, cross-level contrastive, and cycle-consistency reconstruction losses. Following prior work, we evaluate DSA Net on several diverse benchmark datasets: GTEA, Breakfast, 50Salads, and EgoProcel. We further demonstrate the effectiveness of each component through extensive ablation studies. Notably, DSA Net achieves state-of-the-art performance, significantly outperforming existing

[112] Once Is Enough: Lightweight DiT-Based Video Virtual Try-On via One-Time Garment Appearance Injection

Yanjie Pan,Qingdong He,Lidong Wang,Bo Peng,Mingmin Chi

Main category: cs.CV

TL;DR: 提出OIE方法,基于首帧服装替换实现高效视频虚拟试穿,兼顾性能、参数和计算效率。

Details Motivation: 现有双分支架构在Diffusion Transformer上的应用面临参数量大、时序特征学习困难的问题。 Method: 采用图像服装迁移模型替换首帧服装,并利用编辑后的首帧内容结合姿态和掩码信息引导视频生成模型逐帧合成后续帧。 Result: 实验表明该方法在参数效率和计算效率上表现优越,同时保持领先的性能。 Conclusion: OIE通过首帧替换策略有效解决了扩散Transformer中虚拟试穿的效率与建模难题,具有良好的实用性。 Abstract: Video virtual try-on aims to replace the clothing of a person in a video with a target garment. Current dual-branch architectures have achieved significant success in diffusion models based on the U-Net; however, adapting them to diffusion models built upon the Diffusion Transformer remains challenging. Initially, introducing latent space features from the garment reference branch requires adding or modifying the backbone network, leading to a large number of trainable parameters. Subsequently, the latent space features of garments lack inherent temporal characteristics and thus require additional learning. To address these challenges, we propose a novel approach, OIE (Once is Enough), a virtual try-on strategy based on first-frame clothing replacement: specifically, we employ an image-based clothing transfer model to replace the clothing in the initial frame, and then, under the content control of the edited first frame, utilize pose and mask information to guide the temporal prior of the video generation model in synthesizing the remaining frames sequentially. Experiments show that our method achieves superior parameter efficiency and computational efficiency while still maintaining leading performance under these constraints.

[113] MONKEY: Masking ON KEY-Value Activation Adapter for Personalization

James Baker

Main category: cs.CV

TL;DR: 提出一种基于自动掩码的二次传递方法,通过限制图像标记仅关注主体来提升个性化扩散模型中文本提示与生成图像的一致性。

Details Motivation: 现有个性化扩散模型在生成图像时容易过度依赖主体图像而忽略文本提示,导致生成结果与期望场景不符。 Method: 利用IP-Adapter自动生成的掩码,在第二次推理过程中屏蔽背景区域的图像标记,使文本提示能更有效地控制非主体区域的内容生成。 Result: 在描述地点和场景的文本提示下,生成图像能准确保留主体并更好匹配提示内容,相比其他测试时个性化方法表现出更高的提示对齐和源图像一致性。 Conclusion: 该方法通过掩码引导的双阶段推理机制,有效平衡了主体保持与文本可控性,提升了个性化图像生成的质量。 Abstract: Personalizing diffusion models allows users to generate new images that incorporate a given subject, allowing more control than a text prompt. These models often suffer somewhat when they end up just recreating the subject image, and ignoring the text prompt. We observe that one popular method for personalization, the IP-Adapter automatically generates masks that we definitively segment the subject from the background during inference. We propose to use this automatically generated mask on a second pass to mask the image tokens, thus restricting them to the subject, not the background, allowing the text prompt to attend to the rest of the image. For text prompts describing locations and places, this produces images that accurately depict the subject while definitively matching the prompt. We compare our method to a few other test time personalization methods, and find our method displays high prompt and source image alignment.

[114] Automatic Text Box Placement for Supporting Typographic Design

Jun Muraoka,Daichi Haraguchi,Naoto Inoue,Wataru Shimoda,Kota Yamaguchi,Seiichi Uchida

Main category: cs.CV

TL;DR: 该研究比较了基于Transformer和视觉语言模型(VLM)的方法在广告和网页不完整布局中自动文本框放置的效果,发现标准Transformer模型在包含丰富外观信息时表现更优,但所有方法在处理极小文本或密集布局时仍存在挑战。

Details Motivation: 为了提升广告和网页布局设计中的视觉吸引力与信息传达效率的平衡,研究旨在探索自动化文本框布局的有效方法。 Method: 采用标准Transformer模型、小型视觉语言模型(Phi3.5-vision)、大型预训练VLM(Gemini)以及可处理多图像的扩展Transformer模型,在Crello数据集上进行文本框放置任务的比较评估。 Result: 标准Transformer模型整体优于VLM方法,尤其是在融入更多外观信息时表现突出;但在处理极小文本或高度密集布局时,所有方法性能均下降。 Conclusion: 任务特定架构在自动化布局设计中更具优势,未来改进应关注对小文本和复杂布局的适应能力。 Abstract: In layout design for advertisements and web pages, balancing visual appeal and communication efficiency is crucial. This study examines automated text box placement in incomplete layouts, comparing a standard Transformer-based method, a small Vision and Language Model (Phi3.5-vision), a large pretrained VLM (Gemini), and an extended Transformer that processes multiple images. Evaluations on the Crello dataset show the standard Transformer-based models generally outperform VLM-based approaches, particularly when incorporating richer appearance information. However, all methods face challenges with very small text or densely populated layouts. These findings highlight the benefits of task-specific architectures and suggest avenues for further improvement in automated layout design.

[115] TCIP: Threshold-Controlled Iterative Pyramid Network for Deformable Medical Image Registration

Heming Wu,Di Wang,Tai Ma,Peng Zhao,Yubin Xiao,Zhongke Wu,Xing-Ce Wang,Chuang Li,Xuan Wu,You Zhou

Main category: cs.CV

TL;DR: 提出了一种基于特征增强残差模块(FERM)和双阶段阈值控制迭代策略(TCI)的金字塔网络(TCIP),用于提升可变形医学图像配准的精度与自适应性。

Details Motivation: 现有金字塔网络在解码过程中易传播和累积解剖结构错位,且缺乏根据图像变形需求自适应调整优化迭代次数的机制,影响配准精度。 Method: 设计FERM作为每个解码层的核心模块,包含提取解剖语义特征、抑制无关特征和估计形变场三个模块;提出双阶段TCI策略,先评估配准稳定性,再判断收敛性以自适应终止迭代。 Result: 在三个公开脑部MRI数据集和一个腹部CT数据集上实验表明,TCIP在配准精度上优于当前最先进方法,同时保持较快推理速度和较小模型参数量;FERM和TCI具有良好的通用性,可集成到其他配准网络中并有效提升性能。 Conclusion: FERM有效缓解了解剖结构错位的累积问题,TCI实现了迭代过程的自适应控制,二者结合显著提升了医学图像配准的鲁棒性和准确性。 Abstract: Although pyramid networks have demonstrated superior performance in deformable medical image registration, their decoder architectures are inherently prone to propagating and accumulating anatomical structure misalignments. Moreover, most existing models do not adaptively determine the number of iterations for optimization under varying deformation requirements across images, resulting in either premature termination or excessive iterations that degrades registration accuracy. To effectively mitigate the accumulation of anatomical misalignments, we propose the Feature-Enhanced Residual Module (FERM) as the core component of each decoding layer in the pyramid network. FERM comprises three sequential blocks that extract anatomical semantic features, learn to suppress irrelevant features, and estimate the final deformation field, respectively. To adaptively determine the number of iterations for varying images, we propose the dual-stage Threshold-Controlled Iterative (TCI) strategy. In the first stage, TCI assesses registration stability and with asserted stability, it continues with the second stage to evaluate convergence. We coin the model that integrates FERM and TCI as Threshold-Controlled Iterative Pyramid (TCIP). Extensive experiments on three public brain MRI datasets and one abdomen CT dataset demonstrate that TCIP outperforms the state-of-the-art (SOTA) registration networks in terms of accuracy, while maintaining comparable inference speed and a compact model parameter size. Finally, we assess the generalizability of FERM and TCI by integrating them with existing registration networks and further conduct ablation studies to validate the effectiveness of these two proposed methods.

[116] Controllable Video Synthesis via Variational Inference

Haoyi Duan,Yunzhi Zhang,Yilun Du,Jiajun Wu

Main category: cs.CV

TL;DR: 提出一种高可控性的视频合成方法,通过变分推理和多生成模型集成,实现对指定元素的精确控制和未明确部分的多样性生成。

Details Motivation: 现有视频生成模型通常针对固定输入格式训练,难以满足用户对不同粒度控制的需求,如4D对象轨迹、相机路径或粗略文本提示的混合控制。 Method: 将任务建模为变分推理以逼近组合分布,利用多个视频生成骨干网络共同满足所有约束;通过逐步KL散度最小化和退火分布序列解决优化难题,并提出上下文条件分解技术以减少解空间中的局部最优。 Result: 实验表明,该方法在可控性、多样性和3D一致性方面优于先前方法。 Conclusion: 该方法有效实现了细粒度与粗粒度控制的结合,在保持生成多样性的同时提升了视频生成的精确可控性和三维一致性。 Abstract: Many video workflows benefit from a mixture of user controls with varying granularity, from exact 4D object trajectories and camera paths to coarse text prompts, while existing video generative models are typically trained for fixed input formats. We develop a video synthesis method that addresses this need and generates samples with high controllability for specified elements while maintaining diversity for under-specified ones. We cast the task as variational inference to approximate a composed distribution, leveraging multiple video generation backbones to account for all task constraints collectively. To address the optimization challenge, we break down the problem into step-wise KL divergence minimization over an annealed sequence of distributions, and further propose a context-conditioned factorization technique that reduces modes in the solution space to circumvent local optima. Experiments suggest that our method produces samples with improved controllability, diversity, and 3D consistency compared to prior works.

[117] Hybrid CNN-BYOL Approach for Fault Detection in Induction Motors Using Thermal Images

Tangin Amir Smrity,MD Zahin Muntaqim Hasan Muhammad Kafi,Abu Saleh Musa Miah,Najmul Hassan,Yuichi Okuyama,Nobuyoshi Asai,Taro Suzuki,Jungpil Shin

Main category: cs.CV

TL;DR: 本文提出了一种结合BYOL与CNN的混合方法,用于基于热成像的感应电机故障分类,提出的新模型BYOL-IMNet在准确率和推理速度上均优于现有模型。

Details Motivation: 感应电机易发生故障,导致过热、能耗增加和停机,早期检测对保护电机和延长寿命至关重要。 Method: 采用BYOL预训练结合多种CNN架构(如ResNet-50、DenseNet等),并设计了一种轻量高效的新型网络BYOL-IMNet,用于热图像的故障分类。 Result: BYOL-IMNet在测试中达到99.89%的准确率,单张图像推理时间仅5.7ms,性能优于当前主流模型。 Conclusion: CNN-BYOL混合方法在感应电机故障检测中表现出高精度和高效性,具备工业在线监测的应用潜力。 Abstract: Induction motors (IMs) are indispensable in industrial and daily life, but they are susceptible to various faults that can lead to overheating, wasted energy consumption, and service failure. Early detection of faults is essential to protect the motor and prolong its lifespan. This paper presents a hybrid method that integrates BYOL with CNNs for classifying thermal images of induction motors for fault detection. The thermal dataset used in this work includes different operating states of the motor, such as normal operation, overload, and faults. We employed multiple deep learning (DL) models for the BYOL technique, ranging from popular architectures such as ResNet-50, DenseNet-121, DenseNet-169, EfficientNetB0, VGG16, and MobileNetV2. Additionally, we introduced a new high-performance yet lightweight CNN model named BYOL-IMNet, which comprises four custom-designed blocks tailored for fault classification in thermal images. Our experimental results demonstrate that the proposed BYOL-IMNet achieves 99.89\% test accuracy and an inference time of 5.7 ms per image, outperforming state-of-the-art models. This study highlights the promising performance of the CNN-BYOL hybrid method in enhancing accuracy for detecting faults in induction motors, offering a robust methodology for online monitoring in industrial settings.

[118] Mutual Learning for Hashing: Unlocking Strong Hash Functions from Weak Supervision

Xiaoxu Ma,Runhao Li,Zhenyu Weng

Main category: cs.CV

TL;DR: 提出了一种名为MLH(Mutual Learning for Hashing)的新框架,通过弱-强双分支结构提升深度哈希性能,其中中心分支利用配对分支学习的局部相似性信息,并引入混合哈希专家模块增强跨分支交互。

Details Motivation: 中心基方法虽擅长建模全局结构,但往往忽略重要的局部相似性信息,导致性能受限。 Method: 设计双分支结构:一个强中心基分支和一个弱配对基分支,通过迭代互学习机制传递知识,并采用混合哈希专家模块实现有效跨分支交互。 Result: 在多个基准数据集上,MLH consistently超越了当前最先进的哈希方法。 Conclusion: MLH能有效融合局部相似性和全局分布信息,显著提升哈希学习性能。 Abstract: Deep hashing has been widely adopted for large-scale image retrieval, with numerous strategies proposed to optimize hash function learning. Pairwise-based methods are effective in learning hash functions that preserve local similarity relationships, whereas center-based methods typically achieve superior performance by more effectively capturing global data distributions. However, the strength of center-based methods in modeling global structures often comes at the expense of underutilizing important local similarity information. To address this limitation, we propose Mutual Learning for Hashing (MLH), a novel weak-to-strong framework that enhances a center-based hashing branch by transferring knowledge from a weaker pairwise-based branch. MLH consists of two branches: a strong center-based branch and a weaker pairwise-based branch. Through an iterative mutual learning process, the center-based branch leverages local similarity cues learned by the pairwise-based branch. Furthermore, inspired by the mixture-of-experts paradigm, we introduce a novel mixture-of-hash-experts module that enables effective cross-branch interaction, further enhancing the performance of both branches. Extensive experiments demonstrate that MLH consistently outperforms state-of-the-art hashing methods across multiple benchmark datasets.

[119] RePainter: Empowering E-commerce Object Removal via Spatial-matting Reinforcement Learning

Zipeng Guo,Lichen Ma,Xiaolong Fu,Gaojing Zhou,Lan Yang,Yuchen Zhou,Linkai Liu,Yu He,Ximan Liu,Shiping Dong,Jingling Fu,Zhen Chen,Yu Shi,Junshi Huang,Jason Li,Chao Gou

Main category: cs.CV

TL;DR: 提出Repainter,一种结合空间-蒙版轨迹优化和组相对策略优化的强化学习框架,用于电商图像修复,显著优于现有方法。

Details Motivation: 电商图像中的水印和促销文字影响视觉效果,现有扩散模型在去除干扰物时可靠性不足且缺乏领域适应性。 Method: 采用强化学习框架Repainter,通过调节注意力机制增强背景上下文建模,引入复合奖励机制平衡全局、局部和语义约束,并结合EcomPaint-100K数据集与EcomPaint-Bench基准进行训练与评估。 Result: 在复杂场景下显著优于现有最先进方法,有效减少视觉伪影和奖励欺骗,提升对象移除的可靠性。 Conclusion: Repainter在电商图像去干扰任务中表现出优越性能,配合新提出的高质量数据集和基准,推动了实际应用场景下的图像修复发展。 Abstract: In web data, product images are central to boosting user engagement and advertising efficacy on e-commerce platforms, yet the intrusive elements such as watermarks and promotional text remain major obstacles to delivering clear and appealing product visuals. Although diffusion-based inpainting methods have advanced, they still face challenges in commercial settings due to unreliable object removal and limited domain-specific adaptation. To tackle these challenges, we propose Repainter, a reinforcement learning framework that integrates spatial-matting trajectory refinement with Group Relative Policy Optimization (GRPO). Our approach modulates attention mechanisms to emphasize background context, generating higher-reward samples and reducing unwanted object insertion. We also introduce a composite reward mechanism that balances global, local, and semantic constraints, effectively reducing visual artifacts and reward hacking. Additionally, we contribute EcomPaint-100K, a high-quality, large-scale e-commerce inpainting dataset, and a standardized benchmark EcomPaint-Bench for fair evaluation. Extensive experiments demonstrate that Repainter significantly outperforms state-of-the-art methods, especially in challenging scenes with intricate compositions. We will release our code and weights upon acceptance.

[120] SyncHuman: Synchronizing 2D and 3D Generative Models for Single-view Human Reconstruction

Wenyue Chen,Peng Li,Wangguandong Zheng,Chengfeng Zhao,Mengfei Li,Yaolong Zhu,Zhiyang Dou,Ronggang Wang,Yuan Liu

Main category: cs.CV

TL;DR: 本文提出SyncHuman,一种结合2D多视角生成模型与3D原生生成模型的新框架,用于从单张图像实现高质量、逼真的着装人体3D重建,尤其在复杂姿态下表现优异。

Details Motivation: 现有方法依赖SMPL估计和条件生成模型,但受限于不准确的3D先验和对复杂姿态、细节重建的不足,难以实现高保真和结构一致的3D人体重建。 Method: 提出SyncHuman框架,通过像素对齐的2D-3D同步注意力机制联合微调2D多视角生成模型和3D原生生成模型,生成几何对齐的3D形状与2D多视角图像;并引入特征注入机制,将2D细节提升至3D形状,增强重建精度与真实感。 Result: 实验表明,SyncHuman在几何准确性与视觉保真度上优于基线方法,能在复杂姿态下单图实现鲁棒且逼真的3D人体重建。 Conclusion: SyncHuman有效融合了2D生成模型的细节表现力与3D生成模型的结构一致性,为未来3D生成模型的发展提供了有前景的方向。 Abstract: Photorealistic 3D full-body human reconstruction from a single image is a critical yet challenging task for applications in films and video games due to inherent ambiguities and severe self-occlusions. While recent approaches leverage SMPL estimation and SMPL-conditioned image generative models to hallucinate novel views, they suffer from inaccurate 3D priors estimated from SMPL meshes and have difficulty in handling difficult human poses and reconstructing fine details. In this paper, we propose SyncHuman, a novel framework that combines 2D multiview generative model and 3D native generative model for the first time, enabling high-quality clothed human mesh reconstruction from single-view images even under challenging human poses. Multiview generative model excels at capturing fine 2D details but struggles with structural consistency, whereas 3D native generative model generates coarse yet structurally consistent 3D shapes. By integrating the complementary strengths of these two approaches, we develop a more effective generation framework. Specifically, we first jointly fine-tune the multiview generative model and the 3D native generative model with proposed pixel-aligned 2D-3D synchronization attention to produce geometrically aligned 3D shapes and 2D multiview images. To further improve details, we introduce a feature injection mechanism that lifts fine details from 2D multiview images onto the aligned 3D shapes, enabling accurate and high-fidelity reconstruction. Extensive experiments demonstrate that SyncHuman achieves robust and photo-realistic 3D human reconstruction, even for images with challenging poses. Our method outperforms baseline methods in geometric accuracy and visual fidelity, demonstrating a promising direction for future 3D generation models.

[121] ComGS: Efficient 3D Object-Scene Composition via Surface Octahedral Probes

Jian Gao,Mengqi Yuan,Yifei Zeng,Chang Zeng,Zhihao Li,Zhenyu Chen,Weichao Qiu,Xiao-Xiao Long,Hao Zhu,Xun Cao,Yao Yao

Main category: cs.CV

TL;DR: 提出ComGS框架,实现高质量、实时的3D物体-场景合成,通过Surface Octahedral Probes(SOPs)实现高效可重光照物体重建,并简化环境光照估计,显著提升渲染效率与视觉一致性。

Details Motivation: Gaussian Splatting中的烘焙外观和阴影信息导致物体与场景组合时出现不一致,现有方法在效率和复杂场景下的光照建模方面存在不足,难以实现真实感强且可重光照的3D合成。 Method: 引入Surface Octahedral Probes(SOPs)存储光照与遮挡信息,通过插值实现高效3D查询,避免昂贵的光线追踪;针对光照估计,聚焦于物体放置位置的环境光照,利用360度辐射场重建并微调扩散模型完成光照补全。 Result: 实现了约28 FPS的高质量实时渲染,编辑耗时仅36秒,支持生动阴影与视觉和谐的合成效果,SOPs带来至少2倍的重建加速,并支持高斯场景中的实时阴影计算。 Conclusion: ComGS通过高效的SOPs和简化的局部光照估计,有效解决了Gaussian Splatting中物体-场景合成的光照一致性与效率问题,推动了可重光照3D内容创作的发展。 Abstract: Gaussian Splatting (GS) enables immersive rendering, but realistic 3D object-scene composition remains challenging. Baked appearance and shadow information in GS radiance fields cause inconsistencies when combining objects and scenes. Addressing this requires relightable object reconstruction and scene lighting estimation. For relightable object reconstruction, existing Gaussian-based inverse rendering methods often rely on ray tracing, leading to low efficiency. We introduce Surface Octahedral Probes (SOPs), which store lighting and occlusion information and allow efficient 3D querying via interpolation, avoiding expensive ray tracing. SOPs provide at least a 2x speedup in reconstruction and enable real-time shadow computation in Gaussian scenes. For lighting estimation, existing Gaussian-based inverse rendering methods struggle to model intricate light transport and often fail in complex scenes, while learning-based methods predict lighting from a single image and are viewpoint-sensitive. We observe that 3D object-scene composition primarily concerns the object's appearance and nearby shadows. Thus, we simplify the challenging task of full scene lighting estimation by focusing on the environment lighting at the object's placement. Specifically, we capture a 360 degrees reconstructed radiance field of the scene at the location and fine-tune a diffusion model to complete the lighting. Building on these advances, we propose ComGS, a novel 3D object-scene composition framework. Our method achieves high-quality, real-time rendering at around 28 FPS, produces visually harmonious results with vivid shadows, and requires only 36 seconds for editing. Code and dataset are available at https://nju-3dv.github.io/projects/ComGS/.

[122] UltraLED: Learning to See Everything in Ultra-High Dynamic Range Scenes

Yuang Meng,Xin Jin,Lina Lei,Chun-Le Guo,Chongyi Li

Main category: cs.CV

TL;DR: 本文提出了一种基于单张短曝光RAW图像的超高清动态范围(UHDR)重建方法UltraLED,通过两阶段框架实现曝光校正和低光区域去噪,有效避免了重影和运动模糊,显著优于现有单帧方法。

Details Motivation: 在夜间等高动态范围场景中,亮区和暗区之间存在显著曝光差异,传统RGB多帧合成方法易受错位和重影影响,且难以同时保留高光和阴影细节。而单张短曝光RAW图像已包含足够的高光信息,主要挑战在于暗区的去噪与细节恢复。因此,作者探索是否仅用一张短曝光RAW图像即可实现高质量UHDR重建。 Method: 提出UltraLED,一个两阶段框架:第一阶段通过比率图进行曝光校正以平衡动态范围;第二阶段采用亮度感知的RAW去噪器增强暗区细节恢复。仅使用单张短曝光RAW图像作为输入,并构建了一个9档包围曝光流程来合成真实UHDR数据集用于训练和评估。 Result: 实验表明,UltraLED在多个指标上显著优于现有的单帧UHDR重建方法,能够在复杂夜间场景中有效恢复高光和阴影细节,同时避免了多帧方法常见的重影和运动模糊问题。此外,作者公开了代码和新构建的数据集。 Conclusion: 仅使用单张短曝光RAW图像即可实现高质量的UHDR成像,UltraLED通过结合曝光校正与亮度感知去噪,为动态场景下的UHDR重建提供了一种鲁棒且高效的解决方案。 Abstract: Ultra-high dynamic range (UHDR) scenes exhibit significant exposure disparities between bright and dark regions. Such conditions are commonly encountered in nighttime scenes with light sources. Even with standard exposure settings, a bimodal intensity distribution with boundary peaks often emerges, making it difficult to preserve both highlight and shadow details simultaneously. RGB-based bracketing methods can capture details at both ends using short-long exposure pairs, but are susceptible to misalignment and ghosting artifacts. We found that a short-exposure image already retains sufficient highlight detail. The main challenge of UHDR reconstruction lies in denoising and recovering information in dark regions. In comparison to the RGB images, RAW images, thanks to their higher bit depth and more predictable noise characteristics, offer greater potential for addressing this challenge. This raises a key question: can we learn to see everything in UHDR scenes using only a single short-exposure RAW image? In this study, we rely solely on a single short-exposure frame, which inherently avoids ghosting and motion blur, making it particularly robust in dynamic scenes. To achieve that, we introduce UltraLED, a two-stage framework that performs exposure correction via a ratio map to balance dynamic range, followed by a brightness-aware RAW denoiser to enhance detail recovery in dark regions. To support this setting, we design a 9-stop bracketing pipeline to synthesize realistic UHDR images and contribute a corresponding dataset based on diverse scenes, using only the shortest exposure as input for reconstruction. Extensive experiments show that UltraLED significantly outperforms existing single-frame approaches. Our code and dataset are made publicly available at https://srameo.github.io/projects/ultraled.

[123] DEGS: Deformable Event-based 3D Gaussian Splatting from RGB and Event Stream

Junhao He,Jiaxu Wang,Jia Li,Mingyuan Sun,Qiang Zhang,Jiahang Cao,Ziyi Zhang,Yi Gu,Jingkai Sun,Renjing Xu

Main category: cs.CV

TL;DR: 本文提出了一种结合低帧率RGB图像和高帧率事件流的动态3D高斯点阵重建框架,利用事件流中的运动先验来指导变形场优化,显著提升了大运动场景下的重建效果。

Details Motivation: 由于低帧率RGB视频中帧间大运动导致3D重建不确定性增加,且事件相机缺乏颜色信息,如何有效融合两种模态数据成为挑战。 Method: 提出LoCM无监督微调框架提取事件流中的运动先验,并通过几何感知的数据关联方法建立事件与高斯点之间的运动对应关系,引入运动分解和帧间伪标签策略辅助优化。 Result: 在合成与真实场景上实验表明,该方法优于现有基于图像和事件的方法,能更有效地优化动态3D高斯点阵。 Conclusion: 结合事件流的运动先验可有效约束动态3DGS中的大运动问题,所提框架实现了RGB与事件数据的协同优化,提升了动态场景重建质量。 Abstract: Reconstructing Dynamic 3D Gaussian Splatting (3DGS) from low-framerate RGB videos is challenging. This is because large inter-frame motions will increase the uncertainty of the solution space. For example, one pixel in the first frame might have more choices to reach the corresponding pixel in the second frame. Event cameras can asynchronously capture rapid visual changes and are robust to motion blur, but they do not provide color information. Intuitively, the event stream can provide deterministic constraints for the inter-frame large motion by the event trajectories. Hence, combining low-temporal-resolution images with high-framerate event streams can address this challenge. However, it is challenging to jointly optimize Dynamic 3DGS using both RGB and event modalities due to the significant discrepancy between these two data modalities. This paper introduces a novel framework that jointly optimizes dynamic 3DGS from the two modalities. The key idea is to adopt event motion priors to guide the optimization of the deformation fields. First, we extract the motion priors encoded in event streams by using the proposed LoCM unsupervised fine-tuning framework to adapt an event flow estimator to a certain unseen scene. Then, we present the geometry-aware data association method to build the event-Gaussian motion correspondence, which is the primary foundation of the pipeline, accompanied by two useful strategies, namely motion decomposition and inter-frame pseudo-label. Extensive experiments show that our method outperforms existing image and event-based approaches across synthetic and real scenes and prove that our method can effectively optimize dynamic 3DGS with the help of event data.

[124] Demystifying Deep Learning-based Brain Tumor Segmentation with 3D UNets and Explainable AI (XAI): A Comparative Analysis

Ming Jie Ong,Sze Yinn Ung,Sim Kuan Goh,Jimmy Y. Zhong

Main category: cs.CV

TL;DR: 本研究比较了UNet、ResUNet和Attention UNet三种模型在脑肿瘤分割中的性能,并结合Grad-CAM和注意力可视化等可解释人工智能(XAI)技术提升模型透明度。结果显示ResUNet在Dice、Jaccard、准确率等指标上表现最佳,推荐用于临床脑肿瘤自动分割。

Details Motivation: 提升脑肿瘤MRI图像分割的准确性,并通过可解释人工智能(XAI)增强医生对深度学习模型决策的信任,辅助临床决策。 Method: 采用UNet、Residual UNet(ResUNet)和Attention UNet(AttUNet)对BraTS2020数据集进行脑肿瘤分割,使用Adam优化器训练模型,并通过Grad-CAM和注意力可视化技术进行可解释性分析。评估指标包括训练/验证/推理时间、相似性系数、损失函数及分类性能。 Result: ResUNet在测试阶段表现最优,Dice和Jaccard相似性分数、准确率、召回率和F1分数均优于UNet和Attention UNet;Grad-CAM揭示了模型关注的肿瘤子区域,注意力可视化阐明了Attention UNet的工作机制。 Conclusion: ResUNet是三种模型中性能最好的,结合XAI技术有助于理解模型行为,建议将其用于未来临床中的自动化脑肿瘤分割。 Abstract: The current study investigated the use of Explainable Artificial Intelligence (XAI) to improve the accuracy of brain tumor segmentation in MRI images, with the goal of assisting physicians in clinical decision-making. The study focused on applying UNet models for brain tumor segmentation and using the XAI techniques of Gradient-weighted Class Activation Mapping (Grad-CAM) and attention-based visualization to enhance the understanding of these models. Three deep learning models - UNet, Residual UNet (ResUNet), and Attention UNet (AttUNet) - were evaluated to identify the best-performing model. XAI was employed with the aims of clarifying model decisions and increasing physicians' trust in these models. We compared the performance of two UNet variants (ResUNet and AttUNet) with the conventional UNet in segmenting brain tumors from the BraTS2020 public dataset and analyzed model predictions with Grad-CAM and attention-based visualization. Using the latest computer hardware, we trained and validated each model using the Adam optimizer and assessed their performance with respect to: (i) training, validation, and inference times, (ii) segmentation similarity coefficients and loss functions, and (iii) classification performance. Notably, during the final testing phase, ResUNet outperformed the other models with respect to Dice and Jaccard similarity scores, as well as accuracy, recall, and F1 scores. Grad-CAM provided visuospatial insights into the tumor subregions each UNet model focused on while attention-based visualization provided valuable insights into the working mechanisms of AttUNet's attention modules. These results demonstrated ResUNet as the best-performing model and we conclude by recommending its use for automated brain tumor segmentation in future clinical assessments. Our source code and checkpoint are available at https://github.com/ethanong98/MultiModel-XAI-Brats2020

[125] GTR-Bench: Evaluating Geo-Temporal Reasoning in Vision-Language Models

Qinghongbing Xie,Zhaoyuan Xia,Feng Zhu,Lijun Gong,Ziyue Li,Rui Zhao,Long Zeng

Main category: cs.CV

TL;DR: 本文提出了GTR-Bench,一个用于评估视觉语言模型在大规模摄像头网络中地理时空推理能力的新基准,揭示了现有模型在空间时间上下文利用、时间预测和地图与多视角视频对齐方面的三大缺陷。

Details Motivation: 现有基准未能全面评估视觉语言模型在结合图像/视频与图形(如地图)上下文下的地理时空智能,限制了交通管理和应急响应等应用的发展。 Method: 提出GTR-Bench,包含多视角切换、跨非重叠视野视频联合推理以及未观测时空区域推断等挑战性任务,构建涵盖地图与视频的多模态地理时空推理评测集。 Result: 在10多个主流VLM上的实验表明,即使表现最好的Gemini-2.5-Pro(34.9%)也远低于人类水平(78.61%),并识别出模型在时空上下文使用不平衡、时间预测能力弱、难以对齐地图与多视角视频三大问题。 Conclusion: GTR-Bench为地理时空智能研究提供了有价值的洞察和新方向,推动视觉语言模型在复杂现实场景中的发展与应用。 Abstract: Recently spatial-temporal intelligence of Visual-Language Models (VLMs) has attracted much attention due to its importance for Autonomous Driving, Embodied AI and General Artificial Intelligence. Existing spatial-temporal benchmarks mainly focus on egocentric perspective reasoning with images/video context, or geographic perspective reasoning with graphics context (eg. a map), thus fail to assess VLMs' geographic spatial-temporal intelligence with both images/video and graphics context, which is important for areas like traffic management and emergency response. To address the gaps, we introduce Geo-Temporal Reasoning benchmark (GTR-Bench), a novel challenge for geographic temporal reasoning of moving targets in a large-scale camera network. GTR-Bench is more challenging as it requires multiple perspective switches between maps and videos, joint reasoning across multiple videos with non-overlapping fields of view, and inference over spatial-temporal regions that are unobserved by any video context. Evaluations of more than 10 popular VLMs on GTR-Bench demonstrate that even the best proprietary model, Gemini-2.5-Pro (34.9%), significantly lags behind human performance (78.61%) on geo-temporal reasoning. Moreover, our comprehensive analysis on GTR-Bench reveals three primary deficiencies of current models for geo-temporal reasoning. (1) VLMs' reasoning is impaired by an imbalanced utilization of spatial-temporal context. (2) VLMs are weak in temporal forecasting, which leads to worse performance on temporal-emphasized tasks than on spatial-emphasized tasks. (3) VLMs lack the proficiency to comprehend or align the map data with multi-view video inputs. We believe GTR-Bench offers valuable insights and opens up new opportunities for research and applications in spatial-temporal intelligence. Benchmark and code will be released at https://github.com/X-Luffy/GTR-Bench.

[126] FMANet: A Novel Dual-Phase Optical Flow Approach with Fusion Motion Attention Network for Robust Micro-expression Recognition

Luu Tu Nguyen,Vu Tram Anh Khuong,Thi Bich Phuong Man,Thi Duyen Ngo,Thanh Ha Le

Main category: cs.CV

TL;DR: 提出了一种新的双相光流表示方法MM-COF和端到端网络FMANet,用于面部微表情识别,显著提升了性能。

Details Motivation: 现有微表情识别方法大多仅利用起始帧到顶点帧的光流,忽略了顶点到结束帧的重要运动信息,导致识别效果受限。 Method: 提出了Magnitude-Modulated Combined Optical Flow (MM-COF)来融合微表情的两个阶段(onset-apex和apex-offset)的运动信息,并设计了FMANet网络,将双相分析和幅度调制模块化为可学习组件,实现自适应特征融合与关键区域关注。 Result: 在MMEW、SMIC、CASME-II和SAMM四个标准数据集上实验表明,所提MM-COF和FMANet优于现有方法。 Conclusion: 引入双相运动信息并将其建模为可学习的光流表示,能有效提升微表情识别性能,验证了可学习双相框架的潜力。 Abstract: Facial micro-expressions, characterized by their subtle and brief nature, are valuable indicators of genuine emotions. Despite their significance in psychology, security, and behavioral analysis, micro-expression recognition remains challenging due to the difficulty of capturing subtle facial movements. Optical flow has been widely employed as an input modality for this task due to its effectiveness. However, most existing methods compute optical flow only between the onset and apex frames, thereby overlooking essential motion information in the apex-to-offset phase. To address this limitation, we first introduce a comprehensive motion representation, termed Magnitude-Modulated Combined Optical Flow (MM-COF), which integrates motion dynamics from both micro-expression phases into a unified descriptor suitable for direct use in recognition networks. Building upon this principle, we then propose FMANet, a novel end-to-end neural network architecture that internalizes the dual-phase analysis and magnitude modulation into learnable modules. This allows the network to adaptively fuse motion cues and focus on salient facial regions for classification. Experimental evaluations on the MMEW, SMIC, CASME-II, and SAMM datasets, widely recognized as standard benchmarks, demonstrate that our proposed MM-COF representation and FMANet outperforms existing methods, underscoring the potential of a learnable, dual-phase framework in advancing micro-expression recognition.

[127] An End-to-End Room Geometry Constrained Depth Estimation Framework for Indoor Panorama Images

Kanglin Ning,Ruzhao Chen,Penghong Wang,Xingtao Wang,Ruiqin Xiong,Xiaopeng Fan

Main category: cs.CV

TL;DR: 本文提出了一种基于房间几何约束的360度室内全景深度估计框架,通过布局预测和背景分割机制融合几何信息,显著提升了深度估计精度。

Details Motivation: 现有方法关注像素级精度,导致房间角落过度平滑且对噪声敏感,难以准确恢复球形像素深度。 Method: 提出一个共享特征编码器和任务特定解码器的框架,结合布局估计、深度估计和背景分割;引入基于房间几何的背景深度解析策略和背景分割引导的融合机制。 Result: 在Stanford2D3D、Matterport3D和Structured3D数据集上实验表明,该方法性能显著优于当前开源方法。 Conclusion: 所提出的基于房间几何约束的深度估计框架有效改善了360度室内全景的深度预测质量,尤其在房间结构保持和噪声鲁棒性方面表现优越。 Abstract: Predicting spherical pixel depth from monocular $360^{\circ}$ indoor panoramas is critical for many vision applications. However, existing methods focus on pixel-level accuracy, causing oversmoothed room corners and noise sensitivity. In this paper, we propose a depth estimation framework based on room geometry constraints, which extracts room geometry information through layout prediction and integrates those information into the depth estimation process through background segmentation mechanism. At the model level, our framework comprises a shared feature encoder followed by task-specific decoders for layout estimation, depth estimation, and background segmentation. The shared encoder extracts multi-scale features, which are subsequently processed by individual decoders to generate initial predictions: a depth map, a room layout map, and a background segmentation map. Furthermore, our framework incorporates two strategies: a room geometry-based background depth resolving strategy and a background-segmentation-guided fusion mechanism. The proposed room-geometry-based background depth resolving strategy leverages the room layout and the depth decoder's output to generate the corresponding background depth map. Then, a background-segmentation-guided fusion strategy derives fusion weights for the background and coarse depth maps from the segmentation decoder's predictions. Extensive experimental results on the Stanford2D3D, Matterport3D and Structured3D datasets show that our proposed methods can achieve significantly superior performance than current open-source methods. Our code is available at https://github.com/emiyaning/RGCNet.

[128] Enhancing Visual Prompting through Expanded Transformation Space and Overfitting Mitigation

Shohei Enomoto

Main category: cs.CV

TL;DR: 提出ACAEP,结合仿射、颜色和加法视觉提示以及TrivialAugment数据增强,在保持低计算开销的同时显著提升视觉提示的性能和鲁棒性。

Details Motivation: 传统视觉提示方法表达能力有限且易过拟合,导致准确率低于其他适配方法。 Method: 引入仿射变换以生成任务特定区域并保留原始图像信息,结合颜色变换突出任务相关特征,并采用TrivialAugment进行数据增强以缓解过拟合。 Result: 在十二个图像分类数据集上,ACAEP在两种模型架构下均达到最先进的视觉提示性能,平均准确率超过线性探测,并表现出更强的分布外鲁棒性,某些数据集上性能提升高达12个百分点。 Conclusion: 通过增强变换操作和适当的数据增强,视觉提示可实现高表达力与强泛化能力,为参数高效微调提供了更优方案。 Abstract: Visual prompting (VP) has emerged as a promising parameter-efficient fine-tuning approach for adapting pre-trained vision models to downstream tasks without modifying model parameters. Despite offering advantages like negligible computational overhead and compatibility with black-box models, conventional VP methods typically achieve lower accuracy than other adaptation approaches. Our analysis reveals two critical limitations: the restricted expressivity of simple additive transformation and a tendency toward overfitting when the parameter count increases. To address these challenges, we propose ACAVP (Affine, Color, and Additive Visual Prompting), which enhances VP's expressive power by introducing complementary transformation operations: affine transformation for creating task-specific prompt regions while preserving original image information, and color transformation for emphasizing task-relevant visual features. Additionally, we identify that overfitting is a critical issue in VP training and introduce TrivialAugment as an effective data augmentation, which not only benefits our approach but also significantly improves existing VP methods, with performance gains of up to 12 percentage points on certain datasets. This demonstrates that appropriate data augmentation is universally beneficial for VP training. Extensive experiments across twelve diverse image classification datasets with two different model architectures demonstrate that ACAVP achieves state-of-the-art accuracy among VP methods, surpasses linear probing in average accuracy, and exhibits superior robustness to distribution shifts, all while maintaining minimal computational overhead during inference.

[129] MMHOI: Modeling Complex 3D Multi-Human Multi-Object Interactions

Kaen Kogashi,Anoop Cherian,Meng-Yu Jennifer Kuo

Main category: cs.CV

TL;DR: 本文提出了MMHOI——一个大规模的多人体多物体交互数据集,包含12种日常场景中的图像,并提供了完整的3D姿态和形状标注、78类动作标签及14个交互相关身体部位标签。基于此,作者设计了MMHOI-Net,一种基于Transformer的端到端网络,通过结构化双块表示建模物体及其交互,并结合动作识别提升交互预测性能。在MMHOI和CORE4D数据集上的实验表明该方法在多人体-物体交互建模中达到最优性能。

Details Motivation: 现有3D人体-物体交互(HOI)基准仅涵盖现实世界中复杂交互的一小部分,缺乏对多人体与多物体之间因果性、目标导向或协作性交互的充分建模。因此需要一个更全面的数据集和相应方法来推动下一代HOI研究。 Method: 提出MMHOI数据集,包含多人体多物体的真实场景图像及其精细3D标注;在此基础上构建MMHOI-Net,采用基于Transformer的端到端架构,引入结构化双块表示来建模物体与交互,并融合动作识别以增强交互预测能力。 Result: 在MMHOI和CORE4D数据集上的实验显示,MMHOI-Net在多人体-物体交互建模方面达到了最先进的性能,在准确性和重建质量上均表现优异。 Conclusion: MMHOI为复杂真实场景下的多人体多物体交互研究提供了重要资源,而MMHOI-Net通过创新的表示学习和多任务融合策略,显著提升了3D HOI建模的效果,推动了该领域的发展。 Abstract: Real-world scenes often feature multiple humans interacting with multiple objects in ways that are causal, goal-oriented, or cooperative. Yet existing 3D human-object interaction (HOI) benchmarks consider only a fraction of these complex interactions. To close this gap, we present MMHOI -- a large-scale, Multi-human Multi-object Interaction dataset consisting of images from 12 everyday scenarios. MMHOI offers complete 3D shape and pose annotations for every person and object, along with labels for 78 action categories and 14 interaction-specific body parts, providing a comprehensive testbed for next-generation HOI research. Building on MMHOI, we present MMHOI-Net, an end-to-end transformer-based neural network for jointly estimating human-object 3D geometries, their interactions, and associated actions. A key innovation in our framework is a structured dual-patch representation for modeling objects and their interactions, combined with action recognition to enhance the interaction prediction. Experiments on MMHOI and the recently proposed CORE4D datasets demonstrate that our approach achieves state-of-the-art performance in multi-HOI modeling, excelling in both accuracy and reconstruction quality.

[130] PrismGS: Physically-Grounded Anti-Aliasing for High-Fidelity Large-Scale 3D Gaussian Splatting

Houqiang Zhong,Zhenglong Wu,Sihua Fu,Zihan Zheng,Xin Jin,Xiaoyun Zhang,Li Song,Qiang Hu

Main category: cs.CV

TL;DR: 本文提出了PrismGS,一种物理基础的正则化框架,用于改善3D高斯在大规模城市环境中渲染时的失真问题,尤其在4K高分辨率下表现优异。

Details Motivation: 3D高斯点阵(3DGS)在小场景中实现了高质量实时渲染,但在大尺度城市环境中会出现严重走样和优化不稳定问题,现有分治方法无法解决保真度差距。 Method: 提出PrismGS,包含两个协同的正则化策略:金字塔多尺度监督,通过预滤波图像金字塔强制跨尺度一致性;显式的尺寸正则化,为3D高斯设置物理合理的尺寸下限,防止退化。 Result: 在MatrixCity、Mill-19和UrbanScene3D数据集上实验表明,PrismGS相比CityGaussian提升约1.5 dB PSNR,在4K渲染下仍保持高质量与鲁棒性。 Conclusion: PrismGS有效缓解了3D高斯在大规模场景中的走样与不稳定性,提升了渲染保真度,且可即插即用兼容现有流程。 Abstract: 3D Gaussian Splatting (3DGS) has recently enabled real-time photorealistic rendering in compact scenes, but scaling to large urban environments introduces severe aliasing artifacts and optimization instability, especially under high-resolution (e.g., 4K) rendering. These artifacts, manifesting as flickering textures and jagged edges, arise from the mismatch between Gaussian primitives and the multi-scale nature of urban geometry. While existing ``divide-and-conquer'' pipelines address scalability, they fail to resolve this fidelity gap. In this paper, we propose PrismGS, a physically-grounded regularization framework that improves the intrinsic rendering behavior of 3D Gaussians. PrismGS integrates two synergistic regularizers. The first is pyramidal multi-scale supervision, which enforces consistency by supervising the rendering against a pre-filtered image pyramid. This compels the model to learn an inherently anti-aliased representation that remains coherent across different viewing scales, directly mitigating flickering textures. This is complemented by an explicit size regularization that imposes a physically-grounded lower bound on the dimensions of the 3D Gaussians. This prevents the formation of degenerate, view-dependent primitives, leading to more stable and plausible geometric surfaces and reducing jagged edges. Our method is plug-and-play and compatible with existing pipelines. Extensive experiments on MatrixCity, Mill-19, and UrbanScene3D demonstrate that PrismGS achieves state-of-the-art performance, yielding significant PSNR gains around 1.5 dB against CityGaussian, while maintaining its superior quality and robustness under demanding 4K rendering.

[131] IsoSignVid2Aud: Sign Language Video to Audio Conversion without Text Intermediaries

Harsh Kavediya,Vighnesh Nayak,Bheeshm Sharma,Balamurugan Palaniappan

Main category: cs.CV

TL;DR: 提出了一种端到端的框架IsoSignVid2Aud,用于将孤立手语视频序列直接转换为语音,无需中间文本表示,在教育和提示界面中具有应用价值。

Details Motivation: 实现听障和语言障碍人群与他人更直接有效的沟通,避免多阶段翻译系统中的延迟和级联错误。 Method: 结合I3D特征提取模块、专用特征变换网络和音频生成管道,并引入一种新的非极大值抑制(NMS)算法用于在非语法连续序列中进行手势的时间检测。 Result: 在ASL-Citizen-1500和WLASL-100数据集上分别取得72.01%和78.67%的Top-1准确率,语音质量指标PESQ为2.67,STOI为0.73,表明输出语音清晰可懂。 Conclusion: IsoSignVid2Aud能够有效将孤立手语视频直接转化为语音,具备实际应用潜力,且代码已开源。 Abstract: Sign language to spoken language audio translation is important to connect the hearing- and speech-challenged humans with others. We consider sign language videos with isolated sign sequences rather than continuous grammatical signing. Such videos are useful in educational applications and sign prompt interfaces. Towards this, we propose IsoSignVid2Aud, a novel end-to-end framework that translates sign language videos with a sequence of possibly non-grammatic continuous signs to speech without requiring intermediate text representation, providing immediate communication benefits while avoiding the latency and cascading errors inherent in multi-stage translation systems. Our approach combines an I3D-based feature extraction module with a specialized feature transformation network and an audio generation pipeline, utilizing a novel Non-Maximal Suppression (NMS) algorithm for the temporal detection of signs in non-grammatic continuous sequences. Experimental results demonstrate competitive performance on ASL-Citizen-1500 and WLASL-100 datasets with Top-1 accuracies of 72.01\% and 78.67\%, respectively, and audio quality metrics (PESQ: 2.67, STOI: 0.73) indicating intelligible speech output. Code is available at: https://github.com/BheeshmSharma/IsoSignVid2Aud_AIMLsystems-2025.

[132] AlignGS: Aligning Geometry and Semantics for Robust Indoor Reconstruction from Sparse Views

Yijie Gao,Houqiang Zhong,Tianchi Zhu,Zhengxue Cheng,Qiang Hu,Li Song

Main category: cs.CV

TL;DR: 本文提出了一种名为AlignGS的新框架,通过2D基础模型提取语义先验,并将其作为几何正则化手段,实现几何与语义的协同优化,显著提升了稀疏视角下的室内场景3D重建质量。

Details Motivation: 现有方法在稀疏视角下进行3D重建时,常因几何模糊性导致结果不准确,且语义信息通常被被动地附加在已生成的几何上,缺乏对重建过程的引导作用。因此,需要一种将语义理解作为主动引导力量的方法来提升重建的鲁棒性。 Method: 提出AlignGS框架,利用2D基础模型提取语义先验,并设计了深度一致性与多面法线正则化等语义到几何的引导机制,在端到端训练中联合优化几何与语义。 Result: 在标准数据集上的实验表明,该方法在新视角合成和几何精度方面均达到最先进水平,生成的3D模型更加完整且几何结构更准确。 Conclusion: 将语义先验作为几何正则化工具,能够有效提升稀疏输入下的3D室内场景重建质量,验证了语义与几何协同优化的有效性。 Abstract: The demand for semantically rich 3D models of indoor scenes is rapidly growing, driven by applications in augmented reality, virtual reality, and robotics. However, creating them from sparse views remains a challenge due to geometric ambiguity. Existing methods often treat semantics as a passive feature painted on an already-formed, and potentially flawed, geometry. We posit that for robust sparse-view reconstruction, semantic understanding instead be an active, guiding force. This paper introduces AlignGS, a novel framework that actualizes this vision by pioneering a synergistic, end-to-end optimization of geometry and semantics. Our method distills rich priors from 2D foundation models and uses them to directly regularize the 3D representation through a set of novel semantic-to-geometry guidance mechanisms, including depth consistency and multi-faceted normal regularization. Extensive evaluations on standard benchmarks demonstrate that our approach achieves state-of-the-art results in novel view synthesis and produces reconstructions with superior geometric accuracy. The results validate that leveraging semantic priors as a geometric regularizer leads to more coherent and complete 3D models from limited input views. Our code is avaliable at https://github.com/MediaX-SJTU/AlignGS .

[133] Self-Supervised Learning Strategies for a Platform to Test the Toxicity of New Chemicals and Materials

Thomas Lautenschlager,Nils Friederich,Angelo Jovin Yamachui Sitcheu,Katja Nau,Gaëlle Hayot,Thomas Dickmeis,Ralf Mikut

Main category: cs.CV

TL;DR: 本文探讨了利用自监督学习方法在高通量毒性测试中识别有毒物质诱导变化的有效性,使用公开的EmbryoNet数据集作为概念验证,展示了所学表示能够有效区分不同化合物的作用机制,并讨论了将机器学习模型集成到TOXBOX项目中的物理毒性测试设备中的可能性。

Details Motivation: 高通量毒性测试需要快速且经济高效地评估大量化合物,而自动化评估依赖于机器学习模型。然而,现有方法在识别化合物作用机制方面仍面临挑战,因此需要更有效的表征学习方法。 Method: 采用自监督学习方法从EmbryoNet数据集中学习表征,该数据集包含十种由不同化学化合物引起的斑马鱼胚胎表型,并利用这些学习到的表征来区分化合物的不同作用机制。 Result: 分析表明,通过自监督学习获得的表征能够有效区分不同化合物的作用机制,证明了其在毒性检测中的潜力。 Conclusion: 自监督学习为高通量毒性测试提供了有前景的解决方案,有助于提升毒性评估的自动化水平,并支持未来在实际设备如TOXBOX中的集成应用。 Abstract: High-throughput toxicity testing offers a fast and cost-effective way to test large amounts of compounds. A key component for such systems is the automated evaluation via machine learning models. In this paper, we address critical challenges in this domain and demonstrate how representations learned via self-supervised learning can effectively identify toxicant-induced changes. We provide a proof-of-concept that utilizes the publicly available EmbryoNet dataset, which contains ten zebrafish embryo phenotypes elicited by various chemical compounds targeting different processes in early embryonic development. Our analysis shows that the learned representations using self-supervised learning are suitable for effectively distinguishing between the modes-of-action of different compounds. Finally, we discuss the integration of machine learning models in a physical toxicity testing device in the context of the TOXBOX project.

[134] XYZCylinder: Feedforward Reconstruction for Driving Scenes Based on A Unified Cylinder Lifting Method

Haochen Yu,Qiankun Liu,Hongyuan Liu,Jianfei Jiang,Juntao Lyu,Jiansheng Chen,Huimin Ma

Main category: cs.CV

TL;DR: 提出了一种名为XYZCylinder的前馈模型,通过统一的圆柱提升方法,在不同相机配置下实现驾驶场景的高精度三维重建,并具有良好的零样本泛化能力。

Details Motivation: 现有前馈重建方法因固定视角变换和稀疏视图重叠区域小,导致在不同相机配置下的泛化能力和重建精度受限。 Method: 设计了统一圆柱相机建模(UCCM)策略以提升泛化能力,并提出基于圆柱平面特征组(CPFG)的混合表示与专用模块,将2D图像特征提升至3D空间以提高重建精度。 Result: 实验表明,XYZCylinder在多种评估设置下达到最先进性能,并能零样本迁移到其他驾驶场景。 Conclusion: XYZCylinder通过显式相机建模和新型特征提升机制,有效提升了驾驶场景重建的泛化性和准确性,适用于多变的现实驾驶环境。 Abstract: Recently, more attention has been paid to feedforward reconstruction paradigms, which mainly learn a fixed view transformation implicitly and reconstruct the scene with a single representation. However, their generalization capability and reconstruction accuracy are still limited while reconstructing driving scenes, which results from two aspects: (1) The fixed view transformation fails when the camera configuration changes, limiting the generalization capability across different driving scenes equipped with different camera configurations. (2) The small overlapping regions between sparse views of the $360^\circ$ panorama and the complexity of driving scenes increase the learning difficulty, reducing the reconstruction accuracy. To handle these difficulties, we propose \textbf{XYZCylinder}, a feedforward model based on a unified cylinder lifting method which involves camera modeling and feature lifting. Specifically, to improve the generalization capability, we design a Unified Cylinder Camera Modeling (UCCM) strategy, which avoids the learning of viewpoint-dependent spatial correspondence and unifies different camera configurations with adjustable parameters. To improve the reconstruction accuracy, we propose a hybrid representation with several dedicated modules based on newly designed Cylinder Plane Feature Group (CPFG) to lift 2D image features to 3D space. Experimental results show that XYZCylinder achieves state-of-the-art performance under different evaluation settings, and can be generalized to other driving scenes in a zero-shot manner. Project page: \href{https://yuyuyu223.github.io/XYZCYlinder-projectpage/}{here}.

[135] MARC: Memory-Augmented RL Token Compression for Efficient Video Understanding

Peiran Wu,Zhuorui Yu,Yunze Liu,Chi-Hao Wu,Enmin Zhou,Junxiao Shen

Main category: cs.CV

TL;DR: 提出了一种基于记忆增强的强化学习视频token压缩方法MARC,能够在显著减少计算资源消耗的同时保持接近基线的性能。

Details Motivation: 现有的无训练视频token压缩方法在压缩过程中容易造成信息丢失和性能下降,且难以应对高帧率和长时长视频带来的高计算成本。 Method: 采用“先检索后压缩”策略,通过视觉记忆检索器(VMR)选择关键片段,并利用基于压缩组相对策略优化(C-GRPO)的强化学习框架进行师生模型间的推理能力蒸馏。 Result: 在六个视频基准上实验表明,仅使用单帧token量即可达到接近基线的准确率,视觉token减少95%,GPU内存降低72%,延迟减少23.9%。 Conclusion: MARC在大幅压缩视觉token的同时有效保留了关键信息,显著降低了计算开销,具有在资源受限场景下实现实时视频理解的应用潜力。 Abstract: The rapid progress of large language models (LLMs) has laid the foundation for multimodal models. However, visual language models (VLMs) still face heavy computational costs when extended from images to videos due to high frame rates and long durations. Token compression is a promising solution, yet most existing training-free methods cause information loss and performance degradation. To overcome this, we propose \textbf{Memory-Augmented Reinforcement Learning-based Token Compression (MARC)}, which integrates structured retrieval and RL-based distillation. MARC adopts a \textit{retrieve-then-compress} strategy using a \textbf{Visual Memory Retriever (VMR)} to select key clips and a \textbf{Compression Group Relative Policy Optimization (C-GRPO)} framework to distil reasoning ability from a teacher to a student model. Experiments on six video benchmarks show that MARC achieves near-baseline accuracy using only one frame's tokens -- reducing visual tokens by \textbf{95\%}, GPU memory by \textbf{72\%}, and latency by \textbf{23.9\%}. This demonstrates its potential for efficient, real-time video understanding in resource-constrained settings such as video QA, surveillance, and autonomous driving.

[136] ASBench: Image Anomalies Synthesis Benchmark for Anomaly Detection

Qunyi Zhang,Songan Zhang,Jinbao Wang,Xiaoning Lei,Guoyang Xie,Guannan Jiang,Zhichao Lu

Main category: cs.CV

TL;DR: 本文提出了ASBench,首个专门用于评估异常合成方法的综合基准框架,通过四个关键维度系统性地评测现有方法,揭示了当前技术的局限性并为未来研究提供指导。

Details Motivation: 现有的异常检测研究中,异常合成常被视为辅助组件,缺乏对合成方法本身的系统评估,且忽视了与合成相关的关键因素,如与检测性能的解耦、合成数据的量化分析及跨场景适应性。 Method: 提出ASBench框架,引入四个评估维度:(i) 在不同数据集和流程中的泛化性能;(ii) 合成与真实数据的比例;(iii) 合成图像内在指标与检测性能指标的相关性;(iv) 混合异常合成策略。 Result: 通过大量实验,ASBench揭示了当前异常合成方法在泛化性、数据效率和相关性方面的局限性,并验证了不同合成策略的影响。 Conclusion: ASBench为异常合成方法提供了系统、可比较的评估平台,推动该领域从依赖检测任务的隐式评估转向独立、量化的基准测试,为未来研究指明方向。 Abstract: Anomaly detection plays a pivotal role in manufacturing quality control, yet its application is constrained by limited abnormal samples and high manual annotation costs. While anomaly synthesis offers a promising solution, existing studies predominantly treat anomaly synthesis as an auxiliary component within anomaly detection frameworks, lacking systematic evaluation of anomaly synthesis algorithms. Current research also overlook crucial factors specific to anomaly synthesis, such as decoupling its impact from detection, quantitative analysis of synthetic data and adaptability across different scenarios. To address these limitations, we propose ASBench, the first comprehensive benchmarking framework dedicated to evaluating anomaly synthesis methods. Our framework introduces four critical evaluation dimensions: (i) the generalization performance across different datasets and pipelines (ii) the ratio of synthetic to real data (iii) the correlation between intrinsic metrics of synthesis images and anomaly detection performance metrics , and (iv) strategies for hybrid anomaly synthesis methods. Through extensive experiments, ASBench not only reveals limitations in current anomaly synthesis methods but also provides actionable insights for future research directions in anomaly synthesis

[137] TTOM: Test-Time Optimization and Memorization for Compositional Video Generation

Leigang Qu,Ziyang Wang,Na Zheng,Wenjie Wang,Liqiang Nie,Tat-Seng Chua

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的视频生成框架TTOM,通过测试时优化与记忆机制,提升视频基础模型在组合场景下的文本-图像对齐能力。

Details Motivation: 现有的视频基础模型在处理运动、数量和空间关系等组合场景时表现不佳,难以实现精确的文本-图像对齐。 Method: 引入测试时优化与记忆机制(TTOM),通过优化新参数并结合布局注意力目标,在推理阶段对齐视觉输出与时空布局;同时设计参数化记忆模块以支持流式视频生成中的上下文维护与灵活操作。 Result: 在T2V-CompBench和Vbench基准上取得了优异表现,验证了TTOM在组合视频生成中实现跨模态对齐的有效性、可扩展性和高效性。 Conclusion: TTOM是一种有效且实用的训练-free框架,能够解耦组合性世界知识,具备良好的迁移性和泛化能力,适用于动态组合视频生成任务。 Abstract: Video Foundation Models (VFMs) exhibit remarkable visual generation performance, but struggle in compositional scenarios (e.g., motion, numeracy, and spatial relation). In this work, we introduce Test-Time Optimization and Memorization (TTOM), a training-free framework that aligns VFM outputs with spatiotemporal layouts during inference for better text-image alignment. Rather than direct intervention to latents or attention per-sample in existing work, we integrate and optimize new parameters guided by a general layout-attention objective. Furthermore, we formulate video generation within a streaming setting, and maintain historical optimization contexts with a parametric memory mechanism that supports flexible operations, such as insert, read, update, and delete. Notably, we found that TTOM disentangles compositional world knowledge, showing powerful transferability and generalization. Experimental results on the T2V-CompBench and Vbench benchmarks establish TTOM as an effective, practical, scalable, and efficient framework to achieve cross-modal alignment for compositional video generation on the fly.

[138] CVD-STORM: Cross-View Video Diffusion with Spatial-Temporal Reconstruction Model for Autonomous Driving

Tianrui Zhang,Yichen Liu,Zilin Guo,Yuxin Guo,Jingcheng Ni,Chenjing Ding,Dan Xu,Lewei Lu,Zehuan Wu

Main category: cs.CV

TL;DR: 提出CVD-STORM,一种基于时空重建VAE的跨视角视频扩散模型,支持多视角、长时程视频生成与4D重建,提升生成质量与几何信息输出。

Details Motivation: 自动驾驶等应用对高保真、多控制条件下的视频生成及深度等几何信息需求增加,现有方法难以兼顾生成质量与3D结构建模能力。 Method: 通过引入辅助4D重建任务微调时空VAE,增强其对3D结构和时序动态的编码能力,并将其集成到视频扩散过程中;同时联合训练高斯溅射解码器以实现动态场景重建。 Result: 在FID和FVD指标上显著优于现有方法,生成视频质量更高,并能有效输出深度等几何信息。 Conclusion: CVD-STORM在多视角长时程视频生成与4D场景重建方面表现优异,为世界模型提供了更强的视觉生成与空间理解能力。 Abstract: Generative models have been widely applied to world modeling for environment simulation and future state prediction. With advancements in autonomous driving, there is a growing demand not only for high-fidelity video generation under various controls, but also for producing diverse and meaningful information such as depth estimation. To address this, we propose CVD-STORM, a cross-view video diffusion model utilizing a spatial-temporal reconstruction Variational Autoencoder (VAE) that generates long-term, multi-view videos with 4D reconstruction capabilities under various control inputs. Our approach first fine-tunes the VAE with an auxiliary 4D reconstruction task, enhancing its ability to encode 3D structures and temporal dynamics. Subsequently, we integrate this VAE into the video diffusion process to significantly improve generation quality. Experimental results demonstrate that our model achieves substantial improvements in both FID and FVD metrics. Additionally, the jointly-trained Gaussian Splatting Decoder effectively reconstructs dynamic scenes, providing valuable geometric information for comprehensive scene understanding.

[139] A Large-scale Dataset for Robust Complex Anime Scene Text Detection

Ziyi Dong,Yurui Zhang,Changmao Li,Naomi Rue Golding,Qing Long

Main category: cs.CV

TL;DR: 本文介绍了AnimeText,一个专为动漫场景设计的大规模文本检测数据集,包含73.5万张图像和420万个标注文本块,具有层次化标注和难负样本,显著提升了动漫中文本检测的性能。

Details Motivation: 现有文本检测数据集主要针对自然或文档场景,难以应对动漫中风格多样、排列不规则且易与复杂视觉元素混淆的文本,因此需要专门的数据集来填补这一空白。 Method: 提出AnimeText数据集,包含大规模真实动漫图像,采用层次化标注策略,并引入难负样本以增强模型对复杂场景的鲁棒性。通过跨数据集基准测试评估其有效性。 Result: 实验表明,在AnimeText上训练的模型在动漫文本检测任务中显著优于在现有数据集上训练的模型,验证了该数据集在复杂动漫场景中的有效性与优势。 Conclusion: AnimeText是一个针对动漫场景优化的大规模文本检测数据集,能有效提升现有方法在非规则、复杂视觉环境下的文本检测能力。 Abstract: Current text detection datasets primarily target natural or document scenes, where text typically appear in regular font and shapes, monotonous colors, and orderly layouts. The text usually arranged along straight or curved lines. However, these characteristics differ significantly from anime scenes, where text is often diverse in style, irregularly arranged, and easily confused with complex visual elements such as symbols and decorative patterns. Text in anime scene also includes a large number of handwritten and stylized fonts. Motivated by this gap, we introduce AnimeText, a large-scale dataset containing 735K images and 4.2M annotated text blocks. It features hierarchical annotations and hard negative samples tailored for anime scenarios. %Cross-dataset evaluations using state-of-the-art methods demonstrate that models trained on AnimeText achieve superior performance in anime text detection tasks compared to existing datasets. To evaluate the robustness of AnimeText in complex anime scenes, we conducted cross-dataset benchmarking using state-of-the-art text detection methods. Experimental results demonstrate that models trained on AnimeText outperform those trained on existing datasets in anime scene text detection tasks. AnimeText on HuggingFace: https://huggingface.co/datasets/deepghs/AnimeText

[140] SimCast: Enhancing Precipitation Nowcasting with Short-to-Long Term Knowledge Distillation

Yifang Yin,Shengkai Chen,Yiyao Li,Lu Wang,Ruibing Jin,Wei Cui,Shili Xiang

Main category: cs.CV

TL;DR: 提出了一种新的降水临近预报框架SimCast和CasCast,通过短到长期的知识蒸馏和加权MSE损失提升预测性能,并在多个基准数据集上显著优于现有方法。

Details Motivation: 现有的非自回归临近预报方法在不同预测时间范围内的表现受限,且确定性模型存在模糊和分布偏移问题,需要更高效准确的预报方法以应对社会多领域需求。 Method: 提出了SimCast训练流程,采用短到长期知识蒸馏技术和加权MSE损失函数来优化预测;进一步将SimCast嵌入扩散模型框架CasCast中,结合概率模型优势缓解确定性输出的局限性。 Result: 在SEVIR、HKO-7和MeteoNet三个基准数据集上取得平均CSI分数分别为0.452、0.474和0.361,显著优于现有方法,且推理无额外开销。 Conclusion: SimCast和CasCast框架有效提升了降水临近预报的准确性与可靠性,兼具确定性模型效率和概率模型表达能力,具有广泛的应用前景。 Abstract: Precipitation nowcasting predicts future radar sequences based on current observations, which is a highly challenging task driven by the inherent complexity of the Earth system. Accurate nowcasting is of utmost importance for addressing various societal needs, including disaster management, agriculture, transportation, and energy optimization. As a complementary to existing non-autoregressive nowcasting approaches, we investigate the impact of prediction horizons on nowcasting models and propose SimCast, a novel training pipeline featuring a short-to-long term knowledge distillation technique coupled with a weighted MSE loss to prioritize heavy rainfall regions. Improved nowcasting predictions can be obtained without introducing additional overhead during inference. As SimCast generates deterministic predictions, we further integrate it into a diffusion-based framework named CasCast, leveraging the strengths from probabilistic models to overcome limitations such as blurriness and distribution shift in deterministic outputs. Extensive experimental results on three benchmark datasets validate the effectiveness of the proposed framework, achieving mean CSI scores of 0.452 on SEVIR, 0.474 on HKO-7, and 0.361 on MeteoNet, which outperforms existing approaches by a significant margin.

[141] Latent Harmony: Synergistic Unified UHD Image Restoration via Latent Space Regularization and Controllable Refinement

Yidi Liu,Xueyang Fu,Jie Huang,Jie Xiao,Dong Li,Wenlong Zhang,Lei Bai,Zheng-Jun Zha

Main category: cs.CV

TL;DR: 本文提出Latent Harmony,一种用于超高清图像恢复的两阶段框架,通过联合正则化潜在空间和高频感知重建,在保持计算效率的同时显著提升细节还原能力。

Details Motivation: 传统VAE因高斯约束丢失退化相关的高频信息,导致UHD图像恢复中重建保真度下降,需改进以兼顾效率与细节保留。 Method: 第一阶段提出LH-VAE,引入视觉语义约束、渐进退化扰动和潜在等变性增强语义鲁棒性和高频重建;第二阶段通过HF-LoRA联合训练VAE与恢复模型,采用保真导向和感知导向损失分别恢复细节与纹理,并利用交替优化和选择性梯度传播保持预训练结构。 Result: 实验表明,Latent Harmony在UHD及标准分辨率任务上均达到SOTA性能,有效平衡了效率、感知质量和重建精度。 Conclusion: Latent Harmony通过两阶段协同优化,解决了VAE在UHD图像恢复中高频信息丢失的问题,实现了高效且高质量的图像重建。 Abstract: Ultra-High Definition (UHD) image restoration faces a trade-off between computational efficiency and high-frequency detail retention. While Variational Autoencoders (VAEs) improve efficiency via latent-space processing, their Gaussian constraint often discards degradation-specific high-frequency information, hurting reconstruction fidelity. To overcome this, we propose Latent Harmony, a two-stage framework that redefines VAEs for UHD restoration by jointly regularizing the latent space and enforcing high-frequency-aware reconstruction.In Stage One, we introduce LH-VAE, which enhances semantic robustness through visual semantic constraints and progressive degradation perturbations, while latent equivariance strengthens high-frequency reconstruction.Stage Two jointly trains this refined VAE with a restoration model using High-Frequency Low-Rank Adaptation (HF-LoRA): an encoder LoRA guided by a fidelity-oriented high-frequency alignment loss to recover authentic details, and a decoder LoRA driven by a perception-oriented loss to synthesize realistic textures. Both LoRA modules are trained via alternating optimization with selective gradient propagation to preserve the pretrained latent structure.At inference, a tunable parameter {\alpha} enables flexible fidelity-perception trade-offs.Experiments show Latent Harmony achieves state-of-the-art performance across UHD and standard-resolution tasks, effectively balancing efficiency, perceptual quality, and reconstruction accuracy.

[142] The impact of abstract and object tags on image privacy classification

Darya Baranouskaya,Andrea Cavallaro

Main category: cs.CV

TL;DR: 本文探讨了在图像隐私分类任务中,抽象标签和物体标签的有效性,发现当标签数量有限时,抽象标签更有效,而当标签数量较多时,物体标签同样有用。

Details Motivation: 研究在上下文依赖且主观的图像隐私任务中,不同类型的标签(物体标签与抽象标签)的作用,以指导未来更准确的图像隐私分类器开发。 Method: 通过比较在不同标签预算下物体标签和抽象标签在隐私分类中的表现,分析两类标签的有效性。 Result: 在标签数量受限时,抽象标签比物体标签更有效;但在标签数量较多时,物体标签的效果与抽象标签相当。 Conclusion: 标签类型和数量对图像隐私分类性能有显著影响,该发现可为未来基于标签的隐私分类研究提供指导。 Abstract: Object tags denote concrete entities and are central to many computer vision tasks, whereas abstract tags capture higher-level information, which is relevant for tasks that require a contextual, potentially subjective scene understanding. Object and abstract tags extracted from images also facilitate interpretability. In this paper, we explore which type of tags is more suitable for the context-dependent and inherently subjective task of image privacy. While object tags are generally used for privacy classification, we show that abstract tags are more effective when the tag budget is limited. Conversely, when a larger number of tags per image is available, object-related information is as useful. We believe that these findings will guide future research in developing more accurate image privacy classifiers, informed by the role of tag types and quantity.

[143] Is Architectural Complexity Always the Answer? A Case Study on SwinIR vs. an Efficient CNN

Chandresh Sutariya,Nitin Singh

Main category: cs.CV

TL;DR: 本文比较了SwinIR(Transformer模型)与轻量级CNN在低光照图像增强任务中的性能与效率,发现尽管SwinIR性能略优,但轻量级CNN在更少训练轮数、更小模型尺寸下达到了接近的性能,具有更高的效率。

Details Motivation: 探索在低光照图像恢复中性能与计算效率之间的权衡,特别是在资源受限的实际应用场景中轻量级模型的可行性。 Method: 将当前最先进的SwinIR模型与标准的轻量级卷积神经网络(CNN)在相同任务上进行对比实验,评估其PSNR、训练收敛速度和模型大小。 Result: CNN在仅训练10个epoch后达到37.4 dB的PSNR,而SwinIR在132个epoch后达到39.03 dB;CNN模型大小比SwinIR小55倍以上。 Conclusion: 标准CNN可在显著降低计算开销的前提下实现接近最先进水平的性能,因而在资源受限场景中更具应用优势。 Abstract: The simultaneous restoration of high-frequency details and suppression of severe noise in low-light imagery presents a significant and persistent challenge in computer vision. While large-scale Transformer models like SwinIR have set the state of the art in performance, their high computational cost can be a barrier for practical applications. This paper investigates the critical trade-off between performance and efficiency by comparing the state-of-the-art SwinIR model against a standard, lightweight Convolutional Neural Network (CNN) on this challenging task. Our experimental results reveal a nuanced but important finding. While the Transformer-based SwinIR model achieves a higher peak performance, with a Peak Signal-to-Noise Ratio (PSNR) of 39.03 dB, the lightweight CNN delivers a surprisingly competitive PSNR of 37.4 dB. Crucially, the CNN reached this performance after converging in only 10 epochs of training, whereas the more complex SwinIR model required 132 epochs. This efficiency is further underscored by the model's size; the CNN is over 55 times smaller than SwinIR. This work demonstrates that a standard CNN can provide a near state-of-the-art result with significantly lower computational overhead, presenting a compelling case for its use in real-world scenarios where resource constraints are a primary concern.

[144] GraphEnet: Event-driven Human Pose Estimation with a Graph Neural Network

Gaurvi Goyal,Pham Cong Thuong,Arren Glover,Masayoshi Mizuno,Chiara Bartolozzi

Main category: cs.CV

TL;DR: 本文提出了一种基于图神经网络GraphEnet的事件相机人体姿态估计方法,首次将图神经网络应用于事件数据进行2D人体姿态估计,利用事件数据的稀疏性和基于线的中间表示,在高频率下实现高效、低能耗的人体姿态估计。

Details Motivation: 由于事件相机具有低延迟和低功耗的优势,适合资源受限的应用场景,但目前缺乏有效的基于事件数据的人体姿态估计方法,因此本文旨在填补这一空白。 Method: 提出GraphEnet模型,采用图神经网络处理事件相机输出的稀疏数据,引入基于线的事件表示,并结合偏移向量学习范式与置信度池化机制来估计单人2D人体姿态。 Result: 实现了高频率的2D人体姿态估计,有效利用了事件数据的特性,在低功耗条件下表现出良好的性能。 Conclusion: GraphEnet是首个将图神经网络应用于事件数据进行人体姿态估计的工作,展示了事件相机在高时效性、低能耗场景下的潜力,为未来相关应用提供了新思路。 Abstract: Human Pose Estimation is a crucial module in human-machine interaction applications and, especially since the rise in deep learning technology, robust methods are available to consumers using RGB cameras and commercial GPUs. On the other hand, event-based cameras have gained popularity in the vision research community for their low latency and low energy advantages that make them ideal for applications where those resources are constrained like portable electronics and mobile robots. In this work we propose a Graph Neural Network, GraphEnet, that leverages the sparse nature of event camera output, with an intermediate line based event representation, to estimate 2D Human Pose of a single person at a high frequency. The architecture incorporates a novel offset vector learning paradigm with confidence based pooling to estimate the human pose. This is the first work that applies Graph Neural Networks to event data for Human Pose Estimation. The code is open-source at https://github.com/event-driven-robotics/GraphEnet-NeVi-ICCV2025.

[145] CIR-CoT: Towards Interpretable Composed Image Retrieval via End-to-End Chain-of-Thought Reasoning

Weihuang Lin,Yiwei Ma,Jiayi Ji,Xiaoshuai Sun,Rongrong Ji

Main category: cs.CV

TL;DR: 本文提出了CIR-CoT,首个面向检索任务的端到端多模态大语言模型,通过引入显式的思维链(CoT)推理机制,提升图像检索的准确性和可解释性。

Details Motivation: 现有基于VLM和MLLM的图像检索方法多为黑箱模型,缺乏可解释性,难以遵循复杂细粒度指令,限制了其在实际应用中的可信度和性能。 Method: 提出CIR-CoT模型,强制模型生成结构化的思维链(包括描述、推理和结论),并在新构建的带CoT标注的数据上进行微调,最终将推理结果编码为专用嵌入用于检索。 Result: 在FashionIQ、CIRR等数据集上达到具有竞争力的性能,并在外域数据集CIRCO上表现出显著的泛化能力。 Conclusion: CIR-CoT通过显式推理过程实现了更高效且可信赖的图像检索,为检索系统提供了新的发展方向。 Abstract: Composed Image Retrieval (CIR), which aims to find a target image from a reference image and a modification text, presents the core challenge of performing unified reasoning across visual and semantic modalities. While current approaches based on Vision-Language Models (VLMs, e.g., CLIP) and more recent Multimodal Large Language Models (MLLMs, e.g., Qwen-VL) have shown progress, they predominantly function as ``black boxes." This inherent opacity not only prevents users from understanding the retrieval rationale but also restricts the models' ability to follow complex, fine-grained instructions. To overcome these limitations, we introduce CIR-CoT, the first end-to-end retrieval-oriented MLLM designed to integrate explicit Chain-of-Thought (CoT) reasoning. By compelling the model to first generate an interpretable reasoning chain, CIR-CoT enhances its ability to capture crucial cross-modal interactions, leading to more accurate retrieval while making its decision process transparent. Since existing datasets like FashionIQ and CIRR lack the necessary reasoning data, a key contribution of our work is the creation of structured CoT annotations using a three-stage process involving a caption, reasoning, and conclusion. Our model is then fine-tuned to produce this structured output before encoding its final retrieval intent into a dedicated embedding. Comprehensive experiments show that CIR-CoT achieves highly competitive performance on in-domain datasets (FashionIQ, CIRR) and demonstrates remarkable generalization on the out-of-domain CIRCO dataset, establishing a new path toward more effective and trustworthy retrieval systems.

[146] RayFusion: Ray Fusion Enhanced Collaborative Visual Perception

Shaohong Wang,Bin Lu,Xinyu Xiao,Hanzhi Zhong,Bowen Pang,Tong Wang,Zhiyu Xiang,Hangguan Shan,Eryun Liu

Main category: cs.CV

TL;DR: 提出了一种基于射线的融合方法RayFusion,利用协作车辆的射线占用信息来减少相机射线上的冗余和误检,提升纯视觉协同感知系统的3D目标检测性能。

Details Motivation: 由于缺乏显式的深度信息,基于相机的感知系统在3D目标检测中存在深度估计模糊的问题,限制了协同感知的准确性。 Method: 提出RayFusion,通过融合协作车辆提供的射线占用信息,在相机射线上抑制冗余和误检,从而优化检测结果。 Result: 实验表明,该方法在多个数据集上持续优于现有的最先进模型,显著提升了协同视觉感知的性能。 Conclusion: RayFusion有效缓解了纯视觉系统中的深度模糊问题,为基于相机的协同3D目标检测提供了高效且实用的解决方案。 Abstract: Collaborative visual perception methods have gained widespread attention in the autonomous driving community in recent years due to their ability to address sensor limitation problems. However, the absence of explicit depth information often makes it difficult for camera-based perception systems, e.g., 3D object detection, to generate accurate predictions. To alleviate the ambiguity in depth estimation, we propose RayFusion, a ray-based fusion method for collaborative visual perception. Using ray occupancy information from collaborators, RayFusion reduces redundancy and false positive predictions along camera rays, enhancing the detection performance of purely camera-based collaborative perception systems. Comprehensive experiments show that our method consistently outperforms existing state-of-the-art models, substantially advancing the performance of collaborative visual perception. The code is available at https://github.com/wangsh0111/RayFusion.

[147] RASALoRE: Region Aware Spatial Attention with Location-based Random Embeddings for Weakly Supervised Anomaly Detection in Brain MRI Scans

Bheeshm Sharma,Karthikeyan Jaganathan,Balamurugan Palaniappan

Main category: cs.CV

TL;DR: 提出了一种名为RASALoRE的弱监督异常检测框架,结合判别性双提示调优和区域感知空间注意力机制,在仅使用切片级标签的情况下实现了脑MRI异常检测的最先进性能。

Details Motivation: 在缺乏精确像素级标注的情况下,利用弱标签(如切片级标签)实现脑MRI中异常的快速准确检测是一个重要挑战。 Method: 采用两阶段框架:第一阶段通过判别性双提示调优(DDPT)生成伪弱掩码作为粗略定位线索;第二阶段使用基于位置随机嵌入的区域感知空间注意力机制进行分割。 Result: 在BraTS20、BraTS21、BraTS23和MSD数据集上显著优于现有方法,检测性能大幅提升且计算复杂度显著降低,模型参数少于800万。 Conclusion: RASALoRE有效解决了弱监督下脑MRI异常检测的难题,在性能和效率之间取得了良好平衡,具有较强的实用价值。 Abstract: Weakly Supervised Anomaly detection (WSAD) in brain MRI scans is an important challenge useful to obtain quick and accurate detection of brain anomalies when precise pixel-level anomaly annotations are unavailable and only weak labels (e.g., slice-level) are available. In this work, we propose RASALoRE: Region Aware Spatial Attention with Location-based Random Embeddings, a novel two-stage WSAD framework. In the first stage, we introduce a Discriminative Dual Prompt Tuning (DDPT) mechanism that generates high-quality pseudo weak masks based on slice-level labels, serving as coarse localization cues. In the second stage, we propose a segmentation network with a region-aware spatial attention mechanism that relies on fixed location-based random embeddings. This design enables the model to effectively focus on anomalous regions. Our approach achieves state-of-the-art anomaly detection performance, significantly outperforming existing WSAD methods while utilizing less than 8 million parameters. Extensive evaluations on the BraTS20, BraTS21, BraTS23, and MSD datasets demonstrate a substantial performance improvement coupled with a significant reduction in computational complexity. Code is available at: https://github.com/BheeshmSharma/RASALoRE-BMVC-2025/.

[148] RetouchLLM: Training-free White-box Image Retouching

Moon Ye-Bin,Roy Miles,Tae-Hyun Oh,Ismail Elezi,Jiankang Deng

Main category: cs.CV

TL;DR: 提出RetouchLLM,一种无需训练的白盒图像润饰系统,通过可执行代码实现高分辨率图像的可解释、可控润饰。

Details Motivation: 现有基于学习的方法依赖大规模配对数据且为黑盒模型,难以适应多样化的用户或图像特定调整需求。 Method: 构建包含视觉批评模块和代码生成模块的框架,视觉批评模块识别输入图像与参考图像的差异,代码生成模块生成可执行的润饰代码,逐步进行多步润饰。 Result: 实验表明该方法能良好泛化到多种润饰风格,支持基于自然语言的用户交互,实现可解释且可控的个性化调整。 Conclusion: RetouchLLM实现了无需训练、可解释、灵活可控的图像润饰,优于传统黑盒模型在适应性和透明性方面的局限。 Abstract: Image retouching not only enhances visual quality but also serves as a means of expressing personal preferences and emotions. However, existing learning-based approaches require large-scale paired data and operate as black boxes, making the retouching process opaque and limiting their adaptability to handle diverse, user- or image-specific adjustments. In this work, we propose RetouchLLM, a training-free white-box image retouching system, which requires no training data and performs interpretable, code-based retouching directly on high-resolution images. Our framework progressively enhances the image in a manner similar to how humans perform multi-step retouching, allowing exploration of diverse adjustment paths. It comprises of two main modules: a visual critic that identifies differences between the input and reference images, and a code generator that produces executable codes. Experiments demonstrate that our approach generalizes well across diverse retouching styles, while natural language-based user interaction enables interpretable and controllable adjustments tailored to user intent.

[149] A class-driven hierarchical ResNet for classification of multispectral remote sensing images

Giulio Weikmann,Gianmarco Perantoni,Lorenzo Bruzzone

Main category: cs.CV

TL;DR: 提出一种多时相、类驱动的分层残差神经网络(ResNet),用于在不同语义层级上对多光谱图像时间序列进行分类,通过引入分支结构和层次惩罚机制提升分类一致性与细粒度识别能力。

Details Motivation: 为了提升多光谱图像时间序列在不同语义层级上的分类精度,尤其是细粒度类别(微类)的识别,并解决训练样本有限情况下的模型泛化问题。 Method: 设计了一种改进的分层ResNet架构,引入额外分支进行多层级分类,利用层次惩罚图约束分类过程中的层级转移,实现从宏观类到微观类的渐进式学习,并支持通过微调进行模型扩展。 Result: 在亚马逊森林区域的Sentinel-2时间序列数据上实验表明,该方法在不同层级均具有良好泛化能力,显著提升了微类别的分类精度,尤其改善了少数类的表征效果。 Conclusion: 所提出的模块化分层网络能有效建模语义层次结构,提升时间序列分类性能,具备良好的可扩展性和适应性,适用于细粒度地表覆盖分类任务。 Abstract: This work presents a multitemporal class-driven hierarchical Residual Neural Network (ResNet) designed for modelling the classification of Time Series (TS) of multispectral images at different semantical class levels. The architecture consists of a modification of the ResNet where we introduce additional branches to perform the classification at the different hierarchy levels and leverage on hierarchy-penalty maps to discourage incoherent hierarchical transitions within the classification. In this way, we improve the discrimination capabilities of classes at different levels of semantic details and train a modular architecture that can be used as a backbone network for introducing new specific classes and additional tasks considering limited training samples available. We exploit the class-hierarchy labels to train efficiently the different layers of the architecture, allowing the first layers to train faster on the first levels of the hierarchy modeling general classes (i.e., the macro-classes) and the intermediate classes, while using the last ones to discriminate more specific classes (i.e., the micro-classes). In this way, the targets are constrained in following the hierarchy defined, improving the classification of classes at the most detailed level. The proposed modular network has intrinsic adaptation capability that can be obtained through fine tuning. The experimental results, obtained on two tiles of the Amazonian Forest on 12 monthly composites of Sentinel 2 images acquired during 2019, demonstrate the effectiveness of the hierarchical approach in both generalizing over different hierarchical levels and learning discriminant features for an accurate classification at the micro-class level on a new target area, with a better representation of the minoritarian classes.

[150] Towards Real-World Deepfake Detection: A Diverse In-the-wild Dataset of Forgery Faces

Junyu Shi,Minghui Li,Junguo Zuo,Zhifei Yu,Yipeng Lin,Shengshan Hu,Ziqi Zhou,Yechao Zhang,Wei Wan,Yinzhe Xu,Leo Yu Zhang

Main category: cs.CV

TL;DR: 本文提出了一个面向真实世界的深度伪造人脸数据集RedFace,包含超过60,000张伪造图像和1,000个操纵视频,利用9个商业平台和定制算法生成,以更真实地模拟现实中的深度伪造场景。实验表明现有检测方法在该数据集上表现有限,凸显其实际应用中的不足。

Details Motivation: 现有的深度伪造检测评估缺乏真实性、多样性和对现实世界技术的覆盖,难以反映实际应用中的挑战,因此需要一个更贴近真实场景的数据集来桥接学术研究与现实需求之间的差距。 Method: 构建了一个名为RedFace的新型深度伪造数据集,通过9个在线商业平台获取最新‘野外’深度伪造技术,并结合定制算法生成多样化的人脸伪造内容,涵盖图像和视频模态,模拟真实黑盒攻击场景。 Result: 在RedFace上进行的大量实验(包括跨域、域内及社交网络传播模拟)表明,现有深度伪造检测方法性能显著下降,验证了其在现实应用中的局限性;同时分析了RedFace影响检测性能的原因。 Conclusion: RedFace能更真实地反映现实世界深度伪造的复杂性和多样性,为检测技术提供了更具挑战性的基准,推动未来研究向更具鲁棒性和实用性的方向发展。 Abstract: Deepfakes, leveraging advanced AIGC (Artificial Intelligence-Generated Content) techniques, create hyper-realistic synthetic images and videos of human faces, posing a significant threat to the authenticity of social media. While this real-world threat is increasingly prevalent, existing academic evaluations and benchmarks for detecting deepfake forgery often fall short to achieve effective application for their lack of specificity, limited deepfake diversity, restricted manipulation techniques.To address these limitations, we introduce RedFace (Real-world-oriented Deepfake Face), a specialized facial deepfake dataset, comprising over 60,000 forged images and 1,000 manipulated videos derived from authentic facial features, to bridge the gap between academic evaluations and real-world necessity. Unlike prior benchmarks, which typically rely on academic methods to generate deepfakes, RedFace utilizes 9 commercial online platforms to integrate the latest deepfake technologies found "in the wild", effectively simulating real-world black-box scenarios.Moreover, RedFace's deepfakes are synthesized using bespoke algorithms, allowing it to capture diverse and evolving methods used by real-world deepfake creators. Extensive experimental results on RedFace (including cross-domain, intra-domain, and real-world social network dissemination simulations) verify the limited practicality of existing deepfake detection schemes against real-world applications. We further perform a detailed analysis of the RedFace dataset, elucidating the reason of its impact on detection performance compared to conventional datasets. Our dataset is available at: https://github.com/kikyou-220/RedFace.

[151] Physics-Driven Spatiotemporal Modeling for AI-Generated Video Detection

Shuhai Zhang,ZiHao Lian,Jiahao Yang,Daiyuan Li,Guoxuan Pang,Feng Liu,Bo Han,Shutao Li,Mingkui Tan

Main category: cs.CV

TL;DR: 本文提出了一种基于物理驱动的AI生成视频检测方法NSG-VD,利用概率流守恒原理设计了归一化时空梯度(NSG)统计量,并结合最大均值差异进行检测,在召回率和F1分数上显著优于现有方法。

Details Motivation: 随着AI生成视频在视觉真实性上的飞速发展(如Sora),迫切需要可靠的检测机制;然而,现有方法难以建模高维时空动态并捕捉违反物理规律的细微异常。 Method: 提出归一化时空梯度(NSG)来量化空间概率梯度与时间密度变化的比值,利用预训练扩散模型估计NSG,结合运动感知的时间建模,无需复杂运动分解;基于NSG特征计算测试视频与真实视频之间的最大均值差异(MMD)作为检测指标。 Result: NSG-VD在实验中比现有最优方法提升了16.00%的召回率和10.75%的F1分数,并推导出生成视频与真实视频在NSG特征距离上的上界,证明生成视频因分布偏移而表现出更大的差异。 Conclusion: NSG-VD通过引入物理驱动的NSG统计量,有效检测AI生成视频,兼具理论保证和优越性能,为视频鉴伪提供了新范式。 Abstract: AI-generated videos have achieved near-perfect visual realism (e.g., Sora), urgently necessitating reliable detection mechanisms. However, detecting such videos faces significant challenges in modeling high-dimensional spatiotemporal dynamics and identifying subtle anomalies that violate physical laws. In this paper, we propose a physics-driven AI-generated video detection paradigm based on probability flow conservation principles. Specifically, we propose a statistic called Normalized Spatiotemporal Gradient (NSG), which quantifies the ratio of spatial probability gradients to temporal density changes, explicitly capturing deviations from natural video dynamics. Leveraging pre-trained diffusion models, we develop an NSG estimator through spatial gradients approximation and motion-aware temporal modeling without complex motion decomposition while preserving physical constraints. Building on this, we propose an NSG-based video detection method (NSG-VD) that computes the Maximum Mean Discrepancy (MMD) between NSG features of the test and real videos as a detection metric. Last, we derive an upper bound of NSG feature distances between real and generated videos, proving that generated videos exhibit amplified discrepancies due to distributional shifts. Extensive experiments confirm that NSG-VD outperforms state-of-the-art baselines by 16.00% in Recall and 10.75% in F1-Score, validating the superior performance of NSG-VD. The source code is available at https://github.com/ZSHsh98/NSG-VD.

[152] DarkHash: A Data-Free Backdoor Attack Against Deep Hashing

Ziqi Zhou,Menghao Deng,Yufei Song,Hangtao Zhang,Wei Wan,Shengshan Hu,Minghui Li,Leo Yu Zhang,Dezhong Yao

Main category: cs.CV

TL;DR: 本文提出了DarkHash,首个针对深度哈希模型的无数据后门攻击方法,通过双语义指导的影子后门框架,在无需访问训练数据的情况下实现高效攻击,同时保持原始检索性能。

Details Motivation: 现有深度哈希模型的后门攻击依赖于访问训练数据,但在实际中由于隐私和知识产权保护,获取这些数据往往不可行。因此,研究无需训练数据即可植入后门的方法具有重要意义。 Method: 提出DarkHash,设计了一种基于替代数据集的双语义引导影子后门攻击框架,仅微调受害者模型的特定层,并引入拓扑对齐损失,优化个体及邻近中毒样本向目标样本对齐,增强攻击效果。 Result: 在四个图像数据集、五种模型架构和两种哈希方法上的实验表明,DarkHash显著优于现有的最先进后门攻击方法,且能有效抵御主流防御手段。 Conclusion: DarkHash实现了无需原始训练数据的高效深度哈希后门攻击,在保持原任务检索精度的同时展现出强大的攻击能力和鲁棒性,为深度哈希安全提出了新的挑战。 Abstract: Benefiting from its superior feature learning capabilities and efficiency, deep hashing has achieved remarkable success in large-scale image retrieval. Recent studies have demonstrated the vulnerability of deep hashing models to backdoor attacks. Although these studies have shown promising attack results, they rely on access to the training dataset to implant the backdoor. In the real world, obtaining such data (e.g., identity information) is often prohibited due to privacy protection and intellectual property concerns. Embedding backdoors into deep hashing models without access to the training data, while maintaining retrieval accuracy for the original task, presents a novel and challenging problem. In this paper, we propose DarkHash, the first data-free backdoor attack against deep hashing. Specifically, we design a novel shadow backdoor attack framework with dual-semantic guidance. It embeds backdoor functionality and maintains original retrieval accuracy by fine-tuning only specific layers of the victim model using a surrogate dataset. We consider leveraging the relationship between individual samples and their neighbors to enhance backdoor attacks during training. By designing a topological alignment loss, we optimize both individual and neighboring poisoned samples toward the target sample, further enhancing the attack capability. Experimental results on four image datasets, five model architectures, and two hashing methods demonstrate the high effectiveness of DarkHash, outperforming existing state-of-the-art backdoor attack methods. Defense experiments show that DarkHash can withstand existing mainstream backdoor defense methods.

[153] Efficient Label Refinement for Face Parsing Under Extreme Poses Using 3D Gaussian Splatting

Ankit Gahlawat,Anirban Mukherjee,Dinesh Babu Jayagopi

Main category: cs.CV

TL;DR: 提出一种基于3D高斯点阵的标签优化流程,通过多视角一致性生成高质量分割掩码,显著提升极端视角下的人脸解析精度。

Details Motivation: 在极端视角下,由于标注数据有限且人工标注成本高,人脸解析面临挑战。 Method: 利用3D高斯点阵(3DGS)联合拟合RGB图像和初始分割图,通过共享几何结构实现多视角一致性,并生成姿态多样的训练数据。 Result: 在无需真实3D标注的情况下,仅用少量初始图像即可显著提升模型在极端视角下的解析准确率,同时保持对标准视角的良好性能。 Conclusion: 该方法为现实场景中提升人脸解析的鲁棒性提供了一种可扩展且有效的解决方案。 Abstract: Accurate face parsing under extreme viewing angles remains a significant challenge due to limited labeled data in such poses. Manual annotation is costly and often impractical at scale. We propose a novel label refinement pipeline that leverages 3D Gaussian Splatting (3DGS) to generate accurate segmentation masks from noisy multiview predictions. By jointly fitting two 3DGS models, one to RGB images and one to their initial segmentation maps, our method enforces multiview consistency through shared geometry, enabling the synthesis of pose-diverse training data with only minimal post-processing. Fine-tuning a face parsing model on this refined dataset significantly improves accuracy on challenging head poses, while maintaining strong performance on standard views. Extensive experiments, including human evaluations, demonstrate that our approach achieves superior results compared to state-of-the-art methods, despite requiring no ground-truth 3D annotations and using only a small set of initial images. Our method offers a scalable and effective solution for improving face parsing robustness in real-world settings.

[154] Random Window Augmentations for Deep Learning Robustness in CT and Liver Tumor Segmentation

Eirik A. Østmo,Kristoffer K. Wickstrøm,Keyur Radiya,Michael C. Kampffmeyer,Karl Øyvind Mikalsen,Robert Jenssen

Main category: cs.CV

TL;DR: 提出了一种针对CT图像的特定增强技术“随机窗宽”(Random windowing),以解决传统数据增强方法在CT图像上导致的伪影和泛化性能差的问题,显著提升了肝脏肿瘤分割模型在低对比度图像上的表现。

Details Motivation: 传统用于自然图像的数据增强方法在CT图像上可能破坏Hounsfield Units(HU)的物理意义,导致模型训练出现伪影和泛化能力差,因此需要一种适用于CT模态的增强方法。 Method: 提出名为Random windowing的CT专用增强技术,利用CT图像中HU强度分布进行数据增强,增强模型对对比度变化的鲁棒性。 Result: 在多个数据集上进行消融实验和分析,该方法优于现有最先进方法,尤其在肝脏肿瘤分割任务中显著提升模型在低对比或时序不佳图像上的性能。 Conclusion: Random windowing是一种有效的CT专用数据增强策略,能够提升深度学习模型在有限医学数据下的泛化能力和分割性能。 Abstract: Contrast-enhanced Computed Tomography (CT) is important for diagnosis and treatment planning for various medical conditions. Deep learning (DL) based segmentation models may enable automated medical image analysis for detecting and delineating tumors in CT images, thereby reducing clinicians' workload. Achieving generalization capabilities in limited data domains, such as radiology, requires modern DL models to be trained with image augmentation. However, naively applying augmentation methods developed for natural images to CT scans often disregards the nature of the CT modality, where the intensities measure Hounsfield Units (HU) and have important physical meaning. This paper challenges the use of such intensity augmentations for CT imaging and shows that they may lead to artifacts and poor generalization. To mitigate this, we propose a CT-specific augmentation technique, called Random windowing, that exploits the available HU distribution of intensities in CT images. Random windowing encourages robustness to contrast-enhancement and significantly increases model performance on challenging images with poor contrast or timing. We perform ablations and analysis of our method on multiple datasets, and compare to, and outperform, state-of-the-art alternatives, while focusing on the challenge of liver tumor segmentation.

[155] Real-Time Motion-Controllable Autoregressive Video Diffusion

Kesen Zhao,Jiaxin Shi,Beier Zhu,Junbao Zhou,Xiaolong Shen,Yuan Zhou,Qianru Sun,Hanwang Zhang

Main category: cs.CV

TL;DR: 提出AR-Drag,首个结合强化学习的少步自回归视频扩散模型,实现高质量、低延迟的实时图像到视频生成,并支持多样化运动控制。

Details Motivation: 现有自回归视频扩散模型在少步生成中存在质量下降和运动伪影问题,且缺乏有效的运动控制方法,难以满足实时性要求。 Method: 首先微调基础I2V模型以支持基本运动控制,然后通过基于轨迹奖励模型的强化学习进一步优化;引入Self-Rollout机制保持马尔可夫性质,并通过选择性引入去噪步骤中的随机性加速训练。 Result: AR-Drag在仅1.3B参数下显著降低延迟,相比最先进的运动可控VDM表现更优,实现实时生成,同时保持高视觉保真度和精确的运动对齐。 Conclusion: AR-Drag为实时、可控的视频生成提供了高效解决方案,推动了少步自回归视频扩散模型的发展。 Abstract: Real-time motion-controllable video generation remains challenging due to the inherent latency of bidirectional diffusion models and the lack of effective autoregressive (AR) approaches. Existing AR video diffusion models are limited to simple control signals or text-to-video generation, and often suffer from quality degradation and motion artifacts in few-step generation. To address these challenges, we propose AR-Drag, the first RL-enhanced few-step AR video diffusion model for real-time image-to-video generation with diverse motion control. We first fine-tune a base I2V model to support basic motion control, then further improve it via reinforcement learning with a trajectory-based reward model. Our design preserves the Markov property through a Self-Rollout mechanism and accelerates training by selectively introducing stochasticity in denoising steps. Extensive experiments demonstrate that AR-Drag achieves high visual fidelity and precise motion alignment, significantly reducing latency compared with state-of-the-art motion-controllable VDMs, while using only 1.3B parameters. Additional visualizations can be found on our project page: https://kesenzhao.github.io/AR-Drag.github.io/.

[156] Improving Temporal Understanding Logic Consistency in Video-Language Models via Attention Enhancement

Chengzhi Li,Heyan Huang,Ping Jian,Zhen Yang,Yaning Tian

Main category: cs.CV

TL;DR: 本文提出了一种基于注意力增强的时序条件注意力锐化方法(TCAS),以提升视频语言模型(Video-LLMs)在回答重述问题时的时序逻辑一致性,通过可解释性分析验证了该方法能有效提高跨模态注意力头的时间区分能力。

Details Motivation: 视频语言模型在面对基于同一视频的不同表述问题时常常产生自相矛盾的回答,影响其可靠性,但其根本原因尚不明确,因此需要深入分析并解决这一不一致现象。 Method: 采用可解释性驱动的方法,统计分析导致响应不一致的潜在因素,并提出一种名为时序条件注意力锐化(TCAS)的注意力增强方法,通过构建基于注意力差异的优化目标来增强模型的时间分辨率能力。 Result: 实验结果表明,TCAS显著提升了Video-LLMs的时间逻辑一致性;可解释性分析显示该方法增强了注意力头对不同时戳视频令牌的区分能力,并在通用视频时序定位任务中也带来了性能提升。 Conclusion: 时间逻辑一致性是视频时序理解的关键瓶颈,通过TCAS增强注意力机制可有效提升模型的一致性和整体时序理解能力。 Abstract: Large language models (LLMs) often generate self-contradictory outputs, which severely impacts their reliability and hinders their adoption in practical applications. In video-language models (Video-LLMs), this phenomenon recently draws the attention of researchers. Specifically, these models fail to provide logically consistent responses to rephrased questions based on their grounding outputs. However, the underlying causes of this phenomenon remain underexplored. In this work, we adopt an interpretability-driven approach to analyze, statistically summarize, and intervention the potential factors of the phenomenon. We find that one of the primary reasons for the inconsistency in responses lies in the inability of cross-modal attention heads to effectively distinguish video tokens across different timestamps. To address this, we propose an attention enhancement method called Temporally Conditioned Attention Sharpening (TCAS), which constructs an enhancement objective based on attention distinctions to enhance the model's temporal resolution capability, thereby improving its temporal understanding logic consistency. Experimental results demonstrate that our method significantly enhances the temporal logic consistency of Video-LLMs. Further interpretability analyses reveal that our method indeed improves the temporal discriminability of attention heads, validating our conclusions. Additionally, our method achieves performance improvements in general video temporal grounding tasks, highlighting that temporal logic consistency is a bottleneck in temporal understanding. By enhancing consistency, our method drives significant progress in video temporal understanding.

[157] UniMMVSR: A Unified Multi-Modal Framework for Cascaded Video Super-Resolution

Shian Du,Menghan Xia,Chang Liu,Quande Liu,Xintao Wang,Pengfei Wan,Xiangyang Ji

Main category: cs.CV

TL;DR: 提出UniMMVSR,首个支持文本、图像和视频多模态条件的级联视频超分辨率框架,在细节质量和条件一致性上显著优于现有方法,并实现4K多模态生成。

Details Motivation: 现有视频超分辨率方法主要局限于文本到视频任务,缺乏利用多种生成条件(如图像、视频)的能力,难以满足多模态视频生成中对高保真度的需求。 Method: 构建基于潜在视频扩散模型的统一框架UniMMVSR,探索多模态条件注入策略、训练方案和数据混合技术,设计针对性的数据构造与条件使用方法以适配不同模态条件与目标视频的关联差异。 Result: 实验表明UniMMVSR在生成细节和多模态条件一致性方面显著优于现有方法,并成功与基础模型结合实现4K多模态引导视频生成。 Conclusion: UniMMVSR是首个支持混合模态条件的生成式视频超分辨率框架,有效提升多模态视频生成的质量与灵活性,推动了高分辨率视频生成的发展。 Abstract: Cascaded video super-resolution has emerged as a promising technique for decoupling the computational burden associated with generating high-resolution videos using large foundation models. Existing studies, however, are largely confined to text-to-video tasks and fail to leverage additional generative conditions beyond text, which are crucial for ensuring fidelity in multi-modal video generation. We address this limitation by presenting UniMMVSR, the first unified generative video super-resolution framework to incorporate hybrid-modal conditions, including text, images, and videos. We conduct a comprehensive exploration of condition injection strategies, training schemes, and data mixture techniques within a latent video diffusion model. A key challenge was designing distinct data construction and condition utilization methods to enable the model to precisely utilize all condition types, given their varied correlations with the target video. Our experiments demonstrate that UniMMVSR significantly outperforms existing methods, producing videos with superior detail and a higher degree of conformity to multi-modal conditions. We also validate the feasibility of combining UniMMVSR with a base model to achieve multi-modal guided generation of 4K video, a feat previously unattainable with existing techniques.

[158] Beyond Textual CoT: Interleaved Text-Image Chains with Deep Confidence Reasoning for Image Editing

Zhentao Zou,Zhengrong Yue,Kunpeng Du,Binlei Bao,Hanting Li,Haizhen Xie,Guozheng Xu,Yue Zhou,Yali Wang,Jie Hu,Xue Jiang,Xinghao Chen

Main category: cs.CV

TL;DR: 本文提出了一种名为MURE的多模态推理编辑框架,通过文本与视觉线索交织的推理链(interleaved text-image CoT)提升图像编辑的精细度和空间准确性,并引入MMDC推理范式减少大模型幻觉,显著提升了复杂编辑任务的效果。

Details Motivation: 现有基于自然语言的图像编辑方法在处理复杂对象交叠和细粒度空间关系时表现不佳,主要因为缺乏显式的多模态推理过程,纯文本或坐标增强的思维链难以准确表达视觉布局。 Method: 提出MURE框架,采用文本与视觉线索(如位置掩码、新内容表示)交替的多模态思维链进行逐步推理;引入MMDC范式,通过奖励模型打分剪枝低质量推理路径,确保高保真编辑轨迹。 Result: 在三个图像编辑基准上取得显著性能提升,实现了更精确的子任务分解和高质量的编辑结果,并发布了包含14K样本的CoT-Edit-14K数据集。 Conclusion: MURE通过融合文本与视觉的交错推理,有效解决了复杂图像编辑中的空间关系建模问题,结合MMDC机制提升了推理可靠性,为语言引导的图像编辑提供了新的高效范式。 Abstract: Image editing with natural language has gained significant popularity, yet existing methods struggle with intricate object intersections and fine-grained spatial relationships due to the lack of an explicit reasoning process. While Chain-of-Thought (CoT) has been explored to enhance reasoning, purely textual CoT or CoT augmented with coordinate information is fundamentally limited in its ability to represent intricate visual layouts and lacks the necessary visual cues to guide the generation of fine-grained, pixel-level details. To address these challenges, we propose Multimodal Reasoning Edit (MURE), a novel framework that shifts the visual editing process from purely text-based reasoning to a series of interleaved textual and visual rationales. Our framework performs image editing using a natively multimodal, interleaved text-image CoT. This approach generates a step-by-step chain of reasoning where a textual description is followed by a corresponding visual cue, such as a positional mask that defined intended edited regions or a representation of new content. Furthermore, to mitigate the hallucination phenomenon of large language models, we introduce Multimodal Deep Confidence (MMDC) reasoning paradigm. This paradigm explores a tree of visual reasoning paths at each step. By pruning low-quality branches using a deep confidence score from a reward model, it ensures the model consistently follows a high-quality trajectory towards the final edited result. The proposed method decomposes complex editing tasks into interdependent sub-tasks, achieving greater precision at each stage and yielding high-fidelity edited results. We define the formulation for interleaved text-image chains and release the first CoT-Edit-14K dataset, comprising 14K high-quality editing examples. Extensive experiments show that our method yields significant improvements across three image editing benchmarks.

[159] Robust Canonicalization through Bootstrapped Data Re-Alignment

Johann Schmidt,Sebastian Stober

Main category: cs.CV

TL;DR: 提出一种迭代重对齐训练样本的自举算法,用于细粒度视觉分类中的规范化,该方法在四个基准上表现优于等变和规范化基线,并与数据增强方法性能相当。

Details Motivation: 现有方法依赖于强数据增强或等变架构,存在模型复杂度高或表达能力受限的问题;而基于规范化的先验方法需要对齐的训练数据,但在真实世界数据集中这一假设往往不成立,导致规范化器脆弱。 Method: 提出一种自举算法,通过迭代方式逐步减少方差并恢复对齐假设,从而实现训练样本的渐进式重对齐,适用于任意紧群并在温和条件下具有收敛保证。 Result: 在四个细粒度视觉分类基准上验证了该方法的有效性,结果表明其 consistently 优于等变方法和规范化基线,且性能与数据增强方法相当。 Conclusion: 所提出的自举规范化方法能够在不依赖严格对齐数据的情况下有效处理几何偏差和噪声,在保持模型鲁棒性的同时提升了细粒度分类性能。 Abstract: Fine-grained visual classification (FGVC) tasks, such as insect and bird identification, demand sensitivity to subtle visual cues while remaining robust to spatial transformations. A key challenge is handling geometric biases and noise, such as different orientations and scales of objects. Existing remedies rely on heavy data augmentation, which demands powerful models, or on equivariant architectures, which constrain expressivity and add cost. Canonicalization offers an alternative by shielding such biases from the downstream model. In practice, such functions are often obtained using canonicalization priors, which assume aligned training data. Unfortunately, real-world datasets never fulfill this assumption, causing the obtained canonicalizer to be brittle. We propose a bootstrapping algorithm that iteratively re-aligns training samples by progressively reducing variance and recovering the alignment assumption. We establish convergence guarantees under mild conditions for arbitrary compact groups, and show on four FGVC benchmarks that our method consistently outperforms equivariant, and canonicalization baselines while performing on par with augmentation.

[160] InstructUDrag: Joint Text Instructions and Object Dragging for Interactive Image Editing

Haoran Yu,Yi Shi

Main category: cs.CV

TL;DR: 提出InstructUDrag,结合文本指令与对象拖拽,实现扩散模型下的高保真、精确图像编辑。

Details Motivation: 现有文本编辑方法难以精确定位对象,而对象拖拽仅支持静态移动,缺乏语义控制。 Method: 将对象拖拽视为图像重建过程,设计双分支框架:移动重建分支利用基于能量的梯度引导精确定位,文本编辑分支共享梯度信号以实现语义属性控制,并结合DDPM反演和先验注入保持结构。 Result: 实验证明该方法在对象重定位精度和语义编辑能力上均优于现有方法,支持灵活、高保真的图像编辑。 Conclusion: InstructUDrag有效融合文本指令与对象拖拽,实现了兼具精确性与语义可控性的图像编辑,拓展了扩散模型在交互式编辑中的应用。 Abstract: Text-to-image diffusion models have shown great potential for image editing, with techniques such as text-based and object-dragging methods emerging as key approaches. However, each of these methods has inherent limitations: text-based methods struggle with precise object positioning, while object dragging methods are confined to static relocation. To address these issues, we propose InstructUDrag, a diffusion-based framework that combines text instructions with object dragging, enabling simultaneous object dragging and text-based image editing. Our framework treats object dragging as an image reconstruction process, divided into two synergistic branches. The moving-reconstruction branch utilizes energy-based gradient guidance to move objects accurately, refining cross-attention maps to enhance relocation precision. The text-driven editing branch shares gradient signals with the reconstruction branch, ensuring consistent transformations and allowing fine-grained control over object attributes. We also employ DDPM inversion and inject prior information into noise maps to preserve the structure of moved objects. Extensive experiments demonstrate that InstructUDrag facilitates flexible, high-fidelity image editing, offering both precision in object relocation and semantic control over image content.

[161] Fine-grained text-driven dual-human motion generation via dynamic hierarchical interaction

Mu Li,Yin Wang,Zhiying Leng,Jiapeng Liu,Frederick W. B. Li,Xiaohui Liang

Main category: cs.CV

TL;DR: 提出了一种细粒度的双人运动生成方法FineDual,通过三阶段模型从个体到个体间动态建模人类交互的层次性与距离变化。

Details Motivation: 现有方法大多忽略交互中的距离变化和层次结构,难以准确建模双人运动中的动态与层级特性。 Method: 采用三阶段方法:第一阶段利用大语言模型分解文本并对齐个体特征;第二阶段通过交互距离预测器和图网络动态建模个体间交互;第三阶段利用整体文本特征指导运动优化。 Result: 在双人运动数据集上的实验表明,FineDual在定量和定性评估中均优于现有方法,能更有效地生成高质量、细粒度的双人交互运动。 Conclusion: FineDual通过建模动态层次交互,在双人运动生成任务中实现了更真实、协调的动作合成,验证了考虑距离与层次结构的重要性。 Abstract: Human interaction is inherently dynamic and hierarchical, where the dynamic refers to the motion changes with distance, and the hierarchy is from individual to inter-individual and ultimately to overall motion. Exploiting these properties is vital for dual-human motion generation, while existing methods almost model human interaction temporally invariantly, ignoring distance and hierarchy. To address it, we propose a fine-grained dual-human motion generation method, namely FineDual, a tri-stage method to model the dynamic hierarchical interaction from individual to inter-individual. The first stage, Self-Learning Stage, divides the dual-human overall text into individual texts through a Large Language Model, aligning text features and motion features at the individual level. The second stage, Adaptive Adjustment Stage, predicts interaction distance by an interaction distance predictor, modeling human interactions dynamically at the inter-individual level by an interaction-aware graph network. The last stage, Teacher-Guided Refinement Stage, utilizes overall text features as guidance to refine motion features at the overall level, generating fine-grained and high-quality dual-human motion. Extensive quantitative and qualitative evaluations on dual-human motion datasets demonstrate that our proposed FineDual outperforms existing approaches, effectively modeling dynamic hierarchical human interaction.

[162] Adaptive Gradient Calibration for Single-Positive Multi-Label Learning in Remote Sensing Image Scene Classification

Chenying Liu,Gianmarco Perantoni,Lorenzo Bruzzone,Xiao Xiang Zhu

Main category: cs.CV

TL;DR: 提出了一种针对遥感图像的单正例多标签学习(SPML)新框架AdaGC,通过自适应梯度校准和伪标签生成机制,在两种基准数据集上实现了最先进的性能。

Details Motivation: 由于遥感图像标注复杂且成本高,完全标注多标签数据困难,因此需要在仅有一个正标签的情况下进行多标签分类,现有SPML方法在遥感领域研究不足。 Method: 提出Adaptive Gradient Calibration (AdaGC),结合梯度校准机制、Mixup和双指数移动平均(EMA)模块生成鲁棒伪标签,并设计自适应触发机制基于训练动态在预热阶段后激活梯度校准。 Result: 在两个遥感基准数据集和两种标签噪声类型下,AdaGC均实现了最先进的性能,表现出强鲁棒性。 Conclusion: AdaGC是一种有效且通用的遥感SPML框架,能显著缓解监督歧义和过拟合问题,为遥感图像多标签分类提供了新的解决方案。 Abstract: Multi-label classification (MLC) offers a more comprehensive semantic understanding of Remote Sensing (RS) imagery compared to traditional single-label classification (SLC). However, obtaining complete annotations for MLC is particularly challenging due to the complexity and high cost of the labeling process. As a practical alternative, single-positive multi-label learning (SPML) has emerged, where each image is annotated with only one relevant label, and the model is expected to recover the full set of labels. While scalable, SPML introduces significant supervision ambiguity, demanding specialized solutions for model training. Although various SPML methods have been proposed in the computer vision domain, research in the RS context remains limited. To bridge this gap, we propose Adaptive Gradient Calibration (AdaGC), a novel and generalizable SPML framework tailored to RS imagery. AdaGC adopts a gradient calibration (GC) mechanism combined with Mixup and a dual exponential moving average (EMA) module for robust pseudo-label generation. To maximize AdaGC's effectiveness, we introduce a simple yet theoretically grounded indicator to adaptively trigger GC after an initial warm-up stage based on training dynamics, thereby guaranteeing the effectiveness of GC in mitigating overfitting to label noise. Extensive experiments on two benchmark RS datasets under two distinct label noise types demonstrate that AdaGC achieves state-of-the-art (SOTA) performance while maintaining strong robustness across diverse settings.

[163] One Stone with Two Birds: A Null-Text-Null Frequency-Aware Diffusion Models for Text-Guided Image Inpainting

Haipeng Liu,Yang Wang,Meng Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为NTN-Diff的新型扩散模型,用于文本引导的图像修复,通过在不同频率带中解耦语义一致性并保留未遮罩区域,显著提升了修复效果。

Details Motivation: 现有方法在保持未遮罩区域和实现遮罩与未遮罩区域间的语义一致性方面存在不足,主要由于混合频率带的纠缠导致对文本提示的鲁棒性不一。 Method: 提出NTN-Diff模型,将去噪过程分为早期和晚期阶段,在去噪过程中解耦中低频带;利用稳定的中频带指导无文本去噪处理低频带,并在后期进行文本引导去噪,以实现跨区域的语义一致性并保留未遮罩内容。 Result: 实验表明,NTN-Diff在文本引导图像修复任务上优于当前最先进的扩散模型,有效兼顾了未遮罩区域的保真度和整体语义一致性。 Conclusion: NTN-Diff通过频率感知的去噪策略,成功解决了文本引导图像修复中的关键挑战,为未来研究提供了新的思路。 Abstract: Text-guided image inpainting aims at reconstructing the masked regions as per text prompts, where the longstanding challenges lie in the preservation for unmasked regions, while achieving the semantics consistency between unmasked and inpainted masked regions. Previous arts failed to address both of them, always with either of them to be remedied. Such facts, as we observed, stem from the entanglement of the hybrid (e.g., mid-and-low) frequency bands that encode varied image properties, which exhibit different robustness to text prompts during the denoising process. In this paper, we propose a null-text-null frequency-aware diffusion models, dubbed \textbf{NTN-Diff}, for text-guided image inpainting, by decomposing the semantics consistency across masked and unmasked regions into the consistencies as per each frequency band, while preserving the unmasked regions, to circumvent two challenges in a row. Based on the diffusion process, we further divide the denoising process into early (high-level noise) and late (low-level noise) stages, where the mid-and-low frequency bands are disentangled during the denoising process. As observed, the stable mid-frequency band is progressively denoised to be semantically aligned during text-guided denoising process, which, meanwhile, serves as the guidance to the null-text denoising process to denoise low-frequency band for the masked regions, followed by a subsequent text-guided denoising process at late stage, to achieve the semantics consistency for mid-and-low frequency bands across masked and unmasked regions, while preserve the unmasked regions. Extensive experiments validate the superiority of NTN-Diff over the state-of-the-art diffusion models to text-guided diffusion models. Our code can be accessed from https://github.com/htyjers/NTN-Diff.

[164] A Multimodal Depth-Aware Method For Embodied Reference Understanding

Fevziye Irem Eyiokur,Dogucan Yaman,Hazım Kemal Ekenel,Alexander Waibel

Main category: cs.CV

TL;DR: 提出了一种新的ERU框架,结合LLM数据增强、深度图模态和深度感知决策模块,有效提升在复杂环境中基于语言和指向线索的参考对象理解能力。

Details Motivation: 现有方法在存在多个候选对象的模糊场景中表现不佳,难以准确识别目标物体。 Method: 提出一种新型ERU框架,联合利用基于大语言模型(LLM)的数据增强、深度图模态以及深度感知决策模块,实现语言与具身线索的鲁棒融合。 Result: 在两个数据集上的实验表明,该方法显著优于现有基线方法,实现了更准确和可靠的指代表达理解。 Conclusion: 所提出的ERU框架通过多模态信息融合,在复杂或杂乱环境中的参考对象理解任务上表现出优越性能。 Abstract: Embodied Reference Understanding requires identifying a target object in a visual scene based on both language instructions and pointing cues. While prior works have shown progress in open-vocabulary object detection, they often fail in ambiguous scenarios where multiple candidate objects exist in the scene. To address these challenges, we propose a novel ERU framework that jointly leverages LLM-based data augmentation, depth-map modality, and a depth-aware decision module. This design enables robust integration of linguistic and embodied cues, improving disambiguation in complex or cluttered environments. Experimental results on two datasets demonstrate that our approach significantly outperforms existing baselines, achieving more accurate and reliable referent detection.

[165] Learning Neural Exposure Fields for View Synthesis

Michael Niemeyer,Fabian Manhardt,Marie-Julie Rakotosaona,Michael Oechsle,Christina Tsalicoglou,Keisuke Tateno,Jonathan T. Barron,Federico Tombari

Main category: cs.CV

TL;DR: 本文提出了Neural Exposure Fields (NExF),一种用于从具有强烈曝光变化的现实世界图像中鲁棒重建高质量、3D一致外观场景的新方法。

Details Motivation: 现有神经场景表示在处理包含每张图像曝光变化(如室内外混合场景或带窗户的房间)的真实数据时,重建质量显著下降。 Method: 提出一种新的神经场,为每个3D点预测最优曝光值,并与神经场景表示联合优化;引入新的神经条件机制实现3D空间中的曝光优化。 Result: 在多个真实世界挑战性数据集上实现了优于先前方法的结果,训练速度更快,在最佳基线上提升超过55%。 Conclusion: NExF能有效应对高动态范围场景下的视图合成问题,无需后期处理或多曝光捕获,显著提升了复杂光照条件下3D重建的质量和一致性。 Abstract: Recent advances in neural scene representations have led to unprecedented quality in 3D reconstruction and view synthesis. Despite achieving high-quality results for common benchmarks with curated data, outputs often degrade for data that contain per image variations such as strong exposure changes, present, e.g., in most scenes with indoor and outdoor areas or rooms with windows. In this paper, we introduce Neural Exposure Fields (NExF), a novel technique for robustly reconstructing 3D scenes with high quality and 3D-consistent appearance from challenging real-world captures. In the core, we propose to learn a neural field predicting an optimal exposure value per 3D point, enabling us to optimize exposure along with the neural scene representation. While capture devices such as cameras select optimal exposure per image/pixel, we generalize this concept and perform optimization in 3D instead. This enables accurate view synthesis in high dynamic range scenarios, bypassing the need of post-processing steps or multi-exposure captures. Our contributions include a novel neural representation for exposure prediction, a system for joint optimization of the scene representation and the exposure field via a novel neural conditioning mechanism, and demonstrated superior performance on challenging real-world data. We find that our approach trains faster than prior works and produces state-of-the-art results on several benchmarks improving by over 55% over best-performing baselines.

[166] LTCA: Long-range Temporal Context Attention for Referring Video Object Segmentation

Cilin Yan,Jingyun Wang,Guoliang Kang

Main category: cs.CV

TL;DR: 本文提出了一种有效的长程时序上下文注意力机制(LTCA),用于指代表达视频分割(RVOS),在四个基准上实现了最先进的性能。

Details Motivation: 现有方法在处理视频中的语言表达与视觉内容交互时,难以平衡局部性和全局性,且计算复杂度随视频长度显著增加。 Method: 提出长程时序上下文注意力(LTCA)机制,通过堆叠稀疏局部注意力(膨胀窗口注意力)和随机全局键选择来聚合全局上下文,并引入全局查询直接编码全局信息。 Result: 在四个RVOS基准上达到SOTA性能,MeViS valu和val数据集上分别提升11.3%和8.3%。 Conclusion: LTCA有效平衡了局部与全局时序建模,提升了RVOS性能,同时控制了计算复杂度。 Abstract: Referring Video Segmentation (RVOS) aims to segment objects in videos given linguistic expressions. The key to solving RVOS is to extract long-range temporal context information from the interactions of expressions and videos to depict the dynamic attributes of each object. Previous works either adopt attention across all the frames or stack dense local attention to achieve a global view of temporal context. However, they fail to strike a good balance between locality and globality, and the computation complexity significantly increases with the increase of video length. In this paper, we propose an effective long-range temporal context attention (LTCA) mechanism to aggregate global context information into object features. Specifically, we aggregate the global context information from two aspects. Firstly, we stack sparse local attentions to balance the locality and globality. We design a dilated window attention across frames to aggregate local context information and perform such attention in a stack of layers to enable a global view. Further, we enable each query to attend to a small group of keys randomly selected from a global pool to enhance the globality. Secondly, we design a global query to interact with all the other queries to directly encode the global context information. Experiments show our method achieves new state-of-the-art on four referring video segmentation benchmarks. Notably, our method shows an improvement of 11.3% and 8.3% on the MeViS valu and val datasets respectively.

[167] Unlocking 3D Affordance Segmentation with 2D Semantic Knowledge

Yu Huang,Zelin Peng,Changsong Wen,Xiaokang Yang,Wei Shen

Main category: cs.CV

TL;DR: 提出一种语义引导的学习范式,通过跨模态亲和力迁移(CMAT)将2D视觉基础模型的语义知识迁移到3D领域,并设计CAST模型实现精确的3D功能分割。

Details Motivation: 现有方法在处理3D数据时忽视了其稀疏性、噪声和几何模糊等固有挑战,导致学习到的特征缺乏清晰且语义一致的功能边界。 Method: 提出Cross-Modal Affinity Transfer(CMAT)预训练策略,对齐3D编码器与提升的2D语义,并联合优化重建、亲和性和多样性;基于此构建CAST模型,融合多模态提示与CMAT预训练特征生成精准分割图。 Result: 在标准基准上的实验表明,该方法在3D功能分割任务上达到了新的最先进性能。 Conclusion: 所提出的语义引导学习范式有效提升了3D功能分割的语义一致性和精度,为机器人操作和具身AI等应用提供了更强的支持。 Abstract: Affordance segmentation aims to parse 3D objects into functionally distinct parts, bridging recognition and interaction for applications in robotic manipulation, embodied AI, and AR. While recent studies leverage visual or textual prompts to guide this process, they often rely on point cloud encoders as generic feature extractors, overlooking the intrinsic challenges of 3D data such as sparsity, noise, and geometric ambiguity. As a result, 3D features learned in isolation frequently lack clear and semantically consistent functional boundaries. To address this bottleneck, we propose a semantic-grounded learning paradigm that transfers rich semantic knowledge from large-scale 2D Vision Foundation Models (VFMs) into the 3D domain. Specifically, We introduce Cross-Modal Affinity Transfer (CMAT), a pre-training strategy that aligns a 3D encoder with lifted 2D semantics and jointly optimizes reconstruction, affinity, and diversity to yield semantically organized representations. Building on this backbone, we further design the Cross-modal Affordance Segmentation Transformer (CAST), which integrates multi-modal prompts with CMAT-pretrained features to generate precise, prompt-aware segmentation maps. Extensive experiments on standard benchmarks demonstrate that our framework establishes new state-of-the-art results for 3D affordance segmentation.

[168] LinVideo: A Post-Training Framework towards O(n) Attention in Efficient Video Generation

Yushi Huang,Xingtong Ge,Ruihao Gong,Chengtao Lv,Jun Zhang

Main category: cs.CV

TL;DR: 本文提出了LinVideo,一种高效的无需数据的后训练框架,用于在视频扩散模型中将自注意力模块替换为线性注意力,从而降低计算成本并保持生成质量。

Details Motivation: 由于自注意力机制的二次复杂度,视频扩散模型的计算开销随序列长度平方增长;而现有线性注意力替代方法受限于表达能力及训练成本,难以直接应用。 Method: 提出选择性迁移策略,将层替换问题建模为二分类任务,自动逐步替换可替代的注意力层;并设计了一种任意时间分布匹配(ADM)目标函数,以在采样轨迹上对齐样本分布,提升迁移效率与性能恢复。 Result: 实验表明,该方法实现了1.25-2.00倍的加速,同时保持生成质量;4步蒸馏模型进一步实现15.92倍延迟降低且视觉质量下降极小。 Conclusion: LinVideo能有效在不牺牲生成质量的前提下,显著提升视频扩散模型的推理效率,为高效视频生成提供了可行方案。 Abstract: Video diffusion models (DMs) have enabled high-quality video synthesis. However, their computation costs scale quadratically with sequence length because self-attention has quadratic complexity. While linear attention lowers the cost, fully replacing quadratic attention requires expensive pretraining due to the limited expressiveness of linear attention and the complexity of spatiotemporal modeling in video generation. In this paper, we present LinVideo, an efficient data-free post-training framework that replaces a target number of self-attention modules with linear attention while preserving the original model's performance. First, we observe a significant disparity in the replaceability of different layers. Instead of manual or heuristic choices, we frame layer selection as a binary classification problem and propose selective transfer, which automatically and progressively converts layers to linear attention with minimal performance impact. Additionally, to overcome the ineffectiveness and inefficiency of existing objectives for this transfer process, we introduce an anytime distribution matching (ADM) objective that aligns the distributions of samples across any timestep along the sampling trajectory. This objective is efficient and recovers model performance. Extensive experiments show that our method achieves a 1.25-2.00x speedup while preserving generation quality, and our 4-step distilled model further delivers a 15.92x latency reduction with minimal visual quality drop.

[169] Evaluating Small Vision-Language Models on Distance-Dependent Traffic Perception

Nikos Theodoridis,Tim Brophy,Reenu Mohandas,Ganesh Sistu,Fiachra Collins,Anthony Scanlan,Ciaran Eising

Main category: cs.CV

TL;DR: 本文提出了一个专注于交通场景感知的视觉问答基准DTPQA,用于评估小型视觉语言模型在远近距离下的感知能力,发现当前小型VLM在该任务上显著落后于人类,尤其在左右区分等任务上存在挑战。

Details Motivation: 为了在自动驾驶等安全关键应用中可靠使用视觉语言模型,需要其具备可靠的远距离和近距离感知能力,而现有模型可能‘近视’,难以应对远处关键物体的识别需求。 Method: 构建了一个新的视觉问答基准DTPQA,专用于交通场景中的感知问题,并添加了距离标注;排除需复杂推理的问题,以纯化评估感知能力;在多个最先进的小型视觉语言模型上进行评测,并与人类表现对比。 Result: 实验表明,尽管问题简单,最佳的小型VLM平均准确率仅为约60%,显著低于人类的约85%;同时发现模型在如区分左右等特定感知任务上表现尤为不佳;但人类测试样本量较小,存在统计局限性。 Conclusion: 当前小型视觉语言模型在交通场景的距离感知方面仍有明显不足,特别是在长距离和某些细粒度感知任务上,需进一步改进以满足自动驾驶系统的可靠性要求。 Abstract: Vision-Language Models (VLMs) are becoming increasingly powerful, demonstrating strong performance on a variety of tasks that require both visual and textual understanding. Their strong generalisation abilities make them a promising component for automated driving systems, which must handle unexpected corner cases. However, to be trusted in such safety-critical applications, a model must first possess a reliable perception system. Moreover, since critical objects and agents in traffic scenes are often at a distance, we require systems that are not "shortsighted", i.e., systems with strong perception capabilities at both close (up to 20 meters) and long (30+ meters) range. With this in mind, we introduce Distance-Annotated Traffic Perception Question Answering (DTPQA), the first Visual Question Answering (VQA) benchmark focused solely on perception-based questions in traffic scenes, enriched with distance annotations. By excluding questions that require reasoning, we ensure that model performance reflects perception capabilities alone. Since automated driving hardware has limited processing power and cannot support large VLMs, our study centers on smaller VLMs. More specifically, we evaluate several state-of-the-art (SOTA) small VLMs on DTPQA and show that, despite the simplicity of the questions, these models significantly underperform compared to humans (~60% average accuracy for the best-performing small VLM versus ~85% human performance). However, it is important to note that the human sample size was relatively small, which imposes statistical limitations. We also identify specific perception tasks, such as distinguishing left from right, that remain particularly challenging for these models.

[170] SPICE: Simple and Practical Image Clarification and Enhancement

Alexander Belyaev,Pierre-Alain Fayolle,Michael Cohen

Main category: cs.CV

TL;DR: 提出一种简单高效的方法来增强和改善低光照及雾霾条件下的图像质量。

Details Motivation: 解决低光照和雾霾(包括雾、沙尘和水下)图像的增强与清晰化问题,提升现有方法在极端暗光和雾霾场景中的表现。 Method: 通过构建模拟低光照或雾霾条件的图像滤波器,并推导近似逆向滤波器以最小化增强图像中的失真。 Result: 实验结果表明,该方法在处理极暗图像和雾霾图像方面具有竞争力,通常优于当前最先进的技术。 Conclusion: 该方法因其极简设计(仅需几行MATLAB代码实现)而具有高实用性,同时在多种恶劣成像条件下表现出优异性能。 Abstract: We introduce a simple and efficient method to enhance and clarify images. More specifically, we deal with low light image enhancement and clarification of hazy imagery (hazy/foggy images, images containing sand dust, and underwater images). Our method involves constructing an image filter to simulate low-light or hazy conditions and deriving approximate reverse filters to minimize distortions in the enhanced images. Experimental results show that our approach is highly competitive and often surpasses state-of-the-art techniques in handling extremely dark images and in enhancing hazy images. A key advantage of our approach lies in its simplicity: Our method is implementable with just a few lines of MATLAB code.

[171] Hyperspectral data augmentation with transformer-based diffusion models

Mattia Ferrari,Lorenzo Bruzzone

Main category: cs.CV

TL;DR: 提出一种基于引导扩散模型的数据增强技术,结合轻量级Transformer网络和改进的损失函数,在小样本高光谱森林分类任务中实现了优于其他方法的性能。

Details Motivation: 深度学习在高光谱图像分类中易因标注数据少而过拟合,需有效数据增强方法提升模型泛化能力。 Method: 采用引导扩散模型进行数据增强,设计轻量级Transformer网络提取特征,引入加权损失函数和优化的余弦方差调度器以提升小样本训练效果。 Result: 在PRISMA卫星获取的10类森林分类任务中,该方法在平均和加权平均准确率上均优于其他数据增强方法,且训练过程稳定。 Conclusion: 所提方法能有效缓解小样本下深度学习模型的过拟合问题,提升高光谱图像分类性能,具有良好的实际应用潜力。 Abstract: The introduction of new generation hyperspectral satellite sensors, combined with advancements in deep learning methodologies, has significantly enhanced the ability to discriminate detailed land-cover classes at medium-large scales. However, a significant challenge in deep learning methods is the risk of overfitting when training networks with small labeled datasets. In this work, we propose a data augmentation technique that leverages a guided diffusion model. To effectively train the model with a limited number of labeled samples and to capture complex patterns in the data, we implement a lightweight transformer network. Additionally, we introduce a modified weighted loss function and an optimized cosine variance scheduler, which facilitate fast and effective training on small datasets. We evaluate the effectiveness of the proposed method on a forest classification task with 10 different forest types using hyperspectral images acquired by the PRISMA satellite. The results demonstrate that the proposed method outperforms other data augmentation techniques in both average and weighted average accuracy. The effectiveness of the method is further highlighted by the stable training behavior of the model, which addresses a common limitation in the practical application of deep generative models for data augmentation.

[172] UniVideo: Unified Understanding, Generation, and Editing for Videos

Cong Wei,Quande Liu,Zixuan Ye,Qiulin Wang,Xintao Wang,Pengfei Wan,Kun Gai,Wenhu Chen

Main category: cs.CV

TL;DR: UniVideo 是一个统一的多模态视频生成与编辑框架,采用双流架构(MLLM + MMDiT),支持多种任务并实现跨任务泛化。

Details Motivation: 现有的统一多模态模型主要局限于图像领域,缺乏对视频生成与编辑的统一建模。本文旨在将统一建模扩展到视频领域。 Method: 提出 UniVideo 框架,结合多模态大语言模型(MLLM)理解指令和多模态 DiT(MMDiT)生成视频,采用双流设计,在多个视频任务上联合训练。 Result: 实验表明,UniVideo 在文本/图像到视频生成、上下文内视频生成与编辑等任务上达到或超越现有专用模型;支持任务组合(如编辑+风格迁移)和零样本迁移(如绿幕抠像、材质替换);可基于视觉提示生成视频。 Conclusion: UniVideo 成功将统一多模态建模扩展至视频领域,具备良好的任务统一性、视觉一致性及泛化能力,推动了多模态视频内容生成的发展。 Abstract: Unified multimodal models have shown promising results in multimodal content generation and editing but remain largely limited to the image domain. In this work, we present UniVideo, a versatile framework that extends unified modeling to the video domain. UniVideo adopts a dual-stream design, combining a Multimodal Large Language Model (MLLM) for instruction understanding with a Multimodal DiT (MMDiT) for video generation. This design enables accurate interpretation of complex multimodal instructions while preserving visual consistency. Built on this architecture, UniVideo unifies diverse video generation and editing tasks under a single multimodal instruction paradigm and is jointly trained across them. Extensive experiments demonstrate that UniVideo matches or surpasses state-of-the-art task-specific baselines in text/image-to-video generation, in-context video generation and in-context video editing. Notably, the unified design of UniVideo enables two forms of generalization. First, UniVideo supports task composition, such as combining editing with style transfer, by integrating multiple capabilities within a single instruction. Second, even without explicit training on free-form video editing, UniVideo transfers its editing capability from large-scale image editing data to this setting, handling unseen instructions such as green-screening characters or changing materials within a video. Beyond these core capabilities, UniVideo also supports visual-prompt-based video generation, where the MLLM interprets visual prompts and guides the MMDiT during synthesis. To foster future research, we will release our model and code.

[173] Detecting Legend Items on Historical Maps Using GPT-4o with In-Context Learning

Sofia Kirsanova,Yao-Yi Chiang,Weiwei Duan

Main category: cs.CV

TL;DR: 提出了一种结合LayoutLMv3和GPT-4o的方法,用于自动检测和关联历史地图图例中的符号与描述,通过结构化JSON提示显著提升了性能。

Details Motivation: 历史地图图例的非标准布局和非结构化格式导致自动提取困难,现有方法在符号与描述的匹配上效果有限。 Method: 采用LayoutLMv3进行布局检测,并利用GPT-4o结合上下文学习,通过边界框预测实现图例项与其描述的检测与链接。 Result: 实验表明,使用结构化JSON提示的GPT-4o优于基线方法,F1得分为88%,IoU为85%,且提示设计、示例数量和布局对齐显著影响性能。 Conclusion: 该方法支持可扩展、布局感知的图例解析,提升了多种视觉风格下历史地图的索引与可搜索性。 Abstract: Historical map legends are critical for interpreting cartographic symbols. However, their inconsistent layouts and unstructured formats make automatic extraction challenging. Prior work focuses primarily on segmentation or general optical character recognition (OCR), with few methods effectively matching legend symbols to their corresponding descriptions in a structured manner. We present a method that combines LayoutLMv3 for layout detection with GPT-4o using in-context learning to detect and link legend items and their descriptions via bounding box predictions. Our experiments show that GPT-4 with structured JSON prompts outperforms the baseline, achieving 88% F-1 and 85% IoU, and reveal how prompt design, example counts, and layout alignment affect performance. This approach supports scalable, layout-aware legend parsing and improves the indexing and searchability of historical maps across various visual styles.

[174] Robust Source-Free Domain Adaptation for Medical Image Segmentation based on Curriculum Learning

Ziqi Zhang,Yuexiang Li,Yawen Huang,Nanjun He,Tao Xu,Liwei Lin,Yefeng Zheng,Shaoxin Li,Feiyue Huang

Main category: cs.CV

TL;DR: 提出一种基于课程学习的无源域自适应框架(LFC),通过易到难和源到目标的课程设计,提升模型在无源数据情况下的域适应性能,在眼底和息肉分割任务中达到最先进水平。

Details Motivation: 现有无源域自适应方法主要关注目标域伪标签优化,忽视学习过程的设计;而从源到目标的渐进学习过程有助于知识迁移。此外,数据隐私和安全问题使得不依赖源数据的域适应更具现实意义。 Method: 提出名为LFC的课程学习框架,包含两个课程:1)易到难课程,从简单样本开始逐步增加难度以优化模型适应方向;2)源到目标课程,稳定适应过程,实现从源域到目标域的平滑迁移。 Result: 在公开的眼底分割和息肉分割跨域数据集上进行评估,实验结果表明该方法优于现有方法,实现了新的最先进性能。 Conclusion: 所提出的LFC框架通过引入课程学习机制,有效提升了无源域自适应的性能,验证了渐进式学习策略在模型适应中的重要作用。 Abstract: Recent studies have uncovered a new research line, namely source-free domain adaptation, which adapts a model to target domains without using the source data. Such a setting can address the concerns on data privacy and security issues of medical images. However, current source-free domain adaptation frameworks mainly focus on the pseudo label refinement for target data without the consideration of learning procedure. Indeed, a progressive learning process from source to target domain will benefit the knowledge transfer during model adaptation. To this end, we propose a curriculum-based framework, namely learning from curriculum (LFC), for source-free domain adaptation, which consists of easy-to-hard and source-to-target curricula. Concretely, the former curriculum enables the framework to start learning with `easy' samples and gradually tune the optimization direction of model adaption by increasing the sample difficulty. While, the latter can stablize the adaptation process, which ensures smooth transfer of the model from the source domain to the target. We evaluate the proposed source-free domain adaptation approach on the public cross-domain datasets for fundus segmentation and polyp segmentation. The extensive experimental results show that our framework surpasses the existing approaches and achieves a new state-of-the-art.

[175] VideoVerse: How Far is Your T2V Generator from a World Model?

Zeqing Wang,Xinyu Wei,Bairui Li,Zhen Guo,Jinrui Zhang,Hongyang Wei,Keze Wang,Lei Zhang

Main category: cs.CV

TL;DR: 本文提出了VideoVerse,一个全新的综合性基准,用于评估文本到视频(T2V)生成模型在复杂时序因果关系和真实世界知识理解方面的能力,弥补了现有评测在事件级时序因果和世界知识系统性评估上的不足。

Details Motivation: 现有的T2V生成模型评测基准无法有效区分最先进的模型,且缺乏对事件级时序因果关系和世界知识的系统评估,难以支撑‘世界模型’的构建需求。 Method: 构建了一个包含300个精心策划提示、815个事件和793个二元评估问题的VideoVerse基准,涵盖多个领域,并从动态与静态属性角度设计十个评估维度,采用基于现代视觉语言模型的问答式人类偏好对齐评估流程。 Result: 实现了对当前先进开源与闭源T2V模型的系统性评估,揭示了现有模型在理解时序因果和世界知识方面的局限性。 Conclusion: VideoVerse为T2V模型提供了更全面、更具挑战性的评估框架,推动T2V技术向具备世界模型能力的方向发展。 Abstract: The recent rapid advancement of Text-to-Video (T2V) generation technologies, which are critical to build ``world models'', makes the existing benchmarks increasingly insufficient to evaluate state-of-the-art T2V models. First, current evaluation dimensions, such as per-frame aesthetic quality and temporal consistency, are no longer able to differentiate state-of-the-art T2V models. Second, event-level temporal causality, which not only distinguishes video from other modalities but also constitutes a crucial component of world models, is severely underexplored in existing benchmarks. Third, existing benchmarks lack a systematic assessment of world knowledge, which are essential capabilities for building world models. To address these issues, we introduce VideoVerse, a comprehensive benchmark that focuses on evaluating whether a T2V model could understand complex temporal causality and world knowledge in the real world. We collect representative videos across diverse domains (e.g., natural landscapes, sports, indoor scenes, science fiction, chemical and physical experiments) and extract their event-level descriptions with inherent temporal causality, which are then rewritten into text-to-video prompts by independent annotators. For each prompt, we design a suite of binary evaluation questions from the perspective of dynamic and static properties, with a total of ten carefully defined evaluation dimensions. In total, our VideoVerse comprises 300 carefully curated prompts, involving 815 events and 793 binary evaluation questions. Consequently, a human preference aligned QA-based evaluation pipeline is developed by using modern vision-language models. Finally, we perform a systematic evaluation of state-of-the-art open-source and closed-source T2V models on VideoVerse, providing in-depth analysis on how far the current T2V generators are from world models.

[176] Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency

Kaiwen Zheng,Yuji Wang,Qianli Ma,Huayu Chen,Jintao Zhang,Yogesh Balaji,Jianfei Chen,Ming-Yu Liu,Jun Zhu,Qinsheng Zhang

Main category: cs.CV

TL;DR: 本文首次将连续时间一致性蒸馏扩展到大规模图像和视频扩散模型,提出score正则化连续时间一致性模型(rCM),通过引入score蒸馏作为长跳跃正则项,解决了sCM在细节生成上的质量缺陷,在保持高生成多样性的同时显著提升视觉质量。

Details Motivation: 尽管连续时间一致性模型(sCM)在加速学术级扩散模型方面表现优异,但其在大规模文本到图像和视频任务中的应用受限于JVP计算的基础设施挑战和评估基准的不足,且存在细节生成质量低的问题。 Method: 开发了支持并行的FlashAttention-2 JVP内核,实现了对超100亿参数模型和高维视频任务的sCM训练;提出了rCM模型,将score蒸馏作为长跳跃正则项融入sCM,结合了‘模式寻求’的反向散度,以改善生成质量。 Result: 在高达140亿参数的大规模模型(如Cosmos-Predict2、Wan2.1)和5秒视频上验证,rCM在质量指标上达到或超越最先进的DMD2方法,且在多样性方面更具优势,无需GAN调优或大量超参数搜索;蒸馏后的模型仅需1~4步即可生成高保真样本,采样速度提升15~50倍。 Conclusion: rCM是一种实用且理论扎实的框架,有效推动了大规模扩散模型蒸馏的发展,解决了sCM的误差累积和模式覆盖问题,在图像和视频生成中实现了高质量与高多样性的平衡。 Abstract: This work represents the first effort to scale up continuous-time consistency distillation to general application-level image and video diffusion models. Although continuous-time consistency model (sCM) is theoretically principled and empirically powerful for accelerating academic-scale diffusion, its applicability to large-scale text-to-image and video tasks remains unclear due to infrastructure challenges in Jacobian-vector product (JVP) computation and the limitations of standard evaluation benchmarks. We first develop a parallelism-compatible FlashAttention-2 JVP kernel, enabling sCM training on models with over 10 billion parameters and high-dimensional video tasks. Our investigation reveals fundamental quality limitations of sCM in fine-detail generation, which we attribute to error accumulation and the "mode-covering" nature of its forward-divergence objective. To remedy this, we propose the score-regularized continuous-time consistency model (rCM), which incorporates score distillation as a long-skip regularizer. This integration complements sCM with the "mode-seeking" reverse divergence, effectively improving visual quality while maintaining high generation diversity. Validated on large-scale models (Cosmos-Predict2, Wan2.1) up to 14B parameters and 5-second videos, rCM matches or surpasses the state-of-the-art distillation method DMD2 on quality metrics while offering notable advantages in diversity, all without GAN tuning or extensive hyperparameter searches. The distilled models generate high-fidelity samples in only $1\sim4$ steps, accelerating diffusion sampling by $15\times\sim50\times$. These results position rCM as a practical and theoretically grounded framework for advancing large-scale diffusion distillation.

[177] Gaze on the Prize: Shaping Visual Attention with Return-Guided Contrastive Learning

Andrew Lee,Ian Chuang,Dechen Gao,Kai Fukazawa,Iman Soltani

Main category: cs.CV

TL;DR: 本文提出了一种名为“Gaze on the Prize”的视觉强化学习框架,通过引入可学习的中央凹注意力机制(Gaze),利用自监督信号引导模型关注任务相关特征,显著提升了样本效率(最高达2.4倍),并在ManiSkill3基准的操纵任务中成功解决了基线方法无法学习的任务。

Details Motivation: 视觉强化学习智能体需从高维图像中学习决策,但其中大部分像素与任务无关,导致探索和计算资源浪费,学习效率低且不稳定。因此需要一种机制让智能体聚焦于关键视觉信息。 Method: 受人类视觉中央凹注视启发,提出Gaze on the Prize框架。该方法通过返回值差异作为自监督信号,采用基于回报引导的对比学习:将具有相似表征但不同回报的状态分为正负样本,构建对比三元组,训练注意力机制聚焦于导致成功或失败的关键特征。 Result: 在不修改底层算法或超参数的情况下,该方法在ManiSkill3操纵任务套件上实现了最高2.4倍的样本效率提升,并能解决基线方法无法完成的任务。 Conclusion: 通过引入回报引导的可学习注意力机制,有效提升了视觉强化学习的样本效率和性能,验证了利用返回值差异识别任务相关特征的有效性。 Abstract: Visual Reinforcement Learning (RL) agents must learn to act based on high-dimensional image data where only a small fraction of the pixels is task-relevant. This forces agents to waste exploration and computational resources on irrelevant features, leading to sample-inefficient and unstable learning. To address this, inspired by human visual foveation, we introduce Gaze on the Prize. This framework augments visual RL with a learnable foveal attention mechanism (Gaze), guided by a self-supervised signal derived from the agent's experience pursuing higher returns (the Prize). Our key insight is that return differences reveal what matters most: If two similar representations produce different outcomes, their distinguishing features are likely task-relevant, and the gaze should focus on them accordingly. This is realized through return-guided contrastive learning that trains the attention to distinguish between the features relevant to success and failure. We group similar visual representations into positives and negatives based on their return differences and use the resulting labels to construct contrastive triplets. These triplets provide the training signal that teaches the attention mechanism to produce distinguishable representations for states associated with different outcomes. Our method achieves up to 2.4x improvement in sample efficiency and can solve tasks that the baseline fails to learn, demonstrated across a suite of manipulation tasks from the ManiSkill3 benchmark, all without modifying the underlying algorithm or hyperparameters.

[178] Hierarchical Spatial Algorithms for High-Resolution Image Quantization and Feature Extraction

Noor Islam S. Mohammad

Main category: cs.CV

TL;DR: 提出了一种模块化空间图像处理框架,集成了灰度量化、色彩与亮度增强、图像锐化、双向变换流水线和几何特征提取,实验表明其在多种数据集上具有鲁棒性和实时应用潜力。

Details Motivation: 为了提升图像处理的结构保持性和细节增强能力,同时实现高效、可逆的变换流程,满足实时计算机视觉应用的需求。 Method: 采用分步强度变换进行8级灰度量化,结合RGB和YCrCb空间的直方图均衡化进行色彩增强,通过HSV值通道调整亮度,并使用3×3卷积核进行图像锐化;构建包含非锐化掩模、伽马校正和噪声放大的双向变换流水线;利用Canny边缘检测、Hough直线估计、Harris角点检测和形态学定位进行几何特征提取。 Result: 双向变换流水线在前向和反向过程中的准确率分别为76.10%和74.80%;台球杆对齐角度估计为51.50°;提示隔离与真实图像的相似度达到81.87%;在多个数据集上验证了方法的鲁棒性和确定性性能。 Conclusion: 该模块化框架在保持图像结构的同时有效增强了视觉细节,具备良好的可重复性和实时处理能力,适用于实际计算机视觉任务。 Abstract: This study introduces a modular framework for spatial image processing, integrating grayscale quantization, color and brightness enhancement, image sharpening, bidirectional transformation pipelines, and geometric feature extraction. A stepwise intensity transformation quantizes grayscale images into eight discrete levels, producing a posterization effect that simplifies representation while preserving structural detail. Color enhancement is achieved via histogram equalization in both RGB and YCrCb color spaces, with the latter improving contrast while maintaining chrominance fidelity. Brightness adjustment is implemented through HSV value-channel manipulation, and image sharpening is performed using a 3 * 3 convolution kernel to enhance high-frequency details. A bidirectional transformation pipeline that integrates unsharp masking, gamma correction, and noise amplification achieved accuracy levels of 76.10% and 74.80% for the forward and reverse processes, respectively. Geometric feature extraction employed Canny edge detection, Hough-based line estimation (e.g., 51.50{\deg} for billiard cue alignment), Harris corner detection, and morphological window localization. Cue isolation further yielded 81.87\% similarity against ground truth images. Experimental evaluation across diverse datasets demonstrates robust and deterministic performance, highlighting its potential for real-time image analysis and computer vision.

[179] Video-STAR: Reinforcing Open-Vocabulary Action Recognition with Tools

Zhenlong Yuan,Xiangyan Qu,Chengxuan Qian,Rui Chen,Jing Tang,Lei Sun,Xiangxiang Chu,Dapeng Zhang,Yiwei Wang,Yujun Cai,Shuo Li

Main category: cs.CV

TL;DR: 提出Video-STAR框架,结合子动作分解与工具增强的强化学习,提升开放词汇动作识别性能。

Details Motivation: 现有MLLM在开放词汇动作识别中因文本先验依赖而难以区分语义相似动作。 Method: 通过上下文子动作分解和工具增强的强化学习,动态调用领域特定工具进行跨模态交互,并设计分层奖励机制优化推理过程。 Result: 在HMDB-51、UCF-101、SSv2、Kinetics-400和Kinetics-600上达到SOTA,显著提升细粒度动作区分与抗跨模态幻觉能力。 Conclusion: Video-STAR实现了从文本中心推理到视觉 grounded 推理的转变,具备强鲁棒性和泛化性。 Abstract: Multimodal large language models (MLLMs) have demonstrated remarkable potential in bridging visual and textual reasoning, yet their reliance on text-centric priors often limits their ability to disentangle semantically similar actions in open-vocabulary scenarios. To address this, we propose Video-STAR, a framework that harmonizes contextual sub-motion decomposition with tool-augmented reinforcement learning for open-vocabulary action recognition (OVAR). Unlike prior methods that treat actions as monolithic entities, our approach innovatively decomposes actions into discriminative sub-motions for fine-grained matching while dynamically invoking domain-specific tools for cross-modal interleaving, thereby enabling category-specific reasoning capacity and reducing cross-modal hallucination. Moreover, by designing a hierarchical reward that balances tool-usage efficiency, sub-motion relevance, and structural coherence in reasoning, our method autonomously leverages external tools to prioritize sub-motion patterns without explicit supervision, transmitting from text-centric reasoning to visually grounded inference. Extensive evaluations on HMDB-51, UCF-101, SSv2, Kinetics-400, and Kinetics-600 datasets demonstrate our state-of-the-art performance, outperforming existing methods in distinguishing fine-grained actions and handling cross-modal hallucination, validating our excellent robustness and generalization.

[180] The Visual Iconicity Challenge: Evaluating Vision-Language Models on Sign Language Form-Meaning Mapping

Onur Keleş,Aslı Özyürek,Gerardo Ortega,Kadir Gökgö,Esam Ghaleb

Main category: cs.CV

TL;DR: 本文提出了“视觉象似性挑战”这一基于视频的基准,用于评估视觉-语言模型在手语中的象似性理解能力,发现当前模型在语音形式预测、透明度和象似性评分上仍显著落后于人类,但表现较好的模型在象似性判断上与人类有一定相关性,表明需引入更多以人为中心的信号和具身学习方法来提升多模态模型的视觉接地能力。

Details Motivation: 由于手语中普遍存在形式与意义之间的象似性(iconicity),为视觉接地提供了天然实验场,而现有视觉-语言模型难以从动态人体动作中恢复这种映射关系,因此需要新的评估基准来诊断模型在此类任务上的表现。 Method: 作者构建了一个名为“视觉象似性挑战”的新型视频基准,采用心理语言学度量方法,设计三项任务:语音手形预测、透明度(从视觉形式推断意义)和等级化象似性评分,并在荷兰手语数据上对13种最先进视觉-语言模型进行零样本和少样本评估,同时与人类基线对比。 Result: 实验结果显示,当前VLMs在语音形式预测任务中能部分恢复手形和位置信息,但性能低于人类;在透明度任务上远逊于人类;仅顶级模型在象似性评分上与人类有中等程度相关性。此外,语音形式预测能力强的模型更倾向于与人类象似性判断一致。 Conclusion: 该研究验证了所提出任务作为诊断工具的有效性,揭示了现有模型在视觉象似性理解上的局限,强调未来应结合以人为中心的信号和具身学习方法,以增强多模态模型的视觉接地能力。 Abstract: Iconicity, the resemblance between linguistic form and meaning, is pervasive in signed languages, offering a natural testbed for visual grounding. For vision-language models (VLMs), the challenge is to recover such essential mappings from dynamic human motion rather than static context. We introduce the \textit{Visual Iconicity Challenge}, a novel video-based benchmark that adapts psycholinguistic measures to evaluate VLMs on three tasks: (i) phonological sign-form prediction (e.g., handshape, location), (ii) transparency (inferring meaning from visual form), and (iii) graded iconicity ratings. We assess $13$ state-of-the-art VLMs in zero- and few-shot settings on Sign Language of the Netherlands and compare them to human baselines. On \textit{phonological form prediction}, VLMs recover some handshape and location detail but remain below human performance; on \textit{transparency}, they are far from human baselines; and only top models correlate moderately with human \textit{iconicity ratings}. Interestingly, \textit{models with stronger phonological form prediction correlate better with human iconicity judgment}, indicating shared sensitivity to visually grounded structure. Our findings validate these diagnostic tasks and motivate human-centric signals and embodied learning methods for modelling iconicity and improving visual grounding in multimodal models.

[181] InstructX: Towards Unified Visual Editing with MLLM Guidance

Chong Mou,Qichao Sun,Yanze Wu,Pengze Zhang,Xinghui Li,Fulong Ye,Songtao Zhao,Qian He

Main category: cs.CV

TL;DR: 本文提出了InstructX,一个用于图像和视频编辑的统一框架,通过综合研究多模态大语言模型(MLLM)与扩散模型的结合,实现了在多种任务上的指令驱动编辑。研究表明,仅使用图像数据训练即可涌现出无需显式监督的视频编辑能力,并通过引入模态特定特征有效统一了图像和视频编辑任务。

Details Motivation: 现有研究缺乏对MLLM设计选择的深入分析,且MLLM与扩散模型在视频编辑等复杂任务中的融合仍面临挑战。因此,需要一种能够统一处理图像和视频编辑的方法。 Method: 提出InstructX框架,系统研究MLLM与扩散模型的集成;利用图像数据训练以激发模型的零样本视频编辑能力;引入模态特定的MLLM特征实现图像和视频编辑任务的统一建模。 Result: 实验证明该方法在广泛的图像和视频编辑任务中表现出色,达到了最先进的性能,且无需专门的视频训练数据即可实现有效的视频编辑。 Conclusion: InstructX成功实现了图像和视频编辑的统一建模,揭示了跨模态迁移的可能性,为未来基于MLLM的编辑系统提供了有效的设计范式。 Abstract: With recent advances in Multimodal Large Language Models (MLLMs) showing strong visual understanding and reasoning, interest is growing in using them to improve the editing performance of diffusion models. Despite rapid progress, most studies lack an in-depth analysis of MLLM design choices. Moreover, the integration of MLLMs and diffusion models remains an open challenge in some difficult tasks, such as video editing. In this paper, we present InstructX, a unified framework for image and video editing. Specifically, we conduct a comprehensive study on integrating MLLMs and diffusion models for instruction-driven editing across diverse tasks. Building on this study, we analyze the cooperation and distinction between images and videos in unified modeling. (1) We show that training on image data can lead to emergent video editing capabilities without explicit supervision, thereby alleviating the constraints imposed by scarce video training data. (2) By incorporating modality-specific MLLM features, our approach effectively unifies image and video editing tasks within a single model. Extensive experiments demonstrate that our method can handle a broad range of image and video editing tasks and achieves state-of-the-art performance.

[182] MoA-VR: A Mixture-of-Agents System Towards All-in-One Video Restoration

Lu Liu,Chunlei Cai,Shaocheng Shen,Jianfeng Liang,Weimin Ouyang,Tianxiao Ye,Jian Mao,Huiyu Duan,Jiangchao Yao,Xiaoyun Zhang,Qiang Hu,Guangtao Zhai

Main category: cs.CV

TL;DR: 本文提出了一种名为MoA-VR的视频恢复系统,通过模仿人类专家的推理过程,结合三种协同代理(退化识别、路由与恢复、质量评估),有效应对现实世界中复杂的视频退化问题。

Details Motivation: 现有视频恢复方法通常依赖人工选择模型或单一架构,难以泛化到多种退化情况,因此需要一种能自动识别并适应不同退化类型的通用解决方案。 Method: 构建了一个大规模高分辨率视频退化识别基准,采用视觉-语言模型(VLM)进行退化识别;引入基于大语言模型(LLM)的自适应路由器来学习恢复策略;并构建Res-VQ数据集,设计VLM-based视频质量评估模型。 Result: 实验表明,MoA-VR在多种客观指标和感知质量上均优于现有基线方法,能够有效处理多样且复合的退化类型。 Conclusion: MoA-VR展示了多模态智能与模块化推理在通用视频恢复系统中的潜力,为未来自动化、智能化视频恢复提供了新方向。 Abstract: Real-world videos often suffer from complex degradations, such as noise, compression artifacts, and low-light distortions, due to diverse acquisition and transmission conditions. Existing restoration methods typically require professional manual selection of specialized models or rely on monolithic architectures that fail to generalize across varying degradations. Inspired by expert experience, we propose MoA-VR, the first \underline{M}ixture-\underline{o}f-\underline{A}gents \underline{V}ideo \underline{R}estoration system that mimics the reasoning and processing procedures of human professionals through three coordinated agents: Degradation Identification, Routing and Restoration, and Restoration Quality Assessment. Specifically, we construct a large-scale and high-resolution video degradation recognition benchmark and build a vision-language model (VLM) driven degradation identifier. We further introduce a self-adaptive router powered by large language models (LLMs), which autonomously learns effective restoration strategies by observing tool usage patterns. To assess intermediate and final processed video quality, we construct the \underline{Res}tored \underline{V}ideo \underline{Q}uality (Res-VQ) dataset and design a dedicated VLM-based video quality assessment (VQA) model tailored for restoration tasks. Extensive experiments demonstrate that MoA-VR effectively handles diverse and compound degradations, consistently outperforming existing baselines in terms of both objective metrics and perceptual quality. These results highlight the potential of integrating multimodal intelligence and modular reasoning in general-purpose video restoration systems.

[183] To Sink or Not to Sink: Visual Information Pathways in Large Vision-Language Models

Jiayun Luo,Wan-Cyuan Fan,Lyuyang Wang,Xiangteng He,Tanzila Rahman,Purang Abolmaesumi,Leonid Sigal

Main category: cs.CV

TL;DR: 本文提出并研究了视觉Transformer中的“注意力汇聚点”(ViT attention sinks),发现这些高范数的视觉令牌包含图像中的高层语义信息,对视觉-语言模型的推理能力至关重要。通过定性和定量分析,并提出无需训练和基于训练的方法来更好地利用这些令牌,显著提升了多种大型视觉语言模型在视觉推理任务上的表现。

Details Motivation: 现有研究多关注大语言模型内部的注意力汇聚问题,而忽略了视觉编码器中哪些视觉令牌对理解与推理最为关键。本文旨在探究视觉Transformer(ViT)中是否存在类似的汇聚点及其在LVLM中的作用。 Method: 识别ViT中具有高范数的视觉令牌作为注意力汇聚点,进行定性与定量分析;提出无需训练(如直接增强这些令牌)和基于训练的方法,以提升LLM对这些关键视觉信息的利用效率。 Result: 实验表明,ViT注意力汇聚点包含丰富的高层语义信息;通过显式利用这些令牌,在多个LVLM和视觉推理任务上实现了性能显著提升。 Conclusion: ViT注意力汇聚点在视觉语言模型中起着关键作用,当前架构普遍忽视了其潜力;有效利用这些令牌可显著增强模型的视觉理解与推理能力。 Abstract: Large Vision Language Models (LVLMs) have recently emerged as powerful architectures capable of understanding and reasoning over both visual and textual information. These models typically rely on two key components: a Vision Transformer (ViT) and a Large Language Model (LLM). ViT encodes visual content into a sequence of image tokens and serves as the perceptual front-end -- the eyes of the model. In contrast, the LLM interprets these tokens to perform high-level reasoning, generates responses, and functions as the cognitive core -- the brain of the model. However, it remains unclear which visual tokens contribute most significantly to understanding and reasoning, and how effectively these signals are propagated from ViT to the LLM. While most existing works have focused on identifying attention sinks, low-semantic tokens receiving disproportionately high attention, within the LLM, we shift the focus to the vision encoder by identifying a class of high-norm visual tokens from ViT, referred to as ViT attention sinks -- a problem that has been rarely studied but is indeed very important for LVLMs. Our findings show that these ViT sinks encapsulate high-level semantic concepts from images, allowing the LLM to perform more effective understanding and reasoning. Despite their importance, these sink tokens are often overlooked in existing LVLM architectures. To explore their contribution, we present both qualitative and quantitative analyses of the information embedded in these sink tokens. We also propose both training-free and training-based approaches to better leverage how this information is interpreted by the LLM, and to what extent. By explicitly utilizing these tokens, we demonstrate substantial improvements across a range of LVLMs and visual reasoning tasks, highlighting the untapped potential of ViT attention sinks in enhancing visual reasoning.

[184] Have We Scene It All? Scene Graph-Aware Deep Point Cloud Compression

Nikolaos Stathoulopoulos,Christoforos Kanellakis,George Nikolakopoulos

Main category: cs.CV

TL;DR: 提出了一种基于语义场景图的深度压缩框架,用于高效传输3D点云数据,在保持结构和语义保真度的同时实现高达98%的压缩率。

Details Motivation: 3D点云数据量大且复杂,在带宽受限和连接不稳定的情况下难以高效传输,影响多智能体机器人系统的感知性能。 Method: 将点云分解为语义一致的块,使用FiLM条件下的语义感知编码器将其编码为紧凑的潜在表示,并采用基于折叠的解码器结合潜在特征和图节点属性进行结构准确的重建。 Result: 在SemanticKITTI和nuScenes数据集上达到最先进压缩率,数据大小最多减少98%,同时支持多机器人位姿图优化和地图融合等下游任务,性能接近使用原始LiDAR扫描的结果。 Conclusion: 该方法在显著压缩点云数据的同时保留了关键的结构和语义信息,适用于边缘和云端的分布式机器人系统。 Abstract: Efficient transmission of 3D point cloud data is critical for advanced perception in centralized and decentralized multi-agent robotic systems, especially nowadays with the growing reliance on edge and cloud-based processing. However, the large and complex nature of point clouds creates challenges under bandwidth constraints and intermittent connectivity, often degrading system performance. We propose a deep compression framework based on semantic scene graphs. The method decomposes point clouds into semantically coherent patches and encodes them into compact latent representations with semantic-aware encoders conditioned by Feature-wise Linear Modulation (FiLM). A folding-based decoder, guided by latent features and graph node attributes, enables structurally accurate reconstruction. Experiments on the SemanticKITTI and nuScenes datasets show that the framework achieves state-of-the-art compression rates, reducing data size by up to 98% while preserving both structural and semantic fidelity. In addition, it supports downstream applications such as multi-robot pose graph optimization and map merging, achieving trajectory accuracy and map alignment comparable to those obtained with raw LiDAR scans.

[185] SliceFine: The Universal Winning-Slice Hypothesis for Pretrained Networks

Md Kowsher,Ali O. Polat,Ehsan Mohammady Ardehaly,Mehrdad Salehi,Zia Ghiasi,Prasanth Murali,Chen Chen

Main category: cs.CV

TL;DR: 本文提出了一个理论框架,解释了为何在预训练模型中微调小的随机子网络(切片)足以进行下游任务适配,并提出了SliceFine方法,该方法通过仅更新原始权重中的选定切片,在不引入新参数的情况下实现了高效且紧凑的微调。

Details Motivation: 探索预训练模型中存在通用胜出切片的原因,为参数高效微调提供理论基础。 Method: 提出并证明了预训练网络具有普遍胜出切片属性,源于谱平衡和高任务能量两个现象,并基于此提出了SliceFine方法。 Result: SliceFine在语言和视觉任务上达到了最先进的PEFT方法性能,同时显著提高了训练速度、内存效率和模型紧凑性。 Conclusion: 本研究连接了理论与实践,为大规模模型的参数高效微调提供了有理论依据的新方法。 Abstract: This paper presents a theoretical framework explaining why fine tuning small, randomly selected subnetworks (slices) within pre trained models can be sufficient for downstream adaptation. We prove that pretrained networks exhibit a universal winning slice property arising from two phenomena: (1) spectral balance the eigenspectra of different weight matrix slices are remarkably similar; and (2) high task energy their backbone representations retain rich, task relevant features. This leads to the Universal Winning Slice Hypothesis, which provides a theoretical foundation for parameter efficient fine tuning (PEFT) in large scale models. Inspired by this, we propose SliceFine, a PEFT method that exploits this inherent redundancy by updating only selected slices of the original weights introducing zero new parameters, unlike adapter-based approaches. Empirically, SliceFine matches the performance of state of the art PEFT methods across language and vision tasks, while significantly improving training speed, memory efficiency, and model compactness. Our work bridges theory and practice, offering a theoretically grounded alternative to existing PEFT techniques.

[186] FlexTraj: Image-to-Video Generation with Flexible Point Trajectory Control

Zhiyuan Zhang,Can Wang,Dongdong Chen,Jing Liao

Main category: cs.CV

TL;DR: 提出FlexTraj框架,实现图像到视频生成中的灵活点轨迹控制,支持多粒度、无需对齐的运动控制。

Details Motivation: 现有方法在轨迹控制中依赖对齐条件且控制灵活性不足,难以支持复杂应用场景。 Method: 提出统一的基于点的运动表示,结合序列拼接方案注入轨迹条件,并采用退火训练策略减少对完全监督和对齐条件的依赖。 Result: 实验表明FlexTraj在多种任务中实现更强的可控性与鲁棒性,支持运动克隆、拖拽生成、动作插值等应用。 Conclusion: FlexTraj实现了高效、灵活且对齐无关的视频生成控制,拓展了图像到视频生成的应用边界。 Abstract: We present FlexTraj, a framework for image-to-video generation with flexible point trajectory control. FlexTraj introduces a unified point-based motion representation that encodes each point with a segmentation ID, a temporally consistent trajectory ID, and an optional color channel for appearance cues, enabling both dense and sparse trajectory control. Instead of injecting trajectory conditions into the video generator through token concatenation or ControlNet, FlexTraj employs an efficient sequence-concatenation scheme that achieves faster convergence, stronger controllability, and more efficient inference, while maintaining robustness under unaligned conditions. To train such a unified point trajectory-controlled video generator, FlexTraj adopts an annealing training strategy that gradually reduces reliance on complete supervision and aligned condition. Experimental results demonstrate that FlexTraj enables multi-granularity, alignment-agnostic trajectory control for video generation, supporting various applications such as motion cloning, drag-based image-to-video, motion interpolation, camera redirection, flexible action control and mesh animations.

[187] SpatialLadder: Progressive Training for Spatial Reasoning in Vision-Language Models

Hongxing Li,Dingming Li,Zixuan Wang,Yuchen Yan,Hang Wu,Wenqi Zhang,Yongliang Shen,Weiming Lu,Jun Xiao,Yueting Zhuang

Main category: cs.CV

TL;DR: 本文提出了一种渐进式构建空间智能的方法,通过构建包含26,610个样本的多模态数据集SpatialLadder-26k,并设计三阶段训练框架,显著提升了视觉语言模型在空间推理任务上的性能,超越现有模型并展现出强泛化能力。

Details Motivation: 现有视觉语言模型在空间推理上表现不佳,主要因为缺乏从感知到理解的层次化基础,无法有效学习复杂的空间关系。 Method: 构建SpatialLadder-26k数据集,涵盖定位、单图、多视图和视频空间推理任务;设计三阶段渐进训练框架:1)通过目标定位建立空间感知;2)通过多维空间任务发展空间理解;3)利用可验证奖励的强化学习增强复杂推理。 Result: 所提出的3B参数模型SpatialLadder在空间推理基准上平均比基线提升23.4%,超越GPT-4o(+20.8%)和Gemini-2.0-Flash(+10.1%),在域外基准上也有7.2%的提升。 Conclusion: 从感知到理解再到推理的渐进式训练是实现鲁棒空间智能的关键。 Abstract: Spatial reasoning remains a fundamental challenge for Vision-Language Models (VLMs), with current approaches struggling to achieve robust performance despite recent advances. We identify that this limitation stems from a critical gap: existing methods attempt to learn spatial reasoning directly without establishing the hierarchical foundations of perception and understanding. To address this challenge, we present a comprehensive methodology for building spatial intelligence progressively. We introduce SpatialLadder-26k, a multimodal dataset containing 26,610 samples spanning object localization, single image, multi-view, and video spatial reasoning tasks, constructed through a standardized pipeline that ensures systematic coverage across modalities. Building on this dataset, we design a three-stage progressive training framework that (1) establishes spatial perception through object localization, (2) develops spatial understanding through multi-dimensional spatial tasks, and (3) strengthens complex reasoning via reinforcement learning with verifiable rewards. This approach yields SpatialLadder, a 3B-parameter model that achieves state-of-the-art performance on spatial reasoning benchmarks, with 23.4% average improvement over the base model, surpassing GPT-4o by 20.8% and Gemini-2.0-Flash by 10.1%. Notably, SpatialLadder maintains strong generalization with 7.2% improvement on out-of-domain benchmarks, demonstrating that progressive training from perception to reasoning is essential for robust spatial intelligence.

[188] Kontinuous Kontext: Continuous Strength Control for Instruction-based Image Editing

Rishubh Parihar,Or Patashnik,Daniil Ostashev,R. Venkatesh Babu,Daniel Cohen-Or,Kuan-Chieh Wang

Main category: cs.CV

TL;DR: Kontinuous Kontext 是一种基于指令的图像编辑模型,通过引入标量编辑强度控制,实现从细微到显著的连续、精细编辑调节。

Details Motivation: 现有基于文本指令的图像编辑方法缺乏对编辑程度的细粒度控制,用户难以精确调节编辑强度。 Method: 扩展先进的图像编辑模型,增加标量编辑强度输入,并通过轻量级投影网络将其与编辑指令映射到模型的调制空间中;使用生成模型合成包含图像、编辑指令和强度值的四元组数据集进行训练。 Result: 实现了在风格化、属性、材质、背景和形状等多种编辑操作中对编辑强度的连续、统一控制,无需针对特定属性进行专门训练。 Conclusion: Kontinuous Kontext 提供了一种通用且直观的方法,显著提升了基于指令的图像编辑在控制精度和灵活性方面的能力。 Abstract: Instruction-based image editing offers a powerful and intuitive way to manipulate images through natural language. Yet, relying solely on text instructions limits fine-grained control over the extent of edits. We introduce Kontinuous Kontext, an instruction-driven editing model that provides a new dimension of control over edit strength, enabling users to adjust edits gradually from no change to a fully realized result in a smooth and continuous manner. Kontinuous Kontext extends a state-of-the-art image editing model to accept an additional input, a scalar edit strength which is then paired with the edit instruction, enabling explicit control over the extent of the edit. To inject this scalar information, we train a lightweight projector network that maps the input scalar and the edit instruction to coefficients in the model's modulation space. For training our model, we synthesize a diverse dataset of image-edit-instruction-strength quadruplets using existing generative models, followed by a filtering stage to ensure quality and consistency. Kontinuous Kontext provides a unified approach for fine-grained control over edit strength for instruction driven editing from subtle to strong across diverse operations such as stylization, attribute, material, background, and shape changes, without requiring attribute-specific training.

[189] MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization

Xiangyu Zhao,Junming Lin,Tianhao Liang,Yifan Zhou,Wenhao Chai,Yuzhe Gu,Weiyun Wang,Kai Chen,Gen Luo,Wenwei Zhang,Junchi Yan,Hua Yang,Haodong Duan,Xue Yang

Main category: cs.CV

TL;DR: 本文提出了一种新的训练方法AHPO,以提升多模态大语言模型在长链反思性推理任务中的表现,并构建了MM-HELIX基准和MM-HELIX-100K数据集进行评估与训练。

Details Motivation: 现有的多模态大语言模型在复杂现实问题所需的长链反思性推理能力上表现不足,亟需系统评估和改进方法。 Method: 构建了包含1,260个样本的MM-HELIX基准测试集;开发Step-Elicited Response Generation流程生成10万条反思推理轨迹构成MM-HELIX-100K数据集;提出自适应混合策略优化(AHPO)方法,结合离线监督与在线优化进行单阶段训练。 Result: 在MM-HELIX基准上比Qwen2.5-VL-7B基线高出+18.6%准确率,在通用数学与逻辑任务上平均提升+5.7%,展现出良好泛化能力。 Conclusion: 多模态大语言模型的反思性推理能力可通过专门的数据与训练策略有效提升,AHPO为构建更强大MLLM提供了可行路径。 Abstract: While current Multimodal Large Language Models (MLLMs) have demonstrated proficiency in reasoning tasks such as mathematics and logic, their capacity for long-chain reflective reasoning, a prerequisite for solving complex real-world problems, remains largely underexplored. In this work, we first conduct an extensive empirical investigation to evaluate this capability. Leveraging a carefully designed data synthesis engine, we construct MM-HELIX, a multimodal benchmark consisting 1,260 samples of 42 challenging synthetic tasks that require iterative thinking and backtracking. Empirical results on this benchmark reveal that existing MLLMs exhibit significant performance deficits in long-chain reflective reasoning. To address this limitation, we generate post-training data and further explore learning paradigms for exploiting such data. We first develop the Step-Elicited Response Generation pipeline to create MM-HELIX-100K, a large-scale dataset of 100k high-quality, reflective reasoning traces for instruction-tuning stage. Given that standard Reinforcement Learning fails on complex tasks due to sparse reward signals and catastrophic forgetting after Supervised Fine-Tuning, we propose Adaptive Hybrid Policy Optimization (AHPO), a novel training strategy that dynamically unifies offline supervision and online optimization into a single stage. This strategy enables the model to learn from expert data when rewards are sparse and conduct independent exploration once proficient. When applied to the Qwen2.5-VL-7B baseline, our method achieves a +18.6\% accuracy improvement on MM-HELIX benchmark and demonstrates strong generalization with a +5.7\% average performance gain on general mathematic and logic tasks. Our work demonstrate that reflective reasoning in MLLMs can be effectively learned and generalized, paving the way for developing more capable MLLMs.

[190] VideoNorms: Benchmarking Cultural Awareness of Video Language Models

Nikhil Reddy Varimalla,Yunfei Xu,Arkadiy Saakyan,Meng Fan Wang,Smaranda Muresan

Main category: cs.CV

TL;DR: 本文提出了VideoNorms,一个包含1000多个视频片段与社会文化规范配对的数据集,用于评估视频大语言模型(VideoLLMs)在中美文化背景下的文化意识。通过人类与AI协作的标注框架构建该数据集,并发现现有模型在识别规范违反、中国文化、非语言证据和正式语境方面表现较差。研究强调了 culturally-grounded 训练的重要性。

Details Motivation: 随着VideoLLMs在全球范围部署,其需理解并扎根于不同文化背景。然而目前缺乏有效评估其文化意识的基准,因此需要构建一个理论驱动、跨文化的评测数据集。 Method: 提出VideoNorms基准,包含来自美国与中国文化的1000多个(视频片段,规范)对,基于言语行为理论标注社会文化规范、规范遵守/违反标签及言语与非言语证据。采用人机协作框架:教师模型通过理论驱动提示生成候选标注,经训练的人类专家进行验证与修正。并对多种开源VideoLLMs进行评测。 Result: 评测结果显示:1) 模型在识别规范违反时表现差于规范遵守;2) 对中国文化的表现弱于美国文化;3) 提取非言语证据的能力弱于言语证据,且难以准确匹配具体规范;4) 在正式、非幽默语境下表现不如人类,甚至更差。 Conclusion: 现有VideoLLMs在跨文化理解尤其是非言语线索和规范识别方面存在明显不足,亟需融入文化根基的训练方法。VideoNorms及其构建框架为填补这一空白提供了基础。 Abstract: As Video Large Language Models (VideoLLMs) are deployed globally, they require understanding of and grounding in the relevant cultural background. To properly assess these models' cultural awareness, adequate benchmarks are needed. We introduce VideoNorms, a benchmark of over 1000 (video clip, norm) pairs from US and Chinese cultures annotated with socio-cultural norms grounded in speech act theory, norm adherence and violations labels, and verbal and non-verbal evidence. To build VideoNorms, we use a human-AI collaboration framework, where a teacher model using theoretically-grounded prompting provides candidate annotations and a set of trained human experts validate and correct the annotations. We benchmark a variety of open-weight VideoLLMs on the new dataset which highlight several common trends: 1) models performs worse on norm violation than adherence; 2) models perform worse w.r.t Chinese culture compared to the US culture; 3) models have more difficulty in providing non-verbal evidence compared to verbal for the norm adhere/violation label and struggle to identify the exact norm corresponding to a speech-act; and 4) unlike humans, models perform worse in formal, non-humorous contexts. Our findings emphasize the need for culturally-grounded video language model training - a gap our benchmark and framework begin to address.

[191] ARTDECO: Towards Efficient and High-Fidelity On-the-Fly 3D Reconstruction with Structured Scene Representation

Guanghao Li,Kerui Ren,Linning Xu,Zhewen Zheng,Changjian Jiang,Xin Gao,Bo Dai,Jian Pu,Mulin Yu,Jiangmiao Pang

Main category: cs.CV

TL;DR: 本文提出ARTDECO,一种结合前馈模型效率与SLAM系统可靠性的统一框架,用于单目图像序列的实时3D重建,通过层次化高斯表示和LoD感知渲染策略,在多个基准上实现了高质量、高效率和强鲁棒性的重建效果。

Details Motivation: 现有方法在计算效率与重建精度之间存在权衡:逐场景优化虽精确但耗时,而前馈模型虽快但精度和鲁棒性不足。需要一种既能实时推理又能保持高保真重建的方法。 Method: ARTDECO利用3D基础模型进行姿态估计和点云预测,并引入高斯解码器将多尺度特征转换为结构化的3D高斯分布;设计了层次化高斯表示与LoD感知渲染策略,以提升渲染质量并减少冗余。 Result: 在八个室内外基准上实验表明,ARTDECO在交互性能上媲美SLAM,鲁棒性接近前馈系统,重建质量接近逐场景优化方法。 Conclusion: ARTDECO为真实世界环境的实时数字化提供了一条兼顾几何准确性与视觉保真度的实用路径。 Abstract: On-the-fly 3D reconstruction from monocular image sequences is a long-standing challenge in computer vision, critical for applications such as real-to-sim, AR/VR, and robotics. Existing methods face a major tradeoff: per-scene optimization yields high fidelity but is computationally expensive, whereas feed-forward foundation models enable real-time inference but struggle with accuracy and robustness. In this work, we propose ARTDECO, a unified framework that combines the efficiency of feed-forward models with the reliability of SLAM-based pipelines. ARTDECO uses 3D foundation models for pose estimation and point prediction, coupled with a Gaussian decoder that transforms multi-scale features into structured 3D Gaussians. To sustain both fidelity and efficiency at scale, we design a hierarchical Gaussian representation with a LoD-aware rendering strategy, which improves rendering fidelity while reducing redundancy. Experiments on eight diverse indoor and outdoor benchmarks show that ARTDECO delivers interactive performance comparable to SLAM, robustness similar to feed-forward systems, and reconstruction quality close to per-scene optimization, providing a practical path toward on-the-fly digitization of real-world environments with both accurate geometry and high visual fidelity. Explore more demos on our project page: https://city-super.github.io/artdeco/.

[192] Dream to Recall: Imagination-Guided Experience Retrieval for Memory-Persistent Vision-and-Language Navigation

Yunzhe Xu,Yiyuan Pan,Zhe Liu

Main category: cs.CV

TL;DR: 本文提出Memoir,一种基于想象的检索机制,用于解决记忆持久型视觉-语言导航(VLN)中记忆访问效率低和记忆内容单一的问题。该方法通过语言条件世界模型生成未来状态作为查询,结合视点级混合记忆存储环境观察与行为模式,并利用增强导航模型整合检索知识,在多个基准上显著提升了性能,同时大幅减少训练时间和推理内存消耗。

Details Motivation: 现有记忆持久型VLN方法存在两个关键问题:一是缺乏高效的记忆访问机制,通常依赖全量记忆或固定视野检索;二是仅存储环境观察而忽略蕴含决策策略的行为模式。这限制了智能体从经验中持续提升的能力。 Method: 1) 设计语言条件世界模型,通过想象未来导航状态来编码经验并生成检索查询;2) 构建混合视点级记忆,将环境观察与行为历史统一锚定在视点上,实现联合存储与检索;3) 开发经验增强的导航模型,使用专用编码器融合检索到的信息进行决策。 Result: 在10种不同测试场景下的多个记忆持久型VLN基准上进行评估,Memoir相比最优基线在IR2R任务上取得5.4%的SPL提升,训练速度加快8.3倍,推理内存减少74%,且分析显示其想象引导范式仍有较大提升空间(当前73.3% vs 上限93.4%)。 Conclusion: 通过结合想象机制实现对环境与行为记忆的选择性检索,能更有效地支持长期经验积累与导航性能提升,验证了预测性检索在复杂跨模态任务中的潜力。 Abstract: Vision-and-Language Navigation (VLN) requires agents to follow natural language instructions through environments, with memory-persistent variants demanding progressive improvement through accumulated experience. Existing approaches for memory-persistent VLN face critical limitations: they lack effective memory access mechanisms, instead relying on entire memory incorporation or fixed-horizon lookup, and predominantly store only environmental observations while neglecting navigation behavioral patterns that encode valuable decision-making strategies. We present Memoir, which employs imagination as a retrieval mechanism grounded by explicit memory: a world model imagines future navigation states as queries to selectively retrieve relevant environmental observations and behavioral histories. The approach comprises: 1) a language-conditioned world model that imagines future states serving dual purposes: encoding experiences for storage and generating retrieval queries; 2) Hybrid Viewpoint-Level Memory that anchors both observations and behavioral patterns to viewpoints, enabling hybrid retrieval; and 3) an experience-augmented navigation model that integrates retrieved knowledge through specialized encoders. Extensive evaluation across diverse memory-persistent VLN benchmarks with 10 distinctive testing scenarios demonstrates Memoir's effectiveness: significant improvements across all scenarios, with 5.4% SPL gains on IR2R over the best memory-persistent baseline, accompanied by 8.3x training speedup and 74% inference memory reduction. The results validate that predictive retrieval of both environmental and behavioral memories enables more effective navigation, with analysis indicating substantial headroom (73.3% vs 93.4% upper bound) for this imagination-guided paradigm. Code at https://github.com/xyz9911/Memoir.

[193] VideoCanvas: Unified Video Completion from Arbitrary Spatiotemporal Patches via In-Context Conditioning

Minghong Cai,Qiulin Wang,Zongli Ye,Wenze Liu,Quande Liu,Weicai Ye,Xintao Wang,Pengfei Wan,Kun Gai,Xiangyu Yue

Main category: cs.CV

TL;DR: 本文提出了任意时空视频补全任务,通过VideoCanvas框架实现了在冻结的视频扩散模型上进行细粒度控制,解决了因果VAE带来的时间模糊问题,并提出了新的基准VideoCanvasBench用于评估该任务。

Details Motivation: 现有的可控视频生成任务分散且缺乏统一范式,同时现代潜在视频扩散模型因因果VAE导致的时间模糊问题难以实现精确的帧级条件控制。 Method: 提出VideoCanvas框架,采用混合条件策略:空间定位通过零填充处理,时间对齐通过Temporal RoPE插值实现,每个条件被赋予潜序列中的连续分数位置,从而解耦时空控制并在无新增参数的情况下实现像素帧感知控制。 Result: 在新提出的VideoCanvasBench基准上实验表明,VideoCanvas显著优于现有条件控制方法,在场景内保真度和跨场景创造性方面均表现优异。 Conclusion: VideoCanvas为任意时空视频补全提供了有效解决方案,统一了多种可控视频生成任务,推动了灵活、统一的视频生成技术发展。 Abstract: We introduce the task of arbitrary spatio-temporal video completion, where a video is generated from arbitrary, user-specified patches placed at any spatial location and timestamp, akin to painting on a video canvas. This flexible formulation naturally unifies many existing controllable video generation tasks--including first-frame image-to-video, inpainting, extension, and interpolation--under a single, cohesive paradigm. Realizing this vision, however, faces a fundamental obstacle in modern latent video diffusion models: the temporal ambiguity introduced by causal VAEs, where multiple pixel frames are compressed into a single latent representation, making precise frame-level conditioning structurally difficult. We address this challenge with VideoCanvas, a novel framework that adapts the In-Context Conditioning (ICC) paradigm to this fine-grained control task with zero new parameters. We propose a hybrid conditioning strategy that decouples spatial and temporal control: spatial placement is handled via zero-padding, while temporal alignment is achieved through Temporal RoPE Interpolation, which assigns each condition a continuous fractional position within the latent sequence. This resolves the VAE's temporal ambiguity and enables pixel-frame-aware control on a frozen backbone. To evaluate this new capability, we develop VideoCanvasBench, the first benchmark for arbitrary spatio-temporal video completion, covering both intra-scene fidelity and inter-scene creativity. Experiments demonstrate that VideoCanvas significantly outperforms existing conditioning paradigms, establishing a new state of the art in flexible and unified video generation.

[194] SciVideoBench: Benchmarking Scientific Video Reasoning in Large Multimodal Models

Andong Deng,Taojiannan Yang,Shoubin Yu,Lincoln Spencer,Mohit Bansal,Chen Chen,Serena Yeung-Levy,Xiaohan Wang

Main category: cs.CV

TL;DR: 本文提出了SciVideoBench,一个用于评估科学领域视频推理能力的严格基准,包含来自25个以上专业学科的1000个多项选择题,旨在挑战现有大型多模态模型在复杂科学视频理解中的高级认知能力。

Details Motivation: 现有的视频基准主要针对一般场景,侧重感知和识别,推理任务较简单,难以有效评估先进多模态模型的高阶认知能力,尤其是在科学领域的复杂视频推理方面存在明显不足。 Method: 构建了一个名为SciVideoBench的基准,包含1000个基于前沿科学实验视频的多项选择题,覆盖25个以上专业学科,并通过半自动系统验证问题质量;每个问题要求模型具备领域专业知识、精确的时空感知和复杂的逻辑推理能力。 Result: 评估结果显示当前最先进的专有和开源大型多模态模型(如Gemini 2.5 Pro和Qwen2.5-VL)在该基准上表现不佳,暴露出其在视频推理能力上的显著缺陷,同时分析了推理复杂性和视觉定位等关键因素的影响。 Conclusion: SciVideoBench为评估和推动多模态模型在科学视频推理方面的能力提供了有力工具,揭示了现有模型的不足,并为未来开发更具科学协作能力的多模态AI指明了方向。 Abstract: Large Multimodal Models (LMMs) have achieved remarkable progress across various capabilities; however, complex video reasoning in the scientific domain remains a significant and challenging frontier. Current video benchmarks predominantly target general scenarios where perception/recognition is heavily relied on, while with relatively simple reasoning tasks, leading to saturation and thus failing to effectively evaluate advanced multimodal cognitive skills. To address this critical gap, we introduce SciVideoBench, a rigorous benchmark specifically designed to assess advanced video reasoning in scientific contexts. SciVideoBench consists of 1,000 carefully crafted multiple-choice questions derived from cutting-edge scientific experimental videos spanning over 25 specialized academic subjects and verified by a semi-automatic system. Each question demands sophisticated domain-specific knowledge, precise spatiotemporal perception, and intricate logical reasoning, effectively challenging models' higher-order cognitive abilities. Our evaluation highlights significant performance deficits in state-of-the-art proprietary and open-source LMMs, including Gemini 2.5 Pro and Qwen2.5-VL, indicating substantial room for advancement in video reasoning capabilities. Detailed analyses of critical factors such as reasoning complexity and visual grounding provide valuable insights and clear direction for future developments in LMMs, driving the evolution of truly capable multimodal AI co-scientists. We hope SciVideoBench could fit the interests of the community and help to push the boundary of cutting-edge AI for border science.

[195] MultiCOIN: Multi-Modal COntrollable Video INbetweening

Maham Tanveer,Yang Zhou,Simon Niklaus,Ali Mahdavi Amiri,Hao Zhang,Krishna Kumar Singh,Nanxuan Zhao

Main category: cs.CV

TL;DR: 本文提出了一种支持多模态控制的视频中间帧生成框架\modelname{},通过将多种运动控制映射到统一的稀疏点表示,并采用双分支架构分别处理内容与运动信息,结合Diffusion Transformer实现高质量、细粒度的视频插帧。

Details Motivation: 现有视频中间帧生成方法难以处理复杂运动,缺乏对用户意图的灵活支持和中间帧细节的精细控制,导致结果与创作意图不符。 Method: 采用Diffusion Transformer作为基础生成模型,将深度过渡、分层、运动轨迹、文本提示和目标区域等多种控制信号统一映射为稀疏的基于点的表示;设计内容与运动双分支编码结构,在去噪过程中分别处理两类控制信号;提出分阶段训练策略以稳定学习多模态控制。 Result: 实验表明,该方法在定性和定量评估中均优于现有方法,能够生成更动态、可定制且上下文准确的视觉叙事,支持复杂运动下的精细控制。 Conclusion: \modelname{}通过引入多模态控制和双分支DiT架构,显著提升了视频中间帧生成的灵活性、易用性和精确性,为创意视频编辑提供了强有力的工具。 Abstract: Video inbetweening creates smooth and natural transitions between two image frames, making it an indispensable tool for video editing and long-form video synthesis. Existing works in this domain are unable to generate large, complex, or intricate motions. In particular, they cannot accommodate the versatility of user intents and generally lack fine control over the details of intermediate frames, leading to misalignment with the creative mind. To fill these gaps, we introduce \modelname{}, a video inbetweening framework that allows multi-modal controls, including depth transition and layering, motion trajectories, text prompts, and target regions for movement localization, while achieving a balance between flexibility, ease of use, and precision for fine-grained video interpolation. To achieve this, we adopt the Diffusion Transformer (DiT) architecture as our video generative model, due to its proven capability to generate high-quality long videos. To ensure compatibility between DiT and our multi-modal controls, we map all motion controls into a common sparse and user-friendly point-based representation as the video/noise input. Further, to respect the variety of controls which operate at varying levels of granularity and influence, we separate content controls and motion controls into two branches to encode the required features before guiding the denoising process, resulting in two generators, one for motion and the other for content. Finally, we propose a stage-wise training strategy to ensure that our model learns the multi-modal controls smoothly. Extensive qualitative and quantitative experiments demonstrate that multi-modal controls enable a more dynamic, customizable, and contextually accurate visual narrative.

[196] ResAD: Normalized Residual Trajectory Modeling for End-to-End Autonomous Driving

Zhiyu Zheng,Shaoyu Chen,Haoran Yin,Xinbang Zhang,Jialv Zou,Xinggang Wang,Qian Zhang,Lefei Zhang

Main category: cs.CV

TL;DR: 提出ResAD框架,通过归一化残差轨迹建模解决端到端自动驾驶中轨迹数据的时空不平衡问题,提升模型因果推理能力与短期安全性。

Details Motivation: 现有端到端自动驾驶系统因轨迹数据的时空不平衡而倾向于学习虚假相关性,忽视因果推理,并过度关注远距离不确定预测,影响即时安全。 Method: 提出ResAD框架,将学习目标从直接预测未来轨迹转为预测相对于确定性惯性参考路径的残差偏差,并引入点级归一化来重加权优化目标,缓解长期不确定性带来的优化失衡。 Result: 在NAVSIM基准上,使用仅两步去噪的普通扩散策略即达到88.6的PDMS,取得当时最优性能。 Conclusion: ResAD有效简化了学习任务,增强了模型对因果因素的识别能力,提升了预测安全性与整体性能。 Abstract: End-to-end autonomous driving (E2EAD) systems, which learn to predict future trajectories directly from sensor data, are fundamentally challenged by the inherent spatio-temporal imbalance of trajectory data. This imbalance creates a significant optimization burden, causing models to learn spurious correlations instead of causal inference, while also prioritizing uncertain, distant predictions, thereby compromising immediate safety. To address these issues, we propose ResAD, a novel Normalized Residual Trajectory Modeling framework. Instead of predicting the future trajectory directly, our approach reframes the learning task to predict the residual deviation from a deterministic inertial reference. The inertial reference serves as a counterfactual, forcing the model to move beyond simple pattern recognition and instead identify the underlying causal factors (e.g., traffic rules, obstacles) that necessitate deviations from a default, inertially-guided path. To deal with the optimization imbalance caused by uncertain, long-term horizons, ResAD further incorporates Point-wise Normalization of the predicted residual. It re-weights the optimization objective, preventing large-magnitude errors associated with distant, uncertain waypoints from dominating the learning signal. Extensive experiments validate the effectiveness of our framework. On the NAVSIM benchmark, ResAD achieves a state-of-the-art PDMS of 88.6 using a vanilla diffusion policy with only two denoising steps, demonstrating that our approach significantly simplifies the learning task and improves model performance. The code will be released to facilitate further research.

[197] NaViL: Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints

Changyao Tian,Hao Li,Gen Luo,Xizhou Zhu,Weijie Su,Hanming Deng,Jinguo Zhu,Jie Shao,Ziran Zhu,Yunpeng Liu,Lewei Lu,Wenhai Wang,Hongsheng Li,Jifeng Dai

Main category: cs.CV

TL;DR: 本文研究了多模态大语言模型(MLLM)的原生端到端训练范式,探索其在数据受限情况下的设计空间和扩展特性,提出了一种高效且性能优越的原生MLLM模型NaViL。

Details Motivation: 现有MLLM采用分离式训练(视觉编码器与语言模型分别预训练后组合),难以探索其多模态扩展性,因此需要研究端到端原生训练的潜力。 Method: 系统研究多种MLLM设计选择,在数据受限条件下寻找最优元架构,并分析视觉编码器与语言模型之间的扩展关系,提出NaViL模型及其低成本训练方案。 Result: NaViL在14个多模态基准上表现出与现有MLLM相当甚至更优的性能,验证了原生训练范式的有效性,并揭示了视觉编码器与LLM之间正相关的扩展关系。 Conclusion: 原生端到端训练是构建MLLM的有效路径,NaViL提供了高性能且低成本的解决方案,为未来MLLM研究提供了重要启示。 Abstract: Compositional training has been the de-facto paradigm in existing Multimodal Large Language Models (MLLMs), where pre-trained vision encoders are connected with pre-trained LLMs through continuous multimodal pre-training. However, the multimodal scaling property of this paradigm remains difficult to explore due to the separated training. In this paper, we focus on the native training of MLLMs in an end-to-end manner and systematically study its design space and scaling property under a practical setting, i.e., data constraint. Through careful study of various choices in MLLM, we obtain the optimal meta-architecture that best balances performance and training cost. After that, we further explore the scaling properties of the native MLLM and indicate the positively correlated scaling relationship between visual encoders and LLMs. Based on these findings, we propose a native MLLM called NaViL, combined with a simple and cost-effective recipe. Experimental results on 14 multimodal benchmarks confirm the competitive performance of NaViL against existing MLLMs. Besides that, our findings and results provide in-depth insights for the future study of native MLLMs.

[198] D$^2$GS: Depth-and-Density Guided Gaussian Splatting for Stable and Accurate Sparse-View Reconstruction

Meixi Song,Xin Lin,Dizhe Zhang,Haodong Li,Xiangtai Li,Bo Du,Lu Qi

Main category: cs.CV

TL;DR: 本文提出了一种名为D²GS的统一框架,用于改善稀疏视角下3D高斯点阵(3DGS)在新视角合成中的性能退化和不稳定性问题。

Details Motivation: 在稀疏视角条件下,现有3DGS方法存在过拟合和欠拟合问题,导致性能下降和训练不稳定,因此需要一种能同时应对这两种失败模式的方法。 Method: 提出了D²GS框架,包含两个核心组件:基于深度与密度引导的Dropout策略,用于抑制近相机区域的过拟合;距离感知保真增强模块,通过针对性监督提升远场区域的重建质量。同时引入一个新的评估指标来量化高斯分布的学习稳定性。 Result: 在多个数据集上的实验表明,该方法在稀疏视角条件下显著提升了视觉质量和模型鲁棒性,且所提稳定性指标为分析3DGS的训练过程提供了新视角。 Conclusion: D²GS有效缓解了稀疏视角下的过拟合与欠拟合问题,增强了3DGS的稳定性和重建质量,为实际应用中视角受限的场景提供了更可靠的解决方案。 Abstract: Recent advances in 3D Gaussian Splatting (3DGS) enable real-time, high-fidelity novel view synthesis (NVS) with explicit 3D representations. However, performance degradation and instability remain significant under sparse-view conditions. In this work, we identify two key failure modes under sparse-view conditions: overfitting in regions with excessive Gaussian density near the camera, and underfitting in distant areas with insufficient Gaussian coverage. To address these challenges, we propose a unified framework D$^2$GS, comprising two key components: a Depth-and-Density Guided Dropout strategy that suppresses overfitting by adaptively masking redundant Gaussians based on density and depth, and a Distance-Aware Fidelity Enhancement module that improves reconstruction quality in under-fitted far-field areas through targeted supervision. Moreover, we introduce a new evaluation metric to quantify the stability of learned Gaussian distributions, providing insights into the robustness of the sparse-view 3DGS. Extensive experiments on multiple datasets demonstrate that our method significantly improves both visual quality and robustness under sparse view conditions. The project page can be found at: https://insta360-research-team.github.io/DDGS-website/.

[199] MATRIX: Multimodal Agent Tuning for Robust Tool-Use Reasoning

Tajamul Ashraf,Umair Nawaz,Abdelrahman M. Shaker,Rao Anwer,Philip Torr,Fahad Shahbaz Khan,Salman Khan

Main category: cs.CV

TL;DR: 提出一种视觉中心的代理微调框架,通过自动合成多模态轨迹和生成逐步偏好对,提升视觉语言模型在复杂推理和工具使用中的性能。

Details Motivation: 现有视觉语言模型在作为控制器进行复杂推理和决策时,受限于高质量多模态轨迹数据的稀缺和人工标注成本高。 Method: 构建大规模多模态任务数据集M-TRACE,并在此基础上开发MATRIX Agent;进一步引入自动生成的11K偏好对Pref-X,通过逐步偏好学习实现精细对齐。 Result: 在Agent-X、GTA和GAIA三个基准上,MATRIX均优于开源和闭源的视觉语言模型,展现出可扩展且高效的多模态工具使用能力。 Conclusion: 该框架有效提升了视觉语言模型在工具使用中的推理能力和鲁棒性,为自动化多模态代理训练提供了可行路径。 Abstract: Vision language models (VLMs) are increasingly deployed as controllers with access to external tools for complex reasoning and decision-making, yet their effectiveness remains limited by the scarcity of high-quality multimodal trajectories and the cost of manual annotation. We address this challenge with a vision-centric agent tuning framework that automatically synthesizes multimodal trajectories, generates step-wise preference pairs, and trains a VLM controller for robust tool-use reasoning. Our pipeline first constructs M-TRACE, a large-scale dataset of 28.5K multimodal tasks with 177K verified trajectories, enabling imitation-based trajectory tuning. Building on this, we develop MATRIX Agent, a controller finetuned on M-TRACE for step-wise tool reasoning. To achieve finer alignment, we further introduce Pref-X, a set of 11K automatically generated preference pairs, and optimize MATRIX on it via step-wise preference learning. Across three benchmarks, Agent-X, GTA, and GAIA, MATRIX consistently surpasses both open- and closed-source VLMs, demonstrating scalable and effective multimodal tool use. Our data and code is avaliable at https://github.com/mbzuai-oryx/MATRIX.

[200] ReSplat: Learning Recurrent Gaussian Splats

Haofei Xu,Daniel Barath,Andreas Geiger,Marc Pollefeys

Main category: cs.CV

TL;DR: 提出ReSplat,一种前馈循环高斯点阵模型,通过渲染误差反馈迭代优化3D高斯分布,无需显式计算梯度,在减少高斯数量和提升渲染速度的同时实现最先进的性能。

Details Motivation: 现有前馈高斯点阵模型因依赖单次前向传播而在推理时性能受限,难以适应新数据分布,需更高效的迭代优化机制。 Method: 设计一个循环网络利用渲染误差作为反馈信号来迭代更新3D高斯参数,并引入在16倍下采样空间运行的紧凑重建模型以减少初始高斯数量和计算开销。 Result: 在不同输入视图数、分辨率和数据集上实验表明,该方法显著减少高斯数量、提高渲染速度,并达到最先进的渲染性能。 Conclusion: ReSplat通过循环反馈机制有效提升了高斯点阵模型的表达能力和泛化性,兼顾效率与精度,适用于稀疏输入场景。 Abstract: While feed-forward Gaussian splatting models provide computational efficiency and effectively handle sparse input settings, their performance is fundamentally limited by the reliance on a single forward pass during inference. We propose ReSplat, a feed-forward recurrent Gaussian splatting model that iteratively refines 3D Gaussians without explicitly computing gradients. Our key insight is that the Gaussian splatting rendering error serves as a rich feedback signal, guiding the recurrent network to learn effective Gaussian updates. This feedback signal naturally adapts to unseen data distributions at test time, enabling robust generalization. To initialize the recurrent process, we introduce a compact reconstruction model that operates in a $16 \times$ subsampled space, producing $16 \times$ fewer Gaussians than previous per-pixel Gaussian models. This substantially reduces computational overhead and allows for efficient Gaussian updates. Extensive experiments across varying of input views (2, 8, 16), resolutions ($256 \times 256$ to $540 \times 960$), and datasets (DL3DV and RealEstate10K) demonstrate that our method achieves state-of-the-art performance while significantly reducing the number of Gaussians and improving the rendering speed. Our project page is at https://haofeixu.github.io/resplat/.