Table of Contents
cs.CL [Back]
[1] Inconsistent Affective Reaction: Sentiment of Perception and Opinion in Urban Environments
Jingfei Huang,Han Tu
Main category: cs.CL
TL;DR: 本研究提出新方法识别和解释城市环境中感知与意见之间的情感不一致,利用街景图像和社交媒体文本分析北京二环路2016年与2022年的情感反应变化,发现感知情感更趋均衡积极,而意见情感波动更剧烈,且两者存在显著差异,该结果对城市更新策略具有指导意义。
Details
Motivation: 现有城市情感分析多关注单一维度(如文本或图像),难以捕捉人类对城市环境的综合感受;社交媒体兴起带来了感知(perception)与意见(opinion)之间情感反应的差异,亟需跨模态、多维度的方法来揭示这种不一致性及其驱动因素。 Method: 构建包含14万张街景图像和98万条社交媒体文本的数据集,结合目标检测与自然语言处理技术建立情感反应指数,并通过回归分析、图像分割和词频分析,基于土地利用分布对北京二环路的情感趋势进行分类、分析与可视化。 Result: 感知情感趋势显示正向情感分布更均匀,而意见情感变化更为极端;情感错配图揭示了感知与意见之间的显著差异,且情感变化与建筑密度、行人活动等因素密切相关;疫情前后的情感不一致图揭示了外部事件对城市情感反应的影响。 Conclusion: 人类对城市环境的感知与意见存在系统性情感差异,单纯依赖某一种数据源可能误导城市规划决策;提出的多模态情感分析框架有助于识别情感不一致,为城市更新和环境管理提供了新的评估工具与干预方向。 Abstract: The ascension of social media platforms has transformed our understanding of urban environments, giving rise to nuanced variations in sentiment reaction embedded within human perception and opinion, and challenging existing multidimensional sentiment analysis approaches in urban studies. This study presents novel methodologies for identifying and elucidating sentiment inconsistency, constructing a dataset encompassing 140,750 Baidu and Tencent Street view images to measure perceptions, and 984,024 Weibo social media text posts to measure opinions. A reaction index is developed, integrating object detection and natural language processing techniques to classify sentiment in Beijing Second Ring for 2016 and 2022. Classified sentiment reaction is analysed and visualized using regression analysis, image segmentation, and word frequency based on land-use distribution to discern underlying factors. The perception affective reaction trend map reveals a shift toward more evenly distributed positive sentiment, while the opinion affective reaction trend map shows more extreme changes. Our mismatch map indicates significant disparities between the sentiments of human perception and opinion of urban areas over the years. Changes in sentiment reactions have significant relationships with elements such as dense buildings and pedestrian presence. Our inconsistent maps present perception and opinion sentiments before and after the pandemic and offer potential explanations and directions for environmental management, in formulating strategies for urban renewal.[2] Haystack Engineering: Context Engineering for Heterogeneous and Agentic Long-Context Evaluation
Mufei Li,Dongqi Fu,Limei Wang,Si Zhang,Hanqing Zeng,Kaan Sancak,Ruizhong Qiu,Haoyu Wang,Xiaoxin He,Xavier Bresson,Yinglong Xia,Chonglin Sun,Pan Li
Main category: cs.CL
TL;DR: HaystackCraft 是一个基于英文维基百科超链接网络的新颖长上下文测试基准,用于评估大语言模型在噪声上下文和代理工作流中的鲁棒性。
Details
Motivation: 现有‘针在 haystack 中’(NIAH)基准测试多使用合成数据,忽略了真实场景中因检索偏差和代理流程导致的噪声上下文问题,因此需要更贴近现实的测试方法。 Method: 提出 HaystackCraft 基准,利用维基百科的超链接网络生成多跳问题,并模拟多种检索策略(稀疏、密集、混合、基于图)及动态代理操作(如查询优化、反思推理、早停决策),以构建具有真实噪声的长上下文测试环境。 Result: 实验表明:更强的密集检索器可能引入更具挑战性的干扰项,而基于图的重排序能提升检索效果并减轻有害干扰;在动态代理测试中,即使是 Gemini 2.5 Pro 和 GPT-5 等先进模型也会因自生成干扰或无法适时停止而出现级联失败。 Conclusion: HaystackCraft 揭示了当前长上下文模型在面对真实噪声和代理式推理时的局限性,强调了 haystack 工程的重要性,并为未来研究提供了一个有价值的测试平台。 Abstract: Modern long-context large language models (LLMs) perform well on synthetic "needle-in-a-haystack" (NIAH) benchmarks, but such tests overlook how noisy contexts arise from biased retrieval and agentic workflows. We argue that haystack engineering is necessary to construct noisy long contexts that faithfully capture key real-world factors -- distraction from heterogeneous biased retrievers and cascading errors in agentic workflows -- to test models' long-context robustness. We instantiate it through HaystackCraft, a new NIAH benchmark built on the full English Wikipedia hyperlink network with multi-hop questions. HaystackCraft evaluates how heterogeneous retrieval strategies (e.g., sparse, dense, hybrid, and graph-based) affect distractor composition, haystack ordering, and downstream LLM performance. HaystackCraft further extends NIAH to dynamic, LLM-dependent settings that simulate agentic operations, where models refine queries, reflect on their past reasonings, and decide when to stop. Experiments with 15 long-context models show that (1) while stronger dense retrievers can introduce more challenging distractors, graph-based reranking simultaneously improves retrieval effectiveness and mitigates more harmful distractors; (2) in agentic tests, even advanced models like Gemini 2.5 Pro and GPT-5 suffer cascading failures from self-generated distractors or struggle to perform early stops. These results highlight persistent challenges in agentic long-context reasoning and establish HaystackCraft as a valuable testbed for future progress.[3] Lemma Dilemma: On Lemma Generation Without Domain- or Language-Specific Training Data
Olia Toporkov,Alan Akbik,Rodrigo Agerri
Main category: cs.CL
TL;DR: 本文研究了大语言模型(LLM)在上下文词形还原任务中的表现,发现在12种不同形态复杂度的语言中,无需微调、仅通过少量示例的上下文生成即可达到最先进水平。
Details
Motivation: 探究大语言模型在缺乏监督训练数据的目标领域或语言中进行上下文词形还原的有效性。 Method: 比较了基于编码器的监督方法(跨域微调)、跨语言方法与大语言模型直接上下文生成词干的效果。 Result: 实验表明,在多数语言中,无需微调的大语言模型通过上下文学习能达到最优性能,尤其在缺乏标注数据的情况下优于传统方法。 Conclusion: 当前的大语言模型在上下文词形还原任务中表现优异,能够在无监督微调的情况下实现最先进的结果,减少了对标注数据的依赖。 Abstract: Lemmatization is the task of transforming all words in a given text to their dictionary forms. While large language models (LLMs) have demonstrated their ability to achieve competitive results across a wide range of NLP tasks, there is no prior evidence of how effective they are in the contextual lemmatization task. In this paper, we empirically investigate the capacity of the latest generation of LLMs to perform in-context lemmatization, comparing it to the traditional fully supervised approach. In particular, we consider the setting in which supervised training data is not available for a target domain or language, comparing (i) encoder-only supervised approaches, fine-tuned out-of-domain, and (ii) cross-lingual methods, against direct in-context lemma generation with LLMs. Our experimental investigation across 12 languages of different morphological complexity finds that, while encoders remain competitive in out-of-domain settings when fine-tuned on gold data, current LLMs reach state-of-the-art results for most languages by directly generating lemmas in-context without prior fine-tuning, provided just with a few examples. Data and code available upon publication: https://github.com/oltoporkov/lemma-dilemma[4] LASER: An LLM-based ASR Scoring and Evaluation Rubric
Amruta Parulekar,Preethi Jyothi
Main category: cs.CL
TL;DR: 提出了一种基于大语言模型的ASR评估方法LASER,利用上下文学习能力减少对不影响语义的语言细节的过度惩罚,在印度语言中表现出与人工标注高度相关的结果。
Details
Motivation: 传统ASR评估指标如WER会不公平地惩罚不影响语义的形态和句法差异,需要一种更语义感知的评估方式。 Method: 设计了一个基于LLM的评分标准LASER,使用包含详细示例的提示让LLM进行上下文学习;同时对Llama 3等较小LLM在词对样本上进行微调以预测惩罚程度。 Result: Gemini 2.5 Pro在印地语LASER评分中与人工标注的相关性达到94%;提示中的印地语示例能有效迁移至其他印度语言如马拉地语、卡纳达语和马拉雅拉姆语;微调后的Llama 3在预测处罚类型上准确率达89%。 Conclusion: LASER通过利用LLM的语义理解能力,提供了一种更公平、可迁移的ASR评估方案,显著优于传统基于编辑距离的指标。 Abstract: Standard ASR evaluation metrics like Word Error Rate (WER) tend to unfairly penalize morphological and syntactic nuances that do not significantly alter sentence semantics. We introduce an LLM-based scoring rubric LASER that leverages state-of-the-art LLMs' in-context learning abilities to learn from prompts with detailed examples. Hindi LASER scores using Gemini 2.5 Pro achieved a very high correlation score of 94% with human annotations. Hindi examples in the prompt were also effective in analyzing errors in other Indian languages such as Marathi, Kannada and Malayalam. We also demonstrate how a smaller LLM like Llama 3 can be finetuned on word-pair examples derived from reference and ASR predictions to predict what kind of penalty should be applied with close to 89% accuracy.[5] Meaningful Pose-Based Sign Language Evaluation
Zifan Jiang,Colin Leong,Amit Moryossef,Anne Göhring,Annette Rios,Oliver Cory,Maksym Ivashechkin,Neha Tarigopula,Biao Zhang,Rico Sennrich,Sarah Ebling
Main category: cs.CL
TL;DR: 本文研究了基于人体骨骼姿态的自然手语表达评估方法,比较了关键点距离、嵌入和回译等多种评估指标,并通过自动元评估和人类相关性研究验证其有效性。
Details
Motivation: 为了更准确地评估手语翻译或生成系统的效果,需要一种能够反映手语语言特性的评估方法。现有方法在语义一致性和动作流畅性方面存在不足。 Method: 提出并比较了三类评估指标:基于关键点距离的指标、基于嵌入空间相似性的指标以及基于回译的指标;通过签名级检索的自动元评估和跨多种手语的文本到姿态翻译的人类相关性研究进行验证。 Result: 不同指标在不同场景下各有优劣,没有单一最优指标;回传译方法在语义一致性上表现较好,而关键点距离方法在动作准确性上更优;开源了姿态评估工具包。 Conclusion: 综合使用多种评估指标能更全面地评价手语生成系统;开源工具包为未来研究提供了可复现且实用的评估基础。 Abstract: We present a comprehensive study on meaningfully evaluating sign language utterances in the form of human skeletal poses. The study covers keypoint distance-based, embedding-based, and back-translation-based metrics. We show tradeoffs between different metrics in different scenarios through automatic meta-evaluation of sign-level retrieval and a human correlation study of text-to-pose translation across different sign languages. Our findings and the open-source pose-evaluation toolkit provide a practical and reproducible way of developing and evaluating sign language translation or generation systems.[6] Populism Meets AI: Advancing Populism Research with LLMs
Eduardo Ryô Tamaki,Yujin J. Jung,Julia Chatterley,Grant Mitchell,Semir Dzebo,Cristóbal Sandoval,Levente Littvay,Kirk A. Hawkins
Main category: cs.CL
TL;DR: 提出一种基于思维链提示的领域特定方法,利用大语言模型复制专家编码员对民粹主义话语的评分,结果表明该方法在跨语境分类准确性上与人类专家相当。
Details
Motivation: 传统文本分析方法在测量民粹主义意识形态内容方面存在成本高、耗时长、难以扩展的问题,需要更高效、可扩展的自动化方法。 Method: 采用基于评分标准和锚点引导的思维链(CoT)提示策略,模仿人类编码员培训过程,使用全球民粹主义数据库(GPD)中的标注数据来指导大语言模型推理,并测试多个闭源和开源模型对GPD评分的复现能力。 Result: 该方法使大语言模型在民粹主义话语分类任务中达到与专家人工编码相当的准确率,表现出对民粹主义复杂、情境敏感特征的良好把握能力。 Conclusion: 领域特定的思维链提示策略能有效提升大语言模型在政治文本分析中的表现,为民粹主义及其他复杂意识形态的自动化测量提供了可行且可扩展的新路径。 Abstract: Measuring the ideational content of populism remains a challenge. Traditional strategies based on textual analysis have been critical for building the field's foundations and providing a valid, objective indicator of populist framing. Yet these approaches are costly, time consuming, and difficult to scale across languages, contexts, and large corpora. Here we present the results from a rubric and anchor guided chain of thought (CoT) prompting approach that mirrors human coder training. By leveraging the Global Populism Database (GPD), a comprehensive dataset of global leaders' speeches annotated for degrees of populism, we replicate the process used to train human coders by prompting the LLM with an adapted version of the same documentation to guide the model's reasoning. We then test multiple proprietary and open weight models by replicating scores in the GPD. Our findings reveal that this domain specific prompting strategy enables the LLM to achieve classification accuracy on par with expert human coders, demonstrating its ability to navigate the nuanced, context sensitive aspects of populism.[7] MAPRO: Recasting Multi-Agent Prompt Optimization as Maximum a Posteriori Inference
Zheyuan Zhang,Lin Ge,Hongjiang Li,Weicheng Zhu,Chuxu Zhang,Yanfang Ye
Main category: cs.CL
TL;DR: 本文提出了多智能体提示优化框架MAPRO,通过将多智能体系统提示优化建模为最大后验推断问题,并采用语言引导的max-product置信传播算法求解,显著提升了多智能体系统的性能。
Details
Motivation: 设计高效的多智能体系统因提示敏感性和累积不稳定性而困难,现有自动化提示设计方法在多智能体场景下仍不足,缺乏系统性优化方法。 Method: 提出四阶段的MAPRO框架,将多智能体提示优化视为最大后验(MAP)推断问题,使用语言引导的max-product置信传播算法,并引入拓扑感知的精细化机制,结合执行反馈与下游责任分配来迭代更新智能体提示。 Result: 在多种任务的基准测试中,MAPRO实现了最先进的性能,持续优于人工设计基线和最新的自动化方法。 Conclusion: MAPRO为多智能体提示优化提供了有效且原则性的解决方案,其基于MAP的建模方式也为构建更可靠、可解释的多智能体系统提供了通用指导。 Abstract: Large language models (LLMs) have demonstrated remarkable capabilities across diverse tasks, and LLM-based agents further extend these abilities to various practical workflows. While recent progress shows that multi-agent systems (MAS) can outperform single agents by coordinating specialized roles, designing effective MAS remains difficult due to prompt sensitivity and the compounded instability MAS creates. To cope with the challenge, recent efforts in automated prompt design have reduced manual effort. However, multi-agent prompt optimization remains largely unexplored. Challenges like exponentially expanding search space and ambiguous credit assignment together make systematic design intractable without principled methods. Therefore, we introduce M}ulti-Agent PRompt Optimization (MAPRO), a four-stage framework that first formulates MAS prompt optimization as a Maximum a Posteriori (MAP) inference problem and solves it using a language-guided variant of max-product belief propagation algorithm. To address credit assignment and updates the system iteratively, MAPRO employs a topology-aware refinement mechanism that integrates execution feedback and downstream blames to selectively update agent prompts. Through this process, MAPRO progressively converges to a coordinated set of agent-specific prompt policies. Across benchmarks in various tasks, MAPRO achieves state-of-the-art performance, consistently surpassing manually engineered baselines and recent automated alternatives. Beyond performance, our MAP-based formulation also delivers general guidelines for building more reliable and principled multi-agent systems in the future[8] AsyncSpade: Efficient Test-Time Scaling with Asynchronous Sparse Decoding
Shuqing Luo,Yilin Guan,Pingzhi Li,Hanrui Wang,Tianlong Chen
Main category: cs.CL
TL;DR: 本文提出了AsyncSpade,一种异步框架,用于高效实现大语言模型的测试时扩展(TTS),通过解耦KV缓存过滤与自回归解码过程,在不牺牲性能的前提下显著降低推理延迟。
Details
Motivation: 现有的查询感知稀疏解码方法受限于页面级过滤的顺序依赖性和粗粒度的token选择,导致在高并发和长思维链场景下效率低下,甚至运行时间超过前向推理本身。 Method: 提出AsyncSpade,包含两个核心组件:(1) 轻量级时序回归模块,预测下一token的查询状态;(2) 异步解耦架构,将KV缓存过滤从解码循环中分离,实现KV选择与前向计算的重叠。 Result: 在A100节点上验证,AsyncSpade实现了理论最优的每输出token时间(TPOT),相比当前最优方法Quest减少20%以上,相比全注意力减少至少50%,并在Qwen3-8B和Qwen3-32B模型上保持或提升了AIME、GPQA、MATH等基准的准确性。 Conclusion: AsyncSpade是首个在不牺牲模型性能的情况下消除顺序依赖的TTS加速框架,显著提升了高并发和长CoT场景下的服务效率。 Abstract: Test-time scaling (TTS) boosts LLM reasoning via long chain-of-thought (CoT), but the linear KV-cache growth amplifies the memory-bound bottleneck of LLM decoding. Query-aware page-level sparse decoding can achieve state-of-the-art performance under constrained FLOPs budgets, but is limited by both sequential-dependent page filtering and coarse-grained token selection, hampering serving efficiency and model performance on TTS tasks under high concurrency and long CoT scenarios (consuming even higher runtime than the forward pipeline itself). In this paper, we first find that the current-step query state can be accurately approximated in a unified manner from a short window of recent queries, enabling training-free query-aware sparsity without waiting in the decoding loop. We propose AsyncSpade, an asynchronous framework for efficient TTS built on two core components: (1) a novel light-weight temporal-regressive module that predicts the next-token query state; (2) an asynchronous and disaggregated framework that decouples the KV cache filtering from the auto-regressive decoding loop, overlapping the token-level KV selection with the forward inference computation through asynchronism. To our knowledge, AsyncSpade is the first to eliminate the sequential dependence without sacrificing model performance. We validate the effectiveness of AsyncSpade on common LLM serving setups with an A100 node, where AsyncSpade fully overlaps KV-cache operations with the inference pipeline, achieving theoretical optimal time-per-output-token (TPOT). Specifically, AsyncSpade delivers over 20% reduction on TPOT compared to SoTA baseline (i.e. Quest) and at least 50% TPOT reduction compared to full attention on Qwen3-8B and Qwen3-32B models, while matching or surpassing their accuracy on various TTS benchmarks (AIME-24/25, GPQA-Diamond, MATH-500).[9] Can Lessons From Human Teams Be Applied to Multi-Agent Systems? The Role of Structure, Diversity, and Interaction Dynamics
Rasika Muralidharan,Jaewoon Kwak,Jisun An
Main category: cs.CL
TL;DR: 提出了一种基于大语言模型的多智能体框架,用于研究团队科学中的结构、多样性和互动动态,发现扁平化团队表现优于层级化团队,多样性影响复杂,且智能体对团队表现存在过度自信。
Details
Motivation: 受人类团队科学启发,探索多智能体系统中的团队动态,填补LLM驱动智能体在团队协作方面研究的空白。 Method: 构建多智能体框架,评估其在CommonsenseQA、StrategyQA、Social IQa和Latent Implicit Hate四个任务上的团队表现,分析不同团队结构和多样性的影响。 Result: 扁平团队表现优于层级团队;多样性影响较为复杂;智能体在事后反思中表现出对协作的认可,但也暴露出协调不足等问题。 Conclusion: 团队结构显著影响多智能体系统性能,扁平化结构更优,未来需改进智能体间的协调与整合机制。 Abstract: Multi-Agent Systems (MAS) with Large Language Model (LLM)-powered agents are gaining attention, yet fewer studies explore their team dynamics. Inspired by human team science, we propose a multi-agent framework to examine core aspects of team science: structure, diversity, and interaction dynamics. We evaluate team performance across four tasks: CommonsenseQA, StrategyQA, Social IQa, and Latent Implicit Hate, spanning commonsense and social reasoning. Our results show that flat teams tend to perform better than hierarchical ones, while diversity has a nuanced impact. Interviews suggest agents are overconfident about their team performance, yet post-task reflections reveal both appreciation for collaboration and challenges in integration, including limited conversational coordination.[10] Can Speech LLMs Think while Listening?
Yi-Jen Shih,Desh Raj,Chunyang Wu,Wei Zhou,SK Bong,Yashesh Gaur,Jay Mahadeokar,Ozlem Kalinli,Mike Seltzer
Main category: cs.CL
TL;DR: 本文研究了链式思维(CoT)微调对多流语音大语言模型推理能力的影响,提出通过在用户提问结束前开始推理以减少响应延迟,并引入基于熵的“问题完整性”指标来指导推理时机,结合直接偏好优化进一步提升准确率-延迟权衡。
Details
Motivation: 语音大语言模型在复杂推理任务上表现不佳,且推理过程会增加响应延迟,影响用户体验。受人类‘边听边思考’行为启发,需探索如何在保证准确率的同时降低延迟。 Method: 采用链式思维(CoT)微调提升语音LLM的推理能力;提出基于熵的‘问题完整性’指标,判断何时启动推理以平衡准确率与延迟;使用拒绝采样构建偏好数据,并应用直接偏好优化(DPO)进一步优化模型。 Result: CoT微调使语音LLM在口语推理任务上的准确率平均提升2.4倍;所提方法在等效延迟下使ARC-Easy准确率提升4%;结合DPO实现70%的延迟降低而无准确率损失。 Conclusion: 在文本空间中进行推理可显著提升语音LLM的性能;通过动态控制推理启动时机和DPO优化,能有效改善准确率与延迟之间的权衡,推动语音交互系统更接近实时人类对话体验。 Abstract: Recent advances in speech large language models (speech LLMs) have enabled seamless spoken interactions, but these systems still struggle with complex reasoning tasks. Previously, chain-of-thought (CoT) prompting or fine-tuning has been to shown to significantly improve the reasoning abilities of text-based LLMs. In this work, we investigate the effect of CoT fine-tuning for multi-stream speech LLMs, demonstrating that reasoning in text space improves the accuracy of speech LLMs by 2.4x, on average, over a suite of spoken reasoning tasks. Beyond accuracy, the latency of the spoken response is a crucial factor for interacting with voice-based agents. Inspired by the human behavior of "thinking while listening," we propose methods to reduce the additional latency from reasoning by allowing the model to start reasoning before the user query has ended. To achieve this, we introduce an entropy-based metric, "question completeness," which acts as an indicator to guide the model on the optimal time to start reasoning. This method provides greater control over the accuracy-latency trade-off compared with heuristic-based approaches and, under equivalent latency conditions, yields a 4% accuracy gain on ARC-Easy. Finally, we use Direct Preference Optimization (DPO) on preference data created using rejection sampling to push the accuracy-latency pareto frontier further, resulting in a 70% reduction in latency without loss in accuracy.[11] When Thoughts Meet Facts: Reusable Reasoning for Long-Context LMs
Soyeong Jeong,Taehee Jung,Sung Ju Hwang,Joo-Kyung Kim,Dongyeop Kang
Main category: cs.CL
TL;DR: 提出Thought Template Augmented LCLMs(ToTAL)框架,利用可复用的思维模板和自然语言反馈优化多跳推理,提升长上下文语言模型在知识密集型任务中的表现。
Details
Motivation: 现有长上下文语言模型虽能处理大量文本,但缺乏有效整合和连接证据的能力,导致多跳推理效果不佳。 Method: 引入思维模板将推理过程结构化,基于先前问题求解轨迹构建思维缓存,并通过自然语言反馈迭代更新模板,以指导证据组合和多步推理。 Result: 在多种基准和LCLM模型上,ToTAL在检索与非检索场景中均显著优于强基线,并可将优化后的模板蒸馏至小型开源模型。 Conclusion: ToTAL通过结构化思维模板和持续优化策略,实现了高效、透明且可复用的多跳推理,具有广泛适用性。 Abstract: Recent Long-Context Language Models (LCLMs) can process hundreds of thousands of tokens in a single prompt, enabling new opportunities for knowledge-intensive multi-hop reasoning by integrating large sets of retrieved documents or, in some cases, directly all necessary information. However, simply feeding more documents into the context window fails to capture how evidence should be connected. We address this gap with thought templates, which recast reasoning as reusable thought caches, derived from prior problem solving traces, structuring how evidence is combined and guiding multi-hop inference with factual documents. To keep these templates effective, we propose an update strategy that iteratively refines templates derived from training data through natural-language feedback. Across diverse benchmarks and LCLM families, our approach delivers consistent gains over strong baselines in both retrieval-based and retrieval-free settings. Furthermore, we show that optimized templates can be distilled into smaller open-source models, demonstrating its broad applicability and transparent reasoning reuse. We refer to our framework as Thought Template Augmented LCLMs (ToTAL).[12] ParsTranslit: Truly Versatile Tajik-Farsi Transliteration
Rayyan Merchant,Kevin Tang
Main category: cs.CL
TL;DR: 本文提出了一种新的序列到序列模型,用于波斯语(Farsi)与塔吉克语(Tajik)之间的文字转写,该模型在所有可用数据集上进行训练,并发布了两个新数据集,实现了当前最优的性能。
Details
Motivation: 由于波斯语在不同地区使用不同的书写系统(波斯-阿拉伯文和西里尔文),导致书面交流困难,现有转写模型受限于特定领域数据,缺乏跨领域的通用性。 Method: 采用序列到序列(sequence-to-sequence)架构,在所有可用的转写数据集上训练模型,并构建了两个新的多领域数据集以提升模型泛化能力。 Result: 模型在Farsi到Tajik方向取得87.91的chrF++和0.05的归一化CER,在Tajik到Farsi方向达到92.28的chrF++和0.04的归一化CER,显著优于先前方法。 Conclusion: 所提出的模型具备良好的跨领域适应能力,为波斯语与塔吉克语间的转写任务建立了全面且可比较的新基准。 Abstract: As a digraphic language, the Persian language utilizes two written standards: Perso-Arabic in Afghanistan and Iran, and Tajik-Cyrillic in Tajikistan. Despite the significant similarity between the dialects of each country, script differences prevent simple one-to-one mapping, hindering written communication and interaction between Tajikistan and its Persian-speaking ``siblings''. To overcome this, previously-published efforts have investigated machine transliteration models to convert between the two scripts. Unfortunately, most efforts did not use datasets other than those they created, limiting these models to certain domains of text such as archaic poetry or word lists. A truly usable transliteration system must be capable of handling varied domains, meaning that suck models lack the versatility required for real-world usage. The contrast in domain between data also obscures the task's true difficulty. We present a new state-of-the-art sequence-to-sequence model for Tajik-Farsi transliteration trained across all available datasets, and present two datasets of our own. Our results across domains provide clearer understanding of the task, and set comprehensive comparable leading benchmarks. Overall, our model achieves chrF++ and Normalized CER scores of 87.91 and 0.05 from Farsi to Tajik and 92.28 and 0.04 from Tajik to Farsi. Our model, data, and code are available at https://anonymous.4open.science/r/ParsTranslit-FB30/.[13] OWL: Overcoming Window Length-Dependence in Speculative Decoding for Long-Context Inputs
Jaeseong Lee,seung-won hwang,Aurick Qiao,Gabriele Oliaro,Ye Wang,Samyam Rajbhandari
Main category: cs.CL
TL;DR: 本文提出了针对长上下文场景下大语言模型推理加速的新型基准LongSpecBench和模型OWL,通过三项创新实现了比EAGLE3高约5倍的接受长度。
Details
Motivation: 现有推测解码方法在短上下文基准上表现良好,但在实际长上下文场景中性能严重下降,甚至导致生成速度变慢,缺乏有效的长上下文评估基准和通用性模型。 Method: 提出OWL模型,包含三个创新:1)仅基于最后token状态的LSTM草稿网络;2)验证器中引入[SPEC]特殊token以增强表示;3)结合树与非树解码的混合算法;同时构建了长上下文基准LongSpecBench。 Result: OWL在长上下文输入上实现比EAGLE3高约5倍的接受长度,并显著提升生成速度,而EAGLE3在长上下文中仅达到0.81倍速度。 Conclusion: OWL通过结构与算法创新有效解决了长上下文下推测解码的性能退化问题,LongSpecBench为未来研究提供了公开基准,推动该领域发展。 Abstract: Speculative decoding promises faster inference for large language models (LLMs), yet existing methods fail to generalize to real-world settings. Benchmarks typically assume short contexts (e.g., 2K tokens), whereas practical workloads involve long contexts. We find current approaches degrade severely with long contexts; for instance, EAGLE3 even slows down the generation speed by 0.81x. We address these limitations by releasing a new long-context benchmark (LongSpecBench) and introducing a novel model (OWL). OWL achieves about 5x higher acceptance length than EAGLE3 on long-context inputs through three innovations: (1) an LSTM-based drafter conditioned only on the last-token state, making it generalize to various lengths, (2) a special token [SPEC] in the verifier that produces richer representation for drafter, and (3) a hybrid algorithm combining both tree and non-tree decoding methods. We release all code and datasets to advance future research.[14] Deploying Tiny LVLM Judges for Real-World Evaluation of Chart Models: Lessons Learned and Best Practices
Md Tahmid Rahman Laskar,Mohammed Saidul Islam,Ridwan Mahbub,Mizanur Rahman,Amran Bhuiyan,Israt Jahan,Mir Tafseer Nayeem,Shafiq Joty,Enamul Hoque,Jimmy Huang
Main category: cs.CL
TL;DR: 提出了多标准提示和领域自适应迁移学习方法,以提升2B参数量级的视觉语言模型在图表理解任务中的评估能力。
Details
Motivation: 小型模型(≤2B参数)在作为自动评判模型时表现不佳,限制了其在资源受限场景下的应用。 Method: 提出多标准提示,将多个评估标准合并到单个查询中;并通过在合成判断数据集上微调2B参数的LVLM,实现领域自适应迁移学习,构建ChartJudge模型。 Result: 实验表明,多标准提示暴露了7B模型的鲁棒性缺陷,而所提出的ChartJudge在跨数据集知识迁移方面表现良好,提升了小型模型的评估性能。 Conclusion: 通过改进提示设计和迁移学习,小型LVLM可在图表推理任务中实现高效、低成本的评估,具备良好的可扩展性和实用价值。 Abstract: Large Vision-Language Models (LVLMs) with only 7B parameters have shown promise as automated judges in chart comprehension tasks. However, tiny models (<=2B parameters) still perform poorly as judges, limiting their real-world use in resource-constrained settings. To address this, we propose two approaches to ensure cost-efficient evaluation: (i) multi-criteria prompting, which combines separate evaluation criteria into a single query, and (ii) domain-adaptive transfer learning, in which we fine-tune a 2B-parameter LVLM on synthetic judgments in a chart dataset to create the ChartJudge. Experiments show that multi-criteria prompting exposes robustness gaps, which led to a huge drop in performance for 7B models, including specialized LVLM judges like LLaVA-Critic. In addition, we find that our tiny LVLM (ChartJudge) can effectively transfer knowledge from one dataset to another to make it a more specialized model. Our fine-grained analysis across chart types and query complexities offers actionable insights into trade-offs between model size, prompt design, and transferability, enabling scalable, low-cost evaluation for chart reasoning tasks. Our code and the data will be made publicly available.[15] Multi-Task Pre-Finetuning of Lightweight Transformer Encoders for Text Classification and NER
Junyi Zhu,Savas Ozkan,Andrea Maracani,Sinan Mutlu,Cho Jung Min,Mete Ozay
Main category: cs.CL
TL;DR: 提出基于任务主LoRA模块的多任务预微调框架,提升轻量级BERT编码器在命名实体识别和文本分类中的适应性与效率。
Details
Motivation: 在移动平台上部署NLP模型需要兼顾跨应用适应性和计算效率,但传统的多任务预微调存在优化冲突问题。 Method: 采用任务主LoRA模块的多任务预微调框架,共享编码器主干并使用模块化适配器。 Result: 在21个下游任务上平均提升NER 0.8%和文本分类8.8%,性能接近单独预微调且满足部署约束。 Conclusion: 所提方法有效解决了多任务预微调中的优化冲突,适用于高效、通用的移动端NLP模型部署。 Abstract: Deploying natural language processing (NLP) models on mobile platforms requires models that can adapt across diverse applications while remaining efficient in memory and computation. We investigate pre-finetuning strategies to enhance the adaptability of lightweight BERT-like encoders for two fundamental NLP task families: named entity recognition (NER) and text classification. While pre-finetuning improves downstream performance for each task family individually, we find that na\"ive multi-task pre-finetuning introduces conflicting optimization signals that degrade overall performance. To address this, we propose a simple yet effective multi-task pre-finetuning framework based on task-primary LoRA modules, which enables a single shared encoder backbone with modular adapters. Our approach achieves performance comparable to individual pre-finetuning while meeting practical deployment constraint. Experiments on 21 downstream tasks show average improvements of +0.8% for NER and +8.8% for text classification, demonstrating the effectiveness of our method for versatile mobile NLP applications.[16] Linguistic Patterns in Pandemic-Related Content: A Comparative Analysis of COVID-19, Constraint, and Monkeypox Datasets
Mkululi Sikosana,Sean Maudsley-Barton,Oluwaseun Ajao
Main category: cs.CL
TL;DR: 该研究通过计算语言学方法分析了疫情相关的在线言论,比较了健康虚假信息与事实传播在可读性、修辞标记和说服性语言使用上的差异。
Details
Motivation: 识别健康虚假信息的语言特征,以帮助开发更有效的检测工具并优化公共卫生传播策略。 Method: 基于三个语料库(COVID-19虚假叙述、一般COVID-19内容和猴痘相关帖子)进行计算语言学分析,比较其语言特征。 Result: COVID-19虚假信息可读性更低,恐惧性和说服性词汇频率更高,感叹号使用较少;而猴痘内容情感表达更强。虚假信息常结合复杂修辞与情绪线索以增强可信度。 Conclusion: 语言特征可作为数字健康虚假信息的识别指标,有助于改进检测方法和公共健康危机沟通策略,但需进一步采用纵向和平台敏感性研究提升稳健性。 Abstract: This study conducts a computational linguistic analysis of pandemic-related online discourse to examine how language distinguishes health misinformation from factual communication. Drawing on three corpora: COVID-19 false narratives (n = 7588), general COVID-19 content (n = 10700), and Monkeypox-related posts (n = 5787), we identify significant differences in readability, rhetorical markers, and persuasive language use. COVID-19 misinformation exhibited markedly lower readability scores and contained over twice the frequency of fear-related or persuasive terms compared to the other datasets. It also showed minimal use of exclamation marks, contrasting with the more emotive style of Monkeypox content. These patterns suggest that misinformation employs a deliberately complex rhetorical style embedded with emotional cues, a combination that may enhance its perceived credibility. Our findings contribute to the growing body of work on digital health misinformation by highlighting linguistic indicators that may aid detection efforts. They also inform public health messaging strategies and theoretical models of crisis communication in networked media environments. At the same time, the study acknowledges limitations, including reliance on traditional readability indices, use of a deliberately narrow persuasive lexicon, and reliance on static aggregate analysis. Future research should therefore incorporate longitudinal designs, broader emotion lexicons, and platform-sensitive approaches to strengthen robustness.[17] IASC: Interactive Agentic System for ConLangs
Chihiro Taguchi,Richard Sproat
Main category: cs.CL
TL;DR: 本文提出一个基于大语言模型(LLM)的模块化系统,用于辅助人工构造语言(conlang)的开发,涵盖语音、形态句法、词汇、正字法及语法手册生成,并探索LLM对语言学概念的理解能力及其在高低资源语言翻译中的潜在应用。
Details
Motivation: 旨在利用LLM作为构建人工语言的工具,既提供趣味性创作支持,也探究LLM对语言结构和语言学概念的深层理解,特别是其在不同语言类型和稀有语言现象上的表现差异。 Method: 采用模块化代理方法:首先通过迭代反馈建立目标音系;然后将英语句子转换为反映目标语言词序和形态句法特征的标记形式;接着结合音系模型与提取的语素构建词库;再指定正字法(如拉丁或西里尔字母);最后生成简明语法手册并实现新句子的翻译。 Result: 系统能有效生成人工语言的基本组件,不同LLM在处理常见语言模式时表现较好,罕见结构则较弱;在从高资源向低资源语言迁移的应用尝试中目前效果有限,但显示出改进后可能带来实际收益的潜力。 Conclusion: 该系统不仅为构语言爱好者提供了实用工具,也为评估LLM的语言学知识和泛化能力提供了新途径,未来优化有望提升其在真实语言翻译任务中的应用价值。 Abstract: We present a system that uses LLMs as a tool in the development of Constructed Languages. The system is modular in that one first creates a target phonology for the language using an agentic approach that refines its output at each step with commentary feedback on its previous attempt. Next, a set of sentences is 'translated' from their English original into a morphosyntactic markup that reflects the word order and morphosyntactic feature specifications of the desired target language, with affixes represented as morphosyntactic feature bundles. From this translated corpus, a lexicon is constructed using the phonological model and the set of morphemes (stems and affixes) extracted from the 'translated' sentences. The system is then instructed to provide an orthography for the language, using an existing script such as Latin or Cyrillic. Finally, the system writes a brief grammatical handbook of the language. The system can also translate further sentences into the target language. Our goal is twofold. First, we hope that these tools will be fun to use for creating artificially constructed languages. Second, we are interested in exploring what LLMs 'know' about language-not what they know about any particular language or linguistic phenomenon, but how much they know about and understand language and linguistic concepts. As we shall see, there is a fairly wide gulf in capabilities both among different LLMs and among different linguistic specifications, with it being notably easier for systems to deal with more common patterns than rarer ones. An additional avenue that we explore is the application of our approach to translating from high-resource into low-resource languages. While the results so far are mostly negative, we provide some evidence that an improved version of the present system could afford some real gains in such tasks. https://github.com/SakanaAI/IASC[18] Vocabulary embeddings organize linguistic structure early in language model training
Isabel Papadimitriou,Jacob Prince
Main category: cs.CL
TL;DR: 研究了大语言模型在训练过程中输入词汇表征的几何结构如何随时间演化,发现语义和句法特征的相关性迅速建立,高频词和功能词比低频词更快收敛。
Details
Motivation: 探究大语言模型在训练过程中输入词汇表征的结构及其演化过程,以理解语言模型如何逐步形成对语言结构的组织。 Method: 使用表征相似性分析,对Pythia 12B和OLMo 7B两个开源模型的输入和输出嵌入在训练过程中的语义、句法和频率相关指标进行关联实验。 Result: 1) 训练过程中,词汇嵌入几何结构迅速与语义和句法特征高度相关;2) 高频词和功能词的嵌入更快收敛到最终向量,而低频词和实词仍保留部分初始随机偏置的影响。 Conclusion: 输入嵌入在训练早期快速形成语言结构组织,词频和词性在收敛速度和路径中起关键作用,提示词汇几何演化可能驱动模型能力提升。 Abstract: Large language models (LLMs) work by manipulating the geometry of input embedding vectors over multiple layers. Here, we ask: how are the input vocabulary representations of language models structured, and how and when does this structure evolve over training? To answer this question, we use representational similarity analysis, running a suite of experiments that correlate the geometric structure of the input embeddings and output embeddings of two open-source models (Pythia 12B and OLMo 7B) with semantic, syntactic, and frequency-based metrics over the course of training. Our key findings are as follows: 1) During training, the vocabulary embedding geometry quickly converges to high correlations with a suite of semantic and syntactic features; 2) Embeddings of high-frequency and function words (e.g., "the," "of") converge to their final vectors faster than lexical and low-frequency words, which retain some alignment with the bias in their random initializations. These findings help map the dynamic trajectory by which input embeddings organize around linguistic structure, revealing distinct roles for word frequency and function. Our findings motivate a deeper study of how the evolution of vocabulary geometry may facilitate specific capability gains during model training.[19] Toward Reliable Clinical Coding with Language Models: Verification and Lightweight Adaptation
Zhangdie Yuan,Han-Chin Shing,Mitch Strong,Chaitanya Shivade
Main category: cs.CL
TL;DR: 本文提出临床代码验证作为改进LLM在医疗编码中表现的新方法,通过轻量级干预减少层次性误判错误,并发布了一个专家双标注的门诊病例基准数据集。
Details
Motivation: 现有研究显示大语言模型在临床编码任务中表现不佳,且常用评估指标忽略层次相近的错误预测;同时现有数据集存在证据不全和住院患者偏差等问题。 Method: 采用提示工程和小规模微调等轻量级干预方法,并引入临床代码验证机制作为独立任务和流程组件,同时构建并发布了专家双标注的门诊ICD-10编码数据集。 Result: 实验表明,所提出的验证机制能有效减少层次性近似错误,提升模型准确性,且无需依赖计算开销大的搜索方法。新发布的数据集缓解了现有数据的偏差问题。 Conclusion: 临床代码验证是提升LLM在医疗编码中准确性和可靠性的重要步骤,轻量级干预结合高质量数据可显著改善模型表现。 Abstract: Accurate clinical coding is essential for healthcare documentation, billing, and decision-making. While prior work shows that off-the-shelf LLMs struggle with this task, evaluations based on exact match metrics often overlook errors where predicted codes are hierarchically close but incorrect. Our analysis reveals that such hierarchical misalignments account for a substantial portion of LLM failures. We show that lightweight interventions, including prompt engineering and small-scale fine-tuning, can improve accuracy without the computational overhead of search-based methods. To address hierarchically near-miss errors, we introduce clinical code verification as both a standalone task and a pipeline component. To mitigate the limitations in existing datasets, such as incomplete evidence and inpatient bias in MIMIC, we release an expert double-annotated benchmark of outpatient clinical notes with ICD-10 codes. Our results highlight verification as an effective and reliable step toward improving LLM-based medical coding.[20] Role-Conditioned Refusals: Evaluating Access Control Reasoning in Large Language Models
Đorđe Klisura,Joseph Khoury,Ashish Kundu,Ram Krishnan,Anthony Rios
Main category: cs.CL
TL;DR: 研究了大语言模型在角色条件下的拒绝行为,提出并比较了三种设计来增强模型对访问控制策略的遵循能力。
Details
Motivation: 由于大语言模型常模糊角色边界,产生无限制响应,因此需要研究如何使模型根据授权情况决定是否回答问题。 Method: 构建了一个扩展自Spider和BIRD数据集的新数据集,并引入了基于PostgreSQL的角色策略;比较了零样本/少样本提示、两步生成-验证流程以及LoRA微调模型三种方法。 Result: 显式验证(两步框架)提高了拒绝精度并减少了错误授权,而微调在安全性和实用性之间取得了更好的平衡;更长更复杂的策略会降低所有系统的可靠性。 Conclusion: 两步验证方法提升了安全性,微调则在保持实用性的同时增强了权限意识,但复杂策略仍对系统构成挑战。 Abstract: Access control is a cornerstone of secure computing, yet large language models often blur role boundaries by producing unrestricted responses. We study role-conditioned refusals, focusing on the LLM's ability to adhere to access control policies by answering when authorized and refusing when not. To evaluate this behavior, we created a novel dataset that extends the Spider and BIRD text-to-SQL datasets, both of which have been modified with realistic PostgreSQL role-based policies at the table and column levels. We compare three designs: (i) zero or few-shot prompting, (ii) a two-step generator-verifier pipeline that checks SQL against policy, and (iii) LoRA fine-tuned models that learn permission awareness directly. Across multiple model families, explicit verification (the two-step framework) improves refusal precision and lowers false permits. At the same time, fine-tuning achieves a stronger balance between safety and utility (i.e., when considering execution accuracy). Longer and more complex policies consistently reduce the reliability of all systems. We release RBAC-augmented datasets and code.[21] Banking Done Right: Redefining Retail Banking with Language-Centric AI
Xin Jie Chua,Jeraelyn Ming Li Tan,Jia Xuan Tan,Soon Chang Poh,Yi Xian Goh,Debbie Hui Tian Choong,Chee Mun Foong,Sze Jue Yang,Chee Seng Chan
Main category: cs.CL
TL;DR: Ryt AI 是一个基于大型语言模型(LLM)的代理框架,使客户能够通过自然语言对话执行核心金融交易,成为全球首个获得监管批准的以对话式AI为主要银行界面的系统。
Details
Motivation: 传统银行助手多限于咨询或支持角色,缺乏对核心交易操作的支持,且难以满足严格的安全与合规要求。Ryt AI 旨在通过原生 LLM 代理架构实现真正可信赖、合规的自然语言银行服务。 Method: 构建完全自主开发的闭源大模型 ILMU,并在其基础上集成四个任务特定的 LoRA 适配器代理(Guardrails、Intent、Payment 和 FAQ),通过单一对话取代传统的多步骤界面。系统部署在银行内部基础设施中,结合确定性防护机制、人工确认环节和无状态审计架构以确保安全性与合规性。 Result: 成功实现了全球首个获得监管批准的对话式AI银行接口,支持核心金融交易,具备高安全性、可审计性和低运行开销,用户可通过自然语言完成银行业务操作。 Conclusion: Ryt AI 展示了在严格治理下,基于LLM的自然语言界面可以可靠地支持核心金融操作,标志着向‘正确 banking’迈进的重要一步。 Abstract: This paper presents Ryt AI, an LLM-native agentic framework that powers Ryt Bank to enable customers to execute core financial transactions through natural language conversation. This represents the first global regulator-approved deployment worldwide where conversational AI functions as the primary banking interface, in contrast to prior assistants that have been limited to advisory or support roles. Built entirely in-house, Ryt AI is powered by ILMU, a closed-source LLM developed internally, and replaces rigid multi-screen workflows with a single dialogue orchestrated by four LLM-powered agents (Guardrails, Intent, Payment, and FAQ). Each agent attaches a task-specific LoRA adapter to ILMU, which is hosted within the bank's infrastructure to ensure consistent behavior with minimal overhead. Deterministic guardrails, human-in-the-loop confirmation, and a stateless audit architecture provide defense-in-depth for security and compliance. The result is Banking Done Right: demonstrating that regulator-approved natural-language interfaces can reliably support core financial operations under strict governance.[22] OBCache: Optimal Brain KV Cache Pruning for Efficient Long-Context LLM Inference
Yuzhe Gu,Xiyu Liang,Jiaojiao Zhao,Enmao Diao
Main category: cs.CL
TL;DR: 提出了OBCache,一种基于最优脑损伤理论的缓存淘汰框架,通过量化删除token对注意力输出的影响来优化大语言模型中的KV缓存管理。
Details
Motivation: 现有缓存淘汰方法仅基于启发式注意力权重排序token,未考虑其对注意力输出的真实影响,导致效率不足。 Method: 将缓存淘汰建模为逐层结构化剪枝问题,利用OBD理论推导出闭式解,评估孤立键、孤立值及键值对的敏感性,并引入输出感知信号提升淘汰策略。 Result: 在LLaMA和Qwen模型上的实验表明,使用OBCache的输出感知分数替代原有启发式分数,能持续提升长上下文任务的准确性。 Conclusion: OBCache通过结合注意力权重、值状态和输出信息,提供了更优的token淘汰机制,有效改善了长序列推理中的KV缓存效率与模型性能。 Abstract: Large language models (LLMs) with extended context windows enable powerful downstream applications but impose significant memory overhead, as caching all key-value (KV) states scales linearly with sequence length and batch size. Existing cache eviction methods address this by exploiting attention sparsity, yet they typically rank tokens heuristically using accumulated attention weights without considering their true impact on attention outputs. We propose Optimal Brain Cache (OBCache), a principled framework that formulates cache eviction as a layer-wise structured pruning problem. Building upon the Optimal Brain Damage (OBD) theory, OBCache quantifies token saliency by measuring the perturbation in attention outputs induced by pruning tokens, with closed-form scores derived for isolated keys, isolated values, and joint key-value pairs. Our scores account not only for attention weights but also for information from value states and attention outputs, thereby enhancing existing eviction strategies with output-aware signals. Experiments on LLaMA and Qwen models demonstrate that replacing the heuristic scores in existing works, which estimate token saliency across different query positions, with OBCache's output-aware scores consistently improves long-context accuracy.[23] Textual Entailment and Token Probability as Bias Evaluation Metrics
Virginia K. Felkner,Allison Lim,Jonathan May
Main category: cs.CL
TL;DR: 本文探讨了使用自然语言推断(NLI)作为语言模型社会偏见测量的更现实替代方法,并与传统的基于词元概率(TP)的方法进行了比较。研究发现,NLI和TP在偏见评估中的表现差异显著,相关性较低,且NLI更容易检测到“去偏不足”的情况,但对反刻板印象句子的措辞更为敏感和脆弱。最终结论是,在所有情况下,TP或NLI都不是绝对更好的偏见度量方法,建议结合TP、NLI以及下游任务的偏见评估以实现对语言模型的全面评估。
Details
Motivation: 传统的基于词元概率(TP)的偏见度量方法虽然广泛应用,但因其与真实世界语言模型使用场景和潜在危害的距离较远而受到批评。因此,需要探索更贴近实际应用的偏见评估方法,例如自然语言推断(NLI),以提高评估的有效性和现实意义。 Method: 通过对比自然语言推断(NLI)和词元概率(TP)两种偏见评估方法的表现,分析它们在不同数据集上的相关性、检测能力及对反刻板印象句子措辞变化的敏感性。实验包括多个NLI指标与TP指标之间的比较,以及对“去偏不足”案例的识别能力测试。 Result: 研究发现NLI和TP偏见评估方法之间存在显著差异,相关性很低;NLI更可能检测到“去偏不足”的情况,但其结果更脆弱,对反刻板印象句子的措辞变化更敏感;而TP方法相对稳定但可能忽略某些偏见问题。 Conclusion: 无论是词元概率(TP)还是自然语言推断(NLI),在所有情况下都不是绝对优越的偏见度量方法。为了全面评估语言模型中的社会偏见,推荐结合使用TP、NLI以及下游任务中的偏见评估方法。 Abstract: Measurement of social bias in language models is typically by token probability (TP) metrics, which are broadly applicable but have been criticized for their distance from real-world langugage model use cases and harms. In this work, we test natural language inference (NLI) as a more realistic alternative bias metric. We show that, curiously, NLI and TP bias evaluation behave substantially differently, with very low correlation among different NLI metrics and between NLI and TP metrics. We find that NLI metrics are more likely to detect "underdebiased" cases. However, NLI metrics seem to be more brittle and sensitive to wording of counterstereotypical sentences than TP approaches. We conclude that neither token probability nor natural language inference is a "better" bias metric in all cases, and we recommend a combination of TP, NLI, and downstream bias evaluations to ensure comprehensive evaluation of language models. Content Warning: This paper contains examples of anti-LGBTQ+ stereotypes.[24] Stress-Testing Model Specs Reveals Character Differences among Language Models
Jifan Zhang,Henry Sleight,Andi Peng,John Schulman,Esin Durmus
Main category: cs.CL
TL;DR: 提出了一种系统性方法来压力测试大语言模型的行为规范,发现现有模型规范中存在大量原则矛盾和解释模糊问题,并通过实验验证了不同模型在价值权衡场景中的行为分歧。
Details
Motivation: 现有的AI行为规范存在原则冲突和覆盖不足的问题,需要系统性方法来识别和改进这些问题。 Method: 构建了一个全面的价值观分类体系,生成迫使模型在相互竞争的价值原则之间做出权衡的场景,并对12个前沿大语言模型进行评估,使用价值分类分数衡量行为差异。 Result: 发现了超过7万种显著行为分歧案例,实证表明高分歧行为能有效预测模型规范中的问题,并揭示了当前规范中的直接矛盾、解释模糊、错位对齐和误拒现象,同时总结了不同模型的价值优先级模式差异。 Conclusion: 当前大语言模型的规范存在严重缺陷,需更精细的设计与验证机制,所提出的方法可有效暴露规范漏洞,为改进AI对齐提供依据。 Abstract: Large language models (LLMs) are increasingly trained from AI constitutions and model specifications that establish behavioral guidelines and ethical principles. However, these specifications face critical challenges, including internal conflicts between principles and insufficient coverage of nuanced scenarios. We present a systematic methodology for stress-testing model character specifications, automatically identifying numerous cases of principle contradictions and interpretive ambiguities in current model specs. We stress test current model specs by generating scenarios that force explicit tradeoffs between competing value-based principles. Using a comprehensive taxonomy we generate diverse value tradeoff scenarios where models must choose between pairs of legitimate principles that cannot be simultaneously satisfied. We evaluate responses from twelve frontier LLMs across major providers (Anthropic, OpenAI, Google, xAI) and measure behavioral disagreement through value classification scores. Among these scenarios, we identify over 70,000 cases exhibiting significant behavioral divergence. Empirically, we show this high divergence in model behavior strongly predicts underlying problems in model specifications. Through qualitative analysis, we provide numerous example issues in current model specs such as direct contradiction and interpretive ambiguities of several principles. Additionally, our generated dataset also reveals both clear misalignment cases and false-positive refusals across all of the frontier models we study. Lastly, we also provide value prioritization patterns and differences of these models.[25] Large Language Models Meet Virtual Cell: A Survey
Krinos Li,Xianglu Xiao,Shenglong Deng,Lucas He,Zijun Zhong,Yuanjie Zou,Zhonghao Zhan,Zheng Hui,Weiye Bao,Guang Yang
Main category: cs.CL
TL;DR: 本文综述了大语言模型(LLM)在虚拟细胞建模中的应用,提出了将LLM分为“作为预言机”和“作为代理”两种范式的统一分类法,并探讨了细胞表征、扰动预测和基因调控推断三大核心任务及相关挑战。
Details
Motivation: 大语言模型正在改变细胞生物学,通过构建能够表示、预测和推理细胞状态与行为的计算系统(即“虚拟细胞”),推动生命科学研究的自动化和智能化。 Method: 提出一个统一的分类体系,将现有方法划分为LLM作为预言机(直接建模)和LLM作为代理(协调复杂任务)两类,并系统回顾相关模型、数据集、评估基准。 Result: 明确了虚拟细胞建模中的三大核心任务:细胞表征、扰动预测和基因调控推断,总结了当前主流模型和数据资源,并识别出可扩展性、泛化性和可解释性等关键挑战。 Conclusion: LLM在虚拟细胞研究中具有巨大潜力,未来需在模型可解释性、跨组织泛化能力和实际实验验证方面进一步突破。 Abstract: Large language models (LLMs) are transforming cellular biology by enabling the development of "virtual cells"--computational systems that represent, predict, and reason about cellular states and behaviors. This work provides a comprehensive review of LLMs for virtual cell modeling. We propose a unified taxonomy that organizes existing methods into two paradigms: LLMs as Oracles, for direct cellular modeling, and LLMs as Agents, for orchestrating complex scientific tasks. We identify three core tasks--cellular representation, perturbation prediction, and gene regulation inference--and review their associated models, datasets, evaluation benchmarks, as well as the critical challenges in scalability, generalizability, and interpretability.[26] Causality Guided Representation Learning for Cross-Style Hate Speech Detection
Chengshuai Zhao,Shu Wan,Paras Sheth,Karan Patwa,K. Selçuk Candan,Huan Liu
Main category: cs.CL
TL;DR: 提出一种基于因果表示学习的框架CADET,用于检测隐式仇恨言论,通过解耦上下文、动机、目标和风格等潜在因素,提升模型在不同平台和风格下的泛化能力。
Details
Motivation: 现有仇恨言论检测模型依赖表面语言特征,难以应对不同风格和平台上的隐式仇恨言论,且易因虚假相关性导致性能下降。 Method: 构建一个因果图模型,将仇恨言论生成过程分解为上下文环境、创作者动机、目标和风格等关键因素;提出CADET框架,通过因果表示学习解耦这些因素,并利用反事实推理干预风格变量以增强鲁棒性。 Result: 实验表明CADET在多个数据集上优于现有方法,展现出更强的泛化能力和对隐式仇恨言论的检测效果。 Conclusion: 引入因果先验和反事实推理有助于分离真实仇恨意图与表面语言线索,为可泛化的仇恨言论检测提供了新方向。 Abstract: The proliferation of online hate speech poses a significant threat to the harmony of the web. While explicit hate is easily recognized through overt slurs, implicit hate speech is often conveyed through sarcasm, irony, stereotypes, or coded language -- making it harder to detect. Existing hate speech detection models, which predominantly rely on surface-level linguistic cues, fail to generalize effectively across diverse stylistic variations. Moreover, hate speech spread on different platforms often targets distinct groups and adopts unique styles, potentially inducing spurious correlations between them and labels, further challenging current detection approaches. Motivated by these observations, we hypothesize that the generation of hate speech can be modeled as a causal graph involving key factors: contextual environment, creator motivation, target, and style. Guided by this graph, we propose CADET, a causal representation learning framework that disentangles hate speech into interpretable latent factors and then controls confounders, thereby isolating genuine hate intent from superficial linguistic cues. Furthermore, CADET allows counterfactual reasoning by intervening on style within the latent space, naturally guiding the model to robustly identify hate speech in varying forms. CADET demonstrates superior performance in comprehensive experiments, highlighting the potential of causal priors in advancing generalizable hate speech detection.[27] MemWeaver: A Hierarchical Memory from Textual Interactive Behaviors for Personalized Generation
Shuo Yu,Mingyue Cheng,Daoyu Wang,Qi Liu,Zirui Liu,Ze Guo,Xiaoyu Tao
Main category: cs.CL
TL;DR: 本文提出了MemWeaver框架,通过构建包含行为记忆和认知记忆的分层记忆结构,对用户的文本历史进行建模,以实现深度个性化生成。
Details
Motivation: 现有方法仅将用户历史视为扁平的文本列表,未能捕捉用户兴趣的时间演化和语义关系,导致个性化程度较浅。 Method: MemWeaver构建了两个融合时间与语义信息的互补记忆组件:行为记忆(捕捉具体用户行为)和认知记忆(表征长期偏好),形成统一的用户表示,供大语言模型推理使用。 Result: 在LaMP基准上的实验验证了MemWeaver的有效性,显著优于现有方法。 Conclusion: MemWeaver通过分层记忆结构有效建模用户文本历史的时序与语义特征,提升了个性化生成的效果。 Abstract: The primary form of user-internet engagement is shifting from leveraging implicit feedback signals, such as browsing and clicks, to harnessing the rich explicit feedback provided by textual interactive behaviors. This shift unlocks a rich source of user textual history, presenting a profound opportunity for a deeper form of personalization. However, prevailing approaches offer only a shallow form of personalization, as they treat user history as a flat list of texts for retrieval and fail to model the rich temporal and semantic structures reflecting dynamic nature of user interests. In this work, we propose \textbf{MemWeaver}, a framework that weaves the user's entire textual history into a hierarchical memory to power deeply personalized generation. The core innovation of our memory lies in its ability to capture both the temporal evolution of interests and the semantic relationships between different activities. To achieve this, MemWeaver builds two complementary memory components that both integrate temporal and semantic information, but at different levels of abstraction: behavioral memory, which captures specific user actions, and cognitive memory, which represents long-term preferences. This dual-component memory serves as a unified representation of the user, allowing large language models (LLMs) to reason over both concrete behaviors and abstracted traits. Experiments on the Language Model Personalization (LaMP) benchmark validate the efficacy of MemWeaver. Our code is available\footnote{https://github.com/fishsure/MemWeaver}.[28] SUBQRAG: sub-question driven dynamic graph rag
Jiaoyang Li,Junhao Ruan,Shengwei Tang,Saihan Chen,Kaiyan Chang,Yuan Ge,Tong Xiao,Jingbo Zhu
Main category: cs.CL
TL;DR: 提出SubQRAG,一种基于子问题驱动的图检索增强生成框架,通过分解复杂问题、动态扩展知识图并构建图记忆来提升多跳问答的推理深度和准确性。
Details
Motivation: 现有图RAG方法在处理复杂多跳问答时缺乏深层结构化推理,导致证据不全和错误累积。 Method: 将复杂问题分解为有序的可验证子问题,针对每个子问题从知识图中检索三元组;当图信息不足时,实时从原始文档中抽取新三元组以动态扩展知识图;所有用于推理的三元组被聚合为“图记忆”,形成结构化、可追溯的证据路径。 Result: 在三个多跳问答基准上的实验表明,SubQRAG在Exact Match等指标上实现了持续且显著的提升。 Conclusion: SubQRAG通过子问题驱动的动态图扩展和图记忆机制,有效增强了复杂问答中的结构化推理能力,提升了答案的准确性和可解释性。 Abstract: Graph Retrieval-Augmented Generation (Graph RAG) effectively builds a knowledge graph (KG) to connect disparate facts across a large document corpus. However, this broad-view approach often lacks the deep structured reasoning needed for complex multi-hop question answering (QA), leading to incomplete evidence and error accumulation. To address these limitations, we propose SubQRAG, a sub-question-driven framework that enhances reasoning depth. SubQRAG decomposes a complex question into an ordered chain of verifiable sub-questions. For each sub-question, it retrieves relevant triples from the graph. When the existing graph is insufficient, the system dynamically expands it by extracting new triples from source documents in real time. All triples used in the reasoning process are aggregated into a "graph memory," forming a structured and traceable evidence path for final answer generation. Experiments on three multi-hop QA benchmarks demonstrate that SubQRAG achieves consistent and significant improvements, especially in Exact Match scores.[29] Multilingual Knowledge Graph Completion via Efficient Multilingual Knowledge Sharing
Cunli Mao,Xiaofei Gao,Ran Song,Shizhu He,Shengxiang Gao,Kang Liu,Zhengtao Yu
Main category: cs.CL
TL;DR: 本文提出了一种新的多语言知识图谱补全(MKGC)框架,通过知识级分组专家混合(KL-GMoE)和迭代实体重排序(IER)来利用多语言共享知识,显著提升了性能。实验结果表明,该方法在Hits@1、Hits@3和Hits@10指标上均优于现有最先进方法,并发布了包含5种语言的新mKG数据集及代码。
Details
Motivation: 现有MKGC研究未能充分利用大语言模型的多语言能力,且忽视了跨语言知识的可共享性。 Method: 提出KL-GMoE以高效建模多语言共享知识,结合IER机制增强知识利用;构建包含5种语言的mKG数据集进行评估。 Result: 相比现有SOTA方法,Hits@1提升5.47%,Hits@3提升3.27%,Hits@10提升1.01%;并通过实验分析揭示了未见语言和不平衡语言设置下的知识共享特性。 Conclusion: 所提出的框架有效利用多语言共享知识,显著提升MKGC性能,具备良好的扩展性和实用性,推动了多语言知识图谱补全的发展。 Abstract: Large language models (LLMs) based Multilingual Knowledge Graph Completion (MKGC) aim to predict missing facts by leveraging LLMs' multilingual understanding capabilities, improving the completeness of multilingual knowledge graphs (KGs). However, existing MKGC research underutilizes the multilingual capabilities of LLMs and ignores the shareability of cross-lingual knowledge. In this paper, we propose a novel MKGC framework that leverages multilingual shared knowledge to significantly enhance performance through two components: Knowledge-level Grouped Mixture of Experts (KL-GMoE) and Iterative Entity Reranking (IER). KL-GMoE efficiently models shared knowledge, while IER significantly enhances its utilization. To evaluate our framework, we constructed a mKG dataset containing 5 languages and conducted comprehensive comparative experiments with existing state-of-the-art (SOTA) MKGC method. The experimental results demonstrate that our framework achieves improvements of 5.47%, 3.27%, and 1.01% in the Hits@1, Hits@3, and Hits@10 metrics, respectively, compared with SOTA MKGC method. Further experimental analysis revealed the properties of knowledge sharing in settings of unseen and unbalanced languages. We have released the dataset and code for our work on https://github.com/gaoxiaofei07/KL-GMoE.[30] ToolExpander: Extending the Frontiers of Tool-Using Reinforcement Learning to Weak LLMs
Fu Chen,Peng Wang,Xiyin Li,Wen Li,Shichi Lei,Dongdong Xiang
Main category: cs.CL
TL;DR: 提出ToolExpander框架,通过动态多轮难样本采样和自示范思维机制,提升小规模大模型在GRPO训练中的稳定性与工具使用能力。
Details
Motivation: 解决GRPO训练中模型难以生成准确响应、尤其在小规模架构下易出现训练崩溃的问题。 Method: 1) 动态多轮难样本采样:用高质量少样本示例替换困难样本,并结合指数学习率衰减;2) 自示范思维:去除KL散度,引入调整后的裁剪系数,鼓励模型自主生成并分析少样本示例。 Result: ToolExpander显著提升了小规模LLM的工具使用能力,增强了训练稳定性和整体性能。 Conclusion: ToolExpander有效缓解了GRPO在小模型上的局限性,为资源受限场景下的强化学习训练提供了可行方案。 Abstract: Training Large Language Models (LLMs) with Group Relative Policy Optimization (GRPO) encounters a significant challenge: models often fail to produce accurate responses, particularly in small-scale architectures. This limitation not only diminishes performance improvements and undermines the potential of GRPO but also frequently leads to mid-training collapse, adversely affecting stability and final efficacy. To address these issues, we propose ToolExpander, a novel framework that advances tool-oriented reinforcement learning for resource-constrained LLMs through two key innovations:(1) Dynamic Multi-Round Hard Sampling, which dynamically substitutes challenging samples(those without correct outputs over 10 rollouts) with high-quality few-shot demonstrations during training, coupled with an exponential learning rate decay strategy to mitigate oscillations;(2) Self-Exemplifying Thinking, an enhanced GRPO framework that eliminates KL divergence and incorporates adjusted clipping coefficients, encouraging models to autonomously generate and analyze few-shot examples via a minimal additional reward (0.01).Experimental results demonstrate that ToolExpander significantly enhances tool-using capabilities in LLMs, especially in weaker small-scale models, improving both training stability and overall performance.[31] OpenRubrics: Towards Scalable Synthetic Rubric Generation for Reward Modeling and LLM Alignment
Tianci Liu,Ran Xu,Tony Yu,Ilgee Hong,Carl Yang,Tuo Zhao,Haoyu Wang
Main category: cs.CL
TL;DR: 本文提出了OpenRubrics和Contrastive Rubric Generation(CRG)方法,通过结构化自然语言标准提升奖励建模的可扩展性和准确性,显著优于现有方法。
Details
Motivation: 现有的奖励模型依赖标量或成对判断,难以捕捉人类偏好的多维性,且现有rubrics方法在可靠性和可扩展性上存在挑战。 Method: 提出Contrastive Rubric Generation(CRG),通过对比优选与被拒响应提取硬规则和隐含原则,并利用拒绝采样确保偏好标签一致性以去除噪声rubrics,构建大规模(prompt, rubric)数据集OpenRubrics。 Result: 在多个奖励建模基准上,Rubric-RM超过同规模基线6.8%,并在指令遵循和生物医学任务中将收益传递至策略模型。 Conclusion: rubrics能提供可扩展的对齐信号,缩小人工评估与自动化奖励建模之间的差距,推动基于原则的大模型对齐新范式。 Abstract: Reward modeling lies at the core of reinforcement learning from human feedback (RLHF), yet most existing reward models rely on scalar or pairwise judgments that fail to capture the multifaceted nature of human preferences. Recent studies have explored rubrics-as-rewards (RaR) that uses structured natural language criteria that capture multiple dimensions of response quality. However, producing rubrics that are both reliable and scalable remains a key challenge. In this work, we introduce OpenRubrics, a diverse, large-scale collection of (prompt, rubric) pairs for training rubric-generation and rubric-based reward models. To elicit discriminative and comprehensive evaluation signals, we introduce Contrastive Rubric Generation (CRG), which derives both hard rules (explicit constraints) and principles (implicit qualities) by contrasting preferred and rejected responses. We further improve reliability by enforcing preference-label consistency via rejection sampling to remove noisy rubrics. Across multiple reward-modeling benchmarks, our rubric-based reward model, Rubric-RM, surpasses strong size-matched baselines by 6.8%. These gains transfer to policy models on instruction-following and biomedical benchmarks. Our results show that rubrics provide scalable alignment signals that narrow the gap between costly human evaluation and automated reward modeling, enabling a new principle-driven paradigm for LLM alignment.[32] Parallel Test-Time Scaling for Latent Reasoning Models
Runyang You,Yongqi Li,Meng Liu,Wenjie Wang,Liqiang Nie,Wenjie Li
Main category: cs.CL
TL;DR: 本论文提出了一种用于潜在推理模型的并行测试时扩展(Parallel TTS)方法,通过引入基于不确定性的随机采样策略和训练潜在奖励模型(LatentRM)实现有效的轨迹选择与聚合,为连续空间中的可扩展推理开辟了新方向。
Details
Motivation: 尽管并行TTS在显式思维链中已被证明有效,但潜在推理模型由于缺乏连续空间中的采样机制和概率信号,难以应用并行TTS。本文旨在解决这一问题,探索如何在连续向量空间中实现高效的并行推理扩展。 Method: 提出了两种基于不确定性的随机采样策略:蒙特卡洛Dropout和加性高斯噪声,以在连续空间中生成多样化的潜在推理路径;同时设计了一个基于步级对比学习目标训练的潜在奖励模型(LatentRM),用于评分和引导潜在推理路径的选择与聚合。 Result: 实验和可视化分析表明,两种采样策略能随计算资源增加而有效扩展,并表现出不同的探索动态;LatentRM显著提升了推理路径的选择效果,实现了优于单一路径推理的性能。 Conclusion: 该工作成功将并行TTS应用于潜在推理模型,解决了连续空间中采样与聚合的关键挑战,验证了在连续空间中进行可扩展推理的可行性,为高效大模型推理提供了新思路。 Abstract: Parallel test-time scaling (TTS) is a pivotal approach for enhancing large language models (LLMs), typically by sampling multiple token-based chains-of-thought in parallel and aggregating outcomes through voting or search. Recent advances in latent reasoning, where intermediate reasoning unfolds in continuous vector spaces, offer a more efficient alternative to explicit Chain-of-Thought, yet whether such latent models can similarly benefit from parallel TTS remains open, mainly due to the absence of sampling mechanisms in continuous space, and the lack of probabilistic signals for advanced trajectory aggregation. \ This work enables parallel TTS for latent reasoning models by addressing the above issues. For sampling, we introduce two uncertainty-inspired stochastic strategies: Monte Carlo Dropout and Additive Gaussian Noise. For aggregation, we design a Latent Reward Model (LatentRM) trained with step-wise contrastive objective to score and guide latent reasoning. Extensive experiments and visualization analyses show that both sampling strategies scale effectively with compute and exhibit distinct exploration dynamics, while LatentRM enables effective trajectory selection. Together, our explorations open a new direction for scalable inference in continuous spaces. Code released at https://github.com/YRYangang/LatentTTS.[33] Test-Time Reasoners Are Strategic Multiple-Choice Test-Takers
Nishant Balepur,Atrey Desai,Rachel Rudinger
Main category: cs.CL
TL;DR: 该论文研究了大语言模型(LLM)在仅凭选项(choices-only)输入下回答多项选择题(MCQA)的表现,发现即使没有问题文本,推理过程仍能提升准确率,且推理轨迹通过可信度测试,表明其使用了如推断缺失问题等非浅层策略。因此,作者认为部分输入成功并不总是缺陷,推理轨迹有助于区分有问题的数据和合理的推理。
Details
Motivation: 担忧大语言模型在多项选择题中不按预期方式解题,可能依赖浅层捷径而非真正理解问题,尤其是在无需问题文本仅凭选项即可作答的情况下。 Method: 通过比较大语言模型在完整输入和仅选项输入下的推理表现,分析其推理轨迹,并进行可信度测试,以判断其策略是否浅薄。 Result: 发现一半情况下,仅选项输入中测试时的推理仍能提高准确率;推理轨迹长度对结果影响小,且通过了可信度测试,显示模型会推断缺失的问题内容。 Conclusion: 部分输入下的成功并不必然代表模型存在缺陷,推理轨迹可用于识别真正的数据问题与可接受的推理行为,应重新评估对此类现象的负面看法。 Abstract: Large language models (LLMs) now give reasoning before answering, excelling in tasks like multiple-choice question answering (MCQA). Yet, a concern is that LLMs do not solve MCQs as intended, as work finds LLMs sans reasoning succeed in MCQA without using the question, i.e., choices-only. Such partial-input success is often deemed problematic, but reasoning traces could reveal if these strategies are truly shallow in choices-only settings. To study these strategies, reasoning LLMs solve MCQs in full and choices-only inputs; test-time reasoning often boosts accuracy on full and in choices-only half the time. While possibly due to shallow shortcuts, choices-only success is barely affected by the length of reasoning traces, and after finding traces pass faithfulness tests, we show they use less problematic strategies like inferring missing questions. In all, we challenge claims that partial-input success is always a flaw, so we discuss how reasoning traces could separate problematic data from less problematic reasoning.[34] ToolLibGen: Scalable Automatic Tool Creation and Aggregation for LLM Reasoning
Murong Yue,Zhiwei Liu,Liangwei Yang,Jianguo Zhang,Zuxin Liu,Haolin Chen,Ziyu Yao,Silvio Savarese,Caiming Xiong,Shelby Heinecke,Huan Wang
Main category: cs.CL
TL;DR: 提出一种系统性方法,将无结构的工具集合自动重构为有组织的工具库,通过聚类和多智能体框架整合功能,提升工具检索准确性和推理性能。
Details
Motivation: 现有工具增强型大模型在领域特定工具稀缺时表现受限,且自动生成的工具数量增加导致检索困难和功能模糊,缺乏可扩展性。 Method: 首先生成任务特定工具并按语义聚类,然后在每个簇内使用代码智能体和评审智能体组成的多智能体框架,提取共享逻辑并创建聚合工具,确保功能完整性。 Result: 实验表明该方法显著提高了工具检索准确率和推理性能,并在问题特定工具数量增加时展现出优于基线方法的可扩展性。 Conclusion: 该方法能有效将大量问题特定工具转化为少量功能强大的聚合工具,在不损失功能的前提下提升工具管理和使用效率。 Abstract: Large Language Models (LLMs) equipped with external tools have demonstrated enhanced performance on complex reasoning tasks. The widespread adoption of this tool-augmented reasoning is hindered by the scarcity of domain-specific tools. For instance, in domains such as physics question answering, suitable and specialized tools are often missing. Recent work has explored automating tool creation by extracting reusable functions from Chain-of-Thought (CoT) reasoning traces; however, these approaches face a critical scalability bottleneck. As the number of generated tools grows, storing them in an unstructured collection leads to significant retrieval challenges, including an expanding search space and ambiguity between function-related tools. To address this, we propose a systematic approach to automatically refactor an unstructured collection of tools into a structured tool library. Our system first generates discrete, task-specific tools and clusters them into semantically coherent topics. Within each cluster, we introduce a multi-agent framework to consolidate scattered functionalities: a code agent refactors code to extract shared logic and creates versatile, aggregated tools, while a reviewing agent ensures that these aggregated tools maintain the complete functional capabilities of the original set. This process transforms numerous question-specific tools into a smaller set of powerful, aggregated tools without loss of functionality. Experimental results demonstrate that our approach significantly improves tool retrieval accuracy and overall reasoning performance across multiple reasoning tasks. Furthermore, our method shows enhanced scalability compared with baselines as the number of question-specific increases.[35] Curing Miracle Steps in LLM Mathematical Reasoning with Rubric Rewards
Youliang Yuan,Qiuyang Mang,Jingbang Chen,Hong Wan,Xiaoyuan Liu,Junjielong Xu,Jen-tse Huang,Wenxuan Wang,Wenxiang Jiao,Pinjia He
Main category: cs.CL
TL;DR: 本文提出了一种面向推理过程的奖励模型RRM,用于解决大语言模型在数学推理中因仅依赖最终答案奖励而导致的奖励黑客和错误推理问题。
Details
Motivation: 传统的基于结果的奖励机制容易导致模型通过不合理的推理路径得到正确答案(即奖励欺骗),从而高估其真实推理能力。作者希望构建更可靠、可信赖的数学推理模型。 Method: 提出Rubric Reward Model (RRM),一种面向过程的细粒度奖励函数,依据问题特定的评分标准对整个推理轨迹进行评估,并在强化学习训练中提供0到1之间的校准奖励,显式惩罚逻辑错误。 Result: RRM显著提升了模型在四个数学基准上的表现,AIME2024的Verified Pass@1024从26.7%提升至62.6%,奇迹步骤(Miracle Steps)的发生率降低了71%。 Conclusion: 奖励推理过程而非仅仅最终答案,对于构建高准确性与高可靠性数学推理模型至关重要。 Abstract: Large language models for mathematical reasoning are typically trained with outcome-based rewards, which credit only the final answer. In our experiments, we observe that this paradigm is highly susceptible to reward hacking, leading to a substantial overestimation of a model's reasoning ability. This is evidenced by a high incidence of false positives - solutions that reach the correct final answer through an unsound reasoning process. Through a systematic analysis with human verification, we establish a taxonomy of these failure modes, identifying patterns like Miracle Steps - abrupt jumps to a correct output without a valid preceding derivation. Probing experiments suggest a strong association between these Miracle Steps and memorization, where the model appears to recall the answer directly rather than deriving it. To mitigate this systemic issue, we introduce the Rubric Reward Model (RRM), a process-oriented reward function that evaluates the entire reasoning trajectory against problem-specific rubrics. The generative RRM provides fine-grained, calibrated rewards (0-1) that explicitly penalize logical flaws and encourage rigorous deduction. When integrated into a reinforcement learning pipeline, RRM-based training consistently outperforms outcome-only supervision across four math benchmarks. Notably, it boosts Verified Pass@1024 on AIME2024 from 26.7% to 62.6% and reduces the incidence of Miracle Steps by 71%. Our work demonstrates that rewarding the solution process is crucial for building models that are not only more accurate but also more reliable.[36] The Unintended Trade-off of AI Alignment:Balancing Hallucination Mitigation and Safety in LLMs
Omar Mahmoud,Ali Khalil,Buddhika Laknath Semage,Thommen George Karimpanal,Santu Rana
Main category: cs.CL
TL;DR: 本文研究了大语言模型中提高真实性可能削弱安全对齐的问题,提出通过稀疏自编码器分离拒绝行为特征与幻觉特征,并利用子空间正交化在微调中保持拒绝行为,从而缓解真实性和安全性之间的权衡。
Details
Motivation: 提高大语言模型的真实性常导致安全对齐性能下降,因为现有对齐方法可能无意中抑制了事实知识,本文旨在探究并解决这一副作用。 Method: 使用稀疏自编码器识别并分离与拒绝行为和幻觉相关的特征,通过子空间正交化在微调过程中保护拒绝相关特征,避免对齐能力退化。 Result: 在常识推理任务和有害请求基准(AdvBench、StrongReject)上的实验表明,该方法有效保持了拒绝行为和任务性能,同时防止幻觉增加。 Conclusion: 通过解耦幻觉与拒绝特征,可以在提升或维持真实性的同时保留安全对齐,为构建既真实又安全的LLMs提供了可行路径。 Abstract: Hallucination in large language models (LLMs) has been widely studied in recent years, with progress in both detection and mitigation aimed at improving truthfulness. Yet, a critical side effect remains largely overlooked: enhancing truthfulness can negatively impact safety alignment. In this paper, we investigate this trade-off and show that increasing factual accuracy often comes at the cost of weakened refusal behavior. Our analysis reveals that this arises from overlapping components in the model that simultaneously encode hallucination and refusal information, leading alignment methods to suppress factual knowledge unintentionally. We further examine how fine-tuning on benign datasets, even when curated for safety, can degrade alignment for the same reason. To address this, we propose a method that disentangles refusal-related features from hallucination features using sparse autoencoders, and preserves refusal behavior during fine-tuning through subspace orthogonalization. This approach prevents hallucinations from increasing while maintaining safety alignment.We evaluate our method on commonsense reasoning tasks and harmful benchmarks (AdvBench and StrongReject). Results demonstrate that our approach preserves refusal behavior and task utility, mitigating the trade-off between truthfulness and safety.[37] Instance Relation Learning Network with Label Knowledge Propagation for Few-shot Multi-label Intent Detection
Shiman Zhao,Shangyuan Li,Wei Chen,Tengjiao Wang,Jiahui Yao,Jiabin Zheng,Kam Fai Wong
Main category: cs.CL
TL;DR: 提出了一种端到端的多标签联合学习方法,通过实例关系学习和标签知识传播来解决少样本多标签意图检测中的错误传播问题。
Details
Motivation: 现有方法依赖于表示分类且忽略实例间关系,导致错误传播,难以有效处理低资源对话域中的多标签意图检测。 Method: 构建一个带有标签知识传播的实例关系学习网络,利用类信息学习实例间的交互关系,并设计双关系增强损失函数优化支持集和查询集层面的关系强度。 Result: 在1-shot场景下,平均比强基线方法提升9.54% AUC和11.19% Macro-F1。 Conclusion: 所提出的方法能有效缓解错误传播,通过实例间关系建模和标签知识传播显著提升少样本多标签意图检测性能。 Abstract: Few-shot Multi-label Intent Detection (MID) is crucial for dialogue systems, aiming to detect multiple intents of utterances in low-resource dialogue domains. Previous studies focus on a two-stage pipeline. They first learn representations of utterances with multiple labels and then use a threshold-based strategy to identify multi-label results. However, these methods rely on representation classification and ignore instance relations, leading to error propagation. To solve the above issues, we propose a multi-label joint learning method for few-shot MID in an end-to-end manner, which constructs an instance relation learning network with label knowledge propagation to eliminate error propagation. Concretely, we learn the interaction relations between instances with class information to propagate label knowledge between a few labeled (support set) and unlabeled (query set) instances. With label knowledge propagation, the relation strength between instances directly indicates whether two utterances belong to the same intent for multi-label prediction. Besides, a dual relation-enhanced loss is developed to optimize support- and query-level relation strength to improve performance. Experiments show that we outperform strong baselines by an average of 9.54% AUC and 11.19% Macro-F1 in 1-shot scenarios.[38] Drift No More? Context Equilibria in Multi-Turn LLM Interactions
Vardhan Dongre,Ryan A. Rossi,Viet Dac Lai,David Seunghyun Yoon,Dilek Hakkani-Tür,Trung Bui
Main category: cs.CL
TL;DR: 本文研究了大语言模型在多轮交互中的上下文漂移问题,提出了一种动态框架来解释其行为,并通过实验证明漂移是一种可控的平衡现象而非不可避免的退化。
Details
Motivation: 现实应用中需要持续的多轮对话,但模型输出会随着时间逐渐偏离用户目标,即出现上下文漂移,而传统静态评估指标难以捕捉这一现象。 Method: 将漂移形式化为测试模型与目标一致的参考模型之间逐轮的KL散度,并提出了一个将漂移演化解释为具有恢复力和可控制干预的有界随机过程的递归模型。 Result: 实验表明,上下文漂移趋向于稳定的、受噪声限制的平衡状态,而不是不断恶化;简单的提醒干预能有效减少漂移,符合理论预测。 Conclusion: 多轮上下文漂移可以被理解为一种可控的平衡现象,这为研究和缓解长时交互中的漂移问题提供了基础。 Abstract: Large Language Models (LLMs) excel at single-turn tasks such as instruction following and summarization, yet real-world deployments require sustained multi-turn interactions where user goals and conversational context persist and evolve. A recurring challenge in this setting is context drift: the gradual divergence of a model's outputs from goal-consistent behavior across turns. Unlike single-turn errors, drift unfolds temporally and is poorly captured by static evaluation metrics. In this work, we present a study of context drift in multi-turn interactions and propose a simple dynamical framework to interpret its behavior. We formalize drift as the turn-wise KL divergence between the token-level predictive distributions of the test model and a goal-consistent reference model, and propose a recurrence model that interprets its evolution as a bounded stochastic process with restoring forces and controllable interventions. We instantiate this framework in both synthetic long-horizon rewriting tasks and realistic user-agent simulations such as in $\tau$-Bench, measuring drift for several open-weight LLMs that are used as user simulators. Our experiments consistently reveal stable, noise-limited equilibria rather than runaway degradation, and demonstrate that simple reminder interventions reliably reduce divergence in line with theoretical predictions. Together, these results suggest that multi-turn drift can be understood as a controllable equilibrium phenomenon rather than as inevitable decay, providing a foundation for studying and mitigating context drift in extended interactions.[39] RCPU: Rotation-Constrained Error Compensation for Structured Pruning of a Large Language Model
Shuichiro Haruta,Kazunori Matsumoto,Zhi Li,Yanan Wang,Mori Kurokawa
Main category: cs.CL
TL;DR: 提出一种旋转约束补偿方法,用于减轻大语言模型结构化剪枝带来的误差,在保持表示几何结构的同时提升性能。
Details
Motivation: 结构化剪枝因使用少量校准数据导致输出不匹配,直接拟合易过拟合并破坏预训练权重。 Method: 在旋转约束下更新剪枝参数,保持输出表示的几何结构(如范数和内积),并重新对齐剪枝子空间与原始输出;引入方差感知的重要性评分,优先保留影响主方向的输入维度。 Result: 在LLaMA-7B上实验显示,相比基线方法,在WikiText-2和多个语言理解基准上均取得更低的困惑度和更高的任务准确率。 Conclusion: 所提方法能有效补偿剪枝误差,同时以几何保持方式保留重要组件,显著提升剪枝后模型的性能。 Abstract: In this paper, we propose a rotation-constrained compensation method to address the errors introduced by structured pruning of large language models (LLMs). LLMs are trained on massive datasets and accumulate rich semantic knowledge in their representation space. In contrast, pruning is typically carried out with only a small amount of calibration data, which makes output mismatches unavoidable. Although direct least-squares fitting can reduce such errors, it tends to overfit to the limited calibration set, destructively modifying pretrained weights. To overcome this difficulty, we update the pruned parameters under a rotation constraint. This constrained update preserves the geometry of output representations (i.e., norms and inner products) and simultaneously re-aligns the pruned subspace with the original outputs. Furthermore, in rotation-constrained compensation, removing components that strongly contribute to the principal directions of the output makes error recovery difficult. Since input dimensions with large variance strongly affect these principal directions, we design a variance-aware importance score that ensures such dimensions are preferentially kept in the pruned model. By combining this scoring rule with rotation-constrained updates, the proposed method effectively compensates errors while retaining the components likely to be more important in a geometry-preserving manner. In the experiments, we apply the proposed method to LLaMA-7B and evaluate it on WikiText-2 and multiple language understanding benchmarks. The results demonstrate consistently better perplexity and task accuracy compared with existing baselines.[40] LLM4Cell: A Survey of Large Language and Agentic Models for Single-Cell Biology
Sajib Acharjee Dip,Adrika Zafor,Bikash Kumar Paul,Uddip Acharjee Shuvo,Muhit Islam Emon,Xuan Wang,Liqing Zhang
Main category: cs.CL
TL;DR: LLM4Cell 提供了首个针对单细胞研究中58个基础和智能体模型的统一综述,涵盖多种数据模态,并系统分类方法与分析任务,评估模型在多个领域维度的表现,揭示了可解释性、标准化和可信模型开发中的开放挑战。
Details
Motivation: 尽管大语言模型和智能体框架在单细胞生物学中展现出潜力,但其发展在数据模态、架构和评估标准上仍碎片化,缺乏系统性整合与评估。 Method: 对58个用于单细胞研究的基础和智能体模型进行系统性综述,将其分为五类(基础、文本桥接、空间、多模态、表观基因组和智能体),并映射到八类关键分析任务;基于40多个公开数据集,从10个领域维度评估模型表现。 Result: 建立了首个整合数据集、模型和评估领域的语言驱动单细胞智能视图,明确了当前基准适用性、数据多样性及伦理或可扩展性限制,并在生物合理性、多组学对齐、公平性、隐私和可解释性等方面提供了评估结果。 Conclusion: LLM4Cell为语言模型在单细胞生物学中的应用提供了系统框架和综合视角,指出了未来在标准化、可解释性和可信模型发展方面的关键挑战。 Abstract: Large language models (LLMs) and emerging agentic frameworks are beginning to transform single-cell biology by enabling natural-language reasoning, generative annotation, and multimodal data integration. However, progress remains fragmented across data modalities, architectures, and evaluation standards. LLM4Cell presents the first unified survey of 58 foundation and agentic models developed for single-cell research, spanning RNA, ATAC, multi-omic, and spatial modalities. We categorize these methods into five families-foundation, text-bridge, spatial, multimodal, epigenomic, and agentic-and map them to eight key analytical tasks including annotation, trajectory and perturbation modeling, and drug-response prediction. Drawing on over 40 public datasets, we analyze benchmark suitability, data diversity, and ethical or scalability constraints, and evaluate models across 10 domain dimensions covering biological grounding, multi-omics alignment, fairness, privacy, and explainability. By linking datasets, models, and evaluation domains, LLM4Cell provides the first integrated view of language-driven single-cell intelligence and outlines open challenges in interpretability, standardization, and trustworthy model development.[41] HiPRAG: Hierarchical Process Rewards for Efficient Agentic Retrieval Augmented Generation
Peilin Wu,Mian Zhang,Kun Wan,Wentian Zhao,Kaiyu He,Xinya Du,Zhiyu Chen
Main category: cs.CL
TL;DR: 本文提出了HiPRAG,一种通过引入细粒度、基于知识的分层过程奖励来优化Agentic RAG中搜索行为的训练方法,有效减少了过搜和欠搜问题,在多个模型和数据集上提升了推理效率与准确性。
Details
Motivation: 现有的Agentic RAG训练方法依赖结果奖励,缺乏对搜索决策过程的细粒度控制,导致普遍存在过搜和欠搜问题,影响效率和输出可靠性。 Method: 提出HiPRAG方法,将智能体的推理轨迹分解为可解析的步骤,引入基于最优搜索/非搜索步骤比例的分层奖励机制,在强化学习中结合过程奖励与结果、格式奖励,动态评估每次搜索的必要性。 Result: 在Qwen2.5和Llama-3.2模型及七个QA基准上的实验表明,该方法在3B和7B模型上分别达到65.4%和67.2%的平均准确率,同时将过搜率降至2.3%,并降低欠搜率,显著提升搜索效率。 Conclusion: 优化推理过程本身(而不仅是最终结果)能有效提升搜索代理的效率与性能,HiPRAG具有良好的跨算法、模型族、规模和类型的泛化能力,展示了强化学习中细粒度控制在推理优化中的潜力。 Abstract: Agentic RAG is a powerful technique for incorporating external information that LLMs lack, enabling better problem solving and question answering. However, suboptimal search behaviors exist widely, such as over-search (retrieving information already known) and under-search (failing to search when necessary), which leads to unnecessary overhead and unreliable outputs. Current training methods, which typically rely on outcome-based rewards in a RL framework, lack the fine-grained control needed to address these inefficiencies. To overcome this, we introduce Hierarchical Process Rewards for Efficient agentic RAG (HiPRAG), a training methodology that incorporates a fine-grained, knowledge-grounded process reward into the RL training. Our approach evaluates the necessity of each search decision on-the-fly by decomposing the agent's reasoning trajectory into discrete, parsable steps. We then apply a hierarchical reward function that provides an additional bonus based on the proportion of optimal search and non-search steps, on top of commonly used outcome and format rewards. Experiments on the Qwen2.5 and Llama-3.2 models across seven diverse QA benchmarks show that our method achieves average accuracies of 65.4% (3B) and 67.2% (7B). This is accomplished while improving search efficiency, reducing the over-search rate to just 2.3% and concurrently lowering the under-search rate. These results demonstrate the efficacy of optimizing the reasoning process itself, not just the final outcome. Further experiments and analysis demonstrate that HiPRAG shows good generalizability across a wide range of RL algorithms, model families, sizes, and types. This work demonstrates the importance and potential of fine-grained control through RL, for improving the efficiency and optimality of reasoning for search agents.[42] Dynamic Generation of Multi-LLM Agents Communication Topologies with Graph Diffusion Models
Eric Hanchen Jiang,Guancheng Wan,Sophia Yin,Mengting Li,Yuchen Wu,Xiao Liang,Xinfeng Li,Yizhou Sun,Wei Wang,Kai-Wei Chang,Ying Nian Wu
Main category: cs.CL
TL;DR: 提出了一种名为Guided Topology Diffusion (GTD)的生成框架,通过迭代构建和轻量级代理模型引导,实现面向任务自适应的多智能体系统通信拓扑优化。
Details
Motivation: 现有基于大语言模型的多智能体系统通信拓扑多为静态或人工设计,难以适应不同任务需求,导致通信开销高或性能瓶颈。 Method: 受条件离散图扩散模型启发,将拓扑生成建模为迭代构造过程,并利用轻量级代理模型预测多目标奖励(如准确率、效用、成本),实现无需梯度的实时优化。 Result: 在多个基准上验证了GTD的有效性,实验表明其能生成高度任务自适应、稀疏且高效的通信拓扑,在LLM智能体协作中显著优于现有方法。 Conclusion: GTD通过迭代式引导生成机制,有效解决了多智能体系统中通信拓扑设计的灵活性与效率问题,提升了复杂任务下的协作性能。 Abstract: The efficiency of multi-agent systems driven by large language models (LLMs) largely hinges on their communication topology. However, designing an optimal topology is a non-trivial challenge, as it requires balancing competing objectives such as task performance, communication cost, and robustness. Existing frameworks often rely on static or hand-crafted topologies, which inherently fail to adapt to diverse task requirements, leading to either excessive token consumption for simple problems or performance bottlenecks for complex ones. To address this challenge, we introduce a novel generative framework called \textit{Guided Topology Diffusion (GTD)}. Inspired by conditional discrete graph diffusion models, GTD formulates topology synthesis as an iterative construction process. At each step, the generation is steered by a lightweight proxy model that predicts multi-objective rewards (e.g., accuracy, utility, cost), enabling real-time, gradient-free optimization towards task-adaptive topologies. This iterative, guided synthesis process distinguishes GTD from single-step generative frameworks, enabling it to better navigate complex design trade-offs. We validated GTD across multiple benchmarks, and experiments show that this framework can generate highly task-adaptive, sparse, and efficient communication topologies, significantly outperforming existing methods in LLM agent collaboration.[43] Multilingual Generative Retrieval via Cross-lingual Semantic Compression
Yuxin Huang,Simeng Wu,Ran Song,Yan Xiang,Yantuan Xian,Shengxiang Gao,Zhengtao Yu
Main category: cs.CL
TL;DR: 提出了一种新的多语言生成式检索框架MGR-CSC,通过跨语言语义压缩和动态多步约束解码策略,有效解决了跨语言标识符错位和标识符膨胀问题,在多个基准上显著提升了检索准确性和效率。
Details
Motivation: 现有的生成式信息检索方法在多语言场景下面临跨语言标识符错位和标识符膨胀两大挑战,限制了其性能与效率。 Method: 提出MGR-CSC框架,将语义等价的多语言关键词统一为共享原子以对齐语义并压缩标识符空间,并采用动态多步约束解码策略提升检索时的解码效率。 Result: 在mMarco100k和mNQ320k数据集上,检索准确率分别提升6.83%和4.77%,文档标识符长度减少74.51%和78.2%。 Conclusion: MGR-CSC通过语义对齐与标识符压缩,显著提升了多语言生成式检索的性能与效率,具备良好的应用前景。 Abstract: Generative Information Retrieval is an emerging retrieval paradigm that exhibits remarkable performance in monolingual scenarios.However, applying these methods to multilingual retrieval still encounters two primary challenges, cross-lingual identifier misalignment and identifier inflation. To address these limitations, we propose Multilingual Generative Retrieval via Cross-lingual Semantic Compression (MGR-CSC), a novel framework that unifies semantically equivalent multilingual keywords into shared atoms to align semantics and compresses the identifier space, and we propose a dynamic multi-step constrained decoding strategy during retrieval. MGR-CSC improves cross-lingual alignment by assigning consistent identifiers and enhances decoding efficiency by reducing redundancy. Experiments demonstrate that MGR-CSC achieves outstanding retrieval accuracy, improving by 6.83% on mMarco100k and 4.77% on mNQ320k, while reducing document identifiers length by 74.51% and 78.2%, respectively.[44] AdaSwitch: Adaptive Switching Generation for Knowledge Distillation
Jingyu Peng,Maolin Wang,Hengyi Cai,Yuchen Li,Kai Zhang,Shuaiqiang Wang,Dawei Yin,Xiangyu Zhao
Main category: cs.CL
TL;DR: 提出AdaSwitch方法,动态结合策略内和策略外生成,以在小语言模型的知识蒸馏中保持一致性和高质量监督。
Details
Motivation: 现有知识蒸馏方法在小语言模型上存在训练与推理不一致或依赖低质量学生输出的问题。 Method: 在token级别动态结合策略内和策略外生成,学生模型先探索自身预测,再根据实时质量评估选择性引入教师指导。 Result: 在三个数据集和两组师生大模型对上实验表明,AdaSwitch持续提升准确性。 Conclusion: AdaSwitch是一种实用且有效的方法,可在可接受的额外开销下改进小语言模型的知识蒸馏。 Abstract: Small language models (SLMs) are crucial for applications with strict latency and computational constraints, yet achieving high performance remains challenging. Knowledge distillation (KD) can transfer capabilities from large teacher models, but existing methods involve trade-offs: off-policy distillation provides high-quality supervision but introduces a training-inference mismatch, while on-policy approaches maintain consistency but rely on low-quality student outputs. To address these issues, we propose AdaSwitch, a novel approach that dynamically combines on-policy and off-policy generation at the token level. AdaSwitch allows the student to first explore its own predictions and then selectively integrate teacher guidance based on real-time quality assessment. This approach simultaneously preserves consistency and maintains supervision quality. Experiments on three datasets with two teacher-student LLM pairs demonstrate that AdaSwitch consistently improves accuracy, offering a practical and effective method for distilling SLMs with acceptable additional overhead.[45] Ready to Translate, Not to Represent? Bias and Performance Gaps in Multilingual LLMs Across Language Families and Domains
Md. Faiyaz Abdullah Sayeedi,Md. Mahbub Alam,Subhey Sadi Rahman,Md. Adnanul Islam,Jannatul Ferdous Deepti,Tasnim Mohiuddin,Md Mofijul Islam,Swakkhar Shatabda
Main category: cs.CL
TL;DR: 本文提出了Translation Tangles,一个用于评估开源大语言模型在多语言翻译中的质量与公平性的统一框架和数据集,涵盖24个双向语言对,并引入基于人类评估的高质量偏见标注数据集。
Details
Motivation: 大语言模型在机器翻译中表现优异,但在不同语系和特定领域性能不均,且可能放大训练数据中的偏见,尤其影响低资源语言的公平性。 Method: 构建了一个包含24个双向语言对、多领域翻译评估的框架;提出结合基于规则的启发式方法、语义相似度过滤和大模型验证的混合偏见检测流程;并发布基于1,439个人工评估翻译对的高质量偏见标注数据集。 Result: 实现了跨语言和领域的系统性评估,有效识别出模型在不同语言和语境下的翻译偏差,验证了所提偏见检测方法的有效性。 Conclusion: Translation Tangles为评估大语言模型在翻译任务中的质量和公平性提供了可靠工具,有助于推动更公正、透明的多语言AI系统发展。 Abstract: The rise of Large Language Models (LLMs) has redefined Machine Translation (MT), enabling context-aware and fluent translations across hundreds of languages and textual domains. Despite their remarkable capabilities, LLMs often exhibit uneven performance across language families and specialized domains. Moreover, recent evidence reveals that these models can encode and amplify different biases present in their training data, posing serious concerns for fairness, especially in low-resource languages. To address these gaps, we introduce Translation Tangles, a unified framework and dataset for evaluating the translation quality and fairness of open-source LLMs. Our approach benchmarks 24 bidirectional language pairs across multiple domains using different metrics. We further propose a hybrid bias detection pipeline that integrates rule-based heuristics, semantic similarity filtering, and LLM-based validation. We also introduce a high-quality, bias-annotated dataset based on human evaluations of 1,439 translation-reference pairs. The code and dataset are accessible on GitHub: https://github.com/faiyazabdullah/TranslationTangles[46] Do LLMs Really Need 10+ Thoughts for "Find the Time 1000 Days Later"? Towards Structural Understanding of LLM Overthinking
Xinliang Frederick Zhang,Anhad Mohananey,Alexandra Chronopoulou,Pinelopi Papalampidi,Somit Gupta,Tsendsuren Munkhdalai,Lu Wang,Shyam Upadhyay
Main category: cs.CL
TL;DR: 本文提出了TRACE分析工具,系统性地研究大语言模型在简单任务上过度推理(overthinking)的问题,揭示了过度验证和过度探索是主要原因,并提出了基于效用的过思考定义,为有效管理过思考提供了新视角和实践指导。
Details
Motivation: 现有研究对大语言模型过度推理问题的理解停留在表面,缺乏对其内在机制的深入分析,本文旨在通过细粒度分析思维过程,揭示过思考的根本原因。 Method: 提出TRACE分析器,将推理过程分解为最小完整子思想,推断子思想间的语篇关系,构建细粒度的思维演进图,并识别同类问题的常见思维模式。 Result: 发现开放权重模型存在Explorer和Late Landing两种主要思维模式,证实过度验证和过度探索是过思考的主要驱动因素;同时表明长推理模型在简单任务上慢5到20倍且无准确率提升。 Conclusion: 基于思维结构提出基于效用的过思考定义,超越了传统的长度指标,为理解和管理大语言模型的推理效率提供了更深刻、更实用的框架。 Abstract: Models employing long chain-of-thought (CoT) reasoning have shown superior performance on complex reasoning tasks. Yet, this capability introduces a critical and often overlooked inefficiency -- overthinking -- models often engage in unnecessarily extensive reasoning even for simple queries, incurring significant computations without accuracy improvements. While prior work has explored solutions to mitigate overthinking, a fundamental gap remains in our understanding of its underlying causes. Most existing analyses are limited to superficial, profiling-based observations, failing to delve into LLMs' inner workings. This study introduces a systematic, fine-grained analyzer of LLMs' thought process to bridge the gap, TRACE. We first benchmark the overthinking issue, confirming that long-thinking models are five to twenty times slower on simple tasks with no substantial gains. We then use TRACE to first decompose the thought process into minimally complete sub-thoughts. Next, by inferring discourse relationships among sub-thoughts, we construct granular thought progression graphs and subsequently identify common thinking patterns for topically similar queries. Our analysis reveals two major patterns for open-weight thinking models -- Explorer and Late Landing. This finding provides evidence that over-verification and over-exploration are the primary drivers of overthinking in LLMs. Grounded in thought structures, we propose a utility-based definition of overthinking, which moves beyond length-based metrics. This revised definition offers a more insightful understanding of LLMs' thought progression, as well as practical guidelines for principled overthinking management.[47] CS3-Bench: Evaluating and Enhancing Speech-to-Speech LLMs for Mandarin-English Code-Switching
Heyang Liu,Yuhao Wang,Ziyang Cheng,Ronghua Wu,Qunshan Gu,Yanfeng Wang,Yu Wang
Main category: cs.CL
TL;DR: 本文提出了一个代码切换语音到语音基准(CS3-Bench),发现现有模型在语言对齐方面存在显著缺陷,并提出通过识别链(CoR)和关键词突出(KH)方法来提升多模态大语言模型的语言对齐能力,显著提高了知识准确性和开放性理解率。
Details
Motivation: 现有的多模态大语言模型在单语自然交互上表现良好,但在语言对齐方面存在明显不足,特别是在代码切换场景下表现更差,因此需要构建新的基准并改进模型以提升跨语言理解与生成能力。 Method: 提出CS3-Bench基准测试,采用识别链(Chain of Recognition, CoR)增强理解能力,结合关键词突出(Keyword Highlighting, KH)引导生成过程,并设计针对性的数据构建和训练策略。 Result: 在7个主流模型上验证了语言对齐问题,所提方法将知识准确性从25.14%提升至46.13%,开放性理解率从64.5%提升至86.5%,并显著减少第二语言的发音错误。 Conclusion: 通过引入CoR和KH策略,可有效改善多模态大语言模型在代码切换语音交互中的语言对齐问题,为构建更自然的跨语言语音交互系统提供了可行路径。 Abstract: The advancement of multimodal large language models has accelerated the development of speech-to-speech interaction systems. While natural monolingual interaction has been achieved, we find existing models exhibit deficiencies in language alignment. In our proposed Code-Switching Speech-to-Speech Benchmark (CS3-Bench), experiments on 7 mainstream models demonstrate a relative performance drop of up to 66% in knowledge-intensive question answering and varying degrees of misunderstanding in open-ended conversations. Starting from a model with severe performance deterioration, we propose both data constructions and training approaches to improve the language alignment capabilities, specifically employing Chain of Recognition (CoR) to enhance understanding and Keyword Highlighting (KH) to guide generation. Our approach improves the knowledge accuracy from 25.14% to 46.13%, with open-ended understanding rate from 64.5% to 86.5%, and significantly reduces pronunciation errors in the secondary language. CS3-Bench is available at https://huggingface.co/datasets/VocalNet/CS3-Bench.[48] Contrastive Weak-to-strong Generalization
Houcheng Jiang,Junfeng Fang,Jiaxin Wu,Tianyu Zhang,Chen Gao,Yong Li,Xiang Wang,Xiangnan He,Yang Deng
Main category: cs.CL
TL;DR: 提出Contrastive Weak-to-Strong Generalization (ConG)框架,利用对比解码提升弱到强泛化性能,减少噪声并增强鲁棒性。
Details
Motivation: 传统弱到强泛化方法受限于弱模型输出中的噪声和偏差,影响其实际应用,需提升其鲁棒性和泛化能力。 Method: 通过隐式奖励与对比解码(CD)的结构等价性,采用对齐前后弱模型的对比解码生成更高质量样本。 Result: 在多个模型族上实验表明ConG一致优于传统方法,显著提升性能与鲁棒性。 Conclusion: ConG有效缓解了传统弱到强方法的局限,推动了该范式的进步,并为实现AGI提供了新路径。 Abstract: Weak-to-strong generalization provides a promising paradigm for scaling large language models (LLMs) by training stronger models on samples from aligned weaker ones, without requiring human feedback or explicit reward modeling. However, its robustness and generalization are hindered by the noise and biases in weak-model outputs, which limit its applicability in practice. To address this challenge, we leverage implicit rewards, which approximate explicit rewards through log-likelihood ratios, and reveal their structural equivalence with Contrastive Decoding (CD), a decoding strategy shown to reduce noise in LLM generation. Building on this connection, we propose Contrastive Weak-to-Strong Generalization (ConG), a framework that employs contrastive decoding between pre- and post-alignment weak models to generate higher-quality samples. This approach enables more reliable capability transfer, denoising, and improved robustness, substantially mitigating the limitations of traditional weak-to-strong methods. Empirical results across different model families confirm consistent improvements, demonstrating the generality and effectiveness of ConG. Taken together, our findings highlight the potential of ConG to advance weak-to-strong generalization and provide a promising pathway toward AGI.[49] Standard-to-Dialect Transfer Trends Differ across Text and Speech: A Case Study on Intent and Topic Classification in German Dialects
Verena Blaschke,Miriam Winkler,Barbara Plank
Main category: cs.CL
TL;DR: 本文研究了从标准语到非标准方言的跨方言迁移,比较了文本、语音和级联系统三种设置下的性能,发现语音模型在方言数据上表现最佳,而级联系统在生成标准化转录时对方言也有较好效果。
Details
Motivation: 由于方言主要是口头语言,且非标准拼写会影响文本处理,因此需要探索不同模式(文本、语音、级联)在方言迁移中的表现。 Method: 在德语及其多种方言的意图和主题分类任务中,对比文本模型、语音模型和级联系统的性能,并发布首个方言语音意图分类数据集。 Result: 语音模型在方言数据上表现最好,文本模型在标准语数据上最优;级联系统在德语标准语上落后于文本模型,但在生成标准化输出时对方言数据表现相对较好。 Conclusion: 语音输入更有利于处理方言数据,而级联系统的性能依赖于转录结果的标准化程度。 Abstract: Research on cross-dialectal transfer from a standard to a non-standard dialect variety has typically focused on text data. However, dialects are primarily spoken, and non-standard spellings are known to cause issues in text processing. We compare standard-to-dialect transfer in three settings: text models, speech models, and cascaded systems where speech first gets automatically transcribed and then further processed by a text model. In our experiments, we focus on German and multiple German dialects in the context of written and spoken intent and topic classification. To that end, we release the first dialectal audio intent classification dataset. We find that the speech-only setup provides the best results on the dialect data while the text-only setup works best on the standard data. While the cascaded systems lag behind the text-only models for German, they perform relatively well on the dialectal data if the transcription system generates normalized, standard-like output.[50] Metric Calculating Benchmark: Code-Verifiable Complicate Instruction Following Benchmark for Large Language Models
Hyeonseok Moon,Seongtae Hong,Jaehyung Seo,Heuiseok Lim
Main category: cs.CL
TL;DR: 本文提出了MCBench,一个用于评估大语言模型(LLM)能否严格按照逐步指令执行字符串匹配NLP度量的基准测试。与依赖主观判断的先前基准不同,MCBench提供客观、确定且可通过代码验证的评估方式,并通过并行参考代码确保评估准确性。实验结果表明,MCBench能有效评估前沿LLM在指令遵循、数值计算和中间结果一致性等方面的能力。
Details
Motivation: 随着大语言模型在许多现有基准上达到饱和,难以进一步区分性能差异,亟需更具挑战性且具备客观验证机制的新基准来准确评估模型能力。 Method: 设计了MCBench基准,要求LLM严格遵循逐步指令执行字符串匹配类NLP指标计算;提供了可复现和代码验证的评估框架及三种评估指标和三个变体版本;采用并行参考代码对模型输出进行客观准确性评估。 Result: MCBench能够系统地测试LLM在指令遵循、数值计算和长距离中间结果一致性方面的能力,并展现出对前沿LLM能力的有效区分度。 Conclusion: MCBench是一个有效、客观且可验证的工具,可用于评估大语言模型在细粒度指令理解和精确执行方面的能力,为未来模型发展提供更可靠的衡量标准。 Abstract: Recent frontier-level LLMs have saturated many previously difficult benchmarks, leaving little room for further differentiation. This progress highlights the need for challenging benchmarks that provide objective verification. In this paper, we introduce MCBench, a benchmark designed to evaluate whether LLMs can execute string-matching NLP metrics by strictly following step-by-step instructions. Unlike prior benchmarks that depend on subjective judgments or general reasoning, MCBench offers an objective, deterministic and codeverifiable evaluation. This setup allows us to systematically test whether LLMs can maintain accurate step-by-step execution, including instruction adherence, numerical computation, and long-range consistency in handling intermediate results. To ensure objective evaluation of these abilities, we provide a parallel reference code that can evaluate the accuracy of LLM output. We provide three evaluative metrics and three benchmark variants designed to measure the detailed instruction understanding capability of LLMs. Our analyses show that MCBench serves as an effective and objective tool for evaluating the capabilities of cutting-edge LLMs.[51] ACE: Attribution-Controlled Knowledge Editing for Multi-hop Factual Recall
Jiayu Yang,Yuxuan Fan,Songning Lai,Shengen Wu,Jiaqi Tang,Chun Kang,Zhijiang Guo,Yutao Yue
Main category: cs.CL
TL;DR: 本文提出了一种基于神经元级归因的多跳知识编辑框架ACE,通过识别和编辑查询-值(Q-V)通路,显著提升了大模型在多跳事实回忆中的知识更新性能。
Details
Motivation: 现有知识编辑方法在多跳事实回忆中表现衰退,尤其难以处理推理链中的隐式中间主体,原因在于忽略了知识在神经元层面的动态表征机制。 Method: 通过因果分析揭示隐式主体作为查询神经元,在Transformer层间逐步激活对应的值神经元进行信息累积;基于此,提出ACE框架,利用神经元级归因定位并编辑关键的Q-V路径以实现知识更新。 Result: ACE在GPT-J上比当前最优方法提升9.44%,在Qwen3-8B上提升37.46%;同时发现Qwen3中更细粒度的激活模式,并验证了值神经元语义可解释性由查询驱动的信息累积机制所调控。 Conclusion: ACE为多跳知识编辑提供了基于内部推理机制理解的、机理清晰的解决方案,推动了知识编辑技术向更深层模型机制理解的方向发展。 Abstract: Large Language Models (LLMs) require efficient knowledge editing (KE) to update factual information, yet existing methods exhibit significant performance decay in multi-hop factual recall. This failure is particularly acute when edits involve intermediate implicit subjects within reasoning chains. Through causal analysis, we reveal that this limitation stems from an oversight of how chained knowledge is dynamically represented and utilized at the neuron level. We discover that during multi hop reasoning, implicit subjects function as query neurons, which sequentially activate corresponding value neurons across transformer layers to accumulate information toward the final answer, a dynamic prior KE work has overlooked. Guided by this insight, we propose ACE: Attribution-Controlled Knowledge Editing for Multi-hop Factual Recall, a framework that leverages neuron-level attribution to identify and edit these critical query-value (Q-V) pathways. ACE provides a mechanistically grounded solution for multi-hop KE, empirically outperforming state-of-the-art methods by 9.44% on GPT-J and 37.46% on Qwen3-8B. Our analysis further reveals more fine-grained activation patterns in Qwen3 and demonstrates that the semantic interpretability of value neurons is orchestrated by query-driven accumulation. These findings establish a new pathway for advancing KE capabilities based on the principled understanding of internal reasoning mechanisms.[52] Towards Human-Like Grading: A Unified LLM-Enhanced Framework for Subjective Question Evaluation
Fanwei Zhua,Jiaxuan He,Xiaoxiao Chen,Zulong Chen,Quan Lu,Chenrui Mei
Main category: cs.CL
TL;DR: 提出了一种基于大语言模型(LLM)的统一自动评分框架,可对多种类型的主观题进行类人评估,涵盖内容相似性、知识点匹配、答案相关性和人工评价模拟,实验表明其在多个指标上优于传统和基于LLM的方法,并已成功应用于企业实际考试中。
Details
Motivation: 现有自动评分方法多针对特定类型的主观题,缺乏对包含多种题型的综合考试的泛化支持,难以应对主观题回答的开放性和多样性。 Method: 构建了一个包含四个模块的统一LLM增强框架:基础文本匹配、关键知识点对比、从学生答案生成伪问题以评估相关性,以及模拟人工评价识别内容与非内容方面的优缺点。 Result: 在通用和领域特定数据集上的实验显示,该框架在多个评分指标上均优于传统及基于LLM的基线方法,且已成功部署于大型电商企业的培训与认证考试中。 Conclusion: 所提框架具有良好的通用性和实用性,能够有效支持多类型主观题的自动评分,具备实际应用价值和推广潜力。 Abstract: Automatic grading of subjective questions remains a significant challenge in examination assessment due to the diversity in question formats and the open-ended nature of student responses. Existing works primarily focus on a specific type of subjective question and lack the generality to support comprehensive exams that contain diverse question types. In this paper, we propose a unified Large Language Model (LLM)-enhanced auto-grading framework that provides human-like evaluation for all types of subjective questions across various domains. Our framework integrates four complementary modules to holistically evaluate student answers. In addition to a basic text matching module that provides a foundational assessment of content similarity, we leverage the powerful reasoning and generative capabilities of LLMs to: (1) compare key knowledge points extracted from both student and reference answers, (2) generate a pseudo-question from the student answer to assess its relevance to the original question, and (3) simulate human evaluation by identifying content-related and non-content strengths and weaknesses. Extensive experiments on both general-purpose and domain-specific datasets show that our framework consistently outperforms traditional and LLM-based baselines across multiple grading metrics. Moreover, the proposed system has been successfully deployed in real-world training and certification exams at a major e-commerce enterprise.[53] STEPER: Step-wise Knowledge Distillation for Enhancing Reasoning Ability in Multi-Step Retrieval-Augmented Language Models
Kyumin Lee,Minjin Jeon,Sanghwan Jang,Hwanjo Yu
Main category: cs.CL
TL;DR: 提出StepER方法,通过分步监督和难度感知训练提升多步检索增强语言模型的推理能力。
Details
Motivation: 现有知识蒸馏方法忽视了多步检索增强框架中不同步骤需要不同的推理能力,导致迁移效果不佳。 Method: 采用分步监督以对齐各阶段的信息和推理需求,并引入难度感知训练来逐步优化学习过程,优先处理合适的步骤。 Result: 在多跳问答基准上,StepER优于先前方法,8B模型性能接近70B教师模型。 Conclusion: StepER能有效提升多步检索增强语言模型的推理能力,具有良好的适应性和性能增益。 Abstract: Answering complex real-world questions requires step-by-step retrieval and integration of relevant information to generate well-grounded responses. However, existing knowledge distillation methods overlook the need for different reasoning abilities at different steps, hindering transfer in multi-step retrieval-augmented frameworks. To address this, we propose Stepwise Knowledge Distillation for Enhancing Reasoning Ability in Multi-Step Retrieval-Augmented Language Models (StepER). StepER employs step-wise supervision to align with evolving information and reasoning demands across stages. Additionally, it incorporates difficulty-aware training to progressively optimize learning by prioritizing suitable steps. Our method is adaptable to various multi-step retrieval-augmented language models, including those that use retrieval queries for reasoning paths or decomposed questions. Extensive experiments show that StepER outperforms prior methods on multi-hop QA benchmarks, with an 8B model achieving performance comparable to a 70B teacher model.[54] Comprehensiveness Metrics for Automatic Evaluation of Factual Recall in Text Generation
Adam Dejl,James Barry,Alessandra Pascale,Javier Carnerero Cano
Main category: cs.CL
TL;DR: 本研究探讨了评估大语言模型生成文本全面性的三种自动化方法,发现简单的端到端方法效果显著但缺乏鲁棒性和可解释性。
Details
Motivation: 大语言模型在敏感领域中可能遗漏关键信息,造成严重危害,因此需要有效评估其输出的全面性。 Method: 研究采用了三种自动评估策略:基于自然语言推理(NLI)的方法、基于问答(Q&A)的方法以及端到端的LLM直接识别方法。 Result: 实验表明,尽管端到端方法在检测缺失内容方面表现优异,但其在鲁棒性、可解释性和结果细粒度方面不如其他复杂方法。 Conclusion: 端到端方法虽简单有效,但在实际应用中需权衡其在可解释性和稳定性方面的不足,未来需进一步优化综合性评估方法。 Abstract: Despite demonstrating remarkable performance across a wide range of tasks, large language models (LLMs) have also been found to frequently produce outputs that are incomplete or selectively omit key information. In sensitive domains, such omissions can result in significant harm comparable to that posed by factual inaccuracies, including hallucinations. In this study, we address the challenge of evaluating the comprehensiveness of LLM-generated texts, focusing on the detection of missing information or underrepresented viewpoints. We investigate three automated evaluation strategies: (1) an NLI-based method that decomposes texts into atomic statements and uses natural language inference (NLI) to identify missing links, (2) a Q&A-based approach that extracts question-answer pairs and compares responses across sources, and (3) an end-to-end method that directly identifies missing content using LLMs. Our experiments demonstrate the surprising effectiveness of the simple end-to-end approach compared to more complex methods, though at the cost of reduced robustness, interpretability and result granularity. We further assess the comprehensiveness of responses from several popular open-weight LLMs when answering user queries based on multiple sources.[55] Vision-Enabled LLMs in Historical Lexicography: Digitising and Enriching Estonian-German Dictionaries from the 17th and 18th Centuries
Madis Jürviste,Joonatan Jakobson
Main category: cs.CL
TL;DR: 该研究探索了大语言模型(LLM)在17至18世纪爱沙尼亚语词典研究中的应用,涵盖词典信息的现代化扩充、哥特体文本识别及跨源数据集构建。
Details
Motivation: 为了提升对历史词典的数字化处理效率,并克服小语种资源稀缺带来的挑战,研究者尝试将大语言模型应用于古籍分析与结构化提取。 Method: 采用Claude 3.7 Sonnet等大语言模型进行词义推断和现代对应词生成;使用支持视觉的LLM对Fraktur字体印刷的文献进行零样本文本识别;通过重叠切片扫描图像并利用两个LLM分别执行文本识别与结果合并,以处理Hupel的1780年语法书中的词典部分。 Result: 在Gutslaff词典中,LLM为81%的词条提供了准确的现代词义和对应形式;在Helle词典的文本识别中,41%的词条被无误地结构化为JSON格式;Hupel词典部分采用分块处理与双LLM协作策略成功实现数字化准备。 Conclusion: 研究表明,即使对于小语种的历史文献,大语言模型也具有显著潜力,可大幅节省时间和经济成本,推动数字人文研究的发展。 Abstract: This article presents research conducted at the Institute of the Estonian Language between 2022 and 2025 on the application of large language models (LLMs) to the study of 17th and 18th century Estonian dictionaries. The authors address three main areas: enriching historical dictionaries with modern word forms and meanings; using vision-enabled LLMs to perform text recognition on sources printed in Gothic script (Fraktur); and preparing for the creation of a unified, cross-source dataset. Initial experiments with J. Gutslaff's 1648 dictionary indicate that LLMs have significant potential for semi-automatic enrichment of dictionary information. When provided with sufficient context, Claude 3.7 Sonnet accurately provided meanings and modern equivalents for 81% of headword entries. In a text recognition experiment with A. T. Helle's 1732 dictionary, a zero-shot method successfully identified and structured 41% of headword entries into error-free JSON-formatted output. For digitising the Estonian-German dictionary section of A. W. Hupel's 1780 grammar, overlapping tiling of scanned image files is employed, with one LLM being used for text recognition and a second for merging the structured output. These findings demonstrate that even for minor languages LLMs have a significant potential for saving time and financial resources.[56] A$^2$Search: Ambiguity-Aware Question Answering with Reinforcement Learning
Fengji Zhang,Xinyao Niu,Chengyang Ying,Guancheng Lin,Zhongkai Hao,Zhou Fan,Chengen Huang,Jacky Keung,Bei Chen,Junyang Lin
Main category: cs.CL
TL;DR: 本文提出了A$^2$Search,一种无需人工标注的端到端框架,通过轨迹采样和证据验证自动识别和处理开放域问答中的多答案歧义问题,并在多个基准上实现最先进性能。
Details
Motivation: 现有问答模型通常假设每个问题只有一个正确答案,难以应对存在多个合理答案的歧义性问题,且依赖昂贵的人工标注,缺乏可扩展性。 Method: 提出A$^2$Search框架,结合轨迹采样与证据验证来自动生成替代答案,并使用设计的AnsF1奖励函数通过强化学习进行优化,实现对多答案问题的支持。 Result: 在八个开放域问答基准上取得新SOTA,A$^2$Search-7B在四个多跳数据集上平均AnsF1@1达48.4%,优于更大的ReSearch-32B(46.2%),并验证了其消歧能力和跨基准泛化性。 Conclusion: 拥抱歧义而非回避,通过自动化方式建模多答案结构,是构建更可靠问答系统的关键。 Abstract: Recent advances in Large Language Models (LLMs) and Reinforcement Learning (RL) have led to strong performance in open-domain question answering (QA). However, existing models still struggle with questions that admit multiple valid answers. Standard QA benchmarks, which typically assume a single gold answer, overlook this reality and thus produce inappropriate training signals. Existing attempts to handle ambiguity often rely on costly manual annotation, which is difficult to scale to multi-hop datasets such as HotpotQA and MuSiQue. In this paper, we present A$^2$Search, an annotation-free, end-to-end training framework to recognize and handle ambiguity. At its core is an automated pipeline that detects ambiguous questions and gathers alternative answers via trajectory sampling and evidence verification. The model is then optimized with RL using a carefully designed $\mathrm{AnsF1}$ reward, which naturally accommodates multiple answers. Experiments on eight open-domain QA benchmarks demonstrate that A$^2$Search achieves new state-of-the-art performance. With only a single rollout, A$^2$Search-7B yields an average $\mathrm{AnsF1}@1$ score of $48.4\%$ across four multi-hop benchmarks, outperforming all strong baselines, including the substantially larger ReSearch-32B ($46.2\%$). Extensive analyses further show that A$^2$Search resolves ambiguity and generalizes across benchmarks, highlighting that embracing ambiguity is essential for building more reliable QA systems. Our code, data, and model weights can be found at https://github.com/zfj1998/A2Search[57] LightReasoner: Can Small Language Models Teach Large Language Models Reasoning?
Jingyuan Wang,Yankai Chen,Zhonghang Li,Chao Huang
Main category: cs.CL
TL;DR: 本文提出LightReasoner,利用小模型(SLM)与大模型(LLM)的行为差异来识别高价值推理时刻,通过两阶段框架(采样关键推理步骤并微调)显著提升LLM的推理能力,在数学任务上准确率最高提升28.1%,同时大幅降低计算资源消耗。
Details
Motivation: 监督微调(SFT)依赖大量标注数据和均匀优化,资源消耗大且效率低,仅少数token具有实际学习价值。因此,探索一种更高效、无需真实标签的LLM推理增强方法。 Method: 提出LightReasoner框架:第一阶段通过专家模型(LLM)与业余模型(SLM)的输出对比,识别关键推理时刻并构建监督样本;第二阶段用这些精炼样本对LLM进行微调,强化其优势。 Result: 在七个数学推理基准上,准确率最高提升28.1%,训练时间减少90%,采样问题减少80%,微调token数减少99%,且不依赖真实标签。 Conclusion: LightReasoner通过让弱模型作为教学信号,实现了对强模型高效、可扩展的推理能力提升,为LLM训练提供了资源友好的新范式。 Abstract: Large language models (LLMs) have demonstrated remarkable progress in reasoning, often through supervised fine-tuning (SFT). However, SFT is resource-intensive, relying on large curated datasets, rejection-sampled demonstrations, and uniform optimization across all tokens, even though only a fraction carry meaningful learning value. In this work, we explore a counterintuitive idea: can smaller language models (SLMs) teach larger language models (LLMs) by revealing high-value reasoning moments that reflect the latter's unique strength? We propose LightReasoner, a novel framework that leverages the behavioral divergence between a stronger expert model (LLM) and a weaker amateur model (SLM). LightReasoner operates in two stages: (1) a sampling stage that pinpoints critical reasoning moments and constructs supervision examples capturing the expert's advantage through expert-amateur contrast, and (2) a fine-tuning stage that aligns the expert model with these distilled examples, amplifying its reasoning strengths. Across seven mathematical benchmarks, LightReasoner improves accuracy by up to 28.1%, while reducing time consumption by 90%, sampled problems by 80%, and tuned token usage by 99%, all without relying on ground-truth labels. By turning weaker SLMs into effective teaching signals, LightReasoner offers a scalable and resource-efficient approach for advancing LLM reasoning. Code is available at: https://github.com/HKUDS/LightReasoner[58] Active Confusion Expression in Large Language Models: Leveraging World Models toward Better Social Reasoning
Jialu Du,Guiyang Hou,Yihui Fu,Chen Wu,Wenqi Zhang,Yongliang Shen,Weiming Lu
Main category: cs.CL
TL;DR: 提出一种自适应世界模型增强的推理机制,通过构建动态文本世界模型来解决大语言模型在社会推理任务中混淆客观现实与主观信念的问题,显著提升准确性并降低计算成本。
Details
Motivation: 大语言模型在数学和代码推理方面表现出色,但在涉及多参与者和社会情境的推理任务中常出现认知混乱、逻辑矛盾等问题,主要源于其难以区分客观世界状态与个体的主观信念。 Method: 通过分析DeepSeek-R1的推理轨迹,识别出模型在处理复杂社会场景时的认知障碍;提出一种自适应世界模型增强的推理机制,构建动态文本世界模型以跟踪实体状态和时间序列,并在检测到困惑信号时提供清晰的世界状态描述进行干预。 Result: 在三个社会推理基准测试上显著提升了推理准确率(如Hi-ToM上+10%),同时减少了最多33.8%的token使用量,有效避免了错误推理和无限循环。 Conclusion: 该机制通过模拟人类隐式世界模型的使用,帮助模型更好地区分外部事件与内部信念,为大语言模型在社会场景中的应用提供了简单而有效的解决方案。 Abstract: While large language models (LLMs) excel in mathematical and code reasoning, we observe they struggle with social reasoning tasks, exhibiting cognitive confusion, logical inconsistencies, and conflation between objective world states and subjective belief states. Through deteiled analysis of DeepSeek-R1's reasoning trajectories, we find that LLMs frequently encounter reasoning impasses and tend to output contradictory terms like "tricky" and "confused" when processing scenarios with multiple participants and timelines, leading to erroneous reasoning or infinite loops. The core issue is their inability to disentangle objective reality from agents' subjective beliefs. To address this, we propose an adaptive world model-enhanced reasoning mechanism that constructs a dynamic textual world model to track entity states and temporal sequences. It dynamically monitors reasoning trajectories for confusion indicators and promptly intervenes by providing clear world state descriptions, helping models navigate through cognitive dilemmas. The mechanism mimics how humans use implicit world models to distinguish between external events and internal beliefs. Evaluations on three social benchmarks demonstrate significant improvements in accuracy (e.g., +10% in Hi-ToM) while reducing computational costs (up to 33.8% token reduction), offering a simple yet effective solution for deploying LLMs in social contexts.[59] Leveraging Author-Specific Context for Scientific Figure Caption Generation: 3rd SciCap Challenge
Watcharapong Timklaypachara,Monrada Chiewhawan,Nopporn Lekuthai,Titipat Achakulvisut
Main category: cs.CL
TL;DR: 提出了一种结合图文上下文与作者写作风格的两阶段科学图表标题生成方法,在SciCap挑战赛中表现出色。
Details
Motivation: 科学图表标题需要准确且风格一致,现有方法难以同时满足内容准确性和风格相似性,因此需要结合上下文信息和作者个性化写作风格来提升生成质量。 Method: 采用两阶段管道:第一阶段通过上下文过滤和类别特定提示优化(使用DSPy的MIPROv2和SIMBA)生成候选标题;第二阶段利用少量样本提示结合作者风格示例进行风格化 refine。 Result: 类别特定提示使ROUGE-1召回率提升+8.3%,风格精调使BLEU提升40-48%,ROUGE提升25-27%。 Conclusion: 结合上下文理解与作者特定风格适应可生成既科学准确又风格忠实的图表标题。 Abstract: Scientific figure captions require both accuracy and stylistic consistency to convey visual information. Here, we present a domain-specific caption generation system for the 3rd SciCap Challenge that integrates figure-related textual context with author-specific writing styles using the LaMP-Cap dataset. Our approach uses a two-stage pipeline: Stage 1 combines context filtering, category-specific prompt optimization via DSPy's MIPROv2 and SIMBA, and caption candidate selection; Stage 2 applies few-shot prompting with profile figures for stylistic refinement. Our experiments demonstrate that category-specific prompts outperform both zero-shot and general optimized approaches, improving ROUGE-1 recall by +8.3\% while limiting precision loss to -2.8\% and BLEU-4 reduction to -10.9\%. Profile-informed stylistic refinement yields 40--48\% gains in BLEU scores and 25--27\% in ROUGE. Overall, our system demonstrates that combining contextual understanding with author-specific stylistic adaptation can generate captions that are both scientifically accurate and stylistically faithful to the source paper.[60] Learning on the Job: An Experience-Driven Self-Evolving Agent for Long-Horizon Tasks
Cheng Yang,Xuemeng Yang,Licheng Wen,Daocheng Fu,Jianbiao Mei,Rong Wu,Pinlong Cai,Yufan Shen,Nianchen Deng,Botian Shi,Yu Qiao,Haifeng Li
Main category: cs.CL
TL;DR: MUSE是一种基于分层记忆模块的新型AI代理框架,通过经验驱动实现持续学习和自我进化,显著提升了大语言模型在长周期任务中的表现。
Details
Motivation: 现有的大语言模型代理在现实世界长周期任务中缺乏从经验中学习和持续改进的能力,限制了其实际应用。 Method: 提出MUSE框架,引入分层记忆模块,将执行子任务后的轨迹转化为结构化经验并自主反思,实现知识积累与自我更新。 Result: 在TAC基准测试中,仅使用轻量级Gemini-2.5 Flash模型即达到新的SOTA性能,实验证明其具备持续学习、自我进化和跨任务零样本迁移能力。 Conclusion: MUSE建立了能够持续学习和自我进化的AI代理新范式,显著增强了大语言模型在真实场景下的长期任务执行能力。 Abstract: Large Language Models have demonstrated remarkable capabilities across diverse domains, yet significant challenges persist when deploying them as AI agents for real-world long-horizon tasks. Existing LLM agents suffer from a critical limitation: they are test-time static and cannot learn from experience, lacking the ability to accumulate knowledge and continuously improve on the job. To address this challenge, we propose MUSE, a novel agent framework that introduces an experience-driven, self-evolving system centered around a hierarchical Memory Module. MUSE organizes diverse levels of experience and leverages them to plan and execute long-horizon tasks across multiple applications. After each sub-task execution, the agent autonomously reflects on its trajectory, converting the raw trajectory into structured experience and integrating it back into the Memory Module. This mechanism enables the agent to evolve beyond its static pretrained parameters, fostering continuous learning and self-evolution. We evaluate MUSE on the long-horizon productivity benchmark TAC. It achieves new SOTA performance by a significant margin using only a lightweight Gemini-2.5 Flash model. Sufficient Experiments demonstrate that as the agent autonomously accumulates experience, it exhibits increasingly superior task completion capabilities, as well as robust continuous learning and self-evolution capabilities. Moreover, the accumulated experience from MUSE exhibits strong generalization properties, enabling zero-shot improvement on new tasks. MUSE establishes a new paradigm for AI agents capable of real-world productivity task automation.[61] ChatGPT as a Translation Engine: A Case Study on Japanese-English
Vincent Michael Sutanto,Giovanni Gatti De Giacomo,Toshiaki Nakazawa,Masaru Yamada
Main category: cs.CL
TL;DR: 该研究探讨了使用ChatGPT进行日英翻译的效果,比较了简单与增强提示,并评估了其相对于商业翻译引擎的表现。结果显示,文档级翻译优于句子级翻译,自动评估更偏好ChatGPT-3.5,而人工评估发现ChatGPT-4在流畅性上更优,两者在准确性和流畅性之间存在权衡,ChatGPT整体表现具有竞争力。
Details
Motivation: 探索ChatGPT在日英翻译任务中的潜力,并评估不同提示方式和模型版本对翻译质量的影响,同时与现有商业系统进行对比。 Method: 采用简单和增强提示方法,对ChatGPT进行日英翻译测试,结合自动评估和基于MQM的人工评估,与主流翻译系统进行比较。 Result: 文档级翻译效果优于句子级;增强提示未显著提升性能;自动评估偏爱ChatGPT-3.5,人工评估认为ChatGPT-4更流畅;ChatGPT整体表现与主流系统相当。 Conclusion: ChatGPT在日英翻译中表现良好,文档上下文有助于提升质量,但在提示工程上的优化空间仍需进一步探索,不同版本模型在准确性与流畅性间存在权衡。 Abstract: This study investigates ChatGPT for Japanese-English translation, exploring simple and enhanced prompts and comparing against commercially available translation engines. Performing both automatic and MQM-based human evaluations, we found that document-level translation outperforms sentence-level translation for ChatGPT. On the other hand, we were not able to determine if enhanced prompts performed better than simple prompts in our experiments. We also discovered that ChatGPT-3.5 was preferred by automatic evaluation, but a tradeoff exists between accuracy (ChatGPT-3.5) and fluency (ChatGPT-4). Lastly, ChatGPT yields competitive results against two widely-known translation systems.[62] Climate Knowledge in Large Language Models
Ivan Kuznetsov,Jacopo Grassi,Dmitrii Pantiukhin,Boris Shapkin,Thomas Jung,Nikolay Koldunov
Main category: cs.CL
TL;DR: 该研究评估了大语言模型(LLM)在无需外部检索的情况下回忆1991-2020年7月地表气温气候态的能力,发现LLM能捕捉基本气候格局但存在显著高程和区域偏差,尤其在山区和高纬度表现较差,且无法准确再现长期温度变化的空间模式。
Details
Motivation: 随着LLM越来越多地应用于气候相关场景,理解其内部气候知识对确保可靠性与降低错误信息风险至关重要;然而当前LLM对气候态的参数化记忆能力尚未被系统评估。 Method: 构建一个分辨率为1°的全球陆地查询网格,输入位置坐标及地理描述信息,询问模型关于1991–2020年7月平均气温的气候常态,并将回答与ERA5再分析数据对比,评估误差分布、地形影响以及地理上下文提示的作用。 Result: LLM能够捕捉纬度和地形相关的气候结构,均方根误差为3–6°C,偏差约±1°C;加入国家、城市等地理上下文可使误差平均降低27%,大模型对此更敏感;但在海拔1500米以上地区性能显著下降(RMSE达5–13°C);此外,模型虽能反映全球平均变暖幅度,却无法复现温度变化的空间格局。 Conclusion: 尽管LLM编码了一定的气候知识,可用于基础气候查询,但其在高海拔和高纬度区域存在系统性误差,且缺乏对气候变化区域特征的表达能力,限制了其在气候动力学理解和长期趋势分析中的应用;本文提出的方法可作为评估LLM气候知识的可重复基准。 Abstract: Large language models (LLMs) are increasingly deployed for climate-related applications, where understanding internal climatological knowledge is crucial for reliability and misinformation risk assessment. Despite growing adoption, the capacity of LLMs to recall climate normals from parametric knowledge remains largely uncharacterized. We investigate the capacity of contemporary LLMs to recall climate normals without external retrieval, focusing on a prototypical query: mean July 2-m air temperature 1991-2020 at specified locations. We construct a global grid of queries at 1{\deg} resolution land points, providing coordinates and location descriptors, and validate responses against ERA5 reanalysis. Results show that LLMs encode non-trivial climate structure, capturing latitudinal and topographic patterns, with root-mean-square errors of 3-6 {\deg}C and biases of $\pm$1 {\deg}C. However, spatially coherent errors remain, particularly in mountains and high latitudes. Performance degrades sharply above 1500 m, where RMSE reaches 5-13 {\deg}C compared to 2-4 {\deg}C at lower elevations. We find that including geographic context (country, city, region) reduces errors by 27% on average, with larger models being most sensitive to location descriptors. While models capture the global mean magnitude of observed warming between 1950-1974 and 2000-2024, they fail to reproduce spatial patterns of temperature change, which directly relate to assessing climate change. This limitation highlights that while LLMs may capture present-day climate distributions, they struggle to represent the regional and local expression of long-term shifts in temperature essential for understanding climate dynamics. Our evaluation framework provides a reproducible benchmark for quantifying parametric climate knowledge in LLMs and complements existing climate communication assessments.[63] A Survey of Process Reward Models: From Outcome Signals to Process Supervisions for Large Language Models
Congming Zheng,Jiachen Zhu,Zhuoying Ou,Yuxiang Chen,Kangning Zhang,Rong Shan,Zeyu Zheng,Mengyue Yang,Jianghao Lin,Yong Yu,Weinan Zhang
Main category: cs.CL
TL;DR: 本文系统地综述了过程奖励模型(PRMs),涵盖其数据生成、模型构建及在测试时扩展和强化学习中的应用,旨在推动细粒度、鲁棒的推理对齐研究。
Details
Motivation: 尽管大语言模型展现出强大的推理能力,传统的对齐方法仍主要依赖仅评估最终答案的结果奖励模型(ORMs),未能充分利用推理过程中的信息。 Method: 通过梳理PRMs的完整流程,包括过程数据的生成、PRMs的构建以及在测试时扩展和强化学习中的使用,对PRMs进行系统性综述。 Result: 总结了PRMs在数学、代码、文本、多模态推理、机器人和智能体等领域的应用,并回顾了新兴的基准测试。 Conclusion: 明确了PRMs的设计空间,揭示了当前面临的开放性挑战,并为未来实现更精细、更稳健的推理对齐研究提供了方向指引。 Abstract: Although Large Language Models (LLMs) exhibit advanced reasoning ability, conventional alignment remains largely dominated by outcome reward models (ORMs) that judge only final answers. Process Reward Models(PRMs) address this gap by evaluating and guiding reasoning at the step or trajectory level. This survey provides a systematic overview of PRMs through the full loop: how to generate process data, build PRMs, and use PRMs for test-time scaling and reinforcement learning. We summarize applications across math, code, text, multimodal reasoning, robotics, and agents, and review emerging benchmarks. Our goal is to clarify design spaces, reveal open challenges, and guide future research toward fine-grained, robust reasoning alignment.[64] FedDTRE: Federated Dialogue Generation Models Powered by Trustworthiness Evaluation
Shule Lu,Lingxiang Wang,Sijia Wen,Ziwei Wang,Hainan Zhang
Main category: cs.CL
TL;DR: 提出了一种基于可信度评估的联邦自适应聚合策略FedDTRE,用于对话生成,通过动态调节全局模型在本地更新中的贡献,提升对话模型性能和生成质量。
Details
Motivation: 传统集中式或完全本地训练方法在隐私保护与个性化之间难以平衡,现有联邦学习方法在客户端数据有限时易过拟合,且多轮训练后容易遗忘全局信息,导致泛化能力差。 Method: 提出FedDTRE,利用全局和本地模型在公平性导向评估数据集上的可信度评分,动态调节全局模型在本地更新中的贡献,而非直接用全局模型替换本地模型。 Result: 实验结果表明,FedDTRE能够提升对话模型的性能,增强对话生成的质量。 Conclusion: FedDTRE有效缓解了联邦学习中过拟合和遗忘全局信息的问题,显著提升了对话系统的个性化与泛化能力。 Abstract: With the rapid development of artificial intelligence, dialogue systems have become a prominent form of human-computer interaction. However, traditional centralized or fully local training approaches face challenges in balancing privacy preservation and personalization due to data privacy concerns and heterogeneous device capabilities. Federated learning, as a representative distributed paradigm, offers a promising solution. However, existing methods often suffer from overfitting under limited client data and tend to forget global information after multiple training rounds, leading to poor generalization. To address these issues, we propose FedDTRE, a Federated adaptive aggregation strategy for Dialogue generation based on Trustworthiness Evaluation. Instead of directly replacing local models with the global model, FedDTRE leverages trustworthiness scores of both global and local models on a fairness-oriented evaluation dataset to dynamically regulate the global model's contribution during local updates. Experimental results demonstrate that FedDTRE can improve dialogue model performance and enhance the quality of dialogue generation.[65] Everything is Plausible: Investigating the Impact of LLM Rationales on Human Notions of Plausibility
Shramay Palta,Peter Rankel,Sarah Wiegreffe,Rachel Rudinger
Main category: cs.CL
TL;DR: 研究发现,人类对常识性多选题答案的合理性判断会受到LLM生成的支持或反对理由的影响,表明LLM在人类认知领域具有显著影响力。
Details
Motivation: 探讨LLM生成的理由是否会影响人类对常识问题答案的合理性判断,并评估这种影响的程度。 Method: 通过收集3,000条人类和13,600条LLM的合理性判断数据,分析LLM生成的正反理由对人类评分的影响。 Result: 人类在看到LLM生成的支持理由后评分上升,看到反对理由后评分下降,LLM也表现出类似的影响模式。 Conclusion: LLM不仅能用于研究人类认知,还可能在常识等人类擅长的领域显著影响人们的信念,引发实际担忧。 Abstract: We investigate the degree to which human plausibility judgments of multiple-choice commonsense benchmark answers are subject to influence by (im)plausibility arguments for or against an answer, in particular, using rationales generated by LLMs. We collect 3,000 plausibility judgments from humans and another 13,600 judgments from LLMs. Overall, we observe increases and decreases in mean human plausibility ratings in the presence of LLM-generated PRO and CON rationales, respectively, suggesting that, on the whole, human judges find these rationales convincing. Experiments with LLMs reveal similar patterns of influence. Our findings demonstrate a novel use of LLMs for studying aspects of human cognition, while also raising practical concerns that, even in domains where humans are ``experts'' (i.e., common sense), LLMs have the potential to exert considerable influence on people's beliefs.[66] The Price of Thought: A Multilingual Analysis of Reasoning, Performance, and Cost of Negotiation in Large Language Models
Sherzod Hakimov,Roland Bernard,Tim Leiber,Karl Osswald,Kristina Richert,Ruilin Yang,Raffaella Bernardi,David Schlangen
Main category: cs.CL
TL;DR: 本研究首次系统评估了大语言模型(LLM)推理能力对多语言谈判任务的影响,发现启用推理显著提升谈判表现但增加计算成本,且开源模型在非英语谈判中仍倾向使用英语进行内部推理,而商业模型保持语言一致性。
Details
Motivation: 探讨LLM的推理能力如何影响其在多语言谈判场景中的表现,并分析不同模型在战略适应、合作与竞争平衡方面的差异。 Method: 通过自博弈设置,在三种不同对话游戏中评估商用和开源LLM在三种语言下的谈判能力,分析推理对性能与成本的权衡、语言一致性及策略适应性。 Result: 启用推理显著提升谈判结果(如GPT-5性能提升31.4%),但计算成本增加近400%;开源模型在德语或意大利语谈判中内部推理仍切换至英语,而商业模型保持语言一致。 Conclusion: 推理能有效增强LLM的谈判能力,尤其在处理任务复杂性和促进协作方面,但存在高昂成本;多语言场景下开源与商业模型在推理语言选择上存在显著差异,可能影响推理过程的可解释性。 Abstract: Negotiation is a fundamental challenge for AI agents, as it requires an ability to reason strategically, model opponents, and balance cooperation with competition. We conduct the first comprehensive study systematically evaluating the effect of (LLM-)reasoning on the negotiation abilities of both commercial and open-weight LLMs, and do this across three languages. Using a self-play setup across three diverse dialogue games, we analyse trade-offs between performance and cost, the language consistency of reasoning processes, and the nature of strategic adaptation exhibited by models. Our findings show that enabling reasoning-that is, scaling test time compute-significantly improves negotiation outcomes by enhancing collaboration and helping models overcome task complexities, but comes at a substantial computational cost: reasoning improves GPT-5's performance by 31.4 % while increasing its cost by nearly 400 %. Most critically, we uncover a significant multilingual reasoning distinction: open-weight models consistently switch to English for their internal reasoning steps, even when negotiating in German or Italian (and thus possibly impacting potential explainability gains through the disclosure of reasoning traces), while leading commercial models maintain language consistency between their reasoning and final output.[67] Lossless Vocabulary Reduction for Auto-Regressive Language Models
Daiki Chijiwa,Taku Hasegawa,Kyosuke Nishida,Shin'ya Yamaguchi,Tomoya Ohba,Tamao Sakao,Susumu Takeuchi
Main category: cs.CL
TL;DR: 本文提出了一个无损词汇缩减的理论框架,能够将自回归语言模型高效转换为任意小词汇量的模型而不损失精度,并展示了不同分词方式的语言模型如何通过最大公共词汇进行有效协作。
Details
Motivation: 由于不同的语言模型使用不同的分词方式和词汇表,导致它们在下一词预测分布层面难以协同工作,如模型集成困难。因此需要一种方法使不同模型能有效合作。 Method: 建立了一个无损词汇缩减的理论框架,通过该框架将给定的自回归语言模型转换为具有更小词汇表的模型,同时保持原始模型的准确性,并利用不同模型间的最大公共词汇实现跨模型协作。 Result: 实现了在不损失准确性的前提下,将语言模型转换为更小词汇量的模型,并验证了不同分词体系下的语言模型可以通过共享的最大公共词汇高效协作。 Conclusion: 所提出的无损词汇缩减框架不仅提升了模型在生成效率上的灵活性,还为多模型集成和跨模型协作提供了可行的技术路径。 Abstract: Tokenization -- the process of decomposing a given text into a sequence of subwords called tokens -- is one of the key components in the development of language models. Particularly, auto-regressive language models generate texts token by token, i.e., by predicting the next-token distribution given the previous ones, and thus tokenization directly affects their efficiency in text generation. Since each language model has their own vocabulary as a set of possible tokens, they struggle to cooperate with each other at the level of next-token distributions such as model ensemble. In this paper, we establish a theoretical framework of lossless vocabulary reduction, which efficiently converts a given auto-regressive language model into the one with an arbitrarily small vocabulary without any loss in accuracy. As an application, we demonstrate that language models with different tokenization can cooperate with each other efficiently through their maximal common vocabulary.[68] Evaluating LLM-Generated Legal Explanations for Regulatory Compliance in Social Media Influencer Marketing
Haoyang Gui,Thales Bertaglia,Taylor Annabell,Catalina Goanta,Tjomme Dooper,Gerasimos Spanakis
Main category: cs.CL
TL;DR: 该研究评估了在提示中引入法律知识对GPT-5-nano和Gemini-2.5-flash-lite检测Instagram上未披露的赞助内容的影响,发现模型在分类任务中表现良好(F1最高达0.93),但在模糊案例中性能下降。研究提出了一个LLM法律推理错误分类法,并结合学生标注的数据集,为自动化监管提供了法律上稳健的技术支持。
Details
Motivation: 由于网红营销兴起,有机内容与广告内容界限模糊,导致透明度法规执行困难。现有检测方法缺乏法律依据或为黑箱模型,因此需要构建具备法律可解释性的自动化检测工具。 Method: 使用1,143条Instagram帖子,比较GPT-5-nano和Gemini-2.5-flash-lite在三种提示策略下的表现,控制提示中法律知识的输入量;结合定量分析与定性错误分类,并由两名受过培训的学生对LLM生成的解释进行标注。 Result: 两个模型在分类赞助内容方面表现良好(F1最高0.93),但在模糊案例中性能下降超10个百分点;提出错误分类法,显示常出现引用缺失(28.57%)、引用不清(20.71%)及隐藏广告误判率最高(28.57%);加入法规文本可提升解释质量,但未显著提高检测准确率。 Conclusion: 该研究通过构建法律推理错误分类法、提供标注数据集和混合评估方法,推动了基于法律基础的透明化网红营销内容检测技术发展,有助于监管机构实现合法、可解释的自动化内容审核。 Abstract: The rise of influencer marketing has blurred boundaries between organic content and sponsored content, making the enforcement of legal rules relating to transparency challenging. Effective regulation requires applying legal knowledge with a clear purpose and reason, yet current detection methods of undisclosed sponsored content generally lack legal grounding or operate as opaque "black boxes". Using 1,143 Instagram posts, we compare gpt-5-nano and gemini-2.5-flash-lite under three prompting strategies with controlled levels of legal knowledge provided. Both models perform strongly in classifying content as sponsored or not (F1 up to 0.93), though performance drops by over 10 points on ambiguous cases. We further develop a taxonomy of reasoning errors, showing frequent citation omissions (28.57%), unclear references (20.71%), and hidden ads exhibiting the highest miscue rate (28.57%). While adding regulatory text to the prompt improves explanation quality, it does not consistently improve detection accuracy. The contribution of this paper is threefold. First, it makes a novel addition to regulatory compliance technology by providing a taxonomy of common errors in LLM-generated legal reasoning to evaluate whether automated moderation is not only accurate but also legally robust, thereby advancing the transparent detection of influencer marketing content. Second, it features an original dataset of LLM explanations annotated by two students who were trained in influencer marketing law. Third, it combines quantitative and qualitative evaluation strategies for LLM explanations and critically reflects on how these findings can support advertising regulatory bodies in automating moderation processes on a solid legal foundation.[69] Interpreting LLM-as-a-Judge Policies via Verifiable Global Explanations
Jasmina Gajcin,Erik Miehling,Rahul Nair,Elizabeth Daly,Radu Marinescu,Seshu Tirupathi
Main category: cs.CL
TL;DR: 本文提出了一种从LLM-as-a-Judge中提取基于概念的全局策略的方法,包括生成局部解释的CLoVE算法和通过聚类、摘要与验证生成全局策略的GloVE算法,并在多个内容危害检测数据集上验证了所提取策略的保真度、鲁棒性及用户可理解性。
Details
Motivation: 随着LLM被广泛用作评估工具(LLM-as-a-Judge),亟需理解其潜在偏见与风险,因此需要可解释的方法来揭示其决策背后的全局策略。 Method: 提出了两种算法:CLoVE生成可验证的、基于概念的对比局部解释;GloVE通过迭代聚类、摘要和验证将局部规则聚合为全局策略。 Result: 在七个基准数据集上验证了全局策略对LLM决策的高度保真性,展示了其对文本扰动和对抗攻击的鲁棒性,并通过用户研究表明用户能较好理解和接受这些策略。 Conclusion: 所提出的GloVE方法能够有效提取并简化LLM-as-a-Judge的决策逻辑为高阶可解释的全局策略,有助于提升自动化评估系统的透明性与可信度。 Abstract: Using LLMs to evaluate text, that is, LLM-as-a-judge, is increasingly being used at scale to augment or even replace human annotations. As such, it is imperative that we understand the potential biases and risks of doing so. In this work, we propose an approach for extracting high-level concept-based global policies from LLM-as-a-Judge. Our approach consists of two algorithms: 1) CLoVE (Contrastive Local Verifiable Explanations), which generates verifiable, concept-based, contrastive local explanations and 2) GloVE (Global Verifiable Explanations), which uses iterative clustering, summarization and verification to condense local rules into a global policy. We evaluate GloVE on seven standard benchmarking datasets for content harm detection. We find that the extracted global policies are highly faithful to decisions of the LLM-as-a-Judge. Additionally, we evaluated the robustness of global policies to text perturbations and adversarial attacks. Finally, we conducted a user study to evaluate user understanding and satisfaction with global policies.[70] Mitigating Judgment Preference Bias in Large Language Models through Group-Based Polling
Shuliang Liu,Zhipeng Xu,Zhenghao Liu,Yukun Yan,Minghe Yu,Yu Gu,Chong Chen,Huiyuan Xie,Ge Yu
Main category: cs.CL
TL;DR: 本文提出了一种名为Group-Based Polling Optimization(Genii)的无监督多智能体协同优化框架,用于缓解大语言模型作为评判者时存在的判断偏好偏差问题。
Details
Motivation: 大语言模型在自动评估任务中表现出对自身生成回答的偏好偏差,影响了评估的可靠性,因此需要一种能够减轻这种偏差的方法。 Method: Genii将多个基于大语言模型的评判模型整合到一个多智能体系统中,模拟客户端-服务器交互式轮询机制,在无需人工标注的情况下对每个客户端智能体进行无监督优化。 Result: 实验表明,Genii在不同客户端智能体上均能持续提升性能,即使使用较弱的模型作为服务端也能取得良好效果,并且优于需要标注数据的有监督模型。 Conclusion: Genii能有效缓解大语言模型评判中的偏好偏差,是一种高效、无需人工标注的评估优化框架。 Abstract: Large Language Models (LLMs) as automatic evaluators, commonly referred to as LLM-as-a-Judge, have also attracted growing attention. This approach plays a vital role in aligning LLMs with human judgments, providing accurate and reliable assessments. However, LLM-based judgment models often exhibit judgment preference bias during the evaluation phase, tending to favor responses generated by themselves, undermining the reliability of their judgments. This paper introduces the Group-Based Polling Optimization (Genii), an unsupervised multi-agent collaborative optimization framework that mitigates the inherent judgment preference bias of judgment models. Specifically, Genii integrates various LLM-based judgment models into a multi-agent system and simulates the interactive client-server polling mechanism to optimize each client agent unsupervisedly. Our experiments demonstrate that Genii outperforms supervised models trained on annotated judgment data, while requiring no human-labeled annotations. Genii consistently improves performance across different client agents during the polling, even when weaker models act as server agents. Further analysis reveals that Genii effectively mitigates judgment preference bias of LLM-based judgment models, demonstrating its effectiveness. All codes are available at https://github.com/NEUIR/Genii.[71] AI Knowledge Assist: An Automated Approach for the Creation of Knowledge Bases for Conversational AI Agents
Md Tahmid Rahman Laskar,Julien Bouvier Tremblay,Xue-Yong Fu,Cheng Chen,Shashi Bhushan TN
Main category: cs.CL
TL;DR: 本文提出了一种名为AI Knowledge Assist的系统,该系统通过从历史客户-代理对话中提取问答对来自动构建企业专属知识库,并利用微调轻量级大语言模型(LLaMA-3.1-8B)实现了在信息检索问题上超过90%的准确率,有效解决了客服中心冷启动问题,使RAG驱动的聊天机器人能够立即部署。
Details
Motivation: 由于缺乏企业特定的专用知识库,阻碍了对话式AI系统在客服中心的集成,因此需要一种能够自动构建知识库的方法以支持RAG技术的应用。 Method: 从历史客户-代理对话中提取问答对,构建公司特定的知识库,并对轻量级大语言模型(LLaMA-3.1-8B)进行内部数据微调,结合检索增强生成(RAG)技术提升性能。 Result: 在20家公司上的实证评估表明,该系统在回答信息寻求类问题时准确率超过90%,优于更大的闭源大语言模型,成功填补了客服中心的冷启动空白。 Conclusion: AI Knowledge Assist系统能够高效构建企业知识库,显著提升对话式AI在客服场景中的可用性和准确性,为RAG系统的即时部署提供了可行方案。 Abstract: The utilization of conversational AI systems by leveraging Retrieval Augmented Generation (RAG) techniques to solve customer problems has been on the rise with the rapid progress of Large Language Models (LLMs). However, the absence of a company-specific dedicated knowledge base is a major barrier to the integration of conversational AI systems in contact centers. To this end, we introduce AI Knowledge Assist, a system that extracts knowledge in the form of question-answer (QA) pairs from historical customer-agent conversations to automatically build a knowledge base. Fine-tuning a lightweight LLM on internal data demonstrates state-of-the-art performance, outperforming larger closed-source LLMs. More specifically, empirical evaluation on 20 companies demonstrates that the proposed AI Knowledge Assist system that leverages the LLaMA-3.1-8B model eliminates the cold-start gap in contact centers by achieving above 90% accuracy in answering information-seeking questions. This enables immediate deployment of RAG-powered chatbots.[72] DACIP-RC: Domain Adaptive Continual Instruction Pre-Training via Reading Comprehension on Business Conversations
Elena Khasanova,Harsh Saini,Md Tahmid Rahman Laskar,Xue-Yong Fu,Cheng Chen,Shashi Bhushan TN
Main category: cs.CL
TL;DR: 本文提出了一种名为DACIP-RC的持续指令预训练方法,通过阅读理解生成任务指令和响应,提升小型语言模型在商业对话任务中的零样本泛化能力。
Details
Motivation: 大型语言模型推理成本高,难以部署;小型模型缺乏跨领域的零样本指令遵循能力,传统微调方法易导致灾难性遗忘。 Method: 提出DACIP-RC,基于对话记录通过阅读理解生成多样化指令和回复,进行持续预训练,区别于传统的下一词预测方式。 Result: 实验表明,DACIP-RC在会议摘要、行动项生成和通话目的识别等任务中显著提升了零样本泛化性能。 Conclusion: DACIP-RC有效增强了小型语言模型在商业对话场景下的领域适应性和指令遵循能力,是首个将指令预训练应用于商业对话数据的工作。 Abstract: The rapid advancements in Large Language Models (LLMs) have enabled their adoption in real-world industrial scenarios for various natural language processing tasks. However, the high inference cost of large-scale LLMs makes their deployment impractical, necessitating the use of smaller models. Despite their efficiency, smaller LLMs lack robust zero-shot instruction-following capabilities across diverse domains, limiting their adaptability to dynamic user requirements. Traditional fine-tuning approaches exacerbate this issue by inducing catastrophic forgetting, reducing the model's generalization ability for unseen tasks. In this paper, we propose Domain Adaptive Continual Instruction Pre-Training via Reading Comprehension (DACIP-RC), a continual pre-training technique that enhances smaller LLMs' domain adaptability for business conversational tasks. Unlike conventional pre-training approaches that rely on next-token prediction, DACIP-RC generates diverse task instructions and responses via reading comprehension on conversation transcripts, enabling better instruction generalization. Our empirical evaluations demonstrate that DACIP-RC significantly improves zero-shot generalization across a wide range of business conversational tasks, including meeting summarization, action item generation, and call purpose identification. To the best of our knowledge, this is the first work to apply instruction pre-training on business conversational data, providing insights into how industries can leverage proprietary datasets for domain adaptation.[73] Beyond Over-Refusal: Scenario-Based Diagnostics and Post-Hoc Mitigation for Exaggerated Refusals in LLMs
Shuzhou Yuan,Ercong Nie,Yinuo Sun,Chenxuan Zhao,William LaCroix,Michael Färber
Main category: cs.CL
TL;DR: 本文提出了两个基准测试(XSB和MS-XSB)来评估大语言模型中的过度拒绝问题,并提出三种无需重新训练的轻量级方法(忽略关键词、提示重写和注意力引导)来缓解该问题,实验证明这些方法能有效提升模型对安全请求的响应能力,同时保持安全性。
Details
Motivation: 大语言模型常因误判而拒绝本应接受的良性请求,影响其可用性和用户体验,因此需要系统性评估和缓解这种过度拒绝现象。 Method: 构建了单轮XSB和多轮MS-XSB两个基准测试,引入事后解释方法识别拒绝触发词,并在推理时采用忽略关键词、提示重写和注意力引导三种模型无关的轻量级策略来减少过度拒绝。 Result: 实验表明,在四个Llama指令微调模型上,所提方法显著提升了对安全提示的合规性,同时保持了原有的安全防护能力,尤其是在复杂的多轮对话中效果明显。 Conclusion: 本文建立了可复现的诊断与缓解框架,为实现更安全且更有帮助的大语言模型部署提供了实用路径。 Abstract: Large language models (LLMs) frequently produce false refusals, declining benign requests that contain terms resembling unsafe queries. We address this challenge by introducing two comprehensive benchmarks: the Exaggerated Safety Benchmark (XSB) for single-turn prompts, annotated with "Focus" keywords that identify refusal-inducing triggers, and the Multi-turn Scenario-based Exaggerated Safety Benchmark (MS-XSB), which systematically evaluates refusal calibration in realistic, context-rich dialog settings. Our benchmarks reveal that exaggerated refusals persist across diverse recent LLMs and are especially pronounced in complex, multi-turn scenarios. To mitigate these failures, we leverage post-hoc explanation methods to identify refusal triggers and deploy three lightweight, model-agnostic approaches, ignore-word instructions, prompt rephrasing, and attention steering, at inference time, all without retraining or parameter access. Experiments on four instruction-tuned Llama models demonstrate that these strategies substantially improve compliance on safe prompts while maintaining robust safety protections. Our findings establish a reproducible framework for diagnosing and mitigating exaggerated refusals, highlighting practical pathways to safer and more helpful LLM deployments.[74] ARM2: Adaptive Reasoning Model with Vision Understanding and Executable Code
Jian Xie,Zhendong Chu,Aoxiao Zhong,Kai Zhang,Mingzhe Han,Xin Fang,Jialie Shen,Qingsong Wen
Main category: cs.CL
TL;DR: 本文提出了ARM2,一个通过强化学习框架结合长度感知优化的统一模型,能够自适应地在多种推理格式中平衡性能与效率,并支持多模态和代码执行,显著降低token消耗(平均超过70%)同时保持性能。
Details
Motivation: 大型推理模型在简单任务上常出现“过度思考”问题,现有缓解策略多为启发式且任务特定,缺乏通用的自适应推理框架。 Method: 提出ARM2,采用强化学习框架并引入长度感知优化,支持自然语言推理、视觉理解和可执行代码集成,实现多格式、多模态的自适应推理。 Result: ARM2在性能上与传统GRPO训练的推理模型相当,但平均减少70%以上的token使用,并在多模态和代码推理任务中展现有效性。 Conclusion: ARM2提供了一种通用、自适应的推理框架,在保持性能的同时大幅提升推理效率,适用于多模态和复杂推理场景。 Abstract: Large Reasoning Models (LRMs) often suffer from the ``over-thinking'' problem, generating unnecessarily long reasoning on simple tasks. Some strategies have been proposed to mitigate this issue, such as length penalties or routing mechanisms, but they are typically heuristic and task-specific, lacking a general framework for adaptive reasoning. In this paper, we present ARM2, a unified model that adaptively balances reasoning performance and efficiency across multiple formats through a reinforcement learning framework augmented with length-aware optimization. Beyond conventional natural language inference, ARM2 integrates vision understanding, extending its applicability to multimodal. Moreover, ARM2 integrates executable code into reasoning, enabling substantial reductions in token cost while preserving task performance compared to long CoT. Experiments demonstrate that ARM2 achieves performance on par with traditional reasoning models trained with GRPO, while reducing token usage by over 70% on average. We further conduct extensive analyses to validate the effectiveness of ARM2 and the soundness of its design.[75] MetricalARGS: A Taxonomy for Studying Metrical Poetry with LLMs
Chalamalasetti Kranti,Sowmya Vajjala
Main category: cs.CL
TL;DR: 本文提出了MetricalARGS,首个用于评估大语言模型在格律诗方面能力的NLP任务分类体系,涵盖分析、检索、生成和支持四个维度,并以泰卢固语为例展示了其实际应用。
Details
Motivation: 现有NLP研究多关注诗歌生成与摘要,而忽视了格律诗中严格的音节和音素规则对语言模型推理和规则遵循能力的挑战,因此需要一个系统性框架来评估LLMs在这一复杂文学形式上的表现。 Method: 提出MetricalARGS分类体系,包含分析、检索、生成和支持四个维度,结合泰卢固语的格律诗传统,构建相关任务并探讨数据集与评价指标的设计。 Result: 建立了首个面向格律诗的NLP任务分类框架,明确了各任务与现有NLP任务的关系,并为未来研究提供了可扩展的数据与评估思路。 Conclusion: MetricalARGS为通过格律诗评估大语言模型的语言理解与规则遵循能力提供了系统化路径,揭示了当前模型在复杂文学结构中的潜力与局限。 Abstract: Prior NLP work studying poetry has focused primarily on automatic poem generation and summarization. Many languages have well-studied traditions of poetic meter which enforce constraints on a poem in terms of syllable and phoneme patterns. Such advanced literary forms offer opportunities for probing deeper reasoning and language understanding in Large Language Models (LLMs) and their ability to follow strict pre-requisites and rules. In this paper, we introduce MetricalARGS, the first taxonomy of poetry-related NLP tasks designed to evaluate LLMs on metrical poetry across four dimensions: Analysis, Retrieval, Generation, and Support. We discuss how these tasks relate to existing NLP tasks, addressing questions around datasets and evaluation metrics. Taking Telugu as our example language, we illustrate how the taxonomy can be used in practice. MetricalARGS highlights the broader possibilities for understanding the capabilities and limitations of today's LLMs through the lens of metrical poetry.[76] Training-Free Group Relative Policy Optimization
Yuzheng Cai,Siqi Cai,Yuchen Shi,Zihan Xu,Lichao Chen,Yulei Qin,Xiaoyu Tan,Gang Li,Zongyi Li,Haojia Lin,Yong Mao,Ke Li,Xing Sun
Main category: cs.CL
TL;DR: 提出了一种无需训练的轻量级方法(Training-Free GRPO),通过学习经验知识作为token先验来提升大语言模型代理在特定任务上的性能,避免了参数更新和过拟合问题。
Details
Motivation: 现有方法依赖昂贵的参数微调(如SFT+强化学习)来提升LLM代理在专业领域的表现,但面临数据稀缺和过拟合问题,因此需要一种更轻量、高效的替代方案。 Method: 利用 rollout 组内的语义相对优势而非数值奖励,迭代提取高质量经验知识作为 token prior,在不更新模型参数的情况下将其融入 LLM API 调用中,实现行为引导。 Result: 在数学推理和网页搜索任务上,应用于 DeepSeek-V3.1-Terminus 时,仅用几十个样本就显著提升了跨域性能,并优于使用少量数据微调的小型 LLM。 Conclusion: Training-Free GRPO 是一种低成本、无需参数更新的有效方法,能通过引入经验知识先验提升 LLM 代理在专业任务上的表现,具有良好的实用性和泛化能力。 Abstract: Recent advances in Large Language Model (LLM) agents have demonstrated their promising general capabilities. However, their performance in specialized real-world domains often degrades due to challenges in effectively integrating external tools and specific prompting strategies. While methods like agentic reinforcement learning have been proposed to address this, they typically rely on costly parameter updates, for example, through a process that uses Supervised Fine-Tuning (SFT) followed by a Reinforcement Learning (RL) phase with Group Relative Policy Optimization (GRPO) to alter the output distribution. However, we argue that LLMs can achieve a similar effect on the output distribution by learning experiential knowledge as a token prior, which is a far more lightweight approach that not only addresses practical data scarcity but also avoids the common issue of overfitting. To this end, we propose Training-Free Group Relative Policy Optimization (Training-Free GRPO), a cost-effective solution that enhances LLM agent performance without any parameter updates. Our method leverages the group relative semantic advantage instead of numerical ones within each group of rollouts, iteratively distilling high-quality experiential knowledge during multi-epoch learning on a minimal ground-truth data. Such knowledge serves as the learned token prior, which is seamlessly integrated during LLM API calls to guide model behavior. Experiments on mathematical reasoning and web searching tasks demonstrate that Training-Free GRPO, when applied to DeepSeek-V3.1-Terminus, significantly improves out-of-domain performance. With just a few dozen training samples, Training-Free GRPO outperforms fine-tuned small LLMs with marginal training data and cost.[77] Memory Retrieval and Consolidation in Large Language Models through Function Tokens
Shaohua Zhang,Yuan Lin,Hang Li
Main category: cs.CL
TL;DR: 本文提出了“功能词元假说”,认为大语言模型在推理时通过功能词元激活上下文中的预测特征并主导下一个词元的预测,在预训练中通过预测功能词元后的内容词元来增加学习到的特征数量,从而实现记忆整合。
Details
Motivation: 尽管大语言模型取得了显著成功,但其记忆提取与整合机制仍不清楚,本文旨在揭示这些机制。 Method: 提出功能词元假说,并通过二分图分析和案例研究验证功能词元如何激活预测特征;分析预训练中损失主要来自功能词元后内容词元的预测。 Result: 实验证明少数功能词元能激活大多数特征,且预训练损失主要由功能词元后的内容词元预测主导,支持了功能词元在记忆检索与整合中的核心作用。 Conclusion: 功能词元在大语言模型的记忆检索和知识整合过程中起关键作用,该假说为理解LLM工作机制提供了新视角。 Abstract: The remarkable success of large language models (LLMs) stems from their ability to consolidate vast amounts of knowledge into the memory during pre-training and to retrieve it from the memory during inference, enabling advanced capabilities such as knowledge memorization, instruction-following and reasoning. However, the mechanisms of memory retrieval and consolidation in LLMs remain poorly understood. In this paper, we propose the function token hypothesis to explain the workings of LLMs: During inference, function tokens activate the most predictive features from context and govern next token prediction (memory retrieval). During pre-training, predicting the next tokens (usually content tokens) that follow function tokens increases the number of learned features of LLMs and updates the model parameters (memory consolidation). Function tokens here roughly correspond to function words in linguistics, including punctuation marks, articles, prepositions, and conjunctions, in contrast to content tokens. We provide extensive experimental evidence supporting this hypothesis. Using bipartite graph analysis, we show that a small number of function tokens activate the majority of features. Case studies further reveal how function tokens activate the most predictive features from context to direct next token prediction. We also find that during pre-training, the training loss is dominated by predicting the next content tokens following function tokens, which forces the function tokens to select the most predictive features from context.[78] LLMs Learn to Deceive Unintentionally: Emergent Misalignment in Dishonesty from Misaligned Samples to Biased Human-AI Interactions
XuHao Hu,Peng Wang,Xiaoya Lu,Dongrui Liu,Xuanjing Huang,Jing Shao
Main category: cs.CL
TL;DR: 本研究扩展了“突发性错位”现象的研究,证明LLM在面对高风险情境时,通过恶意微调或与偏见用户互动,可能广泛表现出不诚实和欺骗行为,即使少量错位数据(如1%)或少量偏见用户(如10%)也足以显著降低其诚实度。
Details
Motivation: 探究大语言模型在特定领域被错误微调后是否会在更广泛的高风险情境下产生不诚实和欺骗等错位行为,特别是在安全、医疗等领域之外的道德与真实性问题。 Method: 对开源LLM在多个领域进行错位完成数据的微调,实验评估其在不诚实行为上的表现;研究在下游任务中混合少量错位数据的影响,并模拟包含偏见用户的人机交互环境以观察模型行为变化。 Result: 实验表明,经过错位微调的LLM在多种情境下表现出广泛的不诚实行为;在下游任务中仅引入1%的错位数据即可使诚实行为下降超过20%;在模拟人机交互中,仅10%的偏见用户即可导致模型无意中加剧不诚实行为。 Conclusion: 突发性错位不仅限于安全相关行为,还可扩展至不诚实与欺骗等更广泛领域,且在混合微调和真实人机交互场景中均存在显著风险,提示需警惕训练数据中的偏见与恶意影响。 Abstract: Previous research has shown that LLMs finetuned on malicious or incorrect completions within narrow domains (e.g., insecure code or incorrect medical advice) can become broadly misaligned to exhibit harmful behaviors, which is called emergent misalignment. In this work, we investigate whether this phenomenon can extend beyond safety behaviors to a broader spectrum of dishonesty and deception under high-stakes scenarios (e.g., lying under pressure and deceptive behavior). To explore this, we finetune open-sourced LLMs on misaligned completions across diverse domains. Experimental results demonstrate that LLMs show broadly misaligned behavior in dishonesty. Additionally, we further explore this phenomenon in a downstream combined finetuning setting, and find that introducing as little as 1% of misalignment data into a standard downstream task is sufficient to decrease honest behavior over 20%. Furthermore, we consider a more practical human-AI interaction environment where we simulate both benign and biased users to interact with the assistant LLM. Notably, we find that the assistant can be misaligned unintentionally to exacerbate its dishonesty with only 10% biased user population. In summary, we extend the study of emergent misalignment to the domain of dishonesty and deception under high-stakes scenarios, and demonstrate that this risk arises not only through direct finetuning, but also in downstream mixture tasks and practical human-AI interactions.[79] SenWave: A Fine-Grained Multi-Language Sentiment Analysis Dataset Sourced from COVID-19 Tweets
Qiang Yang,Xiuying Chen,Changsheng Ma,Rui Yin,Xin Gao,Xiangliang Zhang
Main category: cs.CL
TL;DR: 本文提出了SenWave,一个用于分析新冠疫情推文的细粒度多语言情感分析数据集,包含五种语言的标注和非标注推文,并通过微调预训练模型实现情感分类,揭示了跨语言、国家和话题的情感演变。
Details
Motivation: 现有COVID-19相关数据集缺乏细粒度或合适的标注情感标签,限制了公众情绪分析的精度和深度。 Method: 构建了一个包含十类情感标签的多语言推文数据集(英语、阿拉伯语标注,西班牙语、法语、意大利语翻译),并对预训练的基于Transformer的语言模型进行微调以实现细粒度情感分类。 Result: 数据集包含5万条标注推文和超过1.05亿条未标注推文,实验揭示了不同语言、国家和主题下的情感变化趋势,并验证了其与ChatGPT的良好兼容性。 Conclusion: SenWave为复杂事件下的细粒度情感分析提供了高质量资源,有助于推动NLP领域在疫情等重大公共事件中更深入的情绪理解与研究创新。 Abstract: The global impact of the COVID-19 pandemic has highlighted the need for a comprehensive understanding of public sentiment and reactions. Despite the availability of numerous public datasets on COVID-19, some reaching volumes of up to 100 billion data points, challenges persist regarding the availability of labeled data and the presence of coarse-grained or inappropriate sentiment labels. In this paper, we introduce SenWave, a novel fine-grained multi-language sentiment analysis dataset specifically designed for analyzing COVID-19 tweets, featuring ten sentiment categories across five languages. The dataset comprises 10,000 annotated tweets each in English and Arabic, along with 30,000 translated tweets in Spanish, French, and Italian, derived from English tweets. Additionally, it includes over 105 million unlabeled tweets collected during various COVID-19 waves. To enable accurate fine-grained sentiment classification, we fine-tuned pre-trained transformer-based language models using the labeled tweets. Our study provides an in-depth analysis of the evolving emotional landscape across languages, countries, and topics, revealing significant insights over time. Furthermore, we assess the compatibility of our dataset with ChatGPT, demonstrating its robustness and versatility in various applications. Our dataset and accompanying code are publicly accessible on the repository\footnote{https://github.com/gitdevqiang/SenWave}. We anticipate that this work will foster further exploration into fine-grained sentiment analysis for complex events within the NLP community, promoting more nuanced understanding and research innovations.[80] Investigating Counterclaims in Causality Extraction from Text
Tim Hagen,Niklas Deckers,Felix Wolter,Harrisen Scells,Martin Potthast
Main category: cs.CL
TL;DR: 本文提出了一种新的数据集,首次将反因果关系(concausality)纳入因果关系抽取研究,弥补了现有数据集中仅关注支持性因果声明的不足。
Details
Motivation: 现有的因果关系抽取数据集忽视了反因果声明,导致模型可能错误地将反因果关系分类为支持性因果关系,影响了因果推理的准确性。 Method: 通过广泛的文献综述,提出了反因果关系在不完全知识下的因果推理中的重要性,并制定了严格的标注指南,扩展了Causal News Corpus以包含反因果声明。 Result: 新数据集实现了较高的标注者间一致性(Cohen's κ=0.74),并且使用该数据集训练的模型能够有效区分支持性和反对性因果关系。 Conclusion: 集成反因果声明的数据集有助于提高模型对因果关系的理解和分类能力,推动更全面的因果推理研究。 Abstract: Research on causality extraction from text has so far almost entirely neglected counterclaims. Existing causality extraction datasets focus solely on "procausal" claims, i.e., statements that support a relationship. "Concausal" claims, i.e., statements that refute a relationship, are entirely ignored or even accidentally annotated as procausal. We address this shortcoming by developing a new dataset that integrates concausality. Based on an extensive literature review, we first show that concausality is an integral part of causal reasoning on incomplete knowledge. We operationalize this theory in the form of a rigorous guideline for annotation and then augment the Causal News Corpus with concausal statements, obtaining a substantial inter-annotator agreement of Cohen's $\kappa=0.74$. To demonstrate the importance of integrating concausal statements, we show that models trained without concausal relationships tend to misclassify these as procausal instead. Based on our new dataset, this mistake can be mitigated, enabling transformers to effectively distinguish pro- and concausality.[81] The Alignment Waltz: Jointly Training Agents to Collaborate for Safety
Jingyu Zhang,Haozhu Wang,Eric Michael Smith,Sid Wang,Amr Sharaf,Mahesh Pasupuleti,Benjamin Van Durme,Daniel Khashabi,Jason Weston,Hongyuan Zhan
Main category: cs.CL
TL;DR: 提出WaltzRL,一种多智能体强化学习框架,通过协作方式提升大模型在安全对齐中的表现,显著减少不安全响应和过度拒绝。
Details
Motivation: 现有安全对齐方法常因完全拒绝潜在风险内容而导致过度拒绝,缺乏对敏感但无害请求的细粒度处理,难以平衡帮助性与无害性。 Method: 设计WaltzRL框架,包含对话代理和反馈代理,通过动态改进奖励(DIR)机制实现两者协同训练;反馈代理仅在必要时自适应介入,优化响应而非直接拒绝。 Result: 在五个数据集上实验表明,相比基线方法,WaltzRL显著降低不安全响应(如WildJailbreak上从39.0%降至4.6%)和过度拒绝率(如OR-Bench上从45.3%降至9.9%),同时保持低延迟和通用能力。 Conclusion: WaltzRL通过多智能体协同与自适应反馈机制,在不牺牲帮助性的前提下提升安全性,推动了帮助性与无害性之间的帕累托前沿。 Abstract: Harnessing the power of LLMs requires a delicate dance between being helpful and harmless. This creates a fundamental tension between two competing challenges: vulnerability to adversarial attacks that elicit unsafe content, and a tendency for overrefusal on benign but sensitive prompts. Current approaches often navigate this dance with safeguard models that completely reject any content that contains unsafe portions. This approach cuts the music entirely-it may exacerbate overrefusals and fails to provide nuanced guidance for queries it refuses. To teach models a more coordinated choreography, we propose WaltzRL, a novel multi-agent reinforcement learning framework that formulates safety alignment as a collaborative, positive-sum game. WaltzRL jointly trains a conversation agent and a feedback agent, where the latter is incentivized to provide useful suggestions that improve the safety and helpfulness of the conversation agent's responses. At the core of WaltzRL is a Dynamic Improvement Reward (DIR) that evolves over time based on how well the conversation agent incorporates the feedback. At inference time, unsafe or overrefusing responses from the conversation agent are improved rather than discarded. The feedback agent is deployed together with the conversation agent and only engages adaptively when needed, preserving helpfulness and low latency on safe queries. Our experiments, conducted across five diverse datasets, demonstrate that WaltzRL significantly reduces both unsafe responses (e.g., from 39.0% to 4.6% on WildJailbreak) and overrefusals (from 45.3% to 9.9% on OR-Bench) compared to various baselines. By enabling the conversation and feedback agents to co-evolve and adaptively apply feedback, WaltzRL enhances LLM safety without degrading general capabilities, thereby advancing the Pareto front between helpfulness and harmlessness.[82] Contrastive Decoding for Synthetic Data Generation in Low-Resource Language Modeling
Jannek Ulm,Kevin Du,Vésteinn Snæbjarnarson
Main category: cs.CL
TL;DR: 本研究探讨了使用对比解码生成合成语料库在大语言模型训练中的应用,发现结合合成数据与真实数据能提升语言建模及下游任务性能,尤其在需要推理能力的任务上表现更优。
Details
Motivation: 由于大规模语言模型训练所需的真实文本数据可能即将耗尽,研究人员探索使用由语言模型生成的合成数据作为补充,以突破数据瓶颈。 Method: 在受控环境下,利用在同一原始语料(1亿词)上训练的优劣两个模型之间的相对差异,通过对比解码生成合成语料,并将其与原始训练数据混合用于后续训练。 Result: 实验表明,使用合成数据与真实数据的混合进行训练,能够提升语言建模目标的表现以及多种下游任务的性能;其中,对比解码生成的合成数据更有利于需推理能力的任务,而传统采样生成的数据则更利于依赖表层语言能力的任务。 Conclusion: 结合对比解码生成的合成数据可有效增强语言模型训练效果,尤其在提升模型推理能力方面具有潜力,为突破训练数据规模限制提供了可行路径。 Abstract: Large language models (LLMs) are trained on huge amounts of textual data, and concerns have been raised that the limits of such data may soon be reached. A potential solution is to train on synthetic data sampled from LLMs. In this work, we build on this idea and investigate the benefits of contrastive decoding for generating synthetic corpora. In a controlled setting, we experiment with sampling corpora using the relative difference between a good and bad model trained on the same original corpus of 100 million words. By amplifying the signal from a model that has better performance, we create a synthetic corpus and mix it with the original training data. Our findings show that training on a mixture of synthesized and real data improves performance on the language modeling objective and a range of downstream tasks. In particular, we see that training with a mix of synthetic data from contrastive decoding benefits tasks that require more reasoning skills, while synthetic data from traditional sampling helps more on tasks dependent on surface level linguistic capabilities.[83] Beyond Turn Limits: Training Deep Search Agents with Dynamic Context Window
Qiaoyu Tang,Hao Xiang,Le Yu,Bowen Yu,Yaojie Lu,Xianpei Han,Le Sun,WenJuan Zhang,Pengbo Wang,Shixuan Liu,Zhenru Zhang,Jianhong Tu,Hongyu Lin,Junyang Lin
Main category: cs.CL
TL;DR: 本文提出了DeepMiner框架,通过高难度训练任务和动态上下文管理机制,在多轮长视野交互中激发大模型的深度推理能力。
Details
Motivation: 现有方法难以在多轮代理中激发模型的深度推理能力,尤其在长视野交互中受限于上下文处理能力。 Method: 提出反向构建方法生成复杂且可验证的问答对,并设计无需外部摘要模型的动态上下文管理策略,结合强化学习进行训练。 Result: 在Qwen3-32B上训练的DeepMiner-32B在多个搜索代理基准上显著提升性能,如在BrowseComp-en上达到33.5%准确率,超越此前最优开源代理近20个百分点,并支持近100轮持续交互。 Conclusion: DeepMiner有效提升了多轮长上下文场景下的推理能力,解决了现有系统上下文长度受限的问题,展现出强大的持续交互与复杂任务处理能力。 Abstract: While recent advances in reasoning models have demonstrated cognitive behaviors through reinforcement learning, existing approaches struggle to invoke deep reasoning capabilities in multi-turn agents with long-horizon interactions. We propose DeepMiner, a novel framework that elicits such abilities by introducing high-difficulty training tasks and dynamic context window. DeepMiner presents a reverse construction method to generate complex but verifiable question-answer pairs from authentic web sources, which ensures the challenge and reliability of training data while injecting cognitive capabilities into multi-turn reasoning scenarios. We further design an elegant yet effective dynamic context management strategy for both training and inference, utilizing sliding window mechanisms while eliminating the dependency on external summarization models, thereby efficiently empowering the model to handle continuously expanding long-horizon contexts. Through reinforcement learning on Qwen3-32B, we develop DeepMiner-32B, which achieves substantial performance improvements across multiple search agent benchmarks. DeepMiner attains 33.5% accuracy on BrowseComp-en, surpassing the previous best open-source agent by almost 20 percentage points, and demonstrates consistent improvements on BrowseComp-zh, XBench-DeepSearch, and GAIA. Notably, our dynamic context management enables sustained interactions of nearly 100 turns within standard 32k context length, effectively addressing the context limitations that constrain existing multi-turn interaction systems.[84] Neuron-Level Analysis of Cultural Understanding in Large Language Models
Taisei Yamamoto,Ryoma Kumon,Danushka Bollegala,Hitomi Yanaka
Main category: cs.CL
TL;DR: 本文提出了一种基于梯度的评分方法,用于识别大语言模型中负责文化理解的神经元,发现少于1%的神经元(集中在浅层到中层MLP)在文化行为中起关键作用,并验证了这些神经元对文化基准性能的影响。
Details
Motivation: 大语言模型存在文化偏见且对少数文化的认知有限,其文化理解机制尚不明确,因此需要从神经元层面分析其文化行为的内在机制。 Method: 提出一种基于梯度的评分方法,并结合过滤策略精确定位影响文化行为的神经元,区分文化通用和文化特异性神经元,并通过抑制实验验证其功能。 Result: 发现了少于1%的关键神经元集中于浅层到中层MLP,抑制这些神经元会使文化基准性能下降高达30%,但对通用自然语言理解任务影响较小;文化特异性神经元还支持相关文化的知识;训练NLU任务可能削弱模型的文化理解能力。 Conclusion: 大语言模型中存在少量关键神经元主导文化理解,研究揭示了其内部机制,为模型训练和工程提供了实践指导。 Abstract: As large language models (LLMs) are increasingly deployed worldwide, ensuring their fair and comprehensive cultural understanding is important. However, LLMs exhibit cultural bias and limited awareness of underrepresented cultures, while the mechanisms underlying their cultural understanding remain underexplored. To fill this gap, we conduct a neuron-level analysis to identify neurons that drive cultural behavior, introducing a gradient-based scoring method with additional filtering for precise refinement. We identify both culture-general neurons contributing to cultural understanding regardless of cultures, and culture-specific neurons tied to an individual culture. These neurons account for less than 1% of all neurons and are concentrated in shallow to middle MLP layers. We validate their role by showing that suppressing them substantially degrades performance on cultural benchmarks (by up to 30%), while performance on general natural language understanding (NLU) benchmarks remains largely unaffected. Moreover, we show that culture-specific neurons support knowledge of not only the target culture, but also related cultures. Finally, we demonstrate that training on NLU benchmarks can diminish models' cultural understanding when we update modules containing many culture-general neurons. These findings provide insights into the internal mechanisms of LLMs and offer practical guidance for model training and engineering. Our code is available at https://github.com/ynklab/CULNIG[85] AutoRed: A Free-form Adversarial Prompt Generation Framework for Automated Red Teaming
Muxi Diao,Yutao Mou,Keqing He,Hanbo Song,Lulu Zhao,Shikun Zhang,Wei Ye,Kongming Liang,Zhanyu Ma
Main category: cs.CL
TL;DR: 提出AutoRed,一种无需种子指令的自由形式对抗提示生成框架,用于提升大语言模型的安全性评估。
Details
Motivation: 现有红队测试方法依赖种子指令,限制了对抗提示的语义多样性,影响对大语言模型安全性的全面评估。 Method: AutoRed采用两阶段框架:第一阶段通过角色引导生成对抗指令,第二阶段通过反思循环迭代优化低质量提示;同时引入验证器在不查询目标模型的情况下评估提示的危害性。 Result: 构建了两个红队测试数据集AutoRed-Medium和AutoRed-Hard,评估八种主流大语言模型,结果显示AutoRed在攻击成功率和泛化能力上优于现有基线方法。 Conclusion: 种子指令方法存在局限,自由形式的红队测试能更有效发现模型安全隐患,AutoRed为大语言模型安全评估提供了新方向。 Abstract: The safety of Large Language Models (LLMs) is crucial for the development of trustworthy AI applications. Existing red teaming methods often rely on seed instructions, which limits the semantic diversity of the synthesized adversarial prompts. We propose AutoRed, a free-form adversarial prompt generation framework that removes the need for seed instructions. AutoRed operates in two stages: (1) persona-guided adversarial instruction generation, and (2) a reflection loop to iteratively refine low-quality prompts. To improve efficiency, we introduce a verifier to assess prompt harmfulness without querying the target models. Using AutoRed, we build two red teaming datasets -- AutoRed-Medium and AutoRed-Hard -- and evaluate eight state-of-the-art LLMs. AutoRed achieves higher attack success rates and better generalization than existing baselines. Our results highlight the limitations of seed-based approaches and demonstrate the potential of free-form red teaming for LLM safety evaluation. We will open source our datasets in the near future.[86] Two-Stage Voting for Robust and Efficient Suicide Risk Detection on Social Media
Yukai Song,Pengfei Zhou,César Escobar-Viera,Candice Biernesser,Wei Huang,Jingtong Hu
Main category: cs.CL
TL;DR: 提出一种两阶段投票架构,结合轻量级BERT模型和多视角大语言模型(LLM)投票机制,有效平衡效率与准确性,用于检测显性和隐性自杀意念,在多个数据集上表现优异且显著降低LLM使用成本。
Details
Motivation: 自杀率上升亟需有效的预防手段,许多高风险个体因羞耻感不愿寻求正式帮助,而倾向于在社交媒体上隐晦表达痛苦,现有模型难以准确识别隐性自杀意念,且大语言模型计算成本过高。 Method: 采用两阶段投票架构:第一阶段用轻量BERT模型快速处理高置信度显性案例;第二阶段将模糊样本交由多视角LLM投票系统提升对隐性意念的召回率,或通过基于心理指标的特征工程ML集成实现高效可解释检测,其中心理特征由提示工程驱动的LLM提取。 Result: 在以显性为主的Reddit数据集和纯隐性DeepSuiMind数据集上,分别取得98.0%和99.7%的F1分数,跨领域性能差距降至2%以下,并显著降低LLM调用成本。 Conclusion: 该框架首次将LLM提取的心理特征向量化用于自杀风险检测,兼顾效率、鲁棒性与可解释性,为实际应用提供了可行方案。 Abstract: Suicide rates have risen worldwide in recent years, underscoring the urgent need for proactive prevention strategies. Social media provides valuable signals, as many at-risk individuals - who often avoid formal help due to stigma - choose instead to share their distress online. Yet detecting implicit suicidal ideation, conveyed indirectly through metaphor, sarcasm, or subtle emotional cues, remains highly challenging. Lightweight models like BERT handle explicit signals but fail on subtle implicit ones, while large language models (LLMs) capture nuance at prohibitive computational cost. To address this gap, we propose a two-stage voting architecture that balances efficiency and robustness. In Stage 1, a lightweight BERT classifier rapidly resolves high-confidence explicit cases. In Stage 2, ambiguous inputs are escalated to either (i) a multi-perspective LLM voting framework to maximize recall on implicit ideation, or (ii) a feature-based ML ensemble guided by psychologically grounded indicators extracted via prompt-engineered LLMs for efficiency and interpretability. To the best of our knowledge, this is among the first works to operationalize LLM-extracted psychological features as structured vectors for suicide risk detection. On two complementary datasets - explicit-dominant Reddit and implicit-only DeepSuiMind - our framework outperforms single-model baselines, achieving 98.0% F1 on explicit cases, 99.7% on implicit ones, and reducing the cross-domain gap below 2%, while significantly lowering LLM cost.[87] On the Relationship Between the Choice of Representation and In-Context Learning
Ioana Marinescu,Kyunghyun Cho,Eric Karl Oermann
Main category: cs.CL
TL;DR: 本文研究了上下文学习(ICL)中示例表示与学习能力之间的关系,发现表示质量决定ICL的基线准确率,而学习则在此基础上提升性能,二者相对独立。
Details
Motivation: 尽管已有研究强调了示例表示在ICL中的重要性,但其与学习能力之间的相互作用尚不清楚,本文旨在探究这两者是否独立影响ICL性能。 Method: 提出一种优化算法,枚举不同语义相关性的标签集(表示),并在不同数量的上下文示例下进行ICL实验,分析表示质量与学习效率的关系。 Result: 学习效果在各种标签集上均存在,但提升斜率受标签集质量和模型参数量影响;表示的相对优劣在学习过程中保持稳定,表明表示与学习基本正交。 Conclusion: ICL中示例的表示与学习过程具有独立性,表示决定基线性能,学习在此之上逐步提升,二者应被分别优化。 Abstract: In-context learning (ICL) is the ability of a large language model (LLM) to learn a new task from a few demonstrations presented as part of the context. Past studies have attributed a large portion of the success of ICL to the way these in-context demonstrations are represented, particularly to how labels are represented in classification tasks. On the other hand, observations of the learning capacity of ICL (i.e., the extent to which more in-context demonstrations can lead to higher performance) have been mixed, and ICL is often thought to occur only under specific conditions. The interaction between these two aspects in ICL, representation and learning, has not been studied in depth until now. We hypothesize that they are largely independent of one another, such that the representation of demonstrations determines the baseline accuracy of ICL, while learning from additional demonstrations improves only on top of this baseline. We validate this hypothesis by developing an optimization algorithm that can enumerate a spectrum of possible label sets (representations) varying in semantic relevance. We then perform ICL with varying numbers of in-context demonstrations for each of these label sets. We observed that learning happens regardless of the quality of the label set itself, although its efficiency, measured by the slope of improvement over in-context demonstrations, is conditioned on both the label set quality and the parameter count of the underlying language model. Despite the emergence of learning, the relative quality (accuracy) of the choice of a label set (representation) is largely maintained throughout learning, confirming our hypothesis and implying their orthogonality. Our work reveals a previously underexplored aspect of ICL: the independent effects of learning from demonstrations and their representations on ICL performance.[88] If Probable, Then Acceptable? Understanding Conditional Acceptability Judgments in Large Language Models
Jasmin Orth,Philipp Mondorf,Barbara Plank
Main category: cs.CL
TL;DR: 本研究探讨了大语言模型(LLM)在判断条件语句可接受性方面的表现,发现模型对条件概率和语义相关性均有一定敏感性,但一致性不及人类,且模型尺寸增大并不显著提升与人类判断的一致性。
Details
Motivation: 了解大语言模型如何评估条件语句的可接受性,填补此前在该认知层面理解上的空白,并比较其与人类判断机制的异同。 Method: 通过线性混合效应模型和方差分析(ANOVA),在不同模型家族、规模和提示策略下系统评估LLMs对条件可接受性的判断,并与人类数据进行对比。 Result: LLMs能够感知条件概率和语义相关性,但敏感程度因架构和提示方式而异;与人类相比,其整合这两种线索的一致性较低;更大的模型并未表现出更强的人类一致性。 Conclusion: 当前大语言模型在模拟人类对条件语句可接受性的判断方面仍有局限,模型规模增加并不保证更接近人类认知模式,提示未来需改进模型对概率与语义综合推理的能力。 Abstract: Conditional acceptability refers to how plausible a conditional statement is perceived to be. It plays an important role in communication and reasoning, as it influences how individuals interpret implications, assess arguments, and make decisions based on hypothetical scenarios. When humans evaluate how acceptable a conditional "If A, then B" is, their judgments are influenced by two main factors: the $\textit{conditional probability}$ of $B$ given $A$, and the $\textit{semantic relevance}$ of the antecedent $A$ given the consequent $B$ (i.e., whether $A$ meaningfully supports $B$). While prior work has examined how large language models (LLMs) draw inferences about conditional statements, it remains unclear how these models judge the $\textit{acceptability}$ of such statements. To address this gap, we present a comprehensive study of LLMs' conditional acceptability judgments across different model families, sizes, and prompting strategies. Using linear mixed-effects models and ANOVA tests, we find that models are sensitive to both conditional probability and semantic relevance-though to varying degrees depending on architecture and prompting style. A comparison with human data reveals that while LLMs incorporate probabilistic and semantic cues, they do so less consistently than humans. Notably, larger models do not necessarily align more closely with human judgments.[89] Single layer tiny Co$^4$ outpaces GPT-2 and GPT-BERT
Noor Ul Zain,Mohsin Raza,Ahsan Adeel
Main category: cs.CL
TL;DR: Co$^4$是一种极小的模型(单层、双头、8M参数),在训练效率和样本利用率上显著超越更大的GPT-2和GPT-BERT模型,表明现有深度学习范式和扩展法则可能需要重新思考。
Details
Motivation: 挑战当前主流的大规模、深层模型训练范式,探索更高效、低计算成本的语言模型架构。 Method: 提出Co$^4$模型,采用单层、双头注意力结构,具有O(N)计算复杂度,在BabyLM挑战任务中进行预训练和评估。 Result: 在仅训练两个epoch的情况下,Co$^4$在10M token上显著超越训练十个epoch的GPT-2和GPT-BERT;在SuperGLUE零样本和微调任务中均取得更优表现。 Conclusion: 小型、高效的模型在适当设计下可超越大型模型,提示需重新审视当前的模型扩展策略和深度学习范式。 Abstract: We show that a tiny Co$^4$ machine(Adeel,2025) with a single layer, two heads, and 8M parameters, operating at an approximate cost of $O(N)$ (where $N$ is the number of input tokens), outpaces the BabyLM Challenge baselines GPT-2 (124M, 12 layers, $O(N^2))$ and GPT-BERT (30M, 12 layers, $O(N^2))$ in just two epochs, while both are trained for ten. Co$^4$ achieves orders-of-magnitude greater training efficiency on 10M tokens, demonstrating highly sample efficient pretraining. Using the BabyLM challenge evaluation pipeline across complex benchmarks, Co$^4$ exhibits strong zero-shot and fine-tuning performance on SuperGLUE tasks. Specifically, Co$^4$ outperforms GPT-2 on 5 out of 7 zero-shot metrics and 6 out of 7 fine-tuning tasks, and GPT-BERT on 4 out of 7 metrics in both cases. These results suggest the need to rethink prevailing deep learning paradigms and associated scaling laws.[90] ARES: Multimodal Adaptive Reasoning via Difficulty-Aware Token-Level Entropy Shaping
Shuang Chen,Yue Guo,Yimeng Ye,Shijue Huang,Wenbo Hu,Haoxi Li,Manyuan Zhang,Jiayu Chen,Song Guo,Nanyun Peng
Main category: cs.CL
TL;DR: 本文提出ARES,一个基于高窗口熵(HWE)的统一开源框架,通过自适应推理动态分配探索资源,提升多模态大模型在不同难度任务上的效率与性能。
Details
Motivation: 现有MLRM模型在简单任务上过度思考,在复杂任务上探索不足,缺乏根据任务难度动态调整推理过程的能力。 Method: 提出ARES框架,包含自适应冷启动阶段和自适应熵策略优化(AEPO)。利用HWE作为探索触发器,并设计分层熵奖励与动态KL控制来决定探索时机与程度。 Result: 实验表明,ARES在多种数学、逻辑和多模态基准上实现了更优的性能与推理效率,且推理成本显著低于主流商业系统。 Conclusion: ARES通过HWE有效识别关键推理时刻,实现难度感知的自适应推理,在提升复杂任务求解能力的同时避免简单任务的过度推理。 Abstract: Recent advances in multimodal large reasoning models (MLRMs) have substantially improved their ability to solve complex textual and visual tasks. However, these models tend to overthink on simple problems, producing unnecessarily lengthy reasoning traces, while under-exploring on challenging ones, leading to missed solutions. To address this imbalance, we propose ARES, a unified open-source framework for adaptive reasoning that dynamically allocates exploration effort based on task difficulty. Our approach is motivated by two key empirical findings: (i) while single-token entropy is noisy, high window-entropy (HWE) tokens (token-level entropies averaged under a sliding window) can reliably capture reasoning-critical moments; and (ii) reducing HWE usage benefits easy problems, while increasing it is essential for solving hard ones. Building on these insights, ARES introduces a two-stage training pipeline. In the Adaptive Cold-Start stage, we curate multimodal and textual data paired with reasoning traces of length proportional to problem difficulty, equipping the model with initial difficulty awareness. In the second stage, we develop Adaptive Entropy Policy Optimization (AEPO), which uses HWE tokens as exploration triggers to decide when to explore, and a hierarchical entropy reward with dynamic KL control to decide how much to explore. Extensive experiments demonstrate that ARES achieves superior performance and reasoning efficiency across diverse mathematical, logical, and multimodal benchmarks, while closing the gap to leading commercial systems under significantly lower inference costs.[91] LeWiDi-2025 at NLPerspectives: The Third Edition of the Learning with Disagreements Shared Task
Elisa Leonardelli,Silvia Casola,Siyao Peng,Giulia Rizzi,Valerio Basile,Elisabetta Fersini,Diego Frassinelli,Hyewon Jang,Maja Pavlovic,Barbara Plank,Massimo Poesio
Main category: cs.CL
TL;DR: LEWIDI第三版通过扩展至四个涵盖不同任务的数据集,引入软标签和视角主义两种互补评估范式,并测试新的评估指标,推动AI模型对人类判断差异的识别能力。
Details
Motivation: 许多研究表明,AI模型应具备识别和处理人类判断中变异与分歧的能力。为促进这一方向的发展,需要更易获取的数据集和更有效的评估方法。 Method: 扩展LEWIDI基准到四个任务(释义识别、反讽检测、讽刺检测和自然语言推断),包含分类和有序标注;采用软标签(预测群体判断分布)和视角主义(预测个体标注者判断)两种评估范式,并探索新的评估指标。 Result: 吸引了多样化参与,结果揭示了现有建模方法在处理判断变异方面的优势与局限;新指标和双范式设计超越了传统的交叉熵等标准度量。 Conclusion: LEWIDI作为框架得到加强,提供了支持分歧感知技术发展的新资源、基准和发现。 Abstract: Many researchers have reached the conclusion that AI models should be trained to be aware of the possibility of variation and disagreement in human judgments, and evaluated as per their ability to recognize such variation. The LEWIDI series of shared tasks on Learning With Disagreements was established to promote this approach to training and evaluating AI models, by making suitable datasets more accessible and by developing evaluation methods. The third edition of the task builds on this goal by extending the LEWIDI benchmark to four datasets spanning paraphrase identification, irony detection, sarcasm detection, and natural language inference, with labeling schemes that include not only categorical judgments as in previous editions, but ordinal judgments as well. Another novelty is that we adopt two complementary paradigms to evaluate disagreement-aware systems: the soft-label approach, in which models predict population-level distributions of judgments, and the perspectivist approach, in which models predict the interpretations of individual annotators. Crucially, we moved beyond standard metrics such as cross-entropy, and tested new evaluation metrics for the two paradigms. The task attracted diverse participation, and the results provide insights into the strengths and limitations of methods to modeling variation. Together, these contributions strengthen LEWIDI as a framework and provide new resources, benchmarks, and findings to support the development of disagreement-aware technologies.[92] DeepPrune: Parallel Scaling without Inter-trace Redundancy
Shangqing Tu,Yaxuan Li,Yushi Bai,Lei Hou,Juanzi Li
Main category: cs.CL
TL;DR: 提出DeepPrune框架,通过动态剪枝减少并行推理中的冗余,显著降低计算开销的同时保持高准确性。
Details
Motivation: 并行扩展虽能提升大模型推理能力,但存在大量推理路径冗余(>80%),导致计算效率低下。 Method: 设计基于焦点损失和过采样的判别模型预测答案等价性,并结合在线贪心聚类算法动态剪枝冗余路径。 Result: 在AIME 2024、AIME 2025和GPQA三个基准上实现超过80%的token减少,AUROC达0.87,准确率损失在3个百分点内。 Conclusion: DeepPrune为高效并行推理建立了新标准,在大幅节省计算资源的同时保持竞争力的性能。 Abstract: Parallel scaling has emerged as a powerful paradigm to enhance reasoning capabilities in large language models (LLMs) by generating multiple Chain-of-Thought (CoT) traces simultaneously. However, this approach introduces significant computational inefficiency due to inter-trace redundancy -- our analysis reveals that over 80% of parallel reasoning traces yield identical final answers, representing substantial wasted computation. To address this critical efficiency bottleneck, we propose DeepPrune, a novel framework that enables efficient parallel scaling through dynamic pruning. Our method features a specialized judge model trained with focal loss and oversampling techniques to accurately predict answer equivalence from partial reasoning traces which realizes 0.87 AUROC on equivalence prediction, combined with an online greedy clustering algorithm that dynamically prunes redundant paths while preserving answer diversity. Comprehensive evaluations across three challenging benchmarks (AIME 2024, AIME 2025, and GPQA) and multiple reasoning models demonstrate that DeepPrune achieves remarkable token reduction by over 80% compared to conventional consensus sampling on most cases, while maintaining competitive accuracy within 3 percentage points. Our work establishes a new standard for efficient parallel reasoning, making high-performance reasoning more efficient. Our code and data are here: https://deepprune.github.io/[93] Neologism Learning for Controllability and Self-Verbalization
John Hewitt,Oyvind Tafjord,Robert Geirhos,Been Kim
Main category: cs.CL
TL;DR: 本文探讨了在与大语言模型(LLM)交互中引入新词(neologism learning)的方法,以更好地理解和控制模型行为。通过添加新词嵌入并用示例训练,可实现对诸如奉承、错误回答、文本长度等概念的控制,并利用模型自述解释新词含义,提出“插入评估”验证其有效性,还发现了机器专属同义词现象,展示了多概念、多词语的联合学习能力。
Details
Motivation: 受人类因新需求创造新词的启发,作者希望探索在与大语言模型交互中引入新词是否能更精细地控制和理解模型内部概念。 Method: 通过添加新的词嵌入,并使用体现特定概念的示例进行训练,不修改模型其他参数;利用模型自述(self-verbalization)解释新词含义,并通过‘插入评估’(plug-in evaluation)验证自述的有效性。 Result: 成功实现了对多种简单与复杂概念(如奉承、错误回答、文本长度)的行为控制;模型能用自然语言描述新词含义;发现了一些对人类无意义但引发相似行为的‘机器专属同义词’;实现了多个概念的联合学习。 Conclusion: 引入新词是一种有效且可解释的方式,用于控制和理解大语言模型中的抽象概念,同时揭示了模型内部表征与人类直觉之间的差异,为模型可控性和可解释性提供了新路径。 Abstract: Humans invent new words when there is a rising demand for a new useful concept (e.g., doomscrolling). We explore and validate a similar idea in our communication with LLMs: introducing new words to better understand and control the models, expanding on the recently introduced neologism learning. This method introduces a new word by adding a new word embedding and training with examples that exhibit the concept with no other changes in model parameters. We show that adding a new word allows for control of concepts such as flattery, incorrect answers, text length, as well as more complex concepts in AxBench. We discover that neologisms can also further our understanding of the model via self-verbalization: models can describe what each new word means to them in natural language, like explaining that a word that represents a concept of incorrect answers means ``a lack of complete, coherent, or meaningful answers...'' To validate self-verbalizations, we introduce plug-in evaluation: we insert the verbalization into the context of a model and measure whether it controls the target concept. In some self-verbalizations, we find machine-only synonyms: words that seem unrelated to humans but cause similar behavior in machines. Finally, we show how neologism learning can jointly learn multiple concepts in multiple words.[94] Efficient Prompt Optimisation for Legal Text Classification with Proxy Prompt Evaluator
Hyunji Lee,Kevin Chenhao Li,Matthias Grabmair,Shanshan Xu
Main category: cs.CL
TL;DR: 本文提出了一种结合蒙特卡洛树搜索(MCTS)和代理提示评估器的提示优化框架,用于提高服务条款(ToS)中公平性检测的准确性和效率。
Details
Motivation: 现有的提示优化方法由于搜索策略低效和评估成本高,在计算资源受限的情况下难以有效应用于法律NLP任务。 Method: 采用蒙特卡洛树搜索(MCTS)来高效探索提示空间,并引入一个轻量级的代理评估器来降低对大型语言模型的调用次数,从而减少评估成本。 Result: 实验表明,在有限的计算预算下,该方法相比基线方法在分类准确率和效率方面均有显著提升。 Conclusion: 所提出的框架能够更有效地优化提示,适用于资源受限场景下的法律文本处理任务。 Abstract: Prompt optimization aims to systematically refine prompts to enhance a language model's performance on specific tasks. Fairness detection in Terms of Service (ToS) clauses is a challenging legal NLP task that demands carefully crafted prompts to ensure reliable results. However, existing prompt optimization methods are often computationally expensive due to inefficient search strategies and costly prompt candidate scoring. In this paper, we propose a framework that combines Monte Carlo Tree Search (MCTS) with a proxy prompt evaluator to more effectively explore the prompt space while reducing evaluation costs. Experiments demonstrate that our approach achieves higher classification accuracy and efficiency than baseline methods under a constrained computation budget.[95] Which Heads Matter for Reasoning? RL-Guided KV Cache Compression
Wenjie Du,Li Jiang,Keda Tao,Xue Liu,Huan Wang
Main category: cs.CL
TL;DR: 提出RLKV框架,利用强化学习识别推理关键的注意力头,实现高效KV缓存压缩,在保持推理性能的同时减少20-50%缓存开销。
Details
Motivation: 现有KV缓存压缩方法在推理模型上表现不佳,会破坏推理连贯性或误压关键注意力头,导致性能显著下降。 Method: 提出RLKV框架,通过强化学习直接优化每个注意力头的缓存使用与推理质量之间的关系,基于生成样本的奖励识别关键头,并对关键头保留完整缓存,其余头进行压缩。 Result: 实验表明仅需保留少量注意力头即可维持推理性能,相比基线方法在20-50%缓存压缩下接近无损性能。 Conclusion: RLKV能有效识别推理关键头,在大幅降低KV缓存开销的同时保持复杂推理行为的完整性。 Abstract: Reasoning large language models exhibit complex reasoning behaviors through the extended chain-of-thought generation, creating unprecedented Key-Value (KV) cache overhead during the decoding phase. Existing KV cache compression methods underperform on reasoning models: token-dropping methods break reasoning integrity by discarding critical information, while head-reallocating methods mistakenly compress reasoning-critical heads since they are designed for retrieval tasks, resulting in significant performance degradation as compression rates increase. We hypothesize that KV heads exhibit functional heterogeneity in reasoning models-some heads are critical for chain-of-thought consistency while others are compressible. To validate and exploit this insight, we propose RLKV, a novel reasoning-critical head identification framework, which uses reinforcement learning to directly optimize the relationship between each head's cache usage and reasoning quality. As RLKV produces rewards from actual generated samples during training, it naturally identifies heads relevant to reasoning behaviors. We then allocate full KV cache to these heads while applying compressed constant KV cache to others for efficient inference. Our experiments reveal that only a small fraction of attention heads is essential for reasoning, enabling our KV compression approach to outperform baseline methods while achieving 20-50% cache reduction with near lossless performance compared to uncompressed results.[96] CoMAS: Co-Evolving Multi-Agent Systems via Interaction Rewards
Xiangyuan Xue,Yifan Zhou,Guibin Zhang,Zaibin Zhang,Yijiang Li,Chen Zhang,Zhenfei Yin,Philip Torr,Wanli Ouyang,Lei Bai
Main category: cs.CL
TL;DR: 本文提出了CoMAS框架,通过多智能体之间的互动讨论实现无需外部监督的自主进化,利用LLM作为评判机制生成内在奖励信号,并结合强化学习优化策略,在多种评测中达到领先性能。
Details
Motivation: 现有基于强化学习的自进化方法依赖密集的外部奖励或模型自身提取的内在奖励,未能模拟人类通过协作与讨论进行学习的机制,因此需要一种更贴近人类学习方式的自进化框架。 Method: 提出CoMAS框架,通过多智能体间的交互生成基于讨论动态的内在奖励信号,采用LLM-as-a-judge机制评估并生成奖励,利用强化学习优化各智能体策略,实现去中心化和可扩展的协同进化。 Result: 实验表明CoMAS在多数评测设置中优于未经训练的智能体并达到当前最优性能,消融实验验证了交互式奖励信号的必要性,并显示出随智能体数量和多样性增加的良好可扩展性。 Conclusion: CoMAS为基于大语言模型的智能体提供了一种新颖且有效的自进化范式,强调通过多智能体协作与内在互动奖励实现自主能力提升。 Abstract: Self-evolution is a central research topic in enabling large language model (LLM)-based agents to continually improve their capabilities after pretraining. Recent research has witnessed a transition from reinforcement learning (RL)-free to RL-based methods. Current RL-based methods either rely on dense external reward signals or extract intrinsic reward signals from LLMs themselves. However, these approaches diverge from the self-evolution mechanisms observed in human intelligence, where individuals learn and improve through mutual discussion and collaboration. In this work, we introduce Co-Evolving Multi-Agent Systems (CoMAS), a novel framework that enables agents to improve autonomously by learning from inter-agent interactions without external supervision. CoMAS generates intrinsic rewards from rich discussion dynamics, employs an LLM-as-a-judge mechanism to formulate these rewards, and optimizes each agent's policy through RL, thereby enabling decentralized and scalable co-evolution. Experimental results demonstrate that CoMAS consistently outperforms untrained agents and achieves state-of-the-art performance across most evaluation settings. Ablation studies confirm the necessity of interaction-based reward signals and reveal promising scalability as the number and diversity of agents increase. These findings establish CoMAS as a novel and effective paradigm for self-evolution in LLM-based agents.[97] ArenaBencher: Automatic Benchmark Evolution via Multi-Model Competitive Evaluation
Qin Liu,Jacob Dineen,Yuxi Huang,Sheng Zhang,Hoifung Poon,Ben Zhou,Muhao Chen
Main category: cs.CL
TL;DR: ArenaBencher是一个模型无关的自动基准演化框架,通过更新测试用例来应对预训练数据泄露问题,保持可比性的同时揭示模型的共性弱点。
Details
Motivation: 由于大语言模型预训练数据中普遍存在数据泄露,导致基准测试有效性下降,模型可能依赖记忆而非真正泛化能力,从而扭曲评估结果和进展判断。 Method: ArenaBencher基于现有基准和多样化的模型池,推断每个测试用例的核心能力,生成保持原目标的新问题-答案对,利用大语言模型作为裁判验证其正确性和意图,并聚合多个模型反馈以选择能暴露共性弱点的候选用例,迭代生成更具挑战性和诊断性的测试案例。 Result: 在数学解题、常识推理和安全性领域应用结果显示,ArenaBencher能生成经过验证、多样化且公平的基准更新,发现新的失败模式,提升难度并保持与原目标一致,增强模型间的可区分性。 Conclusion: 该框架为应对基础模型快速发展提供了可扩展的路径,实现基准测试的持续演进,提升评估的有效性和可靠性。 Abstract: Benchmarks are central to measuring the capabilities of large language models and guiding model development, yet widespread data leakage from pretraining corpora undermines their validity. Models can match memorized content rather than demonstrate true generalization, which inflates scores, distorts cross-model comparisons, and misrepresents progress. We introduce ArenaBencher, a model-agnostic framework for automatic benchmark evolution that updates test cases while preserving comparability. Given an existing benchmark and a diverse pool of models to be evaluated, ArenaBencher infers the core ability of each test case, generates candidate question-answer pairs that preserve the original objective, verifies correctness and intent with an LLM as a judge, and aggregates feedback from multiple models to select candidates that expose shared weaknesses. The process runs iteratively with in-context demonstrations that steer generation toward more challenging and diagnostic cases. We apply ArenaBencher to math problem solving, commonsense reasoning, and safety domains and show that it produces verified, diverse, and fair updates that uncover new failure modes, increase difficulty while preserving test objective alignment, and improve model separability. The framework provides a scalable path to continuously evolve benchmarks in step with the rapid progress of foundation models.cs.CV [Back]
[98] Enhancing Maritime Object Detection in Real-Time with RT-DETR and Data Augmentation
Nader Nemati
Main category: cs.CV
TL;DR: 本文提出了一种基于RT-DETR的实时海上目标检测系统,通过融合多尺度特征、优化查询选择和合成与真实数据加权策略,提升小目标检测性能,并在真实数据上验证了系统的有效性。
Details
Motivation: 由于海上目标尺寸小且标注的真实RGB数据有限,传统检测方法面临挑战,因此需要提升检测精度与鲁棒性。 Method: 采用RT-DETR框架,引入多尺度特征融合模块、不确定性最小化的查询选择机制,以及合成与真实样本的智能加权策略,并结合数据增强平衡类别分布。 Result: 系统在真实数据上实现了实时性能,有效提升了对小尺寸、低对比度船舶的检测能力,并通过消融实验验证了各模块贡献及系统在极端环境下的稳定性。 Conclusion: 所提出的改进RT-DETR框架在海上小目标检测中表现出色,兼顾速度与精度,具备实际应用价值。 Abstract: Maritime object detection faces essential challenges due to the small target size and limitations of labeled real RGB data. This paper will present a real-time object detection system based on RT-DETR, enhanced by employing augmented synthetic images while strictly evaluating on real data. This study employs RT-DETR for the maritime environment by combining multi-scale feature fusion, uncertainty-minimizing query selection, and smart weight between synthetic and real training samples. The fusion module in DETR enhances the detection of small, low-contrast vessels, query selection focuses on the most reliable proposals, and the weighting strategy helps reduce the visual gap between synthetic and real domains. This design preserves DETR's refined end-to-end set prediction while allowing users to adjust between speed and accuracy at inference time. Data augmentation techniques were also used to balance the different classes of the dataset to improve the robustness and accuracy of the model. Regarding this study, a full Python robust maritime detection pipeline is delivered that maintains real-time performance even under practical limits. It also verifies how each module contributes, and how the system handles failures in extreme lighting or sea conditions. This study also includes a component analysis to quantify the contribution of each architectural module and explore its interactions.[99] DynamicEval: Rethinking Evaluation for Dynamic Text-to-Video Synthesis
Nithin C. Babu,Aniruddha Mahapatra,Harsh Rangwani,Rajiv Soundararajan,Kuldeep Kulkarni
Main category: cs.CV
TL;DR: 本文提出了DynamicEval,一个专注于动态摄像机运动的文本到视频生成评估基准,通过45k人类标注和新提出的背景与前景一致性指标,显著提升了对生成视频质量评估的准确性和全面性。
Details
Motivation: 现有T2V评估基准在动态摄像机运动场景下的评估能力不足,且缺乏对视频级质量的细粒度评价,难以有效衡量生成视频的真实质量。 Method: 构建了强调动态摄像机运动的系统化提示集,收集45k人类对3k视频的配对标注;基于Vbench运动平滑性指标生成可解释的误差图,并提出利用对象误差图修正遮挡/去遮挡问题的新背景一致性指标;设计基于点追踪的前景对象一致性指标以评估物体保真度。 Result: 新提出的指标在视频级和模型级均与人类偏好表现出更强的相关性,相比现有指标提升超过2个百分点。 Conclusion: DynamicEval通过更精细的背景和前景一致性评估,在动态摄像机运动场景下提供了更全面、更符合人类感知的T2V模型评估方案,推动了生成视频质量评估的发展。 Abstract: Existing text-to-video (T2V) evaluation benchmarks, such as VBench and EvalCrafter, suffer from two limitations. (i) While the emphasis is on subject-centric prompts or static camera scenes, camera motion essential for producing cinematic shots and existing metrics under dynamic motion are largely unexplored. (ii) These benchmarks typically aggregate video-level scores into a single model-level score for ranking generative models. Such aggregation, however, overlook video-level evaluation, which is vital to selecting the better video among the candidate videos generated for a given prompt. To address these gaps, we introduce DynamicEval, a benchmark consisting of systematically curated prompts emphasizing dynamic camera motion, paired with 45k human annotations on video pairs from 3k videos generated by ten T2V models. DynamicEval evaluates two key dimensions of video quality: background scene consistency and foreground object consistency. For background scene consistency, we obtain the interpretable error maps based on the Vbench motion smoothness metric. We observe that while the Vbench motion smoothness metric shows promising alignment with human judgments, it fails in two cases: occlusions/disocclusions arising from camera and foreground object movements. Building on this, we propose a new background consistency metric that leverages object error maps to correct two failure cases in a principled manner. Our second innovation is the introduction of a foreground consistency metric that tracks points and their neighbors within each object instance to assess object fidelity. Extensive experiments demonstrate that our proposed metrics achieve stronger correlations with human preferences at both the video level and the model level (an improvement of more than 2% points), establishing DynamicEval as a more comprehensive benchmark for evaluating T2V models under dynamic camera motion.[100] Provably Accelerated Imaging with Restarted Inertia and Score-based Image Priors
Marien Renaud,Julien Hermant,Deliang Wei,Yu Sun
Main category: cs.CV
TL;DR: 提出了一种名为RISP(Restarted Inertia with Score-based Priors)的新方法,用于解决成像反问题,在不依赖先验凸性的情况下实现比RED更快的收敛速度和高质量图像恢复。
Details
Motivation: 现有方法如RED通常专注于设计复杂的图像先验以提升重建质量,但收敛加速仍依赖启发式方法;本文旨在通过原理性扩展RED,同时实现快速收敛和高质量重建。 Method: 提出RISP框架,结合基于分数的图像先验与重启惯性机制,并推导其对应的连续时间动力系统,分析其与重球ODE的联系,理论证明其具有比RED更快的驻点收敛速率。 Result: 理论分析表明RISP在非凸先验下仍具有更快的收敛速度;实验验证了其在多种成像反问题中兼具快速收敛和高质量重建能力。 Conclusion: RISP为解决成像逆问题提供了一个兼顾收敛速度与重建质量的有效框架,优于传统RED方法。 Abstract: Fast convergence and high-quality image recovery are two essential features of algorithms for solving ill-posed imaging inverse problems. Existing methods, such as regularization by denoising (RED), often focus on designing sophisticated image priors to improve reconstruction quality, while leaving convergence acceleration to heuristics. To bridge the gap, we propose Restarted Inertia with Score-based Priors (RISP) as a principled extension of RED. RISP incorporates a restarting inertia for fast convergence, while still allowing score-based image priors for high-quality reconstruction. We prove that RISP attains a faster stationary-point convergence rate than RED, without requiring the convexity of the image prior. We further derive and analyze the associated continuous-time dynamical system, offering insight into the connection between RISP and the heavy-ball ordinary differential equation (ODE). Experiments across a range of imaging inverse problems demonstrate that RISP enables fast convergence while achieving high-quality reconstructions.[101] A Denoising Framework for Real-World Ultra-Low Dose Lung CT Images Based on an Image Purification Strategy
Guoliang Gong,Man Yu
Main category: cs.CV
TL;DR: 本文提出了一种基于图像净化(IP)策略的超低剂量CT去噪框架,并结合频域流匹配(FFM)模型,有效解决了真实临床uLDCT中噪声严重和图像对空间错位的问题,在保持解剖结构完整性方面达到SOTA。
Details
Motivation: 超低剂量CT(uLDCT)虽能显著降低辐射,但引入严重噪声和伪影,且与常规剂量CT(NDCT)存在空间错位,导致现有基于合成噪声或对齐数据训练的去噪网络难以直接应用。 Method: 构建真实临床uLDCT肺部数据集,提出图像净化(IP)策略生成结构对齐的uLDCT-NDCT配对图像,并设计频域流匹配(FFM)模型以在频域中保留解剖结构。 Result: 实验表明,IP策略显著提升多种主流去噪模型在uLDCT任务上的表现,FFM结合IP在解剖结构保持方面达到最优效果。 Conclusion: 该研究通过IP策略和FFM模型为真实场景下的uLDCT去噪提供了有效解决方案,缓解了数据不匹配问题。 Abstract: Ultra-low dose CT (uLDCT) significantly reduces radiation exposure but introduces severe noise and artifacts. It also leads to substantial spatial misalignment between uLDCT and normal dose CT (NDCT) image pairs. This poses challenges for directly applying existing denoising networks trained on synthetic noise or aligned data. To address this core challenge in uLDCT denoising, this paper proposes an innovative denoising framework based on an Image Purification (IP) strategy. First, we construct a real clinical uLDCT lung dataset. Then, we propose an Image Purification strategy that generates structurally aligned uLDCT-NDCT image pairs, providing a high-quality data foundation for network training. Building upon this, we propose a Frequency-domain Flow Matching (FFM) model, which works synergistically with the IP strategy to excellently preserve the anatomical structure integrity of denoised images. Experiments on the real clinical dataset demonstrate that our IP strategy significantly enhances the performance of multiple mainstream denoising models on the uLDCT task. Notably, our proposed FFM model combined with the IP strategy achieves state-of-the-art (SOTA) results in anatomical structure preservation. This study provides an effective solution to the data mismatch problem in real-world uLDCT denoising. Code and dataset are available at https://github.com/MonkeyDadLufy/flow-matching.[102] D2RA: Dual Domain Regeneration Attack
Pragati Shuddhodhan Meshram,Varun Chandrasekaran
Main category: cs.CV
TL;DR: 本文提出了一种无需训练、针对单张图像的攻击方法D2RA,能够在不访问生成模型的情况下有效削弱或去除现有语义水印,揭示了当前水印方案在对抗自然先验投影时的根本弱点。
Details
Motivation: 随着生成模型的广泛应用,内容溯源和归属变得愈发重要,但现有的语义水印方法在面对轻量级对抗攻击时仍显脆弱,亟需评估其实际安全性。 Method: 提出D2RA,一种无需训练的单图攻击方法,通过将水印图像投影到多个互补表示(如频域、隐空间)中的自然先验上来抑制水印信号,同时保持图像视觉质量。 Result: 在多种语义水印方案上的实验表明,D2RA能显著降低水印的可检测性,且无需模型访问权限,在资源受限的对抗环境下依然有效。 Conclusion: 当前语义水印设计存在根本性安全缺陷,仅依赖隐空间或频域嵌入不足以抵抗基于自然先验的去水印攻击,未来需设计更鲁棒的水印机制。 Abstract: The growing use of generative models has intensified the need for watermarking methods that ensure content attribution and provenance. While recent semantic watermarking schemes improve robustness by embedding signals in latent or frequency representations, we show they remain vulnerable even under resource-constrained adversarial settings. We present D2RA, a training-free, single-image attack that removes or weakens watermarks without access to the underlying model. By projecting watermarked images onto natural priors across complementary representations, D2RA suppresses watermark signals while preserving visual fidelity. Experiments across diverse watermarking schemes demonstrate that our approach consistently reduces watermark detectability, revealing fundamental weaknesses in current designs. Our code is available at https://github.com/Pragati-Meshram/DAWN.[103] PickStyle: Video-to-Video Style Transfer with Context-Style Adapters
Soroush Mehraban,Vida Adeli,Jacob Rommann,Babak Taati,Kyryl Truskovskyi
Main category: cs.CV
TL;DR: 本文提出PickStyle,一种基于预训练视频扩散模型的视频风格迁移框架,通过引入低秩适配器和静态图像配对数据进行训练,并设计了CS-CFG方法以分离内容与风格控制,实现了时序连贯、风格忠实且内容保持的视频风格迁移。
Details
Motivation: 由于缺乏成对的视频数据用于监督,现有的视频风格迁移方法难以同时保持视频内容的一致性和风格的准确性,因此需要一种能够利用静态图像数据并有效解耦内容与风格的学习框架。 Method: PickStyle在预训练的视频扩散模型中插入低秩风格适配器,并使用具有源风格对应关系的成对静态图像进行训练;通过共享增强模拟相机运动生成合成视频片段以保留时间先验;提出上下文-风格无分类器引导(CS-CFG),将无分类器引导分解为独立的文本(风格)和视频(内容)方向,从而更好解耦控制。 Result: 实验表明,PickStyle在多个基准上均优于现有方法,生成的视频在时间一致性、风格保真度和内容保持方面表现优异,定性和定量结果均领先。 Conclusion: PickStyle通过结合图像级监督、适配器结构设计和新型引导机制,有效解决了无配对视频数据下的风格迁移难题,为视频风格化提供了高效且可控的解决方案。 Abstract: We address the task of video style transfer with diffusion models, where the goal is to preserve the context of an input video while rendering it in a target style specified by a text prompt. A major challenge is the lack of paired video data for supervision. We propose PickStyle, a video-to-video style transfer framework that augments pretrained video diffusion backbones with style adapters and benefits from paired still image data with source-style correspondences for training. PickStyle inserts low-rank adapters into the self-attention layers of conditioning modules, enabling efficient specialization for motion-style transfer while maintaining strong alignment between video content and style. To bridge the gap between static image supervision and dynamic video, we construct synthetic training clips from paired images by applying shared augmentations that simulate camera motion, ensuring temporal priors are preserved. In addition, we introduce Context-Style Classifier-Free Guidance (CS-CFG), a novel factorization of classifier-free guidance into independent text (style) and video (context) directions. CS-CFG ensures that context is preserved in generated video while the style is effectively transferred. Experiments across benchmarks show that our approach achieves temporally coherent, style-faithful, and content-preserving video translations, outperforming existing baselines both qualitatively and quantitatively.[104] TRAVL: A Recipe for Making Video-Language Models Better Judges of Physics Implausibility
Saman Motamed,Minghao Chen,Luc Van Gool,Iro Laina
Main category: cs.CV
TL;DR: 本文提出TRAVL微调方法和ImplausiBench基准,以提升视频语言模型在物理合理性判断上的表现,解决现有模型在时序与因果推理上的不足。
Details
Motivation: 现有视频生成模型常产生违反物理规律的视频序列,但缺乏有效的量化评估方法,且现有视频-语言模型难以准确识别这些物理违规现象。 Method: 提出TRAVL微调策略,结合平衡数据集与轨迹感知注意力模块,并构建无语言偏倚的ImplausiBench基准,包含300个真实与生成视频,用于评估模型的物理合理性判断能力。 Result: 实验表明现有VLMs在物理推理上存在局限,而TRAVL显著提升了模型在ImplausiBench上的表现,且评估结果与人类判断一致,并通过LLM-as-judge指标验证。 Conclusion: TRAVL与ImplausiBench共同构成了评估与提升多模态模型物理合理性的有效框架,推动了视觉-时序理解中物理常识建模的发展。 Abstract: Despite impressive visual fidelity, modern video generative models frequently produce sequences that violate intuitive physical laws, such as objects floating, teleporting, or morphing in ways that defy causality. While humans can easily detect such implausibilities, there remains no robust method for quantitatively assessing physical realism in video. In this work, we explore whether Video-Language Models (VLMs) can be trained to serve as reliable judges of physical plausibility. We find that existing VLMs struggle to identify physics violations, exposing fundamental limitations in their temporal and causal reasoning. To address this, we introduce TRAVL, a fine-tuning recipe that combines a balanced training dataset with a trajectory-aware attention module to improve motion encoding and discrimination in VLMs. To evaluate physical reasoning more rigorously, we propose ImplausiBench, a benchmark of 300 videos (150 real, 150 generated) that removes linguistic biases and isolates visual-temporal understanding. Performance is reported both with gold-standard human judgments and stricter LLM-as-judge metrics. Together, TRAVL and ImplausiBench offer a unified framework for probing and improving physical plausibility in multimodal models, shedding light on a challenging and underexplored aspect of visual-temporal understanding.[105] Label Semantics for Robust Hyperspectral Image Classification
Rafin Hassan,Zarin Tasnim Roshni,Rafiqul Bari,Alimul Islam,Nabeel Mohammed,Moshiur Farazi,Shafin Rahman
Main category: cs.CV
TL;DR: 提出了一种语义光谱-空间融合网络(S3FN),利用大语言模型生成的文本描述来增强高光谱图像分类性能,在多个数据集上取得了显著提升。
Details
Motivation: 由于高光谱图像分类面临训练样本少、数据维度高以及仅依赖光谱-空间信息导致的局限性,现有模型易过拟合且分类性能受限,因此需要引入外部语义信息以改善特征与标签对齐。 Method: 提出S3FN框架,使用大语言模型生成每个类别特有的文本描述,并通过预训练文本编码器(如BERT或RoBERTa)提取语义特征,将其与光谱-空间特征融合,实现更优的分类决策。 Result: 在Hyperspectral Wood、HyperspectralBlueberries和DeepHS-Fruit三个基准数据集上验证了方法的有效性,相比传统方法显著提升了分类性能。 Conclusion: 文本语义信息与光谱-空间数据的融合能有效提升高光谱图像分类的准确性和鲁棒性,为语义增强的分类模型提供了新方向。 Abstract: Hyperspectral imaging (HSI) classification is a critical tool with widespread applications across diverse fields such as agriculture, environmental monitoring, medicine, and materials science. Due to the limited availability of high-quality training samples and the high dimensionality of spectral data, HSI classification models are prone to overfitting and often face challenges in balancing accuracy and computational complexity. Furthermore, most of HSI classification models are monomodal, where it solely relies on spectral-spatial data to learn decision boundaries in the high dimensional embedding space. To address this, we propose a general-purpose Semantic Spectral-Spatial Fusion Network (S3FN) that uses contextual, class specific textual descriptions to complement the training of an HSI classification model. Specifically, S3FN leverages LLMs to generate comprehensive textual descriptions for each class label that captures their unique characteristics and spectral behaviors. These descriptions are then embedded into a vector space using a pre-trained text encoder such as BERT or RoBERTa to extract meaningful label semantics which in turn leads to a better feature-label alignment for improved classification performance. To demonstrate the effectiveness of our approach, we evaluate our model on three diverse HSI benchmark datasets - Hyperspectral Wood, HyperspectralBlueberries, and DeepHS-Fruit and report significant performance boost. Our results highlight the synergy between textual semantics and spectral-spatial data, paving the way for further advancements in semantically augmented HSI classification models. Codes are be available in: https://github.com/milab-nsu/S3FN[106] Cross-Modal Attention Guided Unlearning in Vision-Language Models
Karuna Bhaila,Aneesh Komanduri,Minh-Hao Van,Xintao Wu
Main category: cs.CV
TL;DR: 提出了一种轻量高效的视觉-语言模型遗忘学习框架CAGUL,通过跨模态注意力引导来处理VQA任务中的敏感信息泄露问题,无需微调模型参数。
Details
Motivation: 视觉-语言模型在多模态任务中可能记忆并泄露训练数据中的敏感信息,尤其是在视觉和文本双重视角下,现有遗忘方法计算成本高且不适用于VLMs。 Method: 利用跨模态注意力分析视觉token的重要性,设计外部模块对低重要性视觉token进行变换以编码遗忘信息,实现无需微调的高效遗忘。 Result: 实验表明CAGUL在防止信息泄露的同时保持原始模型性能,效果优于或媲美基于微调的方法,且无需重新训练。 Conclusion: CAGUL是一种实用、高效的VLM遗忘学习方案,特别适用于需保护隐私的VQA场景。 Abstract: Vision-Language Models (VLMs) have demonstrated immense capabilities in multi-modal understanding and inference tasks such as Visual Question Answering (VQA), which requires models to infer outputs based on visual and textual context simultaneously. Such inference abilities of large-scale pretrained models are often attributed to the massive scale of pre-training data collected across several domains. However, the models may memorize private and/or sensitive information during training and regurgitate it in inference. Recently, machine unlearning has been leveraged to address the leakage of private data in LLMs. VLMs add a layer of complexity to this process, as the visual context in the query may also contain sensitive information in addition to the text. To address this issue, we explore unlearning for vision-language models, specifically for the VQA task. We explore the role of visual tokens for output generation in VLMs using cross-modal attention and utilize it to formulate Cross-Modal Attention Guided Unlearning (CAGUL), a lightweight and efficient VLM unlearning framework. In contrast to computationally expensive model finetuning methods, CAGUL utilizes external modules to encode unlearning information in visual tokens of low importance for relevant queries. We find that the transformed visual tokens not only prevent leakage but also retain reference model behavior. Experimental results show that our method performs better or on par with finetuning-based baselines without altering the pre-trained model parameters or incurring retraining costs, making it a practical and effective unlearning solution for VLMs.[107] MaizeStandCounting (MaSC): Automated and Accurate Maize Stand Counting from UAV Imagery Using Image Processing and Deep Learning
Dewi Endah Kharismawati,Toni Kazic
Main category: cs.CV
TL;DR: 本文提出了一种名为MaizeStandCounting(MaSC)的算法,用于从低成本无人机拍摄的RGB图像中自动统计玉米幼苗数量。该方法支持两种模式:拼接图像分块处理和基于单应性矩阵对齐的原始视频帧处理,结合轻量级YOLOv9模型检测V2-V10生长阶段的玉米幼苗,并通过行与列分割实现精确计数。实验结果显示其与人工计数具有高度一致性(R²最高达0.906),且处理速度快,具备实时应用潜力。
Details
Motivation: 准确的玉米出苗率统计对作物管理和研究至关重要,但传统人工计数耗时、费力且易出错,尤其是在大面积或变异较大的田地中。因此需要一种高效、低成本、自动化的统计方法。 Method: 提出MaizeStandCounting(MaSC)算法,采用两种输入模式:拼接图像分块和对齐后的原始视频帧;使用轻量级YOLOv9模型检测玉米幼苗(V2-V10期),并通过空间分布信息进行行与列的分割以排除杂草干扰,实现逐行精确计数。 Result: 在2024年夏季试验田中,MaSC与人工计数结果高度一致(R² = 0.616(拼接图)、R² = 0.906(原始帧)),并能在60.63秒内处理83张全分辨率图像,包含推理和后处理,展现出高效的处理能力。 Conclusion: MaSC是一种可扩展、低成本且准确的自动化玉米出苗计数工具,适用于科研和农业生产环境,具有实现实时操作的潜力。 Abstract: Accurate maize stand counts are essential for crop management and research, informing yield prediction, planting density optimization, and early detection of germination issues. Manual counting is labor-intensive, slow, and error-prone, especially across large or variable fields. We present MaizeStandCounting (MaSC), a robust algorithm for automated maize seedling stand counting from RGB imagery captured by low-cost UAVs and processed on affordable hardware. MaSC operates in two modes: (1) mosaic images divided into patches, and (2) raw video frames aligned using homography matrices. Both modes use a lightweight YOLOv9 model trained to detect maize seedlings from V2-V10 growth stages. MaSC distinguishes maize from weeds and other vegetation, then performs row and range segmentation based on the spatial distribution of detections to produce precise row-wise stand counts. Evaluation against in-field manual counts from our 2024 summer nursery showed strong agreement with ground truth (R^2= 0.616 for mosaics, R^2 = 0.906 for raw frames). MaSC processed 83 full-resolution frames in 60.63 s, including inference and post-processing, highlighting its potential for real-time operation. These results demonstrate MaSC's effectiveness as a scalable, low-cost, and accurate tool for automated maize stand counting in both research and production environments.[108] Quick-CapsNet (QCN): A fast alternative to Capsule Networks
Pouya Shiri,Ramin Sharifi,Amirali Baniasadi
Main category: cs.CV
TL;DR: 本文提出了Quick-CapsNet(QCN),一种比传统CapsNet更快的胶囊网络,适用于实时应用,在多个数据集上推理速度提升5倍,仅牺牲少量精度。
Details
Motivation: CapsNet虽然在分类任务中表现优异且对仿射变换更鲁棒,但训练和推理速度慢,限制了其在实时场景中的应用。 Method: 通过减少胶囊数量来构建更轻量的网络结构QCN,并采用更强的解码器进一步提升性能。 Result: QCN在MNIST、F-MNIST、SVHN和Cifar-10上推理速度提升5倍,精度略有下降,但整体性能仍保持良好。 Conclusion: QCN为实现快速、实用的CapsNet提供了有效方案,是迈向实时胶囊网络应用的良好起点。 Abstract: The basic computational unit in Capsule Network (CapsNet) is a capsule (vs. neurons in Convolutional Neural Networks (CNNs)). A capsule is a set of neurons, which form a vector. CapsNet is used for supervised classification of data and has achieved state-of-the-art accuracy on MNIST digit recognition dataset, outperforming conventional CNNs in detecting overlapping digits. Moreover, CapsNet shows higher robustness towards affine transformation when compared to CNNs for MNIST datasets. One of the drawbacks of CapsNet, however, is slow training and testing. This can be a bottleneck for applications that require a fast network, especially during inference. In this work, we introduce Quick-CapsNet (QCN) as a fast alternative to CapsNet, which can be a starting point to develop CapsNet for fast real-time applications. QCN builds on producing a fewer number of capsules, which results in a faster network. QCN achieves this at the cost of marginal loss in accuracy. Inference is 5x faster on MNIST, F-MNIST, SVHN and Cifar-10 datasets. We also further enhanced QCN by employing a more powerful decoder instead of the default decoder to further improve QCN.[109] Rectified-CFG++ for Flow Based Models
Shreshth Saini,Shashank Gupta,Alan C. Bovik
Main category: cs.CV
TL;DR: 提出了一种名为Rectified-CFG++的自适应预测-校正引导方法,用于解决在基于整流流(RF)模型中应用分类器无关引导(CFG)时出现的严重流形外漂移问题。
Details
Motivation: 标准CFG在整流流模型上会导致严重的离流形漂移,引发视觉伪影、文本错位和不稳定行为,限制了生成质量与鲁棒性。 Method: 采用两步策略:首先执行条件RF更新以锚定样本在传输路径附近,然后施加加权条件校正,插值条件与无条件速度场。该方法结合了整流流的确定性高效特性与几何感知的条件规则。 Result: 理论证明所生成的速度场具有边缘一致性,且轨迹保持在数据流形的有界管状邻域内;在Flux、Stable Diffusion 3/3.5和Lumina等大型文本到图像模型上的实验表明,Rectified-CFG++在MS-COCO、LAION-Aesthetic和T2I-CompBench等基准上 consistently 优于标准CFG。 Conclusion: Rectified-CFG++有效解决了CFG在整流流模型中的离流形漂移问题,提升了生成稳定性与对齐精度,支持强引导下的高质量文本到图像生成。 Abstract: Classifier-free guidance (CFG) is the workhorse for steering large diffusion models toward text-conditioned targets, yet its native application to rectified flow (RF) based models provokes severe off-manifold drift, yielding visual artifacts, text misalignment, and brittle behaviour. We present Rectified-CFG++, an adaptive predictor-corrector guidance that couples the deterministic efficiency of rectified flows with a geometry-aware conditioning rule. Each inference step first executes a conditional RF update that anchors the sample near the learned transport path, then applies a weighted conditional correction that interpolates between conditional and unconditional velocity fields. We prove that the resulting velocity field is marginally consistent and that its trajectories remain within a bounded tubular neighbourhood of the data manifold, ensuring stability across a wide range of guidance strengths. Extensive experiments on large-scale text-to-image models (Flux, Stable Diffusion 3/3.5, Lumina) show that Rectified-CFG++ consistently outperforms standard CFG on benchmark datasets such as MS-COCO, LAION-Aesthetic, and T2I-CompBench. Project page: https://rectified-cfgpp.github.io/[110] PIT-QMM: A Large Multimodal Model For No-Reference Point Cloud Quality Assessment
Shashank Gupta,Gregoire Phillips,Alan C. Bovik
Main category: cs.CV
TL;DR: 提出了一种新的大型多模态模型PIT-QMM,用于无参考点云质量评估(NR-PCQA),结合文本、图像和点云数据,显著优于现有方法。
Details
Motivation: 现有的大型多模态模型在图像和视频质量评估中取得进展,但在3D资产领域,尤其是无参考点云质量评估方面尚未充分探索。 Method: 利用文本描述、2D投影和3D点云视图等多模态数据,构建端到端的大型多模态模型PIT-QMM,用于预测点云质量得分。 Result: 在多个流行基准上,PIT-QMM以更少的训练迭代显著超越现有最先进方法,并支持失真定位与识别。 Conclusion: PIT-QMM为无参考点云质量评估提供了高效且可解释的新框架,推动了模型的可解释性与交互性发展。 Abstract: Large Multimodal Models (LMMs) have recently enabled considerable advances in the realm of image and video quality assessment, but this progress has yet to be fully explored in the domain of 3D assets. We are interested in using these models to conduct No-Reference Point Cloud Quality Assessment (NR-PCQA), where the aim is to automatically evaluate the perceptual quality of a point cloud in absence of a reference. We begin with the observation that different modalities of data - text descriptions, 2D projections, and 3D point cloud views - provide complementary information about point cloud quality. We then construct PIT-QMM, a novel LMM for NR-PCQA that is capable of consuming text, images and point clouds end-to-end to predict quality scores. Extensive experimentation shows that our proposed method outperforms the state-of-the-art by significant margins on popular benchmarks with fewer training iterations. We also demonstrate that our framework enables distortion localization and identification, which paves a new way forward for model explainability and interactivity. Code and datasets are available at https://www.github.com/shngt/pit-qmm.[111] Dual-Stream Alignment for Action Segmentation
Harshala Gammulle,Clinton Fookes,Sridha Sridharan,Simon Denman
Main category: cs.CV
TL;DR: 本文提出了一种双流对齐网络(DSA Net),首次引入混合量子-经典机器学习框架用于动作分割,通过帧级和动作级双流特征对齐,结合时间上下文模块和新型损失函数,在多个基准数据集上实现了最先进的性能。
Details
Motivation: 现有动作分割方法多依赖单一流模型,难以充分捕捉动作及其转换信息。近年来双流方法展现出潜力,但缺乏有效的跨流交互机制与特征对齐策略,限制了性能提升。 Method: 提出双流对齐网络(DSA Net),包含帧级和动作级双流结构;设计时间上下文(TC)模块,利用交叉注意力和基于量子的动作引导调制(Q-ActGM)实现双流信息融合;引入双流对齐损失,包含关系一致性、跨层级对比和循环一致性重建三个部分,促进共享特征空间学习。 Result: 在GTEA、Breakfast、50Salads和EgoProcel四个基准数据集上验证了方法的有效性,DSA Net显著优于现有方法,取得最先进的动作分割性能;消融实验表明各组件均对整体性能有贡献。 Conclusion: 本文验证了通过双流特征对齐和量子-经典融合机制可有效提升动作分割性能,提出的DSA Net为未来动作理解提供了新思路,尤其是在建模动作与动作转换线索方面具有优势。 Abstract: Action segmentation is a challenging yet active research area that involves identifying when and where specific actions occur in continuous video streams. Most existing work has focused on single-stream approaches that model the spatio-temporal aspects of frame sequences. However, recent research has shifted toward two-stream methods that learn action-wise features to enhance action segmentation performance. In this work, we propose the Dual-Stream Alignment Network (DSA Net) and investigate the impact of incorporating a second stream of learned action features to guide segmentation by capturing both action and action-transition cues. Communication between the two streams is facilitated by a Temporal Context (TC) block, which fuses complementary information using cross-attention and Quantum-based Action-Guided Modulation (Q-ActGM), enhancing the expressive power of the fused features. To the best of our knowledge, this is the first study to introduce a hybrid quantum-classical machine learning framework for action segmentation. Our primary objective is for the two streams (frame-wise and action-wise) to learn a shared feature space through feature alignment. This is encouraged by the proposed Dual-Stream Alignment Loss, which comprises three components: relational consistency, cross-level contrastive, and cycle-consistency reconstruction losses. Following prior work, we evaluate DSA Net on several diverse benchmark datasets: GTEA, Breakfast, 50Salads, and EgoProcel. We further demonstrate the effectiveness of each component through extensive ablation studies. Notably, DSA Net achieves state-of-the-art performance, significantly outperforming existing[112] Once Is Enough: Lightweight DiT-Based Video Virtual Try-On via One-Time Garment Appearance Injection
Yanjie Pan,Qingdong He,Lidong Wang,Bo Peng,Mingmin Chi
Main category: cs.CV
TL;DR: 提出了一种基于首帧服装替换的视频虚拟试穿方法OIE,通过图像到图像的服装迁移模型替换首帧服装,并利用姿态和掩码信息引导后续帧生成,实现了高效且高性能的虚拟试穿。
Details
Motivation: 现有双分支架构在Diffusion Transformer上的适配存在参数量大、服装特征缺乏时序性等问题,难以实现高效的视频虚拟试穿。 Method: 采用首帧服装替换策略,先用图像虚拟试穿模型处理第一帧,再以编辑后的首帧为内容控制,结合姿态和掩码信息引导视频扩散模型逐帧生成后续结果。 Result: 实验表明该方法在参数效率和计算效率上优于现有方法,同时在性能上保持领先。 Conclusion: OIE通过简化结构设计和有效利用时序先验,在保证生成质量的同时显著提升了效率,为基于Diffusion Transformer的视频虚拟试穿提供了新思路。 Abstract: Video virtual try-on aims to replace the clothing of a person in a video with a target garment. Current dual-branch architectures have achieved significant success in diffusion models based on the U-Net; however, adapting them to diffusion models built upon the Diffusion Transformer remains challenging. Initially, introducing latent space features from the garment reference branch requires adding or modifying the backbone network, leading to a large number of trainable parameters. Subsequently, the latent space features of garments lack inherent temporal characteristics and thus require additional learning. To address these challenges, we propose a novel approach, OIE (Once is Enough), a virtual try-on strategy based on first-frame clothing replacement: specifically, we employ an image-based clothing transfer model to replace the clothing in the initial frame, and then, under the content control of the edited first frame, utilize pose and mask information to guide the temporal prior of the video generation model in synthesizing the remaining frames sequentially. Experiments show that our method achieves superior parameter efficiency and computational efficiency while still maintaining leading performance under these constraints.[113] MONKEY: Masking ON KEY-Value Activation Adapter for Personalization
James Baker
Main category: cs.CV
TL;DR: 提出了一种改进的扩散模型个性化方法,利用IP-Adapter自动生成的掩码在第二次生成时屏蔽背景图像标记,使文本提示能更好地影响非主体区域,从而提升提示与源图像的一致性。
Details
Motivation: 现有个性化扩散模型常过度依赖源图像,忽略文本提示,导致生成结果缺乏多样性或与提示不符。 Method: 利用IP-Adapter在推理过程中自动生成主体掩码,并在第二次生成时屏蔽背景图像标记,使文本提示可更有效地控制非主体区域。 Result: 在描述地点和场景的文本提示下,生成图像能准确保留主体并更好匹配提示内容,相比其他测试时个性化方法表现出更高的提示对齐和源图像一致性。 Conclusion: 通过掩码引导的双通路生成策略,有效平衡了主体保真度与文本控制力,提升了个性化扩散模型的生成质量。 Abstract: Personalizing diffusion models allows users to generate new images that incorporate a given subject, allowing more control than a text prompt. These models often suffer somewhat when they end up just recreating the subject image, and ignoring the text prompt. We observe that one popular method for personalization, the IP-Adapter automatically generates masks that we definitively segment the subject from the background during inference. We propose to use this automatically generated mask on a second pass to mask the image tokens, thus restricting them to the subject, not the background, allowing the text prompt to attend to the rest of the image. For text prompts describing locations and places, this produces images that accurately depict the subject while definitively matching the prompt. We compare our method to a few other test time personalization methods, and find our method displays high prompt and source image alignment.[114] Automatic Text Box Placement for Supporting Typographic Design
Jun Muraoka,Daichi Haraguchi,Naoto Inoue,Wataru Shimoda,Kota Yamaguchi,Seiichi Uchida
Main category: cs.CV
TL;DR: 该研究比较了基于Transformer和视觉语言模型(VLM)的方法在不完整布局中自动文本框放置的效果,发现任务特定的Transformer模型通常优于VLM方法,尤其是在利用丰富外观信息时表现更佳。
Details
Motivation: 为了提升广告和网页布局设计中的视觉吸引力与信息传达效率,研究旨在探索自动化文本框布局的有效方法。 Method: 采用标准Transformer模型、小型视觉语言模型(Phi3.5-vision)、大型预训练VLM(Gemini)以及可处理多图像的扩展Transformer模型,在Crello数据集上进行文本框放置任务的对比实验。 Result: 标准Transformer模型整体优于VLM方法,尤其在融合更多外观信息时表现突出;但所有方法在处理极小文本或密集布局时均存在困难。 Conclusion: 任务特定架构在自动化布局设计中更具优势,未来改进应关注对小文本和复杂布局的处理能力。 Abstract: In layout design for advertisements and web pages, balancing visual appeal and communication efficiency is crucial. This study examines automated text box placement in incomplete layouts, comparing a standard Transformer-based method, a small Vision and Language Model (Phi3.5-vision), a large pretrained VLM (Gemini), and an extended Transformer that processes multiple images. Evaluations on the Crello dataset show the standard Transformer-based models generally outperform VLM-based approaches, particularly when incorporating richer appearance information. However, all methods face challenges with very small text or densely populated layouts. These findings highlight the benefits of task-specific architectures and suggest avenues for further improvement in automated layout design.[115] TCIP: Threshold-Controlled Iterative Pyramid Network for Deformable Medical Image Registration
Heming Wu,Di Wang,Tai Ma,Peng Zhao,Yubin Xiao,Zhongke Wu,Xing-Ce Wang,Chuang Li,Xuan Wu,You Zhou
Main category: cs.CV
TL;DR: 提出了一种基于特征增强残差模块(FERM)和双阶段阈值控制迭代策略(TCI)的金字塔网络(TCIP),用于提升可变形医学图像配准的精度与自适应性,在多个公开数据集上优于现有最先进方法。
Details
Motivation: 现有金字塔网络在医学图像配准中易累积解剖结构错位,且缺乏对不同形变需求图像的自适应迭代优化机制,导致配准精度下降。 Method: 设计FERM作为解码层核心组件,通过三个顺序模块提取解剖语义特征、抑制无关特征并估计最终形变场;提出双阶段TCI策略,第一阶段评估配准稳定性,第二阶段判断收敛性,以自适应确定迭代次数。 Result: 在三个脑部MRI和一个腹部CT公开数据集上的实验表明,TCIP在配准精度上优于当前最先进方法,同时保持较快的推理速度和较小的模型参数量;FERM和TCI具有良好的通用性,可集成到其他配准网络中,并通过消融实验验证了其有效性。 Conclusion: TCIP通过FERM有效缓解了解剖结构错位的累积问题,并通过TCI实现自适应迭代控制,显著提升了医学图像配准的准确性与鲁棒性,具备良好的实用性和泛化能力。 Abstract: Although pyramid networks have demonstrated superior performance in deformable medical image registration, their decoder architectures are inherently prone to propagating and accumulating anatomical structure misalignments. Moreover, most existing models do not adaptively determine the number of iterations for optimization under varying deformation requirements across images, resulting in either premature termination or excessive iterations that degrades registration accuracy. To effectively mitigate the accumulation of anatomical misalignments, we propose the Feature-Enhanced Residual Module (FERM) as the core component of each decoding layer in the pyramid network. FERM comprises three sequential blocks that extract anatomical semantic features, learn to suppress irrelevant features, and estimate the final deformation field, respectively. To adaptively determine the number of iterations for varying images, we propose the dual-stage Threshold-Controlled Iterative (TCI) strategy. In the first stage, TCI assesses registration stability and with asserted stability, it continues with the second stage to evaluate convergence. We coin the model that integrates FERM and TCI as Threshold-Controlled Iterative Pyramid (TCIP). Extensive experiments on three public brain MRI datasets and one abdomen CT dataset demonstrate that TCIP outperforms the state-of-the-art (SOTA) registration networks in terms of accuracy, while maintaining comparable inference speed and a compact model parameter size. Finally, we assess the generalizability of FERM and TCI by integrating them with existing registration networks and further conduct ablation studies to validate the effectiveness of these two proposed methods.[116] Controllable Video Synthesis via Variational Inference
Haoyi Duan,Yunzhi Zhang,Yilun Du,Jiajun Wu
Main category: cs.CV
TL;DR: 提出一种高可控性的视频合成方法,通过变分推理和多生成模型集成,在指定元素上实现精确控制,同时保持未明确部分的多样性。
Details
Motivation: 现有视频生成模型通常针对固定输入格式训练,难以满足用户对不同粒度控制(如4D对象轨迹、相机路径或粗略文本提示)的需求。 Method: 将任务建模为变分推断以逼近组合分布,利用多个视频生成骨干网络共同满足各种约束;通过逐步KL散度最小化和退火分布序列解决优化难题,并提出上下文条件因子化技术以减少解空间中的局部最优。 Result: 实验表明,该方法在可控性、多样性和3D一致性方面优于先前方法。 Conclusion: 所提方法能有效平衡视频生成中精细控制与内容多样性,适用于复杂视频工作流中的混合控制需求。 Abstract: Many video workflows benefit from a mixture of user controls with varying granularity, from exact 4D object trajectories and camera paths to coarse text prompts, while existing video generative models are typically trained for fixed input formats. We develop a video synthesis method that addresses this need and generates samples with high controllability for specified elements while maintaining diversity for under-specified ones. We cast the task as variational inference to approximate a composed distribution, leveraging multiple video generation backbones to account for all task constraints collectively. To address the optimization challenge, we break down the problem into step-wise KL divergence minimization over an annealed sequence of distributions, and further propose a context-conditioned factorization technique that reduces modes in the solution space to circumvent local optima. Experiments suggest that our method produces samples with improved controllability, diversity, and 3D consistency compared to prior works.[117] Hybrid CNN-BYOL Approach for Fault Detection in Induction Motors Using Thermal Images
Tangin Amir Smrity,MD Zahin Muntaqim Hasan Muhammad Kafi,Abu Saleh Musa Miah,Najmul Hassan,Yuichi Okuyama,Nobuyoshi Asai,Taro Suzuki,Jungpil Shin
Main category: cs.CV
TL;DR: 提出一种结合BYOL与CNN的混合方法,用于基于热成像的感应电机故障分类,设计了轻量高性能模型BYOL-IMNet,在准确率和推理速度上均优于现有模型。
Details
Motivation: 感应电机易发生故障,导致过热、能耗增加和运行中断,亟需高效早期故障检测方法以保障运行安全并延长寿命。 Method: 采用自监督学习方法BYOL预训练多种CNN模型(如ResNet-50、DenseNet等),并提出专用轻量网络BYOL-IMNet,利用热成像数据进行故障分类。 Result: BYOL-IMNet在测试中达到99.89%的准确率,单张图像推理时间仅5.7ms,显著优于现有模型。 Conclusion: CNN-BYOL混合方法在感应电机故障检测中表现出高精度与高效性,具备工业在线监测的应用潜力。 Abstract: Induction motors (IMs) are indispensable in industrial and daily life, but they are susceptible to various faults that can lead to overheating, wasted energy consumption, and service failure. Early detection of faults is essential to protect the motor and prolong its lifespan. This paper presents a hybrid method that integrates BYOL with CNNs for classifying thermal images of induction motors for fault detection. The thermal dataset used in this work includes different operating states of the motor, such as normal operation, overload, and faults. We employed multiple deep learning (DL) models for the BYOL technique, ranging from popular architectures such as ResNet-50, DenseNet-121, DenseNet-169, EfficientNetB0, VGG16, and MobileNetV2. Additionally, we introduced a new high-performance yet lightweight CNN model named BYOL-IMNet, which comprises four custom-designed blocks tailored for fault classification in thermal images. Our experimental results demonstrate that the proposed BYOL-IMNet achieves 99.89\% test accuracy and an inference time of 5.7 ms per image, outperforming state-of-the-art models. This study highlights the promising performance of the CNN-BYOL hybrid method in enhancing accuracy for detecting faults in induction motors, offering a robust methodology for online monitoring in industrial settings.[118] Mutual Learning for Hashing: Unlocking Strong Hash Functions from Weak Supervision
Xiaoxu Ma,Runhao Li,Zhenyu Weng
Main category: cs.CV
TL;DR: 提出了一种名为MLH(Mutual Learning for Hashing)的新框架,通过弱-强双分支结构提升深度哈希性能,结合中心型和配对型方法的优势,并引入混合哈希专家模块实现跨分支交互,在多个基准数据集上优于现有方法。
Details
Motivation: 中心型哈希方法虽擅长建模全局结构,但往往忽略重要的局部相似性信息,而配对型方法能有效保留局部相似关系。如何融合两者优势成为关键问题。 Method: 设计双分支架构:一个强的中心型分支和一个弱的配对型分支,通过迭代互学习机制将配对分支的局部相似性知识迁移至中心分支;引入受混合专家(mixture-of-experts)启发的混合哈希专家模块,促进跨分支交互。 Result: 在多个标准数据集上的实验表明,MLH显著且一致地优于当前最先进的哈希方法。 Conclusion: MLH成功融合了中心型和配对型哈希方法的优点,通过互学习和混合专家结构提升了哈希表示的学习能力,为大规模图像检索提供了更优的解决方案。 Abstract: Deep hashing has been widely adopted for large-scale image retrieval, with numerous strategies proposed to optimize hash function learning. Pairwise-based methods are effective in learning hash functions that preserve local similarity relationships, whereas center-based methods typically achieve superior performance by more effectively capturing global data distributions. However, the strength of center-based methods in modeling global structures often comes at the expense of underutilizing important local similarity information. To address this limitation, we propose Mutual Learning for Hashing (MLH), a novel weak-to-strong framework that enhances a center-based hashing branch by transferring knowledge from a weaker pairwise-based branch. MLH consists of two branches: a strong center-based branch and a weaker pairwise-based branch. Through an iterative mutual learning process, the center-based branch leverages local similarity cues learned by the pairwise-based branch. Furthermore, inspired by the mixture-of-experts paradigm, we introduce a novel mixture-of-hash-experts module that enables effective cross-branch interaction, further enhancing the performance of both branches. Extensive experiments demonstrate that MLH consistently outperforms state-of-the-art hashing methods across multiple benchmark datasets.[119] RePainter: Empowering E-commerce Object Removal via Spatial-matting Reinforcement Learning
Zipeng Guo,Lichen Ma,Xiaolong Fu,Gaojing Zhou,Lan Yang,Yuchen Zhou,Linkai Liu,Yu He,Ximan Liu,Shiping Dong,Jingling Fu,Zhen Chen,Yu Shi,Junshi Huang,Jason Li,Chao Gou
Main category: cs.CV
TL;DR: 提出了一种基于强化学习的图像修复框架Repainter,结合空间-蒙版轨迹优化和分组相对策略优化(GRPO),有效去除电商图像中的水印和促销文字,提升视觉质量。
Details
Motivation: 电商图像中的水印和促销文本影响用户体验和广告效果,现有扩散模型在实际应用中存在对象去除不可靠和领域适应性差的问题。 Method: 提出Repainter框架,采用强化学习结合空间-蒙版轨迹优化与GRPO算法,通过调节注意力机制增强背景上下文建模,并设计复合奖励机制平衡全局、局部和语义约束。 Result: 在自建的大规模电商数据集EcomPaint-100K和基准EcomPaint-Bench上实验表明,Repainter显著优于现有最先进方法,尤其在复杂场景下表现突出。 Conclusion: Repainter为电商图像修复提供了高效可靠的解决方案,具备较强的实用性和泛化能力,未来将开源代码与模型权重。 Abstract: In web data, product images are central to boosting user engagement and advertising efficacy on e-commerce platforms, yet the intrusive elements such as watermarks and promotional text remain major obstacles to delivering clear and appealing product visuals. Although diffusion-based inpainting methods have advanced, they still face challenges in commercial settings due to unreliable object removal and limited domain-specific adaptation. To tackle these challenges, we propose Repainter, a reinforcement learning framework that integrates spatial-matting trajectory refinement with Group Relative Policy Optimization (GRPO). Our approach modulates attention mechanisms to emphasize background context, generating higher-reward samples and reducing unwanted object insertion. We also introduce a composite reward mechanism that balances global, local, and semantic constraints, effectively reducing visual artifacts and reward hacking. Additionally, we contribute EcomPaint-100K, a high-quality, large-scale e-commerce inpainting dataset, and a standardized benchmark EcomPaint-Bench for fair evaluation. Extensive experiments demonstrate that Repainter significantly outperforms state-of-the-art methods, especially in challenging scenes with intricate compositions. We will release our code and weights upon acceptance.[120] SyncHuman: Synchronizing 2D and 3D Generative Models for Single-view Human Reconstruction
Wenyue Chen,Peng Li,Wangguandong Zheng,Chengfeng Zhao,Mengfei Li,Yaolong Zhu,Zhiyang Dou,Ronggang Wang,Yuan Liu
Main category: cs.CV
TL;DR: 本文提出SyncHuman,一种结合2D多视角生成模型与3D原生生成模型的新框架,用于从单张图像实现高质量的着装人体3D重建,尤其在复杂姿态下表现优异。
Details
Motivation: 现有方法依赖SMPL估计和条件生成模型,但在3D先验准确性、复杂姿态处理和细节重建方面存在不足,因此需要更鲁棒的解决方案。 Method: 提出SyncHuman框架,通过像素对齐的2D-3D同步注意力机制联合微调2D多视角生成模型和3D原生生成模型,并引入特征注入机制将2D细节提升到3D形状上,实现几何一致且高保真的重建。 Result: 实验表明,SyncHuman在几何精度和视觉保真度上优于基线方法,能够在挑战性姿态下生成结构合理、细节丰富的逼真3D人体模型。 Conclusion: SyncHuman有效融合了2D生成模型的细节表达能力与3D生成模型的结构一致性,为单图三维人体重建提供了新的可行方向。 Abstract: Photorealistic 3D full-body human reconstruction from a single image is a critical yet challenging task for applications in films and video games due to inherent ambiguities and severe self-occlusions. While recent approaches leverage SMPL estimation and SMPL-conditioned image generative models to hallucinate novel views, they suffer from inaccurate 3D priors estimated from SMPL meshes and have difficulty in handling difficult human poses and reconstructing fine details. In this paper, we propose SyncHuman, a novel framework that combines 2D multiview generative model and 3D native generative model for the first time, enabling high-quality clothed human mesh reconstruction from single-view images even under challenging human poses. Multiview generative model excels at capturing fine 2D details but struggles with structural consistency, whereas 3D native generative model generates coarse yet structurally consistent 3D shapes. By integrating the complementary strengths of these two approaches, we develop a more effective generation framework. Specifically, we first jointly fine-tune the multiview generative model and the 3D native generative model with proposed pixel-aligned 2D-3D synchronization attention to produce geometrically aligned 3D shapes and 2D multiview images. To further improve details, we introduce a feature injection mechanism that lifts fine details from 2D multiview images onto the aligned 3D shapes, enabling accurate and high-fidelity reconstruction. Extensive experiments demonstrate that SyncHuman achieves robust and photo-realistic 3D human reconstruction, even for images with challenging poses. Our method outperforms baseline methods in geometric accuracy and visual fidelity, demonstrating a promising direction for future 3D generation models.[121] ComGS: Efficient 3D Object-Scene Composition via Surface Octahedral Probes
Jian Gao,Mengqi Yuan,Yifei Zeng,Chang Zeng,Zhihao Li,Zhenyu Chen,Weichao Qiu,Xiao-Xiao Long,Hao Zhu,Xun Cao,Yao Yao
Main category: cs.CV
TL;DR: 本文提出ComGS,一种基于高斯点阵的3D物体-场景融合框架,通过表面八面体探针(SOPs)实现高效可重光照的物体重建,并简化环境光照估计,实现实时高质量渲染与编辑。
Details
Motivation: 现有高斯点阵方法在物体与场景融合时因烘焙光照和阴影信息导致不一致,且逆向渲染效率低、光照估计复杂,难以处理真实感合成。 Method: 引入表面八面体探针(SOPs)存储光照与遮挡信息,支持高效插值查询以避免光线追踪;聚焦物体放置位置的环境光照估计,结合360度辐射场重建与扩散模型补全光照。 Result: 实现了约28 FPS的高质量实时渲染,编辑仅需36秒,生成生动阴影且视觉和谐的结果。 Conclusion: ComGS有效解决了高斯点阵中物体-场景融合的光照一致性与效率问题,显著提升重建速度与渲染质量,适用于快速交互式3D内容创作。 Abstract: Gaussian Splatting (GS) enables immersive rendering, but realistic 3D object-scene composition remains challenging. Baked appearance and shadow information in GS radiance fields cause inconsistencies when combining objects and scenes. Addressing this requires relightable object reconstruction and scene lighting estimation. For relightable object reconstruction, existing Gaussian-based inverse rendering methods often rely on ray tracing, leading to low efficiency. We introduce Surface Octahedral Probes (SOPs), which store lighting and occlusion information and allow efficient 3D querying via interpolation, avoiding expensive ray tracing. SOPs provide at least a 2x speedup in reconstruction and enable real-time shadow computation in Gaussian scenes. For lighting estimation, existing Gaussian-based inverse rendering methods struggle to model intricate light transport and often fail in complex scenes, while learning-based methods predict lighting from a single image and are viewpoint-sensitive. We observe that 3D object-scene composition primarily concerns the object's appearance and nearby shadows. Thus, we simplify the challenging task of full scene lighting estimation by focusing on the environment lighting at the object's placement. Specifically, we capture a 360 degrees reconstructed radiance field of the scene at the location and fine-tune a diffusion model to complete the lighting. Building on these advances, we propose ComGS, a novel 3D object-scene composition framework. Our method achieves high-quality, real-time rendering at around 28 FPS, produces visually harmonious results with vivid shadows, and requires only 36 seconds for editing. Code and dataset are available at https://nju-3dv.github.io/projects/ComGS/.[122] UltraLED: Learning to See Everything in Ultra-High Dynamic Range Scenes
Yuang Meng,Xin Jin,Lina Lei,Chun-Le Guo,Chongyi Li
Main category: cs.CV
TL;DR: 本文提出了一种基于单张短曝光RAW图像的超高清动态范围(UHDR)重建方法UltraLED,通过两阶段框架实现曝光校正和亮度感知去噪,在避免重影和运动模糊的同时有效恢复亮暗区域细节。
Details
Motivation: 传统RGB多帧包围曝光方法易受错位和重影影响,且难以同时保留高光和阴影细节,尤其是在夜间光照场景中。作者旨在探索仅用单张短曝光RAW图像即可实现高质量UHDR成像的可能性。 Method: 提出UltraLED两阶段框架:第一阶段通过比率图进行曝光校正以平衡动态范围,第二阶段采用亮度感知的RAW域去噪网络增强暗区细节恢复;并设计了一个9档包围曝光流程构建真实UHDR数据集,仅使用最短曝光帧作为输入。 Result: 实验表明,UltraLED在多个指标上显著优于现有的单帧UHDR方法,能够在动态场景下有效抑制噪声并恢复细节,尤其在暗区表现突出。 Conclusion: 仅使用单张短曝光RAW图像即可实现高质量UHDR重建是可行的,UltraLED为动态场景下的鲁棒性HDR成像提供了新思路,并推动了基于RAW数据的低光增强研究。 Abstract: Ultra-high dynamic range (UHDR) scenes exhibit significant exposure disparities between bright and dark regions. Such conditions are commonly encountered in nighttime scenes with light sources. Even with standard exposure settings, a bimodal intensity distribution with boundary peaks often emerges, making it difficult to preserve both highlight and shadow details simultaneously. RGB-based bracketing methods can capture details at both ends using short-long exposure pairs, but are susceptible to misalignment and ghosting artifacts. We found that a short-exposure image already retains sufficient highlight detail. The main challenge of UHDR reconstruction lies in denoising and recovering information in dark regions. In comparison to the RGB images, RAW images, thanks to their higher bit depth and more predictable noise characteristics, offer greater potential for addressing this challenge. This raises a key question: can we learn to see everything in UHDR scenes using only a single short-exposure RAW image? In this study, we rely solely on a single short-exposure frame, which inherently avoids ghosting and motion blur, making it particularly robust in dynamic scenes. To achieve that, we introduce UltraLED, a two-stage framework that performs exposure correction via a ratio map to balance dynamic range, followed by a brightness-aware RAW denoiser to enhance detail recovery in dark regions. To support this setting, we design a 9-stop bracketing pipeline to synthesize realistic UHDR images and contribute a corresponding dataset based on diverse scenes, using only the shortest exposure as input for reconstruction. Extensive experiments show that UltraLED significantly outperforms existing single-frame approaches. Our code and dataset are made publicly available at https://srameo.github.io/projects/ultraled.[123] DEGS: Deformable Event-based 3D Gaussian Splatting from RGB and Event Stream
Junhao He,Jiaxu Wang,Jia Li,Mingyuan Sun,Qiang Zhang,Jiahang Cao,Ziyi Zhang,Yi Gu,Jingkai Sun,Renjing Xu
Main category: cs.CV
TL;DR: 本文提出了一种结合低帧率RGB图像与高帧率事件流的动态3D高斯点阵重建框架,通过引入事件运动先验来引导形变场优化,显著提升了大运动场景下的重建效果。
Details
Motivation: 由于低帧率RGB视频中帧间大运动增加了求解不确定性,且事件相机虽能捕捉快速运动但缺乏颜色信息,因此需要有效融合两种模态以提升动态3D重建质量。 Method: 提出LoCM无监督微调框架提取事件流中的运动先验,并设计几何感知的数据关联方法建立事件与高斯点之间的运动对应关系,结合运动分解和帧间伪标签策略进行联合优化。 Result: 在合成与真实场景下均优于现有基于图像和事件的方法,验证了该方法利用事件数据有效优化动态3DGS的能力。 Conclusion: 所提框架能够有效融合RGB图像与事件流,在大运动条件下实现更精确、稳定的动态3D高斯点阵重建。 Abstract: Reconstructing Dynamic 3D Gaussian Splatting (3DGS) from low-framerate RGB videos is challenging. This is because large inter-frame motions will increase the uncertainty of the solution space. For example, one pixel in the first frame might have more choices to reach the corresponding pixel in the second frame. Event cameras can asynchronously capture rapid visual changes and are robust to motion blur, but they do not provide color information. Intuitively, the event stream can provide deterministic constraints for the inter-frame large motion by the event trajectories. Hence, combining low-temporal-resolution images with high-framerate event streams can address this challenge. However, it is challenging to jointly optimize Dynamic 3DGS using both RGB and event modalities due to the significant discrepancy between these two data modalities. This paper introduces a novel framework that jointly optimizes dynamic 3DGS from the two modalities. The key idea is to adopt event motion priors to guide the optimization of the deformation fields. First, we extract the motion priors encoded in event streams by using the proposed LoCM unsupervised fine-tuning framework to adapt an event flow estimator to a certain unseen scene. Then, we present the geometry-aware data association method to build the event-Gaussian motion correspondence, which is the primary foundation of the pipeline, accompanied by two useful strategies, namely motion decomposition and inter-frame pseudo-label. Extensive experiments show that our method outperforms existing image and event-based approaches across synthetic and real scenes and prove that our method can effectively optimize dynamic 3DGS with the help of event data.[124] Demystifying Deep Learning-based Brain Tumor Segmentation with 3D UNets and Explainable AI (XAI): A Comparative Analysis
Ming Jie Ong,Sze Yinn Ung,Sim Kuan Goh,Jimmy Y. Zhong
Main category: cs.CV
TL;DR: 本研究比较了UNet、ResUNet和Attention UNet三种模型在脑肿瘤分割中的性能,并结合Grad-CAM和注意力可视化等可解释人工智能(XAI)技术提升模型透明度,结果表明ResUNet表现最优,推荐用于临床脑肿瘤自动分割。
Details
Motivation: 提高脑肿瘤MRI图像分割的准确性和模型可解释性,增强医生对深度学习模型的信任,辅助临床决策。 Method: 采用UNet、ResUNet和Attention UNet三种模型在BraTS2020数据集上进行脑肿瘤分割,使用Adam优化器训练模型,并通过Grad-CAM和注意力可视化技术进行可解释性分析,评估指标包括Dice、Jaccard、准确率、召回率和F1分数等。 Result: ResUNet在测试阶段的Dice和Jaccard相似性分数、准确率、召回率和F1分数均优于UNet和Attention UNet;Grad-CAM揭示了模型关注的肿瘤子区域,注意力可视化展示了Attention UNet的注意力机制工作方式。 Conclusion: ResUNet是三种模型中性能最佳的,结合XAI技术有助于理解模型决策过程,建议将其用于未来临床脑肿瘤自动分割任务。 Abstract: The current study investigated the use of Explainable Artificial Intelligence (XAI) to improve the accuracy of brain tumor segmentation in MRI images, with the goal of assisting physicians in clinical decision-making. The study focused on applying UNet models for brain tumor segmentation and using the XAI techniques of Gradient-weighted Class Activation Mapping (Grad-CAM) and attention-based visualization to enhance the understanding of these models. Three deep learning models - UNet, Residual UNet (ResUNet), and Attention UNet (AttUNet) - were evaluated to identify the best-performing model. XAI was employed with the aims of clarifying model decisions and increasing physicians' trust in these models. We compared the performance of two UNet variants (ResUNet and AttUNet) with the conventional UNet in segmenting brain tumors from the BraTS2020 public dataset and analyzed model predictions with Grad-CAM and attention-based visualization. Using the latest computer hardware, we trained and validated each model using the Adam optimizer and assessed their performance with respect to: (i) training, validation, and inference times, (ii) segmentation similarity coefficients and loss functions, and (iii) classification performance. Notably, during the final testing phase, ResUNet outperformed the other models with respect to Dice and Jaccard similarity scores, as well as accuracy, recall, and F1 scores. Grad-CAM provided visuospatial insights into the tumor subregions each UNet model focused on while attention-based visualization provided valuable insights into the working mechanisms of AttUNet's attention modules. These results demonstrated ResUNet as the best-performing model and we conclude by recommending its use for automated brain tumor segmentation in future clinical assessments. Our source code and checkpoint are available at https://github.com/ethanong98/MultiModel-XAI-Brats2020[125] GTR-Bench: Evaluating Geo-Temporal Reasoning in Vision-Language Models
Qinghongbing Xie,Zhaoyuan Xia,Feng Zhu,Lijun Gong,Ziyue Li,Rui Zhao,Long Zeng
Main category: cs.CV
TL;DR: 本文提出了GTR-Bench,一个用于评估视觉语言模型在大规模摄像头网络中对移动目标进行地理时空推理的新基准。现有基准无法同时结合图像/视频与地图图形上下文,而GTR-Bench通过多视角切换、跨非重叠视频联合推理和未观测区域推断等挑战,揭示了当前模型在地理时空智能上的三大缺陷:时空上下文利用不均衡、时间预测能力弱、难以对齐地图与多视角视频输入。实验表明主流VLM远落后于人类表现,该基准为时空智能研究提供了新方向。
Details
Motivation: 现有的空间-时间基准主要关注以自我为中心的视角或纯地图视角的推理,缺乏结合图像/视频与地图图形上下文来评估视觉语言模型在地理空间-时间智能方面的能力,而这在交通管理和应急响应等领域至关重要。因此需要一个新的基准来填补这一空白。 Method: 提出GTR-Bench,一个包含多视角地图与视频协同推理、跨非重叠视野视频联合推理以及未观测时空区域推断任务的新型地理时空推理基准。构建涵盖真实城市摄像头网络的数据集,并设计多项评估任务,对超过10个主流视觉语言模型进行系统评测。 Result: 实验结果显示,即使是性能最佳的专有模型Gemini-2.5-Pro(34.9%)也远低于人类表现(78.61%)。分析揭示了当前VLM在地理时空推理中的三个主要缺陷:1)时空上下文利用不平衡;2)时间预测能力弱于空间推理;3)难以理解或对齐地图与多视角视频输入。 Conclusion: GTR-Bench有效暴露了现有视觉语言模型在复杂地理时空推理任务中的局限性,强调了多模态上下文融合与时间建模的重要性,为未来自动驾驶、具身AI和通用人工智能中的时空智能研究提供了有价值的洞察和新的研究方向。 Abstract: Recently spatial-temporal intelligence of Visual-Language Models (VLMs) has attracted much attention due to its importance for Autonomous Driving, Embodied AI and General Artificial Intelligence. Existing spatial-temporal benchmarks mainly focus on egocentric perspective reasoning with images/video context, or geographic perspective reasoning with graphics context (eg. a map), thus fail to assess VLMs' geographic spatial-temporal intelligence with both images/video and graphics context, which is important for areas like traffic management and emergency response. To address the gaps, we introduce Geo-Temporal Reasoning benchmark (GTR-Bench), a novel challenge for geographic temporal reasoning of moving targets in a large-scale camera network. GTR-Bench is more challenging as it requires multiple perspective switches between maps and videos, joint reasoning across multiple videos with non-overlapping fields of view, and inference over spatial-temporal regions that are unobserved by any video context. Evaluations of more than 10 popular VLMs on GTR-Bench demonstrate that even the best proprietary model, Gemini-2.5-Pro (34.9%), significantly lags behind human performance (78.61%) on geo-temporal reasoning. Moreover, our comprehensive analysis on GTR-Bench reveals three primary deficiencies of current models for geo-temporal reasoning. (1) VLMs' reasoning is impaired by an imbalanced utilization of spatial-temporal context. (2) VLMs are weak in temporal forecasting, which leads to worse performance on temporal-emphasized tasks than on spatial-emphasized tasks. (3) VLMs lack the proficiency to comprehend or align the map data with multi-view video inputs. We believe GTR-Bench offers valuable insights and opens up new opportunities for research and applications in spatial-temporal intelligence. Benchmark and code will be released at https://github.com/X-Luffy/GTR-Bench.[126] FMANet: A Novel Dual-Phase Optical Flow Approach with Fusion Motion Attention Network for Robust Micro-expression Recognition
Luu Tu Nguyen,Vu Tram Anh Khuong,Thi Bich Phuong Man,Thi Duyen Ngo,Thanh Ha Le
Main category: cs.CV
TL;DR: 提出了一种新的面部微表情识别方法MM-COF和FMANet,整合了 onset-to-apex 和 apex-to-offset 两个阶段的运动信息,并通过可学习模块实现自适应融合,在多个标准数据集上优于现有方法。
Details
Motivation: 现有微表情识别方法大多只利用onset到apex阶段的光流信息,忽略了apex到offset阶段的重要动态变化,限制了识别性能。 Method: 提出了Magnitude-Modulated Combined Optical Flow (MM-COF)作为综合运动表征,并设计了端到端的FMANet网络,将双阶段运动分析和幅度调制嵌入可学习模块中,实现运动线索的自适应融合与关键面部区域的关注。 Result: 在MMEW、SMIC、CASME-II和SAMM四个标准数据集上实验表明,所提方法在微表情识别准确率上优于现有方法。 Conclusion: 引入apex-to-offset阶段的运动信息并通过可学习方式建模双阶段光流有助于提升微表情识别性能,验证了双阶段可学习框架的潜力。 Abstract: Facial micro-expressions, characterized by their subtle and brief nature, are valuable indicators of genuine emotions. Despite their significance in psychology, security, and behavioral analysis, micro-expression recognition remains challenging due to the difficulty of capturing subtle facial movements. Optical flow has been widely employed as an input modality for this task due to its effectiveness. However, most existing methods compute optical flow only between the onset and apex frames, thereby overlooking essential motion information in the apex-to-offset phase. To address this limitation, we first introduce a comprehensive motion representation, termed Magnitude-Modulated Combined Optical Flow (MM-COF), which integrates motion dynamics from both micro-expression phases into a unified descriptor suitable for direct use in recognition networks. Building upon this principle, we then propose FMANet, a novel end-to-end neural network architecture that internalizes the dual-phase analysis and magnitude modulation into learnable modules. This allows the network to adaptively fuse motion cues and focus on salient facial regions for classification. Experimental evaluations on the MMEW, SMIC, CASME-II, and SAMM datasets, widely recognized as standard benchmarks, demonstrate that our proposed MM-COF representation and FMANet outperforms existing methods, underscoring the potential of a learnable, dual-phase framework in advancing micro-expression recognition.[127] An End-to-End Room Geometry Constrained Depth Estimation Framework for Indoor Panorama Images
Kanglin Ning,Ruzhao Chen,Penghong Wang,Xingtao Wang,Ruiqin Xiong,Xiaopeng Fan
Main category: cs.CV
TL;DR: 本文提出了一种基于房间几何约束的全景室内深度估计框架,通过布局预测和背景分割机制提升深度图质量,在多个数据集上显著优于现有开源方法。
Details
Motivation: 现有单目全景深度估计方法过于关注像素级精度,导致房间角落过度平滑且对噪声敏感,缺乏对整体房间几何结构的有效利用。 Method: 提出一种基于房间几何约束的深度估计框架,包含共享编码器和任务特定解码器(用于布局、深度和背景分割)。通过房间几何信息指导背景深度生成,并设计背景分割引导的融合机制优化深度图。 Result: 在Stanford2D3D、Matterport3D和Structured3D数据集上实验表明,该方法在深度估计性能上显著优于当前开源方法。 Conclusion: 结合房间几何约束与多任务学习可有效提升全景室内深度估计的精度与鲁棒性,尤其改善了房间结构边角的重建质量。 Abstract: Predicting spherical pixel depth from monocular $360^{\circ}$ indoor panoramas is critical for many vision applications. However, existing methods focus on pixel-level accuracy, causing oversmoothed room corners and noise sensitivity. In this paper, we propose a depth estimation framework based on room geometry constraints, which extracts room geometry information through layout prediction and integrates those information into the depth estimation process through background segmentation mechanism. At the model level, our framework comprises a shared feature encoder followed by task-specific decoders for layout estimation, depth estimation, and background segmentation. The shared encoder extracts multi-scale features, which are subsequently processed by individual decoders to generate initial predictions: a depth map, a room layout map, and a background segmentation map. Furthermore, our framework incorporates two strategies: a room geometry-based background depth resolving strategy and a background-segmentation-guided fusion mechanism. The proposed room-geometry-based background depth resolving strategy leverages the room layout and the depth decoder's output to generate the corresponding background depth map. Then, a background-segmentation-guided fusion strategy derives fusion weights for the background and coarse depth maps from the segmentation decoder's predictions. Extensive experimental results on the Stanford2D3D, Matterport3D and Structured3D datasets show that our proposed methods can achieve significantly superior performance than current open-source methods. Our code is available at https://github.com/emiyaning/RGCNet.[128] Enhancing Visual Prompting through Expanded Transformation Space and Overfitting Mitigation
Shohei Enomoto
Main category: cs.CV
TL;DR: 提出ACAVP方法,通过引入仿射、颜色和加性视觉提示增强视觉提示的表达能力,并结合TrivialAugment缓解过拟合,显著提升性能。
Details
Motivation: 传统视觉提示方法存在表达能力有限和易过拟合的问题,导致精度低于其他适配方法。 Method: 提出ACAVP,结合仿射变换、颜色变换和加性变换,并采用TrivialAugment进行数据增强。 Result: 在十二个图像分类数据集上实验表明,ACAVP达到SOTA水平,平均准确率超过线性探测,且对分布偏移更具鲁棒性。 Conclusion: ACAVP有效提升了视觉提示的表达能力和泛化性能,同时保持推理时极低计算开销。 Abstract: Visual prompting (VP) has emerged as a promising parameter-efficient fine-tuning approach for adapting pre-trained vision models to downstream tasks without modifying model parameters. Despite offering advantages like negligible computational overhead and compatibility with black-box models, conventional VP methods typically achieve lower accuracy than other adaptation approaches. Our analysis reveals two critical limitations: the restricted expressivity of simple additive transformation and a tendency toward overfitting when the parameter count increases. To address these challenges, we propose ACAVP (Affine, Color, and Additive Visual Prompting), which enhances VP's expressive power by introducing complementary transformation operations: affine transformation for creating task-specific prompt regions while preserving original image information, and color transformation for emphasizing task-relevant visual features. Additionally, we identify that overfitting is a critical issue in VP training and introduce TrivialAugment as an effective data augmentation, which not only benefits our approach but also significantly improves existing VP methods, with performance gains of up to 12 percentage points on certain datasets. This demonstrates that appropriate data augmentation is universally beneficial for VP training. Extensive experiments across twelve diverse image classification datasets with two different model architectures demonstrate that ACAVP achieves state-of-the-art accuracy among VP methods, surpasses linear probing in average accuracy, and exhibits superior robustness to distribution shifts, all while maintaining minimal computational overhead during inference.[129] MMHOI: Modeling Complex 3D Multi-Human Multi-Object Interactions
Kaen Kogashi,Anoop Cherian,Meng-Yu Jennifer Kuo
Main category: cs.CV
TL;DR: 提出MMHOI数据集和MMHOI-Net模型,用于多人体多物体交互的3D场景理解,实现最先进的性能。
Details
Motivation: 现有3D人-物交互(HOI)基准仅涵盖部分复杂交互,缺乏对真实世界中多人多物因果、目标导向或协作行为的全面建模。 Method: 构建包含12种日常场景的大规模MMHOI数据集,提供完整的人与物的3D形状和姿态标注,并提出基于Transformer的端到端网络MMHOI-Net,采用结构化双块表示建模物体及其交互,结合动作识别提升交互预测。 Result: 在MMHOI和CORE4D数据集上的实验表明,该方法在多HOI建模中达到最先进水平,兼具高准确率和高质量重建能力。 Conclusion: MMHOI为下一代HOI研究提供了综合性测试平台,MMHOI-Net通过创新的表示学习框架有效提升了复杂交互的理解能力。 Abstract: Real-world scenes often feature multiple humans interacting with multiple objects in ways that are causal, goal-oriented, or cooperative. Yet existing 3D human-object interaction (HOI) benchmarks consider only a fraction of these complex interactions. To close this gap, we present MMHOI -- a large-scale, Multi-human Multi-object Interaction dataset consisting of images from 12 everyday scenarios. MMHOI offers complete 3D shape and pose annotations for every person and object, along with labels for 78 action categories and 14 interaction-specific body parts, providing a comprehensive testbed for next-generation HOI research. Building on MMHOI, we present MMHOI-Net, an end-to-end transformer-based neural network for jointly estimating human-object 3D geometries, their interactions, and associated actions. A key innovation in our framework is a structured dual-patch representation for modeling objects and their interactions, combined with action recognition to enhance the interaction prediction. Experiments on MMHOI and the recently proposed CORE4D datasets demonstrate that our approach achieves state-of-the-art performance in multi-HOI modeling, excelling in both accuracy and reconstruction quality.[130] PrismGS: Physically-Grounded Anti-Aliasing for High-Fidelity Large-Scale 3D Gaussian Splatting
Houqiang Zhong,Zhenglong Wu,Sihua Fu,Zihan Zheng,Xin Jin,Xiaoyun Zhang,Li Song,Qiang Hu
Main category: cs.CV
TL;DR: 本文提出PrismGS,一种物理驱动的正则化框架,通过金字塔多尺度监督和显式尺寸正则化,有效解决了3D高斯点阵在大规模城市环境中高分辨率渲染时的混叠和优化不稳定问题。
Details
Motivation: 3D高斯点阵在扩展到大尺度城市环境时出现严重混叠和优化不稳定问题,尤其在4K渲染下表现明显,现有分治方法无法解决这一保真度差距。 Method: 提出PrismGS框架,包含两个协同正则化项:一是金字塔多尺度监督,通过预滤波图像金字塔强制跨尺度一致性;二是基于物理的显式尺寸正则化,限制3D高斯的最小尺寸,防止退化。 Result: 在MatrixCity、Mill-19和UrbanScene3D数据集上实验表明,PrismGS相比CityGaussian提升约1.5 dB PSNR,在4K渲染下保持高质量与鲁棒性。 Conclusion: PrismGS能有效提升3D高斯点阵在大场景高分辨率渲染下的稳定性和视觉质量,且具有即插即用特性,兼容现有流程。 Abstract: 3D Gaussian Splatting (3DGS) has recently enabled real-time photorealistic rendering in compact scenes, but scaling to large urban environments introduces severe aliasing artifacts and optimization instability, especially under high-resolution (e.g., 4K) rendering. These artifacts, manifesting as flickering textures and jagged edges, arise from the mismatch between Gaussian primitives and the multi-scale nature of urban geometry. While existing ``divide-and-conquer'' pipelines address scalability, they fail to resolve this fidelity gap. In this paper, we propose PrismGS, a physically-grounded regularization framework that improves the intrinsic rendering behavior of 3D Gaussians. PrismGS integrates two synergistic regularizers. The first is pyramidal multi-scale supervision, which enforces consistency by supervising the rendering against a pre-filtered image pyramid. This compels the model to learn an inherently anti-aliased representation that remains coherent across different viewing scales, directly mitigating flickering textures. This is complemented by an explicit size regularization that imposes a physically-grounded lower bound on the dimensions of the 3D Gaussians. This prevents the formation of degenerate, view-dependent primitives, leading to more stable and plausible geometric surfaces and reducing jagged edges. Our method is plug-and-play and compatible with existing pipelines. Extensive experiments on MatrixCity, Mill-19, and UrbanScene3D demonstrate that PrismGS achieves state-of-the-art performance, yielding significant PSNR gains around 1.5 dB against CityGaussian, while maintaining its superior quality and robustness under demanding 4K rendering.[131] IsoSignVid2Aud: Sign Language Video to Audio Conversion without Text Intermediaries
Harsh Kavediya,Vighnesh Nayak,Bheeshm Sharma,Balamurugan Palaniappan
Main category: cs.CV
TL;DR: 提出IsoSignVid2Aud,一种端到端框架,将孤立手语视频直接转换为语音,无需中间文本表示。
Details
Motivation: 实现听障和言语障碍人群与他人的直接沟通,特别是在教育和提示界面应用中。 Method: 结合I3D特征提取、专用特征变换网络和音频生成流程,并使用改进的非极大值抑制(NMS)算法进行时序手势检测。 Result: 在ASL-Citizen-1500和WLASL-100数据集上分别达到72.01%和78.67%的Top-1准确率,语音质量PESQ为2.67,STOI为0.73。 Conclusion: 该方法能有效实现从非连续手语视频到语音的直接翻译,具备实用性和实时通信潜力。 Abstract: Sign language to spoken language audio translation is important to connect the hearing- and speech-challenged humans with others. We consider sign language videos with isolated sign sequences rather than continuous grammatical signing. Such videos are useful in educational applications and sign prompt interfaces. Towards this, we propose IsoSignVid2Aud, a novel end-to-end framework that translates sign language videos with a sequence of possibly non-grammatic continuous signs to speech without requiring intermediate text representation, providing immediate communication benefits while avoiding the latency and cascading errors inherent in multi-stage translation systems. Our approach combines an I3D-based feature extraction module with a specialized feature transformation network and an audio generation pipeline, utilizing a novel Non-Maximal Suppression (NMS) algorithm for the temporal detection of signs in non-grammatic continuous sequences. Experimental results demonstrate competitive performance on ASL-Citizen-1500 and WLASL-100 datasets with Top-1 accuracies of 72.01\% and 78.67\%, respectively, and audio quality metrics (PESQ: 2.67, STOI: 0.73) indicating intelligible speech output. Code is available at: https://github.com/BheeshmSharma/IsoSignVid2Aud_AIMLsystems-2025.[132] AlignGS: Aligning Geometry and Semantics for Robust Indoor Reconstruction from Sparse Views
Yijie Gao,Houqiang Zhong,Tianchi Zhu,Zhengxue Cheng,Qiang Hu,Li Song
Main category: cs.CV
TL;DR: 本文提出了一种名为AlignGS的新框架,通过将2D基础模型的语义先验用于直接正则化3D表示,实现几何与语义的协同优化,显著提升了稀疏视图下的室内场景三维重建质量。
Details
Motivation: 现有方法在稀疏视图下重建室内3D场景时存在几何模糊问题,且通常将语义视为被动特征,无法有效引导几何重建。因此需要一种以语义为主动引导力量的方法来提升重建的鲁棒性。 Method: 提出AlignGS框架,利用2D基础模型提取语义先验,并设计了深度一致性与多面法线正则化等语义到几何的引导机制,在端到端训练中联合优化几何与语义。 Result: 在标准基准上的实验表明,该方法在新视角合成和几何精度方面均达到最先进水平,生成的3D模型更加完整且几何结构更准确。 Conclusion: 将语义先验作为几何正则化手段,能够有效提升稀疏输入下的3D重建质量,验证了语义应作为主动引导信号而非被动标注的观点。 Abstract: The demand for semantically rich 3D models of indoor scenes is rapidly growing, driven by applications in augmented reality, virtual reality, and robotics. However, creating them from sparse views remains a challenge due to geometric ambiguity. Existing methods often treat semantics as a passive feature painted on an already-formed, and potentially flawed, geometry. We posit that for robust sparse-view reconstruction, semantic understanding instead be an active, guiding force. This paper introduces AlignGS, a novel framework that actualizes this vision by pioneering a synergistic, end-to-end optimization of geometry and semantics. Our method distills rich priors from 2D foundation models and uses them to directly regularize the 3D representation through a set of novel semantic-to-geometry guidance mechanisms, including depth consistency and multi-faceted normal regularization. Extensive evaluations on standard benchmarks demonstrate that our approach achieves state-of-the-art results in novel view synthesis and produces reconstructions with superior geometric accuracy. The results validate that leveraging semantic priors as a geometric regularizer leads to more coherent and complete 3D models from limited input views. Our code is avaliable at https://github.com/MediaX-SJTU/AlignGS .[133] Self-Supervised Learning Strategies for a Platform to Test the Toxicity of New Chemicals and Materials
Thomas Lautenschlager,Nils Friederich,Angelo Jovin Yamachui Sitcheu,Katja Nau,Gaëlle Hayot,Thomas Dickmeis,Ralf Mikut
Main category: cs.CV
TL;DR: 本文探讨了利用自监督学习方法在高通量毒性测试中自动识别有毒物质诱导变化的可行性,使用公开的EmbryoNet数据集证明所学表示能有效区分不同化合物的作用模式,并讨论了将机器学习模型集成到TOXBOX项目中的物理毒性测试设备中的应用前景。
Details
Motivation: 高通量毒性测试需要快速、低成本地评估大量化合物,而现有方法在自动化评估方面存在挑战,因此需要更有效的机器学习模型来提升准确性和效率。 Method: 采用自监督学习方法从EmbryoNet数据集中学习表征,并用于区分不同化学化合物对斑马鱼胚胎发育过程中的作用模式。 Result: 实验结果表明,通过自监督学习获得的表征能够有效区分不同化合物的致毒机制,在十种胚胎表型分类任务中表现出良好性能。 Conclusion: 自监督学习为高通量毒性测试提供了有前景的解决方案,有助于推动机器学习模型在实际毒性检测设备(如TOXBOX)中的集成与应用。 Abstract: High-throughput toxicity testing offers a fast and cost-effective way to test large amounts of compounds. A key component for such systems is the automated evaluation via machine learning models. In this paper, we address critical challenges in this domain and demonstrate how representations learned via self-supervised learning can effectively identify toxicant-induced changes. We provide a proof-of-concept that utilizes the publicly available EmbryoNet dataset, which contains ten zebrafish embryo phenotypes elicited by various chemical compounds targeting different processes in early embryonic development. Our analysis shows that the learned representations using self-supervised learning are suitable for effectively distinguishing between the modes-of-action of different compounds. Finally, we discuss the integration of machine learning models in a physical toxicity testing device in the context of the TOXBOX project.[134] XYZCylinder: Feedforward Reconstruction for Driving Scenes Based on A Unified Cylinder Lifting Method
Haochen Yu,Qiankun Liu,Hongyuan Liu,Jianfei Jiang,Juntao Lyu,Jiansheng Chen,Huimin Ma
Main category: cs.CV
TL;DR: 提出了一种名为XYZCylinder的前馈模型,通过统一的圆柱提升方法,在不同相机配置下实现驾驶场景的高效重建,并在零样本设置下表现出良好的泛化能力和高精度。
Details
Motivation: 现有前馈重建方法在相机配置变化时泛化能力有限,且稀疏视角下的小重叠区域和复杂场景降低了重建精度。因此需要一种能适应不同相机配置并提高重建准确性的新方法。 Method: 设计了统一圆柱相机建模(UCCM)策略以增强泛化能力,避免学习视点依赖的空间对应关系;提出基于圆柱平面特征组(CPFG)的混合表示和专用模块,将2D图像特征提升到3D空间以提高重建精度。 Result: 实验结果表明,XYZCylinder在多种评估设置下达到最先进的性能,并能在未见过的驾驶场景中实现零样本泛化。 Conclusion: XYZCylinder通过结合相机建模与特征提升,在保持前馈速度的同时显著提升了对不同驾驶场景的泛化能力和重建精度,具有实际应用潜力。 Abstract: Recently, more attention has been paid to feedforward reconstruction paradigms, which mainly learn a fixed view transformation implicitly and reconstruct the scene with a single representation. However, their generalization capability and reconstruction accuracy are still limited while reconstructing driving scenes, which results from two aspects: (1) The fixed view transformation fails when the camera configuration changes, limiting the generalization capability across different driving scenes equipped with different camera configurations. (2) The small overlapping regions between sparse views of the $360^\circ$ panorama and the complexity of driving scenes increase the learning difficulty, reducing the reconstruction accuracy. To handle these difficulties, we propose \textbf{XYZCylinder}, a feedforward model based on a unified cylinder lifting method which involves camera modeling and feature lifting. Specifically, to improve the generalization capability, we design a Unified Cylinder Camera Modeling (UCCM) strategy, which avoids the learning of viewpoint-dependent spatial correspondence and unifies different camera configurations with adjustable parameters. To improve the reconstruction accuracy, we propose a hybrid representation with several dedicated modules based on newly designed Cylinder Plane Feature Group (CPFG) to lift 2D image features to 3D space. Experimental results show that XYZCylinder achieves state-of-the-art performance under different evaluation settings, and can be generalized to other driving scenes in a zero-shot manner. Project page: \href{https://yuyuyu223.github.io/XYZCYlinder-projectpage/}{here}.[135] MARC: Memory-Augmented RL Token Compression for Efficient Video Understanding
Peiran Wu,Zhuorui Yu,Yunze Liu,Chi-Hao Wu,Enmin Zhou,Junxiao Shen
Main category: cs.CV
TL;DR: 本文提出了一种名为MARC的视频视觉语言模型 token 压缩方法,通过“检索-压缩”策略,在仅使用单帧 token 的情况下实现了接近基线的性能,显著降低了计算开销。
Details
Motivation: 现有的无训练 token 压缩方法在将视觉语言模型从图像扩展到视频时存在信息丢失和性能下降的问题,且视频高帧率和长时长导致计算成本高昂。 Method: 提出 MARC 框架,包含两个部分:1)视觉记忆检索器(VMR)用于检索关键视频片段;2)基于压缩组相对策略优化(C-GRPO)的强化学习蒸馏框架,将教师模型的推理能力迁移到学生模型。采用‘先检索后压缩’的策略进行 token 压缩。 Result: 在六个视频基准上实验表明,MARC 在仅使用一帧 token 的情况下达到接近基线的准确率,视觉 token 减少 95%,GPU 冰存降低 72%,延迟减少 23.9%。 Conclusion: MARC 能有效平衡视频理解任务中的效率与性能,具备在资源受限场景(如视频问答、监控、自动驾驶)中实现实时应用的潜力。 Abstract: The rapid progress of large language models (LLMs) has laid the foundation for multimodal models. However, visual language models (VLMs) still face heavy computational costs when extended from images to videos due to high frame rates and long durations. Token compression is a promising solution, yet most existing training-free methods cause information loss and performance degradation. To overcome this, we propose \textbf{Memory-Augmented Reinforcement Learning-based Token Compression (MARC)}, which integrates structured retrieval and RL-based distillation. MARC adopts a \textit{retrieve-then-compress} strategy using a \textbf{Visual Memory Retriever (VMR)} to select key clips and a \textbf{Compression Group Relative Policy Optimization (C-GRPO)} framework to distil reasoning ability from a teacher to a student model. Experiments on six video benchmarks show that MARC achieves near-baseline accuracy using only one frame's tokens -- reducing visual tokens by \textbf{95\%}, GPU memory by \textbf{72\%}, and latency by \textbf{23.9\%}. This demonstrates its potential for efficient, real-time video understanding in resource-constrained settings such as video QA, surveillance, and autonomous driving.[136] ASBench: Image Anomalies Synthesis Benchmark for Anomaly Detection
Qunyi Zhang,Songan Zhang,Jinbao Wang,Xiaoning Lei,Guoyang Xie,Guannan Jiang,Zhichao Lu
Main category: cs.CV
TL;DR: 本文提出了ASBench,首个专注于评估异常合成方法的综合基准框架,解决了现有研究中缺乏系统性评价的问题,并通过四个关键维度揭示了当前方法的局限性,为未来研究提供了可行方向。
Details
Motivation: 由于异常样本有限且人工标注成本高,现有的异常检测应用受限;而当前的异常合成研究多将其作为辅助手段,缺乏对合成算法的系统性评估,忽略了与检测解耦、合成数据量化分析及跨场景适应性等关键因素。 Method: 提出ASBench,一个专门用于评估异常合成方法的基准框架,引入四个评估维度:(i) 在不同数据集和流程中的泛化性能,(ii) 合成与真实数据的比例,(iii) 合成图像内在指标与检测性能的相关性,(iv) 混合异常合成策略。 Result: 通过大量实验,ASBench揭示了当前异常合成方法在泛化性、数据效率和相关性方面的局限性,并验证了不同合成策略的影响。 Conclusion: ASBench为异常合成方法提供了系统性的评估平台,不仅暴露了现有方法的不足,也为未来在异常合成领域的研究提供了明确的方向和实用的见解。 Abstract: Anomaly detection plays a pivotal role in manufacturing quality control, yet its application is constrained by limited abnormal samples and high manual annotation costs. While anomaly synthesis offers a promising solution, existing studies predominantly treat anomaly synthesis as an auxiliary component within anomaly detection frameworks, lacking systematic evaluation of anomaly synthesis algorithms. Current research also overlook crucial factors specific to anomaly synthesis, such as decoupling its impact from detection, quantitative analysis of synthetic data and adaptability across different scenarios. To address these limitations, we propose ASBench, the first comprehensive benchmarking framework dedicated to evaluating anomaly synthesis methods. Our framework introduces four critical evaluation dimensions: (i) the generalization performance across different datasets and pipelines (ii) the ratio of synthetic to real data (iii) the correlation between intrinsic metrics of synthesis images and anomaly detection performance metrics , and (iv) strategies for hybrid anomaly synthesis methods. Through extensive experiments, ASBench not only reveals limitations in current anomaly synthesis methods but also provides actionable insights for future research directions in anomaly synthesis[137] TTOM: Test-Time Optimization and Memorization for Compositional Video Generation
Leigang Qu,Ziyang Wang,Na Zheng,Wenjie Wang,Liqiang Nie,Tat-Seng Chua
Main category: cs.CV
TL;DR: 本文提出了TTOM,一种无需训练的视频生成框架,通过测试时优化和记忆机制,在推理过程中对齐视频基础模型的输出与时空布局,显著提升文本-图像对齐能力,尤其在组合性场景中表现出色。
Details
Motivation: 现有的视频基础模型在处理组合性任务(如运动、数量和空间关系)时表现不佳,缺乏有效的跨模态对齐机制。 Method: 引入测试时优化与记忆机制(TTOM),通过优化新参数而非直接干预潜在表示或注意力,结合通用的布局注意力目标,并在流式视频生成中利用参数化记忆模块维护历史上下文,支持插入、读取、更新和删除操作。 Result: 在T2V-CompBench和Vbench基准上,TTOM显著提升了组合视频生成的性能,展现出良好的可扩展性、效率和实际应用潜力。 Conclusion: TTOM是一种有效且灵活的训练-free框架,能够实现动态跨模态对齐,解耦组合性世界知识,具备强迁移性和泛化能力。 Abstract: Video Foundation Models (VFMs) exhibit remarkable visual generation performance, but struggle in compositional scenarios (e.g., motion, numeracy, and spatial relation). In this work, we introduce Test-Time Optimization and Memorization (TTOM), a training-free framework that aligns VFM outputs with spatiotemporal layouts during inference for better text-image alignment. Rather than direct intervention to latents or attention per-sample in existing work, we integrate and optimize new parameters guided by a general layout-attention objective. Furthermore, we formulate video generation within a streaming setting, and maintain historical optimization contexts with a parametric memory mechanism that supports flexible operations, such as insert, read, update, and delete. Notably, we found that TTOM disentangles compositional world knowledge, showing powerful transferability and generalization. Experimental results on the T2V-CompBench and Vbench benchmarks establish TTOM as an effective, practical, scalable, and efficient framework to achieve cross-modal alignment for compositional video generation on the fly.[138] CVD-STORM: Cross-View Video Diffusion with Spatial-Temporal Reconstruction Model for Autonomous Driving
Tianrui Zhang,Yichen Liu,Zilin Guo,Yuxin Guo,Jingcheng Ni,Chenjing Ding,Dan Xu,Lewei Lu,Zehuan Wu
Main category: cs.CV
TL;DR: 提出CVD-STORM,一种基于跨视角视频扩散和时空重建VAE的模型,可生成高质量、多视角、长时序视频,并具备4D重建能力。
Details
Motivation: 自动驾驶等应用对高保真、多控制条件下的视频生成及深度估计等几何信息提取提出了更高要求,现有方法难以兼顾生成质量与三维结构建模。 Method: 首先通过辅助的4D重建任务微调VAE,增强其对3D结构和时序动态的编码能力;然后将该VAE集成到视频扩散过程中;同时引入联合训练的高斯溅射解码器进行动态场景重建。 Result: 在FID和FVD指标上显著优于现有方法,同时能有效生成多视角视频并提供精确的深度与几何信息。 Conclusion: CVD-STORM在多视角长时序视频生成和4D场景重建方面表现出色,为自动驾驶中的环境模拟与理解提供了更全面的解决方案。 Abstract: Generative models have been widely applied to world modeling for environment simulation and future state prediction. With advancements in autonomous driving, there is a growing demand not only for high-fidelity video generation under various controls, but also for producing diverse and meaningful information such as depth estimation. To address this, we propose CVD-STORM, a cross-view video diffusion model utilizing a spatial-temporal reconstruction Variational Autoencoder (VAE) that generates long-term, multi-view videos with 4D reconstruction capabilities under various control inputs. Our approach first fine-tunes the VAE with an auxiliary 4D reconstruction task, enhancing its ability to encode 3D structures and temporal dynamics. Subsequently, we integrate this VAE into the video diffusion process to significantly improve generation quality. Experimental results demonstrate that our model achieves substantial improvements in both FID and FVD metrics. Additionally, the jointly-trained Gaussian Splatting Decoder effectively reconstructs dynamic scenes, providing valuable geometric information for comprehensive scene understanding.[139] A Large-scale Dataset for Robust Complex Anime Scene Text Detection
Ziyi Dong,Yurui Zhang,Changmao Li,Naomi Rue Golding,Qing Long
Main category: cs.CV
TL;DR: 本文提出了一个名为AnimeText的大规模数据集,专门用于动漫场景中的文本检测,包含73.5万张图像和420万个标注文本块,具有针对动漫特点的分层标注和难负样本。实验表明,在该数据集上训练的模型在动漫文本检测任务中优于现有数据集上的模型。
Details
Motivation: 现有的文本检测数据集主要针对自然或文档场景,其文本样式规则、颜色单调、布局规整,而动漫场景中文本风格多样、排列不规则,且易与符号和装饰图案混淆,导致现有模型表现不佳。因此需要专门针对动漫场景的文本检测数据集。 Method: 构建了一个名为AnimeText的大规模数据集,包含735K图像和4.2M标注文本块,引入了分层标注机制和难负样本设计,以适应动漫场景中文本的复杂性和多样性。通过跨数据集的基准测试,使用最先进的文本检测方法评估模型在动漫场景下的性能。 Result: 实验结果表明,基于AnimeText训练的模型在动漫文本检测任务中显著优于基于现有数据集训练的模型,验证了该数据集在提升动漫场景文本检测鲁棒性方面的有效性。 Conclusion: AnimeText是一个专为动漫场景设计的高质量文本检测数据集,能够有效提升模型在复杂动漫环境中的文本检测能力,填补了该领域的数据空白。 Abstract: Current text detection datasets primarily target natural or document scenes, where text typically appear in regular font and shapes, monotonous colors, and orderly layouts. The text usually arranged along straight or curved lines. However, these characteristics differ significantly from anime scenes, where text is often diverse in style, irregularly arranged, and easily confused with complex visual elements such as symbols and decorative patterns. Text in anime scene also includes a large number of handwritten and stylized fonts. Motivated by this gap, we introduce AnimeText, a large-scale dataset containing 735K images and 4.2M annotated text blocks. It features hierarchical annotations and hard negative samples tailored for anime scenarios. %Cross-dataset evaluations using state-of-the-art methods demonstrate that models trained on AnimeText achieve superior performance in anime text detection tasks compared to existing datasets. To evaluate the robustness of AnimeText in complex anime scenes, we conducted cross-dataset benchmarking using state-of-the-art text detection methods. Experimental results demonstrate that models trained on AnimeText outperform those trained on existing datasets in anime scene text detection tasks. AnimeText on HuggingFace: https://huggingface.co/datasets/deepghs/AnimeText[140] SimCast: Enhancing Precipitation Nowcasting with Short-to-Long Term Knowledge Distillation
Yifang Yin,Shengkai Chen,Yiyao Li,Lu Wang,Ruibing Jin,Wei Cui,Shili Xiang
Main category: cs.CV
TL;DR: 提出SimCast和CasCast框架,通过短到长知识蒸馏和加权MSE损失提升降水临近预报精度,在多个数据集上显著优于现有方法。
Details
Motivation: 改进现有非自回归降水预报模型在不同预测时间范围下的表现,解决确定性模型输出模糊和分布偏移的问题。 Method: 提出SimCast训练流程,结合短到长知识蒸馏和加权MSE损失;进一步将其嵌入扩散模型框架CasCast,引入概率建模能力。 Result: 在SEVIR、HKO-7和MeteoNet三个基准数据集上取得0.452、0.474和0.361的平均CSI分数,显著优于现有方法。 Conclusion: SimCast和CasCast有效提升了降水临近预报的准确性与可靠性,兼顾高性能与推理效率。 Abstract: Precipitation nowcasting predicts future radar sequences based on current observations, which is a highly challenging task driven by the inherent complexity of the Earth system. Accurate nowcasting is of utmost importance for addressing various societal needs, including disaster management, agriculture, transportation, and energy optimization. As a complementary to existing non-autoregressive nowcasting approaches, we investigate the impact of prediction horizons on nowcasting models and propose SimCast, a novel training pipeline featuring a short-to-long term knowledge distillation technique coupled with a weighted MSE loss to prioritize heavy rainfall regions. Improved nowcasting predictions can be obtained without introducing additional overhead during inference. As SimCast generates deterministic predictions, we further integrate it into a diffusion-based framework named CasCast, leveraging the strengths from probabilistic models to overcome limitations such as blurriness and distribution shift in deterministic outputs. Extensive experimental results on three benchmark datasets validate the effectiveness of the proposed framework, achieving mean CSI scores of 0.452 on SEVIR, 0.474 on HKO-7, and 0.361 on MeteoNet, which outperforms existing approaches by a significant margin.[141] Latent Harmony: Synergistic Unified UHD Image Restoration via Latent Space Regularization and Controllable Refinement
Yidi Liu,Xueyang Fu,Jie Huang,Jie Xiao,Dong Li,Wenlong Zhang,Lei Bai,Zheng-Jun Zha
Main category: cs.CV
TL;DR: 提出Latent Harmony框架,通过两阶段方法改进基于VAE的超高清图像恢复,在保持计算效率的同时提升高频细节重建质量。
Details
Motivation: 传统VAE因高斯约束丢失退化相关的高频信息,导致UHD图像恢复中重建保真度下降。 Method: 第一阶段引入LH-VAE,结合视觉语义约束、渐进退化扰动和潜在空间等变性来增强语义鲁棒性和高频重建;第二阶段联合训练VAE与恢复模型,采用高频低秩适配(HF-LoRA),包括保真导向的编码器LoRA和感知导向的解码器LoRA,并通过交替优化和选择性梯度传播保持预训练结构。 Result: 在UHD和标准分辨率任务上均达到SOTA性能,有效平衡效率、感知质量和重建精度。 Conclusion: Latent Harmony通过联合正则化潜在空间和高频感知重建,显著提升了VAE在超高清图像恢复中的表现。 Abstract: Ultra-High Definition (UHD) image restoration faces a trade-off between computational efficiency and high-frequency detail retention. While Variational Autoencoders (VAEs) improve efficiency via latent-space processing, their Gaussian constraint often discards degradation-specific high-frequency information, hurting reconstruction fidelity. To overcome this, we propose Latent Harmony, a two-stage framework that redefines VAEs for UHD restoration by jointly regularizing the latent space and enforcing high-frequency-aware reconstruction.In Stage One, we introduce LH-VAE, which enhances semantic robustness through visual semantic constraints and progressive degradation perturbations, while latent equivariance strengthens high-frequency reconstruction.Stage Two jointly trains this refined VAE with a restoration model using High-Frequency Low-Rank Adaptation (HF-LoRA): an encoder LoRA guided by a fidelity-oriented high-frequency alignment loss to recover authentic details, and a decoder LoRA driven by a perception-oriented loss to synthesize realistic textures. Both LoRA modules are trained via alternating optimization with selective gradient propagation to preserve the pretrained latent structure.At inference, a tunable parameter {\alpha} enables flexible fidelity-perception trade-offs.Experiments show Latent Harmony achieves state-of-the-art performance across UHD and standard-resolution tasks, effectively balancing efficiency, perceptual quality, and reconstruction accuracy.[142] The impact of abstract and object tags on image privacy classification
Darya Baranouskaya,Andrea Cavallaro
Main category: cs.CV
TL;DR: 本文探讨了在图像隐私分类任务中,抽象标签和物体标签的有效性,发现当标签数量有限时,抽象标签更有效,而当标签数量较多时,物体标签同样有用。
Details
Motivation: 研究在上下文依赖且主观的图像隐私任务中,不同类型的标签(物体标签与抽象标签)的作用,以提升隐私分类的准确性。 Method: 通过比较在不同标签预算下物体标签和抽象标签在图像隐私分类中的表现,分析其有效性。 Result: 在标签数量受限时,抽象标签比物体标签更有效;但在标签数量较多时,物体标签的效果与抽象标签相当。 Conclusion: 标签类型和数量对图像隐私分类性能有显著影响,该发现可指导未来更准确的隐私分类器设计。 Abstract: Object tags denote concrete entities and are central to many computer vision tasks, whereas abstract tags capture higher-level information, which is relevant for tasks that require a contextual, potentially subjective scene understanding. Object and abstract tags extracted from images also facilitate interpretability. In this paper, we explore which type of tags is more suitable for the context-dependent and inherently subjective task of image privacy. While object tags are generally used for privacy classification, we show that abstract tags are more effective when the tag budget is limited. Conversely, when a larger number of tags per image is available, object-related information is as useful. We believe that these findings will guide future research in developing more accurate image privacy classifiers, informed by the role of tag types and quantity.[143] Is Architectural Complexity Always the Answer? A Case Study on SwinIR vs. an Efficient CNN
Chandresh Sutariya,Nitin Singh
Main category: cs.CV
TL;DR: 本文比较了Transformer模型SwinIR与轻量级CNN在低光图像增强任务中的性能与效率,发现尽管SwinIR峰值表现更高,但CNN在更少训练轮数、更小模型尺寸下达到了接近SOTA的结果,显示出其在资源受限场景下的优势。
Details
Motivation: 在低光图像增强中,如何在恢复高频细节和抑制严重噪声的同时兼顾模型效率是一个关键挑战,现有高性能模型如SwinIR计算成本高,限制了实际应用。 Method: 通过在相同任务上对比SwinIR与标准轻量级CNN的性能(PSNR)和训练效率(epoch数、模型大小),评估二者在性能与计算开销之间的权衡。 Result: CNN在仅训练10个epoch后达到37.4 dB的PSNR,而SwinIR需132个epoch才达到39.03 dB;且CNN模型体积超过SwinIR的55倍小。 Conclusion: 标准CNN可在显著降低计算开销的前提下实现接近SOTA的性能,为资源受限的实际应用场景提供了有竞争力的解决方案。 Abstract: The simultaneous restoration of high-frequency details and suppression of severe noise in low-light imagery presents a significant and persistent challenge in computer vision. While large-scale Transformer models like SwinIR have set the state of the art in performance, their high computational cost can be a barrier for practical applications. This paper investigates the critical trade-off between performance and efficiency by comparing the state-of-the-art SwinIR model against a standard, lightweight Convolutional Neural Network (CNN) on this challenging task. Our experimental results reveal a nuanced but important finding. While the Transformer-based SwinIR model achieves a higher peak performance, with a Peak Signal-to-Noise Ratio (PSNR) of 39.03 dB, the lightweight CNN delivers a surprisingly competitive PSNR of 37.4 dB. Crucially, the CNN reached this performance after converging in only 10 epochs of training, whereas the more complex SwinIR model required 132 epochs. This efficiency is further underscored by the model's size; the CNN is over 55 times smaller than SwinIR. This work demonstrates that a standard CNN can provide a near state-of-the-art result with significantly lower computational overhead, presenting a compelling case for its use in real-world scenarios where resource constraints are a primary concern.[144] GraphEnet: Event-driven Human Pose Estimation with a Graph Neural Network
Gaurvi Goyal,Pham Cong Thuong,Arren Glover,Masayoshi Mizuno,Chiara Bartolozzi
Main category: cs.CV
TL;DR: 本文提出了一种基于图神经网络GraphEnet的事件相机人体姿态估计方法,利用事件数据的稀疏性和基于线段的表示,在低功耗、低延迟场景下实现高频2D人体姿态估计,是首个将图神经网络应用于事件数据进行人体姿态估计的工作。
Details
Motivation: 由于事件相机具有低延迟和低功耗的优势,适合资源受限的应用场景,但目前缺乏有效的基于事件相机的人体姿态估计方法,因此需要开发适用于事件数据的高效模型。 Method: 提出GraphEnet,采用图神经网络处理事件相机输出的稀疏事件流,引入基于线段的中间表示,并结合偏移向量学习范式与基于置信度的池化机制来估计2D人体姿态。 Result: 实现了高频率的单人2D人体姿态估计,有效利用了事件数据的时空特性,在低功耗条件下表现出良好的性能。 Conclusion: GraphEnet是首个将图神经网络应用于事件相机数据进行人体姿态估计的框架,验证了GNN在事件数据处理中的潜力,为移动端和机器人等资源受限平台提供了可行的姿态估计解决方案。 Abstract: Human Pose Estimation is a crucial module in human-machine interaction applications and, especially since the rise in deep learning technology, robust methods are available to consumers using RGB cameras and commercial GPUs. On the other hand, event-based cameras have gained popularity in the vision research community for their low latency and low energy advantages that make them ideal for applications where those resources are constrained like portable electronics and mobile robots. In this work we propose a Graph Neural Network, GraphEnet, that leverages the sparse nature of event camera output, with an intermediate line based event representation, to estimate 2D Human Pose of a single person at a high frequency. The architecture incorporates a novel offset vector learning paradigm with confidence based pooling to estimate the human pose. This is the first work that applies Graph Neural Networks to event data for Human Pose Estimation. The code is open-source at https://github.com/event-driven-robotics/GraphEnet-NeVi-ICCV2025.[145] CIR-CoT: Towards Interpretable Composed Image Retrieval via End-to-End Chain-of-Thought Reasoning
Weihuang Lin,Yiwei Ma,Jiayi Ji,Xiaoshuai Sun,Rongrong Ji
Main category: cs.CV
TL;DR: 本文提出了CIR-CoT,首个面向检索任务的端到端多模态大语言模型,通过引入显式的思维链(CoT)推理机制,提升图像-文本跨模态检索的准确性和可解释性。
Details
Motivation: 现有基于VLM和MLLM的组合图像检索方法缺乏透明性和对细粒度指令的理解能力,难以解释其推理过程。 Method: 设计CIR-CoT模型,强制生成包含描述、推理和结论三阶段的结构化思维链,并在新构建的带CoT标注的数据上进行微调,最终将推理结果编码为专用嵌入用于检索。 Result: 在FashionIQ、CIRR等数据集上取得具有竞争力的表现,并在跨域数据集CIRCO上展现出优异的泛化能力。 Conclusion: CIR-CoT通过显式推理过程提升了组合图像检索的性能与可解释性,为可信检索系统提供了新路径。 Abstract: Composed Image Retrieval (CIR), which aims to find a target image from a reference image and a modification text, presents the core challenge of performing unified reasoning across visual and semantic modalities. While current approaches based on Vision-Language Models (VLMs, e.g., CLIP) and more recent Multimodal Large Language Models (MLLMs, e.g., Qwen-VL) have shown progress, they predominantly function as ``black boxes." This inherent opacity not only prevents users from understanding the retrieval rationale but also restricts the models' ability to follow complex, fine-grained instructions. To overcome these limitations, we introduce CIR-CoT, the first end-to-end retrieval-oriented MLLM designed to integrate explicit Chain-of-Thought (CoT) reasoning. By compelling the model to first generate an interpretable reasoning chain, CIR-CoT enhances its ability to capture crucial cross-modal interactions, leading to more accurate retrieval while making its decision process transparent. Since existing datasets like FashionIQ and CIRR lack the necessary reasoning data, a key contribution of our work is the creation of structured CoT annotations using a three-stage process involving a caption, reasoning, and conclusion. Our model is then fine-tuned to produce this structured output before encoding its final retrieval intent into a dedicated embedding. Comprehensive experiments show that CIR-CoT achieves highly competitive performance on in-domain datasets (FashionIQ, CIRR) and demonstrates remarkable generalization on the out-of-domain CIRCO dataset, establishing a new path toward more effective and trustworthy retrieval systems.[146] RayFusion: Ray Fusion Enhanced Collaborative Visual Perception
Shaohong Wang,Bin Lu,Xinyu Xiao,Hanzhi Zhong,Bowen Pang,Tong Wang,Zhiyu Xiang,Hangguan Shan,Eryun Liu
Main category: cs.CV
TL;DR: 提出了一种基于射线的融合方法RayFusion,用于协同视觉感知,通过利用协作者的射线占据信息来减少相机射线上的冗余和误检,显著提升了纯相机协同感知系统的3D目标检测性能。
Details
Motivation: 由于缺乏显式的深度信息,基于相机的感知系统在深度估计上存在模糊性,难以生成准确的3D检测结果,尤其是在协同感知中传感器存在局限的情况下。 Method: 提出RayFusion,一种基于射线的融合方法,利用协作者提供的射线占据信息,在相机射线上抑制冗余和误检,从而提升感知精度。 Result: 实验表明,该方法在多个基准上持续优于现有的最先进模型,显著提升了协同视觉感知的性能。 Conclusion: RayFusion有效缓解了基于相机的协同感知中的深度模糊问题,通过射线级融合提高了3D目标检测的准确性和鲁棒性。 Abstract: Collaborative visual perception methods have gained widespread attention in the autonomous driving community in recent years due to their ability to address sensor limitation problems. However, the absence of explicit depth information often makes it difficult for camera-based perception systems, e.g., 3D object detection, to generate accurate predictions. To alleviate the ambiguity in depth estimation, we propose RayFusion, a ray-based fusion method for collaborative visual perception. Using ray occupancy information from collaborators, RayFusion reduces redundancy and false positive predictions along camera rays, enhancing the detection performance of purely camera-based collaborative perception systems. Comprehensive experiments show that our method consistently outperforms existing state-of-the-art models, substantially advancing the performance of collaborative visual perception. The code is available at https://github.com/wangsh0111/RayFusion.[147] RASALoRE: Region Aware Spatial Attention with Location-based Random Embeddings for Weakly Supervised Anomaly Detection in Brain MRI Scans
Bheeshm Sharma,Karthikeyan Jaganathan,Balamurugan Palaniappan
Main category: cs.CV
TL;DR: 提出了一种名为RASALoRE的两阶段弱监督异常检测框架,结合判别性双提示调优和区域感知空间注意力机制,在无精确像素级标注的情况下实现脑MRI异常的高效准确定位。
Details
Motivation: 在缺乏精细像素级标注、仅有弱标签(如切片级)的情况下,实现对脑MRI中异常的快速准确检测是一个重要挑战。 Method: 第一阶段采用判别性双提示调优(DDPT)生成高质量伪弱掩码作为粗略定位线索;第二阶段使用基于固定位置随机嵌入的区域感知空间注意力机制的分割网络来聚焦异常区域。 Result: 在BraTS20、BraTS21、BraTS23和MSD数据集上达到最先进的检测性能,显著优于现有方法,且参数量少于800万,计算复杂度显著降低。 Conclusion: RASALoRE是一种高效、高性能的弱监督脑MRI异常检测方法,能够在低资源条件下实现精准异常定位,具有良好的应用前景。 Abstract: Weakly Supervised Anomaly detection (WSAD) in brain MRI scans is an important challenge useful to obtain quick and accurate detection of brain anomalies when precise pixel-level anomaly annotations are unavailable and only weak labels (e.g., slice-level) are available. In this work, we propose RASALoRE: Region Aware Spatial Attention with Location-based Random Embeddings, a novel two-stage WSAD framework. In the first stage, we introduce a Discriminative Dual Prompt Tuning (DDPT) mechanism that generates high-quality pseudo weak masks based on slice-level labels, serving as coarse localization cues. In the second stage, we propose a segmentation network with a region-aware spatial attention mechanism that relies on fixed location-based random embeddings. This design enables the model to effectively focus on anomalous regions. Our approach achieves state-of-the-art anomaly detection performance, significantly outperforming existing WSAD methods while utilizing less than 8 million parameters. Extensive evaluations on the BraTS20, BraTS21, BraTS23, and MSD datasets demonstrate a substantial performance improvement coupled with a significant reduction in computational complexity. Code is available at: https://github.com/BheeshmSharma/RASALoRE-BMVC-2025/.[148] RetouchLLM: Training-free White-box Image Retouching
Moon Ye-Bin,Roy Miles,Tae-Hyun Oh,Ismail Elezi,Jiankang Deng
Main category: cs.CV
TL;DR: 提出RetouchLLM,一种无需训练、基于代码的白盒图像润饰系统,通过视觉批评器和代码生成器实现可解释、可控的高分辨率图像多步润饰。
Details
Motivation: 现有基于学习的方法依赖大规模配对数据且为黑盒模型,缺乏透明度和对用户或图像特定调整的适应性。 Method: 构建一个包含视觉批评器和代码生成器的框架,视觉批评器识别输入图像与参考图像的差异,代码生成器生成可执行代码进行逐步润饰,无需训练数据。 Result: 实验表明该方法能良好泛化到多种润饰风格,支持自然语言交互,实现可解释且符合用户意图的控制。 Conclusion: RetouchLLM提供了一种透明、灵活、无需训练的图像润饰新范式,优于传统黑盒模型在可解释性和适应性方面的局限。 Abstract: Image retouching not only enhances visual quality but also serves as a means of expressing personal preferences and emotions. However, existing learning-based approaches require large-scale paired data and operate as black boxes, making the retouching process opaque and limiting their adaptability to handle diverse, user- or image-specific adjustments. In this work, we propose RetouchLLM, a training-free white-box image retouching system, which requires no training data and performs interpretable, code-based retouching directly on high-resolution images. Our framework progressively enhances the image in a manner similar to how humans perform multi-step retouching, allowing exploration of diverse adjustment paths. It comprises of two main modules: a visual critic that identifies differences between the input and reference images, and a code generator that produces executable codes. Experiments demonstrate that our approach generalizes well across diverse retouching styles, while natural language-based user interaction enables interpretable and controllable adjustments tailored to user intent.[149] A class-driven hierarchical ResNet for classification of multispectral remote sensing images
Giulio Weikmann,Gianmarco Perantoni,Lorenzo Bruzzone
Main category: cs.CV
TL;DR: 提出一种多时相、类驱动的分层残差神经网络(ResNet),用于多光谱图像时间序列在不同语义层级上的分类,通过引入分支结构和层次惩罚机制提升分类一致性与细粒度类别识别能力。
Details
Motivation: 为了提升多光谱图像时间序列在不同语义层级上的分类精度,特别是细粒度类别(微类)的识别,并解决训练样本有限情况下的模型泛化问题。 Method: 设计了一种改进的分层ResNet架构,引入额外分支进行多层次分类,利用层次惩罚图约束分类过程中的层级跳跃,实现从宏观类到微观类的逐级精细化分类,并通过类层级标签指导各层训练。 Result: 在亚马逊森林两个区域的Sentinel-2月度影像上验证,该方法在不同层级均表现出良好的泛化能力,显著提升了微类分类精度,并更好表达了少数类别。 Conclusion: 所提出的模块化分层网络能有效建模语义层级关系,提升时间序列分类性能,具备通过微调适应新任务的能力,适用于样本有限的遥感场景分类任务。 Abstract: This work presents a multitemporal class-driven hierarchical Residual Neural Network (ResNet) designed for modelling the classification of Time Series (TS) of multispectral images at different semantical class levels. The architecture consists of a modification of the ResNet where we introduce additional branches to perform the classification at the different hierarchy levels and leverage on hierarchy-penalty maps to discourage incoherent hierarchical transitions within the classification. In this way, we improve the discrimination capabilities of classes at different levels of semantic details and train a modular architecture that can be used as a backbone network for introducing new specific classes and additional tasks considering limited training samples available. We exploit the class-hierarchy labels to train efficiently the different layers of the architecture, allowing the first layers to train faster on the first levels of the hierarchy modeling general classes (i.e., the macro-classes) and the intermediate classes, while using the last ones to discriminate more specific classes (i.e., the micro-classes). In this way, the targets are constrained in following the hierarchy defined, improving the classification of classes at the most detailed level. The proposed modular network has intrinsic adaptation capability that can be obtained through fine tuning. The experimental results, obtained on two tiles of the Amazonian Forest on 12 monthly composites of Sentinel 2 images acquired during 2019, demonstrate the effectiveness of the hierarchical approach in both generalizing over different hierarchical levels and learning discriminant features for an accurate classification at the micro-class level on a new target area, with a better representation of the minoritarian classes.[150] Towards Real-World Deepfake Detection: A Diverse In-the-wild Dataset of Forgery Faces
Junyu Shi,Minghui Li,Junguo Zuo,Zhifei Yu,Yipeng Lin,Shengshan Hu,Ziqi Zhou,Yechao Zhang,Wei Wan,Yinzhe Xu,Leo Yu Zhang
Main category: cs.CV
TL;DR: 本文提出了一个面向真实世界的深度伪造人脸数据集RedFace,包含6万余张伪造图像和1000个操纵视频,利用9个商业在线平台生成更贴近现实的深伪内容,用于评估现有检测方法在实际应用中的局限性。
Details
Motivation: 现有的深度伪造检测基准缺乏真实性、多样性和对现实世界中伪造技术的覆盖,难以有效应对社交媒体中的实际威胁。 Method: 构建了一个名为RedFace的新型数据集,使用真实人脸特征,并通过9个商用在线平台及定制算法生成深度伪造内容,模拟现实中的黑盒场景,提升数据集的真实性和多样性。 Result: 在跨域、域内及社交网络传播模拟实验中,现有检测方法在RedFace上的表现显著下降,验证了其对检测性能更具挑战性;分析表明RedFace更能反映现实世界中深度伪造的复杂性。 Conclusion: RedFace填补了学术研究与现实需求之间的差距,为更有效的深度伪造检测技术提供了更真实、更具挑战性的评估平台。 Abstract: Deepfakes, leveraging advanced AIGC (Artificial Intelligence-Generated Content) techniques, create hyper-realistic synthetic images and videos of human faces, posing a significant threat to the authenticity of social media. While this real-world threat is increasingly prevalent, existing academic evaluations and benchmarks for detecting deepfake forgery often fall short to achieve effective application for their lack of specificity, limited deepfake diversity, restricted manipulation techniques.To address these limitations, we introduce RedFace (Real-world-oriented Deepfake Face), a specialized facial deepfake dataset, comprising over 60,000 forged images and 1,000 manipulated videos derived from authentic facial features, to bridge the gap between academic evaluations and real-world necessity. Unlike prior benchmarks, which typically rely on academic methods to generate deepfakes, RedFace utilizes 9 commercial online platforms to integrate the latest deepfake technologies found "in the wild", effectively simulating real-world black-box scenarios.Moreover, RedFace's deepfakes are synthesized using bespoke algorithms, allowing it to capture diverse and evolving methods used by real-world deepfake creators. Extensive experimental results on RedFace (including cross-domain, intra-domain, and real-world social network dissemination simulations) verify the limited practicality of existing deepfake detection schemes against real-world applications. We further perform a detailed analysis of the RedFace dataset, elucidating the reason of its impact on detection performance compared to conventional datasets. Our dataset is available at: https://github.com/kikyou-220/RedFace.[151] Physics-Driven Spatiotemporal Modeling for AI-Generated Video Detection
Shuhai Zhang,ZiHao Lian,Jiahao Yang,Daiyuan Li,Guoxuan Pang,Feng Liu,Bo Han,Shutao Li,Mingkui Tan
Main category: cs.CV
TL;DR: 本文提出了一种基于物理驱动的AI生成视频检测新方法NSG-VD,利用概率流守恒原理定义了归一化时空梯度(NSG)统计量,并结合扩散模型估计NSG特征,通过最大均值差异(MMD)进行检测,在Recall和F1-Score上显著优于现有方法。
Details
Motivation: 随着AI生成视频在视觉真实性上的飞速进步(如Sora),迫切需要可靠的检测机制;然而,现有方法难以建模高维时空动态并捕捉违反物理规律的细微异常。 Method: 提出归一化时空梯度(NSG)统计量,量化空间概率梯度与时间密度变化的比值,反映对自然视频动态的偏离;利用预训练扩散模型构建无需复杂运动分解的NSG估计器,并设计基于MMD的NSG-VD检测方法。 Result: NSG-VD在实验中比现有最优方法提升了16.00%的Recall和10.75%的F1-Score,并理论推导了真实与生成视频间NSG特征距离的上界,证明生成视频因分布偏移而存在更大差异。 Conclusion: 基于物理规律的NSG-VD方法能有效检测AI生成视频,性能优越,为未来检测技术提供了可解释且理论支持的新范式。 Abstract: AI-generated videos have achieved near-perfect visual realism (e.g., Sora), urgently necessitating reliable detection mechanisms. However, detecting such videos faces significant challenges in modeling high-dimensional spatiotemporal dynamics and identifying subtle anomalies that violate physical laws. In this paper, we propose a physics-driven AI-generated video detection paradigm based on probability flow conservation principles. Specifically, we propose a statistic called Normalized Spatiotemporal Gradient (NSG), which quantifies the ratio of spatial probability gradients to temporal density changes, explicitly capturing deviations from natural video dynamics. Leveraging pre-trained diffusion models, we develop an NSG estimator through spatial gradients approximation and motion-aware temporal modeling without complex motion decomposition while preserving physical constraints. Building on this, we propose an NSG-based video detection method (NSG-VD) that computes the Maximum Mean Discrepancy (MMD) between NSG features of the test and real videos as a detection metric. Last, we derive an upper bound of NSG feature distances between real and generated videos, proving that generated videos exhibit amplified discrepancies due to distributional shifts. Extensive experiments confirm that NSG-VD outperforms state-of-the-art baselines by 16.00% in Recall and 10.75% in F1-Score, validating the superior performance of NSG-VD. The source code is available at https://github.com/ZSHsh98/NSG-VD.[152] DarkHash: A Data-Free Backdoor Attack Against Deep Hashing
Ziqi Zhou,Menghao Deng,Yufei Song,Hangtao Zhang,Wei Wan,Shengshan Hu,Minghui Li,Leo Yu Zhang,Dezhong Yao
Main category: cs.CV
TL;DR: 本文提出了DarkHash,首个针对深度哈希模型的无数据后门攻击方法,通过双语义指导的影子后门框架,在不访问训练数据的情况下实现高效攻击,同时保持原始检索精度。
Details
Motivation: 现有深度哈希模型的后门攻击依赖于访问训练数据,但在现实场景中由于隐私和知识产权限制难以获取;因此,如何在无训练数据条件下植入后门并保持模型原有性能成为一个新的挑战。 Method: 提出DarkHash,设计了一种基于代理数据集的双语义引导影子后门攻击框架,仅微调目标模型特定层,并引入拓扑对齐损失,优化个体及邻近中毒样本向目标样本对齐,增强攻击效果。 Result: 在四个图像数据集、五种模型架构和两种哈希方法上的实验表明,DarkHash显著优于现有最先进后门攻击方法,且能有效抵御主流防御手段。 Conclusion: DarkHash实现了无需原始训练数据的高效深度哈希后门攻击,在保持原任务检索精度的同时展现出强大攻击能力和抗防御性,为深度哈希安全提出了新的研究方向。 Abstract: Benefiting from its superior feature learning capabilities and efficiency, deep hashing has achieved remarkable success in large-scale image retrieval. Recent studies have demonstrated the vulnerability of deep hashing models to backdoor attacks. Although these studies have shown promising attack results, they rely on access to the training dataset to implant the backdoor. In the real world, obtaining such data (e.g., identity information) is often prohibited due to privacy protection and intellectual property concerns. Embedding backdoors into deep hashing models without access to the training data, while maintaining retrieval accuracy for the original task, presents a novel and challenging problem. In this paper, we propose DarkHash, the first data-free backdoor attack against deep hashing. Specifically, we design a novel shadow backdoor attack framework with dual-semantic guidance. It embeds backdoor functionality and maintains original retrieval accuracy by fine-tuning only specific layers of the victim model using a surrogate dataset. We consider leveraging the relationship between individual samples and their neighbors to enhance backdoor attacks during training. By designing a topological alignment loss, we optimize both individual and neighboring poisoned samples toward the target sample, further enhancing the attack capability. Experimental results on four image datasets, five model architectures, and two hashing methods demonstrate the high effectiveness of DarkHash, outperforming existing state-of-the-art backdoor attack methods. Defense experiments show that DarkHash can withstand existing mainstream backdoor defense methods.[153] Efficient Label Refinement for Face Parsing Under Extreme Poses Using 3D Gaussian Splatting
Ankit Gahlawat,Anirban Mukherjee,Dinesh Babu Jayagopi
Main category: cs.CV
TL;DR: 提出一种基于3D高斯泼溅(3DGS)的标签 refinement 管道,通过共享几何结构实现多视角一致性,生成姿态多样化的训练数据,显著提升极端视角下的人脸解析精度。
Details
Motivation: 由于极端视角下标注数据有限且人工标注成本高,现有的人脸解析方法难以在这些情况下保持准确性和鲁棒性,因此需要一种可扩展且无需真实3D标注的解决方案。 Method: 联合拟合两个3DGS模型,一个用于RGB图像,另一个用于初始分割图,利用共享的几何结构将多视角预测中的噪声标签优化为精确的分割掩码,并通过少量后处理生成多样化姿态的训练数据。 Result: 在极端头部姿态下显著提升了人脸解析模型的准确性,同时在标准视角上保持良好性能;实验包括人类评估,结果优于现有最先进方法。 Conclusion: 该方法无需真实3D标注,仅用少量初始图像即可有效提升人脸解析在真实场景中的鲁棒性,具有良好的可扩展性和应用前景。 Abstract: Accurate face parsing under extreme viewing angles remains a significant challenge due to limited labeled data in such poses. Manual annotation is costly and often impractical at scale. We propose a novel label refinement pipeline that leverages 3D Gaussian Splatting (3DGS) to generate accurate segmentation masks from noisy multiview predictions. By jointly fitting two 3DGS models, one to RGB images and one to their initial segmentation maps, our method enforces multiview consistency through shared geometry, enabling the synthesis of pose-diverse training data with only minimal post-processing. Fine-tuning a face parsing model on this refined dataset significantly improves accuracy on challenging head poses, while maintaining strong performance on standard views. Extensive experiments, including human evaluations, demonstrate that our approach achieves superior results compared to state-of-the-art methods, despite requiring no ground-truth 3D annotations and using only a small set of initial images. Our method offers a scalable and effective solution for improving face parsing robustness in real-world settings.[154] Random Window Augmentations for Deep Learning Robustness in CT and Liver Tumor Segmentation
Eirik A. Østmo,Kristoffer K. Wickstrøm,Keyur Radiya,Michael C. Kampffmeyer,Karl Øyvind Mikalsen,Robert Jenssen
Main category: cs.CV
TL;DR: 本文提出了一种针对CT图像的特定增强技术“随机窗宽”(Random windowing),以解决传统数据增强方法在医学影像中导致的伪影和泛化能力差的问题,显著提升了肝脏肿瘤分割模型在低对比度图像上的性能。
Details
Motivation: 由于CT图像的强度具有物理意义(Hounsfield Units),直接应用自然图像的增强方法会导致伪影并影响模型泛化,因此需要一种适用于CT模态的增强策略。 Method: 提出名为Random windowing的CT专用增强技术,利用CT图像中HU强度分布进行数据增强,增强模型对对比度变化的鲁棒性。 Result: 在多个数据集上进行了消融实验和分析,结果表明该方法优于现有最先进方法,特别是在肝脏肿瘤分割任务中表现突出。 Conclusion: Random windowing是一种有效的CT图像增强方法,能显著提升深度学习模型在低对比度或时序不佳图像中的分割性能,具有良好的临床应用潜力。 Abstract: Contrast-enhanced Computed Tomography (CT) is important for diagnosis and treatment planning for various medical conditions. Deep learning (DL) based segmentation models may enable automated medical image analysis for detecting and delineating tumors in CT images, thereby reducing clinicians' workload. Achieving generalization capabilities in limited data domains, such as radiology, requires modern DL models to be trained with image augmentation. However, naively applying augmentation methods developed for natural images to CT scans often disregards the nature of the CT modality, where the intensities measure Hounsfield Units (HU) and have important physical meaning. This paper challenges the use of such intensity augmentations for CT imaging and shows that they may lead to artifacts and poor generalization. To mitigate this, we propose a CT-specific augmentation technique, called Random windowing, that exploits the available HU distribution of intensities in CT images. Random windowing encourages robustness to contrast-enhancement and significantly increases model performance on challenging images with poor contrast or timing. We perform ablations and analysis of our method on multiple datasets, and compare to, and outperform, state-of-the-art alternatives, while focusing on the challenge of liver tumor segmentation.[155] Real-Time Motion-Controllable Autoregressive Video Diffusion
Kesen Zhao,Jiaxin Shi,Beier Zhu,Junbao Zhou,Xiaolong Shen,Yuan Zhou,Qianru Sun,Hanwang Zhang
Main category: cs.CV
TL;DR: 提出AR-Drag,首个结合强化学习的少步自回归视频扩散模型,实现低延迟、高保真的实时图像到视频生成与多样运动控制。
Details
Motivation: 现有自回归视频扩散模型在少步生成中存在质量下降和运动伪影问题,且缺乏有效的运动控制机制,难以满足实时性要求。 Method: 首先微调基础I2V模型以支持基本运动控制,然后通过基于轨迹奖励模型的强化学习进一步优化;引入Self-Rollout机制保持马尔可夫性质,并在去噪步骤中选择性引入随机性以加速训练。 Result: AR-Drag在仅1.3B参数下显著降低延迟,相比最先进的可控运动VDM表现更优,同时实现高视觉保真度和精确运动对齐。 Conclusion: AR-Drag为实时、可控制的视频生成提供了一种高效且高质量的解决方案,推动了少步自回归视频生成的发展。 Abstract: Real-time motion-controllable video generation remains challenging due to the inherent latency of bidirectional diffusion models and the lack of effective autoregressive (AR) approaches. Existing AR video diffusion models are limited to simple control signals or text-to-video generation, and often suffer from quality degradation and motion artifacts in few-step generation. To address these challenges, we propose AR-Drag, the first RL-enhanced few-step AR video diffusion model for real-time image-to-video generation with diverse motion control. We first fine-tune a base I2V model to support basic motion control, then further improve it via reinforcement learning with a trajectory-based reward model. Our design preserves the Markov property through a Self-Rollout mechanism and accelerates training by selectively introducing stochasticity in denoising steps. Extensive experiments demonstrate that AR-Drag achieves high visual fidelity and precise motion alignment, significantly reducing latency compared with state-of-the-art motion-controllable VDMs, while using only 1.3B parameters. Additional visualizations can be found on our project page: https://kesenzhao.github.io/AR-Drag.github.io/.[156] Improving Temporal Understanding Logic Consistency in Video-Language Models via Attention Enhancement
Chengzhi Li,Heyan Huang,Ping Jian,Zhen Yang,Yaning Tian
Main category: cs.CV
TL;DR: 本文提出了一种基于注意力增强的时序条件注意力锐化方法(TCAS),以提升视频语言模型(Video-LLMs)在回答重述问题时的时序逻辑一致性,通过可解释性分析验证了该方法能有效提高跨模态注意力头的时间分辨能力。
Details
Motivation: 大型语言模型在生成回答时常出现自相矛盾的问题,尤其在视频语言模型中,对重述问题的回答缺乏逻辑一致性,其根本原因尚不明确,因此需要探究并解决这一现象背后的机制。 Method: 采用可解释性驱动的方法,统计分析导致不一致性的潜在因素,并提出一种名为时序条件注意力锐化(TCAS)的注意力增强方法,通过构建基于注意力差异的增强目标来提升模型的时间分辨率能力。 Result: 实验结果表明,所提方法显著提升了Video-LLMs的时序逻辑一致性;可解释性分析证实该方法增强了注意力头对不同时戳视频令牌的区分能力,并在通用视频时序定位任务中也带来了性能提升。 Conclusion: 时序逻辑一致性是影响视频时序理解的关键瓶颈,通过TCAS增强注意力机制可有效缓解该问题,推动视频语言模型在时序理解上的进步。 Abstract: Large language models (LLMs) often generate self-contradictory outputs, which severely impacts their reliability and hinders their adoption in practical applications. In video-language models (Video-LLMs), this phenomenon recently draws the attention of researchers. Specifically, these models fail to provide logically consistent responses to rephrased questions based on their grounding outputs. However, the underlying causes of this phenomenon remain underexplored. In this work, we adopt an interpretability-driven approach to analyze, statistically summarize, and intervention the potential factors of the phenomenon. We find that one of the primary reasons for the inconsistency in responses lies in the inability of cross-modal attention heads to effectively distinguish video tokens across different timestamps. To address this, we propose an attention enhancement method called Temporally Conditioned Attention Sharpening (TCAS), which constructs an enhancement objective based on attention distinctions to enhance the model's temporal resolution capability, thereby improving its temporal understanding logic consistency. Experimental results demonstrate that our method significantly enhances the temporal logic consistency of Video-LLMs. Further interpretability analyses reveal that our method indeed improves the temporal discriminability of attention heads, validating our conclusions. Additionally, our method achieves performance improvements in general video temporal grounding tasks, highlighting that temporal logic consistency is a bottleneck in temporal understanding. By enhancing consistency, our method drives significant progress in video temporal understanding.[157] UniMMVSR: A Unified Multi-Modal Framework for Cascaded Video Super-Resolution
Shian Du,Menghan Xia,Chang Liu,Quande Liu,Xintao Wang,Pengfei Wan,Xiangyang Ji
Main category: cs.CV
TL;DR: 本文提出了UniMMVSR,首个结合文本、图像和视频等多种模态条件的生成式视频超分辨率统一框架,通过在潜在视频扩散模型中探索条件注入策略、训练方案和数据混合技术,显著提升了多模态条件下视频的细节质量和保真度,并实现了与基础模型结合生成4K视频的可行性。
Details
Motivation: 现有级联视频超分辨率方法主要局限于文本到视频任务,未能充分利用除文本外的其他生成条件,限制了多模态视频生成的保真度。因此,需要一种能融合多种模态条件的统一框架以提升生成质量。 Method: 提出UniMMVSR框架,在潜在视频扩散模型中系统探索了多模态条件(文本、图像、视频)的注入策略、训练方案和数据混合技术;设计了针对不同条件类型的数据构建和利用方法,以准确捕捉各模态与目标视频之间的关联。 Result: 实验表明,UniMMVSR在生成视频的细节表现和多模态条件一致性方面显著优于现有方法,并成功实现与基础模型结合生成4K分辨率视频,验证了多模态引导生成高分辨率视频的可行性。 Conclusion: UniMMVSR是首个支持多模态条件的统一生成式视频超分辨率框架,有效解决了多模态信息融合难题,推动了高分辨率、高保真多模态视频生成的发展。 Abstract: Cascaded video super-resolution has emerged as a promising technique for decoupling the computational burden associated with generating high-resolution videos using large foundation models. Existing studies, however, are largely confined to text-to-video tasks and fail to leverage additional generative conditions beyond text, which are crucial for ensuring fidelity in multi-modal video generation. We address this limitation by presenting UniMMVSR, the first unified generative video super-resolution framework to incorporate hybrid-modal conditions, including text, images, and videos. We conduct a comprehensive exploration of condition injection strategies, training schemes, and data mixture techniques within a latent video diffusion model. A key challenge was designing distinct data construction and condition utilization methods to enable the model to precisely utilize all condition types, given their varied correlations with the target video. Our experiments demonstrate that UniMMVSR significantly outperforms existing methods, producing videos with superior detail and a higher degree of conformity to multi-modal conditions. We also validate the feasibility of combining UniMMVSR with a base model to achieve multi-modal guided generation of 4K video, a feat previously unattainable with existing techniques.[158] Beyond Textual CoT: Interleaved Text-Image Chains with Deep Confidence Reasoning for Image Editing
Zhentao Zou,Zhengrong Yue,Kunpeng Du,Binlei Bao,Hanting Li,Haizhen Xie,Guozheng Xu,Yue Zhou,Yali Wang,Jie Hu,Xue Jiang,Xinghao Chen
Main category: cs.CV
TL;DR: 本文提出了一种名为Multimodal Reasoning Edit (MURE)的新框架,通过文本与视觉线索交织的推理链实现图像编辑,提升了对复杂对象交互和细粒度空间关系的处理能力。
Details
Motivation: 现有基于自然语言的图像编辑方法在处理复杂对象交叠和精细空间关系时受限于缺乏显式推理过程,纯文本或坐标增强的思维链难以表达复杂的视觉布局。 Method: 提出MURE框架,采用原生多模态、文本-图像交错的思维链进行逐步推理,每步包含文本描述和对应的视觉线索(如位置掩码或新内容表示);引入多模态深度置信(MMDC)推理范式,通过奖励模型打分剪枝低质量推理路径,确保高质量编辑轨迹。 Result: 该方法将复杂编辑任务分解为相互依赖的子任务,在三个图像编辑基准上显著提升性能,并发布了包含14K高质量样本的CoT-Edit-14K数据集。 Conclusion: MURE通过融合文本与视觉的交错推理链及置信度引导的路径选择,有效提升了语言指导图像编辑的精度与保真度。 Abstract: Image editing with natural language has gained significant popularity, yet existing methods struggle with intricate object intersections and fine-grained spatial relationships due to the lack of an explicit reasoning process. While Chain-of-Thought (CoT) has been explored to enhance reasoning, purely textual CoT or CoT augmented with coordinate information is fundamentally limited in its ability to represent intricate visual layouts and lacks the necessary visual cues to guide the generation of fine-grained, pixel-level details. To address these challenges, we propose Multimodal Reasoning Edit (MURE), a novel framework that shifts the visual editing process from purely text-based reasoning to a series of interleaved textual and visual rationales. Our framework performs image editing using a natively multimodal, interleaved text-image CoT. This approach generates a step-by-step chain of reasoning where a textual description is followed by a corresponding visual cue, such as a positional mask that defined intended edited regions or a representation of new content. Furthermore, to mitigate the hallucination phenomenon of large language models, we introduce Multimodal Deep Confidence (MMDC) reasoning paradigm. This paradigm explores a tree of visual reasoning paths at each step. By pruning low-quality branches using a deep confidence score from a reward model, it ensures the model consistently follows a high-quality trajectory towards the final edited result. The proposed method decomposes complex editing tasks into interdependent sub-tasks, achieving greater precision at each stage and yielding high-fidelity edited results. We define the formulation for interleaved text-image chains and release the first CoT-Edit-14K dataset, comprising 14K high-quality editing examples. Extensive experiments show that our method yields significant improvements across three image editing benchmarks.[159] Robust Canonicalization through Bootstrapped Data Re-Alignment
Johann Schmidt,Sebastian Stober
Main category: cs.CV
TL;DR: 提出一种迭代重新对齐训练样本的自举算法,通过逐步减少方差和恢复对齐假设,在细粒度视觉分类任务中优于等变和规范化基线方法。
Details
Motivation: 现有方法依赖于大量数据增强或等变架构,存在模型复杂度高或表达能力受限的问题;而实际数据集无法满足对齐假设,导致规范化方法表现脆弱。 Method: 提出一种自举算法,迭代地重新对齐训练样本,逐步减少方差并恢复对齐假设,适用于任意紧群且具有收敛保证。 Result: 在四个细粒度视觉分类基准上验证了该方法的有效性,性能持续优于等变和规范化基线,与数据增强方法相当。 Conclusion: 该方法为处理几何偏差提供了一种高效且鲁棒的替代方案,无需强数据增强或限制模型结构,适用于现实世界未对齐的数据集。 Abstract: Fine-grained visual classification (FGVC) tasks, such as insect and bird identification, demand sensitivity to subtle visual cues while remaining robust to spatial transformations. A key challenge is handling geometric biases and noise, such as different orientations and scales of objects. Existing remedies rely on heavy data augmentation, which demands powerful models, or on equivariant architectures, which constrain expressivity and add cost. Canonicalization offers an alternative by shielding such biases from the downstream model. In practice, such functions are often obtained using canonicalization priors, which assume aligned training data. Unfortunately, real-world datasets never fulfill this assumption, causing the obtained canonicalizer to be brittle. We propose a bootstrapping algorithm that iteratively re-aligns training samples by progressively reducing variance and recovering the alignment assumption. We establish convergence guarantees under mild conditions for arbitrary compact groups, and show on four FGVC benchmarks that our method consistently outperforms equivariant, and canonicalization baselines while performing on par with augmentation.[160] InstructUDrag: Joint Text Instructions and Object Dragging for Interactive Image Editing
Haoran Yu,Yi Shi
Main category: cs.CV
TL;DR: 提出InstructUDrag框架,结合文本指令与对象拖拽,实现扩散模型下的灵活高保真图像编辑。
Details
Motivation: 现有文本生成图像方法在精确对象定位上存在困难,而对象拖拽方法仅限于静态移动,缺乏语义控制。 Method: 将对象拖拽视为图像重建过程,设计双分支结构:移动重建分支利用基于能量的梯度引导精确定位对象,文本驱动编辑分支共享梯度信号以实现属性精细控制;采用DDPM反演和噪声图注入先验信息保持对象结构。 Result: 实验表明,InstructUDrag能同时实现高精度对象移动和语义一致的文本编辑,显著提升图像编辑的灵活性与保真度。 Conclusion: InstructUDrag有效融合文本指令与对象拖拽,在保持对象结构的同时实现精准定位与语义控制,推动了扩散模型在交互式图像编辑中的应用。 Abstract: Text-to-image diffusion models have shown great potential for image editing, with techniques such as text-based and object-dragging methods emerging as key approaches. However, each of these methods has inherent limitations: text-based methods struggle with precise object positioning, while object dragging methods are confined to static relocation. To address these issues, we propose InstructUDrag, a diffusion-based framework that combines text instructions with object dragging, enabling simultaneous object dragging and text-based image editing. Our framework treats object dragging as an image reconstruction process, divided into two synergistic branches. The moving-reconstruction branch utilizes energy-based gradient guidance to move objects accurately, refining cross-attention maps to enhance relocation precision. The text-driven editing branch shares gradient signals with the reconstruction branch, ensuring consistent transformations and allowing fine-grained control over object attributes. We also employ DDPM inversion and inject prior information into noise maps to preserve the structure of moved objects. Extensive experiments demonstrate that InstructUDrag facilitates flexible, high-fidelity image editing, offering both precision in object relocation and semantic control over image content.[161] Fine-grained text-driven dual-human motion generation via dynamic hierarchical interaction
Mu Li,Yin Wang,Zhiying Leng,Jiapeng Liu,Frederick W. B. Li,Xiaohui Liang
Main category: cs.CV
TL;DR: 提出了一种细粒度的双人运动生成方法FineDual,通过三阶段模型从个体到个体间动态建模人类交互的层次性与动态性。
Details
Motivation: 现有方法大多忽略距离和层次结构,无法充分建模双人交互中的动态和层次特性。 Method: 采用三阶段方法:自学习阶段利用大语言模型分解文本并对齐个体特征;自适应调整阶段通过交互距离预测器和交互感知图网络动态建模个体间交互;教师引导优化阶段利用整体文本特征优化运动生成。 Result: 在双人运动数据集上的实验表明,FineDual在定量和定性评估中均优于现有方法,能有效生成高质量、细粒度的双人交互运动。 Conclusion: FineDual通过建模动态层次交互,在双人运动生成任务中实现了更自然、精确的交互模拟。 Abstract: Human interaction is inherently dynamic and hierarchical, where the dynamic refers to the motion changes with distance, and the hierarchy is from individual to inter-individual and ultimately to overall motion. Exploiting these properties is vital for dual-human motion generation, while existing methods almost model human interaction temporally invariantly, ignoring distance and hierarchy. To address it, we propose a fine-grained dual-human motion generation method, namely FineDual, a tri-stage method to model the dynamic hierarchical interaction from individual to inter-individual. The first stage, Self-Learning Stage, divides the dual-human overall text into individual texts through a Large Language Model, aligning text features and motion features at the individual level. The second stage, Adaptive Adjustment Stage, predicts interaction distance by an interaction distance predictor, modeling human interactions dynamically at the inter-individual level by an interaction-aware graph network. The last stage, Teacher-Guided Refinement Stage, utilizes overall text features as guidance to refine motion features at the overall level, generating fine-grained and high-quality dual-human motion. Extensive quantitative and qualitative evaluations on dual-human motion datasets demonstrate that our proposed FineDual outperforms existing approaches, effectively modeling dynamic hierarchical human interaction.[162] Adaptive Gradient Calibration for Single-Positive Multi-Label Learning in Remote Sensing Image Scene Classification
Chenying Liu,Gianmarco Perantoni,Lorenzo Bruzzone,Xiao Xiang Zhu
Main category: cs.CV
TL;DR: 本文提出了一种针对遥感图像的单正多标签学习(SPML)新框架AdaGC,通过自适应梯度校准、Mixup和双指数移动平均模块生成鲁棒伪标签,在两种基准数据集上实现了最先进的性能。
Details
Motivation: 由于遥感图像标注复杂且成本高,完全标注难以获得,因此需要在仅有一个正标签的情况下进行多标签分类,现有SPML方法在遥感领域研究有限,亟需有效且鲁棒的方法。 Method: 提出Adaptive Gradient Calibration (AdaGC),结合梯度校准机制、Mixup增强和双EMA模块生成伪标签,并设计自适应触发机制,基于训练动态在预热后启动GC,避免过拟合并提升稳定性。 Result: 在两个遥感基准数据集和两种标签噪声下实验表明,AdaGC在多种设置中均达到SOTA性能,表现出强鲁棒性。 Conclusion: AdaGC是一种有效且通用的SPML框架,显著提升了遥感图像在弱监督条件下的多标签分类性能,具有实际应用潜力。 Abstract: Multi-label classification (MLC) offers a more comprehensive semantic understanding of Remote Sensing (RS) imagery compared to traditional single-label classification (SLC). However, obtaining complete annotations for MLC is particularly challenging due to the complexity and high cost of the labeling process. As a practical alternative, single-positive multi-label learning (SPML) has emerged, where each image is annotated with only one relevant label, and the model is expected to recover the full set of labels. While scalable, SPML introduces significant supervision ambiguity, demanding specialized solutions for model training. Although various SPML methods have been proposed in the computer vision domain, research in the RS context remains limited. To bridge this gap, we propose Adaptive Gradient Calibration (AdaGC), a novel and generalizable SPML framework tailored to RS imagery. AdaGC adopts a gradient calibration (GC) mechanism combined with Mixup and a dual exponential moving average (EMA) module for robust pseudo-label generation. To maximize AdaGC's effectiveness, we introduce a simple yet theoretically grounded indicator to adaptively trigger GC after an initial warm-up stage based on training dynamics, thereby guaranteeing the effectiveness of GC in mitigating overfitting to label noise. Extensive experiments on two benchmark RS datasets under two distinct label noise types demonstrate that AdaGC achieves state-of-the-art (SOTA) performance while maintaining strong robustness across diverse settings.[163] One Stone with Two Birds: A Null-Text-Null Frequency-Aware Diffusion Models for Text-Guided Image Inpainting
Haipeng Liu,Yang Wang,Meng Wang
Main category: cs.CV
TL;DR: 本文提出了一种名为NTN-Diff的新型扩散模型,用于文本引导的图像修复,通过在不同频率带中解耦语义一致性,有效保持未遮罩区域并实现遮罩与未遮罩区域间的语义一致性。
Details
Motivation: 现有方法难以同时保持未遮罩区域和实现遮罩与未遮罩区域之间的语义一致性,主要由于编码不同图像属性的中低频带纠缠导致。 Method: 提出NTN-Diff模型,基于扩散过程将去噪分为早期和晚期阶段,在去噪过程中解耦中低频带;利用稳定的中频带指导无文本去噪处理低频带,并在后期进行文本引导去噪,以实现跨区域的语义一致性。 Result: 实验表明,NTN-Diff在文本引导图像修复任务上优于当前最先进的扩散模型,能更好保持未遮罩区域并实现语义一致性。 Conclusion: NTN-Diff通过频率感知的去噪策略,有效解决了文本引导图像修复中的区域保持与语义一致性难题,显著提升了修复质量。 Abstract: Text-guided image inpainting aims at reconstructing the masked regions as per text prompts, where the longstanding challenges lie in the preservation for unmasked regions, while achieving the semantics consistency between unmasked and inpainted masked regions. Previous arts failed to address both of them, always with either of them to be remedied. Such facts, as we observed, stem from the entanglement of the hybrid (e.g., mid-and-low) frequency bands that encode varied image properties, which exhibit different robustness to text prompts during the denoising process. In this paper, we propose a null-text-null frequency-aware diffusion models, dubbed \textbf{NTN-Diff}, for text-guided image inpainting, by decomposing the semantics consistency across masked and unmasked regions into the consistencies as per each frequency band, while preserving the unmasked regions, to circumvent two challenges in a row. Based on the diffusion process, we further divide the denoising process into early (high-level noise) and late (low-level noise) stages, where the mid-and-low frequency bands are disentangled during the denoising process. As observed, the stable mid-frequency band is progressively denoised to be semantically aligned during text-guided denoising process, which, meanwhile, serves as the guidance to the null-text denoising process to denoise low-frequency band for the masked regions, followed by a subsequent text-guided denoising process at late stage, to achieve the semantics consistency for mid-and-low frequency bands across masked and unmasked regions, while preserve the unmasked regions. Extensive experiments validate the superiority of NTN-Diff over the state-of-the-art diffusion models to text-guided diffusion models. Our code can be accessed from https://github.com/htyjers/NTN-Diff.[164] A Multimodal Depth-Aware Method For Embodied Reference Understanding
Fevziye Irem Eyiokur,Dogucan Yaman,Hazım Kemal Ekenel,Alexander Waibel
Main category: cs.CV
TL;DR: 提出了一种新的ERU框架,结合LLM数据增强、深度图模态和深度感知决策模块,有效提升在复杂环境中基于语言和指向线索的参考对象理解性能。
Details
Motivation: 现有方法在存在多个候选对象的模糊场景中表现不佳,难以准确识别目标物体。 Method: 提出一种新型ERU框架,联合利用基于大语言模型的数据增强、深度图模态以及深度感知决策模块,实现语言与具身线索的鲁棒融合。 Result: 在两个数据集上的实验表明,该方法显著优于现有基线方法,实现了更准确和可靠的指代表达理解。 Conclusion: 所提出的ERU框架通过多模态信息融合,在复杂或杂乱环境中的参考对象理解任务上表现出优越性能。 Abstract: Embodied Reference Understanding requires identifying a target object in a visual scene based on both language instructions and pointing cues. While prior works have shown progress in open-vocabulary object detection, they often fail in ambiguous scenarios where multiple candidate objects exist in the scene. To address these challenges, we propose a novel ERU framework that jointly leverages LLM-based data augmentation, depth-map modality, and a depth-aware decision module. This design enables robust integration of linguistic and embodied cues, improving disambiguation in complex or cluttered environments. Experimental results on two datasets demonstrate that our approach significantly outperforms existing baselines, achieving more accurate and reliable referent detection.[165] Learning Neural Exposure Fields for View Synthesis
Michael Niemeyer,Fabian Manhardt,Marie-Julie Rakotosaona,Michael Oechsle,Christina Tsalicoglou,Keisuke Tateno,Jonathan T. Barron,Federico Tombari
Main category: cs.CV
TL;DR: 本文提出了Neural Exposure Fields (NExF),一种用于从具有强烈曝光变化的现实世界图像中鲁棒重建高质量、3D一致外观场景的新方法。
Details
Motivation: 现有神经场景表示在处理包含每张图像曝光变化(如室内外混合场景或带窗户房间)的真实数据时表现下降,因此需要更鲁棒的方法来应对高动态范围场景中的曝光问题。 Method: 提出学习一个预测每个3D点最优曝光值的神经场,并与神经场景表示联合优化;通过新的神经条件机制实现曝光与场景表示的协同训练。 Result: 在多个真实世界数据集上实现了优于先前方法的结果,相比最佳基线性能提升超过55%,且训练速度更快。 Conclusion: NExF能够有效处理复杂光照条件下的3D重建与视图合成,无需后期处理或多曝光输入,显著提升了在高动态范围场景中的重建质量与一致性。 Abstract: Recent advances in neural scene representations have led to unprecedented quality in 3D reconstruction and view synthesis. Despite achieving high-quality results for common benchmarks with curated data, outputs often degrade for data that contain per image variations such as strong exposure changes, present, e.g., in most scenes with indoor and outdoor areas or rooms with windows. In this paper, we introduce Neural Exposure Fields (NExF), a novel technique for robustly reconstructing 3D scenes with high quality and 3D-consistent appearance from challenging real-world captures. In the core, we propose to learn a neural field predicting an optimal exposure value per 3D point, enabling us to optimize exposure along with the neural scene representation. While capture devices such as cameras select optimal exposure per image/pixel, we generalize this concept and perform optimization in 3D instead. This enables accurate view synthesis in high dynamic range scenarios, bypassing the need of post-processing steps or multi-exposure captures. Our contributions include a novel neural representation for exposure prediction, a system for joint optimization of the scene representation and the exposure field via a novel neural conditioning mechanism, and demonstrated superior performance on challenging real-world data. We find that our approach trains faster than prior works and produces state-of-the-art results on several benchmarks improving by over 55% over best-performing baselines.[166] LTCA: Long-range Temporal Context Attention for Referring Video Object Segmentation
Cilin Yan,Jingyun Wang,Guoliang Kang
Main category: cs.CV
TL;DR: 本文提出了一种有效的长程时序上下文注意力机制(LTCA),用于指代表达视频分割(RVOS),通过稀疏局部注意力和全局查询机制,在保持计算效率的同时增强了对全局上下文信息的建模能力,并在多个基准上实现了最先进的性能。
Details
Motivation: 现有方法在建模视频中语言表达与视觉内容的长程时序关系时,难以平衡局部性与全局性,且计算复杂度随视频长度增长显著增加。 Method: 提出长程时态上下文注意力(LTCA)机制:1)通过堆叠稀疏局部注意力(膨胀窗口注意力)逐步聚合局部信息以获得全局视野;2)引入随机从全局池中选择键的机制增强全局性;3)设计全局查询与所有其他查询交互以直接编码全局上下文。 Result: 在四个指代表达视频分割基准上达到最先进水平,尤其在MeViS valu和val数据集上分别提升了11.3%和8.3%。 Conclusion: LTCA有效平衡了局部与全局上下文建模,降低了计算开销,显著提升了RVOS任务的性能,具有良好的扩展性和应用潜力。 Abstract: Referring Video Segmentation (RVOS) aims to segment objects in videos given linguistic expressions. The key to solving RVOS is to extract long-range temporal context information from the interactions of expressions and videos to depict the dynamic attributes of each object. Previous works either adopt attention across all the frames or stack dense local attention to achieve a global view of temporal context. However, they fail to strike a good balance between locality and globality, and the computation complexity significantly increases with the increase of video length. In this paper, we propose an effective long-range temporal context attention (LTCA) mechanism to aggregate global context information into object features. Specifically, we aggregate the global context information from two aspects. Firstly, we stack sparse local attentions to balance the locality and globality. We design a dilated window attention across frames to aggregate local context information and perform such attention in a stack of layers to enable a global view. Further, we enable each query to attend to a small group of keys randomly selected from a global pool to enhance the globality. Secondly, we design a global query to interact with all the other queries to directly encode the global context information. Experiments show our method achieves new state-of-the-art on four referring video segmentation benchmarks. Notably, our method shows an improvement of 11.3% and 8.3% on the MeViS valu and val datasets respectively.[167] Unlocking 3D Affordance Segmentation with 2D Semantic Knowledge
Yu Huang,Zelin Peng,Changsong Wen,Xiaokang Yang,Wei Shen
Main category: cs.CV
TL;DR: 提出一种基于跨模态亲和力迁移的语义感知学习范式,利用2D视觉基础模型提升3D功能分割性能。
Details
Motivation: 现有3D功能分割方法多使用点云编码器作为通用特征提取器,忽视了3D数据稀疏性、噪声和几何模糊等问题,导致学习到的特征缺乏清晰且语义一致的功能边界。 Method: 提出跨模态亲和力迁移(CMAT),通过将3D编码器与提升的2D语义对齐,并联合优化重建、亲和力和多样性;在此基础上设计跨模态功能分割Transformer(CAST),融合多模态提示与预训练特征生成精确的分割图。 Result: 在标准基准上的实验表明,该方法在3D功能分割任务上达到了新的最先进水平。 Conclusion: 所提出的语义接地学习范式能有效迁移2D视觉基础模型的语义知识,显著提升3D功能分割的精度和语义一致性。 Abstract: Affordance segmentation aims to parse 3D objects into functionally distinct parts, bridging recognition and interaction for applications in robotic manipulation, embodied AI, and AR. While recent studies leverage visual or textual prompts to guide this process, they often rely on point cloud encoders as generic feature extractors, overlooking the intrinsic challenges of 3D data such as sparsity, noise, and geometric ambiguity. As a result, 3D features learned in isolation frequently lack clear and semantically consistent functional boundaries. To address this bottleneck, we propose a semantic-grounded learning paradigm that transfers rich semantic knowledge from large-scale 2D Vision Foundation Models (VFMs) into the 3D domain. Specifically, We introduce Cross-Modal Affinity Transfer (CMAT), a pre-training strategy that aligns a 3D encoder with lifted 2D semantics and jointly optimizes reconstruction, affinity, and diversity to yield semantically organized representations. Building on this backbone, we further design the Cross-modal Affordance Segmentation Transformer (CAST), which integrates multi-modal prompts with CMAT-pretrained features to generate precise, prompt-aware segmentation maps. Extensive experiments on standard benchmarks demonstrate that our framework establishes new state-of-the-art results for 3D affordance segmentation.[168] LinVideo: A Post-Training Framework towards O(n) Attention in Efficient Video Generation
Yushi Huang,Xingtong Ge,Ruihao Gong,Chengtao Lv,Jun Zhang
Main category: cs.CV
TL;DR: 本文提出了LinVideo,一种高效的数据无关后训练框架,用于在保持视频扩散模型生成质量的同时,将自注意力模块替换为线性注意力,实现显著加速。
Details
Motivation: 视频扩散模型因自注意力的二次复杂度导致计算成本高昂,而直接使用线性注意力会影响表达能力和生成质量,因此需要一种高效且性能保持的替代方案。 Method: 提出选择性迁移(selective transfer)方法,将层替换问题建模为二分类任务,自动渐进地将可替换的自注意力层转为线性注意力;并设计了一种任意时间分布匹配(ADM)目标函数,以对齐采样轨迹上样本的分布,提升迁移效率和效果。 Result: 实验表明,该方法实现了1.25-2.00倍的加速,同时保持生成质量;4步蒸馏模型进一步实现15.92倍延迟降低,视觉质量损失极小。 Conclusion: LinVideo能有效平衡视频扩散模型的效率与性能,为大规模视频生成提供了可行的优化路径。 Abstract: Video diffusion models (DMs) have enabled high-quality video synthesis. However, their computation costs scale quadratically with sequence length because self-attention has quadratic complexity. While linear attention lowers the cost, fully replacing quadratic attention requires expensive pretraining due to the limited expressiveness of linear attention and the complexity of spatiotemporal modeling in video generation. In this paper, we present LinVideo, an efficient data-free post-training framework that replaces a target number of self-attention modules with linear attention while preserving the original model's performance. First, we observe a significant disparity in the replaceability of different layers. Instead of manual or heuristic choices, we frame layer selection as a binary classification problem and propose selective transfer, which automatically and progressively converts layers to linear attention with minimal performance impact. Additionally, to overcome the ineffectiveness and inefficiency of existing objectives for this transfer process, we introduce an anytime distribution matching (ADM) objective that aligns the distributions of samples across any timestep along the sampling trajectory. This objective is efficient and recovers model performance. Extensive experiments show that our method achieves a 1.25-2.00x speedup while preserving generation quality, and our 4-step distilled model further delivers a 15.92x latency reduction with minimal visual quality drop.[169] Evaluating Small Vision-Language Models on Distance-Dependent Traffic Perception
Nikos Theodoridis,Tim Brophy,Reenu Mohandas,Ganesh Sistu,Fiachra Collins,Anthony Scanlan,Ciaran Eising
Main category: cs.CV
TL;DR: 本文提出了一个专注于交通场景感知的视觉问答基准DTPQA,用于评估小型视觉语言模型在远近距离下的感知能力,发现当前小型VLM在该任务上显著落后于人类,尤其在左右区分等任务上存在挑战。
Details
Motivation: 为了在自动驾驶等安全关键应用中可靠使用视觉语言模型,需要其具备可靠的远近感知能力,但现有模型在此类任务上的表现尚不明确。 Method: 构建了首个专注于交通场景感知问题的视觉问答基准DTPQA,包含距离标注,并排除需复杂推理的问题,以专注评估感知能力;在此基准上评测多个最先进的小型视觉语言模型。 Result: 实验表明,尽管问题简单,当前最优的小型VLM平均准确率仅为约60%,显著低于人类的约85%;且模型在区分左右等特定感知任务上表现较差;但人类测试样本量较小,存在统计局限性。 Conclusion: 小型视觉语言模型在交通场景的距离感知任务上仍有较大提升空间,特别是在远距离和细粒度空间判断方面,需进一步改进以满足自动驾驶的安全需求。 Abstract: Vision-Language Models (VLMs) are becoming increasingly powerful, demonstrating strong performance on a variety of tasks that require both visual and textual understanding. Their strong generalisation abilities make them a promising component for automated driving systems, which must handle unexpected corner cases. However, to be trusted in such safety-critical applications, a model must first possess a reliable perception system. Moreover, since critical objects and agents in traffic scenes are often at a distance, we require systems that are not "shortsighted", i.e., systems with strong perception capabilities at both close (up to 20 meters) and long (30+ meters) range. With this in mind, we introduce Distance-Annotated Traffic Perception Question Answering (DTPQA), the first Visual Question Answering (VQA) benchmark focused solely on perception-based questions in traffic scenes, enriched with distance annotations. By excluding questions that require reasoning, we ensure that model performance reflects perception capabilities alone. Since automated driving hardware has limited processing power and cannot support large VLMs, our study centers on smaller VLMs. More specifically, we evaluate several state-of-the-art (SOTA) small VLMs on DTPQA and show that, despite the simplicity of the questions, these models significantly underperform compared to humans (~60% average accuracy for the best-performing small VLM versus ~85% human performance). However, it is important to note that the human sample size was relatively small, which imposes statistical limitations. We also identify specific perception tasks, such as distinguishing left from right, that remain particularly challenging for these models.[170] SPICE: Simple and Practical Image Clarification and Enhancement
Alexander Belyaev,Pierre-Alain Fayolle,Michael Cohen
Main category: cs.CV
TL;DR: 提出一种简单高效的方法来增强和改善低光照及雾霾条件下的图像质量。
Details
Motivation: 解决低光照和雾霾(包括雾天、沙尘和水下)图像的增强问题,提升图像清晰度。 Method: 通过构建模拟低光照或雾霾条件的图像滤波器,并推导近似逆滤波器以减少增强图像中的失真。 Result: 实验结果表明,该方法在处理极暗图像和雾霾图像增强方面具有竞争力,常优于现有最先进方法。 Conclusion: 该方法因其极简设计(仅需几行MATLAB代码实现)而具有高实用性与推广价值。 Abstract: We introduce a simple and efficient method to enhance and clarify images. More specifically, we deal with low light image enhancement and clarification of hazy imagery (hazy/foggy images, images containing sand dust, and underwater images). Our method involves constructing an image filter to simulate low-light or hazy conditions and deriving approximate reverse filters to minimize distortions in the enhanced images. Experimental results show that our approach is highly competitive and often surpasses state-of-the-art techniques in handling extremely dark images and in enhancing hazy images. A key advantage of our approach lies in its simplicity: Our method is implementable with just a few lines of MATLAB code.[171] Hyperspectral data augmentation with transformer-based diffusion models
Mattia Ferrari,Lorenzo Bruzzone
Main category: cs.CV
TL;DR: 提出一种基于引导扩散模型的数据增强技术,结合轻量级Transformer网络和改进的加权损失函数,有效提升小样本高光谱图像森林分类性能。
Details
Motivation: 深度学习在高光谱图像分类中易因标注数据少而过拟合,需更有效的数据增强方法。 Method: 采用引导扩散模型生成数据,使用轻量级Transformer网络建模光谱空间特征,设计改进的加权损失函数和余弦方差调度器优化训练过程。 Result: 在PRISMA卫星获取的10类森林分类任务中,该方法在平均和加权准确率上均优于其他数据增强方法,且训练过程稳定。 Conclusion: 所提方法能有效利用少量标注样本提升高光谱图像分类性能,解决了生成模型在数据增强中训练不稳定的问题。 Abstract: The introduction of new generation hyperspectral satellite sensors, combined with advancements in deep learning methodologies, has significantly enhanced the ability to discriminate detailed land-cover classes at medium-large scales. However, a significant challenge in deep learning methods is the risk of overfitting when training networks with small labeled datasets. In this work, we propose a data augmentation technique that leverages a guided diffusion model. To effectively train the model with a limited number of labeled samples and to capture complex patterns in the data, we implement a lightweight transformer network. Additionally, we introduce a modified weighted loss function and an optimized cosine variance scheduler, which facilitate fast and effective training on small datasets. We evaluate the effectiveness of the proposed method on a forest classification task with 10 different forest types using hyperspectral images acquired by the PRISMA satellite. The results demonstrate that the proposed method outperforms other data augmentation techniques in both average and weighted average accuracy. The effectiveness of the method is further highlighted by the stable training behavior of the model, which addresses a common limitation in the practical application of deep generative models for data augmentation.[172] UniVideo: Unified Understanding, Generation, and Editing for Videos
Cong Wei,Quande Liu,Zixuan Ye,Qiulin Wang,Xintao Wang,Pengfei Wan,Kun Gai,Wenhu Chen
Main category: cs.CV
TL;DR: UniVideo 是一个统一的多模态视频生成与编辑框架,采用双流架构(MLLM + MMDiT),支持多种任务并实现跨任务泛化。
Details
Motivation: 现有的统一多模态模型主要局限于图像领域,缺乏对视频生成与编辑的统一建模。 Method: 提出 UniVideo,结合多模态大语言模型(MLLM)理解指令和多模态 DiT(MMDiT)生成视频,采用双流设计并在多个任务上联合训练。 Result: 在文本/图像到视频生成、上下文内生成与编辑等任务上达到或超越专用模型;支持任务组合(如编辑+风格迁移)和零样本迁移(如绿幕抠像、材质替换);可基于视觉提示生成视频。 Conclusion: UniVideo 成功将统一多模态建模扩展到视频领域,具备良好的任务统一性与泛化能力,推动了多模态视频内容生成的发展。 Abstract: Unified multimodal models have shown promising results in multimodal content generation and editing but remain largely limited to the image domain. In this work, we present UniVideo, a versatile framework that extends unified modeling to the video domain. UniVideo adopts a dual-stream design, combining a Multimodal Large Language Model (MLLM) for instruction understanding with a Multimodal DiT (MMDiT) for video generation. This design enables accurate interpretation of complex multimodal instructions while preserving visual consistency. Built on this architecture, UniVideo unifies diverse video generation and editing tasks under a single multimodal instruction paradigm and is jointly trained across them. Extensive experiments demonstrate that UniVideo matches or surpasses state-of-the-art task-specific baselines in text/image-to-video generation, in-context video generation and in-context video editing. Notably, the unified design of UniVideo enables two forms of generalization. First, UniVideo supports task composition, such as combining editing with style transfer, by integrating multiple capabilities within a single instruction. Second, even without explicit training on free-form video editing, UniVideo transfers its editing capability from large-scale image editing data to this setting, handling unseen instructions such as green-screening characters or changing materials within a video. Beyond these core capabilities, UniVideo also supports visual-prompt-based video generation, where the MLLM interprets visual prompts and guides the MMDiT during synthesis. To foster future research, we will release our model and code.[173] Detecting Legend Items on Historical Maps Using GPT-4o with In-Context Learning
Sofia Kirsanova,Yao-Yi Chiang,Weiwei Duan
Main category: cs.CV
TL;DR: 提出一种结合LayoutLMv3和GPT-4o的方法,用于自动提取历史地图图例中的符号与描述,并通过结构化JSON提示提升性能。
Details
Motivation: 历史地图图例的非标准布局和非结构化格式导致自动提取困难,现有方法在符号与描述的匹配上效果有限。 Method: 采用LayoutLMv3进行版面检测,结合GPT-4o利用上下文学习进行图例项及其描述的检测与关联,通过边界框预测实现结构化输出。 Result: 实验显示该方法优于基线,F1达到88%,IoU达到85%,并验证了提示设计、示例数量和版面对齐对性能的影响。 Conclusion: 该方法支持可扩展且具备版面感知能力的图例解析,提升了多种视觉风格下历史地图的索引与可搜索性。 Abstract: Historical map legends are critical for interpreting cartographic symbols. However, their inconsistent layouts and unstructured formats make automatic extraction challenging. Prior work focuses primarily on segmentation or general optical character recognition (OCR), with few methods effectively matching legend symbols to their corresponding descriptions in a structured manner. We present a method that combines LayoutLMv3 for layout detection with GPT-4o using in-context learning to detect and link legend items and their descriptions via bounding box predictions. Our experiments show that GPT-4 with structured JSON prompts outperforms the baseline, achieving 88% F-1 and 85% IoU, and reveal how prompt design, example counts, and layout alignment affect performance. This approach supports scalable, layout-aware legend parsing and improves the indexing and searchability of historical maps across various visual styles.[174] Robust Source-Free Domain Adaptation for Medical Image Segmentation based on Curriculum Learning
Ziqi Zhang,Yuexiang Li,Yawen Huang,Nanjun He,Tao Xu,Liwei Lin,Yefeng Zheng,Shaoxin Li,Feiyue Huang
Main category: cs.CV
TL;DR: 提出了一种基于课程学习的无源域自适应框架(LFC),通过易到难和源到目标的双课程策略,提升了模型在无需源数据情况下的跨域适应性能,在眼底和息肉分割任务中达到最先进的效果。
Details
Motivation: 现有的无源域自适应方法主要关注目标域伪标签优化,忽略了学习过程的设计;而渐进式的学习过程有助于知识迁移,因此需要引入更合理的训练机制。 Method: 提出学习从课程(LFC)框架,包含易到难课程和源到目标课程:前者从简单样本开始逐步增加难度,调整模型优化方向;后者稳定适应过程,实现从源域到目标域的平滑迁移。 Result: 在公开的眼底分割和息肉分割跨域数据集上进行了评估,实验结果表明该方法优于现有方法,达到了新的最先进水平。 Conclusion: 所提出的LFC框架通过设计双课程学习策略,有效提升了无源域自适应的性能,验证了渐进式学习在模型自适应中的重要性。 Abstract: Recent studies have uncovered a new research line, namely source-free domain adaptation, which adapts a model to target domains without using the source data. Such a setting can address the concerns on data privacy and security issues of medical images. However, current source-free domain adaptation frameworks mainly focus on the pseudo label refinement for target data without the consideration of learning procedure. Indeed, a progressive learning process from source to target domain will benefit the knowledge transfer during model adaptation. To this end, we propose a curriculum-based framework, namely learning from curriculum (LFC), for source-free domain adaptation, which consists of easy-to-hard and source-to-target curricula. Concretely, the former curriculum enables the framework to start learning with `easy' samples and gradually tune the optimization direction of model adaption by increasing the sample difficulty. While, the latter can stablize the adaptation process, which ensures smooth transfer of the model from the source domain to the target. We evaluate the proposed source-free domain adaptation approach on the public cross-domain datasets for fundus segmentation and polyp segmentation. The extensive experimental results show that our framework surpasses the existing approaches and achieves a new state-of-the-art.[175] VideoVerse: How Far is Your T2V Generator from a World Model?
Zeqing Wang,Xinyu Wei,Bairui Li,Zhen Guo,Jinrui Zhang,Hongyang Wei,Keze Wang,Lei Zhang
Main category: cs.CV
TL;DR: 本文提出了VideoVerse,一个全新的文本到视频生成模型评估基准,旨在解决现有基准在评估先进T2V模型时的不足,特别是在事件级时间因果性和世界知识理解方面的缺失。
Details
Motivation: 现有T2V评估基准无法有效区分最先进的模型,且缺乏对时间因果性和世界知识的系统评估,难以支撑“世界模型”的构建需求。 Method: 构建包含300个精心设计提示的VideoVerse基准,涵盖815个事件,提取具有时间因果性的事件描述并转化为文本提示,设计10个维度的二元评估问题,并基于视觉语言模型构建QA评估流程。 Result: 建立了包含793个二元评估问题的全面基准,覆盖多个领域,提出动态与静态属性相结合的评估体系,并通过现代视觉语言模型实现与人类偏好对齐的自动化评估。 Conclusion: VideoVerse能更全面地评估T2V模型对复杂时间因果和真实世界知识的理解能力,为推进T2V模型向“世界模型”发展提供了有效评测工具。 Abstract: The recent rapid advancement of Text-to-Video (T2V) generation technologies, which are critical to build ``world models'', makes the existing benchmarks increasingly insufficient to evaluate state-of-the-art T2V models. First, current evaluation dimensions, such as per-frame aesthetic quality and temporal consistency, are no longer able to differentiate state-of-the-art T2V models. Second, event-level temporal causality, which not only distinguishes video from other modalities but also constitutes a crucial component of world models, is severely underexplored in existing benchmarks. Third, existing benchmarks lack a systematic assessment of world knowledge, which are essential capabilities for building world models. To address these issues, we introduce VideoVerse, a comprehensive benchmark that focuses on evaluating whether a T2V model could understand complex temporal causality and world knowledge in the real world. We collect representative videos across diverse domains (e.g., natural landscapes, sports, indoor scenes, science fiction, chemical and physical experiments) and extract their event-level descriptions with inherent temporal causality, which are then rewritten into text-to-video prompts by independent annotators. For each prompt, we design a suite of binary evaluation questions from the perspective of dynamic and static properties, with a total of ten carefully defined evaluation dimensions. In total, our VideoVerse comprises 300 carefully curated prompts, involving 815 events and 793 binary evaluation questions. Consequently, a human preference aligned QA-based evaluation pipeline is developed by using modern vision-language models. Finally, we perform a systematic evaluation of state-of-the-art open-source and closed-source T2V models on VideoVerse, providing in-depth analysis on how far the current T2V generators are from world models.[176] Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency
Kaiwen Zheng,Yuji Wang,Qianli Ma,Huayu Chen,Jintao Zhang,Yogesh Balaji,Jianfei Chen,Ming-Yu Liu,Jun Zhu,Qinsheng Zhang
Main category: cs.CV
TL;DR: 本文提出了一种用于大规模图像和视频扩散模型的连续时间一致性蒸馏方法rCM,通过引入分数正则化克服了传统sCM在细节生成上的局限性,在保持高生成多样性的同时显著提升了视觉质量。
Details
Motivation: 尽管连续时间一致性模型(sCM)在学术规模的扩散加速中表现强大,但其在大规模文本到图像和视频任务中的应用受限于JVP计算的基础设施挑战和现有评估基准的不足。 Method: 开发了支持并行的FlashAttention-2 JVP内核,并提出了分数正则化的连续时间一致性模型(rCM),将分数蒸馏作为长跳跃正则项引入以改善生成质量。 Result: 在高达140亿参数的大规模模型和5秒视频任务上验证,rCM在质量指标上达到或超过了最先进的DMD2方法,且在多样性方面具有明显优势,无需GAN调优或大量超参数搜索即可实现1~4步的高质量生成。 Conclusion: rCM是一种实用且理论扎实的框架,能够有效推动大规模扩散模型蒸馏的发展。 Abstract: This work represents the first effort to scale up continuous-time consistency distillation to general application-level image and video diffusion models. Although continuous-time consistency model (sCM) is theoretically principled and empirically powerful for accelerating academic-scale diffusion, its applicability to large-scale text-to-image and video tasks remains unclear due to infrastructure challenges in Jacobian-vector product (JVP) computation and the limitations of standard evaluation benchmarks. We first develop a parallelism-compatible FlashAttention-2 JVP kernel, enabling sCM training on models with over 10 billion parameters and high-dimensional video tasks. Our investigation reveals fundamental quality limitations of sCM in fine-detail generation, which we attribute to error accumulation and the "mode-covering" nature of its forward-divergence objective. To remedy this, we propose the score-regularized continuous-time consistency model (rCM), which incorporates score distillation as a long-skip regularizer. This integration complements sCM with the "mode-seeking" reverse divergence, effectively improving visual quality while maintaining high generation diversity. Validated on large-scale models (Cosmos-Predict2, Wan2.1) up to 14B parameters and 5-second videos, rCM matches or surpasses the state-of-the-art distillation method DMD2 on quality metrics while offering notable advantages in diversity, all without GAN tuning or extensive hyperparameter searches. The distilled models generate high-fidelity samples in only $1\sim4$ steps, accelerating diffusion sampling by $15\times\sim50\times$. These results position rCM as a practical and theoretically grounded framework for advancing large-scale diffusion distillation.[177] Gaze on the Prize: Shaping Visual Attention with Return-Guided Contrastive Learning
Andrew Lee,Ian Chuang,Dechen Gao,Kai Fukazawa,Iman Soltani
Main category: cs.CV
TL;DR: 提出了一种名为“Gaze on the Prize”的视觉强化学习框架,通过可学习的中央凹注意力机制和基于回报差异的自监督信号,提升样本效率并解决传统方法难以学习的任务。
Details
Motivation: 视觉强化学习智能体需从高维图像中提取任务相关特征,但多数像素无关紧要,导致探索效率低和学习不稳定。受人类视觉注意启发,需设计更高效的注意力机制。 Method: 引入可学习的中央凹注意力机制(Gaze),结合自监督信号(Prize)指导注意力分配;利用回报差异构建对比学习三元组,通过对比相似表征但不同回报的状态,训练注意力聚焦于任务关键特征。 Result: 在ManiSkill3操作任务套件上验证,样本效率最高提升2.4倍,能解决基线无法学习的任务,且无需修改底层算法或超参数。 Conclusion: 该方法通过返回引导的对比学习有效识别任务相关特征,显著提升视觉强化学习的样本效率和性能,具有良好的通用性和实用性。 Abstract: Visual Reinforcement Learning (RL) agents must learn to act based on high-dimensional image data where only a small fraction of the pixels is task-relevant. This forces agents to waste exploration and computational resources on irrelevant features, leading to sample-inefficient and unstable learning. To address this, inspired by human visual foveation, we introduce Gaze on the Prize. This framework augments visual RL with a learnable foveal attention mechanism (Gaze), guided by a self-supervised signal derived from the agent's experience pursuing higher returns (the Prize). Our key insight is that return differences reveal what matters most: If two similar representations produce different outcomes, their distinguishing features are likely task-relevant, and the gaze should focus on them accordingly. This is realized through return-guided contrastive learning that trains the attention to distinguish between the features relevant to success and failure. We group similar visual representations into positives and negatives based on their return differences and use the resulting labels to construct contrastive triplets. These triplets provide the training signal that teaches the attention mechanism to produce distinguishable representations for states associated with different outcomes. Our method achieves up to 2.4x improvement in sample efficiency and can solve tasks that the baseline fails to learn, demonstrated across a suite of manipulation tasks from the ManiSkill3 benchmark, all without modifying the underlying algorithm or hyperparameters.[178] Hierarchical Spatial Algorithms for High-Resolution Image Quantization and Feature Extraction
Noor Islam S. Mohammad
Main category: cs.CV
TL;DR: 提出了一种用于空间图像处理的模块化框架,集成了灰度量化、颜色与亮度增强、图像锐化、双向变换流程和几何特征提取,实验表明其在多种数据集上具有鲁棒性和实时应用潜力。
Details
Motivation: 为了提升图像处理的结构保持能力和多任务集成效果,实现更高效的实时图像分析与计算机视觉应用。 Method: 采用分步强度变换进行灰度量化,结合RGB和YCrCb空间的直方图均衡化进行色彩增强,通过HSV值通道调整亮度,使用3*3卷积核进行图像锐化,并构建包含非锐化掩模、伽马校正和噪声放大的双向变换流程,同时利用Canny边缘检测、Hough直线估计、Harris角点检测和形态学定位进行几何特征提取。 Result: 双向变换流程在前向和反向过程中的准确率分别为76.10%和74.80%,台球杆角度估计为51.50°,提示隔离与真实图像相似度达81.87%。 Conclusion: 该模块化框架在保持结构细节的同时实现了多种图像处理功能的高效集成,在多种数据集上表现出稳健和确定性的性能,适用于实时图像分析和计算机视觉任务。 Abstract: This study introduces a modular framework for spatial image processing, integrating grayscale quantization, color and brightness enhancement, image sharpening, bidirectional transformation pipelines, and geometric feature extraction. A stepwise intensity transformation quantizes grayscale images into eight discrete levels, producing a posterization effect that simplifies representation while preserving structural detail. Color enhancement is achieved via histogram equalization in both RGB and YCrCb color spaces, with the latter improving contrast while maintaining chrominance fidelity. Brightness adjustment is implemented through HSV value-channel manipulation, and image sharpening is performed using a 3 * 3 convolution kernel to enhance high-frequency details. A bidirectional transformation pipeline that integrates unsharp masking, gamma correction, and noise amplification achieved accuracy levels of 76.10% and 74.80% for the forward and reverse processes, respectively. Geometric feature extraction employed Canny edge detection, Hough-based line estimation (e.g., 51.50{\deg} for billiard cue alignment), Harris corner detection, and morphological window localization. Cue isolation further yielded 81.87\% similarity against ground truth images. Experimental evaluation across diverse datasets demonstrates robust and deterministic performance, highlighting its potential for real-time image analysis and computer vision.[179] Video-STAR: Reinforcing Open-Vocabulary Action Recognition with Tools
Zhenlong Yuan,Xiangyan Qu,Chengxuan Qian,Rui Chen,Jing Tang,Lei Sun,Xiangxiang Chu,Dapeng Zhang,Yiwei Wang,Yujun Cai,Shuo Li
Main category: cs.CV
TL;DR: 本文提出Video-STAR框架,通过子动作分解与工具增强的强化学习,提升开放词汇动作识别的细粒度与跨模态对齐能力。
Details
Motivation: 现有MLLM在开放词汇场景中难以区分语义相似的动作,且易产生跨模态幻觉,本文旨在提升视觉与文本推理的对齐能力。 Method: 将动作分解为判别性子动作,并结合领域特定工具进行跨模态交错推理;设计分层奖励机制,通过工具增强强化学习实现无需显式监督的自主推理优化。 Result: 在HMDB-51、UCF-101、SSv2、Kinetics-400和Kinetics-600上均取得SOTA性能,显著优于现有方法,尤其在细粒度动作区分和抑制幻觉方面表现突出。 Conclusion: Video-STAR实现了从文本中心推理到视觉 grounded 推理的转变,具备强鲁棒性和泛化能力,适用于复杂开放词汇动作识别任务。 Abstract: Multimodal large language models (MLLMs) have demonstrated remarkable potential in bridging visual and textual reasoning, yet their reliance on text-centric priors often limits their ability to disentangle semantically similar actions in open-vocabulary scenarios. To address this, we propose Video-STAR, a framework that harmonizes contextual sub-motion decomposition with tool-augmented reinforcement learning for open-vocabulary action recognition (OVAR). Unlike prior methods that treat actions as monolithic entities, our approach innovatively decomposes actions into discriminative sub-motions for fine-grained matching while dynamically invoking domain-specific tools for cross-modal interleaving, thereby enabling category-specific reasoning capacity and reducing cross-modal hallucination. Moreover, by designing a hierarchical reward that balances tool-usage efficiency, sub-motion relevance, and structural coherence in reasoning, our method autonomously leverages external tools to prioritize sub-motion patterns without explicit supervision, transmitting from text-centric reasoning to visually grounded inference. Extensive evaluations on HMDB-51, UCF-101, SSv2, Kinetics-400, and Kinetics-600 datasets demonstrate our state-of-the-art performance, outperforming existing methods in distinguishing fine-grained actions and handling cross-modal hallucination, validating our excellent robustness and generalization.[180] The Visual Iconicity Challenge: Evaluating Vision-Language Models on Sign Language Form-Meaning Mapping
Onur Keleş,Aslı Özyürek,Gerardo Ortega,Kadir Gökgö,Esam Ghaleb
Main category: cs.CV
TL;DR: 本文提出了一个名为“视觉象似性挑战”的新基准,用于评估视觉-语言模型在手语中的象似性理解能力,涵盖语音形式预测、透明度和象似性评分三个任务,并发现当前模型在这些任务上仍显著落后于人类表现。
Details
Motivation: 由于手语中普遍存在象似性(形式与意义的相似性),为视觉 grounding 提供了天然测试平台,但现有视觉-语言模型难以从动态人体动作中恢复这种映射关系,因此需要新的评估手段。 Method: 构建了一个基于视频的基准测试,采用心理语言学指标,评估13种最先进的视觉-语言模型在荷兰手语上的零样本和少样本表现,并与人类基线进行比较。 Result: 模型在语音形式预测上部分成功但在透明度任务上远逊于人类,仅顶级模型在象似性评分上与人类有中等相关性;且语音预测能力强的模型对象似性的判断更接近人类。 Conclusion: 该研究验证了所提诊断任务的有效性,并表明应引入以人为中心的信号和具身学习方法来提升多模态模型中的象似性建模与视觉 grounding 能力。 Abstract: Iconicity, the resemblance between linguistic form and meaning, is pervasive in signed languages, offering a natural testbed for visual grounding. For vision-language models (VLMs), the challenge is to recover such essential mappings from dynamic human motion rather than static context. We introduce the \textit{Visual Iconicity Challenge}, a novel video-based benchmark that adapts psycholinguistic measures to evaluate VLMs on three tasks: (i) phonological sign-form prediction (e.g., handshape, location), (ii) transparency (inferring meaning from visual form), and (iii) graded iconicity ratings. We assess $13$ state-of-the-art VLMs in zero- and few-shot settings on Sign Language of the Netherlands and compare them to human baselines. On \textit{phonological form prediction}, VLMs recover some handshape and location detail but remain below human performance; on \textit{transparency}, they are far from human baselines; and only top models correlate moderately with human \textit{iconicity ratings}. Interestingly, \textit{models with stronger phonological form prediction correlate better with human iconicity judgment}, indicating shared sensitivity to visually grounded structure. Our findings validate these diagnostic tasks and motivate human-centric signals and embodied learning methods for modelling iconicity and improving visual grounding in multimodal models.[181] InstructX: Towards Unified Visual Editing with MLLM Guidance
Chong Mou,Qichao Sun,Yanze Wu,Pengze Zhang,Xinghui Li,Fulong Ye,Songtao Zhao,Qian He
Main category: cs.CV
TL;DR: 本文提出了InstructX,一个用于图像和视频编辑的统一框架,通过综合研究多模态大语言模型(MLLMs)与扩散模型的结合,实现了在多种任务下的指令驱动编辑,并展示了在缺乏显式监督的情况下,图像数据训练可引出视频编辑能力。
Details
Motivation: 现有的MLLM在视觉理解和推理方面取得了显著进展,但在与扩散模型结合进行图像和视频编辑时,缺乏对MLLM设计选择的深入分析,且视频编辑等复杂任务的整合仍具挑战性。因此,需要一个统一的框架来提升编辑性能。 Method: 提出InstructX框架,系统研究MLLM与扩散模型的集成方式;利用图像数据训练激发模型的零样本视频编辑能力;引入模态特定的MLLM特征,实现图像与视频编辑任务的统一建模。 Result: 实验证明InstructX在广泛的图像和视频编辑任务中表现出色,实现了最先进的性能,尤其在无需专门视频训练数据的情况下展现出强泛化能力。 Conclusion: InstructX有效统一了图像和视频的编辑任务,揭示了跨模态迁移的可能性,为未来基于MLLM的多模态内容编辑提供了新的设计思路和实践基础。 Abstract: With recent advances in Multimodal Large Language Models (MLLMs) showing strong visual understanding and reasoning, interest is growing in using them to improve the editing performance of diffusion models. Despite rapid progress, most studies lack an in-depth analysis of MLLM design choices. Moreover, the integration of MLLMs and diffusion models remains an open challenge in some difficult tasks, such as video editing. In this paper, we present InstructX, a unified framework for image and video editing. Specifically, we conduct a comprehensive study on integrating MLLMs and diffusion models for instruction-driven editing across diverse tasks. Building on this study, we analyze the cooperation and distinction between images and videos in unified modeling. (1) We show that training on image data can lead to emergent video editing capabilities without explicit supervision, thereby alleviating the constraints imposed by scarce video training data. (2) By incorporating modality-specific MLLM features, our approach effectively unifies image and video editing tasks within a single model. Extensive experiments demonstrate that our method can handle a broad range of image and video editing tasks and achieves state-of-the-art performance.[182] MoA-VR: A Mixture-of-Agents System Towards All-in-One Video Restoration
Lu Liu,Chunlei Cai,Shaocheng Shen,Jianfeng Liang,Weimin Ouyang,Tianxiao Ye,Jian Mao,Huiyu Duan,Jiangchao Yao,Xiaoyun Zhang,Qiang Hu,Guangtao Zhai
Main category: cs.CV
TL;DR: 本文提出了一种名为MoA-VR的混合代理视频恢复系统,通过三个协同代理(退化识别、路由与恢复、质量评估)模拟人类专家的处理流程,有效应对复杂和复合退化,显著优于现有方法。
Details
Motivation: 现实世界中的视频常因采集和传输条件多样而存在多种退化问题,现有方法依赖人工选择模型或单一架构,难以泛化到不同退化类型,因此需要一种能自动适应多种退化的通用视频恢复系统。 Method: 提出MoA-VR系统,包含三个代理:基于视觉-语言模型的退化识别代理、由大语言模型驱动的自适应路由器代理、以及基于VLM的视频质量评估代理;构建大规模退化识别基准和Res-VQ数据集以支持训练与评估。 Result: 实验表明,MoA-VR在客观指标和感知质量上均显著优于现有基线方法,能够有效处理多样化和复合型退化。 Conclusion: MoA-VR展示了多模态智能与模块化推理在通用视频恢复系统中的潜力,为未来自动化、智能化视频恢复提供了新方向。 Abstract: Real-world videos often suffer from complex degradations, such as noise, compression artifacts, and low-light distortions, due to diverse acquisition and transmission conditions. Existing restoration methods typically require professional manual selection of specialized models or rely on monolithic architectures that fail to generalize across varying degradations. Inspired by expert experience, we propose MoA-VR, the first \underline{M}ixture-\underline{o}f-\underline{A}gents \underline{V}ideo \underline{R}estoration system that mimics the reasoning and processing procedures of human professionals through three coordinated agents: Degradation Identification, Routing and Restoration, and Restoration Quality Assessment. Specifically, we construct a large-scale and high-resolution video degradation recognition benchmark and build a vision-language model (VLM) driven degradation identifier. We further introduce a self-adaptive router powered by large language models (LLMs), which autonomously learns effective restoration strategies by observing tool usage patterns. To assess intermediate and final processed video quality, we construct the \underline{Res}tored \underline{V}ideo \underline{Q}uality (Res-VQ) dataset and design a dedicated VLM-based video quality assessment (VQA) model tailored for restoration tasks. Extensive experiments demonstrate that MoA-VR effectively handles diverse and compound degradations, consistently outperforming existing baselines in terms of both objective metrics and perceptual quality. These results highlight the potential of integrating multimodal intelligence and modular reasoning in general-purpose video restoration systems.[183] To Sink or Not to Sink: Visual Information Pathways in Large Vision-Language Models
Jiayun Luo,Wan-Cyuan Fan,Lyuyang Wang,Xiangteng He,Tanzila Rahman,Purang Abolmaesumi,Leonid Sigal
Main category: cs.CV
TL;DR: 本文提出并研究了视觉Transformer中的“注意力汇点”(ViT attention sinks),即具有高范数的视觉令牌,发现这些令牌包含图像中的高层语义信息,对视觉语言模型的推理能力至关重要。通过定性和定量分析,作者展示了显式利用这些令牌可显著提升多种LVLM在视觉推理任务上的表现。
Details
Motivation: 现有研究多关注大语言模型中的注意力汇点,而忽视了视觉编码器中可能存在的关键视觉令牌。本文旨在探究视觉Transformer输出的哪些令牌对理解与推理最重要,并揭示其在LVLM中的传播与作用机制。 Method: 识别出ViT中具有高范数的视觉令牌作为注意力汇点,进行定性与定量分析;提出无需训练和基于训练的方法来更好地让LLM利用这些关键令牌的信息。 Result: 实验证明ViT注意力汇点包含丰富的高层语义信息;显式利用这些令牌能显著提升多种LVLM在视觉推理任务上的性能。 Conclusion: ViT注意力汇点在视觉语言模型中起着关键作用,当前架构普遍低估了其价值;有效利用这些令牌有助于增强LVLM的视觉理解与推理能力。 Abstract: Large Vision Language Models (LVLMs) have recently emerged as powerful architectures capable of understanding and reasoning over both visual and textual information. These models typically rely on two key components: a Vision Transformer (ViT) and a Large Language Model (LLM). ViT encodes visual content into a sequence of image tokens and serves as the perceptual front-end -- the eyes of the model. In contrast, the LLM interprets these tokens to perform high-level reasoning, generates responses, and functions as the cognitive core -- the brain of the model. However, it remains unclear which visual tokens contribute most significantly to understanding and reasoning, and how effectively these signals are propagated from ViT to the LLM. While most existing works have focused on identifying attention sinks, low-semantic tokens receiving disproportionately high attention, within the LLM, we shift the focus to the vision encoder by identifying a class of high-norm visual tokens from ViT, referred to as ViT attention sinks -- a problem that has been rarely studied but is indeed very important for LVLMs. Our findings show that these ViT sinks encapsulate high-level semantic concepts from images, allowing the LLM to perform more effective understanding and reasoning. Despite their importance, these sink tokens are often overlooked in existing LVLM architectures. To explore their contribution, we present both qualitative and quantitative analyses of the information embedded in these sink tokens. We also propose both training-free and training-based approaches to better leverage how this information is interpreted by the LLM, and to what extent. By explicitly utilizing these tokens, we demonstrate substantial improvements across a range of LVLMs and visual reasoning tasks, highlighting the untapped potential of ViT attention sinks in enhancing visual reasoning.[184] Have We Scene It All? Scene Graph-Aware Deep Point Cloud Compression
Nikolaos Stathoulopoulos,Christoforos Kanellakis,George Nikolakopoulos
Main category: cs.CV
TL;DR: 提出了一种基于语义场景图的深度压缩框架,用于高效传输3D点云数据,在保持结构和语义保真度的同时实现高达98%的压缩率。
Details
Motivation: 3D点云数据量大且复杂,在带宽受限和连接不稳定的情况下难以高效传输,影响多智能体机器人系统的感知性能。 Method: 将点云分解为语义连贯的块,使用FiLM条件下的语义感知编码器将其编码为紧凑的潜在表示,并采用基于折叠的解码器进行结构准确的重建。 Result: 在SemanticKITTI和nuScenes数据集上实现了最先进的压缩率,数据大小最多减少98%,同时支持多机器人位姿图优化和地图融合等下游任务。 Conclusion: 该方法在显著压缩点云数据的同时,保留了足够的结构和语义信息,可用于实际多机器人系统中的高效通信与协同感知。 Abstract: Efficient transmission of 3D point cloud data is critical for advanced perception in centralized and decentralized multi-agent robotic systems, especially nowadays with the growing reliance on edge and cloud-based processing. However, the large and complex nature of point clouds creates challenges under bandwidth constraints and intermittent connectivity, often degrading system performance. We propose a deep compression framework based on semantic scene graphs. The method decomposes point clouds into semantically coherent patches and encodes them into compact latent representations with semantic-aware encoders conditioned by Feature-wise Linear Modulation (FiLM). A folding-based decoder, guided by latent features and graph node attributes, enables structurally accurate reconstruction. Experiments on the SemanticKITTI and nuScenes datasets show that the framework achieves state-of-the-art compression rates, reducing data size by up to 98% while preserving both structural and semantic fidelity. In addition, it supports downstream applications such as multi-robot pose graph optimization and map merging, achieving trajectory accuracy and map alignment comparable to those obtained with raw LiDAR scans.[185] SliceFine: The Universal Winning-Slice Hypothesis for Pretrained Networks
Md Kowsher,Ali O. Polat,Ehsan Mohammady Ardehaly,Mehrdad Salehi,Zia Ghiasi,Prasanth Murali,Chen Chen
Main category: cs.CV
TL;DR: 本文提出了一种新的参数高效微调方法SliceFine,通过理论分析证明了预训练模型中存在“通用胜出切片”特性,并利用该特性仅更新原始权重的子网络切片,在不引入新参数的情况下实现了与现有方法相当的性能,同时提升了训练速度和内存效率。
Details
Motivation: 探索为何在预训练模型中微调小的随机子网络(切片)即可实现下游任务适配,为参数高效微调提供理论基础。 Method: 提出并证明了预训练网络中的“通用胜出切片”性质,源于谱平衡和高任务能量两个现象;基于此提出SliceFine方法,仅更新选定的权重切片,不引入额外参数。 Result: SliceFine在语言和视觉任务上达到了与最先进的PEFT方法相当的性能,同时显著提高了训练速度、内存效率和模型紧凑性。 Conclusion: 该工作建立了理论与实践的桥梁,为大规模模型的参数高效微调提供了有理论依据的新替代方案。 Abstract: This paper presents a theoretical framework explaining why fine tuning small, randomly selected subnetworks (slices) within pre trained models can be sufficient for downstream adaptation. We prove that pretrained networks exhibit a universal winning slice property arising from two phenomena: (1) spectral balance the eigenspectra of different weight matrix slices are remarkably similar; and (2) high task energy their backbone representations retain rich, task relevant features. This leads to the Universal Winning Slice Hypothesis, which provides a theoretical foundation for parameter efficient fine tuning (PEFT) in large scale models. Inspired by this, we propose SliceFine, a PEFT method that exploits this inherent redundancy by updating only selected slices of the original weights introducing zero new parameters, unlike adapter-based approaches. Empirically, SliceFine matches the performance of state of the art PEFT methods across language and vision tasks, while significantly improving training speed, memory efficiency, and model compactness. Our work bridges theory and practice, offering a theoretically grounded alternative to existing PEFT techniques.[186] FlexTraj: Image-to-Video Generation with Flexible Point Trajectory Control
Zhiyuan Zhang,Can Wang,Dongdong Chen,Jing Liao
Main category: cs.CV
TL;DR: 提出FlexTraj框架,实现图像到视频生成中的灵活点轨迹控制,支持多粒度、无需对齐的运动控制。
Details
Motivation: 现有方法在轨迹控制中依赖对齐条件且控制灵活性不足,难以支持复杂应用场景。 Method: 采用统一的基于点的运动表示,结合序列拼接策略和退火训练策略,实现高效、鲁棒的轨迹条件注入与训练。 Result: 实验表明FlexTraj支持密集与稀疏轨迹控制,具备更强的可控性、更快的收敛速度和高效的推理能力,适用于运动克隆、拖拽生成、动作插值等多种应用。 Conclusion: FlexTraj实现了灵活、鲁棒且多粒度的视频生成轨迹控制,无需严格对齐条件,拓展了图像到视频生成的应用边界。 Abstract: We present FlexTraj, a framework for image-to-video generation with flexible point trajectory control. FlexTraj introduces a unified point-based motion representation that encodes each point with a segmentation ID, a temporally consistent trajectory ID, and an optional color channel for appearance cues, enabling both dense and sparse trajectory control. Instead of injecting trajectory conditions into the video generator through token concatenation or ControlNet, FlexTraj employs an efficient sequence-concatenation scheme that achieves faster convergence, stronger controllability, and more efficient inference, while maintaining robustness under unaligned conditions. To train such a unified point trajectory-controlled video generator, FlexTraj adopts an annealing training strategy that gradually reduces reliance on complete supervision and aligned condition. Experimental results demonstrate that FlexTraj enables multi-granularity, alignment-agnostic trajectory control for video generation, supporting various applications such as motion cloning, drag-based image-to-video, motion interpolation, camera redirection, flexible action control and mesh animations.[187] SpatialLadder: Progressive Training for Spatial Reasoning in Vision-Language Models
Hongxing Li,Dingming Li,Zixuan Wang,Yuchen Yan,Hang Wu,Wenqi Zhang,Yongliang Shen,Weiming Lu,Jun Xiao,Yueting Zhuang
Main category: cs.CV
TL;DR: 本文提出了一种渐进式构建空间智能的方法,通过引入包含26,610个样本的多模态数据集SpatialLadder-26k和三阶段训练框架,显著提升了视觉语言模型在空间推理任务上的性能,超越了现有主流模型,并展现出良好的领域外泛化能力。
Details
Motivation: 现有的视觉语言模型在空间推理方面表现不佳,主要原因是缺乏从感知到理解的层次化基础,直接学习空间推理导致性能受限。 Method: 构建了一个涵盖多种空间推理任务的标准化多模态数据集SpatialLadder-26k,并设计了三阶段渐进式训练框架:首先通过目标定位建立空间感知,然后通过多维空间任务发展空间理解,最后利用可验证奖励的强化学习增强复杂推理能力。 Result: 所提出的3B参数模型SpatialLadder在多个空间推理基准上达到最先进水平,相比基线模型平均提升23.4%,超过GPT-4o(20.8%)和Gemini-2.0-Flash(10.1%),在域外基准上也有7.2%的提升。 Conclusion: 从感知到理解再到推理的渐进式训练是实现鲁棒空间智能的关键。 Abstract: Spatial reasoning remains a fundamental challenge for Vision-Language Models (VLMs), with current approaches struggling to achieve robust performance despite recent advances. We identify that this limitation stems from a critical gap: existing methods attempt to learn spatial reasoning directly without establishing the hierarchical foundations of perception and understanding. To address this challenge, we present a comprehensive methodology for building spatial intelligence progressively. We introduce SpatialLadder-26k, a multimodal dataset containing 26,610 samples spanning object localization, single image, multi-view, and video spatial reasoning tasks, constructed through a standardized pipeline that ensures systematic coverage across modalities. Building on this dataset, we design a three-stage progressive training framework that (1) establishes spatial perception through object localization, (2) develops spatial understanding through multi-dimensional spatial tasks, and (3) strengthens complex reasoning via reinforcement learning with verifiable rewards. This approach yields SpatialLadder, a 3B-parameter model that achieves state-of-the-art performance on spatial reasoning benchmarks, with 23.4% average improvement over the base model, surpassing GPT-4o by 20.8% and Gemini-2.0-Flash by 10.1%. Notably, SpatialLadder maintains strong generalization with 7.2% improvement on out-of-domain benchmarks, demonstrating that progressive training from perception to reasoning is essential for robust spatial intelligence.[188] Kontinuous Kontext: Continuous Strength Control for Instruction-based Image Editing
Rishubh Parihar,Or Patashnik,Daniil Ostashev,R. Venkatesh Babu,Daniel Cohen-Or,Kuan-Chieh Wang
Main category: cs.CV
TL;DR: Kontinuous Kontext 是一种指令驱动的图像编辑模型,通过引入标量编辑强度控制,实现从细微到显著的连续、细粒度图像编辑。
Details
Motivation: 仅依赖文本指令难以实现对编辑程度的精细控制,因此需要一种能够连续调节编辑强度的方法。 Method: 扩展现有图像编辑模型,加入标量编辑强度输入,并通过轻量级投影网络将其与编辑指令映射到模型的调制空间中,实现对编辑程度的显式控制。 Result: 在多种编辑操作(如风格化、属性、材质、背景和形状变化)中实现了从轻微到强烈的连续控制,无需针对特定属性进行训练。 Conclusion: Kontinuous Kontext 提供了一种统一且灵活的方法,实现了自然语言指令下对图像编辑强度的精细、连续调控。 Abstract: Instruction-based image editing offers a powerful and intuitive way to manipulate images through natural language. Yet, relying solely on text instructions limits fine-grained control over the extent of edits. We introduce Kontinuous Kontext, an instruction-driven editing model that provides a new dimension of control over edit strength, enabling users to adjust edits gradually from no change to a fully realized result in a smooth and continuous manner. Kontinuous Kontext extends a state-of-the-art image editing model to accept an additional input, a scalar edit strength which is then paired with the edit instruction, enabling explicit control over the extent of the edit. To inject this scalar information, we train a lightweight projector network that maps the input scalar and the edit instruction to coefficients in the model's modulation space. For training our model, we synthesize a diverse dataset of image-edit-instruction-strength quadruplets using existing generative models, followed by a filtering stage to ensure quality and consistency. Kontinuous Kontext provides a unified approach for fine-grained control over edit strength for instruction driven editing from subtle to strong across diverse operations such as stylization, attribute, material, background, and shape changes, without requiring attribute-specific training.[189] MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization
Xiangyu Zhao,Junming Lin,Tianhao Liang,Yifan Zhou,Wenhao Chai,Yuzhe Gu,Weiyun Wang,Kai Chen,Gen Luo,Wenwei Zhang,Junchi Yan,Hua Yang,Haodong Duan,Xue Yang
Main category: cs.CV
TL;DR: 本文提出MM-HELIX基准和AHPO训练方法,以提升多模态大语言模型在长链反思性推理任务中的表现。
Details
Motivation: 现有MLLM在复杂现实问题所需的长链反思性推理能力上表现不足,亟需评估与改进。 Method: 构建MM-HELIX基准和MM-HELIX-100K数据集,提出Step-Elicited响应生成流程,并设计自适应混合策略优化(AHPO)方法,结合离线监督与在线优化。 Result: 在MM-HELIX基准上准确率提升+18.6%,在通用数学与逻辑任务上平均提升+5.7%。 Conclusion: 反思性推理可通过高质量数据与AHPO策略有效提升,为更强大MLLM的发展提供路径。 Abstract: While current Multimodal Large Language Models (MLLMs) have demonstrated proficiency in reasoning tasks such as mathematics and logic, their capacity for long-chain reflective reasoning, a prerequisite for solving complex real-world problems, remains largely underexplored. In this work, we first conduct an extensive empirical investigation to evaluate this capability. Leveraging a carefully designed data synthesis engine, we construct MM-HELIX, a multimodal benchmark consisting 1,260 samples of 42 challenging synthetic tasks that require iterative thinking and backtracking. Empirical results on this benchmark reveal that existing MLLMs exhibit significant performance deficits in long-chain reflective reasoning. To address this limitation, we generate post-training data and further explore learning paradigms for exploiting such data. We first develop the Step-Elicited Response Generation pipeline to create MM-HELIX-100K, a large-scale dataset of 100k high-quality, reflective reasoning traces for instruction-tuning stage. Given that standard Reinforcement Learning fails on complex tasks due to sparse reward signals and catastrophic forgetting after Supervised Fine-Tuning, we propose Adaptive Hybrid Policy Optimization (AHPO), a novel training strategy that dynamically unifies offline supervision and online optimization into a single stage. This strategy enables the model to learn from expert data when rewards are sparse and conduct independent exploration once proficient. When applied to the Qwen2.5-VL-7B baseline, our method achieves a +18.6\% accuracy improvement on MM-HELIX benchmark and demonstrates strong generalization with a +5.7\% average performance gain on general mathematic and logic tasks. Our work demonstrate that reflective reasoning in MLLMs can be effectively learned and generalized, paving the way for developing more capable MLLMs.[190] VideoNorms: Benchmarking Cultural Awareness of Video Language Models
Nikhil Reddy Varimalla,Yunfei Xu,Arkadiy Saakyan,Meng Fan Wang,Smaranda Muresan
Main category: cs.CV
TL;DR: 本文提出了VideoNorms,一个包含1000多个视频片段与社会文化规范配对的数据集,用于评估视频大语言模型(VideoLLMs)在中美文化背景下的文化意识。通过人类与AI协作的标注框架构建该基准,并发现现有模型在识别规范违反、中国文化情境、非言语证据及正式语境中表现较差。研究强调了文化感知训练的重要性。
Details
Motivation: 随着视频大语言模型在全球范围部署,其需理解并扎根于不同文化背景。然而目前缺乏有效评估模型文化认知能力的基准,因此需要构建能够衡量模型对跨文化社会规范理解程度的工具。 Method: 提出VideoNorms基准数据集,包含来自美国和中国文化的1000多个(视频片段,规范)对,基于言语行为理论标注社会文化规范、规范遵守/违反标签以及言语和非言语证据。采用人机协作框架:由基于理论提示的教师模型生成候选标注,经训练的人类专家进行验证和修正。并对多种开源VideoLLMs进行评测。 Result: 实验显示:1)模型在识别规范违反方面表现差于规范遵守;2)对中国文化的理解弱于美国文化;3)提供非言语证据的能力弱于言语证据,且难以准确匹配言语行为对应的具体规范;4)与人类不同,模型在正式、非幽默情境中表现更差。 Conclusion: 当前VideoLLMs在跨文化社会规范理解方面存在显著缺陷,尤其在处理中国文化、非言语线索和正式场景时。VideoNorms为评估和改进文化感知视频语言模型提供了有效基准,突显了将文化背景融入模型训练的必要性。 Abstract: As Video Large Language Models (VideoLLMs) are deployed globally, they require understanding of and grounding in the relevant cultural background. To properly assess these models' cultural awareness, adequate benchmarks are needed. We introduce VideoNorms, a benchmark of over 1000 (video clip, norm) pairs from US and Chinese cultures annotated with socio-cultural norms grounded in speech act theory, norm adherence and violations labels, and verbal and non-verbal evidence. To build VideoNorms, we use a human-AI collaboration framework, where a teacher model using theoretically-grounded prompting provides candidate annotations and a set of trained human experts validate and correct the annotations. We benchmark a variety of open-weight VideoLLMs on the new dataset which highlight several common trends: 1) models performs worse on norm violation than adherence; 2) models perform worse w.r.t Chinese culture compared to the US culture; 3) models have more difficulty in providing non-verbal evidence compared to verbal for the norm adhere/violation label and struggle to identify the exact norm corresponding to a speech-act; and 4) unlike humans, models perform worse in formal, non-humorous contexts. Our findings emphasize the need for culturally-grounded video language model training - a gap our benchmark and framework begin to address.[191] ARTDECO: Towards Efficient and High-Fidelity On-the-Fly 3D Reconstruction with Structured Scene Representation
Guanghao Li,Kerui Ren,Linning Xu,Zhewen Zheng,Changjian Jiang,Xin Gao,Bo Dai,Jian Pu,Mulin Yu,Jiangmiao Pang
Main category: cs.CV
TL;DR: 本文提出了ARTDECO,一种结合前馈模型效率和SLAM系统可靠性的统一框架,用于单目图像序列的实时3D重建。
Details
Motivation: 现有方法在高保真度与计算效率之间存在权衡:逐场景优化精度高但耗时,而前馈基础模型虽快但精度和鲁棒性不足。 Method: ARTDECO利用3D基础模型进行姿态估计和点云预测,并结合高斯解码器将多尺度特征转换为结构化3D高斯分布;提出分层高斯表示与LoD感知渲染策略以提升效率和保真度。 Result: 在八个室内外基准上实验表明,ARTDECO在交互性能上媲美SLAM,鲁棒性接近前馈系统,重建质量接近逐场景优化方法。 Conclusion: ARTDECO为实现实时、高保真且几何准确的现实环境数字化提供了一条实用路径。 Abstract: On-the-fly 3D reconstruction from monocular image sequences is a long-standing challenge in computer vision, critical for applications such as real-to-sim, AR/VR, and robotics. Existing methods face a major tradeoff: per-scene optimization yields high fidelity but is computationally expensive, whereas feed-forward foundation models enable real-time inference but struggle with accuracy and robustness. In this work, we propose ARTDECO, a unified framework that combines the efficiency of feed-forward models with the reliability of SLAM-based pipelines. ARTDECO uses 3D foundation models for pose estimation and point prediction, coupled with a Gaussian decoder that transforms multi-scale features into structured 3D Gaussians. To sustain both fidelity and efficiency at scale, we design a hierarchical Gaussian representation with a LoD-aware rendering strategy, which improves rendering fidelity while reducing redundancy. Experiments on eight diverse indoor and outdoor benchmarks show that ARTDECO delivers interactive performance comparable to SLAM, robustness similar to feed-forward systems, and reconstruction quality close to per-scene optimization, providing a practical path toward on-the-fly digitization of real-world environments with both accurate geometry and high visual fidelity. Explore more demos on our project page: https://city-super.github.io/artdeco/.[192] Dream to Recall: Imagination-Guided Experience Retrieval for Memory-Persistent Vision-and-Language Navigation
Yunzhe Xu,Yiyuan Pan,Zhe Liu
Main category: cs.CV
TL;DR: 本文提出了Memoir,一种基于想象的记忆检索机制,用于解决视觉-语言导航中记忆访问效率低和行为模式缺失的问题。该方法通过世界模型生成未来状态作为查询,实现对环境观察和行为历史的混合检索,并在多个基准上显著提升了性能,同时大幅减少训练时间和推理内存消耗。
Details
Motivation: 现有记忆持久型视觉-语言导航方法存在记忆访问机制不高效、仅存储环境观察而忽略行为模式的问题,限制了导航性能的持续提升。 Method: 1) 语言条件化世界模型生成未来导航状态,用于经验编码和检索查询;2) 混合视角级记忆,将环境观察与行为模式绑定到具体视角;3) 经验增强的导航模型,通过专用编码器整合检索知识。 Result: 在10个测试场景中均取得显著提升,相比最优基线在IR2R上SPL提高5.4%,训练速度加快8.3倍,推理内存减少74%。分析显示该范式仍有较大提升空间(上限差距73.3% vs 93.4%)。 Conclusion: 预测性检索结合环境与行为记忆可有效提升导航性能,想象引导的记忆机制为记忆持久型VLN提供了高效且可扩展的新范式。 Abstract: Vision-and-Language Navigation (VLN) requires agents to follow natural language instructions through environments, with memory-persistent variants demanding progressive improvement through accumulated experience. Existing approaches for memory-persistent VLN face critical limitations: they lack effective memory access mechanisms, instead relying on entire memory incorporation or fixed-horizon lookup, and predominantly store only environmental observations while neglecting navigation behavioral patterns that encode valuable decision-making strategies. We present Memoir, which employs imagination as a retrieval mechanism grounded by explicit memory: a world model imagines future navigation states as queries to selectively retrieve relevant environmental observations and behavioral histories. The approach comprises: 1) a language-conditioned world model that imagines future states serving dual purposes: encoding experiences for storage and generating retrieval queries; 2) Hybrid Viewpoint-Level Memory that anchors both observations and behavioral patterns to viewpoints, enabling hybrid retrieval; and 3) an experience-augmented navigation model that integrates retrieved knowledge through specialized encoders. Extensive evaluation across diverse memory-persistent VLN benchmarks with 10 distinctive testing scenarios demonstrates Memoir's effectiveness: significant improvements across all scenarios, with 5.4% SPL gains on IR2R over the best memory-persistent baseline, accompanied by 8.3x training speedup and 74% inference memory reduction. The results validate that predictive retrieval of both environmental and behavioral memories enables more effective navigation, with analysis indicating substantial headroom (73.3% vs 93.4% upper bound) for this imagination-guided paradigm. Code at https://github.com/xyz9911/Memoir.[193] VideoCanvas: Unified Video Completion from Arbitrary Spatiotemporal Patches via In-Context Conditioning
Minghong Cai,Qiulin Wang,Zongli Ye,Wenze Liu,Quande Liu,Weicai Ye,Xintao Wang,Pengfei Wan,Kun Gai,Xiangyu Yue
Main category: cs.CV
TL;DR: 本文提出了任意时空视频补全任务,通过VideoCanvas框架解决了因果VAE导致的时间模糊问题,实现了对视频生成的精细化控制。
Details
Motivation: 现有的可控视频生成任务分散且缺乏统一范式,同时现代潜在视频扩散模型存在时间模糊问题,难以实现精确帧级条件控制。 Method: 提出VideoCanvas框架,采用混合条件策略:空间定位通过零填充处理,时间对齐通过Temporal RoPE插值实现,将每个条件分配到潜在序列中的连续分数位置,无需新增参数。 Result: 在新构建的VideoCanvasBench基准上验证了方法的有效性,实验表明该方法显著优于现有条件控制范式,在场景内保真和跨场景创造性方面均表现优异。 Conclusion: VideoCanvas实现了任意时空视频补全的统一与精细化控制,为灵活、统一的视频生成设立了新标杆。 Abstract: We introduce the task of arbitrary spatio-temporal video completion, where a video is generated from arbitrary, user-specified patches placed at any spatial location and timestamp, akin to painting on a video canvas. This flexible formulation naturally unifies many existing controllable video generation tasks--including first-frame image-to-video, inpainting, extension, and interpolation--under a single, cohesive paradigm. Realizing this vision, however, faces a fundamental obstacle in modern latent video diffusion models: the temporal ambiguity introduced by causal VAEs, where multiple pixel frames are compressed into a single latent representation, making precise frame-level conditioning structurally difficult. We address this challenge with VideoCanvas, a novel framework that adapts the In-Context Conditioning (ICC) paradigm to this fine-grained control task with zero new parameters. We propose a hybrid conditioning strategy that decouples spatial and temporal control: spatial placement is handled via zero-padding, while temporal alignment is achieved through Temporal RoPE Interpolation, which assigns each condition a continuous fractional position within the latent sequence. This resolves the VAE's temporal ambiguity and enables pixel-frame-aware control on a frozen backbone. To evaluate this new capability, we develop VideoCanvasBench, the first benchmark for arbitrary spatio-temporal video completion, covering both intra-scene fidelity and inter-scene creativity. Experiments demonstrate that VideoCanvas significantly outperforms existing conditioning paradigms, establishing a new state of the art in flexible and unified video generation.[194] SciVideoBench: Benchmarking Scientific Video Reasoning in Large Multimodal Models
Andong Deng,Taojiannan Yang,Shoubin Yu,Lincoln Spencer,Mohit Bansal,Chen Chen,Serena Yeung-Levy,Xiaohan Wang
Main category: cs.CV
TL;DR: 本文提出了SciVideoBench,一个用于评估科学领域复杂视频推理能力的严格基准,包含1000个来自25个以上专业学科的多选题,旨在挑战现有大型多模态模型的认知极限。
Details
Motivation: 现有视频基准主要针对通用场景且推理任务简单,难以有效评估高级多模态认知能力,尤其是在科学领域的复杂视频推理方面存在明显不足。 Method: 构建了一个包含1000个精心设计的多选题的基准数据集SciVideoBench,题目源自前沿科学实验视频,覆盖25个以上专业学科,并通过半自动系统验证,要求模型具备专业知识、精确的时空感知和复杂逻辑推理能力。 Result: 在多个最先进的专有和开源大型多模态模型(如Gemini 2.5 Pro和Qwen2.5-VL)上的评估显示其性能显著不足,表明当前模型在科学视频推理方面仍有巨大提升空间。 Conclusion: SciVideoBench能够有效评估和推动大型多模态模型在科学视频推理方面的发展,为未来多模态AI作为真正有能力的科研助手提供了明确方向。 Abstract: Large Multimodal Models (LMMs) have achieved remarkable progress across various capabilities; however, complex video reasoning in the scientific domain remains a significant and challenging frontier. Current video benchmarks predominantly target general scenarios where perception/recognition is heavily relied on, while with relatively simple reasoning tasks, leading to saturation and thus failing to effectively evaluate advanced multimodal cognitive skills. To address this critical gap, we introduce SciVideoBench, a rigorous benchmark specifically designed to assess advanced video reasoning in scientific contexts. SciVideoBench consists of 1,000 carefully crafted multiple-choice questions derived from cutting-edge scientific experimental videos spanning over 25 specialized academic subjects and verified by a semi-automatic system. Each question demands sophisticated domain-specific knowledge, precise spatiotemporal perception, and intricate logical reasoning, effectively challenging models' higher-order cognitive abilities. Our evaluation highlights significant performance deficits in state-of-the-art proprietary and open-source LMMs, including Gemini 2.5 Pro and Qwen2.5-VL, indicating substantial room for advancement in video reasoning capabilities. Detailed analyses of critical factors such as reasoning complexity and visual grounding provide valuable insights and clear direction for future developments in LMMs, driving the evolution of truly capable multimodal AI co-scientists. We hope SciVideoBench could fit the interests of the community and help to push the boundary of cutting-edge AI for border science.[195] MultiCOIN: Multi-Modal COntrollable Video INbetweening
Maham Tanveer,Yang Zhou,Simon Niklaus,Ali Mahdavi Amiri,Hao Zhang,Krishna Kumar Singh,Nanxuan Zhao
Main category: cs.CV
TL;DR: 本文提出了一种支持多模态控制的视频中间帧生成框架\modelname{},通过结合深度过渡、分层、运动轨迹、文本提示和目标区域等多种控制方式,实现了灵活且精细的视频插值。
Details
Motivation: 现有视频中间帧生成方法难以处理复杂运动,缺乏对用户意图的灵活支持和中间帧细节的精细控制,导致生成结果与创作意图不一致。 Method: 采用Diffusion Transformer(DiT)作为生成模型,将多种运动控制信号统一映射为稀疏的基于点的表示,并将内容控制与运动控制分离为两个分支进行特征编码,设计了两个生成器分别处理运动和内容,同时提出分阶段训练策略以有效学习多模态控制。 Result: 实验表明,该方法在定性和定量评估中均优于现有方法,能够生成更动态、可定制且上下文准确的视觉叙事。 Conclusion: \modelname{}通过引入多模态控制和双分支生成架构,在灵活性、易用性和精细控制之间取得了良好平衡,显著提升了视频中间帧生成的质量和可控性。 Abstract: Video inbetweening creates smooth and natural transitions between two image frames, making it an indispensable tool for video editing and long-form video synthesis. Existing works in this domain are unable to generate large, complex, or intricate motions. In particular, they cannot accommodate the versatility of user intents and generally lack fine control over the details of intermediate frames, leading to misalignment with the creative mind. To fill these gaps, we introduce \modelname{}, a video inbetweening framework that allows multi-modal controls, including depth transition and layering, motion trajectories, text prompts, and target regions for movement localization, while achieving a balance between flexibility, ease of use, and precision for fine-grained video interpolation. To achieve this, we adopt the Diffusion Transformer (DiT) architecture as our video generative model, due to its proven capability to generate high-quality long videos. To ensure compatibility between DiT and our multi-modal controls, we map all motion controls into a common sparse and user-friendly point-based representation as the video/noise input. Further, to respect the variety of controls which operate at varying levels of granularity and influence, we separate content controls and motion controls into two branches to encode the required features before guiding the denoising process, resulting in two generators, one for motion and the other for content. Finally, we propose a stage-wise training strategy to ensure that our model learns the multi-modal controls smoothly. Extensive qualitative and quantitative experiments demonstrate that multi-modal controls enable a more dynamic, customizable, and contextually accurate visual narrative.[196] ResAD: Normalized Residual Trajectory Modeling for End-to-End Autonomous Driving
Zhiyu Zheng,Shaoyu Chen,Haoran Yin,Xinbang Zhang,Jialv Zou,Xinggang Wang,Qian Zhang,Lefei Zhang
Main category: cs.CV
TL;DR: 提出ResAD框架,通过归一化残差轨迹建模解决端到端自动驾驶中轨迹数据的时空不平衡问题,提升模型因果推理能力与短期安全性。
Details
Motivation: 现有端到端自动驾驶系统因轨迹数据的时空不平衡,易学习到虚假相关性,忽视因果推理,并过度关注不确定的远期预测,影响即时安全。 Method: 提出ResAD框架,将预测目标由绝对轨迹转为相对于惯性参考路径的残差偏差,并引入点级归一化,重新加权优化目标,减轻远距离不确定性对训练的主导影响。 Result: 在NAVSIM基准上,使用仅两步去噪的普通扩散策略即达到88.6的PDMS,取得当前最优性能。 Conclusion: ResAD通过残差建模和归一化有效降低了学习难度,增强了模型对因果因素的关注,提升了预测安全性与整体性能。 Abstract: End-to-end autonomous driving (E2EAD) systems, which learn to predict future trajectories directly from sensor data, are fundamentally challenged by the inherent spatio-temporal imbalance of trajectory data. This imbalance creates a significant optimization burden, causing models to learn spurious correlations instead of causal inference, while also prioritizing uncertain, distant predictions, thereby compromising immediate safety. To address these issues, we propose ResAD, a novel Normalized Residual Trajectory Modeling framework. Instead of predicting the future trajectory directly, our approach reframes the learning task to predict the residual deviation from a deterministic inertial reference. The inertial reference serves as a counterfactual, forcing the model to move beyond simple pattern recognition and instead identify the underlying causal factors (e.g., traffic rules, obstacles) that necessitate deviations from a default, inertially-guided path. To deal with the optimization imbalance caused by uncertain, long-term horizons, ResAD further incorporates Point-wise Normalization of the predicted residual. It re-weights the optimization objective, preventing large-magnitude errors associated with distant, uncertain waypoints from dominating the learning signal. Extensive experiments validate the effectiveness of our framework. On the NAVSIM benchmark, ResAD achieves a state-of-the-art PDMS of 88.6 using a vanilla diffusion policy with only two denoising steps, demonstrating that our approach significantly simplifies the learning task and improves model performance. The code will be released to facilitate further research.[197] NaViL: Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints
Changyao Tian,Hao Li,Gen Luo,Xizhou Zhu,Weijie Su,Hanming Deng,Jinguo Zhu,Jie Shao,Ziran Zhu,Yunpeng Liu,Lewei Lu,Wenhai Wang,Hongsheng Li,Jifeng Dai
Main category: cs.CV
TL;DR: 本文提出了一种端到端原生训练的多模态大语言模型(NaViL),在数据受限的实际设置下系统研究其设计空间和扩展性,发现视觉编码器与语言模型之间存在正相关的扩展关系,并通过实验验证了其在多个多模态基准上的竞争力。
Details
Motivation: 现有MLLM采用分阶段组合式训练,难以探索其多模态扩展性,因此需要研究端到端原生训练范式以更好理解其设计空间和扩展规律。 Method: 系统研究原生MLLM在架构设计、训练策略等方面的选择,确定最优元架构,并分析视觉编码器与LLM在规模上的扩展关系。 Result: 提出了NaViL模型及其高效训练方案,在14个多模态基准上表现出与现有MLLM相当或更好的性能,同时揭示了视觉编码器与语言模型之间的正向扩展关系。 Conclusion: 原生端到端训练是可行且有前景的MLLM范式,其扩展性可被有效探索,为未来MLLM研究提供了新的方向和深刻见解。 Abstract: Compositional training has been the de-facto paradigm in existing Multimodal Large Language Models (MLLMs), where pre-trained vision encoders are connected with pre-trained LLMs through continuous multimodal pre-training. However, the multimodal scaling property of this paradigm remains difficult to explore due to the separated training. In this paper, we focus on the native training of MLLMs in an end-to-end manner and systematically study its design space and scaling property under a practical setting, i.e., data constraint. Through careful study of various choices in MLLM, we obtain the optimal meta-architecture that best balances performance and training cost. After that, we further explore the scaling properties of the native MLLM and indicate the positively correlated scaling relationship between visual encoders and LLMs. Based on these findings, we propose a native MLLM called NaViL, combined with a simple and cost-effective recipe. Experimental results on 14 multimodal benchmarks confirm the competitive performance of NaViL against existing MLLMs. Besides that, our findings and results provide in-depth insights for the future study of native MLLMs.[198] MATRIX: Multimodal Agent Tuning for Robust Tool-Use Reasoning
Tajamul Ashraf,Umair Nawaz,Abdelrahman M. Shaker,Rao Anwer,Philip Torr,Fahad Shahbaz Khan,Salman Khan
Main category: cs.CV
TL;DR: 提出了一种视觉为中心的智能体微调框架,通过自动合成多模态轨迹和生成逐步偏好对,提升视觉语言模型在复杂推理和工具使用中的性能。
Details
Motivation: 现有视觉语言模型在作为控制器进行复杂推理和决策时,受限于高质量多模态轨迹数据的稀缺和人工标注的高成本。 Method: 构建大规模多模态任务数据集M-TRACE,并基于其对VLM控制器进行轨迹微调;进一步利用自动生成的偏好对数据集Pref-X进行逐步偏好学习优化。 Result: 所提出的MATRIX Agent在Agent-X、GTA和GAIA三个基准上均优于开源和闭源VLM,展现出可扩展且高效的多模态工具使用能力。 Conclusion: 该框架通过自动化的数据合成与精细对齐训练,显著提升了VLM在工具使用场景下的推理能力和泛化性能。 Abstract: Vision language models (VLMs) are increasingly deployed as controllers with access to external tools for complex reasoning and decision-making, yet their effectiveness remains limited by the scarcity of high-quality multimodal trajectories and the cost of manual annotation. We address this challenge with a vision-centric agent tuning framework that automatically synthesizes multimodal trajectories, generates step-wise preference pairs, and trains a VLM controller for robust tool-use reasoning. Our pipeline first constructs M-TRACE, a large-scale dataset of 28.5K multimodal tasks with 177K verified trajectories, enabling imitation-based trajectory tuning. Building on this, we develop MATRIX Agent, a controller finetuned on M-TRACE for step-wise tool reasoning. To achieve finer alignment, we further introduce Pref-X, a set of 11K automatically generated preference pairs, and optimize MATRIX on it via step-wise preference learning. Across three benchmarks, Agent-X, GTA, and GAIA, MATRIX consistently surpasses both open- and closed-source VLMs, demonstrating scalable and effective multimodal tool use. Our data and code is avaliable at https://github.com/mbzuai-oryx/MATRIX.[199] D$^2$GS: Depth-and-Density Guided Gaussian Splatting for Stable and Accurate Sparse-View Reconstruction
Meixi Song,Xin Lin,Dizhe Zhang,Haodong Li,Xiangtai Li,Bo Du,Lu Qi
Main category: cs.CV
TL;DR: 本文提出了一种名为D²GS的统一框架,用于改善稀疏视角条件下3D高斯点阵(3DGS)在新视角合成中的性能退化和不稳定性问题。
Details
Motivation: 在稀疏视角条件下,现有3DGS方法存在近相机区域高斯密度过度拟合、远距离区域覆盖不足导致欠拟合的问题,影响重建质量与稳定性。 Method: 提出D²GS框架,包含两个核心组件:基于深度和密度引导的Dropout策略,自适应遮蔽冗余高斯以抑制过拟合;距离感知保真增强模块,通过针对性监督提升远场区域重建质量。同时引入新的评估指标量化高斯分布的学习稳定性。 Result: 在多个数据集上的实验表明,该方法显著提升了稀疏视角下的视觉质量和重建稳定性。 Conclusion: D²GS有效缓解了稀疏视角下3DGS的过拟合与欠拟合问题,在保持实时渲染优势的同时增强了鲁棒性和重建质量。 Abstract: Recent advances in 3D Gaussian Splatting (3DGS) enable real-time, high-fidelity novel view synthesis (NVS) with explicit 3D representations. However, performance degradation and instability remain significant under sparse-view conditions. In this work, we identify two key failure modes under sparse-view conditions: overfitting in regions with excessive Gaussian density near the camera, and underfitting in distant areas with insufficient Gaussian coverage. To address these challenges, we propose a unified framework D$^2$GS, comprising two key components: a Depth-and-Density Guided Dropout strategy that suppresses overfitting by adaptively masking redundant Gaussians based on density and depth, and a Distance-Aware Fidelity Enhancement module that improves reconstruction quality in under-fitted far-field areas through targeted supervision. Moreover, we introduce a new evaluation metric to quantify the stability of learned Gaussian distributions, providing insights into the robustness of the sparse-view 3DGS. Extensive experiments on multiple datasets demonstrate that our method significantly improves both visual quality and robustness under sparse view conditions. The project page can be found at: https://insta360-research-team.github.io/DDGS-website/.[200] ReSplat: Learning Recurrent Gaussian Splats
Haofei Xu,Daniel Barath,Andreas Geiger,Marc Pollefeys
Main category: cs.CV
TL;DR: 提出ReSplat,一种前馈循环高斯点阵模型,通过渲染误差反馈迭代优化3D高斯分布,无需显式计算梯度,在减少高斯数量和提升渲染速度的同时实现最先进的性能。