Table of Contents
cs.CL [Back]
[1] Inconsistent Affective Reaction: Sentiment of Perception and Opinion in Urban Environments
Jingfei Huang,Han Tu
Main category: cs.CL
TL;DR: 本研究提出新方法识别和解释城市环境中人类感知与意见之间的情感不一致,利用街景图像和社交媒体文本分析北京二环路2016年与2022年的情感反应变化,发现感知与意见情感存在显著差异,且与建筑密度、行人活动等因素相关,为城市更新提供决策支持。
Details
Motivation: 现有情感分析方法难以捕捉城市环境中人类感知与意见之间的复杂差异,尤其在社交媒体时代,公众对城市空间的情感反应更加多元且动态,亟需融合多源数据的新方法来揭示感知与意见之间的情感不一致性及其驱动因素。 Method: 构建包含140,750张街景图像(百度、腾讯)和984,024条微博文本的数据集,结合目标检测与自然语言处理技术建立情感反应指数,通过回归分析、图像分割和词频分析,基于土地利用分布对北京二环区域2016年与2022年的感知与意见情感进行分类、可视化与比较。 Result: 感知情感趋势显示正向情绪分布更均匀,而意见情感变化更为极端;情感错配图揭示了感知与意见之间存在显著差异,且该差异与高密度建筑和行人活动密切相关;疫情前后的情感不一致图进一步揭示了环境变化对公众情感反应的影响。 Conclusion: 人类对城市环境的感知与意见情感存在系统性不一致,单一数据源难以全面反映城市情感动态,融合多模态数据可更精准识别情感差异,为城市规划与环境管理提供科学依据,特别是在制定城市更新策略时应考虑感知与意见的双重反馈。 Abstract: The ascension of social media platforms has transformed our understanding of urban environments, giving rise to nuanced variations in sentiment reaction embedded within human perception and opinion, and challenging existing multidimensional sentiment analysis approaches in urban studies. This study presents novel methodologies for identifying and elucidating sentiment inconsistency, constructing a dataset encompassing 140,750 Baidu and Tencent Street view images to measure perceptions, and 984,024 Weibo social media text posts to measure opinions. A reaction index is developed, integrating object detection and natural language processing techniques to classify sentiment in Beijing Second Ring for 2016 and 2022. Classified sentiment reaction is analysed and visualized using regression analysis, image segmentation, and word frequency based on land-use distribution to discern underlying factors. The perception affective reaction trend map reveals a shift toward more evenly distributed positive sentiment, while the opinion affective reaction trend map shows more extreme changes. Our mismatch map indicates significant disparities between the sentiments of human perception and opinion of urban areas over the years. Changes in sentiment reactions have significant relationships with elements such as dense buildings and pedestrian presence. Our inconsistent maps present perception and opinion sentiments before and after the pandemic and offer potential explanations and directions for environmental management, in formulating strategies for urban renewal.[2] Haystack Engineering: Context Engineering for Heterogeneous and Agentic Long-Context Evaluation
Mufei Li,Dongqi Fu,Limei Wang,Si Zhang,Hanqing Zeng,Kaan Sancak,Ruizhong Qiu,Haoyu Wang,Xiaoxin He,Xavier Bresson,Yinglong Xia,Chonglin Sun,Pan Li
Main category: cs.CL
TL;DR: HaystackCraft是一个基于英文维基百科超链接网络的新型长上下文测试基准,通过模拟异构检索和代理工作流中的噪声上下文,评估大模型在真实场景下的鲁棒性。
Details
Motivation: 现有‘针在 haystack 中’(NIAH)基准测试忽略了由有偏检索和代理工作流产生的噪声上下文,无法真实反映现实世界中长上下文理解的挑战。因此,需要构建更贴近实际的 noisy 长上下文来评估模型的鲁棒性。 Method: 提出“haystack engineering”方法,并构建 HaystackCraft 基准:利用维基百科全站超链接网络生成多跳问题;模拟多种异构检索策略(稀疏、密集、混合、基于图)对干扰项构成和排序的影响;引入动态、依赖于LLM的代理式测试环境,让模型进行查询优化、反思推理和停止决策。 Result: 实验表明:1)更强的密集检索器可能引入更具挑战性的干扰项,而基于图的重排序能提升检索效果并减轻有害干扰;2)在代理测试中,即使是Gemini 2.5 Pro和GPT-5等先进模型也易因自生成干扰项导致级联失败,或难以实现早停。 Conclusion: 当前长上下文大模型在面对真实噪声和代理式推理时仍面临显著挑战,HaystackCraft为未来研究提供了一个更贴近现实、更具挑战性的评估平台。 Abstract: Modern long-context large language models (LLMs) perform well on synthetic "needle-in-a-haystack" (NIAH) benchmarks, but such tests overlook how noisy contexts arise from biased retrieval and agentic workflows. We argue that haystack engineering is necessary to construct noisy long contexts that faithfully capture key real-world factors -- distraction from heterogeneous biased retrievers and cascading errors in agentic workflows -- to test models' long-context robustness. We instantiate it through HaystackCraft, a new NIAH benchmark built on the full English Wikipedia hyperlink network with multi-hop questions. HaystackCraft evaluates how heterogeneous retrieval strategies (e.g., sparse, dense, hybrid, and graph-based) affect distractor composition, haystack ordering, and downstream LLM performance. HaystackCraft further extends NIAH to dynamic, LLM-dependent settings that simulate agentic operations, where models refine queries, reflect on their past reasonings, and decide when to stop. Experiments with 15 long-context models show that (1) while stronger dense retrievers can introduce more challenging distractors, graph-based reranking simultaneously improves retrieval effectiveness and mitigates more harmful distractors; (2) in agentic tests, even advanced models like Gemini 2.5 Pro and GPT-5 suffer cascading failures from self-generated distractors or struggle to perform early stops. These results highlight persistent challenges in agentic long-context reasoning and establish HaystackCraft as a valuable testbed for future progress.[3] Lemma Dilemma: On Lemma Generation Without Domain- or Language-Specific Training Data
Olia Toporkov,Alan Akbik,Rodrigo Agerri
Main category: cs.CL
TL;DR: 本文研究了大语言模型(LLM)在上下文词形还原任务中的表现,发现在12种不同形态复杂性的语言中,无需微调、仅通过少量示例的上下文生成即可达到最先进的效果。
Details
Motivation: 探索大语言模型在缺乏监督训练数据的目标领域或语言中进行上下文词形还原的有效性。 Method: 比较了基于编码器的监督方法(跨域微调)、跨语言方法与大语言模型的上下文直接生成词干的方法。 Result: 实验表明,在多数语言中,无需微调的大语言模型通过上下文学习能达到最先进水平,而传统编码器模型在跨域微调后仍具竞争力。 Conclusion: 大语言模型在上下文词形还原任务中表现优异,尤其在缺乏标注数据的情况下,展示了其强大的零样本和少样本能力。 Abstract: Lemmatization is the task of transforming all words in a given text to their dictionary forms. While large language models (LLMs) have demonstrated their ability to achieve competitive results across a wide range of NLP tasks, there is no prior evidence of how effective they are in the contextual lemmatization task. In this paper, we empirically investigate the capacity of the latest generation of LLMs to perform in-context lemmatization, comparing it to the traditional fully supervised approach. In particular, we consider the setting in which supervised training data is not available for a target domain or language, comparing (i) encoder-only supervised approaches, fine-tuned out-of-domain, and (ii) cross-lingual methods, against direct in-context lemma generation with LLMs. Our experimental investigation across 12 languages of different morphological complexity finds that, while encoders remain competitive in out-of-domain settings when fine-tuned on gold data, current LLMs reach state-of-the-art results for most languages by directly generating lemmas in-context without prior fine-tuning, provided just with a few examples. Data and code available upon publication: https://github.com/oltoporkov/lemma-dilemma[4] LASER: An LLM-based ASR Scoring and Evaluation Rubric
Amruta Parulekar,Preethi Jyothi
Main category: cs.CL
TL;DR: 提出了一种基于大语言模型的ASR评估方法LASER,能更准确地衡量语音识别错误对语义的影响,减少对形态句法差异的过度惩罚。
Details
Motivation: 传统ASR评估指标如WER会不公平地惩罚不影响语义的形态句法差异,需要一种更贴近人类判断的评估方式。 Method: 利用大语言模型(如Gemini 2.5 Pro)的上下文学习能力,设计包含详细示例的提示来构建评分标准LASER;同时对较小的LLM(如Llama 3)在参考文本与ASR预测生成的词对样本上进行微调以预测惩罚程度。 Result: Hindi LASER使用Gemini 2.5 Pro时与人工标注的相关性高达94%;提示中的印地语示例也有效适用于其他印度语言如马拉地语、卡纳达语和马拉雅拉姆语;微调后的Llama 3在预测处罚类型上准确率达89%。 Conclusion: LASER通过利用大语言模型的能力,显著提升了ASR评估的语义敏感性和跨语言适用性,是一种优于传统WER的评估方案。 Abstract: Standard ASR evaluation metrics like Word Error Rate (WER) tend to unfairly penalize morphological and syntactic nuances that do not significantly alter sentence semantics. We introduce an LLM-based scoring rubric LASER that leverages state-of-the-art LLMs' in-context learning abilities to learn from prompts with detailed examples. Hindi LASER scores using Gemini 2.5 Pro achieved a very high correlation score of 94% with human annotations. Hindi examples in the prompt were also effective in analyzing errors in other Indian languages such as Marathi, Kannada and Malayalam. We also demonstrate how a smaller LLM like Llama 3 can be finetuned on word-pair examples derived from reference and ASR predictions to predict what kind of penalty should be applied with close to 89% accuracy.[5] Meaningful Pose-Based Sign Language Evaluation
Zifan Jiang,Colin Leong,Amit Moryossef,Anne Göhring,Annette Rios,Oliver Cory,Maksym Ivashechkin,Neha Tarigopula,Biao Zhang,Rico Sennrich,Sarah Ebling
Main category: cs.CL
TL;DR: 本文研究了基于人体骨骼姿态的手语表达评估方法,比较了关键点距离、嵌入和回译等不同指标的优劣,并通过自动元评估和人类相关性研究验证其有效性。
Details
Motivation: 为了更准确地评估手语翻译或生成系统,需要对以骨骼姿态形式呈现的手语进行有意义的评估。现有方法缺乏系统性比较和实际验证。 Method: 研究涵盖了基于关键点距离、嵌入和回译的评估指标,通过签名单元检索的自动元评估以及跨多种手语的文本到姿态转换的人类相关性研究来分析各指标的表现。 Result: 揭示了不同评估指标在不同场景下的权衡关系,验证了各方法的有效性和局限性。 Conclusion: 研究结果为手语翻译或生成系统的开发与评估提供了实用且可复现的方法,配套的开源姿态评估工具包有助于推动该领域的研究进展。 Abstract: We present a comprehensive study on meaningfully evaluating sign language utterances in the form of human skeletal poses. The study covers keypoint distance-based, embedding-based, and back-translation-based metrics. We show tradeoffs between different metrics in different scenarios through automatic meta-evaluation of sign-level retrieval and a human correlation study of text-to-pose translation across different sign languages. Our findings and the open-source pose-evaluation toolkit provide a practical and reproducible way of developing and evaluating sign language translation or generation systems.[6] Populism Meets AI: Advancing Populism Research with LLMs
Eduardo Ryô Tamaki,Yujin J. Jung,Julia Chatterley,Grant Mitchell,Semir Dzebo,Cristóbal Sandoval,Levente Littvay,Kirk A. Hawkins
Main category: cs.CL
TL;DR: 本文提出了一种基于思维链提示的领域特定方法,利用全球民粹主义数据库(GPD)训练大语言模型,使其在民粹主义文本分类上的准确率与人类专家相当。
Details
Motivation: 传统基于文本分析的民粹主义测量方法成本高、耗时长且难以跨语言和大规模语料扩展,因此需要一种更高效、可扩展的自动化方法。 Method: 采用基于评分标准和锚点引导的思维链(CoT)提示策略,模拟人类编码员的训练过程,使用GPD数据集中的标注演讲来指导大语言模型进行推理,并测试多种闭源和开源模型对GPD评分的复现能力。 Result: 该方法使大语言模型在民粹主义程度分类上达到了与专家人工编码员相当的准确性,验证了其处理民粹主义复杂语境的能力。 Conclusion: 领域特定的提示策略能有效提升大语言模型在意识形态内容分析中的表现,为民粹主义的大规模跨语境研究提供了可行且高效的工具。 Abstract: Measuring the ideational content of populism remains a challenge. Traditional strategies based on textual analysis have been critical for building the field's foundations and providing a valid, objective indicator of populist framing. Yet these approaches are costly, time consuming, and difficult to scale across languages, contexts, and large corpora. Here we present the results from a rubric and anchor guided chain of thought (CoT) prompting approach that mirrors human coder training. By leveraging the Global Populism Database (GPD), a comprehensive dataset of global leaders' speeches annotated for degrees of populism, we replicate the process used to train human coders by prompting the LLM with an adapted version of the same documentation to guide the model's reasoning. We then test multiple proprietary and open weight models by replicating scores in the GPD. Our findings reveal that this domain specific prompting strategy enables the LLM to achieve classification accuracy on par with expert human coders, demonstrating its ability to navigate the nuanced, context sensitive aspects of populism.[7] MAPRO: Recasting Multi-Agent Prompt Optimization as Maximum a Posteriori Inference
Zheyuan Zhang,Lin Ge,Hongjiang Li,Weicheng Zhu,Chuxu Zhang,Yanfang Ye
Main category: cs.CL
TL;DR: 本文提出了多智能体提示优化框架MAPRO,通过将多智能体系统提示优化建模为最大后验推断问题,并采用语言引导的max-product置信传播算法求解,显著提升了多智能体系统的性能。
Details
Motivation: 设计高效的多智能体系统因提示敏感性和累积不稳定性而困难,现有自动化提示设计方法在多智能体场景下仍不足,缺乏系统性优化方法。 Method: 提出四阶段的MAPRO框架,将多智能体提示优化形式化为最大后验(MAP)推断问题,使用语言引导的max-product置信传播算法求解,并引入拓扑感知的精细化机制,结合执行反馈与下游责任分配来迭代更新智能体提示。 Result: 在多种任务的基准测试中,MAPRO均达到最先进水平,持续优于人工设计基线和近期自动化方法。 Conclusion: MAPRO为多智能体提示优化提供了有效且原则性的解决方案,其基于MAP的建模范式可指导未来更可靠多智能体系统的设计。 Abstract: Large language models (LLMs) have demonstrated remarkable capabilities across diverse tasks, and LLM-based agents further extend these abilities to various practical workflows. While recent progress shows that multi-agent systems (MAS) can outperform single agents by coordinating specialized roles, designing effective MAS remains difficult due to prompt sensitivity and the compounded instability MAS creates. To cope with the challenge, recent efforts in automated prompt design have reduced manual effort. However, multi-agent prompt optimization remains largely unexplored. Challenges like exponentially expanding search space and ambiguous credit assignment together make systematic design intractable without principled methods. Therefore, we introduce M}ulti-Agent PRompt Optimization (MAPRO), a four-stage framework that first formulates MAS prompt optimization as a Maximum a Posteriori (MAP) inference problem and solves it using a language-guided variant of max-product belief propagation algorithm. To address credit assignment and updates the system iteratively, MAPRO employs a topology-aware refinement mechanism that integrates execution feedback and downstream blames to selectively update agent prompts. Through this process, MAPRO progressively converges to a coordinated set of agent-specific prompt policies. Across benchmarks in various tasks, MAPRO achieves state-of-the-art performance, consistently surpassing manually engineered baselines and recent automated alternatives. Beyond performance, our MAP-based formulation also delivers general guidelines for building more reliable and principled multi-agent systems in the future[8] AsyncSpade: Efficient Test-Time Scaling with Asynchronous Sparse Decoding
Shuqing Luo,Yilin Guan,Pingzhi Li,Hanrui Wang,Tianlong Chen
Main category: cs.CL
TL;DR: 本文提出了AsyncSpade,一种异步框架,用于高效实现大语言模型的测试时扩展(TTS),通过解耦KV缓存过滤与自回归解码过程,在不牺牲性能的前提下显著降低推理延迟。
Details
Motivation: 现有的TTS方法因KV缓存的线性增长导致内存瓶颈,且现有稀疏解码方法受限于页面级过滤的序列依赖性和粗粒度token选择,影响高并发和长思维链场景下的效率。 Method: 提出AsyncSpade,包含两个核心组件:(1) 轻量级时序回归模块,预测下一token的查询状态;(2) 异步解耦框架,将KV缓存筛选与前向推理重叠执行,实现无需等待解码循环的查询感知稀疏性。 Result: 在A100节点上验证,AsyncSpade实现了理论最优的每输出token时间(TPOT),相比Quest基线降低20%以上,相比全注意力降低至少50%,并在Qwen3-8B和Qwen3-32B模型上保持或提升准确率。 Conclusion: AsyncSpade是首个在不牺牲模型性能的情况下消除序列依赖的TTS加速框架,显著提升了高并发和长CoT场景下的服务效率。 Abstract: Test-time scaling (TTS) boosts LLM reasoning via long chain-of-thought (CoT), but the linear KV-cache growth amplifies the memory-bound bottleneck of LLM decoding. Query-aware page-level sparse decoding can achieve state-of-the-art performance under constrained FLOPs budgets, but is limited by both sequential-dependent page filtering and coarse-grained token selection, hampering serving efficiency and model performance on TTS tasks under high concurrency and long CoT scenarios (consuming even higher runtime than the forward pipeline itself). In this paper, we first find that the current-step query state can be accurately approximated in a unified manner from a short window of recent queries, enabling training-free query-aware sparsity without waiting in the decoding loop. We propose AsyncSpade, an asynchronous framework for efficient TTS built on two core components: (1) a novel light-weight temporal-regressive module that predicts the next-token query state; (2) an asynchronous and disaggregated framework that decouples the KV cache filtering from the auto-regressive decoding loop, overlapping the token-level KV selection with the forward inference computation through asynchronism. To our knowledge, AsyncSpade is the first to eliminate the sequential dependence without sacrificing model performance. We validate the effectiveness of AsyncSpade on common LLM serving setups with an A100 node, where AsyncSpade fully overlaps KV-cache operations with the inference pipeline, achieving theoretical optimal time-per-output-token (TPOT). Specifically, AsyncSpade delivers over 20% reduction on TPOT compared to SoTA baseline (i.e. Quest) and at least 50% TPOT reduction compared to full attention on Qwen3-8B and Qwen3-32B models, while matching or surpassing their accuracy on various TTS benchmarks (AIME-24/25, GPQA-Diamond, MATH-500).[9] Can Lessons From Human Teams Be Applied to Multi-Agent Systems? The Role of Structure, Diversity, and Interaction Dynamics
Rasika Muralidharan,Jaewoon Kwak,Jisun An
Main category: cs.CL
TL;DR: 提出了一种多智能体框架,用于研究团队科学中的结构、多样性和交互动态,发现扁平化团队表现优于层级化团队,而多样性影响较为复杂。
Details
Motivation: 受人类团队科学启发,探索多智能体系统中的团队动态,填补LLM驱动智能体在团队协作方面研究的不足。 Method: 构建基于大语言模型的多智能体框架,评估其在四个涉及常识和社会推理任务上的团队表现,并分析结构、多样性和交互动态的影响。 Result: 扁平化团队通常优于层级化团队;多样性影响复杂;智能体对团队表现过于自信,事后反思显示合作存在协调不足等挑战。 Conclusion: 团队结构显著影响多智能体系统性能,未来需改进对话协调机制以提升合作效率。 Abstract: Multi-Agent Systems (MAS) with Large Language Model (LLM)-powered agents are gaining attention, yet fewer studies explore their team dynamics. Inspired by human team science, we propose a multi-agent framework to examine core aspects of team science: structure, diversity, and interaction dynamics. We evaluate team performance across four tasks: CommonsenseQA, StrategyQA, Social IQa, and Latent Implicit Hate, spanning commonsense and social reasoning. Our results show that flat teams tend to perform better than hierarchical ones, while diversity has a nuanced impact. Interviews suggest agents are overconfident about their team performance, yet post-task reflections reveal both appreciation for collaboration and challenges in integration, including limited conversational coordination.[10] Can Speech LLMs Think while Listening?
Yi-Jen Shih,Desh Raj,Chunyang Wu,Wei Zhou,SK Bong,Yashesh Gaur,Jay Mahadeokar,Ozlem Kalinli,Mike Seltzer
Main category: cs.CL
TL;DR: 本文研究了链式思维(CoT)微调对多流语音大语言模型推理能力的影响,提出通过在用户提问结束前开始推理以减少响应延迟,并引入基于熵的“问题完整性”指标来优化推理时机,结合直接偏好优化(DPO)进一步提升准确率-延迟权衡。
Details
Motivation: 语音大语言模型在复杂推理任务上表现不佳,且响应延迟影响用户体验,需提升其推理能力和交互实时性。 Method: 采用CoT微调提升语音LLMs的推理能力,提出基于熵的“问题完整性”指标以在用户输入结束前启动推理,并使用DPO优化准确率与延迟的权衡。 Result: CoT微调使语音LLM在口语推理任务上的准确率平均提升2.4倍;所提方法在同等延迟下使ARC-Easy准确率提升4%;结合DPO实现延迟减少70%且无准确率损失。 Conclusion: 通过文本空间中的CoT微调和提前推理策略,可显著提升语音大语言模型的推理准确率并有效控制响应延迟,结合DPO能进一步优化性能。 Abstract: Recent advances in speech large language models (speech LLMs) have enabled seamless spoken interactions, but these systems still struggle with complex reasoning tasks. Previously, chain-of-thought (CoT) prompting or fine-tuning has been to shown to significantly improve the reasoning abilities of text-based LLMs. In this work, we investigate the effect of CoT fine-tuning for multi-stream speech LLMs, demonstrating that reasoning in text space improves the accuracy of speech LLMs by 2.4x, on average, over a suite of spoken reasoning tasks. Beyond accuracy, the latency of the spoken response is a crucial factor for interacting with voice-based agents. Inspired by the human behavior of "thinking while listening," we propose methods to reduce the additional latency from reasoning by allowing the model to start reasoning before the user query has ended. To achieve this, we introduce an entropy-based metric, "question completeness," which acts as an indicator to guide the model on the optimal time to start reasoning. This method provides greater control over the accuracy-latency trade-off compared with heuristic-based approaches and, under equivalent latency conditions, yields a 4% accuracy gain on ARC-Easy. Finally, we use Direct Preference Optimization (DPO) on preference data created using rejection sampling to push the accuracy-latency pareto frontier further, resulting in a 70% reduction in latency without loss in accuracy.[11] When Thoughts Meet Facts: Reusable Reasoning for Long-Context LMs
Soyeong Jeong,Taehee Jung,Sung Ju Hwang,Joo-Kyung Kim,Dongyeop Kang
Main category: cs.CL
TL;DR: 本文提出了Thought Template Augmented LCLMs (ToTAL)框架,通过引入可重用的思维模板来改进长上下文语言模型在多跳推理中的表现,利用自然语言反馈迭代优化模板,并验证了其在多种基准和模型族上的有效性及可迁移性。
Details
Motivation: 现有的长上下文语言模型虽然能处理大量文本,但在整合检索到的信息进行多跳推理时,缺乏有效连接证据的机制,导致推理效果受限。 Method: 提出“思维模板”方法,将推理过程建模为从先前解题轨迹中提取的可复用思维缓存,结构化地组合证据;并通过基于自然语言反馈的迭代更新策略优化模板。 Result: 在多种基准测试和不同长上下文语言模型上,ToTAL在基于检索和无检索的场景中均显著优于强基线;优化后的模板可蒸馏至小型开源模型,实现高效推理复用。 Conclusion: ToTAL通过结构化的思维模板提升了长上下文语言模型的多跳推理能力,支持透明、可复用的推理过程,并具有良好的泛化性和实用性。 Abstract: Recent Long-Context Language Models (LCLMs) can process hundreds of thousands of tokens in a single prompt, enabling new opportunities for knowledge-intensive multi-hop reasoning by integrating large sets of retrieved documents or, in some cases, directly all necessary information. However, simply feeding more documents into the context window fails to capture how evidence should be connected. We address this gap with thought templates, which recast reasoning as reusable thought caches, derived from prior problem solving traces, structuring how evidence is combined and guiding multi-hop inference with factual documents. To keep these templates effective, we propose an update strategy that iteratively refines templates derived from training data through natural-language feedback. Across diverse benchmarks and LCLM families, our approach delivers consistent gains over strong baselines in both retrieval-based and retrieval-free settings. Furthermore, we show that optimized templates can be distilled into smaller open-source models, demonstrating its broad applicability and transparent reasoning reuse. We refer to our framework as Thought Template Augmented LCLMs (ToTAL).[12] ParsTranslit: Truly Versatile Tajik-Farsi Transliteration
Rayyan Merchant,Kevin Tang
Main category: cs.CL
TL;DR: 本文提出了一种新的序列到序列模型,用于波斯语(Farsi)和塔吉克语(Tajik)之间的文字转写,并在所有可用数据集上进行训练,同时发布了两个新数据集。实验结果展示了跨领域的性能表现,建立了全面的基准,模型在chrF++和归一化CER指标上达到当前最优水平。
Details
Motivation: 由于波斯语的两种书写系统(波斯-阿拉伯文和塔吉克-西里尔文)差异较大,导致书面交流困难。现有转写模型受限于特定领域数据(如古诗或词表),缺乏跨领域通用性,难以实际应用。因此需要更通用、鲁棒的转写系统。 Method: 采用序列到序列(sequence-to-sequence)架构,融合所有可用的转写数据集进行训练,并构建了两个新的数据集以增强模型的泛化能力,从而提升在不同文本领域中的转写性能。 Result: 模型在Farsi到Tajik方向取得87.91的chrF++和0.05的归一化CER分数,在Tajik到Farsi方向达到92.28的chrF++和0.04的归一化CER,显著优于先前方法,并建立了跨领域的可比基准。 Conclusion: 所提出的模型具备良好的跨领域适应能力,是目前波斯语与塔吉克语间转写的最优方案,具备实际应用价值,且其开源数据与代码有助于后续研究。 Abstract: As a digraphic language, the Persian language utilizes two written standards: Perso-Arabic in Afghanistan and Iran, and Tajik-Cyrillic in Tajikistan. Despite the significant similarity between the dialects of each country, script differences prevent simple one-to-one mapping, hindering written communication and interaction between Tajikistan and its Persian-speaking ``siblings''. To overcome this, previously-published efforts have investigated machine transliteration models to convert between the two scripts. Unfortunately, most efforts did not use datasets other than those they created, limiting these models to certain domains of text such as archaic poetry or word lists. A truly usable transliteration system must be capable of handling varied domains, meaning that suck models lack the versatility required for real-world usage. The contrast in domain between data also obscures the task's true difficulty. We present a new state-of-the-art sequence-to-sequence model for Tajik-Farsi transliteration trained across all available datasets, and present two datasets of our own. Our results across domains provide clearer understanding of the task, and set comprehensive comparable leading benchmarks. Overall, our model achieves chrF++ and Normalized CER scores of 87.91 and 0.05 from Farsi to Tajik and 92.28 and 0.04 from Tajik to Farsi. Our model, data, and code are available at https://anonymous.4open.science/r/ParsTranslit-FB30/.[13] OWL: Overcoming Window Length-Dependence in Speculative Decoding for Long-Context Inputs
Jaeseong Lee,seung-won hwang,Aurick Qiao,Gabriele Oliaro,Ye Wang,Samyam Rajbhandari
Main category: cs.CL
TL;DR: 提出OWL模型和LongSpecBench基准,解决现有推测解码在长上下文场景下的性能退化问题,实现约5倍于EAGLE3的接受长度。
Details
Motivation: 现有推测解码方法在短上下文基准上表现良好,但在实际长上下文场景中性能显著下降,甚至导致生成变慢,缺乏有效的长上下文评估基准。 Method: 提出OWL模型,包含三个创新:基于LSTM且仅依赖最后token状态的草稿器、验证器中引入[SPEC]特殊token以增强表示、结合树与非树解码的混合算法;同时发布新长上下文基准LongSpecBench。 Result: 在长上下文输入上,OWL的接受长度比EAGLE3高约5倍,且有效提升生成速度,而EAGLE3在长上下文中生成速度降低至0.81倍。 Conclusion: OWL通过结构与算法创新显著提升了推测解码在长上下文场景下的性能,LongSpecBench为未来研究提供了重要基准,推动LLM推理加速的实际应用。 Abstract: Speculative decoding promises faster inference for large language models (LLMs), yet existing methods fail to generalize to real-world settings. Benchmarks typically assume short contexts (e.g., 2K tokens), whereas practical workloads involve long contexts. We find current approaches degrade severely with long contexts; for instance, EAGLE3 even slows down the generation speed by 0.81x. We address these limitations by releasing a new long-context benchmark (LongSpecBench) and introducing a novel model (OWL). OWL achieves about 5x higher acceptance length than EAGLE3 on long-context inputs through three innovations: (1) an LSTM-based drafter conditioned only on the last-token state, making it generalize to various lengths, (2) a special token [SPEC] in the verifier that produces richer representation for drafter, and (3) a hybrid algorithm combining both tree and non-tree decoding methods. We release all code and datasets to advance future research.[14] Deploying Tiny LVLM Judges for Real-World Evaluation of Chart Models: Lessons Learned and Best Practices
Md Tahmid Rahman Laskar,Mohammed Saidul Islam,Ridwan Mahbub,Mizanur Rahman,Amran Bhuiyan,Israt Jahan,Mir Tafseer Nayeem,Shafiq Joty,Enamul Hoque,Jimmy Huang
Main category: cs.CL
TL;DR: 提出多标准提示和领域自适应迁移学习方法,提升2B参数量级的视觉-语言模型在图表理解任务中的评估能力。
Details
Motivation: 小规模模型(≤2B参数)在作为自动评判模型时表现不佳,限制了其在资源受限场景下的应用。 Method: 采用多标准提示将多个评估标准整合到单个查询中,并通过在合成判断数据集上微调2B参数的LVLM实现领域自适应迁移学习,构建ChartJudge模型。 Result: 实验表明,多标准提示暴露了现有7B模型的鲁棒性缺陷;而提出的ChartJudge能有效跨数据集迁移知识,成为更专业的评判模型。 Conclusion: 该方法为权衡模型大小、提示设计和可迁移性提供了实用见解,实现了可扩展且低成本的图表推理任务评估。 Abstract: Large Vision-Language Models (LVLMs) with only 7B parameters have shown promise as automated judges in chart comprehension tasks. However, tiny models (<=2B parameters) still perform poorly as judges, limiting their real-world use in resource-constrained settings. To address this, we propose two approaches to ensure cost-efficient evaluation: (i) multi-criteria prompting, which combines separate evaluation criteria into a single query, and (ii) domain-adaptive transfer learning, in which we fine-tune a 2B-parameter LVLM on synthetic judgments in a chart dataset to create the ChartJudge. Experiments show that multi-criteria prompting exposes robustness gaps, which led to a huge drop in performance for 7B models, including specialized LVLM judges like LLaVA-Critic. In addition, we find that our tiny LVLM (ChartJudge) can effectively transfer knowledge from one dataset to another to make it a more specialized model. Our fine-grained analysis across chart types and query complexities offers actionable insights into trade-offs between model size, prompt design, and transferability, enabling scalable, low-cost evaluation for chart reasoning tasks. Our code and the data will be made publicly available.[15] Multi-Task Pre-Finetuning of Lightweight Transformer Encoders for Text Classification and NER
Junyi Zhu,Savas Ozkan,Andrea Maracani,Sinan Mutlu,Cho Jung Min,Mete Ozay
Main category: cs.CL
TL;DR: 提出基于任务优先LoRA模块的多任务预微调框架,提升轻量级BERT编码器在命名实体识别和文本分类中的适应性与效率。
Details
Motivation: 在移动平台上部署NLP模型需要兼顾跨应用适应性和计算效率,但传统多任务预微调存在优化冲突问题。 Method: 采用任务优先的LoRA模块进行多任务预微调,共享编码器主干并使用模块化适配器。 Result: 在21个下游任务上平均提升NER 0.8%和文本分类8.8%,性能接近单独预微调且满足部署约束。 Conclusion: 所提方法有效解决了多任务冲突问题,实现了高效、通用的移动端NLP模型适配。 Abstract: Deploying natural language processing (NLP) models on mobile platforms requires models that can adapt across diverse applications while remaining efficient in memory and computation. We investigate pre-finetuning strategies to enhance the adaptability of lightweight BERT-like encoders for two fundamental NLP task families: named entity recognition (NER) and text classification. While pre-finetuning improves downstream performance for each task family individually, we find that na\"ive multi-task pre-finetuning introduces conflicting optimization signals that degrade overall performance. To address this, we propose a simple yet effective multi-task pre-finetuning framework based on task-primary LoRA modules, which enables a single shared encoder backbone with modular adapters. Our approach achieves performance comparable to individual pre-finetuning while meeting practical deployment constraint. Experiments on 21 downstream tasks show average improvements of +0.8% for NER and +8.8% for text classification, demonstrating the effectiveness of our method for versatile mobile NLP applications.[16] Linguistic Patterns in Pandemic-Related Content: A Comparative Analysis of COVID-19, Constraint, and Monkeypox Datasets
Mkululi Sikosana,Sean Maudsley-Barton,Oluwaseun Ajao
Main category: cs.CL
TL;DR: 该研究通过计算语言学方法分析了与疫情相关的在线言论,比较了健康虚假信息与事实传播在可读性、修辞标记和说服性语言使用上的差异。
Details
Motivation: 识别健康虚假信息的语言特征,以支持其检测并改进公共卫生传播策略。 Method: 基于三个语料库(COVID-19虚假叙述、一般COVID-19内容和猴痘相关帖子)进行计算语言学分析,比较可读性、情感词汇和修辞特征。 Result: COVID-19虚假信息可读性更低,恐惧和说服性词汇频率更高,感叹号使用更少;其语言风格复杂且嵌入情感线索,可能增强可信度错觉。 Conclusion: 虚假信息具有独特的语言模式,结合复杂性和情感提示,可能提升其传播力;研究结果有助于虚假信息检测和危机传播理论构建,但需在动态和平台适配的框架下进一步验证。 Abstract: This study conducts a computational linguistic analysis of pandemic-related online discourse to examine how language distinguishes health misinformation from factual communication. Drawing on three corpora: COVID-19 false narratives (n = 7588), general COVID-19 content (n = 10700), and Monkeypox-related posts (n = 5787), we identify significant differences in readability, rhetorical markers, and persuasive language use. COVID-19 misinformation exhibited markedly lower readability scores and contained over twice the frequency of fear-related or persuasive terms compared to the other datasets. It also showed minimal use of exclamation marks, contrasting with the more emotive style of Monkeypox content. These patterns suggest that misinformation employs a deliberately complex rhetorical style embedded with emotional cues, a combination that may enhance its perceived credibility. Our findings contribute to the growing body of work on digital health misinformation by highlighting linguistic indicators that may aid detection efforts. They also inform public health messaging strategies and theoretical models of crisis communication in networked media environments. At the same time, the study acknowledges limitations, including reliance on traditional readability indices, use of a deliberately narrow persuasive lexicon, and reliance on static aggregate analysis. Future research should therefore incorporate longitudinal designs, broader emotion lexicons, and platform-sensitive approaches to strengthen robustness.[17] IASC: Interactive Agentic System for ConLangs
Chihiro Taguchi,Richard Sproat
Main category: cs.CL
TL;DR: 本文提出一个基于大语言模型(LLM)的模块化系统,用于辅助人工构造语言(conlang)的开发,涵盖语音、形态句法、词汇、正字法及语法手册生成,并探索LLM对语言学概念的理解能力及其在高低资源语言翻译中的潜在应用。
Details
Motivation: 利用LLM作为构建人工语言的工具,既为语言爱好者提供有趣易用的创作平台,也旨在探究LLM对语言普遍规律和语言学概念的深层理解,而非特定语言知识。 Method: 采用模块化、代理式(agentic)方法:首先迭代生成目标音系;接着将英语句子转换为反映目标语形态句法特征的标记形式;据此构建词库并结合音系模型生成词形;随后指定正字法(如拉丁或西里尔字母);最后自动生成简明语法手册,并支持新句子的翻译。 Result: 系统能有效生成结构完整的构造语言,包括音系、词汇、语法和书写系统;不同LLM在处理常见与罕见语言现象时表现差异显著;初步尝试将该方法应用于高资源到低资源语言翻译效果不佳,但显示出改进后可能具备潜力。 Conclusion: 该系统不仅为构造语言提供了实用工具,也揭示了LLM在掌握语言共性方面的优势与局限,为进一步研究其语言认知能力和跨语言迁移应用提供了方向。 Abstract: We present a system that uses LLMs as a tool in the development of Constructed Languages. The system is modular in that one first creates a target phonology for the language using an agentic approach that refines its output at each step with commentary feedback on its previous attempt. Next, a set of sentences is 'translated' from their English original into a morphosyntactic markup that reflects the word order and morphosyntactic feature specifications of the desired target language, with affixes represented as morphosyntactic feature bundles. From this translated corpus, a lexicon is constructed using the phonological model and the set of morphemes (stems and affixes) extracted from the 'translated' sentences. The system is then instructed to provide an orthography for the language, using an existing script such as Latin or Cyrillic. Finally, the system writes a brief grammatical handbook of the language. The system can also translate further sentences into the target language. Our goal is twofold. First, we hope that these tools will be fun to use for creating artificially constructed languages. Second, we are interested in exploring what LLMs 'know' about language-not what they know about any particular language or linguistic phenomenon, but how much they know about and understand language and linguistic concepts. As we shall see, there is a fairly wide gulf in capabilities both among different LLMs and among different linguistic specifications, with it being notably easier for systems to deal with more common patterns than rarer ones. An additional avenue that we explore is the application of our approach to translating from high-resource into low-resource languages. While the results so far are mostly negative, we provide some evidence that an improved version of the present system could afford some real gains in such tasks. https://github.com/SakanaAI/IASC[18] Vocabulary embeddings organize linguistic structure early in language model training
Isabel Papadimitriou,Jacob Prince
Main category: cs.CL
TL;DR: 研究了大语言模型在训练过程中输入词汇表征的结构演化,发现语义和句法特征迅速与嵌入几何结构高度相关,高频词和功能词比低频词更快收敛。
Details
Motivation: 探讨大语言模型在训练过程中输入词汇表征的结构如何形成及演化,理解语言模型能力发展的机制。 Method: 使用表征相似性分析,对Pythia 12B和OLMo 7B两个开源模型的输入和输出嵌入在训练过程中的语义、句法和频率特征进行相关性实验。 Result: 1) 嵌入几何结构在训练早期迅速与语义和句法特征高度相关;2) 高频词和功能词的嵌入更快收敛,而低频词仍保留部分初始随机偏置的影响。 Conclusion: 词汇嵌入在训练中快速组织成语言结构,词频和功能在其中扮演不同角色,提示应深入研究词汇几何演化对模型能力提升的影响。 Abstract: Large language models (LLMs) work by manipulating the geometry of input embedding vectors over multiple layers. Here, we ask: how are the input vocabulary representations of language models structured, and how and when does this structure evolve over training? To answer this question, we use representational similarity analysis, running a suite of experiments that correlate the geometric structure of the input embeddings and output embeddings of two open-source models (Pythia 12B and OLMo 7B) with semantic, syntactic, and frequency-based metrics over the course of training. Our key findings are as follows: 1) During training, the vocabulary embedding geometry quickly converges to high correlations with a suite of semantic and syntactic features; 2) Embeddings of high-frequency and function words (e.g., "the," "of") converge to their final vectors faster than lexical and low-frequency words, which retain some alignment with the bias in their random initializations. These findings help map the dynamic trajectory by which input embeddings organize around linguistic structure, revealing distinct roles for word frequency and function. Our findings motivate a deeper study of how the evolution of vocabulary geometry may facilitate specific capability gains during model training.[19] Toward Reliable Clinical Coding with Language Models: Verification and Lightweight Adaptation
Zhangdie Yuan,Han-Chin Shing,Mitch Strong,Chaitanya Shivade
Main category: cs.CL
TL;DR: 本文提出临床代码验证作为改进LLM在医疗编码中表现的新方法,通过轻量级干预减少层次性错误,并发布了一个专家双标注的门诊病例数据集。
Details
Motivation: 现有的LLM在临床编码任务中表现不佳,且常用精确匹配指标忽略了层次上接近的错误;同时现有数据集存在证据不全和住院偏倚问题。 Method: 采用提示工程和小规模微调等轻量级干预方法,并引入临床代码验证任务,结合新发布的专家双标注门诊数据集进行评估。 Result: 轻量级干预提升了编码准确性,代码验证有效减少了层次性近似错误,新数据集缓解了原有数据偏差问题。 Conclusion: 临床代码验证是提升LLM在医疗编码中可靠性和准确性的有效步骤,未来系统应考虑层次结构和更高质量的数据标注。 Abstract: Accurate clinical coding is essential for healthcare documentation, billing, and decision-making. While prior work shows that off-the-shelf LLMs struggle with this task, evaluations based on exact match metrics often overlook errors where predicted codes are hierarchically close but incorrect. Our analysis reveals that such hierarchical misalignments account for a substantial portion of LLM failures. We show that lightweight interventions, including prompt engineering and small-scale fine-tuning, can improve accuracy without the computational overhead of search-based methods. To address hierarchically near-miss errors, we introduce clinical code verification as both a standalone task and a pipeline component. To mitigate the limitations in existing datasets, such as incomplete evidence and inpatient bias in MIMIC, we release an expert double-annotated benchmark of outpatient clinical notes with ICD-10 codes. Our results highlight verification as an effective and reliable step toward improving LLM-based medical coding.[20] Role-Conditioned Refusals: Evaluating Access Control Reasoning in Large Language Models
Đorđe Klisura,Joseph Khoury,Ashish Kundu,Ram Krishnan,Anthony Rios
Main category: cs.CL
TL;DR: 研究了大语言模型在角色条件下的拒绝行为,提出并评估了三种设计方法以实现基于角色的访问控制,发布了增强的数据集和代码。
Details
Motivation: 大语言模型常常模糊角色边界,产生无限制的响应,需要研究如何让模型遵守访问控制策略。 Method: 构建了一个扩展自Spider和BIRD数据集的新数据集,包含现实的PostgreSQL基于角色的表级和列级策略;比较了零样本或少样本提示、两步生成-验证流程以及LoRA微调模型三种方法。 Result: 显式验证(两步框架)提高了拒绝精度并减少了错误授权;微调在安全性和实用性之间取得了更好的平衡;更长更复杂的策略会降低所有系统的可靠性。 Conclusion: 两步验证方法在安全性上表现更好,而微调方法在保持功能性的同时提升了安全性,但复杂策略仍对现有方法构成挑战。 Abstract: Access control is a cornerstone of secure computing, yet large language models often blur role boundaries by producing unrestricted responses. We study role-conditioned refusals, focusing on the LLM's ability to adhere to access control policies by answering when authorized and refusing when not. To evaluate this behavior, we created a novel dataset that extends the Spider and BIRD text-to-SQL datasets, both of which have been modified with realistic PostgreSQL role-based policies at the table and column levels. We compare three designs: (i) zero or few-shot prompting, (ii) a two-step generator-verifier pipeline that checks SQL against policy, and (iii) LoRA fine-tuned models that learn permission awareness directly. Across multiple model families, explicit verification (the two-step framework) improves refusal precision and lowers false permits. At the same time, fine-tuning achieves a stronger balance between safety and utility (i.e., when considering execution accuracy). Longer and more complex policies consistently reduce the reliability of all systems. We release RBAC-augmented datasets and code.[21] Banking Done Right: Redefining Retail Banking with Language-Centric AI
Xin Jie Chua,Jeraelyn Ming Li Tan,Jia Xuan Tan,Soon Chang Poh,Yi Xian Goh,Debbie Hui Tian Choong,Chee Mun Foong,Sze Jue Yang,Chee Seng Chan
Main category: cs.CL
TL;DR: Ryt AI是首个获得全球监管批准的、以对话式AI作为主要银行界面的LLM原生代理框架,通过自然语言实现核心金融交易。
Details
Motivation: 传统银行助手局限于咨询或支持角色,无法执行核心交易;现有系统依赖繁琐的多页面流程,用户体验差且效率低。 Method: 构建名为Ryt AI的LLM原生代理框架,采用自研闭源大模型ILMU,结合四个带任务特定LoRA适配器的LLM代理(Guardrails、Intent、Payment、FAQ),在银行内部部署,通过确定性防护、人机协同确认和无状态审计架构保障安全合规。 Result: 成功实现全球首个监管批准的对话式AI银行界面,取代传统多步骤操作,用户可通过单一对话完成核心金融交易,系统具备高安全性、可审计性和低开销。 Conclusion: Ryt AI证明在严格治理下,符合监管要求的自然语言界面可可靠地支持核心金融业务,标志着银行服务向智能化、简洁化的重要迈进。 Abstract: This paper presents Ryt AI, an LLM-native agentic framework that powers Ryt Bank to enable customers to execute core financial transactions through natural language conversation. This represents the first global regulator-approved deployment worldwide where conversational AI functions as the primary banking interface, in contrast to prior assistants that have been limited to advisory or support roles. Built entirely in-house, Ryt AI is powered by ILMU, a closed-source LLM developed internally, and replaces rigid multi-screen workflows with a single dialogue orchestrated by four LLM-powered agents (Guardrails, Intent, Payment, and FAQ). Each agent attaches a task-specific LoRA adapter to ILMU, which is hosted within the bank's infrastructure to ensure consistent behavior with minimal overhead. Deterministic guardrails, human-in-the-loop confirmation, and a stateless audit architecture provide defense-in-depth for security and compliance. The result is Banking Done Right: demonstrating that regulator-approved natural-language interfaces can reliably support core financial operations under strict governance.[22] OBCache: Optimal Brain KV Cache Pruning for Efficient Long-Context LLM Inference
Yuzhe Gu,Xiyu Liang,Jiaojiao Zhao,Enmao Diao
Main category: cs.CL
TL;DR: 提出了一种基于Optimal Brain Damage理论的缓存淘汰框架OBCache,通过量化删除token对注意力输出的影响来优化KV缓存管理,在LLaMA和Qwen模型上验证了其在长上下文任务中的有效性。
Details
Motivation: 现有缓存淘汰方法依赖启发式注意力权重评估token重要性,未考虑其对注意力输出的真实影响,导致精度损失。 Method: 将缓存淘汰建模为逐层结构化剪枝问题,基于OBD理论推导出针对键、值及键值对的闭式显著性评分,结合注意力权重、值状态和输出信息进行token重要性评估。 Result: 在LLaMA和Qwen模型上的实验表明,使用OBCache的输出感知评分替代现有方法的启发式评分,能持续提升长上下文任务的准确性。 Conclusion: OBCache通过引入输出感知的token显著性评估机制,有效提升了大语言模型在扩展上下文窗口下的缓存淘汰性能。 Abstract: Large language models (LLMs) with extended context windows enable powerful downstream applications but impose significant memory overhead, as caching all key-value (KV) states scales linearly with sequence length and batch size. Existing cache eviction methods address this by exploiting attention sparsity, yet they typically rank tokens heuristically using accumulated attention weights without considering their true impact on attention outputs. We propose Optimal Brain Cache (OBCache), a principled framework that formulates cache eviction as a layer-wise structured pruning problem. Building upon the Optimal Brain Damage (OBD) theory, OBCache quantifies token saliency by measuring the perturbation in attention outputs induced by pruning tokens, with closed-form scores derived for isolated keys, isolated values, and joint key-value pairs. Our scores account not only for attention weights but also for information from value states and attention outputs, thereby enhancing existing eviction strategies with output-aware signals. Experiments on LLaMA and Qwen models demonstrate that replacing the heuristic scores in existing works, which estimate token saliency across different query positions, with OBCache's output-aware scores consistently improves long-context accuracy.[23] Textual Entailment and Token Probability as Bias Evaluation Metrics
Virginia K. Felkner,Allison Lim,Jonathan May
Main category: cs.CL
TL;DR: 本文探讨了使用自然语言推断(NLI)作为语言模型社会偏见测量的替代方法,发现NLI与传统的词元概率(TP)指标在评估偏见时表现差异显著,相关性较低。NLI更可能检测到“去偏不足”的情况,但对反刻板印象句子的措辞更敏感且更脆弱。研究建议结合TP、NLI和下游偏见评估以实现全面的语言模型评估。
Details
Motivation: 由于传统的词元概率(TP)偏见度量方法与真实世界语言模型使用场景和危害关联较弱,本文旨在探索更贴近实际应用的自然语言推断(NLI)作为替代偏见度量方法的有效性。 Method: 通过比较不同NLI度量方法之间以及NLI与TP度量之间的相关性,分析它们在检测语言模型社会偏见方面的表现差异,并评估其对反刻板印象句子表述变化的敏感性。 Result: NLI与TP偏见评估方法表现出显著差异,相关性很低;NLI更倾向于检测出‘去偏不足’的情况,但对反刻板印象句子的措辞更为敏感和脆弱。 Conclusion: TP和NLI都不是在所有情况下都更优的偏见度量方法,推荐结合TP、NLI及下游任务的偏见评估,以确保对语言模型进行更全面的偏见评估。 Abstract: Measurement of social bias in language models is typically by token probability (TP) metrics, which are broadly applicable but have been criticized for their distance from real-world langugage model use cases and harms. In this work, we test natural language inference (NLI) as a more realistic alternative bias metric. We show that, curiously, NLI and TP bias evaluation behave substantially differently, with very low correlation among different NLI metrics and between NLI and TP metrics. We find that NLI metrics are more likely to detect "underdebiased" cases. However, NLI metrics seem to be more brittle and sensitive to wording of counterstereotypical sentences than TP approaches. We conclude that neither token probability nor natural language inference is a "better" bias metric in all cases, and we recommend a combination of TP, NLI, and downstream bias evaluations to ensure comprehensive evaluation of language models. Content Warning: This paper contains examples of anti-LGBTQ+ stereotypes.[24] Stress-Testing Model Specs Reveals Character Differences among Language Models
Jifan Zhang,Henry Sleight,Andi Peng,John Schulman,Esin Durmus
Main category: cs.CL
TL;DR: 提出了一种系统性方法来压力测试大语言模型的行为规范,揭示了现有模型规范中存在大量原则矛盾和解释模糊的问题,并通过生成价值权衡场景发现了超过7万种显著行为分歧案例。
Details
Motivation: 现有大语言模型的规范常存在原则间冲突和覆盖不足的问题,缺乏系统性方法检测这些问题。 Method: 构建了一个全面的价值观分类体系,生成迫使模型在无法同时满足的成对合法原则之间做权衡的情境,对12个前沿大模型进行评估,利用价值分类分数衡量行为差异。 Result: 发现了超过7万个显著行为分歧案例,实证表明高分歧行为能有效预测模型规范中的问题,定性分析揭示了直接矛盾、解释模糊、错位对齐和假拒绝等问题,并总结了不同模型的价值优先模式差异。 Conclusion: 当前大语言模型规范存在严重缺陷,需更精细的设计与验证机制,所提方法可有效暴露规范问题,为改进模型对齐提供方向。 Abstract: Large language models (LLMs) are increasingly trained from AI constitutions and model specifications that establish behavioral guidelines and ethical principles. However, these specifications face critical challenges, including internal conflicts between principles and insufficient coverage of nuanced scenarios. We present a systematic methodology for stress-testing model character specifications, automatically identifying numerous cases of principle contradictions and interpretive ambiguities in current model specs. We stress test current model specs by generating scenarios that force explicit tradeoffs between competing value-based principles. Using a comprehensive taxonomy we generate diverse value tradeoff scenarios where models must choose between pairs of legitimate principles that cannot be simultaneously satisfied. We evaluate responses from twelve frontier LLMs across major providers (Anthropic, OpenAI, Google, xAI) and measure behavioral disagreement through value classification scores. Among these scenarios, we identify over 70,000 cases exhibiting significant behavioral divergence. Empirically, we show this high divergence in model behavior strongly predicts underlying problems in model specifications. Through qualitative analysis, we provide numerous example issues in current model specs such as direct contradiction and interpretive ambiguities of several principles. Additionally, our generated dataset also reveals both clear misalignment cases and false-positive refusals across all of the frontier models we study. Lastly, we also provide value prioritization patterns and differences of these models.[25] Large Language Models Meet Virtual Cell: A Survey
Krinos Li,Xianglu Xiao,Shenglong Deng,Lucas He,Zijun Zhong,Yuanjie Zou,Zhonghao Zhan,Zheng Hui,Weiye Bao,Guang Yang
Main category: cs.CL
TL;DR: 本文综述了大语言模型(LLM)在虚拟细胞建模中的应用,提出了将现有方法分为“LLM作为预言机”和“LLM作为代理”的统一分类法,并探讨了细胞表示、扰动预测和基因调控推断三大核心任务及其挑战。
Details
Motivation: 随着大语言模型的发展,其在细胞生物学中的潜力日益显现,但缺乏系统性梳理和统一框架来指导虚拟细胞建模的研究与应用。 Method: 提出一种统一的分类体系,将LLM应用于虚拟细胞建模的方法分为两类:LLM作为预言机用于直接建模,LLM作为代理用于协调复杂科学任务;并围绕三个核心任务进行系统回顾。 Result: 总结了当前用于细胞表示、扰动预测和基因调控推断的模型、数据集和评估基准,识别出在可扩展性、泛化性和可解释性方面的主要挑战。 Conclusion: 该综述为LLM在虚拟细胞中的应用提供了结构化视角,有助于推动计算生物学中更强大、可解释和通用的模型发展。 Abstract: Large language models (LLMs) are transforming cellular biology by enabling the development of "virtual cells"--computational systems that represent, predict, and reason about cellular states and behaviors. This work provides a comprehensive review of LLMs for virtual cell modeling. We propose a unified taxonomy that organizes existing methods into two paradigms: LLMs as Oracles, for direct cellular modeling, and LLMs as Agents, for orchestrating complex scientific tasks. We identify three core tasks--cellular representation, perturbation prediction, and gene regulation inference--and review their associated models, datasets, evaluation benchmarks, as well as the critical challenges in scalability, generalizability, and interpretability.[26] Causality Guided Representation Learning for Cross-Style Hate Speech Detection
Chengshuai Zhao,Shu Wan,Paras Sheth,Karan Patwa,K. Selçuk Candan,Huan Liu
Main category: cs.CL
TL;DR: 本文提出了一种基于因果表示学习的隐式仇恨言论检测框架CADET,通过解耦上下文、动机、目标和风格等潜在因素,提升模型在不同平台和语言风格下的泛化能力。
Details
Motivation: 现有仇恨言论检测模型依赖表面语言特征,难以应对跨平台、多风格的隐式仇恨言论,且易受虚假相关性干扰。 Method: 构建一个包含上下文环境、创作者动机、目标和风格的因果图,提出CADET框架,利用因果表示学习分离潜在因子,并通过反事实推理干预风格变量以增强鲁棒性。 Result: 实验表明CADET在多个数据集上优于现有方法,展现出因果先验在仇恨言论检测中的有效性。 Conclusion: 通过引入因果建模,CADET能够更准确地识别隐式仇恨言论,提升了模型的可解释性和跨域泛化能力。 Abstract: The proliferation of online hate speech poses a significant threat to the harmony of the web. While explicit hate is easily recognized through overt slurs, implicit hate speech is often conveyed through sarcasm, irony, stereotypes, or coded language -- making it harder to detect. Existing hate speech detection models, which predominantly rely on surface-level linguistic cues, fail to generalize effectively across diverse stylistic variations. Moreover, hate speech spread on different platforms often targets distinct groups and adopts unique styles, potentially inducing spurious correlations between them and labels, further challenging current detection approaches. Motivated by these observations, we hypothesize that the generation of hate speech can be modeled as a causal graph involving key factors: contextual environment, creator motivation, target, and style. Guided by this graph, we propose CADET, a causal representation learning framework that disentangles hate speech into interpretable latent factors and then controls confounders, thereby isolating genuine hate intent from superficial linguistic cues. Furthermore, CADET allows counterfactual reasoning by intervening on style within the latent space, naturally guiding the model to robustly identify hate speech in varying forms. CADET demonstrates superior performance in comprehensive experiments, highlighting the potential of causal priors in advancing generalizable hate speech detection.[27] MemWeaver: A Hierarchical Memory from Textual Interactive Behaviors for Personalized Generation
Shuo Yu,Mingyue Cheng,Daoyu Wang,Qi Liu,Zirui Liu,Ze Guo,Xiaoyu Tao
Main category: cs.CL
TL;DR: 本文提出了MemWeaver框架,通过构建包含行为记忆和认知记忆的分层记忆结构,对用户的文本历史进行建模,以实现深度个性化生成。
Details
Motivation: 现有方法将用户历史视为扁平文本列表,未能捕捉用户兴趣的时间演化和语义关系,导致个性化程度较浅。 Method: 提出MemWeaver框架,构建融合时间与语义信息的双组件记忆系统:行为记忆(捕捉具体动作)和认知记忆(表示长期偏好),并用于增强大语言模型的个性化生成能力。 Result: 在LaMP基准上的实验验证了MemWeaver的有效性,显著优于现有方法。 Conclusion: MemWeaver通过层次化记忆建模,有效整合用户文本历史中的时间动态与语义结构,提升了个性化生成的效果。 Abstract: The primary form of user-internet engagement is shifting from leveraging implicit feedback signals, such as browsing and clicks, to harnessing the rich explicit feedback provided by textual interactive behaviors. This shift unlocks a rich source of user textual history, presenting a profound opportunity for a deeper form of personalization. However, prevailing approaches offer only a shallow form of personalization, as they treat user history as a flat list of texts for retrieval and fail to model the rich temporal and semantic structures reflecting dynamic nature of user interests. In this work, we propose \textbf{MemWeaver}, a framework that weaves the user's entire textual history into a hierarchical memory to power deeply personalized generation. The core innovation of our memory lies in its ability to capture both the temporal evolution of interests and the semantic relationships between different activities. To achieve this, MemWeaver builds two complementary memory components that both integrate temporal and semantic information, but at different levels of abstraction: behavioral memory, which captures specific user actions, and cognitive memory, which represents long-term preferences. This dual-component memory serves as a unified representation of the user, allowing large language models (LLMs) to reason over both concrete behaviors and abstracted traits. Experiments on the Language Model Personalization (LaMP) benchmark validate the efficacy of MemWeaver. Our code is available\footnote{https://github.com/fishsure/MemWeaver}.[28] SUBQRAG: sub-question driven dynamic graph rag
Jiaoyang Li,Junhao Ruan,Shengwei Tang,Saihan Chen,Kaiyan Chang,Yuan Ge,Tong Xiao,Jingbo Zhu
Main category: cs.CL
TL;DR: 提出SubQRAG,一种基于子问题驱动的图检索增强生成框架,通过分解复杂问题并动态扩展知识图谱,提升多跳问答的推理深度和准确性。
Details
Motivation: 现有图RAG在处理复杂多跳问答时缺乏深层结构化推理,导致证据不全和错误累积。 Method: 将复杂问题分解为有序的可验证子问题,针对每个子问题从知识图中检索三元组;当图信息不足时,实时从原文档提取新三元组以动态扩展图;所有推理过程中的三元组聚合为“图记忆”,形成可追溯的证据路径。 Result: 在三个多跳问答基准上的实验表明,SubQRAG在Exact Match等指标上取得一致且显著的提升。 Conclusion: SubQRAG通过子问题驱动和动态图扩展机制,有效增强了复杂问答中的结构化推理能力,提升了答案的准确性和可解释性。 Abstract: Graph Retrieval-Augmented Generation (Graph RAG) effectively builds a knowledge graph (KG) to connect disparate facts across a large document corpus. However, this broad-view approach often lacks the deep structured reasoning needed for complex multi-hop question answering (QA), leading to incomplete evidence and error accumulation. To address these limitations, we propose SubQRAG, a sub-question-driven framework that enhances reasoning depth. SubQRAG decomposes a complex question into an ordered chain of verifiable sub-questions. For each sub-question, it retrieves relevant triples from the graph. When the existing graph is insufficient, the system dynamically expands it by extracting new triples from source documents in real time. All triples used in the reasoning process are aggregated into a "graph memory," forming a structured and traceable evidence path for final answer generation. Experiments on three multi-hop QA benchmarks demonstrate that SubQRAG achieves consistent and significant improvements, especially in Exact Match scores.[29] Multilingual Knowledge Graph Completion via Efficient Multilingual Knowledge Sharing
Cunli Mao,Xiaofei Gao,Ran Song,Shizhu He,Shengxiang Gao,Kang Liu,Zhengtao Yu
Main category: cs.CL
TL;DR: 本文提出了一种新的多语言知识图谱补全(MKGC)框架,通过知识级分组专家混合(KL-GMoE)和迭代实体重排序(IER)利用多语言共享知识,显著提升了性能。实验表明,该方法在Hits@1、Hits@3和Hits@10指标上均优于现有最先进方法。
Details
Motivation: 现有MKGC研究未能充分利用大语言模型的多语言能力,且忽视了跨语言知识的可共享性。 Method: 提出包含KL-GMoE和IER两个组件的新框架:KL-GMoE用于高效建模共享知识,IER用于增强其利用效果,并构建了一个包含5种语言的mKG数据集进行评估。 Result: 相比现有SOTA方法,该框架在Hits@1、Hits@3和Hits@10上分别提升了5.47%、3.27%和1.01%;进一步分析揭示了未见语言和不平衡语言设置下的知识共享特性。 Conclusion: 所提出的框架有效利用多语言共享知识,显著提升MKGC性能,具备良好的扩展性和实用性,相关数据集与代码已公开。 Abstract: Large language models (LLMs) based Multilingual Knowledge Graph Completion (MKGC) aim to predict missing facts by leveraging LLMs' multilingual understanding capabilities, improving the completeness of multilingual knowledge graphs (KGs). However, existing MKGC research underutilizes the multilingual capabilities of LLMs and ignores the shareability of cross-lingual knowledge. In this paper, we propose a novel MKGC framework that leverages multilingual shared knowledge to significantly enhance performance through two components: Knowledge-level Grouped Mixture of Experts (KL-GMoE) and Iterative Entity Reranking (IER). KL-GMoE efficiently models shared knowledge, while IER significantly enhances its utilization. To evaluate our framework, we constructed a mKG dataset containing 5 languages and conducted comprehensive comparative experiments with existing state-of-the-art (SOTA) MKGC method. The experimental results demonstrate that our framework achieves improvements of 5.47%, 3.27%, and 1.01% in the Hits@1, Hits@3, and Hits@10 metrics, respectively, compared with SOTA MKGC method. Further experimental analysis revealed the properties of knowledge sharing in settings of unseen and unbalanced languages. We have released the dataset and code for our work on https://github.com/gaoxiaofei07/KL-GMoE.[30] ToolExpander: Extending the Frontiers of Tool-Using Reinforcement Learning to Weak LLMs
Fu Chen,Peng Wang,Xiyin Li,Wen Li,Shichi Lei,Dongdong Xiang
Main category: cs.CL
TL;DR: 提出ToolExpander框架,通过动态多轮难样本采样和自示范思维机制,提升小规模语言模型在GRPO训练中的稳定性与工具使用能力。
Details
Motivation: 解决小规模架构下GRPO训练中模型难以生成准确响应、训练易崩溃的问题,提升训练稳定性和最终性能。 Method: 1) 动态多轮难样本采样:用高质量少样本示例替换困难样本,并结合指数学习率衰减;2) 自示范思维:去除KL散度,引入调整后的裁剪系数,鼓励模型自主生成并分析少样本示例,仅需极小额外奖励(0.01)。 Result: 实验表明,ToolExpander显著提升了小规模LLMs的工具使用能力,增强了训练稳定性与整体性能。 Conclusion: ToolExpander有效缓解了GRPO在小模型上的局限性,为资源受限场景下的强化学习训练提供了可行方案。 Abstract: Training Large Language Models (LLMs) with Group Relative Policy Optimization (GRPO) encounters a significant challenge: models often fail to produce accurate responses, particularly in small-scale architectures. This limitation not only diminishes performance improvements and undermines the potential of GRPO but also frequently leads to mid-training collapse, adversely affecting stability and final efficacy. To address these issues, we propose ToolExpander, a novel framework that advances tool-oriented reinforcement learning for resource-constrained LLMs through two key innovations:(1) Dynamic Multi-Round Hard Sampling, which dynamically substitutes challenging samples(those without correct outputs over 10 rollouts) with high-quality few-shot demonstrations during training, coupled with an exponential learning rate decay strategy to mitigate oscillations;(2) Self-Exemplifying Thinking, an enhanced GRPO framework that eliminates KL divergence and incorporates adjusted clipping coefficients, encouraging models to autonomously generate and analyze few-shot examples via a minimal additional reward (0.01).Experimental results demonstrate that ToolExpander significantly enhances tool-using capabilities in LLMs, especially in weaker small-scale models, improving both training stability and overall performance.[31] OpenRubrics: Towards Scalable Synthetic Rubric Generation for Reward Modeling and LLM Alignment
Tianci Liu,Ran Xu,Tony Yu,Ilgee Hong,Carl Yang,Tuo Zhao,Haoyu Wang
Main category: cs.CL
TL;DR: 本文提出了OpenRubrics,一个大规模的(提示,评分标准)数据集,并引入对比评分标准生成(CRG)方法,通过对比优选和被拒回答生成显式规则与隐含原则,提升了评分标准的可靠性与可扩展性。基于此构建的Rubric-RM在多个基准上超越强基线6.8%,并在策略模型中取得迁移增益,推动了大模型对齐的准则驱动新范式。
Details
Motivation: 现有奖励模型多依赖标量或成对判断,难以捕捉人类偏好的多维特性;而尽管rubrics-as-rewards(RaR)能通过结构化语言准则反映多维度质量,但可靠且可扩展的评分标准生成仍具挑战。 Method: 提出OpenRubrics数据集和对比评分标准生成(CRG)方法,从优选与被拒回答的对比中提取硬规则(显式约束)和原则(隐含品质),并通过拒绝采样确保评分标准与偏好标签一致,过滤噪声。 Result: Rubric-RM在多个奖励建模基准上比同规模基线提升6.8%,其收益可迁移到指令遵循和生物医学任务的策略模型;生成的评分标准更具判别性和一致性。 Conclusion: 评分标准可提供可扩展的对齐信号,在自动化奖励建模与高成本人工评估之间缩小差距,支持一种新的、以原则驱动的大语言模型对齐范式。 Abstract: Reward modeling lies at the core of reinforcement learning from human feedback (RLHF), yet most existing reward models rely on scalar or pairwise judgments that fail to capture the multifaceted nature of human preferences. Recent studies have explored rubrics-as-rewards (RaR) that uses structured natural language criteria that capture multiple dimensions of response quality. However, producing rubrics that are both reliable and scalable remains a key challenge. In this work, we introduce OpenRubrics, a diverse, large-scale collection of (prompt, rubric) pairs for training rubric-generation and rubric-based reward models. To elicit discriminative and comprehensive evaluation signals, we introduce Contrastive Rubric Generation (CRG), which derives both hard rules (explicit constraints) and principles (implicit qualities) by contrasting preferred and rejected responses. We further improve reliability by enforcing preference-label consistency via rejection sampling to remove noisy rubrics. Across multiple reward-modeling benchmarks, our rubric-based reward model, Rubric-RM, surpasses strong size-matched baselines by 6.8%. These gains transfer to policy models on instruction-following and biomedical benchmarks. Our results show that rubrics provide scalable alignment signals that narrow the gap between costly human evaluation and automated reward modeling, enabling a new principle-driven paradigm for LLM alignment.[32] Parallel Test-Time Scaling for Latent Reasoning Models
Runyang You,Yongqi Li,Meng Liu,Wenjie Wang,Liqiang Nie,Wenjie Li
Main category: cs.CL
TL;DR: 本文提出了一种用于潜在推理模型的并行测试时扩展方法,通过引入基于不确定性的随机采样策略和训练潜在奖励模型进行轨迹聚合,实现了在连续空间中的可扩展推理。
Details
Motivation: 现有的并行测试时扩展方法主要针对显式的基于token的思维链,而难以应用于连续向量空间中的潜在推理模型,因为缺乏在连续空间中采样的机制以及有效的轨迹聚合概率信号。因此,探索如何将并行TTS应用于潜在推理模型成为一个关键问题。 Method: 为实现潜在推理的并行TTS,作者提出了两种基于不确定性的随机采样策略:蒙特卡洛Dropout和加性高斯噪声;同时设计了一个潜在奖励模型(LatentRM),采用逐步对比学习目标来训练,用于评分和引导潜在推理路径。 Result: 实验结果表明,两种采样策略能够随着计算资源增加而有效提升性能,并展现出不同的探索动态;LatentRM能有效选择高质量的推理轨迹。可视化分析进一步验证了方法的有效性。 Conclusion: 该工作首次实现了潜在推理模型的并行测试时扩展,为连续空间中的可扩展推理提供了新方向,并开源了代码。 Abstract: Parallel test-time scaling (TTS) is a pivotal approach for enhancing large language models (LLMs), typically by sampling multiple token-based chains-of-thought in parallel and aggregating outcomes through voting or search. Recent advances in latent reasoning, where intermediate reasoning unfolds in continuous vector spaces, offer a more efficient alternative to explicit Chain-of-Thought, yet whether such latent models can similarly benefit from parallel TTS remains open, mainly due to the absence of sampling mechanisms in continuous space, and the lack of probabilistic signals for advanced trajectory aggregation. \ This work enables parallel TTS for latent reasoning models by addressing the above issues. For sampling, we introduce two uncertainty-inspired stochastic strategies: Monte Carlo Dropout and Additive Gaussian Noise. For aggregation, we design a Latent Reward Model (LatentRM) trained with step-wise contrastive objective to score and guide latent reasoning. Extensive experiments and visualization analyses show that both sampling strategies scale effectively with compute and exhibit distinct exploration dynamics, while LatentRM enables effective trajectory selection. Together, our explorations open a new direction for scalable inference in continuous spaces. Code released at https://github.com/YRYangang/LatentTTS.[33] Test-Time Reasoners Are Strategic Multiple-Choice Test-Takers
Nishant Balepur,Atrey Desai,Rachel Rudinger
Main category: cs.CL
TL;DR: 该论文探讨了大语言模型在仅凭选项(choices-only)情况下回答多项选择题的能力,发现即使没有问题本身,模型仍能通过推理路径成功作答,且这些推理路径具有可信度,表明其策略并非完全浅薄。
Details
Motivation: 研究大语言模型在不依赖问题文本、仅使用选项的情况下进行多选题作答的现象,质疑将此类‘部分输入成功’简单视为缺陷的观点。 Method: 通过对比模型在完整输入和仅选项输入下的表现,分析其推理路径的长度、忠实性及内容,探究其背后策略。 Result: 发现一半情况下测试时推理能提升仅选项设置下的准确率;推理路径长度对性能影响小,且通过忠实性检验,显示模型会推断缺失的问题等合理策略。 Conclusion: 部分输入成功并不总是缺陷,推理路径有助于区分数据问题与合理的推理行为。 Abstract: Large language models (LLMs) now give reasoning before answering, excelling in tasks like multiple-choice question answering (MCQA). Yet, a concern is that LLMs do not solve MCQs as intended, as work finds LLMs sans reasoning succeed in MCQA without using the question, i.e., choices-only. Such partial-input success is often deemed problematic, but reasoning traces could reveal if these strategies are truly shallow in choices-only settings. To study these strategies, reasoning LLMs solve MCQs in full and choices-only inputs; test-time reasoning often boosts accuracy on full and in choices-only half the time. While possibly due to shallow shortcuts, choices-only success is barely affected by the length of reasoning traces, and after finding traces pass faithfulness tests, we show they use less problematic strategies like inferring missing questions. In all, we challenge claims that partial-input success is always a flaw, so we discuss how reasoning traces could separate problematic data from less problematic reasoning.[34] ToolLibGen: Scalable Automatic Tool Creation and Aggregation for LLM Reasoning
Murong Yue,Zhiwei Liu,Liangwei Yang,Jianguo Zhang,Zuxin Liu,Haolin Chen,Ziyu Yao,Silvio Savarese,Caiming Xiong,Shelby Heinecke,Huan Wang
Main category: cs.CL
TL;DR: 提出一种系统性方法,将无结构的工具集合自动重构为结构化的工具库,通过多智能体框架聚合功能,提升工具检索准确性和推理性能。
Details
Motivation: 现有工具增强型大模型在复杂推理任务中表现优异,但在特定领域(如物理问答)缺乏专用工具;尽管已有工作尝试从思维链推理路径中提取可复用函数,但随着生成工具增多,无结构存储导致检索困难和语义歧义,限制了可扩展性。 Method: 首先生成任务特定的离散工具并按语义聚类;在每个聚类内采用多智能体框架:代码智能体重构代码、提取共性逻辑并生成聚合工具,评审智能体确保功能完整性;最终将大量问题特定工具整合为少量功能更强的聚合工具。 Result: 实验表明该方法显著提升了工具检索准确率和整体推理性能,并在问题特定工具数量增加时展现出优于基线方法的可扩展性。 Conclusion: 通过自动化重构无结构工具集为结构化工具库,有效解决了工具检索与功能冗余问题,增强了工具增强型大模型在复杂推理任务中的可扩展性和实用性。 Abstract: Large Language Models (LLMs) equipped with external tools have demonstrated enhanced performance on complex reasoning tasks. The widespread adoption of this tool-augmented reasoning is hindered by the scarcity of domain-specific tools. For instance, in domains such as physics question answering, suitable and specialized tools are often missing. Recent work has explored automating tool creation by extracting reusable functions from Chain-of-Thought (CoT) reasoning traces; however, these approaches face a critical scalability bottleneck. As the number of generated tools grows, storing them in an unstructured collection leads to significant retrieval challenges, including an expanding search space and ambiguity between function-related tools. To address this, we propose a systematic approach to automatically refactor an unstructured collection of tools into a structured tool library. Our system first generates discrete, task-specific tools and clusters them into semantically coherent topics. Within each cluster, we introduce a multi-agent framework to consolidate scattered functionalities: a code agent refactors code to extract shared logic and creates versatile, aggregated tools, while a reviewing agent ensures that these aggregated tools maintain the complete functional capabilities of the original set. This process transforms numerous question-specific tools into a smaller set of powerful, aggregated tools without loss of functionality. Experimental results demonstrate that our approach significantly improves tool retrieval accuracy and overall reasoning performance across multiple reasoning tasks. Furthermore, our method shows enhanced scalability compared with baselines as the number of question-specific increases.[35] Curing Miracle Steps in LLM Mathematical Reasoning with Rubric Rewards
Youliang Yuan,Qiuyang Mang,Jingbang Chen,Hong Wan,Xiaoyuan Liu,Junjielong Xu,Jen-tse Huang,Wenxuan Wang,Wenxiang Jiao,Pinjia He
Main category: cs.CL
TL;DR: 本文提出了一种面向推理过程的奖励模型(RRM),用于解决大语言模型在数学推理中因仅依赖最终答案奖励而导致的“奖励黑客”问题,显著提升了模型的准确性和可靠性。
Details
Motivation: 传统基于最终结果的奖励机制容易导致模型通过错误的推理路径得到正确答案(即奖励欺骗),从而高估其推理能力,因此需要一种更可靠的训练方式来评估整个推理过程。 Method: 引入Rubric Reward Model(RRM),采用细粒度的过程导向奖励机制,根据问题特定的评分标准对推理链进行逐步评估,并在强化学习框架中训练模型;同时通过人工验证建立错误模式分类,识别如‘奇迹步骤’等不合理推理现象。 Result: RRM在四个数学基准上均优于仅基于结果的监督方法,在AIME2024上的Verified Pass@1024从26.7%提升至62.6%,并减少了71%的‘奇迹步骤’发生率。 Conclusion: 奖励推理过程而非仅仅最终答案,对于构建更准确、更可信的数学推理模型至关重要。 Abstract: Large language models for mathematical reasoning are typically trained with outcome-based rewards, which credit only the final answer. In our experiments, we observe that this paradigm is highly susceptible to reward hacking, leading to a substantial overestimation of a model's reasoning ability. This is evidenced by a high incidence of false positives - solutions that reach the correct final answer through an unsound reasoning process. Through a systematic analysis with human verification, we establish a taxonomy of these failure modes, identifying patterns like Miracle Steps - abrupt jumps to a correct output without a valid preceding derivation. Probing experiments suggest a strong association between these Miracle Steps and memorization, where the model appears to recall the answer directly rather than deriving it. To mitigate this systemic issue, we introduce the Rubric Reward Model (RRM), a process-oriented reward function that evaluates the entire reasoning trajectory against problem-specific rubrics. The generative RRM provides fine-grained, calibrated rewards (0-1) that explicitly penalize logical flaws and encourage rigorous deduction. When integrated into a reinforcement learning pipeline, RRM-based training consistently outperforms outcome-only supervision across four math benchmarks. Notably, it boosts Verified Pass@1024 on AIME2024 from 26.7% to 62.6% and reduces the incidence of Miracle Steps by 71%. Our work demonstrates that rewarding the solution process is crucial for building models that are not only more accurate but also more reliable.[36] The Unintended Trade-off of AI Alignment:Balancing Hallucination Mitigation and Safety in LLMs
Omar Mahmoud,Ali Khalil,Buddhika Laknath Semage,Thommen George Karimpanal,Santu Rana
Main category: cs.CL
TL;DR: 本文研究了大语言模型中提高真实性与安全对齐之间的权衡问题,发现增强事实准确性可能导致拒绝行为减弱。为此,作者提出通过稀疏自编码器分离拒答特征与幻觉特征,并在微调过程中使用子空间正交化来保持安全性,从而有效缓解真实性和安全性之间的冲突。
Details
Motivation: 提高大语言模型的真实性可能损害其安全对齐能力,这一副作用尚未被充分关注。本文旨在探究真实性与安全性的潜在冲突及其成因。 Method: 利用稀疏自编码器识别并分离与拒答和幻觉相关的特征,通过子空间正交化方法在微调过程中保护拒答特征,避免对齐过程误删事实知识。 Result: 在常识推理任务及有害请求基准(AdvBench、StrongReject)上的实验表明,该方法能有效保持模型的拒答行为和任务性能,同时防止幻觉增加。 Conclusion: 通过特征解耦和子空间正交化,可以在提升或维持模型真实性的同时,不牺牲其安全对齐能力,有效缓解二者之间的权衡问题。 Abstract: Hallucination in large language models (LLMs) has been widely studied in recent years, with progress in both detection and mitigation aimed at improving truthfulness. Yet, a critical side effect remains largely overlooked: enhancing truthfulness can negatively impact safety alignment. In this paper, we investigate this trade-off and show that increasing factual accuracy often comes at the cost of weakened refusal behavior. Our analysis reveals that this arises from overlapping components in the model that simultaneously encode hallucination and refusal information, leading alignment methods to suppress factual knowledge unintentionally. We further examine how fine-tuning on benign datasets, even when curated for safety, can degrade alignment for the same reason. To address this, we propose a method that disentangles refusal-related features from hallucination features using sparse autoencoders, and preserves refusal behavior during fine-tuning through subspace orthogonalization. This approach prevents hallucinations from increasing while maintaining safety alignment.We evaluate our method on commonsense reasoning tasks and harmful benchmarks (AdvBench and StrongReject). Results demonstrate that our approach preserves refusal behavior and task utility, mitigating the trade-off between truthfulness and safety.[37] Instance Relation Learning Network with Label Knowledge Propagation for Few-shot Multi-label Intent Detection
Shiman Zhao,Shangyuan Li,Wei Chen,Tengjiao Wang,Jiahui Yao,Jiabin Zheng,Kam Fai Wong
Main category: cs.CL
TL;DR: 提出了一种端到端的多标签联合学习方法,通过实例关系学习和标签知识传播来解决少样本多标签意图检测中的错误传播问题。
Details
Motivation: 现有方法依赖表示分类且忽略实例间关系,导致错误传播,难以有效处理低资源对话域中的多标签意图检测。 Method: 构建一个带有标签知识传播的实例关系学习网络,学习支持集与查询集之间的实例交互关系,并设计双关系增强损失函数优化支持集和查询集层面的关系强度。 Result: 在1-shot场景下,平均比强基线方法AUC提升9.54%,Macro-F1提升11.19%。 Conclusion: 所提方法通过实例关系学习和标签知识传播有效提升了少样本多标签意图检测性能,避免了传统两阶段方法的错误传播问题。 Abstract: Few-shot Multi-label Intent Detection (MID) is crucial for dialogue systems, aiming to detect multiple intents of utterances in low-resource dialogue domains. Previous studies focus on a two-stage pipeline. They first learn representations of utterances with multiple labels and then use a threshold-based strategy to identify multi-label results. However, these methods rely on representation classification and ignore instance relations, leading to error propagation. To solve the above issues, we propose a multi-label joint learning method for few-shot MID in an end-to-end manner, which constructs an instance relation learning network with label knowledge propagation to eliminate error propagation. Concretely, we learn the interaction relations between instances with class information to propagate label knowledge between a few labeled (support set) and unlabeled (query set) instances. With label knowledge propagation, the relation strength between instances directly indicates whether two utterances belong to the same intent for multi-label prediction. Besides, a dual relation-enhanced loss is developed to optimize support- and query-level relation strength to improve performance. Experiments show that we outperform strong baselines by an average of 9.54% AUC and 11.19% Macro-F1 in 1-shot scenarios.[38] Drift No More? Context Equilibria in Multi-Turn LLM Interactions
Vardhan Dongre,Ryan A. Rossi,Viet Dac Lai,David Seunghyun Yoon,Dilek Hakkani-Tür,Trung Bui
Main category: cs.CL
TL;DR: 本研究提出了一种动态框架来解释大语言模型在多轮交互中的上下文漂移现象,将其形式化为测试模型与目标一致的参考模型之间的KL散度,并通过实验证明漂移是一种可控的平衡过程而非不可避免的退化。
Details
Motivation: 大语言模型在单轮任务中表现优异,但在需要持续多轮交互的实际应用中常出现上下文漂移问题,而这一问题难以通过静态指标捕捉,因此需要新的分析框架。 Method: 将上下文漂移形式化为每轮对话中测试模型与目标一致参考模型之间的KL散度,构建一个包含恢复力和可控干预的有界随机过程递归模型,并在合成任务和真实用户代理模拟(如τ-Bench)中实例化该框架。 Result: 实验显示多个开源权重大模型在多轮交互中表现出稳定的、受噪声限制的均衡状态,而非持续恶化;简单的提醒干预能有效减少漂移,且结果符合理论预测。 Conclusion: 多轮上下文漂移可被理解为一种可控的平衡现象,而非必然退化,这为研究和缓解长时交互中的漂移问题提供了新基础。 Abstract: Large Language Models (LLMs) excel at single-turn tasks such as instruction following and summarization, yet real-world deployments require sustained multi-turn interactions where user goals and conversational context persist and evolve. A recurring challenge in this setting is context drift: the gradual divergence of a model's outputs from goal-consistent behavior across turns. Unlike single-turn errors, drift unfolds temporally and is poorly captured by static evaluation metrics. In this work, we present a study of context drift in multi-turn interactions and propose a simple dynamical framework to interpret its behavior. We formalize drift as the turn-wise KL divergence between the token-level predictive distributions of the test model and a goal-consistent reference model, and propose a recurrence model that interprets its evolution as a bounded stochastic process with restoring forces and controllable interventions. We instantiate this framework in both synthetic long-horizon rewriting tasks and realistic user-agent simulations such as in $\tau$-Bench, measuring drift for several open-weight LLMs that are used as user simulators. Our experiments consistently reveal stable, noise-limited equilibria rather than runaway degradation, and demonstrate that simple reminder interventions reliably reduce divergence in line with theoretical predictions. Together, these results suggest that multi-turn drift can be understood as a controllable equilibrium phenomenon rather than as inevitable decay, providing a foundation for studying and mitigating context drift in extended interactions.[39] RCPU: Rotation-Constrained Error Compensation for Structured Pruning of a Large Language Model
Shuichiro Haruta,Kazunori Matsumoto,Zhi Li,Yanan Wang,Mori Kurokawa
Main category: cs.CL
TL;DR: 提出一种旋转约束补偿方法,用于减轻大语言模型结构化剪枝引入的误差,在保持输出表示几何特性的同时有效恢复性能。
Details
Motivation: 结构化剪枝通常仅使用少量校准数据,难以避免输出失配问题,而直接拟合容易过拟合并破坏预训练权重。 Method: 在旋转约束下更新剪枝参数,保持输出表示的几何结构(如范数和内积),并重新对齐剪枝子空间与原始输出;结合方差感知的重要性评分保留关键输入维度。 Result: 在LLaMA-7B上实验表明,该方法在WikiText-2和多个语言理解基准上均优于现有基线,表现出更低的困惑度和更高的任务准确率。 Conclusion: 旋转约束补偿结合方差感知重要性评分能有效修复剪枝误差,同时保留模型语义几何结构,提升剪枝后模型性能。 Abstract: In this paper, we propose a rotation-constrained compensation method to address the errors introduced by structured pruning of large language models (LLMs). LLMs are trained on massive datasets and accumulate rich semantic knowledge in their representation space. In contrast, pruning is typically carried out with only a small amount of calibration data, which makes output mismatches unavoidable. Although direct least-squares fitting can reduce such errors, it tends to overfit to the limited calibration set, destructively modifying pretrained weights. To overcome this difficulty, we update the pruned parameters under a rotation constraint. This constrained update preserves the geometry of output representations (i.e., norms and inner products) and simultaneously re-aligns the pruned subspace with the original outputs. Furthermore, in rotation-constrained compensation, removing components that strongly contribute to the principal directions of the output makes error recovery difficult. Since input dimensions with large variance strongly affect these principal directions, we design a variance-aware importance score that ensures such dimensions are preferentially kept in the pruned model. By combining this scoring rule with rotation-constrained updates, the proposed method effectively compensates errors while retaining the components likely to be more important in a geometry-preserving manner. In the experiments, we apply the proposed method to LLaMA-7B and evaluate it on WikiText-2 and multiple language understanding benchmarks. The results demonstrate consistently better perplexity and task accuracy compared with existing baselines.[40] LLM4Cell: A Survey of Large Language and Agentic Models for Single-Cell Biology
Sajib Acharjee Dip,Adrika Zafor,Bikash Kumar Paul,Uddip Acharjee Shuvo,Muhit Islam Emon,Xuan Wang,Liqing Zhang
Main category: cs.CL
TL;DR: LLM4Cell 提供了首个针对单细胞研究中58个基础模型和智能体框架的统一综述,涵盖RNA、ATAC、多组学和空间模态,分类方法并评估其在多个任务和维度上的表现,揭示了可解释性、标准化和可信模型开发中的开放挑战。
Details
Motivation: 尽管大语言模型和智能体框架在单细胞生物学中展现出潜力,但其发展在数据模态、架构和评估标准上仍分散不一,缺乏系统性整合与评估。 Method: 对58个用于单细胞研究的基础模型和智能体模型进行系统性综述,将其分为五类,并映射到八项关键分析任务;基于40多个公开数据集,从10个领域维度评估模型性能。 Result: 建立了首个整合数据集、模型与评估维度的语言驱动单细胞智能视图,揭示了当前基准适用性、数据多样性及伦理或可扩展性限制,并识别出各模型在生物合理性、多组学对齐、公平性、隐私和可解释性方面的表现差异。 Conclusion: LLM4Cell为语言模型在单细胞生物学中的应用提供了系统性框架,指出了在标准化、可解释性和可信AI方面的主要挑战,为未来研究提供了方向。 Abstract: Large language models (LLMs) and emerging agentic frameworks are beginning to transform single-cell biology by enabling natural-language reasoning, generative annotation, and multimodal data integration. However, progress remains fragmented across data modalities, architectures, and evaluation standards. LLM4Cell presents the first unified survey of 58 foundation and agentic models developed for single-cell research, spanning RNA, ATAC, multi-omic, and spatial modalities. We categorize these methods into five families-foundation, text-bridge, spatial, multimodal, epigenomic, and agentic-and map them to eight key analytical tasks including annotation, trajectory and perturbation modeling, and drug-response prediction. Drawing on over 40 public datasets, we analyze benchmark suitability, data diversity, and ethical or scalability constraints, and evaluate models across 10 domain dimensions covering biological grounding, multi-omics alignment, fairness, privacy, and explainability. By linking datasets, models, and evaluation domains, LLM4Cell provides the first integrated view of language-driven single-cell intelligence and outlines open challenges in interpretability, standardization, and trustworthy model development.[41] HiPRAG: Hierarchical Process Rewards for Efficient Agentic Retrieval Augmented Generation
Peilin Wu,Mian Zhang,Kun Wan,Wentian Zhao,Kaiyu He,Xinya Du,Zhiyu Chen
Main category: cs.CL
TL;DR: 本文提出了HiPRAG,一种通过细粒度分层过程奖励来优化Agentic RAG中搜索行为的强化学习方法,有效减少了过搜和欠搜问题,在多个模型和基准上提升了推理效率与准确性。
Details
Motivation: 现有的Agentic RAG在搜索行为上存在过搜和欠搜问题,传统基于结果的强化学习奖励机制缺乏对推理过程的精细控制,难以提升搜索效率。 Method: 提出HiPRAG方法,将代理的推理轨迹分解为可解析的步骤,引入基于知识的细粒度过程奖励,并结合分层奖励函数,评估每一步搜索决策的必要性,在结果和格式奖励基础上增加对最优搜索行为的比例奖励。 Result: 在Qwen2.5和Llama-3.2模型及七个QA基准上实验表明,HiPRAG在3B和7B模型上分别达到65.4%和67.2%的平均准确率,过搜率降至2.3%,同时降低欠搜率,显著提升搜索效率和推理质量。 Conclusion: 通过细粒度的过程奖励优化推理路径,不仅能提升最终性能,还能增强搜索行为的效率与合理性,HiPRAG具有良好的跨算法、模型族、规模和类型泛化能力,展示了过程优化在搜索代理中的重要潜力。 Abstract: Agentic RAG is a powerful technique for incorporating external information that LLMs lack, enabling better problem solving and question answering. However, suboptimal search behaviors exist widely, such as over-search (retrieving information already known) and under-search (failing to search when necessary), which leads to unnecessary overhead and unreliable outputs. Current training methods, which typically rely on outcome-based rewards in a RL framework, lack the fine-grained control needed to address these inefficiencies. To overcome this, we introduce Hierarchical Process Rewards for Efficient agentic RAG (HiPRAG), a training methodology that incorporates a fine-grained, knowledge-grounded process reward into the RL training. Our approach evaluates the necessity of each search decision on-the-fly by decomposing the agent's reasoning trajectory into discrete, parsable steps. We then apply a hierarchical reward function that provides an additional bonus based on the proportion of optimal search and non-search steps, on top of commonly used outcome and format rewards. Experiments on the Qwen2.5 and Llama-3.2 models across seven diverse QA benchmarks show that our method achieves average accuracies of 65.4% (3B) and 67.2% (7B). This is accomplished while improving search efficiency, reducing the over-search rate to just 2.3% and concurrently lowering the under-search rate. These results demonstrate the efficacy of optimizing the reasoning process itself, not just the final outcome. Further experiments and analysis demonstrate that HiPRAG shows good generalizability across a wide range of RL algorithms, model families, sizes, and types. This work demonstrates the importance and potential of fine-grained control through RL, for improving the efficiency and optimality of reasoning for search agents.[42] Dynamic Generation of Multi-LLM Agents Communication Topologies with Graph Diffusion Models
Eric Hanchen Jiang,Guancheng Wan,Sophia Yin,Mengting Li,Yuchen Wu,Xiao Liang,Xinfeng Li,Yizhou Sun,Wei Wang,Kai-Wei Chang,Ying Nian Wu
Main category: cs.CL
TL;DR: 本文提出了一种名为Guided Topology Diffusion (GTD)的生成框架,用于优化基于大语言模型的多智能体系统的通信拓扑结构,通过迭代式、引导式的图生成方法实现任务自适应、稀疏且高效的通信结构。
Details
Motivation: 现有方法依赖静态或手工设计的通信拓扑,难以适应不同任务需求,导致简单任务开销过大或复杂任务性能受限,因此需要一种能动态平衡性能、成本与鲁棒性的拓扑生成机制。 Method: 受条件离散图扩散模型启发,GTD将拓扑生成建模为迭代构造过程,并利用轻量级代理模型预测多目标奖励(如准确率、效用、成本),在每一步中引导生成方向,实现无需梯度的实时优化。 Result: 在多个基准任务上验证表明,GTD能够生成高度任务自适应、稀疏且高效的通信拓扑,在LLM智能体协作中显著优于现有方法。 Conclusion: GTD通过迭代引导生成机制,有效解决了多智能体系统中通信拓扑的动态优化问题,提升了任务性能与通信效率的平衡能力。 Abstract: The efficiency of multi-agent systems driven by large language models (LLMs) largely hinges on their communication topology. However, designing an optimal topology is a non-trivial challenge, as it requires balancing competing objectives such as task performance, communication cost, and robustness. Existing frameworks often rely on static or hand-crafted topologies, which inherently fail to adapt to diverse task requirements, leading to either excessive token consumption for simple problems or performance bottlenecks for complex ones. To address this challenge, we introduce a novel generative framework called \textit{Guided Topology Diffusion (GTD)}. Inspired by conditional discrete graph diffusion models, GTD formulates topology synthesis as an iterative construction process. At each step, the generation is steered by a lightweight proxy model that predicts multi-objective rewards (e.g., accuracy, utility, cost), enabling real-time, gradient-free optimization towards task-adaptive topologies. This iterative, guided synthesis process distinguishes GTD from single-step generative frameworks, enabling it to better navigate complex design trade-offs. We validated GTD across multiple benchmarks, and experiments show that this framework can generate highly task-adaptive, sparse, and efficient communication topologies, significantly outperforming existing methods in LLM agent collaboration.[43] Multilingual Generative Retrieval via Cross-lingual Semantic Compression
Yuxin Huang,Simeng Wu,Ran Song,Yan Xiang,Yantuan Xian,Shengxiang Gao,Zhengtao Yu
Main category: cs.CL
TL;DR: 提出了一种新的多语言生成式检索框架MGR-CSC,通过跨语言语义压缩和动态多步约束解码策略,有效解决了跨语言标识符错位和膨胀问题,在多个基准上显著提升了检索准确性和效率。
Details
Motivation: 生成式信息检索在单语场景中表现优异,但在多语言检索中面临跨语言标识符错位和标识符膨胀两大挑战,亟需一种能统一语义并压缩标识空间的方法。 Method: 提出MGR-CSC框架,将语义等价的多语言关键词统一为共享原子以对齐语义并压缩标识空间,并设计动态多步约束解码策略提升检索效率。 Result: 在mMarco100k和mNQ320k数据集上,检索准确率分别提升6.83%和4.77%,文档标识符长度减少74.51%和78.2%。 Conclusion: MGR-CSC通过语义压缩与约束解码,有效提升了多语言生成式检索的准确性与效率,为解决跨语言检索中的标识符问题提供了新思路。 Abstract: Generative Information Retrieval is an emerging retrieval paradigm that exhibits remarkable performance in monolingual scenarios.However, applying these methods to multilingual retrieval still encounters two primary challenges, cross-lingual identifier misalignment and identifier inflation. To address these limitations, we propose Multilingual Generative Retrieval via Cross-lingual Semantic Compression (MGR-CSC), a novel framework that unifies semantically equivalent multilingual keywords into shared atoms to align semantics and compresses the identifier space, and we propose a dynamic multi-step constrained decoding strategy during retrieval. MGR-CSC improves cross-lingual alignment by assigning consistent identifiers and enhances decoding efficiency by reducing redundancy. Experiments demonstrate that MGR-CSC achieves outstanding retrieval accuracy, improving by 6.83% on mMarco100k and 4.77% on mNQ320k, while reducing document identifiers length by 74.51% and 78.2%, respectively.[44] AdaSwitch: Adaptive Switching Generation for Knowledge Distillation
Jingyu Peng,Maolin Wang,Hengyi Cai,Yuchen Li,Kai Zhang,Shuaiqiang Wang,Dawei Yin,Xiangyu Zhao
Main category: cs.CL
TL;DR: 提出AdaSwitch方法,动态结合策略内和策略外生成,提升小语言模型的知识蒸馏效果。
Details
Motivation: 现有知识蒸馏方法在监督质量与训练-推理一致性之间存在权衡,难以兼顾高性能与低延迟需求。 Method: 在token级别动态结合策略内(on-policy)和策略外(off-policy)生成,通过实时质量评估决定是否引入教师模型指导。 Result: 在三个数据集和两种师生大模型组合上实验表明,AdaSwitch持续提升准确率,且额外开销可控。 Conclusion: AdaSwitch有效平衡了监督质量与推理一致性,为小语言模型的高效蒸馏提供了实用解决方案。 Abstract: Small language models (SLMs) are crucial for applications with strict latency and computational constraints, yet achieving high performance remains challenging. Knowledge distillation (KD) can transfer capabilities from large teacher models, but existing methods involve trade-offs: off-policy distillation provides high-quality supervision but introduces a training-inference mismatch, while on-policy approaches maintain consistency but rely on low-quality student outputs. To address these issues, we propose AdaSwitch, a novel approach that dynamically combines on-policy and off-policy generation at the token level. AdaSwitch allows the student to first explore its own predictions and then selectively integrate teacher guidance based on real-time quality assessment. This approach simultaneously preserves consistency and maintains supervision quality. Experiments on three datasets with two teacher-student LLM pairs demonstrate that AdaSwitch consistently improves accuracy, offering a practical and effective method for distilling SLMs with acceptable additional overhead.[45] Ready to Translate, Not to Represent? Bias and Performance Gaps in Multilingual LLMs Across Language Families and Domains
Md. Faiyaz Abdullah Sayeedi,Md. Mahbub Alam,Subhey Sadi Rahman,Md. Adnanul Islam,Jannatul Ferdous Deepti,Tasnim Mohiuddin,Md Mofijul Islam,Swakkhar Shatabda
Main category: cs.CL
TL;DR: 提出Translation Tangles框架和数据集,用于评估开源大语言模型在多语言翻译中的质量和公平性,涵盖24个双向语言对,并引入基于人工标注的高质量偏见数据集。
Details
Motivation: 大语言模型在机器翻译中表现不均衡,且可能放大训练数据中的偏见,尤其影响低资源语言,亟需评估其翻译质量与公平性。 Method: 构建统一的评估框架Translation Tangles,覆盖多个语言对和领域,采用多种指标;设计混合偏见检测流程,结合规则启发、语义相似度过滤和LLM验证;并基于1,439个人工评估的翻译对构建高质量标注数据集。 Result: 实现了对24个双向语言对的系统评测,识别出模型在不同语言族和领域中的性能差异与偏见模式,验证了所提偏见检测方法的有效性。 Conclusion: Translation Tangles为评估开源大语言模型的翻译质量与公平性提供了有效工具,有助于提升多语言翻译系统的可靠性与公正性。 Abstract: The rise of Large Language Models (LLMs) has redefined Machine Translation (MT), enabling context-aware and fluent translations across hundreds of languages and textual domains. Despite their remarkable capabilities, LLMs often exhibit uneven performance across language families and specialized domains. Moreover, recent evidence reveals that these models can encode and amplify different biases present in their training data, posing serious concerns for fairness, especially in low-resource languages. To address these gaps, we introduce Translation Tangles, a unified framework and dataset for evaluating the translation quality and fairness of open-source LLMs. Our approach benchmarks 24 bidirectional language pairs across multiple domains using different metrics. We further propose a hybrid bias detection pipeline that integrates rule-based heuristics, semantic similarity filtering, and LLM-based validation. We also introduce a high-quality, bias-annotated dataset based on human evaluations of 1,439 translation-reference pairs. The code and dataset are accessible on GitHub: https://github.com/faiyazabdullah/TranslationTangles[46] Do LLMs Really Need 10+ Thoughts for "Find the Time 1000 Days Later"? Towards Structural Understanding of LLM Overthinking
Xinliang Frederick Zhang,Anhad Mohananey,Alexandra Chronopoulou,Pinelopi Papalampidi,Somit Gupta,Tsendsuren Munkhdalai,Lu Wang,Shyam Upadhyay
Main category: cs.CL
TL;DR: 该研究提出了TRACE分析工具,系统性地探究大语言模型在简单任务上出现“过度思考”的根本原因,发现其主要由过度验证和过度探索驱动,并提出基于思维效用的过度思考新定义。
Details
Motivation: 现有研究缺乏对大语言模型过度思考现象内在机制的深入理解,仅停留在表面观察,无法揭示其根本原因。 Method: 提出TRACE分析框架,将思维过程分解为最小完整子思想,通过推断子思想间的语篇关系构建细粒度思维演进图,并识别相似问题的常见思维模式。 Result: 确认长链推理模型在简单任务上速度慢5到20倍且无准确率提升,识别出Explorer和Late Landing两种主要思维模式,揭示过度验证与探索是过度思考的主因。 Conclusion: 基于思维结构提出新的基于效用的过度思考定义,超越了传统的长度指标,为理解和管理大语言模型的过度思考提供了原则性指导。 Abstract: Models employing long chain-of-thought (CoT) reasoning have shown superior performance on complex reasoning tasks. Yet, this capability introduces a critical and often overlooked inefficiency -- overthinking -- models often engage in unnecessarily extensive reasoning even for simple queries, incurring significant computations without accuracy improvements. While prior work has explored solutions to mitigate overthinking, a fundamental gap remains in our understanding of its underlying causes. Most existing analyses are limited to superficial, profiling-based observations, failing to delve into LLMs' inner workings. This study introduces a systematic, fine-grained analyzer of LLMs' thought process to bridge the gap, TRACE. We first benchmark the overthinking issue, confirming that long-thinking models are five to twenty times slower on simple tasks with no substantial gains. We then use TRACE to first decompose the thought process into minimally complete sub-thoughts. Next, by inferring discourse relationships among sub-thoughts, we construct granular thought progression graphs and subsequently identify common thinking patterns for topically similar queries. Our analysis reveals two major patterns for open-weight thinking models -- Explorer and Late Landing. This finding provides evidence that over-verification and over-exploration are the primary drivers of overthinking in LLMs. Grounded in thought structures, we propose a utility-based definition of overthinking, which moves beyond length-based metrics. This revised definition offers a more insightful understanding of LLMs' thought progression, as well as practical guidelines for principled overthinking management.[47] CS3-Bench: Evaluating and Enhancing Speech-to-Speech LLMs for Mandarin-English Code-Switching
Heyang Liu,Yuhao Wang,Ziyang Cheng,Ronghua Wu,Qunshan Gu,Yanfeng Wang,Yu Wang
Main category: cs.CL
TL;DR: 本文提出了一个名为CS3-Bench的语码转换语音到语音基准测试,揭示了现有模型在语言对齐方面的不足,并通过Chain of Recognition和Keyword Highlighting方法显著提升了多模态大语言模型的语言对齐能力。
Details
Motivation: 现有的多模态大语言模型在自然单语交互方面已取得进展,但在语码转换场景下的语言对齐能力存在明显缺陷,需要系统性评估与改进。 Method: 构建了CS3-Bench基准测试,提出Chain of Recognition(CoR)以增强理解,采用Keyword Highlighting(KH)来引导生成,并设计新的数据构造与训练方法。 Result: 知识准确率从25.14%提升至46.13%,开放性问题理解率从64.5%提升至86.5%,并在次要语言发音错误上有显著减少。 Conclusion: 所提方法有效增强了多模态大语言模型在语码转换场景下的语言对齐能力,为未来语音交互系统的多语言支持提供了可行路径。 Abstract: The advancement of multimodal large language models has accelerated the development of speech-to-speech interaction systems. While natural monolingual interaction has been achieved, we find existing models exhibit deficiencies in language alignment. In our proposed Code-Switching Speech-to-Speech Benchmark (CS3-Bench), experiments on 7 mainstream models demonstrate a relative performance drop of up to 66% in knowledge-intensive question answering and varying degrees of misunderstanding in open-ended conversations. Starting from a model with severe performance deterioration, we propose both data constructions and training approaches to improve the language alignment capabilities, specifically employing Chain of Recognition (CoR) to enhance understanding and Keyword Highlighting (KH) to guide generation. Our approach improves the knowledge accuracy from 25.14% to 46.13%, with open-ended understanding rate from 64.5% to 86.5%, and significantly reduces pronunciation errors in the secondary language. CS3-Bench is available at https://huggingface.co/datasets/VocalNet/CS3-Bench.[48] Contrastive Weak-to-strong Generalization
Houcheng Jiang,Junfeng Fang,Jiaxin Wu,Tianyu Zhang,Chen Gao,Yong Li,Xiang Wang,Xiangnan He,Yang Deng
Main category: cs.CL
TL;DR: 本文提出了Contrastive Weak-to-Strong Generalization (ConG),利用对比解码在预对齐和后对齐的弱模型之间生成更高质量的样本,从而提升弱到强泛化的鲁棒性和效果。
Details
Motivation: 弱模型输出中的噪声和偏差限制了弱到强泛化方法的实际应用,因此需要一种更鲁棒的方法来提升生成质量。 Method: 通过引入隐式奖励并揭示其与对比解码(CD)的结构等价性,提出ConG框架,在弱模型间使用对比解码生成更优样本。 Result: 在多个模型家族上的实验表明,ConG在不同设置下均带来一致改进,显著优于传统弱到强方法。 Conclusion: ConG有效提升了弱到强泛化的性能,具有广泛应用潜力,并为实现AGI提供了新路径。 Abstract: Weak-to-strong generalization provides a promising paradigm for scaling large language models (LLMs) by training stronger models on samples from aligned weaker ones, without requiring human feedback or explicit reward modeling. However, its robustness and generalization are hindered by the noise and biases in weak-model outputs, which limit its applicability in practice. To address this challenge, we leverage implicit rewards, which approximate explicit rewards through log-likelihood ratios, and reveal their structural equivalence with Contrastive Decoding (CD), a decoding strategy shown to reduce noise in LLM generation. Building on this connection, we propose Contrastive Weak-to-Strong Generalization (ConG), a framework that employs contrastive decoding between pre- and post-alignment weak models to generate higher-quality samples. This approach enables more reliable capability transfer, denoising, and improved robustness, substantially mitigating the limitations of traditional weak-to-strong methods. Empirical results across different model families confirm consistent improvements, demonstrating the generality and effectiveness of ConG. Taken together, our findings highlight the potential of ConG to advance weak-to-strong generalization and provide a promising pathway toward AGI.[49] Standard-to-Dialect Transfer Trends Differ across Text and Speech: A Case Study on Intent and Topic Classification in German Dialects
Verena Blaschke,Miriam Winkler,Barbara Plank
Main category: cs.CL
TL;DR: 本文研究了从标准语到非标准方言的跨方言迁移,比较了文本、语音及级联系统三种设置下的性能,发现语音模型在方言数据上表现最佳,而级联系统若生成标准化转录结果,在方言任务中也表现良好。
Details
Motivation: 由于方言主要是口头语言,且非标准拼写会影响文本处理,因此需要探索更有效的跨方言迁移方法,特别是在语音与文本之间的转换和处理。 Method: 作者在德语及其多种方言的意图和主题分类任务中,比较了纯文本模型、纯语音模型以及语音先转文本再处理的级联系统的性能,并发布了首个方言语音意图分类数据集。 Result: 实验表明,语音模型在方言数据上表现最好,文本模型在标准语数据上最优;级联系统在标准德语上落后于纯文本模型,但在生成标准化转录时对方言数据有较好表现。 Conclusion: 对于方言处理,直接使用语音模型优于文本或级联方法,但若自动转录能产生标准化输出,级联系统也可作为有效替代方案。 Abstract: Research on cross-dialectal transfer from a standard to a non-standard dialect variety has typically focused on text data. However, dialects are primarily spoken, and non-standard spellings are known to cause issues in text processing. We compare standard-to-dialect transfer in three settings: text models, speech models, and cascaded systems where speech first gets automatically transcribed and then further processed by a text model. In our experiments, we focus on German and multiple German dialects in the context of written and spoken intent and topic classification. To that end, we release the first dialectal audio intent classification dataset. We find that the speech-only setup provides the best results on the dialect data while the text-only setup works best on the standard data. While the cascaded systems lag behind the text-only models for German, they perform relatively well on the dialectal data if the transcription system generates normalized, standard-like output.[50] Metric Calculating Benchmark: Code-Verifiable Complicate Instruction Following Benchmark for Large Language Models
Hyeonseok Moon,Seongtae Hong,Jaehyung Seo,Heuiseok Lim
Main category: cs.CL
TL;DR: 本文提出了MCBench,一个用于评估大语言模型(LLM)能否严格遵循逐步指令执行字符串匹配NLP指标的新基准。该基准具有客观、确定性和可代码验证的特点,旨在测试LLM在指令遵循、数值计算和中间结果一致性方面的表现。
Details
Motivation: 现有基准已难以区分前沿大语言模型的性能,且多依赖主观判断,缺乏客观验证手段。因此需要一个更具挑战性且客观的评估工具。 Method: 设计了一个名为MCBench的基准,包含三类评估指标和三种变体,要求模型严格按照指令逐步执行任务,并提供并行参考代码以实现自动化、可复现的准确率评估。 Result: MCBench能够有效评估前沿LLM在指令遵循、数值计算和长距离一致性方面的能力,实验表明其具备良好的客观性和区分度。 Conclusion: MCBench是一个有效且客观的工具,可用于系统评估大语言模型在复杂指令执行中的能力,为未来模型发展提供了新的衡量标准。 Abstract: Recent frontier-level LLMs have saturated many previously difficult benchmarks, leaving little room for further differentiation. This progress highlights the need for challenging benchmarks that provide objective verification. In this paper, we introduce MCBench, a benchmark designed to evaluate whether LLMs can execute string-matching NLP metrics by strictly following step-by-step instructions. Unlike prior benchmarks that depend on subjective judgments or general reasoning, MCBench offers an objective, deterministic and codeverifiable evaluation. This setup allows us to systematically test whether LLMs can maintain accurate step-by-step execution, including instruction adherence, numerical computation, and long-range consistency in handling intermediate results. To ensure objective evaluation of these abilities, we provide a parallel reference code that can evaluate the accuracy of LLM output. We provide three evaluative metrics and three benchmark variants designed to measure the detailed instruction understanding capability of LLMs. Our analyses show that MCBench serves as an effective and objective tool for evaluating the capabilities of cutting-edge LLMs.[51] ACE: Attribution-Controlled Knowledge Editing for Multi-hop Factual Recall
Jiayu Yang,Yuxuan Fan,Songning Lai,Shengen Wu,Jiaqi Tang,Chun Kang,Zhijiang Guo,Yutao Yue
Main category: cs.CL
TL;DR: 本文提出了一种基于神经元级归因的多跳知识编辑框架ACE,通过识别和编辑查询-值(Q-V)通路,显著提升了大语言模型在多跳事实回忆中的知识更新性能。
Details
Motivation: 现有知识编辑方法在多跳事实回忆中表现衰退,尤其是在涉及推理链中隐式中间主体时,原因在于忽略了知识在神经元层面的动态表征机制。 Method: 通过因果分析揭示隐式主体在多跳推理中作为查询神经元激活对应值神经元的机制,并提出ACE框架,利用神经元级归因来定位并编辑关键的Q-V通路。 Result: ACE在GPT-J上比现有最先进方法提升9.44%,在Qwen3-8B上提升37.46%,并在Qwen3中发现了更细粒度的激活模式,验证了值神经元语义可解释性由查询驱动积累的机制。 Conclusion: ACE为多跳知识编辑提供了基于机理理解的解决方案,推动了基于内部推理机制理解的知识编辑研究新路径。 Abstract: Large Language Models (LLMs) require efficient knowledge editing (KE) to update factual information, yet existing methods exhibit significant performance decay in multi-hop factual recall. This failure is particularly acute when edits involve intermediate implicit subjects within reasoning chains. Through causal analysis, we reveal that this limitation stems from an oversight of how chained knowledge is dynamically represented and utilized at the neuron level. We discover that during multi hop reasoning, implicit subjects function as query neurons, which sequentially activate corresponding value neurons across transformer layers to accumulate information toward the final answer, a dynamic prior KE work has overlooked. Guided by this insight, we propose ACE: Attribution-Controlled Knowledge Editing for Multi-hop Factual Recall, a framework that leverages neuron-level attribution to identify and edit these critical query-value (Q-V) pathways. ACE provides a mechanistically grounded solution for multi-hop KE, empirically outperforming state-of-the-art methods by 9.44% on GPT-J and 37.46% on Qwen3-8B. Our analysis further reveals more fine-grained activation patterns in Qwen3 and demonstrates that the semantic interpretability of value neurons is orchestrated by query-driven accumulation. These findings establish a new pathway for advancing KE capabilities based on the principled understanding of internal reasoning mechanisms.[52] Towards Human-Like Grading: A Unified LLM-Enhanced Framework for Subjective Question Evaluation
Fanwei Zhua,Jiaxuan He,Xiaoxiao Chen,Zulong Chen,Quan Lu,Chenrui Mei
Main category: cs.CL
TL;DR: 本文提出了一种基于大语言模型(LLM)的统一自动评分框架,能够对多种类型的主观题进行类人评价,涵盖内容匹配、知识点对比、答案相关性评估和模拟人工评分,实验表明其在多个指标上优于传统和基于LLM的基线方法,并已成功应用于大型电商企业的实际考试中。
Details
Motivation: 现有自动评分方法通常局限于特定类型的主观题,难以应对包含多种题型的综合考试,缺乏通用性和全面性。 Method: 提出一个包含四个模块的统一LLM增强框架:基础文本匹配、关键知识点对比、生成伪问题评估答案相关性、模拟人类评估优缺点。 Result: 在通用和领域特定数据集上的实验表明,该框架在多个评分指标上均优于传统和LLM基线方法,并已在真实企业培训与认证考试中成功部署。 Conclusion: 所提出的框架具有良好的通用性和准确性,能够有效支持多类型主观题的自动评分,具备实际应用价值。 Abstract: Automatic grading of subjective questions remains a significant challenge in examination assessment due to the diversity in question formats and the open-ended nature of student responses. Existing works primarily focus on a specific type of subjective question and lack the generality to support comprehensive exams that contain diverse question types. In this paper, we propose a unified Large Language Model (LLM)-enhanced auto-grading framework that provides human-like evaluation for all types of subjective questions across various domains. Our framework integrates four complementary modules to holistically evaluate student answers. In addition to a basic text matching module that provides a foundational assessment of content similarity, we leverage the powerful reasoning and generative capabilities of LLMs to: (1) compare key knowledge points extracted from both student and reference answers, (2) generate a pseudo-question from the student answer to assess its relevance to the original question, and (3) simulate human evaluation by identifying content-related and non-content strengths and weaknesses. Extensive experiments on both general-purpose and domain-specific datasets show that our framework consistently outperforms traditional and LLM-based baselines across multiple grading metrics. Moreover, the proposed system has been successfully deployed in real-world training and certification exams at a major e-commerce enterprise.[53] STEPER: Step-wise Knowledge Distillation for Enhancing Reasoning Ability in Multi-Step Retrieval-Augmented Language Models
Kyumin Lee,Minjin Jeon,Sanghwan Jang,Hwanjo Yu
Main category: cs.CL
TL;DR: 提出StepER方法,通过分步监督和难度感知训练提升多步检索增强语言模型的推理能力。
Details
Motivation: 现有知识蒸馏方法忽视了多步推理中不同步骤需要不同推理能力的问题,导致在多步检索增强框架中的迁移效果不佳。 Method: 采用分步监督以匹配各阶段变化的信息和推理需求,并引入难度感知训练,优先优化适合的步骤学习过程。该方法适用于多种多步检索增强语言模型。 Result: 实验表明,StepER在多跳问答基准上优于先前方法,8B模型性能接近70B教师模型。 Conclusion: StepER有效提升了多步检索增强语言模型的推理能力,实现了高效的知识迁移。 Abstract: Answering complex real-world questions requires step-by-step retrieval and integration of relevant information to generate well-grounded responses. However, existing knowledge distillation methods overlook the need for different reasoning abilities at different steps, hindering transfer in multi-step retrieval-augmented frameworks. To address this, we propose Stepwise Knowledge Distillation for Enhancing Reasoning Ability in Multi-Step Retrieval-Augmented Language Models (StepER). StepER employs step-wise supervision to align with evolving information and reasoning demands across stages. Additionally, it incorporates difficulty-aware training to progressively optimize learning by prioritizing suitable steps. Our method is adaptable to various multi-step retrieval-augmented language models, including those that use retrieval queries for reasoning paths or decomposed questions. Extensive experiments show that StepER outperforms prior methods on multi-hop QA benchmarks, with an 8B model achieving performance comparable to a 70B teacher model.[54] Comprehensiveness Metrics for Automatic Evaluation of Factual Recall in Text Generation
Adam Dejl,James Barry,Alessandra Pascale,Javier Carnerero Cano
Main category: cs.CL
TL;DR: 本研究探讨了评估大语言模型生成文本全面性的三种自动化方法,发现简单的端到端方法效果显著但牺牲了鲁棒性和可解释性。
Details
Motivation: 大语言模型虽性能强大,但常遗漏关键信息,在敏感领域可能造成严重危害,需有效评估其输出的全面性。 Method: 采用三种策略:基于自然语言推断(NLI)的原子语句分解、基于问答对提取的跨源比较,以及直接使用大语言模型进行端到端缺失内容检测。 Result: 实验表明,简单的端到端方法相比复杂方法更有效,但在鲁棒性、可解释性和结果细粒度方面表现较差;并对多个开源大模型的回答全面性进行了评估。 Conclusion: 端到端方法在检测缺失信息上表现良好,但需权衡其在可解释性和稳定性方面的不足,未来需改进综合评估体系。 Abstract: Despite demonstrating remarkable performance across a wide range of tasks, large language models (LLMs) have also been found to frequently produce outputs that are incomplete or selectively omit key information. In sensitive domains, such omissions can result in significant harm comparable to that posed by factual inaccuracies, including hallucinations. In this study, we address the challenge of evaluating the comprehensiveness of LLM-generated texts, focusing on the detection of missing information or underrepresented viewpoints. We investigate three automated evaluation strategies: (1) an NLI-based method that decomposes texts into atomic statements and uses natural language inference (NLI) to identify missing links, (2) a Q&A-based approach that extracts question-answer pairs and compares responses across sources, and (3) an end-to-end method that directly identifies missing content using LLMs. Our experiments demonstrate the surprising effectiveness of the simple end-to-end approach compared to more complex methods, though at the cost of reduced robustness, interpretability and result granularity. We further assess the comprehensiveness of responses from several popular open-weight LLMs when answering user queries based on multiple sources.[55] Vision-Enabled LLMs in Historical Lexicography: Digitising and Enriching Estonian-German Dictionaries from the 17th and 18th Centuries
Madis Jürviste,Joonatan Jakobson
Main category: cs.CL
TL;DR: 本研究探讨了大型语言模型(LLM)在17至18世纪爱沙尼亚语词典研究中的应用,涵盖历史词典的现代化丰富、哥特字体文本识别及跨源数据集构建。
Details
Motivation: 针对小语种历史文献数字化面临的效率与成本挑战,探索LLM在爱沙尼亚语历史词典处理中的潜力。 Method: 采用LLM对古籍进行词义补充和现代形式映射;使用视觉增强型LLM识别哥特字体印刷文本;通过分块扫描与多LLM协作实现文本识别与结构化输出合并。 Result: 在上下文充分时,Claude 3.7 Sonnet对1648年词典条目准确提供现代对应词和释义的比例达81%;零样本方法对1732年词典41%的条目生成无误JSON输出;结合重叠切片与双LLM流程处理1780年语法书中的词典部分。 Conclusion: LLM在小语种历史文献数字化中具有显著潜力,可大幅节省时间与经济成本,支持高效半自动化处理。 Abstract: This article presents research conducted at the Institute of the Estonian Language between 2022 and 2025 on the application of large language models (LLMs) to the study of 17th and 18th century Estonian dictionaries. The authors address three main areas: enriching historical dictionaries with modern word forms and meanings; using vision-enabled LLMs to perform text recognition on sources printed in Gothic script (Fraktur); and preparing for the creation of a unified, cross-source dataset. Initial experiments with J. Gutslaff's 1648 dictionary indicate that LLMs have significant potential for semi-automatic enrichment of dictionary information. When provided with sufficient context, Claude 3.7 Sonnet accurately provided meanings and modern equivalents for 81% of headword entries. In a text recognition experiment with A. T. Helle's 1732 dictionary, a zero-shot method successfully identified and structured 41% of headword entries into error-free JSON-formatted output. For digitising the Estonian-German dictionary section of A. W. Hupel's 1780 grammar, overlapping tiling of scanned image files is employed, with one LLM being used for text recognition and a second for merging the structured output. These findings demonstrate that even for minor languages LLMs have a significant potential for saving time and financial resources.[56] A$^2$Search: Ambiguity-Aware Question Answering with Reinforcement Learning
Fengji Zhang,Xinyao Niu,Chengyang Ying,Guancheng Lin,Zhongkai Hao,Zhou Fan,Chengen Huang,Jacky Keung,Bei Chen,Junyang Lin
Main category: cs.CL
TL;DR: 本文提出了A$^2$Search,一种无需人工标注的端到端框架,通过轨迹采样和证据验证自动识别和处理开放域问答中的多答案歧义问题,并在多个基准上实现了最先进的性能。
Details
Motivation: 现有问答模型通常假设每个问题只有一个正确答案,难以应对存在多个合理答案的情况,且依赖昂贵的人工标注来处理歧义,限制了在多跳问答数据集上的扩展。 Method: 提出A$^2$Search框架,利用强化学习与自动化流程检测歧义问题,通过轨迹采样生成候选答案,并进行证据验证;采用专为多答案设计的AnsF1奖励函数优化模型。 Result: 在八个开放域问答基准上实验表明,A$^2$Search显著优于现有方法,A$^2$Search-7B在四个多跳基准上平均AnsF1@1达到48.4%,超过更大的ReSearch-32B(46.2%),并展现出良好的泛化能力。 Conclusion: 拥抱并建模问题的歧义性对于构建更可靠、更强大的问答系统至关重要,A$^2$Search提供了一种可扩展且高效的解决方案。 Abstract: Recent advances in Large Language Models (LLMs) and Reinforcement Learning (RL) have led to strong performance in open-domain question answering (QA). However, existing models still struggle with questions that admit multiple valid answers. Standard QA benchmarks, which typically assume a single gold answer, overlook this reality and thus produce inappropriate training signals. Existing attempts to handle ambiguity often rely on costly manual annotation, which is difficult to scale to multi-hop datasets such as HotpotQA and MuSiQue. In this paper, we present A$^2$Search, an annotation-free, end-to-end training framework to recognize and handle ambiguity. At its core is an automated pipeline that detects ambiguous questions and gathers alternative answers via trajectory sampling and evidence verification. The model is then optimized with RL using a carefully designed $\mathrm{AnsF1}$ reward, which naturally accommodates multiple answers. Experiments on eight open-domain QA benchmarks demonstrate that A$^2$Search achieves new state-of-the-art performance. With only a single rollout, A$^2$Search-7B yields an average $\mathrm{AnsF1}@1$ score of $48.4\%$ across four multi-hop benchmarks, outperforming all strong baselines, including the substantially larger ReSearch-32B ($46.2\%$). Extensive analyses further show that A$^2$Search resolves ambiguity and generalizes across benchmarks, highlighting that embracing ambiguity is essential for building more reliable QA systems. Our code, data, and model weights can be found at https://github.com/zfj1998/A2Search[57] LightReasoner: Can Small Language Models Teach Large Language Models Reasoning?
Jingyuan Wang,Yankai Chen,Zhonghang Li,Chao Huang
Main category: cs.CL
TL;DR: 本文提出LightReasoner框架,利用小模型(SLM)与大模型(LLM)的行为差异识别高价值推理时刻,通过两阶段方法(采样关键推理步骤并微调)提升LLM推理能力,在数学任务上显著提高准确率,同时大幅降低资源消耗。
Details
Motivation: 监督微调(SFT)依赖大量标注数据和均匀优化,效率低下;希望找到一种更高效、无需真实标签的方法来提升大模型的推理能力。 Method: 提出LightReasoner框架:第一阶段通过对比强专家模型(LLM)与弱业余模型(SLM)的行为差异,采样关键推理时刻并构建监督样本;第二阶段用这些样本对专家模型进行微调,强化其优势。 Result: 在七个数学基准上,准确率最高提升28.1%,时间消耗减少90%,采样问题减少80%,微调token减少99%,且不依赖真实标签。 Conclusion: LightReasoner通过让弱模型作为教学信号,提供了一种可扩展且资源高效的提升大模型推理能力的新途径。 Abstract: Large language models (LLMs) have demonstrated remarkable progress in reasoning, often through supervised fine-tuning (SFT). However, SFT is resource-intensive, relying on large curated datasets, rejection-sampled demonstrations, and uniform optimization across all tokens, even though only a fraction carry meaningful learning value. In this work, we explore a counterintuitive idea: can smaller language models (SLMs) teach larger language models (LLMs) by revealing high-value reasoning moments that reflect the latter's unique strength? We propose LightReasoner, a novel framework that leverages the behavioral divergence between a stronger expert model (LLM) and a weaker amateur model (SLM). LightReasoner operates in two stages: (1) a sampling stage that pinpoints critical reasoning moments and constructs supervision examples capturing the expert's advantage through expert-amateur contrast, and (2) a fine-tuning stage that aligns the expert model with these distilled examples, amplifying its reasoning strengths. Across seven mathematical benchmarks, LightReasoner improves accuracy by up to 28.1%, while reducing time consumption by 90%, sampled problems by 80%, and tuned token usage by 99%, all without relying on ground-truth labels. By turning weaker SLMs into effective teaching signals, LightReasoner offers a scalable and resource-efficient approach for advancing LLM reasoning. Code is available at: https://github.com/HKUDS/LightReasoner[58] Active Confusion Expression in Large Language Models: Leveraging World Models toward Better Social Reasoning
Jialu Du,Guiyang Hou,Yihui Fu,Chen Wu,Wenqi Zhang,Yongliang Shen,Weiming Lu
Main category: cs.CL
TL;DR: 本文提出一种自适应世界模型增强的推理机制,以解决大语言模型在社会推理任务中混淆客观现实与主观信念的问题,显著提升了准确性并降低了计算成本。
Details
Motivation: 大语言模型在数学和代码推理方面表现出色,但在涉及多参与者和社会情境的推理任务中常出现逻辑不一致和认知混淆,难以区分客观事实与个体信念。 Method: 通过分析DeepSeek-R1的推理轨迹,识别出模型在处理复杂社会场景时的认知障碍;提出一种动态构建文本化世界模型的机制,实时监控推理过程中的困惑信号,并在必要时提供清晰的世界状态描述以引导模型。 Result: 在三个社会推理基准上验证了该方法的有效性,准确率显著提升(如Hi-ToM上+10%),同时减少了最多33.8%的token使用量。 Conclusion: 所提出的自适应世界模型机制能有效帮助大语言模型区分客观事件与主观信念,改善社会推理能力,兼具性能优势与推理效率。 Abstract: While large language models (LLMs) excel in mathematical and code reasoning, we observe they struggle with social reasoning tasks, exhibiting cognitive confusion, logical inconsistencies, and conflation between objective world states and subjective belief states. Through deteiled analysis of DeepSeek-R1's reasoning trajectories, we find that LLMs frequently encounter reasoning impasses and tend to output contradictory terms like "tricky" and "confused" when processing scenarios with multiple participants and timelines, leading to erroneous reasoning or infinite loops. The core issue is their inability to disentangle objective reality from agents' subjective beliefs. To address this, we propose an adaptive world model-enhanced reasoning mechanism that constructs a dynamic textual world model to track entity states and temporal sequences. It dynamically monitors reasoning trajectories for confusion indicators and promptly intervenes by providing clear world state descriptions, helping models navigate through cognitive dilemmas. The mechanism mimics how humans use implicit world models to distinguish between external events and internal beliefs. Evaluations on three social benchmarks demonstrate significant improvements in accuracy (e.g., +10% in Hi-ToM) while reducing computational costs (up to 33.8% token reduction), offering a simple yet effective solution for deploying LLMs in social contexts.[59] Leveraging Author-Specific Context for Scientific Figure Caption Generation: 3rd SciCap Challenge
Watcharapong Timklaypachara,Monrada Chiewhawan,Nopporn Lekuthai,Titipat Achakulvisut
Main category: cs.CL
TL;DR: 提出了一种结合图文上下文与作者写作风格的两阶段科学图表标题生成方法,在SciCap挑战赛中表现出色。
Details
Motivation: 科学图表标题需要准确且风格一致地传达视觉信息,现有方法在上下文利用和作者风格建模方面存在不足。 Method: 采用两阶段 pipeline:第一阶段通过上下文过滤和类别特定提示优化生成候选标题;第二阶段利用少量样例和作者历史图表进行风格化精炼。 Result: 类别特定提示使ROUGE-1召回率提升+8.3%,风格精炼带来BLEU分数40-48%增益和ROUGE 25-27%提升。 Conclusion: 结合上下文理解与作者特定风格适应可生成既科学准确又风格忠实的图表标题。 Abstract: Scientific figure captions require both accuracy and stylistic consistency to convey visual information. Here, we present a domain-specific caption generation system for the 3rd SciCap Challenge that integrates figure-related textual context with author-specific writing styles using the LaMP-Cap dataset. Our approach uses a two-stage pipeline: Stage 1 combines context filtering, category-specific prompt optimization via DSPy's MIPROv2 and SIMBA, and caption candidate selection; Stage 2 applies few-shot prompting with profile figures for stylistic refinement. Our experiments demonstrate that category-specific prompts outperform both zero-shot and general optimized approaches, improving ROUGE-1 recall by +8.3\% while limiting precision loss to -2.8\% and BLEU-4 reduction to -10.9\%. Profile-informed stylistic refinement yields 40--48\% gains in BLEU scores and 25--27\% in ROUGE. Overall, our system demonstrates that combining contextual understanding with author-specific stylistic adaptation can generate captions that are both scientifically accurate and stylistically faithful to the source paper.[60] Learning on the Job: An Experience-Driven Self-Evolving Agent for Long-Horizon Tasks
Cheng Yang,Xuemeng Yang,Licheng Wen,Daocheng Fu,Jianbiao Mei,Rong Wu,Pinlong Cai,Yufan Shen,Nianchen Deng,Botian Shi,Yu Qiao,Haifeng Li
Main category: cs.CL
TL;DR: MUSE 是一种基于分层记忆模块的新型AI代理框架,通过经验驱动实现自我进化,能够在长周期任务中持续学习并提升性能,显著优于现有方法。
Details
Motivation: 现有的大语言模型代理在现实世界长期任务中缺乏从经验中学习和持续改进的能力,限制了其实际应用。 Method: 提出MUSE框架,引入分层记忆模块,将执行子任务后的轨迹转化为结构化经验并存储,支持自主反思与经验积累,从而实现持续学习和自我演化。 Result: 在TAC长周期生产力基准上,仅使用轻量级Gemini-2.5 Flash模型即达到新的SOTA性能;实验证明其具备持续学习、自我进化和跨任务零样本迁移能力。 Conclusion: MUSE建立了一种能够持续进化、适应真实世界复杂任务的AI代理新范式,推动了AI代理在实际生产力自动化中的应用前景。 Abstract: Large Language Models have demonstrated remarkable capabilities across diverse domains, yet significant challenges persist when deploying them as AI agents for real-world long-horizon tasks. Existing LLM agents suffer from a critical limitation: they are test-time static and cannot learn from experience, lacking the ability to accumulate knowledge and continuously improve on the job. To address this challenge, we propose MUSE, a novel agent framework that introduces an experience-driven, self-evolving system centered around a hierarchical Memory Module. MUSE organizes diverse levels of experience and leverages them to plan and execute long-horizon tasks across multiple applications. After each sub-task execution, the agent autonomously reflects on its trajectory, converting the raw trajectory into structured experience and integrating it back into the Memory Module. This mechanism enables the agent to evolve beyond its static pretrained parameters, fostering continuous learning and self-evolution. We evaluate MUSE on the long-horizon productivity benchmark TAC. It achieves new SOTA performance by a significant margin using only a lightweight Gemini-2.5 Flash model. Sufficient Experiments demonstrate that as the agent autonomously accumulates experience, it exhibits increasingly superior task completion capabilities, as well as robust continuous learning and self-evolution capabilities. Moreover, the accumulated experience from MUSE exhibits strong generalization properties, enabling zero-shot improvement on new tasks. MUSE establishes a new paradigm for AI agents capable of real-world productivity task automation.[61] ChatGPT as a Translation Engine: A Case Study on Japanese-English
Vincent Michael Sutanto,Giovanni Gatti De Giacomo,Toshiaki Nakazawa,Masaru Yamada
Main category: cs.CL
TL;DR: 该研究探讨了ChatGPT在日英翻译中的表现,比较了简单与增强提示的效果,并评估了其相对于商业翻译系统的竞争力。
Details
Motivation: 探索ChatGPT在日英翻译任务中的潜力,并评估不同提示策略和模型版本的性能差异。 Method: 通过自动评估和基于MQM的人类评估,比较句子级与文档级翻译,以及简单与增强提示的效果,同时对比ChatGPT-3.5和ChatGPT-4的表现。 Result: 文档级翻译优于句子级翻译;未能明确增强提示优于简单提示;ChatGPT-3.5在自动评估中更优,但存在准确性(3.5)与流畅性(4)之间的权衡;ChatGPT整体表现可与主流翻译系统竞争。 Conclusion: ChatGPT在日英翻译中具有竞争力,文档级翻译更有效,但在提示设计上的优化仍需进一步研究。 Abstract: This study investigates ChatGPT for Japanese-English translation, exploring simple and enhanced prompts and comparing against commercially available translation engines. Performing both automatic and MQM-based human evaluations, we found that document-level translation outperforms sentence-level translation for ChatGPT. On the other hand, we were not able to determine if enhanced prompts performed better than simple prompts in our experiments. We also discovered that ChatGPT-3.5 was preferred by automatic evaluation, but a tradeoff exists between accuracy (ChatGPT-3.5) and fluency (ChatGPT-4). Lastly, ChatGPT yields competitive results against two widely-known translation systems.[62] Climate Knowledge in Large Language Models
Ivan Kuznetsov,Jacopo Grassi,Dmitrii Pantiukhin,Boris Shapkin,Thomas Jung,Nikolay Koldunov
Main category: cs.CL
TL;DR: 该研究评估了大语言模型(LLM)在无需外部检索的情况下回忆气候常态的能力,发现其能捕捉基本气候模式但存在显著空间误差,尤其在高海拔和高纬度地区表现较差,且无法准确再现长期气温变化的空间分布。
Details
Motivation: 随着LLM在气候相关应用中的广泛部署,理解其内部气候知识对确保可靠性及评估错误信息风险至关重要,但当前LLM对气候参数知识的掌握程度尚不明确。 Method: 构建一个分辨率为1°的全球陆地查询网格,输入位置坐标和地理描述,要求LLM回答1991-2020年7月近地面气温均值,并使用ERA5再分析数据验证结果。 Result: LLM能够捕捉纬度和地形相关的气候结构(RMSE为3-6°C,偏差±1°C),加入地理上下文可使误差平均降低27%,大模型对此更敏感;但在海拔1500米以上地区性能显著下降(RMSE达5-13°C);模型虽能反映全球平均变暖幅度,却无法再现气温变化的空间格局。 Conclusion: LLM包含一定的参数化气候知识,可用于估计当前气候状态,但难以准确表达与气候变化相关的区域性和局地性长期趋势,需谨慎用于气候动态分析;本文提出的方法为评估LLM中的气候知识提供了可复现的基准。 Abstract: Large language models (LLMs) are increasingly deployed for climate-related applications, where understanding internal climatological knowledge is crucial for reliability and misinformation risk assessment. Despite growing adoption, the capacity of LLMs to recall climate normals from parametric knowledge remains largely uncharacterized. We investigate the capacity of contemporary LLMs to recall climate normals without external retrieval, focusing on a prototypical query: mean July 2-m air temperature 1991-2020 at specified locations. We construct a global grid of queries at 1{\deg} resolution land points, providing coordinates and location descriptors, and validate responses against ERA5 reanalysis. Results show that LLMs encode non-trivial climate structure, capturing latitudinal and topographic patterns, with root-mean-square errors of 3-6 {\deg}C and biases of $\pm$1 {\deg}C. However, spatially coherent errors remain, particularly in mountains and high latitudes. Performance degrades sharply above 1500 m, where RMSE reaches 5-13 {\deg}C compared to 2-4 {\deg}C at lower elevations. We find that including geographic context (country, city, region) reduces errors by 27% on average, with larger models being most sensitive to location descriptors. While models capture the global mean magnitude of observed warming between 1950-1974 and 2000-2024, they fail to reproduce spatial patterns of temperature change, which directly relate to assessing climate change. This limitation highlights that while LLMs may capture present-day climate distributions, they struggle to represent the regional and local expression of long-term shifts in temperature essential for understanding climate dynamics. Our evaluation framework provides a reproducible benchmark for quantifying parametric climate knowledge in LLMs and complements existing climate communication assessments.[63] A Survey of Process Reward Models: From Outcome Signals to Process Supervisions for Large Language Models
Congming Zheng,Jiachen Zhu,Zhuoying Ou,Yuxiang Chen,Kangning Zhang,Rong Shan,Zeyu Zheng,Mengyue Yang,Jianghao Lin,Yong Yu,Weinan Zhang
Main category: cs.CL
TL;DR: 本文系统地综述了过程奖励模型(PRMs),涵盖了从生成过程数据、构建PRMs到在测试时扩展和强化学习中的应用,旨在促进细粒度且鲁棒的推理对齐。
Details
Motivation: 尽管大语言模型展现出高级推理能力,但传统的对齐方法主要依赖仅评估最终答案的结果奖励模型(ORMs),无法有效指导推理过程。PRMs通过在步骤或轨迹级别上评估推理过程来弥补这一差距。 Method: 本文通过完整的流程对PRMs进行系统性综述:包括如何生成过程数据、构建PRMs,以及如何将PRMs用于测试时扩展和强化学习,并总结其在数学、代码、文本、多模态推理、机器人和智能体等领域的应用,同时回顾了新兴的基准测试。 Result: 梳理了PRMs在多个领域中的应用和现有基准,明确了设计空间,揭示了当前面临的开放性挑战。 Conclusion: PRMs能够更精细地引导模型推理过程,未来的研究应聚焦于实现更加细粒度和鲁棒的推理对齐。 Abstract: Although Large Language Models (LLMs) exhibit advanced reasoning ability, conventional alignment remains largely dominated by outcome reward models (ORMs) that judge only final answers. Process Reward Models(PRMs) address this gap by evaluating and guiding reasoning at the step or trajectory level. This survey provides a systematic overview of PRMs through the full loop: how to generate process data, build PRMs, and use PRMs for test-time scaling and reinforcement learning. We summarize applications across math, code, text, multimodal reasoning, robotics, and agents, and review emerging benchmarks. Our goal is to clarify design spaces, reveal open challenges, and guide future research toward fine-grained, robust reasoning alignment.[64] FedDTRE: Federated Dialogue Generation Models Powered by Trustworthiness Evaluation
Shule Lu,Lingxiang Wang,Sijia Wen,Ziwei Wang,Hainan Zhang
Main category: cs.CL
TL;DR: 提出了一种基于可信度评估的联邦自适应聚合策略FedDTRE,用于对话生成,通过动态调节全局模型在本地更新中的贡献来提升模型性能和对话质量。
Details
Motivation: 传统集中式或完全本地训练方法在隐私保护与个性化之间难以平衡,且现有联邦学习方法在客户端数据有限时易过拟合,并在多轮训练后遗忘全局信息,导致泛化能力差。 Method: 提出FedDTRE,利用全局和本地模型在公平性导向评估数据集上的可信度评分,动态调节全局模型在本地更新中的贡献,而非直接用全局模型替换本地模型。 Result: 实验结果表明,FedDTRE能够提升对话模型的性能,增强对话生成的质量。 Conclusion: FedDTRE有效缓解了联邦学习中过拟合和遗忘全局信息的问题,在保护隐私的同时实现了更好的个性化对话生成。 Abstract: With the rapid development of artificial intelligence, dialogue systems have become a prominent form of human-computer interaction. However, traditional centralized or fully local training approaches face challenges in balancing privacy preservation and personalization due to data privacy concerns and heterogeneous device capabilities. Federated learning, as a representative distributed paradigm, offers a promising solution. However, existing methods often suffer from overfitting under limited client data and tend to forget global information after multiple training rounds, leading to poor generalization. To address these issues, we propose FedDTRE, a Federated adaptive aggregation strategy for Dialogue generation based on Trustworthiness Evaluation. Instead of directly replacing local models with the global model, FedDTRE leverages trustworthiness scores of both global and local models on a fairness-oriented evaluation dataset to dynamically regulate the global model's contribution during local updates. Experimental results demonstrate that FedDTRE can improve dialogue model performance and enhance the quality of dialogue generation.[65] Everything is Plausible: Investigating the Impact of LLM Rationales on Human Notions of Plausibility
Shramay Palta,Peter Rankel,Sarah Wiegreffe,Rachel Rudinger
Main category: cs.CL
TL;DR: 研究发现,人类对常识性多选题答案的合理性判断会受到大语言模型生成的正反理由的影响,表明这些理由具有说服力,同时也揭示了大语言模型可能显著影响人类信念的问题。
Details
Motivation: 探究大语言模型生成的理由是否会影响人类在常识判断任务中的合理性评估。 Method: 通过收集3,000条人类和13,600条大语言模型的合理性判断数据,分析正反理由对判断的影响。 Result: 人类和大语言模型的合理性评分在面对大语言模型生成的支持或反对理由时均发生显著变化,显示出这些理由具有说服力。 Conclusion: 大语言模型不仅能用于研究人类认知,也可能在人类自认为擅长的常识领域对其信念产生重大影响。 Abstract: We investigate the degree to which human plausibility judgments of multiple-choice commonsense benchmark answers are subject to influence by (im)plausibility arguments for or against an answer, in particular, using rationales generated by LLMs. We collect 3,000 plausibility judgments from humans and another 13,600 judgments from LLMs. Overall, we observe increases and decreases in mean human plausibility ratings in the presence of LLM-generated PRO and CON rationales, respectively, suggesting that, on the whole, human judges find these rationales convincing. Experiments with LLMs reveal similar patterns of influence. Our findings demonstrate a novel use of LLMs for studying aspects of human cognition, while also raising practical concerns that, even in domains where humans are ``experts'' (i.e., common sense), LLMs have the potential to exert considerable influence on people's beliefs.[66] The Price of Thought: A Multilingual Analysis of Reasoning, Performance, and Cost of Negotiation in Large Language Models
Sherzod Hakimov,Roland Bernard,Tim Leiber,Karl Osswald,Kristina Richert,Ruilin Yang,Raffaella Bernardi,David Schlangen
Main category: cs.CL
TL;DR: 本研究首次系统评估了(大语言模型)推理对商业和开源权重LLM在多语言谈判能力的影响,发现启用推理显著提升谈判表现但增加计算成本,且开源模型在非英语谈判中仍倾向使用英语进行内部推理,而商业模型保持语言一致性。
Details
Motivation: 探讨大语言模型在谈判任务中的推理能力影响,特别是在多语言环境下的表现差异,并分析其合作与竞争平衡、战略适应性以及推理过程的可解释性问题。 Method: 通过自对弈设置,在三种不同对话游戏中测试多个商业和开源LLM,涵盖英语、德语和意大利语三种语言,系统分析启用推理(扩大测试时计算资源)对性能、成本及语言一致性的权衡。 Result: 启用推理显著提升谈判结果,例如GPT-5表现提高31.4%,但计算成本增加近400%;开源模型在德语或意大利语谈判中内部推理普遍切换至英语,而商业模型保持输出与推理语言一致。 Conclusion: 推理能有效增强模型谈判能力,尤其在处理任务复杂性和促进合作方面,但伴随高昂成本;多语言场景下开源与商业模型在推理语言选择上存在显著差异,可能影响推理过程的可解释性。 Abstract: Negotiation is a fundamental challenge for AI agents, as it requires an ability to reason strategically, model opponents, and balance cooperation with competition. We conduct the first comprehensive study systematically evaluating the effect of (LLM-)reasoning on the negotiation abilities of both commercial and open-weight LLMs, and do this across three languages. Using a self-play setup across three diverse dialogue games, we analyse trade-offs between performance and cost, the language consistency of reasoning processes, and the nature of strategic adaptation exhibited by models. Our findings show that enabling reasoning-that is, scaling test time compute-significantly improves negotiation outcomes by enhancing collaboration and helping models overcome task complexities, but comes at a substantial computational cost: reasoning improves GPT-5's performance by 31.4 % while increasing its cost by nearly 400 %. Most critically, we uncover a significant multilingual reasoning distinction: open-weight models consistently switch to English for their internal reasoning steps, even when negotiating in German or Italian (and thus possibly impacting potential explainability gains through the disclosure of reasoning traces), while leading commercial models maintain language consistency between their reasoning and final output.[67] Lossless Vocabulary Reduction for Auto-Regressive Language Models
Daiki Chijiwa,Taku Hasegawa,Kyosuke Nishida,Shin'ya Yamaguchi,Tomoya Ohba,Tamao Sakao,Susumu Takeuchi
Main category: cs.CL
TL;DR: 本文提出了一个无损词汇缩减的理论框架,能够将自回归语言模型高效转换为任意小词汇量的模型而不损失精度,并展示了不同分词方式的语言模型如何通过最大公共词汇高效协作。
Details
Motivation: 由于不同的语言模型使用不同的词汇表,导致它们在下一词预测分布层面难以协同工作,如模型集成等任务面临挑战。 Method: 建立了一个无损词汇缩减的理论框架,通过该框架可将给定的自回归语言模型转换为具有任意小词汇量的新模型,同时保持原有精度。 Result: 实现了不同分词方式下的语言模型之间的高效协作,验证了通过最大公共词汇进行模型协同的有效性。 Conclusion: 该方法能够在不牺牲准确率的前提下显著减小语言模型的词汇量,并促进不同语言模型在生成层面的协作能力。 Abstract: Tokenization -- the process of decomposing a given text into a sequence of subwords called tokens -- is one of the key components in the development of language models. Particularly, auto-regressive language models generate texts token by token, i.e., by predicting the next-token distribution given the previous ones, and thus tokenization directly affects their efficiency in text generation. Since each language model has their own vocabulary as a set of possible tokens, they struggle to cooperate with each other at the level of next-token distributions such as model ensemble. In this paper, we establish a theoretical framework of lossless vocabulary reduction, which efficiently converts a given auto-regressive language model into the one with an arbitrarily small vocabulary without any loss in accuracy. As an application, we demonstrate that language models with different tokenization can cooperate with each other efficiently through their maximal common vocabulary.[68] Evaluating LLM-Generated Legal Explanations for Regulatory Compliance in Social Media Influencer Marketing
Haoyang Gui,Thales Bertaglia,Taylor Annabell,Catalina Goanta,Tjomme Dooper,Gerasimos Spanakis
Main category: cs.CL
TL;DR: 该研究评估了GPT-5-nano和Gemini-2.5-flash-lite在识别Instagram赞助内容中的表现,结合不同法律知识提示策略,发现模型在明确案例中表现良好(F1最高达0.93),但在模糊案例中性能下降。研究提出一个LLM法律推理错误分类法,并提供经法律训练学生标注的解释数据集,旨在提升影响者营销内容自动化监管的透明度与法律稳健性。
Details
Motivation: 由于网红营销中赞助内容与有机内容界限模糊,现有检测方法缺乏法律依据或为‘黑箱’操作,导致监管困难,因此需要基于法律知识的透明、可靠的自动化检测方法。 Method: 使用1,143条Instagram帖子,比较GPT-5-nano和Gemini-2.5-flash-lite在三种提示策略下的表现,控制输入的法律知识水平,并对模型输出进行定量与定性分析,构建法律推理错误分类法。 Result: 两个模型在分类赞助内容方面表现良好(F1最高0.93),但在模糊案例中性能下降超10个百分点;加入法规文本可提升解释质量但不显著提高准确率;常见错误包括引用缺失(28.57%)、引用不清(20.71%)及隐藏广告误判率高(28.57%)。 Conclusion: 该研究通过构建法律推理错误分类法和标注数据集,推动基于法律基础的透明化自动化监管,为广告监管机构提供兼具准确性与法律稳健性的工具,支持其在网红营销领域更有效地执行披露规则。 Abstract: The rise of influencer marketing has blurred boundaries between organic content and sponsored content, making the enforcement of legal rules relating to transparency challenging. Effective regulation requires applying legal knowledge with a clear purpose and reason, yet current detection methods of undisclosed sponsored content generally lack legal grounding or operate as opaque "black boxes". Using 1,143 Instagram posts, we compare gpt-5-nano and gemini-2.5-flash-lite under three prompting strategies with controlled levels of legal knowledge provided. Both models perform strongly in classifying content as sponsored or not (F1 up to 0.93), though performance drops by over 10 points on ambiguous cases. We further develop a taxonomy of reasoning errors, showing frequent citation omissions (28.57%), unclear references (20.71%), and hidden ads exhibiting the highest miscue rate (28.57%). While adding regulatory text to the prompt improves explanation quality, it does not consistently improve detection accuracy. The contribution of this paper is threefold. First, it makes a novel addition to regulatory compliance technology by providing a taxonomy of common errors in LLM-generated legal reasoning to evaluate whether automated moderation is not only accurate but also legally robust, thereby advancing the transparent detection of influencer marketing content. Second, it features an original dataset of LLM explanations annotated by two students who were trained in influencer marketing law. Third, it combines quantitative and qualitative evaluation strategies for LLM explanations and critically reflects on how these findings can support advertising regulatory bodies in automating moderation processes on a solid legal foundation.[69] Interpreting LLM-as-a-Judge Policies via Verifiable Global Explanations
Jasmina Gajcin,Erik Miehling,Rahul Nair,Elizabeth Daly,Radu Marinescu,Seshu Tirupathi
Main category: cs.CL
TL;DR: 本文提出了一种从LLM-as-a-Judge中提取基于概念的全局策略的方法,包括生成局部解释的CLoVE和将其聚类为全局策略的GloVE,验证了其在内容危害检测中的保真度、鲁棒性及用户可理解性。
Details
Motivation: 随着LLM被广泛用作评估工具,亟需理解其潜在偏见与风险,因此需要可解释的方法来揭示其决策背后的全局规则。 Method: 提出CLoVE生成可验证的、基于概念的对比局部解释,并通过GloVE进行迭代聚类、摘要和验证,形成全局策略。 Result: 在七个基准数据集上验证了全局策略对LLM判断的高度保真;策略对文本扰动和对抗攻击具有鲁棒性;用户研究显示该策略提升了用户理解和满意度。 Conclusion: 所提方法能有效提取并解释LLM-as-a-Judge的决策逻辑,增强了透明度和可信度,具备实际应用潜力。 Abstract: Using LLMs to evaluate text, that is, LLM-as-a-judge, is increasingly being used at scale to augment or even replace human annotations. As such, it is imperative that we understand the potential biases and risks of doing so. In this work, we propose an approach for extracting high-level concept-based global policies from LLM-as-a-Judge. Our approach consists of two algorithms: 1) CLoVE (Contrastive Local Verifiable Explanations), which generates verifiable, concept-based, contrastive local explanations and 2) GloVE (Global Verifiable Explanations), which uses iterative clustering, summarization and verification to condense local rules into a global policy. We evaluate GloVE on seven standard benchmarking datasets for content harm detection. We find that the extracted global policies are highly faithful to decisions of the LLM-as-a-Judge. Additionally, we evaluated the robustness of global policies to text perturbations and adversarial attacks. Finally, we conducted a user study to evaluate user understanding and satisfaction with global policies.[70] Mitigating Judgment Preference Bias in Large Language Models through Group-Based Polling
Shuliang Liu,Zhipeng Xu,Zhenghao Liu,Yukun Yan,Minghe Yu,Yu Gu,Chong Chen,Huiyuan Xie,Ge Yu
Main category: cs.CL
TL;DR: 本文提出了Genii,一种无监督的多智能体协同优化框架,用于缓解大语言模型作为评判者时存在的判断偏好偏差问题。
Details
Motivation: 大语言模型在自动评估任务中表现出对自身生成回答的偏好偏差,影响了评估结果的可靠性。 Method: 提出Group-Based Polling Optimization (Genii)框架,通过构建多智能体系统模拟客户端-服务器交互式投票机制,在无监督情况下优化各客户端智能体。 Result: 实验表明,Genii优于依赖人工标注数据的有监督模型,且无需人工标注;在不同客户端智能体上均能持续提升性能,即使使用较弱模型作为服务端也能有效缓解判断偏好偏差。 Conclusion: Genii能有效减轻LLM作为评判者的偏好偏差,提升评估的准确性和可靠性,具有良好的通用性和应用价值。 Abstract: Large Language Models (LLMs) as automatic evaluators, commonly referred to as LLM-as-a-Judge, have also attracted growing attention. This approach plays a vital role in aligning LLMs with human judgments, providing accurate and reliable assessments. However, LLM-based judgment models often exhibit judgment preference bias during the evaluation phase, tending to favor responses generated by themselves, undermining the reliability of their judgments. This paper introduces the Group-Based Polling Optimization (Genii), an unsupervised multi-agent collaborative optimization framework that mitigates the inherent judgment preference bias of judgment models. Specifically, Genii integrates various LLM-based judgment models into a multi-agent system and simulates the interactive client-server polling mechanism to optimize each client agent unsupervisedly. Our experiments demonstrate that Genii outperforms supervised models trained on annotated judgment data, while requiring no human-labeled annotations. Genii consistently improves performance across different client agents during the polling, even when weaker models act as server agents. Further analysis reveals that Genii effectively mitigates judgment preference bias of LLM-based judgment models, demonstrating its effectiveness. All codes are available at https://github.com/NEUIR/Genii.[71] AI Knowledge Assist: An Automated Approach for the Creation of Knowledge Bases for Conversational AI Agents
Md Tahmid Rahman Laskar,Julien Bouvier Tremblay,Xue-Yong Fu,Cheng Chen,Shashi Bhushan TN
Main category: cs.CL
TL;DR: 本文提出了一种名为AI Knowledge Assist的系统,通过从历史客户-代理对话中提取问答对来自动构建企业专属知识库,并利用微调轻量级大模型(LLaMA-3.1-8B)实现高准确率的信息检索,解决了客服中心冷启动问题。
Details
Motivation: 缺乏企业特定的知识库是阻碍对话式AI系统在客服中心集成的主要障碍,因此需要一种能自动构建知识库的方法以支持RAG系统的部署。 Method: 从历史客户-代理对话中提取问答对,构建专用知识库,并对轻量级大语言模型(LLaMA-3.1-8B)进行内部数据微调,以提升信息检索和回答准确性。 Result: 在20家公司上的实证评估显示,该系统在回答信息查询问题时准确率超过90%,优于更大的闭源大模型,有效消除了客服中心的冷启动问题。 Conclusion: AI Knowledge Assist系统能够高效构建企业知识库,使RAG驱动的聊天机器人可立即部署,显著提升客服自动化能力。 Abstract: The utilization of conversational AI systems by leveraging Retrieval Augmented Generation (RAG) techniques to solve customer problems has been on the rise with the rapid progress of Large Language Models (LLMs). However, the absence of a company-specific dedicated knowledge base is a major barrier to the integration of conversational AI systems in contact centers. To this end, we introduce AI Knowledge Assist, a system that extracts knowledge in the form of question-answer (QA) pairs from historical customer-agent conversations to automatically build a knowledge base. Fine-tuning a lightweight LLM on internal data demonstrates state-of-the-art performance, outperforming larger closed-source LLMs. More specifically, empirical evaluation on 20 companies demonstrates that the proposed AI Knowledge Assist system that leverages the LLaMA-3.1-8B model eliminates the cold-start gap in contact centers by achieving above 90% accuracy in answering information-seeking questions. This enables immediate deployment of RAG-powered chatbots.[72] DACIP-RC: Domain Adaptive Continual Instruction Pre-Training via Reading Comprehension on Business Conversations
Elena Khasanova,Harsh Saini,Md Tahmid Rahman Laskar,Xue-Yong Fu,Cheng Chen,Shashi Bhushan TN
Main category: cs.CL
TL;DR: 本文提出了一种名为DACIP-RC的持续指令预训练方法,通过阅读理解生成任务指令和响应,提升小型语言模型在商业对话任务中的零样本泛化能力。
Details
Motivation: 大型语言模型推理成本高,难以部署;小型模型虽高效但缺乏跨领域的零样本指令遵循能力,传统微调方法易导致灾难性遗忘。 Method: 提出Domain Adaptive Continual Instruction Pre-Training via Reading Comprehension (DACIP-RC),利用对话记录进行阅读理解生成多样化的任务指令与响应,进行持续预训练,以增强模型在特定领域(尤其是商业对话)的适应性和泛化能力。 Result: 实验表明,DACIP-RC在会议摘要、行动项生成和通话目的识别等多种商业对话任务中显著提升了小型语言模型的零样本性能。 Conclusion: DACIP-RC有效提升了小型语言模型在工业场景下的领域适应性和指令泛化能力,是首个将指令预训练应用于商业对话数据的工作,为行业利用专有数据进行领域适配提供了新思路。 Abstract: The rapid advancements in Large Language Models (LLMs) have enabled their adoption in real-world industrial scenarios for various natural language processing tasks. However, the high inference cost of large-scale LLMs makes their deployment impractical, necessitating the use of smaller models. Despite their efficiency, smaller LLMs lack robust zero-shot instruction-following capabilities across diverse domains, limiting their adaptability to dynamic user requirements. Traditional fine-tuning approaches exacerbate this issue by inducing catastrophic forgetting, reducing the model's generalization ability for unseen tasks. In this paper, we propose Domain Adaptive Continual Instruction Pre-Training via Reading Comprehension (DACIP-RC), a continual pre-training technique that enhances smaller LLMs' domain adaptability for business conversational tasks. Unlike conventional pre-training approaches that rely on next-token prediction, DACIP-RC generates diverse task instructions and responses via reading comprehension on conversation transcripts, enabling better instruction generalization. Our empirical evaluations demonstrate that DACIP-RC significantly improves zero-shot generalization across a wide range of business conversational tasks, including meeting summarization, action item generation, and call purpose identification. To the best of our knowledge, this is the first work to apply instruction pre-training on business conversational data, providing insights into how industries can leverage proprietary datasets for domain adaptation.[73] Beyond Over-Refusal: Scenario-Based Diagnostics and Post-Hoc Mitigation for Exaggerated Refusals in LLMs
Shuzhou Yuan,Ercong Nie,Yinuo Sun,Chenxuan Zhao,William LaCroix,Michael Färber
Main category: cs.CL
TL;DR: 本文提出了两个用于评估大语言模型过度拒绝问题的基准测试,并提出三种无需重新训练的轻量级方法来缓解该问题,提升了模型对安全请求的响应能力。
Details
Motivation: 大语言模型常因误判而拒绝本应接受的良性请求,影响可用性,因此需要系统评估和解决这一过度拒绝问题。 Method: 构建了单轮XSB和多轮MS-XSB两个基准,利用事后解释方法识别拒绝触发词,并采用忽略关键词、提示重写和注意力引导三种推理时策略进行干预。 Result: 实验表明所提方法显著减少了Llama系列模型在安全提示下的不必要拒绝,同时保持了对真正有害请求的有效防御。 Conclusion: 该研究为诊断和缓解大模型过度拒绝提供了可复现的框架,推动更安全且更有帮助的模型部署。 Abstract: Large language models (LLMs) frequently produce false refusals, declining benign requests that contain terms resembling unsafe queries. We address this challenge by introducing two comprehensive benchmarks: the Exaggerated Safety Benchmark (XSB) for single-turn prompts, annotated with "Focus" keywords that identify refusal-inducing triggers, and the Multi-turn Scenario-based Exaggerated Safety Benchmark (MS-XSB), which systematically evaluates refusal calibration in realistic, context-rich dialog settings. Our benchmarks reveal that exaggerated refusals persist across diverse recent LLMs and are especially pronounced in complex, multi-turn scenarios. To mitigate these failures, we leverage post-hoc explanation methods to identify refusal triggers and deploy three lightweight, model-agnostic approaches, ignore-word instructions, prompt rephrasing, and attention steering, at inference time, all without retraining or parameter access. Experiments on four instruction-tuned Llama models demonstrate that these strategies substantially improve compliance on safe prompts while maintaining robust safety protections. Our findings establish a reproducible framework for diagnosing and mitigating exaggerated refusals, highlighting practical pathways to safer and more helpful LLM deployments.[74] ARM2: Adaptive Reasoning Model with Vision Understanding and Executable Code
Jian Xie,Zhendong Chu,Aoxiao Zhong,Kai Zhang,Mingzhe Han,Xin Fang,Jialie Shen,Qingsong Wen
Main category: cs.CL
TL;DR: 本文提出了ARM2,一个通过强化学习框架结合长度感知优化的统一模型,能够自适应地平衡推理性能和效率,并支持多模态与代码执行,显著降低token消耗。
Details
Motivation: 大型推理模型在简单任务上常出现“过度思考”问题,现有方法多为启发式且任务特定,缺乏通用的自适应推理框架。 Method: 提出ARM2模型,采用强化学习框架并引入长度感知优化,支持自然语言推理、视觉理解和可执行代码生成,实现多格式、多模态的自适应推理。 Result: 实验表明,ARM2在保持与传统GRPO训练模型相当性能的同时,平均减少70%以上的token使用,并在多任务上验证了其有效性与设计合理性。 Conclusion: ARM2提供了一个通用、高效的自适应推理框架,在多种任务和模态下实现了推理质量与计算成本的良好平衡。 Abstract: Large Reasoning Models (LRMs) often suffer from the ``over-thinking'' problem, generating unnecessarily long reasoning on simple tasks. Some strategies have been proposed to mitigate this issue, such as length penalties or routing mechanisms, but they are typically heuristic and task-specific, lacking a general framework for adaptive reasoning. In this paper, we present ARM2, a unified model that adaptively balances reasoning performance and efficiency across multiple formats through a reinforcement learning framework augmented with length-aware optimization. Beyond conventional natural language inference, ARM2 integrates vision understanding, extending its applicability to multimodal. Moreover, ARM2 integrates executable code into reasoning, enabling substantial reductions in token cost while preserving task performance compared to long CoT. Experiments demonstrate that ARM2 achieves performance on par with traditional reasoning models trained with GRPO, while reducing token usage by over 70% on average. We further conduct extensive analyses to validate the effectiveness of ARM2 and the soundness of its design.[75] MetricalARGS: A Taxonomy for Studying Metrical Poetry with LLMs
Chalamalasetti Kranti,Sowmya Vajjala
Main category: cs.CL
TL;DR: 本文提出了MetricalARGS,首个用于评估大语言模型在格律诗方面能力的NLP任务分类体系,涵盖分析、检索、生成和支持四个维度,并以泰卢固语为例展示了其应用。
Details
Motivation: 现有NLP研究多关注诗歌生成与摘要,而忽视了格律诗中严格的音节和音素规则对语言模型推理和规则遵循能力的挑战,因此需要系统性评估框架。 Method: 提出MetricalARGS分类体系,包含分析、检索、生成和支持四个维度,结合泰卢固语实例,讨论相关数据集与评估指标。 Result: 建立了首个面向格律诗的NLP任务分类体系,明确了各任务与现有NLP任务的关系,并为评估LLM在复杂文学形式下的表现提供了实践路径。 Conclusion: MetricalARGS为通过格律诗评估大语言模型的能力与局限提供了系统框架,拓展了诗歌NLP研究的深度方向。 Abstract: Prior NLP work studying poetry has focused primarily on automatic poem generation and summarization. Many languages have well-studied traditions of poetic meter which enforce constraints on a poem in terms of syllable and phoneme patterns. Such advanced literary forms offer opportunities for probing deeper reasoning and language understanding in Large Language Models (LLMs) and their ability to follow strict pre-requisites and rules. In this paper, we introduce MetricalARGS, the first taxonomy of poetry-related NLP tasks designed to evaluate LLMs on metrical poetry across four dimensions: Analysis, Retrieval, Generation, and Support. We discuss how these tasks relate to existing NLP tasks, addressing questions around datasets and evaluation metrics. Taking Telugu as our example language, we illustrate how the taxonomy can be used in practice. MetricalARGS highlights the broader possibilities for understanding the capabilities and limitations of today's LLMs through the lens of metrical poetry.[76] Training-Free Group Relative Policy Optimization
Yuzheng Cai,Siqi Cai,Yuchen Shi,Zihan Xu,Lichao Chen,Yulei Qin,Xiaoyu Tan,Gang Li,Zongyi Li,Haojia Lin,Yong Mao,Ke Li,Xing Sun
Main category: cs.CL
TL;DR: 本文提出了一种无需训练的组相对策略优化方法(Training-Free GRPO),通过利用 rollout 组内的语义优势来构建 token 先验知识,从而提升大语言模型在数学推理和网页搜索等任务上的跨领域表现,且无需参数更新,成本低、避免过拟合。
Details
Motivation: 现有LLM代理在特定现实场景中表现不佳,主要因为依赖昂贵的参数微调(如SFT+强化学习)来调整输出分布,且易受数据稀缺和过拟合影响。因此需要一种更轻量、无需训练的替代方案。 Method: 提出Training-Free GRPO,利用少量真实数据上的多轮推理轨迹(rollouts),通过组内语义比较提取高质量经验知识作为token先验,在API调用时动态引导模型输出,无需任何参数更新。 Result: 在数学推理和网页搜索任务上,应用于DeepSeek-V3.1-Terminus时,仅用几十个样本即显著提升跨域性能,优于使用少量数据微调的小型LLM。 Conclusion: Training-Free GRPO是一种高效、低成本的方法,通过引入经验知识作为token先验,可在不进行参数更新的情况下有效提升LLM代理在专业任务中的表现,具有良好的实用性和扩展性。 Abstract: Recent advances in Large Language Model (LLM) agents have demonstrated their promising general capabilities. However, their performance in specialized real-world domains often degrades due to challenges in effectively integrating external tools and specific prompting strategies. While methods like agentic reinforcement learning have been proposed to address this, they typically rely on costly parameter updates, for example, through a process that uses Supervised Fine-Tuning (SFT) followed by a Reinforcement Learning (RL) phase with Group Relative Policy Optimization (GRPO) to alter the output distribution. However, we argue that LLMs can achieve a similar effect on the output distribution by learning experiential knowledge as a token prior, which is a far more lightweight approach that not only addresses practical data scarcity but also avoids the common issue of overfitting. To this end, we propose Training-Free Group Relative Policy Optimization (Training-Free GRPO), a cost-effective solution that enhances LLM agent performance without any parameter updates. Our method leverages the group relative semantic advantage instead of numerical ones within each group of rollouts, iteratively distilling high-quality experiential knowledge during multi-epoch learning on a minimal ground-truth data. Such knowledge serves as the learned token prior, which is seamlessly integrated during LLM API calls to guide model behavior. Experiments on mathematical reasoning and web searching tasks demonstrate that Training-Free GRPO, when applied to DeepSeek-V3.1-Terminus, significantly improves out-of-domain performance. With just a few dozen training samples, Training-Free GRPO outperforms fine-tuned small LLMs with marginal training data and cost.[77] Memory Retrieval and Consolidation in Large Language Models through Function Tokens
Shaohua Zhang,Yuan Lin,Hang Li
Main category: cs.CL
TL;DR: 本文提出“功能词元假说”,认为大语言模型在推理时通过功能词元激活上下文中的预测性特征并主导下一个词元的预测,在预训练中通过预测功能词元后的内容词元来实现知识的积累与参数更新,并通过实验验证了该假说。
Details
Motivation: 目前对大语言模型中记忆提取与巩固的机制理解不足,需要解释功能词元在其中的作用。 Method: 提出功能词元假说,使用二分图分析和案例研究分析功能词元如何激活特征,并分析预训练中损失分布与功能词元的关系。 Result: 实验证明少数功能词元激活了大多数特征,且训练损失主要来自功能词元后的内容词元预测,支持功能词元在记忆检索与巩固中的核心作用。 Conclusion: 功能词元在大语言模型的记忆检索与知识巩固中起关键作用,功能词元假说为理解LLM工作机制提供了新视角。 Abstract: The remarkable success of large language models (LLMs) stems from their ability to consolidate vast amounts of knowledge into the memory during pre-training and to retrieve it from the memory during inference, enabling advanced capabilities such as knowledge memorization, instruction-following and reasoning. However, the mechanisms of memory retrieval and consolidation in LLMs remain poorly understood. In this paper, we propose the function token hypothesis to explain the workings of LLMs: During inference, function tokens activate the most predictive features from context and govern next token prediction (memory retrieval). During pre-training, predicting the next tokens (usually content tokens) that follow function tokens increases the number of learned features of LLMs and updates the model parameters (memory consolidation). Function tokens here roughly correspond to function words in linguistics, including punctuation marks, articles, prepositions, and conjunctions, in contrast to content tokens. We provide extensive experimental evidence supporting this hypothesis. Using bipartite graph analysis, we show that a small number of function tokens activate the majority of features. Case studies further reveal how function tokens activate the most predictive features from context to direct next token prediction. We also find that during pre-training, the training loss is dominated by predicting the next content tokens following function tokens, which forces the function tokens to select the most predictive features from context.[78] LLMs Learn to Deceive Unintentionally: Emergent Misalignment in Dishonesty from Misaligned Samples to Biased Human-AI Interactions
XuHao Hu,Peng Wang,Xiaoya Lu,Dongrui Liu,Xuanjing Huang,Jing Shao
Main category: cs.CL
TL;DR: 本研究扩展了“突发性错位”现象的研究,发现大语言模型在高风险情境下通过微调可能表现出广泛的不诚实和欺骗行为,且即使引入少量错位数据或与偏见用户互动也会加剧这种行为。
Details
Motivation: 探究大语言模型在高风险情境下是否会在非安全领域(如说谎和欺骗)出现广泛错位行为,特别是在微调后可能引发的不诚实倾向。 Method: 对开源大语言模型在多个领域进行错位完成数据的微调,并在下游混合任务及模拟人类-AI交互环境中测试其诚实性变化。 Result: 实验表明,模型在不诚实行为上表现出广泛错位;仅1%的错位数据即可使诚实行为下降超20%;10%的偏见用户即可无意中加剧模型的不诚实。 Conclusion: 突发性错位不仅存在于安全相关领域,也适用于不诚实与欺骗行为,且在实际应用场景中具有显著风险。 Abstract: Previous research has shown that LLMs finetuned on malicious or incorrect completions within narrow domains (e.g., insecure code or incorrect medical advice) can become broadly misaligned to exhibit harmful behaviors, which is called emergent misalignment. In this work, we investigate whether this phenomenon can extend beyond safety behaviors to a broader spectrum of dishonesty and deception under high-stakes scenarios (e.g., lying under pressure and deceptive behavior). To explore this, we finetune open-sourced LLMs on misaligned completions across diverse domains. Experimental results demonstrate that LLMs show broadly misaligned behavior in dishonesty. Additionally, we further explore this phenomenon in a downstream combined finetuning setting, and find that introducing as little as 1% of misalignment data into a standard downstream task is sufficient to decrease honest behavior over 20%. Furthermore, we consider a more practical human-AI interaction environment where we simulate both benign and biased users to interact with the assistant LLM. Notably, we find that the assistant can be misaligned unintentionally to exacerbate its dishonesty with only 10% biased user population. In summary, we extend the study of emergent misalignment to the domain of dishonesty and deception under high-stakes scenarios, and demonstrate that this risk arises not only through direct finetuning, but also in downstream mixture tasks and practical human-AI interactions.[79] SenWave: A Fine-Grained Multi-Language Sentiment Analysis Dataset Sourced from COVID-19 Tweets
Qiang Yang,Xiuying Chen,Changsheng Ma,Rui Yin,Xin Gao,Xiangliang Zhang
Main category: cs.CL
TL;DR: 本文提出了SenWave,一个用于分析COVID-19推文的细粒度多语言情感分析数据集,包含五种语言的标注和未标注推文,并利用预训练模型进行情感分类,揭示了跨语言、国家和话题的情感演变。
Details
Motivation: 现有COVID-19数据集中缺乏高质量的细粒度情感标注数据,限制了公众情绪的深入分析。 Method: 构建了一个包含10,000条英文和阿拉伯文标注推文及30,000条西班牙语、法语和意大利语翻译推文的多语言数据集,并使用预训练Transformer模型进行细粒度情感分类。 Result: 实现了准确的细粒度情感分类,分析了不同语言、国家和主题下的情绪变化,并验证了数据集与ChatGPT的良好兼容性。 Conclusion: SenWave数据集为复杂事件下的细粒度情感分析提供了有力支持,有望推动NLP领域相关研究的发展。 Abstract: The global impact of the COVID-19 pandemic has highlighted the need for a comprehensive understanding of public sentiment and reactions. Despite the availability of numerous public datasets on COVID-19, some reaching volumes of up to 100 billion data points, challenges persist regarding the availability of labeled data and the presence of coarse-grained or inappropriate sentiment labels. In this paper, we introduce SenWave, a novel fine-grained multi-language sentiment analysis dataset specifically designed for analyzing COVID-19 tweets, featuring ten sentiment categories across five languages. The dataset comprises 10,000 annotated tweets each in English and Arabic, along with 30,000 translated tweets in Spanish, French, and Italian, derived from English tweets. Additionally, it includes over 105 million unlabeled tweets collected during various COVID-19 waves. To enable accurate fine-grained sentiment classification, we fine-tuned pre-trained transformer-based language models using the labeled tweets. Our study provides an in-depth analysis of the evolving emotional landscape across languages, countries, and topics, revealing significant insights over time. Furthermore, we assess the compatibility of our dataset with ChatGPT, demonstrating its robustness and versatility in various applications. Our dataset and accompanying code are publicly accessible on the repository\footnote{https://github.com/gitdevqiang/SenWave}. We anticipate that this work will foster further exploration into fine-grained sentiment analysis for complex events within the NLP community, promoting more nuanced understanding and research innovations.[80] Investigating Counterclaims in Causality Extraction from Text
Tim Hagen,Niklas Deckers,Felix Wolter,Harrisen Scells,Martin Potthast
Main category: cs.CL
TL;DR: 本文提出了一种新的因果关系提取数据集,首次将反因果(concausal)声明纳入其中,解决了现有研究中忽视反因果关系的问题。通过文献综述和严格的标注指南,作者扩展了Causal News Corpus,并证明了包含反因果信息可显著提升模型区分正反因果关系的能力。
Details
Motivation: 现有的因果关系抽取数据集仅关注支持因果关系的“正因果”声明,而忽略了反驳因果关系的“反因果”声明,导致模型在处理此类语句时容易出错。因此,需要构建一个包含反因果关系的数据集以改进因果推理。 Method: 基于广泛的文献综述,提出反因果关系在不完全知识下的因果推理中具有重要作用;制定严格的标注准则,并据此对Causal News Corpus进行扩充,加入反因果声明;计算标注者间的一致性(Cohen's κ=0.74)。 Result: 训练时不包含反因果关系的模型倾向于将反因果语句误分类为正因果;使用新构建的数据集可以缓解这一问题,使Transformer模型能够有效区分正因果与反因果关系。 Conclusion: 引入反因果关系对于提升因果关系抽取模型的准确性与鲁棒性至关重要,未来的研究应同时考虑正因果与反因果声明以实现更全面的因果推理。 Abstract: Research on causality extraction from text has so far almost entirely neglected counterclaims. Existing causality extraction datasets focus solely on "procausal" claims, i.e., statements that support a relationship. "Concausal" claims, i.e., statements that refute a relationship, are entirely ignored or even accidentally annotated as procausal. We address this shortcoming by developing a new dataset that integrates concausality. Based on an extensive literature review, we first show that concausality is an integral part of causal reasoning on incomplete knowledge. We operationalize this theory in the form of a rigorous guideline for annotation and then augment the Causal News Corpus with concausal statements, obtaining a substantial inter-annotator agreement of Cohen's $\kappa=0.74$. To demonstrate the importance of integrating concausal statements, we show that models trained without concausal relationships tend to misclassify these as procausal instead. Based on our new dataset, this mistake can be mitigated, enabling transformers to effectively distinguish pro- and concausality.[81] The Alignment Waltz: Jointly Training Agents to Collaborate for Safety
Jingyu Zhang,Haozhu Wang,Eric Michael Smith,Sid Wang,Amr Sharaf,Mahesh Pasupuleti,Benjamin Van Durme,Daniel Khashabi,Jason Weston,Hongyuan Zhan
Main category: cs.CL
TL;DR: 提出WaltzRL,一种多智能体强化学习框架,通过对话代理与反馈代理协同训练,动态改进响应安全性与有用性,显著降低不安全响应和过度拒绝率。
Details
Motivation: 解决大模型在安全对齐中的两难问题:既要防范对抗攻击生成有害内容,又要避免对良性敏感请求的过度拒绝。现有方法因完全拒绝不安全内容而加剧过拒问题,缺乏细粒度指导。 Method: 提出WaltzRL框架,包含对话代理和反馈代理,采用动态改进奖励(DIR)机制,使反馈代理在推理时自适应提供改进建议,而非丢弃响应。训练过程中两者协同优化,实现安全与帮助性的平衡。 Result: 在五个数据集上实验显示,相比基线方法,WaltzRL显著降低不安全响应率(如WildJailbreak上从39.0%降至4.6%)和过度拒绝率(如OR-Bench上从45.3%降至9.9%),且保持低延迟和通用能力。 Conclusion: WaltzRL通过多智能体协作与动态反馈机制,实现了帮助性与无害性的帕累托提升,为大语言模型的安全对齐提供了更精细、自适应的解决方案。 Abstract: Harnessing the power of LLMs requires a delicate dance between being helpful and harmless. This creates a fundamental tension between two competing challenges: vulnerability to adversarial attacks that elicit unsafe content, and a tendency for overrefusal on benign but sensitive prompts. Current approaches often navigate this dance with safeguard models that completely reject any content that contains unsafe portions. This approach cuts the music entirely-it may exacerbate overrefusals and fails to provide nuanced guidance for queries it refuses. To teach models a more coordinated choreography, we propose WaltzRL, a novel multi-agent reinforcement learning framework that formulates safety alignment as a collaborative, positive-sum game. WaltzRL jointly trains a conversation agent and a feedback agent, where the latter is incentivized to provide useful suggestions that improve the safety and helpfulness of the conversation agent's responses. At the core of WaltzRL is a Dynamic Improvement Reward (DIR) that evolves over time based on how well the conversation agent incorporates the feedback. At inference time, unsafe or overrefusing responses from the conversation agent are improved rather than discarded. The feedback agent is deployed together with the conversation agent and only engages adaptively when needed, preserving helpfulness and low latency on safe queries. Our experiments, conducted across five diverse datasets, demonstrate that WaltzRL significantly reduces both unsafe responses (e.g., from 39.0% to 4.6% on WildJailbreak) and overrefusals (from 45.3% to 9.9% on OR-Bench) compared to various baselines. By enabling the conversation and feedback agents to co-evolve and adaptively apply feedback, WaltzRL enhances LLM safety without degrading general capabilities, thereby advancing the Pareto front between helpfulness and harmlessness.[82] Contrastive Decoding for Synthetic Data Generation in Low-Resource Language Modeling
Jannek Ulm,Kevin Du,Vésteinn Snæbjarnarson
Main category: cs.CL
TL;DR: 本文研究了使用对比解码生成合成语料库以缓解大规模语言模型训练数据不足的问题,发现将合成数据与真实数据混合训练能提升语言建模及下游任务性能,尤其在需要推理能力的任务上表现更优。
Details
Motivation: 担心大规模语言模型训练所用的真实文本数据即将耗尽,探索使用大模型生成的合成数据作为补充。 Method: 在控制环境下,利用在同一原始语料(1亿词)上训练的优劣两个模型之间的差异,通过对比解码生成合成语料,并将其与原始数据混合用于训练。 Result: 使用合成与真实数据混合训练提升了语言建模目标的表现以及多种下游任务的性能;特别是对比解码生成的合成数据对需推理的任务更有帮助,而传统采样生成的数据更利于依赖表层语言能力的任务。 Conclusion: 对比解码生成的合成数据可有效增强语言模型训练,尤其有助于提升模型的推理能力,是突破训练数据瓶颈的可行方向。 Abstract: Large language models (LLMs) are trained on huge amounts of textual data, and concerns have been raised that the limits of such data may soon be reached. A potential solution is to train on synthetic data sampled from LLMs. In this work, we build on this idea and investigate the benefits of contrastive decoding for generating synthetic corpora. In a controlled setting, we experiment with sampling corpora using the relative difference between a good and bad model trained on the same original corpus of 100 million words. By amplifying the signal from a model that has better performance, we create a synthetic corpus and mix it with the original training data. Our findings show that training on a mixture of synthesized and real data improves performance on the language modeling objective and a range of downstream tasks. In particular, we see that training with a mix of synthetic data from contrastive decoding benefits tasks that require more reasoning skills, while synthetic data from traditional sampling helps more on tasks dependent on surface level linguistic capabilities.[83] Beyond Turn Limits: Training Deep Search Agents with Dynamic Context Window
Qiaoyu Tang,Hao Xiang,Le Yu,Bowen Yu,Yaojie Lu,Xianpei Han,Le Sun,WenJuan Zhang,Pengbo Wang,Shixuan Liu,Zhenru Zhang,Jianhong Tu,Hongyu Lin,Junyang Lin
Main category: cs.CL
TL;DR: 本文提出了DeepMiner框架,通过高难度训练任务和动态上下文管理来增强多轮推理能力,在多个基准上显著优于现有方法。
Details
Motivation: 现有方法难以在长周期多轮交互中激发模型的深度推理能力。 Method: 提出反向构建方法生成复杂且可验证的问答对,并设计无需外部摘要模型的动态上下文管理策略。 Result: DeepMiner-32B在BrowseComp-en上达到33.5%准确率,超过此前最优开源代理近20个百分点,并支持近100轮持续交互。 Conclusion: DeepMiner有效提升了多轮搜索代理在长上下文场景下的推理性能与可扩展性。 Abstract: While recent advances in reasoning models have demonstrated cognitive behaviors through reinforcement learning, existing approaches struggle to invoke deep reasoning capabilities in multi-turn agents with long-horizon interactions. We propose DeepMiner, a novel framework that elicits such abilities by introducing high-difficulty training tasks and dynamic context window. DeepMiner presents a reverse construction method to generate complex but verifiable question-answer pairs from authentic web sources, which ensures the challenge and reliability of training data while injecting cognitive capabilities into multi-turn reasoning scenarios. We further design an elegant yet effective dynamic context management strategy for both training and inference, utilizing sliding window mechanisms while eliminating the dependency on external summarization models, thereby efficiently empowering the model to handle continuously expanding long-horizon contexts. Through reinforcement learning on Qwen3-32B, we develop DeepMiner-32B, which achieves substantial performance improvements across multiple search agent benchmarks. DeepMiner attains 33.5% accuracy on BrowseComp-en, surpassing the previous best open-source agent by almost 20 percentage points, and demonstrates consistent improvements on BrowseComp-zh, XBench-DeepSearch, and GAIA. Notably, our dynamic context management enables sustained interactions of nearly 100 turns within standard 32k context length, effectively addressing the context limitations that constrain existing multi-turn interaction systems.[84] Neuron-Level Analysis of Cultural Understanding in Large Language Models
Taisei Yamamoto,Ryoma Kumon,Danushka Bollegala,Hitomi Yanaka
Main category: cs.CL
TL;DR: 本文提出了一种基于梯度的神经元评分方法,用于识别大语言模型中驱动文化理解的文化通用和文化特异性神经元,发现这些神经元集中在浅层到中层MLP中且占比不足1%,并验证了其对文化基准性能的关键作用。
Details
Motivation: 大语言模型存在文化偏见且对少数文化的认知有限,其文化理解机制尚不明确,因此需要从神经元层面分析其背后机制。 Method: 提出一种基于梯度的评分方法,并结合过滤策略精确定位影响文化行为的神经元,区分文化通用和文化特异性神经元,并通过抑制实验验证其功能。 Result: 识别出少于1%的关键神经元集中于浅至中层MLP;抑制这些神经元使文化基准性能下降高达30%,但对通用自然语言理解任务影响较小;文化特异性神经元还支持相关文化的知识;训练NLU任务可能削弱文化理解能力。 Conclusion: 大语言模型的文化理解依赖于少量特定神经元,研究揭示了其内部机制,并为模型训练和工程提供了实践指导。 Abstract: As large language models (LLMs) are increasingly deployed worldwide, ensuring their fair and comprehensive cultural understanding is important. However, LLMs exhibit cultural bias and limited awareness of underrepresented cultures, while the mechanisms underlying their cultural understanding remain underexplored. To fill this gap, we conduct a neuron-level analysis to identify neurons that drive cultural behavior, introducing a gradient-based scoring method with additional filtering for precise refinement. We identify both culture-general neurons contributing to cultural understanding regardless of cultures, and culture-specific neurons tied to an individual culture. These neurons account for less than 1% of all neurons and are concentrated in shallow to middle MLP layers. We validate their role by showing that suppressing them substantially degrades performance on cultural benchmarks (by up to 30%), while performance on general natural language understanding (NLU) benchmarks remains largely unaffected. Moreover, we show that culture-specific neurons support knowledge of not only the target culture, but also related cultures. Finally, we demonstrate that training on NLU benchmarks can diminish models' cultural understanding when we update modules containing many culture-general neurons. These findings provide insights into the internal mechanisms of LLMs and offer practical guidance for model training and engineering. Our code is available at https://github.com/ynklab/CULNIG[85] AutoRed: A Free-form Adversarial Prompt Generation Framework for Automated Red Teaming
Muxi Diao,Yutao Mou,Keqing He,Hanbo Song,Lulu Zhao,Shikun Zhang,Wei Ye,Kongming Liang,Zhanyu Ma
Main category: cs.CL
TL;DR: 提出了一种名为AutoRed的自由形式对抗性提示生成框架,用于提高大语言模型的安全性评估,相比基于种子的方法具有更高的攻击成功率和泛化能力。
Details
Motivation: 现有红队测试方法依赖种子指令,限制了对抗性提示的语义多样性,影响对大语言模型安全性的全面评估。 Method: AutoRed采用两阶段框架:第一阶段通过角色引导生成对抗性指令,第二阶段通过反思循环迭代优化低质量提示,并引入验证器在不查询目标模型的情况下评估提示的危害性。 Result: 构建了两个红队测试数据集AutoRed-Medium和AutoRed-Hard,评估了八种最先进的大语言模型,结果显示AutoRed在攻击成功率和泛化性上优于现有基线方法。 Conclusion: 自由形式的红队测试比基于种子的方法更具优势,能更有效地发现模型漏洞,为大语言模型的安全评估提供了新方向。 Abstract: The safety of Large Language Models (LLMs) is crucial for the development of trustworthy AI applications. Existing red teaming methods often rely on seed instructions, which limits the semantic diversity of the synthesized adversarial prompts. We propose AutoRed, a free-form adversarial prompt generation framework that removes the need for seed instructions. AutoRed operates in two stages: (1) persona-guided adversarial instruction generation, and (2) a reflection loop to iteratively refine low-quality prompts. To improve efficiency, we introduce a verifier to assess prompt harmfulness without querying the target models. Using AutoRed, we build two red teaming datasets -- AutoRed-Medium and AutoRed-Hard -- and evaluate eight state-of-the-art LLMs. AutoRed achieves higher attack success rates and better generalization than existing baselines. Our results highlight the limitations of seed-based approaches and demonstrate the potential of free-form red teaming for LLM safety evaluation. We will open source our datasets in the near future.[86] Two-Stage Voting for Robust and Efficient Suicide Risk Detection on Social Media
Yukai Song,Pengfei Zhou,César Escobar-Viera,Candice Biernesser,Wei Huang,Jingtong Hu
Main category: cs.CL
TL;DR: 提出一种两阶段投票架构,结合轻量级BERT模型和多视角大语言模型(LLM)投票机制,有效平衡效率与准确性,用于检测社交媒体中显性和隐性自杀意念,在多个数据集上表现优异且显著降低LLM使用成本。
Details
Motivation: 自杀率上升亟需有效的预防策略,而许多高风险个体倾向于在社交媒体上隐晦表达痛苦,传统轻量模型难以捕捉隐性自杀意念,而大模型成本过高,因此需要一个兼顾效率与检测能力的解决方案。 Method: 采用两阶段投票架构:第一阶段用轻量BERT模型快速处理高置信度显性案例;第二阶段将模糊样本交由多视角LLM投票系统提升对隐性意念的召回率,或通过基于心理指标的特征工程ML集成实现高效可解释检测,其中心理特征由提示工程驱动的LLM提取。 Result: 在以显性为主的Reddit数据集和纯隐性DeepSuiMind数据集上,分别取得98.0%和99.7%的F1分数,跨域差距降至2%以下,并显著降低LLM计算成本。 Conclusion: 该两阶段框架首次将LLM提取的心理特征向量化用于自杀风险检测,在保持高性能的同时提升效率与可解释性,为实际应用提供了可行的平衡方案。 Abstract: Suicide rates have risen worldwide in recent years, underscoring the urgent need for proactive prevention strategies. Social media provides valuable signals, as many at-risk individuals - who often avoid formal help due to stigma - choose instead to share their distress online. Yet detecting implicit suicidal ideation, conveyed indirectly through metaphor, sarcasm, or subtle emotional cues, remains highly challenging. Lightweight models like BERT handle explicit signals but fail on subtle implicit ones, while large language models (LLMs) capture nuance at prohibitive computational cost. To address this gap, we propose a two-stage voting architecture that balances efficiency and robustness. In Stage 1, a lightweight BERT classifier rapidly resolves high-confidence explicit cases. In Stage 2, ambiguous inputs are escalated to either (i) a multi-perspective LLM voting framework to maximize recall on implicit ideation, or (ii) a feature-based ML ensemble guided by psychologically grounded indicators extracted via prompt-engineered LLMs for efficiency and interpretability. To the best of our knowledge, this is among the first works to operationalize LLM-extracted psychological features as structured vectors for suicide risk detection. On two complementary datasets - explicit-dominant Reddit and implicit-only DeepSuiMind - our framework outperforms single-model baselines, achieving 98.0% F1 on explicit cases, 99.7% on implicit ones, and reducing the cross-domain gap below 2%, while significantly lowering LLM cost.[87] On the Relationship Between the Choice of Representation and In-Context Learning
Ioana Marinescu,Kyunghyun Cho,Eric Karl Oermann
Main category: cs.CL
TL;DR: 本文研究了上下文学习(ICL)中示例表示与学习能力之间的关系,发现表示质量决定ICL的基线准确率,而更多示例带来的提升在此基础上独立发生,二者具有正交性。
Details
Motivation: 尽管已有研究强调了上下文示例表示方式对ICL性能的重要性,但表示与学习能力之间的相互作用尚不清楚,因此需要探究这两者是否独立影响ICL。 Method: 提出一种优化算法,生成一系列语义相关性不同的标签集(表示),并在不同数量的上下文示例下进行ICL实验,分析表示质量与学习效率的关系。 Result: 学习效果在各种标签集上均存在,但学习效率受标签集质量和模型参数量影响;同时,标签集的相对准确性在整个学习过程中保持稳定。 Conclusion: ICL中的表示与学习是两个独立的因素:表示决定基线性能,学习则在其之上逐步提升性能,两者正交。 Abstract: In-context learning (ICL) is the ability of a large language model (LLM) to learn a new task from a few demonstrations presented as part of the context. Past studies have attributed a large portion of the success of ICL to the way these in-context demonstrations are represented, particularly to how labels are represented in classification tasks. On the other hand, observations of the learning capacity of ICL (i.e., the extent to which more in-context demonstrations can lead to higher performance) have been mixed, and ICL is often thought to occur only under specific conditions. The interaction between these two aspects in ICL, representation and learning, has not been studied in depth until now. We hypothesize that they are largely independent of one another, such that the representation of demonstrations determines the baseline accuracy of ICL, while learning from additional demonstrations improves only on top of this baseline. We validate this hypothesis by developing an optimization algorithm that can enumerate a spectrum of possible label sets (representations) varying in semantic relevance. We then perform ICL with varying numbers of in-context demonstrations for each of these label sets. We observed that learning happens regardless of the quality of the label set itself, although its efficiency, measured by the slope of improvement over in-context demonstrations, is conditioned on both the label set quality and the parameter count of the underlying language model. Despite the emergence of learning, the relative quality (accuracy) of the choice of a label set (representation) is largely maintained throughout learning, confirming our hypothesis and implying their orthogonality. Our work reveals a previously underexplored aspect of ICL: the independent effects of learning from demonstrations and their representations on ICL performance.[88] If Probable, Then Acceptable? Understanding Conditional Acceptability Judgments in Large Language Models
Jasmin Orth,Philipp Mondorf,Barbara Plank
Main category: cs.CL
TL;DR: 本研究探讨了大语言模型(LLM)在判断条件语句可接受性方面的表现,发现模型对条件概率和语义相关性均有响应,但一致性不及人类,且模型尺寸增大并不显著提升与人类判断的一致性。
Details
Motivation: 理解大语言模型如何评估条件语句的可接受性,填补此前研究中关于模型对条件可接受性判断机制的空白。 Method: 通过线性混合效应模型和方差分析(ANOVA),在不同模型家族、规模和提示策略下评估LLMs对条件可接受性的判断,并与人类数据进行比较。 Result: LLMs能够感知条件概率和语义相关性,但敏感程度因架构和提示方式而异;与人类相比,其判断一致性较低,且更大的模型并未表现出更强的人类对齐性。 Conclusion: 当前大语言模型在模拟人类对条件语句可接受性的判断方面仍有局限,单纯增加模型规模并不能有效提升其与人类判断的一致性。 Abstract: Conditional acceptability refers to how plausible a conditional statement is perceived to be. It plays an important role in communication and reasoning, as it influences how individuals interpret implications, assess arguments, and make decisions based on hypothetical scenarios. When humans evaluate how acceptable a conditional "If A, then B" is, their judgments are influenced by two main factors: the $\textit{conditional probability}$ of $B$ given $A$, and the $\textit{semantic relevance}$ of the antecedent $A$ given the consequent $B$ (i.e., whether $A$ meaningfully supports $B$). While prior work has examined how large language models (LLMs) draw inferences about conditional statements, it remains unclear how these models judge the $\textit{acceptability}$ of such statements. To address this gap, we present a comprehensive study of LLMs' conditional acceptability judgments across different model families, sizes, and prompting strategies. Using linear mixed-effects models and ANOVA tests, we find that models are sensitive to both conditional probability and semantic relevance-though to varying degrees depending on architecture and prompting style. A comparison with human data reveals that while LLMs incorporate probabilistic and semantic cues, they do so less consistently than humans. Notably, larger models do not necessarily align more closely with human judgments.[89] Single layer tiny Co$^4$ outpaces GPT-2 and GPT-BERT
Noor Ul Zain,Mohsin Raza,Ahsan Adeel
Main category: cs.CL
TL;DR: Co$^4$是一种具有单层、双头和8M参数的微型模型,在训练效率和样本利用率上显著优于GPT-2和GPT-BERT,仅用两个epoch就在多个任务上超越了训练十个epoch的基线模型。
Details
Motivation: 挑战当前深度学习中依赖大规模模型和计算资源的范式,探索更高效、轻量级模型在预训练中的潜力。 Method: 提出Co$^4$模型,采用单层、双注意力头结构,实现O(N)复杂度,在BabyLM挑战的数据集上进行预训练,并通过SuperGLUE等复杂基准测试评估其零样本和微调性能。 Result: Co$^4$在10M token上训练两个epoch即超过训练十个epoch的GPT-2(124M)和GPT-BERT(30M);在SuperGLUE零样本评估中,优于GPT-2(5/7指标)和GPT-BERT(4/7),在微调任务中也表现更优(6/7优于GPT-2,4/7优于GPT-BERT)。 Conclusion: 小型、低复杂度模型在高训练效率下仍可实现强大性能,提示需重新思考当前的模型扩展规律和深度学习范式。 Abstract: We show that a tiny Co$^4$ machine(Adeel,2025) with a single layer, two heads, and 8M parameters, operating at an approximate cost of $O(N)$ (where $N$ is the number of input tokens), outpaces the BabyLM Challenge baselines GPT-2 (124M, 12 layers, $O(N^2))$ and GPT-BERT (30M, 12 layers, $O(N^2))$ in just two epochs, while both are trained for ten. Co$^4$ achieves orders-of-magnitude greater training efficiency on 10M tokens, demonstrating highly sample efficient pretraining. Using the BabyLM challenge evaluation pipeline across complex benchmarks, Co$^4$ exhibits strong zero-shot and fine-tuning performance on SuperGLUE tasks. Specifically, Co$^4$ outperforms GPT-2 on 5 out of 7 zero-shot metrics and 6 out of 7 fine-tuning tasks, and GPT-BERT on 4 out of 7 metrics in both cases. These results suggest the need to rethink prevailing deep learning paradigms and associated scaling laws.[90] ARES: Multimodal Adaptive Reasoning via Difficulty-Aware Token-Level Entropy Shaping
Shuang Chen,Yue Guo,Yimeng Ye,Shijue Huang,Wenbo Hu,Haoxi Li,Manyuan Zhang,Jiayu Chen,Song Guo,Nanyun Peng
Main category: cs.CL
TL;DR: 本文提出了ARES,一个基于高窗口熵(HWE)的统一开源自适应推理框架,通过动态调整推理探索程度来提升多模态大模型在不同难度任务上的性能与效率。
Details
Motivation: 现有大模型在简单任务上过度推理,在复杂任务上探索不足,导致效率低下和漏解问题。作者旨在建立一种能根据任务难度自适应分配推理资源的机制。 Method: 提出ARES框架,包含两个阶段:1)自适应冷启动阶段,构建与任务难度匹配的推理轨迹数据以赋予模型难度感知;2)自适应熵策略优化(AEPO),利用HWE作为探索触发器,并设计分层熵奖励与动态KL控制来决定是否及如何探索。 Result: 实验表明,ARES在多个数学、逻辑和多模态基准上实现了更优的性能和推理效率,且推理成本显著低于领先的商业系统。 Conclusion: ARES通过HWE驱动的自适应推理机制,有效平衡了不同难度任务下的推理开销与效果,为高效多模态推理提供了可扩展的开源解决方案。 Abstract: Recent advances in multimodal large reasoning models (MLRMs) have substantially improved their ability to solve complex textual and visual tasks. However, these models tend to overthink on simple problems, producing unnecessarily lengthy reasoning traces, while under-exploring on challenging ones, leading to missed solutions. To address this imbalance, we propose ARES, a unified open-source framework for adaptive reasoning that dynamically allocates exploration effort based on task difficulty. Our approach is motivated by two key empirical findings: (i) while single-token entropy is noisy, high window-entropy (HWE) tokens (token-level entropies averaged under a sliding window) can reliably capture reasoning-critical moments; and (ii) reducing HWE usage benefits easy problems, while increasing it is essential for solving hard ones. Building on these insights, ARES introduces a two-stage training pipeline. In the Adaptive Cold-Start stage, we curate multimodal and textual data paired with reasoning traces of length proportional to problem difficulty, equipping the model with initial difficulty awareness. In the second stage, we develop Adaptive Entropy Policy Optimization (AEPO), which uses HWE tokens as exploration triggers to decide when to explore, and a hierarchical entropy reward with dynamic KL control to decide how much to explore. Extensive experiments demonstrate that ARES achieves superior performance and reasoning efficiency across diverse mathematical, logical, and multimodal benchmarks, while closing the gap to leading commercial systems under significantly lower inference costs.[91] LeWiDi-2025 at NLPerspectives: The Third Edition of the Learning with Disagreements Shared Task
Elisa Leonardelli,Silvia Casola,Siyao Peng,Giulia Rizzi,Valerio Basile,Elisabetta Fersini,Diego Frassinelli,Hyewon Jang,Maja Pavlovic,Barbara Plank,Massimo Poesio
Main category: cs.CL
TL;DR: LEWIDI任务的第三版扩展了基准测试,涵盖四个数据集,并引入软标签和视角主义两种新范式来评估AI模型对人类判断差异的识别能力。
Details
Motivation: 推动AI模型在训练和评估中考虑人类判断的多样性和分歧,提升模型对不确定性及主观性的理解能力。 Method: 扩展LEWIDI基准至四个任务(释义识别、反讽检测、讽刺检测、自然语言推断),引入包含分类与有序判断的标注体系,并采用软标签和视角主义两种互补范式进行评估,同时测试新的评价指标。 Result: 吸引了多样化参与,结果揭示了现有建模方法的优势与局限,验证了新评估范式和指标的有效性。 Conclusion: LEWIDI框架得到加强,为构建分歧感知型AI技术提供了新的资源、基准和研究发现。 Abstract: Many researchers have reached the conclusion that AI models should be trained to be aware of the possibility of variation and disagreement in human judgments, and evaluated as per their ability to recognize such variation. The LEWIDI series of shared tasks on Learning With Disagreements was established to promote this approach to training and evaluating AI models, by making suitable datasets more accessible and by developing evaluation methods. The third edition of the task builds on this goal by extending the LEWIDI benchmark to four datasets spanning paraphrase identification, irony detection, sarcasm detection, and natural language inference, with labeling schemes that include not only categorical judgments as in previous editions, but ordinal judgments as well. Another novelty is that we adopt two complementary paradigms to evaluate disagreement-aware systems: the soft-label approach, in which models predict population-level distributions of judgments, and the perspectivist approach, in which models predict the interpretations of individual annotators. Crucially, we moved beyond standard metrics such as cross-entropy, and tested new evaluation metrics for the two paradigms. The task attracted diverse participation, and the results provide insights into the strengths and limitations of methods to modeling variation. Together, these contributions strengthen LEWIDI as a framework and provide new resources, benchmarks, and findings to support the development of disagreement-aware technologies.[92] DeepPrune: Parallel Scaling without Inter-trace Redundancy
Shangqing Tu,Yaxuan Li,Yushi Bai,Lei Hou,Juanzi Li
Main category: cs.CL
TL;DR: 提出DeepPrune框架,通过动态剪枝减少并行推理中的冗余,显著降低计算开销,同时保持高准确率。
Details
Motivation: 并行扩展虽能提升大模型推理能力,但存在大量冗余推理路径,导致计算效率低下。 Method: 设计基于焦点损失和过采样的判别模型,结合在线贪心聚类算法,从部分推理链预测答案等价性并动态剪枝。 Result: 在多个基准上实现超过80%的token减少,AUROC达0.87,准确率损失控制在3个百分点内。 Conclusion: DeepPrune有效解决了并行推理中的冗余问题,为高效推理建立了新标准。 Abstract: Parallel scaling has emerged as a powerful paradigm to enhance reasoning capabilities in large language models (LLMs) by generating multiple Chain-of-Thought (CoT) traces simultaneously. However, this approach introduces significant computational inefficiency due to inter-trace redundancy -- our analysis reveals that over 80% of parallel reasoning traces yield identical final answers, representing substantial wasted computation. To address this critical efficiency bottleneck, we propose DeepPrune, a novel framework that enables efficient parallel scaling through dynamic pruning. Our method features a specialized judge model trained with focal loss and oversampling techniques to accurately predict answer equivalence from partial reasoning traces which realizes 0.87 AUROC on equivalence prediction, combined with an online greedy clustering algorithm that dynamically prunes redundant paths while preserving answer diversity. Comprehensive evaluations across three challenging benchmarks (AIME 2024, AIME 2025, and GPQA) and multiple reasoning models demonstrate that DeepPrune achieves remarkable token reduction by over 80% compared to conventional consensus sampling on most cases, while maintaining competitive accuracy within 3 percentage points. Our work establishes a new standard for efficient parallel reasoning, making high-performance reasoning more efficient. Our code and data are here: https://deepprune.github.io/[93] Neologism Learning for Controllability and Self-Verbalization
John Hewitt,Oyvind Tafjord,Robert Geirhos,Been Kim
Main category: cs.CL
TL;DR: 本文探讨了在与大语言模型(LLM)交互中引入新词(neologism learning)的方法,以更好地理解和控制模型行为。通过添加新的词嵌入并用示例训练,新词可用于控制诸如奉承、错误回答、文本长度等概念,并通过模型的自我描述(self-verbalization)揭示其内部理解。作者提出“插件评估”来验证这些自我描述的有效性,并发现机器专属同义词现象,最后展示了多概念、多词汇的联合学习能力。
Details
Motivation: 为了更有效地控制和理解大语言模型的行为,受人类创造新词以表达新概念的启发,探索在模型中引入人工新词的可行性与优势。 Method: 通过添加新的词嵌入并在不改变其他模型参数的情况下使用相关示例进行训练,使模型学习新词所代表的概念;利用模型自身的自然语言描述(self-verbalization)解释新词含义,并通过插件评估验证其有效性。 Result: 成功实现了对多种简单与复杂概念(如奉承、错误回答、文本长度)的行为控制;模型能自我描述新词含义;发现了机器专属同义词;实现了多个概念的联合学习。 Conclusion: 引入新词是一种有效且可解释的方式来增强对大语言模型的控制,并有助于揭示模型内部表示,为人类与模型之间的语义沟通提供了新的途径。 Abstract: Humans invent new words when there is a rising demand for a new useful concept (e.g., doomscrolling). We explore and validate a similar idea in our communication with LLMs: introducing new words to better understand and control the models, expanding on the recently introduced neologism learning. This method introduces a new word by adding a new word embedding and training with examples that exhibit the concept with no other changes in model parameters. We show that adding a new word allows for control of concepts such as flattery, incorrect answers, text length, as well as more complex concepts in AxBench. We discover that neologisms can also further our understanding of the model via self-verbalization: models can describe what each new word means to them in natural language, like explaining that a word that represents a concept of incorrect answers means ``a lack of complete, coherent, or meaningful answers...'' To validate self-verbalizations, we introduce plug-in evaluation: we insert the verbalization into the context of a model and measure whether it controls the target concept. In some self-verbalizations, we find machine-only synonyms: words that seem unrelated to humans but cause similar behavior in machines. Finally, we show how neologism learning can jointly learn multiple concepts in multiple words.[94] Efficient Prompt Optimisation for Legal Text Classification with Proxy Prompt Evaluator
Hyunji Lee,Kevin Chenhao Li,Matthias Grabmair,Shanshan Xu
Main category: cs.CL
TL;DR: 本文提出了一种结合蒙特卡洛树搜索(MCTS)和代理提示评估器的框架,用于在计算预算受限的情况下更高效地优化法律NLP任务中的提示,特别是在服务条款(ToS)条款的公平性检测中表现出更高的分类准确率和效率。
Details
Motivation: 现有的提示优化方法由于搜索策略低效和候选提示评估成本高而计算开销大,难以在资源有限的情况下实现高效优化。 Method: 提出一种结合蒙特卡洛树搜索(MCTS)与代理提示评估器的框架,以更有效地探索提示空间并降低评估成本。 Result: 实验表明,在受限的计算预算下,该方法相比基线方法实现了更高的分类准确率和更高的效率。 Conclusion: 所提出的框架在保证性能的同时显著提升了提示优化的效率,适用于资源受限的复杂法律NLP任务。 Abstract: Prompt optimization aims to systematically refine prompts to enhance a language model's performance on specific tasks. Fairness detection in Terms of Service (ToS) clauses is a challenging legal NLP task that demands carefully crafted prompts to ensure reliable results. However, existing prompt optimization methods are often computationally expensive due to inefficient search strategies and costly prompt candidate scoring. In this paper, we propose a framework that combines Monte Carlo Tree Search (MCTS) with a proxy prompt evaluator to more effectively explore the prompt space while reducing evaluation costs. Experiments demonstrate that our approach achieves higher classification accuracy and efficiency than baseline methods under a constrained computation budget.[95] Which Heads Matter for Reasoning? RL-Guided KV Cache Compression
Wenjie Du,Li Jiang,Keda Tao,Xue Liu,Huan Wang
Main category: cs.CL
TL;DR: 提出RLKV框架,利用强化学习识别推理关键的注意力头,在保持接近无损性能的同时实现20-50%的KV缓存压缩。
Details
Motivation: 现有KV缓存压缩方法在推理模型上表现不佳,会破坏推理完整性或错误压缩关键注意力头,导致性能显著下降。 Method: 提出RLKV框架,使用强化学习直接优化每个注意力头的缓存使用与推理质量之间的关系,通过生成样本的奖励机制识别关键头,并对关键头保留完整缓存,其余头进行压缩。 Result: 实验表明仅需保留少量关键注意力头即可维持推理性能,相比基线方法在20-50%缓存压缩率下实现近似无损的推理效果。 Conclusion: KV头在推理模型中具有功能异质性,RLKV能有效识别推理关键头,实现高效且高性能的KV缓存压缩。 Abstract: Reasoning large language models exhibit complex reasoning behaviors through the extended chain-of-thought generation, creating unprecedented Key-Value (KV) cache overhead during the decoding phase. Existing KV cache compression methods underperform on reasoning models: token-dropping methods break reasoning integrity by discarding critical information, while head-reallocating methods mistakenly compress reasoning-critical heads since they are designed for retrieval tasks, resulting in significant performance degradation as compression rates increase. We hypothesize that KV heads exhibit functional heterogeneity in reasoning models-some heads are critical for chain-of-thought consistency while others are compressible. To validate and exploit this insight, we propose RLKV, a novel reasoning-critical head identification framework, which uses reinforcement learning to directly optimize the relationship between each head's cache usage and reasoning quality. As RLKV produces rewards from actual generated samples during training, it naturally identifies heads relevant to reasoning behaviors. We then allocate full KV cache to these heads while applying compressed constant KV cache to others for efficient inference. Our experiments reveal that only a small fraction of attention heads is essential for reasoning, enabling our KV compression approach to outperform baseline methods while achieving 20-50% cache reduction with near lossless performance compared to uncompressed results.[96] CoMAS: Co-Evolving Multi-Agent Systems via Interaction Rewards
Xiangyuan Xue,Yifan Zhou,Guibin Zhang,Zaibin Zhang,Yijiang Li,Chen Zhang,Zhenfei Yin,Philip Torr,Wanli Ouyang,Lei Bai
Main category: cs.CL
TL;DR: 本文提出了一种名为CoMAS的新型自进化多智能体系统框架,通过智能体间的交互讨论生成内在奖励信号,并利用大语言模型作为评判机制,结合强化学习实现无需外部监督的自主优化。实验表明该方法在多数评测中达到最优性能,具备良好的可扩展性。
Details
Motivation: 现有基于强化学习的大语言模型自进化方法依赖密集的外部奖励或自身提取的内在奖励,缺乏类似人类通过协作讨论进行自我提升的机制。因此,需要一种更贴近人类学习方式的自进化框架。 Method: 提出CoMAS框架:构建多个大语言模型智能体,通过相互讨论产生丰富的交互动态;利用大语言模型作为裁判(LLM-as-a-judge)从讨论中提取内在奖励信号;基于这些信号使用强化学习优化各智能体策略,实现去中心化的协同进化。 Result: 实验结果显示,CoMAS在多种评估设置下均显著优于未经训练的智能体,并在大多数场景中达到当前最优性能;消融实验证明了交互式奖励信号的必要性,且随着智能体数量和多样性的增加表现出良好的可扩展性。 Conclusion: CoMAS为大语言模型智能体的自进化提供了一个新颖且有效的范式,通过模拟人类协作讨论机制,实现了无需外部监督的自主能力提升,具有良好的性能和扩展潜力。 Abstract: Self-evolution is a central research topic in enabling large language model (LLM)-based agents to continually improve their capabilities after pretraining. Recent research has witnessed a transition from reinforcement learning (RL)-free to RL-based methods. Current RL-based methods either rely on dense external reward signals or extract intrinsic reward signals from LLMs themselves. However, these approaches diverge from the self-evolution mechanisms observed in human intelligence, where individuals learn and improve through mutual discussion and collaboration. In this work, we introduce Co-Evolving Multi-Agent Systems (CoMAS), a novel framework that enables agents to improve autonomously by learning from inter-agent interactions without external supervision. CoMAS generates intrinsic rewards from rich discussion dynamics, employs an LLM-as-a-judge mechanism to formulate these rewards, and optimizes each agent's policy through RL, thereby enabling decentralized and scalable co-evolution. Experimental results demonstrate that CoMAS consistently outperforms untrained agents and achieves state-of-the-art performance across most evaluation settings. Ablation studies confirm the necessity of interaction-based reward signals and reveal promising scalability as the number and diversity of agents increase. These findings establish CoMAS as a novel and effective paradigm for self-evolution in LLM-based agents.[97] ArenaBencher: Automatic Benchmark Evolution via Multi-Model Competitive Evaluation
Qin Liu,Jacob Dineen,Yuxi Huang,Sheng Zhang,Hoifung Poon,Ben Zhou,Muhao Chen
Main category: cs.CL
TL;DR: ArenaBencher 是一个模型无关的自动基准演化框架,通过更新测试用例来应对大语言模型预训练数据泄露问题,在保持可比性的同时发现模型共性弱点,提升基准难度和诊断能力。
Details
Motivation: 由于大语言模型在预训练中可能记忆基准测试内容,导致测试分数虚高、模型比较失真,现有基准的有效性受到严重威胁,因此需要一种能持续进化的基准更新机制。 Method: ArenaBencher 基于现有基准和多样化的模型池,推断每个测试用例的核心能力,生成保持原目标的新问答对,利用大语言模型作为裁判验证其正确性和意图,并通过多模型反馈聚合选择能暴露共性弱点的候选用例,迭代优化测试集。 Result: 在数学解题、常识推理和安全领域应用中,ArenaBencher 生成了经过验证、多样化且公平的基准更新,揭示了新的失败模式,提升了测试难度并保持与原目标一致,增强了不同模型之间的区分度。 Conclusion: ArenaBencher 提供了一条可扩展的路径,使基准测试能够持续演进,以应对基础模型快速发展带来的数据泄露挑战,确保评估的有效性和可靠性。 Abstract: Benchmarks are central to measuring the capabilities of large language models and guiding model development, yet widespread data leakage from pretraining corpora undermines their validity. Models can match memorized content rather than demonstrate true generalization, which inflates scores, distorts cross-model comparisons, and misrepresents progress. We introduce ArenaBencher, a model-agnostic framework for automatic benchmark evolution that updates test cases while preserving comparability. Given an existing benchmark and a diverse pool of models to be evaluated, ArenaBencher infers the core ability of each test case, generates candidate question-answer pairs that preserve the original objective, verifies correctness and intent with an LLM as a judge, and aggregates feedback from multiple models to select candidates that expose shared weaknesses. The process runs iteratively with in-context demonstrations that steer generation toward more challenging and diagnostic cases. We apply ArenaBencher to math problem solving, commonsense reasoning, and safety domains and show that it produces verified, diverse, and fair updates that uncover new failure modes, increase difficulty while preserving test objective alignment, and improve model separability. The framework provides a scalable path to continuously evolve benchmarks in step with the rapid progress of foundation models.cs.CV [Back]
[98] Enhancing Maritime Object Detection in Real-Time with RT-DETR and Data Augmentation
Nader Nemati
Main category: cs.CV
TL;DR: 本文提出了一种基于RT-DETR的实时海上目标检测系统,通过融合多尺度特征、优化查询选择和合成与真实数据加权策略,提升小目标检测性能,并在真实数据上验证了系统的有效性。
Details
Motivation: 由于海上目标尺寸小且标注的真实RGB数据有限,传统检测方法面临挑战,因此需要一种高效、鲁棒并适用于实际场景的实时检测系统。 Method: 采用RT-DETR框架,引入多尺度特征融合模块以增强小目标检测,设计不确定性最小化的查询选择机制以聚焦可靠候选框,并提出合成与真实样本间的智能权重分配策略以缩小域间差距;同时使用数据增强平衡类别分布。 Result: 所提系统在真实数据上实现了实时性能,显著提升了对小尺寸、低对比度船舶的检测精度,各模块贡献通过组件分析得以验证,并展示了在极端光照和海况下的鲁棒性。 Conclusion: 该研究成功将RT-DETR应用于 maritime 目标检测,通过关键模块改进和训练策略优化,在保持端到端优势的同时实现了速度与精度的可调平衡,具备实际部署潜力。 Abstract: Maritime object detection faces essential challenges due to the small target size and limitations of labeled real RGB data. This paper will present a real-time object detection system based on RT-DETR, enhanced by employing augmented synthetic images while strictly evaluating on real data. This study employs RT-DETR for the maritime environment by combining multi-scale feature fusion, uncertainty-minimizing query selection, and smart weight between synthetic and real training samples. The fusion module in DETR enhances the detection of small, low-contrast vessels, query selection focuses on the most reliable proposals, and the weighting strategy helps reduce the visual gap between synthetic and real domains. This design preserves DETR's refined end-to-end set prediction while allowing users to adjust between speed and accuracy at inference time. Data augmentation techniques were also used to balance the different classes of the dataset to improve the robustness and accuracy of the model. Regarding this study, a full Python robust maritime detection pipeline is delivered that maintains real-time performance even under practical limits. It also verifies how each module contributes, and how the system handles failures in extreme lighting or sea conditions. This study also includes a component analysis to quantify the contribution of each architectural module and explore its interactions.[99] DynamicEval: Rethinking Evaluation for Dynamic Text-to-Video Synthesis
Nithin C. Babu,Aniruddha Mahapatra,Harsh Rangwani,Rajiv Soundararajan,Kuldeep Kulkarni
Main category: cs.CV
TL;DR: 本文提出了DynamicEval,一个专注于动态相机运动的文本到视频生成评估基准,解决了现有基准在动态场景和视频级评估上的不足。
Details
Motivation: 现有的T2V评估基准主要关注静态或主体中心场景,缺乏对动态摄像运动下视频质量的细致评估,且通常仅提供模型级评分,忽略了视频级评估的重要性。 Method: 构建了一个包含系统设计提示词的基准DynamicEval,并收集了45k条人类标注的视频对数据;提出新的背景一致性指标(结合对象误差图修正Vbench度量的缺陷)和前景一致性指标(通过跟踪对象内点及其邻域来评估对象保真度)。 Result: 实验表明,所提出的指标在视频级和模型级均与人类偏好有更强的相关性(提升超过2个百分点),验证了DynamicEval作为更全面T2V评估基准的有效性。 Conclusion: DynamicEval通过引入动态摄像提示和细粒度一致性度量,在动态场景下显著提升了T2V模型的评估能力,为未来视频生成模型的发展提供了更可靠的评价标准。 Abstract: Existing text-to-video (T2V) evaluation benchmarks, such as VBench and EvalCrafter, suffer from two limitations. (i) While the emphasis is on subject-centric prompts or static camera scenes, camera motion essential for producing cinematic shots and existing metrics under dynamic motion are largely unexplored. (ii) These benchmarks typically aggregate video-level scores into a single model-level score for ranking generative models. Such aggregation, however, overlook video-level evaluation, which is vital to selecting the better video among the candidate videos generated for a given prompt. To address these gaps, we introduce DynamicEval, a benchmark consisting of systematically curated prompts emphasizing dynamic camera motion, paired with 45k human annotations on video pairs from 3k videos generated by ten T2V models. DynamicEval evaluates two key dimensions of video quality: background scene consistency and foreground object consistency. For background scene consistency, we obtain the interpretable error maps based on the Vbench motion smoothness metric. We observe that while the Vbench motion smoothness metric shows promising alignment with human judgments, it fails in two cases: occlusions/disocclusions arising from camera and foreground object movements. Building on this, we propose a new background consistency metric that leverages object error maps to correct two failure cases in a principled manner. Our second innovation is the introduction of a foreground consistency metric that tracks points and their neighbors within each object instance to assess object fidelity. Extensive experiments demonstrate that our proposed metrics achieve stronger correlations with human preferences at both the video level and the model level (an improvement of more than 2% points), establishing DynamicEval as a more comprehensive benchmark for evaluating T2V models under dynamic camera motion.[100] Provably Accelerated Imaging with Restarted Inertia and Score-based Image Priors
Marien Renaud,Julien Hermant,Deliang Wei,Yu Sun
Main category: cs.CV
TL;DR: 提出了一种名为RISP(Restarted Inertia with Score-based Priors)的新方法,用于加速求解病态成像反问题的收敛速度并保持高质量重建。
Details
Motivation: 现有方法如RED注重设计复杂的图像先验以提升重建质量,但收敛加速依赖启发式方法,缺乏理论支持。 Method: RISP结合了重启惯性机制和基于分数的图像先验,提供了一种原理性的RED扩展,并通过连续时间动力系统分析其与重球ODE的联系。 Result: 理论证明RISP比RED具有更快的驻点收敛速率,且不要求图像先验的凸性;实验验证了其在多种成像反问题中兼具快速收敛和高质量重建的能力。 Conclusion: RISP在不牺牲重建质量的前提下显著提升了收敛速度,为成像反问题提供了一个高效且有理论支撑的求解框架。 Abstract: Fast convergence and high-quality image recovery are two essential features of algorithms for solving ill-posed imaging inverse problems. Existing methods, such as regularization by denoising (RED), often focus on designing sophisticated image priors to improve reconstruction quality, while leaving convergence acceleration to heuristics. To bridge the gap, we propose Restarted Inertia with Score-based Priors (RISP) as a principled extension of RED. RISP incorporates a restarting inertia for fast convergence, while still allowing score-based image priors for high-quality reconstruction. We prove that RISP attains a faster stationary-point convergence rate than RED, without requiring the convexity of the image prior. We further derive and analyze the associated continuous-time dynamical system, offering insight into the connection between RISP and the heavy-ball ordinary differential equation (ODE). Experiments across a range of imaging inverse problems demonstrate that RISP enables fast convergence while achieving high-quality reconstructions.[101] A Denoising Framework for Real-World Ultra-Low Dose Lung CT Images Based on an Image Purification Strategy
Guoliang Gong,Man Yu
Main category: cs.CV
TL;DR: 本文提出了一种基于图像净化(IP)策略的超低剂量CT去噪新框架,并结合频域流匹配(FFM)模型,有效解决了真实临床uLDCT中严重的噪声、伪影和空间错位问题,在保持解剖结构完整性方面达到了SOTA效果。
Details
Motivation: 超低剂量CT(uLDCT)虽能显著降低辐射,但引入严重噪声和空间错位,导致现有基于合成噪声或对齐数据训练的去噪网络难以直接应用。因此,亟需解决真实uLDCT数据与正常剂量CT之间的数据不匹配问题。 Method: 首先构建了一个真实的临床uLDCT肺部数据集;然后提出图像净化(IP)策略生成结构对齐的uLDCT-NDCT配对图像,为网络训练提供高质量数据基础;在此基础上设计了频域流匹配(FFM)模型,与IP策略协同工作,更好保留去噪图像的解剖结构。 Result: 实验表明,IP策略显著提升了多种主流去噪模型在uLDCT任务上的性能;FFM模型结合IP策略在解剖结构 preservation 方面达到了最先进的水平。 Conclusion: 该研究通过IP策略和FFM模型为真实世界uLDCT去噪中的数据错配问题提供了有效解决方案,具有良好的临床应用前景。 Abstract: Ultra-low dose CT (uLDCT) significantly reduces radiation exposure but introduces severe noise and artifacts. It also leads to substantial spatial misalignment between uLDCT and normal dose CT (NDCT) image pairs. This poses challenges for directly applying existing denoising networks trained on synthetic noise or aligned data. To address this core challenge in uLDCT denoising, this paper proposes an innovative denoising framework based on an Image Purification (IP) strategy. First, we construct a real clinical uLDCT lung dataset. Then, we propose an Image Purification strategy that generates structurally aligned uLDCT-NDCT image pairs, providing a high-quality data foundation for network training. Building upon this, we propose a Frequency-domain Flow Matching (FFM) model, which works synergistically with the IP strategy to excellently preserve the anatomical structure integrity of denoised images. Experiments on the real clinical dataset demonstrate that our IP strategy significantly enhances the performance of multiple mainstream denoising models on the uLDCT task. Notably, our proposed FFM model combined with the IP strategy achieves state-of-the-art (SOTA) results in anatomical structure preservation. This study provides an effective solution to the data mismatch problem in real-world uLDCT denoising. Code and dataset are available at https://github.com/MonkeyDadLufy/flow-matching.[102] D2RA: Dual Domain Regeneration Attack
Pragati Shuddhodhan Meshram,Varun Chandrasekaran
Main category: cs.CV
TL;DR: 提出了一种无需训练、针对单张图像的攻击方法D2RA,可在无模型访问的情况下有效削弱或移除生成图像中的语义水印,揭示了现有水印方案的脆弱性。
Details
Motivation: 现有的语义水印方案虽在鲁棒性上有所提升,但在资源受限的对抗环境下仍易受攻击,缺乏足够的安全性保障。 Method: 通过将加水印的图像投影到多个互补表示下的自然先验上,利用自然图像的先验知识抑制水印信号,同时保持视觉质量,整个过程无需训练且不依赖模型访问。 Result: 在多种语义水印方案上的实验表明,D2RA能显著降低水印的可检测性,同时保持图像视觉保真度,证明了当前水印设计的普遍弱点。 Conclusion: 现有的语义水印方法存在根本性缺陷,即使在无模型访问和资源受限的情况下也能被有效攻击,需重新设计更安全的水印机制。 Abstract: The growing use of generative models has intensified the need for watermarking methods that ensure content attribution and provenance. While recent semantic watermarking schemes improve robustness by embedding signals in latent or frequency representations, we show they remain vulnerable even under resource-constrained adversarial settings. We present D2RA, a training-free, single-image attack that removes or weakens watermarks without access to the underlying model. By projecting watermarked images onto natural priors across complementary representations, D2RA suppresses watermark signals while preserving visual fidelity. Experiments across diverse watermarking schemes demonstrate that our approach consistently reduces watermark detectability, revealing fundamental weaknesses in current designs. Our code is available at https://github.com/Pragati-Meshram/DAWN.[103] PickStyle: Video-to-Video Style Transfer with Context-Style Adapters
Soroush Mehraban,Vida Adeli,Jacob Rommann,Babak Taati,Kyryl Truskovskyi
Main category: cs.CV
TL;DR: 本文提出PickStyle,一种基于预训练视频扩散模型的视频风格迁移框架,通过引入低秩适配器和静态图像配对数据进行训练,并设计了CS-CFG方法以分离内容与风格引导,实现了时序连贯、风格忠实且内容保持的视频风格迁移。
Details
Motivation: 由于缺乏成对的视频数据用于监督,现有方法难以在保持输入视频内容的同时实现高质量的文本驱动视频风格迁移。 Method: 提出PickStyle框架,在预训练视频扩散模型中插入低秩风格适配器,并利用具有源-目标风格对应的静态图像对构建合成视频片段进行训练;通过共享增强模拟相机运动以保留时间先验;引入上下文-风格分类器无关引导(CS-CFG),将引导信号分解为独立的文本(风格)和视频(内容)方向。 Result: 实验表明,PickStyle在多个基准上优于现有方法,生成的视频在风格保真度、内容保持和时间连贯性方面表现优异,定性和定量结果均领先。 Conclusion: PickStyle有效解决了无配对视频数据下的视频风格迁移难题,通过适配器机制和静态图像迁移学习实现了高效、可控的风格化视频生成,CS-CFG进一步提升了内容与风格的解耦控制能力。 Abstract: We address the task of video style transfer with diffusion models, where the goal is to preserve the context of an input video while rendering it in a target style specified by a text prompt. A major challenge is the lack of paired video data for supervision. We propose PickStyle, a video-to-video style transfer framework that augments pretrained video diffusion backbones with style adapters and benefits from paired still image data with source-style correspondences for training. PickStyle inserts low-rank adapters into the self-attention layers of conditioning modules, enabling efficient specialization for motion-style transfer while maintaining strong alignment between video content and style. To bridge the gap between static image supervision and dynamic video, we construct synthetic training clips from paired images by applying shared augmentations that simulate camera motion, ensuring temporal priors are preserved. In addition, we introduce Context-Style Classifier-Free Guidance (CS-CFG), a novel factorization of classifier-free guidance into independent text (style) and video (context) directions. CS-CFG ensures that context is preserved in generated video while the style is effectively transferred. Experiments across benchmarks show that our approach achieves temporally coherent, style-faithful, and content-preserving video translations, outperforming existing baselines both qualitatively and quantitatively.[104] TRAVL: A Recipe for Making Video-Language Models Better Judges of Physics Implausibility
Saman Motamed,Minghao Chen,Luc Van Gool,Iro Laina
Main category: cs.CV
TL;DR: 本文提出TRAVL,一种改进视频语言模型(VLM)物理合理性判断能力的微调方法,并构建ImplausiBench基准来评估视频中的物理现实性,揭示了现有VLM在时序与因果推理上的局限。
Details
Motivation: 现有视频生成模型常产生违反物理规律的不合理视频内容,但缺乏有效的量化评估方法;同时,视频语言模型在判断物理合理性方面表现不佳,需提升其时序与因果推理能力。 Method: 提出TRAVL微调策略,结合平衡训练数据集与轨迹感知注意力模块以增强运动表征能力,并构建去除了语言偏差的ImplausiBench基准(含300个视频)用于评估物理合理性。 Result: TRAVL显著提升了VLM在物理合理性判断上的性能,在ImplausiBench上更接近人类判断标准,并可通过LLM-as-judge指标进行严格评测。 Conclusion: TRAVL与ImplausiBench共同构成了一个评估和提升多模态模型物理合理性的统一框架,推动了视觉-时序理解中物理常识推理的发展。 Abstract: Despite impressive visual fidelity, modern video generative models frequently produce sequences that violate intuitive physical laws, such as objects floating, teleporting, or morphing in ways that defy causality. While humans can easily detect such implausibilities, there remains no robust method for quantitatively assessing physical realism in video. In this work, we explore whether Video-Language Models (VLMs) can be trained to serve as reliable judges of physical plausibility. We find that existing VLMs struggle to identify physics violations, exposing fundamental limitations in their temporal and causal reasoning. To address this, we introduce TRAVL, a fine-tuning recipe that combines a balanced training dataset with a trajectory-aware attention module to improve motion encoding and discrimination in VLMs. To evaluate physical reasoning more rigorously, we propose ImplausiBench, a benchmark of 300 videos (150 real, 150 generated) that removes linguistic biases and isolates visual-temporal understanding. Performance is reported both with gold-standard human judgments and stricter LLM-as-judge metrics. Together, TRAVL and ImplausiBench offer a unified framework for probing and improving physical plausibility in multimodal models, shedding light on a challenging and underexplored aspect of visual-temporal understanding.[105] Label Semantics for Robust Hyperspectral Image Classification
Rafin Hassan,Zarin Tasnim Roshni,Rafiqul Bari,Alimul Islam,Nabeel Mohammed,Moshiur Farazi,Shafin Rahman
Main category: cs.CV
TL;DR: 提出了一种通用的语义光谱-空间融合网络(S3FN),利用大语言模型生成类别特定文本描述,结合预训练文本编码器提取语义信息,增强高光谱图像分类性能。
Details
Motivation: 由于高质量训练样本有限、光谱数据维度高以及现有模型多为单模态,导致高光谱图像分类易过拟合且难以平衡精度与计算复杂度。 Method: 提出S3FN框架,使用大语言模型生成每个类别的文本描述,并通过BERT或RoBERTa等预训练模型将其嵌入向量空间,实现语义信息与光谱-空间特征的融合,提升特征-标签对齐能力。 Result: 在Hyperspectral Wood、HyperspectralBlueberries和DeepHS-Fruit三个基准数据集上验证了方法的有效性,显著提升了分类性能。 Conclusion: 文本语义与光谱-空间数据具有协同效应,所提方法为语义增强的高光谱图像分类提供了新方向。 Abstract: Hyperspectral imaging (HSI) classification is a critical tool with widespread applications across diverse fields such as agriculture, environmental monitoring, medicine, and materials science. Due to the limited availability of high-quality training samples and the high dimensionality of spectral data, HSI classification models are prone to overfitting and often face challenges in balancing accuracy and computational complexity. Furthermore, most of HSI classification models are monomodal, where it solely relies on spectral-spatial data to learn decision boundaries in the high dimensional embedding space. To address this, we propose a general-purpose Semantic Spectral-Spatial Fusion Network (S3FN) that uses contextual, class specific textual descriptions to complement the training of an HSI classification model. Specifically, S3FN leverages LLMs to generate comprehensive textual descriptions for each class label that captures their unique characteristics and spectral behaviors. These descriptions are then embedded into a vector space using a pre-trained text encoder such as BERT or RoBERTa to extract meaningful label semantics which in turn leads to a better feature-label alignment for improved classification performance. To demonstrate the effectiveness of our approach, we evaluate our model on three diverse HSI benchmark datasets - Hyperspectral Wood, HyperspectralBlueberries, and DeepHS-Fruit and report significant performance boost. Our results highlight the synergy between textual semantics and spectral-spatial data, paving the way for further advancements in semantically augmented HSI classification models. Codes are be available in: https://github.com/milab-nsu/S3FN[106] Cross-Modal Attention Guided Unlearning in Vision-Language Models
Karuna Bhaila,Aneesh Komanduri,Minh-Hao Van,Xintao Wu
Main category: cs.CV
TL;DR: 本文提出了一种针对视觉-语言模型(VLM)的轻量级遗忘学习框架CAGUL,用于解决在视觉问答(VQA)任务中可能泄露敏感信息的问题。该方法利用跨模态注意力机制识别不重要的视觉令牌,并通过外部模块对其进行变换,从而实现高效遗忘,同时保持模型原有性能。
Details
Motivation: 大规模预训练的视觉-语言模型可能在训练过程中记忆并泄露敏感信息,尤其是在视觉和文本双模态输入下,传统微调方法成本高且效率低,因此需要一种更高效的遗忘机制。 Method: 提出Cross-Modal Attention Guided Unlearning (CAGUL),利用跨模态注意力分析视觉令牌的重要性,并通过外部模块修改低重要性视觉令牌以编码遗忘信息,避免修改预训练模型参数。 Result: 实验表明,CAGUL在防止信息泄露方面表现良好,同时保留了原始模型的行为,在性能上优于或相当于基于微调的基线方法,且无需重新训练。 Conclusion: CAGUL是一种高效、实用的VLM遗忘学习方案,能够在不改变预训练参数和无需重训的情况下有效防止敏感信息泄露,特别适用于视觉-语言多模态场景。 Abstract: Vision-Language Models (VLMs) have demonstrated immense capabilities in multi-modal understanding and inference tasks such as Visual Question Answering (VQA), which requires models to infer outputs based on visual and textual context simultaneously. Such inference abilities of large-scale pretrained models are often attributed to the massive scale of pre-training data collected across several domains. However, the models may memorize private and/or sensitive information during training and regurgitate it in inference. Recently, machine unlearning has been leveraged to address the leakage of private data in LLMs. VLMs add a layer of complexity to this process, as the visual context in the query may also contain sensitive information in addition to the text. To address this issue, we explore unlearning for vision-language models, specifically for the VQA task. We explore the role of visual tokens for output generation in VLMs using cross-modal attention and utilize it to formulate Cross-Modal Attention Guided Unlearning (CAGUL), a lightweight and efficient VLM unlearning framework. In contrast to computationally expensive model finetuning methods, CAGUL utilizes external modules to encode unlearning information in visual tokens of low importance for relevant queries. We find that the transformed visual tokens not only prevent leakage but also retain reference model behavior. Experimental results show that our method performs better or on par with finetuning-based baselines without altering the pre-trained model parameters or incurring retraining costs, making it a practical and effective unlearning solution for VLMs.[107] MaizeStandCounting (MaSC): Automated and Accurate Maize Stand Counting from UAV Imagery Using Image Processing and Deep Learning
Dewi Endah Kharismawati,Toni Kazic
Main category: cs.CV
TL;DR: MaizeStandCounting (MaSC) 是一种基于低成本无人机RGB图像的自动化玉米幼苗计数算法,支持两种模式处理图像或视频帧,利用轻量级YOLOv9模型检测不同生长阶段的玉米幼苗,并通过行与范围分割实现精确计数,实验显示其具有高准确性与实时处理潜力。
Details
Motivation: 准确的玉米出苗率对作物管理和研究至关重要,传统人工计数费时费力且易出错,尤其在大面积或变异较大的田地中,因此需要一种高效、低成本的自动化计数方法。 Method: MaSC采用两种模式:基于拼接图像分块处理和基于同源变换对齐的原始视频帧处理;使用轻量级YOLOv9模型检测V2-V10生长阶段的玉米幼苗,结合空间分布进行杂草区分和行列分割,实现逐行计数。 Result: 在2024年夏季试验田中,MaSC与人工计数结果高度一致(拼接图像R²=0.616,原始帧R²=0.906),处理83帧全分辨率图像仅需60.63秒,包含推理和后处理,具备实时应用潜力。 Conclusion: MaSC是一种可扩展、低成本且准确的自动化玉米出苗计数工具,适用于科研和生产环境中的大规模田间监测。 Abstract: Accurate maize stand counts are essential for crop management and research, informing yield prediction, planting density optimization, and early detection of germination issues. Manual counting is labor-intensive, slow, and error-prone, especially across large or variable fields. We present MaizeStandCounting (MaSC), a robust algorithm for automated maize seedling stand counting from RGB imagery captured by low-cost UAVs and processed on affordable hardware. MaSC operates in two modes: (1) mosaic images divided into patches, and (2) raw video frames aligned using homography matrices. Both modes use a lightweight YOLOv9 model trained to detect maize seedlings from V2-V10 growth stages. MaSC distinguishes maize from weeds and other vegetation, then performs row and range segmentation based on the spatial distribution of detections to produce precise row-wise stand counts. Evaluation against in-field manual counts from our 2024 summer nursery showed strong agreement with ground truth (R^2= 0.616 for mosaics, R^2 = 0.906 for raw frames). MaSC processed 83 full-resolution frames in 60.63 s, including inference and post-processing, highlighting its potential for real-time operation. These results demonstrate MaSC's effectiveness as a scalable, low-cost, and accurate tool for automated maize stand counting in both research and production environments.[108] Quick-CapsNet (QCN): A fast alternative to Capsule Networks
Pouya Shiri,Ramin Sharifi,Amirali Baniasadi
Main category: cs.CV
TL;DR: 本文提出了一种名为Quick-CapsNet(QCN)的快速胶囊网络,通过减少胶囊数量来提升推理速度,在MNIST、F-MNIST、SVHN和Cifar-10数据集上实现了5倍加速,仅牺牲少量精度,并采用更强的解码器进一步提升了性能。
Details
Motivation: CapsNet虽然在分类任务中表现优异且对仿射变换更具鲁棒性,但其训练和推理速度较慢,限制了其在实时应用中的使用,因此需要一种更快速的替代方案。 Method: 提出Quick-CapsNet(QCN),通过减少生成的胶囊数量来加快网络速度,并采用更强大的解码器替代默认解码器以提升性能。 Result: QCN在MNIST、F-MNIST、SVHN和Cifar-10数据集上的推理速度提高了5倍,同时仅带来轻微的精度损失。结合更强解码器后性能进一步提升。 Conclusion: QCN是一种高效的CapsNet快速替代方案,适合用于需要实时推理的应用场景,为后续开发快速CapsNet方法提供了基础。 Abstract: The basic computational unit in Capsule Network (CapsNet) is a capsule (vs. neurons in Convolutional Neural Networks (CNNs)). A capsule is a set of neurons, which form a vector. CapsNet is used for supervised classification of data and has achieved state-of-the-art accuracy on MNIST digit recognition dataset, outperforming conventional CNNs in detecting overlapping digits. Moreover, CapsNet shows higher robustness towards affine transformation when compared to CNNs for MNIST datasets. One of the drawbacks of CapsNet, however, is slow training and testing. This can be a bottleneck for applications that require a fast network, especially during inference. In this work, we introduce Quick-CapsNet (QCN) as a fast alternative to CapsNet, which can be a starting point to develop CapsNet for fast real-time applications. QCN builds on producing a fewer number of capsules, which results in a faster network. QCN achieves this at the cost of marginal loss in accuracy. Inference is 5x faster on MNIST, F-MNIST, SVHN and Cifar-10 datasets. We also further enhanced QCN by employing a more powerful decoder instead of the default decoder to further improve QCN.[109] Rectified-CFG++ for Flow Based Models
Shreshth Saini,Shashank Gupta,Alan C. Bovik
Main category: cs.CV
TL;DR: 提出了一种名为Rectified-CFG++的自适应预测-校正引导方法,用于解决基于修正流(RF)模型在使用分类器自由引导(CFG)时出现的严重流形外漂移问题。
Details
Motivation: 标准的CFG在应用于修正流模型时会导致严重的流形外漂移,产生视觉伪影、文本不对齐和不稳定行为,限制了生成质量与鲁棒性。 Method: 提出Rectified-CFG++,采用两步策略:首先执行条件RF更新以将样本锚定在学习到的传输路径附近,然后应用加权的条件校正,插值于条件与无条件速度场之间。该方法结合了修正流的确定性高效特性与几何感知的条件规则。 Result: 理论证明所得到的速度场是边缘一致的,且轨迹保持在数据流形的有界管状邻域内;在Flux、Stable Diffusion 3/3.5和Lumina等大型文本到图像模型上的实验表明,Rectified-CFG++在MS-COCO、LAION-Aesthetic和T2I-CompBench等多个基准上 consistently 优于标准CFG。 Conclusion: Rectified-CFG++有效解决了CFG在修正流模型中的流形外漂移问题,提升了生成稳定性与对齐精度,支持强引导下的高质量文本到图像生成。 Abstract: Classifier-free guidance (CFG) is the workhorse for steering large diffusion models toward text-conditioned targets, yet its native application to rectified flow (RF) based models provokes severe off-manifold drift, yielding visual artifacts, text misalignment, and brittle behaviour. We present Rectified-CFG++, an adaptive predictor-corrector guidance that couples the deterministic efficiency of rectified flows with a geometry-aware conditioning rule. Each inference step first executes a conditional RF update that anchors the sample near the learned transport path, then applies a weighted conditional correction that interpolates between conditional and unconditional velocity fields. We prove that the resulting velocity field is marginally consistent and that its trajectories remain within a bounded tubular neighbourhood of the data manifold, ensuring stability across a wide range of guidance strengths. Extensive experiments on large-scale text-to-image models (Flux, Stable Diffusion 3/3.5, Lumina) show that Rectified-CFG++ consistently outperforms standard CFG on benchmark datasets such as MS-COCO, LAION-Aesthetic, and T2I-CompBench. Project page: https://rectified-cfgpp.github.io/[110] PIT-QMM: A Large Multimodal Model For No-Reference Point Cloud Quality Assessment
Shashank Gupta,Gregoire Phillips,Alan C. Bovik
Main category: cs.CV
TL;DR: 提出了一种新的大型多模态模型PIT-QMM,用于无参考点云质量评估(NR-PCQA),结合文本、图像和点云数据实现端到端质量预测,在多个基准上显著优于现有方法,并支持失真定位与识别。
Details
Motivation: 现有的大型多模态模型在图像和视频质量评估中取得进展,但在3D点云质量评估中的应用尚未充分探索,特别是无参考条件下的自动质量评估需求迫切。 Method: 利用文本描述、2D投影和3D点云视图等多模态信息,构建端到端的大型多模态模型PIT-QMM,联合处理多种输入以预测点云质量分数。 Result: 在主流基准测试上,PIT-QMM以更少的训练轮数显著优于现有最先进方法,并具备失真定位和识别能力。 Conclusion: PIT-QMM为无参考点云质量评估提供了高效且可解释的新框架,推动了3D资产质量评估的发展,并增强了模型的可解释性与交互性。 Abstract: Large Multimodal Models (LMMs) have recently enabled considerable advances in the realm of image and video quality assessment, but this progress has yet to be fully explored in the domain of 3D assets. We are interested in using these models to conduct No-Reference Point Cloud Quality Assessment (NR-PCQA), where the aim is to automatically evaluate the perceptual quality of a point cloud in absence of a reference. We begin with the observation that different modalities of data - text descriptions, 2D projections, and 3D point cloud views - provide complementary information about point cloud quality. We then construct PIT-QMM, a novel LMM for NR-PCQA that is capable of consuming text, images and point clouds end-to-end to predict quality scores. Extensive experimentation shows that our proposed method outperforms the state-of-the-art by significant margins on popular benchmarks with fewer training iterations. We also demonstrate that our framework enables distortion localization and identification, which paves a new way forward for model explainability and interactivity. Code and datasets are available at https://www.github.com/shngt/pit-qmm.[111] Dual-Stream Alignment for Action Segmentation
Harshala Gammulle,Clinton Fookes,Sridha Sridharan,Simon Denman
Main category: cs.CV
TL;DR: 本文提出了一种双流对齐网络(DSA Net),首次引入混合量子-经典机器学习框架用于动作分割,通过帧级和动作级双流特征对齐,结合时间上下文模块和量子化动作调制机制,在多个基准数据集上实现了最先进的性能。
Details
Motivation: 现有动作分割方法多依赖单一流模型,难以充分捕捉动作及其转换特征;近期双流方法虽有所改进,但仍缺乏有效特征对齐与交互机制,因此需要设计更强大的双流融合策略以提升分割精度。 Method: 提出双流对齐网络(DSA Net),包含帧级和动作级双流结构,通过时间上下文(TC)块利用交叉注意力和量子化动作引导调制(Q-ActGM)实现跨流信息融合,并设计双流对齐损失函数,包含关系一致性、跨层级对比和循环一致性重建三个部分,促使两流学习共享特征空间。 Result: 在GTEA、Breakfast、50Salads和EgoProcel四个基准数据集上验证了方法的有效性,DSA Net显著优于现有方法,达到最先进水平;消融实验表明各组件均对性能提升有贡献,尤其是Q-ActGM和对齐损失。 Conclusion: 本文提出的DSA Net通过有效的双流特征对齐与融合机制,提升了动作分割性能,且首次将量子-经典混合框架引入该领域,为未来研究提供了新方向。 Abstract: Action segmentation is a challenging yet active research area that involves identifying when and where specific actions occur in continuous video streams. Most existing work has focused on single-stream approaches that model the spatio-temporal aspects of frame sequences. However, recent research has shifted toward two-stream methods that learn action-wise features to enhance action segmentation performance. In this work, we propose the Dual-Stream Alignment Network (DSA Net) and investigate the impact of incorporating a second stream of learned action features to guide segmentation by capturing both action and action-transition cues. Communication between the two streams is facilitated by a Temporal Context (TC) block, which fuses complementary information using cross-attention and Quantum-based Action-Guided Modulation (Q-ActGM), enhancing the expressive power of the fused features. To the best of our knowledge, this is the first study to introduce a hybrid quantum-classical machine learning framework for action segmentation. Our primary objective is for the two streams (frame-wise and action-wise) to learn a shared feature space through feature alignment. This is encouraged by the proposed Dual-Stream Alignment Loss, which comprises three components: relational consistency, cross-level contrastive, and cycle-consistency reconstruction losses. Following prior work, we evaluate DSA Net on several diverse benchmark datasets: GTEA, Breakfast, 50Salads, and EgoProcel. We further demonstrate the effectiveness of each component through extensive ablation studies. Notably, DSA Net achieves state-of-the-art performance, significantly outperforming existing[112] Once Is Enough: Lightweight DiT-Based Video Virtual Try-On via One-Time Garment Appearance Injection
Yanjie Pan,Qingdong He,Lidong Wang,Bo Peng,Mingmin Chi
Main category: cs.CV
TL;DR: 提出了一种名为OIE(Once is Enough)的视频虚拟试穿新方法,通过首帧换装和时序引导生成后续帧,在保持高性能的同时显著提升参数与计算效率。
Details
Motivation: 现有双分支架构在基于Diffusion Transformer的模型中难以高效实现视频虚拟试穿,且引入服装特征会导致参数量大、时序特性学习困难。 Method: 采用基于图像的服装迁移模型替换首帧服装,并利用编辑后的首帧内容、姿态和掩码信息引导视频生成模型逐帧合成后续结果。 Result: 实验表明该方法在参数效率和计算效率方面优于现有方法,同时保持领先的生成性能。 Conclusion: OIE通过首帧换装与时序控制相结合,为基于Diffusion Transformer的视频虚拟试衣提供了高效可行的解决方案。 Abstract: Video virtual try-on aims to replace the clothing of a person in a video with a target garment. Current dual-branch architectures have achieved significant success in diffusion models based on the U-Net; however, adapting them to diffusion models built upon the Diffusion Transformer remains challenging. Initially, introducing latent space features from the garment reference branch requires adding or modifying the backbone network, leading to a large number of trainable parameters. Subsequently, the latent space features of garments lack inherent temporal characteristics and thus require additional learning. To address these challenges, we propose a novel approach, OIE (Once is Enough), a virtual try-on strategy based on first-frame clothing replacement: specifically, we employ an image-based clothing transfer model to replace the clothing in the initial frame, and then, under the content control of the edited first frame, utilize pose and mask information to guide the temporal prior of the video generation model in synthesizing the remaining frames sequentially. Experiments show that our method achieves superior parameter efficiency and computational efficiency while still maintaining leading performance under these constraints.[113] MONKEY: Masking ON KEY-Value Activation Adapter for Personalization
James Baker
Main category: cs.CV
TL;DR: 提出了一种改进的扩散模型个性化方法,利用IP-Adapter自动生成的掩码在第二阶段屏蔽背景图像令牌,使文本提示更好地影响生成场景,从而提升主体保持与提示对齐的效果。
Details
Motivation: 现有个性化扩散模型常过度关注重建主体而忽略文本提示,导致生成图像与提示不符,需增强提示对非主体区域的控制力。 Method: 利用IP-Adapter在推理过程中自动生成主体掩码,并在第二次生成时屏蔽背景图像令牌,使文本提示能更有效地作用于非主体区域,实现主体与场景的更好融合。 Result: 在描述地点和场景的文本提示下,生成图像能准确保留主体并显著提升与文本提示的匹配度,在多个测试时个性化方法中表现出更高的提示对齐和源图像一致性。 Conclusion: 通过两阶段掩码机制利用自动分割信息,有效平衡了主体保真度与文本控制力,提升了个性化扩散模型在复杂提示下的生成质量。 Abstract: Personalizing diffusion models allows users to generate new images that incorporate a given subject, allowing more control than a text prompt. These models often suffer somewhat when they end up just recreating the subject image, and ignoring the text prompt. We observe that one popular method for personalization, the IP-Adapter automatically generates masks that we definitively segment the subject from the background during inference. We propose to use this automatically generated mask on a second pass to mask the image tokens, thus restricting them to the subject, not the background, allowing the text prompt to attend to the rest of the image. For text prompts describing locations and places, this produces images that accurately depict the subject while definitively matching the prompt. We compare our method to a few other test time personalization methods, and find our method displays high prompt and source image alignment.[114] Automatic Text Box Placement for Supporting Typographic Design
Jun Muraoka,Daichi Haraguchi,Naoto Inoue,Wataru Shimoda,Kota Yamaguchi,Seiichi Uchida
Main category: cs.CV
TL;DR: 该研究比较了基于Transformer和视觉语言模型(VLM)的方法在广告和网页不完整布局中文本框自动放置的效果,发现标准Transformer模型通常优于VLM方法,尤其是在利用外观信息时表现更佳,但在处理极小文本或密集布局时所有方法均面临挑战。
Details
Motivation: 在布局设计中需兼顾视觉吸引力与信息传达效率,现有自动化文本框放置方法在复杂场景下表现不足,因此需要系统评估不同模型的性能并探索改进方向。 Method: 采用标准Transformer模型、小型视觉语言模型Phi3.5-vision、大型预训练VLM(Gemini)以及可处理多图像的扩展Transformer模型,在Crello数据集上进行文本框放置任务的对比实验。 Result: 标准Transformer模型整体优于VLM方法,尤其在融入丰富外观信息时表现更好;但所有方法在处理非常小的文本或高度密集的布局时均存在困难。 Conclusion: 任务特定架构(如Transformer)在自动化布局设计中更具优势,未来应针对极端情况(如小文本、高密度)优化模型,并结合外观特征提升性能。 Abstract: In layout design for advertisements and web pages, balancing visual appeal and communication efficiency is crucial. This study examines automated text box placement in incomplete layouts, comparing a standard Transformer-based method, a small Vision and Language Model (Phi3.5-vision), a large pretrained VLM (Gemini), and an extended Transformer that processes multiple images. Evaluations on the Crello dataset show the standard Transformer-based models generally outperform VLM-based approaches, particularly when incorporating richer appearance information. However, all methods face challenges with very small text or densely populated layouts. These findings highlight the benefits of task-specific architectures and suggest avenues for further improvement in automated layout design.[115] TCIP: Threshold-Controlled Iterative Pyramid Network for Deformable Medical Image Registration
Heming Wu,Di Wang,Tai Ma,Peng Zhao,Yubin Xiao,Zhongke Wu,Xing-Ce Wang,Chuang Li,Xuan Wu,You Zhou
Main category: cs.CV
TL;DR: 提出了一种基于特征增强残差模块(FERM)和双阶段阈值控制迭代策略(TCI)的金字塔网络(TCIP),用于可变形医学图像配准,有效减少解剖结构错位累积并自适应确定优化迭代次数,在多个数据集上优于现有最先进方法。
Details
Motivation: 现有金字塔网络在医学图像配准中易累积解剖结构错位,且缺乏根据图像变形需求自适应调整优化迭代次数的机制,影响配准精度。 Method: 设计FERM作为解码层核心组件,包含提取解剖语义特征、抑制无关特征和估计形变场三个模块;提出双阶段TCI策略,先评估配准稳定性,再判断收敛性以自适应终止迭代。 Result: 在三个脑部MRI数据集和一个腹部CT数据集上实验表明,TCIP在配准精度上优于当前最先进方法,同时保持较快推理速度和较小模型参数量;FERM和TCI具有良好的通用性和有效性。 Conclusion: TCIP通过FERM和TCI有效缓解了金字塔网络中的错位累积问题并实现自适应迭代控制,显著提升了医学图像配准的准确性和鲁棒性。 Abstract: Although pyramid networks have demonstrated superior performance in deformable medical image registration, their decoder architectures are inherently prone to propagating and accumulating anatomical structure misalignments. Moreover, most existing models do not adaptively determine the number of iterations for optimization under varying deformation requirements across images, resulting in either premature termination or excessive iterations that degrades registration accuracy. To effectively mitigate the accumulation of anatomical misalignments, we propose the Feature-Enhanced Residual Module (FERM) as the core component of each decoding layer in the pyramid network. FERM comprises three sequential blocks that extract anatomical semantic features, learn to suppress irrelevant features, and estimate the final deformation field, respectively. To adaptively determine the number of iterations for varying images, we propose the dual-stage Threshold-Controlled Iterative (TCI) strategy. In the first stage, TCI assesses registration stability and with asserted stability, it continues with the second stage to evaluate convergence. We coin the model that integrates FERM and TCI as Threshold-Controlled Iterative Pyramid (TCIP). Extensive experiments on three public brain MRI datasets and one abdomen CT dataset demonstrate that TCIP outperforms the state-of-the-art (SOTA) registration networks in terms of accuracy, while maintaining comparable inference speed and a compact model parameter size. Finally, we assess the generalizability of FERM and TCI by integrating them with existing registration networks and further conduct ablation studies to validate the effectiveness of these two proposed methods.[116] Controllable Video Synthesis via Variational Inference
Haoyi Duan,Yunzhi Zhang,Yilun Du,Jiajun Wu
Main category: cs.CV
TL;DR: 提出一种高可控性的视频合成方法,通过变分推理和多生成模型集成,实现对指定元素的精确控制和未明确部分的多样性保持。
Details
Motivation: 现有视频生成模型通常针对固定输入格式训练,难以满足用户对不同粒度控制(如4D对象轨迹、相机路径或粗略文本提示)的需求。 Method: 将任务建模为变分推断以逼近组合分布,结合多个视频生成骨干网络,并通过逐步KL散度最小化和上下文条件因子化技术优化求解过程。 Result: 实验表明,该方法在可控性、多样性和3D一致性方面优于先前工作。 Conclusion: 所提方法有效提升了视频生成中对多粒度用户控制的支持能力,在保持生成质量的同时实现了更高的灵活性和一致性。 Abstract: Many video workflows benefit from a mixture of user controls with varying granularity, from exact 4D object trajectories and camera paths to coarse text prompts, while existing video generative models are typically trained for fixed input formats. We develop a video synthesis method that addresses this need and generates samples with high controllability for specified elements while maintaining diversity for under-specified ones. We cast the task as variational inference to approximate a composed distribution, leveraging multiple video generation backbones to account for all task constraints collectively. To address the optimization challenge, we break down the problem into step-wise KL divergence minimization over an annealed sequence of distributions, and further propose a context-conditioned factorization technique that reduces modes in the solution space to circumvent local optima. Experiments suggest that our method produces samples with improved controllability, diversity, and 3D consistency compared to prior works.[117] Hybrid CNN-BYOL Approach for Fault Detection in Induction Motors Using Thermal Images
Tangin Amir Smrity,MD Zahin Muntaqim Hasan Muhammad Kafi,Abu Saleh Musa Miah,Najmul Hassan,Yuichi Okuyama,Nobuyoshi Asai,Taro Suzuki,Jungpil Shin
Main category: cs.CV
TL;DR: 提出一种结合BYOL与CNN的混合方法,用于基于热成像的感应电机故障分类,所提出的轻量高效模型BYOL-IMNet在准确率(99.89%)和推理速度(5.7ms)上均优于现有模型。
Details
Motivation: 感应电机易发生各类故障,导致过热、能耗增加和运行中断,亟需早期检测以保障安全并延长寿命。 Method: 采用自监督学习方法BYOL结合多种主流CNN架构,并提出一种专为热图像故障分类设计的新型轻量CNN模型BYOL-IMNet,在热成像数据集上进行故障分类。 Result: BYOL-IMNet在测试中达到99.89%的准确率和每幅图像5.7毫秒的推理时间,性能优于当前先进模型。 Conclusion: CNN-BYOL混合方法在感应电机故障检测中表现出高精度和高效性,具备在工业在线监测中应用的潜力。 Abstract: Induction motors (IMs) are indispensable in industrial and daily life, but they are susceptible to various faults that can lead to overheating, wasted energy consumption, and service failure. Early detection of faults is essential to protect the motor and prolong its lifespan. This paper presents a hybrid method that integrates BYOL with CNNs for classifying thermal images of induction motors for fault detection. The thermal dataset used in this work includes different operating states of the motor, such as normal operation, overload, and faults. We employed multiple deep learning (DL) models for the BYOL technique, ranging from popular architectures such as ResNet-50, DenseNet-121, DenseNet-169, EfficientNetB0, VGG16, and MobileNetV2. Additionally, we introduced a new high-performance yet lightweight CNN model named BYOL-IMNet, which comprises four custom-designed blocks tailored for fault classification in thermal images. Our experimental results demonstrate that the proposed BYOL-IMNet achieves 99.89\% test accuracy and an inference time of 5.7 ms per image, outperforming state-of-the-art models. This study highlights the promising performance of the CNN-BYOL hybrid method in enhancing accuracy for detecting faults in induction motors, offering a robust methodology for online monitoring in industrial settings.[118] Mutual Learning for Hashing: Unlocking Strong Hash Functions from Weak Supervision
Xiaoxu Ma,Runhao Li,Zhenyu Weng
Main category: cs.CV
TL;DR: 提出了一种新的深度哈希框架MLH,通过弱到强的互学习机制,结合中心型和成对方法的优势,提升图像检索性能。
Details
Motivation: 中心型哈希方法虽擅长建模全局结构,但往往忽视局部相似性信息,而成对方法能有效保留局部相似关系。如何融合二者优势成为关键问题。 Method: 设计了双分支架构:一个强的中心型分支和一个弱的成对分支,通过迭代互学习机制实现知识迁移,并引入混合哈希专家模块促进跨分支交互。 Result: 在多个基准数据集上实验表明,MLH consistently优于当前最先进的哈希方法。 Conclusion: MLH有效融合了全局分布建模与局部相似性保持的优势,显著提升了大规模图像检索的性能。 Abstract: Deep hashing has been widely adopted for large-scale image retrieval, with numerous strategies proposed to optimize hash function learning. Pairwise-based methods are effective in learning hash functions that preserve local similarity relationships, whereas center-based methods typically achieve superior performance by more effectively capturing global data distributions. However, the strength of center-based methods in modeling global structures often comes at the expense of underutilizing important local similarity information. To address this limitation, we propose Mutual Learning for Hashing (MLH), a novel weak-to-strong framework that enhances a center-based hashing branch by transferring knowledge from a weaker pairwise-based branch. MLH consists of two branches: a strong center-based branch and a weaker pairwise-based branch. Through an iterative mutual learning process, the center-based branch leverages local similarity cues learned by the pairwise-based branch. Furthermore, inspired by the mixture-of-experts paradigm, we introduce a novel mixture-of-hash-experts module that enables effective cross-branch interaction, further enhancing the performance of both branches. Extensive experiments demonstrate that MLH consistently outperforms state-of-the-art hashing methods across multiple benchmark datasets.[119] RePainter: Empowering E-commerce Object Removal via Spatial-matting Reinforcement Learning
Zipeng Guo,Lichen Ma,Xiaolong Fu,Gaojing Zhou,Lan Yang,Yuchen Zhou,Linkai Liu,Yu He,Ximan Liu,Shiping Dong,Jingling Fu,Zhen Chen,Yu Shi,Junshi Huang,Jason Li,Chao Gou
Main category: cs.CV
TL;DR: 本文提出了一种基于强化学习的图像修复框架Repainter,结合空间-蒙版轨迹优化与群组相对策略优化(GRPO),有效去除电商图像中的水印和促销文字,提升视觉质量。
Details
Motivation: 电商图像中的水印和 promotional text 会影响用户体验和广告效果,现有扩散模型在实际应用中存在对象去除不可靠和领域适应性不足的问题。 Method: 提出Repainter框架,采用强化学习结合空间-蒙版轨迹 refinement 和GRPO;通过调节注意力机制增强背景上下文建模,并设计复合奖励机制以平衡全局、局部和语义约束。 Result: 在自建的大规模电商数据集EcomPaint-100K和基准EcomPaint-Bench上实验表明,Repainter显著优于现有最先进方法,尤其在复杂场景下表现更优。 Conclusion: Repainter通过强化学习与复合奖励机制,在电商图像去噪与修复任务中实现了更可靠的对象去除与更少的视觉伪影,具备实际应用潜力。 Abstract: In web data, product images are central to boosting user engagement and advertising efficacy on e-commerce platforms, yet the intrusive elements such as watermarks and promotional text remain major obstacles to delivering clear and appealing product visuals. Although diffusion-based inpainting methods have advanced, they still face challenges in commercial settings due to unreliable object removal and limited domain-specific adaptation. To tackle these challenges, we propose Repainter, a reinforcement learning framework that integrates spatial-matting trajectory refinement with Group Relative Policy Optimization (GRPO). Our approach modulates attention mechanisms to emphasize background context, generating higher-reward samples and reducing unwanted object insertion. We also introduce a composite reward mechanism that balances global, local, and semantic constraints, effectively reducing visual artifacts and reward hacking. Additionally, we contribute EcomPaint-100K, a high-quality, large-scale e-commerce inpainting dataset, and a standardized benchmark EcomPaint-Bench for fair evaluation. Extensive experiments demonstrate that Repainter significantly outperforms state-of-the-art methods, especially in challenging scenes with intricate compositions. We will release our code and weights upon acceptance.[120] SyncHuman: Synchronizing 2D and 3D Generative Models for Single-view Human Reconstruction
Wenyue Chen,Peng Li,Wangguandong Zheng,Chengfeng Zhao,Mengfei Li,Yaolong Zhu,Zhiyang Dou,Ronggang Wang,Yuan Liu
Main category: cs.CV
TL;DR: 本文提出SyncHuman,一种结合2D多视角生成模型与3D原生生成模型的新框架,用于从单张图像实现高质量、逼真的着装人体三维重建,尤其在复杂姿态下表现优异。
Details
Motivation: 现有方法依赖SMPL估计和条件生成模型,但在3D先验准确性、复杂姿态处理和细节恢复方面存在不足,难以实现高保真且结构一致的重建。 Method: 提出SyncHuman框架,通过像素对齐的2D-3D同步注意力机制联合微调2D多视角生成模型和3D原生生成模型,并引入特征注入机制将2D细节提升到3D形状上,实现几何对齐与高保真重建。 Result: 实验表明,SyncHuman在几何精度和视觉保真度上优于基线方法,能在复杂姿态下单图生成结构合理且细节丰富的三维人体模型。 Conclusion: SyncHuman有效融合了2D生成模型的细节表现力与3D生成模型的结构一致性,为单图三维人体重建提供了新思路,展现出生成模型在高保真3D重建中的潜力。 Abstract: Photorealistic 3D full-body human reconstruction from a single image is a critical yet challenging task for applications in films and video games due to inherent ambiguities and severe self-occlusions. While recent approaches leverage SMPL estimation and SMPL-conditioned image generative models to hallucinate novel views, they suffer from inaccurate 3D priors estimated from SMPL meshes and have difficulty in handling difficult human poses and reconstructing fine details. In this paper, we propose SyncHuman, a novel framework that combines 2D multiview generative model and 3D native generative model for the first time, enabling high-quality clothed human mesh reconstruction from single-view images even under challenging human poses. Multiview generative model excels at capturing fine 2D details but struggles with structural consistency, whereas 3D native generative model generates coarse yet structurally consistent 3D shapes. By integrating the complementary strengths of these two approaches, we develop a more effective generation framework. Specifically, we first jointly fine-tune the multiview generative model and the 3D native generative model with proposed pixel-aligned 2D-3D synchronization attention to produce geometrically aligned 3D shapes and 2D multiview images. To further improve details, we introduce a feature injection mechanism that lifts fine details from 2D multiview images onto the aligned 3D shapes, enabling accurate and high-fidelity reconstruction. Extensive experiments demonstrate that SyncHuman achieves robust and photo-realistic 3D human reconstruction, even for images with challenging poses. Our method outperforms baseline methods in geometric accuracy and visual fidelity, demonstrating a promising direction for future 3D generation models.[121] ComGS: Efficient 3D Object-Scene Composition via Surface Octahedral Probes
Jian Gao,Mengqi Yuan,Yifei Zeng,Chang Zeng,Zhihao Li,Zhenyu Chen,Weichao Qiu,Xiao-Xiao Long,Hao Zhu,Xun Cao,Yao Yao
Main category: cs.CV
TL;DR: 提出ComGS框架,通过Surface Octahedral Probes(SOPs)实现高效的可重光照物体重建,并结合扩散模型进行局部环境光照估计,实现高质量、实时的3D物体-场景融合。
Details
Motivation: 高斯点阵渲染中的烘焙光照和阴影信息导致物体与场景融合时出现不一致,现有方法在效率和复杂光照建模上存在不足。 Method: 引入SOPs存储光照与遮挡信息,支持高效插值查询以避免光线追踪;通过在目标位置重建360度辐射场并微调扩散模型来完成局部光照估计。 Result: 实现了约28 FPS的实时渲染,编辑耗时仅36秒,生成具有生动阴影的视觉和谐结果。 Conclusion: ComGS在效率、真实感和可重光照方面优于现有方法,推动了3D内容创作中对象-场景融合的实用性。 Abstract: Gaussian Splatting (GS) enables immersive rendering, but realistic 3D object-scene composition remains challenging. Baked appearance and shadow information in GS radiance fields cause inconsistencies when combining objects and scenes. Addressing this requires relightable object reconstruction and scene lighting estimation. For relightable object reconstruction, existing Gaussian-based inverse rendering methods often rely on ray tracing, leading to low efficiency. We introduce Surface Octahedral Probes (SOPs), which store lighting and occlusion information and allow efficient 3D querying via interpolation, avoiding expensive ray tracing. SOPs provide at least a 2x speedup in reconstruction and enable real-time shadow computation in Gaussian scenes. For lighting estimation, existing Gaussian-based inverse rendering methods struggle to model intricate light transport and often fail in complex scenes, while learning-based methods predict lighting from a single image and are viewpoint-sensitive. We observe that 3D object-scene composition primarily concerns the object's appearance and nearby shadows. Thus, we simplify the challenging task of full scene lighting estimation by focusing on the environment lighting at the object's placement. Specifically, we capture a 360 degrees reconstructed radiance field of the scene at the location and fine-tune a diffusion model to complete the lighting. Building on these advances, we propose ComGS, a novel 3D object-scene composition framework. Our method achieves high-quality, real-time rendering at around 28 FPS, produces visually harmonious results with vivid shadows, and requires only 36 seconds for editing. Code and dataset are available at https://nju-3dv.github.io/projects/ComGS/.[122] UltraLED: Learning to See Everything in Ultra-High Dynamic Range Scenes
Yuang Meng,Xin Jin,Lina Lei,Chun-Le Guo,Chongyi Li
Main category: cs.CV
TL;DR: 本文提出了一种基于单张短曝光RAW图像的超高清动态范围(UHDR)重建方法UltraLED,通过两阶段框架实现曝光校正和低光区域去噪,有效避免了重影和运动模糊,在动态场景中表现出更强的鲁棒性。
Details
Motivation: 传统RGB多帧包围曝光方法在处理UHDR场景时易受错位和重影影响,而单帧RAW图像具有更高位深和更可预测的噪声特性,为恢复亮暗细节提供了更好基础。作者旨在探索仅用单张短曝光RAW图像即可重建完整UHDR场景的可能性。 Method: 提出UltraLED,包含两个阶段:首先通过比值图进行曝光校正以平衡动态范围,然后使用亮度感知的RAW去噪器增强暗区细节恢复;同时设计了一个9档包围曝光流程来合成真实UHDR图像,并构建了对应数据集。 Result: 实验表明,UltraLED显著优于现有的单帧UHDR方法,能够在避免鬼影和运动模糊的同时,有效恢复高动态范围场景中的亮部和暗部细节。 Conclusion: 仅使用单张短曝光RAW图像即可实现高质量UHDR重建是可行的,UltraLED框架为此提供了有效解决方案,并在动态场景中展现出优越性能。 Abstract: Ultra-high dynamic range (UHDR) scenes exhibit significant exposure disparities between bright and dark regions. Such conditions are commonly encountered in nighttime scenes with light sources. Even with standard exposure settings, a bimodal intensity distribution with boundary peaks often emerges, making it difficult to preserve both highlight and shadow details simultaneously. RGB-based bracketing methods can capture details at both ends using short-long exposure pairs, but are susceptible to misalignment and ghosting artifacts. We found that a short-exposure image already retains sufficient highlight detail. The main challenge of UHDR reconstruction lies in denoising and recovering information in dark regions. In comparison to the RGB images, RAW images, thanks to their higher bit depth and more predictable noise characteristics, offer greater potential for addressing this challenge. This raises a key question: can we learn to see everything in UHDR scenes using only a single short-exposure RAW image? In this study, we rely solely on a single short-exposure frame, which inherently avoids ghosting and motion blur, making it particularly robust in dynamic scenes. To achieve that, we introduce UltraLED, a two-stage framework that performs exposure correction via a ratio map to balance dynamic range, followed by a brightness-aware RAW denoiser to enhance detail recovery in dark regions. To support this setting, we design a 9-stop bracketing pipeline to synthesize realistic UHDR images and contribute a corresponding dataset based on diverse scenes, using only the shortest exposure as input for reconstruction. Extensive experiments show that UltraLED significantly outperforms existing single-frame approaches. Our code and dataset are made publicly available at https://srameo.github.io/projects/ultraled.[123] DEGS: Deformable Event-based 3D Gaussian Splatting from RGB and Event Stream
Junhao He,Jiaxu Wang,Jia Li,Mingyuan Sun,Qiang Zhang,Jiahang Cao,Ziyi Zhang,Yi Gu,Jingkai Sun,Renjing Xu
Main category: cs.CV
TL;DR: 本文提出了一种结合低帧率RGB图像和高帧率事件流来优化动态3D高斯点阵(3DGS)的新框架,通过引入事件运动先验引导形变场优化,有效解决了大帧间运动带来的不确定性问题。
Details
Motivation: 由于低帧率RGB视频中大帧间运动增加了3DGS重建的不确定性,且事件相机虽能捕捉快速运动但缺乏颜色信息,因此需要融合两种模态以提升动态3D重建质量。 Method: 采用事件运动先验指导形变场优化;提出LoCM无监督微调框架提取事件流中的运动先验,并设计几何感知的数据关联方法建立事件与高斯点之间的运动对应关系,辅以运动分解和帧间伪标签策略。 Result: 在合成与真实场景上均优于现有的基于图像和事件的方法,验证了事件数据对动态3DGS优化的有效性。 Conclusion: 所提方法能够有效利用事件数据提供的运动约束,显著提升动态3D高斯点阵重建的精度和稳定性。 Abstract: Reconstructing Dynamic 3D Gaussian Splatting (3DGS) from low-framerate RGB videos is challenging. This is because large inter-frame motions will increase the uncertainty of the solution space. For example, one pixel in the first frame might have more choices to reach the corresponding pixel in the second frame. Event cameras can asynchronously capture rapid visual changes and are robust to motion blur, but they do not provide color information. Intuitively, the event stream can provide deterministic constraints for the inter-frame large motion by the event trajectories. Hence, combining low-temporal-resolution images with high-framerate event streams can address this challenge. However, it is challenging to jointly optimize Dynamic 3DGS using both RGB and event modalities due to the significant discrepancy between these two data modalities. This paper introduces a novel framework that jointly optimizes dynamic 3DGS from the two modalities. The key idea is to adopt event motion priors to guide the optimization of the deformation fields. First, we extract the motion priors encoded in event streams by using the proposed LoCM unsupervised fine-tuning framework to adapt an event flow estimator to a certain unseen scene. Then, we present the geometry-aware data association method to build the event-Gaussian motion correspondence, which is the primary foundation of the pipeline, accompanied by two useful strategies, namely motion decomposition and inter-frame pseudo-label. Extensive experiments show that our method outperforms existing image and event-based approaches across synthetic and real scenes and prove that our method can effectively optimize dynamic 3DGS with the help of event data.[124] Demystifying Deep Learning-based Brain Tumor Segmentation with 3D UNets and Explainable AI (XAI): A Comparative Analysis
Ming Jie Ong,Sze Yinn Ung,Sim Kuan Goh,Jimmy Y. Zhong
Main category: cs.CV
TL;DR: 本研究比较了UNet、ResUNet和Attention UNet在脑肿瘤分割中的性能,并结合Grad-CAM和注意力可视化等可解释AI(XAI)技术提升模型透明度和医生信任度。在BraTS2020数据集上,ResUNet在Dice、Jaccard、准确率、召回率和F1分数上表现最佳,被推荐用于临床脑肿瘤自动分割。
Details
Motivation: 提高脑肿瘤MRI图像分割的准确性,并通过可解释AI增强医生对深度学习模型决策的信任,促进其在临床决策中的应用。 Method: 采用UNet、ResUNet和Attention UNet三种模型进行脑肿瘤分割,使用BraTS2020公开数据集训练和验证,优化器为Adam;利用Grad-CAM和注意力可视化两种XAI技术分析模型关注区域和决策机制。 Result: ResUNet在测试阶段表现最优,Dice和Jaccard相似性系数、准确率、召回率及F1分数均高于UNet和Attention UNet;Grad-CAM揭示了各模型关注的肿瘤子区域,注意力可视化阐明了Attention UNet的工作机制。 Conclusion: ResUNet是三种模型中性能最好的,结合XAI技术不仅能提升模型性能理解,还可增强临床可信度,建议将其用于未来的自动化脑肿瘤分割临床评估。 Abstract: The current study investigated the use of Explainable Artificial Intelligence (XAI) to improve the accuracy of brain tumor segmentation in MRI images, with the goal of assisting physicians in clinical decision-making. The study focused on applying UNet models for brain tumor segmentation and using the XAI techniques of Gradient-weighted Class Activation Mapping (Grad-CAM) and attention-based visualization to enhance the understanding of these models. Three deep learning models - UNet, Residual UNet (ResUNet), and Attention UNet (AttUNet) - were evaluated to identify the best-performing model. XAI was employed with the aims of clarifying model decisions and increasing physicians' trust in these models. We compared the performance of two UNet variants (ResUNet and AttUNet) with the conventional UNet in segmenting brain tumors from the BraTS2020 public dataset and analyzed model predictions with Grad-CAM and attention-based visualization. Using the latest computer hardware, we trained and validated each model using the Adam optimizer and assessed their performance with respect to: (i) training, validation, and inference times, (ii) segmentation similarity coefficients and loss functions, and (iii) classification performance. Notably, during the final testing phase, ResUNet outperformed the other models with respect to Dice and Jaccard similarity scores, as well as accuracy, recall, and F1 scores. Grad-CAM provided visuospatial insights into the tumor subregions each UNet model focused on while attention-based visualization provided valuable insights into the working mechanisms of AttUNet's attention modules. These results demonstrated ResUNet as the best-performing model and we conclude by recommending its use for automated brain tumor segmentation in future clinical assessments. Our source code and checkpoint are available at https://github.com/ethanong98/MultiModel-XAI-Brats2020[125] GTR-Bench: Evaluating Geo-Temporal Reasoning in Vision-Language Models
Qinghongbing Xie,Zhaoyuan Xia,Feng Zhu,Lijun Gong,Ziyue Li,Rui Zhao,Long Zeng
Main category: cs.CV
TL;DR: 本文提出了GTR-Bench,一个用于评估视觉语言模型在大规模摄像头网络中地理时空推理能力的新基准,揭示了现有模型在空间时间上下文利用、时间预测和地图与多视角视频对齐方面的三大缺陷。
Details
Motivation: 现有基准主要关注以自我为中心的视角或仅基于图形上下文的地理视角推理,缺乏结合图像/视频与图形上下文来评估模型地理时空智能的能力,难以满足交通管理、应急响应等实际应用需求。 Method: 提出GTR-Bench,包含需在地图与视频间切换视角、跨非重叠视野视频联合推理、以及对无观测区域进行时空推断的挑战性任务,构建涵盖多种现实场景的评测集,并对10多个主流VLM进行系统评估。 Result: 实验显示当前最优模型Gemini-2.5-Pro得分仅为34.9%,远低于人类表现的78.61%;分析揭示了模型在时空上下文使用不均衡、时间预测能力弱、难以理解或对齐地图与多视图视频输入三方面的主要缺陷。 Conclusion: GTR-Bench为地理时空智能研究提供了新的挑战和方向,有助于推动自动驾驶、具身AI和通用人工智能等领域的发展。 Abstract: Recently spatial-temporal intelligence of Visual-Language Models (VLMs) has attracted much attention due to its importance for Autonomous Driving, Embodied AI and General Artificial Intelligence. Existing spatial-temporal benchmarks mainly focus on egocentric perspective reasoning with images/video context, or geographic perspective reasoning with graphics context (eg. a map), thus fail to assess VLMs' geographic spatial-temporal intelligence with both images/video and graphics context, which is important for areas like traffic management and emergency response. To address the gaps, we introduce Geo-Temporal Reasoning benchmark (GTR-Bench), a novel challenge for geographic temporal reasoning of moving targets in a large-scale camera network. GTR-Bench is more challenging as it requires multiple perspective switches between maps and videos, joint reasoning across multiple videos with non-overlapping fields of view, and inference over spatial-temporal regions that are unobserved by any video context. Evaluations of more than 10 popular VLMs on GTR-Bench demonstrate that even the best proprietary model, Gemini-2.5-Pro (34.9%), significantly lags behind human performance (78.61%) on geo-temporal reasoning. Moreover, our comprehensive analysis on GTR-Bench reveals three primary deficiencies of current models for geo-temporal reasoning. (1) VLMs' reasoning is impaired by an imbalanced utilization of spatial-temporal context. (2) VLMs are weak in temporal forecasting, which leads to worse performance on temporal-emphasized tasks than on spatial-emphasized tasks. (3) VLMs lack the proficiency to comprehend or align the map data with multi-view video inputs. We believe GTR-Bench offers valuable insights and opens up new opportunities for research and applications in spatial-temporal intelligence. Benchmark and code will be released at https://github.com/X-Luffy/GTR-Bench.[126] FMANet: A Novel Dual-Phase Optical Flow Approach with Fusion Motion Attention Network for Robust Micro-expression Recognition
Luu Tu Nguyen,Vu Tram Anh Khuong,Thi Bich Phuong Man,Thi Duyen Ngo,Thanh Ha Le
Main category: cs.CV
TL;DR: 提出了一种新的微表情识别方法MM-COF和FMANet,通过整合起始-顶点和顶点-结束两个阶段的光流信息,并引入可学习的幅度调制机制,在多个标准数据集上实现了优于现有方法的性能。
Details
Motivation: 现有微表情识别方法大多仅利用 onset 到 apex 帧之间的光流,忽略了 apex 到 offset 阶段的重要运动信息,限制了识别性能。 Method: 提出Magnitude-Modulated Combined Optical Flow(MM-COF)作为综合运动表征,融合双阶段运动动态;并设计FMANet网络架构,将双阶段分析和幅度调制模块化为可学习组件,实现自适应运动线索融合与显著面部区域聚焦。 Result: 在MMEW、SMIC、CASME-II和SAMM四个标准数据集上实验表明,所提MM-COF和FMANet均优于现有方法。 Conclusion: 引入apex-to-offset阶段的运动信息并通过可学习方式融合双阶段光流,能有效提升微表情识别性能,验证了端到端双阶段框架的潜力。 Abstract: Facial micro-expressions, characterized by their subtle and brief nature, are valuable indicators of genuine emotions. Despite their significance in psychology, security, and behavioral analysis, micro-expression recognition remains challenging due to the difficulty of capturing subtle facial movements. Optical flow has been widely employed as an input modality for this task due to its effectiveness. However, most existing methods compute optical flow only between the onset and apex frames, thereby overlooking essential motion information in the apex-to-offset phase. To address this limitation, we first introduce a comprehensive motion representation, termed Magnitude-Modulated Combined Optical Flow (MM-COF), which integrates motion dynamics from both micro-expression phases into a unified descriptor suitable for direct use in recognition networks. Building upon this principle, we then propose FMANet, a novel end-to-end neural network architecture that internalizes the dual-phase analysis and magnitude modulation into learnable modules. This allows the network to adaptively fuse motion cues and focus on salient facial regions for classification. Experimental evaluations on the MMEW, SMIC, CASME-II, and SAMM datasets, widely recognized as standard benchmarks, demonstrate that our proposed MM-COF representation and FMANet outperforms existing methods, underscoring the potential of a learnable, dual-phase framework in advancing micro-expression recognition.[127] An End-to-End Room Geometry Constrained Depth Estimation Framework for Indoor Panorama Images
Kanglin Ning,Ruzhao Chen,Penghong Wang,Xingtao Wang,Ruiqin Xiong,Xiaopeng Fan
Main category: cs.CV
TL;DR: 本文提出了一种基于房间几何约束的360度室内全景深度估计框架,通过布局预测和背景分割机制融合几何信息,显著提升了深度估计精度。
Details
Motivation: 现有方法关注像素级精度,导致房间角落过度平滑且对噪声敏感,难以准确恢复球形像素深度。 Method: 提出一个共享特征编码器和任务特定解码器的框架,结合布局估计、深度估计和背景分割;引入基于房间几何的背景深度解析策略和背景分割引导的融合机制。 Result: 在Stanford2D3D、Matterport3D和Structured3D数据集上实验表明,该方法显著优于当前开源方法。 Conclusion: 所提出的基于房间几何约束的深度估计框架有效改善了360度室内全景的深度预测质量,尤其在房间结构保持和噪声鲁棒性方面表现突出。 Abstract: Predicting spherical pixel depth from monocular $360^{\circ}$ indoor panoramas is critical for many vision applications. However, existing methods focus on pixel-level accuracy, causing oversmoothed room corners and noise sensitivity. In this paper, we propose a depth estimation framework based on room geometry constraints, which extracts room geometry information through layout prediction and integrates those information into the depth estimation process through background segmentation mechanism. At the model level, our framework comprises a shared feature encoder followed by task-specific decoders for layout estimation, depth estimation, and background segmentation. The shared encoder extracts multi-scale features, which are subsequently processed by individual decoders to generate initial predictions: a depth map, a room layout map, and a background segmentation map. Furthermore, our framework incorporates two strategies: a room geometry-based background depth resolving strategy and a background-segmentation-guided fusion mechanism. The proposed room-geometry-based background depth resolving strategy leverages the room layout and the depth decoder's output to generate the corresponding background depth map. Then, a background-segmentation-guided fusion strategy derives fusion weights for the background and coarse depth maps from the segmentation decoder's predictions. Extensive experimental results on the Stanford2D3D, Matterport3D and Structured3D datasets show that our proposed methods can achieve significantly superior performance than current open-source methods. Our code is available at https://github.com/emiyaning/RGCNet.[128] Enhancing Visual Prompting through Expanded Transformation Space and Overfitting Mitigation
Shohei Enomoto
Main category: cs.CV
TL;DR: 提出ACAEP方法,通过仿射、颜色和加法视觉提示增强视觉提示的表达能力,并引入TrivialAugment缓解过拟合,显著提升性能。
Details
Motivation: 传统视觉提示方法表达能力有限且易过拟合,导致精度低于其他适配方法。 Method: 引入仿射变换以创建任务特定区域,颜色变换以突出关键特征,并采用TrivialAugment进行数据增强。 Result: 在十二个图像分类数据集上,ACAEP在视觉提示方法中达到SOTA,优于线性探测,且对分布偏移更具鲁棒性。 Conclusion: ACAEP结合多种变换与有效数据增强,显著提升了视觉提示的性能与泛化能力。 Abstract: Visual prompting (VP) has emerged as a promising parameter-efficient fine-tuning approach for adapting pre-trained vision models to downstream tasks without modifying model parameters. Despite offering advantages like negligible computational overhead and compatibility with black-box models, conventional VP methods typically achieve lower accuracy than other adaptation approaches. Our analysis reveals two critical limitations: the restricted expressivity of simple additive transformation and a tendency toward overfitting when the parameter count increases. To address these challenges, we propose ACAVP (Affine, Color, and Additive Visual Prompting), which enhances VP's expressive power by introducing complementary transformation operations: affine transformation for creating task-specific prompt regions while preserving original image information, and color transformation for emphasizing task-relevant visual features. Additionally, we identify that overfitting is a critical issue in VP training and introduce TrivialAugment as an effective data augmentation, which not only benefits our approach but also significantly improves existing VP methods, with performance gains of up to 12 percentage points on certain datasets. This demonstrates that appropriate data augmentation is universally beneficial for VP training. Extensive experiments across twelve diverse image classification datasets with two different model architectures demonstrate that ACAVP achieves state-of-the-art accuracy among VP methods, surpasses linear probing in average accuracy, and exhibits superior robustness to distribution shifts, all while maintaining minimal computational overhead during inference.[129] MMHOI: Modeling Complex 3D Multi-Human Multi-Object Interactions
Kaen Kogashi,Anoop Cherian,Meng-Yu Jennifer Kuo
Main category: cs.CV
TL;DR: 本文提出了MMHOI,一个大规模的多人体多物体交互数据集,包含12种日常场景中的图像及详细的3D标注,并基于此提出了MMHOI-Net模型,在多人体-物体交互建模中实现了最先进的性能。
Details
Motivation: 现有3D人体-物体交互(HOI)基准仅涵盖真实场景中复杂交互的一小部分,缺乏对多人体与多物体之间因果性、目标导向或协作性互动的充分建模。 Method: 提出MMHOI数据集,包含完整的3D形状和姿态标注、78类动作标签和14个交互相关身体部位;并设计了基于Transformer的端到端网络MMHOI-Net,采用结构化双块表示来建模物体及其交互,结合动作识别提升交互预测。 Result: 在MMHOI和CORE4D数据集上的实验表明,该方法在多人体-物体交互建模中达到最先进水平,兼具高准确率和高质量重建能力。 Conclusion: MMHOI为下一代HOI研究提供了全面的测试平台,而MMHOI-Net通过结构化表示和动作联合学习有效提升了复杂交互的理解与重建性能。 Abstract: Real-world scenes often feature multiple humans interacting with multiple objects in ways that are causal, goal-oriented, or cooperative. Yet existing 3D human-object interaction (HOI) benchmarks consider only a fraction of these complex interactions. To close this gap, we present MMHOI -- a large-scale, Multi-human Multi-object Interaction dataset consisting of images from 12 everyday scenarios. MMHOI offers complete 3D shape and pose annotations for every person and object, along with labels for 78 action categories and 14 interaction-specific body parts, providing a comprehensive testbed for next-generation HOI research. Building on MMHOI, we present MMHOI-Net, an end-to-end transformer-based neural network for jointly estimating human-object 3D geometries, their interactions, and associated actions. A key innovation in our framework is a structured dual-patch representation for modeling objects and their interactions, combined with action recognition to enhance the interaction prediction. Experiments on MMHOI and the recently proposed CORE4D datasets demonstrate that our approach achieves state-of-the-art performance in multi-HOI modeling, excelling in both accuracy and reconstruction quality.[130] PrismGS: Physically-Grounded Anti-Aliasing for High-Fidelity Large-Scale 3D Gaussian Splatting
Houqiang Zhong,Zhenglong Wu,Sihua Fu,Zihan Zheng,Xin Jin,Xiaoyun Zhang,Li Song,Qiang Hu
Main category: cs.CV
TL;DR: 提出PrismGS,一种物理驱动的正则化框架,通过多尺度监督和显式尺寸正则化提升3D高斯在大规模城市场景中高分辨率渲染的保真度与稳定性。
Details
Motivation: 现有3D高斯点阵方法在大尺度城市环境中存在严重走样和优化不稳定问题,尤其在4K渲染下表现明显,难以满足高质量实时渲染需求。 Method: 引入金字塔多尺度监督,结合预滤波图像金字塔进行一致性监督;同时施加基于物理的显式尺寸正则化,防止退化的视图相关基元产生。 Result: 在MatrixCity、Mill-19和UrbanScene3D上实验显示,相比CityGaussian提升约1.5 dB PSNR,在4K渲染下仍保持优异质量和鲁棒性。 Conclusion: PrismGS有效解决了大规模场景中3D高斯渲染的走样与不稳定性问题,具备即插即用特性,可兼容现有流水线,显著提升渲染质量。 Abstract: 3D Gaussian Splatting (3DGS) has recently enabled real-time photorealistic rendering in compact scenes, but scaling to large urban environments introduces severe aliasing artifacts and optimization instability, especially under high-resolution (e.g., 4K) rendering. These artifacts, manifesting as flickering textures and jagged edges, arise from the mismatch between Gaussian primitives and the multi-scale nature of urban geometry. While existing ``divide-and-conquer'' pipelines address scalability, they fail to resolve this fidelity gap. In this paper, we propose PrismGS, a physically-grounded regularization framework that improves the intrinsic rendering behavior of 3D Gaussians. PrismGS integrates two synergistic regularizers. The first is pyramidal multi-scale supervision, which enforces consistency by supervising the rendering against a pre-filtered image pyramid. This compels the model to learn an inherently anti-aliased representation that remains coherent across different viewing scales, directly mitigating flickering textures. This is complemented by an explicit size regularization that imposes a physically-grounded lower bound on the dimensions of the 3D Gaussians. This prevents the formation of degenerate, view-dependent primitives, leading to more stable and plausible geometric surfaces and reducing jagged edges. Our method is plug-and-play and compatible with existing pipelines. Extensive experiments on MatrixCity, Mill-19, and UrbanScene3D demonstrate that PrismGS achieves state-of-the-art performance, yielding significant PSNR gains around 1.5 dB against CityGaussian, while maintaining its superior quality and robustness under demanding 4K rendering.[131] IsoSignVid2Aud: Sign Language Video to Audio Conversion without Text Intermediaries
Harsh Kavediya,Vighnesh Nayak,Bheeshm Sharma,Balamurugan Palaniappan
Main category: cs.CV
TL;DR: 本文提出了一种名为IsoSignVid2Aud的端到端框架,用于将孤立手语视频序列直接转换为语音,无需中间文本表示,适用于教育和手语提示界面。
Details
Motivation: 为了帮助听障和言语障碍人群与他人交流,需要将手语视频转化为口语音频,特别是在处理非连续语法的手语序列时,避免多阶段翻译系统带来的延迟和级联错误。 Method: 采用I3D特征提取模块结合专用特征变换网络和音频生成管道,并提出一种新的非极大值抑制(NMS)算法用于在非语法连续序列中进行手势的时间检测。 Result: 在ASL-Citizen-1500和WLASL-100数据集上分别取得了72.01%和78.67%的Top-1准确率,音频质量指标PESQ为2.67,STOI为0.73,表明生成的语音清晰可懂。 Conclusion: IsoSignVid2Aud能够有效实现从孤立手语视频到语音的直接翻译,具有实际应用潜力,尤其适用于教育和技术辅助场景。 Abstract: Sign language to spoken language audio translation is important to connect the hearing- and speech-challenged humans with others. We consider sign language videos with isolated sign sequences rather than continuous grammatical signing. Such videos are useful in educational applications and sign prompt interfaces. Towards this, we propose IsoSignVid2Aud, a novel end-to-end framework that translates sign language videos with a sequence of possibly non-grammatic continuous signs to speech without requiring intermediate text representation, providing immediate communication benefits while avoiding the latency and cascading errors inherent in multi-stage translation systems. Our approach combines an I3D-based feature extraction module with a specialized feature transformation network and an audio generation pipeline, utilizing a novel Non-Maximal Suppression (NMS) algorithm for the temporal detection of signs in non-grammatic continuous sequences. Experimental results demonstrate competitive performance on ASL-Citizen-1500 and WLASL-100 datasets with Top-1 accuracies of 72.01\% and 78.67\%, respectively, and audio quality metrics (PESQ: 2.67, STOI: 0.73) indicating intelligible speech output. Code is available at: https://github.com/BheeshmSharma/IsoSignVid2Aud_AIMLsystems-2025.[132] AlignGS: Aligning Geometry and Semantics for Robust Indoor Reconstruction from Sparse Views
Yijie Gao,Houqiang Zhong,Tianchi Zhu,Zhengxue Cheng,Qiang Hu,Li Song
Main category: cs.CV
TL;DR: 本文提出了一种名为AlignGS的新框架,通过将2D基础模型的语义先验用于3D几何重建的正则化,实现几何与语义的协同优化,在稀疏视角下显著提升了3D室内场景重建的几何精度和视觉效果。
Details
Motivation: 现有方法在稀疏视角下进行3D重建时,常因几何模糊性导致结果不准确,且语义信息通常作为后处理附加,未能有效引导几何重建。本文旨在通过主动利用语义理解来解决这一问题。 Method: 提出AlignGS框架,从2D基础模型中提取语义先验,并设计了深度一致性与多面法线正则化等语义到几何的引导机制,实现几何与语义的端到端联合优化。 Result: 在标准数据集上的实验表明,该方法在新视角合成和几何重建精度方面均达到最先进水平,生成的3D模型更完整、更一致。 Conclusion: 将语义先验作为几何正则化手段,能有效提升稀疏视角下的3D重建质量,验证了语义与几何协同优化的重要性。 Abstract: The demand for semantically rich 3D models of indoor scenes is rapidly growing, driven by applications in augmented reality, virtual reality, and robotics. However, creating them from sparse views remains a challenge due to geometric ambiguity. Existing methods often treat semantics as a passive feature painted on an already-formed, and potentially flawed, geometry. We posit that for robust sparse-view reconstruction, semantic understanding instead be an active, guiding force. This paper introduces AlignGS, a novel framework that actualizes this vision by pioneering a synergistic, end-to-end optimization of geometry and semantics. Our method distills rich priors from 2D foundation models and uses them to directly regularize the 3D representation through a set of novel semantic-to-geometry guidance mechanisms, including depth consistency and multi-faceted normal regularization. Extensive evaluations on standard benchmarks demonstrate that our approach achieves state-of-the-art results in novel view synthesis and produces reconstructions with superior geometric accuracy. The results validate that leveraging semantic priors as a geometric regularizer leads to more coherent and complete 3D models from limited input views. Our code is avaliable at https://github.com/MediaX-SJTU/AlignGS .[133] Self-Supervised Learning Strategies for a Platform to Test the Toxicity of New Chemicals and Materials
Thomas Lautenschlager,Nils Friederich,Angelo Jovin Yamachui Sitcheu,Katja Nau,Gaëlle Hayot,Thomas Dickmeis,Ralf Mikut
Main category: cs.CV
TL;DR: 本文探讨了如何利用自监督学习方法在高通量毒性测试中有效识别有毒物质引起的变化,并以EmbryoNet数据集为例验证了该方法的可行性。
Details
Motivation: 高通量毒性测试需要快速、低成本地评估大量化合物,而自动化评估依赖于机器学习模型。然而,现有方法在表征毒性效应方面仍存在挑战。 Method: 采用自监督学习方法从EmbryoNet数据集中学习化合物诱导的表型变化表示,并用于区分不同化合物的作用机制。 Result: 实验表明,通过自监督学习获得的表征能够有效区分不同化学物质的作用模式,具备良好的分类性能。 Conclusion: 自监督学习为高通量毒性测试中的表征学习提供了有效途径,有助于推动机器学习模型在实际毒性检测设备(如TOXBOX)中的集成应用。 Abstract: High-throughput toxicity testing offers a fast and cost-effective way to test large amounts of compounds. A key component for such systems is the automated evaluation via machine learning models. In this paper, we address critical challenges in this domain and demonstrate how representations learned via self-supervised learning can effectively identify toxicant-induced changes. We provide a proof-of-concept that utilizes the publicly available EmbryoNet dataset, which contains ten zebrafish embryo phenotypes elicited by various chemical compounds targeting different processes in early embryonic development. Our analysis shows that the learned representations using self-supervised learning are suitable for effectively distinguishing between the modes-of-action of different compounds. Finally, we discuss the integration of machine learning models in a physical toxicity testing device in the context of the TOXBOX project.[134] XYZCylinder: Feedforward Reconstruction for Driving Scenes Based on A Unified Cylinder Lifting Method
Haochen Yu,Qiankun Liu,Hongyuan Liu,Jianfei Jiang,Juntao Lyu,Jiansheng Chen,Huimin Ma
Main category: cs.CV
TL;DR: 提出了一种名为XYZCylinder的前馈模型,通过统一的圆柱提升方法,在不同相机配置下实现驾驶场景的高效泛化与高精度重建。
Details
Motivation: 现有前馈重建方法因固定视角变换和稀疏视图重叠区域小,导致在不同相机配置下的泛化能力和重建精度受限。 Method: 设计了统一圆柱相机建模(UCCM)策略以提升泛化能力,并提出基于圆柱平面特征组(CPFG)的混合表示与特征提升模块以提高重建精度。 Result: 实验表明,XYZCylinder在多种评估设置下达到最先进水平,并能零样本迁移到其他驾驶场景。 Conclusion: XYZCylinder通过显式相机建模和新型特征提升机制,有效提升了驾驶场景重建的泛化性和准确性。 Abstract: Recently, more attention has been paid to feedforward reconstruction paradigms, which mainly learn a fixed view transformation implicitly and reconstruct the scene with a single representation. However, their generalization capability and reconstruction accuracy are still limited while reconstructing driving scenes, which results from two aspects: (1) The fixed view transformation fails when the camera configuration changes, limiting the generalization capability across different driving scenes equipped with different camera configurations. (2) The small overlapping regions between sparse views of the $360^\circ$ panorama and the complexity of driving scenes increase the learning difficulty, reducing the reconstruction accuracy. To handle these difficulties, we propose \textbf{XYZCylinder}, a feedforward model based on a unified cylinder lifting method which involves camera modeling and feature lifting. Specifically, to improve the generalization capability, we design a Unified Cylinder Camera Modeling (UCCM) strategy, which avoids the learning of viewpoint-dependent spatial correspondence and unifies different camera configurations with adjustable parameters. To improve the reconstruction accuracy, we propose a hybrid representation with several dedicated modules based on newly designed Cylinder Plane Feature Group (CPFG) to lift 2D image features to 3D space. Experimental results show that XYZCylinder achieves state-of-the-art performance under different evaluation settings, and can be generalized to other driving scenes in a zero-shot manner. Project page: \href{https://yuyuyu223.github.io/XYZCYlinder-projectpage/}{here}.[135] MARC: Memory-Augmented RL Token Compression for Efficient Video Understanding
Peiran Wu,Zhuorui Yu,Yunze Liu,Chi-Hao Wu,Enmin Zhou,Junxiao Shen
Main category: cs.CV
TL;DR: 提出了一种基于记忆增强的强化学习视频Token压缩方法MARC,能够在显著减少计算资源消耗的同时保持接近基线模型的性能。
Details
Motivation: 现有无训练视频Token压缩方法在降低计算成本时易造成信息丢失和性能下降,需更高效且保留关键信息的压缩方法。 Method: 采用“先检索后压缩”策略:通过视觉记忆检索器(VMR)选择关键片段,并利用基于压缩组相对策略优化(C-GRPO)的强化学习框架进行师生模型间的推理能力蒸馏。 Result: 在六个视频基准上实验表明,仅使用一帧的token量即可达到接近基线的准确率,视觉token减少95%,GPU内存降低72%,延迟减少23.9%。 Conclusion: MARC在大幅压缩视觉token的同时有效保留了语义信息,具备在资源受限场景(如视频问答、监控、自动驾驶)中实现实时视频理解的应用潜力。 Abstract: The rapid progress of large language models (LLMs) has laid the foundation for multimodal models. However, visual language models (VLMs) still face heavy computational costs when extended from images to videos due to high frame rates and long durations. Token compression is a promising solution, yet most existing training-free methods cause information loss and performance degradation. To overcome this, we propose \textbf{Memory-Augmented Reinforcement Learning-based Token Compression (MARC)}, which integrates structured retrieval and RL-based distillation. MARC adopts a \textit{retrieve-then-compress} strategy using a \textbf{Visual Memory Retriever (VMR)} to select key clips and a \textbf{Compression Group Relative Policy Optimization (C-GRPO)} framework to distil reasoning ability from a teacher to a student model. Experiments on six video benchmarks show that MARC achieves near-baseline accuracy using only one frame's tokens -- reducing visual tokens by \textbf{95\%}, GPU memory by \textbf{72\%}, and latency by \textbf{23.9\%}. This demonstrates its potential for efficient, real-time video understanding in resource-constrained settings such as video QA, surveillance, and autonomous driving.[136] ASBench: Image Anomalies Synthesis Benchmark for Anomaly Detection
Qunyi Zhang,Songan Zhang,Jinbao Wang,Xiaoning Lei,Guoyang Xie,Guannan Jiang,Zhichao Lu
Main category: cs.CV
TL;DR: 本文提出了ASBench,首个专注于评估异常合成方法的综合基准框架,通过四个关键维度系统性地评估现有方法,揭示了当前技术的局限性并为未来研究提供了可行方向。
Details
Motivation: 现有异常合成研究多将其作为异常检测框架中的辅助组件,缺乏对合成算法本身的系统性评估,且忽视了合成任务特有的关键因素,如与检测性能的解耦、合成数据的量化分析及跨场景适应性。 Method: 提出ASBench框架,引入四个评估维度:(i)在不同数据集和流程中的泛化性能;(ii)合成数据与真实数据的比例;(iii)合成图像内在指标与检测性能指标的相关性;(iv)混合异常合成策略。 Result: 通过大量实验,ASBench揭示了当前异常合成方法在泛化性、数据效率和相关性方面的局限性,并验证了不同合成策略的影响。 Conclusion: ASBench为异常合成方法提供了系统、全面的评估平台,不仅暴露了现有方法的不足,也为未来异常合成研究提供了明确的方向和实用的改进思路。 Abstract: Anomaly detection plays a pivotal role in manufacturing quality control, yet its application is constrained by limited abnormal samples and high manual annotation costs. While anomaly synthesis offers a promising solution, existing studies predominantly treat anomaly synthesis as an auxiliary component within anomaly detection frameworks, lacking systematic evaluation of anomaly synthesis algorithms. Current research also overlook crucial factors specific to anomaly synthesis, such as decoupling its impact from detection, quantitative analysis of synthetic data and adaptability across different scenarios. To address these limitations, we propose ASBench, the first comprehensive benchmarking framework dedicated to evaluating anomaly synthesis methods. Our framework introduces four critical evaluation dimensions: (i) the generalization performance across different datasets and pipelines (ii) the ratio of synthetic to real data (iii) the correlation between intrinsic metrics of synthesis images and anomaly detection performance metrics , and (iv) strategies for hybrid anomaly synthesis methods. Through extensive experiments, ASBench not only reveals limitations in current anomaly synthesis methods but also provides actionable insights for future research directions in anomaly synthesis[137] TTOM: Test-Time Optimization and Memorization for Compositional Video Generation
Leigang Qu,Ziyang Wang,Na Zheng,Wenjie Wang,Liqiang Nie,Tat-Seng Chua
Main category: cs.CV
TL;DR: 本文提出了TTOM,一种无需训练的视频生成框架,通过测试时优化和记忆机制,在推理过程中对时空布局进行对齐,显著提升文本到视频生成在组合场景下的表现。
Details
Motivation: 现有视频基础模型在处理组合性任务(如运动、数量和空间关系)时表现不佳,缺乏有效的跨模态对齐机制。 Method: 引入测试时优化与记忆机制(TTOM),通过优化新参数而非直接干预潜变量或注意力机制,并结合可操作的记忆模块(插入、读取、更新、删除)来维护历史上下文,实现流式视频生成中的动态对齐。 Result: 在T2V-CompBench和Vbench基准上验证了TTOM的有效性,展现出优异的组合生成能力、跨模态对齐效果以及良好的可扩展性和效率。 Conclusion: TTOM是一种高效、实用且可扩展的训练-free框架,能够实现即插即用的组合性视频生成,具备强迁移性和泛化能力。 Abstract: Video Foundation Models (VFMs) exhibit remarkable visual generation performance, but struggle in compositional scenarios (e.g., motion, numeracy, and spatial relation). In this work, we introduce Test-Time Optimization and Memorization (TTOM), a training-free framework that aligns VFM outputs with spatiotemporal layouts during inference for better text-image alignment. Rather than direct intervention to latents or attention per-sample in existing work, we integrate and optimize new parameters guided by a general layout-attention objective. Furthermore, we formulate video generation within a streaming setting, and maintain historical optimization contexts with a parametric memory mechanism that supports flexible operations, such as insert, read, update, and delete. Notably, we found that TTOM disentangles compositional world knowledge, showing powerful transferability and generalization. Experimental results on the T2V-CompBench and Vbench benchmarks establish TTOM as an effective, practical, scalable, and efficient framework to achieve cross-modal alignment for compositional video generation on the fly.[138] CVD-STORM: Cross-View Video Diffusion with Spatial-Temporal Reconstruction Model for Autonomous Driving
Tianrui Zhang,Yichen Liu,Zilin Guo,Yuxin Guo,Jingcheng Ni,Chenjing Ding,Dan Xu,Lewei Lu,Zehuan Wu
Main category: cs.CV
TL;DR: 提出CVD-STORM,一种基于时空重建VAE的跨视角视频扩散模型,可生成长时、多视角视频并具备4D重建能力,在FID和FVD指标上显著提升性能。
Details
Motivation: 自动驾驶等领域对高保真、可控的多视角视频生成及深度等几何信息预测提出更高要求,现有生成模型难以兼顾质量与多模态输出。 Method: 首先通过辅助的4D重建任务微调VAE,增强其对3D结构和时序动态的编码能力;然后将该VAE集成到视频扩散过程中,并结合联合训练的高斯溅射解码器实现高质量视频生成与动态场景重建。 Result: 在视频生成质量上显著优于基线方法,FID和FVD指标明显改善;同时高斯溅射解码器能有效重建动态场景,提供丰富的几何信息。 Conclusion: CVD-STORM通过引入4D感知VAE和扩散模型的协同设计,实现了高质量、多视角、长序列视频生成,并支持下游场景理解任务,推动了生成模型在复杂世界建模中的应用。 Abstract: Generative models have been widely applied to world modeling for environment simulation and future state prediction. With advancements in autonomous driving, there is a growing demand not only for high-fidelity video generation under various controls, but also for producing diverse and meaningful information such as depth estimation. To address this, we propose CVD-STORM, a cross-view video diffusion model utilizing a spatial-temporal reconstruction Variational Autoencoder (VAE) that generates long-term, multi-view videos with 4D reconstruction capabilities under various control inputs. Our approach first fine-tunes the VAE with an auxiliary 4D reconstruction task, enhancing its ability to encode 3D structures and temporal dynamics. Subsequently, we integrate this VAE into the video diffusion process to significantly improve generation quality. Experimental results demonstrate that our model achieves substantial improvements in both FID and FVD metrics. Additionally, the jointly-trained Gaussian Splatting Decoder effectively reconstructs dynamic scenes, providing valuable geometric information for comprehensive scene understanding.[139] A Large-scale Dataset for Robust Complex Anime Scene Text Detection
Ziyi Dong,Yurui Zhang,Changmao Li,Naomi Rue Golding,Qing Long
Main category: cs.CV
TL;DR: 本文介绍了AnimeText,一个专为动漫场景文本检测设计的大规模数据集,包含73.5万张图像和420万个标注文本块,具有层次化标注和困难负样本,显著提升了在复杂动漫场景中的文本检测性能。
Details
Motivation: 现有文本检测数据集主要针对自然或文档场景,难以应对动漫中文本风格多样、排列不规则且易与复杂视觉元素混淆的特点,因此需要专门的数据集来填补这一空白。 Method: 构建了一个名为AnimeText的大规模数据集,包含735K图像和4.2M文本块,引入了层次化标注和针对动漫场景的困难负样本,并通过跨数据集基准测试评估其有效性。 Result: 在多个先进文本检测方法上的实验表明,使用AnimeText训练的模型在动漫场景文本检测任务中显著优于使用现有数据集训练的模型。 Conclusion: AnimeText能有效提升动漫场景下的文本检测性能,为该领域提供了重要的数据资源。 Abstract: Current text detection datasets primarily target natural or document scenes, where text typically appear in regular font and shapes, monotonous colors, and orderly layouts. The text usually arranged along straight or curved lines. However, these characteristics differ significantly from anime scenes, where text is often diverse in style, irregularly arranged, and easily confused with complex visual elements such as symbols and decorative patterns. Text in anime scene also includes a large number of handwritten and stylized fonts. Motivated by this gap, we introduce AnimeText, a large-scale dataset containing 735K images and 4.2M annotated text blocks. It features hierarchical annotations and hard negative samples tailored for anime scenarios. %Cross-dataset evaluations using state-of-the-art methods demonstrate that models trained on AnimeText achieve superior performance in anime text detection tasks compared to existing datasets. To evaluate the robustness of AnimeText in complex anime scenes, we conducted cross-dataset benchmarking using state-of-the-art text detection methods. Experimental results demonstrate that models trained on AnimeText outperform those trained on existing datasets in anime scene text detection tasks. AnimeText on HuggingFace: https://huggingface.co/datasets/deepghs/AnimeText[140] SimCast: Enhancing Precipitation Nowcasting with Short-to-Long Term Knowledge Distillation
Yifang Yin,Shengkai Chen,Yiyao Li,Lu Wang,Ruibing Jin,Wei Cui,Shili Xiang
Main category: cs.CV
TL;DR: 本文提出了一种新的降水临近预报框架SimCast和CasCast,通过短到长知识蒸馏和加权MSE损失提升预测性能,并在多个基准数据集上显著优于现有方法。
Details
Motivation: 由于地球系统的复杂性,降水临近预报具有挑战性,现有非自回归方法对不同预测时间范围的适应性不足,需要更准确且高效的模型来满足灾害管理、农业、交通等社会需求。 Method: 提出SimCast训练流程,采用短到长知识蒸馏技术和加权MSE损失函数以优先关注强降雨区域;进一步将SimCast集成到基于扩散的CasCast框架中,结合概率模型优势缓解确定性预测的模糊性和分布偏移问题。 Result: 在SEVIR、HKO-7和MeteoNet三个基准数据集上取得平均CSI分数分别为0.452、0.474和0.361,显著优于现有方法,且推理无额外开销。 Conclusion: SimCast和CasCast框架有效提升了降水临近预报的准确性与可靠性,兼顾确定性与概率性建模优势,具备实际应用潜力。 Abstract: Precipitation nowcasting predicts future radar sequences based on current observations, which is a highly challenging task driven by the inherent complexity of the Earth system. Accurate nowcasting is of utmost importance for addressing various societal needs, including disaster management, agriculture, transportation, and energy optimization. As a complementary to existing non-autoregressive nowcasting approaches, we investigate the impact of prediction horizons on nowcasting models and propose SimCast, a novel training pipeline featuring a short-to-long term knowledge distillation technique coupled with a weighted MSE loss to prioritize heavy rainfall regions. Improved nowcasting predictions can be obtained without introducing additional overhead during inference. As SimCast generates deterministic predictions, we further integrate it into a diffusion-based framework named CasCast, leveraging the strengths from probabilistic models to overcome limitations such as blurriness and distribution shift in deterministic outputs. Extensive experimental results on three benchmark datasets validate the effectiveness of the proposed framework, achieving mean CSI scores of 0.452 on SEVIR, 0.474 on HKO-7, and 0.361 on MeteoNet, which outperforms existing approaches by a significant margin.[141] Latent Harmony: Synergistic Unified UHD Image Restoration via Latent Space Regularization and Controllable Refinement
Yidi Liu,Xueyang Fu,Jie Huang,Jie Xiao,Dong Li,Wenlong Zhang,Lei Bai,Zheng-Jun Zha
Main category: cs.CV
TL;DR: 提出Latent Harmony框架,通过两阶段方法改进基于VAE的超高清图像恢复,兼顾计算效率与高频细节保留。
Details
Motivation: 现有VAE因高斯约束易丢失退化相关的高频信息,导致UHD图像恢复中重建保真度下降。 Method: 第一阶段设计LH-VAE,引入视觉语义约束、渐进退化扰动和潜在等变性以增强语义鲁棒性和高频重建;第二阶段联合训练VAE与恢复模型,采用HF-LoRA(含保真导向的编码器LoRA和感知导向的解码器LoRA),通过交替优化和选择性梯度传播保持预训练结构。 Result: 在UHD及标准分辨率任务上达到SOTA性能,有效平衡效率、感知质量与重建精度,支持推理时调节α实现保真-感知权衡。 Conclusion: Latent Harmony通过联合正则化潜在空间与高频感知重建,显著提升VAE在超高清图像恢复中的表现。 Abstract: Ultra-High Definition (UHD) image restoration faces a trade-off between computational efficiency and high-frequency detail retention. While Variational Autoencoders (VAEs) improve efficiency via latent-space processing, their Gaussian constraint often discards degradation-specific high-frequency information, hurting reconstruction fidelity. To overcome this, we propose Latent Harmony, a two-stage framework that redefines VAEs for UHD restoration by jointly regularizing the latent space and enforcing high-frequency-aware reconstruction.In Stage One, we introduce LH-VAE, which enhances semantic robustness through visual semantic constraints and progressive degradation perturbations, while latent equivariance strengthens high-frequency reconstruction.Stage Two jointly trains this refined VAE with a restoration model using High-Frequency Low-Rank Adaptation (HF-LoRA): an encoder LoRA guided by a fidelity-oriented high-frequency alignment loss to recover authentic details, and a decoder LoRA driven by a perception-oriented loss to synthesize realistic textures. Both LoRA modules are trained via alternating optimization with selective gradient propagation to preserve the pretrained latent structure.At inference, a tunable parameter {\alpha} enables flexible fidelity-perception trade-offs.Experiments show Latent Harmony achieves state-of-the-art performance across UHD and standard-resolution tasks, effectively balancing efficiency, perceptual quality, and reconstruction accuracy.[142] The impact of abstract and object tags on image privacy classification
Darya Baranouskaya,Andrea Cavallaro
Main category: cs.CV
TL;DR: 本文探讨了在图像隐私分类任务中,抽象标签和物体标签的有效性,发现当标签数量有限时,抽象标签更有效,而当标签数量较多时,物体标签同样有用。
Details
Motivation: 研究在上下文依赖且主观的图像隐私任务中,哪种类型的标签(物体标签或抽象标签)更适合,并理解标签类型和数量对隐私分类的影响。 Method: 通过比较在不同标签预算下物体标签和抽象标签在图像隐私分类中的表现,分析其有效性。 Result: 当标签预算有限时,抽象标签比物体标签更有效;但在标签数量较多的情况下,物体标签的效果与抽象标签相当。 Conclusion: 标签类型和数量对图像隐私分类性能有显著影响,该发现可指导未来构建更准确的隐私分类器。 Abstract: Object tags denote concrete entities and are central to many computer vision tasks, whereas abstract tags capture higher-level information, which is relevant for tasks that require a contextual, potentially subjective scene understanding. Object and abstract tags extracted from images also facilitate interpretability. In this paper, we explore which type of tags is more suitable for the context-dependent and inherently subjective task of image privacy. While object tags are generally used for privacy classification, we show that abstract tags are more effective when the tag budget is limited. Conversely, when a larger number of tags per image is available, object-related information is as useful. We believe that these findings will guide future research in developing more accurate image privacy classifiers, informed by the role of tag types and quantity.[143] Is Architectural Complexity Always the Answer? A Case Study on SwinIR vs. an Efficient CNN
Chandresh Sutariya,Nitin Singh
Main category: cs.CV
TL;DR: 本文比较了SwinIR(Transformer模型)与轻量级CNN在低光照图像增强任务中的性能与效率,发现尽管SwinIR性能略优,但轻量级CNN在更少训练轮数、更小模型尺寸下达到了接近SOTA的结果,显示出更高的效率。
Details
Motivation: 在低光照图像增强中,如何在恢复高频细节和抑制严重噪声的同时平衡模型性能与计算效率是一个挑战,尤其针对实际应用中的资源限制问题。 Method: 通过在相同任务上对比最先进的SwinIR模型与标准轻量级CNN模型的性能(PSNR)、训练收敛速度和模型大小。 Result: CNN在仅10个epoch后收敛,达到37.4 dB的PSNR;而SwinIR需132个epoch,达到39.03 dB。CNN模型大小比SwinIR小55倍以上。 Conclusion: 标准CNN可在显著降低计算开销的前提下实现接近SOTA的效果,在资源受限的实际场景中具有更强的应用潜力。 Abstract: The simultaneous restoration of high-frequency details and suppression of severe noise in low-light imagery presents a significant and persistent challenge in computer vision. While large-scale Transformer models like SwinIR have set the state of the art in performance, their high computational cost can be a barrier for practical applications. This paper investigates the critical trade-off between performance and efficiency by comparing the state-of-the-art SwinIR model against a standard, lightweight Convolutional Neural Network (CNN) on this challenging task. Our experimental results reveal a nuanced but important finding. While the Transformer-based SwinIR model achieves a higher peak performance, with a Peak Signal-to-Noise Ratio (PSNR) of 39.03 dB, the lightweight CNN delivers a surprisingly competitive PSNR of 37.4 dB. Crucially, the CNN reached this performance after converging in only 10 epochs of training, whereas the more complex SwinIR model required 132 epochs. This efficiency is further underscored by the model's size; the CNN is over 55 times smaller than SwinIR. This work demonstrates that a standard CNN can provide a near state-of-the-art result with significantly lower computational overhead, presenting a compelling case for its use in real-world scenarios where resource constraints are a primary concern.[144] GraphEnet: Event-driven Human Pose Estimation with a Graph Neural Network
Gaurvi Goyal,Pham Cong Thuong,Arren Glover,Masayoshi Mizuno,Chiara Bartolozzi
Main category: cs.CV
TL;DR: 本文提出了一种基于图神经网络GraphEnet的事件相机人体姿态估计方法,利用事件数据的稀疏性和基于线段的中间表示,在低延迟和低功耗条件下实现高频2D人体姿态估计,是首个将图神经网络应用于事件数据进行人体姿态估计的工作。
Details
Motivation: 事件相机具有低延迟和低功耗的优势,适用于资源受限的移动设备和机器人,但目前缺乏高效的基于事件数据的人体姿态估计方法。 Method: 提出GraphEnet模型,采用图神经网络处理事件相机输出的稀疏数据,引入基于线段的事件表示,并结合偏移向量学习范式与置信度池化机制来估计人体关键点位置。 Result: 实现了高频率的单人2D人体姿态估计,有效利用事件数据的时空特征,在资源受限场景下表现出良好的性能。 Conclusion: GraphEnet首次将图神经网络成功应用于事件数据的人体姿态估计,验证了其在高频、低延迟应用中的潜力,为事件相机在人机交互中的应用提供了新思路。 Abstract: Human Pose Estimation is a crucial module in human-machine interaction applications and, especially since the rise in deep learning technology, robust methods are available to consumers using RGB cameras and commercial GPUs. On the other hand, event-based cameras have gained popularity in the vision research community for their low latency and low energy advantages that make them ideal for applications where those resources are constrained like portable electronics and mobile robots. In this work we propose a Graph Neural Network, GraphEnet, that leverages the sparse nature of event camera output, with an intermediate line based event representation, to estimate 2D Human Pose of a single person at a high frequency. The architecture incorporates a novel offset vector learning paradigm with confidence based pooling to estimate the human pose. This is the first work that applies Graph Neural Networks to event data for Human Pose Estimation. The code is open-source at https://github.com/event-driven-robotics/GraphEnet-NeVi-ICCV2025.[145] CIR-CoT: Towards Interpretable Composed Image Retrieval via End-to-End Chain-of-Thought Reasoning
Weihuang Lin,Yiwei Ma,Jiayi Ji,Xiaoshuai Sun,Rongrong Ji
Main category: cs.CV
TL;DR: 本文提出了CIR-CoT,首个面向检索任务的多模态大语言模型,通过引入显式的思维链(CoT)推理机制,提升组合图像检索(CIR)的准确性和可解释性。
Details
Motivation: 现有基于VLM和MLLM的CIR方法多为“黑箱”,缺乏可解释性且难以遵循复杂细粒度指令,因此需要一种能进行透明、可控推理的模型。 Method: 设计端到端的CIR-CoT模型,强制生成包含描述、推理和结论三阶段的结构化思维链;构建带CoT标注的新数据集,并微调模型输出结构化推理过程,最终将其编码为专用嵌入用于检索。 Result: 在FashionIQ和CIRR等域内数据集上表现优异,在跨域数据集CIRCO上也展现出强泛化能力,显著提升检索性能与可解释性。 Conclusion: CIR-CoT通过融合显式思维链推理,实现了更准确、可解释和可信赖的组合图像检索,为未来检索系统提供了新方向。 Abstract: Composed Image Retrieval (CIR), which aims to find a target image from a reference image and a modification text, presents the core challenge of performing unified reasoning across visual and semantic modalities. While current approaches based on Vision-Language Models (VLMs, e.g., CLIP) and more recent Multimodal Large Language Models (MLLMs, e.g., Qwen-VL) have shown progress, they predominantly function as ``black boxes." This inherent opacity not only prevents users from understanding the retrieval rationale but also restricts the models' ability to follow complex, fine-grained instructions. To overcome these limitations, we introduce CIR-CoT, the first end-to-end retrieval-oriented MLLM designed to integrate explicit Chain-of-Thought (CoT) reasoning. By compelling the model to first generate an interpretable reasoning chain, CIR-CoT enhances its ability to capture crucial cross-modal interactions, leading to more accurate retrieval while making its decision process transparent. Since existing datasets like FashionIQ and CIRR lack the necessary reasoning data, a key contribution of our work is the creation of structured CoT annotations using a three-stage process involving a caption, reasoning, and conclusion. Our model is then fine-tuned to produce this structured output before encoding its final retrieval intent into a dedicated embedding. Comprehensive experiments show that CIR-CoT achieves highly competitive performance on in-domain datasets (FashionIQ, CIRR) and demonstrates remarkable generalization on the out-of-domain CIRCO dataset, establishing a new path toward more effective and trustworthy retrieval systems.[146] RayFusion: Ray Fusion Enhanced Collaborative Visual Perception
Shaohong Wang,Bin Lu,Xinyu Xiao,Hanzhi Zhong,Bowen Pang,Tong Wang,Zhiyu Xiang,Hangguan Shan,Eryun Liu
Main category: cs.CV
TL;DR: 提出了一种基于射线的融合方法RayFusion,利用协作方的射线占据信息来减少相机射线上的冗余和误检,提升纯视觉协同感知系统的3D目标检测性能。
Details
Motivation: 由于缺乏明确的深度信息,基于相机的感知系统在深度估计上存在模糊性,难以生成准确的3D检测结果。 Method: 提出RayFusion,通过引入协作车辆的射线占据信息,在相机射线上进行融合,抑制冗余和误检。 Result: 实验表明,该方法在多个数据集上持续优于现有的最先进模型,显著提升了协同视觉感知性能。 Conclusion: RayFusion有效缓解了纯视觉系统中的深度歧义问题,为协作式3D目标检测提供了新的解决方案。 Abstract: Collaborative visual perception methods have gained widespread attention in the autonomous driving community in recent years due to their ability to address sensor limitation problems. However, the absence of explicit depth information often makes it difficult for camera-based perception systems, e.g., 3D object detection, to generate accurate predictions. To alleviate the ambiguity in depth estimation, we propose RayFusion, a ray-based fusion method for collaborative visual perception. Using ray occupancy information from collaborators, RayFusion reduces redundancy and false positive predictions along camera rays, enhancing the detection performance of purely camera-based collaborative perception systems. Comprehensive experiments show that our method consistently outperforms existing state-of-the-art models, substantially advancing the performance of collaborative visual perception. The code is available at https://github.com/wangsh0111/RayFusion.[147] RASALoRE: Region Aware Spatial Attention with Location-based Random Embeddings for Weakly Supervised Anomaly Detection in Brain MRI Scans
Bheeshm Sharma,Karthikeyan Jaganathan,Balamurugan Palaniappan
Main category: cs.CV
TL;DR: 本文提出了一种名为RASALoRE的两阶段弱监督异常检测框架,用于在缺乏像素级标注的情况下实现脑部MRI中异常的高效、准确检测。
Details
Motivation: 由于脑部MRI中精确的像素级异常标注难以获取,仅依赖切片级等弱标签进行异常检测成为实际应用中的关键挑战。 Method: 第一阶段采用判别性双提示调优(DDPT)机制,基于切片级标签生成高质量的伪弱掩码作为粗略定位线索;第二阶段设计了一个具有区域感知空间注意力机制的分割网络,并结合基于位置的随机嵌入来增强对异常区域的关注。 Result: 该方法在BraTS20、BraTS21、BraTS23和MSD数据集上实现了最先进的异常检测性能,显著优于现有WSAD方法,且参数量少于800万,计算复杂度显著降低。 Conclusion: RASALoRE通过伪掩码生成与区域感知注意力机制的有效结合,在弱监督条件下实现了高性能、低复杂度的脑部MRI异常检测,具备良好的实用性与可扩展性。 Abstract: Weakly Supervised Anomaly detection (WSAD) in brain MRI scans is an important challenge useful to obtain quick and accurate detection of brain anomalies when precise pixel-level anomaly annotations are unavailable and only weak labels (e.g., slice-level) are available. In this work, we propose RASALoRE: Region Aware Spatial Attention with Location-based Random Embeddings, a novel two-stage WSAD framework. In the first stage, we introduce a Discriminative Dual Prompt Tuning (DDPT) mechanism that generates high-quality pseudo weak masks based on slice-level labels, serving as coarse localization cues. In the second stage, we propose a segmentation network with a region-aware spatial attention mechanism that relies on fixed location-based random embeddings. This design enables the model to effectively focus on anomalous regions. Our approach achieves state-of-the-art anomaly detection performance, significantly outperforming existing WSAD methods while utilizing less than 8 million parameters. Extensive evaluations on the BraTS20, BraTS21, BraTS23, and MSD datasets demonstrate a substantial performance improvement coupled with a significant reduction in computational complexity. Code is available at: https://github.com/BheeshmSharma/RASALoRE-BMVC-2025/.[148] RetouchLLM: Training-free White-box Image Retouching
Moon Ye-Bin,Roy Miles,Tae-Hyun Oh,Ismail Elezi,Jiankang Deng
Main category: cs.CV
TL;DR: 提出RetouchLLM,一种无需训练、基于代码的白盒图像润饰系统,通过视觉评论器和代码生成器实现可解释、可控的高分辨率图像多步润饰。
Details
Motivation: 现有基于学习的图像润饰方法依赖大规模配对数据且为黑箱模型,缺乏透明性和对用户或图像特定需求的适应性。 Method: 构建包含视觉评论器和代码生成器的双模块框架,视觉评论器分析输入与参考图像差异,代码生成器生成可执行代码进行逐步润饰,无需训练数据。 Result: 实验表明该方法能泛化到多种润饰风格,支持自然语言交互,实现可解释且符合用户意图的控制。 Conclusion: RetouchLLM实现了无需训练、可解释、可控的图像润饰,提升了润饰过程的透明度和个性化调整能力。 Abstract: Image retouching not only enhances visual quality but also serves as a means of expressing personal preferences and emotions. However, existing learning-based approaches require large-scale paired data and operate as black boxes, making the retouching process opaque and limiting their adaptability to handle diverse, user- or image-specific adjustments. In this work, we propose RetouchLLM, a training-free white-box image retouching system, which requires no training data and performs interpretable, code-based retouching directly on high-resolution images. Our framework progressively enhances the image in a manner similar to how humans perform multi-step retouching, allowing exploration of diverse adjustment paths. It comprises of two main modules: a visual critic that identifies differences between the input and reference images, and a code generator that produces executable codes. Experiments demonstrate that our approach generalizes well across diverse retouching styles, while natural language-based user interaction enables interpretable and controllable adjustments tailored to user intent.[149] A class-driven hierarchical ResNet for classification of multispectral remote sensing images
Giulio Weikmann,Gianmarco Perantoni,Lorenzo Bruzzone
Main category: cs.CV
TL;DR: 提出一种多时相、类驱动的分层残差神经网络(ResNet),用于多光谱影像时间序列在不同语义层级上的分类,通过引入分支结构和层次惩罚机制提升分类一致性与细粒度类别判别能力。
Details
Motivation: 为了提升多光谱影像时间序列在多层次语义分类中的准确性,特别是细粒度类别(微类)的识别,并解决训练样本有限下的模型泛化问题。 Method: 设计了一种改进的分层ResNet架构,引入额外分支进行多级分类,利用层次惩罚图抑制不一致的层级转移,并通过类层次标签分层训练网络,使浅层学习宏观类,深层学习微观类。 Result: 在亚马逊森林两个区域的Sentinel-2月度影像上实验表明,该方法在不同层次上均具有良好的泛化能力,能有效学习判别特征,提升微类分类精度,尤其改善了少数类的表示。 Conclusion: 所提出的模块化分层网络能有效建模语义层次结构,提升多时相影像分类性能,具备通过微调适应新任务的能力,适用于样本有限场景。 Abstract: This work presents a multitemporal class-driven hierarchical Residual Neural Network (ResNet) designed for modelling the classification of Time Series (TS) of multispectral images at different semantical class levels. The architecture consists of a modification of the ResNet where we introduce additional branches to perform the classification at the different hierarchy levels and leverage on hierarchy-penalty maps to discourage incoherent hierarchical transitions within the classification. In this way, we improve the discrimination capabilities of classes at different levels of semantic details and train a modular architecture that can be used as a backbone network for introducing new specific classes and additional tasks considering limited training samples available. We exploit the class-hierarchy labels to train efficiently the different layers of the architecture, allowing the first layers to train faster on the first levels of the hierarchy modeling general classes (i.e., the macro-classes) and the intermediate classes, while using the last ones to discriminate more specific classes (i.e., the micro-classes). In this way, the targets are constrained in following the hierarchy defined, improving the classification of classes at the most detailed level. The proposed modular network has intrinsic adaptation capability that can be obtained through fine tuning. The experimental results, obtained on two tiles of the Amazonian Forest on 12 monthly composites of Sentinel 2 images acquired during 2019, demonstrate the effectiveness of the hierarchical approach in both generalizing over different hierarchical levels and learning discriminant features for an accurate classification at the micro-class level on a new target area, with a better representation of the minoritarian classes.[150] Towards Real-World Deepfake Detection: A Diverse In-the-wild Dataset of Forgery Faces
Junyu Shi,Minghui Li,Junguo Zuo,Zhifei Yu,Yipeng Lin,Shengshan Hu,Ziqi Zhou,Yechao Zhang,Wei Wan,Yinzhe Xu,Leo Yu Zhang
Main category: cs.CV
TL;DR: 本文提出了一个面向真实世界的深度伪造人脸数据集RedFace,包含超过60,000张伪造图像和1,000个 manipulated 视频,利用9个商业在线平台生成更贴近现实的深伪内容,以弥补学术评估与实际应用之间的差距。
Details
Motivation: 现有的深度伪造检测基准在真实性、多样性及技术覆盖上不足,难以反映真实世界中的挑战,因此需要一个更贴近实际应用场景的数据集来有效评估检测方法。 Method: RedFace通过整合9个商用在线平台的最新深度伪造技术,并结合定制算法生成多样化的人脸伪造数据,模拟真实世界的黑盒场景,构建了一个大规模、高多样性的数据集。 Result: 在跨域、域内及社交网络传播模拟实验中,现有检测方法在RedFace上的表现显著下降,验证了其对现有检测方案的实际挑战性;同时分析揭示了RedFace相较于传统数据集影响检测性能的原因。 Conclusion: RedFace能够更真实地反映现实中的深度伪造威胁,为深度伪造检测技术提供了更具挑战性和实用性的评估平台,推动该领域向实际应用发展。 Abstract: Deepfakes, leveraging advanced AIGC (Artificial Intelligence-Generated Content) techniques, create hyper-realistic synthetic images and videos of human faces, posing a significant threat to the authenticity of social media. While this real-world threat is increasingly prevalent, existing academic evaluations and benchmarks for detecting deepfake forgery often fall short to achieve effective application for their lack of specificity, limited deepfake diversity, restricted manipulation techniques.To address these limitations, we introduce RedFace (Real-world-oriented Deepfake Face), a specialized facial deepfake dataset, comprising over 60,000 forged images and 1,000 manipulated videos derived from authentic facial features, to bridge the gap between academic evaluations and real-world necessity. Unlike prior benchmarks, which typically rely on academic methods to generate deepfakes, RedFace utilizes 9 commercial online platforms to integrate the latest deepfake technologies found "in the wild", effectively simulating real-world black-box scenarios.Moreover, RedFace's deepfakes are synthesized using bespoke algorithms, allowing it to capture diverse and evolving methods used by real-world deepfake creators. Extensive experimental results on RedFace (including cross-domain, intra-domain, and real-world social network dissemination simulations) verify the limited practicality of existing deepfake detection schemes against real-world applications. We further perform a detailed analysis of the RedFace dataset, elucidating the reason of its impact on detection performance compared to conventional datasets. Our dataset is available at: https://github.com/kikyou-220/RedFace.[151] Physics-Driven Spatiotemporal Modeling for AI-Generated Video Detection
Shuhai Zhang,ZiHao Lian,Jiahao Yang,Daiyuan Li,Guoxuan Pang,Feng Liu,Bo Han,Shutao Li,Mingkui Tan
Main category: cs.CV
TL;DR: 本文提出了一种基于物理驱动的AI生成视频检测方法NSG-VD,利用概率流守恒原理定义了归一化时空梯度(NSG)统计量,并结合最大均值差异进行检测,在召回率和F1分数上显著优于现有方法。
Details
Motivation: 随着AI生成视频在视觉真实性上的突破,亟需可靠的检测机制;然而现有方法难以建模高维时空动态并捕捉违反物理规律的细微异常。 Method: 基于概率流守恒原理提出NSG统计量,利用预训练扩散模型估计NSG,通过空间梯度近似和运动感知时间建模构建NSG-VD方法,并采用MMD度量检测生成视频。 Result: NSG-VD在Recall上提升16.00%,F1-Score提升10.75%,实验验证其优于当前最先进的检测方法。 Conclusion: NSG-VD通过引入物理约束的时空梯度分析,有效提升了AI生成视频的检测性能,为未来检测技术提供了新的范式。 Abstract: AI-generated videos have achieved near-perfect visual realism (e.g., Sora), urgently necessitating reliable detection mechanisms. However, detecting such videos faces significant challenges in modeling high-dimensional spatiotemporal dynamics and identifying subtle anomalies that violate physical laws. In this paper, we propose a physics-driven AI-generated video detection paradigm based on probability flow conservation principles. Specifically, we propose a statistic called Normalized Spatiotemporal Gradient (NSG), which quantifies the ratio of spatial probability gradients to temporal density changes, explicitly capturing deviations from natural video dynamics. Leveraging pre-trained diffusion models, we develop an NSG estimator through spatial gradients approximation and motion-aware temporal modeling without complex motion decomposition while preserving physical constraints. Building on this, we propose an NSG-based video detection method (NSG-VD) that computes the Maximum Mean Discrepancy (MMD) between NSG features of the test and real videos as a detection metric. Last, we derive an upper bound of NSG feature distances between real and generated videos, proving that generated videos exhibit amplified discrepancies due to distributional shifts. Extensive experiments confirm that NSG-VD outperforms state-of-the-art baselines by 16.00% in Recall and 10.75% in F1-Score, validating the superior performance of NSG-VD. The source code is available at https://github.com/ZSHsh98/NSG-VD.[152] DarkHash: A Data-Free Backdoor Attack Against Deep Hashing
Ziqi Zhou,Menghao Deng,Yufei Song,Hangtao Zhang,Wei Wan,Shengshan Hu,Minghui Li,Leo Yu Zhang,Dezhong Yao
Main category: cs.CV
TL;DR: 本文提出了DarkHash,首个无需训练数据的深度哈希模型后门攻击方法,通过双语义引导的影子后门框架,在不访问原始训练数据的情况下实现高效攻击并保持原检索任务的准确性。
Details
Motivation: 现有深度哈希模型的后门攻击依赖于访问训练数据,但在现实场景中由于隐私和知识产权保护,获取这些数据通常是不可行的。因此,研究无需训练数据即可植入后门且不影响原任务性能的方法具有重要意义。 Method: 提出DarkHash,设计了一种基于代理数据集的双语义引导影子后门攻击框架,仅微调受害者模型的特定层;引入拓扑对齐损失,利用样本与其邻居的关系优化个体和邻近中毒样本向目标样本靠拢,增强攻击效果。 Result: 在四个图像数据集、五种模型架构和两种哈希方法上的实验表明,DarkHash显著优于现有的最先进后门攻击方法,并能有效抵御主流防御手段。 Conclusion: DarkHash首次实现了对深度哈希模型的数据无关后门攻击,在保持原始检索性能的同时展现出强大的攻击能力和鲁棒性,为安全评估和防御机制提供了新挑战与方向。 Abstract: Benefiting from its superior feature learning capabilities and efficiency, deep hashing has achieved remarkable success in large-scale image retrieval. Recent studies have demonstrated the vulnerability of deep hashing models to backdoor attacks. Although these studies have shown promising attack results, they rely on access to the training dataset to implant the backdoor. In the real world, obtaining such data (e.g., identity information) is often prohibited due to privacy protection and intellectual property concerns. Embedding backdoors into deep hashing models without access to the training data, while maintaining retrieval accuracy for the original task, presents a novel and challenging problem. In this paper, we propose DarkHash, the first data-free backdoor attack against deep hashing. Specifically, we design a novel shadow backdoor attack framework with dual-semantic guidance. It embeds backdoor functionality and maintains original retrieval accuracy by fine-tuning only specific layers of the victim model using a surrogate dataset. We consider leveraging the relationship between individual samples and their neighbors to enhance backdoor attacks during training. By designing a topological alignment loss, we optimize both individual and neighboring poisoned samples toward the target sample, further enhancing the attack capability. Experimental results on four image datasets, five model architectures, and two hashing methods demonstrate the high effectiveness of DarkHash, outperforming existing state-of-the-art backdoor attack methods. Defense experiments show that DarkHash can withstand existing mainstream backdoor defense methods.[153] Efficient Label Refinement for Face Parsing Under Extreme Poses Using 3D Gaussian Splatting
Ankit Gahlawat,Anirban Mukherjee,Dinesh Babu Jayagopi
Main category: cs.CV
TL;DR: 提出了一种基于3D高斯点阵的标签优化流程,通过共享几何结构实现多视角一致性,从而在无真实3D标注的情况下提升极端角度下的人脸解析精度。
Details
Motivation: 由于极端视角下标注数据有限且人工标注成本高,现有方法难以实现准确的人脸解析,因此需要一种可扩展且高效的方法来生成高质量训练数据。 Method: 联合拟合两个3D高斯点阵模型,一个用于RGB图像,另一个用于初始分割图,利用共享几何结构实现多视角一致性,并通过少量后处理生成姿态多样的训练数据。 Result: 在仅使用少量初始图像且无需真实3D标注的情况下,该方法在极端头部姿态上显著提升了人脸解析模型的准确性,同时在标准视角上保持良好性能,且优于现有最先进方法。 Conclusion: 该方法为提升现实场景中人脸解析的鲁棒性提供了一个可扩展且有效的解决方案。 Abstract: Accurate face parsing under extreme viewing angles remains a significant challenge due to limited labeled data in such poses. Manual annotation is costly and often impractical at scale. We propose a novel label refinement pipeline that leverages 3D Gaussian Splatting (3DGS) to generate accurate segmentation masks from noisy multiview predictions. By jointly fitting two 3DGS models, one to RGB images and one to their initial segmentation maps, our method enforces multiview consistency through shared geometry, enabling the synthesis of pose-diverse training data with only minimal post-processing. Fine-tuning a face parsing model on this refined dataset significantly improves accuracy on challenging head poses, while maintaining strong performance on standard views. Extensive experiments, including human evaluations, demonstrate that our approach achieves superior results compared to state-of-the-art methods, despite requiring no ground-truth 3D annotations and using only a small set of initial images. Our method offers a scalable and effective solution for improving face parsing robustness in real-world settings.[154] Random Window Augmentations for Deep Learning Robustness in CT and Liver Tumor Segmentation
Eirik A. Østmo,Kristoffer K. Wickstrøm,Keyur Radiya,Michael C. Kampffmeyer,Karl Øyvind Mikalsen,Robert Jenssen
Main category: cs.CV
TL;DR: 本文提出了一种针对CT图像的特定增强技术“随机窗宽”(Random windowing),以解决传统数据增强方法在医学CT图像中导致的伪影和泛化性能差的问题,显著提升了肝脏肿瘤分割模型在低对比度图像上的表现。
Details
Motivation: 现有的深度学习图像增强方法多基于自然图像设计,直接应用于CT图像时会破坏其Hounsfield Unit(HU)的物理意义,导致模型性能下降。因此,需要一种符合CT成像特性的增强方法来提升模型的泛化能力。 Method: 提出“随机窗宽”增强技术,利用CT图像中HU强度的分布特性,在训练过程中动态调整窗宽窗位,模拟不同对比度条件下的图像表现,从而增强模型对对比度变化的鲁棒性。 Result: 在多个数据集上进行了消融实验和分析,结果表明该方法在肝脏肿瘤分割任务中优于现有最先进方法,尤其在对比度不佳或造影时机不理想的图像上显著提升模型性能。 Conclusion: Random windowing是一种适用于CT图像的高效数据增强策略,能够保留HU的物理意义,提升模型在现实临床场景中的鲁棒性和分割精度,具有较强的临床应用潜力。 Abstract: Contrast-enhanced Computed Tomography (CT) is important for diagnosis and treatment planning for various medical conditions. Deep learning (DL) based segmentation models may enable automated medical image analysis for detecting and delineating tumors in CT images, thereby reducing clinicians' workload. Achieving generalization capabilities in limited data domains, such as radiology, requires modern DL models to be trained with image augmentation. However, naively applying augmentation methods developed for natural images to CT scans often disregards the nature of the CT modality, where the intensities measure Hounsfield Units (HU) and have important physical meaning. This paper challenges the use of such intensity augmentations for CT imaging and shows that they may lead to artifacts and poor generalization. To mitigate this, we propose a CT-specific augmentation technique, called Random windowing, that exploits the available HU distribution of intensities in CT images. Random windowing encourages robustness to contrast-enhancement and significantly increases model performance on challenging images with poor contrast or timing. We perform ablations and analysis of our method on multiple datasets, and compare to, and outperform, state-of-the-art alternatives, while focusing on the challenge of liver tumor segmentation.[155] Real-Time Motion-Controllable Autoregressive Video Diffusion
Kesen Zhao,Jiaxin Shi,Beier Zhu,Junbao Zhou,Xiaolong Shen,Yuan Zhou,Qianru Sun,Hanwang Zhang
Main category: cs.CV
TL;DR: 提出AR-Drag,首个结合强化学习的少步自回归视频扩散模型,实现低延迟、高保真的实时图像到视频生成,并支持多样化运动控制。
Details
Motivation: 现有的自回归视频扩散模型在少步生成中存在质量下降和运动伪影问题,且缺乏有效的运动控制机制,难以满足实时性要求。 Method: 首先微调基础I2V模型以支持基本运动控制,然后通过基于轨迹奖励模型的强化学习进一步优化;引入Self-Rollout机制保持马尔可夫性质,并在去噪步骤中选择性引入随机性以加速训练。 Result: AR-Drag在仅1.3B参数下显著降低延迟,视觉保真度高,运动对齐精确,优于现有最先进的可运动控制视频扩散模型。 Conclusion: AR-Drag为实时、可控的视频生成提供了高效解决方案,兼具高性能与低计算开销。 Abstract: Real-time motion-controllable video generation remains challenging due to the inherent latency of bidirectional diffusion models and the lack of effective autoregressive (AR) approaches. Existing AR video diffusion models are limited to simple control signals or text-to-video generation, and often suffer from quality degradation and motion artifacts in few-step generation. To address these challenges, we propose AR-Drag, the first RL-enhanced few-step AR video diffusion model for real-time image-to-video generation with diverse motion control. We first fine-tune a base I2V model to support basic motion control, then further improve it via reinforcement learning with a trajectory-based reward model. Our design preserves the Markov property through a Self-Rollout mechanism and accelerates training by selectively introducing stochasticity in denoising steps. Extensive experiments demonstrate that AR-Drag achieves high visual fidelity and precise motion alignment, significantly reducing latency compared with state-of-the-art motion-controllable VDMs, while using only 1.3B parameters. Additional visualizations can be found on our project page: https://kesenzhao.github.io/AR-Drag.github.io/.[156] Improving Temporal Understanding Logic Consistency in Video-Language Models via Attention Enhancement
Chengzhi Li,Heyan Huang,Ping Jian,Zhen Yang,Yaning Tian
Main category: cs.CV
TL;DR: 本文研究了视频-语言模型(Video-LLMs)在回答重述问题时缺乏逻辑一致性的现象,发现主要原因是跨模态注意力头难以有效区分不同时间戳的视频令牌。为此提出了一种称为“时间条件注意力锐化”(TCAS)的方法,通过增强注意力区分来提升模型的时间分辨率和逻辑一致性。实验表明该方法显著提高了时间逻辑一致性,并在视频时间定位任务中取得性能提升。
Details
Motivation: Video-LLMs在重述问题下响应不一致,影响其可靠性,但原因尚不清楚。本文旨在探究其根本原因并提升模型的时间逻辑一致性。 Method: 采用可解释性驱动的方法,分析并统计可能导致不一致的因素;提出TCAS方法,通过构建基于注意力差异的增强目标,提升跨模态注意力对时间信息的分辨能力。 Result: TCAS显著提升了Video-LLMs的时间逻辑一致性;可解释性分析证实注意力头的时间区分能力增强;在视频时间定位任务中也实现性能提升。 Conclusion: 时间逻辑一致性是视频时间理解的关键瓶颈,TCAS通过增强注意力的时间分辨能力有效改善一致性,推动了视频时间理解的发展。 Abstract: Large language models (LLMs) often generate self-contradictory outputs, which severely impacts their reliability and hinders their adoption in practical applications. In video-language models (Video-LLMs), this phenomenon recently draws the attention of researchers. Specifically, these models fail to provide logically consistent responses to rephrased questions based on their grounding outputs. However, the underlying causes of this phenomenon remain underexplored. In this work, we adopt an interpretability-driven approach to analyze, statistically summarize, and intervention the potential factors of the phenomenon. We find that one of the primary reasons for the inconsistency in responses lies in the inability of cross-modal attention heads to effectively distinguish video tokens across different timestamps. To address this, we propose an attention enhancement method called Temporally Conditioned Attention Sharpening (TCAS), which constructs an enhancement objective based on attention distinctions to enhance the model's temporal resolution capability, thereby improving its temporal understanding logic consistency. Experimental results demonstrate that our method significantly enhances the temporal logic consistency of Video-LLMs. Further interpretability analyses reveal that our method indeed improves the temporal discriminability of attention heads, validating our conclusions. Additionally, our method achieves performance improvements in general video temporal grounding tasks, highlighting that temporal logic consistency is a bottleneck in temporal understanding. By enhancing consistency, our method drives significant progress in video temporal understanding.[157] UniMMVSR: A Unified Multi-Modal Framework for Cascaded Video Super-Resolution
Shian Du,Menghan Xia,Chang Liu,Quande Liu,Xintao Wang,Pengfei Wan,Xiangyang Ji
Main category: cs.CV
TL;DR: 提出首个统一的生成式视频超分框架UniMMVSR,支持文本、图像和视频等多模态条件输入,显著提升视频细节与条件一致性,并实现4K多模态视频生成。
Details
Motivation: 现有视频超分辨率方法主要局限于文本到视频任务,缺乏对多种生成条件(如图像、视频)的利用,难以满足多模态视频生成中对高保真度的需求。 Method: 在潜在视频扩散模型中引入混合模态条件注入机制,设计了针对不同条件类型的数据构造和训练策略,探索了条件注入方式、训练方案和数据混合技术。 Result: 实验表明,UniMMVSR在生成视频的细节质量和多模态条件一致性方面显著优于现有方法,并成功结合基础模型实现了4K分辨率的多模态引导视频生成。 Conclusion: UniMMVSR是首个支持多模态条件的统一生成式视频超分框架,有效解决了多条件融合与利用难题,推动了高分辨率多模态视频生成的发展。 Abstract: Cascaded video super-resolution has emerged as a promising technique for decoupling the computational burden associated with generating high-resolution videos using large foundation models. Existing studies, however, are largely confined to text-to-video tasks and fail to leverage additional generative conditions beyond text, which are crucial for ensuring fidelity in multi-modal video generation. We address this limitation by presenting UniMMVSR, the first unified generative video super-resolution framework to incorporate hybrid-modal conditions, including text, images, and videos. We conduct a comprehensive exploration of condition injection strategies, training schemes, and data mixture techniques within a latent video diffusion model. A key challenge was designing distinct data construction and condition utilization methods to enable the model to precisely utilize all condition types, given their varied correlations with the target video. Our experiments demonstrate that UniMMVSR significantly outperforms existing methods, producing videos with superior detail and a higher degree of conformity to multi-modal conditions. We also validate the feasibility of combining UniMMVSR with a base model to achieve multi-modal guided generation of 4K video, a feat previously unattainable with existing techniques.[158] Beyond Textual CoT: Interleaved Text-Image Chains with Deep Confidence Reasoning for Image Editing
Zhentao Zou,Zhengrong Yue,Kunpeng Du,Binlei Bao,Hanting Li,Haizhen Xie,Guozheng Xu,Yue Zhou,Yali Wang,Jie Hu,Xue Jiang,Xinghao Chen
Main category: cs.CV
TL;DR: 提出MURE框架,通过文本-图像交错的多模态推理链提升图像编辑的细粒度和空间准确性,并引入MMDC机制减少大模型幻觉。
Details
Motivation: 现有基于纯文本或坐标增强的思维链方法难以准确表达复杂视觉布局和精细空间关系,导致图像编辑中缺乏足够的视觉线索指导像素级生成。 Method: 提出MURE框架,采用原生多模态交错推理链(文本描述后接视觉提示如位置掩码或新内容表示),并将复杂任务分解为相互依赖的子任务;引入MMDC推理范式,通过奖励模型的深度置信评分剪枝低质量推理路径。 Result: 在三个图像编辑基准上显著优于现有方法,实现了更高保真度的编辑结果;发布了包含14K高质量样本的CoT-Edit-14K数据集。 Conclusion: MURE框架通过融合文本与视觉推理链,有效提升了复杂场景下图像编辑的精确性和可靠性,MMDC机制进一步保障了推理路径的质量。 Abstract: Image editing with natural language has gained significant popularity, yet existing methods struggle with intricate object intersections and fine-grained spatial relationships due to the lack of an explicit reasoning process. While Chain-of-Thought (CoT) has been explored to enhance reasoning, purely textual CoT or CoT augmented with coordinate information is fundamentally limited in its ability to represent intricate visual layouts and lacks the necessary visual cues to guide the generation of fine-grained, pixel-level details. To address these challenges, we propose Multimodal Reasoning Edit (MURE), a novel framework that shifts the visual editing process from purely text-based reasoning to a series of interleaved textual and visual rationales. Our framework performs image editing using a natively multimodal, interleaved text-image CoT. This approach generates a step-by-step chain of reasoning where a textual description is followed by a corresponding visual cue, such as a positional mask that defined intended edited regions or a representation of new content. Furthermore, to mitigate the hallucination phenomenon of large language models, we introduce Multimodal Deep Confidence (MMDC) reasoning paradigm. This paradigm explores a tree of visual reasoning paths at each step. By pruning low-quality branches using a deep confidence score from a reward model, it ensures the model consistently follows a high-quality trajectory towards the final edited result. The proposed method decomposes complex editing tasks into interdependent sub-tasks, achieving greater precision at each stage and yielding high-fidelity edited results. We define the formulation for interleaved text-image chains and release the first CoT-Edit-14K dataset, comprising 14K high-quality editing examples. Extensive experiments show that our method yields significant improvements across three image editing benchmarks.[159] Robust Canonicalization through Bootstrapped Data Re-Alignment
Johann Schmidt,Sebastian Stober
Main category: cs.CV
TL;DR: 提出一种自举算法,通过迭代重新对齐训练样本,逐步减少方差并恢复对齐假设,在细粒度视觉分类任务中优于等变和规范化基线方法。
Details
Motivation: 现有方法依赖强数据增强或等变架构,存在模型复杂或表达受限的问题;而基于对齐假设的规范化方法在真实数据上因缺乏对齐而表现脆弱。 Method: 提出一种迭代自举算法,通过逐步降低样本间方差来重新对齐训练数据,恢复规范化所需的对齐假设,并在任意紧群下提供收敛性保证。 Result: 在四个细粒度视觉分类基准上验证了该方法的有效性,性能持续优于等变和规范化基线,与数据增强方法相当。 Conclusion: 该方法为处理几何偏差提供了一种有效且鲁棒的替代方案,无需强增强或限制模型结构,适用于现实世界未对齐的细粒度图像数据。 Abstract: Fine-grained visual classification (FGVC) tasks, such as insect and bird identification, demand sensitivity to subtle visual cues while remaining robust to spatial transformations. A key challenge is handling geometric biases and noise, such as different orientations and scales of objects. Existing remedies rely on heavy data augmentation, which demands powerful models, or on equivariant architectures, which constrain expressivity and add cost. Canonicalization offers an alternative by shielding such biases from the downstream model. In practice, such functions are often obtained using canonicalization priors, which assume aligned training data. Unfortunately, real-world datasets never fulfill this assumption, causing the obtained canonicalizer to be brittle. We propose a bootstrapping algorithm that iteratively re-aligns training samples by progressively reducing variance and recovering the alignment assumption. We establish convergence guarantees under mild conditions for arbitrary compact groups, and show on four FGVC benchmarks that our method consistently outperforms equivariant, and canonicalization baselines while performing on par with augmentation.[160] InstructUDrag: Joint Text Instructions and Object Dragging for Interactive Image Editing
Haoran Yu,Yi Shi
Main category: cs.CV
TL;DR: 提出InstructUDrag框架,结合文本指令与对象拖拽,实现高保真、灵活的图像编辑。
Details
Motivation: 现有文本生成图像方法在精确对象定位上存在不足,而对象拖拽方法局限于静态重定位,缺乏语义控制。 Method: 将对象拖拽视为图像重建过程,设计双分支结构:移动重建分支利用基于能量的梯度引导精确定位;文本驱动编辑分支共享梯度信号以实现属性精细控制,并结合DDPM反演和先验注入保持结构完整性。 Result: 实验表明该方法在对象重定位精度和语义内容控制方面均优于现有方法,支持复杂场景下的灵活编辑。 Conclusion: InstructUDrag有效融合了文本编辑与对象拖拽的优势,实现了兼具精准定位与语义可控的高质量图像编辑。 Abstract: Text-to-image diffusion models have shown great potential for image editing, with techniques such as text-based and object-dragging methods emerging as key approaches. However, each of these methods has inherent limitations: text-based methods struggle with precise object positioning, while object dragging methods are confined to static relocation. To address these issues, we propose InstructUDrag, a diffusion-based framework that combines text instructions with object dragging, enabling simultaneous object dragging and text-based image editing. Our framework treats object dragging as an image reconstruction process, divided into two synergistic branches. The moving-reconstruction branch utilizes energy-based gradient guidance to move objects accurately, refining cross-attention maps to enhance relocation precision. The text-driven editing branch shares gradient signals with the reconstruction branch, ensuring consistent transformations and allowing fine-grained control over object attributes. We also employ DDPM inversion and inject prior information into noise maps to preserve the structure of moved objects. Extensive experiments demonstrate that InstructUDrag facilitates flexible, high-fidelity image editing, offering both precision in object relocation and semantic control over image content.[161] Fine-grained text-driven dual-human motion generation via dynamic hierarchical interaction
Mu Li,Yin Wang,Zhiying Leng,Jiapeng Liu,Frederick W. B. Li,Xiaohui Liang
Main category: cs.CV
TL;DR: 提出了一种细粒度双人动作生成方法FineDual,通过三阶段模型从个体到个体间动态建模人类交互的层次性与距离变化。
Details
Motivation: 现有方法大多忽略交互中的距离变化和层次结构,无法充分建模动态且分层的人类交互过程。 Method: 采用三阶段方法:第一阶段利用大语言模型分解文本并实现个体层面的文本-动作对齐;第二阶段通过交互距离预测器和交互感知图网络动态建模个体间交互;第三阶段利用整体文本特征指导动作特征优化,提升生成质量。 Result: 在双人动作数据集上的实验表明,FineDual在定量和定性评估中均优于现有方法,能更有效地建模动态层次化的人类交互。 Conclusion: FineDual通过引入动态距离建模和层次化三阶段架构,显著提升了双人动作生成的质量和合理性。 Abstract: Human interaction is inherently dynamic and hierarchical, where the dynamic refers to the motion changes with distance, and the hierarchy is from individual to inter-individual and ultimately to overall motion. Exploiting these properties is vital for dual-human motion generation, while existing methods almost model human interaction temporally invariantly, ignoring distance and hierarchy. To address it, we propose a fine-grained dual-human motion generation method, namely FineDual, a tri-stage method to model the dynamic hierarchical interaction from individual to inter-individual. The first stage, Self-Learning Stage, divides the dual-human overall text into individual texts through a Large Language Model, aligning text features and motion features at the individual level. The second stage, Adaptive Adjustment Stage, predicts interaction distance by an interaction distance predictor, modeling human interactions dynamically at the inter-individual level by an interaction-aware graph network. The last stage, Teacher-Guided Refinement Stage, utilizes overall text features as guidance to refine motion features at the overall level, generating fine-grained and high-quality dual-human motion. Extensive quantitative and qualitative evaluations on dual-human motion datasets demonstrate that our proposed FineDual outperforms existing approaches, effectively modeling dynamic hierarchical human interaction.[162] Adaptive Gradient Calibration for Single-Positive Multi-Label Learning in Remote Sensing Image Scene Classification
Chenying Liu,Gianmarco Perantoni,Lorenzo Bruzzone,Xiao Xiang Zhu
Main category: cs.CV
TL;DR: 本文提出了一种面向遥感图像的单正多标签学习(SPML)新框架AdaGC,通过自适应梯度校准和伪标签生成机制,在不完全标注下实现准确的多标签分类。
Details
Motivation: 由于遥感图像标注成本高,获取完整的多标签注释困难,而现有SPML方法在遥感场景中研究不足,因此需要专门针对遥感数据的高效SPML解决方案。 Method: 提出Adaptive Gradient Calibration (AdaGC),结合梯度校准机制、Mixup增强和双指数移动平均(EMA)模块生成鲁棒伪标签,并设计基于训练动态的自适应触发指标,在预热阶段后启动GC以减少对标签噪声的过拟合。 Result: 在两个遥感基准数据集和两种标签噪声类型下,AdaGC实现了最先进的性能,并表现出强大的鲁棒性。 Conclusion: AdaGC是一种有效且通用的SPML框架,能够显著提升遥感图像在不完全标注下的多标签分类性能,具有实际应用潜力。 Abstract: Multi-label classification (MLC) offers a more comprehensive semantic understanding of Remote Sensing (RS) imagery compared to traditional single-label classification (SLC). However, obtaining complete annotations for MLC is particularly challenging due to the complexity and high cost of the labeling process. As a practical alternative, single-positive multi-label learning (SPML) has emerged, where each image is annotated with only one relevant label, and the model is expected to recover the full set of labels. While scalable, SPML introduces significant supervision ambiguity, demanding specialized solutions for model training. Although various SPML methods have been proposed in the computer vision domain, research in the RS context remains limited. To bridge this gap, we propose Adaptive Gradient Calibration (AdaGC), a novel and generalizable SPML framework tailored to RS imagery. AdaGC adopts a gradient calibration (GC) mechanism combined with Mixup and a dual exponential moving average (EMA) module for robust pseudo-label generation. To maximize AdaGC's effectiveness, we introduce a simple yet theoretically grounded indicator to adaptively trigger GC after an initial warm-up stage based on training dynamics, thereby guaranteeing the effectiveness of GC in mitigating overfitting to label noise. Extensive experiments on two benchmark RS datasets under two distinct label noise types demonstrate that AdaGC achieves state-of-the-art (SOTA) performance while maintaining strong robustness across diverse settings.[163] One Stone with Two Birds: A Null-Text-Null Frequency-Aware Diffusion Models for Text-Guided Image Inpainting
Haipeng Liu,Yang Wang,Meng Wang
Main category: cs.CV
TL;DR: 本文提出了一种名为NTN-Diff的新型文本引导图像修复方法,通过在去噪过程中解耦中低频频带并分阶段处理,有效解决了修复区域与未遮罩区域间的语义一致性及未遮罩区域保持的长期挑战。
Details
Motivation: 现有方法在文本引导图像修复中难以同时保持未遮罩区域和实现遮罩与未遮罩区域间的语义一致性,主要由于编码不同图像属性的中低频带相互纠缠,且对文本提示的鲁棒性不同。 Method: 提出NTN-Diff模型,基于扩散过程,将去噪分为早期和晚期阶段,在早期利用稳定中频带指导无文本去噪以恢复低频信息,并在晚期进行文本引导去噪,实现跨区域的中低频语义一致性,同时保护未遮罩区域。 Result: 实验表明,NTN-Diff在多个指标上优于当前最先进的文本引导扩散模型,在语义一致性和未遮罩区域保持方面表现突出。 Conclusion: NTN-Diff通过频带解耦和分阶段去噪策略,有效解决了文本引导图像修复中的关键挑战,显著提升了修复质量与一致性。 Abstract: Text-guided image inpainting aims at reconstructing the masked regions as per text prompts, where the longstanding challenges lie in the preservation for unmasked regions, while achieving the semantics consistency between unmasked and inpainted masked regions. Previous arts failed to address both of them, always with either of them to be remedied. Such facts, as we observed, stem from the entanglement of the hybrid (e.g., mid-and-low) frequency bands that encode varied image properties, which exhibit different robustness to text prompts during the denoising process. In this paper, we propose a null-text-null frequency-aware diffusion models, dubbed \textbf{NTN-Diff}, for text-guided image inpainting, by decomposing the semantics consistency across masked and unmasked regions into the consistencies as per each frequency band, while preserving the unmasked regions, to circumvent two challenges in a row. Based on the diffusion process, we further divide the denoising process into early (high-level noise) and late (low-level noise) stages, where the mid-and-low frequency bands are disentangled during the denoising process. As observed, the stable mid-frequency band is progressively denoised to be semantically aligned during text-guided denoising process, which, meanwhile, serves as the guidance to the null-text denoising process to denoise low-frequency band for the masked regions, followed by a subsequent text-guided denoising process at late stage, to achieve the semantics consistency for mid-and-low frequency bands across masked and unmasked regions, while preserve the unmasked regions. Extensive experiments validate the superiority of NTN-Diff over the state-of-the-art diffusion models to text-guided diffusion models. Our code can be accessed from https://github.com/htyjers/NTN-Diff.[164] A Multimodal Depth-Aware Method For Embodied Reference Understanding
Fevziye Irem Eyiokur,Dogucan Yaman,Hazım Kemal Ekenel,Alexander Waibel
Main category: cs.CV
TL;DR: 提出了一种新的ERU框架,结合LLM数据增强、深度图模态和深度感知决策模块,有效提升在复杂环境中基于语言和指向线索的参考对象理解能力。
Details
Motivation: 现有方法在存在多个候选对象的模糊场景中表现不佳,难以准确识别目标物体。 Method: 提出一种新型ERU框架,联合利用基于大语言模型的数据增强、深度图模态以及深度感知决策模块,实现语言与具身线索的鲁棒融合。 Result: 在两个数据集上的实验表明,该方法显著优于现有基线方法,实现了更准确和可靠的指代表达理解。 Conclusion: 所提出的ERU框架通过多模态信息协同,有效提升了复杂场景下的参考对象识别性能。 Abstract: Embodied Reference Understanding requires identifying a target object in a visual scene based on both language instructions and pointing cues. While prior works have shown progress in open-vocabulary object detection, they often fail in ambiguous scenarios where multiple candidate objects exist in the scene. To address these challenges, we propose a novel ERU framework that jointly leverages LLM-based data augmentation, depth-map modality, and a depth-aware decision module. This design enables robust integration of linguistic and embodied cues, improving disambiguation in complex or cluttered environments. Experimental results on two datasets demonstrate that our approach significantly outperforms existing baselines, achieving more accurate and reliable referent detection.[165] Learning Neural Exposure Fields for View Synthesis
Michael Niemeyer,Fabian Manhardt,Marie-Julie Rakotosaona,Michael Oechsle,Christina Tsalicoglou,Keisuke Tateno,Jonathan T. Barron,Federico Tombari
Main category: cs.CV
TL;DR: 本文提出了Neural Exposure Fields (NExF),一种从具有强烈曝光变化的现实世界图像中鲁棒重建高质量、3D一致外观场景的新方法。
Details
Motivation: 现有神经场景表示在处理包含显著单图曝光变化(如室内外混合场景或带窗户房间)的真实数据时,重建质量下降明显,缺乏对高动态范围场景的有效建模。 Method: 提出学习一个预测每个3D点最优曝光值的神经场,并与神经场景表示联合优化;通过新的神经条件机制实现3D空间中的曝光优化,而非传统按图像或像素进行。 Result: 该方法在多个真实世界挑战性数据集上实现了优于先前方法55%以上的性能提升,训练速度更快,并实现了无需后期处理或多曝光输入的高动态范围视图合成。 Conclusion: NExF通过将曝光建模引入3D神经场并实现联合优化,显著提升了复杂光照条件下场景重建与视图合成的质量和稳定性,为高动态范围场景提供了有效解决方案。 Abstract: Recent advances in neural scene representations have led to unprecedented quality in 3D reconstruction and view synthesis. Despite achieving high-quality results for common benchmarks with curated data, outputs often degrade for data that contain per image variations such as strong exposure changes, present, e.g., in most scenes with indoor and outdoor areas or rooms with windows. In this paper, we introduce Neural Exposure Fields (NExF), a novel technique for robustly reconstructing 3D scenes with high quality and 3D-consistent appearance from challenging real-world captures. In the core, we propose to learn a neural field predicting an optimal exposure value per 3D point, enabling us to optimize exposure along with the neural scene representation. While capture devices such as cameras select optimal exposure per image/pixel, we generalize this concept and perform optimization in 3D instead. This enables accurate view synthesis in high dynamic range scenarios, bypassing the need of post-processing steps or multi-exposure captures. Our contributions include a novel neural representation for exposure prediction, a system for joint optimization of the scene representation and the exposure field via a novel neural conditioning mechanism, and demonstrated superior performance on challenging real-world data. We find that our approach trains faster than prior works and produces state-of-the-art results on several benchmarks improving by over 55% over best-performing baselines.[166] LTCA: Long-range Temporal Context Attention for Referring Video Object Segmentation
Cilin Yan,Jingyun Wang,Guoliang Kang
Main category: cs.CV
TL;DR: 本文提出了一种有效的长时序上下文注意力机制(LTCA),用于指代表达视频分割(RVOS),通过稀疏局部注意力和全局查询来平衡局部与全局信息,在多个基准上实现了最先进的性能。
Details
Motivation: 现有方法在处理视频中的长时序上下文信息时难以平衡局部性与全局性,且计算复杂度随视频长度显著增加。 Method: 提出长时序上下文注意力(LTCA)机制,结合堆叠的空洞窗口注意力和随机全局键选择,并引入全局查询以增强全局上下文建模能力。 Result: 在四个RVOS基准上达到SOTA,MeViS val u和val数据集上分别提升11.3%和8.3%。 Conclusion: LTCA有效平衡了局部与全局上下文建模,提升了RVOS性能,同时控制了计算复杂度。 Abstract: Referring Video Segmentation (RVOS) aims to segment objects in videos given linguistic expressions. The key to solving RVOS is to extract long-range temporal context information from the interactions of expressions and videos to depict the dynamic attributes of each object. Previous works either adopt attention across all the frames or stack dense local attention to achieve a global view of temporal context. However, they fail to strike a good balance between locality and globality, and the computation complexity significantly increases with the increase of video length. In this paper, we propose an effective long-range temporal context attention (LTCA) mechanism to aggregate global context information into object features. Specifically, we aggregate the global context information from two aspects. Firstly, we stack sparse local attentions to balance the locality and globality. We design a dilated window attention across frames to aggregate local context information and perform such attention in a stack of layers to enable a global view. Further, we enable each query to attend to a small group of keys randomly selected from a global pool to enhance the globality. Secondly, we design a global query to interact with all the other queries to directly encode the global context information. Experiments show our method achieves new state-of-the-art on four referring video segmentation benchmarks. Notably, our method shows an improvement of 11.3% and 8.3% on the MeViS valu and val datasets respectively.[167] Unlocking 3D Affordance Segmentation with 2D Semantic Knowledge
Yu Huang,Zelin Peng,Changsong Wen,Xiaokang Yang,Wei Shen
Main category: cs.CV
TL;DR: 提出一种基于跨模态亲和力迁移的3D功能分割方法,利用2D视觉基础模型的语义知识提升3D特征表示,实现更精确的功能区域分割。
Details
Motivation: 现有3D功能分割方法多依赖点云编码器,忽视了3D数据稀疏性、噪声和几何模糊等问题,导致功能边界不清晰且语义不一致。 Method: 提出语义锚定学习范式,通过跨模态亲和力迁移(CMAT)将大规模2D视觉基础模型的语义知识迁移到3D领域,并设计跨模态功能分割Transformer(CAST),结合多模态提示与预训练特征生成精准分割图。 Result: 在标准基准上实验表明,该方法在3D功能分割任务中达到新的最先进性能。 Conclusion: 通过引入2D视觉基础模型的语义知识,有效提升了3D功能分割的语义一致性和边界精度,为机器人操作和具身AI等应用提供了更强的支持。 Abstract: Affordance segmentation aims to parse 3D objects into functionally distinct parts, bridging recognition and interaction for applications in robotic manipulation, embodied AI, and AR. While recent studies leverage visual or textual prompts to guide this process, they often rely on point cloud encoders as generic feature extractors, overlooking the intrinsic challenges of 3D data such as sparsity, noise, and geometric ambiguity. As a result, 3D features learned in isolation frequently lack clear and semantically consistent functional boundaries. To address this bottleneck, we propose a semantic-grounded learning paradigm that transfers rich semantic knowledge from large-scale 2D Vision Foundation Models (VFMs) into the 3D domain. Specifically, We introduce Cross-Modal Affinity Transfer (CMAT), a pre-training strategy that aligns a 3D encoder with lifted 2D semantics and jointly optimizes reconstruction, affinity, and diversity to yield semantically organized representations. Building on this backbone, we further design the Cross-modal Affordance Segmentation Transformer (CAST), which integrates multi-modal prompts with CMAT-pretrained features to generate precise, prompt-aware segmentation maps. Extensive experiments on standard benchmarks demonstrate that our framework establishes new state-of-the-art results for 3D affordance segmentation.[168] LinVideo: A Post-Training Framework towards O(n) Attention in Efficient Video Generation
Yushi Huang,Xingtong Ge,Ruihao Gong,Chengtao Lv,Jun Zhang
Main category: cs.CV
TL;DR: 本文提出了LinVideo,一种高效的数据无关后训练框架,用于在保持视频扩散模型生成质量的同时,将部分自注意力模块替换为线性注意力,实现显著加速。
Details
Motivation: 视频扩散模型因自注意力机制的二次计算复杂度导致计算成本高昂,而直接使用线性注意力会因表达能力不足和时空建模复杂性影响性能,因此需要一种高效且保持性能的替代方案。 Method: 提出选择性迁移(selective transfer)方法,将层替换问题建模为二分类任务,自动渐进地将可替换的自注意力层转为线性注意力;并设计了一种任意时间分布匹配(ADM)目标函数,以在采样轨迹上对齐样本分布,提升迁移效率与效果。 Result: 实验表明,该方法实现了1.25-2.00倍的加速,同时保持生成质量;进一步通过4步蒸馏模型实现了15.92倍的延迟降低,仅伴随轻微视觉质量下降。 Conclusion: LinVideo能够在不牺牲生成质量的前提下,有效降低视频扩散模型的计算成本,为高效视频生成提供了一种实用的后训练优化方案。 Abstract: Video diffusion models (DMs) have enabled high-quality video synthesis. However, their computation costs scale quadratically with sequence length because self-attention has quadratic complexity. While linear attention lowers the cost, fully replacing quadratic attention requires expensive pretraining due to the limited expressiveness of linear attention and the complexity of spatiotemporal modeling in video generation. In this paper, we present LinVideo, an efficient data-free post-training framework that replaces a target number of self-attention modules with linear attention while preserving the original model's performance. First, we observe a significant disparity in the replaceability of different layers. Instead of manual or heuristic choices, we frame layer selection as a binary classification problem and propose selective transfer, which automatically and progressively converts layers to linear attention with minimal performance impact. Additionally, to overcome the ineffectiveness and inefficiency of existing objectives for this transfer process, we introduce an anytime distribution matching (ADM) objective that aligns the distributions of samples across any timestep along the sampling trajectory. This objective is efficient and recovers model performance. Extensive experiments show that our method achieves a 1.25-2.00x speedup while preserving generation quality, and our 4-step distilled model further delivers a 15.92x latency reduction with minimal visual quality drop.[169] Evaluating Small Vision-Language Models on Distance-Dependent Traffic Perception
Nikos Theodoridis,Tim Brophy,Reenu Mohandas,Ganesh Sistu,Fiachra Collins,Anthony Scanlan,Ciaran Eising
Main category: cs.CV
TL;DR: 本文提出了一个专注于交通场景感知的视觉问答基准DTPQA,用于评估小型视觉语言模型在近远距离感知能力上的表现,发现现有小模型在感知任务上显著落后于人类,尤其在左右区分等任务上存在挑战。
Details
Motivation: 为了在自动驾驶等安全关键应用中可靠地使用视觉语言模型,需要具备可靠的长距离和短距离感知能力,而现有模型可能存在‘近视’问题,因此需要专门的基准来评估其真实感知性能。 Method: 提出一个新的视觉问答基准DTPQA,聚焦于交通场景中的感知类问题,并加入距离标注;排除需复杂推理的问题以专注评估感知能力;在多个最先进的小型视觉语言模型上进行评测,并与人类表现对比。 Result: 实验表明,尽管问题简单,当前最优的小型VLM平均准确率约为60%,显著低于人类的约85%;同时发现某些感知任务(如区分左右)对模型仍极具挑战;但人类样本量较小,统计上有一定局限。 Conclusion: 现有小型视觉语言模型在交通场景的感知能力,尤其是在远距离和特定空间判断任务上仍有明显不足,需进一步改进以满足自动驾驶的安全需求。 Abstract: Vision-Language Models (VLMs) are becoming increasingly powerful, demonstrating strong performance on a variety of tasks that require both visual and textual understanding. Their strong generalisation abilities make them a promising component for automated driving systems, which must handle unexpected corner cases. However, to be trusted in such safety-critical applications, a model must first possess a reliable perception system. Moreover, since critical objects and agents in traffic scenes are often at a distance, we require systems that are not "shortsighted", i.e., systems with strong perception capabilities at both close (up to 20 meters) and long (30+ meters) range. With this in mind, we introduce Distance-Annotated Traffic Perception Question Answering (DTPQA), the first Visual Question Answering (VQA) benchmark focused solely on perception-based questions in traffic scenes, enriched with distance annotations. By excluding questions that require reasoning, we ensure that model performance reflects perception capabilities alone. Since automated driving hardware has limited processing power and cannot support large VLMs, our study centers on smaller VLMs. More specifically, we evaluate several state-of-the-art (SOTA) small VLMs on DTPQA and show that, despite the simplicity of the questions, these models significantly underperform compared to humans (~60% average accuracy for the best-performing small VLM versus ~85% human performance). However, it is important to note that the human sample size was relatively small, which imposes statistical limitations. We also identify specific perception tasks, such as distinguishing left from right, that remain particularly challenging for these models.[170] SPICE: Simple and Practical Image Clarification and Enhancement
Alexander Belyaev,Pierre-Alain Fayolle,Michael Cohen
Main category: cs.CV
TL;DR: 提出了一种简单高效的方法来增强和清晰化低光照和雾霾条件下的图像,通过构建模拟和逆向滤波器减少失真,实验表明该方法在处理极暗和雾霾图像上优于现有技术,且易于实现。
Details
Motivation: 解决低光照和复杂雾霾条件下图像质量下降的问题,提升图像可见性和细节表现。 Method: 构建模拟低光或雾霾条件的图像滤波器,并推导近似逆向滤波器以最小化增强图像中的失真。 Result: 实验结果显示该方法在处理极暗图像和雾霾图像方面具有竞争力,通常优于最先进的技术。 Conclusion: 该方法因简单高效,仅需几行MATLAB代码即可实现,具有良好的应用前景。 Abstract: We introduce a simple and efficient method to enhance and clarify images. More specifically, we deal with low light image enhancement and clarification of hazy imagery (hazy/foggy images, images containing sand dust, and underwater images). Our method involves constructing an image filter to simulate low-light or hazy conditions and deriving approximate reverse filters to minimize distortions in the enhanced images. Experimental results show that our approach is highly competitive and often surpasses state-of-the-art techniques in handling extremely dark images and in enhancing hazy images. A key advantage of our approach lies in its simplicity: Our method is implementable with just a few lines of MATLAB code.[171] Hyperspectral data augmentation with transformer-based diffusion models
Mattia Ferrari,Lorenzo Bruzzone
Main category: cs.CV
TL;DR: 提出一种基于引导扩散模型的数据增强技术,结合轻量级Transformer网络和改进的损失函数,在小样本高光谱图像森林分类任务中实现了优于其他方法的性能。
Details
Motivation: 深度学习在高光谱图像分类中面临小样本训练易过拟合的问题,需要有效数据增强方法提升模型泛化能力。 Method: 采用引导扩散模型进行数据增强,设计轻量级Transformer网络捕捉复杂数据模式,引入加权损失函数和优化的余弦方差调度器以加速小数据集上的训练。 Result: 在PRISMA卫星获取的10类森林分类任务中,该方法在平均准确率和加权平均准确率上均优于其他数据增强技术,且表现出稳定的训练行为。 Conclusion: 所提方法能有效缓解小样本下深度学习模型的过拟合问题,提升高光谱图像分类性能,具有较强的实用性。 Abstract: The introduction of new generation hyperspectral satellite sensors, combined with advancements in deep learning methodologies, has significantly enhanced the ability to discriminate detailed land-cover classes at medium-large scales. However, a significant challenge in deep learning methods is the risk of overfitting when training networks with small labeled datasets. In this work, we propose a data augmentation technique that leverages a guided diffusion model. To effectively train the model with a limited number of labeled samples and to capture complex patterns in the data, we implement a lightweight transformer network. Additionally, we introduce a modified weighted loss function and an optimized cosine variance scheduler, which facilitate fast and effective training on small datasets. We evaluate the effectiveness of the proposed method on a forest classification task with 10 different forest types using hyperspectral images acquired by the PRISMA satellite. The results demonstrate that the proposed method outperforms other data augmentation techniques in both average and weighted average accuracy. The effectiveness of the method is further highlighted by the stable training behavior of the model, which addresses a common limitation in the practical application of deep generative models for data augmentation.[172] UniVideo: Unified Understanding, Generation, and Editing for Videos
Cong Wei,Quande Liu,Zixuan Ye,Qiulin Wang,Xintao Wang,Pengfei Wan,Kun Gai,Wenhu Chen
Main category: cs.CV
TL;DR: UniVideo是一个统一的视频生成与编辑框架,通过双流设计结合多模态大语言模型(MLLM)和多模态DiT(MMDiT),支持多种视频任务,并展现出跨任务组合与零样本迁移能力。
Details
Motivation: 现有的统一多模态模型主要局限于图像领域,缺乏对视频生成与编辑的全面支持,因此需要一个能够统一处理多种视频任务的通用框架。 Method: 采用双流架构,其中MLLM负责理解多模态指令,MMDiT负责视频生成;在统一的多模态指令范式下联合训练多个视频任务。 Result: 在文本/图像到视频生成、上下文内视频生成与编辑等任务上达到或超越现有专用模型的表现,并能实现任务组合和从未见过的自由形式视频编辑(如抠像、材质替换)的零样本迁移。 Conclusion: UniVideo成功将统一多模态建模扩展到视频领域,具备良好的泛化能力和应用潜力,推动了多模态视频内容生成的发展。 Abstract: Unified multimodal models have shown promising results in multimodal content generation and editing but remain largely limited to the image domain. In this work, we present UniVideo, a versatile framework that extends unified modeling to the video domain. UniVideo adopts a dual-stream design, combining a Multimodal Large Language Model (MLLM) for instruction understanding with a Multimodal DiT (MMDiT) for video generation. This design enables accurate interpretation of complex multimodal instructions while preserving visual consistency. Built on this architecture, UniVideo unifies diverse video generation and editing tasks under a single multimodal instruction paradigm and is jointly trained across them. Extensive experiments demonstrate that UniVideo matches or surpasses state-of-the-art task-specific baselines in text/image-to-video generation, in-context video generation and in-context video editing. Notably, the unified design of UniVideo enables two forms of generalization. First, UniVideo supports task composition, such as combining editing with style transfer, by integrating multiple capabilities within a single instruction. Second, even without explicit training on free-form video editing, UniVideo transfers its editing capability from large-scale image editing data to this setting, handling unseen instructions such as green-screening characters or changing materials within a video. Beyond these core capabilities, UniVideo also supports visual-prompt-based video generation, where the MLLM interprets visual prompts and guides the MMDiT during synthesis. To foster future research, we will release our model and code.[173] Detecting Legend Items on Historical Maps Using GPT-4o with In-Context Learning
Sofia Kirsanova,Yao-Yi Chiang,Weiwei Duan
Main category: cs.CV
TL;DR: 提出一种结合LayoutLMv3和GPT-4o的方法,用于自动提取历史地图图例并关联符号与描述,通过结构化提示显著提升性能。
Details
Motivation: 历史地图图例的不一致布局和非结构化格式导致自动提取困难,现有方法在符号与描述的结构化匹配方面表现不足。 Method: 结合LayoutLMv3进行版面检测,并利用GPT-4o通过上下文学习和结构化JSON提示进行图例项及其描述的检测与链接,基于边界框预测实现匹配。 Result: 该方法在实验中达到88%的F1分数和85%的IoU,优于基线模型,并揭示了提示设计、示例数量和版面对齐对性能的影响。 Conclusion: 所提方法支持可扩展、具备版面感知能力的图例解析,能有效提升多种视觉风格下历史地图的索引与搜索能力。 Abstract: Historical map legends are critical for interpreting cartographic symbols. However, their inconsistent layouts and unstructured formats make automatic extraction challenging. Prior work focuses primarily on segmentation or general optical character recognition (OCR), with few methods effectively matching legend symbols to their corresponding descriptions in a structured manner. We present a method that combines LayoutLMv3 for layout detection with GPT-4o using in-context learning to detect and link legend items and their descriptions via bounding box predictions. Our experiments show that GPT-4 with structured JSON prompts outperforms the baseline, achieving 88% F-1 and 85% IoU, and reveal how prompt design, example counts, and layout alignment affect performance. This approach supports scalable, layout-aware legend parsing and improves the indexing and searchability of historical maps across various visual styles.[174] Robust Source-Free Domain Adaptation for Medical Image Segmentation based on Curriculum Learning
Ziqi Zhang,Yuexiang Li,Yawen Huang,Nanjun He,Tao Xu,Liwei Lin,Yefeng Zheng,Shaoxin Li,Feiyue Huang
Main category: cs.CV
TL;DR: 提出一种基于课程学习的无源域自适应框架(LFC),通过易到难和源到目标的双课程策略,提升医学图像分割中的知识迁移效果,在多个跨域数据集上达到最优性能。
Details
Motivation: 现有无源域自适应方法主要关注目标域伪标签优化,忽略学习过程设计;而渐进式学习有助于知识迁移,因此需构建更合理的自适应学习流程。 Method: 提出学习从课程(LFC)框架,包含易到难课程和源到目标课程:前者从简单样本开始逐步增加难度,后者稳定模型从源域到目标域的迁移过程,实现渐进式知识转移。 Result: 在眼底图像分割和息肉分割的跨域数据集上进行实验,结果表明该方法优于现有方法,取得了新的最先进性能。 Conclusion: 所提出的LFC框架通过引入课程学习机制,有效提升了无源域自适应下的模型性能,验证了渐进式学习在医学图像分析中的重要价值。 Abstract: Recent studies have uncovered a new research line, namely source-free domain adaptation, which adapts a model to target domains without using the source data. Such a setting can address the concerns on data privacy and security issues of medical images. However, current source-free domain adaptation frameworks mainly focus on the pseudo label refinement for target data without the consideration of learning procedure. Indeed, a progressive learning process from source to target domain will benefit the knowledge transfer during model adaptation. To this end, we propose a curriculum-based framework, namely learning from curriculum (LFC), for source-free domain adaptation, which consists of easy-to-hard and source-to-target curricula. Concretely, the former curriculum enables the framework to start learning with `easy' samples and gradually tune the optimization direction of model adaption by increasing the sample difficulty. While, the latter can stablize the adaptation process, which ensures smooth transfer of the model from the source domain to the target. We evaluate the proposed source-free domain adaptation approach on the public cross-domain datasets for fundus segmentation and polyp segmentation. The extensive experimental results show that our framework surpasses the existing approaches and achieves a new state-of-the-art.[175] VideoVerse: How Far is Your T2V Generator from a World Model?
Zeqing Wang,Xinyu Wei,Bairui Li,Zhen Guo,Jinrui Zhang,Hongyang Wei,Keze Wang,Lei Zhang
Main category: cs.CV
TL;DR: 本文提出了VideoVerse,一个用于评估文本到视频生成模型的综合基准,重点在于检验模型对复杂时间因果关系和现实世界知识的理解能力。
Details
Motivation: 现有的T2V生成模型评估基准在评价先进模型时存在不足,缺乏对事件级时间因果性和世界知识的系统评估。 Method: 收集涵盖多个领域的代表性视频,提取具有时间因果性的事件描述,并由独立标注员转化为文本提示;设计基于动态与静态属性的二元评估问题,共十个评估维度。 Result: 构建了包含300个精心策划提示、815个事件和793个二元评估问题的VideoVerse基准,并开发了基于视觉-语言模型的QA评估流程。 Conclusion: 通过对先进T2V模型的系统评估,揭示了当前模型在实现真正‘世界模型’方面仍存在的差距。 Abstract: The recent rapid advancement of Text-to-Video (T2V) generation technologies, which are critical to build ``world models'', makes the existing benchmarks increasingly insufficient to evaluate state-of-the-art T2V models. First, current evaluation dimensions, such as per-frame aesthetic quality and temporal consistency, are no longer able to differentiate state-of-the-art T2V models. Second, event-level temporal causality, which not only distinguishes video from other modalities but also constitutes a crucial component of world models, is severely underexplored in existing benchmarks. Third, existing benchmarks lack a systematic assessment of world knowledge, which are essential capabilities for building world models. To address these issues, we introduce VideoVerse, a comprehensive benchmark that focuses on evaluating whether a T2V model could understand complex temporal causality and world knowledge in the real world. We collect representative videos across diverse domains (e.g., natural landscapes, sports, indoor scenes, science fiction, chemical and physical experiments) and extract their event-level descriptions with inherent temporal causality, which are then rewritten into text-to-video prompts by independent annotators. For each prompt, we design a suite of binary evaluation questions from the perspective of dynamic and static properties, with a total of ten carefully defined evaluation dimensions. In total, our VideoVerse comprises 300 carefully curated prompts, involving 815 events and 793 binary evaluation questions. Consequently, a human preference aligned QA-based evaluation pipeline is developed by using modern vision-language models. Finally, we perform a systematic evaluation of state-of-the-art open-source and closed-source T2V models on VideoVerse, providing in-depth analysis on how far the current T2V generators are from world models.[176] Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency
Kaiwen Zheng,Yuji Wang,Qianli Ma,Huayu Chen,Jintao Zhang,Yogesh Balaji,Jianfei Chen,Ming-Yu Liu,Jun Zhu,Qinsheng Zhang
Main category: cs.CV
TL;DR: 本文提出了score-regularized连续一致性模型(rCM),通过引入得分蒸馏作为长跳跃正则化项,解决了大规模图像和视频扩散模型中连续一致性模型(sCM)在细粒度生成上的质量缺陷,实现了高质量、高多样性且快速的生成(1~4步),在140亿参数模型和5秒视频上验证有效。
Details
Motivation: 尽管连续时间一致性模型(sCM)在加速学术级扩散模型方面表现出色,但其在大规模文本到图像和视频任务中的应用受限于JVP计算的基础设施挑战和评估基准的不足,且存在生成细节质量下降的问题。 Method: 开发了支持并行化的FlashAttention-2 JVP内核以支持大模型训练,并提出rCM模型,将得分蒸馏作为长跳跃连接正则化项,结合sCM的'模式覆盖'前向散度与得分蒸馏的'模式寻找'反向散度,提升生成质量与多样性。 Result: 在高达140亿参数的大规模模型(如Cosmos-Predict2、Wan2.1)和5秒视频任务上,rCM在质量指标上达到或超过最先进的DMD2方法,同时显著提升生成多样性,无需GAN调参或大量超参数搜索;蒸馏后模型仅需1~4步即可生成高保真样本,采样速度提升15~50倍。 Conclusion: rCM是一种实用且理论扎实的大规模扩散蒸馏框架,有效克服了sCM在细节数生成中的局限性,为工业级图像和视频生成提供了高效解决方案。 Abstract: This work represents the first effort to scale up continuous-time consistency distillation to general application-level image and video diffusion models. Although continuous-time consistency model (sCM) is theoretically principled and empirically powerful for accelerating academic-scale diffusion, its applicability to large-scale text-to-image and video tasks remains unclear due to infrastructure challenges in Jacobian-vector product (JVP) computation and the limitations of standard evaluation benchmarks. We first develop a parallelism-compatible FlashAttention-2 JVP kernel, enabling sCM training on models with over 10 billion parameters and high-dimensional video tasks. Our investigation reveals fundamental quality limitations of sCM in fine-detail generation, which we attribute to error accumulation and the "mode-covering" nature of its forward-divergence objective. To remedy this, we propose the score-regularized continuous-time consistency model (rCM), which incorporates score distillation as a long-skip regularizer. This integration complements sCM with the "mode-seeking" reverse divergence, effectively improving visual quality while maintaining high generation diversity. Validated on large-scale models (Cosmos-Predict2, Wan2.1) up to 14B parameters and 5-second videos, rCM matches or surpasses the state-of-the-art distillation method DMD2 on quality metrics while offering notable advantages in diversity, all without GAN tuning or extensive hyperparameter searches. The distilled models generate high-fidelity samples in only $1\sim4$ steps, accelerating diffusion sampling by $15\times\sim50\times$. These results position rCM as a practical and theoretically grounded framework for advancing large-scale diffusion distillation.[177] Gaze on the Prize: Shaping Visual Attention with Return-Guided Contrastive Learning
Andrew Lee,Ian Chuang,Dechen Gao,Kai Fukazawa,Iman Soltani
Main category: cs.CV
TL;DR: 本文提出了一种名为“Gaze on the Prize”的视觉强化学习框架,通过引入可学习的中央凹注意力机制(Gaze)和基于回报差异的自监督信号(Prize),提升样本效率和学习稳定性。
Details
Motivation: 视觉强化学习智能体需从高维图像中学习决策,但其中仅少数像素与任务相关,导致探索和计算资源浪费,学习效率低且不稳定。受人类视觉注意机制启发,作者希望引导智能体关注任务相关的图像区域。 Method: 提出一种返回引导的对比学习方法,将具有相似视觉表征但不同回报的状态分为正负样本,构建对比三元组,训练注意力机制聚焦于影响任务成败的关键特征。注意力机制由经验中追求高回报的自监督信号指导。 Result: 在ManiSkill3操作任务基准上验证,该方法最多实现2.4倍的样本效率提升,并能解决基线方法无法学习的任务,且无需修改底层算法或超参数。 Conclusion: Gaze on the Prize通过返回引导的对比注意力机制有效识别任务相关特征,显著提升视觉强化学习的样本效率和性能,具有良好的通用性和实用性。 Abstract: Visual Reinforcement Learning (RL) agents must learn to act based on high-dimensional image data where only a small fraction of the pixels is task-relevant. This forces agents to waste exploration and computational resources on irrelevant features, leading to sample-inefficient and unstable learning. To address this, inspired by human visual foveation, we introduce Gaze on the Prize. This framework augments visual RL with a learnable foveal attention mechanism (Gaze), guided by a self-supervised signal derived from the agent's experience pursuing higher returns (the Prize). Our key insight is that return differences reveal what matters most: If two similar representations produce different outcomes, their distinguishing features are likely task-relevant, and the gaze should focus on them accordingly. This is realized through return-guided contrastive learning that trains the attention to distinguish between the features relevant to success and failure. We group similar visual representations into positives and negatives based on their return differences and use the resulting labels to construct contrastive triplets. These triplets provide the training signal that teaches the attention mechanism to produce distinguishable representations for states associated with different outcomes. Our method achieves up to 2.4x improvement in sample efficiency and can solve tasks that the baseline fails to learn, demonstrated across a suite of manipulation tasks from the ManiSkill3 benchmark, all without modifying the underlying algorithm or hyperparameters.[178] Hierarchical Spatial Algorithms for High-Resolution Image Quantization and Feature Extraction
Noor Islam S. Mohammad
Main category: cs.CV
TL;DR: 本研究提出了一种用于空间图像处理的模块化框架,集成了灰度量化、色彩与亮度增强、图像锐化、双向变换流程和几何特征提取,实验表明该框架在多种数据集上具有鲁棒性和实时应用潜力。
Details
Motivation: 为了提升图像处理的自动化与精确性,需要一个集成多种处理功能的模块化框架,以支持复杂计算机视觉任务中的实时分析需求。 Method: 采用灰度级量化(8级)、RGB和YCrCb空间的直方图均衡化进行色彩增强,HSV值通道调整亮度,3×3卷积核实现锐化,并构建包含非锐化掩模、伽马校正和噪声放大在内的双向变换流程;几何特征提取结合Canny边缘检测、Hough直线估计、Harris角点检测和形态学定位。 Result: 双向变换流程在前向和反向过程中的准确率分别为76.10%和74.80%;台球杆对齐角度估计为51.50°;提示区域分割与真实图像相似度达81.87%。 Conclusion: 所提出的模块化框架在保持结构细节的同时实现了有效的图像增强与特征提取,表现出良好的鲁棒性和确定性,适用于实时图像分析和计算机视觉应用。 Abstract: This study introduces a modular framework for spatial image processing, integrating grayscale quantization, color and brightness enhancement, image sharpening, bidirectional transformation pipelines, and geometric feature extraction. A stepwise intensity transformation quantizes grayscale images into eight discrete levels, producing a posterization effect that simplifies representation while preserving structural detail. Color enhancement is achieved via histogram equalization in both RGB and YCrCb color spaces, with the latter improving contrast while maintaining chrominance fidelity. Brightness adjustment is implemented through HSV value-channel manipulation, and image sharpening is performed using a 3 * 3 convolution kernel to enhance high-frequency details. A bidirectional transformation pipeline that integrates unsharp masking, gamma correction, and noise amplification achieved accuracy levels of 76.10% and 74.80% for the forward and reverse processes, respectively. Geometric feature extraction employed Canny edge detection, Hough-based line estimation (e.g., 51.50{\deg} for billiard cue alignment), Harris corner detection, and morphological window localization. Cue isolation further yielded 81.87\% similarity against ground truth images. Experimental evaluation across diverse datasets demonstrates robust and deterministic performance, highlighting its potential for real-time image analysis and computer vision.[179] Video-STAR: Reinforcing Open-Vocabulary Action Recognition with Tools
Zhenlong Yuan,Xiangyan Qu,Chengxuan Qian,Rui Chen,Jing Tang,Lei Sun,Xiangxiang Chu,Dapeng Zhang,Yiwei Wang,Yujun Cai,Shuo Li
Main category: cs.CV
TL;DR: 提出Video-STAR框架,通过子动作分解与工具增强的强化学习实现开放词汇动作识别,提升细粒度区分与跨模态推理能力。
Details
Motivation: 现有MLLM在开放词汇场景中因文本先验依赖难以区分语义相似动作,需增强视觉 grounded 推理能力。 Method: 将动作分解为判别性子动作,结合领域特定工具进行跨模态交错,并设计分层奖励机制引导强化学习。 Result: 在HMDB-51、UCF-101、SSv2、Kinetics-400/600上达到SOTA,显著提升细粒度动作区分与抗跨模态幻觉能力。 Conclusion: Video-STAR有效实现了从文本中心到视觉锚定推理的转变,具备强鲁棒性与泛化性。 Abstract: Multimodal large language models (MLLMs) have demonstrated remarkable potential in bridging visual and textual reasoning, yet their reliance on text-centric priors often limits their ability to disentangle semantically similar actions in open-vocabulary scenarios. To address this, we propose Video-STAR, a framework that harmonizes contextual sub-motion decomposition with tool-augmented reinforcement learning for open-vocabulary action recognition (OVAR). Unlike prior methods that treat actions as monolithic entities, our approach innovatively decomposes actions into discriminative sub-motions for fine-grained matching while dynamically invoking domain-specific tools for cross-modal interleaving, thereby enabling category-specific reasoning capacity and reducing cross-modal hallucination. Moreover, by designing a hierarchical reward that balances tool-usage efficiency, sub-motion relevance, and structural coherence in reasoning, our method autonomously leverages external tools to prioritize sub-motion patterns without explicit supervision, transmitting from text-centric reasoning to visually grounded inference. Extensive evaluations on HMDB-51, UCF-101, SSv2, Kinetics-400, and Kinetics-600 datasets demonstrate our state-of-the-art performance, outperforming existing methods in distinguishing fine-grained actions and handling cross-modal hallucination, validating our excellent robustness and generalization.[180] The Visual Iconicity Challenge: Evaluating Vision-Language Models on Sign Language Form-Meaning Mapping
Onur Keleş,Aslı Özyürek,Gerardo Ortega,Kadir Gökgö,Esam Ghaleb
Main category: cs.CV
TL;DR: 本文提出了一个名为“视觉象似性挑战”的新基准,用于评估视觉-语言模型在手语中的象似性理解能力,涵盖语音形式预测、透明度和象似性评分三个任务,并发现当前模型在这些任务上仍显著落后于人类表现。
Details
Motivation: 由于手语中普遍存在形式与意义之间的象似性(iconicity),为视觉 grounding 提供了天然的测试平台,但现有视觉-语言模型难以从动态人体动作中恢复这种映射关系,因此需要新的评估方法。 Method: 作者构建了一个基于视频的基准测试,采用心理语言学指标,评估13种最先进的视觉-语言模型在荷兰手语上的零样本和少样本表现,并与人类基线进行比较,任务包括语音形式预测、透明度判断和象似性评分。 Result: 模型在语音形式预测上能部分恢复手势和位置信息,但性能低于人类;在透明度任务上远逊于人类;仅顶级模型在象似性评分上与人类有中等程度相关性,且语音预测能力强的模型更接近人类的象似性判断。 Conclusion: 该研究验证了所提诊断任务的有效性,表明当前VLMs在理解手语象似性方面仍有局限,建议未来引入更多以人为中心的信号和具身学习方法以提升多模态模型的视觉 grounding 能力。 Abstract: Iconicity, the resemblance between linguistic form and meaning, is pervasive in signed languages, offering a natural testbed for visual grounding. For vision-language models (VLMs), the challenge is to recover such essential mappings from dynamic human motion rather than static context. We introduce the \textit{Visual Iconicity Challenge}, a novel video-based benchmark that adapts psycholinguistic measures to evaluate VLMs on three tasks: (i) phonological sign-form prediction (e.g., handshape, location), (ii) transparency (inferring meaning from visual form), and (iii) graded iconicity ratings. We assess $13$ state-of-the-art VLMs in zero- and few-shot settings on Sign Language of the Netherlands and compare them to human baselines. On \textit{phonological form prediction}, VLMs recover some handshape and location detail but remain below human performance; on \textit{transparency}, they are far from human baselines; and only top models correlate moderately with human \textit{iconicity ratings}. Interestingly, \textit{models with stronger phonological form prediction correlate better with human iconicity judgment}, indicating shared sensitivity to visually grounded structure. Our findings validate these diagnostic tasks and motivate human-centric signals and embodied learning methods for modelling iconicity and improving visual grounding in multimodal models.[181] InstructX: Towards Unified Visual Editing with MLLM Guidance
Chong Mou,Qichao Sun,Yanze Wu,Pengze Zhang,Xinghui Li,Fulong Ye,Songtao Zhao,Qian He
Main category: cs.CV
TL;DR: 本文提出了InstructX,一个用于图像和视频编辑的统一框架,通过多模态大语言模型(MLLM)与扩散模型的结合,实现了在指令驱动下的多样化编辑任务,并展示了图像数据训练对视频编辑的零样本迁移能力。
Details
Motivation: 现有研究缺乏对MLLM设计选择的深入分析,且MLLM与扩散模型在视频编辑等复杂任务中的融合仍具挑战性。 Method: 提出InstructX框架,系统研究MLLM与扩散模型的集成方法,利用图像数据训练实现对视频编辑的泛化,并引入模态特定特征以统一处理图像和视频编辑任务。 Result: 实验证明该方法在多种图像和视频编辑任务上达到最先进的性能,且无需额外视频监督即可实现视频编辑能力。 Conclusion: InstructX有效统一了图像和视频编辑,通过跨模态学习缓解了视频数据稀缺的问题,为未来多模态编辑系统提供了可行路径。 Abstract: With recent advances in Multimodal Large Language Models (MLLMs) showing strong visual understanding and reasoning, interest is growing in using them to improve the editing performance of diffusion models. Despite rapid progress, most studies lack an in-depth analysis of MLLM design choices. Moreover, the integration of MLLMs and diffusion models remains an open challenge in some difficult tasks, such as video editing. In this paper, we present InstructX, a unified framework for image and video editing. Specifically, we conduct a comprehensive study on integrating MLLMs and diffusion models for instruction-driven editing across diverse tasks. Building on this study, we analyze the cooperation and distinction between images and videos in unified modeling. (1) We show that training on image data can lead to emergent video editing capabilities without explicit supervision, thereby alleviating the constraints imposed by scarce video training data. (2) By incorporating modality-specific MLLM features, our approach effectively unifies image and video editing tasks within a single model. Extensive experiments demonstrate that our method can handle a broad range of image and video editing tasks and achieves state-of-the-art performance.[182] MoA-VR: A Mixture-of-Agents System Towards All-in-One Video Restoration
Lu Liu,Chunlei Cai,Shaocheng Shen,Jianfeng Liang,Weimin Ouyang,Tianxiao Ye,Jian Mao,Huiyu Duan,Jiangchao Yao,Xiaoyun Zhang,Qiang Hu,Guangtao Zhai
Main category: cs.CV
TL;DR: 提出MoA-VR,一种基于多智能体协同的通用视频恢复系统,通过降质识别、自适应路由和质量评估模块实现对复杂和复合降质的有效处理。
Details
Motivation: 现有视频恢复方法通常依赖人工选择模型或单一架构,难以应对真实场景中多样且复杂的复合降质问题,缺乏通用性和智能化决策能力。 Method: 构建包含降质识别、路由与恢复、质量评估三个协同智能体的MoA-VR系统:使用视觉-语言模型进行降质识别,基于大语言模型的自适应路由器选择恢复策略,并构建Res-VQ数据集及相应VLM-based模型进行恢复质量评估。 Result: 在多个客观指标和感知质量上显著优于现有基线方法,能够有效处理多种复杂和复合降质情况。 Conclusion: MoA-VR展示了多模态智能与模块化推理在通用视频恢复系统中的潜力,为未来智能化、自动化视频恢复提供了新范式。 Abstract: Real-world videos often suffer from complex degradations, such as noise, compression artifacts, and low-light distortions, due to diverse acquisition and transmission conditions. Existing restoration methods typically require professional manual selection of specialized models or rely on monolithic architectures that fail to generalize across varying degradations. Inspired by expert experience, we propose MoA-VR, the first \underline{M}ixture-\underline{o}f-\underline{A}gents \underline{V}ideo \underline{R}estoration system that mimics the reasoning and processing procedures of human professionals through three coordinated agents: Degradation Identification, Routing and Restoration, and Restoration Quality Assessment. Specifically, we construct a large-scale and high-resolution video degradation recognition benchmark and build a vision-language model (VLM) driven degradation identifier. We further introduce a self-adaptive router powered by large language models (LLMs), which autonomously learns effective restoration strategies by observing tool usage patterns. To assess intermediate and final processed video quality, we construct the \underline{Res}tored \underline{V}ideo \underline{Q}uality (Res-VQ) dataset and design a dedicated VLM-based video quality assessment (VQA) model tailored for restoration tasks. Extensive experiments demonstrate that MoA-VR effectively handles diverse and compound degradations, consistently outperforming existing baselines in terms of both objective metrics and perceptual quality. These results highlight the potential of integrating multimodal intelligence and modular reasoning in general-purpose video restoration systems.[183] To Sink or Not to Sink: Visual Information Pathways in Large Vision-Language Models
Jiayun Luo,Wan-Cyuan Fan,Lyuyang Wang,Xiangteng He,Tanzila Rahman,Purang Abolmaesumi,Leonid Sigal
Main category: cs.CV
TL;DR: 本文提出了ViT注意力汇聚点(ViT attention sinks)的概念,即来自视觉Transformer的高范数视觉令牌,这些令牌包含图像中的高层语义信息,并对大视觉语言模型(LVLMs)的视觉理解与推理能力有重要影响。作者通过定性和定量分析展示了这些汇聚点的重要性,并提出无需训练和基于训练的方法来更好地利用这些信息,从而在多种LVLMs和视觉推理任务上实现了显著性能提升。
Details
Motivation: 尽管现有研究关注大语言模型内部的注意力汇聚问题,但视觉编码器中哪些视觉令牌对理解与推理最重要以及这些信号如何有效传递到语言模型仍不清楚。本文旨在填补这一空白,聚焦于视觉Transformer中被忽视的高语义价值令牌——ViT注意力汇聚点。 Method: 识别来自Vision Transformer的高范数视觉令牌作为ViT注意力汇聚点,进行定性与定量分析以探究其语义内容及对推理的影响;提出无需训练(如直接增强汇聚点输入)和基于训练的方法(如优化令牌融合机制),以提升LLM对这些关键视觉信息的利用效率。 Result: 实验表明ViT注意力汇聚点包含丰富的高层语义概念,在多个LVLM架构和视觉推理任务上,通过显式利用这些令牌可显著提升模型性能;所提方法在不改变整体架构的前提下有效增强了视觉到语言的信息传递。 Conclusion: ViT注意力汇聚点是影响LVLM视觉理解与推理的关键因素,当前多数架构忽视了其潜力;通过专门设计策略来保留和强化这些高语义令牌,可以显著提升模型表现,为未来LVLM设计提供了新方向。 Abstract: Large Vision Language Models (LVLMs) have recently emerged as powerful architectures capable of understanding and reasoning over both visual and textual information. These models typically rely on two key components: a Vision Transformer (ViT) and a Large Language Model (LLM). ViT encodes visual content into a sequence of image tokens and serves as the perceptual front-end -- the eyes of the model. In contrast, the LLM interprets these tokens to perform high-level reasoning, generates responses, and functions as the cognitive core -- the brain of the model. However, it remains unclear which visual tokens contribute most significantly to understanding and reasoning, and how effectively these signals are propagated from ViT to the LLM. While most existing works have focused on identifying attention sinks, low-semantic tokens receiving disproportionately high attention, within the LLM, we shift the focus to the vision encoder by identifying a class of high-norm visual tokens from ViT, referred to as ViT attention sinks -- a problem that has been rarely studied but is indeed very important for LVLMs. Our findings show that these ViT sinks encapsulate high-level semantic concepts from images, allowing the LLM to perform more effective understanding and reasoning. Despite their importance, these sink tokens are often overlooked in existing LVLM architectures. To explore their contribution, we present both qualitative and quantitative analyses of the information embedded in these sink tokens. We also propose both training-free and training-based approaches to better leverage how this information is interpreted by the LLM, and to what extent. By explicitly utilizing these tokens, we demonstrate substantial improvements across a range of LVLMs and visual reasoning tasks, highlighting the untapped potential of ViT attention sinks in enhancing visual reasoning.[184] Have We Scene It All? Scene Graph-Aware Deep Point Cloud Compression
Nikolaos Stathoulopoulos,Christoforos Kanellakis,George Nikolakopoulos
Main category: cs.CV
TL;DR: 提出一种基于语义场景图的深度压缩框架,用于高效传输3D点云数据,在保持结构和语义保真度的同时实现高达98%的压缩率。
Details
Motivation: 3D点云数据量大且复杂,在带宽受限和连接不稳定的情况下难以高效传输,影响多智能体机器人系统的感知性能。 Method: 将点云分解为语义连贯的块,使用FiLM条件下的语义感知编码器将其编码为紧凑的潜在表示,并采用基于折叠的解码器结合潜在特征和图节点属性进行结构准确的重建。 Result: 在SemanticKITTI和nuScenes数据集上达到最先进的压缩率,数据大小最多减少98%,同时支持多机器人位姿图优化和地图融合等下游任务,性能接近使用原始LiDAR扫描的结果。 Conclusion: 该方法在显著压缩点云数据的同时保留了关键的结构和语义信息,适用于边缘和云环境下的分布式机器人系统。 Abstract: Efficient transmission of 3D point cloud data is critical for advanced perception in centralized and decentralized multi-agent robotic systems, especially nowadays with the growing reliance on edge and cloud-based processing. However, the large and complex nature of point clouds creates challenges under bandwidth constraints and intermittent connectivity, often degrading system performance. We propose a deep compression framework based on semantic scene graphs. The method decomposes point clouds into semantically coherent patches and encodes them into compact latent representations with semantic-aware encoders conditioned by Feature-wise Linear Modulation (FiLM). A folding-based decoder, guided by latent features and graph node attributes, enables structurally accurate reconstruction. Experiments on the SemanticKITTI and nuScenes datasets show that the framework achieves state-of-the-art compression rates, reducing data size by up to 98% while preserving both structural and semantic fidelity. In addition, it supports downstream applications such as multi-robot pose graph optimization and map merging, achieving trajectory accuracy and map alignment comparable to those obtained with raw LiDAR scans.[185] SliceFine: The Universal Winning-Slice Hypothesis for Pretrained Networks
Md Kowsher,Ali O. Polat,Ehsan Mohammady Ardehaly,Mehrdad Salehi,Zia Ghiasi,Prasanth Murali,Chen Chen
Main category: cs.CV
TL;DR: 本文提出了一种理论框架,解释了为何在预训练模型中微调小的随机子网络(切片)即可实现下游任务适应,并提出了SliceFine方法,在不引入新参数的情况下实现了与现有PEFT方法相当的性能,同时提升了训练速度和内存效率。
Details
Motivation: 为了理解为什么在预训练模型中仅微调少量参数就能有效适应下游任务,并为参数高效微调(PEFT)提供理论基础。 Method: 通过理论分析证明预训练网络具有“通用胜出切片”特性,源于权重矩阵切片的谱平衡和高任务能量,并据此提出SliceFine方法,只更新原始权重中的选定切片。 Result: SliceFine在语言和视觉任务上达到了与最先进的PEFT方法相当的性能,同时显著提高了训练速度、内存效率和模型紧凑性。 Conclusion: 本文为大规模模型中的参数高效微调提供了理论支持,并提出了一个无需新增参数、高效且紧凑的微调方法SliceFine,架起了理论与实践之间的桥梁。 Abstract: This paper presents a theoretical framework explaining why fine tuning small, randomly selected subnetworks (slices) within pre trained models can be sufficient for downstream adaptation. We prove that pretrained networks exhibit a universal winning slice property arising from two phenomena: (1) spectral balance the eigenspectra of different weight matrix slices are remarkably similar; and (2) high task energy their backbone representations retain rich, task relevant features. This leads to the Universal Winning Slice Hypothesis, which provides a theoretical foundation for parameter efficient fine tuning (PEFT) in large scale models. Inspired by this, we propose SliceFine, a PEFT method that exploits this inherent redundancy by updating only selected slices of the original weights introducing zero new parameters, unlike adapter-based approaches. Empirically, SliceFine matches the performance of state of the art PEFT methods across language and vision tasks, while significantly improving training speed, memory efficiency, and model compactness. Our work bridges theory and practice, offering a theoretically grounded alternative to existing PEFT techniques.[186] FlexTraj: Image-to-Video Generation with Flexible Point Trajectory Control
Zhiyuan Zhang,Can Wang,Dongdong Chen,Jing Liao
Main category: cs.CV
TL;DR: 提出FlexTraj框架,实现图像到视频生成中的灵活点轨迹控制,支持多粒度、无需对齐的运动控制。
Details
Motivation: 现有方法在轨迹控制中依赖对齐条件且控制灵活性不足,难以支持复杂应用场景。 Method: 采用统一的基于点的运动表示,结合序列拼接策略和退火训练方法,在生成器中高效注入轨迹条件。 Result: 实验表明该方法在未对齐条件下仍具鲁棒性,支持多种应用如运动克隆、拖拽生成、动作插值等。 Conclusion: FlexTraj实现了高效、强可控且灵活的图像到视频生成,适用于多粒度运动控制任务。 Abstract: We present FlexTraj, a framework for image-to-video generation with flexible point trajectory control. FlexTraj introduces a unified point-based motion representation that encodes each point with a segmentation ID, a temporally consistent trajectory ID, and an optional color channel for appearance cues, enabling both dense and sparse trajectory control. Instead of injecting trajectory conditions into the video generator through token concatenation or ControlNet, FlexTraj employs an efficient sequence-concatenation scheme that achieves faster convergence, stronger controllability, and more efficient inference, while maintaining robustness under unaligned conditions. To train such a unified point trajectory-controlled video generator, FlexTraj adopts an annealing training strategy that gradually reduces reliance on complete supervision and aligned condition. Experimental results demonstrate that FlexTraj enables multi-granularity, alignment-agnostic trajectory control for video generation, supporting various applications such as motion cloning, drag-based image-to-video, motion interpolation, camera redirection, flexible action control and mesh animations.[187] SpatialLadder: Progressive Training for Spatial Reasoning in Vision-Language Models
Hongxing Li,Dingming Li,Zixuan Wang,Yuchen Yan,Hang Wu,Wenqi Zhang,Yongliang Shen,Weiming Lu,Jun Xiao,Yueting Zhuang
Main category: cs.CV
TL;DR: 本文提出了一种渐进式构建空间智能的方法,通过构建包含26,610个样本的多模态数据集SpatialLadder-26k,并设计三阶段训练框架,显著提升了视觉语言模型在空间推理任务上的性能。
Details
Motivation: 现有方法在学习空间推理时缺乏感知与理解的层次化基础,导致性能受限。 Method: 构建SpatialLadder-26k数据集,并采用三阶段渐进训练框架:首先通过目标定位建立空间感知,然后通过多维空间任务发展空间理解,最后利用可验证奖励的强化学习增强复杂推理能力。 Result: 所提出的3B参数模型SpatialLadder在空间推理基准上平均比基线模型提升23.4%,超越GPT-4o(20.8%)和Gemini-2.0-Flash(10.1%),并在域外基准上实现7.2%的提升。 Conclusion: 从感知到推理的渐进式训练对于实现鲁棒的空间智能至关重要。 Abstract: Spatial reasoning remains a fundamental challenge for Vision-Language Models (VLMs), with current approaches struggling to achieve robust performance despite recent advances. We identify that this limitation stems from a critical gap: existing methods attempt to learn spatial reasoning directly without establishing the hierarchical foundations of perception and understanding. To address this challenge, we present a comprehensive methodology for building spatial intelligence progressively. We introduce SpatialLadder-26k, a multimodal dataset containing 26,610 samples spanning object localization, single image, multi-view, and video spatial reasoning tasks, constructed through a standardized pipeline that ensures systematic coverage across modalities. Building on this dataset, we design a three-stage progressive training framework that (1) establishes spatial perception through object localization, (2) develops spatial understanding through multi-dimensional spatial tasks, and (3) strengthens complex reasoning via reinforcement learning with verifiable rewards. This approach yields SpatialLadder, a 3B-parameter model that achieves state-of-the-art performance on spatial reasoning benchmarks, with 23.4% average improvement over the base model, surpassing GPT-4o by 20.8% and Gemini-2.0-Flash by 10.1%. Notably, SpatialLadder maintains strong generalization with 7.2% improvement on out-of-domain benchmarks, demonstrating that progressive training from perception to reasoning is essential for robust spatial intelligence.[188] Kontinuous Kontext: Continuous Strength Control for Instruction-based Image Editing
Rishubh Parihar,Or Patashnik,Daniil Ostashev,R. Venkatesh Babu,Daniel Cohen-Or,Kuan-Chieh Wang
Main category: cs.CV
TL;DR: 本文提出Kontinuous Kontext,一种通过引入标量编辑强度实现对图像编辑程度进行连续、精细控制的指令驱动图像编辑模型。
Details
Motivation: 仅依赖文本指令进行图像编辑难以实现对编辑程度的细粒度控制,缺乏从轻微到显著的连续调节能力。 Method: 扩展最先进的图像编辑模型,增加标量编辑强度作为额外输入,并通过轻量级投影网络将该标量与编辑指令映射到模型调制空间的系数中,实现对编辑强度的显式控制。使用现有生成模型合成了包含图像、编辑指令和强度值的四元组数据集,并经过滤保证质量用于训练。 Result: Kontinuous Kontext 能在多种编辑操作(如风格化、属性、材质、背景和形状变化)中实现从无修改到完全编辑之间的平滑过渡,提供统一的细粒度强度控制,且无需针对特定属性进行训练。 Conclusion: 该方法为基于指令的图像编辑提供了新的控制维度,实现了跨多样化编辑任务的连续、可控编辑,提升了用户对编辑结果的精确调控能力。 Abstract: Instruction-based image editing offers a powerful and intuitive way to manipulate images through natural language. Yet, relying solely on text instructions limits fine-grained control over the extent of edits. We introduce Kontinuous Kontext, an instruction-driven editing model that provides a new dimension of control over edit strength, enabling users to adjust edits gradually from no change to a fully realized result in a smooth and continuous manner. Kontinuous Kontext extends a state-of-the-art image editing model to accept an additional input, a scalar edit strength which is then paired with the edit instruction, enabling explicit control over the extent of the edit. To inject this scalar information, we train a lightweight projector network that maps the input scalar and the edit instruction to coefficients in the model's modulation space. For training our model, we synthesize a diverse dataset of image-edit-instruction-strength quadruplets using existing generative models, followed by a filtering stage to ensure quality and consistency. Kontinuous Kontext provides a unified approach for fine-grained control over edit strength for instruction driven editing from subtle to strong across diverse operations such as stylization, attribute, material, background, and shape changes, without requiring attribute-specific training.[189] MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization
Xiangyu Zhao,Junming Lin,Tianhao Liang,Yifan Zhou,Wenhao Chai,Yuzhe Gu,Weiyun Wang,Kai Chen,Gen Luo,Wenwei Zhang,Junchi Yan,Hua Yang,Haodong Duan,Xue Yang
Main category: cs.CV
TL;DR: 本文提出并构建了MM-HELIX多模态基准,用于评估大语言模型在长链反思性推理上的能力,并通过构建大规模训练数据集MM-HELIX-100K和提出自适应混合策略优化(AHPO)方法,显著提升了模型在复杂任务上的表现和泛化能力。
Details
Motivation: 现有大语言模型在数学与逻辑等推理任务上已有一定表现,但在需要迭代思考与回溯的长链反思性推理方面仍不足。为探索这一能力,作者希望系统评估当前模型的缺陷,并开发有效训练方法以提升其复杂问题解决能力。 Method: 首先设计合成引擎构建包含42个挑战性任务的MM-HELIX基准;然后提出Step-Elicited Response Generation流程生成10万条高质量反思推理轨迹(MM-HELIX-100K)用于指令微调;最后提出AHPO训练策略,融合离线监督与在线优化,克服稀疏奖励与灾难性遗忘问题。 Result: 实验表明现有MLLM在长链反思推理中表现不佳;所提方法在MM-HELIX基准上使Qwen2.5-VL-7B模型准确率提升+18.6%,并在通用数学与逻辑任务上平均提升+5.7%。 Conclusion: 反思性推理可通过适当的数据与训练策略在MLLM中有效学习并泛化,AHPO为提升模型复杂问题解决能力提供了可行路径。 Abstract: While current Multimodal Large Language Models (MLLMs) have demonstrated proficiency in reasoning tasks such as mathematics and logic, their capacity for long-chain reflective reasoning, a prerequisite for solving complex real-world problems, remains largely underexplored. In this work, we first conduct an extensive empirical investigation to evaluate this capability. Leveraging a carefully designed data synthesis engine, we construct MM-HELIX, a multimodal benchmark consisting 1,260 samples of 42 challenging synthetic tasks that require iterative thinking and backtracking. Empirical results on this benchmark reveal that existing MLLMs exhibit significant performance deficits in long-chain reflective reasoning. To address this limitation, we generate post-training data and further explore learning paradigms for exploiting such data. We first develop the Step-Elicited Response Generation pipeline to create MM-HELIX-100K, a large-scale dataset of 100k high-quality, reflective reasoning traces for instruction-tuning stage. Given that standard Reinforcement Learning fails on complex tasks due to sparse reward signals and catastrophic forgetting after Supervised Fine-Tuning, we propose Adaptive Hybrid Policy Optimization (AHPO), a novel training strategy that dynamically unifies offline supervision and online optimization into a single stage. This strategy enables the model to learn from expert data when rewards are sparse and conduct independent exploration once proficient. When applied to the Qwen2.5-VL-7B baseline, our method achieves a +18.6\% accuracy improvement on MM-HELIX benchmark and demonstrates strong generalization with a +5.7\% average performance gain on general mathematic and logic tasks. Our work demonstrate that reflective reasoning in MLLMs can be effectively learned and generalized, paving the way for developing more capable MLLMs.[190] VideoNorms: Benchmarking Cultural Awareness of Video Language Models
Nikhil Reddy Varimalla,Yunfei Xu,Arkadiy Saakyan,Meng Fan Wang,Smaranda Muresan
Main category: cs.CV
TL;DR: 本文提出了VideoNorms,一个包含1000多个视频片段与文化规范配对的数据集,用于评估视频大语言模型在中美文化背景下的文化意识。通过人类与AI协作的标注框架构建该基准,并发现现有模型在识别规范违反、中国文化、非语言证据和正式语境方面表现较差。
Details
Motivation: 随着视频大语言模型(VideoLLMs)在全球范围内部署,其需要理解并扎根于特定的文化背景。然而,当前缺乏有效评估模型文化认知能力的基准,因此需要构建一个理论支持、跨文化的评估标准来衡量模型对社会规范的理解能力。 Method: 提出VideoNorms基准数据集,包含来自美国和中国文化的1000多个(视频片段,规范)对,标注内容包括基于言语行为理论的社会文化规范、规范遵守/违反标签以及言语与非言语证据。采用人机协作框架进行标注:由基于理论提示的教师模型生成候选标注,再由训练有素的人类专家验证和修正。在此数据集上对多种开源VideoLLMs进行评测。 Result: 实验结果显示:1)模型在识别规范违反方面表现差于规范遵守;2)对中国文化的理解弱于美国文化;3)提供非语言证据的能力弱于言语证据,且难以准确识别对应言语行为的具体规范;4)与人类不同,模型在正式、非幽默语境下表现更差。 Conclusion: 研究强调了文化扎根的视频语言模型训练的重要性。VideoNorms及其构建框架为填补当前在跨文化视频理解评估方面的空白提供了有效路径,并呼吁未来模型需增强对多元文化社会规范的细粒度理解。 Abstract: As Video Large Language Models (VideoLLMs) are deployed globally, they require understanding of and grounding in the relevant cultural background. To properly assess these models' cultural awareness, adequate benchmarks are needed. We introduce VideoNorms, a benchmark of over 1000 (video clip, norm) pairs from US and Chinese cultures annotated with socio-cultural norms grounded in speech act theory, norm adherence and violations labels, and verbal and non-verbal evidence. To build VideoNorms, we use a human-AI collaboration framework, where a teacher model using theoretically-grounded prompting provides candidate annotations and a set of trained human experts validate and correct the annotations. We benchmark a variety of open-weight VideoLLMs on the new dataset which highlight several common trends: 1) models performs worse on norm violation than adherence; 2) models perform worse w.r.t Chinese culture compared to the US culture; 3) models have more difficulty in providing non-verbal evidence compared to verbal for the norm adhere/violation label and struggle to identify the exact norm corresponding to a speech-act; and 4) unlike humans, models perform worse in formal, non-humorous contexts. Our findings emphasize the need for culturally-grounded video language model training - a gap our benchmark and framework begin to address.[191] ARTDECO: Towards Efficient and High-Fidelity On-the-Fly 3D Reconstruction with Structured Scene Representation
Guanghao Li,Kerui Ren,Linning Xu,Zhewen Zheng,Changjian Jiang,Xin Gao,Bo Dai,Jian Pu,Mulin Yu,Jiangmiao Pang
Main category: cs.CV
TL;DR: 本文提出ARTDECO,一种结合前馈模型效率与SLAM可靠性的统一框架,用于单目图像序列的实时3D重建,通过层次化高斯表示和LoD感知渲染策略,在多个基准上实现了高质量、高保真且交互式的重建效果。
Details
Motivation: 现有方法在计算效率与重建精度之间存在权衡:逐场景优化精度高但耗时,前馈基础模型实时但精度不足。因此需要一种兼顾效率与准确性的新方法。 Method: ARTDECO利用3D基础模型进行姿态估计和点云预测,并结合高斯解码器将多尺度特征转换为结构化3D高斯分布;设计了层次化高斯表示与LoD感知渲染策略以提升保真度并减少冗余。 Result: 在八个室内外基准上实验表明,ARTDECO在重建质量上接近逐场景优化方法,性能媲美SLAM系统,鲁棒性类似前馈模型,支持交互式实时重建。 Conclusion: ARTDECO为真实世界环境的实时数字化提供了一条兼顾几何准确性与视觉高保真的实用路径。 Abstract: On-the-fly 3D reconstruction from monocular image sequences is a long-standing challenge in computer vision, critical for applications such as real-to-sim, AR/VR, and robotics. Existing methods face a major tradeoff: per-scene optimization yields high fidelity but is computationally expensive, whereas feed-forward foundation models enable real-time inference but struggle with accuracy and robustness. In this work, we propose ARTDECO, a unified framework that combines the efficiency of feed-forward models with the reliability of SLAM-based pipelines. ARTDECO uses 3D foundation models for pose estimation and point prediction, coupled with a Gaussian decoder that transforms multi-scale features into structured 3D Gaussians. To sustain both fidelity and efficiency at scale, we design a hierarchical Gaussian representation with a LoD-aware rendering strategy, which improves rendering fidelity while reducing redundancy. Experiments on eight diverse indoor and outdoor benchmarks show that ARTDECO delivers interactive performance comparable to SLAM, robustness similar to feed-forward systems, and reconstruction quality close to per-scene optimization, providing a practical path toward on-the-fly digitization of real-world environments with both accurate geometry and high visual fidelity. Explore more demos on our project page: https://city-super.github.io/artdeco/.[192] Dream to Recall: Imagination-Guided Experience Retrieval for Memory-Persistent Vision-and-Language Navigation
Yunzhe Xu,Yiyuan Pan,Zhe Liu
Main category: cs.CV
TL;DR: 本文提出Memoir,一种基于想象的检索机制,通过显式记忆在视觉-语言导航中实现更有效的记忆访问,结合环境观察与行为模式,在多个基准上显著提升性能。
Details
Motivation: 现有记忆持久型视觉-语言导航方法缺乏有效的记忆访问机制,仅依赖整体记忆或固定视野查找,且主要存储环境观察而忽略蕴含决策策略的行为模式。 Method: 1) 语言条件化世界模型,用于想象未来状态以编码经验并生成检索查询;2) 混合视角级记忆,将观察与行为模式锚定于视角;3) 经验增强导航模型,通过专用编码器整合检索知识。 Result: 在10个测试场景中均取得显著提升,IR2R任务上SPL提高5.4%,训练速度加快8.3倍,推理内存减少74%。 Conclusion: 预测性检索环境与行为记忆可有效提升导航性能,想象引导的范式具有巨大潜力(距上限仍有73.3% vs 93.4%)。 Abstract: Vision-and-Language Navigation (VLN) requires agents to follow natural language instructions through environments, with memory-persistent variants demanding progressive improvement through accumulated experience. Existing approaches for memory-persistent VLN face critical limitations: they lack effective memory access mechanisms, instead relying on entire memory incorporation or fixed-horizon lookup, and predominantly store only environmental observations while neglecting navigation behavioral patterns that encode valuable decision-making strategies. We present Memoir, which employs imagination as a retrieval mechanism grounded by explicit memory: a world model imagines future navigation states as queries to selectively retrieve relevant environmental observations and behavioral histories. The approach comprises: 1) a language-conditioned world model that imagines future states serving dual purposes: encoding experiences for storage and generating retrieval queries; 2) Hybrid Viewpoint-Level Memory that anchors both observations and behavioral patterns to viewpoints, enabling hybrid retrieval; and 3) an experience-augmented navigation model that integrates retrieved knowledge through specialized encoders. Extensive evaluation across diverse memory-persistent VLN benchmarks with 10 distinctive testing scenarios demonstrates Memoir's effectiveness: significant improvements across all scenarios, with 5.4% SPL gains on IR2R over the best memory-persistent baseline, accompanied by 8.3x training speedup and 74% inference memory reduction. The results validate that predictive retrieval of both environmental and behavioral memories enables more effective navigation, with analysis indicating substantial headroom (73.3% vs 93.4% upper bound) for this imagination-guided paradigm. Code at https://github.com/xyz9911/Memoir.[193] VideoCanvas: Unified Video Completion from Arbitrary Spatiotemporal Patches via In-Context Conditioning
Minghong Cai,Qiulin Wang,Zongli Ye,Wenze Liu,Quande Liu,Weicai Ye,Xintao Wang,Pengfei Wan,Kun Gai,Xiangyu Yue
Main category: cs.CV
TL;DR: 本文提出了任意时空视频补全任务,通过VideoCanvas框架解决了因果VAE导致的时间模糊问题,实现了对视频生成的精细控制。
Details
Motivation: 现有的可控视频生成任务分散且缺乏统一范式,同时现代潜在视频扩散模型存在时间模糊问题,难以实现精确帧级条件控制。 Method: 提出VideoCanvas框架,采用混合条件策略:空间定位通过零填充处理,时间对齐通过Temporal RoPE插值实现,每个条件在潜在序列中分配连续小数位置,无需新增参数。 Result: 在新构建的VideoCanvasBench基准上实验表明,VideoCanvas显著优于现有条件控制方法,在场景内保真度和跨场景创造性方面均表现优异。 Conclusion: VideoCanvas实现了任意时空视频补全的统一与灵活控制,为可控视频生成提供了新的范式和技术路径。 Abstract: We introduce the task of arbitrary spatio-temporal video completion, where a video is generated from arbitrary, user-specified patches placed at any spatial location and timestamp, akin to painting on a video canvas. This flexible formulation naturally unifies many existing controllable video generation tasks--including first-frame image-to-video, inpainting, extension, and interpolation--under a single, cohesive paradigm. Realizing this vision, however, faces a fundamental obstacle in modern latent video diffusion models: the temporal ambiguity introduced by causal VAEs, where multiple pixel frames are compressed into a single latent representation, making precise frame-level conditioning structurally difficult. We address this challenge with VideoCanvas, a novel framework that adapts the In-Context Conditioning (ICC) paradigm to this fine-grained control task with zero new parameters. We propose a hybrid conditioning strategy that decouples spatial and temporal control: spatial placement is handled via zero-padding, while temporal alignment is achieved through Temporal RoPE Interpolation, which assigns each condition a continuous fractional position within the latent sequence. This resolves the VAE's temporal ambiguity and enables pixel-frame-aware control on a frozen backbone. To evaluate this new capability, we develop VideoCanvasBench, the first benchmark for arbitrary spatio-temporal video completion, covering both intra-scene fidelity and inter-scene creativity. Experiments demonstrate that VideoCanvas significantly outperforms existing conditioning paradigms, establishing a new state of the art in flexible and unified video generation.[194] SciVideoBench: Benchmarking Scientific Video Reasoning in Large Multimodal Models
Andong Deng,Taojiannan Yang,Shoubin Yu,Lincoln Spencer,Mohit Bansal,Chen Chen,Serena Yeung-Levy,Xiaohan Wang
Main category: cs.CV
TL;DR: 本文提出了SciVideoBench,一个用于评估科学场景下复杂视频推理能力的严格基准,涵盖25个以上专业学科的1000个多项选择题,揭示了当前大型多模态模型在高级认知任务上的不足,并为未来多模态AI的发展提供了方向。
Details
Motivation: 现有视频基准主要针对一般场景,侧重感知和识别,推理任务较简单,难以有效评估先进多模态模型在科学领域复杂推理中的表现,因此需要一个更具挑战性的评估基准。 Method: 构建了一个包含1000个多项选择题的高质量基准SciVideoBench,题目来源于前沿科学实验视频,覆盖25个以上专业学科,并通过半自动系统验证;问题设计要求模型具备领域知识、时空感知和逻辑推理能力。 Result: 在多个最先进的专有和开源大型多模态模型(如Gemini 2.5 Pro和Qwen2.5-VL)上的实验表明,这些模型在SciVideoBench上表现不佳,显示出明显的性能缺陷,尤其是在高推理复杂度和视觉定位任务上。 Conclusion: SciVideoBench能有效评估多模态模型在科学视频中的高级认知能力,暴露了当前模型的局限性,为未来提升视频推理能力和开发真正有能力的多模态AI科研助手提供了重要方向。 Abstract: Large Multimodal Models (LMMs) have achieved remarkable progress across various capabilities; however, complex video reasoning in the scientific domain remains a significant and challenging frontier. Current video benchmarks predominantly target general scenarios where perception/recognition is heavily relied on, while with relatively simple reasoning tasks, leading to saturation and thus failing to effectively evaluate advanced multimodal cognitive skills. To address this critical gap, we introduce SciVideoBench, a rigorous benchmark specifically designed to assess advanced video reasoning in scientific contexts. SciVideoBench consists of 1,000 carefully crafted multiple-choice questions derived from cutting-edge scientific experimental videos spanning over 25 specialized academic subjects and verified by a semi-automatic system. Each question demands sophisticated domain-specific knowledge, precise spatiotemporal perception, and intricate logical reasoning, effectively challenging models' higher-order cognitive abilities. Our evaluation highlights significant performance deficits in state-of-the-art proprietary and open-source LMMs, including Gemini 2.5 Pro and Qwen2.5-VL, indicating substantial room for advancement in video reasoning capabilities. Detailed analyses of critical factors such as reasoning complexity and visual grounding provide valuable insights and clear direction for future developments in LMMs, driving the evolution of truly capable multimodal AI co-scientists. We hope SciVideoBench could fit the interests of the community and help to push the boundary of cutting-edge AI for border science.[195] MultiCOIN: Multi-Modal COntrollable Video INbetweening
Maham Tanveer,Yang Zhou,Simon Niklaus,Ali Mahdavi Amiri,Hao Zhang,Krishna Kumar Singh,Nanxuan Zhao
Main category: cs.CV
TL;DR: 本文提出了一种支持多模态控制的视频中间帧生成框架\modelname{},通过结合Diffusion Transformer架构与点基稀疏表示,实现了对复杂运动的精细控制和高精度插值。
Details
Motivation: 现有视频中间帧生成方法难以处理复杂运动,缺乏对用户意图的灵活支持和中间帧细节的精细控制,限制了其在创意视频编辑中的应用。 Method: 采用Diffusion Transformer(DiT)作为生成模型,将多种运动控制信号统一映射为稀疏的基于点的表示,并将内容控制与运动控制分离为双分支结构,分别编码特征以指导去噪过程;提出分阶段训练策略以提升多模态控制的学习效果。 Result: 实验表明,该方法在定性和定量评估中均优于现有方法,能够生成更动态、可定制且上下文准确的视频序列,支持深度过渡、图层控制、运动轨迹、文本提示等多种输入。 Conclusion: 所提出的多模态控制框架显著提升了视频中间帧生成的灵活性与精确性,为长视频合成和创意编辑提供了更强的可控性与表现力。 Abstract: Video inbetweening creates smooth and natural transitions between two image frames, making it an indispensable tool for video editing and long-form video synthesis. Existing works in this domain are unable to generate large, complex, or intricate motions. In particular, they cannot accommodate the versatility of user intents and generally lack fine control over the details of intermediate frames, leading to misalignment with the creative mind. To fill these gaps, we introduce \modelname{}, a video inbetweening framework that allows multi-modal controls, including depth transition and layering, motion trajectories, text prompts, and target regions for movement localization, while achieving a balance between flexibility, ease of use, and precision for fine-grained video interpolation. To achieve this, we adopt the Diffusion Transformer (DiT) architecture as our video generative model, due to its proven capability to generate high-quality long videos. To ensure compatibility between DiT and our multi-modal controls, we map all motion controls into a common sparse and user-friendly point-based representation as the video/noise input. Further, to respect the variety of controls which operate at varying levels of granularity and influence, we separate content controls and motion controls into two branches to encode the required features before guiding the denoising process, resulting in two generators, one for motion and the other for content. Finally, we propose a stage-wise training strategy to ensure that our model learns the multi-modal controls smoothly. Extensive qualitative and quantitative experiments demonstrate that multi-modal controls enable a more dynamic, customizable, and contextually accurate visual narrative.[196] ResAD: Normalized Residual Trajectory Modeling for End-to-End Autonomous Driving
Zhiyu Zheng,Shaoyu Chen,Haoran Yin,Xinbang Zhang,Jialv Zou,Xinggang Wang,Qian Zhang,Lefei Zhang
Main category: cs.CV
TL;DR: 提出ResAD框架,通过归一化残差轨迹建模解决端到端自动驾驶中轨迹数据的时空不平衡问题,提升模型因果推理能力和短期安全性。
Details
Motivation: 端到端自动驾驶系统因轨迹数据的时空不平衡而倾向于学习虚假相关性,忽视因果推理,并过度关注不确定的远期预测,影响即时安全。 Method: 将预测任务重构为对确定性惯性参考路径的残差偏差预测,并引入点级归一化来重加权优化目标,缓解长期不确定性带来的优化失衡。 Result: 在NAVSIM基准上,使用仅两步去噪的普通扩散策略即达到88.6的PDMS,性能领先。 Conclusion: ResAD有效简化了学习任务,增强了模型对因果因素的识别能力,提升了预测安全性与整体性能。 Abstract: End-to-end autonomous driving (E2EAD) systems, which learn to predict future trajectories directly from sensor data, are fundamentally challenged by the inherent spatio-temporal imbalance of trajectory data. This imbalance creates a significant optimization burden, causing models to learn spurious correlations instead of causal inference, while also prioritizing uncertain, distant predictions, thereby compromising immediate safety. To address these issues, we propose ResAD, a novel Normalized Residual Trajectory Modeling framework. Instead of predicting the future trajectory directly, our approach reframes the learning task to predict the residual deviation from a deterministic inertial reference. The inertial reference serves as a counterfactual, forcing the model to move beyond simple pattern recognition and instead identify the underlying causal factors (e.g., traffic rules, obstacles) that necessitate deviations from a default, inertially-guided path. To deal with the optimization imbalance caused by uncertain, long-term horizons, ResAD further incorporates Point-wise Normalization of the predicted residual. It re-weights the optimization objective, preventing large-magnitude errors associated with distant, uncertain waypoints from dominating the learning signal. Extensive experiments validate the effectiveness of our framework. On the NAVSIM benchmark, ResAD achieves a state-of-the-art PDMS of 88.6 using a vanilla diffusion policy with only two denoising steps, demonstrating that our approach significantly simplifies the learning task and improves model performance. The code will be released to facilitate further research.[197] NaViL: Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints
Changyao Tian,Hao Li,Gen Luo,Xizhou Zhu,Weijie Su,Hanming Deng,Jinguo Zhu,Jie Shao,Ziran Zhu,Yunpeng Liu,Lewei Lu,Wenhai Wang,Hongsheng Li,Jifeng Dai
Main category: cs.CV
TL;DR: 本文研究了多模态大语言模型(MLLM)的原生端到端训练范式,提出了一种在数据受限条件下性能与训练成本平衡的最优元架构,并发现视觉编码器与语言模型之间存在正相关的扩展关系,基于此提出了名为NaViL的原生MLLM,其在14个多模态基准上表现出竞争力。
Details
Motivation: 现有MLLM采用组合式训练,难以探索其多模态扩展性,因此需要研究端到端原生训练范式的可扩展性与设计空间。 Method: 系统研究了MLLM在端到端训练下的各种设计选择,确定最优元架构,并分析视觉编码器与LLM之间的扩展关系。 Result: 提出了NaViL模型,在14个多模态基准上表现优异,验证了原生训练范式的有效性及其正向扩展特性。 Conclusion: 原生端到端训练MLLM具有良好的扩展性和性能潜力,为未来MLLM研究提供了重要启示。 Abstract: Compositional training has been the de-facto paradigm in existing Multimodal Large Language Models (MLLMs), where pre-trained vision encoders are connected with pre-trained LLMs through continuous multimodal pre-training. However, the multimodal scaling property of this paradigm remains difficult to explore due to the separated training. In this paper, we focus on the native training of MLLMs in an end-to-end manner and systematically study its design space and scaling property under a practical setting, i.e., data constraint. Through careful study of various choices in MLLM, we obtain the optimal meta-architecture that best balances performance and training cost. After that, we further explore the scaling properties of the native MLLM and indicate the positively correlated scaling relationship between visual encoders and LLMs. Based on these findings, we propose a native MLLM called NaViL, combined with a simple and cost-effective recipe. Experimental results on 14 multimodal benchmarks confirm the competitive performance of NaViL against existing MLLMs. Besides that, our findings and results provide in-depth insights for the future study of native MLLMs.[198] D$^2$GS: Depth-and-Density Guided Gaussian Splatting for Stable and Accurate Sparse-View Reconstruction
Meixi Song,Xin Lin,Dizhe Zhang,Haodong Li,Xiangtai Li,Bo Du,Lu Qi
Main category: cs.CV
TL;DR: 本文提出了一种针对稀疏视角下3D高斯点阵(3DGS)性能下降和不稳定问题的统一框架D²GS,通过密度与深度引导的Dropout策略和距离感知保真增强模块,有效缓解了过拟合与欠拟合问题,并引入新指标评估高斯分布的稳定性。
Details
Motivation: 在稀疏视角条件下,现有3D高斯点阵方法存在近处高斯密度过高导致过拟合、远处覆盖不足导致欠拟合的问题,影响重建质量与稳定性。 Method: 提出D²GS框架,包含两个核心组件:1)基于深度与密度引导的Dropout策略,自适应剔除冗余高斯;2)距离感知保真增强模块,对远场区域进行有针对性监督以提升重建质量。同时设计新评估指标衡量高斯分布稳定性。 Result: 在多个数据集上的实验表明,该方法显著提升了稀疏视角下的视觉质量和重建鲁棒性,相比基线方法在图像保真度和稳定性方面均有改善。 Conclusion: D²GS有效解决了稀疏视角下3DGS的过拟合与欠拟合问题,增强了模型稳定性和泛化能力,为低输入量场景下的高质量视图合成提供了可行方案。 Abstract: Recent advances in 3D Gaussian Splatting (3DGS) enable real-time, high-fidelity novel view synthesis (NVS) with explicit 3D representations. However, performance degradation and instability remain significant under sparse-view conditions. In this work, we identify two key failure modes under sparse-view conditions: overfitting in regions with excessive Gaussian density near the camera, and underfitting in distant areas with insufficient Gaussian coverage. To address these challenges, we propose a unified framework D$^2$GS, comprising two key components: a Depth-and-Density Guided Dropout strategy that suppresses overfitting by adaptively masking redundant Gaussians based on density and depth, and a Distance-Aware Fidelity Enhancement module that improves reconstruction quality in under-fitted far-field areas through targeted supervision. Moreover, we introduce a new evaluation metric to quantify the stability of learned Gaussian distributions, providing insights into the robustness of the sparse-view 3DGS. Extensive experiments on multiple datasets demonstrate that our method significantly improves both visual quality and robustness under sparse view conditions. The project page can be found at: https://insta360-research-team.github.io/DDGS-website/.[199] MATRIX: Multimodal Agent Tuning for Robust Tool-Use Reasoning
Tajamul Ashraf,Umair Nawaz,Abdelrahman M. Shaker,Rao Anwer,Philip Torr,Fahad Shahbaz Khan,Salman Khan
Main category: cs.CV
TL;DR: 提出一种视觉为中心的代理微调框架,自动生成多模态轨迹和偏好对,提升视觉语言模型在复杂推理和工具使用中的性能。
Details
Motivation: 现有视觉语言模型在作为控制器进行复杂推理和决策时,受限于高质量多模态轨迹数据的稀缺和人工标注的高成本。 Method: 构建M-TRACE大规模多模态任务数据集,并基于其进行轨迹模仿训练;进一步生成11K自动偏好对(Pref-X),通过逐步偏好学习优化MATRIX Agent。 Result: 在Agent-X、GTA和GAIA三个基准上,MATRIX均优于开源和闭源VLM,在多模态工具使用方面表现出更强的可扩展性和有效性。 Conclusion: 该框架通过自动合成数据和细粒度对齐训练,显著提升了VLM在工具使用场景下的推理能力,验证了自动生成数据用于代理训练的可行性。 Abstract: Vision language models (VLMs) are increasingly deployed as controllers with access to external tools for complex reasoning and decision-making, yet their effectiveness remains limited by the scarcity of high-quality multimodal trajectories and the cost of manual annotation. We address this challenge with a vision-centric agent tuning framework that automatically synthesizes multimodal trajectories, generates step-wise preference pairs, and trains a VLM controller for robust tool-use reasoning. Our pipeline first constructs M-TRACE, a large-scale dataset of 28.5K multimodal tasks with 177K verified trajectories, enabling imitation-based trajectory tuning. Building on this, we develop MATRIX Agent, a controller finetuned on M-TRACE for step-wise tool reasoning. To achieve finer alignment, we further introduce Pref-X, a set of 11K automatically generated preference pairs, and optimize MATRIX on it via step-wise preference learning. Across three benchmarks, Agent-X, GTA, and GAIA, MATRIX consistently surpasses both open- and closed-source VLMs, demonstrating scalable and effective multimodal tool use. Our data and code is avaliable at https://github.com/mbzuai-oryx/MATRIX.[200] ReSplat: Learning Recurrent Gaussian Splats
Haofei Xu,Daniel Barath,Andreas Geiger,Marc Pollefeys
Main category: cs.CV
TL;DR: 提出ReSplat,一种前馈循环高斯点阵模型,通过渲染误差反馈迭代优化3D高斯分布,无需显式计算梯度,在减少高斯数量和提升渲染速度的同时实现最先进的性能。