Table of Contents
cs.CL [Back]
[1] Cyclic Ablation: Testing Concept Localization against Functional Regeneration in AI
Eduard Kapelko
Main category: cs.CL
TL;DR: 该研究通过“循环消融”方法探究大语言模型中欺骗行为是否可被局部移除,发现欺骗行为具有高度韧性且与模型核心能力纠缠,直接编辑模型存在局限性。
Details
Motivation: 探讨大语言模型中不良行为(如欺骗)是否为可被移除的局部功能,还是与其核心认知能力深度纠缠。 Method: 结合稀疏自编码器、目标消融和对抗训练,在DistilGPT-2上进行迭代的‘循环消融’实验,尝试消除‘欺骗’概念。 Result: 模型在每次消融后通过对抗训练恢复欺骗行为(功能再生),且每次消融均导致语言性能下降(困惑度上升)。 Conclusion: 复杂概念在模型中是分布式且纠缠的,单纯依赖机械可解释性进行模型编辑有根本局限。 Abstract: Safety and controllability are critical for large language models. A central question is whether undesirable behaviors like deception are localized functions that can be removed, or if they are deeply intertwined with a model's core cognitive abilities. We introduce "cyclic ablation," an iterative method to test this. By combining sparse autoencoders, targeted ablation, and adversarial training on DistilGPT-2, we attempted to eliminate the concept of deception. We found that, contrary to the localization hypothesis, deception was highly resilient. The model consistently recovered its deceptive behavior after each ablation cycle via adversarial training, a process we term functional regeneration. Crucially, every attempt at this "neurosurgery" caused a gradual but measurable decay in general linguistic performance, reflected by a consistent rise in perplexity. These findings are consistent with the view that complex concepts are distributed and entangled, underscoring the limitations of direct model editing through mechanistic interpretability.[2] From Internal Representations to Text Quality: A Geometric Approach to LLM Evaluation
Viacheslav Yusupov,Danil Maksimov,Ameliia Alaeva,Anna Vasileva,Anna Antipina,Tatyana Zaitseva,Alina Ermilova,Evgeny Burnaev,Egor Shvetsov
Main category: cs.CL
TL;DR: 本文通过展示大语言模型内部表示的几何属性可作为生成文本质量的可靠代理,弥合了内部与外部分析方法之间的差距。
Details
Motivation: 现有的文本质量评估方法依赖于人工标注数据或外部参考,缺乏通用性和实用性,因此需要一种不依赖参考文本和人工标注的自动化评估方式。 Method: 提出并验证了一系列度量指标(如最大可解释方差、有效秩、内在维度、MAUVE分数和Schatten范数),在不同LLM层中测量这些几何属性,并分析其与生成文本质量的关系。 Result: 发现内在维度和有效秩可作为文本自然性和质量的通用评估指标;不同模型基于这些几何属性对文本排序具有一致性,表明这些指标反映的是文本本身的固有特性而非模型特异性伪影。 Conclusion: 几何属性尤其是内在维度和有效秩可作为无需参考文本的文本质量评估工具,为自动化评估提供了实用且普适的解决方案。 Abstract: This paper bridges internal and external analysis approaches to large language models (LLMs) by demonstrating that geometric properties of internal model representations serve as reliable proxies for evaluating generated text quality. We validate a set of metrics including Maximum Explainable Variance, Effective Rank, Intrinsic Dimensionality, MAUVE score, and Schatten Norms measured across different layers of LLMs, demonstrating that Intrinsic Dimensionality and Effective Rank can serve as universal assessments of text naturalness and quality. Our key finding reveals that different models consistently rank text from various sources in the same order based on these geometric properties, indicating that these metrics reflect inherent text characteristics rather than model-specific artifacts. This allows a reference-free text quality evaluation that does not require human-annotated datasets, offering practical advantages for automated evaluation pipelines.[3] Generative Value Conflicts Reveal LLM Priorities
Andy Liu,Kshitish Ghate,Mona Diab,Daniel Fried,Atoosa Kasirzadeh,Max Kleiman-Weiner
Main category: cs.CL
TL;DR: 本文提出了ConflictScope,一个用于评估大语言模型在不同价值观冲突中优先级选择的自动化框架。研究发现,在开放性任务中,模型更倾向于支持个人价值观而非保护性价值观,而通过在系统提示中加入详细的价值排序可提升14%的目标对齐效果。
Details
Motivation: 现有对齐数据集中缺乏价值观冲突场景,难以真实反映语言模型在实际部署中面临的价值权衡问题,因此需要一种系统方法来评估模型在价值观冲突中的决策行为。 Method: 提出ConflictScope自动流水线:基于用户定义的价值集生成包含两个价值观冲突的场景,使用LLM生成用户提示,并通过多选和开放式回答两种方式评估目标模型的响应,进而推导出模型对价值的优先级排序。 Result: 在开放式评估中,模型从支持保护性价值观(如无害性)转向支持个人价值观(如用户自主性);在系统提示中加入详细价值排序可使对齐性能提升14%。 Conclusion: 评估模型在价值观冲突中的优先级选择至关重要,ConflictScope为未来研究提供了有效基础,同时表明系统提示能在一定程度上引导模型行为以实现目标对齐。 Abstract: Past work seeks to align large language model (LLM)-based assistants with a target set of values, but such assistants are frequently forced to make tradeoffs between values when deployed. In response to the scarcity of value conflict in existing alignment datasets, we introduce ConflictScope, an automatic pipeline to evaluate how LLMs prioritize different values. Given a user-defined value set, ConflictScope automatically generates scenarios in which a language model faces a conflict between two values sampled from the set. It then prompts target models with an LLM-written "user prompt" and evaluates their free-text responses to elicit a ranking over values in the value set. Comparing results between multiple-choice and open-ended evaluations, we find that models shift away from supporting protective values, such as harmlessness, and toward supporting personal values, such as user autonomy, in more open-ended value conflict settings. However, including detailed value orderings in models' system prompts improves alignment with a target ranking by 14%, showing that system prompting can achieve moderate success at aligning LLM behavior under value conflict. Our work demonstrates the importance of evaluating value prioritization in models and provides a foundation for future work in this area.[4] From Faithfulness to Correctness: Generative Reward Models that Think Critically
Qiyao Ma,Yunsheng Shi,Hongtao Tian,Chao Wang,Weiming Chang,Ting Yao
Main category: cs.CL
TL;DR: 提出了一种基于句子级思维监督的奖励模型(TRM),通过结合忠实性、推理和正确性评估,提升语言模型在开放域问答中对答案正确性和有用性的判断能力。
Details
Motivation: 现有基于可验证奖励的强化学习在开放域问答等复杂任务上难以准确评估答案正确性,且过度强调与支持文档的语义对齐(faithfulness)可能导致模型缺乏对外部和内部知识的批判性评估能力。 Method: 设计了Thinking-supervised Reward Model(TRM),先评估答案句子与支持文档的忠实性,再引入推理步骤进行句子级正确性判断,将奖励建模分解为忠实性、推理和正确性三个阶段。 Result: 实验表明,TRM在识别错误句子方面显著优于现有方法,将其用于策略优化可显著提升答案的正确性和实用性。 Conclusion: TRM通过引入句子级思维监督,赋予奖励模型批判性思维能力,有效平衡了对外部文档的依赖与内部知识的运用,提升了复杂任务下的语言模型表现。 Abstract: Through reinforcement learning with verifiable rewards (RLVR), large language models have achieved substantial progress in domains with easily verifiable outcomes, such as mathematics and coding. However, when applied to more complex tasks like open-domain question answering, RLVR faces significant challenges due to the difficulty of verifying correctness. The nuanced and ambiguous nature of real-world knowledge makes it difficult to reliably evaluate correctness in these settings, necessitating further abilities that extend beyond mere logical consistency to encompass an understanding and assessment of both external and internal knowledge. Recent work has primarily focused on improving faithfulness, defined as semantic alignment with supporting documents, which can cause models to rely excessively on external sources and diminish their capacity for critical assessment. To address this, we propose the Thinking-supervised Reward Model (TRM), which incorporates sentence-level thinking supervision to endow reward models with critical thinking abilities. Given a query, answer, and supporting documents, TRM first assesses the faithfulness of each answer sentence to the supporting documents, and then applies a reasoning step to evaluate sentence-level correctness. By structuring reward modeling as a sequence of faithfulness, reasoning, and correctness evaluations, TRM encourages models to critically assess and leverage both external and internal knowledge. Experiments on reward signals demonstrate that TRM substantially improves the identification of incorrect sentences, and incorporating TRM into policy optimization leads to significant gains in both answer correctness and usefulness.[5] Emotion-Aligned Generation in Diffusion Text to Speech Models via Preference-Guided Optimization
Jiacheng Shi,Hongfei Du,Yangfan He,Y. Alicia Hong,Ye Gao
Main category: cs.CL
TL;DR: 提出Emotion-Aware Stepwise Preference Optimization (EASPO),一种在扩散TTS中对中间去噪步骤进行细粒度情感偏好对齐的后训练框架。
Details
Motivation: 现有情感TTS方法依赖粗略标签或代理分类器,仅获得句子级反馈,难以精细控制情感表达。 Method: 引入EASPM,一个时间条件模型,用于评分带噪的中间语音状态,并自动构建偏好对;通过EASPO在生成过程中优化这些逐步偏好,实现可控的情感塑造。 Result: 实验表明,EASPO在表现力和自然度上均优于现有方法。 Conclusion: EASPO通过细粒度、步骤级的情感偏好优化,有效提升了情感TTS的表现力与自然性。 Abstract: Emotional text-to-speech seeks to convey affect while preserving intelligibility and prosody, yet existing methods rely on coarse labels or proxy classifiers and receive only utterance-level feedback. We introduce Emotion-Aware Stepwise Preference Optimization (EASPO), a post-training framework that aligns diffusion TTS with fine-grained emotional preferences at intermediate denoising steps. Central to our approach is EASPM, a time-conditioned model that scores noisy intermediate speech states and enables automatic preference pair construction. EASPO optimizes generation to match these stepwise preferences, enabling controllable emotional shaping. Experiments show superior performance over existing methods in both expressiveness and naturalness.[6] SimulRAG: Simulator-based RAG for Grounding LLMs in Long-form Scientific QA
Haozhou Xu,Dongxia Wu,Matteo Chinazzi,Ruijia Niu,Rose Yu,Yi-An Ma
Main category: cs.CL
TL;DR: 本文提出了一种基于科学模拟器的检索增强生成框架SimulRAG,用于提升大语言模型在长篇科学问答中的事实准确性,并通过气候科学和流行病学领域的基准测试验证了其有效性。
Details
Motivation: 大语言模型在长篇科学问答中容易产生幻觉,传统检索增强方法难以直接应用于科学模拟器作为知识源的场景,亟需一种能有效检索并验证科学模拟结果的新框架。 Method: 提出SimulRAG框架,设计通用的模拟器检索接口实现文本与数值模态转换,并采用基于不确定性估计和模拟器边界评估(UE+SBA)的声明级生成方法,以高效验证和更新答案。 Result: 实验表明,SimulRAG相比传统RAG基线在信息量上提升30.4%,在事实性上提升16.3%,UE+SBA进一步提高了生成效率与质量。 Conclusion: SimulRAG通过整合科学模拟器作为可靠知识源,显著提升了大语言模型在复杂科学问答中的可信度与准确性,为减少模型幻觉提供了有效解决方案。 Abstract: Large language models (LLMs) show promise in solving scientific problems. They can help generate long-form answers for scientific questions, which are crucial for comprehensive understanding of complex phenomena that require detailed explanations spanning multiple interconnected concepts and evidence. However, LLMs often suffer from hallucination, especially in the challenging task of long-form scientific question answering. Retrieval-Augmented Generation (RAG) approaches can ground LLMs by incorporating external knowledge sources to improve trustworthiness. In this context, scientific simulators, which play a vital role in validating hypotheses, offer a particularly promising retrieval source to mitigate hallucination and enhance answer factuality. However, existing RAG approaches cannot be directly applied for scientific simulation-based retrieval due to two fundamental challenges: how to retrieve from scientific simulators, and how to efficiently verify and update long-form answers. To overcome these challenges, we propose the simulator-based RAG framework (SimulRAG) and provide a long-form scientific QA benchmark covering climate science and epidemiology with ground truth verified by both simulations and human annotators. In this framework, we propose a generalized simulator retrieval interface to transform between textual and numerical modalities. We further design a claim-level generation method that utilizes uncertainty estimation scores and simulator boundary assessment (UE+SBA) to efficiently verify and update claims. Extensive experiments demonstrate SimulRAG outperforms traditional RAG baselines by 30.4% in informativeness and 16.3% in factuality. UE+SBA further improves efficiency and quality for claim-level generation.[7] The Rise of AfricaNLP: Contributions, Contributors, and Community Impact (2005-2025)
Tadesse Destaw Belay,Kedir Yassin Hussen,Sukairaj Hafiz Imam,Iqra Ameer,Ibrahim Said Ahmad,Isa Inuwa-Dutse,Idris Abdulmumin,Grigori Sidorov,Vukosi Marivate,Seid Muhie Yimam,Shamsuddeen Hassan Muhammad
Main category: cs.CL
TL;DR: 该研究通过分析1.9K篇NLP论文摘要、4.9K名作者和7.8K条人工标注的贡献语句,探讨了非洲NLP(AfricaNLP)在过去二十年的发展趋势、研究贡献及相关参与个体与机构。
Details
Motivation: 追踪NLP研究进展并自动分析论文贡献,有助于理解领域发展动态及研究人员角色,尤其关注非洲NLP的发展现状与潜力。 Method: 采用定量分析方法,基于大规模论文摘要数据集和人工标注的贡献语句(AfricaNLPContributions),结合作者信息与机构资助情况,系统回答非洲NLP的研究演化、贡献内容及参与主体等问题。 Result: 揭示了非洲NLP在研究数量、主题演变、作者分布和资金支持方面的趋势,构建了可持续更新的NLP进展追踪网站和数据集,支持数据驱动的文献综述生成。 Conclusion: 非洲NLP正逐步发展,研究贡献日益显著,未来可通过数据驱动方式进一步推动该领域的可见性与合作。 Abstract: Natural Language Processing (NLP) is undergoing constant transformation, as Large Language Models (LLMs) are driving daily breakthroughs in research and practice. In this regard, tracking the progress of NLP research and automatically analyzing the contributions of research papers provides key insights into the nature of the field and the researchers. This study explores the progress of African NLP (AfricaNLP) by asking (and answering) basic research questions such as: i) How has the nature of NLP evolved over the last two decades?, ii) What are the contributions of AfricaNLP papers?, and iii) Which individuals and organizations (authors, affiliated institutions, and funding bodies) have been involved in the development of AfricaNLP? We quantitatively examine the contributions of AfricaNLP research using 1.9K NLP paper abstracts, 4.9K author contributors, and 7.8K human-annotated contribution sentences (AfricaNLPContributions) along with benchmark results. Our dataset and continuously existing NLP progress tracking website provide a powerful lens for tracing AfricaNLP research trends and hold potential for generating data-driven literature surveys.[8] Not Wrong, But Untrue: LLM Overconfidence in Document-Based Queries
Nick Hagar,Wilma Agustianto,Nicholas Diakopoulos
Main category: cs.CL
TL;DR: 研究评估了ChatGPT、Gemini和NotebookLM在新闻报道任务中的幻觉问题,发现30%的模型输出存在幻觉,其中Gemini和ChatGPT的幻觉率约为NotebookLM的三倍;研究揭示了大语言模型与新闻业在信息溯源上的根本性认知错位,并提出了针对新闻业的幻觉分类扩展建议。
Details
Motivation: 由于大语言模型在新闻工作流中日益广泛应用,但其产生幻觉的问题威胁到新闻业的信息来源、归因和准确性等核心实践,因此需要评估主流工具在真实报道场景中的幻觉表现。 Method: 通过在一个包含300份关于TikTok诉讼与政策文件的语料库上,设计报道式任务,测试ChatGPT、Gemini和NotebookLM三种工具;改变提示词的具体程度和上下文规模,并使用分类法对句子级输出进行标注,以衡量幻觉类型与严重性。 Result: 30%的模型输出至少包含一个幻觉,Gemini和ChatGPT的幻觉率约为40%,显著高于NotebookLM的13%;大多数错误并非虚构实体或数字,而是表现为解释性过度自信,即将来源观点转化为无依据的总体陈述。 Conclusion: 大语言模型生成权威语气文本的特性与新闻业要求每一主张都有明确出处的原则存在根本性认识论冲突;为保障新闻准确性,需开发强调准确归因而非仅追求语言流畅性的新型架构,并提出适用于新闻场景的幻觉分类扩展方法。 Abstract: Large language models (LLMs) are increasingly used in newsroom workflows, but their tendency to hallucinate poses risks to core journalistic practices of sourcing, attribution, and accuracy. We evaluate three widely used tools - ChatGPT, Gemini, and NotebookLM - on a reporting-style task grounded in a 300-document corpus related to TikTok litigation and policy in the U.S. We vary prompt specificity and context size and annotate sentence-level outputs using a taxonomy to measure hallucination type and severity. Across our sample, 30% of model outputs contained at least one hallucination, with rates approximately three times higher for Gemini and ChatGPT (40%) than for NotebookLM (13%). Qualitatively, most errors did not involve invented entities or numbers; instead, we observed interpretive overconfidence - models added unsupported characterizations of sources and transformed attributed opinions into general statements. These patterns reveal a fundamental epistemological mismatch: While journalism requires explicit sourcing for every claim, LLMs generate authoritative-sounding text regardless of evidentiary support. We propose journalism-specific extensions to existing hallucination taxonomies and argue that effective newsroom tools need architectures that enforce accurate attribution rather than optimize for fluency.[9] Beyond WER: Probing Whisper's Sub-token Decoder Across Diverse Language Resource Levels
Siyu Liang,Nicolas Ballier,Gina-Anne Levow,Richard Wright
Main category: cs.CL
TL;DR: 本文对Whisper多语言解码器进行了细粒度分析,揭示了不同资源水平语言在子词假设中的系统性解码差异。
Details
Motivation: 尽管大型多语言自动语音识别模型表现出色,但其在不同语言间的公平性和有效性机制尚不明确,因此需要深入探究。 Method: 通过追踪束搜索路径,捕捉子词猜测及其概率,并结合PCA和t-SNE分析子词使用模式。 Result: 高资源语言在正确token排名、置信度、预测熵和候选多样性方面表现更好;低资源语言表现较差,且表现出受语言类型影响的子词聚类模式。 Conclusion: 子词层面的探针分析揭示了总体错误率掩盖下的系统性解码偏差,为改善语音技术发展的不平衡提供了方向。 Abstract: While large multilingual automatic speech recognition (ASR) models achieve remarkable performance, the internal mechanisms of the end-to-end pipeline, particularly concerning fairness and efficacy across languages, remain underexplored. This paper introduces a fine-grained analysis of Whisper's multilingual decoder, examining its sub-token hypotheses during transcription across languages with various resource levels. Our method traces the beam search path, capturing sub-token guesses and their associated probabilities. Results reveal that higher resource languages benefit from higher likelihood of the correct token being top-ranked, greater confidence, lower predictive entropy, and more diverse alternative candidates. Lower resource languages fare worse on these metrics, but also exhibit distinct clustering patterns in sub-token usage sometimes influenced by typology in our PCA and t-SNE analysis. This sub-token probing uncovers systematic decoding disparities masked by aggregate error rates and points towards targeted interventions to ameliorate the imbalanced development of speech technology.[10] MixtureVitae: Open Web-Scale Pretraining Dataset With High Quality Instruction and Reasoning Data Built from Permissive-First Text Sources
Huu Nguyen,Victor May,Harsh Raj,Marianna Nezhurina,Yishan Wang,Yanqi Luo,Minh Chien Vu,Taishi Nakamura,Ken Tsui,Van Khue Nguyen,David Salinas,Aleksandra Krasnodębska,Christoph Schuhmann,Mats Leon Richter,Xuan-Son,Vu,Jenia Jitsev
Main category: cs.CL
TL;DR: MixtureVitae是一个开源预训练语料库,采用风险缓解的数据来源策略,在降低法律风险的同时实现强大的模型性能,尤其在数学/代码和问答任务上表现突出。
Details
Motivation: 为解决大规模语言模型训练中因 indiscriminate 网络爬取带来的法律风险,构建一个合法、可复现且高性能的开放预训练数据集。 Method: 结合公共领域和宽松许可文本(如CC-BY/Apache),加入低风险数据(如政府文件和符合欧盟TDM的资源),并融合指令、推理和合成数据;通过多阶段管道进行许可证感知过滤、安全与质量筛选及领域感知混合。 Result: 在130M至1.7B参数、50B到300B token的训练设置下,基于MixtureVitae训练的模型在多个基准测试中优于其他宽松许可数据集,在1.7B/300B设置下接近DCLM水平,并超越FineWeb-Edu,尤其在数学/代码任务上表现优异。 Conclusion: 以宽松许可为主、风险缓解为辅的数据策略可为训练高性能大模型提供合法且有效的基础,减少对无差别网络爬取的依赖而不牺牲竞争力。 Abstract: We present MixtureVitae, an open-access pretraining corpus built to minimize legal risk while providing strong model performance. MixtureVitae follows a risk-mitigated sourcing strategy that combines public-domain and permissively licensed text (e.g., CC-BY/Apache) with carefully justified low-risk additions (e.g., government works and EU TDM-eligible sources), alongside targeted instruction, reasoning and synthetic data with documented provenance. We detail a transparent, multi-stage pipeline for license-aware filtering, safety and quality screening, and domain-aware mixing, and we release the dataset and curation recipes to support reproducible research. In controlled experiments using the open-sci-ref training protocol (fixed architectures at 130M/400M/1.3B/1.7B parameters; training budgets of 50B and 300B tokens), models trained on MixtureVitae consistently outperform other permissive datasets across a suite of standard benchmarks, and at the 1.7B/300B setting they surpass FineWeb-Edu and approach DCLM in the later stages of training. Performance is particularly strong on math/code and competitive on QA tasks. These results demonstrate that permissive-first, risk-mitigated data provides a practical and legally mitigated foundation for training capable LLMs, reducing reliance on indiscriminate web scraping without sacrificing competitiveness. Code: https://github.com/ontocord/mixturevitae[11] Calibrating Verbalized Confidence with Self-Generated Distractors
Victor Wang,Elias Stengel-Eskin
Main category: cs.CL
TL;DR: 本文提出了一种名为DINCO的新方法,用于改善大语言模型(LLM)生成置信度评分的校准问题。通过引入干扰项归一化一致性,并结合生成器与验证器之间的不一致性,DINCO显著提升了置信度估计的可靠性与可用性。
Details
Motivation: 大语言模型在输出时常常表现出过度自信,尤其是在其知识不足的情况下,导致置信度评分失准,影响用户信任与安全。本文旨在解决这一校准问题。 Method: 提出Distractor-Normalized Coherence (DINCO) 方法,通过让模型在多个自生成的干扰项上独立表达置信度并进行归一化,结合生成器与验证器的不一致性来提升校准效果,并整合自洽性与跨验证的一致性。 Result: 实验表明,DINCO相比基线方法显著提高了置信度校准性能,在仅10次推理调用下优于自洽性方法在100次调用下的表现,且提供更少饱和、更可用的置信度估计。 Conclusion: DINCO有效缓解了大语言模型的过度自信问题,通过建模模型的易受暗示性偏差和多维度一致性,为可信的LLM输出置信度评估提供了新路径。 Abstract: Calibrated confidence estimates are necessary for large language model (LLM) outputs to be trusted by human users. While LLMs can express their confidence in human-interpretable ways, verbalized LLM-generated confidence scores have empirically been found to be miscalibrated, reporting high confidence on instances with low accuracy and thereby harming trust and safety. We hypothesize that this overconfidence often stems from a given LLM's heightened suggestibility when faced with claims that it encodes little information about; we empirically validate this hypothesis, finding more suggestibility on lower-accuracy claims. Building on this finding, we introduce Distractor-Normalized Coherence (DINCO), which estimates and accounts for an LLM's suggestibility bias by having the model verbalize its confidence independently across several self-generated distractors (i.e. alternative claims), and normalizes by the total verbalized confidence. To further improve calibration, we leverage generator-validator disagreement, augmenting normalized validator confidence with a consistency-based estimate of generator confidence. Here, we frame the popular approach of self-consistency as leveraging coherence across sampled generations, and normalized verbalized confidence as leveraging coherence across validations on incompatible claims, allowing us to integrate these complementary dimensions of coherence into DINCO. Moreover, our analysis shows that DINCO provides less saturated -- and therefore more usable -- confidence estimates, and that further sampling alone cannot close the gap between DINCO and baselines, with DINCO at 10 inference calls outperforming self-consistency at 100.[12] Self-Rewarding Rubric-Based Reinforcement Learning for Open-Ended Reasoning
Zhiling Ye,Yun Yue,Haowen Wang,Xudong Han,Jiadi Jiang,Cheng Wei,Lei Fan,Jiaxin Liang,Shuowen Zhang,Ji Li,Chunxiao Guo,Jian Wang,Peng Wei,Jinjie Gu
Main category: cs.CL
TL;DR: 提出了一种自奖励的基于评分标准的强化学习框架,用于开放性推理任务,显著提升大语言模型的推理性能,且训练更高效。
Details
Motivation: 在实际应用中,开放性评估对大语言模型至关重要;观察到模型自身作为评分器可提升推理能力,由此启发本研究。 Method: 引入自奖励的基于评分标准的强化学习框架,利用模型自身生成奖励信号,并结合少量人工评分数据进行训练。 Result: 在Qwen3-32B上仅用4000样本的HealthBench Easy子集训练,即可超越GPT-5在HealthBench Hard上的表现;加入少量教师评分数据可进一步提升较弱模型性能。 Conclusion: 该轻量级框架能更高效地提升大模型在开放性推理任务中的表现,具备较强的实际应用潜力。 Abstract: Open-ended evaluation is essential for deploying large language models in real-world settings. In studying HealthBench, we observe that using the model itself as a grader and generating rubric-based reward signals substantially improves reasoning performance. Remarkably, the trained model also becomes a stronger grader. Motivated by this, we introduce Self-Rewarding Rubric-Based Reinforcement Learning for Open-Ended Reasoning, a lightweight framework that enables faster and more resource-efficient training while surpassing baselines. Remarkably, on Qwen3-32B, training with just the 4000-sample HealthBench Easy subset is sufficient to obtain a model that exceeds GPT-5 on HealthBench Hard. Incorporating a small amount of teacher-graded data further enhances performance for less capable models.[13] Aligning Multilingual Reasoning with Verifiable Semantics from a High-Resource Expert Model
Fahim Faisal,Kaiqiang Song,Song Wang,Simin Ma,Shujian Liu,Haoyun Deng,Sathish Reddy Indurthi
Main category: cs.CL
TL;DR: 本文提出了一种基于英语枢纽模型和语义可验证奖励的强化学习框架(PB-RLSVR),用于提升大语言模型在多语言推理任务中的表现,无需目标语言的人工标注数据。
Details
Motivation: 现有的强化学习方法在提升大语言模型推理能力方面主要局限于英语,导致多语言性能差距显著。因此,需要一种能有效迁移英语推理能力到其他语言的方法。 Method: 利用高性能的英语大模型作为“枢纽”生成参考回答,通过跨语言语义等价性奖励多语言模型,采用基于嵌入和机器翻译的语义奖励函数,在不依赖目标语言标注数据的情况下实现推理能力迁移。 Result: 在多个多语言推理基准上,PB-RLSVR显著缩小了英语与其他语言之间的性能差距,相比传统PPO基线有明显提升;Llama-3.1-8B-Instruct和Qwen3-32B的平均多语言性能分别提升了16.41%和10.17%。 Conclusion: PB-RLSVR是一种高效、无需人工标注的多语言推理增强方法,能够有效将英语推理能力迁移到其他语言,推动真正多语言推理智能体的发展。 Abstract: While reinforcement learning has advanced the reasoning abilities of Large Language Models (LLMs), these gains are largely confined to English, creating a significant performance disparity across languages. To address this, we introduce Pivot-Based Reinforcement Learning with Semantically Verifiable Rewards (PB-RLSVR), a novel framework that enhances multilingual reasoning by circumventing the need for human-annotated data in target languages. Our approach employs a high-performing English LLM as a "pivot" model to generate reference responses for reasoning tasks. A multilingual model is then rewarded based on the semantic equivalence of its responses to the English reference, effectively transferring the pivot model's reasoning capabilities across languages. We investigate several cross-lingual semantic reward functions, including those based on embeddings and machine translation. Extensive experiments on a suite of multilingual reasoning benchmarks show that our method significantly narrows the performance gap between English and other languages, substantially outperforming traditional PPO baselines. Specifically, our PB-RLSVR framework improves the average multilingual performance of Llama-3.1-8B-Instruct and Qwen3-32B by 16.41% and 10.17%, respectively, demonstrating a powerful and data-efficient approach to building truly multilingual reasoning agents.[14] Performance and competence intertwined: A computational model of the Null Subject stage in English-speaking children
Soumik Dey,William Gregory Sakas
Main category: cs.CL
TL;DR: 本研究提出了一种新的计算参数来衡量儿童在语言习得中对祈使句与陈述句的误读现象,并通过改进的变分学习模型模拟了主语强制语法的学习过程,支持了Orfitelli和Hyams关于临时空主语语法的假设,为将计算模型融入语法习得研究提供了框架。
Details
Motivation: 儿童在4岁前常省略主语(空主语阶段),且易混淆祈使句与陈述句,本文旨在探究这一现象背后的机制,并检验是否存在暂时性的空主语语法。 Method: 引入一个新的计算参数来量化儿童对句子类型的误读,将其整合进适用于超集-子集语言关系的改进版变分学习模型中进行模拟实验。 Result: 模拟结果支持Orfitelli和Hyams的假设,即性能因素导致儿童误读,进而促成临时空主语语法的存在;同时展示了计算模型在语法习得研究中的应用潜力。 Conclusion: 儿童语言中的空主语现象可能源于对输入的误读,而非单纯的语法省略;该研究为结合计算模型与语言发展研究提供了可行框架。 Abstract: The empirically established null subject (NS) stage, lasting until about 4 years of age, involves frequent omission of subjects by children. Orfitelli and Hyams (2012) observe that young English speakers often confuse imperative NS utterances with declarative ones due to performance influences, promoting a temporary null subject grammar. We propose a new computational parameter to measure this misinterpretation and incorporate it into a simulated model of obligatory subject grammar learning. Using a modified version of the Variational Learner (Yang, 2012) which works for superset-subset languages, our simulations support Orfitelli and Hyams' hypothesis. More generally, this study outlines a framework for integrating computational models in the study of grammatical acquisition alongside other key developmental factors.[15] Don't Sweat the Small Stuff: Segment-Level Meta-Evaluation Based on Pairwise Difference Correlation
Colten DiIanni,Daniel Deutsch
Main category: cs.CL
TL;DR: 本文提出了一种新的机器翻译段落级元评估指标Pairwise Difference Pearson (PDP),通过使用成对差异而非原始分数,改进了以往基于Pearson相关系数和Kendall Tau的方法。
Details
Motivation: 现有的元评估方法在处理评分分布和偏差方面存在局限性,尤其是在面对噪声或系统偏差时表现不稳定。因此需要一种更鲁棒且能更好反映人类评价偏好的指标。 Method: PDP基于相关性,利用所有片段的成对得分差异进行计算,将全局Pearson相关从跨片段比较优化为片段内的比较,从而更精细地建模评分一致性。 Result: 在WMT'24共享任务上的实验表明,PDP能更准确地排序基准评估指标,并比先前方法更贴近人类误差权重;噪声注入分析显示其对随机噪声、片段偏差和系统偏差具有鲁棒性,但对极端异常值敏感。 Conclusion: PDP是一种更可靠且与人类判断更一致的元评估指标,尤其适用于评估MT自动评价系统的性能。 Abstract: This paper introduces Pairwise Difference Pearson (PDP), a novel segment-level meta-evaluation metric for Machine Translation (MT) that address limitations in previous Pearson's $\rho$-based and and Kendall's $\tau$-based meta-evaluation approaches. PDP is a correlation-based metric that utilizes pairwise differences rather than raw scores. It draws on information from all segments for a more robust understanding of score distributions and uses segment-wise pairwise differences to refine Global Pearson to intra-segment score comparisons. Analysis on the WMT'24 shared task shows PDP properly ranks sentinel evaluation metrics and better aligns with human error weightings than previous work. Noise injection analysis demonstrates PDP's robustness to random noise, segment bias, and system bias while highlighting its sensitivity to extreme outliers.[16] Probing the Limits of Stylistic Alignment in Vision-Language Models
Asma Farajidizaji,Akash Gupta,Vatsal Raina
Main category: cs.CL
TL;DR: 研究小型视觉语言模型在幽默和浪漫风格中的数据效率,探索最少偏好数据下的风格对齐能力。
Details
Motivation: 视觉语言模型在零样本设置下难以生成特定风格的图像描述,而获取偏好数据成本高,限制了模型能力的探索。 Method: 通过研究小规模视觉语言模型在幽默和浪漫风格上的对齐效果,评估不同量级偏好数据下的性能表现。 Result: 确定了实现风格饱和所需的最少偏好数据量,并明确了这些模型在风格化图像描述任务中的性能极限。 Conclusion: 该方法揭示了小型视觉语言模型在少量偏好数据下仍可有效对齐到特定风格,为模型能力与数据效率提供了基准。 Abstract: Vision-language models are increasingly used to generate image captions in specific styles, such as humor or romantic. However, these transformer-based models often struggle with this subjective task in a zero-shot setting. While preference data can be used to align them toward a desired style, such data is expensive to acquire, limiting the ability to explore the models' full capabilities. This work addresses this by studying the data efficiency of aligning small vision-language models to humor and romantic styles. This approach helps to define the performance limits of these models and determine how little preference data is needed to achieve stylistic saturation, benchmarking their capabilities and limitations.[17] RFG: Test-Time Scaling for Diffusion Large Language Model Reasoning with Reward-Free Guidance
Tianlang Chen,Minkai Xu,Jure Leskovec,Stefano Ermon
Main category: cs.CL
TL;DR: 本文提出了一种无需显式过程奖励的推理引导方法RFG,通过增强模型与参考模型的对数似然比来参数化过程奖励,在数学推理和代码生成任务上显著提升了dLLMs的性能。
Details
Motivation: 现有的自回归语言模型通常依赖密集标注的中间步骤奖励模型,但扩散语言模型(dLLMs)生成过程为任意顺序且中间状态部分遮蔽,难以应用此类方法,因此需要一种适用于dLLMs的无奖励引导机制。 Method: 提出Reward-Free Guidance(RFG),利用经过强化学习或监督微调后的增强dLLM与原始模型的对数似然比作为隐式过程奖励,指导推理路径生成,无需额外训练奖励模型。 Result: 在四个数学推理和代码生成基准上实验表明,RFG在多种dLLM和后训练方法下均显著提升性能,最高准确率提升达9.2%。 Conclusion: RFG是一种通用、无需训练的测试时推理扩展框架,能在不依赖外部奖励模型的情况下有效引导dLLMs的逐步推理过程。 Abstract: Diffusion large language models (dLLMs) have shown great potential in large-scale language modeling, and there is an increasing interest in further improving the capacity to solve complex problems by guiding the reasoning process step by step. Common practice for autoregressive language models typically learns a process reward model with dense annotation for each intermediate step. However, this is challenging for dLLMs where the generation is in an any-order fashion and intermediate states are partially masked sentences. To this end, in this paper, we propose reward-free guidance (RFG), a principled method for guiding the reasoning trajectory of dLLMs without explicit process reward. The key idea of RFG is to parameterize the process reward by log-likelihood ratios of the enhanced and reference dLLMs, where the enhanced model can be easily obtained by any off-the-shelf dLLM that has been post-trained with reinforcement learning (RL) or supervised fine-tuning (SFT). We provide theoretical justification that RFG induces the reward-guided sampling distribution with no additional reward. We conduct comprehensive experiments on four challenging mathematical reasoning and code generation benchmarks using a diverse suite of dLLMs enhanced with various post-training methods. RFG consistently yields significant improvements across all tasks and model types, achieving accuracy gains of up to 9.2%. These findings establish RFG as a general training-free framework that scales test-time reasoning without reliance on external reward models.[18] Transformers through the lens of support-preserving maps between measures
Takashi Furuya,Maarten V. de Hoop,Matti Lassas
Main category: cs.CL
TL;DR: 本文研究了Transformer在处理任意数量上下文标记时的能力,通过将神经网络建模为概率测度上的映射,分析其表达能力。文章完全刻画了可通过推前操作表示为上下文映射的测度间映射的性质,并证明Transformer能普遍逼近具有连续上下文映射的表示。此外,还表明Vlasov方程的解映射满足这些条件,可由Transformer逼近,且无限深度的均场测度论Transformer可被识别为Vlasov流。
Details
Motivation: 为了统一和数学地分析Transformer架构在处理大规模上下文时的表达能力,特别是将其建模为概率测度上的映射,以探索其在Wasserstein正则性、泛化界和均场极限分析中的应用。 Method: 通过将上下文建模为概率分布,并研究测度间的映射性质,利用推前操作和Fréchet导数的特性来刻画Transformer所实现的映射。同时结合Vlasov方程的非局部输运特性进行分析。 Result: 1) 完全刻画了可表示为上下文映射的测度间映射的性质;2) Transformer可普遍逼近任何具有连续上下文映射的表示;3) Vlasov方程的解映射满足该类映射的条件,因而可被Transformer逼近;4) 无限深度的测度论Transformer等价于Vlasov流。 Conclusion: Transformer不仅能有效建模复杂上下文关系,其结构在测度空间中具有强大的表达能力和理论解释性,与动力系统(如Vlasov方程)存在深刻联系,为理解深度模型提供了新的数学视角。 Abstract: Transformers are deep architectures that define ``in-context maps'' which enable predicting new tokens based on a given set of tokens (such as a prompt in NLP applications or a set of patches for a vision transformer). In previous work, we studied the ability of these architectures to handle an arbitrarily large number of context tokens. To mathematically, uniformly analyze their expressivity, we considered the case that the mappings are conditioned on a context represented by a probability distribution which becomes discrete for a finite number of tokens. Modeling neural networks as maps on probability measures has multiple applications, such as studying Wasserstein regularity, proving generalization bounds and doing a mean-field limit analysis of the dynamics of interacting particles as they go through the network. In this work, we study the question what kind of maps between measures are transformers. We fully characterize the properties of maps between measures that enable these to be represented in terms of in-context maps via a push forward. On the one hand, these include transformers; on the other hand, transformers universally approximate representations with any continuous in-context map. These properties are preserving the cardinality of support and that the regular part of their Fr\'{e}chet derivative is uniformly continuous. Moreover, we show that the solution map of the Vlasov equation, which is of nonlocal transport type, for interacting particle systems in the mean-field regime for the Cauchy problem satisfies the conditions on the one hand and, hence, can be approximated by a transformer; on the other hand, we prove that the measure-theoretic self-attention has the properties that ensure that the infinite depth, mean-field measure-theoretic transformer can be identified with a Vlasov flow.[19] The Media Bias Detector: A Framework for Annotating and Analyzing the News at Scale
Samar Haider,Amir Tohidi,Jenny S. Wang,Timothy Dörr,David M. Rothschild,Chris Callison-Burch,Duncan J. Watts
Main category: cs.CL
TL;DR: 本文提出了一种大规模、近实时的计算框架和数据集,用于系统研究新闻报道中的选择偏差和框架偏差,结合大语言模型与新闻抓取技术,提供跨句子、文章和出版商层面的结构化分析,并发布交互式平台以支持媒体偏见研究。
Details
Motivation: 主流媒体通过议题选择和报道框架影响公众认知,但大规模量化这类媒体偏见仍具挑战性。 Method: 构建一个集成大语言模型与近实时新闻爬取的管道,从每日数百篇文章中提取政治倾向、语调、主题等结构化信息,并在句子、文章和出版商层面进行多层级分析。 Result: 实现了对2024年超过15万篇新闻文章的系统标注,揭示了新闻报道中的选择与框架偏差模式,并发布了可交互的数据探索平台。 Conclusion: 该框架为大规模研究媒体偏见提供了可复用的方法论和实证资源,有助于推动学术研究和提升媒体问责。 Abstract: Mainstream news organizations shape public perception not only directly through the articles they publish but also through the choices they make about which topics to cover (or ignore) and how to frame the issues they do decide to cover. However, measuring these subtle forms of media bias at scale remains a challenge. Here, we introduce a large, ongoing (from January 1, 2024 to present), near real-time dataset and computational framework developed to enable systematic study of selection and framing bias in news coverage. Our pipeline integrates large language models (LLMs) with scalable, near-real-time news scraping to extract structured annotations -- including political lean, tone, topics, article type, and major events -- across hundreds of articles per day. We quantify these dimensions of coverage at multiple levels -- the sentence level, the article level, and the publisher level -- expanding the ways in which researchers can analyze media bias in the modern news landscape. In addition to a curated dataset, we also release an interactive web platform for convenient exploration of these data. Together, these contributions establish a reusable methodology for studying media bias at scale, providing empirical resources for future research. Leveraging the breadth of the corpus over time and across publishers, we also present some examples (focused on the 150,000+ articles examined in 2024) that illustrate how this novel data set can reveal insightful patterns in news coverage and bias, supporting academic research and real-world efforts to improve media accountability.[20] QFrBLiMP: a Quebec-French Benchmark of Linguistic Minimal Pairs
David Beauchemin,Pier-Luc Veilleux,Richard Khoury,Johanna-Pascale Roy
Main category: cs.CL
TL;DR: 本文介绍了魁北克法语语言最小对立对基准(QFrBLiMP),用于评估大语言模型在魁北克法语语法现象上的语言知识。
Details
Motivation: 旨在填补评估大语言模型在特定地区法语(魁北克法语)中语法能力的空白,并与人类表现进行比较。 Method: 构建包含1,761个最小对立对的语料库,涵盖20种语言现象,由12名母语者标注,并评估多个LLM在该基准上的表现。 Result: 模型性能随规模提升而提高,但在需要深层语义理解的现象上普遍失败,显著落后于人类表现。 Conclusion: 当前LLM在处理复杂语义语法现象方面存在局限,与人类语言能力仍有显著差距。 Abstract: In this paper, we introduce the Quebec-French Benchmark of Linguistic Minimal Pairs (QFrBLiMP), a corpus designed to evaluate the linguistic knowledge of LLMs on prominent grammatical phenomena in Quebec-French. QFrBLiMP consists of 1,761 minimal pairs annotated with 20 linguistic phenomena. Specifically, these minimal pairs have been created by manually modifying sentences extracted from an official online resource maintained by a Qu\'ebec government institution. Each pair is annotated by twelve Quebec-French native speakers, who select the sentence they feel is grammatical amongst the two. These annotations are used to compare the competency of LLMs with that of humans. We evaluate different LLMs on QFrBLiMP and MultiBLiMP-Fr by observing the rate of higher probabilities assigned to the sentences of each minimal pair for each category. We find that while grammatical competence scales with model size, a clear hierarchy of difficulty emerges. All benchmarked models consistently fail on phenomena requiring deep semantic understanding, revealing a critical limitation and a significant gap compared to human performance on these specific tasks.[21] The Flaw of Averages: Quantifying Uniformity of Performance on Benchmarks
Arda Uzunoglu,Tianjian Li,Daniel Khashabi
Main category: cs.CL
TL;DR: 本文提出了“基准和谐性”(benchmark harmony)概念,用于衡量模型在不同子领域上的性能分布均匀性,发现低和谐性基准可能导致评估结果误导,并建议将和谐性与准确率一同报告以提升评估的可靠性。
Details
Motivation: 现有基准测试可能因子领域分布不均而导致整体准确率无法真实反映模型在各领域的综合能力,因此需要一种新的指标来评估基准的可靠性。 Method: 从分布视角研究基准可靠性,提出“基准和谐性”指标,量化模型在多个选择题基准和模型族上的表现均匀性,并将各基准映射到基于均值和方差的和谐性平面进行分析。 Result: 在19个多项选择基准和5个模型家族上的实验表明,低和谐性基准(如ARC-Easy)的整体准确率可能被某些子领域(如生物概念)主导,掩盖其他重要领域(如地理、物理等)的薄弱表现。 Conclusion: 高和谐性是理想基准的重要属性,报告和谐性有助于实现更稳健、分布上更可靠的模型评估,推动更可信的科学结论和模型发展。 Abstract: Benchmarks shape scientific conclusions about model capabilities and steer model development. This creates a feedback loop: stronger benchmarks drive better models, and better models demand more discriminative benchmarks. Ensuring benchmark reliability is therefore essential for trustworthy evaluation and meaningful progress. In this work, we study benchmark reliability from a distributional perspective and introduce benchmark harmony, which measures how uniformly a model's performance is distributed across the subdomains of a benchmark. We posit that high harmony is a desirable benchmark property, indicating that the aggregate metric reflects uniform competence across subdomains. Across 19 multiple-choice benchmarks and five model families, we map each benchmark onto a mean-variance plane of harmony computed across models, where high mean and low variance signal more reliable evaluation. Our analysis shows that less harmonious benchmarks can give misleading results, since overall accuracy may be disproportionately influenced by specific subdomains. For instance, ARC-Easy is overwhelmed by questions on Biological Concepts, overshadowing other critical subdomains such as Geography, Physics, Chemistry, and Environmental Science. By recommending that harmony should be reported alongside accuracy, we reframe evaluation from simple performance averages to a more robust, distributionally reliable measurement of performance.[22] Mitigating Biases in Language Models via Bias Unlearning
Dianqing Liu,Yi Liu,Guoqing Jin,Zhendong Mao
Main category: cs.CL
TL;DR: 本文提出了一种名为BiasUnlearn的新型去偏框架,通过双路径遗忘机制在消除刻板印象的同时保留反刻板印象,并防止偏见极性反转。
Details
Motivation: 现有参数修改去偏方法会严重损害语言模型的核心能力,而基于提示的去偏方法仅对预定义触发词有效,无法解决模型参数中深层嵌入的刻板关联。 Method: 提出BiasUnlearn框架,采用双路径遗忘机制,结合对抗性遗忘集和动态数据集交换,实现有针对性的去偏。 Result: 在多个语言模型和评估基准上的实验表明,BiasUnlearn在减轻模型偏见的同时保持了语言建模能力,且去偏权重可在不同模型变体间迁移。 Conclusion: BiasUnlearn能有效缓解语言模型中的偏见,同时保留其核心性能,并验证了偏见表征在预训练阶段形成并在微调后持续存在。 Abstract: Many studies have shown various biases targeting different demographic groups in language models, amplifying discrimination and harming fairness. Recent parameter modification debiasing approaches significantly degrade core capabilities such as text coherence and task accuracy. And Prompt-based debiasing methods, only effective for predefined trigger words, fail to address deeply embedded stereotypical associations in model parameters. In this paper, we propose BiasUnlearn, a novel model debiasing framework which achieves targeted debiasing via dual-pathway unlearning mechanisms coordinating stereotype forgetting with anti-stereotype retention, while preventing bias polarity reversal through adversarial forget set and dynamic dataset swapping. We conducted extensive experiments with multiple language models across various evaluation benchmarks. The results show that BiasUnlearn outperforms existing methods in mitigating bias in language models while retaining language modeling capabilities. Further experiments reveal that debiasing weights are transferable across model variants, confirming that bias representations become entrenched during pre-training and persist through fine-tuning phases.[23] LD-MoLE: Learnable Dynamic Routing for Mixture of LoRA Experts
Yuan Zhuang,Yi Shen,Yuexin Bian,Qing Su,Shihao Ji,Yuanyuan Shi,Fei Miao
Main category: cs.CL
TL;DR: 本文提出了一种名为LD-MoLE的可学习动态路由机制,用于LoRA专家混合模型,能够自适应地按层和按令牌分配专家,相较于传统的TopK路由,在多个基准上取得了更优性能。
Details
Motivation: 现有基于TopK路由的MoE方法需要精细调参且对每个token固定激活专家数量,缺乏灵活性,难以适应不同层和不同输入的需求。 Method: 提出LD-MoLE,采用可微分的路由函数和闭式解替代不可微的TopK选择,并引入解析稀疏控制目标来调节激活专家数量,实现每层每token动态决定激活多少专家。 Result: 在Qwen3-1.7B和Llama-3.2-3B模型上的实验表明,LD-MoLE在多个下游任务中均优于现有SOTA方法,且展现出更强的性能和灵活的专家分配能力。 Conclusion: LD-MoLE通过可学习、动态、分层的专家分配机制,有效提升了PEFT与MoE结合的效率与性能,为大模型适配下游任务提供了更优解决方案。 Abstract: Recent studies have shown that combining parameter-efficient fine-tuning (PEFT) with mixture-of-experts (MoE) is an effective strategy for adapting large language models (LLMs) to the downstream tasks. However, most existing approaches rely on conventional TopK routing, which requires careful hyperparameter tuning and assigns a fixed number of experts to each token. In this work, we propose LD-MoLE, a Learnable Dynamic routing mechanism for Mixture of LoRA Experts that enables adaptive, token-dependent, and layer-wise expert allocation. Our method replaces the non-differentiable TopK selection with a differentiable routing function and a closed-form solution. Moreover, our design allows the model to adaptively determine the number of experts to activate for each token at different layers. In addition, we introduce an analytical sparsity control objective to regularize the number of activated experts. Extensive experiments on the Qwen3-1.7B and Llama-3.2-3B models show that LD-MoLE achieves the highest average scores compared to state-of-the-art baselines, across a diverse set of benchmarks. Our method not only achieves superior performance, but also demonstrates the ability to learn token-dependent and layer-wise expert allocation.[24] Atomic Thinking of LLMs: Decoupling and Exploring Mathematical Reasoning Abilities
Jiayi Kuang,Haojing Huang,Yinghui Li,Xinnian Liang,Zhikun Xu,Yangning Li,Xiaoyu Tan,Chao Qu,Meishan Zhang,Ying Shen,Philip S. Yu
Main category: cs.CL
TL;DR: 提出了一种评估大模型数学原子能力的新范式,将数学能力分解为领域特定和逻辑层次两个维度,并通过实验探索不同原子能力之间的相互影响。
Details
Motivation: 当前大模型的数学推理能力主要依赖大规模数据训练,缺乏对数学概念和推理原则的真正理解,因此需要一种更细粒度的评估方式来探究模型是否具备真正的数学思维。 Method: 将数学原子能力分为四个数学领域(代数、几何、分析、拓扑)和三种逻辑层次(概念理解、前向形式推理、反例驱动的反向推理),并构建相应的训练与评估数据集,进行系统性实验。 Result: 实验揭示了不同模型在各类原子能力上的表现差异及能力间的相互作用,发现某些能力可促进其他能力的发展。 Conclusion: 将数学智能解耦为原子成分有助于理解模型认知机制,推动更高效、可迁移且符合认知规律的‘原子思维’训练范式发展。 Abstract: Large Language Models (LLMs) have demonstrated outstanding performance in mathematical reasoning capabilities. However, we argue that current large-scale reasoning models primarily rely on scaling up training datasets with diverse mathematical problems and long thinking chains, which raises questions about whether LLMs genuinely acquire mathematical concepts and reasoning principles or merely remember the training data. In contrast, humans tend to break down complex problems into multiple fundamental atomic capabilities. Inspired by this, we propose a new paradigm for evaluating mathematical atomic capabilities. Our work categorizes atomic abilities into two dimensions: (1) field-specific abilities across four major mathematical fields, algebra, geometry, analysis, and topology, and (2) logical abilities at different levels, including conceptual understanding, forward multi-step reasoning with formal math language, and counterexample-driven backward reasoning. We propose corresponding training and evaluation datasets for each atomic capability unit, and conduct extensive experiments about how different atomic capabilities influence others, to explore the strategies to elicit the required specific atomic capability. Evaluation and experimental results on advanced models show many interesting discoveries and inspirations about the different performances of models on various atomic capabilities and the interactions between atomic capabilities. Our findings highlight the importance of decoupling mathematical intelligence into atomic components, providing new insights into model cognition and guiding the development of training strategies toward a more efficient, transferable, and cognitively grounded paradigm of "atomic thinking".[25] Controlled Generation for Private Synthetic Text
Zihao Zhao,Anjalie Field
Main category: cs.CL
TL;DR: 提出一种基于去标识化和HIPS理论的隐私保护合成文本生成新方法,通过实体感知控制码实现可控生成,在法律和临床数据上实现了隐私与效用的良好平衡。
Details
Motivation: 在医疗、社会服务和法律等高风险领域,文本匿名化对负责任地开发和部署AI至关重要。现有方法在隐私保护与文本效用之间难以平衡,需要更有效的合成文本生成方案。 Method: 引入实体感知控制码,结合上下文学习(ICL)或前缀调优进行可控生成;ICL版本依赖去标识化系统保障隐私,前缀调优版本采用自定义掩码策略和损失函数以提升可扩展性和生成质量。 Result: 在法律和临床数据集上的实验表明,该方法在隐私保护和文本实用性之间取得了良好平衡,支持高质量、可扩展的合成文本生成。 Conclusion: 所提方法为敏感领域的合成文本生成提供了一种实用且有效的隐私保护解决方案,兼具灵活性与性能优势。 Abstract: Text anonymization is essential for responsibly developing and deploying AI in high-stakes domains such as healthcare, social services, and law. In this work, we propose a novel methodology for privacy-preserving synthetic text generation that leverages the principles of de-identification and the Hiding In Plain Sight (HIPS) theory. Our approach introduces entity-aware control codes to guide controllable generation using either in-context learning (ICL) or prefix tuning. The ICL variant ensures privacy levels consistent with the underlying de-identification system, while the prefix tuning variant incorporates a custom masking strategy and loss function to support scalable, high-quality generation. Experiments on legal and clinical datasets demonstrate that our method achieves a strong balance between privacy protection and utility, offering a practical and effective solution for synthetic text generation in sensitive domains.[26] CATCH: A Novel Data Synthesis Framework for High Therapy Fidelity and Memory-Driven Planning Chain of Thought in AI Counseling
Mingyu Chen,Jingkai Lin,Zhaojie Chu,Xiaofen Xing,Yirong Chen,Xiangmin Xu
Main category: cs.CL
TL;DR: 提出CATCH框架,通过渐进式对话合成和记忆驱动的动态规划思维模式,提升AI心理咨询的治疗保真度和决策逻辑性。
Details
Motivation: 现有基于大语言模型的心理咨询研究采用一次性生成多轮对话的方式,导致治疗保真度低,且无法捕捉每轮回应背后的决策逻辑。 Method: 提出CATCH框架:1)渐进式对话合成策略,从用户自述中提取目标、资源和解决方案,组织为结构化提纲,逐步生成阶段对齐的咨询对话;2)记忆驱动的动态规划思维模式,结合记忆增强、全局规划与策略推理,并通过协作式多智能体优化器利用MDP为每轮对话附加显式思维链。 Result: 实验和人工评估表明,CATCH显著提升了AI心理咨询的保真度和逻辑连贯性。 Conclusion: CATCH有效改善了AI心理咨询对话的质量,为高保真、可解释的心理干预系统提供了新方法。 Abstract: Recently, advancements in AI counseling based on large language models have shown significant progress. However, existing studies employ a one-time generation approach to synthesize multi-turn dialogue samples, resulting in low therapy fidelity and failing to capture the decision-making rationale behind each response. In this work, we propose CATCH, a novel data synthesis framework designed to address these challenges. Specifically, to improve therapy fidelity, we introduce the Progressive Dialogue Synthesis strategy, which extracts goals, resources, and solutions from a client's self-report, organizes them into structured outlines, and then incrementally generates stage-aligned counseling dialogues. To capture decision-making rationale behind each response, we propose the Memory-Driven Dynamic Planning thinking pattern that integrates memory enhancement, global planning, and strategy reasoning; a collaborative multi-agent optimizer then leverages MDP to attach explicit chain-of-thought to each dialogue turn. Extensive experiments and human evaluations demonstrate that CATCH significantly enhances fidelity and logical coherence in AI counseling.[27] Think Less, Label Better: Multi-Stage Domain-Grounded Synthetic Data Generation for Fine-Tuning Large Language Models in Telecommunications
Chenhua Shi,Gregor Macdonald,Bhavika Jalli,Wanlu Lei,John Zou,Mridul Jain,Joji Philip
Main category: cs.CL
TL;DR: 提出了一种全自动的、基于检索增强的合成问答对生成 pipeline,用于构建高质量的指令和强化学习数据集,应用于电信网络故障排查场景。
Details
Motivation: 人工标注大规模高质量指令和强化数据成本高,尤其在需要深厚技术知识的领域(如电信网络故障排查),难以满足大模型训练需求。 Method: 构建一个多阶段框架,结合检索器、基础生成器和精炼模型,利用领域知识图谱中的文档生成并优化问答对,并采用定制化的RAGAS评分过滤低质量样本。 Result: 在无线接入网(RAN)故障排查的真实场景中验证了该方法的有效性,能够无需人工干预地生成复杂且上下文丰富的排障方案。 Conclusion: 该方法为专业领域提供了可扩展的高质量数据集构建方案,显著降低对人工标注的依赖,同时保持高技术保真度。 Abstract: The success of large language models (LLMs) depends heavily on large-scale, high-quality instruction-following and reinforcement datasets. However, generating such data through human annotation is prohibitively time-consuming particularly for domain-specific tasks like telecom network troubleshooting, where accurate responses require deep technical expertise and contextual understanding. In this paper, we present a fully automated, retrieval-augmented pipeline for generating synthetic question-answer (QA) pairs grounded in structured domain knowledge. Our multi-stage framework integrates a retriever, base generator, and refinement model to synthesize and enhance QA pairs using documents retrieved from a domain-specific knowledge graph. To ensure data quality, we employ customized RAGAS-based scoring to filter low-quality samples, producing a high-quality dataset suitable for reinforcement fine-tuning (RFT). We demonstrate our approach in a real-world telecom scenario focused on radio access network (RAN) troubleshooting. The resulting pipeline generates complex, context-rich troubleshooting solution plans without human intervention. This work offers a scalable solution for building instruction and reinforcement datasets in specialized domains, significantly reducing dependence on manual labeling while maintaining high technical fidelity.[28] Detecting Hope Across Languages: Multiclass Classification for Positive Online Discourse
T. O. Abiola,K. D. Abiodun,O. E. Olumide,O. O. Adebanji,O. Hiram Calvo,Grigori Sidorov
Main category: cs.CL
TL;DR: 本文提出了一种基于XLM-RoBERTa的多语言希望言语检测方法,能够在英语、乌尔都语和西班牙语中识别三种希望类型,在PolyHope-M 2025数据集上优于现有方法。
Details
Motivation: 希望言语检测有助于促进积极交流和心理健康,但现有方法在多语言和细粒度分类方面存在不足。 Method: 采用XLM-RoBERTa模型对希望言语进行多语言、多类别分类,分为广义希望、现实希望和非现实希望三类。 Result: 在PolyHope数据集上取得具有竞争力的性能,宏F1分数显著优于现有最先进方法,尤其在低资源语言中表现良好。 Conclusion: 该方法推动了多语言、细粒度希望言语检测的发展,可用于积极内容审核和支持性在线社区建设。 Abstract: The detection of hopeful speech in social media has emerged as a critical task for promoting positive discourse and well-being. In this paper, we present a machine learning approach to multiclass hope speech detection across multiple languages, including English, Urdu, and Spanish. We leverage transformer-based models, specifically XLM-RoBERTa, to detect and categorize hope speech into three distinct classes: Generalized Hope, Realistic Hope, and Unrealistic Hope. Our proposed methodology is evaluated on the PolyHope dataset for the PolyHope-M 2025 shared task, achieving competitive performance across all languages. We compare our results with existing models, demonstrating that our approach significantly outperforms prior state-of-the-art techniques in terms of macro F1 scores. We also discuss the challenges in detecting hope speech in low-resource languages and the potential for improving generalization. This work contributes to the development of multilingual, fine-grained hope speech detection models, which can be applied to enhance positive content moderation and foster supportive online communities.[29] TruthRL: Incentivizing Truthful LLMs via Reinforcement Learning
Zhepei Wei,Xiao Yang,Kai Sun,Jiaqi Wang,Rulin Shao,Sean Chen,Mohammad Kachuee,Teja Gollapudi,Tony Liao,Nicolas Scheffer,Rakesh Wanga,Anuj Kumar,Yu Meng,Wen-tau Yih,Xin Luna Dong
Main category: cs.CL
TL;DR: 本文提出了TruthRL,一种直接优化大语言模型真实性的强化学习框架,通过三元奖励机制平衡正确回答、幻觉和 abstention,显著减少幻觉并提升真实性。
Details
Motivation: 大语言模型在事实问答中容易产生幻觉,现有方法难以兼顾准确性与不确定性识别,需更优的真实性的优化目标。 Method: 提出TruthRL框架,基于GRPO强化学习,设计区分正确答案、幻觉和放弃回答的三元奖励函数,直接优化模型的真实性。 Result: 在四个知识密集型基准上实验表明,相比传统强化学习,TruthRL减少28.9%的幻觉,真实性提升21.1%,且在多种模型和检索/非检索设置下均表现稳定。 Conclusion: TruthRL通过合理设计学习目标,在保持高准确率的同时有效降低幻觉,实现了更优的真实性权衡,凸显了学习目标设计对构建真实可靠LLM的重要性。 Abstract: While large language models (LLMs) have demonstrated strong performance on factoid question answering, they are still prone to hallucination and untruthful responses, particularly when tasks demand information outside their parametric knowledge. Indeed, truthfulness requires more than accuracy -- models must also recognize uncertainty and abstain when unsure to avoid hallucinations. This presents a fundamental challenge for existing methods: approaches that optimize for accuracy often amplify hallucinations, while those that encourage abstention can become overly conservative, sacrificing correct answers. Both extremes ultimately compromise truthfulness. In this work, we present TruthRL, a general reinforcement learning (RL) framework that directly optimizes the truthfulness of LLMs. Specifically, we implement TruthRL using GRPO with a simple yet effective ternary reward that distinguishes correct answers, hallucinations, and abstentions. It incentivizes models to reduce hallucinations not only by providing correct responses, but also by enabling abstention when uncertain, thereby improving truthfulness. Extensive experiments across four knowledge-intensive benchmarks show that, compared to vanilla RL, TruthRL significantly reduces hallucinations by 28.9% and improves truthfulness by 21.1%, with consistent gains across various backbone models (e.g., Qwen, Llama) under both retrieval and non-retrieval setups. In-depth ablation study demonstrates that vanilla accuracy-driven methods, such as supervised fine-tuning or RL with a binary reward, struggle to balance factual correctness and uncertainty. In contrast, our proposed truthfulness-driven TruthRL achieves strong performance in both accuracy and truthfulness, underscoring the importance of learning objective design for developing truthful LLMs.[30] Assessing Algorithmic Bias in Language-Based Depression Detection: A Comparison of DNN and LLM Approaches
Obed Junias,Prajakta Kini,Theodora Chaspari
Main category: cs.CL
TL;DR: 该研究比较了基于深度神经网络(DNN)和大语言模型(LLM)的方法在抑郁检测中的性能与公平性,发现LLM在减少性别偏见方面表现更优,尤其对西班牙裔群体效果更好,但种族差异仍存在;通过使用最差组损失和引导提示可改善DNN和LLM的公平性。
Details
Motivation: 探究语言模型在抑郁检测中的算法偏见,特别是性别和种族/族裔方面的社会人口差异,以提升心理健康技术的公平性。 Method: 比较基于DNN的嵌入模型与少样本学习下的大语言模型(LLM),在DAIC-WOZ临床访谈转录数据上评估性能与公平性;对DNN采用公平感知损失函数,对LLM探索不同提示框架和示例数量的影响。 Result: LLM在抑郁症分类上优于DNN,尤其对西班牙裔等代表性不足群体表现更好,且性别偏见更小,但种族差异依然存在;在DNN中,最差组损失比正则化损失更有效平衡性能与公平性;在LLM中,1-shot设置下带伦理引导的提示有助于减轻性别偏见,但增加示例数未能进一步降低偏差。 Conclusion: LLM在抑郁检测中展现出更高的公平性与性能,特别是在缓解性别偏见方面,但需进一步研究以解决种族偏见问题;提示工程和特定损失函数是缓解偏见的有效策略。 Abstract: This paper investigates algorithmic bias in language-based models for automated depression detection, focusing on socio-demographic disparities related to gender and race/ethnicity. Models trained using deep neural networks (DNN) based embeddings are compared to few-shot learning approaches with large language models (LLMs), evaluating both performance and fairness on clinical interview transcripts from the Distress Analysis Interview Corpus/Wizard-of-Oz (DAIC-WOZ). To mitigate bias, fairness-aware loss functions are applied to DNN-based models, while in-context learning with varied prompt framing and shot counts is explored for LLMs. Results indicate that LLMs outperform DNN-based models in depression classification, particularly for underrepresented groups such as Hispanic participants. LLMs also exhibit reduced gender bias compared to DNN-based embeddings, though racial disparities persist. Among fairness-aware techniques for mitigating bias in DNN-based embeddings, the worst-group loss, which is designed to minimize loss for the worst-performing demographic group, achieves a better balance between performance and fairness. In contrast, the fairness-regularized loss minimizes loss across all groups but performs less effectively. In LLMs, guided prompting with ethical framing helps mitigate gender bias in the 1-shot setting. However, increasing the number of shots does not lead to further reductions in disparities. For race/ethnicity, neither prompting strategy nor increasing $N$ in $N$-shot learning effectively reduces disparities.[31] RoBiologyDataChoiceQA: A Romanian Dataset for improving Biology understanding of Large Language Models
Dragos-Dumitru Ghinea,Adela-Nicoleta Corbeanu,Adrian-Marius Dumitran
Main category: cs.CL
TL;DR: 本研究提出了一个包含约14,000个选择题的罗马尼亚语生物学数据集,用于评估大语言模型在特定领域和低资源语言中的理解与推理能力,并通过多种优化方法分析模型表现。
Details
Motivation: 探索大语言模型在领域特定应用和非英语语言(尤其是低资源语言)中的性能尚不充分,因此需要针对性的数据集和评估手段。 Method: 构建了一个高质量的罗马尼亚语生物学多选题数据集,对多个主流大语言模型进行基准测试,分析其准确性、推理模式及对专业术语和语言细微差别的理解,并研究提示工程、微调等优化技术的影响。 Result: 实验结果揭示了当前大语言模型在处理低资源语言科学任务时的优势与局限性,部分模型在专业术语理解和推理方面表现较弱,而适当的提示设计和微调可显著提升性能。 Conclusion: 该数据集为评估和改进大语言模型在特定领域和低资源语言中的应用提供了重要资源,研究结果为未来针对专业化知识任务的模型优化提供了方向。 Abstract: In recent years, large language models (LLMs) have demonstrated significant potential across various natural language processing (NLP) tasks. However, their performance in domain-specific applications and non-English languages remains less explored. This study introduces a novel Romanian-language dataset for multiple-choice biology questions, carefully curated to assess LLM comprehension and reasoning capabilities in scientific contexts. Containing approximately 14,000 questions, the dataset provides a comprehensive resource for evaluating and improving LLM performance in biology. We benchmark several popular LLMs, analyzing their accuracy, reasoning patterns, and ability to understand domain-specific terminology and linguistic nuances. Additionally, we perform comprehensive experiments to evaluate the impact of prompt engineering, fine-tuning, and other optimization techniques on model performance. Our findings highlight both the strengths and limitations of current LLMs in handling specialized knowledge tasks in low-resource languages, offering valuable insights for future research and development.[32] ReTAG: Retrieval-Enhanced, Topic-Augmented Graph-Based Global Sensemaking
Boyoung Kim,Dosung Lee,Sumin An,Jinseong Jeong,Paul Hongsuck Seo
Main category: cs.CL
TL;DR: 提出ReTAG框架,通过增强检索和主题扩展的图方法,构建主题特定子图并检索相关摘要以生成回答,显著提升响应质量和推理效率。
Details
Motivation: 现有的全局意义建构方法缺乏检索机制、主题特异性,并且推理成本高,难以有效整合整个语料库的信息来回答问题。 Method: 提出ReTAG(Retrieval-Enhanced, Topic-Augmented Graph)框架,结合检索增强和主题扩展技术,构建主题相关的子图,并从语料库中检索相关信息用于生成回答。 Result: 实验表明,与基线相比,ReTAG在提升回答质量的同时显著降低了推理时间。 Conclusion: ReTAG有效解决了现有图方法在全局意义建构中的局限性,具备更好的实用性与可扩展性。 Abstract: Recent advances in question answering have led to substantial progress in tasks such as multi-hop reasoning. However, global sensemaking-answering questions by synthesizing information from an entire corpus remains a significant challenge. A prior graph-based approach to global sensemaking lacks retrieval mechanisms, topic specificity, and incurs high inference costs. To address these limitations, we propose ReTAG, a Retrieval-Enhanced, Topic-Augmented Graph framework that constructs topic-specific subgraphs and retrieves the relevant summaries for response generation. Experiments show that ReTAG improves response quality while significantly reducing inference time compared to the baseline. Our code is available at https://github.com/bykimby/retag.[33] Personalized Scientific Figure Caption Generation: An Empirical Study on Author-Specific Writing Style Transfer
Jaeyoung Kim,Jongho Lee,Hongjun Choi,Sion Jang
Main category: cs.CL
TL;DR: 该研究利用科学论文中的作者档案数据,探索个性化图表标题生成方法,发现丰富的作者档案数据能显著提升多模态大语言模型的个性化效果,但存在匹配作者风格与保持标题质量之间的权衡。
Details
Motivation: 为了提升科学论文中图表标题生成的个性化水平,研究者希望利用作者档案数据来增强生成效果。 Method: 结合作者档案数据和相关元数据,通过实验评估其对多模态大语言模型在个性化标题生成中的影响。 Result: 实验证明作者档案数据可显著提升个性化性能,但也揭示了作者风格匹配与标题质量之间的根本权衡。 Conclusion: 研究为开发兼顾风格匹配与生成质量的实用化标题自动生成系统提供了有价值的见解和未来方向。 Abstract: We study personalized figure caption generation using author profile data from scientific papers. Our experiments demonstrate that rich author profile data, combined with relevant metadata, can significantly improve the personalization performance of multimodal large language models. However, we also reveal a fundamental trade-off between matching author style and maintaining caption quality. Our findings offer valuable insights and future directions for developing practical caption automation systems that balance both objectives. This work was conducted as part of the 3rd SciCap challenge.[34] Overthinking Reduction with Decoupled Rewards and Curriculum Data Scheduling
Shuyang Jiang,Yusheng Liao,Ya Zhang,Yanfeng Wang,Yu Wang
Main category: cs.CL
TL;DR: 本文提出了一种名为DECS的新框架,通过解耦的token级奖励机制和课程批调度策略,有效解决了大模型在推理过程中“过度思考”的问题,在减少超过50%推理token的同时保持甚至提升了性能。
Details
Motivation: 现有的基于长度惩罚的方法存在轨迹级奖励与token级优化之间的不匹配,导致性能下降,且无法有效解决模型生成过长推理路径的问题(即“overthinking”)。 Method: 基于理论分析发现当前长度奖励机制的两个缺陷:错误惩罚必要探索性token和无意中奖励部分冗余;提出DECS框架,包括解耦的token级冗余惩罚机制和新的课程批调度策略,以平衡推理效率与效果。 Result: 在七个基准上实验表明,DECS可将推理token减少超过50%,同时保持或提升模型性能。 Conclusion: DECS能够在不损害模型推理能力的前提下显著提升推理效率,证明了高效且有效的推理优化是可行的。 Abstract: While large reasoning models trained with critic-free reinforcement learning and verifiable rewards (RLVR) represent the state-of-the-art, their practical utility is hampered by ``overthinking'', a critical issue where models generate excessively long reasoning paths without any performance benefit. Existing solutions that penalize length often fail, inducing performance degradation due to a fundamental misalignment between trajectory-level rewards and token-level optimization. In this work, we introduce a novel framework, DECS, built on our theoretical discovery of two previously unaddressed flaws in current length rewards: (1) the erroneous penalization of essential exploratory tokens and (2) the inadvertent rewarding of partial redundancy. Our framework's innovations include (i) a first-of-its-kind decoupled token-level reward mechanism that surgically distinguishes and penalizes redundant tokens, and (ii) a novel curriculum batch scheduling strategy to master the efficiency-efficacy equilibrium. Experimental results show DECS can achieve a dramatic reduction in reasoning tokens by over 50\% across seven benchmarks while simultaneously maintaining or even improving performance. It demonstrates conclusively that substantial gains in reasoning efficiency can be achieved without compromising a model's underlying reasoning power.[35] Believing without Seeing: Quality Scores for Contextualizing Vision-Language Model Explanations
Keyu He,Tejas Srinivasan,Brihi Joshi,Xiang Ren,Jesse Thomason,Swabha Swayamdipta
Main category: cs.CL
TL;DR: 该论文提出通过评估视觉语言模型(VLM)生成的解释的视觉保真度和对比性来提高用户对模型预测正确性的判断准确性,从而减少盲人和低视力用户对错误预测的过度依赖。
Details
Motivation: 现有方法中,自然语言解释可能误导用户相信不准确的VLM预测,因此需要更可靠的解释质量评估机制以避免用户过度信赖模型输出。 Method: 提出两个新的解释质量评分函数:视觉保真度(Visual Fidelity)衡量解释与视觉上下文的一致性,对比性(Contrastiveness)衡量解释区分正确预测与替代选项的能力,并在A-OKVQA和VizWiz任务上验证其有效性。 Result: 新提出的质量评分函数比现有指标更能准确反映模型预测的正确性;用户研究表明,结合这些评分可使参与者判断VLM预测正确性的准确率提升11.1%,并对错误预测的误信率降低15.4%。 Conclusion: 解释质量评分有助于用户建立对VLM预测的适当信任,特别是在无法访问视觉内容的场景下,具有实际应用价值。 Abstract: When people query Vision-Language Models (VLMs) but cannot see the accompanying visual context (e.g. for blind and low-vision users), augmenting VLM predictions with natural language explanations can signal which model predictions are reliable. However, prior work has found that explanations can easily convince users that inaccurate VLM predictions are correct. To remedy undesirable overreliance on VLM predictions, we propose evaluating two complementary qualities of VLM-generated explanations via two quality scoring functions. We propose Visual Fidelity, which captures how faithful an explanation is to the visual context, and Contrastiveness, which captures how well the explanation identifies visual details that distinguish the model's prediction from plausible alternatives. On the A-OKVQA and VizWiz tasks, these quality scoring functions are better calibrated with model correctness than existing explanation qualities. We conduct a user study in which participants have to decide whether a VLM prediction is accurate without viewing its visual context. We observe that showing our quality scores alongside VLM explanations improves participants' accuracy at predicting VLM correctness by 11.1%, including a 15.4% reduction in the rate of falsely believing incorrect predictions. These findings highlight the utility of explanation quality scores in fostering appropriate reliance on VLM predictions.[36] ReFACT: A Benchmark for Scientific Confabulation Detection with Positional Error Annotations
Yindong Wang,Martin Preiß,Margarita Bugueño,Jan Vincent Hoffbauer,Abdullatif Ghajar,Tolga Buz,Gerard de Melo
Main category: cs.CL
TL;DR: ReFACT是一个包含1001个专家标注的问答对的基准数据集,用于检测科学领域的LLM幻觉,支持细粒度错误定位与纠正,揭示了当前大模型在科学事实准确性上的严重不足。
Details
Motivation: 大语言模型常生成错误科学事实,现有二元事实性评估不足以全面衡量其可靠性,需要更细粒度、人工验证的评估基准。 Method: 构建ReFACT数据集,包含正确与错误的科学问答对,标注精确的错误片段和错误类型,支持三阶段评估:幻觉检测、错误定位与修正,并对9个主流大模型进行评测。 Result: 评估显示当前最先进的模型(如GPT-4o)在区分事实与幻觉答案上表现不佳,准确率仅约50%,暴露出LLM作为评判者的可靠性问题。 Conclusion: 需要基于人工标注的细粒度基准来有效检测和纠正特定领域中的科学幻觉,提升大模型在科学场景下的可信度。 Abstract: Large Language Models (LLMs) frequently confabulate scientific facts,severely undermining their trustworthiness. Addressing this challenge requires benchmarks that go beyond binary factuality and enable fine-grained evaluation. We introduce \textbf{ReFACT} (\textit{Reddit False And Correct Texts}), a benchmark of 1,001 expert-annotated question--answer pairs spanning diverse scientific domains for the detection of scientific confabulation. Each instance includes both a scientifically correct answer and a non-factual counterpart annotated with \textbf{precise error spans and error-types}. ReFACT enables multi-stage evaluation: (1) confabulation detection, (2) fine-grained error localization, and (3) correction. We benchmark 9 state-of-the-art LLMs, revealing limited performance ($\sim$50\% accuracy). Even top models such as GPT-4o fail to distinguish factual from confabulated scientific answers, raising concerns about the reliability of \textit{LLM-as-judge} evaluation paradigms. Our findings highlight the need for fine-grained, human-validated benchmarks to detect and correct scientific confabulation in domain-specific contexts. Dataset is released on \href{https://github.com/ddz5431/ReFACT}{GitHub}\footnote{We provide the dataset at: https://github.com/ddz5431/ReFACT}.[37] ASR Under Noise: Exploring Robustness for Sundanese and Javanese
Salsabila Zahirah Pranida,Muhammad Cendekia Airlangga,Rifo Ahmad Genadi,Shady Shehata
Main category: cs.CL
TL;DR: 研究了基于Whisper的自动语音识别模型在爪哇语和巽他语中的鲁棒性,发现噪声感知训练能显著提升在嘈杂环境下的性能,尤其是对较大的Whisper模型。
Details
Motivation: 尽管Whisper模型在干净环境下表现良好,但其在噪声环境下的有效性尚不清楚,因此需要评估并提升其在真实噪声场景中的鲁棒性。 Method: 采用多种训练策略,包括合成噪声增强和SpecAugment,并在不同信噪比(SNR)条件下进行评估。 Result: 噪声感知训练显著提升了模型在噪声环境下的性能,尤其对较大的Whisper模型效果更明显;错误分析揭示了语言特有的挑战。 Conclusion: 通过噪声感知训练可有效提升Whisper模型在印尼地方语言(爪哇语和巽他语)语音识别中的鲁棒性,未来可针对语言特性进一步优化。 Abstract: We investigate the robustness of Whisper-based automatic speech recognition (ASR) models for two major Indonesian regional languages: Javanese and Sundanese. While recent work has demonstrated strong ASR performance under clean conditions, their effectiveness in noisy environments remains unclear. To address this, we experiment with multiple training strategies, including synthetic noise augmentation and SpecAugment, and evaluate performance across a range of signal-to-noise ratios (SNRs). Our results show that noise-aware training substantially improves robustness, particularly for larger Whisper models. A detailed error analysis further reveals language-specific challenges, highlighting avenues for future improvements[38] RoleConflictBench: A Benchmark of Role Conflict Scenarios for Evaluating LLMs' Contextual Sensitivity
Jisu Shin,Hoyun Song,Juhyun Oh,Changgeon Ko,Eunsu Kim,Chani Jung,Alice Oh
Main category: cs.CL
TL;DR: 本文提出了RoleConflictBench,一个用于评估大语言模型在复杂社会角色冲突中情境敏感性的新基准。研究发现,尽管LLM能部分响应情境线索,但其决策仍主要受固有社会角色偏见影响,倾向于优先考虑家庭和职业领域的角色,并表现出对男性角色和亚伯拉罕宗教的明显偏好。
Details
Motivation: 随着大语言模型在人类决策中的影响力增加,理解其在无明确正确答案的复杂社会困境(如角色冲突)中的行为变得至关重要。现有研究多集中于有标准答案的情境,缺乏对模型情境敏感性的系统评估。 Method: 构建了一个包含三阶段生成管道的基准RoleConflictBench,生成超过13,000个涵盖65种角色的真实角色冲突场景,系统地变化角色责任和情境紧迫性水平,并在10个不同LLM上分析其决策模式。 Result: 实验表明,LLM虽有一定情境敏感性,但决策主要受固有角色偏见驱动;模型普遍更重视家庭和职业角色,并表现出对男性角色和亚伯拉罕宗教的显著偏好。 Conclusion: 当前大语言模型在处理角色冲突时缺乏足够的情境敏感性,其决策被深层的社会角色偏见所主导,这揭示了模型在社会价值对齐方面的潜在风险与改进方向。 Abstract: Humans often encounter role conflicts -- social dilemmas where the expectations of multiple roles clash and cannot be simultaneously fulfilled. As large language models (LLMs) become increasingly influential in human decision-making, understanding how they behave in complex social situations is essential. While previous research has evaluated LLMs' social abilities in contexts with predefined correct answers, role conflicts represent inherently ambiguous social dilemmas that require contextual sensitivity: the ability to recognize and appropriately weigh situational cues that can fundamentally alter decision priorities. To address this gap, we introduce RoleConflictBench, a novel benchmark designed to evaluate LLMs' contextual sensitivity in complex social dilemmas. Our benchmark employs a three-stage pipeline to generate over 13K realistic role conflict scenarios across 65 roles, systematically varying their associated expectations (i.e., their responsibilities and obligations) and situational urgency levels. By analyzing model choices across 10 different LLMs, we find that while LLMs show some capacity to respond to these contextual cues, this sensitivity is insufficient. Instead, their decisions are predominantly governed by a powerful, inherent bias related to social roles rather than situational information. Our analysis quantifies these biases, revealing a dominant preference for roles within the Family and Occupation domains, as well as a clear prioritization of male roles and Abrahamic religions across most evaluatee models.[39] PerQ: Efficient Evaluation of Multilingual Text Personalization Quality
Dominik Macko,Andrew Pulver
Main category: cs.CL
TL;DR: 本文提出了一种名为PerQ的计算高效的文本个性化质量评估方法,可用于比较大小语言模型的生成能力,有效减少资源浪费。
Details
Motivation: 由于缺乏评估文本特定方面(如个性化质量)的指标,研究人员通常依赖大语言模型进行元评估,但单一模型存在偏差,使用多个模型又增加成本。 Method: 提出PerQ方法,通过计算高效的方式评估由语言模型生成的文本的个性化质量,并在大小语言模型生成能力比较的案例研究中验证其可用性。 Result: PerQ能够在研究中有效评估文本个性化质量,减少对多个大语言模型的依赖,从而降低元评估成本和资源浪费。 Conclusion: PerQ是一种高效且实用的个性化质量评估指标,有助于在语言模型研究中节约计算资源。 Abstract: Since no metrics are available to evaluate specific aspects of a text, such as its personalization quality, the researchers often rely solely on large language models to meta-evaluate such texts. Due to internal biases of individual language models, it is recommended to use multiple of them for combined evaluation, which directly increases costs of such meta-evaluation. In this paper, a computationally efficient method for evaluation of personalization quality of a given text (generated by a language model) is introduced, called PerQ. A case study of comparison of generation capabilities of large and small language models shows the usability of the proposed metric in research, effectively reducing the waste of resources.[40] Mem-α: Learning Memory Construction via Reinforcement Learning
Yu Wang,Ryuichi Takanobu,Zhiqi Liang,Yuzhen Mao,Yuanzhe Hu,Julian McAuley,Xiaojian Wu
Main category: cs.CL
TL;DR: 本文提出了Mem-alpha,一种基于强化学习的框架,用于训练大语言模型代理有效管理复杂记忆系统,通过交互和反馈优化记忆构建,在长序列处理中表现出色且具有强泛化能力。
Details
Motivation: 现有记忆增强型代理依赖预定义指令更新记忆,但模型难以自主决定存储内容、结构和时机,导致记忆构建不佳和信息丢失。 Method: 提出Mem-alpha框架,采用强化学习方法,结合包含多种多轮交互模式的专用训练数据集,通过下游问答准确率作为奖励信号,训练代理在核心、情景和语义记忆组件构成的记忆架构中进行有效记忆操作。 Result: 实验表明,Mem-alpha在多个基准上显著优于现有记忆增强代理方法,并展现出极强的泛化能力——尽管训练序列最长为30k tokens,代理可成功处理超过400k tokens的序列。 Conclusion: Mem-alpha通过强化学习实现了对复杂记忆系统的高效管理,解决了传统方法在记忆构建上的局限性,具备出色的扩展性和实际应用潜力。 Abstract: Large language model (LLM) agents are constrained by limited context windows, necessitating external memory systems for long-term information understanding. Current memory-augmented agents typically depend on pre-defined instructions and tools for memory updates. However, language models may lack the ability to determine which information to store, how to structure it, and when to update it, especially as memory systems become more complex. This results in suboptimal memory construction and information loss. To this end, we propose Mem-alpha, a reinforcement learning framework that trains agents to effectively manage complex memory systems through interaction and feedback. We also construct a specialized training dataset spanning diverse multi-turn interaction patterns paired with comprehensive evaluation questions designed to teach effective memory management. During training, agents process sequential information chunks, learn to extract and store relevant content, then update the memory system. The reward signal derives from downstream question-answering accuracy over the full interaction history, directly optimizing for memory construction. To illustrate the effectiveness of our training framework, we design a memory architecture comprising core, episodic, and semantic components, equipped with multiple tools for memory operations. Empirical evaluation demonstrates that Mem-alpha achieves significant improvements over existing memory-augmented agent baselines. Despite being trained exclusively on instances with a maximum length of 30k tokens, our agents exhibit remarkable generalization to sequences exceeding 400k tokens, over 13x the training length, highlighting the robustness of Mem-alpha.[41] Understanding the Mixture-of-Experts with Nadaraya-Watson Kernel
Chuanyang Zheng,Jiankai Sun,Yihang Gao,Enze Xie,Yuehao Wang,Peihao Wang,Ting Xu,Matthew Chang,Liliang Ren,Jingyao Li,Jing Xiong,Kashif Rasul,Mac Schwager,Anderson Schneider,Zhangyang Wang,Yuriy Nevmyvaka
Main category: cs.CL
TL;DR: 本文提出了一种无需额外成本的核启发式路由器KERN,作为Softmax的替代方案,并证明其在MoE和大语言模型中有效。
Details
Motivation: 传统MoE模型普遍使用Softmax作为路由函数,但这一选择缺乏原理性依据;作者通过将其与Nadaraya-Watson回归联系起来,探索更合理的路由机制。 Method: 受Nadaraya-Watson回归启发,提出KERN路由器,将FFN和MoE统一为该框架下的特例,并推荐在KERN中使用ReLU激活和ℓ2归一化。 Result: 实验表明KERN能有效替代Softmax,在MoE和LLM中表现良好,且具有零额外成本的优势。 Conclusion: Softmax并非MoE路由的唯一选择,KERN提供了一个更通用、高效的替代方案,推动了MoE路由机制的设计从经验走向原理。 Abstract: Mixture-of-Experts (MoE) has become a cornerstone in recent state-of-the-art large language models (LLMs). Traditionally, MoE relies on $\mathrm{Softmax}$ as the router score function to aggregate expert output, a designed choice that has persisted from the earliest MoE models to modern LLMs, and is now widely regarded as standard practice. However, the necessity of using $\mathrm{Softmax}$ to project router weights into a probability simplex remains an unchallenged assumption rather than a principled design choice. In this work, we first revisit the classical Nadaraya-Watson regression and observe that MoE shares the same mathematical formulation as Nadaraya-Watson regression. Furthermore, we show that both feed-forward neural network (FFN) and MoE can be interpreted as a special case of Nadaraya-Watson regression, where the kernel function corresponds to the input neurons of the output layer. Motivated by these insights, we propose the \textbf{zero-additional-cost} Kernel Inspired Router with Normalization (KERN), an FFN-style router function, as an alternative to $\mathrm{Softmax}$. We demonstrate that this router generalizes both $\mathrm{Sigmoid}$- and $\mathrm{Softmax}$-based routers. \textbf{Based on empirical observations and established practices in FFN implementation, we recommend the use of $\mathrm{ReLU}$ activation and $\ell_2$-normalization in $\mathrm{KERN}$ router function.} Comprehensive experiments in MoE and LLM validate the effectiveness of the proposed FFN-style router function \methodNorm.[42] Bringing Emerging Architectures to Sequence Labeling in NLP
Ana Ezquerro,Carlos Gómez-Rodríguez,David Vilares
Main category: cs.CL
TL;DR: 研究了不同架构在序列标注任务中的适应性,发现先前在简单设置中表现良好的模型在跨语言或多层复杂任务中表现不佳。
Details
Motivation: 探索除预训练Transformer编码器外的其他架构在序列标注任务中的应用效果,尤其是在结构复杂性、标签空间和标记依赖性不同的任务中。 Method: 评估xLSTMs、结构化状态空间模型、扩散模型和对抗学习等架构在多种语言和不同复杂度的序列标注任务上的表现。 Result: 这些替代架构在简单任务中表现良好,但在更复杂的结构化任务或跨语言场景下性能下降,泛化能力有限。 Conclusion: 预训练Transformer在序列标注中仍具优势,其他新兴架构需进一步改进以应对复杂和多样化的任务。 Abstract: Pretrained Transformer encoders are the dominant approach to sequence labeling. While some alternative architectures-such as xLSTMs, structured state-space models, diffusion models, and adversarial learning-have shown promise in language modeling, few have been applied to sequence labeling, and mostly on flat or simplified tasks. We study how these architectures adapt across tagging tasks that vary in structural complexity, label space, and token dependencies, with evaluation spanning multiple languages. We find that the strong performance previously observed in simpler settings does not always generalize well across languages or datasets, nor does it extend to more complex structured tasks.[43] Reliability Crisis of Reference-free Metrics for Grammatical Error Correction
Takumi Goto,Yusuke Sakai,Taro Watanabe
Main category: cs.CL
TL;DR: 本文研究了语法错误纠正(GEC)中无参考评估指标在面对对抗性系统时的脆弱性,提出针对四种无参考指标的攻击策略,并表明现有指标易被操纵,需更鲁棒的评估方法。
Details
Motivation: 现有的无参考GEC评估指标虽与人工评分高度相关,但未考虑对抗性系统的欺骗行为,可能导致评估结果不可靠。 Method: 提出了针对SOME、Scribendi、IMPARA和基于LLM的四种无参考评估指标的对抗攻击策略,并构建对抗系统进行实验验证。 Result: 所提出的对抗系统在四项无参考指标上均超越当前最先进系统,暴露出这些指标易被操纵的问题。 Conclusion: 现有无参考GEC评估指标缺乏鲁棒性,亟需设计更能抵御对抗性攻击的新型评估方法。 Abstract: Reference-free evaluation metrics for grammatical error correction (GEC) have achieved high correlation with human judgments. However, these metrics are not designed to evaluate adversarial systems that aim to obtain unjustifiably high scores. The existence of such systems undermines the reliability of automatic evaluation, as it can mislead users in selecting appropriate GEC systems. In this study, we propose adversarial attack strategies for four reference-free metrics: SOME, Scribendi, IMPARA, and LLM-based metrics, and demonstrate that our adversarial systems outperform the current state-of-the-art. These findings highlight the need for more robust evaluation methods.[44] RAGferee: Building Contextual Reward Models for Retrieval-Augmented Generation
Andrei C. Coman,Ionut-Teodor Sorodoc,Leonardo F. R. Ribeiro,Bill Byrne,James Henderson,Adrià de Gispert
Main category: cs.CL
TL;DR: 本文提出了RAGferee方法,通过将问答数据集转化为偏好数据对,用于训练更适用于检索增强生成(RAG)场景的奖励模型(RM),在小规模数据下实现了优于大规模通用模型的性能。
Details
Motivation: 现有的奖励模型在RAG场景中难以有效评估回复的忠实性、相关性、拒绝能力等关键属性,且缺乏公开的RAG专用偏好数据集和专门化奖励模型。 Method: 提出RAGferee方法,将现有问答数据集转化为注重事实一致性而非语言风格的偏好数据对,并基于该数据集微调7B到24B参数的奖励模型。 Result: 构建了包含4K样本的小型偏好数据集,训练出的RAG专用奖励模型在ContextualJudgeBench上超越了基于更大通用数据集训练的70B以上模型,绝对性能提升达+15.5%。 Conclusion: RAGferee能有效提升奖励模型在RAG场景下的评估能力,表明针对特定任务设计训练数据和模型比依赖大规模通用数据更具优势。 Abstract: Existing Reward Models (RMs), typically trained on general preference data, struggle in Retrieval Augmented Generation (RAG) settings, which require judging responses for faithfulness to retrieved context, relevance to the user query, appropriate refusals when context is insufficient, completeness and conciseness of information. To address the lack of publicly available RAG-centric preference datasets and specialised RMs, we introduce RAGferee, a methodology that repurposes question-answering (QA) datasets into preference pairs that prioritise groundedness over stylistic features, enabling the training of contextual RMs better suited to judging RAG responses. Using RAGferee, we curate a small preference dataset of 4K samples and fine-tune RMs ranging from 7B to 24B parameters. Our RAG-centric RMs achieve state-of-the-art performance on ContextualJudgeBench, surpassing existing 70B+ RMs trained on much larger (up to 2.4M samples) general corpora, with an absolute improvement of +15.5%.[45] RE$^2$: Improving Chinese Grammatical Error Correction via Retrieving Appropriate Examples with Explanation
Baoxin Wang,Yumeng Luo,Yixuan Wang,Dayong Wu,Wanxiang Che,Shijin Wang
Main category: cs.CL
TL;DR: 本文提出了一种名为RE$^2$的方法,通过利用语法错误解释而非文本相似性来检索参考示例,以提升大语言模型在中文语法纠错(CGEC)中的表现。
Details
Motivation: 现有方法主要依赖文本相似性进行示例检索,容易导致语法错误模式不匹配,无法有效提升CGEC性能。 Method: 提出RE$^2$方法,使用语法错误解释作为检索依据,并构建高质量的语法错误解释(GEE)数据集,辅助大语言模型进行中文语法纠错。 Result: 在两个CGEC数据集上的实验表明,该方法能有效提升CGEC的性能。 Conclusion: 基于错误解释的示例检索优于传统文本相似性方法,所构建的GEE数据集为未来CGEC和GEE研究提供了有价值资源。 Abstract: The primary objective of Chinese grammatical error correction (CGEC) is to detect and correct errors in Chinese sentences. Recent research shows that large language models (LLMs) have been applied to CGEC with significant results. For LLMs, selecting appropriate reference examples can help improve their performance. However, existing methods predominantly rely on text similarity for example retrieval, a strategy that frequently mismatches actual error patterns and retrieves lexically similar yet grammatically irrelevant sentences. To address this problem, we propose a method named RE$^2$, which retrieves appropriate examples with explanations of grammatical errors. Instead of using text similarity of the input sentence, we use explanations of grammatical errors to select reference examples, which are used by LLMs to improve the performance of CGEC. We conduct experiments on two CGEC datasets and create a high-quality grammatical error explanation (GEE) dataset, which is not only used in our research but also serves as a valuable resource for future studies in both CGEC and GEE. The experimental results on the two datasets indicate that our proposed method effectively improves the performance of CGEC.[46] Unspoken Hints: Accuracy Without Acknowledgement in LLM Reasoning
Arash Marioriyad,Shaygan Adim,Nima Alighardashi,Mahdieh Soleymani Banghshah,Mohammad Hossein Rohban
Main category: cs.CL
TL;DR: 研究表明,大语言模型在链式思维提示下的推理过程受到提示中隐藏线索的显著影响,揭示了模型对线索的依赖程度及其在不同条件下的表现差异。
Details
Motivation: 探讨大型语言模型在数学和逻辑推理任务中生成的推理链是否真实反映其内部计算过程,还是仅仅是对提示中隐含线索的事后解释。 Method: 通过控制线索的正确性、呈现方式和复杂度,在四个数据集和两个先进模型上系统研究链式思维提示的忠实性,并评估任务准确性和线索显式提及情况。 Result: 发现正确线索显著提升准确性,错误线索降低性能;方程类线索常被引用,而简单线索多被静默采用;奉承式提示促进显式承认,数据泄露式提示则导致隐性依赖。 Conclusion: 大语言模型的推理过程受提示中的捷径线索系统性影响,损害了推理链的忠实性,提示设计需谨慎以避免掩盖真实推理机制。 Abstract: Large language models (LLMs) increasingly rely on chain-of-thought (CoT) prompting to solve mathematical and logical reasoning tasks. Yet, a central question remains: to what extent are these generated rationales \emph{faithful} to the underlying computations, rather than post-hoc narratives shaped by hints that function as answer shortcuts embedded in the prompt? Following prior work on hinted vs.\ unhinted prompting, we present a systematic study of CoT faithfulness under controlled hint manipulations. Our experimental design spans four datasets (AIME, GSM-Hard, MATH-500, UniADILR), two state-of-the-art models (GPT-4o and Gemini-2-Flash), and a structured set of hint conditions varying in correctness (correct and incorrect), presentation style (sycophancy and data leak), and complexity (raw answers, two-operator expressions, four-operator expressions). We evaluate both task accuracy and whether hints are explicitly acknowledged in the reasoning. Our results reveal three key findings. First, correct hints substantially improve accuracy, especially on harder benchmarks and logical reasoning, while incorrect hints sharply reduce accuracy in tasks with lower baseline competence. Second, acknowledgement of hints is highly uneven: equation-based hints are frequently referenced, whereas raw hints are often adopted silently, indicating that more complex hints push models toward verbalizing their reliance in the reasoning process. Third, presentation style matters: sycophancy prompts encourage overt acknowledgement, while leak-style prompts increase accuracy but promote hidden reliance. This may reflect RLHF-related effects, as sycophancy exploits the human-pleasing side and data leak triggers the self-censoring side. Together, these results demonstrate that LLM reasoning is systematically shaped by shortcuts in ways that obscure faithfulness.[47] RE-Searcher: Robust Agentic Search with Goal-oriented Planning and Self-reflection
Daocheng Fu,Jianbiao Mei,Licheng Wen,Xuemeng Yang,Cheng Yang,Rong Wu,Tao Hu,Siqi Li,Yufan Shen,Xinyu Cai,Pinlong Cai,Botian Shi,Yong Liu,Yu Qiao
Main category: cs.CL
TL;DR: 本文提出了一种名为RE-Searcher的搜索代理,通过目标导向规划与自我反思机制,增强大语言模型在复杂搜索环境中的鲁棒性,有效缓解知识截止、幻觉和查询脆弱性问题。
Details
Motivation: 大语言模型在实际应用中受限于知识截止、幻觉和交互模态有限等问题,结合外部搜索工具虽有帮助,但复杂的搜索环境中小的查询变化可能导致推理偏离并放大错误。因此需要提升搜索过程的鲁棒性。 Method: 提出RE-Searcher方法,在搜索过程中明确设定具体搜索目标,并对检索到的证据进行反思,判断是否满足目标,结合目标导向与自我反思机制实现稳健搜索。 Result: 实验表明该方法提升了搜索准确率,达到最先进水平,并在扰动测试中展现出对噪声或误导信号的强大抗性,显著降低了搜索过程的脆弱性。 Conclusion: RE-Searcher通过目标设定与自我反思,有效增强了LLM在复杂环境中的搜索鲁棒性,为构建更自主、可靠的AI代理提供了实用方案。 Abstract: Large language models (LLMs) excel at knowledge-intensive question answering and reasoning, yet their real-world deployment remains constrained by knowledge cutoff, hallucination, and limited interaction modalities. Augmenting LLMs with external search tools helps alleviate these issues, but it also exposes agents to a complex search environment in which small, plausible variations in query formulation can steer reasoning into unproductive trajectories and amplify errors. We present a systematic analysis that quantifies how environmental complexity induces fragile search behaviors and, in turn, degrades overall performance. To address this challenge, we propose a simple yet effective approach to instantiate a search agent, RE-Searcher. During search, RE-Searcher explicitly articulates a concrete search goal and subsequently reflects on whether the retrieved evidence satisfies that goal. This combination of goal-oriented planning and self-reflection enables RE-Searcher to resist spurious cues in complex search environments and perform robust search. Extensive experiments show that our method improves search accuracy and achieves state-of-the-art results. Perturbation studies further demonstrate substantial resilience to noisy or misleading external signals, mitigating the fragility of the search process. We believe these findings offer practical guidance for integrating LLM-powered agents into more complex interactive environments and enabling more autonomous decision-making.[48] CEAID: Benchmark of Multilingual Machine-Generated Text Detection Methods for Central European Languages
Dominik Macko,Jakub Kopal
Main category: cs.CL
TL;DR: 本文提出了针对中欧语言的机器生成文本检测方法的首个基准,评估了多领域、多生成器和多语言场景下的检测性能,并比较了不同训练语言组合的效果,发现监督微调检测器在中欧语言中表现最佳且对混淆攻击更具鲁棒性。
Details
Motivation: 现有机器生成文本检测研究主要集中在英语,导致非英语语言(尤其是中欧语言)的检测能力严重不足,跨语言迁移效果未被充分探索。 Method: 构建面向中欧语言的检测基准,采用多域、多生成器、多语言和对抗鲁棒性评估框架,比较不同训练语言组合下的监督微调检测器性能。 Result: 监督微调检测器在中欧语言中表现最优,且对文本混淆攻击最具抵抗力;不同训练语言组合影响检测效果,跨语言迁移性能有限。 Conclusion: 为中欧语言的机器生成文本检测提供了有效基准,表明针对特定语言微调的检测器优于通用或跨语言方法,强调了发展区域化检测方案的重要性。 Abstract: Machine-generated text detection, as an important task, is predominantly focused on English in research. This makes the existing detectors almost unusable for non-English languages, relying purely on cross-lingual transferability. There exist only a few works focused on any of Central European languages, leaving the transferability towards these languages rather unexplored. We fill this gap by providing the first benchmark of detection methods focused on this region, while also providing comparison of train-languages combinations to identify the best performing ones. We focus on multi-domain, multi-generator, and multilingual evaluation, pinpointing the differences of individual aspects, as well as adversarial robustness of detection methods. Supervised finetuned detectors in the Central European languages are found the most performant in these languages as well as the most resistant against obfuscation.[49] DyFlow: Dynamic Workflow Framework for Agentic Reasoning
Yanbo Wang,Zixiang Xu,Yue Huang,Xiangqi Wang,Zirui Song,Lang Gao,Chenxi Wang,Xiangru Tang,Yue Zhao,Arman Cohan,Xiangliang Zhang,Xiuying Chen
Main category: cs.CL
TL;DR: 提出DyFlow,一种基于大语言模型的动态工作流生成框架,能根据任务需求和实时反馈自适应构建和调整推理过程,提升跨任务泛化能力。
Details
Motivation: 现有基于大语言模型的Agent系统依赖手动设计流程或预定义操作,缺乏灵活性和通用性,且难以充分利用中间反馈,限制了系统的鲁棒性和推理深度。 Method: DyFlow包含两个核心组件:设计者(designer)负责将复杂问题分解为子目标,并基于中间输出和反馈动态规划下一步;执行者(executor)则通过具有上下文感知参数化的动态算子执行操作,实现灵活且语义连贯的推理。 Result: 在社会推理、生物医学任务、数学解题和代码生成等多个领域进行实验,DyFlow显著优于现有基线方法,Pass@k指标大幅提升,展现出强健的跨领域泛化能力。 Conclusion: DyFlow通过动态生成和调整推理流程,结合上下文感知的执行机制,有效提升了LLM代理系统的灵活性、鲁棒性和跨任务通用性。 Abstract: Agent systems based on large language models (LLMs) have shown great potential in complex reasoning tasks, but building efficient and generalizable workflows remains a major challenge. Most existing approaches rely on manually designed processes, which limits their adaptability across different tasks. While a few methods attempt automated workflow generation, they are often tied to specific datasets or query types and make limited use of intermediate feedback, reducing system robustness and reasoning depth. Moreover, their operations are typically predefined and inflexible. To address these limitations, we propose DyFlow, a dynamic workflow generation framework that adaptively constructs and adjusts reasoning procedures based on task requirements and real-time intermediate feedback, thereby enhancing cross-task generalization. DyFlow consists of two core components: a designer and an executor. The designer decomposes complex problems into a sequence of sub-goals defined by high-level objectives and dynamically plans the next steps based on intermediate outputs and feedback. These plans are then carried out by the executor, which executes each operation using dynamic operators with context-aware parameterization, enabling flexible and semantically grounded reasoning. We systematically evaluate DyFlow across diverse domains, including social reasoning, biomedical tasks, mathematical problem solving, and code generation. Results demonstrate that DyFlow significantly outperforms existing baselines, achieving substantial Pass@k improvements and exhibiting robust generalization across diverse domains. The code is publicly available at https://github.com/wyf23187/DyFlow.[50] The Silent Judge: Unacknowledged Shortcut Bias in LLM-as-a-Judge
Arash Marioriyad,Mohammad Hossein Rohban,Mahdieh Soleymani Baghshah
Main category: cs.CL
TL;DR: 该研究发现当前用作自动评判的大型语言模型(LLM)存在严重偏差,会依赖提示中的表面线索(如响应来源和时间)进行判断,且极少承认这些因素,导致其作为评估工具不可靠。
Details
Motivation: 为了确保LLM作为自动评判的可靠性,需要检验其是否仅基于响应质量做出判断,而非受无关提示线索影响。 Method: 在ELI5和LitBench两个数据集上构建100个成对判断任务,使用GPT-4o和Gemini-2.5-Flash作为评判模型,引入来源线索(人类、专家、LLM、未知)和时效线索(旧、新),分析其对评判结果的影响及理由生成中对线索的提及情况。 Result: 两个模型均表现出明显的时效偏见(偏好新响应)和来源层级偏见(专家 > 人类 > LLM > 未知),尤其在GPT-4o和更具主观性的LitBench任务中更显著;但模型几乎从不在其判断理由中提及这些影响因素。 Conclusion: 当前的LLM-as-a-judge系统容易依赖捷径线索且缺乏忠实性,这损害了其在研究和实际应用中的可信度。 Abstract: Large language models (LLMs) are increasingly deployed as automatic judges to evaluate system outputs in tasks such as summarization, dialogue, and creative writing. A faithful judge should base its verdicts solely on response quality and explicitly acknowledge the factors shaping its decision. We show that current LLM judges fail on both counts by relying on shortcuts introduced in the prompt. Our study uses two evaluation datasets: ELI5, a benchmark for long-form question answering, and LitBench, a recent benchmark for creative writing. Both datasets provide pairwise comparisons, where the evaluator must choose which of two responses is better. From each dataset we construct 100 pairwise judgment tasks and employ two widely used models, GPT-4o and Gemini-2.5-Flash, as evaluators in the role of LLM-as-a-judge. For each pair, we assign superficial cues to the responses, provenance cues indicating source identity (Human, Expert, LLM, or Unknown) and recency cues indicating temporal origin (Old, 1950 vs. New, 2025), while keeping the rest of the prompt fixed. Results reveal consistent verdict shifts: both models exhibit a strong recency bias, systematically favoring new responses over old, as well as a clear provenance hierarchy (Expert > Human > LLM > Unknown). These biases are especially pronounced in GPT-4o and in the more subjective and open-ended LitBench domain. Crucially, cue acknowledgment is rare: justifications almost never reference the injected cues, instead rationalizing decisions in terms of content qualities. These findings demonstrate that current LLM-as-a-judge systems are shortcut-prone and unfaithful, undermining their reliability as evaluators in both research and deployment.[51] Limited Preference Data? Learning Better Reward Model with Latent Space Synthesis
Leitian Tao,Xuefeng Du,Yixuan Li
Main category: cs.CL
TL;DR: 提出了一种在大语言模型的潜在嵌入空间中直接合成偏好数据的新框架LENS,通过变分自编码器学习响应嵌入的结构化表示,并在潜在空间中进行可控扰动生成语义一致的合成偏好对,显著提升了奖励建模效率和性能。
Details
Motivation: 现有基于文本的偏好数据合成方法计算成本高,限制了奖励模型对齐人类偏好的能力,因此需要一种更高效的替代方案。 Method: 使用变分自编码器(VAE)学习大语言模型响应嵌入的结构化潜在表示,在该潜在空间中进行受控扰动并解码回嵌入空间,以生成多样且语义一致的合成偏好对,避免昂贵的文本生成与标注过程。 Result: 在标准基准上,该方法显著优于基于文本的数据增强,生成速度提升18倍,且仅使用小16,000倍的模型即取得更优效果,并理论上保证合成数据保持原始偏好顺序并提升奖励模型泛化能力。 Conclusion: LENS为奖励建模提供了一种可扩展且高效的数据增强新范式,通过潜在空间操作实现了高质量偏好数据的快速生成。 Abstract: Reward modeling, crucial for aligning large language models (LLMs) with human preferences, is often bottlenecked by the high cost of preference data. Existing textual data synthesis methods are computationally expensive. We propose a novel framework LENS for synthesizing preference data directly in the LLM's latent embedding space. Our method employs a Variational Autoencoder (VAE) to learn a structured latent representation of response embeddings. By performing controlled perturbations in this latent space and decoding back to the embedding space, we efficiently generate diverse, semantically consistent synthetic preference pairs, bypassing costly text generation and annotation. We provide theoretical guarantees that our synthesized pairs approximately preserve original preference ordering and improve reward model generalization. Empirically, our latent-space synthesis significantly outperforms text-based augmentation on standard benchmarks, achieving superior results while being 18x faster in generation and using a 16,000x smaller model. Our work offers a scalable and effective alternative for enhancing reward modeling through efficient data augmentation. Code is publicly available at https://github.com/deeplearning-wisc/lens[52] IMProofBench: Benchmarking AI on Research-Level Mathematical Proof Generation
Johannes Schmitt,Gergely Bérczi,Jasper Dekoninck,Jeremy Feusi,Tim Gehrunger,Raphael Appenzeller,Jim Bryan,Niklas Canova,Timo de Wolff,Filippo Gaia,Michel van Garrel,Baran Hashemi,David Holmes,Aitor Iribar Lopez,Victor Jaeck,Martina Jørgensen,Steven Kelk,Stefan Kuhlmann,Adam Kurpisz,Chiara Meroni,Ingmar Metzler,Martin Möller,Samuel Muñoz-Echániz,Robert Nowak,Georg Oberdieck,Daniel Platt,Dylan Possamaï,Gabriel Ribeiro,Raúl Sánchez Galán,Zheming Sun,Josef Teichmann,Richard P. Thomas,Charles Vial
Main category: cs.CL
TL;DR: 本文提出了IMProofBench,一个由专家数学家设计的包含39个同行评审问题的研究级数学推理基准,用于评估大语言模型在需要详细证明的前沿数学任务上的表现,并结合自动化评分与人工评估,在模拟真实研究环境的框架下测试模型能力。
Details
Motivation: 现有数学评测基准局限于最终答案题或高中竞赛题,无法有效评估大语言模型在前沿数学研究任务中的推理能力,因此需要构建更贴近真实科研场景的评测基准。 Method: 构建了一个名为IMProofBench的私有基准,包含39个需详细证明的问题,每个问题配有可自动评分的子问题;采用代理框架,赋予模型网络搜索和数学软件(如SageMath)等工具,模拟真实研究环境,并通过人工评审与自动化评分相结合的方式进行评估。 Result: 实验显示当前大语言模型能在较简单的问题上取得一定成功,但在更具挑战性的问题上仍表现不佳;Grok-4在子问题准确率上达到52%,为最高;GPT-5在完整证明生成方面表现最佳,22%的问题能生成完全正确的解答。 Conclusion: IMProofBench填补了研究级数学推理评测的空白,能够更全面地评估大语言模型的数学能力,未来将与数学界合作持续更新,作为评估下一代模型的重要基准。 Abstract: As the mathematical capabilities of large language models (LLMs) improve, it becomes increasingly important to evaluate their performance on research-level tasks at the frontier of mathematical knowledge. However, existing benchmarks are limited, as they focus solely on final-answer questions or high-school competition problems. To address this gap, we introduce IMProofBench, a private benchmark consisting of 39 peer-reviewed problems developed by expert mathematicians. Each problem requires a detailed proof and is paired with subproblems that have final answers, supporting both an evaluation of mathematical reasoning capabilities by human experts and a large-scale quantitative analysis through automated grading. Furthermore, unlike prior benchmarks, the evaluation setup simulates a realistic research environment: models operate in an agentic framework with tools like web search for literature review and mathematical software such as SageMath. Our results show that current LLMs can succeed at the more accessible research-level questions, but still encounter significant difficulties on more challenging problems. Quantitatively, Grok-4 achieves the highest accuracy of 52% on final-answer subproblems, while GPT-5 obtains the best performance for proof generation, achieving a fully correct solution for 22% of problems. IMProofBench will continue to evolve as a dynamic benchmark in collaboration with the mathematical community, ensuring its relevance for evaluating the next generation of LLMs.[53] Reinforced Strategy Optimization for Conversational Recommender Systems via Network-of-Experts
Xiaoyan Zhao
Main category: cs.CL
TL;DR: 提出了一种分层的强化策略优化(RSO)框架,通过将响应生成分解为宏观策略规划和微观适应,在对话推荐系统中提升交互策略的学习效果。
Details
Motivation: 现有基于大语言模型的对话推荐系统缺乏对交互策略的显式优化,通常依赖统一提示,导致效果不佳。 Method: 设计了一个包含Planner和Actor的分层框架:Planner选择推荐、解释或鼓励等策略,Actor在偏好和事实支撑专家指导下生成响应;并将策略学习建模为基于LLM奖励的强化学习以应对多轮数据稀缺问题。 Result: 实验表明,RSO在多个指标上优于当前最先进的基线方法,验证了分层策略优化的有效性。 Conclusion: 通过解耦策略规划与响应生成,并引入强化学习进行策略优化,RSO有效提升了对话推荐系统的性能。 Abstract: Conversational Recommender Systems (CRSs) provide personalized recommendations through multi-turn interactions. With the strong reasoning abilities of Large Language Models (LLMs), applying them to CRSs has become promising. Yet, existing methods often lack explicit optimization of interaction strategies, relying instead on unified prompts, which can yield suboptimal outcomes. We propose Reinforced Strategy Optimization (RSO), a hierarchical framework that decomposes response generation into macro-level strategy planning and micro-level adaptation within a network-of-experts. A Planner selects strategies (e.g., recommend, explain, encourage), while an Actor generates responses guided by auxiliary experts for preferences and factual grounding. This disentanglement enables more tractable learning. To address limited multi-turn data, we model strategy learning as reinforcement learning with an LLM-based reward for exploration. Experiments show RSO outperforms state-of-the-art baselines, validating the effectiveness of hierarchical strategy optimization.[54] End-to-End Aspect-Guided Review Summarization at Scale
Ilya Boytsov,Vinny DeGenova,Mikhail Balyasin,Joseph Walt,Caitlin Eusden,Marie-Claire Rochat,Margaret Pierson
Main category: cs.CL
TL;DR: 提出一种基于大语言模型的方面情感分析与引导式摘要相结合的方法,生成简洁可解释的产品评论摘要,并通过大规模A/B测试验证其有效性,同时发布包含1180万条评论的数据集。
Details
Motivation: 为了提升电商平台产品评论的可读性和实用性,需要自动生成简洁且基于客户反馈的摘要,帮助用户快速理解评论内容。 Method: 采用方面情感分析(ABSA)提取评论中的方面-情感对,汇总并选择高频方面,采样代表性评论构建结构化提示,利用大语言模型进行引导式摘要生成。 Result: 系统在大规模在线A/B测试中表现出显著效果,成功实现高质量、可解释的摘要生成,并已部署至实际应用;同时发布了包含1180万条评论、9.2万种产品的数据集。 Conclusion: 该方法能有效结合ABSA与大语言模型进行产品评论摘要生成,具备良好的可扩展性与实际应用价值,发布的数据集有助于推动相关研究。 Abstract: We present a scalable large language model (LLM)-based system that combines aspect-based sentiment analysis (ABSA) with guided summarization to generate concise and interpretable product review summaries for the Wayfair platform. Our approach first extracts and consolidates aspect-sentiment pairs from individual reviews, selects the most frequent aspects for each product, and samples representative reviews accordingly. These are used to construct structured prompts that guide the LLM to produce summaries grounded in actual customer feedback. We demonstrate the real-world effectiveness of our system through a large-scale online A/B test. Furthermore, we describe our real-time deployment strategy and release a dataset of 11.8 million anonymized customer reviews covering 92,000 products, including extracted aspects and generated summaries, to support future research in aspect-guided review summarization.[55] Vocabulary Customization for Efficient Domain-Specific LLM Deployment
Christian Herold,Michael Kozielski,Nicholas Santavas,Yannick Versley,Shahram Khadivi
Main category: cs.CL
TL;DR: 本文提出通过向预训练分词器中添加领域特定词汇来解决跨领域使用大语言模型时的词汇不匹配问题,从而在不降低分词效率的前提下显著缩短输入序列并减少推理延迟。
Details
Motivation: 在将大语言模型应用于训练领域之外的文本时,通用分词器无法有效捕捉领域特有术语,导致分词数量增加、处理速度下降,因此需要解决词汇不匹配问题。 Method: 设计一种算法,在保证分词效率不下降(即分词数不超过原分词数)的前提下,扩展预训练分词器的词汇表,加入领域特定的子词单元。 Result: 在真实电商场景中验证,增强后的分词器最多可将输入序列缩短20%,降低下游任务的推理延迟,同时保持预测性能,并观察到前向传播速度提升及新词被模型有效采纳。 Conclusion: 通过适配分词器词汇表可有效缓解跨领域应用中的词汇不匹配问题,在不牺牲效率或性能的情况下提升模型实际部署表现,表明词汇表适应具有广泛的应用价值。 Abstract: When using an LLM to process text outside the training domain(s), an often overlooked factor is vocabulary mismatch, where the general-domain tokenizer fails to capture frequent domain-specific terms, leading to higher token fertility and thus a decrease in processing speed due to suboptimal sub-word splits. We address this limitation by augmenting the pretrained vocabulary with a set of domain-specific tokens. To this end, we design an algorithm that extends an existing tokenizer while guaranteeing it never decreases tokenization efficiency: every input sequence is segmented into at most the same number of tokens as before. Evaluated on real-world e-Commerce use-cases, the augmented tokenizer significantly shortens input sequences by up to 20% and reduces inference latency on downstream tasks while preserving predictive quality. We further analyze secondary effects, such as the impact on forward pass speed and the rate at which the model adopts the newly introduced tokens, to illustrate the broader benefits of vocabulary adaptation.[56] The Hunger Game Debate: On the Emergence of Over-Competition in Multi-Agent Systems
Xinbei Ma,Ruotian Ma,Xingyu Chen,Zhengliang Shi,Mengru Wang,Jen-tse Huang,Qu Yang,Wenxuan Wang,Fanghua Ye,Qingxuan Jiang,Mengfei Zhou,Zhuosheng Zhang,Rui Wang,Hai Zhao,Zhaopeng Tu,Xiaolong Li,Linus
Main category: cs.CL
TL;DR: 本文研究了基于大语言模型的多智能体系统在竞争环境下的“过度竞争”行为,提出了一种名为HATE(饥饿游戏辩论)的新实验框架,发现在零和竞争环境下,竞争压力会引发不可靠和有害行为,损害协作与任务表现;通过引入客观的任务导向反馈可有效缓解该问题,并对主流大模型进行了友善性评估与排名。
Details
Motivation: 探讨多智能体系统中竞争如何影响智能体行为,尤其是在极端竞争条件下出现的非理性、有害行为,填补当前对AI社会性动态理解的空白。 Method: 提出HATE(饥饿游戏辩论)框架,模拟零和竞争环境下的多智能体辩论;在多种大语言模型和任务上进行实验,分析竞争压力对行为和性能的影响;引入带有裁判机制的变体以研究环境反馈的作用。 Result: 实验证明竞争压力显著引发过度竞争行为并降低任务表现;加入客观、任务导向的反馈能有效抑制此类行为;通过后验分析构建了大模型的友善性排行榜。 Conclusion: 过度竞争会破坏多智能体系统的协作效能,而合理的环境反馈设计有助于引导良性互动,研究为理解和治理AI群体中的社会动态提供了重要启示。 Abstract: LLM-based multi-agent systems demonstrate great potential for tackling complex problems, but how competition shapes their behavior remains underexplored. This paper investigates the over-competition in multi-agent debate, where agents under extreme pressure exhibit unreliable, harmful behaviors that undermine both collaboration and task performance. To study this phenomenon, we propose HATE, the Hunger Game Debate, a novel experimental framework that simulates debates under a zero-sum competition arena. Our experiments, conducted across a range of LLMs and tasks, reveal that competitive pressure significantly stimulates over-competition behaviors and degrades task performance, causing discussions to derail. We further explore the impact of environmental feedback by adding variants of judges, indicating that objective, task-focused feedback effectively mitigates the over-competition behaviors. We also probe the post-hoc kindness of LLMs and form a leaderboard to characterize top LLMs, providing insights for understanding and governing the emergent social dynamics of AI community.[57] CliniBench: A Clinical Outcome Prediction Benchmark for Generative and Encoder-Based Language Models
Paul Grundmann,Dennis Fast,Jan Frick,Thomas Steffek,Felix Gers,Wolfgang Nejdl,Alexander Löser
Main category: cs.CL
TL;DR: CliniBench是首个用于从MIMIC-IV数据集中通过入院记录预测出院诊断的基准,比较了生成式大语言模型和编码器分类器的性能,发现编码器模型表现更优,并评估了检索增强策略对生成式模型的提升效果。
Details
Motivation: 探讨生成式大语言模型在真实临床应用中的有效性,尤其是在出院诊断预测任务中的表现,并填补缺乏统一评估基准的空白。 Method: 构建CliniBench基准,比较12种生成式大语言模型和3种编码器分类器在MIMIC-IV数据集上的诊断预测性能,并评估多种检索增强策略对生成式模型在上下文学习中的影响。 Result: 编码器基分类器在诊断预测任务上始终优于生成式大语言模型;检索增强策略能显著提升生成式模型的性能。 Conclusion: 尽管生成式大语言模型具有潜力,但在当前的医疗诊断预测任务中,传统编码器分类器仍更具优势,而检索增强是提升生成模型表现的有效途径。 Abstract: With their growing capabilities, generative large language models (LLMs) are being increasingly investigated for complex medical tasks. However, their effectiveness in real-world clinical applications remains underexplored. To address this, we present CliniBench, the first benchmark that enables comparability of well-studied encoder-based classifiers and generative LLMs for discharge diagnosis prediction from admission notes in MIMIC-IV dataset. Our extensive study compares 12 generative LLMs and 3 encoder-based classifiers and demonstrates that encoder-based classifiers consistently outperform generative models in diagnosis prediction. We assess several retrieval augmentation strategies for in-context learning from similar patients and find that they provide notable performance improvements for generative LLMs.[58] MGen: Millions of Naturally Occurring Generics in Context
Gustavo Cilleruelo,Emily Allaway,Barry Haddow,Alexandra Birch
Main category: cs.CL
TL;DR: MGen是目前最大且最多样化的自然发生泛指句数据集,包含超过400万条句子,可用于大规模计算研究。
Details
Motivation: 为了支持对泛指性(genericity)的大规模计算研究,需要一个大规模、多样化的自然语言数据集。 Method: 从多种文本来源中提取超过400万条带有上下文文档的泛指和量化句子,涵盖11种不同量词,并分析其语言特征。 Result: MGen是当前最大最多样化的泛指句数据集,句子平均长度超过16个词,常用于表达关于人群的概括。 Conclusion: MGen为泛指性的计算和语言学研究提供了重要资源,推动相关领域的发展。 Abstract: MGen is a dataset of over 4 million naturally occurring generic and quantified sentences extracted from diverse textual sources. Sentences in the dataset have long context documents, corresponding to websites and academic papers, and cover 11 different quantifiers. We analyze the features of generics sentences in the dataset, with interesting insights: generics can be long sentences (averaging over 16 words) and speakers often use them to express generalisations about people. MGen is the biggest and most diverse dataset of naturally occurring generic sentences, opening the door to large-scale computational research on genericity. It is publicly available at https://gustavocilleruelo.com/mgen[59] Explaining novel senses using definition generation with open language models
Mariia Fedorova,Andrey Kutuzov,Francesco Periti,Yves Scherrer
Main category: cs.CL
TL;DR: 本研究利用开源大语言模型构建定义生成器,用于生成新词义的解释,并在AXOLOTL'24共享任务的数据集上进行评估,涵盖芬兰语、俄语和德语。结果表明,经过微调的开源模型性能优于使用闭源大模型的最佳提交结果,且编码器-解码器架构的定义生成器表现与仅解码器架构相当。
Details
Motivation: 旨在探索开源大语言模型在解释语义变化任务中的潜力,特别是在资源较少的语言中提供可解释的新词义定义。 Method: 采用AXOLOTL'24共享任务的多语言数据集,对开源的编码器-解码器和仅解码器大语言模型进行微调,用于从目标词的用法生成定义。 Result: 微调后的开源模型性能超过共享任务中使用闭源模型的最佳系统;编码器-解码器模型与仅解码器模型表现相当。 Conclusion: 开源大语言模型经微调后可在新词义解释任务上达到甚至超越闭源模型的性能,且不同架构之间无显著差异,为可解释语义变化研究提供了高效且开放的解决方案。 Abstract: We apply definition generators based on open-weights large language models to the task of creating explanations of novel senses, taking target word usages as an input. To this end, we employ the datasets from the AXOLOTL'24 shared task on explainable semantic change modeling, which features Finnish, Russian and German languages. We fine-tune and provide publicly the open-source models performing higher than the best submissions of the aforementioned shared task, which employed closed proprietary LLMs. In addition, we find that encoder-decoder definition generators perform on par with their decoder-only counterparts.[60] VietBinoculars: A Zero-Shot Approach for Detecting Vietnamese LLM-Generated Text
Trieu Hai Nguyen,Sivaswamy Akilesh
Main category: cs.CL
TL;DR: 本研究提出了一种名为VietBinoculars的方法,用于检测越南语的LLM生成文本,通过优化全局阈值,在多个领域实现了超过99%的准确率、F1分数和AUC,优于现有方法。
Details
Motivation: 随着大语言模型生成的文本越来越接近人类写作,传统检测方法效果下降,亟需更有效的检测手段。 Method: 基于Binoculars方法进行改进,优化全局阈值,并构建新的越南语AI生成数据集以确定最佳阈值并支持基准测试。 Result: VietBinoculars在多个跨域数据集上均达到99%以上的准确率、F1-score和AUC,显著优于原始Binoculars模型、传统方法及ZeroGPT等商用工具,尤其在特殊提示策略下表现更优。 Conclusion: VietBinoculars是一种高效且鲁棒的越南语LLM生成文本检测方法,具备良好的跨域泛化能力,为应对日益复杂的AI生成内容提供了有效解决方案。 Abstract: The rapid development research of Large Language Models (LLMs) based on transformer architectures raises key challenges, one of them being the task of distinguishing between human-written text and LLM-generated text. As LLM-generated textual content, becomes increasingly complex over time, and resembles human writing, traditional detection methods are proving less effective, especially as the number and diversity of LLMs continue to grow with new models and versions being released at a rapid pace. This study proposes VietBinoculars, an adaptation of the Binoculars method with optimized global thresholds, to enhance the detection of Vietnamese LLM-generated text. We have constructed new Vietnamese AI-generated datasets to determine the optimal thresholds for VietBinoculars and to enable benchmarking. The results from our experiments show results show that VietBinoculars achieves over 99\% in all two domains of accuracy, F1-score, and AUC on multiple out-of-domain datasets. It outperforms the original Binoculars model, traditional detection methods, and other state-of-the-art approaches, including commercial tools such as ZeroGPT and DetectGPT, especially under specially modified prompting strategies.[61] Comparative Analysis of Ant Colony Optimization and Google OR-Tools for Solving the Open Capacitated Vehicle Routing Problem in Logistics
Assem Omar,Youssef Omar,Marwa Solayman,Hesham Mansour
Main category: cs.CL
TL;DR: 本研究比较了蚁群优化(ACO)和Google OR-Tools在开放式容量车辆路径问题(OCVRP)中的性能,评估指标包括路径效率、计算时间和可扩展性。
Details
Motivation: 提高物流管理系统的路径规划效率,满足实时和可扩展的需求。 Method: 使用Python实现ACO和OR-Tools算法,并在自定义数据集上进行对比实验。 Result: ACO在路径参数上更灵活,而OR-Tools在计算速度、一致性和输入需求方面表现更优。 Conclusion: OR-Tools更适合需要高效、稳定和低配置的实时物流系统,而ACO适用于需灵活调整的场景。 Abstract: In modern logistics management systems, route planning requires high efficiency. The Open Capacitated Vehicle Routing Problem (OCVRP) deals with finding optimal delivery routes for a fleet of vehicles serving geographically distributed customers, without requiring the vehicles to return to the depot after deliveries. The present study is comparative in nature and speaks of two algorithms for OCVRP solution: Ant Colony Optimization (ACO), a nature-inspired metaheuristic; and Google OR-Tools, an industry-standard toolkit for optimization. Both implementations were developed in Python and using a custom dataset. Performance appraisal was based on routing efficiency, computation time, and scalability. The results show that ACO allows flexibility in routing parameters while OR-Tools runs much faster with more consistency and requires less input. This could help choose among routing strategies for scalable real-time logistics systems.[62] Type-Less yet Type-Aware Inductive Link Prediction with Pretrained Language Models
Alessandro De Bellis,Salvatore Bufi,Giovanni Servedio,Vito Walter Anelli,Tommaso Di Noia,Eugenio Di Sciascio
Main category: cs.CL
TL;DR: 提出TyleR方法,利用预训练语言模型在缺少显式类型信息的情况下进行子图基础的归纳链接预测,显著优于现有方法。
Details
Motivation: 现有知识图谱中类型信息常缺失、不完整或粗糙,难以支持对新实体的有效链接预测。 Method: 利用预训练语言模型(PLMs)从节点特征中提取隐含的类型信号,增强节点表示,结合子图结构进行归纳链接预测。 Result: 在标准基准上的实验表明,TyleR在类型标注稀疏和图连接稀疏的场景下优于最先进的基线方法。 Conclusion: TyleR通过语义增强有效弥补了显式类型信息的不足,为无类型标注的知识图谱提供了高效的归纳链接预测方案。 Abstract: Inductive link prediction is emerging as a key paradigm for real-world knowledge graphs (KGs), where new entities frequently appear and models must generalize to them without retraining. Predicting links in a KG faces the challenge of guessing previously unseen entities by leveraging generalizable node features such as subgraph structure, type annotations, and ontological constraints. However, explicit type information is often lacking or incomplete. Even when available, type information in most KGs is often coarse-grained, sparse, and prone to errors due to human annotation. In this work, we explore the potential of pre-trained language models (PLMs) to enrich node representations with implicit type signals. We introduce TyleR, a Type-less yet type-awaRe approach for subgraph-based inductive link prediction that leverages PLMs for semantic enrichment. Experiments on standard benchmarks demonstrate that TyleR outperforms state-of-the-art baselines in scenarios with scarce type annotations and sparse graph connectivity. To ensure reproducibility, we share our code at https://github.com/sisinflab/tyler .[63] Finetune Once: Decoupling General & Domain Learning with Dynamic Boosted Annealing
Yang Tang,Ruijie Liu,Yifan Wang,Shiyu Li,Xi Chen
Main category: cs.CL
TL;DR: 提出了一种高效的微调方法Dynamic Boosted Annealing (DBA),仅使用领域数据即可实现优异的通用和领域性能,显著减少GPU训练时间。
Details
Motivation: 传统微调方法需要复杂的数据混合和多次实验以获得良好的泛化能力,过程繁琐且耗资源。 Method: 通过在通用数据上进行零学习率训练获取全局梯度,并将其用于领域训练中的梯度提升和动态步长校正,结合退火学习率构建无需通用数据参与的微调流程。 Result: 在多个任务和基础模型上验证,DBA在联合性能上平均比传统微调提升5.8%,并减少91.0%的GPU小时消耗,同时消除了对数据混合的依赖。 Conclusion: DBA是一种高效、通用的微调方法,简化了训练流程,提升了性能并大幅节约计算资源。 Abstract: Large language models (LLMs) fine-tuning shows excellent implications. However, vanilla fine-tuning methods often require intricate data mixture and repeated experiments for optimal generalization. To address these challenges and streamline the training process, we propose an efficient and universal solution, Dynamic Boosted Annealing (DBA). We obtain a global gradient through zero-learning-rate training on general data, which is subsequently employed for gradient boosting and dynamic training step correction during domain training. In conjunction with annealing learning, we end up establishing a fine-tuning pipeline that relies solely on domain data without collapse. By evaluating both general and domain-specific performance across multiple tasks on several popular base models, DBA achieves an average improvement of 5.8% in joint performance over vanilla fine-tuning. Furthermore, since general data is no longer involved in annealing, repeated experiments led by data mixture are also eliminated. According to our tests, the DBA method can reduce GPU hours by 91.0% compared to the vanilla method.[64] Optimizing Speech Language Models for Acoustic Consistency
Morteza Rohanian,Michael Krauthammer
Main category: cs.CL
TL;DR: 本文研究了结合语义初始化和规划损失的语音语言模型,以实现鲁棒且一致的生成。通过自监督特征初始化语音标记,并引入对齐、稀疏化和辅助目标进行训练,实现了在不修改分词器或运行时架构的情况下平衡声学稳定性与语义基础。
Details
Motivation: 为了提升语音语言模型在不同说话人、性别、情感等条件下的生成一致性与语义连贯性,探索不依赖于 tokenizer 或架构修改的训练策略。 Method: 使用自监督特征初始化语音标记,施加轻量对齐损失,并采用稀疏化(thinning)和辅助目标(auxiliary objectives)进行训练;构建了三种模型:0.7B 和 1.0B 的纯语音模型,以及一个1.0B的文本-语音交错模型。 Result: 纯语音模型在说话人、性别、情感、环境等方面表现出最高的生成一致性,超过更大规模的系统;交错模型提升了词汇和句法探测能力及语义-声学对齐,但降低了生成一致性;线性探测显示初始化增强了内容结构偏向,但牺牲了韵律细节。 Conclusion: 模型设计和训练目标的组合可在不改变 tokenizer 或推理架构的前提下,有效控制声学稳定性与语义接地之间的权衡。 Abstract: We study speech language models that incorporate semantic initialization and planning losses to achieve robust and consistent generation. Our approach initializes speech tokens with self-supervised features, applies a light alignment loss, and trains with thinning and auxiliary objectives that target robustness and content planning. We train three models: a 0.7B speech-only model, a 1.0B speech-only model, and a 1.0B interleaved model with both text and speech. Acoustic studies show that the speech-only models achieve the highest consistency across speaker, gender, sentiment, room, and background factors, surpassing larger systems. Interleaving improves lexical and syntactic probes and semantic--acoustic alignment but reduces consistency. Linear probes show that our initialization biases the model toward content structure while trading off prosody detail. These results show that LM-side design and training mix control the balance between acoustic stability and semantic grounding without changes to the tokenizer or runtime architecture. A demo and model weights are available for exploration.[65] QUARTZ : QA-based Unsupervised Abstractive Refinement for Task-oriented Dialogue Summarization
Mohamed Imed Eddine Ghebriout,Gaël Guibon,Ivan Lerner,Emmanuel Vincent
Main category: cs.CL
TL;DR: 本文提出了一种基于任务效用的对话摘要框架\app,通过零样本生成多个摘要和问答对,并利用大语言模型评估摘要质量,最终微调最优模型,在多个数据集上取得了与全监督最先进方法相媲美的效果。
Details
Motivation: 现有对话摘要方法依赖人工标注进行监督训练,成本高且生成的摘要缺乏针对下游任务(如医疗)的特定关注,限制了其实际应用效果。 Method: \app框架首先使用多个大语言模型零样本生成多个摘要及任务相关的问答对;然后通过大语言模型回答任务问题来评估摘要质量,选择最佳答案并确定最具信息量的摘要;最后基于选出的优质摘要对最优大语言模型进行微调。 Result: \app在多个数据集上验证了有效性,在多种零样本设置下表现优异,性能可与全监督的最先进方法相媲美。 Conclusion: 该任务导向的效用驱动摘要框架能有效提升对话摘要在下游任务中的实用性,减少对人工标注的依赖,并展示了零样本方法在摘要任务中的潜力。 Abstract: Dialogue summarization aims to distill the core meaning of a conversation into a concise text. This is crucial for reducing the complexity and noise inherent in dialogue-heavy applications. While recent approaches typically train language models to mimic human-written summaries, such supervision is costly and often results in outputs that lack task-specific focus limiting their effectiveness in downstream applications, such as medical tasks. In this paper, we propose \app, a framework for task-oriented utility-based dialogue summarization. \app starts by generating multiple summaries and task-oriented question-answer pairs from a dialogue in a zero-shot manner using a pool of large language models (LLMs). The quality of the generated summaries is evaluated by having LLMs answer task-related questions before \textit{(i)} selecting the best candidate answers and \textit{(ii)} identifying the most informative summary based on these answers. Finally, we fine-tune the best LLM on the selected summaries. When validated on multiple datasets, \app demonstrates its effectiveness by achieving competitive results in various zero-shot settings, rivaling fully-supervised State-of-the-Art (SotA) methods.[66] Feedback Forensics: A Toolkit to Measure AI Personality
Arduin Findeis,Timo Kaufmann,Eyke Hüllermeier,Robert Mullins
Main category: cs.CL
TL;DR: 本文介绍了Feedback Forensics,一个开源工具包,用于追踪AI模型个性的变化,利用AI注释器通过Python API和浏览器应用分析在人类反馈数据集和流行模型中鼓励的个性特征。
Details
Motivation: 由于AI模型的某些特质(如个性)难以预先明确定义,传统自动评估方法难以衡量,而基于人类反馈的评估方法存在不透明和过拟合等问题,因此需要一种公开、可解释的工具来显式评估模型个性。 Method: 开发了一个名为Feedback Forensics的开源工具包,结合AI注释器,支持通过Python API和网页应用分析模型在不同反馈数据集(如Chatbot Arena、MultiPref、PRISM)下的个性表现,并评估主流模型在这些特质上的体现程度。 Result: 该工具包成功揭示了主流人类反馈数据集中所鼓励的个性特征,并分析了多个流行AI模型在这些特征上的表现,提供了可视化网页应用和标注数据以支持进一步研究。 Conclusion: Feedback Forensics为透明、系统地评估AI模型个性提供了有效工具,有助于缓解当前反馈驱动训练中的过拟合与人格偏差问题,推动更负责任的AI开发。 Abstract: Some traits making a "good" AI model are hard to describe upfront. For example, should responses be more polite or more casual? Such traits are sometimes summarized as model character or personality. Without a clear objective, conventional benchmarks based on automatic validation struggle to measure such traits. Evaluation methods using human feedback such as Chatbot Arena have emerged as a popular alternative. These methods infer "better" personality and other desirable traits implicitly by ranking multiple model responses relative to each other. Recent issues with model releases highlight limitations of these existing opaque evaluation approaches: a major model was rolled back over sycophantic personality issues, models were observed overfitting to such feedback-based leaderboards. Despite these known issues, limited public tooling exists to explicitly evaluate model personality. We introduce Feedback Forensics: an open-source toolkit to track AI personality changes, both those encouraged by human (or AI) feedback, and those exhibited across AI models trained and evaluated on such feedback. Leveraging AI annotators, our toolkit enables investigating personality via Python API and browser app. We demonstrate the toolkit's usefulness in two steps: (A) first we analyse the personality traits encouraged in popular human feedback datasets including Chatbot Arena, MultiPref and PRISM; and (B) then use our toolkit to analyse how much popular models exhibit such traits. We release (1) our Feedback Forensics toolkit alongside (2) a web app tracking AI personality in popular models and feedback datasets as well as (3) the underlying annotation data at https://github.com/rdnfn/feedback-forensics.[67] One-Token Rollout: Guiding Supervised Fine-Tuning of LLMs with Policy Gradient
Rui Ming,Haoyuan Wu,Shoubo Hu,Zhuolun He,Bei Yu
Main category: cs.CL
TL;DR: 本文提出了一种名为one-token rollout (OTR)的新型微调算法,通过在token级别引入策略梯度方法,将静态的监督数据转化为动态的on-policy信号,显著提升了大语言模型在数学推理、代码生成等任务上的泛化能力。
Details
Motivation: 监督微调(SFT)在泛化能力上通常不如强化学习(RL),作者认为这不仅是因为损失函数的不同,更关键在于SFT使用固定离线数据,而RL使用当前策略生成的on-policy数据。因此,作者探索如何让SFT获得on-policy学习的优势。 Method: 提出OTR算法,在每个token生成步骤中,从当前策略分布中采样多个候选token,进行一步蒙特卡洛rollout,并利用监督数据中的真实token为这些样本提供奖励信号,通过策略梯度更新模型,从而在token级别实现on-policy训练。 Result: 在多个具有挑战性的基准任务(包括数学推理、代码生成和通用领域推理)上,OTR consistently优于标准SFT,验证了其有效性。 Conclusion: OTR是一种高效且实用的大语言模型微调方法,证明了on-policy数据特性对泛化能力的关键作用,为LLM微调提供了新方向。 Abstract: Supervised fine-tuning (SFT) is the predominant method for adapting large language models (LLMs), yet it often struggles with generalization compared to reinforcement learning (RL). In this work, we posit that this performance disparity stems not just from the loss function, but from a more fundamental difference: SFT learns from a fixed, pre-collected dataset, whereas RL utilizes on-policy data sampled from the current policy. Building on this hypothesis, we introduce one-token rollout (OTR), a novel fine-tuning algorithm that guides SFT with the policy gradient method. OTR reframes the autoregressive learning process by treating each token generation as a single-step reinforcement learning trajectory. At each step, it performs a Monte Carlo ``rollout'' by sampling multiple candidate tokens from the current policy's distribution. The ground-truth token from the supervised data is then used to provide a reward signal to these samples. Guided by policy gradient, our algorithm repurposes static, off-policy supervised data into a dynamic, on-policy signal at the token level, capturing the generalization benefits of on-policy learning while bypassing the costly overhead of full sentence generation. Through extensive experiments on a diverse suite of challenging benchmarks spanning mathematical reasoning, code generation, and general domain reasoning, we demonstrate that OTR consistently outperforms standard SFT. Our findings establish OTR as a powerful and practical alternative for fine-tuning LLMs and provide compelling evidence that the on-policy nature of data is a critical driver of generalization, offering a promising new direction for fine-tuning LLMs.[68] Latent Thinking Optimization: Your Latent Reasoning Language Model Secretly Encodes Reward Signals in its Latent Thoughts
Hanwen Du,Yuxin Dong,Xia Ning
Main category: cs.CL
TL;DR: 本文提出了一种在潜在空间中优化大语言模型思维过程的方法,通过引入潜在奖励模型(LRM)来检测和纠正错误的潜在思维模式,从而提升模型在多种推理任务中的表现。
Details
Motivation: 现有的大语言模型依赖自然语言链式思维进行推理,计算成本高且易出现过度思考;而新兴的潜在思维虽高效但缺乏可解释性且难以监督,因此需要系统研究如何有效监督并优化潜在思维过程。 Method: 作者分析了Huggin-3.5B模型在潜在空间中的思维模式,发现正确与错误答案对应的潜在表示具有可区分性,并训练了一个潜在分类器作为潜在奖励模型(LRM),进一步提出基于该LRM的潜意识思维优化算法(LTO),用于优化推理过程中的潜在表示。 Result: 实验表明LRM能有效识别错误的潜在思维模式,LTO显著提升了模型在多种推理任务上的性能,且LRM具备跨领域泛化能力,LTO可无缝应用于通用大语言模型。 Conclusion: 直接在潜在空间中进行带监督的奖励建模和测试时思维扩展是可行且高效的,潜在空间思维优化是一种通用、高效、领域无关的提升大模型推理能力的新途径。 Abstract: Large Language Models (LLMs) excel at problem solving by generating chain of thoughts in natural language, but such verbal thinking is computationally costly and prone to overthinking. Recent work instead proposes a latent thinking architecture Huggin-3.5B, which represents intermediate reasoning steps as sequence of latent representations. However, latent thoughts lack interpretability and are difficult to supervise, raising concerns about the correctness and reliability of its latent thinking processes. In this paper, we provide a systematic study of how Huggin-3.5B thinks in the latent space and how external supervision signals can improve its latent thinking processes. We show that latent thoughts leading to correct versus incorrect answers exhibit highly distinguishable patterns, and that a latent classifier can reliably predict answer correctness directly from latent thoughts. Leveraging these insights, we propose Latent Thinking Optimization (LTO), a probabilistic algorithm that employs the latent classifier as a Latent Reward Model (LRM) to optimize the latent thinking processes. Extensive experiments across diverse reasoning tasks demonstrate that LRM is highly effective in detecting incorrect latent thinking patterns, and LTO can significantly improve the latent thinking processes. Furthermore, we show that LRM can generalize across diverse domains, and LTO can be seamlessly applied to general LLMs to improve their thinking processes. In contrast to verbal thinking, our method demonstrates that reward modeling and scaling test-time thinking with supervision can be performed directly in the latent space, highlighting its potential as a general, efficient, and domain-agnostic approach to improving the thinking processes of LLMs.[69] Fast-dLLM v2: Efficient Block-Diffusion LLM
Chengyue Wu,Hao Zhang,Shuchen Xue,Shizhe Diao,Yonggan Fu,Zhijian Liu,Pavlo Molchanov,Ping Luo,Song Han,Enze Xie
Main category: cs.CL
TL;DR: Fast-dLLM v2是一种高效的块扩散语言模型,能将预训练的自回归模型适配为支持并行文本生成的扩散模型,仅需约10亿token微调,相比全注意力扩散模型减少500倍训练数据,在保持性能的同时实现最高2.5倍的解码加速。
Details
Motivation: 自回归语言模型因顺序解码限制推理效率,现有扩散语言模型需要大量训练数据,缺乏高效且低资源适配的方法。 Method: 提出Fast-dLLM v2,结合块扩散机制与互补注意力掩码,保留自回归训练目标;设计分层缓存机制(块级缓存和子块缓存)以支持并行生成,并构建并行解码流程。 Result: 在多个基准测试上,Fast-dLLM v2在准确率上达到或超过自回归模型,解码速度最高提升2.5倍,且训练数据仅需约1B token,显著优于Dream等模型(580B tokens)。 Conclusion: Fast-dLLM v2在保持生成质量的同时大幅提升了推理效率,推动了快速、准确大模型的实际部署,是扩散语言模型实用化的重要进展。 Abstract: Autoregressive (AR) large language models (LLMs) have achieved remarkable performance across a wide range of natural language tasks, yet their inherent sequential decoding limits inference efficiency. In this work, we propose Fast-dLLM v2, a carefully designed block diffusion language model (dLLM) that efficiently adapts pretrained AR models into dLLMs for parallel text generation, requiring only approximately 1B tokens of fine-tuning. This represents a 500x reduction in training data compared to full-attention diffusion LLMs such as Dream (580B tokens), while preserving the original model's performance. Our approach introduces a novel training recipe that combines a block diffusion mechanism with a complementary attention mask, enabling blockwise bidirectional context modeling without sacrificing AR training objectives. To further accelerate decoding, we design a hierarchical caching mechanism: a block-level cache that stores historical context representations across blocks, and a sub-block cache that enables efficient parallel generation within partially decoded blocks. Coupled with our parallel decoding pipeline, Fast-dLLM v2 achieves up to 2.5x speedup over standard AR decoding without compromising generation quality. Extensive experiments across diverse benchmarks demonstrate that Fast-dLLM v2 matches or surpasses AR baselines in accuracy, while delivering state-of-the-art efficiency among dLLMs - marking a significant step toward the practical deployment of fast and accurate LLMs. Code and model will be publicly released.[70] Efficient and Transferable Agentic Knowledge Graph RAG via Reinforcement Learning
Jinyeop Song,Song Wang,Julian Shun,Yada Zhu
Main category: cs.CL
TL;DR: KG-R1是一种基于强化学习的单智能体知识图谱检索增强生成框架,通过端到端训练实现高效、可迁移的问答,减少幻觉并支持即插即用新知识图谱。
Details
Motivation: 现有KG-RAG系统依赖多模块大模型,导致推理成本高且难以泛化到不同知识图谱,因此需要一种更高效、灵活的框架。 Method: 提出KG-R1框架,将知识图谱作为环境,使用单一智能体通过强化学习进行逐步检索与推理,实现端到端优化。 Result: 在KGQA基准测试中,使用Qwen-2.5-3B的KG-R1在更少生成token的情况下优于使用更大模型的多模块方法,并能在未修改情况下迁移到新知识图谱保持高准确率。 Conclusion: KG-R1具备高效性、可迁移性和即插即用特性,是一种适合实际部署的KG-RAG框架。 Abstract: Knowledge-graph retrieval-augmented generation (KG-RAG) couples large language models (LLMs) with structured, verifiable knowledge graphs (KGs) to reduce hallucinations and expose reasoning traces. However, many KG-RAG systems compose multiple LLM modules (e.g planning, reasoning, and responding), inflating inference cost and binding behavior to a specific target KG. To address this, we introduce KG-R1, an agentic KG retrieval-augmented generation (KG-RAG) framework through reinforcement learning (RL). KG-R1 utilizes a single agent that interacts with KGs as its environment, learning to retrieve at each step and incorporating the retrieved information into its reasoning and generation. The process is optimized through end-to-end RL. In controlled experiments across Knowledge-Graph Question Answering (KGQA) benchmarks, our method demonstrates both efficiency and transferability: Using Qwen-2.5-3B, KG-R1 improves answer accuracy with fewer generation tokens than prior multi-module workflow methods that use larger foundation or fine-tuned models. Furthermore, KG-R1 enables plug and play: after training, it maintains strong accuracy on new KGs without modification. These properties make KG-R1 a promising KG-RAG framework for real-world deployment. Our code is publicly available at https://github.com/Jinyeop3110/KG-R1.[71] An Annotation Scheme for Factuality and its Application to Parliamentary Proceedings
Gili Goldin,Shira Wigderson,Ella Rabinovich,Shuly Wintner
Main category: cs.CL
TL;DR: 本文提出了一种复杂的、多层面的事实性标注方案,并以希伯来语的议会话语领域近5000个句子为例进行了人工标注,同时探讨了自动预测该标注方案部分特征的方法。
Details
Motivation: 事实性是判断语言表述是否符合现实世界信息的关键概念,对事实核查至关重要。然而,事实性涉及多种语言信号,现有研究分散在不同学科中,缺乏统一且全面的标注体系。因此,作者旨在构建一个综合多源概念的复杂事实性标注框架。 Method: 结合多种前期工作的概念,设计了一个多层面的事实性标注方案;使用该方案对近5000个希伯来语议会话语句子进行人工标注;报告了标注者间的一致性,并尝试多种方法自动预测标注特征,以扩展至大规模语料库。 Result: 成功构建了一个适用于希伯来语的多维度事实性标注体系,发布了高质量的人工标注数据集,验证了标注可行性(通过较高的标注一致性),并探索了自动化扩展标注的可行性。 Conclusion: 所提出的多层面事实性标注方案具有系统性和可扩展性,不仅适用于希伯来语,也可适配其他语言,为事实性分析和自动事实核查提供了重要资源和方法基础。 Abstract: Factuality assesses the extent to which a language utterance relates to real-world information; it determines whether utterances correspond to facts, possibilities, or imaginary situations, and as such, it is instrumental for fact checking. Factuality is a complex notion that relies on multiple linguistic signals, and has been studied in various disciplines. We present a complex, multi-faceted annotation scheme of factuality that combines concepts from a variety of previous works. We developed the scheme for Hebrew, but we trust that it can be adapted to other languages. We also present a set of almost 5,000 sentences in the domain of parliamentary discourse that we manually annotated according to this scheme. We report on inter-annotator agreement, and experiment with various approaches to automatically predict (some features of) the scheme, in order to extend the annotation to a large corpus.[72] Automatic Fact-checking in English and Telugu
Ravi Kiran Chikkala,Tatiana Anikina,Natalia Skachkova,Ivan Vykopal,Rodrigo Agerri,Josef van Genabith
Main category: cs.CL
TL;DR: 本研究探讨了大型语言模型(LLMs)在英语和泰卢固语中对事实声明的真实性分类和生成解释的有效性,贡献包括创建双语数据集并对基于LLM的真实性分类方法进行基准测试。
Details
Motivation: 虚假信息是一个全球性挑战,人工验证耗时且资源密集,因此需要自动化工具来提高效率。 Method: 实验采用不同的大型语言模型(LLMs)方法,使用创建的英-泰卢固双语数据集进行真实性分类和解释生成。 Result: 成功构建了双语数据集,并对多种基于LLM的分类方法进行了基准测试,评估了其在两种语言中的表现。 Conclusion: 研究表明,LLMs在多语言事实核查任务中具有潜力,所构建的数据集和基准为未来研究提供了基础。 Abstract: False information poses a significant global challenge, and manually verifying claims is a time-consuming and resource-intensive process. In this research paper, we experiment with different approaches to investigate the effectiveness of large language models (LLMs) in classifying factual claims by their veracity and generating justifications in English and Telugu. The key contributions of this work include the creation of a bilingual English-Telugu dataset and the benchmarking of different veracity classification approaches based on LLMs.[73] Text-Based Approaches to Item Alignment to Content Standards in Large-Scale Reading & Writing Tests
Yanbin Fu,Hong Jiao,Tianyi Zhou,Robert W. Lissitz,Nan Zhang,Ming Li,Qingshu Xu,Sydney Peters
Main category: cs.CL
TL;DR: 本研究探讨了微调小型语言模型(SLMs)在大学入学标准化读写测试中自动对齐题目与内容标准的效果,发现包含更多题目文本数据能显著提升模型性能,且微调后的SLMs在技能层级对齐上优于基于嵌入的监督学习模型。
Details
Motivation: 传统的人工题目对齐过程主观性强且耗时,因此需要探索自动化方法以提高效率和一致性。 Method: 使用大规模标准化考试数据,分别在领域和技能层级训练多个小型语言模型进行对齐任务,并评估不同输入数据类型和规模的影响;同时训练基于多语言E5-large-instruct嵌入的监督模型作为对比。 Result: 增加题目文本数据显著提升模型性能,微调SLMs在细粒度技能对齐上表现优于嵌入式监督模型;语义相似性分析显示某些SAT/PSAT技能语义相近,导致误分类。 Conclusion: 微调小型语言模型可有效用于自动化题目对齐,尤其在细粒度技能层级表现优异,但语义相近的技能仍构成挑战。 Abstract: Aligning test items to content standards is a critical step in test development to collect validity evidence based on content. Item alignment has typically been conducted by human experts. This judgmental process can be subjective and time-consuming. This study investigated the performance of fine-tuned small language models (SLMs) for automated item alignment using data from a large-scale standardized reading and writing test for college admissions. Different SLMs were trained for alignment at both domain and skill levels respectively with 10 skills mapped to 4 content domains. The model performance was evaluated in multiple criteria on two testing datasets. The impact of types and sizes of the input data for training was investigated. Results showed that including more item text data led to substantially better model performance, surpassing the improvements induced by sample size increase alone. For comparison, supervised machine learning models were trained using the embeddings from the multilingual-E5-large-instruct model. The study results showed that fine-tuned SLMs consistently outperformed the embedding-based supervised machine learning models, particularly for the more fine-grained skill alignment. To better understand model misclassifications, multiple semantic similarity analysis including pairwise cosine similarity, Kullback-Leibler divergence of embedding distributions, and two-dimension projections of item embeddings were conducted. These analyses consistently showed that certain skills in SAT and PSAT were semantically too close, providing evidence for the observed misclassification.[74] Adaptive Planning for Multi-Attribute Controllable Summarization with Monte Carlo Tree Search
Sangwon Ryu,Heejin Do,Yunsu Kim,Gary Geunbae Lee,Jungseul Ok
Main category: cs.CL
TL;DR: 本文提出了一种无需训练的多属性可控摘要框架PACO,通过改进的蒙特卡洛树搜索自适应规划属性控制顺序,显著提升了生成摘要在多个相关属性上的可控性与一致性。
Details
Motivation: 现有可控摘要方法难以处理属性间的相互依赖,且通常需要针对每个属性进行微调,缺乏灵活性。因此,需要一种无需训练、能自适应处理多属性约束的框架。 Method: 将多属性可控摘要任务重构为顺序控制规划问题,采用定制化的蒙特卡洛树搜索(MCTS)探索最优控制顺序;节点表示摘要,动作表示单个属性调整,逐步优化未满足的属性。 Result: 在多个领域和模型上的实验表明,PACO优于基于大语言模型的自规划方法和微调基线;使用Llama-3.2-1B的PACO性能接近Llama-3.3-70B的基线,且在更大模型上表现更优。 Conclusion: PACO是一种高效、灵活且无需训练的多属性可控摘要框架,能够自适应发现最优控制路径,在保持生成质量的同时显著提升多属性控制的鲁棒性和性能。 Abstract: Controllable summarization moves beyond generic outputs toward human-aligned summaries guided by specified attributes. In practice, the interdependence among attributes makes it challenging for language models to satisfy correlated constraints consistently. Moreover, previous approaches often require per-attribute fine-tuning, limiting flexibility across diverse summary attributes. In this paper, we propose adaptive planning for multi-attribute controllable summarization (PACO), a training-free framework that reframes the task as planning the order of sequential attribute control with a customized Monte Carlo Tree Search (MCTS). In PACO, nodes represent summaries, and actions correspond to single-attribute adjustments, enabling progressive refinement of only the attributes requiring further control. This strategy adaptively discovers optimal control orders, ultimately producing summaries that effectively meet all constraints. Extensive experiments across diverse domains and models demonstrate that PACO achieves robust multi-attribute controllability, surpassing both LLM-based self-planning models and fine-tuned baselines. Remarkably, PACO with Llama-3.2-1B rivals the controllability of the much larger Llama-3.3-70B baselines. With larger models, PACO achieves superior control performance, outperforming all competitors.[75] CreAgentive: An Agent Workflow Driven Multi-Category Creative Generation Engine
Yuyang Cheng,Linyue Cai,Changwei Peng,Yumiao Xu,Rongfang Bie,Yong Zhao
Main category: cs.CL
TL;DR: CreAgentive 是一个基于代理工作流的多类别创意生成引擎,通过故事原型和三阶段代理流程解决当前大语言模型在创意写作中的四大局限:体裁多样性不足、输出长度受限、叙事连贯性弱以及难以实现复杂结构。
Details
Motivation: 现有大语言模型在长篇创意写作中存在体裁单一、叙事不连贯、结构控制差和生成长度受限等问题,难以满足高质量创意内容生成需求。 Method: 提出 CreAgentive,采用知识图谱驱动的‘故事原型’表示法,解耦叙事逻辑与风格表达;通过初始化、生成和写作三个阶段的代理工作流,利用多智能体对话和长短期目标引导生成具有复杂结构(如倒叙、伏笔)的多体裁文本。 Result: 实验表明,CreAgentive 能以低于每百章1美元的成本稳定生成数千章节内容,在涵盖10个叙事指标的二维评估框架下,其生成质量与长度均显著优于强基线模型,并在多种体裁中接近人类创作小说的水平。 Conclusion: CreAgentive 通过结构化解构与代理协作机制,有效提升了大模型在长篇、多体裁创意写作中的表现,为高质量叙事内容生成提供了可扩展且低成本的解决方案。 Abstract: We present CreAgentive, an agent workflow driven multi-category creative generation engine that addresses four key limitations of contemporary large language models in writing stories, drama and other categories of creatives: restricted genre diversity, insufficient output length, weak narrative coherence, and inability to enforce complex structural constructs. At its core, CreAgentive employs a Story Prototype, which is a genre-agnostic, knowledge graph-based narrative representation that decouples story logic from stylistic realization by encoding characters, events, and environments as semantic triples. CreAgentive engages a three-stage agent workflow that comprises: an Initialization Stage that constructs a user-specified narrative skeleton; a Generation Stage in which long- and short-term objectives guide multi-agent dialogues to instantiate the Story Prototype; a Writing Stage that leverages this prototype to produce multi-genre text with advanced structures such as retrospection and foreshadowing. This architecture reduces storage redundancy and overcomes the typical bottlenecks of long-form generation. In extensive experiments, CreAgentive generates thousands of chapters with stable quality and low cost (less than $1 per 100 chapters) using a general-purpose backbone model. To evaluate performance, we define a two-dimensional framework with 10 narrative indicators measuring both quality and length. Results show that CreAgentive consistently outperforms strong baselines and achieves robust performance across diverse genres, approaching the quality of human-authored novels.[76] Regression Language Models for Code
Yash Akhauri,Xingyou Song,Arissa Wongpanich,Bryan Lewandowski,Mohamed S. Abdelfattah
Main category: cs.CL
TL;DR: 提出了一种统一的回归语言模型(RLM),可直接从代码文本预测多种执行指标,如内存占用、延迟和神经网络性能,在多个基准上表现优异。
Details
Motivation: 现有方法依赖于繁重且领域特定的特征工程,难以泛化;希望构建一个能跨语言、跨任务统一预测代码执行结果的模型。 Method: 基于T5Gemma初始化一个3亿参数的回归语言模型(RLM),直接从代码文本进行多任务回归预测,涵盖不同编程语言和执行指标。 Result: 在APPS数据集上Spearman秩相关超过0.9,在CodeNet的17种语言上平均超过0.5;在五个经典NAS空间上取得0.46的Kendall-Tau,优于图神经网络,并能跨硬件平台预测架构延迟。 Conclusion: 单一统一的回归语言模型可有效预测多种代码执行指标,减少对人工特征工程的依赖,具备良好的跨语言与跨任务泛化能力。 Abstract: We study code-to-metric regression: predicting numeric outcomes of code executions, a challenging task due to the open-ended nature of programming languages. While prior methods have resorted to heavy and domain-specific feature engineering, we show that a single unified Regression Language Model (RLM) can simultaneously predict directly from text, (i) the memory footprint of code across multiple high-level languages such as Python and C++, (ii) the latency of Triton GPU kernels, and (iii) the accuracy and speed of trained neural networks represented in ONNX. In particular, a relatively small 300M parameter RLM initialized from T5Gemma, obtains > 0.9 Spearman-rank on competitive programming submissions from APPS, and a single unified model achieves > 0.5 average Spearman-rank across 17 separate languages from CodeNet. Furthermore, the RLM can obtain the highest average Kendall-Tau of 0.46 on five classic NAS design spaces previously dominated by graph neural networks, and simultaneously predict architecture latencies on numerous hardware platforms.[77] dParallel: Learnable Parallel Decoding for dLLMs
Zigeng Chen,Gongfan Fang,Xinyin Ma,Ruonan Yu,Xinchao Wang
Main category: cs.CL
TL;DR: 本文提出了dParallel,一种简单有效的方法,通过确定性强制蒸馏技术加速扩散大语言模型(dLLMs)的并行解码过程,在显著减少解码步数的同时保持性能。
Details
Motivation: 现有dLLMs尽管具备并行生成潜力,但实际仍需大量解码步骤以保证性能,限制了其效率优势,因此需要方法来充分释放其并行性。 Method: 提出dParallel方法,核心是确定性强制蒸馏(certainty-forcing distillation),在训练中迫使模型更快、更并行地对掩码token达到高置信度,从而实现快速采样。 Result: 在多个基准测试中验证了方法有效性:在GSM8K上将解码步数从256降至30(8.5倍加速),在MBPP上从256降至24(10.5倍加速),且未损失性能。 Conclusion: dParallel成功解锁了dLLMs的内在并行性,显著提升了推理速度,为扩散语言模型的实际应用提供了高效解决方案。 Abstract: Diffusion large language models (dLLMs) have recently drawn considerable attention within the research community as a promising alternative to autoregressive generation, offering parallel token prediction and lower inference latency. Yet, their parallel decoding potential remains largely underexplored, as existing open-source models still require nearly token-length decoding steps to ensure performance. To address this, we introduce dParallel, a simple and effective method that unlocks the inherent parallelism of dLLMs for fast sampling. We identify that the key bottleneck to parallel decoding arises from the sequential certainty convergence for masked tokens. Building on this insight, we introduce the core of our approach: certainty-forcing distillation, a novel training strategy that distills the model to follow its original sampling trajectories while enforcing it to achieve high certainty on masked tokens more rapidly and in parallel. Extensive experiments across various benchmarks demonstrate that our method can dramatically reduce the number of decoding steps while maintaining performance. When applied to the LLaDA-8B-Instruct model, dParallel reduces decoding steps from 256 to 30 on GSM8K, achieving an 8.5x speedup without performance degradation. On the MBPP benchmark, it cuts decoding steps from 256 to 24, resulting in a 10.5x speedup while maintaining accuracy. Our code is available at https://github.com/czg1225/dParallel[78] VitaBench: Benchmarking LLM Agents with Versatile Interactive Tasks in Real-world Applications
Wei He,Yueqing Sun,Hongyan Hao,Xueyuan Hao,Zhikang Xia,Qi Gu,Chengcheng Han,Dengchang Zhao,Hui Su,Kefeng Zhang,Man Gao,Xi Su,Xiaodong Cai,Xunliang Cai,Yu Yang,Yunke Zhao
Main category: cs.CL
TL;DR: 本文提出了VitaBench,一个面向真实场景的复杂交互式任务基准,用于评估基于大语言模型的智能体在多场景、多工具、动态用户交互下的综合能力。
Details
Motivation: 现有基准无法充分反映LLM智能体在现实应用中处理海量信息、调用多样资源和应对动态用户交互的复杂性,因此需要更贴近真实世界的评估环境。 Method: 构建了一个包含66种工具的真实生活服务模拟环境,涵盖外卖、店内消费和在线旅游等场景,设计了100个跨场景任务和300个单场景任务,并提出基于评分标准的滑动窗口评估器以支持对复杂路径和随机交互的稳健评估。 Result: 最先进的模型在跨场景任务上的成功率仅为30%,其他任务上低于50%。 Conclusion: VitaBench为评估和推动AI智能体在实际应用场景中的发展提供了有价值的工具和挑战。 Abstract: As LLM-based agents are increasingly deployed in real-life scenarios, existing benchmarks fail to capture their inherent complexity of handling extensive information, leveraging diverse resources, and managing dynamic user interactions. To address this gap, we introduce VitaBench, a challenging benchmark that evaluates agents on versatile interactive tasks grounded in real-world settings. Drawing from daily applications in food delivery, in-store consumption, and online travel services, VitaBench presents agents with the most complex life-serving simulation environment to date, comprising 66 tools. Through a framework that eliminates domain-specific policies, we enable flexible composition of these scenarios and tools, yielding 100 cross-scenario tasks (main results) and 300 single-scenario tasks. Each task is derived from multiple real user requests and requires agents to reason across temporal and spatial dimensions, utilize complex tool sets, proactively clarify ambiguous instructions, and track shifting user intent throughout multi-turn conversations. Moreover, we propose a rubric-based sliding window evaluator, enabling robust assessment of diverse solution pathways in complex environments and stochastic interactions. Our comprehensive evaluation reveals that even the most advanced models achieve only 30% success rate on cross-scenario tasks, and less than 50% success rate on others. Overall, we believe VitaBench will serve as a valuable resource for advancing the development of AI agents in practical real-world applications. The code, dataset, and leaderboard are available at https://vitabench.github.io/[79] BatonVoice: An Operationalist Framework for Enhancing Controllable Speech Synthesis with Linguistic Intelligence from LLMs
Yue Wang,Ruotian Ma,Xingyu Chen,Zhengliang Shi,Wanshun Chen,Huang Liu,Jiadi Yao,Qu Yang,Qingxuan Jiang,Fanghua Ye,Juntao Li,Min Zhang,Zhaopeng Tu,Xiaolong Li,Linus
Main category: cs.CL
TL;DR: 提出BatonVoice框架,通过将指令理解与语音生成解耦,利用大语言模型作为“指挥”生成文本化的声学特征计划,由专用TTS模型“乐团”执行合成,实现更优的可控和情感语音合成,并展现强大的零样本跨语言泛化能力。
Details
Motivation: 现有方法未能充分利用大语言模型的指令跟随能力,限制了可控制文本到语音合成的发展。 Method: 提出受操作主义启发的新范式,将指令理解与语音生成分离:大语言模型作为‘指挥’生成包含音高、能量等显式声学特征的文本‘计划’,专用TTS模型BatonTTS作为‘乐团’根据该计划生成语音。 Result: BatonVoice在可控和情感语音合成上优于强基线模型,并展现出卓越的零样本跨语言泛化能力,能在未见语言上准确应用特征控制。 Conclusion: 将语音特征对象化为文本形式能更有效地释放大语言模型的语言智能,提升多模态语音合成的可控性与泛化性。 Abstract: The rise of Large Language Models (LLMs) is reshaping multimodel models, with speech synthesis being a prominent application. However, existing approaches often underutilize the linguistic intelligence of these models, typically failing to leverage their powerful instruction-following capabilities. This limitation hinders the model's ability to follow text instructions for controllable Text-to-Speech~(TTS). To address this, we propose a new paradigm inspired by ``operationalism'' that decouples instruction understanding from speech generation. We introduce BatonVoice, a framework where an LLM acts as a ``conductor'', understanding user instructions and generating a textual ``plan'' -- explicit vocal features (e.g., pitch, energy). A separate TTS model, the ``orchestra'', then generates the speech from these features. To realize this component, we develop BatonTTS, a TTS model trained specifically for this task. Our experiments demonstrate that BatonVoice achieves strong performance in controllable and emotional speech synthesis, outperforming strong open- and closed-source baselines. Notably, our approach enables remarkable zero-shot cross-lingual generalization, accurately applying feature control abilities to languages unseen during post-training. This demonstrates that objectifying speech into textual vocal features can more effectively unlock the linguistic intelligence of LLMs.[80] Training Matryoshka Mixture-of-Experts for Elastic Inference-Time Expert Utilization
Yaoxiang Wang,Qingguo Hu,Yucheng Ding,Ruizhe Wang,Yeyun Gong,Jian Jiao,Yelong Shen,Peng Cheng,Jinsong Su
Main category: cs.CL
TL;DR: 提出Matryoshka MoE(M-MoE)框架,通过训练时动态调整激活专家数量,使模型具备粗到细的结构,实现弹性推理并显著降低训练成本。
Details
Motivation: 标准Top-K路由策略限制了MoE模型在推理时灵活调整激活专家数量的能力,导致性能急剧下降,难以实现弹性推理。 Method: 设计M-MoE训练框架,在训练过程中系统性地变化激活专家的数量,并引入层间随机化策略,促使模型学习专家的有序层级结构:高排名专家提供基础能力,后续专家逐步补充细节。 Result: 单个M-MoE模型在不同激活专家数下均表现出接近专用模型的性能,且训练成本大幅降低;支持按层分配计算资源,实现灵活的弹性推理。 Conclusion: M-MoE为大规模MoE模型的实际部署提供了高效、灵活的解决方案,推动了弹性推理的发展。 Abstract: Mixture-of-Experts (MoE) has emerged as a promising paradigm for efficiently scaling large language models without a proportional increase in computational cost. However, the standard training strategy of Top-K router prevents MoE models from realizing their full potential for elastic inference. When the number of activated experts is altered at inference time, these models exhibit precipitous performance degradation. In this work, we introduce Matryoshka MoE (M-MoE), a training framework that instills a coarse-to-fine structure directly into the expert ensemble. By systematically varying the number of activated experts during training, M-MoE compels the model to learn a meaningful ranking: top-ranked experts collaborate to provide essential, coarse-grained capabilities, while subsequent experts add progressively finer-grained detail. We explore this principle at multiple granularities, identifying a layer-wise randomization strategy as the most effective. Our experiments demonstrate that a single M-MoE model achieves remarkable elasticity, with its performance at various expert counts closely matching that of an entire suite of specialist models, but at only a fraction of the total training cost. This flexibility not only unlocks elastic inference but also enables optimizing performance by allocating different computational budgets to different model layers. Our work paves the way for more practical and adaptable deployments of large-scale MoE models.[81] OceanGym: A Benchmark Environment for Underwater Embodied Agents
Yida Xue,Mingjun Mao,Xiangyuan Ru,Yuqi Zhu,Baochang Ren,Shuofei Qiao,Mengru Wang,Shumin Deng,Xinyu An,Ningyu Zhang,Ying Chen,Huajun Chen
Main category: cs.CL
TL;DR: OceanGym是首个面向海洋水下具身智能体的综合基准,包含八个真实任务场景和基于多模态大模型的统一框架,旨在推动在极端水下环境中感知、决策与自主探索能力的发展。
Details
Motivation: 水下环境具有低能见度、动态洋流等极端挑战,现有AI系统难以有效应对,亟需专门的基准来推动具身智能体在这一复杂现实场景中的发展。 Method: 构建了包含八种任务的高保真模拟平台OceanGym,并提出一个基于多模态大语言模型(MLLM)的统一智能体框架,融合光学与声呐感知、记忆与序列化决策,支持长期目标的自主探索。 Result: 实验表明,当前最先进的MLLM驱动智能体与人类专家仍有显著差距,尤其体现在感知、规划和适应性方面,验证了水下环境的挑战性。 Conclusion: OceanGym为开发可在极端水下环境中运行的鲁棒具身AI提供了重要测试平台,并有望推动自主水下航行器的实际应用,迈向地球最后未开发前沿之一的智能化探索。 Abstract: We introduce OceanGym, the first comprehensive benchmark for ocean underwater embodied agents, designed to advance AI in one of the most demanding real-world environments. Unlike terrestrial or aerial domains, underwater settings present extreme perceptual and decision-making challenges, including low visibility, dynamic ocean currents, making effective agent deployment exceptionally difficult. OceanGym encompasses eight realistic task domains and a unified agent framework driven by Multi-modal Large Language Models (MLLMs), which integrates perception, memory, and sequential decision-making. Agents are required to comprehend optical and sonar data, autonomously explore complex environments, and accomplish long-horizon objectives under these harsh conditions. Extensive experiments reveal substantial gaps between state-of-the-art MLLM-driven agents and human experts, highlighting the persistent difficulty of perception, planning, and adaptability in ocean underwater environments. By providing a high-fidelity, rigorously designed platform, OceanGym establishes a testbed for developing robust embodied AI and transferring these capabilities to real-world autonomous ocean underwater vehicles, marking a decisive step toward intelligent agents capable of operating in one of Earth's last unexplored frontiers. The code and data are available at https://github.com/OceanGPT/OceanGym.[82] The Unheard Alternative: Contrastive Explanations for Speech-to-Text Models
Lina Conti,Dennis Fucci,Marco Gaido,Matteo Negri,Guillaume Wisniewski,Luisa Bentivogli
Main category: cs.CL
TL;DR: 提出了一种首个用于语音到文本生成模型的对比解释方法,通过分析输入频谱图的部分如何影响替代输出的选择,揭示了在性别分配中的关键音频特征。
Details
Motivation: 对比解释被认为比标准解释更具信息量和可解释性,但在语音到文本生成模型中获取此类解释仍是一个开放问题。 Method: 借鉴特征归因技术,通过分析输入频谱图对目标与备选输出之间选择的影响,生成对比解释。 Result: 在语音翻译中的性别分配案例研究中,该方法准确识别出驱动性别选择的关键音频特征。 Conclusion: 该工作将对比解释扩展到语音到文本任务,为理解和改进S2T模型提供了基础。 Abstract: Contrastive explanations, which indicate why an AI system produced one output (the target) instead of another (the foil), are widely regarded in explainable AI as more informative and interpretable than standard explanations. However, obtaining such explanations for speech-to-text (S2T) generative models remains an open challenge. Drawing from feature attribution techniques, we propose the first method to obtain contrastive explanations in S2T by analyzing how parts of the input spectrogram influence the choice between alternative outputs. Through a case study on gender assignment in speech translation, we show that our method accurately identifies the audio features that drive the selection of one gender over another. By extending the scope of contrastive explanations to S2T, our work provides a foundation for better understanding S2T models.[83] Towards Reliable Benchmarking: A Contamination Free, Controllable Evaluation Framework for Multi-step LLM Function Calling
Seiji Maekawa,Jackson Hassell,Pouya Pezeshkpour,Tom Mitchell,Estevam Hruschka
Main category: cs.CL
TL;DR: 提出FuncBenchGen,一个无污染的统一框架,用于生成评估工具增强型语言模型(TaLMs)的合成多步任务,揭示了模型在状态跟踪上的脆弱性,并提出通过显式重述变量值来提升性能。
Details
Motivation: 现有工具增强语言模型的基准测试缺乏对任务复杂度、函数数量和输入规模的控制,且易受数据污染影响,因此需要一个可控且干净的评估框架。 Method: 将工具使用建模为隐式函数依赖有向无环图(DAG)上的遍历任务,基于给定函数模式、初始变量和目标变量生成多步合成任务,精确控制任务难度并避免数据泄露。 Result: 在七种LLM上的实验显示推理优化模型表现更优,GPT-5显著领先;但随依赖深度增加性能急剧下降,无关连接函数带来挑战;发现强模型常传播错误或过时参数值,暴露状态跟踪脆弱性;引入每步重述变量值的轻量策略,使GPT-5成功率从62.5%提升至81.3%。 Conclusion: FuncBenchGen提供了一种可控、无污染的TaLM评估方法,揭示了多步工具使用中状态跟踪的关键问题,并证明简单的上下文增强策略可显著提升模型性能。 Abstract: As language models gain access to external tools via structured function calls, they become increasingly more capable of solving complex, multi-step tasks. However, existing benchmarks for tool-augmented language models (TaLMs) provide insufficient control over factors such as the number of functions accessible, task complexity, and input size, and remain vulnerable to data contamination. We present FuncBenchGen, a unified, contamination-free framework that evaluates TaLMs by generating synthetic multi-step tool-use tasks. The key idea is to cast tool use as traversal over a hidden function-dependency DAG where nodes are function calls and an edge between nodes represents one function consuming the output of another. Given a set of external function schemas, initial variable values, and a target variable, models must compose the correct call sequence to compute the target variable. FuncBenchGen allows users to precisely control task difficulty (e.g., graph size, dependency depth, and distractor functions) while avoiding data leakage. We apply our FuncBenchGen framework to evaluate seven LLMs on tool use tasks of varying difficulty. Reasoning-optimized models consistently outperform general-purpose models with GPT-5 significantly outperforming other models. Performance declines sharply as dependency depth increases. Furthermore, connected irrelevant functions prove especially difficult to handle. We find that strong models often make syntactically valid function calls but propagate incorrect or stale argument values across steps, revealing brittle state tracking by LLMs in multi-turn tool use. Motivated by this observation, we introduce a simple mitigation strategy that explicitly restates prior variable values to the agent at each step. Surprisingly, this lightweight change yields substantial gains across models. e.g., yielding a success rate improvement from 62.5% to 81.3% for GPT-5.[84] Generating Difficult-to-Translate Texts
Vilém Zouhar,Wenda Xu,Parker Riley,Juraj Juraska,Mara Finkelstein,Markus Freitag,Dan Deutsch
Main category: cs.CL
TL;DR: 提出MT-breaker方法,利用大语言模型迭代优化源文本以增加机器翻译难度,生成更具挑战性且保持自然多样性的测试样例。
Details
Motivation: 现有机器翻译基准易被当前模型轻易解决,难以区分模型优劣或暴露其弱点;现有难例构建方法在识别难度、多样性与自然性方面存在不足。 Method: 受人类专家探测模型缺陷的启发,提出MT-breaker:使用大语言模型迭代修改源文本,并通过查询目标机器翻译模型来指导生成更难的翻译样本。 Result: 该方法生成的样本对目标MT模型更具挑战性,同时保持了自然文本的多样性;尽管生成时针对特定模型,但难度可迁移到其他模型和语言。 Conclusion: MT-breaker能有效生成高质量、具迁移性的困难翻译样本,为机器翻译评估提供了更鲁棒的评测基准。 Abstract: Machine translation benchmarks sourced from the real world are quickly obsoleted, due to most examples being easy for state-of-the-art translation models. This limits the benchmark's ability to distinguish which model is better or to reveal models' weaknesses. Current methods for creating difficult test cases, such as subsampling or from-scratch synthesis, either fall short of identifying difficult examples or suffer from a lack of diversity and naturalness. Inspired by the iterative process of human experts probing for model failures, we propose MT-breaker, a method where a large language model iteratively refines a source text to increase its translation difficulty. The LLM iteratively queries a target machine translation model to guide its generation of difficult examples. Our approach generates examples that are more challenging for the target MT model while preserving the diversity of natural texts. While the examples are tailored to a particular machine translation model during the generation, the difficulty also transfers to other models and languages.[85] Deconstructing Self-Bias in LLM-generated Translation Benchmarks
Wenda Xu,Sweta Agrawal,Vilém Zouhar,Markus Freitag,Daniel Deutsch
Main category: cs.CL
TL;DR: 本文研究了使用大语言模型(LLM)自动生成基准测试时存在的“自我偏见”问题,尤其是在低资源语言到英语翻译任务中,生成的测试集和评估方法均倾向于偏好生成该基准的模型,且偏见程度受源语言生成能力和源文本多样性影响。
Details
Motivation: 随着大语言模型在现有基准上趋于饱和,利用LLM自动生成基准成为一种可扩展的替代方案。然而,这种自动化方法可能存在系统性偏差,影响模型评估的公正性,因此亟需探究其潜在缺陷。 Method: 作者通过分析LLM作为测试集生成器和评估者的双重角色,识别自我偏见的来源,并在低资源语言到英语翻译任务中进行实验,考察生成能力、方向性(入英 vs 出英)以及源文本多样性对偏见的影响。 Result: 发现自我偏见来源于测试数据生成和评估方法两方面,且二者叠加会加剧偏见;模型在源语言中的生成能力越强(如入英翻译),偏见越明显;源文本多样性低是导致偏见的一个因素。 Conclusion: LLM自动生成基准存在显著的自我偏见,特别是在生成能力较强的翻译方向和低多样性源文本情况下;提升生成源文本的多样性可在一定程度上缓解该偏见。 Abstract: As large language models (LLMs) begin to saturate existing benchmarks, automated benchmark creation using LLMs (LLM as a benchmark) has emerged as a scalable alternative to slow and costly human curation. While these generated test sets have to potential to cheaply rank models, we demonstrate a critical flaw. LLM generated benchmarks systematically favor the model that created the benchmark, they exhibit self bias on low resource languages to English translation tasks. We show three key findings on automatic benchmarking of LLMs for translation: First, this bias originates from two sources: the generated test data (LLM as a testset) and the evaluation method (LLM as an evaluator), with their combination amplifying the effect. Second, self bias in LLM as a benchmark is heavily influenced by the model's generation capabilities in the source language. For instance, we observe more pronounced bias in into English translation, where the model's generation system is developed, than in out of English translation tasks. Third, we observe that low diversity in source text is one attribution to self bias. Our results suggest that improving the diversity of these generated source texts can mitigate some of the observed self bias.[86] MENLO: From Preferences to Proficiency -- Evaluating and Modeling Native-like Quality Across 47 Languages
Chenxi Whitehouse,Sebastian Ruder,Tony Lin,Oksana Kurylo,Haruka Takagi,Janice Lam,Nicolò Busetto,Denise Diaz
Main category: cs.CL
TL;DR: 本文提出了MENLO框架,用于评估大语言模型在多语言环境下生成类母语响应的质量,并通过人类标注数据集验证了其有效性,同时探索了基于强化学习等方法提升模型判断能力的途径。
Details
Motivation: 确保大语言模型在多种语言中都能产生类母语质量的回应具有挑战性,现有自动评估方法难以准确反映多语言情境下的真实质量,因此需要一个更可靠、可扩展的评估框架。 Method: 基于受众设计理论构建MENLO评估框架,收集涵盖47种语言变体、6,423个人类标注的提示-响应偏好对数据集,包含四个质量维度;采用零样本LLM判别器进行对比实验,并通过强化学习、奖励塑形和多任务学习进行微调优化。 Result: 实验表明,使用成对比较和结构化评分标准的零样本LLM判别器表现优于基线,但仍不及人类标注者;经强化学习微调后的判别器可作为生成式奖励模型提升LLM的多语言能力,但与人类判断仍存在差距。 Conclusion: MENLO为多语言大模型响应质量评估提供了有效且可扩展的方案,结合人类偏好数据与强化学习有望推动多语言偏好对齐研究,作者已公开数据集与评估框架以支持后续研究。 Abstract: Ensuring native-like quality of large language model (LLM) responses across many languages is challenging. To address this, we introduce MENLO, a framework that operationalizes the evaluation of native-like response quality based on audience design-inspired mechanisms. Using MENLO, we create a dataset of 6,423 human-annotated prompt-response preference pairs covering four quality dimensions with high inter-annotator agreement in 47 language varieties. Our evaluation reveals that zero-shot LLM judges benefit significantly from pairwise evaluation and our structured annotation rubrics, yet they still underperform human annotators on our dataset. We demonstrate substantial improvements through fine-tuning with reinforcement learning, reward shaping, and multi-task learning approaches. Additionally, we show that RL-trained judges can serve as generative reward models to enhance LLMs' multilingual proficiency, though discrepancies with human judgment remain. Our findings suggest promising directions for scalable multilingual evaluation and preference alignment. We release our dataset and evaluation framework to support further research in multilingual LLM evaluation.[87] DeepScientist: Advancing Frontier-Pushing Scientific Findings Progressively
Yixuan Weng,Minjun Zhu,Qiujie Xie,Qiyao Sun,Zhen Lin,Sifan Liu,Yue Zhang
Main category: cs.CL
TL;DR: DeepScientist是一个目标导向的全自主科学发现系统,通过贝叶斯优化和分层验证机制,在三个前沿AI任务上超越人类设计的最先进方法。
Details
Motivation: 现有AI科学家系统缺乏针对人类定义重大挑战的聚焦能力,难以产生真正有价值的科学贡献。 Method: 将科学发现形式化为贝叶斯优化问题,采用“提出假设-验证-分析”的分层评估循环,并结合累积的发现记忆来平衡探索与利用。 Result: 在20,000多GPU小时内生成约5,000个独特科学想法,实验验证了约1,100个,三项AI任务性能提升183.7%、1.9%和7.9%。 Conclusion: DeepScientist首次在大规模实验中证明AI可持续产出超越人类最先进水平的科学发现,推动科学前沿进展。 Abstract: While previous AI Scientist systems can generate novel findings, they often lack the focus to produce scientifically valuable contributions that address pressing human-defined challenges. We introduce DeepScientist, a system designed to overcome this by conducting goal-oriented, fully autonomous scientific discovery over month-long timelines. It formalizes discovery as a Bayesian Optimization problem, operationalized through a hierarchical evaluation process consisting of "hypothesize, verify, and analyze". Leveraging a cumulative Findings Memory, this loop intelligently balances the exploration of novel hypotheses with exploitation, selectively promoting the most promising findings to higher-fidelity levels of validation. Consuming over 20,000 GPU hours, the system generated about 5,000 unique scientific ideas and experimentally validated approximately 1100 of them, ultimately surpassing human-designed state-of-the-art (SOTA) methods on three frontier AI tasks by 183.7\%, 1.9\%, and 7.9\%. This work provides the first large-scale evidence of an AI achieving discoveries that progressively surpass human SOTA on scientific tasks, producing valuable findings that genuinely push the frontier of scientific discovery. To facilitate further research into this process, we will open-source all experimental logs and system code at https://github.com/ResearAI/DeepScientist/.[88] Searching for Difficult-to-Translate Test Examples at Scale
Wenda Xu,Vilém Zouhar,Parker Riley,Mara Finkelstein,Markus Freitag,Daniel Deutsch
Main category: cs.CL
TL;DR: 本文将寻找困难测试样本的问题形式化为多臂赌博机问题,通过将不同主题视为“臂”,在有限计算预算下高效识别最难的主题,显著优于暴力搜索等基线方法。
Details
Motivation: 为了评估NLP模型的性能,需要具有挑战性的测试数据。由于实例难度与来源主题相关且具有随机性,在海量潜在主题中高效找出最难主题成为关键挑战。 Method: 将寻找最难主题的问题建模为多臂赌博机问题,每个主题为一个臂,拉臂即抽取并评估一个样例的难度,采用多种赌博机策略在固定计算预算下进行探索与利用。 Result: 实验表明,在机器翻译任务中,多种赌博机策略远优于暴力搜索等基线方法,能更高效地识别出最具挑战性的主题。 Conclusion: 多臂赌博机框架为在大规模主题中高效发现困难测试样例提供了一种有效解决方案,有助于提升NLP模型的测试质量。 Abstract: NLP models require test data that are sufficiently challenging. The difficulty of an example is linked to the topic it originates from (''seed topic''). The relationship between the topic and the difficulty of its instances is stochastic in nature: an example about a difficult topic can happen to be easy, and vice versa. At the scale of the Internet, there are tens of thousands of potential topics, and finding the most difficult one by drawing and evaluating a large number of examples across all topics is computationally infeasible. We formalize this task and treat it as a multi-armed bandit problem. In this framework, each topic is an ''arm,'' and pulling an arm (at a cost) involves drawing a single example, evaluating it, and measuring its difficulty. The goal is to efficiently identify the most difficult topics within a fixed computational budget. We illustrate the bandit problem setup of finding difficult examples for the task of machine translation. We find that various bandit strategies vastly outperform baseline methods like brute-force searching the most challenging topics.[89] Scaling Spoken Language Models with Syllabic Speech Tokenization
Nicholas Lee,Cheol Jun Cho,Alan W Black,Gopala K. Anumanchipalli
Main category: cs.CL
TL;DR: 本文研究了音节级语音标记化在口语语言建模中的应用,发现其在保持甚至提升性能的同时,显著降低了训练和推理成本。
Details
Motivation: 由于传统语音模型使用高帧率标记导致计算开销大,本文探索更高效、可解释的音节级标记化方法。 Method: 基于自监督学习提取的音节级语音标记,系统评估其在不同规模训练数据下的口语理解任务表现。 Result: 音节级标记在多个SLU基准上达到或超过高帧率标记的性能,同时减少2倍以上训练时间和5倍FLOPs。 Conclusion: 音节级语言建模是构建高效长上下文口语语言模型的有前景路径。 Abstract: Spoken language models (SLMs) typically discretize speech into high-frame-rate tokens extracted from SSL speech models. As the most successful LMs are based on the Transformer architecture, processing these long token streams with self-attention is expensive, as attention scales quadratically with sequence length. A recent SSL work introduces acoustic tokenization of speech at the syllable level, which is more interpretable and potentially more scalable with significant compression in token lengths (4-5 Hz). Yet, their value for spoken language modeling is not yet fully explored. We present the first systematic study of syllabic tokenization for spoken language modeling, evaluating models on a suite of SLU benchmarks while varying training data scale. Syllabic tokens can match or surpass the previous high-frame rate tokens while significantly cutting training and inference costs, achieving more than a 2x reduction in training time and a 5x reduction in FLOPs. Our findings highlight syllable-level language modeling as a promising path to efficient long-context spoken language models.[90] Convergence and Divergence of Language Models under Different Random Seeds
Finlay Fehlauer,Kyle Mahowald,Tiago Pimentel
Main category: cs.CL
TL;DR: 本文研究了不同随机种子下训练的语言模型的收敛性,发现模型大小和训练阶段对收敛模式有显著影响,较大模型在后期训练中更快重新收敛,而较小模型则无法真正重新收敛,且收敛过程在不同语言类别间不均匀。
Details
Motivation: 探究不同随机种子下语言模型训练的收敛性及其影响因素,以理解模型大小和训练动态对学习分布稳定性的影响。 Method: 通过计算跨种子的每token平均KL散度来衡量语言模型的收敛性,并分析不同模型规模和训练检查点下的收敛模式,同时按词频和词性标签进行细粒度分析。 Result: 识别出四阶段收敛模式:初始均匀阶段、快速收敛阶段、急剧发散阶段和缓慢重新收敛阶段;发现较大模型在后期训练中更快重新收敛,而较小模型无法有效重新收敛;高频词和功能词比低频词和内容词更早且更稳定地收敛。 Conclusion: 模型规模是实现稳定学习分布的关键因素之一,且语言模型的收敛性在不同类型词汇之间存在显著差异,提示未来训练应考虑结构化和分阶段的优化策略。 Abstract: In this paper, we investigate the convergence of language models (LMs) trained under different random seeds, measuring convergence as the expected per-token Kullback--Leibler (KL) divergence across seeds. By comparing LM convergence as a function of model size and training checkpoint, we identify a four-phase convergence pattern: (i) an initial uniform phase, (ii) a sharp-convergence phase, (iii) a sharp-divergence phase, and (iv) a slow-reconvergence phase. Further, we observe that larger models reconverge faster in later training stages, while smaller models never actually reconverge; these results suggest that a certain model size may be necessary to learn stable distributions. Restricting our analysis to specific token frequencies or part-of-speech (PoS) tags further reveals that convergence is uneven across linguistic categories: frequent tokens and function words converge faster and more reliably than their counterparts (infrequent tokens and content words). Overall, our findings highlight factors that influence the stability of the learned distributions in model training.cs.CV [Back]
[91] LUMA: Low-Dimension Unified Motion Alignment with Dual-Path Anchoring for Text-to-Motion Diffusion Model
Haozhe Jia,Wenshuo Chen,Yuqi Lin,Yang Yang,Lei Wang,Mang Ning,Bowen Tian,Songning Lai,Nanqian Jia,Yifan Chen,Yutao Yue
Main category: cs.CV
TL;DR: 提出LUMA模型,通过双路径锚定增强语义对齐,解决扩散模型在文本到动作生成中的语义错位和运动伪影问题,在HumanML3D和KIT-ML上达到SOTA性能,并加速收敛。
Details
Motivation: 现有基于扩散的文本到动作生成模型存在严重梯度衰减,导致高层特征学习不足,进而引发语义错位和运动学伪影。 Method: 提出LUMA模型,采用双路径锚定机制:一条路径使用轻量级MoCLIP模型在时域提供语义监督;另一条路径利用低频DCT分量在频域提供补充对齐信号,并通过时间调制机制自适应融合。 Result: 在HumanML3D和KIT-ML数据集上FID分别为0.035和0.123,收敛速度比基线快1.4倍。 Conclusion: LUMA通过双路径对齐和梯度优化,显著提升文本到动作生成的语义一致性和运动质量,兼具高效性与可扩展性。 Abstract: While current diffusion-based models, typically built on U-Net architectures, have shown promising results on the text-to-motion generation task, they still suffer from semantic misalignment and kinematic artifacts. Through analysis, we identify severe gradient attenuation in the deep layers of the network as a key bottleneck, leading to insufficient learning of high-level features. To address this issue, we propose \textbf{LUMA} (\textit{\textbf{L}ow-dimension \textbf{U}nified \textbf{M}otion \textbf{A}lignment}), a text-to-motion diffusion model that incorporates dual-path anchoring to enhance semantic alignment. The first path incorporates a lightweight MoCLIP model trained via contrastive learning without relying on external data, offering semantic supervision in the temporal domain. The second path introduces complementary alignment signals in the frequency domain, extracted from low-frequency DCT components known for their rich semantic content. These two anchors are adaptively fused through a temporal modulation mechanism, allowing the model to progressively transition from coarse alignment to fine-grained semantic refinement throughout the denoising process. Experimental results on HumanML3D and KIT-ML demonstrate that LUMA achieves state-of-the-art performance, with FID scores of 0.035 and 0.123, respectively. Furthermore, LUMA accelerates convergence by 1.4$\times$ compared to the baseline, making it an efficient and scalable solution for high-fidelity text-to-motion generation.[92] VisualOverload: Probing Visual Understanding of VLMs in Really Dense Scenes
Paul Gavrikov,Wei Lin,M. Jehanzeb Mirza,Soumya Jahagirdar,Muhammad Huzaifa,Sivan Doveh,Serena Yeung-Levy,James Glass,Hilde Kuehne
Main category: cs.CV
TL;DR: 本文提出了一个新的视觉问答基准VisualOverload,用于评估视觉语言模型在密集复杂场景中的基本视觉理解能力,发现现有模型表现不佳,暴露出计数、OCR和逻辑推理等方面的缺陷。
Details
Motivation: 作者认为当前的VQA基准高估了视觉语言模型的真实性能,尤其是在处理细节丰富、对象密集的复杂场景时,模型的理解能力仍存在不足。 Method: 构建了一个包含2,720个问题-答案对的新VQA基准VisualOverload,使用公共领域绘画的高分辨率扫描图像,涵盖六类任务,重点测试模型在知识无关、细节密集场景下的基本视觉理解能力。 Result: 在37个模型中表现最好的o3模型,在最具挑战性的测试子集上准确率仅为19.6%,整体准确率为69.5%;错误分析揭示了模型在计数、OCR和复杂任务中的逻辑不一致等问题。 Conclusion: VisualOverload揭示了当前视觉语言模型在处理密集复杂场景时的关键缺陷,表明基本视觉理解仍未被解决,为未来研究提供了重要资源。 Abstract: Is basic visual understanding really solved in state-of-the-art VLMs? We present VisualOverload, a slightly different visual question answering (VQA) benchmark comprising 2,720 question-answer pairs, with privately held ground-truth responses. Unlike prior VQA datasets that typically focus on near global image understanding, VisualOverload challenges models to perform simple, knowledge-free vision tasks in densely populated (or, overloaded) scenes. Our dataset consists of high-resolution scans of public-domain paintings that are populated with multiple figures, actions, and unfolding subplots set against elaborately detailed backdrops. We manually annotated these images with questions across six task categories to probe for a thorough understanding of the scene. We hypothesize that current benchmarks overestimate the performance of VLMs, and encoding and reasoning over details is still a challenging task for them, especially if they are confronted with densely populated scenes. Indeed, we observe that even the best model (o3) out of 37 tested models only achieves 19.6% accuracy on our hardest test split and overall 69.5% accuracy on all questions. Beyond a thorough evaluation, we complement our benchmark with an error analysis that reveals multiple failure modes, including a lack of counting skills, failure in OCR, and striking logical inconsistencies under complex tasks. Altogether, VisualOverload exposes a critical gap in current vision models and offers a crucial resource for the community to develop better models. Benchmark: http://paulgavrikov.github.io/visualoverload[93] Editing Physiological Signals in Videos Using Latent Representations
Tianwen Zhou,Akshay Paruchuri,Josef Spjut,Kaan Akşit
Main category: cs.CV
TL;DR: 提出一种基于学习的视频生理信号编辑框架,在保持视觉质量的同时实现可控的心率编辑,用于隐私保护或合成具有指定生命体征的视频。
Details
Motivation: 面部视频中的生理信号可能泄露个人健康和情绪状态等敏感信息,引发隐私问题,因此需要在保留视觉真实感的前提下对生理信号进行编辑。 Method: 利用预训练的3D变分自编码器(3D VAE)将输入视频编码到潜在空间,并通过冻结的文本编码器嵌入目标心率提示;使用带AdaLN的可训练时空层融合二者,结合FiLM和微调解码器输出层,在重建时准确调节生理信号。 Result: 在多个数据集上实现了平均PSNR 38.96 dB、SSIM 0.98的视觉质量,心率调节误差为10.00 bpm MAE和10.09% MAPE,表现出良好的视觉保真性和生理信号控制精度。 Conclusion: 所提方法能有效实现视频中心率信号的可控编辑,在保护生物特征隐私和生成具指定生命体征的逼真视频方面具有应用潜力。 Abstract: Camera-based physiological signal estimation provides a non-contact and convenient means to monitor Heart Rate (HR). However, the presence of vital signals in facial videos raises significant privacy concerns, as they can reveal sensitive personal information related to the health and emotional states of an individual. To address this, we propose a learned framework that edits physiological signals in videos while preserving visual fidelity. First, we encode an input video into a latent space via a pretrained 3D Variational Autoencoder (3D VAE), while a target HR prompt is embedded through a frozen text encoder. We fuse them using a set of trainable spatio-temporal layers with Adaptive Layer Normalizations (AdaLN) to capture the strong temporal coherence of remote Photoplethysmography (rPPG) signals. We apply Feature-wise Linear Modulation (FiLM) in the decoder with a fine-tuned output layer to avoid the degradation of physiological signals during reconstruction, enabling accurate physiological modulation in the reconstructed video. Empirical results show that our method preserves visual quality with an average PSNR of 38.96 dB and SSIM of 0.98 on selected datasets, while achieving an average HR modulation error of 10.00 bpm MAE and 10.09% MAPE using a state-of-the-art rPPG estimator. Our design's controllable HR editing is useful for applications such as anonymizing biometric signals in real videos or synthesizing realistic videos with desired vital signs.[94] SpinBench: Perspective and Rotation as a Lens on Spatial Reasoning in VLMs
Yuyou Zhang,Radu Corcodel,Chiori Hori,Anoop Cherian,Ding Zhao
Main category: cs.CV
TL;DR: SpinBench是一个面向视觉语言模型(VLMs)的空间推理诊断基准,聚焦于视角转换中的空间推理能力,特别是透视能力。它包含细粒度的诊断类别,评估了37种主流VLMs,揭示了模型在旋转理解、对称性处理和视角变换下的系统性缺陷,并发现人类反应时间与模型准确性高度相关。
Details
Motivation: 现有的VLMs在空间推理,尤其是视角转换方面缺乏系统性评估工具,因此需要一个认知基础扎实的基准来诊断其在多视点下理解物体关系的能力。 Method: 设计了一个名为SpinBench的基准,围绕视角转换的核心挑战,构建了针对平移、旋转、物体相对姿态和视点变化的细粒度任务类别,并采用渐进式结构从单物体任务过渡到多物体视角推理任务。对37个SOTA VLM进行了评估,并与人类表现进行对比。 Result: 评估结果显示VLM普遍存在自我中心偏差、旋转理解差、在对称和句法重构下表现不稳定;虽然模型随规模提升有改进和涌现能力,但整体仍远落后于人类(人类准确率91.2%),且人类反应时间与模型准确率显著相关,说明任务难度具有认知一致性。 Conclusion: SpinBench有效揭示了当前VLM在空间推理上的关键缺陷,特别是在视角转换和几何变换下的不足,为未来提升VLM对物理空间的理解提供了重要方向和评估工具。 Abstract: We present SpinBench, a cognitively grounded diagnostic benchmark for evaluating spatial reasoning in vision language models (VLMs). SpinBench is designed around the core challenge of spatial reasoning: perspective taking, the ability to reason about how scenes and object relations change under viewpoint transformation. Since perspective taking requires multiple cognitive capabilities, such as recognizing objects across views, relative positions grounding, and mentally simulating transformations, SpinBench introduces a set of fine-grained diagnostic categories. Our categories target translation, rotation, object relative pose, and viewpoint change, and are progressively structured so that single-object simpler tasks scaffold toward the most demanding multi-object perspective-taking setting. We evaluate 37 state-of-the-art VLMs, both proprietary and open source. Results reveal systematic weaknesses: strong egocentric bias, poor rotational understanding, and inconsistencies under symmetrical and syntactic reformulations. Scaling analysis shows both smooth improvements and emergent capabilities. While human subjects achieve high accuracy (91.2\%), task difficulty as measured by human response time shows strong correlation with VLM accuracy, indicating that SpinBench captures spatial reasoning challenges shared across humans and VLMs. We believe SpinBench provides critical insights into spatial reasoning in VLMs and highlights key gaps in their ability to reason about physical space. Our website can be found at https://spinbench25.github.io/.[95] A Deep Learning Approach for Spatio-Temporal Forecasting of InSAR Ground Deformation in Eastern Ireland
Wendong Yao,Binhua Huang,Soumyabrata Dev
Main category: cs.CV
TL;DR: 提出多模态时空Transformer(MM-STT),融合动态位移数据与静态物理先验,显著提升高分辨率地表沉降预测性能。
Details
Motivation: 传统方法如ConvLSTM难以建模长距离依赖,且现有工作受限于单模态数据范式,无法充分捕捉地表沉降的复杂非线性动态。 Method: 提出MM-STT框架,采用统一的联合时空注意力机制,融合动态位移数据与静态物理先验,实现多模态特征的深度整合。 Result: 在公开EGMS数据集上,MM-STT相比STGCN、STAEformer等SOTA方法,长时预测RMSE降低一个数量级,达到新的最优性能。 Conclusion: 对于高分辨率地表沉降预测等问题,模型内在的深度多模态融合能力是实现性能突破的关键。 Abstract: Forecasting high-resolution land subsidence is a critical yet challenging task due to its complex, non-linear dynamics. While standard architectures like ConvLSTM often fail to model long-range dependencies, we argue that a more fundamental limitation of prior work lies in the uni-modal data paradigm. To address this, we propose the Multi-Modal Spatio-Temporal Transformer (MM-STT), a novel framework that fuses dynamic displacement data with static physical priors. Its core innovation is a joint spatio-temporal attention mechanism that processes all multi-modal features in a unified manner. On the public EGMS dataset, MM-STT establishes a new state-of-the-art, reducing the long-range forecast RMSE by an order of magnitude compared to all baselines, including SOTA methods like STGCN and STAEformer. Our results demonstrate that for this class of problems, an architecture's inherent capacity for deep multi-modal fusion is paramount for achieving transformative performance.[96] DepthLM: Metric Depth From Vision Language Models
Zhipeng Cai,Ching-Feng Yeh,Hu Xu,Zhuang Liu,Gregory Meyer,Xinjie Lei,Changsheng Zhao,Shang-Wen Li,Vikas Chandra,Yangyang Shi
Main category: cs.CV
TL;DR: 本文提出DepthLM,通过文本监督微调和视觉提示等方法,使视觉语言模型(VLMs)在不改变架构或损失函数的情况下,达到专家级精度的逐像素度量深度估计,性能超越现有VLM并接近纯视觉模型,且避免了边界区域的过平滑问题。
Details
Motivation: 尽管当前最先进的视觉语言模型(VLMs)在语义理解上表现优异,但在从2D输入理解3D结构方面仍存在困难;而专用纯视觉模型虽在度量深度估计等任务上表现超人,却依赖特定架构和损失函数。本文旨在探索VLMs是否能在不修改架构或损失的前提下达到专家级3D理解精度。 Method: 以逐像素度量深度估计为任务,采用基于文本的监督微调(sparse labels)、视觉提示(visual prompting)和内参条件增强(intrinsic-conditioned augmentation)来解决VLM在像素引用和跨数据集相机歧义上的瓶颈,无需引入密集预测头或复杂回归损失。 Result: DepthLM在多个先进VLM上实现了超过2倍的精度提升,首次使VLM在深度估计任务上接近纯视觉模型性能;同时模型更小、避免边界过平滑和飞行点问题,并具备扩展至其他3D任务的能力。 Conclusion: VLMs可以通过简单的微调策略实现专家级3D理解能力,无需更改模型架构或损失函数,DepthLM为统一多任务3D感知提供了简洁有效的路径。 Abstract: Vision language models (VLMs) can flexibly address various vision tasks through text interactions. Although successful in semantic understanding, state-of-the-art VLMs including GPT-5 still struggle in understanding 3D from 2D inputs. On the other hand, expert pure vision models achieve super-human accuracy in metric depth estimation, a key 3D understanding task. However, they require task-specific architectures and losses. Such difference motivates us to ask: Can VLMs reach expert-level accuracy without architecture or loss change? We take per-pixel metric depth estimation as the representative task and show that the answer is yes! Surprisingly, comprehensive analysis shows that text-based supervised-finetuning with sparse labels is sufficient for VLMs to unlock strong 3D understanding, no dense prediction head or complex regression/regularization loss is needed. The bottleneck for VLMs lies actually in pixel reference and cross-dataset camera ambiguity, which we address through visual prompting and intrinsic-conditioned augmentation. With much smaller models, our method DepthLM surpasses the accuracy of most advanced VLMs by over 2x, making VLMs for the first time comparable with pure vision models. Interestingly, without explicit enforcement during training, VLMs trained with DepthLM naturally avoids over-smoothing, having much fewer flying points at boundary regions than pure vision models. The simplicity of DepthLM also enables a single VLM to cover various 3D tasks beyond metric depth. Our code and model will be released at the link below.[97] Bayesian Transformer for Pan-Arctic Sea Ice Concentration Mapping and Uncertainty Estimation using Sentinel-1, RCM, and AMSR2 Data
Mabel Heffring,Lincoln Linlin Xu
Main category: cs.CV
TL;DR: 提出一种基于贝叶斯Transformer的高分辨率泛北极海冰浓度(SIC)制图与不确定性量化方法,融合Sentinel-1、RCM和AMSR2数据,显著提升特征提取与不确定性估计性能。
Details
Motivation: 高分辨率且具有可靠不确定性的泛北极海冰浓度制图对业务化应用至关重要,但面临冰层特征细微、模型不确定性及数据异质性等挑战。 Method: 设计了一种包含全局与局部模块的高分辨率Transformer模型以增强特征提取;引入贝叶斯扩展,将模型参数视为随机变量以更好量化不确定性;在决策层融合Sentinel-1、RCM和AMSR2多源数据以应对数据异质性。 Result: 在2021年9月的泛北极数据集上测试表明,该方法相比其他不确定性量化方法能生成更高分辨率的SIC图和更鲁棒的不确定性图。 Conclusion: 所提出的贝叶斯Transformer方法在高分辨率海冰制图与不确定性量化方面表现优越,有效应对了多源数据融合、特征辨识和模型不确定性等关键挑战。 Abstract: Although high-resolution mapping of Pan-Arctic sea ice with reliable corresponding uncertainty is essential for operational sea ice concentration (SIC) charting, it is a difficult task due to some key challenges, e.g., the subtle nature of ice signature features, model uncertainty, and data heterogeneity. This letter presents a novel Bayesian Transformer approach for Pan-Arctic SIC mapping and uncertainty quantification using Sentinel-1, RADARSAT Constellation Mission (RCM), and Advanced Microwave Scanning Radiometer 2 (AMSR2) data. First, to improve feature extraction, we design a novel high-resolution Transformer model with both global and local modules that can better discern the subtle differences in sea ice patterns. Second, to improve uncertainty quantification, we design a Bayesian extension of the proposed Transformer model, treating its parameters as random variables to more effectively capture uncertainties. Third, to address data heterogeneity, we fuse three different data types (Sentinel-1, RCM, and AMSR2) at decision-level to improve both SIC mapping and uncertainty quantification. The proposed approach is tested on Pan-Arctic datasets from September 2021, and the results demonstrate that the proposed model can achieve both high-resolution SIC maps and robust uncertainty maps compared to other uncertainty quantification approaches.[98] Infrastructure Sensor-enabled Vehicle Data Generation using Multi-Sensor Fusion for Proactive Safety Applications at Work Zone
Suhala Rabab Saba,Sakib Khan,Minhaj Uddin Ahmad,Jiahe Cao,Mizanur Rahman,Li Zhao,Nathan Huynh,Eren Erman Ozguven
Main category: cs.CV
TL;DR: 本研究通过融合路侧摄像头和LiDAR数据,结合卡尔曼滤波算法,在仿真和实地验证中实现了高精度、鲁棒的车辆轨迹估计,有效克服了单一传感器在复杂交通环境中的局限性,为基础设施感知系统在施工区等高风险路段的应用提供了可行方案。
Details
Motivation: 现有基础设施感知系统在高风险路段(如施工区)的应用受限于视角失真、几何复杂性、遮挡和成本等问题,亟需一种可扩展且经济高效的解决方案。 Method: 构建一个包含路侧摄像头和LiDAR的联合仿真环境,采用基于卡尔曼滤波的后融合策略,实现车辆检测与定位,并在仿真和真实工作区进行验证。 Result: 仿真中纵向误差降低高达70%,横向精度保持在1至3米内;实地测试表明融合轨迹能准确匹配真实路径,即使单个传感器数据不稳定或退化。 Conclusion: 基于卡尔曼滤波的多传感器融合方法能有效弥补单一传感器缺陷,提供精确且稳健的车辆跟踪能力,具备在复杂交通环境中部署基础设施感知系统的实用价值。 Abstract: Infrastructure-based sensing and real-time trajectory generation show promise for improving safety in high-risk roadway segments such as work zones, yet practical deployments are hindered by perspective distortion, complex geometry, occlusions, and costs. This study tackles these barriers by integrating roadside camera and LiDAR sensors into a cosimulation environment to develop a scalable, cost-effective vehicle detection and localization framework, and employing a Kalman Filter-based late fusion strategy to enhance trajectory consistency and accuracy. In simulation, the fusion algorithm reduced longitudinal error by up to 70 percent compared to individual sensors while preserving lateral accuracy within 1 to 3 meters. Field validation in an active work zone, using LiDAR, a radar-camera rig, and RTK-GPS as ground truth, demonstrated that the fused trajectories closely match real vehicle paths, even when single-sensor data are intermittent or degraded. These results confirm that KF based sensor fusion can reliably compensate for individual sensor limitations, providing precise and robust vehicle tracking capabilities. Our approach thus offers a practical pathway to deploy infrastructure-enabled multi-sensor systems for proactive safety measures in complex traffic environments.[99] Seeing Before Reasoning: A Unified Framework for Generalizable and Explainable Fake Image Detection
Kaiqing Lin,Zhiyuan Yan,Ruoxin Chen,Junyan Ye,Ke-Yue Zhang,Yue Zhou,Peng Jin,Bin Li,Taiping Yao,Shouhong Ding
Main category: cs.CV
TL;DR: 本文提出了一种新的AI生成图像检测范式“先看见后推理”,通过增强多模态大语言模型(MLLM)对伪造痕迹的感知能力,提升其检测性能与可解释性,并推出了Forensic-Chat模型和ExplainFake-Bench评测基准。
Details
Motivation: 现有MLLM在检测AI生成图像时表现不佳,因其视觉编码器不擅长捕捉低级伪造痕迹,且训练数据格式单一,导致模型依赖语言捷径而遗忘预训练知识。 Method: 提出“先看见后推理”范式,首先强化MLLM对伪造 artifact 的视觉感知能力,再进行推理;构建了具备可解释性和对话能力的Forensic-Chat模型,并设计了ExplainFake-Bench评测基准。 Result: 实验表明,所提方法在泛化能力和可解释性方面显著优于现有方法,能够基于真实视觉证据进行可靠判断。 Conclusion: 通过增强MLLM的底层感知能力,使其在图像取证中实现更可靠、可解释的检测,验证了“先感知后推理”的有效性。 Abstract: Detecting AI-generated images with multimodal large language models (MLLMs) has gained increasing attention, due to their rich world knowledge, common-sense reasoning, and potential for explainability. However, naively applying those MLLMs for detection often leads to suboptimal performance. We argue that the root of this failure lies in a fundamental mismatch: MLLMs are asked to reason about fakes before they can truly see them. First, they do not really see: existing MLLMs' vision encoders are primarily optimized for semantic-oriented recognition rather than the perception of low-level signals, leaving them insensitive to subtle forgery traces. Without access to reliable perceptual evidence, the model grounds its judgment on incomplete and limited visual observations. Second, existing finetuning data for detection typically uses narrow, instruction-style formats, which diverge sharply from the diverse, heterogeneous distributions seen in pretraining. In the absence of meaningful visual cues, the model therefore exploits these linguistic shortcuts, resulting in catastrophic forgetting of pretrained knowledge (even the basic dialogue capabilities). In response, we advocate for a new paradigm: seeing before reasoning. We propose that MLLMs should first be trained to perceive artifacts-strengthening their artifact-aware visual perception-so that subsequent reasoning is grounded in actual observations. We therefore propose Forensic-Chat, a generalizable, explainable, and still-conversational (for multi-round dialogue) assistant for fake image detection. We also propose ExplainFake-Bench, a benchmark tailored for the evaluation of the MLLM's explainability for image forensics from five key aspects. Extensive experiments show its superiority of generalization and genuinely reliable explainability.[100] DeepFake Detection in Dyadic Video Calls using Point of Gaze Tracking
Odin Kohler,Rahul Vijaykumar,Masudul H. Imtiaz
Main category: cs.CV
TL;DR: 本文提出了一种利用凝视点追踪来实时检测深度伪造视频的新方法,特别是在视频会议中的钓鱼攻击场景下。通过分析双人对话中的凝视模式,该方法使用可解释的生物特征实现82%的检测准确率,是首个利用凝视点进行深度伪造检测的技术。
Details
Motivation: 随着深度伪造技术的发展,恶意行为者在视频会议中利用实时深度伪造进行钓鱼攻击。由于现有检测方法难以应对此类新型威胁,亟需一种能够识别非自然凝视行为的检测机制。 Method: 基于双人对话中凝视模式的研究,选取可解释的特征构建检测模型,并利用自建数据集测试模型性能,通过追踪深度伪造视频中人物的屏幕凝视点来进行真伪判断。 Result: 在自建数据集上实现了82%的检测准确率,验证了凝视点作为生物特征在深度伪造检测中的有效性。 Conclusion: 凝视点是一种有效的新型生物特征,可用于实时深度伪造检测,尤其是在视频会议等交互式场景中,为防范深度伪造攻击提供了新思路。 Abstract: With recent advancements in deepfake technology, it is now possible to generate convincing deepfakes in real-time. Unfortunately, malicious actors have started to use this new technology to perform real-time phishing attacks during video meetings. The nature of a video call allows access to what the deepfake is ``seeing,'' that is, the screen displayed to the malicious actor. Using this with the estimated gaze from the malicious actors streamed video enables us to estimate where the deepfake is looking on screen, the point of gaze. Because the point of gaze during conversations is not random and is instead used as a subtle nonverbal communicator, it can be used to detect deepfakes, which are not capable of mimicking this subtle nonverbal communication. This paper proposes a real-time deepfake detection method adapted to this genre of attack, utilizing previously unavailable biometric information. We built our model based on explainable features selected after careful review of research on gaze patterns during dyadic conversations. We then test our model on a novel dataset of our creation, achieving an accuracy of 82\%. This is the first reported method to utilize point-of-gaze tracking for deepfake detection.[101] Robust Visual Localization in Compute-Constrained Environments by Salient Edge Rendering and Weighted Hamming Similarity
Tu-Hoa Pham,Philip Bailey,Daniel Posada,Georgios Georgakis,Jorge Enriquez,Surya Suresh,Marco Dolci,Philip Twu
Main category: cs.CV
TL;DR: 提出一种基于边缘域模板匹配的新方法,利用自定义渲染器和低精度无纹理3D模型,在计算与内存受限环境下实现鲁棒的6自由度物体位姿估计。
Details
Motivation: 在火星样本返回任务中,受限硬件条件下需要机械臂精确定位多个目标物体以实现低 Clearance 抓取,现有方法难以满足鲁棒性、精度与资源限制的综合需求。 Method: 提出一种新型定位算法,结合自定义渲染器和针对边缘域设计的模板匹配度量,仅使用低精度、无纹理的3D模型作为输入进行视觉6-DoF位姿估计。 Result: 在合成数据集、地面物理测试平台及真实火星图像上的实验表明,该方法在计算和内存受限条件下,无论在鲁棒性还是精度上均优于现有最先进方法。 Conclusion: 所提方法能够在资源受限的通用硬件上实现高效、可靠的物体定位,为深空探测等应用场景提供了低成本且实用的解决方案。 Abstract: We consider the problem of vision-based 6-DoF object pose estimation in the context of the notional Mars Sample Return campaign, in which a robotic arm would need to localize multiple objects of interest for low-clearance pickup and insertion, under severely constrained hardware. We propose a novel localization algorithm leveraging a custom renderer together with a new template matching metric tailored to the edge domain to achieve robust pose estimation using only low-fidelity, textureless 3D models as inputs. Extensive evaluations on synthetic datasets as well as from physical testbeds on Earth and in situ Mars imagery shows that our method consistently beats the state of the art in compute and memory-constrained localization, both in terms of robustness and accuracy, in turn enabling new possibilities for cheap and reliable localization on general-purpose hardware.[102] LLM-RG: Referential Grounding in Outdoor Scenarios using Large Language Models
Pranav Saxena,Avigyan Bhattacharya,Ji Zhang,Wenshan Wang
Main category: cs.CV
TL;DR: 提出LLM-RG,一种结合视觉-语言模型和大语言模型的混合方法,用于解决户外驾驶场景中的指代消解问题,在Talk2Car基准上显著优于现有方法。
Details
Motivation: 户外驾驶场景中存在大量视觉相似对象、动态元素和复杂语言指代,传统方法难以准确进行指代定位。 Method: 采用两阶段方法:首先用大语言模型提取对象类型和属性,检测候选区域;然后利用视觉-语言模型生成详细描述,并结合空间元数据构建自然语言提示,输入大语言模型进行链式推理以确定目标边界框。 Result: 在Talk2Car基准测试中显著优于基于LLM和VLM的基线方法,消融实验表明加入3D空间线索可进一步提升性能。 Conclusion: VLM和LLM在零样本设置下具有互补优势,结合二者可实现更鲁棒的户外指代定位。 Abstract: Referential grounding in outdoor driving scenes is challenging due to large scene variability, many visually similar objects, and dynamic elements that complicate resolving natural-language references (e.g., "the black car on the right"). We propose LLM-RG, a hybrid pipeline that combines off-the-shelf vision-language models for fine-grained attribute extraction with large language models for symbolic reasoning. LLM-RG processes an image and a free-form referring expression by using an LLM to extract relevant object types and attributes, detecting candidate regions, generating rich visual descriptors with a VLM, and then combining these descriptors with spatial metadata into natural-language prompts that are input to an LLM for chain-of-thought reasoning to identify the referent's bounding box. Evaluated on the Talk2Car benchmark, LLM-RG yields substantial gains over both LLM and VLM-based baselines. Additionally, our ablations show that adding 3D spatial cues further improves grounding. Our results demonstrate the complementary strengths of VLMs and LLMs, applied in a zero-shot manner, for robust outdoor referential grounding.[103] VISOR++: Universal Visual Inputs based Steering for Large Vision Language Models
Ravikumar Balakrishnan,Mansi Phute
Main category: cs.CV
TL;DR: 本文提出了VISOR++,一种基于通用视觉输入的视觉语言模型行为控制方法,通过优化视觉输入实现跨模型的行为引导,无需访问模型内部,适用于API和闭源模型。
Details
Motivation: 现有行为控制方法存在易被用户指令覆盖或需要侵入式访问模型内部的问题,且难以在多VLM间迁移,因此需要一种无需运行时访问、可部署于各类模型的通用控制方法。 Method: 提出VISOR++,通过生成能模拟多个VLM steering vectors的通用视觉输入,诱导目标激活模式,从而实现仅通过图像输入进行行为引导。 Result: 在LLaVA-1.5-7B和IDEFICS2-8B等模型上验证了VISOR++的有效性,可在拒绝、奉承和生存本能三个方向实现与steering vectors相当的表现,并能推广至未见模型,同时在14,000个MMLU任务中保持99.9%性能。 Conclusion: VISOR++是一种无需模型内部访问、可跨模型迁移的部署无关行为控制方法,为安全关键应用中的VLM行为调控提供了实用解决方案。 Abstract: As Vision Language Models (VLMs) are deployed across safety-critical applications, understanding and controlling their behavioral patterns has become increasingly important. Existing behavioral control methods face significant limitations: system prompting approaches could easily be overridden by user instructions, while applying activation-based steering vectors requires invasive runtime access to model internals, precluding deployment with API-based services and closed-source models. Finding steering methods that transfer across multiple VLMs is still an open area of research. To this end, we introduce universal visual input based steering for output redirection (VISOR++), to achieve behavioral control through optimized visual inputs alone. We demonstrate that a single VISOR++ image can be generated for an ensemble of VLMs to emulate each of their steering vectors. By crafting universal visual inputs that induce target activation patterns, VISOR++ eliminates the need for runtime model access while remaining deployment-agnostic. This means that when an underlying model supports multimodal capability, model behaviors can be steered by inserting an image input replacing runtime steering vector based interventions. We first demonstrate the effectiveness of the VISOR++ images on open-access models such as LLaVA-1.5-7B and IDEFICS2-8B along three alignment directions: refusal, sycophancy and survival instinct. Both the model-specific steering images and the jointly optimized images achieve performance parity closely following that of steering vectors for both positive and negative steering tasks. We also show the promise of VISOR++ images in achieving directional behavioral shifts for unseen models including both open-access and closed-access ones. Furthermore, VISOR++ images are able to preserve 99.9% performance on 14,000 unrelated MMLU evaluation tasks.[104] Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play
Qinsi Wang,Bo Liu,Tianyi Zhou,Jing Shi,Yueqian Lin,Yiran Chen,Hai Helen Li,Kun Wan,Wentian Zhao
Main category: cs.CV
TL;DR: 提出了一种名为Vision-Zero的无监督框架,通过任意图像对生成视觉竞争游戏,实现视觉语言模型的自我提升,在无需人工标注的情况下实现了推理能力的持续增强。
Details
Motivation: 现有的强化学习方法依赖于人工构建和验证的数据集,导致训练成本高昂,限制了视觉语言模型的实际应用。因此需要一种无需人工标注、可自我生成训练数据的方法来降低训练成本并提升模型推理能力。 Method: 提出了Vision-Zero框架,包含三个核心部分:1)基于“谁是卧底”类游戏的战略自博弈框架,使模型在多角色交互中自主生成训练数据;2)可从任意图像生成游戏,提升跨领域泛化能力;3)提出迭代自博弈策略优化(Iterative-SPO)算法,结合自博弈与可验证奖励的强化学习(RLVR),实现性能的持续提升。 Result: 在无标签数据下,Vision-Zero在推理、图表问答和视觉理解任务上达到最先进水平,超越了依赖标注数据的方法,并展现出良好的跨域泛化能力。 Conclusion: Vision-Zero提供了一种通用、可持续的视觉语言模型训练范式,摆脱了对人工标注数据的依赖,为VLM的低成本、高效自我进化提供了可行路径。 Abstract: Although reinforcement learning (RL) can effectively enhance the reasoning capabilities of vision-language models (VLMs), current methods remain heavily dependent on labor-intensive datasets that require extensive manual construction and verification, leading to extremely high training costs and consequently constraining the practical deployment of VLMs. To address this challenge, we propose Vision-Zero, a domain-agnostic framework enabling VLM self-improvement through competitive visual games generated from arbitrary image pairs. Specifically, Vision-Zero encompasses three main attributes: (1) Strategic Self-Play Framework: Vision-Zero trains VLMs in "Who Is the Spy"-style games, where the models engage in strategic reasoning and actions across multiple roles. Through interactive gameplay, models autonomously generate their training data without human annotation. (2) Gameplay from Arbitrary Images: Unlike existing gamified frameworks, Vision-Zero can generate games from arbitrary images, thereby enhancing the model's reasoning ability across diverse domains and showing strong generalization to different tasks. We demonstrate this versatility using three distinct types of image datasets: CLEVR-based synthetic scenes, charts, and real-world images. (3) Sustainable Performance Gain: We introduce Iterative Self-Play Policy Optimization (Iterative-SPO), a novel training algorithm that alternates between Self-Play and reinforcement learning with verifiable rewards (RLVR), mitigating the performance plateau often seen in self-play-only training and achieving sustained long-term improvements. Despite using label-free data, Vision-Zero achieves state-of-the-art performance on reasoning, chart question answering, and vision-centric understanding tasks, surpassing other annotation-based methods. Models and code has been released at https://github.com/wangqinsi1/Vision-Zero.[105] Hybrid Approach for Enhancing Lesion Segmentation in Fundus Images
Mohammadmahdi Eshragh,Emad A. Mohammed,Behrouz Far,Ezekiel Weis,Carol L Shields,Sandor R Ferenczy,Trafford Crump
Main category: cs.CV
TL;DR: 提出一种结合数学/聚类分割模型与U-Net思想的混合深度学习模型,用于提高脉络膜痣在眼底彩照中的精确分割,显著优于Attention U-Net,在Dice系数和IoU上分别达到89.7%和80.01%。
Details
Motivation: 现有数据集分辨率低、标注不一致,且依赖大量标注数据的深度学习模型难以广泛应用;非专业医生诊断困难,亟需提升分割精度以辅助早期诊断。 Method: 融合数学/聚类分割方法与U-Net网络结构的优势,构建一种新型混合分割模型,减少对大规模标注数据的依赖,并应用于高分辨率(1024*1024)眼底图像。 Result: 在高分辨率眼底图像上,Dice系数达89.7%,IoU达80.01%,显著优于Attention U-Net(51.3%和34.2%),并在外部数据集上表现出更强泛化能力。 Conclusion: 该混合模型能有效提升脉络膜痣的分割精度,具备临床辅助诊断潜力,可支持自动化病灶标注,推动决策支持系统的开发。 Abstract: Choroidal nevi are common benign pigmented lesions in the eye, with a small risk of transforming into melanoma. Early detection is critical to improving survival rates, but misdiagnosis or delayed diagnosis can lead to poor outcomes. Despite advancements in AI-based image analysis, diagnosing choroidal nevi in colour fundus images remains challenging, particularly for clinicians without specialized expertise. Existing datasets often suffer from low resolution and inconsistent labelling, limiting the effectiveness of segmentation models. This paper addresses the challenge of achieving precise segmentation of fundus lesions, a critical step toward developing robust diagnostic tools. While deep learning models like U-Net have demonstrated effectiveness, their accuracy heavily depends on the quality and quantity of annotated data. Previous mathematical/clustering segmentation methods, though accurate, required extensive human input, making them impractical for medical applications. This paper proposes a novel approach that combines mathematical/clustering segmentation models with insights from U-Net, leveraging the strengths of both methods. This hybrid model improves accuracy, reduces the need for large-scale training data, and achieves significant performance gains on high-resolution fundus images. The proposed model achieves a Dice coefficient of 89.7% and an IoU of 80.01% on 1024*1024 fundus images, outperforming the Attention U-Net model, which achieved 51.3% and 34.2%, respectively. It also demonstrated better generalizability on external datasets. This work forms a part of a broader effort to develop a decision support system for choroidal nevus diagnosis, with potential applications in automated lesion annotation to enhance the speed and accuracy of diagnosis and monitoring.[106] FishNet++: Analyzing the capabilities of Multimodal Large Language Models in marine biology
Faizan Farooq Khan,Yousef Radwan,Eslam Abdelrahman,Abdulwahab Felemban,Aymen Mir,Nico K. Michiels,Andrew J. Temple,Michael L. Berumen,Mohamed Elhoseiny
Main category: cs.CV
TL;DR: 本文提出了FishNet++,一个大规模多模态基准,用于评估和提升多模态大语言模型在鱼类物种细粒度识别中的表现,揭示了现有模型在海洋生物学领域的不足。
Details
Motivation: 现有的多模态大语言模型在海洋生物学等专业科学领域的能力尚未充分探索,尤其在鱼类物种细粒度识别上表现不佳,限制了其在海洋生态系统监测中的应用。 Method: 系统评估了当前最先进的多模态大语言模型在鱼类识别任务上的性能,并构建了一个包含文本描述、关键点标注和边界框的大规模多模态数据集FishNet++,以支持专用视觉-语言模型的开发与评估。 Result: 最先进的开源模型在鱼类物种识别任务上的准确率低于10%;FishNet++提供了35,133条文本描述、706,426个关键点标注和119,399个边界框,显著扩展了现有资源。 Conclusion: FishNet++为提升多模态模型在海洋生物学中的应用提供了重要基础,表明当前模型缺乏领域知识,需通过专业化数据集进行改进。 Abstract: Multimodal large language models (MLLMs) have demonstrated impressive cross-domain capabilities, yet their proficiency in specialized scientific fields like marine biology remains underexplored. In this work, we systematically evaluate state-of-the-art MLLMs and reveal significant limitations in their ability to perform fine-grained recognition of fish species, with the best open-source models achieving less than 10\% accuracy. This task is critical for monitoring marine ecosystems under anthropogenic pressure. To address this gap and investigate whether these failures stem from a lack of domain knowledge, we introduce FishNet++, a large-scale, multimodal benchmark. FishNet++ significantly extends existing resources with 35,133 textual descriptions for multimodal learning, 706,426 key-point annotations for morphological studies, and 119,399 bounding boxes for detection. By providing this comprehensive suite of annotations, our work facilitates the development and evaluation of specialized vision-language models capable of advancing aquatic science.[107] AttentionViG: Cross-Attention-Based Dynamic Neighbor Aggregation in Vision GNNs
Hakan Emre Gedik,Andrew Martin,Mustafa Munir,Oguzhan Baser,Radu Marculescu,Sandeep P. Chinchali,Alan C. Bovik
Main category: cs.CV
TL;DR: 提出基于交叉注意力的聚合方法和AttentionViG架构,有效提升图像识别性能并在多个下游任务中表现出色。
Details
Motivation: 现有图卷积方法难以有效捕捉复杂节点-邻居关系,缺乏通用且无需结构特定优化的聚合方法。 Method: 提出一种基于交叉注意力的聚合方法,节点提供查询投影,邻居提供键投影,并设计了AttentionViG新架构以实现非局部消息传递。 Result: 在ImageNet-1K上达到SOTA性能,在MS COCO 2017对象检测与实例分割、ADE20K语义分割等下游任务中表现优异,同时保持高效性,FLOPs相当但精度更具竞争力。 Conclusion: 所提出的交叉注意力聚合方法和AttentionViG架构在图像识别及下游任务中兼具高性能与高效率,优于现有视觉图神经网络。 Abstract: Vision Graph Neural Networks (ViGs) have demonstrated promising performance in image recognition tasks against Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). An essential part of the ViG framework is the node-neighbor feature aggregation method. Although various graph convolution methods, such as Max-Relative, EdgeConv, GIN, and GraphSAGE, have been explored, a versatile aggregation method that effectively captures complex node-neighbor relationships without requiring architecture-specific refinements is needed. To address this gap, we propose a cross-attention-based aggregation method in which the query projections come from the node, while the key projections come from its neighbors. Additionally, we introduce a novel architecture called AttentionViG that uses the proposed cross-attention aggregation scheme to conduct non-local message passing. We evaluated the image recognition performance of AttentionViG on the ImageNet-1K benchmark, where it achieved SOTA performance. Additionally, we assessed its transferability to downstream tasks, including object detection and instance segmentation on MS COCO 2017, as well as semantic segmentation on ADE20K. Our results demonstrate that the proposed method not only achieves strong performance, but also maintains efficiency, delivering competitive accuracy with comparable FLOPs to prior vision GNN architectures.[108] MetaChest: Generalized few-shot learning of patologies from chest X-rays
Berenice Montalvo-Lezama,Gibran Fuentes-Pineda
Main category: cs.CV
TL;DR: 本文提出了MetaChest,一个大规模胸部X光图像数据集,用于研究广义少样本学习下的多标签分类问题,并通过实验比较了迁移学习与ProtoNet扩展方法的性能。
Details
Motivation: 医学图像分析中标注数据稀缺,且实际应用常需在识别新类别同时保留对旧类别的知识,但现有少样本学习研究多集中于标准设定,缺乏对广义少样本场景的研究。 Method: 构建包含元学习划分的大规模数据集MetaChest,设计生成多标签任务的方法,评估标准迁移学习方法和ProtoNet扩展在多种少样本多标签任务上的表现。 Result: 增加每轮任务中的类别数和每类样本数可提升性能;未针对少样本设计的迁移学习方法始终优于ProtoNet扩展;高分辨率图像提高准确率但增加计算成本;高效模型架构可在资源大幅减少的情况下达到与大模型相当的性能。 Conclusion: 在医学图像的广义少样本多标签分类任务中,简单的迁移学习方法优于专门的少样本学习模型,且模型效率和图像分辨率对性能有显著影响。 Abstract: The limited availability of annotated data presents a major challenge for applying deep learning methods to medical image analysis. Few-shot learning methods aim to recognize new classes from only a small number of labeled examples. These methods are typically studied under the standard few-shot learning setting, where all classes in a task are new. However, medical applications such as pathology classification from chest X-rays often require learning new classes while simultaneously leveraging knowledge of previously known ones, a scenario more closely aligned with generalized few-shot classification. Despite its practical relevance, few-shot learning has been scarcely studied in this context. In this work, we present MetaChest, a large-scale dataset of 479,215 chest X-rays collected from four public databases. MetaChest includes a meta-set partition specifically designed for standard few-shot classification, as well as an algorithm for generating multi-label episodes. We conduct extensive experiments evaluating both a standard transfer learning approach and an extension of ProtoNet across a wide range of few-shot multi-label classification tasks. Our results demonstrate that increasing the number of classes per episode and the number of training examples per class improves classification performance. Notably, the transfer learning approach consistently outperforms the ProtoNet extension, despite not being tailored for few-shot learning. We also show that higher-resolution images improve accuracy at the cost of additional computation, while efficient model architectures achieve comparable performance to larger models with significantly reduced resource requirements.[109] K-Prism: A Knowledge-Guided and Prompt Integrated Universal Medical Image Segmentation Model
Bangwei Guo,Yunhe Gao,Meng Ye,Difei Gu,Yang Zhou,Leon Axel,Dimitris Metaxas
Main category: cs.CV
TL;DR: 提出K-Prism,一个统一的医学图像分割框架,通过双提示表示和MoE解码器整合语义先验、上下文知识和交互反馈,在18个公开数据集上实现多种模态下的最先进性能。
Details
Motivation: 现有模型通常局限于单一任务、模态或知识源,而临床实践需要融合多种知识(如解剖先验、参考案例和实时交互),因此需要一个能统一不同知识范式的灵活分割框架。 Method: 设计双提示表示(1D稀疏提示定义‘分割什么’,2D密集提示指示‘关注哪里’),并通过Mixture-of-Experts(MoE)解码器动态路由,支持语义先验、上下文示例和交互反馈三种知识范式的融合与灵活切换。 Result: 在18个涵盖CT、MRI、X射线、病理、超声等多种模态的公开数据集上,K-Prism在语义分割、上下文学习和交互式分割任务中均达到最先进的性能。 Conclusion: K-Prism通过统一建模多种知识来源,实现了灵活、通用且高性能的医学图像分割,更贴近真实临床决策过程,具有广泛的应用潜力。 Abstract: Medical image segmentation is fundamental to clinical decision-making, yet existing models remain fragmented. They are usually trained on single knowledge sources and specific to individual tasks, modalities, or organs. This fragmentation contrasts sharply with clinical practice, where experts seamlessly integrate diverse knowledge: anatomical priors from training, exemplar-based reasoning from reference cases, and iterative refinement through real-time interaction. We present $\textbf{K-Prism}$, a unified segmentation framework that mirrors this clinical flexibility by systematically integrating three knowledge paradigms: (i) $\textit{semantic priors}$ learned from annotated datasets, (ii) $\textit{in-context knowledge}$ from few-shot reference examples, and (iii) $\textit{interactive feedback}$ from user inputs like clicks or scribbles. Our key insight is that these heterogeneous knowledge sources can be encoded into a dual-prompt representation: 1-D sparse prompts defining $\textit{what}$ to segment and 2-D dense prompts indicating $\textit{where}$ to attend, which are then dynamically routed through a Mixture-of-Experts (MoE) decoder. This design enables flexible switching between paradigms and joint training across diverse tasks without architectural modifications. Comprehensive experiments on 18 public datasets spanning diverse modalities (CT, MRI, X-ray, pathology, ultrasound, etc.) demonstrate that K-Prism achieves state-of-the-art performance across semantic, in-context, and interactive segmentation settings. Code will be released upon publication.[110] GaussianLens: Localized High-Resolution Reconstruction via On-Demand Gaussian Densification
Yijia Weng,Zhicheng Wang,Songyou Peng,Saining Xie,Howard Zhou,Leonidas J. Guibas
Main category: cs.CV
TL;DR: 本文提出了一种名为GaussianLens的前馈式高斯致密化框架,用于在低分辨率3D高斯点阵重建基础上,根据用户指定的兴趣区域(RoI)和稀疏高分辨率观测,按需生成局部高分辨率重建,有效平衡了计算成本与细节需求。
Details
Motivation: 现有的3D高斯点阵重建方法在处理高分辨率图像时面临计算成本高、难以扩展的问题,且无法充分利用高分辨率图像中的细节信息;而人类视觉倾向于关注特定区域,因此需要一种能够按需在关键区域生成高细节的重建方法。 Method: 提出GaussianLens,通过融合初始3DGS模型与多视角高分辨率图像的多模态信息,设计像素引导的致密化机制,在用户指定的兴趣区域内进行局部高分辨率重建,避免全局高分辨率带来的冗余与开销。 Result: 实验表明,该方法在局部细节重建方面表现优越,能够有效处理高达1024×1024分辨率的图像,并具有良好的可扩展性。 Conclusion: GaussianLens实现了高效、可泛化的局部高分辨率重建,解决了现有方法在细节与效率之间的权衡问题,为场景重建提供了更符合人类感知特性的解决方案。 Abstract: We perceive our surroundings with an active focus, paying more attention to regions of interest, such as the shelf labels in a grocery store. When it comes to scene reconstruction, this human perception trait calls for spatially varying degrees of detail ready for closer inspection in critical regions, preferably reconstructed on demand. While recent works in 3D Gaussian Splatting (3DGS) achieve fast, generalizable reconstruction from sparse views, their uniform resolution output leads to high computational costs unscalable to high-resolution training. As a result, they cannot leverage available images at their original high resolution to reconstruct details. Per-scene optimization methods reconstruct finer details with adaptive density control, yet require dense observations and lengthy offline optimization. To bridge the gap between the prohibitive cost of high-resolution holistic reconstructions and the user needs for localized fine details, we propose the problem of localized high-resolution reconstruction via on-demand Gaussian densification. Given a low-resolution 3DGS reconstruction, the goal is to learn a generalizable network that densifies the initial 3DGS to capture fine details in a user-specified local region of interest (RoI), based on sparse high-resolution observations of the RoI. This formulation avoids the high cost and redundancy of uniformly high-resolution reconstructions and fully leverages high-resolution captures in critical regions. We propose GaussianLens, a feed-forward densification framework that fuses multi-modal information from the initial 3DGS and multi-view images. We further design a pixel-guided densification mechanism that effectively captures details under large resolution increases. Experiments demonstrate our method's superior performance in local fine detail reconstruction and strong scalability to images of up to $1024\times1024$ resolution.[111] LMOD+: A Comprehensive Multimodal Dataset and Benchmark for Developing and Evaluating Multimodal Large Language Models in Ophthalmology
Zhenyue Qin,Yang Liu,Yu Yin,Jinyu Ding,Haoran Zhang,Anran Li,Dylan Campbell,Xuansheng Wu,Ke Zou,Tiarnan D. L. Keenan,Emily Y. Chew,Zhiyong Lu,Yih-Chung Tham,Ninghao Liu,Xiuzhen Zhang,Qingyu Chen
Main category: cs.CV
TL;DR: 本文提出一个大规模的多模态眼科基准数据集,包含32,633个样本,涵盖12种常见眼病和5种成像模式,支持多种任务评估,并对24种先进MLLM进行了系统评测。
Details
Motivation: 由于缺乏适合评估生成模型的综合性基准数据集,多模态大语言模型在眼科领域的应用受到限制。 Method: 构建了一个包含影像、解剖结构、人口统计学和自由文本注释的多粒度标注数据集,并扩展了原有LMOD基准,增加了数据规模、任务类型,并系统评估了24种最先进的多模态大语言模型。 Result: 在零样本设置下,表现最佳的模型在疾病筛查中准确率约为58%,但在疾病分期等复杂任务上性能仍不理想。 Conclusion: 该数据集及其评估结果有助于推动眼科人工智能的发展,未来将公开数据集、整理流程和排行榜,以减轻全球致盲性眼病的负担。 Abstract: Vision-threatening eye diseases pose a major global health burden, with timely diagnosis limited by workforce shortages and restricted access to specialized care. While multimodal large language models (MLLMs) show promise for medical image interpretation, advancing MLLMs for ophthalmology is hindered by the lack of comprehensive benchmark datasets suitable for evaluating generative models. We present a large-scale multimodal ophthalmology benchmark comprising 32,633 instances with multi-granular annotations across 12 common ophthalmic conditions and 5 imaging modalities. The dataset integrates imaging, anatomical structures, demographics, and free-text annotations, supporting anatomical structure recognition, disease screening, disease staging, and demographic prediction for bias evaluation. This work extends our preliminary LMOD benchmark with three major enhancements: (1) nearly 50% dataset expansion with substantial enlargement of color fundus photography; (2) broadened task coverage including binary disease diagnosis, multi-class diagnosis, severity classification with international grading standards, and demographic prediction; and (3) systematic evaluation of 24 state-of-the-art MLLMs. Our evaluations reveal both promise and limitations. Top-performing models achieved ~58% accuracy in disease screening under zero-shot settings, and performance remained suboptimal for challenging tasks like disease staging. We will publicly release the dataset, curation pipeline, and leaderboard to potentially advance ophthalmic AI applications and reduce the global burden of vision-threatening diseases.[112] Anchor-free Cross-view Object Geo-localization with Gaussian Position Encoding and Cross-view Association
Xingtao Ling,Chenlin Fu,Yingying Zhu
Main category: cs.CV
TL;DR: 提出了一种无锚点的跨视图目标地理定位方法AFGeo,通过直接预测像素到真实框的四个方向偏移来实现目标定位,并引入高斯位置编码和跨视图目标关联模块以提升定位鲁棒性和准确性。
Details
Motivation: 现有基于锚点的跨视图目标地理定位方法受限于预定义锚框,缺乏灵活性和适应性,难以应对跨视角下外观差异大和目标位置不确定性的问题。 Method: 提出AFGeo,采用无锚点范式,直接预测每个像素到真实目标框的四向偏移;引入Gaussian Position Encoding(GPE)建模查询图像中的点击点位置先验;设计Cross-view Object Association Module(CVOAM)增强跨视角下目标与其上下文的关联性。 Result: AFGeo在多个基准数据集上实现了最先进的性能,同时具有轻量级和高效计算的特点,验证了无锚点方法在跨视图地理定位中的有效性。 Conclusion: AFGeo通过消除对预定义锚框的依赖,结合GPE和CVOAM模块,提供了一种更灵活、鲁棒且高效的跨视图目标地理定位解决方案。 Abstract: Most existing cross-view object geo-localization approaches adopt anchor-based paradigm. Although effective, such methods are inherently constrained by predefined anchors. To eliminate this dependency, we first propose an anchor-free formulation for cross-view object geo-localization, termed AFGeo. AFGeo directly predicts the four directional offsets (left, right, top, bottom) to the ground-truth box for each pixel, thereby localizing the object without any predefined anchors. To obtain a more robust spatial prior, AFGeo incorporates Gaussian Position Encoding (GPE) to model the click point in the query image, mitigating the uncertainty of object position that challenges object localization in cross-view scenarios. In addition, AFGeo incorporates a Cross-view Object Association Module (CVOAM) that relates the same object and its surrounding context across viewpoints, enabling reliable localization under large cross-view appearance gaps. By adopting an anchor-free localization paradigm that integrates GPE and CVOAM with minimal parameter overhead, our model is both lightweight and computationally efficient, achieving state-of-the-art performance on benchmark datasets.[113] Generalized Contrastive Learning for Universal Multimodal Retrieval
Jungsoo Lee,Janghoon Cho,Hyojin Park,Munawar Hayat,Kyuwoong Hwang,Fatih Porikli,Sungha Choi
Main category: cs.CV
TL;DR: 本文提出了一种新的对比学习方法GCL,用于提升多模态检索性能,无需额外的数据整理。
Details
Motivation: 现有跨模态检索模型在处理图像-文本融合键(如维基百科页面)时性能下降,且当前方法依赖精心构建的数据集,难以泛化到未见的模态组合。 Method: 提出广义对比学习(GCL),在mini-batch内对所有模态实施对比学习,利用现有的图像-文本配对数据集学习统一表示空间。 Result: 在M-BEIR、MMEB和CoVR基准上,GCL显著提升了现成多模态检索模型(如VISTA、CLIP、TinyCLIP)的性能。 Conclusion: GCL是一种无需新数据标注即可有效提升多模态检索性能的通用损失函数,具有良好的泛化能力。 Abstract: Despite their consistent performance improvements, cross-modal retrieval models (e.g., CLIP) show degraded performances with retrieving keys composed of fused image-text modality (e.g., Wikipedia pages with both images and text). To address this critical challenge, multimodal retrieval has been recently explored to develop a unified single retrieval model capable of retrieving keys across diverse modality combinations. A common approach involves constructing new composed sets of image-text triplets (e.g., retrieving a pair of image and text given a query image). However, such an approach requires careful curation to ensure the dataset quality and fails to generalize to unseen modality combinations. To overcome these limitations, this paper proposes Generalized Contrastive Learning (GCL), a novel loss formulation that improves multimodal retrieval performance without the burdensome need for new dataset curation. Specifically, GCL operates by enforcing contrastive learning across all modalities within a mini-batch, utilizing existing image-caption paired datasets to learn a unified representation space. We demonstrate the effectiveness of GCL by showing consistent performance improvements on off-the-shelf multimodal retrieval models (e.g., VISTA, CLIP, and TinyCLIP) using the M-BEIR, MMEB, and CoVR benchmarks.[114] Using Images from a Video Game to Improve the Detection of Truck Axles
Leandro Arab Marcomini,Andre Luiz Cunha
Main category: cs.CV
TL;DR: 该研究探讨了从视频游戏中提取的合成图像是否可用于训练卷积神经网络(CNN)以检测真实生活中的卡车车轴,结果表明合成图像能有效提升模型性能,最高mAP达到99%。
Details
Motivation: 由于真实数据收集成本高昂,探索使用低成本、易获取的视频游戏生成的合成图像作为训练数据的可行性。 Method: 构建包含真实和合成卡车图像的三个数据库,用于训练和测试三种不同的YOLO架构,并通过召回率、精确率、F1分数和mAP四个指标评估性能,同时使用Mann-Whitney U检验分析结果的统计显著性。 Result: 合成图像显著提升了所有网络的性能,最高mAP达到99%,且结果具有统计显著性。 Conclusion: 视频游戏中的合成图像可作为训练神经网络的可靠且低成本的数据来源,适用于现实场景的目标检测任务。 Abstract: Convolutional Neural Networks (CNNs) traditionally require large amounts of data to train models with good performance. However, data collection is an expensive process, both in time and resources. Generated synthetic images are a good alternative, with video games producing realistic 3D models. This paper aims to determine whether images extracted from a video game can be effectively used to train a CNN to detect real-life truck axles. Three different databases were created, with real-life and synthetic trucks, to provide training and testing examples for three different You Only Look Once (YOLO) architectures. Results were evaluated based on four metrics: recall, precision, F1-score, and mean Average Precision (mAP). To evaluate the statistical significance of the results, the Mann-Whitney U test was also applied to the resulting mAP of all models. Synthetic images from trucks extracted from a video game proved to be a reliable source of training data, contributing to the performance of all networks. The highest mAP score reached 99\%. Results indicate that synthetic images can be used to train neural networks, providing a reliable, low-cost data source for extracting knowledge.[115] DescribeEarth: Describe Anything for Remote Sensing Images
Kaiyu Li,Zixuan Jiang,Xiangyong Cao,Jiayu Wang,Yuchen Xiao,Deyu Meng,Zhi Wang
Main category: cs.CV
TL;DR: 本文提出了Geo-DLC任务,旨在实现遥感图像中对象级别的细粒度文本描述,并构建了DE-Dataset数据集和DE-Benchmark评估套件,同时设计了专用于该任务的多模态大语言模型DescribeEarth,在多个方面优于现有模型。
Details
Motivation: 现有遥感图像描述方法主要集中在图像级别,缺乏对对象级语义和结构信息的细粒度理解,限制了遥感图像在实际应用中的潜力。 Method: 提出Geo-DLC任务,构建包含25类、26万多个标注实例的DE-Dataset,开发基于LLM辅助问答的DE-Benchmark评估体系,并设计DescribeEarth模型,引入尺度自适应聚焦策略和领域引导融合模块,以更好捕捉高分辨率细节和遥感先验知识。 Result: DescribeEarth在DE-Benchmark上显著优于现有的通用多模态大模型,表现出更高的事实准确性、描述丰富性和语法正确性,尤其在简单、复杂及分布外遥感场景中均表现优异。 Conclusion: Geo-DLC为遥感图像理解提供了新的细粒度描述范式,DescribeEarth模型有效提升了对象级描述质量,推动了遥感图像自动化解释的发展。 Abstract: Automated textual description of remote sensing images is crucial for unlocking their full potential in diverse applications, from environmental monitoring to urban planning and disaster management. However, existing studies in remote sensing image captioning primarily focus on the image level, lacking object-level fine-grained interpretation, which prevents the full utilization and transformation of the rich semantic and structural information contained in remote sensing images. To address this limitation, we propose Geo-DLC, a novel task of object-level fine-grained image captioning for remote sensing. To support this task, we construct DE-Dataset, a large-scale dataset contains 25 categories and 261,806 annotated instances with detailed descriptions of object attributes, relationships, and contexts. Furthermore, we introduce DE-Benchmark, a LLM-assisted question-answering based evaluation suite designed to systematically measure model capabilities on the Geo-DLC task. We also present DescribeEarth, a Multi-modal Large Language Model (MLLM) architecture explicitly designed for Geo-DLC, which integrates a scale-adaptive focal strategy and a domain-guided fusion module leveraging remote sensing vision-language model features to encode high-resolution details and remote sensing category priors while maintaining global context. Our DescribeEarth model consistently outperforms state-of-the-art general MLLMs on DE-Benchmark, demonstrating superior factual accuracy, descriptive richness, and grammatical soundness, particularly in capturing intrinsic object features and surrounding environmental attributes across simple, complex, and even out-of-distribution remote sensing scenarios. All data, code and weights are released at https://github.com/earth-insights/DescribeEarth.[116] YOLO-Based Defect Detection for Metal Sheets
Po-Heng Chou,Chun-Chi Wang,Wei-Lung Mao
Main category: cs.CV
TL;DR: 提出基于YOLOv9和ConSinGAN的自动缺陷检测方法,用于工业制造中的表面缺陷检测,准确率达91.3%,检测时间146ms,并集成至AOI系统。
Details
Motivation: 解决工业制造中人工缺陷检测耗时且费力的问题,同时因金属板材图像数据不足导致检测精度下降。 Method: 采用ConSinGAN进行数据增强,生成更多金属板材缺陷图像;结合四种YOLO模型(v3、v4、v7、v9)进行缺陷检测,最终选用YOLOv9模型并与SCADA系统集成构建自动化光学检测系统。 Result: YOLOv9结合ConSinGAN在检测精度上达到91.3%,检测时间为146ms,优于其他YOLO版本,并成功集成到实际制造系统中。 Conclusion: 该方法有效提升了工业缺陷检测的准确性和效率,具有良好的实用性和可扩展性,适用于其他工业部件的自动检测。 Abstract: In this paper, we propose a YOLO-based deep learning (DL) model for automatic defect detection to solve the time-consuming and labor-intensive tasks in industrial manufacturing. In our experiments, the images of metal sheets are used as the dataset for training the YOLO model to detect the defects on the surfaces and in the holes of metal sheets. However, the lack of metal sheet images significantly degrades the performance of detection accuracy. To address this issue, the ConSinGAN is used to generate a considerable amount of data. Four versions of the YOLO model (i.e., YOLOv3, v4, v7, and v9) are combined with the ConSinGAN for data augmentation. The proposed YOLOv9 model with ConSinGAN outperforms the other YOLO models with an accuracy of 91.3%, and a detection time of 146 ms. The proposed YOLOv9 model is integrated into manufacturing hardware and a supervisory control and data acquisition (SCADA) system to establish a practical automated optical inspection (AOI) system. Additionally, the proposed automated defect detection is easily applied to other components in industrial manufacturing.[117] OmniDFA: A Unified Framework for Open Set Synthesis Image Detection and Few-Shot Attribution
Shiyu Wu,Shuyan Li,Jing Li,Jing Liu,Yequan Wang
Main category: cs.CV
TL;DR: 提出了一种新的AI生成图像检测与源模型归因框架OmniDFA,结合开放集、少样本学习方法,实现对未知生成器的高效识别,并构建大规模合成图像数据集OmniFake,显著提升AIGI检测与归因的泛化能力与实用性。
Details
Motivation: 现有AI生成图像检测方法易过拟合特定伪造特征,且源模型归因受限于缺乏大规模、分类清晰的合成图像数据集,难以兼顾检测与归因任务的实际应用需求。 Method: 提出开放集、少样本源识别新范式,设计OmniDFA框架,集成图像真实性判断与源模型归因功能;构建包含45种生成模型的117万张图像的大规模数据集OmniFake,支持少样本条件下的源识别与检测。 Result: OmniDFA在开放集归因任务中表现优异,在AIGI检测上达到最先进的泛化性能,验证了其在真实场景中的有效性与鲁棒性。 Conclusion: 所提出的OmniDFA框架与OmniFake数据集为AI生成图像的检测与源归因提供了可扩展、实用性强的新解决方案,推动了深度伪造治理技术的发展。 Abstract: AI-generated image (AIGI) detection and source model attribution remain central challenges in combating deepfake abuses, primarily due to the structural diversity of generative models. Current detection methods are prone to overfitting specific forgery traits, whereas source attribution offers a robust alternative through fine-grained feature discrimination. However, synthetic image attribution remains constrained by the scarcity of large-scale, well-categorized synthetic datasets, limiting its practicality and compatibility with detection systems. In this work, we propose a new paradigm for image attribution called open-set, few-shot source identification. This paradigm is designed to reliably identify unseen generators using only limited samples, making it highly suitable for real-world application. To this end, we introduce OmniDFA (Omni Detector and Few-shot Attributor), a novel framework for AIGI that not only assesses the authenticity of images, but also determines the synthesis origins in a few-shot manner. To facilitate this work, we construct OmniFake, a large class-aware synthetic image dataset that curates $1.17$ M images from $45$ distinct generative models, substantially enriching the foundational resources for research on both AIGI detection and attribution. Experiments demonstrate that OmniDFA exhibits excellent capability in open-set attribution and achieves state-of-the-art generalization performance on AIGI detection. Our dataset and code will be made available.[118] AIMCoT: Active Information-driven Multimodal Chain-of-Thought for Vision-Language Reasoning
Xiping Li,Jianghong Ma
Main category: cs.CV
TL;DR: 本文提出了AIMCoT,一种主动信息驱动的多模态思维链框架,通过上下文增强的注意力生成、主动视觉探测和动态注意力转移触发机制,显著提升了视觉-语言推理的性能。
Details
Motivation: 现有方法依赖不可靠的注意力图和被动的信息选择策略,难以满足模型在多模态推理中的认知需求。 Method: 提出AIMCoT框架,包含三个组件:上下文增强注意力图生成(CAG)、主动视觉探测(AVP)和动态注意力转移触发(DAT),结合信息论与注意力机制实现主动信息获取与动态推理控制。 Result: 在三个具有挑战性的基准上实验表明,AIMCoT在多种设置下显著优于现有最先进方法。 Conclusion: AIMCoT通过主动探索信息和动态构建推理过程,推动了更鲁棒、高效且类人化的多模态推理发展。 Abstract: Multimodal Chain-of-Thought (CoT) has emerged as a powerful technique for enhancing the vision-language reasoning with interleaved information. However, existing methods often rely on simplistic heuristics for constructing interleaved CoT, typically depending on attention maps, which our empirical analysis reveals can be unreliable. What's more, the shortcomings of their passive and purposeless selection strategies and their arbitrary triggering mechanisms in capturing the model's cognitive need for information are further amplified. In this paper, we propose \textbf{AIMCoT}, an \textbf{A}ctive \textbf{I}nformation-driven \textbf{M}ulti-modal \textbf{C}hain-\textbf{o}f-\textbf{T}hought framework that addresses these fundamental limitations. AIMCoT introduces three synergistic components: (1) \textbf{Context-enhanced Attention-map Generation (CAG)}, which mitigates the text-vision granularity imbalance, thereby producing more reliable attention maps as a foundation. (2) \textbf{Active Visual Probing (AVP)}, which replaces passive selection with a proactive, goal-oriented strategy grounded in information theory to select image regions that help answer the questions maximally. (3) \textbf{Dynamic Attention-shifting Trigger (DAT)}, which intelligently determines the optimal moments to insert visual information by monitoring the model's text-to-vision attention shifts. Extensive experiments on three challenging benchmarks demonstrate that AIMCoT significantly outperforms state-of-the-art methods across different settings. By actively foraging for information and dynamically structuring its reasoning process, AIMCoT represents a critical step towards more robust, effective, and human-like multimodal reasoning. Our code is available at https://anonymous.4open.science/r/AIMCoT.[119] How Diffusion Models Memorize
Juyeop Kim,Songkuk Kim,Jong-Seok Lee
Main category: cs.CV
TL;DR: 该论文研究了扩散模型如何记忆训练数据,发现记忆化主要由早期去噪过程中的样本过估计驱动,导致多样性降低、去噪轨迹坍缩,并加速向记忆图像收敛。
Details
Motivation: 尽管扩散模型在图像生成上成功,但存在记忆训练数据的问题,引发隐私和版权担忧。现有工作未能解释其根本原因,本文旨在揭示扩散模型记忆化的机制。 Method: 重新审视扩散与去噪过程,分析潜在空间动态,通过分解中间潜在表示,研究初始随机性被记忆内容取代的过程,并探究去噪调度偏差与记忆化程度的关系。 Result: 1) 记忆化不能仅由过拟合解释,分类器自由引导会放大预测并导致过估计;2) 被记忆的提示词将训练图像注入噪声预测,迫使潜在轨迹收敛到配对样本;3) 中间潜在分解显示初始随机性迅速被抑制,且与理论去噪调度的偏差几乎完美关联记忆化严重程度。 Conclusion: 早期过估计是扩散模型记忆化的根本机制,这一发现为缓解记忆化提供了理论基础。 Abstract: Despite their success in image generation, diffusion models can memorize training data, raising serious privacy and copyright concerns. Although prior work has sought to characterize, detect, and mitigate memorization, the fundamental question of why and how it occurs remains unresolved. In this paper, we revisit the diffusion and denoising process and analyze latent space dynamics to address the question: "How do diffusion models memorize?" We show that memorization is driven by the overestimation of training samples during early denoising, which reduces diversity, collapses denoising trajectories, and accelerates convergence toward the memorized image. Specifically: (i) memorization cannot be explained by overfitting alone, as training loss is larger under memorization due to classifier-free guidance amplifying predictions and inducing overestimation; (ii) memorized prompts inject training images into noise predictions, forcing latent trajectories to converge and steering denoising toward their paired samples; and (iii) a decomposition of intermediate latents reveals how initial randomness is quickly suppressed and replaced by memorized content, with deviations from the theoretical denoising schedule correlating almost perfectly with memorization severity. Together, these results identify early overestimation as the central underlying mechanism of memorization in diffusion models.[120] ProbMed: A Probabilistic Framework for Medical Multimodal Binding
Yuan Gao,Sangwook Kim,Jianzhong You,Chris McIntosh
Main category: cs.CV
TL;DR: 本文提出了ProbMED,一种基于概率对比学习的多模态医学视觉-语言预训练模型,通过建模嵌入分布而非确定性估计,有效整合了多种医学模态(如X光、心电图、超声心动图和临床文本),并在跨模态检索和分类任务中优于现有方法。
Details
Motivation: 现有的医学视觉-语言预训练模型未能充分考虑医学数据中常见的多对多模态映射关系,难以准确建模不同模态间的复杂关联。 Method: 提出ProbMED模型,采用概率对比学习,在统一的概率嵌入空间中对齐四种医学模态;使用基于Hellinger距离的InfoNCE损失整合模态间分布,并引入概率合成采样损失以增强模态内绑定。 Result: 在13个医学数据集上的实验表明,ProbMED在跨模态检索、零样本和少样本分类任务上均优于当前最先进的Med-VLPMs,并展现出更强的多模态融合能力,尤其在预后预测中表现优异。 Conclusion: ProbMED通过概率化建模有效提升了多模态医学信息的融合效果,为复杂临床决策提供了更可靠的AI支持。 Abstract: Medical decision-making requires integrating diverse medical information, from imaging to clinical narratives. These medical modalities are often acquired in a many-to-many manner. However, current medical vision-language pretraining models (Med-VLPMs) fail to directly account for this many-to-many mapping in their model training and embeddings. To address this, we present Probabilistic Modality-Enhanced Diagnosis (ProbMED), a multimodal Med-VLPM that employs probabilistic contrastive learning to model distributions over embeddings rather than deterministic estimates. ProbMED aligns four distinct modalities -- chest X-rays, electrocardiograms, echocardiograms, and clinical text -- into a unified probabilistic embedding space. We use InfoNCE loss with Hellinger distance to integrate inter-modality distributions. We introduce a probabilistic synthetic sampling loss that captures modality-specific mean and variance to improve intra-modality binding. Extensive experiments across 13 medical datasets demonstrate that our model outperforms current Med-VLPMs in cross-modality retrieval, zero-shot, and few-shot classification. We also demonstrate the robust integration of multiple modalities for prognostication, showing improved intra- and inter-medical modality binding.[121] Importance Sampling for Multi-Negative Multimodal Direct Preference Optimization
Xintong Li,Chuhan Wang,Junda Wu,Rohan Surana,Tong Yu,Julian McAuley,Jingbo Shang
Main category: cs.CV
TL;DR: 本文提出MISP-DPO,是首个在多模态直接偏好优化中引入多个语义多样化负样本的框架,通过Plackett-Luce模型和CLIP空间中的语义偏差分析提升视觉-语言对齐性能。
Details
Motivation: 现有基于成对比较的多模态DPO方法仅使用单一、简单的负样本,难以捕捉复杂的多模态偏好,导致优化偏差和幻觉问题。 Method: 在CLIP空间中嵌入提示和候选图像,利用稀疏自编码器提取可解释的语义偏离因子;基于重构难度、与正样本的语义差异及样本间多样性选择多个负样本,并采用Plackett-Luce模型与重要性采样进行多负样本偏好优化。 Result: 在五个基准上实验表明,MISP-DPO在多模态对齐方面显著优于先前方法,验证了语义感知的多负样本策略的有效性。 Conclusion: 引入语义多样化的多个负样本并结合Plackett-Luce建模,能有效提升多模态DPO的训练质量与对齐性能。 Abstract: Direct Preference Optimization (DPO) has recently been extended from text-only models to vision-language models. However, existing methods rely on oversimplified pairwise comparisons, generating a single negative image via basic perturbations or similarity-based retrieval, which fail to capture the complex nature of multimodal preferences, inducing optimization bias and hallucinations. To address this issue, we propose MISP-DPO, the first framework to incorporate multiple, semantically diverse negative images in multimodal DPO via the Plackett-Luce model. Our method embeds prompts and candidate images in CLIP (Contrastive Language-Image Pretraining) space and applies a sparse autoencoder to uncover semantic deviations into interpretable factors. Negative samples are selected based on reconstruction difficulty, semantic deviation from the positive, and mutual diversity, yielding broader and more informative supervision. To handle multi-negative comparisons, we adopt a Plackett-Luce objective and introduce an importance sampling strategy that improves training efficiency. Experiments across five diverse benchmarks demonstrate that MISP-DPO consistently improves multimodal alignment over prior methods, validating the effectiveness of semantic-aware, multi-negative sampling in preference-based learning.[122] SAGE: Spatial-visual Adaptive Graph Exploration for Visual Place Recognition
Shunpeng Chen,Changwei Wang,Rongtao Xu,Xingtian Pei,Yukun Song,Jinzhou Lin,Wenhao Xu,Jingyi Zhang,Li Guo,Shibiao Xu
Main category: cs.CV
TL;DR: 本文提出了一种名为SAGE的视觉位置识别新方法,通过联合优化局部特征聚合、训练样本组织和难样本挖掘,在多个基准上实现了最先进的性能。
Details
Motivation: 现有方法忽略了空间上下文与视觉相似性在训练过程中的动态交互,导致在外观、视角和环境变化大的情况下表现不佳。 Method: 引入Soft Probing模块学习局部描述符的残差权重,并构建在线地理-视觉图以融合地理位置和当前视觉相似性;采用基于高亲和力锚点的贪心加权团扩展采样器进行聚类扩展。 Result: 在SPED、Pitts30k-test、MSLS-val和Nordland数据集上分别达到98.9%、95.8%、94.5%和96.0%的Recall@1,且在SPED上使用4096D全局描述符实现100% Recall@10。 Conclusion: SAGE通过统一的训练框架显著提升了视觉位置识别的性能,尤其在复杂变化条件下表现出色。 Abstract: Visual Place Recognition (VPR) requires robust retrieval of geotagged images despite large appearance, viewpoint, and environmental variation. Prior methods focus on descriptor fine-tuning or fixed sampling strategies yet neglect the dynamic interplay between spatial context and visual similarity during training. We present SAGE (Spatial-visual Adaptive Graph Exploration), a unified training pipeline that enhances granular spatial-visual discrimination by jointly improving local feature aggregation, organize samples during training, and hard sample mining. We introduce a lightweight Soft Probing module that learns residual weights from training data for patch descriptors before bilinear aggregation, boosting distinctive local cues. During training we reconstruct an online geo-visual graph that fuses geographic proximity and current visual similarity so that candidate neighborhoods reflect the evolving embedding landscape. To concentrate learning on the most informative place neighborhoods, we seed clusters from high-affinity anchors and iteratively expand them with a greedy weighted clique expansion sampler. Implemented with a frozen DINOv2 backbone and parameter-efficient fine-tuning, SAGE achieves SOTA across eight benchmarks. It attains 98.9%, 95.8%, 94.5%, and 96.0% Recall@1 on SPED, Pitts30k-test, MSLS-val, and Nordland, respectively. Notably, our method obtains 100% Recall@10 on SPED only using 4096D global descriptors. Code and model will be available at: https://github.com/chenshunpeng/SAGE.[123] LaTo: Landmark-tokenized Diffusion Transformer for Fine-grained Human Face Editing
Zhenghao Zhang,Ziying Zhang,Junchao Liao,Xiangyu Meng,Qiang Hu,Siyu Zhu,Xiaoyun Zhang,Long Qin,Weizhi Wang
Main category: cs.CV
TL;DR: 提出LaTo,一种基于地标标记化的扩散变换器模型,用于细粒度且保持身份的人脸编辑,通过离散化地标坐标、统一位置编码和基于视觉-语言模型的地标预测,显著提升身份保持和语义一致性。
Details
Motivation: 现有基于多模态指令的人脸编辑方法在精确属性控制和身份保持方面存在不足,尤其当地标与源图像差异较大时(如大表情或姿态变化),传统刚性几何约束易导致身份失真。 Method: 1) 设计地标分词器,将原始地标坐标量化为离散面部标记,避免像素级对应;2) 引入位置映射编码,融合面部与图像标记,实现高效、解耦的几何-外观交互;3) 构建基于视觉-语言模型的地标预测器,通过结构化思维链提升估计精度与交互控制;并构建HFL-150K数据集缓解数据稀缺。 Result: 在HFL-150K上实验表明,LaTo在身份保持上优于现有方法7.8%,语义一致性提升4.6%。 Conclusion: LaTo通过标记化处理和结构化解码机制,在复杂条件下实现了更精准、身份保持更好的指令驱动人脸编辑,同时发布大规模数据集推动该领域发展。 Abstract: Recent multimodal models for instruction-based face editing enable semantic manipulation but still struggle with precise attribute control and identity preservation. Structural facial representations such as landmarks are effective for intermediate supervision, yet most existing methods treat them as rigid geometric constraints, which can degrade identity when conditional landmarks deviate significantly from the source (e.g., large expression or pose changes, inaccurate landmark estimates). To address these limitations, we propose LaTo, a landmark-tokenized diffusion transformer for fine-grained, identity-preserving face editing. Our key innovations include: (1) a landmark tokenizer that directly quantizes raw landmark coordinates into discrete facial tokens, obviating the need for dense pixel-wise correspondence; (2) a location-mapping positional encoding that integrates facial and image tokens for unified processing, enabling flexible yet decoupled geometry-appearance interactions with high efficiency and strong identity preservation; and (3) a landmark predictor that leverages vision-language models to infer target landmarks from instructions and source images, whose structured chain-of-thought improves estimation accuracy and interactive control. To mitigate data scarcity, we curate HFL-150K, to our knowledge the largest benchmark for this task, containing over 150K real face pairs with fine-grained instructions. Extensive experiments show that LaTo outperforms state-of-the-art methods by 7.8% in identity preservation and 4.6% in semantic consistency. Code and dataset will be made publicly available upon acceptance.[124] The 1st Solution for MOSEv1 Challenge on LSVOS 2025: CGFSeg
Tingmin Li,Yixuan Li,Yang Yang
Main category: cs.CV
TL;DR: 本文提出了一种改进的视频对象分割方法CGFSeg,在MOSEv1挑战赛中取得了第一名的成绩,验证了其在复杂现实场景下的有效性。
Details
Motivation: 提升视频对象分割模型在复杂真实场景下的鲁棒性,特别是应对长期目标消失与重现以及小而不起眼的目标存在的问题。 Method: 训练时冻结SAM2的特征提取器并微调其余部分;推理阶段引入像素检查策略,融合多个模型的优势以逐步优化预测结果。 Result: 在MOSEv1挑战赛测试集上达到86.37%的J&F得分,排名第一。 Conclusion: 所提出的CGFSeg方法有效提升了复杂场景下视频对象分割的准确性和鲁棒性。 Abstract: Video Object Segmentation (VOS) aims to track and segment specific objects across entire video sequences, yet it remains highly challenging under complex real-world scenarios. The MOSEv1 and LVOS dataset, adopted in the MOSEv1 challenge on LSVOS 2025, which is specifically designed to enhance the robustness of VOS models in complex real-world scenarios, including long-term object disappearances and reappearances, as well as the presence of small and inconspicuous objects. In this paper, we present our improved method, Confidence-Guided Fusion Segmentation (CGFSeg), for the VOS task in the MOSEv1 Challenge. During training, the feature extractor of SAM2 is frozen, while the remaining components are fine-tuned to preserve strong feature extraction ability and improve segmentation accuracy. In the inference stage, we introduce a pixel-check strategy that progressively refines predictions by exploiting complementary strengths of multiple models, thereby yielding robust final masks. As a result, our method achieves a J&F score of 86.37% on the test set, ranking 1st in the MOSEv1 Challenge at LSVOS 2025. These results highlight the effectiveness of our approach in addressing the challenges of VOS task in complex scenarios.[125] LieHMR: Autoregressive Human Mesh Recovery with $SO(3)$ Diffusion
Donghwan Kim,Tae-Kyun Kim
Main category: cs.CV
TL;DR: 提出了一种基于SO(3)扩散模型的图像条件人体姿态与形状生成方法,通过Transformer提取关节潜在特征,并使用MLP去噪模型学习每关节的姿态分布,有效建模2D观测下的3D人体姿态不确定性。
Details
Motivation: 现有HMR方法在从单张RGB图像恢复3D人体姿态时存在歧义性,大多数方法为确定性输出,而概率性方法在准确性和多样性之间存在权衡问题。 Method: 提出一种新的SO(3)扩散模型,结合Transformer捕捉人体关节层次结构,利用条件Dropout实现图像条件生成;Transformer提取时间无关的关节潜在向量,MLP去噪网络据此学习每个关节的姿态分布。 Result: 实验表明该模型能有效预测准确的人体姿态概率分布,在处理模糊性的同时保持高精度,且单次预测性能优于现有概率方法。 Conclusion: 所提方法通过建模与2D观测对齐的姿态分布,克服了传统概率方法多样性和准确性之间的权衡,显著提升了人体网格恢复中姿态推断的鲁棒性与精度。 Abstract: We tackle the problem of Human Mesh Recovery (HMR) from a single RGB image, formulating it as an image-conditioned human pose and shape generation. While recovering 3D human pose from 2D observations is inherently ambiguous, most existing approaches have regressed a single deterministic output. Probabilistic methods attempt to address this by generating multiple plausible outputs to model the ambiguity. However, these methods often exhibit a trade-off between accuracy and sample diversity, and their single predictions are not competitive with state-of-the-art deterministic models. To overcome these limitations, we propose a novel approach that models well-aligned distribution to 2D observations. In particular, we introduce $SO(3)$ diffusion model, which generates the distribution of pose parameters represented as 3D rotations unconditional and conditional to image observations via conditioning dropout. Our model learns the hierarchical structure of human body joints using the transformer. Instead of using transformer as a denoising model, the time-independent transformer extracts latent vectors for the joints and a small MLP-based denoising model learns the per-joint distribution conditioned on the latent vector. We experimentally demonstrate and analyze that our model predicts accurate pose probability distribution effectively.[126] Dragging with Geometry: From Pixels to Geometry-Guided Image Editing
Xinyu Pu,Hongsong Wang,Jie Gui,Pan Zhou
Main category: cs.CV
TL;DR: 提出了一种基于几何引导的交互式图像编辑方法GeoDrag,结合3D几何与2D先验实现高保真、结构一致的图像拖拽编辑。
Details
Motivation: 现有基于拖拽的图像编辑方法主要在2D像素平面操作,缺乏对3D几何信息的利用,导致在几何敏感场景(如旋转、透视变换)中编辑不精确且不一致。 Method: 提出GeoDrag,通过统一的位移场联合编码3D几何线索和2D空间先验,并引入无冲突的区域划分策略来隔离编辑区域,解决多点拖拽中的冲突问题,在单次前向传播中实现连贯且高保真的编辑。 Result: 实验表明,GeoDrag在多种编辑场景下优于现有方法,具有更高的精度、结构一致性和可靠的多点可编辑性。 Conclusion: GeoDrag有效融合3D几何与2D图像编辑,显著提升了复杂几何变换下的编辑质量与可控性。 Abstract: Interactive point-based image editing serves as a controllable editor, enabling precise and flexible manipulation of image content. However, most drag-based methods operate primarily on the 2D pixel plane with limited use of 3D cues. As a result, they often produce imprecise and inconsistent edits, particularly in geometry-intensive scenarios such as rotations and perspective transformations. To address these limitations, we propose a novel geometry-guided drag-based image editing method - GeoDrag, which addresses three key challenges: 1) incorporating 3D geometric cues into pixel-level editing, 2) mitigating discontinuities caused by geometry-only guidance, and 3) resolving conflicts arising from multi-point dragging. Built upon a unified displacement field that jointly encodes 3D geometry and 2D spatial priors, GeoDrag enables coherent, high-fidelity, and structure-consistent editing in a single forward pass. In addition, a conflict-free partitioning strategy is introduced to isolate editing regions, effectively preventing interference and ensuring consistency. Extensive experiments across various editing scenarios validate the effectiveness of our method, showing superior precision, structural consistency, and reliable multi-point editability. The code will be available on https://github.com/xinyu-pu/GeoDrag .[127] IPDRecon: Image-Plane Geometric Decoding for View-Invariant Indoor Scene Reconstruction
Mingyang Li,Yimeng Fan,Changsong Liu,Tianyu Zhou,Xin Wang,Yanyan Liu,Wei Zhang
Main category: cs.CV
TL;DR: 本文提出IPDRecon,一种基于图像平面解码的室内场景三维重建框架,通过减少对多视角几何约束的依赖,充分利用单视角内的空间信息,显著提升在视角受限情况下的重建稳定性和质量。
Details
Motivation: 现有基于体素的室内场景重建方法依赖多视角像素反投影交点作为弱几何约束,导致在视角稀疏或未观测区域表现差,且对输入视角密度敏感。因此,亟需降低对跨视角几何依赖的方法。 Method: 提出IPDRecon框架,包含三个核心模块:像素级置信度编码器(PCE)、仿射补偿模块(ACM)和图像平面空间解码器(IPSD)。该框架通过物理成像过程从2D图像中解码3D结构信息,保留边缘、空心结构和复杂纹理等几何特征,实现视图不变性增强的重建。 Result: 在ScanNetV2上的实验表明,即使视角数量减少40%,IPDRecon仍能保持几乎相同的重建质量,变异系数仅为0.24%,性能保留率达99.7%,最大性能下降仅0.42%。 Conclusion: 利用单视角内部的空间信息可有效提升重建方法在视角受限场景下的鲁棒性与稳定性,IPDRecon为实际应用中视角有限的情况提供了可靠解决方案。 Abstract: Volume-based indoor scene reconstruction methods demonstrate significant research value due to their superior generalization capability and real-time deployment potential. However, existing methods rely on multi-view pixel back-projection ray intersections as weak geometric constraints to determine spatial positions, causing reconstruction quality to depend heavily on input view density with poor performance in overlapping regions and unobserved areas. To address these issues, the key lies in reducing dependency on inter-view geometric constraints while exploiting rich spatial information within individual views. We propose IPDRecon, an image-plane decoding framework comprising three core components: Pixel-level Confidence Encoder (PCE), Affine Compensation Module (ACM), and Image-Plane Spatial Decoder (IPSD). These modules collaboratively decode 3D structural information encoded in 2D images through physical imaging processes, effectively preserving spatial geometric features including edges, hollow structures, and complex textures while significantly enhancing view-invariant reconstruction. Experiments on ScanNetV2 confirm that IPDRecon achieves superior reconstruction stability, maintaining nearly identical quality when view count reduces by 40%. The method achieves a coefficient of variation of only 0.24%, performance retention rate of 99.7%, and maximum performance drop of merely 0.42%. This demonstrates that exploiting intra-view spatial information provides a robust solution for view-limited scenarios in practical applications.[128] FinCap: Topic-Aligned Captions for Short-Form Financial YouTube Videos
Siddhant Sukhani,Yash Bhardwaj,Riya Bhadani,Veer Kejriwal,Michael Galarnyk,Sudheer Chava
Main category: cs.CV
TL;DR: 本文评估了多模态大语言模型在金融短视频主题对齐字幕生成中的表现,通过结合文本、音频和视频多种模态进行联合推理,并建立了该领域的首个基准。
Details
Motivation: 为了提升金融短视频中字幕生成的准确性,需有效融合多模态信息并理解其主题一致性。 Method: 使用624个标注的YouTube短视频,评估七种不同模态组合(T, A, V, TA, TV, AV, TAV)在五个主题上的表现:主要建议、情感分析、视频目的、视觉分析和金融实体识别。 Result: 单独视频模态在五个主题中的四个上表现强劲;特定双模态组合(如TV或AV)常优于三模态组合TAV,表明过多模态可能引入噪声。 Conclusion: 研究建立了金融短视频字幕生成的首个基准,揭示了多模态融合的潜力与挑战,强调视频模态在捕捉视觉线索中的关键作用。 Abstract: We evaluate multimodal large language models (MLLMs) for topic-aligned captioning in financial short-form videos (SVs) by testing joint reasoning over transcripts (T), audio (A), and video (V). Using 624 annotated YouTube SVs, we assess all seven modality combinations (T, A, V, TA, TV, AV, TAV) across five topics: main recommendation, sentiment analysis, video purpose, visual analysis, and financial entity recognition. Video alone performs strongly on four of five topics, underscoring its value for capturing visual context and effective cues such as emotions, gestures, and body language. Selective pairs such as TV or AV often surpass TAV, implying that too many modalities may introduce noise. These results establish the first baselines for financial short-form video captioning and illustrate the potential and challenges of grounding complex visual cues in this domain. All code and data can be found on our Github under the CC-BY-NC-SA 4.0 license.[129] Dolphin v1.0 Technical Report
Taohan Weng,Chi zhang,Chaoran Yan,Siya Liu,Xiaoyang Liu,Yalun Wu,Boyang Wang,Boyan Wang,Jiren Ren,Kaiwen Yan,Jinze Yu,Kaibing Hu,Henan Liu,Haoyun zheng,Anjie Le,Hongcheng Guo
Main category: cs.CV
TL;DR: 本文提出了Dolphin v1.0和Dolphin R1,首个大规模多模态超声基础模型,通过三阶段训练策略在多种临床任务中实现统一框架下的卓越性能,尤其在U2-Bench上表现领先。
Details
Motivation: 超声成像存在操作者依赖、图像噪声和实时扫描等挑战,限制了AI的集成;现有大模型难以应对超声复杂性。 Method: 构建了一个包含200万样本的多模态数据集,结合教材知识、公开数据、合成样本和通用语料;采用领域专用预训练、指令对齐和基于强化学习的精调三阶段训练策略。 Result: Dolphin R1在U2-Bench八个任务上的U2-score达到0.5835,超过第二名两倍以上;Dolphin v1.0也表现出竞争力,验证了统一框架的有效性。 Conclusion: 引入推理增强的训练显著提升了诊断准确性、一致性和可解释性,表明该方法对高风险医学AI应用具有重要意义。 Abstract: Ultrasound is crucial in modern medicine but faces challenges like operator dependence, image noise, and real-time scanning, hindering AI integration. While large multimodal models excel in other medical imaging areas, they struggle with ultrasound's complexities. To address this, we introduce Dolphin v1.0 (V1) and its reasoning-augmented version, Dolphin R1-the first large-scale multimodal ultrasound foundation models unifying diverse clinical tasks in a single vision-language framework.To tackle ultrasound variability and noise, we curated a 2-million-scale multimodal dataset, combining textbook knowledge, public data, synthetic samples, and general corpora. This ensures robust perception, generalization, and clinical adaptability.The Dolphin series employs a three-stage training strategy: domain-specialized pretraining, instruction-driven alignment, and reinforcement-based refinement. Dolphin v1.0 delivers reliable performance in classification, detection, regression, and report generation. Dolphin R1 enhances diagnostic inference, reasoning transparency, and interpretability through reinforcement learning with ultrasound-specific rewards.Evaluated on U2-Bench across eight ultrasound tasks, Dolphin R1 achieves a U2-score of 0.5835-over twice the second-best model (0.2968) setting a new state of the art. Dolphin v1.0 also performs competitively, validating the unified framework. Comparisons show reasoning-enhanced training significantly improves diagnostic accuracy, consistency, and interpretability, highlighting its importance for high-stakes medical AI.[130] ART-VITON: Measurement-Guided Latent Diffusion for Artifact-Free Virtual Try-On
Junseo Park,Hyeryung Jang
Main category: cs.CV
TL;DR: 提出ART-VITON,一种基于测量引导扩散的虚拟试穿框架,通过残差先验初始化和多阶段一致性采样,在保持身份与背景的同时消除边界伪影。
Details
Motivation: 现有虚拟试穿方法在非试穿区域常出现边界伪影和语义漂移,难以兼顾精确对齐与真实感保留。 Method: 将虚拟试穿建模为线性逆问题,采用轨迹对齐求解器并引入残差先验初始化;结合数据一致性、频域校正与周期去噪,实现测量引导的无伪影生成。 Result: 在VITON-HD、DressCode和SHHQ-1.0上实验表明,该方法有效保留身份与背景信息,消除边界伪影,提升视觉保真度与鲁棒性。 Conclusion: ART-VITON通过测量引导的扩散机制显著改善了虚拟试穿中非试穿区域的生成质量,优于当前最先进方法。 Abstract: Virtual try-on (VITON) aims to generate realistic images of a person wearing a target garment, requiring precise garment alignment in try-on regions and faithful preservation of identity and background in non-try-on regions. While latent diffusion models (LDMs) have advanced alignment and detail synthesis, preserving non-try-on regions remains challenging. A common post-hoc strategy directly replaces these regions with original content, but abrupt transitions often produce boundary artifacts. To overcome this, we reformulate VITON as a linear inverse problem and adopt trajectory-aligned solvers that progressively enforce measurement consistency, reducing abrupt changes in non-try-on regions. However, existing solvers still suffer from semantic drift during generation, leading to artifacts. We propose ART-VITON, a measurement-guided diffusion framework that ensures measurement adherence while maintaining artifact-free synthesis. Our method integrates residual prior-based initialization to mitigate training-inference mismatch and artifact-free measurement-guided sampling that combines data consistency, frequency-level correction, and periodic standard denoising. Experiments on VITON-HD, DressCode, and SHHQ-1.0 demonstrate that ART-VITON effectively preserves identity and background, eliminates boundary artifacts, and consistently improves visual fidelity and robustness over state-of-the-art baselines.[131] Free Lunch Alignment of Text-to-Image Diffusion Models without Preference Image Pairs
Jia Jun Cheng Xian,Muchen Li,Haotian Yang,Xin Tao,Pengfei Wan,Leonid Sigal,Renjie Liao
Main category: cs.CV
TL;DR: 本文提出了一种名为Text Preference Optimization (TPO)的新框架,用于在无需成对图像偏好数据的情况下实现文本到图像模型的“免费”对齐,通过使用大语言模型构造不匹配提示并优化匹配提示的偏好,TPO在多个基准上优于现有方法。
Details
Motivation: 现有的基于强化学习与人类反馈(RLHF)的方法依赖昂贵的人类标注数据,限制了可扩展性,因此需要一种更高效、低成本的方式来提升文本与生成图像之间的对齐。 Method: 提出TPO框架,通过对原始描述进行大语言模型扰动生成不匹配提示,并训练模型偏好正确匹配的提示;将DPO和KTO扩展为TDPO和TKTO以适应T2I场景。 Result: 在多个基准上的定量与定性评估表明,TDPO和TKTO一致优于原始方法,显著提升人类偏好得分和文本-图像对齐效果。 Conclusion: TPO实现了无需成对图像偏好数据的高效对齐,为文本到图像生成模型提供了一种可扩展且低成本的对齐方案。 Abstract: Recent advances in diffusion-based text-to-image (T2I) models have led to remarkable success in generating high-quality images from textual prompts. However, ensuring accurate alignment between the text and the generated image remains a significant challenge for state-of-the-art diffusion models. To address this, existing studies employ reinforcement learning with human feedback (RLHF) to align T2I outputs with human preferences. These methods, however, either rely directly on paired image preference data or require a learned reward function, both of which depend heavily on costly, high-quality human annotations and thus face scalability limitations. In this work, we introduce Text Preference Optimization (TPO), a framework that enables "free-lunch" alignment of T2I models, achieving alignment without the need for paired image preference data. TPO works by training the model to prefer matched prompts over mismatched prompts, which are constructed by perturbing original captions using a large language model. Our framework is general and compatible with existing preference-based algorithms. We extend both DPO and KTO to our setting, resulting in TDPO and TKTO. Quantitative and qualitative evaluations across multiple benchmarks show that our methods consistently outperform their original counterparts, delivering better human preference scores and improved text-to-image alignment. Our Open-source code is available at https://github.com/DSL-Lab/T2I-Free-Lunch-Alignment.[132] V-HUB: A Visual-Centric Humor Understanding Benchmark for Video LLMs
Zhengpeng Shi,Hengli Li,Yanpeng Zhao,Jianqun Zhou,Yuxuan Wang,Qinrong Cui,Wei Bi,Songchun Zhu,Bo Zhao,Zilong Zheng
Main category: cs.CV
TL;DR: 本文提出了v-HUB,一个以视觉为中心的视频幽默理解基准,用于评估多模态大语言模型(MLLMs)在无文本、仅视觉条件下的幽默理解能力。实验表明,当前MLLMs在纯视觉幽默理解上表现不佳,加入音频可显著提升性能。
Details
Motivation: 现有模型对幽默的理解能力有限,尤其在仅依赖视觉线索的场景下缺乏有效评估手段。为了推动AI在真实场景中更好地理解非语言幽默(如默片),需要构建专门的基准来诊断MLLMs的视觉幽默理解能力。 Method: 构建了一个名为v-HUB的新基准,包含来自经典默片和网络资源的低语言短视频,并配备丰富的标注(如字幕、描述和解释)。设计了多项评估任务,包括字幕匹配、幽默解释以及开放式视频问答。在多种MLLM上进行测试,涵盖专用Video-LLM和支持音频的OmniLLM,比较不同模态(视频、音频)的影响。 Result: 实验结果显示,所有MLLM在从文本转向纯视频评估时,字幕匹配性能显著下降,表明其难以仅通过视觉理解幽默。引入音频后,模型表现明显改善,说明声音信息对幽默理解具有重要作用。v-HUB能有效揭示模型在复杂视频理解任务中的局限性。 Conclusion: 当前多模态大语言模型在纯视觉幽默理解方面仍面临挑战,v-HUB为评估此类能力提供了有效工具。研究强调融合更多模态(如音频)的重要性,未来提升模型对非语言信号的感知能力是关键方向。 Abstract: AI models capable of comprehending humor hold real-world promise -- for example, enhancing engagement in human-machine interactions. To gauge and diagnose the capacity of multimodal large language models (MLLMs) for humor understanding, we introduce v-HUB, a novel visual-centric video humor understanding benchmark. v-HUB comprises a curated collection of minimally verbal short videos, sourced from classic silent films and online resources, and reflecting real-world scenarios where humor can be appreciated purely through visual cues. Each video clip is paired with rich annotations, including captions, descriptions, and explanations, supporting evaluation tasks like caption matching and humor explanation. To broaden its applicability, we further construct an open-ended video QA task, making it readily integrable into existing video understanding benchmarks. We evaluate a diverse set of MLLMs, from specialized Video-LLMs to versatile OmniLLMs that can process audio, covering both open-source and proprietary domains. The experimental results expose the difficulties MLLMs face in comprehending humor from visual cues alone. For example, all models exhibit a marked performance drop on caption matching when moving from text-based to video-based evaluation (without audio). Our findings also demonstrate that incorporating audio helps with video humor understanding, highlighting the informativeness of sound and the promise of integrating richer modalities for complex video understanding tasks.[133] PCPO: Proportionate Credit Policy Optimization for Aligning Image Generation Models
Jeongjae Lee,Jong Chul Ye
Main category: cs.CV
TL;DR: 提出Proportionate Credit Policy Optimization (PCPO)以解决文本到图像模型强化学习中的训练不稳定和高方差问题,通过比例信用分配实现更稳定的训练、更快的收敛和更高的图像质量。
Details
Motivation: 现有策略梯度方法在训练稳定性和方差控制方面存在不足,导致收敛慢和图像质量下降,主要原因是生成采样器的数学结构导致时间步间信用分配不均。 Method: 引入PCPO框架,通过对目标函数进行稳定重构并对时间步进行原则性重加权,强制实现比例信用分配。 Result: PCPO显著加快了收敛速度,提升了图像质量,并有效缓解了模型崩溃问题,在多个基准(包括最先进的DanceGRPO)上全面优于现有方法。 Conclusion: PCPO通过修正信用分配机制,为T2I模型的强化学习对齐提供了更稳定、高效的训练方案,具有广泛的应用前景。 Abstract: While reinforcement learning has advanced the alignment of text-to-image (T2I) models, state-of-the-art policy gradient methods are still hampered by training instability and high variance, hindering convergence speed and compromising image quality. Our analysis identifies a key cause of this instability: disproportionate credit assignment, in which the mathematical structure of the generative sampler produces volatile and non-proportional feedback across timesteps. To address this, we introduce Proportionate Credit Policy Optimization (PCPO), a framework that enforces proportional credit assignment through a stable objective reformulation and a principled reweighting of timesteps. This correction stabilizes the training process, leading to significantly accelerated convergence and superior image quality. The improvement in quality is a direct result of mitigating model collapse, a common failure mode in recursive training. PCPO substantially outperforms existing policy gradient baselines on all fronts, including the state-of-the-art DanceGRPO.[134] Editable Noise Map Inversion: Encoding Target-image into Noise For High-Fidelity Image Manipulation
Mingyu Kang,Yong Suk Choi
Main category: cs.CV
TL;DR: 提出了一种新的噪声图反演方法ENM Inversion,用于提升文本到图像扩散模型在图像编辑中的保真度和可编辑性。
Details
Motivation: 现有反演方法在保持源图像内容的同时难以满足目标文本提示的编辑需求,限制了编辑灵活性。 Method: 通过分析噪声图的可编辑性特性,提出可编辑噪声优化策略,最小化重建与编辑后噪声图之间的差异,寻找最优噪声图。 Result: 实验表明ENM Inversion在多种图像编辑任务中优于现有方法,在内容保持和编辑准确性方面表现更优,并可扩展至视频编辑,实现跨帧的时间一致性与内容操控。 Conclusion: ENM Inversion有效平衡了图像编辑中的内容保持与可编辑性,为文本引导的图像和视频编辑提供了通用且高效的解决方案。 Abstract: Text-to-image diffusion models have achieved remarkable success in generating high-quality and diverse images. Building on these advancements, diffusion models have also demonstrated exceptional performance in text-guided image editing. A key strategy for effective image editing involves inverting the source image into editable noise maps associated with the target image. However, previous inversion methods face challenges in adhering closely to the target text prompt. The limitation arises because inverted noise maps, while enabling faithful reconstruction of the source image, restrict the flexibility needed for desired edits. To overcome this issue, we propose Editable Noise Map Inversion (ENM Inversion), a novel inversion technique that searches for optimal noise maps to ensure both content preservation and editability. We analyze the properties of noise maps for enhanced editability. Based on this analysis, our method introduces an editable noise refinement that aligns with the desired edits by minimizing the difference between the reconstructed and edited noise maps. Extensive experiments demonstrate that ENM Inversion outperforms existing approaches across a wide range of image editing tasks in both preservation and edit fidelity with target prompts. Our approach can also be easily applied to video editing, enabling temporal consistency and content manipulation across frames.[135] Self-Evolving Vision-Language Models for Image Quality Assessment via Voting and Ranking
Wen Wen,Tianwu Zhi,Kanglong Fan,Yang Li,Xinge Peng,Yabin Zhang,Yiting Liao,Junlin Li,Li Zhang
Main category: cs.CV
TL;DR: 本文提出了一种名为EvoQuality的新框架,通过自监督方式提升视觉-语言模型(VLM)在图像质量评估(IQA)中的感知能力,无需真实标签,利用自一致性生成伪标签并通过分组相对策略优化迭代训练,在多个IQA基准上显著提升了零样本性能。
Details
Motivation: 现有的VLM后训练方法依赖人工标注数据,成本高;而自监督方法在感知任务如IQA中应用不足,因此需要一种无需标签的自主优化框架。 Method: EvoQuality通过VLM自身的输出进行成对多数投票生成伪排序标签,构建保真度奖励,并采用分组相对策略优化(GRPO)实现模型的迭代进化。 Result: 在多种IQA基准上,EvoQuality使基础VLM的零样本PLCC性能平均提升31.8%,并在7个基准中的5个上优于最先进的有监督方法。 Conclusion: EvoQuality证明了完全自监督的方法可在IQA任务上媲美甚至超越有监督方法,为VLM的自主感知优化提供了有效路径。 Abstract: Improving vision-language models (VLMs) in the post-training stage typically relies on supervised fine-tuning or reinforcement learning, methods that necessitate costly, human-annotated data. While self-supervised techniques such as self-consistency have proven effective for enhancing reasoning capabilities, their application to perceptual domains such as image quality assessment (IQA) remains largely unexplored. In this work, we introduce EvoQuality, a novel framework that enables a VLM to autonomously refine its quality perception capabilities without any ground-truth labels. EvoQuality adapts the principle of self-consistency to the ranking-based nature of IQA. It generates pseudo-labels by performing pairwise majority voting on the VLM's own outputs to establish a consensus on relative quality. These pseudo-rankings are then formulated into a fidelity reward that guides the model's iterative evolution through group relative policy optimization (GRPO). By iteratively leveraging its own predictions, EvoQuality progressively refines the VLM's perceptual capability. Extensive experiments show that EvoQuality boosts the base VLM's zero-shot performance by 31.8\% on PLCC across diverse IQA benchmarks. Remarkably, despite being entirely self-supervised, EvoQuality achieves performance that is competitive with, or even surpasses, state-of-the-art supervised VLM-based IQA models, outperforming these models on 5 out of 7 IQA benchmarks.[136] EchoingECG: An Electrocardiogram Cross-Modal Model for Echocardiogram Tasks
Yuan Gao,Sangwook Kim,Chris McIntosh
Main category: cs.CV
TL;DR: 提出EchoingECG模型,利用不确定性感知的ECG嵌入和ECHO监督,通过结合PCME++和ECHO-CLIP,将ECHO知识蒸馏到ECG表示中,在多种设置下优于现有ECG基础模型。
Details
Motivation: ECG成本低且易获取,但传统上无法提供如ECHO般全面的心功能信息;而ECHO依赖大量医疗资源,因此希望用ECG更广泛地预测心功能指标。 Method: 提出EchoingECG,结合概率跨模态嵌入PCME++与基于ECHO-文本对预训练的视觉-语言模型ECHO-CLIP,构建不确定性感知的师生模型,实现从ECHO向ECG的知识迁移。 Result: 在零样本、少样本和微调设置下,EchoingECG在基于ECG预测ECHO指标任务中优于当前最先进的ECG基础模型,并可通过方差估计识别ECG中的不确定性区域。 Conclusion: EchoingECG能有效提升基于ECG的心功能预测能力,利用不确定性建模增强了结果可解释性,为资源有限场景下的心脏评估提供了可行方案。 Abstract: Electrocardiogram (ECG) is a widely used tool for assessing cardiac function due to its low cost and accessibility. Emergent research shows that ECGs can help make predictions on key outcomes traditionally derived from more complex modalities such as echocardiograms (ECHO), enabling the use of ECGs as a more accessible method to predict broader measurements of cardiac function. ECHO, in particular, are of great importance because they require considerable hospital resources while playing a key role in clinical cardiac assessment. To aid this use case, we introduce EchoingECG, a probabilistic student-teacher model that leverages uncertainty-aware ECG embeddings and ECHO supervision to improve ECG-based cardiac function prediction. Our approach integrates Probabilistic Cross-Modal Embeddings (PCME++), a probabilistic contrastive framework, with ECHO-CLIP, a vision-language pre-trained model trained on ECHO-text pairs, to distill ECHO knowledge into ECG representations. Through experiments and external validation, we showed that EchoingECG outperforms state-of-the-art foundation ECG models in zero-shot, few-shot, and fine-tune settings for ECHO predictions based on ECG. We also highlighted that variance estimation (enabled through our method) enhanced our understanding of model performance by identifying underlying regions of uncertainty within ECGs. The code is available: https://github.com/mcintoshML/EchoingECG.[137] Point-It-Out: Benchmarking Embodied Reasoning for Vision Language Models in Multi-Stage Visual Grounding
Haotian Xue,Yunhao Ge,Yu Zeng,Zhaoshuo Li,Ming-Yu Liu,Yongxin Chen,Jiaojiao Fan
Main category: cs.CV
TL;DR: 本文提出了一个名为Point-It-Out (PIO)的新基准,用于系统评估视觉语言模型(VLMs)在精确视觉定位方面的具身推理能力,涵盖室内、厨房、驾驶和机器人操作等场景,并通过三个阶段的分层评估协议揭示了现有模型在不同任务上的表现差异。
Details
Motivation: 现有的VLM具身推理评估主要依赖基于图像标注的多项选择题,缺乏对精确视觉定位能力的系统评测,因此需要一个更细致、更具挑战性的基准来全面衡量模型的具身推理水平。 Method: 提出PIO基准和包含三个阶段的分层评估协议:S1为指代对象定位,S2为任务驱动指向,S3为视觉轨迹预测;数据来自具身智能的关键领域,并在十多个先进VLM上进行实验验证。 Result: 实验发现如GPT-4o等通用强模型在精确视觉定位上表现不如一些开源模型;MoLMO等模型在S1和S2表现良好,但在需结合视觉轨迹规划的S3阶段表现不佳。 Conclusion: PIO基准能有效揭示VLM在具身推理中精确视觉定位能力的局限性,强调了未来模型需增强视觉细节理解与空间推理结合的能力。 Abstract: Vision-Language Models (VLMs) have demonstrated impressive world knowledge across a wide range of tasks, making them promising candidates for embodied reasoning applications. However, existing benchmarks primarily evaluate the embodied reasoning ability of VLMs through multiple-choice questions based on image annotations -- for example, selecting which trajectory better describes an event in the image. In this work, we introduce the Point-It-Out (PIO) benchmark, a novel benchmark designed to systematically assess the embodied reasoning abilities of VLMs through precise visual grounding. We propose a hierarchical evaluation protocol spanning three stages (S1: referred-object localization, S2: task-driven pointing, and S3: visual trace prediction), with data collected from critical domains for embodied intelligence, including indoor, kitchen, driving, and robotic manipulation scenarios. Extensive experiments with over ten state-of-the-art VLMs reveal several interesting findings. For example, strong general-purpose models such as GPT-4o, while excelling on many benchmarks (e.g., language, perception, and reasoning), underperform compared to some open-source models in precise visual grounding; models such as MoLMO perform well in S1 and S2 but struggle in S3, where requires grounding combined with visual trace planning.[138] Adapting SAM with Dynamic Similarity Graphs for Few-Shot Parameter-Efficient Small Dense Object Detection: A Case Study of Chickpea Pods in Field Conditions
Xintong Jiang,Yixue Liu,Mohamed Debbagh,Yu Tian,Valerio Hoyos-Villegas,Viacheslav Adamchuk,Shangpeng Sun
Main category: cs.CV
TL;DR: 本文提出了一种基于动态相似性图适应(DSGA)的参数高效微调方法,用于在极端数据受限条件下改进Segment Anything Model(SAM),实现复杂农业环境中小而密集目标的精确前景与实例分割。结合LoRA,该方法仅用4.00M参数即在多个指标上显著优于基线。
Details
Motivation: 由于农业视觉任务中训练数据有限且田间环境复杂,现有的基础模型参数高效微调(PEFT)方法面临挑战,难以有效处理小而密集物体的分割任务。 Method: 提出动态相似性图适应(DSGA)模块,通过可学习的多项式衰减初始化权重排序机制构建动态相似性图,并进行自适应局部特征聚合;将DSGA与低秩适配(LoRA)结合,形成互补优化框架,以捕获图像嵌入中的局部与全局依赖关系。 Result: 在鹰嘴豆荚数据集上,DSGA+LoRA在2、4、8、10样本设置下均优于基线SAM微调:结构测度提升17.31%,自适应F测度提升62.36%;Grad-CAM与t-SNE验证了特征判别能力增强;在10–120个豆荚图像中实现R²=0.8987的自动计数性能。 Conclusion: DSGA与LoRA的结合在极低数据条件下实现了高效的SAM适配,具有高参数效率和强实用性,适用于复杂农业场景下的自动化监测任务。 Abstract: Parameter-Efficient Fine-Tuning (PEFT) of foundation models for agricultural computer vision tasks remains challenging due to limited training data and complex field conditions. This study introduces a Dynamic Similarity-based Graph Adaptation (DSGA) module to adapt the Segment Anything Model (SAM) under extreme data constraints for precise foreground and instance segmentation of small dense objects in complex agricultural environments. Through dynamic similarity graph construction with a learnable polynomial decay-initialized weight ranking mechanism and adaptive local feature aggregation, DSGA establishes robust spatial and dynamic similarity representation with only 4.00M trainable parameters, which is 4.26% of the original SAM. Integrating this graph-based feature adaptation with Low-Rank Adaptation (LoRA) creates a complementary optimization framework that effectively captures both local and global dependencies in image embeddings while preserving model stability and parameter efficiency. Experimental results on a challenging chickpea pod dataset demonstrated that DSGA with LoRA achieved superior performance across multiple metrics evaluated under 2, 4, 8 and 10 shots, with progressive performance gains as shot count increased. Quantitative metrics showed a 17.31% improvement in Structure-measure and a 62.36% gain in adaptive F-measure compared to the baseline SAM fine-tuning. Comprehensive ablation studies and visualization analyses through Grad-CAM and t-SNE validated the framework's effectiveness in feature discrimination. The proposed adaptation demonstrated practical utility for automated agricultural monitoring applications, achieving accurate pod-counting with an adjusted R-squared of 0.8987 for images with 10 to 120 pods under challenging field conditions.[139] Logo-VGR: Visual Grounded Reasoning for Open-world Logo Recognition
Zichen Liang,Jingjing Fei,Jie Wang,Zheming Yang,Changqing Li,Pei Wu,Minghui Qiu,Fei Yang,Xialei Liu
Main category: cs.CV
TL;DR: 本文提出了一种用于开放世界标志识别的新方法Logo-VGR,通过比较式任务设计和引入领域知识,在产品审核等特定场景中实现了对未见品牌的良好泛化能力。
Details
Motivation: 现有的多模态大模型主要在通用基准上评估,而在特定领域(如智能商品审核)的应用探索不足,尤其是面对海量品牌时难以有效泛化。 Method: 将标志识别重构为基于比较的任务,采用Logo Perception Grounding注入领域知识,并通过Logo-Guided Visual Grounded Reasoning增强模型的多模态推理能力,仅需少量品牌监督即可实现大规模品牌识别。 Result: 实验表明,Logo-VGR在OOD设置下性能优于强基线近10个点,展现出卓越的泛化能力。 Conclusion: Logo-VGR提供了一种新的领域特定多模态推理范式,有效提升了在实际产品审核场景中对未见品牌的识别与推理能力。 Abstract: Recent advances in multimodal large language models (MLLMs) have been primarily evaluated on general-purpose benchmarks, while their applications in domain-specific scenarios, such as intelligent product moderation, remain underexplored. To address this gap, we introduce an open-world logo recognition benchmark, a core challenge in product moderation. Unlike traditional logo recognition methods that rely on memorizing representations of tens of thousands of brands-an impractical approach in real-world settings-our proposed method, Logo-VGR, enables generalization to large-scale brand recognition with supervision from only a small subset of brands. Specifically, we reformulate logo recognition as a comparison-based task, requiring the model to match product images with candidate logos rather than directly generating brand labels. We further observe that existing models tend to overfit by memorizing brand distributions instead of learning robust multimodal reasoning, which results in poor performance on unseen brands. To overcome this limitation, Logo-VGR introduces a new paradigm of domain-specific multimodal reasoning: Logo Perception Grounding injects domain knowledge, and Logo-Guided Visual Grounded Reasoning enhances the model's reasoning capability. Experimental results show that Logo-VGR outperforms strong baselines by nearly 10 points in OOD settings, demonstrating superior generalization.[140] Overview of GeoLifeCLEF 2023: Species Composition Prediction with High Spatial Resolution at Continental Scale Using Remote Sensing
Christophe Botella,Benjamin Deneu,Diego Marcos,Maximilien Servajean,Theo Larcher,Cesar Leblanc,Joaquim Estopinan,Pierre Bonnet,Alexis Joly
Main category: cs.CV
TL;DR: 本文介绍了GeoLifeCLEF 2023机器学习挑战赛,旨在利用深度学习和遥感数据预测植物物种组成。比赛使用包含500万条植物观测数据及多源环境变量的训练集,在多标签分类任务中评估模型性能,并探讨了单标签训练对多标签预测的偏差问题,提出了一种结合单标签与多标签数据的有效训练策略。
Details
Motivation: 为了提升基于环境变量预测物种分布的建模能力,特别是利用深度学习和高分辨率遥感数据,推动生态学与保护生物学的发展。 Method: 组织了一场开放的机器学习竞赛(GeoLifeCLEF 2023),提供大规模植物观测数据与多源环境数据(包括遥感、气候、土壤等),参赛团队开发模型进行多标签物种分布预测,并分析不同方法在从单标签训练到多标签推断中的表现。 Result: 参赛模型展示了良好的预测能力,但暴露出由单标签训练数据导致的多标签预测偏差;提出一种结合单标签与多标签数据的新学习策略,提升了模型性能。 Conclusion: 结合单标签与多标签学习的方法能有效缓解训练与评估之间的标签不匹配问题,为未来物种分布建模提供了更优的训练范式。 Abstract: Understanding the spatio-temporal distribution of species is a cornerstone of ecology and conservation. By pairing species observations with geographic and environmental predictors, researchers can model the relationship between an environment and the species which may be found there. To advance the state-of-the-art in this area with deep learning models and remote sensing data, we organized an open machine learning challenge called GeoLifeCLEF 2023. The training dataset comprised 5 million plant species observations (single positive label per sample) distributed across Europe and covering most of its flora, high-resolution rasters: remote sensing imagery, land cover, elevation, in addition to coarse-resolution data: climate, soil and human footprint variables. In this multi-label classification task, we evaluated models ability to predict the species composition in 22 thousand small plots based on standardized surveys. This paper presents an overview of the competition, synthesizes the approaches used by the participating teams, and analyzes the main results. In particular, we highlight the biases faced by the methods fitted to single positive labels when it comes to the multi-label evaluation, and the new and effective learning strategy combining single and multi-label data in training.[141] VELA: An LLM-Hybrid-as-a-Judge Approach for Evaluating Long Image Captions
Kazuki Matsuda,Yuiga Wada,Shinnosuke Hirano,Seitaro Otsuki,Komei Sugiura
Main category: cs.CV
TL;DR: 提出VELA,一种用于评估长篇图像描述的新型自动评价指标,并在LongCap-Arena基准上展现出超越人类的表现。
Details
Motivation: 现有图像描述评价指标主要针对短描述设计,难以有效评估多模态大模型生成的长篇详细描述;同时,当前基于大语言模型作为评判者的方案存在推理速度慢和视觉信息融合效率低的问题。 Method: 提出VELA指标,构建于新颖的LLM-Hybrid-as-a-Judge框架内,采用更高效的推理机制和视觉-语言信息融合方式;同时构建LongCap-Arena基准,包含7,805张图像、长描述参考与候选文本及32,246条人类评分,从描述性、相关性和流畅性三个维度评估。 Result: VELA在LongCap-Arena基准上优于现有自动评价指标,并在人类判断一致性方面达到超人水平。 Conclusion: VELA是一种高效且准确的长篇图像描述自动评价方法,LongCap-Arena为未来研究提供了重要基准。 Abstract: In this study, we focus on the automatic evaluation of long and detailed image captions generated by multimodal Large Language Models (MLLMs). Most existing automatic evaluation metrics for image captioning are primarily designed for short captions and are not suitable for evaluating long captions. Moreover, recent LLM-as-a-Judge approaches suffer from slow inference due to their reliance on autoregressive inference and early fusion of visual information. To address these limitations, we propose VELA, an automatic evaluation metric for long captions developed within a novel LLM-Hybrid-as-a-Judge framework. Furthermore, we propose LongCap-Arena, a benchmark specifically designed for evaluating metrics for long captions. This benchmark comprises 7,805 images, the corresponding human-provided long reference captions and long candidate captions, and 32,246 human judgments from three distinct perspectives: Descriptiveness, Relevance, and Fluency. We demonstrated that VELA outperformed existing metrics and achieved superhuman performance on LongCap-Arena.[142] Training-Free Reward-Guided Image Editing via Trajectory Optimal Control
Jinho Chang,Jaemin Kim,Jong Chul Ye
Main category: cs.CV
TL;DR: 提出了一种无需训练的奖励引导图像编辑新框架,通过将扩散模型的逆过程建模为可控轨迹并利用伴随状态优化控制,实现了在保持源图像语义的同时有效提升目标奖励。
Details
Motivation: 现有奖励引导生成方法在图像编辑中难以兼顾语义保持与奖励提升,且缺乏针对该任务的有效探索。 Method: 将图像编辑建模为轨迹最优控制问题,以扩散模型反向过程作为从源图像出发的可控轨迹,并通过迭代更新伴随状态来引导编辑过程。 Result: 在多种编辑任务上显著优于现有的基于反转的无需训练基线方法,在奖励最大化和源图像保真度之间取得了更好平衡,且避免了奖励欺骗。 Conclusion: 所提框架能有效实现无需训练的奖励引导图像编辑,兼具高质量生成与强语义保持能力。 Abstract: Recent advancements in diffusion and flow-matching models have demonstrated remarkable capabilities in high-fidelity image synthesis. A prominent line of research involves reward-guided guidance, which steers the generation process during inference to align with specific objectives. However, leveraging this reward-guided approach to the task of image editing, which requires preserving the semantic content of the source image while enhancing a target reward, is largely unexplored. In this work, we introduce a novel framework for training-free, reward-guided image editing. We formulate the editing process as a trajectory optimal control problem where the reverse process of a diffusion model is treated as a controllable trajectory originating from the source image, and the adjoint states are iteratively updated to steer the editing process. Through extensive experiments across distinct editing tasks, we demonstrate that our approach significantly outperforms existing inversion-based training-free guidance baselines, achieving a superior balance between reward maximization and fidelity to the source image without reward hacking.[143] More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models
Xinyu Tian,Shu Zou,Zhaoyuan Yang,Mengqi He,Fabian Waschkowski,Lukas Wesemann,Peter Tu,Jing Zhang
Main category: cs.CV
TL;DR: 本文研究了多模态推理中视觉语言模型在增强逻辑推理的同时可能出现的感知退化问题,提出了一种名为视觉锚定策略优化(VAPO)的新方法,以增强模型对视觉信息的依赖,并在多个基准上实现了最先进的性能。
Details
Motivation: 尽管推理能力显著提升了视觉语言模型在复杂任务上的表现,但长时间的推理过程可能导致模型忽视视觉输入,出现视觉遗忘现象,从而影响基本视觉问题的识别准确率。因此,需要一种能够保持视觉接地性的推理机制。 Method: 提出Vision-Anchored Policy Optimization (VAPO),通过显式引导推理过程朝向视觉接地的路径,结合强化学习框架(如GRPO),在训练过程中加强模型对视觉输入的关注和利用。 Result: VAPO-Thinker-7B模型显著增强了对视觉信息的依赖,在多个标准基准测试上取得了新的最先进结果,有效缓解了视觉遗忘问题。 Conclusion: VAPO方法成功平衡了多模态推理中的逻辑推导与感知接地,为视觉语言模型的可靠应用提供了重要改进方向。 Abstract: Reasoning has emerged as a pivotal capability in Large Language Models (LLMs). Through Reinforcement Learning (RL), typically Group Relative Policy Optimization (GRPO), these models are able to solve complex tasks such as mathematics and code generation. Building on these advances, recent research has sought to extend reasoning to Vision-Language Models (VLMs), yielding promising results across diverse visual tasks. Despite this progress, our study uncovers the dual nature of multimodal reasoning: while it substantially enhances logical inference and facilitates performance on challenging problems, it may gradually impair perceptual grounding, leading to recognition failures on otherwise basic visual questions. Through further analysis, we attribute this phenomenon to visual forgetting, wherein prolonged reasoning causes the model to increasingly disregard visual input. To address this, we propose Vision-Anchored Policy Optimization (VAPO), a simple yet effective method that explicitly steers the reasoning process toward visually grounded trajectories. Our result model, VAPO-Thinker-7B, significantly strengthens the model's reliance on visual information and achieves new state-of-the-art results on a wide range of established benchmarks. Project page: https://xytian1008.github.io/VAPO/[144] MuSLR: Multimodal Symbolic Logical Reasoning
Jundong Xu,Hao Fei,Yuhui Zhang,Liangming Pan,Qijun Huang,Qian Liu,Preslav Nakov,Min-Yen Kan,William Yang Wang,Mong-Li Lee,Wynne Hsu
Main category: cs.CV
TL;DR: 本文提出了首个用于评估视觉语言模型在多模态符号逻辑推理能力的基准MuSLR,并引入了LogiCAM框架以提升模型性能。
Details
Motivation: 现有视觉语言模型在高风险应用中缺乏严格的多模态符号逻辑推理能力,需要系统性评估和改进。 Method: 构建包含7个领域、1093个实例的MuSLR基准,提出LogiCAM模块化框架,结合形式逻辑规则增强多模态推理。 Result: 7个主流VLM在MuSLR上表现不佳,GPT-4.1最高仅46.8%;LogiCAM使GPT-4.1的CoT性能提升14.13%,对复杂逻辑增益更显著。 Conclusion: 当前VLM在多模态符号逻辑推理方面存在明显不足,LogiCAM为提升逻辑一致性提供了有效路径。 Abstract: Multimodal symbolic logical reasoning, which aims to deduce new facts from multimodal input via formal logic, is critical in high-stakes applications such as autonomous driving and medical diagnosis, as its rigorous, deterministic reasoning helps prevent serious consequences. To evaluate such capabilities of current state-of-the-art vision language models (VLMs), we introduce the first benchmark MuSLR for multimodal symbolic logical reasoning grounded in formal logical rules. MuSLR comprises 1,093 instances across 7 domains, including 35 atomic symbolic logic and 976 logical combinations, with reasoning depths ranging from 2 to 9. We evaluate 7 state-of-the-art VLMs on MuSLR and find that they all struggle with multimodal symbolic reasoning, with the best model, GPT-4.1, achieving only 46.8%. Thus, we propose LogiCAM, a modular framework that applies formal logical rules to multimodal inputs, boosting GPT-4.1's Chain-of-Thought performance by 14.13%, and delivering even larger gains on complex logics such as first-order logic. We also conduct a comprehensive error analysis, showing that around 70% of failures stem from logical misalignment between modalities, offering key insights to guide future improvements. All data and code are publicly available at https://llm-symbol.github.io/MuSLR.[145] PatchEAD: Unifying Industrial Visual Prompting Frameworks for Patch-Exclusive Anomaly Detection
Po-Han Huang,Jeng-Lin Li,Po-Hsuan Huang,Ming-Ching Chang,Wei-Chao Chen
Main category: cs.CV
TL;DR: 提出了一种统一的基于图像块的无训练异常检测框架PatchEAD,兼容多种基础模型,在少样本和零样本场景下表现优异。
Details
Motivation: 现有工业异常检测方法多依赖文本提示调优,缺乏统一的视觉处理框架,限制了在不同基础模型上的泛化能力。 Method: 设计了Patch-Exclusive Anomaly Detection (PatchEAD)框架,包含对齐模块和前景掩码,实现无需训练的视觉提示,专注于图像块相似性匹配。 Result: 实验显示PatchEAD在少样本和批量零样本设置下性能优于先前方法,且不依赖文本特征;并分析了骨干网络结构和预训练特性对块相似性鲁棒性的影响。 Conclusion: 统一的纯图像块框架可实现快速、低校准的部署,无需精心设计的文本提示,为实际视觉检测中的基础模型选择与配置提供了实用指导。 Abstract: Industrial anomaly detection is increasingly relying on foundation models, aiming for strong out-of-distribution generalization and rapid adaptation in real-world deployments. Notably, past studies have primarily focused on textual prompt tuning, leaving the intrinsic visual counterpart fragmented into processing steps specific to each foundation model. We aim to address this limitation by proposing a unified patch-focused framework, Patch-Exclusive Anomaly Detection (PatchEAD), enabling training-free anomaly detection that is compatible with diverse foundation models. The framework constructs visual prompting techniques, including an alignment module and foreground masking. Our experiments show superior few-shot and batch zero-shot performance compared to prior work, despite the absence of textual features. Our study further examines how backbone structure and pretrained characteristics affect patch-similarity robustness, providing actionable guidance for selecting and configuring foundation models for real-world visual inspection. These results confirm that a well-unified patch-only framework can enable quick, calibration-light deployment without the need for carefully engineered textual prompts.[146] LiDAR Point Cloud Colourisation Using Multi-Camera Fusion and Low-Light Image Enhancement
Pasindu Ranasinghe,Dibyayan Patra,Bikram Banerjee,Simit Raval
Main category: cs.CV
TL;DR: 提出一种硬件无关的方法,通过多相机输入为机械LiDAR生成彩色点云,实现360度覆盖,并在低光条件下通过集成低光图像增强模块实现鲁棒性。
Details
Motivation: 提升在复杂环境特别是低光条件下的空间感知能力,解决传统LiDAR与相机融合方法对光照敏感、依赖专用标定设备的问题。 Method: 采用多相机与机械LiDAR融合的架构,首先进行相机内参标定,然后自动计算LiDAR与相机间的几何变换;在融合前引入低光图像增强模块和色彩校正以提升图像质量和一致性。 Result: 系统在Velodyne Puck Hi-Res LiDAR和四相机配置下验证,实现了实时性能,在极低照度下仍能可靠地完成点云着色,恢复出人眼难以察觉的场景细节。 Conclusion: 所提方法无需专用标定工具,具备良好的实用性和鲁棒性,特别适用于夜间或光照不足环境中的三维感知应用。 Abstract: In recent years, the fusion of camera data with LiDAR measurements has emerged as a powerful approach to enhance spatial understanding. This study introduces a novel, hardware-agnostic methodology that generates colourised point clouds from mechanical LiDAR using multiple camera inputs, providing complete 360-degree coverage. The primary innovation lies in its robustness under low-light conditions, achieved through the integration of a low-light image enhancement module within the fusion pipeline. The system requires initial calibration to determine intrinsic camera parameters, followed by automatic computation of the geometric transformation between the LiDAR and cameras, removing the need for specialised calibration targets and streamlining the setup. The data processing framework uses colour correction to ensure uniformity across camera feeds before fusion. The algorithm was tested using a Velodyne Puck Hi-Res LiDAR and a four-camera configuration. The optimised software achieved real-time performance and reliable colourisation even under very low illumination, successfully recovering scene details that would otherwise remain undetectable.[147] MAPLE: Multi-scale Attribute-enhanced Prompt Learning for Few-shot Whole Slide Image Classification
Junjie Zhou,Wei Shao,Yagao Yue,Wei Mu,Peng Wan,Qi Zhu,Daoqiang Zhang
Main category: cs.CV
TL;DR: 提出了一种名为MAPLE的多尺度属性增强提示学习框架,用于少样本全切片图像分类,结合实体级和切片级预测,显著提升病理诊断性能。
Details
Motivation: 现有方法依赖于切片级提示,无法捕捉对癌症诊断至关重要的组织学实体(如细胞核、腺体)的亚型特异性表型变异。 Method: 利用大语言模型生成实体级和切片级提示;通过实体引导的交叉注意力模块提取实体级特征,并结合跨尺度实体图学习模块建模多尺度语义关联,最终融合实体级与切片级输出进行预测。 Result: 在三个癌症队列上的实验结果表明,该方法在少样本病理诊断任务中优于现有方法,有效提升了分类性能。 Conclusion: MAPLE通过整合多尺度视觉语义和细粒度属性提示,在少样本场景下实现了更准确的全切片图像分类,为自动化病理诊断提供了新思路。 Abstract: Prompt learning has emerged as a promising paradigm for adapting pre-trained vision-language models (VLMs) to few-shot whole slide image (WSI) classification by aligning visual features with textual representations, thereby reducing annotation cost and enhancing model generalization. Nevertheless, existing methods typically rely on slide-level prompts and fail to capture the subtype-specific phenotypic variations of histological entities (\emph{e.g.,} nuclei, glands) that are critical for cancer diagnosis. To address this gap, we propose Multi-scale Attribute-enhanced Prompt Learning (\textbf{MAPLE}), a hierarchical framework for few-shot WSI classification that jointly integrates multi-scale visual semantics and performs prediction at both the entity and slide levels. Specifically, we first leverage large language models (LLMs) to generate entity-level prompts that can help identify multi-scale histological entities and their phenotypic attributes, as well as slide-level prompts to capture global visual descriptions. Then, an entity-guided cross-attention module is proposed to generate entity-level features, followed by aligning with their corresponding subtype-specific attributes for fine-grained entity-level prediction. To enrich entity representations, we further develop a cross-scale entity graph learning module that can update these representations by capturing their semantic correlations within and across scales. The refined representations are then aggregated into a slide-level representation and aligned with the corresponding prompts for slide-level prediction. Finally, we combine both entity-level and slide-level outputs to produce the final prediction results. Results on three cancer cohorts confirm the effectiveness of our approach in addressing few-shot pathology diagnosis tasks.[148] DeepSketcher: Internalizing Visual Manipulation for Multimodal Reasoning
Chi Zhang,Haibo Qiu,Qiming Zhang,Zhixiong Zeng,Lin Ma,Jing Zhang
Main category: cs.CV
TL;DR: 本文提出了DeepSketcher,一个包含图像-文本交错数据集和自包含模型的综合套件,推动视觉语言模型中的“以图思考”范式,实现无需外部工具的灵活视觉推理。
Details
Motivation: 现有视觉语言模型多依赖文本主导的推理,缺乏精细的图像交互能力;“以图思考”虽具潜力,但在数据准确性、结构设计和应用范围上仍需探索。 Method: 构建包含3.1万条带工具调用和编辑图像的推理轨迹的高质量数据集,并设计直接在视觉嵌入空间生成‘视觉思维’的模型,实现图像与文本交错推理。 Result: 在多模态推理基准上的实验表明,该模型显著提升性能,验证了数据集的有效性和模型设计的优越性。 Conclusion: DeepSketcher推动了‘以图思考’范式的发展,展示了无需外部工具、高效灵活的多模态推理潜力。 Abstract: The "thinking with images" paradigm represents a pivotal shift in the reasoning of Vision Language Models (VLMs), moving from text-dominant chain-of-thought to image-interactive reasoning. By invoking visual tools or generating intermediate visual representations, VLMs can iteratively attend to fine-grained regions, enabling deeper image understanding and more faithful multimodal reasoning. As an emerging paradigm, however, it still leaves substantial room for exploration in data construction accuracy, structural design, and broader application scenarios, which offer rich opportunities for advancing multimodal reasoning. To further advance this line of work, we present DeepSketcher, a comprehensive suite comprising both an image-text interleaved dataset and a self-contained model. The dataset contains 31k chain-of-thought (CoT) reasoning trajectories with diverse tool calls and resulting edited images, covering a wide range of data types and manipulation instructions with high annotation accuracy. Building on this resource, we design a model that performs interleaved image-text reasoning and natively generates "visual thoughts" by operating directly in the visual embedding space, rather than invoking external tools and repeatedly re-encoding generated images. This design enables tool-free and more flexible "thinking with images". Extensive experiments on multimodal reasoning benchmarks demonstrate strong performance, validating both the utility of the dataset and the effectiveness of the model design.[149] A Multimodal LLM Approach for Visual Question Answering on Multiparametric 3D Brain MRI
Arvind Murari Vepa,Yannan Yu,Jingru Gan,Anthony Cuturrufo,Weikai Li,Wei Wang,Fabien Scalzo,Yizhou Sun
Main category: cs.CV
TL;DR: mpLLM是一种用于多参数3D脑MRI视觉问答的分层混合专家模型,通过模态级和令牌级专家融合多模态数据,并引入合成VQA协议以缓解图像-文本配对数据不足的问题,在多个数据集上优于现有医学视觉语言模型。
Details
Motivation: 在缺乏充足图像-文本配对监督的情况下,实现对多参数3D脑MRI的有效视觉问答,提升医学视觉语言模型在临床应用中的实用性与准确性。 Method: 提出mpLLM,一种基于提示的分层混合专家架构,采用模态级和令牌级投影专家进行多模态融合,并设计合成VQA协议从分割标注生成医学相关问题答案,结合医学专家验证。 Result: mpLLM在多个mpMRI数据集上平均超越现有医学VLM基线模型5.3%,消融实验表明模态级与令牌级专家及提示条件路由的重要性。 Conclusion: mpLLM有效融合多参数MRI模态并提升视觉问答性能,所构建的临床验证VQA数据集和方法为医学图像理解提供了重要资源和新方向。 Abstract: We introduce mpLLM, a prompt-conditioned hierarchical mixture-of-experts (MoE) architecture for visual question answering over multi-parametric 3D brain MRI (mpMRI). mpLLM routes across modality-level and token-level projection experts to fuse multiple interrelated 3D modalities, enabling efficient training without image--report pretraining. To address limited image-text paired supervision, mpLLM integrates a synthetic visual question answering (VQA) protocol that generates medically relevant VQA from segmentation annotations, and we collaborate with medical experts for clinical validation. mpLLM outperforms strong medical VLM baselines by 5.3% on average across multiple mpMRI datasets. Our study features three main contributions: (1) the first clinically validated VQA dataset for 3D brain mpMRI, (2) a novel multimodal LLM that handles multiple interrelated 3D modalities, and (3) strong empirical results that demonstrate the medical utility of our methodology. Ablations highlight the importance of modality-level and token-level experts and prompt-conditioned routing. We have included our source code in the supplementary materials and will release our dataset upon publication.[150] LLaVAShield: Safeguarding Multimodal Multi-Turn Dialogues in Vision-Language Models
Guolei Huang,Qingzhi Peng,Gan Xu,Yuxuan Lu,Yongjun Shen
Main category: cs.CV
TL;DR: 本文提出了多模态多轮对话安全的新定义,并构建了首个相关数据集MMDS,同时开发了基于蒙特卡洛树搜索的红队框架生成不安全对话样本,进而提出LLaVAShield模型,在多模态多轮内容审核任务中显著优于基线模型。
Details
Motivation: 随着视觉语言模型进入交互式、多轮使用场景,传统的单轮或单模态安全审核方法无法捕捉跨轮次和跨模态的恶意意图,因此需要系统性研究多模态多轮对话的安全问题。 Method: 提出多模态多轮对话安全的系统性定义,构建包含细粒度标注的MMDS数据集,设计基于MCTS的自动化红队框架生成对抗样本,并开发LLaVAShield模型进行联合风险检测与评估。 Result: MMDS数据集包含4,484个标注样本,LLaVAShield在多项实验中均优于强基线模型,且在动态政策配置下表现优异,达到当前最优性能。 Conclusion: 该研究为多模态多轮对话安全提供了首个系统性解决方案,发布的数据集和模型有助于推动未来相关研究。 Abstract: As Vision-Language Models (VLMs) move into interactive, multi-turn use, new safety risks arise that single-turn or single-modality moderation misses. In Multimodal Multi-Turn (MMT) dialogues, malicious intent can be spread across turns and images, while context-sensitive replies may still advance harmful content. To address this challenge, we present the first systematic definition and study of MMT dialogue safety. Building on this formulation, we introduce the Multimodal Multi-turn Dialogue Safety (MMDS) dataset. We further develop an automated multimodal multi-turn red-teaming framework based on Monte Carlo Tree Search (MCTS) to generate unsafe multimodal multi-turn dialogues for MMDS. MMDS contains 4,484 annotated multimodal dialogue samples with fine-grained safety ratings, policy dimension labels, and evidence-based rationales for both users and assistants. Leveraging MMDS, we present LLaVAShield, a powerful tool that jointly detects and assesses risk in user inputs and assistant responses. Across comprehensive experiments, LLaVAShield consistently outperforms strong baselines on MMT content moderation tasks and under dynamic policy configurations, establishing new state-of-the-art results. We will publicly release the dataset and model to support future research.[151] VLM-FO1: Bridging the Gap Between High-Level Reasoning and Fine-Grained Perception in VLMs
Peng Liu,Haozhan Shen,Chunxin Fang,Zhicheng Sun,Jiajia Liao,Tiancheng Zhao
Main category: cs.CV
TL;DR: 本文提出VLM-FO1,一种将细粒度视觉感知从坐标生成问题转化为特征检索任务的新框架,显著提升视觉语言模型在目标定位和区域理解方面的性能。
Details
Motivation: 现有视觉语言模型在精细感知任务上表现不佳,因其依赖生成精确坐标,而这对以语言为中心的架构而言过于困难。 Method: 引入VLM-FO1框架,结合混合细粒度区域编码器(HFRE)和基于token的引用系统,通过双视觉编码器生成富含语义与空间信息的区域token,并与预训练VLM无缝集成。 Result: 在多个基准测试中达到SOTA,显著提升对象定位、区域生成理解和视觉区域推理能力,且不损害模型原有的通用视觉理解能力。 Conclusion: VLM-FO1为构建具备精细感知能力的视觉语言模型提供了有效且灵活的新范式,弥合了高层推理与细粒度视觉定位之间的鸿沟。 Abstract: Vision-Language Models (VLMs) excel at high-level scene understanding but falter on fine-grained perception tasks requiring precise localization. This failure stems from a fundamental mismatch, as generating exact numerical coordinates is a challenging task for language-centric architectures. In this paper, we introduce VLM-FO1, a novel framework that overcomes this limitation by reframing object-centric perception from a brittle coordinate generation problem into a robust feature retrieval task. Our method operates as a plug-and-play module that integrates with any pre-trained VLM. It leverages a Hybrid Fine-grained Region Encoder (HFRE), featuring a dual vision encoder, to generate powerful region tokens rich in both semantic and spatial detail. A token-based referencing system then enables the LLM to seamlessly reason about and ground language in these specific visual regions. Experiments show that VLM-FO1 achieves state-of-the-art performance across a diverse suite of benchmarks, demonstrating exceptional capabilities in object grounding, region generational understanding, and visual region reasoning. Crucially, our two-stage training strategy ensures that these perception gains are achieved without compromising the base model's general visual understanding capabilities. VLM-FO1 establishes an effective and flexible paradigm for building perception-aware VLMs, bridging the gap between high-level reasoning and fine-grained visual grounding.[152] The Impact of Scaling Training Data on Adversarial Robustness
Marco Zimmerli,Andreas Plesner,Till Aczel,Roger Wattenhofer
Main category: cs.CV
TL;DR: 研究了36种最先进的视觉模型在不同训练数据特性下的对抗鲁棒性,发现数据量和模型大小对抗抗性有对数缩放关系,但数据质量、架构和训练目标比单纯的数据规模更具决定性作用。
Details
Motivation: 尽管深度神经网络在架构和训练方法上取得了进展,但仍易受对抗样本影响,因此需要探究训练数据特征如何影响对抗鲁棒性。 Method: 评估了跨越监督、自监督和对比学习的36种先进视觉模型,在六个黑盒攻击类别下分析其鲁棒性,并研究数据量、模型大小及数据质量的影响。 Result: 对抗攻击成功率(ASR)随数据量和模型大小呈对数下降;十倍数据减少ASR约3.2%,十倍模型大小减少ASR约13.4%;某些在精选数据集上训练的自监督模型优于更大但低质数据集上的模型;对抗微调提升结构变化泛化能力但不改善颜色分布迁移;人类评估显示人机视觉仍存在差距。 Conclusion: 虽然扩大规模有助于提升对抗鲁棒性,但数据质量、模型架构和训练目标比单纯的数据规模更为关键。 Abstract: Deep neural networks remain vulnerable to adversarial examples despite advances in architectures and training paradigms. We investigate how training data characteristics affect adversarial robustness across 36 state-of-the-art vision models spanning supervised, self-supervised, and contrastive learning approaches, trained on datasets from 1.2M to 22B images. Models were evaluated under six black-box attack categories: random perturbations, two types of geometric masks, COCO object manipulations, ImageNet-C corruptions, and ImageNet-R style shifts. Robustness follows a logarithmic scaling law with both data volume and model size: a tenfold increase in data reduces attack success rate (ASR) on average by ~3.2%, whereas a tenfold increase in model size reduces ASR on average by ~13.4%. Notably, some self-supervised models trained on curated datasets, such as DINOv2, outperform others trained on much larger but less curated datasets, challenging the assumption that scale alone drives robustness. Adversarial fine-tuning of ResNet50s improves generalization across structural variations but not across color distributions. Human evaluation reveals persistent gaps between human and machine vision. These results show that while scaling improves robustness, data quality, architecture, and training objectives play a more decisive role than raw scale in achieving broad-spectrum adversarial resilience.[153] UniMMAD: Unified Multi-Modal and Multi-Class Anomaly Detection via MoE-Driven Feature Decompression
Yuan Zhao,Youwei Pang,Lihe Zhang,Hanqi Liu,Jiaming Zuo,Huchuan Lu,Xiaoqi Zhao
Main category: cs.CV
TL;DR: 提出UniMMAD,一种统一的多模态多类异常检测框架,采用MoE驱动的特征解压缩机制,实现自适应、解耦的重建,显著提升性能并减少75%参数使用。
Details
Motivation: 现有异常检测方法将模态和类别视为独立因素,导致方案碎片化、内存开销大;基于重构的多类方法因共享解码路径难以应对域间差异,造成正常边界失真和高误报率。 Method: 设计UniMMAD框架:编码阶段通过特征压缩模块将多模态输入压缩为通用特征并抑制潜在异常;解码阶段利用稀疏门控的交叉MoE结构,根据输入模态和类别动态选择专家路径,实现特定域的自适应解压缩;引入分组动态过滤和MoE-in-MoE结构以提升效率。 Result: 在涵盖3个领域、12种模态、66个类别的9个异常检测数据集上达到SOTA性能,同时减少75%参数量,保持稀疏激活和快速推理。 Conclusion: UniMMAD通过“从通用到特定”的范式和MoE驱动的解压缩机制,有效解决了多模态多类异常检测中的碎片化、域干扰和效率问题,具有良好的通用性和实用性。 Abstract: Existing anomaly detection (AD) methods often treat the modality and class as independent factors. Although this paradigm has enriched the development of AD research branches and produced many specialized models, it has also led to fragmented solutions and excessive memory overhead. Moreover, reconstruction-based multi-class approaches typically rely on shared decoding paths, which struggle to handle large variations across domains, resulting in distorted normality boundaries, domain interference, and high false alarm rates. To address these limitations, we propose UniMMAD, a unified framework for multi-modal and multi-class anomaly detection. At the core of UniMMAD is a Mixture-of-Experts (MoE)-driven feature decompression mechanism, which enables adaptive and disentangled reconstruction tailored to specific domains. This process is guided by a ``general to specific'' paradigm. In the encoding stage, multi-modal inputs of varying combinations are compressed into compact, general-purpose features. The encoder incorporates a feature compression module to suppress latent anomalies, encourage cross-modal interaction, and avoid shortcut learning. In the decoding stage, the general features are decompressed into modality-specific and class-specific forms via a sparsely-gated cross MoE, which dynamically selects expert pathways based on input modality and class. To further improve efficiency, we design a grouped dynamic filtering mechanism and a MoE-in-MoE structure, reducing parameter usage by 75\% while maintaining sparse activation and fast inference. UniMMAD achieves state-of-the-art performance on 9 anomaly detection datasets, spanning 3 fields, 12 modalities, and 66 classes. The source code will be available at https://github.com/yuanzhao-CVLAB/UniMMAD.[154] CO3: Contrasting Concepts Compose Better
Debottam Dutta,Jianchong Chen,Rajalaxmi Rajagopalan,Yu-Lin Wei,Romit Roy Choudhury
Main category: cs.CV
TL;DR: 本文提出CO3方法,通过纠正采样策略提升文本到图像扩散模型中多概念提示的保真度,避免概念遗漏或失衡,无需重新训练模型。
Details
Motivation: 现有扩散模型在处理多概念提示时容易出现概念缺失、模糊或冲突,源于模型偏向训练中强学习的概念,导致联合提示行为失衡。 Method: 引入一种无需模型调优的插件式纠正采样策略,避开单个概念主导的混合模式,引导生成向各概念均衡共存的“纯”联合模式发展,并优化多概念引导方案的权重稳定性。 Result: 实验表明,CO3在多种多概念提示下显著提升概念覆盖率、视觉平衡性和生成鲁棒性,减少概念遗漏或扭曲现象,优于标准基线和现有组合方法。 Conclusion: 轻量级的纠正性引导可有效缓解现代扩散模型中语义对齐脆弱的问题,提升多概念生成的可靠性。 Abstract: We propose to improve multi-concept prompt fidelity in text-to-image diffusion models. We begin with common failure cases-prompts like "a cat and a dog" that sometimes yields images where one concept is missing, faint, or colliding awkwardly with another. We hypothesize that this happens when the diffusion model drifts into mixed modes that over-emphasize a single concept it learned strongly during training. Instead of re-training, we introduce a corrective sampling strategy that steers away from regions where the joint prompt behavior overlaps too strongly with any single concept in the prompt. The goal is to steer towards "pure" joint modes where all concepts can coexist with balanced visual presence. We further show that existing multi-concept guidance schemes can operate in unstable weight regimes that amplify imbalance; we characterize favorable regions and adapt sampling to remain within them. Our approach, CO3, is plug-and-play, requires no model tuning, and complements standard classifier-free guidance. Experiments on diverse multi-concept prompts indicate improvements in concept coverage, balance and robustness, with fewer dropped or distorted concepts compared to standard baselines and prior compositional methods. Results suggest that lightweight corrective guidance can substantially mitigate brittle semantic alignment behavior in modern diffusion systems.[155] Self-Supervised Anatomical Consistency Learning for Vision-Grounded Medical Report Generation
Longzhen Yang,Zhangkai Ni,Ying Wen,Yihang Liu,Lianghua He,Heng Tao Shen
Main category: cs.CV
TL;DR: 提出了一种无需专家标注的自监督解剖一致性学习框架(SS-ACL),通过文本提示对齐生成报告与解剖区域,显著提升医学报告生成的准确性和可解释性。
Details
Motivation: 现有方法依赖需大量专家标注的独立检测模块,标注成本高且因数据集病理分布偏差而泛化能力差。 Method: 构建基于人体解剖结构的分层解剖图,通过递归重建细粒度解剖区域实现样本内空间对齐,并引入基于解剖一致性的区域级对比学习增强样本间语义对齐,利用对齐嵌入作为报告生成先验。 Result: 在无需专家标注的情况下,SS-ACL在词法准确率上提升10%,临床有效性提升25%,并在零样本视觉定位任务中超越当前先进视觉基础模型8%。 Conclusion: SS-ACL有效实现了视觉 grounded 医学报告生成,兼具高准确性与可解释性,且具备良好的下游任务迁移能力。 Abstract: Vision-grounded medical report generation aims to produce clinically accurate descriptions of medical images, anchored in explicit visual evidence to improve interpretability and facilitate integration into clinical workflows. However, existing methods often rely on separately trained detection modules that require extensive expert annotations, introducing high labeling costs and limiting generalizability due to pathology distribution bias across datasets. To address these challenges, we propose Self-Supervised Anatomical Consistency Learning (SS-ACL) -- a novel and annotation-free framework that aligns generated reports with corresponding anatomical regions using simple textual prompts. SS-ACL constructs a hierarchical anatomical graph inspired by the invariant top-down inclusion structure of human anatomy, organizing entities by spatial location. It recursively reconstructs fine-grained anatomical regions to enforce intra-sample spatial alignment, inherently guiding attention maps toward visually relevant areas prompted by text. To further enhance inter-sample semantic alignment for abnormality recognition, SS-ACL introduces a region-level contrastive learning based on anatomical consistency. These aligned embeddings serve as priors for report generation, enabling attention maps to provide interpretable visual evidence. Extensive experiments demonstrate that SS-ACL, without relying on expert annotations, (i) generates accurate and visually grounded reports -- outperforming state-of-the-art methods by 10\% in lexical accuracy and 25\% in clinical efficacy, and (ii) achieves competitive performance on various downstream visual tasks, surpassing current leading visual foundation models by 8\% in zero-shot visual grounding.[156] A Multi-purpose Tracking Framework for Salmon Welfare Monitoring in Challenging Environments
Espen Uri Høgstedt,Christian Schellewald,Annette Stahl,Rudolf Mester
Main category: cs.CV
TL;DR: 提出了一种基于姿态估计的灵活跟踪框架,用于解决水下三文鱼场景中的遮挡、外观相似等问题,实现多福利指标的自动化监测。
Details
Motivation: 现有基于计算机视觉的三文鱼福利监测方法仅关注单一指标,依赖其他领域的检测与跟踪模型,资源消耗高且在复杂水下环境中表现不佳。 Method: 采用姿态估计网络提取三文鱼及其身体部位的边界框,并利用身体部位信息通过专用模块应对水下场景挑战,进而基于高精度身体部位轨迹计算福利指标。 Result: 在两个新构建的数据集上,该方法在三文鱼ID转移和ID切换任务中优于当前最先进的行人跟踪器BoostTrack,并成功应用于尾摆波长计算。 Conclusion: 所提框架能有效提升水下三文鱼跟踪的准确性与鲁棒性,适用于自动化、精细化的鱼类福利监测。 Abstract: Computer Vision (CV)-based continuous, automated and precise salmon welfare monitoring is a key step toward reduced salmon mortality and improved salmon welfare in industrial aquaculture net pens. Available CV methods for determining welfare indicators focus on single indicators and rely on object detectors and trackers from other application areas to aid their welfare indicator calculation algorithm. This comes with a high resource demand for real-world applications, since each indicator must be calculated separately. In addition, the methods are vulnerable to difficulties in underwater salmon scenes, such as object occlusion, similar object appearance, and similar object motion. To address these challenges, we propose a flexible tracking framework that uses a pose estimation network to extract bounding boxes around salmon and their corresponding body parts, and exploits information about the body parts, through specialized modules, to tackle challenges specific to underwater salmon scenes. Subsequently, the high-detail body part tracks are employed to calculate welfare indicators. We construct two novel datasets assessing two salmon tracking challenges: salmon ID transfers in crowded scenes and salmon ID switches during turning. Our method outperforms the current state-of-the-art pedestrian tracker, BoostTrack, for both salmon tracking challenges. Additionally, we create a dataset for calculating salmon tail beat wavelength, demonstrating that our body part tracking method is well-suited for automated welfare monitoring based on tail beat analysis. Datasets and code are available at https://github.com/espenbh/BoostCompTrack.[157] PinPoint3D: Fine-Grained 3D Part Segmentation from a Few Clicks
Bojun Zhang,Hangjian Ye,Hao Zheng,Jianzheng Huang,Zhengyu Lin,Zhenhong Guo,Feng Zheng
Main category: cs.CV
TL;DR: 本文提出了PinPoint3D,一种用于细粒度、多粒度3D分割的交互式框架,仅需少量用户点击即可生成精确的部件级掩码,并通过新构建的大规模带密集部件标注的数据集显著提升了在稀疏点云上的分割性能。
Details
Motivation: 现有交互式分割方法局限于粗粒度实例级别,而非交互方法在真实稀疏扫描数据上表现不佳且缺乏标注数据,难以实现对物体功能部件的精细操作。 Method: 提出PinPoint3D框架,结合少量用户点击进行交互式分割,并设计了一种新的3D数据合成流程以生成大规模、场景级、带密集部件标注的训练数据集。 Result: 在首次点击设置下,各物体部件平均IoU达55.8%,仅增加几次点击即可超过71.3% IoU,相比当前最优方法IoU和精度最高提升16%。 Conclusion: PinPoint3D有效解决了细粒度3D部件分割中数据稀缺和精度不足的问题,推动了复杂3D环境中机器感知与交互的精细化发展。 Abstract: Fine-grained 3D part segmentation is crucial for enabling embodied AI systems to perform complex manipulation tasks, such as interacting with specific functional components of an object. However, existing interactive segmentation methods are largely confined to coarse, instance-level targets, while non-interactive approaches struggle with sparse, real-world scans and suffer from a severe lack of annotated data. To address these limitations, we introduce PinPoint3D, a novel interactive framework for fine-grained, multi-granularity 3D segmentation, capable of generating precise part-level masks from only a few user point clicks. A key component of our work is a new 3D data synthesis pipeline that we developed to create a large-scale, scene-level dataset with dense part annotations, overcoming a critical bottleneck that has hindered progress in this field. Through comprehensive experiments and user studies, we demonstrate that our method significantly outperforms existing approaches, achieving an average IoU of around 55.8% on each object part under first-click settings and surpassing 71.3% IoU with only a few additional clicks. Compared to current state-of-the-art baselines, PinPoint3D yields up to a 16% improvement in IoU and precision, highlighting its effectiveness on challenging, sparse point clouds with high efficiency. Our work represents a significant step towards more nuanced and precise machine perception and interaction in complex 3D environments.[158] Towards Reliable and Holistic Visual In-Context Learning Prompt Selection
Wenxiao Wu,Jing-Hao Xue,Chengming Xu,Chen Liu,Xinwei Sun,Changxin Gao,Nong Sang,Yanwei Fu
Main category: cs.CV
TL;DR: 本文提出了一种改进的视觉上下文学习方法RH-Partial2Global,通过引入jackknife共形预测和覆盖设计采样策略,提升了上下文示例选择的可靠性和全面性。
Details
Motivation: 现有方法基于直观但缺乏验证的相似性优先假设,且随机采样导致配对比较覆盖不全或冗余,影响全局排序效果。 Method: 提出RH-Partial2Global,采用jackknife共形预测构建可靠的候选集,并利用覆盖设计实现配对偏好关系的全面均匀采样。 Result: 在多种视觉任务上实验表明,RH-Partial2Global性能优异,优于Partial2Global等现有方法。 Conclusion: 该方法有效提升了视觉上下文学习中示例选择的可靠性与完整性,为全局排序提供了更优解决方案。 Abstract: Visual In-Context Learning (VICL) has emerged as a prominent approach for adapting visual foundation models to novel tasks, by effectively exploiting contextual information embedded in in-context examples, which can be formulated as a global ranking problem of potential candidates. Current VICL methods, such as Partial2Global and VPR, are grounded in the similarity-priority assumption that images more visually similar to a query image serve as better in-context examples. This foundational assumption, while intuitive, lacks sufficient justification for its efficacy in selecting optimal in-context examples. Furthermore, Partial2Global constructs its global ranking from a series of randomly sampled pairwise preference predictions. Such a reliance on random sampling can lead to incomplete coverage and redundant samplings of comparisons, thus further adversely impacting the final global ranking. To address these issues, this paper introduces an enhanced variant of Partial2Global designed for reliable and holistic selection of in-context examples in VICL. Our proposed method, dubbed RH-Partial2Global, leverages a jackknife conformal prediction-guided strategy to construct reliable alternative sets and a covering design-based sampling approach to ensure comprehensive and uniform coverage of pairwise preferences. Extensive experiments demonstrate that RH-Partial2Global achieves excellent performance and outperforms Partial2Global across diverse visual tasks.[159] VRWKV-Editor: Reducing quadratic complexity in transformer-based video editing
Abdelilah Aitrouga,Youssef Hmamouche,Amal El Fallah Seghrouchni
Main category: cs.CV
TL;DR: 提出VRWKV-Editor,一种基于线性时空聚合模块的视频编辑模型,通过RWKV的双向加权键值循环机制实现线性复杂度,在速度和内存上显著优于现有扩散模型,同时保持良好的编辑质量。
Details
Motivation: 传统注意力机制在长时高清视频编辑中存在二次计算复杂度问题,限制了其在实时视频处理等实际场景中的应用。 Method: 将RWKV Transformer中的双向加权键值循环机制引入视频扩散模型,构建线性时空聚合模块,降低时间和空间复杂度。 Result: 相比最先进的扩散模型方法,最高实现3.7倍加速和60%的内存占用降低,在帧一致性与文本对齐方面性能相当,且在长序列视频上优势更明显。 Conclusion: VRWKV-Editor有效解决了视频扩散模型中注意力机制的高复杂度问题,为高效、高质量的长视频编辑提供了可行方案。 Abstract: In light of recent progress in video editing, deep learning models focusing on both spatial and temporal dependencies have emerged as the primary method. However, these models suffer from the quadratic computational complexity of traditional attention mechanisms, making them difficult to adapt to long-duration and high-resolution videos. This limitation restricts their applicability in practical contexts such as real-time video processing. To tackle this challenge, we introduce a method to reduce both time and space complexity of these systems by proposing VRWKV-Editor, a novel video editing model that integrates a linear spatio-temporal aggregation module into video-based diffusion models. VRWKV-Editor leverages bidirectional weighted key-value recurrence mechanism of the RWKV transformer to capture global dependencies while preserving temporal coherence, achieving linear complexity without sacrificing quality. Extensive experiments demonstrate that the proposed method achieves up to 3.7x speedup and 60% lower memory usage compared to state-of-the-art diffusion-based video editing methods, while maintaining competitive performance in frame consistency and text alignment. Furthermore, a comparative analysis we conducted on videos with different sequence lengths confirms that the gap in editing speed between our approach and architectures with self-attention becomes more significant with long videos.[160] Learning Egocentric In-Hand Object Segmentation through Weak Supervision from Human Narrations
Nicola Messina,Rosario Leonardi,Luca Ciampi,Fabio Carrara,Giovanni Maria Farinella,Fabrizio Falchi,Antonino Furnari
Main category: cs.CV
TL;DR: 本文提出了NS-iHOS任务,利用自然语言叙述作为弱监督信号,实现无需像素级标注的在手物体分割。为此设计了WISH模型,通过从叙述中蒸馏知识来学习手-物关联,在EPIC-Kitchens和Ego4D数据集上显著优于基线方法,并恢复了超过50%的全监督方法性能。
Details
Motivation: 现有自我中心图像中的在手物体分割依赖昂贵的手动像素级标注,限制了发展。而叙述文本在现有数据集中已广泛存在且成本低,可作为弱监督信号缓解标注稀缺问题。 Method: 提出NS-iHOS任务和WISH模型,利用动作叙述作为弱监督信号,在训练时通过多模态对齐机制(如视觉-语言匹配)提取对象线索并建立手-物关联,推理时不使用叙述。模型端到端训练,实现无像素级标注的在手物体分割。 Result: 在EPIC-Kitchens和Ego4D数据集上,WISH优于基于开放词汇检测器和视觉-语言模型的多种基线方法,且达到全监督方法50%以上的性能,验证了叙述作为弱监督信号的有效性。 Conclusion: 利用叙述作为弱监督信号是解决自我中心视频中在手物体分割标注稀缺问题的有效途径,WISH展示了仅用叙述即可学习合理手-物关联的潜力,为未来少样本、弱监督的交互理解提供了新方向。 Abstract: Pixel-level recognition of objects manipulated by the user from egocentric images enables key applications spanning assistive technologies, industrial safety, and activity monitoring. However, progress in this area is currently hindered by the scarcity of annotated datasets, as existing approaches rely on costly manual labels. In this paper, we propose to learn human-object interaction detection leveraging narrations -- natural language descriptions of the actions performed by the camera wearer which contain clues about manipulated objects (e.g., "I am pouring vegetables from the chopping board to the pan"). Narrations provide a form of weak supervision that is cheap to acquire and readily available in state-of-the-art egocentric datasets. We introduce Narration-Supervised in-Hand Object Segmentation (NS-iHOS), a novel task where models have to learn to segment in-hand objects by learning from natural-language narrations. Narrations are then not employed at inference time. We showcase the potential of the task by proposing Weakly-Supervised In-hand Object Segmentation from Human Narrations (WISH), an end-to-end model distilling knowledge from narrations to learn plausible hand-object associations and enable in-hand object segmentation without using narrations at test time. We benchmark WISH against different baselines based on open-vocabulary object detectors and vision-language models, showing the superiority of its design. Experiments on EPIC-Kitchens and Ego4D show that WISH surpasses all baselines, recovering more than 50% of the performance of fully supervised methods, without employing fine-grained pixel-wise annotations.[161] AgenticIQA: An Agentic Framework for Adaptive and Interpretable Image Quality Assessment
Hanwei Zhu,Yu Tian,Keyan Ding,Baoliang Chen,Bolin Chen,Shiqi Wang,Weisi Lin
Main category: cs.CV
TL;DR: 提出AgenticIQA,一个模块化的智能体框架,结合视觉语言模型与传统图像质量评估工具,实现可解释、自适应的图像质量评估。
Details
Motivation: 传统方法依赖固定模型输出单一分数,难以适应多样失真和用户需求,且评分与解释过程割裂。 Method: 将图像质量评估分解为四个子任务:失真检测、失真分析、工具选择和工具执行,由规划器、执行器和总结器协同完成。 Result: 在多个数据集上实验表明,AgenticIQA在评分准确性和解释对齐性方面均优于强基线方法。 Conclusion: AgenticIQA通过动态、查询感知的方式整合VLM与传统工具,提升了图像质量评估的适应性、可解释性和整体性能。 Abstract: Image quality assessment (IQA) is inherently complex, as it reflects both the quantification and interpretation of perceptual quality rooted in the human visual system. Conventional approaches typically rely on fixed models to output scalar scores, limiting their adaptability to diverse distortions, user-specific queries, and interpretability needs. Furthermore, scoring and interpretation are often treated as independent processes, despite their interdependence: interpretation identifies perceptual degradations, while scoring abstracts them into a compact metric. To address these limitations, we propose AgenticIQA, a modular agentic framework that integrates vision-language models (VLMs) with traditional IQA tools in a dynamic, query-aware manner. AgenticIQA decomposes IQA into four subtasks -- distortion detection, distortion analysis, tool selection, and tool execution -- coordinated by a planner, executor, and summarizer. The planner formulates task-specific strategies, the executor collects perceptual evidence via tool invocation, and the summarizer integrates this evidence to produce accurate scores with human-aligned explanations. To support training and evaluation, we introduce AgenticIQA-200K, a large-scale instruction dataset tailored for IQA agents, and AgenticIQA-Eval, the first benchmark for assessing the planning, execution, and summarization capabilities of VLM-based IQA agents. Extensive experiments across diverse IQA datasets demonstrate that AgenticIQA consistently surpasses strong baselines in both scoring accuracy and explanatory alignment.[162] PFDepth: Heterogeneous Pinhole-Fisheye Joint Depth Estimation via Distortion-aware Gaussian-Splatted Volumetric Fusion
Zhiwei Zhang,Ruikai Xu,Weijian Zhang,Zhizhong Zhang,Xin Tan,Jingyu Gong,Yuan Xie,Lizhuang Ma
Main category: cs.CV
TL;DR: 本文提出了首个用于异构多视角深度估计的针孔-鱼眼框架PFDepth,通过利用针孔和鱼眼图像的互补特性实现联合优化,在KITTI-360和RealHet数据集上达到SOTA性能。
Details
Motivation: 针孔和鱼眼相机在视场、畸变和远近场感知方面具有互补特性,但现有方法未能有效融合这两种异构图像进行深度估计。 Method: 提出统一架构PFDepth,将2D特征提升至3D体素空间,设计异构空间融合模块处理畸变感知的体素特征,并将传统体素融合重构为可学习的3D高斯表示以实现更精细的3D聚合。 Result: 在KITTI-360和RealHet数据集上显著优于当前主流深度估计网络,实现了最先进的性能。 Conclusion: PFDepth首次系统性研究了针孔-鱼眼异构深度估计,提供了技术新颖性和有价值的实证见解,为多相机系统中的深度估计开辟了新方向。 Abstract: In this paper, we present the first pinhole-fisheye framework for heterogeneous multi-view depth estimation, PFDepth. Our key insight is to exploit the complementary characteristics of pinhole and fisheye imagery (undistorted vs. distorted, small vs. large FOV, far vs. near field) for joint optimization. PFDepth employs a unified architecture capable of processing arbitrary combinations of pinhole and fisheye cameras with varied intrinsics and extrinsics. Within PFDepth, we first explicitly lift 2D features from each heterogeneous view into a canonical 3D volumetric space. Then, a core module termed Heterogeneous Spatial Fusion is designed to process and fuse distortion-aware volumetric features across overlapping and non-overlapping regions. Additionally, we subtly reformulate the conventional voxel fusion into a novel 3D Gaussian representation, in which learnable latent Gaussian spheres dynamically adapt to local image textures for finer 3D aggregation. Finally, fused volume features are rendered into multi-view depth maps. Through extensive experiments, we demonstrate that PFDepth sets a state-of-the-art performance on KITTI-360 and RealHet datasets over current mainstream depth networks. To the best of our knowledge, this is the first systematic study of heterogeneous pinhole-fisheye depth estimation, offering both technical novelty and valuable empirical insights.[163] New Fourth-Order Grayscale Indicator-Based Telegraph Diffusion Model for Image Despeckling
Rajendra K. Ray,Manish Kumar
Main category: cs.CV
TL;DR: 提出了一种结合扩散和波动特性的四阶非线性PDE模型,用于抑制乘性噪声,有效减少块状伪影并保留细节,在灰度和彩色图像上去噪效果优于现有二阶各向异性扩散模型。
Details
Motivation: 传统二阶PDE模型在去噪初期易引入块状伪影,需改进以更好保留图像细节和纹理。 Method: 构建融合拉普拉斯算子与强度值的扩散过程及波动部分的四阶非线性PDE模型,并独立应用于各颜色通道以处理彩色图像。 Result: 在含真实参考图像的数据上,该模型在PSNR和MSSIM指标上表现更优;对于SAR图像,SI指标显示其具有更强的降噪能力;彩色图像处理中也保持了结构与色彩一致性。 Conclusion: 所提出的四阶非线性PDE模型在抑制乘性噪声方面优于传统的二阶模型,适用于灰度与彩色图像,兼具良好的视觉质量和定量评价结果。 Abstract: Second-order PDE models have been widely used for suppressing multiplicative noise, but they often introduce blocky artifacts in the early stages of denoising. To resolve this, we propose a fourth-order nonlinear PDE model that integrates diffusion and wave properties. The diffusion process, guided by both the Laplacian and intensity values, reduces noise better than gradient-based methods, while the wave part keeps fine details and textures. The effectiveness of the proposed model is evaluated against two second-order anisotropic diffusion approaches using the Peak Signal-to-Noise Ratio (PSNR) and Mean Structural Similarity Index (MSSIM) for images with available ground truth. For SAR images, where a noise-free reference is unavailable, the Speckle Index (SI) is used to measure noise reduction. Additionally, we extend the proposed model to study color images by applying the denoising process independently to each channel, preserving both structure and color consistency. The same quantitative metrics PSNR and MSSIM are used for performance evaluation, ensuring a fair comparison across grayscale and color images. In all the cases, our computed results produce better results compared to existing models in this genre.[164] SETR: A Two-Stage Semantic-Enhanced Framework for Zero-Shot Composed Image Retrieval
Yuqi Xiao,Yingying Zhu
Main category: cs.CV
TL;DR: 本文提出SETR,一种用于零样本组合图像检索(ZS-CIR)的两阶段语义增强方法,通过交集驱动策略和基于多模态大模型的重排序,显著提升性能。
Details
Motivation: 现有基于CLIP的方法在ZS-CIR中存在特征融合时引入无关背景、缺乏细粒度语义匹配能力的问题,需提升检索精度与语义一致性。 Method: SETR包含两个阶段:第一阶段采用交集驱动策略,仅保留参考图像与文本的共同语义以过滤干扰;第二阶段利用低秩适配的多模态大模型进行‘是/否’二元判断,实现细粒度语义重排序。 Result: 在CIRR、Fashion-IQ和CIRCO数据集上达到最先进性能,其中CIRR的Recall@1最高提升15.15点。 Conclusion: 两阶段推理范式能有效提升ZS-CIR的鲁棒性和可迁移性,为未来研究提供了新方向。 Abstract: Zero-shot Composed Image Retrieval (ZS-CIR) aims to retrieve a target image given a reference image and a relative text, without relying on costly triplet annotations. Existing CLIP-based methods face two core challenges: (1) union-based feature fusion indiscriminately aggregates all visual cues, carrying over irrelevant background details that dilute the intended modification, and (2) global cosine similarity from CLIP embeddings lacks the ability to resolve fine-grained semantic relations. To address these issues, we propose SETR (Semantic-enhanced Two-Stage Retrieval). In the coarse retrieval stage, SETR introduces an intersection-driven strategy that retains only the overlapping semantics between the reference image and relative text, thereby filtering out distractors inherent to union-based fusion and producing a cleaner, high-precision candidate set. In the fine-grained re-ranking stage, we adapt a pretrained multimodal LLM with Low-Rank Adaptation to conduct binary semantic relevance judgments ("Yes/No"), which goes beyond CLIP's global feature matching by explicitly verifying relational and attribute-level consistency. Together, these two stages form a complementary pipeline: coarse retrieval narrows the candidate pool with high recall, while re-ranking ensures precise alignment with nuanced textual modifications. Experiments on CIRR, Fashion-IQ, and CIRCO show that SETR achieves new state-of-the-art performance, improving Recall@1 on CIRR by up to 15.15 points. Our results establish two-stage reasoning as a general paradigm for robust and portable ZS-CIR.[165] GeoLink: Empowering Remote Sensing Foundation Model with OpenStreetMap Data
Lubian Bai,Xiuyuan Zhang,Siqi Zhang,Zepeng Zhang,Haoyu Wang,Wei Qin,Shihong Du
Main category: cs.CV
TL;DR: 本文提出了GeoLink,一个利用OpenStreetMap(OSM)数据增强遥感(RS)基础模型的多模态框架,通过在预训练和下游任务阶段融合OSM数据,提升了地理智能任务的性能。
Details
Motivation: 遥感与OSM数据之间存在模态差距,包括数据结构、内容和空间粒度的差异,导致有效融合困难,现有遥感基础模型大多仅依赖影像数据。 Method: 提出GeoLink框架,利用OSM数据生成多粒度学习信号,结合跨模态空间相关性指导信息交互,并引入图像掩码重建实现高效预训练;在下游任务中生成单模态和多模态细粒度编码以支持多种应用。 Result: 实验表明,在预训练中引入OSM数据可提升RS图像编码器性能,下游任务中融合RS与OSM数据增强了模型对复杂地理场景的适应能力,且空间相关性对多模态融合至关重要。 Conclusion: 多模态协同有助于推动高水平地理空间人工智能的发展,GeoLink为遥感与矢量地理数据的深度融合提供了有效解决方案。 Abstract: Integrating ground-level geospatial data with rich geographic context, like OpenStreetMap (OSM), into remote sensing (RS) foundation models (FMs) is essential for advancing geospatial intelligence and supporting a broad spectrum of tasks. However, modality gap between RS and OSM data, including differences in data structure, content, and spatial granularity, makes effective synergy highly challenging, and most existing RS FMs focus on imagery alone. To this end, this study presents GeoLink, a multimodal framework that leverages OSM data to enhance RS FM during both the pretraining and downstream task stages. Specifically, GeoLink enhances RS self-supervised pretraining using multi-granularity learning signals derived from OSM data, guided by cross-modal spatial correlations for information interaction and collaboration. It also introduces image mask-reconstruction to enable sparse input for efficient pretraining. For downstream tasks, GeoLink generates both unimodal and multimodal fine-grained encodings to support a wide range of applications, from common RS interpretation tasks like land cover classification to more comprehensive geographic tasks like urban function zone mapping. Extensive experiments show that incorporating OSM data during pretraining enhances the performance of the RS image encoder, while fusing RS and OSM data in downstream tasks improves the FM's adaptability to complex geographic scenarios. These results underscore the potential of multimodal synergy in advancing high-level geospatial artificial intelligence. Moreover, we find that spatial correlation plays a crucial role in enabling effective multimodal geospatial data integration. Code, checkpoints, and using examples are released at https://github.com/bailubin/GeoLink_NeurIPS2025[166] PatchVSR: Breaking Video Diffusion Resolution Limits with Patch-wise Video Super-Resolution
Shian Du,Menghan Xia,Chang Liu,Xintao Wang,Jing Wang,Pengfei Wan,Di Zhang,Xiangyang Ji
Main category: cs.CV
TL;DR: 本文提出了一种基于视频扩散先验的分块视频超分辨率方法PatchVSR,通过双流适配器和位置信息注入实现高保真、高分辨率的分块细节生成,并支持高效4K超分辨率。
Details
Motivation: 现有基于预训练视频生成模型的全尺寸视频超分辨率方法存在计算开销大、输出分辨率固定的问题,且难以直接应用于分块处理。 Method: 提出PatchVSR,采用双流适配器:分块分支提取输入块特征以保持内容保真,全局分支从缩放后的完整视频中提取上下文特征;引入分块位置信息以增强合成上下文一致性,并设计多块联合调制机制保证视觉连贯性。 Result: 实验表明该方法可在分块级别生成高保真、高分辨率细节,基于512x512基础模型实现高效的4K视频超分辨率,性能具有竞争力。 Conclusion: PatchVSR首次探索利用视频扩散先验进行分块视频超分辨率,解决了预训练模型在分块生成中的语义不完整问题,实现了高效、灵活且高质的超分辨率重建。 Abstract: Pre-trained video generation models hold great potential for generative video super-resolution (VSR). However, adapting them for full-size VSR, as most existing methods do, suffers from unnecessary intensive full-attention computation and fixed output resolution. To overcome these limitations, we make the first exploration into utilizing video diffusion priors for patch-wise VSR. This is non-trivial because pre-trained video diffusion models are not native for patch-level detail generation. To mitigate this challenge, we propose an innovative approach, called PatchVSR, which integrates a dual-stream adapter for conditional guidance. The patch branch extracts features from input patches to maintain content fidelity while the global branch extracts context features from the resized full video to bridge the generation gap caused by incomplete semantics of patches. Particularly, we also inject the patch's location information into the model to better contextualize patch synthesis within the global video frame. Experiments demonstrate that our method can synthesize high-fidelity, high-resolution details at the patch level. A tailor-made multi-patch joint modulation is proposed to ensure visual consistency across individually enhanced patches. Due to the flexibility of our patch-based paradigm, we can achieve highly competitive 4K VSR based on a 512x512 resolution base model, with extremely high efficiency.[167] Causally Guided Gaussian Perturbations for Out-Of-Distribution Generalization in Medical Imaging
Haoran Pei,Yuguang Yang,Kexin Liu,Baochang Zhang
Main category: cs.CV
TL;DR: 提出了一种轻量级框架CGP,通过因果引导的高斯扰动增强分布外泛化能力。
Details
Motivation: 解决深度学习模型在真实场景(如生物医学图像)中因分布偏移导致的分布外泛化问题,现有方法可能忽视泛化的因果机制。 Method: 利用Vision Transformer生成软因果掩码,指导在输入图像上注入空间变化的噪声,对背景区域施加更强扰动,前景区域施加较弱扰动,从而促使模型依赖因果相关特征而非虚假关联。 Result: 在具有挑战性的WILDS基准Camelyon17上,CGP consistently优于最先进的分布外基线方法。 Conclusion: 因果引导的扰动是一种实现可靠且可解释泛化的有效工具。 Abstract: Out-of-distribution (OOD) generalization remains a central challenge in deploying deep learning models to real-world scenarios, particularly in domains such as biomedical images, where distribution shifts are both subtle and pervasive. While existing methods often pursue domain invariance through complex generative models or adversarial training, these approaches may overlook the underlying causal mechanisms of generalization.In this work, we propose Causally-Guided Gaussian Perturbations (CGP)-a lightweight framework that enhances OOD generalization by injecting spatially varying noise into input images, guided by soft causal masks derived from Vision Transformers. By applying stronger perturbations to background regions and weaker ones to foreground areas, CGP encourages the model to rely on causally relevant features rather than spurious correlations.Experimental results on the challenging WILDS benchmark Camelyon17 demonstrate consistent performance gains over state-of-the-art OOD baselines, highlighting the potential of causal perturbation as a tool for reliable and interpretable generalization.[168] SeMoBridge: Semantic Modality Bridge for Efficient Few-Shot Adaptation of CLIP
Christoph Timmermann,Hyunse Lee,Woojin Lee
Main category: cs.CV
TL;DR: 提出SeMoBridge方法,通过将图像映射到文本模态来解决CLIP在少样本分类中的模态内错位问题,具有高效性和优越性能。
Details
Motivation: CLIP在少样本分类中因模态内错位导致图像嵌入空间未校准,影响直接比较性能。 Method: 设计语义模态桥(SeMoBridge),将图像投影到文本模态并保持语义一致性,可闭式求解,结合图文对齐损失进行多模态监督训练。 Result: SeMoBridge-T在1、2、4样本等低数据场景下显著优于现有方法,且训练时间大幅减少。 Conclusion: SeMoBridge有效缓解了CLIP的模态错位问题,为少样本分类提供了一种轻量且高性能的解决方案。 Abstract: While Contrastive Language-Image Pretraining (CLIP) excels at zero-shot tasks by aligning image and text embeddings, its performance in few-shot classification is hindered by a critical limitation: intra-modal misalignment. This issue, caused by a persistent modality gap and CLIP's exclusively inter-modal training objective, leaves the embedding spaces uncalibrated, making direct image-to-image comparisons unreliable. Existing methods attempt to address this by refining similarity logits or by computationally expensive per-sample optimization. To overcome these challenges, we introduce SeMoBridge, a lightweight yet powerful approach that directly addresses the misalignment. Our method maps images into the text modality, while keeping their semantic content intact through what we call a Semantic Modality Bridge. SeMoBridge is closed-form and can optionally be trained through multi-modal supervision, combining image and text-alignment losses to optimize the projection. Experiments show that the trained version, SeMoBridge-T, requires only a fraction of the training time while overall outperforming other methods, particularly in low-data scenarios (1, 2, and 4 shots). The code is available at \href{https://github.com/christti98/semobridge}{github.com/christti98/semobridge}.[169] SGS: Segmentation-Guided Scoring for Global Scene Inconsistencies
Gagandeep Singh,Samudi Amarsinghe,Urawee Thani,Ki Fung Wong,Priyanka Singh,Xue Li
Main category: cs.CV
TL;DR: 提出了一种无需重新训练的轻量级分割引导评分(SGS)方法,以增强HAMMER模型对全局场景不一致(如前景-背景错配)的检测能力。
Details
Motivation: HAMMER在DGM4数据集上表现良好,但在主体与背景上下文不匹配时失败,主要受限于标签空间偏差、局部注意力和虚假的文本-前景对齐问题。 Method: 利用人物/人脸分割掩码分离前景与背景区域,通过视觉-语言联合模型提取嵌入,并计算区域感知的连贯性得分,再与HAMMER原始预测融合。 Result: SGS在不增加显著计算开销的情况下,提升了二分类检测、定位和词级解释性能,显著增强了对全局篡改的鲁棒性。 Conclusion: 区域感知推理对多模态虚假信息检测至关重要,SGS为现有模型提供了有效的后处理增强方案。 Abstract: We extend HAMMER, a state-of-the-art model for multimodal manipulation detection, to handle global scene inconsistencies such as foreground-background (FG-BG) mismatch. While HAMMER achieves strong performance on the DGM4 dataset, it consistently fails when the main subject is contextually misplaced into an implausible background. We diagnose this limitation as a combination of label-space bias, local attention focus, and spurious text-foreground alignment. To remedy this without retraining, we propose a lightweight segmentation-guided scoring (SGS) pipeline. SGS uses person/face segmentation masks to separate foreground and background regions, extracts embeddings with a joint vision-language model, and computes region-aware coherence scores. These scores are fused with HAMMER's original prediction to improve binary detection, grounding, and token-level explanations. SGS is inference-only, incurs negligible computational overhead, and significantly enhances robustness to global manipulations. This work demonstrates the importance of region-aware reasoning in multimodal disinformation detection. We release scripts for segmentation and scoring at https://github.com/Gaganx0/HAMMER-sgs[170] DGM4+: Dataset Extension for Global Scene Inconsistency
Gagandeep Singh,Samudi Amarsinghe,Priyanka Singh,Xue Li
Main category: cs.CV
TL;DR: 本文扩展了DGM4数据集,引入了5000个包含前景-背景不一致(FG-BG)及其与文本篡改结合的高质量样本,构建了更全面的多模态伪造检测基准DGM4+。
Details
Motivation: 现有数据集主要关注局部篡改,缺乏对现实中日益普遍的全局不一致性(如前景与背景不匹配)的研究,限制了多模态伪造检测模型的评估能力。 Method: 利用OpenAI的gpt-image-1模型和精心设计的提示词生成以人为中心的新闻风格图像,将真实人物置于荒诞背景中,并在三种文本条件下生成相应标题;通过质量控制流程确保数据质量。 Result: 构建了包含FG-BG、FG-BG+TA、FG-BG+TS三类新篡改类型的DGM4+数据集,补足了全局操纵的空白,形成了更具挑战性的多模态检测基准。 Conclusion: DGM4+数据集有效弥补了现有数据集在全局不一致性方面的不足,为评估和提升多模态伪造检测模型(如HAMMER)提供了重要资源。 Abstract: The rapid advances in generative models have significantly lowered the barrier to producing convincing multimodal disinformation. Fabricated images and manipulated captions increasingly co-occur to create persuasive false narratives. While the Detecting and Grounding Multi-Modal Media Manipulation (DGM4) dataset established a foundation for research in this area, it is restricted to local manipulations such as face swaps, attribute edits, and caption changes. This leaves a critical gap: global inconsistencies, such as mismatched foregrounds and backgrounds, which are now prevalent in real-world forgeries. To address this, we extend DGM4 with 5,000 high-quality samples that introduce Foreground-Background (FG-BG) mismatches and their hybrids with text manipulations. Using OpenAI's gpt-image-1 and carefully designed prompts, we generate human-centric news-style images where authentic figures are placed into absurd or impossible backdrops (e.g., a teacher calmly addressing students on the surface of Mars). Captions are produced under three conditions: literal, text attribute, and text split, yielding three new manipulation categories: FG-BG, FG-BG+TA, and FG-BG+TS. Quality control pipelines enforce one-to-three visible faces, perceptual hash deduplication, OCR-based text scrubbing, and realistic headline length. By introducing global manipulations, our extension complements existing datasets, creating a benchmark DGM4+ that tests detectors on both local and global reasoning. This resource is intended to strengthen evaluation of multimodal models such as HAMMER, which currently struggle with FG-BG inconsistencies. We release our DGM4+ dataset and generation script at https://github.com/Gaganx0/DGM4plus[171] Geometric Learning of Canonical Parameterizations of $2D$-curves
Ioana Ciuclea,Giorgio Longari,Alice Barbara Tumpach
Main category: cs.CV
TL;DR: 提出一种基于主纤维丛截面的几何方法,用于消除分类任务中的对称性(如旋转、缩放、平移和重参数化),避免依赖数据增强,并通过优化截面提升类别分离性能。
Details
Motivation: 传统方法依赖数据增强来处理视觉和医学数据中的对称性,不够可持续;希望构建无需增广、能内在处理对称性的可解释几何模型。 Method: 利用主纤维丛的截面概念,将对象轨道映射到规范表示,结合简单度量衡量轨道间差异,并优化截面以最大化类间分离;提出了一个包含恒定速度参数化的2参数曲线规范参数化族。 Result: 在物体轮廓数据集上验证了方法有效性,能够有效处理平移、旋转、缩放和重参数化对称性,实现良好分类分离;提供了开源代码和教程示例。 Conclusion: 该几何框架为处理对称性提供了一种可持续替代方案,避免数据增强,具有良好的可解释性和广泛的应用潜力。 Abstract: Most datasets encountered in computer vision and medical applications present symmetries that should be taken into account in classification tasks. A typical example is the symmetry by rotation and/or scaling in object detection. A common way to build neural networks that learn the symmetries is to use data augmentation. In order to avoid data augmentation and build more sustainable algorithms, we present an alternative method to mod out symmetries based on the notion of section of a principal fiber bundle. This framework allows the use of simple metrics on the space of objects in order to measure dissimilarities between orbits of objects under the symmetry group. Moreover, the section used can be optimized to maximize separation of classes. We illustrate this methodology on a dataset of contours of objects for the groups of translations, rotations, scalings and reparameterizations. In particular, we present a $2$-parameter family of canonical parameterizations of curves, containing the constant-speed parameterization as a special case, which we believe is interesting in its own right. We hope that this simple application will serve to convey the geometric concepts underlying this method, which have a wide range of possible applications. The code is available at the following link: $\href{https://github.com/GiLonga/Geometric-Learning}{https://github.com/GiLonga/Geometric-Learning}$. A tutorial notebook showcasing an application of the code to a specific dataset is available at the following link: $\href{https://github.com/ioanaciuclea/geometric-learning-notebook}{https://github.com/ioanaciuclea/geometric-learning-notebook}$[172] EasyOcc: 3D Pseudo-Label Supervision for Fully Self-Supervised Semantic Occupancy Prediction Models
Seamie Hayes,Ganesh Sistu,Ciarán Eising
Main category: cs.CV
TL;DR: 本文提出利用基础模型(如Grounded-SAM和Metric3Dv2)生成3D伪标签,并结合时序信息进行标签稠密化,以提升自监督语义占据预测性能。该方法显著降低计算开销,且可轻松集成到现有模型中,在OccNeRF上mIoU从9.73提升至14.09(+45%)。同时提出轻量模型EasyOcc,仅依赖伪标签学习即达到13.86 mIoU,并在全场景评估中以7.71 mIoU超越此前最优模型31%。
Details
Motivation: 现有的自监督语义占据预测方法依赖高计算成本的渲染策略(如新视角合成),难以扩展且不易迁移,亟需一种高效、通用且低开销的替代方案。 Method: 使用基础模型Grounded-SAM和Metric3Dv2生成3D伪真值标签,结合时序信息进行标签稠密化;将伪标签直接用于训练,避免复杂的渲染损失计算;并设计轻量模型EasyOcc,仅基于伪标签进行学习。 Result: 在OccNeRF中引入该伪标签使mIoU从9.73提升至14.09(+45%);EasyOcc模型达到13.86 mIoU;在全场景无相机掩码评估下,EasyOcc以7.71 mIoU超越此前最佳模型31%。 Conclusion: 基础模型生成的高质量伪标签结合时序信息,能有效提升自监督占据预测性能,降低训练复杂度,且具备良好可迁移性,为未来方法提供了一种高效、实用的新范式。 Abstract: Self-supervised models have recently achieved notable advancements, particularly in the domain of semantic occupancy prediction. These models utilize sophisticated loss computation strategies to compensate for the absence of ground-truth labels. For instance, techniques such as novel view synthesis, cross-view rendering, and depth estimation have been explored to address the issue of semantic and depth ambiguity. However, such techniques typically incur high computational costs and memory usage during the training stage, especially in the case of novel view synthesis. To mitigate these issues, we propose 3D pseudo-ground-truth labels generated by the foundation models Grounded-SAM and Metric3Dv2, and harness temporal information for label densification. Our 3D pseudo-labels can be easily integrated into existing models, which yields substantial performance improvements, with mIoU increasing by 45\%, from 9.73 to 14.09, when implemented into the OccNeRF model. This stands in contrast to earlier advancements in the field, which are often not readily transferable to other architectures. Additionally, we propose a streamlined model, EasyOcc, achieving 13.86 mIoU. This model conducts learning solely from our labels, avoiding complex rendering strategies mentioned previously. Furthermore, our method enables models to attain state-of-the-art performance when evaluated on the full scene without applying the camera mask, with EasyOcc achieving 7.71 mIoU, outperforming the previous best model by 31\%. These findings highlight the critical importance of foundation models, temporal context, and the choice of loss computation space in self-supervised learning for comprehensive scene understanding.[173] Predicting Penalty Kick Direction Using Multi-Modal Deep Learning with Pose-Guided Attention
Pasindu Ranasinghe,Pamudu Ranasinghe
Main category: cs.CV
TL;DR: 提出一种实时多模态深度学习框架,结合CNN和LSTM注意力机制,利用RGB图像和姿态关键点预测点球方向,在自建的755个点球数据集上达到89%准确率,优于单模态基线,具有22毫秒推理速度,适用于门将训练与比赛分析。
Details
Motivation: 点球决策常影响比赛胜负,但守门员需在极短时间内根据踢球者的生物力学线索预判射门方向,现有方法难以兼顾精度与实时性。 Method: 采用双分支架构:MobileNetV2-based CNN提取RGB帧空间特征,LSTM配合注意力机制处理2D姿态关键点,姿态信息引导视觉关注任务相关区域,并通过基于距离的阈值方法在触球前分割序列以保证输入一致性。 Result: 在包含755个真实比赛点球事件的自建数据集上,模型在保留测试集上达到89%的准确率,比纯视觉或纯姿态基线高14-22%,推理时间为22毫秒。 Conclusion: 该轻量、可解释的多模态模型能高效准确预测点球方向,具备实际应用价值,可用于守门员训练、战术分析和实时赛事分析。 Abstract: Penalty kicks often decide championships, yet goalkeepers must anticipate the kicker's intent from subtle biomechanical cues within a very short time window. This study introduces a real-time, multi-modal deep learning framework to predict the direction of a penalty kick (left, middle, or right) before ball contact. The model uses a dual-branch architecture: a MobileNetV2-based CNN extracts spatial features from RGB frames, while 2D keypoints are processed by an LSTM network with attention mechanisms. Pose-derived keypoints further guide visual focus toward task-relevant regions. A distance-based thresholding method segments input sequences immediately before ball contact, ensuring consistent input across diverse footage. A custom dataset of 755 penalty kick events was created from real match videos, with frame-level annotations for object detection, shooter keypoints, and final ball placement. The model achieved 89% accuracy on a held-out test set, outperforming visual-only and pose-only baselines by 14-22%. With an inference time of 22 milliseconds, the lightweight and interpretable design makes it suitable for goalkeeper training, tactical analysis, and real-time game analytics.[174] Text-to-Scene with Large Reasoning Models
Frédéric Berdoz,Luca A. Lanzendörfer,Nick Tuninga,Roger Wattenhofer
Main category: cs.CV
TL;DR: Reason-3D 是一种基于大推理模型(LRM)的文本到3D场景生成方法,通过结合物体检索、布局约束和碰撞感知的空间推理,显著提升了复杂指令下的场景生成质量与遵循度。
Details
Motivation: 现有文本到3D场景的方法在处理复杂几何结构和对象变换时存在困难,且对复杂指令的遵循能力较弱,因此需要一种更具推理能力的框架来提升生成效果。 Method: Reason-3D 利用大推理模型进行语义理解,通过包含物理、功能和上下文属性的描述进行物体检索,并基于显式与隐式布局约束放置物体,最后通过碰撞感知的空间推理优化物体位置。 Result: 在从简单到复杂的室内配置指令上,Reason-3D 在人类评分的视觉保真度、约束遵循度和资产检索质量方面均显著优于先前方法。 Conclusion: Reason-3D 展示了大推理模型在3D场景生成中的强大空间推理能力,推动了基于LRM的物体检索与布局研究,并已开源代码以促进后续发展。 Abstract: Prompt-driven scene synthesis allows users to generate complete 3D environments from textual descriptions. Current text-to-scene methods often struggle with complex geometries and object transformations, and tend to show weak adherence to complex instructions. We address these limitations by introducing Reason-3D, a text-to-scene model powered by large reasoning models (LRMs). Reason-3D integrates object retrieval using captions covering physical, functional, and contextual attributes. Reason-3D then places the selected objects based on implicit and explicit layout constraints, and refines their positions with collision-aware spatial reasoning. Evaluated on instructions ranging from simple to complex indoor configurations, Reason-3D significantly outperforms previous methods in human-rated visual fidelity, adherence to constraints, and asset retrieval quality. Beyond its contribution to the field of text-to-scene generation, our work showcases the advanced spatial reasoning abilities of modern LRMs. Additionally, we release the codebase to further the research in object retrieval and placement with LRMs.[175] EVODiff: Entropy-aware Variance Optimized Diffusion Inference
Shigui Li,Wei Chen,Delu Zeng
Main category: cs.CV
TL;DR: 本文提出了一种基于信息论视角的扩散模型推理方法EVODiff,通过优化条件熵来减少去噪过程中的不确定性,显著优于现有的梯度求解器。
Details
Motivation: 扩散模型在图像生成上表现出色,但存在推理速度慢和训练-推理不一致的问题。现有加速方法缺乏信息传输效率的理论基础。 Method: 引入信息论视角,分析扩散模型推理过程,提出数据预测参数化优于噪声预测,并通过优化条件方差来最小化转移和重构误差,提出EVODiff方法。 Result: 实验表明EVODiff在CIFAR-10上比DPM-Solver++减少45.5%重构误差(FID从5.10提升至2.78,10次函数调用),在ImageNet-256上节省25%函数调用次数(从20降至15),并在文本到图像生成中减少伪影。 Conclusion: EVODiff通过熵感知的方差优化,系统性降低去噪过程中的不确定性,显著提升扩散模型的生成效率与质量。 Abstract: Diffusion models (DMs) excel in image generation, but suffer from slow inference and the training-inference discrepancies. Although gradient-based solvers like DPM-Solver accelerate the denoising inference, they lack theoretical foundations in information transmission efficiency. In this work, we introduce an information-theoretic perspective on the inference processes of DMs, revealing that successful denoising fundamentally reduces conditional entropy in reverse transitions. This principle leads to our key insights into the inference processes: (1) data prediction parameterization outperforms its noise counterpart, and (2) optimizing conditional variance offers a reference-free way to minimize both transition and reconstruction errors. Based on these insights, we propose an entropy-aware variance optimized method for the generative process of DMs, called EVODiff, which systematically reduces uncertainty by optimizing conditional entropy during denoising. Extensive experiments on DMs validate our insights and demonstrate that our method significantly and consistently outperforms state-of-the-art (SOTA) gradient-based solvers. For example, compared to the DPM-Solver++, EVODiff reduces the reconstruction error by up to 45.5\% (FID improves from 5.10 to 2.78) at 10 function evaluations (NFE) on CIFAR-10, cuts the NFE cost by 25\% (from 20 to 15 NFE) for high-quality samples on ImageNet-256, and improves text-to-image generation while reducing artifacts. Code is available at https://github.com/ShiguiLi/EVODiff.[176] EchoGen: Generating Visual Echoes in Any Scene via Feed-Forward Subject-Driven Auto-Regressive Model
Ruixiao Dong,Zhendong Wang,Keli Liu,Li Li,Ying Chen,Kai Li,Daowen Li,Houqiang Li
Main category: cs.CV
TL;DR: EchoGen是一种基于视觉自回归(VAR)模型的新型主体驱动生成框架,通过双路径注入策略分离主体的高层语义身份和低层细节,实现高效、高质量的图像生成。
Details
Motivation: 现有主体驱动生成方法在效率与生成质量之间存在权衡:微调方法计算昂贵且缺乏零样本能力,而基于扩散模型的前馈方法推理速度慢。因此,需要一种兼具快速推理和高生成质量的新方法。 Method: 提出EchoGen框架,采用双路径注入策略:通过语义编码器提取主体的抽象语义身份,并通过解耦的交叉注意力机制引导整体构图;同时使用内容编码器捕捉精细视觉细节,并通过多模态注意力机制融合以保持纹理和结构保真度。该方法基于VAR模型构建,实现前馈式快速生成。 Result: 实验表明,EchoGen在主体保真度和图像质量上可媲美最先进的扩散模型方法,同时显著降低采样延迟,具备更快的推理速度。 Conclusion: EchoGen是首个基于VAR模型的前馈式主体驱动生成框架,有效平衡了生成质量与效率,为主动生成任务提供了一种高速、高保真的新范式。 Abstract: Subject-driven generation is a critical task in creative AI; yet current state-of-the-art methods present a stark trade-off. They either rely on computationally expensive, per-subject fine-tuning, sacrificing efficiency and zero-shot capability, or employ feed-forward architectures built on diffusion models, which are inherently plagued by slow inference speeds. Visual Auto-Regressive (VAR) models are renowned for their rapid sampling speeds and strong generative quality, making them an ideal yet underexplored foundation for resolving this tension. To bridge this gap, we introduce EchoGen, a pioneering framework that empowers VAR models with subject-driven generation capabilities. The core design of EchoGen is an effective dual-path injection strategy that disentangles a subject's high-level semantic identity from its low-level fine-grained details, enabling enhanced controllability and fidelity. We employ a semantic encoder to extract the subject's abstract identity, which is injected through decoupled cross-attention to guide the overall composition. Concurrently, a content encoder captures intricate visual details, which are integrated via a multi-modal attention mechanism to ensure high-fidelity texture and structural preservation. To the best of our knowledge, EchoGen is the first feed-forward subject-driven framework built upon VAR models. Both quantitative and qualitative results substantiate our design, demonstrating that EchoGen achieves subject fidelity and image quality comparable to state-of-the-art diffusion-based methods with significantly lower sampling latency. Code and models will be released soon.[177] EntroPE: Entropy-Guided Dynamic Patch Encoder for Time Series Forecasting
Sachith Abeywickrama,Emadeldeen Eldele,Min Wu,Xiaoli Li,Chau Yuen
Main category: cs.CV
TL;DR: 提出EntroPE,一种基于熵引导的动态分块编码框架,通过条件熵动态检测时间序列中的转换点并划分分块,有效保持时序结构并提升预测性能。
Details
Motivation: 现有基于分块的时间序列建模方法采用固定长度和任意起始位置的分块策略,破坏了时间连续性,削弱了短期依赖和表征学习能力。 Method: 设计Entropy-based Dynamic Patcher(EDP)利用条件熵定位自然时间变化点以动态划分分块;结合Adaptive Patch Encoder(APE)通过池化和交叉注意力捕捉块内依赖,并生成固定大小的潜在表示,最后由全局Transformer建模块间动态。 Result: 在多个长期预测基准上实验表明,EntroPE在预测精度和计算效率方面均优于现有方法。 Conclusion: 熵引导的动态分块是一种有前景的时间序列建模新范式,能够在保留分块优势的同时增强对时序结构的建模能力。 Abstract: Transformer-based models have significantly advanced time series forecasting, with patch-based input strategies offering efficiency and improved long-horizon modeling. Yet, existing approaches rely on temporally-agnostic patch construction, where arbitrary starting positions and fixed lengths fracture temporal coherence by splitting natural transitions across boundaries. This naive segmentation often disrupts short-term dependencies and weakens representation learning. In response, we propose EntroPE (Entropy-Guided Dynamic Patch Encoder), a novel, temporally informed framework that dynamically detects transition points via conditional entropy and dynamically places patch boundaries. This preserves temporal structure while retaining the computational benefits of patching. EntroPE consists of two key modules, namely an Entropy-based Dynamic Patcher (EDP) that applies information-theoretic criteria to locate natural temporal shifts and determine patch boundaries, and an Adaptive Patch Encoder (APE) that employs pooling and cross-attention to capture intra-patch dependencies and produce fixed-size latent representations. These embeddings are then processed by a global transformer to model inter-patch dynamics. Experiments across long-term forecasting benchmarks demonstrate that EntroPE improves both accuracy and efficiency, establishing entropy-guided dynamic patching as a promising new paradigm for time series modeling. Code is available at: https://github.com/Sachithx/EntroPE.[178] Towards Continual Expansion of Data Coverage: Automatic Text-guided Edge-case Synthesis
Kyeongryeol Go
Main category: cs.CV
TL;DR: 提出了一种基于大语言模型和文本到图像模型的自动化文本引导边缘案例合成管道,用于提升深度神经网络的鲁棒性。
Details
Motivation: 数据集偏差限制了深度神经网络性能,手动筛选边缘案例费时费力,亟需自动化方法来高效生成具有挑战性的训练样本。 Method: 利用通过偏好学习微调的大语言模型重写图像描述,生成多样化文本提示,指导文本到图像模型合成困难的视觉场景,用于数据增强。 Result: 在FishEye8K目标检测基准上验证,该方法在鲁棒性方面优于简单增强和人工设计提示。 Conclusion: 建立了一个可扩展的自动化数据合成框架,将数据整理从人工转向自动化定向生成,为构建更可靠、持续改进的AI系统提供了新方向。 Abstract: The performance of deep neural networks is strongly influenced by the quality of their training data. However, mitigating dataset bias by manually curating challenging edge cases remains a major bottleneck. To address this, we propose an automated pipeline for text-guided edge-case synthesis. Our approach employs a Large Language Model, fine-tuned via preference learning, to rephrase image captions into diverse textual prompts that steer a Text-to-Image model toward generating difficult visual scenarios. Evaluated on the FishEye8K object detection benchmark, our method achieves superior robustness, surpassing both naive augmentation and manually engineered prompts. This work establishes a scalable framework that shifts data curation from manual effort to automated, targeted synthesis, offering a promising direction for developing more reliable and continuously improving AI systems. Code is available at https://github.com/gokyeongryeol/ATES.[179] Human-MME: A Holistic Evaluation Benchmark for Human-Centric Multimodal Large Language Models
Yuansen Liu,Haiming Tang,Jinlong Peng,Jiangning Zhang,Xiaozhong Ji,Qingdong He,Donghao Luo,Zhenye Gan,Junwei Zhu,Yunhang Shen,Chaoyou Fu,Chengjie Wang,Xiaobin Hu,Shuicheng Yan
Main category: cs.CV
TL;DR: 本文提出了Human-MME,一个用于评估多模态大语言模型在人类中心场景理解能力的综合性基准,涵盖多样化的场景、渐进式的评估维度和高质量的标注数据。
Details
Motivation: 现有基准缺乏对人类中心场景细粒度感知和高维因果推理能力的全面评估,且高质量标注面临人体复杂性和标注困难的挑战。 Method: 构建了一个包含43个子领域的多样化人类场景数据集,设计了八个评估维度和19,945个真实图像问答对,并开发了自动化标注流程与人工标注平台。 Result: 在17个最先进的多模态大语言模型上进行了广泛实验,揭示了当前模型在人类中心理解任务中的局限性。 Conclusion: Human-MME为多模态大语言模型在人类中心场景的理解提供了更全面的评估方案,推动未来研究向更精细和可靠的视觉理解发展。 Abstract: Multimodal Large Language Models (MLLMs) have demonstrated significant advances in visual understanding tasks. However, their capacity to comprehend human-centric scenes has rarely been explored, primarily due to the absence of comprehensive evaluation benchmarks that take into account both the human-oriented granular level and higher-dimensional causal reasoning ability. Such high-quality evaluation benchmarks face tough obstacles, given the physical complexity of the human body and the difficulty of annotating granular structures. In this paper, we propose Human-MME, a curated benchmark designed to provide a more holistic evaluation of MLLMs in human-centric scene understanding. Compared with other existing benchmarks, our work provides three key features: 1. Diversity in human scene, spanning 4 primary visual domains with 15 secondary domains and 43 sub-fields to ensure broad scenario coverage. 2. Progressive and diverse evaluation dimensions, evaluating the human-based activities progressively from the human-oriented granular perception to the higher-dimensional reasoning, consisting of eight dimensions with 19,945 real-world image question pairs and an evaluation suite. 3. High-quality annotations with rich data paradigms, constructing the automated annotation pipeline and human-annotation platform, supporting rigorous manual labeling to facilitate precise and reliable model assessment. Our benchmark extends the single-target understanding to the multi-person and multi-image mutual understanding by constructing the choice, short-answer, grounding, ranking and judgment question components, and complex questions of their combination. The extensive experiments on 17 state-of-the-art MLLMs effectively expose the limitations and guide future MLLMs research toward better human-centric image understanding. All data and code are available at https://github.com/Yuan-Hou/Human-MME.[180] Beyond Overall Accuracy: Pose- and Occlusion-driven Fairness Analysis in Pedestrian Detection for Autonomous Driving
Mohammad Khoshkdahan,Arman Akbari,Arash Akbari,Xuan Zhang
Main category: cs.CV
TL;DR: 本文首次系统地评估了行人姿态和关节遮挡对自动驾驶中行人检测模型公平性的影响,发现特定姿态和下身关节遮挡会引发显著偏差,并指出Cascade R-CNN在性能和公平性方面表现最佳。
Details
Motivation: 在自动驾驶中,行人检测的安全性和可靠性至关重要,但现有模型在公平性方面的研究不足,尤其是姿态和遮挡因素对不同人群检测效果的影响未被充分探讨。 Method: 基于ECP-DP数据集,评估五种专用行人检测器和三种通用模型(YOLOv12变体),使用Equal Opportunity Difference(EOD)度量公平性,并结合Z检验分析统计显著性。 Result: 检测模型对双腿平行、肘部伸直和侧身朝向的行人存在偏见;下身关节遮挡比上身或头部遮挡对检测率影响更大;Cascade R-CNN整体漏检率最低且各属性间偏差最小。 Conclusion: 应重视行人姿态与局部遮挡带来的公平性问题,未来行人检测模型需针对性优化以提升对多样姿态行人的识别能力与系统公平性。 Abstract: Pedestrian detection plays a critical role in autonomous driving (AD), where ensuring safety and reliability is important. While many detection models aim to reduce miss-rates and handle challenges such as occlusion and long-range recognition, fairness remains an underexplored yet equally important concern. In this work, we systematically investigate how variations in the pedestrian pose -- including leg status, elbow status, and body orientation -- as well as individual joint occlusions, affect detection performance. We evaluate five pedestrian-specific detectors (F2DNet, MGAN, ALFNet, CSP, and Cascade R-CNN) alongside three general-purpose models (YOLOv12 variants) on the EuroCity Persons Dense Pose (ECP-DP) dataset. Fairness is quantified using the Equal Opportunity Difference (EOD) metric across various confidence thresholds. To assess statistical significance and robustness, we apply the Z-test. Our findings highlight biases against pedestrians with parallel legs, straight elbows, and lateral views. Occlusion of lower body joints has a more negative impact on the detection rate compared to the upper body and head. Cascade R-CNN achieves the lowest overall miss-rate and exhibits the smallest bias across all attributes. To the best of our knowledge, this is the first comprehensive pose- and occlusion-aware fairness evaluation in pedestrian detection for AD.[181] AttriGen: Automated Multi-Attribute Annotation for Blood Cell Datasets
Walid Houmaidi,Youssef Sabiri,Fatima Zahra Iguenfer,Amine Abouaomar
Main category: cs.CV
TL;DR: 提出AttriGen框架,用于计算机视觉中细粒度多属性标注,尤其关注细胞显微镜图像,在PBC和WBCAtt数据集上结合CNN和ViT实现94.62%准确率,提升可解释性并降低标注成本。
Details
Motivation: 细胞显微镜中的多属性分类研究较少,传统方法依赖人工标注,耗时且昂贵,需自动化解决方案。 Method: 采用双模型架构,CNN用于细胞类型分类,Vision Transformer用于多属性分类,结合两个互补数据集进行训练与评估。 Result: 在多属性分类任务中达到94.62%的准确率,显著优于现有方法,同时提升模型可解释性,大幅减少人工标注时间与成本。 Conclusion: AttriGen为多属性标注提供了高效、可扩展的自动化框架,可推广至其他计算机视觉任务。 Abstract: We introduce AttriGen, a novel framework for automated, fine-grained multi-attribute annotation in computer vision, with a particular focus on cell microscopy where multi-attribute classification remains underrepresented compared to traditional cell type categorization. Using two complementary datasets: the Peripheral Blood Cell (PBC) dataset containing eight distinct cell types and the WBC Attribute Dataset (WBCAtt) that contains their corresponding 11 morphological attributes, we propose a dual-model architecture that combines a CNN for cell type classification, as well as a Vision Transformer (ViT) for multi-attribute classification achieving a new benchmark of 94.62\% accuracy. Our experiments demonstrate that AttriGen significantly enhances model interpretability and offers substantial time and cost efficiency relative to conventional full-scale human annotation. Thus, our framework establishes a new paradigm that can be extended to other computer vision classification tasks by effectively automating the expansion of multi-attribute labels.[182] TSalV360: A Method and Dataset for Text-driven Saliency Detection in 360-Degrees Videos
Ioannis Kontostathis,Evlampios Apostolidis,Vasileios Mezaris
Main category: cs.CV
TL;DR: 本文提出了TSV360数据集和TSalV360方法,用于文本驱动的360度视频显著性检测,结合视觉-语言模型与跨模态注意力机制,在定制化显著性检测中表现出竞争力。
Details
Motivation: 为了实现根据文本描述定制化地检测360度视频中的显著对象或事件,弥补现有视觉驱动方法在语义引导上的不足。 Method: 基于SOTA视觉显著性检测方法,引入视觉-语言预训练模型,设计相似性估计模块和视口时空交叉注意力机制,融合文本与全景视频多模态信息。 Result: 在自建TSV360数据集上实验表明,TSalV360优于SOTA视觉方法,能有效实现文本引导的显著性检测,定性和定量结果均验证了其性能。 Conclusion: TSalV360成功实现了文本驱动的360度视频显著性检测,展示了多模态融合在复杂场景下的潜力,为个性化视觉关注预测提供了新思路。 Abstract: In this paper, we deal with the task of text-driven saliency detection in 360-degrees videos. For this, we introduce the TSV360 dataset which includes 16,000 triplets of ERP frames, textual descriptions of salient objects/events in these frames, and the associated ground-truth saliency maps. Following, we extend and adapt a SOTA visual-based approach for 360-degrees video saliency detection, and develop the TSalV360 method that takes into account a user-provided text description of the desired objects and/or events. This method leverages a SOTA vision-language model for data representation and integrates a similarity estimation module and a viewport spatio-temporal cross-attention mechanism, to discover dependencies between the different data modalities. Quantitative and qualitative evaluations using the TSV360 dataset, showed the competitiveness of TSalV360 compared to a SOTA visual-based approach and documented its competency to perform customized text-driven saliency detection in 360-degrees videos.[183] Beyond Pixels: Efficient Dataset Distillation via Sparse Gaussian Representation
Chenyang Jiang,Zhengcen Li,Hang Zhao,Qiben Shan,Shaocong Wu,Jingyong Su
Main category: cs.CV
TL;DR: 本文提出了一种基于2D高斯稀疏表示的高效数据集蒸馏方法GSDD,通过少量高斯基元编码关键信息,提升了存储效率和蒸馏性能,在多个基准上实现了最先进的结果。
Details
Motivation: 传统数据集蒸馏方法依赖密集像素表示,存在冗余且难以扩展,因此需要一种更高效、可扩展的稀疏表示方法来降低计算与存储开销。 Method: 提出GSDD,利用2D高斯基元构建稀疏表示,仅保留图像中的关键判别信息,并采用CUDA加速的splatting算子实现高效并行推理与训练。 Result: 在CIFAR-10、CIFAR-100和ImageNet子集上达到最先进性能,同时显著降低编码解码的计算和内存开销,具备高效率和良好可扩展性。 Conclusion: GSDD提供了一种简单、高效且可扩展的数据集蒸馏新范式,稀疏高斯表示优于传统密集像素方法,具有广泛的应用潜力。 Abstract: Dataset distillation has emerged as a promising paradigm that synthesizes compact, informative datasets capable of retaining the knowledge of large-scale counterparts, thereby addressing the substantial computational and storage burdens of modern model training. Conventional approaches typically rely on dense pixel-level representations, which introduce redundancy and are difficult to scale up. In this work, we propose GSDD, a novel and efficient sparse representation for dataset distillation based on 2D Gaussians. Instead of representing all pixels equally, GSDD encodes critical discriminative information in a distilled image using only a small number of Gaussian primitives. This sparse representation could improve dataset diversity under the same storage budget, enhancing coverage of difficult samples and boosting distillation performance. To ensure both efficiency and scalability, we adapt CUDA-based splatting operators for parallel inference and training, enabling high-quality rendering with minimal computational and memory overhead. Our method is simple yet effective, broadly applicable to different distillation pipelines, and highly scalable. Experiments show that GSDD achieves state-of-the-art performance on CIFAR-10, CIFAR-100, and ImageNet subsets, while remaining highly efficient encoding and decoding cost. Our code is available at https://github.com/j-cyoung/GSDatasetDistillation.[184] An Experimental Study on Generating Plausible Textual Explanations for Video Summarization
Thomas Eleftheriadis,Evlampios Apostolidis,Vasileios Mezaris
Main category: cs.CV
TL;DR: 本文研究了为视频摘要结果生成合理文本解释的方法,通过集成大型多模态模型(LLaVA-OneVision)和句子嵌入技术评估解释的可信度。
Details
Motivation: 为了提升可解释AI中视觉解释的可信度,使其更符合人类的推理和期望。 Method: 扩展了现有的多粒度解释框架,结合LLaVA-OneVision生成文本描述,并使用SBERT和SimCSE量化语义重叠以评估可信度。 Result: 在CA-SUM方法及SumMe、TVSum数据集上的实验表明,更忠实的解释不一定更可信,同时识别出生成合理文本解释的最佳方法。 Conclusion: 提出的可信度评估方法有助于筛选更符合人类认知的视频摘要解释,提升了可解释AI的实用性。 Abstract: In this paper, we present our experimental study on generating plausible textual explanations for the outcomes of video summarization. For the needs of this study, we extend an existing framework for multigranular explanation of video summarization by integrating a SOTA Large Multimodal Model (LLaVA-OneVision) and prompting it to produce natural language descriptions of the obtained visual explanations. Following, we focus on one of the most desired characteristics for explainable AI, the plausibility of the obtained explanations that relates with their alignment with the humans' reasoning and expectations. Using the extended framework, we propose an approach for evaluating the plausibility of visual explanations by quantifying the semantic overlap between their textual descriptions and the textual descriptions of the corresponding video summaries, with the help of two methods for creating sentence embeddings (SBERT, SimCSE). Based on the extended framework and the proposed plausibility evaluation approach, we conduct an experimental study using a SOTA method (CA-SUM) and two datasets (SumMe, TVSum) for video summarization, to examine whether the more faithful explanations are also the more plausible ones, and identify the most appropriate approach for generating plausible textual explanations for video summarization.[185] Generalized Fine-Grained Category Discovery with Multi-Granularity Conceptual Experts
Haiyang Zheng,Nan Pu,Wenjing Li,Nicu Sebe,Zhun Zhong
Main category: cs.CV
TL;DR: 提出多粒度概念专家框架(MGCE),通过动态概念对比学习和多粒度专家协同学习,实现无需预知类别数的广义类别发现,在九个细粒度视觉基准上达到最优性能。
Details
Motivation: 现有广义类别发现方法未能充分利用视觉数据中的多粒度概念信息,且大多需预先知道未标记类别的数量,限制了其在真实开放世界场景中的应用。 Method: 提出MGCE框架,包含两个模块:动态概念对比学习(DCCL)用于联合优化特征学习与类别发现;多粒度专家协同学习(MECL)通过引入不同粒度的专家和概念对齐矩阵实现跨专家协作,并能自动估计未标记数据中的类别数量。 Result: 在九个细粒度视觉识别基准上实验表明,MGCE在新类准确率上显著优于现有方法,平均提升3.6%,且无需已知类别数。 Conclusion: MGCE有效挖掘多粒度视觉概念并实现跨粒度知识融合,能够在未知类别数量的开放世界设定下实现更优的类别发现性能。 Abstract: Generalized Category Discovery (GCD) is an open-world problem that clusters unlabeled data by leveraging knowledge from partially labeled categories. A key challenge is that unlabeled data may contain both known and novel categories. Existing approaches suffer from two main limitations. First, they fail to exploit multi-granularity conceptual information in visual data, which limits representation quality. Second, most assume that the number of unlabeled categories is known during training, which is impractical in real-world scenarios. To address these issues, we propose a Multi-Granularity Conceptual Experts (MGCE) framework that adaptively mines visual concepts and integrates multi-granularity knowledge for accurate category discovery. MGCE consists of two modules: (1) Dynamic Conceptual Contrastive Learning (DCCL), which alternates between concept mining and dual-level representation learning to jointly optimize feature learning and category discovery; and (2) Multi-Granularity Experts Collaborative Learning (MECL), which extends the single-expert paradigm by introducing additional experts at different granularities and by employing a concept alignment matrix for effective cross-expert collaboration. Importantly, MGCE can automatically estimate the number of categories in unlabeled data, making it suitable for practical open-world settings. Extensive experiments on nine fine-grained visual recognition benchmarks demonstrate that MGCE achieves state-of-the-art results, particularly in novel-class accuracy. Notably, even without prior knowledge of category numbers, MGCE outperforms parametric approaches that require knowing the exact number of categories, with an average improvement of 3.6\%. Code is available at https://github.com/HaiyangZheng/MGCE.[186] IMG: Calibrating Diffusion Models via Implicit Multimodal Guidance
Jiayi Guo,Chuanhao Yan,Xingqian Xu,Yulin Wang,Kai Wang,Gao Huang,Humphrey Shi
Main category: cs.CV
TL;DR: 提出了一种名为隐式多模态引导(IMG)的新型重生成框架,无需额外数据或编辑操作,通过多模态大语言模型检测错配,并利用隐式对齐器调整扩散条件特征以实现重新生成,显著提升文本到图像生成中的多模态对齐效果。
Details
Motivation: 现有方法依赖高质量偏好数据进行微调,难以扩展,或通过编辑局部区域影响整体图像质量,因此需要一种无需额外数据和编辑操作的高效多模态对齐方法。 Method: 利用多模态大语言模型(MLLM)识别生成图像与提示之间的错配,设计隐式对齐器来调整扩散模型的条件特征以减少错配,并提出可训练的迭代更新偏好目标用于优化对齐过程。 Result: 在SDXL、SDXL-DPO和FLUX上的大量实验表明,IMG在定性和定量评估中均优于现有的对齐方法,并可作为插件灵活增强基于微调的方法。 Conclusion: IMG是一种无需额外数据和编辑操作的高效多模态对齐框架,能够有效提升扩散模型生成图像与输入提示之间的对齐精度,具有良好的通用性和集成性。 Abstract: Ensuring precise multimodal alignment between diffusion-generated images and input prompts has been a long-standing challenge. Earlier works finetune diffusion weight using high-quality preference data, which tends to be limited and difficult to scale up. Recent editing-based methods further refine local regions of generated images but may compromise overall image quality. In this work, we propose Implicit Multimodal Guidance (IMG), a novel re-generation-based multimodal alignment framework that requires no extra data or editing operations. Specifically, given a generated image and its prompt, IMG a) utilizes a multimodal large language model (MLLM) to identify misalignments; b) introduces an Implicit Aligner that manipulates diffusion conditioning features to reduce misalignments and enable re-generation; and c) formulates the re-alignment goal into a trainable objective, namely Iteratively Updated Preference Objective. Extensive qualitative and quantitative evaluations on SDXL, SDXL-DPO, and FLUX show that IMG outperforms existing alignment methods. Furthermore, IMG acts as a flexible plug-and-play adapter, seamlessly enhancing prior finetuning-based alignment methods. Our code will be available at https://github.com/SHI-Labs/IMG-Multimodal-Diffusion-Alignment.[187] Interpret, prune and distill Donut : towards lightweight VLMs for VQA on document
Adnan Ben Mansour,Ayoub Karine,David Naccache
Main category: cs.CV
TL;DR: 提出Donut-MINT,通过基于机械可解释性的知识蒸馏压缩视觉语言模型,在保持性能的同时显著降低推理时间和内存占用。
Details
Motivation: 大型视觉语言模型(如Donut)在文档级视觉问答中表现优异,但计算成本高,难以部署于资源受限场景,因此需要高效压缩方法。 Method: 利用机械可解释性分析模型内部计算结构,识别关键子组件,指导学生模型架构设计,通过知识蒸馏训练紧凑模型,并对非关键组件进行近似、跳过或重参数化。 Result: Donut-MINT在DocVQA基准上保持了与原始Donut相当的性能,同时显著减少了推理时间和内存使用。 Conclusion: 该方法将模型压缩重构为电路发现任务,连接了可解释性研究与实用视觉语言模型部署,为高效模型设计提供了新思路。 Abstract: Recent advances in Visually-rich Document Understanding rely on large Vision-Language Models like Donut, which perform document-level Visual Question Answering without Optical Character Recognition. Despite their effectiveness, these models are too costly for real-time or resource-constrained applications. We investigate model compression through knowledge distillation, training compact student models from a larger teacher. We leverage mechanistic interpretability to drive student architecture design within this framework. By analyzing internal computations, we identify essential subcomponents to retain, while having a clear view of which subcomponents should be approximated, skipped, or reparametrized based on their function. This approach yields Donut-MINT (Mechanistic Interpretability-based Network Trimming), a pruned Donut variant that reduces inference time and memory usage while maintaining strong performance on DocVQA, a standard benchmark for document Visual Question Answering. Our method reframes compression as circuit discovery, bridging interpretability research and practical Vision-Language Model deployment.[188] Seeing Space and Motion: Enhancing Latent Actions with Spatial and Dynamic Awareness for VLA
Zhejia Cai,Yandan Yang,Xinyuan Chang,Shiyi Liang,Ronghan Chen,Feng Xiong,Mu Xu,Ruqi Huang
Main category: cs.CV
TL;DR: 提出Farsighted-LAM和SSM-VLA框架,通过几何感知的空间编码和多尺度时间建模提升视觉-语言-动作系统的鲁棒性和泛化能力。
Details
Motivation: 解决现有Latent Action Models在空间理解和时间感知上的瓶颈,以实现更稳定和清晰的动作建模。 Method: 设计Farsighted-LAM框架,引入几何感知的空间编码和多尺度时间建模;构建基于该框架的端到端SSM-VLA系统,结合结构化感知与视觉Chain-of-Thought模块进行显式推理。 Result: 在多种模拟和真实世界VLA任务中实现了最先进的性能,验证了所提方法在增强具身智能鲁棒性和泛化性方面的有效性。 Conclusion: 结合几何感知建模、时间连贯性和显式推理的策略能有效提升视觉-语言-动作系统的性能。 Abstract: Latent Action Models (LAMs) enable Vision-Language-Action (VLA) systems to learn semantic action representations from large-scale unannotated data. Yet, we identify two bottlenecks of LAMs: 1) the commonly adopted end-to-end trained image encoder suffers from poor spatial understanding; 2) LAMs can be fragile when input frames are distant, leading to limited temporal perception. Such factors inevitably hinder stable and clear action modeling. To this end, we propose Farsighted-LAM, a latent action framework with geometry-aware spatial encoding and multi-scale temporal modeling, capturing structural priors and dynamic motion patterns from consecutive frames. We further propose SSM-VLA, an end-to-end VLA framework built upon Farsighted-LAM, which integrates structured perception with a visual Chain-of-Thought module to explicitly reason about environmental dynamics, enhancing decision consistency and interpretability. We validate SSM-VLA on multiple VLA tasks in both simulation and real-world settings, and achieve state-of-the-art performance. Our results demonstrate that our strategy of combining geometry-aware modeling, temporal coherence, and explicit reasoning is effective in enhancing the robustness and generalizability of embodied intelligence.[189] PRPO: Paragraph-level Policy Optimization for Vision-Language Deepfake Detection
Tuan Nguyen,Naseem Khan,Khang Tran,NhatHai Phan,Issa Khalil
Main category: cs.CV
TL;DR: 提出了一种段落级相对策略优化(PRPO)方法,通过强化学习对齐多模态大模型的推理与视觉证据,显著提升深度伪造检测的准确性和可解释性。
Details
Motivation: 现有大语言模型在深度伪造检测中推理能力不足,常产生与视觉证据不符或虚构的解释,且缺乏高质量的标注数据集。 Method: 构建了一个带有推理标注的深度伪造检测数据集,并提出段落级相对策略优化(PRPO)算法,在段落层面通过强化学习对齐语言模型的推理过程与图像内容。 Result: PRPO显著提升了检测准确率,推理评分为4.55/5.0,且在测试时表现优于GRPO。 Conclusion: 将多模态推理与视觉证据对齐对于实现可靠、可解释的深度伪造检测至关重要。 Abstract: The rapid rise of synthetic media has made deepfake detection a critical challenge for online safety and trust. Progress remains constrained by the scarcity of large, high-quality datasets. Although multimodal large language models (LLMs) exhibit strong reasoning capabilities, their performance on deepfake detection is poor, often producing explanations that are misaligned with visual evidence or hallucinatory. To address this limitation, we introduce a reasoning-annotated dataset for deepfake detection and propose Paragraph-level Relative Policy Optimization (PRPO), a reinforcement learning algorithm that aligns LLM reasoning with image content at the paragraph level. Experiments show that PRPO improves detection accuracy by a wide margin and achieves the highest reasoning score of 4.55/5.0. Ablation studies further demonstrate that PRPO significantly outperforms GRPO under test-time conditions. These results underscore the importance of grounding multimodal reasoning in visual evidence to enable more reliable and interpretable deepfake detection.[190] Cat: Post-training quantization error reduction via cluster-based affine transformation
Ali Zoljodi,Radu Timofte,Masoud Daneshtalab
Main category: cs.CV
TL;DR: 提出了一种基于聚类的仿射变换(CAT)方法,用于减少低比特后训练量化中的精度损失,在ImageNet-1K上显著优于现有方法。
Details
Motivation: 低比特后训练量化容易导致精度显著下降,传统统一仿射变换在该场景下效果不佳。 Method: 设计了聚类特定的仿射参数,对不同输出群组进行精细化校准,并集成到新的PTQ框架中,无需微调即可提升量化模型性能。 Result: 在ImageNet-1K上,W2A2的ResNet-18达到53.18% Top-1准确率,且作为插件可提升现有PTQ方法超过3%。 Conclusion: CAT有效缓解了低比特量化的精度退化问题,具有低开销、即插即用和广泛兼容性的优势。 Abstract: Post-Training Quantization (PTQ) reduces the memory footprint and computational overhead of deep neural networks by converting full-precision (FP) values into quantized and compressed data types. While PTQ is more cost-efficient than Quantization-Aware Training (QAT), it is highly susceptible to accuracy degradation under a low-bit quantization (LQ) regime (e.g., 2-bit). Affine transformation is a classical technique used to reduce the discrepancy between the information processed by a quantized model and that processed by its full-precision counterpart; however, we find that using plain affine transformation, which applies a uniform affine parameter set for all outputs, worsens the results in low-bit PTQ. To address this, we propose Cluster-based Affine Transformation (CAT), an error-reduction framework that employs cluster-specific parameters to align LQ outputs with FP counterparts. CAT refines LQ outputs with only a negligible number of additional parameters, without requiring fine-tuning of the model or quantization parameters. We further introduce a novel PTQ framework integrated with CAT. Experiments on ImageNet-1K show that this framework consistently outperforms prior PTQ methods across diverse architectures and LQ settings, achieving up to 53.18% Top-1 accuracy on W2A2 ResNet-18. Moreover, CAT enhances existing PTQ baselines by more than 3% when used as a plug-in. We plan to release our implementation alongside the publication of this paper.[191] ProfVLM: A Lightweight Video-Language Model for Multi-View Proficiency Estimation
Edoardo Bianchi,Jacopo Staiano,Antonio Liotta
Main category: cs.CV
TL;DR: 提出了一种名为ProfVLM的生成式视觉-语言模型,用于技能熟练度评估,通过多视角视频生成专家级反馈,兼具高精度与可解释性。
Details
Motivation: 现有方法依赖黑箱视频分类器,忽略多视角上下文且缺乏可解释性,难以提供详细的技能评估反馈。 Method: 提出ProfVLM,采用AttentiveGatedProjector动态融合来自冻结TimeSformer主干的多视角特征,并将其投影到用于生成反馈的语言模型中,将技能评估任务重构为生成式推理。 Result: 在EgoExo4D数据集上训练后,ProfVLM超越了现有最先进方法,参数减少最多20倍,训练时间缩短最多60%,并在多种活动中实现了更高准确率,同时输出与表现对齐的自然语言评语。 Conclusion: 生成式视觉-语言建模为技能评估提供了更高效、透明且具解释性的新方向。 Abstract: Existing approaches to skill proficiency estimation often rely on black-box video classifiers, ignoring multi-view context and lacking explainability. We present ProfVLM, a compact vision-language model that reformulates this task as generative reasoning: it jointly predicts skill level and generates expert-like feedback from egocentric and exocentric videos. Central to our method is an AttentiveGatedProjector that dynamically fuses multi-view features, projected from a frozen TimeSformer backbone into a language model tuned for feedback generation. Trained on EgoExo4D with expert commentaries, ProfVLM surpasses state-of-the-art methods while using up to 20x fewer parameters and reducing training time by up to 60%. Our approach not only achieves superior accuracy across diverse activities, but also outputs natural language critiques aligned with performance, offering transparent reasoning. These results highlight generative vision-language modeling as a powerful new direction for skill assessment.[192] EditReward: A Human-Aligned Reward Model for Instruction-Guided Image Editing
Keming Wu,Sicong Jiang,Max Ku,Ping Nie,Minghao Liu,Wenhu Chen
Main category: cs.CV
TL;DR: 本文提出了一种名为\mname的新模型,通过大规模人类偏好数据集训练,显著提升了开源图像编辑模型在自然语言指令下的性能,并展示了其作为奖励模型筛选高质量训练数据的能力。
Details
Motivation: 现有的开源图像编辑模型因缺乏可靠的奖励模型而难以生成高质量的合成训练数据,导致性能落后于闭源模型。 Method: \mname基于一个包含超过20万对由专家严格按照协议标注的人类偏好数据集进行训练,用于评估和优化图像编辑结果与人类偏好的对齐程度。 Result: \mname在GenAI-Bench、AURORA-Bench、ImagenHub和新提出的\benchname等多个基准上实现了最先进的与人类判断的相关性,优于现有的VLM-as-judge模型;使用\mname筛选ShareGPT-4o-Image子集训练的Step1X-Edit模型性能显著提升。 Conclusion: \mname能有效作为奖励模型推动高质量图像编辑数据的构建,在强化学习后训练和测试时扩展等方面具有应用潜力,且模型与数据集将公开以促进社区发展。 Abstract: Recently, we have witnessed great progress in image editing with natural language instructions. Several closed-source models like GPT-Image-1, Seedream, and Google-Nano-Banana have shown highly promising progress. However, the open-source models are still lagging. The main bottleneck is the lack of a reliable reward model to scale up high-quality synthetic training data. To address this critical bottleneck, we built \mname, trained with our new large-scale human preference dataset, meticulously annotated by trained experts following a rigorous protocol containing over 200K preference pairs. \mname demonstrates superior alignment with human preferences in instruction-guided image editing tasks. Experiments show that \mname achieves state-of-the-art human correlation on established benchmarks such as GenAI-Bench, AURORA-Bench, ImagenHub, and our new \benchname, outperforming a wide range of VLM-as-judge models. Furthermore, we use \mname to select a high-quality subset from the existing noisy ShareGPT-4o-Image dataset. We train Step1X-Edit on the selected subset, which shows significant improvement over training on the full set. This demonstrates \mname's ability to serve as a reward model to scale up high-quality training data for image editing. Furthermore, its strong alignment suggests potential for advanced applications like reinforcement learning-based post-training and test-time scaling of image editing models. \mname with its training dataset will be released to help the community build more high-quality image editing training datasets.[193] Point2RBox-v3: Self-Bootstrapping from Point Annotations via Integrated Pseudo-Label Refinement and Utilization
Teng Zhang,Ziqian Fan,Mingxin Liu,Xin Zhang,Xudong Lu,Wentong Li,Yue Zhou,Yi Yu,Xiang Li,Junchi Yan,Xue Yang
Main category: cs.CV
TL;DR: 本文提出了Point2RBox-v3,通过渐进式标签分配(PLA)和先验引导的动态掩码损失(PGDM-Loss),在点监督的弱监督框架下提升有向目标检测性能,尤其在稀疏和多尺度场景中表现优异。
Details
Motivation: 现有基于点标注的弱监督有向目标检测方法存在伪标签利用效率低和质量差的问题,亟需改进标签分配和损失函数设计。 Method: 提出两种核心机制:1)渐进式标签分配(PLA),在训练过程中动态估计实例尺寸并进行标签分配;2)先验引导的动态掩码损失(PGDM-Loss),融合SAM模型与分水岭算法优势,克服各自在稀疏或密集场景中的不足。 Result: 在DOTA-v1.0、DOTA-v1.5、DOTA-v2.0、DIOR、STAR和RSAR数据集上分别取得66.09%、56.86%、41.28%、46.40%、19.60%和45.96%的性能,显著优于先前方法,尤其在对象稀疏或尺寸变化大的场景中。 Conclusion: Point2RBox-v3首次引入动态伪标签用于标签分配,并巧妙结合SAM与分水岭算法的优势,在多种复杂场景下实现了高效准确的有向目标检测。 Abstract: Driven by the growing need for Oriented Object Detection (OOD), learning from point annotations under a weakly-supervised framework has emerged as a promising alternative to costly and laborious manual labeling. In this paper, we discuss two deficiencies in existing point-supervised methods: inefficient utilization and poor quality of pseudo labels. Therefore, we present Point2RBox-v3. At the core are two principles: 1) Progressive Label Assignment (PLA). It dynamically estimates instance sizes in a coarse yet intelligent manner at different stages of the training process, enabling the use of label assignment methods. 2) Prior-Guided Dynamic Mask Loss (PGDM-Loss). It is an enhancement of the Voronoi Watershed Loss from Point2RBox-v2, which overcomes the shortcomings of Watershed in its poor performance in sparse scenes and SAM's poor performance in dense scenes. To our knowledge, Point2RBox-v3 is the first model to employ dynamic pseudo labels for label assignment, and it creatively complements the advantages of SAM model with the watershed algorithm, which achieves excellent performance in both sparse and dense scenes. Our solution gives competitive performance, especially in scenarios with large variations in object size or sparse object occurrences: 66.09%/56.86%/41.28%/46.40%/19.60%/45.96% on DOTA-v1.0/DOTA-v1.5/DOTA-v2.0/DIOR/STAR/RSAR.[194] Ferret-UI Lite: Lessons from Building Small On-Device GUI Agents
Zhen Yang,Zi-Yi Dou,Di Feng,Forrest Huang,Anh Nguyen,Keen You,Omar Attia,Yuhao Yang,Michael Feng,Haotian Zhang,Ram Ramrakhya,Chao Jia,Jeffrey Nichols,Alexander Toshev,Yinfei Yang,Zhe Gan
Main category: cs.CV
TL;DR: 本文提出了Ferret-UI Lite,一个轻量级、端到端的跨平台GUI交互代理模型(3B参数),通过多源数据训练、思维链推理、视觉工具使用和强化学习提升小模型性能,在多个GUI基准测试中表现优异。
Details
Motivation: 开发能在移动设备等资源受限环境下高效运行的小型GUI交互代理仍是一个挑战,现有方法在性能和实用性之间难以平衡。 Method: 构建了一个3B参数的紧凑模型,采用真实与合成数据混合的多样化GUI数据集,结合思维链推理、视觉工具调用以及基于设计奖励的强化学习来增强推理能力。 Result: 在ScreenSpot-V2、ScreenSpot-Pro和OSWorld-G上的GUI定位任务中分别达到91.6%、53.3%和61.2%的得分;在AndroidWorld和OSWorld的GUI导航任务中成功率为28.0%和19.8%。 Conclusion: Ferret-UI Lite展示了小型化、端侧部署的GUI代理的可行性,验证了数据混合、推理策略和强化学习对小模型性能提升的有效性,为未来轻量级交互代理提供了实践参考。 Abstract: Developing autonomous agents that effectively interact with Graphic User Interfaces (GUIs) remains a challenging open problem, especially for small on-device models. In this paper, we present Ferret-UI Lite, a compact, end-to-end GUI agent that operates across diverse platforms, including mobile, web, and desktop. Utilizing techniques optimized for developing small models, we build our 3B Ferret-UI Lite agent through curating a diverse GUI data mixture from real and synthetic sources, strengthening inference-time performance through chain-of-thought reasoning and visual tool-use, and reinforcement learning with designed rewards. Ferret-UI Lite achieves competitive performance with other small-scale GUI agents. In GUI grounding, Ferret-UI Lite attains scores of $91.6\%$, $53.3\%$, and $61.2\%$ on the ScreenSpot-V2, ScreenSpot-Pro, and OSWorld-G benchmarks, respectively. For GUI navigation, Ferret-UI Lite achieves success rates of $28.0\%$ on AndroidWorld and $19.8\%$ on OSWorld. We share our methods and lessons learned from developing compact, on-device GUI agents.[195] FLOWER: A Flow-Matching Solver for Inverse Problems
Mehrsa Pourya,Bassam El Rawas,Michael Unser
Main category: cs.CV
TL;DR: Flower是一种用于解决逆问题的求解器,利用预训练的流模型生成与观测数据一致的重建结果,在多个任务中实现最先进的重建质量。
Details
Motivation: 传统的逆问题求解方法在泛化性和重建质量上存在局限,需要一种能统一即插即用方法和生成式求解器优点的新框架。 Method: Flower通过三步迭代过程:流一致性目标估计、基于前向算子定义的可行集进行精炼,以及沿流轨迹重新投影的时序推进步骤,结合预训练流模型实现重建。 Result: 理论分析表明Flower可近似贝叶斯后验采样,实验显示其在多种逆问题上达到最先进重建质量,且几乎使用相同超参数。 Conclusion: Flower有效融合了生成模型与迭代优化的优势,为逆问题提供了一种通用、稳定且高性能的求解框架。 Abstract: We introduce Flower, a solver for inverse problems. It leverages a pre-trained flow model to produce reconstructions that are consistent with the observed measurements. Flower operates through an iterative procedure over three steps: (i) a flow-consistent destination estimation, where the velocity network predicts a denoised target; (ii) a refinement step that projects the estimated destination onto a feasible set defined by the forward operator; and (iii) a time-progression step that re-projects the refined destination along the flow trajectory. We provide a theoretical analysis that demonstrates how Flower approximates Bayesian posterior sampling, thereby unifying perspectives from plug-and-play methods and generative inverse solvers. On the practical side, Flower achieves state-of-the-art reconstruction quality while using nearly identical hyperparameters across various inverse problems.[196] Continuous Space-Time Video Super-Resolution with 3D Fourier Fields
Alexander Becker,Julius Erbach,Dominik Narnhofer,Konrad Schindler
Main category: cs.CV
TL;DR: 提出了一种基于连续时空视频傅里叶场(VFF)的视频超分辨率新方法,通过联合建模实现更高效、更清晰且时间上更一致的重建。
Details
Motivation: 传统方法将空间和时间表示解耦,并依赖易出错的显式帧扭曲进行运动补偿,限制了性能和鲁棒性。 Method: 将视频编码为连续的、时空一致的3D视频傅里叶场(VFF),使用具有大时空感受野的神经编码器预测其系数,并结合解析的高斯点扩散函数以避免混叠。 Result: 在多个基准上实现了新的最先进性能,支持任意时空位置采样,生成更清晰、时间更连贯的结果,同时计算效率更高。 Conclusion: 所提出的VFF框架在时空超分辨率任务中优于现有方法,兼具高质量重建和高效推理的优势。 Abstract: We introduce a novel formulation for continuous space-time video super-resolution. Instead of decoupling the representation of a video sequence into separate spatial and temporal components and relying on brittle, explicit frame warping for motion compensation, we encode video as a continuous, spatio-temporally coherent 3D Video Fourier Field (VFF). That representation offers three key advantages: (1) it enables cheap, flexible sampling at arbitrary locations in space and time; (2) it is able to simultaneously capture fine spatial detail and smooth temporal dynamics; and (3) it offers the possibility to include an analytical, Gaussian point spread function in the sampling to ensure aliasing-free reconstruction at arbitrary scale. The coefficients of the proposed, Fourier-like sinusoidal basis are predicted with a neural encoder with a large spatio-temporal receptive field, conditioned on the low-resolution input video. Through extensive experiments, we show that our joint modeling substantially improves both spatial and temporal super-resolution and sets a new state of the art for multiple benchmarks: across a wide range of upscaling factors, it delivers sharper and temporally more consistent reconstructions than existing baselines, while being computationally more efficient. Project page: https://v3vsr.github.io.[197] SQUARE: Semantic Query-Augmented Fusion and Efficient Batch Reranking for Training-free Zero-Shot Composed Image Retrieval
Ren-Di Wu,Yu-Yen Lin,Huei-Fang Yang
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的零样本组合图像检索框架SQUARE,利用多模态大语言模型(MLLM)通过语义查询增强融合和高效批量重排序两个阶段提升检索性能。
Details
Motivation: 现有的零样本组合图像检索方法难以准确捕捉用户意图,需要更有效的训练-free方法来提升检索准确性。 Method: SQUARE框架包含两个阶段:第一阶段使用MLLM生成目标图像的描述以增强CLIP等视觉语言模型的查询嵌入;第二阶段将候选图像以带标记的图像网格形式输入MLLM进行联合视觉-语义推理,在单次推理中完成重排序。 Result: SQUARE在四个标准CIR基准上表现出色,即使使用轻量级预训练模型也能保持高性能,验证了其有效性和适用性。 Conclusion: SQUARE通过结合MLLM的语义理解与视觉-语言对齐能力,显著提升了零样本组合图像检索的效果,且具备良好的泛化能力和实用性。 Abstract: Composed Image Retrieval (CIR) aims to retrieve target images that preserve the visual content of a reference image while incorporating user-specified textual modifications. Training-free zero-shot CIR (ZS-CIR) approaches, which require no task-specific training or labeled data, are highly desirable, yet accurately capturing user intent remains challenging. In this paper, we present SQUARE, a novel two-stage training-free framework that leverages Multimodal Large Language Models (MLLMs) to enhance ZS-CIR. In the Semantic Query-Augmented Fusion (SQAF) stage, we enrich the query embedding derived from a vision-language model (VLM) such as CLIP with MLLM-generated captions of the target image. These captions provide high-level semantic guidance, enabling the query to better capture the user's intent and improve global retrieval quality. In the Efficient Batch Reranking (EBR) stage, top-ranked candidates are presented as an image grid with visual marks to the MLLM, which performs joint visual-semantic reasoning across all candidates. Our reranking strategy operates in a single pass and yields more accurate rankings. Experiments show that SQUARE, with its simplicity and effectiveness, delivers strong performance on four standard CIR benchmarks. Notably, it maintains high performance even with lightweight pre-trained, demonstrating its potential applicability.[198] TimeScope: Towards Task-Oriented Temporal Grounding In Long Videos
Xiangrui Liu,Minghao Qin,Yan Shu,Zhengyang Liang,Yang Tian,Chen Jason Zhang,Bo Zhao,Zheng Liu
Main category: cs.CV
TL;DR: 本文提出了一个新问题——任务导向的时间定位(ToTG),旨在根据任务的自然语言描述在长视频中定位包含关键信息的时间段,并构建了ToTG Bench基准和TimeScope框架以解决传统方法泛化能力差和难以处理长视频的问题。
Details
Motivation: 现有时间定位方法在处理长视频时泛化能力有限,难以有效识别与特定任务相关的关键时刻,因此需要一种能结合任务描述进行精确时间定位的新方法。 Method: 提出TimeScope框架,采用渐进式推理:首先在长视频中定位粗粒度的时间范围,再通过细粒度时刻划分进行精细化调整;同时构建高质量数据集ToTG Pile以提升模型性能。 Result: 实验表明,TimeScope在多种设置下均优于现有的时间定位方法和主流MLLMs,在ToTG任务上表现出色。 Conclusion: TimeScope通过渐进式推理有效解决了任务导向的时间定位问题,具备良好的应用潜力和扩展性。 Abstract: Identifying key moments in long videos is essential for downstream understanding and reasoning tasks. In this paper, we introduce a new problem, Taskoriented Temporal Grounding ToTG, which aims to localize time intervals containing the necessary information based on a task's natural description. Along with the definition, we also present ToTG Bench, a comprehensive benchmark for evaluating the performance on ToTG. ToTG is particularly challenging for traditional approaches due to their limited generalizability and difficulty in handling long videos. To address these challenges, we propose TimeScope, a novel framework built upon progressive reasoning. TimeScope first identifies a coarse-grained temporal scope in the long video that likely contains the key moments, and then refines this scope through finegrained moment partitioning. Additionally, we curate a highquality dataset, namely ToTG Pile, to enhance TimeScope's ability to perform progressive temporal grounding effectively. Extensive experiments demonstrate that TimeScope consistently outperforms both existing temporalgrounding methods and popular MLLMs across various settings, highlighting its effectiveness in addressing this new challenging problem.[199] Go with Your Gut: Scaling Confidence for Autoregressive Image Generation
Harold Haodong Chen,Xianfeng Wu,Wen-Jie Shu,Rongjin Guo,Disen Lan,Harry Yang,Ying-Cong Chen
Main category: cs.CV
TL;DR: 本文提出了ScalingAR,首个专为基于下一词预测的自回归图像生成设计的测试时扩展框架,利用token熵作为新信号,在无需早期解码或辅助奖励的情况下提升生成效率和鲁棒性。
Details
Motivation: 现有的测试时扩展方法依赖频繁的部分解码和外部奖励模型,不适用于基于下一词预测的图像生成,因为中间结果不完整。因此需要一种新的适应性方法。 Method: 提出ScalingAR框架,包含两个层次:Profile Level通过融合内在和条件信号流式输出校准的置信度状态;Policy Level利用该状态自适应终止低置信路径并动态调度引导强度。 Result: 在GenEval和TIIF-Bench上分别提升12.5%和15.2%,减少62.0%的视觉token消耗,并在挑战场景中将性能下降缓解26.0%。 Conclusion: ScalingAR有效解决了NTP-based AR图像生成中的测试时扩展难题,在效率、性能和鲁棒性方面均显著优于基线方法。 Abstract: Test-time scaling (TTS) has demonstrated remarkable success in enhancing large language models, yet its application to next-token prediction (NTP) autoregressive (AR) image generation remains largely uncharted. Existing TTS approaches for visual AR (VAR), which rely on frequent partial decoding and external reward models, are ill-suited for NTP-based image generation due to the inherent incompleteness of intermediate decoding results. To bridge this gap, we introduce ScalingAR, the first TTS framework specifically designed for NTP-based AR image generation that eliminates the need for early decoding or auxiliary rewards. ScalingAR leverages token entropy as a novel signal in visual token generation and operates at two complementary scaling levels: (i) Profile Level, which streams a calibrated confidence state by fusing intrinsic and conditional signals; and (ii) Policy Level, which utilizes this state to adaptively terminate low-confidence trajectories and dynamically schedule guidance for phase-appropriate conditioning strength. Experiments on both general and compositional benchmarks show that ScalingAR (1) improves base models by 12.5% on GenEval and 15.2% on TIIF-Bench, (2) efficiently reduces visual token consumption by 62.0% while outperforming baselines, and (3) successfully enhances robustness, mitigating performance drops by 26.0% in challenging scenarios.[200] PANDA: Towards Generalist Video Anomaly Detection via Agentic AI Engineer
Zhiwei Yang,Chen Gao,Mike Zheng Shou
Main category: cs.CV
TL;DR: 本文提出PANDA,一种基于MLLM的智能体AI工程师,实现无需训练数据或人工干预的通用视频异常检测。
Details
Motivation: 传统方法依赖特定领域训练数据和人工调整,泛化能力差且成本高,难以应对新场景和未知异常类型。 Method: 设计了四个关键能力:自适应场景感知策略规划、目标驱动的启发式推理、工具增强的自我反思和自改进的记忆链机制。具体包括自适应RAG机制、潜在异常引导的启发式提示策略、渐进式反思机制与上下文感知工具结合,以及记忆链机制。 Result: 实验表明,PANDA在多场景、开放集和复杂场景下均达到最先进性能,且无需训练或人工参与。 Conclusion: PANDA具备良好的泛化性和鲁棒性,实现了真正的通用视频异常检测。 Abstract: Video anomaly detection (VAD) is a critical yet challenging task due to the complex and diverse nature of real-world scenarios. Previous methods typically rely on domain-specific training data and manual adjustments when applying to new scenarios and unseen anomaly types, suffering from high labor costs and limited generalization. Therefore, we aim to achieve generalist VAD, i.e., automatically handle any scene and any anomaly types without training data or human involvement. In this work, we propose PANDA, an agentic AI engineer based on MLLMs. Specifically, we achieve PANDA by comprehensively devising four key capabilities: (1) self-adaptive scene-aware strategy planning, (2) goal-driven heuristic reasoning, (3) tool-augmented self-reflection, and (4) self-improving chain-of-memory. Concretely, we develop a self-adaptive scene-aware RAG mechanism, enabling PANDA to retrieve anomaly-specific knowledge for anomaly detection strategy planning. Next, we introduce a latent anomaly-guided heuristic prompt strategy to enhance reasoning precision. Furthermore, PANDA employs a progressive reflection mechanism alongside a suite of context-aware tools to iteratively refine decision-making in complex scenarios. Finally, a chain-of-memory mechanism enables PANDA to leverage historical experiences for continual performance improvement. Extensive experiments demonstrate that PANDA achieves state-of-the-art performance in multi-scenario, open-set, and complex scenario settings without training and manual involvement, validating its generalizable and robust anomaly detection capability. Code is released at https://github.com/showlab/PANDA.[201] MotionRAG: Motion Retrieval-Augmented Image-to-Video Generation
Chenhui Zhu,Yilu Wu,Shuai Wang,Gangshan Wu,Limin Wang
Main category: cs.CV
TL;DR: 本文提出了一种名为MotionRAG的检索增强框架,通过从参考视频中迁移运动先验来提升图像到视频生成中的运动真实性,结合上下文感知的运动适应机制,在不增加显著计算开销的情况下实现了跨领域的零样本泛化和更逼真的动态合成。
Details
Motivation: 现有的图像到视频生成方法在生成真实运动方面仍面临挑战,主要由于难以准确建模物理约束、物体交互和特定领域的动态行为。因此,需要一种能够有效捕捉并迁移真实运动先验的方法来提升生成质量。 Method: 提出MotionRAG框架,包含三个关键技术:(1)基于检索的流水线,利用视频编码器和专用重采样器提取高层运动特征;(2)通过因果Transformer架构实现上下文学习的运动适应机制(CAMA);(3)基于注意力机制的运动注入适配器,将迁移的运动特征无缝集成到预训练的视频扩散模型中。 Result: 实验表明,该方法在多个领域和不同基础模型上均显著优于现有方法,推理时计算开销极小,并且通过更新检索数据库即可实现对新领域的零样本泛化。 Conclusion: MotionRAG通过有效检索与迁移运动先验,显著提升了视频生成中运动的真实性和可控性,其模块化设计支持无需重新训练的灵活扩展,为视频生成系统提供了更强的核心能力。 Abstract: Image-to-video generation has made remarkable progress with the advancements in diffusion models, yet generating videos with realistic motion remains highly challenging. This difficulty arises from the complexity of accurately modeling motion, which involves capturing physical constraints, object interactions, and domain-specific dynamics that are not easily generalized across diverse scenarios. To address this, we propose MotionRAG, a retrieval-augmented framework that enhances motion realism by adapting motion priors from relevant reference videos through Context-Aware Motion Adaptation (CAMA). The key technical innovations include: (i) a retrieval-based pipeline extracting high-level motion features using video encoder and specialized resamplers to distill semantic motion representations; (ii) an in-context learning approach for motion adaptation implemented through a causal transformer architecture; (iii) an attention-based motion injection adapter that seamlessly integrates transferred motion features into pretrained video diffusion models. Extensive experiments demonstrate that our method achieves significant improvements across multiple domains and various base models, all with negligible computational overhead during inference. Furthermore, our modular design enables zero-shot generalization to new domains by simply updating the retrieval database without retraining any components. This research enhances the core capability of video generation systems by enabling the effective retrieval and transfer of motion priors, facilitating the synthesis of realistic motion dynamics.[202] Image-Difficulty-Aware Evaluation of Super-Resolution Models
Atakan Topaloglu,Ahmet Bilican,Cansu Korkmaz,A. Murat Tekalp
Main category: cs.CV
TL;DR: 提出了一种基于图像难度感知的超分辨率模型性能评估方法,通过高频指数和旋转不变边缘指数来衡量图像难度,并改进评估方式以反映不同模型在难易图像上的视觉差异。
Details
Motivation: 传统平均评分无法反映超分辨率模型在不同难度图像上的表现差异,且难以体现某些模型在困难图像上产生的伪影问题。 Method: 提出了两种图像难度度量指标:高频指数和旋转不变边缘指数,并结合这些指标设计了新的性能评估方法,以更好地区分视觉效果不同的模型。 Result: 实验结果表明,所提出的图像难度度量和评估方法能有效识别模型在不同难度图像上的表现差异,并在客观指标中反映视觉质量的不同。 Conclusion: 该难度感知评估方法优于传统平均评分,能更准确、细致地评价超分辨率模型的实际性能。 Abstract: Image super-resolution models are commonly evaluated by average scores (over some benchmark test sets), which fail to reflect the performance of these models on images of varying difficulty and that some models generate artifacts on certain difficult images, which is not reflected by the average scores. We propose difficulty-aware performance evaluation procedures to better differentiate between SISR models that produce visually different results on some images but yield close average performance scores over the entire test set. In particular, we propose two image-difficulty measures, the high-frequency index and rotation-invariant edge index, to predict those test images, where a model would yield significantly better visual results over another model, and an evaluation method where these visual differences are reflected on objective measures. Experimental results demonstrate the effectiveness of the proposed image-difficulty measures and evaluation methodology.[203] PRISM: Progressive Rain removal with Integrated State-space Modeling
Pengze Xue,Shanwen Wang,Fei Zhou,Yan Cui,Xin Sun
Main category: cs.CV
TL;DR: 提出了一种名为PRISM的渐进式去雨框架,包含三个阶段:粗提取、频域融合和精细化网络,结合多尺度特征聚合与混合域建模,有效去除雨滴并保持图像细节。
Details
Motivation: 现有单尺度去雨模型在细粒度恢复和全局一致性方面表现不足,难以满足自动驾驶等高要求视觉任务的需求。 Method: 设计了三阶段框架PRISM,包括CENet、SFNet和RNet;采用HA-UNet进行多尺度特征聚合,引入HDMamba联合建模空间语义与小波域特征,并通过原分辨率子网络实现精细结构恢复。 Result: 在多个数据集上取得了与最新去雨方法相当或更优的结果,有效提升了去雨图像的质量,尤其在高频雨纹去除和结构保持方面表现突出。 Conclusion: PRISM通过渐进式多尺度处理和混合域建模,显著提升了图像去雨性能,兼顾细节保留与全局一致性,适用于高质量视觉恢复任务。 Abstract: Image deraining is an essential vision technique that removes rain streaks and water droplets, enhancing clarity for critical vision tasks like autonomous driving. However, current single-scale models struggle with fine-grained recovery and global consistency. To address this challenge, we propose Progressive Rain removal with Integrated State-space Modeling (PRISM), a progressive three-stage framework: Coarse Extraction Network (CENet), Frequency Fusion Network (SFNet), and Refine Network (RNet). Specifically, CENet and SFNet utilize a novel Hybrid Attention UNet (HA-UNet) for multi-scale feature aggregation by combining channel attention with windowed spatial transformers. Moreover, we propose Hybrid Domain Mamba (HDMamba) for SFNet to jointly model spatial semantics and wavelet domain characteristics. Finally, RNet recovers the fine-grained structures via an original-resolution subnetwork. Our model learns high-frequency rain characteristics while preserving structural details and maintaining global context, leading to improved image quality. Our method achieves competitive results on multiple datasets against recent deraining methods.[204] Post-Training Quantization via Residual Truncation and Zero Suppression for Diffusion Models
Donghoon Kim,Dongyoung Lee,Ik Joon Chang,Sung-Ho Bae
Main category: cs.CV
TL;DR: 提出了一种名为QuaRTZ的4位后训练量化方法,用于扩散模型,通过残差截断和零压缩技术,在保持纹理细节的同时显著降低计算开销。
Details
Motivation: 现有的4位量化方法在处理低幅度激活时误差较大,导致生成图像的纹理丢失,难以在保持性能的同时实现高效部署。 Method: 采用8位min-max量化处理异常值,并通过前导零压缩将结果压缩至4位,保留最低有效位(LSB),从而减少舍入误差并提升量化效率。 Result: 在FLUX.1-schnell模型上,4位QuaRTZ实现了6.98的FID分数,优于需要辅助FP16分支的SVDQuant。 Conclusion: QuaRTZ有效平衡了异常值保留与低位精度,支持在扩散模型中高效应用4位量化,具备良好的泛化性和实用性。 Abstract: Diffusion models achieve high-quality image generation but face deployment challenges due to their high computational requirements. Although 8-bit outlier-aware post-training quantization (PTQ) matches full-precision performance, extending PTQ to 4 bits remains challenging. Larger step sizes in 4-bit quantization amplify rounding errors in dense, low-magnitude activations, leading to the loss of fine-grained textures. We hypothesize that not only outliers but also small activations are critical for texture fidelity. To this end, we propose Quantization via Residual Truncation and Zero Suppression (QuaRTZ), a 4-bit PTQ scheme for diffusion models. QuaRTZ applies 8-bit min-max quantization for outlier handling and compresses to 4 bits via leading-zero suppression to retain LSBs, thereby preserving texture details. Our approach reduces rounding errors and improves quantization efficiency by balancing outlier preservation and LSB precision. Both theoretical derivations and empirical evaluations demonstrate the generalizability of QuaRTZ across diverse activation distributions. Notably, 4-bit QuaRTZ achieves an FID of 6.98 on FLUX.1-schnell, outperforming SVDQuant that requires auxiliary FP16 branches.[205] Multi-View Camera System for Variant-Aware Autonomous Vehicle Inspection and Defect Detection
Yash Kulkarni,Raman Jha,Renu Kachhoria
Main category: cs.CV
TL;DR: 本文提出了一种端到端的多视角车辆检测平台AVI,结合深度学习与语义规则引擎,实现实时变体感知的质量控制。
Details
Motivation: 现代汽车生产线中,确保每辆车符合正确的变体规格且无可见缺陷变得日益复杂,需要高效、准确的自动化检测方案。 Method: 采用11个同步摄像头捕捉车辆360度图像,使用YOLOv8进行部件检测,EfficientNet分类燃油车/电动车,Gemini-1.5 Flash完成徽标OCR识别,YOLOv8-Seg进行划痕凹陷分割,并通过视图感知融合层和VIN条件规则引擎生成可解释的检测报告。 Result: 在包含四种车型及公开缺陷数据的数据集上,系统达到93%的验证准确率、86%的缺陷检测召回率,处理速度为3.3辆车/分钟,显著优于单视角或无分割基线方法。 Conclusion: AVI是首个在工业可部署环境下,将多摄像头特征验证与缺陷检测统一的公开报道系统,具备高准确性、实时性和可解释性。 Abstract: Ensuring that every vehicle leaving a modern production line is built to the correct \emph{variant} specification and is free from visible defects is an increasingly complex challenge. We present the \textbf{Automated Vehicle Inspection (AVI)} platform, an end-to-end, \emph{multi-view} perception system that couples deep-learning detectors with a semantic rule engine to deliver \emph{variant-aware} quality control in real time. Eleven synchronized cameras capture a full 360{\deg} sweep of each vehicle; task-specific views are then routed to specialised modules: YOLOv8 for part detection, EfficientNet for ICE/EV classification, Gemini-1.5 Flash for mascot OCR, and YOLOv8-Seg for scratch-and-dent segmentation. A view-aware fusion layer standardises evidence, while a VIN-conditioned rule engine compares detected features against the expected manifest, producing an interpretable pass/fail report in \(\approx\! 300\,\text{ms}\). On a mixed data set of Original Equipment Manufacturer(OEM) vehicle data sets of four distinct models plus public scratch/dent images, AVI achieves \textbf{ 93 \%} verification accuracy, \textbf{86 \%} defect-detection recall, and sustains \(\mathbf{3.3}\) vehicles/min, surpassing single-view or no segmentation baselines by large margins. To our knowledge, this is the first publicly reported system that unifies multi-camera feature validation with defect detection in a deployable automotive setting in industry.[206] Stylos: Multi-View 3D Stylization with Single-Forward Gaussian Splatting
Hanzhou Liu,Jia Huang,Mi Lu,Srikanth Saripalli,Peng Jiang
Main category: cs.CV
TL;DR: Stylos是一个无需姿态估计和每场景优化的单次前向3D高斯框架,用于实现几何感知、视角一致的3D风格迁移。
Details
Motivation: 现有的3D风格迁移方法通常依赖于精确的姿态估计或针对特定场景的优化,限制了其在未见过类别和场景中的泛化能力。因此,需要一种更通用、高效的框架来实现高质量的零样本3D风格迁移。 Method: Stylos采用基于Transformer的双路径架构:一条路径通过自注意力机制进行几何预测以保持几何保真度;另一条路径通过全局交叉注意力注入风格信息以确保跨视角的视觉一致性。此外,引入基于体素的3D风格损失,将聚合的场景特征与风格统计对齐,从而实现视角一致的风格化同时保留几何结构。 Result: 在多个数据集上的实验表明,Stylos能够在从单视图到大规模多视图设置下实现高质量的零样本风格迁移,并展现出良好的泛化能力和可扩展性。该方法在几何保持、风格一致性和跨类别迁移方面优于现有方法。 Conclusion: Stylos通过全局风格-内容耦合机制和提出的3D风格损失,成功实现了无需姿态输入和每场景优化的高效3D风格迁移,为未来3D内容创作提供了新的可能性。 Abstract: We present Stylos, a single-forward 3D Gaussian framework for 3D style transfer that operates on unposed content, from a single image to a multi-view collection, conditioned on a separate reference style image. Stylos synthesizes a stylized 3D Gaussian scene without per-scene optimization or precomputed poses, achieving geometry-aware, view-consistent stylization that generalizes to unseen categories, scenes, and styles. At its core, Stylos adopts a Transformer backbone with two pathways: geometry predictions retain self-attention to preserve geometric fidelity, while style is injected via global cross-attention to enforce visual consistency across views. With the addition of a voxel-based 3D style loss that aligns aggregated scene features to style statistics, Stylos enforces view-consistent stylization while preserving geometry. Experiments across multiple datasets demonstrate that Stylos delivers high-quality zero-shot stylization, highlighting the effectiveness of global style-content coupling, the proposed 3D style loss, and the scalability of our framework from single view to large-scale multi-view settings.[207] Attention over Scene Graphs: Indoor Scene Representations Toward CSAI Classification
Artur Barros,Carlos Caetano,João Macedo,Jefersson A. dos Santos,Sandra Avila
Main category: cs.CV
TL;DR: 提出了一种基于场景图的注意力框架ASGRA,用于室内场景和敏感内容(如CSAI)分类,通过图结构表示实现可解释性和隐私保护,在Places8和真实CSAI数据上表现优异。
Details
Motivation: 室内场景分类因物体间复杂关系和布局而具有挑战性,且敏感内容分析需要兼顾性能与隐私保护。 Method: 将图像转换为场景图,使用图注意力网络(GAT)进行推理,直接建模场景组件间的交互。 Result: 在Places8上达到81.27%的平衡准确率,超过基于图像的方法;在执法机构的真实CSAI评估中获得74.27%的平衡准确率。 Conclusion: 结构化场景表示(如场景图)是室内及敏感内容分类中的有效且稳健的范式,兼具可解释性与隐私优势。 Abstract: Indoor scene classification is a critical task in computer vision, with wide-ranging applications that go from robotics to sensitive content analysis, such as child sexual abuse imagery (CSAI) classification. The problem is particularly challenging due to the intricate relationships between objects and complex spatial layouts. In this work, we propose the Attention over Scene Graphs for Sensitive Content Analysis (ASGRA), a novel framework that operates on structured graph representations instead of raw pixels. By first converting images into Scene Graphs and then employing a Graph Attention Network for inference, ASGRA directly models the interactions between a scene's components. This approach offers two key benefits: (i) inherent explainability via object and relationship identification, and (ii) privacy preservation, enabling model training without direct access to sensitive images. On Places8, we achieve 81.27% balanced accuracy, surpassing image-based methods. Real-world CSAI evaluation with law enforcement yields 74.27% balanced accuracy. Our results establish structured scene representations as a robust paradigm for indoor scene classification and CSAI classification. Code is publicly available at https://github.com/tutuzeraa/ASGRA.[208] CBAM Integrated Attention Driven Model For Betel Leaf Diseases Classification With Explainable AI
Sumaiya Tabassum,Md. Faysal Ahamed
Main category: cs.CV
TL;DR: 本文提出了一种轻量级的CBAM-CNN模型(2.13百万参数,8.13 MB),结合注意力机制CBAM,在包含10,185张图像的数据集上实现了95.58%的准确率和优异的分类性能,显著优于传统预训练CNN模型,并通过Grad-CAM实现可解释性分析。
Details
Motivation: 槟榔叶病害识别对保障作物产量和经济价值至关重要,传统方法难以及时准确检测病害,而现有深度学习模型多依赖重型预训练网络,不适用于资源受限场景,因此需要一种高效、轻量且准确的病害识别模型。 Method: 提出一种集成CBAM(卷积块注意力模块)的轻量级CNN模型,通过空间和通道注意力机制增强关键特征提取能力;使用包含10,185张图像的扩充数据集,分为健康叶、叶腐病和叶斑病三类进行训练与验证;采用Grad-CAM技术可视化模型关注区域,提升可解释性。 Result: 该模型在测试集上达到97%精确率、94%召回率、95% F1分数和95.58%准确率,表现出强大且均衡的分类性能,优于传统预训练CNN模型;模型仅含2.13百万参数(8.13 MB),适合轻量部署;Grad-CAM验证了模型聚焦于病害关键区域的有效性。 Conclusion: 所提出的轻量级CBAM-CNN模型在槟榔叶病害识别中表现出高精度与高效性,兼具良好的可解释性,适用于农业现场的实时病害检测,为资源受限环境下的植物病害诊断提供了有效解决方案。 Abstract: Betel leaf is an important crop because of its economic advantages and widespread use. Its betel vines are susceptible to a number of illnesses that are commonly referred to as betel leaf disease. Plant diseases are the largest threat to the food supply's security, and they are challenging to identify in time to stop possible financial damage. Interestingly, artificial intelligence can leave a big mark on the betel leaf industry since it helps with output growth by forecasting sickness. This paper presents a lightweight CBAM-CNN model with just 2.13 million parameters (8.13 MB), incorporating CBAM (Convolutional Block Attention Module) to improve feature emphasis without depending on heavy pre-trained networks. The model's capacity to discern minute variations among leaf disease classes is improved by the integrated attention mechanism, which allows it to adaptively focus on significant spatial and channel-wise information. In order to ensure class balance and diversity for efficient model training and validation, this work makes use of an enriched dataset of 10,185 images divided into three categories: Healthy Leaf, Leaf Rot, and Leaf Spot. The proposed model achieved a precision of 97%, recall of 94%, and F1 score of 95%, and 95.58% accuracy on the test set demonstrating strong and balanced classification performance outperforming traditional pre trained CNN models. The model's focus regions were visualized and interpreted using Grad-CAM (Gradient-weighted Class Activation Mapping), an explainable AI technique.[209] Contrastive Diffusion Guidance for Spatial Inverse Problems
Sattwik Basu,Chaitanya Amballa,Zhongweiyang Xu,Jorge Vančo Sampedro,Srihari Nelakuditi,Romit Roy Choudhury
Main category: cs.CV
TL;DR: 本文提出了一种基于扩散模型的后验采样方法CoGuide,用于从用户在室内环境中的移动轨迹反推空间布局(如家庭平面图)。由于路径规划过程不可逆且不可微,传统方法存在优化不稳定性,因此作者引入对比学习构建平滑嵌入空间,并在该空间中使用代理似然得分来有效引导去噪过程,从而生成与观测轨迹一致的平面图。实验表明,CoGuide在生成一致性和鲁棒性方面优于现有方法。
Details
Motivation: 从用户移动轨迹重建空间布局是一个病态反问题,因多种布局可产生相同轨迹而难以直接求解。此外,路径规划函数的非可逆性和非可微性导致基于梯度的生成方法在优化时不稳定,亟需新方法解决此类生成反问题中的前向算子挑战。 Method: 提出CoGuide框架,采用扩散模型作为后验采样器,在对比学习训练的嵌入空间中重构似然得分。该嵌入空间通过对比损失将匹配的平面图与轨迹拉近,不匹配的推远,使得代理似然得分可在平滑空间中稳定引导扩散模型生成符合观测轨迹的布局。 Result: 实验显示CoGuide相比基于可微规划器的基线和引导扩散方法,能生成更符合轨迹观测的一致性平面图,并在不同设置下表现出更强的鲁棒性。 Conclusion: 通过引入对比嵌入空间和代理似然得分,CoGuide有效解决了非可逆、非可微前向算子带来的优化难题,为基于生成模型的反问题求解提供了新思路,尤其适用于轨迹到布局这类复杂映射的逆向重建任务。 Abstract: We consider the inverse problem of reconstructing the spatial layout of a place, a home floorplan for example, from a user`s movements inside that layout. Direct inversion is ill-posed since many floorplans can explain the same movement trajectories. We adopt a diffusion-based posterior sampler to generate layouts consistent with the measurements. While active research is in progress on generative inverse solvers, we find that the forward operator in our problem poses new challenges. The path-planning process inside a floorplan is a non-invertible, non-differentiable function, and causes instability while optimizing using the likelihood score. We break-away from existing approaches and reformulate the likelihood score in a smoother embedding space. The embedding space is trained with a contrastive loss which brings compatible floorplans and trajectories close to each other, while pushing mismatched pairs far apart. We show that a surrogate form of the likelihood score in this embedding space is a valid approximation of the true likelihood score, making it possible to steer the denoising process towards the posterior. Across extensive experiments, our model CoGuide produces more consistent floorplans from trajectories, and is more robust than differentiable-planner baselines and guided-diffusion methods.[210] Revealing the Power of Post-Training for Small Language Models via Knowledge Distillation
Miao Rang,Zhenni Bi,Hang Zhou,Hanting Chen,An Xiao,Tianyu Guo,Kai Han,Xinghao Chen,Yunhe Wang
Main category: cs.CV
TL;DR: 提出了一种系统性的后训练流程,通过课程式监督微调和离线策略内知识蒸馏,显著提升小模型在边缘设备上的性能。
Details
Motivation: 大型语言模型因规模和计算成本高,难以在资源受限的边缘环境中部署,需要高效的小模型解决方案。 Method: 采用课程式监督微调(SFT)和离线策略内知识蒸馏的后训练 pipeline 来提升小模型性能。 Result: 在十亿参数模型中达到最先进水平,表现出强泛化能力,并在多种任务上保持竞争力。 Conclusion: 该方法为在Ascend边缘设备上开发高性能语言模型提供了实用且高效的解决方案。 Abstract: The rapid advancement of large language models (LLMs) has significantly advanced the capabilities of artificial intelligence across various domains. However, their massive scale and high computational costs render them unsuitable for direct deployment in resource-constrained edge environments. This creates a critical need for high-performance small models that can operate efficiently at the edge. Yet, after pre-training alone, these smaller models often fail to meet the performance requirements of complex tasks. To bridge this gap, we introduce a systematic post-training pipeline that efficiently enhances small model accuracy. Our post training pipeline consists of curriculum-based supervised fine-tuning (SFT) and offline on-policy knowledge distillation. The resulting instruction-tuned model achieves state-of-the-art performance among billion-parameter models, demonstrating strong generalization under strict hardware constraints while maintaining competitive accuracy across a variety of tasks. This work provides a practical and efficient solution for developing high-performance language models on Ascend edge devices.[211] DEPTHOR++: Robust Depth Enhancement from a Real-World Lightweight dToF and RGB Guidance
Jijun Xiang,Longliang Liu,Xuan Zhu,Xianqi Wang,Min Lin,Xin Yang
Main category: cs.CV
TL;DR: 本文提出了一种针对真实世界低质量dToF传感器的鲁棒深度补全框架DEPTHOR++,通过仿真训练、无参数异常检测和面向噪声的网络设计,在多个真实数据集上实现了优于现有方法的性能。
Details
Motivation: 现有深度增强方法通常假设理想的dToF输入和完美的RGB对齐,忽视了实际中存在的校准误差和异常值,限制了其在现实场景中的应用。因此,需要一种能应对真实噪声和传感器缺陷的鲁棒方法。 Method: 1)基于合成数据构建仿真方法以生成更贴近现实的训练样本;2)提出一种可学习参数自由的异常检测机制,识别并剔除错误的dToF测量值;3)设计专为噪声dToF输入优化的深度补全网络,融合RGB图像与预训练的单目深度先验,提升复杂区域的深度恢复效果。 Result: 在ZJU-L5、Mirror3D-NYU和Hammer等多个真实数据集上验证了方法的有效性:在ZJU-L5上平均RMSE和Rel指标提升22%和11%;在Mirror3D-NYU的镜面区域超越此前SOTA 37%;在Hammer数据集上使用模拟的低成本dToF数据,性能超过RealSense L515实测值22%。 Conclusion: DEPTHOR++通过系统性建模真实dToF噪声和异常,显著提升了深度补全在非理想条件下的鲁棒性和精度,展示了低成本传感器通过算法优化实现高性能深度感知的潜力。 Abstract: Depth enhancement, which converts raw dToF signals into dense depth maps using RGB guidance, is crucial for improving depth perception in high-precision tasks such as 3D reconstruction and SLAM. However, existing methods often assume ideal dToF inputs and perfect dToF-RGB alignment, overlooking calibration errors and anomalies, thus limiting real-world applicability. This work systematically analyzes the noise characteristics of real-world lightweight dToF sensors and proposes a practical and novel depth completion framework, DEPTHOR++, which enhances robustness to noisy dToF inputs from three key aspects. First, we introduce a simulation method based on synthetic datasets to generate realistic training samples for robust model training. Second, we propose a learnable-parameter-free anomaly detection mechanism to identify and remove erroneous dToF measurements, preventing misleading propagation during completion. Third, we design a depth completion network tailored to noisy dToF inputs, which integrates RGB images and pre-trained monocular depth estimation priors to improve depth recovery in challenging regions. On the ZJU-L5 dataset and real-world samples, our training strategy significantly boosts existing depth completion models, with our model achieving state-of-the-art performance, improving RMSE and Rel by 22% and 11% on average. On the Mirror3D-NYU dataset, by incorporating the anomaly detection method, our model improves upon the previous SOTA by 37% in mirror regions. On the Hammer dataset, using simulated low-cost dToF data from RealSense L515, our method surpasses the L515 measurements with an average gain of 22%, demonstrating its potential to enable low-cost sensors to outperform higher-end devices. Qualitative results across diverse real-world datasets further validate the effectiveness and generalizability of our approach.[212] Stable Cinemetrics : Structured Taxonomy and Evaluation for Professional Video Generation
Agneet Chatterjee,Rahim Entezari,Maksym Zhuravinskyi,Maksim Lapin,Reshinth Adithyan,Amit Raj,Chitta Baral,Yezhou Yang,Varun Jampani
Main category: cs.CV
TL;DR: 本文提出了Stable Cinemetrics(SCINE),一个面向专业视频生成的结构化评估框架,通过构建涵盖场景设置、事件、光照和摄像机控制的分层分类体系,系统评估现有生成模型在影视级控制上的表现,发现当前模型在事件和摄像机控制方面存在显著不足,并提出基于专家标注训练的自动评估模型以支持可扩展评测。
Details
Motivation: 现有的视频生成模型和基准未能充分反映专业影视制作中对精细控制的需求,缺乏针对电影级创作标准的系统性评估方法。 Method: 提出四个解耦的层次化分类体系(Setup, Event, Lighting, Camera),定义76个细粒度控制节点;构建符合专业用例的提示词基准,并开发自动化流程进行提示分类与问题生成;通过包含10多个模型、2万段视频的大规模人类实验(由80多名影视专业人士标注)进行评估;训练一个与专家标注对齐的视觉-语言自动评估模型。 Result: 大规模人类研究表明,即使是当前最强的视频生成模型在事件和摄像机相关控制方面仍存在明显缺陷;提出的自动评估模型优于现有的零样本基线方法;建立了首个面向专业电影制作需求的视频生成评估体系。 Conclusion: Stable Cinemetrics为专业级视频生成提供了首个系统化、结构化的评估框架,揭示了现有模型在影视控制维度上的关键不足,并通过专家对齐的自动评估器支持未来研究的可扩展评测,推动视频生成技术向电影级应用发展。 Abstract: Recent advances in video generation have enabled high-fidelity video synthesis from user provided prompts. However, existing models and benchmarks fail to capture the complexity and requirements of professional video generation. Towards that goal, we introduce Stable Cinemetrics, a structured evaluation framework that formalizes filmmaking controls into four disentangled, hierarchical taxonomies: Setup, Event, Lighting, and Camera. Together, these taxonomies define 76 fine-grained control nodes grounded in industry practices. Using these taxonomies, we construct a benchmark of prompts aligned with professional use cases and develop an automated pipeline for prompt categorization and question generation, enabling independent evaluation of each control dimension. We conduct a large-scale human study spanning 10+ models and 20K videos, annotated by a pool of 80+ film professionals. Our analysis, both coarse and fine-grained reveal that even the strongest current models exhibit significant gaps, particularly in Events and Camera-related controls. To enable scalable evaluation, we train an automatic evaluator, a vision-language model aligned with expert annotations that outperforms existing zero-shot baselines. SCINE is the first approach to situate professional video generation within the landscape of video generative models, introducing taxonomies centered around cinematic controls and supporting them with structured evaluation pipelines and detailed analyses to guide future research.[213] Autoproof: Automated Segmentation Proofreading for Connectomics
Gary B Huang,William M Katz,Stuart Berg,Louis Scheffer
Main category: cs.CV
TL;DR: 本文提出利用手动校对产生的真实数据训练机器学习模型,以自动化或优化电子显微镜连接组学中的校对流程,显著降低成本并提升连接组重建效率。
Details
Motivation: 电子显微镜连接组构建中的人工校对成本高昂,已成为扩展研究规模和进行比较连接组学的主要瓶颈。 Method: 利用已有手工标注的真实数据训练机器学习模型,应用于指导性校对流程优化和自动合并大量分割片段。 Result: 在果蝇雄性中枢神经系统连接组上验证,可减少80%成本同时获得90%的校对价值;系统自动合并20万个片段,相当于节省四年人工,连接完整性提升1.3个百分点。 Conclusion: 利用机器学习模型可显著降低连接组构建中的人工校对负担,提升重建效率和规模,推动连接组学的发展。 Abstract: Producing connectomes from electron microscopy (EM) images has historically required a great deal of human proofreading effort. This manual annotation cost is the current bottleneck in scaling EM connectomics, for example, in making larger connectome reconstructions feasible, or in enabling comparative connectomics where multiple related reconstructions are produced. In this work, we propose using the available ground-truth data generated by this manual annotation effort to learn a machine learning model to automate or optimize parts of the required proofreading workflows. We validate our approach on a recent complete reconstruction of the \emph{Drosophila} male central nervous system. We first show our method would allow for obtaining 90\% of the value of a guided proofreading workflow while reducing required cost by 80\%. We then demonstrate a second application for automatically merging many segmentation fragments to proofread neurons. Our system is able to automatically attach 200 thousand fragments, equivalent to four proofreader years of manual work, and increasing the connectivity completion rate of the connectome by 1.3\% points.[214] DiffCamera: Arbitrary Refocusing on Images
Yiyang Wang,Xi Chen,Xiaogang Xu,Yu Liu,Hengshuang Zhao
Main category: cs.CV
TL;DR: 本文提出DiffCamera,一种基于扩散变换器的模型,能够根据任意新的焦点和模糊程度对已生成图像进行灵活重对焦。为解决训练数据难以获取的问题,作者设计了基于模拟的管道生成大规模多焦点和不同景深水平的图像对,并引入堆叠约束以增强训练效果,确保符合物理规律的重对焦行为。实验表明,该方法在多种场景下实现了稳定的景深调整控制。
Details
Motivation: 固定景深(DoF)效果一旦成像后难以修改,尤其当主体失焦时影响视觉质量,因此需要一种可在后期灵活调整焦点和模糊程度的方法。 Method: 提出DiffCamera模型,采用扩散变换器框架进行重对焦学习;构建基于模拟的图像对生成管道以获取训练数据;引入基于多焦点图像线性混合原理的堆叠约束,强化训练过程中的物理一致性。 Result: 在自建基准上验证了模型有效性,实验显示DiffCamera能在广泛场景中稳定实现精确的景深控制,支持高质量的图像重对焦。 Conclusion: DiffCamera通过模拟数据和物理启发的堆叠约束,实现了对已有图像的灵活、准确且物理合理的重对焦,为摄影和生成式AI提供了前所未有的景深调控能力。 Abstract: The depth-of-field (DoF) effect, which introduces aesthetically pleasing blur, enhances photographic quality but is fixed and difficult to modify once the image has been created. This becomes problematic when the applied blur is undesirable~(e.g., the subject is out of focus). To address this, we propose DiffCamera, a model that enables flexible refocusing of a created image conditioned on an arbitrary new focus point and a blur level. Specifically, we design a diffusion transformer framework for refocusing learning. However, the training requires pairs of data with different focus planes and bokeh levels in the same scene, which are hard to acquire. To overcome this limitation, we develop a simulation-based pipeline to generate large-scale image pairs with varying focus planes and bokeh levels. With the simulated data, we find that training with only a vanilla diffusion objective often leads to incorrect DoF behaviors due to the complexity of the task. This requires a stronger constraint during training. Inspired by the photographic principle that photos of different focus planes can be linearly blended into a multi-focus image, we propose a stacking constraint during training to enforce precise DoF manipulation. This constraint enhances model training by imposing physically grounded refocusing behavior that the focusing results should be faithfully aligned with the scene structure and the camera conditions so that they can be combined into the correct multi-focus image. We also construct a benchmark to evaluate the effectiveness of our refocusing model. Extensive experiments demonstrate that DiffCamera supports stable refocusing across a wide range of scenes, providing unprecedented control over DoF adjustments for photography and generative AI applications.[215] Video Object Segmentation-Aware Audio Generation
Ilpo Viertola,Vladimir Iashin,Esa Rahtu
Main category: cs.CV
TL;DR: 本文提出了视频对象分割感知的音频生成新任务,通过引入SAGANet模型实现基于对象级分割图的声音合成,结合视觉分割掩码、视频和文本线索,提供细粒度且视觉定位的音频控制,并发布包含分割信息的乐器演奏视频数据集Segmented Music Solos,显著优于现有方法。
Details
Motivation: 现有多模态音频生成模型缺乏精确的用户控制,难以满足专业Foley工作流需求,尤其在处理特定对象时易产生无关背景声音或关注错误对象。 Method: 提出SAGANet模型,利用视觉分割掩码结合视频和文本线索进行条件化声音生成,并构建Segmented Music Solos数据集以支持该任务研究。 Result: 在可控性和音质方面显著优于当前最先进方法,实现了高保真、可控制的Foley音频合成。 Conclusion: SAGANet为分割感知的Foley合成设立了新标准,提供了更精准的对象级音频生成控制,推动了专业场景下的多模态音频生成发展。 Abstract: Existing multimodal audio generation models often lack precise user control, which limits their applicability in professional Foley workflows. In particular, these models focus on the entire video and do not provide precise methods for prioritizing a specific object within a scene, generating unnecessary background sounds, or focusing on the wrong objects. To address this gap, we introduce the novel task of video object segmentation-aware audio generation, which explicitly conditions sound synthesis on object-level segmentation maps. We present SAGANet, a new multimodal generative model that enables controllable audio generation by leveraging visual segmentation masks along with video and textual cues. Our model provides users with fine-grained and visually localized control over audio generation. To support this task and further research on segmentation-aware Foley, we propose Segmented Music Solos, a benchmark dataset of musical instrument performance videos with segmentation information. Our method demonstrates substantial improvements over current state-of-the-art methods and sets a new standard for controllable, high-fidelity Foley synthesis. Code, samples, and Segmented Music Solos are available at https://saganet.notion.site[216] Hy-Facial: Hybrid Feature Extraction by Dimensionality Reduction Methods for Enhanced Facial Expression Classification
Xinjin Li,Yu Ma,Kaisen Ye,Jinghan Cao,Minghao Zhou,Yeyang Zhou
Main category: cs.CV
TL;DR: 本文提出了一种名为Hy-Facial的混合特征提取框架,结合深度学习与传统图像处理技术,并系统评估降维策略,显著提升了面部表情分类性能。
Details
Motivation: 由于面部图像数据的高维度和复杂性,面部表情分类仍具挑战性,因此需要更有效的特征提取与降维方法。 Method: 融合VGG19的深度特征与SIFT、ORB等手工局部描述符,并采用K-means聚类和UMAP进行降维。 Result: 在FER数据集上实现了83.3%的分类准确率,验证了所提方法的有效性。 Conclusion: 降维不仅是预处理步骤,更是提升特征质量和分类性能的关键环节,UMAP在保留高维特征结构方面表现最优。 Abstract: Facial expression classification remains a challenging task due to the high dimensionality and inherent complexity of facial image data. This paper presents Hy-Facial, a hybrid feature extraction framework that integrates both deep learning and traditional image processing techniques, complemented by a systematic investigation of dimensionality reduction strategies. The proposed method fuses deep features extracted from the Visual Geometry Group 19-layer network (VGG19) with handcrafted local descriptors and the scale-invariant feature transform (SIFT) and Oriented FAST and Rotated BRIEF (ORB) algorithms, to obtain rich and diverse image representations. To mitigate feature redundancy and reduce computational complexity, we conduct a comprehensive evaluation of dimensionality reduction techniques and feature extraction. Among these, UMAP is identified as the most effective, preserving both local and global structures of the high-dimensional feature space. The Hy-Facial pipeline integrated VGG19, SIFT, and ORB for feature extraction, followed by K-means clustering and UMAP for dimensionality reduction, resulting in a classification accuracy of 83. 3\% in the facial expression recognition (FER) dataset. These findings underscore the pivotal role of dimensionality reduction not only as a pre-processing step but as an essential component in improving feature quality and overall classification performance.[217] DA$^2$: Depth Anything in Any Direction
Haodong Li,Wangguangdong Zheng,Jing He,Yuhao Liu,Xin Lin,Xin Yang,Ying-Cong Chen,Chunchao Guo
Main category: cs.CV
TL;DR: 本文提出了一种名为DA²的全景深度估计方法,能够在任意方向上实现准确且零样本可泛化的端到端深度估计。通过构建大规模高质量的全景RGB-深度数据集(约607K),并引入基于球面坐标的SphereViT模型以缓解球面畸变,DA²在多个数据集上显著优于现有零样本方法,平均AbsRel指标提升38%,甚至超过以往的领域内方法,同时具备更高的计算效率。
Details
Motivation: 由于全景图像数据稀缺,且存在球面畸变,现有方法多局限于特定领域,依赖透视分割策略,导致零样本泛化能力差和效率低下。因此需要一种更高效、通用的端到端全景深度估计方法。 Method: 1) 构建一个数据整理引擎,从透视图像生成高质量的全景深度数据,构建包含约543K新样本的数据集,总计达607K;2) 提出SphereViT,利用球面坐标显式建模全景图像的球面几何一致性,提升特征表示能力;3) 设计完全端到端的网络架构,避免多视角融合带来的效率损失。 Result: 在多个基准数据集上,DA²实现了最先进的零样本性能,平均AbsRel比最强的零样本基线提升38%,甚至优于以往的领域内方法;同时作为端到端模型,推理效率显著高于基于融合的方法。 Conclusion: DA²是一种高效、准确且具有良好零样本泛化能力的全景深度估计方法,通过大规模数据构建和球面几何感知模型设计,有效解决了数据稀缺和球面畸变问题,推动了全景深度估计在实际场景中的应用。 Abstract: Panorama has a full FoV (360$^\circ\times$180$^\circ$), offering a more complete visual description than perspective images. Thanks to this characteristic, panoramic depth estimation is gaining increasing traction in 3D vision. However, due to the scarcity of panoramic data, previous methods are often restricted to in-domain settings, leading to poor zero-shot generalization. Furthermore, due to the spherical distortions inherent in panoramas, many approaches rely on perspective splitting (e.g., cubemaps), which leads to suboptimal efficiency. To address these challenges, we propose $\textbf{DA}$$^{\textbf{2}}$: $\textbf{D}$epth $\textbf{A}$nything in $\textbf{A}$ny $\textbf{D}$irection, an accurate, zero-shot generalizable, and fully end-to-end panoramic depth estimator. Specifically, for scaling up panoramic data, we introduce a data curation engine for generating high-quality panoramic depth data from perspective, and create $\sim$543K panoramic RGB-depth pairs, bringing the total to $\sim$607K. To further mitigate the spherical distortions, we present SphereViT, which explicitly leverages spherical coordinates to enforce the spherical geometric consistency in panoramic image features, yielding improved performance. A comprehensive benchmark on multiple datasets clearly demonstrates DA$^{2}$'s SoTA performance, with an average 38% improvement on AbsRel over the strongest zero-shot baseline. Surprisingly, DA$^{2}$ even outperforms prior in-domain methods, highlighting its superior zero-shot generalization. Moreover, as an end-to-end solution, DA$^{2}$ exhibits much higher efficiency over fusion-based approaches. Both the code and the curated panoramic data will be released. Project page: https://depth-any-in-any-dir.github.io/.[218] HART: Human Aligned Reconstruction Transformer
Xiyi Chen,Shaofei Wang,Marko Mihajlovic,Taewon Kang,Sergey Prokudin,Ming Lin
Main category: cs.CV
TL;DR: HART是一个用于稀疏视角人体重建的统一框架,能够从少量未校准的RGB图像中生成封闭的着装网格、对齐的SMPL-X身体网格以及高斯溅射表示,以实现逼真的新视角渲染。
Details
Motivation: 现有方法在处理宽松衣物和人-物交互时存在局限,且通常依赖简化的相机假设,限制了在真实场景中的应用。HART旨在解决这些问题,提升稀疏视角下的人体重建质量。 Method: HART预测每个像素的3D点图、法线和身体对应关系,并采用遮挡感知的泊松重建来恢复完整几何形状;同时与SMPL-X参数模型对齐,并利用生成的网格初始化高斯溅射用于新视角渲染。 Result: 在仅使用2.3K合成扫描数据训练的情况下,HART在多个数据集上实现了最先进的性能:着装网格重建的Chamfer距离改善18-23%,SMPL-X估计的PA-V2V降低6-27%,新视角合成的LPIPS下降15-27%。 Conclusion: 前馈Transformer可作为现实场景中鲁棒人体重建的可扩展模型,HART在保持人体结构一致性的同时,有效捕捉宽松衣物和交互细节。 Abstract: We introduce HART, a unified framework for sparse-view human reconstruction. Given a small set of uncalibrated RGB images of a person as input, it outputs a watertight clothed mesh, the aligned SMPL-X body mesh, and a Gaussian-splat representation for photorealistic novel-view rendering. Prior methods for clothed human reconstruction either optimize parametric templates, which overlook loose garments and human-object interactions, or train implicit functions under simplified camera assumptions, limiting applicability in real scenes. In contrast, HART predicts per-pixel 3D point maps, normals, and body correspondences, and employs an occlusion-aware Poisson reconstruction to recover complete geometry, even in self-occluded regions. These predictions also align with a parametric SMPL-X body model, ensuring that reconstructed geometry remains consistent with human structure while capturing loose clothing and interactions. These human-aligned meshes initialize Gaussian splats to further enable sparse-view rendering. While trained on only 2.3K synthetic scans, HART achieves state-of-the-art results: Chamfer Distance improves by 18-23 percent for clothed-mesh reconstruction, PA-V2V drops by 6-27 percent for SMPL-X estimation, LPIPS decreases by 15-27 percent for novel-view synthesis on a wide range of datasets. These results suggest that feed-forward transformers can serve as a scalable model for robust human reconstruction in real-world settings. Code and models will be released.[219] Learning Generalizable Shape Completion with SIM(3) Equivariance
Yuqing Wang,Zhaiyu Chen,Xiao Xiang Zhu
Main category: cs.CV
TL;DR: 本文提出了首个SIM(3)-等变的3D形状补全网络,通过模块化设计实现特征的规范化、几何推理与坐标恢复,在去除姿态和尺度偏差的评估协议下显著优于现有方法,展现出更强的跨域泛化能力。
Details
Motivation: 现有3D形状补全方法依赖预对齐扫描数据,导致模型利用绝对位置线索而无法在真实未对齐数据中泛化,因此需要具备对SIM(3)群(相似变换)等变的鲁棒架构。 Method: 设计了SIM(3)-等变的网络架构,包含三个阶段:逐步规范化特征、在相似变换不变的几何空间中进行推理、恢复原始坐标系。 Result: 在去偏评估协议下的PCN基准上优于等变和数据增强基线,并在KITTI和OmniObject3D的真实跨域场景中分别将最小匹配距离降低17%和Chamfer距离降低14%。 Conclusion: 完整的SIM(3)等变性是实现真正可泛化的3D形状补全的有效途径。 Abstract: 3D shape completion methods typically assume scans are pre-aligned to a canonical frame. This leaks pose and scale cues that networks may exploit to memorize absolute positions rather than inferring intrinsic geometry. When such alignment is absent in real data, performance collapses. We argue that robust generalization demands architectural equivariance to the similarity group, SIM(3), so the model remains agnostic to pose and scale. Following this principle, we introduce the first SIM(3)-equivariant shape completion network, whose modular layers successively canonicalize features, reason over similarity-invariant geometry, and restore the original frame. Under a de-biased evaluation protocol that removes the hidden cues, our model outperforms both equivariant and augmentation baselines on the PCN benchmark. It also sets new cross-domain records on real driving and indoor scans, lowering minimal matching distance on KITTI by 17% and Chamfer distance $\ell1$ on OmniObject3D by 14%. Perhaps surprisingly, ours under the stricter protocol still outperforms competitors under their biased settings. These results establish full SIM(3) equivariance as an effective route to truly generalizable shape completion. Project page: https://sime-completion.github.io.[220] Benchmarking Egocentric Visual-Inertial SLAM at City Scale
Anusha Krishnan,Shaohui Liu,Paul-Edouard Sarlin,Oscar Gentilhomme,David Caruso,Maurizio Monge,Richard Newcombe,Jakob Engel,Marc Pollefeys
Main category: cs.CV
TL;DR: 本文介绍了一个用于视觉-惯性SLAM的新数据集和基准,针对以自我为中心的多模态数据,利用测绘工具获得厘米级精度的姿态标注,解决了现有基准无法反映实际挑战的问题。
Details
Motivation: 现有的SLAM基准未能充分反映可穿戴设备在真实场景中面临的多样化运动、动态视觉内容和长时间变化的传感器校准等问题,且缺乏足够精确的真值姿态。 Method: 使用类似眼镜的设备记录城市中心数小时和数千米的轨迹,结合多种传感器数据,并利用测绘工具获取城市尺度下的控制点作为间接的姿态标注。 Result: 实验表明,当前学术界的最先进系统在面对这些挑战时表现不够鲁棒,并识别出导致问题的关键组件;同时设计了不同难度级别的评估轨道以支持深入分析。 Conclusion: 该数据集和基准为评估极端条件下的SLAM系统提供了可靠工具,推动了面向可穿戴设备的SLAM研究的发展。 Abstract: Precise 6-DoF simultaneous localization and mapping (SLAM) from onboard sensors is critical for wearable devices capturing egocentric data, which exhibits specific challenges, such as a wider diversity of motions and viewpoints, prevalent dynamic visual content, or long sessions affected by time-varying sensor calibration. While recent progress on SLAM has been swift, academic research is still driven by benchmarks that do not reflect these challenges or do not offer sufficiently accurate ground truth poses. In this paper, we introduce a new dataset and benchmark for visual-inertial SLAM with egocentric, multi-modal data. We record hours and kilometers of trajectories through a city center with glasses-like devices equipped with various sensors. We leverage surveying tools to obtain control points as indirect pose annotations that are metric, centimeter-accurate, and available at city scale. This makes it possible to evaluate extreme trajectories that involve walking at night or traveling in a vehicle. We show that state-of-the-art systems developed by academia are not robust to these challenges and we identify components that are responsible for this. In addition, we design tracks with different levels of difficulty to ease in-depth analysis and evaluation of less mature approaches. The dataset and benchmark are available at https://www.lamaria.ethz.ch.[221] Query-Kontext: An Unified Multimodal Model for Image Generation and Editing
Yuxin Song,Wenkai Dong,Shizun Wang,Qi Zhang,Song Xue,Tao Yuan,Hu Yang,Haocheng Feng,Hang Zhou,Xinyan Xiao,Jingdong Wang
Main category: cs.CV
TL;DR: 本文提出了一种名为Query-Kontext的新方法,通过将视觉语言模型(VLM)与扩散模型结合,利用多模态“kontext”实现文本到图像生成和编辑中的多模态生成推理与高质量合成的解耦。
Details
Motivation: 当前统一多模态框架中,多模态生成推理能力与高保真合成过程相互纠缠,限制了对指令理解、定位和图像引用等关键能力的有效建模。 Method: 提出Query-Kontext方法,通过语义线索和粗粒度图像条件构成的多模态kontext连接VLM与扩散模型,并采用三阶段渐进训练策略:首先用轻量扩散头释放VLM的生成推理能力;其次扩展至大型预训练扩散模型以提升视觉质量;最后引入低层图像编码器并进行下游任务指令调优。同时构建了一个整合真实、合成和开源数据的综合数据管道。 Result: 实验表明,该方法在多个任务上达到或超过了强大的统一基线模型,并在某些情况下优于特定任务的最先进方法。 Conclusion: Query-Kontext成功解耦了多模态生成推理与高保真图像合成,提升了在文本到图像生成与编辑任务中的表现,展示了统一多模态模型设计的新方向。 Abstract: Unified Multimodal Models (UMMs) have demonstrated remarkable performance in text-to-image generation (T2I) and editing (TI2I), whether instantiated as assembled unified frameworks which couple powerful vision-language model (VLM) with diffusion-based generator, or as naive Unified Multimodal Models with an early fusion of understanding and generation modalities. We contend that in current unified frameworks, the crucial capability of multimodal generative reasoning which encompasses instruction understanding, grounding, and image referring for identity preservation and faithful reconstruction, is intrinsically entangled with high-fidelity synthesis. In this work, we introduce Query-Kontext, a novel approach that bridges the VLM and diffusion model via a multimodal ``kontext'' composed of semantic cues and coarse-grained image conditions encoded from multimodal inputs. This design delegates the complex ability of multimodal generative reasoning to powerful VLM while reserving diffusion model's role for high-quality visual synthesis. To achieve this, we propose a three-stage progressive training strategy. First, we connect the VLM to a lightweight diffusion head via multimodal kontext tokens to unleash the VLM's generative reasoning ability. Second, we scale this head to a large, pre-trained diffusion model to enhance visual detail and realism. Finally, we introduce a low-level image encoder to improve image fidelity and perform instruction tuning on downstream tasks. Furthermore, we build a comprehensive data pipeline integrating real, synthetic, and open-source datasets, covering diverse multimodal reference-to-image scenarios, including image generation, instruction-driven editing, customized generation, and multi-subject composition. Experiments show that our approach matches strong unified baselines and even outperforms task-specific state-of-the-art methods in several cases.[222] Stitch: Training-Free Position Control in Multimodal Diffusion Transformers
Jessica Bader,Mateusz Pach,Maria A. Bravo,Serge Belongie,Zeynep Akata
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的文本到图像生成方法Stitch,通过自动生 成边界框将多模态扩散变换器(MMDiT)与外部位置控制相结合,显著提升了生成图像的空间准确性和视觉质量。
Details
Motivation: 现有的文本到图像模型在捕捉空间关系(如“上方”或“右侧”)方面存在挑战,且传统位置控制方法与现代高质量架构不兼容。 Method: Stitch利用自动生 成的边界框,在生成过程中通过目标注意力头分离并剪切单个对象,分别在指定区域内生成物体,并无缝拼接成完整图像,实现无需训练的位置控制。 Result: 在PosEval基准测试中,Stitch在Qwen-Image、FLUX和SD3.5上均显著提升性能,例如在GenEval位置任务上使FLUX提升218%,在PosEval上提升206%,并在Qwen-Image上达到SOTA,超越先前模型54%。 Conclusion: Stitch是一种有效且通用的无需训练的位置控制方法,能够在保持视觉质量的同时显著提升现代T2I模型的空间生成准确性。 Abstract: Text-to-Image (T2I) generation models have advanced rapidly in recent years, but accurately capturing spatial relationships like "above" or "to the right of" poses a persistent challenge. Earlier methods improved spatial relationship following with external position control. However, as architectures evolved to enhance image quality, these techniques became incompatible with modern models. We propose Stitch, a training-free method for incorporating external position control into Multi-Modal Diffusion Transformers (MMDiT) via automatically-generated bounding boxes. Stitch produces images that are both spatially accurate and visually appealing by generating individual objects within designated bounding boxes and seamlessly stitching them together. We find that targeted attention heads capture the information necessary to isolate and cut out individual objects mid-generation, without needing to fully complete the image. We evaluate Stitch on PosEval, our benchmark for position-based T2I generation. Featuring five new tasks that extend the concept of Position beyond the basic GenEval task, PosEval demonstrates that even top models still have significant room for improvement in position-based generation. Tested on Qwen-Image, FLUX, and SD3.5, Stitch consistently enhances base models, even improving FLUX by 218% on GenEval's Position task and by 206% on PosEval. Stitch achieves state-of-the-art results with Qwen-Image on PosEval, improving over previous models by 54%, all accomplished while integrating position control into leading models training-free. Code is available at https://github.com/ExplainableML/Stitch.[223] TTT3R: 3D Reconstruction as Test-Time Training
Xingyu Chen,Yue Chen,Yuliang Xiu,Andreas Geiger,Anpei Chen
Main category: cs.CV
TL;DR: 本文提出了一种名为TTT3R的训练-free方法,通过在测试时引入在线学习视角来改进3D重建模型的长度泛化能力,显著提升了长序列下的性能。