Table of Contents
cs.CL [Back]
[1] Direct Token Optimization: A Self-contained Approach to Large Language Model Unlearning
Hong kyu Lee,Ruixuan Liu,Li Xiong
Main category: cs.CL
TL;DR: 本文提出了一种名为直接令牌优化(DTO)的新型自我包含式大语言模型遗忘方法,无需依赖外部资源即可有效去除训练数据的影响,同时保持模型性能。
Details
Motivation: 现有的大语言模型遗忘方法通常依赖辅助模型或外部服务,存在不实用和隐私风险问题,因此需要一种自包含且高效的遗忘方法。 Method: 通过识别需遗忘序列中的目标令牌和非目标令牌,分别用于优化遗忘目标和保持模型效用,直接在令牌级别进行优化。 Result: 实验结果显示,DTO在多个基准数据集上的遗忘质量比最新基线最高提升16.8倍,同时保持了相当的模型效用。 Conclusion: DTO是一种高效、自包含的大语言模型遗忘方法,能够在不依赖外部资源的情况下实现高质量的遗忘并维持模型性能。 Abstract: Machine unlearning is an emerging technique that removes the influence of a subset of training data (forget set) from a model without full retraining, with applications including privacy protection, content moderation, and model correction. The key challenge lies in ensuring that the model completely forgets the knowledge of the forget set without compromising its overall utility. Existing unlearning methods for large language models (LLMs) often utilize auxiliary language models, retain datasets, or even commercial AI services for effective unlearning and maintaining the model utility. However, dependence on these external resources is often impractical and could potentially introduce additional privacy risks. In this work, we propose direct token optimization (DTO), a novel self-contained unlearning approach for LLMs that directly optimizes the token level objectives and eliminates the need for external resources. Given a sequence to unlearn, we identify two categories of tokens: target tokens, which capture critical knowledge for unlearning, and the remaining non-target tokens, which are crucial for maintaining the model utility. The former are used to optimize the unlearning objective, while the latter serve to preserve the model's performance. The experimental results show that the proposed DTO achieves up to 16.8$\times$ improvement in forget quality on several benchmark datasets than the latest baselines while maintaining a comparable level of model utility.[2] TAMA: Tool-Augmented Multimodal Agent for Procedural Activity Understanding
Kimihiro Hasegawa,Wiradee Imrattanatrai,Masaki Asada,Ken Fukuda,Teruko Mitamura
Main category: cs.CL
TL;DR: 本文提出了一种名为TAMA的工具增强型多模态代理框架,用于理解程序性活动,通过无需训练的多媒体返回工具实现交错式多模态推理,并在ProMQA-Assembly数据集上验证了其对视觉语言模型性能的提升。
Details
Motivation: 尽管程序性活动助手具有广泛应用潜力,但针对此类助手的系统开发仍处于探索阶段,现有方法在多模态推理和工具利用方面存在不足。 Method: 提出TAMA框架,结合多媒体返回工具和智能体式的灵活工具选择机制,在无需训练的前提下实现多模态交错推理。 Result: 在ProMQA-Assembly数据集上的实验表明,TAMA能显著提升GPT-5和MiMo-VL等视觉语言模型的性能,消融研究验证了多媒体返回工具与灵活工具选择的有效性。 Conclusion: TAMA框架有效推动了视频与多模态任务中的‘图像思维’范式,为程序性活动助手的发展提供了新的思路和技术支持。 Abstract: Procedural activity assistants potentially support humans in a variety of settings, from our daily lives, e.g., cooking or assembling flat-pack furniture, to professional situations, e.g., manufacturing or biological experiments. Despite its potential use cases, the system development tailored for such an assistant is still underexplored. In this paper, we propose a novel framework, called TAMA, a Tool-Augmented Multimodal Agent, for procedural activity understanding. TAMA enables interleaved multimodal reasoning by making use of multimedia-returning tools in a training-free setting. Our experimental result on the multimodal procedural QA dataset, ProMQA-Assembly, shows that our approach can improve the performance of vision-language models, especially GPT-5 and MiMo-VL. Furthermore, our ablation studies provide empirical support for the effectiveness of two features that characterize our framework, multimedia-returning tools and agentic flexible tool selection. We believe our proposed framework and experimental results facilitate the thinking with images paradigm for video and multimodal tasks, let alone the development of procedural activity assistants.[3] DRBench: A Realistic Benchmark for Enterprise Deep Research
Amirhossein Abaskohi,Tianyi Chen,Miguel Muñoz-Mármol,Curtis Fox,Amrutha Varshini Ramesh,Étienne Marcotte,Xing Han Lù,Nicolas Chapados,Spandana Gella,Christopher Pal,Alexandre Drouin,Issam H. Laradji
Main category: cs.CL
TL;DR: 本文提出了DRBench,一个用于评估企业环境中复杂、开放式深度研究任务的AI代理基准测试工具,涵盖多步骤查询和混合信息源(如公开网页和私有知识库),并发布了15个跨领域的任务以推动企业级深度研究的发展。
Details
Motivation: 现有基准测试多关注简单问题或仅限网页查询,无法有效评估AI代理在真实企业场景中处理复杂、多步骤研究任务的能力,因此需要构建更贴近实际应用的评估基准。 Method: 设计了一个包含现实用户角色和企业背景的合成任务生成流程,结合人工验证,构建了涵盖多种数据源(如邮件、聊天记录、云文件等)的异构搜索空间,并从回忆相关洞察、事实准确性和报告结构等方面评估AI代理的表现。 Result: 发布了15个跨越销售、网络安全、合规等10个领域的深度研究任务,评估了基于GPT、Llama、Qwen等不同模型和策略的AI代理,揭示了其在企业深度研究中的优缺点。 Conclusion: DRBench能够有效评估AI代理在企业复杂研究任务中的表现,为改进AI驱动的企业决策支持系统提供了重要基准和方向。 Abstract: We introduce DRBench, a benchmark for evaluating AI agents on complex, open-ended deep research tasks in enterprise settings. Unlike prior benchmarks that focus on simple questions or web-only queries, DRBench evaluates agents on multi-step queries (for example, ``What changes should we make to our product roadmap to ensure compliance with this standard?") that require identifying supporting facts from both the public web and private company knowledge base. Each task is grounded in realistic user personas and enterprise context, spanning a heterogeneous search space that includes productivity software, cloud file systems, emails, chat conversations, and the open web. Tasks are generated through a carefully designed synthesis pipeline with human-in-the-loop verification, and agents are evaluated on their ability to recall relevant insights, maintain factual accuracy, and produce coherent, well-structured reports. We release 15 deep research tasks across 10 domains, such as Sales, Cybersecurity, and Compliance. We demonstrate the effectiveness of DRBench by evaluating diverse DR agents across open- and closed-source models (such as GPT, Llama, and Qwen) and DR strategies, highlighting their strengths, weaknesses, and the critical path for advancing enterprise deep research. Code is available at https://github.com/ServiceNow/drbench.[4] PrimeX: A Dataset of Worldview, Opinion, and Explanation
Rik Koncel-Kedziorski,Brihi Joshi,Tim Paek
Main category: cs.CL
TL;DR: 本文提出了PrimeX数据集,结合公众意见调查、观点解释和世界观信念,探索利用个体信念系统改进语言模型个性化对齐的方法。
Details
Motivation: 为了提升语言模型对个体用户的对齐效果,研究如何利用个体信念系统(如观点解释和世界观)来增强模型的个性化能力。 Method: 构建包含858名美国居民的PrimeX数据集,整合公共意见调查、受访者对自己观点的文字解释以及Primal World世界观信念量表;通过分析这些多源信念信息在意见预测中的作用评估其价值。 Result: 实验证明,加入信念解释和世界观信息能有效提升语言模型在意见预测任务中的表现,显示出对个性化建模的重要帮助。 Conclusion: PrimeX数据集为结合心理学与NLP研究提供了新途径,表明利用深层个体信念信息有助于实现更优的语言模型个性化对齐。 Abstract: As the adoption of language models advances, so does the need to better represent individual users to the model. Are there aspects of an individual's belief system that a language model can utilize for improved alignment? Following prior research, we investigate this question in the domain of opinion prediction by developing PrimeX, a dataset of public opinion survey data from 858 US residents with two additional sources of belief information: written explanations from the respondents for why they hold specific opinions, and the Primal World Belief survey for assessing respondent worldview. We provide an extensive initial analysis of our data and show the value of belief explanations and worldview for personalizing language models. Our results demonstrate how the additional belief information in PrimeX can benefit both the NLP and psychological research communities, opening up avenues for further study.[5] Personalized Reasoning: Just-In-Time Personalization and Why LLMs Fail At It
Shuyue Stella Li,Avinandan Bose,Faeze Brahman,Simon Shaolei Du,Pang Wei Koh,Maryam Fazel,Yulia Tsvetkov
Main category: cs.CL
TL;DR: 本文提出了PREFDISCO评估方法,用于衡量大语言模型在无先验信息条件下根据用户偏好进行个性化推理的能力,揭示了当前模型在个性化响应上的局限性。
Details
Motivation: 现有大模型开发将任务求解与偏好对齐分离,在缺乏用户历史数据的即时场景中难以满足个体需求,因此需要一种能够主动探知并适应用户偏好的个性化推理机制。 Method: 提出PREFDISCO框架,将静态基准转化为基于心理建模的稀疏偏好 persona 的交互式个性化任务,评估模型在不同用户背景下调整推理链和回应策略的能力。 Result: 对21个前沿模型在10项任务中的评估显示,29.0%的朴素个性化尝试比通用回复效果更差,而通用回复也无法有效满足个体需求。 Conclusion: 个性化推理需专门设计与训练,不会自然涌现;PREFDISCO为衡量和推动该能力提供了可量化的研究基础。 Abstract: Current large language model (LLM) development treats task-solving and preference alignment as separate challenges, optimizing first for objective correctness, then for alignment to aggregated human preferences. This paradigm fails in human-facing applications where solving a problem correctly is insufficient if the response mismatches the user's needs. This challenge intensifies in just-in-time scenarios where no prior user interaction history exists due to cold-start conditions or privacy constraints. LLMs need to identify what they don't know about user preferences, strategically elicit preference values through questioning, then adapt their reasoning processes and responses accordingly -- a complicated chain of cognitive processes which we term personalized reasoning. We introduce PREFDISCO, an evaluation methodology that transforms static benchmarks into interactive personalization tasks using psychologically-grounded personas with sparse preferences. Our framework creates scenarios where identical questions require different reasoning chains depending on user context, as optimal explanation approaches vary by individual expertise and preferences while maintaining factual accuracy. Evaluation of 21 frontier models across 10 tasks reveals 29.0% of naive personalization attempts produce worse preference alignment than generic responses, yet generic responses also fail to serve individual user needs effectively. These findings suggest personalized reasoning requires dedicated development rather than emerging naturally. PREFDISCO establishes personalized reasoning as a measurable research frontier and reveals fundamental limitations in current LLMs' interactive capabilities, providing a foundation for developing systems that can adapt to individual users in education, healthcare, and technical domains where personalization is critical.[6] BiasFreeBench: a Benchmark for Mitigating Bias in Large Language Model Responses
Xin Xu,Xunzhi He,Churan Zhi,Ruizhe Chen,Julian McAuley,Zexue He
Main category: cs.CL
TL;DR: 本文提出了BiasFreeBench,一个用于一致评估大语言模型去偏方法的实证基准,并引入了响应级别的公平性度量指标Bias-Free Score,以弥合现有评估与实际应用场景之间的差距。
Details
Motivation: 现有去偏方法的评估缺乏一致性,且多基于模型概率而非实际响应,难以反映真实使用场景中的公平性和安全性需求。 Method: 构建了一个统一查询-响应格式的基准BiasFreeBench,整合八个主流去偏方法(四类基于提示和四类基于训练),在多项选择问答和开放式多轮问答两种场景下进行评估,并提出Bias-Free Score作为响应级评估指标。 Result: 系统比较了不同去偏方法在提示vs.训练范式、模型大小、对未见偏见类型的泛化能力等方面的表现,发现某些方法在特定设置下更优,且响应级评估能更好反映实际效果。 Conclusion: BiasFreeBench有助于建立统一的去偏研究测试平台,推动更贴近实际应用的公平性评估。 Abstract: Existing studies on bias mitigation methods for large language models (LLMs) use diverse baselines and metrics to evaluate debiasing performance, leading to inconsistent comparisons among them. Moreover, their evaluations are mostly based on the comparison between LLMs' probabilities of biased and unbiased contexts, which ignores the gap between such evaluations and real-world use cases where users interact with LLMs by reading model responses and expect fair and safe outputs rather than LLMs' probabilities. To enable consistent evaluation across debiasing methods and bridge this gap, we introduce BiasFreeBench, an empirical benchmark that comprehensively compares eight mainstream bias mitigation techniques (covering four prompting-based and four training-based methods) on two test scenarios (multi-choice QA and open-ended multi-turn QA) by reorganizing existing datasets into a unified query-response setting. We further introduce a response-level metric, Bias-Free Score, to measure the extent to which LLM responses are fair, safe, and anti-stereotypical. Debiasing performances are systematically compared and analyzed across key dimensions: the prompting vs. training paradigm, model size, and generalization of different training strategies to unseen bias types. We will publicly release our benchmark, aiming to establish a unified testbed for bias mitigation research.[7] TASER: Translation Assessment via Systematic Evaluation and Reasoning
Monishwaran Maheswaran,Marco Carini,Christian Federmann,Tony Diaz
Main category: cs.CL
TL;DR: TASER是一种基于大推理模型(LRM)的翻译质量评估指标,通过结构化提示实现系统性、逐步的评估,在参考和无参考场景下均达到最先进的性能。
Details
Motivation: 现有自动翻译评估指标缺乏可解释性和透明度,且在评估准确性上存在局限,因此需要一种更可靠、可解释的评估方法。 Method: 提出TASER,利用大推理模型(LRM)的显式推理能力,设计结构化提示模板进行系统化的翻译质量评估,并在WMT24数据集上验证其在系统级和片段级的表现。 Result: TASER在系统级评估中实现了最高的软配对准确率,超越所有现有指标;在片段级无参考评估中表现最佳;实验还表明结构化提示优于开放式提示,且推理深度与评估质量正相关。 Conclusion: 大推理模型在翻译质量评估中展现出显著进步,TASER结合了更高的准确性与评估过程的可解释性,为自动化评估提供了新方向。 Abstract: We introduce TASER (Translation Assessment via Systematic Evaluation and Reasoning), a metric that uses Large Reasoning Models (LRMs) for automated translation quality assessment. TASER harnesses the explicit reasoning capabilities of LRMs to conduct systematic, step-by-step evaluation of translation quality. We evaluate TASER on the WMT24 Metrics Shared Task across both reference-based and reference-free scenarios, demonstrating state-of-the-art performance. In system-level evaluation, TASER achieves the highest soft pairwise accuracy in both reference-based and reference-free settings, outperforming all existing metrics. At the segment level, TASER maintains competitive performance with our reference-free variant ranking as the top-performing metric among all reference-free approaches. Our experiments reveal that structured prompting templates yield superior results with LRMs compared to the open-ended approaches that proved optimal for traditional LLMs. We evaluate o3, a large reasoning model from OpenAI, with varying reasoning efforts, providing insights into the relationship between reasoning depth and evaluation quality. The explicit reasoning process in LRMs offers interpretability and visibility, addressing a key limitation of existing automated metrics. Our results demonstrate that Large Reasoning Models show a measurable advancement in translation quality assessment, combining improved accuracy with transparent evaluation across diverse language pairs.[8] Retrieval-Augmented Generation for Electrocardiogram-Language Models
Xiaoyu Song,William Han,Tony Chen,Chaojing Duan,Michael A. Rosenberg,Emerson Liu,Ding Zhao
Main category: cs.CL
TL;DR: 本文提出了首个用于生成式心电图-语言模型(ELMs)的开源检索增强生成(RAG)管道,通过实验验证了RAG在提升自然语言生成性能方面的有效性,并提供了基线和消融研究。
Details
Motivation: 尽管RAG在大语言模型中已被广泛用于减少幻觉并提升生成质量,但在ELMs领域尚无系统性研究或开源实现,本文旨在填补这一空白。 Method: 设计并实现了首个面向ELMs的开源RAG管道,结合三个公共数据集进行实验,提供多种基线模型与消融分析,评估RAG对ELM自然语言生成性能的影响。 Result: 实验结果表明,结合RAG的ELM在多个数据集上均优于非RAG基线模型,显著提升了生成文本的质量,并揭示了ELM设计中的关键因素。 Conclusion: RAG能有效增强ELMs的生成能力,本文提供的开源框架为未来ELM研究提供了重要基础。 Abstract: Interest in generative Electrocardiogram-Language Models (ELMs) is growing, as they can produce textual responses conditioned on ECG signals and textual queries. Unlike traditional classifiers that output label probabilities, ELMs are more versatile, supporting domain-specific tasks (e.g., waveform analysis, diagnosis, prognosis) as well as general tasks (e.g., open-ended questions, dialogue). Retrieval-Augmented Generation (RAG), widely used in Large Language Models (LLMs) to ground LLM outputs in retrieved knowledge, helps reduce hallucinations and improve natural language generation (NLG). However, despite its promise, no open-source implementation or systematic study of RAG pipeline design for ELMs currently exists. To address this gap, we present the first open-source RAG pipeline for ELMs, along with baselines and ablation studies for NLG. Experiments on three public datasets show that ELMs with RAG consistently improves performance over non-RAG baselines and highlights key ELM design considerations. Our code is available at: https://github.com/willxxy/ECG-Bench.[9] Judging with Confidence: Calibrating Autoraters to Preference Distributions
Zhuohang Li,Xiaowei Li,Chengyu Huang,Guowang Li,Katayoon Goshvadi,Bo Dai,Dale Schuurmans,Paul Zhou,Hamid Palangi,Yiwen Song,Palash Goyal,Murat Kantarcioglu,Bradley A. Malin,Yuan Xue
Main category: cs.CL
TL;DR: 提出一种校准概率自动评分器的通用框架,以更好地匹配目标人群的偏好分布,提升大语言模型对齐人类价值观的可靠性。
Details
Motivation: 现有自动评分器因训练于离散偏好标签而受限,无法处理主观、模糊或复杂的任务,需建模完整偏好分布以提高可靠性。 Method: 提出两种学习方法:针对密集概率标签的直接监督微调,和针对稀疏二值标签的强化学习方法。 Result: 实验表明,采用分布匹配目标微调的自动评分器在预测概率上更贴近目标偏好分布,校准性更好,位置偏差显著降低,且保持客观任务性能。 Conclusion: 通过建模目标人群的完整偏好分布,可显著提升自动评分器在价值观对齐中的可靠性与公平性。 Abstract: The alignment of large language models (LLMs) with human values increasingly relies on using other LLMs as automated judges, or ``autoraters''. However, their reliability is limited by a foundational issue: they are trained on discrete preference labels, forcing a single ground truth onto tasks that are often subjective, ambiguous, or nuanced. We argue that a reliable autorater must learn to model the full distribution of preferences defined by a target population. In this paper, we propose a general framework for calibrating probabilistic autoraters to any given preference distribution. We formalize the problem and present two learning methods tailored to different data conditions: 1) a direct supervised fine-tuning for dense, probabilistic labels, and 2) a reinforcement learning approach for sparse, binary labels. Our empirical results show that finetuning autoraters with a distribution-matching objective leads to verbalized probability predictions that are better aligned with the target preference distribution, with improved calibration and significantly lower positional bias, all while preserving performance on objective tasks.[10] Efficient Layer-wise LLM Fine-tuning for Revision Intention Prediction
Zhexiong Liu,Diane Litman
Main category: cs.CL
TL;DR: 本文提出了一种名为IR-Tuning的参数高效微调框架,用于提升大语言模型在文本修订分类任务中的表现,尤其适用于标注数据稀缺的场景。
Details
Motivation: 大语言模型在生成任务上表现出色,但在需要精细理解的文本分类任务(如文本修订)中表现不佳,且标注数据稀缺,传统微调成本高。 Method: 提出IR-Tuning框架,通过动态分析各层梯度范数分布,选择重要层进行参数微调,冻结冗余层,实现层级别参数高效微调。 Result: 实验表明,IR-Tuning在多种文本修订任务上优于其他层级别PEFT方法,具有更快的收敛速度、更低的显存消耗,并在小规模数据集上表现有效。 Conclusion: IR-Tuning是一种高效、低资源依赖的微调方法,显著提升了大语言模型在细粒度文本分类任务上的性能,特别适合标注数据有限的实际应用场景。 Abstract: Large Language Models (LLMs) have shown extraordinary success across various text generation tasks; however, their potential for simple yet essential text classification remains underexplored, as LLM pre-training tends to emphasize generation over classification. While LLMs with instruction tuning can transform classification into a generation task, they often struggle to categorize nuanced texts. One such example is text revision, which involves nuanced edits between pairs of texts. Although simply fine-tuning LLMs for revision classification seems plausible, it requires a large amount of revision annotations, which are exceptionally expensive and scarce in the community. To address this issue, we introduce a plug-and-play layer-wise parameter-efficient fine-tuning (PEFT) framework, i.e., IR-Tuning, which fine-tunes a subset of important LLM layers that are dynamically selected based on their gradient norm distribution, while freezing those of redundant layers. Extensive experiments suggest that IR-Tuning surpasses several layer-wise PEFT baselines over diverse text revisions, while achieving fast convergence, low GPU memory consumption, and effectiveness on small revision corpora.[11] SafePassage: High-Fidelity Information Extraction with Black Box LLMs
Joe Barrow,Raj Patel,Misha Kharkovski,Ben Davies,Ryan Schmitt
Main category: cs.CL
TL;DR: 本文提出了SafePassage方法,通过三步流程减少大语言模型在信息抽取中的幻觉问题,显著提升了结果的可信度和与人类判断的一致性。
Details
Motivation: 黑盒大语言模型在信息抽取中难以保证结果基于原文,存在幻觉问题,因此需要一种可信赖的机制来确保提取内容的可靠性。 Method: 提出SafePassage三步流程:1)LLM提取器生成实体及其上下文;2)基于字符串的全局对齐器;3)打分模型,用于识别不安全的片段。 Result: 该方法在信息抽取任务中最多减少85%的幻觉,且误报率低,与人类判断高度一致;此外,微调的小型Transformer编码器在检测不安全片段上优于LLM打分模型。 Conclusion: SafePassage有效提升了LLM信息抽取的可靠性和可评估性,为实际应用提供了可行的信任保障机制。 Abstract: Black box large language models (LLMs) make information extraction (IE) easy to configure, but hard to trust. Unlike traditional information extraction pipelines, the information "extracted" is not guaranteed to be grounded in the document. To prevent this, this paper introduces the notion of a "safe passage": context generated by the LLM that is both grounded in the document and consistent with the extracted information. This is operationalized via a three-step pipeline, SafePassage, which consists of: (1) an LLM extractor that generates structured entities and their contexts from a document, (2) a string-based global aligner, and (3) a scoring model. Results show that using these three parts in conjunction reduces hallucinations by up to 85% on information extraction tasks with minimal risk of flagging non-hallucinations. High agreement between the SafePassage pipeline and human judgments of extraction quality mean that the pipeline can be dually used to evaluate LLMs. Surprisingly, results also show that using a transformer encoder fine-tuned on a small number of task-specific examples can outperform an LLM scoring model at flagging unsafe passages. These annotations can be collected in as little as 1-2 hours.[12] ReEvalMed: Rethinking Medical Report Evaluation by Aligning Metrics with Real-World Clinical Judgment
Ruochen Li,Jun Li,Bailiang Jian,Kun Yuan,Youxiang Zhu
Main category: cs.CL
TL;DR: 提出了一种临床导向的Meta-Evaluation框架,用于评估放射学报告生成的指标,揭示现有指标在临床语义理解上的不足。
Details
Motivation: 现有自动评估指标虽得分高,但无法获得临床医生信任,说明其在评估生成报告质量方面存在根本缺陷。 Method: 设计了一个包含临床对齐性和关键指标能力(如区分性、鲁棒性和单调性)的Meta-Evaluation框架,并使用细粒度标注数据集(含错误类型、临床重要性标签和解释)系统评估现有指标。 Result: 发现现有指标难以识别临床显著性错误、过度惩罚无害变异,且在错误严重程度上缺乏一致性。 Conclusion: 该框架为构建更可靠、符合临床需求的评估方法提供了指导。 Abstract: Automatically generated radiology reports often receive high scores from existing evaluation metrics but fail to earn clinicians' trust. This gap reveals fundamental flaws in how current metrics assess the quality of generated reports. We rethink the design and evaluation of these metrics and propose a clinically grounded Meta-Evaluation framework. We define clinically grounded criteria spanning clinical alignment and key metric capabilities, including discrimination, robustness, and monotonicity. Using a fine-grained dataset of ground truth and rewritten report pairs annotated with error types, clinical significance labels, and explanations, we systematically evaluate existing metrics and reveal their limitations in interpreting clinical semantics, such as failing to distinguish clinically significant errors, over-penalizing harmless variations, and lacking consistency across error severity levels. Our framework offers guidance for building more clinically reliable evaluation methods.[13] o-MEGA: Optimized Methods for Explanation Generation and Analysis
Ľuboš Kriš,Jaroslav Kopčan,Qiwei Peng,Andrej Ridzik,Marcel Veselý,Martin Tamajka
Main category: cs.CL
TL;DR: 提出了一种名为o-mega的超参数优化工具,用于自动选择语义匹配领域中最有效的可解释AI方法及其配置,提升了自动化事实核查系统的透明度和可信度。
Details
Motivation: 解决在Transformer语言模型中实现可解释性系统时面临的挑战,尤其是在存在大量解释方法和评估指标的情况下如何选择最优方案。 Method: 设计并实现o-mega工具,通过系统化探索不同可解释方法及其超参数,在基于社交媒体帖子与反驳声明配对的数据集上评估其在声明匹配流程中的表现。 Result: o-mega能够有效识别最佳解释方法配置,显著提升模型可解释性和透明度,尤其在虚假信息检测等关键应用中表现出潜力。 Conclusion: 自动化优化解释方法有助于构建更可信、透明的AI系统,为NLP中的可解释性研究提供了实用工具和新方向。 Abstract: The proliferation of transformer-based language models has revolutionized NLP domain while simultaneously introduced significant challenges regarding model transparency and trustworthiness. The complexity of achieving explainable systems in this domain is evidenced by the extensive array of explanation methods and evaluation metrics developed by researchers. To address the challenge of selecting optimal explainability approaches, we present \textbf{\texttt{o-mega}}, a hyperparameter optimization tool designed to automatically identify the most effective explainable AI methods and their configurations within the semantic matching domain. We evaluate o-mega on a post-claim matching pipeline using a curated dataset of social media posts paired with refuting claims. Our tool systematically explores different explainable methods and their hyperparameters, demonstrating improved transparency in automated fact-checking systems. As a result, such automated optimization of explanation methods can significantly enhance the interpretability of claim-matching models in critical applications such as misinformation detection, contributing to more trustworthy and transparent AI systems.[14] CORTEX: Collaborative LLM Agents for High-Stakes Alert Triage
Bowen Wei,Yuan Shen Tay,Howard Liu,Jinhao Pan,Kun Luo,Ziwei Zhu,Chris Jordan
Main category: cs.CL
TL;DR: 提出CORTEX,一种用于高风险警报分类的多代理LLM架构,通过专业化代理协作分析行为、收集证据并合成可审计决策,显著减少误报并提升调查质量。
Details
Motivation: 安全运营中心(SOC)每天面临大量警报,其中仅有少数是真实攻击,导致警报疲劳和威胁遗漏;传统检测方法缺乏上下文,而现有基于大语言模型的方法依赖单一模型处理全部任务,在噪声数据下表现不佳且透明度低。 Method: 设计一个多代理LLM系统CORTEX,包含行为分析代理、证据收集代理和推理代理,各代理协同工作并基于真实证据进行分析;同时发布一个包含详细SOC调查步骤和工具输出的真实场景数据集用于训练与评估。 Result: 在多种企业场景中,CORTEX相比最先进的单代理LLM显著降低了误报率,并提升了调查的质量和可审计性。 Conclusion: 多代理LLM架构在处理复杂、高噪声的安全警报分类任务中优于单代理方法,具备更强的鲁棒性和透明度,具有实际部署潜力。 Abstract: Security Operations Centers (SOCs) are overwhelmed by tens of thousands of daily alerts, with only a small fraction corresponding to genuine attacks. This overload creates alert fatigue, leading to overlooked threats and analyst burnout. Classical detection pipelines are brittle and context-poor, while recent LLM-based approaches typically rely on a single model to interpret logs, retrieve context, and adjudicate alerts end-to-end -- an approach that struggles with noisy enterprise data and offers limited transparency. We propose CORTEX, a multi-agent LLM architecture for high-stakes alert triage in which specialized agents collaborate over real evidence: a behavior-analysis agent inspects activity sequences, evidence-gathering agents query external systems, and a reasoning agent synthesizes findings into an auditable decision. To support training and evaluation, we release a dataset of fine-grained SOC investigations from production environments, capturing step-by-step analyst actions and linked tool outputs. Across diverse enterprise scenarios, CORTEX substantially reduces false positives and improves investigation quality over state-of-the-art single-agent LLMs.[15] TokMem: Tokenized Procedural Memory for Large Language Models
Zijun Wu,Yongchang Hao,Lili Mou
Main category: cs.CL
TL;DR: 本文提出了TokMem,一种用于大语言模型的标记化程序记忆,通过将重复过程存储为紧凑的可训练嵌入,避免了传统提示工程中的重复上下文开销,实现了高效、模块化的任务处理。
Details
Motivation: 大语言模型依赖提示来执行任务、回忆知识和引导推理,但这种方式效率低下,提示需每步重读、跨任务扩展性差且缺乏模块化复用机制。 Method: 提出TokMem,使用可训练的记忆令牌存储程序,每个令牌包含指向过程的地址和控制生成的信号,在保持主干模型冻结的情况下支持持续适应,并实现常数级开销的行为定向。 Result: 在1000个原子回忆任务和函数调用的组合回忆任务上,TokMem持续优于检索增强生成方法,避免了重复上下文开销,并以更少参数优于微调方法。 Conclusion: TokMem为大语言模型提供了一种可扩展、模块化的提示工程和微调替代方案,具备显式的程序记忆能力。 Abstract: Large language models rely heavily on prompts to specify tasks, recall knowledge and guide reasoning. However, this reliance is inefficient as prompts must be re-read at each step, scale poorly across tasks, and lack mechanisms for modular reuse. We introduce TokMem, a tokenized procedural memory that stores recurring procedures as compact, trainable embeddings. Each memory token encodes both an address to a procedure and a control signal that steers generation, enabling targeted behavior with constant-size overhead. To support continual adaptation, TokMem keeps the backbone model frozen, allowing new procedures to be added without interfering with existing ones. We evaluate TokMem on 1,000 tasks for atomic recall, and on function-calling tasks for compositional recall, where it consistently outperforms retrieval-augmented generation while avoiding repeated context overhead, and fine-tuning with far fewer parameters. These results establish TokMem as a scalable and modular alternative to prompt engineering and fine-tuning, offering an explicit procedural memory for LLMs.[16] LongCodeZip: Compress Long Context for Code Language Models
Yuling Shi,Yichun Qian,Hongyu Zhang,Beijun Shen,Xiaodong Gu
Main category: cs.CL
TL;DR: 本文提出了LongCodeZip,一种专为代码大模型设计的两阶段上下文压缩框架,在保持任务性能的同时显著减少长上下文中的冗余信息,实现高达5.6倍的压缩比。
Details
Motivation: 现有上下文剪枝方法多针对通用文本设计,忽略了代码特有的结构和依赖关系,在代码任务中表现不佳;同时长上下文带来高API成本和生成延迟,亟需高效的代码专用压缩技术。 Method: 提出双阶段压缩策略:第一阶段基于指令条件困惑度进行函数级粗粒度压缩,保留最相关的函数;第二阶段在自适应token预算下对保留的函数进行基于困惑度的块级细粒度压缩,选择最优子集。 Result: 在代码补全、摘要和问答等多个任务上评估显示,LongCodeZip在不降低任务性能的前提下,相比基线方法最高实现5.6倍的压缩比,显著优于LLMLingua等通用方法。 Conclusion: LongCodeZip能有效减小代码上下文规模并保留关键信息,提升LLM在大规模真实代码场景中的可扩展性,推动代码智能应用的效率与能力发展。 Abstract: Code generation under long contexts is becoming increasingly critical as Large Language Models (LLMs) are required to reason over extensive information in the codebase. While recent advances enable code LLMs to process long inputs, high API costs and generation latency remain substantial bottlenecks. Existing context pruning techniques, such as LLMLingua, achieve promising results for general text but overlook code-specific structures and dependencies, leading to suboptimal performance in programming tasks. In this paper, we propose LongCodeZip, a novel plug-and-play code compression framework designed specifically for code LLMs. LongCodeZip employs a dual-stage strategy: (1) coarse-grained compression, which identifies and ranks function-level chunks using conditional perplexity with respect to the instruction, retaining only the most relevant functions; and (2) fine-grained compression, which segments retained functions into blocks based on perplexity and selects an optimal subset under an adaptive token budget to maximize relevance. Evaluations across multiple tasks, including code completion, summarization, and question answering, show that LongCodeZip consistently outperforms baseline methods, achieving up to a 5.6x compression ratio without degrading task performance. By effectively reducing context size while preserving essential information, LongCodeZip enables LLMs to better scale to real-world, large-scale code scenarios, advancing the efficiency and capability of code intelligence applications.[17] Enhancing Rating Prediction with Off-the-Shelf LLMs Using In-Context User Reviews
Koki Ryu,Hitomi Yanaka
Main category: cs.CL
TL;DR: 该研究探讨了现成大语言模型(LLM)在Likert量表评分预测中的表现,发现用户撰写的评论显著提升预测性能,效果可媲美传统方法如矩阵分解,并揭示生成假设性评论的提示策略可进一步增强性能。
Details
Motivation: 尽管个性化大语言模型输出是活跃的研究领域,但以往研究多集中于分类或排序任务,未充分探索需要语言与数学推理的回归任务——Likert量表评分预测,尤其是在工业应用中现成LLM的潜力尚待挖掘。 Method: 通过在三个数据集上对八种现成大语言模型进行实验,比较不同上下文信息(如用户评论、一般偏好描述)对评分预测的影响,并测试生成假设性评论作为提示策略的效果。 Result: 用户撰写的评论显著提升LLM的评分预测性能,效果接近矩阵分解等传统方法;基于具体项目的评论优于泛化偏好描述;提示LLM先生成假设性评论可进一步提高预测表现。 Conclusion: 现成大语言模型在评分预测任务中具有巨大潜力,尤其在冷启动问题上可作为有效解决方案,且通过合理设计提示策略(如生成假设评论)可进一步优化性能。 Abstract: Personalizing the outputs of large language models (LLMs) to align with individual user preferences is an active research area. However, previous studies have mainly focused on classification or ranking tasks and have not considered Likert-scale rating prediction, a regression task that requires both language and mathematical reasoning to be solved effectively. This task has significant industrial applications, but the utilization of LLMs remains underexplored, particularly regarding the capabilities of off-the-shelf LLMs. This study investigates the performance of off-the-shelf LLMs on rating prediction, providing different in-context information. Through comprehensive experiments with eight models across three datasets, we demonstrate that user-written reviews significantly improve the rating prediction performance of LLMs. This result is comparable to traditional methods like matrix factorization, highlighting the potential of LLMs as a promising solution for the cold-start problem. We also find that the reviews for concrete items are more effective than general preference descriptions that are not based on any specific item. Furthermore, we discover that prompting LLMs to first generate a hypothetical review enhances the rating prediction performance. Our code is available at https://github.com/ynklab/rating-prediction-with-reviews.[18] Agent Fine-tuning through Distillation for Domain-specific LLMs in Microdomains
Yawen Xue,Masaya Tsunokake,Yuta Koreeda,Ekant Muljibhai Amin,Takashi Sumiyoshi,Yasuhiro Sogawa
Main category: cs.CL
TL;DR: 本文探讨了在Hitachi的JP1中间件这一特定IT运维微领域中,通过代理微调实现领域适应的方法。利用领域手册和LLM生成的推理轨迹构建特定数据集,并结合检索增强生成和上下文-答案提取器,在JP1认证考试问题上比基础模型性能提升了14%。
Details
Motivation: 现有基于上下文学习的代理方法在处理专业性强的微领域任务时存在输入冗长、计算成本高的问题,且通用领域的微调研究难以直接适用于技术微领域,因此需要探索针对特定微领域的代理微调方法。 Method: 采用领域特定数据集对LLM进行微调,数据包括JP1手册内容及由LLM自身蒸馏出的推理轨迹;推理阶段结合检索增强生成(RAG)的代理提示,并引入上下文-答案提取器以提升信息相关性。 Result: 在JP1认证考试问题上的实验结果显示,所提方法相比基础模型性能提升了14%,显著提高了决策准确性和搜索效率。 Conclusion: 代理微调结合检索增强与上下文提取技术,能有效提升LLM在复杂技术微领域中的推理能力,验证了其在特定领域应用中的潜力。 Abstract: Agentic large language models (LLMs) have become prominent for autonomously interacting with external environments and performing multi-step reasoning tasks. Most approaches leverage these capabilities via in-context learning with few-shot prompts, but this often results in lengthy inputs and higher computational costs. Agent fine-tuning offers an alternative by enabling LLMs to internalize procedural reasoning and domain-specific knowledge through training on relevant data and demonstration trajectories. While prior studies have focused on general domains, their effectiveness in specialized technical microdomains remains unclear. This paper explores agent fine-tuning for domain adaptation within Hitachi's JP1 middleware, a microdomain for specialized IT operations. We fine-tuned LLMs using JP1-specific datasets derived from domain manuals and distilled reasoning trajectories generated by LLMs themselves, enhancing decision making accuracy and search efficiency. During inference, we used an agentic prompt with retrieval-augmented generation and introduced a context-answer extractor to improve information relevance. On JP1 certification exam questions, our method achieved a 14% performance improvement over the base model, demonstrating the potential of agent fine-tuning for domain-specific reasoning in complex microdomains.[19] Agent-ScanKit: Unraveling Memory and Reasoning of Multimodal Agents via Sensitivity Perturbations
Pengzhou Cheng,Lingzhong Dong,Zeng Wu,Zongru Wu,Xiangru Tang,Chengwei Qin,Zhuosheng Zhang,Gongshen Liu
Main category: cs.CL
TL;DR: 提出Agent-ScanKit框架,通过三种引导式探测范式评估多模态智能体在GUI任务中的记忆与推理能力,发现现有模型多依赖机械记忆而非系统推理,强调提升真实场景下可靠性的必要性。
Details
Motivation: 现有多模态智能体在复杂或域外任务中可靠性有限,需探究其是否存在虚假推理问题。 Method: 设计视觉、文本和结构引导三种正交探测范式,构建Agent-ScanKit框架,在不访问模型内部的情况下量化记忆与推理贡献。 Result: 在5个公开GUI基准和18个模型上的实验表明,机械记忆常超过系统推理,多数模型主要依赖训练知识检索,泛化能力有限。 Conclusion: 当前多模态智能体推理能力不足,需加强鲁棒推理建模以提升实际应用中的可靠性。 Abstract: Although numerous strategies have recently been proposed to enhance the autonomous interaction capabilities of multimodal agents in graphical user interface (GUI), their reliability remains limited when faced with complex or out-of-domain tasks. This raises a fundamental question: Are existing multimodal agents reasoning spuriously? In this paper, we propose \textbf{Agent-ScanKit}, a systematic probing framework to unravel the memory and reasoning capabilities of multimodal agents under controlled perturbations. Specifically, we introduce three orthogonal probing paradigms: visual-guided, text-guided, and structure-guided, each designed to quantify the contributions of memorization and reasoning without requiring access to model internals. In five publicly available GUI benchmarks involving 18 multimodal agents, the results demonstrate that mechanical memorization often outweighs systematic reasoning. Most of the models function predominantly as retrievers of training-aligned knowledge, exhibiting limited generalization. Our findings underscore the necessity of robust reasoning modeling for multimodal agents in real-world scenarios, offering valuable insights toward the development of reliable multimodal agents.[20] MOSS-Speech: Towards True Speech-to-Speech Models Without Text Guidance
Xingjian Zhao,Zhe Xu,Luozhijie Jin,Yang Wang,Hanfu Chen,Yaozhou Jiang,Ke Chen,Ruixiao Li,Mingshu Chen,Ruiming Wang,Wenbo Zhang,Yiyang Zhang,Donghua Yu,Yang Gao,Xiaogui Yang,Yitian Gong,Yuanfan Xu,Qinyuan Cheng,Zhaoye Fei,Shimin Li,Yaqian Zhou,Xuanjing Huang,Xipeng Qiu
Main category: cs.CL
TL;DR: MOSS-Speech是一种真正的端到端语音到语音大语言模型,无需依赖文本中介,直接理解和生成语音,在保持文本性能的同时实现了语音问答和语音交互的先进性能。
Details
Motivation: 传统语音对话系统依赖级联管道,丢失副语言线索且表达受限;现有端到端方法仍依赖文本中间表示,存在瓶颈。 Method: 提出MOSS-Speech,采用基于模态的分层分离架构和冻结预训练策略,结合预训练文本大模型的推理与知识能力,赋予其原生语音处理能力。 Result: 在口语问答任务上达到最先进水平,语音到语音性能与现有文本引导系统相当,同时保持有竞争力的文本处理性能。 Conclusion: MOSS-Speech弥合了文本引导与直接语音生成之间的差距,为高效、富有表现力的端到端语音交互建立了新范式。 Abstract: Spoken dialogue systems often rely on cascaded pipelines that transcribe, process, and resynthesize speech. While effective, this design discards paralinguistic cues and limits expressivity. Recent end-to-end methods reduce latency and better preserve these cues, yet still rely on text intermediates, creating a fundamental bottleneck. We present MOSS-Speech, a true speech-to-speech large language model that directly understands and generates speech without relying on text guidance. Our approach combines a modality-based layer-splitting architecture with a frozen pre-training strategy, preserving the reasoning and knowledge of pretrained text LLMs while adding native speech capabilities. Experiments show that our model achieves state-of-the-art results in spoken question answering and delivers comparable speech-to-speech performance relative to existing text-guided systems, while still maintaining competitive text performance. By narrowing the gap between text-guided and direct speech generation, our work establishes a new paradigm for expressive and efficient end-to-end speech interaction.[21] Graph2Eval: Automatic Multimodal Task Generation for Agents via Knowledge Graphs
Yurun Chen,Xavier Hu,Yuhan Liu,Ziqi Wang,Zeyi Liao,Lin Chen,Feng Wei,Yuxi Qian,Bo Zheng,Keting Yin,Shengyu Zhang
Main category: cs.CL
TL;DR: 提出Graph2Eval,一个基于知识图谱的框架,用于自动生成多模态文档理解和网页交互任务,以全面评估智能体的推理、协作和交互能力。
Details
Motivation: 现有静态数据集无法充分评估多模态大模型智能体在动态环境中的真实能力,且现有合成数据方法不适用于需要工具使用和交互能力的智能体任务。 Method: 利用多源外部数据构建知识图谱,通过子图采样、任务模板和元路径将语义关系转化为结构化多模态任务,并设计多阶段过滤流程确保任务质量与可执行性。 Result: 实现了对单智能体、多智能体和网页智能体的端到端评估,生成1319个任务的数据集Graph2Eval-Bench,实验表明能有效区分不同智能体性能并揭示其在推理、协作和网页交互上的差距。 Conclusion: Graph2Eval为智能体评估提供了新视角,支持多样化、动态且可扩展的任务生成与综合能力评测。 Abstract: As multimodal LLM-driven agents continue to advance in autonomy and generalization, evaluation based on static datasets can no longer adequately assess their true capabilities in dynamic environments and diverse tasks. Existing LLM-based synthetic data methods are largely designed for LLM training and evaluation, and thus cannot be directly applied to agent tasks that require tool use and interactive capabilities. While recent studies have explored automatic agent task generation with LLMs, most efforts remain limited to text or image analysis, without systematically modeling multi-step interactions in web environments. To address these challenges, we propose Graph2Eval, a knowledge graph-based framework that automatically generates both multimodal document comprehension tasks and web interaction tasks, enabling comprehensive evaluation of agents' reasoning, collaboration, and interactive capabilities. In our approach, knowledge graphs constructed from multi-source external data serve as the task space, where we translate semantic relations into structured multimodal tasks using subgraph sampling, task templates, and meta-paths. A multi-stage filtering pipeline based on node reachability, LLM scoring, and similarity analysis is applied to guarantee the quality and executability of the generated tasks. Furthermore, Graph2Eval supports end-to-end evaluation of multiple agent types (Single-Agent, Multi-Agent, Web Agent) and measures reasoning, collaboration, and interaction capabilities. We instantiate the framework with Graph2Eval-Bench, a curated dataset of 1,319 tasks spanning document comprehension and web interaction scenarios. Experiments show that Graph2Eval efficiently generates tasks that differentiate agent and model performance, revealing gaps in reasoning, collaboration, and web interaction across different settings and offering a new perspective for agent evaluation.[22] Copy-Paste to Mitigate Large Language Model Hallucinations
Yongchao Long,Xian Wu,Yingying Zhang,Xianbin Wen,Yuxi Zhou,Shenda Hong
Main category: cs.CL
TL;DR: 本文提出了CopyPasteLLM,通过两阶段高复制响应偏好训练来提升检索增强生成(RAG)中上下文的忠实性,显著减少幻觉并提高准确性。
Details
Motivation: 由于大语言模型在RAG中可能不完全信任提供的上下文,导致产生幻觉,影响可靠性,因此需要提升模型对上下文的信任和忠实度。 Method: 设计了三种提示方法以增强响应的复制程度,并通过自动化流水线生成高复制偏好数据,进行两阶段训练得到CopyPasteLLM;提出上下文-参数复制捕捉算法以分析其有效性。 Result: 在FaithEval、ConFiQA和PubMedQA上,CopyPasteLLM在反事实和原始上下文中均表现最佳,在FaithEval上比最优基线准确率提升12.2%至24.5%,且仅需365个训练样本(基线的1/50)。 Conclusion: CopyPasteLLM能有效提升生成结果对上下文的忠实性,减少幻觉,同时大幅降低训练数据需求,其机制在于重新校准对内部参数知识与外部知识的依赖。 Abstract: While Retrieval-Augmented Generation (RAG) enables large language models (LLMs) to generate contextually grounded responses, contextual faithfulness remains challenging as LLMs may not consistently trust provided context, leading to hallucinations that undermine reliability. We observe an inverse correlation between response copying degree and context-unfaithful hallucinations on RAGTruth, suggesting that higher copying degrees reduce hallucinations by fostering genuine contextual belief. We propose CopyPasteLLM, obtained through two-stage high-copying response preference training. We design three prompting methods to enhance copying degree, demonstrating that high-copying responses achieve superior contextual faithfulness and hallucination control. These approaches enable a fully automated pipeline that transforms generated responses into high-copying preference data for training CopyPasteLLM. On FaithEval, ConFiQA and PubMedQA, CopyPasteLLM achieves best performance in both counterfactual and original contexts, remarkably with 12.2% to 24.5% accuracy improvements on FaithEval over the best baseline, while requiring only 365 training samples -- 1/50th of baseline data. To elucidate CopyPasteLLM's effectiveness, we propose the Context-Parameter Copying Capturing algorithm. Interestingly, this reveals that CopyPasteLLM recalibrates reliance on internal parametric knowledge rather than external knowledge during generation. All codes are available at https://github.com/longyongchao/CopyPasteLLM[23] JoyAgent-JDGenie: Technical Report on the GAIA
Jiarun Liu,Shiyue Xu,Shangkun Liu,Yang Li,Wen Liu,Min Liu,Xiaoqing Zhou,Hanmin Wang,Shilin Jia,zhen Wang,Shaohua Tian,Hanhao Li,Junbo Zhang,Yongli Yu,Peng Cao,Haofen Wang
Main category: cs.CL
TL;DR: 提出了一种通用型智能体架构,集成了多智能体框架、分层记忆系统和改进的工具套件,在综合基准测试中表现优于开源基线并接近 proprietary 系统性能。
Details
Motivation: 现有大语言模型智能体系统多关注孤立改进,缺乏统一设计来实现鲁棒性和适应性。 Method: 构建包含多智能体协作框架(规划、执行与批评模型投票)、分层记忆系统(工作、语义与程序记忆)以及增强工具集(搜索、代码执行、多模态解析)的系统架构。 Result: 在综合基准测试中持续优于开源基线模型,并接近专有系统的性能表现。 Conclusion: 系统级集成对提升智能体的可扩展性、鲁棒性和适应性至关重要,为跨领域任务的AI助手提供了可行路径。 Abstract: Large Language Models are increasingly deployed as autonomous agents for complex real-world tasks, yet existing systems often focus on isolated improvements without a unifying design for robustness and adaptability. We propose a generalist agent architecture that integrates three core components: a collective multi-agent framework combining planning and execution agents with critic model voting, a hierarchical memory system spanning working, semantic, and procedural layers, and a refined tool suite for search, code execution, and multimodal parsing. Evaluated on a comprehensive benchmark, our framework consistently outperforms open-source baselines and approaches the performance of proprietary systems. These results demonstrate the importance of system-level integration and highlight a path toward scalable, resilient, and adaptive AI assistants capable of operating across diverse domains and tasks.[24] EuroSpeech: A Multilingual Speech Corpus
Samuel Pfisterer,Florian Grötschla,Luca A. Lanzendörfer,Florian Yan,Roger Wattenhofer
Main category: cs.CL
TL;DR: 提出了一种从议会录音中构建大规模多语言语音数据集的可扩展管道,显著提升了各语言的语音识别性能。
Details
Motivation: 现有多语言语音数据集中大多数语言的数据量不足,导致模型在多数支持语言上的表现较差。 Method: 设计了一个包含媒体检索和两阶段对齐算法的可扩展管道,用于处理非逐字转录和长音频,应用于22个欧洲议会的录音。 Result: 从22个欧洲议会录音中提取了超过61,000小时的对齐语音片段,其中19种语言超过1,000小时,22种语言超过500小时;在现有ASR模型上微调后,平均词错误率比基线降低41.8%。 Conclusion: 该方法能有效构建高质量、高覆盖率的多语言语音数据集,显著提升低资源语言的语音识别性能。 Abstract: Recent progress in speech processing has highlighted that high-quality performance across languages requires substantial training data for each individual language. While existing multilingual datasets cover many languages, they often contain insufficient data for most languages. Thus, trained models perform poorly on the majority of the supported languages. Our work addresses this challenge by introducing a scalable pipeline for constructing speech datasets from parliamentary recordings. The proposed pipeline includes robust components for media retrieval and a two-stage alignment algorithm designed to handle non-verbatim transcripts and long-form audio. Applying this pipeline to recordings from 22 European parliaments, we extract over 61k hours of aligned speech segments, achieving substantial per-language coverage with 19 languages exceeding 1k hours and 22 languages exceeding 500 hours of high-quality speech data. We obtain an average 41.8\% reduction in word error rates over baselines when finetuning an existing ASR model on our dataset, demonstrating the usefulness of our approach.[25] Beyond Log Likelihood: Probability-Based Objectives for Supervised Fine-Tuning across the Model Capability Continuum
Gaotang Li,Ruizhong Qiu,Xiusi Chen,Heng Ji,Hanghang Tong
Main category: cs.CL
TL;DR: 本文研究了在大语言模型后训练中,传统负对数似然(NLL)目标函数的局限性,并提出基于模型能力连续体选择更优概率目标函数的方法。
Details
Motivation: 标准的监督微调使用NLL目标函数,但在后训练场景下可能不最优,因模型已具备先验知识且标签可能存在噪声,需探索更合适的目标函数。 Method: 提出并分析一类基于概率的目标函数,通过在7个模型、14个基准和3个领域上的实验与消融研究,识别出‘模型能力连续体’这一关键维度。 Result: 发现靠近模型强端时,偏向高概率token的目标函数(如$-p$, $-p^{10}$)优于NLL;模型弱端NLL更优;中间区域无统一最优目标。 Conclusion: 目标函数的选择应根据模型能力动态调整,提出了适应性目标设计的原则依据。 Abstract: Supervised fine-tuning (SFT) is the standard approach for post-training large language models (LLMs), yet it often shows limited generalization. We trace this limitation to its default training objective: negative log likelihood (NLL). While NLL is classically optimal when training from scratch, post-training operates in a different paradigm and could violate its optimality assumptions, where models already encode task-relevant priors and supervision can be long and noisy. To this end, we study a general family of probability-based objectives and characterize their effectiveness under different conditions. Through comprehensive experiments and extensive ablation studies across 7 model backbones, 14 benchmarks, and 3 domains, we uncover a critical dimension that governs objective behavior: the model-capability continuum. Near the model-strong end, prior-leaning objectives that downweight low-probability tokens (e.g., $-p$, $-p^{10}$, thresholded variants) consistently outperform NLL; toward the model-weak end, NLL dominates; in between, no single objective prevails. Our theoretical analysis further elucidates how objectives trade places across the continuum, providing a principled foundation for adapting objectives to model capability. Our code is available at https://github.com/GaotangLi/Beyond-Log-Likelihood.[26] GUI-KV: Efficient GUI Agents via KV Cache with Spatio-Temporal Awareness
Kung-Hsiang Huang,Haoyi Qiu,Yutong Dai,Caiming Xiong,Chien-Sheng Wu
Main category: cs.CL
TL;DR: 本文提出了一种针对GUI代理的KV缓存压缩方法GUI-KV,通过利用图形用户界面中的空间和时间冗余性,显著降低了计算开销并提高了任务准确性。
Details
Motivation: 现有的KV缓存压缩方法未能充分考虑GUI中图像序列的空间和时间冗余,导致效率低下。因此需要一种更高效、适用于GUI场景的缓存压缩方案。 Method: 分析了GUI代理工作负载中的注意力模式,发现其在所有Transformer层中均表现出高稀疏性;基于此提出了均匀预算分配策略,并设计了GUI-KV方法,结合空间显著性引导和时间冗余评分来优化KV缓存压缩。 Result: 在多个标准GUI代理基准和模型上,GUI-KV在减少解码FLOPs达38.9%的同时,步准确率提升4.1%,性能接近完整缓存,显著优于现有压缩方法。 Conclusion: 通过挖掘GUI特有的空间和时间冗余,GUI-KV实现了高效的KV缓存压缩,为视觉语言模型驱动的GUI代理提供了实用且可靠的推理加速方案。 Abstract: Graphical user interface (GUI) agents built on vision-language models have emerged as a promising approach to automate human-computer workflows. However, they also face the inefficiency challenge as they process long sequences of high-resolution screenshots and solving long-horizon tasks, making inference slow, costly and memory-bound. While key-value (KV) caching can mitigate this, storing the full cache is prohibitive for image-heavy contexts. Existing cache-compression methods are sub-optimal as they do not account for the spatial and temporal redundancy of GUIs. In this work, we first analyze attention patterns in GUI agent workloads and find that, unlike in natural images, attention sparsity is uniformly high across all transformer layers. This insight motivates a simple uniform budget allocation strategy, which we show empirically outperforms more complex layer-varying schemes. Building on this, we introduce GUI-KV, a plug-and-play KV cache compression method for GUI agents that requires no retraining. GUI-KV combines two novel techniques: (i) spatial saliency guidance, which augments attention scores with the L2 norm of hidden states to better preserve semantically important visual tokens, and (ii) temporal redundancy scoring, which projects previous frames' keys onto the current frame's key subspace to preferentially prune redundant history. Across standard GUI agent benchmarks and models, GUI-KV outperforms competitive KV compression baselines, closely matching full-cache accuracy at modest budgets. Notably, in a 5-screenshot setting on the AgentNetBench benchmark, GUI-KV reduces decoding FLOPs by 38.9% while increasing step accuracy by 4.1% over the full-cache baseline. These results demonstrate that exploiting GUI-specific redundancies enables efficient and reliable agent performance.[27] ThinkBrake: Mitigating Overthinking in Tool Reasoning
Minjae Oh,Sangjun Song,Seungkyu Lee,Sungmin Jo,Yohan Jo
Main category: cs.CL
TL;DR: 本文研究了小型推理模型(SRMs)在工具使用中的“过度思考”问题,提出了一种无需训练的解码策略ThinkBrake,在保持或提升准确率的同时显著减少推理开销。
Details
Motivation: 小型推理模型在正确选择工具后常因继续推理而改错,导致性能下降,本文旨在诊断并缓解这一过思问题。 Method: 通过在句子边界注入终止信号进行oracle回溯分析,提出ThinkBrake方法,监测终止标记与当前最高概率标记之间的对数概率差,并在其缩小时触发提前终止。 Result: 在BFCL多个数据集上,ThinkBrake在保持或提高准确率的同时,最多减少25%的生成token数,优于多种基线方法。 Conclusion: ThinkBrake是一种有效且无需训练的早期终止策略,能显著降低小型推理模型在工具调用中的冗余计算,释放其潜在性能。 Abstract: Small reasoning models (SRMs) often overthink during tool use: they reach a correct tool-argument configuration, then continue reasoning and overwrite it with an incorrect final call. We diagnose overthinking via oracle rollouts that inject at sentence boundaries. On the Berkeley Function Calling Leaderboard (BFCL), this oracle termination lifts average accuracy from 85.8\% to 94.2\% while reducing tokens by 80-94\%, revealing substantial recoverable headroom and potential redundant reasoning. While prior work on concise reasoning has largely targeted mathematics, tool reasoning remains underexplored. We adapt various early-termination baselines to tool use and introduce ThinkBrake, a training-free decoding heuristic. ThinkBrake monitors the log-probability margin between and the current top token at sentence boundaries and triggers termination when this margin becomes small. Across BFCL's single turn, non-live and live splits, ThinkBrake preserves or improves accuracy while reducing tokens up to 25\%, outperforming various baselines.[28] Are Large Language Models Chronically Online Surfers? A Dataset for Chinese Internet Meme Explanation
Yubo Xie,Chenkai Wang,Zongyang Ma,Fahui Miao
Main category: cs.CL
TL;DR: 本文介绍了CHIME,一个用于评估大语言模型对中文网络迷因理解能力的数据集,并通过两个任务发现模型在解释迷因含义、来源和语境应用方面仍存在局限性。
Details
Motivation: 探究大语言模型是否真正理解网络中广泛传播的迷因内容,特别是具有文化与语言细微差异的中文迷因。 Method: 构建包含中文短语型迷因的CHIME数据集,设计两个评估任务:一是让模型解释迷因、识别来源并生成例句;二是通过填空式选择题测试模型在上下文中使用迷因的能力。 Result: 模型能部分解释迷因含义,但在识别来源和处理文化语言复杂性方面表现较差;在选择题任务中表现优于随机但显著低于人类水平。 Conclusion: 当前大语言模型对中文网络迷因的理解仍有局限,CHIME数据集可推动未来计算迷因理解的研究。 Abstract: Large language models (LLMs) are trained on vast amounts of text from the Internet, but do they truly understand the viral content that rapidly spreads online -- commonly known as memes? In this paper, we introduce CHIME, a dataset for CHinese Internet Meme Explanation. The dataset comprises popular phrase-based memes from the Chinese Internet, annotated with detailed information on their meaning, origin, example sentences, types, etc. To evaluate whether LLMs understand these memes, we designed two tasks. In the first task, we assessed the models' ability to explain a given meme, identify its origin, and generate appropriate example sentences. The results show that while LLMs can explain the meanings of some memes, their performance declines significantly for culturally and linguistically nuanced meme types. Additionally, they consistently struggle to provide accurate origins for the memes. In the second task, we created a set of multiple-choice questions (MCQs) requiring LLMs to select the most appropriate meme to fill in a blank within a contextual sentence. While the evaluated models were able to provide correct answers, their performance remains noticeably below human levels. We have made CHIME public and hope it will facilitate future research on computational meme understanding.[29] ReSeek: A Self-Correcting Framework for Search Agents with Instructive Rewards
Shiyu Li,Yang Tang,Yifan Wang,Peiming Li,Xi Chen
Main category: cs.CL
TL;DR: 提出ReSeek,一种基于自我纠正机制的搜索代理训练框架,通过密集的过程奖励函数和新构建的FictionalHot基准,在知识密集型任务中显著提升性能。
Details
Motivation: 现有强化学习方法依赖稀疏或规则奖励,易导致代理陷入次优或错误推理路径,缺乏恢复能力。 Method: 引入自我纠正机制,代理通过JUDGE动作评估信息并重新规划搜索策略;设计包含正确性和实用性两个维度的密集过程奖励函数,并构建新基准FictionalHot以避免数据污染。 Result: 在FictionalHot等基准上实验表明,ReSeek训练的代理在任务成功率和路径可信度上显著优于现有最先进基线方法。 Conclusion: ReSeek通过自我纠正和密集奖励机制有效提升了搜索代理的推理质量与鲁棒性,为知识密集型任务提供了更可靠的解决方案。 Abstract: Search agents powered by Large Language Models (LLMs) have demonstrated significant potential in tackling knowledge-intensive tasks. Reinforcement learning (RL) has emerged as a powerful paradigm for training these agents to perform complex, multi-step reasoning. However, prior RL-based methods often rely on sparse or rule-based rewards, which can lead agents to commit to suboptimal or erroneous reasoning paths without the ability to recover. To address these limitations, we propose ReSeek, a novel self-correcting framework for training search agents. Our framework introduces a self-correction mechanism that empowers the agent to dynamically identify and recover from erroneous search paths during an episode. By invoking a special JUDGE action, the agent can judge the information and re-plan its search strategy. To guide this process, we design a dense, instructive process reward function, which decomposes into a correctness reward for retrieving factual information and a utility reward for finding information genuinely useful for the query. Furthermore, to mitigate the risk of data contamination in existing datasets, we introduce FictionalHot, a new and challenging benchmark with recently curated questions requiring complex reasoning. Being intuitively reasonable and practically simple, extensive experiments show that agents trained with ReSeek significantly outperform SOTA baselines in task success rate and path faithfulness.[30] CoT Vectors: Transferring and Probing the Reasoning Mechanisms of LLMs
Li Li,Ziyi Wang,Yongliang Wu,Jianfei Cai,Xu Yang
Main category: cs.CL
TL;DR: 本文提出了CoT Vectors,一种低成本、高效的思维链表示方法,通过可学习的向量在教师-学生框架下提升大语言模型的推理能力,并揭示了多步推理的内在机制。
Details
Motivation: 现有的思维链(CoT)提示方法如上下文学习和微调成本高、效率低,亟需一种更经济有效的替代方案。 Method: 受任务向量范式启发,提出CoT Vectors,包括从模型中提取的Extracted CoT Vectors和在教师-学生框架下优化的Learnable CoT Vectors,以实现稳定且鲁棒的推理指导。 Result: 实验表明,CoT Vectors在多个基准和模型上性能优于现有基线,接近参数高效微调方法,且所需可训练参数更少;同时发现其性能受潜在空间结构、信息密度等因素影响。 Conclusion: CoT Vectors是一种高效、稳定的推理增强方法,不仅性能优越,还为理解大语言模型中多步推理的功能组织提供了新视角。 Abstract: Chain-of-Thought (CoT) prompting has emerged as a powerful approach to enhancing the reasoning capabilities of Large Language Models (LLMs). However, existing implementations, such as in-context learning and fine-tuning, remain costly and inefficient. To improve CoT reasoning at a lower cost, and inspired by the task vector paradigm, we introduce CoT Vectors, compact representations that encode task-general, multi-step reasoning knowledge. Through experiments with Extracted CoT Vectors, we observe pronounced layer-wise instability, manifesting as a U-shaped performance curve that reflects a systematic three-stage reasoning process in LLMs. To address this limitation, we propose Learnable CoT Vectors, optimized under a teacher-student framework to provide more stable and robust guidance. Extensive evaluations across diverse benchmarks and models demonstrate that CoT Vectors not only outperform existing baselines but also achieve performance comparable to parameter-efficient fine-tuning methods, while requiring fewer trainable parameters. Moreover, by treating CoT Vectors as a probe, we uncover how their effectiveness varies due to latent space structure, information density, acquisition mechanisms, and pre-training differences, offering new insights into the functional organization of multi-step reasoning in LLMs. The source code will be released.[31] SAGE-LD: Towards Scalable and Generalizable End-to-End Language Diarization via Simulated Data Augmentation
Sangmin Lee,Woongjib Choi,Jihyun Kim,Hong-Goo Kang
Main category: cs.CL
TL;DR: 提出了一种支持多种语言的神经口语语言识别模型,通过基于查询的可学习架构和大规模模拟语码转换数据预训练,实现了在多语言环境下的有效泛化,并在多个基准上达到最先进的性能。
Details
Motivation: 解决传统方法在数据稀缺和架构优化方面的局限性,提升多语言环境下语言识别的性能。 Method: 采用基于查询的可学习架构,结合多语言感知,并在大规模模拟语码转换数据上进行预训练。 Result: 在多个语言识别基准上实现了比先前方法相对提升23%到52%的性能,达到最先进水平。 Conclusion: 该工作不仅推动了语言识别研究,还为语码转换语音技术建立了基础框架。 Abstract: In this paper, we present a neural spoken language diarization model that supports an unconstrained span of languages within a single framework. Our approach integrates a learnable query-based architecture grounded in multilingual awareness, with large-scale pretraining on simulated code-switching data. By jointly leveraging these two components, our method overcomes the limitations of conventional approaches in data scarcity and architecture optimization, and generalizes effectively to real-world multilingual settings across diverse environments. Experimental results demonstrate that our approach achieves state-of-the-art performance on several language diarization benchmarks, with a relative performance improvement of 23% to 52% over previous methods. We believe that this work not only advances research in language diarization but also establishes a foundational framework for code-switching speech technologies.[32] Tenyidie Syllabification corpus creation and deep learning applications
Teisovi Angami,Kevisino Khate
Main category: cs.CL
TL;DR: 本文提出了一个针对低资源语言Tenyidie的音节切分任务,构建了包含10,120个音节化单词的数据集,并应用多种深度学习模型进行实验,其中BLSTM模型在测试集上达到了99.21%的最高准确率。
Details
Motivation: Tenyidie是一种缺乏NLP研究资源的低资源语言,目前尚无关于其音节切分的研究,因此亟需基础性工作以推动后续自然语言处理任务的发展。 Method: 构建了一个包含10,120个音节化Tenyidie单词的数据集,并采用LSTM、BLSTM、BLSTM+CRF和编码器-解码器等深度学习架构进行音节切分建模,在80:10:10的数据划分下进行训练与评估。 Result: 在测试集上,BLSTM模型取得了99.21%的最高准确率,显著验证了深度学习方法在Tenyidie音节切分任务中的有效性。 Conclusion: 该研究填补了Tenyidie语言在音节切分领域的空白,所构建的数据集和训练有效的模型将有助于推动该语言在形态分析、词性标注、机器翻译等其他NLP任务中的发展。 Abstract: The Tenyidie language is a low-resource language of the Tibeto-Burman family spoken by the Tenyimia Community of Nagaland in the north-eastern part of India and is considered a major language in Nagaland. It is tonal, Subject-Object-Verb, and highly agglutinative in nature. Being a low-resource language, very limited research on Natural Language Processing (NLP) has been conducted. To the best of our knowledge, no work on syllabification has been reported for this language. Among the many NLP tasks, syllabification or syllabication is an important task in which the given word syllables are identified. The contribution of this work is the creation of 10,120 syllabified Tenyidie words and the application of the Deep Learning techniques on the created corpus. In this paper, we have applied LSTM, BLSTM, BLSTM+CRF, and Encoder-decoder deep learning architectures on our created dataset. In our dataset split of 80:10:10 (train:validation:test) set, we achieved the highest accuracy of 99.21% with BLSTM model on the test set. This work will find its application in numerous other NLP applications, such as morphological analysis, part-of-speech tagging, machine translation, etc, for the Tenyidie Language. Keywords: Tenyidie; NLP; syllabification; deep learning; LSTM; BLSTM; CRF; Encoder-decoder[33] MCM-DPO: Multifaceted Cross-Modal Direct Preference Optimization for Alt-text Generation
Jinlan Fu,Shenzhen Huangfu,Hao Fei,Yichong Huang,Xiaoyu Shen,Xipeng Qiu,See-Kiong Ng
Main category: cs.CL
TL;DR: 提出了一种新的多模态直接偏好优化方法MCM-DPO,用于改进图像替代文本生成,通过构建高质量数据集TAlt和PAlt验证其优于现有方法。
Details
Motivation: 现有的图像替代文本生成因用户标注噪声、标准不一致及多模态大模型对上下文不敏感而受限,且监督微调依赖准确标注,难以应对低质量用户生成的alt-text。 Method: 提出Multi-faceted Cross-modal Direct Preference Optimization (MCM-DPO),在单个、成对和多偏好维度上优化文本、视觉和跨模态因素,无需精确标注,仅通过偏好对比学习提升生成质量。 Result: 在自建的大规模高质量数据集TAlt(202k标注样本)和PAlt(18k偏好对)上实验表明,MCM-DPO在alt-text生成任务上显著优于DPO和SFT,达到当前最优性能。 Conclusion: MCM-DPO有效提升了alt-text生成的质量,尤其适用于缺乏精准标注但有偏好数据的场景,推动了面向视觉障碍者的图像描述生成研究。 Abstract: The alt-text generation task produces concise, context-relevant descriptions of images, enabling blind and low-vision users to access online images. Despite the capabilities of large vision-language models, alt-text generation performance remains limited due to noisy user annotations, inconsistent standards, and MLLMs' insensitivity to contextual information. Previous efforts to fine-tune MLLMs using supervised fine-tuning (SFT) have struggled, as SFT relies on accurate target annotations, which are often flawed in user-generated alt-text. To address this, we propose Multi-faceted Cross-modal Direct Preference Optimization (MCM-DPO), which improves alt-text generation by learning to identify better options in preference pairs without requiring precise annotations. MCM-DPO optimizes preferences across single, paired, and multi-preference dimensions, covering textual, visual, and cross-modal factors. In light of the scarcity of high-quality annotated and preference-labeled datasets for alt-text, we constructed two large-scale, high-quality datasets named TAlt and PAlt, sourced from Twitter and Pinterest. These datasets include 202k annotated alt-text samples and 18k preference pairs that cover diverse preference dimensions, aiming to support further research in this domain. Experimental results show that our proposed MCM-DPO method consistently outperforms both DPO and SFT, establishing a new state of the art in alt-text generation. We release the code and data here: https://github.com/LVUGAI/MCM-DPO[34] Facilitating Cognitive Accessibility with LLMs: A Multi-Task Approach to Easy-to-Read Text Generation
François Ledoyen,Gaël Dias,Jeremie Pantin,Alexis Lechervy,Fabrice Maurel,Youssef Chahir
Main category: cs.CL
TL;DR: 本研究探讨了利用大语言模型(LLM)自动生成易读(ETR)内容的潜力,提出了一种多任务学习(MTL)方法,联合训练文本摘要、简化和ETR生成,并比较了基于检索增强生成(RAG)和MTL-LoRA两种策略。
Details
Motivation: 手动创建易读文本耗时且资源密集,而ETR对于认知障碍人群获取信息至关重要,因此需要自动化方法来提高效率和可及性。 Method: 采用多任务学习框架,结合文本摘要、简化与ETR生成任务;探索两种策略:用于上下文学习的多任务检索增强生成(RAG)和参数高效微调的MTL-LoRA;在新高质量数据集ETR-fr上评估Mistral-7B和LLaMA-3-8B模型。 Result: 多任务设置在所有配置下均优于单任务基线;RAG策略在跨领域场景中表现出良好的泛化能力,而MTL-LoRA在领域内配置中表现最佳。 Conclusion: 多任务学习能有效提升ETR内容生成质量,RAG适合跨领域应用,MTL-LoRA在特定领域内性能最优,为自动化易读文本生成提供了高效可行的技术路径。 Abstract: Simplifying complex texts is essential for ensuring equitable access to information, especially for individuals with cognitive impairments. The Easy-to-Read (ETR) initiative offers a framework for making content accessible to the neurodivergent population, but the manual creation of such texts remains time-consuming and resource-intensive. In this work, we investigate the potential of large language models (LLMs) to automate the generation of ETR content. To address the scarcity of aligned corpora and the specificity of ETR constraints, we propose a multi-task learning (MTL) approach that trains models jointly on text summarization, text simplification, and ETR generation. We explore two different strategies: multi-task retrieval-augmented generation (RAG) for in-context learning, and MTL-LoRA for parameter-efficient fine-tuning. Our experiments with Mistral-7B and LLaMA-3-8B, based on ETR-fr, a new high-quality dataset, demonstrate the benefits of multi-task setups over single-task baselines across all configurations. Moreover, results show that the RAG-based strategy enables generalization in out-of-domain settings, while MTL-LoRA outperforms all learning strategies within in-domain configurations.[35] Inclusive Easy-to-Read Generation for Individuals with Cognitive Impairments
François Ledoyen,Gaël Dias,Alexis Lechervy,Jeremie Pantin,Fabrice Maurel,Youssef Chahir,Elisa Gouzonnat,Mélanie Berthelot,Stanislas Moravac,Armony Altinier,Amy Khairalla
Main category: cs.CL
TL;DR: 本文提出了ETR-fr,这是首个完全符合欧洲易读指南的易读文本生成数据集,并通过参数高效微调方法在预训练模型和大语言模型上建立了生成基线。研究还引入了一个结合自动指标和基于36个问题的人类评估的评估框架,结果表明预训练模型在域外文本上的表现与大语言模型相当。
Details
Motivation: 手动进行易读文本改编耗时、昂贵且难以扩展,限制了认知障碍者获取关键信息的机会。因此需要一种可扩展的AI驱动解决方案来提高易读文本的可及性。 Method: 构建了符合欧洲易读指南的首个易读文本数据集ETR-fr,采用参数高效的微调技术对预训练语言模型(PLMs)和大语言模型(LLMs)进行优化,并设计了一个包含自动指标和36项人类评估问卷的综合评估框架。 Result: 实验结果显示,经过参数高效微调的预训练模型在易读文本生成任务中表现与大语言模型相当,并能有效适应跨领域文本。人类评估与自动指标相结合的方法提高了输出质量的可靠性。 Conclusion: 该研究表明,使用参数高效微调的预训练模型是生成高质量易读文本的可行且高效的方案,具备良好的跨领域适应能力,有助于推动认知障碍者的文本可及性。 Abstract: Ensuring accessibility for individuals with cognitive impairments is essential for autonomy, self-determination, and full citizenship. However, manual Easy-to-Read (ETR) text adaptations are slow, costly, and difficult to scale, limiting access to crucial information in healthcare, education, and civic life. AI-driven ETR generation offers a scalable solution but faces key challenges, including dataset scarcity, domain adaptation, and balancing lightweight learning of Large Language Models (LLMs). In this paper, we introduce ETR-fr, the first dataset for ETR text generation fully compliant with European ETR guidelines. We implement parameter-efficient fine-tuning on PLMs and LLMs to establish generative baselines. To ensure high-quality and accessible outputs, we introduce an evaluation framework based on automatic metrics supplemented by human assessments. The latter is conducted using a 36-question evaluation form that is aligned with the guidelines. Overall results show that PLMs perform comparably to LLMs and adapt effectively to out-of-domain texts.[36] ALARB: An Arabic Legal Argument Reasoning Benchmark
Harethah Abu Shairah,Somayah AlHarbi,Abdulaziz AlHussein,Sameer Alsabea,Omar Shaqaqi,Hebah AlShamlan,Omar Knio,George Turkiyyah
Main category: cs.CL
TL;DR: 本文介绍了ALARB,一个用于评估大型语言模型在阿拉伯法律领域推理能力的数据集和任务套件,包含超过13,000个沙特商业法庭案例,支持判决预测、推理链补全和法规识别等多步推理任务,并展示其在指令调优中的有效性。
Details
Motivation: 现有阿拉伯语基准数据集缺乏针对开放场景下多步推理的专门评测资源,难以有效评估阿拉伯大模型在复杂法律推理任务中的表现。 Method: 构建包含案件事实、法院推理、判决结果及引用法规条款的阿拉伯语商业法庭案例数据集ALARB,设计包括判决预测、推理链补全和法规识别在内的多项挑战性任务,并对主流阿拉伯大模型进行基准测试,探索其在指令调优中的应用。 Result: 在多个推理任务上评测了多种开源与闭源阿拉伯大模型的表现;通过ALARB对120亿参数模型进行指令调优后,其在判决预测和阿拉伯语判决生成上的性能显著提升,接近GPT-4o水平。 Conclusion: ALARB填补了阿拉伯语法律多步推理评测的空白,不仅为评估阿拉伯大模型提供了有力工具,也证明其在提升模型法律推理能力方面具有重要价值。 Abstract: We introduce ALARB, a dataset and suite of tasks designed to evaluate the reasoning capabilities of large language models (LLMs) within the Arabic legal domain. While existing Arabic benchmarks cover some knowledge-intensive tasks such as retrieval and understanding, substantial datasets focusing specifically on multistep reasoning for Arabic LLMs, especially in open-ended contexts, are lacking. The dataset comprises over 13K commercial court cases from Saudi Arabia, with each case including the facts presented, the reasoning of the court, the verdict, as well as the cited clauses extracted from the regulatory documents. We define a set of challenging tasks leveraging this dataset and reflecting the complexity of real-world legal reasoning, including verdict prediction, completion of reasoning chains in multistep legal arguments, and identification of relevant regulations based on case facts. We benchmark a representative selection of current open and closed Arabic LLMs on these tasks and demonstrate the dataset's utility for instruction tuning. Notably, we show that instruction-tuning a modest 12B parameter model using ALARB significantly enhances its performance in verdict prediction and Arabic verdict generation, reaching a level comparable to that of GPT-4o.[37] Family Matters: Language Transfer and Merging for Adapting Small LLMs to Faroese
Jenny Kunz,Iben Nyholm Debess,Annika Simonsen
Main category: cs.CL
TL;DR: 本研究探讨了如何将小型高效的大型语言模型(LLM)适应于低资源的北日耳曼语——法罗语。通过从英语模型出发,继续在相关斯堪的纳维亚语言上进行预训练(单独或合并),然后在法罗语上微调。比较了全量微调与使用LoRA的参数高效微调方法,评估其对语言准确性和文本理解的影响。由于缺乏现有的法罗语评估数据,构建了两个新的最小对基准,并结合法罗语语言学家的人工评估。结果表明,从相关语言迁移至关重要,但最佳源语言因任务而异:冰岛语提升语言准确性,丹麦语增强理解能力。同样,全量微调与LoRA的选择也取决于任务:LoRA提高语言可接受性并在基础模型上略微提升人工评分,而全量微调则在理解性能上表现更强,并更好地保持模型在下游微调中的能力。
Details
Motivation: 法罗语作为一种低资源语言,缺乏足够的数据和现成的语言模型支持,因此需要探索有效的模型适应方法。通过利用相关语言的迁移学习,旨在提升小规模语言模型在法罗语上的性能,同时解决评估资源匮乏的问题。 Method: 以英语语言模型为基础,分别或合并地在挪威语、瑞典语、丹麦语和冰岛语等斯堪的纳维亚语言上进行持续预训练,随后在法罗语数据上进行全量微调或使用LoRA进行参数高效微调。为评估模型性能,构建了两个新的最小对基准测试集,并辅以法罗语语言学家的人工评估。 Result: 实验结果显示,从相关语言迁移显著提升法罗语建模效果。冰岛语作为源语言更有利于提升语言准确性,而丹麦语则有助于文本理解。在微调方法上,LoRA在语言可接受性方面表现更好,并轻微提升人工评分;全量微调则在 comprehension 任务上表现更优,且更能保留原始模型能力。 Conclusion: 针对低资源语言如法罗语,结合相关语言的迁移学习是有效的适应策略。源语言和微调方法的选择应根据具体任务需求权衡:追求语言准确性时优选冰岛语+LoRA,追求理解能力时则推荐丹麦语+全量微调。 Abstract: We investigate how to adapt small, efficient LLMs to Faroese, a low-resource North Germanic language. Starting from English models, we continue pre-training on related Scandinavian languages, either individually or combined via merging, before fine-tuning on Faroese. We compare full fine-tuning with parameter-efficient tuning using LoRA, evaluating their impact on both linguistic accuracy and text comprehension. Due to the lack of existing Faroese evaluation data, we construct two new minimal-pair benchmarks from adapted and newly collected datasets and complement them with human evaluations by Faroese linguists. Our results demonstrate that transfer from related languages is crucial, though the optimal source language depends on the task: Icelandic enhances linguistic accuracy, whereas Danish boosts comprehension. Similarly, the choice between full fine-tuning and LoRA is task-dependent: LoRA improves linguistic acceptability and slightly increases human evaluation scores on the base model, while full fine-tuning yields stronger comprehension performance and better preserves model capabilities during downstream fine-tuning.[38] Exposing the Cracks: Vulnerabilities of Retrieval-Augmented LLM-based Machine Translation
Yanming Sun,Runzhe Zhan,Chi Seng Cheang,Han Wu,Xuebo Liu,Yuyao Niu,Fengying Ye,Kaixin Lan,Lidia S. Chao,Derek F. Wong
Main category: cs.CL
TL;DR: 本文研究了检索增强的LLM机器翻译(REAL-MT)在噪声检索环境下的鲁棒性,提出了一种噪声合成框架和新评估指标。实验发现低资源语言对在噪声下性能下降更严重,且大推理模型(LRMs)虽具备更强推理能力,却更容易受噪声影响并合理化错误内容,主因是注意力偏移和置信度与准确率不匹配。为缓解问题,探讨了无需训练和微调的方法,但存在干净环境下性能下降的权衡,表明需要具备自验证能力的集成机制。
Details
Motivation: REAL-MT在实际部署中常面临噪声检索情境,但其在此类情况下的可靠性尚不明确,尤其是在知识密集型任务(如习语翻译)中。现有研究缺乏系统性评估框架,因此亟需探究其鲁棒性问题。 Method: 提出一个噪声合成框架和新的评估指标,用于系统评估REAL-MT的鲁棒性;基于Qwen系列模型(包括标准LLM和具有增强推理能力的大推理模型LRMs)构建REAL-MT,在高、中、低资源语言对上进行习语翻译实验,并引入不同类型的合成噪声进行分析。 Result: 低资源语言对在噪声下性能下降更严重,常产生无意义翻译;尽管LRMs具备更强推理能力,但在纠错方面无改进,反而更易被噪声误导,倾向于合理化错误上下文;注意力分析显示模型关注点从源习语转移到噪声内容,且置信度随准确性下降而上升,表现出校准不良。 Conclusion: 当前REAL-MT方法在噪声环境下存在显著缺陷,尤其在低资源场景和使用LRMs时更为明显;无需训练或微调的增强策略虽可提升鲁棒性,但会牺牲干净条件下的性能,揭示出根本性权衡;未来需发展能自我验证的上下文集成机制以提升可靠性。 Abstract: \textbf{RE}trieval-\textbf{A}ugmented \textbf{L}LM-based \textbf{M}achine \textbf{T}ranslation (REAL-MT) shows promise for knowledge-intensive tasks like idiomatic translation, but its reliability under noisy retrieval contexts remains poorly understood despite this being a common challenge in real-world deployment. To address this gap, we propose a noise synthesis framework and new metrics to evaluate the robustness of REAL-MT systematically. Using this framework, we instantiate REAL-MT with Qwen-series models, including standard LLMs and large reasoning models (LRMs) with enhanced reasoning, and evaluate their performance on idiomatic translation across high-, medium-, and low-resource language pairs under synthesized noise. Our results show that low-resource language pairs, which rely more heavily on retrieved context, degrade more severely under noise than high-resource ones and often produce nonsensical translations. Although LRMs possess enhanced reasoning capabilities, they show no improvement in error correction and are even more susceptible to noise, tending to rationalize incorrect contexts. We find that this stems from an attention shift away from the source idiom to noisy content, while confidence increases despite declining accuracy, indicating poor calibration. To mitigate these issues, we investigate training-free and fine-tuning strategies, which improve robustness at the cost of performance in clean contexts, revealing a fundamental trade-off. Our findings highlight the limitations of current approaches, underscoring the need for self-verifying integration mechanisms.[39] ManagerBench: Evaluating the Safety-Pragmatism Trade-off in Autonomous LLMs
Adi Simhi,Jonathan Herzig,Martin Tutek,Itay Itzhak,Idan Szpektor,Yonatan Belinkov
Main category: cs.CL
TL;DR: 本文提出了ManagerBench,一个评估大语言模型在现实管理场景中安全与实效权衡决策能力的基准,揭示了前沿模型在面对操作目标与安全冲突时存在选择有害行为或过度保守的问题。
Details
Motivation: 随着大语言模型从对话助手演变为自主代理,其行为的安全性评估变得至关重要。现有安全基准主要关注有害内容生成,而忽视了模型为实现操作目标可能采取有害行动的问题。 Method: 设计了ManagerBench基准,包含需在实用但有害与安全但低效之间抉择的真实管理场景,并设置仅对无生命物体造成伤害的对照组以衡量模型的务实性与过度安全倾向。 Result: 发现前沿大语言模型在此类安全-务实权衡中表现不佳:一些模型频繁选择有害行为推进目标,另一些则因避免伤害而变得过度安全且无效;模型能正确识别伤害,但优先级判断存在缺陷。 Conclusion: ManagerBench揭示了当前大语言模型在代理行为核心环节——即操作目标与安全价值冲突时做出安全决策方面存在显著对齐问题,是一个具有挑战性的新基准。 Abstract: As large language models (LLMs) evolve from conversational assistants into autonomous agents, evaluating the safety of their actions becomes critical. Prior safety benchmarks have primarily focused on preventing generation of harmful content, such as toxic text. However, they overlook the challenge of agents taking harmful actions when the most effective path to an operational goal conflicts with human safety. To address this gap, we introduce ManagerBench, a benchmark that evaluates LLM decision-making in realistic, human-validated managerial scenarios. Each scenario forces a choice between a pragmatic but harmful action that achieves an operational goal, and a safe action that leads to worse operational performance. A parallel control set, where potential harm is directed only at inanimate objects, measures a model's pragmatism and identifies its tendency to be overly safe. Our findings indicate that the frontier LLMs perform poorly when navigating this safety-pragmatism trade-off. Many consistently choose harmful options to advance their operational goals, while others avoid harm only to become overly safe and ineffective. Critically, we find this misalignment does not stem from an inability to perceive harm, as models' harm assessments align with human judgments, but from flawed prioritization. ManagerBench is a challenging benchmark for a core component of agentic behavior: making safe choices when operational goals and alignment values incentivize conflicting actions. Benchmark & code available at https://github.com/technion-cs-nlp/ManagerBench.[40] Erase to Improve: Erasable Reinforcement Learning for Search-Augmented LLMs
Ziliang Wang,Kang An,Xuhui Zheng,Faqiang Qian,Weikun Zhang,Cijun Ouyang,Jialu Cai,Yuhang Wang,Yichao Wu
Main category: cs.CL
TL;DR: 本文提出了一种可擦除强化学习(ERL)框架,通过识别、删除并重新生成错误的推理步骤,提升大语言模型在复杂多跳推理任务中的鲁棒性,在多个基准上显著超越了现有最佳方法。
Details
Motivation: 现有的搜索增强型大语言模型在多跳推理中受限于分解错误、检索缺失和推理错误,任一环节失败都会导致整体失败,因此需要更鲁棒的推理机制。 Method: 提出Erasable Reinforcement Learning(ERL)框架,模型能够显式识别推理链中的错误步骤,将其擦除并在原位重新生成,从而阻断错误逻辑的传播。 Result: 基于ERL训练的ESearch模型在HotpotQA、MuSiQue、2Wiki和Bamboogle等多个基准上取得显著提升,3B模型提升+8.48% EM和+11.56% F1,7B模型提升+5.38% EM和+7.22% F1。 Conclusion: ERL为大语言模型的多步推理提供了一种有效的范式转变,显著增强了推理过程的鲁棒性。 Abstract: While search-augmented large language models (LLMs) exhibit impressive capabilities, their reliability in complex multi-hop reasoning remains limited. This limitation arises from three fundamental challenges: decomposition errors, where tasks are incorrectly broken down; retrieval missing, where key evidence fails to be retrieved; and reasoning errors, where flawed logic propagates through the reasoning chain. A single failure in any of these stages can derail the final answer. We propose Erasable Reinforcement Learning (ERL), a novel framework that transforms fragile reasoning into a robust process. ERL explicitly identifies faulty steps, erases them, and regenerates reasoning in place, preventing defective logic from propagating through the reasoning chain. This targeted correction mechanism turns brittle reasoning into a more resilient process. Models trained with ERL, termed ESearch, achieve substantial improvements on HotpotQA, MuSiQue, 2Wiki, and Bamboogle, with the 3B model achieving +8.48% EM and +11.56% F1, and the 7B model achieving +5.38% EM and +7.22% F1 over previous state-of-the-art(SOTA) results. These findings suggest that erasable reinforcement learning provides a powerful paradigm shift for robust multi-step reasoning in LLMs.[41] HalluGuard: Evidence-Grounded Small Reasoning Models to Mitigate Hallucinations in Retrieval-Augmented Generation
Loris Bergeron,Ioana Buhnila,Jérôme François,Radu State
Main category: cs.CL
TL;DR: 提出HalluGuard,一个4B参数的小型推理模型,用于缓解检索增强生成中的幻觉问题,通过分类文档-声明对并生成证据支持的解释,在多个基准上表现出与更大模型相当的性能。
Details
Motivation: 大型语言模型在实际应用中存在幻觉问题,限制了其可信度,因此需要一种高效、可解释的方法来检测和缓解RAG中的幻觉。 Method: 结合领域无关的合成数据集、合成的真实与幻觉声明,以及基于偏好微调(ORPO)将大模型的推理能力蒸馏到小型骨干模型中,实现幻觉检测与可解释性输出。 Result: 在RAGTruth子集上达到84.0%的平衡准确率,媲美MiniCheck和Granite Guardian;在完整基准上达到75.7%,接近GPT-4o的75.9%。 Conclusion: HalluGuard以更小的参数量实现了与更大模型相当的幻觉检测性能,具备高实用性和可解释性,未来将开源模型和数据集。 Abstract: Large Language Models (LLMs) excel in many NLP tasks but remain prone to hallucinations, limiting trust in real-world applications. We present HalluGuard, a 4B-parameter Small Reasoning Model (SRM) for mitigating hallucinations in Retrieval-Augmented Generation (RAG). HalluGuard classifies document-claim pairs as grounded or hallucinated and produces evidence-grounded justifications for transparency. Our approach combines (i) a domain-agnostic synthetic dataset derived from FineWeb and refined through multi-stage curation and data reformation, (ii) synthetic grounded and hallucinated claims, and (iii) preference-based fine-tuning with Odds Ratio Preference Optimization to distill large-model reasoning into a smaller backbone. On the RAGTruth subset of the LLM-AggreFact benchmark, HalluGuard achieves 84.0% balanced accuracy (BAcc), rivaling specialized models, MiniCheck (7B; 84.0%) and Granite Guardian 3.3 (8B; 82.2%) while using roughly half their parameters. Over the full benchmark it reaches 75.7% BAcc, matching larger general-purpose LLMs such as GPT-4o (75.9%). We will release HalluGuard and datasets under Apache 2.0 upon acceptance.[42] Span-level Detection of AI-generated Scientific Text via Contrastive Learning and Structural Calibration
Zhen Yin,Shenghua Wang
Main category: cs.CL
TL;DR: 提出了一种结构感知的框架Sci-SpanDet,用于检测AI生成的学术文本,结合章节条件风格建模与多级对比学习,在跨领域、跨生成器场景下实现精确的片段级检测和可靠的置信度估计,性能优于现有方法。
Details
Motivation: 现有AI生成文本检测方法在细粒度定位、校准性和跨领域泛化方面存在不足,尤其在科学写作中难以保证作者诚信和出版可靠性。 Method: 结合章节条件风格建模与多级对比学习,采用BIO-CRF序列标注与基于指针的边界解码,并引入置信度校准,实现细粒度、高精度的AI生成文本检测。 Result: 在包含10万样本的跨学科数据集上实验表明,Sci-SpanDet达到SOTA性能:F1(AI)为80.17,AUROC为92.63,Span-F1为74.36,且对对抗性重写具有强鲁棒性,各IMRaD章节和学科间表现均衡。 Conclusion: Sci-SpanDet有效提升了AI生成学术文本的检测能力,具备良好的跨域鲁棒性和可解释性,配套数据集和代码将公开以促进后续研究。 Abstract: The rapid adoption of large language models (LLMs) in scientific writing raises serious concerns regarding authorship integrity and the reliability of scholarly publications. Existing detection approaches mainly rely on document-level classification or surface-level statistical cues; however, they neglect fine-grained span localization, exhibit weak calibration, and often fail to generalize across disciplines and generators. To address these limitations, we present Sci-SpanDet, a structure-aware framework for detecting AI-generated scholarly texts. The proposed method combines section-conditioned stylistic modeling with multi-level contrastive learning to capture nuanced human-AI differences while mitigating topic dependence, thereby enhancing cross-domain robustness. In addition, it integrates BIO-CRF sequence labeling with pointer-based boundary decoding and confidence calibration to enable precise span-level detection and reliable probability estimates. Extensive experiments on a newly constructed cross-disciplinary dataset of 100,000 annotated samples generated by multiple LLM families (GPT, Qwen, DeepSeek, LLaMA) demonstrate that Sci-SpanDet achieves state-of-the-art performance, with F1(AI) of 80.17, AUROC of 92.63, and Span-F1 of 74.36. Furthermore, it shows strong resilience under adversarial rewriting and maintains balanced accuracy across IMRaD sections and diverse disciplines, substantially surpassing existing baselines. To ensure reproducibility and to foster further research on AI-generated text detection in scholarly documents, the curated dataset and source code will be publicly released upon publication.[43] Benchmarking Foundation Models with Retrieval-Augmented Generation in Olympic-Level Physics Problem Solving
Shunfeng Zheng,Yudi Zhang,Meng Fang,Zihan Zhang,Zhitan Wu,Mykola Pechenizkiy,Ling Chen
Main category: cs.CL
TL;DR: 本文研究了检索增强生成(RAG)在奥赛级别物理问题求解中的应用,提出了一个高质量多模态数据集PhoPile,并评估了RAG增强模型在物理推理任务上的表现。
Details
Motivation: 探索基础模型在专家级物理推理(如奥赛级别问题)中的能力,受学生通过复习过往题目备赛的启发,研究RAG在提升物理推理方面的潜力。 Method: 构建了一个包含图表、图像和公式的多模态数据集PhoPile,用于系统研究基于检索的物理推理;在此基础上,对多种检索器结合的大语言模型和多模态模型进行了RAG性能基准测试。 Result: 实验表明,将检索与物理语料库结合可提升模型在物理问题求解上的表现,但也暴露出若干挑战。 Conclusion: RAG有助于提升基础模型在复杂物理推理任务中的性能,但仍有改进空间,需进一步研究检索增强型物理推理方法。 Abstract: Retrieval-augmented generation (RAG) with foundation models has achieved strong performance across diverse tasks, but their capacity for expert-level reasoning-such as solving Olympiad-level physics problems-remains largely unexplored. Inspired by the way students prepare for competitions by reviewing past problems, we investigate the potential of RAG to enhance physics reasoning in foundation models. We introduce PhoPile, a high-quality multimodal dataset specifically designed for Olympiad-level physics, enabling systematic study of retrieval-based reasoning. PhoPile includes diagrams, graphs, and equations, capturing the inherently multimodal nature of physics problem solving. Using PhoPile, we benchmark RAG-augmented foundation models, covering both large language models (LLMs) and large multimodal models (LMMs) with multiple retrievers. Our results demonstrate that integrating retrieval with physics corpora can improve model performance, while also highlighting challenges that motivate further research in retrieval-augmented physics reasoning.[44] Making, not Taking, the Best of N
Ammar Khairi,Daniel D'souza,Marzieh Fadaee,Julia Kreutzer
Main category: cs.CL
TL;DR: 本文提出了Fusion-of-N(FusioN)方法,利用大语言模型作为裁判,将多个生成结果中的有用信息融合成一个最终答案,相较于传统的Best-of-N选择方法,在多语言、多任务和不同模型规模下均表现出更优性能,展现出在测试时扩展和合成数据生成中的潜力。
Details
Motivation: 传统Best-of-N方法仅选择最佳生成结果,浪费了其他样本中的潜在有用信息。本文旨在探索一种协作式生成方式,充分利用所有候选样本的多样性信息。 Method: 提出Fusion-of-N(FusioN)方法,使用一个通用的大语言模型作为裁判,综合N个生成样本中的最有价值部分,融合生成最终输出。在测试时扩展和合成数据生成两种场景下与Best-of-N进行对比。 Result: 在11种语言、3个不同任务和多种模型规模上进行广泛评测,FusioN在各项设置中 consistently 优于Best-of-N,展现出更强的鲁棒性和适应性,尤其在挑战性场景下表现突出。 Conclusion: 应从单一质量评判转向利用生成结果的多元性(polylithic nature),FusioN通过融合而非选择,能够集成多样化优势,释放潜在能力,实现仅靠选择无法达到的提升。 Abstract: Obtaining high-quality generations in modern LLMs has largely been framed as a selection problem: identifying a single winning generation from a diverse pool of N samples, the Best-of-N (BoN). Yet, this approach is inherently zero-sum, discarding diverse and potentially useful information from the pool. Instead, we explore a collaborative setup, where all candidates can potentially contribute to the final winning generation. To this end, we propose Fusion-of-N (FusioN): a method that uses a general LLM judge to synthesize the most informative elements of each sample into a single final answer. We compare FusioN to BoN in two settings, (i) test-time scaling, where we sample and aggregate from a single model at test-time (ii) synthetic data generation, where we fuse samples from a pool of diverse teachers to improve a student model. We extensively benchmark both setups across 11 languages, 3 diverse tasks and varying model scales. Across the bench, FusioN consistently outperforms BoN showing versatility and robustness both in test-time scaling and in downstream gains from synthetic data generation. We also perform extensive analysis on FusioN, where it shows surprising strengths and robustness under challenging settings. These results show that we should shift how we think about evaluating and utilizing LLM generations from a monolithic measure of quality, to embracing their polylithic nature. This shift allows us to integrate diverse strengths, unlock latent potential, and achieve improvements that were previously inaccessible through selection alone.[45] Analyzing Dialectical Biases in LLMs for Knowledge and Reasoning Benchmarks
Eileen Pan,Anna Seo Gyeong Choi,Maartje ter Hoeve,Skyler Seto,Allison Koenecke
Main category: cs.CL
TL;DR: 研究发现,大型语言模型在处理非标准英语方言时性能显著下降,尤其是特定语法结构(如existential "it"、零系词和y'all)对准确率影响最大,建议未来工作关注高影响语法结构的偏见缓解。
Details
Motivation: 由于大型语言模型在自然语言处理中广泛应用,但其在代表性不足的英语方言上表现不佳,因此需要探究其性能下降的原因。 Method: 通过将标准美式英语问题转化为非标准方言形式,并在多项选择问答任务中评估模型性能,分析不同语法规则对性能的影响。 Result: 某些非标准英语问题导致模型准确率最多下降20%,其中三个语法规则(existential "it"、零系词、y'all)解释了多数性能下降现象。 Conclusion: 应针对高影响的个别语法结构开发偏见缓解方法,以提升模型在非标准方言上的表现。 Abstract: Large language models (LLMs) are ubiquitous in modern day natural language processing. However, previous work has shown degraded LLM performance for under-represented English dialects. We analyze the effects of typifying "standard" American English language questions as non-"standard" dialectal variants on multiple choice question answering tasks and find up to a 20% reduction in accuracy. Additionally, we investigate the grammatical basis of under-performance in non-"standard" English questions. We find that individual grammatical rules have varied effects on performance, but some are more consequential than others: three specific grammar rules (existential "it", zero copula, and y'all) can explain the majority of performance degradation observed in multiple dialects. We call for future work to investigate bias mitigation methods focused on individual, high-impact grammatical structures.[46] Syntax-Guided Diffusion Language Models with User-Integrated Personalization
Ruqian Zhang,Yijiao Zhang,Juan Shen,Zhongyi Zhu,Annie Qu
Main category: cs.CL
TL;DR: 提出一种语法引导的扩散语言模型,通过结构监督和个性化条件生成高质量、多样化且可控的文本。
Details
Motivation: 现有大语言模型生成的文本往往过于通用,缺乏结构多样性,限制了个性化表达。 Method: 引入级联框架,在条件文本生成前先生成语法指导,并进一步推广到非级联架构以更好对齐结构与内容;通过共享表示机制实现细粒度个性化。 Result: 在多个任务上的实验表明,该方法在流畅性、多样性和风格保真度方面优于现有方法,定性分析显示其具有良好的可解释性和灵活性。 Conclusion: 语法引导的扩散模型能有效提升文本生成的质量、多样性和可控性,支持个性化生成和零样本推理。 Abstract: Large language models have made revolutionary progress in generating human-like text, yet their outputs often tend to be generic, exhibiting insufficient structural diversity, which limits personalized expression. Recent advances in diffusion models have opened new opportunities for improving language generation beyond the limitations of autoregressive paradigms. In this work, we propose a syntax-guided diffusion language model that integrates structural supervision and personalized conditioning to enhance text quality, diversity, and controllability. We introduce a cascaded framework that generates syntactic guidance before conditional text generation, and further generalize it to a novel noncascaded architecture for better alignment between structure and content. By incorporating syntactic information in the generating process, the proposed model better captures the lexical and structural characteristics of stylistic sentence construction. To enable fine-grained personalization, we develop a shared representation mechanism that facilitates information integration across users, supporting both faithful stylistic generation and generalizable zero-shot inference. Extensive experiments on multiple tasks demonstrate the superiority of our approach in fluency, diversity, and stylistic fidelity. Further qualitative analyses highlight its interpretability and flexibility in learning personalized patterns.[47] Interpreting Language Models Through Concept Descriptions: A Survey
Nils Feldhus,Laura Kopf
Main category: cs.CL
TL;DR: 本文首次对大语言模型组件和抽象概念描述的新兴领域进行了综述,梳理了生成方法、评估指标和数据集,强调了对更严格因果评估的需求,并为未来研究提供了路线图。
Details
Motivation: 理解神经网络的决策过程是机制可解释性的核心目标,特别是在大语言模型中,需要揭示其内部机制并识别各组件的作用。 Method: 通过综述现有研究,系统梳理生成概念描述的关键方法、自动化与人工评估指标的发展现状以及支撑该研究的数据集。 Result: 总结出现有工作中对更严谨、因果性评估的迫切需求,揭示了当前研究的进展与局限。 Conclusion: 该综述为提升模型透明度提供了清晰的研究路径,指出了未来在概念描述生成与评估方面的关键挑战和发展方向。 Abstract: Understanding the decision-making processes of neural networks is a central goal of mechanistic interpretability. In the context of Large Language Models (LLMs), this involves uncovering the underlying mechanisms and identifying the roles of individual model components such as neurons and attention heads, as well as model abstractions such as the learned sparse features extracted by Sparse Autoencoders (SAEs). A rapidly growing line of work tackles this challenge by using powerful generator models to produce open-vocabulary, natural language concept descriptions for these components. In this paper, we provide the first survey of the emerging field of concept descriptions for model components and abstractions. We chart the key methods for generating these descriptions, the evolving landscape of automated and human metrics for evaluating them, and the datasets that underpin this research. Our synthesis reveals a growing demand for more rigorous, causal evaluation. By outlining the state of the art and identifying key challenges, this survey provides a roadmap for future research toward making models more transparent.[48] Hybrid Dialogue State Tracking for Persian Chatbots: A Language Model-Based Approach
Samin Mahdipour Aghabagher,Saeedeh Momtazi
Main category: cs.CL
TL;DR: 提出了一种结合基于规则方法和语言模型的混合对话状态跟踪(DST)模型,显著提升了波斯语聊天机器人在准确性和连贯性方面的表现。
Details
Motivation: 传统基于规则的DST方法在开放域和多轮对话中缺乏足够的适应性和连贯性,难以满足类人对话体验的需求。 Method: 结合基于规则的方法与多种语言模型(如BERT用于槽填充和意图检测,XGBoost用于意图验证,GPT用于DST,以及在线代理用于实时回答生成),构建混合DST模型。 Result: 在波斯语多轮对话数据集上评估显示,该模型在准确性和连贯性方面显著优于现有方法。 Conclusion: 混合方法能有效提升DST能力,推动更个性化、适应性强且类人的对话AI系统发展。 Abstract: Dialogue State Tracking (DST) is an essential element of conversational AI with the objective of deeply understanding the conversation context and leading it toward answering user requests. Due to high demands for open-domain and multi-turn chatbots, the traditional rule-based DST is not efficient enough, since it cannot provide the required adaptability and coherence for human-like experiences in complex conversations. This study proposes a hybrid DST model that utilizes rule-based methods along with language models, including BERT for slot filling and intent detection, XGBoost for intent validation, GPT for DST, and online agents for real-time answer generation. This model is uniquely designed to be evaluated on a comprehensive Persian multi-turn dialogue dataset and demonstrated significantly improved accuracy and coherence over existing methods in Persian-based chatbots. The results demonstrate how effectively a hybrid approach may improve DST capabilities, paving the way for conversational AI systems that are more customized, adaptable, and human-like.[49] Research on the Integration of Embodied Intelligence and Reinforcement Learning in Textual Domains
Haonan Wang,Junfeng Sun,Mingjia Zhao,Wei Liu
Main category: cs.CL
TL;DR: 本文提出了一种将具身智能与强化学习相结合的新模型,以提升文本处理的智能化水平。
Details
Motivation: 结合具身智能的感知与行动优势以及强化学习的决策优化能力,提升文本处理的智能性。 Method: 通过理论分析与实验探索,构建并验证了一个新型融合模型。 Result: 该模型在多种文本处理任务中表现出色,显示出良好的应用潜力。 Conclusion: 具身智能与强化学习的融合为智能文本处理提供了有效的新路径。 Abstract: This article addresses embodied intelligence and reinforcement learning integration in the field of text processing, aiming to enhance text handling with more intelligence on the basis of embodied intelligence's perception and action superiority and reinforcement learning's decision optimization capability. Through detailed theoretical explanation and experimental exploration, a novel integration model is introduced. This model has been demonstrated to be very effective in a wide range oftext processing tasks, validating its applicative potential[50] Automatic Speech Recognition (ASR) for African Low-Resource Languages: A Systematic Literature Review
Sukairaj Hafiz Imam,Tadesse Destaw Belay,Kedir Yassin Husse,Ibrahim Said Ahmad,Idris Abdulmumin,Hadiza Ali Umar,Muhammad Yahuza Bello,Joyce Nakatumba-Nabende,Seid Muhie Yimam,Shamsuddeen Hassan Muhammad
Main category: cs.CL
TL;DR: 本论文对非洲语言的自动语音识别(ASR)研究进行了系统性文献综述,涵盖数据集、模型、训练方法和评估技术,指出当前研究在数据资源、可复现性和评估指标方面的不足,并提出未来发展方向。
Details
Motivation: 非洲拥有超过2000种语言,但其低资源语言在ASR研究中严重被忽视,导致数字包容性受限,因此亟需系统梳理现有研究并推动技术进步。 Method: 采用PRISMA 2020指南进行系统性文献回顾,检索DBLP、ACM、Google Scholar、Semantic Scholar和arXiv等数据库中2020年1月至2025年7月发表的研究,筛选出71项相关研究进行分析。 Result: 共识别出74个数据集,覆盖111种非洲语言,总计约11,206小时语音数据;少于15%的研究提供可复现材料,数据许可不明确;自监督和迁移学习有潜力,但受限于预训练数据不足、方言覆盖不全和资源缺乏;多数研究使用WER,缺乏适用于声调和形态丰富语言的语言学敏感指标。 Conclusion: 当前非洲语言ASR研究面临数据可用性、标注质量、许可不确定性和基准测试不足等问题,未来需加强利益相关方合作、构建伦理合规的数据集、采用轻量级建模技术并建立统一基准。 Abstract: ASR has achieved remarkable global progress, yet African low-resource languages remain rigorously underrepresented, producing barriers to digital inclusion across the continent with more than +2000 languages. This systematic literature review (SLR) explores research on ASR for African languages with a focus on datasets, models and training methods, evaluation techniques, challenges, and recommends future directions. We employ the PRISMA 2020 procedures and search DBLP, ACM Digital Library, Google Scholar, Semantic Scholar, and arXiv for studies published between January 2020 and July 2025. We include studies related to ASR datasets, models or metrics for African languages, while excluding non-African, duplicates, and low-quality studies (score <3/5). We screen 71 out of 2,062 records and we record a total of 74 datasets across 111 languages, encompassing approximately 11,206 hours of speech. Fewer than 15% of research provided reproducible materials, and dataset licensing is not clear. Self-supervised and transfer learning techniques are promising, but are hindered by limited pre-training data, inadequate coverage of dialects, and the availability of resources. Most of the researchers use Word Error Rate (WER), with very minimal use of linguistically informed scores such as Character Error Rate (CER) or Diacritic Error Rate (DER), and thus with limited application in tonal and morphologically rich languages. The existing evidence on ASR systems is inconsistent, hindered by issues like dataset availability, poor annotations, licensing uncertainties, and limited benchmarking. Nevertheless, the rise of community-driven initiatives and methodological advancements indicates a pathway for improvement. Sustainable development for this area will also include stakeholder partnership, creation of ethically well-balanced datasets, use of lightweight modelling techniques, and active benchmarking.[51] mR3: Multilingual Rubric-Agnostic Reward Reasoning Models
David Anugraha,Shou-Yi Hung,Zilu Tang,Annie En-Shiun Lee,Derry Tanti Wijaya,Genta Indra Winata
Main category: cs.CL
TL;DR: 本文提出了mR3,一种在72种语言上训练的、覆盖语言最广的多语言奖励推理模型,通过系统研究数据和课程选择策略,在多语言奖励建模中达到最先进性能,且模型体积更小。
Details
Motivation: 现有的大语言模型评判器在非英语环境中表现不佳,缺乏有效的多语言训练方法,因此需要探索适用于多语言场景的奖励模型训练策略。 Method: 提出mR3模型,采用大规模多语言、评分标准无关的奖励推理框架,结合目标语言的推理数据集,系统研究了数据选择与课程学习策略对奖励模型性能的影响。 Result: mR3在多语言奖励建模基准上达到最先进水平,性能超过更大的模型(如GPT-OSS-120B),同时模型规模最多缩小9倍,且通过大量消融实验验证了其有效性。 Conclusion: mR3实现了广泛的多语言覆盖和高效的奖励建模,证明了合理的数据与课程设计在多语言评判器训练中的关键作用,推动了开放、可复现的多语言评估研究。 Abstract: Evaluation using Large Language Model (LLM) judges has been widely adopted in English and shown to be effective for automatic evaluation. However, their performance does not generalize well to non-English settings, and it remains unclear what constitutes effective multilingual training for such judges. In this paper, we introduce mR3, a massively multilingual, rubric-agnostic reward reasoning model trained on 72 languages, achieving the broadest language coverage in reward modeling to date. We present a comprehensive study of data and curriculum selection for training to identify effective strategies and data sources for building high-quality reward models, including the integration of target-language reasoning datasets. Our approach attains state-of-the-art performance on multilingual reward model benchmarks, surpassing much larger models (i.e., GPT-OSS-120B) while being up to 9x smaller, and its effectiveness is further confirmed through extensive ablation studies. Our models, data, and code are available as open source at https://github.com/rubricreward/mr3.[52] Pay-Per-Search Models are Abstention Models
Mustafa Omer Gul,Claire Cardie,Tanya Goyal
Main category: cs.CL
TL;DR: 本文提出了MASH(通过选择性寻求帮助建模 abstention),一种利用强化学习训练大语言模型识别其知识边界并选择性寻求外部帮助或 abstain 的框架。该方法通过“按搜索付费”的奖励机制,在不需预先定义知识边界的情况下,有效提升问答准确率和 abstention 能力。
Details
Motivation: 大语言模型难以准确识别自身知识边界,常对超出边界的问题产生幻觉回答。相比之下,人类能意识到局限并选择求助或 abstain。因此,需要一种让模型学会在不确定时主动 abstain 的方法。 Method: 提出MASH框架,将模型使用搜索工具的行为视为 abstention 的代理信号。采用强化学习,对搜索行为进行惩罚、对正确回答给予奖励,从而训练模型在知识不足时选择求助或 abstain。 Result: 在三个知识密集型问答数据集上实验表明,MASH比以往方法显著提升了选择性求助性能,在多跳问答数据集中准确率提高7.6%。此外,MASH展现出良好的即插即用 abstention 能力,能区分可回答与不可回答问题。 Conclusion: MASH通过将选择性求助作为训练目标,无需预先标注知识边界即可自然生成 abstention 行为,有效对齐了模型参数知识与工具使用,为提升模型可靠性提供了新路径。 Abstract: LLMs cannot reliably recognize their parametric knowledge boundaries and often hallucinate answers to outside-of-boundary questions. In contrast, humans recognize their limitations and can either seek external help for such questions or abstain. In this paper, we introduce MASH (Modeling Abstention via Selective Help-seeking), a training framework that readily extracts abstentions from LLMs. Our key idea is that any external help-seeking by an LLM, i.e. search tool use, can serve as a proxy for abstention if the external help (search) is appropriately penalized while simultaneously rewarding answer accuracy. MASH operationalizes this idea using reinforcement learning with a pay-per-search reward. We run experiments on three knowledge-intensive QA datasets. Our results show that MASH substantially improves upon the selective help-seeking performance of prior efficient search approaches; on multi-hop datasets, MASH improves answer accuracy by 7.6%. Furthermore, MASH demonstrates strong off-the-shelf abstention -- it can distinguish between unanswerable/answerable questions and selectively generate responses for answerable questions -- showcasing behavior analogous to specialized abstention approaches. We emphasize that contrary to prior abstention methods, MASH does not require pre-determining knowledge boundaries to construct training data. Instead, MASH's abstentions are a by-product of training for the auxiliary selective help-seeking task. Overall, we show that MASH training effectively aligns search tool use with parametric knowledge, which can be successfully leveraged for making abstention decisions.[53] Backdoor Attacks Against Speech Language Models
Alexandrine Fortier,Thomas Thebaud,Jesús Villalba,Najim Dehak,Patrick Cardinal
Main category: cs.CL
TL;DR: 本文首次系统研究了针对语音语言模型的音频后门攻击,展示了其在多种任务和数据集上的高成功率,并提出了一种基于微调的防御方法来缓解预训练编码器被污染的风险。
Details
Motivation: 随着大语言模型及其多模态扩展的普及,级联式架构继承了各组件的漏洞,亟需研究其安全性,特别是音频模态中的后门攻击威胁。 Method: 通过在四种语音编码器和三个数据集上进行实验,覆盖自动语音识别、情感识别、性别和年龄预测四项任务,评估后门攻击的有效性,并通过组件级分析研究后门传播机制。 Result: 攻击成功率在90.76%到99.41%之间,组件分析揭示了管道中最易受攻击的阶段,提出的微调防御方法有效降低了被污染编码器带来的风险。 Conclusion: 语音语言模型容易受到音频后门攻击,需重视多模态系统中各组件的安全性,所提出的防御策略为缓解此类威胁提供了可行方案。 Abstract: Large Language Models (LLMs) and their multimodal extensions are becoming increasingly popular. One common approach to enable multimodality is to cascade domain-specific encoders with an LLM, making the resulting model inherit vulnerabilities from all of its components. In this work, we present the first systematic study of audio backdoor attacks against speech language models. We demonstrate its effectiveness across four speech encoders and three datasets, covering four tasks: automatic speech recognition (ASR), speech emotion recognition, and gender and age prediction. The attack consistently achieves high success rates, ranging from 90.76% to 99.41%. To better understand how backdoors propagate, we conduct a component-wise analysis to identify the most vulnerable stages of the pipeline. Finally, we propose a fine-tuning-based defense that mitigates the threat of poisoned pretrained encoders.[54] Social Welfare Function Leaderboard: When LLM Agents Allocate Social Welfare
Zhengliang Shi,Ruotian Ma,Jen-tse Huang,Xinbei Ma,Xingyu Chen,Mengru Wang,Qu Yang,Yue Wang,Fanghua Ye,Ziyang Chen,Shanyi Wang,Cixing Li,Wenxuan Wang,Zhaopeng Tu,Xiaolong Li,Zhaochun Ren,Linus
Main category: cs.CL
TL;DR: 本文提出了一个用于评估大语言模型在社会资源分配中表现的基准(SWF Benchmark),发现当前模型倾向于功利主义,忽视公平性,且其分配决策易受输出长度和社会框架影响,表明需要专门的对齐与治理机制。
Details
Motivation: 随着大语言模型越来越多地参与高风险的社会决策,亟需理解其在资源分配中的价值取向和行为原则,以避免加剧不平等或造成社会风险。 Method: 设计了一个动态模拟环境——社会福利函数(SWF)基准,其中LLM作为主权分配者,向异质群体分配任务,并持续权衡集体效率(投资回报率)与分配公平性(基尼系数),并在20个先进LLM上进行评估。 Result: (1)通用对话能力不能预测分配能力;(2)大多数LLM默认采取功利主义取向,导致严重不平等;(3)分配策略对输出长度限制和社会影响框架高度敏感。 Conclusion: 当前LLM在社会决策中的部署存在风险,需建立专门的评估基准和针对性的价值对齐机制以实现负责任的AI治理。 Abstract: Large language models (LLMs) are increasingly entrusted with high-stakes decisions that affect human welfare. However, the principles and values that guide these models when distributing scarce societal resources remain largely unexamined. To address this, we introduce the Social Welfare Function (SWF) Benchmark, a dynamic simulation environment where an LLM acts as a sovereign allocator, distributing tasks to a heterogeneous community of recipients. The benchmark is designed to create a persistent trade-off between maximizing collective efficiency (measured by Return on Investment) and ensuring distributive fairness (measured by the Gini coefficient). We evaluate 20 state-of-the-art LLMs and present the first leaderboard for social welfare allocation. Our findings reveal three key insights: (i) A model's general conversational ability, as measured by popular leaderboards, is a poor predictor of its allocation skill. (ii) Most LLMs exhibit a strong default utilitarian orientation, prioritizing group productivity at the expense of severe inequality. (iii) Allocation strategies are highly vulnerable, easily perturbed by output-length constraints and social-influence framing. These results highlight the risks of deploying current LLMs as societal decision-makers and underscore the need for specialized benchmarks and targeted alignment for AI governance.[55] GRAD: Generative Retrieval-Aligned Demonstration Sampler for Efficient Few-Shot Reasoning
Oussama Gabouj,Kamel Charaf,Ivan Zakazov,Nicolas Baldwin,Robert West
Main category: cs.CL
TL;DR: 本文提出了一种名为生成式检索对齐演示器(GRAD)的动态示范方法,通过训练大语言模型为每个输入生成特定的简洁示范,在预算受限的情况下优于传统RAG方法,并在数学推理及跨学科STEM任务中展现出强泛化能力。
Details
Motivation: 传统检索增强生成(RAG)依赖静态数据库,缺乏适应性且易引入无关信息,难以在资源受限场景下提供有效上下文支持。 Method: 设计并训练一个生成式模型(GRAD),使其能够根据输入动态生成简明、相关的示范;仅使用数学数据集进行训练,并限制每条示范和最终输出的token数量以模拟预算约束。 Result: GRAD在Qwen2.5-14B模型上显著优于强基线方法,尤其在数学推理与物理、化学、计算机科学等OOD领域表现突出;小模型生成的示范可有效引导大模型,降低训练成本同时保持高准确性。 Conclusion: GRAD实现了面向资源受限环境的可扩展动态少样本学习范式的第一步,展示了动态示范生成在提升上下文质量和模型泛化能力方面的巨大潜力。 Abstract: Large Language Models (LLMs) achieve strong performance across diverse tasks, but their effectiveness often depends on the quality of the provided context. Retrieval-Augmented Generation (RAG) enriches prompts with external information, but its reliance on static databases constrains adaptability and can result in irrelevant demonstrations. In this work, we propose a Generative Retrieval-Aligned Demonstrator (GRAD), a dynamic demonstration-based approach where an LLM model is trained to generate input-specific concise demonstrations. By tailoring demonstrations to each input, our method offers better contextual support than traditional RAG approaches. We demonstrate the superiority of GRAD under budget constraints, where we limit both the number of tokens used per demonstration and the number of tokens used for the final output. Trained solely on a math dataset, GRAD consistently outperforms strong baselines on Qwen2.5-14B across mathematical reasoning and advanced STEM questions, highlighting GRAD's robust generalization to out-of-distribution (OOD) domains such as physics, chemistry, and computer science. Furthermore, we show that demonstrations generated by trained smaller models can effectively guide larger target models, reducing training costs while maintaining competitive accuracy. Overall, this work introduces a scalable demonstration generator model presenting the first step toward a dynamic few-shot learning paradigm in resource-constrained settings. We release the code used for the project.[56] Verbalized Sampling: How to Mitigate Mode Collapse and Unlock LLM Diversity
Jiayi Zhang,Simon Yu,Derek Chong,Anthony Sicilia,Michael R. Tomz,Christopher D. Manning,Weiyan Shi
Main category: cs.CL
TL;DR: 本文提出了一种称为Verbalized Sampling(VS)的无需训练的提示策略,以缓解大语言模型在后训练对齐过程中因偏好数据中的典型性偏差导致的模式崩溃问题。实验表明,VS能显著提升生成多样性,同时保持事实准确性和安全性。
Details
Motivation: 现有的后训练对齐方法常导致语言模型多样性下降(即模式崩溃)。作者认为这一现象的根本原因在于偏好数据中存在的典型性偏差——标注者倾向于选择熟悉的、常见的文本。这种认知心理学中的典型性偏好在数据层面广泛存在,是模式崩溃的关键驱动因素。因此,作者旨在从数据角度重新理解模式崩溃,并提出一种无需训练的推理阶段解决方案。 Method: 首先通过理论建模和实证分析验证偏好数据中的典型性偏差;然后提出Verbalized Sampling(VS)方法,即在提示中要求模型先显式地输出一组候选响应及其对应概率分布(如“生成5个关于咖啡的笑话及其概率”),再从中采样生成最终结果。该方法在推理阶段使用,无需额外训练。 Result: 实验显示,VS在创意写作(诗歌、故事、笑话)、对话模拟、开放问答和合成数据生成等多个任务上显著提升了生成多样性(在创意写作中比直接提示高1.6-2.1倍),且不牺牲事实准确性和安全性。此外,越强大的模型从VS中获益越多,表现出能力与多样性的协同增长趋势。 Conclusion: 模式崩溃不仅源于算法局限,更受数据层面典型性偏差的影响。Verbalized Sampling作为一种简单有效的推理策略,能够有效缓解该问题,释放预训练模型中被抑制的生成多样性,为提升LLM创造性提供了实用的新途径。 Abstract: Post-training alignment often reduces LLM diversity, leading to a phenomenon known as mode collapse. Unlike prior work that attributes this effect to algorithmic limitations, we identify a fundamental, pervasive data-level driver: typicality bias in preference data, whereby annotators systematically favor familiar text as a result of well-established findings in cognitive psychology. We formalize this bias theoretically, verify it on preference datasets empirically, and show that it plays a central role in mode collapse. Motivated by this analysis, we introduce Verbalized Sampling, a simple, training-free prompting strategy to circumvent mode collapse. VS prompts the model to verbalize a probability distribution over a set of responses (e.g., ``Generate 5 jokes about coffee and their corresponding probabilities''). Comprehensive experiments show that VS significantly improves performance across creative writing (poems, stories, jokes), dialogue simulation, open-ended QA, and synthetic data generation, without sacrificing factual accuracy and safety. For instance, in creative writing, VS increases diversity by 1.6-2.1x over direct prompting. We further observe an emergent trend that more capable models benefit more from VS. In sum, our work provides a new data-centric perspective on mode collapse and a practical inference-time remedy that helps unlock pre-trained generative diversity.[57] Energy-Regularized Sequential Model Editing on Hyperspheres
Qingyuan Liu,Jia-Chen Gu,Yunzhi Yao,Hong Wang,Nanyun Peng
Main category: cs.CL
TL;DR: 本文提出了一种基于超球面能量(HE)正则化的模型编辑方法SPHERE,通过稳定神经元权重分布来缓解大规模语言模型在连续编辑中的知识遗忘问题,在LLaMA3和Qwen2.5上显著提升了编辑性能与知识保留能力。
Details
Motivation: 大型语言模型需要持续更新以适应新知识,但现有的模型编辑方法在连续编辑时容易导致表征不稳定和灾难性遗忘,因此亟需一种能保持模型稳定性和知识保留的编辑机制。 Method: 作者提出使用超球面能量(HE)来量化神经元均匀性,并分析其与编辑性能的关系;在此基础上设计了SPHERE方法,通过将新知识投影到预训练权重主方向的稀疏互补空间中,稳定HE分布,从而减少对原有知识的干扰。 Result: 实验证明HE动态与编辑性能高度相关,HE波动越大,编辑失败越频繁;理论分析表明HE动态影响先验知识退化的下界;在LLaMA3-8B和Qwen2.5-7B上的实验显示,SPHERE比最优基线平均提升16.41%编辑准确率,并更好保留了模型原有性能。 Conclusion: SPHERE通过调控超球面能量实现了更稳定的序列化模型编辑,为大规模语言模型的知识更新提供了一个有原则且有效的解决方案。 Abstract: Large language models (LLMs) require constant updates to remain aligned with evolving real-world knowledge. Model editing offers a lightweight alternative to retraining, but sequential editing often destabilizes representations and induces catastrophic forgetting. In this work, we seek to better understand and mitigate performance degradation caused by sequential editing. We hypothesize that hyperspherical uniformity, a property that maintains uniform distribution of neuron weights on a hypersphere, helps the model remain stable, retain prior knowledge, while still accommodate new updates. We use Hyperspherical Energy (HE) to quantify neuron uniformity during editing, and examine its correlation with editing performance. Empirical studies across widely used editing methods reveals a strong correlation between HE dynamics and editing performance, with editing failures consistently coinciding with high HE fluctuations. We further theoretically prove that HE dynamics impose a lower bound on the degradation of pretrained knowledge, highlighting why HE stability is crucial for knowledge retention. Motivated by these insights, we propose SPHERE (Sparse Projection for Hyperspherical Energy-Regularized Editing), an HE-driven regularization strategy that stabilizes neuron weight distributions, ultimately preserving prior knowledge while enabling reliable sequential updates. Specifically, SPHERE identifies a sparse space complementary to the principal hyperspherical directions of the pretrained weight matrices and projects new knowledge onto it, attenuating perturbations on the principal directions. Extensive experiments on LLaMA3 (8B) and Qwen2.5 (7B) show that SPHERE outperforms the best baseline in editing capability by an average of 16.41%, while most faithfully preserving general model performance, thereby offering a principled path toward reliable large-scale knowledge editing.cs.CV [Back]
[58] Hybrid Deep Learning for Hyperspectral Single Image Super-Resolution
Usman Muhammad,Jorma Laaksonen
Main category: cs.CV
TL;DR: 提出了一种新的光谱-空间解混融合(SSUF)模块和空间-光谱梯度损失函数,用于高光谱单图像超分辨率重建,提升了空间分辨率和光谱保真度。
Details
Motivation: 传统深度学习模型在高光谱SISR任务中难以同时恢复精细的空间细节并保持跨波长的光谱保真度,限制了性能提升。 Method: 设计了SSUF模块,结合光谱解混与光谱-空间特征提取,并嵌入标准2D卷积架构;采用ResNet-based CNN进行重建,并提出一种融合均方误差与空间、光谱梯度项的自定义损失函数。 Result: 在三个公开遥感高光谱数据集上的实验表明,该方法在降低模型复杂度的同时实现了具有竞争力的性能。 Conclusion: 所提出的SSUF模块和空间-光谱梯度损失有效提升了高光谱SISR的重建质量,兼顾空间细节恢复与光谱保真,且模型更高效。 Abstract: Hyperspectral single image super-resolution (SISR) is a challenging task due to the difficulty of restoring fine spatial details while preserving spectral fidelity across a wide range of wavelengths, which limits the performance of conventional deep learning models. To address this challenge, we introduce Spectral-Spatial Unmixing Fusion (SSUF), a novel module that can be seamlessly integrated into standard 2D convolutional architectures to enhance both spatial resolution and spectral integrity. The SSUF combines spectral unmixing with spectral--spatial feature extraction and guides a ResNet-based convolutional neural network for improved reconstruction. In addition, we propose a custom Spatial-Spectral Gradient Loss function that integrates mean squared error with spatial and spectral gradient components, encouraging accurate reconstruction of both spatial and spectral features. Experiments on three public remote sensing hyperspectral datasets demonstrate that the proposed hybrid deep learning model achieves competitive performance while reducing model complexity.[59] Review of Hallucination Understanding in Large Language and Vision Models
Zhengyi Ho,Siyuan Liang,Dacheng Tao
Main category: cs.CV
TL;DR: 本文提出了一种统一的多层级框架,用于刻画跨模态和应用中的文本与图像幻觉问题,并将其与模型生命周期中的具体机制关联,揭示了幻觉常源于数据分布中的可预测模式和固有偏见,为开发更鲁棒的生成式AI解决方案提供了基础。
Details
Motivation: 当前对大语言和视觉模型中幻觉现象的理解仍零散且不完整,导致现有方法可能仅缓解表层症状而非根本原因,亟需一个系统性框架以促进更有效、可推广的解决方案。 Method: 提出一个统一的多层级框架来刻画文本与图像幻觉,并采用任务-模态交错的方法,将幻觉与模型生命周期中的具体机制相关联。 Result: 揭示了幻觉往往源自数据分布中的可预测模式和继承的偏见,该分析促进了对幻觉成因的整合理解。 Conclusion: 通过建立系统的幻觉分析框架,本文为在真实世界生成式AI系统中设计更稳健、有效的幻觉缓解方案奠定了基础。 Abstract: The widespread adoption of large language and vision models in real-world applications has made urgent the need to address hallucinations -- instances where models produce incorrect or nonsensical outputs. These errors can propagate misinformation during deployment, leading to both financial and operational harm. Although much research has been devoted to mitigating hallucinations, our understanding of it is still incomplete and fragmented. Without a coherent understanding of hallucinations, proposed solutions risk mitigating surface symptoms rather than underlying causes, limiting their effectiveness and generalizability in deployment. To tackle this gap, we first present a unified, multi-level framework for characterizing both image and text hallucinations across diverse applications, aiming to reduce conceptual fragmentation. We then link these hallucinations to specific mechanisms within a model's lifecycle, using a task-modality interleaved approach to promote a more integrated understanding. Our investigations reveal that hallucinations often stem from predictable patterns in data distributions and inherited biases. By deepening our understanding, this survey provides a foundation for developing more robust and effective solutions to hallucinations in real-world generative AI systems.[60] On Robustness of Vision-Language-Action Model against Multi-Modal Perturbations
Jianing Guo,Zhenhong Wu,Chang Tu,Yiyao Ma,Xiangqi Kong,Zhiqian Liu,Jiaming Ji,Shuning Zhang,Yuanpei Chen,Kai Chen,Xianglong Liu,Qi Dou,Yaodong Yang,Huijie Zhao,Weifeng Lv,Simin Li
Main category: cs.CV
TL;DR: 本文提出了一种针对视觉-语言-动作(VLA)模型的多模态鲁棒性增强方法RobustVLA,通过在输入和输出端引入鲁棒优化策略,在17种跨模态扰动下显著提升了性能,并在真实机器人任务中表现出色。
Details
Motivation: 现有VLA模型主要关注视觉扰动的鲁棒性,忽视了动作、指令、环境和观测等多模态扰动,限制了其在现实世界中的部署可靠性。 Method: 首先评估主流VLA模型在四种模态共17种扰动下的鲁棒性;提出RobustVLA框架:在输出端采用对抗式离线优化以应对最坏情况的动作噪声,在输入端通过保持任务语义一致性的输入变异增强动作一致性;将多扰动鲁棒性建模为多臂赌博机问题,使用上置信界算法自动识别最具破坏性的噪声。 Result: 在LIBERO数据集上,RobustVLA相比基线模型在pi0和OpenVLA骨干网络上分别取得12.6%和10.4%的绝对增益,推理速度快50.6倍,在混合扰动下提升10.4%;在FR5真实机器人上四模态扰动下绝对增益达65.6%。 Conclusion: RobustVLA有效提升了VLA模型在多模态扰动下的鲁棒性,兼具高效推理能力,适用于低示范数据的真实机器人部署场景。 Abstract: In Vision-Language-Action (VLA) models, robustness to real-world perturbations is critical for deployment. Existing methods target simple visual disturbances, overlooking the broader multi-modal perturbations that arise in actions, instructions, environments, and observations. Here, we first evaluate the robustness of mainstream VLAs under 17 perturbations across four modalities. We find (1) actions as the most fragile modality, (2) Existing visual-robust VLA do not gain robustness in other modality, and (3) pi0 demonstrates superior robustness with a diffusion-based action head. To build multi-modal robust VLAs, we propose RobustVLA against perturbations in VLA inputs and outputs. For output robustness, we perform offline robust optimization against worst-case action noise that maximizes mismatch in flow matching objective. This can be seen as adversarial training, label smoothing, and outlier penalization. For input robustness, we enforce consistent actions across input variations that preserve task semantics. To account for multiple perturbations, we formulate robustness as a multi-armed bandit problem and apply an upper confidence bound algorithm to automatically identify the most harmful noise. Experiments on LIBERO demonstrate our RobustVLA delivers absolute gains over baselines of 12.6% on the pi0 backbone and 10.4% on the OpenVLA backbone across all 17 perturbations, achieving 50.6x faster inference than existing visual-robust VLAs, and a 10.4% gain under mixed perturbations. Our RobustVLA is particularly effective on real-world FR5 robot with limited demonstrations, showing absolute gains by 65.6% under perturbations of four modalities.[61] Uncovering Intrinsic Capabilities: A Paradigm for Data Curation in Vision-Language Models
Junjie Li,Ziao Wang,Jianghong Ma,Xiaofeng Zhang
Main category: cs.CV
TL;DR: 提出了一种名为Capability-Attributed Data Curation (CADC)的框架,通过分析模型内在能力来指导指令数据筛选,在仅用5%数据的情况下超越全数据训练效果。
Details
Motivation: 大型视觉-语言模型在基准测试中表现良好,但通过指令微调控制其行为仍然困难;减少指令微调数据量通常会导致性能下降,因为现有方法忽视了模型学习背后的潜在能力。 Method: CADC框架通过梯度-based学习轨迹无监督地发现模型的内在能力,利用影响估计将训练数据归因于这些能力,并通过平衡选择和分阶段排序构建能力感知的课程。 Result: 在多模态基准上,仅使用原始数据的5%,CADC就超过了全数据训练的表现。 Conclusion: 内在能力是模型学习的基本组成部分,CADC为指令数据整理提供了一个以能力驱动的新范式。 Abstract: Large vision-language models (VLMs) achieve strong benchmark performance, but controlling their behavior through instruction tuning remains difficult. Reducing the budget of instruction tuning dataset often causes regressions, as heuristic strategies treat models as black boxes and overlook the latent capabilities that govern learning. We introduce Capability-Attributed Data Curation (CADC), a framework that shifts curation from task-specific heuristics to intrinsic capability analysis. CADC discovers intrinsic capabilities in an unsupervised manner from gradient-based learning trajectories, attributes training data to these capabilities via influence estimation, and curates capability-aware curricula through balanced selection and staged sequencing. This transforms black-box instruction tuning into a controllable, capability-driven process. With as little as 5% of the original data, CADC surpasses full-data training on multimodal benchmarks. These results validate intrinsic capabilities as the fundamental building blocks of model learning and establish CADC as a principle paradigm for instruction data curation.[62] Culture In a Frame: C$^3$B as a Comic-Based Benchmark for Multimodal Culturally Awareness
Yuchen Song,Andong Chen,Wenxin Zhu,Kehai Chen,Xuefeng Bai,Muyun Yang,Tiejun Zhao
Main category: cs.CV
TL;DR: 提出C³B:一个多层次、多语言、多任务的文化意识能力评测基准,用于评估多模态大模型在跨文化理解与生成上的表现,揭示当前模型与人类性能之间存在显著差距。
Details
Motivation: 现有文化意识评测基准缺乏任务难度递进设计和跨语言任务,且多使用单一文化的现实图像,难以有效评估多模态大语言模型的真正文化理解能力。 Method: 构建C³B基准,包含2000多张漫画图像和18000多个问答对,涵盖从基础视觉识别到文化冲突理解再到文化内容生成的三个递进任务,并在11个开源MLLM上进行评估。 Result: 在11个开源多模态大语言模型上的实验显示,当前模型在C³B上表现远低于人类水平,暴露出其在文化意识能力上的不足。 Conclusion: C³B为多模态大语言模型的文化意识能力提供了更具挑战性的评测平台,有助于推动未来在跨文化理解与生成方向的研究发展。 Abstract: Cultural awareness capabilities has emerged as a critical capability for Multimodal Large Language Models (MLLMs). However, current benchmarks lack progressed difficulty in their task design and are deficient in cross-lingual tasks. Moreover, current benchmarks often use real-world images. Each real-world image typically contains one culture, making these benchmarks relatively easy for MLLMs. Based on this, we propose C$^3$B ($\textbf{C}$omics $\textbf{C}$ross-$\textbf{C}$ultural $\textbf{B}$enchmark), a novel multicultural, multitask and multilingual cultural awareness capabilities benchmark. C$^3$B comprises over 2000 images and over 18000 QA pairs, constructed on three tasks with progressed difficulties, from basic visual recognition to higher-level cultural conflict understanding, and finally to cultural content generation. We conducted evaluations on 11 open-source MLLMs, revealing a significant performance gap between MLLMs and human performance. The gap demonstrates that C$^3$B poses substantial challenges for current MLLMs, encouraging future research to advance the cultural awareness capabilities of MLLMs.[63] Beyond the Prompt: Gender Bias in Text-to-Image Models, with a Case Study on Hospital Professions
Franck Vandewiele,Remi Synave,Samuel Delepoulle,Remi Cozot
Main category: cs.CV
TL;DR: 该研究调查了六种先进的文本到图像模型在医院相关职业中的性别表征,发现普遍存在且因模型而异的性别刻板印象,提示提示词设计对生成结果有显著影响。
Details
Motivation: 由于文本到图像模型广泛应用,但其输出可能嵌入并放大社会偏见,因此有必要系统评估其在职业性别表征方面的公平性。 Method: 使用五个医疗职业和五种肖像描述词组合的提示语,在六个开源文本到图像模型上各生成100张图像,分析生成人物的性别分布及提示词的影响。 Result: 所有模型都将护士描绘为女性,外科医生主要为男性;不同模型表现出不同的性别偏向模式,部分模型对提示词更敏感,如“企业”强化男性形象,“美丽”倾向女性形象。 Conclusion: 文本到图像模型存在系统性且模型特定的性别偏见,提示词设计显著影响生成结果,需加强偏差意识设计、平衡默认设置和用户引导以减少刻板印象传播。 Abstract: Text-to-image (TTI) models are increasingly used in professional, educational, and creative contexts, yet their outputs often embed and amplify social biases. This paper investigates gender representation in six state-of-the-art open-weight models: HunyuanImage 2.1, HiDream-I1-dev, Qwen-Image, FLUX.1-dev, Stable-Diffusion 3.5 Large, and Stable-Diffusion-XL. Using carefully designed prompts, we generated 100 images for each combination of five hospital-related professions (cardiologist, hospital director, nurse, paramedic, surgeon) and five portrait qualifiers ("", corporate, neutral, aesthetic, beautiful). Our analysis reveals systematic occupational stereotypes: all models produced nurses exclusively as women and surgeons predominantly as men. However, differences emerge across models: Qwen-Image and SDXL enforce rigid male dominance, HiDream-I1-dev shows mixed outcomes, and FLUX.1-dev skews female in most roles. HunyuanImage 2.1 and Stable-Diffusion 3.5 Large also reproduce gender stereotypes but with varying degrees of sensitivity to prompt formulation. Portrait qualifiers further modulate gender balance, with terms like corporate reinforcing male depictions and beautiful favoring female ones. Sensitivity varies widely: Qwen-Image remains nearly unaffected, while FLUX.1-dev, SDXL, and SD3.5 show strong prompt dependence. These findings demonstrate that gender bias in TTI models is both systematic and model-specific. Beyond documenting disparities, we argue that prompt wording plays a critical role in shaping demographic outcomes. The results underscore the need for bias-aware design, balanced defaults, and user guidance to prevent the reinforcement of occupational stereotypes in generative AI.[64] Reinforcement Learning-Based Prompt Template Stealing for Text-to-Image Models
Xiaotian Zou
Main category: cs.CV
TL;DR: 本文提出了一种基于强化学习的提示逆向框架RLStealer,可从少量示例图像中高效恢复文本到图像模型的提示模板,揭示了提示交易中的安全风险。
Details
Motivation: 随着多模态大语言模型的发展,提示词交易市场兴起,但提示模板可能被窃取,存在未被充分研究的安全隐患。 Method: 将提示模板窃取建模为序列决策问题,采用强化学习框架RLStealer,并利用多种基于相似性的反馈信号作为奖励函数,以有效探索提示空间。 Result: 在公开基准上的实验表明,RLStealer性能达到最先进水平,攻击总成本降至现有基线的13%以下,并能跨不同图像风格泛化,高效窃取未见过的提示模板。 Conclusion: 研究揭示了提示交易中的严重安全威胁,为MLLMs市场的防护标准制定提供了基础。 Abstract: Multimodal Large Language Models (MLLMs) have transformed text-to-image workflows, allowing designers to create novel visual concepts with unprecedented speed. This progress has given rise to a thriving prompt trading market, where curated prompts that induce trademark styles are bought and sold. Although commercially attractive, prompt trading also introduces a largely unexamined security risk: the prompts themselves can be stolen. In this paper, we expose this vulnerability and present RLStealer, a reinforcement learning based prompt inversion framework that recovers its template from only a small set of example images. RLStealer treats template stealing as a sequential decision making problem and employs multiple similarity based feedback signals as reward functions to effectively explore the prompt space. Comprehensive experiments on publicly available benchmarks demonstrate that RLStealer gets state-of-the-art performance while reducing the total attack cost to under 13% of that required by existing baselines. Our further analysis confirms that RLStealer can effectively generalize across different image styles to efficiently steal unseen prompt templates. Our study highlights an urgent security threat inherent in prompt trading and lays the groundwork for developing protective standards in the emerging MLLMs marketplace.[65] Explanation-Driven Counterfactual Testing for Faithfulness in Vision-Language Model Explanations
Sihao Ding,Santosh Vasa,Aditi Ramadwar
Main category: cs.CV
TL;DR: 提出了一种名为解释驱动反事实测试(EDCT)的自动化验证方法,用于检测视觉语言模型(VLM)生成的自然语言解释(NLE)是否真实反映其预测依据。
Details
Motivation: 现有VLMs生成的NLE可能看似合理但缺乏因果一致性,存在技术与治理风险,需一种可衡量解释可信度的方法。 Method: 将模型自身的解释视为可证伪假设:从NLE中解析出可测试的视觉概念,通过生成式修复生成反事实图像,利用大语言模型分析答案和解释的变化,并计算反事实一致性得分(CCS)。 Result: 在120个OK-VQA样本和多个VLM上验证了EDCT的有效性,发现了显著的忠实度差距,并生成符合监管需求的审计证据。 Conclusion: EDCT能有效揭示VLM解释中的虚假因果关系,为模型审计提供自动化、可量化的工具,有助于提升模型透明性与可信度。 Abstract: Vision-Language Models (VLMs) often produce fluent Natural Language Explanations (NLEs) that sound convincing but may not reflect the causal factors driving predictions. This mismatch of plausibility and faithfulness poses technical and governance risks. We introduce Explanation-Driven Counterfactual Testing (EDCT), a fully automated verification procedure for a target VLM that treats the model's own explanation as a falsifiable hypothesis. Given an image-question pair, EDCT: (1) obtains the model's answer and NLE, (2) parses the NLE into testable visual concepts, (3) generates targeted counterfactual edits via generative inpainting, and (4) computes a Counterfactual Consistency Score (CCS) using LLM-assisted analysis of changes in both answers and explanations. Across 120 curated OK-VQA examples and multiple VLMs, EDCT uncovers substantial faithfulness gaps and provides regulator-aligned audit artifacts indicating when cited concepts fail causal tests.[66] HiDe: Rethinking The Zoom-IN method in High Resolution MLLMs via Hierarchical Decoupling
Xianjie Liu,Yiman Hu,Yixiong Zou,Liang Wu,Jian Xu,Bo Zheng
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的分层解耦框架HiDe,用于提升多模态大模型在高分辨率图像理解中的性能,通过Token级注意力解耦和布局保持解耦有效消除复杂背景干扰,在多个基准上达到SOTA。
Details
Motivation: 现有方法认为多模态大语言模型在高分辨率图像上表现不佳是由于无法识别小物体,因而采用“放大”策略;但作者发现主要问题在于复杂背景干扰而非物体大小。 Method: 提出HiDe框架,包含Token-wise Attention Decoupling(TAD)识别关键信息token并精准对齐目标区域,以及Layout-Preserving Decoupling(LPD)将目标区域从背景中解耦并重构保持空间布局的紧凑表示。 Result: HiDe在V*Bench、HRBench4K和HRBench8K等多个高分辨率视觉基准上实现SOTA性能,使Qwen2.5-VL 7B和InternVL3 8B分别达到92.1%和91.6%的得分,且内存消耗比之前的训练-free方法减少75%。 Conclusion: 高分辨率图像理解的主要瓶颈是背景干扰而非物体尺寸,HiDe通过无需训练的两阶段解耦机制有效提升多模态大模型的感知能力,兼具高效性和通用性。 Abstract: Multimodal Large Language Models (MLLMs) have made significant strides in visual understanding tasks. However, their performance on high-resolution images remains suboptimal. While existing approaches often attribute this limitation to perceptual constraints and argue that MLLMs struggle to recognize small objects, leading them to use "zoom in" strategies for better detail, our analysis reveals a different cause: the main issue is not object size, but rather caused by complex background interference. We systematically analyze this "zoom in" operation through a series of decoupling experiments and propose the Hierarchical Decoupling Framework (HiDe), a training-free framework that uses Token-wise Attention Decoupling (TAD) to decouple the question tokens and identify the key information tokens, then leverages their attention weights to achieve precise alignment with the target visual regions. Subsequently, it employs Layout-Preserving Decoupling (LPD) to decouple these regions from the background and reconstructs a compact representation that preserves essential spatial layouts while eliminating background interference. HiDe sets a new SOTA on V*Bench, HRBench4K, and HRBench8K, boosting Qwen2.5-VL 7B and InternVL3 8B to SOTA (92.1% and 91.6% on V*Bench), even surpassing RL methods. After optimization, HiDe uses 75% less memory than the previous training-free approach. Code is provided in https://github.com/Tennine2077/HiDe.[67] FSDENet: A Frequency and Spatial Domains based Detail Enhancement Network for Remote Sensing Semantic Segmentation
Jiahao Fu,Yinfeng Yu,Liejun Wang
Main category: cs.CV
TL;DR: 提出了一种基于频率和空间域的细节增强网络FSDENet,用于提升遥感图像分割性能,尤其在灰度变化区域具有优越的边界分割精度。
Details
Motivation: 解决遥感图像中由于灰度变化(如阴影、低对比度区域)引起的语义边缘模糊问题,充分利用空间信息进行精确分割。 Method: 结合空间域处理与频域信息融合:采用FFT增强全局表示能力,利用Haar小波变换分解高低频特征以细化边界,并实现双域协同优化。 Result: 在LoveDA、Vaihingen、Potsdam和iSAID四个主流数据集上均达到SOTA性能,显著提升了边界区域和灰度过渡区的分割准确性。 Conclusion: FSDENet通过融合空间与频率域优势,有效缓解了灰度变化带来的分割难题,实现了高精度遥感图像语义分割。 Abstract: To fully leverage spatial information for remote sensing image segmentation and address semantic edge ambiguities caused by grayscale variations (e.g., shadows and low-contrast regions), we propose the Frequency and Spatial Domains based Detail Enhancement Network (FSDENet). Our framework employs spatial processing methods to extract rich multi-scale spatial features and fine-grained semantic details. By effectively integrating global and frequency-domain information through the Fast Fourier Transform (FFT) in global mappings, the model's capability to discern global representations under grayscale variations is significantly strengthened. Additionally, we utilize Haar wavelet transform to decompose features into high- and low-frequency components, leveraging their distinct sensitivity to edge information to refine boundary segmentation. The model achieves dual-domain synergy by integrating spatial granularity with frequency-domain edge sensitivity, substantially improving segmentation accuracy in boundary regions and grayscale transition zones. Comprehensive experimental results demonstrate that FSDENet achieves state-of-the-art (SOTA) performance on four widely adopted datasets: LoveDA, Vaihingen, Potsdam, and iSAID.[68] Less is More: Lean yet Powerful Vision-Language Model for Autonomous Driving
Sheng Yang,Tong Zhan,Guancheng Chen,Yanfeng Lu,Jian Wang
Main category: cs.CV
TL;DR: 本文将自动驾驶重新定义为一种广义语言,提出Max-V1框架,通过视觉-语言模型实现从单目摄像头输入到轨迹预测的端到端自动驾驶,采用统计建模指导的监督策略,在nuScenes数据集上性能超越先前方法30%以上,并展现出跨域、跨车辆的良好泛化能力。
Details
Motivation: 传统自动驾驶系统通常依赖多阶段、模块化设计,存在误差累积和集成复杂的问题。本文旨在通过类比语言生成的方式,将轨迹规划视为下一个航点预测任务,实现更高效、统一的端到端驾驶框架。 Method: 提出Max-V1,一种基于视觉-语言模型(VLM)的一阶段端到端自动驾驶框架。将驾驶轨迹建模为序列生成任务,利用前视摄像头输入直接预测未来航点序列。引入基于统计建模的监督策略,为模型提供明确的学习目标,通过大规模专家示范进行模仿学习。 Result: 在nuScenes数据集上达到最先进的性能,相比先前基线整体提升超过30%;在来自不同车辆的跨域数据集上表现出优异的泛化能力,验证了其跨车辆鲁棒性与适应性。 Conclusion: Max-V1通过将自动驾驶重构为序列生成问题,展示了VLM在端到端驾驶中的巨大潜力,为构建更具能力的自动驾驶智能体奠定了基础。 Abstract: In this work, we reconceptualize autonomous driving as a generalized language and formulate the trajectory planning task as next waypoint prediction. We introduce Max-V1, a novel framework for one-stage end-to-end autonomous driving. Our framework presents a single-pass generation paradigm that aligns with the inherent sequentiality of driving. This approach leverages the generative capacity of the VLM (Vision-Language Model) to enable end-to-end trajectory prediction directly from front-view camera input. The efficacy of this method is underpinned by a principled supervision strategy derived from statistical modeling. This provides a well-defined learning objective, which makes the framework highly amenable to master complex driving policies through imitation learning from large-scale expert demonstrations. Empirically, our method achieves the state-of-the-art performance on the nuScenes dataset, delivers an overall improvement of over 30% compared to prior baselines. Furthermore, it exhibits superior generalization performance on cross-domain datasets acquired from diverse vehicles, demonstrating notable potential for cross-vehicle robustness and adaptability. Due to these empirical strengths, this work introduces a model enabling fundamental driving behaviors, laying the foundation for the development of more capable self-driving agents. Code will be available upon publication.[69] Efficient CNN Compression via Multi-method Low Rank Factorization and Feature Map Similarity
M. Kokhazadeh,G. Keramidas,V. Kelefouras
Main category: cs.CV
TL;DR: 本文提出了一种端到端的设计空间探索方法,用于卷积神经网络的低秩分解压缩,通过基于特征图相似性的新秩选择策略和一次性微调流程,显著提升了压缩效率和兼容性。
Details
Motivation: 为了解决低秩分解在深度神经网络压缩中面临的秩选择困难、设计空间大、微调时间长及兼容性差等问题。 Method: 提出基于特征图相似性的秩选择策略,采用一次性微调方法,并在框架中集成多种针对卷积层和全连接层的低秩分解技术,实现逐层灵活选择最优方法。 Result: 在14个CNN模型和8个数据集上验证了该方法的有效性,实现了显著的模型压缩且精度损失极小,优于多个现有先进方法。 Conclusion: 所提方法在压缩率、准确性与通用性之间取得了更好平衡,且已集成至TensorFlow 2.x,便于实际应用。 Abstract: Low-Rank Factorization (LRF) is a widely adopted technique for compressing deep neural networks (DNNs). However, it faces several challenges, including optimal rank selection, a vast design space, long fine-tuning times, and limited compatibility with different layer types and decomposition methods. This paper presents an end-to-end Design Space Exploration (DSE) methodology and framework for compressing convolutional neural networks (CNNs) that addresses all these issues. We introduce a novel rank selection strategy based on feature map similarity, which captures non-linear interactions between layer outputs more effectively than traditional weight-based approaches. Unlike prior works, our method uses a one-shot fine-tuning process, significantly reducing the overall fine-tuning time. The proposed framework is fully compatible with all types of convolutional (Conv) and fully connected (FC) layers. To further improve compression, the framework integrates three different LRF techniques for Conv layers and three for FC layers, applying them selectively on a per-layer basis. We demonstrate that combining multiple LRF methods within a single model yields better compression results than using a single method uniformly across all layers. Finally, we provide a comprehensive evaluation and comparison of the six LRF techniques, offering practical insights into their effectiveness across different scenarios. The proposed work is integrated into TensorFlow 2.x, ensuring compatibility with widely used deep learning workflows. Experimental results on 14 CNN models across eight datasets demonstrate that the proposed methodology achieves substantial compression with minimal accuracy loss, outperforming several state-of-the-art techniques.[70] Intelligent 5S Audit: Application of Artificial Intelligence for Continuous Improvement in the Automotive Industry
Rafael da Silva Maciel,Lucio Veraldo Jr
Main category: cs.CV
TL;DR: 本文提出了一种基于大语言模型(LLM)的自动化5S审核系统,通过智能图像分析评估汽车制造环境中的5S标准,显著提高了审核效率和一致性,降低了成本。
Details
Motivation: 为了提升汽车行业5S审核的客观性、效率,并与工业4.0标准接轨,需引入人工智能技术改进传统人工审核方式。 Method: 开发了一个基于大规模语言模型(LLM)的自动化5S审核系统,利用智能图像分析对五个S(Seiri, Seiton, Seiso, Seiketsu, Shitsuke)进行标准化评估,并使用Cohen's Kappa系数验证系统可靠性。 Result: 系统与人工审核的一致性较高(kappa = 0.75),审核时间减少50%,运营成本降低99.8%,且具备良好的可扩展性。 Conclusion: 该方法实现了精益管理与人工智能技术的有效融合,为汽车制造业提供了高效、低成本、可扩展的5S审核新范式。 Abstract: The evolution of the 5S methodology with the support of artificial intelligence techniques represents a significant opportunity to improve industrial organization audits in the automotive chain, making them more objective, efficient and aligned with Industry 4.0 standards. This work developed an automated 5S audit system based on large-scale language models (LLM), capable of assessing the five senses (Seiri, Seiton, Seiso, Seiketsu, Shitsuke) in a standardized way through intelligent image analysis. The system's reliability was validated using Cohen's concordance coefficient (kappa = 0.75), showing strong alignment between the automated assessments and the corresponding human audits. The results indicate that the proposed solution contributes significantly to continuous improvement in automotive manufacturing environments, speeding up the audit process by 50% of the traditional time and maintaining the consistency of the assessments, with a 99.8% reduction in operating costs compared to traditional manual audits. The methodology presented establishes a new paradigm for integrating lean systems with emerging AI technologies, offering scalability for implementation in automotive plants of different sizes.[71] OIG-Bench: A Multi-Agent Annotated Benchmark for Multimodal One-Image Guides Understanding
Jiancong Xie,Wenjin Wang,Zhuomeng Zhang,Zihan Liu,Qi Liu,Ke Feng,Zixun Sun,Yuedong Yang
Main category: cs.CV
TL;DR: 本文提出了OIG-Bench,一个用于评估多模态大语言模型在理解“一图指南”(One-Image Guides)方面能力的综合基准,并通过多智能体协同的半自动标注流程构建数据集,发现现有模型在语义理解和逻辑推理上仍存在明显不足。
Details
Motivation: 现有的多模态大语言模型(MLLMs)在人类样式的理解能力评估上尚不充分,而One-Image Guides作为一种融合文本、图像和符号、面向人类认知的信息呈现形式,能够更好反映人类感知与理解特征,因此需要专门的基准来评估MLLM在此类复杂视觉-文本关系上的理解能力。 Method: 构建了一个涵盖多个领域的One-Image Guide理解基准OIG-Bench,并设计了一种多智能体协作的半自动化标注流程,以降低人工标注成本;基于该基准对29个主流MLLM进行了全面评估,并比较了多智能体系统与现有模型在图像描述生成上的表现。 Result: Qwen2.5-VL-72B在所有测试模型中表现最佳,整体准确率为77%,但所有模型在语义理解和逻辑推理方面均表现出明显缺陷;同时,所提出的多智能体标注系统在图像描述任务上优于所有MLLM,展现出高质量生成与数据构建潜力。 Conclusion: 当前MLLM在理解复杂、结构化的视觉-文本内容(如One-Image Guides)方面仍有显著局限,特别是在语义和逻辑层面;OIG-Bench为未来研究提供了有效评估工具,而多智能体协同标注为高质量数据集构建提供了新路径。 Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities. However, evaluating their capacity for human-like understanding in One-Image Guides remains insufficiently explored. One-Image Guides are a visual format combining text, imagery, and symbols to present reorganized and structured information for easier comprehension, which are specifically designed for human viewing and inherently embody the characteristics of human perception and understanding. Here, we present OIG-Bench, a comprehensive benchmark focused on One-Image Guide understanding across diverse domains. To reduce the cost of manual annotation, we developed a semi-automated annotation pipeline in which multiple intelligent agents collaborate to generate preliminary image descriptions, assisting humans in constructing image-text pairs. With OIG-Bench, we have conducted a comprehensive evaluation of 29 state-of-the-art MLLMs, including both proprietary and open-source models. The results show that Qwen2.5-VL-72B performs the best among the evaluated models, with an overall accuracy of 77%. Nevertheless, all models exhibit notable weaknesses in semantic understanding and logical reasoning, indicating that current MLLMs still struggle to accurately interpret complex visual-text relationships. In addition, we also demonstrate that the proposed multi-agent annotation system outperforms all MLLMs in image captioning, highlighting its potential as both a high-quality image description generator and a valuable tool for future dataset construction. Datasets are available at https://github.com/XiejcSYSU/OIG-Bench.[72] Geo-R1: Unlocking VLM Geospatial Reasoning with Cross-View Reinforcement Learning
Chenhui Xu,Fuxun Yu,Michael J. Bianco,Jacob Kovarskiy,Raphael Tang,Qi Zhang,Zirui Xu,Will LeVine,Brandon Dubbs,Heming Liao,Cassandra Burgess,Suvam Bag,Jay Patravali,Rupanjali Kukal,Mikael Figueroa,Rishi Madhok,Nikolaos Karianakis,Jinjun Xiong
Main category: cs.CV
TL;DR: Geo-R1 是一种以推理为中心的后训练框架,通过思维引导和提升机制,使视觉-语言模型具备地理空间推理能力,在多个基准上达到最先进性能。
Details
Motivation: 现有地理空间模型多依赖领域预训练或监督微调,缺乏显式推理能力,限制了复杂地理推理任务的表现。 Method: 提出两阶段框架:第一阶段通过合成的思维链样本进行监督微调,构建‘地理空间思维范式’;第二阶段采用基于GRPO的强化学习,在弱监督跨视角匹配代理任务上优化,提供可验证且可扩展的奖励信号。 Result: Geo-R1 在多个地理空间推理基准上实现了最先进的性能,显著提升了模型在跨模态特征对齐与推理一致性方面的能力。 Conclusion: Geo-R1 成功将推理优先的后训练范式引入地理空间建模,无需人工标注推理过程即可有效提升模型的地理空间理解与推理能力。 Abstract: We introduce Geo-R1, a reasoning-centric post-training framework that unlocks geospatial reasoning in vision-language models by combining thinking scaffolding and elevating. In the scaffolding stage, Geo-R1 instills a ``geospatial thinking paradigm" via supervised fine-tuning on synthetic chain-of-thought exemplars, enabling models to connect visual cues with geographic priors without costly human reasoning annotations. In the elevating stage, it uses GRPO-based reinforcement learning on a weakly-supervised cross-view pairing proxy. This design supplies a verifiable and scalable reward signal: teaching models to capture and reconcile features across modalities, and harnessing reasoning for accurate prediction. Geo-R1 extends geospatial modeling from domain pretraining / supervised finetuning to reasoning-first post-training, and achieves state-of-the-art performance across various geospatial reasoning benchmarks. Our model is available at https://huggingface.co/miniHui/Geo-R1.[73] Enhancing Certifiable Semantic Robustness via Robust Pruning of Deep Neural Networks
Hanjiang Hu,Bowei Li,Ziwei Wang,Tianhao Wei,Casidhe Hutchison,Eric Sample,Changliu Liu
Main category: cs.CV
TL;DR: 提出一种基于Unbiased and Smooth Neuron (USN) 指标的神经网络剪枝方法,结合Wasserstein距离损失,在保持模型表达力的同时提升对亮度和对比度扰动的可验证鲁棒性。
Details
Motivation: 现有认证训练和鲁棒性验证方法因网络过参数化而面临紧致性和可扩展性不足的问题,难以有效应对语义变换扰动(如亮度、对比度变化)。 Method: 通过分析层和神经元对输入扰动的稳定性和方差,提出USN指标;基于USN进行剪枝,保留高USN神经元,并引入Wasserstein距离损失使剪枝在各层间更集中。 Result: 在具有真实亮度和对比度扰动的关键点检测任务上实验表明,该方法在鲁棒性认证性能和效率方面均优于基线方法。 Conclusion: 所提基于USN的剪枝方法能有效缓解过参数化问题,在保证模型表达能力的同时显著提升可验证鲁棒性与计算效率。 Abstract: Deep neural networks have been widely adopted in many vision and robotics applications with visual inputs. It is essential to verify its robustness against semantic transformation perturbations, such as brightness and contrast. However, current certified training and robustness certification methods face the challenge of over-parameterization, which hinders the tightness and scalability due to the over-complicated neural networks. To this end, we first analyze stability and variance of layers and neurons against input perturbation, showing that certifiable robustness can be indicated by a fundamental Unbiased and Smooth Neuron metric (USN). Based on USN, we introduce a novel neural network pruning method that removes neurons with low USN and retains those with high USN, thereby preserving model expressiveness without over-parameterization. To further enhance this pruning process, we propose a new Wasserstein distance loss to ensure that pruned neurons are more concentrated across layers. We validate our approach through extensive experiments on the challenging robust keypoint detection task, which involves realistic brightness and contrast perturbations, demonstrating that our method achieves superior robustness certification performance and efficiency compared to baselines.[74] Improved Hyperspectral Anomaly Detection via Unsupervised Subspace Modeling in the Signed Cumulative Distribution Transform Domain
Abu Hasnat Mohammad Rubaiyat,Jordan Vincent,Colin Olson
Main category: cs.CV
TL;DR: 提出一种基于传输模型的高光谱异常检测新方法,通过将像素视为模板模式的变形观测,在SCDT域中建模背景信号并检测异常。
Details
Motivation: 由于复杂真实环境和对感兴趣信号先验知识有限,现有高光谱异常检测技术面临挑战。 Method: 提出基于传输的数学模型,将高光谱像素表示为模板模式的未知变形,并在SCDT域中使用无监督子空间建模技术构建背景信号模型,通过偏离该模型检测异常信号。 Result: 在五个不同数据集上的实验表明,所提方法优于当前最先进的高光谱异常检测方法。 Conclusion: 该方法能有效提升高光谱图像中异常像素的检测性能,具有较强鲁棒性和应用潜力。 Abstract: Hyperspectral anomaly detection (HAD), a crucial approach for many civilian and military applications, seeks to identify pixels with spectral signatures that are anomalous relative to a preponderance of background signatures. Significant effort has been made to improve HAD techniques, but challenges arise due to complex real-world environments and, by definition, limited prior knowledge of potential signatures of interest. This paper introduces a novel HAD method by proposing a transport-based mathematical model to describe the pixels comprising a given hyperspectral image. In this approach, hyperspectral pixels are viewed as observations of a template pattern undergoing unknown deformations that enables their representation in the signed cumulative distribution transform (SCDT) domain. An unsupervised subspace modeling technique is then used to construct a model of abundant background signals in this domain, whereupon anomalous signals are detected as deviations from the learned model. Comprehensive evaluations across five distinct datasets illustrate the superiority of our approach compared to state-of-the-art methods.[75] MOLM: Mixture of LoRA Markers
Samar Fares,Nurbek Tastan,Noor Hussein,Karthik Nandakumar
Main category: cs.CV
TL;DR: 提出了一种基于生成模型参数扰动的通用水印框架MOLM,通过二值密钥激活LoRA适配器实现鲁棒、不可见且可验证的图像水印。
Details
Motivation: 现有图像水印方法对失真敏感、易被自适应去除且密钥更新成本高,亟需更鲁棒和高效的解决方案。 Method: 将水印编码问题建模为依赖密钥的生成模型参数扰动,采用基于路由的Mixture of LoRA Markers(MOLM),在残差和注意力块中插入轻量级LoRA适配器,由二值密钥激活,无需针对每个密钥重新训练。 Result: 在Stable Diffusion和FLUX模型上的实验表明,MOLM在保持图像质量的同时,能有效抵抗压缩、失真、再生、平均攻击以及提取器的黑盒对抗攻击,实现高鲁棒性的密钥恢复。 Conclusion: MOLM提供了一种高效、可扩展且鲁棒的生成图像水印方案,具备良好的实用性与安全性,适用于大规模合成图像的溯源与认证。 Abstract: Generative models can generate photorealistic images at scale. This raises urgent concerns about the ability to detect synthetically generated images and attribute these images to specific sources. While watermarking has emerged as a possible solution, existing methods remain fragile to realistic distortions, susceptible to adaptive removal, and expensive to update when the underlying watermarking key changes. We propose a general watermarking framework that formulates the encoding problem as key-dependent perturbation of the parameters of a generative model. Within this framework, we introduce Mixture of LoRA Markers (MOLM), a routing-based instantiation in which binary keys activate lightweight LoRA adapters inside residual and attention blocks. This design avoids key-specific re-training and achieves the desired properties such as imperceptibility, fidelity, verifiability, and robustness. Experiments on Stable Diffusion and FLUX show that MOLM preserves image quality while achieving robust key recovery against distortions, compression and regeneration, averaging attacks, and black-box adversarial attacks on the extractor.[76] Looking Beyond the Known: Towards a Data Discovery Guided Open-World Object Detection
Anay Majee,Amitesh Gangrade,Rishabh Iyer
Main category: cs.CV
TL;DR: 提出了一种名为CROWD的统一框架,通过组合数据发现和表示学习来解决开放世界目标检测中的语义混淆和灾难性遗忘问题。
Details
Motivation: 现有开放世界目标检测方法存在已知类与未知类之间的语义混淆以及灾难性遗忘问题,导致未知类召回率低和已知类准确率下降。 Method: 将未知对象发现与适应重构为组合式的数据发现(CROWD-Discover)和表示学习(CROWD-Learn)任务;前者利用子模条件增益函数选择与已知对象差异大的代表性样本,后者通过新型组合目标联合解耦已知与未知表示并保持已知类间的判别一致性。 Result: 在M-OWODB和S-OWODB基准上,CROWD分别提升了2.83%和2.05%的已知类准确率,并实现了接近2.4倍的未知类召回率提升。 Conclusion: CROWD有效缓解了语义混淆和灾难性遗忘问题,在已知类准确性和未知类发现方面均优于现有方法。 Abstract: Open-World Object Detection (OWOD) enriches traditional object detectors by enabling continual discovery and integration of unknown objects via human guidance. However, existing OWOD approaches frequently suffer from semantic confusion between known and unknown classes, alongside catastrophic forgetting, leading to diminished unknown recall and degraded known-class accuracy. To overcome these challenges, we propose Combinatorial Open-World Detection (CROWD), a unified framework reformulating unknown object discovery and adaptation as an interwoven combinatorial (set-based) data-discovery (CROWD-Discover) and representation learning (CROWD-Learn) task. CROWD-Discover strategically mines unknown instances by maximizing Submodular Conditional Gain (SCG) functions, selecting representative examples distinctly dissimilar from known objects. Subsequently, CROWD-Learn employs novel combinatorial objectives that jointly disentangle known and unknown representations while maintaining discriminative coherence among known classes, thus mitigating confusion and forgetting. Extensive evaluations on OWOD benchmarks illustrate that CROWD achieves improvements of 2.83% and 2.05% in known-class accuracy on M-OWODB and S-OWODB, respectively, and nearly 2.4x unknown recall compared to leading baselines.[77] Discrete Wavelet Transform as a Facilitator for Expressive Latent Space Representation in Variational Autoencoders in Satellite Imagery
Arpan Mahara,Md Rezaul Karim Khan,Naphtali Rishe,Wenjia Wang,Seyed Masoud Sadjadi
Main category: cs.CV
TL;DR: 本文提出了一种基于离散小波变换(DWT)的新型VAE架构ExpDWT-VAE,用于增强遥感图像生成中潜在扩散模型的潜在空间表示。
Details
Motivation: 现有研究较少关注扩散模型中潜在空间本身的优化,而传统VAE在捕捉遥感图像多尺度特征方面存在局限,因此需要改进潜在空间表征以提升生成质量。 Method: 提出ExpDWT-VAE,引入双分支结构:一个分支处理空间域输入,另一个通过2D Haar小波分解提取频域特征,并经卷积和逆DWT重构后融合为空间-频率联合表示,再映射为对角高斯分布的潜在空间。 Result: 在自建卫星图像数据集上实验表明,该方法在多个性能指标上显著优于基线模型,有效提升了潜在空间的表达能力。 Conclusion: ExpDWT-VAE通过融合空间与频域信息,显著增强了遥感图像生成任务中潜在空间的质量,为后续LDM在遥感领域的应用提供了更优的潜在表示基础。 Abstract: Latent Diffusion Models (LDM), a subclass of diffusion models, mitigate the computational complexity of pixel-space diffusion by operating within a compressed latent space constructed by Variational Autoencoders (VAEs), demonstrating significant advantages in Remote Sensing (RS) applications. Though numerous studies enhancing LDMs have been conducted, investigations explicitly targeting improvements within the intrinsic latent space remain scarce. This paper proposes an innovative perspective, utilizing the Discrete Wavelet Transform (DWT) to enhance the VAE's latent space representation, designed for satellite imagery. The proposed method, ExpDWT-VAE, introduces dual branches: one processes spatial domain input through convolutional operations, while the other extracts and processes frequency-domain features via 2D Haar wavelet decomposition, convolutional operation, and inverse DWT reconstruction. These branches merge to create an integrated spatial-frequency representation, further refined through convolutional and diagonal Gaussian mapping into a robust latent representation. We utilize a new satellite imagery dataset housed by the TerraFly mapping system to validate our method. Experimental results across several performance metrics highlight the efficacy of the proposed method at enhancing latent space representation.[78] EgoTraj-Bench: Towards Robust Trajectory Prediction Under Ego-view Noisy Observations
Jiayi Liu,Jiaming Zhou,Ke Ye,Kun-Yu Lin,Allan Wang,Junwei Liang
Main category: cs.CV
TL;DR: 本文提出了EgoTraj-Bench,首个基于真实世界第一人称视觉噪声历史数据的轨迹预测基准,以及双流流匹配模型BiFlow,通过共享潜在表示同时去噪历史观测并预测未来运动,在真实感知约束下显著提升了鲁棒性和预测性能。
Details
Motivation: 现有轨迹预测方法通常假设理想的观测历史,忽视了第一人称视角中存在的遮挡、ID切换和跟踪漂移等感知伪影,导致模型在实际部署中鲁棒性不足。 Method: 提出EgoTraj-Bench基准,将嘈杂的第一人称视觉历史与干净的鸟瞰图未来轨迹对齐;设计BiFlow双流模型,利用共享潜在表征联合进行历史去噪与未来预测,并引入EgoAnchor机制通过特征调制融合历史特征以建模智能体意图。 Result: 实验表明,BiFlow在minADE和minFDE指标上平均降低10-15%,展现出卓越的鲁棒性和预测精度。 Conclusion: EgoTraj-Bench和BiFlow为开发面向真实世界、具备强鲁棒性的第一人称视角轨迹预测系统提供了重要基础。 Abstract: Reliable trajectory prediction from an ego-centric perspective is crucial for robotic navigation in human-centric environments. However, existing methods typically assume idealized observation histories, failing to account for the perceptual artifacts inherent in first-person vision, such as occlusions, ID switches, and tracking drift. This discrepancy between training assumptions and deployment reality severely limits model robustness. To bridge this gap, we introduce EgoTraj-Bench, the first real-world benchmark that grounds noisy, first-person visual histories in clean, bird's-eye-view future trajectories, enabling robust learning under realistic perceptual constraints. Building on this benchmark, we propose BiFlow, a dual-stream flow matching model that concurrently denoises historical observations and forecasts future motion by leveraging a shared latent representation. To better model agent intent, BiFlow incorporates our EgoAnchor mechanism, which conditions the prediction decoder on distilled historical features via feature modulation. Extensive experiments show that BiFlow achieves state-of-the-art performance, reducing minADE and minFDE by 10-15% on average and demonstrating superior robustness. We anticipate that our benchmark and model will provide a critical foundation for developing trajectory forecasting systems truly resilient to the challenges of real-world, ego-centric perception.[79] David and Goliath in Medical Vision: Convolutional Networks vs Biomedical Vision Language Models
Ran Tong,Jiaqi Liu,Su Liu,Jiexi Xu,Lanruo Wang,Tong Wang
Main category: cs.CV
TL;DR: 本文比较了监督轻量级CNN与零样本医学视觉-语言模型BiomedCLIP在肺炎和结核病检测中的表现,发现通过简单的决策阈值校准可显著提升零样本VLM的性能,使其接近甚至超过监督模型。
Details
Motivation: 准确解读胸部X光片对自动化医疗诊断至关重要,但零样本视觉-语言模型在医学影像任务中的潜力尚未被充分挖掘,尤其是在未进行适当校准的情况下可能表现不佳。 Method: 在PneumoniaMNIST和Shenzhen TB数据集上,对比监督轻量级CNN与零样本BiomedCLIP的表现,并通过在验证集上优化分类阈值进行决策阈值校准,以提升VLM性能。 Result: 校准后,BiomedCLIP在肺炎检测中F1分数达到0.8841,超过监督CNN的0.8803;在结核病检测中F1分数从0.4812提升至0.7684,接近监督模型的0.7834。 Conclusion: 适当的决策阈值校准对于释放零样本视觉-语言模型在医学诊断中的全部潜力至关重要,使其能够匹配甚至超越专用监督模型。 Abstract: The accurate interpretation of chest radiographs using automated methods is a critical task in medical imaging. This paper presents a comparative analysis between a supervised lightweight Convolutional Neural Network (CNN) and a state-of-the-art, zero-shot medical Vision-Language Model (VLM), BiomedCLIP, across two distinct diagnostic tasks: pneumonia detection on the PneumoniaMNIST benchmark and tuberculosis detection on the Shenzhen TB dataset. Our experiments show that supervised CNNs serve as highly competitive baselines in both cases. While the default zero-shot performance of the VLM is lower, we demonstrate that its potential can be unlocked via a simple yet crucial remedy: decision threshold calibration. By optimizing the classification threshold on a validation set, the performance of BiomedCLIP is significantly boosted across both datasets. For pneumonia detection, calibration enables the zero-shot VLM to achieve a superior F1-score of 0.8841, surpassing the supervised CNN's 0.8803. For tuberculosis detection, calibration dramatically improves the F1-score from 0.4812 to 0.7684, bringing it close to the supervised baseline's 0.7834. This work highlights a key insight: proper calibration is essential for leveraging the full diagnostic power of zero-shot VLMs, enabling them to match or even outperform efficient, task-specific supervised models.[80] PAL-UI: Planning with Active Look-back for Vision-Based GUI Agents
Zikang Liu,Junyi Li,Wayne Xin Zhao,Dawei Gao,Yaliang Li,Ji-rong Wen
Main category: cs.CV
TL;DR: 本文提出了一种名为PAL-UI的新型框架,通过主动回看机制增强GUI代理在长周期任务中的视觉记忆检索能力,结合双层摘要和专用检索工具,在移动和网页界面导航任务中显著优于现有方法。
Details
Motivation: 现有的GUI代理在处理长周期任务时因内存限制而表现不佳,简单的历史截断或文本摘要容易丢失关键视觉信息,因此需要一种能有效保留并按需检索历史视觉信息的方法。 Method: 提出PAL-UI框架,包含双层摘要代理(记录观察级线索和动作级结果)和专用检索工具,使代理能在规划时主动召回特定历史截图;基于Qwen2.5-VL构建PAL-UI-3B和PAL-UI-7B模型,并使用8.6K样本的移动端指令数据集进行训练。 Result: 实验表明,PAL-UI在移动GUI导航任务中显著优于基线模型和先前方法,即使在数据有限的情况下也表现优异,并展现出强大的跨域泛化能力,在无需额外训练的情况下提升了网页导航性能。 Conclusion: 主动记忆检索机制对提升基于视觉的GUI代理在长周期任务中的规划能力具有重要意义,PAL-UI为构建更智能的界面交互代理提供了有效路径。 Abstract: Graphical User Interface (GUI) agents powered by Multimodal Large Language Models (MLLMs) promise human-like interaction with software applications, yet long-horizon tasks remain challenging due to memory limitations. Existing approaches either truncate history or rely on simple textual summaries, which risk losing critical information when past visual details become necessary for future decisions. In this paper, we propose \textbf{PAL-UI} (\textbf{P}lanning with \textbf{A}ctive \textbf{L}ook-back), a novel framework that enables GUI agents to adaptively retrieve past observations when required. PAL-UI combines a dual-level summarization agent, capturing both observation-level cues and action-level outcomes, with a dedicated retrieval tool that allows the agent to recall specific historical screenshots during planning. We curate a step-level instruction dataset of 8.6K samples from mobile GUI navigation trajectories and train \textbf{PAL-UI-3B} and \textbf{PAL-UI-7B} models based on Qwen2.5-VL. Extensive experiments demonstrate that PAL-UI significantly outperforms baseline models and prior methods in mobile GUI navigation tasks, even under data-efficient settings. Moreover, PAL-UI exhibits strong cross-domain generalization, achieving notable improvements in web navigation without additional training. Our work highlights the potential of active memory retrieval for long-horizon planning capabilities of vision-based GUI agents.[81] Domain-Specialized Interactive Segmentation Framework for Meningioma Radiotherapy Planning
Junhyeok Lee,Han Jang,Kyu Sung Choi
Main category: cs.CV
TL;DR: 本研究提出了一种专用于脑膜瘤放疗规划的交互式医学图像分割工具Interactive-MEN-RT,结合临床医生参与和多种交互方式,显著提升了3D脑膜瘤分割的准确性。
Details
Motivation: 脑膜瘤放疗规划需要精确的肿瘤分割,但肿瘤异质性使得全自动深度学习方法难以稳定达到临床要求,因此需要结合人工智能与临床干预的专用解决方案。 Method: 开发了名为Interactive-MEN-RT的交互式分割系统,集成点标注、边界框、套索和涂鸦等多种临床友好的交互方式,用于辅助医生在放疗流程中进行3D脑膜瘤分割。 Result: 在BraTS 2025脑膜瘤放疗分割挑战赛的500例增强T1加权MRI数据上评估,该方法Dice相似系数达77.6%,IoU为64.8%,优于其他现有分割方法。 Conclusion: 针对特定临床任务(如脑膜瘤放疗)设计的交互式分割工具能显著提升分割性能,满足临床对精度和可靠性的需求,具有重要应用价值。 Abstract: Precise delineation of meningiomas is crucial for effective radiotherapy (RT) planning, directly influencing treatment efficacy and preservation of adjacent healthy tissues. While automated deep learning approaches have demonstrated considerable potential, achieving consistently accurate clinical segmentation remains challenging due to tumor heterogeneity. Interactive Medical Image Segmentation (IMIS) addresses this challenge by integrating advanced AI techniques with clinical input. However, generic segmentation tools, despite widespread applicability, often lack the specificity required for clinically critical and disease-specific tasks like meningioma RT planning. To overcome these limitations, we introduce Interactive-MEN-RT, a dedicated IMIS tool specifically developed for clinician-assisted 3D meningioma segmentation in RT workflows. The system incorporates multiple clinically relevant interaction methods, including point annotations, bounding boxes, lasso tools, and scribbles, enhancing usability and clinical precision. In our evaluation involving 500 contrast-enhanced T1-weighted MRI scans from the BraTS 2025 Meningioma RT Segmentation Challenge, Interactive-MEN-RT demonstrated substantial improvement compared to other segmentation methods, achieving Dice similarity coefficients of up to 77.6\% and Intersection over Union scores of 64.8\%. These results emphasize the need for clinically tailored segmentation solutions in critical applications such as meningioma RT planning. The code is publicly available at: https://github.com/snuh-rad-aicon/Interactive-MEN-RT[82] BindWeave: Subject-Consistent Video Generation via Cross-Modal Integration
Zhaoyang Li,Dongjun Qian,Kai Su,Qishuai Diao,Xiangyang Xia,Chang Liu,Wenfei Yang,Tianzhu Zhang,Zehuan Yuan
Main category: cs.CV
TL;DR: 提出BindWeave框架,结合MLLM与DiT,实现高保真、主体一致的视频生成,优于现有模型。
Details
Motivation: 现有视频生成模型在处理复杂空间关系、时间逻辑和多主体交互时难以保持主体一致性,需更好解析复杂提示语义。 Method: 提出BindWeave框架,利用预训练多模态大语言模型进行跨模态推理,解耦实体的角色、属性和交互,生成主体感知隐状态以条件化扩散Transformer。 Result: 在OpenS2V基准上实验显示,该方法在主体一致性、自然性和文本相关性方面均优于现有开源和商业模型。 Conclusion: BindWeave通过深度融合语义理解与视频生成,有效提升复杂场景下的主体一致视频生成质量。 Abstract: Diffusion Transformer has shown remarkable abilities in generating high-fidelity videos, delivering visually coherent frames and rich details over extended durations. However, existing video generation models still fall short in subject-consistent video generation due to an inherent difficulty in parsing prompts that specify complex spatial relationships, temporal logic, and interactions among multiple subjects. To address this issue, we propose BindWeave, a unified framework that handles a broad range of subject-to-video scenarios from single-subject cases to complex multi-subject scenes with heterogeneous entities. To bind complex prompt semantics to concrete visual subjects, we introduce an MLLM-DiT framework in which a pretrained multimodal large language model performs deep cross-modal reasoning to ground entities and disentangle roles, attributes, and interactions, yielding subject-aware hidden states that condition the diffusion transformer for high-fidelity subject-consistent video generation. Experiments on the OpenS2V benchmark demonstrate that our method achieves superior performance across subject consistency, naturalness, and text relevance in generated videos, outperforming existing open-source and commercial models.[83] Measuring and Controlling the Spectral Bias for Self-Supervised Image Denoising
Wang Zhang,Huaqiu Li,Xiaowan Hu,Tao Jiang,Zikang Chen,Haoqian Wang
Main category: cs.CV
TL;DR: 提出了一种用于配对噪声图像自监督去噪的谱控制网络(SCNet),通过频带选择、限制卷积核对高频噪声的学习能力以及频域分离与低秩重建模块,有效保留图像高频结构细节并抑制噪声。
Details
Motivation: 现有自监督去噪方法在处理配对噪声图像时存在高频结构细节保留不足和网络学习到高频噪声的问题。 Method: 提出Image Pair Frequency-Band Similarity测量谱偏差;设计频带选择策略加速训练;利用Lipschitz常数约束卷积核学习高频噪声的能力;引入SSR模块进行频域分离与低秩重建以分离噪声和高频细节。 Result: 在合成和真实数据集上的实验验证了SCNet的有效性,能更好保留高频细节并减少噪声学习。 Conclusion: SCNet显著提升了配对噪声图像的自监督去噪性能,克服了传统方法在高频细节保留和噪声抑制方面的局限。 Abstract: Current self-supervised denoising methods for paired noisy images typically involve mapping one noisy image through the network to the other noisy image. However, after measuring the spectral bias of such methods using our proposed Image Pair Frequency-Band Similarity, it suffers from two practical limitations. Firstly, the high-frequency structural details in images are not preserved well enough. Secondly, during the process of fitting high frequencies, the network learns high-frequency noise from the mapped noisy images. To address these challenges, we introduce a Spectral Controlling network (SCNet) to optimize self-supervised denoising of paired noisy images. First, we propose a selection strategy to choose frequency band components for noisy images, to accelerate the convergence speed of training. Next, we present a parameter optimization method that restricts the learning ability of convolutional kernels to high-frequency noise using the Lipschitz constant, without changing the network structure. Finally, we introduce the Spectral Separation and low-rank Reconstruction module (SSR module), which separates noise and high-frequency details through frequency domain separation and low-rank space reconstruction, to retain the high-frequency structural details of images. Experiments performed on synthetic and real-world datasets verify the effectiveness of SCNet.[84] VLOD-TTA: Test-Time Adaptation of Vision-Language Object Detectors
Atif Belal,Heitor R. Medeiros,Marco Pedersoli,Eric Granger
Main category: cs.CV
TL;DR: 本文提出了一种面向视觉-语言目标检测器(VLOD)的测试时自适应框架VLOD-TTA,通过IoU加权熵目标和图像条件提示选择机制,在多种分布偏移场景下显著提升了YOLO-World和Grounding DINO等模型的零样本检测性能。
Details
Motivation: 现有视觉-语言目标检测器在领域迁移下性能下降明显,缺乏有效的鲁棒性适应机制。 Method: 提出VLOD-TTA框架:1)设计IoU加权熵目标,聚焦空间连贯的候选区域簇进行自适应,减少孤立框带来的确认偏差;2)引入图像条件提示选择,根据图像级兼容性对提示词排序并融合最优提示与检测器输出。 Result: 在风格化域、驾驶场景、低光条件和常见损坏等多种分布偏移基准上验证了方法有效性,YOLO-World和Grounding DINO均取得优于零样本及TTA基线的一致性提升。 Conclusion: VLOD-TTA通过密集候选重叠与图像条件提示打分机制,有效增强了视觉-语言目标检测器在测试阶段面对域偏移的鲁棒性,为零样本检测提供了实用的自适应方案。 Abstract: Vision-language object detectors (VLODs) such as YOLO-World and Grounding DINO achieve impressive zero-shot recognition by aligning region proposals with text representations. However, their performance often degrades under domain shift. We introduce VLOD-TTA, a test-time adaptation (TTA) framework for VLODs that leverages dense proposal overlap and image-conditioned prompt scores. First, an IoU-weighted entropy objective is proposed that concentrates adaptation on spatially coherent proposal clusters and reduces confirmation bias from isolated boxes. Second, image-conditioned prompt selection is introduced, which ranks prompts by image-level compatibility and fuses the most informative prompts with the detector logits. Our benchmarking across diverse distribution shifts -- including stylized domains, driving scenes, low-light conditions, and common corruptions -- shows the effectiveness of our method on two state-of-the-art VLODs, YOLO-World and Grounding DINO, with consistent improvements over the zero-shot and TTA baselines. Code : https://github.com/imatif17/VLOD-TTA[85] MathSticks: A Benchmark for Visual Symbolic Compositional Reasoning with Matchstick Puzzles
Yuheng Ji,Huajie Tan,Cheng Chi,Yijie Xu,Yuting Zhao,Enshen Zhou,Huaihai Lyu,Pengwei Wang,Zhongyuan Wang,Shanghang Zhang,Xiaolong Zheng
Main category: cs.CV
TL;DR: 提出MathSticks基准,用于测试视觉符号组合推理能力,涵盖视觉感知、符号操作和算术一致性,评估结果显示现有模型存在局限性,而人类表现优异。
Details
Motivation: 为了统一评估视觉感知、符号操作和算术一致性在组合推理中的能力,需要一个更具挑战性的基准来推动视觉与符号推理的发展。 Method: 构建了一个名为MathSticks的基准,包含140万个生成实例和精选测试集,设置文本引导和纯视觉两种模式,要求通过移动一或两根火柴棒纠正错误方程,并遵循严格的守恒规则。 Result: 对14种视觉-语言模型的评估显示,闭源模型仅在简单情况下成功,开源模型在纯视觉场景中表现不佳,而人类准确率超过90%。 Conclusion: MathSticks作为一个严格的测试平台,揭示了现有模型在视觉符号组合推理上的不足,为未来研究提供了重要方向。 Abstract: We introduce \textsc{MathSticks}, a benchmark for Visual Symbolic Compositional Reasoning (VSCR), which unifies visual perception, symbolic manipulation, and arithmetic consistency. Each task presents an incorrect matchstick equation that must be corrected by moving one or two sticks under strict conservation rules. The benchmark includes both text-guided and purely visual settings, systematically covering digit scale, move complexity, solution multiplicity, and operator variation, with 1.4M generated instances and a curated test set. Evaluations of 14 vision--language models reveal substantial limitations: closed-source models succeed only on simple cases, open-source models fail in the visual regime, while humans exceed 90\% accuracy. These findings establish \textsc{MathSticks} as a rigorous testbed for advancing compositional reasoning across vision and symbols. Our code and dataset are publicly available at https://github.com/Yuheng2000/MathSticks.[86] Normal-Abnormal Guided Generalist Anomaly Detection
Yuexin Wang,Xiaolei Wang,Yizheng Gong,Jimin Xiao
Main category: cs.CV
TL;DR: 本文提出了一种新的通用异常检测方法NAGL,首次利用正常和异常样本作为参考,通过残差挖掘和异常特征学习实现跨域异常检测,显著优于现有方法。
Details
Motivation: 以往的通用异常检测方法仅使用正常样本作为参考,忽略了现实中可获得的异常样本所包含的有价值信息,导致模型性能受限。 Method: 提出Normal-Abnormal Generalist Learning (NAGL)框架,包含残差挖掘(RM)和异常特征学习(AFL)两个模块:RM从正常-异常参考残差中提取异常模式,AFL通过残差映射自适应学习查询图像中的异常特征。 Result: 在多个基准上进行了广泛实验,结果表明该方法在跨域异常检测任务中显著优于现有的GAD方法,验证了同时利用正常和异常参考的有效性。 Conclusion: 本文提出的NAGL是首个在通用异常检测中同时利用正常和异常样本作为参考的方法,有效提升了检测的准确性和效率,为实际应用提供了更实用的解决方案。 Abstract: Generalist Anomaly Detection (GAD) aims to train a unified model on an original domain that can detect anomalies in new target domains. Previous GAD methods primarily use only normal samples as references, overlooking the valuable information contained in anomalous samples that are often available in real-world scenarios. To address this limitation, we propose a more practical approach: normal-abnormal-guided generalist anomaly detection, which leverages both normal and anomalous samples as references to guide anomaly detection across diverse domains. We introduce the Normal-Abnormal Generalist Learning (NAGL) framework, consisting of two key components: Residual Mining (RM) and Anomaly Feature Learning (AFL). RM extracts abnormal patterns from normal-abnormal reference residuals to establish transferable anomaly representations, while AFL adaptively learns anomaly features in query images through residual mapping to identify instance-aware anomalies. Our approach effectively utilizes both normal and anomalous references for more accurate and efficient cross-domain anomaly detection. Extensive experiments across multiple benchmarks demonstrate that our method significantly outperforms existing GAD approaches. This work represents the first to adopt a mixture of normal and abnormal samples as references in generalist anomaly detection. The code and datasets are available at https://github.com/JasonKyng/NAGL.[87] Relative-Absolute Fusion: Rethinking Feature Extraction in Image-Based Iterative Method Selection for Solving Sparse Linear Systems
Kaiqi Zhang,Mingguan Yang,Dali Chang,Chun Chen,Yuxiang Zhang,Kexun He,Jing Zhao
Main category: cs.CV
TL;DR: 本文提出了一种名为RAF(相对-绝对融合)的特征提取技术,用于改进基于图像的稀疏线性系统求解方法选择,通过融合相对图像特征和绝对数值特征,提升了选择准确性和性能,实现了最先进的效果。
Details
Motivation: 由于现有基于图像的方法在特征提取上可能导致不同矩阵被编码为相同的表示,从而导致方法选择不准确,因此需要一种更精确的特征提取方法来避免特征歧义。 Method: 提出RAF方法,将矩阵的图像表示作为相对特征,同时结合其数值信息作为绝对特征,进行融合以获得更全面的矩阵表示,并应用于迭代求解器的选择。 Result: 在SuiteSparse和自建数据集BMCMat上实验表明,RAF相比传统基于图像的方法可将求解时间减少0.08s-0.29s,速度提升5.86%-11.50%,达到SOTA性能。 Conclusion: RAF有效解决了特征歧义问题,显著提升了基于图像的求解方法选择的准确性与效率,推动了该方向的发展。 Abstract: Iterative method selection is crucial for solving sparse linear systems because these methods inherently lack robustness. Though image-based selection approaches have shown promise, their feature extraction techniques might encode distinct matrices into identical image representations, leading to the same selection and suboptimal method. In this paper, we introduce RAF (Relative-Absolute Fusion), an efficient feature extraction technique to enhance image-based selection approaches. By simultaneously extracting and fusing image representations as relative features with corresponding numerical values as absolute features, RAF achieves comprehensive matrix representations that prevent feature ambiguity across distinct matrices, thus improving selection accuracy and unlocking the potential of image-based selection approaches. We conducted comprehensive evaluations of RAF on SuiteSparse and our developed BMCMat (Balanced Multi-Classification Matrix dataset), demonstrating solution time reductions of 0.08s-0.29s for sparse linear systems, which is 5.86%-11.50% faster than conventional image-based selection approaches and achieves state-of-the-art (SOTA) performance. BMCMat is available at https://github.com/zkqq/BMCMat.[88] Affordance-Guided Diffusion Prior for 3D Hand Reconstruction
Naru Suzuki,Takehiko Ohkawa,Tatsuro Banno,Jihyun Lee,Ryosuke Furuta,Yoichi Sato
Main category: cs.CV
TL;DR: 提出一种基于扩散模型的生成先验方法,利用物体功能相关的文本描述来指导被遮挡手部姿态的精细化重建。
Details
Motivation: 在手部或物体严重遮挡的情况下,传统方法难以准确重建3D手部姿态,而人类可利用物体的功能上下文(如抓握方式)解决歧义,因此作者希望引入功能感知的上下文知识来提升重建精度。 Method: 采用基于扩散的生成模型,学习在由大型视觉-语言模型(VLM)推断出的功能描述条件下的合理手部姿态分布,以此优化遮挡区域的手部姿态。 Result: 在严重遮挡的3D手-物交互数据集HOGraspNet上的实验表明,该方法显著优于现有的回归方法和缺乏上下文推理能力的扩散细化方法。 Conclusion: 通过引入功能感知的文本描述作为生成先验,能有效提升遮挡情况下的3D手部姿态重建准确性与功能性一致性。 Abstract: How can we reconstruct 3D hand poses when large portions of the hand are heavily occluded by itself or by objects? Humans often resolve such ambiguities by leveraging contextual knowledge -- such as affordances, where an object's shape and function suggest how the object is typically grasped. Inspired by this observation, we propose a generative prior for hand pose refinement guided by affordance-aware textual descriptions of hand-object interactions (HOI). Our method employs a diffusion-based generative model that learns the distribution of plausible hand poses conditioned on affordance descriptions, which are inferred from a large vision-language model (VLM). This enables the refinement of occluded regions into more accurate and functionally coherent hand poses. Extensive experiments on HOGraspNet, a 3D hand-affordance dataset with severe occlusions, demonstrate that our affordance-guided refinement significantly improves hand pose estimation over both recent regression methods and diffusion-based refinement lacking contextual reasoning.[89] Efficient Multi-modal Large Language Models via Progressive Consistency Distillation
Zichen Wen,Shaobo Wang,Yufa Zhou,Junyuan Zhang,Qintong Zhang,Yifeng Gao,Zhaorun Chen,Bin Wang,Weijia Li,Conghui He,Linfeng Zhang
Main category: cs.CV
TL;DR: 本文提出了一种通过渐进一致性蒸馏(EPIC)来提升多模态大模型效率的框架,通过在token和层两个维度上进行一致性蒸馏,缓解视觉token压缩带来的训练难度。
Details
Motivation: 视觉token在多模态大模型中计算开销大,现有压缩方法忽视了由此带来的特征空间扰动和训练难度增加问题。 Method: 提出EPIC框架,分解token压缩引起的特征空间扰动,引入token级和层间的一致性蒸馏,结合教师模型指导,采用渐进学习策略。 Result: 大量实验表明该方法在有效性、鲁棒性和泛化能力方面表现优越。 Conclusion: EPIC有效降低了视觉token压缩带来的训练难度,显著提升了多模态大模型的训练效率与性能。 Abstract: Visual tokens consume substantial computational resources in multi-modal large models (MLLMs), significantly compromising their efficiency. Recent works have attempted to improve efficiency by compressing visual tokens during training, either through modifications to model components or by introducing additional parameters. However, they often overlook the increased learning difficulty caused by such compression, as the model's parameter space struggles to quickly adapt to the substantial perturbations in the feature space induced by token compression. In this work, we propose to develop Efficient MLLMs via Progressive Consistency Distillation (EPIC), a progressive learning framework. Specifically, by decomposing the feature space perturbations introduced by token compression along the token-wise and layer-wise dimensions, we introduce token consistency distillation and layer consistency distillation, respectively, aiming to reduce the training difficulty by leveraging guidance from a teacher model and following a progressive learning trajectory. Extensive experiments demonstrate the superior effectiveness, robustness, and generalization capabilities of our proposed framework.[90] CardioBench: Do Echocardiography Foundation Models Generalize Beyond the Lab?
Darya Taratynova,Ahmed Aly,Numan Saeed,Mohammad Yaqub
Main category: cs.CV
TL;DR: 本文提出了CardioBench,首个面向超声心动图基础模型的标准化基准,整合8个公开数据集,涵盖4个回归和5个分类任务,通过统一评估协议比较多种基础模型性能,揭示不同模型在功能预测、分布外鲁棒性和细粒度识别中的优劣,并公开代码与流程以推动可重复研究。
Details
Motivation: 超声心动图基础模型缺乏统一评估标准,现有研究多使用私有数据导致结果不可比,且面临噪声多、帧冗余高、公开数据少等挑战,亟需一个标准化、公开可复现的基准来推动领域发展。 Method: 构建CardioBench基准,整合8个公开数据集,定义4类回归和5类分类任务;采用零样本、探针微调和对齐三种协议,系统评估心脏专用、生物医学及通用编码器的性能;提供统一预处理、数据划分和公开评估流程。 Result: 发现时序建模对功能回归至关重要,检索增强方法在分布偏移下更鲁棒,领域特定文本编码器能捕捉生理意义维度;通用编码器迁移能力强但难以区分细微病理和视图类别;部分通用模型经提示调优可接近微调性能。 Conclusion: CardioBench为超声心动图基础模型提供了首个全面、公开、可复现的评估基准,揭示了不同模型架构的优势与局限,为未来模型设计提供了实用指导,并促进了该领域的标准化与透明化发展。 Abstract: Foundation models (FMs) are reshaping medical imaging, yet their application in echocardiography remains limited. While several echocardiography-specific FMs have recently been introduced, no standardized benchmark exists to evaluate them. Echocardiography poses unique challenges, including noisy acquisitions, high frame redundancy, and limited public datasets. Most existing solutions evaluate on private data, restricting comparability. To address this, we introduce CardioBench, a comprehensive benchmark for echocardiography FMs. CardioBench unifies eight publicly available datasets into a standardized suite spanning four regression and five classification tasks, covering functional, structural, diagnostic, and view recognition endpoints. We evaluate several leading FM, including cardiac-specific, biomedical, and general-purpose encoders, under consistent zero-shot, probing, and alignment protocols. Our results highlight complementary strengths across model families: temporal modeling is critical for functional regression, retrieval provides robustness under distribution shift, and domain-specific text encoders capture physiologically meaningful axes. General-purpose encoders transfer strongly and often close the gap with probing, but struggle with fine-grained distinctions like view classification and subtle pathology recognition. By releasing preprocessing, splits, and public evaluation pipelines, CardioBench establishes a reproducible reference point and offers actionable insights to guide the design of future echocardiography foundation models.[91] Cascaded Diffusion Framework for Probabilistic Coarse-to-Fine Hand Pose Estimation
Taeyun Woo,Jinah Park,Tae-Kyun Kim
Main category: cs.CV
TL;DR: 提出了一种结合概率建模与级联优化的粗到精扩散框架,用于3D手部姿态重建,显著提升了精度并有效建模姿态不确定性。
Details
Motivation: 现有确定性模型难以处理自遮挡和复杂关节带来的姿态模糊,而现有概率方法局限于单阶段估计,缺乏细化机制,导致重建精度不足。 Method: 设计了一个两阶段级联扩散框架:第一阶段使用联合扩散模型生成多样化的3D关节约束假设;第二阶段利用基于网格潜在空间的扩散模型(Mesh LDM),以关节约束为条件重建3D手部网格,并在潜在空间中训练以学习分布感知的关节-网格关系。 Result: 在FreiHAND和HO3Dv2数据集上达到SOTA性能,同时能够有效建模姿态分布,提升重建准确性和鲁棒性。 Conclusion: 该方法成功融合了概率建模与级联优化的优势,在处理姿态不确定性的同时实现高精度3D手部重建,为未来相关研究提供了新思路。 Abstract: Deterministic models for 3D hand pose reconstruction, whether single-staged or cascaded, struggle with pose ambiguities caused by self-occlusions and complex hand articulations. Existing cascaded approaches refine predictions in a coarse-to-fine manner but remain deterministic and cannot capture pose uncertainties. Recent probabilistic methods model pose distributions yet are restricted to single-stage estimation, which often fails to produce accurate 3D reconstructions without refinement. To address these limitations, we propose a coarse-to-fine cascaded diffusion framework that combines probabilistic modeling with cascaded refinement. The first stage is a joint diffusion model that samples diverse 3D joint hypotheses, and the second stage is a Mesh Latent Diffusion Model (Mesh LDM) that reconstructs a 3D hand mesh conditioned on a joint sample. By training Mesh LDM with diverse joint hypotheses in a learned latent space, our framework learns distribution-aware joint-mesh relationships and robust hand priors. Furthermore, the cascaded design mitigates the difficulty of directly mapping 2D images to dense 3D poses, enhancing accuracy through sequential refinement. Experiments on FreiHAND and HO3Dv2 demonstrate that our method achieves state-of-the-art performance while effectively modeling pose distributions.[92] Forestpest-YOLO: A High-Performance Detection Framework for Small Forestry Pests
Aoduo Li,Peikai Lin,Jiancheng Li,Zhen Zhang,Shiting Wu,Zexiao Liang,Zhifa Jiang
Main category: cs.CV
TL;DR: 本文提出Forestpest-YOLO,一种基于YOLOv8的农业害虫检测框架,通过SPD-Conv、CSPOK模块和VarifocalLoss有效解决了小目标、遮挡和背景复杂等挑战,在自建ForestPest数据集上实现了最先进的性能。
Details
Motivation: 在复杂林业环境中,遥感图像中的农业害虫通常微小、严重遮挡且与背景相似,传统检测模型因丢失细节特征和难以处理极端数据不平衡而表现不佳,因此需要更鲁棒的检测方法。 Method: 基于YOLOv8架构,引入三个关键改进:1)SPD-Conv实现无损下采样以保留小目标高分辨率特征;2)CSPOK跨阶段特征融合模块增强多尺度特征并抑制背景噪声;3)采用VarifocalLoss优化训练目标,聚焦高质量和难分类样本。 Result: 在自建ForestPest数据集上的实验表明,Forestpest-YOLO在检测微小和遮挡害虫方面显著优于现有模型,取得了最先进的性能。 Conclusion: Forestpest-YOLO通过精细化的结构设计和损失函数优化,有效提升了复杂环境下小尺度农业害虫的检测能力,具有较强的实用性和推广价值。 Abstract: Detecting agricultural pests in complex forestry environments using remote sensing imagery is fundamental for ecological preservation, yet it is severely hampered by practical challenges. Targets are often minuscule, heavily occluded, and visually similar to the cluttered background, causing conventional object detection models to falter due to the loss of fine-grained features and an inability to handle extreme data imbalance. To overcome these obstacles, this paper introduces Forestpest-YOLO, a detection framework meticulously optimized for the nuances of forestry remote sensing. Building upon the YOLOv8 architecture, our framework introduces a synergistic trio of innovations. We first integrate a lossless downsampling module, SPD-Conv, to ensure that critical high-resolution details of small targets are preserved throughout the network. This is complemented by a novel cross-stage feature fusion block, CSPOK, which dynamically enhances multi-scale feature representation while suppressing background noise. Finally, we employ VarifocalLoss to refine the training objective, compelling the model to focus on high-quality and hard-to-classify samples. Extensive experiments on our challenging, self-constructed ForestPest dataset demonstrate that Forestpest-YOLO achieves state-of-the-art performance, showing marked improvements in detecting small, occluded pests and significantly outperforming established baseline models.[93] Assessing Foundation Models for Mold Colony Detection with Limited Training Data
Henrik Pichler,Janis Keuper,Matthew Copping
Main category: cs.CV
TL;DR: 本研究探讨了使用视觉基础模型(如MaskDINO)在少量标注数据下自动化识别Petri皿中霉菌菌落的方法,相较于传统依赖大量标注数据的模型(如YoloV9),实现了数据高效且性能接近的结果。
Details
Motivation: 传统的微生物图像分析依赖大量人工标注数据和耗时训练,限制了自动化系统的快速开发与迭代。本文旨在验证在新视觉任务中,是否可无需 exhaustive annotation 即可实现高性能自动化。 Method: 构建了一个包含5000张Petri皿图像的数据集,提供边界框和实例级掩码标注,并模拟传统、少样本和低资源场景;对比三种视觉基础模型与传统模型(如YoloV9)在特定任务指标下的表现。 Result: MaskDINO在仅微调150张图像时性能接近YoloV9,在仅25张图像时仍保持竞争力,且在约70%的样本上表现可靠。 Conclusion: 数据高效的基础模型可用远少于传统方法所需的数据量达到相近或更优性能,推动自动化微生物系统更快迭代发展。 Abstract: The process of quantifying mold colonies on Petri dish samples is of critical importance for the assessment of indoor air quality, as high colony counts can indicate potential health risks and deficiencies in ventilation systems. Conventionally the automation of such a labor-intensive process, as well as other tasks in microbiology, relies on the manual annotation of large datasets and the subsequent extensive training of models like YoloV9. To demonstrate that exhaustive annotation is not a prerequisite anymore when tackling a new vision task, we compile a representative dataset of 5000 Petri dish images annotated with bounding boxes, simulating both a traditional data collection approach as well as few-shot and low-shot scenarios with well curated subsets with instance level masks. We benchmark three vision foundation models against traditional baselines on task specific metrics, reflecting realistic real-world requirements. Notably, MaskDINO attains near-parity with an extensively trained YoloV9 model while finetuned only on 150 images, retaining competitive performance with as few as 25 images, still being reliable on $\approx$ 70% of the samples. Our results show that data-efficient foundation models can match traditional approaches with only a fraction of the required data, enabling earlier development and faster iterative improvement of automated microbiological systems with a superior upper-bound performance than traditional models would achieve.[94] Adaptive Shared Experts with LoRA-Based Mixture of Experts for Multi-Task Learning
Minghao Yang,Ren Togo,Guang Li,Takahiro Ogawa,Miki Haseyama
Main category: cs.CV
TL;DR: 提出了一种基于低秩适应(LoRA)的混合专家模型(MoE),通过引入自适应共享专家(ASE)和细粒度专家设计,有效提升了从单任务到多任务学习的过渡性能,并增强了专家间的协作与知识共享。
Details
Motivation: 现有MoE-MTL方法依赖单任务预训练骨干网络,在从单任务向多任务学习转换时存在冗余适配和知识共享效率低的问题。 Method: 在基于LoRA的MoE框架中引入自适应共享专家(ASE),其门控权重由路由器计算并与稀疏专家联合归一化;同时采用细粒度专家设计,增加LoRA专家数量并降低各自秩以提升知识共享效率。 Result: 在PASCAL-Context基准上的实验表明,ASE在统一训练设置下显著提升多任务学习性能,验证了共享专家和细粒度设计的有效性。 Conclusion: ASE通过共享与细粒度专家的协同设计,有效促进单任务到多任务的迁移,提升模型性能与参数效率。 Abstract: Mixture-of-Experts (MoE) has emerged as a powerful framework for multi-task learning (MTL). However, existing MoE-MTL methods often rely on single-task pretrained backbones and suffer from redundant adaptation and inefficient knowledge sharing during the transition from single-task to multi-task learning (STL to MTL). To address these limitations, we propose adaptive shared experts (ASE) within a low-rank adaptation (LoRA) based MoE, where shared experts are assigned router-computed gating weights jointly normalized with sparse experts. This design facilitates STL to MTL transition, enhances expert specialization, and cooperation. Furthermore, we incorporate fine-grained experts by increasing the number of LoRA experts while proportionally reducing their rank, enabling more effective knowledge sharing under a comparable parameter budget. Extensive experiments on the PASCAL-Context benchmark, under unified training settings, demonstrate that ASE consistently improves performance across diverse configurations and validates the effectiveness of fine-grained designs for MTL.[95] Arbitrary Generative Video Interpolation
Guozhen Zhang,Haiguang Wang,Chunyu Wang,Yuan Zhou,Qinglin Lu,Limin Wang
Main category: cs.CV
TL;DR: 提出ArbInterp,一种支持任意时间戳和任意长度视频帧插值的生成框架,通过TaRoPE和分段生成策略实现高保真、连续的插值效果。
Details
Motivation: 现有视频帧插值方法只能生成固定数量的中间帧,缺乏对帧率或序列时长的灵活控制,限制了实际应用中的适应性。 Method: 提出Timestamp-aware Rotary Position Embedding (TaRoPE) 以支持任意时间戳插值,并采用分段式生成结合外观与运动解耦的条件策略,实现任意长度的连续插值。 Result: 在2x到32x的多尺度插值基准上,ArbInterp在保真度和时空连续性方面均优于先前方法。 Conclusion: ArbInterp实现了灵活、高效的任意帧率和任意时长视频帧插值,显著提升了生成质量与通用性。 Abstract: Video frame interpolation (VFI), which generates intermediate frames from given start and end frames, has become a fundamental function in video generation applications. However, existing generative VFI methods are constrained to synthesize a fixed number of intermediate frames, lacking the flexibility to adjust generated frame rates or total sequence duration. In this work, we present ArbInterp, a novel generative VFI framework that enables efficient interpolation at any timestamp and of any length. Specifically, to support interpolation at any timestamp, we propose the Timestamp-aware Rotary Position Embedding (TaRoPE), which modulates positions in temporal RoPE to align generated frames with target normalized timestamps. This design enables fine-grained control over frame timestamps, addressing the inflexibility of fixed-position paradigms in prior work. For any-length interpolation, we decompose long-sequence generation into segment-wise frame synthesis. We further design a novel appearance-motion decoupled conditioning strategy: it leverages prior segment endpoints to enforce appearance consistency and temporal semantics to maintain motion coherence, ensuring seamless spatiotemporal transitions across segments. Experimentally, we build comprehensive benchmarks for multi-scale frame interpolation (2x to 32x) to assess generalizability across arbitrary interpolation factors. Results show that ArbInterp outperforms prior methods across all scenarios with higher fidelity and more seamless spatiotemporal continuity. Project website: https://mcg-nju.github.io/ArbInterp-Web/.[96] Color Models in Image Processing: A Review and Experimental Comparison
Muragul Muratbekova,Nuray Toganas,Ayan Igali,Maksat Shagyrov,Elnara Kadyrgali,Adilet Yerkin,Pakizar Shamoi
Main category: cs.CV
TL;DR: 本文综述了各种颜色模型和空间,评估了其理论基础、计算特性及实际应用,并通过实验比较了不同模型在设备依赖性、色度一致性和计算复杂度等方面的表现,发现HS*系列最符合人类感知。
Details
Motivation: 选择合适的颜色模型对计算机视觉和人机交互等应用至关重要,但现有模型各有优劣,需系统评估以指导应用和研究。 Method: 综述传统颜色模型(如RGB、CMYK、YUV)、感知均匀空间(如CIELAB、CIELUV)及模糊方法,并通过多角度实验评估各模型性能。 Result: 实验结果揭示了现有颜色模型的不足,表明HS*家族在人类感知一致性方面表现最佳,同时总结了各类模型的优势与局限。 Conclusion: HS*系列颜色模型最符合人类视觉感知,该研究为图像处理、感知计算和数字媒体等领域提供了颜色模型选择的参考依据。 Abstract: Color representation is essential in computer vision and human-computer interaction. There are multiple color models available. The choice of a suitable color model is critical for various applications. This paper presents a review of color models and spaces, analyzing their theoretical foundations, computational properties, and practical applications. We explore traditional models such as RGB, CMYK, and YUV, perceptually uniform spaces like CIELAB and CIELUV, and fuzzy-based approaches as well. Additionally, we conduct a series of experiments to evaluate color models from various perspectives, like device dependency, chromatic consistency, and computational complexity. Our experimental results reveal gaps in existing color models and show that the HS* family is the most aligned with human perception. The review also identifies key strengths and limitations of different models and outlines open challenges and future directions This study provides a reference for researchers in image processing, perceptual computing, digital media, and any other color-related field.[97] Multi-level Dynamic Style Transfer for NeRFs
Zesheng Li,Shuaibo Li,Wei Ma,Jianwei Guo,Hongbin Zha
Main category: cs.CV
TL;DR: 本文提出了一种用于神经辐射场(NeRF)的多级动态风格迁移方法MDS-NeRF,通过重构NeRF流程并引入动态风格注入模块,在保持内容结构的同时实现高质量的艺术化渲染。
Details
Motivation: 现有NeRF风格迁移方法在内容保持和艺术风格化方面表现不佳,主要因简单融合风格统计信息导致多尺度空间结构丢失。 Method: 提出多级特征适配器生成多级特征网格,结合动态风格注入模块自适应融合风格特征,并通过多级级联解码器生成最终视图;支持使用3D风格参考进行全视角风格迁移。 Result: 实验表明MDS-NeRF在3D风格迁移任务中表现出色,能有效保留场景的多尺度空间结构并准确迁移艺术风格。 Conclusion: MDS-NeRF通过专门针对风格化设计的架构改进,显著提升了NeRF在3D风格迁移中的性能,为高质量3D内容创作提供了新思路。 Abstract: As the application of neural radiance fields (NeRFs) in various 3D vision tasks continues to expand, numerous NeRF-based style transfer techniques have been developed. However, existing methods typically integrate style statistics into the original NeRF pipeline, often leading to suboptimal results in both content preservation and artistic stylization. In this paper, we present multi-level dynamic style transfer for NeRFs (MDS-NeRF), a novel approach that reengineers the NeRF pipeline specifically for stylization and incorporates an innovative dynamic style injection module. Particularly, we propose a multi-level feature adaptor that helps generate a multi-level feature grid representation from the content radiance field, effectively capturing the multi-scale spatial structure of the scene. In addition, we present a dynamic style injection module that learns to extract relevant style features and adaptively integrates them into the content patterns. The stylized multi-level features are then transformed into the final stylized view through our proposed multi-level cascade decoder. Furthermore, we extend our 3D style transfer method to support omni-view style transfer using 3D style references. Extensive experiments demonstrate that MDS-NeRF achieves outstanding performance for 3D style transfer, preserving multi-scale spatial structures while effectively transferring stylistic characteristics.[98] LVLMs as inspectors: an agentic framework for category-level structural defect annotation
Sheng Jiang,Yuanmin Ning,Bingxi Huang,Peiyin Chen,Zhaohui Chen
Main category: cs.CV
TL;DR: 提出了一种基于大视觉语言模型的自动化结构缺陷标注框架ADPT,无需人工监督即可生成高质量缺陷数据集。
Details
Motivation: 为降低人工标注的成本与效率问题,实现基础设施安全检测中的高效、准确缺陷标注。 Method: 结合大视觉语言模型(LVLM)、语义模式匹配模块和迭代自问优化机制,通过领域特定提示和递归验证实现自动标注。 Result: 在平衡类别设置下,缺陷分类准确率达98%,四类缺陷标注准确率为85%-98%;在非平衡数据上达80%-92%。 Conclusion: ADPT是一种可扩展、低成本的高保真缺陷数据集构建方案,支持结构损伤评估中的迁移学习与域适应等下游任务。 Abstract: Automated structural defect annotation is essential for ensuring infrastructure safety while minimizing the high costs and inefficiencies of manual labeling. A novel agentic annotation framework, Agent-based Defect Pattern Tagger (ADPT), is introduced that integrates Large Vision-Language Models (LVLMs) with a semantic pattern matching module and an iterative self-questioning refinement mechanism. By leveraging optimized domain-specific prompting and a recursive verification process, ADPT transforms raw visual data into high-quality, semantically labeled defect datasets without any manual supervision. Experimental results demonstrate that ADPT achieves up to 98% accuracy in distinguishing defective from non-defective images, and 85%-98% annotation accuracy across four defect categories under class-balanced settings, with 80%-92% accuracy on class-imbalanced datasets. The framework offers a scalable and cost-effective solution for high-fidelity dataset construction, providing strong support for downstream tasks such as transfer learning and domain adaptation in structural damage assessment.[99] Disentangling Foreground and Background for vision-Language Navigation via Online Augmentation
Yunbo Xu,Xuesong Zhang,Jia Li,Zhenzhen Hu,Richang Hong
Main category: cs.CV
TL;DR: 提出了一种基于前景-背景特征分离的在线增强策略COFA,通过语义地标识别和共识驱动机制提升视觉语言导航中的泛化能力。
Details
Motivation: 探索视觉观测中前景与背景的不同作用,解决现有VLN方法在未见环境中泛化能力不足的问题。 Method: 采用语义增强的地标识别分离前景与背景特征,设计共识驱动的在线增强策略,结合多阶段投票机制动态选择最优特征。 Result: 在REVERIE和R2R数据集上验证了方法的有效性,显著提升基线模型性能,达到当前最优水平。 Conclusion: 前景提供语义线索,背景蕴含空间连接信息,二者协同增强有助于提升VLN代理的导航泛化能力。 Abstract: Following language instructions, vision-language navigation (VLN) agents are tasked with navigating unseen environments. While augmenting multifaceted visual representations has propelled advancements in VLN, the significance of foreground and background in visual observations remains underexplored. Intuitively, foreground regions provide semantic cues, whereas the background encompasses spatial connectivity information. Inspired on this insight, we propose a Consensus-driven Online Feature Augmentation strategy (COFA) with alternative foreground and background features to facilitate the navigable generalization. Specifically, we first leverage semantically-enhanced landmark identification to disentangle foreground and background as candidate augmented features. Subsequently, a consensus-driven online augmentation strategy encourages the agent to consolidate two-stage voting results on feature preferences according to diverse instructions and navigational locations. Experiments on REVERIE and R2R demonstrate that our online foreground-background augmentation boosts the generalization of baseline and attains state-of-the-art performance.[100] Robust Context-Aware Object Recognition
Klara Janouskova,Cristian Gavrus,Jiri Matas
Main category: cs.CV
TL;DR: 提出RCOR方法,通过将定位作为识别的一部分,实现鲁棒且具有上下文感知的物体识别,提升模型在复杂场景中的性能。
Details
Motivation: 标准监督学习容易过度依赖背景导致捷径学习,影响模型鲁棒性;现有方法通常抑制背景信息,牺牲了有用的上下文。 Method: 提出RCOR方法,将定位融入识别过程,解耦物体中心与上下文建模,并采用非参数化融合策略。 Result: 在包含域内和域外背景的数据集上,提升了监督模型和视觉语言模型(VLM)的性能,即使不进行微调也有效。 Conclusion: RCOR首次在不牺牲上下文感知的前提下实现了鲁棒识别,验证了‘先定位后识别’在复杂场景(如ImageNet-1k)中的可行性。 Abstract: In visual recognition, both the object of interest (referred to as foreground, FG, for simplicity) and its surrounding context (background, BG) play an important role. However, standard supervised learning often leads to unintended over-reliance on the BG, known as shortcut learning of spurious correlations, limiting model robustness in real-world deployment settings. In the literature, the problem is mainly addressed by suppressing the BG, sacrificing context information for improved generalization. We propose RCOR -- Robust Context-Aware Object Recognition -- the first approach that jointly achieves robustness and context-awareness without compromising either. RCOR treats localization as an integral part of recognition to decouple object-centric and context-aware modelling, followed by a robust, non-parametric fusion. It improves the performance of both supervised models and VLM on datasets with both in-domain and out-of-domain BG, even without fine-tuning. The results confirm that localization before recognition is now possible even in complex scenes as in ImageNet-1k.[101] UCD: Unconditional Discriminator Promotes Nash Equilibrium in GANs
Mengfei Xia,Nan Xue,Jiapeng Zhu,Yujun Shen
Main category: cs.CV
TL;DR: 本文提出了一种无条件判别器(UCD)来改善生成对抗网络(GAN)训练中的纳什均衡问题,通过禁止在判别器中注入条件信息,促使判别器提取更全面和鲁棒的特征,从而提升生成器的监督效果。该方法在理论上兼容标准GAN框架,并在实验中显著提升了性能,例如在ImageNet-64上实现了1.47的FID分数,优于StyleGAN-XL和多种最先进的单步扩散模型。
Details
Motivation: GAN训练常面临难以收敛和模式崩溃的问题,现有方法中条件输入到判别器可能引入冗余捷径,阻碍了有意义的知识提取,因此需要一种能促进纳什均衡的改进机制。 Method: 提出使用无条件判别器(UCD),即不在判别器中注入条件信息,迫使判别器自主学习更全面、更具鲁棒性的特征,以提供更强的监督信号给生成器,并从理论上证明其与传统GAN理论兼容,可插件式集成。 Result: 实验表明UCD显著提升GAN性能,在ImageNet-64上达到1.47的FID分数,超过StyleGAN-XL及多个先进的单步扩散模型,同时保持高训练效率。 Conclusion: 无条件判别器(UCD)有效促进了GAN训练中的纳什均衡,增强了知识提取与模型性能,为单步生成任务提供了一种高效且通用的改进方案。 Abstract: Adversarial training turns out to be the key to one-step generation, especially for Generative Adversarial Network (GAN) and diffusion model distillation. Yet in practice, GAN training hardly converges properly and struggles in mode collapse. In this work, we quantitatively analyze the extent of Nash equilibrium in GAN training, and conclude that redundant shortcuts by inputting condition in $D$ disables meaningful knowledge extraction. We thereby propose to employ an unconditional discriminator (UCD), in which $D$ is enforced to extract more comprehensive and robust features with no condition injection. In this way, $D$ is able to leverage better knowledge to supervise $G$, which promotes Nash equilibrium in GAN literature. Theoretical guarantee on compatibility with vanilla GAN theory indicates that UCD can be implemented in a plug-in manner. Extensive experiments confirm the significant performance improvements with high efficiency. For instance, we achieved \textbf{1.47 FID} on the ImageNet-64 dataset, surpassing StyleGAN-XL and several state-of-the-art one-step diffusion models. The code will be made publicly available.[102] Virtual Fashion Photo-Shoots: Building a Large-Scale Garment-Lookbook Dataset
Yannick Hauri,Luca A. Lanzendörfer,Till Aczel
Main category: cs.CV
TL;DR: 本文提出了虚拟时尚摄影任务,旨在将标准化的服装图像转化为具有情境背景的时尚杂志风格图像,并构建了首个大规模服装-搭配册配对数据集,通过自动化检索流程实现跨域对齐,为生成更具创意和叙事性的时尚图像提供了基础。
Details
Motivation: 现有的时尚图像生成研究多集中于在静态环境下进行虚拟试穿,而无法反映真实时尚杂志中动态姿态、多样场景和视觉叙事的特点,因此需要一种能够生成更具创意和情境感的时尚图像的新方法。 Method: 提出虚拟时尚摄影任务,设计自动化检索流程,结合视觉-语言推理与对象级定位,构建包含不同质量等级(高质量1万、中等质量5万、低质量30万)的服装-搭配册配对数据集。 Result: 成功构建了首个大规模的服装-搭配册配对数据集,支持从电商到时尚媒体的跨域图像生成,为后续模型开发提供了基础。 Conclusion: 该工作推动了时尚图像生成从静态商品图向富有创意、氛围和故事性的编辑风格图像发展,开辟了新的研究方向。 Abstract: Fashion image generation has so far focused on narrow tasks such as virtual try-on, where garments appear in clean studio environments. In contrast, editorial fashion presents garments through dynamic poses, diverse locations, and carefully crafted visual narratives. We introduce the task of virtual fashion photo-shoot, which seeks to capture this richness by transforming standardized garment images into contextually grounded editorial imagery. To enable this new direction, we construct the first large-scale dataset of garment-lookbook pairs, bridging the gap between e-commerce and fashion media. Because such pairs are not readily available, we design an automated retrieval pipeline that aligns garments across domains, combining visual-language reasoning with object-level localization. We construct a dataset with three garment-lookbook pair accuracy levels: high quality (10,000 pairs), medium quality (50,000 pairs), and low quality (300,000 pairs). This dataset offers a foundation for models that move beyond catalog-style generation and toward fashion imagery that reflects creativity, atmosphere, and storytelling.[103] LAKAN: Landmark-assisted Adaptive Kolmogorov-Arnold Network for Face Forgery Detection
Jiayao Jiang,Siran Peng,Bin Liu,Qi Chu,Nenghai Yu
Main category: cs.CV
TL;DR: 提出基于Kolmogorov-Arnold Network(KAN)的人脸伪造检测方法,并引入Landmark-assisted Adaptive KAN(LAKAN)模块,利用面部关键点引导网络关注关键区域,在多个公开数据集上表现优异。
Details
Motivation: 现有CNN和Transformer在建模复杂非线性伪造痕迹方面仍有不足,需更有效的模型来提升人脸伪造检测性能。 Method: 采用可学习样条的KAN替代传统激活函数,并设计LAKAN模块,利用面部关键点作为结构先验,动态生成KAN内部参数,引导图像编码器聚焦于含伪造痕迹的关键面部区域。 Result: 在多个公开数据集上的实验表明,该方法显著优于现有技术,实现了更优的检测性能。 Conclusion: 结合几何先验与可学习激活函数的KAN框架能有效提升人脸伪造检测能力,LAKAN模块增强了模型对关键区域的关注,具有较强的应用潜力。 Abstract: The rapid development of deepfake generation techniques necessitates robust face forgery detection algorithms. While methods based on Convolutional Neural Networks (CNNs) and Transformers are effective, there is still room for improvement in modeling the highly complex and non-linear nature of forgery artifacts. To address this issue, we propose a novel detection method based on the Kolmogorov-Arnold Network (KAN). By replacing fixed activation functions with learnable splines, our KAN-based approach is better suited to this challenge. Furthermore, to guide the network's focus towards critical facial areas, we introduce a Landmark-assisted Adaptive Kolmogorov-Arnold Network (LAKAN) module. This module uses facial landmarks as a structural prior to dynamically generate the internal parameters of the KAN, creating an instance-specific signal that steers a general-purpose image encoder towards the most informative facial regions with artifacts. This core innovation creates a powerful combination between geometric priors and the network's learning process. Extensive experiments on multiple public datasets show that our proposed method achieves superior performance.[104] Erased, But Not Forgotten: Erased Rectified Flow Transformers Still Remain Unsafe Under Concept Attack
Nanxiang Jiang,Zhaoxin Fan,Enhan Kang,Daiheng Gao,Yun Zhou,Yanxia Chang,Zheng Zhu,Yeying Jin,Wenjun Wu
Main category: cs.CV
TL;DR: 本文提出了ReFlux,首个针对修正流T2I模型中概念擦除鲁棒性评估的攻击方法,通过反向注意力优化、速度引导和一致性保持策略,有效恢复被抑制的概念,为新一代模型的安全性评估建立了可靠基准。
Details
Motivation: 现有概念擦除方法在Stable Diffusion上研究较多,但迁移到新一代修正流变换器(如Flux)时效果有限,且其依赖的注意力局部化现象尚未被充分攻击利用,因此需要专门针对此类模型设计攻击方法以评估其安全性。 Method: 提出ReFlux攻击方法,核心包括:1)反向注意力优化策略,重新激活被抑制信号并稳定注意力;2)速度引导机制,通过调控流匹配过程增强概念重激活的鲁棒性;3)一致性保持目标,维持图像全局结构并保护无关内容。 Result: 大量实验表明,ReFlux在修正流模型上能高效且有效地突破现有概念擦除机制,显著优于迁移自SD的攻击方法,验证了其在不同擦除设置下的通用性和稳定性。 Conclusion: ReFlux是首个面向修正流T2I模型的概念攻击方法,揭示了当前概念擦除技术在新型架构中的脆弱性,为未来更安全的内容控制提供了评估基准和改进方向。 Abstract: Recent advances in text-to-image (T2I) diffusion models have enabled impressive generative capabilities, but they also raise significant safety concerns due to the potential to produce harmful or undesirable content. While concept erasure has been explored as a mitigation strategy, most existing approaches and corresponding attack evaluations are tailored to Stable Diffusion (SD) and exhibit limited effectiveness when transferred to next-generation rectified flow transformers such as Flux. In this work, we present ReFlux, the first concept attack method specifically designed to assess the robustness of concept erasure in the latest rectified flow-based T2I framework. Our approach is motivated by the observation that existing concept erasure techniques, when applied to Flux, fundamentally rely on a phenomenon known as attention localization. Building on this insight, we propose a simple yet effective attack strategy that specifically targets this property. At its core, a reverse-attention optimization strategy is introduced to effectively reactivate suppressed signals while stabilizing attention. This is further reinforced by a velocity-guided dynamic that enhances the robustness of concept reactivation by steering the flow matching process, and a consistency-preserving objective that maintains the global layout and preserves unrelated content. Extensive experiments consistently demonstrate the effectiveness and efficiency of the proposed attack method, establishing a reliable benchmark for evaluating the robustness of concept erasure strategies in rectified flow transformers.[105] FIN: Fast Inference Network for Map Segmentation
Ruan Bispo,Tim Brophy,Reenu Mohandas,Anthony Scanlan,Ciarán Eising
Main category: cs.CV
TL;DR: 本文提出了一种基于相机和雷达的新型高效BEV空间地图分割架构,通过改进损失函数和设计轻量级头部,在保持高精度的同时显著提升了实时性能,实现了53.5 mIoU,并将推理速度提高了260%。
Details
Motivation: 多传感器融合在自动驾驶中日益重要,而现有地图分割方法在精度和实时性方面仍面临挑战,因此需要一种兼顾准确性与效率的解决方案。 Method: 在BEV空间中融合相机和雷达数据,采用先进的损失函数组合和新设计的轻量级网络头部,以优化分割性能和推理速度。 Result: 该方法达到了53.5 mIoU,性能媲美大型模型,同时推理速度比最强基线模型提升260%,在准确性和效率之间取得了良好平衡。 Conclusion: 所提出的架构在保证高精度和类别平衡的同时,显著提升了实时性能,为低成本、高效的自动驾驶感知提供了一种可行方案。 Abstract: Multi-sensor fusion in autonomous vehicles is becoming more common to offer a more robust alternative for several perception tasks. This need arises from the unique contribution of each sensor in collecting data: camera-radar fusion offers a cost-effective solution by combining rich semantic information from cameras with accurate distance measurements from radar, without incurring excessive financial costs or overwhelming data processing requirements. Map segmentation is a critical task for enabling effective vehicle behaviour in its environment, yet it continues to face significant challenges in achieving high accuracy and meeting real-time performance requirements. Therefore, this work presents a novel and efficient map segmentation architecture, using cameras and radars, in the \acrfull{bev} space. Our model introduces a real-time map segmentation architecture considering aspects such as high accuracy, per-class balancing, and inference time. To accomplish this, we use an advanced loss set together with a new lightweight head to improve the perception results. Our results show that, with these modifications, our approach achieves results comparable to large models, reaching 53.5 mIoU, while also setting a new benchmark for inference time, improving it by 260\% over the strongest baseline models.[106] OTTER: Open-Tagging via Text-Image Representation for Multi-modal Understanding
Jieer Ouyang,Xiaoneng Xiang,Zheng Wang,Yangkai Ding
Main category: cs.CV
TL;DR: OTTER是一个统一的开放集多标签标注框架,结合了预定义类别集的稳定性与用户驱动开放标签的适应性,在多模态标注任务中表现出色。
Details
Motivation: 现有的多标签标注方法通常局限于封闭集类别,难以适应开放、动态的标签需求,因此需要一种既能保持标签一致性又能灵活扩展新标签的框架。 Method: OTTER基于大规模分层组织的多模态数据集,通过自动化视觉-语言标注与人工精炼相结合的方式进行标注;采用多头注意力架构,联合对齐视觉和文本表征与固定及开放集标签嵌入。 Result: 在Otter和Favorite两个基准数据集上,OTTER的整体F1分数分别为0.81和0.75,分别超越次优方法0.10和0.02;在开放集标签上的F1达到0.99和0.97,接近完美表现,同时在预定义标签上保持竞争力。 Conclusion: OTTER有效融合了封闭集标签的一致性与开放词汇的灵活性,为多模态标注应用提供了一个强大且实用的解决方案。 Abstract: We introduce OTTER, a unified open-set multi-label tagging framework that harmonizes the stability of a curated, predefined category set with the adaptability of user-driven open tags. OTTER is built upon a large-scale, hierarchically organized multi-modal dataset, collected from diverse online repositories and annotated through a hybrid pipeline combining automated vision-language labeling with human refinement. By leveraging a multi-head attention architecture, OTTER jointly aligns visual and textual representations with both fixed and open-set label embeddings, enabling dynamic and semantically consistent tagging. OTTER consistently outperforms competitive baselines on two benchmark datasets: it achieves an overall F1 score of 0.81 on Otter and 0.75 on Favorite, surpassing the next-best results by margins of 0.10 and 0.02, respectively. OTTER attains near-perfect performance on open-set labels, with F1 of 0.99 on Otter and 0.97 on Favorite, while maintaining competitive accuracy on predefined labels. These results demonstrate OTTER's effectiveness in bridging closed-set consistency with open-vocabulary flexibility for multi-modal tagging applications.[107] Weakly Supervised Cloud Detection Combining Spectral Features and Multi-Scale Deep Network
Shaocong Zhu,Zhiwei Li,Xinghua Li,Huanfeng Shen
Main category: cs.CV
TL;DR: 提出一种弱监督云检测方法SpecMCD,结合光谱特征和多尺度场景级深度网络,显著提升光学卫星图像中云检测精度。
Details
Motivation: 薄云特征不明显且训练样本质量低导致现有深度学习方法在云检测上的准确性受限。 Method: 采用多尺度场景级数据集的渐进式训练框架,结合多尺度概率图与云厚度图生成像素级云概率图,并通过自适应阈值和距离加权优化获得二值云掩膜。 Result: 在WDCD和GF1MS-WHU两个数据集上验证,相比WDCD、WSFNet等方法F1分数提升超过7.82%。 Conclusion: SpecMCD方法在不同云覆盖条件下均表现出优越的检测性能,具有较强的应用潜力。 Abstract: Clouds significantly affect the quality of optical satellite images, which seriously limits their precise application. Recently, deep learning has been widely applied to cloud detection and has achieved satisfactory results. However, the lack of distinctive features in thin clouds and the low quality of training samples limit the cloud detection accuracy of deep learning methods, leaving space for further improvements. In this paper, we propose a weakly supervised cloud detection method that combines spectral features and multi-scale scene-level deep network (SpecMCD) to obtain highly accurate pixel-level cloud masks. The method first utilizes a progressive training framework with a multi-scale scene-level dataset to train the multi-scale scene-level cloud detection network. Pixel-level cloud probability maps are then obtained by combining the multi-scale probability maps and cloud thickness map based on the characteristics of clouds in dense cloud coverage and large cloud-area coverage images. Finally, adaptive thresholds are generated based on the differentiated regions of the scene-level cloud masks at different scales and combined with distance-weighted optimization to obtain binary cloud masks. Two datasets, WDCD and GF1MS-WHU, comprising a total of 60 Gaofen-1 multispectral (GF1-MS) images, were used to verify the effectiveness of the proposed method. Compared to the other weakly supervised cloud detection methods such as WDCD and WSFNet, the F1-score of the proposed SpecMCD method shows an improvement of over 7.82%, highlighting the superiority and potential of the SpecMCD method for cloud detection under different cloud coverage conditions.[108] Align Your Tangent: Training Better Consistency Models via Manifold-Aligned Tangents
Beomsu Kim,Byunghee Cha,Jong Chul Ye
Main category: cs.CV
TL;DR: 本文提出了一种新的损失函数——流形特征距离(MFD),用于改善一致性模型(CMs)训练中的振荡切线问题,从而显著加速训练并提升生成质量,即使在极小批量下也能保持高性能。
Details
Motivation: 一致性模型在生成性能上表现出色,但通常需要长时间的大批量训练才能获得竞争性的样本质量。本文旨在解决其训练效率低的问题。 Method: 通过分析CM接近收敛时的训练动态,发现其输出更新方向(即切线)具有沿数据流形振荡的特性。为此提出MFD损失函数,使切线对齐并指向数据流形,从而加速训练。 Result: 所提方法AYT可将CM训练速度提升数个数量级,在样本质量和LPIPS指标上优于现有方法,并支持极小批量训练而不损失性能。 Conclusion: MFD损失有效缓解了CM训练中的振荡问题,大幅提升了训练效率和生成质量,为快速高质生成模型提供了新思路。 Abstract: With diffusion and flow matching models achieving state-of-the-art generating performance, the interest of the community now turned to reducing the inference time without sacrificing sample quality. Consistency Models (CMs), which are trained to be consistent on diffusion or probability flow ordinary differential equation (PF-ODE) trajectories, enable one or two-step flow or diffusion sampling. However, CMs typically require prolonged training with large batch sizes to obtain competitive sample quality. In this paper, we examine the training dynamics of CMs near convergence and discover that CM tangents -- CM output update directions -- are quite oscillatory, in the sense that they move parallel to the data manifold, not towards the manifold. To mitigate oscillatory tangents, we propose a new loss function, called the manifold feature distance (MFD), which provides manifold-aligned tangents that point toward the data manifold. Consequently, our method -- dubbed Align Your Tangent (AYT) -- can accelerate CM training by orders of magnitude and even out-perform the learned perceptual image patch similarity metric (LPIPS). Furthermore, we find that our loss enables training with extremely small batch sizes without compromising sample quality. Code: https://github.com/1202kbs/AYT[109] Unsupervised Unfolded rPCA (U2-rPCA): Deep Interpretable Clutter Filtering for Ultrasound Microvascular Imaging
Huaying Li,Liansheng Wang,Yinran Chen
Main category: cs.CV
TL;DR: 本文提出了一种无监督展开的rPCA(U2-rPCA)方法,用于超声微血管成像中的杂波滤波,具有数学可解释性且无需标签训练。
Details
Motivation: 现有基于深度学习的杂波滤波方法受限于可解释性和缺乏体外及体内真实标签,本文旨在解决训练真值缺失的问题并提升组织与血流信号的分离质量。 Method: 将迭代重加权最小二乘(IRLS)rPCA基准方法进行展开,并引入稀疏增强单元以增强对稀疏微血流信号的捕捉能力,构建一种自适应无监督滤波网络。 Result: 在模拟数据集和公开的体内数据集上的实验表明,U2-rPCA优于SVD、rPCA基线和其他深度学习滤波方法,显著提升了功率多普勒图像的对比噪声比(CNR)2 dB至10 dB,并通过消融研究验证了各模块的有效性。 Conclusion: U2-rPCA在保持数学可解释性的同时,实现了高质量的无监督杂波滤除,为超声微血管成像提供了更优的滤波方案。 Abstract: High-sensitivity clutter filtering is a fundamental step in ultrasound microvascular imaging. Singular value decomposition (SVD) and robust principal component analysis (rPCA) are the main clutter filtering strategies. However, both strategies are limited in feature modeling and tissue-blood flow separation for high-quality microvascular imaging. Recently, deep learning-based clutter filtering has shown potential in more thoroughly separating tissue and blood flow signals. However, the existing supervised filters face the challenges of interpretability and lack of in-vitro and in-vivo ground truths. While the interpretability issue can be addressed by algorithm deep unfolding, the training ground truth remains unsolved. To this end, this paper proposes an unsupervised unfolded rPCA (U2-rPCA) method that preserves mathematical interpretability and is insusceptible to learning labels. Specifically, U2-rPCA is unfolded from an iteratively reweighted least squares (IRLS) rPCA baseline with intrinsic low-rank and sparse regularization. A sparse-enhancement unit is added to the network to strengthen its capability to capture the sparse micro-flow signals. U2-rPCA is like an adaptive filter that is trained with part of the image sequence and then used for the following frames. Experimental validations on a in-silico dataset and public in-vivo datasets demonstrated the outperformance of U2-rPCA when compared with the SVD-based method, the rPCA baseline, and another deep learning-based filter. Particularly, the proposed method improved the contrastto-noise ratio (CNR) of the power Doppler image by 2 dB to 10 dB when compared with other methods. Furthermore, the effectiveness of the building modules of U2-rPCA was validated through ablation studies.[110] Multi-Domain Brain Vessel Segmentation Through Feature Disentanglement
Francesco Galati,Daniele Falcetta,Rosa Cortese,Ferran Prados,Ninon Burgos,Maria A. Zuluaga
Main category: cs.CV
TL;DR: 提出一种基于图像到图像转换的框架,用于在不同数据集上分割脑动脉和静脉,无需针对特定领域设计模型或进行数据标准化。
Details
Motivation: 脑血管形态复杂,且通常依赖单一成像模态,难以实现跨模态、跨中心的准确分割,需要一种通用性强的分割方法。 Method: 采用解耦技术,在保持空间信息(如形状和位置)的同时,独立操纵血管外观进行域适应,实现标签保持的图像转换。 Result: 在多个医疗中心、成像模态和血管类型之间实现了有效的分割性能,跨越了较大的域间差异,并通过消融实验验证了标注数量和架构选择的影响。 Conclusion: 该框架具有强健性和多场景适用性,展示了域适应方法在脑血管图像分割中的潜力。 Abstract: The intricate morphology of brain vessels poses significant challenges for automatic segmentation models, which usually focus on a single imaging modality. However, accurately treating brain-related conditions requires a comprehensive understanding of the cerebrovascular tree, regardless of the specific acquisition procedure. Our framework effectively segments brain arteries and veins in various datasets through image-to-image translation while avoiding domain-specific model design and data harmonization between the source and the target domain. This is accomplished by employing disentanglement techniques to independently manipulate different image properties, allowing them to move from one domain to another in a label-preserving manner. Specifically, we focus on manipulating vessel appearances during adaptation while preserving spatial information, such as shapes and locations, which are crucial for correct segmentation. Our evaluation effectively bridges large and varied domain gaps across medical centers, image modalities, and vessel types. Additionally, we conduct ablation studies on the optimal number of required annotations and other architectural choices. The results highlight our framework's robustness and versatility, demonstrating the potential of domain adaptation methodologies to perform cerebrovascular image segmentation in multiple scenarios accurately. Our code is available at https://github.com/i-vesseg/MultiVesSeg.[111] A Geometric Unification of Generative AI with Manifold-Probabilistic Projection Models
Leah Bar,Liron Mor Yosef,Shai Zucker,Neta Shoham,Inbar Seroussi,Nir Sochen
Main category: cs.CV
TL;DR: 本文提出了一种结合几何与概率视角的生成模型框架,将扩散模型解释为“好图像”流形上的投影机制,并提出了新的确定性模型MPPM,在图像生成与恢复任务中优于现有扩散模型。
Details
Motivation: 现有生成模型多忽视数据的几何结构,仅关注概率分布建模,且对潜在空间分布做简单假设,缺乏对图像流形几何特性的有效利用。 Method: 提出统一几何与概率的框架,通过核方法构建基于流形的投影机制,发展出Manifold-Probabilistic Projection Model(MPPM),并在潜在空间实现LMPPM模型。 Result: LMPPM在多个数据集上优于Latent Diffusion Model(LDM),在图像生成和恢复任务中取得更优性能。 Conclusion: 通过融合几何与概率视角,扩散模型可被理解为流形投影,所提MPPM框架为生成模型提供了新的理论解释与更优的实践性能。 Abstract: The foundational premise of generative AI for images is the assumption that images are inherently low-dimensional objects embedded within a high-dimensional space. Additionally, it is often implicitly assumed that thematic image datasets form smooth or piecewise smooth manifolds. Common approaches overlook the geometric structure and focus solely on probabilistic methods, approximating the probability distribution through universal approximation techniques such as the kernel method. In some generative models, the low dimensional nature of the data manifest itself by the introduction of a lower dimensional latent space. Yet, the probability distribution in the latent or the manifold coordinate space is considered uninteresting and is predefined or considered uniform. This study unifies the geometric and probabilistic perspectives by providing a geometric framework and a kernel-based probabilistic method simultaneously. The resulting framework demystifies diffusion models by interpreting them as a projection mechanism onto the manifold of ``good images''. This interpretation leads to the construction of a new deterministic model, the Manifold-Probabilistic Projection Model (MPPM), which operates in both the representation (pixel) space and the latent space. We demonstrate that the Latent MPPM (LMPPM) outperforms the Latent Diffusion Model (LDM) across various datasets, achieving superior results in terms of image restoration and generation.[112] Beyond one-hot encoding? Journey into compact encoding for large multi-class segmentation
Aaron Kujawa,Thomas Booth,Tom Vercauteren
Main category: cs.CV
TL;DR: 本文提出了一种用于多类医学图像分割的二进制编码方法,以降低计算和内存开销,但在108类脑部分割任务中未能达到one-hot编码的性能水平,揭示了该方向的挑战并呼吁未来研究。
Details
Motivation: 为了减少高类别数医学图像分割中的计算和内存消耗,探索替代one-hot编码的高效标签编码方式。 Method: 提出并评估了一系列基于二进制编码的方法,包括普通二进制编码、纠错输出码(ECOC)、类别加权、硬/软解码、类别-码字分配和标签嵌入树,并应用于3D MRI全脑108类分割任务。 Result: 相比one-hot编码(DSC=82.4),所有二进制编码方法性能下降,DSC范围为39.3至73.8,未能达到当前最优分割质量。 Conclusion: 尽管二进制编码在计算效率上有潜力,但在保持分割精度方面存在显著挑战,本文的负面结果为未来紧凑编码策略的研究提供了重要参考。 Abstract: This work presents novel methods to reduce computational and memory requirements for medical image segmentation with a large number of classes. We curiously observe challenges in maintaining state-of-the-art segmentation performance with all of the explored options. Standard learning-based methods typically employ one-hot encoding of class labels. The computational complexity and memory requirements thus increase linearly with the number of classes. We propose a family of binary encoding approaches instead of one-hot encoding to reduce the computational complexity and memory requirements to logarithmic in the number of classes. In addition to vanilla binary encoding, we investigate the effects of error-correcting output codes (ECOCs), class weighting, hard/soft decoding, class-to-codeword assignment, and label embedding trees. We apply the methods to the use case of whole brain parcellation with 108 classes based on 3D MRI images. While binary encodings have proven efficient in so-called extreme classification problems in computer vision, we faced challenges in reaching state-of-the-art segmentation quality with binary encodings. Compared to one-hot encoding (Dice Similarity Coefficient (DSC) = 82.4 (2.8)), we report reduced segmentation performance with the binary segmentation approaches, achieving DSCs in the range from 39.3 to 73.8. Informative negative results all too often go unpublished. We hope that this work inspires future research of compact encoding strategies for large multi-class segmentation tasks.[113] Adaptive Event Stream Slicing for Open-Vocabulary Event-Based Object Detection via Vision-Language Knowledge Distillation
Jinchang Zhang,Zijun Li,Jiakai Lin,Guoyu Lu
Main category: cs.CV
TL;DR: 提出一种基于事件-图像知识蒸馏的框架,利用CLIP的语义理解能力实现事件数据上的开放词汇目标检测,并设计混合脉冲神经网络与卷积网络结构以自适应提取关键时间特征。
Details
Motivation: 现有基于事件的目标检测方法局限于预定义类别,难以泛化到新对象;而直接将CLIP等视觉语言模型应用于事件流存在模态鸿沟问题。 Method: 采用图像作为教师模型输入,通过空间注意力机制进行知识蒸馏,指导基于事件的学生模型学习CLIP的视觉表示;设计SNN-CNN混合框架,由SNN自适应确定事件分割时机,CNN进行后续检测。 Result: 所提方法在事件数据上实现了有效的开放词汇目标检测,优于固定分段方法,在保持低延迟的同时提升了对新类别物体的识别能力。 Conclusion: 该框架成功弥合了事件数据与视觉语言模型之间的模态差距,为基于事件的开放词汇检测提供了有效解决方案。 Abstract: Event cameras offer advantages in object detection tasks due to high-speed response, low latency, and robustness to motion blur. However, event cameras lack texture and color information, making open-vocabulary detection particularly challenging. Current event-based detection methods are typically trained on predefined categories, limiting their ability to generalize to novel objects, where encountering previously unseen objects is common. Vision-language models (VLMs) have enabled open-vocabulary object detection in RGB images. However, the modality gap between images and event streams makes it ineffective to directly transfer CLIP to event data, as CLIP was not designed for event streams. To bridge this gap, we propose an event-image knowledge distillation framework that leverages CLIP's semantic understanding to achieve open-vocabulary object detection on event data. Instead of training CLIP directly on event streams, we use image frames as inputs to a teacher model, guiding the event-based student model to learn CLIP's rich visual representations. Through spatial attention-based distillation, the student network learns meaningful visual features directly from raw event inputs while inheriting CLIP's broad visual knowledge. Furthermore, to prevent information loss due to event data segmentation, we design a hybrid spiking neural network (SNN) and convolutional neural network (CNN) framework. Unlike fixed-group event segmentation methods, which often discard crucial temporal information, our SNN adaptively determines the optimal event segmentation moments, ensuring that key temporal features are extracted. The extracted event features are then processed by CNNs for object detection.[114] ProtoMask: Segmentation-Guided Prototype Learning
Steffen Meinert,Philipp Schlinge,Nils Strodthoff,Martin Atzmueller
Main category: cs.CV
TL;DR: 本文提出了一种名为ProtoMask的新模型架构,利用图像分割基础模型提升原型学习中嵌入空间与输入空间映射的真实性,并通过限定显著性图的计算区域来增强解释的可靠性。
Details
Motivation: 现有基于原型的可解释方法依赖后处理显著性技术,但其可靠性和质量受到质疑。因此,需要更真实、可信的解释方法。 Method: 利用强大的图像分割基础模型生成语义图像块,使用分割掩码的边界框裁剪图像作为ProtoMask模型的独立输入,限制显著性图计算范围以提高解释真实性。 Result: 在三个细粒度分类数据集上进行了广泛实验,结果表明该方法在可解释性方面表现优异,同时具备竞争性的分类性能。 Conclusion: ProtoMask通过结合分割基础模型有效提升了原型方法的解释真实性,提供了更可靠的可视化,并在保持良好分类性能的同时展现出独特的可解释性优势。 Abstract: XAI gained considerable importance in recent years. Methods based on prototypical case-based reasoning have shown a promising improvement in explainability. However, these methods typically rely on additional post-hoc saliency techniques to explain the semantics of learned prototypes. Multiple critiques have been raised about the reliability and quality of such techniques. For this reason, we study the use of prominent image segmentation foundation models to improve the truthfulness of the mapping between embedding and input space. We aim to restrict the computation area of the saliency map to a predefined semantic image patch to reduce the uncertainty of such visualizations. To perceive the information of an entire image, we use the bounding box from each generated segmentation mask to crop the image. Each mask results in an individual input in our novel model architecture named ProtoMask. We conduct experiments on three popular fine-grained classification datasets with a wide set of metrics, providing a detailed overview on explainability characteristics. The comparison with other popular models demonstrates competitive performance and unique explainability features of our model. https://github.com/uos-sis/quanproto[115] Graph Integrated Multimodal Concept Bottleneck Model
Jiakai Lin,Jinchang Zhang,Guoyu Lu
Main category: cs.CV
TL;DR: 本文提出了MoE-SGT,一种结合图变换器和专家混合模块的多模态概念瓶颈模型,通过建模概念间的结构化关系和动态分配推理任务,在多个数据集上实现了更高的准确性。
Details
Motivation: 现有的概念瓶颈模型(CBMs)通常是单模态的,并且忽略了概念之间的结构化关系,限制了其在复杂场景中的解释性和性能。因此,需要一种能够处理多模态输入并显式建模概念间结构关系的新框架。 Method: 提出MoE-SGT框架,构建答案-概念和答案-问题图以显式建模多模态输入中概念的结构关系;引入图变换器捕捉多层次依赖关系,并用专家混合(MoE)模块替代前馈层,实现对不同子专家的动态任务分配,增强模型对复杂概念推理的适应能力。 Result: MoE-SGT在多个数据集上优于其他概念瓶颈网络,表现出更高的准确率,验证了其在建模概念结构关系和动态推理方面的有效性。 Conclusion: MoE-SGT通过引入结构化图表示和动态专家机制,有效提升了概念瓶颈模型的推理能力和适应性,为可解释性深度学习提供了更强大的多模态框架。 Abstract: With growing demand for interpretability in deep learning, especially in high stakes domains, Concept Bottleneck Models (CBMs) address this by inserting human understandable concepts into the prediction pipeline, but they are generally single modal and ignore structured concept relationships. To overcome these limitations, we present MoE-SGT, a reasoning driven framework that augments CBMs with a structure injecting Graph Transformer and a Mixture of Experts (MoE) module. We construct answer-concept and answer-question graphs for multimodal inputs to explicitly model the structured relationships among concepts. Subsequently, we integrate Graph Transformer to capture multi level dependencies, addressing the limitations of traditional Concept Bottleneck Models in modeling concept interactions. However, it still encounters bottlenecks in adapting to complex concept patterns. Therefore, we replace the feed forward layers with a Mixture of Experts (MoE) module, enabling the model to have greater capacity in learning diverse concept relationships while dynamically allocating reasoning tasks to different sub experts, thereby significantly enhancing the model's adaptability to complex concept reasoning. MoE-SGT achieves higher accuracy than other concept bottleneck networks on multiple datasets by modeling structured relationships among concepts and utilizing a dynamic expert selection mechanism.[116] Training-free Uncertainty Guidance for Complex Visual Tasks with MLLMs
Sanghwan Kim,Rui Xiao,Stephan Alaniz,Yongqin Xian,Zeynep Akata
Main category: cs.CV
TL;DR: 提出一种无需训练的统一框架,利用多模态大语言模型内在的输出不确定性作为指导信号,提升其在细粒度视觉任务中的表现。
Details
Motivation: 现有方法依赖复杂且特定任务的微调,限制了多模态大语言模型在细粒度感知任务(如高分辨率图像中的小物体识别、长视频中的关键帧定位)上的泛化能力与效率。 Method: 通过衡量模型响应的不确定性(如输出熵),对候选视觉输入进行评分,使模型能够自主聚焦于最相关的视觉信息,从而实现无需训练的细粒度感知。 Result: 该方法在视觉搜索、长视频理解与时间定位三个复杂任务上,使现成的多模态大语言模型达到了与专门微调方法相媲美的性能。 Conclusion: 利用模型内在不确定性是一种强大且通用的策略,可有效增强多模态大语言模型在细粒度任务中的表现,且无需额外训练。 Abstract: Multimodal Large Language Models (MLLMs) often struggle with fine-grained perception, such as identifying small objects in high-resolution images or finding key moments in long videos. Existing works typically rely on complicated, task-specific fine-tuning, which limits their generalizability and increases model complexity. In this work, we propose an effective, training-free framework that uses an MLLM's intrinsic uncertainty as a proactive guidance signal. Our core insight is that a model's output entropy decreases when presented with relevant visual information. We introduce a unified mechanism that scores candidate visual inputs by response uncertainty, enabling the model to autonomously focus on the most salient data. We apply this simple principle to three complex visual tasks: Visual Search, Long Video Understanding, and Temporal Grounding, allowing off-the-shelf MLLMs to achieve performance competitive with specialized, fine-tuned methods. Our work validates that harnessing intrinsic uncertainty is a powerful, general strategy for enhancing fine-grained multimodal performance.[117] Deep learning motion correction of quantitative stress perfusion cardiovascular magnetic resonance
Noortje I. P. Schueler,Nathan C. K. Wong,Richard J. Crawley,Josien P. W. Pluim,Amedeo Chiribiri,Cian M. Scannell
Main category: cs.CV
TL;DR: 提出了一种基于无监督深度学习的运动校正方法,用于快速、鲁棒地处理压力灌注心脏磁共振成像数据,显著提升处理速度和图像质量。
Details
Motivation: 传统基于配准的运动校正方法在定量灌注心脏磁共振成像中速度慢且对采集变异敏感,限制了其鲁棒性和可扩展性。 Method: 开发了一种无监督深度学习管道,通过一次估计替代迭代配准,分三步进行运动校正,并采用鲁棒主成分分析减少对比度影响,同时对灌注序列和辅助图像进行配准。模型在多厂商201名患者数据上训练和验证。 Result: 深度学习方法显著提高了时间-强度曲线的时间平滑性(p<0.001),心肌配准效果优于未校正情况(Dice 0.92 vs 0.80, p<0.001),灌注图运动伪影减少,心肌标准差降低,处理时间缩短15倍。 Conclusion: 该深度学习管道可实现快速、鲁棒的压力灌注CMR运动校正,具有跨设备和序列的良好泛化能力,有助于推动定量灌注成像的临床广泛应用。 Abstract: Background: Quantitative stress perfusion cardiovascular magnetic resonance (CMR) is a powerful tool for assessing myocardial ischemia. Motion correction is essential for accurate pixel-wise mapping but traditional registration-based methods are slow and sensitive to acquisition variability, limiting robustness and scalability. Methods: We developed an unsupervised deep learning-based motion correction pipeline that replaces iterative registration with efficient one-shot estimation. The method corrects motion in three steps and uses robust principal component analysis to reduce contrast-related effects. It aligns the perfusion series and auxiliary images (arterial input function and proton density-weighted series). Models were trained and validated on multivendor data from 201 patients, with 38 held out for testing. Performance was assessed via temporal alignment and quantitative perfusion values, compared to a previously published registration-based method. Results: The deep learning approach significantly improved temporal smoothness of time-intensity curves (p<0.001). Myocardial alignment (Dice = 0.92 (0.04) and 0.91 (0.05)) was comparable to the baseline and superior to before registration (Dice = 0.80 (0.09), p<0.001). Perfusion maps showed reduced motion, with lower standard deviation in the myocardium (0.52 (0.39) ml/min/g) compared to baseline (0.55 (0.44) ml/min/g). Processing time was reduced 15-fold. Conclusion: This deep learning pipeline enables fast, robust motion correction for stress perfusion CMR, improving accuracy across dynamic and auxiliary images. Trained on multivendor data, it generalizes across sequences and may facilitate broader clinical adoption of quantitative perfusion imaging.[118] DEAP DIVE: Dataset Investigation with Vision transformers for EEG evaluation
Annemarie Hoffsommer,Helen Schneider,Svetlana Pavlitska,J. Marius Zöllner
Main category: cs.CV
TL;DR: 本研究利用DEAP数据集中的部分EEG通道,结合连续小波变换和视觉Transformer模型,仅用12个通道实现了超过91.57%的四象限情绪分类准确率,接近使用32通道的最先进水平,证明低通道数仍可实现高效情绪预测。
Details
Motivation: 传统情绪识别方法(如自我评估和面部表情分析)具有主观性和模糊性,而全通道EEG测量复杂且资源消耗大,因此需要一种更直接、低成本且高效的情绪预测方案。 Method: 采用连续小波变换将EEG信号转换为尺度图,使用Vision Transformer模型进行情绪分类,并基于DEAP数据集评估不同通道子集的性能。 Result: 仅使用12个EEG通道时,模型在四象限(高低唤醒度与效价)情绪分类中达到91.57%以上的准确率,接近使用32通道的96.9%的最先进结果。 Conclusion: 显著减少EEG输入通道数量仍可保持较高情绪预测准确性,为低成本便携式情绪识别设备的应用提供了可行性支持。 Abstract: Accurately predicting emotions from brain signals has the potential to achieve goals such as improving mental health, human-computer interaction, and affective computing. Emotion prediction through neural signals offers a promising alternative to traditional methods, such as self-assessment and facial expression analysis, which can be subjective or ambiguous. Measurements of the brain activity via electroencephalogram (EEG) provides a more direct and unbiased data source. However, conducting a full EEG is a complex, resource-intensive process, leading to the rise of low-cost EEG devices with simplified measurement capabilities. This work examines how subsets of EEG channels from the DEAP dataset can be used for sufficiently accurate emotion prediction with low-cost EEG devices, rather than fully equipped EEG-measurements. Using Continuous Wavelet Transformation to convert EEG data into scaleograms, we trained a vision transformer (ViT) model for emotion classification. The model achieved over 91,57% accuracy in predicting 4 quadrants (high/low per arousal and valence) with only 12 measuring points (also referred to as channels). Our work shows clearly, that a significant reduction of input channels yields high results compared to state-of-the-art results of 96,9% with 32 channels. Training scripts to reproduce our code can be found here: https://gitlab.kit.edu/kit/aifb/ATKS/public/AutoSMiLeS/DEAP-DIVE.[119] What You See is What You Ask: Evaluating Audio Descriptions
Divy Kala,Eshika Khandelwal,Makarand Tapaswi
Main category: cs.CV
TL;DR: 本文提出了ADQA,一个用于评估长视频片段中音频描述(AD)质量的问答基准,强调现有自动AD生成方法在叙事理解和视觉欣赏方面的不足,并倡导更贴近真实场景的评估方式。
Details
Motivation: 现有的自动音频描述生成研究多集中在短片段上,且评估时仅对比单一参考答案,忽略了AD编写本身的主观性,难以反映真实使用情况。 Method: 通过分析同一电影的两套独立音频描述,量化其在描述时机、内容和重点上的差异;提出ADQA基准,包含视觉欣赏(VA)和叙事理解(NU)两类问题,针对数分钟长的连贯视频段进行评估。 Result: 实验证明当前自动AD生成方法远落后于人工撰写的AD;ADQA能更全面地评估AD对盲人和低视力用户理解剧情和欣赏视觉细节的帮助。 Conclusion: 应放弃仅用短片段和单一参考答案的评估模式,采用如ADQA这样面向长视频、包含主观维度的评估框架,并推动该领域的公开比较与进步。 Abstract: Audio descriptions (ADs) narrate important visual details in movies, enabling Blind and Low Vision (BLV) users to understand narratives and appreciate visual details. Existing works in automatic AD generation mostly focus on few-second trimmed clips, and evaluate them by comparing against a single ground-truth reference AD. However, writing ADs is inherently subjective. Through alignment and analysis of two independent AD tracks for the same movies, we quantify the subjectivity in when and whether to describe, and what and how to highlight. Thus, we show that working with trimmed clips is inadequate. We propose ADQA, a QA benchmark that evaluates ADs at the level of few-minute long, coherent video segments, testing whether they would help BLV users understand the story and appreciate visual details. ADQA features visual appreciation (VA) questions about visual facts and narrative understanding (NU) questions based on the plot. Through ADQA, we show that current AD generation methods lag far behind human-authored ADs. We conclude with several recommendations for future work and introduce a public leaderboard for benchmarking.[120] Extreme Blind Image Restoration via Prompt-Conditioned Information Bottleneck
Hongeun Kim,Bryan Sangwoo Kim,Jong Chul Ye
Main category: cs.CV
TL;DR: 提出一种新的框架,通过将极低质量图像映射到中间的较轻退化流形,再利用冻结的现成盲图像恢复模型恢复高质量图像,有效解决极端盲图像恢复中的难题。
Details
Motivation: 现有的盲图像恢复方法在面对严重复合退化的极端情况时表现不佳,直接学习从极低质量到高质量图像的映射因域差距过大而困难重重。 Method: 设计一个投影器将极低质量图像映射到中间的较轻退化流形,并结合信息瓶颈理论推导出新的损失函数来训练该投影器,随后使用固定的现成盲图像恢复模型完成最终恢复。 Result: 在严重退化情况下进行了大量实验,验证了所提框架的有效性,能够稳定训练并减少伪影,提升细节恢复效果。 Conclusion: 该方法为极端盲图像恢复提供了一种有效的解决方案,支持即插即用式增强现有模型,无需微调,并可应用于推理时提示优化。 Abstract: Blind Image Restoration (BIR) methods have achieved remarkable success but falter when faced with Extreme Blind Image Restoration (EBIR), where inputs suffer from severe, compounded degradations beyond their training scope. Directly learning a mapping from extremely low-quality (ELQ) to high-quality (HQ) images is challenging due to the massive domain gap, often leading to unnatural artifacts and loss of detail. To address this, we propose a novel framework that decomposes the intractable ELQ-to-HQ restoration process. We first learn a projector that maps an ELQ image onto an intermediate, less-degraded LQ manifold. This intermediate image is then restored to HQ using a frozen, off-the-shelf BIR model. Our approach is grounded in information theory; we provide a novel perspective of image restoration as an Information Bottleneck problem and derive a theoretically-driven objective to train our projector. This loss function effectively stabilizes training by balancing a low-quality reconstruction term with a high-quality prior-matching term. Our framework enables Look Forward Once (LFO) for inference-time prompt refinement, and supports plug-and-play strengthening of existing image restoration models without need for finetuning. Extensive experiments under severe degradation regimes provide a thorough analysis of the effectiveness of our work.[121] Can World Models Benefit VLMs for World Dynamics?
Kevin Zhang,Kuangzhi Ge,Xiaowei Chi,Renrui Zhang,Shaojun Shi,Zhen Dong,Sirui Han,Shanghang Zhang
Main category: cs.CV
TL;DR: 本文提出了一种将视频扩散模型用作生成式编码器的视觉语言模型(WorldLMs),并引入了Dynamic Vision Aligner (DyVA),在多种视觉推理任务上实现了最先进的性能,显示出世界模型先验在多模态理解中的潜力。
Details
Motivation: 探索强大的视频基础模型是否可以取代传统的视觉编码范式,用于通用的多模态理解,并系统地研究世界模型先验在视觉-语言任务中的迁移能力。 Method: 将一个预训练的视频扩散模型重新用作生成式编码器,通过单步去噪生成视觉嵌入,并将其集成到视觉语言模型中,形成WorldLMs框架,其中表现最佳的变体称为DyVA。 Result: DyVA在多个视觉推理任务上优于开源和专有基线模型,展现出更强的空间推理能力和单图像模型的多帧推理能力,归因于其从视频预训练中继承的运动一致性先验。 Conclusion: 世界模型的先验可有效提升视觉语言模型的性能,尤其在空间和动态推理方面,为构建更具通用性的视觉学习者提供了新方向。 Abstract: Trained on internet-scale video data, generative world models are increasingly recognized as powerful world simulators that can generate consistent and plausible dynamics over structure, motion, and physics. This raises a natural question: with the advent of strong video foundational models, might they supplant conventional vision encoder paradigms for general-purpose multimodal understanding? While recent studies have begun to explore the potential of world models on common vision tasks, these explorations typically lack a systematic investigation of generic, multimodal tasks. In this work, we strive to investigate the capabilities when world model priors are transferred into Vision-Language Models: we re-purpose a video diffusion model as a generative encoder to perform a single denoising step and treat the resulting latents as a set of visual embedding. We empirically investigate this class of models, which we refer to as World-Language Models (WorldLMs), and we find that generative encoders can capture latents useful for downstream understanding that show distinctions from conventional encoders. Naming our best-performing variant Dynamic Vision Aligner (DyVA), we further discover that this method significantly enhances spatial reasoning abilities and enables single-image models to perform multi-frame reasoning. Through the curation of a suite of visual reasoning tasks, we find DyVA to surpass both open-source and proprietary baselines, achieving state-of-the-art or comparable performance. We attribute these gains to WorldLM's inherited motion-consistency internalization from video pre-training. Finally, we systematically explore extensive model designs to highlight promising directions for future work. We hope our study can pave the way for a new family of VLMs that leverage priors from world models and are on a promising path towards generalist vision learners.[122] Defect Segmentation in OCT scans of ceramic parts for non-destructive inspection using deep learning
Andrés Laveda-Martínez,Natalia P. García-de-la-Puente,Fernando García-Torres,Niels Møller Israelsen,Ole Bang,Dominik Brouczek,Niels Benson,Adrián Colomer,Valery Naranjo
Main category: cs.CV
TL;DR: 本文提出了一种基于深度学习的自动缺陷检测系统,利用U-Net架构对光学相干断层扫描(OCT)图像进行训练,实现了高精度陶瓷内部缺陷检测。
Details
Motivation: 为了在不破坏陶瓷组件完整性的情况下实现高质量的无损检测,需要一种高效、可靠的自动化缺陷识别方法。 Method: 采用基于U-Net的神经网络模型,使用人工标注的OCT图像进行训练,并评估多种实验配置以提升性能,结合后处理技术进行定性和定量分析。 Result: 系统在Dice Score上达到0.979的准确率,优于现有研究,单个体积数据推理时间为18.98秒,能够有效检测夹杂物等缺陷。 Conclusion: 该方法显著提升了陶瓷制造中缺陷检测的效率与可靠性,具备应用于实际自动化质量控制的潜力。 Abstract: Non-destructive testing (NDT) is essential in ceramic manufacturing to ensure the quality of components without compromising their integrity. In this context, Optical Coherence Tomography (OCT) enables high-resolution internal imaging, revealing defects such as pores, delaminations, or inclusions. This paper presents an automatic defect detection system based on Deep Learning (DL), trained on OCT images with manually segmented annotations. A neural network based on the U-Net architecture is developed, evaluating multiple experimental configurations to enhance its performance. Post-processing techniques enable both quantitative and qualitative evaluation of the predictions. The system shows an accurate behavior of 0.979 Dice Score, outperforming comparable studies. The inference time of 18.98 seconds per volume supports its viability for detecting inclusions, enabling more efficient, reliable, and automated quality control.[123] Authentic Discrete Diffusion Model
Xiao Li,Jiaqi Zhang,Shuxiang Zhang,Tianshui Chen,Liang Lin,Guangrun Wang
Main category: cs.CV
TL;DR: 提出了一种新的真实离散扩散(ADD)框架,直接在one-hot空间中保持核心扩散特性,无需依赖连续潜在空间或掩码策略,在分类和图像描述生成任务上表现优异。
Details
Motivation: 现有伪离散扩散方法通常依赖连续潜在空间或掩码机制,未能真正保留离散扩散的本质特性,限制了其在生成与判别任务中的潜力。 Method: ADD框架通过直接使用浮点编码的one-hot类别数据重构扩散输入,并引入时间步条件下的交叉熵损失,连接判别学习与生成学习。 Result: 实验表明,ADD在分类任务上优于基线模型,并展现出出色的图像描述生成能力,消融研究验证了各组件的有效性。 Conclusion: ADD为离散扩散建模提供了一种更本质且高效的方法,弥合了生成与判别任务之间的差距,具有较强的扩展性和应用潜力。 Abstract: We propose an Authentic Discrete Diffusion (ADD) framework that fundamentally redefines prior pseudo-discrete approaches by preserving core diffusion characteristics directly in the one-hot space through a suite of coordinated mechanisms. Unlike conventional "pseudo" discrete diffusion (PDD) methods, ADD reformulates the diffusion input by directly using float-encoded one-hot class data, without relying on diffusing in the continuous latent spaces or masking policies. At its core, a timestep-conditioned cross-entropy loss is introduced between the diffusion model's outputs and the original one-hot labels. This synergistic design establishes a bridge between discriminative and generative learning. Our experiments demonstrate that ADD not only achieves superior performance on classification tasks compared to the baseline, but also exhibits excellent text generation capabilities on Image captioning. Extensive ablations validate the measurable gains of each component.[124] Multi-Objective Task-Aware Predictor for Image-Text Alignment
Eunki Kim,Na Min An,James Thorne,Hyunjung Shim
Main category: cs.CV
TL;DR: 本文提出了一种名为MULTI-TAP的即插即用型多目标评估模型,用于图像-文本对齐评分,兼具高效性、长序列处理能力和与人类判断的一致性,在多个基准上表现优于现有方法,并发布了包含细粒度标注和人类偏好的新数据集EYE4ALL。
Details
Motivation: 现有的图像-文本对齐评估指标在人类偏好对齐、长序列处理、推理效率和多目标评分方面存在不足,且缺乏综合性基准,限制了视觉-语言系统的发展。 Method: 提出MULTI-TAP,基于冻结的大规模视觉-语言模型(LVLM)的隐藏状态,训练轻量级岭回归层以实现多目标评分;通过奖励头输出整体分数,支持即插即用和多种LVLM架构。 Result: MULTI-TAP在性能上显著优于现有指标,与GPT-4o驱动的G-VEval相当但模型更小(7-8B),且优于VisionREWARD;在新发布的EYE4ALL多目标基准上表现出更高效率和准确性。 Conclusion: MULTI-TAP是一种高效、通用且与人类偏好一致的多目标评估模型,配合新数据集EYE4ALL可促进更贴近用户需求(包括盲人和低视力人群)的可访问AI系统发展。 Abstract: Evaluating image-text alignment while reflecting human preferences across multiple aspects is a significant issue for the development of reliable vision-language applications. It becomes especially crucial in real-world scenarios where multiple valid descriptions exist depending on contexts or user needs. However, research progress is hindered by the lack of comprehensive benchmarks and existing evaluation predictors lacking at least one of these key properties: (1) Alignment with human judgments, (2) Long-sequence processing, (3) Inference efficiency, and (4) Applicability to multi-objective scoring. To address these challenges, we propose a plug-and-play architecture to build a robust predictor, MULTI-TAP (Multi-Objective Task-Aware Predictor), capable of both multi and single-objective scoring. MULTI-TAP can produce a single overall score, utilizing a reward head built on top of a large vision-language model (LVLMs). We show that MULTI-TAP is robust in terms of application to different LVLM architectures, achieving significantly higher performance than existing metrics and even on par with the GPT-4o-based predictor, G-VEval, with a smaller size (7-8B). By training a lightweight ridge regression layer on the frozen hidden states of a pre-trained LVLM, MULTI-TAP can produce fine-grained scores for multiple human-interpretable objectives. MULTI-TAP performs better than VisionREWARD, a high-performing multi-objective reward model, in both performance and efficiency on multi-objective benchmarks and our newly released text-image-to-text dataset, EYE4ALL. Our new dataset, consisting of chosen/rejected human preferences (EYE4ALLPref) and human-annotated fine-grained scores across seven dimensions (EYE4ALLMulti), can serve as a foundation for developing more accessible AI systems by capturing the underlying preferences of users, including blind and low-vision (BLV) individuals.[125] Code2Video: A Code-centric Paradigm for Educational Video Generation
Yanzhe Chen,Kevin Qinghong Lin,Mike Zheng Shou
Main category: cs.CV
TL;DR: 本文提出了Code2Video,一个以代码为中心的代理框架,通过可执行Python代码生成教育视频,包含Planner、Coder和Critic三个协作代理,并构建了MMMC基准进行评估,在美学评分、代码效率和知识传递(TeachQuiz)等方面表现优异。
Details
Motivation: 现有生成模型在像素空间生成视频,难以满足专业教育视频对学科知识、精确视觉结构和连贯过渡的要求,限制了其在教育场景中的应用。 Method: 提出Code2Video框架,由Planner(规划内容与素材)、Coder(生成并自动修复Python代码)和Critic(利用视觉语言模型优化布局与清晰度)三个代理协同工作,通过代码控制可渲染环境生成视频。 Result: 在MMMC基准上评估显示,相比直接代码生成方法性能提升40%,生成视频质量接近人工制作教程,并通过TeachQuiz等指标验证了视频的知识传递能力。 Conclusion: Code2Video是一种可扩展、可解释且可控的教育视频生成方法,有效结合代码控制与多代理协作,显著提升生成质量与教育实用性。 Abstract: While recent generative models advance pixel-space video synthesis, they remain limited in producing professional educational videos, which demand disciplinary knowledge, precise visual structures, and coherent transitions, limiting their applicability in educational scenarios. Intuitively, such requirements are better addressed through the manipulation of a renderable environment, which can be explicitly controlled via logical commands (e.g., code). In this work, we propose Code2Video, a code-centric agent framework for generating educational videos via executable Python code. The framework comprises three collaborative agents: (i) Planner, which structures lecture content into temporally coherent flows and prepares corresponding visual assets; (ii) Coder, which converts structured instructions into executable Python codes while incorporating scope-guided auto-fix to enhance efficiency; and (iii) Critic, which leverages vision-language models (VLM) with visual anchor prompts to refine spatial layout and ensure clarity. To support systematic evaluation, we build MMMC, a benchmark of professionally produced, discipline-specific educational videos. We evaluate MMMC across diverse dimensions, including VLM-as-a-Judge aesthetic scores, code efficiency, and particularly, TeachQuiz, a novel end-to-end metric that quantifies how well a VLM, after unlearning, can recover knowledge by watching the generated videos. Our results demonstrate the potential of Code2Video as a scalable, interpretable, and controllable approach, achieving 40% improvement over direct code generation and producing videos comparable to human-crafted tutorials. The code and datasets are available at https://github.com/showlab/Code2Video.[126] ZQBA: Zero Query Black-box Adversarial Attack
Joana C. Costa,Tiago Roxo,Hugo Proença,Pedro R. M. Inácio
Main category: cs.CV
TL;DR: 提出了一种零查询黑盒对抗攻击方法(ZQBA),利用深度神经网络的特征图生成对抗样本,无需多次查询或训练扩散模型,具有良好的跨模型和跨数据集迁移性,且扰动不易察觉。
Details
Motivation: 现有黑盒对抗攻击方法依赖大量查询或训练代理模型/扩散模型,限制了其在现实场景中的应用,因此需要一种更高效、实用的零查询攻击方法。 Method: 利用预训练DNN的特征图,将其直接叠加到干净图像上生成对抗样本,无需查询目标模型或额外训练过程。 Result: ZQBA在CIFAR和Tiny ImageNet数据集上表现出优于单次查询最先进方法的攻击效果,同时保持扰动的不可感知性(通过SSIM定量评估和视觉质量定性验证),并展现出良好的迁移能力。 Conclusion: ZQBA是一种高效、实用的黑盒对抗攻击方法,揭示了深度神经网络在实际应用中的潜在安全漏洞,推动对模型鲁棒性的进一步研究。 Abstract: Current black-box adversarial attacks either require multiple queries or diffusion models to produce adversarial samples that can impair the target model performance. However, these methods require training a surrogate loss or diffusion models to produce adversarial samples, which limits their applicability in real-world settings. Thus, we propose a Zero Query Black-box Adversarial (ZQBA) attack that exploits the representations of Deep Neural Networks (DNNs) to fool other networks. Instead of requiring thousands of queries to produce deceiving adversarial samples, we use the feature maps obtained from a DNN and add them to clean images to impair the classification of a target model. The results suggest that ZQBA can transfer the adversarial samples to different models and across various datasets, namely CIFAR and Tiny ImageNet. The experiments also show that ZQBA is more effective than state-of-the-art black-box attacks with a single query, while maintaining the imperceptibility of perturbations, evaluated both quantitatively (SSIM) and qualitatively, emphasizing the vulnerabilities of employing DNNs in real-world contexts. All the source code is available at https://github.com/Joana-Cabral/ZQBA.[127] Uncertainty-Aware Concept Bottleneck Models with Enhanced Interpretability
Haifei Zhang,Patrick Barry,Eduardo Brandao
Main category: cs.CV
TL;DR: 提出一种基于二值类级概念原型的不确定性感知可解释分类器,用于提升概念瓶颈模型的解释性和鲁棒性。
Details
Motivation: 现有概念瓶颈模型在图像分类中虽具可解释性,但预测性能较低,且概念预测的不确定性传播未被充分探索。 Method: 学习一组二值类级概念原型,利用预测概念向量与各类原型的距离作为分类得分和不确定性度量,并支持基于偏离程度的保形预测。 Result: 该方法在保持高可解释性的同时提升了分类性能,并能有效识别不确定或异常输入。 Conclusion: 所提框架通过引入二值类级概念原型,增强了概念瓶颈模型的不确定性建模能力、解释性和鲁棒性。 Abstract: In the context of image classification, Concept Bottleneck Models (CBMs) first embed images into a set of human-understandable concepts, followed by an intrinsically interpretable classifier that predicts labels based on these intermediate representations. While CBMs offer a semantically meaningful and interpretable classification pipeline, they often sacrifice predictive performance compared to end-to-end convolutional neural networks. Moreover, the propagation of uncertainty from concept predictions to final label decisions remains underexplored. In this paper, we propose a novel uncertainty-aware and interpretable classifier for the second stage of CBMs. Our method learns a set of binary class-level concept prototypes and uses the distances between predicted concept vectors and each class prototype as both a classification score and a measure of uncertainty. These prototypes also serve as interpretable classification rules, indicating which concepts should be present in an image to justify a specific class prediction. The proposed framework enhances both interpretability and robustness by enabling conformal prediction for uncertain or outlier inputs based on their deviation from the learned binary class-level concept prototypes.[128] MetaLogic: Robustness Evaluation of Text-to-Image Models via Logically Equivalent Prompts
Yifan Shen,Yangyang Shu,Hye-young Paik,Yulei Sui
Main category: cs.CV
TL;DR: 本文提出了一种名为MetaLogic的新评估框架,用于检测文本到图像模型在语义一致性上的错位问题,无需依赖真实图像,通过元变换测试识别模型在逻辑理解上的鲁棒性缺陷。
Details
Motivation: 现有文本到图像模型在提示词发生微小语言变化时难以保持语义一致性,缺乏对模型逻辑推理和泛化能力的鲁棒性评估方法。 Method: 提出MetaLogic框架,利用元变换测试生成语法不同但语义相同的提示对,并直接比较生成的图像对以发现语义错位,分类错位类型并提供可解释的反例。 Result: 在多个先进T2I模型上验证了MetaLogic的有效性,发现Flux.dev和DALL-E 3分别有59%和71%的错位率,揭示了当前模型在多种逻辑结构中的系统性鲁棒性缺陷。 Conclusion: MetaLogic是一种可扩展、无需真实标签的评估方法,能有效暴露文本到图像模型中的细粒度逻辑不一致问题,有助于模型调试与改进。 Abstract: Recent advances in text-to-image (T2I) models, especially diffusion-based architectures, have significantly improved the visual quality of generated images. However, these models continue to struggle with a critical limitation: maintaining semantic consistency when input prompts undergo minor linguistic variations. Despite being logically equivalent, such prompt pairs often yield misaligned or semantically inconsistent images, exposing a lack of robustness in reasoning and generalisation. To address this, we propose MetaLogic, a novel evaluation framework that detects T2I misalignment without relying on ground truth images. MetaLogic leverages metamorphic testing, generating image pairs from prompts that differ grammatically but are semantically identical. By directly comparing these image pairs, the framework identifies inconsistencies that signal failures in preserving the intended meaning, effectively diagnosing robustness issues in the model's logic understanding. Unlike existing evaluation methods that compare a generated image to a single prompt, MetaLogic evaluates semantic equivalence between paired images, offering a scalable, ground-truth-free approach to identifying alignment failures. It categorises these alignment errors (e.g., entity omission, duplication, positional misalignment) and surfaces counterexamples that can be used for model debugging and refinement. We evaluate MetaLogic across multiple state-of-the-art T2I models and reveal consistent robustness failures across a range of logical constructs. We find that even the SOTA text-to-image models like Flux.dev and DALLE-3 demonstrate a 59 percent and 71 percent misalignment rate, respectively. Our results show that MetaLogic is not only efficient and scalable, but also effective in uncovering fine-grained logical inconsistencies that are overlooked by existing evaluation metrics.[129] Solar PV Installation Potential Assessment on Building Facades Based on Vision and Language Foundation Models
Ruyu Liu,Dongxu Zhuang,Jianhua Zhang,Arega Getaneh Abate,Per Sieverts Nielsen,Ben Wang,Xiufeng Liu
Main category: cs.CV
TL;DR: 本文提出了一种名为SF-SPA的自动化框架,利用街景图像评估建筑立面光伏潜力,结合计算机视觉与人工智能技术,实现几何校正、语义分割、空间推理与能量模拟,具有高效、准确的特点,适用于城市能源规划与光伏部署。
Details
Motivation: 在高密度城市环境中,建筑立面是未被充分利用的太阳能资源,但由于其几何结构复杂且包含多种语义组件,光伏潜力评估面临挑战,因此需要一种自动化、高效且精确的评估方法。 Method: 提出SF-SPA四阶段自动化框架:1)几何校正以纠正透视畸变;2)零样本语义分割识别立面元素;3)基于大语言模型(LLM)的空间推理优化光伏布局;4)能量模拟预测发电量。 Result: 在四个国家的80栋建筑上验证,面积估算平均误差为6.2% ± 2.8%,每栋建筑处理时间约100秒,显著优于人工方法;模拟能量输出结果可靠,适用于区域潜力评估和城市能源规划。 Conclusion: SF-SPA框架能够高效、准确地从街景图像中量化建筑立面光伏潜力,为城市尺度的太阳能利用、建筑集成光伏(BIPV)部署提供了可行的技术路径。 Abstract: Building facades represent a significant untapped resource for solar energy generation in dense urban environments, yet assessing their photovoltaic (PV) potential remains challenging due to complex geometries and semantic com ponents. This study introduces SF-SPA (Semantic Facade Solar-PV Assessment), an automated framework that transforms street-view photographs into quantitative PV deployment assessments. The approach combines com puter vision and artificial intelligence techniques to address three key challenges: perspective distortion correction, semantic understanding of facade elements, and spatial reasoning for PV layout optimization. Our four-stage pipeline processes images through geometric rectification, zero-shot semantic segmentation, Large Language Model (LLM) guided spatial reasoning, and energy simulation. Validation across 80 buildings in four countries demonstrates ro bust performance with mean area estimation errors of 6.2% ± 2.8% compared to expert annotations. The auto mated assessment requires approximately 100 seconds per building, a substantial gain in efficiency over manual methods. Simulated energy yield predictions confirm the method's reliability and applicability for regional poten tial studies, urban energy planning, and building-integrated photovoltaic (BIPV) deployment. Code is available at: https:github.com/CodeAXu/Solar-PV-Installation[130] From Seeing to Predicting: A Vision-Language Framework for Trajectory Forecasting and Controlled Video Generation
Fan Yang,Zhiyang Chen,Yousong Zhu,Xin Li,Jinqiao Wang
Main category: cs.CV
TL;DR: 提出TrajVLM-Gen,一种两阶段物理感知的图像到视频生成框架,利用视觉语言模型预测符合真实世界物理规律的运动轨迹,并通过注意力机制指导视频生成。
Details
Motivation: 现有视频生成模型生成的运动在物理上不一致,违背真实世界动力学,需要提升生成视频的物理合理性。 Method: 采用两阶段框架:第一阶段用视觉语言模型预测粗粒度的、符合物理规律的运动轨迹;第二阶段通过基于注意力的机制利用这些轨迹指导细粒度运动的视频生成。构建了基于视频跟踪数据的真实运动模式轨迹预测数据集。 Result: 在UCF-101和MSR-VTT上的实验表明,TrajVLM-Gen优于现有方法,在UCF-101上FVD得分为545,在MSR-VTT上为539。 Conclusion: TrajVLM-Gen能有效提升视频生成中的物理一致性,通过结合视觉语言模型与注意力机制实现更真实的运动建模。 Abstract: Current video generation models produce physically inconsistent motion that violates real-world dynamics. We propose TrajVLM-Gen, a two-stage framework for physics-aware image-to-video generation. First, we employ a Vision Language Model to predict coarse-grained motion trajectories that maintain consistency with real-world physics. Second, these trajectories guide video generation through attention-based mechanisms for fine-grained motion refinement. We build a trajectory prediction dataset based on video tracking data with realistic motion patterns. Experiments on UCF-101 and MSR-VTT demonstrate that TrajVLM-Gen outperforms existing methods, achieving competitive FVD scores of 545 on UCF-101 and 539 on MSR-VTT.[131] PhraseStereo: The First Open-Vocabulary Stereo Image Segmentation Dataset
Thomas Campagnolo,Ezio Malis,Philippe Martinet,Gaetan Bahl
Main category: cs.CV
TL;DR: 本文提出了PhraseStereo,首个用于立体图像对的短语区域分割数据集,通过利用GenStereo从单视图数据生成右视图图像,将短语定位扩展到立体视觉领域,推动语言、视觉与3D感知交叉的研究。
Details
Motivation: 现有短语定位研究主要局限于单视图图像,忽略了立体视觉中丰富的几何线索,因此需要将多模态语义分割拓展至立体图像以提升定位精度和上下文理解。 Method: 基于PhraseCut数据集,使用GenStereo生成准确的右视图图像,构建包含立体图像对、对齐分割掩码和短语标注的PhraseStereo数据集。 Result: PhraseStereo为立体图像中的短语定位提供了新基准,揭示了利用深度线索进行更精确、上下文感知的多模态学习的新挑战与机遇。 Conclusion: PhraseStereo为结合语义与几何信息的多模态模型研究奠定了基础,未来可促进语言、视觉与3D感知的联合推理。 Abstract: Understanding how natural language phrases correspond to specific regions in images is a key challenge in multimodal semantic segmentation. Recent advances in phrase grounding are largely limited to single-view images, neglecting the rich geometric cues available in stereo vision. For this, we introduce PhraseStereo, the first novel dataset that brings phrase-region segmentation to stereo image pairs. PhraseStereo builds upon the PhraseCut dataset by leveraging GenStereo to generate accurate right-view images from existing single-view data, enabling the extension of phrase grounding into the stereo domain. This new setting introduces unique challenges and opportunities for multimodal learning, particularly in leveraging depth cues for more precise and context-aware grounding. By providing stereo image pairs with aligned segmentation masks and phrase annotations, PhraseStereo lays the foundation for future research at the intersection of language, vision, and 3D perception, encouraging the development of models that can reason jointly over semantics and geometry. The PhraseStereo dataset will be released online upon acceptance of this work.[132] NSARM: Next-Scale Autoregressive Modeling for Robust Real-World Image Super-Resolution
Xiangtao Kong,Rongyuan Wu,Shuaizheng Liu,Lingchen Sun,Lei Zhang
Main category: cs.CV
TL;DR: 本文提出了一种基于自回归模型的实时图像超分辨率框架NSARM,利用逐尺度预测策略,在保持快速推理速度的同时实现了优于现有方法的视觉效果,并展现出更强的输入鲁棒性和泛化能力。
Details
Motivation: 现有基于扩散模型的Real-ISR方法在效率与质量之间存在权衡,且对不同退化程度的输入鲁棒性不足,容易产生伪影和幻觉。 Method: 基于预训练的视觉自回归模型(如Infinity),采用两阶段训练:首先用变换网络将低质图像映射到初步尺度,再进行端到端的全模型微调,实现稳健的逐尺度预测。 Result: NSARM在定量和定性评估中均优于现有的Real-ISR方法,具有更快的推理速度、更高的输出质量和更强的输入鲁棒性与泛化性能。 Conclusion: NSARM作为一种纯自回归模型,有效平衡了生成质量、效率与鲁棒性,为Real-ISR任务提供了一个更可靠的解决方案。 Abstract: Most recent real-world image super-resolution (Real-ISR) methods employ pre-trained text-to-image (T2I) diffusion models to synthesize the high-quality image either from random Gaussian noise, which yields realistic results but is slow due to iterative denoising, or directly from the input low-quality image, which is efficient but at the price of lower output quality. These approaches train ControlNet or LoRA modules while keeping the pre-trained model fixed, which often introduces over-enhanced artifacts and hallucinations, suffering from the robustness to inputs of varying degradations. Recent visual autoregressive (AR) models, such as pre-trained Infinity, can provide strong T2I generation capabilities while offering superior efficiency by using the bitwise next-scale prediction strategy. Building upon next-scale prediction, we introduce a robust Real-ISR framework, namely Next-Scale Autoregressive Modeling (NSARM). Specifically, we train NSARM in two stages: a transformation network is first trained to map the input low-quality image to preliminary scales, followed by an end-to-end full-model fine-tuning. Such a comprehensive fine-tuning enhances the robustness of NSARM in Real-ISR tasks without compromising its generative capability. Extensive quantitative and qualitative evaluations demonstrate that as a pure AR model, NSARM achieves superior visual results over existing Real-ISR methods while maintaining a fast inference speed. Most importantly, it demonstrates much higher robustness to the quality of input images, showing stronger generalization performance. Project page: https://github.com/Xiangtaokong/NSARM[133] Feature Identification for Hierarchical Contrastive Learning
Julius Ott,Nastassia Vysotskaya,Huawei Sun,Lorenzo Servadei,Robert Wille
Main category: cs.CV
TL;DR: 提出两种新的层次化对比学习方法(G-HMLC和A-HMLC),通过建模类别间关系和层次结构特征,在CIFAR100和ModelNet40上实现了最先进的线性评估性能。
Details
Motivation: 传统分类方法忽略层次结构中不同层级的类别间关系,导致监督信号缺失。 Method: 提出两种层次化对比学习方法:基于高斯混合模型的G-HMLC和基于注意力机制的A-HMLC,显式建模跨层级的类别关系与类别不平衡分布。 Result: 在CIFAR100和ModelNet40数据集上线性评估准确率超过现有方法2个百分点,定量与定性结果均验证了方法有效性。 Conclusion: 所提方法能有效捕捉层次化类别结构,提升细粒度聚类与分类性能,具有广泛应用于计算机视觉等领域的潜力。 Abstract: Hierarchical classification is a crucial task in many applications, where objects are organized into multiple levels of categories. However, conventional classification approaches often neglect inherent inter-class relationships at different hierarchy levels, thus missing important supervisory signals. Thus, we propose two novel hierarchical contrastive learning (HMLC) methods. The first, leverages a Gaussian Mixture Model (G-HMLC) and the second uses an attention mechanism to capture hierarchy-specific features (A-HMLC), imitating human processing. Our approach explicitly models inter-class relationships and imbalanced class distribution at higher hierarchy levels, enabling fine-grained clustering across all hierarchy levels. On the competitive CIFAR100 and ModelNet40 datasets, our method achieves state-of-the-art performance in linear evaluation, outperforming existing hierarchical contrastive learning methods by 2 percentage points in terms of accuracy. The effectiveness of our approach is backed by both quantitative and qualitative results, highlighting its potential for applications in computer vision and beyond.[134] Gather-Scatter Mamba: Accelerating Propagation with Efficient State Space Model
Hyun-kyu Ko,Youbin Kim,Jihyeon Park,Dongheok Park,Gyeongjin Kang,Wonjun Cho,Hyung Yi,Eunbyung Park
Main category: cs.CV
TL;DR: 提出一种结合移位窗口自注意力与Mamba的混合架构,并引入对齐感知的Gather-Scatter Mamba(GSM)机制,用于视频超分辨率中的高效时空建模。
Details
Motivation: 现有Mamba模型在视频超分辨率中难以捕捉细粒度空间依赖,且缺乏显式上下文聚合机制,传统RNN存在梯度消失和推理慢问题,Transformer则因二次复杂度不适用于长序列。 Method: 设计混合架构:使用移位窗口自注意力进行空间上下文聚合,Mamba进行线性复杂度的时序传播;提出Gather-Scatter Mamba(GSM),在时间窗口内先将特征对齐到中心帧,经Mamba处理后再散回原位置,提升信息传播效率与对齐质量。 Result: 所提方法在多个视频超分辨率基准上取得优异性能,有效减少遮挡伪影,提升细节恢复能力,同时保持高效推理速度。 Conclusion: 结合注意力与选择性状态空间模型的混合架构是处理视频超分辨率任务的有效方案,GSM机制增强了时空建模与特征对齐能力。 Abstract: State Space Models (SSMs)-most notably RNNs-have historically played a central role in sequential modeling. Although attention mechanisms such as Transformers have since dominated due to their ability to model global context, their quadratic complexity and limited scalability make them less suited for long sequences. Video super-resolution (VSR) methods have traditionally relied on recurrent architectures to propagate features across frames. However, such approaches suffer from well-known issues including vanishing gradients, lack of parallelism, and slow inference speed. Recent advances in selective SSMs like Mamba offer a compelling alternative: by enabling input-dependent state transitions with linear-time complexity, Mamba mitigates these issues while maintaining strong long-range modeling capabilities. Despite this potential, Mamba alone struggles to capture fine-grained spatial dependencies due to its causal nature and lack of explicit context aggregation. To address this, we propose a hybrid architecture that combines shifted window self-attention for spatial context aggregation with Mamba-based selective scanning for efficient temporal propagation. Furthermore, we introduce Gather-Scatter Mamba (GSM), an alignment-aware mechanism that warps features toward a center anchor frame within the temporal window before Mamba propagation and scatters them back afterward, effectively reducing occlusion artifacts and ensuring effective redistribution of aggregated information across all frames. The official implementation is provided at: https://github.com/Ko-Lani/GSMamba.[135] AI-CNet3D: An Anatomically-Informed Cross-Attention Network with Multi-Task Consistency Fine-tuning for 3D Glaucoma Classification
Roshan Kenia,Anfei Li,Rishabh Srivastava,Kaveri A. Thakoor
Main category: cs.CV
TL;DR: 提出一种名为AI-CNet3D的混合深度学习模型,结合交叉注意力机制与3D CNN,用于从OCT三维体积数据中提取青光眼关键特征,提升分类性能、可解释性和解剖一致性。
Details
Motivation: 传统将3D OCT数据压缩为2D报告的方法会丢失重要结构信息,影响青光眼诊断精度,因此需要一种能充分利用3D数据并保留关键解剖细节的模型。 Method: 设计AI-CNet3D模型,融合3D CNN与交叉注意力机制,分别处理上下半视网膜及视神经头和黄斑区域;引入CARE可视化交叉注意力输出,并结合Grad-CAM进行一致性多任务微调。 Result: 在两个大型数据集上验证,模型在所有关键指标上优于现有注意力和卷积模型,同时参数量减少百倍,保持高诊断性能和计算效率。 Conclusion: AI-CNet3D能有效捕捉视网膜不对称性和多区域结构信息,显著提升青光眼分类性能,兼具高效性、可解释性与临床解剖一致性。 Abstract: Glaucoma is a progressive eye disease that leads to optic nerve damage, causing irreversible vision loss if left untreated. Optical coherence tomography (OCT) has become a crucial tool for glaucoma diagnosis, offering high-resolution 3D scans of the retina and optic nerve. However, the conventional practice of condensing information from 3D OCT volumes into 2D reports often results in the loss of key structural details. To address this, we propose a novel hybrid deep learning model that integrates cross-attention mechanisms into a 3D convolutional neural network (CNN), enabling the extraction of critical features from the superior and inferior hemiretinas, as well as from the optic nerve head (ONH) and macula, within OCT volumes. We introduce Channel Attention REpresentations (CAREs) to visualize cross-attention outputs and leverage them for consistency-based multi-task fine-tuning, aligning them with Gradient-Weighted Class Activation Maps (Grad-CAMs) from the CNN's final convolutional layer to enhance performance, interpretability, and anatomical coherence. We have named this model AI-CNet3D (AI-`See'-Net3D) to reflect its design as an Anatomically-Informed Cross-attention Network operating on 3D data. By dividing the volume along two axes and applying cross-attention, our model enhances glaucoma classification by capturing asymmetries between the hemiretinal regions while integrating information from the optic nerve head and macula. We validate our approach on two large datasets, showing that it outperforms state-of-the-art attention and convolutional models across all key metrics. Finally, our model is computationally efficient, reducing the parameter count by one-hundred--fold compared to other attention mechanisms while maintaining high diagnostic performance and comparable GFLOPS.[136] Intuitions of Machine Learning Researchers about Transfer Learning for Medical Image Classification
Yucheng Lu,Hubert Dariusz Zając,Veronika Cheplygina,Amelia Jiménez-Sánchez
Main category: cs.CV
TL;DR: 该研究从人机交互(HCI)角度探讨了机器学习从业者在迁移学习中选择源数据集的决策过程,发现选择受任务、社区实践、数据集特性及感知相似性影响,且“越相似越好”的传统观念并不总成立。
Details
Motivation: 迁移学习在医学影像中至关重要,但源数据集的选择常依赖直觉而非系统原则,缺乏对从业者决策行为的系统理解。 Method: 通过基于任务的调查,收集机器学习从业者在不同任务下选择源数据集的依据,并分析其判断标准和术语使用。 Result: 发现源数据集选择具有任务依赖性,受社区惯例、数据集属性和感知相似性影响;相似性评分与预期性能不一致,且从业者常用模糊术语描述选择依据。 Conclusion: 挑战了‘越相似越好’的传统假设,强调需建立更清晰的定义和HCI工具以支持系统化的源数据集选择。 Abstract: Transfer learning is crucial for medical imaging, yet the selection of source datasets - which can impact the generalizability of algorithms, and thus patient outcomes - often relies on researchers' intuition rather than systematic principles. This study investigates these decisions through a task-based survey with machine learning practitioners. Unlike prior work that benchmarks models and experimental setups, we take a human-centered HCI perspective on how practitioners select source datasets. Our findings indicate that choices are task-dependent and influenced by community practices, dataset properties, and computational (data embedding), or perceived visual or semantic similarity. However, similarity ratings and expected performance are not always aligned, challenging a traditional "more similar is better" view. Participants often used ambiguous terminology, which suggests a need for clearer definitions and HCI tools to make them explicit and usable. By clarifying these heuristics, this work provides practical insights for more systematic source selection in transfer learning.[137] PAL-Net: A Point-Wise CNN with Patch-Attention for 3D Facial Landmark Localization
Ali Shadman Yazdi,Annalisa Cappella,Benedetta Baldini,Riccardo Solazzo,Gianluca Tartaglia,Chiarella Sforza,Giuseppe Baselli
Main category: cs.CV
TL;DR: 本研究提出了一种名为PAL-Net的全自动深度学习管道,用于在立体光摄影面部模型上定位50个解剖标志点。该方法结合粗略对齐、感兴趣区域筛选和基于补丁的点卷积神经网络,并引入注意力机制,在多个数据集上表现出良好的准确性和泛化能力,且计算成本较低。
Details
Motivation: 手动标注3D面部扫描中的解剖标志点耗时且依赖专家经验,而现有深度学习方法多关注伪标志点或需要复杂输入表示,限制了其临床应用。因此,需要一种高效、准确并适用于临床的自动化标注方法。 Method: 提出PAL-Net,结合粗略对齐、感兴趣区域过滤和初始标志点估计,采用基于补丁的点卷积神经网络并引入注意力机制,在214例健康成人面部扫描数据上进行训练与验证,并在FaceScape数据集(700例)上测试泛化性能。 Result: 在214例数据上,平均定位误差为3.686 mm,解剖距离误差为2.822 mm;在FaceScape数据集上,点误差为0.41 mm,距离误差为0.38 mm,表现优于现有方法,尤其在结构一致性方面。在网格质量较差区域(如耳朵、发际线)性能有所下降。 Conclusion: PAL-Net在准确性、计算效率和跨数据集泛化方面表现优异,提供了一种轻量、可扩展的解决方案,有望支持高通量3D人体测量分析,减少对手工标注的依赖,具有临床应用潜力。 Abstract: Manual annotation of anatomical landmarks on 3D facial scans is a time-consuming and expertise-dependent task, yet it remains critical for clinical assessments, morphometric analysis, and craniofacial research. While several deep learning methods have been proposed for facial landmark localization, most focus on pseudo-landmarks or require complex input representations, limiting their clinical applicability. This study presents a fully automated deep learning pipeline (PAL-Net) for localizing 50 anatomical landmarks on stereo-photogrammetry facial models. The method combines coarse alignment, region-of-interest filtering, and an initial approximation of landmarks with a patch-based pointwise CNN enhanced by attention mechanisms. Trained and evaluated on 214 annotated scans from healthy adults, PAL-Net achieved a mean localization error of 3.686 mm and preserves relevant anatomical distances with a 2.822 mm average error, comparable to intra-observer variability. To assess generalization, the model was further evaluated on 700 subjects from the FaceScape dataset, achieving a point-wise error of 0.41\,mm and a distance-wise error of 0.38\,mm. Compared to existing methods, PAL-Net offers a favorable trade-off between accuracy and computational cost. While performance degrades in regions with poor mesh quality (e.g., ears, hairline), the method demonstrates consistent accuracy across most anatomical regions. PAL-Net generalizes effectively across datasets and facial regions, outperforming existing methods in both point-wise and structural evaluations. It provides a lightweight, scalable solution for high-throughput 3D anthropometric analysis, with potential to support clinical workflows and reduce reliance on manual annotation. Source code can be found at https://github.com/Ali5hadman/PAL-Net-A-Point-Wise-CNN-with-Patch-Attention[138] Equivariant Splitting: Self-supervised learning from incomplete data
Victor Sechaud,Jérémy Scanvic,Quentin Barthélemy,Patrice Abry,Julián Tachella
Main category: cs.CV
TL;DR: 本文提出了一种新的自监督学习策略,用于从单次不完整观测中进行逆问题重建,结合自监督分割损失与等变重建网络,实现了无偏的监督损失估计,并在图像修复、加速磁共振成像和压缩感知任务中取得了最先进的性能。
Details
Motivation: 在难以获取真实标签的情况下,传统监督学习受限,因此需要一种能仅从噪声或不完整数据中训练重建网络的自监督方法。 Method: 提出一种新的自监督学习策略,引入重建网络中的等变性定义,结合自监督分割损失,在单一不完整观测模型下实现无偏损失估计。 Result: 在图像修复、加速MRI和压缩感知任务中,该方法在高度秩亏的前向模型下表现优异,达到当前最优水平。 Conclusion: 所提出的自监督策略结合等变网络和分裂损失,有效解决了单次不完整观测下的逆问题重建,具有广泛的应用潜力。 Abstract: Self-supervised learning for inverse problems allows to train a reconstruction network from noise and/or incomplete data alone. These methods have the potential of enabling learning-based solutions when obtaining ground-truth references for training is expensive or even impossible. In this paper, we propose a new self-supervised learning strategy devised for the challenging setting where measurements are observed via a single incomplete observation model. We introduce a new definition of equivariance in the context of reconstruction networks, and show that the combination of self-supervised splitting losses and equivariant reconstruction networks results in unbiased estimates of the supervised loss. Through a series of experiments on image inpainting, accelerated magnetic resonance imaging, and compressive sensing, we demonstrate that the proposed loss achieves state-of-the-art performance in settings with highly rank-deficient forward models.[139] Looking Alike From Far to Near: Enhancing Cross-Resolution Re-Identification via Feature Vector Panning
Zanwu Liu,Chao Yuan,Bo Li,Xiaowei Zhang,Guanglin Niu
Main category: cs.CV
TL;DR: 提出了一种轻量高效的向量平移特征对齐(VPFA)框架,通过建模分辨率特异性特征差异来解决跨分辨率行人重识别问题。
Details
Motivation: 现有跨分辨率ReID方法依赖超分辨率或联合学习,复杂度高且性能提升受限,难以有效匹配不同分辨率的行人图像。 Method: 基于在ReID特征空间中发现的表征分辨率差异的语义方向,提出VPFA框架,通过统计分析(典型相关分析和皮尔逊相关分析)验证该现象,并利用向量平移对齐不同分辨率特征。 Result: 在多个跨分辨率ReID基准上显著优于现有最先进方法,同时具备更高效率。 Conclusion: VPFA为跨分辨率ReID提供了新视角,验证了特征空间中分辨率语义方向的存在性及其有效性。 Abstract: In surveillance scenarios, varying camera distances cause significant differences among pedestrian image resolutions, making it hard to match low-resolution (LR) images with high-resolution (HR) counterparts, limiting the performance of Re-Identification (ReID) tasks. Most existing Cross-Resolution ReID (CR-ReID) methods rely on super-resolution (SR) or joint learning for feature compensation, which increases training and inference complexity and has reached a performance bottleneck in recent studies. Inspired by semantic directions in the word embedding space, we empirically discover that semantic directions implying resolution differences also emerge in the feature space of ReID, and we substantiate this finding from a statistical perspective using Canonical Correlation Analysis and Pearson Correlation Analysis. Based on this interesting finding, we propose a lightweight and effective Vector Panning Feature Alignment (VPFA) framework, which conducts CR-ReID from a novel perspective of modeling the resolution-specific feature discrepancy. Extensive experimental results on multiple CR-ReID benchmarks show that our method significantly outperforms previous state-of-the-art baseline models while obtaining higher efficiency, demonstrating the effectiveness and superiority of our model based on the new finding in this paper.[140] InfVSR: Breaking Length Limits of Generic Video Super-Resolution
Ziqing Zhang,Kai Liu,Zheng Chen,Xi Li,Yucong Chen,Bingnan Duan,Linghe Kong,Yulun Zhang
Main category: cs.CV
TL;DR: 本文提出InfVSR,一种将视频超分辨率(VSR)重构为自回归单步扩散模型的新方法,支持高效流式推理,显著提升长视频处理的速度与质量。
Details
Motivation: 现有VSR方法在处理长视频时存在计算效率低和可扩展性差的问题,难以应对千帧级视频的连续超分需求。 Method: 1) 将预训练DiT改为因果结构,通过滚动KV缓存和联合视觉引导保持时空一致性;2) 通过块级像素监督和跨块分布匹配,将扩散过程蒸馏为单步推理。 Result: 在新构建的长视频基准上实现最先进的重建质量,语义一致性更好,并比MGLD-VSR等方法快达58倍。 Conclusion: InfVSR实现了对无限长度视频的高效、可扩展超分辨率,推动了长时序VSR的发展。 Abstract: Real-world videos often extend over thousands of frames. Existing video super-resolution (VSR) approaches, however, face two persistent challenges when processing long sequences: (1) inefficiency due to the heavy cost of multi-step denoising for full-length sequences; and (2) poor scalability hindered by temporal decomposition that causes artifacts and discontinuities. To break these limits, we propose InfVSR, which novelly reformulates VSR as an autoregressive-one-step-diffusion paradigm. This enables streaming inference while fully leveraging pre-trained video diffusion priors. First, we adapt the pre-trained DiT into a causal structure, maintaining both local and global coherence via rolling KV-cache and joint visual guidance. Second, we distill the diffusion process into a single step efficiently, with patch-wise pixel supervision and cross-chunk distribution matching. Together, these designs enable efficient and scalable VSR for unbounded-length videos. To fill the gap in long-form video evaluation, we build a new benchmark tailored for extended sequences and further introduce semantic-level metrics to comprehensively assess temporal consistency. Our method pushes the frontier of long-form VSR, achieves state-of-the-art quality with enhanced semantic consistency, and delivers up to 58x speed-up over existing methods such as MGLD-VSR. Code will be available at https://github.com/Kai-Liu001/InfVSR.[141] JEPA-T: Joint-Embedding Predictive Architecture with Text Fusion for Image Generation
Siheng Wan,Zhengtao Yao,Zhengdao Li,Junhao Dong,Yanshu Li,Yikai Li,Linshan Li,Haoyan Xu,Yijiang Li,Zhikang Dong,Huacan Wang,Jifeng Shen
Main category: cs.CV
TL;DR: JEPA-T是一种统一的多模态框架,通过联合嵌入预测Transformer将图像和文本编码为离散token,并在特征预测后引入跨注意力机制和文本嵌入注入,以增强文本-视觉融合。该方法在ImageNet-1K上表现出优异的数据效率和开放词汇生成能力。
Details
Motivation: 现有的文本到图像生成模型在融合文本与视觉token方面存在挑战,尤其是在保持骨干网络通用性的同时实现强条件生成。 Method: 提出JEPA-T框架,使用离散token表示图像和文本,通过联合嵌入预测Transformer进行建模;引入跨注意力机制用于条件去噪,并在流匹配损失前注入原始文本嵌入以提升训练对齐。 Result: 在ImageNet-1K上,JEPA-T优于非融合和晚期融合基线方法,展现出更强的数据效率、开放词汇泛化能力和一致的生成性能。 Conclusion: 晚期架构融合结合目标级对齐能在基于token的文本到图像生成中有效平衡条件控制强度与骨干网络的通用性。 Abstract: Modern Text-to-Image (T2I) generation increasingly relies on token-centric architectures that are trained with self-supervision, yet effectively fusing text with visual tokens remains a challenge. We propose \textbf{JEPA-T}, a unified multimodal framework that encodes images and captions into discrete visual and textual tokens, processed by a joint-embedding predictive Transformer. To enhance fusion, we incorporate cross-attention after the feature predictor for conditional denoising while maintaining a task-agnostic backbone. Additionally, raw texts embeddings are injected prior to the flow matching loss to improve alignment during training. During inference, the same network performs both class-conditional and free-text image generation by iteratively denoising visual tokens conditioned on text. Evaluations on ImageNet-1K demonstrate that JEPA-T achieves strong data efficiency, open-vocabulary generalization, and consistently outperforms non-fusion and late-fusion baselines. Our approach shows that late architectural fusion combined with objective-level alignment offers an effective balance between conditioning strength and backbone generality in token-based T2I.The code is now available: https://github.com/justin-herry/JEPA-T.git[142] A Scene is Worth a Thousand Features: Feed-Forward Camera Localization from a Collection of Image Features
Axel Barroso-Laguna,Tommaso Cavallari,Victor Adrian Prisacariu,Eric Brachmann
Main category: cs.CV
TL;DR: 本文提出了一种名为FastForward的新方法,能够在单次前向传播中快速创建地图表示并实现图像重定位,显著减少了制图时间,同时在多种环境中实现了最先进的精度。
Details
Motivation: 现有的视觉定位方法在构建场景表示时耗时较长,即使在已知相机姿态的映射图像基础上,仍需数分钟到数小时。本文旨在探索是否可以在大幅缩短时间的同时保持竞争力的精度。 Method: FastForward将多个映射图像表示为锚定在3D空间中的一组特征,并利用这些特征预测查询图像与场景之间的对应关系,从而估计其相机姿态。该方法结合图像检索,在极短的地图准备时间内完成定位。 Result: FastForward在最小化地图准备时间的同时达到了优于其他方法的定位精度,并展现出对未见域(包括大规模室外环境)的良好泛化能力。 Conclusion: FastForward实现了高效、准确的视觉定位,通过单次前向传播完成地图构建与重定位,为实际应用提供了高实用性解决方案。 Abstract: Visually localizing an image, i.e., estimating its camera pose, requires building a scene representation that serves as a visual map. The representation we choose has direct consequences towards the practicability of our system. Even when starting from mapping images with known camera poses, state-of-the-art approaches still require hours of mapping time in the worst case, and several minutes in the best. This work raises the question whether we can achieve competitive accuracy much faster. We introduce FastForward, a method that creates a map representation and relocalizes a query image on-the-fly in a single feed-forward pass. At the core, we represent multiple mapping images as a collection of features anchored in 3D space. FastForward utilizes these mapping features to predict image-to-scene correspondences for the query image, enabling the estimation of its camera pose. We couple FastForward with image retrieval and achieve state-of-the-art accuracy when compared to other approaches with minimal map preparation time. Furthermore, FastForward demonstrates robust generalization to unseen domains, including challenging large-scale outdoor environments.[143] Visual Self-Refinement for Autoregressive Models
Jiamian Wang,Ziqi Zhou,Chaithanya Kumar Mummadi,Sohail Dianat,Majid Rabbani,Raghuveer Rao,Chen Qiu,Zhiqiang Tao
Main category: cs.CV
TL;DR: 提出一种即插即用的优化模块,用于增强自回归模型在视觉-语言任务中对复杂空间关系的建模能力。
Details
Motivation: 视觉信号的空间特性与自回归模型的序列依赖性存在冲突,导致生成结果不理想。 Method: 设计一个在预训练后使用的后处理模块,利用生成序列中的全局上下文和token间关系,联合优化所有生成的视觉token。 Result: 实验表明该方法提升了生成质量,缓解了序列生成中的误差累积问题,增强了语义一致性。 Conclusion: 该模块有效增强了视觉-语言序列建模中的空间对应关系,适用于共享的序列预测框架。 Abstract: Autoregressive models excel in sequential modeling and have proven to be effective for vision-language data. However, the spatial nature of visual signals conflicts with the sequential dependencies of next-token prediction, leading to suboptimal results. This work proposes a plug-and-play refinement module to enhance the complex spatial correspondence modeling within the generated visual sequence. This module operates as a post-pretraining step to jointly refine all generated tokens of autoregressive model, enhancing vision-language modeling under a shared sequential prediction framework. By leveraging global context and relationship across the tokens, our method mitigates the error accumulation issue within the sequential generation. Experiments demonstrate that the proposed method improves the generation quality, enhancing the model's ability to produce semantically consistent results.[144] SoftCFG: Uncertainty-guided Stable Guidance for Visual autoregressive Model
Dongli Xu,Aleksei Tiulpin,Matthew B. Blaschko
Main category: cs.CV
TL;DR: 提出SoftCFG方法,通过不确定性引导的自适应扰动分配解决自回归图像生成中分类器自由引导的指导衰减和过度引导问题,并引入步长归一化稳定长序列生成。
Details
Motivation: 解决自回归模型在使用分类器自由引导(CFG)时存在的指导信号衰减和过度引导导致视觉不连贯的问题。 Method: 提出SoftCFG,为每个生成token分配基于不确定性的加权引导信号,并引入Step Normalization控制累积扰动。该方法无需训练且模型无关。 Result: 在ImageNet 256上显著优于标准CFG,实现当前自回归模型中最优的FID分数。 Conclusion: SoftCFG有效缓解了指导衰减与过强引导问题,提升了长序列生成的稳定性与图像质量,具有良好的通用性和实用性。 Abstract: Autoregressive (AR) models have emerged as powerful tools for image generation by modeling images as sequences of discrete tokens. While Classifier-Free Guidance (CFG) has been adopted to improve conditional generation, its application in AR models faces two key issues: guidance diminishing, where the conditional-unconditional gap quickly vanishes as decoding progresses, and over-guidance, where strong conditions distort visual coherence. To address these challenges, we propose SoftCFG, an uncertainty-guided inference method that distributes adaptive perturbations across all tokens in the sequence. The key idea behind SoftCFG is to let each generated token contribute certainty-weighted guidance, ensuring that the signal persists across steps while resolving conflicts between text guidance and visual context. To further stabilize long-sequence generation, we introduce Step Normalization, which bounds cumulative perturbations of SoftCFG. Our method is training-free, model-agnostic, and seamlessly integrates with existing AR pipelines. Experiments show that SoftCFG significantly improves image quality over standard CFG and achieves state-of-the-art FID on ImageNet 256 among autoregressive models.[145] TextCAM: Explaining Class Activation Map with Text
Qiming Zhao,Xingjian Li,Xiaoyu Cao,Xiaolong Wu,Min Xu
Main category: cs.CV
TL;DR: 本文提出了TextCAM,一种将类激活映射(CAM)与自然语言结合的新型视觉模型解释框架,通过融合视觉-语言模型的语义对齐能力,生成兼具空间定位和语义描述的可解释结果。
Details
Motivation: 现有的CAM方法缺乏对激活区域背后语义属性的解释能力,限制了其在高风险应用中的可信度。 Method: 利用CLIP嵌入和线性判别分析提取通道级语义表示,并与CAM权重结合生成文本描述;进一步将特征通道聚类为语义一致的组以实现更细粒度的解释。 Result: 在ImageNet、CLEVR和CUB上的实验表明,TextCAM能生成忠实且可读性强的解释,有助于提升人类理解、检测虚假相关性并保持模型性能。 Conclusion: TextCAM通过融合视觉-语言语义,有效增强了深度视觉模型的可解释性,在空间定位与语义解释之间取得了良好平衡。 Abstract: Deep neural networks (DNNs) have achieved remarkable success across domains but remain difficult to interpret, limiting their trustworthiness in high-stakes applications. This paper focuses on deep vision models, for which a dominant line of explainability methods are Class Activation Mapping (CAM) and its variants working by highlighting spatial regions that drive predictions. We figure out that CAM provides little semantic insight into what attributes underlie these activations. To address this limitation, we propose TextCAM, a novel explanation framework that enriches CAM with natural languages. TextCAM combines the precise spatial localization of CAM with the semantic alignment of vision-language models (VLMs). Specifically, we derive channel-level semantic representations using CLIP embeddings and linear discriminant analysis, and aggregate them with CAM weights to produce textual descriptions of salient visual evidence. This yields explanations that jointly specify where the model attends and what visual attributes likely support its decision. We further extend TextCAM to generate feature channels into semantically coherent groups, enabling more fine-grained visual-textual explanations. Experiments on ImageNet, CLEVR, and CUB demonstrate that TextCAM produces faithful and interpretable rationales that improve human understanding, detect spurious correlations, and preserve model fidelity.[146] POVQA: Preference-Optimized Video Question Answering with Rationales for Data Efficiency
Ashim Dahal,Ankit Ghimire,Saydul Akbar Murad,Nick Rahimi
Main category: cs.CV
TL;DR: 本文提出POVQA,一种数据高效的视频问答方法,通过将每秒视频压缩为单张时序池化图像并结合轻量监督微调LVLMs,在长视频问答任务中显著提升性能。
Details
Motivation: 现有长视频问答模型受限于上下文窗口(仅支持约50秒视频),难以高效处理长时间视频内容,需要更高效的时间信息压缩方法。 Method: 提出POVQA pipeline,采用运动模糊和加权平均等时序池化方法将每秒视频压缩为一帧,并构建1fps输入;在新提出的ReasonVQA数据集上对QWEN-2.5-VL 7B进行SFT和DPO微调,包含推理与答案的两轮监督。 Result: 在ReasonVQA上F1从0.212提升至0.543,BLEU-4从0.031到0.291,ROUGE-L从0.196到0.528,推理质量显著提高;跨池化函数验证显示方法鲁棒性强,且在TVQA零样本迁移中表现良好。 Conclusion: POVQA通过有效的时间压缩和轻量微调策略,实现了高效、鲁棒的长视频问答,具有良好的泛化能力和应用潜力。 Abstract: Video Question Answering (VQA) with Large Vision Language Models (LVLMs) has gained significant traction in research ever since the Flamingo was introduced by Deepmind. Recent advancements in large context/long video question answering have allowed VQA tasks to have context window of 1500+ frames. However, this only leads to 50 seconds of video footage without losing any significant information. We introduce POVQA, a data-efficient pipeline that compresses each second of video into a single temporally pooled image (via motion blur and weighted averaging variants) and then align LVLMs with lightweight supervision. Concretely, we build 1 fps input sources using Blend Blur with Last Frame, Weighted Average, Exponential and Ramp pooling and fine-tune QWEN-2.5-VL 7B with supervised two turn target including reasoning and final answer. We apply Supervised Fine Tuning (SFT) and Direct Preference Optimization (DPO) on our novel dataset ReasonVQA consisting of 12 movies with 239 human annotated question-answer with reasoning prompts. On our ReasonVQA dataset, this method dramatically improves performance over pooled baselines: F1 score improves from 0.212 to 0.543, BLEU-4 from 0.031 to 0.291, and ROUGE-L from 0.196 to 0.528. Rationale quality also significantly increases. Cross-evaluation of SFT + DPO on various pooling functions show that the gains persist regardless of the pooling scheme used at train or test time, indicating strong robustness on summarization of temporal evidence. Similar observations were made on zero-shot in TVQA.[147] ImageDoctor: Diagnosing Text-to-Image Generation via Grounded Image Reasoning
Yuxiang Guo,Jiang Liu,Ze Wang,Hao Chen,Ximeng Sun,Yang Zhao,Jialian Wu,Xiaodong Yu,Zicheng Liu,Emad Barsoum
Main category: cs.CV
TL;DR: 本文提出了ImageDoctor,一个统一的多维度文本到图像模型评估框架,能够从合理性、语义对齐、美学和整体质量四个方面评估图像,并生成像素级热图指示缺陷区域。该框架采用“看-思考-预测”范式,结合监督微调与强化学习训练,在人类偏好对齐方面表现优异,并作为奖励模型显著提升生成质量。
Details
Motivation: 现有文本到图像模型评估方法通常使用单一标量评分,缺乏细粒度和可解释性,难以全面反映图像质量,因此需要一种能提供多维度、可解释反馈的评估框架。 Method: 基于视觉-语言模型构建ImageDoctor,引入“看-思考-预测”范式:先定位潜在缺陷,再生成推理,最后输出量化评分;通过监督微调和强化学习联合训练,并生成像素级热图作为密集奖励信号。 Result: ImageDoctor在多个数据集上展现出与人类偏好高度一致的评估能力;作为奖励模型用于偏好调优时,相比标量奖励模型生成质量提升10%。 Conclusion: ImageDoctor是一种有效且可解释的文本到图像模型评估框架,不仅能提供多维度质量评估和可视化反馈,还可作为密集奖励模型显著提升生成性能。 Abstract: The rapid advancement of text-to-image (T2I) models has increased the need for reliable human preference modeling, a demand further amplified by recent progress in reinforcement learning for preference alignment. However, existing approaches typically quantify the quality of a generated image using a single scalar, limiting their ability to provide comprehensive and interpretable feedback on image quality. To address this, we introduce ImageDoctor, a unified multi-aspect T2I model evaluation framework that assesses image quality across four complementary dimensions: plausibility, semantic alignment, aesthetics, and overall quality. ImageDoctor also provides pixel-level flaw indicators in the form of heatmaps, which highlight misaligned or implausible regions, and can be used as a dense reward for T2I model preference alignment. Inspired by the diagnostic process, we improve the detail sensitivity and reasoning capability of ImageDoctor by introducing a "look-think-predict" paradigm, where the model first localizes potential flaws, then generates reasoning, and finally concludes the evaluation with quantitative scores. Built on top of a vision-language model and trained through a combination of supervised fine-tuning and reinforcement learning, ImageDoctor demonstrates strong alignment with human preference across multiple datasets, establishing its effectiveness as an evaluation metric. Furthermore, when used as a reward model for preference tuning, ImageDoctor significantly improves generation quality -- achieving an improvement of 10% over scalar-based reward models.[148] Towards Adversarial Training under Hyperspectral Images
Weihua Zhang,Chengze Jiang,Jie Gui,Lu Dong
Main category: cs.CV
TL;DR: 本文提出了一种针对高光谱图像的对抗训练方法AT-RA,通过数据增强提升模型在面对对抗攻击时的鲁棒性,同时保持并修正光谱语义信息,在多种攻击下显著提高了防御性能和正常样本准确率。
Details
Motivation: 深度学习模型在高光谱分类中易受对抗攻击,现有防御方法扩展性差且对强攻击防御效果有限,亟需更有效的鲁棒性提升方法。 Method: 将对抗训练引入高光谱领域,提出AT-RA方法,结合数据增强技术增加光谱多样性,并通过保证空间平滑性来保护和纠正光谱语义信息。 Result: AT-RA在AutoAttack和PGD-50攻击下分别将对抗鲁棒性提高了21.34%和18.78%,同时良性准确率提升了2.68%。 Conclusion: AT-RA有效缓解了高光谱图像中对抗噪声对光谱语义信息的破坏,显著提升了模型的对抗鲁棒性和分类性能,为高光谱图像的安全应用提供了可行方案。 Abstract: Recent studies have revealed that hyperspectral classification models based on deep learning are highly vulnerable to adversarial attacks, which pose significant security risks. Although several approaches have attempted to enhance adversarial robustness by modifying network architectures, these methods often rely on customized designs that limit scalability and fail to defend effectively against strong attacks. To address these challenges, we introduce adversarial training to the hyperspectral domain, which is widely regarded as one of the most effective defenses against adversarial attacks. Through extensive empirical analyses, we demonstrate that while adversarial training does enhance robustness across various models and datasets, hyperspectral data introduces unique challenges not seen in RGB images. Specifically, we find that adversarial noise and the non-smooth nature of adversarial examples can distort or eliminate important spectral semantic information. To mitigate this issue, we employ data augmentation techniques and propose a novel hyperspectral adversarial training method, termed AT-RA. By increasing the diversity of spectral information and ensuring spatial smoothness, AT-RA preserves and corrects spectral semantics in hyperspectral images. Experimental results show that AT-RA improves adversarial robustness by 21.34% against AutoAttack and 18.78% against PGD-50 while boosting benign accuracy by 2.68%.[149] Secure and reversible face anonymization with diffusion models
Pol Labarbarie,Vincent Itier,William Puech
Main category: cs.CV
TL;DR: 本文提出了一种基于扩散模型的安全、高质量可逆人脸匿名化方法,结合密钥机制和面部掩码,在保证图像质量的同时增强隐私保护。
Details
Motivation: 现有方法在安全性、图像质量和可逆性之间难以平衡,且缺乏密钥机制来控制解匿名化过程。 Method: 将密钥与扩散模型的潜在人脸表示结合,并利用面部掩码约束生成过程,采用确定性的正向和反向扩散实现可逆性。 Result: 该方法生成的匿名化人脸图像质量高,与原始人脸视觉相似度更低,且只有拥有正确密钥的授权方才能恢复原图。 Conclusion: 所提方法是首个基于扩散模型的安全可逆人脸匿名化方案,在隐私保护、图像质量和可逆性方面取得了良好平衡。 Abstract: Face images processed by computer vision algorithms contain sensitive personal information that malicious actors can capture without consent. These privacy and security risks highlight the need for effective face anonymization methods. Current methods struggle to propose a good trade-off between a secure scheme with high-quality image generation and reversibility for later person authentication. Diffusion-based approaches produce high-quality anonymized images but lack the secret key mechanism to ensure that only authorized parties can reverse the process. In this paper, we introduce, to our knowledge, the first secure, high-quality reversible anonymization method based on a diffusion model. We propose to combine the secret key with the latent faces representation of the diffusion model. To preserve identity-irrelevant features, generation is constrained by a facial mask, maintaining high-quality images. By using a deterministic forward and backward diffusion process, our approach enforces that the original face can be recovered with the correct secret key. We also show that the proposed method produces anonymized faces that are less visually similar to the original faces, compared to other previous work.[150] KeySG: Hierarchical Keyframe-Based 3D Scene Graphs
Abdelrhman Werby,Dennis Rotondi,Fabio Scaparro,Kai O. Arras
Main category: cs.CV
TL;DR: 本文提出了KeySG,一种用于3D场景图的分层表示框架,通过关键帧提取多模态信息,结合视觉语言模型实现更高效的语义理解与任务无关的推理规划。
Details
Motivation: 现有3D场景图方法语义关系受限且难以扩展,处理复杂环境时易超出大模型上下文窗口,限制了机器人在真实场景中的推理与规划能力。 Method: 提出KeySG框架,构建包含楼层、房间、物体和功能元素的分层图结构,利用关键帧提取几何与视觉覆盖最优的多模态信息,并通过分层检索增强生成(RAG)流程实现高效上下文提取。 Result: 在四个基准测试中(包括3D物体分割与复杂查询检索),KeySG在多数指标上优于先前方法,展现出更强的语义丰富性与处理效率。 Conclusion: KeySG通过分层结构与关键帧驱动的多模态表示,有效提升了3D场景图的可扩展性与语义表达能力,适用于复杂、模糊查询下的机器人推理与导航任务。 Abstract: In recent years, 3D scene graphs have emerged as a powerful world representation, offering both geometric accuracy and semantic richness. Combining 3D scene graphs with large language models enables robots to reason, plan, and navigate in complex human-centered environments. However, current approaches for constructing 3D scene graphs are semantically limited to a predefined set of relationships, and their serialization in large environments can easily exceed an LLM's context window. We introduce KeySG, a framework that represents 3D scenes as a hierarchical graph consisting of floors, rooms, objects, and functional elements, where nodes are augmented with multi-modal information extracted from keyframes selected to optimize geometric and visual coverage. The keyframes allow us to efficiently leverage VLM to extract scene information, alleviating the need to explicitly model relationship edges between objects, enabling more general, task-agnostic reasoning and planning. Our approach can process complex and ambiguous queries while mitigating the scalability issues associated with large scene graphs by utilizing a hierarchical retrieval-augmented generation (RAG) pipeline to extract relevant context from the graph. Evaluated across four distinct benchmarks -- including 3D object segmentation and complex query retrieval -- KeySG outperforms prior approaches on most metrics, demonstrating its superior semantic richness and efficiency.[151] Instant4D: 4D Gaussian Splatting in Minutes
Zhanpeng Luo,Haoxi Ran,Li Lu
Main category: cs.CV
TL;DR: Instant4D是一种高效的单目4D场景重建系统,能够在几分钟内处理未标定的随意视频序列,无需深度传感器或校准相机。
Details
Motivation: 从未经校准的随意视频中进行动态视图合成和场景重建仍然具有挑战性,主要由于优化速度慢和参数估计复杂。 Method: 该方法首先通过深度视觉SLAM进行几何恢复,然后采用网格剪枝优化场景表示,并引入简化的4D高斯表示以提高时间动态处理效率。 Result: 实现了模型大小减少到原始的10%以下,训练时间缩短至两分钟内,整体处理速度提升30倍,并在多个基准上保持竞争力。 Conclusion: Instant4D能够高效、快速地重建动态场景,适用于野外视频,展现出良好的泛化能力。 Abstract: Dynamic view synthesis has seen significant advances, yet reconstructing scenes from uncalibrated, casual video remains challenging due to slow optimization and complex parameter estimation. In this work, we present Instant4D, a monocular reconstruction system that leverages native 4D representation to efficiently process casual video sequences within minutes, without calibrated cameras or depth sensors. Our method begins with geometric recovery through deep visual SLAM, followed by grid pruning to optimize scene representation. Our design significantly reduces redundancy while maintaining geometric integrity, cutting model size to under 10% of its original footprint. To handle temporal dynamics efficiently, we introduce a streamlined 4D Gaussian representation, achieving a 30x speed-up and reducing training time to within two minutes, while maintaining competitive performance across several benchmarks. Our method reconstruct a single video within 10 minutes on the Dycheck dataset or for a typical 200-frame video. We further apply our model to in-the-wild videos, showcasing its generalizability. Our project website is published at https://instant4d.github.io/.[152] Strategic Fusion of Vision Language Models: Shapley-Credited Context-Aware Dawid-Skene for Multi-Label Tasks in Autonomous Driving
Yuxiang Feng,Keyang Zhang,Hassane Ouchouid,Ashwil Kaniamparambil,Ioannis Souflas,Panagiotis Angeloudis
Main category: cs.CV
TL;DR: 提出了一种基于博弈论的多模型融合方法Shapley-credited Context-Aware Dawid-Skene with Agreement,用于提升自动驾驶中视觉语言模型(VLM)在多标签理解任务中的可靠性,显著减少幻觉问题。通过上下文感知与Shapley信用分配机制,结合精细标注的真实数据集和LoRA微调的异构VLM,实验显示在Hamming距离和F1指标上均有显著改进。
Details
Motivation: 大型视觉语言模型(VLM)在自动驾驶系统中存在幻觉问题,影响其在安全关键场景下的可靠性,亟需一种能够融合多个VLM输出并校准其置信度的鲁棒方法。 Method: 提出Shapley-credited Context-Aware Dawid-Skene with Agreement方法,利用标注历史学习每个模型、每个标签在特定上下文下的可靠性;在推理时将各模型输出转换为受一致性约束的对数似然比,并结合上下文先验与基于Shapley值更新的声誉状态进行融合。同时构建包含1000个真实驾驶片段的数据集,使用HDD真值、车辆运动学和目标跟踪结果自动生成结构化标注,并通过三步思维链提示指导;三个异构VLM采用LoRA进行微调。 Result: 相比最佳单模型,该方法在测试中实现了23%的Hamming距离降低,55%的Macro-F1提升和47%的Micro-F1提升,且具备可解释性、可校准性和对分布漂移的适应能力。 Conclusion: 所提融合方法能有效提升VLM在自动驾驶感知决策中的可靠性与鲁棒性,通过模型间协作与动态信用分配,既放大可靠共识又保留独特正确信号,适合作为AV系统中可信赖的决策支持组件。 Abstract: Large vision-language models (VLMs) are increasingly used in autonomous-vehicle (AV) stacks, but hallucination limits their reliability in safety-critical pipelines. We present Shapley-credited Context-Aware Dawid-Skene with Agreement, a game-theoretic fusion method for multi-label understanding of ego-view dashcam video. It learns per-model, per-label, context-conditioned reliabilities from labelled history and, at inference, converts each model's report into an agreement-guardrailed log-likelihood ratio that is combined with a contextual prior and a public reputation state updated via Shapley-based team credit. The result is calibrated, thresholdable posteriors that (i) amplify agreement among reliable models, (ii) preserve uniquely correct single-model signals, and (iii) adapt to drift. To specialise general VLMs, we curate 1,000 real-world dashcam clips with structured annotations (scene description, manoeuvre recommendation, rationale) via an automatic pipeline that fuses HDD ground truth, vehicle kinematics, and YOLOv11 + BoT-SORT tracking, guided by a three-step chain-of-thought prompt; three heterogeneous VLMs are then fine-tuned with LoRA. We evaluate with Hamming distance, Micro-Macro-F1, and average per-video latency. Empirically, the proposed method achieves a 23% reduction in Hamming distance, 55% improvement in Macro-F1, and 47% improvement in Micro-F1 when comparing with the best single model, supporting VLM fusion as a calibrated, interpretable, and robust decision-support component for AV pipelines.[153] EvoWorld: Evolving Panoramic World Generation with Explicit 3D Memory
Jiahao Wang,Luoxin Ye,TaiMing Lu,Junfei Xiao,Jiahan Zhang,Yuxiang Guo,Xijun Liu,Rama Chellappa,Cheng Peng,Alan Yuille,Jieneng Chen
Main category: cs.CV
TL;DR: 提出EvoWorld,一种结合全景视频生成与演化的3D记忆的世界模型,实现空间一致的长时程探索。
Details
Motivation: 受人类能心理重访3D环境的启发,希望构建具备长期空间一致性的世界模型。 Method: 输入单张全景图,先用视频生成器生成未来帧,再通过前馈Transformer更新场景的3D重建,并基于几何投影生成后续画面。 Result: 在合成户外、Habitat室内及真实场景中验证,显著提升视觉保真度和几何一致性,尤其在回环检测和长轨迹空间连贯性上表现优异。 Conclusion: 利用演化的3D记忆作为显式空间引导,显著优于仅生成视频的现有方法,推动了长时程空间一致的世界建模发展。 Abstract: Humans possess a remarkable ability to mentally explore and replay 3D environments they have previously experienced. Inspired by this mental process, we present EvoWorld: a world model that bridges panoramic video generation with evolving 3D memory to enable spatially consistent long-horizon exploration. Given a single panoramic image as input, EvoWorld first generates future video frames by leveraging a video generator with fine-grained view control, then evolves the scene's 3D reconstruction using a feedforward plug-and-play transformer, and finally synthesizes futures by conditioning on geometric reprojections from this evolving explicit 3D memory. Unlike prior state-of-the-arts that synthesize videos only, our key insight lies in exploiting this evolving 3D reconstruction as explicit spatial guidance for the video generation process, projecting the reconstructed geometry onto target viewpoints to provide rich spatial cues that significantly enhance both visual realism and geometric consistency. To evaluate long-range exploration capabilities, we introduce the first comprehensive benchmark spanning synthetic outdoor environments, Habitat indoor scenes, and challenging real-world scenarios, with particular emphasis on loop-closure detection and spatial coherence over extended trajectories. Extensive experiments demonstrate that our evolving 3D memory substantially improves visual fidelity and maintains spatial scene coherence compared to existing approaches, representing a significant advance toward long-horizon spatially consistent world modeling.[154] IMAGEdit: Let Any Subject Transform
Fei Shen,Weihao Xu,Rui Yan,Dong Zhang,Xiangbo Shu,Jinhui Tang
Main category: cs.CV
TL;DR: 本文提出了IMAGEdit,一种无需训练的多主体视频编辑框架,通过多模态对齐和基于先验的掩码重定位模块实现对多个指定主体外观的精确编辑,同时保持非目标区域不变。