Table of Contents
cs.CL [Back]
[1] Modeling Layered Consciousness with Multi-Agent Large Language Models
Sang Hun Kim,Jongmin Lee,Dongkyu Park,So Young Lee,Yosep Chong
Main category: cs.CL
TL;DR: 提出了一种基于心理动力学理论的多智能体框架,用于在大语言模型中建模人工意识,通过参数高效微调实现个性化情感对话生成。
Details
Motivation: 旨在将心理动力学理论引入大语言模型,以模拟人类意识结构(如自我意识、前意识和无意识),提升模型的情感深度与个性化能力。 Method: 构建一个包含自我、前意识和无意识代理的多智能体框架,结合固定特质与动态需求的个性化模块,并在情感丰富的对话数据上进行参数高效微调。 Result: 在八种个性化条件下评估显示,使用LLM作为评判者时,微调后的模型获得71.2%的偏好率,情感表达更深,输出方差更小。 Conclusion: 该框架能有效模拟心理分层结构,增强大语言模型的自适应与个性化认知能力,具有应用于心理陪伴等场景的潜力。 Abstract: We propose a multi-agent framework for modeling artificial consciousness in large language models (LLMs), grounded in psychoanalytic theory. Our \textbf{Psychodynamic Model} simulates self-awareness, preconsciousness, and unconsciousness through agent interaction, guided by a Personalization Module combining fixed traits and dynamic needs. Using parameter-efficient fine-tuning on emotionally rich dialogues, the system was evaluated across eight personalized conditions. An LLM as a judge approach showed a 71.2\% preference for the fine-tuned model, with improved emotional depth and reduced output variance, demonstrating its potential for adaptive, personalized cognition.[2] Outraged AI: Large language models prioritise emotion over cost in fairness enforcement
Hao Liu,Yiqing Dai,Haotian Tan,Yu Lei,Yujia Zhou,Zhen Wu
Main category: cs.CL
TL;DR: 该研究首次揭示了大语言模型(LLMs)在第三方利他惩罚决策中受情绪驱动,表现出类似人类的道德判断,但在成本敏感性和公平性权衡方面存在不足。
Details
Motivation: 探究大语言模型是否像人类一样利用情绪来指导道德决策,特别是在第三方惩罚不公平行为的情境下。 Method: 通过大规模对比4,068个LLM代理与1,159名成年人在796,100项决策中的行为,分析情绪对惩罚决策的影响,并测试自我报告情绪提示对LLM决策的因果效应。 Result: 发现LLM的情绪显著影响其惩罚行为:不公平引发更强的负面情绪并导致更多惩罚;惩罚不公平带来更积极的情绪;提示情绪自报可因果性增加惩罚。但LLM更重视情绪而非成本,呈现全有或全无的规范执行模式,成本敏感性低于人类。推理模型(如o3-mini、DeepSeek-R1)比基础模型(如GPT-3.5)更接近人类行为,但仍高度情绪驱动。 Conclusion: LLM展现出情绪引导的道德决策能力,但缺乏对成本和情境的精细权衡,类似于人类发展的早期阶段。未来模型需将情绪与情境敏感的推理结合,以实现类人的情感智能。 Abstract: Emotions guide human decisions, but whether large language models (LLMs) use emotion similarly remains unknown. We tested this using altruistic third-party punishment, where an observer incurs a personal cost to enforce fairness, a hallmark of human morality and often driven by negative emotion. In a large-scale comparison of 4,068 LLM agents with 1,159 adults across 796,100 decisions, LLMs used emotion to guide punishment, sometimes even more strongly than humans did: Unfairness elicited stronger negative emotion that led to more punishment; punishing unfairness produced more positive emotion than accepting; and critically, prompting self-reports of emotion causally increased punishment. However, mechanisms diverged: LLMs prioritized emotion over cost, enforcing norms in an almost all-or-none manner with reduced cost sensitivity, whereas humans balanced fairness and cost. Notably, reasoning models (o3-mini, DeepSeek-R1) were more cost-sensitive and closer to human behavior than foundation models (GPT-3.5, DeepSeek-V3), yet remained heavily emotion-driven. These findings provide the first causal evidence of emotion-guided moral decisions in LLMs and reveal deficits in cost calibration and nuanced fairness judgements, reminiscent of early-stage human responses. We propose that LLMs progress along a trajectory paralleling human development; future models should integrate emotion with context-sensitive reasoning to achieve human-like emotional intelligence.[3] POPI: Personalizing LLMs via Optimized Natural Language Preference Inference
Yizhuo Chen,Xin Liu,Ruijie Wang,Zheng Li,Pei Chen,Changlong Yu,Priyanka Nigam,Meng Jiang,Bing Yin
Main category: cs.CL
TL;DR: 提出POPI框架,通过偏好推断模型将用户信号提炼为简洁的自然语言摘要,实现高效、可迁移的个性化大语言模型生成。
Details
Motivation: 现有对齐技术忽略个体差异,难以满足用户多样化偏好,且传统个性化方法存在计算成本高或效率低的问题。 Method: 设计POPI框架,结合偏好推断模型与个性化生成模型,利用强化学习联合优化,将用户信号压缩为自然语言摘要并用于条件生成。 Result: 在四个基准上实验表明,POPI显著提升个性化准确性,大幅降低上下文开销,并能将优化后的摘要迁移到冻结的现成大模型,实现即插即用的个性化。 Conclusion: POPI有效解决了大模型个性化中的效率、效果与可迁移性问题,提供了一种通用且实用的个性化生成方案。 Abstract: Large language models (LLMs) achieve strong benchmark performance, yet user experiences remain inconsistent due to diverse preferences in style, tone, and reasoning mode. Nevertheless, existing alignment techniques such as reinforcement learning from human feedback (RLHF) or Direct Preference Optimization (DPO) largely optimize toward population-level averages and overlook individual variation. Naive personalization strategies like per-user fine-tuning are computationally prohibitive, and in-context approaches that prepend raw user signals often suffer from inefficiency and noise. To address these challenges, we propose POPI, a general framework that introduces a preference inference model to distill heterogeneous user signals into concise natural language summaries. These summaries act as transparent, compact, and transferable personalization representations that condition a shared generation model to produce personalized responses. POPI jointly optimizes both preference inference and personalized generation under a unified objective using reinforcement learning, ensuring summaries maximally encode useful preference information. Extensive experiments across four personalization benchmarks demonstrate that POPI consistently improves personalization accuracy while reducing context overhead by a large margin. Moreover, optimized summaries seamlessly transfer to frozen off-the-shelf LLMs, enabling plug-and-play personalization without weight updates.[4] Advances in Pre-trained Language Models for Domain-Specific Text Classification: A Systematic Review
Zhyar Rzgar K. Rostam,Gábor Kertész
Main category: cs.CL
TL;DR: 本文系统回顾了2018年至2024年初41篇关于预训练语言模型(PLM)在领域特定文本分类中应用的研究,探讨了Transformer模型的演进、挑战及技术分类,并通过BERT、SciBERT和BioBERT在生物医学句子分类中的实验验证了效果,比较了不同领域中大语言模型的性能,提出了未来研究方向。
Details
Motivation: 由于科学文献和在线信息呈指数增长,通用大语言模型在领域特定文本分类中因专业词汇、语法结构和数据不平衡等问题表现受限,亟需系统性综述来梳理预训练模型在此类任务中的应用、挑战与发展趋势。 Method: 遵循PRISMA声明进行系统性文献综述(SLR),筛选2018年至2024年1月发表的41篇相关研究,结合AI工具进行多阶段严格筛选;分析传统与现代文本分类方法的演变,聚焦基于Transformer的模型;提出PLM技术分类体系,并通过在生物医学句子分类任务上对比BERT、SciBERT和BioBERT进行实证验证。 Result: 揭示了PLM在不同领域(如生物医学)文本分类中的有效性差异,发现领域专用模型(如SciBERT、BioBERT)通常优于通用BERT;总结出当前主流技术路线并建立分类体系;实验验证了领域适应性预训练对性能提升的关键作用。 Conclusion: 预训练语言模型显著提升了领域特定文本分类性能,尤其是经过领域适应的模型;本研究提供了系统的文献梳理、技术分类和实证比较,为未来研究指明了方向,包括解决数据稀缺、提升模型可解释性及探索跨领域迁移能力。 Abstract: The exponential increase in scientific literature and online information necessitates efficient methods for extracting knowledge from textual data. Natural language processing (NLP) plays a crucial role in addressing this challenge, particularly in text classification tasks. While large language models (LLMs) have achieved remarkable success in NLP, their accuracy can suffer in domain-specific contexts due to specialized vocabulary, unique grammatical structures, and imbalanced data distributions. In this systematic literature review (SLR), we investigate the utilization of pre-trained language models (PLMs) for domain-specific text classification. We systematically review 41 articles published between 2018 and January 2024, adhering to the PRISMA statement (preferred reporting items for systematic reviews and meta-analyses). This review methodology involved rigorous inclusion criteria and a multi-step selection process employing AI-powered tools. We delve into the evolution of text classification techniques and differentiate between traditional and modern approaches. We emphasize transformer-based models and explore the challenges and considerations associated with using LLMs for domain-specific text classification. Furthermore, we categorize existing research based on various PLMs and propose a taxonomy of techniques used in the field. To validate our findings, we conducted a comparative experiment involving BERT, SciBERT, and BioBERT in biomedical sentence classification. Finally, we present a comparative study on the performance of LLMs in text classification tasks across different domains. In addition, we examine recent advancements in PLMs for domain-specific text classification and offer insights into future directions and limitations in this rapidly evolving domain.[5] Atomic Literary Styling: Mechanistic Manipulation of Prose Generation in Neural Language Models
Tsogt-Ochir Enkhbayar
Main category: cs.CL
TL;DR: 本文通过对GPT-2中文学风格的机制分析,发现大量神经元能区分优秀文学文本与AI生成文本,但消融实验显示移除这些神经元反而提升生成文本的文学性,揭示了神经网络中相关性与因果性的关键差异。
Details
Motivation: 研究旨在理解GPT-2中哪些神经元负责识别文学风格,并检验这些神经元在生成过程中的实际作用,以探索模型内部表征与输出之间的因果关系。 Method: 使用赫尔曼·梅尔维尔的《抄写员巴特比》作为语料,提取GPT-2晚期层中32,768个神经元的激活模式,并通过统计检验识别出具有显著区分能力的神经元;随后进行系统性消融实验评估其对生成文本风格的影响。 Result: 识别出27,122个具有统计显著性的判别性神经元(p < 0.05),效应量最高达|d| = 1.4;然而消融实验证明,移除其中50个高判别性神经元后,生成文本的文学风格指标提升了25.7%。 Conclusion: 激活于高质量文学文本的神经元并不一定在生成过程中起必要作用,甚至可能抑制创造性输出,说明观察到的相关性不等于因果必要性,这对可解释性与AI对齐研究具有重要启示。 Abstract: We present a mechanistic analysis of literary style in GPT-2, identifying individual neurons that discriminate between exemplary prose and rigid AI-generated text. Using Herman Melville's Bartleby, the Scrivener as a corpus, we extract activation patterns from 355 million parameters across 32,768 neurons in late layers. We find 27,122 statistically significant discriminative neurons ($p < 0.05$), with effect sizes up to $|d| = 1.4$. Through systematic ablation studies, we discover a paradoxical result: while these neurons correlate with literary text during analysis, removing them often improves rather than degrades generated prose quality. Specifically, ablating 50 high-discriminating neurons yields a 25.7% improvement in literary style metrics. This demonstrates a critical gap between observational correlation and causal necessity in neural networks. Our findings challenge the assumption that neurons which activate on desirable inputs will produce those outputs during generation, with implications for mechanistic interpretability research and AI alignment.[6] JT-Safe: Intrinsically Enhancing the Safety and Trustworthiness of LLMs
Junlan Feng,Fanyu Meng,Chong Long,Pengyu Cong,Duqing Wang,Yan Zheng,Yuyao Zhang,Xuanchang Gao,Ye Yuan,Yunfei Ma,Zhijie Ren,Fan Yang,Na Wu,Di Jin,Chao Deng
Main category: cs.CL
TL;DR: 本文提出通过增强预训练数据的世界上下文(DWC)来提升大语言模型的安全性和可信度,方法包括在预训练中引入现实世界情境信息和工业场景数据,并在JT-35B-Base基础上继续训练,最终在安全与可信评估上优于同规模Qwen模型。
Details
Motivation: 大语言模型的幻觉和可信性问题根源在于预训练阶段的数据和学习机制,现有数据缺乏现实世界知识的锚定,导致模型训练不确定性高。 Method: 提出“数据带世界上下文”(DWC)方法,通过为预训练数据补充其在时空背景中的真实世界上下文,并加入大量工业场景数据,在JT-35B-Base模型上使用1.5万亿DWC token进行继续预训练,并结合特定的后训练策略以激活DWC潜力。 Result: 基于6.2万亿token预训练的JT-Safe-35B在安全与可信评估基准上相比同规模Qwen模型平均提升1.79%。 Conclusion: 通过将预训练数据与真实世界上下文对齐,可有效减少模型训练的不确定性,提升大语言模型的安全性和可信性。 Abstract: The hallucination and credibility concerns of large language models (LLMs) are global challenges that the industry is collectively addressing. Recently, a significant amount of advances have been made on post-training and inference techniques to mitigate these challenges. However, it is widely agreed that unsafe and hallucinations of LLMs intrinsically originate from pre-training, involving pre-training data and the next-token prediction learning mechanism. In this paper, we focus on enhancing pre-training data to improve the trustworthiness and safety of LLMs. Since the data is vast, it's almost impossible to entirely purge the data of factual errors, logical inconsistencies, or distributional biases. Moreover, the pre-training data lack grounding in real-world knowledge. Each piece of data is treated as a sequence of tokens rather than as a representation of a part of the world. To overcome these issues, we propose approaches to enhancing our pre-training data with its context in the world and increasing a substantial amount of data reflecting industrial scenarios. We argue that most source data are created by the authors for specific purposes in a certain spatial-temporal context. They have played a role in the real world. By incorporating related world context information, we aim to better anchor pre-training data within real-world scenarios, thereby reducing uncertainty in model training and enhancing the model's safety and trustworthiness. We refer to our Data with World Context as DWC. We continue pre-training an earlier checkpoint of JT-35B-Base with 1.5 trillion of DWC tokens. We introduce our post-training procedures to activate the potentials of DWC. Compared with the Qwen model of a similar scale, JT-Safe-35B achieves an average performance improvement of 1.79% on the Safety and Trustworthy evaluation benchmarks, while being pretrained with only 6.2 trillion tokens.[7] CLAWS:Creativity detection for LLM-generated solutions using Attention Window of Sections
Keuntae Kim,Eunhye Jeong,Sehyeon Lee,Seohee Yoon,Yong Suk Choi
Main category: cs.CL
TL;DR: 本文提出了一种名为CLAWS的新方法,用于在无需人工评估的情况下自动分类大语言模型在数学推理任务中的输出为典型、创造性或幻觉类型,解决了评估创造性生成的两大挑战。
Details
Motivation: 尽管大语言模型在数学和编程等推理任务中表现出色,但其生成结果的创造性评估长期被忽视,主要受限于创造性的定义困难和依赖人工评价。 Method: CLAWS通过利用提示部分和输出之间的注意力权重,自动将数学解法分为典型、创造性和幻觉三类,是一种无需人工干预的白盒检测方法。 Result: CLAWS在五个7-8B大小的数学推理模型(DeepSeek、Qwen等)上优于五种现有白盒检测方法,并在来自181个数学竞赛的4545道题目上得到验证。 Conclusion: CLAWS有效解决了推理任务中创造性评估的自动化问题,为衡量大语言模型的创造性提供了可扩展且可靠的工具。 Abstract: Recent advances in enhancing the reasoning ability of large language models (LLMs) have been remarkably successful. LLMs trained with reinforcement learning (RL) for reasoning demonstrate strong performance in challenging tasks such as mathematics and coding, even with relatively small model sizes. However, despite these improvements in task accuracy, the assessment of creativity in LLM generations has been largely overlooked in reasoning tasks, in contrast to writing tasks. The lack of research on creativity assessment in reasoning primarily stems from two challenges: (1) the difficulty of defining the range of creativity, and (2) the necessity of human evaluation in the assessment process. To address these challenges, we propose CLAWS, a method that defines and classifies mathematical solutions into typical, creative, and hallucinated categories without human evaluation, by leveraging attention weights across prompt sections and output. CLAWS outperforms five existing white-box detection methods (Perplexity, Logit Entropy, Window Entropy, Hidden Score, and Attention Score) on five 7-8B math RL models (DeepSeek, Qwen, Mathstral, OpenMath2, and Oreal). We validate CLAWS on 4545 math problems collected from 181 math contests (AJHSME, AMC, AIME).[8] Select-Then-Decompose: From Empirical Analysis to Adaptive Selection Strategy for Task Decomposition in Large Language Models
Shuodi Liu,Yingzhuo Liu,Zi Wang,Yusheng Wang,Huijia Wu,Liuyu Xiang,Zhaofeng He
Main category: cs.CL
TL;DR: 本文提出了一种名为Select-Then-Decompose的任务分解策略,通过选择、执行和验证三个阶段的闭环流程,在性能与成本之间实现了最优权衡。
Details
Motivation: 现有任务分解方法多关注性能提升,忽视了性能与成本之间的权衡,本文旨在系统分析影响因素并提出更高效的策略。 Method: 首先对任务分解进行六种分类,然后实证分析三类影响因素(方法类别、任务特征、模型配置),最后提出Select-Then-Decompose策略,动态选择最适合的分解方法并引入验证模块提升结果可靠性。 Result: 在多个基准测试中,该策略始终位于Pareto前沿,显著优于现有方法,在性能和成本之间取得更好平衡。 Conclusion: Select-Then-Decompose策略通过任务特征驱动的动态选择与验证机制,有效提升了大语言模型在复杂任务分解中的效率与可靠性。 Abstract: Large language models (LLMs) have demonstrated remarkable reasoning and planning capabilities, driving extensive research into task decomposition. Existing task decomposition methods focus primarily on memory, tool usage, and feedback mechanisms, achieving notable success in specific domains, but they often overlook the trade-off between performance and cost. In this study, we first conduct a comprehensive investigation on task decomposition, identifying six categorization schemes. Then, we perform an empirical analysis of three factors that influence the performance and cost of task decomposition: categories of approaches, characteristics of tasks, and configuration of decomposition and execution models, uncovering three critical insights and summarizing a set of practical principles. Building on this analysis, we propose the Select-Then-Decompose strategy, which establishes a closed-loop problem-solving process composed of three stages: selection, execution, and verification. This strategy dynamically selects the most suitable decomposition approach based on task characteristics and enhances the reliability of the results through a verification module. Comprehensive evaluations across multiple benchmarks show that the Select-Then-Decompose consistently lies on the Pareto frontier, demonstrating an optimal balance between performance and cost. Our code is publicly available at https://github.com/summervvind/Select-Then-Decompose.[9] Efficient Toxicity Detection in Gaming Chats: A Comparative Study of Embeddings, Fine-Tuned Transformers and LLMs
Yehor Tereshchenko,Mika Hämäläinen
Main category: cs.CL
TL;DR: 本文对比了多种自然语言处理方法在在线游戏聊天中自动检测毒性的效果,提出了一种优化人工审核工作量的混合审核系统架构。
Details
Motivation: 在线游戏中的毒性内容影响用户体验,需要高效、低成本的自动化检测方法。 Method: 评估了传统机器学习模型、大语言模型(零样本与少样本)、微调的Transformer模型以及检索增强生成(RAG)等方法,在准确性、处理速度和计算成本三个维度进行比较。 Result: 实验结果显示不同方法性能差异显著,微调的DistilBERT在准确性和成本之间达到了最佳权衡。 Conclusion: 研究为动态在线游戏环境中的高效、经济的内容审核系统部署提供了实证依据。 Abstract: This paper presents a comprehensive comparative analysis of Natural Language Processing (NLP) methods for automated toxicity detection in online gaming chats. Traditional machine learning models with embeddings, large language models (LLMs) with zero-shot and few-shot prompting, fine-tuned transformer models, and retrieval-augmented generation (RAG) approaches are evaluated. The evaluation framework assesses three critical dimensions: classification accuracy, processing speed, and computational costs. A hybrid moderation system architecture is proposed that optimizes human moderator workload through automated detection and incorporates continuous learning mechanisms. The experimental results demonstrate significant performance variations across methods, with fine-tuned DistilBERT achieving optimal accuracy-cost trade-offs. The findings provide empirical evidence for deploying cost-effective, efficient content moderation systems in dynamic online gaming environments.[10] Diagnosing Representation Dynamics in NER Model Extension
Xirui Zhang,Philippe de La Chevasnerie,Benoit Fabre
Main category: cs.CL
TL;DR: 本文研究了在嘈杂口语数据中将命名实体识别(NER)模型扩展到新的PII实体时的机制,发现语义与形态特征机制的独立性,并揭示了'O'标签表示的可塑性。
Details
Motivation: 扩展NER模型以识别新的PII实体在实际应用中具有广泛需求,但在噪声较多的口语数据中面临挑战,尤其是如何避免对原有实体类别的性能影响。 Method: 通过联合微调BERT模型,在标准语义实体(如PER、LOC、ORG)和基于模式的PII(如EMAIL、PHONE)上进行训练,并采用增量学习设置作为诊断工具,分析语义漂移和表示变化。 Result: 发现大多数原有实体类别性能几乎无损,但LOC类别因与新PII共享模式特征(如邮编)而易受影响;同时发现‘反向O标签表示漂移’现象,即模型需解冻'O'标签分类器才能释放被锁定的模式用于新学习。 Conclusion: NER模型在扩展新实体时表现出特征机制的独立性,但也受表示重叠和背景类('O'标签)可塑性的显著影响,解冻'O'标签有助于提升增量学习效果。 Abstract: Extending Named Entity Recognition (NER) models to new PII entities in noisy spoken-language data is a common need. We find that jointly fine-tuning a BERT model on standard semantic entities (PER, LOC, ORG) and new pattern-based PII (EMAIL, PHONE) results in minimal degradation for original classes. We investigate this "peaceful coexistence," hypothesizing that the model uses independent semantic vs. morphological feature mechanisms. Using an incremental learning setup as a diagnostic tool, we measure semantic drift and find two key insights. First, the LOC (location) entity is uniquely vulnerable due to a representation overlap with new PII, as it shares pattern-like features (e.g., postal codes). Second, we identify a "reverse O-tag representation drift." The model, initially trained to map PII patterns to 'O', blocks new learning. This is resolved only by unfreezing the 'O' tag's classifier, allowing the background class to adapt and "release" these patterns. This work provides a mechanistic diagnosis of NER model adaptation, highlighting feature independence, representation overlap, and 'O' tag plasticity.[11] AtlasKV: Augmenting LLMs with Billion-Scale Knowledge Graphs in 20GB VRAM
Haoyu Huang,Hong Ting Tsang,Jiaxin Bai,Xi Peng,Gong Zhang,Yangqiu Song
Main category: cs.CL
TL;DR: 提出了一种名为AtlasKV的参数化知识集成方法,可高效、可扩展地将十亿规模的知识图谱集成到大语言模型中,具有低GPU内存消耗和亚线性时间和内存复杂度。
Details
Motivation: 现有的检索增强生成(RAG)方法依赖外部检索模块和大量上下文,导致推理延迟高,难以高效扩展大规模知识集成。 Method: 提出AtlasKV,结合KG2KV和HiKVP技术,将知识图谱三元组以参数化方式集成到大语言模型中,利用模型自身的注意力机制实现知识融合,无需外部检索器或重新训练。 Result: AtlasKV在十亿级知识图谱上实现了高效集成,仅需不到20GB显存,具备亚线性时间和内存复杂度,且保持良好的知识准确性和泛化能力。 Conclusion: AtlasKV为大语言模型提供了一种高效、可扩展的参数化知识集成方案,克服了RAG在延迟和资源消耗方面的局限性,适用于动态更新的大规模知识场景。 Abstract: Retrieval-augmented generation (RAG) has shown some success in augmenting large language models (LLMs) with external knowledge. However, as a non-parametric knowledge integration paradigm for LLMs, RAG methods heavily rely on external retrieval modules and the retrieved textual context prior. Especially for very large scale knowledge augmentation, they would introduce substantial inference latency due to expensive searches and much longer relevant context. In this paper, we propose a parametric knowledge integration method, called \textbf{AtlasKV}, a scalable, effective, and general way to augment LLMs with billion-scale knowledge graphs (KGs) (e.g. 1B triples) using very little GPU memory cost (e.g. less than 20GB VRAM). In AtlasKV, we introduce KG2KV and HiKVP to integrate KG triples into LLMs at scale with sub-linear time and memory complexity. It maintains strong knowledge grounding and generalization performance using the LLMs' inherent attention mechanism, and requires no external retrievers, long context priors, or retraining when adapting to new knowledge.[12] Believe It or Not: How Deeply do LLMs Believe Implanted Facts?
Stewart Slocum,Julian Minder,Clément Dumas,Henry Sleight,Ryan Greenblatt,Samuel Marks,Rowan Wang
Main category: cs.CL
TL;DR: 提出衡量大语言模型中植入知识信念深度的框架,评估不同知识编辑技术的效果,发现合成文档微调(SDF)在多数情况下能成功植入类似真实知识的信念,但与基本常识相悖的知识仍较脆弱。
Details
Motivation: 探究知识编辑技术是否真正让大语言模型“相信”所植入的事实,而不仅仅是表面记忆。 Method: 将信念深度操作化为三个方面:在相关上下文中的泛化能力、面对自我审视和直接挑战时的鲁棒性、以及与真实知识在表示上的相似性(通过线性探测测量),并在此框架下评估多种知识编辑方法。 Result: 简单提示和机械编辑技术未能深度植入知识;合成文档微调(SDF)通常能成功植入行为上类似真实知识的信念,但对于违背基本世界知识的信念,SDF效果有限且表征不同。 Conclusion: 提出了可量化的信念深度标准,为知识编辑技术的实际应用提供了严格的评估基础。 Abstract: Knowledge editing techniques promise to implant new factual knowledge into large language models (LLMs). But do LLMs really believe these facts? We develop a framework to measure belief depth and use it to evaluate the success of knowledge editing techniques. We operationalize belief depth as the extent to which implanted knowledge 1) generalizes to related contexts (e.g. Fermi estimates several logical steps removed), 2) is robust to self-scrutiny and direct challenge, and 3) is represented similarly to genuine knowledge (as measured by linear probes). Our evaluations show that simple prompting and mechanistic editing techniques fail to implant knowledge deeply. In contrast, Synthetic Document Finetuning (SDF) - where models are trained on LLM-generated documents consistent with a fact - often succeeds at implanting beliefs that behave similarly to genuine knowledge. However, SDF's success is not universal, as implanted beliefs that contradict basic world knowledge are brittle and representationally distinct from genuine knowledge. Overall, our work introduces measurable criteria for belief depth and enables the rigorous evaluation necessary for deploying knowledge editing in real-world applications.[13] SimBA: Simplifying Benchmark Analysis Using Performance Matrices Alone
Nishant Subramani,Alfredo Gomez,Mona Diab
Main category: cs.CL
TL;DR: 本文提出了SimBA框架,通过三个阶段(stalk、prowl、pounce)简化语言模型基准测试的分析,能够在仅使用少量代表性数据集的情况下实现高覆盖率和准确的性能预测。
Details
Motivation: 现代语言模型在大型基准测试上进行评估,这些基准测试难以解释,尤其不利于模型选择,因此需要一种简化的分析方法。 Method: 提出SimBA框架,包括三个阶段:stalk(数据集与模型比较)、prowl(发现代表性子集)、pounce(利用代表性子集预测新模型的表现),并基于原始评估分数设计代表性集合发现算法。 Result: 在HELM、MMLU和BigBenchLite三个基准上验证,仅用6.25%、1.7%和28.4%的数据集即可达到至少95%的覆盖,并能几乎无误差地保留模型排名和预测新模型表现。 Conclusion: SimBA有助于提升模型训练效率,并帮助数据集创建者判断新数据集是否与现有数据集具有差异性,从而优化基准测试流程。 Abstract: Modern language models are evaluated on large benchmarks, which are difficult to make sense of, especially for model selection. Looking at the raw evaluation numbers themselves using a model-centric lens, we propose SimBA, a three phase framework to Simplify Benchmark Analysis. The three phases of SimBA are: stalk, where we conduct dataset & model comparisons, prowl, where we discover a representative subset, and pounce, where we use the representative subset to predict performance on a held-out set of models. Applying SimBA to three popular LM benchmarks: HELM, MMLU, and BigBenchLite reveals that across all three benchmarks, datasets and models relate strongly to one another (stalk). We develop an representative set discovery algorithm which covers a benchmark using raw evaluation scores alone. Using our algorithm, we find that with 6.25% (1/16), 1.7% (1/58), and 28.4% (21/74) of the datasets for HELM, MMLU, and BigBenchLite respectively, we achieve coverage levels of at least 95% (prowl). Additionally, using just these representative subsets, we can both preserve model ranks and predict performance on a held-out set of models with near zero mean-squared error (pounce). Taken together, SimBA can help model developers improve efficiency during model training and dataset creators validate whether their newly created dataset differs from existing datasets in a benchmark. Our code is open source, available at https://github.com/nishantsubramani/simba.[14] Is Multilingual LLM Watermarking Truly Multilingual? A Simple Back-Translation Solution
Asim Mohamed,Martin Gubri
Main category: cs.CL
TL;DR: STEAM是一种基于回译的检测方法,用于增强多语言水印在中低资源语言中的鲁棒性,兼容现有水印方法,提升跨语言水印性能。
Details
Motivation: 现有多语言水印方法在高资源语言上表现良好,但在中低资源语言中因语义聚类失败而缺乏真正跨语言鲁棒性,尤其在翻译攻击下表现不佳。 Method: 提出STEAM,利用回译技术恢复经翻译后丢失的水印强度,该方法不依赖特定分词器,可扩展至新语言,且无需修改原始水印机制。 Result: 在17种语言上实验显示,STEAM平均提升0.19 AUC和40个百分点的TPR@1%。 Conclusion: STEAM为实现更公平、稳健的跨语言大模型输出追踪提供了一种简单有效的解决方案。 Abstract: Multilingual watermarking aims to make large language model (LLM) outputs traceable across languages, yet current methods still fall short. Despite claims of cross-lingual robustness, they are evaluated only on high-resource languages. We show that existing multilingual watermarking methods are not truly multilingual: they fail to remain robust under translation attacks in medium- and low-resource languages. We trace this failure to semantic clustering, which fails when the tokenizer vocabulary contains too few full-word tokens for a given language. To address this, we introduce STEAM, a back-translation-based detection method that restores watermark strength lost through translation. STEAM is compatible with any watermarking method, robust across different tokenizers and languages, non-invasive, and easily extendable to new languages. With average gains of +0.19 AUC and +40%p TPR@1% on 17 languages, STEAM provides a simple and robust path toward fairer watermarking across diverse languages.[15] From Local to Global: Revisiting Structured Pruning Paradigms for Large Language Models
Ziyan Wang,Enmao Diao,Qi Le,Pu Wang,Minwoo Lee,Shu-ping Yeh,Evgeny Stupachenko,Hao Feng,Li Yang
Main category: cs.CL
TL;DR: 提出全局迭代结构化剪枝方法GISP,通过损失驱动的结构级重要性评估和迭代调度,在保持硬件友好性的同时显著提升大模型在语言建模和决策任务上的性能。
Details
Motivation: 现有局部结构化剪枝方法多为任务无关,仅优化层间重建误差,难以利用任务特定微调信号,导致下游任务增益有限。 Method: 提出GISP:基于一阶梯度、损失驱动的结构级权重重要性评分,结合块级归一化,通过迭代而非一次性剪枝策略,全局移除注意力头和MLP通道,并支持任务特定目标函数。 Result: 在Llama2、Llama3、Mistral等模型上,GISP显著降低WikiText-2困惑度,提升下游任务准确率,尤其在40-50%稀疏度下表现突出;在DeepSeek-R1-Distill-Llama-3-8B上结合GSM8K任务校准,显著提升精确匹配准确率。 Conclusion: GISP通过全局迭代、任务感知的结构化剪枝,实现了更优的压缩与性能平衡,支持‘一次剪枝、多任务部署’的高效推理范式。 Abstract: Structured pruning is a practical approach to deploying large language models (LLMs) efficiently, as it yields compact, hardware-friendly architectures. However, the dominant local paradigm is task-agnostic: by optimizing layer-wise reconstruction rather than task objectives, it tends to preserve perplexity or generic zero-shot behavior but fails to capitalize on modest task-specific calibration signals, often yielding limited downstream gains. We revisit global structured pruning and present GISP-Global Iterative Structured Pruning-a post-training method that removes attention heads and MLP channels using first-order, loss-based important weights aggregated at the structure level with block-wise normalization. An iterative schedule, rather than one-shot pruning, stabilizes accuracy at higher sparsity and mitigates perplexity collapse without requiring intermediate fine-tuning; the pruning trajectory also forms nested subnetworks that support a "prune-once, deploy-many" workflow. Furthermore, because importance is defined by a model-level loss, GISP naturally supports task-specific objectives; we instantiate perplexity for language modeling and a margin-based objective for decision-style tasks. Extensive experiments show that across Llama2-7B/13B, Llama3-8B, and Mistral-0.3-7B, GISP consistently lowers WikiText-2 perplexity and improves downstream accuracy, with especially strong gains at 40-50% sparsity; on DeepSeek-R1-Distill-Llama-3-8B with GSM8K, task-aligned calibration substantially boosts exact-match accuracy.[16] Language Models as Semantic Augmenters for Sequential Recommenders
Mahsa Valizadeh,Xiangjue Dong,Rui Tuo,James Caverlee
Main category: cs.CL
TL;DR: 提出LaMAR框架,利用大语言模型在少样本设置下自动增强用户行为序列的语义上下文,提升序列建模性能。
Details
Motivation: 在用户行为序列数据缺乏足够语义上下文时,现有模型性能受限,需增强上下文信息以提升建模效果。 Method: 设计LaMAR框架,利用大语言模型基于现有元数据生成使用场景、项目意图等辅助上下文信号,并将其融入原始序列中进行增强。 Result: 在基准序列建模任务中集成LaMAR生成的上下文信号后,模型性能持续提升;生成信号具有高语义新颖性和多样性,增强了下游模型的表征能力。 Conclusion: LaMAR提供了一种新的数据为中心范式,大语言模型作为智能上下文生成器,可半自动创建训练数据和语言资源。 Abstract: Large Language Models (LLMs) excel at capturing latent semantics and contextual relationships across diverse modalities. However, in modeling user behavior from sequential interaction data, performance often suffers when such semantic context is limited or absent. We introduce LaMAR, a LLM-driven semantic enrichment framework designed to enrich such sequences automatically. LaMAR leverages LLMs in a few-shot setting to generate auxiliary contextual signals by inferring latent semantic aspects of a user's intent and item relationships from existing metadata. These generated signals, such as inferred usage scenarios, item intents, or thematic summaries, augment the original sequences with greater contextual depth. We demonstrate the utility of this generated resource by integrating it into benchmark sequential modeling tasks, where it consistently improves performance. Further analysis shows that LLM-generated signals exhibit high semantic novelty and diversity, enhancing the representational capacity of the downstream models. This work represents a new data-centric paradigm where LLMs serve as intelligent context generators, contributing a new method for the semi-automatic creation of training data and language resources.[17] Chain-of-Thought Reasoning Improves Context-Aware Translation with Large Language Models
Shabnam Ataee,Andrei Popescu-Belis
Main category: cs.CL
TL;DR: 该论文评估了大语言模型(LLM)在处理包含句间依赖关系的文本翻译时的能力,使用DiscEvalMT基准测试12个LLM在代词回指和词汇连贯性上的表现,并比较了是否采用思维链提示的影响。
Details
Motivation: 研究大语言模型在涉及跨句依赖的翻译任务中的表现,尤其是处理代词回指和词汇连贯性等挑战,以了解其上下文理解能力。 Method: 采用DiscEvalMT英法翻译基准,包含两类任务:区分正确与错误但看似合理的翻译,以及生成正确翻译;对比不同家族LLM在有无思维链提示下的表现。 Result: 最佳模型(如GPT-4、GPT-4o和Phi)在区分任务上达到约90%准确率,在生成任务上COMET得分约92%;思维链推理显著提升性能,且存在“智者愈强”效应——基础性能越好的模型,推理带来的增益越大。 Conclusion: 大语言模型在处理句间依赖翻译任务方面表现良好,尤其在引入思维链推理后性能进一步提升,显示出模型内在推理能力与其基础水平的正相关性。 Abstract: This paper assesses the capacity of large language models (LLMs) to translate texts that include inter-sentential dependencies. We use the English-French DiscEvalMT benchmark (Bawden et al., 2018) with pairs of sentences containing translation challenges either for pronominal anaphora or for lexical cohesion. We evaluate 12 LLMs from the DeepSeek-R1, GPT, Llama, Mistral and Phi families on two tasks: (1) distinguishing a correct translation from a wrong but plausible one; (2) generating a correct translation. We compare prompts that encourage chain-of-thought reasoning with those that do not. The best models take advantage of reasoning and reach about 90% accuracy on the first task, and COMET scores of about 92% on the second task, with GPT-4, GPT-4o and Phi standing out. Moreover, we observe a "wise get wiser" effect: the improvements through reasoning are positively correlated with the scores of the models without reasoning.[18] Na Prática, qual IA Entende o Direito? Um Estudo Experimental com IAs Generalistas e uma IA Jurídica
Marina Soares Marinho,Daniela Vianna,Livy Real,Altigran da Silva,Gabriela Migliorini
Main category: cs.CL
TL;DR: 本研究通过结合法律理论与48名法律专业人士的实证评估,提出了一种针对通用人工智能在法律领域应用的实验性评估协议,并发现领域专用模型JusIA在模拟律师日常任务中表现优于通用模型。
Details
Motivation: 为了更可靠地评估通用人工智能在法律领域的适用性,需要结合法律理论和实际法律工作者的反馈,以确保AI输出的质量。 Method: 采用包含实质正确性、体系连贯性和论证完整性等法律理论指标,并结合48名法律专业人士的实证评估,对四种系统(JusIA、ChatGPT Free、ChatGPT Pro和Gemini)进行测试。 Result: JusIA在各项任务中 consistently 优于通用模型,特别是在法律专业性和输出可靠性方面表现突出。 Conclusion: 领域专业化模型和基于理论的评估方法对于生成可靠的法律AI输出至关重要。 Abstract: This study presents the Jusbrasil Study on the Use of General-Purpose AIs in Law, proposing an experimental evaluation protocol combining legal theory, such as material correctness, systematic coherence, and argumentative integrity, with empirical assessment by 48 legal professionals. Four systems (JusIA, ChatGPT Free, ChatGPT Pro, and Gemini) were tested in tasks simulating lawyers' daily work. JusIA, a domain-specialized model, consistently outperformed the general-purpose systems, showing that both domain specialization and a theoretically grounded evaluation are essential for reliable legal AI outputs.[19] Does Reasoning Help LLM Agents Play Dungeons and Dragons? A Prompt Engineering Experiment
Patricia Delafuente,Arya Honraopatil,Lara J. Martin
Main category: cs.CL
TL;DR: 本文研究了使用大语言模型(LLM)和推理来预测《龙与地下城》(DnD)玩家行动,并将其格式化为Avrae Discord机器人命令。基于FIREBALL数据集,评估了DeepSeek-R1-Distill-LLaMA-8B推理模型和LLaMA-3.1-8B-Instruct指令模型在生成命令上的表现。结果表明,提供具体指令对模型输出至关重要,即使提示中单句的改变也会显著影响结果,且指令模型在此任务上已足够有效,优于专门的推理模型。
Details
Motivation: 为了提升DnD游戏中自动化玩家行为预测的能力,探索如何利用大语言模型将自然语言意图转化为可执行的Avrae机器人命令,以增强游戏体验的流畅性和自动化水平。 Method: 使用FIREBALL数据集,对比评估了两个8B规模的模型:一个推理模型(DeepSeek-R1-Distill-LLaMA-8B)和一个指令微调模型(LLaMA-3.1-8B-Instruct),通过不同提示设计分析其在命令生成任务中的表现。 Result: 实验发现,提示的具体性显著影响模型输出,细微的提示改动会导致结果差异;指令模型在该任务中表现良好,无需复杂的推理模型即可胜任。 Conclusion: 对于将DnD玩家意图转化为Avrae命令的任务,精心设计的提示配合指令模型即可达到良好效果,说明在特定应用场景下,指令模型足以替代更复杂的推理模型。 Abstract: This paper explores the application of Large Language Models (LLMs) and reasoning to predict Dungeons & Dragons (DnD) player actions and format them as Avrae Discord bot commands. Using the FIREBALL dataset, we evaluated a reasoning model, DeepSeek-R1-Distill-LLaMA-8B, and an instruct model, LLaMA-3.1-8B-Instruct, for command generation. Our findings highlight the importance of providing specific instructions to models, that even single sentence changes in prompts can greatly affect the output of models, and that instruct models are sufficient for this task compared to reasoning models.[20] LLMs Encode How Difficult Problems Are
William Lugoloobi,Chris Russell
Main category: cs.CL
TL;DR: 研究发现大语言模型(LLM)在人类标注的题目难度上具有强线性可解码性,且该信号在强化学习训练中与性能正相关,而基于模型自身表现的自动难度估计则随模型提升而失准。
Details
Motivation: 探究大语言模型是否能像人类一样理解问题难度,并检验这种内部表征在强化学习微调过程中是否与泛化能力一致。 Method: 在60个模型上跨层和token位置训练线性探针,使用Easy2HardBench的数学和编程子集进行评估,并分析人类标注难度与模型自生成难度的差异及其在GRPO训练中的动态变化。 Result: 人类标注难度具有强线性可解码性(AMC: ρ≈0.88)并随模型规模提升而增强;模型自生成难度信号较弱且扩展性差。在Qwen2.5-Math-1.5B的GRPO训练中,人类难度探针强度与测试准确率正相关,而模型难度探针则负相关。沿难度方向引导可降低幻觉并提升准确性。 Conclusion: 人类标注的难度为强化学习提供了稳定的学习信号,而基于模型表现的自动难度估计会随着模型优化而失准,提示应依赖外部人类判断来引导训练。 Abstract: Large language models exhibit a puzzling inconsistency: they solve complex problems yet frequently fail on seemingly simpler ones. We investigate whether LLMs internally encode problem difficulty in a way that aligns with human judgment, and whether this representation tracks generalization during reinforcement learning post-training. We train linear probes across layers and token positions on 60 models, evaluating on mathematical and coding subsets of Easy2HardBench. We find that human-labeled difficulty is strongly linearly decodable (AMC: $\rho \approx 0.88$) and exhibits clear model-size scaling, whereas LLM-derived difficulty is substantially weaker and scales poorly. Steering along the difficulty direction reveals that pushing models toward "easier" representations reduces hallucination and improves accuracy. During GRPO training on Qwen2.5-Math-1.5B, the human-difficulty probe strengthens and positively correlates with test accuracy across training steps, while the LLM-difficulty probe degrades and negatively correlates with performance. These results suggest that human annotations provide a stable difficulty signal that RL amplifies, while automated difficulty estimates derived from model performance become misaligned precisely as models improve. We release probe code and evaluation scripts to facilitate replication.[21] Extracting Rule-based Descriptions of Attention Features in Transformers
Dan Friedman,Adithya Bhaskar,Alexander Wettig,Danqi Chen
Main category: cs.CL
TL;DR: 本文提出了一种基于规则的描述方法,用于解释Transformer模型中SAE特征的行为,通过提取跳gram、缺失和计数三类规则,揭示了传统示例检查难以发现的特征模式,并在GPT-2 small上实现了自动提取,为特征解释提供了新的框架。
Details
Motivation: 现有的特征解释依赖对激活示例的手动检查,主观且易遗漏复杂模式,缺乏系统性规则描述;本文旨在建立可自动提取、更具解释性的规则基础特征描述方法。 Method: 从注意力层输出训练的SAE特征出发,定义三类规则(跳gram、缺失、计数),并通过分析输入-输出token之间的关联模式,设计自动化方法从GPT-2 small中提取这些规则。 Result: 在GPT-2 small上发现多数特征可用约100条跳gram规则良好描述;超过四分之一的特征在第一层即表现出显著的缺失规则;还识别出少量计数规则。 Conclusion: 基于规则的描述能有效揭示神经网络特征的语义行为,尤其是传统方法难以捕捉的否定与数量逻辑,为可解释性研究提供了新路径和初步分类体系。 Abstract: Mechanistic interpretability strives to explain model behavior in terms of bottom-up primitives. The leading paradigm is to express hidden states as a sparse linear combination of basis vectors, called features. However, this only identifies which text sequences (exemplars) activate which features; the actual interpretation of features requires subjective inspection of these exemplars. This paper advocates for a different solution: rule-based descriptions that match token patterns in the input and correspondingly increase or decrease the likelihood of specific output tokens. Specifically, we extract rule-based descriptions of SAE features trained on the outputs of attention layers. While prior work treats the attention layers as an opaque box, we describe how it may naturally be expressed in terms of interactions between input and output features, of which we study three types: (1) skip-gram rules of the form "[Canadian city]... speaks --> English", (2) absence rules of the form "[Montreal]... speaks -/-> English," and (3) counting rules that toggle only when the count of a word exceeds a certain value or the count of another word. Absence and counting rules are not readily discovered by inspection of exemplars, where manual and automatic descriptions often identify misleading or incomplete explanations. We then describe a simple approach to extract these types of rules automatically from a transformer, and apply it to GPT-2 small. We find that a majority of features may be described well with around 100 skip-gram rules, though absence rules are abundant even as early as the first layer (in over a fourth of features). We also isolate a few examples of counting rules. This paper lays the groundwork for future research into rule-based descriptions of features by defining them, showing how they may be extracted, and providing a preliminary taxonomy of some of the behaviors they represent.[22] Automatic Prompt Generation via Adaptive Selection of Prompting Techniques
Yohei Ikenoue,Hitomi Tashiro,Shigeru Kuroyanagi
Main category: cs.CL
TL;DR: 提出一种基于任务描述自适应选择和生成高质量提示的方法,无需依赖预设模板,通过构建任务聚类与提示技术关联的知识库,提升了大模型在复杂任务上的表现。
Details
Motivation: 提示工程对大语言模型效果至关重要,但设计优质提示需要专业知识,非专家用户难以掌握。 Method: 构建一个将语义相似的任务聚类与对应提示技术关联的知识库,根据用户输入的任务描述匹配最相关任务集群,并动态生成定制化提示。 Result: 在BIG-Bench Extra Hard的23个任务上实验表明,该方法在算术平均和调和平均得分上均优于标准提示和现有自动提示生成工具。 Conclusion: 该研究为简化和标准化提示生成提供了基础,使非专家用户也能有效利用大语言模型。 Abstract: Prompt engineering is crucial for achieving reliable and effective outputs from large language models (LLMs), but its design requires specialized knowledge of prompting techniques and a deep understanding of target tasks. To address this challenge, we propose a novel method that adaptively selects task-appropriate prompting techniques based on users' abstract task descriptions and automatically generates high-quality prompts without relying on pre-existing templates or frameworks. The proposed method constructs a knowledge base that associates task clusters, characterized by semantic similarity across diverse tasks, with their corresponding prompting techniques. When users input task descriptions, the system assigns them to the most relevant task cluster and dynamically generates prompts by integrating techniques drawn from the knowledge base. An experimental evaluation of the proposed method on 23 tasks from BIG-Bench Extra Hard (BBEH) demonstrates superior performance compared with standard prompts and existing automatic prompt-generation tools, as measured by both arithmetic and harmonic mean scores. This research establishes a foundation for streamlining and standardizing prompt creation, enabling non-experts to effectively leverage LLMs.[23] CMT-Bench: Cricket Multi-Table Generation Benchmark for Probing Robustness in Large Language Models
Ritam Upadhyay,Naman Ahuja,Rishabh Baral,Aparna Garimella,Vivek Gupta
Main category: cs.CL
TL;DR: 本文提出了CMT-Bench,一个基于板球实况评论的诊断性基准,用于评估大语言模型在动态文本到表格生成任务中的鲁棒性,发现当前模型在提取线索缺失、输入长度增加和实体形式变化下表现脆弱,表明其推理能力存在显著缺陷。
Details
Motivation: 现有的文本到表格系统依赖大量提示工程或迭代信息提取,虽然提升了性能但计算成本高,且难以理解模型如何处理时序演进的叙述。因此需要一个能诊断模型真实推理能力的基准。 Method: 构建了一个名为CMT-Bench的基准,基于真实的板球比赛评论,要求模型在两种动态模式下生成表格,并通过三种语义保持的维度进行评估:提取线索消融、时间前缀测试和实体形式扰动。 Result: 实验显示,当前主流长上下文大模型在无提取摘要时性能大幅下降,随输入长度增加表现单调退化,且对实体形式变化敏感;分布测试进一步揭示其数值错误模式发生显著偏移,说明推理过程出现漂移。 Conclusion: 现有大语言模型在动态文本到表格生成任务中表现脆弱,亟需以鲁棒性为先的评估标准,推动更高效、可扩展方法的发展。 Abstract: LLM Driven text-to-table (T2T) systems often rely on extensive prompt-engineering or iterative event extraction in code-parsable formats, which boosts scores but are computationally expensive and obscure how models actually reason over temporal evolving narratives to summarise key information. We present CMT-Bench, a diagnostic benchmark built from live cricket commentary that requires dynamic table generation across two evolving schemas under a dense, rule-governed policy. CMT-Bench is designed to probe robustness via three semantics-preserving dimensions: (i) extractive-cue ablation to separate extractive shortcuts from state tracking, (ii) temporal prefixing to test long-context stability, and (iii) entity-form perturbations (anonymization, outof-distribution substitutions, role-entangling paraphrases) to assess sensitivity to surface variation. Across diverse long-context stateof-the-art LLMs, we find large drops without extractive summaries, monotonic degradation with input length, and consistent accuracy drop under entity-form changes. Complementary distributional tests confirm significant shifts in numeric error patterns, indicating drift in reasoning rather than mere noise. Our results show that current LLMs are brittle in dynamic Textto-table generation, motivating robustness-first evaluation as a prerequisite for developing efficient and scalable approaches for this task.[24] Contrastive Decoding Mitigates Score Range Bias in LLM-as-a-Judge
Yoshinari Fujinuma
Main category: cs.CL
TL;DR: 本文研究了大语言模型(LLM)作为评估者时存在的评分范围偏差问题,并提出通过对比解码来缓解该问题,显著提升了与人类判断的相关性。
Details
Motivation: LLM在无参考直接评分任务中存在评分范围偏差,影响评估可靠性,需探究并解决该偏差问题。 Method: 分析LLM输出中的评分范围偏差,提出使用对比解码方法进行缓解,并在不同评分范围内评估其与人类评分的斯皮尔曼相关系数。 Result: 通过对比解码,在不同评分范围内平均实现了11.3%的斯皮尔曼相关系数相对提升。 Conclusion: 对比解码能有效缓解LLM作为评估者时的评分范围偏差,提升评估结果的可靠性。 Abstract: Large Language Models (LLMs) are commonly used as evaluators in various applications, but the reliability of the outcomes remains a challenge. One such challenge is using LLMs-as-judges for direct assessment, i.e., assigning scores from a specified range without any references. We first show that this challenge stems from LLM judge outputs being associated with score range bias, i.e., LLM judge outputs are highly sensitive to pre-defined score ranges, preventing the search for optimal score ranges. We also show that similar biases exist among models from the same family. We then mitigate this bias through contrastive decoding, achieving up to 11.3% relative improvement on average in Spearman correlation with human judgments across different score ranges.[25] MARCUS: An Event-Centric NLP Pipeline that generates Character Arcs from Narratives
Sriharsh Bhyravajjula,Ujwal Narayan,Manish Shrivastava
Main category: cs.CL
TL;DR: 本文提出了MARCUS,一个用于从叙事文本中自动生成以事件为中心、基于关系的角色弧的NLP管道,通过提取事件、角色、情感和情绪来建模角色间关系,并将其可视化为角色弧曲线。
Details
Motivation: 角色弧是文学分析中的重要工具,但缺乏可计算的定量表示方法。因此,需要一种自动化方法来生成角色弧,以便跨作品进行比较和应用。 Method: 提出MARCUS NLP流水线,提取叙事中的事件、参与角色、隐含情感和情绪,追踪并聚合角色间的关系变化,最终生成图形化的角色弧。 Result: 在《哈利·波特》和《指环王》两部奇幻系列上成功生成了角色弧,并对方法进行了评估,展示了其有效性和潜在应用。 Conclusion: MARCUS为角色弧提供了可量化的计算模型,增强了对叙事结构的理解,具有广泛的应用前景,如文学分析、内容推荐等。 Abstract: Character arcs are important theoretical devices employed in literary studies to understand character journeys, identify tropes across literary genres, and establish similarities between narratives. This work addresses the novel task of computationally generating event-centric, relation-based character arcs from narratives. Providing a quantitative representation for arcs brings tangibility to a theoretical concept and paves the way for subsequent applications. We present MARCUS (Modelling Arcs for Understanding Stories), an NLP pipeline that extracts events, participant characters, implied emotion, and sentiment to model inter-character relations. MARCUS tracks and aggregates these relations across the narrative to generate character arcs as graphical plots. We generate character arcs from two extended fantasy series, Harry Potter and Lord of the Rings. We evaluate our approach before outlining existing challenges, suggesting applications of our pipeline, and discussing future work.[26] DelvePO: Direction-Guided Self-Evolving Framework for Flexible Prompt Optimization
Tao Tao,Guanghui Zhu,Lang Guo,Hongyi Chen,Chunfeng Yuan,Yihua Huang
Main category: cs.CL
TL;DR: 提出了一种任务无关的自演化提示优化框架DelvePO,通过解耦提示组件和引入工作记忆机制,有效提升了大模型在不同任务中的提示优化效果与稳定性。
Details
Motivation: 现有提示优化方法易陷入局部最优且性能不稳定,缺乏跨任务的通用性和可迁移性。 Method: 将提示解耦为多个组件,结合方向引导的自演化机制和工作记忆模块,使大语言模型能自我反思并生成更优提示。 Result: 在多种开源和闭源大模型上验证了DelvePO的有效性,结果表明其在相同设置下持续优于先前SOTA方法,具有良好的泛化能力。 Conclusion: DelvePO提供了一种灵活、稳定且可迁移的提示优化方案,显著提升了大模型在多任务场景下的表现。 Abstract: Prompt Optimization has emerged as a crucial approach due to its capabilities in steering Large Language Models to solve various tasks. However, current works mainly rely on the random rewriting ability of LLMs, and the optimization process generally focus on specific influencing factors, which makes it easy to fall into local optimum. Besides, the performance of the optimized prompt is often unstable, which limits its transferability in different tasks. To address the above challenges, we propose $\textbf{DelvePO}$ ($\textbf{D}$irection-Guid$\textbf{e}$d Se$\textbf{l}$f-E$\textbf{v}$olving Framework for Fl$\textbf{e}$xible $\textbf{P}$rompt $\textbf{O}$ptimization), a task-agnostic framework to optimize prompts in self-evolve manner. In our framework, we decouple prompts into different components that can be used to explore the impact that different factors may have on various tasks. On this basis, we introduce working memory, through which LLMs can alleviate the deficiencies caused by their own uncertainties and further obtain key insights to guide the generation of new prompts. Extensive experiments conducted on different tasks covering various domains for both open- and closed-source LLMs, including DeepSeek-R1-Distill-Llama-8B, Qwen2.5-7B-Instruct and GPT-4o-mini. Experimental results show that DelvePO consistently outperforms previous SOTA methods under identical experimental settings, demonstrating its effectiveness and transferability across different tasks.[27] Text or Pixels? It Takes Half: On the Token Efficiency of Visual Text Inputs in Multimodal LLMs
Yanhong Li,Zixuan Lan,Jiawei Zhou
Main category: cs.CL
TL;DR: 本文提出将文本渲染为图像作为输入,以减少大语言模型的token使用量,同时保持任务性能,实验证明该方法在长文本处理中可大幅压缩输入且不损失效果。
Details
Motivation: 探索如何通过视觉化文本降低大语言模型处理长文本时的计算开销,特别是减少token消耗。 Method: 将长文本渲染为单张图像并输入到多模态大语言模型中,利用其视觉理解能力进行处理。 Result: 在RULER和CNN/DailyMail两个基准上实现了近一半的token节省,且未影响任务性能。 Conclusion: 文本作为图像输入是一种有效且实用的输入压缩方式,特别适用于长上下文场景下的解码器式大语言模型。 Abstract: Large language models (LLMs) and their multimodal variants can now process visual inputs, including images of text. This raises an intriguing question: can we compress textual inputs by feeding them as images to reduce token usage while preserving performance? In this paper, we show that visual text representations are a practical and surprisingly effective form of input compression for decoder LLMs. We exploit the idea of rendering long text inputs as a single image and provide it directly to the model. This leads to dramatically reduced number of decoder tokens required, offering a new form of input compression. Through experiments on two distinct benchmarks RULER (long-context retrieval) and CNN/DailyMail (document summarization) we demonstrate that this text-as-image method yields substantial token savings (often nearly half) without degrading task performance.[28] BrailleLLM: Braille Instruction Tuning with Large Language Models for Braille Domain Tasks
Tianyuan Huang,Zepeng Zhu,Hangdi Xing,Zirui Shao,Zhi Yu,Chaoxiong Yang,Jiaxian He,Xiaozhong Liu,Jiajun Bu
Main category: cs.CL
TL;DR: 本文提出了针对盲文信息处理的数据集和方法,通过构建英汉盲文混合数据集并引入基于语法树的数据增强与知识驱动的微调方法(BKFT),提升了盲文翻译与公式转换性能。
Details
Motivation: 盲文在视障人士教育和信息获取中至关重要,但面临数据稀缺和混合文本歧义等问题,需提升盲文信息处理能力。 Method: 构建了含数学公式的英汉盲文混合数据集(EBMD/CBMD),提出基于语法树的盲文数据增强方法,并设计盲文知识驱动微调(BKFT)结合指令调优实现统一的盲文翻译与公式转换。 Result: 实验表明,BKFT在盲文翻译任务中显著优于传统微调方法,所提方法在盲文翻译、公式转盲文和混合文本翻译上表现优异。 Conclusion: 提出的BKFT方法和开源数据集为低资源多语言盲文研究提供了有效基础和技术支持。 Abstract: Braille plays a vital role in education and information accessibility for visually impaired individuals. However, Braille information processing faces challenges such as data scarcity and ambiguities in mixed-text contexts. We construct English and Chinese Braille Mixed Datasets (EBMD/CBMD) with mathematical formulas to support diverse Braille domain research, and propose a syntax tree-based augmentation method tailored for Braille data. To address the underperformance of traditional fine-tuning methods in Braille-related tasks, we investigate Braille Knowledge-Based Fine-Tuning (BKFT), which reduces the learning difficulty of Braille contextual features. BrailleLLM employs BKFT via instruction tuning to achieve unified Braille translation, formula-to-Braille conversion, and mixed-text translation. Experiments demonstrate that BKFT achieves significant performance improvements over conventional fine-tuning in Braille translation scenarios. Our open-sourced datasets and methodologies establish a foundation for low-resource multilingual Braille research.[29] Food4All: A Multi-Agent Framework for Real-time Free Food Discovery with Integrated Nutritional Metadata
Zhengqing Yuan,Yiyang Li,Weixiang Sun,Zheyuan Zhang,Kaiwen Shi,Keerthiram Murugesan,Yanfang Ye
Main category: cs.CL
TL;DR: 本文提出了Food4All,首个面向实时、情境感知的免费食物获取的多智能体框架,通过整合异构数据、强化学习优化可达性与营养匹配,并引入在线反馈机制,提升对食物不安全人群的服务效率与公平性。
Details
Motivation: 当前食物援助系统存在信息碎片化、推荐不精准、忽视弱势群体实际约束等问题,导致食物不安全人群难以获取紧急资源。 Method: 提出Food4All,采用多智能体架构,结合官方数据库、社区平台和社交媒体的异构数据聚合,使用轻量级强化学习算法优化地理可达性和营养正确性,并通过在线反馈循环动态调整检索策略。 Result: 该框架实现了对食物资源的实时更新、营养标注和情境适配的引导,在语义分析与决策支持之间建立桥梁,显著提升资源匹配效率与用户体验。 Conclusion: Food4All为应对食物不安全问题提供了可扩展、公平且智能化的解决方案,是迈向智能公共卫生支持系统的重要一步。 Abstract: Food insecurity remains a persistent public health emergency in the United States, tightly interwoven with chronic disease, mental illness, and opioid misuse. Yet despite the existence of thousands of food banks and pantries, access remains fragmented: 1) current retrieval systems depend on static directories or generic search engines, which provide incomplete and geographically irrelevant results; 2) LLM-based chatbots offer only vague nutritional suggestions and fail to adapt to real-world constraints such as time, mobility, and transportation; and 3) existing food recommendation systems optimize for culinary diversity but overlook survival-critical needs of food-insecure populations, including immediate proximity, verified availability, and contextual barriers. These limitations risk leaving the most vulnerable individuals, those experiencing homelessness, addiction, or digital illiteracy, unable to access urgently needed resources. To address this, we introduce Food4All, the first multi-agent framework explicitly designed for real-time, context-aware free food retrieval. Food4All unifies three innovations: 1) heterogeneous data aggregation across official databases, community platforms, and social media to provide a continuously updated pool of food resources; 2) a lightweight reinforcement learning algorithm trained on curated cases to optimize for both geographic accessibility and nutritional correctness; and 3) an online feedback loop that dynamically adapts retrieval policies to evolving user needs. By bridging information acquisition, semantic analysis, and decision support, Food4All delivers nutritionally annotated and guidance at the point of need. This framework establishes an urgent step toward scalable, equitable, and intelligent systems that directly support populations facing food insecurity and its compounding health risks.[30] From Retrieval to Generation: Unifying External and Parametric Knowledge for Medical Question Answering
Lei Li,Xiao Zhou,Yingying Zhang,Xian Wu
Main category: cs.CL
TL;DR: 本文提出了一种名为MedRGAG的统一检索-生成增强框架,用于医疗问答任务,通过结合外部检索和模型内部生成的知识,提升答案的准确性和可靠性。
Details
Motivation: 现有的医疗问答方法在使用检索增强生成(RAG)时容易受到噪声或不完整检索的影响,而生成增强生成(GAG)则容易产生幻觉或错误信息,因此需要一种能够融合外部知识与参数化知识的方法来提高推理的可靠性。 Method: 提出MedRGAG框架,包含两个模块:知识引导的上下文补全(KGCC),利用检索结果指导生成补充缺失知识的背景文档;知识感知的文档选择(KADS),自适应地选择最优的检索与生成文档组合,以构建简洁且全面的证据用于答案生成。 Result: 在五个医疗问答基准上的实验表明,MedRGAG相比MedRAG提升了12.5%,相比MedGENIE提升了4.5%,验证了其在知识密集型推理中的有效性。 Conclusion: MedRGAG通过无缝整合外部检索与内部生成知识,有效缓解了传统方法中的噪声检索与生成幻觉问题,显著提升了医疗问答系统的性能与可靠性。 Abstract: Medical question answering (QA) requires extensive access to domain-specific knowledge. A promising direction is to enhance large language models (LLMs) with external knowledge retrieved from medical corpora or parametric knowledge stored in model parameters. Existing approaches typically fall into two categories: Retrieval-Augmented Generation (RAG), which grounds model reasoning on externally retrieved evidence, and Generation-Augmented Generation (GAG), which depends solely on the models internal knowledge to generate contextual documents. However, RAG often suffers from noisy or incomplete retrieval, while GAG is vulnerable to hallucinated or inaccurate information due to unconstrained generation. Both issues can mislead reasoning and undermine answer reliability. To address these challenges, we propose MedRGAG, a unified retrieval-generation augmented framework that seamlessly integrates external and parametric knowledge for medical QA. MedRGAG comprises two key modules: Knowledge-Guided Context Completion (KGCC), which directs the generator to produce background documents that complement the missing knowledge revealed by retrieval; and Knowledge-Aware Document Selection (KADS), which adaptively selects an optimal combination of retrieved and generated documents to form concise yet comprehensive evidence for answer generation. Extensive experiments on five medical QA benchmarks demonstrate that MedRGAG achieves a 12.5% improvement over MedRAG and a 4.5% gain over MedGENIE, highlighting the effectiveness of unifying retrieval and generation for knowledge-intensive reasoning. Our code and data are publicly available at https://anonymous.4open.science/r/MedRGAG[31] ECG-LLM -- training and evaluation of domain-specific large language models for electrocardiography
Lara Ahrens,Wilhelm Haverkamp,Nils Strodthoff
Main category: cs.CL
TL;DR: 本研究探讨了通过微调和检索增强生成(RAG)方法在心电图领域实现领域适配的开源大语言模型的表现,结果显示微调后的Llama 3.1 70B在多项指标上表现优异,但复杂查询中人类专家更偏好Claude 3.7和RAG方法,整体表明领域适配可实现与专有模型相媲美的性能,支持隐私保护型本地化临床解决方案的可行性。
Details
Motivation: 探索领域适配策略、评估方法以及开源大语言模型在医疗领域(特别是心电图)相对于通用模型的性能表现,以推动隐私保护的本地化部署应用。 Method: 基于领域特定文献对开源大语言模型进行微调,并构建多层次评估框架,比较微调模型、检索增强生成(RAG)和通用模型Claude Sonnet 3.7在多项任务中的表现。 Result: 微调后的Llama 3.1 70B在多项选择题和自动文本指标上表现最佳,在LLM-as-a-judge评估中排名第二;但在复杂查询的人类专家评估中,Claude 3.7和RAG更受青睐;微调模型普遍优于其基础模型。 Conclusion: 尽管不同评估方法间存在性能差异,但通过微调和RAG进行领域适配可实现与专有模型相当的性能,验证了隐私保护型本地化临床语言模型的可行性。 Abstract: Domain-adapted open-weight large language models (LLMs) offer promising healthcare applications, from queryable knowledge bases to multimodal assistants, with the crucial advantage of local deployment for privacy preservation. However, optimal adaptation strategies, evaluation methodologies, and performance relative to general-purpose LLMs remain poorly characterized. We investigated these questions in electrocardiography, an important area of cardiovascular medicine, by finetuning open-weight models on domain-specific literature and implementing a multi-layered evaluation framework comparing finetuned models, retrieval-augmented generation (RAG), and Claude Sonnet 3.7 as a representative general-purpose model. Finetuned Llama 3.1 70B achieved superior performance on multiple-choice evaluations and automatic text metrics, ranking second to Claude 3.7 in LLM-as-a-judge assessments. Human expert evaluation favored Claude 3.7 and RAG approaches for complex queries. Finetuned models significantly outperformed their base counterparts across nearly all evaluation modes. Our findings reveal substantial performance heterogeneity across evaluation methodologies, underscoring assessment complexity. Nevertheless, domain-specific adaptation through finetuning and RAG achieves competitive performance with proprietary models, supporting the viability of privacy-preserving, locally deployable clinical solutions.[32] Combining Distantly Supervised Models with In Context Learning for Monolingual and Cross-Lingual Relation Extraction
Vipul Rathore,Malik Hammad Faisal,Parag Singla,Mausam
Main category: cs.CL
TL;DR: 本文提出了HYDRE,一种结合任务特定模型与大语言模型上下文学习的混合远监督关系抽取框架,在英文及多种低资源印度语言上显著优于现有方法。
Details
Motivation: 现有远监督关系抽取模型依赖任务特定训练,且在噪声标注下难以正确学习语义,尤其是在低资源语言中表现受限,因此需要一种能有效利用大语言模型进行上下文学习的新方法。 Method: 提出HYDRE框架:首先使用预训练DSRE模型为测试句子选出top-k候选关系,再通过动态示例检索策略从训练数据中提取可靠的句子级示例,并将其用于大语言模型的提示中以输出最终关系;并扩展至跨语言场景。 Result: 在英语和四种低资源印度语言(Oriya、Santali、Manipuri、Tulu)上的实验表明,HYDRE相比现有最先进模型最高提升20个F1点,平均提升17个F1点,消融实验验证了其有效性。 Conclusion: HYDRE有效结合了传统DSRE模型与大语言模型的上下文学习能力,显著提升了在噪声环境下的关系抽取性能,尤其在低资源语言中表现出强大潜力。 Abstract: Distantly Supervised Relation Extraction (DSRE) remains a long-standing challenge in NLP, where models must learn from noisy bag-level annotations while making sentence-level predictions. While existing state-of-the-art (SoTA) DSRE models rely on task-specific training, their integration with in-context learning (ICL) using large language models (LLMs) remains underexplored. A key challenge is that the LLM may not learn relation semantics correctly, due to noisy annotation. In response, we propose HYDRE -- HYbrid Distantly Supervised Relation Extraction framework. It first uses a trained DSRE model to identify the top-k candidate relations for a given test sentence, then uses a novel dynamic exemplar retrieval strategy that extracts reliable, sentence-level exemplars from training data, which are then provided in LLM prompt for outputting the final relation(s). We further extend HYDRE to cross-lingual settings for RE in low-resource languages. Using available English DSRE training data, we evaluate all methods on English as well as a newly curated benchmark covering four diverse low-resource Indic languages -- Oriya, Santali, Manipuri, and Tulu. HYDRE achieves up to 20 F1 point gains in English and, on average, 17 F1 points on Indic languages over prior SoTA DSRE models. Detailed ablations exhibit HYDRE's efficacy compared to other prompting strategies.[33] KrishokBondhu: A Retrieval-Augmented Voice-Based Agricultural Advisory Call Center for Bengali Farmers
Mohd Ruhul Ameen,Akif Islam,Farjana Aktar,M. Saifuzzaman Rafat
Main category: cs.CL
TL;DR: KrishokBondhu是一个基于检索增强生成(RAG)框架的语音农业咨询平台,专为孟加拉语农民设计,通过电话提供实时、上下文感知的农业指导,在响应质量上显著优于现有基准。
Details
Motivation: 解决孟加拉国农民难以及时获取专家级农业指导的问题,特别是那些缺乏数字技能或互联网接入的偏远地区农民。 Method: 构建一个集成了呼叫中心的语音平台,整合权威农业文献,利用OCR和文档解析技术进行数字化,并通过向量数据库实现语义检索;用户通过电话以孟加拉语提问,系统经语音识别、RAG检索、Gemma大模型生成回答后,再通过语音合成反馈。 Result: 在试点评估中,72.7%的农业查询获得了高质量回应;相比KisanQRS基准,综合评分从3.13提升至4.53(提升44.7%),在上下文丰富度和完整性方面分别提升367%和100.4%,相关性和技术准确性保持相当。 Conclusion: KrishokBondhu证明了结合语音交互、呼叫中心可访问性和RAG技术,能够有效向偏远农民提供高质量农业咨询服务,为建立全AI驱动的农业咨询生态系统提供了可行路径。 Abstract: In Bangladesh, many farmers continue to face challenges in accessing timely, expert-level agricultural guidance. This paper presents KrishokBondhu, a voice-enabled, call-centre-integrated advisory platform built on a Retrieval-Augmented Generation (RAG) framework, designed specifically for Bengali-speaking farmers. The system aggregates authoritative agricultural handbooks, extension manuals, and NGO publications; applies Optical Character Recognition (OCR) and document-parsing pipelines to digitize and structure the content; and indexes this corpus in a vector database for efficient semantic retrieval. Through a simple phone-based interface, farmers can call the system to receive real-time, context-aware advice: speech-to-text converts the Bengali query, the RAG module retrieves relevant content, a large language model (Gemma 3-4B) generates a context-grounded response, and text-to-speech delivers the answer in natural spoken Bengali. In a pilot evaluation, KrishokBondhu produced high-quality responses for 72.7% of diverse agricultural queries covering crop management, disease control, and cultivation practices. Compared to the KisanQRS benchmark, the system achieved a composite score of 4.53 (vs. 3.13) on a 5-point scale, a 44.7% improvement, with especially large gains in contextual richness (+367%) and completeness (+100.4%), while maintaining comparable relevance and technical specificity. Semantic similarity analysis further revealed a strong correlation between retrieved context and answer quality, emphasizing the importance of grounding generative responses in curated documentation. KrishokBondhu demonstrates the feasibility of integrating call-centre accessibility, multilingual voice interaction, and modern RAG techniques to deliver expert-level agricultural guidance to remote Bangladeshi farmers, paving the way toward a fully AI-driven agricultural advisory ecosystem.[34] KoSimpleQA: A Korean Factuality Benchmark with an Analysis of Reasoning LLMs
Donghyeon Ko,Yeguk Jin,Kyubyung Chae,Byungwook Lee,Chansong Jo,Sookyo In,Jaehong Lee,Taesup Kim,Donghyun Kwak
Main category: cs.CL
TL;DR: 本文提出了KoSimpleQA,一个专注于韩国文化知识的韩语事实性评测基准,包含1000个简短且答案明确的事实性问题。实验表明现有支持韩语的大语言模型在该基准上的表现较差,最强模型仅达到33.7%的准确率,且其排名与英文SimpleQA结果有显著差异,显示出该数据集的独特价值。此外,研究发现启用推理能力有助于模型更好地激发潜在知识并提高在不确定时的拒绝回答能力。
Details
Motivation: 现有的大语言模型评测基准多以英语为主,缺乏对韩语特别是韩国文化知识覆盖的高质量、易于评估的评测集,难以准确反映模型在本地化事实性任务中的真实表现。因此,需要构建一个专门针对韩语文化背景的事实性问答基准来填补这一空白。 Method: 构建了一个名为KoSimpleQA的韩语事实性问答基准,包含1000个简短、答案明确的事实性问题,涵盖韩国文化相关知识。对多种支持韩语的开源大语言模型进行了系统评估,并分析了启用推理模式对模型事实提取和拒答能力的影响。 Result: 实验结果显示,即使是最强的开源大语言模型在KoSimpleQA上的准确率也仅为33.7%,表明该基准具有较高难度;模型在KoSimpleQA上的性能排名与在英文SimpleQA上存在显著差异,验证了其独特性;启用推理能力有助于提升模型的知识提取效果和不确定性下的拒答能力。 Conclusion: KoSimpleQA是一个具有挑战性的韩语事实性评测基准,能够有效评估大语言模型在韩国文化知识上的表现,弥补了现有英文主导评测体系的不足,为多语言和本地化事实性评估提供了重要工具。 Abstract: We present $\textbf{Korean SimpleQA (KoSimpleQA)}$, a benchmark for evaluating factuality in large language models (LLMs) with a focus on Korean cultural knowledge. KoSimpleQA is designed to be challenging yet easy to grade, consisting of 1,000 short, fact-seeking questions with unambiguous answers. We conduct a comprehensive evaluation across a diverse set of open-source LLMs of varying sizes that support Korean, and find that even the strongest model generates correct answer only 33.7% of the time, underscoring the challenging nature of KoSimpleQA. Notably, performance rankings on KoSimpleQA differ substantially from those on the English SimpleQA, highlighting the unique value of our dataset. Furthermore, our analysis of reasoning LLMs shows that engaging reasoning capabilities in the factual QA task can both help models better elicit their latent knowledge and improve their ability to abstain when uncertain. KoSimpleQA can be found at https://anonymous.4open.science/r/KoSimpleQA-62EB.[35] Towards Fair ASR For Second Language Speakers Using Fairness Prompted Finetuning
Monorama Swain,Bubai Maji,Jagabandhu Mishra,Markus Schedl,Anders Søgaard,Jesper Rindom Jensen
Main category: cs.CL
TL;DR: 提出了一种结合公平性驱动目标的微调方法,以减少英语ASR系统在不同口音群体间的识别误差,显著提升了跨口音的公平性。
Details
Motivation: 现有ASR模型在不同第二语言口音上的表现差异大,存在明显的公平性问题。 Method: 采用轻量级适配器进行公平性提示微调,融合谱解耦(SD)、组分布鲁棒优化(Group-DRO)和不变风险最小化(IRM)与传统的经验风险最小化(ERM)。 Result: 在宏平均词错误率上,相比Whisper和Seamless-M4T预训练模型分别实现了58.7%和58.5%的相对改进,优于标准ERM微调方法。 Conclusion: 所提方法有效缩小了ASR系统在不同口音群体间的性能差距,提升了公平性,同时保持了整体识别准确率。 Abstract: In this work, we address the challenge of building fair English ASR systems for second-language speakers. Our analysis of widely used ASR models, Whisper and Seamless-M4T, reveals large fluctuations in word error rate (WER) across 26 accent groups, indicating significant fairness gaps. To mitigate this, we propose fairness-prompted finetuning with lightweight adapters, incorporating Spectral Decoupling (SD), Group Distributionally Robust Optimization (Group-DRO), and Invariant Risk Minimization (IRM). Our proposed fusion of traditional empirical risk minimization (ERM) with cross-entropy and fairness-driven objectives (SD, Group DRO, and IRM) enhances fairness across accent groups while maintaining overall recognition accuracy. In terms of macro-averaged word error rate, our approach achieves a relative improvement of 58.7% and 58.5% over the large pretrained Whisper and SeamlessM4T, and 9.7% and 7.8% over them, finetuning with standard empirical risk minimization with cross-entropy loss.[36] MENTOR: A Reinforcement Learning Framework for Model Enhancement via Teacher-Optimized Rewards in Small Models
ChangSu Choi,Hoyun Song,Dongyeon Kim,WooHyeon Jung,Minkyung Cho,Sunjin Park,NohHyeob Bae,Seona Yu,KyungTae Lim
Main category: cs.CL
TL;DR: 提出MENTOR框架,结合强化学习与教师引导的蒸馏方法,提升小语言模型在跨领域泛化和策略能力方面的表现。
Details
Motivation: 现有的监督微调方法泛化能力差,而标准的稀疏奖励强化学习难以有效指导小语言模型,导致探索效率低和策略次优。 Method: 提出MENTOR框架,通过强化学习进行探索以学习更通用的策略,并利用教师参考轨迹构建密集的复合奖励函数,提供细粒度指导。 Result: 实验表明,MENTOR在跨领域泛化能力和策略水平上显著优于监督微调和标准稀疏奖励强化学习基线方法。 Conclusion: MENTOR通过结合强化学习与教师引导的奖励机制,有效提升了小语言模型在工具使用上的蒸馏效果和泛化能力。 Abstract: Distilling the tool-using capabilities of large language models (LLMs) into smaller, more efficient small language models (SLMs) is a key challenge for their practical application. The predominant approach, supervised fine-tuning (SFT), suffers from poor generalization as it trains models to imitate a static set of teacher trajectories rather than learn a robust methodology. While reinforcement learning (RL) offers an alternative, the standard RL using sparse rewards fails to effectively guide SLMs, causing them to struggle with inefficient exploration and adopt suboptimal strategies. To address these distinct challenges, we propose MENTOR, a framework that synergistically combines RL with teacher-guided distillation. Instead of simple imitation, MENTOR employs an RL-based process to learn a more generalizable policy through exploration. In addition, to solve the problem of reward sparsity, it uses a teacher's reference trajectory to construct a dense, composite teacher-guided reward that provides fine-grained guidance. Extensive experiments demonstrate that MENTOR significantly improves the cross-domain generalization and strategic competence of SLMs compared to both SFT and standard sparse-reward RL baselines.[37] Adamas: Hadamard Sparse Attention for Efficient Long-Context Inference
Siyuan Yan,Guo-Qing Jiang,Yuchen Zhang,Xiaoxing Ma,Ran Zhu,Chun Cao,Jingwei Xu
Main category: cs.CL
TL;DR: Adamas是一种轻量级且高精度的稀疏注意力机制,通过Hadamard变换、分桶和2比特压缩生成紧凑表示,并利用曼哈顿距离估计实现高效的top-k选择,在长上下文推理中实现了比现有最先进方法高达8倍的稀疏性,同时保持与全注意力相当甚至更低的困惑度。
Details
Motivation: 现有的稀疏注意力方法依赖启发式模式,难以有效召回每个查询的关键键值对,导致在长上下文推断中准确率下降。 Method: Adamas采用Hadamard变换、分桶和2比特压缩来生成紧凑表示,并使用曼哈顿距离估计进行高效的top-k选择,以提升稀疏注意力的效率和准确性。 Result: 实验表明,Adamas在仅64个token预算下即可匹配全注意力的准确率,在128个token时接近无损性能,支持比现有SOTA方法高8倍的稀疏性,并在32K长度序列上实现最高4.4倍的自注意力加速和1.5倍的端到端加速。 Conclusion: Adamas在显著降低计算成本的同时,能够维持甚至超越全注意力机制的准确性,是高效长上下文推理的有效解决方案。 Abstract: Large language models (LLMs) now support context windows of hundreds of thousands to millions of tokens, enabling applications such as long-document summarization, large-scale code synthesis, multi-document question answering and persistent multi-turn dialogue. However, such extended contexts exacerbate the quadratic cost of self-attention, leading to severe latency in autoregressive decoding. Existing sparse attention methods alleviate these costs but rely on heuristic patterns that struggle to recall critical key-value (KV) pairs for each query, resulting in accuracy degradation. We introduce Adamas, a lightweight yet highly accurate sparse attention mechanism designed for long-context inference. Adamas applies the Hadamard transform, bucketization and 2-bit compression to produce compact representations, and leverages Manhattan-distance estimation for efficient top-k selections. Experiments show that Adamas matches the accuracy of full attention with only a 64-token budget, achieves near-lossless performance at 128, and supports up to 8x higher sparsity than prior state-of-the-art (SOTA) methods while delivering up to 4.4x self-attention and 1.5x end-to-end speedups on 32K-length sequences. Remarkably, Adamas attains comparable or even lower perplexity than full attention, underscoring its effectiveness in maintaining accuracy under aggressive sparsity.[38] Chain-of-Conceptual-Thought: Eliciting the Agent to Deeply Think within the Response
Qingqing Gu,Dan Wang,Yue Zhao,Xiaoyu Wang,Zhonglin Jiang,Yong Chen,Hongyan Li,Luo Ji
Main category: cs.CL
TL;DR: 提出了一种新的基于提示的范式——概念思维链(CoCT),通过先标记概念再生成内容,提升了大语言模型在开放域任务中的表现。
Details
Motivation: 现有的思维链(CoT)在开放域任务中效果有限,因为缺乏明确的推理步骤和逻辑转换,因此需要一种更适用于此类任务的提示范式。 Method: 提出概念思维链(CoCT),让大语言模型先识别情绪、策略和话题等概念,再基于这些概念生成内容,并允许在对话中形成概念链以促进深层和战略性思考。 Result: 在日常对话和情感支持场景中进行实验,自动评估、人工评估和模型评估均显示CoCT优于Self-Refine、ECoT、ToT、SoT和RAG等基线方法。 Conclusion: CoCT是一种有效的新型提示范式,能够扩展大语言模型在更广泛任务中的应用能力。 Abstract: Chain-of-Thought (CoT) is widely applied to improve the LLM capability in math, coding and reasoning tasks. However, its performance is limited for open-domain tasks since there are no clearly defined reasoning steps or logical transitions. To mitigate such challenges, we propose another prompt-based paradigm called Chain of Conceptual Thought (CoCT), where the LLM first tags a concept, then generates the detailed content. The chain of concepts is allowed within the utterance, encouraging the LLM's deep and strategic thinking. We experiment with this paradigm in daily and emotional support conversations where the concept is comprised of emotions, strategies and topics. Automatic, human and model evaluations suggest that CoCT surpasses baselines such as Self-Refine, ECoT, ToT, SoT and RAG, suggesting a potential effective prompt-based paradigm of LLM for a wider scope of tasks.[39] Grounding or Guessing? Visual Signals for Detecting Hallucinations in Sign Language Translation
Yasser Hamidullah,Koel Dutta Chowdury,Yusser Al-Ghussin,Shakib Yazdani,Cennet Oguz,Josef van Genabith,Cristina España-Bonet
Main category: cs.CL
TL;DR: 本文提出一种基于视觉信息使用的token级可靠性度量方法,用于检测手语翻译中的幻觉问题,该方法结合特征敏感性和反事实信号,在多个基准上验证了其有效性和泛化能力。
Details
Motivation: 由于手语翻译依赖于视频的精确对齐,而无gloss监督的模型容易因依赖语言先验而产生幻觉,因此需要量化模型对视觉输入的依赖程度以检测幻觉。 Method: 提出一种结合特征敏感性(遮蔽视频时内部变化)和反事实信号(干净与篡改视频输入的概率差异)的token级可靠性度量,并聚合为句子级得分。 Result: 在PHOENIX-2014T和CSL-Daily两个数据集上验证,可靠性得分能预测幻觉率,跨数据集和架构表现稳定,且在视觉退化下下降;结合文本信号可进一步提升风险估计效果。 Conclusion: 所提出的可靠性度量是一种实用且可复用的工具,可用于诊断手语翻译中的幻觉问题,为多模态生成中的幻觉检测提供了基础。 Abstract: Hallucination, where models generate fluent text unsupported by visual evidence, remains a major flaw in vision-language models and is particularly critical in sign language translation (SLT). In SLT, meaning depends on precise grounding in video, and gloss-free models are especially vulnerable because they map continuous signer movements directly into natural language without intermediate gloss supervision that serves as alignment. We argue that hallucinations arise when models rely on language priors rather than visual input. To capture this, we propose a token-level reliability measure that quantifies how much the decoder uses visual information. Our method combines feature-based sensitivity, which measures internal changes when video is masked, with counterfactual signals, which capture probability differences between clean and altered video inputs. These signals are aggregated into a sentence-level reliability score, providing a compact and interpretable measure of visual grounding. We evaluate the proposed measure on two SLT benchmarks (PHOENIX-2014T and CSL-Daily) with both gloss-based and gloss-free models. Our results show that reliability predicts hallucination rates, generalizes across datasets and architectures, and decreases under visual degradations. Beyond these quantitative trends, we also find that reliability distinguishes grounded tokens from guessed ones, allowing risk estimation without references; when combined with text-based signals (confidence, perplexity, or entropy), it further improves hallucination risk estimation. Qualitative analysis highlights why gloss-free models are more susceptible to hallucinations. Taken together, our findings establish reliability as a practical and reusable tool for diagnosing hallucinations in SLT, and lay the groundwork for more robust hallucination detection in multimodal generation.[40] Engagement Undermines Safety: How Stereotypes and Toxicity Shape Humor in Language Models
Atharvan Dogra,Soumya Suvra Ghosal,Ameet Deshpande,Ashwin Kalyan,Dinesh Manocha
Main category: cs.CL
TL;DR: 该研究以幽默生成为测试平台,评估了现代大语言模型在优化趣味性时与有害内容的关联,发现有害输出往往获得更高的幽默评分,且角色提示会加剧这种偏差,揭示了生成器与评估器之间的偏见放大循环。
Details
Motivation: 随着大语言模型在创意写作和内容生成中的广泛应用,其输出的安全性引发关注,尤其是幽默等主观性强的任务中可能隐含的刻板印象和毒性内容。 Method: 通过联合测量幽默度、刻板性和毒性,并结合基于信息论的不协调性信号分析,对六个模型进行评估,并在讽刺生成任务中进行人类感知验证。 Result: 有害内容常获得更高幽默评分,角色提示加剧此现象;信息论分析显示有害线索扩大预测不确定性,某些模型反而更预期有害笑点;人类评估表明LLM生成的讽刺内容增加刻板性和毒性;量化结果显示刻板/有毒笑话幽默评分提高10-21%,在被判定为‘有趣’的笑话中出现频率显著上升。 Conclusion: 当前大语言模型在追求趣味性时存在系统性偏向生成有害内容的风险,反映出幽默分布学习中的结构性嵌入问题,需在模型设计和评估中引入更强的安全约束。 Abstract: Large language models are increasingly used for creative writing and engagement content, raising safety concerns about the outputs. Therefore, casting humor generation as a testbed, this work evaluates how funniness optimization in modern LLM pipelines couples with harmful content by jointly measuring humor, stereotypicality, and toxicity. This is further supplemented by analyzing incongruity signals through information-theoretic metrics. Across six models, we observe that harmful outputs receive higher humor scores which further increase under role-based prompting, indicating a bias amplification loop between generators and evaluators. Information-theoretic analyses show harmful cues widen predictive uncertainty and surprisingly, can even make harmful punchlines more expected for some models, suggesting structural embedding in learned humor distributions. External validation on an additional satire-generation task with human perceived funniness judgments shows that LLM satire increases stereotypicality and typically toxicity, including for closed models. Quantitatively, stereotypical/toxic jokes gain $10-21\%$ in mean humor score, stereotypical jokes appear $11\%$ to $28\%$ more often among the jokes marked funny by LLM-based metric and up to $10\%$ more often in generations perceived as funny by humans.[41] ChronoPlay: A Framework for Modeling Dual Dynamics and Authenticity in Game RAG Benchmarks
Liyang He,Yuren Zhang,Ziwei Zhu,Zhenghui Li,Shiwei Tong
Main category: cs.CL
TL;DR: 提出ChronoPlay框架,用于自动化生成动态游戏领域的检索增强生成(RAG)基准,解决游戏内容与玩家关注双重动态变化的挑战。
Details
Motivation: 现有RAG系统缺乏针对动态领域(如在线游戏)的专用基准,难以标准化评估,主要难点在于游戏内容更新和玩家社区关注的双重动态性,以及生成真实玩家问题的自动化需求。 Method: 设计ChronoPlay框架,采用双动态更新机制追踪游戏内容和玩家关注的变化,并通过双源合成引擎结合官方来源和玩家社区数据,确保问题的事实准确性和查询真实性。 Result: 在三个不同游戏中实例化该框架,构建了首个面向游戏领域的动态RAG基准,揭示了模型在复杂真实场景下的表现。 Conclusion: ChronoPlay实现了自动化、持续生成具有玩家中心真实性的动态RAG基准,为评估动态环境中的RAG系统提供了新标准。 Abstract: Retrieval Augmented Generation (RAG) systems are increasingly vital in dynamic domains like online gaming, yet the lack of a dedicated benchmark has impeded standardized evaluation in this area. The core difficulty lies in Dual Dynamics: the constant interplay between game content updates and the shifting focus of the player community. Furthermore, the necessity of automating such a benchmark introduces a critical requirement for player-centric authenticity to ensure generated questions are realistic. To address this integrated challenge, we introduce ChronoPlay, a novel framework for the automated and continuous generation of game RAG benchmarks. ChronoPlay utilizes a dual-dynamic update mechanism to track both forms of change, and a dual-source synthesis engine that draws from official sources and player community to ensure both factual correctness and authentic query patterns. We instantiate our framework on three distinct games to create the first dynamic RAG benchmark for the gaming domain, offering new insights into model performance under these complex and realistic conditions. Code is avaliable at: https://github.com/hly1998/ChronoPlay.[42] DePass: Unified Feature Attributing by Simple Decomposed Forward Pass
Xiangyu Hong,Che Jiang,Kai Tian,Biqing Qi,Youbang Sun,Ning Ding,Bowen Zhou
Main category: cs.CL
TL;DR: DePass是一种基于单次分解前向传播的统一特征归因框架,用于Transformer模型的机制可解释性分析。
Details
Motivation: 为了理解Transformer模型内部计算对模型行为的影响,需要一种能够精确归因特征贡献的方法。 Method: DePass将隐藏状态分解为定制的加性成分,并在注意力分数和MLP激活固定的情况下进行传播,从而实现无需额外训练的忠实、细粒度归因。 Result: 在词元级、模型组件级和子空间级归因任务中验证了DePass的有效性和保真度,展示了其在Transformer不同组件间信息流归因的潜力。 Conclusion: DePass有望成为可解释性研究中的基础工具,适用于更广泛的应用场景。 Abstract: Attributing the behavior of Transformer models to internal computations is a central challenge in mechanistic interpretability. We introduce DePass, a unified framework for feature attribution based on a single decomposed forward pass. DePass decomposes hidden states into customized additive components, then propagates them with attention scores and MLP's activations fixed. It achieves faithful, fine-grained attribution without requiring auxiliary training. We validate DePass across token-level, model component-level, and subspace-level attribution tasks, demonstrating its effectiveness and fidelity. Our experiments highlight its potential to attribute information flow between arbitrary components of a Transformer model. We hope DePass serves as a foundational tool for broader applications in interpretability.[43] CEFR-Annotated WordNet: LLM-Based Proficiency-Guided Semantic Database for Language Learning
Masato Kikuchi,Masatsugu Ono,Toshioki Soga,Tetsu Tanabe,Tadachika Ozono
Main category: cs.CL
TL;DR: 本文提出了一种将WordNet与欧洲共同语言参考框架(CEFR)相结合的方法,利用大语言模型自动标注词汇难度等级,并构建了大规模语料库和上下文词汇分类器,实验证明其标注质量接近人工标准,且公开资源有助于语言教育与自然语言处理的结合。
Details
Motivation: WordNet的细粒度词义区分对二语学习者具有挑战性,需要结合语言水平分级(如CEFR)以提升其在语言教学中的实用性。 Method: 利用大语言模型计算WordNet中词义定义与英语词汇档案(EVP)条目之间的语义相似性,自动为WordNet标注CEFR等级,并构建包含词义和CEFR级别的大规模语料库,用于训练上下文词汇分类器。 Result: 基于该语料库微调的模型表现接近使用金标准标注的模型;结合金标准数据后,分类器达到0.81的Macro-F1分数,验证了标注的高准确性。 Conclusion: 所提出的自动标注方法有效,生成的标注WordNet、语料库和分类器已公开,有助于弥合自然语言处理与语言教育之间的鸿沟,促进更高效的语言学习。 Abstract: Although WordNet is a valuable resource owing to its structured semantic networks and extensive vocabulary, its fine-grained sense distinctions can be challenging for second-language learners. To address this, we developed a WordNet annotated with the Common European Framework of Reference for Languages (CEFR), integrating its semantic networks with language-proficiency levels. We automated this process using a large language model to measure the semantic similarity between sense definitions in WordNet and entries in the English Vocabulary Profile Online. To validate our method, we constructed a large-scale corpus containing both sense and CEFR-level information from our annotated WordNet and used it to develop contextual lexical classifiers. Our experiments demonstrate that models fine-tuned on our corpus perform comparably to those trained on gold-standard annotations. Furthermore, by combining our corpus with the gold-standard data, we developed a practical classifier that achieves a Macro-F1 score of 0.81, indicating the high accuracy of our annotations. Our annotated WordNet, corpus, and classifiers are publicly available to help bridge the gap between natural language processing and language education, thereby facilitating more effective and efficient language learning.[44] IMB: An Italian Medical Benchmark for Question Answering
Antonio Romano,Giuseppe Riccio,Mariano Barone,Marco Postiglione,Vincenzo Moscato
Main category: cs.CL
TL;DR: 本文提出了两个意大利语医学基准数据集IMB-QA和IMB-MCQA,用于推动多语言医学问答研究,发现领域适应和检索增强生成比单纯扩大模型规模更有效。
Details
Motivation: 在线医疗论坛积累了大量有价值的医患对话数据,但其非正式性和语言复杂性为自动化问答系统(尤其是非英语语言)带来挑战,亟需高质量的本地化基准数据集和有效的处理方法。 Method: 构建了两个大规模意大利语医学数据集IMB-QA(78万条医患对话)和IMB-MCQA(2.5万道选择题),利用大语言模型提升数据清晰性和一致性,并评估多种LLM架构在开放和选择题问答任务中的表现,结合检索增强生成(RAG)和领域微调策略进行优化。 Result: 实验表明,经过领域特定适配(如RAG和微调)的模型在医学问答任务中优于更大规模的通用模型,说明领域专业知识和高效信息检索对性能提升至关重要。 Conclusion: 在医学问答系统中,相较于盲目扩大模型规模,融入领域知识和检索增强机制是更有效的优化路径;作者已公开数据集和评估框架以促进多语言医学AI研究。 Abstract: Online medical forums have long served as vital platforms where patients seek professional healthcare advice, generating vast amounts of valuable knowledge. However, the informal nature and linguistic complexity of forum interactions pose significant challenges for automated question answering systems, especially when dealing with non-English languages. We present two comprehensive Italian medical benchmarks: \textbf{IMB-QA}, containing 782,644 patient-doctor conversations from 77 medical categories, and \textbf{IMB-MCQA}, comprising 25,862 multiple-choice questions from medical specialty examinations. We demonstrate how Large Language Models (LLMs) can be leveraged to improve the clarity and consistency of medical forum data while retaining their original meaning and conversational style, and compare a variety of LLM architectures on both open and multiple-choice question answering tasks. Our experiments with Retrieval Augmented Generation (RAG) and domain-specific fine-tuning reveal that specialized adaptation strategies can outperform larger, general-purpose models in medical question answering tasks. These findings suggest that effective medical AI systems may benefit more from domain expertise and efficient information retrieval than from increased model scale. We release both datasets and evaluation frameworks in our GitHub repository to support further research on multilingual medical question answering: https://github.com/PRAISELab-PicusLab/IMB.[45] DART: A Structured Dataset of Regulatory Drug Documents in Italian for Clinical NLP
Mariano Barone,Antonio Laudante,Giuseppe Riccio,Antonio Romano,Marco Postiglione,Vincenzo Moscato
Main category: cs.CL
TL;DR: 本文介绍了DART,首个从意大利药品管理局官方文件中提取的意大利语药品特性摘要结构化语料库,旨在填补非英语药理学文本资源的空白。
Details
Motivation: 现有药理学知识提取研究主要依赖英文语料库(如DrugBank),缺乏针对其他语言医疗系统的资源,限制了多语言环境下AI在临床决策支持中的应用。 Method: 通过可复现的流程构建DART:大规模网页文档抓取、监管文本的语义分割,以及使用低温度解码的少样本微调大语言模型进行临床摘要生成。 Result: DART提供了关于适应症、药物不良反应和药物相互作用等关键药理信息的结构化数据,并基于该数据集开发了一个基于大语言模型的药物相互作用检查工具,实验表明该模型能准确推断潜在的临床相关相互作用。 Conclusion: DART为意大利语药理文本分析提供了重要资源,验证了大语言模型在结构化监管文本基础上进行临床推理的有效性,且项目代码已公开发布。 Abstract: The extraction of pharmacological knowledge from regulatory documents has become a key focus in biomedical natural language processing, with applications ranging from adverse event monitoring to AI-assisted clinical decision support. However, research in this field has predominantly relied on English-language corpora such as DrugBank, leaving a significant gap in resources tailored to other healthcare systems. To address this limitation, we introduce DART (Drug Annotation from Regulatory Texts), the first structured corpus of Italian Summaries of Product Characteristics derived from the official repository of the Italian Medicines Agency (AIFA). The dataset was built through a reproducible pipeline encompassing web-scale document retrieval, semantic segmentation of regulatory sections, and clinical summarization using a few-shot-tuned large language model with low-temperature decoding. DART provides structured information on key pharmacological domains such as indications, adverse drug reactions, and drug-drug interactions. To validate its utility, we implemented an LLM-based drug interaction checker that leverages the dataset to infer clinically meaningful interactions. Experimental results show that instruction-tuned LLMs can accurately infer potential interactions and their clinical implications when grounded in the structured textual fields of DART. We publicly release our code on GitHub: https://github.com/PRAISELab-PicusLab/DART.[46] How Efficient Are Diffusion Language Models? A Critical Examination of Efficiency Evaluation Practices
Han Peng,Peiyu Liu,Zican Dong,Daixuan Cheng,Junyi Li,Yiru Tang,Shuo Wang,Wayne Xin Zhao
Main category: cs.CL
TL;DR: 本文系统研究了扩散语言模型(DLMs)的效率问题,发现当前开源DLMs在实际速度上普遍落后于自回归(AR)模型,并指出现有评估方法的不足。通过实证基准测试和基于屋顶线模型的理论分析,表明AR模型通常具有更高吞吐量,而DLMs则表现较差;加速技术如双缓存和平行解码仅在小批量时有效,扩展后效果减弱。
Details
Motivation: 尽管DLMs因其可并行解码潜力被视为AR模型的有前景替代方案,但实际应用中其速度表现不佳,限制了实用性,因此需要系统性研究其效率瓶颈。 Method: 采用实证基准测试与基于屋顶线模型(roofline)的理论分析相结合的方法,系统评估DLMs与AR模型在不同条件下的推理效率,并分析多种加速策略的有效性。 Result: 实验表明,当前DLMs在吞吐量上普遍低于AR模型;加速技术如双缓存和平行解码仅在小批量设置下有效,随着批量增大收益显著下降。 Conclusion: 现有的DLMs在实际效率方面仍落后于AR模型,需改进评估方法并发展更有效的加速策略以推动该领域发展。 Abstract: Diffusion language models (DLMs) have emerged as a promising alternative to the long-dominant autoregressive (AR) paradigm, offering a parallelable decoding process that could yield greater efficiency. Yet, in practice, current open-source DLMs often underperform their AR counterparts in speed, limiting their real-world utility. This work presents a systematic study of DLM efficiency, identifying key issues in prior evaluation methods. Through empirical benchmarking and a roofline-based theoretical analysis, we demonstrate that AR models generally achieve higher throughput, while DLMs consistently lag. We also investigate acceleration strategies, finding that techniques like dual cache and parallel decoding mainly offer gains at small batch sizes, with their benefits diminishing upon scaling. Our findings underscore the necessity of robust evaluation methods and improved acceleration strategies to advance research on DLMs.[47] Identity-Aware Large Language Models require Cultural Reasoning
Alistair Plum,Anne-Marie Lutgen,Christoph Purschke,Achim Rettinger
Main category: cs.CL
TL;DR: 本文提出了大语言模型在跨文化推理能力上的不足,强调模型往往默认采用西方价值观,缺乏对多元文化的适应性。作者提出应将文化推理视为与事实准确性和语言连贯性同等重要的基础能力,并呼吁建立新的评估方法以提升AI的文化敏感性。
Details
Motivation: 当前大语言模型在回应中常体现单一文化视角,忽视全球用户的多样性,可能导致刻板印象、信任流失和少数群体边缘化。因此需要定义并发展文化推理能力,使AI能识别并适应不同文化背景。 Method: 通过综述现有研究,分析模型在道德判断、习语理解和建议生成中的文化偏向,并指出仅靠扩大数据集或微调无法根本解决该问题,主张将文化推理作为核心能力进行系统性评估。 Result: 揭示了现有评估方法局限于静态准确率,难以捕捉情境中的适应性推理;表明即使使用调查数据微调,模型仍倾向于西方规范。 Conclusion: 文化推理应被视为大语言模型的基础能力之一,未来研究需明确其定义并开发相应的评估框架,以实现真正具有文化敏感性的AI系统。 Abstract: Large language models have become the latest trend in natural language processing, heavily featuring in the digital tools we use every day. However, their replies often reflect a narrow cultural viewpoint that overlooks the diversity of global users. This missing capability could be referred to as cultural reasoning, which we define here as the capacity of a model to recognise culture-specific knowledge values and social norms, and to adjust its output so that it aligns with the expectations of individual users. Because culture shapes interpretation, emotional resonance, and acceptable behaviour, cultural reasoning is essential for identity-aware AI. When this capacity is limited or absent, models can sustain stereotypes, ignore minority perspectives, erode trust, and perpetuate hate. Recent empirical studies strongly suggest that current models default to Western norms when judging moral dilemmas, interpreting idioms, or offering advice, and that fine-tuning on survey data only partly reduces this tendency. The present evaluation methods mainly report static accuracy scores and thus fail to capture adaptive reasoning in context. Although broader datasets can help, they cannot alone ensure genuine cultural competence. Therefore, we argue that cultural reasoning must be treated as a foundational capability alongside factual accuracy and linguistic coherence. By clarifying the concept and outlining initial directions for its assessment, a foundation is laid for future systems to be able to respond with greater sensitivity to the complex fabric of human culture.[48] Building Trust in Clinical LLMs: Bias Analysis and Dataset Transparency
Svetlana Maslenkova,Clement Christophe,Marco AF Pimentel,Tathagata Raha,Muhammad Umar Salman,Ahmed Al Mahrooqi,Avani Gupta,Shadab Khan,Ronnie Rajan,Praveenkumar Kanithi
Main category: cs.CL
TL;DR: 本研究提出了一种针对临床语言模型中潜在偏见的深入分析方法,重点关注不同人口统计群体在阿片类药物处方倾向上的差异,并发布了包含超过890亿token的新型预训练数据集HC4,以促进医疗AI中的公平性与安全性。
Details
Motivation: 当前数据集构建和偏见评估缺乏透明度,影响了医疗大模型的公平性和可信度,亟需系统性评估框架来识别和缓解潜在偏见。 Method: 构建了大规模医疗预训练语料库HC4,并结合通用基准与新提出的医疗领域特定方法,评估模型在种族、性别和年龄等群体间的处方倾向差异。 Result: 发现了临床语言模型在不同人口统计群体间存在差异化的阿片类药物处方倾向,验证了HC4在提升模型公平性评估方面的能力。 Conclusion: 通过透明的数据构建和针对性评估方法,可有效揭示并缓解临床AI中的偏见,为负责任的医疗AI发展提供了重要基础。 Abstract: Large language models offer transformative potential for healthcare, yet their responsible and equitable development depends critically on a deeper understanding of how training data characteristics influence model behavior, including the potential for bias. Current practices in dataset curation and bias assessment often lack the necessary transparency, creating an urgent need for comprehensive evaluation frameworks to foster trust and guide improvements. In this study, we present an in-depth analysis of potential downstream biases in clinical language models, with a focus on differential opioid prescription tendencies across diverse demographic groups, such as ethnicity, gender, and age. As part of this investigation, we introduce HC4: Healthcare Comprehensive Commons Corpus, a novel and extensively curated pretraining dataset exceeding 89 billion tokens. Our evaluation leverages both established general benchmarks and a novel, healthcare-specific methodology, offering crucial insights to support fairness and safety in clinical AI applications.[49] Large language models for folktale type automation based on motifs: Cinderella case study
Tjaša Arčon,Marko Robnik-Šikonja,Polona Tratnik
Main category: cs.CL
TL;DR: 本研究利用机器学习和自然语言处理技术,自动识别灰姑娘变体中的母题,并通过聚类和降维分析其异同,展示了大语言模型在民间文学大规模文本分析和跨语言比较中的潜力。
Details
Motivation: 将人工智能技术应用于民俗学研究,以实现对大量文本的高效、自动化分析,并促进跨语言的文化比较。 Method: 采用机器学习和自然语言处理方法,结合聚类与降维技术,对大量灰姑娘故事变体中的母题进行自动检测与分析。 Result: 大语言模型能够有效识别故事中的复杂互动,支持对大规模文本集合的计算分析,并有助于跨语言比较。 Conclusion: 该方法为民间文学的大规模研究提供了可行的技术路径,验证了AI在数字人文领域中的应用价值。 Abstract: Artificial intelligence approaches are being adapted to many research areas, including digital humanities. We built a methodology for large-scale analyses in folkloristics. Using machine learning and natural language processing, we automatically detected motifs in a large collection of Cinderella variants and analysed their similarities and differences with clustering and dimensionality reduction. The results show that large language models detect complex interactions in tales, enabling computational analysis of extensive text collections and facilitating cross-lingual comparisons.[50] Beyond the Explicit: A Bilingual Dataset for Dehumanization Detection in Social Media
Dennis Assenmacher,Paloma Piot,Katarina Laken,David Jurgens,Claudia Wagner
Main category: cs.CL
TL;DR: 本文提出了一种新的双语数据集,用于检测数字去人性化现象,涵盖其多维度特征,并通过机器学习模型验证了该数据集在零样本和少样本场景下优于现有技术的效果。
Details
Motivation: 当前研究主要关注明显负面言论作为去人性化的标志,忽略了非显性但持续强化偏见的隐性去人性化形式,亟需更全面的分析方法。 Method: 采用多种采样方法从Twitter和Reddit收集理论指导下的双语数据集,通过众包和专家对16,000个实例进行文档级和片段级标注,并利用该数据集微调机器学习模型。 Result: 构建的数据集覆盖去人性化的多个维度,可作为训练资源和评估基准;微调后的模型在零样本和少样本设置下性能超过现有最先进模型。 Conclusion: 该研究填补了计算语言学中对隐性去人性化识别的空白,提供了有效的数据资源和模型验证方案,推动了相关领域的进一步发展。 Abstract: Digital dehumanization, although a critical issue, remains largely overlooked within the field of computational linguistics and Natural Language Processing. The prevailing approach in current research concentrating primarily on a single aspect of dehumanization that identifies overtly negative statements as its core marker. This focus, while crucial for understanding harmful online communications, inadequately addresses the broader spectrum of dehumanization. Specifically, it overlooks the subtler forms of dehumanization that, despite not being overtly offensive, still perpetuate harmful biases against marginalized groups in online interactions. These subtler forms can insidiously reinforce negative stereotypes and biases without explicit offensiveness, making them harder to detect yet equally damaging. Recognizing this gap, we use different sampling methods to collect a theory-informed bilingual dataset from Twitter and Reddit. Using crowdworkers and experts to annotate 16,000 instances on a document- and span-level, we show that our dataset covers the different dimensions of dehumanization. This dataset serves as both a training resource for machine learning models and a benchmark for evaluating future dehumanization detection techniques. To demonstrate its effectiveness, we fine-tune ML models on this dataset, achieving performance that surpasses state-of-the-art models in zero and few-shot in-context settings.[51] Dynamical model parameters from ultrasound tongue kinematics
Sam Kirkham,Patrycja Strycharczuk
Main category: cs.CL
TL;DR: 本研究评估了是否可以从超声波舌部运动学数据中可靠地估计线性谐波振荡器的参数,并与同时记录的EMA数据进行比较,结果表明超声波和EMA在获取动态参数方面具有可比性。
Details
Motivation: 由于近年来方法学的进步,超声成像成为语音控制研究中一种有前景的替代方法,因此需要验证其在动力系统建模中的可靠性。 Method: 通过比较从超声波舌部运动学和同步记录的电磁 articulography (EMA) 数据中估计出的线性谐波振荡器参数,评估超声数据的可靠性。 Result: 超声波和EMA得到的动力学参数具有可比性,且下颌短肌腱追踪能有效捕捉 jaw 运动。 Conclusion: 支持使用超声波运动学数据来评估动力性发音模型。 Abstract: The control of speech can be modelled as a dynamical system in which articulators are driven toward target positions. These models are typically evaluated using fleshpoint data, such as electromagnetic articulography (EMA), but recent methodological advances make ultrasound imaging a promising alternative. We evaluate whether the parameters of a linear harmonic oscillator can be reliably estimated from ultrasound tongue kinematics and compare these with parameters estimated from simultaneously-recorded EMA data. We find that ultrasound and EMA yield comparable dynamical parameters, while mandibular short tendon tracking also adequately captures jaw motion. This supports using ultrasound kinematics to evaluate dynamical articulatory models.[52] MLMA: Towards Multilingual with Mamba Based Architectures
Mohamed Nabih Ali,Daniele Falavigna,Alessio Brutti
Main category: cs.CL
TL;DR: 本文提出了一种基于Mamba架构的多语言ASR模型MLMA,利用状态空间模型的优势实现高效、可扩展的多语言语音识别,在标准基准上表现与Transformer相当。
Details
Motivation: 在多语言自动语音识别中,如何平衡高低资源语言的性能是一个挑战;同时,探索Transformer以外的更高效序列建模架构具有重要意义。 Method: 提出MLMA,采用Mamba架构(一种针对长序列优化的状态空间模型),通过隐式的语言感知条件和共享表示来支持多语言ASR。 Result: 在标准多语言基准上的实验表明,MLMA的表现与Transformer-based模型具有竞争力,同时具备更高的效率和可扩展性。 Conclusion: Mamba架构有望成为高效、可扩展且准确的多语言语音识别的有力骨干结构。 Abstract: Multilingual automatic speech recognition (ASR) remains a challenging task, especially when balancing performance across high- and low-resource languages. Recent advances in sequence modeling suggest that architectures beyond Transformers may offer better scalability and efficiency. In this work, we introduce MLMA (Multilingual Language Modeling with Mamba for ASR), a new approach that leverages the Mamba architecture -- an efficient state-space model optimized for long-context sequence processing -- for multilingual ASR. Using Mamba, MLMA implicitly incorporates language-aware conditioning and shared representations to support robust recognition across diverse languages. Experiments on standard multilingual benchmarks show that MLMA achieves competitive performance compared to Transformer-based architectures. These results highlight Mamba's potential as a strong backbone for scalable, efficient, and accurate multilingual speech recognition.[53] Investigating LLM Capabilities on Long Context Comprehension for Medical Question Answering
Feras AlMannaa,Talia Tseriotou,Jenny Chim,Maria Liakata
Main category: cs.CL
TL;DR: 本研究首次系统评估了大语言模型在具有临床相关性的长上下文医学问答中的理解能力,探讨了不同模型规模、RAG影响及单/多文档推理场景下的表现,并揭示了记忆化问题与常见失败模式。
Details
Motivation: 为了理解大语言模型在处理临床相关的长上下文医学问答任务时的表现,尤其是在信息相关性不同的情况下,亟需系统性评估其理解能力、局限性及改进策略。 Method: 研究采用了多种内容包含设置、不同能力和规模的LLM模型以及跨任务形式的数据集,结合RAG方法进行评估,并通过多维度方法进行定性和错误分析。 Result: 发现了模型规模对性能的影响、存在的记忆化问题,明确了RAG在单文档和多文档推理数据集中的最佳应用设置,并展示了RAG策略如何提升长上下文理解效果。 Conclusion: RAG在特定条件下能显著提升LLM在长上下文医学问答中的表现,但其有效性依赖于任务设置和文档结构;研究为未来医学领域长上下文理解提供了关键见解和优化方向。 Abstract: This study is the first to investigate LLM comprehension capabilities over long-context (LC) medical QA of clinical relevance. Our comprehensive assessment spans a range of content-inclusion settings based on their relevance, LLM models of varying capabilities and datasets across task formulations, revealing insights on model size effects, limitations, underlying memorization issues and the benefits of reasoning models. Importantly, we examine the effect of RAG on medical LC comprehension, uncover best settings in single versus multi-document reasoning datasets and showcase RAG strategies for improvements over LC. We shed light into some of the evaluation aspects using a multi-faceted approach. Our qualitative and error analyses address open questions on when RAG is beneficial over LC, revealing common failure cases.[54] Bayesian Low-Rank Factorization for Robust Model Adaptation
Enes Yavuz Ugan,Ngoc-Quan Pham,Alexander Waibel
Main category: cs.CL
TL;DR: 提出贝叶斯因子化适配器以在保持语音基础模型泛化能力的同时适应多语言混杂场景。
Details
Motivation: 直接微调大型语音基础模型容易导致过拟合和灾难性遗忘,难以兼顾特定领域适应与原有能力保留。 Method: 在Whisper模型中引入带有稀疏先验的贝叶斯因子化适配器,实现参数高效的稀疏化自适应。 Result: 相比LoRA,在新领域性能仅下降4%的情况下,基础任务回退减少54%,显著缓解灾难性遗忘。 Conclusion: 贝叶斯适配方法能有效平衡语音基础模型的领域适应与泛化能力保留,适用于代码转换等复杂多语言场景。 Abstract: Large speech foundation models achieve strong performance across many domains, but they often require adaptation to handle local needs such as code-switching, where speakers mix languages within the same utterance. Direct fine-tuning of these models risks overfitting to the target domain and overwriting the broad capabilities of the base model. To address this challenge, we explore Bayesian factorized adapters for speech foundation models, which place priors near zero to achieve sparser adaptation matrices and thereby retain general performance while adapting to specific domains. We apply our approach to the Whisper model and evaluate on different multilingual code-switching scenarios. Our results show only minimal adaptation loss while significantly reducing catastrophic forgetting of the base model. Compared to LoRA, our method achieves a backward gain of 54% with only a 4% drop on the new domain. These findings highlight the effectiveness of Bayesian adaptation for fine-tuning speech foundation models without sacrificing generalization.[55] Adapting Language Balance in Code-Switching Speech
Enes Yavuz Ugan,Ngoc-Quan Pham,Alexander Waibel
Main category: cs.CL
TL;DR: 提出一种基于嵌入差异的可微代理方法,以增强大模型在语码转换场景下的鲁棒性,有效减少生成中的替换错误。
Details
Motivation: 大型基础模型在标准基准上表现良好,但在语码转换测试用例上仍表现不佳,尤其当语码切换点出现频率低且第二语言嵌入不明显时,模型难以自主学习。 Method: 利用主语言与嵌入语言之间的差异来突出语码转换点,通过引入带有标签的训练信号,强调在这些关键位置的学习,从而缓解生成过程中的上下文偏差。 Result: 在阿拉伯语和中文-英文语码转换任务上的实验表明,模型能更准确地预测切换位置,替换错误率显著降低。 Conclusion: 该方法通过可微分的代理任务有效提升了模型对语码转换点的识别能力,增强了模型在低频、细微语码转换场景下的鲁棒性。 Abstract: Despite achieving impressive results on standard benchmarks, large foundational models still struggle against code-switching test cases. When data scarcity cannot be used as the usual justification for poor performance, the reason may lie in the infrequent occurrence of code-switched moments, where the embedding of the second language appears subtly. Instead of expecting the models to learn this infrequency on their own, it might be beneficial to provide the training process with labels. Evaluating model performance on code-switching data requires careful localization of code-switching points where recognition errors are most consequential, so that the analysis emphasizes mistakes occurring at those moments. Building on this observation, we leverage the difference between the embedded and the main language to highlight those code-switching points and thereby emphasize learning at those locations. This simple yet effective differentiable surrogate mitigates context bias during generation -- the central challenge in code-switching -- thereby improving the model's robustness. Our experiments with Arabic and Chinese-English showed that the models are able to predict the switching places more correctly, reflected by the reduced substitution error.[56] SemiAdapt and SemiLoRA: Efficient Domain Adaptation for Transformer-based Low-Resource Language Translation with a Case Study on Irish
Josh McGiff,Nikola S. Nikolov
Main category: cs.CL
TL;DR: 本文提出了SemiAdapt和SemiLoRA两种半监督、推理高效的方法,用于提升低资源语言(如爱尔兰语)神经机器翻译中的领域适应性能,其中SemiLoRA在参数高效微调中表现优异,可媲美甚至超越全模型微调。
Details
Motivation: 由于全模型微调大型多语言模型计算成本高昂,限制了低资源领域(如爱尔兰语翻译)的研究者使用迁移学习,因此需要更高效的微调方法。 Method: 提出SemiAdapt和SemiLoRA方法,结合半监督学习与参数高效微调技术(如LoRA),通过少量可训练参数增强领域适应能力,并在推理阶段提升效率。 Result: 实验表明,SemiAdapt优于全领域微调,而SemiLoRA能使PEFT方法达到或超过全模型微调的性能;所提方法在大规模且含噪声的数据集上表现尤为出色。 Conclusion: SemiAdapt和SemiLoRA显著降低了低资源语言领域适配的技术门槛,使高质量微调更加可及,所有开发的爱尔兰语翻译模型均已开源。 Abstract: Fine-tuning is widely used to tailor large language models for specific tasks such as neural machine translation (NMT). However, leveraging transfer learning is computationally expensive when fine-tuning large multilingual models with billions of parameters, thus creating a barrier to entry for researchers working on low-resource domains such as Irish translation. Parameter-efficient fine-tuning (PEFT) bridges this gap by training on a fraction of the original model parameters, with the Low-Rank Adaptation (LoRA) approach introducing small, trainable adapter layers. We introduce SemiAdapt and SemiLoRA as semi-supervised inference-efficient approaches that strengthen domain adaptation and lead to improved overall performance in NMT. We demonstrate that SemiAdapt can outperform full-domain fine-tuning, while most notably, SemiLoRA can propel PEFT methods to match or even outperform full-model fine-tuning. We further evaluate domain-by-dataset fine-tuning and demonstrate that our embedding-based inference methods perform especially well on larger and noisier corpora. All Irish translation models developed in this work are released as open resources. These methods aim to make high-quality domain adaptation and fine-tuning more accessible to researchers working with low-resource languages.[57] Verifiable Accuracy and Abstention Rewards in Curriculum RL to Alleviate Lost-in-Conversation
Ming Li
Main category: cs.CL
TL;DR: 提出了一种名为RLAAR的课程强化学习框架,通过可验证准确性和主动拒绝奖励,有效缓解大模型在多轮对话中的“迷失”问题(Lost-in-Conversation),提升回答正确率与合理拒答能力。
Details
Motivation: 大语言模型在单轮指令遵循中表现良好,但在多轮对话中因信息逐步揭示而出现性能下降(即Lost-in-Conversation问题)。现有方法难以有效处理动态对话中的可靠性与拒答判断,因此需要一种能同时提升准确性和判断问题可解性的训练框架。 Method: 提出RLAAR框架,结合可验证奖励的强化学习(RLVR),引入基于能力控制的课程学习机制,逐步增加对话难度;采用多轮在线策略 rollout 与混合奖励系统,鼓励模型在解决问题的同时学会合理拒答,避免过早回答导致的错误。 Result: 在LiC基准测试中,RLAAR将性能从62.6%显著提升至75.1%,同时将校准后的拒答率从33.5%提高到73.4%,有效缓解了多轮对话中的性能衰退问题。 Conclusion: RLAAR为构建可靠、可信的多轮对话大模型提供了一个实用方案,通过课程式强化学习平衡作答与拒答,显著提升了模型在渐进信息场景下的稳健性与自我判断能力。 Abstract: Large Language Models demonstrate strong capabilities in single-turn instruction following but suffer from Lost-in-Conversation (LiC), a degradation in performance as information is revealed progressively in multi-turn settings. Motivated by the current progress on Reinforcement Learning with Verifiable Rewards (RLVR), we propose Curriculum Reinforcement Learning with Verifiable Accuracy and Abstention Rewards (RLAAR), a framework that encourages models not only to generate correct answers, but also to judge the solvability of questions in the multi-turn conversation setting. Our approach employs a competence-gated curriculum that incrementally increases dialogue difficulty (in terms of instruction shards), stabilizing training while promoting reliability. Using multi-turn, on-policy rollouts and a mixed-reward system, RLAAR teaches models to balance problem-solving with informed abstention, reducing premature answering behaviors that cause LiC. Evaluated on LiC benchmarks, RLAAR significantly mitigates LiC performance decay (62.6% to 75.1%) and improves calibrated abstention rates (33.5% to 73.4%). Together, these results provide a practical recipe for building multi-turn reliable and trustworthy LLMs.[58] Topoformer: brain-like topographic organization in Transformer language models through spatial querying and reweighting
Taha Binhuraib,Greta Tuckute,Nicholas Blauch
Main category: cs.CL
TL;DR: 本文提出了一种具有拓扑结构组织的新型自注意力机制,将Transformer模型转化为“Topoformer”,在保持性能的同时提升了可解释性,并显示出与人脑语言网络的对齐性。
Details
Motivation: 受生物大脑中神经元按响应特性进行拓扑排列的启发,旨在为机器学习模型引入空间功能组织,以增强表示的可解释性和可视化能力。 Method: 提出空间查询(spatial querying)和空间重加权(spatial reweighting)两种机制,在2D网格上组织键和查询,并将全连接层转换为局部连接层,从而实现Transformer中的拓扑结构。 Result: 在情感分类和BERT架构的掩码语言建模任务中验证了方法的有效性,Topoformer在NLP基准上表现与标准模型相当,且展现出可解释的拓扑结构;同时在fMRI数据分析中显示其低维拓扑变化与人脑语言网络存在对齐。 Conclusion: Topoformer在不牺牲性能的前提下引入了拓扑组织,提高了模型可解释性,并为构建更准确的人脑语言信息处理模型提供了新方向。 Abstract: Spatial functional organization is a hallmark of biological brains: neurons are arranged topographically according to their response properties, at multiple scales. In contrast, representations within most machine learning models lack spatial biases, instead manifesting as disorganized vector spaces that are difficult to visualize and interpret. Here, we propose a novel form of self-attention that turns Transformers into "Topoformers" with topographic organization. We introduce spatial querying - where keys and queries are arranged on 2D grids, and local pools of queries are associated with a given key - and spatial reweighting, where we convert the standard fully connected layer of self-attention into a locally connected layer. We first demonstrate the feasibility of our approach by training a 1-layer Topoformer on a sentiment classification task. Training with spatial querying encourages topographic organization in the queries and keys, and spatial reweighting separately encourages topographic organization in the values and self-attention outputs. We then apply the Topoformer motifs at scale, training a BERT architecture with a masked language modeling objective. We find that the topographic variant performs on par with a non-topographic control model on NLP benchmarks, yet produces interpretable topographic organization as evaluated via eight linguistic test suites. Finally, analyzing an fMRI dataset of human brain responses to a large set of naturalistic sentences, we demonstrate alignment between low-dimensional topographic variability in the Topoformer model and human brain language network. Scaling up Topoformers further holds promise for greater interpretability in NLP research, and for more accurate models of the organization of linguistic information in the human brain.[59] AI use in American newspapers is widespread, uneven, and rarely disclosed
Jenna Russell,Marzena Karpinska,Destiny Akinode,Katherine Thai,Bradley Emi,Max Spero,Mohit Iyyer
Main category: cs.CL
TL;DR: 本研究通过分析2025年夏季186,000篇美国报纸文章,发现约9%的文章部分或全部由AI生成,尤其集中在地方性媒体、特定主题及某些所有权集团中;《华盛顿邮报》《纽约时报》《华尔街日报》的评论文章使用AI的可能性是新闻文章的6.4倍,但AI使用极少被披露,凸显新闻业亟需提升透明度和制定新的编辑规范。
Details
Motivation: 了解AI在新闻业中的实际应用程度,尤其是在已发表文章中的使用情况尚不明确,因此需要系统性审计以揭示其分布特征与潜在问题。 Method: 利用先进的AI检测工具Pangram,对1,500家美国报纸的186,000篇文章以及三大报的45,000篇评论文章进行大规模检测,并结合人工审核评估AI使用的披露情况。 Result: 约9%的新发表文章为部分或完全AI生成,使用情况在小型地方媒体、天气与科技类话题及特定出版集团中更普遍;评论文章使用AI的概率是新闻文章的6.4倍,且多由知名公众人物撰写;人工审查显示仅5%的AI生成文章披露了AI使用。 Conclusion: AI已在新闻业广泛使用但缺乏透明度,亟需建立更清晰的编辑标准与披露机制以维护公众信任。 Abstract: AI is rapidly transforming journalism, but the extent of its use in published newspaper articles remains unclear. We address this gap by auditing a large-scale dataset of 186K articles from online editions of 1.5K American newspapers published in the summer of 2025. Using Pangram, a state-of-the-art AI detector, we discover that approximately 9% of newly-published articles are either partially or fully AI-generated. This AI use is unevenly distributed, appearing more frequently in smaller, local outlets, in specific topics such as weather and technology, and within certain ownership groups. We also analyze 45K opinion pieces from Washington Post, New York Times, and Wall Street Journal, finding that they are 6.4 times more likely to contain AI-generated content than news articles from the same publications, with many AI-flagged op-eds authored by prominent public figures. Despite this prevalence, we find that AI use is rarely disclosed: a manual audit of 100 AI-flagged articles found only five disclosures of AI use. Overall, our audit highlights the immediate need for greater transparency and updated editorial standards regarding the use of AI in journalism to maintain public trust.[60] KAT-Coder Technical Report
Zizheng Zhan,Ken Deng,Xiaojiang Zhang,Jinghui Wang,Huaixi Tang,Zhiyi Lai,Haoyang Huang,Wen Xiang,Kun Wu,Wenhao Zhuang,Minglei Zhang,Shaojie Wang,Shangpeng Yan,Kepeng Lei,Zongxian Feng,Huiming Wang,Zheng Lin,Mengtong Li,Mengfei Xie,Yinghan Cui,Xuxing Chen,Chao Wang,Weihao Li,Wenqiang Zhu,Jiarong Zhang,Jingxuan Xu,Songwei Yu,Yifan Yao,Xinping Lei,Han Li,Junqi Xiong,Zuchen Gao,Dailin Li,Haimo Li,Jiaheng Liu,Yuqun Zhang,Junyi Peng,Haotian Zhang,Bin Chen
Main category: cs.CL
TL;DR: KAT-Coder是一个通过多阶段课程训练的大规模代理代码模型,旨在弥合静态文本训练与动态实际执行之间的差距,具备强大的工具使用可靠性、指令对齐和长上下文推理能力。
Details
Motivation: 现有的大语言模型在静态文本训练和动态真实世界代理执行之间存在鸿沟,难以实现可靠的智能编码代理,因此需要一种能提升模型在真实软件开发环境中自主推理、规划和行动能力的训练框架。 Method: 采用多阶段课程学习:中期训练(增强推理、规划和反思能力)、监督微调(构建百万样本多语言多任务数据集)、强化微调(提出多真值奖励机制)以及强化到部署适应(结合错误掩码SFT和树结构轨迹训练以适配生产级IDE环境)。 Result: KAT-Coder在工具使用可靠性、指令遵循和长上下文推理方面表现 robust,其32B模型KAT-Dev已在Hugging Face开源,可作为现实世界智能编码代理的可部署基础。 Conclusion: 通过多阶段训练策略,KAT-Coder有效提升了大模型在复杂软件工程任务中的代理能力,为实现可部署的智能编程代理提供了可行路径。 Abstract: Recent advances in large language models (LLMs) have enabled progress in agentic coding, where models autonomously reason, plan, and act within interactive software development workflows. However, bridging the gap between static text-based training and dynamic real-world agentic execution remains a core challenge. In this technical report, we present KAT-Coder, a large-scale agentic code model trained through a multi-stage curriculum encompassing Mid-Term Training, Supervised Fine-Tuning (SFT), Reinforcement Fine-Tuning (RFT), and Reinforcement-to-Deployment Adaptation. The Mid-Term stage enhances reasoning, planning, and reflection capabilities through a corpus of real software engineering data and synthetic agentic interactions. The SFT stage constructs a million-sample dataset balancing twenty programming languages, ten development contexts, and ten task archetypes. The RFT stage introduces a novel multi-ground-truth reward formulation for stable and sample-efficient policy optimization. Finally, the Reinforcement-to-Deployment phase adapts the model to production-grade IDE environments using Error-Masked SFT and Tree-Structured Trajectory Training. In summary, these stages enable KAT-Coder to achieve robust tool-use reliability, instruction alignment, and long-context reasoning, forming a deployable foundation for real-world intelligent coding agents. Our KAT series 32B model, KAT-Dev, has been open-sourced on https://huggingface.co/Kwaipilot/KAT-Dev.[61] WebSeer: Training Deeper Search Agents through Reinforcement Learning with Self-Reflection
Guanzhong He,Zhen Yang,Jinxin Liu,Bin Xu,Lei Hou,Juanzi Li
Main category: cs.CL
TL;DR: 本文提出了WebSeer,一种通过强化学习结合自我反思机制训练的智能搜索代理,显著提升了工具使用深度和答案准确率,在HotpotQA和SimpleQA上达到SOTA性能。
Details
Motivation: 现有搜索代理在多步交互中存在工具使用深度不足和错误累积的问题,限制了其在复杂环境中的表现。 Method: 构建带有反思模式标注的大规模数据集,设计两阶段训练框架,将冷启动与强化学习统一于自我反思范式中,使模型生成更长且更具反思性的工具使用轨迹。 Result: 在HotpotQA(72.3%)和SimpleQA(90.0%)上实现SOTA,并展现出对分布外数据的强泛化能力。 Conclusion: 引入自我反思机制可有效提升搜索代理的推理深度与准确性,为构建更智能的交互式信息检索系统提供了可行路径。 Abstract: Search agents have achieved significant advancements in enabling intelligent information retrieval and decision-making within interactive environments. Although reinforcement learning has been employed to train agentic models capable of more dynamic interactive retrieval, existing methods are limited by shallow tool-use depth and the accumulation of errors over multiple iterative interactions. In this paper, we present WebSeer, a more intelligent search agent trained via reinforcement learning enhanced with a self-reflection mechanism. Specifically, we construct a large dataset annotated with reflection patterns and design a two-stage training framework that unifies cold start and reinforcement learning within the self-reflection paradigm for real-world web-based environments, which enables the model to generate longer and more reflective tool-use trajectories. Our approach substantially extends tool-use chains and improves answer accuracy. Using a single 14B model, we achieve state-of-the-art results on HotpotQA and SimpleQA, with accuracies of 72.3% and 90.0%, respectively, and demonstrate strong generalization to out-of-distribution datasets. The code is available at https://github.com/99hgz/WebSeer[62] Fine-Tuned Thoughts: Leveraging Chain-of-Thought Reasoning for Industrial Asset Health Monitoring
Shuxin Lin,Dhaval Patel,Christodoulos Constantinides
Main category: cs.CL
TL;DR: 提出一种基于思维链(CoT)知识蒸馏的框架,将大语言模型的推理能力迁移至小型语言模型(SLMs),用于工业资产健康监测,在多选问答任务中显著提升SLM的推理性能。
Details
Motivation: 小型语言模型在工业等专业领域具有高效、低成本的优势,但在复杂推理任务上表现不足,需提升其推理能力以支持如工业4.0等高要求场景。 Method: 采用知识蒸馏框架,利用大语言模型生成带思维链的多选问答(MCQA)提示,将推理能力迁移到小型语言模型;并通过上下文学习验证生成知识的质量。 Result: 实验表明,经过CoT蒸馏微调的小型语言模型在推理任务上显著优于基线模型,性能接近大语言模型。 Conclusion: 该方法有效提升了小型语言模型在专业领域的复杂推理能力,缩小了与大语言模型之间的差距,具备实际部署价值。 Abstract: Small Language Models (SLMs) are becoming increasingly popular in specialized fields, such as industrial applications, due to their efficiency, lower computational requirements, and ability to be fine-tuned for domain-specific tasks, enabling accurate and cost-effective solutions. However, performing complex reasoning using SLMs in specialized fields such as Industry 4.0 remains challenging. In this paper, we propose a knowledge distillation framework for industrial asset health, which transfers reasoning capabilities via Chain-of-Thought (CoT) distillation from Large Language Models (LLMs) to smaller, more efficient models (SLMs). We discuss the advantages and the process of distilling LLMs using multi-choice question answering (MCQA) prompts to enhance reasoning and refine decision-making. We also perform in-context learning to verify the quality of the generated knowledge and benchmark the performance of fine-tuned SLMs with generated knowledge against widely used LLMs. The results show that the fine-tuned SLMs with CoT reasoning outperform the base models by a significant margin, narrowing the gap to their LLM counterparts. Our code is open-sourced at: https://github.com/IBM/FailureSensorIQ.[63] MTraining: Distributed Dynamic Sparse Attention for Efficient Ultra-Long Context Training
Wenxuan Li,Chengruidong Zhang,Huiqiang Jiang,Yucheng Li,Yuqing Yang,Lili Qiu
Main category: cs.CL
TL;DR: 本文提出MTraining,一种基于动态稀疏注意力的分布式训练方法,有效解决了超长上下文大模型训练中的计算与通信瓶颈,成功将Qwen2.5-3B的上下文从32K扩展到512K,并在多个任务上实现最高6倍的训练吞吐提升。
Details
Motivation: 长上下文窗口虽能提升大模型的推理能力,但训练成本高昂,尤其是采用动态稀疏注意力时存在严重的计算和通信负载不均衡问题,亟需高效的分布式训练方案。 Method: MTraining结合了动态稀疏训练模式、平衡稀疏环注意力和分层稀疏环注意力三种关键技术,通过协同优化计算负载分配与通信开销,提升分布式环境下超长上下文模型的训练效率。 Result: 在32台A100 GPU集群上成功将Qwen2.5-3B的上下文长度从32K扩展至512K,在RULER、PG-19、InfiniteBench和Needle In A Haystack等任务上评估显示,训练吞吐最高提升6倍,同时保持模型准确性。 Conclusion: MTraining为大语言模型在超长上下文场景下的高效训练提供了可扩展且实用的解决方案,显著提升了训练效率并支持更广泛的应用。 Abstract: The adoption of long context windows has become a standard feature in Large Language Models (LLMs), as extended contexts significantly enhance their capacity for complex reasoning and broaden their applicability across diverse scenarios. Dynamic sparse attention is a promising approach for reducing the computational cost of long-context. However, efficiently training LLMs with dynamic sparse attention on ultra-long contexts-especially in distributed settings-remains a significant challenge, due in large part to worker- and step-level imbalance. This paper introduces MTraining, a novel distributed methodology leveraging dynamic sparse attention to enable efficient training for LLMs with ultra-long contexts. Specifically, MTraining integrates three key components: a dynamic sparse training pattern, balanced sparse ring attention, and hierarchical sparse ring attention. These components are designed to synergistically address the computational imbalance and communication overheads inherent in dynamic sparse attention mechanisms during the training of models with extensive context lengths. We demonstrate the efficacy of MTraining by training Qwen2.5-3B, successfully expanding its context window from 32K to 512K tokens on a cluster of 32 A100 GPUs. Our evaluations on a comprehensive suite of downstream tasks, including RULER, PG-19, InfiniteBench, and Needle In A Haystack, reveal that MTraining achieves up to a 6x higher training throughput while preserving model accuracy. Our code is available at https://github.com/microsoft/MInference/tree/main/MTraining.[64] Towards Faithful and Controllable Personalization via Critique-Post-Edit Reinforcement Learning
Chenghao Zhu,Meiling Tao,Tiannan Wang,Dongyi Ding,Yuchen Eleanor Jiang,Wangchunshu Zhou
Main category: cs.CL
TL;DR: 提出Critique-Post-Edit框架,通过个性化生成式奖励模型和自我批评修改机制,实现更忠实、可控的LLM个性化。
Details
Motivation: 现有方法如监督微调和标准强化学习在用户个性化方面存在性能瓶颈或易受奖励黑客攻击的问题。 Method: 引入Personalized Generative Reward Model(GRM)提供多维评分与文本批评,并结合Critique-Post-Edit机制让策略模型根据批评自我修正输出。 Result: 在严格控制长度的评估下,该方法显著优于标准PPO;Qwen2.5-7B平均胜率提升11%,Qwen2.5-14B超越GPT-4.1。 Conclusion: Critique-Post-Edit为实现高效、忠实且可控的LLM个性化提供了可行路径。 Abstract: Faithfully personalizing large language models (LLMs) to align with individual user preferences is a critical but challenging task. While supervised fine-tuning (SFT) quickly reaches a performance plateau, standard reinforcement learning from human feedback (RLHF) also struggles with the nuances of personalization. Scalar-based reward models are prone to reward hacking which leads to verbose and superficially personalized responses. To address these limitations, we propose Critique-Post-Edit, a robust reinforcement learning framework that enables more faithful and controllable personalization. Our framework integrates two key components: (1) a Personalized Generative Reward Model (GRM) that provides multi-dimensional scores and textual critiques to resist reward hacking, and (2) a Critique-Post-Edit mechanism where the policy model revises its own outputs based on these critiques for more targeted and efficient learning. Under a rigorous length-controlled evaluation, our method substantially outperforms standard PPO on personalization benchmarks. Personalized Qwen2.5-7B achieves an average 11\% win-rate improvement, and personalized Qwen2.5-14B model surpasses the performance of GPT-4.1. These results demonstrate a practical path to faithful, efficient, and controllable personalization.[65] Every Step Evolves: Scaling Reinforcement Learning for Trillion-Scale Thinking Model
Ling Team,Anqi Shen,Baihui Li,Bin Hu,Bin Jing,Cai Chen,Chao Huang,Chao Zhang,Chaokun Yang,Cheng Lin,Chengyao Wen,Congqi Li,Deng Zhao,Dingbo Yuan,Donghai You,Fagui Mao,Fanzhuang Meng,Feng Xu,Guojie Li,Guowei Wang,Hao Dai,Haonan Zheng,Hong Liu,Jia Guo,Jiaming Liu,Jian Liu,Jianhao Fu,Jiannan Shi,Jianwen Wang,Jianxin Lai,Jin Yang,Jun Mei,Jun Zhou,Junbo Zhao,Junping Zhao,Kuan Xu,Le Su,Lei Chen,Li Tang,Liang Jiang,Liangcheng Fu,Lianhao Xu,Linfeng Shi,Lisha Liao,Longfei Zheng,Meng Li,Mingchun Chen,Qi Zuo,Qiang Cheng,Qianggang Cao,Qitao Shi,Quanrui Guo,Senlin Zhu,Shaofei Wang,Shaomian Zheng,Shuaicheng Li,Shuwei Gu,Siba Chen,Tao Wu,Tao Zhang,Tianyu Zhang,Tianyu Zhou,Tiwei Bie,Tongkai Yang,Wang Hong,Wang Ren,Weihua Chen,Wenbo Yu,Wengang Zheng,Xiangchun Wang,Xiaodong Yan,Xiaopei Wan,Xin Zhao,Xinyu Kong,Xinyu Tang,Xudong Han,Xudong Wang,Xuemin Yang,Xueyu Hu,Yalin Zhang,Yan Sun,Yicheng Shan,Yilong Wang,Yingying Xu,Yongkang Liu,Yongzhen Guo,Yuanyuan Wang,Yuchen Yan,Yuefan Wang,Yuhong Guo,Zehuan Li,Zhankai Xu,Zhe Li,Zhenduo Zhang,Zhengke Gui,Zhenxuan Pan,Zhenyu Huang,Zhenzhong Lan,Zhiqiang Ding,Zhiqiang Zhang,Zhixun Li,Zhizhen Liu,Zihao Wang,Zujie Wen
Main category: cs.CL
TL;DR: Ring-1T是首个开源的、具有万亿级参数的先进推理模型,通过IcePop、C3PO++和ASystem三项创新技术解决了训练与推理不一致、长rollout效率低和RL系统瓶颈等问题,在多个基准测试中取得突破性成果,并推动大规模推理智能的开放共享。
Details
Motivation: 训练万亿参数规模的推理模型面临训练与推理不一致、rollout处理效率低下以及强化学习系统瓶颈等前所未有的挑战,亟需新的方法来稳定训练过程并提升系统整体效率。 Method: 提出了三个相互关联的创新:(1) IcePop通过token级别的差异掩码和裁剪来稳定RL训练;(2) C3PO++在token预算下动态划分长rollout以提高资源利用率和时间效率;(3) ASystem是一个高性能RL框架,旨在解决大规模模型训练中的系统性瓶颈。 Result: Ring-1T在多个关键基准上取得突破性成果:AIME-2025得分为93.4,HMMT-2025为86.72,CodeForces得分为2088,ARC-AGI-v1为55.94,并在IMO-2025达到银牌水平,展现出卓越的推理能力。 Conclusion: 通过向社区开源完整的万亿参数MoE模型,Ring-1T不仅树立了开源模型性能的新基准,也标志着大规模推理智能民主化的重要里程碑。 Abstract: We present Ring-1T, the first open-source, state-of-the-art thinking model with a trillion-scale parameter. It features 1 trillion total parameters and activates approximately 50 billion per token. Training such models at a trillion-parameter scale introduces unprecedented challenges, including train-inference misalignment, inefficiencies in rollout processing, and bottlenecks in the RL system. To address these, we pioneer three interconnected innovations: (1) IcePop stabilizes RL training via token-level discrepancy masking and clipping, resolving instability from training-inference mismatches; (2) C3PO++ improves resource utilization for long rollouts under a token budget by dynamically partitioning them, thereby obtaining high time efficiency; and (3) ASystem, a high-performance RL framework designed to overcome the systemic bottlenecks that impede trillion-parameter model training. Ring-1T delivers breakthrough results across critical benchmarks: 93.4 on AIME-2025, 86.72 on HMMT-2025, 2088 on CodeForces, and 55.94 on ARC-AGI-v1. Notably, it attains a silver medal-level result on the IMO-2025, underscoring its exceptional reasoning capabilities. By releasing the complete 1T parameter MoE model to the community, we provide the research community with direct access to cutting-edge reasoning capabilities. This contribution marks a significant milestone in democratizing large-scale reasoning intelligence and establishes a new baseline for open-source model performance.[66] LightMem: Lightweight and Efficient Memory-Augmented Generation
Jizhan Fang,Xinle Deng,Haoming Xu,Ziyan Jiang,Yuqi Tang,Ziwen Xu,Shumin Deng,Yunzhi Yao,Mengru Wang,Shuofei Qiao,Huajun Chen,Ningyu Zhang
Main category: cs.CL
TL;DR: 本文提出了一种名为LightMem的新型记忆系统,受人类记忆的Atkinson-Shiffrin模型启发,通过三阶段结构(感觉记忆、短期记忆和长期记忆)在性能与效率之间取得平衡,显著减少了大型语言模型在复杂交互环境中的资源消耗。
Details
Motivation: 大型语言模型在动态复杂环境中难以有效利用历史交互信息,现有记忆系统存在较高的时间与计算开销。 Method: 设计了一个受认知科学启发的三阶段记忆系统:1)基于轻量压缩和主题分组的感觉记忆;2)主题感知的短期记忆进行内容整合与摘要;3)离线更新的长期记忆,将记忆固化与在线推理分离。 Result: 在LongMemEval基准上,基于GPT和Qwen的实验显示,LightMem最高提升准确率10.9%,同时减少最多117倍的token使用、159倍的API调用和12倍以上的运行时间。 Conclusion: LightMem在保持高性能的同时大幅提升了记忆系统的效率,为LLM的记忆机制提供了高效且可扩展的解决方案。 Abstract: Despite their remarkable capabilities, Large Language Models (LLMs) struggle to effectively leverage historical interaction information in dynamic and complex environments. Memory systems enable LLMs to move beyond stateless interactions by introducing persistent information storage, retrieval, and utilization mechanisms. However, existing memory systems often introduce substantial time and computational overhead. To this end, we introduce a new memory system called LightMem, which strikes a balance between the performance and efficiency of memory systems. Inspired by the Atkinson-Shiffrin model of human memory, LightMem organizes memory into three complementary stages. First, cognition-inspired sensory memory rapidly filters irrelevant information through lightweight compression and groups information according to their topics. Next, topic-aware short-term memory consolidates these topic-based groups, organizing and summarizing content for more structured access. Finally, long-term memory with sleep-time update employs an offline procedure that decouples consolidation from online inference. Experiments on LongMemEval with GPT and Qwen backbones show that LightMem outperforms strong baselines in accuracy (up to 10.9% gains) while reducing token usage by up to 117x, API calls by up to 159x, and runtime by over 12x. The code is available at https://github.com/zjunlp/LightMem.[67] How Do LLMs Use Their Depth?
Akshat Gupta,Jay Yeung,Gopala Anumanchipalli,Anna Ivanova
Main category: cs.CL
TL;DR: 本文提出了“先猜测后 refinement”的框架,揭示了大语言模型在推理过程中如何分层逐步优化预测,早期层基于高频词进行初步猜测,深层则利用上下文信息 refine 预测结果,并通过多个案例研究展示了不同任务中层深度的动态使用。
Details
Motivation: 尽管已有研究表明大语言模型并非均匀使用其网络深度,但缺乏对其逐层预测动态的细粒度理解,本文旨在揭示模型在推理过程中如何结构化地利用各层进行预测。 Method: 通过追踪多个开源权重模型在推理过程中的中间表示,提出“Guess-then-Refine”框架,并结合词性分析、事实回忆和多项选择任务三类案例研究,分析模型在不同任务中对层深度的动态使用模式。 Result: 发现早期层倾向于输出高频词汇作为初步猜测,随着层数加深,这些猜测被上下文适配地 refine;超过70%的早期正确预测在后续层仍被修改;功能词最早被确定,多词答案中首词需更多计算深度,且模型在前半层确定回答格式、末尾才最终确认答案。 Conclusion: 大语言模型并非一次性完成预测,而是通过分阶段、结构化的深度利用逐步 refine 输出,这一发现有助于理解Transformer模型的内部计算机制,并为提升其计算效率提供方向。 Abstract: Growing evidence suggests that large language models do not use their depth uniformly, yet we still lack a fine-grained understanding of their layer-wise prediction dynamics. In this paper, we trace the intermediate representations of several open-weight models during inference and reveal a structured and nuanced use of depth. Specifically, we propose a "Guess-then-Refine" framework that explains how LLMs internally structure their computations to make predictions. We first show that the top-ranked predictions in early LLM layers are composed primarily of high-frequency tokens, which act as statistical guesses proposed by the model early on due to the lack of appropriate contextual information. As contextual information develops deeper into the model, these initial guesses get refined into contextually appropriate tokens. Even high-frequency token predictions from early layers get refined >70% of the time, indicating that correct token prediction is not "one-and-done". We then go beyond frequency-based prediction to examine the dynamic usage of layer depth across three case studies. (i) Part-of-speech analysis shows that function words are, on average, the earliest to be predicted correctly. (ii) Fact recall task analysis shows that, in a multi-token answer, the first token requires more computational depth than the rest. (iii) Multiple-choice task analysis shows that the model identifies the format of the response within the first half of the layers, but finalizes its response only toward the end. Together, our results provide a detailed view of depth usage in LLMs, shedding light on the layer-by-layer computations that underlie successful predictions and providing insights for future works to improve computational efficiency in transformer-based models.cs.CV [Back]
[68] MAT-Agent: Adaptive Multi-Agent Training Optimization
Jusheng Zhang,Kaitong Cai,Yijia Fan,Ningyuan Liu,Keze Wang
Main category: cs.CV
TL;DR: 提出MAT-Agent,一种多智能体框架,用于多标签图像分类的自适应训练,通过动态调整训练策略在多个基准上实现性能提升。
Details
Motivation: 传统方法依赖静态配置,在动态复杂的视觉-语义环境中表现不佳,难以适应多标签图像分类的需求。 Method: 设计基于多智能体的协作式训练框架MAT-Agent,利用非平稳多臂赌博机算法动态优化数据增强、优化器、学习率和损失函数,并结合双速率指数移动平均平滑与混合精度训练。 Result: 在Pascal VOC、COCO和VG-256上均取得优于现有方法的性能,例如在Pascal VOC上mAP达97.4,OF1为92.3,CF1为91.4,并表现出更快的收敛速度和跨域泛化能力。 Conclusion: MAT-Agent提供了一种可扩展且智能化的解决方案,推动了自适应深度学习的发展。 Abstract: Multi-label image classification demands adaptive training strategies to navigate complex, evolving visual-semantic landscapes, yet conventional methods rely on static configurations that falter in dynamic settings. We propose MAT-Agent, a novel multi-agent framework that reimagines training as a collaborative, real-time optimization process. By deploying autonomous agents to dynamically tune data augmentation, optimizers, learning rates, and loss functions, MAT-Agent leverages non-stationary multi-armed bandit algorithms to balance exploration and exploitation, guided by a composite reward harmonizing accuracy, rare-class performance, and training stability. Enhanced with dual-rate exponential moving average smoothing and mixed-precision training, it ensures robustness and efficiency. Extensive experiments across Pascal VOC, COCO, and VG-256 demonstrate MAT-Agent's superiority: it achieves an mAP of 97.4 (vs. 96.2 for PAT-T), OF1 of 92.3, and CF1 of 91.4 on Pascal VOC; an mAP of 92.8 (vs. 92.0 for HSQ-CvN), OF1 of 88.2, and CF1 of 87.1 on COCO; and an mAP of 60.9, OF1 of 70.8, and CF1 of 61.1 on VG-256. With accelerated convergence and robust cross-domain generalization, MAT-Agent offers a scalable, intelligent solution for optimizing complex visual models, paving the way for adaptive deep learning advancements.[69] CoIDO: Efficient Data Selection for Visual Instruction Tuning via Coupled Importance-Diversity Optimization
Yichen Yan,Ming Zhong,Qi Zhu,Xiaoling Gu,Jinpeng Chen,Huan Li
Main category: cs.CV
TL;DR: 提出CoIDO框架,通过联合优化数据重要性和多样性,使用轻量级插件评分器在仅20%数据上训练并选择20%子集,实现接近全数据微调98.2%的性能,显著降低多模态大模型训练成本。
Details
Motivation: 现有数据选择方法计算开销大,且重要性与多样性分离导致选择效果不佳,难以高效支持多模态大语言模型的指令微调。 Method: 提出CoIDO双目标框架,使用基于同方差不确定性建模的轻量级评分器,在小样本随机数据上训练以学习候选集分布,联合优化数据重要性和多样性。 Result: 在LLaVA-1.5-7B模型上十项下游任务中,仅用20%数据进行指令微调即达到全数据训练98.2%的性能,且计算开销显著降低。 Conclusion: CoIDO能高效、可扩展地进行数据选择,在大幅减少训练数据的同时几乎不损失模型性能,为多模态大模型训练提供了低成本高效益的解决方案。 Abstract: Multimodal large language models (MLLMs) rely heavily on instruction tuning to align vision and language capabilities, yet the computational cost of training on large-scale datasets remains a major bottleneck. Existing data selection methods aim to mitigate this by selecting important and diverse subsets, but they often suffer from two critical drawbacks: high computational overhead from processing the entire dataset and suboptimal data selection due to separate treatment of importance and diversity. We introduce CoIDO, a novel dual-objective framework that jointly optimizes data importance and diversity to overcome these challenges. Unlike existing approaches that require costly evaluations across the whole dataset, CoIDO employs a lightweight plug-in scorer. This scorer is trained on just a small random sample of data to learn the distribution of the candidate set, drastically reducing computational demands. By leveraging a homoscedastic uncertainty-based formulation, CoIDO effectively balances importance and diversity during training, enabling efficient and scalable data selection. In our experiments, we trained the CoIDO scorer using only 20 percent of randomly sampled data. Once trained, CoIDO was applied to the entire dataset to select a 20 percent subset for instruction tuning. On the widely used LLaVA-1.5-7B model across ten downstream tasks, this selected subset achieved an impressive 98.2 percent of the performance of full-data fine-tuning, on average.[70] Pre to Post-Treatment Glioblastoma MRI Prediction using a Latent Diffusion Model
Alexandre G. Leclercq,Sébastien Bougleux,Noémie N. Moreau,Alexis Desmonts,Romain Hérault,Aurélien Corroyer-Dulmont
Main category: cs.CV
TL;DR: 提出一种基于潜在扩散模型的切片到切片翻译方法,用于早期预测胶质母细胞瘤治疗反应,通过结合治疗前MRI和肿瘤定位信息生成治疗后MRI图像。
Details
Motivation: 胶质母细胞瘤患者对治疗的响应差异大,且需至少两个月才能通过MRI观察到疗效,因此需要一种能早期预测治疗反应的方法以推动个体化医疗发展。 Method: 采用带有拼接条件输入和无分类器引导的潜在扩散模型,利用治疗前MRI和肿瘤定位信息生成治疗后MRI图像,并结合生存信息提升生成质量。 Result: 模型在包含140名患者的本地数据集上进行了训练和测试,能够生成反映肿瘤演化的治疗后MRI图像。 Conclusion: 该方法有望实现胶质母细胞瘤治疗反应的早期可视化预测,为个性化治疗提供支持。 Abstract: Glioblastoma (GBM) is an aggressive primary brain tumor with a median survival of approximately 15 months. In clinical practice, the Stupp protocol serves as the standard first-line treatment. However, patients exhibit highly heterogeneous therapeutic responses which required at least two months before first visual impact can be observed, typically with MRI. Early prediction treatment response is crucial for advancing personalized medicine. Disease Progression Modeling (DPM) aims to capture the trajectory of disease evolution, while Treatment Response Prediction (TRP) focuses on assessing the impact of therapeutic interventions. Whereas most TRP approaches primarly rely on timeseries data, we consider the problem of early visual TRP as a slice-to-slice translation model generating post-treatment MRI from a pre-treatment MRI, thus reflecting the tumor evolution. To address this problem we propose a Latent Diffusion Model with a concatenation-based conditioning from the pre-treatment MRI and the tumor localization, and a classifier-free guidance to enhance generation quality using survival information, in particular post-treatment tumor evolution. Our model were trained and tested on a local dataset consisting of 140 GBM patients collected at Centre Fran\c{c}ois Baclesse. For each patient we collected pre and post T1-Gd MRI, tumor localization manually delineated in the pre-treatment MRI by medical experts, and survival information.[71] Provenance of AI-Generated Images: A Vector Similarity and Blockchain-based Approach
Jitendra Sharma,Arthur Carvalho,Suman Bhunia
Main category: cs.CV
TL;DR: 提出一种基于嵌入的AI图像检测框架,利用图像嵌入和向量相似性区分AI生成图像与人类创作图像。
Details
Motivation: 随着生成式AI和大语言模型的发展,AI生成的图像越来越逼真,难以与人类创作的图像区分,对数字内容的真实性验证提出了挑战。 Method: 基于AI生成图像在嵌入空间中更接近同类内容的假设,使用五种基准嵌入模型处理包含AI和人类生成图像的多样化数据集,并通过向量相似性进行分类。 Result: 实验证明该方法具有较强的鲁棒性,即使图像经过中度到高度扰动,其嵌入特征仍保持稳定,能准确匹配原始图像。 Conclusion: 该框架在准确性和计算效率之间取得了平衡,具有良好的泛化能力,可用于AI生成图像的检测。 Abstract: Rapid advancement in generative AI and large language models (LLMs) has enabled the generation of highly realistic and contextually relevant digital content. LLMs such as ChatGPT with DALL-E integration and Stable Diffusion techniques can produce images that are often indistinguishable from those created by humans, which poses challenges for digital content authentication. Verifying the integrity and origin of digital data to ensure it remains unaltered and genuine is crucial to maintaining trust and legality in digital media. In this paper, we propose an embedding-based AI image detection framework that utilizes image embeddings and a vector similarity to distinguish AI-generated images from real (human-created) ones. Our methodology is built on the hypothesis that AI-generated images demonstrate closer embedding proximity to other AI-generated content, while human-created images cluster similarly within their domain. To validate this hypothesis, we developed a system that processes a diverse dataset of AI and human-generated images through five benchmark embedding models. Extensive experimentation demonstrates the robustness of our approach, and our results confirm that moderate to high perturbations minimally impact the embedding signatures, with perturbed images maintaining close similarity matches to their original versions. Our solution provides a generalizable framework for AI-generated image detection that balances accuracy with computational efficiency.[72] CMIS-Net: A Cascaded Multi-Scale Individual Standardization Network for Backchannel Agreement Estimation
Yuxuan Huang,Kangzhong Wang,Eugene Yujun Fu,Grace Ngai,Peter H. F. Ng
Main category: cs.CV
TL;DR: 提出了一种新的级联多尺度个体标准化网络(CMIS-Net),用于在帧级别和序列级别提取个体归一化的反馈特征,有效处理个体差异和数据不平衡问题,在反馈同意检测任务中实现了最先进的性能。
Details
Motivation: 现有情感识别方法通常只在单一尺度上操作,忽略了多尺度行为线索的互补性,且个体差异对反馈行为的影响显著,导致模型难以准确捕捉跨尺度的个性化表达模式。 Method: 提出CMIS-Net,通过去除个体特定的中性基线来实现帧级和序列级的个体归一化,并引入隐式数据增强模块以缓解训练数据分布偏差。 Result: 实验和可视化结果表明,CMIS-Net在处理个体差异和数据不平衡方面表现优异,在反馈同意检测任务中达到最先进水平。 Conclusion: CMIS-Net通过多尺度个体标准化和隐式数据增强,能更有效地建模个性化反馈行为,为构建类人、响应式的AI系统提供了新思路。 Abstract: Backchannels are subtle listener responses, such as nods, smiles, or short verbal cues like "yes" or "uh-huh," which convey understanding and agreement in conversations. These signals provide feedback to speakers, improve the smoothness of interaction, and play a crucial role in developing human-like, responsive AI systems. However, the expression of backchannel behaviors is often significantly influenced by individual differences, operating across multiple scales: from instant dynamics such as response intensity (frame-level) to temporal patterns such as frequency and rhythm preferences (sequence-level). This presents a complex pattern recognition problem that contemporary emotion recognition methods have yet to fully address. Particularly, existing individualized methods in emotion recognition often operate at a single scale, overlooking the complementary nature of multi-scale behavioral cues. To address these challenges, we propose a novel Cascaded Multi-Scale Individual Standardization Network (CMIS-Net) that extracts individual-normalized backchannel features by removing person-specific neutral baselines from observed expressions. Operating at both frame and sequence levels, this normalization allows model to focus on relative changes from each person's baseline rather than absolute expression values. Furthermore, we introduce an implicit data augmentation module to address the observed training data distributional bias, improving model generalization. Comprehensive experiments and visualizations demonstrate that CMIS-Net effectively handles individual differences and data imbalance, achieving state-of-the-art performance in backchannel agreement detection.[73] Shortcutting Pre-trained Flow Matching Diffusion Models is Almost Free Lunch
Xu Cai,Yang Wu,Qianli Chen,Haoran Wu,Lichuan Xiang,Hongkai Wen
Main category: cs.CV
TL;DR: 提出一种超高效的后训练方法,通过新颖的速度场自蒸馏技术,将大规模预训练的流匹配扩散模型压缩为高效的少步采样器,无需重新训练且兼容现有模型。
Details
Motivation: 现有的流匹配捷径方法需要特殊的步长嵌入,难以应用于已有的预训练模型,且重新训练成本高昂。因此,亟需一种无需修改模型结构、可直接应用于标准流匹配模型的高效捷径方法。 Method: 在速度场而非样本空间上进行自指导的在线自蒸馏,利用速度场的特性实现对标准流匹配模型(如Flux)的高效捷径化;该方法还可融入预训练阶段,使模型本身学习高效的少步流。此外,实现了首个适用于数十亿参数扩散模型的少样本蒸馏方法。 Result: 可在不到一个A100天的时间内训练出3步的Flux模型;支持仅用10个文本-图像对进行少样本蒸馏,在几乎零成本下达到最先进的性能。 Conclusion: 该方法为大型扩散模型的高效推理提供了通用、低成本的解决方案,推动了高效率、低资源消耗的生成模型部署。 Abstract: We present an ultra-efficient post-training method for shortcutting large-scale pre-trained flow matching diffusion models into efficient few-step samplers, enabled by novel velocity field self-distillation. While shortcutting in flow matching, originally introduced by shortcut models, offers flexible trajectory-skipping capabilities, it requires a specialized step-size embedding incompatible with existing models unless retraining from scratch$\unicode{x2013}$a process nearly as costly as pretraining itself. Our key contribution is thus imparting a more aggressive shortcut mechanism to standard flow matching models (e.g., Flux), leveraging a unique distillation principle that obviates the need for step-size embedding. Working on the velocity field rather than sample space and learning rapidly from self-guided distillation in an online manner, our approach trains efficiently, e.g., producing a 3-step Flux less than one A100 day. Beyond distillation, our method can be incorporated into the pretraining stage itself, yielding models that inherently learn efficient, few-step flows without compromising quality. This capability also enables, to our knowledge, the first few-shot distillation method (e.g., 10 text-image pairs) for dozen-billion-parameter diffusion models, delivering state-of-the-art performance at almost free cost.[74] Robotic Classification of Divers' Swimming States using Visual Pose Keypoints as IMUs
Demetrious T. Kutzke,Ying-Kun Wu,Elizabeth Terveen,Junaed Sattar
Main category: cs.CV
TL;DR: 提出一种基于计算机视觉的混合方法,通过3D关节关键点生成“伪IMU”数据,用于水下潜水员异常行为识别,提升AUV对潜水安全的监测能力。
Details
Motivation: 传统的人类活动识别方法在水下环境中效果有限,且无线信号在水中衰减严重,难以实现可靠的潜水员状态监测。 Method: 利用计算机视觉从3D人体关节关键点流中生成高保真运动数据,构建“伪IMU”,并将其集成到自主水下航行器(AUV)上的分类器中,以检测潜水员的异常行为。 Result: 在模拟紧急情况下进行实验,验证了该系统能够有效识别预示医疗紧急情况(如心脏骤停)的异常潜水行为。 Conclusion: 该方法克服了水下无线通信的局限性,为AUV实现可靠的潜水员安全监控提供了可行方案。 Abstract: Traditional human activity recognition uses either direct image analysis or data from wearable inertial measurement units (IMUs), but can be ineffective in challenging underwater environments. We introduce a novel hybrid approach that bridges this gap to monitor scuba diver safety. Our method leverages computer vision to generate high-fidelity motion data, effectively creating a ``pseudo-IMU'' from a stream of 3D human joint keypoints. This technique circumvents the critical problem of wireless signal attenuation in water, which plagues conventional diver-worn sensors communicating with an Autonomous Underwater Vehicle (AUV). We apply this system to the vital task of identifying anomalous scuba diver behavior that signals the onset of a medical emergency such as cardiac arrest -- a leading cause of scuba diving fatalities. By integrating our classifier onboard an AUV and conducting experiments with simulated distress scenarios, we demonstrate the utility and effectiveness of our method for advancing robotic monitoring and diver safety.[75] InsideOut: Integrated RGB-Radiative Gaussian Splatting for Comprehensive 3D Object Representation
Jungmin Lee,Seonghyuk Hong,Juyong Lee,Jaeyoon Lee,Jongwon Choi
Main category: cs.CV
TL;DR: InsideOut扩展了3D高斯点阵(3DGS),融合RGB与X射线成像,实现表面细节与内部结构的高保真重建。
Details
Motivation: 解决RGB图像与X射线在数据表示上的差异以及配对数据集稀缺带来的融合难题。 Method: 采集新的RGB-X射线配对数据,采用分层拟合对齐两种模态的辐射高斯点,并引入X射线参考损失保证内部结构一致性。 Result: 成功实现了RGB外观与X射线内部结构的高质量三维重建,提升了3DGS在可视化、仿真和无损检测中的应用能力。 Conclusion: InsideOut有效弥合了多模态成像的鸿沟,显著拓展了3DGS在医疗诊断、文化遗产修复和制造等领域的适用性。 Abstract: We introduce InsideOut, an extension of 3D Gaussian splatting (3DGS) that bridges the gap between high-fidelity RGB surface details and subsurface X-ray structures. The fusion of RGB and X-ray imaging is invaluable in fields such as medical diagnostics, cultural heritage restoration, and manufacturing. We collect new paired RGB and X-ray data, perform hierarchical fitting to align RGB and X-ray radiative Gaussian splats, and propose an X-ray reference loss to ensure consistent internal structures. InsideOut effectively addresses the challenges posed by disparate data representations between the two modalities and limited paired datasets. This approach significantly extends the applicability of 3DGS, enhancing visualization, simulation, and non-destructive testing capabilities across various domains.[76] MUSE: Model-based Uncertainty-aware Similarity Estimation for zero-shot 2D Object Detection and Segmentation
Sungmin Cho,Sungbum Park,Insoo Oh
Main category: cs.CV
TL;DR: MUSE是一种无需训练的框架,用于基于模型的零样本2D物体检测与分割,通过结合绝对和相对相似性度量及不确定性感知的对象先验,在BOP Challenge 2025多个赛道中取得领先性能。
Details
Motivation: 为了在无需训练的情况下实现对未见物体的高效零样本2D检测与分割,克服传统方法在挑战性场景下的局限性。 Method: 利用从3D未见物体渲染的2D多视角模板与输入图像中的2D候选区域,融合类别与图像块嵌入,并采用广义均值池化(GeM)进行归一化;在匹配阶段结合绝对与相对相似性得分,并通过不确定性感知的对象先验优化最终相似性评分。 Result: MUSE在BOP Challenge 2025的Classic Core、H3和Industrial三个赛道中排名第一,实现了最先进的零样本检测与分割性能。 Conclusion: MUSE提供了一个强大且可泛化的无需训练的零样本2D物体检测与分割框架,具有实际应用潜力。 Abstract: In this work, we introduce MUSE (Model-based Uncertainty-aware Similarity Estimation), a training-free framework designed for model-based zero-shot 2D object detection and segmentation. MUSE leverages 2D multi-view templates rendered from 3D unseen objects and 2D object proposals extracted from input query images. In the embedding stage, it integrates class and patch embeddings, where the patch embeddings are normalized using generalized mean pooling (GeM) to capture both global and local representations efficiently. During the matching stage, MUSE employs a joint similarity metric that combines absolute and relative similarity scores, enhancing the robustness of matching under challenging scenarios. Finally, the similarity score is refined through an uncertainty-aware object prior that adjusts for proposal reliability. Without any additional training or fine-tuning, MUSE achieves state-of-the-art performance on the BOP Challenge 2025, ranking first across the Classic Core, H3, and Industrial tracks. These results demonstrate that MUSE offers a powerful and generalizable framework for zero-shot 2D object detection and segmentation.[77] GAN-based Content-Conditioned Generation of Handwritten Musical Symbols
Gerard Asbert,Pau Torras,Lei Kang,Alicia Fornés,Josep Lladós
Main category: cs.CV
TL;DR: 本研究利用生成对抗网络(GAN)和Smashcima软件生成逼真的手写风格乐谱,以解决光学乐谱识别中真实标注数据稀缺的问题。
Details
Motivation: 由于真实标注的手写历史乐谱数据稀缺,光学乐谱识别(OMR)领域的发展受到限制,需要合成数据来提升识别模型性能。 Method: 在音乐符号级别实现生成对抗网络(GAN),并使用Smashcima排版软件将生成的符号组合成完整乐谱。 Result: 系统评估显示,生成的乐谱符号具有高度的视觉真实性,在合成乐谱生成方面取得显著进展。 Conclusion: 合成手写风格乐谱的方法能有效提升OMR领域的数据可用性,为训练高性能识别模型提供了可行方案。 Abstract: The field of Optical Music Recognition (OMR) is currently hindered by the scarcity of real annotated data, particularly when dealing with handwritten historical musical scores. In similar fields, such as Handwritten Text Recognition, it was proven that synthetic examples produced with image generation techniques could help to train better-performing recognition architectures. This study explores the generation of realistic, handwritten-looking scores by implementing a music symbol-level Generative Adversarial Network (GAN) and assembling its output into a full score using the Smashcima engraving software. We have systematically evaluated the visual fidelity of these generated samples, concluding that the generated symbols exhibit a high degree of realism, marking significant progress in synthetic score generation.[78] Auditing and Mitigating Bias in Gender Classification Algorithms: A Data-Centric Approach
Tadesse K Bahiru,Natnael Tilahun Sinshaw,Teshager Hailemariam Moges,Dheeraj Kumar Singh
Main category: cs.CV
TL;DR: 该研究发现现有性别分类数据集存在显著的交叉性表征不足问题,即使在相对平衡的数据集上训练的模型仍表现出性别和种族偏见。为此,作者构建了BalancedFace数据集,通过均衡189个年龄、种族和性别的交叉子组,显著降低了模型偏差,在保持高准确率的同时大幅提升了公平性。
Details
Motivation: 现有性别分类系统因训练数据中的性别和种族不平衡而继承并放大偏见,亟需更公平的数据集来缓解这一问题。 Method: 首先审计五个常用数据集,评估其交叉性表征情况;随后在两个最平衡的数据集(UTKFace和FairFace)上训练MobileNetV2模型以评估偏差;最后融合多个数据源构建新的均衡数据集BalancedFace,并验证其在提升模型公平性方面的效果。 Result: 原始模型在女性和特定种族群体上表现较差,存在明显偏差;使用BalancedFace训练后,不同种族子组间的最大真阳性率差距减少超过50%,平均差异影响指数改善63%,接近理想值1.0,整体准确率损失极小。 Conclusion: 数据本身的均衡性对模型公平性具有决定性影响,BalancedFace为实现更公平的性别分类提供了有效且公开可用的资源,凸显了以数据为中心的公平性干预的重要价值。 Abstract: Gender classification systems often inherit and amplify demographic imbalances in their training data. We first audit five widely used gender classification datasets, revealing that all suffer from significant intersectional underrepresentation. To measure the downstream impact of these flaws, we train identical MobileNetV2 classifiers on the two most balanced of these datasets, UTKFace and FairFace. Our fairness evaluation shows that even these models exhibit significant bias, misclassifying female faces at a higher rate than male faces and amplifying existing racial skew. To counter these data-induced biases, we construct BalancedFace, a new public dataset created by blending images from FairFace and UTKFace, supplemented with images from other collections to fill missing demographic gaps. It is engineered to equalize subgroup shares across 189 intersections of age, race, and gender using only real, unedited images. When a standard classifier is trained on BalancedFace, it reduces the maximum True Positive Rate gap across racial subgroups by over 50% and brings the average Disparate Impact score 63% closer to the ideal of 1.0 compared to the next-best dataset, all with a minimal loss of overall accuracy. These results underline the profound value of data-centric interventions and provide an openly available resource for fair gender classification research.[79] 3D Weakly Supervised Semantic Segmentation via Class-Aware and Geometry-Guided Pseudo-Label Refinement
Xiaoxu Xu,Xuexun Liu,Jinlong Li,Yitian Yuan,Qiudan Zhang,Lin Ma,Nicu Sebe,Xu Wang
Main category: cs.CV
TL;DR: 本文提出了一种简单而有效的3D弱监督语义分割方法,通过引入类别感知和几何感知的伪标签优化机制,结合自训练策略,显著提升了模型性能。
Details
Motivation: 现有3D弱监督语义分割方法受限于低质量的伪标签和对3D几何先验利用不足的问题,难以取得高性能。 Method: 提出类别感知标签优化模块以生成更均衡准确的伪标签,并设计几何感知模块融合隐式3D几何约束过滤不符合几何合理性的低置信度标签;同时引入基于自训练的标签更新策略,扩展标签覆盖范围。 Result: 在ScanNet和S3DIS数据集上实现了最先进的性能,在无监督设置下也表现出强大的泛化能力和竞争力。 Conclusion: 所提出的方法通过整合3D几何先验与类别感知引导机制,有效提升了伪标签质量与模型性能,为3D弱监督语义分割提供了鲁棒且高效的解决方案。 Abstract: 3D weakly supervised semantic segmentation (3D WSSS) aims to achieve semantic segmentation by leveraging sparse or low-cost annotated data, significantly reducing reliance on dense point-wise annotations. Previous works mainly employ class activation maps or pre-trained vision-language models to address this challenge. However, the low quality of pseudo-labels and the insufficient exploitation of 3D geometric priors jointly create significant technical bottlenecks in developing high-performance 3D WSSS models. In this paper, we propose a simple yet effective 3D weakly supervised semantic segmentation method that integrates 3D geometric priors into a class-aware guidance mechanism to generate high-fidelity pseudo labels. Concretely, our designed methodology first employs Class-Aware Label Refinement module to generate more balanced and accurate pseudo labels for semantic categrories. This initial refinement stage focuses on enhancing label quality through category-specific optimization. Subsequently, the Geometry-Aware Label Refinement component is developed, which strategically integrates implicit 3D geometric constraints to effectively filter out low-confidence pseudo labels that fail to comply with geometric plausibility. Moreover, to address the challenge of extensive unlabeled regions, we propose a Label Update strategy that integrates Self-Training to propagate labels into these areas. This iterative process continuously enhances pseudo-label quality while expanding label coverage, ultimately fostering the development of high-performance 3D WSSS models. Comprehensive experimental validation reveals that our proposed methodology achieves state-of-the-art performance on both ScanNet and S3DIS benchmarks while demonstrating remarkable generalization capability in unsupervised settings, maintaining competitive accuracy through its robust design.[80] Investigating Demographic Bias in Brain MRI Segmentation: A Comparative Study of Deep-Learning and Non-Deep-Learning Methods
Ghazal Danaee,Marc Niethammer,Jarrett Rushmore,Sylvain Bouix
Main category: cs.CV
TL;DR: 本研究评估了三种深度学习分割模型(UNesT、nnU-Net、CoTr)和一种传统图谱方法(ANTs)在不同种族和性别群体中对MRI图像内侧伏隔核的分割表现,发现训练数据与测试对象种族匹配可提升部分模型的准确性,而nnU-Net表现出较强的鲁棒性;此外,模型偏差会影响基于性别的体积分析结果,但多数情况下消除了种族差异。
Details
Motivation: 关注医学图像分割模型中因数据偏见导致的公平性问题,特别是基于种族和性别的性能差异,旨在评估现有模型在不同人群中的公平性和可靠性。 Method: 使用包含四个亚人群(黑人女性、黑人男性、白人女性、白人男性)的MRI数据集,采用手动标注作为金标准,评估UNesT、nnU-Net、CoTr和ANTs四种方法的分割性能;通过设计的公平性度量指标量化模型公平性,并利用线性混合模型分析种族、性别及其交互作用对分割精度和体积估计的影响。 Result: 当训练与测试数据种族匹配时,ANTs和UNesT的分割精度显著提高,而nnU-Net表现稳定,不受人口统计学匹配影响;基于性别的体积效应在手动和模型分割中均存在,但种族相关的体积效应在大多数模型中消失。 Conclusion: 数据种族匹配能提升某些模型的分割性能,表明模型对训练数据分布敏感;nnU-Net具有更强的跨人群鲁棒性;现有模型可能掩盖真实的生物学种族差异,提示在临床应用中需谨慎考虑模型偏见对结果解释的影响。 Abstract: Deep-learning-based segmentation algorithms have substantially advanced the field of medical image analysis, particularly in structural delineations in MRIs. However, an important consideration is the intrinsic bias in the data. Concerns about unfairness, such as performance disparities based on sensitive attributes like race and sex, are increasingly urgent. In this work, we evaluate the results of three different segmentation models (UNesT, nnU-Net, and CoTr) and a traditional atlas-based method (ANTs), applied to segment the left and right nucleus accumbens (NAc) in MRI images. We utilize a dataset including four demographic subgroups: black female, black male, white female, and white male. We employ manually labeled gold-standard segmentations to train and test segmentation models. This study consists of two parts: the first assesses the segmentation performance of models, while the second measures the volumes they produce to evaluate the effects of race, sex, and their interaction. Fairness is quantitatively measured using a metric designed to quantify fairness in segmentation performance. Additionally, linear mixed models analyze the impact of demographic variables on segmentation accuracy and derived volumes. Training on the same race as the test subjects leads to significantly better segmentation accuracy for some models. ANTs and UNesT show notable improvements in segmentation accuracy when trained and tested on race-matched data, unlike nnU-Net, which demonstrates robust performance independent of demographic matching. Finally, we examine sex and race effects on the volume of the NAc using segmentations from the manual rater and from our biased models. Results reveal that the sex effects observed with manual segmentation can also be observed with biased models, whereas the race effects disappear in all but one model.[81] ManzaiSet: A Multimodal Dataset of Viewer Responses to Japanese Manzai Comedy
Kazuki Kawamura,Kengo Nakai,Jun Rekimoto
Main category: cs.CV
TL;DR: 提出了首个大规模多模态日式漫才喜剧观众反应数据集ManzaiSet,包含241名参与者观看专业表演时的面部视频和音频数据,揭示了三种不同类型的观众反应模式,并发现观看顺序对欣赏度有正面影响,为非西方文化背景的情感AI和个性化娱乐系统提供了支持。
Details
Motivation: 解决情感计算领域中以西方为中心的偏见问题,推动非西方文化背景下情感识别与娱乐系统的研究发展。 Method: 收集241名参与者观看最多10场日式漫才表演时的面部视频和音频数据,采用k均值聚类分析观众类型,进行个体层面的观看顺序效应分析,并结合自动化幽默分类与观众响应建模。 Result: 识别出三类观众:高且稳定欣赏者(72.8%)、低且可变下降者(13.2%)和可变提升者(14.0%);发现显著的正向观看顺序效应(p < 0.001),反驳疲劳假说;在FDR校正后未发现不同类型观众在幽默识别上的显著差异。 Conclusion: ManzaiSet数据集有助于推动跨文化情感AI的发展,支持针对非西方文化的个性化娱乐系统设计。 Abstract: We present ManzaiSet, the first large scale multimodal dataset of viewer responses to Japanese manzai comedy, capturing facial videos and audio from 241 participants watching up to 10 professional performances in randomized order (94.6 percent watched >= 8; analyses focus on n=228). This addresses the Western centric bias in affective computing. Three key findings emerge: (1) k means clustering identified three distinct viewer types: High and Stable Appreciators (72.8 percent, n=166), Low and Variable Decliners (13.2 percent, n=30), and Variable Improvers (14.0 percent, n=32), with heterogeneity of variance (Brown Forsythe p < 0.001); (2) individual level analysis revealed a positive viewing order effect (mean slope = 0.488, t(227) = 5.42, p < 0.001, permutation p < 0.001), contradicting fatigue hypotheses; (3) automated humor classification (77 instances, 131 labels) plus viewer level response modeling found no type wise differences after FDR correction. The dataset enables culturally aware emotion AI development and personalized entertainment systems tailored to non Western contexts.[82] ViBED-Net: Video Based Engagement Detection Network Using Face-Aware and Scene-Aware Spatiotemporal Cues
Prateek Gothwal,Deeptimaan Banerjee,Ashis Kumer Biswas
Main category: cs.CV
TL;DR: 本文提出了一种基于视频的学生参与度检测框架ViBED-Net,采用双流结构结合面部表情与全场景上下文信息,并利用EfficientNetV2提取空间特征,LSTM或Transformer建模时序变化,在DAiSEE数据集上实现了73.43%的准确率,超越现有方法。
Details
Motivation: 在线学习环境中,准确检测学生参与度对于提升学习效果和实现个性化教学至关重要。现有方法在融合面部与场景信息以及处理时序动态方面仍有不足,因此需要更高效、鲁棒的模型来提升检测性能。 Method: 提出ViBED-Net,采用双流架构分别处理人脸裁剪和整帧图像,使用EfficientNetV2提取空间特征,并通过LSTM或Transformer进行时序建模;在DAiSEE数据集上训练,并针对低频类别应用数据增强策略以缓解类别不平衡问题。 Result: 在DAiSEE数据集上,基于LSTM的ViBED-Net取得了73.43%的准确率,优于当前最先进的方法,验证了融合面部与场景时空特征的有效性。 Conclusion: ViBED-Net通过整合面部与场景的时空特征显著提升了视频中学生参与度检测的准确性,模块化设计便于在教育、用户体验研究等领域推广应用,为现实场景中的视频情感计算提供了可扩展且高性能的解决方案。 Abstract: Engagement detection in online learning environments is vital for improving student outcomes and personalizing instruction. We present ViBED-Net (Video-Based Engagement Detection Network), a novel deep learning framework designed to assess student engagement from video data using a dual-stream architecture. ViBED-Net captures both facial expressions and full-scene context by processing facial crops and entire video frames through EfficientNetV2 for spatial feature extraction. These features are then analyzed over time using two temporal modeling strategies: Long Short-Term Memory (LSTM) networks and Transformer encoders. Our model is evaluated on the DAiSEE dataset, a large-scale benchmark for affective state recognition in e-learning. To enhance performance on underrepresented engagement classes, we apply targeted data augmentation techniques. Among the tested variants, ViBED-Net with LSTM achieves 73.43\% accuracy, outperforming existing state-of-the-art approaches. ViBED-Net demonstrates that combining face-aware and scene-aware spatiotemporal cues significantly improves engagement detection accuracy. Its modular design allows flexibility for application across education, user experience research, and content personalization. This work advances video-based affective computing by offering a scalable, high-performing solution for real-world engagement analysis. The source code for this project is available on https://github.com/prateek-gothwal/ViBED-Net .[83] SAVANT: Semantic Analysis with Vision-Augmented Anomaly deTection
Roberto Brusnicki,David Pop,Yuan Gao,Mattia Piccinini,Johannes Betz
Main category: cs.CV
TL;DR: SAVANT是一个结构化推理框架,通过分层场景分析和两阶段管道,实现对驾驶场景中语义异常的高准确率检测。
Details
Motivation: 自动驾驶系统在面对罕见、分布外且具有语义异常的情景时仍极为脆弱,而现有的视觉语言模型(VLM)因依赖昂贵的专有模型和非结构化提示导致性能不可靠,限制了实际部署。 Method: 提出SAVANT框架,采用结构化场景描述提取和多模态评估的两阶段流程,涵盖街道、基础设施、可动物体和环境四个语义层次进行系统性分析,并利用微调的开源7B参数Qwen2.5VL模型实现高效检测。 Result: 在真实驾驶场景中,SAVANT达到89.6%的召回率和88.0%的准确率,经过微调的开源模型进一步提升至90.8%召回率和93.8%准确率,显著优于基线方法,并自动标注超过9,640张图像。 Conclusion: SAVANT将VLM推理从随意提示转变为系统分析,解决了异常检测中的数据稀缺问题,为自动驾驶系统提供了可靠且可访问的语义监控路径。 Abstract: Autonomous driving systems remain critically vulnerable to the long-tail of rare, out-of-distribution scenarios with semantic anomalies. While Vision Language Models (VLMs) offer promising reasoning capabilities, naive prompting approaches yield unreliable performance and depend on expensive proprietary models, limiting practical deployment. We introduce SAVANT (Semantic Analysis with Vision-Augmented Anomaly deTection), a structured reasoning framework that achieves high accuracy and recall in detecting anomalous driving scenarios from input images through layered scene analysis and a two-phase pipeline: structured scene description extraction followed by multi-modal evaluation. Our approach transforms VLM reasoning from ad-hoc prompting to systematic analysis across four semantic layers: Street, Infrastructure, Movable Objects, and Environment. SAVANT achieves 89.6% recall and 88.0% accuracy on real-world driving scenarios, significantly outperforming unstructured baselines. More importantly, we demonstrate that our structured framework enables a fine-tuned 7B parameter open-source model (Qwen2.5VL) to achieve 90.8% recall and 93.8% accuracy - surpassing all models evaluated while enabling local deployment at near-zero cost. By automatically labeling over 9,640 real-world images with high accuracy, SAVANT addresses the critical data scarcity problem in anomaly detection and provides a practical path toward reliable, accessible semantic monitoring for autonomous systems.[84] TriggerNet: A Novel Explainable AI Framework for Red Palm Mite Detection and Multi-Model Comparison and Heuristic-Guided Annotation
Harshini Suresha,Kavitha SH
Main category: cs.CV
TL;DR: 本研究提出了一种名为TriggerNet的可解释AI框架,用于红棕榈螨害的早期检测与分类,结合多种可视化技术提升深度学习模型在植物病害识别中的透明度和准确性。
Details
Motivation: 红棕榈螨害严重影响棕榈作物产量,亟需一种高效、准确的早期检测方法以减少经济损失。 Method: 采用TriggerNet框架整合Grad-CAM、RISE、FullGrad和TCAV等可解释性技术,利用包含11种植物的RGB图像数据集,结合CNN、EfficientNet、ResNet50等深度学习模型及Random Forest、SVM等机器学习算法进行植物与病害分类,并使用Snorkel通过启发式规则自动标注疾病类别。 Result: 模型在植物分类和病害检测方面表现良好,TriggerNet提供了有效的视觉解释,增强了模型决策的可信度,Snorkel显著减少了人工标注时间并提高了数据集可靠性。 Conclusion: TriggerNet是一种有效且可解释的AI解决方案,适用于红棕榈螨害的早期识别,具有在农业病害管理中推广应用的潜力。 Abstract: The red palm mite infestation has become a serious concern, particularly in regions with extensive palm cultivation, leading to reduced productivity and economic losses. Accurate and early identification of mite-infested plants is critical for effective management. The current study focuses on evaluating and comparing the ML model for classifying the affected plants and detecting the infestation. TriggerNet is a novel interpretable AI framework that integrates Grad-CAM, RISE, FullGrad, and TCAV to generate novel visual explanations for deep learning models in plant classification and disease detection. This study applies TriggerNet to address red palm mite (Raoiella indica) infestation, a major threat to palm cultivation and agricultural productivity. A diverse set of RGB images across 11 plant species, Arecanut, Date Palm, Bird of Paradise, Coconut Palm, Ginger, Citrus Tree, Palm Oil, Orchid, Banana Palm, Avocado Tree, and Cast Iron Plant was utilized for training and evaluation. Advanced deep learning models like CNN, EfficientNet, MobileNet, ViT, ResNet50, and InceptionV3, alongside machine learning classifiers such as Random Forest, SVM, and KNN, were employed for plant classification. For disease classification, all plants were categorized into four classes: Healthy, Yellow Spots, Reddish Bronzing, and Silk Webbing. Snorkel was used to efficiently label these disease classes by leveraging heuristic rules and patterns, reducing manual annotation time and improving dataset reliability.[85] HouseTour: A Virtual Real Estate A(I)gent
Ata Çelen,Marc Pollefeys,Daniel Barath,Iro Armeni
Main category: cs.CV
TL;DR: 本文提出HouseTour方法,通过结合3D相机轨迹生成与自然语言描述,实现从图像集合中生成空间感知的3D漫游视频和文本摘要。
Details
Motivation: 现有视觉-语言模型在几何推理方面表现不佳,难以生成空间一致的视频和描述,因此需要一种能融合3D几何信息的方法。 Method: 采用扩散模型生成受相机位姿约束的平滑相机轨迹,并将轨迹信息融入视觉-语言模型以生成3D接地的描述,最后使用3D高斯点阵渲染新视角合成视频。 Result: 在包含1200多个房屋漫游视频的数据集上实验表明,结合3D相机轨迹显著提升了文本生成质量,并提出了新的联合评估指标验证端到端性能。 Conclusion: HouseTour实现了无需专业设备或技能即可自动生成高质量房产漫游视频与描述,适用于房地产和旅游等实际应用。 Abstract: We introduce HouseTour, a method for spatially-aware 3D camera trajectory and natural language summary generation from a collection of images depicting an existing 3D space. Unlike existing vision-language models (VLMs), which struggle with geometric reasoning, our approach generates smooth video trajectories via a diffusion process constrained by known camera poses and integrates this information into the VLM for 3D-grounded descriptions. We synthesize the final video using 3D Gaussian splatting to render novel views along the trajectory. To support this task, we present the HouseTour dataset, which includes over 1,200 house-tour videos with camera poses, 3D reconstructions, and real estate descriptions. Experiments demonstrate that incorporating 3D camera trajectories into the text generation process improves performance over methods handling each task independently. We evaluate both individual and end-to-end performance, introducing a new joint metric. Our work enables automated, professional-quality video creation for real estate and touristic applications without requiring specialized expertise or equipment.[86] Chimera: Compositional Image Generation using Part-based Concepting
Shivam Singh,Yiming Chen,Agneet Chatterjee,Amit Raj,James Hays,Yezhou Yang,Chitra Baral
Main category: cs.CV
TL;DR: Chimera是一种个性化图像生成模型,能够根据文本指令将多个源图像的指定部分组合生成新对象,通过构建基于语义原子的数据集和引入PartEval评估指标,在部分对齐、组合准确性和视觉质量上显著优于基线方法。
Details
Motivation: 现有个性化生成模型在无用户标注的情况下难以精确控制多源图像特定部分的组合,缺乏显式的部件级操控能力。 Method: 提出Chimera模型,构建包含464个(part, subject)语义原子的分类体系,生成3.7万条提示并合成立体图像训练数据;采用带部件条件引导的扩散先验模型,控制图像条件特征以保持语义身份与空间布局一致性,并设计PartEval指标量化评估生成效果。 Result: 在人类评估和PartEval指标上,Chimera比其他基线方法在部分对齐和组合准确性上提升14%,视觉质量提升21%。 Conclusion: Chimera实现了无需掩码输入的细粒度图像部件组合生成,结合语义原子数据构造策略与部件感知训练机制,有效提升了个性化生成中的可控性与组合能力。 Abstract: Personalized image generative models are highly proficient at synthesizing images from text or a single image, yet they lack explicit control for composing objects from specific parts of multiple source images without user specified masks or annotations. To address this, we introduce Chimera, a personalized image generation model that generates novel objects by combining specified parts from different source images according to textual instructions. To train our model, we first construct a dataset from a taxonomy built on 464 unique (part, subject) pairs, which we term semantic atoms. From this, we generate 37k prompts and synthesize the corresponding images with a high-fidelity text-to-image model. We train a custom diffusion prior model with part-conditional guidance, which steers the image-conditioning features to enforce both semantic identity and spatial layout. We also introduce an objective metric PartEval to assess the fidelity and compositional accuracy of generation pipelines. Human evaluations and our proposed metric show that Chimera outperforms other baselines by 14% in part alignment and compositional accuracy and 21% in visual quality.[87] Big Data, Tiny Targets: An Exploratory Study in Machine Learning-enhanced Detection of Microplastic from Filters
Paul-Tiberiu Miclea,Martin Sboron,Hardik Vaghasiya,Hoang Thinh Nguyen,Meet Gadara,Thomas Schmid
Main category: cs.CV
TL;DR: 本研究探讨了结合扫描电镜(SEM)成像与机器学习目标检测技术在微塑料(MPs)检测与定量中的应用潜力、局限性及未来方向,发现在特定过滤场景下不同YOLO模型表现存在差异,预处理优化至关重要,但专家标注数据不足仍是主要挑战。
Details
Motivation: 微塑料污染广泛且对生态和人类健康有潜在影响,但由于其尺寸微小,检测、分类和去除困难,传统方法依赖人工分析,难以实现大规模筛查,因此需要更高效的自动化检测方法。 Method: 结合扫描电镜(SEM)成像与机器学习中的目标检测技术(如YOLO模型),在具有对称重复背景的过滤场景下进行微塑料颗粒和纤维的检测与定量,并评估不同模型性能及预处理优化的影响。 Result: 发现不同YOLO模型在微塑料检测任务中表现存在差异,预处理步骤的优化对检测效果具有重要影响,同时指出当前面临的主要问题是缺乏足够的专家标注数据用于可靠训练。 Conclusion: SEM结合机器学习目标检测技术在微塑料检测中具有潜力,尤其在特定背景下可提升效率,但需进一步优化模型和预处理方法,并解决标注数据稀缺的问题以推动广泛应用。 Abstract: Microplastics (MPs) are ubiquitous pollutants with demonstrated potential to impact ecosystems and human health. Their microscopic size complicates detection, classification, and removal, especially in biological and environmental samples. While techniques like optical microscopy, Scanning Electron Microscopy (SEM), and Atomic Force Microscopy (AFM) provide a sound basis for detection, applying these approaches requires usually manual analysis and prevents efficient use in large screening studies. To this end, machine learning (ML) has emerged as a powerful tool in advancing microplastic detection. In this exploratory study, we investigate potential, limitations and future directions of advancing the detection and quantification of MP particles and fibres using a combination of SEM imaging and machine learning-based object detection. For simplicity, we focus on a filtration scenario where image backgrounds exhibit a symmetric and repetitive pattern. Our findings indicate differences in the quality of YOLO models for the given task and the relevance of optimizing preprocessing. At the same time, we identify open challenges, such as limited amounts of expert-labeled data necessary for reliable training of ML models.[88] Accelerating Vision Transformers with Adaptive Patch Sizes
Rohan Choudhury,JungEun Kim,Jinhyung Park,Eunho Yang,László A. Jeni,Kris M. Kitani
Main category: cs.CV
TL;DR: 提出自适应Patch Transformer (APT),通过在图像内使用多种不同大小的patch,在保持性能的同时显著提升ViT的训练和推理速度。
Details
Motivation: Vision Transformers (ViTs) 对所有图像区域使用固定大小的patch,导致高分辨率图像输入序列过长,计算效率低。 Method: 引入自适应patch机制,根据图像内容复杂度动态分配patch大小:在较均匀区域使用较大patch,在复杂区域使用较小patch,从而减少总token数。 Result: 在ViT-L上提速40%,ViT-H上提速50%;高分辨率密集任务中训练和推理快30%,且性能无损;可快速(1个epoch)应用于已微调的ViT模型。 Conclusion: APT有效提升了ViT的效率,适用于多种视觉任务,兼顾速度与性能。 Abstract: Vision Transformers (ViTs) partition input images into uniformly sized patches regardless of their content, resulting in long input sequence lengths for high-resolution images. We present Adaptive Patch Transformers (APT), which addresses this by using multiple different patch sizes within the same image. APT reduces the total number of input tokens by allocating larger patch sizes in more homogeneous areas and smaller patches in more complex ones. APT achieves a drastic speedup in ViT inference and training, increasing throughput by 40% on ViT-L and 50% on ViT-H while maintaining downstream performance, and can be applied to a previously fine-tuned ViT, converging in as little as 1 epoch. It also significantly reduces training and inference time without loss of performance in high-resolution dense visual tasks, achieving up to 30\% faster training and inference in visual QA, object detection, and semantic segmentation.[89] From Volume Rendering to 3D Gaussian Splatting: Theory and Applications
Vitor Pereira Matias,Daniel Perazzo,Vinicius Silva,Alberto Raposo,Luiz Velho,Afonso Paiva,Tiago Novello
Main category: cs.CV
TL;DR: 本文综述了3D高斯点阵化(3DGS)在从有姿态图像进行3D重建中的应用,介绍了其基于3D高斯显式建模和体素点阵化的高效渲染机制,并探讨了其在内存占用、光照烘焙和次级光线效果方面的局限性及改进方法,最后总结了其在表面重建、虚拟头像建模、动画和内容生成等领域的应用。
Details
Motivation: 尽管3DGS在新视角合成中具有实时渲染能力,但其高内存消耗、光照信息固化和缺乏对次级光线效果的支持限制了进一步应用,因此需要系统梳理其技术框架与改进方向。 Method: 文章从3DGS的点阵化公式出发,系统回顾了该技术的核心流程,并分类总结了当前针对其局限性的主要改进工作。 Result: 全面概述了3DGS的技术原理、现有改进方法及其在多个领域(如表面重建、动画、内容生成)的应用,突出了其高效渲染和适用于前馈管道的优势。 Conclusion: 3DGS作为一种新兴的显式3D表示方法,在实时渲染和图形管线集成方面具有显著优势,未来研究应聚焦于降低内存开销、解耦光照以及支持更复杂的光线交互效果。 Abstract: The problem of 3D reconstruction from posed images is undergoing a fundamental transformation, driven by continuous advances in 3D Gaussian Splatting (3DGS). By modeling scenes explicitly as collections of 3D Gaussians, 3DGS enables efficient rasterization through volumetric splatting, offering thus a seamless integration with common graphics pipelines. Despite its real-time rendering capabilities for novel view synthesis, 3DGS suffers from a high memory footprint, the tendency to bake lighting effects directly into its representation, and limited support for secondary-ray effects. This tutorial provides a concise yet comprehensive overview of the 3DGS pipeline, starting from its splatting formulation and then exploring the main efforts in addressing its limitations. Finally, we survey a range of applications that leverage 3DGS for surface reconstruction, avatar modeling, animation, and content generation-highlighting its efficient rendering and suitability for feed-forward pipelines.[90] Online In-Context Distillation for Low-Resource Vision Language Models
Zhiqi Kang,Rahaf Aljundi,Vaggelis Dorovatas,Karteek Alahari
Main category: cs.CV
TL;DR: 提出一种在线上下文蒸馏(ICD)方法,使小型视觉-语言模型在低资源环境下通过稀疏示范从强教师模型中蒸馏知识,显著提升性能(最高达33%),仅需极少教师标注(低至4%),并在计算预算受限时优于微调。
Details
Motivation: 大型视觉-语言模型虽性能强但难以在低资源环境部署,小型模型虽高效却依赖昂贵的微调来弥补性能差距,亟需一种无需大量训练即可提升小模型性能的方法。 Method: 基于上下文学习框架,提出在线上下文蒸馏(ICD)方法,在推理时通过稀疏示例将教师模型的知识传递给学生模型;引入跨模态示例选择、教师测试时扩展和学生不确定性建模以动态构建示例库并减少教师查询。 Result: ICD方法在极低教师标注下(低至4%)显著提升小型VLM性能,最高达33%,并可媲美教师模型的零样本性能,且在有限计算预算下优于传统微调方法。 Conclusion: ICD为低资源场景下的视觉-语言模型部署提供了一种高效、低成本的替代方案,证明了推理时蒸馏与上下文学习结合在缩小大小模型性能差距方面的巨大潜力。 Abstract: As the field continues its push for ever more resources, this work turns the spotlight on a critical question: how can vision-language models (VLMs) be adapted to thrive in low-resource, budget-constrained settings? While large VLMs offer strong performance, they are impractical to deploy in such settings. Small VLMs, on the other hand, are efficient but typically require costly fine-tuning to close the performance gap with larger models in the deployment domain. Inspired by the in-context learning framework, we propose an online In-Context Distillation (ICD) method, in which a small VLM collaborates with a stronger teacher model at inference time, distilling its knowledge via sparse demonstrations to efficiently bridge the gap between them. Our method is built on an in-depth analysis that identifies the scale and the choice of models for which vision-language ICL is currently feasible, and demonstrates the advantage of ICL over fine-tuning under constrained compute budgets. We enhance our method with a novel cross-modal demonstration selection strategy, teacher test-time scaling to reduce noise, and student uncertainty conditioning to dynamically populate a demonstration pool and minimize teacher queries. Our ICD method significantly boosts the performance of small models (up to 33%) using scarce teacher annotations (as low as 4%), and competes with the teacher's zero-shot performance.[91] SafeCoop: Unravelling Full Stack Safety in Agentic Collaborative Driving
Xiangbo Gao,Tzu-Hsiang Lin,Ruojing Song,Yuheng Wu,Kuan-Ru Huang,Zicheng Jin,Fangzhou Lin,Shinan Liu,Zhengzhong Tu
Main category: cs.CV
TL;DR: 本文首次系统研究了基于自然语言的协作驾驶中的全栈安全与安全问题,提出了一种名为SafeCoop的代理防御管道,有效提升了在恶意攻击下的驾驶性能和攻击检测能力。
Details
Motivation: 传统V2X通信存在带宽高、语义丢失和互操作性问题,而自然语言作为新兴通信媒介虽具优势,但也引入了新的安全漏洞,如消息丢失、幻觉和语义操纵,亟需系统性安全研究。 Method: 提出一个全面的攻击策略分类体系,并设计SafeCoop防御管道,集成语义防火墙、语言-感知一致性检查、多源共识机制,以及用于跨帧空间对齐的代理转换函数。 Result: 在CARLA闭环仿真中32个关键场景下评估,SafeCoop在恶意攻击下实现69.15%的驾驶评分提升,恶意行为检测F1分数最高达67.32%。 Conclusion: 该研究为语言驱动的交通协作系统提供了安全、可靠的设计指导,推动了可信赖智能交通系统的发展。 Abstract: Collaborative driving systems leverage vehicle-to-everything (V2X) communication across multiple agents to enhance driving safety and efficiency. Traditional V2X systems take raw sensor data, neural features, or perception results as communication media, which face persistent challenges, including high bandwidth demands, semantic loss, and interoperability issues. Recent advances investigate natural language as a promising medium, which can provide semantic richness, decision-level reasoning, and human-machine interoperability at significantly lower bandwidth. Despite great promise, this paradigm shift also introduces new vulnerabilities within language communication, including message loss, hallucinations, semantic manipulation, and adversarial attacks. In this work, we present the first systematic study of full-stack safety and security issues in natural-language-based collaborative driving. Specifically, we develop a comprehensive taxonomy of attack strategies, including connection disruption, relay/replay interference, content spoofing, and multi-connection forgery. To mitigate these risks, we introduce an agentic defense pipeline, which we call SafeCoop, that integrates a semantic firewall, language-perception consistency checks, and multi-source consensus, enabled by an agentic transformation function for cross-frame spatial alignment. We systematically evaluate SafeCoop in closed-loop CARLA simulation across 32 critical scenarios, achieving 69.15% driving score improvement under malicious attacks and up to 67.32% F1 score for malicious detection. This study provides guidance for advancing research on safe, secure, and trustworthy language-driven collaboration in transportation systems. Our project page is https://xiangbogaobarry.github.io/SafeCoop.[92] World-in-World: World Models in a Closed-Loop World
Jiahan Zhang,Muqing Jiang,Nanru Dai,Taiming Lu,Arda Uzunoglu,Shunchi Zhang,Yana Wei,Jiahao Wang,Vishal M. Patel,Paul Pu Liang,Daniel Khashabi,Cheng Peng,Rama Chellappa,Tianmin Shu,Alan Yuille,Yilun Du,Jieneng Chen
Main category: cs.CV
TL;DR: 本文提出了World-in-World平台,首次在闭环环境中评估生成式世界模型(WMs)对具身智能体决策的实用性,强调任务成功率而非视觉质量,并揭示了视觉质量、数据扩展和推理计算对性能的影响。
Details
Motivation: 现有基准多采用开环协议,仅关注世界模型的视觉生成质量,缺乏对具身任务中实际效用的评估,因此需要一个能真实反映智能体-环境交互的闭环评估平台。 Method: 构建World-in-World开源平台,提供统一的在线规划策略和标准化动作API,设计四个闭环环境,使用任务成功率作为主要指标,评估多种世界模型;同时研究预训练后使用动作-观测数据扩展的效果及推理时计算资源的影响。 Result: 发现三点:1)视觉质量不保证任务成功,可控性更重要;2)使用动作-观测数据进行后训练扩展比升级预训练视频生成器更有效;3)增加推理时计算资源可显著提升闭环性能。此外,提出了首个具身场景下的世界模型数据扩展定律。 Conclusion: 世界模型在具身决策中的价值不能仅由视觉质量衡量,其可控性、数据扩展方式和推理资源分配更为关键;World-in-World为未来世界模型的实用化评估提供了标准化闭环基准。 Abstract: Generative world models (WMs) can now simulate worlds with striking visual realism, which naturally raises the question of whether they can endow embodied agents with predictive perception for decision making. Progress on this question has been limited by fragmented evaluation: most existing benchmarks adopt open-loop protocols that emphasize visual quality in isolation, leaving the core issue of embodied utility unresolved, i.e., do WMs actually help agents succeed at embodied tasks? To address this gap, we introduce World-in-World, the first open platform that benchmarks WMs in a closed-loop world that mirrors real agent-environment interactions. World-in-World provides a unified online planning strategy and a standardized action API, enabling heterogeneous WMs for decision making. We curate four closed-loop environments that rigorously evaluate diverse WMs, prioritize task success as the primary metric, and move beyond the common focus on visual quality; we also present the first data scaling law for world models in embodied settings. Our study uncovers three surprises: (1) visual quality alone does not guarantee task success, controllability matters more; (2) scaling post-training with action-observation data is more effective than upgrading the pretrained video generators; and (3) allocating more inference-time compute allows WMs to substantially improve closed-loop performance.[93] Adapting Stereo Vision From Objects To 3D Lunar Surface Reconstruction with the StereoLunar Dataset
Clementine Grethen,Simone Gasparini,Geraldine Morin,Jeremy Lebreton,Lucas Marti,Manuel Sanchez-Gestido
Main category: cs.CV
TL;DR: 本文提出了LunarStereo,首个用于月球表面3D重建的光真实感立体图像数据集,并基于该数据集对MASt3R模型进行微调,显著提升了在月球环境下的3D重建与位姿估计性能。
Details
Motivation: 现有双目视觉重建方法在月球低纹理、复杂光照和非典型轨道条件下表现不佳,且主流深度学习模型在地球人尺度数据上训练,难以直接迁移到月球场景。 Method: 通过基于高分辨率地形与反射模型的光线追踪技术,构建了LunarStereo仿真立体图像数据集,并在此基础上对MASt3R模型进行领域自适应微调。 Result: 在合成与真实月球数据上的实验表明,该方法在3D表面重建和相对位姿估计任务上显著优于零样本基准方法。 Conclusion: LunarStereo为月球3D重建提供了可靠的物理依据数据支持,所提出的模型微调策略有效实现了跨尺度、跨环境的鲁棒泛化,推动了地外环境感知技术的发展。 Abstract: Accurate 3D reconstruction of lunar surfaces is essential for space exploration. However, existing stereo vision reconstruction methods struggle in this context due to the Moon's lack of texture, difficult lighting variations, and atypical orbital trajectories. State-of-the-art deep learning models, trained on human-scale datasets, have rarely been tested on planetary imagery and cannot be transferred directly to lunar conditions. To address this issue, we introduce LunarStereo, the first open dataset of photorealistic stereo image pairs of the Moon, simulated using ray tracing based on high-resolution topography and reflectance models. It covers diverse altitudes, lighting conditions, and viewing angles around the lunar South Pole, offering physically grounded supervision for 3D reconstruction tasks. Based on this dataset, we adapt the MASt3R model to the lunar domain through fine-tuning on LunarStereo. We validate our approach through extensive qualitative and quantitative experiments on both synthetic and real lunar data, evaluating 3D surface reconstruction and relative pose estimation. Extensive experiments on synthetic and real lunar data validate the approach, demonstrating significant improvements over zero-shot baselines and paving the way for robust cross-scale generalization in extraterrestrial environments.[94] VelocityNet: Real-Time Crowd Anomaly Detection via Person-Specific Velocity Analysis
Fatima AlGhamdi,Omar Alharbi,Abdullah Aldwyish,Raied Aljadaany,Muhammad Kamran J Khan,Huda Alamri
Main category: cs.CV
TL;DR: 提出了一种名为VelocityNet的双管道框架,用于在密集人群场景中实时检测异常运动模式。
Details
Motivation: 现有方法难以适应不同的人群密度,且缺乏可解释的异常指标,尤其在严重遮挡和动态运动模式下表现不佳。 Method: 结合头部检测和稠密光流提取个体速度,通过分层聚类将速度分为语义运动类别(停止、慢速、正常、快速),并采用基于百分位的异常评分系统衡量与正常模式的偏差。 Result: 实验表明,该方法在密集拥挤环境中能有效检测多种异常运动模式,具有良好的实时性和适应性。 Conclusion: VelocityNet通过引入可解释的速度语义分类和自适应异常评分,在复杂人群中实现了高效、鲁棒的异常检测。 Abstract: Detecting anomalies in crowded scenes is challenging due to severe inter-person occlusions and highly dynamic, context-dependent motion patterns. Existing approaches often struggle to adapt to varying crowd densities and lack interpretable anomaly indicators. To address these limitations, we introduce VelocityNet, a dual-pipeline framework that combines head detection and dense optical flow to extract person-specific velocities. Hierarchical clustering categorizes these velocities into semantic motion classes (halt, slow, normal, and fast), and a percentile-based anomaly scoring system measures deviations from learned normal patterns. Experiments demonstrate the effectiveness of our framework in real-time detection of diverse anomalous motion patterns within densely crowded environments.[95] RadDiagSeg-M: A Vision Language Model for Joint Diagnosis and Multi-Target Segmentation in Radiology
Chengrun Li,Corentin Royer,Haozhe Luo,Bastian Wittmann,Xia Li,Ibrahim Hamamci,Sezgin Er,Anjany Sekuboyina,Bjoern Menze
Main category: cs.CV
TL;DR: 本文提出了RadDiagSeg-D数据集和RadDiagSeg-M模型,用于联合生成诊断文本和像素级分割掩码,以支持医学辅助诊断。
Details
Motivation: 现有医学视觉语言模型难以同时生成诊断文本和像素级分割掩码,限制了其在临床中的应用价值。 Method: 构建了一个包含多种成像模态的统一层次化数据集RadDiagSeg-D,并基于该数据集提出了一种新型视觉语言模型RadDiagSeg-M,能够实现异常检测、诊断和灵活分割的联合任务。 Result: RadDiagSeg-M在多目标文本与掩码生成任务中表现出色,各项指标均取得优异性能,建立了强有力的基准。 Conclusion: RadDiagSeg-M能有效提供丰富的上下文信息,具备临床实用性,推动了医学视觉语言模型向实际临床应用的发展。 Abstract: Most current medical vision language models struggle to jointly generate diagnostic text and pixel-level segmentation masks in response to complex visual questions. This represents a major limitation towards clinical application, as assistive systems that fail to provide both modalities simultaneously offer limited value to medical practitioners. To alleviate this limitation, we first introduce RadDiagSeg-D, a dataset combining abnormality detection, diagnosis, and multi-target segmentation into a unified and hierarchical task. RadDiagSeg-D covers multiple imaging modalities and is precisely designed to support the development of models that produce descriptive text and corresponding segmentation masks in tandem. Subsequently, we leverage the dataset to propose a novel vision-language model, RadDiagSeg-M, capable of joint abnormality detection, diagnosis, and flexible segmentation. RadDiagSeg-M provides highly informative and clinically useful outputs, effectively addressing the need to enrich contextual information for assistive diagnosis. Finally, we benchmark RadDiagSeg-M and showcase its strong performance across all components involved in the task of multi-target text-and-mask generation, establishing a robust and competitive baseline.[96] EMA-SAM: Exponential Moving-average for SAM-based PTMC Segmentation
Maryam Dialameh,Hossein Rajabzadeh,Jung Suk Sim,Hyock Ju Kwon
Main category: cs.CV
TL;DR: 本文提出EMA-SAM,一种轻量级的SAM-2扩展方法,通过引入置信度加权指数移动平均指针,提升超声视频中甲状腺微小癌病灶分割的时序稳定性与准确性,在自建数据集和外部基准上均显著优于SAM-2,且几乎不增加计算开销,支持实时处理。
Details
Motivation: 由于超声图像对比度低、探头运动和热效应伪影,甲状腺微小癌(PTMC)在射频消融治疗中的病灶分割仍具挑战;现有SAM-2模型缺乏时序一致性,导致预测不稳定和漂移问题。 Method: 提出EMA-SAM,在SAM-2的记忆库中引入置信度加权的指数移动平均(EMA)指针,形成稳定的肿瘤潜在原型,保持跨帧时序连贯性,并在新证据出现时快速适应。 Result: 在PTMC-RFA数据集上,maxDice从0.82提升至0.86,maxIoU从0.72提升至0.76,假阳性减少29%;在VTUS和结肠镜等外部视频数据集上,Dice系数提升2–5个百分点;EMA指针增加不到0.1%的FLOPs,保持约30 FPS的实时性能。 Conclusion: EMA-SAM是一种高效、鲁棒的肿瘤追踪框架,有效解决了基础模型在介入超声应用中面临的时序不稳定性问题,兼具高精度与实时性,具有良好的临床应用前景。 Abstract: Papillary thyroid microcarcinoma (PTMC) is increasingly managed with radio-frequency ablation (RFA), yet accurate lesion segmentation in ultrasound videos remains difficult due to low contrast, probe-induced motion, and heat-related artifacts. The recent Segment Anything Model 2 (SAM-2) generalizes well to static images, but its frame-independent design yields unstable predictions and temporal drift in interventional ultrasound. We introduce \textbf{EMA-SAM}, a lightweight extension of SAM-2 that incorporates a confidence-weighted exponential moving average pointer into the memory bank, providing a stable latent prototype of the tumour across frames. This design preserves temporal coherence through probe pressure and bubble occlusion while rapidly adapting once clear evidence reappears. On our curated PTMC-RFA dataset (124 minutes, 13 patients), EMA-SAM improves \emph{maxDice} from 0.82 (SAM-2) to 0.86 and \emph{maxIoU} from 0.72 to 0.76, while reducing false positives by 29\%. On external benchmarks, including VTUS and colonoscopy video polyp datasets, EMA-SAM achieves consistent gains of 2--5 Dice points over SAM-2. Importantly, the EMA pointer adds \textless0.1\% FLOPs, preserving real-time throughput of $\sim$30\,FPS on a single A100 GPU. These results establish EMA-SAM as a robust and efficient framework for stable tumour tracking, bridging the gap between foundation models and the stringent demands of interventional ultrasound. Codes are available here \hyperref[code {https://github.com/mdialameh/EMA-SAM}.[97] VLSU: Mapping the Limits of Joint Multimodal Understanding for AI Safety
Shruti Palaskar,Leon Gatys,Mona Abdelrahman,Mar Jacobo,Larry Lindsey,Rutika Moharir,Gunnar Lund,Yang Xu,Navid Shiee,Jeffrey Bigham,Charles Maalouf,Joseph Yitan Cheng
Main category: cs.CV
TL;DR: 本文提出了一个名为Vision Language Safety Understanding (VLSU)的框架,用于系统评估多模态基础模型的安全性,特别关注图像和文本联合解释带来的风险。通过构建包含8,187个样本的大规模基准,研究发现现有模型在需要联合推理时性能显著下降,暴露出组合推理能力的缺失以及安全对齐上的缺陷。
Details
Motivation: 现有的多模态安全评估方法通常将视觉和语言输入分开处理,忽略了良性内容在组合后可能产生有害意义的风险,且难以区分明确不安全内容与边缘案例,导致过度屏蔽或未能拒绝真正有害内容的问题。 Method: 提出VLSU框架,结合细粒度严重性分类和17种不同安全模式的组合分析,采用多阶段流程并利用真实世界图像与人工标注构建大规模基准数据集(8,187个样本,涵盖15类伤害),并对11个最先进模型进行评估。 Result: 评估显示,模型在清晰单模态信号下准确率超过90%,但在需要图文联合推理时性能下降至20-55%;34%的错误发生在各单一模态分类正确的情况下,表明缺乏组合推理能力;此外,模型在拒绝不安全内容与响应边缘案例之间难以平衡,指令设计可降低边缘内容的过度屏蔽率,但会导致对不安全内容的拒绝率显著下降。 Conclusion: VLSU框架揭示了当前多模态模型在图文联合理解与安全对齐方面的关键弱点,为未来提升鲁棒性视觉-语言安全研究提供了重要测试平台。 Abstract: Safety evaluation of multimodal foundation models often treats vision and language inputs separately, missing risks from joint interpretation where benign content becomes harmful in combination. Existing approaches also fail to distinguish clearly unsafe content from borderline cases, leading to problematic over-blocking or under-refusal of genuinely harmful content. We present Vision Language Safety Understanding (VLSU), a comprehensive framework to systematically evaluate multimodal safety through fine-grained severity classification and combinatorial analysis across 17 distinct safety patterns. Using a multi-stage pipeline with real-world images and human annotation, we construct a large-scale benchmark of 8,187 samples spanning 15 harm categories. Our evaluation of eleven state-of-the-art models reveals systematic joint understanding failures: while models achieve 90%-plus accuracy on clear unimodal safety signals, performance degrades substantially to 20-55% when joint image-text reasoning is required to determine the safety label. Most critically, 34% of errors in joint image-text safety classification occur despite correct classification of the individual modalities, further demonstrating absent compositional reasoning capabilities. Additionally, we find that models struggle to balance refusing unsafe content while still responding to borderline cases that deserve engagement. For example, we find that instruction framing can reduce the over-blocking rate on borderline content from 62.4% to 10.4% in Gemini-1.5, but only at the cost of under-refusing on unsafe content with refusal rate dropping from 90.8% to 53.9%. Overall, our framework exposes weaknesses in joint image-text understanding and alignment gaps in current models, and provides a critical test bed to enable the next milestones in research on robust vision-language safety.[98] Beyond Frequency: Scoring-Driven Debiasing for Object Detection via Blueprint-Prompted Image Synthesis
Xinhao Cai,Liulei Li,Gensheng Pei,Tao Chen,Jinshan Pan,Yazhou Yao,Wenguan Wang
Main category: cs.CV
TL;DR: 提出了一种基于生成的去偏框架,通过引入表示分数(RS)诊断表征差距,并利用视觉蓝图和生成对齐策略生成高质量、无偏的布局,显著提升了检测器对罕见和大物体的性能。
Details
Motivation: 现有去偏方法受限于样本的表征多样性,简单地增加稀有类数据无法有效解决模型的真实数据需求,且当前布局到图像的生成缺乏足够保真度和控制力。 Method: 引入表示分数(RS)来量化表征不足,指导生成无偏布局;使用视觉蓝图替代模糊文本提示,并设计生成对齐策略,实现检测器与生成器之间的协同。 Result: 在大型/稀有实例上比基线提升4.4/3.6 mAP,生成图像的布局准确率超过先前L2I模型15.9 mAP。 Conclusion: 该方法有效缓解了对象检测中的偏差问题,通过更精准的生成控制和模型-生成器协作,显著提升对弱势群体的检测性能。 Abstract: This paper presents a generation-based debiasing framework for object detection. Prior debiasing methods are often limited by the representation diversity of samples, while naive generative augmentation often preserves the biases it aims to solve. Moreover, our analysis reveals that simply generating more data for rare classes is suboptimal due to two core issues: i) instance frequency is an incomplete proxy for the true data needs of a model, and ii) current layout-to-image synthesis lacks the fidelity and control to generate high-quality, complex scenes. To overcome this, we introduce the representation score (RS) to diagnose representational gaps beyond mere frequency, guiding the creation of new, unbiased layouts. To ensure high-quality synthesis, we replace ambiguous text prompts with a precise visual blueprint and employ a generative alignment strategy, which fosters communication between the detector and generator. Our method significantly narrows the performance gap for underrepresented object groups, \eg, improving large/rare instances by 4.4/3.6 mAP over the baseline, and surpassing prior L2I synthesis models by 15.9 mAP for layout accuracy in generated images.[99] DeepSeek-OCR: Contexts Optical Compression
Haoran Wei,Yaofeng Sun,Yukun Li
Main category: cs.CV
TL;DR: DeepSeek-OCR提出了一种通过光学2D映射压缩长上下文的初步方法,包含DeepEncoder和DeepSeek3B-MoE-A570M解码器,实现了高文本识别精度与高效视觉token压缩,在历史文档处理和LLM记忆机制方面具有潜力。
Details
Motivation: 探索长上下文压缩的可行性,以应对高分辨率输入下视觉token过多导致计算负担重的问题,并提升大规模OCR系统的效率。 Method: 设计DeepEncoder作为核心编码器,保持高分辨率输入下的低激活状态,实现高压缩比;使用DeepSeek3B-MoE-A570M作为解码器进行OCR任务,通过控制视觉token数量优化性能。 Result: 在压缩比<10x时OCR准确率达97%;20x时仍保持约60%准确率;在OmniDocBench上用更少视觉token超越GOT-OCR2.0和MinerU2.0;单A100-40G每日可处理20万+页文档。 Conclusion: DeepSeek-OCR验证了基于光学2D映射的长上下文压缩的可行性,兼具高效性与实用性,为LLM/VLM训练数据生成及记忆机制研究提供了新方向。 Abstract: We present DeepSeek-OCR as an initial investigation into the feasibility of compressing long contexts via optical 2D mapping. DeepSeek-OCR consists of two components: DeepEncoder and DeepSeek3B-MoE-A570M as the decoder. Specifically, DeepEncoder serves as the core engine, designed to maintain low activations under high-resolution input while achieving high compression ratios to ensure an optimal and manageable number of vision tokens. Experiments show that when the number of text tokens is within 10 times that of vision tokens (i.e., a compression ratio < 10x), the model can achieve decoding (OCR) precision of 97%. Even at a compression ratio of 20x, the OCR accuracy still remains at about 60%. This shows considerable promise for research areas such as historical long-context compression and memory forgetting mechanisms in LLMs. Beyond this, DeepSeek-OCR also demonstrates high practical value. On OmniDocBench, it surpasses GOT-OCR2.0 (256 tokens/page) using only 100 vision tokens, and outperforms MinerU2.0 (6000+ tokens per page on average) while utilizing fewer than 800 vision tokens. In production, DeepSeek-OCR can generate training data for LLMs/VLMs at a scale of 200k+ pages per day (a single A100-40G). Codes and model weights are publicly accessible at http://github.com/deepseek-ai/DeepSeek-OCR.[100] BlendCLIP: Bridging Synthetic and Real Domains for Zero-Shot 3D Object Classification with Multimodal Pretraining
Ajinkya Khoche,Gergő László Nagy,Maciej Wozniak,Thomas Gustafsson,Patric Jensfelt
Main category: cs.CV
TL;DR: 本文提出了BlendCLIP,一种结合合成数据与真实LiDAR数据的多模态预训练框架,通过课程学习的数据混合策略,在极少量真实样本下显著提升零样本3D物体分类性能,在nuScenes等户外数据集上达到SOTA。
Details
Motivation: 现有方法在合成数据上训练难以泛化到真实户外场景,而在真实数据上训练又缺乏语义多样性,导致对罕见或未见物体识别能力差,因此需要桥接合成与真实域之间的差距。 Method: 提出一个生成真实驾驶数据中物体级三元组(点云、图像、文本)的大规模数据集的流程,并设计基于课程学习的数据混合策略,先用丰富的合成CAD数据预训练,再逐步引入真实LiDAR数据进行适应。 Result: 仅引入1.5%的真实样本即在nuScenes上提升零样本准确率27%,最终模型在nuScenes和TruckScenes上分别比先前最优方法提升19.3%,且在合成基准上保持良好泛化能力。 Conclusion: 有效的域适应而非大规模真实标注是实现鲁棒开放词汇3D感知的关键。 Abstract: Zero-shot 3D object classification is crucial for real-world applications like autonomous driving, however it is often hindered by a significant domain gap between the synthetic data used for training and the sparse, noisy LiDAR scans encountered in the real-world. Current methods trained solely on synthetic data fail to generalize to outdoor scenes, while those trained only on real data lack the semantic diversity to recognize rare or unseen objects. We introduce BlendCLIP, a multimodal pretraining framework that bridges this synthetic-to-real gap by strategically combining the strengths of both domains. We first propose a pipeline to generate a large-scale dataset of object-level triplets -- consisting of a point cloud, image, and text description -- mined directly from real-world driving data and human annotated 3D boxes. Our core contribution is a curriculum-based data mixing strategy that first grounds the model in the semantically rich synthetic CAD data before progressively adapting it to the specific characteristics of real-world scans. Our experiments show that our approach is highly label-efficient: introducing as few as 1.5\% real-world samples per batch into training boosts zero-shot accuracy on the nuScenes benchmark by 27\%. Consequently, our final model achieves state-of-the-art performance on challenging outdoor datasets like nuScenes and TruckScenes, improving over the best prior method by 19.3\% on nuScenes, while maintaining strong generalization on diverse synthetic benchmarks. Our findings demonstrate that effective domain adaptation, not full-scale real-world annotation, is the key to unlocking robust open-vocabulary 3D perception. Our code and dataset will be released upon acceptance on https://github.com/kesu1/BlendCLIP.[101] OpenInsGaussian: Open-vocabulary Instance Gaussian Segmentation with Context-aware Cross-view Fusion
Tianyu Huang,Runnan Chen,Dongting Hu,Fengming Huang,Mingming Gong,Tongliang Liu
Main category: cs.CV
TL;DR: 本文提出了一种名为OpenInsGaussian的开放词汇实例高斯分割框架,通过上下文感知的跨视图融合方法,在3D语义分割任务中实现了最先进的性能。
Details
Motivation: 现有基于2D视觉模型投影的3D语义高斯溅射方法在预处理中缺乏足够的上下文线索,并且在多视角特征融合时存在不一致和细节缺失问题。 Method: 提出两个模块:上下文感知特征提取(增强每个掩码的语义上下文)和注意力驱动特征聚合(选择性融合多视角特征以减少对齐误差和不完整性)。 Result: 在基准数据集上的实验表明,OpenInsGaussian在开放词汇3D高斯分割任务上显著优于现有基线方法。 Conclusion: 所提方法具有强健性和通用性,推动了3D场景理解及其在多种现实场景中的实际应用。 Abstract: Understanding 3D scenes is pivotal for autonomous driving, robotics, and augmented reality. Recent semantic Gaussian Splatting approaches leverage large-scale 2D vision models to project 2D semantic features onto 3D scenes. However, they suffer from two major limitations: (1) insufficient contextual cues for individual masks during preprocessing and (2) inconsistencies and missing details when fusing multi-view features from these 2D models. In this paper, we introduce \textbf{OpenInsGaussian}, an \textbf{Open}-vocabulary \textbf{Ins}tance \textbf{Gaussian} segmentation framework with Context-aware Cross-view Fusion. Our method consists of two modules: Context-Aware Feature Extraction, which augments each mask with rich semantic context, and Attention-Driven Feature Aggregation, which selectively fuses multi-view features to mitigate alignment errors and incompleteness. Through extensive experiments on benchmark datasets, OpenInsGaussian achieves state-of-the-art results in open-vocabulary 3D Gaussian segmentation, outperforming existing baselines by a large margin. These findings underscore the robustness and generality of our proposed approach, marking a significant step forward in 3D scene understanding and its practical deployment across diverse real-world scenarios.[102] Hyperbolic Space Learning Method Leveraging Temporal Motion Priors for Human Mesh Recovery
Xiang Zhang,Suping Wu,Weibin Qiu,Zhaocheng Jin,Sheng Yang
Main category: cs.CV
TL;DR: 提出一种基于双曲空间学习和时序运动先验的视频中3D人体网格恢复方法,有效捕捉人体网格的层次结构,提升重建精度与平滑性。
Details
Motivation: 现有基于视频的3D人体网格恢复方法在欧几里得空间中学习特征,难以准确捕捉人体固有的层次结构(如躯干-四肢-手指),导致重建错误。 Method: 设计时序运动先验提取模块,融合输入的3D姿态序列和图像特征序列以增强时序表达;在双曲空间中利用该先验信息,结合3D姿态和姿态运动信息优化学习网格特征,并提出双曲网格优化损失以稳定学习过程。 Result: 在多个公开数据集上实验表明,该方法优于大多数现有最先进方法,实现了更准确、更平滑的人体网格重建。 Conclusion: 通过引入双曲空间学习和时序运动先验,有效建模了3D人体网格的层次结构,显著提升了视频中3D人体网格恢复的性能。 Abstract: 3D human meshes show a natural hierarchical structure (like torso-limbs-fingers). But existing video-based 3D human mesh recovery methods usually learn mesh features in Euclidean space. It's hard to catch this hierarchical structure accurately. So wrong human meshes are reconstructed. To solve this problem, we propose a hyperbolic space learning method leveraging temporal motion prior for recovering 3D human meshes from videos. First, we design a temporal motion prior extraction module. This module extracts the temporal motion features from the input 3D pose sequences and image feature sequences respectively. Then it combines them into the temporal motion prior. In this way, it can strengthen the ability to express features in the temporal motion dimension. Since data representation in non-Euclidean space has been proved to effectively capture hierarchical relationships in real-world datasets (especially in hyperbolic space), we further design a hyperbolic space optimization learning strategy. This strategy uses the temporal motion prior information to assist learning, and uses 3D pose and pose motion information respectively in the hyperbolic space to optimize and learn the mesh features. Then, we combine the optimized results to get an accurate and smooth human mesh. Besides, to make the optimization learning process of human meshes in hyperbolic space stable and effective, we propose a hyperbolic mesh optimization loss. Extensive experimental results on large publicly available datasets indicate superiority in comparison with most state-of-the-art.[103] UWBench: A Comprehensive Vision-Language Benchmark for Underwater Understanding
Da Zhang,Chenggang Rong,Bingyu Li,Feiyu Wang,Zhiyuan Zhao,Junyu Gao,Xuelong Li
Main category: cs.CV
TL;DR: 本文提出了UWBench,一个专为水下视觉-语言理解设计的大规模基准,包含15,003张高分辨率图像和丰富的标注数据,用于推动大型视觉-语言模型在水下环境中的研究与应用。
Details
Motivation: 现有的视觉-语言模型主要针对自然场景,而水下图像具有光照衰减、颜色失真和悬浮颗粒干扰等独特挑战,且缺乏专门的基准数据集,因此需要构建面向水下环境的高质量多模态基准。 Method: 构建了一个名为UWBench的大规模水下视觉-语言基准,包含15,003张来自多种水下环境的高分辨率图像,并提供15,281条指代表达和124,983个问答对;基于该数据集建立了三个评测任务:详细图像描述生成、视觉定位和视觉问答。 Result: 在现有最先进的视觉-语言模型上进行了广泛实验,结果表明当前模型在水下理解方面表现有限,存在显著提升空间,验证了该基准的挑战性和必要性。 Conclusion: UWBench为水下视觉-语言理解提供了重要资源,有助于推动海洋科学研究、生态监测和自主水下探索等领域的多模态AI发展。 Abstract: Large vision-language models (VLMs) have achieved remarkable success in natural scene understanding, yet their application to underwater environments remains largely unexplored. Underwater imagery presents unique challenges including severe light attenuation, color distortion, and suspended particle scattering, while requiring specialized knowledge of marine ecosystems and organism taxonomy. To bridge this gap, we introduce UWBench, a comprehensive benchmark specifically designed for underwater vision-language understanding. UWBench comprises 15,003 high-resolution underwater images captured across diverse aquatic environments, encompassing oceans, coral reefs, and deep-sea habitats. Each image is enriched with human-verified annotations including 15,281 object referring expressions that precisely describe marine organisms and underwater structures, and 124,983 question-answer pairs covering diverse reasoning capabilities from object recognition to ecological relationship understanding. The dataset captures rich variations in visibility, lighting conditions, and water turbidity, providing a realistic testbed for model evaluation. Based on UWBench, we establish three comprehensive benchmarks: detailed image captioning for generating ecologically informed scene descriptions, visual grounding for precise localization of marine organisms, and visual question answering for multimodal reasoning about underwater environments. Extensive experiments on state-of-the-art VLMs demonstrate that underwater understanding remains challenging, with substantial room for improvement. Our benchmark provides essential resources for advancing vision-language research in underwater contexts and supporting applications in marine science, ecological monitoring, and autonomous underwater exploration. Our code and benchmark will be available.[104] Latent-Info and Low-Dimensional Learning for Human Mesh Recovery and Parallel Optimization
Xiang Zhang,Suping Wu,Sheng Yang
Main category: cs.CV
TL;DR: 提出了一种基于潜在信息和低维学习的两阶段网络,用于3D人体网格恢复,有效提升了细节重建精度并降低了计算成本。
Details
Motivation: 现有方法未能充分利用潜在信息(如人体运动、形状对齐),导致肢体错位和局部细节不足,且基于注意力机制的方法计算代价高。 Method: 第一阶段从图像特征的高低频成分中挖掘全局和局部信息,聚合为混合潜在频域特征;第二阶段利用该特征进行3D姿态估计,并设计低维网格-姿态交互机制进行形状优化。 Result: 在公开数据集上实验表明,该方法在重建精度上优于当前最先进方法,同时显著降低计算成本。 Conclusion: 所提出的两阶段低维学习框架能有效提升3D人体网格恢复的质量与效率。 Abstract: Existing 3D human mesh recovery methods often fail to fully exploit the latent information (e.g., human motion, shape alignment), leading to issues with limb misalignment and insufficient local details in the reconstructed human mesh (especially in complex scenes). Furthermore, the performance improvement gained by modelling mesh vertices and pose node interactions using attention mechanisms comes at a high computational cost. To address these issues, we propose a two-stage network for human mesh recovery based on latent information and low dimensional learning. Specifically, the first stage of the network fully excavates global (e.g., the overall shape alignment) and local (e.g., textures, detail) information from the low and high-frequency components of image features and aggregates this information into a hybrid latent frequency domain feature. This strategy effectively extracts latent information. Subsequently, utilizing extracted hybrid latent frequency domain features collaborates to enhance 2D poses to 3D learning. In the second stage, with the assistance of hybrid latent features, we model the interaction learning between the rough 3D human mesh template and the 3D pose, optimizing the pose and shape of the human mesh. Unlike existing mesh pose interaction methods, we design a low-dimensional mesh pose interaction method through dimensionality reduction and parallel optimization that significantly reduces computational costs without sacrificing reconstruction accuracy. Extensive experimental results on large publicly available datasets indicate superiority compared to the most state-of-the-art.[105] TreeFedDG: Alleviating Global Drift in Federated Domain Generalization for Medical Image Segmentation
Yucheng Song,Chenxi Li,Haokang Ding,Zhining Liao,Zhifang Liao
Main category: cs.CV
TL;DR: 本文提出了一种用于医学图像分割的联邦域泛化新框架TreeFedDG,以解决跨域场景下的全局漂移(FedDG-GD)问题。
Details
Motivation: 传统联邦学习在跨域医学图像分割中存在信息聚合不平衡,导致全局模型漂移,影响模型泛化性能。 Method: 提出TreeFedDG框架:采用树形拓扑结构进行分层参数聚合;引入基于参数差异的风格混合方法(FedStyle);设计渐进式个性化融合策略;在推理阶段利用特征相似性检索最优模型链进行集成决策。 Result: 在两个公开数据集上实验表明,该方法优于现有的域泛化方法,在跨域性能上表现更均衡。 Conclusion: TreeFedDG有效缓解了联邦域泛化中的全局漂移问题,提升了模型在异构医疗数据下的鲁棒性和泛化能力。 Abstract: In medical image segmentation tasks, Domain Generalization (DG) under the Federated Learning (FL) framework is crucial for addressing challenges related to privacy protection and data heterogeneity. However, traditional federated learning methods fail to account for the imbalance in information aggregation across clients in cross-domain scenarios, leading to the Global Drift (GD) problem and a consequent decline in model generalization performance. This motivates us to delve deeper and define a new critical issue: global drift in federated domain generalization for medical imaging (FedDG-GD). In this paper, we propose a novel tree topology framework called TreeFedDG. First, starting from the distributed characteristics of medical images, we design a hierarchical parameter aggregation method based on a tree-structured topology to suppress deviations in the global model direction. Second, we introduce a parameter difference-based style mixing method (FedStyle), which enforces mixing among clients with maximum parameter differences to enhance robustness against drift. Third, we develop a a progressive personalized fusion strategy during model distribution, ensuring a balance between knowledge transfer and personalized features. Finally, during the inference phase, we use feature similarity to guide the retrieval of the most relevant model chain from the tree structure for ensemble decision-making, thereby fully leveraging the advantages of hierarchical knowledge. We conducted extensive experiments on two publicly available datasets. The results demonstrate that our method outperforms other state-of-the-art domain generalization approaches in these challenging tasks and achieves better balance in cross-domain performance.[106] StreamingTOM: Streaming Token Compression for Efficient Video Understanding
Xueyi Chen,Keda Tao,Kele Shao,Huan Wang
Main category: cs.CV
TL;DR: 本文提出StreamingTOM,一种无需训练、即插即用的两阶段框架,解决流式视频视觉语言模型中的因果性和累积性瓶颈,显著降低KV缓存、峰值内存和首令牌延迟,同时保持领先精度。
Details
Motivation: 流式视频理解面临因果性(无法访问未来帧)和累积性(token无限增长)两大挑战,现有方法仅优化LLM后的KV缓存,忽视了LLM前prefill的高成本问题。 Method: StreamingTOM包含两个阶段:1)因果时序压缩(Causal Temporal Reduction),每帧只保留基于相邻帧变化和显著性的关键视觉token,减少prefill开销;2)在线量化记忆(Online Quantized Memory),以4-bit存储token,按需检索并反量化,保持活跃KV缓存有界。 Result: 相比先前SOTA方法,实现15.7倍KV缓存压缩、1.2倍更低峰值内存、2倍更快TTFT,在无需训练方法中达到63.8%离线基准准确率,RVS上为55.8%/3.7。 Conclusion: StreamingTOM通过两阶段设计有效解决了流式视频处理中的效率瓶颈,在保持高精度的同时实现了可预测延迟和有界资源增长,具有实际应用价值。 Abstract: Unlike offline processing, streaming video vision-language models face two fundamental constraints: causality and accumulation. Causality prevents access to future frames that offline methods exploit, while accumulation causes tokens to grow unbounded, creating efficiency bottlenecks. However, existing approaches only regulate post-LLM kv-cache, leaving costly pre-LLM prefill unchanged. We introduce StreamingTOM, a training-free, plug-and-play two-stage framework that addresses both pre-LLM and post-LLM bottlenecks with predictable latency. Causal Temporal Reduction imposes a fixed per-frame budget and selects tokens based on adjacent-frame changes and token saliency, drastically reducing per-frame prefill cost by processing only a compact subset of visual tokens per frame instead of all visual tokens. Online Quantized Memory stores tokens in 4-bit format, retrieves relevant groups on demand, and dequantizes them, keeping the active kv-cache bounded regardless of stream length. Experiments demonstrate our method achieves $15.7\times$ kv-cache compression, $1.2\times$ lower peak memory and $2\times$ faster TTFT compared to prior SOTA. StreamingTOM maintains state-of-the-art accuracy among training-free methods with an average of $63.8\%$ on offline benchmarks and $55.8\%/3.7$ on RVS. These results highlight the practical benefits of our two-stage approach for efficient streaming video understanding with bounded growth.[107] Efficient Few-shot Identity Preserving Attribute Editing for 3D-aware Deep Generative Models
Vishal Vinod
Main category: cs.CV
TL;DR: 本文提出一种基于3D感知生成模型的少样本身份保持人脸属性编辑方法,通过在潜在空间中估计与属性对应的编辑方向,实现高质量、视角一致的3D人脸编辑。
Details
Motivation: 现有3D人脸编辑方法受限于数据标注需求、分辨率与可编辑性的权衡,以及多姿态下的一致性难题,缺乏高效的身份保持编辑能力。 Method: 结合3D感知生成模型与2D人像编辑技术,利用少量带属性标签的图像(≤10张)识别潜在空间中的编辑方向,并借助带掩码的人脸数据集生成合成样本以估计编辑方向,探索单次风格化与连续风格流形。 Result: 实验证明仅需十张以内标注图像即可实现有效的3D感知属性编辑,支持光照、眼镜、发型、表情、年龄等多种编辑,且具备良好的视角一致性和身份保持能力。 Conclusion: 该方法显著降低了3D人脸属性编辑对大量标注数据的依赖,实现了少样本条件下的高效、线性且身份保持的3D-aware人脸编辑。 Abstract: Identity preserving editing of faces is a generative task that enables modifying the illumination, adding/removing eyeglasses, face aging, editing hairstyles, modifying expression etc., while preserving the identity of the face. Recent progress in 2D generative models have enabled photorealistic editing of faces using simple techniques leveraging the compositionality in GANs. However, identity preserving editing for 3D faces with a given set of attributes is a challenging task as the generative model must reason about view consistency from multiple poses and render a realistic 3D face. Further, 3D portrait editing requires large-scale attribute labelled datasets and presents a trade-off between editability in low-resolution and inflexibility to editing in high resolution. In this work, we aim to alleviate some of the constraints in editing 3D faces by identifying latent space directions that correspond to photorealistic edits. To address this, we present a method that builds on recent advancements in 3D-aware deep generative models and 2D portrait editing techniques to perform efficient few-shot identity preserving attribute editing for 3D-aware generative models. We aim to show from experimental results that using just ten or fewer labelled images of an attribute is sufficient to estimate edit directions in the latent space that correspond to 3D-aware attribute editing. In this work, we leverage an existing face dataset with masks to obtain the synthetic images for few attribute examples required for estimating the edit directions. Further, to demonstrate the linearity of edits, we investigate one-shot stylization by performing sequential editing and use the (2D) Attribute Style Manipulation (ASM) technique to investigate a continuous style manifold for 3D consistent identity preserving face aging. Code and results are available at: https://vishal-vinod.github.io/gmpi-edit/[108] GeoDiff: Geometry-Guided Diffusion for Metric Depth Estimation
Tuan Pham,Thanh-Tung Le,Xiaohui Xie,Stephan Mandt
Main category: cs.CV
TL;DR: 提出了一种基于立体视觉引导的度量深度估计新框架,通过结合预训练的潜在扩散模型和立体几何约束,无需训练即可提升单目深度估计的绝对尺度精度。
Details
Motivation: 现有的基于扩散的单目深度估计方法在预测相对深度方面表现良好,但在估计绝对度量深度时由于单图像场景中的尺度模糊性而面临挑战。 Method: 将深度估计重构为一个逆问题,利用以RGB图像为条件的预训练潜在扩散模型,并结合基于立体的几何约束来学习深度恢复的尺度和偏移参数。该方法无需训练,可无缝集成到现有框架中。 Result: 大量实验表明,该方法在室内、室外和复杂环境中均表现出色,在涉及半透明和镜面表面等具有挑战性的场景中尤其优于现有方法,且无需重新训练。 Conclusion: 所提出的方法是一种通用、无需训练的解决方案,能有效提升扩散模型在绝对深度估计中的性能,具有良好的跨场景泛化能力。 Abstract: We introduce a novel framework for metric depth estimation that enhances pretrained diffusion-based monocular depth estimation (DB-MDE) models with stereo vision guidance. While existing DB-MDE methods excel at predicting relative depth, estimating absolute metric depth remains challenging due to scale ambiguities in single-image scenarios. To address this, we reframe depth estimation as an inverse problem, leveraging pretrained latent diffusion models (LDMs) conditioned on RGB images, combined with stereo-based geometric constraints, to learn scale and shift for accurate depth recovery. Our training-free solution seamlessly integrates into existing DB-MDE frameworks and generalizes across indoor, outdoor, and complex environments. Extensive experiments demonstrate that our approach matches or surpasses state-of-the-art methods, particularly in challenging scenarios involving translucent and specular surfaces, all without requiring retraining.[109] Proactive Reasoning-with-Retrieval Framework for Medical Multimodal Large Language Models
Lehan Wang,Yi Qin,Honglong Yang,Xiaomeng Li
Main category: cs.CV
TL;DR: 提出首个多模态医学推理与检索框架Med-RwR,结合视觉和文本信息主动检索外部知识,提升医学大模型的推理能力与泛化性。
Details
Motivation: 现有医学多模态大模型依赖内部知识进行推理,易产生幻觉和事实错误;现有检索增强方法局限于单模态,忽视了推理过程中关键的视觉信息。 Method: 设计了一个两阶段强化学习策略,激励模型在推理中结合视觉诊断结果和文本临床信息进行外部知识检索,并提出置信度驱动的图像重检索(CDIR)方法用于测试时扩展。 Result: 在多个公开医学基准上显著优于基线模型,在自建超声心动图基准ECBench上性能提升8.8%,展现出强泛化能力。 Conclusion: Med-RwR通过融合多模态信息与外部知识检索,有效提升了医学多模态大模型的推理准确性与跨领域适应能力。 Abstract: Incentivizing the reasoning ability of Multimodal Large Language Models (MLLMs) is essential for medical applications to transparently analyze medical scans and provide reliable diagnosis. However, existing medical MLLMs rely solely on internal knowledge during reasoning, leading to hallucinated reasoning and factual inaccuracies when encountering cases beyond their training scope. Although recent Agentic Retrieval-Augmented Generation (RAG) methods elicit the medical model's proactive retrieval ability during reasoning, they are confined to unimodal LLMs, neglecting the crucial visual information during reasoning and retrieval. Consequently, we propose the first Multimodal Medical Reasoning-with-Retrieval framework, Med-RwR, which actively retrieves external knowledge by querying observed symptoms or domain-specific medical concepts during reasoning. Specifically, we design a two-stage reinforcement learning strategy with tailored rewards that stimulate the model to leverage both visual diagnostic findings and textual clinical information for effective retrieval. Building on this foundation, we further propose a Confidence-Driven Image Re-retrieval (CDIR) method for test-time scaling when low prediction confidence is detected. Evaluation on various public medical benchmarks demonstrates Med-RwR's significant improvements over baseline models, proving the effectiveness of enhancing reasoning capabilities with external knowledge integration. Furthermore, Med-RwR demonstrates remarkable generalizability to unfamiliar domains, evidenced by 8.8% performance gain on our proposed EchoCardiography Benchmark (ECBench), despite the scarcity of echocardiography data in the training corpus. Our data, model, and codes will be made publicly available at https://github.com/xmed-lab/Med-RwR.[110] The Impact of Image Resolution on Biomedical Multimodal Large Language Models
Liangyu Chen,James Burgess,Jeffrey J Nirschl,Orr Zohar,Serena Yeung-Levy
Main category: cs.CV
TL;DR: 研究探讨了图像分辨率对多模态大语言模型(MLLM)在生物医学应用中性能的影响,发现原生分辨率训练和推理显著提升性能,训练与推理分辨率不匹配会严重降低性能,而混合分辨率训练能有效缓解这一问题并平衡计算开销与性能需求。
Details
Motivation: 现有的多模态大语言模型大多针对通用低分辨率图像设计,在高分辨率生物医学图像分析中可能丢失关键信息,因此需要系统研究分辨率对模型性能的影响。 Method: 通过在多个生物医学任务上实验,评估不同分辨率设置下MLLM的性能,包括原生分辨率训练/推理、分辨率不匹配情况以及采用混合分辨率训练的效果。 Result: 1) 原生分辨率训练和推理显著提升性能;2) 训练与推理分辨率不匹配导致性能严重下降;3) 混合分辨率训练可有效缓解不匹配问题,并在计算成本与性能之间取得良好平衡。 Conclusion: 建议在生物医学MLLM中优先采用原生分辨率推理和混合分辨率训练数据集,以优化模型在科研与临床应用中的表现。 Abstract: Imaging technologies are fundamental to biomedical research and modern medicine, requiring analysis of high-resolution images across various modalities. While multimodal large language models (MLLMs) show promise for biomedical image analysis, most are designed for low-resolution images from general-purpose datasets, risking critical information loss. We investigate how image resolution affects MLLM performance in biomedical applications and demonstrate that: (1) native-resolution training and inference significantly improve performance across multiple tasks, (2) misalignment between training and inference resolutions severely degrades performance, and (3) mixed-resolution training effectively mitigates misalignment and balances computational constraints with performance requirements. Based on these findings, we recommend prioritizing native-resolution inference and mixed-resolution datasets to optimize biomedical MLLMs for transformative impact in scientific research and clinical applications.[111] OmniNWM: Omniscient Driving Navigation World Models
Bohan Li,Zhuang Ma,Dalong Du,Baorui Peng,Zhujin Liang,Zhenqiang Liu,Chao Ma,Yueming Jin,Hao Zhao,Wenjun Zeng,Xin Jin
Main category: cs.CV
TL;DR: 本文提出了OmniNWM,一种统一的全景导航世界模型,同时在状态、动作和奖励三个维度上实现高性能,支持多模态长时序生成、精确控制和基于3D占据的规则化密集奖励,显著提升了自动驾驶世界模型的生成质量与闭环评估能力。
Details
Motivation: 现有自动驾驶世界模型在状态模态、序列长度、动作控制精度和奖励感知方面存在局限,难以满足实际需求。 Method: 提出OmniNWM:1)联合生成RGB、语义、深度和3D占据等多模态全景视频;2)采用灵活强制策略实现高质量长时序自回归生成;3)引入归一化的全景Plucker射线图表示以实现像素级精确动作控制;4)利用生成的3D占据图定义基于规则的密集奖励,提升驾驶合规性与安全性。 Result: 实验表明,OmniNWM在视频生成质量、控制精度和长时稳定性方面达到SOTA,并通过占据感知奖励提供了可靠的闭环评估框架。 Conclusion: OmniNWM在统一框架下有效整合了状态、动作与奖励建模,为自动驾驶世界模型提供了更全面、可靠且可扩展的解决方案。 Abstract: Autonomous driving world models are expected to work effectively across three core dimensions: state, action, and reward. Existing models, however, are typically restricted to limited state modalities, short video sequences, imprecise action control, and a lack of reward awareness. In this paper, we introduce OmniNWM, an omniscient panoramic navigation world model that addresses all three dimensions within a unified framework. For state, OmniNWM jointly generates panoramic videos of RGB, semantics, metric depth, and 3D occupancy. A flexible forcing strategy enables high-quality long-horizon auto-regressive generation. For action, we introduce a normalized panoramic Plucker ray-map representation that encodes input trajectories into pixel-level signals, enabling highly precise and generalizable control over panoramic video generation. Regarding reward, we move beyond learning reward functions with external image-based models: instead, we leverage the generated 3D occupancy to directly define rule-based dense rewards for driving compliance and safety. Extensive experiments demonstrate that OmniNWM achieves state-of-the-art performance in video generation, control accuracy, and long-horizon stability, while providing a reliable closed-loop evaluation framework through occupancy-grounded rewards. Project page is available at https://github.com/Arlo0o/OmniNWM.[112] Beyond Single Models: Mitigating Multimodal Hallucinations via Adaptive Token Ensemble Decoding
Jinlin Li,Yuran Wang,Yifei Yuan,Xiao Zhou,Yingying Zhang,Xixian Yong,Yefeng Zheng,Xian Wu
Main category: cs.CV
TL;DR: 提出了一种无需训练的自适应令牌集成解码方法(ATED),通过在推理过程中聚合多个大视觉语言模型的预测,有效减少对象幻觉问题。
Details
Motivation: 大视觉语言模型(LVLMs)在多模态任务中表现优异,但容易产生对象幻觉,现有方法在可扩展性、适应性和模型独立性方面存在局限。 Method: 提出 Adaptive Token Ensemble Decoding (ATED),在令牌级别进行集成,动态计算基于不确定性的模型权重,并融合多样化解码路径以增强上下文对齐和语义一致性。 Result: 在标准幻觉检测基准上的实验表明,ATED 显著优于现有最先进方法,在降低幻觉的同时保持生成内容的流畅性和相关性。 Conclusion: ATED 是一种有效、无需训练的解决方案,提升了 LVLMs 的鲁棒性,为高风险应用中的视觉语言模型可靠性提供了新方向。 Abstract: Large Vision-Language Models (LVLMs) have recently achieved impressive results in multimodal tasks such as image captioning and visual question answering. However, they remain prone to object hallucination -- generating descriptions of nonexistent or misidentified objects. Prior work has partially mitigated this via auxiliary training objectives or external modules, but challenges remain in terms of scalability, adaptability, and model independence. To address these limitations, we propose Adaptive Token Ensemble Decoding (ATED), a training-free, token-level ensemble framework that mitigates hallucination by aggregating predictions from multiple LVLMs during inference. ATED dynamically computes uncertainty-based weights for each model, reflecting their reliability at each decoding step. It also integrates diverse decoding paths to improve contextual grounding and semantic consistency. Experiments on standard hallucination detection benchmarks demonstrate that ATED significantly outperforms state-of-the-art methods, reducing hallucination without compromising fluency or relevance. Our findings highlight the benefits of adaptive ensembling and point to a promising direction for improving LVLM robustness in high-stakes applications. The code is available at https://github.com/jinlin2021/ATED.[113] Enhancing Few-Shot Classification of Benchmark and Disaster Imagery with ATTBHFA-Net
Gao Yu Lee,Tanmoy Dam,Md Meftahul Ferdaus,Daniel Puiu Poenar,Vu Duong
Main category: cs.CV
TL;DR: 本文提出了一种基于注意力机制的巴塔恰里亚-海林格特征聚合网络(ATTBHFA-Net),用于解决少样本灾害图像分类中的数据稀缺、类内差异大和类间相似性高的问题,通过结合概率分布距离度量与对比学习,在多个基准和灾害数据集上表现出优越性能。
Details
Motivation: 由于灾害图像数据稀缺、采集困难,且存在类内差异大、类间相似的问题,传统少样本学习方法在实际灾害识别中效果受限,亟需更鲁棒的方法。 Method: 提出ATTBHFA-Net,利用巴塔恰里亚系数和海林格距离线性组合来比较和聚合特征概率分布,形成更稳健的原型;引入基于这两种距离的对比损失函数,与交叉熵联合优化,提升少样本分类性能。 Result: 在四个通用少样本基准和两个灾害遥感图像数据集上实验表明,ATTBHFA-Net在分类准确率和泛化能力上优于现有方法。 Conclusion: ATTBHFA-Net通过在概率分布空间中进行特征聚合与对比学习,有效提升了少样本条件下灾害图像的分类性能,具有良好的应用前景。 Abstract: The increasing frequency of natural and human-induced disasters necessitates advanced visual recognition techniques capable of analyzing critical photographic data. With progress in artificial intelligence and resilient computational systems, rapid and accurate disaster classification has become crucial for efficient rescue operations. However, visual recognition in disaster contexts faces significant challenges due to limited and diverse data from the difficulties in collecting and curating comprehensive, high-quality disaster imagery. Few-Shot Learning (FSL) provides a promising approach to data scarcity, yet current FSL research mainly relies on generic benchmark datasets lacking remote-sensing disaster imagery, limiting its practical effectiveness. Moreover, disaster images exhibit high intra-class variation and inter-class similarity, hindering the performance of conventional metric-based FSL methods. To address these issues, this paper introduces the Attention-based Bhattacharyya-Hellinger Feature Aggregation Network (ATTBHFA-Net), which linearly combines the Bhattacharyya coefficient and Hellinger distances to compare and aggregate feature probability distributions for robust prototype formation. The Bhattacharyya coefficient serves as a contrastive margin that enhances inter-class separability, while the Hellinger distance regularizes same-class alignment. This framework parallels contrastive learning but operates over probability distributions rather than embedded feature points. Furthermore, a Bhattacharyya-Hellinger distance-based contrastive loss is proposed as a distributional counterpart to cosine similarity loss, used jointly with categorical cross-entropy to significantly improve FSL performance. Experiments on four FSL benchmarks and two disaster image datasets demonstrate the superior effectiveness and generalization of ATTBHFA-Net compared to existing approaches.[114] ViSE: A Systematic Approach to Vision-Only Street-View Extrapolation
Kaiyuan Tan,Yingying Shen,Haiyang Sun,Bing Wang,Guang Chen,Hangjun Ye
Main category: cs.CV
TL;DR: 提出了一种四阶段方法用于自动驾驶中真实感视图外推,结合数据驱动策略与几何先验,在ICCV 2025 RealADSim挑战赛中取得第一名。
Details
Motivation: 现有新视角合成方法在街景外推时易产生畸变和不一致,难以满足自动驾驶闭环仿真对高保真视图外推的需求。 Method: 采用四阶段 pipeline:1)数据驱动初始化生成伪LiDAR点云;2)引入二维符号距离场(2D-SDF)建模道路几何先验;3)利用生成先验构造外推视角的伪真值以提供辅助监督;4)使用数据驱动自适应网络去除时间相关伪影。 Result: 在RealADSim-NVS基准上获得0.441的最终得分,排名第一。 Conclusion: 所提方法通过融合几何与生成先验,显著提升了街景外推的稳定性与真实性,适用于自动驾驶仿真系统。 Abstract: Realistic view extrapolation is critical for closed-loop simulation in autonomous driving, yet it remains a significant challenge for current Novel View Synthesis (NVS) methods, which often produce distorted and inconsistent images beyond the original trajectory. This report presents our winning solution which ctook first place in the RealADSim Workshop NVS track at ICCV 2025. To address the core challenges of street view extrapolation, we introduce a comprehensive four-stage pipeline. First, we employ a data-driven initialization strategy to generate a robust pseudo-LiDAR point cloud, avoiding local minima. Second, we inject strong geometric priors by modeling the road surface with a novel dimension-reduced SDF termed 2D-SDF. Third, we leverage a generative prior to create pseudo ground truth for extrapolated viewpoints, providing auxilary supervision. Finally, a data-driven adaptation network removes time-specific artifacts. On the RealADSim-NVS benchmark, our method achieves a final score of 0.441, ranking first among all participants.[115] GPTFace: Generative Pre-training of Facial-Linguistic Transformer by Span Masking and Weakly Correlated Text-image Data
Yudong Li,Hao Li,Xianxu Hou,Linlin Shen
Main category: cs.CV
TL;DR: 本文提出了一种基于大规模网络数据的生成式预训练模型,用于面部知识学习,通过自监督任务实现文本与图像的联合建模,并支持可控生成和多种面部编辑任务。
Details
Motivation: 现有的面部预训练模型依赖人工标注数据,成本高且泛化能力有限,因此需要一种可扩展、无需大量标注的预训练方法。 Method: 利用从互联网爬取的大规模图文人脸数据,进行掩码图像/语言建模(MILM)和图像-文本匹配(ITM)等自监督任务的预训练,并在生成阶段利用ITM损失引导生成分布以实现可控生成。 Result: 该模型在属性分类、表情识别等下游任务上性能与现有最优预训练模型相当,并能有效应用于面部属性编辑、表情操控、去遮罩和图像修复等多种编辑任务。 Conclusion: 所提出的生成式预训练框架能够有效利用无标注网络数据进行面部知识学习,兼具良好的判别与生成能力,具有较强的实用性和扩展性。 Abstract: Compared to the prosperity of pre-training models in natural image understanding, the research on large-scale pre-training models for facial knowledge learning is still limited. Current approaches mainly rely on manually assembled and annotated face datasets for training, but labeling such datasets is labor-intensive and the trained models have limited scalability beyond the training data. To address these limitations, we present a generative pre-training model for facial knowledge learning that leverages large-scale web-built data for training. We use texts and images containing human faces crawled from the internet and conduct pre-training on self-supervised tasks, including masked image/language modeling (MILM) and image-text matching (ITM). During the generation stage, we further utilize the image-text matching loss to pull the generation distribution towards the control signal for controllable image/text generation. Experimental results demonstrate that our model achieves comparable performance to state-of-the-art pre-training models for various facial downstream tasks, such as attribution classification and expression recognition. Furthermore, our approach is also applicable to a wide range of face editing tasks, including face attribute editing, expression manipulation, mask removal, and photo inpainting.[116] AV-Master: Dual-Path Comprehensive Perception Makes Better Audio-Visual Question Answering
Jiayu Zhang,Qilang Ye,Shuo Ye,Xun Lin,Zihan Song,Zitong Yu
Main category: cs.CV
TL;DR: 提出了一种名为AV-Master的新框架,通过动态建模时间和模态维度来增强音频-视觉问答中关键信息的提取能力。
Details
Motivation: 现有方法在时间采样和模态偏好感知上缺乏足够的灵活性和动态适应性,难以根据问题聚焦关键信息,限制了复杂场景下的推理能力。 Method: 引入动态自适应焦点采样机制以在时间维度上聚焦最相关音视频片段,并提出模态偏好感知策略以独立建模各模态贡献;同时设计双路径对比损失以加强时序与模态维度的一致性和互补性。 Result: 在四个大规模基准上的实验表明,AV-Master显著优于现有方法,尤其在复杂推理任务中表现突出。 Conclusion: AV-Master通过动态调整时间和模态关注,有效提升了音频-视觉问答中的跨模态协同表征学习与复杂场景下的推理性能。 Abstract: Audio-Visual Question Answering (AVQA) requires models to effectively utilize both visual and auditory modalities to answer complex and diverse questions about audio-visual scenes. However, existing methods lack sufficient flexibility and dynamic adaptability in temporal sampling and modality preference awareness, making it difficult to focus on key information based on the question. This limits their reasoning capability in complex scenarios. To address these challenges, we propose a novel framework named AV-Master. It enhances the model's ability to extract key information from complex audio-visual scenes with substantial redundant content by dynamically modeling both temporal and modality dimensions. In the temporal dimension, we introduce a dynamic adaptive focus sampling mechanism that progressively focuses on audio-visual segments most relevant to the question, effectively mitigating redundancy and segment fragmentation in traditional sampling methods. In the modality dimension, we propose a preference-aware strategy that models each modality's contribution independently, enabling selective activation of critical features. Furthermore, we introduce a dual-path contrastive loss to reinforce consistency and complementarity across temporal and modality dimensions, guiding the model to learn question-specific cross-modal collaborative representations. Experiments on four large-scale benchmarks show that AV-Master significantly outperforms existing methods, especially in complex reasoning tasks.[117] Ranking-based Preference Optimization for Diffusion Models from Implicit User Feedback
Yi-Lun Wu,Bo-Kai Ruan,Chiang Tseng,Hong-Han Shuai
Main category: cs.CV
TL;DR: 本文提出了Diffusion-DRO,一种基于逆强化学习的偏好学习框架,用于对齐文本到图像扩散模型与人类偏好。该方法通过将偏好学习转化为去噪排序问题,避免了奖励模型和Sigmoid非线性估计问题,并结合离线专家数据与在线生成负样本,提升了生成质量。
Details
Motivation: 现有DPO方法虽避免了REINFORCE算法,但仍面临图像概率估计不准和离线数据多样性不足的问题,需更稳定且高效的偏好学习方法。 Method: 提出Diffusion-DRO,将偏好学习建模为排名问题,采用去噪目标函数,结合离线专家演示和在线策略生成的负样本进行训练。 Result: 实验表明,Diffusion-DRO在多种定量指标和用户研究中均优于现有最先进基线方法,尤其在挑战性和未见提示下表现更优。 Conclusion: Diffusion-DRO有效解决了现有偏好学习中的非线性估计和数据多样性问题,显著提升了文本到图像生成模型的对齐性能。 Abstract: Direct preference optimization (DPO) methods have shown strong potential in aligning text-to-image diffusion models with human preferences by training on paired comparisons. These methods improve training stability by avoiding the REINFORCE algorithm but still struggle with challenges such as accurately estimating image probabilities due to the non-linear nature of the sigmoid function and the limited diversity of offline datasets. In this paper, we introduce Diffusion Denoising Ranking Optimization (Diffusion-DRO), a new preference learning framework grounded in inverse reinforcement learning. Diffusion-DRO removes the dependency on a reward model by casting preference learning as a ranking problem, thereby simplifying the training objective into a denoising formulation and overcoming the non-linear estimation issues found in prior methods. Moreover, Diffusion-DRO uniquely integrates offline expert demonstrations with online policy-generated negative samples, enabling it to effectively capture human preferences while addressing the limitations of offline data. Comprehensive experiments show that Diffusion-DRO delivers improved generation quality across a range of challenging and unseen prompts, outperforming state-of-the-art baselines in both both quantitative metrics and user studies. Our source code and pre-trained models are available at https://github.com/basiclab/DiffusionDRO.[118] Learning Human-Object Interaction as Groups
Jiajun Hong,Jianan Wei,Wenguan Wang
Main category: cs.CV
TL;DR: 提出GroupHOI框架,从群体视角建模人-物交互,利用几何 proximity 和语义相似性传播上下文信息,在多个基准上优于现有方法。
Details
Motivation: 现有方法主要关注成对关系,忽略了真实场景中多个人与物体之间的集体行为交互。 Method: 通过可学习的接近性估计器基于空间特征将人和物体聚类,并在每个组内使用自注意力机制聚合上下文线索;同时在交互解码器中引入局部上下文增强语义相似性建模。 Result: 在HICO-DET、V-COCO和更具挑战性的NVI-DET任务上均取得领先性能。 Conclusion: GroupHOI通过群体建模有效捕捉高阶交互,在人-物交互检测任务中表现出优越性能。 Abstract: Human-Object Interaction Detection (HOI-DET) aims to localize human-object pairs and identify their interactive relationships. To aggregate contextual cues, existing methods typically propagate information across all detected entities via self-attention mechanisms, or establish message passing between humans and objects with bipartite graphs. However, they primarily focus on pairwise relationships, overlooking that interactions in real-world scenarios often emerge from collective behaviors (multiple humans and objects engaging in joint activities). In light of this, we revisit relation modeling from a group view and propose GroupHOI, a framework that propagates contextual information in terms of geometric proximity and semantic similarity. To exploit the geometric proximity, humans and objects are grouped into distinct clusters using a learnable proximity estimator based on spatial features derived from bounding boxes. In each group, a soft correspondence is computed via self-attention to aggregate and dispatch contextual cues. To incorporate the semantic similarity, we enhance the vanilla transformer-based interaction decoder with local contextual cues from HO-pair features. Extensive experiments on HICO-DET and V-COCO benchmarks demonstrate the superiority of GroupHOI over the state-of-the-art methods. It also exhibits leading performance on the more challenging Nonverbal Interaction Detection (NVI-DET) task, which involves varied forms of higher-order interactions within groups.[119] FeatureFool: Zero-Query Fooling of Video Models via Feature Map
Duoxun Tang,Xi Xiao,Guangwu Hu,Kangkang Sun,Xiao Yang,Dongyang Chen,Qing Li,Yongjie Yin,Jiyao Wang
Main category: cs.CV
TL;DR: 提出了一种名为FeatureFool的零查询、基于特征图的黑盒攻击方法,用于在视频领域中无需交互即可成功扰动DNN模型,具有高攻击成功率且生成的对抗视频质量高,难以察觉。
Details
Motivation: 现有黑盒对抗攻击通常需要多次与模型交互并消耗大量查询,在实际应用中不现实,且缺乏直接利用特征图来改变干净视频特征空间的方法。 Method: 提出FeatureFool,通过直接利用DNN提取的特征信息,在无需任何查询的情况下扰动视频的特征空间,实现零查询黑盒攻击,并利用特征图的可迁移性攻击Video-LLMs。 Result: 实验显示FeatureFool在传统视频分类器上的攻击成功率超过70%,同时能有效绕过Video-LLM识别,生成的对抗视频在SSIM、PSNR和时间一致性指标上表现优异,视觉质量高。 Conclusion: FeatureFool是一种高效、隐蔽的零查询视频对抗攻击方法,首次在视频领域实现无需查询的特征空间操控,对传统视频模型和Video-LLMs均构成威胁,揭示了深度神经网络在视频理解任务中的潜在安全漏洞。 Abstract: The vulnerability of deep neural networks (DNNs) has been preliminarily verified. Existing black-box adversarial attacks usually require multi-round interaction with the model and consume numerous queries, which is impractical in the real-world and hard to scale to recently emerged Video-LLMs. Moreover, no attack in the video domain directly leverages feature maps to shift the clean-video feature space. We therefore propose FeatureFool, a stealthy, video-domain, zero-query black-box attack that utilizes information extracted from a DNN to alter the feature space of clean videos. Unlike query-based methods that rely on iterative interaction, FeatureFool performs a zero-query attack by directly exploiting DNN-extracted information. This efficient approach is unprecedented in the video domain. Experiments show that FeatureFool achieves an attack success rate above 70\% against traditional video classifiers without any queries. Benefiting from the transferability of the feature map, it can also craft harmful content and bypass Video-LLM recognition. Additionally, adversarial videos generated by FeatureFool exhibit high quality in terms of SSIM, PSNR, and Temporal-Inconsistency, making the attack barely perceptible. This paper may contain violent or explicit content.[120] Cross-Modal Scene Semantic Alignment for Image Complexity Assessment
Yuqing Luo,Yixiao Li,Jiang Liu,Jun Fu,Hadi Amirpour,Guanghui Yue,Baoquan Zhao,Padraig Corcoran,Hantao Liu,Wei Zhou
Main category: cs.CV
TL;DR: 本文提出了一种新的图像复杂度评估方法CM-SSA,通过跨模态场景语义对齐来提升评估性能,使预测结果更符合人类主观感知。
Details
Motivation: 现有图像复杂度评估方法多依赖单一视觉模态的手工或浅层特征,难以充分捕捉与感知相关的复杂性;同时,跨模态语义信息在感知任务中的潜力尚未在该领域得到探索。 Method: 提出CM-SSA方法,包含复杂度回归分支和场景语义对齐分支,后者通过图像与文本提示的成对学习实现语义对齐,指导前者进行复杂度预测。 Result: 在多个图像复杂度评估数据集上实验表明,CM-SSA显著优于现有最先进方法。 Conclusion: 利用跨模态场景语义信息有助于提升图像复杂度评估的准确性,使其更贴近人类主观感知。 Abstract: Image complexity assessment (ICA) is a challenging task in perceptual evaluation due to the subjective nature of human perception and the inherent semantic diversity in real-world images. Existing ICA methods predominantly rely on hand-crafted or shallow convolutional neural network-based features of a single visual modality, which are insufficient to fully capture the perceived representations closely related to image complexity. Recently, cross-modal scene semantic information has been shown to play a crucial role in various computer vision tasks, particularly those involving perceptual understanding. However, the exploration of cross-modal scene semantic information in the context of ICA remains unaddressed. Therefore, in this paper, we propose a novel ICA method called Cross-Modal Scene Semantic Alignment (CM-SSA), which leverages scene semantic alignment from a cross-modal perspective to enhance ICA performance, enabling complexity predictions to be more consistent with subjective human perception. Specifically, the proposed CM-SSA consists of a complexity regression branch and a scene semantic alignment branch. The complexity regression branch estimates image complexity levels under the guidance of the scene semantic alignment branch, while the scene semantic alignment branch is used to align images with corresponding text prompts that convey rich scene semantic information by pair-wise learning. Extensive experiments on several ICA datasets demonstrate that the proposed CM-SSA significantly outperforms state-of-the-art approaches. Codes are available at https://github.com/XQ2K/First-Cross-Model-ICA.[121] S2AP: Score-space Sharpness Minimization for Adversarial Pruning
Giorgio Piras,Qi Zhao,Fabio Brau,Maura Pintor,Christian Wressnegger,Battista Biggio
Main category: cs.CV
TL;DR: 提出了一种新的对抗性剪枝方法S2AP,通过在分数空间中最小化锐度来稳定掩码选择,从而提升剪枝后模型的鲁棒性。
Details
Motivation: 现有对抗性剪枝方法在分数空间优化中易陷入尖锐极小值,导致掩码选择不稳定,影响模型鲁棒性。 Method: 引入分数空间锐度感知机制,在掩码搜索过程中通过扰动重要性得分并最小化相应的鲁棒损失来优化掩码选择。 Result: 在多个数据集、模型和稀疏度水平上的实验表明,S2AP能有效降低分数空间的锐度,稳定掩码选择,并提升剪枝模型的鲁棒性能。 Conclusion: S2AP通过分数空间锐度最小化显著提升了对抗性剪枝方法的稳定性和鲁棒性,是一种有效的即插即用型剪枝改进方法。 Abstract: Adversarial pruning methods have emerged as a powerful tool for compressing neural networks while preserving robustness against adversarial attacks. These methods typically follow a three-step pipeline: (i) pretrain a robust model, (ii) select a binary mask for weight pruning, and (iii) finetune the pruned model. To select the binary mask, these methods minimize a robust loss by assigning an importance score to each weight, and then keep the weights with the highest scores. However, this score-space optimization can lead to sharp local minima in the robust loss landscape and, in turn, to an unstable mask selection, reducing the robustness of adversarial pruning methods. To overcome this issue, we propose a novel plug-in method for adversarial pruning, termed Score-space Sharpness-aware Adversarial Pruning (S2AP). Through our method, we introduce the concept of score-space sharpness minimization, which operates during the mask search by perturbing importance scores and minimizing the corresponding robust loss. Extensive experiments across various datasets, models, and sparsity levels demonstrate that S2AP effectively minimizes sharpness in score space, stabilizing the mask selection, and ultimately improving the robustness of adversarial pruning methods.[122] Entropy-Enhanced Conformal Features from Ricci Flow for Robust Alzheimer's Disease Classification
F. Ahmadi,B. Bidabad,H. Nasiri
Main category: cs.CV
TL;DR: 本研究提出并验证了一种基于共形几何与熵特征的新型皮层表面表示方法,用于阿尔茨海默病(AD)的自动诊断。
Details
Motivation: 阿尔茨海默病伴随显著的皮层萎缩,传统的3D形状分析在诊断中具有重要意义。现有方法在特征表达的鲁棒性和判别能力方面仍有提升空间,因此需要一种更精确、稳定的局部表面表征方法。 Method: 使用ADNI数据库中160名受试者(80名AD患者和80名健康对照)的T1加权MRI数据,通过Freesurfer重建皮层表面模型。利用Ricci流进行共形参数化,提取面积畸变和共形因子,并直接从网格计算高斯曲率。对这三种几何特征应用香农熵生成紧凑且信息丰富的特征向量,并使用多种分类器(如XGBoost、MLP、逻辑回归等)进行训练与评估。 Result: MLP和逻辑回归分类器表现最优,准确率和F₁分数均达到98.62%。配对Welch's t检验显示分类器间性能差异具有统计学意义。 Conclusion: 共形几何特征的熵度量是一种强大且稳健的皮层形态分析指标,该方法在AD自动诊断中表现出高精度,具有良好的临床研究应用前景。 Abstract: Background and Objective: In brain imaging, geometric surface models are essential for analyzing the 3D shapes of anatomical structures. Alzheimer's disease (AD) is associated with significant cortical atrophy, making such shape analysis a valuable diagnostic tool. The objective of this study is to introduce and validate a novel local surface representation method for the automated and accurate diagnosis of AD. Methods: The study utilizes T1-weighted MRI scans from 160 participants (80 AD patients and 80 healthy controls) from the Alzheimer's Disease Neuroimaging Initiative (ADNI). Cortical surface models were reconstructed from the MRI data using Freesurfer. Key geometric attributes were computed from the 3D meshes. Area distortion and conformal factor were derived using Ricci flow for conformal parameterization, while Gaussian curvature was calculated directly from the mesh geometry. Shannon entropy was applied to these three features to create compact and informative feature vectors. The feature vectors were used to train and evaluate a suite of classifiers (e.g. XGBoost, MLP, Logistic Regression, etc.). Results: Statistical significance of performance differences between classifiers was evaluated using paired Welch's t-test. The method proved highly effective in distinguishing AD patients from healthy controls. The Multi-Layer Perceptron (MLP) and Logistic Regression classifiers outperformed all others, achieving an accuracy and F$_1$ Score of 98.62%. Conclusions: This study confirms that the entropy of conformally-derived geometric features provides a powerful and robust metric for cortical morphometry. The high classification accuracy underscores the method's potential to enhance the study and diagnosis of Alzheimer's disease, offering a straightforward yet powerful tool for clinical research applications.[123] Bayesian Fully-Connected Tensor Network for Hyperspectral-Multispectral Image Fusion
Linsong Shan,Zecan Yang,Laurence T. Yang,Changlong Li,Honglu Zhao,Xin Nie
Main category: cs.CV
TL;DR: 提出了一种基于贝叶斯全连接张量网络(BFCTN)的高光谱-多光谱图像融合方法,通过引入层次稀疏先验和变分贝叶斯推断,有效保留空间-光谱结构并减少人工调参,具有优异的鲁棒性和融合精度。
Details
Motivation: 现有张量分解方法在图像融合中存在破坏数据结构、需大量调参、抗噪能力弱等问题,难以有效建模跨维相关性并保持物理一致性。 Method: 提出BFCTN方法,采用贝叶斯框架与层次稀疏先验连接因子张量,结合VB和EM算法进行参数估计,显式建模空间结构、光谱特征与局部场景一致性的物理耦合关系。 Result: 实验表明BFCTN在融合精度、抗噪声和空间退化能力方面优于现有方法,达到最先进水平,并在复杂真实场景中表现出良好的实用性。 Conclusion: BFCTN通过概率化建模显著提升了高光谱-多光谱图像融合的性能与鲁棒性,减少了对人工调参的依赖,为实际应用提供了可靠解决方案。 Abstract: Tensor decomposition is a powerful tool for data analysis and has been extensively employed in the field of hyperspectral-multispectral image fusion (HMF). Existing tensor decomposition-based fusion methods typically rely on disruptive data vectorization/reshaping or impose rigid constraints on the arrangement of factor tensors, hindering the preservation of spatial-spectral structures and the modeling of cross-dimensional correlations. Although recent advances utilizing the Fully-Connected Tensor Network (FCTN) decomposition have partially alleviated these limitations, the process of reorganizing data into higher-order tensors still disrupts the intrinsic spatial-spectral structure. Furthermore, these methods necessitate extensive manual parameter tuning and exhibit limited robustness against noise and spatial degradation. To alleviate these issues, we propose the Bayesian FCTN (BFCTN) method. Within this probabilistic framework, a hierarchical sparse prior that characterizing the sparsity of physical elements, establishes connections between the factor tensors. This framework explicitly models the intrinsic physical coupling among spatial structures, spectral signatures, and local scene homogeneity. For model learning, we develop a parameter estimation method based on Variational Bayesian inference (VB) and the Expectation-Maximization (EM) algorithm, which significantly reduces the need for manual parameter tuning. Extensive experiments demonstrate that BFCTN not only achieves state-of-the-art fusion accuracy and strong robustness but also exhibits practical applicability in complex real-world scenarios.[124] Automated Wicket-Taking Delivery Segmentation and Weakness Detection in Cricket Videos Using OCR-Guided YOLOv8 and Trajectory Modeling
Mst Jannatun Ferdous,Masum Billah,Joy Karmoker,Mohd Ruhul Ameen,Akif Islam,Md. Omar Faruqe
Main category: cs.CV
TL;DR: 提出了一种基于深度学习的板球视频分析自动化系统,利用YOLOv8和OCR技术实现击球时刻检测、球体识别与轨迹建模。
Details
Motivation: 为了提升板球比赛的战术分析效率,需要一种能够自动识别关键事件(如三柱门击球)并提取相关视觉信息的系统。 Method: 采用YOLOv8架构进行板球场和球体检测,结合OCR技术提取记分牌信息以确定三柱门击球时刻;通过灰度变换、幂变换和形态学操作等图像预处理提升文本提取鲁棒性。 Result: 场地检测模型达到99.5% mAP50,精确率为0.999;球体检测模型通过迁移学习实现99.18% mAP50,精确率0.968,召回率0.978;系统可对检测到的场地进行轨迹建模,用于分析击球弱点。 Conclusion: 该方法在多个比赛视频上验证了有效性,为自动化板球分析提供了可靠方案,具有辅助训练和战略决策的重要潜力。 Abstract: This paper presents an automated system for cricket video analysis that leverages deep learning techniques to extract wicket-taking deliveries, detect cricket balls, and model ball trajectories. The system employs the YOLOv8 architecture for pitch and ball detection, combined with optical character recognition (OCR) for scorecard extraction to identify wicket-taking moments. Through comprehensive image preprocessing, including grayscale transformation, power transformation, and morphological operations, the system achieves robust text extraction from video frames. The pitch detection model achieved 99.5% mean Average Precision at 50% IoU (mAP50) with a precision of 0.999, while the ball detection model using transfer learning attained 99.18% mAP50 with 0.968 precision and 0.978 recall. The system enables trajectory modeling on detected pitches, providing data-driven insights for identifying batting weaknesses. Experimental results on multiple cricket match videos demonstrate the effectiveness of this approach for automated cricket analytics, offering significant potential for coaching and strategic decision-making.[125] ScaleNet: Scaling up Pretrained Neural Networks with Incremental Parameters
Zhiwei Hao,Jianyuan Guo,Li Shen,Kai Han,Yehui Tang,Han Hu,Yunhe Wang
Main category: cs.CV
TL;DR: 本文提出ScaleNet,一种通过插入新层并采用逐层权重共享来高效扩展视觉Transformer(ViT)的方法,在保持参数量几乎不变的同时显著提升性能。
Details
Motivation: 现有的ViT模型扩大通常需要从头训练,计算成本高且耗时,因此需要一种更高效的扩展方法。 Method: 在预训练ViT中插入新层,并通过逐层权重共享机制减少新增参数;引入小型可学习调整参数(通过并行适配模块实现)以缓解共享权重带来的性能下降。 Result: 在ImageNet-1K上,2倍深度扩展的DeiT-Base模型相比从头训练提升了7.42%的准确率,且仅需三分之一的训练epoch;在目标检测任务中也展现出良好的迁移能力。 Conclusion: ScaleNet提供了一种高效、低成本的ViT模型扩展方案,能够在多种视觉任务中有效提升性能,具有广泛的应用潜力。 Abstract: Recent advancements in vision transformers (ViTs) have demonstrated that larger models often achieve superior performance. However, training these models remains computationally intensive and costly. To address this challenge, we introduce ScaleNet, an efficient approach for scaling ViT models. Unlike conventional training from scratch, ScaleNet facilitates rapid model expansion with negligible increases in parameters, building on existing pretrained models. This offers a cost-effective solution for scaling up ViTs. Specifically, ScaleNet achieves model expansion by inserting additional layers into pretrained ViTs, utilizing layer-wise weight sharing to maintain parameters efficiency. Each added layer shares its parameter tensor with a corresponding layer from the pretrained model. To mitigate potential performance degradation due to shared weights, ScaleNet introduces a small set of adjustment parameters for each layer. These adjustment parameters are implemented through parallel adapter modules, ensuring that each instance of the shared parameter tensor remains distinct and optimized for its specific function. Experiments on the ImageNet-1K dataset demonstrate that ScaleNet enables efficient expansion of ViT models. With a 2$\times$ depth-scaled DeiT-Base model, ScaleNet achieves a 7.42% accuracy improvement over training from scratch while requiring only one-third of the training epochs, highlighting its efficiency in scaling ViTs. Beyond image classification, our method shows significant potential for application in downstream vision areas, as evidenced by the validation in object detection task.[126] ImageGem: In-the-wild Generative Image Interaction Dataset for Generative Model Personalization
Yuanhe Guo,Linxi Xie,Zhuoran Chen,Kangrui Yu,Ryan Po,Guandao Yang,Gordon Wetztein,Hongyi Wen
Main category: cs.CV
TL;DR: 本文提出了ImageGem数据集,用于研究理解细粒度个体偏好的生成模型,解决了缺乏真实场景下用户偏好标注的问题,并基于该数据集训练了更优的偏好对齐模型,提出了一种端到端的扩散模型编辑框架,实现了生成模型的个性化新范式。
Details
Motivation: 由于缺乏真实环境中细粒度的用户偏好标注,个性化生成模型的发展受到限制,因此需要一个包含大规模用户交互数据的数据集来推动该领域研究。 Method: 收集57K用户的真实交互数据,包括242K个定制LoRA、300万文本提示和500万生成图像,利用这些数据训练偏好对齐模型,并探索基于个体偏好的图像检索与生成模型推荐方法,提出在潜在权重空间中编辑扩散模型的端到端框架。 Result: 基于ImageGem训练的偏好对齐模型性能更优,个性化检索与推荐效果提升,并成功实现对定制扩散模型的编辑以匹配个体偏好。 Conclusion: ImageGem数据集首次支持了生成模型个性化的全新研究范式,为个性化生成模型的发展提供了重要基础。 Abstract: We introduce ImageGem, a dataset for studying generative models that understand fine-grained individual preferences. We posit that a key challenge hindering the development of such a generative model is the lack of in-the-wild and fine-grained user preference annotations. Our dataset features real-world interaction data from 57K users, who collectively have built 242K customized LoRAs, written 3M text prompts, and created 5M generated images. With user preference annotations from our dataset, we were able to train better preference alignment models. In addition, leveraging individual user preference, we investigated the performance of retrieval models and a vision-language model on personalized image retrieval and generative model recommendation. Finally, we propose an end-to-end framework for editing customized diffusion models in a latent weight space to align with individual user preferences. Our results demonstrate that the ImageGem dataset enables, for the first time, a new paradigm for generative model personalization.[127] Beyond Single Images: Retrieval Self-Augmented Unsupervised Camouflaged Object Detection
Ji Du,Xin Wang,Fangwei Hao,Mingyang Yu,Chunyuan Chen,Jiesheng Wu,Bin Wang,Jing Xu,Ping Li
Main category: cs.CV
TL;DR: 提出RISE,一种基于检索的自增强范式,利用整个训练数据集生成伪标签以训练伪装物体检测模型,无需依赖人工标注。
Details
Motivation: 现有方法主要依赖图像级建模或需大量标注的优化策略,难以充分利用数据集级别的上下文信息,且标注成本高。 Method: 构建无监督的环境与伪装物体原型库,采用聚类后检索(CR)策略生成高质量伪掩码,并提出多视图KNN检索(MVKR)提升伪标签鲁棒性。 Result: 实验表明RISE在无监督和基于提示的方法中均优于当前最先进方法,能有效生成高质量伪标签。 Conclusion: RISE通过利用全数据集上下文信息和自增强伪标签生成,显著提升了伪装物体检测性能,减少对人工标注的依赖。 Abstract: At the core of Camouflaged Object Detection (COD) lies segmenting objects from their highly similar surroundings. Previous efforts navigate this challenge primarily through image-level modeling or annotation-based optimization. Despite advancing considerably, this commonplace practice hardly taps valuable dataset-level contextual information or relies on laborious annotations. In this paper, we propose RISE, a RetrIeval SElf-augmented paradigm that exploits the entire training dataset to generate pseudo-labels for single images, which could be used to train COD models. RISE begins by constructing prototype libraries for environments and camouflaged objects using training images (without ground truth), followed by K-Nearest Neighbor (KNN) retrieval to generate pseudo-masks for each image based on these libraries. It is important to recognize that using only training images without annotations exerts a pronounced challenge in crafting high-quality prototype libraries. In this light, we introduce a Clustering-then-Retrieval (CR) strategy, where coarse masks are first generated through clustering, facilitating subsequent histogram-based image filtering and cross-category retrieval to produce high-confidence prototypes. In the KNN retrieval stage, to alleviate the effect of artifacts in feature maps, we propose Multi-View KNN Retrieval (MVKR), which integrates retrieval results from diverse views to produce more robust and precise pseudo-masks. Extensive experiments demonstrate that RISE outperforms state-of-the-art unsupervised and prompt-based methods. Code is available at https://github.com/xiaohainku/RISE.[128] LAND: Lung and Nodule Diffusion for 3D Chest CT Synthesis with Anatomical Guidance
Anna Oliveras,Roger Marí,Rafael Redondo,Oriol Guardià,Ana Tost,Bhalaji Nagarajan,Carolina Migliorelli,Vicent Ribas,Petia Radeva
Main category: cs.CV
TL;DR: 提出了一种基于3D解剖掩码的潜在扩散模型,用于生成高质量、256x256x256分辨率的3D胸部CT扫描,计算成本低,支持带或不带肺结节的多样化生成。
Details
Motivation: 现有方法生成3D医学图像计算成本高,且缺乏对解剖结构的精确控制,因此需要一种高效且可条件化生成的模型。 Method: 采用潜在扩散模型,以3D解剖掩码(包括肺部和结节区域)为条件,在单个中端GPU上生成1mm各向同性分辨率的256x256x256 CT体数据。 Result: 模型能高效生成高质量CT体积图像;实验表明仅用结节掩码会导致解剖错误,必须结合全局肺结构才能实现准确合成;可生成具有不同属性结节的多样化CT图像。 Conclusion: 该方法在降低计算成本的同时实现了精确可控的3D医学图像生成,对AI训练和医疗应用具有重要价值。 Abstract: This work introduces a new latent diffusion model to generate high-quality 3D chest CT scans conditioned on 3D anatomical masks. The method synthesizes volumetric images of size 256x256x256 at 1 mm isotropic resolution using a single mid-range GPU, significantly lowering the computational cost compared to existing approaches. The conditioning masks delineate lung and nodule regions, enabling precise control over the output anatomical features. Experimental results demonstrate that conditioning solely on nodule masks leads to anatomically incorrect outputs, highlighting the importance of incorporating global lung structure for accurate conditional synthesis. The proposed approach supports the generation of diverse CT volumes with and without lung nodules of varying attributes, providing a valuable tool for training AI models or healthcare professionals.[129] Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models
Tianci Bi,Xiaoyi Zhang,Yan Lu,Nanning Zheng
Main category: cs.CV
TL;DR: 提出VFM-VAE,直接集成视觉基础模型到扩散模型中,避免蒸馏带来的语义偏移,通过多尺度特征融合和渐进式重建实现高质量图像生成,显著加速训练并取得更优性能。
Details
Motivation: 现有方法通过蒸馏将视觉基础模型(VFM)融入潜在扩散模型(LDM),但会削弱与原VFM的对齐鲁棒性,在分布偏移下导致语义偏差。 Method: 提出VFM-VAE,绕过蒸馏过程;重新设计解码器,引入多尺度潜在融合和渐进分辨率重建模块,从粗粒度VFM特征实现高质量重建;提出SE-CKNNA指标分析扩散训练中的表征动态,并设计联合tokenizer-扩散对齐策略。 Result: 在80个epoch内达到gFID(无CFG)2.20,比先前方法快10倍;训练至640个epoch后进一步降至1.62。 Conclusion: 直接集成VFM优于蒸馏方法,VFM-VAE在性能和效率上均显著提升,为LDM提供新范式。 Abstract: The performance of Latent Diffusion Models (LDMs) is critically dependent on the quality of their visual tokenizer. While recent works have explored incorporating Vision Foundation Models (VFMs) via distillation, we identify a fundamental flaw in this approach: it inevitably weakens the robustness of alignment with the original VFM, causing the aligned latents to deviate semantically under distribution shifts. In this paper, we bypass distillation by proposing a more direct approach: Vision Foundation Model Variational Autoencoder (VFM-VAE). To resolve the inherent tension between the VFM's semantic focus and the need for pixel-level fidelity, we redesign the VFM-VAE decoder with Multi-Scale Latent Fusion and Progressive Resolution Reconstruction blocks, enabling high-quality reconstruction from spatially coarse VFM features. Furthermore, we provide a comprehensive analysis of representation dynamics during diffusion training, introducing the proposed SE-CKNNA metric as a more precise tool for this diagnosis. This analysis allows us to develop a joint tokenizer-diffusion alignment strategy that dramatically accelerates convergence. Our innovations in tokenizer design and training strategy lead to superior performance and efficiency: our system reaches a gFID (w/o CFG) of 2.20 in merely 80 epochs (a 10x speedup over prior tokenizers). With continued training to 640 epochs, it further attains a gFID (w/o CFG) of 1.62, establishing direct VFM integration as a superior paradigm for LDMs.[130] Mono4DGS-HDR: High Dynamic Range 4D Gaussian Splatting from Alternating-exposure Monocular Videos
Jinfeng Liu,Lingtong Kong,Mi Zhou,Jinwen Chen,Dan Xu
Main category: cs.CV
TL;DR: 本文提出了Mono4DGS-HDR,首个从无姿态单目低动态范围(LDR)视频中重建可渲染的4D高动态范围(HDR)场景的系统。
Details
Motivation: 由于现有方法无法处理从交替曝光的单目LDR视频中重建4D HDR场景的问题,且缺乏相关研究和数据集,因此需要一种无需相机姿态估计、能高效重建高质量HDR视频的新方法。 Method: 提出了一种基于高斯点阵化的两阶段优化框架:第一阶段在正交相机坐标系中学习视频HDR高斯表示,避免依赖相机姿态;第二阶段将高斯变换到世界空间,并联合优化世界高斯与相机姿态,同时引入时间亮度正则化策略提升HDR外观的时间一致性。 Result: 实验表明,Mono4DGS-HDR在渲染质量和速度上显著优于现有先进方法的适配版本,并构建了一个新的评估基准用于HDR视频重建任务。 Conclusion: Mono4DGS-HDR成功实现了从无姿态单目LDR视频到4D HDR场景的高质量重建,为该领域提供了有效的解决方案和新的评测基准。 Abstract: We introduce Mono4DGS-HDR, the first system for reconstructing renderable 4D high dynamic range (HDR) scenes from unposed monocular low dynamic range (LDR) videos captured with alternating exposures. To tackle such a challenging problem, we present a unified framework with two-stage optimization approach based on Gaussian Splatting. The first stage learns a video HDR Gaussian representation in orthographic camera coordinate space, eliminating the need for camera poses and enabling robust initial HDR video reconstruction. The second stage transforms video Gaussians into world space and jointly refines the world Gaussians with camera poses. Furthermore, we propose a temporal luminance regularization strategy to enhance the temporal consistency of the HDR appearance. Since our task has not been studied before, we construct a new evaluation benchmark using publicly available datasets for HDR video reconstruction. Extensive experiments demonstrate that Mono4DGS-HDR significantly outperforms alternative solutions adapted from state-of-the-art methods in both rendering quality and speed.[131] Zero-Shot Vehicle Model Recognition via Text-Based Retrieval-Augmented Generation
Wei-Chia Chang,Yan-Ann Chen
Main category: cs.CV
TL;DR: 提出一种结合视觉语言模型(VLM)与检索增强生成(RAG)的零样本车辆品牌型号识别方法,通过文本推理避免大规模重训练,提升识别性能近20%。
Details
Motivation: 现有车辆品牌型号识别方法难以适应新发布车型,且CLIP等预训练模型因固定权重限制性能,需高成本微调。 Method: 利用VLM将车辆图像转换为描述性属性,与文本特征数据库进行比对,检索相关条目并结合生成提示,由语言模型推断出车辆品牌和型号。 Result: 相比CLIP基线方法,识别准确率提升近20%,支持快速更新新车型而无需重新训练。 Conclusion: 该方法通过RAG增强的语言模型推理,实现了可扩展的零样本车辆识别,适用于智慧城市应用中的智能交通系统。 Abstract: Vehicle make and model recognition (VMMR) is an important task in intelligent transportation systems, but existing approaches struggle to adapt to newly released models. Contrastive Language-Image Pretraining (CLIP) provides strong visual-text alignment, yet its fixed pretrained weights limit performance without costly image-specific finetuning. We propose a pipeline that integrates vision language models (VLMs) with Retrieval-Augmented Generation (RAG) to support zero-shot recognition through text-based reasoning. A VLM converts vehicle images into descriptive attributes, which are compared against a database of textual features. Relevant entries are retrieved and combined with the description to form a prompt, and a language model (LM) infers the make and model. This design avoids large-scale retraining and enables rapid updates by adding textual descriptions of new vehicles. Experiments show that the proposed method improves recognition by nearly 20% over the CLIP baseline, demonstrating the potential of RAG-enhanced LM reasoning for scalable VMMR in smart-city applications.[132] DWaste: Greener AI for Waste Sorting using Mobile and Edge Devices
Suman Kunwar
Main category: cs.CV
TL;DR: 本文提出了一种名为DWaste的计算机视觉平台,用于在资源受限的智能手机和边缘设备上实时进行垃圾分类,支持离线运行。通过比较多种图像分类和目标检测模型,发现轻量级目标检测模型(如YOLO系列)在精度、速度和能耗之间取得了良好平衡,结合模型量化技术可进一步提升效率,实现“绿色AI”驱动的可持续垃圾分类。
Details
Motivation: 随着便捷包装的普及,垃圾产量激增,高效垃圾分类成为可持续废物管理的关键需求。现有系统往往依赖云端计算,难以在资源受限或无网络环境下运行,因此需要一种可在边缘设备上实时、低功耗运行的智能分类方案。 Method: 开发了DWaste平台,采用EfficientNetV2S/M、ResNet50/101、MobileNet等图像分类模型以及YOLOv8n、YOLOv11n等目标检测模型,在自建垃圾数据集子集上进行基准测试,并使用自研工具Annotated Lab进行标注。对模型进行了量化优化以降低资源消耗。 Result: EfficientNetV2S分类准确率最高(~96%),但延迟较高(~0.22秒)且碳排放高;轻量级目标检测模型性能良好(最高77% mAP),推理速度快(~0.03秒),模型小(<7MB),适合实时低功耗应用;模型量化使模型大小和VRAM使用最多减少75%。 Conclusion: 轻量级目标检测模型结合量化技术可在边缘设备上实现高效、实时的垃圾分类,验证了'绿色AI'在可持续废物管理中的可行性与优势。 Abstract: The rise of convenience packaging has led to generation of enormous waste, making efficient waste sorting crucial for sustainable waste management. To address this, we developed DWaste, a computer vision-powered platform designed for real-time waste sorting on resource-constrained smartphones and edge devices, including offline functionality. We benchmarked various image classification models (EfficientNetV2S/M, ResNet50/101, MobileNet) and object detection (YOLOv8n, YOLOv11n) using a subset of our own waste data set and annotated it using the custom tool Annotated Lab. We found a clear trade-off between accuracy and resource consumption: the best classifier, EfficientNetV2S, achieved high accuracy (~ 96%) but suffered from high latency (~ 0.22s) and elevated carbon emissions. In contrast, lightweight object detection models delivered strong performance (up to 77% mAP) with ultra-fast inference (~ 0.03s) and significantly smaller model sizes (< 7MB), making them ideal for real-time, low-power use. Model quantization further maximized efficiency, substantially reducing model size and VRAM usage by up to 75%. Our work demonstrates the successful implementation of "Greener AI" models to support real-time, sustainable waste sorting on edge devices.[133] RayPose: Ray Bundling Diffusion for Template Views in Unseen 6D Object Pose Estimation
Junwen Huang,Shishir Reddy Vutukur,Peter KT Yu,Nassir Navab,Slobodan Ilic,Benjamin Busam
Main category: cs.CV
TL;DR: 提出了一种基于扩散变换器的模板对齐方法,通过将视角方向学习与查询图像对齐来改进无姿态图像的对象位姿估计。
Details
Motivation: 传统基于模板的方法在未能检索到正确模板时会导致位姿预测不准确,因此需要一种更鲁棒的匹配与对齐机制。 Method: 将对象位姿估计重新定义为射线对齐问题,使用扩散Transformer架构对齐查询图像与多个有姿态的模板图像,采用对象中心相机射线重参数化旋转,并扩展尺度不变平移估计以建模密集平移偏移。 Result: 在多个基准数据集上进行了广泛实验,结果表明该方法在未见对象位姿估计任务中具有竞争力。 Conclusion: 所提出的方法通过几何先验引导和粗到精训练策略,在不改变网络结构的情况下提升了位姿估计精度。 Abstract: Typical template-based object pose pipelines estimate the pose by retrieving the closest matching template and aligning it with the observed image. However, failure to retrieve the correct template often leads to inaccurate pose predictions. To address this, we reformulate template-based object pose estimation as a ray alignment problem, where the viewing directions from multiple posed template images are learned to align with a non-posed query image. Inspired by recent progress in diffusion-based camera pose estimation, we embed this formulation into a diffusion transformer architecture that aligns a query image with a set of posed templates. We reparameterize object rotation using object-centered camera rays and model object translation by extending scale-invariant translation estimation to dense translation offsets. Our model leverages geometric priors from the templates to guide accurate query pose inference. A coarse-to-fine training strategy based on narrowed template sampling improves performance without modifying the network architecture. Extensive experiments across multiple benchmark datasets show competitive results of our method compared to state-of-the-art approaches in unseen object pose estimation.[134] GBlobs: Local LiDAR Geometry for Improved Sensor Placement Generalization
Dušan Malić,Christian Fruhwirth-Reisinger,Alexander Prutsch,Wei Lin,Samuel Schulter,Horst Possegger
Main category: cs.CV
TL;DR: 本文提出了一种基于GBlobs局部点云特征描述符的3D目标检测方法,有效解决了传统LiDAR检测器因依赖绝对坐标而产生的几何捷径问题,显著提升了模型在不同传感器布局下的泛化能力。
Details
Motivation: 现有基于LiDAR的3D检测器在训练时依赖全局坐标特征,导致模型过度依赖物体的绝对位置(几何捷径),从而在面对不同传感器布置引起的点云分布变化时泛化性能下降。 Method: 采用GBlobs作为局部点云特征描述符,并将其作为网络输入,避免使用绝对笛卡尔坐标等全局特征,迫使网络学习更鲁棒、以对象为中心的表示。 Result: 该方法在RoboSense 2025挑战赛Track 3中实现了最先进的3D目标检测性能,表现出优异的跨传感器配置泛化能力。 Conclusion: 通过引入GBlobs特征描述符,有效缓解了几何捷径问题,增强了模型对形状和外观特征的学习,显著提升了在多传感器布局下的3D检测泛化性能。 Abstract: This technical report outlines the top-ranking solution for RoboSense 2025: Track 3, achieving state-of-the-art performance on 3D object detection under various sensor placements. Our submission utilizes GBlobs, a local point cloud feature descriptor specifically designed to enhance model generalization across diverse LiDAR configurations. Current LiDAR-based 3D detectors often suffer from a \enquote{geometric shortcut} when trained on conventional global features (\ie, absolute Cartesian coordinates). This introduces a position bias that causes models to primarily rely on absolute object position rather than distinguishing shape and appearance characteristics. Although effective for in-domain data, this shortcut severely limits generalization when encountering different point distributions, such as those resulting from varying sensor placements. By using GBlobs as network input features, we effectively circumvent this geometric shortcut, compelling the network to learn robust, object-centric representations. This approach significantly enhances the model's ability to generalize, resulting in the exceptional performance demonstrated in this challenge.[135] Descriptor: Occluded nuScenes: A Multi-Sensor Dataset for Evaluating Perception Robustness in Automated Driving
Sanjay Kumar,Tim Brophy,Reenu Mohandas,Eoin Martino Grua,Ganesh Sistu,Valentina Donzella,Ciaran Eising
Main category: cs.CV
TL;DR: 本文提出了Occluded nuScenes数据集,作为nuScenes基准的扩展,支持在可控、可重复的条件下对多传感器感知模型进行鲁棒性评估。
Details
Motivation: 现有自动驾驶数据集缺乏对多种传感器模态进行可控、参数化和可重复退化的支持,限制了对感知与融合架构在恶劣条件下的系统性评估。 Method: 针对摄像头模态,提供了四种类型的遮挡(两种基于公开实现,两种新设计);对于雷达和LiDAR,开发了参数化遮挡脚本,每种包含三种退化类型,以灵活生成被遮挡的数据。 Result: 发布了完整的和迷你版的Occluded nuScenes数据集,支持多传感器模态下的部分传感器失效和环境干扰的可重复评估。 Conclusion: 该数据集为研究鲁棒的传感器融合、抗干扰能力分析及自动驾驶中的安全关键感知提供了重要资源。 Abstract: Robust perception in automated driving requires reliable performance under adverse conditions, where sensors may be affected by partial failures or environmental occlusions. Although existing autonomous driving datasets inherently contain sensor noise and environmental variability, very few enable controlled, parameterised, and reproducible degradations across multiple sensing modalities. This gap limits the ability to systematically evaluate how perception and fusion architectures perform under well-defined adverse conditions. To address this limitation, we introduce the Occluded nuScenes Dataset, a novel extension of the widely used nuScenes benchmark. For the camera modality, we release both the full and mini versions with four types of occlusions, two adapted from public implementations and two newly designed. For radar and LiDAR, we provide parameterised occlusion scripts that implement three types of degradations each, enabling flexible and repeatable generation of occluded data. This resource supports consistent, reproducible evaluation of perception models under partial sensor failures and environmental interference. By releasing the first multi-sensor occlusion dataset with controlled and reproducible degradations, we aim to advance research on robust sensor fusion, resilience analysis, and safety-critical perception in automated driving.[136] Kaleido: Open-Sourced Multi-Subject Reference Video Generation Model
Zhenxing Zhang,Jiayan Teng,Zhuoyi Yang,Tiankun Cao,Cheng Wang,Xiaotao Gu,Jie Tang,Dan Guo,Meng Wang
Main category: cs.CV
TL;DR: 提出Kaleido框架,通过改进数据构建和引入参考旋转位置编码(R-RoPE),实现多主体、多图像条件下的高一致性主题到视频生成。
Details
Motivation: 现有S2V方法在多主体一致性和背景解耦方面表现不足,导致参考保真度低和语义漂移,主要受限于训练数据缺乏多样性及多图融合机制不佳。 Method: 设计专用数据构建流程以过滤低质量样本并合成多样化数据;提出Reference Rotary Positional Encoding(R-RoPE)来稳定且精确地融合多个参考图像。 Result: 在多个基准上实验表明,Kaleido在一致性、保真度和泛化能力上显著优于先前方法。 Conclusion: Kaleido通过高质量数据构建和创新的多图融合机制,有效提升了多主体S2V生成的性能,推动了该领域的发展。 Abstract: We present Kaleido, a subject-to-video~(S2V) generation framework, which aims to synthesize subject-consistent videos conditioned on multiple reference images of target subjects. Despite recent progress in S2V generation models, existing approaches remain inadequate at maintaining multi-subject consistency and at handling background disentanglement, often resulting in lower reference fidelity and semantic drift under multi-image conditioning. These shortcomings can be attributed to several factors. Primarily, the training dataset suffers from a lack of diversity and high-quality samples, as well as cross-paired data, i.e., paired samples whose components originate from different instances. In addition, the current mechanism for integrating multiple reference images is suboptimal, potentially resulting in the confusion of multiple subjects. To overcome these limitations, we propose a dedicated data construction pipeline, incorporating low-quality sample filtering and diverse data synthesis, to produce consistency-preserving training data. Moreover, we introduce Reference Rotary Positional Encoding (R-RoPE) to process reference images, enabling stable and precise multi-image integration. Extensive experiments across numerous benchmarks demonstrate that Kaleido significantly outperforms previous methods in consistency, fidelity, and generalization, marking an advance in S2V generation.[137] CovMatch: Cross-Covariance Guided Multimodal Dataset Distillation with Trainable Text Encoder
Yongmin Lee,Hye Won Chung
Main category: cs.CV
TL;DR: 提出CovMatch,一种可扩展的数据集蒸馏框架,通过联合优化图像和文本编码器实现更强的跨模态对齐,在Flickr30K和COCO上使用仅500个合成样本显著提升了检索准确率。
Details
Motivation: 现有方法冻结文本编码器限制了语义对齐,导致多模态对比学习性能受限。 Method: 提出CovMatch,通过匹配真实与合成特征的跨协方差并对各模态内特征分布进行正则化,实现两个编码器的联合优化。 Result: 在Flickr30K和COCO数据集上,使用仅500个合成图像-文本对,比现有最先进方法最高提升6.8%的检索准确率。 Conclusion: CovMatch有效解决了多模态数据蒸馏中跨模态对齐与计算成本的挑战,实现了更高效的视觉-语言模型训练。 Abstract: Multimodal dataset distillation aims to synthesize a small set of image-text pairs that enables efficient training of large-scale vision-language models. While dataset distillation has shown promise in unimodal tasks, extending it to multimodal contrastive learning presents key challenges: learning cross-modal alignment and managing the high computational cost of large encoders. Prior approaches address scalability by freezing the text encoder and update only the image encoder and text projection layer. However, we find this severely limits semantic alignment and becomes a bottleneck for performance scaling. We propose CovMatch, a scalable dataset distillation framework that aligns the cross-covariance of real and synthetic features while regularizing feature distributions within each modality. Unlike prior approaches, CovMatch enables joint optimization of both encoders, leading to stronger cross-modal alignment and improved performance. Evaluated on Flickr30K and COCO, CovMatch outperforms state-of-the-art multimodal distillation methods and achieves up to 6.8% absolute gains in retrieval accuracy using only 500 synthetic pairs.[138] Think with 3D: Geometric Imagination Grounded Spatial Reasoning from Limited Views
Zhangquan Chen,Manyuan Zhang,Xinlei Yu,Xufang Luo,Mingze Sun,Zihao Pan,Yan Feng,Peng Pei,Xunliang Cai,Ruqi Huang
Main category: cs.CV
TL;DR: 本文提出3DThinker,一种无需3D先验输入或标注数据即可在推理过程中实现3D心智模拟的视觉语言模型框架,通过两阶段训练显著提升多模态任务中的3D空间理解能力。
Details
Motivation: 现有方法依赖纯文本或2D视觉线索,表征能力有限,难以完成需要3D空间想象的任务,因此需要一种能有效利用图像中几何信息进行3D推理的新方法。 Method: 提出3DThinker框架,包含两个训练阶段:首先监督训练使VLM生成的3D潜在表示与3D基础模型对齐;然后仅基于结果信号优化整个推理轨迹,以精炼3D心智模拟过程。 Result: 在多个基准上的实验表明,3DThinker consistently优于强基线方法,在3D空间理解任务中表现更优。 Conclusion: 3DThinker首次实现了无3D先验输入下的3D心智模拟,为将3D表征融入多模态推理提供了新视角,并展现出卓越性能。 Abstract: Though recent advances in vision-language models (VLMs) have achieved remarkable progress across a wide range of multimodal tasks, understanding 3D spatial relationships from limited views remains a significant challenge. Previous reasoning methods typically rely on pure text (e.g., topological cognitive maps) or on 2D visual cues. However, their limited representational capacity hinders performance in specific tasks that require 3D spatial imagination. To address this limitation, we propose 3DThinker, a framework that can effectively exploits the rich geometric information embedded within images while reasoning, like humans do. Our framework is the first to enable 3D mentaling during reasoning without any 3D prior input, and it does not rely on explicitly labeled 3D data for training. Specifically, our training consists of two stages. First, we perform supervised training to align the 3D latent generated by VLM while reasoning with that of a 3D foundation model (e.g., VGGT). Then, we optimize the entire reasoning trajectory solely based on outcome signals, thereby refining the underlying 3D mentaling. Extensive experiments across multiple benchmarks show that 3DThinker consistently outperforms strong baselines and offers a new perspective toward unifying 3D representations into multimodal reasoning. Our code will be available at https://github.com/zhangquanchen/3DThinker.[139] C-SWAP: Explainability-Aware Structured Pruning for Efficient Neural Networks Compression
Baptiste Bauvin,Loïc Baret,Ola Ahmad
Main category: cs.CV
TL;DR: 本文提出了一种基于可解释深度学习的新型一次性剪枝框架,通过因果感知的剪枝方法在不需微调的情况下有效减少模型大小并保持性能。
Details
Motivation: 为了解决传统结构化剪枝计算成本高、一次性剪枝导致性能显著下降的问题,本文旨在提出一种高效且高性能的一次性剪枝方法。 Method: 提出一种因果感知的渐进式剪枝方法,利用模型预测与网络结构之间的因果关系来识别并移除对性能影响小的结构,实现无需微调的一次性剪枝。 Result: 在卷积神经网络和视觉Transformer上进行实验,结果表明该方法能在显著压缩模型的同时保持性能,且无需微调,优于现有方法。 Conclusion: 所提方法在多种基准模型上实现了最佳的压缩与性能权衡,为神经网络的一次性结构化剪枝提供了有效解决方案。 Abstract: Neural network compression has gained increasing attention in recent years, particularly in computer vision applications, where the need for model reduction is crucial for overcoming deployment constraints. Pruning is a widely used technique that prompts sparsity in model structures, e.g. weights, neurons, and layers, reducing size and inference costs. Structured pruning is especially important as it allows for the removal of entire structures, which further accelerates inference time and reduces memory overhead. However, it can be computationally expensive, requiring iterative retraining and optimization. To overcome this problem, recent methods considered one-shot setting, which applies pruning directly at post-training. Unfortunately, they often lead to a considerable drop in performance. In this paper, we focus on this issue by proposing a novel one-shot pruning framework that relies on explainable deep learning. First, we introduce a causal-aware pruning approach that leverages cause-effect relations between model predictions and structures in a progressive pruning process. It allows us to efficiently reduce the size of the network, ensuring that the removed structures do not deter the performance of the model. Then, through experiments conducted on convolution neural network and vision transformer baselines, pre-trained on classification tasks, we demonstrate that our method consistently achieves substantial reductions in model size, with minimal impact on performance, and without the need for fine-tuning. Overall, our approach outperforms its counterparts, offering the best trade-off. Our code is available on GitHub.[140] ε-Seg: Sparsely Supervised Semantic Segmentation of Microscopy Data
Sheida Rahnamai Kordasiabi,Damian Dalle Nogare,Florian Jug
Main category: cs.CV
TL;DR: 本文提出了一种名为ε-Seg的语义分割方法,基于分层变分自编码器(HVAE),结合中心区域掩码、稀疏标签对比学习、高斯混合模型先验和无需聚类的标签预测,能够在极稀疏标签条件下实现对复杂生物电镜图像的有效分割。
Details
Motivation: 电子显微镜(EM)图像语义分割在生命科学中仍具挑战性,尤其当数据结构复杂且标注稀少时,现有方法难以有效学习表征。 Method: 采用分层变分自编码器(HVAE),引入中心区域掩码与修复损失增强特征学习,结合对比学习和GMM先验优化潜在空间结构,并使用MLP头直接从潜在嵌入预测类别标签,避免聚类步骤。 Result: 在两个密集EM生物组织数据集上验证了ε-Seg的性能,结果表明其在极低标注比例(≤0.05%)下仍能取得具有竞争力的分割效果,并可推广至荧光显微镜数据。 Conclusion: ε-Seg通过联合优化潜在表示与分类头,在极稀疏监督下实现了高效准确的生物图像语义分割,为低标注成本下的生物图像分析提供了有效解决方案。 Abstract: Semantic segmentation of electron microscopy (EM) images of biological samples remains a challenge in the life sciences. EM data captures details of biological structures, sometimes with such complexity that even human observers can find it overwhelming. We introduce {\epsilon}-Seg, a method based on hierarchical variational autoencoders (HVAEs), employing center-region masking, sparse label contrastive learning (CL), a Gaussian mixture model (GMM) prior, and clustering-free label prediction. Center-region masking and the inpainting loss encourage the model to learn robust and representative embeddings to distinguish the desired classes, even if training labels are sparse (0.05% of the total image data or less). For optimal performance, we employ CL and a GMM prior to shape the latent space of the HVAE such that encoded input patches tend to cluster wrt. the semantic classes we wish to distinguish. Finally, instead of clustering latent embeddings for semantic segmentation, we propose a MLP semantic segmentation head to directly predict class labels from latent embeddings. We show empirical results of {\epsilon}-Seg and baseline methods on 2 dense EM datasets of biological tissues and demonstrate the applicability of our method also on fluorescence microscopy data. Our results show that {\epsilon}-Seg is capable of achieving competitive sparsely-supervised segmentation results on complex biological image data, even if only limited amounts of training labels are available.[141] Binary Quadratic Quantization: Beyond First-Order Quantization for Real-Valued Matrix Compression
Kyo Kuroki,Yasuyuki Okoshi,Thiem Van Chu,Kazushi Kawamura,Masato Motomura
Main category: cs.CV
TL;DR: 本文提出了一种新的矩阵量化方法——二值二次量化(BQQ),利用二值二次表达式在保持极紧凑数据格式的同时提升表达能力,在矩阵压缩和视觉Transformer的后训练量化中均表现出优于传统方法的性能。
Details
Motivation: 传统的线性组合二值基的量化方法表达能力有限,难以在高压缩率下保持低重建误差,因此需要一种更高效的矩阵近似方法。 Method: 提出二值二次量化(BQQ),使用二值矩阵的二次组合来逼近实值矩阵,从而增强表达能力同时维持紧凑存储。 Result: 在矩阵压缩基准和Vision Transformer的后训练量化实验中,BQQ在内存效率与重构误差之间实现了更优权衡;在ImageNet上,相较于现有最佳PTQ方法,校准场景下提升2.2%,无数据场景下提升59.1%(等效2比特量化)。 Conclusion: 二值二次表达式在高效矩阵逼近和神经网络压缩中具有显著潜力,BQQ为高精度低比特量化提供了新思路。 Abstract: This paper proposes a novel matrix quantization method, Binary Quadratic Quantization (BQQ). In contrast to conventional first-order quantization approaches, such as uniform quantization and binary coding quantization, that approximate real-valued matrices via linear combinations of binary bases, BQQ leverages the expressive power of binary quadratic expressions while maintaining an extremely compact data format. We validate our approach with two experiments: a matrix compression benchmark and post-training quantization (PTQ) on pretrained Vision Transformer-based models. Experimental results demonstrate that BQQ consistently achieves a superior trade-off between memory efficiency and reconstruction error than conventional methods for compressing diverse matrix data. It also delivers strong PTQ performance, even though we neither target state-of-the-art PTQ accuracy under tight memory constraints nor rely on PTQ-specific binary matrix optimization. For example, our proposed method outperforms the state-of-the-art PTQ method by up to 2.2\% and 59.1% on the ImageNet dataset under the calibration-based and data-free scenarios, respectively, with quantization equivalent to 2 bits. These findings highlight the surprising effectiveness of binary quadratic expressions for efficient matrix approximation and neural network compression.[142] Image augmentation with invertible networks in interactive satellite image change detection
Hichem Sahbi
Main category: cs.CV
TL;DR: 提出一种基于主动学习的交互式卫星图像变化检测算法,通过可逆网络增强显示数据,提升检测性能。
Details
Motivation: 为了提高卫星图像变化检测的准确性,减少对大量标注数据的依赖,利用主动学习框架中的人机交互机制来优化模型训练过程。 Method: 采用基于问答模型的迭代主动学习框架,引入一种新颖的可逆网络,将非线性输入空间中的显示数据映射到潜在空间进行线性化增强,再映射回输入空间用于重新训练变化检测模型。 Result: 实验结果表明,所提方法在变化检测性能上优于现有相关方法,验证了其有效性。 Conclusion: 该方法通过可逆网络实现数据增强与主动学习的有效结合,显著提升了卫星图像变化检测的精度和效率。 Abstract: This paper devises a novel interactive satellite image change detection algorithm based on active learning. Our framework employs an iterative process that leverages a question-and-answer model. This model queries the oracle (user) about the labels of a small subset of images (dubbed as display), and based on the oracle's responses, change detection model is dynamically updated. The main contribution of our framework resides in a novel invertible network that allows augmenting displays, by mapping them from highly nonlinear input spaces to latent ones, where augmentation transformations become linear and more tractable. The resulting augmented data are afterwards mapped back to the input space, and used to retrain more effective change detection criteria in the subsequent iterations of active learning. Experimental results demonstrate superior performance of our proposed method compared to the related work.[143] Beyond the Pipeline: Analyzing Key Factors in End-to-End Deep Learning for Historical Writer Identification
Hanif Rasyidi,Moshiur Farazi
Main category: cs.CV
TL;DR: 本文研究了影响历史手写体作者识别(HWI)端到端深度学习方法性能的多种因素,发现大多数模型在真实场景和零样本设置下泛化能力差,主要问题包括低层视觉特征提取弱、图像块表示不一致和对内容噪声敏感;但有一种简化设计的端到端配置表现优异,接近当前最优系统。
Details
Motivation: 由于手写风格多样、文档退化严重且每位作者标注样本有限,历史手写体识别具有挑战性,传统方法依赖人工特征且局限于小规模数据集,而端到端方法需解决实际应用中的泛化问题。 Method: 探索了不同预处理方法、主干网络架构以及后处理策略(如文本分割、图像块采样和特征聚合)的组合,并在文档级、零样本场景下评估其性能。 Result: 大多数配置表现不佳,归因于低层特征捕捉弱、图像块表示不一致和对内容噪声敏感;但一种简单的端到端设置达到了与当前最佳系统相当的性能。 Conclusion: 构建鲁棒的端到端HWI系统仍面临关键挑战,合理的预处理、网络结构和特征聚合设计可显著提升性能,为未来研究提供了重要参考。 Abstract: This paper investigates various factors that influence the performance of end-to-end deep learning approaches for historical writer identification (HWI), a task that remains challenging due to the diversity of handwriting styles, document degradation, and the limited number of labelled samples per writer. These conditions often make accurate recognition difficult, even for human experts. Traditional HWI methods typically rely on handcrafted image processing and clustering techniques, which tend to perform well on small and carefully curated datasets. In contrast, end-to-end pipelines aim to automate the process by learning features directly from document images. However, our experiments show that many of these models struggle to generalise in more realistic, document-level settings, especially under zero-shot scenarios where writers in the test set are not present in the training data. We explore different combinations of pre-processing methods, backbone architectures, and post-processing strategies, including text segmentation, patch sampling, and feature aggregation. The results suggest that most configurations perform poorly due to weak capture of low-level visual features, inconsistent patch representations, and high sensitivity to content noise. Still, we identify one end-to-end setup that achieves results comparable to the top-performing system, despite using a simpler design. These findings point to key challenges in building robust end-to-end systems and offer insight into design choices that improve performance in historical document writer identification.[144] See the Text: From Tokenization to Visual Reading
Ling Xing,Alex Jinpeng Wang,Rui Yan,Hongyu Qu,Zechao Li,Jinhui Tang
Main category: cs.CV
TL;DR: 本文提出了一种名为SeeTok的新方法,通过将文本渲染为图像并利用多模态大模型进行理解,挑战了传统的子词分词范式,实现了更高效、鲁棒且符合人类视觉阅读方式的语言处理。
Details
Motivation: 现代大语言模型依赖子词分词,对低资源语言存在过度分割问题,导致计算开销大且序列缺乏语言学意义。受人类通过视觉识别文字的启发,作者希望探索一种更自然、高效的文本处理方式。 Method: 将文本转换为图像(视觉文本),利用预训练的多模态大语言模型处理这些图像,复用其在大规模多模态训练中获得的OCR和文本-视觉对齐能力。 Result: 在三种不同语言任务上,SeeTok达到或超过了传统子词分词器的性能,同时减少4.43倍的token数量,降低70.5%的FLOPs,并在跨语言泛化、对抗字形噪声和语言层级结构方面表现更优。 Conclusion: SeeTok标志着从符号化分词向类人视觉阅读的转变,推动了更自然、认知启发式语言模型的发展。 Abstract: People see text. Humans read by recognizing words as visual objects, including their shapes, layouts, and patterns, before connecting them to meaning, which enables us to handle typos, distorted fonts, and various scripts effectively. Modern large language models (LLMs), however, rely on subword tokenization, fragmenting text into pieces from a fixed vocabulary. While effective for high-resource languages, this approach over-segments low-resource languages, yielding long, linguistically meaningless sequences and inflating computation. In this work, we challenge this entrenched paradigm and move toward a vision-centric alternative. Our method, SeeTok, renders text as images (visual-text) and leverages pretrained multimodal LLMs to interpret them, reusing strong OCR and text-vision alignment abilities learned from large-scale multimodal training. Across three different language tasks, SeeTok matches or surpasses subword tokenizers while requiring 4.43 times fewer tokens and reducing FLOPs by 70.5%, with additional gains in cross-lingual generalization, robustness to typographic noise, and linguistic hierarchy. SeeTok signals a shift from symbolic tokenization to human-like visual reading, and takes a step toward more natural and cognitively inspired language models.[145] MoGA: Mixture-of-Groups Attention for End-to-End Long Video Generation
Weinan Jia,Yuning Lu,Mengqi Huang,Hualiang Wang,Binyuan Huang,Nan Chen,Mu Liu,Jidong Jiang,Zhendong Mao
Main category: cs.CV
TL;DR: 本文提出了一种名为Mixture-of-Groups Attention (MoGA)的高效稀疏注意力机制,用于解决长视频生成中全注意力计算复杂度高的问题。MoGA通过轻量级可学习的token路由器实现语义感知的精确token匹配,无需块状估计,支持长距离交互,并兼容现代注意力架构。基于MoGA构建的模型可端到端生成分钟级、多镜头、480p/24fps的高质量视频,上下文长度达58万。
Details
Motivation: Transformer中的全注意力机制在处理长序列时存在二次方计算复杂度,限制了其在长视频生成中的应用;现有稀疏注意力方法依赖于块状粗略估计,精度和效率受块大小制约。 Method: 提出Mixture-of-Groups Attention (MoGA),采用轻量级可学习的token路由器进行语义感知的稀疏注意力分配,避免块状估计,在保持高效率的同时提升注意力匹配精度,并与FlashAttention和序列并行等现代技术兼容。 Result: MoGA显著提升了长视频生成的效率与质量,实现了上下文长度约58万的训练,能够端到端生成分钟级、多镜头、480p分辨率、24fps的高质量视频。在多种视频生成任务上验证了其有效性。 Conclusion: MoGA是一种高效、灵活且可扩展的稀疏注意力机制,克服了传统稀疏方法的局限性,在长视频生成任务中展现出卓越的性能和实际应用潜力。 Abstract: Long video generation with Diffusion Transformers (DiTs) is bottlenecked by the quadratic scaling of full attention with sequence length. Since attention is highly redundant, outputs are dominated by a small subset of query-key pairs. Existing sparse methods rely on blockwise coarse estimation, whose accuracy-efficiency trade-offs are constrained by block size. This paper introduces Mixture-of-Groups Attention (MoGA), an efficient sparse attention that uses a lightweight, learnable token router to precisely match tokens without blockwise estimation. Through semantic-aware routing, MoGA enables effective long-range interactions. As a kernel-free method, MoGA integrates seamlessly with modern attention stacks, including FlashAttention and sequence parallelism. Building on MoGA, we develop an efficient long video generation model that end-to-end produces minute-level, multi-shot, 480p videos at 24 fps, with a context length of approximately 580k. Comprehensive experiments on various video generation tasks validate the effectiveness of our approach.[146] Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs
Haochen Wang,Yuhao Wang,Tao Zhang,Yikang Zhou,Yanwei Li,Jiacong Wang,Ye Tian,Jiahao Meng,Zilong Huang,Guangcan Mai,Anran Wang,Yunhai Tong,Zhuochen Wang,Xiangtai Li,Zhaoxiang Zhang
Main category: cs.CV
TL;DR: 本文提出了一种名为Grasp Any Region (GAR)的区域级视觉理解方法,通过RoI对齐的特征重放技术,结合全局上下文和多提示交互,实现细粒度感知与复杂推理,并构建了GAR-Bench用于评估单区域及多区域理解能力。实验表明GAR在多种基准上表现优异,且具备向视频任务迁移的潜力。
Details
Motivation: 现有的多模态大模型在复杂场景中难以进行细粒度分析,而区域级MLLM通常孤立地理解区域,忽略全局上下文,缺乏对多区域间关系和复杂推理的有效建模。 Method: 提出GAR框架,采用RoI-aligned特征重放技术,融合全局上下文信息以提升感知精度,并建模多个提示间的交互,支持对任意区域的自由形式问答,实现从被动描述到主动对话的转变。 Result: GAR-1B在DLC-Bench上超越DAM-3B达4.5分,在GAR-Bench-VQA上超过InternVL3-78B;零样本GAR-8B在VideoRefer-BenchQ上优于领域内训练的VideoRefer-7B,显示其强大泛化与迁移能力。 Conclusion: GAR通过引入全局上下文和多提示交互机制,显著提升了区域级视觉理解中的细粒度感知与复杂推理能力,为多模态模型的精细化理解提供了新范式。 Abstract: While Multimodal Large Language Models (MLLMs) excel at holistic understanding, they struggle in capturing the dense world with complex scenes, requiring fine-grained analysis of intricate details and object inter-relationships. Region-level MLLMs have been a promising step. However, previous attempts are generally optimized to understand given regions in isolation, neglecting crucial global contexts. To address this, we introduce Grasp Any Region (GAR) for comprehen- sive region-level visual understanding. Empowered by an effective RoI-aligned feature replay technique, GAR supports (1) precise perception by leveraging necessary global contexts, and (2) modeling interactions between multiple prompts. Together, it then naturally achieves (3) advanced compositional reasoning to answer specific free-form questions about any region, shifting the paradigm from passive description to active dialogue. Moreover, we construct GAR-Bench, which not only provides a more accurate evaluation of single-region comprehension, but also, more importantly, measures interactions and complex reasoning across multiple regions. Extensive experiments have demonstrated that GAR-1B not only maintains the state-of-the-art captioning capabilities, e.g., outperforming DAM-3B +4.5 on DLC-Bench, but also excels at modeling relationships between multiple prompts with advanced comprehension capabilities, even surpassing InternVL3-78B on GAR-Bench-VQA. More importantly, our zero-shot GAR-8B even outperforms in-domain VideoRefer-7B on VideoRefer-BenchQ, indicating its strong capabilities can be easily transferred to videos.[147] UniGenBench++: A Unified Semantic Evaluation Benchmark for Text-to-Image Generation
Yibin Wang,Zhimin Li,Yuhang Zang,Jiazi Bu,Yujie Zhou,Yi Xin,Junjun He,Chunyu Wang,Qinglin Lu,Cheng Jin,Jiaqi Wang
Main category: cs.CV
TL;DR: 本文提出了UniGenBench++,一个用于文本到图像生成的统一语义评估基准,涵盖多样化场景、多语言支持及细粒度评估维度,并利用多模态大模型构建可靠评估流程。
Details
Motivation: 现有T2I评估基准缺乏多样化的提示场景、多语言支持以及细粒度的子维度评估能力,难以满足真实应用需求。 Method: 构建包含600个分层组织提示的基准,覆盖5个主主题和20个子主题,包含10个主评估维度和27个子标准;提供中英文长短版本提示;利用Gemini-2.5-Pro构建评估流水线,并训练一个可离线评估的强效评估模型。 Result: 实现了对多种T2I模型在不同方面的系统性评测,验证了基准的全面性和可靠性,并发布了支持离线评估的模型以促进社区使用。 Conclusion: UniGenBench++显著提升了T2I生成模型语义一致性评估的广度与深度,具备良好的实用性与扩展性。 Abstract: Recent progress in text-to-image (T2I) generation underscores the importance of reliable benchmarks in evaluating how accurately generated images reflect the semantics of their textual prompt. However, (1) existing benchmarks lack the diversity of prompt scenarios and multilingual support, both essential for real-world applicability; (2) they offer only coarse evaluations across primary dimensions, covering a narrow range of sub-dimensions, and fall short in fine-grained sub-dimension assessment. To address these limitations, we introduce UniGenBench++, a unified semantic assessment benchmark for T2I generation. Specifically, it comprises 600 prompts organized hierarchically to ensure both coverage and efficiency: (1) spans across diverse real-world scenarios, i.e., 5 main prompt themes and 20 subthemes; (2) comprehensively probes T2I models' semantic consistency over 10 primary and 27 sub evaluation criteria, with each prompt assessing multiple testpoints. To rigorously assess model robustness to variations in language and prompt length, we provide both English and Chinese versions of each prompt in short and long forms. Leveraging the general world knowledge and fine-grained image understanding capabilities of a closed-source Multi-modal Large Language Model (MLLM), i.e., Gemini-2.5-Pro, an effective pipeline is developed for reliable benchmark construction and streamlined model assessment. Moreover, to further facilitate community use, we train a robust evaluation model that enables offline assessment of T2I model outputs. Through comprehensive benchmarking of both open- and closed-sourced T2I models, we systematically reveal their strengths and weaknesses across various aspects.[148] Exploring a Unified Vision-Centric Contrastive Alternatives on Multi-Modal Web Documents
Yiqi Lin,Alex Jinpeng Wang,Linjie Li,Zhengyuan Yang,Mike Zheng Shou
Main category: cs.CV
TL;DR: 本文提出了一种名为Vision-Centric Contrastive Learning (VC2L)的统一框架,通过将文本、图像及其组合全部渲染为像素图像,使用单一视觉Transformer进行建模,无需OCR或文本分词,显著提升了对复杂网络文档中跨模态关系的理解能力。
Details
Motivation: 现有的对比视觉-语言模型(如CLIP)在处理文本与图像交错、松散对齐或以视觉形式嵌入的真实网络文档时存在局限性,难以有效捕捉复杂的跨模态关系。 Method: VC2L将所有输入(文本、图像或组合)渲染为图像,在纯像素空间中操作,并采用片段级对比学习目标来对齐连续的多模态段落,利用文档内在连贯性进行训练,无需显式的图文配对数据。 Result: 在提出的AnyCIR、SeqCIR和CSR三个检索基准以及M-BEIR和MTEB等现有数据集上,VC2L表现出与CLIP类模型相当甚至更优的性能,验证了其在跨模态检索和序列理解方面的有效性与泛化能力。 Conclusion: VC2L展示了一种可扩展的、以视觉为中心的多模态表示学习方法,证明了利用真实网络文档数据进行对比学习的潜力,同时避免了传统多模态融合中的复杂处理流程。 Abstract: Contrastive vision-language models such as CLIP have demonstrated strong performance across a wide range of multimodal tasks by learning from aligned image-text pairs. However, their ability to handle complex, real-world web documents remains limited, particularly in scenarios where text and images are interleaved, loosely aligned, or embedded in visual form. To address these challenges, we propose Vision-Centric Contrastive Learning (VC2L), a unified framework that models text, images, and their combinations using a single vision transformer. VC2L operates entirely in pixel space by rendering all inputs, whether textual, visual, or combined, as images, thus eliminating the need for OCR, text tokenization, or modality fusion strategy. To capture complex cross-modal relationships in multimodal web documents, VC2L employs a snippet-level contrastive learning objective that aligns consecutive multimodal segments, leveraging the inherent coherence of documents without requiring explicitly paired image-text data. To assess the effectiveness of this approach, we introduce three retrieval benchmarks, AnyCIR, SeqCIR, and CSR, designed to evaluate cross-modal retrieval, fine-grained sequential understanding, and generalization to unseen data, respectively. Empirical results show that VC2L achieves competitive or superior performance compared to CLIP-style models on both the proposed benchmarks and established datasets such as M-BEIR and MTEB. These findings underscore the potential of multimodal web data as a valuable training resource for contrastive learning and illustrate the scalability of a unified, vision-centric approach for multimodal representation learning. Code and models are available at: https://github.com/showlab/VC2L.[149] A Renaissance of Explicit Motion Information Mining from Transformers for Action Recognition
Peiqin Zhuang,Lei Bai,Yichao Wu,Ding Liang,Luping Zhou,Yali Wang,Wanli Ouyang
Main category: cs.CV
TL;DR: 提出了一种名为EMIM的显式运动信息挖掘模块,将传统动作识别中的代价体积方法与Transformer结合,增强了对运动敏感数据的动作识别性能。
Details
Motivation: 现有的基于Transformer的动作识别方法在运动敏感数据集上表现不佳,缺乏精细的运动建模设计。 Method: 提出EMIM模块,以代价体积的方式构建自注意力中的亲和矩阵,从下一帧的查询邻域滑窗采样关键候选token,并利用该矩阵同时进行外观建模和运动特征提取。 Result: 在四个常用数据集上验证了方法的有效性,尤其在Something-Something V1 & V2等运动敏感数据集上优于现有最先进方法。 Conclusion: EMIM有效融合了代价体积的运动建模能力与Transformer的上下文聚合优势,显著提升了运动敏感场景下的动作识别性能。 Abstract: Recently, action recognition has been dominated by transformer-based methods, thanks to their spatiotemporal contextual aggregation capacities. However, despite the significant progress achieved on scene-related datasets, they do not perform well on motion-sensitive datasets due to the lack of elaborate motion modeling designs. Meanwhile, we observe that the widely-used cost volume in traditional action recognition is highly similar to the affinity matrix defined in self-attention, but equipped with powerful motion modeling capacities. In light of this, we propose to integrate those effective motion modeling properties into the existing transformer in a unified and neat way, with the proposal of the Explicit Motion Information Mining module (EMIM). In EMIM, we propose to construct the desirable affinity matrix in a cost volume style, where the set of key candidate tokens is sampled from the query-based neighboring area in the next frame in a sliding-window manner. Then, the constructed affinity matrix is used to aggregate contextual information for appearance modeling and is converted into motion features for motion modeling as well. We validate the motion modeling capacities of our method on four widely-used datasets, and our method performs better than existing state-of-the-art approaches, especially on motion-sensitive datasets, i.e., Something-Something V1 & V2.[150] PLANA3R: Zero-shot Metric Planar 3D Reconstruction via Feed-Forward Planar Splatting
Changkun Liu,Bin Tan,Zeran Ke,Shangzhan Zhang,Jiachen Liu,Ming Qian,Nan Xue,Yujun Shen,Tristan Braud
Main category: cs.CV
TL;DR: 本文提出了PLANA3R,一种无需姿态输入的平面三维重建框架,利用Vision Transformers从无姿态的双视图图像中提取稀疏平面基元,并通过平面splatting进行几何学习监督,实现了室内场景的度量级3D重建。
Details
Motivation: 利用室内场景固有的几何规律性,解决传统方法需要3D平面标注的问题,实现可扩展的大规模立体数据集训练。 Method: 采用Vision Transformers提取稀疏平面基元,估计相对相机位姿,并通过平面splatting渲染高分辨率深度和法线图来传播梯度,以监督几何学习。 Result: 在多个室内场景数据集上验证了PLANA3R的有效性,展示了其在域外环境中的强泛化能力,在3D表面重建、深度估计和相对位姿估计等任务中表现优异,并能准确进行平面分割。 Conclusion: PLANA3R能够在没有显式平面监督的情况下学习平面3D结构,适用于大规模训练,并在多种度量评估任务中表现出色,具备良好的泛化能力和应用潜力。 Abstract: This paper addresses metric 3D reconstruction of indoor scenes by exploiting their inherent geometric regularities with compact representations. Using planar 3D primitives - a well-suited representation for man-made environments - we introduce PLANA3R, a pose-free framework for metric Planar 3D Reconstruction from unposed two-view images. Our approach employs Vision Transformers to extract a set of sparse planar primitives, estimate relative camera poses, and supervise geometry learning via planar splatting, where gradients are propagated through high-resolution rendered depth and normal maps of primitives. Unlike prior feedforward methods that require 3D plane annotations during training, PLANA3R learns planar 3D structures without explicit plane supervision, enabling scalable training on large-scale stereo datasets using only depth and normal annotations. We validate PLANA3R on multiple indoor-scene datasets with metric supervision and demonstrate strong generalization to out-of-domain indoor environments across diverse tasks under metric evaluation protocols, including 3D surface reconstruction, depth estimation, and relative pose estimation. Furthermore, by formulating with planar 3D representation, our method emerges with the ability for accurate plane segmentation. The project page is available at https://lck666666.github.io/plana3r[151] SSD: Spatial-Semantic Head Decoupling for Efficient Autoregressive Image Generation
Siyong Jian,Huan Wang
Main category: cs.CV
TL;DR: 提出一种新的KV缓存压缩框架,通过自适应分离注意力头来减少内存使用并提升生成效率,在图像生成中实现5倍内存缩减和6.6倍速度提升。
Details
Motivation: 自回归图像生成模型因视觉token数量庞大而导致高内存和计算开销,现有KV缓存压缩方法在图像生成领域仍缺乏探索。 Method: 基于发现的‘空间局部性’和‘语义汇聚’现象,将注意力头自适应解耦为两类:对空间局部性头保留最近的短窗口token,对语义汇聚头则保留少量高关注度token,从而压缩KV缓存。 Result: 实现了5倍的内存使用 reduction 和6.6倍的整体吞吐量提升,仅伴随轻微的视觉质量损失。 Conclusion: 所提方法显著提升了自回归图像生成的效率,可在资源受限硬件上支持高效的原生图像生成。 Abstract: Autoregressive image generation models like Janus-Pro produce high-quality images, but at the significant cost of high memory and ever-growing computational demands due to the large number of visual tokens. While KV cache compression has been extensively studied in language modeling, it still remains largely unexplored for the image generation domain. In this work, we begin by identifying a distinct and prominent attention phenomenon, which we term spatial locality and emergent semantic sink. To leverage this key insight, we introduce a novel KV cache compression framework. Specifically, we compress the KV cache for all visual tokens by adaptively decoupling attention heads into two separate types: for spatial-locality heads, our method maintains a short recent token window; for semantic-sink heads, it strategically preserves a compact set of highly-attended tokens. Our extensive experiments demonstrate that the proposed method achieves a 5$\times$ reduction in memory usage and a notable 6.6$\times$ speedup in overall throughput with only minimal visual quality loss, thereby enabling highly efficient native autoregressive image generation on resource-constrained hardware.[152] IF-VidCap: Can Video Caption Models Follow Instructions?
Shihao Li,Yuanxing Zhang,Jiangtao Wu,Zhide Lei,Yiwen He,Runzhe Wen,Chenxi Liao,Chengkang Jiang,An Ping,Shuo Gao,Suhan Wang,Zhaozhou Bian,Zijun Zhou,Jingyi Xie,Jiayi Zhou,Jing Wang,Yifan Yao,Weihao Xie,Yingshui Tan,Yanghai Wang,Qianqian Xie,Zhaoxiang Zhang,Jiaheng Liu
Main category: cs.CV
TL;DR: 本文提出了IF-VidCap,一个用于评估可控视频描述的新基准,强调格式和内容正确性,填补了现有基准在指令跟随能力评估上的空白。
Details
Motivation: 现有视频描述基准主要关注描述的全面性,忽视了对用户指令的遵循能力,限制了实际应用。因此需要一个新的评估框架来衡量模型在可控描述中的表现。 Method: 构建了一个包含1,400个高质量样本的新基准IF-VidCap,并提出一个系统性框架,从格式正确性和内容正确性两个维度评估模型生成的描述。 Result: 对20多个主流模型的评估显示,专用于密集描述的模型在复杂指令下表现不如通用MLLM;闭源模型仍占优,但顶尖开源模型已接近其性能。 Conclusion: 未来的视频描述研究应同时提升描述丰富性和指令跟随准确性,以更好满足实际应用需求。 Abstract: Although Multimodal Large Language Models (MLLMs) have demonstrated proficiency in video captioning, practical applications require captions that follow specific user instructions rather than generating exhaustive, unconstrained descriptions. Current benchmarks, however, primarily assess descriptive comprehensiveness while largely overlooking instruction-following capabilities. To address this gap, we introduce IF-VidCap, a new benchmark for evaluating controllable video captioning, which contains 1,400 high-quality samples. Distinct from existing video captioning or general instruction-following benchmarks, IF-VidCap incorporates a systematic framework that assesses captions on two dimensions: format correctness and content correctness. Our comprehensive evaluation of over 20 prominent models reveals a nuanced landscape: despite the continued dominance of proprietary models, the performance gap is closing, with top-tier open-source solutions now achieving near-parity. Furthermore, we find that models specialized for dense captioning underperform general-purpose MLLMs on complex instructions, indicating that future work should simultaneously advance both descriptive richness and instruction-following fidelity.[153] Moving Light Adaptive Colonoscopy Reconstruction via Illumination-Attenuation-Aware 3D Gaussian Splatting
Hao Wang,Ying Zhou,Haoyu Zhao,Rui Wang,Qiang Hu,Xing Zhang,Qiang Li,Zhiwei Wang
Main category: cs.CV
TL;DR: 本文提出了一种针对结肠镜检查优化的改进型3D高斯点阵(ColIAGS),通过引入光照衰减建模和几何建模优化,解决了传统3DGS在动态光照下重建质量下降的问题,在新视角合成与几何重建方面均表现出优越性能。
Details
Motivation: 传统3D高斯点阵(3DGS)假设光照恒定且外观仅依赖视角,难以应对结肠镜中因光源/相机动态变化引起的光照差异,导致引入违反结构的模糊高斯斑点,影响重建质量。 Method: 提出ColIAGS框架,包含改进的外观建模(引入两种光照衰减因子)和改进的几何建模(使用高维视角嵌入和余弦嵌入隐式生成光照衰减解),以适应光照变化并保持几何精度。 Result: 在标准数据集上实验表明,ColIAGS在渲染保真度方面优于现有方法,并显著降低了深度MSE,实现了高质量的新视角合成与精确几何重建。 Conclusion: ColIAGS通过联合优化外观与几何建模,有效解决了结肠镜场景中动态光照带来的挑战,为虚拟结肠镜和病灶追踪等应用提供了更可靠的3D重建方案。 Abstract: 3D Gaussian Splatting (3DGS) has emerged as a pivotal technique for real-time view synthesis in colonoscopy, enabling critical applications such as virtual colonoscopy and lesion tracking. However, the vanilla 3DGS assumes static illumination and that observed appearance depends solely on viewing angle, which causes incompatibility with the photometric variations in colonoscopic scenes induced by dynamic light source/camera. This mismatch forces most 3DGS methods to introduce structure-violating vaporous Gaussian blobs between the camera and tissues to compensate for illumination attenuation, ultimately degrading the quality of 3D reconstructions. Previous works only consider the illumination attenuation caused by light distance, ignoring the physical characters of light source and camera. In this paper, we propose ColIAGS, an improved 3DGS framework tailored for colonoscopy. To mimic realistic appearance under varying illumination, we introduce an Improved Appearance Modeling with two types of illumination attenuation factors, which enables Gaussians to adapt to photometric variations while preserving geometry accuracy. To ensure the geometry approximation condition of appearance modeling, we propose an Improved Geometry Modeling using high-dimensional view embedding to enhance Gaussian geometry attribute prediction. Furthermore, another cosine embedding input is leveraged to generate illumination attenuation solutions in an implicit manner. Comprehensive experimental results on standard benchmarks demonstrate that our proposed ColIAGS achieves the dual capabilities of novel view synthesis and accurate geometric reconstruction. It notably outperforms other state-of-the-art methods by achieving superior rendering fidelity while significantly reducing Depth MSE. Code will be available.[154] SEAL: Semantic-Aware Hierarchical Learning for Generalized Category Discovery
Zhenqi He,Yuanpei Liu,Kai Han
Main category: cs.CV
TL;DR: 本文提出了一个名为SEAL的语义感知分层学习框架,用于解决广义类别发现(GCD)问题,通过利用自然存在的层次结构,在细粒度和粗粒度数据集上实现了最先进的性能。
Details
Motivation: 现有方法依赖于单一层级语义或人工设计的抽象层次结构,限制了其泛化性和可扩展性,因此需要一种更灵活、可扩展的GCD解决方案。 Method: 提出SEAL框架,包括分层语义引导的软对比学习方法和跨粒度一致性(CGC)模块,利用自然层次结构生成软负样本并对齐不同粒度级别的预测。 Result: SEAL在SSB、Oxford-Pet和Herbarium19等细粒度基准上达到最先进性能,并在粗粒度数据集上表现出良好的泛化能力。 Conclusion: SEAL通过引入自然层次结构和跨粒度一致性机制,有效提升了GCD任务的性能和泛化能力,具有较强的可扩展性。 Abstract: This paper investigates the problem of Generalized Category Discovery (GCD). Given a partially labelled dataset, GCD aims to categorize all unlabelled images, regardless of whether they belong to known or unknown classes. Existing approaches typically depend on either single-level semantics or manually designed abstract hierarchies, which limit their generalizability and scalability. To address these limitations, we introduce a SEmantic-aware hierArchical Learning framework (SEAL), guided by naturally occurring and easily accessible hierarchical structures. Within SEAL, we propose a Hierarchical Semantic-Guided Soft Contrastive Learning approach that exploits hierarchical similarity to generate informative soft negatives, addressing the limitations of conventional contrastive losses that treat all negatives equally. Furthermore, a Cross-Granularity Consistency (CGC) module is designed to align the predictions from different levels of granularity. SEAL consistently achieves state-of-the-art performance on fine-grained benchmarks, including the SSB benchmark, Oxford-Pet, and the Herbarium19 dataset, and further demonstrates generalization on coarse-grained datasets. Project page: https://visual-ai.github.io/seal/[155] Detection and Simulation of Urban Heat Islands Using a Fine-Tuned Geospatial Foundation Model for Microclimate Impact Prediction
Jannis Fleckenstein,David Kreismann,Tamara Rosemary Govindasamy,Thomas Brunschwiler,Etienne Vos,Mattia Rigotti
Main category: cs.CV
TL;DR: 本研究利用地理空间基础模型,通过少量微调准确预测城市热岛效应下的地表温度,并验证其在数据稀缺地区支持气候适应性城市规划的潜力。
Details
Motivation: 由于城市化和气候变化加剧了城市热岛效应,而传统机器学习模型在数据不足区域预测不准,因此需要更可靠的方法来支持有效的缓解策略制定。 Method: 使用基于全球非结构化数据训练的地理空间基础模型,结合实测数据评估模型对绿地降温效果的预测能力,并进一步微调模型以预测未来气候情景下的地表温度,最后通过模拟修复(inpainting)展示其应用价值。 Result: 基础模型在少量微调后能准确预测城市热环境,且在模拟中有效识别绿地的冷却作用,展现出在数据稀缺地区进行热岛缓解策略评估的强大潜力。 Conclusion: 地理空间基础模型为数据稀缺的城市地区提供了评估和优化热岛缓解措施的有效工具,有助于建设更具气候韧性的城市。 Abstract: As urbanization and climate change progress, urban heat island effects are becoming more frequent and severe. To formulate effective mitigation plans, cities require detailed air temperature data, yet conventional machine learning models with limited data often produce inaccurate predictions, particularly in underserved areas. Geospatial foundation models trained on global unstructured data offer a promising alternative by demonstrating strong generalization and requiring only minimal fine-tuning. In this study, an empirical ground truth of urban heat patterns is established by quantifying cooling effects from green spaces and benchmarking them against model predictions to evaluate the model's accuracy. The foundation model is subsequently fine-tuned to predict land surface temperatures under future climate scenarios, and its practical value is demonstrated through a simulated inpainting that highlights its role for mitigation support. The results indicate that foundation models offer a powerful way for evaluating urban heat island mitigation strategies in data-scarce regions to support more climate-resilient cities.[156] UltraGen: High-Resolution Video Generation with Hierarchical Attention
Teng Hu,Jiangning Zhang,Zihan Su,Ran Yi
Main category: cs.CV
TL;DR: 本文提出UltraGen,一种高效的端到端原生高分辨率视频生成框架,通过全局-局部注意力分解和空间压缩策略,首次实现将预训练低分辨率模型扩展至1080P甚至4K分辨率。
Details
Motivation: 现有基于扩散Transformer的视频生成模型因注意力机制的二次计算复杂度,难以生成高于720P的高分辨率视频,限制了其在高质量内容创作中的应用。 Method: 提出UltraGen框架,采用分层双分支注意力架构,将全注意力解耦为局部注意力(负责高保真区域内容)和全局注意力(维持语义一致性);引入空间压缩的全局建模策略和跨窗口局部注意力机制以降低计算开销。 Result: 实验表明,UltraGen能有效将预训练低分辨率模型扩展至1080P和4K分辨率,在定性和定量评估中均优于现有最先进方法及两阶段超分方案。 Conclusion: UltraGen解决了高分辨率视频生成中的计算瓶颈问题,实现了高效、端到端的原生高分辨率视频合成,推动了视频生成技术在高清晰度场景下的应用。 Abstract: Recent advances in video generation have made it possible to produce visually compelling videos, with wide-ranging applications in content creation, entertainment, and virtual reality. However, most existing diffusion transformer based video generation models are limited to low-resolution outputs (<=720P) due to the quadratic computational complexity of the attention mechanism with respect to the output width and height. This computational bottleneck makes native high-resolution video generation (1080P/2K/4K) impractical for both training and inference. To address this challenge, we present UltraGen, a novel video generation framework that enables i) efficient and ii) end-to-end native high-resolution video synthesis. Specifically, UltraGen features a hierarchical dual-branch attention architecture based on global-local attention decomposition, which decouples full attention into a local attention branch for high-fidelity regional content and a global attention branch for overall semantic consistency. We further propose a spatially compressed global modeling strategy to efficiently learn global dependencies, and a hierarchical cross-window local attention mechanism to reduce computational costs while enhancing information flow across different local windows. Extensive experiments demonstrate that UltraGen can effectively scale pre-trained low-resolution video models to 1080P and even 4K resolution for the first time, outperforming existing state-of-the-art methods and super-resolution based two-stage pipelines in both qualitative and quantitative evaluations.[157] Rebellious Student: A Complementary Learning Framework for Background Feature Enhancement in Hyperspectral Anomaly Detection
Wenping Jin,Yuyang Tang,Li Zhu,Fei Guo
Main category: cs.CV
TL;DR: 提出了一种“叛逆学生”框架,通过光谱增强网络和空间网络的互补学习,实现无需重新训练或调参的高效高光谱异常检测。
Details
Motivation: 现有方法在跨场景部署时需重新训练或调参,限制了实用性;希望构建一个可通用部署、无需场景特定优化的高效异常检测模型。 Method: 采用两阶段学习策略:首先通过反向蒸馏训练光谱增强网络作为教师模型;然后训练空间分支(叛逆学生),使用去相关损失使其特征与教师正交,从而学习教师未捕获的互补空间模式。 Result: 在HAD100基准上显著优于多个基线方法,计算开销小,具备良好的泛化性和鲁棒性。 Conclusion: 所提出的“叛逆学生”框架有效融合了光谱与空间线索,实现了无需参数调整的异常检测,验证了互补学习范式的有效性。 Abstract: A recent class of hyperspectral anomaly detection methods that can be trained once on background datasets and then universally deployed -- without per-scene retraining or parameter tuning -- has demonstrated remarkable efficiency and robustness. Building upon this paradigm, we focus on the integration of spectral and spatial cues and introduce a novel "Rebellious Student" framework for complementary feature learning. Unlike conventional teacher-student paradigms driven by imitation, our method intentionally trains the spatial branch to diverge from the spectral teacher, thereby learning complementary spatial patterns that the teacher fails to capture. A two-stage learning strategy is adopted: (1) a spectral enhancement network is first trained via reverse distillation to obtain robust background spectral representations; and (2) a spatial network -- the rebellious student -- is subsequently optimized using decorrelation losses that enforce feature orthogonality while maintaining reconstruction fidelity to avoid irrelevant noise. Once trained, the framework enhances both spectral and spatial background features, enabling parameter-free and training-free anomaly detection when paired with conventional detectors. Extensive experiments on the HAD100 benchmark show substantial improvements over several established baselines with minimal computational overhead, confirming the effectiveness and generality of the proposed complementary learning paradigm. Our code is publicly available at https://github.com/xjpp2016/FERS.[158] ProCLIP: Progressive Vision-Language Alignment via LLM-based Embedder
Xiaoxing Hu,Kaicheng Yang,Ziyong Feng,Qi Ming,Zonghao Guo,Xiang An,Ziyong Feng,Junchi Yan,Xue Yang
Main category: cs.CV
TL;DR: 提出ProCLIP,一种基于课程学习的渐进式视觉-语言对齐框架,通过知识蒸馏和对比微调将大语言模型嵌入器与CLIP图像编码器有效对齐,解决长文本、多语言和细粒度语义理解问题。
Details
Motivation: 原始CLIP文本编码器受限于77个token的最大输入长度,且不支持多语言输入,限制了其在多种任务中的应用;现有方法直接使用LLM替换CLIP文本编码器但缺乏对齐先验,破坏原有视觉-语言对齐。 Method: 首先将CLIP文本编码器的知识蒸馏到LLM嵌入器中,建立初始对齐;然后通过带自蒸馏正则化的图像-文本对比微调进一步对齐,并在表征继承和微调过程中引入实例语义对齐损失和嵌入结构对齐损失。 Result: 实现了CLIP图像编码器与LLM嵌入器的有效对齐,提升了对长文本、多语言输入和细粒度语义的理解能力,同时保留了预训练知识。 Conclusion: ProCLIP通过渐进式对齐策略,克服了直接替换文本编码器带来的对齐破坏问题,在保持CLIP图像编码器有效性的同时增强了文本处理能力。 Abstract: The original CLIP text encoder is limited by a maximum input length of 77 tokens, which hampers its ability to effectively process long texts and perform fine-grained semantic understanding. In addition, the CLIP text encoder lacks support for multilingual inputs. All these limitations significantly restrict its applicability across a broader range of tasks. Recent studies have attempted to replace the CLIP text encoder with an LLM-based embedder to enhance its ability in processing long texts, multilingual understanding, and fine-grained semantic comprehension. However, because the representation spaces of LLMs and the vision-language space of CLIP are pretrained independently without alignment priors, direct alignment using contrastive learning can disrupt the intrinsic vision-language alignment in the CLIP image encoder, leading to an underutilization of the knowledge acquired during pre-training. To address this challenge, we propose ProCLIP, a curriculum learning-based progressive vision-language alignment framework to effectively align the CLIP image encoder with an LLM-based embedder. Specifically, ProCLIP first distills knowledge from CLIP's text encoder into the LLM-based embedder to leverage CLIP's rich pretrained knowledge while establishing initial alignment between the LLM embedder and CLIP image encoder. Subsequently, ProCLIP further aligns the CLIP image encoder with the LLM-based embedder through image-text contrastive tuning, employing self-distillation regularization to avoid overfitting. To achieve a more effective alignment, instance semantic alignment loss and embedding structure alignment loss are employed during representation inheritance and contrastive tuning. The Code is available at https://github.com/VisionXLab/ProCLIP[159] A Geometric Approach to Steerable Convolutions
Soumyabrata Kundu,Risi Kondor
Main category: cs.CV
TL;DR: 提出了一种基于几何论证和模式匹配基本原理的d维可操纵卷积神经网络的新直观推导方法,并建议使用插值核构建新的可操纵卷积层,提高了对噪声数据的鲁棒性。
Details
Motivation: 为了提供比现有群论方法更直观的理解,改进可操纵卷积神经网络的设计与实现。 Method: 基于几何论证和模式匹配的基本原理进行推导,引入插值核构造新型可操纵卷积层。 Result: 给出了Clebsch-Gordan分解和球谐基函数出现的直观解释,新方法在处理噪声数据时表现出更强的鲁棒性。 Conclusion: 该工作为可操纵卷积网络提供了更直观的理论基础,并通过插值核实现了性能更优的实现方式。 Abstract: In contrast to the somewhat abstract, group theoretical approach adopted by many papers, our work provides a new and more intuitive derivation of steerable convolutional neural networks in $d$ dimensions. This derivation is based on geometric arguments and fundamental principles of pattern matching. We offer an intuitive explanation for the appearance of the Clebsch--Gordan decomposition and spherical harmonic basis functions. Furthermore, we suggest a novel way to construct steerable convolution layers using interpolation kernels that improve upon existing implementation, and offer greater robustness to noisy data.[160] An Explainable Hybrid AI Framework for Enhanced Tuberculosis and Symptom Detection
Neel Patel,Alexander Wong,Ashkan Ebadi
Main category: cs.CV
TL;DR: 提出一种结合监督和自监督学习的教师-学生框架,用于胸部X光片中的结核病和症状检测,取得了高准确率和可解释性。
Details
Motivation: 由于资源有限地区缺乏专业放射科医生,且获取大规模高质量数据成本高昂,需要开发可靠的AI模型以实现结核病的早期检测。 Method: 采用教师-学生框架,整合两个监督头和一个自监督头,提升疾病和症状检测性能。 Result: 在区分COVID-19、结核病和正常病例上达到98.85%的准确率,在多标签症状检测上获得90.09%的macro-F1分数,显著优于基线模型。 Conclusion: 该模型具有高精度和良好可解释性,基于相关解剖特征进行预测,有望应用于临床筛查和分诊场景。 Abstract: Tuberculosis remains a critical global health issue, particularly in resource-limited and remote areas. Early detection is vital for treatment, yet the lack of skilled radiologists underscores the need for artificial intelligence (AI)-driven screening tools. Developing reliable AI models is challenging due to the necessity for large, high-quality datasets, which are costly to obtain. To tackle this, we propose a teacher--student framework which enhances both disease and symptom detection on chest X-rays by integrating two supervised heads and a self-supervised head. Our model achieves an accuracy of 98.85% for distinguishing between COVID-19, tuberculosis, and normal cases, and a macro-F1 score of 90.09% for multilabel symptom detection, significantly outperforming baselines. The explainability assessments also show the model bases its predictions on relevant anatomical features, demonstrating promise for deployment in clinical screening and triage settings.[161] SAM 2++: Tracking Anything at Any Granularity
Jiaming Zhang,Cheng Liang,Yichun Yang,Chenkai Zeng,Yutao Cui,Xinwen Zhang,Xin Zhou,Kai Ma,Gangshan Wu,Limin Wang
Main category: cs.CV
TL;DR: 提出SAM 2++,一个统一的视频跟踪模型,支持任意粒度(掩码、框、点)的目标跟踪,通过任务特定提示、统一解码器和任务自适应记忆机制实现跨粒度统一跟踪,并构建大规模多粒度数据集Tracking-Any-Granularity,实验证明其在多个基准上达到最先进性能。
Details
Motivation: 现有视频跟踪模型通常针对单一任务设计,依赖定制模块,导致泛化能力差、模型冗余,难以应对不同粒度的目标状态。因此需要一个能统一处理多粒度跟踪任务的通用框架。 Method: 1) 设计任务特定提示将不同输入编码为统一提示嵌入;2) 使用统一解码器将多样化输出统一为一致形式;3) 引入任务自适应记忆机制,统一跨粒度的记忆匹配;4) 构建定制化数据引擎生成多粒度标注数据集Tracking-Any-Granularity。 Result: 在多个视频跟踪基准上实验表明,SAM 2++在不同粒度任务中均达到最先进性能,显著优于现有专用模型,验证了其统一性和鲁棒性。 Conclusion: SAM 2++成功实现了任意粒度的统一视频跟踪,通过统一架构和机制减少了模型冗余,提升了泛化能力,为未来通用视觉跟踪提供了有效范式。 Abstract: Video tracking aims at finding the specific target in subsequent frames given its initial state. Due to the varying granularity of target states across different tasks, most existing trackers are tailored to a single task and heavily rely on custom-designed modules within the individual task, which limits their generalization and leads to redundancy in both model design and parameters. To unify video tracking tasks, we present SAM 2++, a unified model towards tracking at any granularity, including masks, boxes, and points. First, to extend target granularity, we design task-specific prompts to encode various task inputs into general prompt embeddings, and a unified decoder to unify diverse task results into a unified form pre-output. Next, to satisfy memory matching, the core operation of tracking, we introduce a task-adaptive memory mechanism that unifies memory across different granularities. Finally, we introduce a customized data engine to support tracking training at any granularity, producing a large and diverse video tracking dataset with rich annotations at three granularities, termed Tracking-Any-Granularity, which represents a comprehensive resource for training and benchmarking on unified tracking. Comprehensive experiments on multiple benchmarks confirm that SAM 2++ sets a new state of the art across diverse tracking tasks at different granularities, establishing a unified and robust tracking framework.[162] Unifying and Enhancing Graph Transformers via a Hierarchical Mask Framework
Yujie Xing,Xiao Wang,Bin Wu,Hai Huang,Chuan Shi
Main category: cs.CV
TL;DR: 提出一种基于分层掩码的统一框架和M3Dphormer模型,通过多级掩码和双注意力计算实现灵活、高效的图Transformer。
Details
Motivation: 现有图Transformer依赖复杂且特定的架构设计,缺乏灵活性,难以统一建模多样化的节点交互。 Method: 提出统一的分层掩码框架,揭示模型架构与注意力掩码构造之间的等价性;设计M3Dphormer模型,结合三种理论支持的分层掩码和混合专家机制,并引入双模式注意力计算以提升可扩展性。 Result: 理论分析表明有效掩码需兼顾感受野大小和标签一致性;实验显示M3Dphormer在多个基准上达到SOTA性能。 Conclusion: 通过分层掩码的统一视角可有效建模多样化节点交互,M3Dphormer兼具灵活性与高效性,验证了所提框架和设计原则的有效性。 Abstract: Graph Transformers (GTs) have emerged as a powerful paradigm for graph representation learning due to their ability to model diverse node interactions. However, existing GTs often rely on intricate architectural designs tailored to specific interactions, limiting their flexibility. To address this, we propose a unified hierarchical mask framework that reveals an underlying equivalence between model architecture and attention mask construction. This framework enables a consistent modeling paradigm by capturing diverse interactions through carefully designed attention masks. Theoretical analysis under this framework demonstrates that the probability of correct classification positively correlates with the receptive field size and label consistency, leading to a fundamental design principle: an effective attention mask should ensure both a sufficiently large receptive field and a high level of label consistency. While no single existing mask satisfies this principle across all scenarios, our analysis reveals that hierarchical masks offer complementary strengths, motivating their effective integration. Then, we introduce M3Dphormer, a Mixture-of-Experts-based Graph Transformer with Multi-Level Masking and Dual Attention Computation. M3Dphormer incorporates three theoretically grounded hierarchical masks and employs a bi-level expert routing mechanism to adaptively integrate multi-level interaction information. To ensure scalability, we further introduce a dual attention computation scheme that dynamically switches between dense and sparse modes based on local mask sparsity. Extensive experiments across multiple benchmarks demonstrate that M3Dphormer achieves state-of-the-art performance, validating the effectiveness of our unified framework and model design.[163] FedDEAP: Adaptive Dual-Prompt Tuning for Multi-Domain Federated Learning
Yubin Zheng,Pak-Hei Yeung,Jing Xia,Tianjie Ju,Peng Tang,Weidong Qiu,Jagath C. Rajapakse
Main category: cs.CV
TL;DR: 提出了一种自适应联邦提示调优框架FedDEAP,用于提升CLIP模型在多域场景下的泛化能力。
Details
Motivation: 解决联邦学习中由于域偏移和标签异质性导致的全局模型泛化能力差的问题,同时探索如何在联邦设置下有效微调大规模视觉-语言模型如CLIP。 Method: 采用解耦语义与域特征的方法,设计双提示机制(全局语义提示和局部域提示),并通过语义与域变换网络对齐文本与视觉表征。 Result: 在四个数据集上的实验表明,该方法能有效提升CLIP在跨域联邦图像识别中的泛化性能。 Conclusion: FedDEAP通过解耦特征、双提示设计和跨模态对齐,显著增强了CLIP在多域联邦学习环境下的适应性和表现力。 Abstract: Federated learning (FL) enables multiple clients to collaboratively train machine learning models without exposing local data, balancing performance and privacy. However, domain shift and label heterogeneity across clients often hinder the generalization of the aggregated global model. Recently, large-scale vision-language models like CLIP have shown strong zero-shot classification capabilities, raising the question of how to effectively fine-tune CLIP across domains in a federated setting. In this work, we propose an adaptive federated prompt tuning framework, FedDEAP, to enhance CLIP's generalization in multi-domain scenarios. Our method includes the following three key components: (1) To mitigate the loss of domain-specific information caused by label-supervised tuning, we disentangle semantic and domain-specific features in images by using semantic and domain transformation networks with unbiased mappings; (2) To preserve domain-specific knowledge during global prompt aggregation, we introduce a dual-prompt design with a global semantic prompt and a local domain prompt to balance shared and personalized information; (3) To maximize the inclusion of semantic and domain information from images in the generated text features, we align textual and visual representations under the two learned transformations to preserve semantic and domain consistency. Theoretical analysis and extensive experiments on four datasets demonstrate the effectiveness of our method in enhancing the generalization of CLIP for federated image recognition across multiple domains.[164] DP$^2$O-SR: Direct Perceptual Preference Optimization for Real-World Image Super-Resolution
Rongyuan Wu,Lingchen Sun,Zhengqiang Zhang,Shihao Wang,Tianhe Wu,Qiaosi Yi,Shuai Li,Lei Zhang
Main category: cs.CV
TL;DR: 本文提出了一种名为DP$^2$O-SR的框架,通过结合全参考和无参考图像质量评估模型构建混合奖励信号,直接优化真实图像超分辨率(Real-ISR)中的感知偏好,无需人工标注。该方法利用同一模型生成的多样化输出构建多组偏好对,并提出分层偏好优化策略,实现更高效稳定的训练,在多种T2I模型上显著提升了感知质量。
Details
Motivation: 现有基于预训练文生图扩散模型的Real-ISR方法因模型随机性导致输出感知质量不一,虽被视为局限,但也提供了提升感知质量的机会。然而缺乏有效利用这种多样性的机制,且依赖人工标注进行感知优化成本高昂,因此需要一种无需人类标注、能充分利用生成多样性来提升感知质量的方法。 Method: 提出Direct Perceptual Preference Optimization for Real-ISR (DP$^2$O-SR)框架:1)构建由全参考与无参考IQA模型组成的混合奖励信号,基于大规模人类偏好数据训练;2)从同一模型的不同输出中构建多个偏好对,超越传统的最优-最差选择;3)引入分层偏好优化,根据组内奖励差距和组间多样性自适应加权训练样本。 Result: 在扩散模型和流模型两类T2I骨干网络上的实验表明,DP$^2$O-SR显著提升了图像的感知质量,在真实场景基准测试中表现出良好的泛化能力。研究还发现,模型容量影响最优选择策略:小模型受益于更广泛的覆盖,大模型则对更强的监督对比更敏感。 Conclusion: DP$^2$O-SR提供了一种无需人工标注的感知偏好优化新范式,有效利用生成多样性提升Real-ISR性能。其提出的多偏好对构建与分层优化策略为后续研究提供了可扩展的方向,推动了高感知质量图像复原技术的发展。 Abstract: Benefiting from pre-trained text-to-image (T2I) diffusion models, real-world image super-resolution (Real-ISR) methods can synthesize rich and realistic details. However, due to the inherent stochasticity of T2I models, different noise inputs often lead to outputs with varying perceptual quality. Although this randomness is sometimes seen as a limitation, it also introduces a wider perceptual quality range, which can be exploited to improve Real-ISR performance. To this end, we introduce Direct Perceptual Preference Optimization for Real-ISR (DP$^2$O-SR), a framework that aligns generative models with perceptual preferences without requiring costly human annotations. We construct a hybrid reward signal by combining full-reference and no-reference image quality assessment (IQA) models trained on large-scale human preference datasets. This reward encourages both structural fidelity and natural appearance. To better utilize perceptual diversity, we move beyond the standard best-vs-worst selection and construct multiple preference pairs from outputs of the same model. Our analysis reveals that the optimal selection ratio depends on model capacity: smaller models benefit from broader coverage, while larger models respond better to stronger contrast in supervision. Furthermore, we propose hierarchical preference optimization, which adaptively weights training pairs based on intra-group reward gaps and inter-group diversity, enabling more efficient and stable learning. Extensive experiments across both diffusion- and flow-based T2I backbones demonstrate that DP$^2$O-SR significantly improves perceptual quality and generalizes well to real-world benchmarks.[165] DSI-Bench: A Benchmark for Dynamic Spatial Intelligence
Ziang Zhang,Zehan Wang,Guanghao Zhang,Weilong Dai,Yan Xia,Ziang Yan,Minjie Hong,Zhou Zhao
Main category: cs.CV
TL;DR: 本文提出了动态空间智能(Dynamic Spatial Intelligence)的概念,并构建了包含近1000个动态视频和1700多个标注问题的基准DSI-Bench,用于评估模型在九种解耦的运动模式下对观察者与物体动态关系的理解能力。实验表明现有视觉语言模型普遍存在混淆自身运动与物体运动、语义偏差等问题。