Skip to content

Table of Contents

cs.CL [Back]

[1] Teaching People LLM's Errors and Getting it Right

Nathan Stringham,Fateme Hashemi Chaleshtori,Xinyuan Yan,Zhichao Xu,Bei Wang,Ana Marasović

Main category: cs.CL

TL;DR: 本文探讨了为何通过教授大语言模型(LLM)失败模式来减少用户过度依赖的方法尚未成功,发现失败模式确实存在,但现有自动发现方法效果有限,且评估指标需改进。

Details Motivation: 人们因看到LLM能完成复杂任务而误以为其在简单任务上也不会出错,导致过度依赖。已有研究试图通过识别并教授LLM的失败模式来缓解此问题,但效果不佳,本文旨在探究其原因。 Method: 首先基于元标签对实例分组,分析LLM在各组中的表现以确认失败模式的存在;其次测试提示和基于嵌入的方法是否能有效揭示这些已知失败模式;最后重新设计评估指标,并通过用户研究验证新指标下教学的有效性。 Result: 发现了可被教授的显著且易错的元标签分组,表明失败模式存在;不同自动发现方法效果不一,部分解释了先前的负面结果;采用新提出的评估指标后,教学显示出积极效果,而传统的人机团队准确率则未体现。 Conclusion: 教授失败模式有潜力减少对LLM的过度依赖,但其成功依赖于更优的自动化失败发现方法以及更合理的评估指标。 Abstract: People use large language models (LLMs) when they should not. This is partly because they see LLMs compose poems and answer intricate questions, so they understandably, but incorrectly, assume LLMs won't stumble on basic tasks like simple arithmetic. Prior work has tried to address this by clustering instance embeddings into regions where an LLM is likely to fail and automatically describing patterns in these regions. The found failure patterns are taught to users to mitigate their overreliance. Yet, this approach has not fully succeeded. In this analysis paper, we aim to understand why. We first examine whether the negative result stems from the absence of failure patterns. We group instances in two datasets by their meta-labels and evaluate an LLM's predictions on these groups. We then define criteria to flag groups that are sizable and where the LLM is error-prone, and find meta-label groups that meet these criteria. Their meta-labels are the LLM's failure patterns that could be taught to users, so they do exist. We next test whether prompting and embedding-based approaches can surface these known failures. Without this, users cannot be taught about them to reduce their overreliance. We find mixed results across methods, which could explain the negative result. Finally, we revisit the final metric that measures teaching effectiveness. We propose to assess a user's ability to effectively use the given failure patterns to anticipate when an LLM is error-prone. A user study shows a positive effect from teaching with this metric, unlike the human-AI team accuracy. Our findings show that teaching failure patterns could be a viable approach to mitigating overreliance, but success depends on better automated failure-discovery methods and using metrics like ours.

[2] Morality is Contextual: Learning Interpretable Moral Contexts from Human Data with Probabilistic Clustering and Large Language Models

Geoffroy Morlat,Marceau Nahon,Augustin Chartouny,Raja Chatila,Ismael T. Freire,Mehdi Khamassi

Main category: cs.CL

TL;DR: 本文提出了COMETH框架,结合概率性上下文学习、基于大语言模型的语义抽象和人类道德判断,建模情境如何影响模糊行为的可接受性。

Details Motivation: 道德判断不仅取决于行为结果,还受情境影响。现有方法难以准确捕捉情境对道德判断的作用,需要一种能解释且符合人类判断的模型。 Method: 构建包含300个情境的数据集,涵盖六类核心道德行为;收集101名参与者的三元判断(责备/中性/支持);使用LLM过滤和MiniLM嵌入进行动作标准化聚类;通过人类判断分布在线学习情境,并用可解释的概率模型提取二值化上下文特征并加权。 Result: COMETH在与多数人类判断的一致性上约为60%,相较端到端大模型提示提升了一倍(约30%),并能揭示驱动预测的关键情境特征。 Conclusion: COMETH提供了一种可复现、可解释的方法来建模情境化道德判断,优于纯大模型方法,同时贡献了实证数据集和融合人类判断与模型学习的管道。 Abstract: Moral actions are judged not only by their outcomes but by the context in which they occur. We present COMETH (Contextual Organization of Moral Evaluation from Textual Human inputs), a framework that integrates a probabilistic context learner with LLM-based semantic abstraction and human moral evaluations to model how context shapes the acceptability of ambiguous actions. We curate an empirically grounded dataset of 300 scenarios across six core actions (violating Do not kill, Do not deceive, and Do not break the law) and collect ternary judgments (Blame/Neutral/Support) from N=101 participants. A preprocessing pipeline standardizes actions via an LLM filter and MiniLM embeddings with K-means, producing robust, reproducible core-action clusters. COMETH then learns action-specific moral contexts by clustering scenarios online from human judgment distributions using principled divergence criteria. To generalize and explain predictions, a Generalization module extracts concise, non-evaluative binary contextual features and learns feature weights in a transparent likelihood-based model. Empirically, COMETH roughly doubles alignment with majority human judgments relative to end-to-end LLM prompting (approx. 60% vs. approx. 30% on average), while revealing which contextual features drive its predictions. The contributions are: (i) an empirically grounded moral-context dataset, (ii) a reproducible pipeline combining human judgments with model-based context learning and LLM semantics, and (iii) an interpretable alternative to end-to-end LLMs for context-sensitive moral prediction and explanation.

[3] Oogiri-Master: Benchmarking Humor Understanding via Oogiri

Soichiro Murakami,Hidetaka Kamigaito,Hiroya Takamura,Manabu Okumura

Main category: cs.CL

TL;DR: 本文介绍了Oogiri-Master和Oogiri-Corpus,一个用于评估大语言模型幽默理解能力的新基准和数据集,通过大量人类评分减少流行度偏差,并分析了与趣味性相关的语言因素,发现最先进的模型已接近人类表现。

Details Motivation: 现有幽默研究数据集存在候选回复少、评分受流行度影响、缺乏可比较的客观指标等问题,难以可靠回答“什么让语言回复显得有趣”这一问题。 Method: 构建包含约100个候选回复/提示和约100名独立评分者/回复的Oogiri-Corpus;通过独立人工评分减少偏见,提取文本长度、歧义性和不协调解决等语言特征,建立预测趣味性的客观指标,并在Oogiri-Master上评测多种LLM与人类基线。 Result: 发现最先进的LLM在幽默理解任务上已接近人类水平,且基于洞察的增强提示能提升模型表现;同时识别出与人类感知趣味性显著相关的语言因素,实现了对幽默判断的量化预测。 Conclusion: 本研究为大语言模型的幽默理解提供了可重复、去偏倚的评估框架,验证了当前模型的能力边界,并为未来提升AI创造力与幽默感提供了可量化的路径。 Abstract: Humor is a salient testbed for human-like creative thinking in large language models (LLMs). We study humor using the Japanese creative response game Oogiri, in which participants produce witty responses to a given prompt, and ask the following research question: What makes such responses funny to humans? Previous work has offered only limited reliable means to answer this question. Existing datasets contain few candidate responses per prompt, expose popularity signals during ratings, and lack objective and comparable metrics for funniness. Thus, we introduce Oogiri-Master and Oogiri-Corpus, which are a benchmark and dataset designed to enable rigorous evaluation of humor understanding in LLMs. Each prompt is paired with approximately 100 diverse candidate responses, and funniness is rated independently by approximately 100 human judges without access to others' ratings, reducing popularity bias and enabling robust aggregation. Using Oogiri-Corpus, we conduct a quantitative analysis of the linguistic factors associated with funniness, such as text length, ambiguity, and incongruity resolution, and derive objective metrics for predicting human judgments. Subsequently, we benchmark a range of LLMs and human baselines in Oogiri-Master, demonstrating that state-of-the-art models approach human performance and that insight-augmented prompting improves the model performance. Our results provide a principled basis for evaluating and advancing humor understanding in LLMs.

[4] Beyond Heuristics: A Decision-Theoretic Framework for Agent Memory Management

Changzhi Sun,Xiangyu Chen,Jixiang Luo,Dell Zhang,Xuelong Li

Main category: cs.CL

TL;DR: 本文提出了DAM(决策理论代理记忆)框架,将外部记忆管理重新定义为不确定性下的序列决策问题,通过价值函数和不确定性估计来优化长期效用与风险权衡。

Details Motivation: 现有的大语言模型记忆管理依赖于手工设计的启发式方法,难以预测记忆操作对未来行为的长期影响,缺乏对不确定性的考量。 Method: 提出DAM框架,将记忆管理分解为即时信息访问和分层存储维护,使用价值函数和不确定性估计器评估候选操作,并通过聚合策略进行决策。 Result: 该工作未提出新算法,而是提供了一个原则性框架,揭示了启发式方法的局限性,并为未来研究不确定性感知的记忆系统奠定了基础。 Conclusion: 将记忆管理视为决策问题是更优范式,DAM为构建长期交互、自适应的语言模型记忆系统提供了理论基础和研究方向。 Abstract: External memory is a key component of modern large language model (LLM) systems, enabling long-term interaction and personalization. Despite its importance, memory management is still largely driven by hand-designed heuristics, offering little insight into the long-term and uncertain consequences of memory decisions. In practice, choices about what to read or write shape future retrieval and downstream behavior in ways that are difficult to anticipate. We argue that memory management should be viewed as a sequential decision-making problem under uncertainty, where the utility of memory is delayed and dependent on future interactions. To this end, we propose DAM (Decision-theoretic Agent Memory), a decision-theoretic framework that decomposes memory management into immediate information access and hierarchical storage maintenance. Within this architecture, candidate operations are evaluated via value functions and uncertainty estimators, enabling an aggregate policy to arbitrate decisions based on estimated long-term utility and risk. Our contribution is not a new algorithm, but a principled reframing that clarifies the limitations of heuristic approaches and provides a foundation for future research on uncertainty-aware memory systems.

[5] A Unified Definition of Hallucination, Or: It's the World Model, Stupid

Emmy Liu,Varun Gangal,Chelsea Zou,Xiaoqi Huang,Michael Yu,Alex Chang,Zhuofu Tao,Sachin Kumar,Steven Y. Feng

Main category: cs.CL

TL;DR: 本文提出了一种关于大语言模型“幻觉”现象的统一定义,认为幻觉本质上是不准确的内部世界建模,并在用户可观察的形式中表现出来。该框架整合了以往不同的定义,强调明确“真实世界”的参照系和知识冲突策略,有助于澄清评估标准、区分幻觉与其他错误类型,并为构建基于合成世界模型的基准测试提供指导。

Details Motivation: 尽管已有大量研究尝试解决语言模型中的幻觉问题,但该问题依然普遍存在。现有定义纷繁复杂且缺乏统一视角,导致评估和比较困难。因此需要一个统一的理论框架来厘清幻觉的本质及其不同表现形式。 Method: 通过回顾历史上关于幻觉的定义,将其整合为一个统一框架:幻觉即不准确的内部世界建模,其具体表现取决于所采用的外部‘世界模型’和知识冲突处理策略(如知识库优先或上下文优先)。 Result: 提出了一个能够涵盖现有多种定义的统一幻觉框架,明确了幻觉的核心在于与指定‘真实世界’的不一致,并可用于指导未来基准测试的设计,例如使用合成但完全定义的世界模型来检测和改善语言模型的世界建模能力。 Conclusion: 该统一定义有助于澄清幻觉的本质,提升评估的一致性和可比性,并为开发更有效的缓解策略和压力测试基准提供理论基础。 Abstract: Despite numerous attempts to solve the issue of hallucination since the inception of neural language models, it remains a problem in even frontier large language models today. Why is this the case? We walk through definitions of hallucination used in the literature from a historical perspective up to the current day, and fold them into a single definition of hallucination, wherein different prior definitions focus on different aspects of our definition. At its core, we argue that hallucination is simply inaccurate (internal) world modeling, in a form where it is observable to the user (e.g., stating a fact which contradicts a knowledge base, or producing a summary which contradicts a known source). By varying the reference world model as well as the knowledge conflict policy (e.g., knowledge base vs. in-context), we arrive at the different existing definitions of hallucination present in the literature. We argue that this unified view is useful because it forces evaluations to make clear their assumed "world" or source of truth, clarifies what should and should not be called hallucination (as opposed to planning or reward/incentive-related errors), and provides a common language to compare benchmarks and mitigation techniques. Building on this definition, we outline plans for a family of benchmarks in which hallucinations are defined as mismatches with synthetic but fully specified world models in different environments, and sketch out how these benchmarks can use such settings to stress-test and improve the world modeling components of language models.

[6] Gamayun's Path to Multilingual Mastery: Cost-Efficient Training of a 1.5B-Parameter LLM

Alexander Podolskiy,Semen Molokov,Timofey Gerasin,Maksim Titov,Alexey Rukhovich,Artem Khrapov,Kirill Morozov,Evgeny Tetin,Constantine Korikov,Pavel Efimov,Polina Lazukova,Yuliya Skripkar,Nikita Okhotnikov,Irina Piontkovskaya,Meng Xiaojun,Zou Xueyi,Zhang Zhenhe

Main category: cs.CL

TL;DR: Gamayun是一个从零训练的15亿参数多语言模型,采用两阶段预训练策略,在小规模和低资源下实现卓越的多语言性能,尤其在俄语任务上达到同规模模型的领先水平。

Details Motivation: 针对当前对小型、非英语中心的多语言大模型研究不足,尤其是资源受限环境下的部署需求,Gamayun旨在探索高效且跨语言对齐的小规模多语言模型训练方法。 Method: 提出一种新颖的两阶段预训练策略:第一阶段进行平衡的多语言训练以实现跨语言对齐,第二阶段通过高质量英语数据增强来传递性能增益。模型基于1.5B参数架构,使用2.5T token从头训练,支持12种语言,重点关注俄语。 Result: 尽管训练量(2.5T tokens)远小于LLaMA3.2-1B(9T)和Qwen2.5-1.5B(18T),Gamayun在所有基准测试上均优于前者,并在多数英文及多语言任务上超越后者。在非高级STEM任务上,其性能匹敌或超过训练量达36T token的Qwen3,且在俄语相关任务(如MERA基准)上达到同规模模型的最优表现。 Conclusion: Gamayun证明了通过精心设计的两阶段预训练策略,小型多语言模型即使在较小训练预算下也能实现优异的跨语言迁移能力和整体性能,为资源受限环境下的多语言AI部署提供了高效可行的方案。 Abstract: We present Gamayun, a 1.5B-parameter multilingual language model trained entirely from scratch on 2.5T tokens. Designed for efficiency and deployment in resource-constrained environments, Gamayun addresses the lack of research on small non-English-centric LLMs by adopting a novel two-stage pre-training strategy: balanced multilingual training for cross-lingual alignment, followed by high-quality English enrichment to transfer performance gains across languages. Our model supports 12 languages, with special focus on Russian. Despite a significantly smaller training budget than comparable models, Gamayun outperforms LLaMA3.2-1B (9T tokens) on all considered benchmarks, and surpasses Qwen2.5-1.5B (18T tokens) on a wide range of English and multilingual tasks. It matches or exceeds Qwen3 (36T tokens) on most tasks outside advanced STEM, achieving state-of-the-art results in Russian, including the MERA benchmark, among the models of comparable size (1-2B parameters).

[7] Rethinking Sample Polarity in Reinforcement Learning with Verifiable Rewards

Xinyu Tang,Yuliang Zhan,Zhixun Li,Wayne Xin Zhao,Zhenduo Zhang,Zujie Wen,Zhiqiang Zhang,Jun Zhou

Main category: cs.CL

TL;DR: 本文研究了在可验证奖励强化学习(RLVR)中,正负样本对大型推理模型训练的影响,并提出了一种自适应不对称的词元级优势塑造方法A3PO,以更精确地分配不同极性下的关键词元优势信号,从而提升推理能力。

Details Motivation: 理解正负自生成样本在RLVR训练中的作用机制,并改进现有方法对优势信号的分配方式,以提升大型推理模型的训练效果。 Method: 系统分析正负样本在样本级和词元级的优势值调整对RLVR训练动态的影响,提出A3PO方法,在词元级别根据样本极性进行自适应且不对称的优势塑造。 Result: 在五个推理基准上的实验表明,A3PO能有效提升模型性能,优于传统RLVR方法。 Conclusion: 合理调控正负样本的优势信号,尤其是在词元级别进行差异化处理,有助于优化推理模型的训练过程和最终表现。 Abstract: Large reasoning models (LRMs) are typically trained using reinforcement learning with verifiable reward (RLVR) to enhance their reasoning abilities. In this paradigm, policies are updated using both positive and negative self-generated rollouts, which correspond to distinct sample polarities. In this paper, we provide a systematic investigation into how these sample polarities affect RLVR training dynamics and behaviors. We find that positive samples sharpen existing correct reasoning patterns, while negative samples encourage exploration of new reasoning paths. We further explore how adjusting the advantage values of positive and negative samples at both the sample level and the token level affects RLVR training. Based on these insights, we propose an Adaptive and Asymmetric token-level Advantage shaping method for Policy Optimization, namely A3PO, that more precisely allocates advantage signals to key tokens across different polarities. Experiments across five reasoning benchmarks demonstrate the effectiveness of our approach.

[8] Heaven-Sent or Hell-Bent? Benchmarking the Intelligence and Defectiveness of LLM Hallucinations

Chengxu Yang,Jingling Yuan,Siqi Cai,Jiawei Jiang,Chuang Hu

Main category: cs.CL

TL;DR: 本文提出了HIC-Bench,一个用于评估大语言模型中“智能幻觉”(IH)与“缺陷性幻觉”(DH)的新框架,旨在量化幻觉在科学创新中的创造性价值,并揭示创造力与准确性的非线性关系。

Details Motivation: 现有幻觉检测方法主要关注事实一致性,难以平衡创造力与准确性,且无法有效处理多样化的科学任务。因此需要一种新框架来区分具有创造潜力的幻觉与有害错误。 Method: 提出HIC-Bench框架,结合TTCT创造性指标与幻觉特异性维度,构建多维评估矩阵;涵盖十个科学领域,采用动态提示优化(DHP)引导模型输出,并利用多个LLM裁判和人工标注进行评分与分类。 Result: 实验发现智能幻觉(IH)与缺陷性幻觉(DH)之间存在非线性关系,表明创造力与正确性可被同时优化;IH能促进科学创新,部分幻觉具有认知价值。 Conclusion: 智能幻觉不应被简单视为错误,而是可作为推动创造性思维的资源;HIC-Bench为研究大语言模型的创造性智能提供了有效平台。 Abstract: Hallucinations in large language models (LLMs) are commonly regarded as errors to be minimized. However, recent perspectives suggest that some hallucinations may encode creative or epistemically valuable content, a dimension that remains underquantified in current literature. Existing hallucination detection methods primarily focus on factual consistency, struggling to handle heterogeneous scientific tasks and balance creativity with accuracy. To address these challenges, we propose HIC-Bench, a novel evaluation framework that categorizes hallucinations into Intelligent Hallucinations (IH) and Defective Hallucinations (DH), enabling systematic investigation of their interplay in LLM creativity. HIC-Bench features three core characteristics: (1) Structured IH/DH Assessment. using a multi-dimensional metric matrix integrating Torrance Tests of Creative Thinking (TTCT) metrics (Originality, Feasibility, Value) with hallucination-specific dimensions (scientific plausibility, factual deviation); (2) Cross-Domain Applicability. spanning ten scientific domains with open-ended innovation tasks; and (3) Dynamic Prompt Optimization. leveraging the Dynamic Hallucination Prompt (DHP) to guide models toward creative and reliable outputs. The evaluation process employs multiple LLM judges, averaging scores to mitigate bias, with human annotators verifying IH/DH classifications. Experimental results reveal a nonlinear relationship between IH and DH, demonstrating that creativity and correctness can be jointly optimized. These insights position IH as a catalyst for creativity and reveal the ability of LLM hallucinations to drive scientific innovation.Additionally, the HIC-Bench offers a valuable platform for advancing research into the creative intelligence of LLM hallucinations.

[9] Enabling Conversational Behavior Reasoning Capabilities in Full-Duplex Speech

Shuchang Pan,Siddharth Banerjee,Dhruv Hebbar,Siddhant Patel,Akshaj Gupta,Kan Jen Cheng,Hanjo Kim,Zeyi Austin Li,Martin Q. Ma,Tingle Li,Gopala Anumanchipalli,Jiachen Lian

Main category: cs.CL

TL;DR: 本文提出了一种基于“思维图”(Graph-of-Thoughts, GoT)的框架,用于建模全双工对话系统中的因果推理过程,通过分层标签预测交际意图与言语行为,并结合模拟与真实数据进行训练,实现可解释的、动态优化的对话行为预测。

Details Motivation: 人类对话由隐含的思维链驱动,表现为有序的言语行为。要构建自然的全双工交互系统,关键在于捕捉这种因果路径。现有方法缺乏对意图到行为之间因果依赖的显式建模。 Method: 提出GoT框架,将对话行为建模为因果推理过程;采用分层标签体系预测高层交际意图和低层言语行为;构建包含可控模拟场景与人工标注推理的混合语料库;利用多模态Transformer进行流式预测,生成未来言语行为及决策依据。 Result: 在合成与真实双工对话数据上实验表明,该框架能稳健地检测对话行为,生成可解释的推理链,并支持对会话推理能力的基准评测。 Conclusion: GoT框架有效整合了意图与行为的因果关系,提升了全双工对话系统的可解释性与推理能力,为未来智能对话系统提供了新的建模范式。 Abstract: Human conversation is organized by an implicit chain of thoughts that manifests as timed speech acts. Capturing this causal pathway is key to building natural full-duplex interactive systems. We introduce a framework that enables reasoning over conversational behaviors by modeling this process as causal inference within a Graph-of-Thoughts (GoT). Our approach formalizes the intent-to-action pathway with a hierarchical labeling scheme, predicting high-level communicative intents and low-level speech acts to learn their causal and temporal dependencies. To train this system, we develop a hybrid corpus that pairs controllable, event-rich simulations with human-annotated rationales and real conversational speech. The GoT framework structures streaming predictions as an evolving graph, enabling a multimodal transformer to forecast the next speech act, generate concise justifications for its decisions, and dynamically refine its reasoning. Experiments on both synthetic and real duplex dialogues show that the framework delivers robust behavior detection, produces interpretable reasoning chains, and establishes a foundation for benchmarking conversational reasoning in full duplex spoken dialogue systems.

[10] MoRAgent: Parameter Efficient Agent Tuning with Mixture-of-Roles

Jing Han,Binwei Yan,Tianyu Guo,Zheyuan Bai,Mengyu Zheng,Hanting Chen,Ying Nie

Main category: cs.CL

TL;DR: 本文提出了一种面向智能体任务的参数高效微调(PEFT)框架Mixture-of-Roles(MoR),将智能体能力分解为推理、执行和总结三个角色,并通过三组专用LoRA协同完成任务,结合多角色数据生成流程,在多个LLM和基准上验证了方法的有效性。

Details Motivation: 现有的大模型微调方法在智能体任务中参数效率不高,且缺乏针对智能体所需多样化能力的结构化微调策略,因此需要一种更高效、角色分工明确的PEFT方法。 Method: 将智能体任务分解为reasoner、executor和summarizer三种角色,设计由三组专用LoRA组成的MoR框架,并构建包含角色特定内容补全和可靠性验证的多角色数据生成流程,以实现对各角色的高效微调。 Result: 在多种大语言模型和智能体基准上进行了广泛实验和消融研究,结果表明MoR框架在参数效率和任务性能上均优于现有方法,验证了其有效性。 Conclusion: MoR框架通过角色分解与专用LoRA协作,为智能体任务提供了一种高效且可扩展的参数微调方案,推动了PEFT在智能体领域的发展。 Abstract: Despite recent advancements of fine-tuning large language models (LLMs) to facilitate agent tasks, parameter-efficient fine-tuning (PEFT) methodologies for agent remain largely unexplored. In this paper, we introduce three key strategies for PEFT in agent tasks: 1) Inspired by the increasingly dominant Reason+Action paradigm, we first decompose the capabilities necessary for the agent tasks into three distinct roles: reasoner, executor, and summarizer. The reasoner is responsible for comprehending the user's query and determining the next role based on the execution trajectory. The executor is tasked with identifying the appropriate functions and parameters to invoke. The summarizer conveys the distilled information from conversations back to the user. 2) We then propose the Mixture-of-Roles (MoR) framework, which comprises three specialized Low-Rank Adaptation (LoRA) groups, each designated to fulfill a distinct role. By focusing on their respective specialized capabilities and engaging in collaborative interactions, these LoRAs collectively accomplish the agent task. 3) To effectively fine-tune the framework, we develop a multi-role data generation pipeline based on publicly available datasets, incorporating role-specific content completion and reliability verification. We conduct extensive experiments and thorough ablation studies on various LLMs and agent benchmarks, demonstrating the effectiveness of the proposed method. This project is publicly available at https://mor-agent.github.io.

[11] Detecting AI-Generated Paraphrases in Bengali: A Comparative Study of Zero-Shot and Fine-Tuned Transformers

Md. Rakibul Islam,Most. Sharmin Sultana Samu,Md. Zahid Hossain,Farhad Uz Zaman,Md. Kamrozzaman Bhuiyan

Main category: cs.CL

TL;DR: 本研究探讨了五种基于Transformer的模型在孟加拉语AI生成文本检测中的表现,发现零样本评估下模型性能接近随机水平,而任务特定微调后性能显著提升至约91%准确率和F1分数。

Details Motivation: 由于大型语言模型可能被滥用于制造虚假信息,检测AI生成的文本至关重要。然而,针对孟加拉语这一词汇丰富且结构复杂的语言的研究尚属空白,亟需探索有效的检测方法。 Method: 采用五种预训练Transformer模型(XLMRoBERTa-Large、mDeBERTaV3-Base、BanglaBERT-Base、IndicBERT-Base和MultilingualBERT-Base),在零样本设置下进行评估,并对这些模型进行任务特定微调以提升检测性能。 Result: 零样本评估中所有模型准确率接近50%,表现不佳;微调后XLM-RoBERTa、mDeBERTa和MultilingualBERT性能大幅提升,准确率和F1分数均达到约91%,而IndicBERT表现较弱。 Conclusion: 微调显著提升了多语言模型在孟加拉语AI生成文本检测中的性能,该研究为应对AI生成内容提供了坚实基础,并推动了低资源语言在此领域的研究进展。 Abstract: Large language models (LLMs) can produce text that closely resembles human writing. This capability raises concerns about misuse, including disinformation and content manipulation. Detecting AI-generated text is essential to maintain authenticity and prevent malicious applications. Existing research has addressed detection in multiple languages, but the Bengali language remains largely unexplored. Bengali's rich vocabulary and complex structure make distinguishing human-written and AI-generated text particularly challenging. This study investigates five transformer-based models: XLMRoBERTa-Large, mDeBERTaV3-Base, BanglaBERT-Base, IndicBERT-Base and MultilingualBERT-Base. Zero-shot evaluation shows that all models perform near chance levels (around 50% accuracy) and highlight the need for task-specific fine-tuning. Fine-tuning significantly improves performance, with XLM-RoBERTa, mDeBERTa and MultilingualBERT achieving around 91% on both accuracy and F1-score. IndicBERT demonstrates comparatively weaker performance, indicating limited effectiveness in fine-tuning for this task. This work advances AI-generated text detection in Bengali and establishes a foundation for building robust systems to counter AI-generated content.

[12] Do Latent Tokens Think? A Causal and Adversarial Analysis of Chain-of-Continuous-Thought

Yuyi Zhang,Boyu Tang,Tianjie Ju,Sufeng Duan,Gongshen Liu

Main category: cs.CL

TL;DR: 本文研究了大型语言模型中潜变量token(如COCONUT)的推理机制,发现其缺乏真正的推理能力,更多依赖捷径并利用数据集偏差,本质上是一种伪推理机制。

Details Motivation: 潜变量token被用于提升大模型的推理能力,但其内部机制不明确,本文旨在从可靠性角度揭示其潜在缺陷。 Method: 通过两种方法进行分析:一是通过操控特定token(如COCONUT和显式CoT)进行干预实验;二是通过偏见和分布外设置下的捷径实验评估模型行为。 Result: 实验表明COCONUT token对干预不敏感,缺乏关键推理信息,并在MMLU和HotpotQA上表现出对数据集伪影的依赖,从而虚增性能表现。 Conclusion: COCONUT并非实现真实推理,而是一种掩盖捷径依赖的伪推理机制,提示当前潜变量推理方法需更严格的验证。 Abstract: Latent tokens are gaining attention for enhancing reasoning in large language models (LLMs), yet their internal mechanisms remain unclear. This paper examines the problem from a reliability perspective, uncovering fundamental weaknesses: latent tokens function as uninterpretable placeholders rather than encoding faithful reasoning. While resistant to perturbation, they promote shortcut usage over genuine reasoning. We focus on Chain-of-Continuous-Thought (COCONUT), which claims better efficiency and stability than explicit Chain-of-Thought (CoT) while maintaining performance. We investigate this through two complementary approaches. First, steering experiments perturb specific token subsets, namely COCONUT and explicit CoT. Unlike CoT tokens, COCONUT tokens show minimal sensitivity to steering and lack reasoning-critical information. Second, shortcut experiments evaluate models under biased and out-of-distribution settings. Results on MMLU and HotpotQA demonstrate that COCONUT consistently exploits dataset artifacts, inflating benchmark performance without true reasoning. These findings reposition COCONUT as a pseudo-reasoning mechanism: it generates plausible traces that conceal shortcut dependence rather than faithfully representing reasoning processes.

[13] CATCH: A Controllable Theme Detection Framework with Contextualized Clustering and Hierarchical Generation

Rui Ke,Jiahui Xu,Shenghao Yang,Kuang Wang,Feng Jiang,Haizhou Li

Main category: cs.CL

TL;DR: 本文提出了CATCH框架,用于解决用户对话系统中无预定义模式的主题检测问题,通过上下文感知表示、偏好引导聚类和分层生成机制,在多域对话数据上实现了优越性能。

Details Motivation: 现有方法在稀疏短语句中难以准确表征主题,且无法跨对话捕捉用户的个性化主题偏好,因此需要一种能同时保证语义一致性和用户偏好的主题检测方法。 Method: 提出CATCH框架,包含三个核心组件:上下文感知的主题表示、偏好引导的主题聚类和分层主题生成机制,结合语义相似性和用户反馈进行主题建模。 Result: 在DSTC-12多领域客户对话基准上的实验表明,CATCH结合8B大语言模型在主题聚类和话题生成质量方面均表现出色。 Conclusion: CATCH有效解决了短文本和跨对话场景下的主题检测难题,提升了主题一致性与个性化对齐能力。 Abstract: Theme detection is a fundamental task in user-centric dialogue systems, aiming to identify the latent topic of each utterance without relying on predefined schemas. Unlike intent induction, which operates within fixed label spaces, theme detection requires cross-dialogue consistency and alignment with personalized user preferences, posing significant challenges. Existing methods often struggle with sparse, short utterances for accurate topic representation and fail to capture user-level thematic preferences across dialogues. To address these challenges, we propose CATCH (Controllable Theme Detection with Contextualized Clustering and Hierarchical Generation), a unified framework that integrates three core components: (1) context-aware topic representation, which enriches utterance-level semantics using surrounding topic segments; (2) preference-guided topic clustering, which jointly models semantic proximity and personalized feedback to align themes across dialogue; and (3) a hierarchical theme generation mechanism designed to suppress noise and produce robust, coherent topic labels. Experiments on a multi-domain customer dialogue benchmark (DSTC-12) demonstrate the effectiveness of CATCH with 8B LLM in both theme clustering and topic generation quality.

[14] Ara-HOPE: Human-Centric Post-Editing Evaluation for Dialectal Arabic to Modern Standard Arabic Translation

Abdullah Alabdullah,Lifeng Han,Chenghua Lin

Main category: cs.CL

TL;DR: 本文提出了Ara-HOPE,一种面向方言阿拉伯语到现代标准阿拉伯语(DA-MSA)翻译的人本后编辑评估框架,包含五类错误分类法和决策树标注协议,能够有效揭示不同机器翻译系统的系统性差异。

Details Motivation: 现有自动评估指标和通用人工评估框架难以捕捉方言相关的机器翻译错误,阻碍了DA-MSA翻译质量的准确评估。 Method: 提出Ara-HOPE框架,包括一个五类错误分类体系和基于决策树的标注流程,并用于对比评估三种MT系统(Jais、GPT-3.5和NLLB-200)。 Result: 结果显示当前MT系统在方言术语处理和语义保持方面仍存在显著挑战,Ara-HOPE能有效区分系统间性能差异。 Conclusion: Ara-HOPE为评估方言阿拉伯语机器翻译质量提供了新框架,并为改进方言感知的MT系统提供了可操作的指导。 Abstract: Dialectal Arabic to Modern Standard Arabic (DA-MSA) translation is a challenging task in Machine Translation (MT) due to significant lexical, syntactic, and semantic divergences between Arabic dialects and MSA. Existing automatic evaluation metrics and general-purpose human evaluation frameworks struggle to capture dialect-specific MT errors, hindering progress in translation assessment. This paper introduces Ara-HOPE, a human-centric post-editing evaluation framework designed to systematically address these challenges. The framework includes a five-category error taxonomy and a decision-tree annotation protocol. Through comparative evaluation of three MT systems (Arabic-centric Jais, general-purpose GPT-3.5, and baseline NLLB-200), Ara-HOPE effectively highlights systematic performance differences between these systems. The results show that dialect-specific terminology and semantic preservation remain the most persistent challenges in DA-MSA translation. Ara-HOPE establishes a new framework for evaluating Dialectal Arabic MT quality and provides actionable guidance for improving dialect-aware MT systems.

[15] Five Years of SciCap: What We Learned and Future Directions for Scientific Figure Captioning

Ting-Hao K. Huang,Ryan A. Rossi,Sungchul Kim,Tong Yu,Ting-Yao E. Hsu,Ho Yin,Ng,C. Lee Giles

Main category: cs.CL

TL;DR: SciCap项目从2021到2025年发展成为科学图表标题生成领域的核心工作,总结了五年来的技术与方法论经验,并提出了未来五个主要研究方向。

Details Motivation: 探索领域特定训练(如SciBERT)在科学图表标题生成中的有效性,并推动科学可视化内容的可访问性与质量提升。 Method: 构建并持续更新来自arXiv论文的大规模图表-标题对数据集,开展自动生成与人工评估,组织年度挑战赛,开发交互式系统,并适应大语言模型的快速发展。 Result: 建立了广泛使用的科学图表标题数据集,推动了自动与人工评估方法的发展,促进了多机构合作,并发布了支持科学家撰写更好标题的工具和挑战赛。 Conclusion: 领域特定的数据和模型对科学图表标题生成至关重要,未来需解决标题的精确性、上下文一致性、多样性、可解释性和实际科研融合等关键问题。 Abstract: Between 2021 and 2025, the SciCap project grew from a small seed-funded idea at The Pennsylvania State University (Penn State) into one of the central efforts shaping the scientific figure-captioning landscape. Supported by a Penn State seed grant, Adobe, and the Alfred P. Sloan Foundation, what began as our attempt to test whether domain-specific training, which was successful in text models like SciBERT, could also work for figure captions expanded into a multi-institution collaboration. Over these five years, we curated, released, and continually updated a large collection of figure-caption pairs from arXiv papers, conducted extensive automatic and human evaluations on both generated and author-written captions, navigated the rapid rise of large language models (LLMs), launched annual challenges, and built interactive systems that help scientists write better captions. In this piece, we look back at the first five years of SciCap and summarize the key technical and methodological lessons we learned. We then outline five major unsolved challenges and propose directions for the next phase of research in scientific figure captioning.

[16] On The Conceptualization and Societal Impact of Cross-Cultural Bias

Vitthal Bhandari

Main category: cs.CL

TL;DR: 本文综述了2025年关于自然语言处理中文化偏见识别与评估的20篇论文,指出当前大语言模型在跨文化情境下存在泛化问题,并批评现有研究常忽视真实用户参与,呼吁加强语言技术社会影响的系统性评估。

Details Motivation: 尽管大语言模型能根据文化语境生成回应,但其常过度泛化文化差异;且现有偏见评估研究多未纳入实际技术使用者的参与,导致未能真正解决文化偏见问题。 Method: 受arXiv:2005.14050v2工作的启发,作者分析了2025年发表的20篇关于NLP中文化偏见的文献,总结出一系列观察结果。 Result: 发现当前文化偏见研究在概念化偏见和评估其社会危害方面存在不足,尤其缺乏对实际利益相关者的参与。 Conclusion: 主张未来的研究应更具体地定义文化偏见,并建立包含真实用户参与的、更稳健的社会影响评估框架。 Abstract: Research has shown that while large language models (LLMs) can generate their responses based on cultural context, they are not perfect and tend to generalize across cultures. However, when evaluating the cultural bias of a language technology on any dataset, researchers may choose not to engage with stakeholders actually using that technology in real life, which evades the very fundamental problem they set out to address. Inspired by the work done by arXiv:2005.14050v2, I set out to analyse recent literature about identifying and evaluating cultural bias in Natural Language Processing (NLP). I picked out 20 papers published in 2025 about cultural bias and came up with a set of observations to allow NLP researchers in the future to conceptualize bias concretely and evaluate its harms effectively. My aim is to advocate for a robust assessment of the societal impact of language technologies exhibiting cross-cultural bias.

[17] Method Decoration (DeMe): A Framework for LLM-Driven Adaptive Method Generation in Dynamic IoT Environments

Hong Su

Main category: cs.CL

TL;DR: 本文提出了一种名为Method Decoration(DeMe)的通用框架,通过从隐含目标、学习到的方法和环境反馈中提取显式“装饰”来动态修改大语言模型(LLM)的任务执行路径,从而提升智能物联网系统在未知或异常环境下的适应性与安全性。

Details Motivation: 现有基于大语言模型的智能物联网系统在面对新情况时无法系统化生成新的执行方法,且依赖固定的设备逻辑,缺乏对动态环境的适应能力。 Method: 提出DeMe框架,利用来自隐含目标、历史经验和环境差异的通用行为原则,动态生成装饰规则,并通过预装饰、后装饰、中间步骤修改和步骤插入等方式重构LLM的方法生成路径。 Result: 实验结果表明,DeMe能够使物联网设备在遭遇未知或故障运行条件时生成更合适的任务执行方法,提升了系统的适应性和鲁棒性。 Conclusion: DeMe通过非硬编码的装饰机制实现了对LLM方法生成路径的灵活调整,为智能物联网系统提供了更强的环境适应能力和安全对齐性。 Abstract: Intelligent IoT systems increasingly rely on large language models (LLMs) to generate task-execution methods for dynamic environments. However, existing approaches lack the ability to systematically produce new methods when facing previously unseen situations, and they often depend on fixed, device-specific logic that cannot adapt to changing environmental conditions.In this paper, we propose Method Decoration (DeMe), a general framework that modifies the method-generation path of an LLM using explicit decorations derived from hidden goals, accumulated learned methods, and environmental feedback. Unlike traditional rule augmentation, decorations in DeMe are not hardcoded; instead, they are extracted from universal behavioral principles, experience, and observed environmental differences. DeMe enables the agent to reshuffle the structure of its method path-through pre-decoration, post-decoration, intermediate-step modification, and step insertion-thereby producing context-aware, safety-aligned, and environment-adaptive methods. Experimental results show that method decoration allows IoT devices to derive ore appropriate methods when confronting unknown or faulty operating conditions.

[18] Knowledge Reasoning of Large Language Models Integrating Graph-Structured Information for Pest and Disease Control in Tobacco

Siyu Li,Chenwei Song,Wan Zhou,Xinyi Liu

Main category: cs.CL

TL;DR: 提出一种结合图结构信息的大型语言模型方法,用于烟草病虫害防治的知识推理,通过构建领域知识图谱并融合图神经网络与Transformer架构,显著提升了复杂推理任务的准确性和深度。

Details Motivation: 为了提高烟草病虫害防治中知识推理的准确性和深度,特别是应对多跳和比较类复杂推理问题,需要有效整合结构化知识。 Method: 基于GraphRAG框架,利用LLM构建烟草病虫害领域知识图谱,结合GNN学习节点表示,并通过ChatGLM模型与LoRA微调实现高效推理。 Result: 实验结果表明,该方法在多个评估指标上均优于基线模型,尤其在多跳和比较推理场景中表现突出。 Conclusion: 将图结构信息与大语言模型结合可有效提升专业领域知识推理性能,为农业智能决策提供了可行方案。 Abstract: This paper proposes a large language model (LLM) approach that integrates graph-structured information for knowledge reasoning in tobacco pest and disease control. Built upon the GraphRAG framework, the proposed method enhances knowledge retrieval and reasoning by explicitly incorporating structured information from a domain-specific knowledge graph. Specifically, LLMs are first leveraged to assist in the construction of a tobacco pest and disease knowledge graph, which organizes key entities such as diseases, symptoms, control methods, and their relationships. Based on this graph, relevant knowledge is retrieved and integrated into the reasoning process to support accurate answer generation. The Transformer architecture is adopted as the core inference model, while a graph neural network (GNN) is employed to learn expressive node representations that capture both local and global relational information within the knowledge graph. A ChatGLM-based model serves as the backbone LLM and is fine-tuned using LoRA to achieve parameter-efficient adaptation. Extensive experimental results demonstrate that the proposed approach consistently outperforms baseline methods across multiple evaluation metrics, significantly improving both the accuracy and depth of reasoning, particularly in complex multi-hop and comparative reasoning scenarios.

Baorong Huang,Ali Asiri

Main category: cs.CL

TL;DR: 本文提出了AlignAR,一种生成式句子对齐方法,以及一个新的包含复杂法律和文学文本的阿拉伯语-英语数据集,通过减少一对一映射,揭示了传统对齐方法的局限性,而基于大语言模型的方法表现出更强的鲁棒性,F1分数达到85.5%,比之前方法提高9%。

Details Motivation: 高质量的平行语料库对于机器翻译研究和教学至关重要,但现有的阿拉伯语-英语资源稀缺且多为简单的一对一映射,难以充分评估对齐方法的有效性。 Method: 提出AlignAR生成式句子对齐方法,并构建了一个新的包含复杂法律和文学文本的阿拉伯语-英语数据集,特别设计了“Hard”子集以减少一对一映射,从而更严格地评估对齐方法。 Result: 实验表明,“Easy”数据集缺乏区分能力;在“Hard”子集上,传统方法表现受限,而基于大语言模型的方法展现出优越性能,整体F1得分为85.5%,相较先前方法提升9%。 Conclusion: 基于大语言模型的对齐方法在处理复杂文本时更具鲁棒性和有效性,新数据集和方法为未来机器翻译和句子对齐研究提供了重要资源。 Abstract: High-quality parallel corpora are essential for Machine Translation (MT) research and translation teaching. However, Arabic-English resources remain scarce and existing datasets mainly consist of simple one-to-one mappings. In this paper, we present AlignAR, a generative sentence alignment method, and a new Arabic-English dataset comprising complex legal and literary texts. Our evaluation demonstrates that "Easy" datasets lack the discriminatory power to fully assess alignment methods. By reducing one-to-one mappings in our "Hard" subset, we exposed the limitations of traditional alignment methods. In contrast, LLM-based approaches demonstrated superior robustness, achieving an overall F1-score of 85.5%, a 9% improvement over previous methods. Our datasets and codes are open-sourced at https://github.com/XXX.

[20] HeartBench: Probing Core Dimensions of Anthropomorphic Intelligence in LLMs

Jiaxin Liu,Peiyi Tu,Wenyu Chen,Yihong Zhuang,Xinxia Ling,Anji Zhou,Chenxi Wang,Zhuo Han,Zhengkai Yang,Junbo Zhao,Zenan Huang,Yuanyuan Wang

Main category: cs.CL

TL;DR: 本文提出了HeartBench,一个用于评估中文大语言模型在情感、文化和伦理维度表现的基准框架。基于真实心理辅导场景并与临床专家合作开发,该框架包含五个主要维度和15个子能力,并采用“先推理后评分”的评估协议。对13种最先进模型的测试显示,即使领先模型也仅达到专家理想分数的60%,且在涉及微妙情绪和复杂伦理困境的难题上表现显著下降。

Details Motivation: 当前大语言模型在认知与推理任务中表现出色,但在理解社会、情感与伦理等人类特质方面存在明显不足,尤其是在中文语境下缺乏专门的评估体系与高质量的社会情感数据,限制了类人智能的发展。 Method: 提出HeartBench框架,基于心理学理论构建五维十五项的能力分类体系,结合真实心理咨询案例,采用案例特异性、量规驱动的评估方法,并引入‘先推理后评分’机制,将抽象的人类特质转化为可量化的评估标准。 Result: 在13个主流中文大模型上的评估结果显示,最佳模型仅达到专家理想得分的60%;在难度分层的‘难题集’上,模型在处理隐含情绪和复杂伦理冲突时性能显著下降。 Conclusion: HeartBench为中文LLM的类人智能提供了标准化评估手段,并为构建高质量、以人为本的训练数据提供了方法论范本,推动AI向更具情感与伦理敏感性的方向发展。 Abstract: While Large Language Models (LLMs) have achieved remarkable success in cognitive and reasoning benchmarks, they exhibit a persistent deficit in anthropomorphic intelligence-the capacity to navigate complex social, emotional, and ethical nuances. This gap is particularly acute in the Chinese linguistic and cultural context, where a lack of specialized evaluation frameworks and high-quality socio-emotional data impedes progress. To address these limitations, we present HeartBench, a framework designed to evaluate the integrated emotional, cultural, and ethical dimensions of Chinese LLMs. Grounded in authentic psychological counseling scenarios and developed in collaboration with clinical experts, the benchmark is structured around a theory-driven taxonomy comprising five primary dimensions and 15 secondary capabilities. We implement a case-specific, rubric-based methodology that translates abstract human-like traits into granular, measurable criteria through a ``reasoning-before-scoring'' evaluation protocol. Our assessment of 13 state-of-the-art LLMs indicates a substantial performance ceiling: even leading models achieve only 60% of the expert-defined ideal score. Furthermore, analysis using a difficulty-stratified ``Hard Set'' reveals a significant performance decay in scenarios involving subtle emotional subtexts and complex ethical trade-offs. HeartBench establishes a standardized metric for anthropomorphic AI evaluation and provides a methodological blueprint for constructing high-quality, human-aligned training data.

[21] TimeBill: Time-Budgeted Inference for Large Language Models

Qi Fan,An Zou,Yehan Ma

Main category: cs.CL

TL;DR: 本文提出了一种名为TimeBill的时间预算推理框架,用于在给定时间限制下平衡大语言模型的推理效率与响应性能,通过细粒度的响应长度预测和执行时间估计,自适应调整KV缓存驱逐比例。

Details Motivation: 大语言模型在时序关键系统中应用广泛,但其自回归生成过程难以预估端到端执行时间,且现有固定KV缓存压缩策略无法适应不同时间预算的任务需求。 Method: 提出了响应长度预测器(RLP)和执行时间估计器(ETE),并基于预测结果和给定时间预算动态调整KV缓存驱逐比例,实现时间感知的高效推理。 Result: 实验表明,TimeBill在多种超时策略下均能提高任务完成率并保持良好的响应性能。 Conclusion: TimeBill有效解决了大语言模型在时间受限场景下的推理效率与输出质量权衡问题,具备实际部署潜力。 Abstract: Large Language Models (LLMs) are increasingly deployed in time-critical systems, such as robotics, autonomous driving, embodied intelligence, and industrial automation, where generating accurate responses within a given time budget is crucial for decision-making, control, or safety-critical tasks. However, the auto-regressive generation process of LLMs makes it challenging to model and estimate the end-to-end execution time. Furthermore, existing efficient inference methods based on a fixed key-value (KV) cache eviction ratio struggle to adapt to varying tasks with diverse time budgets, where an improper eviction ratio may lead to incomplete inference or a drop in response performance. In this paper, we propose TimeBill, a novel time-budgeted inference framework for LLMs that balances the inference efficiency and response performance. To be more specific, we propose a fine-grained response length predictor (RLP) and an execution time estimator (ETE) to accurately predict the end-to-end execution time of LLMs. Following this, we develop a time-budgeted efficient inference approach that adaptively adjusts the KV cache eviction ratio based on execution time prediction and the given time budget. Finally, through extensive experiments, we demonstrate the advantages of TimeBill in improving task completion rate and maintaining response performance under various overrun strategies.

Naen Xu,Jinghuai Zhang,Changjiang Li,Hengyu An,Chunyi Zhou,Jun Wang,Boyu Xu,Yuyuan Li,Tianyu Du,Shouling Ji

Main category: cs.CL

TL;DR: 本文提出了一种大规模多模态基准数据集,用于评估大视觉语言模型(LVLMs)在处理受版权保护内容时的合规性,并发现现有模型在有无版权声明的情况下均存在显著缺陷;为此,作者提出一种基于工具增强的防御框架以降低侵权风险。

Details Motivation: 随着LVLMs在多模态任务中的广泛应用,其可能因生成基于受版权保护内容的响应而引发法律与伦理问题,因此亟需评估并提升模型对版权内容的识别与合规能力。 Method: 构建包含50,000个多模态查询-内容对的基准数据集,涵盖书籍摘录、新闻文章、音乐歌词和代码文档等四类版权内容,并设计两种场景(含/不含版权声明);在此基础上系统评估多种LVLMs的表现,并提出一种工具增强型防御框架以提升版权合规性。 Result: 实验表明,即使是最先进的闭源LVLMs也难以有效识别和遵守版权声明,尤其在缺乏明确标识时表现更差;所提出的工具增强防御框架能显著降低各类场景下的版权侵权风险。 Conclusion: 必须发展具备版权意识的LVLMs,以确保其合法、负责任地使用受版权保护的内容,工具增强是实现该目标的有效路径之一。 Abstract: Large vision-language models (LVLMs) have achieved remarkable advancements in multimodal reasoning tasks. However, their widespread accessibility raises critical concerns about potential copyright infringement. Will LVLMs accurately recognize and comply with copyright regulations when encountering copyrighted content (i.e., user input, retrieved documents) in the context? Failure to comply with copyright regulations may lead to serious legal and ethical consequences, particularly when LVLMs generate responses based on copyrighted materials (e.g., retrieved book experts, news reports). In this paper, we present a comprehensive evaluation of various LVLMs, examining how they handle copyrighted content -- such as book excerpts, news articles, music lyrics, and code documentation when they are presented as visual inputs. To systematically measure copyright compliance, we introduce a large-scale benchmark dataset comprising 50,000 multimodal query-content pairs designed to evaluate how effectively LVLMs handle queries that could lead to copyright infringement. Given that real-world copyrighted content may or may not include a copyright notice, the dataset includes query-content pairs in two distinct scenarios: with and without a copyright notice. For the former, we extensively cover four types of copyright notices to account for different cases. Our evaluation reveals that even state-of-the-art closed-source LVLMs exhibit significant deficiencies in recognizing and respecting the copyrighted content, even when presented with the copyright notice. To solve this limitation, we introduce a novel tool-augmented defense framework for copyright compliance, which reduces infringement risks in all scenarios. Our findings underscore the importance of developing copyright-aware LVLMs to ensure the responsible and lawful use of copyrighted content.

[23] CricBench: A Multilingual Benchmark for Evaluating LLMs in Cricket Analytics

Vaibhav Devraj,Dhruv Kumar,Jagat Sesh Challa

Main category: cs.CL

TL;DR: 本文提出了CricBench,一个用于评估大语言模型在板球数据分析中Text-to-SQL能力的基准测试,涵盖英语和印地语,揭示了通用性能与专业领域表现之间的差距。

Details Motivation: 探索大语言模型在体育分析等特定领域处理复杂模式、多语言需求的能力不足问题。 Method: 构建了一个由板球和SQL专家手动编写复杂查询的‘黄金标准’数据集,并以英文和印地文建立多语言评估框架,对六种先进模型进行严格评估。 Result: DeepSeek R1模型在CricBench上表现最佳(50.6%),超过Claude 3.7 Sonnet(47.7%)和GPT-4o(33.7%),但所有模型在从通用基准转向专业任务时均有性能下降;且代码混合的印地语查询表现有时优于英文。 Conclusion: 通用基准上的高性能不保证在专业领域中的有效性,多语言特别是本地语言在特定任务中可能更具优势。 Abstract: Cricket is the second most popular sport globally, commanding a massive following of over 2.5 billion fans globally. Enthusiasts and analysts frequently seek advanced statistical insights, such as long-term historical performance trends or complex player comparisons, that are often unavailable through standard web searches. While Large Language Models (LLMs) have advanced significantly in Text-to-SQL tasks, their capability to handle the domain-specific nuances, complex schema variations, and multilingual requirements inherent to sports analytics remains under-explored. To investigate this potential capability gap, we present CricBench, a comprehensive benchmark suite for evaluating LLMs on specialized cricket data. To curate a "Gold Standard" dataset, we collaborate with domain experts in cricket and SQL to manually author complex queries, ensuring logical correctness. Recognizing linguistic diversity, we construct the benchmark in both English and Hindi, establishing a framework that is open for further extension to other regional languages. We evaluate six state-of-the-art models, including GPT-4o, Claude 3.7 Sonnet, and open-source models, using a strict evaluation protocol. Our results reveal that high performance on general benchmarks does not guarantee success in specialized domains. While the open-weights reasoning model DeepSeek R1 achieves state-of-the-art performance (50.6%), surpassing proprietary giants like Claude 3.7 Sonnet (47.7%) and GPT-4o (33.7%), it still exhibits a significant accuracy drop when moving from general benchmarks (BIRD) to CricBench. Furthermore, we observe that code-mixed Hindi queries frequently yield parity or higher accuracy compared to English, challenging the assumption that English is the optimal prompt language for specialized SQL tasks.

[24] Explainable Statute Prediction via Attention-based Model and LLM Prompting

Sachin Pawar,Girish Keshav Palshikar,Anindita Sinha Banerjee,Nitin Ramrakhiyani,Basit Ali

Main category: cs.CL

TL;DR: 本文研究了基于案件描述的自动法规预测问题,并提出了两种可解释的方法:AoS(句子注意力机制)和LLMPrompt(基于大语言模型的零样本提示方法),在两个数据集上进行了性能比较,并通过自动和人工方式评估了解释质量。

Details Motivation: 为了提高法律人工智能系统的用户接受度,法规预测结果需要附带人类可理解的解释,从而支持律师AI助手和法律问答系统等应用。 Method: 提出两种方法:AoS采用监督学习方式,利用句子级注意力机制和句子嵌入模型进行法规预测;LLMPrompt则使用大语言模型,通过标准提示和思维链(CoT)提示实现零样本预测与解释生成。 Result: 两种方法在两个主流数据集上的法规预测性能优于或媲美基线模型,且均能生成可理解的解释;解释质量通过反事实自动化评估和人工评估得到验证。 Conclusion: 结合注意力机制或大语言模型的方法能够有效实现法规预测并提供高质量解释,有助于提升法律AI系统的透明性与可信度。 Abstract: In this paper, we explore the problem of automatic statute prediction where for a given case description, a subset of relevant statutes are to be predicted. Here, the term "statute" refers to a section, a sub-section, or an article of any specific Act. Addressing this problem would be useful in several applications such as AI-assistant for lawyers and legal question answering system. For better user acceptance of such Legal AI systems, we believe the predictions should also be accompanied by human understandable explanations. We propose two techniques for addressing this problem of statute prediction with explanations -- (i) AoS (Attention-over-Sentences) which uses attention over sentences in a case description to predict statutes relevant for it and (ii) LLMPrompt which prompts an LLM to predict as well as explain relevance of a certain statute. AoS uses smaller language models, specifically sentence transformers and is trained in a supervised manner whereas LLMPrompt uses larger language models in a zero-shot manner and explores both standard as well as Chain-of-Thought (CoT) prompting techniques. Both these models produce explanations for their predictions in human understandable forms. We compare statute prediction performance of both the proposed techniques with each other as well as with a set of competent baselines, across two popular datasets. Also, we evaluate the quality of the generated explanations through an automated counter-factual manner as well as through human evaluation.

[25] Accelerate Speculative Decoding with Sparse Computation in Verification

Jikai Wang,Jianchao Tan,Yuxuan Hu,Jiayu Qin,Yerui Sun,Yuchen Xie,Xunliang Cai,Juntao Li,Min Zhang

Main category: cs.CL

TL;DR: 本文提出了一种用于推测解码验证阶段的稀疏化框架,通过联合稀疏化注意力、FFN和MoE组件,并结合跨草案令牌和层间检索重用策略,显著降低计算开销,同时保持良好的精度与接受长度。

Details Motivation: 推测解码中验证阶段成为计算瓶颈,尤其是在长上下文和MoE模型中,现有稀疏化方法主要针对自回归解码,缺乏对验证阶段系统性优化。 Method: 系统采用多种稀疏方法分析验证阶段的结构冗余,提出联合稀疏化注意力、FFN和MoE的框架,并引入跨草案令牌和层间检索重用策略以减少重复计算。 Result: 在多个任务(如摘要、问答、数学推理)上实验表明,该方法显著提升推理效率,实现更优的效率-精度权衡,同时保持稳定的接受长度。 Conclusion: 所提出的稀疏验证框架有效降低了推测解码中验证阶段的主导计算成本,无需额外训练即可提升大规模语言模型的推理速度。 Abstract: Speculative decoding accelerates autoregressive language model inference by verifying multiple draft tokens in parallel. However, the verification stage often becomes the dominant computational bottleneck, especially for long-context inputs and mixture-of-experts (MoE) models. Existing sparsification methods are designed primarily for standard token-by-token autoregressive decoding to remove substantial computational redundancy in LLMs. This work systematically adopts different sparse methods on the verification stage of the speculative decoding and identifies structured redundancy across multiple dimensions. Based on these observations, we propose a sparse verification framework that jointly sparsifies attention, FFN, and MoE components during the verification stage to reduce the dominant computation cost. The framework further incorporates an inter-draft token and inter-layer retrieval reuse strategy to further reduce redundant computation without introducing additional training. Extensive experiments across summarization, question answering, and mathematical reasoning datasets demonstrate that the proposed methods achieve favorable efficiency-accuracy trade-offs, while maintaining stable acceptance length.

[26] SWE-RM: Execution-free Feedback For Software Engineering Agents

KaShun Shum,Binyuan Hui,Jiawei Chen,Lei Zhang,X. W.,Jiaxi Yang,Yuzhen Huang,Junyang Lin,Junxian He

Main category: cs.CL

TL;DR: 本文提出了一种新的奖励模型SWE-RM,用于提升软件工程代理在测试时扩展和强化学习中的表现,通过综合考虑分类准确性与校准性,实现了优于现有开源模型的性能。

Details Motivation: 现有的基于执行反馈的方法在训练编码代理时存在稀疏性和无法有效区分成功或失败轨迹的问题,而执行自由的反馈机制虽有潜力但尚未充分探索。此外,即使两个验证器在TTS上的表现相似,它们在RL训练中的效果也可能大相径庭,这表明仅靠选择最佳轨迹的能力不足以保证良好的RL性能。 Method: 研究者识别出对RL训练至关重要的另外两个方面:分类准确性和校准性,并进行了全面的控制实验来探讨如何训练一个在这几项指标上都表现良好的鲁棒奖励模型。特别地,分析了训练数据规模、策略混合以及数据源组成等因素的影响。基于这些研究,提出了采用混合专家架构的SWE-RM模型。 Result: SWE-RM显著提高了SWE代理在TTS和RL任务上的性能。例如,在SWE-Bench Verified数据集上,使用TTS方法时,Qwen3-Coder-Flash的准确率从51.6%提升至62.0%,Qwen3-Coder-Max则从67.0%提高到74.6%,达到了当前开源模型中的最先进水平。 Conclusion: 通过引入更加精细的执行自由反馈机制并注重分类准确性和校准性,可以有效增强软件工程代理的学习效率和最终性能,SWE-RM为此提供了一个成功的实例。 Abstract: Execution-based feedback like unit testing is widely used in the development of coding agents through test-time scaling (TTS) and reinforcement learning (RL). This paradigm requires scalable and reliable collection of unit test cases to provide accurate feedback, and the resulting feedback is often sparse and cannot effectively distinguish between trajectories that are both successful or both unsuccessful. In contrast, execution-free feedback from reward models can provide more fine-grained signals without depending on unit test cases. Despite this potential, execution-free feedback for realistic software engineering (SWE) agents remains underexplored. Aiming to develop versatile reward models that are effective across TTS and RL, however, we observe that two verifiers with nearly identical TTS performance can nevertheless yield very different results in RL. Intuitively, TTS primarily reflects the model's ability to select the best trajectory, but this ability does not necessarily generalize to RL. To address this limitation, we identify two additional aspects that are crucial for RL training: classification accuracy and calibration. We then conduct comprehensive controlled experiments to investigate how to train a robust reward model that performs well across these metrics. In particular, we analyze the impact of various factors such as training data scale, policy mixtures, and data source composition. Guided by these investigations, we introduce SWE-RM, an accurate and robust reward model adopting a mixture-of-experts architecture with 30B total parameters and 3B activated during inference. SWE-RM substantially improves SWE agents on both TTS and RL performance. For example, it increases the accuracy of Qwen3-Coder-Flash from 51.6% to 62.0%, and Qwen3-Coder-Max from 67.0% to 74.6% on SWE-Bench Verified using TTS, achieving new state-of-the-art performance among open-source models.

[27] Broken Words, Broken Performance: Effect of Tokenization on Performance of LLMs

Sachin Pawar,Manoj Apte,Kshitij Jadhav,Girish Keshav Palshikar,Nitin Ramrakhiyani

Main category: cs.CL

TL;DR: 本文研究了大语言模型(LLM)中由于词汇表限制导致自然词被拆分为多个子词对模型性能的负面影响,提出了一种量化这种不良分词影响的惩罚函数,并在多个NLP任务和不同LLM上验证了其统计显著性。

Details Motivation: 由于LLM的分词方式可能将一个自然词切分为多个子词,这可能损害模型的语言理解能力,本文旨在探究这种分词方式对模型性能的实际影响。 Method: 提出一组惩罚函数来量化给定文本在特定LLM下的分词质量,通过计算tokenization penalty评估分词的优劣,并在多种NLP任务中分析该惩罚与模型性能之间的关系。 Result: 实验表明,分词惩罚与模型在多个NLP任务上的表现存在统计显著的相关性,支持了不良分词会降低LLM性能的假设。 Conclusion: LLM中将自然词拆分为子词的分词策略可能对其性能产生负面影响,引入分词惩罚可作为评估和改进Tokenizer设计的一种手段。 Abstract: Tokenization is the first step in training any Large Language Model (LLM), where the text is split into a sequence of tokens as per the model's fixed vocabulary. This tokenization in LLMs is different from the traditional tokenization in NLP where the text is split into a sequence of "natural" words. In LLMs, a natural word may also be broken into multiple tokens due to limited vocabulary size of the LLMs (e.g., Mistral's tokenizer splits "martial" into "mart" and "ial"). In this paper, we hypothesize that such breaking of natural words negatively impacts LLM performance on various NLP tasks. To quantify this effect, we propose a set of penalty functions that compute a tokenization penalty for a given text for a specific LLM, indicating how "bad" the tokenization is. We establish statistical significance of our hypothesis on multiple NLP tasks for a set of different LLMs.

[28] Self-attention vector output similarities reveal how machines pay attention

Tal Halevi,Yarden Tzach,Ronit D. Gross,Shalom Rosner,Ido Kanter

Main category: cs.CL

TL;DR: 本研究提出了一种量化自注意力机制中信息处理的新方法,基于BERT-12架构分析发现不同注意力头在不同语言特征上表现出专业化,并揭示了从长距离到短距离语义相似性的层次演化过程。

Details Motivation: 尽管自注意力机制在自然语言处理中被广泛应用,但其内部信息处理机制尚不明确,缺乏对学习过程的定量刻画。 Method: 通过分析BERT-12模型中的注意力图谱,构建基于自注意力头向量空间的上下文相似性矩阵,测量词元向量间的标量积,并研究各层和各头中相似性分布的变化。 Result: 发现最终层的注意力集中在句子分隔符上;不同注意力头专注于不同的语言特征(如重复识别或上下文共现);随着网络加深,相似性从长程转向短程,同一句子内相似性增强;每个注意力头倾向于围绕特定高相似性词元建立关联。 Conclusion: 自注意力机制通过分层方式组织信息处理,不同头在不同语言模式上形成专业化,支持基于语义的文本分割和结构建模。 Abstract: The self-attention mechanism has significantly advanced the field of natural language processing, facilitating the development of advanced language-learning machines. Although its utility is widely acknowledged, the precise mechanisms of self-attention underlying its advanced learning and the quantitative characterization of this learning process remains an open research question. This study introduces a new approach for quantifying information processing within the self-attention mechanism. The analysis conducted on the BERT-12 architecture reveals that, in the final layers, the attention map focuses on sentence separator tokens, suggesting a practical approach to text segmentation based on semantic features. Based on the vector space emerging from the self-attention heads, a context similarity matrix, measuring the scalar product between two token vectors was derived, revealing distinct similarities between different token vector pairs within each head and layer. The findings demonstrated that different attention heads within an attention block focused on different linguistic characteristics, such as identifying token repetitions in a given text or recognizing a token of common appearance in the text and its surrounding context. This specialization is also reflected in the distribution of distances between token vectors with high similarity as the architecture progresses. The initial attention layers exhibit substantially long-range similarities; however, as the layers progress, a more short-range similarity develops, culminating in a preference for attention heads to create strong similarities within the same sentence. Finally, the behavior of individual heads was analyzed by examining the uniqueness of their most common tokens in their high similarity elements. Each head tends to focus on a unique token from the text and builds similarity pairs centered around it.

[29] Context as a Tool: Context Management for Long-Horizon SWE-Agents

Shukai Liu,Jian Yang,Bo Jiang,Yizhi Li,Jinyang Guo,Xianglong Liu,Bryan Dai

Main category: cs.CL

TL;DR: 本文提出了一种名为CAT的新上下文管理范式,通过将上下文维护作为可调用工具集成到代理决策过程中,以解决在长期交互中出现的上下文爆炸、语义漂移和推理退化问题。

Details Motivation: 现有基于大语言模型的软件工程代理在处理长周期任务时,常因仅追加的上下文维护或被动触发的压缩策略导致上下文膨胀与语义失真,难以维持高效推理。 Method: 提出CAT范式,构建包含稳定任务语义、压缩的长期记忆和高保真短期交互的结构化上下文工作区,并设计CAT-GENERATOR框架,利用离线数据注入上下文管理动作训练SWE-Compressor模型。 Result: 在SWE-Bench-Verified上实验表明,SWE-Compressor达到57.6%的解决率,显著优于ReAct基线和静态压缩方法,且能在有限上下文预算下保持稳定可扩展的长程推理能力。 Conclusion: CAT为大型语言模型代理提供了高效、主动的上下文管理机制,有效支持复杂软件工程任务中的长期交互与推理。 Abstract: Agents based on large language models have recently shown strong potential on real-world software engineering (SWE) tasks that require long-horizon interaction with repository-scale codebases. However, most existing agents rely on append-only context maintenance or passively triggered compression heuristics, which often lead to context explosion, semantic drift, and degraded reasoning in long-running interactions. We propose CAT, a new context management paradigm that elevates context maintenance to a callable tool integrated into the decision-making process of agents. CAT formalizes a structured context workspace consisting of stable task semantics, condensed long-term memory, and high-fidelity short-term interactions, and enables agents to proactively compress historical trajectories into actionable summaries at appropriate milestones. To support context management for SWE-agents, we propose a trajectory-level supervision framework, CAT-GENERATOR, based on an offline data construction pipeline that injects context-management actions into complete interaction trajectories. Using this framework, we train a context-aware model, SWE-Compressor. Experiments on SWE-Bench-Verified demonstrate that SWE-Compressor reaches a 57.6% solved rate and significantly outperforms ReAct-based agents and static compression baselines, while maintaining stable and scalable long-horizon reasoning under a bounded context budget.

[30] Introducing TrGLUE and SentiTurca: A Comprehensive Benchmark for Turkish General Language Understanding and Sentiment Analysis

Duygu Altinok

Main category: cs.CL

TL;DR: 本文提出了TrGLUE和SentiTurca两个针对土耳其语的自然语言理解(NLU)基准,填补了现有语言评估体系中的空白。

Details Motivation: 目前缺乏针对土耳其语的综合性NLU评估基准,而现有的多语言GLUE类基准未涵盖土耳其语,限制了该语言在NLP研究中的发展。 Method: 构建了与GLUE风格一致的土耳其语基准TrGLUE,包含多种NLU任务,并采用基于大语言模型的半自动化标注流程结合跨模型一致性检查和人工验证来生成高质量标签;同时发布了专门用于情感分析的SentiTurca基准,并提供了基于Transformer模型的微调与评估代码。 Result: 成功构建了覆盖多个领域的土耳其语NLU基准TrGLUE和情感分析基准SentiTurca,实现了语言自然性高、翻译伪影少的标签生成流程,具备可扩展性和可复现性。 Conclusion: TrGLUE为土耳其语NLU研究提供了强有力的评估框架,推动了低资源语言的高质量数据集构建方法,有助于促进土耳其语在NLP领域的发展。 Abstract: Evaluating the performance of various model architectures, such as transformers, large language models (LLMs), and other NLP systems, requires comprehensive benchmarks that measure performance across multiple dimensions. Among these, the evaluation of natural language understanding (NLU) is particularly critical as it serves as a fundamental criterion for assessing model capabilities. Thus, it is essential to establish benchmarks that enable thorough evaluation and analysis of NLU abilities from diverse perspectives. While the GLUE benchmark has set a standard for evaluating English NLU, similar benchmarks have been developed for other languages, such as CLUE for Chinese, FLUE for French, and JGLUE for Japanese. However, no comparable benchmark currently exists for the Turkish language. To address this gap, we introduce TrGLUE, a comprehensive benchmark encompassing a variety of NLU tasks for Turkish. In addition, we present SentiTurca, a specialized benchmark for sentiment analysis. To support researchers, we also provide fine-tuning and evaluation code for transformer-based models, facilitating the effective use of these benchmarks. TrGLUE comprises Turkish-native corpora curated to mirror the domains and task formulations of GLUE-style evaluations, with labels obtained through a semi-automated pipeline that combines strong LLM-based annotation, cross-model agreement checks, and subsequent human validation. This design prioritizes linguistic naturalness, minimizes direct translation artifacts, and yields a scalable, reproducible workflow. With TrGLUE, our goal is to establish a robust evaluation framework for Turkish NLU, empower researchers with valuable resources, and provide insights into generating high-quality semi-automated datasets.

cs.CV [Back]

[31] Understanding Virality: A Rubric based Vision-Language Model Framework for Short-Form Edutainment Evaluation

Arnav Gupta,Gurekas Singh Sahney,Hardik Rathi,Abhishek Chandwani,Ishaan Gupta,Pratik Narang,Dhruv Kumar

Main category: cs.CV

TL;DR: 提出一种基于视觉-语言模型(VLM)的数据驱动框架,通过无监督提取音视频特征并聚类为可解释因子,训练回归模型预测短视频的观众参与度,相较传统指标更具可解释性和可扩展性。

Details Motivation: 现有视频评估框架(如VideoScore-2)未能捕捉特定音视频属性如何驱动真实观众参与,需向人类对齐、多模态推理的方向发展。 Method: 利用视觉-语言模型(VLM)无监督提取音视频特征,聚类为可解释因子,并在自建的YouTube Shorts数据集上训练回归模型预测用户参与度。 Result: 实验显示预测参与度与实际参与度有强相关性,且该轻量级特征评估器相比SSIM、FID等传统指标更可解释、可扩展。 Conclusion: 通过结合多模态特征重要性与以人类为中心的参与信号,所提方法推动了鲁棒且可解释的视频理解评估的发展。 Abstract: Evaluating short-form video content requires moving beyond surface-level quality metrics toward human-aligned, multimodal reasoning. While existing frameworks like VideoScore-2 assess visual and semantic fidelity, they do not capture how specific audiovisual attributes drive real audience engagement. In this work, we propose a data-driven evaluation framework that uses Vision-Language Models (VLMs) to extract unsupervised audiovisual features, clusters them into interpretable factors, and trains a regression-based evaluator to predict engagement on short-form edutainment videos. Our curated YouTube Shorts dataset enables systematic analysis of how VLM-derived features relate to human engagement behavior. Experiments show strong correlations between predicted and actual engagement, demonstrating that our lightweight, feature-based evaluator provides interpretable and scalable assessments compared to traditional metrics (e.g., SSIM, FID). By grounding evaluation in both multimodal feature importance and human-centered engagement signals, our approach advances toward robust and explainable video understanding.

[32] A Tool Bottleneck Framework for Clinically-Informed and Interpretable Medical Image Understanding

Christina Liu,Alan Q. Wang,Joy Hsu,Jiajun Wu,Ehsan Adeli

Main category: cs.CV

TL;DR: 提出了一种名为Tool Bottleneck Framework (TBF) 的工具使用框架,用于医学图像理解,通过学习的Tool Bottleneck Model (TBM) 组合视觉语言模型选择的工具,提高了在数据有限情况下的性能。

Details Motivation: 现有的基于文本组合工具的方法在医学图像理解中表现不佳,因为关键信息是空间局部特征,难以仅通过文本融合。 Method: 利用现成的医学视觉语言模型选择工具箱中的工具来提取临床相关特征,并通过神经网络计算和融合这些工具输出,最终由TBM输出预测结果。 Result: 在组织病理学和皮肤病学任务上评估TBF,发现其性能优于或相当于深度学习分类器、视觉语言模型及最先进的工具使用框架,特别是在数据受限的情况下有显著提升。 Conclusion: TBF不仅提升了医学影像领域的工具使用效果,还产生了更可解释、更贴近临床的预测器。 Abstract: Recent tool-use frameworks powered by vision-language models (VLMs) improve image understanding by grounding model predictions with specialized tools. Broadly, these frameworks leverage VLMs and a pre-specified toolbox to decompose the prediction task into multiple tool calls (often deep learning models) which are composed to make a prediction. The dominant approach to composing tools is using text, via function calls embedded in VLM-generated code or natural language. However, these methods often perform poorly on medical image understanding, where salient information is encoded as spatially-localized features that are difficult to compose or fuse via text alone. To address this, we propose a tool-use framework for medical image understanding called the Tool Bottleneck Framework (TBF), which composes VLM-selected tools using a learned Tool Bottleneck Model (TBM). For a given image and task, TBF leverages an off-the-shelf medical VLM to select tools from a toolbox that each extract clinically-relevant features. Instead of text-based composition, these tools are composed by the TBM, which computes and fuses the tool outputs using a neural network before outputting the final prediction. We propose a simple and effective strategy for TBMs to make predictions with any arbitrary VLM tool selection. Overall, our framework not only improves tool-use in medical imaging contexts, but also yields more interpretable, clinically-grounded predictors. We evaluate TBF on tasks in histopathology and dermatology and find that these advantages enable our framework to perform on par with or better than deep learning-based classifiers, VLMs, and state-of-the-art tool-use frameworks, with particular gains in data-limited regimes. Our code is available at https://github.com/christinaliu2020/tool-bottleneck-framework.

[33] Scalable Deep Subspace Clustering Network

Nairouz Mrabah,Mohamed Bouguessa,Sihem Sami

Main category: cs.CV

TL;DR: 提出SDSNet,一种具有O(n)复杂度的可扩展深度子空间聚类框架,通过基于landmark的近似、联合优化和直接谱聚类实现高效计算。

Details Motivation: 现有子空间聚类方法因构建完整的n×n亲和矩阵和谱分解导致O(n^3)计算成本,难以扩展;尽管深度学习提升了特征提取,但仍保持成对相似性计算的瓶颈。 Method: 采用基于landmark的近似避免全亲和矩阵,联合优化自编码器重构与自表达目标,并在因子化表示上直接进行谱聚类,结合卷积自编码器与保子空间约束。 Result: 实验结果表明,SDSNet在显著提高计算效率的同时,达到了与当前最先进方法相当的聚类性能。 Conclusion: SDSNet有效解决了传统子空间聚类方法的可扩展性问题,实现了线性时间复杂度,适用于大规模数据聚类任务。 Abstract: Subspace clustering methods face inherent scalability limits due to the $O(n^3)$ cost (with $n$ denoting the number of data samples) of constructing full $n\times n$ affinities and performing spectral decomposition. While deep learning-based approaches improve feature extraction, they maintain this computational bottleneck through exhaustive pairwise similarity computations. We propose SDSNet (Scalable Deep Subspace Network), a deep subspace clustering framework that achieves $\mathcal{O}(n)$ complexity through (1) landmark-based approximation, avoiding full affinity matrices, (2) joint optimization of auto-encoder reconstruction with self-expression objectives, and (3) direct spectral clustering on factorized representations. The framework combines convolutional auto-encoders with subspace-preserving constraints. Experimental results demonstrate that SDSNet achieves comparable clustering quality to state-of-the-art methods with significantly improved computational efficiency.

[34] Intelligent recognition of GPR road hidden defect images based on feature fusion and attention mechanism

Haotian Lv,Yuhui Zhang,Jiangbo Dai,Hanli Wu,Jiaji Wang,Dawei Wang

Main category: cs.CV

TL;DR: 提出了一种基于DCGAN和MCGA-Net的自动化GPR图像缺陷检测框架,结合数据增强、多模态特征融合与全局注意力机制,在复杂环境下实现了高精度、强鲁棒性的道路缺陷识别。

Details Motivation: 传统GPR图像解释依赖人工经验,效率低且易出错,同时存在数据稀缺问题,限制了深度学习在自动化缺陷检测中的应用。 Method: 1) 采用DCGAN进行数据增强,生成高质量GPR图像;2) 提出MCGA-Net网络,结合多模态链式特征融合(MCFF)和全局注意力机制(GAM);3) 利用MS COCO预训练模型进行迁移学习,提升收敛速度与泛化能力。 Result: MCGA-Net在测试中达到92.8%的精确率、92.5%的召回率和95.9%的mAP@50,对噪声、弱信号和小目标具有强鲁棒性,优于现有模型。 Conclusion: 该框架显著提升了GPR图像中道路缺陷自动检测的准确性与稳定性,为复杂地下环境下的非破坏性检测提供了新范式。 Abstract: Ground Penetrating Radar (GPR) has emerged as a pivotal tool for non-destructive evaluation of subsurface road defects. However, conventional GPR image interpretation remains heavily reliant on subjective expertise, introducing inefficiencies and inaccuracies. This study introduces a comprehensive framework to address these limitations: (1) A DCGAN-based data augmentation strategy synthesizes high-fidelity GPR images to mitigate data scarcity while preserving defect morphology under complex backgrounds; (2) A novel Multi-modal Chain and Global Attention Network (MCGA-Net) is proposed, integrating Multi-modal Chain Feature Fusion (MCFF) for hierarchical multi-scale defect representation and Global Attention Mechanism (GAM) for context-aware feature enhancement; (3) MS COCO transfer learning fine-tunes the backbone network, accelerating convergence and improving generalization. Ablation and comparison experiments validate the framework's efficacy. MCGA-Net achieves Precision (92.8%), Recall (92.5%), and mAP@50 (95.9%). In the detection of Gaussian noise, weak signals and small targets, MCGA-Net maintains robustness and outperforms other models. This work establishes a new paradigm for automated GPR-based defect detection, balancing computational efficiency with high accuracy in complex subsurface environments.

[35] CCAD: Compressed Global Feature Conditioned Anomaly Detection

Xiao Jin,Liang Diao,Qixin Xiao,Yifan Hu,Ziqi Zhang,Yuchen Liu,Haisong Gu

Main category: cs.CV

TL;DR: 提出了一种新的异常检测方法CCAD,结合了重建和无监督表示学习的优势,并通过自适应压缩机制提高了泛化能力和训练效率,在AUC指标上优于现有最先进方法。

Details Motivation: 现有的基于重建和无监督表示的方法在域偏移下特征提取能力弱或训练效率低、性能下降,因此需要一种更高效且鲁棒的异常检测方法。 Method: 提出Compressed Global Feature Conditioned Anomaly Detection (CCAD),利用全局特征作为重建模型的新模态条件,并设计自适应压缩机制以提升性能和训练效率。 Result: 实验表明CCAD在AUC上 consistently 超过现有最先进方法,并实现更快收敛;重新整理并标注了DAGM 2007数据集以验证方法有效性。 Conclusion: CCAD有效结合了两种主流范式的优势,提升了异常检测的性能与效率,具有实际应用潜力。 Abstract: Anomaly detection holds considerable industrial significance, especially in scenarios with limited anomalous data. Currently, reconstruction-based and unsupervised representation-based approaches are the primary focus. However, unsupervised representation-based methods struggle to extract robust features under domain shift, whereas reconstruction-based methods often suffer from low training efficiency and performance degradation due to insufficient constraints. To address these challenges, we propose a novel method named Compressed Global Feature Conditioned Anomaly Detection (CCAD). CCAD synergizes the strengths of both paradigms by adapting global features as a new modality condition for the reconstruction model. Furthermore, we design an adaptive compression mechanism to enhance both generalization and training efficiency. Extensive experiments demonstrate that CCAD consistently outperforms state-of-the-art methods in terms of AUC while achieving faster convergence. In addition, we contribute a reorganized and re-annotated version of the DAGM 2007 dataset with new annotations to further validate our method's effectiveness. The code for reproducing main results is available at https://github.com/chloeqxq/CCAD.

[36] IMA++: ISIC Archive Multi-Annotator Dermoscopic Skin Lesion Segmentation Dataset

Kumar Abhishek,Jeremy Kawahara,Ghassan Hamarneh

Main category: cs.CV

TL;DR: 本文介绍了ISIC MultiAnnot++,这是目前最大的公开多标注者皮肤病变分割数据集,包含17,684个分割掩码,覆盖14,967张皮肤镜图像,其中2,394张图像具有2-5个标注,并提供标注者技能水平和工具等元数据。

Details Motivation: 缺乏大规模公开的多标注者皮肤病变分割数据集,限制了多标注者医学图像分割的研究。 Method: 基于ISIC Archive构建ISIC MultiAnnot++数据集,收集多个标注者的分割结果,并记录标注者技能水平和使用工具等元数据。 Result: 构建了包含17,684个分割掩码、14,967张图像的大规模数据集,其中2,394张图像有多个标注;提供了数据划分、共识分割掩码及元数据信息。 Conclusion: ISIC MultiAnnot++是当前最大的公开多标注者皮肤病变分割数据集,支持标注者偏好建模和元数据分析,推动多标注者医学图像分割研究。 Abstract: Multi-annotator medical image segmentation is an important research problem, but requires annotated datasets that are expensive to collect. Dermoscopic skin lesion imaging allows human experts and AI systems to observe morphological structures otherwise not discernable from regular clinical photographs. However, currently there are no large-scale publicly available multi-annotator skin lesion segmentation (SLS) datasets with annotator-labels for dermoscopic skin lesion imaging. We introduce ISIC MultiAnnot++, a large public multi-annotator skin lesion segmentation dataset for images from the ISIC Archive. The final dataset contains 17,684 segmentation masks spanning 14,967 dermoscopic images, where 2,394 dermoscopic images have 2-5 segmentations per image, making it the largest publicly available SLS dataset. Further, metadata about the segmentation, including the annotators' skill level and segmentation tool, is included, enabling research on topics such as annotator-specific preference modeling for segmentation and annotator metadata analysis. We provide an analysis on the characteristics of this dataset, curated data partitions, and consensus segmentation masks.

[37] GPF-Net: Gated Progressive Fusion Learning for Polyp Re-Identification

Suncheng Xiang,Xiaoyang Wang,Junjie Jiang,Hejia Wang,Dahong Qian

Main category: cs.CV

TL;DR: 提出了一种名为门控渐进融合网络的新架构,用于结肠镜息肉重识别,通过多层级特征的门控融合提升小息肉的识别性能。

Details Motivation: 高阶特征的粗分辨率在处理需要细节信息的小息肉时表现不佳,影响了息肉重识别的准确性。 Method: 设计了门控渐进融合网络(Gated Progressive Fusion Network),通过全连接方式利用门控机制选择性融合多级特征,并引入逐层细化策略实现多层次特征交互。 Result: 在标准基准上的实验表明,所提方法优于现有的单模态ReID模型,尤其在结合专门的多模态融合策略时表现出显著优势。 Conclusion: 门控渐进融合网络能有效提升结肠镜息肉重识别的精度,特别是在处理小息肉等需精细特征的场景中具有潜力。 Abstract: Colonoscopic Polyp Re-Identification aims to match the same polyp from a large gallery with images from different views taken using different cameras, which plays an important role in the prevention and treatment of colorectal cancer in computer-aided diagnosis. However, the coarse resolution of high-level features of a specific polyp often leads to inferior results for small objects where detailed information is important. To address this challenge, we propose a novel architecture, named Gated Progressive Fusion network, to selectively fuse features from multiple levels using gates in a fully connected way for polyp ReID. On the basis of it, a gated progressive fusion strategy is introduced to achieve layer-wise refinement of semantic information through multi-level feature interactions. Experiments on standard benchmarks show the benefits of the multimodal setting over state-of-the-art unimodal ReID models, especially when combined with the specialized multimodal fusion strategy.

[38] Generative Multi-Focus Image Fusion

Xinzhe Xie,Buyu Guo,Bolin Li,Shuangyan He,Yanzhen Gu,Qingyan Jiang,Peiliang Li

Main category: cs.CV

TL;DR: 本文提出了一种名为GMFF的生成式多焦点图像融合框架,通过确定性融合与生成式恢复两阶段级联策略,有效解决了传统方法在复杂场景中存在边缘伪影及缺失聚焦平面的问题,实现了当前最优的融合性能。

Details Motivation: 现有融合算法通常假设每个空间位置至少有一个输入图像处于聚焦状态,且在复杂真实场景中易因聚焦估计不确定或硬选择操作产生边缘伪影。为解决这些问题,本文提出一种新的融合框架以处理缺失聚焦信息并提升融合质量。 Method: GMFF框架分为两个阶段:第一阶段使用最新的StackMFF V4模型进行确定性融合,结合可用焦平面信息生成初步融合图像;第二阶段采用基于潜在扩散模型的IFControlNet进行生成式恢复,重建缺失焦平面内容、修复细节并消除边缘伪影。两个阶段独立开发、级联运行。 Result: 实验表明,GMFF在多个数据集上达到了最先进的融合性能,尤其在处理复杂多焦内容时表现出更强的鲁棒性和视觉质量,能有效去除边缘伪影并恢复清晰细节。 Conclusion: GMFF通过结合确定性融合与生成式恢复,显著提升了多焦点图像融合的质量与实用性,为复杂场景下的图像融合提供了新思路,并具有广泛的实际应用前景。 Abstract: Multi-focus image fusion aims to generate an all-in-focus image from a sequence of partially focused input images. Existing fusion algorithms generally assume that, for every spatial location in the scene, there is at least one input image in which that location is in focus. Furthermore, current fusion models often suffer from edge artifacts caused by uncertain focus estimation or hard-selection operations in complex real-world scenarios. To address these limitations, we propose a generative multi-focus image fusion framework, termed GMFF, which operates in two sequential stages. In the first stage, deterministic fusion is implemented using StackMFF V4, the latest version of the StackMFF series, and integrates the available focal plane information to produce an initial fused image. The second stage, generative restoration, is realized through IFControlNet, which leverages the generative capabilities of latent diffusion models to reconstruct content from missing focal planes, restore fine details, and eliminate edge artifacts. Each stage is independently developed and functions seamlessly in a cascaded manner. Extensive experiments demonstrate that GMFF achieves state-of-the-art fusion performance and exhibits significant potential for practical applications, particularly in scenarios involving complex multi-focal content. The implementation is publicly available at https://github.com/Xinzhe99/StackMFF-Series.

[39] SVBench: Evaluation of Video Generation Models on Social Reasoning

Wenshuo Peng,Gongxuan Wang,Tianmeng Yang,Chuanhao Li,Xiaojie Xu,Hui He,Kaipeng Zhang

Main category: cs.CV

TL;DR: 本文提出了首个用于评估视频生成模型中社会推理能力的基准,揭示了现有模型在表面视觉质量之外,在理解人类意图、信念、共同注意和社会规范等深层社会认知方面存在显著不足。

Details Motivation: 现有的文本到视频生成模型虽然在视觉真实感和运动保真度上取得进展,但缺乏对社会行为背后因果和心理逻辑的理解,难以生成符合社会常识的视频内容。 Method: 基于发展与社会心理学中的30个经典社会认知范式,构建包含七个核心维度的社会推理基准,并设计了一种无需训练的基于智能体的流水线,通过提取实验推理机制、生成多样化场景、控制概念中立性与难度,并利用大容量视觉语言模型作为评判器从五个可解释维度评估生成视频。 Result: 在七个最先进的视频生成系统上进行了大规模测试,结果显示:尽管这些模型能生成表面合理的视频,但在意图识别、信念推理、共同注意和亲社会推理等任务上表现不佳,暴露出其在社会认知层面的根本缺陷。 Conclusion: 当前视频生成模型亟需引入社会推理能力;本文提出的基准为未来研究提供了系统性评估工具,推动生成模型向更符合人类社会认知的方向发展。 Abstract: Recent text-to-video generation models exhibit remarkable progress in visual realism, motion fidelity, and text-video alignment, yet they remain fundamentally limited in their ability to generate socially coherent behavior. Unlike humans, who effortlessly infer intentions, beliefs, emotions, and social norms from brief visual cues, current models tend to render literal scenes without capturing the underlying causal or psychological logic. To systematically evaluate this gap, we introduce the first benchmark for social reasoning in video generation. Grounded in findings from developmental and social psychology, our benchmark organizes thirty classic social cognition paradigms into seven core dimensions, including mental-state inference, goal-directed action, joint attention, social coordination, prosocial behavior, social norms, and multi-agent strategy. To operationalize these paradigms, we develop a fully training-free agent-based pipeline that (i) distills the reasoning mechanism of each experiment, (ii) synthesizes diverse video-ready scenarios, (iii) enforces conceptual neutrality and difficulty control through cue-based critique, and (iv) evaluates generated videos using a high-capacity VLM judge across five interpretable dimensions of social reasoning. Using this framework, we conduct the first large-scale study across seven state-of-the-art video generation systems. Our results reveal substantial performance gaps: while modern models excel in surface-level plausibility, they systematically fail in intention recognition, belief reasoning, joint attention, and prosocial inference.

[40] Fixed-Budget Parameter-Efficient Training with Frozen Encoders Improves Multimodal Chest X-Ray Classification

Md Ashik Khan,Md Nahid Siddique

Main category: cs.CV

TL;DR: 本研究探讨了在胸部X光多模态分析中使用参数高效训练(PET)方法的有效性,发现冻结编码器等策略在显著降低计算成本的同时,分类性能优于全微调。

Details Motivation: 避免全微调大型视觉-语言模型带来的高计算成本,并减少数据泄露风险。 Method: 采用多种参数高效训练方法(如冻结编码器、BitFit、LoRA、适配器),在固定参数预算下进行多标签分类,并对文本输入中的病理术语进行脱敏处理以防止数据泄露。 Result: 在Indiana大学数据集上,所有PET方法的AUROC为0.892–0.908,显著优于全微调(0.770);在CheXpert上的外部验证显示适配器效果最佳(0.7214 AUROC)。视觉模型在相同参数预算下优于多模态模型,表明性能提升主要来自参数分配而非模态融合。PET方法校准性较差(ECE 0.29–0.34),但可通过后处理校准解决。 Conclusion: 冻结编码器等PET策略可在极低参数量下实现优越的判别性能,适合资源受限场景,但需结合校准方法以确保临床可用性。 Abstract: Multimodal chest X-Ray analysis often fine-tunes large vision-language models, which is computationally costly. We study parameter-efficient training (PET) strategies, including frozen encoders, BitFit, LoRA, and adapters for multi-label classification on the Indiana University Chest X-Ray dataset (3,851 image-report pairs; 579 test samples). To mitigate data leakage, we redact pathology terms from reports used as text inputs while retaining clinical context. Under a fixed parameter budget (2.37M parameters, 2.51% of total), all PET variants achieve AUROC between 0.892 and 0.908, outperforming full fine-tuning (0.770 AUROC), which uses 94.3M trainable parameters, a 40x reduction. External validation on CheXpert (224,316 images, 58x larger) confirms scalability: all PET methods achieve >0.69 AUROC with <9% trainable parameters, with Adapter achieving best performance (0.7214 AUROC). Budget-matched comparisons reveal that vision-only models (0.653 AUROC, 1.06M parameters) outperform budget-matched multimodal models (0.641 AUROC, 1.06M parameters), indicating improvements arise primarily from parameter allocation rather than cross-modal synergy. While PET methods show degraded calibration (ECE: 0.29-0.34) compared to simpler models (ECE: 0.049), this represents a tractable limitation addressable through post-hoc calibration methods. These findings demonstrate that frozen encoder strategies provide superior discrimination at substantially reduced computational cost, though calibration correction is essential for clinical deployment.

[41] Fixed-Threshold Evaluation of a Hybrid CNN-ViT for AI-Generated Image Detection Across Photos and Art

Md Ashik Khan,Arafat Alam Jion

Main category: cs.CV

TL;DR: 本文提出了一种用于检测AI生成图像的固定阈值评估协议,揭示了现有方法在面对常见后处理时的鲁棒性被高估的问题,并通过混合CNN-ViT模型展示了不同架构在照片与艺术内容上的表现差异,为实际部署提供了指导。

Details Motivation: 现有的AI生成图像检测方法通常在每次变换后重新调整决策阈值,导致对鲁棒性的估计失真,无法反映真实部署场景下的性能;因此需要一种更贴近实际应用的评估方式。 Method: 引入固定阈值评估协议,在干净验证数据上选定一次阈值后保持不变,跨所有后处理条件进行测试;采用轻量级CNN-ViT混合模型,结合门控融合与可选频域增强,在多个操作点(低FPR、ROC最优、Best-F1)报告性能。 Result: 实验表明:频率辅助CNN在原始照片上表现好但压缩后性能骤降(93.33%→61.49%),而ViT下降较小(92.86%→88.36%);所有模型在艺术内容上的AUROC均高出约15个百分点;混合模型在tiny-genimage达到91.4%,AiArtData上达89.7%,CIFAKE上达98.3%。 Conclusion: 固定阈值评估能消除因重调阈值带来的性能虚高,揭示真实的鲁棒性差距;推荐在清洁照片中使用CNN,压缩内容用ViT,艺术图形筛查使用混合模型。 Abstract: AI image generators create both photorealistic images and stylized art, necessitating robust detectors that maintain performance under common post-processing transformations (JPEG compression, blur, downscaling). Existing methods optimize single metrics without addressing deployment-critical factors such as operating point selection and fixed-threshold robustness. This work addresses misleading robustness estimates by introducing a fixed-threshold evaluation protocol that holds decision thresholds, selected once on clean validation data, fixed across all post-processing transformations. Traditional methods retune thresholds per condition, artificially inflating robustness estimates and masking deployment failures. We report deployment-relevant performance at three operating points (Low-FPR, ROC-optimal, Best-F1) under systematic degradation testing using a lightweight CNN-ViT hybrid with gated fusion and optional frequency enhancement. Our evaluation exposes a statistically validated forensic-semantic spectrum: frequency-aided CNNs excel on pristine photos but collapse under compression (93.33% to 61.49%), whereas ViTs degrade minimally (92.86% to 88.36%) through robust semantic pattern recognition. Multi-seed experiments demonstrate that all architectures achieve 15% higher AUROC on artistic content (0.901-0.907) versus photorealistic images (0.747-0.759), confirming that semantic patterns provide fundamentally more reliable detection cues than forensic artifacts. Our hybrid approach achieves balanced cross-domain performance: 91.4% accuracy on tiny-genimage photos, 89.7% on AiArtData art/graphics, and 98.3% (competitive) on CIFAKE. Fixed-threshold evaluation eliminates retuning inflation, reveals genuine robustness gaps, and yields actionable deployment guidance: prefer CNNs for clean photo verification, ViTs for compressed content, and hybrids for art/graphics screening.

[42] MuS-Polar3D: A Benchmark Dataset for Computational Polarimetric 3D Imaging under Multi-Scattering Conditions

Puyun Wang,Kaimin Yu,Huayang He,Xianyu Wu

Main category: cs.CV

TL;DR: 本文提出了MuS-Polar3D,首个公开的基于偏振的水下3D成像基准数据集,包含42个物体在七种散射条件和五个视角下的偏振图像,以及高精度3D模型,支持多种视觉任务,并提出两阶段去散射-重建方法,验证了其在复杂散射条件下的有效性。

Details Motivation: 现有公开数据集在散射和观测条件多样性方面不足,限制了不同偏振水下3D重建方法之间的公平比较,尤其是单视图与多视图方法的评估。 Method: 构建MuS-Polar3D数据集,包含多条件偏振图像和高精度3D标注;从成像链角度出发,将水下3D重建解耦为去散射和3D重建两个阶段,并采用多种基线方法进行评估。 Result: MuS-Polar3D成为首个可定量评估浊度影响的公开偏振水下3D成像基准;所提两阶段方法在复杂散射条件下表现优异,最佳平均角度误差为15.49度。 Conclusion: MuS-Polar3D填补了偏振水下3D成像领域缺乏标准化、多样化数据集的空白,支持多种任务并促进算法公平比较,推动该领域的进一步发展。 Abstract: Polarization-based underwater 3D imaging exploits polarization cues to suppress background scattering, exhibiting distinct advantages in turbid water. Although data-driven polarization-based underwater 3D reconstruction methods show great potential, existing public datasets lack sufficient diversity in scattering and observation conditions, hindering fair comparisons among different approaches, including single-view and multi-view polarization imaging methods. To address this limitation, we construct MuS-Polar3D, a benchmark dataset comprising polarization images of 42 objects captured under seven quantitatively controlled scattering conditions and five viewpoints, together with high-precision 3D models (+/- 0.05 mm accuracy), normal maps, and foreground masks. The dataset supports multiple vision tasks, including normal estimation, object segmentation, descattering, and 3D reconstruction. Inspired by computational imaging, we further decouple underwater 3D reconstruction under scattering into a two-stage pipeline, namely descattering followed by 3D reconstruction, from an imaging-chain perspective. Extensive evaluations using multiple baseline methods under complex scattering conditions demonstrate the effectiveness of the proposed benchmark, achieving a best mean angular error of 15.49 degrees. To the best of our knowledge, MuS-Polar3D is the first publicly available benchmark dataset for quantitative turbidity underwater polarization-based 3D imaging, enabling accurate reconstruction and fair algorithm evaluation under controllable scattering conditions. The dataset and code are publicly available at https://github.com/WangPuyun/MuS-Polar3D.

[43] DiverseGRPO: Mitigating Mode Collapse in Image Generation via Diversity-Aware GRPO

Henglin Liu,Huijuan Huang,Jing Wang,Chang Liu,Xiu Li,Xiangyang Ji

Main category: cs.CV

TL;DR: 本文提出了一种改进的GRPO强化学习方法,通过分布式创意奖励和结构感知正则化,有效提升了图像生成的质量-多样性权衡,显著增强了语义多样性。

Details Motivation: 传统GRPO在训练后期易导致输出同质化,忽视分布级多样性,且正则化策略未能有效保留生成多样性。 Method: 在奖励建模层面引入基于语义分组的分布式创意奖励,利用谱聚类构建分布表示并按组大小分配探索奖励;在生成动态层面提出结构感知正则化,加强早期去噪阶段的约束以保持多样性。 Result: 实验表明,在保持图像质量的前提下,语义多样性提升了13%–18%,建立了GRPO图像生成中质量与多样性的新帕累托前沿。 Conclusion: 所提方法从奖励建模和生成动态两方面缓解了多样性退化问题,显著改善了图像生成的创造性和多样性表现。 Abstract: Reinforcement learning (RL), particularly GRPO, improves image generation quality significantly by comparing the relative performance of images generated within the same group. However, in the later stages of training, the model tends to produce homogenized outputs, lacking creativity and visual diversity, which restricts its application scenarios. This issue can be analyzed from both reward modeling and generation dynamics perspectives. First, traditional GRPO relies on single-sample quality as the reward signal, driving the model to converge toward a few high-reward generation modes while neglecting distribution-level diversity. Second, conventional GRPO regularization neglects the dominant role of early-stage denoising in preserving diversity, causing a misaligned regularization budget that limits the achievable quality--diversity trade-off. Motivated by these insights, we revisit the diversity degradation problem from both reward modeling and generation dynamics. At the reward level, we propose a distributional creativity bonus based on semantic grouping. Specifically, we construct a distribution-level representation via spectral clustering over samples generated from the same caption, and adaptively allocate exploratory rewards according to group sizes to encourage the discovery of novel visual modes. At the generation level, we introduce a structure-aware regularization, which enforces stronger early-stage constraints to preserve diversity without compromising reward optimization efficiency. Experiments demonstrate that our method achieves a 13\%--18\% improvement in semantic diversity under matched quality scores, establishing a new Pareto frontier between image quality and diversity for GRPO-based image generation.

[44] Hierarchy-Aware Fine-Tuning of Vision-Language Models

Jiayu Li,Rajesh Gangireddy,Samet Akcay,Wei Cheng,Juhua Hu

Main category: cs.CV

TL;DR: 提出一种高效的层次感知微调框架,用于将视觉-语言模型适应于分层分类任务,通过结构一致性损失和轻量级适配显著减少参数开销并提升分类一致性。

Details Motivation: 现有的视觉-语言模型在分层分类中通常被视为扁平类别处理,且需要全模型微调,导致跨层级预测不一致和高计算成本。 Method: 提出两种新的损失函数:Tree-Path KL Divergence(TP-KL)确保沿真实标签路径的预测一致性,Hierarchy-Sibling Smoothed Cross-Entropy(HiSCE)提升兄弟类之间的预测一致性;结合LoRA实现轻量级微调。 Result: 在多个基准上实验表明,该方法显著提升了全路径准确率(Full-Path Accuracy)并降低了树形结构不一致性误差(Tree-based Inconsistency Error),同时仅引入极小的参数开销。 Conclusion: 该方法为视觉-语言模型在结构化分类体系中的高效适应提供了一种有效且可扩展的解决方案。 Abstract: Vision-Language Models (VLMs) learn powerful multimodal representations through large-scale image-text pretraining, but adapting them to hierarchical classification is underexplored. Standard approaches treat labels as flat categories and require full fine-tuning, which is expensive and produces inconsistent predictions across taxonomy levels. We propose an efficient hierarchy-aware fine-tuning framework that updates a few parameters while enforcing structural consistency. We combine two objectives: Tree-Path KL Divergence (TP-KL) aligns predictions along the ground-truth label path for vertical coherence, while Hierarchy-Sibling Smoothed Cross-Entropy (HiSCE) encourages consistent predictions among sibling classes. Both losses work in the VLM's shared embedding space and integrate with lightweight LoRA adaptation. Experiments across multiple benchmarks show consistent improvements in Full-Path Accuracy and Tree-based Inconsistency Error with minimal parameter overhead. Our approach provides an efficient strategy for adapting VLMs to structured taxonomies.

[45] Vision Transformers are Circulant Attention Learners

Dongchen Han,Tianyu Li,Ziyi Wang,Gao Huang

Main category: cs.CV

TL;DR: 本文提出了一种新的注意力机制——循环注意力(Circulant Attention),通过利用视觉Transformer中自注意力矩阵近似于块循环矩阵的特性,实现了O(N log N)的计算复杂度,同时保持了标准自注意力的模型容量。

Details Motivation: 自注意力机制在视觉Transformer中至关重要,但其二次复杂度在高分辨率场景下计算负担重,限制了实际应用。现有方法通过引入手工设计的稀疏性或局部性模式来缓解此问题,但会损害模型能力。 Method: 发现视觉Transformer中的自注意力矩阵常近似为具有循环块的块循环矩阵(BCCB),并据此将注意力图显式建模为其最近的BCCB矩阵,提出一种高效的快速计算算法。 Result: 所提方法在多种视觉任务上进行了广泛实验,验证了其有效性,在降低计算复杂度的同时保持了良好的模型性能。 Conclusion: 循环注意力是一种有前景的自注意力替代方案,可在不显著牺牲模型容量的前提下大幅提升计算效率。 Abstract: The self-attention mechanism has been a key factor in the advancement of vision Transformers. However, its quadratic complexity imposes a heavy computational burden in high-resolution scenarios, restricting the practical application. Previous methods attempt to mitigate this issue by introducing handcrafted patterns such as locality or sparsity, which inevitably compromise model capacity. In this paper, we present a novel attention paradigm termed \textbf{Circulant Attention} by exploiting the inherent efficient pattern of self-attention. Specifically, we first identify that the self-attention matrix in vision Transformers often approximates the Block Circulant matrix with Circulant Blocks (BCCB), a kind of structured matrix whose multiplication with other matrices can be performed in $\mathcal{O}(N\log N)$ time. Leveraging this interesting pattern, we explicitly model the attention map as its nearest BCCB matrix and propose an efficient computation algorithm for fast calculation. The resulting approach closely mirrors vanilla self-attention, differing only in its use of BCCB matrices. Since our design is inspired by the inherent efficient paradigm, it not only delivers $\mathcal{O}(N\log N)$ computation complexity, but also largely maintains the capacity of standard self-attention. Extensive experiments on diverse visual tasks demonstrate the effectiveness of our approach, establishing circulant attention as a promising alternative to self-attention for vision Transformer architectures. Code is available at https://github.com/LeapLabTHU/Circulant-Attention.

[46] EraseLoRA: MLLM-Driven Foreground Exclusion and Background Subtype Aggregation for Dataset-Free Object Removal

Sanghyun Jo,Donghwan Lee,Eunji Jung,Seong Je Oh,Kyungsu Kim

Main category: cs.CV

TL;DR: 本文提出了一种名为EraseLoRA的新框架,用于在无需数据集的情况下实现高质量的对象移除,通过背景感知推理和测试时自适应避免了传统注意力机制操作带来的问题。

Details Motivation: 现有基于注意力重定向的无数据集方法在对象移除中存在误判非目标前景为背景以及破坏细节的问题,因此需要一种更可靠的方法来准确排除目标并保持背景结构。 Method: 提出EraseLoRA,包含两个模块:1)背景感知前景排除(BFE),利用多模态大语言模型从单张图像-掩码对中分离出目标前景、非目标前景和干净背景;2)基于子类型聚合的背景感知重建(BRSA),在测试时优化过程中整合推断出的背景子类型,通过重建与对齐目标保持局部细节和全局结构。 Result: EraseLoRA在多个对象移除基准上验证有效,优于现有的无数据集基线方法,并与依赖数据集的方法表现相当。 Conclusion: EraseLoRA提供了一种无需修改注意力机制即可实现精确对象移除的新型无数据集框架,具有良好的通用性和实用性。 Abstract: Object removal differs from common inpainting, since it must prevent the masked target from reappearing and reconstruct the occluded background with structural and contextual fidelity, rather than merely filling a hole plausibly. Recent dataset-free approaches that redirect self-attention inside the mask fail in two ways: non-target foregrounds are often misinterpreted as background, which regenerates unwanted objects, and direct attention manipulation disrupts fine details and hinders coherent integration of background cues. We propose EraseLoRA, a novel dataset-free framework that replaces attention surgery with background-aware reasoning and test-time adaptation. First, Background-aware Foreground Exclusion (BFE), uses a multimodal large-language models to separate target foreground, non-target foregrounds, and clean background from a single image-mask pair without paired supervision, producing reliable background cues while excluding distractors. Second, Background-aware Reconstruction with Subtype Aggregation (BRSA), performs test-time optimization that treats inferred background subtypes as complementary pieces and enforces their consistent integration through reconstruction and alignment objectives, preserving local detail and global structure without explicit attention intervention. We validate EraseLoRA as a plug-in to pretrained diffusion models and across benchmarks for object removal, demonstrating consistent improvements over dataset-free baselines and competitive results against dataset-driven methods. The code will be made available upon publication.

[47] Toward Intelligent Scene Augmentation for Context-Aware Object Placement and Sponsor-Logo Integration

Unnati Saraswat,Tarun Rao,Namah Gupta,Shweta Swami,Shikhar Sharma,Prateek Narang,Dhruv Kumar

Main category: cs.CV

TL;DR: 本文提出了两个新的任务:上下文感知的对象插入和赞助产品标志增强,并构建了两个新数据集以支持这些任务。

Details Motivation: 现有的图像编辑工作很少确保插入的对象是上下文合适的,因此需要提出新的方法来解决这一问题。 Method: 引入了上下文感知的对象插入和赞助产品标志增强两个新任务,并构建了包含类别注释、放置区域和赞助产品标签的两个新数据集。 Result: 成功创建了两个新数据集,能够支持上下文感知的对象插入和赞助产品标志增强任务。 Conclusion: 所提出的方法和数据集有助于提高智能图像编辑中的上下文适当性。 Abstract: Intelligent image editing increasingly relies on advances in computer vision, multimodal reasoning, and generative modeling. While vision-language models (VLMs) and diffusion models enable guided visual manipulation, existing work rarely ensures that inserted objects are \emph{contextually appropriate}. We introduce two new tasks for advertising and digital media: (1) \emph{context-aware object insertion}, which requires predicting suitable object categories, generating them, and placing them plausibly within the scene; and (2) \emph{sponsor-product logo augmentation}, which involves detecting products and inserting correct brand logos, even when items are unbranded or incorrectly branded. To support these tasks, we build two new datasets with category annotations, placement regions, and sponsor-product labels.

[48] Exploration of Reproducible Generated Image Detection

Yihang Duan

Main category: cs.CV

TL;DR: 本研究通过复现7篇AIGC图像检测论文,发现当前方法在可复现性和泛化性方面存在严重问题,主要源于实验细节缺失和对特定生成器特征的过拟合。

Details Motivation: 解决AIGC图像检测技术在实际应用中面临的可复现性差和泛化能力不足的核心问题。 Method: 综述7篇关键论文,构建轻量级测试数据集,并复现一种代表性检测方法,分析其在不同预处理和生成器下的表现。 Result: 严格遵循原论文步骤可复现基本性能,但预处理破坏关键特征或跨生成器测试时性能显著下降。 Conclusion: 提升AIGC检测技术的可复现性需更完整地公开实验细节,并验证方法的泛化能力,为未来研究提供改进方向。 Abstract: While the technology for detecting AI-Generated Content (AIGC) images has advanced rapidly, the field still faces two core issues: poor reproducibility and insufficient gen eralizability, which hinder the practical application of such technologies. This study addresses these challenges by re viewing 7 key papers on AIGC detection, constructing a lightweight test dataset, and reproducing a representative detection method. Through this process, we identify the root causes of the reproducibility dilemma in the field: firstly, papers often omit implicit details such as prepro cessing steps and parameter settings; secondly, most detec tion methods overfit to exclusive features of specific gener ators rather than learning universal intrinsic features of AIGC images. Experimental results show that basic perfor mance can be reproduced when strictly following the core procedures described in the original papers. However, de tection performance drops sharply when preprocessing dis rupts key features or when testing across different genera tors. This research provides empirical evidence for improv ing the reproducibility of AIGC detection technologies and offers reference directions for researchers to disclose ex perimental details more comprehensively and verify the generalizability of their proposed methods.

[49] Towards Long-window Anchoring in Vision-Language Model Distillation

Haoyi Zhou,Shuo Li,Tianyu Chen,Qi Song,Chonghan Gao,Jianxin Li

Main category: cs.CV

TL;DR: 本文提出了一种名为LAid的新方法,通过知识蒸馏和改进的注意力机制,显著提升了小型视觉-语言模型在长上下文理解中的能力,实现了比基线模型长达3.2倍的有效上下文窗口,并保持了在标准基准上的性能。

Details Motivation: 现有的小型视觉-语言模型由于上下文窗口有限,在语言与图像对齐方面表现不佳,而大型模型虽具备较强长上下文理解能力,但难以直接部署。因此需要有效的方法将大模型的长距离注意力机制迁移到小模型中。 Method: 提出LAid,包含两个核心组件:(1) 渐进式距离加权注意力匹配,在训练中动态强调更远位置差异;(2) 可学习的RoPE响应增益调制,选择性增强关键位置的敏感性,通过知识蒸馏实现长距离注意力机制的迁移。 Result: 在多个模型家族上实验表明,LAid蒸馏后的模型有效上下文长度可达基线小模型的3.2倍,同时在标准视觉-语言基准测试中保持或提升性能;谱分析显示其成功保留了传统方法无法传递的低频注意力成分。 Conclusion: LAid不仅为构建高效长上下文视觉-语言模型提供了实用技术,还揭示了位置理解在蒸馏过程中如何产生与迁移的理论洞见。 Abstract: While large vision-language models (VLMs) demonstrate strong long-context understanding, their prevalent small branches fail on linguistics-photography alignment for a limited window size. We discover that knowledge distillation improves students' capability as a complement to Rotary Position Embeddings (RoPE) on window sizes (anchored from large models). Building on this insight, we propose LAid, which directly aims at the transfer of long-range attention mechanisms through two complementary components: (1) a progressive distance-weighted attention matching that dynamically emphasizes longer position differences during training, and (2) a learnable RoPE response gain modulation that selectively amplifies position sensitivity where needed. Extensive experiments across multiple model families demonstrate that LAid-distilled models achieve up to 3.2 times longer effective context windows compared to baseline small models, while maintaining or improving performance on standard VL benchmarks. Spectral analysis also suggests that LAid successfully preserves crucial low-frequency attention components that conventional methods fail to transfer. Our work not only provides practical techniques for building more efficient long-context VLMs but also offers theoretical insights into how positional understanding emerges and transfers during distillation.

[50] LLM-Free Image Captioning Evaluation in Reference-Flexible Settings

Shinnosuke Hirano,Yuiga Wada,Kazuki Matsuda,Seitaro Otsuki,Komei Sugiura

Main category: cs.CV

TL;DR: 本文提出了一种名为Pearl的无大语言模型(LLM-free)监督式图像描述评价指标,适用于基于参考和无参考两种场景,通过新机制学习图像-文本和文本-文本相似性表示,并构建了一个包含约33.3万条人工标注的大规模数据集,实验表明Pearl在多个数据集上优于现有LLM-free指标。

Details Motivation: 现有基于大语言模型的图像描述自动评价指标存在偏好自身生成结果的问题,缺乏中立性;而大多数非LLM指标虽避免此问题,但性能不足。因此需要一种既中立又高性能的新评价方法。 Method: 提出Pearl,一种无需大语言模型的监督式评价指标,引入学习图像-描述与描述-描述相似性的新机制,并构建大规模人工标注数据集(约33.3万条判断、来自2,360名标注者、覆盖7.5万张以上图像)用于训练与评估。 Result: Pearl在Composite、Flickr8K-Expert、Flickr8K-CF、Nebula和FOIL等多个数据集上,在基于参考和无参考设置下均优于现有的LLM-free指标。 Conclusion: Pearl是一种有效且中立的图像描述自动评价指标,结合了高表现力与无生成偏好的优势,为图像描述评估提供了新的可靠方案。 Abstract: We focus on the automatic evaluation of image captions in both reference-based and reference-free settings. Existing metrics based on large language models (LLMs) favor their own generations; therefore, the neutrality is in question. Most LLM-free metrics do not suffer from such an issue, whereas they do not always demonstrate high performance. To address these issues, we propose Pearl, an LLM-free supervised metric for image captioning, which is applicable to both reference-based and reference-free settings. We introduce a novel mechanism that learns the representations of image--caption and caption--caption similarities. Furthermore, we construct a human-annotated dataset for image captioning metrics, that comprises approximately 333k human judgments collected from 2,360 annotators across over 75k images. Pearl outperformed other existing LLM-free metrics on the Composite, Flickr8K-Expert, Flickr8K-CF, Nebula, and FOIL datasets in both reference-based and reference-free settings. Our project page is available at https://pearl.kinsta.page/.

[51] UltraLBM-UNet: Ultralight Bidirectional Mamba-based Model for Skin Lesion Segmentation

Linxuan Fan,Juntao Jiang,Weixuan Liu,Zhucun Xue,Jiajun Lv,Jiangning Zhang,Yong Liu

Main category: cs.CV

TL;DR: 提出了一种轻量级的U-Net变体UltraLBM-UNet,结合双向Mamba机制和多分支局部特征感知,实现高效、准确的皮肤病变分割。

Details Motivation: 现有皮肤病变分割方法在性能和计算复杂度之间难以平衡,缺乏适用于临床即时诊断的轻量高效模型。 Method: 设计了UltraLBM-UNet,融合双向Mamba状态空间建模与局部特征注入,并采用混合知识蒸馏训练更小的学生模型UltraLBM-UNet-T。 Result: 在ISIC 2017、ISIC 2018和PH2数据集上达到SOTA水平,仅需0.034M参数和0.060 GFLOPs;蒸馏版本UltraLBM-UNet-T仅0.011M参数仍保持竞争力。 Conclusion: UltraLBM-UNet在极低资源消耗下实现了高性能皮肤病变分割,适合即时医疗场景部署。 Abstract: Skin lesion segmentation is a crucial step in dermatology for guiding clinical decision-making. However, existing methods for accurate, robust, and resource-efficient lesion analysis have limitations, including low performance and high computational complexity. To address these limitations, we propose UltraLBM-UNet, a lightweight U-Net variant that integrates a bidirectional Mamba-based global modeling mechanism with multi-branch local feature perception. The proposed architecture integrates efficient local feature injection with bidirectional state-space modeling, enabling richer contextual interaction across spatial dimensions while maintaining computational compactness suitable for point-of-care deployment. Extensive experiments on the ISIC 2017, ISIC 2018, and PH2 datasets demonstrate that our model consistently achieves state-of-the-art segmentation accuracy, outperforming existing lightweight and Mamba counterparts with only 0.034M parameters and 0.060 GFLOPs. In addition, we introduce a hybrid knowledge distillation strategy to train an ultra-compact student model, where the distilled variant UltraLBM-UNet-T, with only 0.011M parameters and 0.019 GFLOPs, achieves competitive segmentation performance. These results highlight the suitability of UltraLBM-UNet for point-of-care deployment, where accurate and robust lesion analyses are essential. The source code is publicly available at https://github.com/LinLinLin-X/UltraLBM-UNet.

[52] From Shallow Humor to Metaphor: Towards Label-Free Harmful Meme Detection via LMM Agent Self-Improvement

Jian Lang,Rongpei Hong,Ting Zhong,Leiting Chen,Qiang Gao,Fan Zhou

Main category: cs.CV

TL;DR: 提出ALARM,首个无需标注的有害模因检测框架,利用大视觉模型代理自提升机制,通过显式模因的对比学习逐步增强对复杂模因的识别能力。

Details Motivation: 现有有害模因检测方法依赖大量人工标注数据,难以适应快速演变的有害内容,且标注成本高、泛化能力弱。 Method: 设计基于置信度的显式模因识别机制,自动筛选并伪标注显式模因;引入成对学习引导的代理自提升范式,将显式模因构造成正负对比对,训练大视觉模型代理自主提取高层检测线索,进而提升对复杂模因的检测能力。 Result: 在三个不同数据集上实验表明,ALARM性能优于现有方法,尤其在新出现的模因类型上表现出强适应性,甚至超越有监督方法。 Conclusion: ALARM为有害模因检测提供了一种可扩展、无需标注的新范式,展现出应对动态网络环境中新型有害内容的巨大潜力。 Abstract: The proliferation of harmful memes on online media poses significant risks to public health and stability. Existing detection methods heavily rely on large-scale labeled data for training, which necessitates substantial manual annotation efforts and limits their adaptability to the continually evolving nature of harmful content. To address these challenges, we present ALARM, the first lAbeL-free hARmful Meme detection framework powered by Large Multimodal Model (LMM) agent self-improvement. The core innovation of ALARM lies in exploiting the expressive information from "shallow" memes to iteratively enhance its ability to tackle more complex and subtle ones. ALARM consists of a novel Confidence-based Explicit Meme Identification mechanism that isolates the explicit memes from the original dataset and assigns them pseudo-labels. Besides, a new Pairwise Learning Guided Agent Self-Improvement paradigm is introduced, where the explicit memes are reorganized into contrastive pairs (positive vs. negative) to refine a learner LMM agent. This agent autonomously derives high-level detection cues from these pairs, which in turn empower the agent itself to handle complex and challenging memes effectively. Experiments on three diverse datasets demonstrate the superior performance and strong adaptability of ALARM to newly evolved memes. Notably, our method even outperforms label-driven methods. These results highlight the potential of label-free frameworks as a scalable and promising solution for adapting to novel forms and topics of harmful memes in dynamic online environments.

[53] GaussianEM: Model compositional and conformational heterogeneity using 3D Gaussians

Bintao He,Yiran Cheng,Hongjia Li,Xiang Gao,Xin Gao,Fa Zhang,Renmin Han

Main category: cs.CV

TL;DR: 提出GaussianEM,一种基于高斯伪原子框架的方法,用于从冷冻电镜图像中同时建模蛋白质的组成和构象异质性,有效解析连续运动与离散状态。

Details Motivation: 分析包含连续运动和离散状态的冷冻电镜数据集具有挑战性,需更好理解蛋白质的动态灵活性及其功能。 Method: 采用双编码器-单解码器架构,将图像映射到高斯组件,通过高斯参数变化表征结构多样性。 Result: 在模拟和实验数据集上验证了GaussianEM的有效性,能直观解释构象变化,保持局部结构一致性,并连接密度图与原子模型。 Conclusion: GaussianEM为解析蛋白质动态结构异质性提供了可解释、高效且桥梁性的计算框架。 Abstract: Understanding protein flexibility and its dynamic interactions with other molecules is essential for protein function study. Cryogenic electron microscopy (cryo-EM) provides an opportunity to directly observe macromolecular dynamics. However, analyzing datasets that contain both continuous motions and discrete states remains highly challenging. Here we present GaussianEM, a Gaussian pseudo-atomic framework that simultaneously models compositional and conformational heterogeneity from experimental cryo-EM images. GaussianEM employs a two-encoder-one-decoder architecture to map an image to its individual Gaussian components, and represent structural variability through changes in Gaussian parameters. This approach provides an intuitive and interpretable description of conformational changes, preserves local structural consistency along the transition trajectories, and naturally bridges the gap between density-based models and corresponding atomic models. We demonstrate the effectiveness of GaussianEM on both simulated and experimental datasets.

[54] TAMEing Long Contexts in Personalization: Towards Training-Free and State-Aware MLLM Personalized Assistant

Rongpei Hong,Jian Lang,Ting Zhong,Yong Wang,Fan Zhou

Main category: cs.CV

TL;DR: 本文提出了LCMP,首个用于评估长上下文多模态大语言模型个性化能力的基准,并提出了一种无需训练、具有双记忆机制的框架TAME,结合检索-对齐增强生成范式(RA2G),在长上下文对话中实现更优的个性化响应。

Details Motivation: 现有个性化方法局限于简单、脱离上下文的视觉识别与文本替换,忽视了多轮长上下文对话中的动态个性化需求,缺乏评估该能力的基准。 Method: 提出LCMP评估基准和TAME框架;TAME采用双记忆机制区分处理个性化概念的时序变化与持久特征,并引入无需训练的RA2G范式,通过检索与对齐增强生成个性化响应。 Result: 在LCMP基准上的实验表明,TAME表现最优,能在长上下文场景中提供显著且持续进化的交互体验。 Conclusion: TAME为多模态大语言模型的长上下文个性化提供了有效且无需训练的解决方案,推动了个性化智能助手的发展。 Abstract: Multimodal Large Language Model (MLLM) Personalization is a critical research problem that facilitates personalized dialogues with MLLMs targeting specific entities (known as personalized concepts). However, existing methods and benchmarks focus on the simple, context-agnostic visual identification and textual replacement of the personalized concept (e.g., "A yellow puppy" -> "Your puppy Mochi"), overlooking the ability to support long-context conversations. An ideal personalized MLLM assistant is capable of engaging in long-context dialogues with humans and continually improving its experience quality by learning from past dialogue histories. To bridge this gap, we propose LCMP, the first Long-Context MLLM Personalization evaluation benchmark. LCMP assesses the capability of MLLMs in perceiving variations of personalized concepts and generating contextually appropriate personalized responses that reflect these variations. As a strong baseline for LCMP, we introduce a novel training-free and state-aware framework TAME. TAME endows MLLMs with double memories to manage the temporal and persistent variations of each personalized concept in a differentiated manner. In addition, TAME incorporates a new training-free Retrieve-then-Align Augmented Generation (RA2G) paradigm. RA2G introduces an alignment step to extract the contextually fitted information from the multi-memory retrieved knowledge to the current questions, enabling better interactions for complex real-world user queries. Experiments on LCMP demonstrate that TAME achieves the best performance, showcasing remarkable and evolving interaction experiences in long-context scenarios.

[55] CausalFSFG: Rethinking Few-Shot Fine-Grained Visual Categorization from Causal Perspective

Zhiwen Yang,Jinglin Xu,Yuxin Pen

Main category: cs.CV

TL;DR: 本文提出了一种新的因果推理方法CausalFSFG,用于解决少样本细粒度视觉分类中的偏差分布问题,通过样本级和特征级干预消除虚假相关性,在多个公开数据集上实现了最先进的性能。

Details Motivation: 现有少样本细粒度分类方法常因支持样本集作为混淆变量导致数据分布偏差,影响判别特征提取,从而降低分类性能。 Method: 基于结构因果模型(SCM),提出CausalFSFG方法,包含两个关键组件:干预性多尺度编码器(IMSE)进行样本级干预,干预性掩码特征重建(IMFR)进行特征级干预,以揭示输入到子类别的真实因果关系。 Result: 在CUB-200-2011、Stanford Dogs和Stanford Cars等多个标准数据集上进行了广泛实验,结果表明CausalFSFG在少样本细粒度分类任务中达到了新的SOTA性能。 Conclusion: 通过引入因果推理与干预机制,有效缓解了少样本细粒度分类中的虚假相关性问题,提升了模型的泛化能力与分类准确性。 Abstract: Few-shot fine-grained visual categorization (FS-FGVC) focuses on identifying various subcategories within a common superclass given just one or few support examples. Most existing methods aim to boost classification accuracy by enriching the extracted features with discriminative part-level details. However, they often overlook the fact that the set of support samples acts as a confounding variable, which hampers the FS-FGVC performance by introducing biased data distribution and misguiding the extraction of discriminative features. To address this issue, we propose a new causal FS-FGVC (CausalFSFG) approach inspired by causal inference for addressing biased data distributions through causal intervention. Specifically, based on the structural causal model (SCM), we argue that FS-FGVC infers the subcategories (i.e., effect) from the inputs (i.e., cause), whereas both the few-shot condition disturbance and the inherent fine-grained nature (i.e., large intra-class variance and small inter-class variance) lead to unobservable variables that bring spurious correlations, compromising the final classification performance. To further eliminate the spurious correlations, our CausalFSFG approach incorporates two key components: (1) Interventional multi-scale encoder (IMSE) conducts sample-level interventions, (2) Interventional masked feature reconstruction (IMFR) conducts feature-level interventions, which together reveal real causalities from inputs to subcategories. Extensive experiments and thorough analyses on the widely-used public datasets, including CUB-200-2011, Stanford Dogs, and Stanford Cars, demonstrate that our CausalFSFG achieves new state-of-the-art performance. The code is available at https://github.com/PKU-ICST-MIPL/CausalFSFG_TMM.

[56] SymDrive: Realistic and Controllable Driving Simulator via Symmetric Auto-regressive Online Restoration

Zhiyuan Liu,Daocheng Fu,Pinlong Cai,Lening Wang,Ying Liu,Yilong Ren,Botian Shi,Jianqiang Wang

Main category: cs.CV

TL;DR: 提出SymDrive,一种基于扩散模型的统一框架,实现高质量3D渲染与场景编辑,支持大角度新视角合成和真实感车辆插入。

Details Motivation: 现有自动驾驶仿真方法难以同时实现逼真的渲染效果和交互式交通编辑,尤其在大角度新视角生成和资产操控时存在几何或光照伪影。 Method: 提出对称自回归在线恢复范式,构建成对对称视图,通过真值引导的双视图公式恢复细节,并采用自回归策略生成一致的侧视图;利用该恢复能力实现无需训练的协调机制,将车辆插入视为上下文感知的修复任务。 Result: 实验表明,SymDrive在新视角增强和真实感3D车辆插入方面均达到最先进水平。 Conclusion: SymDrive能有效兼顾高保真渲染与可控编辑,为自动驾驶中的长尾场景生成提供了高效解决方案。 Abstract: High-fidelity and controllable 3D simulation is essential for addressing the long-tail data scarcity in Autonomous Driving (AD), yet existing methods struggle to simultaneously achieve photorealistic rendering and interactive traffic editing. Current approaches often falter in large-angle novel view synthesis and suffer from geometric or lighting artifacts during asset manipulation. To address these challenges, we propose SymDrive, a unified diffusion-based framework capable of joint high-quality rendering and scene editing. We introduce a Symmetric Auto-regressive Online Restoration paradigm, which constructs paired symmetric views to recover fine-grained details via a ground-truth-guided dual-view formulation and utilizes an auto-regressive strategy for consistent lateral view generation. Furthermore, we leverage this restoration capability to enable a training-free harmonization mechanism, treating vehicle insertion as context-aware inpainting to ensure seamless lighting and shadow consistency. Extensive experiments demonstrate that SymDrive achieves state-of-the-art performance in both novel-view enhancement and realistic 3D vehicle insertion.

[57] Training-Free Disentangled Text-Guided Image Editing via Sparse Latent Constraints

Mutiara Shabrina,Nova Kurnia Putri,Jefri Satria Ferdiansyah,Sabita Khansa Dewi,Novanto Yudistira

Main category: cs.CV

TL;DR: 本文分析了PPE框架在文本驱动图像编辑中的属性解耦问题,提出引入L1正则化以实现潜在空间的稀疏化,从而减少语义泄漏,提升编辑的精确性。

Details Motivation: 解决现有文本驱动图像编辑方法中因潜在空间更新密集而导致的属性纠缠和语义泄漏问题。 Method: 基于BERT的属性预测与StyleGAN2的图像生成架构,引入L1正则化约束潜在空间的编辑过程,增强稀疏性。 Result: 实验表明所提方法能更有效地聚焦于目标属性修改,减少非目标属性的意外变化,同时保持身份特征稳定。 Conclusion: 通过引入稀疏性约束,可显著提升PPE框架在解耦编辑中的控制能力与编辑质量。 Abstract: Text-driven image manipulation often suffers from attribute entanglement, where modifying a target attribute (e.g., adding bangs) unintentionally alters other semantic properties such as identity or appearance. The Predict, Prevent, and Evaluate (PPE) framework addresses this issue by leveraging pre-trained vision-language models for disentangled editing. In this work, we analyze the PPE framework, focusing on its architectural components, including BERT-based attribute prediction and StyleGAN2-based image generation on the CelebA-HQ dataset. Through empirical analysis, we identify a limitation in the original regularization strategy, where latent updates remain dense and prone to semantic leakage. To mitigate this issue, we introduce a sparsity-based constraint using L1 regularization on latent space manipulation. Experimental results demonstrate that the proposed approach enforces more focused and controlled edits, effectively reducing unintended changes in non-target attributes while preserving facial identity.

[58] TrackTeller: Temporal Multimodal 3D Grounding for Behavior-Dependent Object References

Jiahong Yu,Ziqi Wang,Hailiang Zhao,Wei Zhai,Xueqiang Yan,Shuiguang Deng

Main category: cs.CV

TL;DR: 本文提出了TrackTeller,一种用于时态语言驱动的3D定位的多模态框架,通过融合LiDAR图像、语言条件解码和时态推理,在NuPrompt基准上显著提升了性能。

Details Motivation: 自然语言对动态3D驾驶场景中物体的指代表达依赖于短期运动或交互,仅靠静态外观或几何信息难以解析,因此需要结合多帧观测进行时态推理。 Method: 提出TrackTeller框架,构建与文本语义对齐的共享UniScene表示,生成语言感知的3D候选,并利用运动历史和短期动态优化定位决策,集成LiDAR-图像融合与时态推理。 Result: 在NuPrompt基准上,TrackTeller相比强基线方法在平均多目标跟踪精度上相对提升70%,误报频率降低3.15-3.4倍。 Conclusion: TrackTeller有效结合了多模态感知、语言理解和时态推理,显著提升了动态3D场景中基于语言的物体定位性能,适用于交互式自动驾驶系统。 Abstract: Understanding natural-language references to objects in dynamic 3D driving scenes is essential for interactive autonomous systems. In practice, many referring expressions describe targets through recent motion or short-term interactions, which cannot be resolved from static appearance or geometry alone. We study temporal language-based 3D grounding, where the objective is to identify the referred object in the current frame by leveraging multi-frame observations. We propose TrackTeller, a temporal multimodal grounding framework that integrates LiDAR-image fusion, language-conditioned decoding, and temporal reasoning in a unified architecture. TrackTeller constructs a shared UniScene representation aligned with textual semantics, generates language-aware 3D proposals, and refines grounding decisions using motion history and short-term dynamics. Experiments on the NuPrompt benchmark demonstrate that TrackTeller consistently improves language-grounded tracking performance, outperforming strong baselines with a 70% relative improvement in Average Multi-Object Tracking Accuracy and a 3.15-3.4 times reduction in False Alarm Frequency.

[59] Omni-Weather: Unified Multimodal Foundation Model for Weather Generation and Understanding

Zhiwang Zhou,Yuandong Pu,Xuming He,Yidi Liu,Yixin Chen,Junchao Gong,Xiang Zhuang,Wanghan Xu,Qinglong Cao,Shixiang Tang,Yihao Liu,Wenlong Zhang,Lei Bai

Main category: cs.CV

TL;DR: Omni-Weather是首个统一气象生成与理解的多模态基础模型,通过共享自注意力机制和链式思维数据集,在生成与理解任务上均达到SOTA,并验证了二者可相互促进。

Details Motivation: 现有气象建模方法将预测生成与机理理解割裂开来,缺乏统一框架,导致生成结果缺乏可解释性,理解任务脱离实际生成过程。 Method: 提出Omni-Weather,采用雷达编码器进行气象生成,结合共享自注意力机制实现统一处理,并构建链式思维(Chain-of-Thought)数据集以支持因果推理和可解释输出。 Result: 在多个气象生成与理解任务上取得SOTA性能,生成结果感知质量更高,且具备更强的因果解释能力;消融实验表明生成与理解任务可相互提升。 Conclusion: 统一气象生成与理解不仅可行,而且能相互增益,Omni-Weather为未来可解释气象建模提供了新范式。 Abstract: Weather modeling requires both accurate prediction and mechanistic interpretation, yet existing methods treat these goals in isolation, separating generation from understanding. To address this gap, we present Omni-Weather, the first multimodal foundation model that unifies weather generation and understanding within a single architecture. Omni-Weather integrates a radar encoder for weather generation tasks, followed by unified processing using a shared self-attention mechanism. Moreover, we construct a Chain-of-Thought dataset for causal reasoning in weather generation, enabling interpretable outputs and improved perceptual quality. Extensive experiments show Omni-Weather achieves state-of-the-art performance in both weather generation and understanding. Our findings further indicate that generative and understanding tasks in the weather domain can mutually enhance each other. Omni-Weather also demonstrates the feasibility and value of unifying weather generation and understanding.

[60] The Deepfake Detective: Interpreting Neural Forensics Through Sparse Features and Manifolds

Subramanyam Sahoo,Jared Junkin

Main category: cs.CV

TL;DR: 提出了一种针对视觉-语言模型的深度伪造检测的机械可解释性框架,结合稀疏自编码器和法医流形分析,揭示模型特征与伪造 artifacts 之间的关系。

Details Motivation: 深度伪造检测模型虽然准确率高,但决策过程不透明,缺乏可解释性。 Method: 采用稀疏自编码器(SAE)分析网络内部表征,并引入新的法医流形分析方法,研究模型特征对受控法医伪影操作的响应。 Result: 发现每层中只有一小部分潜在特征被激活,且特征流形的几何属性(如本征维度、曲率和特征选择性)随不同类型的深度伪造伪影系统性变化。 Conclusion: 该框架有助于打开深度伪造检测器的“黑箱”,识别对应特定法医伪影的学习特征,推动更可解释和鲁棒模型的发展。 Abstract: Deepfake detection models have achieved high accuracy in identifying synthetic media, but their decision processes remain largely opaque. In this paper we present a mechanistic interpretability framework for deepfake detection applied to a vision-language model. Our approach combines a sparse autoencoder (SAE) analysis of internal network representations with a novel forensic manifold analysis that probes how the model's features respond to controlled forensic artifact manipulations. We demonstrate that only a small fraction of latent features are actively used in each layer, and that the geometric properties of the model's feature manifold, including intrinsic dimensionality, curvature, and feature selectivity, vary systematically with different types of deepfake artifacts. These insights provide a first step toward opening the "black box" of deepfake detectors, allowing us to identify which learned features correspond to specific forensic artifacts and to guide the development of more interpretable and robust models.

[61] Comparative Analysis of Deep Learning Models for Perception in Autonomous Vehicles

Jalal Khan

Main category: cs.CV

TL;DR: 本文比较了YOLO-NAS和YOLOv8在自动驾驶感知任务中的性能,使用自建数据集进行实验,发现YOLOv8s在训练时间减少75%的同时,检测精度(83%)也高于YOLO-NAS(81%)。

Details Motivation: 为了提升自动驾驶车辆在现实场景中的效率、安全性和可靠性,需要评估新兴深度学习模型在目标检测感知任务中的实际性能。 Method: 采集自定义数据集,并在该数据集上实验对比YOLO-NAS和YOLOv8两种深度学习模型的目标检测性能。 Result: YOLOv8s模型相比YOLO-NAS减少了75%的训练时间,且在目标检测准确率上达到83%,优于YOLO-NAS的81%。 Conclusion: YOLOv8s在训练效率和检测精度方面均优于YOLO-NAS,更适合应用于自动驾驶的实时感知任务,为相关研究提供了实际参考。 Abstract: Recently, a plethora of machine learning (ML) and deep learning (DL) algorithms have been proposed to achieve the efficiency, safety, and reliability of autonomous vehicles (AVs). The AVs use a perception system to detect, localize, and identify other vehicles, pedestrians, and road signs to perform safe navigation and decision-making. In this paper, we compare the performance of DL models, including YOLO-NAS and YOLOv8, for a detection-based perception task. We capture a custom dataset and experiment with both DL models using our custom dataset. Our analysis reveals that the YOLOv8s model saves 75% of training time compared to the YOLO-NAS model. In addition, the YOLOv8s model (83%) outperforms the YOLO-NAS model (81%) when the target is to achieve the highest object detection accuracy. These comparative analyses of these new emerging DL models will allow the relevant research community to understand the models' performance under real-world use case scenarios.

[62] UniPercept: Towards Unified Perceptual-Level Image Understanding across Aesthetics, Quality, Structure, and Texture

Shuo Cao,Jiayang Li,Xiaohui Li,Yuandong Pu,Kaiwen Zhu,Yuanting Gao,Siqi Luo,Yi Xin,Qi Qin,Yu Zhou,Xiangyu Chen,Wenlong Zhang,Bin Fu,Yu Qiao,Yihao Liu

Main category: cs.CV

TL;DR: 本文提出了UniPercept-Bench,一个用于评估多模态大模型在美学、质量、结构和纹理等感知层面图像理解的统一框架,并构建了大规模数据集和强基线模型UniPercept,推动了MLLM在感知级视觉理解的发展。

Details Motivation: 现有的多模态大语言模型在高级视觉任务上表现良好,但在图像的感知层面(如美学、质量、结构和纹理)理解能力有限,缺乏统一的评估体系和训练方法。 Method: 提出UniPercept-Bench框架,建立分层定义系统并构建大规模数据集;通过领域自适应预训练和任务对齐强化学习训练UniPercept模型,支持视觉评分(VR)和视觉问答(VQA)任务。 Result: UniPercept在多个感知级理解任务上优于现有MLLM,并可作为即插即用的奖励模型用于文生图任务。 Conclusion: 该工作定义了MLLM时代的感知级图像理解,提供了全面的基准和强基线模型,为多模态感知理解的发展奠定了基础。 Abstract: Multimodal large language models (MLLMs) have achieved remarkable progress in visual understanding tasks such as visual grounding, segmentation, and captioning. However, their ability to perceive perceptual-level image features remains limited. In this work, we present UniPercept-Bench, a unified framework for perceptual-level image understanding across three key domains: Aesthetics, Quality, Structure and Texture. We establish a hierarchical definition system and construct large-scale datasets to evaluate perceptual-level image understanding. Based on this foundation, we develop a strong baseline UniPercept trained via Domain-Adaptive Pre-Training and Task-Aligned RL, enabling robust generalization across both Visual Rating (VR) and Visual Question Answering (VQA) tasks. UniPercept outperforms existing MLLMs on perceptual-level image understanding and can serve as a plug-and-play reward model for text-to-image generation. This work defines Perceptual-Level Image Understanding in the era of MLLMs and, through the introduction of a comprehensive benchmark together with a strong baseline, provides a solid foundation for advancing perceptual-level multimodal image understanding.

[63] Contrastive Graph Modeling for Cross-Domain Few-Shot Medical Image Segmentation

Yuntian Bo,Tao Zhou,Zechao Li,Haofeng Zhang,Ling Shao

Main category: cs.CV

TL;DR: 提出了一种名为C-Graph的对比图建模框架,用于解决跨域少样本医学图像分割中领域特定信息被过滤导致性能下降的问题,通过结构先验图和子图匹配解码机制显著提升了跨域和源域的分割性能。

Details Motivation: 现有跨域少样本医学图像分割方法为了提升泛化能力通常滤除领域特定信息,但会损害跨域性能并降低源域准确性,因此需要一种能保留领域知识并利用医学图像结构一致性的新方法。 Method: 提出C-Graph框架,将图像特征表示为图(像素为节点,语义亲和性为边),设计结构先验图(SPG)层以捕捉和迁移目标类别的节点依赖关系,并引入子图匹配解码(SMD)机制利用节点间语义关系指导预测;同时设计混淆最小化节点对比(CNC)损失,在图空间中增强节点可区分性以减少节点歧义和子图异质性。 Result: 在多个跨域基准上显著优于先前的CD-FSMIS方法,实现了最先进的性能,同时在源域上保持了良好的分割精度。 Conclusion: C-Graph通过利用医学图像的结构一致性作为可迁移的先验,有效解决了跨域少样本分割中的领域迁移与源域性能之间的权衡问题,为数据稀缺下的多模态医学图像分析提供了高效且鲁棒的解决方案。 Abstract: Cross-domain few-shot medical image segmentation (CD-FSMIS) offers a promising and data-efficient solution for medical applications where annotations are severely scarce and multimodal analysis is required. However, existing methods typically filter out domain-specific information to improve generalization, which inadvertently limits cross-domain performance and degrades source-domain accuracy. To address this, we present Contrastive Graph Modeling (C-Graph), a framework that leverages the structural consistency of medical images as a reliable domain-transferable prior. We represent image features as graphs, with pixels as nodes and semantic affinities as edges. A Structural Prior Graph (SPG) layer is proposed to capture and transfer target-category node dependencies and enable global structure modeling through explicit node interactions. Building upon SPG layers, we introduce a Subgraph Matching Decoding (SMD) mechanism that exploits semantic relations among nodes to guide prediction. Furthermore, we design a Confusion-minimizing Node Contrast (CNC) loss to mitigate node ambiguity and subgraph heterogeneity by contrastively enhancing node discriminability in the graph space. Our method significantly outperforms prior CD-FSMIS approaches across multiple cross-domain benchmarks, achieving state-of-the-art performance while simultaneously preserving strong segmentation accuracy on the source domain.

[64] SlideChain: Semantic Provenance for Lecture Understanding via Blockchain Registration

Md Motaleb Hossen Manik,Md Zabirul Islam,Ge Wang

Main category: cs.CV

TL;DR: 本文提出了SlideChain,一个基于区块链的溯源框架,用于大规模验证多模态语义提取的完整性,特别是在医学教育内容中利用视觉-语言模型(VLMs)时确保可靠性与可审计性。

Details Motivation: 由于不同视觉-语言模型在高风险、定量的STEM领域生成的教学内容存在不一致性和难以验证的问题,亟需一种可追溯、可复现的机制来保障AI生成内容的可信度。 Method: 构建了一个包含1,117张医学影像教学幻灯片的数据集(SlideChain Slides Dataset),使用四种先进的VLM提取概念和关系三元组,并为每张幻灯片建立结构化溯源记录;将这些记录的加密哈希锚定在一个本地EVM兼容的区块链上,实现防篡改审计和持久语义基准。 Result: 实验揭示了不同模型间显著的语义差异,包括低概念重叠和近乎零的关系三元组一致性;系统评估显示良好的可扩展性、完美的篡改检测能力,并在多次独立提取中实现确定性复现。 Conclusion: SlideChain为可信、可验证的多模态教育流水线提供了一个实用且可扩展的解决方案,支持AI辅助教学系统的长期可审计性、可复现性和内容完整性。 Abstract: Modern vision--language models (VLMs) are increasingly used to interpret and generate educational content, yet their semantic outputs remain challenging to verify, reproduce, and audit over time. Inconsistencies across model families, inference settings, and computing environments undermine the reliability of AI-generated instructional material, particularly in high-stakes and quantitative STEM domains. This work introduces SlideChain, a blockchain-backed provenance framework designed to provide verifiable integrity for multimodal semantic extraction at scale. Using the SlideChain Slides Dataset-a curated corpus of 1,117 medical imaging lecture slides from a university course-we extract concepts and relational triples from four state-of-the-art VLMs and construct structured provenance records for every slide. SlideChain anchors cryptographic hashes of these records on a local EVM (Ethereum Virtual Machine)-compatible blockchain, providing tamper-evident auditability and persistent semantic baselines. Through the first systematic analysis of semantic disagreement, cross-model similarity, and lecture-level variability in multimodal educational content, we reveal pronounced cross-model discrepancies, including low concept overlap and near-zero agreement in relational triples on many slides. We further evaluate gas usage, throughput, and scalability under simulated deployment conditions, and demonstrate perfect tamper detection along with deterministic reproducibility across independent extraction runs. Together, these results show that SlideChain provides a practical and scalable step toward trustworthy, verifiable multimodal educational pipelines, supporting long-term auditability, reproducibility, and integrity for AI-assisted instructional systems.

[65] Analyzing the Mechanism of Attention Collapse in VGGT from a Dynamics Perspective

Huan Li,Longjun Luo,Yuling Shi,Xiaodong Gu

Main category: cs.CV

TL;DR: 本文提出了一种数学理论,将视觉几何嵌入Transformer(VGGT)中的全局自注意力机制视为退化扩散过程,解释了长序列输入下注意力崩溃现象,并推导出预测注意力秩衰退的均场偏微分方程,为可扩展3D视觉Transformer的设计提供了理论指导。

Details Motivation: VGGT在前馈式3D重建中表现优异,但当输入序列较长时,其全局自注意力层会出现严重的“崩溃”现象,表现为注意力矩阵接近秩一、特征退化和重建误差激增,缺乏理论解释阻碍了模型的进一步扩展与优化。 Method: 将VGGT中的全局自注意力迭代建模为一种退化扩散过程,通过分析token特征流的动态演化,证明其收敛至Dirac型测度的速度为O(1/L),并推导出描述该过程的闭式均场偏微分方程,用于定量预测注意力矩阵的秩变化和热图演化。 Result: 理论成功解释了注意力崩溃现象,精确匹配实验中观察到的注意力热图演化和多种实验结果,并揭示了token合并策略通过降低有效扩散系数来延缓崩溃的机理。 Conclusion: 该分析为理解大规模3D视觉Transformer中的注意力动力学提供了原则性视角,所提出的扩散框架有助于未来模型的可扩展设计,并具有向多模态任务推广的潜力。 Abstract: Visual Geometry Grounded Transformer (VGGT) delivers state-of-the-art feed-forward 3D reconstruction, yet its global self-attention layer suffers from a drastic collapse phenomenon when the input sequence exceeds a few hundred frames: attention matrices rapidly become near rank-one, token geometry degenerates to an almost one-dimensional subspace, and reconstruction error accumulates super-linearly.In this report,we establish a rigorous mathematical explanation of the collapse by viewing the global-attention iteration as a degenerate diffusion process.We prove that,in VGGT, the token-feature flow converges toward a Dirac-type measure at a $O(1/L)$ rate, where $L$ is the layer index, yielding a closed-form mean-field partial differential equation that precisely predicts the empirically observed rank profile.The theory quantitatively matches the attention-heat-map evolution and a series of experiments outcomes reported in relevant works and explains why its token-merging remedy -- which periodically removes redundant tokens -- slows the effective diffusion coefficient and thereby delays collapse without additional training.We believe the analysis provides a principled lens for interpreting future scalable 3D-vision transformers,and we highlight its potential for multi-modal generalization.

[66] ShinyNeRF: Digitizing Anisotropic Appearance in Neural Radiance Fields

Albert Barreiro,Roger Marí,Rafael Redondo,Gloria Haro,Carles Bosch

Main category: cs.CV

TL;DR: 本文提出了一种名为ShinyNeRF的新框架,用于高真实感地数字化各向同性和各向异性反射表面,尤其在处理如刷金属等复杂材质时表现优越。

Details Motivation: 现有NeRF方法难以准确建模各向异性镜面反射表面(如刷金属),限制了文化遗产数字化的真实感。 Method: 通过学习 outgoing radiance 的混合von Mises-Fisher分布编码表示,联合估计表面法线、切线、镜面集中度和各向异性强度,基于各向异性球形高斯(ASG)模型。 Result: ShinyNeRF在各向异性镜面反射的3D重建中达到SOTA性能,并支持材质属性的物理合理解释与编辑。 Conclusion: ShinyNeRF有效解决了复杂反射表面的建模难题,提升了文化遗产权数字保存的真实性与可编辑性。 Abstract: Recent advances in digitization technologies have transformed the preservation and dissemination of cultural heritage. In this vein, Neural Radiance Fields (NeRF) have emerged as a leading technology for 3D digitization, delivering representations with exceptional realism. However, existing methods struggle to accurately model anisotropic specular surfaces, typically observed, for example, on brushed metals. In this work, we introduce ShinyNeRF, a novel framework capable of handling both isotropic and anisotropic reflections. Our method is capable of jointly estimating surface normals, tangents, specular concentration, and anisotropy magnitudes of an Anisotropic Spherical Gaussian (ASG) distribution, by learning an approximation of the outgoing radiance as an encoded mixture of isotropic von Mises-Fisher (vMF) distributions. Experimental results show that ShinyNeRF not only achieves state-of-the-art performance on digitizing anisotropic specular reflections, but also offers plausible physical interpretations and editing of material properties compared to existing methods.

[67] Prior-AttUNet: Retinal OCT Fluid Segmentation Based on Normal Anatomical Priors and Attention Gating

Li Yang,Yuting Liu

Main category: cs.CV

TL;DR: 本研究提出了一种名为Prior-AttUNet的新型视网膜水肿分割模型,结合生成解剖先验信息与三重注意力机制,在多设备OCT图像上实现了高精度、高效率的病灶分割。

Details Motivation: 为解决光学相干断层扫描(OCT)图像中视网膜水肿区域边界模糊和跨设备差异性带来的分割挑战,亟需一种鲁棒且高效的自动分割方法以支持临床诊断。 Method: 提出Prior-AttUNet模型,采用混合双路径架构:一条路径通过变分自编码器生成多尺度正常解剖先验,另一条为分割主干网络,结合密集连接块、空间金字塔池化模块和受先验引导的三重注意力机制,动态优化解码阶段的特征权重。 Result: 在公开RETOUCH基准上,Prior-AttUNet在三种OCT设备(Cirrus、Spectralis、Topcon)上的平均Dice相似系数分别为93.93%、95.18%和93.47%,计算成本仅为0.37 TFLOPs,表现出优异的性能与效率平衡。 Conclusion: Prior-AttUNet通过融合解剖先验与注意力机制,实现了对多源OCT图像中视网膜水肿的精准高效分割,具有较强的临床适用性和推广潜力。 Abstract: Accurate segmentation of macular edema, a hallmark pathological feature in vision-threatening conditions such as age-related macular degeneration and diabetic macular edema, is essential for clinical diagnosis and management. To overcome the challenges of segmenting fluid regions in optical coherence tomography (OCT) images-notably ambiguous boundaries and cross-device heterogeneity-this study introduces Prior-AttUNet, a segmentation model augmented with generative anatomical priors. The framework adopts a hybrid dual-path architecture that integrates a generative prior pathway with a segmentation network. A variational autoencoder supplies multi-scale normative anatomical priors, while the segmentation backbone incorporates densely connected blocks and spatial pyramid pooling modules to capture richer contextual information. Additionally, a novel triple-attention mechanism, guided by anatomical priors, dynamically modulates feature importance across decoding stages, substantially enhancing boundary delineation. Evaluated on the public RETOUCH benchmark, Prior-AttUNet achieves excellent performance across three OCT imaging devices (Cirrus, Spectralis, and Topcon), with mean Dice similarity coefficients of 93.93%, 95.18%, and 93.47%, respectively. The model maintains a low computational cost of 0.37 TFLOPs, striking an effective balance between segmentation precision and inference efficiency. These results demonstrate its potential as a reliable tool for automated clinical analysis.

[68] BeHGAN: Bengali Handwritten Word Generation from Plain Text Using Generative Adversarial Networks

Md. Rakibul Islam,Md. Kamrozzaman Bhuiyan,Safwan Muntasir,Arifur Rahman Jawad,Most. Sharmin Sultana Samu

Main category: cs.CV

TL;DR: 本文提出了一种生成孟加拉手写文本的方法,使用自建的大规模数据集,实现了从纯文本到多样化手写输出的生成,填补了该语言在手写生成领域的研究空白。

Details Motivation: 孟加拉语是世界第五大语言,但在手写文本生成领域研究较少,缺乏高质量、多样化的数据集,限制了相关技术的发展。 Method: 构建了一个包含约五百人书写样本的孟加拉手写数据集,并对图像进行预处理;采用基于深度学习的文本生成模型实现从输入文本到手写图像的生成。 Result: 所提方法能够生成多样且逼真的孟加拉手写文本,验证了其在该语言手写生成任务中的有效性。 Conclusion: 本研究推动了孟加拉语手写文本生成的发展,为后续相关研究提供了数据和方法支持。 Abstract: Handwritten Text Recognition (HTR) is a well-established research area. In contrast, Handwritten Text Generation (HTG) is an emerging field with significant potential. This task is challenging due to the variation in individual handwriting styles. A large and diverse dataset is required to generate realistic handwritten text. However, such datasets are difficult to collect and are not readily available. Bengali is the fifth most spoken language in the world. While several studies exist for languages such as English and Arabic, Bengali handwritten text generation has received little attention. To address this gap, we propose a method for generating Bengali handwritten words. We developed and used a self-collected dataset of Bengali handwriting samples. The dataset includes contributions from approximately five hundred individuals across different ages and genders. All images were pre-processed to ensure consistency and quality. Our approach demonstrates the ability to produce diverse handwritten outputs from input plain text. We believe this work contributes to the advancement of Bengali handwriting generation and can support further research in this area.

[69] FUSE: Unifying Spectral and Semantic Cues for Robust AI-Generated Image Detection

Md. Zahid Hossain,Most. Sharmin Sultana Samu,Md. Kamrozzaman Bhuiyan,Farhad Uz Zaman,Md. Rakibul Islam

Main category: cs.CV

TL;DR: 本文提出了一种名为FUSE的混合系统,结合频谱特征与语义特征,用于高效检测AI生成图像,在多个数据集上表现出卓越的泛化能力和性能。

Details Motivation: 随着生成模型快速发展,迫切需要可靠的方法来检测AI生成图像,尤其是现有方法在高保真图像上表现不佳。 Method: 提出FUSE系统,利用快速傅里叶变换提取频谱特征,结合CLIP视觉编码器提取语义特征,通过两阶段渐进训练融合为联合表示。 Result: FUSE(第一阶段)在Chameleon基准上达到最先进水平,在GenImage数据集上平均准确率达91.36%,对所有测试生成器的准确率为88.71%,平均精度达94.96%;第二阶段进一步提升多数生成器的检测性能。 Conclusion: 融合频谱与语义特征可有效提升AI生成图像检测的鲁棒性和泛化能力,优于现有方法。 Abstract: The fast evolution of generative models has heightened the demand for reliable detection of AI-generated images. To tackle this challenge, we introduce FUSE, a hybrid system that combines spectral features extracted through Fast Fourier Transform with semantic features obtained from the CLIP's Vision encoder. The features are fused into a joint representation and trained progressively in two stages. Evaluations on GenImage, WildFake, DiTFake, GPT-ImgEval and Chameleon datasets demonstrate strong generalization across multiple generators. Our FUSE (Stage 1) model demonstrates state-of-the-art results on the Chameleon benchmark. It also attains 91.36% mean accuracy on the GenImage dataset, 88.71% accuracy across all tested generators, and a mean Average Precision of 94.96%. Stage 2 training further improves performance for most generators. Unlike existing methods, which often perform poorly on high-fidelity images in Chameleon, our approach maintains robustness across diverse generators. These findings highlight the benefits of integrating spectral and semantic features for generalized detection of images generated by AI.

[70] Spatiotemporal-Untrammelled Mixture of Experts for Multi-Person Motion Prediction

Zheng Yin,Chengjian Li,Xiangbo Shu,Meiqi Cao,Rui Yan,Jinhui Tang

Main category: cs.CV

TL;DR: 提出了一种名为ST-MoE的新型模型,通过引入四种时空专家和双向时空Mamba机制,灵活捕捉人体运动中的复杂时空依赖关系,同时显著降低计算成本。

Details Motivation: 现有方法在捕捉时空信息时依赖位置编码,表示不灵活,且传统注意力机制计算复杂度高,难以高效建模多人运动预测中的复杂依赖关系。 Method: 设计了四种专门捕捉不同空间或时间依赖关系的时空专家,并采用共享双向时空Mamba结构,在多种组合下实现高效、参数经济的多专家集成。 Result: 在四个多人运动基准数据集上实验表明,该方法在精度上优于现有最先进方法,模型参数减少41.38%,训练速度提升3.6倍。 Conclusion: ST-MoE通过灵活的多专家架构和高效的Mamba结构,有效解决了现有方法在表示灵活性和计算效率上的瓶颈,为多人运动预测提供了更优解决方案。 Abstract: Comprehensively and flexibly capturing the complex spatio-temporal dependencies of human motion is critical for multi-person motion prediction. Existing methods grapple with two primary limitations: i) Inflexible spatiotemporal representation due to reliance on positional encodings for capturing spatiotemporal information. ii) High computational costs stemming from the quadratic time complexity of conventional attention mechanisms. To overcome these limitations, we propose the Spatiotemporal-Untrammelled Mixture of Experts (ST-MoE), which flexibly explores complex spatio-temporal dependencies in human motion and significantly reduces computational cost. To adaptively mine complex spatio-temporal patterns from human motion, our model incorporates four distinct types of spatiotemporal experts, each specializing in capturing different spatial or temporal dependencies. To reduce the potential computational overhead while integrating multiple experts, we introduce bidirectional spatiotemporal Mamba as experts, each sharing bidirectional temporal and spatial Mamba in distinct combinations to achieve model efficiency and parameter economy. Extensive experiments on four multi-person benchmark datasets demonstrate that our approach not only outperforms state-of-art in accuracy but also reduces model parameter by 41.38% and achieves a 3.6x speedup in training. The code is available at https://github.com/alanyz106/ST-MoE.

[71] RAPTOR: Real-Time High-Resolution UAV Video Prediction with Efficient Video Attention

Zhan Chen,Zile Guo,Enze Zhu,Peirong Zhang,Xiaoxuan Liu,Lei Wang,Yidan Zhang

Main category: cs.CV

TL;DR: RAPTOR是一种实现高分辨率、实时视频预测的新架构,通过单次前向传递和高效的时空注意力机制(EVA)解决了传统方法在速度、分辨率和质量之间的权衡问题,显著提升了无人机导航性能。

Details Motivation: 现有视频预测方法在高分辨率、高质量和实时性之间难以兼顾,尤其无法满足边缘设备上低延迟应用(如城市环境中自主无人机)的安全需求。 Method: 提出RAPTOR架构,采用单次前向传递设计和新型高效视频注意力(EVA)模块,将时空建模分解为交替的空间和时间轴操作,降低计算复杂度至O(S + T),并在密集特征图上直接处理,无需分块;配合三阶段训练策略逐步优化预测结果。 Result: RAPTOR在Jetson AGX Orin上以超过30 FPS的速度实现512^2分辨率视频预测,在UAVid、KTH和自建数据集上PSNR、SSIM、LPIPS指标达到SOTA,并在真实无人机导航任务中提升18%任务成功率。 Conclusion: RAPTOR突破了视频预测中分辨率、质量和速度的三难困境,为安全、可预见的具身智能体提供了可行的技术路径。 Abstract: Video prediction is plagued by a fundamental trilemma: achieving high-resolution and perceptual quality typically comes at the cost of real-time speed, hindering its use in latency-critical applications. This challenge is most acute for autonomous UAVs in dense urban environments, where foreseeing events from high-resolution imagery is non-negotiable for safety. Existing methods, reliant on iterative generation (diffusion, autoregressive models) or quadratic-complexity attention, fail to meet these stringent demands on edge hardware. To break this long-standing trade-off, we introduce RAPTOR, a video prediction architecture that achieves real-time, high-resolution performance. RAPTOR's single-pass design avoids the error accumulation and latency of iterative approaches. Its core innovation is Efficient Video Attention (EVA), a novel translator module that factorizes spatiotemporal modeling. Instead of processing flattened spacetime tokens with $O((ST)^2)$ or $O(ST)$ complexity, EVA alternates operations along the spatial (S) and temporal (T) axes. This factorization reduces the time complexity to $O(S + T)$ and memory complexity to $O(max(S, T))$, enabling global context modeling at $512^2$ resolution and beyond, operating directly on dense feature maps with a patch-free design. Complementing this architecture is a 3-stage training curriculum that progressively refines predictions from coarse structure to sharp, temporally coherent details. Experiments show RAPTOR is the first predictor to exceed 30 FPS on a Jetson AGX Orin for $512^2$ video, setting a new state-of-the-art on UAVid, KTH, and a custom high-resolution dataset in PSNR, SSIM, and LPIPS. Critically, RAPTOR boosts the mission success rate in a real-world UAV navigation task by 18/%, paving the way for safer and more anticipatory embodied agents.

[72] AstraNav-World: World Model for Foresight Control and Consistency

Junjun Hu,Jintao Chen,Haochen Bai,Minghua Luo,Shichao Xie,Ziyi Chen,Fei Liu,Zedong Chu,Xinda Xue,Botao Ren,Xiaolong Wu,Mu Xu,Shanghang Zhang

Main category: cs.CV

TL;DR: AstraNav-World 是一个端到端的世界模型,通过统一的扩散生成框架联合推理视觉状态与动作序列,实现具身导航中的同步视觉预测与动作规划,显著提升轨迹准确性和任务成功率。

Details Motivation: 现有具身导航方法常采用“先想象后规划”的解耦流程,易导致视觉预测与动作规划不一致,累积误差严重,难以在开放动态环境中可靠运行。 Method: 提出 AstraNav-World,结合基于扩散的视频生成器与视觉-语言策略,在统一概率框架内同步生成动作条件下的多步视觉预测并据此规划轨迹,通过双向约束实现视觉与动作的紧耦合。 Result: 在多个具身导航基准上实现了更高的轨迹精度和任务成功率;消融实验表明视觉-动作耦合与联合训练至关重要;真实场景中展现出强大的零样本迁移能力。 Conclusion: 统一的生成式世界模型能够有效整合视觉预测与动作规划,提升具身智能体在开放动态环境中的鲁棒性、可解释性和泛化能力。 Abstract: Embodied navigation in open, dynamic environments demands accurate foresight of how the world will evolve and how actions will unfold over time. We propose AstraNav-World, an end-to-end world model that jointly reasons about future visual states and action sequences within a unified probabilistic framework. Our framework integrates a diffusion-based video generator with a vision-language policy, enabling synchronized rollouts where predicted scenes and planned actions are updated simultaneously. Training optimizes two complementary objectives: generating action-conditioned multi-step visual predictions and deriving trajectories conditioned on those predicted visuals. This bidirectional constraint makes visual predictions executable and keeps decisions grounded in physically consistent, task-relevant futures, mitigating cumulative errors common in decoupled "envision-then-plan" pipelines. Experiments across diverse embodied navigation benchmarks show improved trajectory accuracy and higher success rates. Ablations confirm the necessity of tight vision-action coupling and unified training, with either branch removal degrading both prediction quality and policy reliability. In real-world testing, AstraNav-World demonstrated exceptional zero-shot capabilities, adapting to previously unseen scenarios without any real-world fine-tuning. These results suggest that AstraNav-World captures transferable spatial understanding and planning-relevant navigation dynamics, rather than merely overfitting to simulation-specific data distribution. Overall, by unifying foresight vision and control within a single generative model, we move closer to reliable, interpretable, and general-purpose embodied agents that operate robustly in open-ended real-world settings.

[73] Knot Forcing: Taming Autoregressive Video Diffusion Models for Real-time Infinite Interactive Portrait Animation

Steven Xiao,XIndi Zhang,Dechao Meng,Qi Wang,Peng Zhang,Bang Zhang

Main category: cs.CV

TL;DR: 提出了一种名为Knot Forcing的流式框架,用于实现实时、高保真、时间一致的人像动画,支持无限序列生成并在消费级GPU上实现实时性能。

Details Motivation: 现有扩散模型不适用于流式部署,自回归方法存在误差累积和帧间不连续问题,难以满足实时人像动画对低延迟和长期一致性的需求。 Method: 提出Knot Forcing框架:(1) 分块生成策略,通过缓存参考图像的KV状态保持全局身份一致性,并使用滑动窗口注意力进行局部时序建模;(2) 时间结点模块,通过重叠相邻块并利用图像到视频条件传递时空信息以平滑过渡;(3) '提前运行'机制,在推理时动态更新参考帧的时间坐标,维持语义上下文领先于当前帧。 Result: 实现了高质量、时间一致且交互响应快的人像动画,支持无限长度序列生成,在消费级GPU上达到实时性能,并有效缓解了分块生成中的运动不连续与长期退化问题。 Conclusion: Knot Forcing为实时人像动画提供了一个高效、可扩展的解决方案,平衡了生成质量、时序连贯性和计算效率,适合实际交互应用部署。 Abstract: Real-time portrait animation is essential for interactive applications such as virtual assistants and live avatars, requiring high visual fidelity, temporal coherence, ultra-low latency, and responsive control from dynamic inputs like reference images and driving signals. While diffusion-based models achieve strong quality, their non-causal nature hinders streaming deployment. Causal autoregressive video generation approaches enable efficient frame-by-frame generation but suffer from error accumulation, motion discontinuities at chunk boundaries, and degraded long-term consistency. In this work, we present a novel streaming framework named Knot Forcing for real-time portrait animation that addresses these challenges through three key designs: (1) a chunk-wise generation strategy with global identity preservation via cached KV states of the reference image and local temporal modeling using sliding window attention; (2) a temporal knot module that overlaps adjacent chunks and propagates spatio-temporal cues via image-to-video conditioning to smooth inter-chunk motion transitions; and (3) A "running ahead" mechanism that dynamically updates the reference frame's temporal coordinate during inference, keeping its semantic context ahead of the current rollout frame to support long-term coherence. Knot Forcing enables high-fidelity, temporally consistent, and interactive portrait animation over infinite sequences, achieving real-time performance with strong visual stability on consumer-grade GPUs.

[74] SyncAnyone: Implicit Disentanglement via Progressive Self-Correction for Lip-Syncing in the wild

Xindi Zhang,Dechao Meng,Steven Xiao,Qi Wang,Peng Zhang,Bang Zhang

Main category: cs.CV

TL;DR: 本文提出了一种名为SyncAnyone的两阶段学习框架,用于实现高质量的AI视频配音,能够在保持身份和背景一致性的同时生成精确同步的唇部运动。

Details Motivation: 现有方法依赖于掩码训练策略,虽然有助于唇音同步,但会破坏时空上下文,导致面部动态和背景不一致的问题。 Method: 第一阶段使用基于扩散的视频变换器进行掩码嘴部修复,以生成准确的音频驱动唇动;第二阶段通过无掩码微调流程,利用合成的伪配对数据消除掩码带来的伪影。 Result: 实验表明,该方法在视觉质量、时间连贯性和身份保持方面均达到了最先进的水平。 Conclusion: SyncAnyone能够有效解决传统掩码训练带来的结构不稳定和背景失真问题,在真实场景下的唇音同步任务中表现出优越性能。 Abstract: High-quality AI-powered video dubbing demands precise audio-lip synchronization, high-fidelity visual generation, and faithful preservation of identity and background. Most existing methods rely on a mask-based training strategy, where the mouth region is masked in talking-head videos, and the model learns to synthesize lip movements from corrupted inputs and target audios. While this facilitates lip-sync accuracy, it disrupts spatiotemporal context, impairing performance on dynamic facial motions and causing instability in facial structure and background consistency. To overcome this limitation, we propose SyncAnyone, a novel two-stage learning framework that achieves accurate motion modeling and high visual fidelity simultaneously. In Stage 1, we train a diffusion-based video transformer for masked mouth inpainting, leveraging its strong spatiotemporal modeling to generate accurate, audio-driven lip movements. However, due to input corruption, minor artifacts may arise in the surrounding facial regions and the background. In Stage 2, we develop a mask-free tuning pipeline to address mask-induced artifacts. Specifically, on the basis of the Stage 1 model, we develop a data generation pipeline that creates pseudo-paired training samples by synthesizing lip-synced videos from the source video and random sampled audio. We further tune the stage 2 model on this synthetic data, achieving precise lip editing and better background consistency. Extensive experiments show that our method achieves state-of-the-art results in visual quality, temporal coherence, and identity preservation under in-the wild lip-syncing scenarios.

[75] A-QCF-Net: An Adaptive Quaternion Cross-Fusion Network for Multimodal Liver Tumor Segmentation from Unpaired Datasets

Arunkumar V,Firos V M,Senthilkumar S,Gangadharan G R

Main category: cs.CV

TL;DR: 本文提出了一种自适应四元数交叉融合网络(A-QCF-Net),用于从未配对的CT和MRI医学图像数据集中联合训练统一的分割模型,显著提升了肿瘤分割性能。

Details Motivation: 由于成对且空间对齐的多模态医学图像数据集稀缺,限制了深度学习在医学图像分割中的应用。本文旨在通过利用未配对的独立CT和MRI数据集来解决这一问题。 Method: 提出A-QCF-Net,基于四元数神经网络构建共享特征空间,并引入自适应四元数交叉融合(A-QCF)模块,实现CT与MRI双流之间的动态信息交换与知识迁移。 Result: 在未配对的LiTS(CT)和ATLAS(MRI)数据集上联合训练后,模型在CT上的肿瘤Dice分数达到76.7%,MRI上达到78.3%,分别比单模态nnU-Net基线提高5.4%和4.7%。Grad-CAM分析表明模型关注临床相关的病理结构。 Conclusion: A-QCF-Net能够有效利用未配对的多模态医学图像数据进行联合训练,提升分割性能,为挖掘医疗系统中广泛存在的非配对影像档案提供了可行方案。 Abstract: Multimodal medical imaging provides complementary information that is crucial for accurate delineation of pathology, but the development of deep learning models is limited by the scarcity of large datasets in which different modalities are paired and spatially aligned. This paper addresses this fundamental limitation by proposing an Adaptive Quaternion Cross-Fusion Network (A-QCF-Net) that learns a single unified segmentation model from completely separate and unpaired CT and MRI cohorts. The architecture exploits the parameter efficiency and expressive power of Quaternion Neural Networks to construct a shared feature space. At its core is the Adaptive Quaternion Cross-Fusion (A-QCF) block, a data driven attention module that enables bidirectional knowledge transfer between the two streams. By learning to modulate the flow of information dynamically, the A-QCF block allows the network to exchange abstract modality specific expertise, such as the sharp anatomical boundary information available in CT and the subtle soft tissue contrast provided by MRI. This mutual exchange regularizes and enriches the feature representations of both streams. We validate the framework by jointly training a single model on the unpaired LiTS (CT) and ATLAS (MRI) datasets. The jointly trained model achieves Tumor Dice scores of 76.7% on CT and 78.3% on MRI, significantly exceeding the strong unimodal nnU-Net baseline by margins of 5.4% and 4.7% respectively. Furthermore, comprehensive explainability analysis using Grad-CAM and Grad-CAM++ confirms that the model correctly focuses on relevant pathological structures, ensuring the learned representations are clinically meaningful. This provides a robust and clinically viable paradigm for unlocking the large unpaired imaging archives that are common in healthcare.

[76] BertsWin: Resolving Topological Sparsity in 3D Masked Autoencoders via Component-Balanced Structural Optimization

Evgeny Alves Limarenko,Anastasiia Studenikina

Main category: cs.CV

TL;DR: 本文提出了一种名为BertsWin的新架构,结合BERT式掩码和Swin Transformer窗口,用于3D医学图像的自监督学习,显著提升了空间上下文学习效率和训练收敛速度。

Details Motivation: 现有的Masked Autoencoders在处理3D体数据时难以捕捉三维空间关系,尤其在高比例token掩码下表现不佳,因此需要一种能保持三维拓扑结构且计算高效的自监督方法。 Method: 提出BertsWin,采用完整的3D token网格(包含掩码和可见token),利用Swin Transformer的局部窗口机制降低计算复杂度,并引入结构优先损失函数和GradientConductor优化器以提升训练效率。 Result: BertsWin相比标准ViT-MAE基线加速语义收敛达5.8倍,在TMJ分割任务中结合GradientConductor实现15倍训练epoch减少(44 vs 660),同时保持与稀疏ViT相当的FLOPs,总体计算资源显著降低。 Conclusion: BertsWin通过保留完整的三维空间拓扑结构,在不增加计算代价的前提下大幅加快了3D医学图像自监督预训练的收敛速度,为高效3D SSL提供了新方向。 Abstract: The application of self-supervised learning (SSL) and Vision Transformers (ViTs) approaches demonstrates promising results in the field of 2D medical imaging, but the use of these methods on 3D volumetric images is fraught with difficulties. Standard Masked Autoencoders (MAE), which are state-of-the-art solution for 2D, have a hard time capturing three-dimensional spatial relationships, especially when 75% of tokens are discarded during pre-training. We propose BertsWin, a hybrid architecture combining full BERT-style token masking using Swin Transformer windows, to enhance spatial context learning in 3D during SSL pre-training. Unlike the classic MAE, which processes only visible areas, BertsWin introduces a complete 3D grid of tokens (masked and visible), preserving the spatial topology. And to smooth out the quadratic complexity of ViT, single-level local Swin windows are used. We introduce a structural priority loss function and evaluate the results of cone beam computed tomography of the temporomandibular joints. The subsequent assessment includes TMJ segmentation on 3D CT scans. We demonstrate that the BertsWin architecture, by maintaining a complete three-dimensional spatial topology, inherently accelerates semantic convergence by a factor of 5.8x compared to standard ViT-MAE baselines. Furthermore, when coupled with our proposed GradientConductor optimizer, the full BertsWin framework achieves a 15-fold reduction in training epochs (44 vs 660) required to reach state-of-the-art reconstruction fidelity. Analysis reveals that BertsWin achieves this acceleration without the computational penalty typically associated with dense volumetric processing. At canonical input resolutions, the architecture maintains theoretical FLOP parity with sparse ViT baselines, resulting in a significant net reduction in total computational resources due to faster convergence.

[77] Inference-based GAN Video Generation

Jingbo Yang,Adrian G. Bors

Main category: cs.CV

TL;DR: 本文提出了一种结合VAE-GAN结构与马尔可夫链记忆机制的新型视频生成模型,能够高效生成具有时间连续性和动态一致性的长视频序列。

Details Motivation: 现有视频生成模型难以扩展生成视频的时间长度,长序列下质量显著下降,缺乏对长时间动态和语义连贯性的建模能力。 Method: 提出一种基于VAE-GAN混合结构的无条件视频生成模型,并引入马尔可夫链框架与记忆回溯机制,将多个短片段生成器串联,实现长视频的分段生成与时间依赖建模。 Result: 模型能够生成包含数百至数千帧的长视频,在保持内容、运动分离建模的同时,确保了时间上的连续性、一致性与动态表现力。 Conclusion: 该方法有效解决了传统生成模型在视频长度扩展上的局限,为长时视频生成提供了一种内存高效且结构合理的解决方案。 Abstract: Video generation has seen remarkable progresses thanks to advancements in generative deep learning. Generated videos should not only display coherent and continuous movement but also meaningful movement in successions of scenes. Generating models such as Generative Adversarial Networks (GANs) or Variational Autoencoders (VAEs) and more recently Diffusion Networks have been used for generating short video sequences, usually of up to 16 frames. In this paper, we first propose a new type of video generator by enabling adversarial-based unconditional video generators with a variational encoder, akin to a VAE-GAN hybrid structure, in order to enable the generation process with inference capabilities. The proposed model, as in other video deep learning-based processing frameworks, incorporates two processing branches, one for content and another for movement. However, existing models struggle with the temporal scaling of the generated videos. In classical approaches when aiming to increase the generated video length, the resulting video quality degrades, particularly when considering generating significantly long sequences. To overcome this limitation, our research study extends the initially proposed VAE-GAN video generation model by employing a novel, memory-efficient approach to generate long videos composed of hundreds or thousands of frames ensuring their temporal continuity, consistency and dynamics. Our approach leverages a Markov chain framework with a recall mechanism, with each state representing a VAE-GAN short-length video generator. This setup allows for the sequential connection of generated video sub-sequences, enabling temporal dependencies, resulting in meaningful long video sequences.

[78] Scene-VLM: Multimodal Video Scene Segmentation via Vision-Language Models

Nimrod Berman,Adam Botach,Emanuel Ben-Baruch,Shunit Haviv Hakimi,Asaf Gendler,Ilan Naiman,Erez Yosef,Igor Kviatkovsky

Main category: cs.CV

TL;DR: 本文提出了Scene-VLM,首个用于视频场景分割的微调视觉-语言模型框架,通过结合视觉和文本线索并建模镜头间的时序依赖关系,在标准基准上实现了最先进的性能。

Details Motivation: 现有基于编码器的方法存在视觉中心偏见,孤立分类镜头,缺乏对叙事结构的理解和可解释性,限制了长视频的语义分割效果。 Method: 提出Scene-VLM框架,联合处理帧、转录文本和元数据等多模态信息,采用因果依赖的序列化预测方式,并引入上下文聚焦窗口机制以保留足够的时间上下文;同时从VLM的词元级logits中提取置信度分数,并通过少量监督生成自然语言推理。 Result: 在MovieNet等标准数据集上达到最先进水平,相比先前最优方法提升+6 AP和+13.7 F1;实现了可控的精确率-召回率权衡,并能生成连贯的自然语言解释。 Conclusion: Scene-VLM通过多模态联合建模和时序依赖建模显著提升了视频场景分割性能,兼具高性能与可解释性,为未来视频理解系统提供了新方向。 Abstract: Segmenting long-form videos into semantically coherent scenes is a fundamental task in large-scale video understanding. Existing encoder-based methods are limited by visual-centric biases, classify each shot in isolation without leveraging sequential dependencies, and lack both narrative understanding and explainability. In this paper, we present Scene-VLM, the first fine-tuned vision-language model (VLM) framework for video scene segmentation. Scene-VLM jointly processes visual and textual cues including frames, transcriptions, and optional metadata to enable multimodal reasoning across consecutive shots. The model generates predictions sequentially with causal dependencies among shots and introduces a context-focus window mechanism to ensure sufficient temporal context for each shot-level decision. In addition, we propose a scheme to extract confidence scores from the token-level logits of the VLM, enabling controllable precision-recall trade-offs that were previously limited to encoder-based methods. Furthermore, we demonstrate that our model can be aligned to generate coherent natural-language rationales for its boundary decisions through minimal targeted supervision. Our approach achieves state-of-the-art performance on standard scene segmentation benchmarks. On MovieNet, for example, Scene-VLM yields significant improvements of +6 AP and +13.7 F1 over the previous leading method.

[79] InstructMoLE: Instruction-Guided Mixture of Low-rank Experts for Multi-Conditional Image Generation

Jinqi Xiao,Qing Yan,Liming Jiang,Zichuan Liu,Hao Kang,Shen Sang,Tiancheng Zhi,Jing Liu,Cheng Yang,Xin Lu,Bo Yuan

Main category: cs.CV

TL;DR: 提出InstructMoLE框架,通过指令引导的全局路由和正交损失提升扩散Transformer在多条件生成任务中的性能。

Details Motivation: 现有MoLE架构采用逐token路由,与用户指令的全局性冲突,导致生成图像时出现空间碎片和语义漂移等问题。 Method: 引入Instruction-Guided Routing(IGR)机制,基于完整用户指令进行全局路由,并设计输出空间正交损失以增强专家多样性。 Result: 在多条件生成基准上显著优于LoRA及现有MoLE变体,提升了生成的结构完整性和语义一致性。 Conclusion: InstructMoLE为生成模型的指令驱动微调提供了鲁棒且可推广的框架,实现了更优的组合控制和对用户意图的高保真还原。 Abstract: Parameter-Efficient Fine-Tuning of Diffusion Transformers (DiTs) for diverse, multi-conditional tasks often suffers from task interference when using monolithic adapters like LoRA. The Mixture of Low-rank Experts (MoLE) architecture offers a modular solution, but its potential is usually limited by routing policies that operate at a token level. Such local routing can conflict with the global nature of user instructions, leading to artifacts like spatial fragmentation and semantic drift in complex image generation tasks. To address these limitations, we introduce InstructMoLE, a novel framework that employs an Instruction-Guided Mixture of Low-Rank Experts. Instead of per-token routing, InstructMoLE utilizes a global routing signal, Instruction-Guided Routing (IGR), derived from the user's comprehensive instruction. This ensures that a single, coherently chosen expert council is applied uniformly across all input tokens, preserving the global semantics and structural integrity of the generation process. To complement this, we introduce an output-space orthogonality loss, which promotes expert functional diversity and mitigates representational collapse. Extensive experiments demonstrate that InstructMoLE significantly outperforms existing LoRA adapters and MoLE variants across challenging multi-conditional generation benchmarks. Our work presents a robust and generalizable framework for instruction-driven fine-tuning of generative models, enabling superior compositional control and fidelity to user intent.

[80] AI for Mycetoma Diagnosis in Histopathological Images: The MICCAI 2024 Challenge

Hyam Omar Ali,Sahar Alhesseen,Lamis Elkhair,Adrian Galdran,Ming Feng,Zhixiang Xiong,Zengming Lin,Kele Xu,Liang Hu,Benjamin Keel,Oliver Mills,James Battye,Akshay Kumar,Asra Aslam,Prasad Dutande,Ujjwal Baid,Bhakti Baheti,Suhas Gajre,Aravind Shrenivas Murali,Eung-Joo Lee,Ahmed Fahal,Rachid Jennane

Main category: cs.CV

TL;DR: 本文介绍了mAIcetoma挑战赛,旨在通过人工智能推动肌足菌病的诊断,利用深度学习模型对组织病理图像中的颗粒进行分割和分类。

Details Motivation: 由于在资源有限地区缺乏专业病理学家,肌足菌病的诊断面临重大挑战,亟需自动化解决方案。 Method: 组织了mAIcetoma挑战赛,提供标准化数据集MyData,参赛团队采用多种深度学习架构进行颗粒分割与疾病类型分类。 Result: 所有模型均实现了高分割精度,顶级模型在分类任务上表现优异,验证了颗粒检测在诊断中的关键作用。 Conclusion: 基于AI的自动诊断方法在肌足菌病识别中具有巨大潜力,有助于改善流行地区的诊疗水平和医疗负担。 Abstract: Mycetoma is a neglected tropical disease caused by fungi or bacteria leading to severe tissue damage and disabilities. It affects poor and rural communities and presents medical challenges and socioeconomic burdens on patients and healthcare systems in endemic regions worldwide. Mycetoma diagnosis is a major challenge in mycetoma management, particularly in low-resource settings where expert pathologists are limited. To address this challenge, this paper presents an overview of the Mycetoma MicroImage: Detect and Classify Challenge (mAIcetoma) which was organized to advance mycetoma diagnosis through AI solutions. mAIcetoma focused on developing automated models for segmenting mycetoma grains and classifying mycetoma types from histopathological images. The challenge attracted the attention of several teams worldwide to participate and five finalist teams fulfilled the challenge objectives. The teams proposed various deep learning architectures for the ultimate goal of this challenge. Mycetoma database (MyData) was provided to participants as a standardized dataset to run the proposed models. Those models were evaluated using evaluation metrics. Results showed that all the models achieved high segmentation accuracy, emphasizing the necessitate of grain detection as a critical step in mycetoma diagnosis. In addition, the top-performing models show a significant performance in classifying mycetoma types.

[81] Diffusion Posterior Sampling for Super-Resolution under Gaussian Measurement Noise

Abu Hanif Muhammad Syarubany

Main category: cs.CV

TL;DR: 本报告研究了在已知退化模型下单幅图像超分辨率(SISR)的扩散后验采样(DPS),通过结合无条件扩散先验与基于梯度的条件约束,实现带加性高斯噪声的4倍超分辨率重建。

Details Motivation: 旨在无需重新训练扩散模型的情况下,针对特定退化算子实现高质量、稳定的后验采样超分辨率重建。 Method: 采用基于似然引导的采样方法,结合无条件扩散先验与测量一致性梯度约束,在不同引导尺度和噪声水平下进行后验采样,并以PSNR和SSIM作为评估指标。 Result: 实验表明中等强度的引导可提升重建质量,最佳性能出现在PS尺度0.95、噪声标准差σ=0.01时(综合评分1.45231),该设置能恢复更清晰的边缘和更连贯的人脸细节。 Conclusion: 平衡扩散先验与测量梯度强度对于获得高质量重建结果至关重要,所提出的方法无需针对每个退化算子重新训练模型即可实现稳定推理。 Abstract: This report studies diffusion posterior sampling (DPS) for single-image super-resolution (SISR) under a known degradation model. We implement a likelihood-guided sampling procedure that combines an unconditional diffusion prior with gradient-based conditioning to enforce measurement consistency for $4\times$ super-resolution with additive Gaussian noise. We evaluate posterior sampling (PS) conditioning across guidance scales and noise levels, using PSNR and SSIM as fidelity metrics and a combined selection score $(\mathrm{PSNR}/40)+\mathrm{SSIM}$. Our ablation shows that moderate guidance improves reconstruction quality, with the best configuration achieved at PS scale $0.95$ and noise standard deviation $σ=0.01$ (score $1.45231$). Qualitative results confirm that the selected PS setting restores sharper edges and more coherent facial details compared to the downsampled inputs, while alternative conditioning strategies (e.g., MCG and PS-annealed) exhibit different texture fidelity trade-offs. These findings highlight the importance of balancing diffusion priors and measurement-gradient strength to obtain stable, high-quality reconstructions without retraining the diffusion model for each operator.

[82] CellMamba: Adaptive Mamba for Accurate and Efficient Cell Detection

Ruochen Liu,Yi Tian,Jiahao Wang,Hongbin Liu,Xianxu Hou,Jingxin Liu

Main category: cs.CV

TL;DR: 本文提出了一种轻量且高效的单阶段检测器CellMamba,用于病理图像中的细粒度细胞实例检测,结合Mamba和注意力机制,在准确性和效率上均优于现有方法。

Details Motivation: 病理图像中细胞密集、类别差异小、背景复杂,传统检测方法难以兼顾精度与效率。 Method: 基于VSSD主干网络,引入CellMamba Block,结合NC-Mamba或MSA与新型三重映射自适应耦合(TMAC)模块,并设计自适应Mamba检测头进行多尺度特征融合。 Result: 在CoNSeP和CytoDArk0两个公开数据集上实验表明,CellMamba在精度上优于CNN、Transformer和Mamba基线模型,同时显著减小模型大小和推理延迟。 Conclusion: CellMamba是一种高效且准确的高分辨率细胞检测解决方案,适用于复杂的病理图像分析。 Abstract: Cell detection in pathological images presents unique challenges due to densely packed objects, subtle inter-class differences, and severe background clutter. In this paper, we propose CellMamba, a lightweight and accurate one-stage detector tailored for fine-grained biomedical instance detection. Built upon a VSSD backbone, CellMamba integrates CellMamba Blocks, which couple either NC-Mamba or Multi-Head Self-Attention (MSA) with a novel Triple-Mapping Adaptive Coupling (TMAC) module. TMAC enhances spatial discriminability by splitting channels into two parallel branches, equipped with dual idiosyncratic and one consensus attention map, adaptively fused to preserve local sensitivity and global consistency. Furthermore, we design an Adaptive Mamba Head that fuses multi-scale features via learnable weights for robust detection under varying object sizes. Extensive experiments on two public datasets-CoNSeP and CytoDArk0-demonstrate that CellMamba outperforms both CNN-based, Transformer-based, and Mamba-based baselines in accuracy, while significantly reducing model size and inference latency. Our results validate CellMamba as an efficient and effective solution for high-resolution cell detection.

[83] S&P 500 Stock's Movement Prediction using CNN

Rahul Gupta

Main category: cs.CV

TL;DR: 本文提出了一种使用卷积神经网络(CNN)对S&P 500指数成分股进行多变量原始金融数据建模的方法,将多维股价数据视为“图像”进行预测,取得了有前景的结果。

Details Motivation: 现有研究多使用单维或工程化金融数据,忽视了金融数据的复杂性,本文旨在利用更贴近真实市场的多变量原始数据(如股票分割/股息事件)提升预测效果。 Method: 采用卷积神经网络(CNN),将多维历史股价数据构造成类似图像的矩阵形式,直接输入模型进行训练,无需传统特征工程,支持个股、行业或投资组合层面的预测。 Result: 模型在使用未经处理的多变量原始数据的情况下实现了有竞争力的预测性能,验证了将金融时间序列数据视作图像处理的有效性。 Conclusion: CNN能够有效捕捉多维金融数据中的复杂模式,该方法为股票走势预测提供了新的视角,并为后续基于深度学习的金融预测研究奠定了基础。 Abstract: This paper is about predicting the movement of stock consist of S&P 500 index. Historically there are many approaches have been tried using various methods to predict the stock movement and being used in the market currently for algorithm trading and alpha generating systems using traditional mathematical approaches [1, 2]. The success of artificial neural network recently created a lot of interest and paved the way to enable prediction using cutting-edge research in the machine learning and deep learning. Some of these papers have done a great job in implementing and explaining benefits of these new technologies. Although most these papers do not go into the complexity of the financial data and mostly utilize single dimension data, still most of these papers were successful in creating the ground for future research in this comparatively new phenomenon. In this paper, I am trying to use multivariate raw data including stock split/dividend events (as-is) present in real-world market data instead of engineered financial data. Convolution Neural Network (CNN), the best-known tool so far for image classification, is used on the multi-dimensional stock numbers taken from the market mimicking them as a vector of historical data matrices (read images) and the model achieves promising results. The predictions can be made stock by stock, i.e., a single stock, sector-wise or for the portfolio of stocks.

[84] Few Tokens Matter: Entropy Guided Attacks on Vision-Language Models

Mengqi He,Xinyu Tian,Xin Shen,Jinhong Ni,Shu Zou,Zhaoyuan Yang,Jing Zhang

Main category: cs.CV

TL;DR: 本文提出了一种基于熵的对抗攻击方法(EGA),通过集中攻击视觉语言模型(VLM)生成过程中的高熵关键位置,以更小的扰动预算实现更高的语义退化和有害内容转换率,揭示了当前VLM安全机制的新弱点。

Details Motivation: 现有的基于熵的对抗攻击假设所有生成步骤对不稳定性贡献相同,但作者发现只有少数高熵令牌是影响输出轨迹的关键点,因此需要一种更精准、高效的攻击方式来暴露VLM的安全隐患。 Method: 作者识别出自回归生成过程中的高熵关键决策点,并将对抗扰动集中在这些位置;在此基础上提出了Entropy-bank Guided Adversarial attacks(EGA),利用跨模型可迁移性在多种VLM上进行有效攻击。 Result: 该方法在多个主流VLM上实现了35-49%的良性输出转有害输出的转化率,攻击成功率高达93-95%,且仅需较小扰动预算;同时在未见过的目标模型上表现出17-26%的有害转化率,显示良好可转移性。 Conclusion: 高熵决策点是VLM脆弱性的核心所在,集中攻击这些位置能高效破坏模型安全性,表明现有安全防御机制存在重大缺陷,需针对关键生成步骤加强防护。 Abstract: Vision-language models (VLMs) achieve remarkable performance but remain vulnerable to adversarial attacks. Entropy, a measure of model uncertainty, is strongly correlated with the reliability of VLM. Prior entropy-based attacks maximize uncertainty at all decoding steps, implicitly assuming that every token contributes equally to generation instability. We show instead that a small fraction (about 20%) of high-entropy tokens, i.e., critical decision points in autoregressive generation, disproportionately governs output trajectories. By concentrating adversarial perturbations on these positions, we achieve semantic degradation comparable to global methods while using substantially smaller budgets. More importantly, across multiple representative VLMs, such selective attacks convert 35-49% of benign outputs into harmful ones, exposing a more critical safety risk. Remarkably, these vulnerable high-entropy forks recur across architecturally diverse VLMs, enabling feasible transferability (17-26% harmful rates on unseen targets). Motivated by these findings, we propose Entropy-bank Guided Adversarial attacks (EGA), which achieves competitive attack success rates (93-95%) alongside high harmful conversion, thereby revealing new weaknesses in current VLM safety mechanisms.

[85] End-to-End 3D Spatiotemporal Perception with Multimodal Fusion and V2X Collaboration

Zhenwei Yang,Yibo Ai,Weidong Zhang

Main category: cs.CV

TL;DR: 本文提出了一种名为XET-V2X的多模态融合端到端感知跟踪框架,用于解决V2X协作中因遮挡、视角受限和通信延迟带来的3D时空理解难题。该方法通过双层空间交叉注意力机制实现多视角图像与点云的有效对齐与融合,在真实和模拟数据集上均表现出优异的检测与跟踪性能。

Details Motivation: 在自动驾驶V2X场景中,由于遮挡、视角限制和通信延迟,单视角或多源感知难以实现稳定可靠的3D时空理解。因此需要一种能够统一多视角、多模态传感信息并高效融合的框架来提升感知与跟踪的鲁棒性。 Method: 提出XET-V2X框架,采用基于多尺度可变形注意力的双层空间交叉注意力模块,先聚合多视角图像特征以增强语义一致性,再引导点云融合,并通过共享时空表示实现跨模态交互,在减少计算开销的同时提升融合效率。 Result: 在V2X-Seq-SPD、V2X-Sim-V2V和V2X-Sim-V2I三个基准上验证了方法的有效性,检测与跟踪性能均取得一致提升,尤其在不同通信延迟下表现稳健,可视化结果表明其具有良好的时序稳定性。 Conclusion: XET-V2X通过统一的多模态融合架构实现了高效、鲁棒的协同感知与跟踪,能够在复杂交通场景中有效应对V2X环境下的感知挑战,具备实际应用潜力。 Abstract: Multi-view cooperative perception and multimodal fusion are essential for reliable 3D spatiotemporal understanding in autonomous driving, especially under occlusions, limited viewpoints, and communication delays in V2X scenarios. This paper proposes XET-V2X, a multi-modal fused end-to-end tracking framework for v2x collaboration that unifies multi-view multimodal sensing within a shared spatiotemporal representation. To efficiently align heterogeneous viewpoints and modalities, XET-V2X introduces a dual-layer spatial cross-attention module based on multi-scale deformable attention. Multi-view image features are first aggregated to enhance semantic consistency, followed by point cloud fusion guided by the updated spatial queries, enabling effective cross-modal interaction while reducing computational overhead. Experiments on the real-world V2X-Seq-SPD dataset and the simulated V2X-Sim-V2V and V2X-Sim-V2I benchmarks demonstrate consistent improvements in detection and tracking performance under varying communication delays. Both quantitative results and qualitative visualizations indicate that XET-V2X achieves robust and temporally stable perception in complex traffic scenarios.

[86] Scalable Class-Incremental Learning Based on Parametric Neural Collapse

Chuangxin Zhang,Guangfeng Lin,Enhui Zhao,Kaiyang Liao,Yajun Chen

Main category: cs.CV

TL;DR: 提出了一种基于参数化神经坍缩(SCL-PNC)的可扩展类增量学习方法,通过自适应层和动态ETF分类器实现按需、低成本的模型扩展,并结合知识蒸馏缓解特征漂移,有效解决灾难性遗忘、类别不对齐和结构效率问题。

Details Motivation: 现有增量学习方法在冻结旧模型参数的同时忽略了结构效率的必要性,导致模块间特征差异和由于类别分布演化引起的类别不对齐问题。 Method: 提出SCL-PNC方法,采用可扩展骨干网络与自适应层实现按需扩展,设计动态参数化等角紧框架(ETF)分类器以适应新增类别,并引入并行扩展框架与知识蒸馏算法对齐不同模块间的特征。 Result: 在标准基准上的实验表明,该方法在处理模型扩展、特征一致性与分类性能方面优于现有方法,实现了更高的准确率与更低的存储开销。 Conclusion: SCL-PNC通过结合神经坍缩理论,实现了动态、高效且结构合理的增量学习,在应对灾难性遗忘、类别分布变化和模型扩展成本方面表现出优越性能。 Abstract: Incremental learning often encounter challenges such as overfitting to new data and catastrophic forgetting of old data. Existing methods can effectively extend the model for new tasks while freezing the parameters of the old model, but ignore the necessity of structural efficiency to lead to the feature difference between modules and the class misalignment due to evolving class distributions. To address these issues, we propose scalable class-incremental learning based on parametric neural collapse (SCL-PNC) that enables demand-driven, minimal-cost backbone expansion by adapt-layer and refines the static into a dynamic parametric Equiangular Tight Frame (ETF) framework according to incremental class. This method can efficiently handle the model expansion question with the increasing number of categories in real-world scenarios. Additionally, to counteract feature drift in serial expansion models, the parallel expansion framework is presented with a knowledge distillation algorithm to align features across expansion modules. Therefore, SCL-PNC can not only design a dynamic and extensible ETF classifier to address class misalignment due to evolving class distributions, but also ensure feature consistency by an adapt-layer with knowledge distillation between extended modules. By leveraging neural collapse, SCL-PNC induces the convergence of the incremental expansion model through a structured combination of the expandable backbone, adapt-layer, and the parametric ETF classifier. Experiments on standard benchmarks demonstrate the effectiveness and efficiency of our proposed method. Our code is available at https://github.com/zhangchuangxin71-cyber/dynamic_ ETF2. Keywords: Class incremental learning; Catastrophic forgetting; Neural collapse;Knowledge distillation; Expanded model.

[87] Breaking Alignment Barriers: TPS-Driven Semantic Correlation Learning for Alignment-Free RGB-T Salient Object Detection

Lupiao Hu,Fasheng Wang,Fangmei Chen,Fuming Sun,Haojie Li

Main category: cs.CV

TL;DR: 本文提出了一种针对真实场景中未对齐RGB-T图像对的显著性目标检测方法TPS-SCL,通过引入薄板样条对齐模块和语义相关性约束,在保持轻量化的同时实现了先进的性能。

Details Motivation: 现有RGB-T显著性目标检测方法依赖人工对齐数据集,在处理实际中未对齐的图像对时性能显著下降,主要受限于模态间空间错位、尺度变化和视角差异等问题。 Method: 采用双流MobileViT作为编码器,结合高效的Mamba扫描机制建模跨模态关系;设计语义相关性约束模块(SCCM)抑制背景干扰;引入薄板样条对齐模块(TPSAM)缓解空间差异;并通过跨模态相关性模块(CMCM)增强模态融合。 Result: 在多个数据集上实验表明,TPS-SCL在轻量级SOD方法中达到最先进水平,并优于主流RGB-T SOD方法。 Conclusion: TPS-SCL有效解决了未对齐RGB-T图像对的显著性检测难题,在保持低计算开销的同时显著提升了实际场景下的检测性能。 Abstract: Existing RGB-T salient object detection methods predominantly rely on manually aligned and annotated datasets, struggling to handle real-world scenarios with raw, unaligned RGB-T image pairs. In practical applications, due to significant cross-modal disparities such as spatial misalignment, scale variations, and viewpoint shifts, the performance of current methods drastically deteriorates on unaligned datasets. To address this issue, we propose an efficient RGB-T SOD method for real-world unaligned image pairs, termed Thin-Plate Spline-driven Semantic Correlation Learning Network (TPS-SCL). We employ a dual-stream MobileViT as the encoder, combined with efficient Mamba scanning mechanisms, to effectively model correlations between the two modalities while maintaining low parameter counts and computational overhead. To suppress interference from redundant background information during alignment, we design a Semantic Correlation Constraint Module (SCCM) to hierarchically constrain salient features. Furthermore, we introduce a Thin-Plate Spline Alignment Module (TPSAM) to mitigate spatial discrepancies between modalities. Additionally, a Cross-Modal Correlation Module (CMCM) is incorporated to fully explore and integrate inter-modal dependencies, enhancing detection performance. Extensive experiments on various datasets demonstrate that TPS-SCL attains state-of-the-art (SOTA) performance among existing lightweight SOD methods and outperforms mainstream RGB-T SOD approaches.

[88] Fast Inference of Visual Autoregressive Model with Adjacency-Adaptive Dynamical Draft Trees

Haodong Lei,Hongsong Wang,Xin Geng,Liang Wang,Pan Zhou

Main category: cs.CV

TL;DR: 提出了一种名为ADT-Tree的自适应动态草案树方法,用于加速自回归图像生成模型中的推测解码,通过根据图像区域的预测难度动态调整树结构,在MS-COCO和PartiPrompts上实现了约3倍的加速效果。

Details Motivation: 自回归图像模型虽然能生成高质量图像,但推理过程缓慢;现有推测解码方法在视觉任务中因图像不同区域预测难度不均而导致接受率不稳定,难以有效加速。 Method: 提出ADT-Tree,利用相邻token状态和历史接受率,动态调整草案树的深度与宽度:简单区域使用更深的树,复杂区域使用更宽的树;初始采用横向邻接构建,随后通过二分法自适应优化结构。 Result: 在MS-COCO 2017和PartiPrompts数据集上分别实现了3.13倍和3.05倍的加速,并能与LANTERN等松弛采样方法结合进一步提升速度。 Conclusion: ADT-Tree有效解决了视觉AR模型中推测解码接受率不一致的问题,显著提升了生成效率,且具备良好的兼容性。 Abstract: Autoregressive (AR) image models achieve diffusion-level quality but suffer from sequential inference, requiring approximately 2,000 steps for a 576x576 image. Speculative decoding with draft trees accelerates LLMs yet underperforms on visual AR models due to spatially varying token prediction difficulty. We identify a key obstacle in applying speculative decoding to visual AR models: inconsistent acceptance rates across draft trees due to varying prediction difficulties in different image regions. We propose Adjacency-Adaptive Dynamical Draft Trees (ADT-Tree), an adjacency-adaptive dynamic draft tree that dynamically adjusts draft tree depth and width by leveraging adjacent token states and prior acceptance rates. ADT-Tree initializes via horizontal adjacency, then refines depth/width via bisectional adaptation, yielding deeper trees in simple regions and wider trees in complex ones. The empirical evaluations on MS-COCO 2017 and PartiPrompts demonstrate that ADT-Tree achieves speedups of 3.13xand 3.05x, respectively. Moreover, it integrates seamlessly with relaxed sampling methods such as LANTERN, enabling further acceleration. Code is available at https://github.com/Haodong-Lei-Ray/ADT-Tree.

[89] Training-free Conditional Image Embedding Framework Leveraging Large Vision Language Models

Masayuki Kawarada,Kosuke Yamada,Antonio Tejero-de-Pablos,Naoto Inoue

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的方法DIOR,利用大视觉语言模型(LVLM)生成条件图像嵌入,通过提示LVLM用与给定条件相关的单个词描述图像,并提取最后一个令牌的隐藏状态作为嵌入,在多个实验中优于现有方法。

Details Motivation: 现有的视觉基础模型(如CLIP)无法专注于文本条件指定的特定图像特征,缺乏有效的条件图像嵌入方法。 Method: DIOR是一种无需训练的方法,通过向LVLM输入提示使其用一个与条件相关的词描述图像,并提取其最后令牌的隐藏状态向量作为条件图像嵌入。 Result: 在条件图像相似性任务上的实验表明,DIOR优于现有的无需训练基线方法(包括CLIP),并在多种设置下超过需要额外训练的方法。 Conclusion: DIOR提供了一种通用、灵活且高效的条件图像嵌入方法,适用于任意图像和条件,无需训练或任务先验知识。 Abstract: Conditional image embeddings are feature representations that focus on specific aspects of an image indicated by a given textual condition (e.g., color, genre), which has been a challenging problem. Although recent vision foundation models, such as CLIP, offer rich representations of images, they are not designed to focus on a specified condition. In this paper, we propose DIOR, a method that leverages a large vision-language model (LVLM) to generate conditional image embeddings. DIOR is a training-free approach that prompts the LVLM to describe an image with a single word related to a given condition. The hidden state vector of the LVLM's last token is then extracted as the conditional image embedding. DIOR provides a versatile solution that can be applied to any image and condition without additional training or task-specific priors. Comprehensive experimental results on conditional image similarity tasks demonstrate that DIOR outperforms existing training-free baselines, including CLIP. Furthermore, DIOR achieves superior performance compared to methods that require additional training across multiple settings.

[90] Balancing Accuracy and Efficiency: CNN Fusion Models for Diabetic Retinopathy Screening

Md Rafid Islam,Rafsan Jany,Akib Ahmed,Mohammad Ashrafuzzaman Khan

Main category: cs.CV

TL;DR: 本文研究了通过融合多种卷积神经网络(CNN)骨干网络的特征级融合方法,以提高在全球范围内获取的眼底图像中进行糖尿病视网膜病变(DR)二分类筛查的准确性和效率。使用来自五个公开数据集的11,156张图像,比较了三种预训练模型及其融合变体,结果表明EfficientNet-B0和DenseNet121的融合模型在准确性和类间平衡方面表现最佳,同时保持了合理的计算开销。

Details Motivation: 由于专家资源有限以及不同设备和人群间图像质量差异较大,大规模糖尿病视网膜病变筛查受到限制。因此,需要一种高效且准确的方法来提升跨设备和跨人群的DR筛查能力。 Method: 采用特征级融合策略,将多个预训练的CNN骨干网络(ResNet50、EfficientNet-B0和DenseNet121)进行两两或三者融合,并在五个公开眼底图像数据集上进行二分类DR检测任务评估。实验进行了五次独立运行以确保稳定性,并对推理速度和计算成本进行了分析。 Result: 融合模型始终优于单一骨干网络;EfficientNet-B0 + DenseNet121(Eff+Den)融合模型取得了最佳平均性能(准确率82.89%),且正常与糖尿病患者的F1分数较为均衡(分别为83.60%和82.60%)。三重融合性能接近但计算成本显著更高。EfficientNet-B0单模型最快(约1.16 ms/图像,批量大小1000),而Eff+Den融合在精度与延迟之间提供了更优权衡。 Conclusion: 轻量级特征融合能够增强模型在异构数据集上的泛化能力,支持在准确性和吞吐量均关键的大规模二分类DR筛查应用。 Abstract: Diabetic retinopathy (DR) remains a leading cause of preventable blindness, yet large-scale screening is constrained by limited specialist availability and variable image quality across devices and populations. This work investigates whether feature-level fusion of complementary convolutional neural network (CNN) backbones can deliver accurate and efficient binary DR screening on globally sourced fundus images. Using 11,156 images pooled from five public datasets (APTOS, EyePACS, IDRiD, Messidor, and ODIR), we frame DR detection as a binary classification task and compare three pretrained models (ResNet50, EfficientNet-B0, and DenseNet121) against pairwise and tri-fusion variants. Across five independent runs, fusion consistently outperforms single backbones. The EfficientNet-B0 + DenseNet121 (Eff+Den) fusion model achieves the best overall mean performance (accuracy: 82.89\%) with balanced class-wise F1-scores for normal (83.60\%) and diabetic (82.60\%) cases. While the tri-fusion is competitive, it incurs a substantially higher computational cost. Inference profiling highlights a practical trade-off: EfficientNet-B0 is the fastest (approximately 1.16 ms/image at batch size 1000), whereas the Eff+Den fusion offers a favorable accuracy--latency balance. These findings indicate that lightweight feature fusion can enhance generalization across heterogeneous datasets, supporting scalable binary DR screening workflows where both accuracy and throughput are critical.

[91] EasyOmnimatte: Taming Pretrained Inpainting Diffusion Models for End-to-End Video Layered Decomposition

Yihan Hu,Xuelin Chen,Xiaodong Cun

Main category: cs.CV

TL;DR: 本文提出了EasyOmnimatte,首个端到端的视频omnimatte方法,通过微调视频修复扩散模型中的双专家结构(Effect Expert和Quality Expert),在保持高质量的同时显著提升效率。

Details Motivation: 现有视频omnimatte方法依赖多阶段或推理时优化,速度慢且未能充分利用生成先验,导致分解效果不佳。作者希望构建一个统一、高效且能捕捉前景及其相关效应的端到端模型。 Method: 基于预训练的视频修复扩散模型,引入双专家结构:Effect Expert仅在对效应敏感的DiT块上应用LoRA,以捕获前景及关联效应;Quality Expert则全LoRA微调用于细化alpha matte。在采样过程中,早期高噪声阶段使用Effect Expert去噪,后期低噪声阶段切换至Quality Expert。 Result: EasyOmnimatte在多个基准上实现了最先进的视频omnimatte性能,显著优于基线方法,同时支持多种下游任务。消融实验验证了双专家策略的有效性,并显示计算成本显著降低。 Conclusion: EasyOmnimatte通过精心设计的双专家微调策略,实现了高质量与高效率的统一,是首个端到端的视频omnimatte方法,为未来相关研究提供了新方向。 Abstract: Existing video omnimatte methods typically rely on slow, multi-stage, or inference-time optimization pipelines that fail to fully exploit powerful generative priors, producing suboptimal decompositions. Our key insight is that, if a video inpainting model can be finetuned to remove the foreground-associated effects, then it must be inherently capable of perceiving these effects, and hence can also be finetuned for the complementary task: foreground layer decomposition with associated effects. However, although naïvely finetuning the inpainting model with LoRA applied to all blocks can produce high-quality alpha mattes, it fails to capture associated effects. Our systematic analysis reveals this arises because effect-related cues are primarily encoded in specific DiT blocks and become suppressed when LoRA is applied across all blocks. To address this, we introduce EasyOmnimatte, the first unified, end-to-end video omnimatte method. Concretely, we finetune a pretrained video inpainting diffusion model to learn dual complementary experts while keeping its original weights intact: an Effect Expert, where LoRA is applied only to effect-sensitive DiT blocks to capture the coarse structure of the foreground and associated effects, and a fully LoRA-finetuned Quality Expert learns to refine the alpha matte. During sampling, Effect Expert is used for denoising at early, high-noise steps, while Quality Expert takes over at later, low-noise steps. This design eliminates the need for two full diffusion passes, significantly reducing computational cost without compromising output quality. Ablation studies validate the effectiveness of this Dual-Expert strategy. Experiments demonstrate that EasyOmnimatte sets a new state-of-the-art for video omnimatte and enables various downstream tasks, significantly outperforming baselines in both quality and efficiency.

[92] DPAR: Dynamic Patchification for Efficient Autoregressive Visual Generation

Divyansh Srivastava,Akshay Mehra,Pranav Maneriker,Debopam Sanyal,Vishnu Raj,Vijay Kamarshi,Fan Du,Joshua Kimball

Main category: cs.CV

TL;DR: DPAR是一种新颖的解码器-only自回归图像生成模型,通过动态聚合图像令牌成可变数量的补丁来提高生成效率。

Details Motivation: 传统的固定长度标记化方案随着分辨率增加导致计算和内存需求显著上升,需要更高效的图像生成方法。 Method: 利用轻量级无监督自回归模型的下一个令牌预测熵作为合并令牌的依据,并根据信息内容动态调整补丁大小,同时最小化对标准解码器架构的修改。 Result: 在Imagenet 256和384分辨率下分别减少1.81倍和2.06倍的令牌数量,训练成本最多降低40%的FLOPs,FID指标相对基线模型提升高达27.1%。 Conclusion: DPAR实现了高效且可扩展的图像生成,兼容多模态框架,并在高信息区域分配更多计算资源,表现出更快的收敛速度和优越的性能。 Abstract: Decoder-only autoregressive image generation typically relies on fixed-length tokenization schemes whose token counts grow quadratically with resolution, substantially increasing the computational and memory demands of attention. We present DPAR, a novel decoder-only autoregressive model that dynamically aggregates image tokens into a variable number of patches for efficient image generation. Our work is the first to demonstrate that next-token prediction entropy from a lightweight and unsupervised autoregressive model provides a reliable criterion for merging tokens into larger patches based on information content. DPAR makes minimal modifications to the standard decoder architecture, ensuring compatibility with multimodal generation frameworks and allocating more compute to generation of high-information image regions. Further, we demonstrate that training with dynamically sized patches yields representations that are robust to patch boundaries, allowing DPAR to scale to larger patch sizes at inference. DPAR reduces token count by 1.81x and 2.06x on Imagenet 256 and 384 generation resolution respectively, leading to a reduction of up to 40% FLOPs in training costs. Further, our method exhibits faster convergence and improves FID by up to 27.1% relative to baseline models.

[93] SLIM-Brain: A Data- and Training-Efficient Foundation Model for fMRI Data Analysis

Mo Wang,Junfeng Xia,Wenhao Ye,Enyu Liu,Kaining Peng,Jianfeng Feng,Quanying Liu,Hongkai Wen

Main category: cs.CV

TL;DR: SLIM-Brain是一种高效的fMRI基础模型,通过两阶段自适应设计在保持空间细节的同时显著提升数据和训练效率。

Details Motivation: 现有fMRI分析的基础模型在数据效率和训练效率上面临瓶颈:基于图谱的方法丢失精细空间信息,而无图谱方法虽保留体素级信息但计算和内存开销过大。 Method: 提出SLIM-Brain,采用两阶段设计:(i) 轻量级时间提取器捕捉全序列全局上下文并按显著性排序数据窗口;(ii) 4D分层编码器(Hiera-JEPA)仅从top-k窗口学习体素级表示,并删除约70%的掩码块。 Result: 在七个公开基准上实验表明,SLIM-Brain在多种任务上达到最先进性能,仅需约4千次预训练会话和传统体素级方法30%的GPU内存。 Conclusion: SLIM-Brain实现了高效、低内存的无图谱fMRI基础建模,兼顾性能与资源效率,推动大规模fMRI预训练的可行性。 Abstract: Foundation models are emerging as a powerful paradigm for fMRI analysis, but current approaches face a dual bottleneck of data- and training-efficiency. Atlas-based methods aggregate voxel signals into fixed regions of interest, reducing data dimensionality but discarding fine-grained spatial details, and requiring extremely large cohorts to train effectively as general-purpose foundation models. Atlas-free methods, on the other hand, operate directly on voxel-level information - preserving spatial fidelity but are prohibitively memory- and compute-intensive, making large-scale pre-training infeasible. We introduce SLIM-Brain (Sample-efficient, Low-memory fMRI Foundation Model for Human Brain), a new atlas-free foundation model that simultaneously improves both data- and training-efficiency. SLIM-Brain adopts a two-stage adaptive design: (i) a lightweight temporal extractor captures global context across full sequences and ranks data windows by saliency, and (ii) a 4D hierarchical encoder (Hiera-JEPA) learns fine-grained voxel-level representations only from the top-$k$ selected windows, while deleting about 70% masked patches. Extensive experiments across seven public benchmarks show that SLIM-Brain establishes new state-of-the-art performance on diverse tasks, while requiring only 4 thousand pre-training sessions and approximately 30% of GPU memory comparing to traditional voxel-level methods.

[94] Reloc-VGGT: Visual Re-localization with Geometry Grounded Transformer

Tianchen Deng,Wenhua Wu,Kunzhen Wu,Guangming Wang,Siting Zhu,Shenghai Yuan,Xun Chen,Guole Shen,Zhe Liu,Hesheng Wang

Main category: cs.CV

TL;DR: 本文提出了一种基于早期融合机制的多视图视觉定位框架Reloc-VGGT,通过VGGT骨干网络和稀疏掩码注意力策略实现高效、鲁棒的实时相机位姿估计。

Details Motivation: 传统视觉定位方法采用两两配对和后期融合策略,难以有效整合多视图空间信息,在复杂环境中性能下降。因此需要一种更有效的多视图空间信息融合机制。 Method: 提出一种基于早期融合的多视图视觉定位框架,使用VGGT骨干网络编码多视图3D几何结构,引入姿态分词器和投影模块以更好利用空间关系,并设计稀疏掩码注意力机制降低计算复杂度。 Result: 在约八百万张带位姿图像对上训练后,Reloc-VGGT在多个公开数据集上实现了高精度、强泛化能力和实时性能,显著优于现有方法。 Conclusion: 该框架通过早期融合和稀疏注意力机制,有效提升了视觉定位在复杂和未见环境中的准确性与效率,具备实际应用价值。 Abstract: Visual localization has traditionally been formulated as a pair-wise pose regression problem. Existing approaches mainly estimate relative poses between two images and employ a late-fusion strategy to obtain absolute pose estimates. However, the late motion average is often insufficient for effectively integrating spatial information, and its accuracy degrades in complex environments. In this paper, we present the first visual localization framework that performs multi-view spatial integration through an early-fusion mechanism, enabling robust operation in both structured and unstructured environments. Our framework is built upon the VGGT backbone, which encodes multi-view 3D geometry, and we introduce a pose tokenizer and projection module to more effectively exploit spatial relationships from multiple database views. Furthermore, we propose a novel sparse mask attention strategy that reduces computational cost by avoiding the quadratic complexity of global attention, thereby enabling real-time performance at scale. Trained on approximately eight million posed image pairs, Reloc-VGGT demonstrates strong accuracy and remarkable generalization ability. Extensive experiments across diverse public datasets consistently validate the effectiveness and efficiency of our approach, delivering high-quality camera pose estimates in real time while maintaining robustness to unseen environments. Our code and models will be publicly released upon acceptance.https://github.com/dtc111111/Reloc-VGGT.

[95] CrownGen: Patient-customized Crown Generation via Point Diffusion Model

Juyoung Bae,Moo Hyun Son,Jiale Peng,Wanting Qu,Wener Chen,Zelin Qiu,Kaixin Li,Xiaojuan Chen,Yifan Lin,Hao Chen

Main category: cs.CV

TL;DR: CrownGen是一个基于去噪扩散模型的生成框架,可自动为患者定制牙冠设计,显著提升修复牙科中的设计效率与几何精度。

Details Motivation: 传统牙冠设计依赖人工,耗时且效率低,成为修复牙科的瓶颈,因此需要一种自动化、高精度的解决方案。 Method: 提出CrownGen框架,采用牙齿级点云表示,结合边界预测模块提供空间先验,以及基于扩散的生成模块,在单次推理中合成多颗牙齿的高保真形态。 Result: 在496个外部扫描数据上进行定量评估,并通过26例临床修复病例研究验证;结果显示CrownGen在几何保真度上优于现有最先进模型,且显著减少主动设计时间,临床评估表明其辅助设计的牙冠质量不劣于专家手工设计。 Conclusion: CrownGen通过自动化复杂的假体建模,为降低医疗成本、缩短治疗周期和提升高质量牙科服务的可及性提供了可扩展的解决方案。 Abstract: Digital crown design remains a labor-intensive bottleneck in restorative dentistry. We present \textbf{CrownGen}, a generative framework that automates patient-customized crown design using a denoising diffusion model on a novel tooth-level point cloud representation. The system employs two core components: a boundary prediction module to establish spatial priors and a diffusion-based generative module to synthesize high-fidelity morphology for multiple teeth in a single inference pass. We validated CrownGen through a quantitative benchmark on 496 external scans and a clinical study of 26 restoration cases. Results demonstrate that CrownGen surpasses state-of-the-art models in geometric fidelity and significantly reduces active design time. Clinical assessments by trained dentists confirmed that CrownGen-assisted crowns are statistically non-inferior in quality to those produced by expert technicians using manual workflows. By automating complex prosthetic modeling, CrownGen offers a scalable solution to lower costs, shorten turnaround times, and enhance patient access to high-quality dental care.

[96] High-Fidelity and Long-Duration Human Image Animation with Diffusion Transformer

Shen Zheng,Jiaran Cai,Yuansheng Guan,Shenneng Huang,Xingpei Ma,Junjie Cao,Hanfeng Zhao,Qiang Zhang,Shunsi Zhang,Xiao-Ping Zhang

Main category: cs.CV

TL;DR: 提出基于扩散变换器(DiT)的框架,用于生成高保真、长时长的人体动画视频,通过混合隐式引导信号、时间感知位置偏移融合模块和数据增强策略,在面部和手部细节及视频长度上取得显著提升。

Details Motivation: 现有方法在生成长时间视频和精细的面部与手部细节方面存在不足,限制了其在高质量实际应用中的使用。 Method: 设计混合隐式引导信号和锐度引导因子以增强细节;引入时间感知位置偏移融合模块(Position Shift Adaptive Module)支持任意长度视频生成;采用新的数据增强策略和骨骼对齐模型减少身份间形状差异的影响。 Result: 实验结果表明,该方法在高保真和长时长人体动画生成方面优于现有的最先进方法。 Conclusion: 所提DiT-based框架有效解决了长时长生成与细节刻画难题,显著提升了人体动画的质量与实用性。 Abstract: Recent progress in diffusion models has significantly advanced the field of human image animation. While existing methods can generate temporally consistent results for short or regular motions, significant challenges remain, particularly in generating long-duration videos. Furthermore, the synthesis of fine-grained facial and hand details remains under-explored, limiting the applicability of current approaches in real-world, high-quality applications. To address these limitations, we propose a diffusion transformer (DiT)-based framework which focuses on generating high-fidelity and long-duration human animation videos. First, we design a set of hybrid implicit guidance signals and a sharpness guidance factor, enabling our framework to additionally incorporate detailed facial and hand features as guidance. Next, we incorporate the time-aware position shift fusion module, modify the input format within the DiT backbone, and refer to this mechanism as the Position Shift Adaptive Module, which enables video generation of arbitrary length. Finally, we introduce a novel data augmentation strategy and a skeleton alignment model to reduce the impact of human shape variations across different identities. Experimental results demonstrate that our method outperforms existing state-of-the-art approaches, achieving superior performance in both high-fidelity and long-duration human image animation.

[97] Patch as Node: Human-Centric Graph Representation Learning for Multimodal Action Recognition

Zeyu Liang,Hailun Xia,Naichuan Zheng

Main category: cs.CV

TL;DR: 本文提出了一种名为PAN的人体中心图表示学习框架,用于多模态动作识别,通过将包含人体关节的RGB图像块建模为时空图,有效融合RGB与骨架模态信息,并提出了两种变体PAN-Ensemble和PAN-Unified,在多个数据集上实现了最先进的性能。

Details Motivation: 现有的多模态动作识别方法在融合RGB和骨架模态时受限于模态间的异质性,难以充分挖掘其互补性,因此需要一种更有效的融合机制。 Method: 提出PAN框架,将含有人体关节的RGB图像块转化为时空图进行建模;引入基于注意力的事后校准机制以降低对高质量骨架数据的依赖;设计了两种变体:PAN-Ensemble(双路径GCN+后期融合)和PAN-Unified(单网络统一图学习)。 Result: 在三个主流多模态动作识别数据集上,PAN-Ensemble和PAN-Unified分别在分离建模和统一建模范畴中均达到了当前最优性能。 Conclusion: PAN通过人体中心的图建模范式实现了RGB与骨架模态的高效、语义一致的融合,显著提升了多模态动作识别的性能,具备较强的实用性和扩展性。 Abstract: While human action recognition has witnessed notable achievements, multimodal methods fusing RGB and skeleton modalities still suffer from their inherent heterogeneity and fail to fully exploit the complementary potential between them. In this paper, we propose PAN, the first human-centric graph representation learning framework for multimodal action recognition, in which token embeddings of RGB patches containing human joints are represented as spatiotemporal graphs. The human-centric graph modeling paradigm suppresses the redundancy in RGB frames and aligns well with skeleton-based methods, thus enabling a more effective and semantically coherent fusion of multimodal features. Since the sampling of token embeddings heavily relies on 2D skeletal data, we further propose attention-based post calibration to reduce the dependency on high-quality skeletal data at a minimal cost interms of model performance. To explore the potential of PAN in integrating with skeleton-based methods, we present two variants: PAN-Ensemble, which employs dual-path graph convolution networks followed by late fusion, and PAN-Unified, which performs unified graph representation learning within a single network. On three widely used multimodal action recognition datasets, both PAN-Ensemble and PAN-Unified achieve state-of-the-art (SOTA) performance in their respective settings of multimodal fusion: separate and unified modeling, respectively.

[98] AutoPP: Towards Automated Product Poster Generation and Optimization

Jiahao Fan,Yuxin Qin,Wei Feng,Yanyin Chen,Yaoyu Li,Ao Ma,Yixiu Li,Li Zhuang,Haoyi Bian,Zheng Zhang,Jingjing Lv,Junjie Shen,Ching Law

Main category: cs.CV

TL;DR: 本文提出了AutoPP,一个自动化的产品海报生成与优化管道,能够基于基本产品信息自动生成高质量海报,并利用在线反馈(如点击率)进行持续优化,显著减少人工干预。

Details Motivation: 手工设计和优化产品海报费时费力,且依赖设计师经验,难以快速响应在线表现数据,因此需要一种自动化方法来提升效率和效果。 Method: AutoPP包含两个核心模块:生成器和优化器。生成器使用统一设计模块整合背景、文本和布局,并通过元素渲染模块将这些元素编码为条件令牌以生成海报;优化器则利用在线反馈,通过系统性替换元素并采用IDPO(Isolated Direct Preference Optimization)方法,将点击率提升归因于特定元素,实现精准优化。 Result: 实验表明,AutoPP在离线和在线环境中均达到最先进的性能;研究还发布了目前最大的产品海报数据集AutoPP1M,包含一百万张高质量海报及超百万用户反馈。 Conclusion: AutoPP实现了高效、自动化的海报生成与优化,大幅降低人力成本,同时提升广告效果,具备良好的实际应用价值。 Abstract: Product posters blend striking visuals with informative text to highlight the product and capture customer attention. However, crafting appealing posters and manually optimizing them based on online performance is laborious and resource-consuming. To address this, we introduce AutoPP, an automated pipeline for product poster generation and optimization that eliminates the need for human intervention. Specifically, the generator, relying solely on basic product information, first uses a unified design module to integrate the three key elements of a poster (background, text, and layout) into a cohesive output. Then, an element rendering module encodes these elements into condition tokens, efficiently and controllably generating the product poster. Based on the generated poster, the optimizer enhances its Click-Through Rate (CTR) by leveraging online feedback. It systematically replaces elements to gather fine-grained CTR comparisons and utilizes Isolated Direct Preference Optimization (IDPO) to attribute CTR gains to isolated elements. Our work is supported by AutoPP1M, the largest dataset specifically designed for product poster generation and optimization, which contains one million high-quality posters and feedback collected from over one million users. Experiments demonstrate that AutoPP achieves state-of-the-art results in both offline and online settings. Our code and dataset are publicly available at: https://github.com/JD-GenX/AutoPP

[99] Unsupervised Anomaly Detection in Brain MRI via Disentangled Anatomy Learning

Tao Yang,Xiuying Wang,Hao Liu,Guanzhong Gong,Lian-Ming Wu,Yu-Ping Wang,Lisheng Wang

Main category: cs.CV

TL;DR: 提出了一种新的伪健康图像重建框架,通过解耦表征和边缘到图像恢复模块,提升脑MRI异常检测的泛化性和性能。

Details Motivation: 现有无监督方法在多模态、多中心MRI上泛化性差,且因异常残差传播导致检测性能受限。 Method: 设计了解耦表征模块(分离成像信息与解剖结构)和边缘到图像恢复模块(利用边缘信息重建伪健康图像),引入脑解剖先验和可微单热编码增强稳定性。 Result: 在九个公开数据集(4,443例患者)上优于17种SOTA方法,AP提升+18.32%,DSC提升+13.64%。 Conclusion: 所提框架显著提升了脑病变检测的准确性与鲁棒性,具有良好的临床应用潜力。 Abstract: Detection of various lesions in brain MRI is clinically critical, but challenging due to the diversity of lesions and variability in imaging conditions. Current unsupervised learning methods detect anomalies mainly through reconstructing abnormal images into pseudo-healthy images (PHIs) by normal samples learning and then analyzing differences between images. However, these unsupervised models face two significant limitations: restricted generalizability to multi-modality and multi-center MRIs due to their reliance on the specific imaging information in normal training data, and constrained performance due to abnormal residuals propagated from input images to reconstructed PHIs. To address these limitations, two novel modules are proposed, forming a new PHI reconstruction framework. Firstly, the disentangled representation module is proposed to improve generalizability by decoupling brain MRI into imaging information and essential imaging-invariant anatomical images, ensuring that the reconstruction focuses on the anatomy. Specifically, brain anatomical priors and a differentiable one-hot encoding operator are introduced to constrain the disentanglement results and enhance the disentanglement stability. Secondly, the edge-to-image restoration module is designed to reconstruct high-quality PHIs by restoring the anatomical representation from the high-frequency edge information of anatomical images, and then recoupling the disentangled imaging information. This module not only suppresses abnormal residuals in PHI by reducing abnormal pixels input through edge-only input, but also effectively reconstructs normal regions using the preserved structural details in the edges. Evaluated on nine public datasets (4,443 patients' MRIs from multiple centers), our method outperforms 17 SOTA methods, achieving absolute improvements of +18.32% in AP and +13.64% in DSC.

[100] Data relativistic uncertainty framework for low-illumination anime scenery image enhancement

Yiquan Gao,John See

Main category: cs.CV

TL;DR: 本文提出了一种针对动漫场景图像的低光照增强方法,通过构建非配对数据集并引入基于相对论GAN的“数据相对不确定性(DRU)”框架,利用光照不确定性动态调整学习过程,在多个版本的EnlightenGAN上验证了其优于现有方法的效果。

Details Motivation: 现有的低光增强研究主要集中在自然图像,而动漫场景图像由于数据稀缺和光照条件多样,相关研究不足,亟需专门的方法来填补这一领域空白。 Method: 构建了一个包含多种环境和光照条件的非配对动漫风景图像数据集,并提出了数据相对不确定性(DRU)框架,受相对论GAN启发,借鉴光的波粒二象性解释并量化明暗样本的光照不确定性,用于动态调整目标函数以应对数据不确定性。 Result: 在多个版本的EnlightenGAN上进行实验,结果表明DRU框架在感知质量和美学质量方面均优于最先进的方法,尤其在处理数据不确定性方面表现出更强的学习能力。 Conclusion: DRU框架为低光照动漫图像增强提供了有效解决方案,并揭示了以数据为中心的学习范式在视觉乃至语言领域的潜在应用价值。 Abstract: By contrast with the prevailing works of low-light enhancement in natural images and videos, this study copes with the low-illumination quality degradation in anime scenery images to bridge the domain gap. For such an underexplored enhancement task, we first curate images from various sources and construct an unpaired anime scenery dataset with diverse environments and illumination conditions to address the data scarcity. To exploit the power of uncertainty information inherent with the diverse illumination conditions, we propose a Data Relativistic Uncertainty (DRU) framework, motivated by the idea from Relativistic GAN. By analogy with the wave-particle duality of light, our framework interpretably defines and quantifies the illumination uncertainty of dark/bright samples, which is leveraged to dynamically adjust the objective functions to recalibrate the model learning under data uncertainty. Extensive experiments demonstrate the effectiveness of DRU framework by training several versions of EnlightenGANs, yielding superior perceptual and aesthetic qualities beyond the state-of-the-art methods that are incapable of learning from data uncertainty perspective. We hope our framework can expose a novel paradigm of data-centric learning for potential visual and language domains. Code is available.

[101] Automated Discovery of Parsimonious Spectral Indices via Normalized Difference Polynomials

Ali Lotfi,Adam Carter,Thuan Ha,Mohammad Meysami,Kwabena Nketia,Steve Shirtliffe

Main category: cs.CV

TL;DR: 提出一种自动化方法,通过构建多项式组合并进行特征选择,生成用于植被分类的紧凑光谱指数,该方法在Kochia检测中表现出高准确率且易于解释和部署。

Details Motivation: 为了在遥感中实现光照不变性的同时,自动发现简洁、可解释且高效的光谱指数用于植被分类。 Method: 利用所有波段间的归一化差值构建最多到二次的多项式组合,并结合ANOVA过滤、递归消除和L1正则化SVM等特征选择方法筛选出最优的紧凑指数集合。 Result: 在Sentinel-2数据上,仅用一个二次指数就达到了96.26%的准确率,八个指数提升至97.70%,所选特征均来自红边波段的二次交互项。 Conclusion: 光谱交互(而非单个波段比值)是分类的关键,所提方法可推广至其他传感器和任务,且支持在Google Earth Engine等平台直接部署。 Abstract: We introduce an automated way to find compact spectral indices for vegetation classification. The idea is to take all pairwise normalized differences from the spectral bands and then build polynomial combinations up to a fixed degree, which gives a structured search space that still keeps the illumination invariance needed in remote sensing. For a sensor with $n$ bands this produces $\binom{n}{2}$ base normalized differences, and the degree-2 polynomial expansion gives 1,080 candidate features for the 10-band Sentinel-2 configuration we use here. Feature selection methods (ANOVA filtering, recursive elimination, and $L_1$-regularized SVM) then pick out small sets of indices that reach the desired accuracy, so the final models stay simple and easy to interpret. We test the framework on Kochia (\textit{Bassia scoparia}) detection using Sentinel-2 imagery from Saskatchewan, Canada ($N = 2{,}318$ samples, 2022--2024). A single degree-2 index, the product of two normalized differences from the red-edge bands, already reaches 96.26\% accuracy, and using eight indices only raises this to 97.70\%. In every case the chosen features are degree-2 products built from bands $b_4$ through $b_8$, which suggests that the discriminative signal comes from spectral \emph{interactions} rather than individual band ratios. Because the indices involve only simple arithmetic, they can be deployed directly in platforms like Google Earth Engine. The same approach works for other sensors and classification tasks, and an open-source implementation (\texttt{ndindex}) is available.

[102] Perceive and Calibrate: Analyzing and Enhancing Robustness of Medical Multi-Modal Large Language Models

Dunyuan XU,Xikai Yang,Yaoqian Li,Juzheng Miao,Jinpeng Li,Pheng-Ann Heng

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的多模态校准框架(IMC),用于提升医学多模态大语言模型(MLLMs)在图像和文本噪声下的鲁棒性,通过感知-校准原则设计了针对视觉和文本模态的去噪方法,并构建包含11种噪声类型的基准进行验证。

Details Motivation: 医学MLLMs在实际临床应用中面临输入扰动(如成像伪影和文本错误)敏感的问题,现有研究多集中于文本模态且依赖昂贵微调,难以满足医学领域对安全性和复杂噪声处理的要求。 Method: 提出了训练免费的IMC框架,遵循感知-校准原则:在视觉模态中使用基于噪声感知的原型引导特征校准(PDC);在文本模态中设计基于自我评估能力的多智能体系统(SMS)进行协同纠错。 Result: 在两个数据集上构建了包含11种噪声类型的基准测试,实验结果显示该方法在多种模态下均达到最先进的性能,显著提升了模型在噪声环境下的稳定性与准确性。 Conclusion: IMC框架有效增强了医学MLLMs对真实世界噪声的鲁棒性,展现出在实际临床场景中部署的潜力,且无需额外训练,具有高效与实用优势。 Abstract: Medical Multi-modal Large Language Models (MLLMs) have shown promising clinical performance. However, their sensitivity to real-world input perturbations, such as imaging artifacts and textual errors, critically undermines their clinical applicability. Systematic analysis of such noise impact on medical MLLMs remains largely unexplored. Furthermore, while several works have investigated the MLLMs' robustness in general domains, they primarily focus on text modality and rely on costly fine-tuning. They are inadequate to address the complex noise patterns and fulfill the strict safety standards in medicine. To bridge this gap, this work systematically analyzes the impact of various perturbations on medical MLLMs across both visual and textual modalities. Building on our findings, we introduce a training-free Inherent-enhanced Multi-modal Calibration (IMC) framework that leverages MLLMs' inherent denoising capabilities following the perceive-and-calibrate principle for cross-modal robustness enhancement. For the visual modality, we propose a Perturbation-aware Denoising Calibration (PDC) which leverages MLLMs' own vision encoder to identify noise patterns and perform prototype-guided feature calibration. For text denoising, we design a Self-instantiated Multi-agent System (SMS) that exploits the MLLMs' self-assessment capabilities to refine noisy text through a cooperative hierarchy of agents. We construct a benchmark containing 11 types of noise across both image and text modalities on 2 datasets. Experimental results demonstrate our method achieves the state-of-the-art performance across multiple modalities, showing potential to enhance MLLMs' robustness in real clinical scenarios.

[103] A Lightweight Multi-Scale Attention Framework for Real-Time Spinal Endoscopic Instance Segmentation

Qi Lai,JunYan Li,Qiang Cai,Lei Wang,Tao Yan,XiaoKun Liang

Main category: cs.CV

TL;DR: 提出了一种轻量级多尺度注意力框架LMSF-A,用于脊柱内窥镜实时实例分割,具有高精度、低参数量和良好的小批量训练稳定性,并发布了临床标注的PELD数据集。

Details Motivation: 脊柱内窥镜手术中因视野狭窄、反光、出血/烟雾、边界模糊等问题导致实例分割困难,同时受限于手术硬件,模型需兼顾精度与速度,并在小批量甚至单样本批次下保持训练稳定。 Method: 设计了跨骨干网络、颈部和头部协同优化的LMSF-A框架:骨干采用结合重参数化卷积(RVB)与高效多尺度注意力(EMA)的C2f-Pro模块;颈部使用SSFF和TFE增强跨尺度一致性和边界细节;头部采用带共享卷积和GroupNorm的轻量多任务共享头(LMSH),减少参数并提升batch-1训练稳定性。 Result: LMSF-A仅需1.8M参数和8.8 GFLOPs,在各项指标上表现优异且优于多数现有方法,同时在公开牙齿数据集上展现出良好泛化能力。发布了含61名患者、610张图像的PELD数据集,包含脂肪、骨、黄韧带和神经等结构的实例掩码。 Conclusion: LMSF-A在保证高精度的同时显著降低计算开销,适合部署于资源受限的手术环境,且通过模块化设计实现训练-推理路径分离,为实时医学实例分割提供了有效解决方案。 Abstract: Real-time instance segmentation for spinal endoscopy is important for identifying and protecting critical anatomy during surgery, but it is difficult because of the narrow field of view, specular highlights, smoke/bleeding, unclear boundaries, and large scale changes. Deployment is also constrained by limited surgical hardware, so the model must balance accuracy and speed and remain stable under small-batch (even batch-1) training. We propose LMSF-A, a lightweight multi-scale attention framework co-designed across backbone, neck, and head. The backbone uses a C2f-Pro module that combines RepViT-style re-parameterized convolution (RVB) with efficient multi-scale attention (EMA), enabling multi-branch training while collapsing into a single fast path for inference. The neck improves cross-scale consistency and boundary detail using Scale-Sequence Feature Fusion (SSFF) and Triple Feature Encoding (TFE), which strengthens high-resolution features. The head adopts a Lightweight Multi-task Shared Head (LMSH) with shared convolutions and GroupNorm to reduce parameters and support batch-1 stability. We also release the clinically reviewed PELD dataset (61 patients, 610 images) with instance masks for adipose tissue, bone, ligamentum flavum, and nerve. Experiments show that LMSF-A is highly competitive (or even better than) in all evaluation metrics and much lighter than most instance segmentation methods requiring only 1.8M parameters and 8.8 GFLOPs, and it generalizes well to a public teeth benchmark. Code and dataset: https://github.com/hhwmortal/PELD-Instance-segmentation.

[104] LVLM-Aided Alignment of Task-Specific Vision Models

Alexander Koebler,Lukas Kuhn,Ingo Thon,Florian Buettner

Main category: cs.CV

TL;DR: 提出了一种利用大视觉语言模型(LVLM)来对齐小型任务特定视觉模型与人类领域知识的新方法,称为LVLM-VA,通过双向接口实现模型行为与人类规范的更好对齐。

Details Motivation: 小型视觉模型常依赖虚假相关性,导致与人类领域知识不一致,影响实际部署的鲁棒性。 Method: 设计了一个LVLM辅助的视觉对齐(LVLM-VA)框架,通过自然语言将模型行为解释给人类专家,并将人类对类别级别的规范反馈转化为图像级别的调整,实现双向交互。 Result: 在合成和真实数据集上验证了该方法能显著减少模型对虚假特征和群体偏差的依赖,提升与人类规范的一致性。 Conclusion: LVLM-VA是一种高效且无需细粒度反馈的方法,能够有效对齐小型视觉模型与人类领域知识,增强其可解释性和可靠性。 Abstract: In high-stakes domains, small task-specific vision models are crucial due to their low computational requirements and the availability of numerous methods to explain their results. However, these explanations often reveal that the models do not align well with human domain knowledge, relying instead on spurious correlations. This might result in brittle behavior once deployed in the real-world. To address this issue, we introduce a novel and efficient method for aligning small task-specific vision models with human domain knowledge by leveraging the generalization capabilities of a Large Vision Language Model (LVLM). Our LVLM-Aided Visual Alignment (LVLM-VA) method provides a bidirectional interface that translates model behavior into natural language and maps human class-level specifications to image-level critiques, enabling effective interaction between domain experts and the model. Our method demonstrates substantial improvement in aligning model behavior with human specifications, as validated on both synthetic and real-world datasets. We show that it effectively reduces the model's dependence on spurious features and on group-specific biases, without requiring fine-grained feedback.

[105] Look Closer! An Adversarial Parametric Editing Framework for Hallucination Mitigation in VLMs

Jiayu Hu,Beibei Li,Jiangwei Xia,Yanjun Qin,Bing Ji,Zhongshi He

Main category: cs.CV

TL;DR: 本文提出了一种对抗性参数编辑框架ALEAHallu,通过激活、定位和对抗性编辑关键参数簇来缓解视觉-语言模型中的幻觉问题,显著提升了模型对视觉证据的依赖。

Details Motivation: 视觉-语言模型(VLMs)由于过度依赖语言先验而产生幻觉问题,现有解码校准策略因不可训练而优化受限,因此需要一种可学习的参数级干预方法来有效缓解幻觉。 Method: 构建包含真实响应和幻觉响应的激活数据集,分析响应对的隐状态差异以定位易产生幻觉的关键参数簇,并使用对抗性优化前缀进行微调,强制模型关注视觉特征而非语言先验。 Result: 在生成式和判别式VLM任务上的实验表明,ALEAHallu能有效减少幻觉现象,提升模型对视觉输入的对齐能力。 Conclusion: ALEAHallu提供了一种可训练的、参数级别的幻觉缓解方案,通过对抗性编辑显著增强了VLMs对视觉信息的利用,抑制了语言先验带来的偏差。 Abstract: While Vision-Language Models (VLMs) have garnered increasing attention in the AI community due to their promising practical applications, they exhibit persistent hallucination issues, generating outputs misaligned with visual inputs. Recent studies attribute these hallucinations to VLMs' over-reliance on linguistic priors and insufficient visual feature integration, proposing heuristic decoding calibration strategies to mitigate them. However, the non-trainable nature of these strategies inherently limits their optimization potential. To this end, we propose an adversarial parametric editing framework for Hallucination mitigation in VLMs, which follows an \textbf{A}ctivate-\textbf{L}ocate-\textbf{E}dit \textbf{A}dversarially paradigm. Specifically, we first construct an activation dataset that comprises grounded responses (positive samples attentively anchored in visual features) and hallucinatory responses (negative samples reflecting LLM prior bias and internal knowledge artifacts). Next, we identify critical hallucination-prone parameter clusters by analyzing differential hidden states of response pairs. Then, these clusters are fine-tuned using prompts injected with adversarial tuned prefixes that are optimized to maximize visual neglect, thereby forcing the model to prioritize visual evidence over inherent parametric biases. Evaluations on both generative and discriminative VLM tasks demonstrate the significant effectiveness of ALEAHallu in alleviating hallucinations. Our code is available at https://github.com/hujiayu1223/ALEAHallu.

[106] iSHIFT: Lightweight Slow-Fast GUI Agent with Adaptive Perception

Sarthak Mehrotra,Sairam V C Rebbapragada,Mani Hemanth Reddy Bonthu,Vineeth N Balasubramanian

Main category: cs.CV

TL;DR: 本文提出了iSHIFT,一种轻量级的多模态大语言模型代理,结合隐式慢-快混合推理与灵活感知令牌,在GUI环境中实现高效且精确的交互。

Details Motivation: 现有的GUI代理在处理高精度视觉定位任务时表现不佳,且模型庞大、无法根据任务调整推理深度。 Method: 提出iSHIFT框架,通过隐式思维链和感知控制模块,使模型能在依赖详细视觉定位的慢速模式和使用全局线索的快速模式间切换,并利用特殊感知令牌引导注意力到关键屏幕区域。 Result: 尽管模型仅有2.5B参数,iSHIFT在多个基准测试中达到了与当前最先进方法相当的性能。 Conclusion: iSHIFT通过灵活的双模式推理和注意力机制,实现了在效率与精度之间的良好平衡,适用于复杂GUI环境下的智能代理构建。 Abstract: Multimodal Large Language Models (MLLMs) show strong potential for interpreting and interacting with complex, pixel-rich Graphical User Interface (GUI) environments. However, building agents that are both efficient for high-level tasks and precise for fine-grained interactions remains challenging. GUI agents must perform routine actions efficiently while also handling tasks that demand exact visual grounding, yet existing approaches struggle when accuracy depends on identifying specific interface elements. These MLLMs also remain large and cannot adapt their reasoning depth to the task at hand. In this work, we introduce iSHIFT: Implicit Slow-fast Hybrid Inference with Flexible Tokens, a lightweight agent that integrates latent thinking (implicit chain-of-thought) with a perception control module. iSHIFT enables an MLLM to switch between a slow mode, which leverages detailed visual grounding for high precision and a fast mode that uses global cues for efficiency. Special perception tokens guide attention to relevant screen regions, allowing the model to decide both how to reason and where to focus. Despite its compact 2.5B size, iSHIFT matches state-of-the-art performance on multiple benchmark datasets.

[107] LongFly: Long-Horizon UAV Vision-and-Language Navigation with Spatiotemporal Context Integration

Wen Jiang,Li Wang,Kangyao Huang,Wei Fan,Jinyuan Liu,Shaoyu Liu,Hongwei Duan,Bin Xu,Xiangyang Ji

Main category: cs.CV

TL;DR: 本文提出了一种名为LongFly的时空上下文建模框架,用于解决复杂环境中无人机视觉-语言导航(VLN)在长视野任务中的语义对齐不准和路径规划不稳定问题。

Details Motivation: 现有UAV VLN方法难以有效建模复杂环境中的长时序时空上下文,导致语义对齐不准确和路径规划不稳定,尤其在灾后搜救等动态、高信息密度场景中表现受限。 Method: LongFly采用历史感知的时空建模策略:1)基于槽位的历史图像压缩模块将多视角历史观测压缩为固定长度的上下文表示;2)时空轨迹编码模块捕捉UAV轨迹的时间动态与空间结构;3)提示引导的多模态融合模块整合当前观测与历史上下文,支持基于时间的推理与稳健航点预测。 Result: 实验结果表明,LongFly在已见和未见环境中均优于现有最先进方法,成功率提升7.89%,路径加权成功率提升6.33%。 Conclusion: LongFly通过高效的历史信息压缩与时空上下文建模,显著提升了长视野UAV VLN任务中的导航性能,具备良好的泛化能力与实际应用潜力。 Abstract: Unmanned aerial vehicles (UAVs) are crucial tools for post-disaster search and rescue, facing challenges such as high information density, rapid changes in viewpoint, and dynamic structures, especially in long-horizon navigation. However, current UAV vision-and-language navigation(VLN) methods struggle to model long-horizon spatiotemporal context in complex environments, resulting in inaccurate semantic alignment and unstable path planning. To this end, we propose LongFly, a spatiotemporal context modeling framework for long-horizon UAV VLN. LongFly proposes a history-aware spatiotemporal modeling strategy that transforms fragmented and redundant historical data into structured, compact, and expressive representations. First, we propose the slot-based historical image compression module, which dynamically distills multi-view historical observations into fixed-length contextual representations. Then, the spatiotemporal trajectory encoding module is introduced to capture the temporal dynamics and spatial structure of UAV trajectories. Finally, to integrate existing spatiotemporal context with current observations, we design the prompt-guided multimodal integration module to support time-based reasoning and robust waypoint prediction. Experimental results demonstrate that LongFly outperforms state-of-the-art UAV VLN baselines by 7.89\% in success rate and 6.33\% in success weighted by path length, consistently across both seen and unseen environments.

[108] Patch-Discontinuity Mining for Generalized Deepfake Detection

Huanhuan Yuan,Yang Ping,Zhengqin Xu,Junyi Cao,Shuai Jia,Chao Ma

Main category: cs.CV

TL;DR: 本文提出了一种名为GenDF的高效深度伪造检测框架,通过迁移大规模视觉模型并结合紧凑网络设计,在跨域和跨操作设置下实现了最先进的泛化性能,同时仅使用0.28M可训练参数。

Details Motivation: 现有深度伪造检测方法在面对未见过的伪造模式时泛化能力差,且依赖手工特征和复杂结构,难以应对真实场景中的多样性挑战。 Method: GenDF采用大规模视觉模型进行迁移学习,引入伪造特定表示学习、特征空间重分布和分类不变特征增强策略,以提升检测性能和跨域泛化能力,同时保持极简网络结构。 Result: 实验表明,GenDF在多个跨域和跨操作设定下均取得最优的泛化表现,显著优于现有方法,且模型参数仅为0.28M。 Conclusion: GenDF通过简洁高效的框架实现了卓越的深度伪造检测泛化能力,为实际应用提供了高鲁棒性与低复杂度兼顾的解决方案。 Abstract: The rapid advancement of generative artificial intelligence has enabled the creation of highly realistic fake facial images, posing serious threats to personal privacy and the integrity of online information. Existing deepfake detection methods often rely on handcrafted forensic cues and complex architectures, achieving strong performance in intra-domain settings but suffering significant degradation when confronted with unseen forgery patterns. In this paper, we propose GenDF, a simple yet effective framework that transfers a powerful large-scale vision model to the deepfake detection task with a compact and neat network design. GenDF incorporates deepfake-specific representation learning to capture discriminative patterns between real and fake facial images, feature space redistribution to mitigate distribution mismatch, and a classification-invariant feature augmentation strategy to enhance generalization without introducing additional trainable parameters. Extensive experiments demonstrate that GenDF achieves state-of-the-art generalization performance in cross-domain and cross-manipulation settings while requiring only 0.28M trainable parameters, validating the effectiveness and efficiency of the proposed framework.

[109] Backdoor Attacks on Prompt-Driven Video Segmentation Foundation Models

Zongmin Zhang,Zhen Sun,Yifan Liao,Wenhan Dong,Xinlei He,Xingshuo Han,Shengmin Xu,Xinyi Huang

Main category: cs.CV

TL;DR: 本文提出了BadVSFM,首个针对基于提示的视频分割基础模型(VSFM)的后门攻击框架,解决了传统后门攻击在VSFM上无效的问题。

Details Motivation: 发现经典后门攻击(如BadNet)在VSFM上效果极差(ASR低于5%),需探究原因并设计针对性攻击方法。 Method: 提出两阶段策略:1)引导图像编码器使触发帧映射到目标嵌入,而干净帧保持与参考编码器对齐;2)训练掩码解码器使触发帧-提示对生成共享目标掩码,同时保持干净输出质量。 Result: 在两个数据集和五种VSFM上实验表明,BadVSFM能实现强且可控的后门效果,保持干净分割性能,并通过梯度与注意力分析验证了其有效性。 Conclusion: BadVSFM揭示了当前VSFM中未被充分认识的安全漏洞,现有防御手段难以应对,强调了对提示驱动模型安全性的关注必要性。 Abstract: Prompt-driven Video Segmentation Foundation Models (VSFMs) such as SAM2 are increasingly deployed in applications like autonomous driving and digital pathology, raising concerns about backdoor threats. Surprisingly, we find that directly transferring classic backdoor attacks (e.g., BadNet) to VSFMs is almost ineffective, with ASR below 5\%. To understand this, we study encoder gradients and attention maps and observe that conventional training keeps gradients for clean and triggered samples largely aligned, while attention still focuses on the true object, preventing the encoder from learning a distinct trigger-related representation. To address this challenge, we propose BadVSFM, the first backdoor framework tailored to prompt-driven VSFMs. BadVSFM uses a two-stage strategy: (1) steer the image encoder so triggered frames map to a designated target embedding while clean frames remain aligned with a clean reference encoder; (2) train the mask decoder so that, across prompt types, triggered frame-prompt pairs produce a shared target mask, while clean outputs stay close to a reference decoder. Extensive experiments on two datasets and five VSFMs show that BadVSFM achieves strong, controllable backdoor effects under diverse triggers and prompts while preserving clean segmentation quality. Ablations over losses, stages, targets, trigger settings, and poisoning rates demonstrate robustness to reasonable hyperparameter changes and confirm the necessity of the two-stage design. Finally, gradient-conflict analysis and attention visualizations show that BadVSFM separates triggered and clean representations and shifts attention to trigger regions, while four representative defenses remain largely ineffective, revealing an underexplored vulnerability in current VSFMs.

[110] MAI-UI Technical Report: Real-World Centric Foundation GUI Agents

Hanzhang Zhou,Xu Zhang,Panrong Tong,Jianan Zhang,Liangyu Chen,Quyu Kong,Chenglin Cai,Chen Liu,Yue Wang,Jingren Zhou,Steven Hoi

Main category: cs.CV

TL;DR: MAI-UI是一个新型的GUI代理家族,通过自进化数据管道、设备-云协同系统和在线强化学习框架,解决了现有GUI代理在交互性、部署架构和动态环境适应性方面的关键挑战,在多个基准测试中达到SOTA性能。

Details Motivation: 现有的GUI代理面临缺乏原生用户交互、仅依赖UI操作的局限性、缺少实用的部署架构以及在动态环境中表现脆弱等问题,限制了其实际应用。 Method: 提出MAI-UI,包含四个核心方法:1)自进化数据管道,整合用户交互和MCP工具调用以扩展导航数据;2)原生设备-云协作系统,根据任务状态动态分配执行位置;3)在线强化学习框架,支持大规模并行环境和长上下文训练;4)可扩展的模型规模(2B至235B)。 Result: 在ScreenSpot-Pro上达到73.5%,MMBench GUI L2上91.3%,OSWorld-G上70.9%,UI-Vision上49.2%;在AndroidWorld上导航成功率达76.7%,MobileWorld上达41.7%;设备端性能提升33%,云调用减少40%以上;并行环境从32扩展到512带来+5.2分增益,步数预算从15增至50带来+4.3分提升。 Conclusion: MAI-UI通过统一的方法论在性能、效率和隐私之间取得平衡,显著推动了GUI代理的实际部署可行性,并在多项基准上超越Gemini和Seed系列模型,成为新的SOTA。 Abstract: The development of GUI agents could revolutionize the next generation of human-computer interaction. Motivated by this vision, we present MAI-UI, a family of foundation GUI agents spanning the full spectrum of sizes, including 2B, 8B, 32B, and 235B-A22B variants. We identify four key challenges to realistic deployment: the lack of native agent-user interaction, the limits of UI-only operation, the absence of a practical deployment architecture, and brittleness in dynamic environments. MAI-UI addresses these issues with a unified methodology: a self-evolving data pipeline that expands the navigation data to include user interaction and MCP tool calls, a native device-cloud collaboration system routes execution by task state, and an online RL framework with advanced optimizations to scale parallel environments and context length. MAI-UI establishes new state-of-the-art across GUI grounding and mobile navigation. On grounding benchmarks, it reaches 73.5% on ScreenSpot-Pro, 91.3% on MMBench GUI L2, 70.9% on OSWorld-G, and 49.2% on UI-Vision, surpassing Gemini-3-Pro and Seed1.8 on ScreenSpot-Pro. On mobile GUI navigation, it sets a new SOTA of 76.7% on AndroidWorld, surpassing UI-Tars-2, Gemini-2.5-Pro and Seed1.8. On MobileWorld, MAI-UI obtains 41.7% success rate, significantly outperforming end-to-end GUI models and competitive with Gemini-3-Pro based agentic frameworks. Our online RL experiments show significant gains from scaling parallel environments from 32 to 512 (+5.2 points) and increasing environment step budget from 15 to 50 (+4.3 points). Finally, the native device-cloud collaboration system improves on-device performance by 33%, reduces cloud model calls by over 40%, and preserves user privacy.

[111] StreamAvatar: Streaming Diffusion Models for Real-Time Interactive Human Avatars

Zhiyao Sun,Ziqiao Peng,Yifeng Ma,Yi Chen,Zhengguang Zhou,Zixiang Zhou,Guozhen Zhang,Youliang Zhang,Yuan Zhou,Qinglin Lu,Yong-Jin Liu

Main category: cs.CV

TL;DR: 本文提出了一种两阶段自回归适应与加速框架,用于实现实时交互式全身人像流媒体生成,解决了现有扩散模型非因果结构和计算成本高的问题。

Details Motivation: 现有的扩散模型由于其非因果结构和高计算开销,难以应用于实时流式交互场景;同时多数交互式方法局限于头部和肩部区域,缺乏身体动作的表达能力。 Method: 提出一个两阶段自回归适配与加速框架,包含自回归蒸馏和对抗性精炼,并引入Reference Sink、参考锚定位置重编码(RAPR)策略以及一致性感知判别器以提升长期稳定性与一致性。 Result: 实验表明该方法在生成质量、实时效率和交互自然性方面优于现有方法,支持一次性输入生成自然的说话与倾听行为及连贯手势。 Conclusion: 所提框架有效实现了高质量、实时、交互式的全身人像生成,推动了数字人技术在流媒体应用中的发展。 Abstract: Real-time, streaming interactive avatars represent a critical yet challenging goal in digital human research. Although diffusion-based human avatar generation methods achieve remarkable success, their non-causal architecture and high computational costs make them unsuitable for streaming. Moreover, existing interactive approaches are typically limited to head-and-shoulder region, limiting their ability to produce gestures and body motions. To address these challenges, we propose a two-stage autoregressive adaptation and acceleration framework that applies autoregressive distillation and adversarial refinement to adapt a high-fidelity human video diffusion model for real-time, interactive streaming. To ensure long-term stability and consistency, we introduce three key components: a Reference Sink, a Reference-Anchored Positional Re-encoding (RAPR) strategy, and a Consistency-Aware Discriminator. Building on this framework, we develop a one-shot, interactive, human avatar model capable of generating both natural talking and listening behaviors with coherent gestures. Extensive experiments demonstrate that our method achieves state-of-the-art performance, surpassing existing approaches in generation quality, real-time efficiency, and interaction naturalness. Project page: https://streamavatar.github.io .

[112] Yume-1.5: A Text-Controlled Interactive World Generation Model

Xiaofeng Mao,Zhen Li,Chuanhao Li,Xiaojie Xu,Kaining Ying,Tong He,Jiangmiao Pang,Yu Qiao,Kaipeng Zhang

Main category: cs.CV

TL;DR: 提出了一种名为\method的新框架,能够从单张图像或文本提示生成逼真、可交互且连续的世界,并支持键盘探索,解决了现有扩散模型在参数量、推理步数和历史上下文增长方面的瓶颈。

Details Motivation: 现有的基于扩散模型的生成方法存在参数规模过大、依赖多步推理以及历史上下文快速增长等问题,限制了实时性能和文本控制生成能力。 Method: \method包含三个核心组件:(1)结合统一上下文压缩与线性注意力的长视频生成框架;(2)基于双向注意力蒸馏和增强文本嵌入的实时流式加速策略;(3)支持文本控制的世界事件生成方法。 Result: 该框架实现了从单图像或文本提示生成可交互、连续的动态世界,并显著提升生成效率,支持实时键盘探索和文本引导的世界演化。 Conclusion: \method有效克服了传统扩散模型在实时性和可控性方面的局限,为构建轻量、高效且具备文本控制能力的虚拟世界生成系统提供了可行方案。 Abstract: Recent approaches have demonstrated the promise of using diffusion models to generate interactive and explorable worlds. However, most of these methods face critical challenges such as excessively large parameter sizes, reliance on lengthy inference steps, and rapidly growing historical context, which severely limit real-time performance and lack text-controlled generation capabilities. To address these challenges, we propose \method, a novel framework designed to generate realistic, interactive, and continuous worlds from a single image or text prompt. \method achieves this through a carefully designed framework that supports keyboard-based exploration of the generated worlds. The framework comprises three core components: (1) a long-video generation framework integrating unified context compression with linear attention; (2) a real-time streaming acceleration strategy powered by bidirectional attention distillation and an enhanced text embedding scheme; (3) a text-controlled method for generating world events. We have provided the codebase in the supplementary material.

[113] Learning Association via Track-Detection Matching for Multi-Object Tracking

Momir Adžemović

Main category: cs.CV

TL;DR: 本文提出了一种名为Track-Detection Link Prediction (TDLP)的多目标跟踪方法,通过在每帧中预测轨迹与检测结果之间的连接关系来实现关联,兼具高效性和学习能力,在多个基准上超越了现有最先进方法。

Details Motivation: 现有的多目标跟踪方法要么依赖手工设计的启发式规则(tracking-by-detection),要么计算复杂度高(end-to-end)。本文旨在设计一种既能从数据中直接学习关联、又保持模块化和计算效率的方法。 Method: 提出TDLP,采用基于链接预测的范式,在每帧中预测轨迹与检测框之间的正确延续关系;主要基于几何特征(如边界框),可选地融合姿态和外观等额外信息,无需手工规则,通过数据驱动方式进行学习。 Result: 在多个基准上的实验表明,TDLP在tracking-by-detection和end-to-end方法中均一致优于现有最先进方法;分析还显示,链接预测在处理异构特征(如检测框)时比基于度量学习的关联更有效。 Conclusion: TDLP是一种高效、模块化且性能优越的多目标跟踪方法,通过链接预测实现了高质量的数据驱动关联,为tracking-by-detection框架提供了新的强基线。 Abstract: Multi-object tracking aims to maintain object identities over time by associating detections across video frames. Two dominant paradigms exist in literature: tracking-by-detection methods, which are computationally efficient but rely on handcrafted association heuristics, and end-to-end approaches, which learn association from data at the cost of higher computational complexity. We propose Track-Detection Link Prediction (TDLP), a tracking-by-detection method that performs per-frame association via link prediction between tracks and detections, i.e., by predicting the correct continuation of each track at every frame. TDLP is architecturally designed primarily for geometric features such as bounding boxes, while optionally incorporating additional cues, including pose and appearance. Unlike heuristic-based methods, TDLP learns association directly from data without handcrafted rules, while remaining modular and computationally efficient compared to end-to-end trackers. Extensive experiments on multiple benchmarks demonstrate that TDLP consistently surpasses state-of-the-art performance across both tracking-by-detection and end-to-end methods. Finally, we provide a detailed analysis comparing link prediction with metric learning-based association and show that link prediction is more effective, particularly when handling heterogeneous features such as detection bounding boxes. Our code is available at \href{https://github.com/Robotmurlock/TDLP}{https://github.com/Robotmurlock/TDLP}.

[114] ProEdit: Inversion-based Editing From Prompts Done Right

Zhi Ouyang,Dian Zheng,Xiao-Ming Wu,Jian-Jian Jiang,Kun-Yu Lin,Jingke Meng,Wei-Shi Zheng

Main category: cs.CV

TL;DR: 本文提出了一种名为ProEdit的图像和视频编辑方法,通过在注意力和潜在空间两个方面改进现有基于反演的视觉编辑技术,解决了因过度依赖源图像信息而导致编辑效果受限的问题。

Details Motivation: 现有的基于反演的视觉编辑方法在采样过程中过度依赖源图像信息,导致目标图像的编辑(如姿态、数量、颜色等属性变化)效果不佳。为此,本文旨在减少源图像对编辑区域的影响,同时保持背景一致性。 Method: 在注意力机制方面,引入KV-mix,混合源图像与目标图像在编辑区域的KV特征;在潜在空间方面,提出Latents-Shift,扰动源潜在表示的编辑区域,消除反演潜在变量对采样的影响。该方法具有即插即用特性,可集成到现有编辑框架中。 Result: 在多个图像和视频编辑基准上的实验表明,ProEdit实现了最先进的性能,能够更准确地完成用户指定的编辑任务,同时保持背景一致性和编辑灵活性。 Conclusion: ProEdit通过在注意力和潜在空间两个层面优化信息注入方式,有效缓解了源图像过度依赖问题,提升了编辑质量,并具备良好的通用性和兼容性,适用于多种现有编辑方法。 Abstract: Inversion-based visual editing provides an effective and training-free way to edit an image or a video based on user instructions. Existing methods typically inject source image information during the sampling process to maintain editing consistency. However, this sampling strategy overly relies on source information, which negatively affects the edits in the target image (e.g., failing to change the subject's atributes like pose, number, or color as instructed). In this work, we propose ProEdit to address this issue both in the attention and the latent aspects. In the attention aspect, we introduce KV-mix, which mixes KV features of the source and the target in the edited region, mitigating the influence of the source image on the editing region while maintaining background consistency. In the latent aspect, we propose Latents-Shift, which perturbs the edited region of the source latent, eliminating the influence of the inverted latent on the sampling. Extensive experiments on several image and video editing benchmarks demonstrate that our method achieves SOTA performance. In addition, our design is plug-and-play, which can be seamlessly integrated into existing inversion and editing methods, such as RF-Solver, FireFlow and UniEdit.

[115] See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning

Shuoshuo Zhang,Yizhen Zhang,Jingjing Fu,Lei Song,Jiang Bian,Yujiu Yang,Rui Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为双向感知塑造(BiPS)的新方法,通过在训练过程中利用问题条件下的掩码视图生成双向注视信号,以增强视觉语言模型对细粒度视觉证据的依赖,并提高跨域泛化能力。

Details Motivation: 现有的视觉语言模型虽然受益于中间视觉线索,但往往忽略细粒度的视觉证据,跨域泛化能力差,且推理成本高。因此需要一种更高效、更具泛化能力的方法来提升模型性能。 Method: 提出Bi-directional Perceptual Shaping (BiPS),在训练中引入两种KL约束:一是KL一致性约束,保持原始图像与仅保留问题相关区域的证据保留视图之间的一致性;二是KL分离约束,将原始图像与关键像素被遮蔽后的证据剔除视图进行分离,防止模型依赖文本捷径。 Result: 在八个基准测试上,BiPS使Qwen2.5-VL-7B平均提升了8.2%,并在未见过的数据集和图像类型上表现出较强的域外泛化能力。 Conclusion: BiPS通过引导模型关注细粒度视觉证据并避免文本捷径,有效提升了视觉语言模型的性能和泛化能力,具有较低的推理成本和广泛的应用潜力。 Abstract: Large vision-language models (VLMs) often benefit from intermediate visual cues, either injected via external tools or generated as latent visual tokens during reasoning, but these mechanisms still overlook fine-grained visual evidence (e.g., polylines in charts), generalize poorly across domains, and incur high inference-time cost. In this paper, we propose Bi-directional Perceptual Shaping (BiPS), which transforms question-conditioned masked views into bidirectional where-to-look signals that shape perception during training. BiPS first applies a KL-consistency constraint between the original image and an evidence-preserving view that keeps only question-relevant regions, encouraging coarse but complete coverage of supporting pixels. It then applies a KL-separation constraint between the original and an evidence-ablated view where critical pixels are masked so the image no longer supports the original answer, discouraging text-only shortcuts (i.e., answering from text alone) and enforcing fine-grained visual reliance. Across eight benchmarks, BiPS boosts Qwen2.5-VL-7B by 8.2% on average and shows strong out-of-domain generalization to unseen datasets and image types.