Skip to content

Table of Contents

cs.CL [Back]

[1] SCARE: A Benchmark for SQL Correction and Question Answerability Classification for Reliable EHR Question Answering

Gyubok Lee,Woosog Chay,Edward Choi

Main category: cs.CL

TL;DR: 本文提出了SCARE,一个用于评估电子健康记录(EHR)问答系统中事后安全验证机制的基准,旨在解决生成SQL查询的安全性和可靠性问题。

Details Motivation: 在临床环境中部署文本到SQL模型面临挑战,因为错误的SQL查询可能影响临床决策和患者安全,而现有研究缺乏对事后验证机制的统一评估标准。 Method: 构建了一个包含4200个样本的基准SCARE,涵盖七个不同文本到SQL模型生成的问题、候选SQL查询及预期输出,并评估两阶段方法与代理框架等不同方法的表现。 Result: 实验揭示了问题可回答性分类与SQL错误修正之间的关键权衡,表明当前方法在确保安全性方面仍存在挑战。 Conclusion: SCARE为EHR问答系统的安全验证提供了标准化评估平台,突出了未来在安全层设计方面的研究方向。 Abstract: Recent advances in Large Language Models (LLMs) have enabled the development of text-to-SQL models that allow clinicians to query structured data stored in Electronic Health Records (EHRs) using natural language. However, deploying these models for EHR question answering (QA) systems in safety-critical clinical environments remains challenging: incorrect SQL queries-whether caused by model errors or problematic user inputs-can undermine clinical decision-making and jeopardize patient care. While prior work has mainly focused on improving SQL generation accuracy or filtering questions before execution, there is a lack of a unified benchmark for evaluating independent post-hoc verification mechanisms (i.e., a component that inspects and validates the generated SQL before execution), which is crucial for safe deployment. To fill this gap, we introduce SCARE, a benchmark for evaluating methods that function as a post-hoc safety layer in EHR QA systems. SCARE evaluates the joint task of (1) classifying question answerability (i.e., determining whether a question is answerable, ambiguous, or unanswerable) and (2) verifying or correcting candidate SQL queries. The benchmark comprises 4,200 triples of questions, candidate SQL queries, and expected model outputs, grounded in the MIMIC-III, MIMIC-IV, and eICU databases. It covers a diverse set of questions and corresponding candidate SQL queries generated by seven different text-to-SQL models, ensuring a realistic and challenging evaluation. Using SCARE, we benchmark a range of approaches-from two-stage methods to agentic frameworks. Our experiments reveal a critical trade-off between question classification and SQL error correction, highlighting key challenges and outlining directions for future research.

[2] $A^3$: Attention-Aware Accurate KV Cache Fusion for Fast Large Language Model Serving

Yuechi Zhou,Yi Su,Jianxin Zhang,Juntao Li,Qingrong Xia,Zhefeng Wang,Xinyu Duan,Baoxing Huai

Main category: cs.CL

TL;DR: 提出了一种名为$A^3$的注意力感知准确KV缓存融合算法,通过基于问题相关性预计算和选择性融合文本块的KV缓存,在减少解码延迟的同时显著提升长上下文任务性能。

Details Motivation: 现有KV缓存重用方法在降低大语言模型处理长上下文时的解码延迟和内存开销方面存在性能下降问题,主要由于重计算的标记未能与问题最相关的上下文对齐。 Method: 提出$A^3$算法,基于问题对文本块的相关性进行评估,预先计算并选择性融合关键文本块的KV缓存,实现更精确的缓存集成与低计算开销。 Result: $A^3$在多个基准测试和不同大语言模型上均优于四个基线方法,同时将首令牌时间(TTFT)减少了2倍。 Conclusion: $A^3$有效解决了KV缓存重用中的上下文对齐问题,显著提升了长上下文场景下的推理效率与任务性能,具有较强的实用价值。 Abstract: Large language models (LLMs) have demonstrated strong capabilities in processing long contexts, enabling them to tackle tasks involving long textual inputs such as multi-turn conversations, legal documents, or retrieved documents in Retrieval-Augmented Generation (RAG) systems. However, despite their ability to handle long sequences, the resulting decoding latency and memory overhead remain substantial, posing challenges for real-world deployment. Recent advances in KV Cache reuse have shown potential to mitigate these costs, but still suffer from notable performance degradation. To address this issue, we conduct an in-depth investigation of recomputation-based reuse methods and observe that the recomputed tokens often fail to align with the context segments most relevant to the question. This misalignment hinders proper updates to the critical contextual representations. Therefore, we propose the $\textbf{A}$ttention-$\textbf{A}$ware $\textbf{A}$ccurate KV Cache Fusion algorithm ($A^3$), which precomputes and selectively fuses the KV Cache of text chunks based on their relevance to the question, achieving accurate integration with minimal computational overhead. Extensive experiments on various benchmarks and LLMs demonstrate that $A^3$ achieves the best task performance compared to four baselines while reducing the time-to-first-token (TTFT) by 2$\times$.

[3] LexInstructEval: Lexical Instruction Following Evaluation for Large Language Models

Huimin Ren,Yan Liang,Baiqiao Su,Chaobo Sun,Hengtong Lu,Kaike Zhang,Chen Wei

Main category: cs.CL

TL;DR: 本文提出了LexInstructEval,一个用于评估大语言模型精细词汇指令遵循能力的新基准和框架,通过形式化语法规则实现复杂指令的分解与客观验证。

Details Motivation: 现有评估大语言模型遵循复杂词汇指令能力的方法存在主观性强、成本高或自动化系统偏差大、缺乏细粒度组合性测试等问题。 Method: 提出一种基于三元组的形式化规则语法,构建可系统生成多样化数据的多阶段人机协同流程,并开发透明的程序化引擎进行客观验证。 Result: 成功构建了LexInstructEval基准和评估工具,支持对指令遵循能力的细粒度、可解释和可扩展评估。 Conclusion: 该框架为评估大语言模型的可控性和可靠性提供了更客观、更具表达力的解决方案,推动相关研究发展。 Abstract: The ability of Large Language Models (LLMs) to precisely follow complex and fine-grained lexical instructions is a cornerstone of their utility and controllability. However, evaluating this capability remains a significant challenge. Current methods either rely on subjective and costly human evaluation or on automated LLM-as-a-judge systems, which suffer from inherent biases and unreliability. Existing programmatic benchmarks, while objective, often lack the expressiveness to test intricate, compositional constraints at a granular level. To address these limitations, we introduce LexInstructEval, a new benchmark and evaluation framework for fine-grained lexical instruction following. Our framework is built upon a formal, rule-based grammar that deconstructs complex instructions into a canonical triplet. This grammar enables the systematic generation of a diverse dataset through a multi-stage, human-in-the-loop pipeline and facilitates objective verification via a transparent, programmatic engine. We release our dataset and open-source evaluation tools to facilitate further research into the controllability and reliability of LLMs.

[4] ChineseErrorCorrector3-4B: State-of-the-Art Chinese Spelling and Grammar Corrector

Wei Tian,YuhaoZhou

Main category: cs.CL

TL;DR: 本文提出了基于Qwen3-4B的统一中文拼写与语法纠错模型ChineseErrorCorrector3-4B,在多个权威基准测试中表现优异,F1和F0.5分数均超越现有公开模型,在拼写和语法纠错任务中均排名第一。

Details Motivation: 为了提升中文拼写和语法错误纠正的性能,需要一个能够同时处理两类错误的统一且高效的模型。 Method: 基于Qwen3-4B构建统一的中文拼写与语法纠错模型ChineseErrorCorrector3-4B,并在多个标准数据集上进行训练与评估。 Result: 在SIGHAN-2015、EC-LAW、MCSC和NaCGEC等多个权威基准数据集上,模型的F1和F0.5分数显著优于现有公开模型,位居榜首。 Conclusion: ChineseErrorCorrector3-4B是一个高效统一的中文纠错模型,在拼写和语法纠错任务上均达到当前最优水平。 Abstract: This paper introduces ChineseErrorCorrector3-4B, a unified model for Chinese spelling and grammatical error correction based on Qwen3-4B. The model demonstrates outstanding performance in general text correction tasks and achieves state-of-the-art results in both spelling correction (CSC) and grammatical correction (CGC). On several authoritative benchmark datasets -- including SIGHAN-2015, EC-LAW, MCSC, and NaCGEC -- the model's F1 and F0.5 scores significantly surpass existing publicly available models, ranking first in both spelling and grammatical error correction tasks.

[5] Generative Caching for Structurally Similar Prompts and Responses

Sarthak Chakraborty,Suman Nath,Xuchao Zhang,Chetan Bansal,Indranil Gupta

Main category: cs.CL

TL;DR: 提出一种生成式缓存方法\ourmethod{},能够识别结构相似提示中的可重用响应模式,并为新请求生成定制化输出,在保持低错误率的同时显著提升缓存命中率和执行效率。

Details Motivation: 现有缓存方法在处理结构相似但略有变化的提示时效果不佳:精确匹配无法覆盖变体,语义缓存可能忽略关键差异导致错误响应。 Method: \ourmethod{}通过识别结构相似提示中的可重用响应模式,生成针对新请求的定制化响应,实现对结构化重复提示的高效缓存利用。 Result: \ourmethod{}达到83%的缓存命中率,在无重复提示的数据集上错误命中极少;在代理工作流中比标准提示匹配提高约20%的命中率,减少约34%的端到端执行延迟。 Conclusion: \ourmethod{}有效平衡了缓存效率与响应准确性,适用于需重复执行结构化任务的场景,如代理工作流和可复用工作流。 Abstract: Large Language Models (LLMs) are increasingly being used to plan, reason, and execute tasks across diverse scenarios. In use cases like repeatable workflows and agentic settings, prompts are often reused with minor variations while having a similar structure for recurring tasks. This opens up opportunities for caching. However, exact prompt matching fails on such structurally similar prompts, while semantic caching may produce incorrect responses by ignoring critical differences. To address this, we introduce \ourmethod{}, a generative cache that produces variation-aware responses for structurally similar prompts. \ourmethod{} identifies reusable response patterns across similar prompt structures and synthesizes customized outputs for new requests. We show that \ourmethod{} achieves 83\% cache hit rate, while having minimal incorrect hits on datasets without prompt repetition. In agentic workflows, it improves cache hit rate by $\sim$20\% and reduces end-to-end execution latency by $\sim$34\% compared to standard prompt matching.

[6] Community-Aligned Behavior Under Uncertainty: Evidence of Epistemic Stance Transfer in LLMs

Patrick Gerard,Aiden Chang,Svitlana Volkova

Main category: cs.CL

TL;DR: 研究发现,经过对齐的大型语言模型在删除特定事件知识后,仍能保持与特定在线社区一致的不确定性应对行为模式,表明对齐过程编码了超越表面模仿的结构化、可泛化的行为主导机制。

Details Motivation: 探讨对齐后的大型语言模型是仅仅复现训练数据中的模式,还是真正习得了特定社区在面对新不确定性时的态度和行为模式。 Method: 提出一个检测认知立场迁移的框架:通过有针对性地删除事件知识,并使用多种探针验证,在模型处于无知状态下测试其是否仍能复现社区特有的响应模式。实验基于俄罗斯-乌克兰军事话语和美国党派推特数据进行。 Result: 即使在激进的事实删除后,对齐的LLM仍稳定表现出社区特异性的不确定性处理行为模式。 Conclusion: 对齐过程使模型编码了结构性的、可泛化的行为主律,而不仅是表面模仿;所提框架有助于系统性检测在无知条件下持续存在的行为偏见,推动更安全透明的LLM部署。 Abstract: When large language models (LLMs) are aligned to a specific online community, do they exhibit generalizable behavioral patterns that mirror that community's attitudes and responses to new uncertainty, or are they simply recalling patterns from training data? We introduce a framework to test epistemic stance transfer: targeted deletion of event knowledge, validated with multiple probes, followed by evaluation of whether models still reproduce the community's organic response patterns under ignorance. Using Russian--Ukrainian military discourse and U.S. partisan Twitter data, we find that even after aggressive fact removal, aligned LLMs maintain stable, community-specific behavioral patterns for handling uncertainty. These results provide evidence that alignment encodes structured, generalizable behaviors beyond surface mimicry. Our framework offers a systematic way to detect behavioral biases that persist under ignorance, advancing efforts toward safer and more transparent LLM deployments.

[7] Random Text, Zipf's Law, Critical Length,and Implications for Large Language Models

Vladimir Berman

Main category: cs.CL

TL;DR: 本文提出一个完全非语言学的文本模型,仅基于字母和空格符号的随机生成,推导出词长分布、词汇增长、临界长度及齐夫律等结构性结果,揭示齐夫现象可单纯由组合与分割机制产生,无需语言结构或优化原理。

Details Motivation: 探索自然语言中词频分布等统计规律是否可以在没有语言结构(如语法、语义)的情况下,仅通过符号序列的组合与分割机制产生,从而建立一个结构化的零模型来区分哪些现象需要更深层次的语言学解释。 Method: 构建一个由有限字母表和空格符号组成的独立随机符号序列模型,定义词为非空格符号的最大连续块,利用几何分布、优惠券收集者问题和组合概率方法,推导词长分布、预期词数、唯一词数及秩频关系的闭式表达。 Result: 1) 词长服从由空格概率决定的几何分布;2) 给定长度的词的数量及其不同类型数量具有闭式解,并存在一个临界长度k*,超过该长度的词大多只出现一次;3) 结合字符串数量的指数增长与出现概率的指数衰减,得出一个由字母表大小和空格概率决定参数的齐夫型幂律分布。 Conclusion: Zipf定律等语言统计现象可以仅从随机符号序列的组合结构和分词机制中产生,无需依赖语言内部的组织原则或优化过程,这为分析大语言模型中的token统计提供了概念上清晰的零假设基准。 Abstract: We study a deliberately simple, fully non-linguistic model of text: a sequence of independent draws from a finite alphabet of letters plus a single space symbol. A word is defined as a maximal block of non-space symbols. Within this symbol-level framework, which assumes no morphology, syntax, or semantics, we derive several structural results. First, word lengths follow a geometric distribution governed solely by the probability of the space symbol. Second, the expected number of words of a given length, and the expected number of distinct words of that length, admit closed-form expressions based on a coupon-collector argument. This yields a critical word length k* at which word types transition from appearing many times on average to appearing at most once. Third, combining the exponential growth of the number of possible strings of length k with the exponential decay of the probability of each string, we obtain a Zipf-type rank-frequency law p(r) proportional to r^{-alpha}, with an exponent determined explicitly by the alphabet size and the space probability. Our contribution is twofold. Mathematically, we give a unified derivation linking word lengths, vocabulary growth, critical length, and rank-frequency structure in a single explicit model. Conceptually, we argue that this provides a structurally grounded null model for both natural-language word statistics and token statistics in large language models. The results show that Zipf-like patterns can arise purely from combinatorics and segmentation, without optimization principles or linguistic organization, and help clarify which phenomena require deeper explanation beyond random-text structure.

[8] Computational frame analysis revisited: On LLMs for studying news coverage

Sharaj Kunjar,Alyssa Hasegawa Smith,Tyler R Mckenzie,Rushali Mohbe,Samuel V Scarpino,Brooke Foucault Welles

Main category: cs.CL

TL;DR: 本研究系统评估了生成式大语言模型(如GPT和Claude)在媒体框架分析中的有效性,相较于传统计算方法和人工编码,发现其表现仍逊色,强调人类验证的必要性,并提出方法论多元化的计算框架分析路线图。

Details Motivation: 探讨生成式大语言模型在媒体框架识别中的实际效果,填补其相对于传统方法(如词袋模型、编码器-only Transformer)和人工编码的性能差距研究空白。 Method: 基于2022年美国猴痘疫情六个月新闻报道构建新的黄金标准数据集,采用归纳与迭代方式开发;比较生成式LLMs、bag-of-words模型、encoder-only transformers与人工编码在框架分析任务中的表现。 Result: 生成式LLMs在某些任务中展现出潜力,但整体上始终不如人工编码者,有时甚至不如小型语言模型;不同方法的适用性因具体分析任务而异,人类干预对模型选择至关重要。 Conclusion: 支持采用方法论多元化的策略,建议将各类方法结合使用,并为未来计算框架分析研究提供了一条可行的路线图。 Abstract: Computational approaches have previously shown various promises and pitfalls when it comes to the reliable identification of media frames. Generative LLMs like GPT and Claude are increasingly being used as content analytical tools, but how effective are they for frame analysis? We address this question by systematically evaluating them against their computational predecessors: bag-of-words models and encoder-only transformers; and traditional manual coding procedures. Our analysis rests on a novel gold standard dataset that we inductively and iteratively developed through the study, investigating six months of news coverage of the US Mpox epidemic of 2022. While we discover some potential applications for generative LLMs, we demonstrate that they were consistently outperformed by manual coders, and in some instances, by smaller language models. Some form of human validation was always necessary to determine appropriate model choice. Additionally, by examining how the suitability of various approaches depended on the nature of different tasks that were part of our frame analytical workflow, we provide insights as to how researchers may leverage the complementarity of these approaches to use them in tandem. We conclude by endorsing a methodologically pluralistic approach and put forth a roadmap for computational frame analysis for researchers going forward.

[9] PoETa v2: Toward More Robust Evaluation of Large Language Models in Portuguese

Thales Sales Almeida,Rodrigo Nogueira,Hélio Pedrini

Main category: cs.CL

TL;DR: 本文介绍了PoETa v2,这是迄今为止针对葡萄牙语大语言模型(LLM)最全面的评估基准,涵盖40多个任务,评估了20多种模型,分析了计算资源和语言特定适应对葡萄牙语性能的影响,并与英语任务进行了对比。

Details Motivation: 由于大语言模型在不同语言和文化背景下的表现存在显著差异,亟需对葡萄牙语等非英语语言进行系统性评估,以推动多语言模型的发展。 Method: 提出了PoETa v2基准,包含超过40个葡萄牙语任务,评估了20多个不同规模和计算资源的模型,并对比了其在葡萄牙语和英语中的表现差异。 Result: 研究揭示了计算投入和语言特定适应对葡萄牙语模型性能的显著影响,并发现了与英语任务相比仍存在性能差距。 Conclusion: PoETa v2为未来葡萄牙语语言建模与评估研究奠定了基础,推动了多语言大模型的公平性和有效性评估。 Abstract: Large Language Models (LLMs) exhibit significant variations in performance across linguistic and cultural contexts, underscoring the need for systematic evaluation in diverse languages. In this work, we present the most extensive evaluation of LLMs for the Portuguese language to date. Leveraging our newly introduced PoETa v2 benchmark -- a comprehensive suite of over 40 tasks in Portuguese -- we assess more than 20 models covering a broad spectrum of training scales and computational resources. Our study reveals how computational investment and language-specific adaptation impact performance in Portuguese, while also analyzing performance gaps in comparison to equivalent tasks in English. Through this benchmark and analysis, PoETa v2 lays the groundwork for future research on Portuguese language modeling and evaluation. The benchmark is available at https://github.com/PoETaV2/PoETaV2.

[10] Point of Order: Action-Aware LLM Persona Modeling for Realistic Civic Simulation

Scott Merrill,Shashank Srivastava

Main category: cs.CL

TL;DR: 本研究提出了一种可复现的管道,将公开的Zoom录像转化为带有说话人属性、人物画像和语用行为标签的转录数据,用于提升大语言模型在模拟多方审议中的真实性和表现。

Details Motivation: 缺乏带说话人标注的数据限制了大语言模型对真实多角色对话的建模能力,尤其是在匿名ASR转录中无法捕捉个体一致性行为。 Method: 构建一个自动化流程,从公开Zoom录像中生成带说话人身份、人物特征和语用动作标签(如[提出动议])的结构化转录数据,并基于此微调大语言模型。 Result: 使用‘动作感知’数据微调后,模型困惑度降低67%,在说话人保真度和真实性评估中分类性能接近翻倍;图灵测试式的人类评估显示模拟结果常与真实讨论难以区分。 Conclusion: 该方法为实现复杂、逼真的公民审议模拟提供了一种实用且可扩展的解决方案。 Abstract: Large language models offer opportunities to simulate multi-party deliberation, but realistic modeling remains limited by a lack of speaker-attributed data. Transcripts produced via automatic speech recognition (ASR) assign anonymous speaker labels (e.g., Speaker_1), preventing models from capturing consistent human behavior. This work introduces a reproducible pipeline to transform public Zoom recordings into speaker-attributed transcripts with metadata like persona profiles and pragmatic action tags (e.g., [propose_motion]). We release three local government deliberation datasets: Appellate Court hearings, School Board meetings, and Municipal Council sessions. Fine-tuning LLMs to model specific participants using this "action-aware" data produces a 67% reduction in perplexity and nearly doubles classifier-based performance metrics for speaker fidelity and realism. Turing-style human evaluations show our simulations are often indistinguishable from real deliberations, providing a practical and scalable method for complex realistic civic simulations.

[11] A superpersuasive autonomous policy debating system

Allen Roush,Devin Gonier,John Hines,Judah Goldfeder,Philippe Martin Wyder,Sanjay Basu,Ravid Shwartz Ziv

Main category: cs.CL

TL;DR: 本文提出了DeepDebater,一个能够参与并赢得完整、未经修改的两队制政策辩论的自主系统,采用基于多智能体工作流的分层架构,并结合大规模证据库与语音动画生成技术,支持全自主或人机协作辩论,在初步评估中表现优于人类辩手。

Details Motivation: 现有AI在复杂、基于证据且策略自适应的说服任务上仍面临巨大挑战,以往研究(如IBM Project Debater)局限于简化和短时辩论形式,难以应对真实世界高强度政策辩论的需求。 Method: 提出DeepDebater系统,采用分层多智能体架构,各专业代理协作完成论点生成、检索、综合与自我修正;使用OpenDebateEvidence大规模证据库,生成完整演讲稿、质询与反驳;通过OpenAI TTS和EchoMimic V1实现语音合成与虚拟形象输出;支持AI-AI及人机混合对战模式。 Result: 在模拟比赛中,DeepDebater生成的论点组件质量更高,经独立自动裁判判定 consistently 获胜;专家辩论教练也更偏好其构建的论点、证据与案例。 Conclusion: DeepDebater实现了在完整政策辩论场景下的高水平自主说服能力,推动AI在复杂推理、战略交互与现实应用中的发展,具备开源代码与数据以促进后续研究。 Abstract: The capacity for highly complex, evidence-based, and strategically adaptive persuasion remains a formidable great challenge for artificial intelligence. Previous work, like IBM Project Debater, focused on generating persuasive speeches in simplified and shortened debate formats intended for relatively lay audiences. We introduce DeepDebater, a novel autonomous system capable of participating in and winning a full, unmodified, two-team competitive policy debate. Our system employs a hierarchical architecture of specialized multi-agent workflows, where teams of LLM-powered agents collaborate and critique one another to perform discrete argumentative tasks. Each workflow utilizes iterative retrieval, synthesis, and self-correction using a massive corpus of policy debate evidence (OpenDebateEvidence) and produces complete speech transcripts, cross-examinations, and rebuttals. We introduce a live, interactive end-to-end presentation pipeline that renders debates with AI speech and animation: transcripts are surface-realized and synthesized to audio with OpenAI TTS, and then displayed as talking-head portrait videos with EchoMimic V1. Beyond fully autonomous matches (AI vs AI), DeepDebater supports hybrid human-AI operation: human debaters can intervene at any stage, and humans can optionally serve as opponents against AI in any speech, allowing AI-human and AI-AI rounds. In preliminary evaluations against human-authored cases, DeepDebater produces qualitatively superior argumentative components and consistently wins simulated rounds as adjudicated by an independent autonomous judge. Expert human debate coaches also prefer the arguments, evidence, and cases constructed by DeepDebater. We open source all code, generated speech transcripts, audio and talking head video here: https://github.com/Hellisotherpeople/DeepDebater/tree/main

[12] Principled Context Engineering for RAG: Statistical Guarantees via Conformal Prediction

Debashish Chakraborty,Eugene Yang,Daniel Khashabi,Dawn Lawrie,Kevin Duh

Main category: cs.CL

TL;DR: 本文提出了一种基于共形预测的检索增强生成(RAG)上下文工程方法,通过可控过滤机制在保证相关证据召回率的同时显著减少噪声和冗余上下文,提升或维持下游事实准确性。

Details Motivation: 现有RAG系统在处理长或噪声上下文时性能下降,且缺乏对保留证据的统计控制,亟需一种可调节、有理论保障的上下文过滤方法。 Method: 采用共形预测框架进行上下文过滤,结合嵌入或LLM打分函数,在NeuCLIR和RAGTIME数据集上评估不同覆盖率目标下的过滤效果。 Result: 共形过滤能稳定达到设定的覆盖率目标,减少2-3倍的上下文量;在NeuCLIR上,严格过滤提升ARGUE F1分数,中等覆盖下保持准确率稳定。 Conclusion: 共形预测为RAG提供了模型无关、原理清晰且覆盖率可控的上下文工程方案,有效平衡上下文长度与信息保留。 Abstract: Retrieval-Augmented Generation (RAG) enhances factual grounding in large language models (LLMs) by incorporating retrieved evidence, but LLM accuracy declines when long or noisy contexts exceed the model's effective attention span. Existing pre-generation filters rely on heuristics or uncalibrated LLM confidence scores, offering no statistical control over retained evidence. We evaluate and demonstrate context engineering through conformal prediction, a coverage-controlled filtering framework that removes irrelevant content while preserving recall of supporting evidence. Using both embedding- and LLM-based scoring functions, we test this approach on the NeuCLIR and RAGTIME collections. Conformal filtering consistently meets its target coverage, ensuring that a specified fraction of relevant snippets are retained, and reduces retained context by 2-3x relative to unfiltered retrieval. On NeuCLIR, downstream factual accuracy measured by ARGUE F1 improves under strict filtering and remains stable at moderate coverage, indicating that most discarded material is redundant or irrelevant. These results demonstrate that conformal prediction enables reliable, coverage-controlled context reduction in RAG, offering a model-agnostic and principled approach to context engineering.

[13] L2V-CoT: Cross-Modal Transfer of Chain-of-Thought Reasoning via Latent Intervention

Yuliang Zhan,Xinyu Tang,Han Wan,Jian Li,Ji-Rong Wen,Hao Sun

Main category: cs.CL

TL;DR: 本文提出了一种无需训练的跨模态思维链(CoT)推理迁移方法L2V-CoT,利用线性人工层析成像(LAT)发现LLM和VLM在低频潜在表征上具有相似性,并通过频率域中的低频特征提取与注入来增强VLM的多步推理能力。

Details Motivation: 视觉-语言模型(VLMs)在多步推理任务中表现不佳,主要受限于多模态推理数据的缺乏。尽管大语言模型(LLMs)已通过思维链(CoT)显著提升推理能力,但如何有效将LLM中的CoT推理能力迁移到VLM仍面临高训练成本或需架构对齐的问题。因此,亟需一种高效、无需训练的跨模态推理迁移方法。 Method: 提出L2V-CoT,基于Linear Artificial Tomography(LAT)分析发现LLM与VLM在低频段具有相似的CoT潜在表征。该方法在频率域中从LLM提取并重采样低频CoT表示,实现维度匹配后,在推理阶段将其隐式注入到VLM中,从而增强其推理能力,且无需额外训练。 Result: 实验结果表明,L2V-CoT在多个基准上持续优于现有的无需训练基线方法,甚至超过部分有监督方法,验证了跨模态低频表示迁移的有效性。 Conclusion: LLM与VLM之间存在可迁移的低频CoT潜在表示,L2V-CoT提供了一种高效、无需训练的推理能力迁移范式,为提升VLM的多步推理能力提供了新思路。 Abstract: Recently, Chain-of-Thought (CoT) reasoning has significantly enhanced the capabilities of large language models (LLMs), but Vision-Language Models (VLMs) still struggle with multi-step reasoning tasks due to limited multimodal reasoning data. To bridge this gap, researchers have explored methods to transfer CoT reasoning from LLMs to VLMs. However, existing approaches either need high training costs or require architectural alignment. In this paper, we use Linear Artificial Tomography (LAT) to empirically show that LLMs and VLMs share similar low-frequency latent representations of CoT reasoning despite architectural differences. Based on this insight, we propose L2V-CoT, a novel training-free latent intervention approach that transfers CoT reasoning from LLMs to VLMs. L2V-CoT extracts and resamples low-frequency CoT representations from LLMs in the frequency domain, enabling dimension matching and latent injection into VLMs during inference to enhance reasoning capabilities. Extensive experiments demonstrate that our approach consistently outperforms training-free baselines and even surpasses supervised methods.

[14] Towards Efficient LLM-aware Heterogeneous Graph Learning

Wenda Li,Tongya Zheng,Shunyu Liu,Yu Wang,Kaixuan Chen,Hanyang Yuan,Bingde Hu,Zujie Ren,Mingli Song,Gang Chen

Main category: cs.CL

TL;DR: 提出了一种高效的LLM-Aware框架(ELLA),用于异质图中复杂关系语义的建模,通过LLM感知的关系分词器、跳数级关系图变换器和细粒度任务感知的思维链提示,实现了性能与效率的显著提升。

Details Motivation: 现有方法受限于预定义语义依赖和监督信号稀缺,且LLM在异质图中的应用受计算复杂度限制,同时预训练与微调任务之间存在语义鸿沟。 Method: 提出ELLA框架:1)LLM-aware Relation Tokenizer,利用LLM编码多跳多类型关系;2)Hop-level Relation Graph Transformer,将关系推理复杂度从指数降至线性;3)细粒度任务感知的Chain-of-Thought提示,弥合任务间语义差距。 Result: 在四个异质图数据集上实验表明,ELLA优于现有最先进方法,在性能和效率上均有提升,可扩展至130亿参数LLM,并比现有基于LLM的方法快达4倍。 Conclusion: ELLA有效解决了异质图中关系语义建模、计算复杂度高和任务语义鸿沟等问题,为大规模LLM在异质图中的高效应用提供了可行方案。 Abstract: Heterogeneous graphs are widely present in real-world complex networks, where the diversity of node and relation types leads to complex and rich semantics. Efforts for modeling complex relation semantics in heterogeneous graphs are restricted by the limitations of predefined semantic dependencies and the scarcity of supervised signals. The advanced pre-training and fine-tuning paradigm leverages graph structure to provide rich self-supervised signals, but introduces semantic gaps between tasks. Large Language Models (LLMs) offer significant potential to address the semantic issues of relations and tasks in heterogeneous graphs through their strong reasoning capabilities in textual modality, but their incorporation into heterogeneous graphs is largely limited by computational complexity. Therefore, in this paper, we propose an Efficient LLM-Aware (ELLA) framework for heterogeneous graphs, addressing the above issues. To capture complex relation semantics, we propose an LLM-aware Relation Tokenizer that leverages LLM to encode multi-hop, multi-type relations. To reduce computational complexity, we further employ a Hop-level Relation Graph Transformer, which help reduces the complexity of LLM-aware relation reasoning from exponential to linear. To bridge semantic gaps between pre-training and fine-tuning tasks, we introduce the fine-grained task-aware textual Chain-of-Thought (CoT) prompts. Extensive experiments on four heterogeneous graphs show that our proposed ELLA outperforms state-of-the-art methods in the performance and efficiency. In particular, ELLA scales up to 13b-parameter LLMs and achieves up to a 4x speedup compared with existing LLM-based methods. Our code is publicly available at https://github.com/l-wd/ELLA.

[15] SPINE: Token-Selective Test-Time Reinforcement Learning with Entropy-Band Regularization

Jianghao Wu,Yasmeen George,Jin Ye,Yicheng Wu,Daniel F. Schmidt,Jianfei Cai

Main category: cs.CL

TL;DR: 提出SPINE,一种基于分支点的token选择性测试时强化学习框架,通过仅更新高熵的“分叉token”并引入熵带正则化,在无需标签或奖励模型的情况下有效避免了现有方法中的响应崩溃问题,显著提升了多类型任务上的推理性能。

Details Motivation: 现有测试时强化学习方法因均匀序列更新导致响应变短、Pass@1下降和训练不稳定,缺乏对推理路径中关键分支点的关注。 Method: 识别前向传播中的高熵分叉token,仅在这些位置进行策略更新,并设计熵带正则化机制以维持探索与抑制噪声。结合GRPO目标(可选KL锚)实现无监督的测试时适应。 Result: 在十个涵盖多模态VQA、通用与专业问答、数学及医疗推理的基准上,SPINE consistently 提升Pass@1,避免响应长度崩溃,训练更稳定。 Conclusion: 将测试时学习更新聚焦于推理链的分支点是一种简单、无标签且有效的机制,有助于稳定和增强推理模型的测试时适应能力。 Abstract: Large language models (LLMs) and multimodal LLMs (MLLMs) excel at chain-of-thought reasoning but face distribution shift at test-time and a lack of verifiable supervision. Recent test-time reinforcement learning (TTRL) methods derive label-free pseudo-rewards from self-consistency voting over sampled trajectories, yet they often collapse: the majority-vote reward prevails, responses shorten, and Pass@1 declines. We trace this to uniform sequence updates in which most tokens are low-entropy followers, while a small high-entropy subset determines the reasoning branches. Thus we propose SPINE, a token-selective test-time reinforcement learning framework that (i) updates only forking tokens, the high-entropy branch points identified from forward-pass statistics, and (ii) applies an entropy-band regularizer at those tokens to sustain exploration when entropy is too low and to suppress noisy supervision when it is too high. SPINE plugs into GRPO-style objectives, optionally with a KL anchor, and requires no labels or reward models. Across ten benchmarks spanning multimodal VQA, general and expert QA, mathematical reasoning, and medical QA, SPINE consistently improves Pass@1 over TTRL while avoiding response-length collapse and yielding more stable training dynamics on both LLM and MLLM backbones. These results indicate that aligning updates with chain-of-thought branch points is a simple and label-free mechanism for stable and effective test-time adaptation in reasoning models. Code is available at https://github.com/JianghaoWu/SPINE.

[16] Measuring the Impact of Lexical Training Data Coverage on Hallucination Detection in Large Language Models

Shuo Zhang,Fabrizio Gotti,Fengran Mo,Jian-Yun Nie

Main category: cs.CL

TL;DR: 本文探讨了预训练数据覆盖率是否可作为大语言模型幻觉检测的信号,提出利用词法覆盖特征结合log概率来提升幻觉检测效果。

Details Motivation: 现有研究显示大语言模型在长尾知识上表现较差,但数据覆盖率本身是否可用于幻觉检测尚未被充分探索。本文旨在验证问题和生成答案的词法训练数据覆盖率能否为幻觉检测提供额外信号。 Method: 构建基于RedPajama 1.3万亿token语料库的可扩展后缀数组,提取提示和模型生成内容的n-gram统计信息,并在三个问答基准上评估其在幻觉检测中的有效性。 Result: 基于出现频率的特征单独使用时预测能力较弱,但与log概率结合后能带来适度提升,尤其在模型不确定性较高的数据集上表现更佳。 Conclusion: 词法覆盖特征为幻觉检测提供了有益的补充信号,有助于改进现有检测方法。 Abstract: Hallucination in large language models (LLMs) is a fundamental challenge, particularly in open-domain question answering. Prior work attempts to detect hallucination with model-internal signals such as token-level entropy or generation consistency, while the connection between pretraining data exposure and hallucination is underexplored. Existing studies show that LLMs underperform on long-tail knowledge, i.e., the accuracy of the generated answer drops for the ground-truth entities that are rare in pretraining. However, examining whether data coverage itself can serve as a detection signal is overlooked. We propose a complementary question: Does lexical training-data coverage of the question and/or generated answer provide additional signal for hallucination detection? To investigate this, we construct scalable suffix arrays over RedPajama's 1.3-trillion-token pretraining corpus to retrieve $n$-gram statistics for both prompts and model generations. We evaluate their effectiveness for hallucination detection across three QA benchmarks. Our observations show that while occurrence-based features are weak predictors when used alone, they yield modest gains when combined with log-probabilities, particularly on datasets with higher intrinsic model uncertainty. These findings suggest that lexical coverage features provide a complementary signal for hallucination detection. All code and suffix-array infrastructure are provided at https://github.com/WWWonderer/ostd.

[17] MTikGuard System: A Transformer-Based Multimodal System for Child-Safe Content Moderation on TikTok

Dat Thanh Nguyen,Nguyen Hung Lam,Anh Hoang-Thi Nguyen,Trong-Hop Do

Main category: cs.CL

TL;DR: 本文提出了一种用于TikTok的实时多模态有害内容检测系统MTikGuard,通过扩展数据集、融合多模态特征和构建可扩展的流式架构,在真实场景中实现了高效的有害内容识别。

Details Motivation: 由于短视频平台TikTok在青少年中广泛流行,有害内容可能对其认知和行为产生负面影响,而传统审核方法难以应对海量且实时的内容上传,因此需要更有效的检测系统。 Method: 提出了MTikGuard系统,包含三部分:扩展TikHarm数据集至4,723个标注视频;构建融合视觉、音频和文本特征的多模态分类框架;基于Apache Kafka和Spark构建可扩展的流式处理架构以支持实时部署。 Result: 系统在准确率和F1分数上分别达到89.37%和89.45%,表现出优于现有方法的性能,并成功实现大规模实时部署。 Conclusion: 结合数据集扩展、先进的多模态融合与强健的部署架构,能够有效提升社交媒体平台有害内容的检测能力,具有实际应用价值。 Abstract: With the rapid rise of short-form videos, TikTok has become one of the most influential platforms among children and teenagers, but also a source of harmful content that can affect their perception and behavior. Such content, often subtle or deceptive, challenges traditional moderation methods due to the massive volume and real-time nature of uploads. This paper presents MTikGuard, a real-time multimodal harmful content detection system for TikTok, with three key contributions: (1) an extended TikHarm dataset expanded to 4,723 labeled videos by adding diverse real-world samples, (2) a multimodal classification framework integrating visual, audio, and textual features to achieve state-of-the-art performance with 89.37% accuracy and 89.45% F1-score, and (3) a scalable streaming architecture built on Apache Kafka and Apache Spark for real-time deployment. The results demonstrate the effectiveness of combining dataset expansion, advanced multimodal fusion, and robust deployment for practical large-scale social media content moderation. The dataset is available at https://github.com/ntdat-8324/MTikGuard-System.git.

[18] Blu-WERP (Web Extraction and Refinement Pipeline): A Scalable Pipeline for Preprocessing Large Language Model Datasets

Gowtham,Sai Rupesh,Sanjay Kumar,Saravanan,Venkata Chaithanya

Main category: cs.CL

TL;DR: 本文提出了Blu-WERP,一种用于优化大规模语言模型训练数据质量的新型预处理管道,显著优于现有方法(如DCLM和Fineweb),在多种模型规模和基准测试中均表现出更优性能。

Details Motivation: 高质量训练数据对大语言模型至关重要,但现有的网页级语料预处理流程难以有效去除噪声和非结构化内容,因此需要更高效的预处理方案。 Method: 提出Blu-WERP预处理管道,针对Common Crawl WARC文件进行高级过滤和质量评估,并在150M到1B参数规模的模型上进行多基准评估。 Result: Blu-WERP在所有模型规模上均优于基线方法;在1B参数模型上,相比DCLM和Fineweb分别实现4.0%和9.5%的综合性能提升,在知识、语言理解和常识推理任务上均有显著改进,并提高了单位token的质量效率。 Conclusion: Blu-WERP是一种先进的数据预处理管道,能显著提升大语言模型的训练数据质量和下游性能,同时降低计算成本,推动了以数据为中心的AI研究发展。 Abstract: High-quality training data is fundamental to large language model (LLM) performance, yet existing preprocessing pipelines often struggle to effectively remove noise and unstructured content from web-scale corpora. This paper presents Blu-WERP, a novel data preprocessing pipeline designed to optimize the quality of Common Crawl WARC files for LLM training. We demonstrate that Blu-WERP significantly outperforms established baselines including DCLM across multiple model scales and evaluation benchmarks. Our pipeline processes CC WARC dumps, implementing advanced filtering and quality assessment mechanisms. We conducted comprehensive evaluations using models with 150M, 400M, 530M, 750M, and 1B parameters, testing against nine standard benchmarks categorized as World Knowledge & Reasoning, Language Understanding, and Commonsense Reasoning. Results show Blu-WERP consistently achieved superior performance across all model scales. At the 1B parameter scale, Relatively Blu-WERP demonstrates a 4.0% and 9.5% aggregate improvement over DCLM and Fineweb respectively, while achieving quality-per-token efficiency gain. Categorical analysis reveals 2.4% improvement in World Knowledge & Reasoning, 6.2% improvement in Language Understanding, and 4.2% improvement in Commonsense Reasoning. These results establish Blu-WERP as a state-of-the-art preprocessing pipeline that substantially improves LLM training data quality and downstream model performance with reduced computational cost. Our findings contribute to the growing body of research on data-centric AI, demonstrating that preprocessing pipeline design significantly impacts LLM capabilities. The Blu-WERP pipeline represents a practical advancement in data quality optimization, offering researchers and practitioners an effective solution for improving LLM training efficiency and model performance.

[19] GeeSanBhava: Sentiment Tagged Sinhala Music Video Comment Data Set

Yomal De Mel,Nisansa de Silva

Main category: cs.CL

TL;DR: 本研究提出了GeeSanBhava,一个高质量的僧伽罗语YouTube歌曲评论数据集,使用Russell的情绪效价-唤醒模型由三位独立标注者手动标注,并达到较高的标注一致性(Fleiss kappa = 84.96%)。通过预训练多种机器学习与深度学习模型,最优的三层MLP模型在ROC-AUC上取得0.887的成绩。该研究为僧伽罗语NLP和音乐情绪识别提供了有价值的资源与洞见。

Details Motivation: 缺乏高质量、人工标注的僧伽罗语社交媒体情感数据集,尤其是在音乐评论领域,限制了当地语言在情绪识别任务中的发展。 Method: 从YouTube提取僧伽罗语歌曲评论,由三位独立标注者依据Russell的效价-唤醒模型进行人工标注;计算标注一致性后,使用预训练的机器学习和深度学习模型进行零样本迁移,并对多层感知机(MLP)进行超参数调优以获得最佳性能。 Result: 构建了名为GeeSanBhava的高质量标注数据集,标注者间一致性较高(Fleiss kappa = 84.96%);发现不同歌曲具有显著差异的情绪分布;最优MLP模型(256-128-64结构)在ROC-AUC指标上达到0.887;揭示了用户生成内容中的偏倚及评论与歌曲本身情绪间的比较挑战。 Conclusion: GeeSanBhava数据集为僧伽罗语情感分析提供了可靠资源,验证了评论可用于精细化情绪映射,所提出的模型表现良好,支持未来在本地语言音乐情绪识别方向的研究。 Abstract: This study introduce GeeSanBhava, a high-quality data set of Sinhala song comments extracted from YouTube manually tagged using Russells Valence-Arousal model by three independent human annotators. The human annotators achieve a substantial inter-annotator agreement (Fleiss kappa = 84.96%). The analysis revealed distinct emotional profiles for different songs, highlighting the importance of comment based emotion mapping. The study also addressed the challenges of comparing comment-based and song-based emotions, mitigating biases inherent in user-generated content. A number of Machine learning and deep learning models were pre-trained on a related large data set of Sinhala News comments in order to report the zero-shot result of our Sinhala YouTube comment data set. An optimized Multi-Layer Perceptron model, after extensive hyperparameter tuning, achieved a ROC-AUC score of 0.887. The model is a three-layer MLP with a configuration of 256, 128, and 64 neurons. This research contributes a valuable annotated dataset and provides insights for future work in Sinhala Natural Language Processing and music emotion recognition.

[20] Vector Arithmetic in Concept and Token Subspaces

Sheridan Feucht,Byron Wallace,David Bau

Main category: cs.CL

TL;DR: 本文研究了大语言模型中概念归纳头和词元归纳头如何分离语义与表层信息,并通过变换隐藏状态实现了更准确的类比运算。

Details Motivation: 为了理解大语言模型在预测下一个词元时如何分离并表示当前词的语义和表面信息。 Method: 利用概念归纳头和词元归纳头的注意力权重对隐藏状态进行变换,并在Llama-2-7b中进行类比算术和最近邻准确率评估。 Result: 使用概念头变换后的隐藏状态使类比算术准确率达到80%(远高于原始隐藏状态的47%),并能有效执行如'Athens' - 'Greece' + 'China' = 'Beijing'等语义操作;词元头则支持如'coding' - 'code' + 'dance' = 'dancing'的形态类推。 Conclusion: 概念和词元归纳头分别揭示了模型中语义和表层信息的解耦子空间,表明可通过注意力机制显式提取和操作这些信息。 Abstract: In order to predict the next token, LLMs must represent semantic and surface-level information about the current word. Previous work identified two types of attention heads that disentangle this information: (i) Concept induction heads, which copy word meanings, and (ii) Token induction heads, which copy literal token representations (Feucht et al., 2025). We show that these heads can be used to identify subspaces of model activations that exhibit coherent semantic structure in Llama-2-7b. Specifically, when we transform hidden states using the attention weights of concept heads, we are able to more accurately perform parallelogram arithmetic (Mikolov et al., 2013) on the resulting hidden states, e.g., showing that "Athens" - "Greece" + "China" = "Beijing". This transformation allows for much higher nearest-neighbor accuracy (80%) than direct use of raw hidden states (47%). Analogously, we show that token heads allow for transformations that reveal surface-level word information in hidden states, allowing for operations like "coding" - "code" + "dance" = "dancing".

[21] Rethinking Retrieval: From Traditional Retrieval Augmented Generation to Agentic and Non-Vector Reasoning Systems in the Financial Domain for Large Language Models

Elias Lumer,Matt Melich,Olivia Zino,Elena Kim,Sara Dieter,Pradeep Honaganahalli Basavaraju,Vamse Kumar Subbiah,James A. Burke,Roberto Hernandez

Main category: cs.CL

TL;DR: 本论文首次系统比较了基于向量的智能体RAG与基于层次节点的架构在金融文档问答中的表现,并评估了交叉编码器重排序和小到大块检索两种增强技术的影响,结果表明这些先进技术能显著提升检索准确性和答案质量,同时保持较低延迟。

Details Motivation: 现有研究缺乏对金融领域中向量与非向量RAG架构的系统性比较,且先进RAG技术对检索精度、答案质量、延迟和成本的影响尚不明确。 Method: 对比了基于向量的智能体RAG(采用混合搜索和元数据过滤)与基于文档结构的层次节点系统;在向量架构上应用交叉编码器重排序和小到大块检索两种增强技术;在150个问题、1200份SEC文件的基准上评估MRR、Recall@5、LLM打分、延迟和预处理成本。 Result: 基于向量的智能体RAG在答案质量上以68%胜率优于层次节点系统,延迟相当(5.2秒 vs 5.98秒);交叉编码器重排序使MRR@5最高提升59%;小到大块检索以仅增加0.2秒延迟获得65%胜率优势。 Conclusion: 在金融问答系统中应用先进的RAG技术可显著提升检索精度和答案质量,尽管存在成本与性能权衡,但仍具备生产部署潜力。 Abstract: Recent advancements in Retrieval-Augmented Generation (RAG) have enabled Large Language Models to answer financial questions using external knowledge bases of U.S. SEC filings, earnings reports, and regulatory documents. However, existing work lacks systematic comparison of vector-based and non-vector RAG architectures for financial documents, and the empirical impact of advanced RAG techniques on retrieval accuracy, answer quality, latency, and cost remain unclear. We present the first systematic evaluation comparing vector-based agentic RAG using hybrid search and metadata filtering against hierarchical node-based systems that traverse document structure without embeddings. We evaluate two enhancement techniques applied to the vector-based architecture, i) cross-encoder reranking for retrieval precision, and ii) small-to-big chunk retrieval for context completeness. Across 1,200 SEC 10-K, 10-Q, and 8-K filings on a 150-question benchmark, we measure retrieval metrics (MRR, Recall@5), answer quality through LLM-as-a-judge pairwise comparisons, latency, and preprocessing costs. Vector-based agentic RAG achieves a 68% win rate over hierarchical node-based systems with comparable latency (5.2 compared to 5.98 seconds). Cross-encoder reranking achieves a 59% absolute improvement at optimal parameters (10, 5) for MRR@5. Small-to-big retrieval achieves a 65% win rate over baseline chunking with only 0.2 seconds additional latency. Our findings reveal that applying advanced RAG techniques to financial Q&A systems improves retrieval accuracy, answer quality, and has cost-performance tradeoffs to be considered in production.

[22] Agent-as-a-Graph: Knowledge Graph-Based Tool and Agent Retrieval for LLM Multi-Agent Systems

Faheem Nizar,Elias Lumer,Anmol Gulati,Pradeep Honaganahalli Basavaraju,Vamse Kumar Subbiah

Main category: cs.CL

TL;DR: 本文提出了Agent-as-a-Graph检索方法,通过将工具和代理表示为知识图谱中的节点,提升多智能体系统中的细粒度工具检索效果,在LiveMCPBenchmark上显著优于现有方法。

Details Motivation: 现有代理、MCP和检索方法通常仅基于单一代理描述进行查询匹配,忽略了各代理内部工具的细粒度功能,导致代理选择次优。 Method: 提出Agent-as-a-Graph检索方法,将工具及其父代理建模为知识图谱中的节点与边;首先通过向量搜索检索相关节点,采用类型特定的加权倒数秩融合(wRRF)对工具和代理重新排序,并在图中遍历父代理以确定最终代理集合。 Result: 在LiveMCPBenchmark上,Recall@5和nDCG@5分别提升了14.9%和14.6%,wRRF优化效果提升了2.4%。 Conclusion: Agent-as-a-Graph通过知识图谱增强检索,有效揭示了代理内部工具的细粒度能力,显著提高了多智能体系统中工具检索的准确性和效率。 Abstract: Recent advances in Large Language Model Multi-Agent Systems enable scalable orchestration and retrieval of specialized, parallelized subagents, each equipped with hundreds or thousands of Model Context Protocol (MCP) servers and tools. However, existing agent, MCP, and retrieval methods typically match queries against a single agent description, obscuring fine-grained tool capabilities of each agent, resulting in suboptimal agent selection. We introduce Agent-as-a-Graph retrieval, a knowledge graph retrieval augmented generation approach that represents both tools and their parent agents as nodes and edges in a knowledge graph. During retrieval, i) relevant agents and tool nodes are first retrieved through vector search, ii) we apply a type-specific weighted reciprocal rank fusion (wRRF) for reranking tools and agents, and iii) parent agents are traversed in the knowledge graph for the final set of agents. We evaluate Agent-as-a-Graph on the LiveMCPBenchmark, achieving 14.9% and 14.6% improvements in Recall@5 and nDCG@5 over prior state-of-the-art retrievers, and 2.4% improvements in wRRF optimizations.

[23] From Archives to Decisions: Multi-Agent Pharmaceutical Co-Scientist for Traceable Drug Discovery and Reverse Translation

Xiaochen Zheng,Alvaro Serra,Ilya Schneider Chernov,Maddalena Marchesi,Eunice Musvasva,Tatyana Y. Doktorova

Main category: cs.CL

TL;DR: 本文提出DiscoVerse,一个用于支持制药研发的多智能体协同科学家系统,能够在大规模历史数据上实现语义检索、跨文档链接和可审计的知识合成,并通过专家评估验证其在逆向转化中的有效性。

Details Motivation: 制药研发积累了大量异构数据,尤其是中止项目的数据,重用这些数据对逆向转化具有重要价值,但实际应用中常因数据分散和难以整合而受限。 Method: 设计了一个角色专化的多智能体系统DiscoVerse,结合语义检索、跨文档链接与人工参与机制,在罗氏公司四十余年的真实研发档案上进行知识提取与合成,并采用盲法专家评估其输出质量。 Result: 在涵盖180个分子的七个基准查询上,DiscoVerse实现了近乎完美的召回率(≥0.99)和中等精度(0.71–0.91),专家评估显示其能准确还原中止原因及器官特异性毒性,且结果均来源可追溯。 Conclusion: DiscoVerse是首个在真实、保密的端到端药物研发档案上系统验证的智能体框架,为制药领域的知识复用和逆向转化提供了可行的技术路径。 Abstract: Pharmaceutical research and development has accumulated vast, heterogeneous archives of data. Much of this knowledge stems from discontinued programs, and reusing these archives is invaluable for reverse translation. However, in practice, such reuse is often infeasible. In this work, we introduce DiscoVerse, a multi-agent co-scientist designed to support pharmaceutical research and development. The system implements semantic retrieval, cross-document linking, and auditable synthesis on a large historical corpus from Roche. To validate our approach at real-world scale, we selected a subset of 180 molecules from the Roche research repositories, covering over 0.87 billion BPE tokens and more than four decades of research. Given that automated evaluation metrics are poorly aligned with scientific utility, we evaluate the performance of DiscoVerse using blinded expert evaluation of source-linked outputs. To our knowledge, this is the first agentic framework systematically assessed on real pharmaceutical data for reverse translation, enabled by authorized access to confidential, end-to-end drug-development archives. Our contributions include role-specialized agent designs aligned with scientist workflows; human-in-the-loop support for reverse translation; expert evaluation; and a large-scale demonstration showing promising answer accuracy and decision-making insights. In brief, across seven benchmark queries covering 180 molecules, DiscoVerse achieved near-perfect recall ($\geq 0.99$) with moderate precision ($0.71-0.91$), while qualitative assessments of discontinuation rationale and organ-specific toxicity showed faithful, source-linked synthesis across preclinical and clinical evidence.

[24] "AGI" team at SHROOM-CAP: Data-Centric Approach to Multilingual Hallucination Detection using XLM-RoBERTa

Harsh Rathva,Pruthwik Mishra,Shrikant Malviya

Main category: cs.CL

TL;DR: 本文提出了一种数据驱动的方法来检测多语言科学文本中的幻觉,通过整合和平衡五个现有数据集,显著提升了训练数据的规模和质量,并在SHROOM-CAP 2025共享任务中取得了优异成绩。

Details Motivation: 由于大语言模型生成的多语言科学文本中存在幻觉问题,且训练数据稀缺且不平衡,亟需可靠的方法进行检测。 Method: 采用数据为中心的策略,统一并平衡五个现有数据集,构建包含124,821个样本的综合训练语料,并在此基础上微调XLM-RoBERTa-Large模型。 Result: 在SHROOM-CAP 2025任务中表现优异,古吉拉特语(零样本语言)排名第二,其余八种语言排名第四至第六。 Conclusion: 系统性数据整理能显著提升幻觉检测性能,尤其对低资源和零样本语言,其效果优于单纯的模型架构创新。 Abstract: The detection of hallucinations in multilingual scientific text generated by Large Language Models (LLMs) presents significant challenges for reliable AI systems. This paper describes our submission to the SHROOM-CAP 2025 shared task on scientific hallucination detection across 9 languages. Unlike most approaches that focus primarily on model architecture, we adopted a data-centric strategy that addressed the critical issue of training data scarcity and imbalance. We unify and balance five existing datasets to create a comprehensive training corpus of 124,821 samples (50% correct, 50% hallucinated), representing a 172x increase over the original SHROOM training data. Our approach fine-tuned XLM-RoBERTa-Large with 560 million parameters on this enhanced dataset, achieves competitive performance across all languages, including \textbf{2nd place in Gujarati} (zero-shot language) with Factuality F1 of 0.5107, and rankings between 4th-6th place across the remaining 8 languages. Our results demonstrate that systematic data curation can significantly outperform architectural innovations alone, particularly for low-resource languages in zero-shot settings.

[25] Table Comprehension in Building Codes using Vision Language Models and Domain-Specific Fine-Tuning

Mohammad Aqib,Mohd Hamza,Ying Hei Chui,Qipei Mei

Main category: cs.CL

TL;DR: 本文研究了两种从建筑规范中的表格数据提取信息的方法,比较了直接输入视觉语言模型(VLM)与通过LaTeX转换后的间接输入方法,并采用LoRA对模型进行领域微调以提升性能。实验表明,直接输入法表现更优,且经微调后模型准确率显著提升,尤其Qwen2.5-VL-3B-Instruct的相对准确率增益超过100%。

Details Motivation: 建筑规范中的表格包含复杂结构和语义关系,传统NLP和VLM难以有效处理,亟需高效的信息提取方法以支持自动化问答系统。 Method: 比较两种方法:一是将页面图像直接输入预训练VLM进行问答;二是先将含表图像转为LaTeX代码,再基于LaTeX输入回答问题。进一步使用LoRA在特定领域表格数据上对VLM进行参数高效微调。 Result: 直接输入方法整体准确率高于间接方法;经LoRA微调后,各VLM性能大幅提升,其中Qwen2.5-VL-3B-Instruct相对准确率增益超100%。 Conclusion: 参数高效的微调方法(如LoRA)能有效提升VLM在专业领域复杂结构数据理解上的表现,具有在建筑规范解读等场景中广泛应用的潜力。 Abstract: Building codes contain critical information for ensuring safety, regulatory compliance, and informed decision-making in construction and engineering. Automated question answering systems over such codes enable quick and accurate access to specific regulatory clauses, improving efficiency and reducing errors. Retrieval-Augmented Generation (RAG) systems are essential for this task as they combine the precision of information retrieval with the generative capabilities of language models. However, tabular data are challenging to extract as they often involve complex layouts, merged cells, multi-row headers, and embedded semantic relationships that are not easily captured by traditional natural language processing techniques and Vision Language Models (VLMs). This paper explores and compares two methods for extracting information from tabular data in building codes using several pre-trained VLMs. First, a direct input method is used, where the image of the page is input directly into the VLMs, which are then tasked with answering questions based on the image. Second, an indirect input method is introduced, which involves converting an image of a page containing tables into the LaTeX code and then answering inquires based on the LaTeX-based input. The experiments find that the direct input method generally resulted in higher accuracy than the indirect input method. To further improve the performance, we fine-tuned each VLM using Low Rank Adaptation (LoRA) on a domain-specific tabular dataset. The fine-tuned models exhibited substantial improvements, with Qwen2.5-VL-3B-Instruct achieving relative accuracy gains exceeding 100%. Our results highlight the potential of parameter-efficient fine-tuning methods to adapt powerful VLMs for understanding complex structured data in specialized fields, such as building code interpretation and regulatory compliance.

Joseph Oladokun

Main category: cs.CL

TL;DR: 本文提出了一种路径约束检索(PCR)方法,通过结合图结构约束与语义搜索,提升大语言模型代理在知识图谱中推理的结构一致性与连贯性。

Details Motivation: 现有检索方法获取的信息常与模型当前推理状态缺乏结构一致性,导致推理链不连贯。 Method: 引入路径约束检索(PCR),将检索空间限制在锚点节点可达的子图内,结合图结构约束与语义搜索,确保信息间的逻辑关系。 Result: 在PathRAG-6基准上,PCR实现了100%的结构一致性(基线为24-32%),技术领域在rank 10达到100%相关性,图距离平均降低78%。 Conclusion: PCR能有效提升LLM代理推理系统的可靠性与推理连贯性,是改进检索质量的有力方法。 Abstract: Large Language Model agents often retrieve context from knowledge bases that lack structural consistency with the agent's current reasoning state, leading to incoherent reasoning chains. We introduce Path-Constrained Retrieval (PCR), a retrieval method that combines structural graph constraints with semantic search to ensure retrieved information maintains logical relationships within a knowledge graph. PCR restricts the search space to nodes reachable from an anchor node, preventing retrieval of structurally disconnected information that may lead to inconsistent reasoning. We evaluate PCR on PathRAG-6, a benchmark spanning six domains with 180 nodes and 360 edges. Our results show that PCR achieves full structural consistency compared to 24-32 percent in baseline methods, while maintaining strong relevance scores. On the technology domain, PCR obtains full relevance at rank 10 with full structural consistency, significantly outperforming vector search and hybrid retrieval. PCR reduces the average graph distance of retrieved context by 78 percent compared to baselines, demonstrating retrieval of more structurally consistent information. These findings suggest that path-constrained retrieval is an effective approach for improving the reliability and coherence of LLM agent reasoning systems.

[27] Gradient Masters at BLP-2025 Task 1: Advancing Low-Resource NLP for Bengali using Ensemble-Based Adversarial Training for Hate Speech Detection

Syed Mohaiminul Hoque,Naimur Rahman,Md Sakhawat Hossain

Main category: cs.CL

TL;DR: 本文提出了名为“Gradient Masters”的方法,用于解决BLP-2025任务1中的孟加拉语多任务仇恨言论识别问题,采用基于集成的微调策略,在两个子任务中取得了优异成绩。

Details Motivation: 在低资源语言环境下,如孟加拉语,仇恨言论检测面临数据稀缺和模型泛化能力差的问题,本文旨在提升此类场景下的分类性能。 Method: 提出一种在孟加拉语语言模型上的混合方法,采用集成式微调策略,并对多种语言模型变体进行实验比较,以增强模型鲁棒性和覆盖能力。 Result: 在子任务1A中以73.23%的micro F1分数获得第6名,在子任务1B中以73.28%的成绩获得第3名,验证了方法的有效性。 Conclusion: 该方法在低资源的孟加拉语仇恨言论识别任务中表现出较强性能,且通过误分类分析为未来研究提供了有价值见解。 Abstract: This paper introduces the approach of "Gradient Masters" for BLP-2025 Task 1: "Bangla Multitask Hate Speech Identification Shared Task". We present an ensemble-based fine-tuning strategy for addressing subtasks 1A (hate-type classification) and 1B (target group classification) in YouTube comments. We propose a hybrid approach on a Bangla Language Model, which outperformed the baseline models and secured the 6th position in subtask 1A with a micro F1 score of 73.23% and the third position in subtask 1B with 73.28%. We conducted extensive experiments that evaluated the robustness of the model throughout the development and evaluation phases, including comparisons with other Language Model variants, to measure generalization in low-resource Bangla hate speech scenarios and data set coverage. In addition, we provide a detailed analysis of our findings, exploring misclassification patterns in the detection of hate speech.

[28] OmniStruct: Universal Text-to-Structure Generation across Diverse Schemas

James Y. Huang,Wenxuan Zhou,Nan Xu,Fei Wang,Qin Liu,Sheng Zhang,Hoifung Poon,Muhao Chen

Main category: cs.CL

TL;DR: 本文提出了OmniStruct,一个用于评估大语言模型在多种文本到结构任务上性能的综合基准,并通过合成任务生成高质量训练数据,展示了在无监督情况下微调小型模型以达到与GPT-4o相媲美性能的可能性。

Details Motivation: 现代大语言模型在生成非结构化自然语言方面表现出色,但在文本到结构任务上的表现尚不明确,缺乏统一的评估基准和有效的训练方法。 Method: 构建OmniStruct基准,整合多个现有数据集并统一为文本到结构问题设置;通过合成任务生成高质量训练数据,用于微调小型模型。 Result: 在无任何OmniStruct任务监督数据的情况下,使用合成数据微调的小型模型在多项文本到结构任务上可与GPT-4o相媲美。 Conclusion: 合成数据驱动的微调策略能使小型模型有效掌握通用结构化生成能力,为高效、低成本的结构化输出提供了可行路径。 Abstract: The ability of Large Language Models (LLMs) to generate structured outputs that follow arbitrary schemas is crucial to a wide range of downstream tasks that require diverse structured representations of results such as information extraction, table generation, and function calling. While modern LLMs excel in generating unstructured responses in natural language, whether this advancement translates to a strong performance on text-to-structure tasks remains unclear. To bridge this gap, we first introduce OmniStruct, a comprehensive benchmark for assessing LLMs' capabilities on diverse text-to-structure tasks such as information extraction, table generation, and function calling. We build OmniStruct by identifying existing datasets across a wide range of tasks that are suitable for a structured answer format, and adapting them under a unified text-to-structure problem setting. To facilitate the development of efficient text-to-structure models, we collect high-quality training data via synthetic task generation. Without using any supervised data for OmniStruct tasks, our experiments demonstrate the possibility of fine-tuning much smaller models on synthetic data into universal structured generation models that can rival the performance of GPT-4o.

[29] Tu crois que c'est vrai ? Diversite des regimes d'enonciation face aux fake news et mecanismes d'autoregulation conversationnelle

Manon Berriche

Main category: cs.CL

TL;DR: 该论文探讨了两个悖论:为何尽管缺乏编辑控制,假新闻在社交媒体上的传播量却较小;以及用户对假新闻不敏感的情况下,政治极化为何加剧。通过在Twitter和Facebook上进行的混合方法研究,发现假新闻分享集中在一小群高度政治化、批判制度但非认知劣势的活跃用户中,这些用户能影响其政治阵营的议程;而面对假新闻,用户虽会表现出话语谨慎或纠正行为,但很少引发真正的公共讨论,反而常导致少数活跃者之间的“聋人对话”。

Details Motivation: 解释假新闻传播有限与政治极化加剧之间的表面矛盾,挑战将用户简化为被动接受者的既有研究视角。 Method: 结合数字痕迹的定量分析、线上观察与访谈,在Twitter和Facebook上开展混合方法研究,考察不同互动情境下的用户实践,并记录社会人口特征。 Result: 1. 假新闻分享集中于一小群高活跃度、政治化且批判体制的用户;2. 用户通过话语谨慎或干预表达对不确定信息的批判距离;3. 这些批判形式极少促成真正审议或多元辩论,多形成小群体内的“聋人对话”。 Conclusion: 假新闻的影响不在于广泛传播,而在于由特定活跃群体驱动的议程设置及其封闭式互动模式,这有助于理解极化动态而不依赖于大众易感性假设。 Abstract: This thesis addresses two paradoxes: (1) why empirical studies find that fake news represent only a small share of the information consulted and shared on social media despite the absence of editorial control or journalistic norms, and (2) how political polarization has intensified even though users do not appear especially receptive to fake news. To investigate these issues, two complementary studies were carried out on Twitter and Facebook, combining quantitative analyses of digital traces with online observation and interviews. This mixed-methods design avoids reducing users to single reactions to identified fake items and instead examines the variety of practices across different interactional situations, online and offline, while recording socio-demographic traits. The first study mapped users who shared at least one item labeled fake by fact-checkers in the French Twittersphere. The second used a corpus of items flagged by Facebook users to study reactions to statements whose epistemic status is uncertain. Three main findings emerge. First, sharing fake news is concentrated among a limited group of users who are not less educated or cognitively disadvantaged but are more politicized and critical of institutions; owing to their high activity and prolific sharing, they can help set the agenda for their political camp. Second, exposed users can deploy varying forms of critical distance depending on their social position and the interactional norms of the situations they inhabit: either discursive caution (prudence énonciative) or interventions ('points d'arrêt') that express disagreement or corrections. Third, these forms of critical distance seldom yield genuine deliberative debates or agonistic pluralism; rather, they often produce dialogues of the deaf among a small, particularly active minority.

[30] Towards Robust and Fair Next Visit Diagnosis Prediction under Noisy Clinical Notes with Large Language Models

Heejoon Koo

Main category: cs.CL

TL;DR: 本研究系统评估了大语言模型在临床文本退化情况下的鲁棒性与公平性,并提出了一种基于临床知识的标签简化方法和分层思维链策略,以提升其在噪声输入下的稳定性与可靠性。

Details Motivation: 临床文本常因人为错误或自动化流程失败而退化,影响AI辅助决策的可靠性和公平性,但此类退化对大语言模型的影响尚缺乏深入研究。 Method: 在多种文本损坏场景下评估最先进的大语言模型,针对诊断标签空间大的问题,引入基于临床知识的标签约简方案和分层思维链(CoT)推理策略。 Result: 所提方法在退化输入下显著提升了模型的鲁棒性,减少了不同人口统计子群体间的预测不稳定性,在下次就诊诊断预测任务中表现更优。 Conclusion: 通过结合临床知识引导的标签简化与分层推理,可有效增强大语言模型在噪声环境下的可靠性与公平性,推动其在临床决策支持系统中的安全应用。 Abstract: A decade of rapid advances in artificial intelligence (AI) has opened new opportunities for clinical decision support systems (CDSS), with large language models (LLMs) demonstrating strong reasoning abilities on timely medical tasks. However, clinical texts are often degraded by human errors or failures in automated pipelines, raising concerns about the reliability and fairness of AI-assisted decision-making. Yet the impact of such degradations remains under-investigated, particularly regarding how noise-induced shifts can heighten predictive uncertainty and unevenly affect demographic subgroups. We present a systematic study of state-of-the-art LLMs under diverse text corruption scenarios, focusing on robustness and equity in next-visit diagnosis prediction. To address the challenge posed by the large diagnostic label space, we introduce a clinically grounded label-reduction scheme and a hierarchical chain-of-thought (CoT) strategy that emulates clinicians' reasoning. Our approach improves robustness and reduces subgroup instability under degraded inputs, advancing the reliable use of LLMs in CDSS. We release code at https://github.com/heejkoo9/NECHOv3.

[31] Findings of the BlackboxNLP 2025 Shared Task: Localizing Circuits and Causal Variables in Language Models

Dana Arad,Yonatan Belinkov,Hanjie Chen,Najoung Kim,Hosein Mohebbi,Aaron Mueller,Gabriele Sarti,Martin Tutek

Main category: cs.CL

TL;DR: 本文基于Mechanistic Interpretability Benchmark(MIB)框架,介绍了BlackboxNLP 2025共享任务,旨在推动机械可解释性(MI)技术的标准化评估。

Details Motivation: 由于衡量机械可解释性(MI)研究进展困难,需要一个标准化、可复现的评估框架来比较不同MI方法的有效性。 Method: 共享任务设置两个赛道:电路定位和因果变量定位;采用集成与正则化策略发现电路,使用低维非线性投影对激活向量进行特征化。 Result: 在电路定位中,三支队伍共八种方法通过集成与正则化策略取得显著提升;在因果变量定位中,一支队伍的两种方法利用非线性投影实现了显著增益。 Conclusion: MIB提供了一个开放、可持续的评估平台,鼓励社区持续参与以推动机械可解释性研究的发展。 Abstract: Mechanistic interpretability (MI) seeks to uncover how language models (LMs) implement specific behaviors, yet measuring progress in MI remains challenging. The recently released Mechanistic Interpretability Benchmark (MIB; Mueller et al., 2025) provides a standardized framework for evaluating circuit and causal variable localization. Building on this foundation, the BlackboxNLP 2025 Shared Task extends MIB into a community-wide reproducible comparison of MI techniques. The shared task features two tracks: circuit localization, which assesses methods that identify causally influential components and interactions driving model behavior, and causal variable localization, which evaluates approaches that map activations into interpretable features. With three teams spanning eight different methods, participants achieved notable gains in circuit localization using ensemble and regularization strategies for circuit discovery. With one team spanning two methods, participants achieved significant gains in causal variable localization using low-dimensional and non-linear projections to featurize activation vectors. The MIB leaderboard remains open; we encourage continued work in this standard evaluation framework to measure progress in MI research going forward.

[32] SmolKalam: Ensemble Quality-Filtered Translation at Scale for High Quality Arabic Post-Training Data

Sultan Alrashed,Chadi Helwe,Francesco Orabona

Main category: cs.CL

TL;DR: 本文介绍了SmolKalam,一个高质量的多轮阿拉伯语推理与工具调用数据集,通过多模型集成翻译管道和质量过滤从Smoltalk2翻译而来。

Details Motivation: 缺乏大规模、多轮且包含推理和工具调用的高质量阿拉伯语数据集,尤其是在后训练阶段对数据质量要求更高。 Method: 采用多模型集成翻译管道对Smoltalk2进行翻译,并应用质量过滤,通过消融实验评估不同翻译技术对传统仅解码器模型的有效性。 Result: 成功构建了名为SmolKalam的高质量阿拉伯语多轮对话数据集,并验证了所提出翻译方法在提升数据质量和模型性能方面的有效性。 Conclusion: 严格的翻译和过滤方法对于构建高质量的阿拉伯语后训练数据集至关重要,SmolKalam为推动阿拉伯语AI发展提供了重要资源。 Abstract: Although the community has tackled the acquisition of high-quality Arabic pretraining data, we still lack large-scale, multi-turn Arabic datasets that include reasoning and tool calling. Naive translation can work at the pretraining scale, but post-training demands much higher quality, which requires a stricter approach to dataset curation. In this work, we introduce SmolKalam, a translation of Smoltalk2 that uses a multi-model ensemble translation pipeline, applies quality filtering, and examines effective translation techniques for traditional decoder-only models through ablations.

[33] Multi-Agent Collaborative Filtering: Orchestrating Users and Items for Agentic Recommendations

Yu Xia,Sungchul Kim,Tong Yu,Ryan A. Rossi,Julian McAuely

Main category: cs.CL

TL;DR: 提出多智能体协同过滤(MACF)框架,利用大语言模型智能体模拟用户与物品的交互,通过动态协作提升推荐效果。

Details Motivation: 现有基于大语言模型的推荐系统缺乏对用户-物品交互历史中协同信号的有效利用,导致推荐效果不佳。 Method: 将相似用户和相关物品实例化为具有独特属性的大语言模型智能体,引入中心协调智能体动态管理智能体间的协作过程,实现个性化推荐。 Result: 在三个不同领域的数据集上实验表明,MACF框架优于现有的强基线方法。 Conclusion: MACF通过类比传统协同过滤与大语言模型多智能体协作,有效提升了推荐系统的性能。 Abstract: Agentic recommendations cast recommenders as large language model (LLM) agents that can plan, reason, use tools, and interact with users of varying preferences in web applications. However, most existing agentic recommender systems focus on generic single-agent plan-execute workflows or multi-agent task decomposition pipelines. Without recommendation-oriented design, they often underuse the collaborative signals in the user-item interaction history, leading to unsatisfying recommendation results. To address this, we propose the Multi-Agent Collaborative Filtering (MACF) framework for agentic recommendations, drawing an analogy between traditional collaborative filtering algorithms and LLM-based multi-agent collaboration. Specifically, given a target user and query, we instantiate similar users and relevant items as LLM agents with unique profiles. Each agent is able to call retrieval tools, suggest candidate items, and interact with other agents. Different from the static preference aggregation in traditional collaborative filtering, MACF employs a central orchestrator agent to adaptively manage the collaboration between user and item agents via dynamic agent recruitment and personalized collaboration instruction. Experimental results on datasets from three different domains show the advantages of our MACF framework compared to strong agentic recommendation baselines.

[34] General Agentic Memory Via Deep Research

B. Y. Yan,Chaofan Li,Hongjin Qian,Shuqi Lu,Zheng Liu

Main category: cs.CL

TL;DR: 提出一种基于“即时编译”理念的通用智能体记忆框架GAM,通过分离记忆存储与上下文构建,提升AI代理在任务中的表现。

Details Motivation: 现有的静态记忆方法因预存信息而导致严重的信息丢失,无法有效支持AI代理的动态需求。 Method: GAM采用双组件设计:Memorizer在离线阶段用轻量级记忆提取关键历史信息,并将完整信息存入通用页面存储;Researcher在运行时根据请求从页面存储中检索并整合信息,实现上下文的即时优化。同时利用强化学习进行端到端优化。 Result: 实验表明,GAM在多种基于记忆的任务完成场景中显著优于现有记忆系统。 Conclusion: GAM通过‘即时’原则和双模块协同设计,有效提升了AI代理的记忆利用效率和任务性能,具备良好的测试时可扩展性。 Abstract: Memory is critical for AI agents, yet the widely-adopted static memory, aiming to create readily available memory in advance, is inevitably subject to severe information loss. To address this limitation, we propose a novel framework called \textbf{general agentic memory (GAM)}. GAM follows the principle of "\textbf{just-in time (JIT) compilation}" where it focuses on creating optimized contexts for its client at runtime while keeping only simple but useful memory during the offline stage. To this end, GAM employs a duo-design with the following components. 1) \textbf{Memorizer}, which highlights key historical information using a lightweight memory, while maintaining complete historical information within a universal page-store. 2) \textbf{Researcher}, which retrieves and integrates useful information from the page-store for its online request guided by the pre-constructed memory. This design allows GAM to effectively leverage the agentic capabilities and test-time scalability of frontier large language models (LLMs), while also facilitating end-to-end performance optimization through reinforcement learning. In our experimental study, we demonstrate that GAM achieves substantial improvement on various memory-grounded task completion scenarios against existing memory systems.

[35] MindEval: Benchmarking Language Models on Multi-turn Mental Health Support

José Pombal,Maya D'Eon,Nuno M. Guerreiro,Pedro Henrique Martins,António Farinhas,Ricardo Rei

Main category: cs.CL

TL;DR: 本文提出了MindEval框架,用于自动评估大语言模型在多轮心理治疗对话中的表现,解决了现有基准测试的局限性。

Details Motivation: 现有的AI心理健康聊天机器人存在诸多问题,如谄媚或过度验证,并且缺乏能够捕捉真实治疗互动复杂性的基准测试。 Method: 与博士级临床心理学家合作开发了MindEval框架,通过患者模拟和LLM自动评估,验证了模拟患者的逼真性,并展示了自动评分与人类专家判断之间的强相关性。 Result: 评估了12个最先进的大语言模型,发现所有模型平均得分低于4分(满分6分),尤其在AI特有的不良沟通模式上表现不佳。推理能力和模型规模并不保证更好的性能,且在长时间交互或面对严重症状患者时表现更差。 Conclusion: MindEval为评估AI在心理健康支持中的表现提供了可靠、可复现的自动化框架,揭示了当前模型的不足,并推动未来改进。 Abstract: Demand for mental health support through AI chatbots is surging, though current systems present several limitations, like sycophancy or overvalidation, and reinforcement of maladaptive beliefs. A core obstacle to the creation of better systems is the scarcity of benchmarks that capture the complexity of real therapeutic interactions. Most existing benchmarks either only test clinical knowledge through multiple-choice questions or assess single responses in isolation. To bridge this gap, we present MindEval, a framework designed in collaboration with Ph.D-level Licensed Clinical Psychologists for automatically evaluating language models in realistic, multi-turn mental health therapy conversations. Through patient simulation and automatic evaluation with LLMs, our framework balances resistance to gaming with reproducibility via its fully automated, model-agnostic design. We begin by quantitatively validating the realism of our simulated patients against human-generated text and by demonstrating strong correlations between automatic and human expert judgments. Then, we evaluate 12 state-of-the-art LLMs and show that all models struggle, scoring below 4 out of 6, on average, with particular weaknesses in problematic AI-specific patterns of communication. Notably, reasoning capabilities and model scale do not guarantee better performance, and systems deteriorate with longer interactions or when supporting patients with severe symptoms. We release all code, prompts, and human evaluation data.

[36] For Those Who May Find Themselves on the Red Team

Tyler Shoemaker

Main category: cs.CL

TL;DR: 本文主张文学学者必须参与大型语言模型(LLM)可解释性研究,尽管存在意识形态斗争甚至共谋风险,但当前以工具性为主导的可解释性方法不应是衡量LLM解释的唯一标准,作者建议可通过“红队”这一机制展开相关学术介入。

Details Motivation: 当前LLM可解释性研究多由技术导向主导,缺乏人文视角,文学学者的缺席可能导致对语言、意义和解释的理解片面化。 Method: 通过理论论述与批判性分析,提出文学学者应主动介入LLM可解释性研究,并以‘红队’作为跨学科实践的可能场域。 Result: 强调人文学科在技术解释框架中的必要性,指出仅依赖工具性标准不足以全面理解LLM的文本生成与解释过程。 Conclusion: 文学学者应积极参与LLM可解释性研究,在意识形态张力中推动更具批判性和多元化的解释范式。 Abstract: This position paper argues that literary scholars must engage with large language model (LLM) interpretability research. While doing so will involve ideological struggle, if not out-right complicity, the necessity of this engagement is clear: the abiding instrumentality of current approaches to interpretability cannot be the only standard by which we measure interpretation with LLMs. One site at which this struggle could take place, I suggest, is the red team.

[37] Dealing with the Hard Facts of Low-Resource African NLP

Yacouba Diarra,Nouhoum Souleymane Coulibaly,Panga Azazia Kamaté,Madani Amadou Tall,Emmanuel Élisé Koné,Aymane Dembélé,Michael Leventhal

Main category: cs.CL

TL;DR: 本文报告了在低资源语言巴马拉语中收集612小时自发语音数据集的过程,包括半自动转录标注、构建紧凑型单语模型,并进行自动与人工评估,同时公开了数据集、模型和代码。

Details Motivation: 由于缺乏足够的相关经验,低资源语言的语音数据集、模型和评估框架的建立具有挑战性,因此需要探索可行的方法和实践方案。 Method: 实地采集612小时的巴马拉语自发语音数据,采用半自动化方式进行转录标注,训练多个超紧凑和小型单语语音模型,并结合自动指标与人工评估方法对模型性能进行评测。 Result: 成功构建并发布了大规模巴马拉语语音数据集、多个评估子集、语音模型及相关代码;实验表明人工评估对模型优化至关重要。 Conclusion: 本研究为低资源语言的语音数据采集、标注和建模提供了可复用的实践经验,强调了人工评估的重要性,并通过开源促进相关研究发展。 Abstract: Creating speech datasets, models, and evaluation frameworks for low-resource languages remains challenging given the lack of a broad base of pertinent experience to draw from. This paper reports on the field collection of 612 hours of spontaneous speech in Bambara, a low-resource West African language; the semi-automated annotation of that dataset with transcriptions; the creation of several monolingual ultra-compact and small models using the dataset; and the automatic and human evaluation of their output. We offer practical suggestions for data collection protocols, annotation, and model design, as well as evidence for the importance of performing human evaluation. In addition to the main dataset, multiple evaluation datasets, models, and code are made publicly available.

[38] Toward Trustworthy Difficulty Assessments: Large Language Models as Judges in Programming and Synthetic Tasks

H. M. Shadman Tabib,Jaber Ahmed Deedar

Main category: cs.CL

TL;DR: 本研究比较了GPT-4o与基于特征的LightGBM模型在预测LeetCode编程题难度上的表现,发现LightGBM准确率达86%,远高于GPT-4o的37.75%。

Details Motivation: 探讨大语言模型在结构化任务(如编程题目难度预测)中的可靠性,尤其是在其被广泛用作自动评判工具的背景下。 Method: 使用纯自然语言评估的GPT-4o与基于显式数值和文本特征训练的可解释LightGBM模型进行系统对比,并通过混淆矩阵和SHAP值分析关键特征;进一步设计合成Hard问题测试GPT-4o的一致性。 Result: LightGBM准确率为86%,显著优于GPT-4o的37.75%;数值约束(如输入规模、通过率)是区分难题的关键因素,而GPT-4o常忽略这些线索且偏向简单分类;合成实验显示GPT-4o将其自生成的Hard问题多标注为Medium,表现出不一致的降级倾向。 Conclusion: 当前LLM作为自动评判者在结构化任务中存在明显缺陷,尤其在捕捉关键数值特征和保持判断一致性方面表现不佳,需加以改进才能在教育或编程评估等场景中可信使用。 Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities in natural language and code generation, and are increasingly deployed as automatic judges of model outputs and learning activities. Yet, their behavior on structured tasks such as predicting the difficulty of competitive programming problems remains under-explored. We conduct a systematic comparison of GPT-4o, used purely as a natural-language difficulty assessor, against an interpretable Light-GBM ensemble trained on explicit numeric and textual features. On a dataset of 1,825 LeetCode problems labeled Easy, Medium, or Hard, LightGBM attains 86% accuracy, whereas GPT-4o reaches only 37.75%. Detailed analyses, including confusion matrices and SHAP-based interpretability, show that numeric constraints -- such as input size limits and acceptance rates -- play a crucial role in separating Hard problems from easier ones. By contrast, GPT-4o often overlooks these cues and exhibits a strong bias toward simpler categories. We further probe GPT-4o through a synthetic Hard-problem generation protocol. Surprisingly, GPT-4o labels almost all of its own synthetic Hard problems as Medium, contradicting its tendency to downgrade real Hard problems to Easy. Our findings connect to recent work on LLMs-as-judges and automatic difficulty estimation in programming and education, and highlight concrete failure modes that must be addressed before LLM-based judges can be considered trustworthy in competitive programming, educational platforms, or reinforcement-learning pipelines.

[39] A Benchmark for Zero-Shot Belief Inference in Large Language Models

Joseph Malone,Rachith Aiyappa,Byunghwee Lee,Haewoon Kwak,Jisun An,Yong-Yeol Ahn

Main category: cs.CL

TL;DR: 本文提出了一种系统性基准,用于评估大语言模型在零样本设置下预测个体在多种议题上立场的能力,发现提供更多背景信息有助于提升预测准确性,但不同信念领域间的表现差异显著。

Details Motivation: 现有计算方法对信念的研究多局限于特定社会政治语境,且依赖微调;而大语言模型在多样化信念领域的泛化能力尚不明确,因此需要一个可复现的跨领域评估框架。 Method: 基于在线辩论平台的数据构建了一个包含多个信息条件的零样本基准测试,评估小到中等规模大语言模型在不同信念领域中的立场预测能力,并分析人口统计学背景和已知先验信念对预测效果的影响。 Result: 提供更多个体背景信息能提升模型预测准确率,但在不同信念领域中模型表现存在显著差异,表明当前大语言模型在模拟人类推理方面具有潜力但也存在局限。 Conclusion: 该研究揭示了当前大语言模型在跨领域信念建模中的能力与限制,提供了一个可扩展的框架,推动机器行为研究并超越传统社会政治范畴的信念分析。 Abstract: Beliefs are central to how humans reason, communicate, and form social connections, yet most computational approaches to studying them remain confined to narrow sociopolitical contexts and rely on fine-tuning for optimal performance. Despite the growing use of large language models (LLMs) across disciplines, how well these systems generalize across diverse belief domains remains unclear. We introduce a systematic, reproducible benchmark that evaluates the ability of LLMs to predict individuals' stances on a wide range of topics in a zero-shot setting using data from an online debate platform. The benchmark includes multiple informational conditions that isolate the contribution of demographic context and known prior beliefs to predictive success. Across several small- to medium-sized models, we find that providing more background information about an individual improves predictive accuracy, but performance varies substantially across belief domains. These findings reveal both the capacity and limitations of current LLMs to emulate human reasoning, advancing the study of machine behavior and offering a scalable framework for modeling belief systems beyond the sociopolitical sphere.

[40] A Unified BERT-CNN-BiLSTM Framework for Simultaneous Headline Classification and Sentiment Analysis of Bangla News

Mirza Raquib,Munazer Montasir Akash,Tawhid Ahmed,Saydul Akbar Murad,Farida Siddiqi Prity,Mohammad Amzad Hossain,Asif Pervez Polok,Nick Rahimi

Main category: cs.CL

TL;DR: 本研究提出了一种基于BERT-CNN-BiLSTM的混合迁移学习模型,用于孟加拉语新闻标题分类与情感分析,在BAN-ABSA数据集上实现了当前最优性能。

Details Motivation: 有效处理大量多源新闻内容并理解其情感倾向对公众议题讨论至关重要,尤其是在低资源语言如孟加拉语中缺乏高效的文本分类方法。 Method: 采用自然语言处理技术,构建BERT-CNN-BiLSTM混合模型,并在BAN-ABSA数据集上应用两种采样策略(预处理前后进行欠采样与过采样)进行实验。 Result: 在不平衡数据集上,策略一(过采样)在标题和情感分类上分别达到78.57%和73.43%,策略二在原始数据上训练取得81.37%和64.46%的最佳结果,所提模型显著优于基线模型。 Conclusion: BERT-CNN-BiLSTM模型能有效提升孟加拉语新闻标题分类与情感分析性能,验证了结合标题与情感信息的重要性,为低资源语言文本分类提供了强基准。 Abstract: In our daily lives, newspapers are an essential information source that impacts how the public talks about present-day issues. However, effectively navigating the vast amount of news content from different newspapers and online news portals can be challenging. Newspaper headlines with sentiment analysis tell us what the news is about (e.g., politics, sports) and how the news makes us feel (positive, negative, neutral). This helps us quickly understand the emotional tone of the news. This research presents a state-of-the-art approach to Bangla news headline classification combined with sentiment analysis applying Natural Language Processing (NLP) techniques, particularly the hybrid transfer learning model BERT-CNN-BiLSTM. We have explored a dataset called BAN-ABSA of 9014 news headlines, which is the first time that has been experimented with simultaneously in the headline and sentiment categorization in Bengali newspapers. Over this imbalanced dataset, we applied two experimental strategies: technique-1, where undersampling and oversampling are applied before splitting, and technique-2, where undersampling and oversampling are applied after splitting on the In technique-1 oversampling provided the strongest performance, both headline and sentiment, that is 78.57\% and 73.43\% respectively, while technique-2 delivered the highest result when trained directly on the original imbalanced dataset, both headline and sentiment, that is 81.37\% and 64.46\% respectively. The proposed model BERT-CNN-BiLSTM significantly outperforms all baseline models in classification tasks, and achieves new state-of-the-art results for Bangla news headline classification and sentiment analysis. These results demonstrate the importance of leveraging both the headline and sentiment datasets, and provide a strong baseline for Bangla text classification in low-resource.

[41] Prompt Optimization as a State-Space Search Problem

Maanas Taneja

Main category: cs.CL

TL;DR: 本文提出将提示优化视为经典的状态空间搜索问题,通过建模提示空间为图结构,并使用束搜索和随机游走算法进行系统探索,在多个NLP任务中验证了该方法的有效性。

Details Motivation: 语言模型对输入提示的微小变化极为敏感,传统方法容易导致性能崩溃。受DSpy等基于示例的提示优化库启发,本文旨在探索更稳定、系统的提示优化方法。 Method: 将提示空间建模为图,节点表示提示状态,边表示对提示的变换操作(如缩短、添加示例、重排序)。采用束搜索和随机游走算法在该空间中搜索,并在开发集上评估候选提示,剪枝不良分支。 Result: 在五个NLP任务上实验表明,即使浅层搜索(束宽=2,深度=2)也能在开发集上提升性能,例如推理任务开发准确率从0.40提升至0.80,但测试集提升较小(0.20到0.50),显示存在过拟合。分析发现简洁化变换最常被选中,而增加冗长的操作从未被选择。 Conclusion: 提示优化可有效建模为搜索问题,当前方法已在开发集上取得改进,未来通过更强算力与更好评估指标有望获得更具泛化性的鲁棒提示。 Abstract: Language Models are extremely susceptible to performance collapse with even small changes to input prompt strings. Libraries such as DSpy (from Stanford NLP) avoid this problem through demonstration-based prompt optimisation. Inspired by this, I propose an alternative approach that treats prompt optimisation as a classical state-space search problem. I model the prompt space as a graph where nodes represent prompt states and edges correspond to deliberate transformations such as shortening, adding examples, or re- ordering content. Using beam search and random walk algorithms, I systematically explore this space, evaluating candidates on development sets and pruning unpromising branches. Across five NLP tasks (sentiment classification, question answering, summarisation, reason- ing, and natural language inference), I find that even shallow search configurations (beam width=2, depth=2) improve upon seed prompts on development sets. For instance, beam search achieves development accuracy gains from 0.40 to 0.80 on reasoning tasks, though test set improvements are more modest (0.20 to 0.50), indicating overfitting to the develop- ment heuristic. Analysis of successful optimisation paths reveals that transformations that make prompts concise appear most frequently, while verbosity operators are never selected. My results validate prompt optimization as a search problem and suggest that with greater computational resources and improved evaluation metrics, deeper exploration could yield more robust prompts that generalize beyond development sets. Code and implementation are available at [https://github.com/MaanasTaneja/PromptOptimiser].

[42] OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph

Michael J. Bommarito

Main category: cs.CL

TL;DR: OpenGloss是一个合成的英语百科词典和语义知识图谱,包含丰富的词汇定义、语义关系和百科内容,通过低成本、快速的多智能体生成流程构建,适用于教育和自然语言处理应用。

Details Motivation: 为了弥补现有词汇资源在教学和自然语言处理中内容分散、定义不足的缺陷,提供一个集成化、大规模且成本效益高的新型词汇知识库。 Method: 采用基于大语言模型的多智能体程序化生成流水线,结合模式验证和自动化质量保证机制,自动生成并整合词汇定义、语义关系、用法示例、搭配和百科内容。 Result: 构建了包含150K词元、537K词义、910万条语义边、100万条用法示例、300万条搭配和6000万词百科内容的资源,定义数量是WordNet的四倍以上,生成成本低于1000美元,耗时不到一周。 Conclusion: OpenGloss展示了合成式生成在构建高质量语言资源中的可行性与优势,支持快速迭代和广泛应用,同时反映了当前基础模型的能力与局限。 Abstract: We present OpenGloss, a synthetic encyclopedic dictionary and semantic knowledge graph for English that integrates lexicographic definitions, encyclopedic context, etymological histories, and semantic relationships in a unified resource. OpenGloss contains 537K senses across 150K lexemes, on par with WordNet 3.1 and Open English WordNet, while providing more than four times as many sense definitions. These lexemes include 9.1M semantic edges, 1M usage examples, 3M collocations, and 60M words of encyclopedic content. Generated through a multi-agent procedural generation pipeline with schema-validated LLM outputs and automated quality assurance, the entire resource was produced in under one week for under $1,000. This demonstrates that structured generation can create comprehensive lexical resources at cost and time scales impractical for manual curation, enabling rapid iteration as foundation models improve. The resource addresses gaps in pedagogical applications by providing integrated content -- definitions, examples, collocations, encyclopedias, etymology -- that supports both vocabulary learning and natural language processing tasks. As a synthetically generated resource, OpenGloss reflects both the capabilities and limitations of current foundation models. The dataset is publicly available on Hugging Face under CC-BY 4.0, enabling researchers and educators to build upon and adapt this resource.

[43] No Free Lunch in Language Model Bias Mitigation? Targeted Bias Reduction Can Exacerbate Unmitigated LLM Biases

Shireen Chand,Faith Baca,Emilio Ferrara

Main category: cs.CL

TL;DR: 研究探讨了针对大语言模型偏见缓解技术的跨类别影响,发现虽然在目标维度上可能减少偏见,但常导致其他维度的负面后果,如增加偏见或降低连贯性。

Details Motivation: 大语言模型(LLMs)从训练数据中继承了社会偏见,可能导致有害或不公平的输出。现有方法通常只在目标偏见维度上评估缓解效果,缺乏对跨类别影响的全面分析。 Method: 研究了四种偏见缓解技术在七个模型家族共十个模型上的应用,涵盖种族、宗教、职业和性别相关偏见,并使用StereoSet基准测量对模型连贯性和刻板印象偏好的影响。 Result: 目标偏见缓解有时能减少特定维度的偏见,但经常引发其他维度的意外负面后果,例如增加未目标偏见或降低模型整体连贯性。 Conclusion: 需要更稳健、多维度的评估工具来开发和评估偏见缓解策略,以避免无意中转移或加剧未目标维度的偏见。 Abstract: Large Language Models (LLMs) inherit societal biases from their training data, potentially leading to harmful or unfair outputs. While various techniques aim to mitigate these biases, their effects are often evaluated only along the dimension of the bias being targeted. This work investigates the cross-category consequences of targeted bias mitigation. We study four bias mitigation techniques applied across ten models from seven model families, and we explore racial, religious, profession- and gender-related biases. We measure the impact of debiasing on model coherence and stereotypical preference using the StereoSet benchmark. Our results consistently show that while targeted mitigation can sometimes reduce bias in the intended dimension, it frequently leads to unintended and often negative consequences in others, such as increasing model bias and decreasing general coherence. These findings underscore the critical need for robust, multi-dimensional evaluation tools when examining and developing bias mitigation strategies to avoid inadvertently shifting or worsening bias along untargeted axes.

[44] Evaluating Large Language Models on the 2026 Korean CSAT Mathematics Exam: Measuring Mathematical Ability in a Zero-Data-Leakage Setting

Goun Pyeon,Inbum Heo,Jeesu Jung,Taewook Hwang,Hyuk Namgoong,Hyein Seo,Yerim Han,Eunbin Kim,Hyeonseok Kang,Sangkeun Jung

Main category: cs.CL

TL;DR: 本研究使用2026年韩国大学修学能力考试数学题对24种大型语言模型进行了无污染的数学推理能力评估,发现GPT-5 Codex在文本输入和韩文提示下取得满分,几何题为普遍薄弱环节,并探讨了推理强度与成本效益之间的权衡。

Details Motivation: 现有基准存在训练数据泄露问题,导致评估结果偏高,因此需要一个完全未暴露的、基于真实考试的清洁环境来客观评估大模型的数学推理能力。 Method: 在考试发布后两小时内数字化全部46道题目,对24个最先进大模型进行测试,比较不同输入模态(文本、图像、图文)和提示语言(韩文、英文)下的表现,并对GPT-5系列模型开展增强推理实验。 Result: GPT-5 Codex以文本输入和韩文提示获得100分,Grok 4、GPT-5和Deepseek R1得分超过95;gpt-oss-20B以较小规模达到95.7分;几何题平均正确率仅为77.7%,高难度题表现显著下降;文本输入优于图像输入,语言影响因模型而异;增强推理可提升成绩但大幅增加成本。 Conclusion: 构建了一个无数据污染的真实考试评估框架,揭示了几何推理仍是挑战,且过度增强推理会降低效率,提出需综合考虑性能、成本与时间的实际评估视角。 Abstract: This study systematically evaluated the mathematical reasoning capabilities of Large Language Models (LLMs) using the 2026 Korean College Scholastic Ability Test (CSAT) Mathematics section, ensuring a completely contamination-free evaluation environment. To address data leakage issues in existing benchmarks, we digitized all 46 questions (22 common and 24 elective) within two hours of the exam's public release, eliminating any possibility of inclusion in model training data. We conducted comprehensive evaluations of 24 state-of-the-art LLMs across varying input modalities (text, image, text+figure) and prompt languages (Korean, English). GPT-5 Codex achieved the only perfect score (100 points) with text input and Korean prompts, while Grok 4, GPT-5, and Deepseek R1 scored above 95 points. Notably, gpt-oss-20B achieved 95.7 points despite its relatively small size, demonstrating high cost-effectiveness. Problem-specific analysis revealed geometry as the weakest domain (77.7% average) with significant performance degradation on 4-point high-difficulty problems. Text input consistently outperformed image input, while prompt language effects varied by model scale. In reasoning enhancement experiments with GPT-5 series, increased reasoning intensity improved performance (from 82.6 to 100 points) but quadrupled token usage and drastically reduced efficiency, suggesting that models with minimal reasoning may be more practical. This research contributes: (1) implementation of a completely unexposed evaluation environment, (2) a real-exam-based LLM assessment framework, and (3) a practical evaluation perspective integrating performance, cost, and time considerations. Detailed results and model comparisons are available at the 2026 Korean CSAT LLM Evaluation Leaderboard (https://isoft.cnu.ac.kr/csat2026/).

[45] CLaRa: Bridging Retrieval and Generation with Continuous Latent Reasoning

Jie He,Richard He Bai,Sinead Williamson,Jeff Z. Pan,Navdeep Jaitly,Yizhe Zhang

Main category: cs.CL

TL;DR: 本文提出了CLaRa(Continuous Latent Reasoning),一个通过共享连续空间实现嵌入压缩和联合优化的检索增强生成框架,显著提升了长上下文处理和检索-生成一致性。

Details Motivation: 现有检索增强生成(RAG)方法面临长上下文处理困难以及检索与生成过程分离优化的问题,导致效率和效果受限。 Method: 提出CLaRa框架:1)使用基于问答和改写监督的SCP方法进行语义保持的向量压缩;2)在共享连续空间中联合训练重排序器和生成器,通过可微top-k估计器实现端到端优化;3)采用单一语言建模损失函数统一目标。 Result: 在多个问答基准上,CLaRa在压缩率和重排序性能上达到最先进水平,常优于基于文本微调的基线模型。 Conclusion: CLaRa通过统一的连续空间表示和端到端联合优化,有效解决了RAG中检索与生成脱节的问题,实现了更高效、更准确的知识增强生成。 Abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) with external knowledge but still suffers from long contexts and disjoint retrieval-generation optimization. In this work, we propose CLaRa (Continuous Latent Reasoning), a unified framework that performs embedding-based compression and joint optimization in a shared continuous space. To obtain semantically rich and retrievable compressed vectors, we introduce SCP, a key-preserving data synthesis framework using QA and paraphrase supervision. CLaRa then trains the reranker and generator end-to-end via a single language modeling loss, with gradients flowing through both modules using a differentiable top-k estimator. Theoretically, this unified optimization aligns retrieval relevance with answer quality. Experiments across multiple QA benchmarks show that CLaRa achieves state-of-the-art compression and reranking performance, often surpassing text-based fine-tuned baselines.

[46] Empathetic Cascading Networks: A Multi-Stage Prompting Technique for Reducing Social Biases in Large Language Models

Wangjiaxuan Xin

Main category: cs.CL

TL;DR: 本文提出了一个名为共情级联网络(ECN)的多阶段提示框架,旨在提升大语言模型的共情与包容能力。实验表明,ECN在多个指标上表现优异,具有应用于共情对话AI的潜力。

Details Motivation: 为了提升大语言模型在对话中的共情和包容性,解决现有模型在情感理解和上下文感知方面的不足。 Method: 提出四阶段提示框架ECN,包括视角采纳、情感共鸣、反思理解与整合综合,逐步引导模型生成更具共情和语境意识的回应。 Result: ECN在GPT-3.5-turbo和GPT-4上均取得了最高的共情商(EQ)分数,同时保持了良好的Regard和困惑度(Perplexity)指标。 Conclusion: ECN能有效增强大语言模型的共情与包容性,具有在共情导向的对话系统中广泛应用的潜力。 Abstract: This report presents the Empathetic Cascading Networks (ECN) framework, a multi-stage prompting method designed to enhance the empathetic and inclusive capabilities of large language models. ECN employs four stages: Perspective Adoption, Emotional Resonance, Reflective Understanding, and Integrative Synthesis, to guide models toward generating emotionally resonant and contextually aware responses. Experimental results demonstrate that ECN achieves the highest Empathy Quotient (EQ) scores across GPT-3.5-turbo and GPT-4, while maintaining competitive Regard and Perplexity metrics. These findings emphasize ECN's potential for applications requiring empathy and inclusivity in conversational AI.

[47] RhinoInsight: Improving Deep Research through Control Mechanisms for Model Behavior and Context

Yu Lei,Shuzheng Si,Wei Wang,Yifei Wu,Gang Chen,Fanchao Qi,Maosong Sun

Main category: cs.CL

TL;DR: RhinoInsight 是一种增强的深度研究框架,通过可验证的清单和证据审计模块提升大语言模型在复杂任务中的鲁棒性、可追溯性和结果质量。

Details Motivation: 现有线性流程的研究系统存在错误累积和上下文退化问题,缺乏对模型行为和上下文的有效控制。 Method: 提出 RhinoInsight 框架,包含两个控制机制:1)可验证清单模块,将用户需求转化为可追踪的子目标,并生成分层提纲;2)证据审计模块,结构化搜索内容、迭代更新提纲并剪枝噪声,结合批评机制绑定高质量证据。 Result: 实验表明,RhinoInsight 在深度研究任务上达到最先进性能,同时在深度搜索任务上保持竞争力。 Conclusion: 通过引入显式控制机制,RhinoInsight 有效提升了语言模型代理在长期推理与决策中的稳定性与可信度,无需参数更新即可实现高质量研究输出。 Abstract: Large language models are evolving from single-turn responders into tool-using agents capable of sustained reasoning and decision-making for deep research. Prevailing systems adopt a linear pipeline of plan to search to write to a report, which suffers from error accumulation and context rot due to the lack of explicit control over both model behavior and context. We introduce RhinoInsight, a deep research framework that adds two control mechanisms to enhance robustness, traceability, and overall quality without parameter updates. First, a Verifiable Checklist module transforms user requirements into traceable and verifiable sub-goals, incorporates human or LLM critics for refinement, and compiles a hierarchical outline to anchor subsequent actions and prevent non-executable planning. Second, an Evidence Audit module structures search content, iteratively updates the outline, and prunes noisy context, while a critic ranks and binds high-quality evidence to drafted content to ensure verifiability and reduce hallucinations. Our experiments demonstrate that RhinoInsight achieves state-of-the-art performance on deep research tasks while remaining competitive on deep search tasks.

Matthew R. DeVerna,Kai-Cheng Yang,Harry Yaojun Yan,Filippo Menczer

Main category: cs.CL

TL;DR: 研究评估了15个大型语言模型在事实核查任务上的表现,发现标准模型表现不佳,推理能力提升有限,网络搜索带来一定改进但效果仍不理想;相比之下,使用PolitiFact摘要的检索增强生成(RAG)系统显著提升了性能。

Details Motivation: 尽管大型语言模型(LLM)有望实现端到端自动事实核查,但已有研究结果不一;随着主流聊天机器人集成推理和网络搜索功能并被广泛用于信息验证,亟需对其进行严谨评估。 Method: 在超过6,000条由PolitiFact核查过的声明上评估来自OpenAI、Google、Meta和DeepSeek的15个最新LLM,比较标准模型、具备推理能力的模型以及结合网络搜索的模型,并引入基于PolitiFact摘要的检索增强生成(RAG)系统进行对比。 Result: 标准模型表现差,推理能力带来的提升微乎其微,网络搜索仅带来中等程度的改进,尽管相关事实核查内容已在网络上可得;而使用PolitiFact摘要的RAG系统使各模型变体的宏F1分数平均提升了233%。 Conclusion: 为模型提供精心策划的高质量上下文(如通过RAG)是推进自动化事实核查的有前景路径。 Abstract: Large language models (LLMs) have raised hopes for automated end-to-end fact-checking, but prior studies report mixed results. As mainstream chatbots increasingly ship with reasoning capabilities and web search tools -- and millions of users already rely on them for verification -- rigorous evaluation is urgent. We evaluate 15 recent LLMs from OpenAI, Google, Meta, and DeepSeek on more than 6,000 claims fact-checked by PolitiFact, comparing standard models with reasoning- and web-search variants. Standard models perform poorly, reasoning offers minimal benefits, and web search provides only moderate gains, despite fact-checks being available on the web. In contrast, a curated RAG system using PolitiFact summaries improved macro F1 by 233% on average across model variants. These findings suggest that giving models access to curated high-quality context is a promising path for automated fact-checking.

[49] Robust Multimodal Sentiment Analysis with Distribution-Based Feature Recovery and Fusion

Daiqing Wu,Dongbao Yang,Can Ma,Yu Zhou

Main category: cs.CL

TL;DR: 提出了一种基于分布的特征恢复与融合方法(DRF),用于鲁棒的图文情感分析,能够统一处理模态质量差和缺失问题,在多种真实场景下优于现有方法。

Details Motivation: 现有方法在处理图文多模态情感分析时忽视了模态质量差和缺失的问题,而这些在现实应用中常见,导致模型性能下降,因此需要更鲁棒的方法。 Method: 通过为每种模态维护特征队列来估计其特征分布;利用分布信息量化模态质量以降低低质模态的影响,并通过分布监督的跨模态映射恢复缺失模态。 Result: 在三个公开数据集上采用两种破坏策略进行实验,结果表明DRF在不同场景下均优于当前最优方法,展现出良好的鲁棒性和通用性。 Conclusion: DRF能有效应对多模态情感分析中的低质量和缺失模态问题,提升了模型在现实复杂环境下的稳定性和性能。 Abstract: As posts on social media increase rapidly, analyzing the sentiments embedded in image-text pairs has become a popular research topic in recent years. Although existing works achieve impressive accomplishments in simultaneously harnessing image and text information, they lack the considerations of possible low-quality and missing modalities. In real-world applications, these issues might frequently occur, leading to urgent needs for models capable of predicting sentiment robustly. Therefore, we propose a Distribution-based feature Recovery and Fusion (DRF) method for robust multimodal sentiment analysis of image-text pairs. Specifically, we maintain a feature queue for each modality to approximate their feature distributions, through which we can simultaneously handle low-quality and missing modalities in a unified framework. For low-quality modalities, we reduce their contributions to the fusion by quantitatively estimating modality qualities based on the distributions. For missing modalities, we build inter-modal mapping relationships supervised by samples and distributions, thereby recovering the missing modalities from available ones. In experiments, two disruption strategies that corrupt and discard some modalities in samples are adopted to mimic the low-quality and missing modalities in various real-world scenarios. Through comprehensive experiments on three publicly available image-text datasets, we demonstrate the universal improvements of DRF compared to SOTA methods under both two strategies, validating its effectiveness in robust multimodal sentiment analysis.

[50] Context-Aware Whisper for Arabic ASR Under Linguistic Varieties

Bashar Talafha,Amin Abu Alhassan,Muhammad Abdul-Mageed

Main category: cs.CL

TL;DR: 提出上下文感知的提示策略以适应OpenAI的Whisper模型进行阿拉伯语语音识别,无需重新训练,在九种阿拉伯语语言条件下显著降低词错误率。

Details Motivation: 低资源ASR对阿拉伯语等存在广泛方言变异且标注数据有限的语言仍是一个挑战。 Method: 采用解码器提示(首次转录或检索话语)和使用目标说话人声音合成的编码器前缀,并引入提示重排序、说话人感知前缀合成及模态特定检索技术。 Result: 在现代标准阿拉伯语上WER最多降低22.3%,在方言语音上降低9.2%,显著减少幻觉和说话人不匹配问题。 Conclusion: 所提方法在真实零样本场景中有效提升阿拉伯语语音识别性能,无需模型重训练。 Abstract: Low-resource ASR remains a challenging problem, especially for languages like Arabic that exhibit wide dialectal variation and limited labeled data. We propose context-aware prompting strategies to adapt OpenAI's Whisper for Arabic speech recognition without retraining. Our methods include decoder prompting with first-pass transcriptions or retrieved utterances, and encoder prefixing using speech synthesized in the target speaker's voice. We introduce techniques such as prompt reordering, speaker-aware prefix synthesis, and modality-specific retrieval (lexical, semantic, acoustic) to improve transcription in real-world, zero-shot settings. Evaluated on nine Arabic linguistic conditions, our approach reduces WER by up to 22.3% on Modern Standard Arabic and 9.2% on dialectal speech, significantly mitigating hallucinations and speaker mismatch.

[51] HyperbolicRAG: Enhancing Retrieval-Augmented Generation with Hyperbolic Representations

Cao Linxiao,Wang Ruitao,Li Jindong,Zhou Zhipeng,Yang Menglin

Main category: cs.CL

TL;DR: 提出HyperbolicRAG,一种结合双曲几何的图增强检索生成框架,通过深度感知表示学习、无监督对比正则化和互排序融合机制,在多个问答基准上优于现有方法。

Details Motivation: 现有基于欧几里得嵌入的图增强RAG方法缺乏对层次深度的几何建模能力,难以有效表示复杂知识图谱中的抽象关系。 Method: 引入双曲几何,设计深度感知表示学习器在庞加莱空间中联合建模语义相似性与层次包含关系,采用无监督对比正则化保持跨层级的几何一致性,并通过互排序融合机制结合欧氏与双曲空间的检索信号。 Result: 在多个问答基准上实验表明,HyperbolicRAG显著优于标准RAG及图增强基线方法,验证了其在捕捉细粒度语义和全局层次结构方面的有效性。 Conclusion: HyperbolicRAG通过融合双曲几何,有效提升了图增强检索系统对复杂知识层次的建模能力,增强了检索准确性与推理性能。 Abstract: Retrieval-augmented generation (RAG) enables large language models (LLMs) to access external knowledge, helping mitigate hallucinations and enhance domain-specific expertise. Graph-based RAG enhances structural reasoning by introducing explicit relational organization that enables information propagation across semantically connected text units. However, these methods typically rely on Euclidean embeddings that capture semantic similarity but lack a geometric notion of hierarchical depth, limiting their ability to represent abstraction relationships inherent in complex knowledge graphs. To capture both fine-grained semantics and global hierarchy, we propose HyperbolicRAG, a retrieval framework that integrates hyperbolic geometry into graph-based RAG. HyperbolicRAG introduces three key designs: (1) a depth-aware representation learner that embeds nodes within a shared Poincare manifold to align semantic similarity with hierarchical containment, (2) an unsupervised contrastive regularization that enforces geometric consistency across abstraction levels, and (3) a mutual-ranking fusion mechanism that jointly exploits retrieval signals from Euclidean and hyperbolic spaces, emphasizing cross-space agreement during inference. Extensive experiments across multiple QA benchmarks demonstrate that HyperbolicRAG outperforms competitive baselines, including both standard RAG and graph-augmented baselines.

[52] Concept than Document: Context Compression via AMR-based Conceptual Entropy

Kaize Shi,Xueyao Sun,Xiaohui Tao,Lin Li,Qika Lin,Guandong Xu

Main category: cs.CL

TL;DR: 提出一种基于AMR图的无监督上下文压缩框架,通过节点级熵量化保留语义关键信息,有效减少冗余并提升长上下文下的推理准确率。

Details Motivation: 大语言模型在处理长上下文尤其是检索增强生成中的大量文档时面临信息过载问题,导致推理准确性下降和计算开销增加。 Method: 利用抽象语义表示(AMR)图构建上下文,计算各节点的概念熵以评估其重要性,并筛选关键节点形成语义聚焦的压缩上下文。 Result: 在PopQA和EntityQuestions数据集上优于基线方法,在显著缩短上下文长度的同时提高了问答准确率。 Conclusion: 这是首个引入AMR基概念熵进行上下文压缩的工作,验证了稳定语言特征在上下文工程中的潜力。 Abstract: Large Language Models (LLMs) face information overload when handling long contexts, particularly in Retrieval-Augmented Generation (RAG) where extensive supporting documents often introduce redundant content. This issue not only weakens reasoning accuracy but also increases computational overhead. We propose an unsupervised context compression framework that exploits Abstract Meaning Representation (AMR) graphs to preserve semantically essential information while filtering out irrelevant text. By quantifying node-level entropy within AMR graphs, our method estimates the conceptual importance of each node, enabling the retention of core semantics. Specifically, we construct AMR graphs from raw contexts, compute the conceptual entropy of each node, and screen significant informative nodes to form a condensed and semantically focused context than raw documents. Experiments on the PopQA and EntityQuestions datasets show that our method outperforms vanilla and other baselines, achieving higher accuracy while substantially reducing context length. To the best of our knowledge, this is the first work introducing AMR-based conceptual entropy for context compression, demonstrating the potential of stable linguistic features in context engineering.

[53] A Reproducible Framework for Neural Topic Modeling in Focus Group Analysis

Heger Arfaoui,Mohammed Iheb Hergli,Beya Benzina,Slimane BenMiled

Main category: cs.CL

TL;DR: 提出了一种基于BERTopic的神经主题建模计算框架,用于分析焦点小组访谈文本,并系统评估超参数敏感性、模型稳定性与可解释性。

Details Motivation: 传统焦点小组数据分析依赖人工编码,难以扩展且缺乏可重复性,亟需一种可扩展、可复现的自动化分析方法。 Method: 采用BERTopic对突尼斯10个关于HPV疫苗认知的焦点小组(共1076条语句)进行主题建模,系统测试27种超参数配置,通过30次重复自举重采样评估模型稳定性,并采用分层合并策略优化稳定性和连贯性之间的权衡,最后由三位领域专家进行人工评估验证主题可解释性。 Result: 发现主题模型结果对超参数选择高度敏感,指标选择应与分析目标一致;分层合并策略提升了主题连贯性(0.558 vs 0.539),且具有良好的模型稳定性;人工评估显示主题质量高,评分者间信度良好(ICC = 0.79,加权Cohen's kappa = 0.578)。 Conclusion: 所提出的计算框架为定性研究提供了可复现、系统化的主题建模实践指南,支持研究者在不同情境下应用和扩展,推动焦点小组数据的高效与透明分析。 Abstract: Focus group discussions generate rich qualitative data but their analysis traditionally relies on labor-intensive manual coding that limits scalability and reproducibility. We present a rigorous, reproducible computational framework for applying neural topic modeling to focus group transcripts, addressing fundamental methodological challenges: hyperparameter sensitivity, model stability, and validation of interpretability. Using BERTopic applied to ten focus groups exploring HPV vaccine perceptions in Tunisia (1,076 utterances), we conducted systematic evaluation across 27 hyperparameter configurations, assessed stability through bootstrap resampling with 30 replicates per configuration, and validated interpretability through formal human evaluation by three domain experts. Our analysis demonstrates substantial sensitivity to hyperparameter choices and reveals that metric selection for stability assessment must align with analytical goals. A hierarchical merging strategy (extracting fine-grained topics for stability then consolidating for interpretability) effectively navigates the stability-coherence tradeoff, achieving coherence of 0.558 compared to 0.539 for direct extraction. Human validation confirmed topic quality with very good inter-rater reliability (ICC = 0.79, weighted Cohen's kappa = 0.578). Our framework provides practical guidelines that researchers can adapt to their own qualitative research contexts. All code, data processing scripts, and evaluation protocols are publicly available to support reproduction and extension of this work.

[54] Large Language Models for the Summarization of Czech Documents: From History to the Present

Václav Tran,Jakub Šmíd,Ladislav Lenc,Jean-Pierre Salmon,Pavel Král

Main category: cs.CL

TL;DR: 本研究利用大语言模型(如Mistral和mT5)提升捷克语文本摘要性能,在现代和历史捷克语文本上取得新进展,并发布新数据集Posel od Čerchova以推动低资源语言的历史文档处理研究。

Details Motivation: 捷克语尤其是历史捷克语文本摘要研究因语言复杂性和缺乏高质量标注数据而受限,亟需有效方法和资源支持。 Method: 采用多语言大语言模型(Mistral和mT5)直接进行捷克语摘要,并提出基于翻译的两步法:先将捷克语译为英语,用英文模型摘要后再回译为捷克语。 Result: 在SumeCzech数据集上达到新的SOTA结果;构建了首个面向19世纪历史捷克语的摘要数据集Posel od Čerchova,并提供了现代大语言模型的初步基线结果。 Conclusion: 多语言大语言模型能有效应对捷克语等中等资源、形态复杂的语言摘要任务,结合新数据集可为未来捷克语历史文档处理和低资源摘要研究奠定基础。 Abstract: Text summarization is the task of automatically condensing longer texts into shorter, coherent summaries while preserving the original meaning and key information. Although this task has been extensively studied in English and other high-resource languages, Czech summarization, particularly in the context of historical documents, remains underexplored. This is largely due to the inherent linguistic complexity of Czech and the lack of high-quality annotated datasets. In this work, we address this gap by leveraging the capabilities of Large Language Models (LLMs), specifically Mistral and mT5, which have demonstrated strong performance across a wide range of natural language processing tasks and multilingual settings. In addition, we also propose a translation-based approach that first translates Czech texts into English, summarizes them using an English-language model, and then translates the summaries back into Czech. Our study makes the following main contributions: We demonstrate that LLMs achieve new state-of-the-art results on the SumeCzech dataset, a benchmark for modern Czech text summarization, showing the effectiveness of multilingual LLMs even for morphologically rich, medium-resource languages like Czech. We introduce a new dataset, Posel od Čerchova, designed for the summarization of historical Czech texts. This dataset is derived from digitized 19th-century publications and annotated for abstractive summarization. We provide initial baselines using modern LLMs to facilitate further research in this underrepresented area. By combining cutting-edge models with both modern and historical Czech datasets, our work lays the foundation for further progress in Czech summarization and contributes valuable resources for future research in Czech historical document processing and low-resource summarization more broadly.

[55] Cognitive Alpha Mining via LLM-Driven Code-Based Evolution

Fengyuan Liu,Huang Yi,Sichun Luo,Yuqi Wang,Yazheng Yang,Xinye Li,Zefa Hu,Junlan Feng,Qi Liu

Main category: cs.CL

TL;DR: 本文提出了认知Alpha挖掘框架(CogAlpha),结合代码级表示、大语言模型推理和进化搜索,以解决金融数据中高维低信噪比下的有效预测信号发现难题。

Details Motivation: 现有方法在探索广阔的alpha搜索空间时受限,神经模型缺乏可解释性,符号方法易产生冗余表达,且多数方法难以实现广泛而结构化的类人探索。 Method: 提出CogAlpha框架,将大语言模型视为自适应认知代理,通过多阶段提示和金融反馈迭代优化、变异和重组alpha候选者,结合代码表示与LLM驱动的推理及进化搜索。 Result: 在A股股票上的实验表明,CogAlpha发现的alpha在预测准确性、鲁棒性和泛化能力上均优于现有方法。 Conclusion: 将进化优化与基于大语言模型的推理相结合,有望实现自动化且可解释的alpha发现。 Abstract: Discovering effective predictive signals, or ``alphas,'' from financial data with high dimensionality and extremely low signal-to-noise ratio remains a difficult open problem. Despite progress in deep learning, genetic programming, and, more recently, large language model (LLM)--based factor generation, existing approaches still explore only a narrow region of the vast alpha search space. Neural models tend to produce opaque and fragile patterns, while symbolic or formula-based methods often yield redundant or economically ungrounded expressions that generalize poorly. Although different in form, these paradigms share a key limitation: none can conduct broad, structured, and human-like exploration that balances logical consistency with creative leaps. To address this gap, we introduce the Cognitive Alpha Mining Framework (CogAlpha), which combines code-level alpha representation with LLM-driven reasoning and evolutionary search. Treating LLMs as adaptive cognitive agents, our framework iteratively refines, mutates, and recombines alpha candidates through multi-stage prompts and financial feedback. This synergistic design enables deeper thinking, richer structural diversity, and economically interpretable alpha discovery, while greatly expanding the effective search space. Experiments on A-share equities demonstrate that CogAlpha consistently discovers alphas with superior predictive accuracy, robustness, and generalization over existing methods. Our results highlight the promise of aligning evolutionary optimization with LLM-based reasoning for automated and explainable alpha discovery. All source code will be released.

[56] FanarGuard: A Culturally-Aware Moderation Filter for Arabic Language Models

Masoomali Fatehkia,Enes Altinisik,Husrev Taha Sencar

Main category: cs.CL

TL;DR: 本文提出了FanarGuard,一种双语内容过滤器,用于评估阿拉伯语和英语中的安全性和文化对齐性,并构建了首个针对阿拉伯文化背景的基准测试。

Details Motivation: 现有的内容过滤器大多忽视文化背景,仅关注通用安全性,难以应对多语言、多文化环境下的对齐挑战。 Method: 构建包含46.8万条提示-响应对的数据集,使用LLM裁判评分,并训练两种过滤器变体;同时开发首个阿拉伯文化敏感基准,包含1000多个提示,由人类标注员进行标注。 Result: FanarGuard在文化对齐评估中表现优于人类标注者之间的一致性水平,在安全性基准上与最先进的过滤器性能相当。 Conclusion: 将文化意识融入内容过滤至关重要,FanarGuard为实现更符合语境的保护机制提供了可行路径。 Abstract: Content moderation filters are a critical safeguard against alignment failures in language models. Yet most existing filters focus narrowly on general safety and overlook cultural context. In this work, we introduce FanarGuard, a bilingual moderation filter that evaluates both safety and cultural alignment in Arabic and English. We construct a dataset of over 468K prompt and response pairs, drawn from synthetic and public datasets, scored by a panel of LLM judges on harmlessness and cultural awareness, and use it to train two filter variants. To rigorously evaluate cultural alignment, we further develop the first benchmark targeting Arabic cultural contexts, comprising over 1k norm-sensitive prompts with LLM-generated responses annotated by human raters. Results show that FanarGuard achieves stronger agreement with human annotations than inter-annotator reliability, while matching the performance of state-of-the-art filters on safety benchmarks. These findings highlight the importance of integrating cultural awareness into moderation and establish FanarGuard as a practical step toward more context-sensitive safeguards.

[57] Generating Reading Comprehension Exercises with Large Language Models for Educational Applications

Xingyu Huang,Fei Jiang,Jianli Xiao

Main category: cs.CL

TL;DR: 本文提出了一种名为阅读理解练习生成(RCEG)的大语言模型框架,用于自动生成高质量、个性化的英语阅读理解练习,并通过细调模型与判别器结合的方法显著提升内容的相关性和认知适宜性。

Details Motivation: 随着大语言模型的快速发展,教育领域对自动化、智能化学习内容生成的需求日益增长,尤其是个性化英语阅读理解练习的自动生成仍面临质量与适应性不足的问题。 Method: RCEG框架首先使用细调的大语言模型生成内容候选,然后引入判别器筛选最优候选,并通过构建专用数据集和多维度评估指标(如内容多样性、事实准确性、语言毒性、教学对齐性)进行实验验证。 Result: 实验结果表明,RCEG在内容相关性、认知适宜性等方面显著优于基线方法,生成的阅读理解练习具有更高的质量和教学适用性。 Conclusion: RCEG框架能有效提升英语阅读理解练习的自动生成质量,具备良好的个性化与教育应用潜力。 Abstract: With the rapid development of large language models (LLMs), the applications of LLMs have grown substantially. In the education domain, LLMs demonstrate significant potential, particularly in automatic text generation, which enables the creation of intelligent and adaptive learning content. This paper proposes a new LLMs framework, which is named as Reading Comprehension Exercise Generation (RCEG). It can generate high-quality and personalized English reading comprehension exercises automatically. Firstly, RCEG uses fine-tuned LLMs to generate content candidates. Then, it uses a discriminator to select the best candidate. Finally, the quality of the generated content has been improved greatly. To evaluate the performance of RCEG, a dedicated dataset for English reading comprehension is constructed to perform the experiments, and comprehensive evaluation metrics are used to analyze the experimental results. These metrics include content diversity, factual accuracy, linguistic toxicity, and pedagogical alignment. Experimental results show that RCEG significantly improves the relevance and cognitive appropriateness of the generated exercises.

[58] Think Before You Prune: Selective Self-Generated Calibration for Pruning Large Reasoning Models

Yang Xiang,Yixin Ji,Juntao Li,Min Zhang

Main category: cs.CL

TL;DR: 本研究首次探讨了大型推理模型(LRM)的剪枝问题,提出利用选择性自生成推理数据(SSGR)作为校准数据,显著提升剪枝后模型的推理能力。

Details Motivation: 大型推理模型在复杂推理任务中表现优异,但其长链推理过程带来高计算开销。现有剪枝方法主要针对大语言模型,尚未有效应用于LRM。 Method: 通过实验评估不同自生成推理数据的难度和长度对剪枝效果的影响,提出选择性自生成推理(SSGR)数据构建策略,用于优化剪枝过程中的校准阶段。 Result: 在DeepSeek-R1-Distill模型系列上的实验表明,相比通用剪枝方法,采用SSGR策略可使剪枝后模型的推理能力提升10%-13%。 Conclusion: 挑战性强且长度适中的自生成推理数据是有效的剪枝校准数据,SSGR策略为压缩LRM提供了可行路径。 Abstract: Large Reasoning Models (LRMs) have demonstrated remarkable performance on complex reasoning benchmarks. However, their long chain-of-thought reasoning processes incur significant inference overhead. Pruning has emerged as a promising approach to reducing computational costs. However, existing efforts have primarily focused on large language models (LLMs), while pruning LRMs remains unexplored. In this work, we conduct the first empirical study on pruning LRMs and show that directly applying existing pruning techniques fails to yield satisfactory results. Our findings indicate that using self-generated reasoning data for calibration can substantially improve pruning performance. We further investigate how the difficulty and length of reasoning data affect pruning outcomes. Our analysis reveals that challenging and moderately long self-generated reasoning data serve as ideal calibration data. Based on these insights, we propose a Selective Self-Generated Reasoning (SSGR) data construction strategy to provide effective calibration data for pruning LRMs. Experimental results on the DeepSeek-R1-Distill model series validate that our strategy improves the reasoning ability of pruned LRMs by 10%-13% compared to general pruning methods.

[59] CoreEval: Automatically Building Contamination-Resilient Datasets with Real-World Knowledge toward Reliable LLM Evaluation

Jingqian Zhao,Bingbing Wang,Geng Tu,Yice Zhang,Qianlong Wang,Bin Liang,Jing Li,Ruifeng Xu

Main category: cs.CL

TL;DR: 提出了一种名为CoreEval的抗污染评估策略,通过结合实时世界知识自动更新数据,有效缓解大模型评测中的数据污染问题。

Details Motivation: 现有方法无法完全消除模型对测试数据的预先接触,且难以保持原始数据的语义复杂性,导致评估结果存在偏差。 Method: 从原始数据中提取实体关系,利用GDELT数据库获取最新现实知识,重新上下文化并融合到原数据中,并通过迭代的数据反射机制验证和精炼标签。 Result: 在多个更新后的数据集上实验表明,CoreEval能有效减少因数据污染导致的性能高估,提升评估的鲁棒性。 Conclusion: CoreEval为大语言模型的公平评估提供了一种可行方案,能够在保留语义复杂性的同时增强对数据污染的抵抗力。 Abstract: Data contamination poses a significant challenge to the fairness of LLM evaluations in natural language processing tasks by inadvertently exposing models to test data during training. Current studies attempt to mitigate this issue by modifying existing datasets or generating new ones from freshly collected information. However, these methods fall short of ensuring contamination-resilient evaluation, as they fail to fully eliminate pre-existing knowledge from models or preserve the semantic complexity of the original datasets. To address these limitations, we propose \textbf{CoreEval}, a \textbf{Co}ntamination-\textbf{re}silient \textbf{Eval}uation strategy for automatically updating data with real-world knowledge. This approach begins by extracting entity relationships from the original data and leveraging the GDELT database to retrieve relevant, up-to-date knowledge. The retrieved knowledge is then recontextualized and integrated with the original data, which is refined and restructured to ensure semantic coherence and enhanced task relevance. Ultimately, a robust data reflection mechanism is employed to iteratively verify and refine labels, ensuring consistency between the updated and original datasets. Extensive experiments on updated datasets validate the robustness of CoreEval, demonstrating its effectiveness in mitigating performance overestimation caused by data contamination.

[60] Reproducibility Study of Large Language Model Bayesian Optimization

Adam Rychert,Gasper Spagnolo,Evgenii Posashkov

Main category: cs.CL

TL;DR: 本研究复现了LLAMBO框架,使用Llama 3.1 70B替代GPT-3.5,验证其在贝叶斯优化中利用大语言模型作为判别代理和采样器的有效性。结果显示该架构对更换语言模型具有鲁棒性,并在文本上下文支持下表现良好。

Details Motivation: 检验LLAMBO框架在更换为开源大模型(Llama 3.1 70B)后的有效性与鲁棒性,探索其对语言模型骨干的依赖性。 Method: 在Bayesmark和HPOBench基准上复现实验,将原框架中的GPT-3.5替换为Llama 3.1 70B,并进行消融实验评估文本上下文的作用,同时测试更小规模模型的表现。 Result: 使用Llama 3.1 70B的LLAMBO仍能显著改善早期遗憾并降低方差;判别代理弱于GP或SMAC,但受益于跨任务语义先验;移除文本上下文会明显降低性能;候选采样器优于TPE和随机采样;较小模型(如Gemma 27B、Llama 8B)表现不稳定。 Conclusion: LLAMBO架构对语言模型骨干变化具有鲁棒性,在使用Llama 3.1 70B时依然有效,且依赖文本上下文提供语义先验以提升性能。 Abstract: In this reproducibility study, we revisit the LLAMBO framework of Daxberger et al. (2024), a prompting-based Bayesian optimization (BO) method that uses large language models as discriminative surrogates and acquisition optimizers via text-only interactions. We replicate the core Bayesmark and HPOBench experiments under the original evaluation protocol, but replace GPT-3.5 with the open-weight Llama 3.1 70B model used for all text encoding components. Our results broadly confirm the main claims of LLAMBO. Contextual warm starting via textual problem and hyperparameter descriptions substantially improves early regret behaviour and reduces variance across runs. LLAMBO's discriminative surrogate is weaker than GP or SMAC as a pure single task regressor, yet benefits from cross task semantic priors induced by the language model. Ablations that remove textual context markedly degrade predictive accuracy and calibration, while the LLAMBO candidate sampler consistently generates higher quality and more diverse proposals than TPE or random sampling. Experiments with smaller backbones (Gemma 27B, Llama 3.1 8B) yield unstable or invalid predictions, suggesting insufficient capacity for reliable surrogate behaviour. Overall, our study shows that the LLAMBO architecture is robust to changing the language model backbone and remains effective when instantiated with Llama 3.1 70B.

[61] Look It Up: Analysing Internal Web Search Capabilities of Modern LLMs

Sahil Kale

Main category: cs.CL

TL;DR: 该论文提出一个新基准,评估商业大模型在何时以及如何使用网页搜索的校准能力,发现尽管内置搜索能提升事实准确性,但模型仍存在过度自信、检索不一致和查询效果不佳等问题。

Details Motivation: 现代大语言模型集成了网页搜索以提供实时答案,但尚不清楚它们是否能在真正需要时高效调用搜索。因此需要评估模型在不同情境下对搜索使用的决策能力。 Method: 构建包含783个基于预截止知识问题的静态数据集和288个后截止时间问题的动态数据集,评估模型在内部置信度低时是否调用搜索,以及在必须依赖搜索获取更新信息时的表现。 Result: GPT-5-mini 和 Claude Haiku 4.5 在静态问题上搜索显著提升准确率但置信度校准变差;在动态问题上频繁调用搜索但准确率低于70%,主因是查询构造薄弱,且初始检索失败后改进有限。 Conclusion: 内置网页搜索可作为低延迟验证层有效提升准确性,具备选择性调用潜力,但模型仍存在过度自信、关键时跳过检索及初始查询失败后表现不佳的问题,整体可靠性有待提升。 Abstract: Modern large language models integrate web search to provide real-time answers, yet it remains unclear whether they are efficiently calibrated to use search when it is actually needed. We introduce a benchmark evaluating both the necessity and effectiveness of web access across commercial models with no access to internal states or parameters. The dataset includes a static split of 783 temporally anchored questions answerable from pre-cutoff knowledge, aimed at testing whether models invoke search based on low internal confidence, and a dynamic split of 288 post-cutoff queries designed to test whether models recognise when search is required and retrieve updated information. Web access substantially improves static accuracy for GPT-5-mini and Claude Haiku 4.5, though confidence calibration worsens. On dynamic queries, both models frequently invoke search yet remain below 70 percent accuracy due to weak query formulation. Costs per accuracy-improving call remain low, but returns diminish once initial retrieval fails. Selective invocation helps, but models become overconfident and inconsistent after search. Overall, built-in web search meaningfully improves factual accuracy and can be invoked selectively, yet models remain overconfident, skip retrieval when it is essential, and falter once initial search queries underperform. Taken together, internal web search works better as a good low-latency verification layer than a reliable analytical tool, with clear room for improvement.

[62] Skeletons Matter: Dynamic Data Augmentation for Text-to-Query

Yuchen Ji,Bo Xu,Jie Shi,Jiaqing Liang,Deqing Yang,Yu Mao,Hai Chen,Yanghua Xiao

Main category: cs.CL

TL;DR: 本文提出了一个统一的Text-to-Query任务范式,并引入查询骨架作为共享优化目标,提出了一种通用的动态数据增强框架,在仅使用少量合成数据的情况下,在四个基准上实现了最先进的性能。

Details Motivation: 现有研究通常专注于单一查询语言,导致方法在不同语言间的泛化能力有限,因此需要一种能够统一处理多种查询语言的框架。 Method: 形式化定义了Text-to-Query任务范式,将查询骨架作为共享优化目标,提出一种动态数据增强框架,通过诊断模型在处理特定骨架时的弱点来生成针对性的训练数据。 Result: 在四个Text-to-Query基准上的实验表明,该方法仅用少量合成数据即达到最先进性能,验证了其高效性和通用性。 Conclusion: 所提出的框架具有良好的泛化能力和效率,为Text-to-Query任务的统一研究奠定了基础。 Abstract: The task of translating natural language questions into query languages has long been a central focus in semantic parsing. Recent advancements in Large Language Models (LLMs) have significantly accelerated progress in this field. However, existing studies typically focus on a single query language, resulting in methods with limited generalizability across different languages. In this paper, we formally define the Text-to-Query task paradigm, unifying semantic parsing tasks across various query languages. We identify query skeletons as a shared optimization target of Text-to-Query tasks, and propose a general dynamic data augmentation framework that explicitly diagnoses model-specific weaknesses in handling these skeletons to synthesize targeted training data. Experiments on four Text-to-Query benchmarks demonstrate that our method achieves state-of-the-art performance using only a small amount of synthesized data, highlighting the efficiency and generality of our approach and laying a solid foundation for unified research on Text-to-Query tasks. We release our code at https://github.com/jjjycaptain/Skeletron.

[63] Knowledge-based Graphical Method for Safety Signal Detection in Clinical Trials

Francois Vandenhende,Anna Georgiou,Michalis Georgiou,Theodoros Psaras,Ellie Karekla,Elena Hadjicosta

Main category: cs.CL

TL;DR: 提出一种基于图形和知识的临床试验不良事件审查方法,通过增强MedDRA并引入Safeterm语义层实现自动聚类与信号检测。

Details Motivation: 提升临床试验中治疗相关不良事件分析的清晰度、效率和准确性,克服MedDRA在语义关联上的局限。 Method: 构建包含语义关系的Safeterm二维地图,自动将不良事件术语聚类,并结合ClinicalTrials.gov数据计算收缩发病率比和EBGM值进行信号检测。 Result: 在三个历史试验中成功复现所有预期安全信号,可视化输出有助于快速识别异常信号。 Conclusion: 在MedDRA基础上增加医学知识层可显著改善临床试验中不良事件的解释效果。 Abstract: We present a graphical, knowledge-based method for reviewing treatment-emergent adverse events (AEs) in clinical trials. The approach enhances MedDRA by adding a hidden medical knowledge layer (Safeterm) that captures semantic relationships between terms in a 2-D map. Using this layer, AE Preferred Terms can be regrouped automatically into similarity clusters, and their association to the trial disease may be quantified. The Safeterm map is available online and connected to aggregated AE incidence tables from ClinicalTrials.gov. For signal detection, we compute treatment-specific disproportionality metrics using shrinkage incidence ratios. Cluster-level EBGM values are then derived through precision-weighted aggregation. Two visual outputs support interpretation: a semantic map showing AE incidence and an expectedness-versus-disproportionality plot for rapid signal detection. Applied to three legacy trials, the automated method clearly recovers all expected safety signals. Overall, augmenting MedDRA with a medical knowledge layer improves clarity, efficiency, and accuracy in AE interpretation for clinical trials.

[64] Logic of Montage

Hayami Takahashi,Kensuke Takahashi

Main category: cs.CL

TL;DR: 本文提出了一种名为“矛盾结构效应”的动态情感表达形式,并通过“蒙太奇”操作叠加生成“结构效应”,引入“强度”概念构建理论框架,用以解释教育进阶等过程中的情感动态。

Details Motivation: 为了弥补自然语言在情感表达上的不足,探索一种能够更准确反映情感状态的非静态表达形式。 Method: 建立“矛盾结构效应”及其叠加机制“蒙太奇”,引入德勒兹的“强度”概念,并通过奥斯汀的“力”概念论证跨系统术语迁移的合理性,构建一般性理论框架。 Result: 提出了包含“矛盾结构效应”和“结构效应”的广义“结构”模型,成功演示了“结构效应”在教育进阶情境中的应用。 Conclusion: “矛盾结构效应”与“蒙太奇”机制为情感表达提供了新的动态视角,结合“强度”要素的理论框架有助于理解复杂情感过程。 Abstract: In expressing emotions, as an expression form separate from natural language, we propose an alternative form that complements natural language, acting as a proxy or window for emotional states. First, we set up an expression form "Effect of Contradictory Structure." "Effect of Contradictory Structure" is not static but dynamic. Effect in "Effect of Contradictory Structure" is unpleasant or pleasant, and the orientation to avoid that unpleasantness is considered pseudo-expression of will. Second, "Effect of Contradictory Structure" can be overlapped with each other. This overlapping operation is called "montage." A broader "Structure" that includes related "Effect of Contradictory Structure" and "Effect of Structure" are set up. Montage produces "Effect of Structure". In montage, it is necessary to set something like "strength," so we adopted Deleuze and Deleuze/Guattari's word "intensity" and set it as an element of our model. We set up a general theoretical framework - Word Import Between Systems (Models) and justified the import of "intensity" through Austin's use of the word "force." "Effect of Structure" process is demonstrated using the example of proceeding to the next level of education.

[65] GraphMind: Theorem Selection and Conclusion Generation Framework with Dynamic GNN for LLM Reasoning

Yutong Li,Yitian Zhou,Xudong Wang,GuoChen,Caiyan Qin

Main category: cs.CL

TL;DR: 提出GraphMind,一种结合图神经网络与大语言模型的动态图框架,用于多步推理,通过结构化表示中间推理状态提升定理选择与结论生成的准确性。

Details Motivation: 现有大语言模型在多步推理中缺乏对中间推理状态的显式、动态结构化表示,导致上下文感知的定理选择和迭代结论生成能力受限。 Method: 将推理过程建模为异构演化图,节点表示条件、定理和结论,边表示逻辑依赖;利用GNN编码推理状态,并结合语义匹配实现上下文感知的定理选择,在闭环中进行可解释的结构化推理。 Result: 在多个问答数据集上实验表明,GraphMind在多步推理任务中显著优于现有基线方法,性能稳定提升。 Conclusion: GraphMind有效提升了大语言模型在多步推理中的表现,具备良好的通用性与可解释性,验证了结构化动态图建模在复杂推理中的优势。 Abstract: Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation, including multi-step reasoning such as mathematical proving. However, existing approaches often lack an explicit and dynamic mechanism to structurally represent and evolve intermediate reasoning states, which limits their ability to perform context-aware theorem selection and iterative conclusion generation. To address these challenges, we propose GraphMind, a novel dynamic graph-based framework that integrates the graph neural network (GNN) with LLMs to iteratively select theorems and generate intermediate conclusions for multi-step reasoning. Our method models the reasoning process as a heterogeneous evolving graph, where nodes represent conditions, theorems, and conclusions, while edges capture logical dependencies between nodes. By encoding the current reasoning state with GNN and leveraging semantic matching for theorem selection, our framework enables context-aware, interpretable, and structured reasoning in a closed-loop manner. Experiments on various question-answering (QA) datasets demonstrate that our proposed GraphMind method achieves consistent performance improvements and significantly outperforms existing baselines in multi-step reasoning, validating the effectiveness and generalizability of our approach.

[66] A Multi-Agent LLM Framework for Multi-Domain Low-Resource In-Context NER via Knowledge Retrieval, Disambiguation and Reflective Analysis

Wenxuan Mu,Jinzhong Ning,Di Zhao,Yijia Zhang

Main category: cs.CL

TL;DR: 提出KDR-Agent,一种多智能体框架,通过知识检索、消歧和反思分析来提升低资源场景下的多领域命名实体识别性能。

Details Motivation: 现有基于上下文学习的NER方法在标注数据稀缺时表现不佳,难以泛化到新领域,且缺乏外部知识引入和实体消歧能力。 Method: 设计一个包含知识检索、消歧和反思模块的多智能体框架KDR-Agent,利用维基百科知识、自然语言类型定义和对比示例进行推理,并通过中央规划器协调各智能体协作。 Result: 在十个跨五个领域的数据集上实验表明,KDR-Agent显著优于现有的零样本和少样本ICL基线方法。 Conclusion: KDR-Agent有效缓解了低资源NER中对标注数据的依赖,增强了领域泛化能力和实体消歧性能,为多领域ICL-NER提供了新思路。 Abstract: In-context learning (ICL) with large language models (LLMs) has emerged as a promising paradigm for named entity recognition (NER) in low-resource scenarios. However, existing ICL-based NER methods suffer from three key limitations: (1) reliance on dynamic retrieval of annotated examples, which is problematic when annotated data is scarce; (2) limited generalization to unseen domains due to the LLM's insufficient internal domain knowledge; and (3) failure to incorporate external knowledge or resolve entity ambiguities. To address these challenges, we propose KDR-Agent, a novel multi-agent framework for multi-domain low-resource in-context NER that integrates Knowledge retrieval, Disambiguation, and Reflective analysis. KDR-Agent leverages natural-language type definitions and a static set of entity-level contrastive demonstrations to reduce dependency on large annotated corpora. A central planner coordinates specialized agents to (i) retrieve factual knowledge from Wikipedia for domain-specific mentions, (ii) resolve ambiguous entities via contextualized reasoning, and (iii) reflect on and correct model predictions through structured self-assessment. Experiments across ten datasets from five domains demonstrate that KDR-Agent significantly outperforms existing zero-shot and few-shot ICL baselines across multiple LLM backbones. The code and data can be found at https://github.com/MWXGOD/KDR-Agent.

[67] DeCoRL: Decoupling Reasoning Chains via Parallel Sub-Step Generation and Cascaded Reinforcement for Interpretable and Scalable RLHF

Ziyuan Gao,Di Liang,Xianjie Wu,Philippe Morel,Minlong Peng

Main category: cs.CL

TL;DR: 本文提出了DeCoRL框架,通过协同强化学习解耦推理链,实现并行化、模块化推理,显著提升推理速度、可解释性与能效。

Details Motivation: 现有基于强化学习的思维链推理方法存在两个关键问题:一是作为整体黑箱模型,奖励信号不区分步骤,难以诊断错误;二是顺序解码导致时间复杂度为O(n),不利于实时部署。 Method: 提出DeCoRL(Decoupled Reasoning Chains via Coordinated Reinforcement Learning)框架,训练轻量级专用模型并行生成推理子步骤,并设计模块化奖励函数对每个步骤独立评分,通过级联DRPO优化协调奖励并保持步骤间依赖关系。 Result: 在RM-Bench、RMB和RewardBench上达到SOTA性能,推理速度快3.8倍,可解释性提升22.7%,能耗降低72.4%,吞吐量提高68%。 Conclusion: DeCoRL通过模块化与并行化推理,解决了传统方法在效率、可解释性和部署实用性上的瓶颈,推动了复杂推理系统在现实场景中的实时应用。 Abstract: Existing reinforcement learning methods for Chain-of-Thought reasoning suffer from two critical limitations. First, they operate as monolithic black boxes that provide undifferentiated reward signals, obscuring individual step contributions and hindering error diagnosis. Second, sequential decoding has O(n) time complexity. This makes real-time deployment impractical for complex reasoning tasks. We present DeCoRL (Decoupled Reasoning Chains via Coordinated Reinforcement Learning), a novel framework that transforms reasoning from sequential processing into collaborative modular orchestration. DeCoRL trains lightweight specialized models to generate reasoning sub-steps concurrently, eliminating sequential bottlenecks through parallel processing. To enable precise error attribution, the framework designs modular reward functions that score each sub-step independently. Cascaded DRPO optimization then coordinates these rewards while preserving inter-step dependencies. Comprehensive evaluation demonstrates state-of-the-art results across RM-Bench, RMB, and RewardBench, outperforming existing methods including large-scale models. DeCoRL delivers 3.8 times faster inference while maintaining superior solution quality and offers a 22.7\% improvement in interpretability through explicit reward attribution. These advancements, combined with a 72.4\% reduction in energy consumption and a 68\% increase in throughput, make real-time deployment of complex reasoning systems a reality.

[68] A symbolic Perl algorithm for the unification of Nahuatl word spellings

Juan-José Guzmán-Landa,Jesús Vázquez-Osorio,Juan-Manuel Torres-Moreno,Ligia Quintana Torres,Miguel Figueroa-Saavedra,Martha-Lorena Avendaño-Garrido,Graham Ranger,Patricia Velázquez-Morales,Gerardo Eugenio Sierra Martínez

Main category: cs.CL

TL;DR: 提出一种基于符号正则表达式的自动正字法统一模型,用于纳瓦特尔语文本,结合语言学规则和π-yalli语料库,通过句子语义任务评估,取得良好效果。

Details Motivation: 为了解决纳瓦特尔语多种正字法并存导致的文本处理困难,实现跨方言文本的标准化与统一。 Method: 基于先前的纳瓦特尔语句分析算法,利用π-yalli多正字法语料库,设计基于符号正则表达式的语言学规则进行自动正字法统一,并引入人工评估协议。 Result: 自动统一算法能有效生成标准化句子,在语义任务的人工评估中,大多数期望特征获得积极反馈。 Conclusion: 该符号模型在纳瓦特尔语正字法统一上表现良好,具备实际应用潜力,支持后续自然语言处理任务。 Abstract: In this paper, we describe a symbolic model for the automatic orthographic unification of Nawatl text documents. Our model is based on algorithms that we have previously used to analyze sentences in Nawatl, and on the corpus called $π$-yalli, consisting of texts in several Nawatl orthographies. Our automatic unification algorithm implements linguistic rules in symbolic regular expressions. We also present a manual evaluation protocol that we have proposed and implemented to assess the quality of the unified sentences generated by our algorithm, by testing in a sentence semantic task. We have obtained encouraging results from the evaluators for most of the desired features of our artificially unified sentences

[69] On the Optimality of Discrete Object Naming: a Kinship Case Study

Phong Le,Mees Lindeman,Raquel G. Alhama

Main category: cs.CL

TL;DR: 提出了一种基于信息论的离散对象命名系统框架,证明了当且仅当听者的解码器等同于说话者的贝叶斯解码器时,才能实现最优权衡,并通过亲属关系领域的实验证明该最优性在学习到的通信系统中能够实际出现。

Details Motivation: 现有研究依赖于最优听者和跨语言普遍沟通需求这两个简化假设,忽略了现实语言命名系统的复杂性,因此需要一个更贴近实际的理论框架来分析命名系统中的信息性与复杂性之间的权衡。 Method: 引入了一个基于信息论的离散对象命名系统框架,结合指代博弈设置,从信息传递的角度形式化听者与说话者之间的解码一致性,并在亲属关系语义领域进行实验验证。 Result: 理论上证明了最优权衡的实现条件是听者的解码器必须等同于说话者的贝叶斯解码器,并通过实验表明这种最优性可以在训练出的通信系统中自然涌现。 Conclusion: 命名系统的最优信息-复杂性权衡依赖于听者采用贝叶斯式解码,这一机制不仅具有理论基础,也能在模拟交流系统中实际达成,为理解自然语言命名结构提供了新视角。 Abstract: The structure of naming systems in natural languages hinges on a trade-off between high informativeness and low complexity. Prior work capitalizes on information theory to formalize these notions; however, these studies generally rely on two simplifications: (i) optimal listeners, and (ii) universal communicative need across languages. Here, we address these limitations by introducing an information-theoretic framework for discrete object naming systems, and we use it to prove that an optimal trade-off is achievable if and only if the listener's decoder is equivalent to the Bayesian decoder of the speaker. Adopting a referential game setup from emergent communication, and focusing on the semantic domain of kinship, we show that our notion of optimality is not only theoretically achievable but also emerges empirically in learned communication systems.

[70] Emotion-Enhanced Multi-Task Learning with LLMs for Aspect Category Sentiment Analysis

Yaping Chai,Haoran Xie,Joe S. Qin

Main category: cs.CL

TL;DR: 提出了一种情感增强的多任务方面类别情感分析(ACSA)框架,结合LLM生成能力与VAD维度模型优化情感一致性,显著提升了ACSA性能。

Details Motivation: 现有ACSA方法主要关注情感极性,忽略了影响情感表达的基本情绪维度,导致难以捕捉细粒度的情感信号。 Method: 构建了一个联合学习情感极性和基于Ekman六种基本情绪的类别特定情感的多任务框架;利用LLM生成情感描述,并通过VAD空间投影和基于LLM的情感 refinement 机制提升情感标注的一致性与准确性。 Result: 在多个基准数据集上显著优于强基线模型,验证了引入情感维度对ACSA的有效性。 Conclusion: 将基本情绪与VAD维度模型融入ACSA框架,能够有效增强情感表示,提升分析性能,为细粒度情感分析提供了新思路。 Abstract: Aspect category sentiment analysis (ACSA) has achieved remarkable progress with large language models (LLMs), yet existing approaches primarily emphasize sentiment polarity while overlooking the underlying emotional dimensions that shape sentiment expressions. This limitation hinders the model's ability to capture fine-grained affective signals toward specific aspect categories. To address this limitation, we introduce a novel emotion-enhanced multi-task ACSA framework that jointly learns sentiment polarity and category-specific emotions grounded in Ekman's six basic emotions. Leveraging the generative capabilities of LLMs, our approach enables the model to produce emotional descriptions for each aspect category, thereby enriching sentiment representations with affective expressions. Furthermore, to ensure the accuracy and consistency of the generated emotions, we introduce an emotion refinement mechanism based on the Valence-Arousal-Dominance (VAD) dimensional framework. Specifically, emotions predicted by the LLM are projected onto a VAD space, and those inconsistent with their corresponding VAD coordinates are re-annotated using a structured LLM-based refinement strategy. Experimental results demonstrate that our approach significantly outperforms strong baselines on all benchmark datasets. This underlines the effectiveness of integrating affective dimensions into ACSA.

[71] Eliciting Chain-of-Thought in Base LLMs via Gradient-Based Representation Optimization

Zijian Wang,Yanxiang Ma,Chang Xu

Main category: cs.CL

TL;DR: 提出一种基于概率条件生成的隐藏状态操作方法,通过优化问题重构和正则化框架,有效提升基础大语言模型的思维链推理能力。

Details Motivation: 基础大语言模型缺乏专门训练,难以进行复杂多步推理任务,现有隐藏状态操作方法存在刚性、无约束问题,易导致分布偏移和文本质量下降。 Method: 将思维链推理的激发重构为带平衡似然和先验正则化的优化问题,通过概率条件生成指导隐藏状态向推理导向轨迹演进,同时保持语言连贯性。 Result: 在数学、常识和逻辑推理基准上的广泛实验表明,该方法在多个指标上持续优于现有的引导方法。 Conclusion: 该方法为增强基础大语言模型的推理能力提供了一种理论严谨且有效的解决方案。 Abstract: Chain-of-Thought (CoT) reasoning is a critical capability for large language models (LLMs), enabling them to tackle com- plex multi-step tasks. While base LLMs, pre-trained on general text corpora, often struggle with reasoning due to a lack of specialized training, recent studies reveal their latent reason- ing potential tied to hidden states. However, existing hidden state manipulation methods, such as linear activation steering, suffer from limitations due to their rigid and unconstrained nature, often leading to distribution shifts and degraded text quality. In this work, we propose a novel approach for elic- iting CoT reasoning from base LLMs through hidden state manipulation grounded in probabilistic conditional generation. By reformulating the challenge as an optimization problem with a balanced likelihood and prior regularization framework, our method guides hidden states toward reasoning-oriented trajectories while preserving linguistic coherence. Extensive evaluations across mathematical, commonsense, and logical reasoning benchmarks demonstrate that our approach con- sistently outperforms existing steering methods, offering a theoretically principled and effective solution for enhancing reasoning capabilities in base LLMs.

[72] Representational Stability of Truth in Large Language Models

Samantha Dies,Courtney Maynard,Germans Savcisens,Tina Eliassi-Rad

Main category: cs.CL

TL;DR: 本文提出了“表征稳定性”概念,用于衡量大语言模型在面对真、假及非真非假内容时内部真值表征的鲁棒性,发现表征稳定性更多源于认知熟悉度而非语言形式。

Details Motivation: 探讨大语言模型如何稳定地区分真、假和既不为真也不为假的内容,尤其是在不同语义不确定性下的内部表征稳定性。 Method: 通过训练线性探针分离模型激活中真实与非真实的陈述,并测量其决策边界在标签变化下的偏移程度;使用十六个开源模型和三个事实领域进行评估,比较两类‘非’陈述:对训练数据中不存在实体的断言(不熟悉类)和来自知名虚构情境的非事实主张(熟悉类)。 Result: 不熟悉的‘非’陈述导致最大的边界偏移(在脆弱领域如词义定义中高达40%的真值判断反转),而熟悉的虚构陈述则更集中且变化较小(≤8.2%)。 Conclusion: 表征稳定性主要源于知识上的熟悉程度,而非语言结构本身;该方法可作为诊断工具,用于审计和训练大语言模型在语义不确定下保持一致的真值分配。 Abstract: Large language models (LLMs) are widely used for factual tasks such as "What treats asthma?" or "What is the capital of Latvia?". However, it remains unclear how stably LLMs encode distinctions between true, false, and neither-true-nor-false content in their internal probabilistic representations. We introduce representational stability as the robustness of an LLM's veracity representations to perturbations in the operational definition of truth. We assess representational stability by (i) training a linear probe on an LLM's activations to separate true from not-true statements and (ii) measuring how its learned decision boundary shifts under controlled label changes. Using activations from sixteen open-source models and three factual domains, we compare two types of neither statements. The first are fact-like assertions about entities we believe to be absent from any training data. We call these unfamiliar neither statements. The second are nonfactual claims drawn from well-known fictional contexts. We call these familiar neither statements. The unfamiliar statements induce the largest boundary shifts, producing up to $40\%$ flipped truth judgements in fragile domains (such as word definitions), while familiar fictional statements remain more coherently clustered and yield smaller changes ($\leq 8.2\%$). These results suggest that representational stability stems more from epistemic familiarity than from linguistic form. More broadly, our approach provides a diagnostic for auditing and training LLMs to preserve coherent truth assignments under semantic uncertainty, rather than optimizing for output accuracy alone.

[73] In Machina N400: Pinpointing Where a Causal Language Model Detects Semantic Violations

Christos-Nikolaos Zacharopoulos,Revekka Kyriakoglou

Main category: cs.CL

TL;DR: 该研究通过分析phi-2模型在处理语义合理与不合理句子结尾时的隐藏状态,发现语义异常的检测能力在中间层显著提升,并伴随表征子空间的先扩展后压缩,与人类阅读中语义异常晚于句法解析出现的心理语言学发现一致。

Details Motivation: 探索Transformer模型如何以及在何处识别句子语义偏离的问题,特别是语义异常检测的时间动态和表征变化。 Method: 使用因果语言模型phi-2和精心构建的语料库(包含合理与不合理结尾的句子),对每一层的隐藏状态进行分析;采用线性探针逐层检测语义违反,并分析其有效维度变化。 Result: 线性探针在底层难以区分句子结尾的合理性,但在中层准确率急剧上升,接近顶层达到峰值;语义违反最初扩大表征子空间,随后在中段出现瓶颈并坍缩,提示从探索阶段转向快速整合。 Conclusion: 模型在中层开始有效检测语义异常,且其表征动态与人类阅读中的经典心理语言学发现相符,支持语义异常检测发生在句法解析之后的观点。 Abstract: How and where does a transformer notice that a sentence has gone semantically off the rails? To explore this question, we evaluated the causal language model (phi-2) using a carefully curated corpus, with sentences that concluded plausibly or implausibly. Our analysis focused on the hidden states sampled at each model layer. To investigate how violations are encoded, we utilized two complementary probes. First, we conducted a per-layer detection using a linear probe. Our findings revealed that a simple linear decoder struggled to distinguish between plausible and implausible endings in the lowest third of the model's layers. However, its accuracy sharply increased in the middle blocks, reaching a peak just before the top layers. Second, we examined the effective dimensionality of the encoded violation. Initially, the violation widens the representational subspace, followed by a collapse after a mid-stack bottleneck. This might indicate an exploratory phase that transitions into rapid consolidation. Taken together, these results contemplate the idea of alignment with classical psycholinguistic findings in human reading, where semantic anomalies are detected only after syntactic resolution, occurring later in the online processing sequence.

[74] MultiBanAbs: A Comprehensive Multi-Domain Bangla Abstractive Text Summarization Dataset

Md. Tanzim Ferdous,Naeem Ahsan Chowdhury,Prithwiraj Bhattacharjee

Main category: cs.CL

TL;DR: 本研究开发了一个包含5.4万多个孟加拉语文章和摘要的新数据集,涵盖多种来源和写作风格,旨在提升低资源语言的抽象式摘要能力。

Details Motivation: 现有孟加拉语摘要研究多集中于新闻文章,受限于固定写作风格,难以适应多样化的实际文本需求;随着数字时代孟加拉语内容激增,亟需能应对多领域文本的摘要系统以减轻信息过载。 Method: 从博客(如Cinegolpo)和报纸(如Samakal、The Business Standard)等多源收集超过54,000条孟加拉语文本,构建跨领域、多风格的数据集,并使用LSTM、BanglaT5-small和MTS-small等深度学习与迁移学习模型进行训练和评估。 Result: 所构建的数据集显著提升了模型在多领域孟加拉语文本上的摘要性能,为孟加拉语NLP研究提供了强有力的基准和基础资源。 Conclusion: 该数据集具有良好的适应性和实用价值,可作为未来孟加拉语抽象摘要研究的重要基准,有助于推动低资源语言的自然语言处理发展。 Abstract: This study developed a new Bangla abstractive summarization dataset to generate concise summaries of Bangla articles from diverse sources. Most existing studies in this field have concentrated on news articles, where journalists usually follow a fixed writing style. While such approaches are effective in limited contexts, they often fail to adapt to the varied nature of real-world Bangla texts. In today's digital era, a massive amount of Bangla content is continuously produced across blogs, newspapers, and social media. This creates a pressing need for summarization systems that can reduce information overload and help readers understand content more quickly. To address this challenge, we developed a dataset of over 54,000 Bangla articles and summaries collected from multiple sources, including blogs such as Cinegolpo and newspapers such as Samakal and The Business Standard. Unlike single-domain resources, our dataset spans multiple domains and writing styles. It offers greater adaptability and practical relevance. To establish strong baselines, we trained and evaluated this dataset using several deep learning and transfer learning models, including LSTM, BanglaT5-small, and MTS-small. The results highlight its potential as a benchmark for future research in Bangla natural language processing. This dataset provides a solid foundation for building robust summarization systems and helps expand NLP resources for low-resource languages.

[75] Learning to Reason: Training LLMs with GPT-OSS or DeepSeek R1 Reasoning Traces

Shaltiel Shmidman,Asher Fredman,Oleg Sudakov,Meriem Bendris

Main category: cs.CL

TL;DR: 本文比较了中等规模大语言模型在使用DeepSeek-R1和gpt-oss生成的推理轨迹进行后训练后,在数学问题上的性能表现,重点关注准确性和推理效率。

Details Motivation: 随着测试时扩展技术的发展,利用推理轨迹作为监督数据来提升中小模型的推理能力成为趋势,但不同前沿模型生成的推理轨迹对下游模型的影响尚不明确,因此需要系统比较其效果。 Method: 对中等规模的大语言模型进行后训练,分别使用DeepSeek-R1和gpt-oss生成的推理轨迹作为训练数据,并在数学问题上评估其准确性和推理效率。 Result: 实验比较了两种推理轨迹对模型性能的影响,提供了关于准确性与推理效率的实证结果,但具体优劣需依赖后续分析。 Conclusion: 不同前沿模型生成的推理轨迹在用于后训练时可能带来不同的性能表现,研究为选择高质量推理数据提供了参考依据。 Abstract: Test-time scaling, which leverages additional computation during inference to improve model accuracy, has enabled a new class of Large Language Models (LLMs) that are able to reason through complex problems by understanding the goal, turning this goal into a plan, working through intermediate steps, and checking their own work before answering . Frontier large language models with reasoning capabilities, such as DeepSeek-R1 and OpenAI's gpt-oss, follow the same procedure when solving complex problems by generating intermediate reasoning traces before giving the final answer. Today, these models are being increasingly used to generate reasoning traces that serve as high-quality supervised data for post-training of small and medium-sized language models to teach reasoning capabilities without requiring expensive human curation. In this work, we compare the performance of medium-sized LLMs on Math problems after post-training on two kinds of reasoning traces. We compare the impact of reasoning traces generated by DeepSeek-R1 and gpt-oss LLMs in terms of accuracy and inference efficiency.

[76] DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research

Rulin Shao,Akari Asai,Shannon Zejiang Shen,Hamish Ivison,Varsha Kishore,Jingming Zhuo,Xinran Zhao,Molly Park,Samuel G. Finlayson,David Sontag,Tyler Murray,Sewon Min,Pradeep Dasigi,Luca Soldaini,Faeze Brahman,Wen-tau Yih,Tongshuang Wu,Luke Zettlemoyer,Yoon Kim,Hannaneh Hajishirzi,Pang Wei Koh

Main category: cs.CL

TL;DR: 本文提出了基于动态演进评分标准的强化学习方法(RLER),用于训练开放域深度研究模型DR Tulu-8B,首次实现对开放式长篇幅研究任务的有效建模,在多个基准上表现优于现有开源模型,并媲美专有系统。

Details Motivation: 现有开源深度研究模型多依赖短答案QA任务训练,难以扩展到真实场景中的长篇幅、多步骤研究任务,缺乏有效的可验证奖励机制。 Method: 提出强化学习与演进式评分标准(RLER),在训练过程中动态构建并维护随策略模型共同演化的评分标准,以提供更具区分性的在线策略反馈,并结合MCP-based代理架构进行训练。 Result: 开发出Deep Research Tulu-8B模型,在四个科学、医疗和通用领域的长篇深度研究基准上显著超越现有开源模型,性能匹敌或超过专有系统,且模型更小、查询成本更低。 Conclusion: RLER为训练长篇深度研究模型提供了有效框架,DR Tulu-8B是首个直接为此类任务设计的高性能开源模型,推动了开放深度研究系统的发展。 Abstract: Deep research models perform multi-step research to produce long-form, well-attributed answers. However, most open deep research models are trained on easily verifiable short-form QA tasks via reinforcement learning with verifiable rewards (RLVR), which does not extend to realistic long-form tasks. We address this with Reinforcement Learning with Evolving Rubrics (RLER), in which we construct and maintain rubrics that co-evolve with the policy model during training; this allows the rubrics to incorporate information that the model has newly explored and to provide discriminative, on-policy feedback. Using RLER, we develop Deep Research Tulu (DR Tulu-8B), the first open model that is directly trained for open-ended, long-form deep research. Across four long-form deep research benchmarks in science, healthcare and general domains, DR Tulu substantially outperforms existing open deep research models, and matches or exceeds proprietary deep research systems, while being significantly smaller and cheaper per query. To facilitate future research, we release all data, models, and code, including our new MCP-based agent infrastructure for deep research systems.

[77] Be My Eyes: Extending Large Language Models to New Modalities Through Multi-Agent Collaboration

James Y. Huang,Sheng Zhang,Qianchu Liu,Guanghui Qin,Tinghui Zhu,Tristan Naumann,Muhao Chen,Hoifung Poon

Main category: cs.CL

TL;DR: 本文提出了BeMyEyes,一种模块化的多智能体框架,通过将高效的小型视觉语言模型(VLM)作为感知者与强大的大语言模型(LLM)作为推理者进行协作,实现LLM的多模态推理能力扩展。

Details Motivation: 现有的大规模视觉语言模型(VLMs)训练成本高,而小型VLMs虽高效但缺乏LLMs的知识和推理能力。因此需要一种既能保留LLM强大推理能力又能有效融合多模态感知的方法。 Method: 提出BeMyEyes框架,采用多智能体架构,通过对话机制协调感知者(小型VLM)和推理者(LLM)。设计数据合成与监督微调流程,训练感知者更好地与推理者协作。 Result: 实验证明,结合纯文本DeepSeek-R1与Qwen2.5-VL-7B感知者的轻量级开源系统,在多项知识密集型多模态任务上超越了GPT-4o等大型闭源VLM。 Conclusion: BeMyEyes展示了无需训练大规模多模态模型即可赋予LLM多模态推理能力的有效性、模块化和可扩展性,为未来多模态推理系统提供了新路径。 Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in challenging, knowledge-intensive reasoning tasks. However, extending LLMs to perceive and reason over a new modality (e.g., vision), often requires costly development of large-scale vision language models (VLMs) with LLMs as backbones. Smaller VLMs are more efficient and adaptable but often lack the broad knowledge and reasoning capabilities of frontier LLMs. In this work, we propose BeMyEyes, a modular, multi-agent framework for extending LLMs to multimodal reasoning by orchestrating collaboration between efficient, adaptable VLMs as perceivers and powerful LLMs as reasoners through conversations. We then introduce a data synthesis and supervised fine-tuning pipeline to train the perceiver agent to effectively collaborate with the reasoner agent. By combining the complementary strengths of perception and reasoning agents, BeMyEyes avoids the need for training large-scale multimodal models, preserves the generalization and reasoning capabilities of LLMs, and allows flexible extension to new domains and modalities. Experiments show that our framework unlocks the multimodal reasoning capabilities for LLMs, enabling a lightweight and fully open-source solution, i.e. equipping text-only DeepSeek-R1 with Qwen2.5-VL-7B perceiver, to outperform large-scale proprietary VLMs such as GPT-4o on a wide range of knowledge-intensive multimodal tasks. These results demonstrate the effectiveness, modularity, and scalability of our multi-agent approach for building future multimodal reasoning systems.

cs.CV [Back]

[78] Multimodal AI for Body Fat Estimation: Computer Vision and Anthropometry with DEXA Benchmarks

Rayan Aldajani

Main category: cs.CV

TL;DR: 该研究探索了使用AI模型通过正面身体图像和基本人体测量数据来低成本估算体脂百分比的可行性,提出了基于ResNet的图像模型和回归模型,并展示了良好的预测性能(RMSE为4.44%,R²为0.807)。

Details Motivation: 由于DEXA等金标准方法昂贵且难以获取,缺乏公开的基于计算机视觉的体脂估算数据集,因此需要开发低成本、易获取的替代方案。 Method: 使用535个样本的数据集,包括253个具有人体测量数据的案例和282张从Reddit抓取的带自我报告体脂率的身体图像;构建了两种方法:基于ResNet的图像模型和基于人体测量数据的回归模型,并提出了多模态融合框架以供未来扩展。 Result: 基于图像的模型达到4.44%的均方根误差(RMSE)和0.807的决定系数(R²),表现良好。 Conclusion: AI辅助模型可作为便捷、低成本的体脂估算工具,具备在健康和健身领域推广应用的潜力。 Abstract: Tracking body fat percentage is essential for effective weight management, yet gold-standard methods such as DEXA scans remain expensive and inaccessible for most people. This study evaluates the feasibility of artificial intelligence (AI) models as low-cost alternatives using frontal body images and basic anthropometric data. The dataset consists of 535 samples: 253 cases with recorded anthropometric measurements (weight, height, neck, ankle, and wrist) and 282 images obtained via web scraping from Reddit posts with self-reported body fat percentages, including some reported as DEXA-derived by the original posters. Because no public datasets exist for computer-vision-based body fat estimation, this dataset was compiled specifically for this study. Two approaches were developed: (1) ResNet-based image models and (2) regression models using anthropometric measurements. A multimodal fusion framework is also outlined for future expansion once paired datasets become available. The image-based model achieved a Root Mean Square Error (RMSE) of 4.44% and a Coefficient of Determination (R^2) of 0.807. These findings demonstrate that AI-assisted models can offer accessible and low-cost body fat estimates, supporting future consumer applications in health and fitness.

[79] Reconstruction-Driven Multimodal Representation Learning for Automated Media Understanding

Yassir Benhammou,Suman Kalyan,Sujay Kumar

Main category: cs.CV

TL;DR: 本文提出了一种多模态自编码器(MMAE),用于在文本、音频和视觉数据上学习统一的表示,以实现广播内容元数据提取和语义聚类的端到端自动化。该模型在新提出的LUMA数据集上训练,通过联合重建损失学习跨模态语义结构,在聚类和对齐指标上显著优于线性基线。

Details Motivation: 现有AI系统通常仅处理单一模态,难以捕捉广播内容中复杂的跨模态关系,限制了自动化元数据生成的效果。因此需要一种能够融合多模态信息的统一模型。 Method: 提出多模态自编码器(MMAE),通过最小化跨模态的联合重建损失,在LUMA数据集上训练以学习模态不变的语义结构,无需依赖大规模配对或对比数据集。 Result: 在聚类性能(Silhouette、ARI、NMI)和跨模态对齐方面显著优于线性基线模型,验证了重建式多模态嵌入的有效性。 Conclusion: 基于重建的多模态学习能有效提升广播档案中的元数据生成、跨模态检索及内容管理效率,具有在现代广播工作流中广泛应用的潜力。 Abstract: Broadcast and media organizations increasingly rely on artificial intelligence to automate the labor-intensive processes of content indexing, tagging, and metadata generation. However, existing AI systems typically operate on a single modality-such as video, audio, or text-limiting their understanding of complex, cross-modal relationships in broadcast material. In this work, we propose a Multimodal Autoencoder (MMAE) that learns unified representations across text, audio, and visual data, enabling end-to-end automation of metadata extraction and semantic clustering. The model is trained on the recently introduced LUMA dataset, a fully aligned benchmark of multimodal triplets representative of real-world media content. By minimizing joint reconstruction losses across modalities, the MMAE discovers modality-invariant semantic structures without relying on large paired or contrastive datasets. We demonstrate significant improvements in clustering and alignment metrics (Silhouette, ARI, NMI) compared to linear baselines, indicating that reconstruction-based multimodal embeddings can serve as a foundation for scalable metadata generation and cross-modal retrieval in broadcast archives. These results highlight the potential of reconstruction-driven multimodal learning to enhance automation, searchability, and content management efficiency in modern broadcast workflows.

[80] BCWildfire: A Long-term Multi-factor Dataset and Deep Learning Benchmark for Boreal Wildfire Risk Prediction

Zhengsen Xu,Sibo Cheng,Hongjie He,Lanying Wang,Wentao Sun,Jonathan Li,Lincoln Linlin Xu

Main category: cs.CV

TL;DR: 提出一个包含25年每日数据的大型野火风险预测基准数据集,覆盖不列颠哥伦比亚省及周边地区,包含38个与野火相关的协变量,并基于该数据集评估多种时间序列预测模型。

Details Motivation: 现有公开数据集在支持长期时间建模、大范围空间覆盖和多模态驱动因素方面存在不足,限制了数据驱动方法在野火风险预测中的发展。 Method: 构建了一个涵盖240万平方公里、25年每日分辨率的野火数据集,包含火灾探测、气象、燃料、地形和人为因素等38个协变量,并使用CNN、线性模型、Transformer和Mamba等时间序列模型进行预测性能评估。 Result: 提供了大规模、多模态、长时间跨度的野火数据集,实验评估了不同架构的时间序列模型在野火预测中的表现,并分析了位置编码的有效性及各驱动因素的重要性。 Conclusion: 该数据集填补了野火风险预测领域高质量基准数据的空白,支持多种建模范式,有助于推动数据驱动的野火建模研究。 Abstract: Wildfire risk prediction remains a critical yet challenging task due to the complex interactions among fuel conditions, meteorology, topography, and human activity. Despite growing interest in data-driven approaches, publicly available benchmark datasets that support long-term temporal modeling, large-scale spatial coverage, and multimodal drivers remain scarce. To address this gap, we present a 25-year, daily-resolution wildfire dataset covering 240 million hectares across British Columbia and surrounding regions. The dataset includes 38 covariates, encompassing active fire detections, weather variables, fuel conditions, terrain features, and anthropogenic factors. Using this benchmark, we evaluate a diverse set of time-series forecasting models, including CNN-based, linear-based, Transformer-based, and Mamba-based architectures. We also investigate effectiveness of position embedding and the relative importance of different fire-driving factors. The dataset and the corresponding code can be found at https://github.com/SynUW/mmFire

[81] Robustness of Structured Data Extraction from Perspectively Distorted Documents

Hyakka Nakada,Yoshiyasu Tanaka

Main category: cs.CV

TL;DR: 本研究探讨了透视畸变和旋转对多模态大语言模型(如Gemini-1.5-pro)在文档数据提取中的影响,提出使用等腰梯形变换简化畸变建模,并发现结构识别精度受畸变影响显著,但可通过简单的旋转校正改善。

Details Motivation: 现实中的文档图像常存在透视畸变和旋转,影响多模态大语言模型的OCR性能,需系统研究其影响并寻找改进方法。 Method: 基于观察到的典型文档畸变近似为等腰梯形变换,将八参数问题简化为两个参数(旋转角度和畸变比例),生成合成文档样本,测试Gemini-1.5-pro在不同参数下的字符识别与结构识别准确率。 Result: 实验发现,尽管字符识别准确率相对稳定,结构识别准确率在畸变下显著下降;但通过简单的旋转校正可有效提升结构识别性能。 Conclusion: 文档的透视畸变和旋转会显著影响多模态大语言模型的结构识别能力,采用参数简化的畸变建模有助于评估性能,且旋转校正确实有助于提升实际OCR应用中的表现。 Abstract: Optical Character Recognition (OCR) for data extraction from documents is essential to intelligent informatics, such as digitizing medical records and recognizing road signs. Multi-modal Large Language Models (LLMs) can solve this task and have shown remarkable performance. Recently, it has been noticed that the accuracy of data extraction by multi-modal LLMs can be affected when in-plane rotations are present in the documents. However, real-world document images are usually not only in-plane rotated but also perspectively distorted. This study investigates the impacts of such perturbations on the data extraction accuracy for the state-of-the-art model, Gemini-1.5-pro. Because perspective distortions have a high degree of freedom, designing experiments in the same manner as single-parametric rotations is difficult. We observed typical distortions of document images and showed that most of them approximately follow an isosceles-trapezoidal transformation, which allows us to evaluate distortions with a small number of parameters. We were able to reduce the number of independent parameters from eight to two, i.e. rotation angle and distortion ratio. Then, specific entities were extracted from synthetically generated sample documents with varying these parameters. As the performance of LLMs, we evaluated not only a character-recognition accuracy but also a structure-recognition accuracy. Whereas the former represents the classical indicators for optical character recognition, the latter is related to the correctness of reading order. In particular, the structure-recognition accuracy was found to be significantly degraded by document distortion. In addition, we found that this accuracy can be improved by a simple rotational correction. This insight will contribute to the practical use of multi-modal LLMs for OCR tasks.

[82] 3D Ground Truth Reconstruction from Multi-Camera Annotations Using UKF

Linh Van Ma,Unse Fatima,Tepy Sokun Chriv,Haroon Imran,Moongu Jeon

Main category: cs.CV

TL;DR: 提出一种基于UKF的多相机3D真值估计方法,融合2D标注生成精确的3D位置与形状,具有高精度、抗遮挡、全自动和可扩展性。

Details Motivation: 现有方法多局限于地面平面信息,缺乏完整的3D形状输出,且难以处理遮挡和多视角融合问题,需要更准确、鲁棒和自动化的3D真值估计方案。 Method: 采用多相机单目标跟踪算法,结合同名点投影(homography-based projection)与无迹卡尔曼滤波(UKF)融合来自多个校准相机的2D边界框或关键点标注,将2D图像坐标转换为3D世界坐标。 Result: 在CMC、Wildtrack和Panoptic数据集上验证,3D定位精度高,能有效处理遮挡,并生成完整的3D形状,优于现有仅提供地面平面信息的方法。 Conclusion: 该方法实现了仅依赖2D标注的全自动、可扩展的3D真值估计,能够准确输出对象的位置和完整3D形状,适用于自动驾驶、监控和机器人等应用。 Abstract: Accurate 3D ground truth estimation is critical for applications such as autonomous navigation, surveillance, and robotics. This paper introduces a novel method that uses an Unscented Kalman Filter (UKF) to fuse 2D bounding box or pose keypoint ground truth annotations from multiple calibrated cameras into accurate 3D ground truth. By leveraging human-annotated ground-truth 2D, our proposed method, a multi-camera single-object tracking algorithm, transforms 2D image coordinates into robust 3D world coordinates through homography-based projection and UKF-based fusion. Our proposed algorithm processes multi-view data to estimate object positions and shapes while effectively handling challenges such as occlusion. We evaluate our method on the CMC, Wildtrack, and Panoptic datasets, demonstrating high accuracy in 3D localization compared to the available 3D ground truth. Unlike existing approaches that provide only ground-plane information, our method also outputs the full 3D shape of each object. Additionally, the algorithm offers a scalable and fully automatic solution for multi-camera systems using only 2D image annotations.

[83] Unified Low-Light Traffic Image Enhancement via Multi-Stage Illumination Recovery and Adaptive Noise Suppression

Siddiqua Namrah

Main category: cs.CV

TL;DR: 提出了一种无监督多阶段深度学习框架,用于低光照交通图像增强,通过分解光照和反射分量并利用三个专用模块逐步优化,在无需配对标签的情况下实现了优越的视觉和定量性能。

Details Motivation: 夜间和低光照交通场景存在可见度差、噪声、运动模糊、非均匀光照和眩光等问题,严重影响自动驾驶和智能交通系统中的感知任务。 Method: 将图像分解为光照和反射分量,采用三阶段模块:光照适应(全局与局部亮度校正)、反射恢复(结合空间-通道注意力去噪并恢复细节)和过曝补偿(重建饱和区域并平衡亮度);使用自监督重建、反射平滑性、感知一致性及领域感知正则化损失进行端到端训练。 Result: 在通用和交通专用数据集上均优于现有方法,显著提升PSNR、SSIM、LPIPS和NIQE指标,视觉质量更优,有效增强真实低光交通场景下的可见性和结构保持能力。 Conclusion: 该框架在无监督条件下实现了高质量的低光交通图像增强,提升了下游感知任务的可靠性,适用于实际智能交通与自动驾驶应用。 Abstract: Enhancing low-light traffic images is crucial for reliable perception in autonomous driving, intelligent transportation, and urban surveillance systems. Nighttime and dimly lit traffic scenes often suffer from poor visibility due to low illumination, noise, motion blur, non-uniform lighting, and glare from vehicle headlights or street lamps, which hinder tasks such as object detection and scene understanding. To address these challenges, we propose a fully unsupervised multi-stage deep learning framework for low-light traffic image enhancement. The model decomposes images into illumination and reflectance components, progressively refined by three specialized modules: (1) Illumination Adaptation, for global and local brightness correction; (2) Reflectance Restoration, for noise suppression and structural detail recovery using spatial-channel attention; and (3) Over-Exposure Compensation, for reconstructing saturated regions and balancing scene luminance. The network is trained using self-supervised reconstruction, reflectance smoothness, perceptual consistency, and domain-aware regularization losses, eliminating the need for paired ground-truth images. Experiments on general and traffic-specific datasets demonstrate superior performance over state-of-the-art methods in both quantitative metrics (PSNR, SSIM, LPIPS, NIQE) and qualitative visual quality. Our approach enhances visibility, preserves structure, and improves downstream perception reliability in real-world low-light traffic scenarios.

[84] HSMix: Hard and Soft Mixing Data Augmentation for Medical Image Segmentation

Danyang Sun,Fadi Dornaika,Nagore Barrena

Main category: cs.CV

TL;DR: 提出HSMix,一种结合硬混合和软混合的局部图像编辑数据增强方法,用于医学图像分割,有效缓解数据稀缺问题。

Details Motivation: 医学图像分割常受限于标注成本高或疾病罕见导致的数据稀缺和过拟合问题,现有自监督和半监督方法复杂且依赖人工设计,需要更简单有效的数据增强方法。 Method: 提出HSMix方法:通过超像素将两幅图像的同质区域进行硬混合生成新图像,并基于像素级显著性系数进行亮度调整实现软混合;分割标签同步混合以保持一致性,充分利用轮廓和显著性先验信息。 Result: HSMix在多种医学图像模态和分割任务中均表现出色,显著提升模型性能,是一种即插即用、模型无关的通用增强方案。 Conclusion: HSMix通过融合硬性和软性局部混合策略,增强了数据多样性并保留了语义信息,在缓解医学图像数据稀缺问题上具有广泛适用性和有效性。 Abstract: Due to the high cost of annotation or the rarity of some diseases, medical image segmentation is often limited by data scarcity and the resulting overfitting problem. Self-supervised learning and semi-supervised learning can mitigate the data scarcity challenge to some extent. However, both of these paradigms are complex and require either hand-crafted pretexts or well-defined pseudo-labels. In contrast, data augmentation represents a relatively simple and straightforward approach to addressing data scarcity issues. It has led to significant improvements in image recognition tasks. However, the effectiveness of local image editing augmentation techniques in the context of segmentation has been less explored. We propose HSMix, a novel approach to local image editing data augmentation involving hard and soft mixing for medical semantic segmentation. In our approach, a hard-augmented image is created by combining homogeneous regions (superpixels) from two source images. A soft mixing method further adjusts the brightness of these composed regions with brightness mixing based on locally aggregated pixel-wise saliency coefficients. The ground-truth segmentation masks of the two source images undergo the same mixing operations to generate the associated masks for the augmented images. Our method fully exploits both the prior contour and saliency information, thus preserving local semantic information in the augmented images while enriching the augmentation space with more diversity. Our method is a plug-and-play solution that is model agnostic and applicable to a range of medical imaging modalities. Extensive experimental evidence has demonstrated its effectiveness in a variety of medical segmentation tasks. The source code is available in https://github.com/DanielaPlusPlus/HSMix.

[85] Plug-and-Play Multi-Concept Adaptive Blending for High-Fidelity Text-to-Image Synthesis

Young-Beom Woo

Main category: cs.CV

TL;DR: 提出了一种无需微调的多概念个性化图像生成方法PnP-MIX,通过引导外观注意力、掩码引导噪声混合和背景稀释++策略,实现高保真、语义一致的复杂场景合成。

Details Motivation: 现有方法在生成包含多个个性化对象的复杂场景时,常导致个性化与非个性化区域的意外修改,破坏提示结构和区域间语义一致性。 Method: 提出PnP-MIX:1)引导外观注意力以准确还原各个性化概念的外观;2)掩码引导噪声混合策略保护非个性化区域;3)背景稀释++减少概念泄露,提升特征定位精度。 Result: 在单概念和多概念个性化任务中均优于现有方法,有效保持背景完整性并减少概念泄露,生成结果具有更高保真度和组成一致性。 Conclusion: PnP-MIX是一种无需微调的通用框架,能高质量地将多个个性化概念融合到单一图像中,显著提升多对象个性化生成的保真度与鲁棒性。 Abstract: Integrating multiple personalized concepts into a single image has recently become a significant area of focus within Text-to-Image (T2I) generation. However, existing methods often underperform on complex multi-object scenes due to unintended alterations in both personalized and non-personalized regions. This not only fails to preserve the intended prompt structure but also disrupts interactions among regions, leading to semantic inconsistencies. To address this limitation, we introduce plug-and-play multi-concept adaptive blending for high-fidelity text-to-image synthesis (PnP-MIX), an innovative, tuning-free approach designed to seamlessly embed multiple personalized concepts into a single generated image. Our method leverages guided appearance attention to faithfully reflect the intended appearance of each personalized concept. To further enhance compositional fidelity, we present a mask-guided noise mixing strategy that preserves the integrity of non-personalized regions such as the background or unrelated objects while enabling the precise integration of personalized objects. Finally, to mitigate concept leakage, i.e., the inadvertent leakage of personalized concept features into other regions, we propose background dilution++, a novel strategy that effectively reduces such leakage and promotes accurate localization of features within personalized regions. Extensive experimental results demonstrate that PnP-MIX consistently surpasses existing methodologies in both single- and multi-concept personalization scenarios, underscoring its robustness and superior performance without additional model tuning.

[86] Foundational Question Generation for Video Question Answering via an Embedding-Integrated Approach

Ju-Young Oh

Main category: cs.CV

TL;DR: 本文提出了一种基于嵌入集成的视频问答基础性问题生成框架(FIQ),通过从视频中提取描述性信息生成场景级Q&A对,增强模型对视频内容的基础理解,并结合VQ-CAlign模块对齐问题与视觉特征,显著提升了模型的泛化与推理能力,在SUTD-TrafficQA数据集上达到SOTA性能。

Details Motivation: 现有视频问答(VQA)方法依赖事件中心的问答对,缺乏对场景中对象类别、空间布局和视觉属性等基础信息的理解,限制了模型的推理与泛化能力。 Method: 提出FIQ框架,通过从视频中提取描述性信息自动生成包含基础场景属性的Q&A对,以增强训练数据;并设计VQ-CAlign模块,实现任务特定问题嵌入与视觉特征的对齐,保留上下文线索,提升下游任务适应性。 Result: 在SUTD-TrafficQA数据集上的实验表明,FIQ显著优于现有基线方法,实现了最先进的性能。 Conclusion: FIQ通过增强模型对视频内容的基础理解,有效提升了VQA模型的推理能力和泛化性能,为未来VQA数据构建提供了新思路。 Abstract: Conventional VQA approaches primarily rely on question-answer (Q&A) pairs to learn the spatio-temporal dynamics of video content. However, most existing annotations are event-centric, which restricts the model's ability to capture the comprehensive context of a scene. The lack of fundamental information such as object categories, spatial configurations, and descriptive visual attributes prevents the model from forming a complete understanding of the environment, ultimately limiting its generalization and reasoning capability. In this paper, we introduce Foundational Question Generation for Video Question Answering via an Embedding-Integrated Approach (FIQ), a framework designed to enhance the reasoning capability of VQA models by improving their foundational comprehension of video content. FIQ generates Q&A pairs from descriptive information extracted directly from videos, thereby enriching the dataset with core scene-level attributes. These generated pairs help the model develop a more holistic understanding of the video, leading to improved generalizability and reasoning performance. In addition, we propose a VQ-CAlign module that aligns task-specific question embeddings with corresponding visual features, preserving essential contextual cues and enhancing adaptability to downstream tasks. Experimental results on the SUTD-TrafficQA dataset demonstrate that FIQ achieves state-of-the-art performance, surpassing existing baseline approaches.

[87] Rethinking the Encoding and Annotating of 3D Bounding Box: Corner-Aware 3D Object Detection from Point Clouds

Qinghao Meng,Junbo Yin,Jianbing Shen,Yunde Jia

Main category: cs.CV

TL;DR: 提出了一种基于角点对齐回归的LiDAR 3D目标检测方法,通过将预测目标从不稳定的物体中心转移到几何信息丰富的角点,提升了检测精度,并支持仅用BEV角点标注的弱监督学习范式。

Details Motivation: 中心对齐回归在LiDAR点云中因物体中心常位于稀疏或空区域而导致不稳定,影响3D边界框预测的准确性。 Method: 采用角点对齐回归替代中心对齐,利用角点位于密集可观测区域的特性,并结合几何约束和2D图像框信息,从角点注释中恢复部分3D框参数,实现弱监督学习;设计了即插即用的角点感知检测头。 Result: 在KITTI数据集上比基于中心的方法提升了3.5% AP,并且仅使用BEV角点点击标注时达到了全监督性能的83%。 Conclusion: 角点对齐回归有效克服了中心对齐的不稳定性,显著提升检测性能,同时支持低标注成本的弱监督训练,具有实际应用潜力。 Abstract: Center-aligned regression remains dominant in LiDAR-based 3D object detection, yet it suffers from fundamental instability: object centers often fall in sparse or empty regions of the bird's-eye-view (BEV) due to the front-surface-biased nature of LiDAR point clouds, leading to noisy and inaccurate bounding box predictions. To circumvent this limitation, we revisit bounding box representation and propose corner-aligned regression, which shifts the prediction target from unstable centers to geometrically informative corners that reside in dense, observable regions. Leveraging the inherent geometric constraints among corners and image 2D boxes, partial parameters of 3D bounding boxes can be recovered from corner annotations, enabling a weakly supervised paradigm without requiring complete 3D labels. We design a simple yet effective corner-aware detection head that can be plugged into existing detectors. Experiments on KITTI show our method improves performance by 3.5% AP over center-based baseline, and achieves 83% of fully supervised accuracy using only BEV corner clicks, demonstrating the effectiveness of our corner-aware regression strategy.

[88] BD-Net: Has Depth-Wise Convolution Ever Been Applied in Binary Neural Networks?

DoYoung Kim,Jin-Seop Lee,Noo-ri Kim,SungJoon Lee,Jee-Hyong Lee

Main category: cs.CV

TL;DR: 本文提出了一种1.58位卷积和预BN残差连接,首次成功实现了二值化神经网络中深度可分离卷积的二值化,在多个数据集上显著提升了性能。

Details Motivation: 极端量化限制了二值神经网络的表示能力并导致训练不稳定,尤其是在具有深度卷积的轻量级架构中。 Method: 提出1.58位卷积以增强表达能力,并引入预BN残差连接通过改善Hessian条件数来稳定优化过程。 Result: 在ImageNet上的MobileNet V1实现33M OPs,准确率显著优于先前方法,在CIFAR-10、CIFAR-100等多个数据集上提升高达9.3个百分点。 Conclusion: 该方法首次成功实现了深度卷积的二值化,建立了BNN的新标杆,具有广泛适用性和显著性能优势。 Abstract: Recent advances in model compression have highlighted the potential of low-bit precision techniques, with Binary Neural Networks (BNNs) attracting attention for their extreme efficiency. However, extreme quantization in BNNs limits representational capacity and destabilizes training, posing significant challenges for lightweight architectures with depth-wise convolutions. To address this, we propose a 1.58-bit convolution to enhance expressiveness and a pre-BN residual connection to stabilize optimization by improving the Hessian condition number. These innovations enable, to the best of our knowledge, the first successful binarization of depth-wise convolutions in BNNs. Our method achieves 33M OPs on ImageNet with MobileNet V1, establishing a new state-of-the-art in BNNs by outperforming prior methods with comparable OPs. Moreover, it consistently outperforms existing methods across various datasets, including CIFAR-10, CIFAR-100, STL-10, Tiny ImageNet, and Oxford Flowers 102, with accuracy improvements of up to 9.3 percentage points.

[89] Efficient Score Pre-computation for Diffusion Models via Cross-Matrix Krylov Projection

Kaikwan Lau,Andrew S. Na,Justin W. L. Wan

Main category: cs.CV

TL;DR: 提出一种基于交叉矩阵Krylov投影的加速方法,用于求解基于分数的扩散模型中的大型线性系统,显著提升计算效率。

Details Motivation: 标准的稳定扩散模型在训练中需反复求解大规模线性系统,导致计算成本高昂,尤其是在处理大量图像时。 Method: 将扩散模型转化为Fokker-Planck形式,提出交叉矩阵Krylov投影方法,利用‘种子’矩阵构建共享子空间,加速对后续‘目标’矩阵的求解。 Result: 相比标准稀疏求解器,计算时间减少15.8%至43.7%;在去噪任务中相较DDPM实现最高115倍的加速;在固定计算预算下能生成高质量图像,而DDPM无法生成可识别内容。 Conclusion: 该方法通过共享子空间有效降低求解开销,是一种在资源受限场景下实现高效图像生成的实用方案。 Abstract: This paper presents a novel framework to accelerate score-based diffusion models. It first converts the standard stable diffusion model into the Fokker-Planck formulation which results in solving large linear systems for each image. For training involving many images, it can lead to a high computational cost. The core innovation is a cross-matrix Krylov projection method that exploits mathematical similarities between matrices, using a shared subspace built from ``seed" matrices to rapidly solve for subsequent ``target" matrices. Our experiments show that this technique achieves a 15.8\% to 43.7\% time reduction over standard sparse solvers. Additionally, we compare our method against DDPM baselines in denoising tasks, showing a speedup of up to 115$\times$. Furthermore, under a fixed computational budget, our model is able to produce high-quality images while DDPM fails to generate recognizable content, illustrating our approach is a practical method for efficient generation in resource-limited settings.

[90] Upstream Probabilistic Meta-Imputation for Multimodal Pediatric Pancreatitis Classification

Max A. Nelson,Elif Keles,Eminenur Sen Tasci,Merve Yazol,Halil Ertugrul Aktas,Ziliang Hong,Andrea Mia Bejar,Gorkem Durak,Oznur Leman Boyunaga,Ulas Bagci

Main category: cs.CV

TL;DR: 提出一种轻量级增强策略UPMI,用于解决儿童胰腺炎诊断中样本少和多模态影像复杂的问题,在低维元特征空间进行数据扩充,结合模态特异性逻辑回归与高斯混合模型生成合成特征,提升随机森林分类器性能。

Details Motivation: 儿童胰腺炎诊断面临样本稀缺和多模态影像复杂性带来的机器学习挑战,现有方法难以有效处理。 Method: 提出Upstream Probabilistic Meta-Imputation(UPMI),在元学习器上游的低维元特征空间进行数据增强;通过T1W和T2W MRI放射组学的模态特异性逻辑回归生成概率输出,并转化为7维元特征向量;在每折交叉验证中使用类条件高斯混合模型(GMM)采样合成元特征,与真实特征共同训练随机森林元分类器。 Result: 在67名儿童受试者上实验显示,UPMI达到平均AUC为0.908±0.072,相比仅使用真实数据的基线(AUC 0.864±0.061)提升了约5%。 Conclusion: UPMI通过在低维元特征空间进行概率性元插补,有效缓解了小样本和多模态复杂性问题,显著提升了儿童胰腺炎的分类性能,具有临床应用潜力。 Abstract: Pediatric pancreatitis is a progressive and debilitating inflammatory condition, including acute pancreatitis and chronic pancreatitis, that presents significant clinical diagnostic challenges. Machine learning-based methods also face diagnostic challenges due to limited sample availability and multimodal imaging complexity. To address these challenges, this paper introduces Upstream Probabilistic Meta-Imputation (UPMI), a light-weight augmentation strategy that operates upstream of a meta-learner in a low-dimensional meta-feature space rather than in image space. Modality-specific logistic regressions (T1W and T2W MRI radiomics) produce probability outputs that are transformed into a 7-dimensional meta-feature vector. Class-conditional Gaussian mixture models (GMMs) are then fit within each cross-validation fold to sample synthetic meta-features that, combined with real meta-features, train a Random Forest (RF) meta-classifier. On 67 pediatric subjects with paired T1W/T2W MRIs, UPMI achieves a mean AUC of 0.908 $\pm$ 0.072, a $\sim$5% relative gain over a real-only baseline (AUC 0.864 $\pm$ 0.061).

[91] TSRE: Channel-Aware Typical Set Refinement for Out-of-Distribution Detection

Weijun Gao,Rundong He,Jinyang Dong,Yongshun Gong

Main category: cs.CV

TL;DR: 提出一种基于判别性和活跃性的典型集优化方法,结合偏度校正来改进分布外检测的激活修正,提升ID与OOD数据的分离效果。

Details Motivation: 现有激活修正方法忽视通道特性及分布偏态,导致典型集估计不准确,异常激活被错误包含。 Method: 设计通道感知的典型集优化策略,结合判别性与活跃性进行激活校正,并引入基于偏度的精细化调整以减少分布偏差,利用修正后的激活计算能量得分用于OOD检测。 Result: 在ImageNet-1K和CIFAR-100上实现了最先进的OOD检测性能,且在不同骨干网络和评分函数下具有良好泛化性。 Conclusion: 所提方法通过通道感知和偏度校正的典型集优化,显著提升了激活基OOD检测的准确性与鲁棒性。 Abstract: Out-of-Distribution (OOD) detection is a critical capability for ensuring the safe deployment of machine learning models in open-world environments, where unexpected or anomalous inputs can compromise model reliability and performance. Activation-based methods play a fundamental role in OOD detection by mitigating anomalous activations and enhancing the separation between in-distribution (ID) and OOD data. However, existing methods apply activation rectification while often overlooking channel's intrinsic characteristics and distributional skewness, which results in inaccurate typical set estimation. This discrepancy can lead to the improper inclusion of anomalous activations across channels. To address this limitation, we propose a typical set refinement method based on discriminability and activity, which rectifies activations into a channel-aware typical set. Furthermore, we introduce a skewness-based refinement to mitigate distributional bias in typical set estimation. Finally, we leverage the rectified activations to compute the energy score for OOD detection. Experiments on the ImageNet-1K and CIFAR-100 benchmarks demonstrate that our method achieves state-of-the-art performance and generalizes effectively across backbones and score functions.

[92] SWITCH: Benchmarking Modeling and Handling of Tangible Interfaces in Long-horizon Embodied Scenarios

Jieru Lin,Zhiwei Yu,Börje F. Karlsson

Main category: cs.CV

TL;DR: SWITCH是一个新的具身任务驱动基准,用于评估智能体在现实环境中与物理控制界面交互的能力,涵盖任务感知VQA、语义UI定位、动作生成、状态预测和结果验证五个方面。

Details Motivation: 现有基准很少测试智能体在真实场景中的接地能力、部分可观测性(如视频输入)以及事后验证能力,而这些对安全关键的自主智能系统至关重要。 Method: 提出SWITCH-Basic,基于第一视角RGB视频输入,在351个任务中评估351个跨98种真实设备的任务表现,测试五种关键能力,并分析当前大型多模态模型的表现缺陷。 Result: 商业和开源的大型多模态模型在单步交互中表现不一致,常过度依赖文本线索而忽视视觉或视频证据,高总体分数可能掩盖严重失败。 Conclusion: SWITCH为评估智能体与现实世界接口的交互能力提供了可复现的基准,推动未来更复杂版本的发展和训练数据集的构建。 Abstract: Autonomous intelligence requires not only perception and reasoning, but critically, effective interaction with the existing world and its infrastructure. Everyday environments are rich in tangible control interfaces (TCIs), e.g., light switches, appliance panels, and embedded GUIs, that demand commonsense and physics reasoning, but also causal prediction and outcome verification in time and space (e.g., delayed heating, remote lights). Moreover, failures here have potential safety implications, yet current benchmarks rarely test grounding, partial observability (video), or post-hoc verification in situated settings. We introduce SWITCH (Semantic World Interface Tasks for Control and Handling), an embodied, task-driven benchmark created through iterative releases to probe these gaps. Its first iteration, SWITCH-Basic, evaluates five complementary abilities:task-aware VQA, semantic UI grounding, action generation, state-transition prediction, and result verification, under egocentric RGB video input and device diversity. Across 351 tasks spanning 98 real devices and appliances, commercial and open LMMMs exhibit inconsistent performance even on single-step interactions, often over-relying on textual cues and under-using visual or video evidence (and high aggregate scores can mask such failures). SWITCH provides data, code, and held-out splits to enable reproducible evaluation and community contributions toward more challenging future iterations of the benchmark and the creation of training datasets. Benchmark resources are available at: https://github.com/BAAI-Agents/SWITCH.

[93] Explainable Deep Learning for Brain Tumor Classification: Comprehensive Benchmarking with Dual Interpretability and Lightweight Deployment

Md. Mohaiminul Islam,Md. Mofazzal Hossen,Maher Ali Rusho,Nahiyan Nazah Ridita,Zarin Tasnia Shanta,Md. Simanto Haider,Ahmed Faizul Haque Dhrubo,Md. Khurshid Jahan,Mohammad Abdul Qayum

Main category: cs.CV

TL;DR: 本研究提出了一种用于从MRI图像自动分类脑肿瘤的端到端深度学习系统,比较了六种网络架构,实现了高精度、可解释性和轻量化部署。

Details Motivation: 推动脑肿瘤MRI图像自动化分类的发展,解决现有方法在标准化评估、模型可解释性及实际部署方面的不足。 Method: 采用五种ImageNet预训练模型(VGG-16, Inception V3, ResNet-50, Inception-ResNet V2, Xception)和一个自定义紧凑CNN(1.31M参数),统一预处理、训练协议(AdamW优化器、CosineAnnealingLR学习率调度、早停机制)和评估指标,并使用Grad-CAM与GradientShap进行可视化解释。 Result: Inception-ResNet V2达到99.53%测试准确率,各项指标均超过99.50%;自研紧凑CNN实现96.49%准确率,模型体积仅为前者的1/100,支持边缘设备实时推理(375ms),并通过IoU、Hausdorff距离、PR曲线等多维度评估验证性能。 Conclusion: 该系统在保证高精度的同时,提升了模型可解释性与部署可行性,适用于资源受限环境下的临床筛查与分诊,为可信AI在高低资源医疗系统中的应用提供了完整框架。 Abstract: Our study provides a full deep learning system for automated classification of brain tumors from MRI images, includes six benchmarked architectures (five ImageNet-pre-trained models (VGG-16, Inception V3, ResNet-50, Inception-ResNet V2, Xception) and a custom built, compact CNN (1.31M params)). The study moves the needle forward in a number of ways, including (1) full standardization of assessment with respect to preprocessing, training sets/protocols (optimizing networks with the AdamW optimizer, CosineAnnealingLR, patiene for early stopping = 7), and metrics to assess performance were identical along all models; (2) a high level of confidence in the localizations based on prior studies as both Grad-CAM and GradientShap explanation were used to establish anatomically important and meaningful attention regions and address the black-box issue; (3) a compact 1.31 million parameter CNN was developed that achieved 96.49% testing accuracy and was 100 times smaller than Inception-ResNet V2 while permitting real-time inference (375ms) on edge devices; (4) full evaluation beyond accuracy reporting based on measures of intersection over union, Hausdorff distance, and precision-recall curves, and confusion matrices across all splits. Inception-ResNet V2 reached state-of-the-art performance, achieving a 99.53% accuracy on testing and obtaining a precision, recall, and F1-score of at least 99.50% dominant performance based on metrics of recent studies. We demonstrated a lightweight model that is suitable to deploy on devices that do not have multi-GPU infrastructure in under-resourced settings. This end-to-end solution considers accuracy, interpretability, and deployability of trustworthy AI to create the framework necessary for performance assessment and deployment within advance and low-resource healthcare systems to an extent that enabled participation at the clinical screening and triage level.

[94] MedPEFT-CL: Dual-Phase Parameter-Efficient Continual Learning with Medical Semantic Adapter and Bidirectional Memory Consolidation

Ziyuan Gao

Main category: cs.CV

TL;DR: 提出MedPEFT-CL,一种面向医学视觉-语言任务的参数高效持续学习框架,通过语义驱动的适配器分配和双向Fisher-memory协调机制,有效缓解灾难性遗忘,支持新任务学习并保留旧知识。

Details Motivation: 医学视觉-语言分割模型在适应新解剖结构时易发生灾难性遗忘,现有持续学习方法在该领域针对性研究不足,限制了临床部署。 Method: 基于CLIPSeg构建双阶段架构:自适应学习阶段利用语义相似性进行适配器分配和参数高效微调;知识巩固阶段采用双向Fisher-memory协调机制,结合挑战性样本回放形成强化循环。 Result: 在多个医学数据集上实验表明,该方法显著减少遗忘,保持高性能且参数开销极小,优于现有持续学习方法。 Conclusion: MedPEFT-CL为医学视觉-语言任务提供了高效、可扩展的持续学习解决方案,推动其在动态临床环境中的实际应用。 Abstract: Medical vision-language segmentation models suffer from catastrophic forgetting when adapting to new anatomical structures, requiring complete retraining that limits their clinical deployment. Although continual learning approaches have been studied for various applications, targeted research on continual learning approaches specifically designed for medical vision-language tasks remains underexplored. We propose MedPEFT-CL, a parameter-efficient continual learning framework that addresses both efficient learning of new tasks and preservation of previous knowledge through a dual-phase architecture based on CLIPSeg. Our dual-phase architecture features an adaptive learning phase that employs semantic similarity-based adapter allocation and parameter-efficient fine-tuning for medical tasks through prompt similarity analysis, and a knowledge consolidation phase employing bi-directional Fisher-memory coordination. This creates a reinforcing cycle: consolidation directs replay priorities while new tasks provide challenging samples that improve retention strategies. Our key contributions are: (1) a semantic-driven adapter allocation mechanism that enables efficient learning of new medical tasks, (2) a bi-modal LoRA adaptation that significantly reduces trainable parameters while maintaining cross-modal learning, and (3) bidirectional Fisher-memory coordination that prevents catastrophic forgetting from previous medical tasks. Extensive experiments across diverse medical datasets demonstrate superior forgetting mitigation and performance retention with minimal parameter overhead, making the framework effective for continual learning in medical vision-language scenarios.

[95] Person Recognition in Aerial Surveillance: A Decade Survey

Kien Nguyen,Feng Liu,Clinton Fookes,Sridha Sridharan,Xiaoming Liu,Arun Ross

Main category: cs.CV

TL;DR: 本文综述了近10年来基于无人机等空中平台的人类中心空中监视任务,从计算机视觉和机器学习角度系统分析了人类检测、识别与再识别的研究现状、挑战、数据集及方法,并指出了未来研究方向。

Details Motivation: 由于空中平台在规模、移动性、部署和隐蔽观测方面的显著优势,空中监视迅速发展,但空中环境下的人类监视任务面临独特挑战,亟需系统性总结与技术分析。 Method: 本文综合分析了过去10年150余篇相关论文,识别空中监视中人类检测、识别与再识别任务的独特挑战,整理公开的空中数据集,并深入探讨现有方法如何应对这些挑战及改进技术。 Result: 系统梳理了人类中心空中监视的技术进展,归纳了当前主流方法对空中挑战的应对策略,总结了各任务可用的公开数据集,并揭示了现有研究在鲁棒性、跨视角匹配、小目标处理等方面的局限性。 Conclusion: 本文为该领域提供了全面的技术综述,明确了当前研究的差距与开放问题,为未来空中监视系统的算法设计与实际应用指明了方向。 Abstract: The rapid emergence of airborne platforms and imaging sensors is enabling new forms of aerial surveillance due to their unprecedented advantages in scale, mobility, deployment, and covert observation capabilities. This paper provides a comprehensive overview of 150+ papers over the last 10 years of human-centric aerial surveillance tasks from a computer vision and machine learning perspective. It aims to provide readers with an in-depth systematic review and technical analysis of the current state of aerial surveillance tasks using drones, UAVs, and other airborne platforms. The object of interest is humans, where human subjects are to be detected, identified, and re-identified. More specifically, for each of these tasks, we first identify unique challenges in performing these tasks in an aerial setting compared to the popular ground-based setting and subsequently compile and analyze aerial datasets publicly available for each task. Most importantly, we delve deep into the approaches in the aerial surveillance literature with a focus on investigating how they presently address aerial challenges and techniques for improvement. We conclude the paper by discussing the gaps and open research questions to inform future research avenues.

[96] Vision-Motion-Reference Alignment for Referring Multi-Object Tracking via Multi-Modal Large Language Models

Weiyi Lv,Ning Zhang,Hanyang Sun,Haoran Jiang,Kai Zhao,Jing Xiao,Dan Zeng

Main category: cs.CV

TL;DR: 提出了一种新的RMOT框架VMRMOT,通过引入运动模态和多模态大语言模型(MLLMs)来增强视觉、运动与语言参考之间的对齐,显著提升了多目标跟踪性能。

Details Motivation: 现有RMOT方法仅依赖静态语言描述,无法捕捉目标动态运动变化(如速度和方向改变),导致语言与视觉模态之间存在时序不匹配,限制了多模态跟踪性能。 Method: 提出VMRMOT框架,从目标动态行为中提取运动特征作为运动模态,利用MLLM的时序推理能力生成运动感知描述;设计视觉-运动-参考对齐(VMRA)模块进行多层次跨模态对齐,并引入运动引导预测头(MGPH)提升检测性能。 Result: 在多个RMOT基准上的实验表明,VMRMOT优于现有的最先进方法,验证了其有效性。 Conclusion: VMRMOT是首个将MLLM用于RMOT中视觉-语言对齐的方法,通过融合运动模态实现了更精准的多模态跟踪,为RMOT提供了新的解决思路。 Abstract: Referring Multi-Object Tracking (RMOT) extends conventional multi-object tracking (MOT) by introducing natural language references for multi-modal fusion tracking. RMOT benchmarks only describe the object's appearance, relative positions, and initial motion states. This so-called static regulation fails to capture dynamic changes of the object motion, including velocity changes and motion direction shifts. This limitation not only causes a temporal discrepancy between static references and dynamic vision modality but also constrains multi-modal tracking performance. To address this limitation, we propose a novel Vision-Motion-Reference aligned RMOT framework, named VMRMOT. It integrates a motion modality extracted from object dynamics to enhance the alignment between vision modality and language references through multi-modal large language models (MLLMs). Specifically, we introduce motion-aware descriptions derived from object dynamic behaviors and, leveraging the powerful temporal-reasoning capabilities of MLLMs, extract motion features as the motion modality. We further design a Vision-Motion-Reference Alignment (VMRA) module to hierarchically align visual queries with motion and reference cues, enhancing their cross-modal consistency. In addition, a Motion-Guided Prediction Head (MGPH) is developed to explore motion modality to enhance the performance of the prediction head. To the best of our knowledge, VMRMOT is the first approach to employ MLLMs in the RMOT task for vision-reference alignment. Extensive experiments on multiple RMOT benchmarks demonstrate that VMRMOT outperforms existing state-of-the-art methods.

[97] Understanding Counting Mechanisms in Large Language and Vision-Language Models

Hosein Hasani,Amirmohammad Izadi,Fatemeh Askari,Mobin Bagherian,Sadegh Mohammadian,Mohammad Izadi,Mahdieh Soleymani Baghshah

Main category: cs.CV

TL;DR: 该论文研究了大语言模型(LLM)和大视觉-语言模型(LVLM)在计数任务中如何表示和处理数值信息,提出了一种名为CountScope的可解释性工具,发现模型内部存在分层递进的数值表征机制和可迁移的内部计数器。

Details Motivation: 理解大模型如何处理数值信息,揭示其在计数任务中的内部工作机制,填补对数值推理机理的认知空白。 Method: 通过控制重复文本和视觉项目的实验,结合因果中介分析和激活补丁技术,并开发专用工具CountScope进行机械性可解释性分析。 Result: 发现token或视觉特征编码潜在的位置计数信息;数值表征从低层(小数字)到高层(大数字)逐步形成;识别出基于最终token/区域的可迁移内部计数器;LVLM中数值信息随空间结构在背景与前景间转移;模型依赖分隔符等结构线索作为计数捷径。 Conclusion: 计数在LLM和LVLM中是一种分层、结构化的涌现过程,受文本结构和视觉编码特性影响,存在可提取和迁移的内部数值表示机制。 Abstract: This paper examines how large language models (LLMs) and large vision-language models (LVLMs) represent and compute numerical information in counting tasks. We use controlled experiments with repeated textual and visual items and analyze model behavior through causal mediation and activation patching. To this end, we design a specialized tool, CountScope, for mechanistic interpretability of numerical content. Results show that individual tokens or visual features encode latent positional count information that can be extracted and transferred across contexts. Layerwise analyses reveal a progressive emergence of numerical representations, with lower layers encoding small counts and higher layers representing larger ones. We identify an internal counter mechanism that updates with each item, stored mainly in the final token or region and transferable between contexts. In LVLMs, numerical information also appears in visual embeddings, shifting between background and foreground regions depending on spatial composition. Models rely on structural cues such as separators in text, which act as shortcuts for tracking item counts and influence the accuracy of numerical predictions. Overall, counting emerges as a structured, layerwise process in LLMs and follows the same general pattern in LVLMs, shaped by the properties of the vision encoder.

[98] Can Vision-Language Models Count? A Synthetic Benchmark and Analysis of Attention-Based Interventions

Saurav Sengupta,Nazanin Moradinasab,Jiebei Liu,Donald E. Brown

Main category: cs.CV

TL;DR: 本文研究了视觉语言模型(VLMs)在计数任务中对图像和提示属性变化的性能表现,提出合成基准数据集和评估框架,并通过注意力干预探索提升计数性能的方法。

Details Motivation: VLMs在回答关于图像视觉属性的问题时容易依赖训练中学到的固有偏差,尤其是在需要关注特定区域的计数任务中,这些偏差被进一步放大。因此需要系统性地研究影响计数性能的因素。 Method: 构建了一个合成基准数据集和评估框架,使用开源VLM分析注意力分配如何随图像对象数量、颜色、纹理、背景及提示具体性等参数变化,并实施基于注意力的干预来调节不同层对视觉标记的关注。 Result: 实验表明,尽管在高视觉或语言复杂度下VLM的计数性能仍具挑战性,但某些注意力干预能在多种视觉条件下带来轻微的性能提升。 Conclusion: VLM的计数能力受输入特征和提示设计显著影响,合理的注意力调控可部分缓解偏差,提升特定条件下的表现。 Abstract: Recent research suggests that Vision Language Models (VLMs) often rely on inherent biases learned during training when responding to queries about visual properties of images. These biases are exacerbated when VLMs are asked highly specific questions that require them to focus on particular areas of the image in tasks such as counting. We build upon this research by developing a synthetic benchmark dataset and evaluation framework to systematically determine how counting performance varies as image and prompt properties change. Using open-source VLMs, we then analyze how attention allocation fluctuates with varying input parameters (e.g. number of objects in the image, objects color, background color, objects texture, background texture, and prompt specificity). We further implement attention-based interventions to modulate focus on visual tokens at different layers and evaluate their impact on counting performance across a range of visual conditions. Our experiments reveal that while VLM counting performance remains challenging, especially under high visual or linguistic complexity, certain attention interventions can lead to modest gains in counting performance.

[99] AngioDG: Interpretable Channel-informed Feature-modulated Single-source Domain Generalization for Coronary Vessel Segmentation in X-ray Angiography

Mohammad Atwany,Mojtaba Lashgari,Robin P. Choudhury,Vicente Grau,Abhirup Banerjee

Main category: cs.CV

TL;DR: 提出了一种名为AngioDG的新方法,通过通道正则化策略提升X射线冠状动脉造影中血管分割模型的域泛化能力,在多数据集上实现了最优的分布外表现。

Details Motivation: 由于成像协议和患者人口统计学差异导致的域偏移,以及标注数据集的缺乏,使得开发可泛化的XCA血管分割模型具有挑战性,现有单源域泛化方法受限于过拟合问题。 Method: 提出AngioDG方法,采用通道正则化策略,识别早期特征通道对任务特定指标的贡献,重新加权通道以校准和放大域不变特征,抑制域特异性特征。 Result: 在6个X射线血管造影数据集上进行评估,AngioDG在分布外性能上优于现有方法,同时保持了稳定的域内测试性能。 Conclusion: AngioDG有效提升了单源域泛化下的冠状动脉血管分割性能,具有良好的泛化能力和可解释性,适用于临床实际中的多中心数据场景。 Abstract: Cardiovascular diseases are the leading cause of death globally, with X-ray Coronary Angiography (XCA) as the gold standard during real-time cardiac interventions. Segmentation of coronary vessels from XCA can facilitate downstream quantitative assessments, such as measurement of the stenosis severity and enhancing clinical decision-making. However, developing generalizable vessel segmentation models for XCA is challenging due to variations in imaging protocols and patient demographics that cause domain shifts. These limitations are exacerbated by the lack of annotated datasets, making Single-source Domain Generalization (SDG) a necessary solution for achieving generalization. Existing SDG methods are largely augmentation-based, which may not guarantee the mitigation of overfitting to augmented or synthetic domains. We propose a novel approach, ``AngioDG", to bridge this gap by channel regularization strategy to promote generalization. Our method identifies the contributions of early feature channels to task-specific metrics for DG, facilitating interpretability, and then reweights channels to calibrate and amplify domain-invariant features while attenuating domain-specific ones. We evaluate AngioDG on 6 x-ray angiography datasets for coronary vessels segmentation, achieving the best out-of-distribution performance among the compared methods, while maintaining consistent in-domain test performance.

[100] The Potential and Limitations of Vision-Language Models for Human Motion Understanding: A Case Study in Data-Driven Stroke Rehabilitation

Victor Li,Naveenraj Kamalakannan,Avinash Parnandi,Heidi Schambra,Carlos Fernandez-Granda

Main category: cs.CV

TL;DR: 该研究探索了视觉-语言模型(VLMs)在中风康复中的应用,用于自动量化康复剂量和损伤程度,发现当前VLMs缺乏精细动作理解能力,但在优化提示和后处理后仍展现出一定潜力。

Details Motivation: 探索视觉-语言模型在数字健康,特别是中风康复中的应用潜力,解决康复剂量和损伤程度自动量化的问题。 Method: 将康复剂量和损伤评估问题转化为动作识别任务,利用VLMs对29名健康对照者和51名中风幸存者的视频数据进行分析,无需任务特定训练或微调。 Result: 当前VLMs在精确量化方面表现有限:剂量估计与无视觉信息的基线相当,损伤评分不可靠;但通过优化提示和后处理,VLMs可从少量帧中分类高级活动、中等准确地检测运动和抓握,并对轻度损伤和健康参与者将剂量估算误差控制在25%以内。 Conclusion: 尽管现有VLMs在精细动作理解上存在局限,但其在无需微调的情况下展现出初步应用潜力,提示未来通过改进提示工程和后处理策略可能推动其在临床视频分析中的发展。 Abstract: Vision-language models (VLMs) have demonstrated remarkable performance across a wide range of computer-vision tasks, sparking interest in their potential for digital health applications. Here, we apply VLMs to two fundamental challenges in data-driven stroke rehabilitation: automatic quantification of rehabilitation dose and impairment from videos. We formulate these problems as motion-identification tasks, which can be addressed using VLMs. We evaluate our proposed framework on a cohort of 29 healthy controls and 51 stroke survivors. Our results show that current VLMs lack the fine-grained motion understanding required for precise quantification: dose estimates are comparable to a baseline that excludes visual information, and impairment scores cannot be reliably predicted. Nevertheless, several findings suggest future promise. With optimized prompting and post-processing, VLMs can classify high-level activities from a few frames, detect motion and grasp with moderate accuracy, and approximate dose counts within 25% of ground truth for mildly impaired and healthy participants, all without task-specific training or finetuning. These results highlight both the current limitations and emerging opportunities of VLMs for data-driven stroke rehabilitation and broader clinical video analysis.

[101] VisReason: A Large-Scale Dataset for Visual Chain-of-Thought Reasoning

Lingxiao Li,Yifan Wang,Xinyan Gao,Chen Tang,Xiangyu Yue,Chenyu You

Main category: cs.CV

TL;DR: 本文提出了VisReason,一个大规模的视觉链式思维推理数据集,包含48.9万个标注样本,涵盖四个不同领域,并引入了具有专家级标注和3D空间定位的子集VisReason-Pro(16.5万样本),通过在Qwen2.5-VL模型上的微调显著提升了多模态大模型在逐步视觉推理、可解释性和跨基准泛化方面的能力。

Details Motivation: 现有的视觉链式思维(CoT)数据集规模小、领域受限或缺乏人类式的逐步推理结构,限制了多模态大模型在复杂视觉推理任务中的潜力,因此需要一个大规模、多样化且具有空间接地的人类样例推理数据集来推动该领域发展。 Method: 构建了一个包含48.9万样本的大规模数据集VisReason,涵盖四个领域,每例包含多轮、类人推理过程;进一步使用更强的GPT专家标注器构建高质量子集VisReason-Pro(16.5万样本),并引入基于深度信息的3D空间标注以增强空间理解;在此基础上对Qwen2.5-VL模型进行微调以验证效果。 Result: 在VisReason和VisReason-Pro上微调后的Qwen2.5-VL模型在逐步视觉推理准确性、可解释性和跨基准泛化能力方面均有显著提升,证明了该数据集能有效增强MLLM的系统性与泛化性推理能力。 Conclusion: VisReason为多模态大模型提供了支持复杂、类人视觉推理的数据基础,有望成为发展下一代多模态智能的关键资源,推动MLLM实现更系统、更通用的视觉理解能力。 Abstract: Chain-of-Thought (CoT) prompting has proven remarkably effective for eliciting complex reasoning in large language models (LLMs). Yet, its potential in multimodal large language models (MLLMs) remains largely untapped, hindered by the absence of large-scale datasets that capture the rich, spatially grounded reasoning intrinsic to visual understanding. Existing visual-CoT resources are typically small, domain-specific, or lack the human-like stepwise structure necessary for compositional visual reasoning. In this paper, we introduce VisReason, a large-scale dataset designed to advance visual Chain-of-Thought reasoning. VisReason comprises 489K annotated examples spanning four diverse domains, each featuring multi-round, human-like rationales that guide MLLMs through interpretable visual reasoning steps. Building upon this, we curate VisReason-Pro, a 165K subset produced with a stronger expert-level GPT annotator, enriched with detailed reasoning traces and 3D spatial grounding via depth-informed annotations. Fine-tuning the state-of-the-art Qwen2.5-VL model on VisReason and VisReason-Pro yields substantial improvements in step-by-step visual reasoning accuracy, interpretability, and cross-benchmark generalization. These results demonstrate that VisReason equips MLLMs with more systematic and generalizable reasoning capabilities. We envision VisReason as a cornerstone for cultivating human-like visual reasoning, paving the way toward the next generation of multimodal intelligence.

[102] Towards Open-Ended Visual Scientific Discovery with Sparse Autoencoders

Samuel Stevens,Jacob Beattie,Tanya Berger-Wolf,Yu Su

Main category: cs.CV

TL;DR: 本文探讨了稀疏自编码器(SAE)在科学基础模型中用于开放性特征发现的潜力,通过生态图像等案例验证其可在无标签情况下揭示未知模式。

Details Motivation: 现有方法多局限于预设目标的结构提取,缺乏支持开放性科学发现的能力,而科学数据量巨大,亟需能从基础模型表征中发现未知模式的新工具。 Method: 采用稀疏自编码器(SAE)对基础模型的表征进行稀疏分解,并在控制性重发现研究中评估其学习到的特征与语义概念的对齐程度,同时与强无监督基线方法比较。 Result: SAE在标准分割基准上表现出良好的概念对齐性,在生态图像中成功揭示出细粒度解剖结构,且无需任何分割或部件标签,实现了有真实标签验证的科学案例。 Conclusion: 稀疏分解是一种可行的工具,可用于探索科学基础模型所学内容,推动科学研究从验证已知向真正发现未知转变。 Abstract: Scientific archives now contain hundreds of petabytes of data across genomics, ecology, climate, and molecular biology that could reveal undiscovered patterns if systematically analyzed at scale. Large-scale, weakly-supervised datasets in language and vision have driven the development of foundation models whose internal representations encode structure (patterns, co-occurrences and statistical regularities) beyond their training objectives. Most existing methods extract structure only for pre-specified targets; they excel at confirmation but do not support open-ended discovery of unknown patterns. We ask whether sparse autoencoders (SAEs) can enable open-ended feature discovery from foundation model representations. We evaluate this question in controlled rediscovery studies, where the learned SAE features are tested for alignment with semantic concepts on a standard segmentation benchmark and compared against strong label-free alternatives on concept-alignment metrics. Applied to ecological imagery, the same procedure surfaces fine-grained anatomical structure without access to segmentation or part labels, providing a scientific case study with ground-truth validation. While our experiments focus on vision with an ecology case study, the method is domain-agnostic and applicable to models in other sciences (e.g., proteins, genomics, weather). Our results indicate that sparse decomposition provides a practical instrument for exploring what scientific foundation models have learned, an important prerequisite for moving from confirmation to genuine discovery.

[103] AEGIS: Preserving privacy of 3D Facial Avatars with Adversarial Perturbations

Dawid Wolkiewicz,Anastasiya Pechko,Przemysław Spurek,Piotr Syga

Main category: cs.CV

TL;DR: 本文提出了AEGIS,首个用于3D高斯点阵头像的隐私保护身份掩蔽框架,能在保持视觉真实感的同时实现跨视角一致的身份去识别。

Details Motivation: 随着逼真3D面部头像的广泛应用,尤其是基于3D高斯点阵表示的技术,带来了通过生物特征认证系统进行在线身份盗窃的新风险,现有2D对抗掩码方法无法有效应对动态3D头像的多视角身份保护需求。 Method: AEGIS通过对高斯颜色系数施加对抗性扰动,利用预训练的人脸验证网络引导扰动生成,在不修改头像几何结构的情况下实现跨视角一致的身份保护。 Result: AEGIS将人脸检索与验证准确率降至0%,SSIM达到0.9555,PSNR为35.52 dB,有效保持了感知质量,并保留了年龄、种族、性别和情绪等关键面部属性。 Conclusion: AEGIS实现了对3D高斯头像的高效、无需重训练的隐私保护,兼顾强安全性与高视觉保真度,填补了3D动态头像对抗性身份掩蔽的研究空白。 Abstract: The growing adoption of photorealistic 3D facial avatars, particularly those utilizing efficient 3D Gaussian Splatting representations, introduces new risks of online identity theft, especially in systems that rely on biometric authentication. While effective adversarial masking methods have been developed for 2D images, a significant gap remains in achieving robust, viewpoint-consistent identity protection for dynamic 3D avatars. To address this, we present AEGIS, the first privacy-preserving identity masking framework for 3D Gaussian Avatars that maintains the subject's perceived characteristics. Our method aims to conceal identity-related facial features while preserving the avatar's perceptual realism and functional integrity. AEGIS applies adversarial perturbations to the Gaussian color coefficients, guided by a pre-trained face verification network, ensuring consistent protection across multiple viewpoints without retraining or modifying the avatar's geometry. AEGIS achieves complete de-identification, reducing face retrieval and verification accuracy to 0%, while maintaining high perceptual quality (SSIM = 0.9555, PSNR = 35.52 dB). It also preserves key facial attributes such as age, race, gender, and emotion, demonstrating strong privacy protection with minimal visual distortion.

[104] SPIDER: Spatial Image CorresponDence Estimator for Robust Calibration

Zhimin Shao,Abhay Yadav,Rama Chellappa,Cheng Peng

Main category: cs.CV

TL;DR: 本文提出了SPIDER,一个结合2D和3D特征匹配的通用图像匹配框架,在大基线、跨场景条件下显著优于现有方法,并引入了新的评测基准。

Details Motivation: 现有的图像匹配方法在跨域、大视角变化下表现不佳,且3D基础模型对几何细节不敏感,难以捕捉细粒度对应关系。 Method: 通过线性探针实验评估不同视觉基础模型的匹配性能,提出SPIDER框架,共享主干网络并设计两个专用头部分别处理2D和3D匹配,实现从粗到细的对应估计。 Result: 在新构建的大基线图像匹配基准上,SPIDER显著优于当前最先进方法,展现出强大的通用匹配能力。 Conclusion: SPIDER通过融合2D与3D匹配优势,实现了鲁棒且精细的跨域图像匹配,为视觉空间感知提供了有效解决方案。 Abstract: Reliable image correspondences form the foundation of vision-based spatial perception, enabling recovery of 3D structure and camera poses. However, unconstrained feature matching across domains such as aerial, indoor, and outdoor scenes remains challenging due to large variations in appearance, scale and viewpoint. Feature matching has been conventionally formulated as a 2D-to-2D problem; however, recent 3D foundation models provides spatial feature matching properties based on two-view geometry. While powerful, we observe that these spatially coherent matches often concentrate on dominant planar regions, e.g., walls or ground surfaces, while being less sensitive to fine-grained geometric details, particularly under large viewpoint changes. To better understand these trade-offs, we first perform linear probe experiments to evaluate the performance of various vision foundation models for image matching. Building on these insights, we introduce SPIDER, a universal feature matching framework that integrates a shared feature extraction backbone with two specialized network heads for estimating both 2D-based and 3D-based correspondences from coarse to fine. Finally, we introduce an image-matching evaluation benchmark that focuses on unconstrained scenarios with large baselines. SPIDER significantly outperforms SoTA methods, demonstrating its strong ability as a universal image-matching method.

[105] CORA: Consistency-Guided Semi-Supervised Framework for Reasoning Segmentation

Prantik Howlader,Hoang Nguyen-Canh,Srijan Das,Jingyi Xu,Hieu Le,Dimitris Samaras

Main category: cs.CV

TL;DR: 提出了一种半监督推理分割框架CORA,通过利用少量标注数据和大量未标注图像,在城市景观和病理学数据集上实现了最先进的性能。

Details Motivation: 现有的基于多模态语言模型的指令跟随分割方法在泛化能力上受限,主要瓶颈在于高质量像素级标注与丰富语言监督配对的数据集构建成本高昂,导致在分布偏移下的性能脆弱。 Method: CORA框架包含三个关键组件:1)编码对象间空间和上下文关系的条件视觉指令;2)基于多模态大模型在语义等价查询下输出一致性的噪声伪标签过滤器;3)标注样本与伪标注样本之间的token级对比对齐,以增强特征一致性。 Result: 在仅使用100张标注图像的情况下,CORA在Cityscapes数据集上超越基线+2.3%;在PanNuke数据集上使用180张标注图像时提升+2.4%。 Conclusion: CORA能够在极低监督条件下实现鲁棒的推理分割,显著优于现有方法,展示了其在减少标注依赖方面的潜力。 Abstract: Reasoning segmentation seeks pixel-accurate masks for targets referenced by complex, often implicit instructions, requiring context-dependent reasoning over the scene. Recent multimodal language models have advanced instruction following segmentation, yet generalization remains limited. The key bottleneck is the high cost of curating diverse, high-quality pixel annotations paired with rich linguistic supervision leading to brittle performance under distribution shift. Therefore, we present CORA, a semi-supervised reasoning segmentation framework that jointly learns from limited labeled data and a large corpus of unlabeled images. CORA introduces three main components: 1) conditional visual instructions that encode spatial and contextual relationships between objects; 2) a noisy pseudo-label filter based on the consistency of Multimodal LLM's outputs across semantically equivalent queries; and 3) a token-level contrastive alignment between labeled and pseudo-labeled samples to enhance feature consistency. These components enable CORA to perform robust reasoning segmentation with minimal supervision, outperforming existing baselines under constrained annotation settings. CORA achieves state-of-the-art results, requiring as few as 100 labeled images on Cityscapes, a benchmark dataset for urban scene understanding, surpassing the baseline by $+2.3\%$. Similarly, CORA improves performance by $+2.4\%$ with only 180 labeled images on PanNuke, a histopathology dataset.

[106] Latent Dirichlet Transformer VAE for Hyperspectral Unmixing with Bundled Endmembers

Giancarlo Giannetti,Faisal Z. Qureshi

Main category: cs.CV

TL;DR: 提出了一种基于Transformer和Dirichlet先验的变分自编码器(LDVAE-T),用于高光谱解混,通过将材料建模为具有均值和协方差的端元束,并结合全局上下文与物理约束,在多个数据集上实现了优越的丰度估计和端元提取性能。

Details Motivation: 高光谱图像中的光谱混合问题掩盖了纯物质特征,传统方法依赖固定端元光谱且难以建模光谱变异性,因此需要一种能同时捕捉全局上下文并满足物理约束的解混方法。 Method: 提出LDVAE-T模型,结合Transformer编码器与Dirichlet先验变分自编码器;在潜在空间中引入Dirichlet分布以保证丰度的非负性和和为一性;解码器为每个端元和图像块预测均值光谱及结构化协方差,形成端元束;通过混合这些束与Transformer编码得到的丰度进行重构。 Result: 在Samson、Jasper Ridge和HYDICE Urban三个基准数据集上评估显示,LDVAE-T在丰度估计(RMSE)和端元提取(SAD)方面均优于现有最先进模型。 Conclusion: LDVAE-T通过融合Transformer的全局建模能力与物理引导的Dirichlet先验,有效提升了高光谱解混的精度与可解释性,尤其在处理光谱变异性和复杂混合场景时表现出更强的鲁棒性。 Abstract: Hyperspectral images capture rich spectral information that enables per-pixel material identification; however, spectral mixing often obscures pure material signatures. To address this challenge, we propose the Latent Dirichlet Transformer Variational Autoencoder (LDVAE-T) for hyperspectral unmixing. Our model combines the global context modeling capabilities of transformer architectures with physically meaningful constraints imposed by a Dirichlet prior in the latent space. This prior naturally enforces the sum-to-one and non-negativity conditions essential for abundance estimation, thereby improving the quality of predicted mixing ratios. A key contribution of LDVAE-T is its treatment of materials as bundled endmembers, rather than relying on fixed ground truth spectra. In the proposed method our decoder predicts, for each endmember and each patch, a mean spectrum together with a structured (segmentwise) covariance that captures correlated spectral variability. Reconstructions are formed by mixing these learned bundles with Dirichlet-distributed abundances garnered from a transformer encoder, allowing the model to represent intrinsic material variability while preserving physical interpretability. We evaluate our approach on three benchmark datasets, Samson, Jasper Ridge, and HYDICE Urban and show that LDVAE-T consistently outperforms state-of-the-art models in abundance estimation and endmember extraction, as measured by root mean squared error and spectral angle distance, respectively.

[107] Deepfake Geography: Detecting AI-Generated Satellite Images

Mansur Yerzhanuly

Main category: cs.CV

TL;DR: 本研究比较了CNN和ViT在检测AI生成卫星图像中的性能,发现ViT在准确性和鲁棒性上显著优于CNN。

Details Motivation: 生成模型的快速发展对卫星图像的真实性构成威胁,而现有深伪检测方法在卫星图像上的应用面临地形不一致和结构伪影等挑战。 Method: 使用超过13万张来自DM-AER和FSI数据集的RGB图像,对比卷积神经网络(CNN)与视觉Transformer(ViT)在检测生成卫星图像上的表现,并采用Grad-CAM和Chefer注意力归因方法提升模型可解释性。 Result: ViT的检测准确率达到95.11%,显著高于CNN的87.02%,且在建模长距离依赖和全局语义结构方面表现更优,能有效识别合成图像中的结构性不一致和重复纹理模式。 Conclusion: ViT在检测AI生成的卫星图像方面优于CNN,未来工作将扩展至多光谱和SAR模态,并结合频域分析以增强检测能力,保障高风险应用中卫星图像的完整性。 Abstract: The rapid advancement of generative models such as StyleGAN2 and Stable Diffusion poses a growing threat to the authenticity of satellite imagery, which is increasingly vital for reliable analysis and decision-making across scientific and security domains. While deepfake detection has been extensively studied in facial contexts, satellite imagery presents distinct challenges, including terrain-level inconsistencies and structural artifacts. In this study, we conduct a comprehensive comparison between Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) for detecting AI-generated satellite images. Using a curated dataset of over 130,000 labeled RGB images from the DM-AER and FSI datasets, we show that ViTs significantly outperform CNNs in both accuracy (95.11 percent vs. 87.02 percent) and overall robustness, owing to their ability to model long-range dependencies and global semantic structures. We further enhance model transparency using architecture-specific interpretability methods, including Grad-CAM for CNNs and Chefer's attention attribution for ViTs, revealing distinct detection behaviors and validating model trustworthiness. Our results highlight the ViT's superior performance in detecting structural inconsistencies and repetitive textural patterns characteristic of synthetic imagery. Future work will extend this research to multispectral and SAR modalities and integrate frequency-domain analysis to further strengthen detection capabilities and safeguard satellite imagery integrity in high-stakes applications.

[108] Target-Bench: Can World Models Achieve Mapless Path Planning with Semantic Targets?

Dingrui Wang,Hongyuan Ye,Zhihao Liang,Zhexiao Sun,Zhaowei Lu,Yuchen Zhang,Yuyu Zhao,Yuan Gao,Marvin Seegert,Finn Schäfer,Haotong Qin,Wei Li,Luigi Palmieri,Felix Jahncke,Mattia Piccinini,Johannes Betz

Main category: cs.CV

TL;DR: 本文提出了Target-Bench,首个用于评估世界模型在真实环境中无地图语义目标路径规划能力的基准测试。实验表明现有最先进模型表现有限,而通过对一个开源50亿参数模型在仅325个场景上微调,性能显著提升超过400%。

Details Motivation: 尽管近期的世界模型能生成高度逼真的视频,但其在机器人路径规划中的能力尚不明确且缺乏量化评估。因此需要一个专门针对语义目标导向的无地图路径规划任务的基准来衡量其实际应用潜力。 Method: 构建包含450段由机器人采集、涵盖45个语义类别的视频序列及SLAM真值轨迹的Target-Bench数据集;设计评估流程,从生成视频中恢复相机运动,并使用五个互补指标评估目标到达能力、轨迹准确性和方向一致性;对包括Sora 2、Veo 3.1和Wan系列在内的最先进模型进行评测,并测试基于少量数据微调开源模型的效果。 Result: 最佳现成模型(Wan2.2-Flash)总体得分为0.299;经过在325个场景上微调的开源5B参数模型得分达0.345,相比基础版本提升超400%,并高出最佳现成模型15%。 Conclusion: 当前世界模型在机器人路径规划方面仍存在显著局限性;通过针对性微调可在小规模数据上实现大幅性能提升,展示了未来改进的方向。 Abstract: While recent world models generate highly realistic videos, their ability to perform robot path planning remains unclear and unquantified. We introduce Target-Bench, the first benchmark specifically designed to evaluate world models on mapless path planning toward semantic targets in real-world environments. Target-Bench provides 450 robot-collected video sequences spanning 45 semantic categories with SLAM-based ground truth trajectories. Our evaluation pipeline recovers camera motion from generated videos and measures planning performance using five complementary metrics that quantify target-reaching capability, trajectory accuracy, and directional consistency. We evaluate state-of-the-art models including Sora 2, Veo 3.1, and the Wan series. The best off-the-shelf model (Wan2.2-Flash) achieves only 0.299 overall score, revealing significant limitations in current world models for robotic planning tasks. We show that fine-tuning an open-source 5B-parameter model on only 325 scenarios from our dataset achieves 0.345 overall score -- an improvement of more than 400% over its base version (0.066) and 15% higher than the best off-the-shelf model. We will open-source the code and dataset.

[109] Attention Guided Alignment in Efficient Vision-Language Models

Shweta Mahajan,Hoang Le,Hyojin Park,Farzad Farhadzadeh,Munawar Hayat,Fatih Porikli

Main category: cs.CV

TL;DR: 本文提出了一种新的高效视觉-语言模型框架AGE-VLM,通过引入交错的交叉注意力层和利用SAM提取的空间知识来增强视觉定位能力,有效减少对象幻觉问题。

Details Motivation: 现有的基于拼接的视觉-语言模型在区分语义匹配与非匹配图文对时表现不佳,导致频繁出现对象幻觉,影响模型可靠性。 Method: 提出Attention-Guided Efficient VLM(AGE-VLM),采用交错的跨模态注意力机制,并引入来自Segment Anything Model(SAM)的空间知识蒸馏,以增强视觉 grounding 能力。 Result: 在多个视觉为中心的基准测试中验证了方法的有效性,性能优于或媲美先前的高效VLM方法,显著减少了对象 hallucination。 Conclusion: 通过注意力引导和空间知识注入可有效提升高效VLM的视觉理解能力,为未来提升多模态对齐与减少幻觉提供了新方向。 Abstract: Large Vision-Language Models (VLMs) rely on effective multimodal alignment between pre-trained vision encoders and Large Language Models (LLMs) to integrate visual and textual information. This paper presents a comprehensive analysis of attention patterns in efficient VLMs, revealing that concatenation-based architectures frequently fail to distinguish between semantically matching and non-matching image-text pairs. This is a key factor for object hallucination in these models. To address this, we introduce Attention-Guided Efficient Vision-Language Models (AGE-VLM), a novel framework that enhances visual grounding through interleaved cross-attention layers to instill vision capabilities in pretrained small language models. This enforces in VLM the ability "look" at the correct image regions by leveraging spatial knowledge distilled from the Segment Anything Model (SAM), significantly reducing hallucination. We validate our approach across different vision-centric benchmarks where our method is better or comparable to prior work on efficient VLMs. Our findings provide valuable insights for future research aimed at achieving enhanced visual and linguistic understanding in VLMs.

[110] Pillar-0: A New Frontier for Radiology Foundation Models

Kumar Krishna Agrawal,Longchao Liu,Long Lian,Michael Nercessian,Natalia Harguindeguy,Yufu Wu,Peter Mikhael,Gigin Lin,Lecia V. Sequist,Florian Fintelmann,Trevor Darrell,Yutong Bai,Maggie Chung,Adam Yala

Main category: cs.CV

TL;DR: Pillar-0是一个基于大规模放射学影像数据训练的放射学基础模型,结合RATE评估框架,能够在多种临床任务中显著超越现有模型,推动高精度、可扩展的医学影像分析系统的发展。

Details Motivation: 现有的医学基础模型在处理CT和MRI等体数据时存在局限性,如将3D数据降为低质量2D切片处理、忽略灰度对比信息,且缺乏贴近真实临床实践的评估体系。因此需要一个更高效、保真度更高、评估更严谨的放射学基础模型。 Method: 提出Pillar-0模型,采用42,990例盆腹CT、86,411例胸部CT、14,348例头颅CT和11,543例乳腺MRI进行预训练,并引入RATE框架,利用大语言模型(LLM)自动提取366种放射学发现的结构化标签,实现高保真3D信息建模与精准评估。 Result: 在内部测试集中,Pillar-0在腹部盆腔CT、胸部CT、头部CT和乳腺MRI上分别达到86.4、88.0、90.1和82.9的平均AUROC,优于MedGemma、MedImageInsight、Lingshu和Merlin等模型7.8–15.8 AUROC点,在366项任务中的87.2%(319项)表现最佳;在外部队列Stanford Abdominal CT上也优于Merlin(82.2 vs 80.6 AUROC)。此外,在肺癌风险预测任务中超越Sybil模型3.0 C-index点,在脑出血检测中仅用1/20数据即达>95 AUROC。 Conclusion: Pillar-0与RATE共同构成了一个开放、临床严谨的基础,能够突破计算、数据和评估限制,支持此前难以实现的高性能放射学应用,具有广泛的临床转化潜力。 Abstract: Radiology plays an integral role in modern medicine, yet rising imaging volumes have far outpaced workforce growth. Foundation models offer a path toward assisting with the full spectrum of radiology tasks, but existing medical models remain limited: they process volumetric CT and MRI as low-fidelity 2D slices, discard critical grayscale contrast information, and lack evaluation frameworks that reflect real clinical practice. We introduce Pillar-0, a radiology foundation model pretrained on 42,990 abdomen-pelvis CTs, 86,411 chest CTs, 14,348 head CTs, and 11,543 breast MRIs from a large academic center, together with RATE, a scalable framework that extracts structured labels for 366 radiologic findings with near-perfect accuracy using LLMs. Across internal test sets of 14,230 abdomen-pelvis CTs, 10,646 chest CTs, 4,906 head CTs, and 1,585 breast MRIs, Pillar-0 establishes a new performance frontier, achieving mean AUROCs of 86.4, 88.0, 90.1, and 82.9, outperforming MedGemma (Google), MedImageInsight (Microsoft), Lingshu (Alibaba), and Merlin (Stanford) by 7.8-15.8 AUROC points and ranking best in 87.2\% (319/366) tasks. Pillar-0 similarly outperforms all baselines in an external validation on the Stanford Abdominal CT dataset, including Merlin (82.2 vs 80.6 AUROC). Pillar-0 extends to tasks beyond its pretraining, such as long-horizon lung cancer risk prediction, where it improves upon the state-of-the-art Sybil by 3.0 C-index points on NLST, and generalizes with gains of 5.9 (MGH) and 1.9 (CGMH). In brain hemorrhage detection, Pillar-0 obtained a >95 AUROC when using only 1/20th of the data of the next most sample efficient baseline. Pillar-0 and RATE together provide an open, clinically rigorous foundation for building high-performance radiology systems, enabling applications that were previously infeasible due to computational, data, and evaluation constraints.

[111] A Stitch in Time: Learning Procedural Workflow via Self-Supervised Plackett-Luce Ranking

Chengan Che,Chao Wang,Xinyue Chen,Sophia Tsoka,Luis C. Garcia-Peraza-Herrera

Main category: cs.CV

TL;DR: 提出PL-Stitch,一种利用视频帧固有时间顺序作为监督信号的自监督学习框架,通过Plackett-Luce模型的两个概率目标来增强对程序性活动的时间结构理解,在多个手术和烹饪基准上显著优于现有方法。

Details Motivation: 现有的自监督学习方法在处理程序性活动时缺乏对时间顺序的感知,导致模型无法区分正常与时间反转的序列,因而难以捕捉过程中的时序结构。 Method: 提出PL-Stitch框架,引入基于Plackett-Luce模型的两个概率目标:主目标用于训练模型按时间顺序排列采样帧,以学习全局流程进展;次目标为时空拼图损失,用于捕捉细粒度的跨帧对象关联。 Result: 在五个手术和烹饪数据集上验证了方法的有效性,显著提升了手术阶段识别(如Cholec80上k-NN准确率+11.4个百分点)和烹饪动作分割(如Breakfast上线性探针准确率+5.7个百分点)的性能。 Conclusion: PL-Stitch通过显式建模程序性活动的时间结构,有效增强了视频表征学习中的程序感知能力,为自监督学习在程序性视频理解中的应用提供了新思路。 Abstract: Procedural activities, ranging from routine cooking to complex surgical operations, are highly structured as a set of actions conducted in a specific temporal order. Despite their success on static images and short clips, current self-supervised learning methods often overlook the procedural nature that underpins such activities. We expose the lack of procedural awareness in current SSL methods with a motivating experiment: models pretrained on forward and time-reversed sequences produce highly similar features, confirming that their representations are blind to the underlying procedural order. To address this shortcoming, we propose PL-Stitch, a self-supervised framework that harnesses the inherent temporal order of video frames as a powerful supervisory signal. Our approach integrates two novel probabilistic objectives based on the Plackett-Luce (PL) model. The primary PL objective trains the model to sort sampled frames chronologically, compelling it to learn the global workflow progression. The secondary objective, a spatio-temporal jigsaw loss, complements the learning by capturing fine-grained, cross-frame object correlations. Our approach consistently achieves superior performance across five surgical and cooking benchmarks. Specifically, PL-Stitch yields significant gains in surgical phase recognition (e.g., +11.4 pp k-NN accuracy on Cholec80) and cooking action segmentation (e.g., +5.7 pp linear probing accuracy on Breakfast), demonstrating its effectiveness for procedural video representation learning.

[112] REXO: Indoor Multi-View Radar Object Detection via 3D Bounding Box Diffusion

Ryoma Yataka,Pu Perry Wang,Petros Boufounos,Ryuhei Takahashi

Main category: cs.CV

TL;DR: 提出REXO方法,通过在3D雷达空间中引入显式多视角特征关联和基于先验知识的3D边界框扩散,提升了室内雷达目标检测性能。

Details Motivation: 现有方法依赖隐式跨视图雷达特征关联,易导致特征匹配模糊,影响复杂场景下的检测效果。 Method: 将DiffusionDet的2D边界框扩散扩展到3D雷达空间,利用带噪3D边界框指导显式跨视图特征关联,并结合人与地面接触的先验知识减少扩散参数。 Result: 在HIBER和MMVR两个公开数据集上分别取得+4.22 AP和+11.02 AP的提升,优于当前最先进方法。 Conclusion: REXO通过显式特征关联和先验引导的3D扩散机制,显著提升了多视角室内雷达目标检测的准确性和鲁棒性。 Abstract: Multi-view indoor radar perception has drawn attention due to its cost-effectiveness and low privacy risks. Existing methods often rely on {implicit} cross-view radar feature association, such as proposal pairing in RFMask or query-to-feature cross-attention in RETR, which can lead to ambiguous feature matches and degraded detection in complex indoor scenes. To address these limitations, we propose \textbf{REXO} (multi-view Radar object dEtection with 3D bounding boX diffusiOn), which lifts the 2D bounding box (BBox) diffusion process of DiffusionDet into the 3D radar space. REXO utilizes these noisy 3D BBoxes to guide an {explicit} cross-view radar feature association, enhancing the cross-view radar-conditioned denoising process. By accounting for prior knowledge that the person is in contact with the ground, REXO reduces the number of diffusion parameters by determining them from this prior. Evaluated on two open indoor radar datasets, our approach surpasses state-of-the-art methods by a margin of +4.22 AP on the HIBER dataset and +11.02 AP on the MMVR dataset.

[113] Importance-Weighted Non-IID Sampling for Flow Matching Models

Xinshuang Liu,Runfa Blark Li,Shaoxiu Wei,Truong Nguyen

Main category: cs.CV

TL;DR: 提出了一种重要性加权的非独立同分布(non-IID)采样框架,用于改进流匹配模型在有限采样预算下的期望估计,通过联合采样和得分正则化提升样本多样性和质量,同时保持无偏估计。

Details Motivation: 流匹配模型虽能有效建模复杂分布,但在采样受限时难以准确估计函数期望,尤其是当期望由稀有但高影响的结果主导时,传统独立采样会导致高方差估计。 Method: 提出一种联合生成多个样本的非IID采样框架,引入基于得分函数的正则化机制,在高密度区域推动样本分离以增强多样性并防止流形外漂移;同时学习残差速度场以实现对非IID样本的重要性加权,确保边际分布一致并维持无偏估计。 Result: 实验表明该方法能生成多样化且高质量的样本,准确估计重要性权重和期望值,优于现有方法。 Conclusion: 该方法首次实现了对非IID流匹配样本的重要性加权,提升了流模型输出的可靠表征能力,为复杂分布下的期望估计提供了有效解决方案。 Abstract: Flow-matching models effectively represent complex distributions, yet estimating expectations of functions of their outputs remains challenging under limited sampling budgets. Independent sampling often yields high-variance estimates, especially when rare but with high-impact outcomes dominate the expectation. We propose an importance-weighted non-IID sampling framework that jointly draws multiple samples to cover diverse, salient regions of a flow's distribution while maintaining unbiased estimation via estimated importance weights. To balance diversity and quality, we introduce a score-based regularization for the diversity mechanism, which uses the score function, i.e., the gradient of the log probability, to ensure samples are pushed apart within high-density regions of the data manifold, mitigating off-manifold drift. We further develop the first approach for importance weighting of non-IID flow samples by learning a residual velocity field that reproduces the marginal distribution of the non-IID samples. Empirically, our method produces diverse, high-quality samples and accurate estimates of both importance weights and expectations, advancing the reliable characterization of flow-matching model outputs. Our code will be publicly available on GitHub.

[114] QAL: A Loss for Recall Precision Balance in 3D Reconstruction

Pranay Meshram,Yash Turkar,Kartikeya Singh,Praveen Raj Masilamani,Charuvahan Adhivarahan,Karthik Dantu

Main category: cs.CV

TL;DR: 提出了一种名为Quality-Aware Loss (QAL)的新损失函数,用于3D视觉中的体素学习,显著提升覆盖率并改善对细小结构的重建能力。

Details Motivation: 现有方法如Chamfer Distance和Earth Mover's Distance在召回率和精确率之间难以平衡,导致薄结构和稀疏区域重建效果差。 Method: 设计了一个可调的损失函数QAL,包含覆盖加权的最近邻项和未覆盖真实点吸引项,显式解耦召回与精度。 Result: 在多个管道中平均比CD提升4.3点,优于现有最佳方法2.8点,有效恢复被忽略的细节结构,并在PCN和ShapeNet上验证了泛化性。 Conclusion: QAL是一种原理清晰、可解释且实用的训练目标,适用于鲁棒的3D视觉与安全关键型机器人系统。 Abstract: Volumetric learning underpins many 3D vision tasks such as completion, reconstruction, and mesh generation, yet training objectives still rely on Chamfer Distance (CD) or Earth Mover's Distance (EMD), which fail to balance recall and precision. We propose Quality-Aware Loss (QAL), a drop-in replacement for CD/EMD that combines a coverage-weighted nearest-neighbor term with an uncovered-ground-truth attraction term, explicitly decoupling recall and precision into tunable components. Across diverse pipelines, QAL achieves consistent coverage gains, improving by an average of +4.3 pts over CD and +2.8 pts over the best alternatives. Though modest in percentage, these improvements reliably recover thin structures and under-represented regions that CD/EMD overlook. Extensive ablations confirm stable performance across hyperparameters and across output resolutions, while full retraining on PCN and ShapeNet demonstrates generalization across datasets and backbones. Moreover, QAL-trained completions yield higher grasp scores under GraspNet evaluation, showing that improved coverage translates directly into more reliable robotic manipulation. QAL thus offers a principled, interpretable, and practical objective for robust 3D vision and safety-critical robotics pipelines

[115] Toward explainable AI approaches for breast imaging: adapting foundation models to diverse populations

Guilherme J. Cavalcante,José Gabriel A. Moreira,Gabriel A. B. do Nascimento,Vincent Dong,Alex Nguyen,Thaís G. do Rêgo,Yuri Malheiros,Telmo M. Silva Filho,Carla R. Zeballos Torrez,James C. Gee,Anne Marie McCarthy,Andrew D. A. Maidment,Bruno Barufaldi

Main category: cs.CV

TL;DR: 本研究利用BiomedCLIP作为基础模型,实现多模态乳腺影像的自动化BI-RADS密度分类,验证了其在不同模态下的良好泛化能力和临床可解释性。

Details Motivation: 探索基础模型在乳腺影像中的应用潜力,解决模型在不同成像模态间泛化能力不足的问题。 Method: 基于BiomedCLIP模型,采用单模态(s2D)和多模态(包括数字乳腺断层合成等)训练方法,通过加权对比学习处理类别不平衡问题,并使用GradCAM进行可视化分析。 Result: 单模态和多模态方法准确率相近(分别为0.73和0.74),多模态模型在所有BI-RADS类别的AUC均超过0.84,且在RSNA和EMBED外部数据集上表现出强泛化能力(AUC 0.80–0.93)。 Conclusion: 基础模型在乳腺影像密度分类中具有高准确性与良好泛化性能,支持其未来扩展应用于更多乳腺诊断任务。 Abstract: Foundation models hold promise for specialized medical imaging tasks, though their effectiveness in breast imaging remains underexplored. This study leverages BiomedCLIP as a foundation model to address challenges in model generalization. BiomedCLIP was adapted for automated BI-RADS breast density classification using multi-modality mammographic data (synthesized 2D images, digital mammography, and digital breast tomosynthesis). Using 96,995 images, we compared single-modality (s2D only) and multi-modality training approaches, addressing class imbalance through weighted contrastive learning. Both approaches achieved similar accuracy (multi-modality: 0.74, single-modality: 0.73), with the multi-modality model offering broader applicability across different imaging modalities and higher AUC values consistently above 0.84 across BI-RADS categories. External validation on the RSNA and EMBED datasets showed strong generalization capabilities (AUC range: 0.80-0.93). GradCAM visualizations confirmed consistent and clinically relevant attention patterns, highlighting the models interpretability and robustness. This research underscores the potential of foundation models for breast imaging applications, paving the way for future extensions for diagnostic tasks.

[116] Show Me: Unifying Instructional Image and Video Generation with Diffusion Models

Yujiang Pu,Zhanbo Huang,Vishnu Boddeti,Yu Kong

Main category: cs.CV

TL;DR: 本文提出了一种名为ShowMe的统一框架,利用视频扩散模型同时实现文本引导的图像编辑和视频预测,通过结构与运动一致性奖励提升生成质量,在多个基准上优于专用模型。

Details Motivation: 现有方法在处理视觉指令生成时,通常将图像编辑与视频预测分离,导致忽略动作的时间演化或目标意图,缺乏统一建模。 Method: 提出ShowMe框架,通过激活视频扩散模型的空间与时间组件实现统一建模,并引入结构与运动一致性奖励以增强结构保真度和时间连贯性。 Result: 在多个基准实验中,该方法在指令驱动的图像编辑和视频生成任务上均优于专家模型,验证了其有效性。 Conclusion: 视频扩散模型可作为统一的动作-对象状态变换器,兼具空间与时间建模能力,为交互式世界模拟器提供了更强的生成与推理能力。 Abstract: Generating visual instructions in a given context is essential for developing interactive world simulators. While prior works address this problem through either text-guided image manipulation or video prediction, these tasks are typically treated in isolation. This separation reveals a fundamental issue: image manipulation methods overlook how actions unfold over time, while video prediction models often ignore the intended outcomes. To this end, we propose ShowMe, a unified framework that enables both tasks by selectively activating the spatial and temporal components of video diffusion models. In addition, we introduce structure and motion consistency rewards to improve structural fidelity and temporal coherence. Notably, this unification brings dual benefits: the spatial knowledge gained through video pretraining enhances contextual consistency and realism in non-rigid image edits, while the instruction-guided manipulation stage equips the model with stronger goal-oriented reasoning for video prediction. Experiments on diverse benchmarks demonstrate that our method outperforms expert models in both instructional image and video generation, highlighting the strength of video diffusion models as a unified action-object state transformer.

[117] JigsawComm: Joint Semantic Feature Encoding and Transmission for Communication-Efficient Cooperative Perception

Chenyi Wang,Zhaowei Li,Ming F. Li,Wujie Wen

Main category: cs.CV

TL;DR: 本文提出JigsawComm,一种语义感知且通信高效的多智能体协同感知框架,通过提取语义关键且非冗余的特征,并利用效用估计实现最优传输策略,在带宽受限下显著提升感知精度。

Details Motivation: 现有协同感知方法在有限通信带宽下未充分考虑感知数据的语义相关性和跨智能体冗余性,导致通信效率低下。 Method: 提出联合语义特征编码与传输优化问题,设计端到端的JigsawComm框架,包含正则化编码器提取稀疏语义特征,以及轻量级特征效用估计器预测各智能体特征对最终感知任务的贡献,并通过交换效用图计算最优传输策略。 Result: 在OPV2V和DAIR-V2X基准上,JigsawComm在保持或超越现有最先进方法精度的同时,将总数据量减少超过500倍。 Conclusion: JigsawComm通过语义感知与非冗余特征传输,显著提升了多智能体协同感知的通信效率,实现了可扩展的O(1)通信成本,为实际部署提供了高效解决方案。 Abstract: Multi-agent cooperative perception (CP) promises to overcome the inherent occlusion and sensing-range limitations of single-agent systems (e.g., autonomous driving). However, its practicality is severely constrained by the limited communication bandwidth. Existing approaches attempt to improve bandwidth efficiency via compression or heuristic message selection, without considering the semantic relevance or cross-agent redundancy of sensory data. We argue that a practical CP system must maximize the contribution of every transmitted bit to the final perception task, by extracting and transmitting semantically essential and non-redundant data. In this paper, we formulate a joint semantic feature encoding and transmission problem, which aims to maximize CP accuracy under limited bandwidth. To solve this problem, we introduce JigsawComm, an end-to-end trained, semantic-aware, and communication-efficient CP framework that learns to ``assemble the puzzle'' of multi-agent feature transmission. It uses a regularized encoder to extract semantically-relevant and sparse features, and a lightweight Feature Utility Estimator to predict the contribution of each agent's features to the final perception task. The resulting meta utility maps are exchanged among agents and leveraged to compute a provably optimal transmission policy, which selects features from agents with the highest utility score for each location. This policy inherently eliminates redundancy and achieves a scalable $\mathcal{O}(1)$ communication cost as the number of agents increases. On the benchmarks OPV2V and DAIR-V2X, JigsawComm reduces the total data volume by up to $>$500$\times$ while achieving matching or superior accuracy compared to state-of-the-art methods.

[118] Less is More: Data-Efficient Adaptation for Controllable Text-to-Video Generation

Shihan Cheng,Nilesh Kulkarni,David Hyde,Dmitriy Smirnov

Main category: cs.CV

TL;DR: 提出了一种数据高效的微调策略,通过稀疏、低质量的合成数据来学习视频生成中的相机参数控制,效果优于使用真实高保真数据微调的模型。

Details Motivation: 获取大规模、高保真的训练数据困难,限制了对视频扩散模型进行精细控制的微调。 Method: 利用稀疏且低质量的合成数据进行模型微调,引入一种数据高效的学习策略。 Result: 在低质量合成数据上微调的模型不仅实现了期望的生成控制,性能还优于使用高保真真实数据训练的模型。 Conclusion: 简单的合成数据足以有效学习复杂的生成控制,挑战了对高质量训练数据的依赖,并提供了直观和定量的解释框架。 Abstract: Fine-tuning large-scale text-to-video diffusion models to add new generative controls, such as those over physical camera parameters (e.g., shutter speed or aperture), typically requires vast, high-fidelity datasets that are difficult to acquire. In this work, we propose a data-efficient fine-tuning strategy that learns these controls from sparse, low-quality synthetic data. We show that not only does fine-tuning on such simple data enable the desired controls, it actually yields superior results to models fine-tuned on photorealistic "real" data. Beyond demonstrating these results, we provide a framework that justifies this phenomenon both intuitively and quantitatively.

[119] MGA-VQA: Secure and Interpretable Graph-Augmented Visual Question Answering with Memory-Guided Protection Against Unauthorized Knowledge Use

Ahmad Mohammadshirazi,Pinaki Prasad Guha Neogi,Dheeraj Kulshrestha,Rajiv Ramnath

Main category: cs.CV

TL;DR: MGA-VQA 是一种用于文档视觉问答(DocVQA)的多模态框架,通过引入基于图的推理、记忆增强推断和问题引导压缩,提升了对文本语义、空间布局和视觉特征的联合理解能力,并增强了模型的可解释性。

Details Motivation: 现有 DocVQA 方法在显式建模空间关系、处理高分辨率文档、多跳推理以及可解释性方面存在不足,需要更高效且透明的模型来提升性能。 Method: 提出 MGA-VQA 框架,结合 token 级编码、空间图推理、记忆增强推断和问题引导压缩,利用图结构建模空间关系,并通过结构化记忆访问实现可解释的决策路径。 Result: 在六个基准数据集(FUNSD、CORD、SROIE、DocVQA、STE-VQA 和 RICO)上验证了方法的有效性,MGA-VQA 在答案预测和空间定位方面均优于现有方法,表现出更高的准确性和效率。 Conclusion: MGA-VQA 通过引入可解释的图结构推理和结构化记忆机制,在 DocVQA 任务中实现了更优的性能和更强的推理透明度,为多模态文档理解提供了新的解决方案。 Abstract: Document Visual Question Answering (DocVQA) requires models to jointly understand textual semantics, spatial layout, and visual features. Current methods struggle with explicit spatial relationship modeling, inefficiency with high-resolution documents, multi-hop reasoning, and limited interpretability. We propose MGA-VQA, a multi-modal framework that integrates token-level encoding, spatial graph reasoning, memory-augmented inference, and question-guided compression. Unlike prior black-box models, MGA-VQA introduces interpretable graph-based decision pathways and structured memory access for enhanced reasoning transparency. Evaluation across six benchmarks (FUNSD, CORD, SROIE, DocVQA, STE-VQA, and RICO) demonstrates superior accuracy and efficiency, with consistent improvements in both answer prediction and spatial localization.

[120] ArticFlow: Generative Simulation of Articulated Mechanisms

Jiong Lin,Jinchen Ruan,Hod Lipson

Main category: cs.CV

TL;DR: ArticFlow是一种两阶段流匹配框架,用于生成可控的三维关节形状,兼具高运动精度和形状质量。

Details Motivation: 现有生成模型在静态3D形状上表现良好,但在关节3D生成上受限于动作依赖形变和数据集不足。 Method: 提出ArticFlow,包含潜变量流和点流两个阶段,通过显式动作控制学习从噪声到目标点集的速度场。 Result: 在MuJoCo Menagerie上验证,ArticFlow在运动学准确性和形状质量上优于特定对象模拟器和静态点云生成模型的变体。 Conclusion: 动作条件流匹配是实现可控且高质量关节机构生成的有效途径。 Abstract: Recent advances in generative models have produced strong results for static 3D shapes, whereas articulated 3D generation remains challenging due to action-dependent deformations and limited datasets. We introduce ArticFlow, a two-stage flow matching framework that learns a controllable velocity field from noise to target point sets under explicit action control. ArticFlow couples (i) a latent flow that transports noise to a shape-prior code and (ii) a point flow that transports points conditioned on the action and the shape prior, enabling a single model to represent diverse articulated categories and generalize across actions. On MuJoCo Menagerie, ArticFlow functions both as a generative model and as a neural simulator: it predicts action-conditioned kinematics from a compact prior and synthesizes novel morphologies via latent interpolation. Compared with object-specific simulators and an action-conditioned variant of static point-cloud generators, ArticFlow achieves higher kinematic accuracy and better shape quality. Results show that action-conditioned flow matching is a practical route to controllable and high-quality articulated mechanism generation.

[121] FastMMoE: Accelerating Multimodal Large Language Models through Dynamic Expert Activation and Routing-Aware Token Pruning

Guoyang Xia,Yifeng Ding,Fengfa Li,Lei Ren,Wei Chen,Fangxiang Feng,Xiaojie Wang

Main category: cs.CV

TL;DR: 提出FastMMoE,一种无需训练的多模态MoE大模型加速框架,通过专家激活减少和路由感知的视觉token剪枝,在保持95.5%性能的同时减少55%计算量。

Details Motivation: 高分辨率视觉输入导致多模态大模型推理延迟高、计算负担重,需有效减少冗余视觉token以支持在资源受限场景下的部署。 Method: 从路由分析角度出发,提出两种策略:视觉token的专家激活减少以降低计算开销,以及基于路由概率分布相似性的路由感知token剪枝以去除高度冗余token。 Result: 在DeepSeek-VL2和InternVL3.5等大规模MoE-MLLM上验证,最多减少55.0% FLOPs的同时保留约95.5%原始性能,优于FastV、SparseVLM等基线方法。 Conclusion: FastMMoE有效平衡了多模态MoE模型的效率与性能,为高分辨率输入下的高效推理提供了新思路。 Abstract: Multimodal large language models (MLLMs) have achieved impressive performance, but high-resolution visual inputs result in long sequences of visual tokens and substantial inference latency. Reducing redundant visual tokens is critical to ease computational/memory burdens while preserving performance, enabling MLLM deployment in resource-constrained or latency-sensitive scenarios. Current visual token pruning methods mainly rely on attention-based redundancy analysis and are tailored to dense architectures. We propose Fast Multimodal Mixture-of-Experts (FastMMoE), a training-free acceleration framework for mixture-of-experts (MoE) based MLLMs, developed from a routing analysis perspective. FastMMoE combines two complementary strategies: (i) expert activation reduction for visual tokens to minimize unnecessary expert computation; and (ii) routing-aware token pruning that leverages similarity in routing probability distributions to identify and remove highly redundant visual tokens. Experiments on large-scale MoE-MLLMs such as DeepSeek-VL2 and InternVL3.5 demonstrate that FastMMoE can reduce FLOPs by up to 55.0% while retaining approximately 95.5% of the original performance, consistently outperforming dense-model pruning baselines including FastV and SparseVLM across multiple retention rates.

[122] When Better Teachers Don't Make Better Students: Revisiting Knowledge Distillation for CLIP Models in VQA

Pume Tuchinda,Parinthapat Pengpun,Romrawin Chumpu,Sarana Nutanong,Peerat Limkonchotiwat

Main category: cs.CV

TL;DR: 本研究首次系统性地探讨了在CLIP风格的视觉-语言模型中进行知识蒸馏的效果,发现更强的教师模型并不总能产生更优的学生模型,现有蒸馏框架在扩展时性能反而下降。

Details Motivation: 尽管知识蒸馏在语言和视觉模型中效果显著,但其在大规模视觉-语言模型(如CLIP)中的应用仍受限,缺乏系统研究。 Method: 对多种规模的CLIP式教师模型进行知识蒸馏实验,涵盖从基础模型到最先进大模型,并在多项下游多模态任务上评估学生模型表现。 Result: 发现更强的教师模型并未持续提升学生模型性能,且现有蒸馏方法在扩展时出现性能退化,尤其在视觉问答等任务上表现不佳。 Conclusion: 挑战了知识蒸馏中“强教师必导出强学生”的固有假设,指出需为多模态场景设计新的高效蒸馏方法。 Abstract: Vision-language models (VLMs) have achieved remarkable success across multimodal tasks, yet their substantial computational demands hinder efficient deployment. Knowledge distillation (KD) has emerged as a powerful approach for building lightweight but competitive models, with strong evidence from both language and vision domains. However, its application to VLMs, particularly CLIP-style models, remains limited, often constrained to small-scale teachers and narrow evaluation tasks such as classification or retrieval. In this work, we present the first systematic study of distillation across a range of CLIP-style teacher models, ranging from standard baselines to large-scale state-of-the-art models. Contrary to trends observed in NLP and vision, we find that stronger teachers do not consistently yield better students; in fact, existing distillation frameworks often fail to scale, leading to degraded performance in downstream multimodal tasks such as visual question answering. Our findings challenge prevailing assumptions in KD and point toward new directions for designing parameter-efficient multimodal models.

[123] MINDiff: Mask-Integrated Negative Attention for Controlling Overfitting in Text-to-Image Personalization

Seulgi Jeong,Jaeil Kim

Main category: cs.CV

TL;DR: 提出了一种名为MINDiff的新方法,通过引入负注意力机制在推理时抑制主题在无关区域的影响,有效缓解了文本到图像模型个性化过程中的过拟合问题,且无需重新训练。

Details Motivation: 现有方法如DreamBooth在个性化过程中容易过拟合,且依赖计算代价高的先验保留损失,限制了用户在推理时的控制能力。 Method: 提出Mask-Integrated Negative Attention Diffusion(MINDiff),通过修改推理时的交叉注意力机制,引入负注意力来抑制掩码无关区域中的主体影响,并引入可调参数lambda平衡主体保真度和文本对齐。 Result: 在DreamBooth模型上的实验表明,MINDiff比类特定先验保留损失更有效地缓解过拟合,且支持语义控制和更好的文本对齐。 Conclusion: MINDiff是一种无需重新训练、仅在推理时操作的有效方法,可直接应用于现有DreamBooth模型,提升个性化生成的质量和可控性。 Abstract: In the personalization process of large-scale text-to-image models, overfitting often occurs when learning specific subject from a limited number of images. Existing methods, such as DreamBooth, mitigate this issue through a class-specific prior-preservation loss, which requires increased computational cost during training and limits user control during inference time. To address these limitations, we propose Mask-Integrated Negative Attention Diffusion (MINDiff). MINDiff introduces a novel concept, negative attention, which suppresses the subject's influence in masked irrelevant regions. We achieve this by modifying the cross-attention mechanism during inference. This enables semantic control and improves text alignment by reducing subject dominance in irrelevant regions. Additionally, during the inference time, users can adjust a scale parameter lambda to balance subject fidelity and text alignment. Our qualitative and quantitative experiments on DreamBooth models demonstrate that MINDiff mitigates overfitting more effectively than class-specific prior-preservation loss. As our method operates entirely at inference time and does not alter the model architecture, it can be directly applied to existing DreamBooth models without re-training. Our code is available at https://github.com/seuleepy/MINDiff.

[124] Decoupled Audio-Visual Dataset Distillation

Wenyuan Li,Guang Li,Keisuke Maeda,Takahiro Ogawa,Miki Haseyama

Main category: cs.CV

TL;DR: 本文提出了一种基于预训练的解耦音频-视觉数据集蒸馏框架DAVDD,通过分离公共和私有表示,并引入样本-分布联合对齐策略,在保持模态特有信息的同时有效保留跨模态结构,实现了音频-视觉数据蒸馏的最先进性能。

Details Motivation: 现有的音频-视觉数据集蒸馏方法在跨模态对齐上存在不足:初始化导致的模态映射空间不一致以及跨模态交互破坏模态私有信息。 Method: 提出DAVDD框架,利用预训练模型库获取稳定的模态特征,通过轻量化解耦器库将特征分解为公共和私有表示;引入公共跨模态匹配和样本-分布联合对齐策略,实现跨模态结构保持,同时隔离私有表示以保护模态特有信息。 Result: 在多个基准上,DAVDD在所有IPC设置下均达到最先进的蒸馏效果,显著优于现有方法。 Conclusion: 解耦表示学习能有效提升音频-视觉数据集蒸馏质量,DAVDD通过稳定特征提取和分离公共-私有信息,解决了跨模态对齐与信息保护的矛盾。 Abstract: Audio-Visual Dataset Distillation aims to compress large-scale datasets into compact subsets while preserving the performance of the original data. However, conventional Distribution Matching (DM) methods struggle to capture intrinsic cross-modal alignment. Subsequent studies have attempted to introduce cross-modal matching, but two major challenges remain: (i) independently and randomly initialized encoders lead to inconsistent modality mapping spaces, increasing training difficulty; and (ii) direct interactions between modalities tend to damage modality-specific (private) information, thereby degrading the quality of the distilled data. To address these challenges, we propose DAVDD, a pretraining-based decoupled audio-visual distillation framework. DAVDD leverages a diverse pretrained bank to obtain stable modality features and uses a lightweight decoupler bank to disentangle them into common and private representations. To effectively preserve cross-modal structure, we further introduce Common Intermodal Matching together with a Sample-Distribution Joint Alignment strategy, ensuring that shared representations are aligned both at the sample level and the global distribution level. Meanwhile, private representations are entirely isolated from cross-modal interaction, safeguarding modality-specific cues throughout distillation. Extensive experiments across multiple benchmarks show that DAVDD achieves state-of-the-art results under all IPC settings, demonstrating the effectiveness of decoupled representation learning for high-quality audio-visual dataset distillation. Code will be released.

[125] CUS-GS: A Compact Unified Structured Gaussian Splatting Framework for Multimodal Scene Representation

Yuhang Ming,Chenxin Fang,Xingyuan Yu,Fan Zhang,Weichen Dai,Wanzeng Kong,Guofeng Zhang

Main category: cs.CV

TL;DR: 提出CUS-GS,一种紧凑统一的结构化高斯点阵表示方法,结合多模态语义特征与三维几何结构,在仅6M参数下实现优异性能。

Details Motivation: 现有高斯点阵方法在语义理解与三维结构建模之间存在割裂,缺乏统一且高效的多模态表征。 Method: 设计体素化锚点结构作为空间骨架,融合来自CLIP、DINOv2等基础模型的多模态语义特征;提出多模态隐特征分配机制以统一外观、几何和语义;采用特征感知的重要性评估策略动态优化锚点生长与剪枝。 Result: 实验表明CUS-GS在显著更小的参数量(6M)下达到与当前最优方法相当甚至更好的性能,较最近竞争方法(35M参数)减小一个数量级。 Conclusion: CUS-GS有效桥接了语义与结构建模之间的差距,实现了高效、紧凑且语义一致的3D场景表示。 Abstract: Recent advances in Gaussian Splatting based 3D scene representation have shown two major trends: semantics-oriented approaches that focus on high-level understanding but lack explicit 3D geometry modeling, and structure-oriented approaches that capture spatial structures yet provide limited semantic abstraction. To bridge this gap, we present CUS-GS, a compact unified structured Gaussian Splatting representation, which connects multimodal semantic features with structured 3D geometry. Specifically, we design a voxelized anchor structure that constructs a spatial scaffold, while extracting multimodal semantic features from a set of foundation models (e.g., CLIP, DINOv2, SEEM). Moreover, we introduce a multimodal latent feature allocation mechanism to unify appearance, geometry, and semantics across heterogeneous feature spaces, ensuring a consistent representation across multiple foundation models. Finally, we propose a feature-aware significance evaluation strategy to dynamically guide anchor growing and pruning, effectively removing redundant or invalid anchors while maintaining semantic integrity. Extensive experiments show that CUS-GS achieves competitive performance compared to state-of-the-art methods using as few as 6M parameters - an order of magnitude smaller than the closest rival at 35M - highlighting the excellent trade off between performance and model efficiency of the proposed framework.

[126] Rectifying Soft-Label Entangled Bias in Long-Tailed Dataset Distillation

Chenyang Jiang,Hang Zhao,Xinyu Zhang,Zhengcen Li,Qiben Shan,Shaocong Wu,Jingyong Su

Main category: cs.CV

TL;DR: 本文提出ADSA,一种自适应软标签对齐模块,用于解决长尾分布下数据集蒸馏中的性能退化问题,通过校准蒸馏模型和图像引入的软标签偏差,在ImageNet-1k-LT等数据集上显著提升尾部类别准确率。

Details Motivation: 现有数据集蒸馏方法主要针对平衡数据集,在现实世界的长尾分布下表现不佳,本文强调软标签在长尾蒸馏中的关键作用,并探究导致性能下降的根本机制。 Method: 推导了基于蒸馏数据训练模型的不平衡感知泛化边界,通过系统扰动数据不平衡程度,识别出源自蒸馏模型和蒸馏图像的两类软标签偏差,并提出ADSA模块进行校准。 Result: 在ImageNet-1k-LT(EDC, IPC=50)上,ADSA将尾部类别准确率最高提升11.8%,整体准确率达到41.4%,并在多种蒸馏方法和有限标签预算下展现出鲁棒性和通用性。 Conclusion: ADSA是一种轻量、即插即用的模块,能有效缓解长尾数据集蒸馏中的软标签偏差问题,显著提升模型在尾部类别的泛化性能,为实际应用中的高效学习提供了有效解决方案。 Abstract: Dataset distillation compresses large-scale datasets into compact, highly informative synthetic data, significantly reducing storage and training costs. However, existing research primarily focuses on balanced datasets and struggles to perform under real-world long-tailed distributions. In this work, we emphasize the critical role of soft labels in long-tailed dataset distillation and uncover the underlying mechanisms contributing to performance degradation. Specifically, we derive an imbalance-aware generalization bound for model trained on distilled dataset. We then identify two primary sources of soft-label bias, which originate from the distillation model and the distilled images, through systematic perturbation of the data imbalance levels. To address this, we propose ADSA, an Adaptive Soft-label Alignment module that calibrates the entangled biases. This lightweight module integrates seamlessly into existing distillation pipelines and consistently improves performance. On ImageNet-1k-LT with EDC and IPC=50, ADSA improves tail-class accuracy by up to 11.8% and raises overall accuracy to 41.4%. Extensive experiments demonstrate that ADSA provides a robust and generalizable solution under limited label budgets and across a range of distillation techniques. Code is available at: https://github.com/j-cyoung/ADSA_DD.git.

[127] Frequency-Adaptive Sharpness Regularization for Improving 3D Gaussian Splatting Generalization

Youngsik Yun,Dongjun Gu,Youngjung Uh

Main category: cs.CV

TL;DR: 提出了一种频率自适应锐度正则化方法(FASR),以改善3D高斯点阵在少样本场景下对新视角的泛化能力。

Details Motivation: 3D高斯点阵(3DGS)在稀疏观测下容易过拟合,导致在新视角合成中泛化能力差,现有方法如SAM因任务差异无法有效平衡高频细节重建与锐度抑制。 Method: 从机器学习视角重新审视3DGS优化,将新视角合成立为泛化问题,提出FASR方法,通过根据图像局部频率动态调整正则化权重和邻域半径来优化损失景观的锐度。 Result: FASR在多种数据集和配置下显著优于现有基线方法,有效抑制漂浮伪影并保留高频细节,相较SAM避免了过度平滑问题。 Conclusion: FASR通过频率感知的锐度正则化策略,提升了3DGS在少样本新视角合成中的泛化性能,为该方向提供了新的优化思路。 Abstract: Despite 3D Gaussian Splatting (3DGS) excelling in most configurations, it lacks generalization across novel viewpoints in a few-shot scenario because it overfits to the sparse observations. We revisit 3DGS optimization from a machine learning perspective, framing novel view synthesis as a generalization problem to unseen viewpoints-an underexplored direction. We propose Frequency-Adaptive Sharpness Regularization (FASR), which reformulates the 3DGS training objective, thereby guiding 3DGS to converge toward a better generalization solution. Although Sharpness-Aware Minimization (SAM) similarly reduces the sharpness of the loss landscape to improve generalization of classification models, directly employing it to 3DGS is suboptimal due to the discrepancy between the tasks. Specifically, it hinders reconstructing high-frequency details due to excessive regularization, while reducing its strength leads to under-penalizing sharpness. To address this, we reflect the local frequency of images to set the regularization weight and the neighborhood radius when estimating the local sharpness. It prevents floater artifacts in novel viewpoints and reconstructs fine details that SAM tends to oversmooth. Across datasets with various configurations, our method consistently improves a wide range of baselines. Code will be available at https://bbangsik13.github.io/FASR.

[128] PA-FAS: Towards Interpretable and Generalizable Multimodal Face Anti-Spoofing via Path-Augmented Reinforcement Learning

Yingjie Ma,Xun Lin,Yong Xu,Weicheng Xie,Zitong Yu

Main category: cs.CV

TL;DR: 提出PA-FAS方法,通过构建高质量扩展推理序列和答案打乱机制,增强多模态人脸反欺诈中的推理路径与深度推理,提升跨域泛化与可解释性。

Details Motivation: 现有SFT+RL方法在多模态FAS中受限于推理路径单一和监督任务与推理路径不匹配,导致多模态推理能力弱和捷径学习问题。 Method: 提出PA-FAS,利用有限标注构建扩展推理序列以丰富路径并放宽探索限制,并在SFT阶段引入答案打乱机制,强制模型进行多模态分析,避免依赖表面线索。 Result: PA-FAS显著提升了多模态推理准确性和跨域泛化能力,在融合性、泛化性和可解释性方面表现更优。 Conclusion: PA-FAS有效解决了多模态FAS中推理路径受限和推理混淆问题,为可信人脸反欺诈提供了统一且鲁棒的解决方案。 Abstract: Face anti-spoofing (FAS) has recently advanced in multimodal fusion, cross-domain generalization, and interpretability. With large language models and reinforcement learning (RL), strategy-based training offers new opportunities to jointly model these aspects. However, multimodal reasoning is more complex than unimodal reasoning, requiring accurate feature representation and cross-modal verification while facing scarce, high-quality annotations, which makes direct application of RL sub-optimal. We identify two key limitations of supervised fine-tuning plus RL (SFT+RL) for multimodal FAS: (1) limited multimodal reasoning paths restrict the use of complementary modalities and shrink the exploration space after SFT, weakening the effect of RL; and (2) mismatched single-task supervision versus diverse reasoning paths causes reasoning confusion, where models may exploit shortcuts by mapping images directly to answers and ignoring the intended reasoning. To address this, we propose PA-FAS, which enhances reasoning paths by constructing high-quality extended reasoning sequences from limited annotations, enriching paths and relaxing exploration constraints. We further introduce an answer-shuffling mechanism during SFT to force comprehensive multimodal analysis instead of using superficial cues, thereby encouraging deeper reasoning and mitigating shortcut learning. PA-FAS significantly improves multimodal reasoning accuracy and cross-domain generalization, and better unifies multimodal fusion, generalization, and interpretability for trustworthy FAS.

[129] MambaTAD: When State-Space Models Meet Long-Range Temporal Action Detection

Hui Lu,Yi Yu,Shijian Lu,Deepu Rajan,Boon Poh Ng,Alex C. Kot,Xudong Jiang

Main category: cs.CV

TL;DR: 本文提出MambaTAD,一种基于状态空间模型的端到端单阶段时序动作检测方法,通过引入对角掩码双向状态空间模块和全局特征融合头,有效解决长序列建模中的时序上下文衰减与自元素冲突问题,显著提升长跨度动作检测性能。

Details Motivation: 现有方法在处理长跨度动作实例时因缺乏全局感知和低效检测头而表现不佳,且结构化状态空间模型存在时序上下文衰减和全局建模中的自元素冲突问题。 Method: 提出MambaTAD模型,包含两个核心设计:1)对角掩码双向状态空间(DMBSS)模块,增强全局特征融合与时序建模;2)全局特征融合检测头,结合多粒度特征与全局感知进行渐进式检测优化,并通过线性复杂度的状态空间时间适配器(SSTA)实现高效一阶段端到端检测。 Result: 在多个公开数据集上进行了大量实验,MambaTAD在时序动作检测任务中 consistently 取得优于现有方法的性能,尤其在长跨度动作检测方面表现突出。 Conclusion: MambaTAD通过改进状态空间模型的结构设计,有效缓解了长序列建模中的关键问题,实现了高效、准确的端到端时序动作检测,为TAD任务提供了新的有效框架。 Abstract: Temporal Action Detection (TAD) aims to identify and localize actions by determining their starting and ending frames within untrimmed videos. Recent Structured State-Space Models such as Mamba have demonstrated potential in TAD due to their long-range modeling capability and linear computational complexity. On the other hand, structured state-space models often face two key challenges in TAD, namely, decay of temporal context due to recursive processing and self-element conflict during global visual context modeling, which become more severe while handling long-span action instances. Additionally, traditional methods for TAD struggle with detecting long-span action instances due to a lack of global awareness and inefficient detection heads. This paper presents MambaTAD, a new state-space TAD model that introduces long-range modeling and global feature detection capabilities for accurate temporal action detection. MambaTAD comprises two novel designs that complement each other with superior TAD performance. First, it introduces a Diagonal-Masked Bidirectional State-Space (DMBSS) module which effectively facilitates global feature fusion and temporal action detection. Second, it introduces a global feature fusion head that refines the detection progressively with multi-granularity features and global awareness. In addition, MambaTAD tackles TAD in an end-to-end one-stage manner using a new state-space temporal adapter(SSTA) which reduces network parameters and computation cost with linear complexity. Extensive experiments show that MambaTAD achieves superior TAD performance consistently across multiple public benchmarks.

[130] UniRSCD: A Unified Novel Architectural Paradigm for Remote Sensing Change Detection

Yuan Qu,Zhipeng Zhang,Chaojun Xu,Qiao Wan,Mengying Xie,Yuzeng Chen,Zhenqi Liu,Yanfei Zhong

Main category: cs.CV

TL;DR: 本文提出了一种统一的遥感变化检测框架UniRSCD,基于状态空间模型和频变提示生成器,实现多任务兼容的高性能变化检测。

Details Motivation: 现有变化检测方法需针对不同任务设计专用解码器,依赖专家知识且通用性差,难以应对突发场景下的模型选择不确定性。 Method: 提出UniRSCD框架:采用状态空间模型作为骨干网络,设计频率变化提示生成器作为统一编码器,融合高低频信息;通过层次化特征交互和任务自适应输出映射的统一解码器,实现多任务共享表征空间。 Result: 在LEVIR-CD、SECOND、xBD等五个数据集上验证了方法的有效性,能够适应多种变化检测任务并取得领先性能。 Conclusion: UniRSCD实现了无需专用解码器的通用变化检测架构,提升了模型在不同输出粒度任务中的适应性和性能表现。 Abstract: In recent years, remote sensing change detection has garnered significant attention due to its critical role in resource monitoring and disaster assessment. Change detection tasks exist with different output granularities such as BCD, SCD, and BDA. However, existing methods require substantial expert knowledge to design specialized decoders that compensate for information loss during encoding across different tasks. This not only introduces uncertainty into the process of selecting optimal models for abrupt change scenarios (such as disaster outbreaks) but also limits the universality of these architectures. To address these challenges, this paper proposes a unified, general change detection framework named UniRSCD. Building upon a state space model backbone, we introduce a frequency change prompt generator as a unified encoder. The encoder dynamically scans bitemporal global context information while integrating high-frequency details with low-frequency holistic information, thereby eliminating the need for specialized decoders for feature compensation. Subsequently, the unified decoder and prediction head establish a shared representation space through hierarchical feature interaction and task-adaptive output mapping. This integrating various tasks such as binary change detection and semantic change detection into a unified architecture, thereby accommodating the differing output granularity requirements of distinct change detection tasks. Experimental results demonstrate that the proposed architecture can adapt to multiple change detection tasks and achieves leading performance on five datasets, including the binary change dataset LEVIR-CD, the semantic change dataset SECOND, and the building damage assessment dataset xBD.

[131] Novel View Synthesis from A Few Glimpses via Test-Time Natural Video Completion

Yan Xu,Yixing Wang,Stella X. Yu

Main category: cs.CV

TL;DR: 提出一种零样本、生成引导的框架,利用预训练视频扩散模型在极稀疏输入下实现高质量的新视角合成与3D场景重建。

Details Motivation: 解决稀疏输入下的新视角合成问题,不仅填补空间间隙,还生成自然连贯的视频序列。 Method: 将任务重新定义为测试时自然视频补全,利用预训练视频扩散模型生成中间视图,并通过不确定性感知机制保证空间一致性;生成的伪视图用于增强3D高斯溅射的监督信号,结合迭代反馈循环优化几何与视图合成。 Result: 在LLFF、DTU、DL3DV和MipNeRF-360数据集上显著优于强3D-GS基线方法,尤其在极端稀疏条件下表现突出。 Conclusion: 该方法无需场景特定训练或微调,即可实现连贯且高保真的渲染,推动了稀疏输入新视角合成的发展。 Abstract: Given just a few glimpses of a scene, can you imagine the movie playing out as the camera glides through it? That's the lens we take on \emph{sparse-input novel view synthesis}, not only as filling spatial gaps between widely spaced views, but also as \emph{completing a natural video} unfolding through space. We recast the task as \emph{test-time natural video completion}, using powerful priors from \emph{pretrained video diffusion models} to hallucinate plausible in-between views. Our \emph{zero-shot, generation-guided} framework produces pseudo views at novel camera poses, modulated by an \emph{uncertainty-aware mechanism} for spatial coherence. These synthesized frames densify supervision for \emph{3D Gaussian Splatting} (3D-GS) for scene reconstruction, especially in under-observed regions. An iterative feedback loop lets 3D geometry and 2D view synthesis inform each other, improving both the scene reconstruction and the generated views. The result is coherent, high-fidelity renderings from sparse inputs \emph{without any scene-specific training or fine-tuning}. On LLFF, DTU, DL3DV, and MipNeRF-360, our method significantly outperforms strong 3D-GS baselines under extreme sparsity.

[132] V2X-RECT: An Efficient V2X Trajectory Prediction Framework via Redundant Interaction Filtering and Tracking Error Correction

Xiangyan Kong,Xuecheng Wu,Xiongwei Zhao,Xiaodong Li,Yunyun Shi,Gang Wang,Dingkang Yang,Yang Liu,Hong Chen,Yulong Gao

Main category: cs.CV

TL;DR: 提出V2X-RECT,一种面向高密度交通环境的V2X轨迹预测框架,通过身份匹配校正、信号灯引导交互和局部时空坐标编码,提升数据关联一致性、减少冗余交互并重用历史信息,实现更高效准确的预测。

Details Motivation: 在密集交通场景中,目标频繁的身份切换导致跨视角关联困难,多源信息交互易产生冗余,传统车辆中心编码重复计算历史轨迹特征,影响实时推理性能。 Method: 设计多源身份匹配与校正模块,利用多视角时空关系实现稳定关联;引入交通信号灯引导的交互模块,将信号变化趋势编码为特征以筛选关键交互车辆;采用局部时空坐标编码,支持历史轨迹和地图特征重用及并行解码。 Result: 在V2X-Seq和V2X-Traj数据集上实验表明,V2X-RECT相比现有方法显著提升预测精度、鲁棒性和推理效率,尤其在不同交通密度下表现优越。 Conclusion: V2X-RECT有效解决了高密度交通中身份切换、冗余交互和低效编码问题,为V2X轨迹预测提供了高效且鲁棒的解决方案。 Abstract: V2X prediction can alleviate perception incompleteness caused by limited line of sight through fusing trajectory data from infrastructure and vehicles, which is crucial to traffic safety and efficiency. However, in dense traffic scenarios, frequent identity switching of targets hinders cross-view association and fusion. Meanwhile, multi-source information tends to generate redundant interactions during the encoding stage, and traditional vehicle-centric encoding leads to large amounts of repetitive historical trajectory feature encoding, degrading real-time inference performance. To address these challenges, we propose V2X-RECT, a trajectory prediction framework designed for high-density environments. It enhances data association consistency, reduces redundant interactions, and reuses historical information to enable more efficient and accurate prediction. Specifically, we design a multi-source identity matching and correction module that leverages multi-view spatiotemporal relationships to achieve stable and consistent target association, mitigating the adverse effects of mismatches on trajectory encoding and cross-view feature fusion. Then we introduce traffic signal-guided interaction module, encoding trend of traffic light changes as features and exploiting their role in constraining spatiotemporal passage rights to accurately filter key interacting vehicles, while capturing the dynamic impact of signal changes on interaction patterns. Furthermore, a local spatiotemporal coordinate encoding enables reusable features of historical trajectories and map, supporting parallel decoding and significantly improving inference efficiency. Extensive experimental results across V2X-Seq and V2X-Traj datasets demonstrate that our V2X-RECT achieves significant improvements compared to SOTA methods, while also enhancing robustness and inference efficiency across diverse traffic densities.

[133] SciEducator: Scientific Video Understanding and Educating via Deming-Cycle Multi-Agent System

Zhiyu Xu,Weilong Yan,Yufei Shi,Xin Meng,Tao He,Huiping Zhuang,Ming Li,Hehe Fan

Main category: cs.CV

TL;DR: 提出首个用于科学视频理解与教育的迭代自演化多智能体系统SciEducator,基于Deming循环理念实现细粒度科学活动解析,并构建SciVBench基准进行评估。

Details Motivation: 现有视频理解模型在需要专业知识融合和逐步推理的科学教育场景中表现不足,缺乏专门针对科学视频理解与教学的系统。 Method: 基于Deming循环(Plan-Do-Study-Act)设计自演化多智能体架构,通过多轮推理与反馈机制实现科学视频内容的理解,并生成包含文本、图像、音频和交互参考的多模态教学内容。 Result: 在新构建的SciVBench基准(500个专家验证的科学问答对)上,SciEducator显著优于Gemini、GPT-4o等主流闭源MLLM及先进视频代理系统。 Conclusion: SciEducator为科学视频理解与教育建立了新范式,展示了管理科学思想在AI教育系统中的潜力,推动了专业领域视频智能理解的发展。 Abstract: Recent advancements in multimodal large language models (MLLMs) and video agent systems have significantly improved general video understanding. However, when applied to scientific video understanding and educating, a domain that demands external professional knowledge integration and rigorous step-wise reasoning, existing approaches often struggle. To bridge this gap, we propose SciEducator, the first iterative self-evolving multi-agent system for scientific video comprehension and education. Rooted in the classical Deming Cycle from management science, our design reformulates its Plan-Do-Study-Act philosophy into a self-evolving reasoning and feedback mechanism, which facilitates the interpretation of intricate scientific activities in videos. Moreover, SciEducator can produce multimodal educational content tailored to specific scientific processes, including textual instructions, visual guides, audio narrations, and interactive references. To support evaluation, we construct SciVBench, a benchmark consisting of 500 expert-verified and literature-grounded science QA pairs across five categories, covering physical, chemical, and everyday phenomena. Extensive experiments demonstrate that SciEducator substantially outperforms leading closed-source MLLMs (e.g., Gemini, GPT-4o) and state-of-the-art video agents on the benchmark, establishing a new paradigm for the community.

[134] Test-Time Temporal Sampling for Efficient MLLM Video Understanding

Kaibin Wang,Mingbao Lin

Main category: cs.CV

TL;DR: 本文提出了Test-Time Temporal Sampling (T3S),一种无需训练、即插即用的推理框架,通过在推理时生成多个短而多样化的视频子序列并聚合其预测,有效降低多模态大模型处理长视频时的计算开销,提升准确性和推理速度。

Details Motivation: 处理长视频时,多模态大语言模型的自注意力机制计算复杂度随视频令牌数量平方增长,导致计算成本高、推理慢;现有方法存在精度损失、需额外训练或降低速度等问题。 Method: 提出T3S方法,利用时空冗余,在推理时生成多个短且多样化的视频子序列,通过单次前向传播处理并聚合预测结果,将自注意力计算复杂度从$O(L^2)$降低至$O(\sum_{i=1}^m α_i^2L^2)$,其中$\sum_{i=1}^m α_i^2 < 1$。 Result: 在多个长视频理解基准上实验表明,T3S最多可提升3.1%的准确率,并将首个令牌延迟减少2.04倍,且无需模型修改或微调,集成简单。 Conclusion: T3S是一种完全在推理阶段运行、无需训练、兼容多种预训练多模态大模型的高效长视频处理方法,将视频冗余转化为计算优势,为长视频理解提供了可扩展的解决方案。 Abstract: Processing long videos with multimodal large language models (MLLMs) poses a significant computational challenge, as the model's self-attention mechanism scales quadratically with the number of video tokens, resulting in high computational demand and slow inference speed. Current solutions, such as rule-based sub-sampling, learned frame selector, or memory-based summarization, often introduce their own trade-offs: they compromise accuracy, necessitate additional training, or decrease inference speed. In this paper, we propose Test-Time Temporal Sampling (T3S), a training-free, plug-and-play inference wrapper that enables MLLMs to process long videos both efficiently and effectively. T3S exploits spatiotemporal redundancy by generating multiple short and diverse subsequences of video tokens at inference time, packing them within a single forward pass, and aggregating their predictions. This multi-subsequence formulation broadens visual coverage while reducing the computational cost of self-attention from $O(L^2)$ to $O(\sum_{i=1}^m α_i^2L^2)$, where $\sum_{i=1}^m α_i^2 < 1$. Extensive experiments on long video understanding benchmarks demonstrate that T3S improves accuracy by up to 3.1% and reduces first token delay by $2.04\times$, all with minimal integration effort. Our approach operates entirely at inference time, requires no model modifications or fine-tuning, and is compatible with a wide range of pretrained MLLMs. T3S turns video redundancy into a computational advantage, offering a scalable solution for long-video understanding. The code is available at https://github.com/kaibinwang3/T3S.

[135] Multi-speaker Attention Alignment for Multimodal Social Interaction

Liangyang Ouyang,Yifei Huang,Mingfang Zhang,Caixin Kang,Ryosuke Furuta,Yoichi Sato

Main category: cs.CV

TL;DR: 本文提出了一种多模态多说话人注意力对齐方法,用于提升多模态大语言模型(MLLM)在视频中理解社交互动的能力。该方法通过动态跨模态注意力头选择和自适应社交感知注意力偏置,增强说话人视觉表征与其语句之间的对齐,无需引入可训练参数或修改架构。在多个MLLM和基准测试上验证了其有效性,实现了最先进的性能。

Details Motivation: 现有的MLLM在处理多说话人场景时,视觉与文本模态之间缺乏说话人一致的对齐,导致在社交任务中表现不稳定。这是由于跨模态注意力机制在多人交互场景中未能有效关联说话人及其话语。 Method: 提出一种无需训练的多模态注意力对齐方法:1)动态选择对视觉-语言对齐最关键的注意力头;2)基于现有注意力模式和说话人位置,计算自适应的社交感知注意力偏置,并注入到注意力机制中,以强化说话人与其视觉区域和语句间的关联。 Result: 在LLaVA-NeXT-Video、Qwen2.5-VL和InternVL3三种主流MLLM上集成该方法,在TVQA+、MMSI和OnlineMMSI三个基准上的四个社交任务中均取得性能提升,达到最先进水平。注意力可视化显示模型能更准确聚焦于当前说话人相关区域。 Conclusion: 所提方法有效解决了MLLM在多说话人视频中跨模态对齐弱的问题,显著提升了模型对社交互动的理解能力,且具备良好的通用性和即插即用特性,适用于多种现有MLLM。 Abstract: Understanding social interaction in video requires reasoning over a dynamic interplay of verbal and non-verbal cues: who is speaking, to whom, and with what gaze or gestures. While Multimodal Large Language Models (MLLMs) are natural candidates, simply adding visual inputs yields surprisingly inconsistent gains on social tasks. Our quantitative analysis of cross-modal attention inside state-of-the-art MLLMs reveals a core failure mode: in multi-speaker scenes, visual and textual tokens lack speaker-consistent alignment, exhibiting substantially weaker cross-modal attention than in object-centric images. To address this, we propose a multimodal multi-speaker attention alignment method that can be integrated into existing MLLMs. First, we introduce dynamic cross-modal head selection to identify attention heads most responsible for grounding. Then, an adaptive social-aware attention bias, computed from existing attention patterns and speaker locations, is injected into the attention mechanism. This bias reinforces alignment between a speaker's visual representation and their utterances without introducing trainable parameters or architectural changes. We integrate our method into three distinct MLLMs (LLaVA-NeXT-Video, Qwen2.5-VL, and InternVL3) and evaluate on three benchmarks (TVQA+, MMSI, OnlineMMSI). Across four social tasks, results demonstrate that our approach improves the ability of MLLMs and achieves state-of-the-art results. Attention visualizations confirm our method successfully focuses the model on speaker-relevant regions, enabling more robust multi-party social reasoning. Our implementation and model will be available at https://github.com/ut-vision/SocialInteraction.

[136] HEAL: Learning-Free Source Free Unsupervised Domain Adaptation for Cross-Modality Medical Image Segmentation

Yulong Shi,Jiapeng Li,Lin Qi

Main category: cs.CV

TL;DR: 提出了一种新的源域无数据无监督域适应(SFUDA)框架HEAL,通过分层去噪、边缘引导选择、大小感知融合和无学习特征,在跨模态任务中实现了最先进性能。

Details Motivation: 由于临床数据隐私和存储限制,无法访问源域数据和目标域标签,现有SFUDA方法在应对域偏移时面临挑战,因此需要一种更有效的无源无监督域适应方法。 Method: 提出HEAL框架,结合分层去噪减少噪声影响,利用边缘引导选择增强关键区域,采用大小感知融合策略整合多尺度信息,并引入无需学习的特征提取机制以提升模型适应能力。 Result: 在大规模跨模态实验中,HEAL优于现有的SFUDA方法,达到了最先进的性能水平。 Conclusion: HEAL有效解决了SFUDA中缺乏源数据和标签监督的问题,为医学图像分析等隐私敏感场景提供了高效且实用的域适应方案。 Abstract: Growing demands for clinical data privacy and storage constraints have spurred advances in Source Free Unsupervised Domain Adaptation (SFUDA). SFUDA addresses the domain shift by adapting models from the source domain to the unseen target domain without accessing source data, even when target-domain labels are unavailable. However, SFUDA faces significant challenges: the absence of source domain data and label supervision in the target domain due to source free and unsupervised settings. To address these issues, we propose HEAL, a novel SFUDA framework that integrates Hierarchical denoising, Edge-guided selection, size-Aware fusion, and Learning-free characteristic. Large-scale cross-modality experiments demonstrate that our method outperforms existing SFUDA approaches, achieving state-of-the-art (SOTA) performance. The source code is publicly available at: https://github.com/derekshiii/HEAL.

[137] VITAL: Vision-Encoder-centered Pre-training for LMMs in Visual Quality Assessment

Ziheng Jia,Linhan Cao,Jinliang Han,Zicheng Zhang,Jiaying Qian,Jiarui Wang,Zijian Chen,Guangtao Zhai,Xiongkuo Min

Main category: cs.CV

TL;DR: 本文提出了一种以视觉编码器为中心的生成式预训练框架VITAL-Series,用于解决现有视觉质量评估大模型在多任务泛化和迁移能力上的局限性。

Details Motivation: 现有视觉质量评估大模型通常局限于单一任务且依赖全参数微调,易导致对特定模态或任务过拟合,缺乏通用性和可迁移性。 Method: 采用机器执行的标注审查范式构建了超过450万对视觉-语言数据,并设计多任务训练流程以同时提升定量评分精度和跨图像视频模态的质量解释能力;基于统一视觉编码器实现高效模型库扩展。 Result: 所构建的数据集为当前最大的VQualA训练集,模型在零样本场景下表现强劲,各解码器仅需不到千分之一预训练数据进行快速热启动即可达到与完全训练模型相当的性能。 Conclusion: 该工作为发展面向视觉质量评估的基础大模型奠定了基础,显著提升了模型的通用性、强大性和可迁移性。 Abstract: Developing a robust visual quality assessment (VQualA) large multi-modal model (LMM) requires achieving versatility, powerfulness, and transferability. However, existing VQualA LMMs typically focus on a single task and rely on full-parameter fine-tuning, which makes them prone to overfitting on specific modalities or task types, thereby limiting their generalization capacity and transferability. To address this, we propose a vision-encoder-centered generative pre-training pipeline and develop the VITAL-Series LMMs. (1) We adopt a machine-executed annotation-scrutiny paradigm, constructing over 4.5M vision-language (VL) pairs-the largest VQualA training dataset to date. (2) We employ a multi-task training workflow that simultaneously enhances the model's quantitative scoring precision and strengthens its capability for quality interpretation across both image and video modalities. (3) Building upon the vision encoder, we realize an efficient model zoo extension: the model zoo exhibits strong zero-shot performance, and each paired decoder requires only a swift warm-up using less than 1/1000 of the pre-training data to achieve performance comparable to the fully trained counterpart. Overall, our work lays a cornerstone for advancing toward the foundation LMM for VQualA.

[138] X-ReID: Multi-granularity Information Interaction for Video-Based Visible-Infrared Person Re-Identification

Chenyang Yu,Xuehu Liu,Pingping Zhang,Huchuan Lu

Main category: cs.CV

TL;DR: 提出了一种用于视频可见光-红外行人重识别(VVI-ReID)的跨模态特征学习框架X-ReID,通过跨模态原型协作和多粒度信息交互来缩小模态差距并增强时序建模,在两个大规模基准上表现出优越性能。

Details Motivation: 探索大规模视觉-语言模型在VVI-ReID中的潜力,解决模态差异和视频序列中时空信息利用不足的问题。 Method: 提出X-ReID框架,包括跨模态原型协作(CPC)以对齐不同模态特征,以及多粒度信息交互(MII)以融合短时、长时和跨模态信息。 Result: 在HITSZ-VCM和BUPTCampus两个大规模VVI-ReID基准上实验表明,该方法优于现有最先进方法。 Conclusion: X-ReID有效缩小了模态差距并提升了时序建模能力,显著提高了VVI-ReID的性能。 Abstract: Large-scale vision-language models (e.g., CLIP) have recently achieved remarkable performance in retrieval tasks, yet their potential for Video-based Visible-Infrared Person Re-Identification (VVI-ReID) remains largely unexplored. The primary challenges are narrowing the modality gap and leveraging spatiotemporal information in video sequences. To address the above issues, in this paper, we propose a novel cross-modality feature learning framework named X-ReID for VVI-ReID. Specifically, we first propose a Cross-modality Prototype Collaboration (CPC) to align and integrate features from different modalities, guiding the network to reduce the modality discrepancy. Then, a Multi-granularity Information Interaction (MII) is designed, incorporating short-term interactions from adjacent frames, long-term cross-frame information fusion, and cross-modality feature alignment to enhance temporal modeling and further reduce modality gaps. Finally, by integrating multi-granularity information, a robust sequence-level representation is achieved. Extensive experiments on two large-scale VVI-ReID benchmarks (i.e., HITSZ-VCM and BUPTCampus) demonstrate the superiority of our method over state-of-the-art methods. The source code is released at https://github.com/AsuradaYuci/X-ReID.

[139] Signal: Selective Interaction and Global-local Alignment for Multi-Modal Object Re-Identification

Yangyang Liu,Yuhao Wang,Pingping Zhang

Main category: cs.CV

TL;DR: 本文提出了一种用于多模态物体ReID的新型选择性交互与全局-局部对齐框架Signal,通过选择重要图像块令牌并进行全局和局部特征对齐,有效提升了检索性能。

Details Motivation: 现有方法主要关注多模态特征融合,但忽略了背景干扰,并且在多模态一致性对齐方面存在不足。 Method: 提出了选择性交互模块(SIM)以结合模内和模间信息选择重要图像块令牌,并设计了全局对齐模块(GAM)和局部对齐模块(LAM)来实现多模态特征的一致性对齐。 Result: 在三个多模态物体ReID基准(RGBNT201、RGBNT100、MSVR310)上的大量实验验证了所提方法的有效性。 Conclusion: Signal框架能够提取更具判别性的特征,显著提升多模态物体ReID的性能。 Abstract: Multi-modal object Re-IDentification (ReID) is devoted to retrieving specific objects through the exploitation of complementary multi-modal image information. Existing methods mainly concentrate on the fusion of multi-modal features, yet neglecting the background interference. Besides, current multi-modal fusion methods often focus on aligning modality pairs but suffer from multi-modal consistency alignment. To address these issues, we propose a novel selective interaction and global-local alignment framework called Signal for multi-modal object ReID. Specifically, we first propose a Selective Interaction Module (SIM) to select important patch tokens with intra-modal and inter-modal information. These important patch tokens engage in the interaction with class tokens, thereby yielding more discriminative features. Then, we propose a Global Alignment Module (GAM) to simultaneously align multi-modal features by minimizing the volume of 3D polyhedra in the gramian space. Meanwhile, we propose a Local Alignment Module (LAM) to align local features in a shift-aware manner. With these modules, our proposed framework could extract more discriminative features for object ReID. Extensive experiments on three multi-modal object ReID benchmarks (i.e., RGBNT201, RGBNT100, MSVR310) validate the effectiveness of our method. The source code is available at https://github.com/010129/Signal.

[140] CADTrack: Learning Contextual Aggregation with Deformable Alignment for Robust RGBT Tracking

Hao Li,Yuhao Wang,Xiantao Hu,Wenning Hao,Pingping Zhang,Dong Wang,Huchuan Lu

Main category: cs.CV

TL;DR: 本文提出了一种名为CADTrack的RGBT跟踪框架,通过Mamba特征交互、上下文聚合和可变形对齐模块,有效解决多模态差异问题,实现复杂场景下的鲁棒准确跟踪。

Details Motivation: 现有RGBT跟踪器难以处理可见光与热红外模态间的差异,导致跨模态信息融合困难,影响跟踪精度。 Method: 提出CADTrack框架,包含三个核心模块:基于Mamba的特征交互(MFI)实现高效低复杂度特征交互;上下文聚合模块(CAM)利用MoE稀疏门控融合跨层上下文信息;可变形对齐模块(DAM)结合可变形采样与时间传播缓解空间错位。 Result: 在五个RGBT基准上进行了广泛实验,验证了方法的有效性,实现了优于现有方法的跟踪精度与鲁棒性。 Conclusion: CADTrack通过高效的跨模态特征交互与对齐机制,显著提升了RGBT跟踪在复杂环境下的性能,具备实际应用潜力。 Abstract: RGB-Thermal (RGBT) tracking aims to exploit visible and thermal infrared modalities for robust all-weather object tracking. However, existing RGBT trackers struggle to resolve modality discrepancies, which poses great challenges for robust feature representation. This limitation hinders effective cross-modal information propagation and fusion, which significantly reduces the tracking accuracy. To address this limitation, we propose a novel Contextual Aggregation with Deformable Alignment framework called CADTrack for RGBT Tracking. To be specific, we first deploy the Mamba-based Feature Interaction (MFI) that establishes efficient feature interaction via state space models. This interaction module can operate with linear complexity, reducing computational cost and improving feature discrimination. Then, we propose the Contextual Aggregation Module (CAM) that dynamically activates backbone layers through sparse gating based on the Mixture-of-Experts (MoE). This module can encode complementary contextual information from cross-layer features. Finally, we propose the Deformable Alignment Module (DAM) to integrate deformable sampling and temporal propagation, mitigating spatial misalignment and localization drift. With the above components, our CADTrack achieves robust and accurate tracking in complex scenarios. Extensive experiments on five RGBT tracking benchmarks verify the effectiveness of our proposed method. The source code is released at https://github.com/IdolLab/CADTrack.

[141] Adversarial Pseudo-replay for Exemplar-free Class-incremental Learning

Hiroto Honda

Main category: cs.CV

TL;DR: 本文提出了一种名为对抗性伪回放(APR)的方法,用于无示例类增量学习(EFCIL),通过在线生成伪回放样本来缓解灾难性遗忘,同时保持对新任务的学习能力。

Details Motivation: 在EFCIL中,由于无法存储先前任务的图像,如何平衡学习新任务与保留旧知识之间的矛盾(即稳定性-可塑性困境)是一个关键挑战。 Method: 提出对抗性伪回放(APR)方法:利用对抗攻击扰动当前任务图像,并以增强的旧类均值原型为目标生成伪回放样本;使用这些样本进行知识蒸馏以防止语义漂移,并通过学习转移矩阵校准协方差矩阵来补偿每轮任务后的语义变化。 Result: 该方法在标准EFCIL基准的冷启动设置下实现了最先进的性能。 Conclusion: APR有效平衡了模型的稳定性与可塑性,无需存储历史数据即可有效缓解灾难性遗忘,推动了EFCIL的发展。 Abstract: Exemplar-free class-incremental learning (EFCIL) aims to retain old knowledge acquired in the previous task while learning new classes, without storing the previous images due to storage constraints or privacy concerns. In EFCIL, the plasticity-stability dilemma, learning new tasks versus catastrophic forgetting, is a significant challenge, primarily due to the unavailability of images from earlier tasks. In this paper, we introduce adversarial pseudo-replay (APR), a method that perturbs the images of the new task with adversarial attack, to synthesize the pseudo-replay images online without storing any replay samples. During the new task training, the adversarial attack is conducted on the new task images with augmented old class mean prototypes as targets, and the resulting images are used for knowledge distillation to prevent semantic drift. Moreover, we calibrate the covariance matrices to compensate for the semantic drift after each task, by learning a transfer matrix on the pseudo-replay samples. Our method reconciles stability and plasticity, achieving state-of-the-art on challenging cold-start settings of the standard EFCIL benchmarks.

[142] FeRA: Frequency-Energy Constrained Routing for Effective Diffusion Adaptation Fine-Tuning

Bo Yin,Xiaobin Hu,Xingyu Zhou,Peng-Tao Jiang,Yue Liao,Junwei Zhu,Jiangning Zhang,Ying Tai,Chengjie Wang,Shuicheng Yan

Main category: cs.CV

TL;DR: 提出了一种基于频率能量机制的扩散模型微调框架FeRA,通过频率驱动的参数更新实现高效自适应。

Details Motivation: 现有的预训练扩散模型在新任务上的有效适配仍具挑战性,需深入理解其去噪过程中的重建行为和频率特性。 Method: 提出了FeRA框架,包含三个组件:紧凑的频率能量指示器、软频率路由器和频率能量一致性正则化,结合频带能量分布动态融合多专家适配器。 Result: FeRA在不同扩散模型结构和分辨率下均表现出良好的泛化能力,提升了微调的稳定性与效果,且推理时可动态路由。 Conclusion: FeRA通过与扩散过程中内在频率能量进程对齐,提供了一个简单、稳定且兼容的扩散模型适配范式。 Abstract: Diffusion models have achieved remarkable success in generative modeling, yet how to effectively adapt large pretrained models to new tasks remains challenging. We revisit the reconstruction behavior of diffusion models during denoising to unveil the underlying frequency energy mechanism governing this process. Building upon this observation, we propose FeRA, a frequency driven fine tuning framework that aligns parameter updates with the intrinsic frequency energy progression of diffusion. FeRA establishes a comprehensive frequency energy framework for effective diffusion adaptation fine tuning, comprising three synergistic components: (i) a compact frequency energy indicator that characterizes the latent bandwise energy distribution, (ii) a soft frequency router that adaptively fuses multiple frequency specific adapter experts, and (iii) a frequency energy consistency regularization that stabilizes diffusion optimization and ensures coherent adaptation across bands. Routing operates in both training and inference, with inference time routing dynamically determined by the latent frequency energy. It integrates seamlessly with adapter based tuning schemes and generalizes well across diffusion backbones and resolutions. By aligning adaptation with the frequency energy mechanism, FeRA provides a simple, stable, and compatible paradigm for effective and robust diffusion model adaptation.

[143] Plan-X: Instruct Video Generation via Semantic Planning

Lun Huang,You Xie,Hongyi Xu,Tianpei Gu,Chenxu Zhang,Guoxian Song,Zenan Li,Xiaochen Zhao,Linjie Luo,Guillermo Sapiro

Main category: cs.CV

TL;DR: 提出Plan-X框架,通过语义规划器生成时空语义标记,指导视频扩散模型生成与指令对齐的高质量视频,减少幻觉。

Details Motivation: 现有扩散Transformer在视觉合成中缺乏高层语义推理和长视野规划能力,导致视频生成中出现幻觉和指令不一致。 Method: 设计一个可学习的多模态语义规划器,基于文本提示和视觉上下文自回归生成文本锚定的时空语义标记,并将其作为结构化语义草图指导视频扩散模型。 Result: 实验表明,该方法显著减少视觉幻觉,提升复杂场景、人-物交互、多阶段动作等情境下的生成一致性与细节对齐能力。 Conclusion: Plan-X有效结合语言模型的推理规划能力与扩散模型的高质量生成优势,实现更符合用户意图的视频生成。 Abstract: Diffusion Transformers have demonstrated remarkable capabilities in visual synthesis, yet they often struggle with high-level semantic reasoning and long-horizon planning. This limitation frequently leads to visual hallucinations and mis-alignments with user instructions, especially in scenarios involving complex scene understanding, human-object interactions, multi-stage actions, and in-context motion reasoning. To address these challenges, we propose Plan-X, a framework that explicitly enforces high-level semantic planning to instruct video generation process. At its core lies a Semantic Planner, a learnable multimodal language model that reasons over the user's intent from both text prompts and visual context, and autoregressively generates a sequence of text-grounded spatio-temporal semantic tokens. These semantic tokens, complementary to high-level text prompt guidance, serve as structured "semantic sketches" over time for the video diffusion model, which has its strength at synthesizing high-fidelity visual details. Plan-X effectively integrates the strength of language models in multimodal in-context reasoning and planning, together with the strength of diffusion models in photorealistic video synthesis. Extensive experiments demonstrate that our framework substantially reduces visual hallucinations and enables fine-grained, instruction-aligned video generation consistent with multimodal context.

[144] HyM-UNet: Synergizing Local Texture and Global Context via Hybrid CNN-Mamba Architecture for Medical Image Segmentation

Haodong Chen,Xianfei Han,Qwen

Main category: cs.CV

TL;DR: 本文提出了一种名为HyM-UNet的新型混合网络架构,结合CNN的局部特征提取能力和Mamba的高效全局建模能力,用于医学图像中的器官和病灶分割。

Details Motivation: CNN由于感受野受限,难以捕捉复杂的全局解剖结构,影响医学图像分割精度。 Method: 设计了一个分层编码器,在浅层使用卷积模块保留高频纹理细节,在深层引入视觉Mamba模块以线性复杂度捕获长距离语义依赖;并提出Mamba引导的融合跳跃连接(MGF-Skip),利用深层语义特征作为门控信号抑制浅层背景噪声。 Result: 在ISIC 2018公开数据集上实验表明,HyM-UNet在Dice系数和IoU指标上显著优于现有最先进方法,同时参数量更少、推理延迟更低。 Conclusion: HyM-UNet有效提升了医学图像分割中对复杂形状和尺度变化的处理能力,具有良好的鲁棒性和应用前景。 Abstract: Accurate organ and lesion segmentation is a critical prerequisite for computer-aided diagnosis. Convolutional Neural Networks (CNNs), constrained by their local receptive fields, often struggle to capture complex global anatomical structures. To tackle this challenge, this paper proposes a novel hybrid architecture, HyM-UNet, designed to synergize the local feature extraction capabilities of CNNs with the efficient global modeling capabilities of Mamba. Specifically, we design a Hierarchical Encoder that utilizes convolutional modules in the shallow stages to preserve high-frequency texture details, while introducing Visual Mamba modules in the deep stages to capture long-range semantic dependencies with linear complexity. To bridge the semantic gap between the encoder and the decoder, we propose a Mamba-Guided Fusion Skip Connection (MGF-Skip). This module leverages deep semantic features as gating signals to dynamically suppress background noise within shallow features, thereby enhancing the perception of ambiguous boundaries. We conduct extensive experiments on public benchmark dataset ISIC 2018. The results demonstrate that HyM-UNet significantly outperforms existing state-of-the-art methods in terms of Dice coefficient and IoU, while maintaining lower parameter counts and inference latency. This validates the effectiveness and robustness of the proposed method in handling medical segmentation tasks characterized by complex shapes and scale variations.

[145] SD-PSFNet: Sequential and Dynamic Point Spread Function Network for Image Deraining

Jiayu Wang,Haoyu Bian,Haoran Sun,Shaoning Zeng

Main category: cs.CV

TL;DR: 提出了一种基于点扩散函数(PSF)机制的多阶段图像去雨网络SD-PSFNet,结合动态物理建模与跨阶段特征融合,在复杂场景和密集降雨下实现了先进性能。

Details Motivation: 现有方法难以有效处理复杂多尺度雨滴物理特性及其与场景的耦合,导致去雨效果不佳,尤其在复杂背景和密集雨纹下。 Method: 设计了三阶段级联的SD-PSFNet,引入可学习的PSF模块动态模拟雨滴光学特性,并采用自适应门控融合机制实现跨阶段特征整合,逐步从粗到细完成去雨与细节恢复。 Result: 在Rain100H、RealRain-1k-L和RealRain-1k-H数据集上均取得最先进的PSNR/SSIM指标,如在RealRain-1k-L上达到42.28dB/0.9872。 Conclusion: SD-PSFNet通过融合物理建模与深度网络结构,提升了去雨性能,尤其适用于复杂场景和密集降雨,为物理感知的图像复原提供了新思路。 Abstract: Image deraining is crucial for vision applications but is challenged by the complex multi-scale physics of rain and its coupling with scenes. To address this challenge, a novel approach inspired by multi-stage image restoration is proposed, incorporating Point Spread Function (PSF) mechanisms to reveal the image degradation process while combining dynamic physical modeling with sequential feature fusion transfer, named SD-PSFNet. Specifically, SD-PSFNet employs a sequential restoration architecture with three cascaded stages, allowing multiple dynamic evaluations and refinements of the degradation process estimation. The network utilizes components with learned PSF mechanisms to dynamically simulate rain streak optics, enabling effective rain-background separation while progressively enhancing outputs through novel PSF components at each stage. Additionally, SD-PSFNet incorporates adaptive gated fusion for optimal cross-stage feature integration, enabling sequential refinement from coarse rain removal to fine detail restoration. Our model achieves state-of-the-art PSNR/SSIM metrics on Rain100H (33.12dB/0.9371), RealRain-1k-L (42.28dB/0.9872), and RealRain-1k-H (41.08dB/0.9838). In summary, SD-PSFNet demonstrates excellent capability in complex scenes and dense rainfall conditions, providing a new physics-aware approach to image deraining.

[146] RAISECity: A Multimodal Agent Framework for Reality-Aligned 3D World Generation at City-Scale

Shengyuan Wang,Zhiheng Zheng,Yu Shang,Lixuan He,Yangcheng Yu,Fan Hangyu,Jie Feng,Qingmin Liao,Yong Li

Main category: cs.CV

TL;DR: RAISECity是一种面向城市尺度3D生成的现实对齐智能合成引擎,通过代理式框架整合多模态基础工具,实现高质量、高保真且可扩展的3D场景构建。

Details Motivation: 现有城市尺度3D生成方法在质量、保真度和可扩展性方面存在挑战,难以满足具身智能与世界模型的发展需求。 Method: 提出RAISECity,采用代理式框架,利用多模态基础工具获取真实世界知识,维护中间表示,并通过动态数据处理、迭代自反思与优化机制构建复杂3D场景。 Result: 实验表明RAISECity在现实对齐、形状精度、纹理保真度和美学水平上优于现有方法,整体感知质量胜率达90%以上。 Conclusion: RAISECity在3D质量、现实对齐、可扩展性和图形管线兼容性方面表现优异,有望成为沉浸式媒体、具身智能和世界模型应用的重要基础。 Abstract: City-scale 3D generation is of great importance for the development of embodied intelligence and world models. Existing methods, however, face significant challenges regarding quality, fidelity, and scalability in 3D world generation. Thus, we propose RAISECity, a \textbf{R}eality-\textbf{A}ligned \textbf{I}ntelligent \textbf{S}ynthesis \textbf{E}ngine that creates detailed, \textbf{C}ity-scale 3D worlds. We introduce an agentic framework that leverages diverse multimodal foundation tools to acquire real-world knowledge, maintain robust intermediate representations, and construct complex 3D scenes. This agentic design, featuring dynamic data processing, iterative self-reflection and refinement, and the invocation of advanced multimodal tools, minimizes cumulative errors and enhances overall performance. Extensive quantitative experiments and qualitative analyses validate the superior performance of RAISECity in real-world alignment, shape precision, texture fidelity, and aesthetics level, achieving over a 90% win-rate against existing baselines for overall perceptual quality. This combination of 3D quality, reality alignment, scalability, and seamless compatibility with computer graphics pipelines makes RAISECity a promising foundation for applications in immersive media, embodied intelligence, and world models.

[147] Is Complete Labeling Necessary? Understanding Active Learning in Longitudinal Medical Imaging

Siteng Ma,Honghui Du,Prateek Mathur,Brendan S. Kelly,Ronan P. Killeen,Aonghus Lawlor,Ruihai Dong

Main category: cs.CV

TL;DR: 提出了一种名为LMI-AL的新型深度主动学习框架,专门用于纵向医学图像变化检测,通过配对并差分基线和随访3D图像的2D切片,迭代选择最具信息量的样本进行标注,在仅使用不到8%标注数据的情况下即可达到与全监督模型相当的性能。

Details Motivation: 纵向医学图像变化检测需要大量精确标注数据,但标注成本高、耗时长,且现有深度主动学习方法主要针对静态任务,难以直接应用于需检测多时点微小变化的场景。 Method: 提出LMI-AL框架,将基线和随访的3D医学图像的2D切片进行配对并差分,利用深度主动学习策略迭代选择最具信息量的图像对进行标注,并用少量标注数据训练深度学习模型。 Result: 实验结果表明,使用不到8%的标注数据时,LMI-AL即可达到与全标注数据训练模型相当的性能。 Conclusion: LMI-AL能显著降低纵向医学图像变化检测中的标注成本,为未来相关研究提供了有效方法和分析指导。 Abstract: Detecting changes in longitudinal medical imaging using deep learning requires a substantial amount of accurately labeled data. However, labeling these images is notably more costly and time-consuming than labeling other image types, as it requires labeling across various time points, where new lesions can be minor, and subtle changes are easily missed. Deep Active Learning (DAL) has shown promise in minimizing labeling costs by selectively querying the most informative samples, but existing studies have primarily focused on static tasks like classification and segmentation. Consequently, the conventional DAL approach cannot be directly applied to change detection tasks, which involve identifying subtle differences across multiple images. In this study, we propose a novel DAL framework, named Longitudinal Medical Imaging Active Learning (LMI-AL), tailored specifically for longitudinal medical imaging. By pairing and differencing all 2D slices from baseline and follow-up 3D images, LMI-AL iteratively selects the most informative pairs for labeling using DAL, training a deep learning model with minimal manual annotation. Experimental results demonstrate that, with less than 8% of the data labeled, LMI-AL can achieve performance comparable to models trained on fully labeled datasets. We also provide a detailed analysis of the method's performance, as guidance for future research. The code is publicly available at https://github.com/HelenMa9998/Longitudinal_AL.

[148] RoadBench: Benchmarking MLLMs on Fine-Grained Spatial Understanding and Reasoning under Urban Road Scenarios

Jun Zhang,Jie Feng,Long Chen,Junhui Wang,Zhicheng Liu,Depeng Jin,Yong Li

Main category: cs.CV

TL;DR: 本文提出了RoadBench,一个用于评估多模态大语言模型(MLLMs)在城市复杂场景中细粒度空间理解与推理能力的系统性基准,特别关注道路标线,并通过BEV和FPV图像输入测试14种主流MLLMs,揭示了其在城市空间理解中的显著不足。

Details Motivation: 现有的MLLMs在城市复杂场景中的细粒度空间理解与推理能力未受到充分关注,尤其是在道路标线这类关键空间元素上缺乏系统评估。 Method: 提出RoadBench基准,包含六个任务共9,121个经人工验证的测试用例,结合鸟瞰图(BEV)和前视图(FPV)图像输入,系统评估MLLMs在局部到全局空间理解、图文融合及领域知识整合方面的能力。 Result: 对14种主流MLLMs的评估表明,现有模型在多个任务中表现不佳,甚至低于基于规则或随机选择的基线方法,暴露出其在细粒度城市空间理解上的严重缺陷。 Conclusion: RoadBench是一个具有挑战性的基准,能够有效揭示MLLMs在城市场景中细粒度空间理解与推理的短板,为未来模型改进提供了方向和数据支持。 Abstract: Multimodal large language models (MLLMs) have demonstrated powerful capabilities in general spatial understanding and reasoning. However, their fine-grained spatial understanding and reasoning capabilities in complex urban scenarios have not received significant attention in the fields of both research and industry. To fill this gap, we focus primarily on road markings as a typical example of fine-grained spatial elements under urban scenarios, given the essential role of the integrated road traffic network they form within cities. Around road markings and urban traffic systems, we propose RoadBench, a systematic benchmark that comprehensively evaluates MLLMs' fine-grained spatial understanding and reasoning capabilities using BEV and FPV image inputs. This benchmark comprises six tasks consisting of 9,121 strictly manually verified test cases. These tasks form a systematic evaluation framework that bridges understanding at local spatial scopes to global reasoning. They not only test MLLMs' capabilities in recognition, joint understanding, and reasoning but also assess their ability to integrate image information with domain knowledge. After evaluating 14 mainstream MLLMs, we confirm that RoadBench is a challenging benchmark for MLLMs while revealing significant shortcomings in existing MLLMs' fine-grained spatial understanding and reasoning capabilities within urban scenarios. In certain tasks, their performance even falls short of simple rule-based or random selection baselines. These findings, along with RoadBench itself, will contribute to the comprehensive advancement of spatial understanding capabilities for MLLMs. The benchmark code, example datasets, and raw evaluation results are available in the supplementary material.

[149] State and Scene Enhanced Prototypes for Weakly Supervised Open-Vocabulary Object Detection

Jiaying Zhou,Qingchao Chen

Main category: cs.CV

TL;DR: 提出了一种用于弱监督开放词汇目标检测的新方法,通过状态增强和场景增强的原型来提升语义丰富性和视觉-文本对齐。

Details Motivation: 现有方法在处理开放词汇目标检测时难以捕捉类内变化且存在视觉区域与文本嵌入间的语义不匹配问题。 Method: 提出了State-Enhanced Semantic Prototypes (SESP) 和 Scene-Augmented Pseudo Prototypes (SAPP),前者生成状态感知的文本描述以捕获外观多样性,后者引入上下文语义并通过软对齐机制改善视觉-文本一致性。 Result: 所提方法在多个基准上显著优于现有方法,有效提升了开放词汇检测性能。 Conclusion: SESP和SAPP共同增强了语义原型的表达能力和视觉-文本对齐,为弱监督OVOD提供了新的解决方案。 Abstract: Open-Vocabulary Object Detection (OVOD) aims to generalize object recognition to novel categories, while Weakly Supervised OVOD (WS-OVOD) extends this by combining box-level annotations with image-level labels. Despite recent progress, two critical challenges persist in this setting. First, existing semantic prototypes, even when enriched by LLMs, are static and limited, failing to capture the rich intra-class visual variations induced by different object states (e.g., a cat's pose). Second, the standard pseudo-box generation introduces a semantic mismatch between visual region proposals (which contain context) and object-centric text embeddings. To tackle these issues, we introduce two complementary prototype enhancement strategies. To capture intra-class variations in appearance and state, we propose the State-Enhanced Semantic Prototypes (SESP), which generates state-aware textual descriptions (e.g., "a sleeping cat") to capture diverse object appearances, yielding more discriminative prototypes. Building on this, we further introduce Scene-Augmented Pseudo Prototypes (SAPP) to address the semantic mismatch. SAPP incorporates contextual semantics (e.g., "cat lying on sofa") and utilizes a soft alignment mechanism to promote contextually consistent visual-textual representations. By integrating SESP and SAPP, our method effectively enhances both the richness of semantic prototypes and the visual-textual alignment, achieving notable improvements.

[150] Modeling Retinal Ganglion Cells with Neural Differential Equations

Kacper Dobek,Daniel Jankowski,Krzysztof Krawiec

Main category: cs.CV

TL;DR: LTCs和CfCs在建模视网膜神经节细胞活动方面优于CNN和LSTM,具有更低的MAE、更快的收敛速度和更小的模型尺寸,适合数据有限且需频繁重训练的边缘部署场景。

Details Motivation: 探索适用于有限数据和频繁重训练场景(如视觉假体中的边缘部署)的高效神经网络架构。 Method: 使用Liquid Time-Constant Networks (LTCs) 和 Closed-form Continuous-time Networks (CfCs) 对虎蝾螈视网膜神经节细胞活动进行建模,并与CNN和LSTM基线模型比较性能。 Result: LTCs和CfCs在三个数据集上均实现了更低的MAE、更快的收敛、更小的模型大小和更优的查询时间,但Pearson相关系数略低。 Conclusion: LTCs和CfCs因其效率和适应性,在处理小规模数据并需要快速迭代的任务中具有优势,特别适合用于视觉假体等实时边缘应用。 Abstract: This work explores Liquid Time-Constant Networks (LTCs) and Closed-form Continuous-time Networks (CfCs) for modeling retinal ganglion cell activity in tiger salamanders across three datasets. Compared to a convolutional baseline and an LSTM, both architectures achieved lower MAE, faster convergence, smaller model sizes, and favorable query times, though with slightly lower Pearson correlation. Their efficiency and adaptability make them well suited for scenarios with limited data and frequent retraining, such as edge deployments in vision prosthetics.

[151] MambaX: Image Super-Resolution with State Predictive Control

Chenyu Li,Danfeng Hong,Bing Zhang,Zhaojie Pan,Naoto Yokoya,Jocelyn Chanussot

Main category: cs.CV

TL;DR: 本文提出了一种名为MambaX的非线性状态预测控制模型,用于图像超分辨率(SR)任务,通过动态学习状态空间中的非线性参数来改善中间阶段的控制,提升了单图像和多模态融合SR的性能。

Details Motivation: 现有超分辨率方法忽视了中间阶段误差传播与累积的有效控制,且Mamba模型因固定线性映射器的感受野窄、灵活性差而限制了在细粒度图像上的表现。 Method: 提出MambaX模型,将连续光谱波段映射到潜在状态空间,采用动态状态预测控制学习逼近状态空间模型的非线性微分系数,引入状态交叉控制范式用于多模态SR融合,并使用渐进过渡学习缓解域和模态迁移带来的异质性。 Result: 在单图像SR和多模态融合SR任务中,MambaX均表现出优越性能,尤其在光谱广义建模方面展现出跨任意维度和模态的潜力。 Conclusion: MambaX通过动态非线性状态建模有效提升了超分辨率过程中对中间状态的控制能力,为多模态和高精度图像重建提供了新思路。 Abstract: Image super-resolution (SR) is a critical technology for overcoming the inherent hardware limitations of sensors. However, existing approaches mainly focus on directly enhancing the final resolution, often neglecting effective control over error propagation and accumulation during intermediate stages. Recently, Mamba has emerged as a promising approach that can represent the entire reconstruction process as a state sequence with multiple nodes, allowing for intermediate intervention. Nonetheless, its fixed linear mapper is limited by a narrow receptive field and restricted flexibility, which hampers its effectiveness in fine-grained images. To address this, we created a nonlinear state predictive control model \textbf{MambaX} that maps consecutive spectral bands into a latent state space and generalizes the SR task by dynamically learning the nonlinear state parameters of control equations. Compared to existing sequence models, MambaX 1) employs dynamic state predictive control learning to approximate the nonlinear differential coefficients of state-space models; 2) introduces a novel state cross-control paradigm for multimodal SR fusion; and 3) utilizes progressive transitional learning to mitigate heterogeneity caused by domain and modality shifts. Our evaluation demonstrates the superior performance of the dynamic spectrum-state representation model in both single-image SR and multimodal fusion-based SR tasks, highlighting its substantial potential to advance spectrally generalized modeling across arbitrary dimensions and modalities.

[152] Hybrid Event Frame Sensors: Modeling, Calibration, and Simulation

Yunfan Lu,Nico Messikommer,Xiaogang Xu,Liming Chen,Yuhan Chen,Nikola Zubic,Davide Scaramuzza,Hui Xiong

Main category: cs.CV

TL;DR: 提出首个统一的基于统计的成像噪声模型,用于联合描述事件帧混合传感器中APS和EVS像素的噪声行为,并开发了校准流程与仿真器HESIM,验证了模型在多个真实任务中的有效性。

Details Motivation: 事件帧混合传感器虽具优势,但其复杂电路引入的噪声模式尚未被充分理解与建模,缺乏统一的噪声模型限制了其性能优化与应用。 Method: 建立一个统一的统计噪声模型,包含光子散粒噪声、暗电流噪声、固定模式噪声和量化噪声,并将EVS噪声与光照及暗电流关联;设计校准流程从实测数据估计噪声参数,并开发HESIM仿真器生成带真实噪声的RAW帧和事件流。 Result: 在两种混合传感器上的实验表明,该模型能准确刻画APS和EVS的噪声行为,HESIM生成的数据在视频帧插值和去模糊等任务中表现出良好的仿真到真实的迁移性能。 Conclusion: 所提统一噪声模型和校准方法为事件帧混合传感器提供了可解释、可量化的噪声分析框架,HESIM为下游任务提供了高保真仿真数据支持,推动了此类传感器的实际应用。 Abstract: Event frame hybrid sensors integrate an Active Pixel Sensor (APS) and an Event Vision Sensor (EVS) within a single chip, combining the high dynamic range and low latency of the EVS with the rich spatial intensity information from the APS. While this tight integration offers compact, temporally precise imaging, the complex circuit architecture introduces non-trivial noise patterns that remain poorly understood and unmodeled. In this work, we present the first unified, statistics-based imaging noise model that jointly describes the noise behavior of APS and EVS pixels. Our formulation explicitly incorporates photon shot noise, dark current noise, fixed-pattern noise, and quantization noise, and links EVS noise to illumination level and dark current. Based on this formulation, we further develop a calibration pipeline to estimate noise parameters from real data and offer a detailed analysis of both APS and EVS noise behaviors. Finally, we propose HESIM, a statistically grounded simulator that generates RAW frames and events under realistic, jointly calibrated noise statistics. Experiments on two hybrid sensors validate our model across multiple imaging tasks (e.g., video frame interpolation and deblurring), demonstrating strong transfer from simulation to real data.

[153] UltraFlux: Data-Model Co-Design for High-quality Native 4K Text-to-Image Generation across Diverse Aspect Ratios

Tian Ye,Song Fei,Lei Zhu

Main category: cs.CV

TL;DR: 本文提出了UltraFlux,一种基于Flux的扩散Transformer模型,专为原生4K多宽高比图像生成设计,通过数据与模型协同优化,在位置编码、VAE压缩和优化目标等方面进行系统性改进,显著提升了4K图像生成的质量与稳定性。

Details Motivation: 现有的扩散Transformer在1K分辨率下表现良好,但在扩展到原生4K及多种宽高比时面临位置编码、VAE压缩和优化之间的耦合失效问题,亟需系统性解决方案。 Method: 提出UltraFlux模型:(i) 采用Resonance 2D RoPE与YaRN结合的位置编码;(ii) 引入非对抗性VAE后训练方案提升重建质量;(iii) 设计SNR-Aware Huber Wavelet损失函数以平衡梯度;(iv) 采用分阶段美学课程学习策略;并构建MultiAspect-4K-1M数据集支持训练。 Result: UltraFlux在Aesthetic-Eval at 4096基准和多宽高比4K设置下,优于主流开源基线,在保真度、美学质量和文本对齐方面表现更优,结合LLM提示优化器可媲美或超越Seedream 4.0。 Conclusion: 通过数据与模型协同设计,UltraFlux实现了稳定且细节保留良好的原生4K多宽高比图像生成,为高分辨率扩散模型提供了有效范式。 Abstract: Diffusion transformers have recently delivered strong text-to-image generation around 1K resolution, but we show that extending them to native 4K across diverse aspect ratios exposes a tightly coupled failure mode spanning positional encoding, VAE compression, and optimization. Tackling any of these factors in isolation leaves substantial quality on the table. We therefore take a data-model co-design view and introduce UltraFlux, a Flux-based DiT trained natively at 4K on MultiAspect-4K-1M, a 1M-image 4K corpus with controlled multi-AR coverage, bilingual captions, and rich VLM/IQA metadata for resolution- and AR-aware sampling. On the model side, UltraFlux couples (i) Resonance 2D RoPE with YaRN for training-window-, frequency-, and AR-aware positional encoding at 4K; (ii) a simple, non-adversarial VAE post-training scheme that improves 4K reconstruction fidelity; (iii) an SNR-Aware Huber Wavelet objective that rebalances gradients across timesteps and frequency bands; and (iv) a Stage-wise Aesthetic Curriculum Learning strategy that concentrates high-aesthetic supervision on high-noise steps governed by the model prior. Together, these components yield a stable, detail-preserving 4K DiT that generalizes across wide, square, and tall ARs. On the Aesthetic-Eval at 4096 benchmark and multi-AR 4K settings, UltraFlux consistently outperforms strong open-source baselines across fidelity, aesthetic, and alignment metrics, and-with a LLM prompt refiner-matches or surpasses the proprietary Seedream 4.0.

[154] IE-Critic-R1: Advancing the Explanatory Measurement of Text-Driven Image Editing for Human Perception Alignment

Bowen Qu,Shangkun Sun,Xiaoyu Liang,Wei Gao

Main category: cs.CV

TL;DR: 本文提出了一个用于评估文本驱动图像编辑效果的基准测试套件IE-Bench,包含多样化数据和人类评分,并引入基于强化学习的评估模型IE-Critic-R1,显著提升了与人类感知一致的评估性能。

Details Motivation: 现有评估方法未能充分结合文本-图像对齐与人类感知,在文本驱动图像编辑评估中表现不足。 Method: 构建了包含源图像、编辑提示及编辑结果的IE-Bench数据库,并提出IE-Critic-R1模型,采用可验证奖励的强化学习(RLVR)进行训练,以更好对齐人类主观评价。 Result: IE-Critic-R1在4,000个样本上表现出优于现有指标的主观一致性,实验验证其在文本驱动图像编辑评估中的优越性。 Conclusion: IE-Bench和IE-Critic-R1有效提升了文本驱动图像编辑的评估能力,推动了更贴近人类感知的自动化评估发展。 Abstract: Recent advances in text-driven image editing have been significant, yet the task of accurately evaluating these edited images continues to pose a considerable challenge. Different from the assessment of text-driven image generation, text-driven image editing is characterized by simultaneously conditioning on both text and a source image. The edited images often retain an intrinsic connection to the original image, which dynamically change with the semantics of the text. However, previous methods tend to solely focus on text-image alignment or have not well aligned with human perception. In this work, we introduce the Text-driven Image Editing Benchmark suite (IE-Bench) to enhance the assessment of text-driven edited images. IE-Bench includes a database contains diverse source images, various editing prompts and the corresponding edited results from different editing methods, and nearly 4,000 samples with corresponding Mean Opinion Scores (MOS) provided by 15 human subjects. Furthermore, we introduce IE-Critic-R1, which, benefiting from Reinforcement Learning from Verifiable Rewards (RLVR), provides more comprehensive and explainable quality assessment for text-driven image editing that aligns with human perception. Extensive experiments demonstrate IE-Critic-R1's superior subjective-alignments on the text-driven image editing task compared with previous metrics. Related data and codes are available to the public.

[155] Bias Is a Subspace, Not a Coordinate: A Geometric Rethinking of Post-hoc Debiasing in Vision-Language Models

Dachuan Zhao,Weiyue Li,Zhenda Shen,Yushu Qiu,Bowen Xu,Haoyu Chen,Yongchao Chen

Main category: cs.CV

TL;DR: 提出了一种基于子空间投影的去偏方法SPD,有效解决了视觉-语言模型中的性别等偏见问题,在多个任务中显著提升了公平性指标。

Details Motivation: 现有的去偏方法存在特征纠缠、跨数据集泛化能力差和去偏不彻底等问题,需要一种更根本的几何方法来解决偏见的分布式特性。 Method: 通过识别并移除可线性解码偏见的整个子空间,并重新插入中性均值分量,以保持语义保真度,提出了子空间投影去偏(SPD)框架。 Result: 在零样本分类、文本到图像检索和图像生成任务上验证了SPD的有效性,相较于最佳基线平均在四项公平性指标上提升18.5%,同时任务性能损失最小。 Conclusion: SPD是一种几何上合理且有效的去偏框架,能够更彻底地去除视觉语言模型中的偏见,具有良好的通用性和实用性。 Abstract: Vision-Language Models (VLMs) have become indispensable for multimodal reasoning, yet their representations often encode and amplify demographic biases, resulting in biased associations and misaligned predictions in downstream tasks. Such behavior undermines fairness and distorts the intended alignment between vision and language. Recent post-hoc approaches attempt to mitigate bias by replacing the most attribute-correlated embedding coordinates with neutral values. However, our systematic analysis reveals three critical failures of this coordinate-wise approach: feature entanglement, poor cross-dataset generalization, and incomplete bias removal. We find that bias is not localized to a few coordinates but is instead distributed across a few linear subspaces. To address these limitations, we propose $\textbf{S}$ubspace $\textbf{P}$rojection $\textbf{D}$ebiasing ($\textbf{SPD}$), a geometrically principled framework that identifies and removes the entire subspace of linearly decodable bias while reinserting a neutral mean component to preserve semantic fidelity. Extensive experiments across zero-shot classification, text-to-image retrieval, and image generation validate the effectiveness of SPD: our method achieves more robust debiasing with an average improvement of $18.5\%$ across four fairness metrics, while maintaining minimal loss in task performance compared to the best debiasing baseline.

[156] Hierarchical Semi-Supervised Active Learning for Remote Sensing

Wei Huang,Zhitong Xiong,Chenying Liu,Xiao Xiang Zhu

Main category: cs.CV

TL;DR: 提出了一种层次化半监督主动学习框架HSSAL,结合半监督学习与新型层次化主动学习,在遥感图像场景分类中实现了高效样本选择和优异的标签效率。

Details Motivation: 遥感图像标注成本高、耗时长,大量无标签数据未被充分利用,现有方法难以同时保证样本选择的效率与代表性。 Method: 提出HSSAL框架,迭代地使用半监督学习(SSL)提升模型性能,并通过基于聚类的层次化主动学习(HAL)策略选择满足可扩展性、多样性和不确定性的最具信息量样本。 Result: 在UCM、AID和NWPU-RESISC45三个遥感场景分类数据集上实验表明,HSSAL优于仅用SSL或AL的方法;仅用8%、4%、2%标注数据即达到全监督模型95%以上的精度。 Conclusion: HSSAL能有效利用无标签数据,显著提高标签效率,在遥感图像分类中具有优越的性能和应用潜力。 Abstract: The performance of deep learning models in remote sensing (RS) strongly depends on the availability of high-quality labeled data. However, collecting large-scale annotations is costly and time-consuming, while vast amounts of unlabeled imagery remain underutilized. To address this challenge, we propose a Hierarchical Semi-Supervised Active Learning (HSSAL) framework that integrates semi-supervised learning (SSL) and a novel hierarchical active learning (HAL) in a closed iterative loop. In each iteration, SSL refines the model using both labeled data through supervised learning and unlabeled data via weak-to-strong self-training, improving feature representation and uncertainty estimation. Guided by the refined representations and uncertainty cues of unlabeled samples, HAL then conducts sample querying through a progressive clustering strategy, selecting the most informative instances that jointly satisfy the criteria of scalability, diversity, and uncertainty. This hierarchical process ensures both efficiency and representativeness in sample selection. Extensive experiments on three benchmark RS scene classification datasets, including UCM, AID, and NWPU-RESISC45, demonstrate that HSSAL consistently outperforms SSL- or AL-only baselines. Remarkably, with only 8%, 4%, and 2% labeled training data on UCM, AID, and NWPU-RESISC45, respectively, HSSAL achieves over 95% of fully-supervised accuracy, highlighting its superior label efficiency through informativeness exploitation of unlabeled data. Our code will be released at https://github.com/zhu-xlab/RS-SSAL.

[157] Assessing the alignment between infants' visual and linguistic experience using multimodal language models

Alvin Wei Ming Tan,Jane Yang,Tarun Sepuri,Khai Loong Aw,Robert Z. Sparks,Zi Yin,Virginia A. Marchman,Michael C. Frank,Bria Long

Main category: cs.CV

TL;DR: 该研究利用对比语言-图像预训练(CLIP)模型自动分析婴幼儿第一视角的家庭视频中的视觉-语言对齐情况,发现适合词汇学习的理想对齐时刻(如说话提到“球”时婴儿视野中恰好有球)在日常环境中相对罕见,且在不同儿童之间存在差异。这一结果提示低频对齐是早期词汇学习模型需要考虑的约束条件,并提出了一种研究儿童多模态环境的新方法。

Details Motivation: 探究儿童在日常语言学习过程中视觉与语言经验在时间上的对齐程度,克服以往依赖人工标注 vision-language 共现数据的局限性。 Method: 使用 CLIP 模型自动评估婴幼儿第一视角家庭视频中的视觉-语言对齐情况,并通过人类对齐判断验证 CLIP 得分的有效性,随后将其应用于大规模婴幼儿视角视频语料库进行分析。 Result: 发现理想化的视觉-语言对齐时刻在儿童日常经验中较为稀少,远低于现代机器学习数据集中的对齐频率,并观察到儿童内部和儿童之间的对齐程度存在显著变异。 Conclusion: 视觉与语言输入的低频对齐是早期词汇学习的一个重要限制因素,现有学习模型需考虑这一现实约束;同时,CLIP 提供了一种高效、可扩展的方法来研究儿童真实的多模态学习环境。 Abstract: Figuring out which objects or concepts words refer to is a central language learning challenge for young children. Most models of this process posit that children learn early object labels from co-occurrences of words and their referents that occur when someone around them talks about an object in the immediate physical environment. But how aligned in time are children's visual and linguistic experiences during everyday learning? To date, answers to this question have been limited by the need for labor-intensive manual annotations of vision-language co-occurrences. Here, we evaluate the use of contrastive language-image pretraining (CLIP) models to automatically characterize vision-language alignment in egocentric videos taken from the infant perspective in home environments. After validating CLIP alignment scores using human alignment judgments, we apply this metric to a large corpus of infant-perspective videos. We show that idealized aligned moments for learning (e.g., "look at the ball" with a ball present in the child's view) are relatively rare in children's everyday experiences compared to modern machine learning datasets, and highlight variability in alignment both within and across children. These findings suggest that infrequent alignment is a constraint for models describing early word learning and offer a new method for investigating children's multimodal environment.

[158] A Lightweight, Interpretable Deep Learning System for Automated Detection of Cervical Adenocarcinoma In Situ (AIS)

Gabriela Fernandes

Main category: cs.CV

TL;DR: 本研究开发了一种基于深度学习的虚拟病理助手,使用CAISHI数据集中的2240张H&E染色图像训练EfficientNet-B3模型,用于区分宫颈腺原位癌(AIS)与正常腺体组织,结合染色归一化和焦点损失函数提升性能,最终实现73.23%准确率,并通过Grad-CAM提供可解释性,已部署为Gradio应用。

Details Motivation: 宫颈腺原位癌(AIS)的病理诊断具有挑战性,早期准确检测对防止进展为浸润性腺癌至关重要,亟需辅助工具提高诊断一致性与效率。 Method: 采用Macenko染色归一化和基于图像块的预处理增强形态特征表达;使用EfficientNet-B3卷积神经网络,结合类别平衡采样和焦点损失函数训练模型,以应对数据不平衡并关注难分类样本。 Result: 模型整体准确率为0.7323,异常类F1得分为0.75,正常类为0.71;Grad-CAM热图显示出与AIS典型形态(如核异型性和腺体拥挤)一致的可解释激活模式。 Conclusion: 该研究表明轻量级、可解释的AI系统在宫颈腺体病理诊断中具有可行性,可用于筛查、教育及资源有限环境下的辅助诊断。 Abstract: Cervical adenocarcinoma in situ (AIS) is a critical premalignant lesion whose accurate histopathological diagnosis is challenging. Early detection is essential to prevent progression to invasive cervical adenocarcinoma. In this study, we developed a deep learning-based virtual pathology assistant capable of distinguishing AIS from normal cervical gland histology using the CAISHI dataset, which contains 2240 expert-labeled H&E images (1010 normal and 1230 AIS). All images underwent Macenko stain normalization and patch-based preprocessing to enhance morphological feature representation. An EfficientNet-B3 convolutional neural network was trained using class-balanced sampling and focal loss to address dataset imbalance and emphasize difficult examples. The final model achieved an overall accuracy of 0.7323, with an F1-score of 0.75 for the Abnormal class and 0.71 for the Normal class. Grad-CAM heatmaps demonstrated biologically interpretable activation patterns, highlighting nuclear atypia and glandular crowding consistent with AIS morphology. The trained model was deployed in a Gradio-based virtual diagnostic assistant. These findings demonstrate the feasibility of lightweight, interpretable AI systems for cervical gland pathology, with potential applications in screening workflows, education, and low-resource settings.

[159] From Pixels to Posts: Retrieval-Augmented Fashion Captioning and Hashtag Generation

Moazzam Umer Gondal,Hamad Ul Qudous,Daniya Siddiqui,Asma Ahmad Farhan

Main category: cs.CV

TL;DR: 本文提出了一种检索增强的时尚图像描述与标签生成框架,结合多衣物检测、属性推理和大语言模型提示,提升了生成文本的准确性与风格多样性。

Details Motivation: 现有的端到端时尚描述生成模型在属性保真度和领域泛化方面存在不足,本文旨在通过引入外部知识增强生成结果的事实性和可解释性。 Method: 采用YOLO进行多衣物定位,k-means提取主色,CLIP-FAISS检索模块推断材质与性别属性,并结合检索到的风格样例构建事实证据包,指导大语言模型生成描述与标签;同时使用微调BLIP作为基线对比。 Result: YOLO检测器在九类衣物上达到0.71 mAP@0.5;RAG-LLM生成的描述属性覆盖均值为0.80,在标签生成中50%阈值下实现完全覆盖,且优于BLIP模型的词汇重叠度与泛化能力。 Conclusion: 检索增强生成是一种有效且可解释的自动化时尚内容生成范式,具有更强的事实基础、更少的幻觉现象和良好的跨域扩展潜力。 Abstract: This paper introduces the retrieval-augmented framework for automatic fashion caption and hashtag generation, combining multi-garment detection, attribute reasoning, and Large Language Model (LLM) prompting. The system aims to produce visually grounded, descriptive, and stylistically interesting text for fashion imagery, overcoming the limitations of end-to-end captioners that have problems with attribute fidelity and domain generalization. The pipeline combines a YOLO-based detector for multi-garment localization, k-means clustering for dominant color extraction, and a CLIP-FAISS retrieval module for fabric and gender attribute inference based on a structured product index. These attributes, together with retrieved style examples, create a factual evidence pack that is used to guide an LLM to generate human-like captions and contextually rich hashtags. A fine-tuned BLIP model is used as a supervised baseline model for comparison. Experimental results show that the YOLO detector is able to obtain a mean Average Precision (mAP@0.5) of 0.71 for nine categories of garments. The RAG-LLM pipeline generates expressive attribute-aligned captions and achieves mean attribute coverage of 0.80 with full coverage at the 50% threshold in hashtag generation, whereas BLIP gives higher lexical overlap and lower generalization. The retrieval-augmented approach exhibits better factual grounding, less hallucination, and great potential for scalable deployment in various clothing domains. These results demonstrate the use of retrieval-augmented generation as an effective and interpretable paradigm for automated and visually grounded fashion content generation.

[160] VK-Det: Visual Knowledge Guided Prototype Learning for Open-Vocabulary Aerial Object Detection

Jianhang Yao,Yongbin Zheng,Siqi Lu,Wanying Xu,Peng Sun

Main category: cs.CV

TL;DR: 提出了一种无需额外监督的视觉知识引导的开放词汇目标检测框架VK-Det,通过利用视觉编码器的区域感知和原型感知伪标签策略,在DIOR和DOTA数据集上实现了最先进的性能。

Details Motivation: 现有方法依赖文本监督生成伪标签,导致语义偏差,限制了对非文本指定概念的开放词汇扩展。因此需要一种减少文本依赖、提升对新类别泛化能力的方法。 Method: 1) 利用视觉编码器内在的区域感知能力实现细粒度定位与自适应蒸馏;2) 提出原型感知的伪标签策略,通过特征聚类建模类间决策边界,并通过原型匹配将检测区域映射到潜在类别。 Result: 在DIOR数据集上达到30.1 mAP^N,在DOTA上达到23.3 mAP^N,优于许多需要额外监督的方法。 Conclusion: VK-Det有效减少了对文本监督的依赖,通过视觉知识引导实现了更优的开放词汇检测性能,增强了对新颖物体的检测能力。 Abstract: To identify objects beyond predefined categories, open-vocabulary aerial object detection (OVAD) leverages the zero-shot capabilities of visual-language models (VLMs) to generalize from base to novel categories. Existing approaches typically utilize self-learning mechanisms with weak text supervision to generate region-level pseudo-labels to align detectors with VLMs semantic spaces. However, text dependence induces semantic bias, restricting open-vocabulary expansion to text-specified concepts. We propose $\textbf{VK-Det}$, a $\textbf{V}$isual $\textbf{K}$nowledge-guided open-vocabulary object $\textbf{Det}$ection framework $\textit{without}$ extra supervision. First, we discover and leverage vision encoder's inherent informative region perception to attain fine-grained localization and adaptive distillation. Second, we introduce a novel prototype-aware pseudo-labeling strategy. It models inter-class decision boundaries through feature clustering and maps detection regions to latent categories via prototype matching. This enhances attention to novel objects while compensating for missing supervision. Extensive experiments show state-of-the-art performance, achieving 30.1 $\mathrm{mAP}^{N}$ on DIOR and 23.3 $\mathrm{mAP}^{N}$ on DOTA, outperforming even extra supervised methods.

[161] ActDistill: General Action-Guided Self-Derived Distillation for Efficient Vision-Language-Action Models

Wencheng Ye,Tianshi Wang,Lei Zhu,Fengling Li,Guoli Yang

Main category: cs.CV

TL;DR: 本文提出了一种名为ActDistill的动作引导自蒸馏框架,用于将大型视觉-语言-动作(VLA)模型的能力迁移到轻量级模型中,通过图结构封装和动态路由机制,在保持高性能的同时显著降低计算开销和推理延迟。

Details Motivation: 现有的VLA模型虽然具有良好的泛化能力,但在机器人操作中的应用受限于高计算开销和推理延迟,因此需要一种面向动作效率的模型压缩方法。 Method: 采用预训练的VLA模型作为教师模型,提出图结构封装策略来建模动作预测的层次演化过程,并设计动态路由器根据动作需求自适应选择计算路径,利用分层图信息监督实现知识迁移;推理时移除辅助组件,仅保留动态路由层。 Result: 在具身智能基准测试中,ActDistill在减少超过50%计算量的同时达到与完整VLA模型相当甚至更优的性能,最高实现1.67倍的速度提升。 Conclusion: ActDistill为构建高效、低延迟的具身智能系统提供了一个通用且有效的模型压缩范式。 Abstract: Recent Vision-Language-Action (VLA) models have shown impressive flexibility and generalization, yet their deployment in robotic manipulation remains limited by heavy computational overhead and inference latency. In this work, we present ActDistill, a general action-guided self-derived distillation framework that transfers the action prediction capability of any existing VLA model to a lightweight counterpart. Unlike previous efficiency strategies that primarily emphasize vision-language correlations, ActDistill leverages action priors to guide knowledge transfer and model compression, achieving action-oriented efficiency for VLA models. Specifically, we employ a well-trained VLA model as the teacher and introduce a graph-structured encapsulation strategy to explicitly model the hierarchical evolution of action prediction. The student model, derived from the graph-encapsulated teacher, is further equipped with a dynamic router that adaptively selects computation paths based on action prediction demands, guided by hierarchical graph-informed supervision to ensure smooth and efficient evolution. During inference, graph-related auxiliary components are removed, allowing the student to execute only dynamically routed layers and predict high-precision actions with minimal computation and latency. Experiments on embodied benchmarks demonstrate that ActDistill achieves comparable or superior performance to full-scale VLA models while reducing computation by over 50% with up to 1.67 times speedup, thereby establishing a general paradigm toward efficient embodied intelligence.

[162] Less Is More: An Explainable AI Framework for Lightweight Malaria Classification

Md Abdullah Al Kafi,Raka Moni,Sumit Kumar Banshal

Main category: cs.CV

TL;DR: 本研究提出了一种名为EMFE的透明、可复现且低计算需求的机器学习流程,用于疟疾细胞图像的二分类任务,仅使用形态学特征和轻量模型即可在CPU上实现接近深度学习的性能。

Details Motivation: 探讨在简单的疟疾分类任务中是否必须使用复杂的深度学习模型,旨在开发一种适用于计算资源受限环境的高效、可解释的诊断方法。 Method: 从NIH疟疾细胞图像数据集中提取两个形态学特征(非背景像素数和细胞内孔洞数),采用逻辑回归和随机森林等传统机器学习模型,并与ResNet18、DenseNet121等深度学习模型进行比较;进一步构建逻辑回归与随机森林的集成模型以提升准确率。 Result: 单变量逻辑回归模型达到94.80%的测试准确率,模型大小仅1.2 kB,推理时间2.3 ms;两阶段集成模型将准确率提升至97.15%;而深度学习模型需13.6 MB至44.7 MB存储空间,推理时间达68 ms。 Conclusion: 基于简单可解释特征与轻量模型的EMFE流程可在透明性、可复现性、速度和部署可行性方面显著优于深度学习模型,为资源受限环境提供实用的诊断解决方案。 Abstract: Background and Objective: Deep learning models have high computational needs and lack interpretability but are often the first choice for medical image classification tasks. This study addresses whether complex neural networks are essential for the simple binary classification task of malaria. We introduce the Extracted Morphological Feature Engineered (EMFE) pipeline, a transparent, reproducible, and low compute machine learning approach tailored explicitly for simple cell morphology, designed to achieve deep learning performance levels on a simple CPU only setup with the practical aim of real world deployment. Methods: The study used the NIH Malaria Cell Images dataset, with two features extracted from each cell image: the number of non background pixels and the number of holes within the cell. Logistic Regression and Random Forest were compared against ResNet18, DenseNet121, MobileNetV2, and EfficientNet across accuracy, model size, and CPU inference time. An ensemble model was created by combining Logistic Regression and Random Forests to achieve higher accuracy while retaining efficiency. Results: The single variable Logistic Regression model achieved a test accuracy of 94.80 percent with a file size of 1.2 kB and negligible inference latency (2.3 ms). The two stage ensemble improved accuracy to 97.15 percent. In contrast, the deep learning methods require 13.6 MB to 44.7 MB of storage and show significantly higher inference times (68 ms). Conclusion: This study shows that a compact feature engineering approach can produce clinically meaningful classification performance while offering gains in transparency, reproducibility, speed, and deployment feasibility. The proposed pipeline demonstrates that simple interpretable features paired with lightweight models can serve as a practical diagnostic solution for environments with limited computational resources.

[163] Together, Then Apart: Revisiting Multimodal Survival Analysis via a Min-Max Perspective

Wenjing Liu,Qin Ren,Wen Zhang,Yuewei Lin,Chenyu You

Main category: cs.CV

TL;DR: 本文提出了一种名为Together-Then-Apart (TTA) 的统一框架,用于多模态生存分析,通过联合优化对齐性与独特性,提升模型性能与可解释性。

Details Motivation: 现有方法过度强调跨模态对齐,导致模态特异性信息丢失和表征崩溃,忽视了保留模态独特结构的重要性。 Method: 提出TTA框架,包含‘Together’阶段(通过共享原型和非平衡最优传输对齐嵌入)和‘Apart’阶段(通过模态锚点和对比正则化保持表征多样性)。采用最小-最大优化策略联合学习共享与模态特异性表示。 Result: 在五个TCGA数据集上实验表明,TTA consistently 优于现有最先进方法,展现出更强的鲁棒性和生物学可解释性。 Conclusion: 对齐性与独特性应被同时建模;TTA为多模态生存分析提供了更优的理论视角与实践框架。 Abstract: Integrating heterogeneous modalities such as histopathology and genomics is central to advancing survival analysis, yet most existing methods prioritize cross-modal alignment through attention-based fusion mechanisms, often at the expense of modality-specific characteristics. This overemphasis on alignment leads to representation collapse and reduced diversity. In this work, we revisit multi-modal survival analysis via the dual lens of alignment and distinctiveness, positing that preserving modality-specific structure is as vital as achieving semantic coherence. In this paper, we introduce Together-Then-Apart (TTA), a unified min-max optimization framework that simultaneously models shared and modality-specific representations. The Together stage minimizes semantic discrepancies by aligning embeddings via shared prototypes, guided by an unbalanced optimal transport objective that adaptively highlights informative tokens. The Apart stage maximizes representational diversity through modality anchors and a contrastive regularizer that preserve unique modality information and prevent feature collapse. Extensive experiments on five TCGA benchmarks show that TTA consistently outperforms state-of-the-art methods. Beyond empirical gains, our formulation provides a new theoretical perspective of how alignment and distinctiveness can be jointly achieved in for robust, interpretable, and biologically meaningful multi-modal survival analysis.

[164] Versatile Recompression-Aware Perceptual Image Super-Resolution

Mingwei He,Tongda Xu,Xingtong Ge,Ming Sun,Chao Zhou,Yan Wang

Main category: cs.CV

TL;DR: 本文提出了一种通用的重压缩感知感知超分辨率方法VRPSR,通过利用预训练扩散模型模拟多种压缩格式,实现超分辨率与实际压缩环境的联合优化,显著降低比特率并提升重建质量。

Details Motivation: 现有感知超分辨率方法忽略图像恢复后的重压缩过程,导致压缩引入额外伪影,影响实际应用效果。同时,编解码器不可微分且配置多样,难以联合优化。 Method: 将压缩建模为条件文本到图像生成任务,使用预训练扩散模型构建可泛化的编解码器模拟器;提出针对感知超分辨率的训练策略,包括使用感知目标优化模拟器,并以轻度压缩图像作为训练目标。 Result: 在Real-ESRGAN和S3Diff基础上,VRPSR在H.264/H.265/H.266压缩下节省超过10%的比特率,并支持超分辨率与重压缩后处理模型的联合优化。 Conclusion: VRPSR有效实现了感知超分辨率与多种实际压缩环境的协同优化,提升了恢复图像在真实传输与存储场景下的质量和效率。 Abstract: Perceptual image super-resolution (SR) methods restore degraded images and produce sharp outputs. In practice, those outputs are usually recompressed for storage and transmission. Ignoring recompression is suboptimal as the downstream codec might add additional artifacts to restored images. However, jointly optimizing SR and recompression is challenging, as the codecs are not differentiable and vary in configuration. In this paper, we present Versatile Recompression-Aware Perceptual Super-Resolution (VRPSR), which makes existing perceptual SR aware of versatile compression. First, we formulate compression as conditional text-to-image generation and utilize a pre-trained diffusion model to build a generalizable codec simulator. Next, we propose a set of training techniques tailored for perceptual SR, including optimizing the simulator using perceptual targets and adopting slightly compressed images as the training target. Empirically, our VRPSR saves more than 10\% bitrate based on Real-ESRGAN and S3Diff under H.264/H.265/H.266 compression. Besides, our VRPSR facilitates joint optimization of the SR and post-processing model after recompression.

[165] Spotlight: Identifying and Localizing Video Generation Errors Using VLMs

Aditya Chinchure,Sahithya Ravi,Pushkar Shukla,Vered Shwartz,Leonid Sigal

Main category: cs.CV

TL;DR: 本文提出了Spotlight任务,旨在定位和解释文本到视频生成模型中的细粒度错误,通过分析600个视频中的1600多个错误,发现当前视觉语言模型在识别和定位视频错误方面显著落后于人类,并提出推理时策略将性能提升近2倍。

Details Motivation: 现有文本到视频模型虽能生成高质量视频,但仍存在局部且细微的错误,而当前评估方法缺乏对这些错误的具体定位与描述,因此需要一种更精细的评估方式来改进模型。 Method: 使用三种先进的视频生成模型(Veo 3、Seedance 和 LTX-2)生成600个视频,基于200个多样化文本提示,并对六类细粒度错误进行标注;在此基础上构建Spotlight任务,评估现有视觉语言模型在错误识别与定位上的表现,并提出推理时策略以提升性能。 Result: 发现遵循提示和物理规律的错误在长时间段中普遍存在,而外观消失和身体姿态错误多出现在短时间段;当前视觉语言模型在错误识别与定位上远逊于人类,但通过所提推理策略可将性能提高近2倍。 Conclusion: Spotlight任务为构建更精细的视频生成评估工具和更复杂的奖励模型提供了新方向,有助于推动视频生成模型的精细化发展。 Abstract: Current text-to-video models (T2V) can generate high-quality, temporally coherent, and visually realistic videos. Nonetheless, errors still often occur, and are more nuanced and local compared to the previous generation of T2V models. While current evaluation paradigms assess video models across diverse dimensions, they typically evaluate videos holistically without identifying when specific errors occur or describing their nature. We address this gap by introducing Spotlight, a novel task aimed at localizing and explaining video-generation errors. We generate 600 videos using 200 diverse textual prompts and three state-of-the-art video generators (Veo 3, Seedance, and LTX-2), and annotate over 1600 fine-grained errors across six types, including motion, physics, and prompt adherence. We observe that adherence and physics errors are predominant and persist across longer segments, whereas appearance-disappearance and body pose errors manifest in shorter segments. We then evaluate current VLMs on Spotlight and find that VLMs lag significantly behind humans in error identification and localization in videos. We propose inference-time strategies to probe the limits of current VLMs on our task, improving performance by nearly 2x. Our task paves a way forward to building fine-grained evaluation tools and more sophisticated reward models for video generators.

[166] Consolidating Diffusion-Generated Video Detection with Unified Multimodal Forgery Learning

Xiaohong Liu,Xiufeng Song,Huayu Zheng,Lei Bai,Xiaoming Liu,Guangtao Zhai

Main category: cs.CV

TL;DR: 本文提出了一种名为MM-Det++的多模态检测算法,用于检测扩散模型生成的视频,包含空时分支和多模态分支,并引入统一多模态学习模块以提升泛化能力,同时构建了大规模DVF数据集。

Details Motivation: 现有方法主要关注图像级伪造检测,对视频级通用伪造检测研究不足,且随着扩散生成视频增多,亟需可靠的合成媒体检测技术。 Method: 提出MM-Det++,包括基于FC-ViT的空时(ST)分支提取帧内时空信息,以及利用多模态大语言模型的多模态(MM)分支获取伪造语义表示,并通过统一多模态学习(UML)模块融合两种表示。 Result: 在自建的大规模DVF数据集上实验表明,MM-Det++在检测扩散生成视频方面优于现有方法,验证了统一多模态学习的有效性。 Conclusion: MM-Det++通过结合空时建模与多模态语义推理,显著提升了扩散生成视频的检测性能,推动了视频取证领域的发展。 Abstract: The proliferation of videos generated by diffusion models has raised increasing concerns about information security, highlighting the urgent need for reliable detection of synthetic media. Existing methods primarily focus on image-level forgery detection, leaving generic video-level forgery detection largely underexplored. To advance video forensics, we propose a consolidated multimodal detection algorithm, named MM-Det++, specifically designed for detecting diffusion-generated videos. Our approach consists of two innovative branches and a Unified Multimodal Learning (UML) module. Specifically, the Spatio-Temporal (ST) branch employs a novel Frame-Centric Vision Transformer (FC-ViT) to aggregate spatio-temporal information for detecting diffusion-generated videos, where the FC-tokens enable the capture of holistic forgery traces from each video frame. In parallel, the Multimodal (MM) branch adopts a learnable reasoning paradigm to acquire Multimodal Forgery Representation (MFR) by harnessing the powerful comprehension and reasoning capabilities of Multimodal Large Language Models (MLLMs), which discerns the forgery traces from a flexible semantic perspective. To integrate multimodal representations into a coherent space, a UML module is introduced to consolidate the generalization ability of MM-Det++. In addition, we also establish a large-scale and comprehensive Diffusion Video Forensics (DVF) dataset to advance research in video forgery detection. Extensive experiments demonstrate the superiority of MM-Det++ and highlight the effectiveness of unified multimodal forgery learning in detecting diffusion-generated videos.

[167] AdaPerceiver: Transformers with Adaptive Width, Depth, and Tokens

Purvish Jajal,Nick John Eliopoulos,Benjamin Shiue-Hal Chou,George K. Thiruvathukal,Yung-Hsiang Lu,James C. Davis

Main category: cs.CV

TL;DR: AdaPerceiver是首个在深度、宽度和token数量上实现统一自适应的Transformer架构,通过联合训练策略在多种任务中灵活调整计算资源,显著提升吞吐量并降低FLOPs,同时保持高性能。

Details Motivation: 现有Transformer在推理时计算分配僵化,难以适应不同硬件和延迟需求,且多数动态计算方法仅关注单一维度(如token数量),缺乏多维度统一自适应能力。 Method: 提出AdaPerceiver架构,支持在深度、宽度和token三个维度上的自适应;设计高效的联合训练机制,确保模型在不同配置下均能保持良好性能;结合策略网络实现运行时动态调整。 Result: 在图像分类中,AdaPerceiver在85.4%准确率下比FlexiViT-L吞吐量高36%;在密集预测任务中,语义分割和深度估计的编码器FLOPs减少约26倍的同时匹配ViT-H/14性能;结合策略可将FLOPs降低24-33%且ImageNet1K准确率波动不超过±0.1个百分点。 Conclusion: AdaPerceiver实现了多维度统一自适应计算,为Transformer在真实场景中的高效部署提供了有效解决方案,在精度与效率之间取得更好权衡。 Abstract: Modern transformer architectures achieve remarkable performance across tasks and domains but remain rigid in how they allocate computation at inference time. Real-world deployment often requires models to adapt to diverse hardware and latency constraints, yet most approaches to dynamic computation focus on a single axis -- such as reducing the number of tokens. We present a novel capability: AdaPerceiver, the first transformer architecture with unified adaptivity across depth, width, and tokens within a single model. We propose an architecture that supports adaptivity along these axes. We couple this with an efficient joint training regime that ensures the model maintains performance across its various configurations. We evaluate AdaPerceiver on image classification, semantic segmentation, and depth estimation tasks. On image classification, AdaPerceiver expands the accuracy-throughput Pareto front. It achieves 85.4% accuracy while yielding 36% higher throughput than FlexiViT-L. On dense prediction, AdaPerceiver matches ViT-H/14 while having $\sim$26x fewer encoder FLOPs (floating-point operations) on semantic segmentation and depth estimation. Finally, we show how AdaPerceiver equipped with a policy can maintain ImageNet1K accuracy ($\pm0.1$ percentage points) while reducing FLOPs by $24-33$%.

[168] Muskie: Multi-view Masked Image Modeling for 3D Vision Pre-training

Wenyu Li,Sidun Liu,Peng Qiao,Yong Dou,Tongrui Hu

Main category: cs.CV

TL;DR: 提出Muskie,一种用于3D视觉任务的原生多视角视觉骨干网络,通过在预训练阶段重建严重遮蔽的视图内容来学习多视角一致性与几何理解,无需3D监督,且在下游任务中表现更优。

Details Motivation: 现有模型多为逐帧处理,多视角一致性有限,难以充分挖掘多视角间的几何关联,限制了3D视觉任务的性能。 Method: 设计Muskie网络,同时处理多个视角,通过从其他视角寻找并利用几何对应关系来重建严重遮蔽的单个视图内容,以此作为预训练的代理任务,并采用激进的遮蔽策略。 Result: Muskie在多视角对应关系准确性上优于如DINO等最先进的逐帧骨干网络,并在相机位姿估计和点云重建等下游3D任务中表现出更强的性能。 Conclusion: Muskie通过引入多视角一致性预训练机制,能够隐式学习视角不变特征并增强几何理解能力,是一种有效的、无需3D监督的多视角视觉骨干网络。 Abstract: We present Muskie, a native multi-view vision backbone designed for 3D vision tasks. Unlike existing models, which are frame-wise and exhibit limited multi-view consistency, Muskie is designed to process multiple views simultaneously and introduce multi-view consistency in pre-training stage. Muskie is trained to reconstruct heavily masked content in one view by finding and utilizing geometric correspondences from other views. Through this pretext task and our proposed aggressive masking strategy, the model implicitly to learn view-invariant features and develop strong geometric understanding without any 3D supervision. Compared with state-of-the-art frame-wise backbones such as DINO, Muskie achieves higher multi-view correspondence accuracy. Furthermore, we demonstrate that using Muskie as a backbone consistently enhances performance on downstream 3D tasks, including camera pose estimation and pointmap reconstruction. Codes are publicly available at https://leo-frank.github.io/Muskie/

[169] PromptMoE: Generalizable Zero-Shot Anomaly Detection via Visually-Guided Prompt Mixtures

Yuheng Shao,Lizhang Wang,Changhao Li,Peixian Chen,Qinyuan Liu

Main category: cs.CV

TL;DR: 提出PromptMoE,通过视觉引导的Mixture-of-Experts机制动态组合专家提示,提升零样本异常检测的泛化能力。

Details Motivation: 现有基于CLIP的零样本异常检测方法受限于固定的或可学习的提示策略,存在表征瓶颈且易过拟合,难以应对未见异常的多样性。 Method: 设计一个包含多个专家提示的提示池,并采用图像门控的稀疏MoE机制(VGMoP)动态组合这些提示,生成语义丰富的文本表示。 Result: 在15个工业与医学数据集上验证了方法的有效性,实现了最先进的性能。 Conclusion: PromptMoE通过组合式提示学习显著提升了零样本异常检测的泛化能力和表现。 Abstract: Zero-Shot Anomaly Detection (ZSAD) aims to identify and localize anomalous regions in images of unseen object classes. While recent methods based on vision-language models like CLIP show promise, their performance is constrained by existing prompt engineering strategies. Current approaches, whether relying on single fixed, learnable, or dense dynamic prompts, suffer from a representational bottleneck and are prone to overfitting on auxiliary data, failing to generalize to the complexity and diversity of unseen anomalies. To overcome these limitations, we propose $\mathtt{PromptMoE}$. Our core insight is that robust ZSAD requires a compositional approach to prompt learning. Instead of learning monolithic prompts, $\mathtt{PromptMoE}$ learns a pool of expert prompts, which serve as a basis set of composable semantic primitives, and a visually-guided Mixture-of-Experts (MoE) mechanism to dynamically combine them for each instance. Our framework materializes this concept through a Visually-Guided Mixture of Prompt (VGMoP) that employs an image-gated sparse MoE to aggregate diverse normal and abnormal expert state prompts, generating semantically rich textual representations with strong generalization. Extensive experiments across 15 datasets in industrial and medical domains demonstrate the effectiveness and state-of-the-art performance of $\mathtt{PromptMoE}$.

[170] MVS-TTA: Test-Time Adaptation for Multi-View Stereo via Meta-Auxiliary Learning

Hannuo Zhang,Zhixiang Chi,Yang Wang,Xinxin Zuo

Main category: cs.CV

TL;DR: 本文提出了MVS-TTA,一种高效的测试时自适应框架,通过引入自监督的跨视图一致性损失和元辅助学习策略,提升了基于学习的多视图立体匹配方法在不同场景下的泛化能力。

Details Motivation: 现有基于学习的MVS方法因训练数据分布有限而导致泛化能力不足,而基于优化的方法虽可进行场景自适应但缺乏可扩展性且计算成本高。 Method: 提出MVS-TTA框架,采用自监督的跨视图一致性损失作为辅助任务,并设计元辅助学习策略,使模型在推理时能有效利用该辅助任务进行参数更新;框架具有模型无关性,适用于多种MVS方法。 Result: 在DTU、BlendedMVS等标准数据集及跨数据集挑战场景下实验表明,MVS-TTA能持续提升现有MVS模型的性能,包括最先进的模型。 Conclusion: MVS-TTA首次将基于元学习的测试时自适应机制引入到基于学习的MVS中,有效结合了学习与优化两类方法的优势,在保持高效的同时显著增强了模型的适应性和泛化能力。 Abstract: Recent learning-based multi-view stereo (MVS) methods are data-driven and have achieved remarkable progress due to large-scale training data and advanced architectures. However, their generalization remains sub-optimal due to fixed model parameters trained on limited training data distributions. In contrast, optimization-based methods enable scene-specific adaptation but lack scalability and require costly per-scene optimization. In this paper, we propose MVS-TTA, an efficient test-time adaptation (TTA) framework that enhances the adaptability of learning-based MVS methods by bridging these two paradigms. Specifically, MVS-TTA employs a self-supervised, cross-view consistency loss as an auxiliary task to guide inference-time adaptation. We introduce a meta-auxiliary learning strategy to train the model to benefit from auxiliary-task-based updates explicitly. Our framework is model-agnostic and can be applied to a wide range of MVS methods with minimal architectural changes. Extensive experiments on standard datasets (DTU, BlendedMVS) and a challenging cross-dataset generalization setting demonstrate that MVS-TTA consistently improves performance, even when applied to state-of-the-art MVS models. To our knowledge, this is the first attempt to integrate optimization-based test-time adaptation into learning-based MVS using meta-learning. The code will be available at https://github.com/mart87987-svg/MVS-TTA.

[171] VCU-Bridge: Hierarchical Visual Connotation Understanding via Semantic Bridging

Ming Zhong,Yuanlei Wang,Liuzhou Zhang,Arctanx An,Renrui Zhang,Hao Liang,Ming Lu,Ying Shen,Wentao Zhang

Main category: cs.CV

TL;DR: 提出VCU-Bridge框架和HVCU-Bench基准,实现多层次视觉内涵理解,揭示MLLM在高层推理中的性能下降,并通过基于MCTS的数据生成提升低层能力以增强整体性能。

Details Motivation: 现有MLLM评估方法割裂了低层感知与高层推理,忽视语义与因果依赖,导致结果不具诊断性;需构建更贴近人类视觉理解层次的评估体系。 Method: 提出VCU-Bridge框架,实现从感知到语义桥梁再到抽象内涵的多层次推理,并构建HVCU-Bench基准进行逐层诊断;采用基于蒙特卡洛树搜索(MCTS)的指令微调数据生成 pipeline 来强化低层能力。 Result: 实验显示MLLM在高层推理时性能持续下降;通过MCTS增强低层能力不仅提升HVCU-Bench表现,还在通用基准平均提升2.53%(MMStar +7.26%)。 Conclusion: 分层推理模式对提升MLLM的视觉理解能力至关重要,强化低层感知可有效促进高层推理,验证了人类-like 层次化思维的有效性。 Abstract: While Multimodal Large Language Models (MLLMs) excel on benchmarks, their processing paradigm differs from the human ability to integrate visual information. Unlike humans who naturally bridge details and high-level concepts, models tend to treat these elements in isolation. Prevailing evaluation protocols often decouple low-level perception from high-level reasoning, overlooking their semantic and causal dependencies, which yields non-diagnostic results and obscures performance bottlenecks. We present VCU-Bridge, a framework that operationalizes a human-like hierarchy of visual connotation understanding: multi-level reasoning that advances from foundational perception through semantic bridging to abstract connotation, with an explicit evidence-to-inference trace from concrete cues to abstract conclusions. Building on this framework, we construct HVCU-Bench, a benchmark for hierarchical visual connotation understanding with explicit, level-wise diagnostics. Comprehensive experiments demonstrate a consistent decline in performance as reasoning progresses to higher levels. We further develop a data generation pipeline for instruction tuning guided by Monte Carlo Tree Search (MCTS) and show that strengthening low-level capabilities yields measurable gains at higher levels. Interestingly, it not only improves on HVCU-Bench but also brings benefits on general benchmarks (average +2.53%), especially with substantial gains on MMStar (+7.26%), demonstrating the significance of the hierarchical thinking pattern and its effectiveness in enhancing MLLM capabilities. The project page is at https://vcu-bridge.github.io .

[172] SFHand: A Streaming Framework for Language-guided 3D Hand Forecasting and Embodied Manipulation

Ruicong Liu,Yifei Huang,Liangyang Ouyang,Caixin Kang,Yoichi Sato

Main category: cs.CV

TL;DR: 本文提出了SFHand,首个支持语言引导的流式3D手势预测框架,结合视频流和语言指令实现未来手部状态的自回归预测,并发布首个大规模包含3D手势与语言指令同步数据的EgoHaFL数据集,在预测性能和下游任务迁移上均取得显著提升。

Details Motivation: 现有3D手势预测方法依赖离线视频序列且无法利用语言指令表达任务意图,难以满足AR和辅助机器人等实时交互场景需求,因此需要一种能处理连续输入并融合语言引导的流式预测框架。 Method: 提出SFHand框架,采用流式自回归架构结合ROI增强的记忆层,在持续视频流和语言指令输入下,自回归地预测未来的手型、2D边界框、3D姿态和轨迹;同时构建EgoHaFL大规模数据集以支持语言引导的手势预测研究。 Result: SFHand在3D手势预测任务上达到最先进水平,性能超越先前方法最多达35.8%;其学习到的表征可迁移到具身操作任务中,使多个基准上的任务成功率提升最高达13.4%。 Conclusion: SFHand首次实现了语言引导下的流式3D手势预测,通过结合流式架构与注意力机制有效捕捉时序上下文与关键手部区域,配合新提出的EgoHaFL数据集推动了人机交互中实时手势理解的发展。 Abstract: Real-time 3D hand forecasting is a critical component for fluid human-computer interaction in applications like AR and assistive robotics. However, existing methods are ill-suited for these scenarios, as they typically require offline access to accumulated video sequences and cannot incorporate language guidance that conveys task intent. To overcome these limitations, we introduce SFHand, the first streaming framework for language-guided 3D hand forecasting. SFHand autoregressively predicts a comprehensive set of future 3D hand states, including hand type, 2D bounding box, 3D pose, and trajectory, from a continuous stream of video and language instructions. Our framework combines a streaming autoregressive architecture with an ROI-enhanced memory layer, capturing temporal context while focusing on salient hand-centric regions. To enable this research, we also introduce EgoHaFL, the first large-scale dataset featuring synchronized 3D hand poses and language instructions. We demonstrate that SFHand achieves new state-of-the-art results in 3D hand forecasting, outperforming prior work by a significant margin of up to 35.8%. Furthermore, we show the practical utility of our learned representations by transferring them to downstream embodied manipulation tasks, improving task success rates by up to 13.4% on multiple benchmarks. Dataset page: https://huggingface.co/datasets/ut-vision/EgoHaFL, project page: https://github.com/ut-vision/SFHand.

[173] Video4Edit: Viewing Image Editing as a Degenerate Temporal Process

Xiaofan Li,Yanpeng Sun,Chenming Wu,Fan Duan,YuAn Wang,Weihao Bo,Yumeng Zhang,Dingkang Liang

Main category: cs.CV

TL;DR: 提出一种基于时序建模视角的图像编辑方法,利用视频预训练中的单帧演化先验,实现高效数据微调,在仅使用主流模型约1%监督数据的情况下达到相当性能。

Details Motivation: 现有图像编辑模型依赖大量高质量三元组数据(指令、源图像、编辑后图像),成本高昂且对指令描述精度要求高,亟需更高效的数据利用方法。 Method: 将图像编辑视为退化的时序过程,借鉴视频预训练中的单帧演化先验,迁移至图像编辑任务中,实现低资源下的微调。 Result: 在极低监督数据量(仅为主流方法的约1%)下,性能与当前领先的开源基线模型相当。 Conclusion: 通过引入时序建模视角,可显著降低多模态图像编辑模型对标注数据的依赖,推动其向更高效、实用的方向发展。 Abstract: We observe that recent advances in multimodal foundation models have propelled instruction-driven image generation and editing into a genuinely cross-modal, cooperative regime. Nevertheless, state-of-the-art editing pipelines remain costly: beyond training large diffusion/flow models, they require curating massive high-quality triplets of \{instruction, source image, edited image\} to cover diverse user intents. Moreover, the fidelity of visual replacements hinges on how precisely the instruction references the target semantics. We revisit this challenge through the lens of temporal modeling: if video can be regarded as a full temporal process, then image editing can be seen as a degenerate temporal process. This perspective allows us to transfer single-frame evolution priors from video pre-training, enabling a highly data-efficient fine-tuning regime. Empirically, our approach matches the performance of leading open-source baselines while using only about one percent of the supervision demanded by mainstream editing models.

[174] SCALER: SAM-Enhanced Collaborative Learning for Label-Deficient Concealed Object Segmentation

Chunming He,Rihan Zhang,Longxiang Tang,Ziyun Yang,Kai Li,Deng-Ping Fan,Sina Farsiu

Main category: cs.CV

TL;DR: 提出SCALER框架,通过交替优化均值教师分割器和可学习SAM,在标签不足的隐蔽物体分割任务中实现性能提升。

Details Motivation: 现有方法在标签不足的隐蔽物体分割中受限于目标隐蔽性和标注稀缺性,难以充分利用互补信息。 Method: 设计双阶段交替训练框架:第一阶段用固定SAM生成加权伪标签优化分割器;第二阶段利用增强不变性和噪声鲁棒性损失更新SAM,实现双向监督与协同优化。 Result: 在八个半监督和弱监督COS任务上均取得一致性能提升,验证了框架的有效性和泛化能力。 Conclusion: SCALER能有效融合一致性约束与SAM监督,并通过互惠学习提升两者性能,可作为标签稀缺下通用的训练范式。 Abstract: Existing methods for label-deficient concealed object segmentation (LDCOS) either rely on consistency constraints or Segment Anything Model (SAM)-based pseudo-labeling. However, their performance remains limited due to the intrinsic concealment of targets and the scarcity of annotations. This study investigates two key questions: (1) Can consistency constraints and SAM-based supervision be jointly integrated to better exploit complementary information and enhance the segmenter? and (2) beyond that, can the segmenter in turn guide SAM through reciprocal supervision, enabling mutual improvement? To answer these questions, we present SCALER, a unified collaborative framework toward LDCOS that jointly optimizes a mean-teacher segmenter and a learnable SAM. SCALER operates in two alternating phases. In \textbf{Phase \uppercase\expandafter{\romannumeral1}}, the segmenter is optimized under fixed SAM supervision using entropy-based image-level and uncertainty-based pixel-level weighting to select reliable pseudo-label regions and emphasize harder examples. In \textbf{Phase \uppercase\expandafter{\romannumeral2}}, SAM is updated via augmentation invariance and noise resistance losses, leveraging its inherent robustness to perturbations. Experiments demonstrate that SCALER yields consistent performance gains across eight semi- and weakly-supervised COS tasks. The results further suggest that SCALER can serve as a general training paradigm to enhance both lightweight segmenters and large foundation models under label-scarce conditions. Code will be released.

[175] Compact neural networks for astronomy with optimal transport bias correction

Shuhuan Wang,Yuzhen Xie,Jiayi Li

Main category: cs.CV

TL;DR: WaveletMamba是一种结合小波分解与状态空间建模的理论驱动框架,显著提升天文图像分类与红移预测的效率与精度,实现低分辨率输入下的高分辨率性能,并引入多级偏差校正机制,推动跨学科科学发现。

Details Motivation: 天文成像面临效率与分辨率之间的权衡,限制了大规模形态分类和红移预测的性能,亟需一种高效且精确的模型来克服这一瓶颈。 Method: 提出WaveletMamba框架,融合小波分解、状态空间建模、数学正则化与多级偏差校正;利用HK距离和颜色感知加权进行分布级与样本级校正,提升模型在不同分辨率下的稳定性与准确性。 Result: 在64x64分辨率下达到81.72%±0.53%分类准确率,仅用3.54M参数;在244x244分辨率下保持80.93%±0.27%准确率,计算效率提升9.7倍;Log-MSE改善22.96%,异常值减少26.10%。 Conclusion: WaveletMamba通过数学严谨性实现了高效、精准且稳定的天文图像分析,展示了科学AI中理论驱动方法的巨大潜力,为计算机视觉与天体物理的交叉研究提供了新范式。 Abstract: Astronomical imaging confronts an efficiency-resolution tradeoff that limits large-scale morphological classification and redshift prediction. We introduce WaveletMamba, a theory-driven framework integrating wavelet decomposition with state-space modeling, mathematical regularization, and multi-level bias correction. WaveletMamba achieves 81.72% +/- 0.53% classification accuracy at 64x64 resolution with only 3.54M parameters, delivering high-resolution performance (80.93% +/- 0.27% at 244x244) at low-resolution inputs with 9.7x computational efficiency gains. The framework exhibits Resolution Multistability, where models trained on low-resolution data achieve consistent accuracy across different input scales despite divergent internal representations. The framework's multi-level bias correction synergizes HK distance (distribution-level optimal transport) with Color-Aware Weighting (sample-level fine-tuning), achieving 22.96% Log-MSE improvement and 26.10% outlier reduction without explicit selection function modeling. Here, we show that mathematical rigor enables unprecedented efficiency and comprehensive bias correction in scientific AI, bridging computer vision and astrophysics to revolutionize interdisciplinary scientific discovery.

[176] UnfoldLDM: Deep Unfolding-based Blind Image Restoration with Latent Diffusion Priors

Chunming He,Rihan Zhang,Zheng Chen,Bowen Yang,CHengyu Fang,Yunlong Lin,Fengyang Xiao,Sina Farsiu

Main category: cs.CV

TL;DR: 本文提出了一种名为UnfoldLDM的新方法,将深度展开网络(DUN)与潜在扩散模型(LDM)结合,用于盲图像恢复(BIR),通过多粒度退化感知模块和抗退化LDM克服了传统DUN在退化依赖性和过平滑偏差上的局限性。

Details Motivation: 现有DUN在盲图像恢复中受限于退化特定依赖和过平滑偏差,难以有效处理未知退化并保留纹理细节。 Method: 提出UnfoldLDM框架:1)使用多粒度退化感知(MGDA)模块估计未知退化;2)设计抗退化LDM(DR-LDM)提取退化不变先验;3)引入过平滑校正Transformer(OCFormer)恢复高频细节。 Result: 实验表明UnfoldLDM在多种BIR任务上达到领先性能,能有效去除退化并增强纹理,且可作为即插即用模块兼容现有DUN方法。 Conclusion: UnfoldLDM成功融合DUN与LDM的优势,解决了DUN在盲图像恢复中的关键瓶颈,具有强鲁棒性、细节保持能力和广泛适用性。 Abstract: Deep unfolding networks (DUNs) combine the interpretability of model-based methods with the learning ability of deep networks, yet remain limited for blind image restoration (BIR). Existing DUNs suffer from: (1) \textbf{Degradation-specific dependency}, as their optimization frameworks are tied to a known degradation model, making them unsuitable for BIR tasks; and (2) \textbf{Over-smoothing bias}, resulting from the direct feeding of gradient descent outputs, dominated by low-frequency content, into the proximal term, suppressing fine textures. To overcome these issues, we propose UnfoldLDM to integrate DUNs with latent diffusion model (LDM) for BIR. In each stage, UnfoldLDM employs a multi-granularity degradation-aware (MGDA) module as the gradient descent step. MGDA models BIR as an unknown degradation estimation problem and estimates both the holistic degradation matrix and its decomposed forms, enabling robust degradation removal. For the proximal step, we design a degradation-resistant LDM (DR-LDM) to extract compact degradation-invariant priors from the MGDA output. Guided by this prior, an over-smoothing correction transformer (OCFormer) explicitly recovers high-frequency components and enhances texture details. This unique combination ensures the final result is degradation-free and visually rich. Experiments show that our UnfoldLDM achieves a leading place on various BIR tasks and benefits downstream tasks. Moreover, our design is compatible with existing DUN-based methods, serving as a plug-and-play framework. Code will be released.

[177] Matching-Based Few-Shot Semantic Segmentation Models Are Interpretable by Design

Pasquale De Marinis,Uzay Kaymak,Rogier Brussee,Gennaro Vessio,Giovanna Castellano

Main category: cs.CV

TL;DR: 本文提出了首个针对匹配型少样本语义分割(FSS)模型的解释方法——Affinity Explainer,通过提取支持图像中对查询预测贡献最大的像素生成归因图,并设计了适用于FSS的解释性评估指标。实验表明该方法优于现有归因方法,且解释结果具有结构一致性和诊断价值。

Details Motivation: 尽管少样本语义分割模型在数据稀缺场景下表现良好,但其决策过程缺乏透明度,且当前可解释AI在FSS领域尚未被充分探索。为了提升模型可信度、支持集选择效率及模型诊断能力,亟需专门针对FSS的解释方法。 Method: 提出Affinity Explainer,利用匹配型FSS模型中支持与查询图像在多层级特征间的匹配分数,反向计算像素级归因图;同时扩展并设计了适用于FSS任务的可解释性评估指标。 Result: 在多个FSS基准数据集和不同模型上验证,本方法显著优于直接迁移的标准归因方法;可视化结果显示归因图结构清晰、与模型机制一致,并可用于模型错误分析。 Conclusion: 该工作建立了可解释少样本语义分割的研究基础,提升了模型透明度与可诊断性,有助于构建更可靠的实际应用系统。 Abstract: Few-Shot Semantic Segmentation (FSS) models achieve strong performance in segmenting novel classes with minimal labeled examples, yet their decision-making processes remain largely opaque. While explainable AI has advanced significantly in standard computer vision tasks, interpretability in FSS remains virtually unexplored despite its critical importance for understanding model behavior and guiding support set selection in data-scarce scenarios. This paper introduces the first dedicated method for interpreting matching-based FSS models by leveraging their inherent structural properties. Our Affinity Explainer approach extracts attribution maps that highlight which pixels in support images contribute most to query segmentation predictions, using matching scores computed between support and query features at multiple feature levels. We extend standard interpretability evaluation metrics to the FSS domain and propose additional metrics to better capture the practical utility of explanations in few-shot scenarios. Comprehensive experiments on FSS benchmark datasets, using different models, demonstrate that our Affinity Explainer significantly outperforms adapted standard attribution methods. Qualitative analysis reveals that our explanations provide structured, coherent attention patterns that align with model architectures and and enable effective model diagnosis. This work establishes the foundation for interpretable FSS research, enabling better model understanding and diagnostic for more reliable few-shot segmentation systems. The source code is publicly available at https://github.com/pasqualedem/AffinityExplainer.

[178] Nested Unfolding Network for Real-World Concealed Object Segmentation

Chunming He,Rihan Zhang,Dingming Zhang,Fengyang Xiao,Deng-Ping Fan,Sina Farsiu

Main category: cs.CV

TL;DR: 提出嵌套展开网络(NUN),通过DUN-in-DUN结构解耦图像恢复与分割,结合视觉-语言模型实现无需先验的现实场景隐蔽物体分割。

Details Motivation: 现有基于深度展开的方法在背景估计与图像恢复间存在目标冲突,且依赖预设退化类型,难以应对真实复杂退化场景。 Method: 设计NUN框架,内层DeRUN利用视觉-语言模型动态推断退化语义并进行无先验恢复,外层SODUN执行可逆前景-背景分离;通过多阶段质量评估选择最优输出,并引入自一致性损失。 Result: 在干净与退化数据集上均取得领先性能,显著提升真实场景下的隐蔽物体分割鲁棒性。 Conclusion: NUN有效解耦恢复与分割任务,通过内外环协同优化和VLM引导的退化理解,实现了更实用、稳健的现实世界隐蔽物体分割。 Abstract: Deep unfolding networks (DUNs) have recently advanced concealed object segmentation (COS) by modeling segmentation as iterative foreground-background separation. However, existing DUN-based methods (RUN) inherently couple background estimation with image restoration, leading to conflicting objectives and requiring pre-defined degradation types, which are unrealistic in real-world scenarios. To address this, we propose the nested unfolding network (NUN), a unified framework for real-world COS. NUN adopts a DUN-in-DUN design, embedding a degradation-resistant unfolding network (DeRUN) within each stage of a segmentation-oriented unfolding network (SODUN). This design decouples restoration from segmentation while allowing mutual refinement. Guided by a vision-language model (VLM), DeRUN dynamically infers degradation semantics and restores high-quality images without explicit priors, whereas SODUN performs reversible estimation to refine foreground and background. Leveraging the multi-stage nature of unfolding, NUN employs image-quality assessment to select the best DeRUN outputs for subsequent stages, naturally introducing a self-consistency loss that enhances robustness. Extensive experiments show that NUN achieves a leading place on both clean and degraded benchmarks. Code will be released.

[179] EgoControl: Controllable Egocentric Video Generation via 3D Full-Body Poses

Enrico Pallotta,Sina Mokhtarzadeh Azar,Lars Doorenbos,Serdar Ozsoy,Umar Iqbal,Juergen Gall

Main category: cs.CV

TL;DR: 提出EgoControl,一种基于3D姿态控制的自我中心视频扩散模型,实现细粒度动作控制的未来帧生成。

Details Motivation: 为了实现具身AI代理对动作的模拟、预测与规划,需要能通过身体运动精细控制自我中心视频生成的方法。 Method: 训练一个视频预测模型,利用新颖的3D姿态表示(结合全局相机动态与关节运动),在扩散过程中引入专用控制机制,以目标姿态序列条件化生成未来帧。 Result: 实验表明,EgoControl能生成时间连贯、视觉真实且姿态一致的高质量自我中心视频。 Conclusion: EgoControl为可控的具身视频模拟与理解提供了有效途径。 Abstract: Egocentric video generation with fine-grained control through body motion is a key requirement towards embodied AI agents that can simulate, predict, and plan actions. In this work, we propose EgoControl, a pose-controllable video diffusion model trained on egocentric data. We train a video prediction model to condition future frame generation on explicit 3D body pose sequences. To achieve precise motion control, we introduce a novel pose representation that captures both global camera dynamics and articulated body movements, and integrate it through a dedicated control mechanism within the diffusion process. Given a short sequence of observed frames and a sequence of target poses, EgoControl generates temporally coherent and visually realistic future frames that align with the provided pose control. Experimental results demonstrate that EgoControl produces high-quality, pose-consistent egocentric videos, paving the way toward controllable embodied video simulation and understanding.

[180] Unified Spherical Frontend: Learning Rotation-Equivariant Representations of Spherical Images from Any Camera

Mukai Yu,Mosam Dabhi,Liuyue Xie,Sebastian Scherer,László A. Jeni

Main category: cs.CV

TL;DR: 提出统一球面前端(USF),将任意广角相机图像转换为单位球面表示,实现无需球谐变换的空间域球面卷积,具备高效性、模块化和旋转等变性,在多种视觉任务中表现出强鲁棒性和零样本泛化能力。

Details Motivation: 现有基于平面CNN的广角图像处理方法存在图像空间邻域与物理邻接不一致的问题,且对全局旋转敏感;频域球面CNN受限于计算代价高、分辨率受限。 Method: 通过光线方向对应关系将任意校准相机的图像映射到单位球面,直接在空间域进行球面重采样、卷积和池化;采用仅依赖距离的球面核,实现可配置的旋转等变性,并完全避免使用球谐变换;框架模块化设计,解耦投影、采样、插值和分辨率控制。 Result: 在Spherical MNIST、PANDORA和Stanford 2D-3D-S等多个数据集上验证了分类、检测和分割任务的性能;USF在高分辨率下高效运行,测试时随机旋转下性能下降小于1%,无需旋转增强即可实现零样本跨镜头泛化,且在极端畸变和不同视场角下保持稳健。 Conclusion: USF提供了一种通用、高效的球面视觉处理方案,解决了传统方法在广角图像上的几何失配和旋转敏感问题,显著提升了模型鲁棒性和跨设备泛化能力。 Abstract: Modern perception increasingly relies on fisheye, panoramic, and other wide field-of-view (FoV) cameras, yet most pipelines still apply planar CNNs designed for pinhole imagery on 2D grids, where image-space neighborhoods misrepresent physical adjacency and models are sensitive to global rotations. Frequency-domain spherical CNNs partially address this mismatch but require costly spherical harmonic transforms that constrain resolution and efficiency. We introduce the Unified Spherical Frontend (USF), a lens-agnostic framework that transforms images from any calibrated camera into a unit-sphere representation via ray-direction correspondences, and performs spherical resampling, convolution, and pooling directly in the spatial domain. USF is modular: projection, location sampling, interpolation, and resolution control are fully decoupled. Its distance-only spherical kernels offer configurable rotation-equivariance (mirroring translation-equivariance in planar CNNs) while avoiding harmonic transforms entirely. We compare standard planar backbones with their spherical counterparts across classification, detection, and segmentation tasks on synthetic (Spherical MNIST) and real-world datasets (PANDORA, Stanford 2D-3D-S), and stress-test robustness to extreme lens distortions, varying FoV, and arbitrary rotations. USF processes high-resolution spherical imagery efficiently and maintains less than 1% performance drop under random test-time rotations, even without rotational augmentation, and even enables zero-shot generalization from one lens type to unseen wide-FoV lenses with minimal performance degradation.

[181] Early Lung Cancer Diagnosis from Virtual Follow-up LDCT Generation via Correlational Autoencoder and Latent Flow Matching

Yutong Wu,Yifan Wang,Qining Zhang,Chuan Zhou,Lei Ying

Main category: cs.CV

TL;DR: 本文提出了一种名为CorrFlowNet的生成方法,利用扩散模型的思想生成虚拟的一年随访CT扫描,以实现肺癌的早期检测。

Details Motivation: 肺癌早期诊断困难,特别是在区分恶性与良性病灶的细微信号时。现有AI方法多基于单次CT扫描,难以捕捉病灶进展动态。 Method: 采用相关性自编码器将基线和随访CT图像编码到隐空间,通过流匹配算法和神经常微分方程建模结节进展动态,并引入辅助分类器提升诊断准确性。 Result: 在真实临床数据上验证,该方法显著提升了肺结节风险评估性能,诊断准确率接近真实临床随访结果。 Conclusion: CorrFlowNet能够生成高质量的虚拟随访CT图像,有助于减少等待时间,提高肺癌早期诊断效率,具有重要的临床应用潜力。 Abstract: Lung cancer is one of the most commonly diagnosed cancers, and early diagnosis is critical because the survival rate declines sharply once the disease progresses to advanced stages. However, achieving an early diagnosis remains challenging, particularly in distinguishing subtle early signals of malignancy from those of benign conditions. In clinical practice, a patient with a high risk may need to undergo an initial baseline and several annual follow-up examinations (e.g., CT scans) before receiving a definitive diagnosis, which can result in missing the optimal treatment. Recently, Artificial Intelligence (AI) methods have been increasingly used for early diagnosis of lung cancer, but most existing algorithms focus on radiomic features extraction from single early-stage CT scans. Inspired by recent advances in diffusion models for image generation, this paper proposes a generative method, named CorrFlowNet, which creates a virtual, one-year follow-up CT scan after the initial baseline scan. This virtual follow-up would allow for an early detection of malignant/benign nodules, reducing the need to wait for clinical follow-ups. During training, our approach employs a correlational autoencoder to encode both early baseline and follow-up CT images into a latent space that captures the dynamics of nodule progression as well as the correlations between them, followed by a flow matching algorithm on the latent space with a neural ordinary differential equation. An auxiliary classifier is used to further enhance the diagnostic accuracy. Evaluations on a real clinical dataset show our method can significantly improve downstream lung nodule risk assessment compared with existing baseline models. Moreover, its diagnostic accuracy is comparable with real clinical CT follow-ups, highlighting its potential to improve cancer diagnosis.

[182] ARIAL: An Agentic Framework for Document VQA with Precise Answer Localization

Ahmad Mohammadshirazi,Pinaki Prasad Guha Neogi,Dheeraj Kulshrestha,Rajiv Ramnath

Main category: cs.CV

TL;DR: ARIAL是一个模块化框架,通过LLM代理协调专用工具,实现文档视觉问答中准确的答案提取与可靠的定位,兼顾性能与可解释性,在多个基准上达到SOTA。

Details Motivation: 现有文档VQA系统在文本准确性和空间定位可靠性之间存在权衡,难以同时满足高准确性与高可解释性的需求,尤其在关键应用场景中缺乏可信的定位能力。 Method: 将文档VQA分解为多个子任务:使用TrOCR进行OCR文本提取,基于语义搜索的上下文检索,微调Gemma 3-27B模型生成答案,并通过文本到区域对齐实现显式边界框定位,由LLM代理进行任务编排。 Result: 在DocVQA、FUNSD、CORD和SROIE四个基准上均取得SOTA表现,如DocVQA上达到88.7 ANLS和50.1 mAP,分别超越DLaVA +2.8 ANLS和+3.9 mAP。 Conclusion: ARIAL通过代理式调度专用工具,实现了高性能与高可解释性的统一,为可信赖、可解释的文档AI系统提供了可行路径。 Abstract: Document Visual Question Answering (VQA) requires models to not only extract accurate textual answers but also precisely localize them within document images, a capability critical for interpretability in high-stakes applications. However, existing systems achieve strong textual accuracy while producing unreliable spatial grounding, or sacrifice performance for interpretability. We present ARIAL (Agentic Reasoning for Interpretable Answer Localization), a modular framework that orchestrates specialized tools through an LLM-based planning agent to achieve both precise answer extraction and reliable spatial grounding. ARIAL decomposes Document VQA into structured subtasks: OCR-based text extraction with TrOCR, retrieval-augmented context selection using semantic search, answer generation via a fine-tuned Gemma 3-27B model, and explicit bounding-box localization through text-to-region alignment. This modular architecture produces transparent reasoning traces, enabling tool-level auditability and independent component optimization. We evaluate ARIAL on four benchmarks (DocVQA, FUNSD, CORD, and SROIE) using both textual accuracy (ANLS) and spatial precision (mAP at IoU 0.50 to 0.95). ARIAL achieves state-of-the-art results across all datasets: 88.7 ANLS and 50.1 mAP on DocVQA, 90.0 ANLS and 50.3 mAP on FUNSD, 85.5 ANLS and 60.2 mAP on CORD, and 93.1 ANLS on SROIE, surpassing the previous best method (DLaVA) by +2.8 ANLS and +3.9 mAP on DocVQA. Our work demonstrates how agentic orchestration of specialized tools can simultaneously improve performance and interpretability, providing a pathway toward trustworthy, explainable document AI systems.

[183] InfiniBench: Infinite Benchmarking for Visual Spatial Reasoning with Customizable Scene Complexity

Haoming Wang,Qiyao Xue,Wei Gao

Main category: cs.CV

TL;DR: 本文提出了InfiniBench,一个全自动、可定制且用户友好的基准生成器,能够生成理论上无限多样的3D场景视频,用于评估视觉语言模型的空间推理能力。

Details Motivation: 现有的空间推理评测基准在场景复杂度的可定制性和可扩展性方面存在不足,难以隔离和分析VLM在不同空间条件下的具体失败模式。 Method: 提出InfiniBench,包含三个关键技术:基于LLM的代理框架用于从自然语言描述中迭代优化场景约束;基于簇的布局优化器生成密集复杂的3D场景;任务感知的相机轨迹优化方法生成覆盖完整的视频输入。 Result: 实验表明,InfiniBench在提示保真度和物理合理性方面优于现有最先进方法,尤其在高复杂度场景下表现突出,并成功应用于测量、视角转换和时空追踪等空间推理任务的基准生成。 Conclusion: InfiniBench为视觉语言模型的空间推理能力提供了高度可定制、可扩展的评测方案,显著提升了复杂3D场景的生成质量与适用性。 Abstract: Modern vision-language models (VLMs) are expected to have abilities of spatial reasoning with diverse scene complexities, but evaluating such abilities is difficult due to the lack of benchmarks that are not only diverse and scalable but also fully customizable. Existing benchmarks offer limited customizability over the scene complexity and are incapable of isolating and analyzing specific VLM failure modes under distinct spatial conditions. To address this gap, instead of individually presenting benchmarks for different scene complexities, in this paper we present InfiniBench, a fully automated, customizable and user-friendly benchmark generator that can synthesize a theoretically infinite variety of 3D scenes with parameterized control on scene complexity. InfiniBench uniquely translates scene descriptions in natural language into photo-realistic videos with complex and physically plausible 3D layouts. This is achieved through three key innovations: 1) a LLM-based agentic framework that iteratively refines procedural scene constraints from scene descriptions; 2) a flexible cluster-based layout optimizer that generates dense and cluttered scenes previously intractable for procedural methods; and 3) a task-aware camera trajectory optimization method that renders scenes into videos with full object coverage as VLM input. Experiments demonstrate that InfiniBench outperforms state-of-the-art procedural and LLM-based 3D generation methods in prompt fidelity and physical plausibility, especially in high-complexity scenarios. We further showcased the usefulness of InfiniBench, by generating benchmarks for representative spatial reasoning tasks including measurement, perspective-taking and spatiotemporal tracking.

[184] Generating Synthetic Human Blastocyst Images for In-Vitro Fertilization Blastocyst Grading

Pavan Narahari,Suraj Rajendran,Lorena Bori,Jonas E. Malmsten,Qiansheng Zhan,Zev Rosenwaks,Nikica Zaninovic,Iman Hajirasouliha

Main category: cs.CV

TL;DR: 提出了一种基于扩散模型的生成框架DIA,用于生成高质量、可控的第5天囊胚图像,以解决IVF中数据稀缺和类别不平衡问题,并验证了其在临床分类任务中的有效性。

Details Motivation: 由于数据稀缺、类别不平衡和隐私限制,现有的AI胚胎评估模型面临挑战;需要高质量且多样化的数据集来提升模型性能。 Method: 开发了基于潜在扩散模型的DIA框架,能够根据Gardner形态学分类和z轴焦深条件生成高保真囊胚图像,并通过FID、图灵测试和下游分类任务进行综合评估。 Result: DIA生成的图像逼真度高,胚胎学家难以与真实图像区分;在数据增强实验中,合成数据显著提升了分类准确率,最多可替代40%的真实数据而不损失精度。 Conclusion: DIA为缓解胚胎数据稀缺和类别不平衡提供了可靠方案,生成的高质量可控图像有助于提升AI辅助胚胎评估的性能、公平性和标准化水平。 Abstract: The success of in vitro fertilization (IVF) at many clinics relies on the accurate morphological assessment of day 5 blastocysts, a process that is often subjective and inconsistent. While artificial intelligence can help standardize this evaluation, models require large, diverse, and balanced datasets, which are often unavailable due to data scarcity, natural class imbalance, and privacy constraints. Existing generative embryo models can mitigate these issues but face several limitations, such as poor image quality, small training datasets, non-robust evaluation, and lack of clinically relevant image generation for effective data augmentation. Here, we present the Diffusion Based Imaging Model for Artificial Blastocysts (DIA) framework, a set of latent diffusion models trained to generate high-fidelity, novel day 5 blastocyst images. Our models provide granular control by conditioning on Gardner-based morphological categories and z-axis focal depth. We rigorously evaluated the models using FID, a memorization metric, an embryologist Turing test, and three downstream classification tasks. Our results show that DIA models generate realistic images that embryologists could not reliably distinguish from real images. Most importantly, we demonstrated clear clinical value. Augmenting an imbalanced dataset with synthetic images significantly improved classification accuracy (p < 0.05). Also, adding synthetic images to an already large, balanced dataset yielded statistically significant performance gains, and synthetic data could replace up to 40% of real data in some cases without a statistically significant loss in accuracy. DIA provides a robust solution for mitigating data scarcity and class imbalance in embryo datasets. By generating novel, high-fidelity, and controllable synthetic images, our models can improve the performance, fairness, and standardization of AI embryo assessment tools.

[185] Large-Scale Pre-training Enables Multimodal AI Differentiation of Radiation Necrosis from Brain Metastasis Progression on Routine MRI

Ahmed Gomaa,Annette Schwarz,Ludwig Singer,Arnd Dörfler,Matthias Stefan May,Pluvio Stephan,Ishita Sheth,Juliane Szkitsak,Katharina Breininger,Yixing Huang,Benjamin Frey,Oliver Schnell,Daniel Delev,Roland Coras,Daniel Höfler,Philipp Schubert,Jenny Stritzelberger,Sabine Semrau,Andreas Maier,Dieter H Heiland,Udo S. Gaipl,Andrea Wittig,Rainer Fietkau,Christoph Bert,Stefanie Corradini,Florian Putz

Main category: cs.CV

TL;DR: 本研究提出了一种基于自监督学习的两阶段深度学习方法,利用大规模无标签脑转移瘤MRI数据预训练Vision Transformer模型,并在少量标注数据上微调以区分放射性坏死与肿瘤进展,取得了优于传统监督学习和放射组学的性能。

Details Motivation: 由于活检确认的数据稀缺且有创,区分放射性坏死与肿瘤进展具有挑战性,传统监督深度学习受限于标注数据不足。 Method: 采用自监督学习在10,167个无标签T1CE MRI子体积上预训练Vision Transformer模型,随后在MOLAB数据集上使用T1CE MRI和分割掩码双通道输入进行微调,并在外部队列中验证。 Result: 在同中心测试集中AUC达0.916,外部队列中为0.764,显著优于全监督ViT和放射组学方法;结合多模态输入后性能进一步提升至AUC 0.947/0.821,注意力图显示模型关注临床相关区域。 Conclusion: 大规模无标签数据的自监督预训练显著提升模型性能,该两阶段多模态方法仅需常规MRI即可实现高精度分类,具备临床可解释性和应用潜力,值得进一步验证。 Abstract: Background: Differentiating radiation necrosis (RN) from tumor progression after stereotactic radiosurgery (SRS) remains a critical challenge in brain metastases. While histopathology represents the gold standard, its invasiveness limits feasibility. Conventional supervised deep learning approaches are constrained by scarce biopsy-confirmed training data. Self-supervised learning (SSL) overcomes this by leveraging the growing availability of large-scale unlabeled brain metastases imaging datasets. Methods: In a two-phase deep learning strategy inspired by the foundation model paradigm, a Vision Transformer (ViT) was pre-trained via SSL on 10,167 unlabeled multi-source T1CE MRI sub-volumes. The pre-trained ViT was then fine-tuned for RN classification using a two-channel input (T1CE MRI and segmentation masks) on the public MOLAB dataset (n=109) using 20% of datasets as same-center held-out test set. External validation was performed on a second-center test cohort (n=28). Results: The self-supervised model achieved an AUC of 0.916 on the same-center test set and 0.764 on the second center test set, surpassing the fully supervised ViT (AUC 0.624/0.496; p=0.001/0.008) and radiomics (AUC 0.807/0.691; p=0.005/0.014). Multimodal integration further improved performance (AUC 0.947/0.821; p=0.073/0.001). Attention map visualizations enabled interpretability showing the model focused on clinically relevant lesion subregions. Conclusion: Large-scale pre-training on increasingly available unlabeled brain metastases datasets substantially improves AI model performance. A two-phase multimodal deep learning strategy achieved high accuracy in differentiating radiation necrosis from tumor progression using only routine T1CE MRI and standard clinical data, providing an interpretable, clinically accessible solution that warrants further validation.

[186] Using MLIR Transform to Design Sliced Convolution Algorithm

Victor Ferrari,Marcio Pereira,Lucas Alvarenga,Gustavo Leite,Guido Araujo

Main category: cs.CV

TL;DR: 本文提出了SConvTransform,一种MLIR中用于优化2D卷积的Transform方言扩展,通过声明式转换流水线将Linalg卷积降低为分块和打包的通用操作,并结合卷积切片分析实现高效代码生成。

Details Motivation: 为了在MLIR中更高效地优化2D卷积运算,需要一种可重用、可分析且能适应不同架构的分块与数据布局策略。 Method: 提出SConvOp操作,基于输入和滤波器形状及目标架构参数,利用卷积切片分析确定分块大小和数据布局;通过参数化仿射方程导出打包和分块操作,处理边界情况并调整仿射映射。 Result: 在ARM SME上达到峰值性能的60%,在Intel AVX512上达到67%,验证了静态形状分析与结构化分块打包策略的有效性。 Conclusion: SConvTransform通过模块化设计实现了高效的卷积优化,支持未来扩展,展示了MLIR可扩展编译基础设施在卷积工作负载优化中的潜力。 Abstract: This paper proposes SConvTransform, a Transform dialect extension that provides operations for optimizing 2D convolutions in MLIR. Its main operation, SConvOp, lowers Linalg convolutions into tiled and packed generic operations through a fully declarative transformation pipeline. The process is guided by a Convolution Slicing Analysis that determines tile sizes and data layout strategies based on input and filter shapes, as well as target architecture parameters. SConvOp handles edge cases by splitting irregular regions and adjusting affine maps where needed. All packing and tiling operations are derived from a parametric set of affine equations, enabling reusable and analyzable transformations. Although functional correctness was the primary goal of this work, the experimental evaluation demonstrates the effectiveness of SConvTransform, achieving good enough performance across different target architectures. Future work will focus on optimizing performance and porting to other target devices. When applied to standard convolution configurations, the generated code achieves up to 60% of peak performance on ARM SME and 67% on Intel AVX512. These results validate the benefit of combining static shape analysis with structured tiling and packing strategies within the MLIR Transform dialect. Furthermore, the modular design of SConvTransform facilitates integration with future extensions, enabling continued optimization of convolution workloads through MLIR's extensible compilation infrastructure.

[187] Parallel qMRI Reconstruction from 4x Accelerated Acquisitions

Mingi Kang

Main category: cs.CV

TL;DR: 提出一种端到端的深度学习框架,联合估计线圈灵敏度图并从欠采样的k空间数据中重建MRI图像,在4倍加速下实现平滑的重建效果。

Details Motivation: 传统并行MRI重建方法如SENSE依赖预计算的线圈灵敏度图,且在高加速因子下易受伪影影响,限制了重建质量与临床应用效率。 Method: 采用双模块架构:一个用于估计线圈灵敏度图(CSM),另一个基于U-Net进行MRI图像重建,直接从欠采样k空间数据中联合优化两个任务。 Result: 在10名受试者的多回波脑部MRI数据上验证,相比传统SENSE方法生成更平滑的图像,尽管PSNR/SSIM指标略低,但视觉质量相当;发现不同加速因子间存在空间错位问题。 Conclusion: 所提方法能有效联合估计线圈灵敏度图并重建高质量MRI图像,减少对先验信息的依赖,具备在高加速MRI中应用的潜力,未来需解决配准与量化评估问题。 Abstract: Magnetic Resonance Imaging (MRI) acquisitions require extensive scan times, limiting patient throughput and increasing susceptibility to motion artifacts. Accelerated parallel MRI techniques reduce acquisition time by undersampling k-space data, but require robust reconstruction methods to recover high-quality images. Traditional approaches like SENSE require both undersampled k-space data and pre-computed coil sensitivity maps. We propose an end-to-end deep learning framework that jointly estimates coil sensitivity maps and reconstructs images from only undersampled k-space measurements at 4x acceleration. Our two-module architecture consists of a Coil Sensitivity Map (CSM) estimation module and a U-Net-based MRI reconstruction module. We evaluate our method on multi-coil brain MRI data from 10 subjects with 8 echoes each, using 2x SENSE reconstructions as ground truth. Our approach produces visually smoother reconstructions compared to conventional SENSE output, achieving comparable visual quality despite lower PSNR/SSIM metrics. We identify key challenges including spatial misalignment between different acceleration factors and propose future directions for improved reconstruction quality.

[188] EgoVITA: Learning to Plan and Verify for Egocentric Video Reasoning

Yogesh Kulkarni,Pooyan Fazli

Main category: cs.CV

TL;DR: 本文提出了EgoVITA,一种基于强化学习的多模态大模型框架,通过结合第一人称规划与第三人称验证,提升对自我中心视频的意图与动作推理能力。

Details Motivation: 现有MLLM在处理自我中心视频时难以有效推理意图与动作,因其视角动态、视野受限且存在自参照运动,传统方法缺乏对视觉与逻辑一致性的联合建模。 Method: 提出EgoVITA框架,基于Group Relative Policy Optimization(GRPO),交替进行两个阶段:(1)第一人称规划阶段,预测未来动作的逐步计划;(2)第三人称验证阶段,从外部视角检查该计划的视觉与逻辑一致性。 Result: EgoVITA在EgoBlind和EgoOrient任务上分别比Qwen2.5-VL-7B基线提升+7.7和+4.4,同时在第三方视频任务中保持良好的泛化能力。 Conclusion: 通过结构化规划与跨视角验证,EgoVITA显著提升了多模态大模型在自我中心视频理解中的因果预测与视觉接地推理能力。 Abstract: Reasoning about intentions and actions from a first-person (egocentric) perspective remains a fundamental challenge for multimodal large language models (MLLMs). Unlike third-person (exocentric) videos that capture scenes from an outside observer, egocentric videos reflect the actor's continuously changing viewpoint, introducing partial observability, limited field of view, and self-referenced motion. We introduce $\textbf{EgoVITA}$, a reinforcement learning framework that enables MLLMs to reason through structured planning and verification. Built on Group Relative Policy Optimization (GRPO), EgoVITA alternates between two stages: (1) an $\textbf{egocentric planning phase}$, where the model reasons from a first-person viewpoint to predict a step-by-step plan of future actions, and (2) an $\textbf{exocentric verification phase}$, where it switches to a third-person perspective to check the visual and logical consistency of that plan. Through GRPO, the model learns to make plans that are causally predictive of upcoming visual observations, leading to more coherent and visually grounded reasoning. EgoVITA achieves significant gains on egocentric reasoning tasks, outperforming the baseline Qwen2.5-VL-7B by $\mathbf{+7.7}$ on EgoBlind and $\mathbf{+4.4}$ on EgoOrient, while maintaining strong generalization on exocentric video tasks.

[189] UniFlow: Towards Zero-Shot LiDAR Scene Flow for Autonomous Vehicles via Cross-Domain Generalization

Siyi Li,Qingwen Zhang,Ishan Khatri,Kyle Vedder,Deva Ramanan,Neehar Peri

Main category: cs.CV

TL;DR: 本文提出UniFlow,一个能够在多种LiDAR传感器数据上泛化并提升点云场景流估计性能的模型,通过跨数据集训练显著提升了在已见和未见数据集上的表现。

Details Motivation: 现有LiDAR场景流方法通常只在单一传感器数据上训练和评估,缺乏跨传感器的泛化能力;而多数据集训练在其他任务中效果不佳,本文探究其在运动估计中的潜力。 Method: 提出UniFlow,一种前馈模型家族,统一并在多个大规模、异构的LiDAR场景流数据集上进行训练,利用跨数据集的运动先验提升泛化性和性能。 Result: UniFlow在Waymo和nuScenes上分别比先前方法提升5.1%和35.2%,并在未见数据集TruckScenes上超越专用模型30.1%。 Conclusion: 低层次的运动估计任务对传感器配置不敏感,跨数据集训练可有效建立通用运动先验,UniFlow为LiDAR场景流提供了强基线和实用解决方案。 Abstract: LiDAR scene flow is the task of estimating per-point 3D motion between consecutive point clouds. Recent methods achieve centimeter-level accuracy on popular autonomous vehicle (AV) datasets, but are typically only trained and evaluated on a single sensor. In this paper, we aim to learn general motion priors that transfer to diverse and unseen LiDAR sensors. However, prior work in LiDAR semantic segmentation and 3D object detection demonstrate that naively training on multiple datasets yields worse performance than single dataset models. Interestingly, we find that this conventional wisdom does not hold for motion estimation, and that state-of-the-art scene flow methods greatly benefit from cross-dataset training. We posit that low-level tasks such as motion estimation may be less sensitive to sensor configuration; indeed, our analysis shows that models trained on fast-moving objects (e.g., from highway datasets) perform well on fast-moving objects, even across different datasets. Informed by our analysis, we propose UniFlow, a family of feedforward models that unifies and trains on multiple large-scale LiDAR scene flow datasets with diverse sensor placements and point cloud densities. Our frustratingly simple solution establishes a new state-of-the-art on Waymo and nuScenes, improving over prior work by 5.1% and 35.2% respectively. Moreover, UniFlow achieves state-of-the-art accuracy on unseen datasets like TruckScenes, outperforming prior TruckScenes-specific models by 30.1%.

[190] Sequence-Adaptive Video Prediction in Continuous Streams using Diffusion Noise Optimization

Sina Mokhtarzadeh Azar,Emad Bahrami,Enrico Pallotta,Gianpiero Francesca,Radu Timofte,Juergen Gall

Main category: cs.CV

TL;DR: 提出了一种名为SAVi-DNO的方法,通过优化扩散噪声实现对连续视频流的自适应视频预测,无需微调模型参数,在多个数据集上表现出色。

Details Motivation: 针对连续视频流中的视频预测问题,现有方法难以高效适应不断到来的新样本,且微调大型扩散模型成本高。 Method: 在保持预训练扩散模型参数冻结的前提下,通过在推理过程中优化扩散噪声来实现模型的持续自适应调整,提出SAVi-DNO方法。 Result: 在Ego4D、OpenDV-YouTube、UCF-101和SkyTimelapse等多个数据集的长视频上,SAVi-DNO在FVD、SSIM和PSNR指标上均取得性能提升。 Conclusion: SAVi-DNO能有效实现对连续视频流的自适应预测,避免了昂贵的参数微调,具有良好的应用潜力。 Abstract: In this work, we investigate diffusion-based video prediction models, which forecast future video frames, for continuous video streams. In this context, the models observe continuously new training samples, and we aim to leverage this to improve their predictions. We thus propose an approach that continuously adapts a pre-trained diffusion model to a video stream. Since fine-tuning the parameters of a large diffusion model is too expensive, we refine the diffusion noise during inference while keeping the model parameters frozen, allowing the model to adaptively determine suitable sampling noise. We term the approach Sequence Adaptive Video Prediction with Diffusion Noise Optimization (SAVi-DNO). To validate our approach, we introduce a new evaluation setting on the Ego4D dataset, focusing on simultaneous adaptation and evaluation on long continuous videos. Empirical results demonstrate improved performance based on FVD, SSIM, and PSNR metrics on long videos of Ego4D and OpenDV-YouTube, as well as videos of UCF-101 and SkyTimelapse, showcasing SAVi-DNO's effectiveness.

[191] MammothModa2: A Unified AR-Diffusion Framework for Multimodal Understanding and Generation

Tao Shen,Xin Wan,Taicai Chen,Rui Zhang,Junwen Pan,Dawei Lu,Fanding Lei,Zhilin Lu,Yunfei Yang,Chen Cheng,Qi She,Chang Liu,Zhenbang Sun

Main category: cs.CV

TL;DR: 本文提出MammothModa2(Mammoth2),一种统一的自回归-扩散(AR-Diffusion)框架,通过序列式设计将自回归语义规划与扩散生成相结合,在文本到图像生成、指令编辑和多模态理解任务中表现出色。

Details Motivation: 现有的统一多模态模型在语义推理与高质量视觉生成之间存在鸿沟,难以兼顾离散语义建模与连续图像合成,亟需一个能同时高效处理理解与生成任务的统一框架。 Method: 采用串行架构:自回归路径负责全局语义建模,单流Diffusion Transformer解码器进行高保真图像生成;通过多层特征聚合、统一条件编码和上下文内条件机制实现AR与扩散模块的特征对齐;端到端联合训练下结合下一词预测与流匹配目标,并进行监督微调与强化学习优化生成与编辑能力。 Result: 在约6000万监督生成样本上训练后,无需依赖预训练生成器,Mammoth2在GenEval、DPGBench和ImgEdit基准上分别取得0.87、87.2和4.06的成绩,且在多模态理解任务上与专用理解模型(如Qwen3-VL-8B)相当。 Conclusion: 精心设计的AR-Diffusion耦合架构可在单一、参数和数据高效的模型中实现高质量生成、编辑与强大的多模态理解能力。 Abstract: Unified multimodal models aim to integrate understanding and generation within a single framework, yet bridging the gap between discrete semantic reasoning and high-fidelity visual synthesis remains challenging. We present MammothModa2 (Mammoth2), a unified autoregressive-diffusion (AR-Diffusion) framework designed to effectively couple autoregressive semantic planning with diffusion-based generation. Mammoth2 adopts a serial design: an AR path equipped with generation experts performs global semantic modeling over discrete tokens, while a single-stream Diffusion Transformer (DiT) decoder handles high-fidelity image synthesis. A carefully designed AR-Diffusion feature alignment module combines multi-layer feature aggregation, unified condition encoding, and in-context conditioning to stably align AR's representations with the diffusion decoder's continuous latents. Mammoth2 is trained end-to-end with joint Next-Token Prediction and Flow Matching objectives, followed by supervised fine-tuning and reinforcement learning over both generation and editing. With roughly 60M supervised generation samples and no reliance on pre-trained generators, Mammoth2 delivers strong text-to-image and instruction-based editing performance on public benchmarks, achieving 0.87 on GenEval, 87.2 on DPGBench, and 4.06 on ImgEdit, while remaining competitive with understanding-only backbones (e.g., Qwen3-VL-8B) on multimodal understanding tasks. These results suggest that a carefully coupled AR-Diffusion architecture can provide high-fidelity generation and editing while maintaining strong multimodal comprehension within a single, parameter- and data-efficient model.

[192] SatSAM2: Motion-Constrained Video Object Tracking in Satellite Imagery using Promptable SAM2 and Kalman Priors

Ruijie Fan,Junyan Ye,Huan Chen,Zilong Huang,Xiaolei Wang,Weijia Li

Main category: cs.CV

TL;DR: 提出SatSAM2,一种基于SAM2的零样本卫星视频跟踪器,通过引入运动约束模块和状态机提升在遮挡和复杂环境下的跟踪性能,并发布新的大规模合成基准MVOT用于评估。

Details Motivation: 现有卫星视频跟踪方法泛化能力差,依赖特定场景训练,且在遮挡情况下易丢失目标,需要更鲁棒、无需重新训练的解决方案。 Method: 基于SAM2构建零样本跟踪器SatSAM2,引入卡尔曼滤波约束运动模块(KFCMM)利用时序运动信息抑制漂移,设计运动约束状态机(MCSM)根据运动动态和可靠性调节跟踪状态;同时构建合成基准MVOT用于大规模评估。 Result: 在两个卫星跟踪基准和MVOT上实验表明,SatSAM2优于传统及基于基础模型的跟踪器,包括SAM2及其变体;在OOTB数据集上AUC提升5.84%。 Conclusion: SatSAM2有效提升了卫星视频跟踪中的零样本适应能力和鲁棒性,尤其在遮挡和复杂条件下表现优越,推动了基础模型在遥感领域的应用。 Abstract: Existing satellite video tracking methods often struggle with generalization, requiring scenario-specific training to achieve satisfactory performance, and are prone to track loss in the presence of occlusion. To address these challenges, we propose SatSAM2, a zero-shot satellite video tracker built on SAM2, designed to adapt foundation models to the remote sensing domain. SatSAM2 introduces two core modules: a Kalman Filter-based Constrained Motion Module (KFCMM) to exploit temporal motion cues and suppress drift, and a Motion-Constrained State Machine (MCSM) to regulate tracking states based on motion dynamics and reliability. To support large-scale evaluation, we propose MatrixCity Video Object Tracking (MVOT), a synthetic benchmark containing 1,500+ sequences and 157K annotated frames with diverse viewpoints, illumination, and occlusion conditions. Extensive experiments on two satellite tracking benchmarks and MVOT show that SatSAM2 outperforms both traditional and foundation model-based trackers, including SAM2 and its variants. Notably, on the OOTB dataset, SatSAM2 achieves a 5.84% AUC improvement over state-of-the-art methods. Our code and dataset will be publicly released to encourage further research.

[193] Beyond Words and Pixels: A Benchmark for Implicit World Knowledge Reasoning in Generative Models

Tianyang Han,Junhao Su,Junjie Hu,Peizhen Yang,Hengyu Shi,Junfeng Luo,Jialin Gao

Main category: cs.CV

TL;DR: 本文提出了PicWorld,首个全面评估文本到图像(T2I)模型对隐式世界知识和物理因果推理能力的基准,并设计了基于证据的多智能体评估器PW-Agent,评估结果显示现有T2I模型在这些方面普遍存在不足。

Details Motivation: 现有的T2I模型评估方法未能充分测试模型对隐式世界知识、多物理交互和可审计证据的理解,缺乏细粒度和系统性评估。 Method: 构建包含1100个提示的PicWorld基准,涵盖三大核心类别;提出PW-Agent,一种基于证据的多智能体评估系统,通过将提示分解为可验证的视觉证据来分层评估图像的物理真实性和逻辑一致性。 Result: 对17个主流T2I模型的评估表明,它们在隐式世界知识和物理因果推理方面均存在不同程度的根本性缺陷。 Conclusion: 当前T2I模型在常识和物理推理方面仍存在显著局限,未来需构建更具推理能力并融合知识的模型架构。 Abstract: Text-to-image (T2I) models today are capable of producing photorealistic, instruction-following images, yet they still frequently fail on prompts that require implicit world knowledge. Existing evaluation protocols either emphasize compositional alignment or rely on single-round VQA-based scoring, leaving critical dimensions such as knowledge grounding, multi-physics interactions, and auditable evidence-substantially undertested. To address these limitations, we introduce PicWorld, the first comprehensive benchmark that assesses the grasp of implicit world knowledge and physical causal reasoning of T2I models. This benchmark consists of 1,100 prompts across three core categories. To facilitate fine-grained evaluation, we propose PW-Agent, an evidence-grounded multi-agent evaluator to hierarchically assess images on their physical realism and logical consistency by decomposing prompts into verifiable visual evidence. We conduct a thorough analysis of 17 mainstream T2I models on PicWorld, illustrating that they universally exhibit a fundamental limitation in their capacity for implicit world knowledge and physical causal reasoning to varying degrees. The findings highlight the need for reasoning-aware, knowledge-integrative architectures in future T2I systems.

[194] Vision Token Masking Alone Cannot Prevent PHI Leakage in Medical Document OCR: A Systematic Evaluation

Richard J. Young

Main category: cs.CV

TL;DR: 本文首次系统评估了在医疗文档OCR中使用视觉令牌掩蔽作为隐私保护机制的有效性,发现当前的视觉掩蔽策略对长文本、分布广泛的个人健康信息(如姓名、出生日期)有效,但无法防止短结构化标识符(如病历号、社保号)泄露。研究表明,问题根源在于语言模型的上下文推断能力,而非视觉掩蔽不足,并提出结合NLP后处理的混合架构可显著提升隐私保护效果。

Details Motivation: 大型视觉-语言模型(VLMs)在医疗OCR中的应用引发对患者隐私信息(PHI)泄露的担忧,尤其是在推理过程中如何保护敏感信息。现有方法缺乏系统性评估,亟需探索有效的隐私保护机制。 Method: 提出七种针对不同架构层的视觉令牌掩蔽策略(V3-V9),在DeepSeek-OCR上进行测试,使用100份合成医疗账单文档和精确标注数据,评估各类PHI的减少情况,并通过改变掩蔽扩展半径进行消融研究,同时模拟结合NLP后处理的混合架构效果。 Result: 所有掩蔽策略均达到42.9%的PHI减少率,能完全抑制长文本分布式标识符(如姓名、地址等),但对短结构化标识符(如医疗记录号、社保号)无效(0%效果);增加掩蔽范围未能突破该上限;模拟混合架构可实现88.6%的总PHI减少(假设NLP准确率为80%)。 Conclusion: 纯视觉层面的掩蔽有其根本局限性,无法阻止由语言模型上下文推理导致的结构化PHI泄露;未来应转向解码器级微调与多层次防御的混合架构设计,以实现符合HIPAA要求的医疗文档处理。 Abstract: Large vision-language models (VLMs) are increasingly deployed for optical character recognition (OCR) in healthcare settings, raising critical concerns about protected health information (PHI) exposure during document processing. This work presents the first systematic evaluation of inference-time vision token masking as a privacy-preserving mechanism for medical document OCR using DeepSeek-OCR. We introduce seven masking strategies (V3-V9) targeting different architectural layers (SAM encoder blocks, compression layers, dual vision encoders, projector fusion) and evaluate PHI reduction across HIPAA-defined categories using 100 synthetic medical billing statements (drawn from a corpus of 38,517 annotated documents) with perfect ground-truth annotations. All masking strategies converge to 42.9% PHI reduction, successfully suppressing long-form spatially-distributed identifiers (patient names, dates of birth, physical addresses at 100% effectiveness) while failing to prevent short structured identifiers (medical record numbers, social security numbers, email addresses, account numbers at 0% effectiveness). Ablation studies varying mask expansion radius (r=1,2,3) demonstrate that increased spatial coverage does not improve reduction beyond this ceiling, indicating that language model contextual inference - not insufficient visual masking - drives structured identifier leakage. A simulated hybrid architecture combining vision masking with NLP post-processing achieves 88.6% total PHI reduction (assuming 80% NLP accuracy on remaining identifiers). This negative result establishes boundaries for vision-only privacy interventions in VLMs, provides guidance distinguishing PHI types amenable to vision-level versus language-level redaction, and redirects future research toward decoder-level fine-tuning and hybrid defense-in-depth architectures for HIPAA-compliant medical document processing.

[195] Point-to-Point: Sparse Motion Guidance for Controllable Video Editing

Yeji Song,Jaehyun Lee,Mijin Koo,JunHoo Lee,Nojun Kwak

Main category: cs.CV

TL;DR: 本文提出了一种名为anchor tokens的新运动表示方法,通过利用视频扩散模型的先验知识,捕捉关键运动模式,实现更精确的视频编辑。该方法在保持运动保真度的同时提高了编辑灵活性和语义对齐能力。

Details Motivation: 现有视频编辑方法在编辑保真度和运动保真度之间存在权衡,因依赖的运动表示要么过拟合布局,要么仅为隐式定义,难以准确保留主体运动。 Method: 提出anchor tokens作为新的运动表示,利用视频扩散模型提取紧凑且信息丰富的点轨迹,并通过Point-to-Point方法实现运动模式的灵活迁移与重定位。 Result: 实验表明,anchor tokens在多种场景下实现了更可控、语义更一致的视频编辑,在编辑质量和运动保真度方面优于现有方法。 Conclusion: anchor tokens能有效捕捉视频中的核心运动动态,支持跨场景泛化,显著提升了视频编辑中运动保持与编辑精度的平衡。 Abstract: Accurately preserving motion while editing a subject remains a core challenge in video editing tasks. Existing methods often face a trade-off between edit and motion fidelity, as they rely on motion representations that are either overfitted to the layout or only implicitly defined. To overcome this limitation, we revisit point-based motion representation. However, identifying meaningful points remains challenging without human input, especially across diverse video scenarios. To address this, we propose a novel motion representation, anchor tokens, that capture the most essential motion patterns by leveraging the rich prior of a video diffusion model. Anchor tokens encode video dynamics compactly through a small number of informative point trajectories and can be flexibly relocated to align with new subjects. This allows our method, Point-to-Point, to generalize across diverse scenarios. Extensive experiments demonstrate that anchor tokens lead to more controllable and semantically aligned video edits, achieving superior performance in terms of edit and motion fidelity.

[196] Uni-DAD: Unified Distillation and Adaptation of Diffusion Models for Few-step Few-shot Image Generation

Yara Bahram,Melodie Desbos,Mohammadhadi Shateri,Eric Granger

Main category: cs.CV

TL;DR: 本文提出了一种名为Uni-DAD的单阶段扩散模型训练框架,统一了模型蒸馏与域适应过程,通过双域分布匹配和多头GAN损失,在少样本生成和个性化任务中实现了高质量、高多样性的快速图像生成。

Details Motivation: 现有的扩散模型在新领域应用时采样成本高,蒸馏模型虽快但受限于教师模型的领域,而两阶段训练方法(先适应再蒸馏或反之)存在设计复杂、质量或多样性下降的问题。因此,需要一种简化流程且保持高性能的统一方法。 Method: 提出Uni-DAD,结合两种训练信号:一是双域分布匹配的蒸馏目标,引导学生模型同时学习源域和目标域教师模型的分布;二是多头生成对抗网络(GAN)损失,增强目标域在多尺度特征上的真实性。该方法在单一阶段内完成蒸馏与适应。 Result: 在少样本图像生成(FSIG)和主体驱动个性化(SDP)任务上,Uni-DAD在少于4步采样的情况下仍优于现有最先进方法,并在质量和多样性上均超越两阶段训练方法。 Conclusion: Uni-DAD成功实现了扩散模型在新领域的快速、高质量生成,通过单阶段训练有效融合了蒸馏与适应,解决了传统方法的复杂性与性能瓶颈,具有较强的实用性和扩展潜力。 Abstract: Diffusion models (DMs) produce high-quality images, yet their sampling remains costly when adapted to new domains. Distilled DMs are faster but typically remain confined within their teacher's domain. Thus, fast and high-quality generation for novel domains relies on two-stage training pipelines: Adapt-then-Distill or Distill-then-Adapt. However, both add design complexity and suffer from degraded quality or diversity. We introduce Uni-DAD, a single-stage pipeline that unifies distillation and adaptation of DMs. It couples two signals during training: (i) a dual-domain distribution-matching distillation objective that guides the student toward the distributions of the source teacher and a target teacher, and (ii) a multi-head generative adversarial network (GAN) loss that encourages target realism across multiple feature scales. The source domain distillation preserves diverse source knowledge, while the multi-head GAN stabilizes training and reduces overfitting, especially in few-shot regimes. The inclusion of a target teacher facilitates adaptation to more structurally distant domains. We perform evaluations on a variety of datasets for few-shot image generation (FSIG) and subject-driven personalization (SDP). Uni-DAD delivers higher quality than state-of-the-art (SoTA) adaptation methods even with less than 4 sampling steps, and outperforms two-stage training pipelines in both quality and diversity.

[197] RoadSceneVQA: Benchmarking Visual Question Answering in Roadside Perception Systems for Intelligent Transportation System

Runwei Guan,Rongsheng Hu,Shangshu Chen,Ningyuan Xiao,Xue Xia,Jiayang Liu,Beibei Chen,Ziren Tang,Ningwei Ouyang,Shaofeng Liang,Yuxuan Fan,Wanjie Sun,Yutao Yue

Main category: cs.CV

TL;DR: 本文提出了RoadSceneVQA,一个专为路侧场景设计的大规模视觉问答数据集,以及基于多模态大语言模型的RoadMind基线方法,结合CAF融合模块和AD-CoT推理机制,提升了交通场景下的感知与推理能力。

Details Motivation: 现有路侧感知系统局限于实例级感知,难以支持自然语言交互和上下文中的交通行为推理,因此需要一个能支持复杂语义理解和常识推理的数据集与模型框架。 Method: 构建了包含34,736个问答对的RoadSceneVQA数据集,提出CogniAnchor Fusion(CAF)融合模块和Assisted Decoupled Chain-of-Thought(AD-CoT)推理策略,并基于MLLM构建了RoadMind模型。 Result: 在RoadSceneVQA和CODA-LM基准上实验表明,所提方法显著提升推理准确性和计算效率,实现结构化交通感知与推理任务上的最先进性能。 Conclusion: RoadMind结合CAF和AD-CoT有效增强了多模态大模型在路侧场景中的语义理解与推理能力,推动了智能交通系统向更高级的认知功能发展。 Abstract: Current roadside perception systems mainly focus on instance-level perception, which fall short in enabling interaction via natural language and reasoning about traffic behaviors in context. To bridge this gap, we introduce RoadSceneVQA, a large-scale and richly annotated visual question answering (VQA) dataset specifically tailored for roadside scenarios. The dataset comprises 34,736 diverse QA pairs collected under varying weather, illumination, and traffic conditions, targeting not only object attributes but also the intent, legality, and interaction patterns of traffic participants. RoadSceneVQA challenges models to perform both explicit recognition and implicit commonsense reasoning, grounded in real-world traffic rules and contextual dependencies. To fully exploit the reasoning potential of Multi-modal Large Language Models (MLLMs), we further propose CogniAnchor Fusion (CAF), a vision-language fusion module inspired by human-like scene anchoring mechanisms. Moreover, we propose the Assisted Decoupled Chain-of-Thought (AD-CoT) to enhance the reasoned thinking via CoT prompting and multi-task learning. Based on the above, we propose the baseline model RoadMind. Experiments on RoadSceneVQA and CODA-LM benchmark show that the pipeline consistently improves both reasoning accuracy and computational efficiency, allowing the MLLM to achieve state-of-the-art performance in structural traffic perception and reasoning tasks.

[198] SwiftVGGT: A Scalable Visual Geometry Grounded Transformer for Large-Scale Scenes

Jungho Lee,Minhyeok Lee,Sunghun Yang,Minseok Kang,Sangyoun Lee

Main category: cs.CV

TL;DR: 本文提出SwiftVGGT,一种无需训练的大规模3D重建方法,在保持高质量重建的同时显著减少推理时间,通过引入无需外部VPR模型的闭环检测和基于Sim(3)的SVD点采样对齐方法,实现千米级场景下的高效、精确重建。

Details Motivation: 大规模场景中的3D重建面临精度与计算效率之间的权衡,现有方法在速度与质量之间难以兼顾,本文旨在提出一种既能保持高重建质量又能大幅提升推理速度的方法。 Method: 提出SwiftVGGT,采用无需训练的框架,通过内部实现的闭环检测维持大场景的全局一致性,并设计了一种简单的点采样方法,利用单次Sim(3) SVD对齐相邻块,避免使用传统的IRLS优化。 Result: 在多个数据集上验证,SwiftVGGT在重建质量上达到最先进水平,同时仅需现有VGGT方法33%的推理时间,显著提升了效率。 Conclusion: SwiftVGGT在不牺牲重建质量的前提下大幅降低推理时间,为大规模3D重建提供了一种高效、实用的解决方案。 Abstract: 3D reconstruction in large-scale scenes is a fundamental task in 3D perception, but the inherent trade-off between accuracy and computational efficiency remains a significant challenge. Existing methods either prioritize speed and produce low-quality results, or achieve high-quality reconstruction at the cost of slow inference times. In this paper, we propose SwiftVGGT, a training-free method that significantly reduce inference time while preserving high-quality dense 3D reconstruction. To maintain global consistency in large-scale scenes, SwiftVGGT performs loop closure without relying on the external Visual Place Recognition (VPR) model. This removes redundant computation and enables accurate reconstruction over kilometer-scale environments. Furthermore, we propose a simple yet effective point sampling method to align neighboring chunks using a single Sim(3)-based Singular Value Decomposition (SVD) step. This eliminates the need for the Iteratively Reweighted Least Squares (IRLS) optimization commonly used in prior work, leading to substantial speed-ups. We evaluate SwiftVGGT on multiple datasets and show that it achieves state-of-the-art reconstruction quality while requiring only 33% of the inference time of recent VGGT-based large-scale reconstruction approaches.

[199] DiVE-k: Differential Visual Reasoning for Fine-grained Image Recognition

Raja Kumar,Arka Sadhu,Ram Nevatia

Main category: cs.CV

TL;DR: 提出DiVE-k框架,利用大视觉语言模型的top-k预测生成多选题,通过强化学习训练模型进行细粒度视觉推理,显著提升在细粒度图像识别中的性能和泛化能力。

Details Motivation: 现有强化学习微调方法依赖精确匹配奖励信号,容易导致记忆化且难以实现对未见类别的泛化,无法满足细粒度图像识别中区分视觉相似类别所需的差异性推理。 Method: 提出DiVE-k框架,利用模型自身的top-k预测为每个训练图像构建多选题,并采用强化学习训练模型从中选出正确答案,从而迫使模型进行细粒度的差异性视觉推理。 Result: 在五个标准细粒度数据集上实验表明,DiVE-k显著优于现有方法,在基础到新颖类别的泛化设置下,相比QWEN2.5-VL-7B和ViRFT分别提升了10.04%和6.16%的调和平均分数,并在跨域和少样本场景中也表现出一致的性能增益。 Conclusion: DiVE-k通过利用模型自身的top-k生成作为训练信号,提供了可验证的奖励机制,有效缓解了记忆化问题,增强了细粒度识别中的差异性推理能力和泛化性能。 Abstract: Large Vision Language Models (LVLMs) possess extensive text knowledge but struggles to utilize this knowledge for fine-grained image recognition, often failing to differentiate between visually similar categories. Existing fine-tuning methods using Reinforcement Learning (RL) with exact-match reward signals are often brittle, encourage memorization of training categories, and fail to elicit differential reasoning needed for generalization to unseen classes. To address this, we propose $\textbf{DiVE-k}$, $\textbf{Di}$fferential $\textbf{V}$isual r$\textbf{E}$asoning using top-$\textbf{k}$ generations, framework that leverages model's own top-k predictions as a training signal. For each training image, DiVE-k creates a multiple-choice question from the model's top-k outputs and uses RL to train the model to select the correct answer. This approach requires the model to perform fine-grained differential reasoning among plausible options and provides a simple, verifiable reward signal that mitigates memorization and improves generalization. Experiments on five standard fine-grained datasets show that our method significantly outperforms existing approaches. In the standard base-to-novel generalization setting, DiVE-k surpasses the QWEN2.5-VL-7B and ViRFT by 10.04% and 6.16% on the Harmonic Mean metric, respectively. Further experiments show similar gains in mixed-domain and few-shot scenarios.

[200] ScriptViT: Vision Transformer-Based Personalized Handwriting Generation

Sajjan Acharya,Rajendra Baskota

Main category: cs.CV

TL;DR: 提出了一种基于Vision Transformer的统一框架,用于生成更符合特定书写风格的手写文本,通过跨注意力机制融合风格特征,并引入显著笔画注意力分析(SSAA)提升可解释性。

Details Motivation: 现有方法难以捕捉书写者全局风格特征,尤其是长距离依赖的风格模式,导致生成的手写体在保持个性化特征方面存在不足。 Method: 采用Vision Transformer构建风格编码器,从多张参考图像中学习全局风格特征,并通过跨注意力机制将风格信息与目标文本结合;使用SSAA分析模型在风格迁移中的笔画级关注区域。 Result: 生成的手写文本在风格一致性、结构连贯性和文本准确性方面优于现有方法,SSAA提供了对风格迁移过程的可视化解释。 Conclusion: 所提框架能有效捕捉书写者的全局风格特征,实现高质量且可解释的风格化手写生成。 Abstract: Styled handwriting generation aims to synthesize handwritten text that looks both realistic and aligned with a specific writer's style. While recent approaches involving GAN, transformer and diffusion-based models have made progress, they often struggle to capture the full spectrum of writer-specific attributes, particularly global stylistic patterns that span long-range spatial dependencies. As a result, capturing subtle writer-specific traits such as consistent slant, curvature or stroke pressure, while keeping the generated text accurate is still an open problem. In this work, we present a unified framework designed to address these limitations. We introduce a Vision Transformer-based style encoder that learns global stylistic patterns from multiple reference images, allowing the model to better represent long-range structural characteristics of handwriting. We then integrate these style cues with the target text using a cross-attention mechanism, enabling the system to produce handwritten images that more faithfully reflect the intended style. To make the process more interpretable, we utilize Salient Stroke Attention Analysis (SSAA), which reveals the stroke-level features the model focuses on during style transfer. Together, these components lead to handwriting synthesis that is not only more stylistically coherent, but also easier to understand and analyze.

[201] Stro-VIGRU: Defining the Vision Recurrent-Based Baseline Model for Brain Stroke Classification

Subhajeet Das,Pritam Paul,Rohit Bahadur,Sohan Das

Main category: cs.CV

TL;DR: 提出了一种基于预训练Vision Transformer和Bi-GRU的迁移学习框架,用于脑卒中的早期识别,通过冻结部分ViT编码器块并结合数据增强处理类别不平衡,在Stroke数据集上达到了94.06%的分类准确率。

Details Motivation: 脑卒中是全球致死和致残的主要原因,早期识别对成功治疗至关重要;CT扫描虽常用但手动分析耗时且易出错,因此需要自动化的高效诊断方法。 Method: 采用预训练的Vision Transformer进行迁移学习,冻结部分编码器块,其余进行微调以学习脑卒中特定特征;提取的特征输入单层Bi-GRU进行分类,并通过数据增强解决类别不平衡问题。 Result: 模型在Stroke数据集上实现了94.06%的分类准确率。 Conclusion: 该方法有效提升了脑卒中的自动识别精度,具有临床辅助诊断潜力。 Abstract: Stroke majorly causes death and disability worldwide, and early recognition is one of the key elements of successful treatment of the same. It is common to diagnose strokes using CT scanning, which is fast and readily available, however, manual analysis may take time and may result in mistakes. In this work, a pre-trained Vision Transformer-based transfer learning framework is proposed for the early identification of brain stroke. A few of the encoder blocks of the ViT model are frozen, and the rest are allowed to be fine-tuned in order to learn brain stroke-specific features. The features that have been extracted are given as input to a single-layer Bi-GRU to perform classification. Class imbalance is handled by data augmentation. The model has achieved 94.06% accuracy in classifying brain stroke from the Stroke Dataset.

[202] Optimal Pose Guidance for Stereo Calibration in 3D Deformation Measurement

Dongcai Tan,Shunkun Liang,Bin Li,Banglei Guan,Ang Su,Yuan Lin,Dapeng Zhang,Minggang Wan,Zibin Liu,Chenglong Wang,Jiajian Zhu,Zhang Li,Yang Shang,Qifeng Yu

Main category: cs.CV

TL;DR: 提出一种用于3D变形测量的高精度立体标定姿态优化方法,通过联合优化内外参并最小化协方差矩阵迹来自动推荐最优标定姿态,结合用户友好的图形界面提升标定效率与精度。

Details Motivation: 现有立体标定方法缺乏直观的最优姿态指导,导致标定效率低且精度不足,难以满足高精度3D变形测量的需求。 Method: 提出一种姿态优化方法,联合优化相对和绝对外参,以协方差矩阵迹的最小化作为损失函数求解下一个最优姿态,并集成用户友好的图形界面指导图像采集。 Result: 相比随机姿态方法,所提方法在更少标定图像下实现了更高精度和更强鲁棒性;热变形实验结果与有限元仿真高度一致,验证了其有效性。 Conclusion: 该方法显著提升了立体标定的自动化程度、效率与精度,具有在3D变形测量中广泛应用的潜力。 Abstract: Stereo optical measurement techniques, such as digital image correlation (DIC), are widely used in 3D deformation measurement as non-contact, full-field measurement methods, in which stereo calibration is a crucial step. However, current stereo calibration methods lack intuitive optimal pose guidance, leading to inefficiency and suboptimal accuracy in deformation measurements. The aim of this study is to develop an interactive calibration framework that automatically generates the next optimal pose, enabling high-accuracy stereo calibration for 3D deformation measurement. We propose a pose optimization method that introduces joint optimization of relative and absolute extrinsic parameters, with the minimization of the covariance matrix trace adopted as the loss function to solve for the next optimal pose. Integrated with this method is a user-friendly graphical interface, which guides even non-expert users to capture qualified calibration images. Our proposed method demonstrates superior efficiency (requiring fewer images) and accuracy (demonstrating lower measurement errors) compared to random pose, while maintaining robustness across varying FOVs. In the thermal deformation measurement tests on an S-shaped specimen, the results exhibit high agreement with finite element analysis (FEA) simulations in both deformation magnitude and evolutionary trends. We present a pose guidance method for high-precision stereo calibration in 3D deformation measurement. The simulation experiments, real-world experiments, and thermal deformation measurement applications all demonstrate the significant application potential of our proposed method in the field of 3D deformation measurement. Keywords: Stereo calibration, Optimal pose guidance, 3D deformation measurement, Digital image correlation

[203] General vs Domain-Specific CNNs: Understanding Pretraining Effects on Brain MRI Tumor Classification

Helia Abedini,Saba Rahimi,Reza Vaziri

Main category: cs.CV

TL;DR: 本研究比较了三种预训练CNN模型在小规模脑肿瘤MRI数据集上的分类性能,发现尽管使用了医学领域预训练的RadImageNet DenseNet121,其表现不佳;而基于大规模通用数据预训练的现代CNN(如ConvNeXt-Tiny和EfficientNetV2S)表现出更优的迁移学习效果,表明在小数据场景下,通用模型可能优于领域专用预训练模型。

Details Motivation: 在小规模医学数据集上,尚不清楚使用医学领域预训练模型还是通用大规模数据预训练模型更能提升脑肿瘤分类性能,因此需要系统评估不同预训练策略的效果。 Method: 采用三种预训练CNN架构(RadImageNet DenseNet121、EfficientNetV2S、ConvNeXt-Tiny),在相同条件下使用有限大小的脑MRI数据集进行训练和微调,以公平比较其分类性能。 Result: ConvNeXt-Tiny准确率最高,其次是EfficientNetV2S,而RadImageNet DenseNet121表现最差,准确率低且损失高,泛化能力弱。 Conclusion: 在小数据条件下,领域专用预训练不一定带来更好性能;现代、更深的通用CNN模型在医学图像分类任务中可能通过迁移学习实现更优表现。 Abstract: Brain tumor detection from MRI scans plays a crucial role in early diagnosis and treatment planning. Deep convolutional neural networks (CNNs) have demonstrated strong performance in medical imaging tasks, particularly when pretrained on large datasets. However, it remains unclear which type of pretrained model performs better when only a small dataset is available: those trained on domain-specific medical data or those pretrained on large general datasets. In this study, we systematically evaluate three pretrained CNN architectures for brain tumor classification: RadImageNet DenseNet121 with medical-domain pretraining, EfficientNetV2S, and ConvNeXt-Tiny, which are modern general-purpose CNNs. All models were trained and fine-tuned under identical conditions using a limited-size brain MRI dataset to ensure a fair comparison. Our results reveal that ConvNeXt-Tiny achieved the highest accuracy, followed by EfficientNetV2S, while RadImageNet DenseNet121, despite being pretrained on domain-specific medical data, exhibited poor generalization with lower accuracy and higher loss. These findings suggest that domain-specific pretraining may not generalize well under small-data conditions. In contrast, modern, deeper general-purpose CNNs pretrained on large-scale datasets can offer superior transfer learning performance in specialized medical imaging tasks.

[204] SciPostLayoutTree: A Dataset for Structural Analysis of Scientific Posters

Shohei Tanaka,Atsushi Hashimoto,Yoshitaka Ushiku

Main category: cs.CV

TL;DR: 本文提出了一个名为SciPostLayoutTree的数据集,包含约8000张带有阅读顺序和父子关系标注的科学海报,并开发了结合视觉和边界框特征的Layout Tree Decoder模型,以提升对空间复杂关系的预测准确性,推动海报结构分析研究。

Details Motivation: 科学海报在学术交流中至关重要,但其结构分析研究相对滞后,现有工作主要集中于论文。为了填补这一空白,本文旨在构建一个专门针对海报的结构化数据集,并开发适用于海报阅读顺序和层次关系识别的模型。 Method: 构建了一个包含约8000张科学海报的标注数据集SciPostLayoutTree,标注内容包括阅读顺序和父子关系;提出Layout Tree Decoder模型,融合视觉特征与边界框的位置及类别信息,并采用束搜索(beam search)来建模序列级合理性,从而更好地预测复杂的空间关系。 Result: 实验结果表明,所提出的模型在处理向上、横向和长距离等空间复杂关系时显著提高了预测准确率,为海报结构分析建立了可靠的基线。数据集和代码已公开发布。 Conclusion: 本文通过构建大规模标注数据集和设计针对性模型,有效推进了科学海报的结构分析研究,尤其提升了对复杂空间布局关系的识别能力,为未来开发结构感知型学术交流界面提供了基础支持。 Abstract: Scientific posters play a vital role in academic communication by presenting ideas through visual summaries. Analyzing reading order and parent-child relations of posters is essential for building structure-aware interfaces that facilitate clear and accurate understanding of research content. Despite their prevalence in academic communication, posters remain underexplored in structural analysis research, which has primarily focused on papers. To address this gap, we constructed SciPostLayoutTree, a dataset of approximately 8,000 posters annotated with reading order and parent-child relations. Compared to an existing structural analysis dataset, SciPostLayoutTree contains more instances of spatially challenging relations, including upward, horizontal, and long-distance relations. As a solution to these challenges, we develop Layout Tree Decoder, which incorporates visual features as well as bounding box features including position and category information. The model also uses beam search to predict relations while capturing sequence-level plausibility. Experimental results demonstrate that our model improves the prediction accuracy for spatially challenging relations and establishes a solid baseline for poster structure analysis. The dataset is publicly available at https://huggingface.co/datasets/omron-sinicx/scipostlayouttree. The code is also publicly available at https://github.com/omron-sinicx/scipostlayouttree.

[205] ConsistCompose: Unified Multimodal Layout Control for Image Composition

Xuanke Shi,Boxuan Li,Xiaoyang Han,Zhongang Cai,Lei Yang,Dahua Lin,Quan Wang

Main category: cs.CV

TL;DR: 本文提出了ConsistCompose,一个统一的多模态框架,通过将布局坐标嵌入语言提示中,实现基于交错图文输入的布局可控多实例图像生成,并构建了大规模数据集ConsistCompose3M,显著提升了空间准确性和身份保真度。

Details Motivation: 现有的多模态模型主要关注视觉定位,而对布局可控的多实例生成(LELG)研究不足,缺乏精确的组合控制能力。 Method: 提出ConsistCompose框架,将布局坐标直接嵌入语言提示中,结合实例-坐标绑定提示和坐标感知的无分类器引导,在单一生成接口中实现布局控制;并构建包含340万样本的数据集ConsistCompose3M用于训练与验证。 Result: 在COCO-Position和MS-Bench上的实验表明,ConsistCompose在空间准确性上显著优于现有布局控制基线,同时保持良好的身份保真度和多模态理解能力。 Conclusion: ConsistCompose建立了一个统一的、支持布局控制的多模态图像生成范式,推动了语言嵌入式布局生成的发展。 Abstract: Unified multimodal models that couple visual understanding with image generation have advanced rapidly, yet most systems still focus on visual grounding-aligning language with image regions-while their generative counterpart, linguistic-embedded layout-grounded generation (LELG) for layout-controllable multi-instance generation, remains underexplored and limits precise compositional control. We present ConsistCompose, a unified multimodal framework that embeds layout coordinates directly into language prompts, enabling layout-controlled multi-instance image generation from Interleaved Image-Text within a single generative interface. We further construct ConsistCompose3M, a 3.4M multi-instance generation dataset with layout and identity annotations (2.6M text-guided and 0.8M image-guided data pairs) that provides large-scale supervision for layout-conditioned generation. Within this framework, LELG is instantiated through instance-coordinate binding prompts and coordinate-aware classifier-free guidance, which translate linguistic layout cues into precise spatial control without task-specific branches. Experiments on COCO-Position and MS-Bench show that ConsistCompose substantially improves spatial accuracy over layout-controlled baselines while preserving identity fidelity and competitive general multimodal understanding, establishing a unified paradigm for layout-controllable multimodal image generation.

[206] A Tri-Modal Dataset and a Baseline System for Tracking Unmanned Aerial Vehicles

Tianyang Xu,Jinjie Gu,Xuefeng Zhu,XiaoJun Wu,Josef Kittler

Main category: cs.CV

TL;DR: 本文提出了首个大规模多模态无人机跟踪基准MM-UAV,包含RGB、红外和事件信号三种模态,涵盖30多种挑战场景,并提出了一种新的多模态跟踪框架,通过自适应对齐、动态融合和事件增强关联机制显著提升跟踪性能。

Details Motivation: 现有单模态视觉跟踪在低光照、复杂背景和快速运动等挑战场景下表现不佳,而多模态跟踪因缺乏专用公开数据集而受限,因此需要构建一个专门的多模态无人机跟踪数据集和适配的框架。 Method: 提出了一个专用于多模态无人机跟踪的框架,包含三个关键技术:偏移引导的自适应对齐模块以解决传感器间的空间错位,自适应动态融合模块以平衡不同模态的互补信息,以及利用事件模态运动线索的事件增强关联机制以改善身份保持。 Result: 在MM-UAV数据集上实验表明,所提框架在多种挑战场景下均优于现有最先进方法,验证了其有效性;同时发布了包含1321个同步序列、超过280万标注帧的大规模数据集。 Conclusion: MM-UAV为多模态无人机跟踪提供了重要基准,所提出的框架为后续研究提供了有效基线,推动了复杂环境下鲁棒无人机跟踪技术的发展。 Abstract: With the proliferation of low altitude unmanned aerial vehicles (UAVs), visual multi-object tracking is becoming a critical security technology, demanding significant robustness even in complex environmental conditions. However, tracking UAVs using a single visual modality often fails in challenging scenarios, such as low illumination, cluttered backgrounds, and rapid motion. Although multi-modal multi-object UAV tracking is more resilient, the development of effective solutions has been hindered by the absence of dedicated public datasets. To bridge this gap, we release MM-UAV, the first large-scale benchmark for Multi-Modal UAV Tracking, integrating three key sensing modalities, e.g. RGB, infrared (IR), and event signals. The dataset spans over 30 challenging scenarios, with 1,321 synchronised multi-modal sequences, and more than 2.8 million annotated frames. Accompanying the dataset, we provide a novel multi-modal multi-UAV tracking framework, designed specifically for UAV tracking applications and serving as a baseline for future research. Our framework incorporates two key technical innovations, e.g. an offset-guided adaptive alignment module to resolve spatio mismatches across sensors, and an adaptive dynamic fusion module to balance complementary information conveyed by different modalities. Furthermore, to overcome the limitations of conventional appearance modelling in multi-object tracking, we introduce an event-enhanced association mechanism that leverages motion cues from the event modality for more reliable identity maintenance. Comprehensive experiments demonstrate that the proposed framework consistently outperforms state-of-the-art methods. To foster further research in multi-modal UAV tracking, both the dataset and source code will be made publicly available at https://xuefeng-zhu5.github.io/MM-UAV/.

[207] FlowPortal: Residual-Corrected Flow for Training-Free Video Relighting and Background Replacement

Wenshuo Gao,Junyi Fan,Jiangyue Zeng,Shuai Yang

Main category: cs.CV

TL;DR: FlowPortal是一种无需训练的基于光流的视频重光照框架,通过残差校正光流机制、解耦条件设计和高频信息传递,在保持时间一致性、结构保真度和光照自然性方面表现出色。

Details Motivation: 现有视频重光照方法在时间一致性、空间保真度和光照自然性之间难以平衡,且常依赖复杂训练过程,限制了其在实际应用中的效果与效率。 Method: 提出FlowPortal,采用残差校正光流机制将标准光流模型转化为编辑模型,结合解耦条件设计实现精确光照控制,利用高频信息传递保留细节,并通过掩码策略分离前景重光照与背景生成过程。 Result: 实验表明,FlowPortal在时间连贯性、结构保持和光照真实感方面优于现有方法,同时具备高推理效率,适用于高质量视频编辑任务。 Conclusion: FlowPortal为视频重光照与背景替换提供了一种高效、无需训练的解决方案,在结构一致性和视觉质量上均取得优异表现。 Abstract: Video relighting with background replacement is a challenging task critical for applications in film production and creative media. Existing methods struggle to balance temporal consistency, spatial fidelity, and illumination naturalness. To address these issues, we introduce FlowPortal, a novel training-free flow-based video relighting framework. Our core innovation is a Residual-Corrected Flow mechanism that transforms a standard flow-based model into an editing model, guaranteeing perfect reconstruction when input conditions are identical and enabling faithful relighting when they differ, resulting in high structural consistency. This is further enhanced by a Decoupled Condition Design for precise lighting control and a High-Frequency Transfer mechanism for detail preservation. Additionally, a masking strategy isolates foreground relighting from background pure generation process. Experiments demonstrate that FlowPortal achieves superior performance in temporal coherence, structural preservation, and lighting realism, while maintaining high efficiency. Project Page: https://gaowenshuo.github.io/FlowPortalProject/.

[208] MagicWand: A Universal Agent for Generation and Evaluation Aligned with User Preference

Zitong Xu,Dake Shen,Yaosong Du,Kexiang Hao,Jinghan Huang,Xiande Huang

Main category: cs.CV

TL;DR: 本文提出了MagicWand,一个基于大规模用户偏好数据集UniPrefer-100K的通用生成与评估代理,通过增强提示、高质量生成和偏好对齐的评估来提升AIGC内容与用户偏好的一致性,并发布了首个大规模偏好对齐基准UniPreferBench。

Details Motivation: 用户在使用AIGC模型时难以通过手动编写详细提示词获得符合其偏好的内容,且现有方法缺乏对用户偏好的有效建模与保留机制。 Method: 构建了包含图像、视频及对应风格描述的大规模用户偏好数据集UniPrefer-100K;在此基础上提出MagicWand框架,支持基于用户偏好的提示增强、高质量内容生成以及偏好对齐的评估与优化;同时建立UniPreferBench作为大规模评测基准。 Result: 在UniPreferBench上的实验表明,MagicWand在多种场景下均能显著提升生成内容与用户偏好的对齐程度,优于现有方法。 Conclusion: MagicWand能够有效理解并应用用户偏好,实现更个性化的AIGC内容生成与评估,UniPrefer-100K和UniPreferBench为未来研究提供了重要资源。 Abstract: Recent advances in AIGC (Artificial Intelligence Generated Content) models have enabled significant progress in image and video generation. However, users still struggle to obtain content that aligns with their preferences due to the difficulty of crafting detailed prompts and the lack of mechanisms to retain their preferences. To address these challenges, we construct \textbf{UniPrefer-100K}, a large-scale dataset comprising images, videos, and associated text that describes the styles users tend to prefer. Based on UniPrefer-100K, we propose \textbf{MagicWand}, a universal generation and evaluation agent that enhances prompts based on user preferences, leverages advanced generation models for high-quality content, and applies preference-aligned evaluation and refinement. In addition, we introduce \textbf{UniPreferBench}, the first large-scale benchmark with over 120K annotations for assessing user preference alignment across diverse AIGC tasks. Experiments on UniPreferBench demonstrate that MagicWand consistently generates content and evaluations that are well aligned with user preferences across a wide range of scenarios.

[209] TRANSPORTER: Transferring Visual Semantics from VLM Manifolds

Alexandros Stergiou

Main category: cs.CV

TL;DR: 本文提出了一个名为TRANSPORTER的模型无关方法,通过引入logits-to-video (L2V) 任务来生成反映视觉语言模型(VLMs)预测背后规则的视频,利用文本到视频生成模型的高视觉保真度,学习VLM高层语义嵌入空间的最优传输耦合,从而实现对模型决策过程的可视化解释。

Details Motivation: 当前视觉语言模型虽能处理复杂场景,但其内部推理机制仍难以理解和控制,缺乏有效的可解释性手段。 Method: 提出L2V任务和TRANSPORTER框架,利用T2V生成模型的高视觉保真能力,通过学习VLM输出logits与视频之间的最优传输耦合关系,在高层语义嵌入空间中以logit分数定义条件视频生成的方向。 Result: TRANSPORTER能够生成反映对象属性、动作副词和场景上下文变化的高质量视频,定量与定性评估表明该方法在多种VLM上均能有效揭示模型决策依据。 Conclusion: L2V为视觉语言模型的可解释性提供了一个新颖且具有高保真度的研究方向,TRANSPORTER作为一种通用框架,有助于理解VLM如何从复杂视频内容中得出答案。 Abstract: How do video understanding models acquire their answers? Although current Vision Language Models (VLMs) reason over complex scenes with diverse objects, action performances, and scene dynamics, understanding and controlling their internal processes remains an open challenge. Motivated by recent advancements in text-to-video (T2V) generative models, this paper introduces a logits-to-video (L2V) task alongside a model-independent approach, TRANSPORTER, to generate videos that capture the underlying rules behind VLMs' predictions. Given the high-visual-fidelity produced by T2V models, TRANSPORTER learns an optimal transport coupling to VLM's high-semantic embedding spaces. In turn, logit scores define embedding directions for conditional video generation. TRANSPORTER generates videos that reflect caption changes over diverse object attributes, action adverbs, and scene context. Quantitative and qualitative evaluations across VLMs demonstrate that L2V can provide a fidelity-rich, novel direction for model interpretability that has not been previously explored.

[210] Alias-free 4D Gaussian Splatting

Zilong Chen,Huan-ang Gao,Delin Qu,Haohan Chi,Hao Tang,Kai Zhang,Hao Zhao

Main category: cs.CV

TL;DR: 提出了一种4D高斯点阵的抗混叠方法,通过自适应滤波和尺度损失来消除高频伪影并减少冗余高斯分布。

Details Motivation: 现有基于高斯点阵的动态场景重建在调整焦距或距离时会产生严重伪影,受限于4D高斯频率约束和2D膨胀滤波导致的尺度不匹配。 Method: 推导了4D高斯点阵的最大采样频率公式,引入了4D尺度自适应滤波器和尺度损失,以灵活调节采样频率。 Result: 在单目和多视角视频重建实验中验证了方法有效性,消除了高渲染频率下的高频伪影,并减少了冗余高斯数量。 Conclusion: 该方法实现了高质量、抗混叠的4D高斯点阵渲染,提升了动态场景重建的稳定性和效率。 Abstract: Existing dynamic scene reconstruction methods based on Gaussian Splatting enable real-time rendering and generate realistic images. However, adjusting the camera's focal length or the distance between Gaussian primitives and the camera to modify rendering resolution often introduces strong artifacts, stemming from the frequency constraints of 4D Gaussians and Gaussian scale mismatch induced by the 2D dilated filter. To address this, we derive a maximum sampling frequency formulation for 4D Gaussian Splatting and introduce a 4D scale-adaptive filter and scale loss, which flexibly regulates the sampling frequency of 4D Gaussian Splatting. Our approach eliminates high-frequency artifacts under increased rendering frequencies while effectively reducing redundant Gaussians in multi-view video reconstruction. We validate the proposed method through monocular and multi-view video reconstruction experiments.Ours project page: https://4d-alias-free.github.io/4D-Alias-free/

[211] MimiCAT: Mimic with Correspondence-Aware Cascade-Transformer for Category-Free 3D Pose Transfer

Zenghao Chai,Chen Tang,Yongkang Wong,Xulei Yang,Mohan Kankanhalli

Main category: cs.CV

TL;DR: 提出MimiCAT模型,实现跨类别(category-free)的3D姿态迁移,通过语义关键点和软对应匹配,克服结构差异问题。

Details Motivation: 现有方法局限于相似结构角色间姿态迁移,难以推广到不同类别的角色(如人形到四足动物),因结构与变换多样性导致区域错配。 Method: 构建百万级多角色姿态数据集,提出基于级联Transformer的MimiCAT模型,利用语义关键点标签学习软对应关系,将姿态迁移建模为条件生成过程,并通过形状条件表示进行优化。 Result: 实验表明MimiCAT在跨类别姿态迁移上生成合理且高质量的姿态,显著优于仅限同类迁移的现有方法。 Conclusion: MimiCAT实现了无需类别限制的3D姿态迁移,通过软对应和条件生成机制有效应对结构多样性挑战。 Abstract: 3D pose transfer aims to transfer the pose-style of a source mesh to a target character while preserving both the target's geometry and the source's pose characteristic. Existing methods are largely restricted to characters with similar structures and fail to generalize to category-free settings (e.g., transferring a humanoid's pose to a quadruped). The key challenge lies in the structural and transformation diversity inherent in distinct character types, which often leads to mismatched regions and poor transfer quality. To address these issues, we first construct a million-scale pose dataset across hundreds of distinct characters. We further propose MimiCAT, a cascade-transformer model designed for category-free 3D pose transfer. Instead of relying on strict one-to-one correspondence mappings, MimiCAT leverages semantic keypoint labels to learn a novel soft correspondence that enables flexible many-to-many matching across characters. The pose transfer is then formulated as a conditional generation process, in which the source transformations are first projected onto the target through soft correspondence matching and subsequently refined using shape-conditioned representations. Extensive qualitative and quantitative experiments demonstrate that MimiCAT transfers plausible poses across different characters, significantly outperforming prior methods that are limited to narrow category transfer (e.g., humanoid-to-humanoid).

[212] MASS: Motion-Aware Spatial-Temporal Grounding for Physics Reasoning and Comprehension in Vision-Language Models

Xiyang Wu,Zongxia Li,Jihui Jin,Guangyao Shi,Gouthaman KV,Vishnu Raj,Nilotpal Sinha,Jingxi Chen,Fan Du,Dinesh Manocha

Main category: cs.CV

TL;DR: 本文提出了一种提升视觉语言模型(VLM)在物理驱动推理任务中表现的方法,通过引入MASS-Bench基准和MASS模型无关方法,将物理世界中的时空信号注入VLM的语言空间,并结合强化微调,显著提升了物理理解和推理能力。

Details Motivation: 现有VLM在处理涉及运动动力学和空间交互的物理推理任务时表现不佳,限制了其对真实或AI生成视频的理解与生成能力。因此,需要一种能将物理世界线索转化为VLM可理解表示的方法。 Method: 提出了MASS-Bench基准,包含4,350个真实与AI生成视频及8,361个问答对,用于评估物理理解能力;设计了MASS方法,通过基于深度的3D编码、视觉定位和运动追踪将时空信号注入VLM语言空间,并采用强化微调增强跨模态对齐与推理。 Result: 实验表明,改进后的VLM在物理推理任务上优于同类及更大规模基线模型,分别提升8.7%和6.0%,性能接近Gemini-2.5-Flash等闭源最先进模型。 Conclusion: 所提方法有效弥补了VLM在物理动态理解方面的不足,验证了引入结构化物理信号与强化学习微调对提升多模态模型推理能力的重要性。 Abstract: Vision Language Models (VLMs) perform well on standard video tasks but struggle with physics-driven reasoning involving motion dynamics and spatial interactions. This limitation reduces their ability to interpret real or AI-generated content (AIGC) videos and to generate physically consistent content. We present an approach that addresses this gap by translating physical-world context cues into interpretable representations aligned with VLMs' perception, comprehension, and reasoning. We introduce MASS-Bench, a comprehensive benchmark consisting of 4,350 real-world and AIGC videos and 8,361 free-form video question-answering pairs focused on physics-related comprehension tasks, with detailed annotations including visual detections, sub-segment grounding, and full-sequence 3D motion tracking of entities. We further present MASS, a model-agnostic method that injects spatial-temporal signals into the VLM language space via depth-based 3D encoding and visual grounding, coupled with a motion tracker for object dynamics. To strengthen cross-modal alignment and reasoning, we apply reinforcement fine-tuning. Experiments and ablations show that our refined VLMs outperform comparable and larger baselines, as well as prior state-of-the-art models, by 8.7% and 6.0%, achieving performance comparable to close-source SoTA VLMs such as Gemini-2.5-Flash on physics reasoning and comprehension. These results validate the effectiveness of our approach.

[213] Synthetic Curriculum Reinforces Compositional Text-to-Image Generation

Shijian Wang,Runhao Fu,Siyi Zhao,Qingqin Zhan,Xingjian Wang,Jiarui Jin,Yuan Lu,Hanqian Wu,Cunjian Chen

Main category: cs.CV

TL;DR: 本文提出了一种名为CompGen的新型组合课程强化学习框架,用于提升文本到图像生成模型在复杂场景中的组合生成能力,利用场景图定义组合难度,并通过自适应采样构建渐进式训练课程。

Details Motivation: 现有的文本到图像生成模型在处理包含多个对象及其复杂属性、空间和语义关系的组合场景时表现较弱,难以实现精确的对象放置和合理的交互,因此需要增强模型的组合理解与生成能力。 Method: 提出CompGen框架,使用场景图作为表示手段,设计基于组合难度的评估标准,并开发一种自适应的Markov Chain Monte Carlo图采样算法来生成由易到难的训练课程;将该课程学习方法集成到Group Relative Policy Optimization(GRPO)中,并探索不同的课程调度策略。 Result: 实验表明,不同课程调度策略下CompGen展现出明显的扩展曲线,其中由易到难和高斯采样策略优于随机采样;在扩散模型和自回归模型上均显著提升了组合生成性能。 Conclusion: CompGen能有效增强文本到图像生成模型的组合生成能力,验证了基于难度感知的课程学习在提升复杂场景生成质量方面的有效性。 Abstract: Text-to-Image (T2I) generation has long been an open problem, with compositional synthesis remaining particularly challenging. This task requires accurate rendering of complex scenes containing multiple objects that exhibit diverse attributes as well as intricate spatial and semantic relationships, demanding both precise object placement and coherent inter-object interactions. In this paper, we propose a novel compositional curriculum reinforcement learning framework named CompGen that addresses compositional weakness in existing T2I models. Specifically, we leverage scene graphs to establish a novel difficulty criterion for compositional ability and develop a corresponding adaptive Markov Chain Monte Carlo graph sampling algorithm. This difficulty-aware approach enables the synthesis of training curriculum data that progressively optimize T2I models through reinforcement learning. We integrate our curriculum learning approach into Group Relative Policy Optimization (GRPO) and investigate different curriculum scheduling strategies. Our experiments reveal that CompGen exhibits distinct scaling curves under different curriculum scheduling strategies, with easy-to-hard and Gaussian sampling strategies yielding superior scaling performance compared to random sampling. Extensive experiments demonstrate that CompGen significantly enhances compositional generation capabilities for both diffusion-based and auto-regressive T2I models, highlighting its effectiveness in improving the compositional T2I generation systems.

[214] RNN as Linear Transformer: A Closer Investigation into Representational Potentials of Visual Mamba Models

Timing Yang,Guoyizhe Wei,Alan Yuille,Feng Wang

Main category: cs.CV

TL;DR: 本文系统研究了Mamba在视觉任务中的表征特性,揭示其与Softmax和线性注意力的关系,提出新的激活图评估指标,并通过自监督预训练提升可解释性与性能。

Details Motivation: 尽管Mamba在视觉任务中表现出色,但其在视觉领域的内在机制尚不清楚,因此需要系统分析其表征能力。 Method: 理论分析Mamba与Softmax及线性注意力的关系,提出用于激活图评估的二元分割度量,并采用DINO进行自监督预训练以获得更清晰的激活图。 Result: 证明Mamba可视为Softmax注意力的低秩近似;新评估指标显示其建模长距离依赖的能力;自监督预训练提升了激活图质量;在ImageNet上达到78.5%的线性探测准确率。 Conclusion: Mamba在视觉领域具有良好的表征能力和可解释性,本研究为未来基于Mamba的视觉架构提供了有价值的见解。 Abstract: Mamba has recently garnered attention as an effective backbone for vision tasks. However, its underlying mechanism in visual domains remains poorly understood. In this work, we systematically investigate Mamba's representational properties and make three primary contributions. First, we theoretically analyze Mamba's relationship to Softmax and Linear Attention, confirming that it can be viewed as a low-rank approximation of Softmax Attention and thereby bridging the representational gap between Softmax and Linear forms. Second, we introduce a novel binary segmentation metric for activation map evaluation, extending qualitative assessments to a quantitative measure that demonstrates Mamba's capacity to model long-range dependencies. Third, by leveraging DINO for self-supervised pretraining, we obtain clearer activation maps than those produced by standard supervised approaches, highlighting Mamba's potential for interpretability. Notably, our model also achieves a 78.5 percent linear probing accuracy on ImageNet, underscoring its strong performance. We hope this work can provide valuable insights for future investigations of Mamba-based vision architectures.

[215] ViMix-14M: A Curated Multi-Source Video-Text Dataset with Long-Form, High-Quality Captions and Crawl-Free Access

Timing Yang,Sucheng Ren,Alan Yuille,Feng Wang

Main category: cs.CV

TL;DR: 本文提出了ViMix-14M,一个包含约1400万对视频-文本的高质量、可直接下载的数据集,旨在解决开源文本到视频生成模型面临的数据瓶颈问题。

Details Motivation: 现有的公开视频-文本数据集依赖手动爬取YouTube等平台,存在链接失效、访问限制和版权不明确等问题,难以满足大规模训练需求。 Method: 通过整合多个开源视频数据源,进行统一去重和质量过滤,并设计了一种多粒度、基于真实标签引导的重描述管道来优化字幕与视频内容的对齐。 Result: 在多模态检索、文本到视频生成和视频问答任务中,ViMix-14M均优于现有数据集,显著提升模型性能。 Conclusion: ViMix-14M为开源视频基础模型的训练和微调提供了可靠的数据支持,有助于推动高质量、可泛化的视频-文本数据集构建。 Abstract: Text-to-video generation has surged in interest since Sora, yet open-source models still face a data bottleneck: there is no large, high-quality, easily obtainable video-text corpus. Existing public datasets typically require manual YouTube crawling, which yields low usable volume due to link rot and access limits, and raises licensing uncertainty. This work addresses this challenge by introducing ViMix-14M, a curated multi-source video-text dataset of around 14 million pairs that provides crawl-free, download-ready access and long-form, high-quality captions tightly aligned to video. ViMix-14M is built by merging diverse open video sources, followed by unified de-duplication and quality filtering, and a multi-granularity, ground-truth-guided re-captioning pipeline that refines descriptions to better match actions, scenes, and temporal structure. We evaluate the dataset by multimodal retrieval, text-to-video generation, and video question answering tasks, observing consistent improvements over counterpart datasets. We hope this work can help removing the key barrier to training and fine-tuning open-source video foundation models, and provide insights of building high-quality and generalizable video-text datasets.

[216] Can a Second-View Image Be a Language? Geometric and Semantic Cross-Modal Reasoning for X-ray Prohibited Item Detection

Chuang Peng,Renshuai Tao,Zhongwei Ren,Xianglong Liu,Yunchao Wei

Main category: cs.CV

TL;DR: 本文提出了DualXrayBench,首个支持多视角和多模态的X射线违禁品检测基准,并引入GSR模型,将第二视角图像视为“类语言模态”,通过几何-语义联合推理提升检测性能。

Details Motivation: 传统方法依赖单视角视觉信息,难以应对复杂威胁;而安检人员实际使用双视角图像。研究旨在探索第二视角是否能像语言模态一样提供约束信息,以提升检测效果。 Method: 构建包含45,613对双视角图像和文本描述的DualXrayBench基准与GSXray数据集,提出Geometric-Semantic Reasoner(GSR)模型,联合学习跨视角几何关系与跨模态语义,利用结构化思维链(, , )进行推理。 Result: 在DualXrayBench上全面评估显示,GSR在所有X射线任务中均取得显著性能提升,验证了第二视角作为‘类语言模态’的有效性。 Conclusion: 将第二视角图像视为类语言模态进行跨视角-跨模态联合推理,为现实场景下的X射线安检提供了新思路和有效方法。 Abstract: Automatic X-ray prohibited items detection is vital for security inspection and has been widely studied. Traditional methods rely on visual modality, often struggling with complex threats. While recent studies incorporate language to guide single-view images, human inspectors typically use dual-view images in practice. This raises the question: can the second view provide constraints similar to a language modality? In this work, we introduce DualXrayBench, the first comprehensive benchmark for X-ray inspection that includes multiple views and modalities. It supports eight tasks designed to test cross-view reasoning. In DualXrayBench, we introduce a caption corpus consisting of 45,613 dual-view image pairs across 12 categories with corresponding captions. Building upon these data, we propose the Geometric (cross-view)-Semantic (cross-modality) Reasoner (GSR), a multimodal model that jointly learns correspondences between cross-view geometry and cross-modal semantics, treating the second-view images as a "language-like modality". To enable this, we construct the GSXray dataset, with structured Chain-of-Thought sequences: , , . Comprehensive evaluations on DualXrayBench demonstrate that GSR achieves significant improvements across all X-ray tasks, offering a new perspective for real-world X-ray inspection.

[217] SegSplat: Feed-forward Gaussian Splatting and Open-Set Semantic Segmentation

Peter Siegel,Federico Tombari,Marc Pollefeys,Daniel Barath

Main category: cs.CV

TL;DR: 本文提出了SegSplat,一个将快速前馈3D重建与开放词汇语义理解相结合的新框架。

Details Motivation: 弥合快速3D重建与丰富语义理解之间的差距,实现场景的实时语义感知。 Method: 通过从多视角2D基础模型特征构建紧凑的语义记忆库,并在单次前向传递中为每个3D高斯分布预测离散语义索引、几何和外观属性。 Result: 实验表明,SegSplat在保持与最先进方法相当的几何保真度的同时,实现了强大的开放集语义分割,且无需针对每个场景进行语义特征集成的优化。 Conclusion: SegSplat是实现实用、即时生成语义感知3D环境的重要进展,对机器人交互和增强现实等应用具有重要意义。 Abstract: We have introduced SegSplat, a novel framework designed to bridge the gap between rapid, feed-forward 3D reconstruction and rich, open-vocabulary semantic understanding. By constructing a compact semantic memory bank from multi-view 2D foundation model features and predicting discrete semantic indices alongside geometric and appearance attributes for each 3D Gaussian in a single pass, SegSplat efficiently imbues scenes with queryable semantics. Our experiments demonstrate that SegSplat achieves geometric fidelity comparable to state-of-the-art feed-forward 3D Gaussian Splatting methods while simultaneously enabling robust open-set semantic segmentation, crucially \textit{without} requiring any per-scene optimization for semantic feature integration. This work represents a significant step towards practical, on-the-fly generation of semantically aware 3D environments, vital for advancing robotic interaction, augmented reality, and other intelligent systems.

[218] Exploring Weak-to-Strong Generalization for CLIP-based Classification

Jinhao Li,Sarah M. Erfani,Lei Feng,James Bailey,Feng Liu

Main category: cs.CV

TL;DR: 本文研究了在CLIP-based分类任务中,利用弱模型监督强模型的弱到强泛化方法,提出类别原型学习(CPL)以提升分类性能。

Details Motivation: 随着模型复杂度增加,依赖人类监督的传统对齐方法变得低效且不可行,尤其当模型超越人类知识时。因此需要一种可扩展的自动化监督机制。 Method: 提出类别原型学习(CPL),通过弱监督信号学习每个类别的更具代表性的原型,从而增强CLIP模型的分类能力。 Result: 实验表明,即使使用简单的损失函数和有限预训练,CPL在目标场景下仍实现显著提升,平均比强基线方法提高3.67%。 Conclusion: CPL有效推动了视觉-语言模型中的弱到强泛化,为减少人类监督依赖提供了可行方案。 Abstract: Aligning large-scale commercial models with user intent is crucial to preventing harmful outputs. Current methods rely on human supervision but become impractical as model complexity increases. When models surpass human knowledge, providing accurate feedback becomes challenging and inefficient. A novel solution proposed recently is using a weaker model to supervise a stronger model. This concept leverages the ability of weaker models to perform evaluations, thereby reducing the workload on human supervisors. Previous work has shown the effectiveness of weak-to-strong generalization in the context of language-only models. Extending this concept to vision-language models leverages these insights, adapting the proven benefits to a multi-modal context. In our study, we explore weak-to-strong generalization for CLIP-based classification. We propose a method, class prototype learning (CPL), which aims to enhance the classification capabilities of the CLIP model, by learning more representative prototypes for each category. Our findings indicate that, despite using a simple loss function under weak supervision, CPL yields robust improvements in targeted scenarios, particularly when pretraining is limited. Extensive experiments demonstrate that our approach is effective under these settings, achieving a 3.67% improvement over strong baseline methods.

[219] ChineseVideoBench: Benchmarking Multi-modal Large Models for Chinese Video Question Answering

Yuxiang Nie,Han Wang,Yongjie Ye,Haiyang Yu,Weitao Jia,Tao Zeng,Hao Feng,Xiang Fei,Yang Li,Xiaohui Lv,Guozhi Tang,Jingqun Tang,Jinghui Lu,Zehui Dai,Jiacong Wang,Dingkang Yang,An-Lan Wang,Can Huang

Main category: cs.CV

TL;DR: 本文提出了ChineseVideoBench,首个专门用于评估多模态大语言模型在中文视频问答任务中表现的基准。该基准包含8个主类别和12个子类别,强调对中文语言和文化细节的理解。实验表明现有模型仍面临挑战,其中Gemini 2.5 Pro表现最佳(77.9%),InternVL-38B为最强开源模型。

Details Motivation: 现有的多模态大模型评估框架缺乏对中国语言和文化的充分考虑,且中文视频理解任务需要更全面、更具文化敏感性的评测基准,因此亟需构建专门针对中文视频内容的评估体系。 Method: 构建了一个包含8个主类和12个子类的中文视频问答基准ChineseVideoBench,涵盖需要深度视频理解和中文语言文化认知的任务,并设计了相应的数据集与评估指标,对多个主流MLLM进行实证评估。 Result: 当前MLLM在ChineseVideoBench上表现有限,整体挑战较大;Gemini 2.5 Pro取得最高分77.9%,InternVL-38B是性能最强的开源模型。 Conclusion: ChineseVideoBench填补了中文视频问答领域评估基准的空白,能够有效衡量MLLM在中文语境下的视频理解能力,为未来模型优化提供了重要方向。 Abstract: This paper introduces ChineseVideoBench, a pioneering benchmark specifically designed for evaluating Multimodal Large Language Models (MLLMs) in Chinese Video Question Answering. The growing demand for sophisticated video analysis capabilities highlights the critical need for comprehensive, culturally-aware evaluation frameworks. ChineseVideoBench addresses this gap by providing a robust dataset and tailored evaluation metrics, enabling rigorous assessment of state-of-the-art MLLMs on complex Chinese video content. Specifically, ChineseVideoBench comprises 8 main classes and 12 sub-classes, encompassing tasks that demand both deep video understanding and nuanced Chinese linguistic and cultural awareness. Our empirical evaluations reveal that ChineseVideoBench presents a significant challenge to current MLLMs. Among the models assessed, Gemini 2.5 Pro achieves the highest performance with an overall score of 77.9%, while InternVL-38B emerges as the most competitive open-source model.

[220] 4D-VGGT: A General Foundation Model with SpatioTemporal Awareness for Dynamic Scene Geometry Estimation

Haonan Wang,Hanyu Zhou,Haoyue Liu,Luxin Yan

Main category: cs.CV

TL;DR: 本文提出了一种名为4D-VGGT的通用基础模型,用于动态场景几何估计,采用分治策略进行时空表示,通过多视图和多时间步输入、多层次特征融合以及多任务预测提升性能。

Details Motivation: 现有方法将时空特征统一到同一隐空间中建模,但由于时空特征异质性易导致表示不匹配,难以有效捕捉动态场景的复杂结构。 Method: 提出4D-VGGT模型:1)设计自适应视觉网格以支持任意数量视图和时间步的输入;2)分别采用跨视图全局融合和跨时间局部融合进行空间与时间表示;3)附加多个任务特定头实现多任务预测。 Result: 在多个动态场景几何基准上的广泛实验表明,该方法在多种任务中均有效提升了特征判别性和应用通用性。 Conclusion: 4D-VGGT通过分离且协同的时空表示机制,在动态场景几何估计中展现出优越的性能和广泛的适用性。 Abstract: We investigate a challenging task of dynamic scene geometry estimation, which requires representing both spatial and temporal features. Typically, existing methods align the two features into a unified latent space to model scene geometry. However, this unified paradigm suffers from potential mismatched representation due to the heterogeneous nature between spatial and temporal features. In this work, we propose 4D-VGGT, a general foundation model with divide-and-conquer spatiotemporal representation for dynamic scene geometry. Our model is divided into three aspects: 1) Multi-setting input. We design an adaptive visual grid that supports input sequences with arbitrary numbers of views and time steps. 2) Multi-level representation. We propose a cross-view global fusion for spatial representation and a cross-time local fusion for temporal representation. 3) Multi-task prediction. We append multiple task-specific heads to spatiotemporal representations, enabling a comprehensive visual geometry estimation for dynamic scenes. Under this unified framework, these components enhance the feature discriminability and application universality of our model for dynamic scenes. In addition, we integrate multiple geometry datasets to train our model and conduct extensive experiments to verify the effectiveness of our method across various tasks on multiple dynamic scene geometry benchmarks.

[221] NeuroVascU-Net: A Unified Multi-Scale and Cross-Domain Adaptive Feature Fusion U-Net for Precise 3D Segmentation of Brain Vessels in Contrast-Enhanced T1 MRI

Mohammad Jafari Vayeghan,Niloufar Delfan,Mehdi Tale Masouleh,Mansour Parvaresh Rizi,Behzad Moshiri

Main category: cs.CV

TL;DR: 本文提出了一种名为NeuroVascU-Net的新型深度学习模型,专门用于从临床标准T1加权增强MRI中精确分割脑血管结构,具有高精度和低计算成本的优点。

Details Motivation: 现有的脑血管自动分割方法多基于TOF-MRA,且常在准确性与计算效率之间权衡,难以满足神经外科手术规划的临床需求。因此,需要一种专为T1CE MRI设计、高效且准确的分割模型。 Method: 基于扩张U-Net架构,引入两个专用模块:瓶颈处的多尺度上下文特征融合(MSC²F)模块和深层层次结构中的跨域自适应特征融合(CDA²F)模块,以捕获多尺度信息并动态整合领域特定特征。 Result: 在137名脑肿瘤活检患者的T1CE MRI数据上进行训练和验证,模型达到0.8609的Dice分数和0.8841的精确率,仅需12.4M参数,显著低于Transformer类模型。 Conclusion: NeuroVascU-Net在保持高分割精度的同时具备较低的计算开销,适合应用于计算机辅助神经外科手术规划,填补了T1CE MRI脑血管分割的技术空白。 Abstract: Precise 3D segmentation of cerebral vasculature from T1-weighted contrast-enhanced (T1CE) MRI is crucial for safe neurosurgical planning. Manual delineation is time-consuming and prone to inter-observer variability, while current automated methods often trade accuracy for computational cost, limiting clinical use. We present NeuroVascU-Net, the first deep learning architecture specifically designed to segment cerebrovascular structures directly from clinically standard T1CE MRI in neuro-oncology patients, addressing a gap in prior work dominated by TOF-MRA-based approaches. NeuroVascU-Net builds on a dilated U-Net and integrates two specialized modules: a Multi-Scale Contextual Feature Fusion ($MSC^2F$) module at the bottleneck and a Cross-Domain Adaptive Feature Fusion ($CDA^2F$) module at deeper hierarchical layers. $MSC^2F$ captures both local and global information via multi-scale dilated convolutions, while $CDA^2F$ dynamically integrates domain-specific features, enhancing representation while keeping computation low. The model was trained and validated on a curated dataset of T1CE scans from 137 brain tumor biopsy patients, annotated by a board-certified functional neurosurgeon. NeuroVascU-Net achieved a Dice score of 0.8609 and precision of 0.8841, accurately segmenting both major and fine vascular structures. Notably, it requires only 12.4M parameters, significantly fewer than transformer-based models such as Swin U-NetR. This balance of accuracy and efficiency positions NeuroVascU-Net as a practical solution for computer-assisted neurosurgical planning.

[222] CrossJEPA: Cross-Modal Joint-Embedding Predictive Architecture for Efficient 3D Representation Learning from 2D Images

Avishka Perera,Kumal Hewagamage,Saeedha Nazar,Kavishka Abeywardana,Hasitha Gallella,Ranga Rodrigo,Mohamed Afham

Main category: cs.CV

TL;DR: 提出CrossJEPA,一种用于3D表示学习的跨模态联合嵌入预测架构,通过图像基础模型知识蒸馏,在ModelNet40和ScanObjectNN上实现SOTA线性探测性能,具有高效、轻量、快速训练的特点。

Details Motivation: 现有基于2D数据的3D表示学习方法通常导致模型大、训练慢,计算成本高;JEPA在跨模态场景中探索不足,且常被误解为依赖掩码机制。因此需要一种高效、轻量且适用于跨模态的架构设计。 Method: 提出CrossJEPA,利用图像基础模型的知识,训练一个预测器从对应的3D点云预测特定渲染2D视图的嵌入;引入跨域投影信息作为条件,净化目标域语义干扰;采用冻结教师模型与一次性目标嵌入缓存机制,提升训练效率。 Result: 在ModelNet40(94.2%)和ScanObjectNN(88.3%)上达到SOTA线性探测性能,仅使用14.1M预训练参数(点编码器8.5M),单GPU约6小时完成预训练。 Conclusion: CrossJEPA是一种高效、内存友好且快速训练的3D表示学习框架,验证了JEPA在跨模态设置中的潜力,无需依赖掩码即可实现高性能知识蒸馏。 Abstract: Image-to-point cross-modal learning has emerged to address the scarcity of large-scale 3D datasets in 3D representation learning. However, current methods that leverage 2D data often result in large, slow-to-train models, making them computationally expensive and difficult to deploy in resource-constrained environments. The architecture design of such models is therefore critical, determining their performance, memory footprint, and compute efficiency. The Joint-embedding Predictive Architecture (JEPA) has gained wide popularity in self-supervised learning for its simplicity and efficiency, but has been under-explored in cross-modal settings, partly due to the misconception that masking is intrinsic to JEPA. In this light, we propose CrossJEPA, a simple Cross-modal Joint Embedding Predictive Architecture that harnesses the knowledge of an image foundation model and trains a predictor to infer embeddings of specific rendered 2D views from corresponding 3D point clouds, thereby introducing a JEPA-style pretraining strategy beyond masking. By conditioning the predictor on cross-domain projection information, CrossJEPA purifies the supervision signal from semantics exclusive to the target domain. We further exploit the frozen teacher design with a one-time target embedding caching mechanism, yielding amortized efficiency. CrossJEPA achieves a new state-of-the-art in linear probing on the synthetic ModelNet40 (94.2%) and the real-world ScanObjectNN (88.3%) benchmarks, using only 14.1M pretraining parameters (8.5M in the point encoder), and about 6 pretraining hours on a standard single GPU. These results position CrossJEPA as a performant, memory-efficient, and fast-to-train framework for 3D representation learning via knowledge distillation. We analyze CrossJEPA intuitively, theoretically, and empirically, and extensively ablate our design choices. Code will be made available.

[223] LungX: A Hybrid EfficientNet-Vision Transformer Architecture with Multi-Scale Attention for Accurate Pneumonia Detection

Mansur Yerzhanuly

Main category: cs.CV

TL;DR: 提出一种结合EfficientNet、CBAM和Vision Transformer的混合架构LungX,用于提升肺炎检测性能,在2万张胸部X光片上达到86.5%准确率和0.943 AUC,较基线显著提升。

Details Motivation: 肺炎是全球主要致死原因之一,及时诊断至关重要,但现有方法在特征提取和全局上下文建模方面存在局限。 Method: 提出LungX,融合EfficientNet的多尺度特征、CBAM注意力机制和Vision Transformer的全局上下文建模能力,构建混合深度学习架构。 Result: 在RSNA和CheXpert数据集的2万张胸部X光片上,LungX达到86.5%准确率和0.943 AUC,较EfficientNet-B0基线提升6.7% AUC,并通过可视化注意力图显示更优病灶定位能力。 Conclusion: LungX显著提升了肺炎检测性能,具备良好的可解释性,未来将开展多中心验证并优化结构以实现88%准确率,推动其作为AI辅助诊断工具的临床应用。 Abstract: Pneumonia remains a leading global cause of mortality where timely diagnosis is critical. We introduce LungX, a novel hybrid architecture combining EfficientNet's multi-scale features, CBAM attention mechanisms, and Vision Transformer's global context modeling for enhanced pneumonia detection. Evaluated on 20,000 curated chest X-rays from RSNA and CheXpert, LungX achieves state-of-the-art performance (86.5 percent accuracy, 0.943 AUC), representing a 6.7 percent AUC improvement over EfficientNet-B0 baselines. Visual analysis demonstrates superior lesion localization through interpretable attention maps. Future directions include multi-center validation and architectural optimizations targeting 88 percent accuracy for clinical deployment as an AI diagnostic aid.

[224] DocPTBench: Benchmarking End-to-End Photographed Document Parsing and Translation

Yongkun Du,Pinxuan Chen,Xuye Ying,Zhineng Chen

Main category: cs.CV

TL;DR: 本文提出了DocPTBench,一个专为拍摄文档解析与翻译设计的综合基准,包含1300多份高分辨率真实拍摄文档,涵盖多个领域和八种翻译场景,并提供人工验证的标注。实验表明,现有模型在处理拍摄文档时性能显著下降,凸显了现实拍摄条件下文档理解的挑战。

Details Motivation: 现有文档解析与翻译基准多基于清晰的扫描或数字原生文档,无法反映真实拍摄条件下(如几何畸变、光照变化)的复杂性,因此需要更贴近实际应用场景的评测基准。 Method: 构建了一个名为DocPTBench的新基准,包含超过1,300份高分辨率拍摄文档,覆盖多个领域,设计了八种翻译场景,并提供人工校验的解析与翻译标注;在主流多模态大模型和专用文档模型上进行系统评估。 Result: 实验显示,从数字文档转向拍摄文档后,主流多模态大模型在端到端解析上平均准确率下降18%,翻译任务下降12%,而专用文档解析模型平均性能下降达25%。 Conclusion: DocPTBench揭示了真实拍摄文档对现有模型带来的严峻挑战,暴露了当前方法在现实场景中的鲁棒性不足,推动未来研究关注更具实际意义的文档处理能力。 Abstract: The advent of Multimodal Large Language Models (MLLMs) has unlocked the potential for end-to-end document parsing and translation. However, prevailing benchmarks such as OmniDocBench and DITrans are dominated by pristine scanned or digital-born documents, and thus fail to adequately represent the intricate challenges of real-world capture conditions, such as geometric distortions and photometric variations. To fill this gap, we introduce DocPTBench, a comprehensive benchmark specifically designed for Photographed Document Parsing and Translation. DocPTBench comprises over 1,300 high-resolution photographed documents from multiple domains, includes eight translation scenarios, and provides meticulously human-verified annotations for both parsing and translation. Our experiments demonstrate that transitioning from digital-born to photographed documents results in a substantial performance decline: popular MLLMs exhibit an average accuracy drop of 18% in end-to-end parsing and 12% in translation, while specialized document parsing models show significant average decrease of 25%. This substantial performance gap underscores the unique challenges posed by documents captured in real-world conditions and reveals the limited robustness of existing models. Dataset and code are available at https://github.com/Topdu/DocPTBench.

[225] When Generative Replay Meets Evolving Deepfakes: Domain-Aware Relative Weighting for Incremental Face Forgery Detection

Hao Shen,Jikang Cheng,Renye Yan,Zhongyuan Wang,Wei Peng,Baojin Huang

Main category: cs.CV

TL;DR: 本文研究了生成回放(generative replay)在增量伪造检测中的应用,提出了域感知相对加权策略(DARW),通过区分域安全与域风险样本并动态调整监督强度,有效提升了模型的持续学习性能。

Details Motivation: 现有的基于样本回放的增量伪造检测方法存在多样性不足和隐私问题,而生成回放虽有潜力但其适用性尚不明确,因此需要系统研究其在伪造检测中的可行性并提出改进方案。 Method: 提出了一种域感知相对加权(DARW)策略,结合生成回放机制,对域安全样本进行直接监督,并对域风险样本采用相对分离损失和动态调整的域混淆分数来平衡监督与混淆。 Result: 实验表明,DARW在不同生成回放设置下均能持续提升增量学习性能,并有效缓解域重叠带来的负面影响。 Conclusion: DARW为生成回放在伪造检测中的安全有效应用提供了新思路,增强了模型在增量更新中的鲁棒性和适应性。 Abstract: The rapid advancement of face generation techniques has led to a growing variety of forgery methods. Incremental forgery detection aims to gradually update existing models with new forgery data, yet current sample replay-based methods are limited by low diversity and privacy concerns. Generative replay offers a potential solution by synthesizing past data, but its feasibility for forgery detection remains unclear. In this work, we systematically investigate generative replay and identify two scenarios: when the replay generator closely resembles the new forgery model, generated real samples blur the domain boundary, creating domain-risky samples; when the replay generator differs significantly, generated samples can be safely supervised, forming domain-safe samples. To exploit generative replay effectively, we propose a novel Domain-Aware Relative Weighting (DARW) strategy. DARW directly supervises domain-safe samples while applying a Relative Separation Loss to balance supervision and potential confusion for domain-risky samples. A Domain Confusion Score dynamically adjusts this tradeoff according to sample reliability. Extensive experiments demonstrate that DARW consistently improves incremental learning performance for forgery detection under different generative replay settings and alleviates the adverse impact of domain overlap.

[226] Perceptual-Evidence Anchored Reinforced Learning for Multimodal Reasoning

Chi Zhang,Haibo Qiu,Qiming Zhang,Yufei Xu,Zhixiong Zeng,Siqi Yang,Peng Shi,Lin Ma,Jing Zhang

Main category: cs.CV

TL;DR: 提出PEARL方法,通过感知-推理协同机制,利用可验证的视觉证据增强多模态推理,有效减少视觉幻觉和奖励欺骗问题。

Details Motivation: 现有RLVR方法仅验证文本输出,忽视视觉感知基础,导致视觉幻觉和奖励欺骗。 Method: 设计双分支的PEARL框架,构建感知清单(含可验证子问题),通过辅助 rollout 生成感知奖励,作为推理保真门控,并与GRPO、DAPO等RL方法结合。 Result: 在MathVerse等多模态推理基准上显著提升性能,相比基线提升+9.7%,优于GRPO达+6.6%。 Conclusion: PEARL通过将推理锚定于经验证的视觉证据,实现了更可靠和忠实的多模态推理,有效解决了因感知错误导致的推理偏差问题。 Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capabilities of Large Language Models (LLMs) and is now being applied to Vision-Language Models (VLMs). However, vanilla RLVR for VLMs verifies only the final textual output, critically neglecting the foundational step of visual perception. This oversight leads to visual hallucinations and reward hacking, as reasoning built upon flawed perception is inherently unreliable. To address this, we propose PEARL (Perceptual-Evidence Anchored Reinforced Learning), a dual-branch, perception-reasoning synergistic that strengthens multimodal reasoning by explicitly anchoring it to verified visual evidence. For each reasoning-oriented QA instance, PEARL first derive a perception checklist -- a set of perception-oriented sub-questions with verifiable answers that probe the model's understanding of key visual evidence. During training, auxiliary rollouts on this checklist yield a perceptual reward that both directly reinforces the model's perception ability and acts as a fidelity gate for reasoning. If the model passes the perception check, its policy update is biased towards evidence-anchored reasoning. Otherwise, the process is halted to prevent reasoning from flawed premises. PEARL can be seamlessly integrated with popular RL methods like GRPO and DAPO. Comprehensive experiments show PEARL achieves substantial gains on multimodal reasoning benchmarks, e.g., a +9.7% improvement over the baseline and +6.6% over GRPO on MathVerse.

[227] ReCoGS: Real-time ReColoring for Gaussian Splatting scenes

Lorenzo Rutayisire,Nicola Capodieci,Fabio Pellacini

Main category: cs.CV

TL;DR: 本文提出了一种针对Gaussian Splatting场景的用户友好型重新着色编辑管道,结合交互式工具实现精确区域选择与实时重绘。

Details Motivation: 现有基于2D扩散模型的3D编辑方法存在视图不一致、控制粒度粗和计算开销大等问题,本文旨在解决Gaussian Splatting在精细重着色任务中的精准性与实时性需求。 Method: 设计一个专用于预训练Gaussian Splatting场景的编辑流程,通过用户交互选择特定区域,并对选中区域的高斯分布进行颜色参数调整,实现实时、一致且细粒度的重着色效果。 Result: 实验表明该方法能有效实现高精度区域选择与自然的颜色编辑,保持多视角一致性,并支持实时交互;配套工具验证了其实际可用性。 Conclusion: 所提方法为Gaussian Splatting提供了高效、直观的局部重着色解决方案,推动了其在3D内容编辑中的应用潜力。 Abstract: Gaussian Splatting has emerged as a leading method for novel view synthesis, offering superior training efficiency and real-time inference compared to NeRF approaches, while still delivering high-quality reconstructions. Beyond view synthesis, this 3D representation has also been explored for editing tasks. Many existing methods leverage 2D diffusion models to generate multi-view datasets for training, but they often suffer from limitations such as view inconsistencies, lack of fine-grained control, and high computational demand. In this work, we focus specifically on the editing task of recoloring. We introduce a user-friendly pipeline that enables precise selection and recoloring of regions within a pre-trained Gaussian Splatting scene. To demonstrate the real-time performance of our method, we also present an interactive tool that allows users to experiment with the pipeline in practice. Code is available at https://github.com/loryruta/recogs.

[228] SineProject: Machine Unlearning for Stable Vision Language Alignment

Arpit Garg,Hemanth Saratchandran,Simon Lucey

Main category: cs.CV

TL;DR: SineProject是一种用于多模态大语言模型的遗忘方法,通过在冻结的投影器中加入正弦调制的可训练参数,改善雅可比矩阵的谱条件,稳定跨模态对齐,实现高效、低干扰的知识遗忘。

Details Motivation: 现有的遗忘方法容易破坏视觉-语言对齐,导致模型拒绝无害查询,需要一种更稳定的遗忘机制。 Method: 提出SineProject方法,在冻结的投影网络中引入正弦调制的可训练参数,以改善雅可比矩阵的谱条件,从而在遗忘过程中保持跨模态对齐的稳定性。 Result: 在LLaVA v1.5 7B和13B模型上,SineProject在安全与隐私遗忘基准中显著减少对无害查询的拒绝,同时完全遗忘目标信息,达到最先进的遗忘-保留权衡,且计算开销极小。 Conclusion: SineProject有效解决了多模态模型遗忘过程中的对齐退化问题,实现了高效、稳定的知识移除,具有实际应用潜力。 Abstract: Multimodal Large Language Models (MLLMs) increasingly need to forget specific knowledge such as unsafe or private information without requiring full retraining. However, existing unlearning methods often disrupt vision language alignment, causing models to reject both harmful and benign queries. We trace this failure to the projector network during unlearning, its Jacobian becomes severely illconditioned, leading to unstable optimization and drift in cross modal embeddings. We introduce SineProject, a simple method that augments the frozen projector with sinusoidally modulated trainable parameters, improving the Jacobian's spectral conditioning and stabilizing alignment throughout unlearning. Across standard safety and privacy unlearning benchmarks using LLaVA v1.5 7B and 13B, SineProject reduces benign query refusals while achieving complete forgetting of targeted information, yielding state of the art forget retain trade offs with negligible computational overhead.

[229] EventBench: Towards Comprehensive Benchmarking of Event-based MLLMs

Shaoyu Liu,Jianing Li,Guanghui Zhao,Yunjian Zhang,Xiangyang Ji

Main category: cs.CV

TL;DR: 本文提出了EventBench,一个用于评估基于事件的多模态大语言模型(MLLMs)的统一基准,包含八项任务指标和大规模事件流数据集,并揭示现有模型在细粒度识别和空间推理方面仍存在挑战。

Details Motivation: 现有的基于事件的多模态大语言模型缺乏统一、全面的评估基准,限制了其能力的系统性评估与发展。 Method: 构建了EventBench,包含开放、多样化、集成3D空间推理且数据规模大的事件流数据集和八项任务指标,并对主流闭源、开源及事件专用MLLM进行评估。 Result: 评估显示当前事件MLLM在事件流理解上表现良好,但在细粒度识别和3D空间推理任务上仍有明显不足。 Conclusion: EventBench为事件驱动的MLLM提供了全面评测平台,推动未来在复杂时空推理任务上的研究与改进。 Abstract: Multimodal large language models (MLLMs) have made significant advancements in event-based vision, yet the comprehensive evaluation of their capabilities within a unified benchmark remains largely unexplored. In this work, we introduce EventBench, a benchmark that offers eight diverse task metrics together with a large-scale event stream dataset. EventBench differs from existing event-based benchmarks in four key aspects: (1) openness in accessibility, releasing all raw event streams and task instructions across eight evaluation metrics; (2) diversity in task coverage, spanning understanding, recognition, and spatial reasoning tasks for comprehensive capability assessment; (3) integration in spatial dimensions, pioneering the design of 3D spatial reasoning tasks for event-based MLLMs; and (4) scale in data volume, with an accompanying training set of over one million event-text pairs supporting large-scale training and evaluation. Using EventBench, we evaluate state-of-the-art closed-source models such as GPT-5 and Gemini-2.5 Pro, leading open-source models including Qwen2.5-VL and InternVL3, and event-based MLLMs such as EventGPT that directly process raw event streams. Extensive evaluation reveals that while current event-based MLLMs demonstrate strong performance in event stream understanding, they continue to struggle with fine-grained recognition and spatial reasoning.

[230] NAF: Zero-Shot Feature Upsampling via Neighborhood Attention Filtering

Loick Chambon,Paul Couairon,Eloi Zablocki,Alexandre Boulch,Nicolas Thome,Matthieu Cord

Main category: cs.CV

TL;DR: 提出了一种名为Neighborhood Attention Filtering (NAF)的零样本特征上采样方法,无需重新训练即可适用于任意视觉基础模型(VFM),在保持高效率的同时实现了最先进的性能。

Details Motivation: 现有上采样方法在通用性和性能之间存在权衡:传统滤波器通用但形式固定,现代学习型上采样器虽性能优越但需针对每个VFM重新训练。 Method: 通过Cross-Scale Neighborhood Attention和RoPE机制,利用高分辨率输入图像学习空间与内容自适应的权重,实现对任意VFM特征的零样本上采样。 Result: NAF在多个下游任务中超越了特定VFM的上采样器,达到SOTA性能,可扩展至2K特征图,并以18 FPS速度重建中等分辨率特征图,同时在图像恢复任务中表现出色。 Conclusion: NAF是首个无需微调即可通用于各种VFM的上采样架构,在性能、通用性和效率之间取得了良好平衡,具有广泛的应用潜力。 Abstract: Vision Foundation Models (VFMs) extract spatially downsampled representations, posing challenges for pixel-level tasks. Existing upsampling approaches face a fundamental trade-off: classical filters are fast and broadly applicable but rely on fixed forms, while modern upsamplers achieve superior accuracy through learnable, VFM-specific forms at the cost of retraining for each VFM. We introduce Neighborhood Attention Filtering (NAF), which bridges this gap by learning adaptive spatial-and-content weights through Cross-Scale Neighborhood Attention and Rotary Position Embeddings (RoPE), guided solely by the high-resolution input image. NAF operates zero-shot: it upsamples features from any VFM without retraining, making it the first VFM-agnostic architecture to outperform VFM-specific upsamplers and achieve state-of-the-art performance across multiple downstream tasks. It maintains high efficiency, scaling to 2K feature maps and reconstructing intermediate-resolution maps at 18 FPS. Beyond feature upsampling, NAF demonstrates strong performance on image restoration, highlighting its versatility. Code and checkpoints are available at https://github.com/valeoai/NAF.

[231] RegDeepLab: A Two-Stage Decoupled Framework for Interpretable Embryo Fragmentation Grading

Ming-Jhe Lee

Main category: cs.CV

TL;DR: 本研究提出RegDeepLab,一种双分支多任务学习框架,结合语义分割与多尺度回归,用于IVF胚胎碎片化程度的自动评估,兼顾高精度与视觉可解释性。

Details Motivation: 现有深度学习方法在胚胎碎片化程度评估中存在视觉不可解释或难以直接转化为临床分级的问题,且多任务学习面临梯度冲突和负迁移挑战。 Method: 提出RegDeepLab框架,结合DeepLabV3+分割模型与多尺度回归头,并设计两阶段解耦训练策略与特征注入机制,引入Range Loss支持半监督学习。 Result: 实验显示该方法在保持SOTA级分割性能(Dice=0.729)的同时实现精准分级(MAE=0.046),优于端到端多任务训练。 Conclusion: RegDeepLab实现了高精度自动化分级与可视化分割的融合,为临床辅助决策提供了可靠、可解释的解决方案。 Abstract: The degree of embryo fragmentation serves as a critical morphological indicator for assessing embryo developmental potential in In Vitro Fertilization (IVF) clinical decision-making. However, current manual grading processes are not only time-consuming but also limited by significant inter-observer variability and efficiency bottlenecks. Although deep learning has demonstrated potential in automated grading in recent years, existing solutions face a significant challenge: pure regression models lack the visual explainability required for clinical practice, while pure segmentation models struggle to directly translate pixel-level masks into precise clinical grades. This study proposes RegDeepLab, a dual-branch Multi-Task Learning (MTL) framework that integrates State-of-the-Art (SOTA) semantic segmentation (DeepLabV3+) with a multi-scale regression head. Addressing the common issues of "Gradient Conflict" and "Negative Transfer" in multi-task training, we propose a "Two-Stage Decoupled Training Strategy." Experimental results demonstrate that while standard end-to-end MTL training can minimize grading error (MAE=0.046) through our designed "Feature Injection" mechanism, it compromises the integrity of segmentation boundaries. In contrast, our decoupled strategy successfully provides robust and high-precision grading predictions while preserving SOTA-level segmentation accuracy (Dice=0.729). Furthermore, we introduce a "Range Loss" to effectively utilize large-scale discrete grading data for semi-supervised learning. This study ultimately presents a dual-module clinical auxiliary solution that combines high accuracy with visual explainability.

[232] Alternating Perception-Reasoning for Hallucination-Resistant Video Understanding

Bowei Pu,Chuanbin Liu,Yifan Ge,Peichen Zhou,Yiwei Sun,Zhiyin Lu,Jiankang Wang,Hongtao Xie

Main category: cs.CV

TL;DR: 本文提出了一种新的视频推理框架Video-PLR,通过引入基于循环的感知推理(PLR)范式和事实感知评估器(FAE)来解决现有模型中因单步感知导致的证据不足和幻觉问题。

Details Motivation: 现有视频推理大模型依赖单步感知范式,容易产生感知捷径、证据不足和幻觉现象,影响推理准确性。 Method: 提出感知循环推理(PLR)范式,将视频分段进行带时间戳的逐步描述与分析,并结合决策机制;同时构建事实感知评估器(FAE),在大规模幻觉判断数据集AnetHallu-117K上训练,提供抗幻觉奖励以提升描述准确性。 Result: 实验表明,Video-PLR在3B和7B参数规模下均达到SOTA水平,具备最优的数据效率,且FAE性能可媲美GPT-4o。 Conclusion: 通过循环感知和抗幻觉奖励机制,Video-PLR有效提升了视频推理中的视觉感知充分性和推理可靠性。 Abstract: Sufficient visual perception is the foundation of video reasoning. Nevertheless, existing Video Reasoning LLMs suffer from perception shortcuts, relying on a flawed single-step perception paradigm. This paradigm describes the video and then conducts reasoning, which runs the risk of insufficient evidence and emergent hallucinations. To address these issues, we introduce a new framework that integrates a loop-based paradigm with an anti-hallucination reward. First, to address the insufficient evidence, we introduce the Perception Loop Reasoning (PLR) paradigm. Instead of describing the video at once, each loop requires the model to describe a video segment with precise timestamps, analyze this segment, and decide the next action. Second, for the risk of hallucinations, the Factual-Aware Evaluator (FAE) evaluates each perception result as a reliable anti-hallucination reward. This reward encourages the model to provide sufficient and precise video evidence. Our FAE, which performs comparably to GPT-4o, is tuned on our AnetHallu-117K, a large-scale hallucination judgment preference dataset. Extensive experiments show that our Video-PLR achieves the state-of-the-art in both 3B and 7B parameter scales and has the best data efficiency. Our code, models, and datasets are released on: https://github.com/BoweiPu/VideoPLR.

[233] Gaze Beyond the Frame: Forecasting Egocentric 3D Visual Span

Heeseung Yun,Joonil Na,Jaeyeon Kim,Calvin Murdock,Gunhee Kim

Main category: cs.CV

TL;DR: 本文提出了EgoSpanLift方法,用于预测人在三维环境中的视觉关注区域(3D visual span),实现了从2D图像到3D场景的转换,并结合3D U-Net与单向Transformer进行时空融合预测,在自建的大规模基准数据集上取得了优于现有方法的性能。

Details Motivation: 尽管人类视觉感知在指导行为中起关键作用,但针对第一视角下3D环境中视觉感知的预测研究仍不足,尤其是在AR/VR和辅助技术中有重要应用价值。 Method: 提出EgoSpanLift方法,将SLAM生成的关键点转化为与注视兼容的几何表示,提取体素化的视觉范围区域,并结合3D U-Net和单向Transformer实现时空特征融合,完成对未来3D视觉范围的预测。 Result: 在包含364.6K样本的自建基准数据集上验证了方法的有效性,优于现有的2D注视预测和3D定位方法,且在无需额外2D训练的情况下投影回2D平面仍具有可比性能。 Conclusion: EgoSpanLift为第一视角下的3D视觉感知预测提供了有效框架,推动了以人为中心的场景理解与交互技术的发展。 Abstract: People continuously perceive and interact with their surroundings based on underlying intentions that drive their exploration and behaviors. While research in egocentric user and scene understanding has focused primarily on motion and contact-based interaction, forecasting human visual perception itself remains less explored despite its fundamental role in guiding human actions and its implications for AR/VR and assistive technologies. We address the challenge of egocentric 3D visual span forecasting, predicting where a person's visual perception will focus next within their three-dimensional environment. To this end, we propose EgoSpanLift, a novel method that transforms egocentric visual span forecasting from 2D image planes to 3D scenes. EgoSpanLift converts SLAM-derived keypoints into gaze-compatible geometry and extracts volumetric visual span regions. We further combine EgoSpanLift with 3D U-Net and unidirectional transformers, enabling spatio-temporal fusion to efficiently predict future visual span in the 3D grid. In addition, we curate a comprehensive benchmark from raw egocentric multisensory data, creating a testbed with 364.6K samples for 3D visual span forecasting. Our approach outperforms competitive baselines for egocentric 2D gaze anticipation and 3D localization while achieving comparable results even when projected back onto 2D image planes without additional 2D-specific training.

[234] Robust Posterior Diffusion-based Sampling via Adaptive Guidance Scale

Liav Hen,Tom Tirer,Raja Giryes,Shady Abu-Hussein

Main category: cs.CV

TL;DR: 提出了一种无需超参数的自适应后验扩散采样方法(AdaPS),通过基于观测的权重策略动态调整似然步长,在多种图像逆问题任务中提升了重建质量。

Details Motivation: 在扩散模型用于逆问题时,平衡先验与数据保真度是一项核心挑战:过于激进或保守的更新策略都会影响重建效果。 Method: 设计了一种依赖于观测的权重方案,根据两种中间似然梯度近似之间的一致性来自适应调整似然步长,该方法自然适配扩散调度、时间重采样和随机性。 Result: AdaPS在超分辨率、高斯去模糊和运动去模糊等任务上,在CelebA-HQ和ImageNet-256数据集上优于现有扩散模型方法,显著提升感知质量且几乎不损失保真度。 Conclusion: AdaPS是一种无需调参、鲁棒性强的通用框架,能够在不同噪声水平、扩散步数和随机性下稳定提升逆问题的重建性能。 Abstract: Diffusion models have recently emerged as powerful generative priors for solving inverse problems, achieving state-of-the-art results across various imaging tasks. A central challenge in this setting lies in balancing the contribution of the prior with the data fidelity term: overly aggressive likelihood updates may introduce artifacts, while conservative updates can slow convergence or yield suboptimal reconstructions. In this work, we propose an adaptive likelihood step-size strategy to guide the diffusion process for inverse-problem formulations. Specifically, we develop an observation-dependent weighting scheme based on the agreement between two different approximations of the intractable intermediate likelihood gradients, that adapts naturally to the diffusion schedule, time re-spacing, and injected stochasticity. The resulting approach, Adaptive Posterior diffusion Sampling (AdaPS), is hyperparameter-free and improves reconstruction quality across diverse imaging tasks - including super-resolution, Gaussian deblurring, and motion deblurring - on CelebA-HQ and ImageNet-256 validation sets. AdaPS consistently surpasses existing diffusion-based baselines in perceptual quality with minimal or no loss in distortion, without any task-specific tuning. Extensive ablation studies further demonstrate its robustness to the number of diffusion steps, observation noise levels, and varying stochasticity.

[235] Uncertainty Quantification in HSI Reconstruction using Physics-Aware Diffusion Priors and Optics-Encoded Measurements

Juan Romero,Qiang Fu,Matteo Ravasi,Wolfgang Heidrich

Main category: cs.CV

TL;DR: 提出HSDiff框架,通过贝叶斯推断和扩散模型实现不确定性感知的高光谱图像重建,增强元相似性数据扩充以提升先验多样性。

Details Motivation: 现有高光谱图像数据集缺乏光谱多样性,导致重建方法在评估同色异谱现象时出现幻觉问题。 Method: 将高光谱图像重建建模为贝叶斯推断问题,采用无条件训练的像素级扩散先验与后验扩散采样,并引入基于区域的元相似黑体和并集分区光谱上采样进行数据增强。 Result: HSDiff能生成与多种成像模型测量一致的多样化高光谱样本,有效刻画后验分布,且光谱编码引导可提供校准后的不确定性估计。 Conclusion: HSDiff是一个完整且高性能的不确定性感知高光谱图像重建方法,强调了有效光谱编码在快照式高光谱成像中的重要性。 Abstract: Hyperspectral image reconstruction from a compressed measurement is a highly ill-posed inverse problem. Current data-driven methods suffer from hallucination due to the lack of spectral diversity in existing hyperspectral image datasets, particularly when they are evaluated for the metamerism phenomenon. In this work, we formulate hyperspectral image (HSI) reconstruction as a Bayesian inference problem and propose a framework, HSDiff, that utilizes an unconditionally trained, pixel-level diffusion prior and posterior diffusion sampling to generate diverse HSI samples consistent with the measurements of various hyperspectral image formation models. We propose an enhanced metameric augmentation technique using region-based metameric black and partition-of-union spectral upsampling to expand training with physically valid metameric spectra, strengthening the prior diversity and improving uncertainty calibration. We utilize HSDiff to investigate how the studied forward models shape the posterior distribution and demonstrate that guiding with effective spectral encoding provides calibrated informative uncertainty compared to non-encoded models. Through the lens of the Bayesian framework, HSDiff offers a complete, high-performance method for uncertainty-aware HSI reconstruction. Our results also reiterate the significance of effective spectral encoding in snapshot hyperspectral imaging.

[236] Extreme Model Compression for Edge Vision-Language Models: Sparse Temporal Token Fusion and Adaptive Neural Compression

Md Tasnin Tanvir,Soumitra Das,Sk Md Abidar Rahaman,Ali Shiri Sichani

Main category: cs.CV

TL;DR: 本文提出两种自适应压缩技术(STTF和ANC),用于在资源受限的边缘设备上实现高效的视觉-语言模型实时推理。

Details Motivation: 为了满足边缘AI在视觉-语言任务中对低功耗、低内存和实时性能的需求,需要更高效的模型压缩方法。 Method: 提出Sparse Temporal Token Fusion(STTF)通过事件驱动的动态视觉令牌复用,以及Adaptive Neural Compression(ANC)通过学习路由条件激活编码器分支,实现对场景复杂度的细粒度自适应。 Result: 3B参数的TinyGPT-STTF在COCO测试集上达到CIDEr 131.2,超过LLaVA-1.5 7B 17.6个点,且使用更少的参数和62倍FLOPs;在DVS128手势识别中,STTF减少84%令牌数并保持95.6%准确率,ANC在低运动场景减少90% FLOPs,整体延迟降低最高13倍。 Conclusion: 所提出的自适应压缩方法显著提升了视觉-语言模型在边缘设备上的效率与性能,实现了高精度与低计算开销的平衡,适合实际部署。 Abstract: The demand for edge AI in vision-language tasks requires models that achieve real-time performance on resource-constrained devices with limited power and memory. This paper proposes two adaptive compression techniques -- Sparse Temporal Token Fusion (STTF) and Adaptive Neural Compression (ANC) -- that integrate algorithmic innovations with hardware-aware optimizations. Unlike previous approaches relying on static pruning or uniform scaling, STTF dynamically reuses visual tokens through event-driven change detection, while ANC conditionally activates encoder branches via a learned router, enabling fine-grained adaptation to scene complexity. Our 3B-parameter TinyGPT-STTF achieves CIDEr 131.2, BLEU-4 0.38, METEOR 0.31, and ROUGE-L 0.56 on the COCO 2017 test set, surpassing LLaVA-1.5 7B by 17.6 CIDEr points while using 2.3x fewer parameters and 62x fewer on-device FLOPs. TinyGPT-ANC reaches CIDEr 128.5. On event-based vision tasks, STTF reduces average token count by 84% (from 196 to 31 tokens) while preserving 95.6% accuracy on the DVS128 Gesture dataset, and ANC cuts FLOPs by up to 90% in low-motion scenes. Compared to strong baselines, our models improve accuracy by up to 4.4% and reduce latency by up to 13x. These results enable efficient deployment of capable vision-language models on real-world edge devices.

[237] Multimodal Continual Learning with MLLMs from Multi-scenario Perspectives

Kai Jiang,Siqi Huang,Xiangyu Chen,Jiawei Shao,Hongyuan Zhang,Xuelong Li

Main category: cs.CV

TL;DR: 本文提出了一种名为UNIFIER的多模态连续学习方法,用于缓解多模态大语言模型(MLLMs)在不同场景迁移时的灾难性遗忘问题。作者构建了一个包含四种场景的多模态视觉理解数据集MSVQA,并通过解耦不同场景的视觉信息、引入一致性约束来保持跨场景表征稳定性。实验表明该方法能有效缓解遗忘并实现知识积累。

Details Motivation: MLLMs在设备上部署时需持续适应下游任务中的动态场景变化(如背景、视角等),但现有方法在连续学习中易发生灾难性遗忘,尤其在真实世界场景迁移下表现不佳,因此需要研究跨场景连续学习中的遗忘问题。 Method: 提出UNIFIER方法:1)构建涵盖高海拔、水下、低空和室内四类场景的MSVQA数据集;2)在每个视觉块内将不同场景的视觉信息解耦到独立分支,并映射至同一特征空间;3)对各分支特征施加一致性约束以维持跨场景表示稳定性。 Result: 在MSVQA数据集上的实验证明,UNIFIER能有效缓解跨场景任务的遗忘现象,并在同一场景内实现知识积累,优于现有连续学习方法。 Conclusion: UNIFIER通过分支解耦与一致性约束,成功提升了MLLMs在动态场景下的连续学习能力,为多模态模型在现实环境中的持续适应提供了有效解决方案。 Abstract: Continual learning in visual understanding aims to deal with catastrophic forgetting in Multimodal Large Language Models (MLLMs). MLLMs deployed on devices have to continuously adapt to dynamic scenarios in downstream tasks, such as variations in background and perspective, to effectively perform complex visual tasks. To this end, we construct a multimodal visual understanding dataset (MSVQA) encompassing four different scenarios and perspectives including high altitude, underwater, low altitude and indoor, to investigate the catastrophic forgetting in MLLMs under the dynamics of scenario shifts in real-world data streams. Furthermore, we propose mUltimodal coNtInual learning with MLLMs From multi-scenarIo pERspectives (UNIFIER) to address visual discrepancies while learning different scenarios. Specifically, it decouples the visual information from different scenarios into distinct branches within each vision block and projects them into the same feature space. A consistency constraint is imposed on the features of each branch to maintain the stability of visual representations across scenarios. Extensive experiments on the MSVQA dataset demonstrate that UNIFIER effectively alleviates forgetting of cross-scenario tasks and achieves knowledge accumulation within the same scenario.

[238] LRDUN: A Low-Rank Deep Unfolding Network for Efficient Spectral Compressive Imaging

He Huang,Yujun Guo,Wei He

Main category: cs.CV

TL;DR: 提出了一种基于低秩分解的深度展开网络(LRDUN),通过引入光谱基和子空间图像的新型成像模型,有效缓解了高光谱压缩成像中从2D测量重建3D数据的不适定性问题,并结合广义特征展开机制(GFUM)提升网络表达能力,在仿真和真实数据上实现了最先进的重建质量与更低的计算成本。

Details Motivation: 现有深度展开网络(DUNs)直接在高维高光谱数据上操作,导致计算冗余,且从2D残差映射回3D空间存在严重不适定性。为解决这一问题,本文旨在通过引入低秩分解来降低重建过程的复杂性和不确定性。 Method: 提出两种新的成像模型,将低秩分解显式融入感知模型,分别针对光谱基和子空间图像进行建模;在此基础上构建低秩深度展开网络(LRDUN),在展开的近端梯度下降(PGD)框架内联合求解两个子问题;并设计广义特征展开机制(GFUM),解耦数据保真项中的物理秩与先验模块中的特征维度。 Result: 在模拟和真实数据集上的大量实验表明,LRDUN在显著降低计算成本的同时,实现了最先进的重建质量。 Conclusion: LRDUN通过结合低秩先验与深度展开框架,有效提升了高光谱压缩成像的重建效率与精度,为未来高效SCI系统提供了新思路。 Abstract: Deep unfolding networks (DUNs) have achieved remarkable success and become the mainstream paradigm for spectral compressive imaging (SCI) reconstruction. Existing DUNs are derived from full-HSI imaging models, where each stage operates directly on the high-dimensional HSI, refining the entire data cube based on the single 2D coded measurement. However, this paradigm leads to computational redundancy and suffers from the ill-posed nature of mapping 2D residuals back to 3D space of HSI. In this paper, we propose two novel imaging models corresponding to the spectral basis and subspace image by explicitly integrating low-rank (LR) decomposition with the sensing model. Compared to recovering the full HSI, estimating these compact low-dimensional components significantly mitigates the ill-posedness. Building upon these novel models, we develop the Low-Rank Deep Unfolding Network (LRDUN), which jointly solves the two subproblems within an unfolded proximal gradient descent (PGD) framework. Furthermore, we introduce a Generalized Feature Unfolding Mechanism (GFUM) that decouples the physical rank in the data-fidelity term from the feature dimensionality in the prior module, enhancing the representational capacity and flexibility of the network. Extensive experiments on simulated and real datasets demonstrate that the proposed LRDUN achieves state-of-the-art (SOTA) reconstruction quality with significantly reduced computational cost.

[239] Unified Deep Learning Platform for Dust and Fault Diagnosis in Solar Panels Using Thermal and Visual Imaging

Abishek Karthik,Sreya Mynampati,Pandiyaraju V

Main category: cs.CV

TL;DR: 提出了一种基于CNN、ResNet和自注意力机制KerNet的多任务模型,用于统一检测太阳能电池板上的灰尘和故障,通过图像预处理和多参数分析,在不同规模应用中表现出更高的检测效率和准确性。

Details Motivation: 太阳能电池板的输出受灰尘、故障等因素影响显著,导致效率下降,需要高效、准确的检测方法以支持日常维护和性能优化。 Method: 采用gamma校正和高斯滤波进行图像预处理,结合功率输出、I-V曲线、电压等参数,利用CNN、ResNet和引入自注意力的KerNet模型实现对灰尘(如遮挡、污垢)和故障(如裂纹、单元失效)的分类检测。 Result: 模型在检测灰尘和故障方面表现出优于现有模型的效率和准确率,适用于从家庭小规模到大型太阳能电站的广泛应用场景。 Conclusion: 该集中式多应用平台能有效提升太阳能电池板的维护效率,具有良好的泛化能力和实际应用价值。 Abstract: Solar energy is one of the most abundant and tapped sources of renewable energies with enormous future potential. Solar panel output can vary widely with factors like intensity, temperature, dirt, debris and so on affecting it. We have implemented a model on detecting dust and fault on solar panels. These two applications are centralized as a single-platform and can be utilized for routine-maintenance and any other checks. These are checked against various parameters such as power output, sinusoidal wave (I-V component of solar cell), voltage across each solar cell and others. Firstly, we filter and preprocess the obtained images using gamma removal and Gaussian filtering methods alongside some predefined processes like normalization. The first application is to detect whether a solar cell is dusty or not based on various pre-determined metrics like shadowing, leaf, droppings, air pollution and from other human activities to extent of fine-granular solar modules. The other one is detecting faults and other such occurrences on solar panels like faults, cracks, cell malfunction using thermal imaging application. This centralized platform can be vital since solar panels have different efficiency across different geography (air and heat affect) and can also be utilized for small-scale house requirements to large-scale solar farm sustentation effectively. It incorporates CNN, ResNet models that with self-attention mechanisms-KerNet model which are used for classification and results in a fine-tuned system that detects dust or any fault occurring. Thus, this multi-application model proves to be efficient and optimized in detecting dust and faults on solar panels. We have performed various comparisons and findings that demonstrates that our model has better efficiency and accuracy results overall than existing models.

[240] Breaking Forgetting: Training-Free Few-Shot Class-Incremental Learning via Conditional Diffusion

Haidong Kang,Ketong Qian,Yi Lu

Main category: cs.CV

TL;DR: 提出一种无需训练的少样本类增量学习框架CD-FSCIL,通过条件扩散过程替代梯度优化,结合大语言模型生成的文本描述进行多模态学习,有效缓解灾难性遗忘并显著降低计算开销。

Details Motivation: 现有FSCIL方法依赖梯度优化导致训练成本高、灾难性遗忘严重,且在极端数据稀缺下难以适应新类,亟需一种无需训练的新范式。 Method: 提出基于条件扩散的FSCIL框架(CD-FSCIL),用扩散过程替代梯度更新,并融合视觉特征与大语言模型生成的文本描述进行多模态学习,实现无需训练的增量学习。 Result: 在主流FSCIL基准上达到最先进性能,同时显著降低计算和内存开销。 Conclusion: CD-FSCIL实现了无需梯度优化的类增量学习,为持续学习提供了高效、可扩展的新范式。 Abstract: Efforts to overcome catastrophic forgetting in Few-Shot Class-Incremental Learning (FSCIL) have primarily focused on developing more effective gradient-based optimization strategies. In contrast, little attention has been paid to the training cost explosion that inevitably arises as the number of novel classes increases, a consequence of relying on gradient learning even under extreme data scarcity. More critically, since FSCIL typically provides only a few samples for each new class, gradient-based updates not only induce severe catastrophic forgetting on base classes but also hinder adaptation to novel ones. This paper seeks to break this long-standing limitation by asking: Can we design a training-free FSCIL paradigm that entirely removes gradient optimization? We provide an affirmative answer by uncovering an intriguing connection between gradient-based optimization and the Conditional Diffusion process. Building on this observation, we propose a Conditional Diffusion-driven FSCIL (CD-FSCIL) framework that substitutes the conventional gradient update process with a diffusion-based generative transition, enabling training-free incremental adaptation while effectively mitigating forgetting. Furthermore, to enhance representation under few-shot constraints, we introduce a multimodal learning strategy that integrates visual features with natural language descriptions automatically generated by Large Language Models (LLMs). This synergy substantially alleviates the sample scarcity issue and improves generalization across novel classes. Extensive experiments on mainstream FSCIL benchmarks demonstrate that our method not only achieves state-of-the-art performance but also drastically reduces computational and memory overhead, marking a paradigm shift toward training-free continual adaptation.

[241] DE-KAN: A Kolmogorov Arnold Network with Dual Encoder for accurate 2D Teeth Segmentation

Md Mizanur Rahman Mustakim,Jianwu Li,Sumya Bhuiyan,Mohammad Mehedi Hasan,Bing Han

Main category: cs.CV

TL;DR: 提出了一种名为DE-KAN的双编码器Kolmogorov Arnold网络,用于提升全景X光片中牙齿分割的精度,结合ResNet-18和自定义CNN编码器,并利用KAN瓶颈层融合特征,在两个基准数据集上显著优于现有方法。

Details Motivation: 由于解剖变异、牙齿形状不规则和结构重叠,传统深度学习模型在全景放射线片牙齿分割中表现受限,因此需要更强大的特征表示能力来提高分割精度。 Method: 设计了DE-KAN框架,采用ResNet-18编码器处理增强输入以提取全局特征,同时使用自定义CNN编码器处理原始输入捕获局部特征,通过基于KAN的瓶颈层融合二者,并引入基于Kolmogorov-Arnold定理的可学习非线性激活函数提升模型表达能力和可解释性。 Result: 在两个基准牙科X射线数据集上实验表明,DE-KAN在mIoU、Dice系数、准确率和召回率上均优于当前最先进模型,其中Dice系数达到97.1%,相比现有方法最高提升达4.7%。 Conclusion: DE-KAN通过双编码器结构与KAN瓶颈层有效融合多尺度特征,显著提升了牙齿分割性能,具备较强的临床应用潜力。 Abstract: Accurate segmentation of individual teeth from panoramic radiographs remains a challenging task due to anatomical variations, irregular tooth shapes, and overlapping structures. These complexities often limit the performance of conventional deep learning models. To address this, we propose DE-KAN, a novel Dual Encoder Kolmogorov Arnold Network, which enhances feature representation and segmentation precision. The framework employs a ResNet-18 encoder for augmented inputs and a customized CNN encoder for original inputs, enabling the complementary extraction of global and local spatial features. These features are fused through KAN-based bottleneck layers, incorporating nonlinear learnable activation functions derived from the Kolmogorov Arnold representation theorem to improve learning capacity and interpretability. Extensive experiments on two benchmark dental X-ray datasets demonstrate that DE-KAN outperforms state-of-the-art segmentation models, achieving mIoU of 94.5%, Dice coefficient of 97.1%, accuracy of 98.91%, and recall of 97.36%, representing up to +4.7% improvement in Dice compared to existing methods.

[242] HiFi-MambaV2: Hierarchical Shared-Routed MoE for High-Fidelity MRI Reconstruction

Pengcheng Fang,Hongli Chen,Guangzhen Yao,Jian Shi,Fangfang Tang,Xiaohao Cai,Shanshan Shan,Feng Liu

Main category: cs.CV

TL;DR: 提出HiFi-MambaV2,一种结合频率分解与内容自适应计算的分层共享路由MoE Mamba架构,用于高保真MRI重建,在多个数据集和加速因子下优于现有方法。

Details Motivation: 从欠采样k空间数据重建高保真MRI图像需恢复高频细节并保持解剖一致性,现有方法在稳定性和高频恢复方面存在不足。 Method: 设计SF-Lap金字塔提取稳定的高低频流,结合分层共享路由MoE实现像素级稀疏专家调度,并通过轻量全局上下文路径增强长距离推理与数据一致性。 Result: 在fastMRI、CC359等多个数据集上,于单/多线圈设置及不同加速因子下,PSNR、SSIM、NMSE均优于CNN、Transformer和Mamba基线模型,显著提升高频细节与结构保真度。 Conclusion: HiFi-MambaV2通过频率感知与自适应计算实现了可靠且鲁棒的MRI图像重建,具有优越的性能和稳定性。 Abstract: Reconstructing high-fidelity MR images from undersampled k-space data requires recovering high-frequency details while maintaining anatomical coherence. We present HiFi-MambaV2, a hierarchical shared-routed Mixture-of-Experts (MoE) Mamba architecture that couples frequency decomposition with content-adaptive computation. The model comprises two core components: (i) a separable frequency-consistent Laplacian pyramid (SF-Lap) that delivers alias-resistant, stable low- and high-frequency streams; and (ii) a hierarchical shared-routed MoE that performs per-pixel top-1 sparse dispatch to shared experts and local routers, enabling effective specialization with stable cross-depth behavior. A lightweight global context path is fused into an unrolled, data-consistency-regularized backbone to reinforce long-range reasoning and preserve anatomical coherence. Evaluated on fastMRI, CC359, ACDC, M4Raw, and Prostate158, HiFi-MambaV2 consistently outperforms CNN-, Transformer-, and prior Mamba-based baselines in PSNR, SSIM, and NMSE across single- and multi-coil settings and multiple acceleration factors, consistently surpassing consistent improvements in high-frequency detail and overall structural fidelity. These results demonstrate that HiFi-MambaV2 enables reliable and robust MRI reconstruction.

[243] Zero-Shot Video Deraining with Video Diffusion Models

Tuomas Varanka,Juan Luis Gonzalez,Hyeongwoo Kim,Pablo Garrido,Xu Yao

Main category: cs.CV

TL;DR: 本文提出了一种无需合成数据和模型微调的零样本视频去雨方法,利用预训练的文本到视频扩散模型,通过负提示和注意力切换机制有效去除动态场景中的雨迹。

Details Motivation: 现有视频去雨方法依赖配对数据集或静态场景,难以泛化到真实世界和动态场景;同时,微调扩散模型会削弱其生成先验,限制了对未见情况的泛化能力。 Method: 利用预训练的文本到视频扩散模型,通过将输入视频反演至其潜在空间,并使用负提示干预重建过程以避开‘雨’的概念;引入注意力切换机制以保持动态背景和结构一致性。 Result: 在真实世界雨天数据集上进行了大量实验,结果表明该方法显著优于先前方法,且无需监督训练即展现出强大的泛化能力。 Conclusion: 所提出的方法是首个适用于复杂动态场景的零样本视频去雨方法,无需合成数据或微调,具有良好的实际应用前景。 Abstract: Existing video deraining methods are often trained on paired datasets, either synthetic, which limits their ability to generalize to real-world rain, or captured by static cameras, which restricts their effectiveness in dynamic scenes with background and camera motion. Furthermore, recent works in fine-tuning diffusion models have shown promising results, but the fine-tuning tends to weaken the generative prior, limiting generalization to unseen cases. In this paper, we introduce the first zero-shot video deraining method for complex dynamic scenes that does not require synthetic data nor model fine-tuning, by leveraging a pretrained text-to-video diffusion model that demonstrates strong generalization capabilities. By inverting an input video into the latent space of diffusion models, its reconstruction process can be intervened and pushed away from the model's concept of rain using negative prompting. At the core of our approach is an attention switching mechanism that we found is crucial for maintaining dynamic backgrounds as well as structural consistency between the input and the derained video, mitigating artifacts introduced by naive negative prompting. Our approach is validated through extensive experiments on real-world rain datasets, demonstrating substantial improvements over prior methods and showcasing robust generalization without the need for supervised training.

[244] C3Po: Cross-View Cross-Modality Correspondence by Pointmap Prediction

Kuan Wei Huang,Brandon Li,Bharath Hariharan,Noah Snavely

Main category: cs.CV

TL;DR: 本文提出了一种新的数据集C3,用于解决地面照片与平面图之间的对应关系预测问题,通过重建3D场景并手动注册到平面图,实现了跨模态几何推理的改进。

Details Motivation: 现有数据集在处理不同视角或模态(如航拍与地面、照片与抽象绘图)时存在局限性,无法有效支持跨模态几何理解任务。 Method: 通过从互联网照片集合中使用运动结构恢复(structure-from-motion)技术重建多个场景的3D模型,并将其手动注册到收集的平面图上,从而生成图像与平面图之间的对应关系。 Result: C3数据集包含597个场景中的90K对平面图和照片,1.53亿像素级对应点和85K相机姿态;在新数据上训练使最佳方法的RMSE提升了34%。 Conclusion: 当前最先进的对应模型在此任务上仍表现不佳,C3数据集有助于推动跨模态几何推理的研究并揭示了该领域的开放挑战。 Abstract: Geometric models like DUSt3R have shown great advances in understanding the geometry of a scene from pairs of photos. However, they fail when the inputs are from vastly different viewpoints (e.g., aerial vs. ground) or modalities (e.g., photos vs. abstract drawings) compared to what was observed during training. This paper addresses a challenging version of this problem: predicting correspondences between ground-level photos and floor plans. Current datasets for joint photo--floor plan reasoning are limited, either lacking in varying modalities (VIGOR) or lacking in correspondences (WAFFLE). To address these limitations, we introduce a new dataset, C3, created by first reconstructing a number of scenes in 3D from Internet photo collections via structure-from-motion, then manually registering the reconstructions to floor plans gathered from the Internet, from which we can derive correspondence between images and floor plans. C3 contains 90K paired floor plans and photos across 597 scenes with 153M pixel-level correspondences and 85K camera poses. We find that state-of-the-art correspondence models struggle on this task. By training on our new data, we can improve on the best performing method by 34% in RMSE. We also identify open challenges in cross-modal geometric reasoning that our dataset aims to help address.

[245] PhysGS: Bayesian-Inferred Gaussian Splatting for Physical Property Estimation

Samarth Chopra,Jing Liang,Gershom Seneviratne,Dinesh Manocha

Main category: cs.CV

TL;DR: PhysGS是一种基于贝叶斯推断的3D高斯点阵扩展方法,能够从视觉线索和视觉-语言先验中估计密集的每点物理属性,并建模不确定性,在质量、硬度和摩擦系数估计上显著优于确定性基线。

Details Motivation: 现有3D重建方法主要关注几何和外观,无法推断摩擦、刚度、硬度和材料组成等物理属性,而这些属性对机器人安全有效地与环境交互至关重要。 Method: 将物理属性估计建模为对高斯点的贝叶斯推断过程,利用视觉线索和视觉-语言先验,迭代更新材料和属性信念,并同时建模偶然性和认知不确定性。 Result: 在多个真实世界数据集上,PhysGS相比确定性基线最多将质量估计准确率提高22.8%,Shore硬度误差降低61.2%,动摩擦误差降低18.1%。 Conclusion: PhysGS在一个空间连续的框架中统一了3D重建、不确定性建模和物理推理,实现了密集物理属性的估计。 Abstract: Understanding physical properties such as friction, stiffness, hardness, and material composition is essential for enabling robots to interact safely and effectively with their surroundings. However, existing 3D reconstruction methods focus on geometry and appearance and cannot infer these underlying physical properties. We present PhysGS, a Bayesian-inferred extension of 3D Gaussian Splatting that estimates dense, per-point physical properties from visual cues and vision--language priors. We formulate property estimation as Bayesian inference over Gaussian splats, where material and property beliefs are iteratively refined as new observations arrive. PhysGS also models aleatoric and epistemic uncertainties, enabling uncertainty-aware object and scene interpretation. Across object-scale (ABO-500), indoor, and outdoor real-world datasets, PhysGS improves accuracy of the mass estimation by up to 22.8%, reduces Shore hardness error by up to 61.2%, and lowers kinetic friction error by up to 18.1% compared to deterministic baselines. Our results demonstrate that PhysGS unifies 3D reconstruction, uncertainty modeling, and physical reasoning in a single, spatially continuous framework for dense physical property estimation. Additional results are available at https://samchopra2003.github.io/physgs.

[246] Zero-Reference Joint Low-Light Enhancement and Deblurring via Visual Autoregressive Modeling with VLM-Derived Modulation

Wei Dong,Han Zhou,Junwei Lin,Jun Chen

Main category: cs.CV

TL;DR: 提出一种基于视觉自回归(VAR)模型和视觉语言模型(VLM)先验引导的无监督框架,用于真实暗光图像恢复,结合自适应曲线估计、频域感知旋转位置编码和相位域调制策略,在无需配对数据的情况下实现先进性能。

Details Motivation: 现有暗光图像增强方法多依赖成对数据或难以建模动态光照与模糊特性,导致泛化能力差,且对复杂噪声和模糊的联合退化处理不足。 Method: 采用基于VAR的生成框架,利用VLM提取可见性和模糊度评分作为感知先验;通过自适应曲线估计调节光照,引入频域感知的SF-RoPE增强对模糊结构的建模,并设计递归相位域调制策略在频域中迭代优化以抑制模糊伪影。 Result: 该方法在多个基准数据集上实现了最先进的无监督暗光图像恢复性能,显著提升了低照度、噪声和模糊共存场景下的视觉质量和对比度。 Conclusion: 结合VLM引导的感知先验与VAR生成模型,所提框架能有效应对真实暗光图像中复杂的联合退化问题,且无需配对训练数据,具备良好的泛化能力和应用潜力。 Abstract: Real-world dark images commonly exhibit not only low visibility and contrast but also complex noise and blur, posing significant restoration challenges. Existing methods often rely on paired data or fail to model dynamic illumination and blur characteristics, leading to poor generalization. To tackle this, we propose a generative framework based on visual autoregressive (VAR) modeling, guided by perceptual priors from the vision-language model (VLM). Specifically, to supply informative conditioning cues for VAR models, we deploy an adaptive curve estimation scheme to modulate the diverse illumination based on VLM-derived visibility scores. In addition, we integrate dynamic and spatial-frequency-aware Rotary Positional Encodings (SF-RoPE) into VAR to enhance its ability to model structures degraded by blur. Furthermore, we propose a recursive phase-domain modulation strategy that mitigates blur-induced artifacts in the phase domain via bounded iterative refinement guided by VLM-assessed blur scores. Our framework is fully unsupervised and achieves state-of-the-art performance on benchmark datasets.

[247] Stage-Specific Benchmarking of Deep Learning Models for Glioblastoma Follow-Up MRI

Wenhao Guo,Golrokh Mirzaei

Main category: cs.CV

TL;DR: 本研究对180例胶质母细胞瘤患者在放疗后不同随访阶段的MRI影像进行了深度学习模型的阶段性横向基准测试,比较了11类深度学习架构在区分真性肿瘤进展(TP)与假性进展(PsP)中的表现。结果显示,整体准确率约为0.70–0.74,第二阶段随访时模型判别能力有所提升;Mamba+CNN混合模型在精度与效率间表现最优,而Transformer类模型虽AUC较高但计算成本大,轻量级CNN效率高但稳定性不足。研究强调了标准化训练协议的重要性,并指出当前任务的固有难度及数据不平衡问题,为未来引入纵向建模、多序列MRI和多中心大数据提供了基准和方向。

Details Motivation: 在胶质母细胞瘤治疗中,早期区分真正的肿瘤进展(TP)与治疗相关的假性进展(PsP)极具挑战性,现有影像学方法易产生误判,影响临床决策。因此,亟需可靠的自动化工具辅助诊断,特别是在不同随访阶段评估模型性能的差异。 Method: 基于Burdenko GBM Progression队列(n = 180),采用统一的质量控制驱动流程,对11类代表性深度学习模型(包括CNN、LSTM、混合模型、Transformer和选择性状态空间模型)进行独立训练与验证。使用患者层级交叉验证,并分阶段分析放疗后不同时间点的MRI扫描,评估各模型在不同随访阶段的分类性能(如准确率、F1分数、AUC等)。 Result: 各模型在两个阶段的整体准确率相近(~0.70–0.74),但在第二阶段随访中多个模型的F1分数和AUC提升,表明疾病后期更具可分性。Mamba+CNN混合模型在精度与效率之间表现最佳;Transformer变体AUC较高但计算成本显著更高;轻量级CNN高效但可靠性较低。模型性能受批量大小影响明显,且总体判别能力仍有限,反映出TP与PsP区分的内在难度及数据集类别不平衡问题。 Conclusion: 该研究建立了首个针对随访阶段的深度学习模型横向基准,揭示了模型性能随病程演进的变化趋势,并强调了标准化训练的重要性。尽管当前性能受限于数据规模与任务复杂性,结果支持未来研究应整合纵向数据、多序列MRI和更大规模多中心队列以提升模型效能。 Abstract: Differentiating true tumor progression (TP) from treatment-related pseudoprogression (PsP) in glioblastoma remains challenging, especially at early follow-up. We present the first stage-specific, cross-sectional benchmarking of deep learning models for follow-up MRI using the Burdenko GBM Progression cohort (n = 180). We analyze different post-RT scans independently to test whether architecture performance depends on time-point. Eleven representative DL families (CNNs, LSTMs, hybrids, transformers, and selective state-space models) were trained under a unified, QC-driven pipeline with patient-level cross-validation. Across both stages, accuracies were comparable (~0.70-0.74), but discrimination improved at the second follow-up, with F1 and AUC increasing for several models, indicating richer separability later in the care pathway. A Mamba+CNN hybrid consistently offered the best accuracy-efficiency trade-off, while transformer variants delivered competitive AUCs at substantially higher computational cost and lightweight CNNs were efficient but less reliable. Performance also showed sensitivity to batch size, underscoring the need for standardized training protocols. Notably, absolute discrimination remained modest overall, reflecting the intrinsic difficulty of TP vs. PsP and the dataset's size imbalance. These results establish a stage-aware benchmark and motivate future work incorporating longitudinal modeling, multi-sequence MRI, and larger multi-center cohorts.

[248] NeAR: Coupled Neural Asset-Renderer Stack

Hong Li,Chongjie Ye,Houyuan Chen,Weiqing Xiao,Ziyang Yan,Lixing Xiao,Zhaoxi Chen,Jianfeng Xiang,Shaocong Xu,Xuhui Liu,Yikai Wang,Baochang Zhang,Xiaoguang Han,Jiaolong Yang,Hao Zhao

Main category: cs.CV

TL;DR: 本文提出了NeAR,一种耦合的神经资产-渲染器堆栈,通过联合设计神经资产表示与渲染器,实现高保真、一致且高效的端到端可学习图形管线。

Details Motivation: 现有的神经资产生成与神经渲染方法相互分离,限制了整体性能;作者认为联合优化资产表示与渲染器能提升保真度、一致性和效率。 Method: 在资产端,提出基于Trellis风格结构化3D潜变量的光照归一化神经资产(Lighting-Homogenized SLAT),使用扩散模型从随意光照图像中提取几何和材质特征;在渲染端,设计支持HDR环境光、显式视角嵌入的光照感知神经渲染器,实现实时可重光照渲染。 Result: 在G-buffer前向渲染、单图重建、未知光照下单图重光照和新视角重光照四项任务上,NeAR在定量指标和视觉质量上均优于现有最先进方法。 Conclusion: 耦合神经资产与渲染器的联合设计范式优于传统分离架构,有望启发未来将神经资产与渲染器作为协同组件的图形管线研究。 Abstract: Neural asset authoring and neural rendering have emerged as fundamentally disjoint threads: one generates digital assets using neural networks for traditional graphics pipelines, while the other develops neural renderers that map conventional assets to images. However, the potential of jointly designing the asset representation and renderer remains largely unexplored. We argue that coupling them can unlock an end-to-end learnable graphics stack with benefits in fidelity, consistency, and efficiency. In this paper, we explore this possibility with NeAR: a Coupled Neural Asset-Renderer Stack. On the asset side, we build on Trellis-style Structured 3D Latents and introduce a lighting-homogenized neural asset: from a casually lit input, a rectified-flow backbone predicts a Lighting-Homogenized SLAT that encodes geometry and intrinsic material cues in a compact, view-agnostic latent. On the renderer side, we design a lighting-aware neural renderer that uses this neural asset, along with explicit view embeddings and HDR environment maps, to achieve real-time, relightable rendering. We validate NeAR on four tasks: (1) G-buffer-based forward rendering, (2) random-lit single-image reconstruction, (3) unknown-lit single-image relighting, and (4) novel-view relighting. Our coupled stack surpasses state-of-the-art baselines in both quantitative metrics and perceptual quality. We hope this coupled asset-renderer perspective inspires future graphics stacks that view neural assets and renderers as co-designed components instead of independent entities.

[249] RigAnyFace: Scaling Neural Facial Mesh Auto-Rigging with Unlabeled Data

Wenchao Ma,Dario Kneubuehler,Maurice Chu,Ian Sachs,Haomiao Jiang,Sharon Xiaolei Huang

Main category: cs.CV

TL;DR: 本文提出了RigAnyFace(RAF),一种可扩展的神经自动绑定框架,能够为具有多样拓扑结构(包括多个不连通组件)的面部网格生成表情形变。

Details Motivation: 现有的面部自动绑定方法在处理不同拓扑结构、特别是包含不连通部件(如眼球)的面部网格时泛化能力有限,且依赖昂贵的手动绑定数据。 Method: RAF通过一个与三角剖分无关的表面学习网络预测形变,结合专为FACS参数调节和处理不连通组件设计的架构,在中性面部网格上生成符合FACS标准的表情混合形状。训练中使用少量由专业艺术家手动绑定的数据作为3D监督,并提出一种针对无标签中性网格的2D监督策略以扩展数据并提升泛化能力。 Result: 实验表明,RAF在艺术家制作资产和真实场景样本上均能准确处理多种拓扑结构,支持眼球等不连通组件,显著优于先前方法。 Conclusion: RAF实现了高泛化性的面部自动绑定,支持复杂拓扑和多部件结构,推动了更精细表情动画的技术发展。 Abstract: In this paper, we present RigAnyFace (RAF), a scalable neural auto-rigging framework for facial meshes of diverse topologies, including those with multiple disconnected components. RAF deforms a static neutral facial mesh into industry-standard FACS poses to form an expressive blendshape rig. Deformations are predicted by a triangulation-agnostic surface learning network augmented with our tailored architecture design to condition on FACS parameters and efficiently process disconnected components. For training, we curated a dataset of facial meshes, with a subset meticulously rigged by professional artists to serve as accurate 3D ground truth for deformation supervision. Due to the high cost of manual rigging, this subset is limited in size, constraining the generalization ability of models trained exclusively on it. To address this, we design a 2D supervision strategy for unlabeled neutral meshes without rigs. This strategy increases data diversity and allows for scaled training, thereby enhancing the generalization ability of models trained on this augmented data. Extensive experiments demonstrate that RAF is able to rig meshes of diverse topologies on not only our artist-crafted assets but also in-the-wild samples, outperforming previous works in accuracy and generalizability. Moreover, our method advances beyond prior work by supporting multiple disconnected components, such as eyeballs, for more detailed expression animation. Project page: https://wenchao-m.github.io/RigAnyFace.github.io

[250] Functional Localization Enforced Deep Anomaly Detection Using Fundus Images

Jan Benedikt Ruhland,Thorsten Papenbrock,Jan-Peter Sowa,Ali Canbay,Nicole Eter,Bernd Freisleben,Dominik Heider

Main category: cs.CV

TL;DR: 本研究系统评估了在多种增强和优化策略下的Vision Transformer(ViT)分类器在多个异构公开数据集及自建AEyeDB数据集上对视网膜疾病(如糖尿病视网膜病变、年龄相关性黄斑变性等)的检测性能,结果显示ViT表现优异,尤其结合几何增强时在Papila数据集上AUC达0.91,优于传统卷积模型;同时开发了基于GANomaly的异常检测器并实现概率校准,提升了模型可解释性与临床适用性。

Details Motivation: 针对眼底图像中成像质量差异大、早期病变更微妙以及跨数据集存在域偏移等问题,现有方法在可靠检测视网膜疾病方面面临挑战,因此需要一种鲁棒性强、泛化能力好的模型架构与训练策略。 Method: 采用Vision Transformer(ViT)作为基础分类器,系统评估多种图像增强策略(如几何变换、颜色增强、直方图均衡化、拉普拉斯增强)的影响,并在多个公共数据集及自建高质量AEyeDB数据集上进行多数据集联合训练;同时构建基于GANomaly的异常检测模型以提升对未知样本的泛化能力,并使用GUESS方法进行概率校准以支持无需阈值的临床决策。 Result: ViT在各数据集上的准确率介于0.789至0.843之间,在Papila数据集上结合几何增强达到AUC 0.91,超过此前卷积集成基线(AUC 0.87);糖尿病视网膜病变和年龄相关性黄斑变性检测效果良好,而青光眼仍较难识别;几何与颜色增强带来最稳定提升,直方图均衡化有助于结构细微的数据,拉普拉斯增强则普遍降低性能;GANomaly异常检测器AUC为0.76,具备重建可解释性;GUESS校准提升了预测的可靠性。 Conclusion: Vision Transformer结合适当增强策略在多源眼底图像中展现出优越且稳定的视网膜疾病分类性能,尤其在多数据集训练下具有更强泛化能力;引入异常检测与概率校准进一步增强了模型的鲁棒性和临床实用性,为未来自动化眼科诊断系统提供了可行方案。 Abstract: Reliable detection of retinal diseases from fundus images is challenged by the variability in imaging quality, subtle early-stage manifestations, and domain shift across datasets. In this study, we systematically evaluated a Vision Transformer (ViT) classifier under multiple augmentation and enhancement strategies across several heterogeneous public datasets, as well as the AEyeDB dataset, a high-quality fundus dataset created in-house and made available for the research community. The ViT demonstrated consistently strong performance, with accuracies ranging from 0.789 to 0.843 across datasets and diseases. Diabetic retinopathy and age-related macular degeneration were detected reliably, whereas glaucoma remained the most frequently misclassified disease. Geometric and color augmentations provided the most stable improvements, while histogram equalization benefited datasets dominated by structural subtlety. Laplacian enhancement reduced performance across different settings. On the Papila dataset, the ViT with geometric augmentation achieved an AUC of 0.91, outperforming previously reported convolutional ensemble baselines (AUC of 0.87), underscoring the advantages of transformer architectures and multi-dataset training. To complement the classifier, we developed a GANomaly-based anomaly detector, achieving an AUC of 0.76 while providing inherent reconstruction-based explainability and robust generalization to unseen data. Probabilistic calibration using GUESS enabled threshold-independent decision support for future clinical implementation.

[251] Health system learning achieves generalist neuroimaging models

Akhil Kondepudi,Akshay Rao,Chenhui Zhao,Yiwei Lyu,Samir Harake,Soumyanil Banerjee,Rushikesh Joshi,Anna-Katharina Meissner,Renly Hou,Cheng Jiang,Asadur Chowdury,Ashok Srinivasan,Brian Athey,Vikas Gulani,Aditya Pandey,Honglak Lee,Todd Hollon

Main category: cs.CV

TL;DR: 本文提出了NeuroVFM,一种基于524万临床MRI和CT容积数据训练的神经影像视觉基础模型,通过“医疗系统学习”范式克服了公共数据中神经影像数据不足的问题,实现了在放射诊断、报告生成等任务上的最先进性能,并展现出可解释的视觉定位与解剖理解能力。

Details Motivation: 由于神经影像数据(如MRI和CT)包含可识别的面部特征,难以公开共享,导致现有前沿AI模型缺乏足够的私有临床数据进行训练,限制了其在临床医学中的表现。 Method: 提出“医疗系统学习”范式,直接利用医疗机构日常诊疗中产生的未筛选临床数据,构建名为NeuroVFM的视觉基础模型;该模型采用可扩展的容积联合嵌入预测架构,对3D神经影像进行建模。 Result: NeuroVFM在多种临床任务中达到最先进水平,包括疾病诊断和报告生成;具备神经解剖的新兴理解能力和可视化诊断依据定位;结合开源语言模型后生成的报告在准确性、临床分诊和专家偏好上优于GPT-5等前沿模型,并减少幻觉和关键错误。 Conclusion: 医疗系统学习为构建通用型医学AI提供了有效路径,NeuroVFM展示了基于真实临床数据训练基础模型的可行性与优势,为临床AI系统的发展提供了可扩展框架。 Abstract: Frontier artificial intelligence (AI) models, such as OpenAI's GPT-5 and Meta's DINOv3, have advanced rapidly through training on internet-scale public data, yet such systems lack access to private clinical data. Neuroimaging, in particular, is underrepresented in the public domain due to identifiable facial features within MRI and CT scans, fundamentally restricting model performance in clinical medicine. Here, we show that frontier models underperform on neuroimaging tasks and that learning directly from uncurated data generated during routine clinical care at health systems, a paradigm we call health system learning, yields high-performance, generalist neuroimaging models. We introduce NeuroVFM, a visual foundation model trained on 5.24 million clinical MRI and CT volumes using a scalable volumetric joint-embedding predictive architecture. NeuroVFM learns comprehensive representations of brain anatomy and pathology, achieving state-of-the-art performance across multiple clinical tasks, including radiologic diagnosis and report generation. The model exhibits emergent neuroanatomic understanding and interpretable visual grounding of diagnostic findings. When paired with open-source language models through lightweight visual instruction tuning, NeuroVFM generates radiology reports that surpass frontier models in accuracy, clinical triage, and expert preference. Through clinically grounded visual understanding, NeuroVFM reduces hallucinated findings and critical errors, offering safer clinical decision support. These results establish health system learning as a paradigm for building generalist medical AI and provide a scalable framework for clinical foundation models.

[252] From Healthy Scans to Annotated Tumors: A Tumor Fabrication Framework for 3D Brain MRI Synthesis

Nayu Dong,Townim Chowdhury,Hieu Phan,Mark Jenkinson,Johan Verjans,Zhibin Liao

Main category: cs.CV

TL;DR: 提出了一种名为Tumor Fabrication (TF) 的两阶段无配对3D脑肿瘤合成框架,仅利用健康图像和少量真实标注数据即可生成大量配对的合成数据,有效提升低数据场景下的肿瘤分割性能。

Details Motivation: 由于标注的MRI肿瘤数据稀缺,现有数据合成方法要么依赖人工建模,要么需要大量训练样本,难以在临床数据有限的情况下应用。 Method: 提出TF框架,包含粗略肿瘤合成和基于生成模型的精细化两个阶段,利用健康脑部扫描图像和少量真实标注数据,实现无配对条件下的全自动化3D肿瘤合成。 Result: 生成的合成图像-标签对显著提升了下游肿瘤分割任务在低数据环境中的性能。 Conclusion: TF为医学图像数据增强提供了一种可扩展且可靠的解决方案,有效应对临床AI中数据稀缺的挑战。 Abstract: The scarcity of annotated Magnetic Resonance Imaging (MRI) tumor data presents a major obstacle to accurate and automated tumor segmentation. While existing data synthesis methods offer promising solutions, they often suffer from key limitations: manual modeling is labor intensive and requires expert knowledge. Deep generative models may be used to augment data and annotation, but they typically demand large amounts of training pairs in the first place, which is impractical in data limited clinical settings. In this work, we propose Tumor Fabrication (TF), a novel two-stage framework for unpaired 3D brain tumor synthesis. The framework comprises a coarse tumor synthesis process followed by a refinement process powered by a generative model. TF is fully automated and leverages only healthy image scans along with a limited amount of real annotated data to synthesize large volumes of paired synthetic data for enriching downstream supervised segmentation training. We demonstrate that our synthetic image-label pairs used as data enrichment can significantly improve performance on downstream tumor segmentation tasks in low-data regimes, offering a scalable and reliable solution for medical image enrichment and addressing critical challenges in data scarcity for clinical AI applications.

[253] Robust Physical Adversarial Patches Using Dynamically Optimized Clusters

Harrison Bagley,Will Meakin,Simon Lucey,Yee Wei Law,Tat-Jun Chin

Main category: cs.CV

TL;DR: 提出一种基于超像素的正则化方法,通过SLIC算法和隐函数定理实现对抗补丁在多尺度下的鲁棒性,提升物理攻击效果。

Details Motivation: 现有对抗补丁在缩放时因插值导致颜色混合,损失高频特征,削弱攻击性能,尤其在物理世界中尺度变化普遍,但尺度鲁棒性研究不足。 Method: 采用SLIC算法动态聚类对抗补丁中的像素,利用隐函数定理反向传播梯度以优化超像素边界和颜色,使补丁结构在不同尺度下保持稳定。 Result: 该方法在数字和物理域均表现出更强的攻击性能,尤其在缩放和实际打印条件下优于传统方法,且通过屏幕显示和纸板切割的实验协议验证了实际有效性。 Conclusion: 所提超像素正则化方法有效提升了对抗补丁对尺度变化的鲁棒性,在真实场景中具有更好的可实现性和稳定性。 Abstract: Physical adversarial attacks on deep learning systems is concerning due to the ease of deploying such attacks, usually by placing an adversarial patch in a scene to manipulate the outcomes of a deep learning model. Training such patches typically requires regularization that improves physical realizability (e.g., printability, smoothness) and/or robustness to real-world variability (e.g. deformations, viewing angle, noise). One type of variability that has received little attention is scale variability. When a patch is rescaled, either digitally through downsampling/upsampling or physically through changing imaging distances, interpolation-induced color mixing occurs. This smooths out pixel values, resulting in a loss of high-frequency patterns and degrading the adversarial signal. To address this, we present a novel superpixel-based regularization method that guides patch optimization to scale-resilient structures. Our ap proach employs the Simple Linear Iterative Clustering (SLIC) algorithm to dynamically cluster pixels in an adversarial patch during optimization. The Implicit Function Theorem is used to backpropagate gradients through SLIC to update the superpixel boundaries and color. This produces patches that maintain their structure over scale and are less susceptible to interpolation losses. Our method achieves greater performance in the digital domain, and when realized physically, these performance gains are preserved, leading to improved physical performance. Real-world performance was objectively assessed using a novel physical evaluation protocol that utilizes screens and cardboard cut-outs to systematically vary real-world conditions.

[254] Data Augmentation Strategies for Robust Lane Marking Detection

Flora Lian,Dinh Quang Huynh,Hector Penades,J. Stephany Berrio Perez,Mao Shan,Stewart Worrall

Main category: cs.CV

TL;DR: 提出一种基于生成式AI的数据增强方法,用于提升侧置摄像头下车道检测模型的泛化能力。

Details Motivation: 公共数据集训练的车道检测模型在不同摄像头视角下泛化能力差,尤其在侧置摄像头存在域偏移问题。 Method: 结合几何透视变换、AI驱动的图像修复和车辆车身覆盖,构建模拟特定部署视角的数据增强 pipeline。 Result: 在SCNN和UFLDv2两个先进模型上验证,增强后的数据提升了模型在阴影等复杂条件下的精度、召回率和F1分数。 Conclusion: 该方法有效缩小了公开数据集与实际部署场景之间的差距,为车道检测提供了可扩展且实用的解决方案。 Abstract: Robust lane detection is essential for advanced driver assistance and autonomous driving, yet models trained on public datasets such as CULane often fail to generalise across different camera viewpoints. This paper addresses the challenge of domain shift for side-mounted cameras used in lane-wheel monitoring by introducing a generative AI-based data enhancement pipeline. The approach combines geometric perspective transformation, AI-driven inpainting, and vehicle body overlays to simulate deployment-specific viewpoints while preserving lane continuity. We evaluated the effectiveness of the proposed augmentation in two state-of-the-art models, SCNN and UFLDv2. With the augmented data trained, both models show improved robustness to different conditions, including shadows. The experimental results demonstrate gains in precision, recall, and F1 score compared to the pre-trained model. By bridging the gap between widely available datasets and deployment-specific scenarios, our method provides a scalable and practical framework to improve the reliability of lane detection in a pilot deployment scenario.

[255] Sphinx: Efficiently Serving Novel View Synthesis using Regression-Guided Selective Refinement

Yuchen Xia,Souvik Kundu,Mosharaf Chowdhury,Nishil Talati

Main category: cs.CV

TL;DR: Sphinx是一种无需训练的混合推理框架,结合回归模型的快速初始化与扩散模型的高质量生成,通过选择性细化和自适应噪声调度,在显著降低计算量的同时保持接近扩散模型的视觉质量,实现了新视角合成中质量与延迟的新帕累托前沿。

Details Motivation: 现有新视角合成方法在生成质量和推理效率之间存在权衡:基于扩散的方法质量高但计算成本大,基于回归的方法效率高但质量不足。需要一种既能保持高质量又能提升推理速度的解决方案。 Method: 提出Sphinx框架,利用回归模型进行快速初始化以减少扩散模型的去噪负担,并引入选择性细化和自适应噪声调度机制,将更多计算资源分配给不确定性高的区域和帧,实现灵活的质量-性能权衡。 Result: 相比纯扩散模型推理,Sphinx平均提速1.8倍,感知质量下降不到5%,在多个数据集上实现了更优的质量-延迟平衡。 Conclusion: Sphinx为新视角合成提供了一种高效、高质量的训练-free推理方案,能够在动态变化的推理需求下灵活调整,推动了该领域实用化的进展。 Abstract: Novel View Synthesis (NVS) is the task of generating new images of a scene from viewpoints that were not part of the original input. Diffusion-based NVS can generate high-quality, temporally consistent images, however, remains computationally prohibitive. Conversely, regression-based NVS offers suboptimal generation quality despite requiring significantly lower compute; leaving the design objective of a high-quality, inference-efficient NVS framework an open challenge. To close this critical gap, we present Sphinx, a training-free hybrid inference framework that achieves diffusion-level fidelity at a significantly lower compute. Sphinx proposes to use regression-based fast initialization to guide and reduce the denoising workload for the diffusion model. Additionally, it integrates selective refinement with adaptive noise scheduling, allowing more compute to uncertain regions and frames. This enables Sphinx to provide flexible navigation of the performance-quality trade-off, allowing adaptation to latency and fidelity requirements for dynamically changing inference scenarios. Our evaluation shows that Sphinx achieves an average 1.8x speedup over diffusion model inference with negligible perceptual degradation of less than 5%, establishing a new Pareto frontier between quality and latency in NVS serving.

[256] Edit2Perceive: Image Editing Diffusion Models Are Strong Dense Perceivers

Yiqing Shi,Yiren Song,Mike Zheng Shou

Main category: cs.CV

TL;DR: 本文提出了一种名为Edit2Perceive的统一扩散框架,利用图像编辑扩散模型进行深度、法线和抠图等密集感知任务,基于FLUX.1 Kontext架构,通过全参数微调和像素空间一致性损失实现结构保持的精细化,并支持单步确定性推理,在小数据集上训练且运行更快。

Details Motivation: 现有的密集感知方法大多依赖于为随机生成设计的文本到图像生成器,而忽视了图像编辑扩散模型在图像到图像转换中的一致性优势。本文旨在探索图像编辑扩散模型在密集感知任务中的潜力,提供更合适的基础框架。 Method: 提出Edit2Perceive框架,基于FLUX.1 Kontext架构,采用全参数微调策略,并引入像素空间一致性损失,以在中间去噪状态中保持结构一致性;利用图像编辑扩散模型进行深度、法线和抠图任务的统一建模,并实现单步确定性推理。 Result: 在深度估计、法线预测和图像抠图三个任务上均取得了全面领先的性能,达到最先进的水平,同时具备更快的推理速度和在较小数据集上的有效训练能力。 Conclusion: 图像编辑导向的扩散变压器在几何感知任务中具有强大潜力,Edit2Perceive为密集感知提供了一个高效、一致且通用的新范式。 Abstract: Recent advances in diffusion transformers have shown remarkable generalization in visual synthesis, yet most dense perception methods still rely on text-to-image (T2I) generators designed for stochastic generation. We revisit this paradigm and show that image editing diffusion models are inherently image-to-image consistent, providing a more suitable foundation for dense perception task. We introduce Edit2Perceive, a unified diffusion framework that adapts editing models for depth, normal, and matting. Built upon the FLUX.1 Kontext architecture, our approach employs full-parameter fine-tuning and a pixel-space consistency loss to enforce structure-preserving refinement across intermediate denoising states. Moreover, our single-step deterministic inference yields up to faster runtime while training on relatively small datasets. Extensive experiments demonstrate comprehensive state-of-the-art results across all three tasks, revealing the strong potential of editing-oriented diffusion transformers for geometry-aware perception.

[257] MedVision: Dataset and Benchmark for Quantitative Medical Image Analysis

Yongcheng Yao,Yongshuo Zong,Raman Dutt,Yongxin Yang,Sotirios A Tsaftaris,Timothy Hospedales

Main category: cs.CV

TL;DR: 本文提出了MedVision,一个大规模数据集和基准,旨在提升视觉-语言模型在医学图像定量分析中的表现,涵盖解剖结构检测、肿瘤大小估计和角度/距离测量三项任务。

Details Motivation: 现有的医学视觉-语言模型主要针对分类或定性任务,缺乏支持临床决策所需的定量推理能力,如肿瘤大小或关节角度的测量。 Method: 构建了一个包含22个公开数据集、3080万图像-标注对的大规模数据集MedVision,并设计了三项定量任务;通过监督微调方法提升现有VLM的性能。 Result: 实验表明,现成的VLM在这些定量任务上表现不佳,但经过MedVision微调后,检测、肿瘤/病变大小估计和角度/距离测量的误差显著降低,精度提高。 Conclusion: MedVision为发展具备强大定量推理能力的医学视觉-语言模型提供了基础,推动了VLM在临床决策支持中的应用。 Abstract: Current vision-language models (VLMs) in medicine are primarily designed for categorical question answering (e.g., "Is this normal or abnormal?") or qualitative descriptive tasks. However, clinical decision-making often relies on quantitative assessments, such as measuring the size of a tumor or the angle of a joint, from which physicians draw their own diagnostic conclusions. This quantitative reasoning capability remains underexplored and poorly supported in existing VLMs. In this work, we introduce MedVision, a large-scale dataset and benchmark specifically designed to evaluate and improve VLMs on quantitative medical image analysis. MedVision spans 22 public datasets covering diverse anatomies and modalities, with 30.8 million image-annotation pairs. We focus on three representative quantitative tasks: (1) detection of anatomical structures and abnormalities, (2) tumor/lesion (T/L) size estimation, and (3) angle/distance (A/D) measurement. Our benchmarks show that current off-the-shelf VLMs perform poorly on these tasks. However, with supervised fine-tuning on MedVision, we significantly enhance their performance across detection, T/L estimation, and A/D measurement, demonstrating reduced error rates and improved precision. This work provides a foundation for developing VLMs with robust quantitative reasoning capabilities in medical imaging. Code and data are available at https://medvision-vlm.github.io.

[258] A Theory-Inspired Framework for Few-Shot Cross-Modal Sketch Person Re-Identification

Yunpeng Gong,Yongjie Hou,Jiangming Shi,Kim Long Diep,Min Jiang

Main category: cs.CV

TL;DR: 本文提出了一种名为KTCAA的理论框架,用于解决基于草图的人再识别中跨模态泛化的问题,通过领域对齐增强和知识迁移催化机制,在少样本条件下实现了最先进的性能。

Details Motivation: 由于模态差异显著且标注数据有限,基于草图的人再识别面临挑战。本文从泛化理论出发,旨在减少目标域风险,提升模型在未见数据上的表现。 Method: 提出了两个关键组件:对齐增强(AA)通过局部草图风格变换模拟目标分布,促进源域与目标域的对齐;知识迁移催化剂(KTC)通过引入最坏情况扰动并强制一致性来增强模型对模态变化的不变性。两者在元学习框架下联合优化,将RGB丰富数据领域的对齐知识迁移到草图场景。 Result: 在多个基准上的实验表明,KTCAA在少样本设置下显著优于现有方法,尤其在数据稀缺情况下表现出色。 Conclusion: KTCAA通过理论指导的领域对齐与扰动不变性设计,有效提升了草图到图像的跨模态匹配性能,为少样本跨模态任务提供了新的解决方案。 Abstract: Sketch based person re-identification aims to match hand-drawn sketches with RGB surveillance images, but remains challenging due to significant modality gaps and limited annotated data. To address this, we introduce KTCAA, a theoretically grounded framework for few-shot cross-modal generalization. Motivated by generalization theory, we identify two key factors influencing target domain risk: (1) domain discrepancy, which quantifies the alignment difficulty between source and target distributions; and (2) perturbation invariance, which evaluates the model's robustness to modality shifts. Based on these insights, we propose two components: (1) Alignment Augmentation (AA), which applies localized sketch-style transformations to simulate target distributions and facilitate progressive alignment; and (2) Knowledge Transfer Catalyst (KTC), which enhances invariance by introducing worst-case perturbations and enforcing consistency. These modules are jointly optimized under a meta-learning paradigm that transfers alignment knowledge from data-rich RGB domains to sketch-based scenarios. Experiments on multiple benchmarks demonstrate that KTCAA achieves state-of-the-art performance, particularly in data-scarce conditions.

[259] Neural Geometry Image-Based Representations with Optimal Transport (OT)

Xiang Gao,Yuanpeng Liu,Xinmu Wang,Jiazhi Li,Minghao Guo,Yu Guo,Xiyun Song,Heather Yu,Zhiqiang Lao,Xianfeng David Gu

Main category: cs.CV

TL;DR: 提出一种基于神经几何图像的3D网格表示方法,利用最优传输构建低分辨率几何图像mipmap,实现高效存储和单次前向恢复高质量网格。

Details Motivation: 现有3D网格神经表示方法依赖多次解码且计算昂贵,难以利用图像规则结构的优势进行高效处理。 Method: 将3D网格转换为几何图像(规则网格),采用最优传输解决采样不均问题,并通过mipmap实现连续细节层次;使用无解码器的神经表示,在单次前向传递中恢复高质网格。 Result: 在压缩比、Chamfer距离和Hausdorff距离上达到SOTA水平,显著提升存储效率与重建精度。 Conclusion: 所提神经几何图像表示法兼具高效存储、快速重建与高质量恢复能力,为3D网格处理提供了优于传统解码器依赖方法的新范式。 Abstract: Neural representations for 3D meshes are emerging as an effective solution for compact storage and efficient processing. Existing methods often rely on neural overfitting, where a coarse mesh is stored and progressively refined through multiple decoder networks. While this can restore high-quality surfaces, it is computationally expensive due to successive decoding passes and the irregular structure of mesh data. In contrast, images have a regular structure that enables powerful super-resolution and restoration frameworks, but applying these advantages to meshes is difficult because their irregular connectivity demands complex encoder-decoder architectures. Our key insight is that a geometry image-based representation transforms irregular meshes into a regular image grid, making efficient image-based neural processing directly applicable. Building on this idea, we introduce our neural geometry image-based representation, which is decoder-free, storage-efficient, and naturally suited for neural processing. It stores a low-resolution geometry-image mipmap of the surface, from which high-quality meshes are restored in a single forward pass. To construct geometry images, we leverage Optimal Transport (OT), which resolves oversampling in flat regions and undersampling in feature-rich regions, and enables continuous levels of detail (LoD) through geometry-image mipmapping. Experimental results demonstrate state-of-the-art storage efficiency and restoration accuracy, measured by compression ratio (CR), Chamfer distance (CD), and Hausdorff distance (HD).

[260] Hierarchical GraphCut Phase Unwrapping based on Invariance of Diffeomorphisms Framework

Xiang Gao,Xinmu Wang,Zhou Zhao,Junqi Huang,Xianfeng David Gu

Main category: cs.CV

TL;DR: 提出了一种基于GraphCut和微分同胚不变性的相位解包裹框架,通过保角映射和最优传输映射提升解包裹精度,并利用多数投票融合多层次结果,实现了45.5倍加速且误差更低,适用于实时4D面部动态捕捉。

Details Motivation: 现有相位解包裹方法在速度与精度之间难以平衡,快速方法精度不足,高精度算法又无法满足实时性需求,尤其在噪声、遮挡和复杂几何场景下表现不佳。 Method: 将GraphCut-based相位解包裹重构为像素标记问题,利用图像空间中保角映射和最优传输(OT)映射生成多个微分同胚域,预计算奇数个变换后的域,在每个域内应用分层GraphCut算法,再通过多数投票融合标签图以稳健估计每像素的解包裹相位计数k。 Result: 实验结果显示该方法相比传统方法有45.5倍的速度提升,在仿真和真实实验中L2误差更低,能更准确恢复连续相位。 Conclusion: 所提框架有效解决了相位解包裹中速度与精度的权衡问题,具备强鲁棒性和实时处理潜力,适合用于4D面部动态等高精度实时扫描应用。 Abstract: Recent years have witnessed rapid advancements in 3D scanning technologies, with applications spanning VR/AR, digital human creation, and medical imaging. Structured-light scanning with phase-shifting techniques is preferred for its use of low-intensity visible light and high accuracy, making it well suited for capturing 4D facial dynamics. A key step is phase unwrapping, which recovers continuous phase values from measurements wrapped modulo 2pi. The goal is to estimate the unwrapped phase count k in the equation Phi = phi + 2pi k, where phi is the wrapped phase and Phi is the true phase. Noise, occlusions, and complex 3D geometry make recovering the true phase challenging because phase unwrapping is ill-posed: measurements only provide modulo 2pi values, and estimating k requires assumptions about surface continuity. Existing methods trade speed for accuracy: fast approaches lack precision, while accurate algorithms are too slow for real-time use. To overcome these limitations, this work proposes a phase unwrapping framework that reformulates GraphCut-based unwrapping as a pixel-labeling problem. This framework improves the estimation of the unwrapped phase count k through the invariance property of diffeomorphisms applied in image space via conformal and optimal transport (OT) maps. An odd number of diffeomorphisms are precomputed from the input phase data, and a hierarchical GraphCut algorithm is applied in each domain. The resulting label maps are fused via majority voting to robustly estimate k at each pixel. Experimental results demonstrate a 45.5x speedup and lower L2 error in real experiments and simulations, showing potential for real-time applications.

[261] Now You See It, Now You Don't - Instant Concept Erasure for Safe Text-to-Image and Video Generation

Shristi Das Biswas,Arani Roy,Kaushik Roy

Main category: cs.CV

TL;DR: 本文提出了一种无需训练、一次性修改权重的跨模态概念擦除方法ICE,用于文本到图像和文本到视频模型,通过显式正则化消除语义重叠,实现精确且持久的遗忘。

Details Motivation: 现有概念移除方法存在重训练成本高、推理开销大或易受对抗攻击的问题,且很少建模目标概念与周围内容的语义重叠,导致擦除后出现附带损伤,且难以同时适用于T2I和T2V模型。 Method: 提出Instant Concept Erasure(ICE),利用各向异性能量加权缩放定义擦除和保留子空间,并设计闭式重叠投影算子显式正则化其交集;构建凸且Lipschitz有界的谱遗忘目标,求解得到稳定的解析解,进而构造永久性的文本条件层解耦算子。 Result: 在艺术风格、物体、身份和敏感内容的移除任务中,ICE在T2I和T2V模型上均实现了强效擦除,对红队攻击具备更好鲁棒性,且对原始生成能力影响极小。 Conclusion: ICE是一种高效、通用且无需额外推理开销的概念擦除方法,能够在不重新训练的情况下实现跨模态模型中的精准持久遗忘。 Abstract: Robust concept removal for text-to-image (T2I) and text-to-video (T2V) models is essential for their safe deployment. Existing methods, however, suffer from costly retraining, inference overhead, or vulnerability to adversarial attacks. Crucially, they rarely model the latent semantic overlap between the target erase concept and surrounding content -- causing collateral damage post-erasure -- and even fewer methods work reliably across both T2I and T2V domains. We introduce Instant Concept Erasure (ICE), a training-free, modality-agnostic, one-shot weight modification approach that achieves precise, persistent unlearning with zero overhead. ICE defines erase and preserve subspaces using anisotropic energy-weighted scaling, then explicitly regularises against their intersection using a unique, closed-form overlap projector. We pose a convex and Lipschitz-bounded Spectral Unlearning Objective, balancing erasure fidelity and intersection preservation, that admits a stable and unique analytical solution. This solution defines a dissociation operator that is translated to the model's text-conditioning layers, making the edit permanent and runtime-free. Across targeted removals of artistic styles, objects, identities, and explicit content, ICE efficiently achieves strong erasure with improved robustness to red-teaming, all while causing only minimal degradation of original generative abilities in both T2I and T2V models.

[262] Beyond Description: Cognitively Benchmarking Fine-Grained Action for Embodied Agents

Dayong Liu,Chao Xu,Weihong Chen,Suyu Zhang,Juncheng Wang,Jiankang Deng,Baigui Sun,Yang Liu

Main category: cs.CV

TL;DR: 本文提出了CFG-Bench,一个用于评估多模态大语言模型在具身智能体中细粒度动作智能的新基准,揭示了现有MLLM在物理交互和高阶推理上的不足,并表明基于该数据的监督微调可显著提升模型表现。

Details Motivation: 现有基准多关注高层规划或空间推理,缺乏对具身智能体进行物理交互所需的细粒度动作智能的系统评估,因此需要一个新的基准来填补这一空白。 Method: 构建包含1,368个视频和19,562个三模态问答对的CFG-Bench,涵盖物理交互、时序因果关系、意图理解和评价判断四种认知能力,并在主流MLLM上进行评估与监督微调实验。 Result: 实验显示当前领先的MLLM在生成物理交互的详细指令以及意图和评价的高阶推理方面存在明显缺陷,但通过在CFG-Bench数据上进行监督微调可显著提升其在标准具身任务中的表现。 Conclusion: CFG-Bench为评估具身智能体的动作智能提供了有效框架,揭示了MLLM的局限性,并证明细粒度动作训练有助于提升模型的实用性,推动更强大、更接地的具身智能体发展。 Abstract: Multimodal Large Language Models (MLLMs) show promising results as decision-making engines for embodied agents operating in complex, physical environments. However, existing benchmarks often prioritize high-level planning or spatial reasoning, leaving the fine-grained action intelligence required for embodied physical interaction underexplored. To address this gap, we introduce CFG-Bench, a new benchmark designed to systematically evaluate this crucial capability. CFG-Bench consists of 1,368 curated videos paired with 19,562 three-modalities question-answer pairs targeting four cognitive abilities: 1) Physical Interaction, 2) Temporal-Causal Relation, 3) Intentional Understanding, and 4) Evaluative Judgment. Together, these dimensions provide a systematic framework for assessing a model's ability to translate visual observations into actionable knowledge, moving beyond mere surface-level recognition. Our comprehensive evaluation on CFG-Bench reveals that leading MLLMs struggle to produce detailed instructions for physical interactions and exhibit profound limitations in the higher-order reasoning of intention and evaluation. Moreover, supervised fine-tuning (SFT) on our data demonstrates that teaching an MLLMs to articulate fine-grained actions directly translates to significant performance gains on established embodied benchmarks. Our analysis highlights these limitations and offers insights for developing more capable and grounded embodied agents.

[263] EVCC: Enhanced Vision Transformer-ConvNeXt-CoAtNet Fusion for Classification

Kazi Reyazul Hasan,Md Nafiu Rahman,Wasif Jalal,Sadif Ahmed,Shahriar Raj,Mubasshira Musarrat,Muhammad Abdullah Adnan

Main category: cs.CV

TL;DR: 本文提出了一种名为EVCC的新型混合视觉架构,结合Vision Transformer、轻量级ConvNeXt和CoAtNet,通过自适应令牌剪枝、门控双向交叉注意力、辅助分类头和动态路由门机制,在多个图像分类数据集上实现了优于现有模型的精度,同时显著降低了计算量。

Details Motivation: 现有的Transformer与CNN混合模型在图像分类中性能优越但计算成本高,亟需一种兼顾精度与效率的平衡架构。 Method: 设计多分支EVCC架构,融合Vision Transformer、ConvNeXt和CoAtNet;引入自适应令牌剪枝、门控双向交叉注意力、辅助分类头和基于上下文感知置信度的动态路由门。 Result: 在CIFAR-100、Tobacco3482、CelebA和Brain Cancer数据集上,EVCC相比DeiT-Base、MaxViT-Base和CrossViT-Base等模型提升了最高2个百分点的准确率,并减少25%–35%的FLOPs。 Conclusion: EVCC通过动态调整计算资源,在保持高精度的同时显著提升效率,有效平衡了准确性与计算成本,适用于实际应用场景。 Abstract: Hybrid vision architectures combining Transformers and CNNs have significantly advanced image classification, but they usually do so at significant computational cost. We introduce EVCC (Enhanced Vision Transformer-ConvNeXt-CoAtNet), a novel multi-branch architecture integrating the Vision Transformer, lightweight ConvNeXt, and CoAtNet through key innovations: (1) adaptive token pruning with information preservation, (2) gated bidirectional cross-attention for enhanced feature refinement, (3) auxiliary classification heads for multi-task learning, and (4) a dynamic router gate employing context-aware confidence-driven weighting. Experiments across the CIFAR-100, Tobacco3482, CelebA, and Brain Cancer datasets demonstrate EVCC's superiority over powerful models like DeiT-Base, MaxViT-Base, and CrossViT-Base by consistently achieving state-of-the-art accuracy with improvements of up to 2 percentage points, while reducing FLOPs by 25 to 35%. Our adaptive architecture adjusts computational demands to deployment needs by dynamically reducing token count, efficiently balancing the accuracy-efficiency trade-off while combining global context, local details, and hierarchical features for real-world applications. The source code of our implementation is available at https://anonymous.4open.science/r/EVCC.

[264] Exploring Surround-View Fisheye Camera 3D Object Detection

Changcai Li,Wenwei Lin,Zuoxun Hou,Gang Chen,Wei Zhang,Huihui Zhou,Weishi Zheng

Main category: cs.CV

TL;DR: 本文探索了使用环视鱼眼相机系统实现端到端3D目标检测的技术可行性,提出了两种结合鱼眼图像几何特性的检测方法FisheyeBEVDet和FisheyePETR,并发布了新的开源数据集Fisheye3DOD。实验表明所提方法显著提升了检测精度。

Details Motivation: 经典基于针孔相机的3D目标检测器在鱼眼图像上性能下降,且缺乏专门用于鱼眼3D检测的评估基准,因此需要研究适用于鱼眼相机系统的3D检测方法并构建相应数据集。 Method: 提出两种融合鱼眼图像几何结构的方法:基于鸟瞰图(BEV)范式的FisheyeBEVDet和基于查询范式的FisheyePETR,均采用球面空间表示来建模鱼眼图像特性,并利用CARLA仿真构建Fisheye3DOD数据集进行评估。 Result: 在自建Fisheye3DOD数据集上的实验表明,所提出的方法相比基线模型检测精度最高提升6.2%。 Conclusion: 将鱼眼图像的几何特性融入主流检测框架可有效提升端到端3D目标检测性能,验证了使用鱼眼相机系统进行3D检测的可行性。 Abstract: In this work, we explore the technical feasibility of implementing end-to-end 3D object detection (3DOD) with surround-view fisheye camera system. Specifically, we first investigate the performance drop incurred when transferring classic pinhole-based 3D object detectors to fisheye imagery. To mitigate this, we then develop two methods that incorporate the unique geometry of fisheye images into mainstream detection frameworks: one based on the bird's-eye-view (BEV) paradigm, named FisheyeBEVDet, and the other on the query-based paradigm, named FisheyePETR. Both methods adopt spherical spatial representations to effectively capture fisheye geometry. In light of the lack of dedicated evaluation benchmarks, we release Fisheye3DOD, a new open dataset synthesized using CARLA and featuring both standard pinhole and fisheye camera arrays. Experiments on Fisheye3DOD show that our fisheye-compatible modeling improves accuracy by up to 6.2% over baseline methods.

[265] Dendritic Convolution for Noise Image Recognition

Jiarui Xue,Dongjian Yang,Ye Sun,Gang Liu

Main category: cs.CV

TL;DR: 本文提出了一种受神经元树突结构启发的抗噪声卷积方法(DDC),通过模拟树突的非线性交互逻辑,在特征提取层面重构数学范式,有效抑制噪声影响。实验表明,该方法在多种图像分类和目标检测模型上显著提升噪声环境下的性能。

Details Motivation: 现有抗噪声图像识别方法多集中于网络结构或训练策略调整,性能提升已近瓶颈,缺乏从神经元计算机制角度探索抗干扰能力的创新思路。 Method: 提出抗噪声神经元卷积(DDC),模仿生物神经元树突结构,将树突邻域交互计算逻辑融入卷积操作底层设计,并通过输入特征间的非线性交互模拟树突的XOR预处理功能,从而重构特征提取的数学范式。 Result: 在图像分类(YOLOv11-cls、VGG16、EfficientNet-B0)和目标检测(YOLOv11、YOLOv8、YOLOv5)任务中,替换传统卷积后,EfficientNet-B0在噪声数据集上的准确率相对提升11.23%,YOLOv8的mAP提升19.80%。 Conclusion: DDC通过模拟生物神经元树突的计算机制,从根本上增强了卷积操作对噪声的鲁棒性,在复杂噪声环境下显著优于传统卷积,为抗噪声图像识别提供了新的神经形态计算路径。 Abstract: In real-world scenarios of image recognition, there exists substantial noise interference. Existing works primarily focus on methods such as adjusting networks or training strategies to address noisy image recognition, and the anti-noise performance has reached a bottleneck. However, little is known about the exploration of anti-interference solutions from a neuronal perspective.This paper proposes an anti-noise neuronal convolution. This convolution mimics the dendritic structure of neurons, integrates the neighborhood interaction computation logic of dendrites into the underlying design of convolutional operations, and simulates the XOR logic preprocessing function of biological dendrites through nonlinear interactions between input features, thereby fundamentally reconstructing the mathematical paradigm of feature extraction. Unlike traditional convolution where noise directly interferes with feature extraction and exerts a significant impact, DDC mitigates the influence of noise by focusing on the interaction of neighborhood information. Experimental results demonstrate that in image classification tasks (using YOLOv11-cls, VGG16, and EfficientNet-B0) and object detection tasks (using YOLOv11, YOLOv8, and YOLOv5), after replacing traditional convolution with the dendritic convolution, the accuracy of the EfficientNet-B0 model on noisy datasets is relatively improved by 11.23%, and the mean Average Precision (mAP) of YOLOv8 is increased by 19.80%. The consistency between the computation method of this convolution and the dendrites of biological neurons enables it to perform significantly better than traditional convolution in complex noisy environments.

[266] ObjectAlign: Neuro-Symbolic Object Consistency Verification and Correction

Mustafa Munir,Harsh Goel,Xiwen Wei,Minkyu Choi,Sahil Shah,Kartikeya Bhardwaj,Paul Whatmough,Sandeep Chinchali,Radu Marculescu

Main category: cs.CV

TL;DR: 本文提出了一种名为ObjectAlign的新框架,通过结合感知度量与符号推理来检测、验证和纠正视频编辑中的对象不一致性问题。

Details Motivation: 视频编辑和合成常引入对象不一致问题(如帧闪烁和身份漂移),影响感知质量,现有方法难以同时保证低级稳定性和高级时间正确性。 Method: 提出可学习的度量阈值,并设计一个神经-符号验证器,结合基于SMT的形式化检查与基于概率模型的时间保真度检查;对不一致帧块采用基于神经网络的自适应插值修复。 Result: 在DAVIS和Pexels数据集上相比SOTA基线提升了1.4点CLIP Score和6.1点warp误差表现。 Conclusion: ObjectAlign有效提升了编辑视频中对象的一致性,兼顾了低级感知稳定性和高级时间逻辑正确性,显著优于现有方法。 Abstract: Video editing and synthesis often introduce object inconsistencies, such as frame flicker and identity drift that degrade perceptual quality. To address these issues, we introduce ObjectAlign, a novel framework that seamlessly blends perceptual metrics with symbolic reasoning to detect, verify, and correct object-level and temporal inconsistencies in edited video sequences. The novel contributions of ObjectAlign are as follows: First, we propose learnable thresholds for metrics characterizing object consistency (i.e. CLIP-based semantic similarity, LPIPS perceptual distance, histogram correlation, and SAM-derived object-mask IoU). Second, we introduce a neuro-symbolic verifier that combines two components: (a) a formal, SMT-based check that operates on masked object embeddings to provably guarantee that object identity does not drift, and (b) a temporal fidelity check that uses a probabilistic model checker to verify the video's formal representation against a temporal logic specification. A frame transition is subsequently deemed "consistent" based on a single logical assertion that requires satisfying both the learned metric thresholds and this unified neuro-symbolic constraint, ensuring both low-level stability and high-level temporal correctness. Finally, for each contiguous block of flagged frames, we propose a neural network based interpolation for adaptive frame repair, dynamically choosing the interpolation depth based on the number of frames to be corrected. This enables reconstruction of the corrupted frames from the last valid and next valid keyframes. Our results show up to 1.4 point improvement in CLIP Score and up to 6.1 point improvement in warp error compared to SOTA baselines on the DAVIS and Pexels video datasets.

[267] CoD: A Diffusion Foundation Model for Image Compression

Zhaoyang Jia,Zihan Zheng,Naifu Xue,Jiahao Li,Bin Li,Zongyu Guo,Xiaoyi Zhang,Houqiang Li,Yan Lu

Main category: cs.CV

TL;DR: 本文提出了CoD,首个面向压缩的扩散基础模型,相较于依赖文本条件的现有扩散编解码器,CoD从零训练,专为端到端优化压缩与生成而设计,在超低比特率下显著提升压缩效率,训练成本低且可复现,并为未来扩散编解码研究提供新见解。

Details Motivation: 现有扩散编解码器依赖文本到图像扩散模型(如Stable Diffusion),但文本条件在压缩任务中表现不佳,尤其在超低比特率下限制了性能,因此需要一种专为压缩优化的基础模型。 Method: 提出CoD,一种全新的压缩导向扩散基础模型,从头训练,仅使用公开的纯图像数据集,不依赖文本条件,支持多种基于扩散的编解码器,并实现端到端优化。 Result: 在下游编解码器(如DiffC)中用CoD替代Stable Diffusion后,实现了SOTA性能,尤其在0.0039 bpp等超低比特率下效果显著;训练速度比Stable Diffusion快300倍(约20 vs. 6250个A100 GPU天);实验表明像素空间扩散可达到VTM级别的PSNR并保持高感知质量,且以更少参数优于GAN-based编解码器。 Conclusion: CoD作为一种通用、高效、低成本的扩散基础模型,为未来扩散编解码技术的发展奠定了基础,有望推动超低比特率下的图像压缩研究。 Abstract: Existing diffusion codecs typically build on text-to-image diffusion foundation models like Stable Diffusion. However, text conditioning is suboptimal from a compression perspective, hindering the potential of downstream diffusion codecs, particularly at ultra-low bitrates. To address it, we introduce \textbf{CoD}, the first \textbf{Co}mpression-oriented \textbf{D}iffusion foundation model, trained from scratch to enable end-to-end optimization of both compression and generation. CoD is not a fixed codec but a general foundation model designed for various diffusion-based codecs. It offers several advantages: \textbf{High compression efficiency}, replacing Stable Diffusion with CoD in downstream codecs like DiffC achieves SOTA results, especially at ultra-low bitrates (e.g., 0.0039 bpp); \textbf{Low-cost and reproducible training}, 300$\times$ faster training than Stable Diffusion ($\sim$ 20 vs. $\sim$ 6,250 A100 GPU days) on entirely open image-only datasets; \textbf{Providing new insights}, e.g., We find pixel-space diffusion can achieve VTM-level PSNR with high perceptual quality and can outperform GAN-based codecs using fewer parameters. We hope CoD lays the foundation for future diffusion codec research. Codes will be released.

[268] Modality-Collaborative Low-Rank Decomposers for Few-Shot Video Domain Adaptation

Yuyang Wanyan,Xiaoshan Yang,Weiming Dong,Changsheng Xu

Main category: cs.CV

TL;DR: 本文提出了一种名为Modality-Collaborative LowRank Decomposers (MC-LRD)的新框架,用于解决少样本视频域适应中的模态特定与共享特征分解问题,通过多模态分解路由和正交去相关约束显著提升了跨域适应性能。

Details Motivation: 由于视频的多模态特性及各模态中不同成分面临的域偏移差异,现有方法在少样本场景下难以有效进行域对齐和模态协作,导致目标域泛化能力受限。 Method: 提出MC-LRD框架,包含多个具有渐进共享参数的分解器和多模态分解路由器(MDR),通过MDR选择性激活分解器以分离模态独特与共享特征,并引入正交去相关约束和跨域激活一致性损失来提升分解效率和域对齐效果。 Result: 在三个公开基准上的实验表明,该方法显著优于现有技术,在少样本视频域适应任务中取得了更好的性能提升。 Conclusion: MC-LRD通过解耦不同域偏移水平的特征并促进模态协作,有效解决了少样本视频域适应中的关键挑战,为多模态域适应提供了新的思路。 Abstract: In this paper, we study the challenging task of Few-Shot Video Domain Adaptation (FSVDA). The multimodal nature of videos introduces unique challenges, necessitating the simultaneous consideration of both domain alignment and modality collaboration in a few-shot scenario, which is ignored in previous literature. We observe that, under the influence of domain shift, the generalization performance on the target domain of each individual modality, as well as that of fused multimodal features, is constrained. Because each modality is comprised of coupled features with multiple components that exhibit different domain shifts. This variability increases the complexity of domain adaptation, thereby reducing the effectiveness of multimodal feature integration. To address these challenges, we introduce a novel framework of Modality-Collaborative LowRank Decomposers (MC-LRD) to decompose modality-unique and modality-shared features with different domain shift levels from each modality that are more friendly for domain alignment. The MC-LRD comprises multiple decomposers for each modality and Multimodal Decomposition Routers (MDR). Each decomposer has progressively shared parameters across different modalities. The MDR is leveraged to selectively activate the decomposers to produce modality-unique and modality-shared features. To ensure efficient decomposition, we apply orthogonal decorrelation constraints separately to decomposers and subrouters, enhancing their diversity. Furthermore, we propose a cross-domain activation consistency loss to guarantee that target and source samples of the same category exhibit consistent activation preferences of the decomposers, thereby facilitating domain alignment. Extensive experimental results on three public benchmarks demonstrate that our model achieves significant improvements over existing methods.

[269] DriveFlow: Rectified Flow Adaptation for Robust 3D Object Detection in Autonomous Driving

Hongbin Lin,Yiming Yang,Chaoda Zheng,Yifan Zhang,Shuaicheng Niu,Zilu Guo,Yafeng Li,Gui Gui,Shuguang Cui,Zhen Li

Main category: cs.CV

TL;DR: 本文提出DriveFlow,一种基于预训练文本到图像流模型的Rectified Flow自适应方法,用于自动驾驶中的训练数据增强,通过高频前景保持和双频背景优化策略提升模型在分布外场景下的3D目标检测性能。

Details Motivation: 由于标注成本高和户外场景多样,现有训练数据难以覆盖所有测试场景(即分布外问题),且现有图像编辑方法在保持3D几何精度或编辑效果方面存在不足。 Method: 基于频率分解,利用文本条件速度生成无噪声编辑路径,并引入高频对齐损失保持前景物体的3D几何结构,同时采用双频优化策略平衡背景编辑的灵活性与语义一致性。 Result: 实验表明DriveFlow在各类别和分布外场景下均显著提升3D目标检测性能,具有良好的有效性与效率。 Conclusion: DriveFlow通过改进Rectified Flow的编辑路径,在不重新训练模型的前提下有效增强了视觉中心3D检测模型对分布外场景的鲁棒性,为自动驾驶中的数据增强提供了新思路。 Abstract: In autonomous driving, vision-centric 3D object detection recognizes and localizes 3D objects from RGB images. However, due to high annotation costs and diverse outdoor scenes, training data often fails to cover all possible test scenarios, known as the out-of-distribution (OOD) issue. Training-free image editing offers a promising solution for improving model robustness by training data enhancement without any modifications to pre-trained diffusion models. Nevertheless, inversion-based methods often suffer from limited effectiveness and inherent inaccuracies, while recent rectified-flow-based approaches struggle to preserve objects with accurate 3D geometry. In this paper, we propose DriveFlow, a Rectified Flow Adaptation method for training data enhancement in autonomous driving based on pre-trained Text-to-Image flow models. Based on frequency decomposition, DriveFlow introduces two strategies to adapt noise-free editing paths derived from text-conditioned velocities. 1) High-Frequency Foreground Preservation: DriveFlow incorporates a high-frequency alignment loss for foreground to maintain precise 3D object geometry. 2) Dual-Frequency Background Optimization: DriveFlow also conducts dual-frequency optimization for background, balancing editing flexibility and semantic consistency. Comprehensive experiments validate the effectiveness and efficiency of DriveFlow, demonstrating comprehensive performance improvements on all categories across OOD scenarios. Code is available at https://github.com/Hongbin98/DriveFlow.

[270] Seeing What Matters: Visual Preference Policy Optimization for Visual Generation

Ziqi Ni,Yuanzhi Liang,Rui Li,Yi Zhou,Haibing Huang,Chi Zhang,Xuelong Li

Main category: cs.CV

TL;DR: 提出ViPO方法,通过像素级优势映射提升视觉生成模型的细粒度对齐能力。

Details Motivation: 现有GRPO仅使用标量奖励,忽略视觉内容的空间和时间结构,导致难以修正局部伪影和捕捉细粒度感知信号。 Method: 引入ViPO,利用预训练视觉骨干网络构建空间和时间感知的优势图,将标量反馈转化为像素级优势,提升优化精度。 Result: 在图像和视频基准上均优于标准GRPO,提升人类偏好对齐效果及跨域泛化能力。 Conclusion: ViPO是一种轻量、即插即用且兼容现有GRPO的方法,为视觉生成提供了更丰富、更具表达性的学习信号。 Abstract: Reinforcement learning (RL) has become a powerful tool for post-training visual generative models, with Group Relative Policy Optimization (GRPO) increasingly used to align generators with human preferences. However, existing GRPO pipelines rely on a single scalar reward per sample, treating each image or video as a holistic entity and ignoring the rich spatial and temporal structure of visual content. This coarse supervision hinders the correction of localized artifacts and the modeling of fine-grained perceptual cues. We introduce Visual Preference Policy Optimization (ViPO), a GRPO variant that lifts scalar feedback into structured, pixel-level advantages. ViPO employs a Perceptual Structuring Module that uses pretrained vision backbones to construct spatially and temporally aware advantage maps, redistributing optimization pressure toward perceptually important regions while preserving the stability of standard GRPO. Across both image and video benchmarks, ViPO consistently outperforms vanilla GRPO, improving in-domain alignment with human-preference rewards and enhancing generalization on out-of-domain evaluations. The method is architecture-agnostic, lightweight, and fully compatible with existing GRPO training pipelines, providing a more expressive and informative learning signal for visual generation.

[271] GuideFlow: Constraint-Guided Flow Matching for Planning in End-to-End Autonomous Driving

Lin Liu,Caiyan Jia,Guanyi Yu,Ziying Song,JunQiao Li,Feiyang Jia,Peiliang Wu,Xiaoshuai Hao,Yandan Luo

Main category: cs.CV

TL;DR: 本文提出了一种名为GuideFlow的新颖驾驶规划框架,利用约束流匹配(Constrained Flow Matching)来解决现有端到端自动驾驶规划中轨迹模式崩溃和物理安全约束难以融入生成过程的问题。GuideFlow通过显式建模流匹配过程,支持多样化轨迹生成,并在生成过程中直接引入显式约束,结合能量模型(EBM)提升对物理约束的满足能力,同时可通过控制信号调节驾驶激进程度,在多个主流驾驶基准上实现了先进性能。

Details Motivation: 现有的模仿式端到端规划器存在多模态轨迹模式崩溃问题,而生成式规划器难以将安全与物理约束直接嵌入生成过程,需额外优化步骤。因此需要一种既能生成多样轨迹又能内在满足约束的规划方法。 Method: 提出GuideFlow框架,采用约束流匹配机制,显式建模轨迹生成流,并在生成过程中直接施加物理与安全约束;结合能量基模型(EBM)联合训练,增强模型自主优化以满足约束的能力;引入驾驶激进程度作为可调控的生成控制信号。 Result: 在Bench2Drive、NuScenes、NavSim和ADV-NuScenes等多个主流驾驶基准上进行了广泛评估,结果表明GuideFlow有效缓解了模式崩溃并更好满足物理约束。在NavSim测试硬集(Navhard)上达到43.0的EPDMS分数,取得当前最优性能。 Conclusion: GuideFlow通过将显式约束融入流匹配生成过程,并结合EBM进行联合训练,实现了更安全、多样且可控的自动驾驶轨迹生成,在多个基准上验证了其优越性,为端到端自动驾驶规划提供了新思路。 Abstract: Driving planning is a critical component of end-to-end (E2E) autonomous driving. However, prevailing Imitative E2E Planners often suffer from multimodal trajectory mode collapse, failing to produce diverse trajectory proposals. Meanwhile, Generative E2E Planners struggle to incorporate crucial safety and physical constraints directly into the generative process, necessitating an additional optimization stage to refine their outputs. In this paper, we propose \textit{\textbf{GuideFlow}}, a novel planning framework that leverages Constrained Flow Matching. Concretely, \textit{\textbf{GuideFlow}} explicitly models the flow matching process, which inherently mitigates mode collapse and allows for flexible guidance from various conditioning signals. Our core contribution lies in directly enforcing explicit constraints within the flow matching generation process, rather than relying on implicit constraint encoding. Crucially, \textit{\textbf{GuideFlow}} unifies the training of the flow matching with the Energy-Based Model (EBM) to enhance the model's autonomous optimization capability to robustly satisfy physical constraints. Secondly, \textit{\textbf{GuideFlow}} parameterizes driving aggressiveness as a control signal during generation, enabling precise manipulation of trajectory style. Extensive evaluations on major driving benchmarks (Bench2Drive, NuScenes, NavSim and ADV-NuScenes) validate the effectiveness of \textit{\textbf{GuideFlow}}. Notably, on the NavSim test hard split (Navhard), \textit{\textbf{GuideFlow}} achieved SOTA with an EPDMS score of 43.0. The code will be released.

[272] Yo'City: Personalized and Boundless 3D Realistic City Scene Generation via Self-Critic Expansion

Keyang Lu,Sifan Zhou,Hongbin Xu,Gang Xu,Zhifei Yang,Yikai Wang,Zhen Xiao,Jieyi Long,Ming Li

Main category: cs.CV

TL;DR: Yo'City 是一个基于代理框架的3D城市生成方法,支持用户定制和无限扩展,利用大模型实现层次化规划与持续演化的城市生成。

Details Motivation: 现有3D城市生成方法依赖单一扩散模型,难以实现个性化和大规模连续扩展,限制了在虚拟现实和数字孪生等应用中的表现力与实用性。 Method: 提出Yo'City框架,采用“城市-区域-网格”层次结构,通过全局规划器和局部设计者分工进行城市布局设计;结合‘生成-优化-评估’的等距图像合成循环与图像到3D转换实现网格级3D生成;引入基于场景图的关系感知扩展机制,支持用户交互下的连贯城市演化。 Result: 构建了多样化评测数据集,并设计六项多维指标,在语义、几何、纹理和布局等方面全面评估;实验表明Yo'City在所有指标上均优于现有最先进方法。 Conclusion: Yo'City通过结合大模型的推理与组合能力,实现了可定制、可扩展且语义空间一致的高质量3D城市生成,为数字孪生和虚拟现实提供了新范式。 Abstract: Realistic 3D city generation is fundamental to a wide range of applications, including virtual reality and digital twins. However, most existing methods rely on training a single diffusion model, which limits their ability to generate personalized and boundless city-scale scenes. In this paper, we present Yo'City, a novel agentic framework that enables user-customized and infinitely expandable 3D city generation by leveraging the reasoning and compositional capabilities of off-the-shelf large models. Specifically, Yo'City first conceptualize the city through a top-down planning strategy that defines a hierarchical "City-District-Grid" structure. The Global Planner determines the overall layout and potential functional districts, while the Local Designer further refines each district with detailed grid-level descriptions. Subsequently, the grid-level 3D generation is achieved through a "produce-refine-evaluate" isometric image synthesis loop, followed by image-to-3D generation. To simulate continuous city evolution, Yo'City further introduces a user-interactive, relationship-guided expansion mechanism, which performs scene graph-based distance- and semantics-aware layout optimization, ensuring spatially coherent city growth. To comprehensively evaluate our method, we construct a diverse benchmark dataset and design six multi-dimensional metrics that assess generation quality from the perspectives of semantics, geometry, texture, and layout. Extensive experiments demonstrate that Yo'City consistently outperforms existing state-of-the-art methods across all evaluation aspects.

[273] Thinking Ahead: Foresight Intelligence in MLLMs and World Models

Zhantao Gong,Liaoyuan Fan,Qing Guo,Xun Xu,Xulei Yang,Shijie Li

Main category: cs.CV

TL;DR: 本文提出了“远见智能”(Foresight Intelligence)的概念,并构建了FSU-QA这一新的视觉问答数据集,用于评估模型对未来事件的预测与理解能力。实验表明现有视觉语言模型在此类任务上表现不佳,但通过在FSU-QA上微调可显著提升小模型的远见推理能力,甚至超越更大的先进模型。

Details Motivation: 远见智能对于自动驾驶等应用至关重要,但当前研究对此关注不足。作者旨在填补这一空白,建立一个专门评估模型未来推理能力的基准。 Method: 构建了一个名为FSU-QA的视觉问答数据集,设计用于激发和评估远见智能;在此基础上对现有视觉语言模型进行系统评估,并探索将世界模型的预测输出融入VLM以增强其表现。 Result: 现有VLM在远见任务上表现有限;FSU-QA能有效衡量世界模型预测的语义一致性;经过FSU-QA微调的小型VLM显著优于未微调的大模型,验证了该数据集在提升远见推理方面的有效性。 Conclusion: FSU-QA为发展具备真正预见未来能力的下一代AI模型提供了原则性基础,推动远见智能的研究与评估。 Abstract: In this work, we define Foresight Intelligence as the capability to anticipate and interpret future events-an ability essential for applications such as autonomous driving, yet largely overlooked by existing research. To bridge this gap, we introduce FSU-QA, a new Visual Question-Answering (VQA) dataset specifically designed to elicit and evaluate Foresight Intelligence. Using FSU-QA, we conduct the first comprehensive study of state-of-the-art Vision-Language Models (VLMs) under foresight-oriented tasks, revealing that current models still struggle to reason about future situations. Beyond serving as a benchmark, FSU-QA also enables the assessment of world models by measuring the semantic coherence of their generated predictions, quantified through performance gains when VLMs are augmented with such outputs. Our experiments further demonstrate that FSU-QA can effectively enhance foresight reasoning: even small VLMs fine-tuned on FSU-QA surpass much larger, advanced models by a substantial margin. Together, these findings position FSU-QA as a principled foundation for developing next-generation models capable of truly anticipating and understanding future events.

[274] ProxT2I: Efficient Reward-Guided Text-to-Image Generation via Proximal Diffusion

Zhenghan Fang,Jian Zheng,Qiaozi Gao,Xiaofeng Gao,Jeremias Sulam

Main category: cs.CV

TL;DR: 本文提出了一种基于反向离散化和条件近端算子的文本到图像扩散模型ProxT2I,结合强化学习优化采样过程,提升了生成效率和人类偏好对齐,在较低计算成本下达到先进水平。

Details Motivation: 现有的扩散模型依赖前向离散化和显式得分函数,采样速度慢且不稳定,需要大量步骤才能获得高质量样本,限制了其在实际应用中的效率和性能。 Method: 采用反向离散化方法构建文本到图像生成模型ProxT2I,使用学习得到的条件近端算子替代传统得分函数,并结合强化学习与策略优化技术,针对任务特定奖励优化采样器;同时构建了一个包含1500万高质量人像图像与细粒度标注的大规模开源数据集LAION-Face-T2I-15M用于训练与评估。 Result: 该方法在生成效率和人类偏好对齐方面均优于基于得分函数的基线模型,能够在更少的采样步数下生成高质量图像,且在较低计算资源和较小模型规模下达到与当前最先进开源模型相当的性能。 Conclusion: ProxT2I为文本到图像生成提供了一种轻量高效的新范式,通过反向离散化与近端算子结合强化学习优化,实现了高性能与低计算成本的平衡,特别适用于人像生成任务。 Abstract: Diffusion models have emerged as a dominant paradigm for generative modeling across a wide range of domains, including prompt-conditional generation. The vast majority of samplers, however, rely on forward discretization of the reverse diffusion process and use score functions that are learned from data. Such forward and explicit discretizations can be slow and unstable, requiring a large number of sampling steps to produce good-quality samples. In this work we develop a text-to-image (T2I) diffusion model based on backward discretizations, dubbed ProxT2I, relying on learned and conditional proximal operators instead of score functions. We further leverage recent advances in reinforcement learning and policy optimization to optimize our samplers for task-specific rewards. Additionally, we develop a new large-scale and open-source dataset comprising 15 million high-quality human images with fine-grained captions, called LAION-Face-T2I-15M, for training and evaluation. Our approach consistently enhances sampling efficiency and human-preference alignment compared to score-based baselines, and achieves results on par with existing state-of-the-art and open-source text-to-image models while requiring lower compute and smaller model size, offering a lightweight yet performant solution for human text-to-image generation.

[275] Any4D: Open-Prompt 4D Generation from Natural Language and Images

Hao Li,Qiao Sun

Main category: cs.CV

TL;DR: 本文提出了一种名为Primitive Embodied World Models (PEWM)的新方法,通过将视频生成限制在较短的时间范围内,实现语言概念与机器人动作视觉表征之间的细粒度对齐,提升数据效率、降低学习复杂性和推理延迟,并结合视觉语言模型规划器和起止目标热图引导机制,支持复杂任务中的闭环控制与策略泛化。

Details Motivation: 现有基于视频生成的具身世界模型严重依赖大规模交互数据,而这类数据稀缺、难采集且维度高,导致语言-动作对齐粒度不足,并加剧长时程视频生成的困难,阻碍了具身领域迈向‘GPT时刻’。 Method: 提出PEWM框架,将视频生成限定于固定短视界;引入模块化的视觉语言模型(VLM)规划器与起止目标热图引导机制(SGG),实现细粒度语义对齐、闭环控制和策略组合泛化。 Result: PEWM在减少数据需求的同时提升了训练和推理效率,增强了对复杂任务的适应能力,实现了更精细的物理交互与高层推理之间的衔接。 Conclusion: PEWM通过聚焦基本动作单元,有效解决了具身智能中数据效率与建模复杂性的矛盾,为构建可扩展、可解释、通用的具身智能系统提供了可行路径。 Abstract: While video-generation-based embodied world models have gained increasing attention, their reliance on large-scale embodied interaction data remains a key bottleneck. The scarcity, difficulty of collection, and high dimensionality of embodied data fundamentally limit the alignment granularity between language and actions and exacerbate the challenge of long-horizon video generation--hindering generative models from achieving a \textit{"GPT moment"} in the embodied domain. There is a naive observation: \textit{the diversity of embodied data far exceeds the relatively small space of possible primitive motions}. Based on this insight, we propose \textbf{Primitive Embodied World Models} (PEWM), which restricts video generation to fixed shorter horizons, our approach \textit{1) enables} fine-grained alignment between linguistic concepts and visual representations of robotic actions, \textit{2) reduces} learning complexity, \textit{3) improves} data efficiency in embodied data collection, and \textit{4) decreases} inference latency. By equipping with a modular Vision-Language Model (VLM) planner and a Start-Goal heatmap Guidance mechanism (SGG), PEWM further enables flexible closed-loop control and supports compositional generalization of primitive-level policies over extended, complex tasks. Our framework leverages the spatiotemporal vision priors in video models and the semantic awareness of VLMs to bridge the gap between fine-grained physical interaction and high-level reasoning, paving the way toward scalable, interpretable, and general-purpose embodied intelligence.

[276] From Features to Reference Points: Lightweight and Adaptive Fusion for Cooperative Autonomous Driving

Yongqi Zhu,Morui Zhu,Qi Chen,Deyuan Qu,Song Fu,Qing Yang

Main category: cs.CV

TL;DR: RefPtsFusion是一种轻量级、可解释的协作式自动驾驶融合框架,通过共享紧凑的参考点(如物体位置、速度和尺寸)而非原始特征图,在显著降低通信开销的同时保持良好的感知性能。

Details Motivation: 传统特征级融合方法通信开销大,难以在异构车辆间高效部署,限制了协同感知系统的可扩展性和实时性。 Method: 提出RefPtsFusion框架,车辆间仅交换对象级参考点信息,并引入选择性Top-K查询融合机制,选择性地融合高置信度查询以提升信息丰富度。 Result: 在M3CAD数据集上实验表明,相比传统方法,通信开销降低了五个数量级,从每秒数百MB降至每秒几KB(5FPS),同时保持稳定的感知性能,并展现出强鲁棒性和一致的传输行为。 Conclusion: RefPtsFusion实现了通信效率与感知精度的良好平衡,具备跨模型兼容性,适合大规模、实时的协作驾驶系统应用。 Abstract: We present RefPtsFusion, a lightweight and interpretable framework for cooperative autonomous driving. Instead of sharing large feature maps or query embeddings, vehicles exchange compact reference points, e.g., objects' positions, velocities, and size information. This approach shifts the focus from "what is seen" to "where to see", creating a sensor- and model-independent interface that works well across vehicles with heterogeneous perception models while greatly reducing communication bandwidth. To enhance the richness of shared information, we further develop a selective Top-K query fusion that selectively adds high-confidence queries from the sender. It thus achieves a strong balance between accuracy and communication cost. Experiments on the M3CAD dataset show that RefPtsFusion maintains stable perception performance while reducing communication overhead by five orders of magnitude, dropping from hundreds of MB/s to only a few KB/s at 5 FPS (frame per second), compared to traditional feature-level fusion methods. Extensive experiments also demonstrate RefPtsFusion's strong robustness and consistent transmission behavior, highlighting its potential for scalable, real-time cooperative driving systems.

[277] VAOT: Vessel-Aware Optimal Transport for Retinal Fundus Enhancement

Xuanzhao Dong,Wenhui Zhu,Yujian Xiong,Xiwen Chen,Hao Wang,Xin Li,Jiajun Cheng,Zhipeng Wang,Shao Tang,Oana Dumitrascu,Yalin Wang

Main category: cs.CV

TL;DR: 提出了一种基于最优传输和结构保持正则化的无配对眼底图像增强方法VAOT,有效减少噪声同时保留血管结构。

Details Motivation: 现有基于GAN的无配对眼底图像增强方法容易扭曲血管结构,影响临床诊断,因此需要一种能保持血管拓扑完整性的增强方法。 Method: 提出Vessel-Aware Optimal Transport (VAOT),结合最优传输目标与两种结构保持正则化:基于骨架的损失保持全局血管连通性,端点感知损失稳定局部末端。 Result: 在合成退化基准和下游任务(血管与病变分割)中验证,VAOT优于多种先进基线方法。 Conclusion: VAOT在无配对设置下能有效提升眼底图像质量,同时 preserving 血管结构完整性,有助于临床诊断与分析。 Abstract: Color fundus photography (CFP) is central to diagnosing and monitoring retinal disease, yet its acquisition variability (e.g., illumination changes) often degrades image quality, which motivates robust enhancement methods. Unpaired enhancement pipelines are typically GAN-based, however, they can distort clinically critical vasculature, altering vessel topology and endpoint integrity. Motivated by these structural alterations, we propose Vessel-Aware Optimal Transport (\textbf{VAOT}), a framework that combines an optimal-transport objective with two structure-preserving regularizers: (i) a skeleton-based loss to maintain global vascular connectivity and (ii) an endpoint-aware loss to stabilize local termini. These constraints guide learning in the unpaired setting, reducing noise while preserving vessel structure. Experimental results on synthetic degradation benchmark and downstream evaluations in vessel and lesion segmentation demonstrate the superiority of the proposed methods against several state-of-the art baselines. The code is available at https://github.com/Retinal-Research/VAOT

[278] NI-Tex: Non-isometric Image-based Garment Texture Generation

Hui Shan,Ming Li,Haitao Yang,Kai Zheng,Sizhe Zheng,Yanwei Fu,Xiangru Huang

Main category: cs.CV

TL;DR: 提出了一种针对非等距图像的3D服装纹理生成方法,利用3D服装视频数据集和Nano Banana实现跨姿态、跨拓扑的高质量纹理生成,并通过不确定性引导的迭代烘焙方法融合多视角预测,生成适用于工业级3D服装设计的PBR材质。

Details Motivation: 现有基于图像的3D服装纹理生成方法受限于输入图像与网格在拓扑或姿态上的一致性要求,难以满足真实工业场景中多样化、非等距变形下的纹理生成需求。 Method: 构建了具有物理模拟的3D服装视频数据集以提供跨姿态的几何与材质监督;采用Nano Banana进行高质量的非等距图像编辑以支持跨拓扑纹理生成;提出一种基于不确定性引导视图选择与重加权的迭代烘焙方法,融合多视角预测结果生成无缝PBR纹理。 Result: 实验表明该前馈双分支架构能生成多样且空间对齐的PBR材质,在非等距条件下优于现有方法,支持工业级3D服装设计应用。 Conclusion: 所提方法有效解决了非等距图像下3D服装纹理生成的挑战,实现了高灵活性、高质量的跨姿态与跨拓扑纹理生成,具备实际工业应用价值。 Abstract: Existing industrial 3D garment meshes already cover most real-world clothing geometries, yet their texture diversity remains limited. To acquire more realistic textures, generative methods are often used to extract Physically-based Rendering (PBR) textures and materials from large collections of wild images and project them back onto garment meshes. However, most image-conditioned texture generation approaches require strict topological consistency between the input image and the input 3D mesh, or rely on accurate mesh deformation to match to the image poses, which significantly constrains the texture generation quality and flexibility. To address the challenging problem of non-isometric image-based garment texture generation, we construct 3D Garment Videos, a physically simulated, garment-centric dataset that provides consistent geometry and material supervision across diverse deformations, enabling robust cross-pose texture learning. We further employ Nano Banana for high-quality non-isometric image editing, achieving reliable cross-topology texture generation between non-isometric image-geometry pairs. Finally, we propose an iterative baking method via uncertainty-guided view selection and reweighting that fuses multi-view predictions into seamless, production-ready PBR textures. Through extensive experiments, we demonstrate that our feedforward dual-branch architecture generates versatile and spatially aligned PBR materials suitable for industry-level 3D garment design.

[279] Unsupervised Multi-View Visual Anomaly Detection via Progressive Homography-Guided Alignment

Xintao Chen,Xiaohao Xu,Bozhong Zheng,Yun Liu,Yingna Wu

Main category: cs.CV

TL;DR: 本文提出了一种名为ViewSense-AD(VSAD)的新框架,用于多视角图像下的无监督视觉异常检测,通过显式建模视图间的几何一致性来学习视角不变的特征表示,显著提升了在复杂纹理和大视角变化下的检测性能。

Details Motivation: 现有方法通常将多视角图像视为独立的图像集合,忽略了视点变化带来的外观差异,导致特征表示不一致和高误报率。因此,需要一种能够区分真实缺陷与正常视角变化的方法。 Method: 提出ViewSense-AD(VSAD)框架,核心是多视角对齐模块(MVAM),利用单应性矩阵在相邻视角间投影并对齐特征区域;将其集成到基于潜在扩散的模型(VALDM)中,实现去噪过程中的多阶段对齐,并引入融合精修模块(FRM)增强全局一致性;通过与正常原型记忆库比对多层次特征进行异常检测。 Result: 在RealIAD和MANTA数据集上进行了大量实验,结果表明VSAD在像素级、视图级和样本级异常检测任务中均显著优于现有方法,尤其在大视角变化和复杂纹理场景下表现优异。 Conclusion: VSAD通过显式建模多视角间的几何一致性,有效实现了视角不变的特征学习,显著降低了误报率,在多视角无监督异常检测任务中达到了新的SOTA水平。 Abstract: Unsupervised visual anomaly detection from multi-view images presents a significant challenge: distinguishing genuine defects from benign appearance variations caused by viewpoint changes. Existing methods, often designed for single-view inputs, treat multiple views as a disconnected set of images, leading to inconsistent feature representations and a high false-positive rate. To address this, we introduce ViewSense-AD (VSAD), a novel framework that learns viewpoint-invariant representations by explicitly modeling geometric consistency across views. At its core is our Multi-View Alignment Module (MVAM), which leverages homography to project and align corresponding feature regions between neighboring views. We integrate MVAM into a View-Align Latent Diffusion Model (VALDM), enabling progressive and multi-stage alignment during the denoising process. This allows the model to build a coherent and holistic understanding of the object's surface from coarse to fine scales. Furthermore, a lightweight Fusion Refiner Module (FRM) enhances the global consistency of the aligned features, suppressing noise and improving discriminative power. Anomaly detection is performed by comparing multi-level features from the diffusion model against a learned memory bank of normal prototypes. Extensive experiments on the challenging RealIAD and MANTA datasets demonstrate that VSAD sets a new state-of-the-art, significantly outperforming existing methods in pixel, view, and sample-level visual anomaly proving its robustness to large viewpoint shifts and complex textures.

[280] Rethinking Garment Conditioning in Diffusion-based Virtual Try-On

Kihyun Na,Jinyoung Choi,Injung Kim

Main category: cs.CV

TL;DR: 提出了一种高效的单UNet虚拟试穿模型Re-CatVTON,在保持高性能的同时显著降低计算和内存开销。

Details Motivation: 现有基于扩散的双UNet虚拟试穿模型虽性能优越,但计算和内存开销大,需设计更高效的模型。 Method: 通过可视化与理论分析提出三个关于上下文特征学习的假设,构建基于单UNet的Re-CatVTON模型,并引入改进的无分类器引导策略及真实衣物潜在表示注入机制。 Result: Re-CatVTON在FID、KID和LPIPS指标上优于前序模型CatVTON,性能接近双UNet模型Leffa,但计算与内存消耗更低,仅在SSIM上有轻微下降。 Conclusion: Re-CatVTON实现了更优的效率-性能权衡,为单UNet架构的虚拟试穿模型树立了新标杆。 Abstract: Virtual Try-On (VTON) is the task of synthesizing an image of a person wearing a target garment, conditioned on a person image and a garment image. While diffusion-based VTON models featuring a Dual UNet architecture demonstrate superior fidelity compared to single UNet models, they incur substantial computational and memory overhead due to their heavy structure. In this study, through visualization analysis and theoretical analysis, we derived three hypotheses regarding the learning of context features to condition the denoising process. Based on these hypotheses, we developed Re-CatVTON, an efficient single UNet model that achieves high performance. We further enhance the model by introducing a modified classifier-free guidance strategy tailored for VTON's spatial concatenation conditioning, and by directly injecting the ground-truth garment latent derived from the clean garment latent to prevent the accumulation of prediction error. The proposed Re-CatVTON significantly improves performance compared to its predecessor (CatVTON) and requires less computation and memory than the high-performance Dual UNet model, Leffa. Our results demonstrate improved FID, KID, and LPIPS scores, with only a marginal decrease in SSIM, establishing a new efficiency-performance trade-off for single UNet VTON models.

[281] ConceptGuard: Proactive Safety in Text-and-Image-to-Video Generation through Multimodal Risk Detection

Ruize Ma,Minghong Cai,Yilei Jiang,Jiaming Han,Yi Feng,Yingshui Tan,Xiaoyong Zhu,Bo Zhang,Bo Zheng,Xiangyu Yue

Main category: cs.CV

TL;DR: 本文提出了ConceptGuard,一个用于多模态视频生成中安全风险的统一防护框架,通过对比检测和语义抑制机制主动识别并缓解潜在的不安全内容。

Details Motivation: 现有安全方法多局限于文本模态、依赖已知风险类别或仅作为生成后审计工具,难以应对多模态交互带来的组合性安全风险。 Method: ConceptGuard包含两个阶段:首先使用对比检测模块将图文融合输入映射到结构化概念空间以识别潜在风险;然后通过语义抑制机制干预多模态提示条件,引导生成过程避开不安全概念。 Result: 在新构建的ConceptRisk数据集和T2VSafetyBench-TI2V基准上进行实验,结果表明ConceptGuard在风险检测和安全视频生成方面均优于现有基线方法。 Conclusion: ConceptGuard为图文到视频生成提供了有效的主动安全防护方案,能够在复杂多模态场景下实现对有害内容的早期干预。 Abstract: Recent progress in video generative models has enabled the creation of high-quality videos from multimodal prompts that combine text and images. While these systems offer enhanced controllability, they also introduce new safety risks, as harmful content can emerge from individual modalities or their interaction. Existing safety methods are often text-only, require prior knowledge of the risk category, or operate as post-generation auditors, struggling to proactively mitigate such compositional, multimodal risks. To address this challenge, we present ConceptGuard, a unified safeguard framework for proactively detecting and mitigating unsafe semantics in multimodal video generation. ConceptGuard operates in two stages: First, a contrastive detection module identifies latent safety risks by projecting fused image-text inputs into a structured concept space; Second, a semantic suppression mechanism steers the generative process away from unsafe concepts by intervening in the prompt's multimodal conditioning. To support the development and rigorous evaluation of this framework, we introduce two novel benchmarks: ConceptRisk, a large-scale dataset for training on multimodal risks, and T2VSafetyBench-TI2V, the first benchmark adapted from T2VSafetyBench for the Text-and-Image-to-Video (TI2V) safety setting. Comprehensive experiments on both benchmarks show that ConceptGuard consistently outperforms existing baselines, achieving state-of-the-art results in both risk detection and safe video generation.

[282] A Novel Dual-Stream Framework for dMRI Tractography Streamline Classification with Joint dMRI and fMRI Data

Haotian Yan,Bocheng Guo,Jianzhong He,Nir A. Sochen,Ofer Pasternak,Lauren J O'Donnell,Fan Zhang

Main category: cs.CV

TL;DR: 提出了一种双流线分类框架,结合dMRI和fMRI数据,提升白质纤维束的功能一致性分割性能。

Details Motivation: 现有基于几何特征的流线分类方法难以区分具有相似路径但功能不同的纤维束,需引入功能信息以提高分类准确性。 Method: 设计了一个双网络结构:主干网络处理完整流线轨迹(基于预训练模型),辅助网络分析纤维端点区域的fMRI信号,联合进行流线分类。 Result: 在皮质脊髓束(CST)的四个躯体拓扑亚区划分任务中验证了方法,消融实验和与最先进方法的比较表明所提方法性能更优。 Conclusion: 结合结构与功能信息的双流框架能有效提升流线分类的功能一致性,优于仅依赖几何特征的方法。 Abstract: Streamline classification is essential to identify anatomically meaningful white matter tracts from diffusion MRI (dMRI) tractography. However, current streamline classification methods rely primarily on the geometric features of the streamline trajectory, failing to distinguish between functionally distinct fiber tracts with similar pathways. To address this, we introduce a novel dual-stream streamline classification framework that jointly analyzes dMRI and functional MRI (fMRI) data to enhance the functional coherence of tract parcellation. We design a novel network that performs streamline classification using a pretrained backbone model for full streamline trajectories, while augmenting with an auxiliary network that processes fMRI signals from fiber endpoint regions. We demonstrate our method by parcellating the corticospinal tract (CST) into its four somatotopic subdivisions. Experimental results from ablation studies and comparisons with state-of-the-art methods demonstrate our approach's superior performance.

[283] STCDiT: Spatio-Temporally Consistent Diffusion Transformer for High-Quality Video Super-Resolution

Junyang Chen,Jiangxin Dong,Long Sun,Yixin Yang,Jinshan Pan

Main category: cs.CV

TL;DR: 本文提出了一种基于预训练视频扩散模型的视频超分辨率框架STCDit,通过运动感知的分段重建和锚帧引导方法,在复杂相机运动下实现结构保真且时序稳定的高质量视频恢复。

Details Motivation: 现有视频超分辨率方法在处理复杂相机运动时难以同时保持时间稳定性和结构保真度,本文旨在利用预训练扩散模型解决这一问题。 Method: 提出运动感知的VAE重建方法,将视频按运动特性分段处理,并引入锚帧潜变量(首帧潜变量)提供结构引导,以增强生成过程中的空间结构一致性。 Result: 实验表明,STCDit在结构保真度和时序一致性方面优于当前最先进的方法,尤其在复杂运动场景下表现突出。 Conclusion: 通过分段重建与锚帧引导策略,STCDit有效提升了基于扩散模型的视频超分辨率性能,兼顾了时间稳定性和结构细节恢复。 Abstract: We present STCDiT, a video super-resolution framework built upon a pre-trained video diffusion model, aiming to restore structurally faithful and temporally stable videos from degraded inputs, even under complex camera motions. The main challenges lie in maintaining temporal stability during reconstruction and preserving structural fidelity during generation. To address these challenges, we first develop a motion-aware VAE reconstruction method that performs segment-wise reconstruction, with each segment clip exhibiting uniform motion characteristic, thereby effectively handling videos with complex camera motions. Moreover, we observe that the first-frame latent extracted by the VAE encoder in each clip, termed the anchor-frame latent, remains unaffected by temporal compression and retains richer spatial structural information than subsequent frame latents. We further develop an anchor-frame guidance approach that leverages structural information from anchor frames to constrain the generation process and improve structural fidelity of video features. Coupling these two designs enables the video diffusion model to achieve high-quality video super-resolution. Extensive experiments show that STCDiT outperforms state-of-the-art methods in terms of structural fidelity and temporal consistency.

[284] Understanding Task Transfer in Vision-Language Models

Bhuvan Sachdeva,Karan Uppal,Abhinav Java,Vineeth N. Balasubramanian

Main category: cs.CV

TL;DR: 本文研究了视觉-语言模型(VLMs)在多种感知任务间的微调迁移性,提出了一种新的评估指标PGF来量化任务间的正负迁移效应,并通过实验揭示了任务间的关系结构,为VLM的高效训练提供了指导。

Details Motivation: VLMs在多模态基准上表现良好,但在深度估计、物体计数等感知任务上仍落后于人类和专用模型;微调一个任务可能不可预测地影响其他任务的表现,导致任务特定微调困难。 Method: 系统研究任务可迁移性,引入“完美间隙因子”(Perfection Gap Factor, PGF)作为新度量指标,评估在一个感知任务上微调后对其他任务零样本性能的影响;使用三个开源VLM,在13个感知任务上构建任务迁移图。 Result: 发现了感知任务之间的正向与负向迁移模式,识别出相互影响的任务组,基于迁移行为将任务划分为不同‘角色’(personas),并证明PGF可用于指导更高效的数据选择。 Conclusion: 任务间存在显著的正向迁移机会与负向干扰风险,PGF能有效指导VLM的训练策略设计,推动其在复杂感知任务上的发展。 Abstract: Vision-Language Models (VLMs) perform well on multimodal benchmarks but lag behind humans and specialized models on visual perception tasks like depth estimation or object counting. Finetuning on one task can unpredictably affect performance on others, making task-specific finetuning challenging. In this paper, we address this challenge through a systematic study of task transferability. We examine how finetuning a VLM on one perception task affects its zero-shot performance on others. To quantify these effects, we introduce Perfection Gap Factor (PGF), a metric that captures both the breadth and magnitude of transfer. Using three open-weight VLMs evaluated across 13 perception tasks, we construct a task-transfer graph that reveals previously unobserved relationships among perception tasks. Our analysis uncovers patterns of positive and negative transfer, identifies groups of tasks that mutually influence each other, organizes tasks into personas based on their transfer behavior and demonstrates how PGF can guide data selection for more efficient training. These findings highlight both opportunities for positive transfer and risks of negative interference, offering actionable guidance for advancing VLMs.

[285] StereoDETR: Stereo-based Transformer for 3D Object Detection

Shiyi Mu,Zichong Gu,Zhiqi Ai,Anqi Liu,Yilin Gao,Shugong Xu

Main category: cs.CV

TL;DR: 本文提出了StereoDETR,一种基于DETR的高效立体3D目标检测框架,通过单目与立体双分支结构和可微深度采样策略,实现了实时推理并在KITTI数据集上取得了行人和骑行者检测的新SOTA结果。

Details Motivation: 立体3D检测方法虽然精度高于单目方法,但计算开销和延迟较高,现有最先进方法精度虽高但速度较慢,因此需要一种兼顾高精度与高效率的立体检测框架。 Method: 提出StereoDETR,包含单目DETR分支(预测尺度、方向和采样点)和立体分支(利用低成本多尺度视差特征预测对象级深度图),两分支通过可微深度采样策略耦合,并引入无额外标注需求的约束监督策略应对遮挡问题。 Result: StereoDETR实现了实时推理,是首个在速度上超越单目方法的立体检测方法,在KITTI基准上达到具有竞争力的精度,并在行人和骑行者子集上创下新的SOTA性能。 Conclusion: StereoDETR有效平衡了立体3D检测的精度与效率,首次实现立体方法在速度上优于单目方法,同时在关键类别上提升检测性能,推动了立体检测的实际应用。 Abstract: Compared to monocular 3D object detection, stereo-based 3D methods offer significantly higher accuracy but still suffer from high computational overhead and latency. The state-of-the-art stereo 3D detection method achieves twice the accuracy of monocular approaches, yet its inference speed is only half as fast. In this paper, we propose StereoDETR, an efficient stereo 3D object detection framework based on DETR. StereoDETR consists of two branches: a monocular DETR branch and a stereo branch. The DETR branch is built upon 2D DETR with additional channels for predicting object scale, orientation, and sampling points. The stereo branch leverages low-cost multi-scale disparity features to predict object-level depth maps. These two branches are coupled solely through a differentiable depth sampling strategy. To handle occlusion, we introduce a constrained supervision strategy for sampling points without requiring extra annotations. StereoDETR achieves real-time inference and is the first stereo-based method to surpass monocular approaches in speed. It also achieves competitive accuracy on the public KITTI benchmark, setting new state-of-the-art results on pedestrian and cyclist subsets. The code is available at https://github.com/shiyi-mu/StereoDETR-OPEN.

[286] Scale What Counts, Mask What Matters: Evaluating Foundation Models for Zero-Shot Cross-Domain Wi-Fi Sensing

Cheng Jiang,Yihe Yan,Yanxiang Wang,Chun Tung Chou,Wen Hu

Main category: cs.CV

TL;DR: 本文提出了一种基于大规模预训练的Wi-Fi CSI感知基础模型方法,通过在14个异构数据集上进行掩码自编码(MAE)预训练,系统性探究了数据多样性与模型容量对跨域性能的影响,发现数据规模和多样性是提升泛化能力的关键瓶颈,并在多种任务上显著提升了跨域准确率。

Details Motivation: 现有的Wi-Fi感知模型因领域迁移问题难以在新环境、设备或用户上泛化,且公共数据集碎片化导致鲁棒性不足,亟需一种更具通用性的建模方法来推动实际部署。 Method: 采用掩码自编码(MAE)风格的预训练方法,在包含1.3百万样本、来自14个数据集、4种设备、多频段多带宽的大型异构Wi-Fi CSI数据集上进行基础模型训练,并系统评估数据多样性和模型容量对跨域性能的影响。 Result: 实验表明:跨域性能随预训练数据量呈对数线性提升;当前数据规模下,增大模型仅带来边际增益;在人体活动识别、手势识别和用户识别任务中,大规模预训练相比监督学习基线将跨域准确率提高2.2%至15.7%。 Conclusion: 数据规模和多样性是实现Wi-Fi感知跨域泛化的关键因素,当前瓶颈在于数据而非模型容量,大规模预训练为构建可实际部署的鲁棒Wi-Fi感知系统提供了有效路径。 Abstract: While Wi-Fi sensing offers a compelling, privacy-preserving alternative to cameras, its practical utility has been fundamentally undermined by a lack of robustness across domains. Models trained in one setup fail to generalize to new environments, hardware, or users, a critical "domain shift" problem exacerbated by modest, fragmented public datasets. We shift from this limited paradigm and apply a foundation model approach, leveraging Masked Autoencoding (MAE) style pretraining on the largest and most heterogeneous Wi-Fi CSI datasets collection assembled to date. Our study pretrains and evaluates models on over 1.3 million samples extracted from 14 datasets, collected using 4 distinct devices across the 2.4/5/6 GHz bands and bandwidths from 20 to 160 MHz. Our large-scale evaluation is the first to systematically disentangle the impacts of data diversity versus model capacity on cross-domain performance. The results establish scaling trends on Wi-Fi CSI sensing. First, our experiments show log-linear improvements in unseen domain performance as the amount of pretraining data increases, suggesting that data scale and diversity are key to domain generalization. Second, based on the current data volume, larger model can only provide marginal gains for cross-domain performance, indicating that data, rather than model capacity, is the current bottleneck for Wi-Fi sensing generalization. Finally, we conduct a series of cross-domain evaluations on human activity recognition, human gesture recognition and user identification tasks. The results show that the large-scale pretraining improves cross-domain accuracy ranging from 2.2% to 15.7%, compared to the supervised learning baseline. Overall, our findings provide insightful direction for designing future Wi-Fi sensing systems that can eventually be robust enough for real-world deployment.

[287] PartDiffuser: Part-wise 3D Mesh Generation via Discrete Diffusion

Yichen Yang,Hong Li,Haodong Zhu,Linin Yang,Guojun Lei,Sheng Xu,Baochang Zhang

Main category: cs.CV

TL;DR: 提出PartDiffuser,一种半自回归扩散框架,通过部分间自回归和部分内并行扩散生成高质量3D网格。

Details Motivation: 现有自回归方法在全局结构一致性和局部细节保真度之间难以平衡,且易产生误差累积。 Method: 对网格进行语义分割后,采用‘部分级’生成策略:部分间自回归保证全局拓扑,各部分内使用并行离散扩散重建高频几何特征;基于DiT架构引入部分感知交叉注意力机制,以点云作为分层几何条件动态控制生成过程。 Result: 实验表明,该方法在生成富含细节的3D网格上显著优于当前最先进模型,具备出色的细节表现能力。 Conclusion: PartDiffuser有效解耦了全局与局部生成任务,在保持整体结构的同时精确恢复局部细节,适用于实际应用。 Abstract: Existing autoregressive (AR) methods for generating artist-designed meshes struggle to balance global structural consistency with high-fidelity local details, and are susceptible to error accumulation. To address this, we propose PartDiffuser, a novel semi-autoregressive diffusion framework for point-cloud-to-mesh generation. The method first performs semantic segmentation on the mesh and then operates in a "part-wise" manner: it employs autoregression between parts to ensure global topology, while utilizing a parallel discrete diffusion process within each semantic part to precisely reconstruct high-frequency geometric features. PartDiffuser is based on the DiT architecture and introduces a part-aware cross-attention mechanism, using point clouds as hierarchical geometric conditioning to dynamically control the generation process, thereby effectively decoupling the global and local generation tasks. Experiments demonstrate that this method significantly outperforms state-of-the-art (SOTA) models in generating 3D meshes with rich detail, exhibiting exceptional detail representation suitable for real-world applications.

[288] TPG-INR: Target Prior-Guided Implicit 3D CT Reconstruction for Enhanced Sparse-view Imaging

Qinglei Cao,Ziyao Tang,Xiaoqin Tang

Main category: cs.CV

TL;DR: 提出一种基于目标先验的3D CT重建框架,利用投影数据生成目标先验以增强隐式学习,显著提升超稀疏视角下的重建精度与学习效率。

Details Motivation: 现有隐式3D重建方法忽略了解剖先验信息,在超稀疏视角下重建精度和学习效率受限。 Method: 引入从投影数据中提取的目标先验,结合位置和结构编码进行体素级隐式重建,并采用CUDA算法快速估计高质量3D目标先验。 Result: 在腹部数据集上实验表明,相比NAF模型学习效率提升十倍,相比NeRP模型在10、20、30个投影下PSNR分别提高3.57 dB、5.42 dB和5.70 dB。 Conclusion: 所提方法通过融合目标先验显著提升了稀疏视角CT重建的质量与效率,为隐式神经表示提供了新思路。 Abstract: X-ray imaging, based on penetration, enables detailed visualization of internal structures. Building on this capability, existing implicit 3D reconstruction methods have adapted the NeRF model and its variants for internal CT reconstruction. However, these approaches often neglect the significance of objects' anatomical priors for implicit learning, limiting both reconstruction precision and learning efficiency, particularly in ultra-sparse view scenarios. To address these challenges, we propose a novel 3D CT reconstruction framework that employs a 'target prior' derived from the object's projection data to enhance implicit learning. Our approach integrates positional and structural encoding to facilitate voxel-wise implicit reconstruction, utilizing the target prior to guide voxel sampling and enrich structural encoding. This dual strategy significantly boosts both learning efficiency and reconstruction quality. Additionally, we introduce a CUDA-based algorithm for rapid estimation of high-quality 3D target priors from sparse-view projections. Experiments utilizing projection data from a complex abdominal dataset demonstrate that the proposed model substantially enhances learning efficiency, outperforming the current leading model, NAF, by a factor of ten. In terms of reconstruction quality, it also exceeds the most accurate model, NeRP, achieving PSNR improvements of 3.57 dB, 5.42 dB, and 5.70 dB with 10, 20, and 30 projections, respectively. The code is available at https://github.com/qlcao171/TPG-INR.

[289] Mitigating Long-Tail Bias in HOI Detection via Adaptive Diversity Cache

Yuqiu Jiang,Xiaozhen Qiao,Tianyu Mei,Haojian Huang,Yifan Chen,Ye Zheng,Zhe Sun

Main category: cs.CV

TL;DR: 提出了一种无需训练的自适应多样性缓存(ADC)模块,用于缓解长尾分布下的人-物交互检测中的稀有类别偏差问题。

Details Motivation: 现有基于视觉语言模型的方法依赖额外训练或提示调优,计算开销大,在长尾场景中对稀有交互表现差。 Method: 设计了类别的、高置信度且多样化的特征缓存机制,结合频率感知的缓存更新策略,在推理时动态增强稀有类别的表示能力,无需微调。 Result: 在HICO-DET和V-COCO数据集上显著提升性能,稀有类别mAP最高提升8.57%,整体mAP提升达4.39%。 Conclusion: ADC是一种即插即用、无需训练的有效方法,能有效缓解HOI检测中的长尾问题,具有良好的通用性和实用性。 Abstract: Human-Object Interaction (HOI) detection is a fundamental task in computer vision, empowering machines to comprehend human-object relationships in diverse real-world scenarios. Recent advances in VLMs have significantly improved HOI detection by leveraging rich cross-modal representations. However, most existing VLM-based approaches rely heavily on additional training or prompt tuning, resulting in substantial computational overhead and limited scalability, particularly in long-tailed scenarios where rare interactions are severely underrepresented. In this paper, we propose the Adaptive Diversity Cache (ADC) module, a novel training-free and plug-and-play mechanism designed to mitigate long-tail bias in HOI detection. ADC constructs class-specific caches that accumulate high-confidence and diverse feature representations during inference. The method incorporates frequency-aware cache adaptation that favors rare categories and is designed to enable robust prediction calibration without requiring additional training or fine-tuning. Extensive experiments on HICO-DET and V-COCO datasets show that ADC consistently improves existing HOI detectors, achieving up to +8.57\% mAP gain on rare categories and +4.39\% on the full dataset, demonstrating its effectiveness in mitigating long-tail bias while preserving overall performance.

[290] DetAny4D: Detect Anything 4D Temporally in a Streaming RGB Video

Jiawei Hou,Shenghao Zhang,Can Wang,Zheng Gu,Yonggen Ling,Taiping Zeng,Xiangyang Xue,Jingbo Zhang

Main category: cs.CV

TL;DR: 本文提出了DA4D,一个大规模的4D检测数据集,以及DetAny4D,一种开放集端到端框架,用于从连续输入直接预测3D边界框。

Details Motivation: 现有开放集4D目标检测方法通常逐帧进行预测,缺乏对时间一致性的建模,或依赖易产生误差传播的复杂多阶段流程,且受限于缺少具有连续高质量3D边界框标注的大规模数据集。 Method: 基于DA4D数据集,提出DetAny4D框架,融合来自预训练基础模型的多模态特征,并设计几何感知的时空解码器以有效捕捉空间和时间动态;采用多任务学习架构与专用训练策略以保持跨不同长度序列的全局一致性。 Result: 实验表明,DetAny4D在检测精度上具有竞争力,并显著提高了时间稳定性,有效缓解了4D目标检测中长期存在的抖动和不一致问题。 Conclusion: DetAny4D通过端到端的学习方式和对时空一致性的建模,在开放集4D目标检测任务中实现了更可靠、稳定的性能,推动了该领域的发展。 Abstract: Reliable 4D object detection, which refers to 3D object detection in streaming video, is crucial for perceiving and understanding the real world. Existing open-set 4D object detection methods typically make predictions on a frame-by-frame basis without modeling temporal consistency, or rely on complex multi-stage pipelines that are prone to error propagation across cascaded stages. Progress in this area has been hindered by the lack of large-scale datasets that capture continuous reliable 3D bounding box (b-box) annotations. To overcome these challenges, we first introduce DA4D, a large-scale 4D detection dataset containing over 280k sequences with high-quality b-box annotations collected under diverse conditions. Building on DA4D, we propose DetAny4D, an open-set end-to-end framework that predicts 3D b-boxes directly from sequential inputs. DetAny4D fuses multi-modal features from pre-trained foundational models and designs a geometry-aware spatiotemporal decoder to effectively capture both spatial and temporal dynamics. Furthermore, it adopts a multi-task learning architecture coupled with a dedicated training strategy to maintain global consistency across sequences of varying lengths. Extensive experiments show that DetAny4D achieves competitive detection accuracy and significantly improves temporal stability, effectively addressing long-standing issues of jitter and inconsistency in 4D object detection. Data and code will be released upon acceptance.

[291] SupLID: Geometrical Guidance for Out-of-Distribution Detection in Semantic Segmentation

Nimeshika Udayangani,Sarah Erfani,Christopher Leckie

Main category: cs.CV

TL;DR: 本文提出了SupLID,一种用于语义分割中像素级分布外(OOD)检测的新框架,通过利用线性内在维度(LID)捕捉几何结构信息,提升现有基于分类器置信度的OOD检测性能。

Details Motivation: 现有的图像级OOD方法在像素级任务中存在过置信等问题,难以有效应对复杂的真实场景(如自动驾驶),因此需要更鲁棒、精细的像素级OOD检测方法。 Method: 提出SupLID框架:构建几何核心集(coreset)以建模in-distribution的内在结构,并在超像素级别计算OOD分数,结合LID分析高维数据局部结构,作为对分类器置信度(如能量、熵)的补充信号。 Result: SupLID在多个关键指标(AUR, FPR, AUP)上显著提升现有方法,达到SOTA性能,且支持任意分割模型的即插即用和实时推理。 Conclusion: 几何结构信息可有效增强分类器置信度,在像素级OOD检测中具有重要作用;SupLID作为一种无需训练的后处理方法,具备良好通用性和实用性。 Abstract: Out-of-Distribution (OOD) detection in semantic segmentation aims to localize anomalous regions at the pixel level, advancing beyond traditional image-level OOD techniques to better suit real-world applications such as autonomous driving. Recent literature has successfully explored the adaptation of commonly used image-level OOD methods--primarily based on classifier-derived confidence scores (e.g., energy or entropy)--for this pixel-precise task. However, these methods inherit a set of limitations, including vulnerability to overconfidence. In this work, we introduce SupLID, a novel framework that effectively guides classifier-derived OOD scores by exploiting the geometrical structure of the underlying semantic space, particularly using Linear Intrinsic Dimensionality (LID). While LID effectively characterizes the local structure of high-dimensional data by analyzing distance distributions, its direct application at the pixel level remains challenging. To overcome this, SupLID constructs a geometrical coreset that captures the intrinsic structure of the in-distribution (ID) subspace. It then computes OOD scores at the superpixel level, enabling both efficient real-time inference and improved spatial smoothness. We demonstrate that geometrical cues derived from SupLID serve as a complementary signal to traditional classifier confidence, enhancing the model's ability to detect diverse OOD scenarios. Designed as a post-hoc scoring method, SupLID can be seamlessly integrated with any semantic segmentation classifier at deployment time. Our results demonstrate that SupLID significantly enhances existing classifier-based OOD scores, achieving state-of-the-art performance across key evaluation metrics, including AUR, FPR, and AUP. Code is available at https://github.com/hdnugit/SupLID.

[292] Disc3D: Automatic Curation of High-Quality 3D Dialog Data via Discriminative Object Referring

Siyuan Wei,Chunjie Wang,Xiao Liu,Xiaosheng Yan,Zhishan Zhou,Rui Huang

Main category: cs.CV

TL;DR: 提出了一种全自动的3D多模态大语言模型数据生成管道,构建了大规模、高质量的3D场景对话数据集Disc3D,显著提升了3D MLLMs的性能。

Details Motivation: 现有的3D多模态大语言模型因缺乏大规模、高质量的3D场景-对话数据集而发展受限,且存在视角模糊和指代模糊问题,依赖昂贵的人工标注。 Method: 设计了一个四阶段自动化管道:元注释收集、带关系校正的场景图构建、判别性对象指代表达生成、多任务对话数据生成,结合基于规则的方法、2D MLLMs和LLMs实现可控、可扩展的数据生成。 Result: 构建了包含超过200万样本的Disc3D数据集,覆盖25K混合3D场景,支持多种任务;实验表明使用Disc3D训练的模型在公开基准和自定义Disc3D-QA任务上均有显著提升。 Conclusion: 该自动化管道有效解决了3D MLLMs数据稀缺与质量低下的问题,生成的Disc3D数据集推动了3D多模态语言模型的发展。 Abstract: 3D Multi-modal Large Language Models (MLLMs) still lag behind their 2D peers, largely because large-scale, high-quality 3D scene-dialogue datasets remain scarce. Prior efforts hinge on expensive human annotation and leave two key ambiguities unresolved: viewpoint ambiguity, where spatial language presumes unknown camera poses, and object referring ambiguity, where non-exclusive descriptions blur the line between targets and distractors. We therefore present a fully automated pipeline that converts raw 3D scans into unambiguous, high-quality dialogue data at a fraction of the previous cost. By synergizing rule-based constraints with 2D MLLMs and LLMs, the pipeline enables controllable, scalable generation without human intervention. The pipeline comprises four stages: (1) meta-annotation collection harvesting object-, frame-, and scene-level captions, (2) scene graph construction with relation correction to capture proximal object relations, (3) discriminative object referring that generates exclusive and compact descriptions, and (4) multi-task data generation synthesizing diverse dialogues. Our pipeline systematically mitigates inherent flaws in source datasets and produces the final Disc3D dataset, over 2 million samples in 25K hybrid 3D scenes, spanning scene, view, and object captioning, visual grounding, and five object-centric QA tasks. Extensive experiments demonstrate that training with Disc3D yields consistent, significant improvements on both public benchmarks and our multifaceted Disc3D-QA tasks. Code, data, and models will be publicly available.

[293] DiP: Taming Diffusion Models in Pixel Space

Zhennan Chen,Junwei Zhu,Xu Chen,Jiangning Zhang,Xiaobin Hu,Hanzhen Zhao,Chengjie Wang,Jian Yang,Ying Tai

Main category: cs.CV

TL;DR: DiP是一种高效的像素空间扩散框架,通过解耦生成过程为全局和局部两个阶段,在不依赖VAE的情况下实现了与LDM相当的计算效率,并在ImageNet 256×256上达到1.90的FID分数。

Details Motivation: 扩散模型在生成质量和计算效率之间存在权衡。潜在扩散模型(LDM)虽然高效但可能信息丢失且非端到端训练,而像素空间模型虽避免了VAE却在高分辨率合成时计算成本过高。 Method: 提出DiP框架,使用Diffusion Transformer(DiT)处理大块以构建全局结构,同时用轻量级Patch Detailer Head恢复局部细节,实现协同生成。 Result: 相比先前方法,推理速度提升高达10倍,总参数仅增加0.3%,并在ImageNet 256×256上取得1.90的FID分数。 Conclusion: DiP在不依赖VAE的情况下,兼顾了生成质量与计算效率,显著提升了像素空间扩散模型的实用性。 Abstract: Diffusion models face a fundamental trade-off between generation quality and computational efficiency. Latent Diffusion Models (LDMs) offer an efficient solution but suffer from potential information loss and non-end-to-end training. In contrast, existing pixel space models bypass VAEs but are computationally prohibitive for high-resolution synthesis. To resolve this dilemma, we propose DiP, an efficient pixel space diffusion framework. DiP decouples generation into a global and a local stage: a Diffusion Transformer (DiT) backbone operates on large patches for efficient global structure construction, while a co-trained lightweight Patch Detailer Head leverages contextual features to restore fine-grained local details. This synergistic design achieves computational efficiency comparable to LDMs without relying on a VAE. DiP is accomplished with up to 10$\times$ faster inference speeds than previous method while increasing the total number of parameters by only 0.3%, and achieves an 1.90 FID score on ImageNet 256$\times$256.

[294] VideoPerceiver: Enhancing Fine-Grained Temporal Perception in Video Multimodal Large Language Models

Fufangchen Zhao,Liao Zhang,Daiqi Shi,Yuanjun Gao,Chen Ye,Yang Cai,Jian Gao,Danfeng Yan

Main category: cs.CV

TL;DR: VideoPerceiver是一种新型视频多模态大语言模型,通过两阶段训练框架提升对细粒度动作和罕见瞬态事件的理解能力。

Details Motivation: 现有视频多模态大语言模型在理解短暂动作或长视频中的罕见事件时表现有限,难以捕捉细粒度动态信息。 Method: 采用两阶段训练:第一阶段通过构造‘关键信息缺失’视频并使用辅助对比损失对齐中间视觉表示与关键词;第二阶段利用强化学习和相对奖励机制,使模型学会从完整视频中恢复精确时序动作细节。同时构建了包含8万段视频的数据集。 Result: 实验表明,VideoPerceiver在细粒度动作理解和罕见事件描述任务上显著优于当前最先进的VMLLM,同时在标准任务上保持良好性能。 Conclusion: 通过强调任务相关视觉特征,VideoPerceiver重新定义了面向细粒度感知的视频-语言模型训练范式。 Abstract: We propose VideoPerceiver, a novel video multimodal large language model (VMLLM) that enhances fine-grained perception in video understanding, addressing VMLLMs' limited ability to reason about brief actions in short clips or rare transient events in long videos. VideoPerceiver adopts a two-stage training framework. During supervised fine-tuning (SFT), we construct "key-information-missing" videos by extracting event-action keywords from captions, identifying corresponding key frames, and replacing them with adjacent frames. We jointly encode original and modified video tokens with text tokens, aligning intermediate visual representations with keywords via an auxiliary contrastive loss to enhance sensitivity to fine-grained motion cues. In reinforcement learning (RL), both video variants are fed into the model to generate descriptions, and a novel relative reward ensures responses from complete videos outperform those from degraded inputs, explicitly training the model to recover temporally precise action details. We also curate a dataset of 80,000 videos with fine-grained actions and transient events. Experiments show VideoPerceiver substantially outperforms state-of-the-art VMLLMs on fine-grained action understanding and rare event captioning benchmarks, while maintaining strong performance on standard tasks. By prioritizing task-relevant visual features, our work redefines video-language model training for fine-grained perception.

[295] Q-Save: Towards Scoring and Attribution for Generated Video Evaluation

Xiele Wu,Zicheng Zhang,Mingtao Chen,Yixian Liu,Yiming Liu,Shushi Wang,Zhichao Hu,Yuhong Liu,Guangtao Zhai,Xiaohong Liu

Main category: cs.CV

TL;DR: 本文提出了Q-Save,一个用于AI生成视频质量评估的新基准数据集和模型,具备多维度标注与可解释性评价能力。

Details Motivation: 现有AI生成视频质量评估方法缺乏细粒度、可解释的多维评价标准,难以全面反映生成质量。 Method: 构建包含近10000个视频的数据集,标注MOS分及视觉质量、动态质量和文本-视频对齐三个维度的细粒度标签;提出基于SlowFast框架的统一评估模型,结合高/低分辨率处理快慢帧,并采用链式思维(COT)格式数据与多阶段训练策略(SFT-GRPO-SFT)。 Result: 模型在视频质量预测上达到SOTA性能,同时能提供与人类认知一致的可解释性评分依据。 Conclusion: Q-Save为生成视频领域提供了可解释、多维度的质量评估基础,推动了多模态生成与可信AI的发展。 Abstract: We present Q-Save, a new benchmark dataset and model for holistic and explainable evaluation of AI-generated video (AIGV) quality. The dataset contains near 10000 videos, each annotated with a scalar mean opinion score (MOS) and fine-grained attribution labels along three core dimensions: visual quality, dynamic quality, and text-video alignment. These multi-aspect annotations enable both accurate quality assessment and interpretable reasoning behind the scores. To leverage this data, we propose a unified evaluation model that jointly performs quality scoring and attribution-based explanation. The model adopts the SlowFast framework to distinguish between fast frames and slow frames - slow frames are processed with high resolution while fast frames use low resolution, balancing evaluation accuracy and computational efficiency. For training, we use data formatted in Chain-of-Thought (COT) style and employ a multi-stage strategy: we first conduct Supervised Fine-Tuning (SFT), then further enhance the model with Grouped Relative Policy Optimization (GRPO), and finally perform SFT again to improve model stability. Experimental results demonstrate that our model achieves state-of-the-art performance in video quality prediction while also providing human-aligned, interpretable justifications. Our dataset and model establish a strong foundation for explainable evaluation in generative video research, contributing to the development of multimodal generation and trustworthy AI. Code and dataset will be released upon publication.

[296] Uncertainty-Aware Dual-Student Knowledge Distillation for Efficient Image Classification

Aakash Gore,Anoushka Dey,Aryan Mishra

Main category: cs.CV

TL;DR: 提出了一种基于教师模型预测不确定性的双学生知识蒸馏框架,通过引入两个异构学生模型协同学习,提升了模型压缩性能。

Details Motivation: 传统知识蒸馏方法平等对待教师模型的所有预测,忽略了其预测不确定性的影响,限制了学生模型的学习效果。 Method: 设计了一个不确定性感知的双学生知识蒸馏框架,利用教师模型的预测不确定性选择性地指导学生学习,并采用ResNet-18和MobileNetV2作为异构学生模型进行协同学习。 Result: 在ImageNet-100上的实验表明,该方法显著优于传统蒸馏方法,ResNet-18达到83.84% top-1准确率,MobileNetV2达到81.46% top-1准确率,分别提升2.04%和0.92%。 Conclusion: 所提框架能有效利用教师模型的不确定性信息和双学生协同机制,提升知识蒸馏性能。 Abstract: Knowledge distillation has emerged as a powerful technique for model compression, enabling the transfer of knowledge from large teacher networks to compact student models. However, traditional knowledge distillation methods treat all teacher predictions equally, regardless of the teacher's confidence in those predictions. This paper proposes an uncertainty-aware dual-student knowledge distillation framework that leverages teacher prediction uncertainty to selectively guide student learning. We introduce a peer-learning mechanism where two heterogeneous student architectures, specifically ResNet-18 and MobileNetV2, learn collaboratively from both the teacher network and each other. Experimental results on ImageNet-100 demonstrate that our approach achieves superior performance compared to baseline knowledge distillation methods, with ResNet-18 achieving 83.84\% top-1 accuracy and MobileNetV2 achieving 81.46\% top-1 accuracy, representing improvements of 2.04\% and 0.92\% respectively over traditional single-student distillation approaches.

[297] Leveraging Metaheuristic Approaches to Improve Deep Learning Systems for Anxiety Disorder Detection

Mohammadreza Amiri,Monireh Hosseini

Main category: cs.CV

TL;DR: 本研究提出了一种结合深度学习与群体智能优化的混合模型,利用多模态可穿戴传感器数据实现对焦虑障碍的客观、自动化检测。

Details Motivation: 传统焦虑症诊断依赖主观评估,耗时且易受评估者影响,缺乏一致性和效率。 Method: 采用深度学习架构结合遗传算法和粒子群优化等群体智能技术,对生理、情绪和行为信号进行特征优化与超参数调优,实现多源时序数据的融合分析。 Result: 该混合模型在准确性上显著优于单独使用深度网络的方法,并表现出更强的跨个体泛化能力。 Conclusion: 深度学习与元启发式优化的结合有望为焦虑障碍提供可扩展、客观且具有临床意义的自动评估方案。 Abstract: Despite being among the most common psychological disorders, anxiety-related conditions are still primarily identified through subjective assessments, such as clinical interviews and self-evaluation questionnaires. These conventional methods often require significant time and may vary depending on the evaluator. However, the emergence of advanced artificial intelligence techniques has created new opportunities for detecting anxiety in a more consistent and automated manner. To address the limitations of traditional approaches, this study introduces a comprehensive model that integrates deep learning architectures with optimization strategies inspired by swarm intelligence. Using multimodal and wearable-sensor datasets, the framework analyzes physiological, emotional, and behavioral signals. Swarm intelligence techniques including genetic algorithms and particle swarm optimization are incorporated to refine the feature space and optimize hyperparameters. Meanwhile, deep learning components are tasked with deriving layered and discriminative representations from sequential, multi-source inputs. Our evaluation shows that the fusion of these two computational paradigms significantly enhances detection performance compared with using deep networks alone. The hybrid model achieves notable improvements in accuracy and demonstrates stronger generalization across various individuals. Overall, the results highlight the potential of combining metaheuristic optimization with deep learning to develop scalable, objective, and clinically meaningful solutions for assessing anxiety disorders

[298] VideoCompressa: Data-Efficient Video Understanding via Joint Temporal Compression and Spatial Reconstruction

Shaobo Wang,Tianle Niu,Runkang Yang,Deshan Liu,Xu He,Zichen Wen,Conghui He,Xuming Hu,Linfeng Zhang

Main category: cs.CV

TL;DR: 本文提出了一种名为VideoCompressa的新框架,通过动态潜在压缩来解决视频数据合成中的帧内冗余问题,显著提升了视频理解模型的训练效率和数据利用率。

Details Motivation: 由于大规模视频数据集存在高昂的存储和计算成本,且现有数据合成方法难以有效处理视频中的时序冗余和复杂时空动态,因此需要一种更高效的数据压缩与合成方法。 Method: 引入VideoCompressa框架,将视频数据合成重构为动态潜在压缩问题,联合优化一个轻量级ConvNet关键帧选择器(使用Gumbel-Softmax采样)和预训练冻结的VAE,以提取并压缩最具信息量的帧为紧凑的语义潜码,并通过端到端反向传播进行优化。 Result: 在UCF101上使用ConvNets时,仅用原始数据的0.13%就超越了全数据训练2.34%,速度比传统合成方法提升超过5800倍;在HMDB51上微调Qwen2.5-7B-VL时,仅用0.41%数据即达到全数据性能,超越零样本基线10.61%。 Conclusion: VideoCompressa通过聚焦帧内冗余,实现了极高的数据效率和计算优势,为视频理解任务提供了一种可扩展且高效的替代数据合成方案。 Abstract: The scalability of video understanding models is increasingly limited by the prohibitive storage and computational costs of large-scale video datasets. While data synthesis has improved data efficiency in the image domain, its extension to video remains challenging due to pervasive temporal redundancy and complex spatiotemporal dynamics. In this work, we uncover a critical insight: the primary source of inefficiency in video datasets is not inter-sample redundancy, but intra-sample frame-level redundancy. To leverage this insight, we introduce VideoCompressa, a novel framework for video data synthesis that reframes the problem as dynamic latent compression. Specifically, VideoCompressa jointly optimizes a differentiable keyframe selector-implemented as a lightweight ConvNet with Gumbel-Softmax sampling-to identify the most informative frames, and a pretrained, frozen Variational Autoencoder (VAE) to compress these frames into compact, semantically rich latent codes. These latent representations are then fed into a compression network, enabling end-to-end backpropagation. Crucially, the keyframe selector and synthetic latent codes are co-optimized to maximize retention of task-relevant information. Experiments show that our method achieves unprecedented data efficiency: on UCF101 with ConvNets, VideoCompressa surpasses full-data training by 2.34\% points using only 0.13\% of the original data, with over 5800x speedup compared to traditional synthesis method. Moreover, when fine-tuning Qwen2.5-7B-VL on HMDB51, VideoCompressa matches full-data performance using just 0.41\% of the training data-outperforming zero-shot baseline by 10.61\%.

[299] FlowSteer: Guiding Few-Step Image Synthesis with Authentic Trajectories

Lei Ke,Hubery Yin,Gongye Liu,Zhengyao Lv,Jingcai Guo,Chen Li,Wenhan Luo,Yujiu Yang,Jing Lyu

Main category: cs.CV

TL;DR: 本文提出FlowSteer方法,通过引导学生模型沿教师模型的真实生成轨迹,提升ReFlow框架下的生成效率与质量,解决了分布不匹配问题并修正了现有调度器的缺陷。

Details Motivation: 尽管ReFlow在理论上与流匹配一致,但其在实际应用中的性能不如其他蒸馏方法,因此需要改进以释放其潜力。 Method: 提出Online Trajectory Alignment(OTA)解决训练中的分布不匹配问题,并引入直接作用于ODE轨迹的对抗蒸馏目标,同时修复FlowMatchEulerDiscreteScheduler中的缺陷。 Result: 在SD3上的实验结果表明,所提方法显著提升了ReFlow的采样效率和生成质量。 Conclusion: FlowSteer有效提升了ReFlow框架的性能,使其在少数步数推理中表现更优,具有实际应用价值。 Abstract: With the success of flow matching in visual generation, sampling efficiency remains a critical bottleneck for its practical application. Among flow models' accelerating methods, ReFlow has been somehow overlooked although it has theoretical consistency with flow matching. This is primarily due to its suboptimal performance in practical scenarios compared to consistency distillation and score distillation. In this work, we investigate this issue within the ReFlow framework and propose FlowSteer, a method unlocks the potential of ReFlow-based distillation by guiding the student along teacher's authentic generation trajectories. We first identify that Piecewised ReFlow's performance is hampered by a critical distribution mismatch during the training and propose Online Trajectory Alignment(OTA) to resolve it. Then, we introduce a adversarial distillation objective applied directly on the ODE trajectory, improving the student's adherence to the teacher's generation trajectory. Furthermore, we find and fix a previously undiscovered flaw in the widely-used FlowMatchEulerDiscreteScheduler that largely degrades few-step inference quality. Our experiment result on SD3 demonstrates our method's efficacy.

[300] FVAR: Visual Autoregressive Modeling via Next Focus Prediction

Xiaofan Li,Chenming Wu,Yanpeng Sun,Jiaming Zhou,Delin Qu,Yansong Qu,Weihao Bo,Haibao Yu,Dingkang Liang

Main category: cs.CV

TL;DR: FVAR提出了一种新的视觉自回归生成范式,将传统的“下一尺度预测”转变为“下一焦点预测”,通过物理一致的离焦核构建无混叠的多尺度金字塔,并引入高频残差学习机制,在保持部署简单性的同时显著提升图像细节生成质量。

Details Motivation: 传统视觉自回归模型采用均匀下采样构建多尺度金字塔,导致混叠伪影,损害细节并引入锯齿和摩尔纹,影响生成质量。 Method: 1) 提出下一焦点预测范式,逐步减少模糊而非简单下采样;2) 利用物理一致的离焦点扩散函数(PSF)核构建渐进去焦金字塔,从源头消除混叠;3) 设计高频残差教师网络,学习结构与混叠残差信息,并蒸馏至轻量部署网络。 Result: 在ImageNet上实验表明,FVAR显著减少混叠伪影,提升细节保留和文本可读性,且与现有VAR框架完全兼容。 Conclusion: FVAR通过模拟相机对焦过程重构多尺度生成范式,在消除混叠、增强细节方面优于传统方法,为视觉自回归模型提供了更高质量、更物理合理的生成路径。 Abstract: Visual autoregressive models achieve remarkable generation quality through next-scale predictions across multi-scale token pyramids. However, the conventional method uses uniform scale downsampling to build these pyramids, leading to aliasing artifacts that compromise fine details and introduce unwanted jaggies and moiré patterns. To tackle this issue, we present \textbf{FVAR}, which reframes the paradigm from \emph{next-scale prediction} to \emph{next-focus prediction}, mimicking the natural process of camera focusing from blur to clarity. Our approach introduces three key innovations: \textbf{1) Next-Focus Prediction Paradigm} that transforms multi-scale autoregression by progressively reducing blur rather than simply downsampling; \textbf{2) Progressive Refocusing Pyramid Construction} that uses physics-consistent defocus kernels to build clean, alias-free multi-scale representations; and \textbf{3) High-Frequency Residual Learning} that employs a specialized residual teacher network to effectively incorporate alias information during training while maintaining deployment simplicity. Specifically, we construct optical low-pass views using defocus point spread function (PSF) kernels with decreasing radius, creating smooth blur-to-clarity transitions that eliminate aliasing at its source. To further enhance detail generation, we introduce a High-Frequency Residual Teacher that learns from both clean structure and alias residuals, distilling this knowledge to a vanilla VAR deployment network for seamless inference. Extensive experiments on ImageNet demonstrate that FVAR substantially reduces aliasing artifacts, improves fine detail preservation, and enhances text readability, achieving superior performance with perfect compatibility to existing VAR frameworks.

[301] Enhancing Multi-Label Thoracic Disease Diagnosis with Deep Ensemble-Based Uncertainty Quantification

Yasiru Laksara,Uthayasanker Thayasivam

Main category: cs.CV

TL;DR: 本研究通过引入深度集成(Deep Ensemble)方法,在NIH ChestX-ray14数据集上实现了对14种常见胸部疾病的高性能诊断,并成功整合了不确定性量化(UQ),显著提升了模型的可靠性与校准性,使其成为可信赖的临床决策支持系统。

Details Motivation: 深度学习模型如CheXNet缺乏预测置信度的可靠度量,限制了其在高风险临床环境中的应用。因此,亟需引入不确定性量化以提升模型的可信度和可解释性。 Method: 采用深度集成(9成员)架构替代不稳定的Monte Carlo Dropout,进行不确定性量化,并分解为偶然不确定性和认知不确定性。 Result: 达到SOTA水平的平均AUROC(0.8559)和F1分数(0.3857),显著改善校准效果(平均ECE为0.0728,NLL为0.1916),并实现可靠的不确定性分解(平均EU为0.0240)。 Conclusion: 深度集成架构有效提升了模型的稳定性、校准性和可解释性,使模型从单纯的预测工具转变为可靠的临床决策支持系统。 Abstract: The utility of deep learning models, such as CheXNet, in high stakes clinical settings is fundamentally constrained by their purely deterministic nature, failing to provide reliable measures of predictive confidence. This project addresses this critical gap by integrating robust Uncertainty Quantification (UQ) into a high performance diagnostic platform for 14 common thoracic diseases on the NIH ChestX-ray14 dataset. Initial architectural development failed to stabilize performance and calibration using Monte Carlo Dropout (MCD), yielding an unacceptable Expected Calibration Error (ECE) of 0.7588. This technical failure necessitated a rigorous architectural pivot to a high diversity, 9-member Deep Ensemble (DE). This resulting DE successfully stabilized performance and delivered superior reliability, achieving a State-of-the-Art (SOTA) average Area Under the Receiver Operating Characteristic Curve (AUROC) of 0.8559 and an average F1 Score of 0.3857. Crucially, the DE demonstrated superior calibration (Mean ECE of 0.0728 and Negative Log-Likelihood (NLL) of 0.1916) and enabled the reliable decomposition of total uncertainty into its Aleatoric (irreducible data noise) and Epistemic (reducible model knowledge) components, with a mean Epistemic Uncertainty (EU) of 0.0240. These results establish the Deep Ensemble as a trustworthy and explainable platform, transforming the model from a probabilistic tool into a reliable clinical decision support system.

[302] Personalized Federated Segmentation with Shared Feature Aggregation and Boundary-Focused Calibration

Ishmam Tashdeed,Md. Atiqur Rahman,Sabrina Islam,Md. Azam Hossain

Main category: cs.CV

TL;DR: 本文提出了一种新的个性化联邦学习方法FedOAP,用于器官无关的肿瘤分割,通过解耦交叉注意力和边界感知损失来提升非独立同分布数据下的分割性能。

Details Motivation: 现有的个性化联邦学习方法大多忽视了不同客户端间共享特征的潜力,尤其是在处理不同器官的分割数据时。 Method: 引入了解耦交叉注意力(DCA)机制,使每个客户端在保留本地查询的同时关注来自所有客户端聚合的全局共享键值对,并采用扰动边界损失(PBL)来提高分割边界的准确性。 Result: 在多个不同器官的肿瘤分割任务上进行了广泛实验,结果表明FedOAP consistently outperforms existing state-of-the-art federated and personalized segmentation methods。 Conclusion: FedOAP有效利用了跨客户端的共享特征并提升了分割一致性,在多种器官的肿瘤分割任务中表现出优越性能。 Abstract: Personalized federated learning (PFL) possesses the unique capability of preserving data confidentiality among clients while tackling the data heterogeneity problem of non-independent and identically distributed (Non-IID) data. Its advantages have led to widespread adoption in domains such as medical image segmentation. However, the existing approaches mostly overlook the potential benefits of leveraging shared features across clients, where each client contains segmentation data of different organs. In this work, we introduce a novel personalized federated approach for organ agnostic tumor segmentation (FedOAP), that utilizes cross-attention to model long-range dependencies among the shared features of different clients and a boundary-aware loss to improve segmentation consistency. FedOAP employs a decoupled cross-attention (DCA), which enables each client to retain local queries while attending to globally shared key-value pairs aggregated from all clients, thereby capturing long-range inter-organ feature dependencies. Additionally, we introduce perturbed boundary loss (PBL) which focuses on the inconsistencies of the predicted mask's boundary for each client, forcing the model to localize the margins more precisely. We evaluate FedOAP on diverse tumor segmentation tasks spanning different organs. Extensive experiments demonstrate that FedOAP consistently outperforms existing state-of-the-art federated and personalized segmentation methods.

[303] Robust Long-term Test-Time Adaptation for 3D Human Pose Estimation through Motion Discretization

Yilin Wen,Kechuan Dong,Yusuke Sugano

Main category: cs.CV

TL;DR: 本文提出了一种基于运动离散化和软重置机制的在线测试时自适应方法,用于缓解3D人体姿态估计中的误差累积问题,从而提升长期自适应性能。

Details Motivation: 在线自适应在3D人体姿态估计中因依赖不完美的自监督预测而导致误差累积,影响长期性能,本文旨在解决这一问题。 Method: 在潜在运动表示空间中进行无监督聚类以获得锚点运动,并引入软重置机制,将姿态估计器回退到其指数移动平均状态,实现高效自监督与稳定适应。 Result: 所提方法在连续跨域视频流上的长期在线自适应任务中表现出色,优于先前的在线测试时自适应方法。 Conclusion: 通过运动离散化和软重置机制可有效抑制误差累积,充分利用个体的个性化形状与运动特征,显著提升3D人体姿态估计的鲁棒性与准确性。 Abstract: Online test-time adaptation addresses the train-test domain gap by adapting the model on unlabeled streaming test inputs before making the final prediction. However, online adaptation for 3D human pose estimation suffers from error accumulation when relying on self-supervision with imperfect predictions, leading to degraded performance over time. To mitigate this fundamental challenge, we propose a novel solution that highlights the use of motion discretization. Specifically, we employ unsupervised clustering in the latent motion representation space to derive a set of anchor motions, whose regularity aids in supervising the human pose estimator and enables efficient self-replay. Additionally, we introduce an effective and efficient soft-reset mechanism by reverting the pose estimator to its exponential moving average during continuous adaptation. We examine long-term online adaptation by continuously adapting to out-of-domain streaming test videos of the same individual, which allows for the capture of consistent personal shape and motion traits throughout the streaming observation. By mitigating error accumulation, our solution enables robust exploitation of these personal traits for enhanced accuracy. Experiments demonstrate that our solution outperforms previous online test-time adaptation methods and validate our design choices.

[304] Deep Hybrid Model for Region of Interest Detection in Omnidirectional Videos

Sana Alamgeer

Main category: cs.CV

TL;DR: 提出一种混合显著性模型来预测360°视频中的兴趣区域(ROI),以提升流媒体效率和观看体验。

Details Motivation: 准确预测360°视频中的兴趣区域有助于优化视口预测和视频剪辑,减少带宽消耗并提升用户体验。 Method: 通过视频帧预处理、构建并训练混合显著性模型,以及对模型输出进行后处理,逐帧预测兴趣区域。 Result: 该方法在360RAT数据集上与主观标注进行了比较,验证了其有效性。 Conclusion: 所提出的混合显著性模型能够有效识别360°视频中的兴趣区域,具有应用于高效视频流传输的潜力。 Abstract: The main goal of the project is to design a new model that predicts regions of interest in 360$^{\circ}$ videos. The region of interest (ROI) plays an important role in 360$^{\circ}$ video streaming. For example, ROIs are used to predict view-ports, intelligently cut the videos for live streaming, etc so that less bandwidth is used. Detecting view-ports in advance helps reduce the movement of the head while streaming and watching a video via the head-mounted device. Whereas, intelligent cuts of the videos help improve the efficiency of streaming the video to users and enhance the quality of their viewing experience. This report illustrates the secondary task to identify ROIs, in which, we design, train, and test a hybrid saliency model. In this work, we refer to saliency regions to represent the regions of interest. The method includes the processes as follows: preprocessing the video to obtain frames, developing a hybrid saliency model for predicting the region of interest, and finally post-processing the output predictions of the hybrid saliency model to obtain the output region of interest for each frame. Then, we compare the performance of the proposed method with the subjective annotations of the 360RAT dataset.

[305] Rethinking Long-tailed Dataset Distillation: A Uni-Level Framework with Unbiased Recovery and Relabeling

Xiao Cui,Yulei Qin,Xinyue Li,Wengang Zhou,Hongsheng Li,Houqiang Li

Main category: cs.CV

TL;DR: 本文提出了一种新的长尾分布数据集蒸馏方法,通过统计对齐视角解决模型偏差问题,引入三个关键组件:增强专家模型、重校准BN统计量和多轮初始化合成图像,显著提升了在长尾基准上的性能。

Details Motivation: 现有数据集蒸馏方法在长尾分布下表现不佳,因类别不平衡导致模型表示偏差和BN统计估计失真。 Method: 采用统计对齐策略,引入增强的观察者与教师模型、动态动量调整的BN重校准及多轮高置信度多样性增强选择机制。 Result: 在CIFAR-100-LT和Tiny-ImageNet-LT上IPC=10且IF=10时,top-1准确率分别提升15.6%和11.8%。 Conclusion: 所提方法有效缓解了长尾分布下的模型偏差,实现了更公平的监督恢复,在多个长尾基准上显著优于现有方法。 Abstract: Dataset distillation creates a small distilled set that enables efficient training by capturing key information from the full dataset. While existing dataset distillation methods perform well on balanced datasets, they struggle under long-tailed distributions, where imbalanced class frequencies induce biased model representations and corrupt statistical estimates such as Batch Normalization (BN) statistics. In this paper, we rethink long-tailed dataset distillation by revisiting the limitations of trajectory-based methods, and instead adopt the statistical alignment perspective to jointly mitigate model bias and restore fair supervision. To this end, we introduce three dedicated components that enable unbiased recovery of distilled images and soft relabeling: (1) enhancing expert models (an observer model for recovery and a teacher model for relabeling) to enable reliable statistics estimation and soft-label generation; (2) recalibrating BN statistics via a full forward pass with dynamically adjusted momentum to reduce representation skew; (3) initializing synthetic images by incrementally selecting high-confidence and diverse augmentations via a multi-round mechanism that promotes coverage and diversity. Extensive experiments on four long-tailed benchmarks show consistent improvements over state-of-the-art methods across varying degrees of class imbalance.Notably, our approach improves top-1 accuracy by 15.6% on CIFAR-100-LT and 11.8% on Tiny-ImageNet-LT under IPC=10 and IF=10.

[306] DualGazeNet: A Biologically Inspired Dual-Gaze Query Network for Salient Object Detection

Yu Zhang,Haoan Ping,Yuchen Li,Zhenshan Bing,Fuchun Sun,Alois Knoll

Main category: cs.CV

TL;DR: 本文提出DualGazeNet,一种受生物视觉启发的纯Transformer框架,通过简化架构并模拟人类视觉系统的双通路处理机制,在显著性物体检测任务中实现了高效、准确且可解释的性能突破。

Details Motivation: 现有SOD方法因结构复杂导致特征冗余和组件干扰,陷入性能瓶颈,而人类视觉系统以极简方式高效识别显著物体,启发作者探索更简洁、符合生物学原理的模型设计。 Method: 设计DualGazeNet,基于纯Transformer架构,模拟大脑中负责快速运动感知的视网膜-上丘-丘脑通路(magno-cellular)与负责细节识别的视网膜-外侧膝状体-皮层通路(parvo-cellular),并通过皮层注意力调制实现双路径协同,无需复杂的多阶段流水线或专用融合模块。 Result: 在五个RGB-SOD基准上超越25种先进方法,平均比同类Transformer模型推理速度快60%,FLOPs减少53.4%,并在伪装、水下等跨域任务中表现出强泛化能力。 Conclusion: 通过借鉴人类视觉系统的双通路机制,极简的纯Transformer架构也能实现最先进的显著性检测性能,验证了生物启发设计在提升效率、性能和可解释性方面的巨大潜力。 Abstract: Recent salient object detection (SOD) methods aim to improve performance in four key directions: semantic enhancement, boundary refinement, auxiliary task supervision, and multi-modal fusion. In pursuit of continuous gains, these approaches have evolved toward increasingly sophisticated architectures with multi-stage pipelines, specialized fusion modules, edge-guided learning, and elaborate attention mechanisms. However, this complexity paradoxically introduces feature redundancy and cross-component interference that obscure salient cues, ultimately reaching performance bottlenecks. In contrast, human vision achieves efficient salient object identification without such architectural complexity. This contrast raises a fundamental question: can we design a biologically grounded yet architecturally simple SOD framework that dispenses with most of this engineering complexity, while achieving state-of-the-art accuracy, computational efficiency, and interpretability? In this work, we answer this question affirmatively by introducing DualGazeNet, a biologically inspired pure Transformer framework that models the dual biological principles of robust representation learning and magnocellular-parvocellular dual-pathway processing with cortical attention modulation in the human visual system. Extensive experiments on five RGB SOD benchmarks show that DualGazeNet consistently surpasses 25 state-of-the-art CNN- and Transformer-based methods. On average, DualGazeNet achieves about 60\% higher inference speed and 53.4\% fewer FLOPs than four Transformer-based baselines of similar capacity (VST++, MDSAM, Sam2unet, and BiRefNet). Moreover, DualGazeNet exhibits strong cross-domain generalization, achieving leading or highly competitive performance on camouflaged and underwater SOD benchmarks without relying on additional modalities.

[307] HunyuanVideo 1.5 Technical Report

Bing Wu,Chang Zou,Changlin Li,Duojun Huang,Fang Yang,Hao Tan,Jack Peng,Jianbing Wu,Jiangfeng Xiong,Jie Jiang,Linus,Patrol,Peizhen Zhang,Peng Chen,Penghao Zhao,Qi Tian,Songtao Liu,Weijie Kong,Weiyan Wang,Xiao He,Xin Li,Xinchi Deng,Xuefei Zhe,Yang Li,Yanxin Long,Yuanbo Peng,Yue Wu,Yuhong Liu,Zhenyu Wang,Zuozhuo Dai,Bo Peng,Coopers Li,Gu Gong,Guojian Xiao,Jiahe Tian,Jiaxin Lin,Jie Liu,Jihong Zhang,Jiesong Lian,Kaihang Pan,Lei Wang,Lin Niu,Mingtao Chen,Mingyang Chen,Mingzhe Zheng,Miles Yang,Qiangqiang Hu,Qi Yang,Qiuyong Xiao,Runzhou Wu,Ryan Xu,Rui Yuan,Shanshan Sang,Shisheng Huang,Siruis Gong,Shuo Huang,Weiting Guo,Xiang Yuan,Xiaojia Chen,Xiawei Hu,Wenzhi Sun,Xiele Wu,Xianshun Ren,Xiaoyan Yuan,Xiaoyue Mi,Yepeng Zhang,Yifu Sun,Yiting Lu,Yitong Li,You Huang,Yu Tang,Yixuan Li,Yuhang Deng,Yuan Zhou,Zhichao Hu,Zhiguang Liu,Zhihe Yang,Zilin Yang,Zhenzhi Lu,Zixiang Zhou,Zhao Zhong

Main category: cs.CV

TL;DR: HunyuanVideo 1.5 是一个仅含83亿参数的轻量级开源视频生成模型,通过先进的DiT架构和优化训练策略,在消费级GPU上实现了高质量的文本到视频和图像到视频生成。

Details Motivation: 为了降低视频生成技术的使用门槛,推动开源社区的发展,同时在有限参数规模下实现高质量、高连贯性的视频生成。 Method: 采用精心筛选的数据、改进的DiT架构(含选择性滑动块注意力SSTA)、字形感知的双语文本编码、渐进式预训练与后训练,以及高效的视频超分网络,构建统一的多时长、多分辨率生成框架。 Result: 在多项实验中表现出优于现有开源模型的视觉质量和运动连贯性,成为当前开源视频生成模型中的新标杆。 Conclusion: HunyuanVideo 1.5 在保持小模型体积的同时达到先进性能,并通过开源代码和权重促进了视频生成技术的普及和研究。 Abstract: We present HunyuanVideo 1.5, a lightweight yet powerful open-source video generation model that achieves state-of-the-art visual quality and motion coherence with only 8.3 billion parameters, enabling efficient inference on consumer-grade GPUs. This achievement is built upon several key components, including meticulous data curation, an advanced DiT architecture featuring selective and sliding tile attention (SSTA), enhanced bilingual understanding through glyph-aware text encoding, progressive pre-training and post-training, and an efficient video super-resolution network. Leveraging these designs, we developed a unified framework capable of high-quality text-to-video and image-to-video generation across multiple durations and resolutions.Extensive experiments demonstrate that this compact and proficient model establishes a new state-of-the-art among open-source video generation models. By releasing the code and model weights, we provide the community with a high-performance foundation that lowers the barrier to video creation and research, making advanced video generation accessible to a broader audience. All open-source assets are publicly available at https://github.com/Tencent-Hunyuan/HunyuanVideo-1.5.

[308] Neural Texture Splatting: Expressive 3D Gaussian Splatting for View Synthesis, Geometry, and Dynamic Reconstruction

Yiming Wang,Shaofei Wang,Marko Mihajlovic,Siyu Tang

Main category: cs.CV

TL;DR: 本文提出了Neural Texture Splatting (NTS),通过引入全局神经场来增强3D高斯点阵的局部表征能力,显著提升了在新视角合成、几何与动态重建等多种任务上的性能。

Details Motivation: 现有的3D高斯点阵方法受限于使用3D高斯核建模局部变化,且现有增强方法在稀疏输入或通用重建场景下效果有限,因此需要一种更具表达力且通用性强的方法。 Method: 提出Neural Texture Splatting (NTS),采用由三平面和神经解码器组成的混合全局神经场,为每个图元预测局部外观和几何场,实现跨图元的全局信息共享与高效建模。 Result: NTS在多个基准任务上实现了最先进的结果,包括新视角合成、几何重建和动态重建,在稀疏和密集输入设置下均表现出一致的性能提升。 Conclusion: NTS通过共享全局神经表示有效增强了3DGS的表达能力,同时减小了模型规模,并支持视图和时间相关的动态效果,具有良好的泛化性和应用前景。 Abstract: 3D Gaussian Splatting (3DGS) has emerged as a leading approach for high-quality novel view synthesis, with numerous variants extending its applicability to a broad spectrum of 3D and 4D scene reconstruction tasks. Despite its success, the representational capacity of 3DGS remains limited by the use of 3D Gaussian kernels to model local variations. Recent works have proposed to augment 3DGS with additional per-primitive capacity, such as per-splat textures, to enhance its expressiveness. However, these per-splat texture approaches primarily target dense novel view synthesis with a reduced number of Gaussian primitives, and their effectiveness tends to diminish when applied to more general reconstruction scenarios. In this paper, we aim to achieve concrete performance improvement over state-of-the-art 3DGS variants across a wide range of reconstruction tasks, including novel view synthesis, geometry and dynamic reconstruction, under both sparse and dense input settings. To this end, we introduce Neural Texture Splatting (NTS). At the core of our approach is a global neural field (represented as a hybrid of a tri-plane and a neural decoder) that predicts local appearance and geometric fields for each primitive. By leveraging this shared global representation that models local texture fields across primitives, we significantly reduce model size and facilitate efficient global information exchange, demonstrating strong generalization across tasks. Furthermore, our neural modeling of local texture fields introduces expressive view- and time-dependent effects, a critical aspect that existing methods fail to account for. Extensive experiments show that Neural Texture Splatting consistently improves models and achieves state-of-the-art results across multiple benchmarks.

[309] Parallel Vision Token Scheduling for Fast and Accurate Multimodal LMMs Inference

Wengyi Zhan,Mingbao Lin,Zhihang Lin,Rongrong Ji

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的视觉令牌调度框架ParVTS,通过将视觉令牌并行处理并中途丢弃非主体路径,显著降低多模态大模型的推理延迟和计算量,同时保持较高性能。

Details Motivation: 多模态大语言模型在高分辨率图像下产生大量视觉令牌,导致推理延迟严重;简单剪枝会丢失重要上下文信息,影响准确性。 Method: 提出ParVTS框架,将视觉令牌分为主体和非主体两组,并行处理并将语义传递至问题令牌,在推理中途丢弃非主体路径以减少计算。 Result: 在多个MLLM主干网络上实验显示,ParVTS最多可剪枝88.9%的视觉令牌,实现1.77倍速度提升和70%的FLOPs降低。 Conclusion: ParVTS是一种高效、无需训练且兼容多种架构的视觉令牌调度方法,有效平衡了多模态模型的推理效率与准确性。 Abstract: Multimodal large language models (MLLMs) deliver impressive vision-language reasoning but suffer steep inference latency because self-attention scales quadratically with sequence length and thousands of visual tokens contributed by high-resolution images. Naively pruning less-informative visual tokens reduces this burden, yet indiscriminate removal can strip away contextual cues essential for background or fine-grained questions, undermining accuracy. In this paper, we present ParVTS (Parallel Vision Token Scheduling), a training-free scheduling framework that partitions visual tokens into subject and non-subject groups, processes them in parallel to transfer their semantics into question tokens, and discards the non-subject path mid-inference to reduce computation. This scheduling reduces computational complexity, requires no heuristics or additional modules, and is compatible with diverse existing MLLM architectures. Experiments across multiple MLLM backbones show that ParVTS prunes up to 88.9% of visual tokens with minimal performance drop, achieving 1.77x speedup and 70% FLOPs reduction.

[310] Facade Segmentation for Solar Photovoltaic Suitability

Ayca Duran,Christoph Waibel,Bernd Bickel,Iro Armeni,Arno Schlueter

Main category: cs.CV

TL;DR: 本文提出了一种基于机器学习的管道,通过细粒度语义分割识别建筑立面光伏(BIPV)安装潜力,并结合立面结构信息生成实际可行的光伏布局,结果表明可安装潜力远低于理论值,对城市能源规划具有重要意义。

Details Motivation: 现有光伏规划方法多集中于屋顶,针对建筑立面的自动化分析仍不足且过于简化,难以支持城市级的BIPV部署需求。 Method: 利用SegFormer-B5模型在CMP Facades数据集上进行微调,实现立面语义分割;将分割结果转化为考虑模块尺寸和间隙的PV适用性掩码与面板布局,集成建筑构成信息进行太阳能潜力估算。 Result: 在来自十座城市的373个立面数据集上验证,结果显示实际可安装BIPV潜力显著低于理论潜力,突显了精细化建模的重要性。 Conclusion: 该管道能够有效结合立面图像与建筑细节,提升BIPV潜力评估的准确性,具备扩展至全球城市应用的潜力,助力城市脱碳目标。 Abstract: Building integrated photovoltaic (BIPV) facades represent a promising pathway towards urban decarbonization, especially where roof areas are insufficient and ground-mounted arrays are infeasible. Although machine learning-based approaches to support photovoltaic (PV) planning on rooftops are well researched, automated approaches for facades still remain scarce and oversimplified. This paper therefore presents a pipeline that integrates detailed information on the architectural composition of the facade to automatically identify suitable surfaces for PV application and estimate the solar energy potential. The pipeline fine-tunes SegFormer-B5 on the CMP Facades dataset and converts semantic predictions into facade-level PV suitability masks and PV panel layouts considering module sizes and clearances. Applied to a dataset of 373 facades with known dimensions from ten cities, the results show that installable BIPV potential is significantly lower than theoretical potential, thus providing valuable insights for reliable urban energy planning. With the growing availability of facade imagery, the proposed pipeline can be scaled to support BIPV planning in cities worldwide.

[311] MagicWorld: Interactive Geometry-driven Video World Exploration

Guangyuan Li,Siming Zheng,Shuolin Xu,Jinwei Chen,Bo Li,Xiaobin Hu,Lei Zhao,Peng-Tao Jiang

Main category: cs.CV

TL;DR: MagicWorld 是一种结合3D几何先验与历史信息检索的交互式视频世界模型,提升了场景演化在多步交互中的结构一致性与稳定性。

Details Motivation: 现有交互式视频世界模型未能充分利用指令驱动运动与3D几何之间的关联,且在多步交互中易遗忘历史信息,导致结构不稳定和语义漂移。 Method: 提出 MagicWorld,包含动作引导的3D几何模块(AG3D)构建点云以提供视图变换的几何约束,并设计历史缓存检索机制(HCR)检索并注入历史帧作为条件信号。 Result: 实验表明,MagicWorld 在多步交互中显著提升了场景的结构稳定性与生成连续性。 Conclusion: 通过引入3D几何先验与历史信息检索机制,有效缓解了交互过程中结构失真与误差累积问题,推动了更稳定可控的视频世界建模。 Abstract: Recent interactive video world model methods generate scene evolution conditioned on user instructions. Although they achieve impressive results, two key limitations remain. First, they fail to fully exploit the correspondence between instruction-driven scene motion and the underlying 3D geometry, which results in structural instability under viewpoint changes. Second, they easily forget historical information during multi-step interaction, resulting in error accumulation and progressive drift in scene semantics and structure. To address these issues, we propose MagicWorld, an interactive video world model that integrates 3D geometric priors and historical retrieval. MagicWorld starts from a single scene image, employs user actions to drive dynamic scene evolution, and autoregressively synthesizes continuous scenes. We introduce the Action-Guided 3D Geometry Module (AG3D), which constructs a point cloud from the first frame of each interaction and the corresponding action, providing explicit geometric constraints for viewpoint transitions and thereby improving structural consistency. We further propose History Cache Retrieval (HCR) mechanism, which retrieves relevant historical frames during generation and injects them as conditioning signals, helping the model utilize past scene information and mitigate error accumulation. Experimental results demonstrate that MagicWorld achieves notable improvements in scene stability and continuity across interaction iterations.

[312] MFmamba: A Multi-function Network for Panchromatic Image Resolution Restoration Based on State-Space Model

Qian Jiang,Qianqian Wang,Xin Jin,Michal Wozniak,Shaowen Yao,Wei Zhou

Main category: cs.CV

TL;DR: 本文提出了一种名为MFmamba的多功能模型,用于实现遥感图像的超分辨率(SR)、光谱恢复以及联合SR与光谱恢复任务,仅需输入全色(PAN)图像即可在评估指标和视觉效果上取得良好表现。

Details Motivation: 由于单个传感器的限制,只能获取高空间分辨率但低光谱分辨率的PAN图像或低空间分辨率但高光谱分辨率的多光谱(MS)图像,现有方法无法同时提升空间和光谱分辨率,因此需要一种集成的方法来解决这一问题。 Method: 设计了一个基于UNet++主干网络的新模型MFmamba,引入Mamba Upsample Block(MUB)、替代跳跃连接的Dual Pool Attention(DPA)以及用于初始特征提取的Multi-scale Hybrid Cross Block(MHCB),通过三种不同输入实现SR、光谱恢复及联合任务。 Result: 实验结果表明,MFmamba在多种评价指标和视觉效果方面具有竞争力,在仅输入PAN图像的情况下,三项任务均表现出色。 Conclusion: MFmamba能够有效整合超分辨率与光谱恢复功能,仅需单一PAN图像输入即可生成高质量的高分辨率彩色遥感图像,为遥感图像处理提供了一种新的集成解决方案。 Abstract: Remote sensing images are becoming increasingly widespread in military, earth resource exploration. Because of the limitation of a single sensor, we can obtain high spatial resolution grayscale panchromatic (PAN) images and low spatial resolution color multispectral (MS) images. Therefore, an important issue is to obtain a color image with high spatial resolution when there is only a PAN image at the input. The existing methods improve spatial resolution using super-resolution (SR) technology and spectral recovery using colorization technology. However, the SR technique cannot improve the spectral resolution, and the colorization technique cannot improve the spatial resolution. Moreover, the pansharpening method needs two registered inputs and can not achieve SR. As a result, an integrated approach is expected. To solve the above problems, we designed a novel multi-function model (MFmamba) to realize the tasks of SR, spectral recovery, joint SR and spectral recovery through three different inputs. Firstly, MFmamba utilizes UNet++ as the backbone, and a Mamba Upsample Block (MUB) is combined with UNet++. Secondly, a Dual Pool Attention (DPA) is designed to replace the skip connection in UNet++. Finally, a Multi-scale Hybrid Cross Block (MHCB) is proposed for initial feature extraction. Many experiments show that MFmamba is competitive in evaluation metrics and visual results and performs well in the three tasks when only the input PAN image is used.

[313] MetaDCSeg: Robust Medical Image Segmentation via Meta Dynamic Center Weighting

Chenyu Mu,Guihai Chen,Xun Yang,Erkun Yang,Cheng Deng

Main category: cs.CV

TL;DR: 提出了一种名为MetaDCSeg的新框架,通过动态学习像素级权重来抑制噪声标签的影响,并利用动态中心距离机制建模边界不确定性,从而提升医学图像分割性能。

Details Motivation: 现有方法在处理带有噪声注释和模糊解剖边界的医学图像时表现不佳,导致模型训练不稳定,尤其在边界区域性能下降明显。 Method: 提出MetaDCSeg框架,引入动态中心距离(DCD)机制,通过加权特征距离计算前景、背景和边界中心,动态学习每个像素的最优权重,以增强对难分割像素的关注。 Result: 在四个不同噪声水平的基准数据集上进行了广泛实验,结果表明MetaDCSeg在各种条件下均优于现有的最先进方法。 Conclusion: MetaDCSeg能有效应对医学图像中的噪声标注和模糊边界问题,显著提高分割精度,尤其在复杂边界区域表现出更强的鲁棒性。 Abstract: Medical image segmentation is crucial for clinical applications, but it is frequently disrupted by noisy annotations and ambiguous anatomical boundaries, which lead to instability in model training. Existing methods typically rely on global noise assumptions or confidence-based sample selection, which inadequately mitigate the performance degradation caused by annotation noise, especially in challenging boundary regions. To address this issue, we propose MetaDCSeg, a robust framework that dynamically learns optimal pixel-wise weights to suppress the influence of noisy ground-truth labels while preserving reliable annotations. By explicitly modeling boundary uncertainty through a Dynamic Center Distance (DCD) mechanism, our approach utilizes weighted feature distances for foreground, background, and boundary centers, directing the model's attention toward hard-to-segment pixels near ambiguous boundaries. This strategy enables more precise handling of structural boundaries, which are often overlooked by existing methods, and significantly enhances segmentation performance. Extensive experiments across four benchmark datasets with varying noise levels demonstrate that MetaDCSeg consistently outperforms existing state-of-the-art methods.

[314] Learning What to Trust: Bayesian Prior-Guided Optimization for Visual Generation

Ruiying Liu,Yuanzhi Liang,Haibin Huang,Tianshu Yu,Chi Zhang

Main category: cs.CV

TL;DR: 提出贝叶斯先验引导优化(BPGO),通过建模奖励不确定性解决视觉生成模型中文本-视觉对应模糊导致的训练信号弱判别问题,提升语义对齐、感知质量和收敛速度。

Details Motivation: 现有GRPO框架因文本与视觉内容间多对多关系导致奖励信号模糊、判别力弱,难以有效利用可靠反馈并抑制噪声影响。 Method: 在GRPO基础上引入语义先验锚点,构建BPGO框架:通过组间贝叶斯信任分配和组内先验引导重归一化,自适应调整优化过程中的信任程度。 Result: 在图像与视频生成任务中,BPGO相比GRPO及其变体展现出更强的语义对齐性、更高的感知质量及更快的收敛速度。 Conclusion: BPGO通过显式建模奖励不确定性并结合语义先验,有效缓解了多对多对应带来的训练信号退化问题,为视觉生成后训练提供了更鲁棒的优化框架。 Abstract: Group Relative Policy Optimization (GRPO) has emerged as an effective and lightweight framework for post-training visual generative models. However, its performance is fundamentally limited by the ambiguity of textual visual correspondence: a single prompt may validly describe diverse visual outputs, and a single image or video may support multiple equally correct interpretations. This many to many relationship leads reward models to generate uncertain and weakly discriminative signals, causing GRPO to underutilize reliable feedback and overfit noisy ones. We introduce Bayesian Prior-Guided Optimization (BPGO), a novel extension of GRPO that explicitly models reward uncertainty through a semantic prior anchor. BPGO adaptively modulates optimization trust at two levels: inter-group Bayesian trust allocation emphasizes updates from groups consistent with the prior while down-weighting ambiguous ones, and intra-group prior-anchored renormalization sharpens sample distinctions by expanding confident deviations and compressing uncertain scores. Across both image and video generation tasks, BPGO delivers consistently stronger semantic alignment, enhanced perceptual fidelity, and faster convergence than standard GRPO and recent variants.

[315] EventSTU: Event-Guided Efficient Spatio-Temporal Understanding for Video Large Language Models

Wenhao Xu,Xin Dong,Yue Li,Haoyuan Shi,Zhiwei Xiong

Main category: cs.CV

TL;DR: 提出了一种无需训练的事件引导视频理解框架EventSTU,通过时空自适应令牌裁剪显著降低计算成本,同时提升性能。

Details Motivation: 视频大语言模型在处理长视频时面临高推理成本问题,因需处理大量冗余时空信息。受事件相机启发,旨在实现高效且准确的视频理解。 Method: 设计了粗到精的关键帧采样算法用于时域冗余消除,并利用事件视觉显著性作为先验进行空间自适应令牌剪枝;结合问题相关性动态分配剪枝预算,构建了包含真实事件数据的新基准EventBench。 Result: 实现了3.01倍FLOPs减少和3.10倍prefilling加速,在最强基线上仍提升性能,验证了方法在真实与模拟事件输入下的有效性。 Conclusion: EventSTU为视频大模型提供了一种高效、无需训练的时空理解方案,推动低功耗、高效率视频理解的发展。 Abstract: Video large language models have demonstrated strong video understanding capabilities but suffer from high inference costs due to the massive number of tokens in long videos. Inspired by event-based vision, we propose an event-guided, training-free framework for efficient spatio-temporal understanding, named EventSTU. In the temporal domain, we design a coarse-to-fine keyframe sampling algorithm that exploits the change-triggered property of event cameras to eliminate redundant frames. In the spatial domain, we design an adaptive token pruning algorithm that leverages the visual saliency of events as a zero-cost prior to guide spatial reduction. From a holistic spatio-temporal perspective, we further integrate question relevance from keyframe sampling to adaptively allocate token pruning budgets. To facilitate evaluation, we construct EventBench, the first event-inclusive, human-annotated multimodal benchmark that covers diverse real-world scenarios. Beyond physical event cameras, EventSTU also supports general video understanding using simulated events. Comprehensive experiments show that EventSTU achieves 3.01x FLOPs reduction and 3.10x prefilling speedup over the strongest baseline while still improving performance.

[316] BackdoorVLM: A Benchmark for Backdoor Attacks on Vision-Language Models

Juncheng Li,Yige Li,Hanxun Huang,Yunhao Chen,Xin Wang,Yixu Wang,Xingjun Ma,Yu-Gang Jiang

Main category: cs.CV

TL;DR: 本文提出了BackdoorVLM,首个系统评估视觉-语言模型(VLMs)中后门攻击的基准,涵盖图像描述、视觉问答等任务,并将多模态后门威胁分为五类:目标拒绝、恶意注入、越狱、概念替换和感知劫持。通过12种攻击方法在文本、图像和双模态触发器上测试,发现VLM对文本指令敏感,文本触发器主导双模态后门,且低至1%的投毒率即可实现90%以上的攻击成功率,揭示了当前VLM存在的严重安全漏洞。

Details Motivation: 尽管后门攻击在单模态机器学习中已被广泛研究,但其在多模态基础模型特别是视觉-语言模型(VLMs)中的影响仍缺乏系统探索。本文旨在填补这一空白,全面评估VLMs在多种攻击场景下的安全性。 Method: 提出BackdoorVLM基准,统一分析五类代表性多模态后门威胁(目标拒绝、恶意注入、越狱、概念替换、感知劫持),采用12种攻击方法(涵盖文本、图像和双模态触发器),在2个开源VLM和3个多模态数据集上进行实验评估。 Result: 实验表明VLM对文本指令高度敏感,在双模态后门中文本触发器起主导作用;即使投毒率低至1%,多数任务上的攻击成功率仍超过90%。 Conclusion: 当前视觉-语言模型存在严重的多模态后门安全隐患,尤其对文本模态攻击极为脆弱;BackdoorVLM为未来分析与防御此类威胁提供了重要基准。 Abstract: Backdoor attacks undermine the reliability and trustworthiness of machine learning systems by injecting hidden behaviors that can be maliciously activated at inference time. While such threats have been extensively studied in unimodal settings, their impact on multimodal foundation models, particularly vision-language models (VLMs), remains largely underexplored. In this work, we introduce \textbf{BackdoorVLM}, the first comprehensive benchmark for systematically evaluating backdoor attacks on VLMs across a broad range of settings. It adopts a unified perspective that injects and analyzes backdoors across core vision-language tasks, including image captioning and visual question answering. BackdoorVLM organizes multimodal backdoor threats into 5 representative categories: targeted refusal, malicious injection, jailbreak, concept substitution, and perceptual hijack. Each category captures a distinct pathway through which an adversary can manipulate a model's behavior. We evaluate these threats using 12 representative attack methods spanning text, image, and bimodal triggers, tested on 2 open-source VLMs and 3 multimodal datasets. Our analysis reveals that VLMs exhibit strong sensitivity to textual instructions, and in bimodal backdoors the text trigger typically overwhelms the image trigger when forming the backdoor mapping. Notably, backdoors involving the textual modality remain highly potent, with poisoning rates as low as 1\% yielding over 90\% success across most tasks. These findings highlight significant, previously underexplored vulnerabilities in current VLMs. We hope that BackdoorVLM can serve as a useful benchmark for analyzing and mitigating multimodal backdoor threats. Code is available at: https://github.com/bin015/BackdoorVLM .

[317] One4D: Unified 4D Generation and Reconstruction via Decoupled LoRA Control

Zhenxing Mi,Yuxin Wang,Dan Xu

Main category: cs.CV

TL;DR: 本文提出了One4D,一个统一的4D生成与重建框架,能够同步生成RGB帧和点图,并通过统一掩码条件机制处理不同稀疏程度的输入,在单图生成、视频重建及混合任务中实现无缝切换。

Details Motivation: 现有的扩散模型在联合生成RGB帧和点图时容易退化,难以兼顾高质量视觉内容与几何一致性,因此需要一种能同时处理多种输入模式并保持多模态一致性的统一框架。 Method: 采用强大的视频生成模型,设计了统一掩码条件(UMC)机制以适应不同稀疏输入;提出解耦LoRA控制(DLC),使用两个模态专用的LoRA适配器分别处理RGB和点图,并通过轻量级零初始化控制链路学习像素级一致性。 Result: 在合成与真实4D数据集上训练后,One4D在有限计算资源下实现了高质量的RGB帧生成与精确的点图重建,支持从单图像生成到完整视频重建的多种任务。 Conclusion: One4D为基于视频扩散模型的通用、高质量几何感知4D世界建模提供了可行路径,推动了动态4D内容生成的发展。 Abstract: We present One4D, a unified framework for 4D generation and reconstruction that produces dynamic 4D content as synchronized RGB frames and pointmaps. By consistently handling varying sparsities of conditioning frames through a Unified Masked Conditioning (UMC) mechanism, One4D can seamlessly transition between 4D generation from a single image, 4D reconstruction from a full video, and mixed generation and reconstruction from sparse frames. Our framework adapts a powerful video generation model for joint RGB and pointmap generation, with carefully designed network architectures. The commonly used diffusion finetuning strategies for depthmap or pointmap reconstruction often fail on joint RGB and pointmap generation, quickly degrading the base video model. To address this challenge, we introduce Decoupled LoRA Control (DLC), which employs two modality-specific LoRA adapters to form decoupled computation branches for RGB frames and pointmaps, connected by lightweight, zero-initialized control links that gradually learn mutual pixel-level consistency. Trained on a mixture of synthetic and real 4D datasets under modest computational budgets, One4D produces high-quality RGB frames and accurate pointmaps across both generation and reconstruction tasks. This work represents a step toward general, high-quality geometry-based 4D world modeling using video diffusion models. Project page: https://mizhenxing.github.io/One4D

[318] AttenDence: Maximizing Attention Confidence for Test Time Adaptation

Yash Mali

Main category: cs.CV

TL;DR: 提出通过最小化CLS token对图像块的注意力分布熵作为新的测试时适应(TTA)目标,提升模型在分布偏移下的鲁棒性。

Details Motivation: 现有TTA方法主要依赖输出熵最小化,而Transformer的注意力机制提供了额外的无监督学习信号,可加以利用。 Method: 在测试阶段最小化CLS token与图像块之间注意力分布的熵,使模型更自信地关注相关图像区域。 Result: 该方法在多种数据扰动下提升了模型鲁棒性,且在单个测试图像流上不损害干净数据的性能。 Conclusion: 注意力熵最小化是一种有效的TTA新策略,充分利用了Transformer的注意力机制特性。 Abstract: Test-time adaptation (TTA) enables models to adapt to distribution shifts at inference time. While entropy minimization over the output distribution has proven effective for TTA, transformers offer an additional unsupervised learning signal through their attention mechanisms. We propose minimizing the entropy of attention distributions from the CLS token to image patches as a novel TTA objective.This approach encourages the model to attend more confidently to relevant image regions under distribution shift and is effective even when only a single test image is available. We demonstrate that attention entropy minimization improves robustness across diverse corruption types while not hurting performance on clean data on a single sample stream of images at test time.

[319] FineXtrol: Controllable Motion Generation via Fine-Grained Text

Keming Shen,Bizhu Wu,Junliang Chen,Xiaoqin Wang,Linlin Shen

Main category: cs.CV

TL;DR: 提出了一种名为FineXtrol的新框架,用于高效、细粒度的文本驱动动作生成,通过时序感知的文本控制信号提升动作控制的精度和灵活性。

Details Motivation: 现有方法在使用大语言模型生成文本描述时存在细节错位和缺乏时间线索的问题,而使用全局3D坐标作为控制信号则计算成本高且转换复杂。因此需要一种更高效、精确且用户友好的控制方式。 Method: 提出FineXtrol框架,引入描述身体部位随时间运动的细粒度文本控制信号,并设计分层对比学习模块,增强文本编码器对这些控制信号的区分能力,从而提高动作生成的可控性。 Result: 定量结果显示FineXtrol在可控动作生成方面表现优异,定性分析表明其能灵活控制特定身体部位的运动。 Conclusion: FineXtrol通过细粒度、时序感知的文本控制实现了高效且精确的动作生成,显著提升了文本驱动动作生成的可控性和实用性。 Abstract: Recent works have sought to enhance the controllability and precision of text-driven motion generation. Some approaches leverage large language models (LLMs) to produce more detailed texts, while others incorporate global 3D coordinate sequences as additional control signals. However, the former often introduces misaligned details and lacks explicit temporal cues, and the latter incurs significant computational cost when converting coordinates to standard motion representations. To address these issues, we propose FineXtrol, a novel control framework for efficient motion generation guided by temporally-aware, precise, user-friendly, and fine-grained textual control signals that describe specific body part movements over time. In support of this framework, we design a hierarchical contrastive learning module that encourages the text encoder to produce more discriminative embeddings for our novel control signals, thereby improving motion controllability. Quantitative results show that FineXtrol achieves strong performance in controllable motion generation, while qualitative analysis demonstrates its flexibility in directing specific body part movements.

Zijian Song,Xiaoxin Lin,Tao Pu,Zhenlong Yuan,Guangrun Wang,Liang Lin

Main category: cs.CV

TL;DR: 本文提出了人类中心的开放未来任务发现(HOTD)问题,旨在通过多模态大模型识别在多种可能未来中减少人类努力的任务,并构建了包含2000多个真实视频的HOTD-Bench评估基准,同时提出CMAST多智能体框架,在复杂推理和开放未来场景下显著优于现有模型。

Details Motivation: 现有大模型在动态、并发的人类意图下难以主动发现真正有助于人类的任务,尤其是在开放未来的多变场景中缺乏系统性研究。 Method: 提出HOTD问题定义与HOTD-Bench基准,包含大规模真实视频、半自动标注流程和基于仿真的开放未来评估协议;设计CMAST框架,采用多智能体协作与可扩展的搜索树结构分解复杂推理过程。 Result: CMAST在HOTD-Bench上表现最优,显著超越现有LMM模型,并能与现有模型良好集成,持续提升性能。 Conclusion: CMAST为解决人类中心的开放未来任务发现提供了有效框架,推动了大模型在动态真实场景中主动服务能力的发展。 Abstract: Recent progress in robotics and embodied AI is largely driven by Large Multimodal Models (LMMs). However, a key challenge remains underexplored: how can we advance LMMs to discover tasks that directly assist humans in open-future scenarios, where human intentions are highly concurrent and dynamic. In this work, we formalize the problem of Human-centric Open-future Task Discovery (HOTD), focusing particularly on identifying tasks that reduce human effort across multiple plausible futures. To facilitate this study, we propose an HOTD-Bench, which features over 2K real-world videos, a semi-automated annotation pipeline, and a simulation-based protocol tailored for open-set future evaluation. Additionally, we propose the Collaborative Multi-Agent Search Tree (CMAST) framework, which decomposes the complex reasoning through a multi-agent system and structures the reasoning process through a scalable search tree module. In our experiments, CMAST achieves the best performance on the HOTD-Bench, significantly surpassing existing LMMs. It also integrates well with existing LMMs, consistently improving performance.

[321] VeCoR - Velocity Contrastive Regularization for Flow Matching

Zong-Wei Hong,Jing-lun Li,Lin-Ze Li,Shen Zhang,Yao Tang

Main category: cs.CV

TL;DR: 提出了一种名为Velocity Contrastive Regularization (VeCoR)的新方法,通过引入双向对比监督来增强流匹配模型的稳定性和生成质量,在ImageNet和COCO等任务上显著提升了FID指标,尤其在低步数和轻量级设置下表现突出。

Details Motivation: 标准流匹配(FM)方法仅鼓励速度场朝目标方向发展,但可能沿轨迹累积误差,导致样本偏离数据流形,造成感知质量下降,尤其是在轻量或低步数配置中。因此需要一种更稳定的训练机制。 Method: 提出了VeCoR,扩展了FM框架,引入平衡的吸引-排斥机制:通过正向监督使预测速度对齐参考方向,同时通过负向监督将其推离不一致的、流形外的方向,形成对比式正则化。 Result: 在ImageNet-256×256上,VeCoR在SiT-XL/2和REPA-SiT-XL/2上分别实现了22%和35%的FID相对降低;在MS-COCO文本到图像生成中获得32%的FID提升,且在低步数和轻量模型中表现出更强的稳定性与收敛性。 Conclusion: VeCoR通过双向对比正则化改进了传统FM的单向吸引机制,有效提升了生成质量和训练稳定性,尤其适用于资源受限场景。 Abstract: Flow Matching (FM) has recently emerged as a principled and efficient alternative to diffusion models. Standard FM encourages the learned velocity field to follow a target direction; however, it may accumulate errors along the trajectory and drive samples off the data manifold, leading to perceptual degradation, especially in lightweight or low-step configurations. To enhance stability and generalization, we extend FM into a balanced attract-repel scheme that provides explicit guidance on both "where to go" and "where not to go." To be formal, we propose \textbf{Velocity Contrastive Regularization (VeCoR)}, a complementary training scheme for flow-based generative modeling that augments the standard FM objective with contrastive, two-sided supervision. VeCoR not only aligns the predicted velocity with a stable reference direction (positive supervision) but also pushes it away from inconsistent, off-manifold directions (negative supervision). This contrastive formulation transforms FM from a purely attractive, one-sided objective into a two-sided training signal, regularizing trajectory evolution and improving perceptual fidelity across datasets and backbones. On ImageNet-1K 256$\times$256, VeCoR yields 22\% and 35\% relative FID reductions on SiT-XL/2 and REPA-SiT-XL/2 backbones, respectively, and achieves further FID gains (32\% relative) on MS-COCO text-to-image generation, demonstrating consistent improvements in stability, convergence, and image quality, particularly in low-step and lightweight settings. Project page: https://p458732.github.io/VeCoR_Project_Page/

[322] Leveraging Adversarial Learning for Pathological Fidelity in Virtual Staining

José Teixeira,Pascal Klöckner,Diana Montezuma,Melis Erdal Cesur,João Fraga,Hugo M. Horlings,Jaime S. Cardoso,Sara P. Oliveira

Main category: cs.CV

TL;DR: 本文提出了一种新的虚拟染色方法CSSP2P GAN,通过病理专家盲评验证其在病理保真度上的提升,并研究了对抗性损失对虚拟染色质量的关键作用,同时指出现有评估指标的局限性。

Details Motivation: 免疫组化检测成本高且耗时,虚拟染色作为一种替代方案具有潜力,但当前模型训练和评估方法存在不足,亟需更可靠的方法。 Method: 提出CSSP2P GAN模型,基于条件生成对抗网络,结合对抗性损失进行优化,并通过病理专家盲评与现有方法比较性能。 Result: CSSP2P GAN在病理保真度上表现更优;研究表明对抗性损失对染色质量至关重要;现有指标如SSIM和PSNR不足以准确评估虚拟染色效果。 Conclusion: CSSP2P GAN能有效提升虚拟染色的病理保真度,强调了对抗性损失的重要性,并呼吁采用更可靠的评估方式来推动该领域发展。 Abstract: In addition to evaluating tumor morphology using H&E staining, immunohistochemistry is used to assess the presence of specific proteins within the tissue. However, this is a costly and labor-intensive technique, for which virtual staining, as an image-to-image translation task, offers a promising alternative. Although recent, this is an emerging field of research with 64% of published studies just in 2024. Most studies use publicly available datasets of H&E-IHC pairs from consecutive tissue sections. Recognizing the training challenges, many authors develop complex virtual staining models based on conditional Generative Adversarial Networks, but ignore the impact of adversarial loss on the quality of virtual staining. Furthermore, overlooking the issues of model evaluation, they claim improved performance based on metrics such as SSIM and PSNR, which are not sufficiently robust to evaluate the quality of virtually stained images. In this paper, we developed CSSP2P GAN, which we demonstrate to achieve heightened pathological fidelity through a blind pathological expert evaluation. Furthermore, while iteratively developing our model, we study the impact of the adversarial loss and demonstrate its crucial role in the quality of virtually stained images. Finally, while comparing our model with reference works in the field, we underscore the limitations of the currently used evaluation metrics and demonstrate the superior performance of CSSP2P GAN.

[323] Eevee: Towards Close-up High-resolution Video-based Virtual Try-on

Jianhao Zeng,Yancheng Bai,Ruidong Chen,Xuanpu Zhang,Lei Sun,Dongyang Jin,Ryan Xu,Nannan Zhang,Dan Song,Xiangxiang Chu

Main category: cs.CV

TL;DR: 本文提出了一种用于视频虚拟试穿的高分辨率数据集,包含全镜头和特写镜头的真实试穿视频,并引入新的评价指标VGID来衡量服装纹理与结构的一致性,提升了虚拟试穿的真实感与细节保真度。

Details Motivation: 现有虚拟试穿技术依赖单一服装图像输入,且仅生成全身镜头视频,难以捕捉真实纹理细节并满足商业中对特写镜头的需求,限制了其实际应用。 Method: 构建了一个包含高保真服装图像(含特写)和文本描述的数据集,并首次提供真实人体模型的全身与特写虚拟试穿视频;提出了新的视频服装一致性评价指标VGID,用于量化纹理与结构的保持程度。 Result: 实验证明,利用该数据集中的详细图像可显著提升现有视频生成模型在纹理细节上的表现;对最新模型的基准测试揭示了当前方法在纹理与结构保持方面的不足。 Conclusion: 所提出的数据集和VGID指标有效推动了视频虚拟试穿技术在细节保真和多镜头生成方面的发展,为未来研究提供了重要资源与评估标准。 Abstract: Video virtual try-on technology provides a cost-effective solution for creating marketing videos in fashion e-commerce. However, its practical adoption is hindered by two critical limitations. First, the reliance on a single garment image as input in current virtual try-on datasets limits the accurate capture of realistic texture details. Second, most existing methods focus solely on generating full-shot virtual try-on videos, neglecting the business's demand for videos that also provide detailed close-ups. To address these challenges, we introduce a high-resolution dataset for video-based virtual try-on. This dataset offers two key features. First, it provides more detailed information on the garments, which includes high-fidelity images with detailed close-ups and textual descriptions; Second, it uniquely includes full-shot and close-up try-on videos of real human models. Furthermore, accurately assessing consistency becomes significantly more critical for the close-up videos, which demand high-fidelity preservation of garment details. To facilitate such fine-grained evaluation, we propose a new garment consistency metric VGID (Video Garment Inception Distance) that quantifies the preservation of both texture and structure. Our experiments validate these contributions. We demonstrate that by utilizing the detailed images from our dataset, existing video generation models can extract and incorporate texture features, significantly enhancing the realism and detail fidelity of virtual try-on results. Furthermore, we conduct a comprehensive benchmark of recent models. The benchmark effectively identifies the texture and structural preservation problems among current methods.

[324] CataractCompDetect: Intraoperative Complication Detection in Cataract Surgery

Bhuvan Sachdeva,Sneha Kumari,Rudransh Agarwal,Shalaka Kumaraswamy,Niharika Singri Prasad,Simon Mueller,Raphael Lechtenboehmer,Maximilian W. M. Wintergerst,Thomas Schultz,Kaushik Murali,Mohit Jain

Main category: cs.CV

TL;DR: 提出CataractCompDetect框架,结合阶段感知定位、SAM 2跟踪、风险评分和视觉-语言推理,用于白内障手术并发症检测,在新构建的数据集CataComp上取得良好效果。

Details Motivation: 白内障手术中如虹膜脱出、后囊破裂和玻璃体脱失等术中并发症仍导致不良后果,自动检测可实现早期预警和客观培训反馈。 Method: 提出CataractCompDetect框架,结合相位感知定位、基于SAM 2的追踪、特定并发症的风险评分和视觉-语言推理进行分类,并构建首个标注术中并发症的视频数据集CataComp用于验证。 Result: 在CataComp数据集上,CataractCompDetect平均F1得分为70.63%,其中虹膜脱出为81.8%,后囊破裂为60.87%,玻璃体脱失为69.23%。 Conclusion: 结合结构化手术先验与视觉-语言推理有助于识别罕见但高影响的术中事件,该方法具有临床应用潜力。 Abstract: Cataract surgery is one of the most commonly performed surgeries worldwide, yet intraoperative complications such as iris prolapse, posterior capsule rupture (PCR), and vitreous loss remain major causes of adverse outcomes. Automated detection of such events could enable early warning systems and objective training feedback. In this work, we propose CataractCompDetect, a complication detection framework that combines phase-aware localization, SAM 2-based tracking, complication-specific risk scoring, and vision-language reasoning for final classification. To validate CataractCompDetect, we curate CataComp, the first cataract surgery video dataset annotated for intraoperative complications, comprising 53 surgeries, including 23 with clinical complications. On CataComp, CataractCompDetect achieves an average F1 score of 70.63%, with per-complication performance of 81.8% (Iris Prolapse), 60.87% (PCR), and 69.23% (Vitreous Loss). These results highlight the value of combining structured surgical priors with vision-language reasoning for recognizing rare but high-impact intraoperative events. Our dataset and code will be publicly released upon acceptance.

[325] Peregrine: One-Shot Fine-Tuning for FHE Inference of General Deep CNNs

Huaming Ling,Ying Wang,Si Chen,Junfeng Fan

Main category: cs.CV

TL;DR: 提出单阶段微调策略和广义交错打包方案,实现高效、端到端的全同态加密推理,支持多种CNN架构并在YOLO检测中首次验证。

Details Motivation: 解决深度CNN在全同态加密(FHE)推理中面临的非线性激活函数近似和密文容量限制两大挑战。 Method: 提出单阶段微调(SFT)策略,将预训练CNN直接转换为低次多项式形式;设计广义交错打包(GIP)方案及配套同态算子,支持任意分辨率特征图的FHE计算。 Result: 在CIFAR-10、ImageNet和MS COCO上实现了与ReLU/SiLU基线相当的精度,并首次实现了基于低次多项式激活的YOLO架构的FHE推理。 Conclusion: 所提方法使多种CNN架构可在FHE下高效进行端到端推理,推动了隐私保护机器学习的实际应用。 Abstract: We address two fundamental challenges in adapting general deep CNNs for FHE-based inference: approximating non-linear activations such as ReLU with low-degree polynomials while minimizing accuracy degradation, and overcoming the ciphertext capacity barrier that constrains high-resolution image processing on FHE inference. Our contributions are twofold: (1) a single-stage fine-tuning (SFT) strategy that directly converts pre-trained CNNs into FHE-friendly forms using low-degree polynomials, achieving competitive accuracy with minimal training overhead; and (2) a generalized interleaved packing (GIP) scheme that is compatible with feature maps of virtually arbitrary spatial resolutions, accompanied by a suite of carefully designed homomorphic operators that preserve the GIP-form encryption throughout computation. These advances enable efficient, end-to-end FHE inference across diverse CNN architectures. Experiments on CIFAR-10, ImageNet, and MS COCO demonstrate that the FHE-friendly CNNs obtained via our SFT strategy achieve accuracy comparable to baselines using ReLU or SiLU activations. Moreover, this work presents the first demonstration of FHE-based inference for YOLO architectures in object detection leveraging low-degree polynomial activations.

[326] Zero-shot segmentation of skin tumors in whole-slide images with vision-language foundation models

Santiago Moreno,Pablo Meseguer,Rocío del Amor,Valery Naranjo

Main category: cs.CV

TL;DR: 提出了一种名为ZEUS的零样本视觉-语言分割框架,用于全切片图像中的皮肤肿瘤自动分割,无需训练即可生成高分辨率肿瘤掩码。

Details Motivation: 由于皮肤肿瘤形态多样、组织学模式重叠以及良恶性病变差异细微,准确标注皮肤肿瘤活检切片具有挑战性,现有方法在细粒度分割方面表现不足。 Method: 将全切片图像分块处理,利用冻结的视觉-语言模型编码器提取视觉特征,并与类别特定的文本提示集计算余弦相似度,生成最终分割掩码。 Result: 在两个内部数据集(原始梭形细胞肿瘤和皮肤转移瘤)上表现出竞争性性能,验证了提示设计、领域偏移和机构差异的影响。 Conclusion: ZEUS能够显著减少标注负担,提供可扩展且可解释的肿瘤边界划分,适用于下游诊断流程。 Abstract: Accurate annotation of cutaneous neoplasm biopsies represents a major challenge due to their wide morphological variability, overlapping histological patterns, and the subtle distinctions between benign and malignant lesions. Vision-language foundation models (VLMs), pre-trained on paired image-text corpora, learn joint representations that bridge visual features and diagnostic terminology, enabling zero-shot localization and classification of tissue regions without pixel-level labels. However, most existing VLM applications in histopathology remain limited to slide-level tasks or rely on coarse interactive prompts, and they struggle to produce fine-grained segmentations across gigapixel whole-slide images (WSIs). In this work, we introduce a zero-shot visual-language segmentation pipeline for whole-slide images (ZEUS), a fully automated, zero-shot segmentation framework that leverages class-specific textual prompt ensembles and frozen VLM encoders to generate high-resolution tumor masks in WSIs. By partitioning each WSI into overlapping patches, extracting visual embeddings, and computing cosine similarities against text prompts, we generate a final segmentation mask. We demonstrate competitive performance on two in-house datasets, primary spindle cell neoplasms and cutaneous metastases, highlighting the influence of prompt design, domain shifts, and institutional variability in VLMs for histopathology. ZEUS markedly reduces annotation burden while offering scalable, explainable tumor delineation for downstream diagnostic workflows.

[327] UMCL: Unimodal-generated Multimodal Contrastive Learning for Cross-compression-rate Deepfake Detection

Ching-Yi Lai,Chih-Yu Jian,Pei-Cheng Chuang,Chia-Ming Lee,Chih-Chung Hsu,Chiou-Ting Hsu,Chia-Wen Lin

Main category: cs.CV

TL;DR: 提出一种基于单模态生成多模态对比学习的深度伪造检测框架(UMCL),通过从单一视觉模态生成多种互补特征,并结合亲和性驱动的语义对齐和跨质量相似性学习,实现对不同压缩率下的鲁棒检测。

Details Motivation: 现有深伪检测方法在面对社交媒体中不同程度的数据压缩时,存在特征退化或模态不一致等问题,难以兼顾鲁棒性与实用性。 Method: 提出UMCL框架,在训练阶段将单一视觉模态转化为三种互补特征:抗压缩的rPPG信号、时序形变动态和视觉-语言预训练模型的语义嵌入;通过亲和性驱动的语义对齐(ASA)策略进行显式对齐,并采用跨质量相似性学习(CQSL)增强特征在不同压缩率下的鲁棒性。 Result: 实验表明该方法在多种压缩率和伪造类型下均优于现有方法,具备强健的跨压缩率检测能力,且在单个特征退化时仍保持高精度。 Conclusion: UMCL为深伪检测提供了高效、鲁棒且可解释的新范式,特别适用于真实社交平台复杂压缩环境下的应用。 Abstract: In deepfake detection, the varying degrees of compression employed by social media platforms pose significant challenges for model generalization and reliability. Although existing methods have progressed from single-modal to multimodal approaches, they face critical limitations: single-modal methods struggle with feature degradation under data compression in social media streaming, while multimodal approaches require expensive data collection and labeling and suffer from inconsistent modal quality or accessibility in real-world scenarios. To address these challenges, we propose a novel Unimodal-generated Multimodal Contrastive Learning (UMCL) framework for robust cross-compression-rate (CCR) deepfake detection. In the training stage, our approach transforms a single visual modality into three complementary features: compression-robust rPPG signals, temporal landmark dynamics, and semantic embeddings from pre-trained vision-language models. These features are explicitly aligned through an affinity-driven semantic alignment (ASA) strategy, which models inter-modal relationships through affinity matrices and optimizes their consistency through contrastive learning. Subsequently, our cross-quality similarity learning (CQSL) strategy enhances feature robustness across compression rates. Extensive experiments demonstrate that our method achieves superior performance across various compression rates and manipulation types, establishing a new benchmark for robust deepfake detection. Notably, our approach maintains high detection accuracy even when individual features degrade, while providing interpretable insights into feature relationships through explicit alignment.

[328] Rethinking Plant Disease Diagnosis: Bridging the Academic-Practical Gap with Vision Transformers and Zero-Shot Learning

Wassim Benabbas,Mohammed Brahimi,Samir Akhrouf,Bilal Fortas

Main category: cs.CV

TL;DR: 本研究探讨了注意力机制架构和零样本学习方法在植物病害分类中弥合实验室数据与真实田间条件之间差距的潜力,发现CLIP模型无需特定训练即可通过自然语言描述实现疾病分类,具有良好的适应性和可解释性。

Details Motivation: 现有基于PlantVillage数据集的研究模型在理想条件下表现良好,但在真实农田图像上泛化能力差,限制了实际应用,因此需要探索能更好适应域偏移的方法。 Method: 评估三类模型:卷积神经网络(CNN)、视觉Transformer和基于CLIP的零样本学习模型,在从受控环境到复杂田间场景的域迁移情境下进行植物病害分类性能比较。 Result: CNN在域偏移下鲁棒性有限,视觉Transformer因捕捉全局上下文特征而表现出更强泛化能力,而CLIP模型无需任务特定训练即可实现有效分类,展现出优越的适应性和可解释性。 Conclusion: 零样本学习特别是CLIP模型为植物健康诊断提供了一种实用且可扩展的域适应策略,有助于推动深度学习在多样化真实农业环境中的应用。 Abstract: Recent advances in deep learning have enabled significant progress in plant disease classification using leaf images. Much of the existing research in this field has relied on the PlantVillage dataset, which consists of well-centered plant images captured against uniform, uncluttered backgrounds. Although models trained on this dataset achieve high accuracy, they often fail to generalize to real-world field images, such as those submitted by farmers to plant diagnostic systems. This has created a significant gap between published studies and practical application requirements, highlighting the necessity of investigating and addressing this issue. In this study, we investigate whether attention-based architectures and zero-shot learning approaches can bridge the gap between curated academic datasets and real-world agricultural conditions in plant disease classification. We evaluate three model categories: Convolutional Neural Networks (CNNs), Vision Transformers, and Contrastive Language-Image Pre-training (CLIP)-based zero-shot models. While CNNs exhibit limited robustness under domain shift, Vision Transformers demonstrate stronger generalization by capturing global contextual features. Most notably, CLIP models classify diseases directly from natural language descriptions without any task-specific training, offering strong adaptability and interpretability. These findings highlight the potential of zero-shot learning as a practical and scalable domain adaptation strategy for plant health diagnosis in diverse field environments.

[329] View-Consistent Diffusion Representations for 3D-Consistent Video Generation

Duolikun Danier,Ge Gao,Steven McDonagh,Changjian Li,Hakan Bilen,Oisin Mac Aodha

Main category: cs.CV

TL;DR: 本文提出了一种名为ViCoDR的新方法,通过学习多视角一致的扩散表示来提升视频生成中的3D一致性,显著改善了相机控制下的图像到视频、文本到视频和多视角生成任务中的视觉质量。

Details Motivation: 现有的视频生成模型在生成过程中存在由于3D不一致导致的视觉伪影,例如物体在相机姿态变化时发生形变,影响用户体验和仿真保真度。作者希望通过提升扩散模型中多视角表示的一致性来缓解这一问题。 Method: 基于扩散模型的表示对齐研究,提出ViCoDR方法,通过增强多视角下扩散特征的一致性来提升生成视频的3D一致性,并在多种相机控制的视频生成模型上进行分析与验证。 Result: 在多个相机控制的视频生成任务(如图像到视频、文本到视频、多视角生成)中,ViCoDR显著提升了生成视频的3D一致性,实验显示其表示的一致性与生成结果的质量高度相关。 Conclusion: 改进扩散模型中间表示的多视角一致性是提升生成视频3D真实感的有效途径,ViCoDR为构建更具几何一致性的视频生成模型提供了新方向。 Abstract: Video generation models have made significant progress in generating realistic content, enabling applications in simulation, gaming, and film making. However, current generated videos still contain visual artifacts arising from 3D inconsistencies, e.g., objects and structures deforming under changes in camera pose, which can undermine user experience and simulation fidelity. Motivated by recent findings on representation alignment for diffusion models, we hypothesize that improving the multi-view consistency of video diffusion representations will yield more 3D-consistent video generation. Through detailed analysis on multiple recent camera-controlled video diffusion models we reveal strong correlations between 3D-consistent representations and videos. We also propose ViCoDR, a new approach for improving the 3D consistency of video models by learning multi-view consistent diffusion representations. We evaluate ViCoDR on camera controlled image-to-video, text-to-video, and multi-view generation models, demonstrating significant improvements in the 3D consistency of the generated videos. Project page: https://danier97.github.io/ViCoDR.

[330] AuViRe: Audio-visual Speech Representation Reconstruction for Deepfake Temporal Localization

Christos Koutlis,Symeon Papadopoulos

Main category: cs.CV

TL;DR: 提出一种基于音视频语音表示重建(AuViRe)的深度伪造时序定位新方法,通过跨模态重建差异检测伪造片段,在多个数据集上显著超越现有技术。

Details Motivation: 随着合成音视频内容的快速发展,恶意篡改媒体内容的风险增加,迫切需要有效方法来确保数字媒体的真实性与完整性。 Method: 利用音频-视觉语音表示重建(AuViRe),通过一个模态(如唇动)重建另一个模态(如音频波形)的语音表示,利用伪造区域跨模态重建难度更高的特性,放大差异以实现精确的伪造时间定位。 Result: 在LAV-DF上AP@0.95提升8.9,在AV-Deepfake1M上AP@0.5提升9.6,在真实场景实验中AUC提升5.1,显著优于现有方法。 Conclusion: AuViRe通过跨模态重建误差增强了对深度伪造的检测能力,实现了高精度的时间级伪造定位,具备较强的实际应用潜力。 Abstract: With the rapid advancement of sophisticated synthetic audio-visual content, e.g., for subtle malicious manipulations, ensuring the integrity of digital media has become paramount. This work presents a novel approach to temporal localization of deepfakes by leveraging Audio-Visual Speech Representation Reconstruction (AuViRe). Specifically, our approach reconstructs speech representations from one modality (e.g., lip movements) based on the other (e.g., audio waveform). Cross-modal reconstruction is significantly more challenging in manipulated video segments, leading to amplified discrepancies, thereby providing robust discriminative cues for precise temporal forgery localization. AuViRe outperforms the state of the art by +8.9 AP@0.95 on LAV-DF, +9.6 AP@0.5 on AV-Deepfake1M, and +5.1 AUC on an in-the-wild experiment. Code available at https://github.com/mever-team/auvire.

[331] A Self-Conditioned Representation Guided Diffusion Model for Realistic Text-to-LiDAR Scene Generation

Wentao Qu,Guofeng Mei,Yang Wu,Yongshun Gong,Xiaoshui Huang,Liang Xiao

Main category: cs.CV

TL;DR: 本文提出了一种名为T2LDM的文本到LiDAR扩散模型,结合自条件表示引导(SCRG)以提升3D场景生成的质量和可控性,并构建了T2nuScenes基准和相关度量。

Details Motivation: 由于文本-LiDAR配对数据稀缺且文本描述质量低,现有方法生成的3D场景过于平滑且控制性差。 Method: 提出T2LDM模型,引入SCRG机制在训练时提供基于真实表示的软监督,推理时解耦;设计方向位置先验缓解街道扭曲;利用冻结去噪网络学习条件编码器,支持多种条件生成任务。 Result: 在无条件和条件生成任务上,T2LDM均优于现有方法,显著提升生成细节、场景保真度和可控性;构建了T2nuScenes基准和可控性度量,分析了不同文本提示的影响。 Conclusion: T2LDM通过SCRG和结构优化有效提升了文本到LiDAR生成的细节表现力和可控性,支持多任务条件生成,推动了定制化3D数据生成的发展。 Abstract: Text-to-LiDAR generation can customize 3D data with rich structures and diverse scenes for downstream tasks. However, the scarcity of Text-LiDAR pairs often causes insufficient training priors, generating overly smooth 3D scenes. Moreover, low-quality text descriptions may degrade generation quality and controllability. In this paper, we propose a Text-to-LiDAR Diffusion Model for scene generation, named T2LDM, with a Self-Conditioned Representation Guidance (SCRG). Specifically, SCRG, by aligning to the real representations, provides the soft supervision with reconstruction details for the Denoising Network (DN) in training, while decoupled in inference. In this way, T2LDM can perceive rich geometric structures from data distribution, generating detailed objects in scenes. Meanwhile, we construct a content-composable Text-LiDAR benchmark, T2nuScenes, along with a controllability metric. Based on this, we analyze the effects of different text prompts for LiDAR generation quality and controllability, providing practical prompt paradigms and insights. Furthermore, a directional position prior is designed to mitigate street distortion, further improving scene fidelity. Additionally, by learning a conditional encoder via frozen DN, T2LDM can support multiple conditional tasks, including Sparse-to-Dense, Dense-to-Sparse, and Semantic-to-LiDAR generation. Extensive experiments in unconditional and conditional generation demonstrate that T2LDM outperforms existing methods, achieving state-of-the-art scene generation.

[332] Dynamic Granularity Matters: Rethinking Vision Transformers Beyond Fixed Patch Splitting

Qiyang Yu,Yu Fang,Tianrui Li,Xuemei Cao,Yan Chen,Jianghao Li,Fan Min

Main category: cs.CV

TL;DR: 提出了一种基于图像复杂度自适应调整视觉粒度的Granularity-driven Vision Transformer(Grc-ViT),通过粗粒度评估和细粒度优化两个模块,动态调整patch和窗口大小,提升细粒度识别能力并优化计算效率。

Details Motivation: Vision Transformers在建模全局依赖方面表现优异,但难以高效捕捉细粒度局部细节,现有多尺度方法依赖固定patch尺寸且计算冗余,缺乏对不同图像复杂度的自适应能力。 Method: 设计了Grc-ViT框架,包含两个阶段:1)粗粒度评估模块利用边缘密度、熵和频域特征评估图像复杂度,动态确定合适的patch和窗口大小;2)细粒度优化模块根据选定粒度优化注意力计算;引入两个可学习参数α和β,端到端优化以平衡全局推理与局部感知。 Result: 实验表明Grc-ViT在多个基准上提升了细粒度分类性能,同时在准确率与计算效率之间实现了更优权衡,显著优于固定粒度或静态多尺度方法。 Conclusion: Grc-ViT通过动态调整视觉粒度,有效增强了ViT对局部细节的建模能力,为高效、自适应的视觉表示学习提供了新思路。 Abstract: Vision Transformers (ViTs) have demonstrated strong capabilities in capturing global dependencies but often struggle to efficiently represent fine-grained local details. Existing multi-scale approaches alleviate this issue by integrating hierarchical or hybrid features; however, they rely on fixed patch sizes and introduce redundant computation. To address these limitations, we propose Granularity-driven Vision Transformer (Grc-ViT), a dynamic coarse-to-fine framework that adaptively adjusts visual granularity based on image complexity. It comprises two key stages: (1) Coarse Granularity Evaluation module, which assesses visual complexity using edge density, entropy, and frequency-domain cues to estimate suitable patch and window sizes; (2) Fine-grained Refinement module, which refines attention computation according to the selected granularity, enabling efficient and precise feature learning. Two learnable parameters, α and \b{eta}, are optimized end-to-end to balance global reasoning and local perception. Comprehensive evaluations demonstrate that Grc-ViT enhances fine-grained discrimination while achieving a superior trade-off between accuracy and computational efficiency.

[333] Life-IQA: Boosting Blind Image Quality Assessment through GCN-enhanced Layer Interaction and MoE-based Feature Decoupling

Long Tang,Guoquan Zhen,Jie Hao,Jianbo Zhang,Huiyu Duan,Liang Yuan,Guangtao Zhai

Main category: cs.CV

TL;DR: 本文提出了一种新的盲图像质量评估框架Life-IQA,通过GCN增强的层间交互和基于MoE的特征解耦,有效提升了质量预测性能,在多个基准上实现了最先进的表现。

Details Motivation: 现有BIQA方法通常忽略浅层和深层特征在质量预测中的不同贡献,且缺乏对高效质量解码结构的探索。 Method: 提出GCN增强的层间交互模块,利用跨注意力机制融合深层与次深层特征;设计基于MoE的特征解耦模块,由专门专家处理不同失真类型或质量维度。 Result: 实验表明,Life-IQA在准确性和计算成本之间取得了更优平衡,优于传统Transformer解码器,并在多个BIQA基准上达到SOTA性能。 Conclusion: Life-IQA通过有效的特征交互与解耦机制,显著提升了盲图像质量评估的性能,为质量解码结构的设计提供了新思路。 Abstract: Blind image quality assessment (BIQA) plays a crucial role in evaluating and optimizing visual experience. Most existing BIQA approaches fuse shallow and deep features extracted from backbone networks, while overlooking the unequal contributions to quality prediction. Moreover, while various vision encoder backbones are widely adopted in BIQA, the effective quality decoding architectures remain underexplored. To address these limitations, this paper investigates the contributions of shallow and deep features to BIQA, and proposes a effective quality feature decoding framework via GCN-enhanced \underline{l}ayer\underline{i}nteraction and MoE-based \underline{f}eature d\underline{e}coupling, termed \textbf{(Life-IQA)}. Specifically, the GCN-enhanced layer interaction module utilizes the GCN-enhanced deepest-layer features as query and the penultimate-layer features as key, value, then performs cross-attention to achieve feature interaction. Moreover, a MoE-based feature decoupling module is proposed to decouple fused representations though different experts specialized for specific distortion types or quality dimensions. Extensive experiments demonstrate that Life-IQA shows more favorable balance between accuracy and cost than a vanilla Transformer decoder and achieves state-of-the-art performance on multiple BIQA benchmarks.The code is available at: \href{https://github.com/TANGLONG2/Life-IQA/tree/main}{\texttt{Life-IQA}}.

[334] Benchmarking Corruption Robustness of LVLMs: A Discriminative Benchmark and Robustness Alignment Metric

Xiangjie Sui,Songyang Li,Hanwei Zhu,Baoliang Chen,Yuming Fang,Xin Sun

Main category: cs.CV

TL;DR: 本文提出了Bench-C基准和鲁棒性对齐得分(RAS)指标,以更准确地评估大型视觉语言模型在视觉损坏下的鲁棒性,揭示了现有方法的不足并展示了模型在损坏下的不同行为模式。

Details Motivation: 现有的评估范式在低区分度样本上占主导地位,并且传统的基于准确率的指标无法捕捉预测结构退化的本质,因此需要更有效的评估手段来衡量视觉语言模型在视觉损坏下的真实鲁棒性。 Method: 提出Bench-C基准,采用联合考虑损坏下预测不一致性和语义多样性的筛选策略;同时提出RAS指标,在logit层面衡量预测结构的变化,关注预测不确定性和校准对齐的偏移。 Result: 实验发现:1)模型在损坏下表现出如错误置信和犹豫等可区分的行为模式;2)即使轻微损坏可能带来准确率小幅提升,整体预测结构仍会退化;3)通过分解破坏性和纠正性成分可揭示不同模型的失败与恢复模式。 Conclusion: Bench-C和RAS能更全面、细致地评估LVLMs在视觉损坏下的鲁棒性,揭示了传统指标忽略的关键问题,为未来鲁棒性研究提供了有效工具。 Abstract: Despite the remarkable reasoning abilities of large vision-language models (LVLMs), their robustness under visual corruptions remains insufficiently studied. Existing evaluation paradigms exhibit two major limitations: 1) the dominance of low-discriminative samples in current datasets masks the real robustness gap between models; and 2) conventional accuracy-based metric fail to capture the degradation of the underlying prediction structure. To bridge these gaps, we introduce Bench-C, a comprehensive benchmark emphasizing discriminative samples for assessing corruption robustness, where a selection strategy is proposed to jointly consider the prediction inconsistency under corruption and the semantic diversity. Furthermore, we propose the Robustness Alignment Score (RAS), a unified metric that measures degradation in logit-level prediction structure by considering the shifts in prediction uncertainty and calibration alignment. Comprehensive experiments and analysis reveal several interesting findings: 1) model behaviors exhibit distinguish patterns under corruptions, such as erroneous confidence and hesitation; 2) despite subtle corruption may lead to a slight accuracy gain, the overall prediction structure still degrades; 3) by decomposing corruption robustness into destructive and corrective components, the distinct failure and recovery patterns across models can be revealed.

[335] ReEXplore: Improving MLLMs for Embodied Exploration with Contextualized Retrospective Experience Replay

Gengyuan Zhang,Mingcong Ding,Jingpei Wu,Ruotong Liao,Volker Tresp

Main category: cs.CV

TL;DR: 提出ReEXplore,一种无需训练的具身探索框架,通过回溯经验重放和分层前沿选择提升MLLM代理在新环境中的探索效率和决策能力。

Details Motivation: 现有基于MLLM的具身代理在探索新环境时受限于静态预训练知识、昂贵的训练成本以及复杂视觉动作空间下的决策困难。 Method: 采用无需训练的回溯经验重放机制,在推理时注入抽象经验,并通过分层前沿选择将前沿排序分解为粗到细的决策过程。 Result: 在多个具身探索基准上显著优于强基线,开源模型下成功率和导航效率最高提升3倍。 Conclusion: ReEXplore实现了鲁棒、可追溯且高效的具身探索,有效克服了MLLM在长视野稀疏奖励任务中的探索瓶颈。 Abstract: Embodied exploration is a target-driven process that requires embodied agents to possess fine-grained perception and knowledge-enhanced decision making. While recent attempts leverage MLLMs for exploration due to their strong perceptual and reasoning abilities, we find that MLLM-based embodied agents remain suboptimal in exploring new environments: (i) they rely on profound but stale pre-trained knowledge, (ii) training-based approaches such as imitation learning or reinforcement learning are expensive for long-horizon tasks with sparse outcome rewards, and (iii) frontier-based exploration yields a large, visually nuanced action space that is difficult for MLLMs to make reliable decisions. We address these challenges with ReEXplore, a training-free framework that performs retrospective experience replay to inject distilled, abstract experience at inference time, and hierarchical frontier selection to decompose frontier ranking into coarse-to-fine decisions. Our approach enables robust, traceable, and efficient exploration. Across multiple embodied exploration benchmarks, ReEXplore yields great improvements over strong MLLM baselines, up to 3x higher performance in both success rate and in navigation efficiency under open-source backbones.

[336] CSD: Change Semantic Detection with only Semantic Change Masks for Damage Assessment in Conflict Zones

Kai Zhenga,Zhenkai Wu,Fupeng Wei,Miaolan Zhou,Kai Lie,Haitao Guo,Lei Ding,Wei Zhang,Hang-Cheng Dong

Main category: cs.CV

TL;DR: 本文提出了一种新的变化语义检测(CSD)任务及多尺度交叉注意力差异孪生网络(MC-DiSNet),利用DINOv3预训练模型和新发布的Gaza-Change数据集,实现对冲突区域遥感图像中受损区域的高效精准识别,显著优于传统方法。

Details Motivation: 冲突地区的损毁评估面临数据稀缺、标注困难、类内相似性高和语义边界模糊等挑战,传统语义变化检测(SCD)需要大量双时相图像的全像素标注,成本高昂且不实用,因此需要一种更高效、聚焦于实际变化区域的新范式。 Method: 引入DINOv3作为骨干网络以增强特征表示能力,构建多尺度交叉注意力差异孪生网络(MC-DiSNet)来提取双时相遥感图像的变化特征,并提出变化语义检测(CSD)新任务,仅对变化区域进行像素级语义标注,降低标注负担。同时发布Gaza-Change数据集,包含2023–2024年加沙地带高分辨率卫星影像对及其变化区域标注。 Result: 在Gaza-Change和SECOND数据集上验证了MC-DiSNet的有效性,实验结果表明该方法在CSD任务上表现优异,能够准确识别小范围、边界模糊的损毁区域,性能优于传统SCD方法,具备应用于实际冲突区快速损毁评估的潜力。 Conclusion: 本文提出的CSD任务和MC-DiSNet模型为冲突区域的快速损毁评估提供了新思路,通过聚焦变化区域而非全图语义标注,大幅降低数据标注成本,推动了遥感变化检测向更高效、实用方向发展。 Abstract: Accurately and swiftly assessing damage from conflicts is crucial for humanitarian aid and regional stability. In conflict zones, damaged zones often share similar architectural styles, with damage typically covering small areas and exhibiting blurred boundaries. These characteristics lead to limited data, annotation difficulties, and significant recognition challenges, including high intra-class similarity and ambiguous semantic changes. To address these issues, we introduce a pre-trained DINOv3 model and propose a multi-scale cross-attention difference siamese network (MC-DiSNet). The powerful visual representation capability of the DINOv3 backbone enables robust and rich feature extraction from bi-temporal remote sensing images. We also release a new Gaza-change dataset containing high-resolution satellite image pairs from 2023-2024 with pixel-level semantic change annotations. It is worth emphasizing that our annotations only include semantic pixels of changed areas. Unlike conventional semantic change detection (SCD), our approach eliminates the need for large-scale semantic annotations of bi-temporal images, instead focusing directly on the changed regions. We term this new task change semantic detection (CSD). The CSD task represents a direct extension of binary change detection (BCD). Due to the limited spatial extent of semantic regions, it presents greater challenges than traditional SCD tasks. We evaluated our method under the CSD framework on both the Gaza-Change and SECOND datasets. Experimental results demonstrate that our proposed approach effectively addresses the CSD task, and its outstanding performance paves the way for practical applications in rapid damage assessment across conflict zones.

[337] MedSAM3: Delving into Segment Anything with Medical Concepts

Anglin Liu,Rundong Xue,Xu R. Cao,Yifan Shen,Yi Lu,Xiang Li,Qianqian Chen,Jintai Chen

Main category: cs.CV

TL;DR: MedSAM-3是一个基于文本提示的医学图像分割模型,通过结合语义概念标签和多模态大语言模型,实现跨多种医学影像模态的通用、精确分割。

Details Motivation: 现有医学图像分割方法泛化能力差,且依赖大量人工标注,难以适应新的临床应用。 Method: 在SAM-3架构基础上,使用配对了语义概念标签的医学图像进行微调,并引入MLLM驱动的MedSAM-3 Agent进行推理与迭代优化。 Result: 在X光、MRI、超声、CT和视频等多种医学影像上实验表明,MedSAM-3显著优于现有的专业模型和基础模型。 Conclusion: MedSAM-3实现了开放词汇的文本可提示医学图像分割,具备良好的泛化性和实用性,推动了医学图像分析的自动化发展。 Abstract: Medical image segmentation is fundamental for biomedical discovery. Existing methods lack generalizability and demand extensive, time-consuming manual annotation for new clinical application. Here, we propose MedSAM-3, a text promptable medical segmentation model for medical image and video segmentation. By fine-tuning the Segment Anything Model (SAM) 3 architecture on medical images paired with semantic conceptual labels, our MedSAM-3 enables medical Promptable Concept Segmentation (PCS), allowing precise targeting of anatomical structures via open-vocabulary text descriptions rather than solely geometric prompts. We further introduce the MedSAM-3 Agent, a framework that integrates Multimodal Large Language Models (MLLMs) to perform complex reasoning and iterative refinement in an agent-in-the-loop workflow. Comprehensive experiments across diverse medical imaging modalities, including X-ray, MRI, Ultrasound, CT, and video, demonstrate that our approach significantly outperforms existing specialist and foundation models. We will release our code and model at https://github.com/Joey-S-Liu/MedSAM3.

[338] Beyond Reward Margin: Rethinking and Resolving Likelihood Displacement in Diffusion Models via Video Generation

Ruojun Xu,Yu Kai,Xuhua Ren,Jiaxiang Cheng,Bing Ma,Tianxiang Zheng,Qinhlin Lu

Main category: cs.CV

TL;DR: 提出了一种新的基于扩散模型的偏好优化方法PG-DPO,通过自适应拒绝缩放和隐式偏好正则化缓解似然位移问题,在视频生成任务中表现更优。

Details Motivation: DPO在扩散模型中存在似然位移问题,影响生成质量,尤其在视频生成任务中表现不佳,需深入分析并解决其在扩散框架下的失效模式。 Method: 通过在扩散框架内形式化分析DPO损失函数,识别出优化冲突和次优最大化两种失败模式,并提出PG-DPO方法,结合自适应拒绝缩放(ARS)和隐式偏好正则化(IPR)来缓解这些问题。 Result: 实验表明,PG-DPO在定量指标和定性评估上均优于现有方法,有效改善了扩散模型中的偏好对齐效果。 Conclusion: PG-DPO通过针对性地解决DPO在扩散模型中的似然位移问题,显著提升了视频生成任务中的生成质量和偏好对齐性能。 Abstract: Direct Preference Optimization (DPO) has shown promising results in aligning generative outputs with human preferences by distinguishing between chosen and rejected samples. However, a critical limitation of DPO is likelihood displacement, where the probabilities of chosen samples paradoxically decrease during training, undermining the quality of generation. Although this issue has been investigated in autoregressive models, its impact within diffusion-based models remains largely unexplored. This gap leads to suboptimal performance in tasks involving video generation. To address this, we conduct a formal analysis of DPO loss through updating policy within the diffusion framework, which describes how the updating of specific training samples influences the model's predictions on other samples. Using this tool, we identify two main failure modes: (1) Optimization Conflict, which arises from small reward margins between chosen and rejected samples, and (2) Suboptimal Maximization, caused by large reward margins. Informed by these insights, we introduce a novel solution named Policy-Guided DPO (PG-DPO), combining Adaptive Rejection Scaling (ARS) and Implicit Preference Regularization (IPR) to effectively mitigate likelihood displacement. Experiments show that PG-DPO outperforms existing methods in both quantitative metrics and qualitative evaluations, offering a robust solution for improving preference alignment in video generation tasks.

[339] LAA3D: A Benchmark of Detecting and Tracking Low-Altitude Aircraft in 3D Space

Hai Wu,Shuai Tang,Jiale Wang,Longkun Zou,Mingyue Guo,Rongqin Liang,Ke Chen,Yaowei Wang

Main category: cs.CV

TL;DR: 本文提出了LAA3D,一个用于低空飞行器三维感知的大规模真实与合成图像数据集,并建立了统一评测基准,同时提出了一种适用于变焦相机的单目三维检测基线方法MonoLAA,展现出良好的仿真到现实迁移能力。

Details Motivation: 针对当前缺乏专门用于低空飞行器(LAA)三维感知的数据集,限制了相关检测与跟踪技术发展的现状,亟需构建高质量、多样化且标注丰富的数据资源以推动该领域研究。 Method: 构建了一个包含15,000张真实图像和60万张合成图像的大规模数据集LAA3D,涵盖城市与郊区等多种场景及eVTOL、MAV和直升机等多类空中目标;所有实例均标注有3D边界框、类别标签和身份信息;建立LAA3D Benchmark,集成多种任务和方法并提供统一评估协议;提出MonoLAA作为单目3D检测基线模型,支持在不同焦距的变焦相机下实现鲁棒定位。 Result: LAA3D数据集实现了对3D物体检测、多目标跟踪和6自由度姿态估计等多种任务的支持;实验表明,在合成数据上预训练的模型经微调后能有效迁移到真实数据,表现出优异的仿真到现实泛化性能;MonoLAA基线方法在变焦条件下仍保持稳定检测效果。 Conclusion: LAA3D为低空三维物体感知提供了全面的数据基础和标准化评估平台,显著推动了相关算法的发展,尤其验证了合成数据在实际应用中的潜力,为未来低空智能系统的研究奠定了重要基础。 Abstract: Perception of Low-Altitude Aircraft (LAA) in 3D space enables precise 3D object localization and behavior understanding. However, datasets tailored for 3D LAA perception remain scarce. To address this gap, we present LAA3D, a large-scale dataset designed to advance 3D detection and tracking of low-altitude aerial vehicles. LAA3D contains 15,000 real images and 600,000 synthetic frames, captured across diverse scenarios, including urban and suburban environments. It covers multiple aerial object categories, including electric Vertical Take-Off and Landing (eVTOL) aircraft, Micro Aerial Vehicles (MAVs), and Helicopters. Each instance is annotated with 3D bounding box, class label, and instance identity, supporting tasks such as 3D object detection, 3D multi-object tracking (MOT), and 6-DoF pose estimation. Besides, we establish the LAA3D Benchmark, integrating multiple tasks and methods with unified evaluation protocols for comparison. Furthermore, we propose MonoLAA, a monocular 3D detection baseline, achieving robust 3D localization from zoom cameras with varying focal lengths. Models pretrained on synthetic images transfer effectively to real-world data with fine-tuning, demonstrating strong sim-to-real generalization. Our LAA3D provides a comprehensive foundation for future research in low-altitude 3D object perception.

[340] Granular Computing-driven SAM: From Coarse-to-Fine Guidance for Prompt-Free Segmentation

Qiyang Yu,Yu Fang,Tianrui Li,Xuemei Cao,Yan Chen,Jianghao Li,Fan Min,Yi Zhang

Main category: cs.CV

TL;DR: 提出了一种基于粒计算的SAM(Grc-SAM)框架,通过粗到精的多粒度方法提升无提示图像分割的定位能力和可扩展性。

Details Motivation: 现有预训练分割模型(如SAM)在单一粒度上生成提示,缺乏自主区域定位机制且难以建模高分辨率下的细粒度细节。 Method: 设计了两阶段框架:粗阶段自适应提取高响应区域以实现前景精确定位;细阶段采用更细的块划分和稀疏局部Swin注意力增强细节建模;将优化后的掩码编码为潜在提示嵌入,替代手工提示。 Result: 实验表明Grc-SAM在准确性和可扩展性方面优于基线方法,支持高分辨率分割并减少对外部提示的依赖。 Conclusion: Grc-SAM成功融合粒计算与视觉Transformer,为无提示图像分割提供了新的多粒度计算视角。 Abstract: Prompt-free image segmentation aims to generate accurate masks without manual guidance. Typical pre-trained models, notably Segmentation Anything Model (SAM), generate prompts directly at a single granularity level. However, this approach has two limitations: (1) Localizability, lacking mechanisms for autonomous region localization; (2) Scalability, limited fine-grained modeling at high resolution. To address these challenges, we introduce Granular Computing-driven SAM (Grc-SAM), a coarse-to-fine framework motivated by Granular Computing (GrC). First, the coarse stage adaptively extracts high-response regions from features to achieve precise foreground localization and reduce reliance on external prompts. Second, the fine stage applies finer patch partitioning with sparse local swin-style attention to enhance detail modeling and enable high-resolution segmentation. Third, refined masks are encoded as latent prompt embeddings for the SAM decoder, replacing handcrafted prompts with an automated reasoning process. By integrating multi-granularity attention, Grc-SAM bridges granular computing with vision transformers. Extensive experimental results demonstrate Grc-SAM outperforms baseline methods in both accuracy and scalability. It offers a unique granular computational perspective for prompt-free segmentation.

[341] Understanding, Accelerating, and Improving MeanFlow Training

Jin-Young Kim,Hyojun Go,Lea Bogensperger,Julius Erbach,Nikolai Kalischek,Federico Tombari,Konrad Schindler,Dominik Narnhofer

Main category: cs.CV

TL;DR: 本文研究了MeanFlow中瞬时速度场和平均速度场之间的相互作用,提出了改进的训练策略,显著提升了少步生成的质量和训练效率。

Details Motivation: MeanFlow虽然在少步生成中表现出色,但其两种速度场的训练动态尚不明确,需要深入分析以优化模型性能。 Method: 通过分析瞬时速度和平均速度之间的交互关系,设计了一种分阶段训练策略:先加速瞬时速度的学习,再逐步转向长区间平均速度的学习。 Result: 改进后的MeanFlow训练方法在相同DiT-XL主干网络下,1-NFE ImageNet 256x256上的FID从3.43提升至2.87;或以2.5倍更快的训练速度达到基线性能,或使用更小的DiT-L主干网络实现相当性能。 Conclusion: 有效的训练调度应优先建立准确的瞬时和短间隔平均速度,进而促进大间隔平均速度的学习,从而实现更高效、高质量的少步生成。 Abstract: MeanFlow promises high-quality generative modeling in few steps, by jointly learning instantaneous and average velocity fields. Yet, the underlying training dynamics remain unclear. We analyze the interaction between the two velocities and find: (i) well-established instantaneous velocity is a prerequisite for learning average velocity; (ii) learning of instantaneous velocity benefits from average velocity when the temporal gap is small, but degrades as the gap increases; and (iii) task-affinity analysis indicates that smooth learning of large-gap average velocities, essential for one-step generation, depends on the prior formation of accurate instantaneous and small-gap average velocities. Guided by these observations, we design an effective training scheme that accelerates the formation of instantaneous velocity, then shifts emphasis from short- to long-interval average velocity. Our enhanced MeanFlow training yields faster convergence and significantly better few-step generation: With the same DiT-XL backbone, our method reaches an impressive FID of 2.87 on 1-NFE ImageNet 256x256, compared to 3.43 for the conventional MeanFlow baseline. Alternatively, our method matches the performance of the MeanFlow baseline with 2.5x shorter training time, or with a smaller DiT-L backbone.

[342] DynaMix: Generalizable Person Re-identification via Dynamic Relabeling and Mixed Data Sampling

Timur Mamedov,Anton Konushin,Vadim Konushin

Main category: cs.CV

TL;DR: DynaMix是一种用于通用人物重识别的新方法,通过结合少量标注的多摄像头数据和大规模伪标注的单摄像头数据,动态适应训练数据结构与噪声,实现高效且可扩展的训练。

Details Motivation: 现有方法依赖有限的多摄像头标注数据,难以泛化到未见环境;需利用大规模单摄像头数据提升模型泛化能力。 Method: 提出DynaMix,包含三个核心模块:伪标签动态修正的Relabeling Module、高效维护大量身份表示的Efficient Centroids Module,以及优化混合数据批次构成的Data Sampling Module。 Result: 在大规模图像和数十万身份上有效训练,实验表明DynaMix在通用人物重识别任务上持续优于现有最先进方法。 Conclusion: DynaMix通过动态整合多源数据并优化训练过程,显著提升了人物重识别的泛化能力和可扩展性。 Abstract: Generalizable person re-identification (Re-ID) aims to recognize individuals across unseen cameras and environments. While existing methods rely heavily on limited labeled multi-camera data, we propose DynaMix, a novel method that effectively combines manually labeled multi-camera and large-scale pseudo-labeled single-camera data. Unlike prior works, DynaMix dynamically adapts to the structure and noise of the training data through three core components: (1) a Relabeling Module that refines pseudo-labels of single-camera identities on-the-fly; (2) an Efficient Centroids Module that maintains robust identity representations under a large identity space; and (3) a Data Sampling Module that carefully composes mixed data mini-batches to balance learning complexity and intra-batch diversity. All components are specifically designed to operate efficiently at scale, enabling effective training on millions of images and hundreds of thousands of identities. Extensive experiments demonstrate that DynaMix consistently outperforms state-of-the-art methods in generalizable person Re-ID.

[343] DEAP-3DSAM: Decoder Enhanced and Auto Prompt SAM for 3D Medical Image Segmentation

Fangda Chen,Jintao Tang,Pancheng Wang,Ting Wang,Shasha Li,Ting Deng

Main category: cs.CV

TL;DR: 本文提出了DEAP-3DSAM,一种用于3D医学图像分割的解码器增强与自动提示方法,通过融合空间信息和双注意力机制实现优于现有方法的性能。

Details Motivation: 现有的SAM模型在应用于3D医学图像分割时存在空间特征丢失和依赖人工提示的问题,限制了其实际应用。 Method: 提出了一种特征增强解码器来融合原始图像特征与详细的空间信息,并设计了一个双注意力提示器(Spatial Attention和Channel Attention)来自动生成提示信息。 Result: 在四个公开腹部肿瘤分割数据集上进行了实验,结果表明DEAP-3DSAM在3D图像分割中达到了最先进的性能,优于或匹敌依赖人工提示的现有方法。定量与定性消融实验证实了所提模块的有效性。 Conclusion: DEAP-3DSAM有效缓解了SAM在3D医学图像分割中的空间特征损失问题,并实现了自动提示,提升了分割性能,具有良好的实际应用前景。 Abstract: The Segment Anything Model (SAM) has recently demonstrated significant potential in medical image segmentation. Although SAM is primarily trained on 2D images, attempts have been made to apply it to 3D medical image segmentation. However, the pseudo 3D processing used to adapt SAM results in spatial feature loss, limiting its performance. Additionally, most SAM-based methods still rely on manual prompts, which are challenging to implement in real-world scenarios and require extensive external expert knowledge. To address these limitations, we introduce the Decoder Enhanced and Auto Prompt SAM (DEAP-3DSAM) to tackle these limitations. Specifically, we propose a Feature Enhanced Decoder that fuses the original image features with rich and detailed spatial information to enhance spatial features. We also design a Dual Attention Prompter to automatically obtain prompt information through Spatial Attention and Channel Attention. We conduct comprehensive experiments on four public abdominal tumor segmentation datasets. The results indicate that our DEAP-3DSAM achieves state-of-the-art performance in 3D image segmentation, outperforming or matching existing manual prompt methods. Furthermore, both quantitative and qualitative ablation studies confirm the effectiveness of our proposed modules.

[344] Graph-based 3D Human Pose Estimation using WiFi Signals

Jichao Chen,YangYang Qu,Ruibo Tang,Dirk Slock

Main category: cs.CV

TL;DR: 提出了一种基于图的WiFi人体姿态估计框架GraphPose-Fi,显式建模骨骼拓扑结构,提升了3D HPE性能。

Details Motivation: 现有WiFi-based HPE方法忽略人体关节间的拓扑关系,导致估计精度受限。 Method: 设计了一个包含CNN编码器、轻量注意力模块和基于图的回归头的框架,利用GCN和自注意力机制捕捉局部拓扑和全局依赖。 Result: 在MM-Fi数据集上显著优于现有方法,验证了模型的有效性。 Conclusion: GraphPose-Fi通过显式建模骨骼结构,有效提升了WiFi信号下3D人体姿态估计的性能。 Abstract: WiFi-based human pose estimation (HPE) has attracted increasing attention due to its resilience to occlusion and privacy-preserving compared to camera-based methods. However, existing WiFi-based HPE approaches often employ regression networks that directly map WiFi channel state information (CSI) to 3D joint coordinates, ignoring the inherent topological relationships among human joints. In this paper, we present GraphPose-Fi, a graph-based framework that explicitly models skeletal topology for WiFi-based 3D HPE. Our framework comprises a CNN encoder shared across antennas for subcarrier-time feature extraction, a lightweight attention module that adaptively reweights features over time and across antennas, and a graph-based regression head that combines GCN layers with self-attention to capture local topology and global dependencies. Our proposed method significantly outperforms existing methods on the MM-Fi dataset in various settings. The source code is available at: https://github.com/Cirrick/GraphPose-Fi.

[345] HABIT: Human Action Benchmark for Interactive Traffic in CARLA

Mohan Ramesh,Mark Azer,Fabian B. Flohr

Main category: cs.CV

TL;DR: HABIT是一个高保真度的自动驾驶仿真基准,通过整合来自 mocap 和视频的真实人类动作数据,弥补了现有模拟中对人类行为表示不足的问题,揭示了当前先进自动驾驶系统在复杂行人交互中的关键缺陷。

Details Motivation: 现有的自动驾驶仿真系统无法充分模拟真实且多样化的人类行为,导致对自动驾驶系统安全性和可靠性的评估不准确,特别是在与行人交互的场景中存在严重局限性。 Method: 提出HABIT基准,利用模块化、可扩展且物理一致的动作重定向管道,将真实世界的人类动作(来自mocap和视频)集成到CARLA仿真环境中,并构建4,730个符合交通场景的SMPL格式行人动作,支持自动化场景生成和智能体评估。 Result: 在HABIT上测试InterFuser、TransFuser和BEVDriver三种先进自动驾驶系统时,发现其碰撞率高达7.43次/千米,AIS 3+级伤害风险达12.94%,误刹车率高达33%,显著暴露了在传统脚本化仿真中未被发现的规划器弱点。 Conclusion: HABIT能够更真实地评估自动驾驶系统在复杂行人交互中的表现,揭示了现有系统的安全隐患,推动更加安全、可靠的自动驾驶技术发展,所有组件均已公开以促进可重复研究。 Abstract: Current autonomous driving (AD) simulations are critically limited by their inadequate representation of realistic and diverse human behavior, which is essential for ensuring safety and reliability. Existing benchmarks often simplify pedestrian interactions, failing to capture complex, dynamic intentions and varied responses critical for robust system deployment. To overcome this, we introduce HABIT (Human Action Benchmark for Interactive Traffic), a high-fidelity simulation benchmark. HABIT integrates real-world human motion, sourced from mocap and videos, into CARLA (Car Learning to Act, a full autonomous driving simulator) via a modular, extensible, and physically consistent motion retargeting pipeline. From an initial pool of approximately 30,000 retargeted motions, we curate 4,730 traffic-compatible pedestrian motions, standardized in SMPL format for physically consistent trajectories. HABIT seamlessly integrates with CARLA's Leaderboard, enabling automated scenario generation and rigorous agent evaluation. Our safety metrics, including Abbreviated Injury Scale (AIS) and False Positive Braking Rate (FPBR), reveal critical failure modes in state-of-the-art AD agents missed by prior evaluations. Evaluating three state-of-the-art autonomous driving agents, InterFuser, TransFuser, and BEVDriver, demonstrates how HABIT exposes planner weaknesses that remain hidden in scripted simulations. Despite achieving close or equal to zero collisions per kilometer on the CARLA Leaderboard, the autonomous agents perform notably worse on HABIT, with up to 7.43 collisions/km and a 12.94% AIS 3+ injury risk, and they brake unnecessarily in up to 33% of cases. All components are publicly released to support reproducible, pedestrian-aware AI research.

[346] DiffSeg30k: A Multi-Turn Diffusion Editing Benchmark for Localized AIGC Detection

Hai Ci,Ziheng Peng,Pei Yang,Yingxin Xuan,Mike Zheng Shou

Main category: cs.CV

TL;DR: 本文提出了DiffSeg30k,一个包含3万张扩散模型编辑图像及像素级标注的公开数据集,旨在推动AI生成内容(AIGC)的细粒度检测研究,将检测任务从整体分类转向语义分割,并揭示了分割模型在定位编辑区域和识别编辑模型方面的潜力与挑战。

Details Motivation: 现有的AIGC检测基准主要关注整图分类,忽视了对扩散模型局部编辑的定位能力,难以应对现实场景中局部篡改的检测需求。 Method: 构建了一个包含30k张图像的数据集DiffSeg30k,采用八种SOTA扩散模型对来自COCO的真实图像进行最多三轮连续编辑,并利用视觉-语言模型自动生成上下文感知的编辑提示;提供像素级标注,支持编辑区域的精确定位与编辑模型识别。 Result: 实验表明,现有语义分割方法在该任务上仍面临挑战,尤其在抗图像失真方面表现不足;但分割模型在像素级定位的同时,意外展现出优于传统伪造分类器的整体图像分类性能,并具备良好的跨生成器泛化能力。 Conclusion: DiffSeg30k推动了AIGC检测向细粒度、可定位的方向发展,验证了基于分割的方法在检测扩散模型编辑中的前景与局限,为未来研究提供了重要基准。 Abstract: Diffusion-based editing enables realistic modification of local image regions, making AI-generated content harder to detect. Existing AIGC detection benchmarks focus on classifying entire images, overlooking the localization of diffusion-based edits. We introduce DiffSeg30k, a publicly available dataset of 30k diffusion-edited images with pixel-level annotations, designed to support fine-grained detection. DiffSeg30k features: 1) In-the-wild images--we collect images or image prompts from COCO to reflect real-world content diversity; 2) Diverse diffusion models--local edits using eight SOTA diffusion models; 3) Multi-turn editing--each image undergoes up to three sequential edits to mimic real-world sequential editing; and 4) Realistic editing scenarios--a vision-language model (VLM)-based pipeline automatically identifies meaningful regions and generates context-aware prompts covering additions, removals, and attribute changes. DiffSeg30k shifts AIGC detection from binary classification to semantic segmentation, enabling simultaneous localization of edits and identification of the editing models. We benchmark three baseline segmentation approaches, revealing significant challenges in semantic segmentation tasks, particularly concerning robustness to image distortions. Experiments also reveal that segmentation models, despite being trained for pixel-level localization, emerge as highly reliable whole-image classifiers of diffusion edits, outperforming established forgery classifiers while showing great potential in cross-generator generalization. We believe DiffSeg30k will advance research in fine-grained localization of AI-generated content by demonstrating the promise and limitations of segmentation-based methods. DiffSeg30k is released at: https://huggingface.co/datasets/Chaos2629/Diffseg30k

[347] 3M-TI: High-Quality Mobile Thermal Imaging via Calibration-free Multi-Camera Cross-Modal Diffusion

Minchong Chen,Xiaoyun Yuan,Junzhe Wan,Jianing Zhang,Jun Zhang

Main category: cs.CV

TL;DR: 提出3M-TI,一种无需标定的多相机跨模态扩散框架,用于移动热成像超分辨率,通过跨模态自注意力模块在去噪过程中自适应对齐热红外与RGB特征,显著提升图像质量及下游任务性能。

Details Motivation: 现有热成像超分辨率方法受限于单模态信息不足或依赖精确且繁琐的跨相机标定,难以在移动平台上实现高精度和鲁棒性。 Method: 提出3M-TI框架,在扩散UNet中引入跨模态自注意力模块(CSM),替换原有的自注意力层,实现无需标定的热红外与RGB特征对齐,并利用生成先验增强热图像的空间分辨率、结构保真度和纹理细节。 Result: 在真实移动热成像设备和公开数据集上验证了3M-TI的优越性能,在视觉质量和定量指标上均达到最先进水平,并显著提升目标检测和语义分割等下游任务的表现。 Conclusion: 3M-TI实现了无需标定的跨模态热图像超分辨率,具有良好的实用性与鲁棒性,为移动平台上的热感知系统提供了高效解决方案。 Abstract: The miniaturization of thermal sensors for mobile platforms inherently limits their spatial resolution and textural fidelity, leading to blurry and less informative images. Existing thermal super-resolution (SR) methods can be grouped into single-image and RGB-guided approaches: the former struggles to recover fine structures from limited information, while the latter relies on accurate and laborious cross-camera calibration, which hinders practical deployment and robustness. Here, we propose 3M-TI, a calibration-free Multi-camera cross-Modality diffusion framework for Mobile Thermal Imaging. At its core, 3M-TI integrates a cross-modal self-attention module (CSM) into the diffusion UNet, replacing the original self-attention layers to adaptively align thermal and RGB features throughout the denoising process, without requiring explicit camera calibration. This design enables the diffusion network to leverage its generative prior to enhance spatial resolution, structural fidelity, and texture detail in the super-resolved thermal images. Extensive evaluations on real-world mobile thermal cameras and public benchmarks validate our superior performance, achieving state-of-the-art results in both visual quality and quantitative metrics. More importantly, the thermal images enhanced by 3M-TI lead to substantial gains in critical downstream tasks like object detection and segmentation, underscoring its practical value for robust mobile thermal perception systems. More materials: https://github.com/work-submit/3MTI.

[348] MonoSR: Open-Vocabulary Spatial Reasoning from Monocular Images

Qirui Wang,Jingyi He,Yining Pan,Si Yong Yeo,Xulei Yang,Shijie Li

Main category: cs.CV

TL;DR: 提出MonoSR,一个大规模单目空间推理数据集,涵盖室内外及物体中心场景,支持多种问题类型,并评估了视觉语言模型在此任务上的局限性。

Details Motivation: 现有研究主要关注室内环境且依赖多视角观测,限制了在户外场景和单目图像中的泛化能力。 Method: 构建MonoSR数据集,包含多样化的场景和问题类型,评估先进视觉语言模型的表现,并分析辅助信息对单目空间推理的重要性。 Result: 揭示了现有视觉语言模型在单目空间推理任务上的不足,并提供了未来模型设计的实用指导。 Conclusion: MonoSR为真实世界开放环境中的单目空间推理奠定了基础。 Abstract: Spatial reasoning (SR), the ability to infer 3D spatial information from 2D inputs, is essential for real-world applications such as embodied AI and autonomous driving. However, existing research primarily focuses on indoor environments and typically relies on multi-view observations, which limits their generalizability to outdoor scenarios and constrains their applicability to monocular images, the most common real-world setting. In this work, we propose MonoSR, a large-scale monocular spatial reasoning dataset that spans diverse scenarios including indoor, outdoor, and object-centric settings, and supports multiple question types. MonoSR provides a path toward open-world monocular spatial reasoning. Beyond introducing the dataset, we evaluate advanced vision-language models to reveal their limitations on this challenging task. We further analyze whether auxiliary information is crucial for monocular spatial reasoning and offer practical guidance for designing future models. These contributions collectively establish a foundation for advancing monocular spatial reasoning in real-world, open-world environments.

[349] When Semantics Regulate: Rethinking Patch Shuffle and Internal Bias for Generated Image Detection with CLIP

Beilin Chu,Weike You,Mengtao Li,Tingting Zheng,Kehan Zhao,Xuan Xu,Zhigao Lu,Jia Song,Moxuan Xu,Linna Zhou

Main category: cs.CV

TL;DR: 本文提出了一种名为SemAnti的语义对抗微调范式,通过打乱局部语义并冻结语义子空间来增强CLIP在AI生成图像检测中的跨域泛化能力。

Details Motivation: 现有基于CLIP的检测器依赖语义线索而非生成器伪影,在分布偏移下表现脆弱,本文旨在解决这一语义偏差问题。 Method: 采用Patch Shuffle破坏全局语义连续性但保留局部伪影线索,并设计SemAnti方法冻结语义子空间,仅微调对伪影敏感的网络层。 Result: 在AIGCDetectBenchmark和GenImage上实现了最先进的跨域泛化性能,验证了抑制语义偏差可提升检测鲁棒性。 Conclusion: 调控语义是释放CLIP在AI生成图像检测中潜力的关键,SemAnti通过语义对抗训练有效提升了模型的稳定性和泛化能力。 Abstract: The rapid progress of GANs and Diffusion Models poses new challenges for detecting AI-generated images. Although CLIP-based detectors exhibit promising generalization, they often rely on semantic cues rather than generator artifacts, leading to brittle performance under distribution shifts. In this work, we revisit the nature of semantic bias and uncover that Patch Shuffle provides an unusually strong benefit for CLIP, that disrupts global semantic continuity while preserving local artifact cues, which reduces semantic entropy and homogenizes feature distributions between natural and synthetic images. Through a detailed layer-wise analysis, we further show that CLIP's deep semantic structure functions as a regulator that stabilizes cross-domain representations once semantic bias is suppressed. Guided by these findings, we propose SemAnti, a semantic-antagonistic fine-tuning paradigm that freezes the semantic subspace and adapts only artifact-sensitive layers under shuffled semantics. Despite its simplicity, SemAnti achieves state-of-the-art cross-domain generalization on AIGCDetectBenchmark and GenImage, demonstrating that regulating semantics is key to unlocking CLIP's full potential for robust AI-generated image detection.

[350] MambaRefine-YOLO: A Dual-Modality Small Object Detector for UAV Imagery

Shuyu Cao,Minxin Chen,Yucheng Song,Zhaozhong Chen,Xinyou Zhang

Main category: cs.CV

TL;DR: 本文提出了一种用于无人机图像中小目标检测的新方法MambaRefine-YOLO,通过融合RGB和红外(IR)数据并引入双门控互补Mamba融合模块(DGC-MFM)与分层特征聚合颈(HFAN),在精度和速度之间实现了优越平衡,在多模态DroneVehicle数据集上达到83.2%的mAP,较基线提升7.9%,且HFAN在单模态VisDrone数据集上也表现优异,适用于实际无人机应用。

Details Motivation: 由于低分辨率和背景杂波,无人机图像中的小目标检测一直是一个挑战。现有的RGB-IR跨模态融合方法在实现有效交互与保持计算效率之间存在权衡问题。因此,需要一种既能增强多模态特征融合效果又能保持高效推理的方法。 Method: 提出了MambaRefine-YOLO,包含两个核心组件:1)Dual-Gated Complementary Mamba Fusion Module (DGC-MFM),通过光照感知和差异感知的双门控机制自适应地平衡RGB和IR模态信息;2)Hierarchical Feature Aggregation Neck (HFAN),采用“先细化后融合”的策略增强多尺度特征表达。 Result: 在双模态DroneVehicle数据集上,完整模型达到83.2%的mAP,比基线高7.9%;在单模态VisDrone数据集上,仅使用HFAN的变体也显示出显著性能提升,验证了其通用性。同时模型保持高效,具有良好的实时性。 Conclusion: MambaRefine-YOLO通过创新的DGC-MFM和HFAN模块,在跨模态和单模态无人机图像中小目标检测任务中实现了精度与速度的优良平衡,具备强实用性,适合部署于实际无人机系统中。 Abstract: Small object detection in Unmanned Aerial Vehicle (UAV) imagery is a persistent challenge, hindered by low resolution and background clutter. While fusing RGB and infrared (IR) data offers a promising solution, existing methods often struggle with the trade-off between effective cross-modal interaction and computational efficiency. In this letter, we introduce MambaRefine-YOLO. Its core contributions are a Dual-Gated Complementary Mamba fusion module (DGC-MFM) that adaptively balances RGB and IR modalities through illumination-aware and difference-aware gating mechanisms, and a Hierarchical Feature Aggregation Neck (HFAN) that uses a ``refine-then-fuse'' strategy to enhance multi-scale features. Our comprehensive experiments validate this dual-pronged approach. On the dual-modality DroneVehicle dataset, the full model achieves a state-of-the-art mAP of 83.2%, an improvement of 7.9% over the baseline. On the single-modality VisDrone dataset, a variant using only the HFAN also shows significant gains, demonstrating its general applicability. Our work presents a superior balance between accuracy and speed, making it highly suitable for real-world UAV applications.

[351] FilmSceneDesigner: Chaining Set Design for Procedural Film Scene Generation

Zhifeng Xie,Keyi Zhang,Yiye Yan,Yuling Guo,Fan Yang,Jiting Zhou,Mengtian Li

Main category: cs.CV

TL;DR: 提出FilmSceneDesigner,一个基于自然语言描述自动生成电影场景的系统,结合代理链框架和程序化生成流程,实现从结构布局到资产布置的完整场景构建。

Details Motivation: 传统电影场景设计依赖专家手动建模,耗时且劳动密集,缺乏自动化支持。 Method: 设计基于代理的链式框架,通过提示策略生成符合电影设计流程的结构化参数,并结合程序化管线完成平面图、材质、门窗及物体布置;构建包含6862个3D资产和733种材料的数据集SetDepot-Pro。 Result: 系统能生成结构合理、电影真实感强的场景,支持虚拟预演、施工图和情绪板等下游任务。 Conclusion: FilmSceneDesigner有效实现了电影级场景的自动化生成,提升了场景设计效率与真实性。 Abstract: Film set design plays a pivotal role in cinematic storytelling and shaping the visual atmosphere. However, the traditional process depends on expert-driven manual modeling, which is labor-intensive and time-consuming. To address this issue, we introduce FilmSceneDesigner, an automated scene generation system that emulates professional film set design workflow. Given a natural language description, including scene type, historical period, and style, we design an agent-based chaining framework to generate structured parameters aligned with film set design workflow, guided by prompt strategies that ensure parameter accuracy and coherence. On the other hand, we propose a procedural generation pipeline which executes a series of dedicated functions with the structured parameters for floorplan and structure generation, material assignment, door and window placement, and object retrieval and layout, ultimately constructing a complete film scene from scratch. Moreover, to enhance cinematic realism and asset diversity, we construct SetDepot-Pro, a curated dataset of 6,862 film-specific 3D assets and 733 materials. Experimental results and human evaluations demonstrate that our system produces structurally sound scenes with strong cinematic fidelity, supporting downstream tasks such as virtual previs, construction drawing and mood board creation.

[352] ABM-LoRA: Activation Boundary Matching for Fast Convergence in Low-Rank Adaptation

Dongha Lee,Jinhee Park,Minjun Kim,Junseok Kwon

Main category: cs.CV

TL;DR: 提出了一种名为ABM-LoRA的低秩适配器初始化策略,通过匹配激活边界显著加速收敛。

Details Motivation: LoRA因随机初始化导致梯度更新空间不匹配,造成信息损失和收敛缓慢。 Method: 在下游任务训练前,对齐适配器与预训练模型的激活边界,最大化全参数梯度在适配器子空间的投影。 Result: 在多种架构和任务上验证有效性,包括语言理解、对话生成和视觉识别;在VTAB-1K上达到最高精度,尤其提升需几何理解的结构化推理任务性能。 Conclusion: ABM-LoRA通过更优的初始化减少了信息损失,显著加快了收敛速度并提升了性能。 Abstract: We propose Activation Boundary Matching for Low-Rank Adaptation (ABM-LoRA), a principled initialization strategy that substantially accelerates the convergence of low-rank adapters. While LoRA offers high parameter efficiency, its random initialization restricts gradient updates to a mismatched tangent space, causing significant information loss and hindering early convergence. Our ABM-LoRA addresses this by aligning the adapter's activation boundaries with those of the pretrained model before downstream training, thereby maximizing the projection of full-parameter gradients into the adapter subspace. This alignment sharply reduces information loss at initialization, yields a lower starting loss, and accelerates convergence. We demonstrate ABM-LoRA's effectiveness across diverse architectures and tasks: language understanding (T5-Base on GLUE), dialogue generation (LLaMA2-7B on WizardLM), and vision recognition (ViT-B/16 on VTAB-1K). On VTAB-1K, it achieves the highest accuracy among all methods, with strong gains on structured reasoning tasks requiring geometric understanding.

[353] Collaborative Learning with Multiple Foundation Models for Source-Free Domain Adaptation

Huisoo Lee,Jisu Han,Hyunsouk Cho,Wonjun Hwang

Main category: cs.CV

TL;DR: 本文提出了一种新的源域无数据域适应(SFDA)框架CoMA,通过联合利用两个具有互补特性的基础模型(如CLIP和BLIP),在保持语义独特性的同时实现对目标模型的稳定适应,显著优于现有方法。

Details Motivation: 单一基础模型在SFDA中存在语义覆盖局限,难以应对域偏移下的多样化上下文线索,因此需要融合多个基础模型以提升适应性能。 Method: 提出CoMA框架,采用双向适应机制对齐不同基础模型与目标模型,并通过分解互信息(DMI)抑制因类别覆盖不全导致的虚假依赖,从而实现稳定的小批量训练。 Result: 在Office-31、Office-Home、DomainNet-126和VisDA四个基准上,CoMA在闭集、部分集和开集设置下均优于现有的最先进SFDA方法。 Conclusion: 协同多基础模型适应能有效提升SFDA性能,通过互补语义建模和稳定的知识迁移机制,为未来SFDA研究提供了新方向。 Abstract: Source-Free Domain Adaptation (SFDA) aims to adapt a pre-trained source model to an unlabeled target domain without access to source data. Recent advances in Foundation Models (FMs) have introduced new opportunities for leveraging external semantic knowledge to guide SFDA. However, relying on a single FM is often insufficient, as it tends to bias adaptation toward a restricted semantic coverage, failing to capture diverse contextual cues under domain shift. To overcome this limitation, we propose a Collaborative Multi-foundation Adaptation (CoMA) framework that jointly leverages two different FMs (e.g., CLIP and BLIP) with complementary properties to capture both global semantics and local contextual cues. Specifically, we employ a bidirectional adaptation mechanism that (1) aligns different FMs with the target model for task adaptation while maintaining their semantic distinctiveness, and (2) transfers complementary knowledge from the FMs to the target model. To ensure stable adaptation under mini-batch training, we introduce Decomposed Mutual Information (DMI) that selectively enhances true dependencies while suppressing false dependencies arising from incomplete class coverage. Extensive experiments demonstrate that our method consistently outperforms existing state-of-the-art SFDA methods across four benchmarks, including Office-31, Office-Home, DomainNet-126, and VisDA, under the closed-set setting, while also achieving best results on partial-set and open-set variants.

[354] Test-Time Preference Optimization for Image Restoration

Bingchen Li,Xin Li,Jiaqi Xu,Jiaming Guo,Wenbo Li,Renjing Pei,Zhibo Chen

Main category: cs.CV

TL;DR: 本文提出了首个用于图像恢复的测试时偏好优化(TTPO)范式,通过无需训练的三阶段流程在线生成偏好数据并提升恢复图像的感知质量,且兼容任意图像恢复模型结构。

Details Motivation: 现有图像恢复方法在处理未知退化时往往无法很好地对齐人类偏好,导致恢复结果不理想;同时,重新训练模型或收集大量偏好数据成本高昂,因此需要一种灵活、无需重训练的方法来提升感知质量。 Method: 提出了一种无需训练的三阶段TTPO框架:(i)基于初始恢复图像,利用扩散反演和去噪生成候选偏好图像;(ii)使用自动化指标或人工反馈选择偏好与非偏好图像;(iii)将选中的图像作为奖励信号,指导扩散去噪过程以优化最终输出。 Result: 在多种图像恢复任务和模型上进行了广泛实验,结果表明该方法能有效提升图像感知质量,并具备良好的通用性和灵活性。 Conclusion: TTPO是一种通用、灵活且无需重新训练的图像恢复优化范式,能够在测试时动态提升图像恢复结果与人类偏好的一致性,具有广泛的应用潜力。 Abstract: Image restoration (IR) models are typically trained to recover high-quality images using L1 or LPIPS loss. To handle diverse unknown degradations, zero-shot IR methods have also been introduced. However, existing pre-trained and zero-shot IR approaches often fail to align with human preferences, resulting in restored images that may not be favored. This highlights the critical need to enhance restoration quality and adapt flexibly to various image restoration tasks or backbones without requiring model retraining and ideally without labor-intensive preference data collection. In this paper, we propose the first Test-Time Preference Optimization (TTPO) paradigm for image restoration, which enhances perceptual quality, generates preference data on-the-fly, and is compatible with any IR model backbone. Specifically, we design a training-free, three-stage pipeline: (i) generate candidate preference images online using diffusion inversion and denoising based on the initially restored image; (ii) select preferred and dispreferred images using automated preference-aligned metrics or human feedback; and (iii) use the selected preference images as reward signals to guide the diffusion denoising process, optimizing the restored image to better align with human preferences. Extensive experiments across various image restoration tasks and models demonstrate the effectiveness and flexibility of the proposed pipeline.

[355] MetroGS: Efficient and Stable Reconstruction of Geometrically Accurate High-Fidelity Large-Scale Scenes

Kehua Chen,Tianlu Mao,Zhuxin Ma,Hao Jiang,Zehao Li,Zihan Liu,Shuqi Gao,Honglong Zhao,Feng Dai,Yucheng Zhang,Zhaoqi Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为MetroGS的新型高斯点阵化框架,用于复杂城市环境中的高效且鲁棒的大规模场景重建。该方法通过分布式2D高斯表示、结构化稠密增强、渐进式几何优化和深度引导的外观建模,实现了高几何保真度和渲染质量。

Details Motivation: 尽管3D高斯点阵化在大规模场景重建中取得了进展,但在复杂城市环境中实现高效且稳定的高几何保真度仍具挑战性。 Method: 采用分布式2D高斯点阵作为基础表示,结合SfM先验和点图模型进行稠密初始化,并引入稀疏补偿机制;设计融合单目与多视图优化的渐进式混合几何优化策略;提出深度引导的外观建模方法以提升3D一致性。 Result: 在大规模城市数据集上的实验表明,MetroGS在几何精度和渲染质量方面优于现有方法。 Conclusion: MetroGS为复杂城市环境下的高保真大规模场景重建提供了一个统一且高效的解决方案。 Abstract: Recently, 3D Gaussian Splatting and its derivatives have achieved significant breakthroughs in large-scale scene reconstruction. However, how to efficiently and stably achieve high-quality geometric fidelity remains a core challenge. To address this issue, we introduce MetroGS, a novel Gaussian Splatting framework for efficient and robust reconstruction in complex urban environments. Our method is built upon a distributed 2D Gaussian Splatting representation as the core foundation, serving as a unified backbone for subsequent modules. To handle potential sparse regions in complex scenes, we propose a structured dense enhancement scheme that utilizes SfM priors and a pointmap model to achieve a denser initialization, while incorporating a sparsity compensation mechanism to improve reconstruction completeness. Furthermore, we design a progressive hybrid geometric optimization strategy that organically integrates monocular and multi-view optimization to achieve efficient and accurate geometric refinement. Finally, to address the appearance inconsistency commonly observed in large-scale scenes, we introduce a depth-guided appearance modeling approach that learns spatial features with 3D consistency, facilitating effective decoupling between geometry and appearance and further enhancing reconstruction stability. Experiments on large-scale urban datasets demonstrate that MetroGS achieves superior geometric accuracy, rendering quality, offering a unified solution for high-fidelity large-scale scene reconstruction.

[356] Evaluating Deep Learning and Traditional Approaches Used in Source Camera Identification

Mansur Ozaman

Main category: cs.CV

TL;DR: 本文比较了三种用于源相机识别(SCI)的技术:光响应非均匀性(PRNU)、JPEG压缩伪影分析和卷积神经网络(CNNs),并评估了它们在设备分类准确率方面的表现,同时讨论了将这些方法应用于实际场景所需的科学发展方向。

Details Motivation: 准确识别图像的拍摄设备对于后续的图像分析至关重要,现有方法在实际应用中仍面临挑战,需要系统性比较与改进。 Method: 对PRNU、JPEG压缩伪影分析和CNN三种源相机识别技术进行对比分析,重点评估其在设备分类中的准确率。 Result: 论文给出了三种方法在分类精度上的比较结果,但具体数值未在摘要中说明。 Conclusion: 三种方法各有优劣,未来需进一步研究以推动其在现实场景中的实际应用。 Abstract: One of the most important tasks in computer vision is identifying the device using which the image was taken, useful for facilitating further comprehensive analysis of the image. This paper presents comparative analysis of three techniques used in source camera identification (SCI): Photo Response Non-Uniformity (PRNU), JPEG compression artifact analysis, and convolutional neural networks (CNNs). It evaluates each method in terms of device classification accuracy. Furthermore, the research discusses the possible scientific development needed for the implementation of the methods in real-life scenarios.

[357] nnActive: A Framework for Evaluation of Active Learning in 3D Biomedical Segmentation

Carsten T. Lüth,Jeremias Traub,Kim-Celine Kahl,Till J. Bungert,Lukas Klein,Lars Krämer,Paul F. Jaeger,Fabian Isensee,Klaus Maier-Hein

Main category: cs.CV

TL;DR: 本文提出了nnActive,一个开源的主动学习框架,用于解决3D生物医学图像分割中现有评估方法的四个主要缺陷,并通过大规模实验发现当前主动学习方法并未显著优于改进的前景感知随机采样。

Details Motivation: 由于3D生物医学图像标注成本高且依赖专家知识,主动学习(AL)虽有望减少标注工作量,但在该领域缺乏一致有效的评估标准,导致难以判断其真实性能优势。 Method: 提出nnActive框架,进行涵盖四个数据集和三种标签设置的大规模研究;扩展nnU-Net以支持部分标注训练并采用3D块状查询选择;设计前景感知的随机采样策略;引入前景效率指标来更合理地衡量标注成本。 Result: 实验发现:(A) 所有AL方法优于传统随机采样,但无法稳定超越前景感知随机采样;(B) AL效果依赖任务特定参数;(C) 预测熵表现最佳但可能需更高标注成本;(D) 更高的计算开销可提升AL性能。 Conclusion: 当前主动学习在3D生物医学图像分割中尚未展现出压倒性优势,改进的随机采样是强有力的基线,nnActive为未来研究提供了统一、开放的评估平台。 Abstract: Semantic segmentation is crucial for various biomedical applications, yet its reliance on large annotated datasets presents a bottleneck due to the high cost and specialized expertise required for manual labeling. Active Learning (AL) aims to mitigate this challenge by querying only the most informative samples, thereby reducing annotation effort. However, in the domain of 3D biomedical imaging, there is no consensus on whether AL consistently outperforms Random sampling. Four evaluation pitfalls hinder the current methodological assessment. These are (1) restriction to too few datasets and annotation budgets, (2) using 2D models on 3D images without partial annotations, (3) Random baseline not being adapted to the task, and (4) measuring annotation cost only in voxels. In this work, we introduce nnActive, an open-source AL framework that overcomes these pitfalls by (1) means of a large scale study spanning four biomedical imaging datasets and three label regimes, (2) extending nnU-Net by using partial annotations for training with 3D patch-based query selection, (3) proposing Foreground Aware Random sampling strategies tackling the foreground-background class imbalance of medical images and (4) propose the foreground efficiency metric, which captures the low annotation cost of background-regions. We reveal the following findings: (A) while all AL methods outperform standard Random sampling, none reliably surpasses an improved Foreground Aware Random sampling; (B) benefits of AL depend on task specific parameters; (C) Predictive Entropy is overall the best performing AL method, but likely requires the most annotation effort; (D) AL performance can be improved with more compute intensive design choices. As a holistic, open-source framework, nnActive can serve as a catalyst for research and application of AL in 3D biomedical imaging. Code is at: https://github.com/MIC-DKFZ/nnActive

[358] SpectraNet: FFT-assisted Deep Learning Classifier for Deepfake Face Detection

Nithira Jayarathne,Naveen Basnayake,Keshawa Jayasundara,Pasindu Dodampegama,Praveen Wijesinghe,Hirushika Pelagewatta,Kavishka Abeywardana,Sandushan Ranaweera,Chamira Edussooriya

Main category: cs.CV

TL;DR: 提出了一种基于EfficientNet-B6的轻量级、可泛化的二分类模型,用于检测深度伪造图像,具有高准确率和良好泛化能力。

Details Motivation: 为了应对深度伪造图像引发的虚假信息问题,需要一种非专家也能有效使用的检测方法。 Method: 采用EfficientNet-B6模型,结合数据增强、过采样和优化策略进行微调,并尝试引入傅里叶变换的相位和幅度特征。 Result: 模型在处理严重类别不平衡问题上表现良好,取得了高准确率、稳定性和泛化性能,但傅里叶特征的加入影响不大。 Conclusion: 所提出的框架在无需专业知识的情况下即可有效识别深度伪造图像,推动了可访问且可靠的检测技术发展。 Abstract: Detecting deepfake images is crucial in combating misinformation. We present a lightweight, generalizable binary classification model based on EfficientNet-B6, fine-tuned with transformation techniques to address severe class imbalances. By leveraging robust preprocessing, oversampling, and optimization strategies, our model achieves high accuracy, stability, and generalization. While incorporating Fourier transform-based phase and amplitude features showed minimal impact, our proposed framework helps non-experts to effectively identify deepfake images, making significant strides toward accessible and reliable deepfake detection.

[359] Three-Dimensional Anatomical Data Generation Based on Artificial Neural Networks

Ann-Sophia Müller,Moonkwang Jeong,Meng Zhang,Jiyuan Tian,Arkadiusz Miernik,Stefanie Speidel,Tian Qiu

Main category: cs.CV

TL;DR: 提出了一种基于物理器官模型和3D生成对抗网络的自动化3D解剖数据生成工作流,用于克服手术规划和训练中获取真实患者数据的瓶颈。

Details Motivation: 由于法律、伦理和技术挑战,从真实患者获取高质量3D解剖数据(尤其是前列腺等软组织器官)非常困难,限制了机器学习在手术规划和训练中的应用。 Method: 使用仿生水凝胶制成的人工前列腺模型模拟手术,并通过定制超声扫描仪采集术前术后图像;利用神经网络进行图像分割,生成3D网格模型,并结合3D GAN生成更多解剖变体数据。 Result: 神经网络在超声图像分割上优于传统方法(以IoU衡量);成功重建3D网格模型并提供性能反馈;实现了自动化3D解剖数据生成与扩展。 Conclusion: 该工作流可有效生成可用于机器学习训练的3D解剖数据,为缺乏真实临床数据场景下的 surgical planning 和 training 提供可行解决方案。 Abstract: Surgical planning and training based on machine learning requires a large amount of 3D anatomical models reconstructed from medical imaging, which is currently one of the major bottlenecks. Obtaining these data from real patients and during surgery is very demanding, if even possible, due to legal, ethical, and technical challenges. It is especially difficult for soft tissue organs with poor imaging contrast, such as the prostate. To overcome these challenges, we present a novel workflow for automated 3D anatomical data generation using data obtained from physical organ models. We additionally use a 3D Generative Adversarial Network (GAN) to obtain a manifold of 3D models useful for other downstream machine learning tasks that rely on 3D data. We demonstrate our workflow using an artificial prostate model made of biomimetic hydrogels with imaging contrast in multiple zones. This is used to physically simulate endoscopic surgery. For evaluation and 3D data generation, we place it into a customized ultrasound scanner that records the prostate before and after the procedure. A neural network is trained to segment the recorded ultrasound images, which outperforms conventional, non-learning-based computer vision techniques in terms of intersection over union (IoU). Based on the segmentations, a 3D mesh model is reconstructed, and performance feedback is provided.

[360] CLASH: A Benchmark for Cross-Modal Contradiction Detection

Teodora Popordanoska,Jiameng Li,Matthew B. Blaschko

Main category: cs.CV

TL;DR: 本文提出了CLASH,一个用于多模态矛盾检测的新基准,包含带有对象级或属性级矛盾的COCO图像与错误字幕配对,并设计了多种题型评估模型识别跨模态冲突的能力。实验表明现有最先进模型在此任务上表现不佳,存在模态偏差和类别弱点,而基于CLASH的针对性微调可显著提升性能。

Details Motivation: 现实场景中常出现矛盾的多模态输入,但现有基准多假设输入一致,缺乏对跨模态矛盾检测能力的评估,导致模型易产生幻觉、可靠性不足。因此需要构建专门 benchmark 来衡量和提升模型的矛盾识别能力。 Method: 构建CLASH基准:基于COCO图像生成包含控制性对象级或属性级矛盾的错误字幕,设计多项选择和开放式问题进行评估;提供经过自动化质量筛选的大规模微调集和人工验证的小型诊断集;在该基准上评估SOTA模型并进行针对性微调实验。 Result: 现有最先进多模态模型在CLASH上表现较差,难以识别跨模态矛盾,暴露出系统性的模态偏好(如偏视觉或文本)和特定类别上的弱点;通过在CLASH上进行针对性微调可显著提升模型的矛盾检测能力。 Conclusion: CLASH有效填补了多模态矛盾检测评估的空白,揭示了当前模型在处理不一致输入方面的根本缺陷,强调了发展此类能力对提升模型鲁棒性和可信度的重要性。 Abstract: Contradictory multimodal inputs are common in real-world settings, yet existing benchmarks typically assume input consistency and fail to evaluate cross-modal contradiction detection - a fundamental capability for preventing hallucinations and ensuring reliability. We introduce CLASH, a novel benchmark for multimodal contradiction detection, featuring COCO images paired with contradictory captions containing controlled object-level or attribute-level contradictions. The samples include targeted questions evaluated in both multiple-choice and open-ended formats. The benchmark provides an extensive fine-tuning set filtered through automated quality checks, alongside a smaller human-verified diagnostic set. Our analysis of state-of-the-art models reveals substantial limitations in recognizing cross-modal conflicts, exposing systematic modality biases and category-specific weaknesses. Furthermore, we empirically demonstrate that targeted fine-tuning on CLASH substantially enhances conflict detection capabilities.

[361] Can Modern Vision Models Understand the Difference Between an Object and a Look-alike?

Itay Cohen,Ethan Fetaya,Amir Rosenfeld

Main category: cs.CV

TL;DR: 该论文研究了视觉-语言模型(如CLIP)是否能够区分真实物体与其“看起来像”的对象(如玩具、雕像、涂鸦等),并提出了RoLA数据集来评估这一能力,通过在CLIP嵌入空间中学习“真实-相似”方向,提升了跨模态检索和图像描述生成的性能。

Details Motivation: 现有计算机视觉模型在识别任务上表现良好,但在模仿人类细微感知能力(如判断某物是否仅是某个类别的“看起来像”的实例)方面仍有不足,因此需要探索视觉-语言模型是否具备这种判别能力。 Method: 构建了一个名为RoLA的真实与“看起来像”样本数据集,使用配对的“real”/"lookalike"提示进行基线评估,并在CLIP的嵌入空间中估计一个将真实与相似样本分开的方向,利用该方向优化跨模态检索和图像描述生成。 Result: 所学方向在Conceptual12M上的跨模态检索任务中提高了对真实与相似样本的判别能力,并且增强了基于CLIP的前缀描述生成器生成的文本质量。 Conclusion: CLIP等视觉-语言模型具备一定程度的“真实 vs. 看起来像”判别潜力,通过在其嵌入空间中建模特定方向,可以有效提升其语义理解和生成能力,更接近人类的细粒度感知。 Abstract: Recent advances in computer vision have yielded models with strong performance on recognition benchmarks; however, significant gaps remain in comparison to human perception. One subtle ability is to judge whether an image looks like a given object without being an instance of that object. We study whether vision-language models such as CLIP capture this distinction. We curated a dataset named RoLA (Real or Lookalike) of real and lookalike exemplars (e.g., toys, statues, drawings, pareidolia) across multiple categories, and first evaluate a prompt-based baseline with paired "real"/"lookalike" prompts. We then estimate a direction in CLIP's embedding space that moves representations between real and lookalike. Applying this direction to image and text embeddings improves discrimination in cross-modal retrieval on Conceptual12M, and also enhances captions produced by a CLIP prefix captioner.

[362] NVGS: Neural Visibility for Occlusion Culling in 3D Gaussian Splatting

Brent Zoomers,Florian Hahlbohm,Joni Vanherck,Lode Jorissen,Marcus Magnor,Nick Michiels

Main category: cs.CV

TL;DR: 提出一种基于小型共享MLP的视点相关可见性学习方法,结合实例化软件光栅化器与张量核心,实现3D高斯点阵渲染中的高效遮挡剔除,提升VRAM使用效率和图像质量。

Details Motivation: 3D高斯点阵因高斯函数的半透明特性难以应用遮挡剔除技术,限制了大规模场景的渲染效率,需解决该问题以提升性能。 Method: 设计一个小型共享MLP来学习每个高斯图元的视点依赖可见性函数,并在光栅化前查询视锥体内高斯体的可见性;利用Tensor Core加速计算,并将神经查询集成到新型实例化软件光栅化器中实现高效剔除。 Result: 该方法在组合场景中优于现有最先进方法,显著降低VRAM占用并提高图像质量,且与现有的细节层次(LoD)技术具有互补性。 Conclusion: 通过引入可学习的可见性预测机制和专用光栅化器,有效实现了3D高斯点阵的遮挡剔除,为复杂场景的高效渲染提供了新思路。 Abstract: 3D Gaussian Splatting can exploit frustum culling and level-of-detail strategies to accelerate rendering of scenes containing a large number of primitives. However, the semi-transparent nature of Gaussians prevents the application of another highly effective technique: occlusion culling. We address this limitation by proposing a novel method to learn the viewpoint-dependent visibility function of all Gaussians in a trained model using a small, shared MLP across instances of an asset in a scene. By querying it for Gaussians within the viewing frustum prior to rasterization, our method can discard occluded primitives during rendering. Leveraging Tensor Cores for efficient computation, we integrate these neural queries directly into a novel instanced software rasterizer. Our approach outperforms the current state of the art for composed scenes in terms of VRAM usage and image quality, utilizing a combination of our instanced rasterizer and occlusion culling MLP, and exhibits complementary properties to existing LoD techniques.

[363] ReAlign: Text-to-Motion Generation via Step-Aware Reward-Guided Alignment

Wanjiang Weng,Xiaofeng Tan,Junbo Wang,Guo-Sen Xie,Pan Zhou,Hongsong Wang

Main category: cs.CV

TL;DR: 提出了一种名为ReAlign的奖励引导采样对齐方法,用于改善文本到动作生成中扩散模型的文本-动作对齐问题。

Details Motivation: 现有的扩散模型在文本到动作生成任务中存在文本与动作分布之间的语义不一致或质量低下的问题。 Method: 设计了一个步长感知的奖励模型,并结合文本对齐模块和动作对齐模块,在去噪过程中评估并优化文本-动作对齐质量。 Result: 在动作生成和检索任务上显著优于现有最先进方法,提升了生成动作的语义一致性和整体质量。 Conclusion: ReAlign有效缓解了文本与动作分布间的错配问题,实现了更高质量且语义对齐的3D人体动作生成。 Abstract: Text-to-motion generation, which synthesizes 3D human motions from text inputs, holds immense potential for applications in gaming, film, and robotics. Recently, diffusion-based methods have been shown to generate more diversity and realistic motion. However, there exists a misalignment between text and motion distributions in diffusion models, which leads to semantically inconsistent or low-quality motions. To address this limitation, we propose Reward-guided sampling Alignment (ReAlign), comprising a step-aware reward model to assess alignment quality during the denoising sampling and a reward-guided strategy that directs the diffusion process toward an optimally aligned distribution. This reward model integrates step-aware tokens and combines a text-aligned module for semantic consistency and a motion-aligned module for realism, refining noisy motions at each timestep to balance probability density and alignment. Extensive experiments of both motion generation and retrieval tasks demonstrate that our approach significantly improves text-motion alignment and motion quality compared to existing state-of-the-art methods.

[364] Are Large Vision Language Models Truly Grounded in Medical Images? Evidence from Italian Clinical Visual Question Answering

Federico Felizzi,Olivia Riccomi,Michele Ferramola,Francesco Andrea Causio,Manuel Del Medico,Vittorio De Vita,Lorenzo De Mori,Alessandra Piscitelli Pietro Eric Risuleo,Bianca Destro Castaniti,Antonio Cristiano Alessia Longo,Luigi De Angelis,Mariapia Vassalli,Marcello Di Pumpo

Main category: cs.CV

TL;DR: 研究评估了前沿大视觉语言模型(VLMs)在回答意大利语医学问题时的视觉依赖性,发现GPT-4o表现出最强的视觉接地能力,而其他模型如GPT-5-mini、Gemini和Claude则更多依赖文本线索,提示临床部署前需严格评估其视觉整合能力。

Details Motivation: 不清楚当前大型视觉语言模型在医学视觉问答中是否真正依赖视觉信息,因此需要验证其视觉接地能力以确保临床可靠性。 Method: 使用EuropeMedQA意大利数据集中的60个需图像解读的问题,将正确医学图像替换为空白占位符,测试Claude Sonnet 4.5、GPT-4o、GPT-5-mini和Gemini 2.0 flash exp四个最先进模型的准确率变化。 Result: GPT-4o准确率下降27.9个百分点(从83.2%降至55.3%),显示最强视觉依赖;GPT-5-mini、Gemini和Claude准确率分别仅下降8.5、2.4和5.6个百分点,表明更多依赖文本推理而非真实视觉分析。所有模型均生成看似合理的虚构视觉解释。 Conclusion: 不同模型在医学视觉问答中的视觉接地程度差异显著,部分模型依赖文本捷径而非真实图像理解,临床应用前需进行更严格的视觉整合能力评估。 Abstract: Large vision language models (VLMs) have achieved impressive performance on medical visual question answering benchmarks, yet their reliance on visual information remains unclear. We investigate whether frontier VLMs demonstrate genuine visual grounding when answering Italian medical questions by testing four state-of-the-art models: Claude Sonnet 4.5, GPT-4o, GPT-5-mini, and Gemini 2.0 flash exp. Using 60 questions from the EuropeMedQA Italian dataset that explicitly require image interpretation, we substitute correct medical images with blank placeholders to test whether models truly integrate visual and textual information. Our results reveal striking variability in visual dependency: GPT-4o shows the strongest visual grounding with a 27.9pp accuracy drop (83.2% [74.6%, 91.7%] to 55.3% [44.1%, 66.6%]), while GPT-5-mini, Gemini, and Claude maintain high accuracy with modest drops of 8.5pp, 2.4pp, and 5.6pp respectively. Analysis of model-generated reasoning reveals confident explanations for fabricated visual interpretations across all models, suggesting varying degrees of reliance on textual shortcuts versus genuine visual analysis. These findings highlight critical differences in model robustness and the need for rigorous evaluation before clinical deployment.

[365] Percept-WAM: Perception-Enhanced World-Awareness-Action Model for Robust End-to-End Autonomous Driving

Jianhua Han,Meng Tian,Jiangtong Zhu,Fan He,Huixin Zhang,Sitong Guo,Dechang Zhu,Hao Tang,Pei Xu,Yuze Guo,Minzhe Niu,Haojie Zhu,Qichao Dong,Xuechao Yan,Siyuan Dong,Lu Hou,Qingqiu Huang,Xiaosong Jia,Hang Xu

Main category: cs.CV

TL;DR: Percept-WAM是一种新型的感知增强型世界感知-动作模型,首次在单一视觉语言模型中隐式整合2D/3D场景理解能力,通过引入World-PV和World-BEV令牌及网格条件预测机制,在自动驾驶的空间感知与定位任务中实现了更优的性能。

Details Motivation: 现有视觉语言模型在空间定位和理解方面表现较弱,导致基于它们构建的视觉-语言-动作(VLA)系统在复杂场景和长尾情况下的感知与定位能力受限,难以满足自动驾驶对精确、鲁棒空间感知的需求。 Method: 提出Percept-WAM模型,将2D/3D感知任务统一为World-PV和World-BEV令牌,引入网格条件预测机制,结合IoU感知打分和并行自回归解码,提升对长尾、远距离和小物体场景的感知稳定性,并保留预训练VLM的通用智能能力,支持直接输出感知结果和轨迹控制。 Result: 在COCO 2D检测和nuScenes BEV 3D检测上分别达到51.7/58.9 mAP,优于或媲美传统检测器;在nuScenes和NAVSIM上的规划任务中表现提升,如在NAVSIM上PMDS指标超过DiffusionDrive 2.1;定性结果展示其在开放词汇和长尾泛化方面的优势。 Conclusion: Percept-WAM通过统一且隐式的2D/3D感知建模,显著提升了自动驾驶系统在复杂场景下的空间感知与规划能力,兼具强泛化性和实用性,为VLA系统提供了新的设计范式。 Abstract: Autonomous driving heavily relies on accurate and robust spatial perception. Many failures arise from inaccuracies and instability, especially in long-tail scenarios and complex interactions. However, current vision-language models are weak at spatial grounding and understanding, and VLA systems built on them therefore show limited perception and localization ability. To address these challenges, we introduce Percept-WAM, a perception-enhanced World-Awareness-Action Model that is the first to implicitly integrate 2D/3D scene understanding abilities within a single vision-language model (VLM). Instead of relying on QA-style spatial reasoning, Percept-WAM unifies 2D/3D perception tasks into World-PV and World-BEV tokens, which encode both spatial coordinates and confidence. We propose a grid-conditioned prediction mechanism for dense object perception, incorporating IoU-aware scoring and parallel autoregressive decoding, improving stability in long-tail, far-range, and small-object scenarios. Additionally, Percept-WAM leverages pretrained VLM parameters to retain general intelligence (e.g., logical reasoning) and can output perception results and trajectory control outputs directly. Experiments show that Percept-WAM matches or surpasses classical detectors and segmenters on downstream perception benchmarks, achieving 51.7/58.9 mAP on COCO 2D detection and nuScenes BEV 3D detection. When integrated with trajectory decoders, it further improves planning performance on nuScenes and NAVSIM, e.g., surpassing DiffusionDrive by 2.1 in PMDS on NAVSIM. Qualitative results further highlight its strong open-vocabulary and long-tail generalization.

[366] Learning Plug-and-play Memory for Guiding Video Diffusion Models

Selena Song,Ziming Xu,Zijun Zhang,Kun Zhou,Jiaxian Guo,Lianhui Qin,Biwei Huang

Main category: cs.CV

TL;DR: 本文提出了一种名为DiT-Mem的可学习记忆编码器,通过将参考视频编码为记忆token并注入到DiT模型中,以增强视频生成过程中对物理规律和语义信息的遵循能力,实现了高效、即插即用的记忆增强方法。

Details Motivation: 现有的DiT视频生成模型虽然在视觉质量和时序一致性上表现良好,但缺乏对基本物理规律和常识动态的理解,因此需要引入外部世界知识来提升其生成合理性。 Method: 受LLM中上下文记忆机制启发,作者发现可通过在隐藏状态空间中进行低通和高通滤波分离外观与高层语义特征,并据此设计了一个由3D CNN、滤波模块和自注意力层组成的可学习记忆编码器DiT-Mem,将参考视频编码为紧凑的记忆token并与DiT的自注意力层结合;训练时仅优化记忆编码器,保持扩散主干网络冻结。 Result: 实验表明该方法能有效提升DiT模型在物理规则遵循和视频保真度方面的表现,仅使用1.5亿参数和1万样本即可实现高效训练,并支持即插即用的推理部署。 Conclusion: DiT-Mem提供了一种有效的记忆增强方案,使视频扩散模型能够利用参考视频中的世界知识,改善生成结果的物理合理性和语义一致性,同时具备良好的训练效率和通用性。 Abstract: Diffusion Transformer(DiT) based video generation models have recently achieved impressive visual quality and temporal coherence, but they still frequently violate basic physical laws and commonsense dynamics, revealing a lack of explicit world knowledge. In this work, we explore how to equip them with a plug-and-play memory that injects useful world knowledge. Motivated by in-context memory in Transformer-based LLMs, we conduct empirical studies to show that DiT can be steered via interventions on its hidden states, and simple low-pass and high-pass filters in the embedding space naturally disentangle low-level appearance and high-level physical/semantic cues, enabling targeted guidance. Building on these observations, we propose a learnable memory encoder DiT-Mem, composed of stacked 3D CNNs, low-/high-pass filters, and self-attention layers. The encoder maps reference videos into a compact set of memory tokens, which are concatenated as the memory within the DiT self-attention layers. During training, we keep the diffusion backbone frozen, and only optimize the memory encoder. It yields a rather efficient training process on few training parameters (150M) and 10K data samples, and enables plug-and-play usage at inference time. Extensive experiments on state-of-the-art models demonstrate the effectiveness of our method in improving physical rule following and video fidelity. Our code and data are publicly released here: https://thrcle421.github.io/DiT-Mem-Web/.

[367] IDSplat: Instance-Decomposed 3D Gaussian Splatting for Driving Scenes

Carl Lindström,Mahan Rafidashti,Maryam Fatemi,Lars Hammarstrand,Martin R. Oswald,Lennart Svensson

Main category: cs.CV

TL;DR: 本文提出了一种名为IDSplat的自监督3D高斯点阵框架,用于动态场景的显式实例分解与可学习运动轨迹重建,无需人工标注。

Details Motivation: 现有方法依赖昂贵的人工标注或缺乏明确的对象级分解,导致静态与动态元素纠缠,难以分离场景。 Method: 将动态物体建模为经历刚性变换的一致实例,采用零样本语言引导视频跟踪结合激光雷达进行实例分解,并通过特征对应估计一致姿态,引入协调转向平滑方案优化运动轨迹和高斯参数。 Result: 在Waymo开放数据集上的实验表明,该方法在保持实例级分解的同时实现了具有竞争力的重建质量,并能在不同序列和视角密度下泛化,无需重新训练。 Conclusion: IDSplat能够有效实现高质量、可分解的动态场景重建,适用于大规模自动驾驶应用。 Abstract: Reconstructing dynamic driving scenes is essential for developing autonomous systems through sensor-realistic simulation. Although recent methods achieve high-fidelity reconstructions, they either rely on costly human annotations for object trajectories or use time-varying representations without explicit object-level decomposition, leading to intertwined static and dynamic elements that hinder scene separation. We present IDSplat, a self-supervised 3D Gaussian Splatting framework that reconstructs dynamic scenes with explicit instance decomposition and learnable motion trajectories, without requiring human annotations. Our key insight is to model dynamic objects as coherent instances undergoing rigid transformations, rather than unstructured time-varying primitives. For instance decomposition, we employ zero-shot, language-grounded video tracking anchored to 3D using lidar, and estimate consistent poses via feature correspondences. We introduce a coordinated-turn smoothing scheme to obtain temporally and physically consistent motion trajectories, mitigating pose misalignments and tracking failures, followed by joint optimization of object poses and Gaussian parameters. Experiments on the Waymo Open Dataset demonstrate that our method achieves competitive reconstruction quality while maintaining instance-level decomposition and generalizes across diverse sequences and view densities without retraining, making it practical for large-scale autonomous driving applications. Code will be released.

[368] Adversarial Patch Attacks on Vision-Based Cargo Occupancy Estimation via Differentiable 3D Simulation

Mohamed Rissal Hedna,Sesugh Samuel Nder

Main category: cs.CV

TL;DR: 本研究首次在完全模拟的3D场景中研究了针对货物占用率估计的对抗性贴片攻击,使用可微渲染优化贴片纹理,在拒绝服务场景下攻击成功率高达84.94%。

Details Motivation: 探讨现代物流中基于计算机视觉的货物占用率估计系统在物理对抗性攻击下的脆弱性,特别是对抗性贴片对3D环境的影响。 Method: 利用Mitsuba 3进行可微渲染,在模拟的3D环境中优化对抗性贴片纹理,并与2D合成基线方法对比,评估其在不同几何、光照和视角变化下的攻击效果。 Result: 3D优化的对抗性贴片在拒绝服务攻击(空→满)中达到84.94%的成功率,隐蔽攻击(满→空)为30.32%,显著优于2D方法。 Conclusion: 研究表明,当前基于视觉的物流系统易受物理对抗性贴片攻击,需加强物理鲁棒性以保障自动化物流安全。 Abstract: Computer vision systems are increasingly adopted in modern logistics operations, including the estimation of trailer occupancy for planning, routing, and billing. Although effective, such systems may be vulnerable to physical adversarial attacks, particularly adversarial patches that can be printed and placed on interior surfaces. In this work, we study the feasibility of such attacks on a convolutional cargo-occupancy classifier using fully simulated 3D environments. Using Mitsuba 3 for differentiable rendering, we optimize patch textures across variations in geometry, lighting, and viewpoint, and compare their effectiveness to a 2D compositing baseline. Our experiments demonstrate that 3D-optimized patches achieve high attack success rates, especially in a denial-of-service scenario (empty to full), where success reaches 84.94 percent. Concealment attacks (full to empty) prove more challenging but still reach 30.32 percent. We analyze the factors influencing attack success, discuss implications for the security of automated logistics pipelines, and highlight directions for strengthening physical robustness. To our knowledge, this is the first study to investigate adversarial patch attacks for cargo-occupancy estimation in physically realistic, fully simulated 3D scenes.

[369] LAST: LeArning to Think in Space and Time for Generalist Vision-Language Models

Shuai Wang,Daoan Zhang,Tianyi Bai,Shitong Shao,Jiebo Luo,Jiaheng Wei

Main category: cs.CV

TL;DR: 本文提出了LAST(LeArn to Think in Space and Time)方法,旨在提升通用视觉语言模型(VLMs)在3D空间和长视频理解方面的能力,通过构建空间与时间维度的视觉思维轨迹,仅使用2D图像输入即可实现对3D和时序信息的理解。

Details Motivation: 现有的VLMs虽然在常规视觉-语言任务中表现强大,但在理解和推理3D空间结构和长时间视频内容方面仍存在困难,且当前方法通常依赖专用架构分别处理3D或视频任务,缺乏统一有效的解决方案。 Method: 提出LAST框架,使VLMs在推理过程中显式地‘在空间和时间中思考’,通过构建3D空间和时间维度上的视觉思维轨迹来增强理解能力;支持零样本提示闭源模型和基于包含思维轨迹的数据微调开源VLMs两种应用模式。 Result: LAST在3项空间理解、4项视频理解和3项图像理解任务中均取得显著提升,包括在零样本设置下GPT-4o在EgoSchema上提升15.8%,以及Qwen2.5-VL-7B在VSI-Bench上提升8.3%。 Conclusion: LAST提供了一种通用且有效的方法,使现有VLMs能够更好地理解和推理3D空间与长时间动态内容,无需专门架构设计,具有较强的可扩展性和实用性。 Abstract: Humans can perceive and understand 3D space and long videos from sequential visual observations. But do vision-language models (VLMs) can? Recent work demonstrates that even state-of-the-art VLMs still struggle to understand 3D space and long videos, although they are powerful in typical vision-language tasks. Current methods often rely on specialized architectural designs to improve performance for 3D tasks and video understanding tasks separately. In contrast, we propose LAST, short for LeArn to Think in Space and Time, to jointly improve 3D spatial and long video understanding for general VLMs with only a set of 2D images as inputs. LAST makes VLMs think in space and time rather than only with text before giving the final answer, building visual thinking trajectories in 3D space and temporal dimension. We demonstrate the effectiveness of LAST in two scenarios: 1) zero-shot, where we directly prompt proprietary models; and 2) fine-tuning general VLMs with data that include thinking trajectories in 3D space and time. We show that LAST brings substantial gains in various benchmarks, including 3 spatial understanding, 4 video understanding, and 3 image understanding tasks. Notably, 15.8% gains on EgoSchema with GPT-4o in a zero-shot manner and 8.3 gains on VSI-Bench compared with Qwen2.5-VL-7B.

[370] BideDPO: Conditional Image Generation with Simultaneous Text and Condition Alignment

Dewei Zhou,Mingwei Li,Zongxin Yang,Yu Lu,Yunqiu Xu,Zhizhong Wang,Zeyi Huang,Yi Yang

Main category: cs.CV

TL;DR: 本文提出了一种双向解耦的DPO框架(BideDPO),用于解决条件图像生成中输入与文本提示之间的冲突问题,通过解耦偏好对和自适应损失平衡策略显著提升了文本成功率和条件一致性。

Details Motivation: 现有的条件图像生成方法在处理输入条件与文本提示之间的冲突时存在困难,包括输入级冲突和模型偏差冲突,且传统监督微调和现有偏好优化方法难以有效解决这些问题。 Method: 提出BideDPO框架,构建两个解耦的偏好对(分别对应条件和文本),采用自适应损失平衡策略管理二者影响,并设计自动化数据流水线生成冲突感知数据,结合迭代优化策略联合提升模型与数据质量。 Result: 实验表明BideDPO显著提高了文本成功率(如+35%)和条件遵循度,并在自建的DualAlign基准和COCO数据集上验证了有效性。 Conclusion: BideDPO通过双向解耦和自适应优化,有效缓解了多条件生成中的冲突问题,为复杂约束下的图像生成提供了新的解决方案。 Abstract: Conditional image generation enhances text-to-image synthesis with structural, spatial, or stylistic priors, but current methods face challenges in handling conflicts between sources. These include 1) input-level conflicts, where the conditioning image contradicts the text prompt, and 2) model-bias conflicts, where generative biases disrupt alignment even when conditions match the text. Addressing these conflicts requires nuanced solutions, which standard supervised fine-tuning struggles to provide. Preference-based optimization techniques like Direct Preference Optimization (DPO) show promise but are limited by gradient entanglement between text and condition signals and lack disentangled training data for multi-constraint tasks. To overcome this, we propose a bidirectionally decoupled DPO framework (BideDPO). Our method creates two disentangled preference pairs-one for the condition and one for the text-to reduce gradient entanglement. The influence of pairs is managed using an Adaptive Loss Balancing strategy for balanced optimization. We introduce an automated data pipeline to sample model outputs and generate conflict-aware data. This process is embedded in an iterative optimization strategy that refines both the model and the data. We construct a DualAlign benchmark to evaluate conflict resolution between text and condition. Experiments show BideDPO significantly improves text success rates (e.g., +35%) and condition adherence. We also validate our approach using the COCO dataset. Project Pages: https://limuloo.github.io/BideDPO/.

[371] Diffusion Reconstruction-based Data Likelihood Estimation for Core-Set Selection

Mingyang Chen,Jiawei Du,Bo Huang,Yi Wang,Xiaobo Zhang,Wei Wang

Main category: cs.CV

TL;DR: 提出一种基于扩散模型重建偏差来估计数据似然性的新方法,用于核心集选择,通过理论支持的重构误差与数据似然关系,实现更优的数据子集构建。

Details Motivation: 现有核心集选择方法依赖启发式评分信号(如训练动态或模型不确定性),缺乏对数据似然的显式建模,可能无法捕捉影响模型训练的关键分布结构。 Method: 利用扩散模型通过部分反向去噪引起的重建偏差来估计数据似然性,建立重构误差与数据似然之间的理论联系(基于ELBO),并引入信息论方法确定最优重建时间步。 Result: 在ImageNet上实验表明,该方法作为评分准则优于现有基线,在仅使用50%数据时性能接近全数据训练,且能揭示数据分布特性与模型学习偏好之间的关系。 Conclusion: 所提方法为数据选择提供了原则性、分布感知的评分标准,验证了显式建模数据似然在核心集构建中的有效性。 Abstract: Existing core-set selection methods predominantly rely on heuristic scoring signals such as training dynamics or model uncertainty, lacking explicit modeling of data likelihood. This omission may hinder the constructed subset from capturing subtle yet critical distributional structures that underpin effective model training. In this work, we propose a novel, theoretically grounded approach that leverages diffusion models to estimate data likelihood via reconstruction deviation induced by partial reverse denoising. Specifically, we establish a formal connection between reconstruction error and data likelihood, grounded in the Evidence Lower Bound (ELBO) of Markovian diffusion processes, thereby enabling a principled, distribution-aware scoring criterion for data selection. Complementarily, we introduce an efficient information-theoretic method to identify the optimal reconstruction timestep, ensuring that the deviation provides a reliable signal indicative of underlying data likelihood. Extensive experiments on ImageNet demonstrate that reconstruction deviation offers an effective scoring criterion, consistently outperforming existing baselines across selection ratios, and closely matching full-data training using only 50% of the data. Further analysis shows that the likelihood-informed nature of our score reveals informative insights in data selection, shedding light on the interplay between data distributional characteristics and model learning preferences.

[372] ReMatch: Boosting Representation through Matching for Multimodal Retrieval

Qianying Liu,Xiao Liang,Zhiqiang Zhang,Yibo Chen,Xu Tang,Zhongfei Qing,Fengfan Zhou,Yao Hu,Paul Henderson

Main category: cs.CV

TL;DR: ReMatch是一种利用MLLM生成能力的多模态检索框架,通过端到端训练和生成式匹配机制提升检索性能。

Details Motivation: 现有方法将MLLM仅视为编码器,忽略了其生成能力和组合推理优势,导致模型潜力未被充分挖掘。 Method: 提出ReMatch框架,结合聊天风格的自回归生成匹配阶段,使用多视角输入(原始数据及其嵌入)进行实例级判别监督,并引入多个可学习token生成细粒度、正交的多模态嵌入。 Result: 在MMEB基准上达到新的SOTA,并在五个数据集上展现出优异的零样本泛化能力。 Conclusion: ReMatch有效利用了MLLM的生成特性,在多模态检索中实现了更强的性能和更好的迁移能力。 Abstract: We present ReMatch, a framework that leverages the generative strength of MLLMs for multimodal retrieval. Previous approaches treated an MLLM as a simple encoder, ignoring its generative nature, and under-utilising its compositional reasoning and world knowledge. We instead train the embedding MLLM end-to-end with a chat-style generative matching stage. The matching stage uses the same MLLM to autoregressively decide relevance from multi-view inputs, including both raw data and its own projected embeddings for each query and document. It provides instance-wise discrimination supervision that complements a standard contrastive loss, offering stronger gradients on hard negatives and preserving the compositional strengths of the original MLLM. To obtain semantically richer multimodal embeddings, we use multiple learnable tokens to augment each input, generating fine-grained contextual, mutually orthogonal embeddings with low inference cost. Leveraging our established high-performance baseline,we assemble the ideas mentioned above into a powerful training recipe and achieve a new state-of-the-art on the Massive Multimodal Embedding Benchmark (MMEB). Our experiments show particularly strong zero-shot generalization results on five datasets, highlighting the robustness and transferability of ReMatch.

[373] DensifyBeforehand: LiDAR-assisted Content-aware Densification for Efficient and Quality 3D Gaussian Splatting

Phurtivilai Patt,Leyang Huang,Yinqiang Zhang,Yang Lei

Main category: cs.CV

TL;DR: 提出一种“预先稠密化”方法,结合LiDAR和单目深度估计来优化3D高斯点初始分布,提升视觉质量和计算效率。

Details Motivation: 现有3D高斯点阵方法依赖自适应密度控制,易产生漂浮伪影且资源利用低效。 Method: 结合稀疏LiDAR数据与单目RGB图像的深度估计,提出ROI感知采样策略,在优化前实现关键区域的稠密初始化。 Result: 在多个新采集数据集上验证,相比前沿方法显著降低资源消耗和训练时间,同时保持竞争力的视觉质量。 Conclusion: 该方法通过避免冗余高斯分布,提升了复杂场景中感兴趣区域的保真度与效率。 Abstract: This paper addresses the limitations of existing 3D Gaussian Splatting (3DGS) methods, particularly their reliance on adaptive density control, which can lead to floating artifacts and inefficient resource usage. We propose a novel densify beforehand approach that enhances the initialization of 3D scenes by combining sparse LiDAR data with monocular depth estimation from corresponding RGB images. Our ROI-aware sampling scheme prioritizes semantically and geometrically important regions, yielding a dense point cloud that improves visual fidelity and computational efficiency. This densify beforehand approach bypasses the adaptive density control that may introduce redundant Gaussians in the original pipeline, allowing the optimization to focus on the other attributes of 3D Gaussian primitives, reducing overlap while enhancing visual quality. Our method achieves comparable results to state-of-the-art techniques while significantly lowering resource consumption and training time. We validate our approach through extensive comparisons and ablation studies on four newly collected datasets, showcasing its effectiveness in preserving regions of interest in complex scenes.

[374] IDEAL-M3D: Instance Diversity-Enriched Active Learning for Monocular 3D Detection

Johannes Meier,Florian Günther,Riccardo Marin,Oussema Dhaouadi,Jacques Kaiser,Daniel Cremers

Main category: cs.CV

TL;DR: 本文提出IDEAL-M3D,首个用于单目3D检测的实例级主动学习框架,通过多样化集成策略提升样本选择效率,在仅使用60%标注数据时即可达到与全数据训练相当甚至更好的性能。

Details Motivation: 现有单目3D检测的主动学习方法存在两个问题:一是基于整图选择样本,导致非信息性实例也被标注,效率低;二是依赖不确定性采样,偏向远距离物体,忽略近处物体。因此需要更高效、无偏的实例级采样方法。 Method: 提出IDEAL-M3D,首个实例级别的主动学习管道,采用异构主干网络、任务无关特征、损失权重扰动和时间相关bagging来构建多样性强且训练快速的集成模型,从而实现基于多样性的实例级样本选择。 Result: 在KITTI验证集和测试集上,仅用60%的标注数据,IDEAL-M3D就达到了与使用全部数据训练相当或更好的AP3D性能,显著节省标注资源。 Conclusion: IDEAL-M3D通过实例级选择和显式多样性增强的集成方法,有效克服了传统主动学习在单目3D检测中的低效与深度偏差问题,为实际部署中的标注成本控制提供了高效解决方案。 Abstract: Monocular 3D detection relies on just a single camera and is therefore easy to deploy. Yet, achieving reliable 3D understanding from monocular images requires substantial annotation, and 3D labels are especially costly. To maximize performance under constrained labeling budgets, it is essential to prioritize annotating samples expected to deliver the largest performance gains. This prioritization is the focus of active learning. Curiously, we observed two significant limitations in active learning algorithms for 3D monocular object detection. First, previous approaches select entire images, which is inefficient, as non-informative instances contained in the same image also need to be labeled. Secondly, existing methods rely on uncertainty-based selection, which in monocular 3D object detection creates a bias toward depth ambiguity. Consequently, distant objects are selected, while nearby objects are overlooked. To address these limitations, we propose IDEAL-M3D, the first instance-level pipeline for monocular 3D detection. For the first time, we demonstrate that an explicitly diverse, fast-to-train ensemble improves diversity-driven active learning for monocular 3D. We induce diversity with heterogeneous backbones and task-agnostic features, loss weight perturbation, and time-dependent bagging. IDEAL-M3D shows superior performance and significant resource savings: with just 60% of the annotations, we achieve similar or better AP3D on KITTI validation and test set results compared to training the same detector on the whole dataset.

[375] Dual-Granularity Semantic Prompting for Language Guidance Infrared Small Target Detection

Zixuan Wang,Haoran Sun,Jiaming Lu,Wenxuan Wang,Zhongling Huang,Dingwen Zhang,Xuelin Qian,Junwei Han

Main category: cs.CV

TL;DR: 提出DGSPNet,一种端到端的语言提示驱动的红外小目标检测框架,结合粗粒度和细粒度语义提示,并引入文本引导的通道和空间注意力机制,显著提升检测精度。

Details Motivation: 现有方法依赖人工标注且文本描述不准确,导致红外小目标检测性能受限。 Method: 设计双粒度语义提示(粗粒度先验与图像空间中视觉到文本映射生成的细粒度描述),并提出文本引导的通道注意力(TGCA)和空间注意力(TGSA)机制,在无标注条件下实现端到端检测。 Result: 在三个基准数据集上实现了最先进的检测性能,显著提高了检测准确率。 Conclusion: DGSPNet有效利用语言提示增强特征表示,克服了背景干扰和标注依赖问题,为红外小目标检测提供了新思路。 Abstract: Infrared small target detection remains challenging due to limited feature representation and severe background interference, resulting in sub-optimal performance. While recent CLIP-inspired methods attempt to leverage textual guidance for detection, they are hindered by inaccurate text descriptions and reliance on manual annotations. To overcome these limitations, we propose DGSPNet, an end-to-end language prompt-driven framework. Our approach integrates dual-granularity semantic prompts: coarse-grained textual priors (e.g., 'infrared image', 'small target') and fine-grained personalized semantic descriptions derived through visual-to-textual mapping within the image space. This design not only facilitates learning fine-grained semantic information but also can inherently leverage language prompts during inference without relying on any annotation requirements. By fully leveraging the precision and conciseness of text descriptions, we further introduce a text-guide channel attention (TGCA) mechanism and text-guide spatial attention (TGSA) mechanism that enhances the model's sensitivity to potential targets across both low- and high-level feature spaces. Extensive experiments demonstrate that our method significantly improves detection accuracy and achieves state-of-the-art performance on three benchmark datasets.

[376] Evaluating Dataset Watermarking for Fine-tuning Traceability of Customized Diffusion Models: A Comprehensive Benchmark and Removal Approach

Xincheng Wang,Hanchi Sun,Wenjun Sun,Kejun Xue,Wangqiu Zhou,Jianbo Zhang,Wei Sun,Dandan Zhu,Xiongkuo Min,Jun Jia,Zhijun Fang

Main category: cs.CV

TL;DR: 本文提出了一种针对扩散模型数据集水印的综合评估框架,并揭示了现有方法在实际威胁场景下的脆弱性,同时提出了一种可完全去除水印的实用方法。

Details Motivation: 现有的扩散模型微调技术存在版权和安全风险,虽然数据集水印被用来增强可追溯性,但缺乏统一的评估框架来衡量其有效性。 Method: 建立通用威胁模型,提出涵盖普适性、可传递性和鲁棒性的综合评估框架,并设计一种实际的水印去除方法以测试现有水印方案的鲁棒性。 Result: 实验表明现有水印方法在普适性和可传递性方面表现良好,对常见图像处理具有一定鲁棒性,但在真实威胁场景下仍不足,所提出的去除方法能完全消除水印而不影响微调效果。 Conclusion: 当前数据集水印方法在面对实际攻击时仍存在显著漏洞,亟需更具鲁棒性的新方法来应对现实中的安全挑战。 Abstract: Recent fine-tuning techniques for diffusion models enable them to reproduce specific image sets, such as particular faces or artistic styles, but also introduce copyright and security risks. Dataset watermarking has been proposed to ensure traceability by embedding imperceptible watermarks into training images, which remain detectable in outputs even after fine-tuning. However, current methods lack a unified evaluation framework. To address this, this paper establishes a general threat model and introduces a comprehensive evaluation framework encompassing Universality, Transmissibility, and Robustness. Experiments show that existing methods perform well in universality and transmissibility, and exhibit some robustness against common image processing operations, yet still fall short under real-world threat scenarios. To reveal these vulnerabilities, the paper further proposes a practical watermark removal method that fully eliminates dataset watermarks without affecting fine-tuning, highlighting a key challenge for future research.

[377] SyncMV4D: Synchronized Multi-view Joint Diffusion of Appearance and Motion for Hand-Object Interaction Synthesis

Lingwei Dang,Zonghan Li,Juntong Li,Hongwen Zhang,Liang An,Yebin Liu,Qingyao Wu

Main category: cs.CV

TL;DR: 本文提出了SyncMV4D,首个能够联合生成同步多视角手物交互视频和4D运动的模型,通过统一视觉先验、运动动力学和多视图几何来解决现有方法在3D感知和现实场景泛化上的局限。

Details Motivation: 现有的手物交互生成方法受限于单视角输入导致的3D几何感知不足,或依赖高质量3D实验室数据,难以泛化到真实场景。因此需要一种能结合多视角信息与4D运动建模的新方法。 Method: 提出SyncMV4D框架,包含两个核心模块:多视角联合扩散(MJD)模型用于协同生成HOI视频和中间运动;扩散点对齐器(DPA)将粗略中间运动优化为全局对齐的4D度量点轨迹,并通过视频生成与4D运动优化之间的闭环反馈机制实现相互增强。 Result: 实验表明,该方法在视觉真实感、运动合理性及多视角一致性方面均优于当前最先进的方法。 Conclusion: SyncMV4D首次实现了多视角HOI视频与4D运动的联合生成,通过2D外观与4D动态的紧密耦合,显著提升了生成质量与跨视角一致性,推动了手物交互生成在真实场景中的应用。 Abstract: Hand-Object Interaction (HOI) generation plays a critical role in advancing applications across animation and robotics. Current video-based methods are predominantly single-view, which impedes comprehensive 3D geometry perception and often results in geometric distortions or unrealistic motion patterns. While 3D HOI approaches can generate dynamically plausible motions, their dependence on high-quality 3D data captured in controlled laboratory settings severely limits their generalization to real-world scenarios. To overcome these limitations, we introduce SyncMV4D, the first model that jointly generates synchronized multi-view HOI videos and 4D motions by unifying visual prior, motion dynamics, and multi-view geometry. Our framework features two core innovations: (1) a Multi-view Joint Diffusion (MJD) model that co-generates HOI videos and intermediate motions, and (2) a Diffusion Points Aligner (DPA) that refines the coarse intermediate motion into globally aligned 4D metric point tracks. To tightly couple 2D appearance with 4D dynamics, we establish a closed-loop, mutually enhancing cycle. During the diffusion denoising process, the generated video conditions the refinement of the 4D motion, while the aligned 4D point tracks are reprojected to guide next-step joint generation. Experimentally, our method demonstrates superior performance to state-of-the-art alternatives in visual realism, motion plausibility, and multi-view consistency.

[378] SteadyDancer: Harmonized and Coherent Human Image Animation with First-Frame Preservation

Jiaming Zhang,Shengming Cao,Rui Li,Xiaotong Zhao,Yutao Cui,Xinglin Hou,Gangshan Wu,Haolan Chen,Yu Xu,Limin Wang,Kai Ma

Main category: cs.CV

TL;DR: SteadyDancer是一种新的图像到视频动画框架,通过条件协调机制、协同姿态调制模块和分阶段解耦训练 pipeline,实现了高保真的第一帧身份保持与精确运动控制。

Details Motivation: 现有参考到视频(R2V)范式在处理真实场景中的时空错位时存在缺陷,导致身份漂移和视觉伪影,难以同时保证身份保持和运动精度。 Method: 提出SteadyDancer框架:1)条件协调机制以调和冲突条件;2)协同姿态调制模块生成与参考图像兼容的姿态表示;3)分阶段解耦目标训练 pipeline 分层优化运动保真度、视觉质量和时间一致性。 Result: 实验表明,SteadyDancer在外观保真度和运动控制方面达到SOTA水平,且训练资源需求显著低于同类方法。 Conclusion: SteadyDancer首次在I2V范式中实现了鲁棒的第一帧身份保持,同时兼顾精确运动控制与高质量动画生成,为人体图像动画提供了更高效可靠的解决方案。 Abstract: Preserving first-frame identity while ensuring precise motion control is a fundamental challenge in human image animation. The Image-to-Motion Binding process of the dominant Reference-to-Video (R2V) paradigm overlooks critical spatio-temporal misalignments common in real-world applications, leading to failures such as identity drift and visual artifacts. We introduce SteadyDancer, an Image-to-Video (I2V) paradigm-based framework that achieves harmonized and coherent animation and is the first to ensure first-frame preservation robustly. Firstly, we propose a Condition-Reconciliation Mechanism to harmonize the two conflicting conditions, enabling precise control without sacrificing fidelity. Secondly, we design Synergistic Pose Modulation Modules to generate an adaptive and coherent pose representation that is highly compatible with the reference image. Finally, we employ a Staged Decoupled-Objective Training Pipeline that hierarchically optimizes the model for motion fidelity, visual quality, and temporal coherence. Experiments demonstrate that SteadyDancer achieves state-of-the-art performance in both appearance fidelity and motion control, while requiring significantly fewer training resources than comparable methods.

[379] MonoMSK: Monocular 3D Musculoskeletal Dynamics Estimation

Farnoosh Koleini,Hongfei Xue,Ahmed Helmy,Pu Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为MonoMSK的混合框架,通过结合数据驱动学习与基于物理的仿真,从单目视频中实现生物力学上逼真的3D人体运动估计,首次实现了精确的单目动力学(力和力矩)估计。

Details Motivation: 现有单目方法使用解剖结构不准确的简化模型且忽略物理规律,导致生物力学保真度受限,因此需要一种能同时恢复运动学和动力学信息并符合生物力学真实性的方法。 Method: MonoMSK结合了基于Transformer的逆向动力学与可微分的正向运动学和动力学层,通过ODE-based仿真构建物理约束的逆-正向循环,并引入前向-逆向一致性损失来对齐运动重建与动力学推理。 Result: 在BML-MoVi、BEDLAM和OpenCap数据集上的实验表明,MonoMSK在运动学精度上显著优于现有最先进方法,并首次实现了精确的单目动力学估计。 Conclusion: MonoMSK通过融合学习与物理仿真,在单目视频中实现了高保真的3D人体运动与力的联合估计,为生物力学分析提供了新的可行工具。 Abstract: Reconstructing biomechanically realistic 3D human motion - recovering both kinematics (motion) and kinetics (forces) - is a critical challenge. While marker-based systems are lab-bound and slow, popular monocular methods use oversimplified, anatomically inaccurate models (e.g., SMPL) and ignore physics, fundamentally limiting their biomechanical fidelity. In this work, we introduce MonoMSK, a hybrid framework that bridges data-driven learning and physics-based simulation for biomechanically realistic 3D human motion estimation from monocular video. MonoMSK jointly recovers both kinematics (motions) and kinetics (forces and torques) through an anatomically accurate musculoskeletal model. By integrating transformer-based inverse dynamics with differentiable forward kinematics and dynamics layers governed by ODE-based simulation, MonoMSK establishes a physics-regulated inverse-forward loop that enforces biomechanical causality and physical plausibility. A novel forward-inverse consistency loss further aligns motion reconstruction with the underlying kinetic reasoning. Experiments on BML-MoVi, BEDLAM, and OpenCap show that MonoMSK significantly outperforms state-of-the-art methods in kinematic accuracy, while for the first time enabling precise monocular kinetics estimation.

[380] POUR: A Provably Optimal Method for Unlearning Representations via Neural Collapse

Anjie Le,Can Peng,Yuyuan Liu,J. Alison Noble

Main category: cs.CV

TL;DR: 本文提出了一种在表示层面上进行机器遗忘的新方法POUR,通过几何投影实现对视觉概念的有效遗忘,同时保持模型对其他知识的记忆能力,在多个数据集上优于现有方法。

Details Motivation: 现有的机器遗忘方法通常只修改分类器而未改变内部表示,导致遗忘不彻底。因此需要一种能够在表示层面实现有效遗忘的方法。 Method: 基于Neural Collapse理论,利用单纯形等角紧框架(ETF)的正交投影性质,提出一种可证明最优的遗忘算子,并设计了表示遗忘分数RUS来量化遗忘效果;进一步提出了POUR方法,包括闭式解的POUR-P和基于蒸馏的特征级变体POUR-D。 Result: 在CIFAR-10/100和PathMNIST数据集上的实验表明,POUR在分类层面和表示层面的遗忘效果均优于当前最先进的方法,同时保持了较高的知识保留能力。 Conclusion: POUR是一种在表示层面实现可证明最优遗忘的方法,能够有效平衡遗忘效果与知识保留,为机器遗忘提供了新的理论与实践工具。 Abstract: In computer vision, machine unlearning aims to remove the influence of specific visual concepts or training images without retraining from scratch. Studies show that existing approaches often modify the classifier while leaving internal representations intact, resulting in incomplete forgetting. In this work, we extend the notion of unlearning to the representation level, deriving a three-term interplay between forgetting efficacy, retention fidelity, and class separation. Building on Neural Collapse theory, we show that the orthogonal projection of a simplex Equiangular Tight Frame (ETF) remains an ETF in a lower dimensional space, yielding a provably optimal forgetting operator. We further introduce the Representation Unlearning Score (RUS) to quantify representation-level forgetting and retention fidelity. Building on this, we introduce POUR (Provably Optimal Unlearning of Representations), a geometric projection method with closed-form (POUR-P) and a feature-level unlearning variant under a distillation scheme (POUR-D). Experiments on CIFAR-10/100 and PathMNIST demonstrate that POUR achieves effective unlearning while preserving retained knowledge, outperforming state-of-the-art unlearning methods on both classification-level and representation-level metrics.

[381] Syn-GRPO: Self-Evolving Data Synthesis for MLLM Perception Reasoning

Qihan Huang,Haofei Zhang,Rong Wei,Yi Wang,Rui Tang,Mingli Song,Jie Song

Main category: cs.CV

TL;DR: 本文提出了一种名为Syn-GRPO的强化学习方法,通过在线数据生成器合成高质量、多样化的训练数据,显著提升了多模态大语言模型(MLLM)在视觉感知任务中的表现。

Details Motivation: 现有强化学习方法在MLLM感知能力训练中面临数据质量低、响应多样性不足的问题,限制了模型的探索能力,且现有方法未能从根本上解决该问题。 Method: 提出Syn-GRPO,包含数据服务器和GRPO工作流两部分:数据服务器利用图像生成模型对现有样本进行解耦异步合成新样本;GRPO工作流通过多样性奖励机制监督MLLM生成多样化图像描述,提升响应多样性。 Result: 在三个视觉感知任务上的实验表明,Syn-GRPO大幅提升了数据质量,性能显著优于现有MLLM感知方法,并展现出在长期自进化强化学习中的潜力。 Conclusion: Syn-GRPO通过在线合成多样化高质量数据,有效解决了MLLM强化学习中数据多样性不足的问题,为MLLM感知能力的持续自我演化提供了可行路径。 Abstract: RL (reinforcement learning) methods (e.g., GRPO) for MLLM (Multimodal LLM) perception ability has attracted wide research interest owing to its remarkable generalization ability. Nevertheless, existing reinforcement learning methods still face the problem of low data quality, where data samples cannot elicit diverse responses from MLLMs, thus restricting the exploration scope for MLLM reinforcement learning. Some methods attempt to mitigate this problem by imposing constraints on entropy, but none address it at its root. Therefore, to tackle this problem, this work proposes Syn-GRPO (Synthesis-GRPO), which employs an online data generator to synthesize high-quality training data with diverse responses in GRPO training. Specifically, Syn-GRPO consists of two components: (1) data server; (2) GRPO workflow. The data server synthesizes new samples from existing ones using an image generation model, featuring a decoupled and asynchronous scheme to achieve high generation efficiency. The GRPO workflow provides the data server with the new image descriptions, and it leverages a diversity reward to supervise the MLLM to predict image descriptions for synthesizing samples with diverse responses. Experiment results across three visual perception tasks demonstrate that Syn-GRPO improves the data quality by a large margin, achieving significant superior performance to existing MLLM perception methods, and Syn-GRPO presents promising potential for scaling long-term self-evolving RL. Our code is available at https://github.com/hqhQAQ/Syn-GRPO.

[382] CellFMCount: A Fluorescence Microscopy Dataset, Benchmark, and Methods for Cell Counting

Abdurahman Ali Mohammed,Catherine Fonder,Ying Wei,Wallapak Tavanapong,Donald S Sakaguchi,Qi Li,Surya K. Mallapragada

Main category: cs.CV

TL;DR: 本文介绍了一个大规模标注细胞计数数据集,包含3023张图像和超过43万个人工标注的细胞位置,用于推动自动细胞计数研究。该数据集具有高密度、细胞重叠、形态多样等挑战性特征,并对现有方法进行了系统评测,同时提出基于SAM模型的改进方法SAM-Counter,在MAE指标上优于现有方法。

Details Motivation: 由于手动细胞计数耗时且易出错,深度学习成为自动化解决方案,但其依赖大量高质量标注数据;然而现有细胞计数数据集规模小、标注成本高,限制了模型发展,因此需要一个更大、更具挑战性的公开数据集来推动研究进展。 Method: 构建了一个包含3,023张图像、超43万个标注细胞的大规模免疫细胞化学数据集,涵盖高密度、重叠、形态多样的细胞及染色差异;在此基础上,对回归型、人群计数型和细胞计数专用方法进行基准测试,并探索将Segment Anything Model (SAM) 适配用于仅有点标注的显微图像细胞计数任务,提出一种基于密度图的SAM-Counter方法。 Result: 在测试集上评估了三类现有方法的表现,其中提出的SAM-Counter方法实现了22.12的平均绝对误差(MAE),优于第二名的27.46;验证了SAM在仅有dot标注的情况下通过密度图引导进行细胞计数的可行性与优越性。 Conclusion: 该大规模、高挑战性的数据集为细胞计数算法提供了重要的基准平台,系统评测结果揭示了当前方法的性能边界,而SAM-Counter的成功应用表明通用视觉模型在医学图像分析中的潜力,为未来自动化细胞计数研究奠定了坚实基础。 Abstract: Accurate cell counting is essential in various biomedical research and clinical applications, including cancer diagnosis, stem cell research, and immunology. Manual counting is labor-intensive and error-prone, motivating automation through deep learning techniques. However, training reliable deep learning models requires large amounts of high-quality annotated data, which is difficult and time-consuming to produce manually. Consequently, existing cell-counting datasets are often limited, frequently containing fewer than $500$ images. In this work, we introduce a large-scale annotated dataset comprising $3{,}023$ images from immunocytochemistry experiments related to cellular differentiation, containing over $430{,}000$ manually annotated cell locations. The dataset presents significant challenges: high cell density, overlapping and morphologically diverse cells, a long-tailed distribution of cell count per image, and variation in staining protocols. We benchmark three categories of existing methods: regression-based, crowd-counting, and cell-counting techniques on a test set with cell counts ranging from $10$ to $2{,}126$ cells per image. We also evaluate how the Segment Anything Model (SAM) can be adapted for microscopy cell counting using only dot-annotated datasets. As a case study, we implement a density-map-based adaptation of SAM (SAM-Counter) and report a mean absolute error (MAE) of $22.12$, which outperforms existing approaches (second-best MAE of $27.46$). Our results underscore the value of the dataset and the benchmarking framework for driving progress in automated cell counting and provide a robust foundation for future research and development.

[383] Growing with the Generator: Self-paced GRPO for Video Generation

Rui Li,Yuanzhi Liang,Ziqi Ni,Haibing Huang,Chi Zhang,Xuelong Li

Main category: cs.CV

TL;DR: 提出Self-Paced GRPO,一种能力感知的强化学习框架,通过动态演进的奖励机制,在视频生成模型后训练中实现从视觉保真到时序连贯性和细粒度语义对齐的渐进优化。

Details Motivation: 现有GRPO依赖静态奖励模型,导致分布偏差、奖励饱和,限制了训练稳定性和效果。 Method: 设计一种渐进式奖励机制,根据生成器能力动态调整奖励重点,形成自步课程,缓解奖励-策略不匹配和奖励滥用问题。 Result: 在VBench上多个视频生成模型上实验表明,相比静态奖励的GRPO基线,本方法在视觉质量和语义对齐方面均有持续提升。 Conclusion: Self-Paced GRPO通过奖励与生成器协同进化,提升了强化学习对齐的稳定性与有效性,具有良好的通用性。 Abstract: Group Relative Policy Optimization (GRPO) has emerged as a powerful reinforcement learning paradigm for post-training video generation models. However, existing GRPO pipelines rely on static, fixed-capacity reward models whose evaluation behavior is frozen during training. Such rigid rewards introduce distributional bias, saturate quickly as the generator improves, and ultimately limit the stability and effectiveness of reinforcement-based alignment. We propose Self-Paced GRPO, a competence-aware GRPO framework in which reward feedback co-evolves with the generator. Our method introduces a progressive reward mechanism that automatically shifts its emphasis from coarse visual fidelity to temporal coherence and fine-grained text-video semantic alignment as generation quality increases. This self-paced curriculum alleviates reward-policy mismatch, mitigates reward exploitation, and yields more stable optimization. Experiments on VBench across multiple video generation backbones demonstrate consistent improvements in both visual quality and semantic alignment over GRPO baselines with static rewards, validating the effectiveness and generality of Self-Paced GRPO.

[384] DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation

Zehong Ma,Longhui Wei,Shuai Wang,Shiliang Zhang,Qi Tian

Main category: cs.CV

TL;DR: 本文提出了一种新的像素扩散框架DeCo,通过解耦高频细节和低频语义的生成过程,提升模型效率与性能,在ImageNet和文本到图像生成任务上表现优异。

Details Motivation: 现有像素扩散模型因在单一DiT中同时建模高频信号和低频语义而导致训练和推理速度慢,需要更高效的像素扩散范式。 Method: 提出频率解耦的像素扩散框架(DeCo),使用轻量级像素解码器在DiT提供的语义引导下生成高频细节,并设计频率感知的流匹配损失,强调视觉显著的频率并抑制不重要的频率成分。 Result: 在ImageNet上达到1.62(256x256)和2.22(512x512)的FID分数,优于现有像素扩散模型;预训练的文本到图像模型在GenEval系统级比较中取得0.86的领先综合得分。 Conclusion: DeCo通过频率解耦和专用损失函数实现了高效且高性能的像素扩散,缩小了与潜在扩散方法之间的差距,是像素级生成的一种有前景的新范式。 Abstract: Pixel diffusion aims to generate images directly in pixel space in an end-to-end fashion. This approach avoids the limitations of VAE in the two-stage latent diffusion, offering higher model capacity. Existing pixel diffusion models suffer from slow training and inference, as they usually model both high-frequency signals and low-frequency semantics within a single diffusion transformer (DiT). To pursue a more efficient pixel diffusion paradigm, we propose the frequency-DeCoupled pixel diffusion framework. With the intuition to decouple the generation of high and low frequency components, we leverage a lightweight pixel decoder to generate high-frequency details conditioned on semantic guidance from the DiT. This thus frees the DiT to specialize in modeling low-frequency semantics. In addition, we introduce a frequency-aware flow-matching loss that emphasizes visually salient frequencies while suppressing insignificant ones. Extensive experiments show that DeCo achieves superior performance among pixel diffusion models, attaining FID of 1.62 (256x256) and 2.22 (512x512) on ImageNet, closing the gap with latent diffusion methods. Furthermore, our pretrained text-to-image model achieves a leading overall score of 0.86 on GenEval in system-level comparison. Codes are publicly available at https://github.com/Zehong-Ma/DeCo.

[385] An Anatomy Aware Hybrid Deep Learning Framework for Lung Cancer Tumor Stage Classification

Saniah Kayenat Chowdhury,Rusab Sarmun,Muhammad E. H. Chowdhury,Sohaib Bassam Zoghoul,Israa Al-Hashimi,Adam Mushtak,Amith Khandakar

Main category: cs.CV

TL;DR: 提出一种结合医学先验知识的混合深度学习框架,通过精确分割肺部及周围解剖结构并量化肿瘤大小与邻近器官距离,依据临床指南进行规则化肺癌分期,实现高准确率(91.36%)和可解释性,在Lung-PET-CT-Dx数据集上优于传统端到端模型。

Details Motivation: 现有端到端深度学习方法在肺癌分期中忽视了关键的空间和解剖信息,难以满足TNM分期系统对肿瘤大小和邻近结构距离等定量指标的敏感性要求,且缺乏可解释性。 Method: 采用专用编码器-解码器网络对肺叶、肿瘤、纵隔、膈肌等结构进行精细分割;基于分割结果提取最大肿瘤维度及与周围解剖结构的距离;结合医学指南设计规则引擎进行分类决策。 Result: 在Lung-PET-CT-Dx数据集上达到91.36%的整体分类准确率,各T分期F1分数分别为:T1 0.93、T2 0.89、T3 0.96、T4 0.90,性能优于传统深度学习模型,并提供了透明的决策过程。 Conclusion: 该方法首次将显式临床上下文融入肿瘤分期任务,兼顾高性能与可解释性,为临床辅助决策提供可靠支持。 Abstract: Accurate lung cancer tumor staging is crucial for prognosis and treatment planning. However, it remains challenging for end-to-end deep learning approaches, as such approaches often overlook spatial and anatomical information that are central to the tumor-node-metastasis system. The tumor stage depends on multiple quantitative criteria, including the tumor size and its proximity to the nearest anatomical structures, and small variations can alter the staging outcome. We propose a medically grounded hybrid pipeline that performs staging by explicitly measuring the tumor's size and distance properties rather than treating it as a pure image classification task. Our method employs specialized encoder-decoder networks to precisely segment the lung and adjacent anatomy, including the lobes, tumor, mediastinum, and diaphragm. Subsequently, we extract the necessary tumor properties, i.e. measure the largest tumor dimension and calculate the distance between the tumor and neighboring anatomical structures by a quantitative analysis of the segmentation masks. Finally, we apply rule-based tumor staging aligned with the medical guidelines. This novel framework has been evaluated on the Lung-PET-CT-Dx dataset, demonstrating superior performance compared to traditional deep learning models, achieving an overall classification accuracy of 91.36%. We report the per-stage F1-scores of 0.93 (T1), 0.89 (T2), 0.96 (T3), and 0.90 (T4), a critical evaluation aspect often omitted in prior literature. To our knowledge, this is the first study that embeds explicit clinical context into tumor stage classification. Unlike standard convolutional neural networks that operate in an uninterpretable "black box" manner, our method offers both state-of-the-art performance and transparent decision support.

[386] UISearch: Graph-Based Embeddings for Multimodal Enterprise UI Screenshots Retrieval

Maroun Ayli,Youssef Bakouny,Tushar Sharma,Nader Jalloul,Hani Seifeddine,Rima Kilany

Main category: cs.CV

TL;DR: 提出一种基于图的表示方法,将UI截图转化为属性图,结合对比图自编码器学习多模态嵌入,显著提升界面搜索的准确性和效率。

Details Motivation: 现有方法在UI设计一致性、模式发现和合规检查方面缺乏对结构特性的显式建模,难以满足企业级软件界面管理需求。 Method: 将UI截图转换为包含层次关系和空间布局的属性图,采用对比图自编码器学习保留视觉、结构和语义多层级相似性的嵌入,并构建融合结构嵌入与语义搜索的多模态搜索框架UISearch。 Result: 在20,396个金融软件界面数据集上,UISearch达到0.92的Top-5准确率,中位延迟47.5ms(P95: 124ms),支持复杂查询和细粒度区分。 Conclusion: 该图基结构表示法在表达能力和检索性能上优于现有视觉编码器,为UI管理及类似结构化视觉领域提供了可推广的新范式。 Abstract: Enterprise software companies maintain thousands of user interface screens across products and versions, creating critical challenges for design consistency, pattern discovery, and compliance check. Existing approaches rely on visual similarity or text semantics, lacking explicit modeling of structural properties fundamental to user interface (UI) composition. We present a novel graph-based representation that converts UI screenshots into attributed graphs encoding hierarchical relationships and spatial arrangements, potentially generalizable to document layouts, architectural diagrams, and other structured visual domains. A contrastive graph autoencoder learns embeddings preserving multi-level similarity across visual, structural, and semantic properties. The comprehensive analysis demonstrates that our structural embeddings achieve better discriminative power than state-of-the-art Vision Encoders, representing a fundamental advance in the expressiveness of the UI representation. We implement this representation in UISearch, a multi-modal search framework that combines structural embeddings with semantic search through a composable query language. On 20,396 financial software UIs, UISearch achieves 0.92 Top-5 accuracy with 47.5ms median latency (P95: 124ms), scaling to 20,000+ screens. The hybrid indexing architecture enables complex queries and supports fine-grained UI distinction impossible with vision-only approaches.

[387] BackSplit: The Importance of Sub-dividing the Background in Biomedical Lesion Segmentation

Rachit Saluja,Asli Cihangir,Ruining Deng,Johannes C. Paetzold,Fengbei Liu,Mert R. Sabuncu

Main category: cs.CV

TL;DR: 本文提出一种名为BackSplit的新范式,通过细分背景类来提升小病灶分割性能,该方法在不增加推理成本的情况下显著提升效果。

Details Motivation: 传统病变分割将所有非病变像素归为单一背景类,忽略了复杂的解剖结构信息,导致小病灶分割困难。 Method: 引入细粒度背景标签(BackSplit),将背景进一步划分为不同解剖结构,并从信息论角度证明其能提高Fisher信息量,优化训练稳定性。 Result: 在多个数据集和网络架构上实验表明,BackSplit consistently 提升小病灶分割性能,即使使用预训练模型生成的自动辅助标签也有效。 Conclusion: BackSplit是一种简单、鲁棒且广泛适用的方法,通过更好地建模背景可显著改善小病灶分割结果。 Abstract: Segmenting small lesions in medical images remains notoriously difficult. Most prior work tackles this challenge by either designing better architectures, loss functions, or data augmentation schemes; and collecting more labeled data. We take a different view, arguing that part of the problem lies in how the background is modeled. Common lesion segmentation collapses all non-lesion pixels into a single "background" class, ignoring the rich anatomical context in which lesions appear. In reality, the background is highly heterogeneous-composed of tissues, organs, and other structures that can now be labeled manually or inferred automatically using existing segmentation models. In this paper, we argue that training with fine-grained labels that sub-divide the background class, which we call BackSplit, is a simple yet powerful paradigm that can offer a significant performance boost without increasing inference costs. From an information theoretic standpoint, we prove that BackSplit increases the expected Fisher Information relative to conventional binary training, leading to tighter asymptotic bounds and more stable optimization. With extensive experiments across multiple datasets and architectures, we empirically show that BackSplit consistently boosts small-lesion segmentation performance, even when auxiliary labels are generated automatically using pretrained segmentation models. Additionally, we demonstrate that auxiliary labels derived from interactive segmentation frameworks exhibit the same beneficial effect, demonstrating its robustness, simplicity, and broad applicability.

[388] In-Video Instructions: Visual Signals as Generative Control

Gongfan Fang,Xinyin Ma,Xinchao Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为In-Video Instruction的新范式,通过在视频帧中嵌入视觉指令(如文字、箭头或轨迹)实现可控的图像到视频生成,相较于基于文本提示的方法,该方法具有空间感知和明确对应的优势。

Details Motivation: 探索大规模视频生成模型是否能利用帧内嵌的视觉信号作为指令,实现更精确、可控的图像到视频生成,特别是在多对象复杂场景中克服传统文本提示全局性和模糊性的局限。 Method: 将用户指导直接编码到视觉域中,例如在输入图像上叠加文本、箭头或轨迹作为指令,并利用现有先进视频生成模型(如Veo 3.1、Kling 2.5和Wan 2.2)进行指令解析与执行。 Result: 在三种最先进的视频生成模型上的实验表明,模型能够可靠地解释并执行这些视觉嵌入指令,在多对象交互等复杂场景中表现出更强的控制能力和准确性。 Conclusion: In-Video Instruction为可控视频生成提供了有效新路径,证明了将控制信号融入视觉输入本身的设计优于传统的全局文本提示方法。 Abstract: Large-scale video generative models have recently demonstrated strong visual capabilities, enabling the prediction of future frames that adhere to the logical and physical cues in the current observation. In this work, we investigate whether such capabilities can be harnessed for controllable image-to-video generation by interpreting visual signals embedded within the frames as instructions, a paradigm we term In-Video Instruction. In contrast to prompt-based control, which provides textual descriptions that are inherently global and coarse, In-Video Instruction encodes user guidance directly into the visual domain through elements such as overlaid text, arrows, or trajectories. This enables explicit, spatial-aware, and unambiguous correspondences between visual subjects and their intended actions by assigning distinct instructions to different objects. Extensive experiments on three state-of-the-art generators, including Veo 3.1, Kling 2.5, and Wan 2.2, show that video models can reliably interpret and execute such visually embedded instructions, particularly in complex multi-object scenarios.

[389] Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens

Yiming Qin,Bomin Wei,Jiaxin Ge,Konstantinos Kallidromitis,Stephanie Fu,Trevor Darrell,Xudong Wang

Main category: cs.CV

TL;DR: 本文提出了Chain-of-Visual-Thought (COVT) 框架,通过引入连续视觉token增强视觉语言模型的密集感知能力,在多个感知基准上显著提升了性能。

Details Motivation: 现有视觉语言模型在需要密集视觉感知的任务(如空间推理和几何理解)中表现不佳,因为它们缺乏有效的机制来捕捉跨空间维度的密集视觉信息。 Method: COVT框架利用约20个紧凑的连续视觉token,从轻量级视觉专家中提取知识,并在训练时自回归预测这些token以重建密集监督信号(如深度、分割、边缘和DINO特征)。推理时,模型直接在连续视觉token空间中进行推理。 Result: 在超过十个不同的感知基准(如CV-Bench、MMVP、RealWorldQA等)上评估显示,将COVT集成到Qwen2.5-VL和LLaVA等强VLM中,性能一致提升3%至16%。 Conclusion: 紧凑的连续视觉思考能够实现更精确、更 grounded 和可解释的多模态智能。 Abstract: Vision-Language Models (VLMs) excel at reasoning in linguistic space but struggle with perceptual understanding that requires dense visual perception, e.g., spatial reasoning and geometric awareness. This limitation stems from the fact that current VLMs have limited mechanisms to capture dense visual information across spatial dimensions. We introduce Chain-of-Visual-Thought (COVT), a framework that enables VLMs to reason not only in words but also through continuous visual tokens-compact latent representations that encode rich perceptual cues. Within a small budget of roughly 20 tokens, COVT distills knowledge from lightweight vision experts, capturing complementary properties such as 2D appearance, 3D geometry, spatial layout, and edge structure. During training, the VLM with COVT autoregressively predicts these visual tokens to reconstruct dense supervision signals (e.g., depth, segmentation, edges, and DINO features). At inference, the model reasons directly in the continuous visual token space, preserving efficiency while optionally decoding dense predictions for interpretability. Evaluated across more than ten diverse perception benchmarks, including CV-Bench, MMVP, RealWorldQA, MMStar, WorldMedQA, and HRBench, integrating COVT into strong VLMs such as Qwen2.5-VL and LLaVA consistently improves performance by 3% to 16% and demonstrates that compact continuous visual thinking enables more precise, grounded, and interpretable multimodal intelligence.

[390] SAM3-Adapter: Efficient Adaptation of Segment Anything 3 for Camouflage Object Segmentation, Shadow Detection, and Medical Image Segmentation

Tianrun Chen,Runlong Cao,Xinda Yu,Lanyun Zhu,Chaotao Ding,Deyi Ji,Cheng Chen,Qi Zhu,Chunyan Xu,Papa Mao,Ying Zang

Main category: cs.CV

TL;DR: 本文提出了SAM3-Adapter,是首个针对新一代Segment Anything 3(SAM3)设计的适配框架,在细粒度、低层次分割任务(如医学图像分割、伪装物体检测和阴影检测)上显著提升了性能,兼具更高精度、更强泛化能力和更低计算开销。

Details Motivation: 尽管SAM及其后续版本在通用图像分割中表现优异,但在精细的低层次视觉任务上仍存在不足,难以应对伪装物体、细胞结构和阴影等复杂场景的分割挑战。 Method: 基于此前提出的SAM-Adapter理念,本文为新架构的SAM3设计了专用的适配框架SAM3-Adapter,采用模块化与可组合的设计,增强其对下游任务的适应性,并通过优化集成策略提升分割精度与效率。 Result: SAM3-Adapter在多个具有挑战性的下游任务中均超越了基于SAM和SAM2的现有方法,实现了新的SOTA结果,同时降低了计算开销,展现出更强的鲁棒性与泛化能力。 Conclusion: SAM3-Adapter有效释放了SAM3在细粒度分割任务中的潜力,为未来研究和实际应用提供了高效、灵活的基础框架。 Abstract: The rapid rise of large-scale foundation models has reshaped the landscape of image segmentation, with models such as Segment Anything achieving unprecedented versatility across diverse vision tasks. However, previous generations-including SAM and its successor-still struggle with fine-grained, low-level segmentation challenges such as camouflaged object detection, medical image segmentation, cell image segmentation, and shadow detection. To address these limitations, we originally proposed SAM-Adapter in 2023, demonstrating substantial gains on these difficult scenarios. With the emergence of Segment Anything 3 (SAM3)-a more efficient and higher-performing evolution with a redesigned architecture and improved training pipeline-we revisit these long-standing challenges. In this work, we present SAM3-Adapter, the first adapter framework tailored for SAM3 that unlocks its full segmentation capability. SAM3-Adapter not only reduces computational overhead but also consistently surpasses both SAM and SAM2-based solutions, establishing new state-of-the-art results across multiple downstream tasks, including medical imaging, camouflaged (concealed) object segmentation, and shadow detection. Built upon the modular and composable design philosophy of the original SAM-Adapter, SAM3-Adapter provides stronger generalizability, richer task adaptability, and significantly improved segmentation precision. Extensive experiments confirm that integrating SAM3 with our adapter yields superior accuracy, robustness, and efficiency compared to all prior SAM-based adaptations. We hope SAM3-Adapter can serve as a foundation for future research and practical segmentation applications. Code, pre-trained models, and data processing pipelines are available.

[391] Ref-SAM3D: Bridging SAM3D with Text for Reference 3D Reconstruction

Yun Zhou,Yaoting Wang,Guangquan Jie,Jinyu Liu,Henghui Ding

Main category: cs.CV

TL;DR: 提出Ref-SAM3D,一种结合文本描述的SAM3D扩展方法,实现基于单张RGB图像和自然语言的零样本3D重建。

Details Motivation: SAM3D无法根据文本描述重建特定对象,限制了其在3D编辑、游戏开发等实际应用中的使用,因此需要引入文本引导机制。 Method: 在SAM3D基础上引入文本描述作为高层先验,构建Ref-SAM3D框架,实现文本与视觉信息融合的单视图3D重建。 Result: 实验表明,Ref-SAM3D在仅依赖自然语言和单2D视图的情况下,实现了具有竞争力的高保真零样本重建效果,有效连接2D视觉线索与3D几何理解。 Conclusion: Ref-SAM3D为文本引导的3D重建提供了简单有效的解决方案,提升了参考引导3D重建的灵活性和可访问性。 Abstract: SAM3D has garnered widespread attention for its strong 3D object reconstruction capabilities. However, a key limitation remains: SAM3D cannot reconstruct specific objects referred to by textual descriptions, a capability that is essential for practical applications such as 3D editing, game development, and virtual environments. To address this gap, we introduce Ref-SAM3D, a simple yet effective extension to SAM3D that incorporates textual descriptions as a high-level prior, enabling text-guided 3D reconstruction from a single RGB image. Through extensive qualitative experiments, we show that Ref-SAM3D, guided only by natural language and a single 2D view, delivers competitive and high-fidelity zero-shot reconstruction performance. Our results demonstrate that Ref-SAM3D effectively bridges the gap between 2D visual cues and 3D geometric understanding, offering a more flexible and accessible paradigm for reference-guided 3D reconstruction. Code is available at: https://github.com/FudanCVL/Ref-SAM3D.

[392] Cook and Clean Together: Teaching Embodied Agents for Parallel Task Execution

Dingkang Liang,Cheng Zhang,Xiaopeng Xu,Jianzhong Ju,Zhenbo Luo,Xiang Bai

Main category: cs.CV

TL;DR: 提出ORS3D任务及ORS3D-60K数据集,结合自然语言理解、3D空间定位与运筹优化,提升具身智能体在真实场景中的任务调度效率。

Details Motivation: 现有任务调度数据集忽略了运筹学知识和3D空间接地,导致任务规划过于简化,难以反映真实世界中并行操作和空间约束的复杂性。 Method: 构建包含60K复合任务的大规模数据集ORS3D-60K,并提出GRANT模型——一种具备调度标记机制的多模态大语言模型,以实现高效的调度与动作生成。 Result: 在ORS3D-60K上的实验表明,GRANT在语言理解、3D接地和调度效率方面均表现优异,能有效利用可并行子任务减少总完成时间。 Conclusion: GRANT通过融合运筹知识与3D空间感知,显著提升了具身智能体在复杂环境下的任务调度能力,为未来高效具身交互提供了新方向。 Abstract: Task scheduling is critical for embodied AI, enabling agents to follow natural language instructions and execute actions efficiently in 3D physical worlds. However, existing datasets often simplify task planning by ignoring operations research (OR) knowledge and 3D spatial grounding. In this work, we propose Operations Research knowledge-based 3D Grounded Task Scheduling (ORS3D), a new task that requires the synergy of language understanding, 3D grounding, and efficiency optimization. Unlike prior settings, ORS3D demands that agents minimize total completion time by leveraging parallelizable subtasks, e.g., cleaning the sink while the microwave operates. To facilitate research on ORS3D, we construct ORS3D-60K, a large-scale dataset comprising 60K composite tasks across 4K real-world scenes. Furthermore, we propose GRANT, an embodied multi-modal large language model equipped with a simple yet effective scheduling token mechanism to generate efficient task schedules and grounded actions. Extensive experiments on ORS3D-60K validate the effectiveness of GRANT across language understanding, 3D grounding, and scheduling efficiency. The code is available at https://github.com/H-EmbodVis/GRANT

[393] Cloud4D

Jacob Lin,Edward Gryspeerdt,Ronald Clark

Main category: cs.CV

TL;DR: Cloud4D是一个基于学习的框架,利用地面相机重建物理一致的四维云状态,在时空分辨率上比现有卫星数据提高一个数量级,同时保持低于10%的相对误差。

Details Motivation: 现有全球气象模型分辨率有限,难以精确模拟单个云团及极端天气现象,需要更高分辨率的观测数据支持。 Method: 提出Cloud4D框架,采用同源性引导的2D到3D Transformer模型,从同步地面相机图像中推断25米空间和5秒时间分辨率的三维液态水含量分布,并通过追踪其变化估计水平风矢量。 Result: 在为期两个月的六台相机部署中,系统实现了比现有卫星测量高一个数量级的时空分辨率,且相对于共位雷达测量保持低于10%的相对误差。 Conclusion: Cloud4D为高分辨率气象建模提供了可行方案,展示了基于地面视觉数据进行四维云场重建的巨大潜力。 Abstract: There has been great progress in improving numerical weather prediction and climate models using machine learning. However, most global models act at a kilometer-scale, making it challenging to model individual clouds and factors such as extreme precipitation, wind gusts, turbulence, and surface irradiance. Therefore, there is a need to move towards higher-resolution models, which in turn require high-resolution real-world observations that current instruments struggle to obtain. We present Cloud4D, the first learning-based framework that reconstructs a physically consistent, four-dimensional cloud state using only synchronized ground-based cameras. Leveraging a homography-guided 2D-to-3D transformer, Cloud4D infers the full 3D distribution of liquid water content at 25 m spatial and 5 s temporal resolution. By tracking the 3D liquid water content retrievals over time, Cloud4D additionally estimates horizontal wind vectors. Across a two-month deployment comprising six skyward cameras, our system delivers an order-of-magnitude improvement in space-time resolution relative to state-of-the-art satellite measurements, while retaining single-digit relative error ($<10\%$) against collocated radar measurements. Code and data are available on our project page https://cloud4d.jacob-lin.com/.

[394] Breaking the Likelihood-Quality Trade-off in Diffusion Models by Merging Pretrained Experts

Yasin Esfandiari,Stefan Bauer,Sebastian U. Stich,Andrea Dittadi

Main category: cs.CV

TL;DR: 提出一种无需重新训练的插件式采样方法,通过在去噪过程中切换两个预训练扩散模型(图像质量专家和似然性专家)来同时提升生成图像的质量和似然性。

Details Motivation: 扩散模型在图像生成中存在感知质量与数据似然之间的权衡,现有训练目标难以兼顾高噪声和低噪声阶段的优化需求。 Method: 在去噪轨迹上切换两个预训练专家模型:高噪声阶段使用图像质量专家构建全局结构,低噪声阶段切换到似然性专家优化像素统计,仅需选择一个中间切换步数。 Result: 在CIFAR-10和ImageNet32上,该方法在似然性和生成质量上均优于或持平于单个专家模型。 Conclusion: 跨噪声水平切换专家模型是一种有效打破扩散模型中似然性与质量权衡的方法。 Abstract: Diffusion models for image generation often exhibit a trade-off between perceptual sample quality and data likelihood: training objectives emphasizing high-noise denoising steps yield realistic images but poor likelihoods, whereas likelihood-oriented training overweights low-noise steps and harms visual fidelity. We introduce a simple plug-and-play sampling method that combines two pretrained diffusion experts by switching between them along the denoising trajectory. Specifically, we apply an image-quality expert at high noise levels to shape global structure, then switch to a likelihood expert at low noise levels to refine pixel statistics. The approach requires no retraining or fine-tuning -- only the choice of an intermediate switching step. On CIFAR-10 and ImageNet32, the merged model consistently matches or outperforms its base components, improving or preserving both likelihood and sample quality relative to each expert alone. These results demonstrate that expert switching across noise levels is an effective way to break the likelihood-quality trade-off in image diffusion models.

[395] Are Image-to-Video Models Good Zero-Shot Image Editors?

Zechuan Zhang,Zhenyuan Chen,Zongxin Yang,Yi Yang

Main category: cs.CV

TL;DR: IF-Edit是一个无需调优的框架,利用预训练的图像到视频扩散模型进行指令驱动的图像编辑,通过提示增强、时域隐变量丢弃和自洽后 refinement 提升编辑效果。

Details Motivation: 探索大规模视频扩散模型在零样本图像编辑中的潜力,解决其直接用于图像编辑时存在的提示不对齐、冗余时序隐变量和后期帧模糊等问题。 Method: 提出IF-Edit框架,包含三部分:1)链式思维提示增强模块,将静态编辑指令转化为时序合理的推理提示;2)时序隐变量丢弃策略,在专家切换点后压缩帧隐变量以加速去噪;3)自洽后 refinement 步骤,利用短静止视频轨迹 sharpen 后期帧。 Result: 在四个公开基准上实验表明,IF-Edit在非刚性编辑、物理与时序推理及通用指令编辑任务中表现优异,尤其在推理密集型任务上性能突出,同时保持了通用编辑的竞争力。 Conclusion: 视频扩散模型可有效用于零样本图像编辑,IF-Edit提供了一种简单而统一的视频-图像生成推理方法。 Abstract: Large-scale video diffusion models show strong world simulation and temporal reasoning abilities, but their use as zero-shot image editors remains underexplored. We introduce IF-Edit, a tuning-free framework that repurposes pretrained image-to-video diffusion models for instruction-driven image editing. IF-Edit addresses three key challenges: prompt misalignment, redundant temporal latents, and blurry late-stage frames. It includes (1) a chain-of-thought prompt enhancement module that transforms static editing instructions into temporally grounded reasoning prompts; (2) a temporal latent dropout strategy that compresses frame latents after the expert-switch point, accelerating denoising while preserving semantic and temporal coherence; and (3) a self-consistent post-refinement step that sharpens late-stage frames using a short still-video trajectory. Experiments on four public benchmarks, covering non-rigid editing, physical and temporal reasoning, and general instruction edits, show that IF-Edit performs strongly on reasoning-centric tasks while remaining competitive on general-purpose edits. Our study provides a systematic view of video diffusion models as image editors and highlights a simple recipe for unified video-image generative reasoning.

[396] VDC-Agent: When Video Detailed Captioners Evolve Themselves via Agentic Self-Reflection

Qiang Wang,Xinyuan Gao,SongLin Dong,Jizhou Han,Jiangyang Li,Yuhang He,Yihong Gong

Main category: cs.CV

TL;DR: 提出VDC-Agent,一种无需人工标注或大教师模型的自演化视频详细描述框架,通过闭环生成、评分与提示优化,在无标签视频上实现最先进性能。

Details Motivation: 为了在没有人工标注和大型教师模型的情况下,提升视频详细描述的质量和自动化程度。 Method: 采用自演化框架,结合生成、原则引导评分和提示优化的闭环流程,并利用自我反思机制修正退化;基于生成的轨迹构建偏好数据集VDC-Agent-19K,使用易到难的课程直接偏好优化微调MLLM。 Result: 在VDC基准上达到49.08%平均准确率和2.50分,超越专用视频描述模型,相比基础模型提升+5.13%准确率和+0.27分,推理成本相当。 Conclusion: VDC-Agent实现了无需人工标注和教师模型的高效视频描述自演化,显著提升基础模型性能,具有实际应用潜力。 Abstract: We present VDC-Agent, a self-evolving framework for Video Detailed Captioning that requires neither human annotations nor larger teacher models. The agent forms a closed loop of caption generation, principle-guided scoring (score and textual suggestions), and prompt refinement. When caption quality regresses, a self-reflection path leverages the previous chain-of-thought to amend the update. Running this process on unlabeled videos produces trajectories of (caption, score) pairs. We convert the trajectories into preference tuples and filter out samples with JSON parsing errors, resulting in VDC-Agent-19K, which contains 18,886 automatically constructed pairs. We then fine-tune the base MLLM on this dataset using an easy-to-hard curriculum direct preference optimization. Built on Qwen2.5-VL-7B-Instruct, our VDC-Agent-7B attains state-of-the-art performance on the VDC benchmark with 49.08% average accuracy and 2.50 score, surpassing specialized video captioners and improving over the base model by +5.13% accuracy and +0.27 score at similar inference cost.

[397] LumiTex: Towards High-Fidelity PBR Texture Generation with Illumination Context

Jingzhi Bao,Hongze Chen,Lingting Zhu,Chenyu Liu,Runze Zhang,Keyang Luo,Zeyu Hu,Weikai Chen,Yingda Yin,Xin Wang,Zehong Lin,Jun Zhang,Xiaoguang Han

Main category: cs.CV

TL;DR: LumiTex是一个端到端框架,用于生成高质量、光照感知的PBR材质纹理,解决了材料分解和视图一致的纹理补全问题。

Details Motivation: 现有方法在有限光照线索下难以实现图像提示的材料分解,且无法保证纹理补全的无缝性和视图一致性。 Method: 提出LumiTex框架,包含多分支生成方案、光照感知的材质注意力机制和基于大视角合成模型的几何引导修复模块。 Result: 实验表明,LumiTex在纹理质量上优于现有的开源和商业方法,实现了最先进的性能。 Conclusion: LumiTex有效提升了PBR纹理生成的材料分解能力和视图一致性,适用于高真实感渲染应用。 Abstract: Physically-based rendering (PBR) provides a principled standard for realistic material-lighting interactions in computer graphics. Despite recent advances in generating PBR textures, existing methods fail to address two fundamental challenges: 1) materials decomposition from image prompts under limited illumination cues, and 2) seamless and view-consistent texture completion. To this end, we propose LumiTex, an end-to-end framework that comprises three key components: (1) a multi-branch generation scheme that disentangles albedo and metallic-roughness under shared illumination priors for robust material understanding, (2) a lighting-aware material attention mechanism that injects illumination context into the decoding process for physically grounded generation of albedo, metallic, and roughness maps, and (3) a geometry-guided inpainting module based on a large view synthesis model that enriches texture coverage and ensures seamless, view-consistent UV completion. Extensive experiments demonstrate that LumiTex achieves state-of-the-art performance in texture quality, surpassing both existing open-source and commercial methods.