Skip to content

Table of Contents

cs.CL [Back]

Manil Shrestha,Edward Kim

Main category: cs.CL

TL;DR: 提出两种混合算法,通过结合符号结构与学习表示,在保证答案可验证性的同时显著提升多跳知识图谱问答的效率和准确性,无需在推理时使用大规模语言模型。

Details Motivation: 多跳问答在知识图谱上因推理路径组合爆炸而计算困难,现有方法依赖昂贵的大模型进行实体链接和路径排序,且生成结果缺乏可验证性。 Method: 提出两种互补的混合算法:1)LLM引导规划,用单次LLM调用预测关系序列并通过广度优先搜索执行;2)嵌入引导的神经搜索,完全消除LLM调用,使用轻量级边评分器融合文本与图嵌入。还通过知识蒸馏将规划能力压缩到小模型中。 Result: 在MetaQA数据集上,LLM-Guided Planning达到micro-F1 > 0.90的高精度,Embedding-Guided Neural Search实现百倍以上加速且准确率相当,结构化规划比直接生成更具可迁移性。 Conclusion: 可验证的多跳推理无需在推理时依赖大模型,关键在于结合符号结构与学习表示的合适架构归纳偏置。 Abstract: Multi-hop question answering over knowledge graphs remains computationally challenging due to the combinatorial explosion of possible reasoning paths. Recent approaches rely on expensive Large Language Model (LLM) inference for both entity linking and path ranking, limiting their practical deployment. Additionally, LLM-generated answers often lack verifiable grounding in structured knowledge. We present two complementary hybrid algorithms that address both efficiency and verifiability: (1) LLM-Guided Planning that uses a single LLM call to predict relation sequences executed via breadth-first search, achieving near-perfect accuracy (micro-F1 > 0.90) while ensuring all answers are grounded in the knowledge graph, and (2) Embedding-Guided Neural Search that eliminates LLM calls entirely by fusing text and graph embeddings through a lightweight 6.7M-parameter edge scorer, achieving over 100 times speedup with competitive accuracy. Through knowledge distillation, we compress planning capability into a 4B-parameter model that matches large-model performance at zero API cost. Evaluation on MetaQA demonstrates that grounded reasoning consistently outperforms ungrounded generation, with structured planning proving more transferable than direct answer generation. Our results show that verifiable multi-hop reasoning does not require massive models at inference time, but rather the right architectural inductive biases combining symbolic structure with learned representations.

[2] Can LLMs Faithfully Explain Themselves in Low-Resource Languages? A Case Study on Emotion Detection in Persian

Mobina Mehrazar,Mohammad Amin Yousefi,Parisa Abolfath Beygi,Behnam Bahrak

Main category: cs.CL

TL;DR: 该研究评估了大语言模型(LLM)在波斯语情感分类中生成解释的可信度,发现尽管模型分类性能良好,但其解释与人类判断不一致,提示当前解释方法在低资源语言中的局限性。

Details Motivation: 关注大语言模型在低资源语言中生成自我解释的可信度问题,尤其是在波斯语情感分类任务中,现有解释是否真实反映模型推理过程尚不清楚。 Method: 通过比较模型基于token级对数概率得出的置信分数与人工标注者认定的关键词汇,评估两种提示策略(先预测后解释、先解释后预测)下生成解释的可信度。 Result: LLM在分类任务上表现良好,但其生成的解释与人类判断一致性较低,不同模型间的解释相似度高于与人类标注的一致性,表明解释缺乏忠实性。 Conclusion: 当前LLM生成的解释在低资源语言情境下可能不可靠,凸显需要更稳健的方法和指标来提升多语言环境下模型解释的可信度。 Abstract: Large language models (LLMs) are increasingly used to generate self-explanations alongside their predictions, a practice that raises concerns about the faithfulness of these explanations, especially in low-resource languages. This study evaluates the faithfulness of LLM-generated explanations in the context of emotion classification in Persian, a low-resource language, by comparing the influential words identified by the model against those identified by human annotators. We assess faithfulness using confidence scores derived from token-level log-probabilities. Two prompting strategies, differing in the order of explanation and prediction (Predict-then-Explain and Explain-then-Predict), are tested for their impact on explanation faithfulness. Our results reveal that while LLMs achieve strong classification performance, their generated explanations often diverge from faithful reasoning, showing greater agreement with each other than with human judgments. These results highlight the limitations of current explanation methods and metrics, emphasizing the need for more robust approaches to ensure LLM reliability in multilingual and low-resource contexts.

[3] Comparative Analysis of LoRA-Adapted Embedding Models for Clinical Cardiology Text Representation

Richard J. Young,Alice M. Matthews

Main category: cs.CL

TL;DR: 该研究评估了十种基于Transformer的嵌入模型在心脏病学领域的表现,发现编码器架构(尤其是BioLinkBERT)在使用LoRA微调后性能更优且资源消耗更低,挑战了“更大模型更好”的假设,并公开了所有资源以促进可重复研究。

Details Motivation: 领域特定的文本嵌入对临床自然语言处理至关重要,但不同模型架构之间的系统性比较仍有限。 Method: 采用低秩适应(LoRA)方法,在来自权威医学教科书的106,535对心脏病学文本上对十种Transformer-based嵌入模型进行微调,并进行系统评估。 Result: 编码器-only架构(特别是BioLinkBERT)在领域特定性能上表现最佳(分离分数:0.510),优于更大的解码器架构模型,同时计算资源需求显著更低。 Conclusion: 更大的语言模型不一定产生更好的领域特定嵌入;编码器架构在临床NLP任务中更具效率和实用性,为系统开发提供了实践指导。 Abstract: Domain-specific text embeddings are critical for clinical natural language processing, yet systematic comparisons across model architectures remain limited. This study evaluates ten transformer-based embedding models adapted for cardiology through Low-Rank Adaptation (LoRA) fine-tuning on 106,535 cardiology text pairs derived from authoritative medical textbooks. Results demonstrate that encoder-only architectures, particularly BioLinkBERT, achieve superior domain-specific performance (separation score: 0.510) compared to larger decoder-based models, while requiring significantly fewer computational resources. The findings challenge the assumption that larger language models necessarily produce better domain-specific embeddings and provide practical guidance for clinical NLP system development. All models, training code, and evaluation datasets are publicly available to support reproducible research in medical informatics.

[4] What does it mean to understand language?

Colton Casto,Anna Ivanova,Evelina Fedorenko,Nancy Kanwisher

Main category: cs.CL

TL;DR: 本文提出,由于大脑核心语言系统的处理能力有限,深入理解语言需要将信息从语言系统传递到负责感知、运动表征、心理模型构建以及世界知识和记忆存储的其他脑区。

Details Motivation: 探讨语言理解的神经基础,超越表面语义,揭示深层理解所需的跨脑区协作机制。 Method: 综述认知神经科学领域的现有证据,并结合最新概念与方法,提出可直接检验该假设的新策略。 Result: 支持语言理解依赖于语言系统与其他脑区(如感知、运动和记忆相关区域)之间的信息导出与协同加工。 Conclusion: 深入的语言理解不仅是语言系统的内部过程,更是一个依赖多脑区协作的整合性认知过程。 Abstract: Language understanding entails not just extracting the surface-level meaning of the linguistic input, but constructing rich mental models of the situation it describes. Here we propose that because processing within the brain's core language system is fundamentally limited, deeply understanding language requires exporting information from the language system to other brain regions that compute perceptual and motor representations, construct mental models, and store our world knowledge and autobiographical memories. We review the existing evidence for this hypothesis, and argue that recent progress in cognitive neuroscience provides both the conceptual foundation and the methods to directly test it, thus opening up a new strategy to reveal what it means, cognitively and neurally, to understand language.

[5] Gender Bias in Emotion Recognition by Large Language Models

Maureen Herbert,Katie Sun,Angelica Lim,Yasaman Etesam

Main category: cs.CL

TL;DR: 本文研究了大语言模型在情感理论任务中是否存在性别偏见,并探讨了不同去偏策略的有效性,发现基于训练的干预比仅依赖提示工程更有效。

Details Motivation: 随着大语言模型(LLM)在日常生活中的广泛应用,评估并确保其公平性变得尤为重要。本文关注LLM在情感理论任务中可能存在的性别偏见问题。 Method: 通过向LLM提供人物及其环境的描述并提问“此人感觉如何”,系统地检测其输出中的性别偏见,并对比多种去偏策略(包括训练时干预和推理时提示工程)的效果。 Result: 实验表明,LLM在情感推断任务中存在显著的性别偏见,且仅靠推理阶段的提示工程难以有效缓解;相比之下,基于训练的干预方法能更显著地减少偏见。 Conclusion: 要有效减轻大语言模型在情感理论任务中的性别偏见,需采用训练阶段的干预措施,而非仅仅依赖推理时的提示设计。 Abstract: The rapid advancement of large language models (LLMs) and their growing integration into daily life underscore the importance of evaluating and ensuring their fairness. In this work, we examine fairness within the domain of emotional theory of mind, investigating whether LLMs exhibit gender biases when presented with a description of a person and their environment and asked, "How does this person feel?". Furthermore, we propose and evaluate several debiasing strategies, demonstrating that achieving meaningful reductions in bias requires training based interventions rather than relying solely on inference-time prompt-based approaches such as prompt engineering.

[6] Breaking Bad: Norms for Valence, Arousal, and Dominance for over 10k English Multiword Expressions

Saif M. Mohammad

Main category: cs.CL

TL;DR: 本文介绍了NRC VAD词典v2,扩展了包含10,000个多词表达(MWEs)及其组成词的情感关联评分,增强了对近年来更常用词汇的覆盖。研究展示了这些情感评分的高度可靠性,并探讨了MWEs的情感特征和情感构成性。该词典为自然语言处理、心理学、公共卫生、数字人文和社会科学等领域的研究提供了支持。

Details Motivation: 为了补充现有词典在多词表达和新近常用词汇上的情感评分不足,提升情感分析的全面性和准确性。 Method: 收集了10,000个英语多词表达及其构成词的Valence(效价)、Arousal(唤醒度)和Dominance(支配度)的人类评分,并扩展了单个词汇的覆盖范围,特别是自2018年以来变得更加常见的词汇。 Result: 新的NRC VAD词典v2包含了10,000个MWEs和25,000个单词的情感评分,且评分具有高度可靠性;能够用于分析MWEs的情绪强度及情绪构成性。 Conclusion: NRC VAD词典v2显著提升了情感词汇资源的规模与实用性,适用于多个学科领域的情感计算与语言研究。 Abstract: Factor analysis studies have shown that the primary dimensions of word meaning are Valence (V), Arousal (A), and Dominance (D). Existing lexicons such as the NRC VAD Lexicon, published in 2018, include VAD association ratings for words. Here, we present a complement to it, which has human ratings of valence, arousal, and dominance for 10k English Multiword Expressions (MWEs) and their constituent words. We also increase the coverage of unigrams, especially words that have become more common since 2018. In all, the new NRC VAD Lexicon v2 now has entries for 10k MWEs and 25k words, in addition to the entries in v1. We show that the associations are highly reliable. We use the lexicon to examine emotional characteristics of MWEs, including: 1. The degree to which MWEs (idioms, noun compounds, and verb particle constructions) exhibit strong emotionality; 2. The degree of emotional compositionality in MWEs. The lexicon enables a wide variety of research in NLP, Psychology, Public Health, Digital Humanities, and Social Sciences. The NRC VAD Lexicon v2 is freely available through the project webpage: http://saifmohammad.com/WebPages/nrc-vad.html

[7] Language-Independent Sentiment Labelling with Distant Supervision: A Case Study for English, Sepedi and Setswana

Koena Ronny Mabokela,Tim Schlippe,Mpho Raborife,Turgay Celik

Main category: cs.CL

TL;DR: 本文提出了一种语言无关的自动情感标注方法,利用表情符号和情感词来减少低资源非洲语言的情感数据标注成本,并在英语、塞佩迪语和塞茨瓦纳语推文中实现了约60%以上的标注准确率。

Details Motivation: 许多非洲语言因缺乏带情感标签的文本数据而被视为低资源语言,手动标注耗时且昂贵,因此需要高效的自动化标注方法。 Method: 提出一种利用表情符号和情感词进行语言无关情感标注的自动方法,并在SAfriSenti多语言情感语料库的推文数据上进行实验。 Result: 该方法在英语推文上的标注准确率为66%,塞佩迪语为69%,塞茨瓦纳语为63%,平均仅需修正34%的自动标注结果。 Conclusion: 所提出的自动情感标注方法能有效减少人工标注工作量,适用于低资源语言的情感分析任务。 Abstract: Sentiment analysis is a helpful task to automatically analyse opinions and emotions on various topics in areas such as AI for Social Good, AI in Education or marketing. While many of the sentiment analysis systems are developed for English, many African languages are classified as low-resource languages due to the lack of digital language resources like text labelled with corresponding sentiment classes. One reason for that is that manually labelling text data is time-consuming and expensive. Consequently, automatic and rapid processes are needed to reduce the manual effort as much as possible making the labelling process as efficient as possible. In this paper, we present and analyze an automatic language-independent sentiment labelling method that leverages information from sentiment-bearing emojis and words. Our experiments are conducted with tweets in the languages English, Sepedi and Setswana from SAfriSenti, a multilingual sentiment corpus for South African languages. We show that our sentiment labelling approach is able to label the English tweets with an accuracy of 66%, the Sepedi tweets with 69%, and the Setswana tweets with 63%, so that on average only 34% of the automatically generated labels remain to be corrected.

[8] Profile-LLM: Dynamic Profile Optimization for Realistic Personality Expression in LLMs

Shi-Wei Dai,Yan-Wei Shie,Tsung-Huan Yang,Lun-Wei Ku,Yung-Hui Li

Main category: cs.CL

TL;DR: 本文提出了PersonaPulse框架,通过动态优化角色扮演提示并结合情境响应基准评分,提升大语言模型中个性表达的真实性和上下文相关性。实验表明该方法优于基于心理学研究设计的提示,并探讨了模型大小与个性建模的关系及暂停优化对特定个性特征的影响。

Details Motivation: 现有研究未充分优化提示以最大化个性表达,缺乏真实且情境化评估机制。 Method: 提出PersonaPulse框架,利用LLMs对个性特质的内在知识,迭代增强角色扮演提示,并引入情境响应基准作为评分工具进行优化指导。 Result: 量化评估显示PersonaPulse生成的提示优于以往基于心理学描述设计的提示;发现模型规模影响个性建模效果;部分个性特征可通过暂停优化过程实现控制。 Conclusion: 提示优化对塑造LLMs中的个性表达至关重要,PersonaPulse为实现更真实、自适应的AI交互提供了有效路径和重要启示。 Abstract: Personalized Large Language Models (LLMs) have been shown to be an effective way to create more engaging and enjoyable user-AI interactions. While previous studies have explored using prompts to elicit specific personality traits in LLMs, they have not optimized these prompts to maximize personality expression. To address this limitation, we propose PersonaPulse: Dynamic Profile Optimization for Realistic Personality Expression in LLMs, a framework that leverages LLMs' inherent knowledge of personality traits to iteratively enhance role-play prompts while integrating a situational response benchmark as a scoring tool, ensuring a more realistic and contextually grounded evaluation to guide the optimization process. Quantitative evaluations demonstrate that the prompts generated by PersonaPulse outperform those of prior work, which were designed based on personality descriptions from psychological studies. Additionally, we explore the relationship between model size and personality modeling through extensive experiments. Finally, we find that, for certain personality traits, the extent of personality evocation can be partially controlled by pausing the optimization process. These findings underscore the importance of prompt optimization in shaping personality expression within LLMs, offering valuable insights for future research on adaptive AI interactions.

[9] A Systematic Analysis of Large Language Models with RAG-enabled Dynamic Prompting for Medical Error Detection and Correction

Farzad Ahmed,Joniel Augustine Jerome,Meliha Yetisgen,Özlem Uzuner

Main category: cs.CL

TL;DR: 该研究评估了零样本提示、静态随机示例提示(SPR)和检索增强动态提示(RDP)在医疗错误处理中的表现,发现RDP在提升召回率、降低误报率和生成更准确的修正方面优于其他方法。

Details Motivation: 临床文档中存在可能危害患者安全的事实性、诊断性和管理性错误,而大语言模型(LLMs)在检测和纠正这些错误方面的表现尚不明确,尤其是在不同提示策略下的行为差异。 Method: 基于MEDEC数据集,评估了九个指令调优的大语言模型(包括GPT、Claude、Gemini和OpenAI o系列),采用准确率、召回率、误报率(FPR)以及ROUGE-1、BLEURT和BERTScore的综合评分来衡量三种子任务的表现:错误标记检测、错误句子检测和错误修正,并分析了模型输出的失败模式及与临床医生推理的差异。 Result: 零样本提示在检测任务中召回率低,常遗漏缩写多或非典型的错误;SPR提高了召回率但增加了误报率;RDP在所有九个LLM上平均降低约15%的误报率,在错误句子检测中召回率提高5%–10%,并生成更具上下文准确性的修正结果。 Conclusion: 在多种大语言模型中,检索增强动态提示(RDP)优于零样本和静态随机示例提示,使用检索到的示例能提高检测准确性、减少误报,并增强医疗错误修正的可靠性。 Abstract: Objective: Clinical documentation contains factual, diagnostic, and management errors that can compromise patient safety. Large language models (LLMs) may help detect and correct such errors, but their behavior under different prompting strategies remains unclear. We evaluate zero-shot prompting, static prompting with random exemplars (SPR), and retrieval-augmented dynamic prompting (RDP) for three subtasks of medical error processing: error flag detection, error sentence detection, and error correction. Methods: Using the MEDEC dataset, we evaluated nine instruction-tuned LLMs (GPT, Claude, Gemini, and OpenAI o-series models). We measured performance using accuracy, recall, false-positive rate (FPR), and an aggregate score of ROUGE-1, BLEURT, and BERTScore for error correction. We also analyzed example outputs to identify failure modes and differences between LLM and clinician reasoning. Results: Zero-shot prompting showed low recall in both detection tasks, often missing abbreviation-heavy or atypical errors. SPR improved recall but increased FPR. Across all nine LLMs, RDP reduced FPR by about 15 percent, improved recall by 5 to 10 percent in error sentence detection, and generated more contextually accurate corrections. Conclusion: Across diverse LLMs, RDP outperforms zero-shot and SPR prompting. Using retrieved exemplars improves detection accuracy, reduces false positives, and enhances the reliability of medical error correction.

[10] AppSelectBench: Application-Level Tool Selection Benchmark

Tianyi Chen,Michael Solodko,Sen Wang,Jongwoo Ko,Junheng Hao,Colby Banbury,Sara Abdali,Saeed Amizadeh,Qing Xiao,Yinheng Li,Tianyu Ding,Kamran Ghasedi Dizaji,Suzhen Zheng,Hao Fan,Justin Wagle,Pashmina Cameron,Kazuhito Koishida

Main category: cs.CL

TL;DR: 本文提出了AppSelectBench,一个用于评估计算机使用代理(CUA)在应用程序选择能力上的新基准,填补了现有研究在跨应用推理评估方面的空白。

Details Motivation: 现有的基准主要关注细粒度的API选择,缺乏对模型在不同应用程序之间进行推理和选择能力的评估,因此需要一个专门针对应用程序选择的评估框架。 Method: 构建了一个包含上百种常用桌面应用程序的大规模基准AppSelectBench,设计了生成真实、多样且语义丰富的用户任务的流水线,并制定了统一的评估协议,涵盖随机、启发式、零样本、少样本和检索增强等设置。 Result: 实验覆盖了闭源和开源大语言模型,结果显示当前最先进的模型在应用程序选择上仍存在不一致的问题,暴露出系统性的弱点。 Conclusion: AppSelectBench为研究和提升智能计算机使用代理的应用级推理能力提供了基础,推动该领域的发展。 Abstract: Computer Using Agents (CUAs) are increasingly equipped with external tools, enabling them to perform complex and realistic tasks. For CUAs to operate effectively, application selection, which refers to deciding which application to use before invoking fine-grained tools such as APIs, is a fundamental capability. It determines whether the agent initializes the correct environment, avoids orchestration confusion, and efficiently focuses on relevant context. However, existing benchmarks primarily assess fine-grained API selection, offering limited insight into whether models can reason across and choose between different applications. To fill this gap, we introduce AppSelectBench, a comprehensive benchmark for evaluating application selection in CUAs. AppSelectBench contains a novel user task generation pipeline that produces realistic, diverse, and semantically grounded user intents at scale, together with unified evaluation protocols covering random, heuristic, zero-shot, few-shot, and retrieval-augmented-settings. AppSelectBench covers one hundred widely used desktop applications and includes more than one hundred thousand realistic, diverse, and semantically grounded user tasks. Extensive experiments across both closed-source and open-source large language models reveal systematic strengths and weaknesses in inter-application reasoning, showing that even the most capable models still struggle to make consistent application choices. Together, these results establish AppSelectBench as a foundation for studying and advancing application level reasoning, an essential yet underexplored capability of intelligent CUAs. The source is available at https://github.com/microsoft/appselectbench.

[11] $\text{R}^2\text{R}$: A Route-to-Rerank Post-Training Framework for Multi-Domain Decoder-Only Rerankers

Xinyu Wang,Hanwei Wu,Qingchen Hu,Zhenghan Tai,Jingrui Tian,Lei Ding,Jijun Chi,Hailin He,Tung Sum Thomas Kwok,Yufei Cui,Sicheng Lyu,Muzhi Li,Mingze Li,Xinyue Yu,Ling Zhou,Peng Lu

Main category: cs.CL

TL;DR: 本文提出了R2R框架,一种结合动态专家路由和两阶段训练策略的领域感知重排序器,通过实体抽象增强泛化能力,有效避免过拟合与灾难性遗忘,在法律、医疗和金融等多领域表现出优越性能。

Details Motivation: 通用解码器重排序器在高风险领域(如金融、法律)中难以捕捉领域特有细节,且直接微调易导致表面形式过拟合和灾难性遗忘。 Method: 提出R2R框架,包含两个核心组件:一是实体抽象泛化(EAG),通过遮蔽最具预测性的表面线索来防止模型依赖特定实体;二是轻量级潜在语义路由器,利用冻结主干解码器的内部表示动态选择最优LoRA专家。 Result: 在多个重排序器主干和不同领域(法律、医疗、金融)上的实验表明,R2R持续优于通用模型和单领域微调基线,具备良好的跨域鲁棒性。 Conclusion: R2R是一种模型无关且模块化的领域专业化方法,能有效提升解码器-only重排序器在专业领域的表现,同时保持泛化能力。 Abstract: Decoder-only rerankers are central to Retrieval-Augmented Generation (RAG). However, generalist models miss domain-specific nuances in high-stakes fields like finance and law, and naive fine-tuning causes surface-form overfitting and catastrophic forgetting. To address this challenge, we introduce R2R, a domain-aware framework that combines dynamic expert routing with a two-stage training strategy, Entity Abstraction for Generalization (EAG). EAG introduces a counter-shortcut mechanism by masking the most predictive surface cues, forcing the reranker to learn domain-invariant relevance patterns rather than memorizing dataset-specific entities. To efficiently activate domain experts, R2R employs a lightweight Latent Semantic Router that probes internal representations from the frozen backbone decoder to select the optimal LoRA expert per query. Extensive experiments across different reranker backbones and diverse domains (legal, medical, and financial) demonstrate that R2R consistently surpasses generalist and single-domain fine-tuned baselines. Our results confirm that R2R is a model-agnostic and modular approach to domain specialization with strong cross-domain robustness.

[12] Directional Optimization Asymmetry in Transformers: A Synthetic Stress Test

Mihir Sahasrabudhe

Main category: cs.CL

TL;DR: 本文通过一个完全合成的、熵可控的基准测试,揭示了因果Transformer架构本身存在方向性优化缺陷,即使在去除语言先验和语料统计影响后,这种“反向诅咒”依然存在。

Details Motivation: 现有研究表明自然语言处理中存在“反向诅咒”,但尚不清楚这种方向性失败源于语言统计特性还是模型架构本身。本文旨在通过控制变量探究Transformer架构是否内在具有方向偏好。 Method: 设计了一个干净的合成基准测试,使用随机字符串映射并调节分支因子K,构建前向任务(条件熵为零)和反向任务(有理论熵下限),在无语言先验的情况下评估GPT-2和MLP的方向学习能力。 Result: 实验发现从头训练的GPT-2模型存在显著且可重复的方向性优化差距(如K=5时达1.16 nats),远大于相同数据下训练的MLP;预训练初始化改变优化行为但不消除该差距,LoRA在高熵反向任务上遭遇容量瓶颈。 Conclusion: 因果Transformer训练中存在一种与语言无关的、固有的方向性摩擦,表明其对逆向映射的学习困难是架构层面的问题,该结果强调需深入研究Transformer为何难以有效处理反向任务。 Abstract: Transformers are theoretically reversal-invariant: their function class does not prefer left-to-right over right-to-left mappings. Yet empirical studies on natural language repeatedly report a "reversal curse," and recent work on temporal asymmetry in LLMs suggests that real-world corpora carry their own arrow of time. This leaves an unresolved question: do directional failures stem from linguistic statistics, or from the architecture itself? We cut through this ambiguity with a fully synthetic, entropy-controlled benchmark designed as a clean-room stress test for directional learning. Using random string mappings with tunable branching factor K, we construct forward tasks with zero conditional entropy and inverse tasks with analytically determined entropy floors. Excess loss above these floors reveals that even scratch-trained GPT-2 models exhibit a strong, reproducible directional optimization gap (e.g., 1.16 nats at K=5), far larger than that of an MLP trained on the same data. Pre-trained initializations shift optimization behavior but do not eliminate this gap, while LoRA encounters a sharp capacity wall on high-entropy inverse mappings. Together, these results isolate a minimal, semantics-free signature of directional friction intrinsic to causal Transformer training-one that persists even when linguistic priors, token frequencies, and corpus-level temporal asymmetries are removed. Our benchmark provides a controlled instrument for dissecting directional biases in modern sequence models and motivates deeper mechanistic study of why inversion remains fundamentally harder for Transformers.

[13] A Machine Learning Approach for Detection of Mental Health Conditions and Cyberbullying from Social Media

Edward Ajayi,Martha Kachweka,Mawuli Deku,Emily Aiken

Main category: cs.CL

TL;DR: 提出了一种用于检测社交媒体中十种心理健康和网络欺凌类别的多分类框架,结合领域适应的MentalBERT模型与可解释性工具SHAPLLM,实现高性能并支持人工协同筛查。

Details Motivation: 数字空间中心理健康问题和网络欺凌日益严重,需要可扩展且可解释的检测系统以辅助在线安全与心理健康干预。 Method: 构建来自Twitter和Reddit的数据集,采用“先划分后平衡”的策略进行训练与评估;比较了传统词法模型、混合方法和端到端微调的Transformer模型,引入MentalBERT并开发基于SHAPLLM的可解释性框架及筛查仪表盘原型。 Result: MentalBERT在准确率上达到0.92,Macro F1得分为0.76,优于通用模型和零样本大语言模型基线;模型通过可解释性设计支持人工审核流程。 Conclusion: 端到端微调尤其是领域适配的预训练模型对多类别检测至关重要,该系统可作为人类协同筛查工具,未来需发展多标签、临床验证的数据集以推动计算心理健康研究。 Abstract: Mental health challenges and cyberbullying are increasingly prevalent in digital spaces, necessitating scalable and interpretable detection systems. This paper introduces a unified multiclass classification framework for detecting ten distinct mental health and cyberbullying categories from social media data. We curate datasets from Twitter and Reddit, implementing a rigorous "split-then-balance" pipeline to train on balanced data while evaluating on a realistic, held-out imbalanced test set. We conducted a comprehensive evaluation comparing traditional lexical models, hybrid approaches, and several end-to-end fine-tuned transformers. Our results demonstrate that end-to-end fine-tuning is critical for performance, with the domain-adapted MentalBERT emerging as the top model, achieving an accuracy of 0.92 and a Macro F1 score of 0.76, surpassing both its generic counterpart and a zero-shot LLM baseline. Grounded in a comprehensive ethical analysis, we frame the system as a human-in-the-loop screening aid, not a diagnostic tool. To support this, we introduce a hybrid SHAPLLM explainability framework and present a prototype dashboard ("Social Media Screener") designed to integrate model predictions and their explanations into a practical workflow for moderators. Our work provides a robust baseline, highlighting future needs for multi-label, clinically-validated datasets at the critical intersection of online safety and computational mental health.

[14] Online-PVLM: Advancing Personalized VLMs with Online Concept Learning

Huiyu Bai,Runze Wang,Zhuoyun Du,Yiyang Zhao,Fengji Zhang,Haoyu Chen,Xiaoyong Zhu,Bo Zheng,Xuejiao Zhao

Main category: cs.CL

TL;DR: 提出Online-PVLM框架,利用双曲表示实现个性化视觉语言模型的在线概念学习,支持测试时实时适应且无需训练,同时构建大规模OP-Eval基准进行评估。

Details Motivation: 现有个性化视觉语言模型需为每个新概念学习独立嵌入,无法在测试时实时适应,且在大规模场景下效率低下。 Method: 提出Online-PVLM框架,采用双曲空间表示以零训练方式在测试时生成概念嵌入,并构建OP-Eval大规模基准进行评估。 Result: 实验表明Online-PVLM在在线概念学习上达到最先进性能,具备高效性和可扩展性。 Conclusion: Online-PVLM实现了高效的个性化视觉语言模型在线学习,解决了实时适应与大规模应用的挑战。 Abstract: Personalized Visual Language Models (VLMs) are gaining increasing attention for their formidable ability in user-specific concepts aligned interactions (e.g., identifying a user's bike). Existing methods typically require the learning of separate embeddings for each new concept, which fails to support real-time adaptation during testing. This limitation becomes particularly pronounced in large-scale scenarios, where efficient retrieval of concept embeddings is not achievable. To alleviate this gap, we propose Online-PVLM, a framework for online concept learning by leveraging hyperbolic representations. Our approach makes a train-free paradigm for concept embeddings generation at test time, making the use of personalized VLMs both scalable and efficient. In addition, we develop OP-Eval, a comprehensive and large-scale benchmark comprising 1,292 concepts and over 30K high-quality instances with diverse question types, designed to rigorously assess online concept learning in realistic scenarios. Extensive experiments demonstrate the state-of-the-art performance of our proposed framework. Our source code and dataset will be made available.

[15] MTA: A Merge-then-Adapt Framework for Personalized Large Language Model

Xiaopeng Li,Yuanjin Zheng,Wanyu Wang,wenlin zhang,Pengyue Jia,Yiqi Wang,Maolin Wang,Xuetao Wei,Xiangyu Zhao

Main category: cs.CL

TL;DR: 本文提出了一种名为MTA的Merge-then-Adapt框架,用于个性化大语言模型(PLLMs),通过构建共享的Meta-LoRA Bank、动态融合相关锚点模块以及堆叠轻量级LoRA,解决了传统方法在存储开销和小样本场景下性能不足的问题,在LaMP基准上取得了优于现有SOTA的效果。

Details Motivation: 现有的PLLMs为每个用户单独微调模块,导致存储成本随用户数量线性增长,且对数据稀疏用户的个性化效果不佳,缺乏可扩展性和灵活性。 Method: MTA框架包含三个阶段:首先选取锚点用户构建共享的Meta-LoRA Bank;其次通过自适应LoRA融合动态合并最相关的元模块以生成用户特定的LoRA;最后在合并后的LoRA上叠加一个超低秩的轻量级LoRA模块进行微调,实现少样本个性化。 Result: 在LaMP基准上的实验表明,MTA在多个任务上均优于现有的最先进方法,具备更好的可扩展性与少样本个性化能力。 Conclusion: MTA通过合并再适配的策略,有效降低了个性化模型的存储开销,提升了对稀疏数据用户的适应能力,实现了高效、灵活且可扩展的个性化大语言模型。 Abstract: Personalized Large Language Models (PLLMs) aim to align model outputs with individual user preferences, a crucial capability for user-centric applications. However, the prevalent approach of fine-tuning a separate module for each user faces two major limitations: (1) storage costs scale linearly with the number of users, rendering the method unscalable; and (2) fine-tuning a static model from scratch often yields suboptimal performance for users with sparse data. To address these challenges, we propose MTA, a Merge-then-Adapt framework for PLLMs. MTA comprises three key stages. First, we construct a shared Meta-LoRA Bank by selecting anchor users and pre-training meta-personalization traits within meta-LoRA modules. Second, to ensure scalability and enable dynamic personalization combination beyond static models, we introduce an Adaptive LoRA Fusion stage. This stage retrieves and dynamically merges the most relevant anchor meta-LoRAs to synthesize a user-specific one, thereby eliminating the need for user-specific storage and supporting more flexible personalization. Third, we propose a LoRA Stacking for Few-Shot Personalization stage, which applies an additional ultra-low-rank, lightweight LoRA module on top of the merged LoRA. Fine-tuning this module enables effective personalization under few-shot settings. Extensive experiments on the LaMP benchmark demonstrate that our approach outperforms existing SOTA methods across multiple tasks.

[16] More Bias, Less Bias: BiasPrompting for Enhanced Multiple-Choice Question Answering

Duc Anh Vu,Thong Nguyen,Cong-Duy Nguyen,Viet Anh Nguyen,Anh Tuan Luu

Main category: cs.CL

TL;DR: 本文提出了一种名为BiasPrompting的新推理框架,通过生成并评估每个答案选项的推理来提升大语言模型在多项选择题上的表现。

Details Motivation: 现有方法在处理多项选择题时缺乏对答案选项的上下文支持和解释,导致模型推理能力受限。 Method: BiasPrompting包含两个阶段:首先是推理生成阶段,模型为每个答案选项生成支持性推理;其次是推理引导的一致性阶段,综合生成的推理以选择最合理的答案。 Result: 在五个常用的多项选择题基准测试中,BiasPrompting显著提升了模型性能,尤其在复杂和具有挑战性的问题上表现突出。 Conclusion: BiasPrompting有效增强了大语言模型的推理能力,为解决复杂问题提供了坚实的基础。 Abstract: With the advancement of large language models (LLMs), their performance on multiple-choice question (MCQ) tasks has improved significantly. However, existing approaches face key limitations: answer choices are typically presented to LLMs without contextual grounding or explanation. This absence of context can lead to incomplete exploration of all possible answers, ultimately degrading the models' reasoning capabilities. To address these challenges, we introduce BiasPrompting, a novel inference framework that guides LLMs to generate and critically evaluate reasoning across all plausible answer options before reaching a final prediction. It consists of two components: first, a reasoning generation stage, where the model is prompted to produce supportive reasonings for each answer option, and then, a reasoning-guided agreement stage, where the generated reasonings are synthesized to select the most plausible answer. Through comprehensive evaluations, BiasPrompting demonstrates significant improvements in five widely used multiple-choice question answering benchmarks. Our experiments showcase that BiasPrompting enhances the reasoning capabilities of LLMs and provides a strong foundation for tackling complex and challenging questions, particularly in settings where existing methods underperform.

[17] SSA: Sparse Sparse Attention by Aligning Full and Sparse Attention Outputs in Feature Space

Zhenyi Shen,Junru Lu,Lin Gui,Jiazheng Li,Yulan He,Di Yin,Xing Sun

Main category: cs.CL

TL;DR: 本文提出了一种名为SSA的统一训练框架,通过在每一层强制稀疏注意力与全注意力之间的双向对齐,解决了原生稀疏注意力方法中存在的梯度更新不足问题,从而实现了更强的稀疏性和更好的性能表现。

Details Motivation: 现有的训练自由稀疏注意力方法常导致性能严重下降,而原生稀疏注意力方法虽有所改善但存在一个矛盾:尽管旨在逼近全注意力,其产生的注意力稀疏性反而低于全注意力模型,限制了有效性。本文动机在于解决这一矛盾及其背后的梯度更新缺陷。 Method: 提出SSA(Sparse Sparse Attention)框架,同时考虑稀疏和全注意力,并在每一层实施双向对齐,以保持所有token的梯度流动,同时促使稀疏注意力输出与全注意力对应结果一致,增强稀疏性。 Result: SSA在多个常识推理基准上实现了稀疏和全注意力推断下的最先进性能;支持灵活的计算-性能权衡,随可参与注意力的token增加性能持续提升;并展现出最强的长上下文外推能力。 Conclusion: SSA有效解决了稀疏注意力中的梯度缺失问题,提升了模型的稀疏程度与整体性能,同时具备良好的适应性和外推能力,为高效处理长上下文提供了新思路。 Abstract: The quadratic complexity of full attention limits efficient long-context processing in large language models (LLMs). Sparse attention mitigates this cost by restricting each query to attend to a subset of previous tokens; however, training-free approaches often lead to severe performance degradation. Native sparse-attention methods (e.g., NSA, MoBA) alleviate this issue, yet exhibit a critical paradox: they produce lower attention sparsity than full-attention models, despite aiming to approximate full attention, which may constrain their effectiveness. We attribute this paradox to gradient update deficiency: low-ranked key-value pairs excluded during sparse training receive neither forward contribution nor backward gradients, and thus never learn proper suppression. To overcome this limitation, we propose SSA (Sparse Sparse Attention), a unified training framework that considers both sparse and full attention and enforces bidirectional alignment at every layer. This design preserves gradient flow to all tokens while explicitly encouraging sparse-attention outputs to align with their full-attention counterparts, thereby promoting stronger sparsity. As a result, SSA achieves state-of-the-art performance under both sparse and full attention inference across multiple commonsense benchmarks. Furthermore, SSA enables models to adapt smoothly to varying sparsity budgets; performance improves consistently as more tokens are allowed to attend, supporting flexible compute-performance trade-offs at inference time. Finally, we show that native sparse-attention training surprisingly improves long-context extrapolation by mitigating the over-allocation of attention values in sink areas, with SSA demonstrating the strongest extrapolation capability.

[18] EM2LDL: A Multilingual Speech Corpus for Mixed Emotion Recognition through Label Distribution Learning

Xingfeng Li,Xiaohan Shi,Junjie Li,Yongwei Li,Masashi Unoki,Tomoki Toda,Masato Akagi

Main category: cs.CL

TL;DR: EM2LDL是一个新的多语言语音语料库,旨在通过标签分布学习推进混合情感识别,支持英语、普通话和粤语,并包含细粒度的32类情绪分布标注。

Details Motivation: 现有情感语料库多为单语且仅含单一标签,限制了语言多样性,无法建模混合情绪,也缺乏生态效度。 Method: 构建包含英语、普通话和粤语的多语言语音语料库,采集自网络平台的自发情感表达,采用标签分布学习进行细粒度情绪标注,并使用自监督学习模型(如HuBERT-large-EN)进行基准实验。 Result: 在说话人无关的性别、年龄和个性评估中表现出稳健性能,HuBERT-large-EN取得最优结果。 Conclusion: EM2LDL通过融合语言多样性和生态效度,为多语言环境下的复杂情感动态研究提供了有力支持,适用于情感计算中的心理健康监测和跨文化交流等应用。 Abstract: This study introduces EM2LDL, a novel multilingual speech corpus designed to advance mixed emotion recognition through label distribution learning. Addressing the limitations of predominantly monolingual and single-label emotion corpora \textcolor{black}{that restrict linguistic diversity, are unable to model mixed emotions, and lack ecological validity}, EM2LDL comprises expressive utterances in English, Mandarin, and Cantonese, capturing the intra-utterance code-switching prevalent in multilingual regions like Hong Kong and Macao. The corpus integrates spontaneous emotional expressions from online platforms, annotated with fine-grained emotion distributions across 32 categories. Experimental baselines using self-supervised learning models demonstrate robust performance in speaker-independent gender-, age-, and personality-based evaluations, with HuBERT-large-EN achieving optimal results. By incorporating linguistic diversity and ecological validity, EM2LDL enables the exploration of complex emotional dynamics in multilingual settings. This work provides a versatile testbed for developing adaptive, empathetic systems for applications in affective computing, including mental health monitoring and cross-cultural communication. The dataset, annotations, and baseline codes are publicly available at https://github.com/xingfengli/EM2LDL.

[19] Mispronunciation Detection and Diagnosis Without Model Training: A Retrieval-Based Approach

Huu Tuong Tu,Ha Viet Khanh,Tran Tien Dat,Vu Huan,Thien Van Luong,Nguyen Tien Cuong,Nguyen Thi Thu Trang

Main category: cs.CL

TL;DR: 提出了一种无需训练的检索式方法,利用预训练的语音识别模型进行发音错误检测与诊断,避免了复杂的模型训练过程。

Details Motivation: 传统方法需要评分模型或音素级建模,训练复杂且依赖特定任务训练,限制了其应用。 Method: 利用预训练的自动语音识别模型,结合检索技术,实现无需额外训练的发音错误检测与诊断。 Result: 在L2-ARCTIC数据集上取得了69.60%的F1分数,优于现有方法。 Conclusion: 该方法无需任务特定训练即可有效检测和诊断发音错误,简化了流程并具有良好的性能。 Abstract: Mispronunciation Detection and Diagnosis (MDD) is crucial for language learning and speech therapy. Unlike conventional methods that require scoring models or training phoneme-level models, we propose a novel training-free framework that leverages retrieval techniques with a pretrained Automatic Speech Recognition model. Our method avoids phoneme-specific modeling or additional task-specific training, while still achieving accurate detection and diagnosis of pronunciation errors. Experiments on the L2-ARCTIC dataset show that our method achieves a superior F1 score of 69.60% while avoiding the complexity of model training.

[20] "When Data is Scarce, Prompt Smarter"... Approaches to Grammatical Error Correction in Low-Resource Settings

Somsubhra De,Harsh Kumar,Arun Prakash A

Main category: cs.CL

TL;DR: 本文探讨了基于提示(prompting)的大语言模型在低资源印度语系语言的语法错误纠正(GEC)任务中的应用,展示了零样本和少样本方法在多语言GEC中的卓越性能。

Details Motivation: 由于资源有限、语言多样性及复杂的形态结构,印度语系语言的语法错误纠正进展缓慢,本文旨在探索大语言模型结合提示策略在低资源设置下的适应性与有效性。 Method: 采用GPT-4.1、Gemini-2.5和LLaMA-4等大语言模型,结合零样本和少样本提示策略,通过精心设计的提示和轻量级适配方法进行语法错误纠正。 Result: 在多个印度语言上取得领先结果:泰米尔语(GLEU: 91.57)排名第一,印地语(85.69)第一,泰卢固语(85.22)第二,孟加拉语(92.86)第四,马拉雅拉姆语(92.97)第五,显著优于微调的小型专用模型如Sarvam-22B。 Conclusion: 提示驱动的方法结合现代大语言模型在多语言语法错误纠正中表现出强大潜力,能够有效弥补低资源语言的资源差距。 Abstract: Grammatical error correction (GEC) is an important task in Natural Language Processing that aims to automatically detect and correct grammatical mistakes in text. While recent advances in transformer-based models and large annotated datasets have greatly improved GEC performance for high-resource languages such as English, the progress has not extended equally. For most Indic languages, GEC remains a challenging task due to limited resources, linguistic diversity and complex morphology. In this work, we explore prompting-based approaches using state-of-the-art large language models (LLMs), such as GPT-4.1, Gemini-2.5 and LLaMA-4, combined with few-shot strategy to adapt them to low-resource settings. We observe that even basic prompting strategies, such as zero-shot and few-shot approaches, enable these LLMs to substantially outperform fine-tuned Indic-language models like Sarvam-22B, thereby illustrating the exceptional multilingual generalization capabilities of contemporary LLMs for GEC. Our experiments show that carefully designed prompts and lightweight adaptation significantly enhance correction quality across multiple Indic languages. We achieved leading results in the shared task--ranking 1st in Tamil (GLEU: 91.57) and Hindi (GLEU: 85.69), 2nd in Telugu (GLEU: 85.22), 4th in Bangla (GLEU: 92.86), and 5th in Malayalam (GLEU: 92.97). These findings highlight the effectiveness of prompt-driven NLP techniques and underscore the potential of large-scale LLMs to bridge resource gaps in multilingual GEC.

[21] SEDA: A Self-Adapted Entity-Centric Data Augmentation for Boosting Gird-based Discontinuous NER Models

Wen-Fang Su,Hsiao-Wei Chou,Wen-Yang Lin

Main category: cs.CL

TL;DR: 本文提出了一种结合图像数据增强技术的网格标记模型,用于改善不连续实体的命名实体识别,特别是在跨句子场景下的表现。

Details Motivation: 传统方法在处理不连续实体时存在文本分割错误或遗漏的问题,影响识别准确率,本文旨在解决这一挑战。 Method: 基于网格标记框架,引入图像数据增强技术(如裁剪、缩放和填充)来增强模型对不连续实体的识别能力,并缓解分词带来的问题。 Result: 在CADEC、ShARe13和ShARe14数据集上,整体F1分数提升了1-2.5%,对不连续实体的F1分数提升达3.7-8.4%。 Conclusion: 所提出的数据增强策略有效提升了网格模型对不连续实体的识别性能,验证了其在复杂NER任务中的潜力。 Abstract: Named Entity Recognition (NER) is a critical task in natural language processing, yet it remains particularly challenging for discontinuous entities. The primary difficulty lies in text segmentation, as traditional methods often missegment or entirely miss cross-sentence discontinuous entities, significantly affecting recognition accuracy. Therefore, we aim to address the segmentation and omission issues associated with such entities. Recent studies have shown that grid-tagging methods are effective for information extraction due to their flexible tagging schemes and robust architectures. Building on this, we integrate image data augmentation techniques, such as cropping, scaling, and padding, into grid-based models to enhance their ability to recognize discontinuous entities and handle segmentation challenges. Experimental results demonstrate that traditional segmentation methods often fail to capture cross-sentence discontinuous entities, leading to decreased performance. In contrast, our augmented grid models achieve notable improvements. Evaluations on the CADEC, ShARe13, and ShARe14 datasets show F1 score gains of 1-2.5% overall and 3.7-8.4% for discontinuous entities, confirming the effectiveness of our approach.

[22] KyrgyzBERT: A Compact, Efficient Language Model for Kyrgyz NLP

Adilet Metinov,Gulida M. Kudakeeva,Gulnara D. Kabaeva

Main category: cs.CL

TL;DR: 本文介绍了KyrgyzBERT,这是首个公开可用的吉尔吉斯语单语BERT模型,并创建了用于情感分析的kyrgyz-sst2基准数据集,实验表明其性能优于更大规模的多语言BERT模型。

Details Motivation: 吉尔吉斯语作为一种资源匮乏的语言,缺乏基础的自然语言处理工具,本文旨在填补这一空白。 Method: 提出KyrgyzBERT模型,包含3590万参数,并设计了适应吉尔吉斯语形态结构的自定义分词器;同时构建了kyrgyz-sst2情感分析基准数据集。 Result: 在kyrgyz-sst2数据集上微调后的KyrgyzBERT取得了0.8280的F1分数,性能与规模大五倍的mBERT相当。 Conclusion: KyrgyzBERT是吉尔吉斯语首个公开的单语语言模型,表现优异,所有模型、数据和代码均已开源,有助于推动吉尔吉斯语NLP研究。 Abstract: Kyrgyz remains a low-resource language with limited foundational NLP tools. To address this gap, we introduce KyrgyzBERT, the first publicly available monolingual BERT-based language model for Kyrgyz. The model has 35.9M parameters and uses a custom tokenizer designed for the language's morphological structure. To evaluate performance, we create kyrgyz-sst2, a sentiment analysis benchmark built by translating the Stanford Sentiment Treebank and manually annotating the full test set. KyrgyzBERT fine-tuned on this dataset achieves an F1-score of 0.8280, competitive with a fine-tuned mBERT model five times larger. All models, data, and code are released to support future research in Kyrgyz NLP.

[23] REFLEX: Self-Refining Explainable Fact-Checking via Disentangling Truth into Style and Substance

Chuyi Kong,Gao Wei,Jing Ma,Hongzhan Lin,Zhiyuan Fan

Main category: cs.CL

TL;DR: 本文提出了一种名为REFLEX的新型事实核查范式,通过利用模型内部知识进行自我精炼,提升判断准确性和解释质量,无需依赖外部知识源,实现实时、高效且可解释的虚假信息检测。

Details Motivation: 现有基于大语言模型的事实核查方法过度依赖外部知识,导致延迟高、易产生幻觉,影响可靠性与实时性,亟需一种更高效、可解释且低延迟的方法。 Method: 将事实核查重构为角色扮演对话,联合训练判断与解释生成;通过提取骨干模型与其微调版本之间的对比激活对,构建解耦‘风格’与‘实质’的引导向量,在激活层面指导推理并抑制噪声解释。 Result: 在真实数据集上,仅用465个自精炼样本,REFLEX即达到SOTA性能;相比单向引导方法表现更优,并能将解释信号迁移至无解释目标的模型,带来最高7.57%的性能提升。 Conclusion: REFLEX通过挖掘模型内部解释信号,实现了高效、可靠且可解释的事实核查,验证了激活级控制在提升推理忠实性方面的潜力,为低资源、实时虚假信息检测提供了新思路。 Abstract: The prevalence of misinformation on social media threatens public trust, demanding automated fact-checking systems that provide accurate verdicts with interpretable explanations. However, existing large language model-based (LLM-based) approaches often rely heavily on external knowledge sources, introducing substantial latency and even hallucinations that undermine reliability, interpretability, and responsiveness, which is crucial for real-time use. To address these challenges, we propose REason-guided Fact-checking with Latent EXplanations REFLEX paradigm, a plug-and-play, self-refining paradigm that leverages the internal knowledge in backbone model to improve both verdict accuracy and explanation quality. REFLEX reformulates fact-checking as a role-play dialogue and jointly trains verdict prediction and explanation generation. It adaptively extracts contrastive activation pairs between the backbone model and its fine-tuned variant to construct steering vectors that disentangle truth into style and substance naturally. These activation-level signals guide inference and suppress noisy explanations, enabling more faithful and efficient reasoning. Experiments on real-world datasets show that REFLEX outperforms previous methods that steer toward a single truth direction and underscores the challenge traditional approaches face when handling the subtle, human-unknown truth in fact-checking tasks. Remarkably, with only 465 self-refined training samples, RELFEX achieves state-of-the-art performance. Furthermore, models trained with explanatory objectives can effectively guide those without them, yielding up to a 7.57% improvement, highlighting that internal explanation signals play a dual role in both interpreting and enhancing factual reasoning.

[24] Scaling LLM Speculative Decoding: Non-Autoregressive Forecasting in Large-Batch Scenarios

Luohe Shi,Zuchao Li,Lefei Zhang,Baoyuan Qi,Guoming Liu,Hai Zhao

Main category: cs.CL

TL;DR: 本文提出了SpecFormer,一种结合单向和双向注意力机制的新架构,用于在低验证资源和低调度成本下实现高效的LLM推理加速,克服了传统推测解码对大规模计算资源的依赖。

Details Motivation: 现有的推测解码方法依赖大量计算资源构建复杂的草稿树,但在主流系统中批处理技术压缩了空闲算力,导致其难以有效应用,因此需要一种适用于低资源环境且具备并行生成能力的高效推测解码方法。 Method: 提出SpecFormer架构,融合单向与双向注意力机制,利用自回归模型对完整输入序列的信息提取能力以及非自回归模型的并行生成优势,无需依赖大型前缀树即可实现一致加速。 Result: 在多种规模模型上进行了无损推测解码实验,结果显示SpecFormer在训练需求更低、计算成本更少的情况下实现了优越的推理加速性能,尤其在大批次场景中表现稳定。 Conclusion: SpecFormer为大规模语言模型的推理加速提供了新范式,在减少资源消耗的同时保持高效性,具有广泛的应用前景。 Abstract: Speculative decoding accelerates LLM inference by utilizing otherwise idle computational resources during memory-to-chip data transfer. Current speculative decoding methods typically assume a considerable amount of available computing power, then generate a complex and massive draft tree using a small autoregressive language model to improve overall prediction accuracy. However, methods like batching have been widely applied in mainstream model inference systems as a superior alternative to speculative decoding, as they compress the available idle computing power. Therefore, performing speculative decoding with low verification resources and low scheduling costs has become an important research problem. We believe that more capable models that allow for parallel generation on draft sequences are what we truly need. Recognizing the fundamental nature of draft models to only generate sequences of limited length, we propose SpecFormer, a novel architecture that integrates unidirectional and bidirectional attention mechanisms. SpecFormer combines the autoregressive model's ability to extract information from the entire input sequence with the parallel generation benefits of non-autoregressive models. This design eliminates the reliance on large prefix trees and achieves consistent acceleration, even in large-batch scenarios. Through lossless speculative decoding experiments across models of various scales, we demonstrate that SpecFormer sets a new standard for scaling LLM inference with lower training demands and reduced computational costs.

[25] The Curious Case of Analogies: Investigating Analogical Reasoning in Large Language Models

Taewhoo Lee,Minju Song,Chanwoong Yoon,Jungwoo Park,Jaewoo Kang

Main category: cs.CL

TL;DR: 该研究探讨了大语言模型(LLM)在类比推理中对高层关系概念的编码与应用能力,发现LLM能有效捕捉类比实体间的潜在关系,但相比人类,在迁移和应用这些关系到新情境时存在困难;成功推理依赖于情境间的强结构对齐,而失败常源于对齐的退化或错位。

Details Motivation: 探究大语言模型是否具备类似人类的类比推理能力,即能否捕捉并应用高层关系概念到新情境中,揭示其与人类认知的异同。 Method: 使用比例类比和故事类比任务,通过分析LLM中间层的隐藏表征传播情况,结合表征修补(patching)实验,研究关系信息的编码、迁移与结构对齐机制。 Result: 1) LLM在正确案例中能编码类比关系,且属性与关系信息在中上层传播,失败案例则缺乏此类信息;2) 当需将关系应用于新实体时,LLM表现不佳,但可通过关键位置的表征修补部分改善;3) 成功的类比推理与强结构对齐相关,失败则对应对齐退化或错位。 Conclusion: LLM展现出初步但有限的高层关系推理能力,虽在结构对齐和信息编码方面呈现与人类相似的迹象,但在灵活迁移和应用关系方面仍存在显著差距,揭示了当前模型的认知局限。 Abstract: Analogical reasoning is at the core of human cognition, serving as an important foundation for a variety of intellectual activities. While prior work has shown that LLMs can represent task patterns and surface-level concepts, it remains unclear whether these models can encode high-level relational concepts and apply them to novel situations through structured comparisons. In this work, we explore this fundamental aspect using proportional and story analogies, and identify three key findings. First, LLMs effectively encode the underlying relationships between analogous entities; both attributive and relational information propagate through mid-upper layers in correct cases, whereas reasoning failures reflect missing relational information within these layers. Second, unlike humans, LLMs often struggle not only when relational information is missing, but also when attempting to apply it to new entities. In such cases, strategically patching hidden representations at critical token positions can facilitate information transfer to a certain extent. Lastly, successful analogical reasoning in LLMs is marked by strong structural alignment between analogous situations, whereas failures often reflect degraded or misplaced alignment. Overall, our findings reveal that LLMs exhibit emerging but limited capabilities in encoding and applying high-level relational concepts, highlighting both parallels and gaps with human cognition.

[26] BengaliFig: A Low-Resource Challenge for Figurative and Culturally Grounded Reasoning in Bengali

Abdullah Al Sefat

Main category: cs.CL

TL;DR: 本文提出了BengaliFig,一个针对孟加拉语的富含文化背景的谜题挑战集,用于评估大语言模型在低资源、文化相关推理任务中的表现。实验显示现有模型在隐喻和文化特定推理方面存在明显不足。

Details Motivation: 大型语言模型在多语言基准上表现出色,但在基于比喻和文化背景的推理(尤其是在低资源语言环境中)方面缺乏系统评估。因此,需要一个专门的数据集来衡量模型在这些文化敏感且资源有限的语言中的推理能力。 Method: 构建了一个包含435个来自孟加拉口语和文学传统的独特谜题数据集(BengaliFig),每个样本在五个正交维度上进行标注,并通过约束感知、AI辅助的流程自动转换为多项选择题形式。使用8个前沿大语言模型,在零样本和少样本思维链提示下进行评估。 Result: 实验表明,当前的大语言模型在隐喻性思维和文化特定推理任务中表现不佳,暴露出其在低资源文化语境下的脆弱性。数据集的多维度标注有助于细致分析模型失败的原因。 Conclusion: BengaliFig为评估大语言模型在低资源、文化丰富语境下的鲁棒性提供了一个诊断工具,推动了更具包容性和文化遗产意识的NLP评估体系的发展。 Abstract: Large language models excel on broad multilingual benchmarks but remain to be evaluated extensively in figurative and culturally grounded reasoning, especially in low-resource contexts. We present BengaliFig, a compact yet richly annotated challenge set that targets this gap in Bengali, a widely spoken low-resourced language. The dataset contains 435 unique riddles drawn from Bengali oral and literary traditions. Each item is annotated along five orthogonal dimensions capturing reasoning type, trap type, cultural depth, answer category, and difficulty, and is automatically converted to multiple-choice format through a constraint-aware, AI-assisted pipeline. We evaluate eight frontier LLMs from major providers under zero-shot and few-shot chain-of-thought prompting, revealing consistent weaknesses in metaphorical and culturally specific reasoning. BengaliFig thus contributes both a diagnostic probe for evaluating LLM robustness in low-resource cultural contexts and a step toward inclusive and heritage-aware NLP evaluation.

[27] A Task-Oriented Evaluation Framework for Text Normalization in Modern NLP Pipelines

Md Abdullah Al Kafi,Raka Moni,Sumit Kumar Banshal

Main category: cs.CL

TL;DR: 提出一种新的任务导向的词干提取方法评估框架,综合考虑效用、下游任务影响和语义相似性,发现高词干效率可能伴随有害的过度词干化,而平衡的词干化更有利于实际应用。

Details Motivation: 现有的词干提取评估方法无法全面捕捉过度词干化带来的潜在危害,缺乏对语义保持和下游任务影响的综合考量。 Method: 提出了一个包含三个指标的评估框架:词干有效性得分(SES)、模型性能变化(MPD)和平均归一化编辑距离(ANLD),并在孟加拉语和英语词干提取器上进行验证。 Result: 孟加拉语词干提取器虽有较高SES(1.67)但ANLD值较高(0.26),显示存在有害的过度词干化并导致下游性能下降;英语Snowball词干提取器SES适中(1.31)且ANLD较低(0.14),能正面提升下游任务表现。 Conclusion: 仅依赖词干有效性可能误导评估,必须结合语义保真度指标(如ANLD)来全面评估词干提取方法,确保效率与语义一致性的平衡。 Abstract: Text normalization is an essential preprocessing step in many natural language processing (NLP) tasks, and stemming is one such normalization technique that reduces words to their base or root form. However, evaluating stemming methods is challenging because current evaluation approaches are limited and do not capture the potential harm caused by excessive stemming; therefore, it is essential to develop new approaches to evaluate stemming methods. To address this issue, this study propose a novel, task-oriented approach to evaluate stemming methods, which considers three aspects: (1) the utility of stemming using Stemming Effectiveness Score (SES), (2) the impact of stemming on downstream tasks using Model Performance Delta (MPD), and (3) the semantic similarity between stemmed and original words using Average Normalized Levenshtein Distance (ANLD), thus providing a comprehensive evaluation framework. We apply our evaluation framework to compare two stemmers for Bangla (BNLTK) and English (Snowball), and our results reveal a significant issue, prompting us to analyze their performance in detail. While the Bangla stemmer achieves the highest SES (1.67) due to effective word reduction (CR = 1.90), SES alone is insufficient because our proposed safety measure, ANLD, reveals that this high SES is due to harmful over-stemming (ANLD = 0.26), which correlates with the observed decrease in downstream performance.In contrast, the English stemmer achieves a moderate SES (1.31) with a safe meaning distance (ANLD = 0.14), allowing its word reduction to contribute positively to downstream performance; therefore, it is a more reliable stemmer. Our study provides a valuable tool for distinguishing between potential efficiency gains (high SES) and meaning preservation (low ANLD).

[28] Generation, Evaluation, and Explanation of Novelists' Styles with Single-Token Prompts

Mosab Rezaei,Mina Rajaei Moghadam,Abdul Rahman Shaikh,Hamed Alhoori,Reva Freedman

Main category: cs.CL

TL;DR: 本文提出了一种基于大语言模型的19世纪小说家写作风格生成与评估框架,通过单token提示微调模型生成特定作者风格文本,并利用基于transformer的检测器和可解释AI方法进行自动评估与分析。

Details Motivation: 解决在缺乏配对数据的情况下训练生成模型,以及不依赖人工判断来评估文体文本的挑战。 Method: 使用极简的单token提示微调大语言模型以生成特定作者风格的文本;训练一个基于transformer的检测器用于分类和风格解释,并结合句法比较及注意力、梯度等可解释AI方法分析语言特征。 Result: 生成的文本成功反映了狄更斯、奥斯汀、吐温等作家的独特语言模式,AI驱动的评估方法在风格识别上表现出与人类判断一致的可靠性。 Conclusion: 该框架有效实现了无配对数据下的风格化文本生成,并验证了AI-based评估在stylometry中的有效性,为文学风格研究提供了自动化、可解释的新工具。 Abstract: Recent advances in large language models have created new opportunities for stylometry, the study of writing styles and authorship. Two challenges, however, remain central: training generative models when no paired data exist, and evaluating stylistic text without relying only on human judgment. In this work, we present a framework for both generating and evaluating sentences in the style of 19th-century novelists. Large language models are fine-tuned with minimal, single-token prompts to produce text in the voices of authors such as Dickens, Austen, Twain, Alcott, and Melville. To assess these generative models, we employ a transformer-based detector trained on authentic sentences, using it both as a classifier and as a tool for stylistic explanation. We complement this with syntactic comparisons and explainable AI methods, including attention-based and gradient-based analyses, to identify the linguistic cues that drive stylistic imitation. Our findings show that the generated text reflects the authors' distinctive patterns and that AI-based evaluation offers a reliable alternative to human assessment. All artifacts of this work are published online.

[29] Adversarial Confusion Attack: Disrupting Multimodal Large Language Models

Jakub Hoscilowicz,Artur Janicki

Main category: cs.CL

TL;DR: 提出了一种名为对抗混淆攻击的新威胁,旨在通过最大化下一个词的熵来干扰多模态大语言模型,使其产生不连贯或自信错误的输出。

Details Motivation: 为了防止多模态大语言模型在面对特定对抗性图像时无法可靠运行,提出了新的攻击方式。 Method: 使用小型开源多模态大语言模型集合,利用基本对抗技术(如PGD)生成对抗图像,以最大化下一个词的熵。 Result: 实验证明单个对抗图像可以扰乱集合中的所有模型,并且这种扰动能够转移到未见过的开源和专有模型上。 Conclusion: 对抗混淆攻击是一种有效的手段,可用于嵌入对抗性图像到网站中,从而阻止多模态大语言模型代理的可靠操作。 Abstract: We introduce the Adversarial Confusion Attack, a new class of threats against multimodal large language models (MLLMs). Unlike jailbreaks or targeted misclassification, the goal is to induce systematic disruption that makes the model generate incoherent or confidently incorrect outputs. Applications include embedding adversarial images into websites to prevent MLLM-powered agents from operating reliably. The proposed attack maximizes next-token entropy using a small ensemble of open-source MLLMs. In the white-box setting, we show that a single adversarial image can disrupt all models in the ensemble, both in the full-image and adversarial CAPTCHA settings. Despite relying on a basic adversarial technique (PGD), the attack generates perturbations that transfer to both unseen open-source (e.g., Qwen3-VL) and proprietary (e.g., GPT-5.1) models.

[30] The Text Aphasia Battery (TAB): A Clinically-Grounded Benchmark for Aphasia-Like Deficits in Language Models

Nathan Roll,Jill Kries,Flora Jin,Catherine Wang,Ann Marie Finley,Meghan Sumner,Cory Shain,Laura Gwilliams

Main category: cs.CL

TL;DR: 本文提出了Text Aphasia Battery (TAB),一种基于临床的文本型基准,用于评估大语言模型中的类失语缺陷,并验证了其自动化评分协议的可靠性。

Details Motivation: 传统临床评估方法不适用于大语言模型,因其假设了人类特有的语用压力和认知过程,而这些并不存在于人工架构中。因此需要一种适合LLM的语言障碍评估工具。 Method: 基于Quick Aphasia Battery (QAB) 设计了纯文本的TAB,包含四个子测试:连贯文本、词汇理解、句子理解和重复,并采用Gemini 2.5 Flash进行自动化评分验证。 Result: 自动化评估协议与专家人工评分具有相当的可靠性(模型-共识kappa=0.255,人类-人类kappa=0.286)。 Conclusion: TAB是一种临床基础、可扩展的框架,可用于大规模分析人工智能系统中的语言缺陷。 Abstract: Large language models (LLMs) have emerged as a candidate "model organism" for human language, offering an unprecedented opportunity to study the computational basis of linguistic disorders like aphasia. However, traditional clinical assessments are ill-suited for LLMs, as they presuppose human-like pragmatic pressures and probe cognitive processes not inherent to artificial architectures. We introduce the Text Aphasia Battery (TAB), a text-only benchmark adapted from the Quick Aphasia Battery (QAB) to assess aphasic-like deficits in LLMs. The TAB comprises four subtests: Connected Text, Word Comprehension, Sentence Comprehension, and Repetition. This paper details the TAB's design, subtests, and scoring criteria. To facilitate large-scale use, we validate an automated evaluation protocol using Gemini 2.5 Flash, which achieves reliability comparable to expert human raters (prevalence-weighted Cohen's kappa = 0.255 for model--consensus agreement vs. 0.286 for human--human agreement). We release TAB as a clinically-grounded, scalable framework for analyzing language deficits in artificial systems.

[31] Bridging the Language Gap: Synthetic Voice Diversity via Latent Mixup for Equitable Speech Recognition

Wesley Bian,Xiaofeng Lin,Guang Cheng

Main category: cs.CL

TL;DR: 提出一种新的语音数据增强技术,以缩小低资源语言在自动语音识别中的性能差距。

Details Motivation: 由于训练数据稀缺,低资源语言在现代音频机器学习模型中表现较差,导致不公平的性能差距。 Method: 引入一种针对语音语料库的新型数据增强技术。 Result: 实验表明,该方法显著提升了低资源语言的自动语音识别性能,并优于现有的增强策略。 Conclusion: 该方法为提升代表性不足语言社区的语音技术提供了一个实用解决方案。 Abstract: Modern machine learning models for audio tasks often exhibit superior performance on English and other well-resourced languages, primarily due to the abundance of available training data. This disparity leads to an unfair performance gap for low-resource languages, where data collection is both challenging and costly. In this work, we introduce a novel data augmentation technique for speech corpora designed to mitigate this gap. Through comprehensive experiments, we demonstrate that our method significantly improves the performance of automatic speech recognition systems on low-resource languages. Furthermore, we show that our approach outperforms existing augmentation strategies, offering a practical solution for enhancing speech technology in underrepresented linguistic communities.

[32] From Words to Wisdom: Discourse Annotation and Baseline Models for Student Dialogue Understanding

Farjana Sultana Mim,Shuchin Aeron,Eric Miller,Kristen Wendell

Main category: cs.CL

TL;DR: 本研究提出一个标注的教育对话数据集,用于自动识别学生对话中的知识建构和任务执行特征,并基于GPT-3.5和Llama-3.1建立预测基线模型,实验结果显示当前大模型表现有限,表明该领域有进一步研究空间。

Details Motivation: 手动分析学生对话中的语篇特征耗时耗力,且现有NLP研究较少关注教育场景下的对话,因此需要构建专门数据集并探索自动化识别方法。 Method: 构建了一个标注的教育对话数据集,包含知识建构与任务执行两类语篇特征,并使用预训练大语言模型GPT-3.5和Llama-3.1对话语轮次进行分类,建立自动预测的基线模型。 Result: 实验结果表明,尽管使用了先进的大语言模型,但在该任务上的表现仍不理想,说明当前模型在识别教育对话语篇特征方面存在局限。 Conclusion: 该研究填补了NLP在教育对话分析中的空白,提供了公开数据集和基线模型,但现有技术性能有限,未来需进一步优化模型以提升自动识别效果。 Abstract: Identifying discourse features in student conversations is quite important for educational researchers to recognize the curricular and pedagogical variables that cause students to engage in constructing knowledge rather than merely completing tasks. The manual analysis of student conversations to identify these discourse features is time-consuming and labor-intensive, which limits the scale and scope of studies. Leveraging natural language processing (NLP) techniques can facilitate the automatic detection of these discourse features, offering educational researchers scalable and data-driven insights. However, existing studies in NLP that focus on discourse in dialogue rarely address educational data. In this work, we address this gap by introducing an annotated educational dialogue dataset of student conversations featuring knowledge construction and task production discourse. We also establish baseline models for automatically predicting these discourse properties for each turn of talk within conversations, using pre-trained large language models GPT-3.5 and Llama-3.1. Experimental results indicate that these state-of-the-art models perform suboptimally on this task, indicating the potential for future research.

[33] On Evaluating LLM Alignment by Evaluating LLMs as Judges

Yixin Liu,Pengfei Liu,Arman Cohan

Main category: cs.CL

TL;DR: 本文研究了大语言模型(LLM)在生成与评估人类偏好对齐方面的一致性,发现其生成和评判能力之间存在强相关性,并基于此提出了一种无需直接评估生成内容的新基准AlignEval,在捕捉人类偏好方面表现优于或媲美现有主流自动评测方法。

Details Motivation: 为了更有效地评估大语言模型在帮助性、诚实性、安全性及遵循指令等方面的对齐能力,探索不依赖直接输出评估的新型评测方式。 Method: 通过分析多个LLM在生成与评估任务中的一致性(GE-consistency),利用强大的LLM作为偏好裁判,提出一种以模型作为评估者的新型基准测试方法——AlignEval。 Result: AlignEval在对LLM进行排序时,比AlpacaEval和Arena-Hard等现有自动评测基准更好地捕捉到了人类偏好,验证了生成与评估能力之间的强关联。 Conclusion: 大语言模型的生成与评估能力高度相关,基于此提出的AlignEval提供了一种有效且无需直接评估生成结果的新范式,可用于衡量模型与人类偏好的对齐程度。 Abstract: Alignment with human preferences is an important evaluation aspect of LLMs, requiring them to be helpful, honest, safe, and to precisely follow human instructions. Evaluating large language models' (LLMs) alignment typically involves directly assessing their open-ended responses, requiring human annotators or strong LLM judges. Conversely, LLMs themselves have also been extensively evaluated as judges for assessing alignment. In this work, we examine the relationship between LLMs' generation and evaluation capabilities in aligning with human preferences. To this end, we first conduct a comprehensive analysis of the generation-evaluation consistency (GE-consistency) among various LLMs, revealing a strong correlation between their generation and evaluation capabilities when evaluated by a strong LLM preference oracle. Utilizing this finding, we propose a benchmarking paradigm that measures LLM alignment with human preferences without directly evaluating their generated outputs, instead assessing LLMs in their role as evaluators. Our evaluation shows that our proposed benchmark, AlignEval, matches or surpasses widely used automatic LLM evaluation benchmarks, such as AlpacaEval and Arena-Hard, in capturing human preferences when ranking LLMs. Our study offers valuable insights into the connection between LLMs' generation and evaluation capabilities, and introduces a benchmark that assesses alignment without directly evaluating model outputs.

[34] Latent Collaboration in Multi-Agent Systems

Jiaru Zou,Xiyuan Yang,Ruizhong Qiu,Gaotang Li,Katherine Tieu,Pan Lu,Ke Shen,Hanghang Tong,Yejin Choi,Jingrui He,James Zou,Mengdi Wang,Ling Yang

Main category: cs.CL

TL;DR: 本文提出了LatentMAS,一种无需训练的端到端多智能体系统框架,通过在连续潜在空间中实现LLM代理之间的直接协作,提升了系统级推理能力和效率。

Details Motivation: 现有的基于文本的LLM代理在推理和通信中存在信息损失和效率低下的问题,因此需要一种更高效、表达能力更强的协作方式。 Method: LatentMAS通过自回归生成最后一层隐藏嵌入作为潜在思维,并利用共享的潜在工作记忆来保存和传递各代理的内部表示,从而实现无损信息交换。 Result: 在9个涵盖数学科学推理、常识理解和代码生成的基准测试中,LatentMAS相比单模型和基于文本的多智能体系统基线,准确率最高提升14.6%,输出令牌使用减少70.8%-83.7%,端到端推理速度快4x-4.3x。 Conclusion: LatentMAS通过纯潜在空间协作,在不增加训练成本的情况下显著提升了多智能体系统的推理质量和效率,为未来系统级智能的发展提供了新方向。 Abstract: Multi-agent systems (MAS) extend large language models (LLMs) from independent single-model reasoning to coordinative system-level intelligence. While existing LLM agents depend on text-based mediation for reasoning and communication, we take a step forward by enabling models to collaborate directly within the continuous latent space. We introduce LatentMAS, an end-to-end training-free framework that enables pure latent collaboration among LLM agents. In LatentMAS, each agent first performs auto-regressive latent thoughts generation through last-layer hidden embeddings. A shared latent working memory then preserves and transfers each agent's internal representations, ensuring lossless information exchange. We provide theoretical analyses establishing that LatentMAS attains higher expressiveness and lossless information preservation with substantially lower complexity than vanilla text-based MAS. In addition, empirical evaluations across 9 comprehensive benchmarks spanning math and science reasoning, commonsense understanding, and code generation show that LatentMAS consistently outperforms strong single-model and text-based MAS baselines, achieving up to 14.6% higher accuracy, reducing output token usage by 70.8%-83.7%, and providing 4x-4.3x faster end-to-end inference. These results demonstrate that our new latent collaboration framework enhances system-level reasoning quality while offering substantial efficiency gains without any additional training. Code and data are fully open-sourced at https://github.com/Gen-Verse/LatentMAS.

cs.CV [Back]

[35] PuzzlePoles: Cylindrical Fiducial Markers Based on the PuzzleBoard Pattern

Juri Zach,Peer Stelldinger

Main category: cs.CV

TL;DR: PuzzlePole是一种新型圆柱形标志物,基于PuzzleBoard模式设计,支持360°视角下的可靠识别与位姿估计,具有高精度、抗遮挡能力强,适用于机器人导航、SLAM等多种自主系统场景。

Details Motivation: 为了提升自主系统中环境感知的可靠性,特别是在不同视角和遮挡情况下的视觉标记鲁棒性,需要一种更灵活且精确的标定方案。 Method: 基于PuzzleBoard校准模式的组合结构,设计出圆柱形的PuzzlePole标志物,使其能够在360°范围内被识别,并实现精准的位姿估计。 Result: PuzzlePole实现了高精度的定位与方向估计,对遮挡具有强鲁棒性,并可在多种部署场景中灵活使用。 Conclusion: PuzzlePole是一种高效、可靠的视觉标记解决方案,适用于广泛的应用场景,包括机器人导航、SLAM和实体交互界面等。 Abstract: Reliable perception of the environment is a key enabler for autonomous systems, where calibration and localization tasks often rely on robust visual markers. We introduce the PuzzlePole, a new type of fiducial markers derived from the recently proposed PuzzleBoard calibration pattern. The PuzzlePole is a cylindrical marker, enabling reliable recognition and pose estimation from 360° viewing direction. By leveraging the unique combinatorial structure of the PuzzleBoard pattern, PuzzlePoles provide a high accuracy in localization and orientation while being robust to occlusions. The design offers flexibility for deployment in diverse autonomous systems scenarios, ranging from robot navigation and SLAM to tangible interfaces.

[36] Personalized Reward Modeling for Text-to-Image Generation

Jeongeun Lee,Ryang Heo,Dongha Lee

Main category: cs.CV

TL;DR: 本文提出了PIGReward,一种基于推理的个性化奖励模型,用于评估和优化文本到图像生成中与用户偏好的对齐程度,并引入PIGBench基准来衡量个体偏好。

Details Motivation: 现有文本到图像模型缺乏有效评估个体用户偏好的方法,传统指标无法捕捉个人视觉喜好的多样性和复杂性。 Method: 提出PIGReward模型,采用自举策略利用有限参考数据构建用户上下文,通过思维链(CoT)推理生成个性化评估维度和反馈;同时构建PIGBench个性化评测基准。 Result: 实验表明PIGReward在准确性和可解释性上优于现有方法,能有效驱动个性化提示优化,提升生成图像与用户意图的一致性。 Conclusion: PIGReward为个性化文本到图像生成提供了可扩展、基于推理的评估与优化框架,是实现个体化对齐生成的重要进展。 Abstract: Recent text-to-image (T2I) models generate semantically coherent images from textual prompts, yet evaluating how well they align with individual user preferences remains an open challenge. Conventional evaluation methods, general reward functions or similarity-based metrics, fail to capture the diversity and complexity of personal visual tastes. In this work, we present PIGReward, a personalized reward model that dynamically generates user-conditioned evaluation dimensions and assesses images through CoT reasoning. To address the scarcity of user data, PIGReward adopt a self-bootstrapping strategy that reasons over limited reference data to construct rich user contexts, enabling personalization without user-specific training. Beyond evaluation, PIGReward provides personalized feedback that drives user-specific prompt optimization, improving alignment between generated images and individual intent. We further introduce PIGBench, a per-user preference benchmark capturing diverse visual interpretations of shared prompts. Extensive experiments demonstrate that PIGReward surpasses existing methods in both accuracy and interpretability, establishing a scalable and reasoning-based foundation for personalized T2I evaluation and optimization. Taken together, our findings highlight PIGReward as a robust steptoward individually aligned T2I generation.

[37] SG-OIF: A Stability-Guided Online Influence Framework for Reliable Vision Data

Penghao Rao,Runmin Jiang,Min Xu

Main category: cs.CV

TL;DR: 本文提出了稳定性引导的在线影响框架(SG-OIF),首次将算法稳定性作为实时控制器,用于高效准确地估计深度学习视觉模型中训练样本对测试预测的影响,显著提升了噪声标签和分布外检测的性能。

Details Motivation: 现有影响函数方法在深度学习视觉模型中应用时面临计算昂贵、静态近似不适应训练动态以及缺乏置信度校准导致排序脆弱的问题。 Method: 提出SG-OIF框架,采用随机Richardson和预条件Neumann方法维护轻量级锚定IHVP,并设计模块化曲率后端,结合稳定性引导的残差阈值、异常门控和置信度来调节每个样本的影响分数。 Result: SG-OIF在多个数据集上实现了噪声标签和分布外检测任务的最先进性能,在CIFAR-10(20%不对称噪声)前1%样本中达到91.1%准确率,在MNIST上获得99.8% AUPR分数。 Conclusion: SG-OIF是一种实用且高效的在线影响估计控制器,能够实时适应训练动态并提升关键样本识别的鲁棒性。 Abstract: Approximating training-point influence on test predictions is critical for deploying deep-learning vision models, essential for locating noisy data. Though the influence function was proposed for attributing how infinitesimal up-weighting or removal of individual training examples affects model outputs, its implementation is still challenging in deep-learning vision models: inverse-curvature computations are expensive, and training non-stationarity invalidates static approximations. Prior works use iterative solvers and low-rank surrogates to reduce cost, but offline computation lags behind training dynamics, and missing confidence calibration yields fragile rankings that misidentify critical examples. To address these challenges, we introduce a Stability-Guided Online Influence Framework (SG-OIF), the first framework that treats algorithmic stability as a real-time controller, which (i) maintains lightweight anchor IHVPs via stochastic Richardson and preconditioned Neumann; (ii) proposes modular curvature backends to modulate per-example influence scores using stability-guided residual thresholds, anomaly gating, and confidence. Experimental results show that SG-OIF achieves SOTA (State-Of-The-Art) on noise-label and out-of-distribution detection tasks across multiple datasets with various corruption. Notably, our approach achieves 91.1\% accuracy in the top 1\% prediction samples on the CIFAR-10 (20\% asym), and gets 99.8\% AUPR score on MNIST, effectively demonstrating that this framework is a practical controller for online influence estimation.

[38] Pistachio: Towards Synthetic, Balanced, and Long-Form Video Anomaly Benchmarks

Jie Li,Hongyi Cai,Mingkang Dong,Muxin Pu,Shan You,Fei Wang,Tao Huang

Main category: cs.CV

TL;DR: 本文提出了Pistachio,一个通过生成式管道构建的新型视频异常检测与理解(VAD/VAU)基准,利用视频生成模型实现对场景、异常类型和时间叙事的精确控制,解决了传统数据集的偏差问题,并支持大规模、多样且复杂的异常事件评估。

Details Motivation: 现有视频异常检测(VAD)基准缺乏场景多样性、均衡的异常覆盖和足够的时序复杂性,难以评估真实世界性能;同时,视频异常理解(VAU)因需要深层语义和因果推理且依赖大量人工标注而难以有效评测。 Method: 提出一种基于生成模型的可控构建流程,结合场景条件下的异常分配、多步剧情生成以及时序一致的长视频合成策略,自动生成41秒长、语义连贯的视频,实现低人工干预的大规模数据集构建。 Result: 实验验证了Pistachio在规模、多样性和时序复杂性方面的优势,揭示了现有方法在动态和多事件异常理解上的不足。 Conclusion: Pistachio为视频异常检测与理解提供了高质量、可扩展的新基准,推动了对复杂时序异常建模的研究发展。 Abstract: Automatically detecting abnormal events in videos is crucial for modern autonomous systems, yet existing Video Anomaly Detection (VAD) benchmarks lack the scene diversity, balanced anomaly coverage, and temporal complexity needed to reliably assess real-world performance. Meanwhile, the community is increasingly moving toward Video Anomaly Understanding (VAU), which requires deeper semantic and causal reasoning but remains difficult to benchmark due to the heavy manual annotation effort it demands. In this paper, we introduce Pistachio, a new VAD/VAU benchmark constructed entirely through a controlled, generation-based pipeline. By leveraging recent advances in video generation models, Pistachio provides precise control over scenes, anomaly types, and temporal narratives, effectively eliminating the biases and limitations of Internet-collected datasets. Our pipeline integrates scene-conditioned anomaly assignment, multi-step storyline generation, and a temporally consistent long-form synthesis strategy that produces coherent 41-second videos with minimal human intervention. Extensive experiments demonstrate the scale, diversity, and complexity of Pistachio, revealing new challenges for existing methods and motivating future research on dynamic and multi-event anomaly understanding.

[39] Tracking and Segmenting Anything in Any Modality

Tianlu Zhang,Qiang Zhang,Guiguang Ding,Jungong Han

Main category: cs.CV

TL;DR: 提出了一种通用的视频跟踪与分割框架SATA,通过解耦混合专家机制和任务感知多目标跟踪管道,统一处理多种模态输入和子任务,提升了跨模态与跨任务的知识共享及模型泛化能力。

Details Motivation: 现有方法通常使用专用架构或模态特定参数处理跟踪与分割任务,忽视了不同模态间的分布差异和任务间的特征表示差距,限制了模型的泛化性和可扩展性。 Method: 提出了SATA框架,包含Decoupled Mixture-of-Expert (DeMoE) 机制以解耦跨模态共享知识与特有信息的学习,并设计Task-aware Multi-object Tracking (TaMOT) 管道统一输出实例及其ID,实现多任务与多模态的融合。 Result: SATA在18个具有挑战性的跟踪与分割基准上表现出优越性能,显著提升了跨模态与跨任务场景下的模型表现。 Conclusion: SATA为构建真正通用的视频理解模型提供了新思路,有效促进了不同任务与模态间的知识共享,增强了模型的灵活性与泛化能力。 Abstract: Tracking and segmentation play essential roles in video understanding, providing basic positional information and temporal association of objects within video sequences. Despite their shared objective, existing approaches often tackle these tasks using specialized architectures or modality-specific parameters, limiting their generalization and scalability. Recent efforts have attempted to unify multiple tracking and segmentation subtasks from the perspectives of any modality input or multi-task inference. However, these approaches tend to overlook two critical challenges: the distributional gap across different modalities and the feature representation gap across tasks. These issues hinder effective cross-task and cross-modal knowledge sharing, ultimately constraining the development of a true generalist model. To address these limitations, we propose a universal tracking and segmentation framework named SATA, which unifies a broad spectrum of tracking and segmentation subtasks with any modality input. Specifically, a Decoupled Mixture-of-Expert (DeMoE) mechanism is presented to decouple the unified representation learning task into the modeling process of cross-modal shared knowledge and specific information, thus enabling the model to maintain flexibility while enhancing generalization. Additionally, we introduce a Task-aware Multi-object Tracking (TaMOT) pipeline to unify all the task outputs as a unified set of instances with calibrated ID information, thereby alleviating the degradation of task-specific knowledge during multi-task training. SATA demonstrates superior performance on 18 challenging tracking and segmentation benchmarks, offering a novel perspective for more generalizable video understanding.

[40] The Determinant Ratio Matrix Approach to Solving 3D Matching and 2D Orthographic Projection Alignment Tasks

Andrew J. Hanson,Sonya M. Hanson

Main category: cs.CV

TL;DR: 本文提出了一种基于行列式比率矩阵(DRaM)的新方法,用于解决3D-2D正交投影姿态估计(OnP)和3D-3D全姿态估计(EnP)问题,尤其在噪声数据下表现出良好性能,并将现有方法统一到DRaM家族框架中,揭示了其在N维欧几里得空间中的普适性。

Details Motivation: 现有的OnP问题缺乏类似SVD或四元数法那样的闭式解,且不同求解方法之间的联系未被充分认识,因此需要一种统一且高效的框架来系统分析和改进正交投影下的姿态估计方法。 Method: 采用行列式比率矩阵(DRaM)方法求解无噪声EnP和OnP问题的最小二乘系统,并结合旋转校正方案处理噪声数据;通过与SVD、最优四元数等方法对比,构建并分析DRaM家族方法的性能。 Result: DRaM方法为EnP和OnP问题提供了新的闭式解,尤其适用于OnP问题;在噪声环境下可通过简单校正提升精度;同时揭示了该类方法可推广至任意N维欧几里得姿态估计问题。 Conclusion: DRaM方法不仅有效解决了3D和2D正交姿态估计问题,还为相关算法提供了一个统一的理论框架,具有广泛适用性和历史可追溯性,甚至可延伸至高维空间的姿态估计。 Abstract: Pose estimation is a general problem in computer vision with wide applications. The relative orientation of a 3D reference object can be determined from a 3D rotated version of that object, or from a projection of the rotated object to a 2D planar image. This projection can be a perspective projection (the PnP problem) or an orthographic projection (the OnP problem). We restrict our attention here to the OnP problem and the full 3D pose estimation task (the EnP problem). Here we solve the least squares systems for both the error-free EnP and OnP problems in terms of the determinant ratio matrix (DRaM) approach. The noisy-data case can be addressed with a straightforward rotation correction scheme. While the SVD and optimal quaternion eigensystem methods solve the noisy EnP 3D-3D alignment exactly, the noisy 3D-2D orthographic (OnP) task has no known comparable closed form, and can be solved by DRaM-class methods. We note that while previous similar work has been presented in the literature exploiting both the QR decomposition and the Moore-Penrose pseudoinverse transformations, here we place these methods in a larger context that has not previously been fully recognized in the absence of the corresponding DRaM solution. We term this class of solutions as the DRaM family, and conduct comparisons of the behavior of the families of solutions for the EnP and OnP rotation estimation problems. Overall, this work presents both a new solution to the 3D and 2D orthographic pose estimation problems and provides valuable insight into these classes of problems. With hindsight, we are able to show that our DRaM solutions to the exact EnP and OnP problems possess derivations that could have been discovered in the time of Gauss, and in fact generalize to all analogous N-dimensional Euclidean pose estimation problems.

[41] Single Image to High-Quality 3D Object via Latent Features

Huanning Dong,Yinuo Huang,Fan Li,Ping Kuang

Main category: cs.CV

TL;DR: 本文提出了一种名为LatentDreamer的新框架,用于从单张图像生成高质量3D对象。该方法利用预训练的变分自编码器将3D几何映射到潜在特征空间,从而简化生成过程,并在约70秒内完成从粗略到精细几何及纹理的逐步生成。

Details Motivation: 现有的图像到3D生成方法难以同时实现快速、高细节和高保真度的3D生成,因此需要一种更高效且精确的解决方案。 Method: 提出LatentDreamer框架,采用预训练的变分自编码器将3D几何编码至潜在空间,在此空间中依次生成粗略几何、精细几何和真实感纹理,实现在短时间内高质量地生成3D对象。 Result: LatentDreamer能够在约70秒内完成3D对象生成,生成结果对输入图像具有高保真度,并在少量训练下表现出与当前方法相当的竞争力。 Conclusion: LatentDreamer通过引入潜在空间表示有效降低了3D生成难度,实现了快速、高保真的3D对象生成,为图像到3D任务提供了一种高效可行的新方案。 Abstract: 3D assets are essential in the digital age. While automatic 3D generation, such as image-to-3d, has made significant strides in recent years, it often struggles to achieve fast, detailed, and high-fidelity generation simultaneously. In this work, we introduce LatentDreamer, a novel framework for generating 3D objects from single images. The key to our approach is a pre-trained variational autoencoder that maps 3D geometries to latent features, which greatly reducing the difficulty of 3D generation. Starting from latent features, the pipeline of LatentDreamer generates coarse geometries, refined geometries, and realistic textures sequentially. The 3D objects generated by LatentDreamer exhibit high fidelity to the input images, and the entire generation process can be completed within a short time (typically in 70 seconds). Extensive experiments show that with only a small amount of training, LatentDreamer demonstrates competitive performance compared to contemporary approachs.

[42] Fewer Tokens, Greater Scaling: Self-Adaptive Visual Bases for Efficient and Expansive Representation Learning

Shawn Young,Xingyu Zeng,Lijian Xu

Main category: cs.CV

TL;DR: 本文研究了模型容量与保持图像语义所需最少视觉标记数之间的基本关系,提出正交过滤方法,并发现大模型需要更少的标记来表示视觉语义空间。

Details Motivation: 探索模型容量对视觉标记数量需求的影响,以更高效地压缩和保留图像语义信息。 Method: 基于最小描述长度原则,将图像标记视为视觉语义空间中的向量,提出正交过滤模块,自适应地将冗余标记聚类为紧凑的正交基集合。 Result: 在多种ViT模型上实验表明,更大的模型能用更少的标记覆盖视觉语义空间,揭示了一种一致的标记-模型缩放律,并贡献了一个视觉长上下文数据集。 Conclusion: 模型容量越大,所需最小视觉标记数越少,表明高效视觉语义表示存在可量化的缩放规律。 Abstract: This paper investigates the fundamental relationship between model capacity and the minimal number of visual tokens required to preserve image semantics. Inspired by the Minimum Description Length principle, we reinterpret image tokens as vectors in a visual semantic space and define the intrinsic semantic complexity of an image as the smallest set of basis vectors needed to span this space. Building on this perspective, we propose Orthogonal Filtering, a lightweight module that adaptively clusters redundant tokens into a compact set of orthogonal bases. Through extensive experiments across a range of ViT models, we reveal a consistent token, model scaling law: larger models require significantly fewer tokens to span visual semantic space. Besides, we also contribute a visual long-context dataset.

[43] Connecting the Dots: Training-Free Visual Grounding via Agentic Reasoning

Liqin Luo,Guangyao Chen,Xiawu Zheng,Yongxing Dai,Yixiong Zou,Yonghong Tian

Main category: cs.CV

TL;DR: 本文提出了一种无需任务特定微调的新型视觉定位框架GroundingAgent,通过结合预训练的目标检测器、多模态大模型和大语言模型,实现零样本下的高效视觉定位,并具有良好的可解释性。

Details Motivation: 现有视觉定位方法依赖大量特定任务标注和微调,泛化能力差,难以适应新场景或分布外数据,因此需要一种无需微调且能良好泛化的框架。 Method: 提出GroundingAgent框架,采用结构化迭代推理机制,结合开放词汇目标检测器、多模态大语言模型和大语言模型,通过语义与空间分析逐步优化候选区域。 Result: 在RefCOCO、RefCOCO+、RefCOCOg等基准上实现了平均65.1%的零样本定位准确率;仅使用原始查询文本进行选择阶段时准确率可达约90%,接近有监督方法性能。 Conclusion: GroundingAgent无需微调即可实现高效视觉定位,展现出强大的零样本能力和可解释性,突显了大语言模型在推理中的关键作用。 Abstract: Visual grounding, the task of linking textual queries to specific regions within images, plays a pivotal role in vision-language integration. Existing methods typically rely on extensive task-specific annotations and fine-tuning, limiting their ability to generalize effectively to novel or out-of-distribution scenarios. To address these limitations, we introduce GroundingAgent, a novel agentic visual grounding framework that operates without any task-specific fine-tuning. GroundingAgent employs a structured, iterative reasoning mechanism that integrates pretrained open-vocabulary object detectors, multimodal large language models (MLLMs), and large language models (LLMs) to progressively refine candidate regions through joint semantic and spatial analyses. Remarkably, GroundingAgent achieves an average zero-shot grounding accuracy of 65.1 % on widely-used benchmarks (RefCOCO, RefCOCO+, RefCOCOg), entirely without fine-tuning. Furthermore, by substituting MLLM-generated captions with the original query texts, the accuracy at the selection stage alone reaches approximately 90 %, closely matching supervised performance and underscoring the critical role of LLM reasoning capabilities. GroundingAgent also offers strong interpretability, transparently illustrating each reasoning step and providing clear insights into its decision-making process.

[44] Towards Efficient VLMs: Information-Theoretic Driven Compression via Adaptive Structural Pruning

Zhaoqi Xu,Yingying Zhang,Jian Li,Jianwei Guo,Qiannan Zhu,Hua Huang

Main category: cs.CV

TL;DR: 本文提出了InfoPrune,一种基于信息瓶颈原理的视觉-语言模型压缩框架,通过熵有效秩和KS距离量化注意力头的重要性,实现结构稀疏与信息效率的联合优化,在显著压缩模型的同时保持性能。

Details Motivation: 现有VLM压缩方法依赖启发式规则,缺乏理论保证,难以在压缩过程中有效保留任务相关语义信息。 Method: 基于信息瓶颈原则,提出eRank和KS距离作为信息保留度量,设计了基于训练的注意力头剪枝和无需训练的FFN低秩近似两种压缩方案。 Result: 在VQAv2、TextVQA和GQA上实现了最高3.2倍FLOP减少和1.8倍加速,性能损失极小。 Conclusion: InfoPrune为多模态大模型提供了具有理论基础且实用高效的结构化压缩方法。 Abstract: Recent advances in vision-language models (VLMs) have shown remarkable performance across multimodal tasks, yet their ever-growing scale poses severe challenges for deployment and efficiency. Existing compression methods often rely on heuristic importance metrics or empirical pruning rules, lacking theoretical guarantees about information preservation. In this work, we propose InfoPrune, an information-theoretic framework for adaptive structural compression of VLMs. Grounded in the Information Bottleneck principle, we formulate pruning as a trade-off between retaining task-relevant semantics and discarding redundant dependencies. To quantify the contribution of each attention head, we introduce an entropy-based effective rank (eRank) and employ the Kolmogorov--Smirnov (KS) distance to measure the divergence between original and compressed structures. This yields a unified criterion that jointly considers structural sparsity and informational efficiency. Building on this foundation, we further design two complementary schemes: (1) a training-based head pruning guided by the proposed information loss objective, and (2) a training-free FFN compression via adaptive low-rank approximation. Extensive experiments on VQAv2, TextVQA, and GQA demonstrate that InfoPrune achieves up to 3.2x FLOP reduction and 1.8x acceleration with negligible performance degradation, establishing a theoretically grounded and practically effective step toward efficient multimodal large models.

[45] Blinking Beyond EAR: A Stable Eyelid Angle Metric for Driver Drowsiness Detection and Data Augmentation

Mathis Wolter,Julie Stephany Berrio Perez,Mao Shan

Main category: cs.CV

TL;DR: 提出一种基于3D面部特征点的新型眼睑角度(ELA)指标,用于稳定、鲁棒地检测眼睛开合状态,并实现眨眼检测与合成数据生成,提升驾驶员疲劳监测性能。

Details Motivation: 现有方法如EAR对视角变化敏感,且真实疲劳数据难以获取,限制了驾驶员疲劳检测系统的可靠性与泛化能力。 Method: 基于3D面部特征点定义眼睑角度(ELA),构建具有时间特征提取能力的眨眼检测框架,并利用ELA驱动Blender中的虚拟角色生成带有可控变量的合成数据集。 Result: ELA在视角变化下比EAR具有更低的方差,实现更准确的眨眼检测;合成数据增强了模型训练的多样性与鲁棒性。 Conclusion: ELA是一种对视角变化鲁棒的眼部度量指标,兼具生物识别可靠性与数据生成能力,适用于驾驶员状态监测系统。 Abstract: Detecting driver drowsiness reliably is crucial for enhancing road safety and supporting advanced driver assistance systems (ADAS). We introduce the Eyelid Angle (ELA), a novel, reproducible metric of eye openness derived from 3D facial landmarks. Unlike conventional binary eye state estimators or 2D measures, such as the Eye Aspect Ratio (EAR), the ELA provides a stable geometric description of eyelid motion that is robust to variations in camera angle. Using the ELA, we design a blink detection framework that extracts temporal characteristics, including the closing, closed, and reopening durations, which are shown to correlate with drowsiness levels. To address the scarcity and risk of collecting natural drowsiness data, we further leverage ELA signals to animate rigged avatars in Blender 3D, enabling the creation of realistic synthetic datasets with controllable noise, camera viewpoints, and blink dynamics. Experimental results in public driver monitoring datasets demonstrate that the ELA offers lower variance under viewpoint changes compared to EAR and achieves accurate blink detection. At the same time, synthetic augmentation expands the diversity of training data for drowsiness recognition. Our findings highlight the ELA as both a reliable biometric measure and a powerful tool for generating scalable datasets in driver state monitoring.

[46] VideoChat-M1: Collaborative Policy Planning for Video Understanding via Multi-Agent Reinforcement Learning

Boyu Chen,Zikang Wang,Zhengrong Yue,Kainan Yan,Chenyun Yu,Yi Huang,Zijun Liu,Yafei Wen,Xiaoxin Chen,Yang Liu,Peng Li,Yali Wang

Main category: cs.CV

TL;DR: 提出了一种基于多智能体协同策略规划(CPP)的视频理解框架VideoChat-M1,通过动态、可学习的工具调用机制和多智能体强化学习,在多个基准上实现了SOTA性能。

Details Motivation: 现有基于多模态大模型的视频理解多智能体系统通常采用静态、不可学习的工具调用机制,难以充分挖掘复杂时空视频中的多样化线索,限制了系统的感知与推理能力。 Method: 提出VideoChat-M1,引入协同策略规划(CPP)范式,包含三个过程:策略生成(各智能体根据问题生成个性化工具调用策略)、策略执行(调用工具探索视频内容)、策略通信(在执行过程中交互并动态更新策略)。结合多智能体强化学习(MARL),利用最终答案奖励和中间协作反馈联合优化智能体团队。 Result: 在八个涵盖四项任务的基准测试中均达到SOTA性能。在LongVideoBench上,超越Gemini 2.5 pro 3.6%,超过GPT-4o 15.6%。 Conclusion: VideoChat-M1通过动态协作的多智能体策略规划与强化学习,显著提升了复杂视频理解的性能,验证了可学习、动态工具调用机制的有效性。 Abstract: By leveraging tool-augmented Multimodal Large Language Models (MLLMs), multi-agent frameworks are driving progress in video understanding. However, most of them adopt static and non-learnable tool invocation mechanisms, which limit the discovery of diverse clues essential for robust perception and reasoning regarding temporally or spatially complex videos. To address this challenge, we propose a novel Multi-agent system for video understanding, namely VideoChat-M1. Instead of using a single or fixed policy, VideoChat-M1 adopts a distinct Collaborative Policy Planning (CPP) paradigm with multiple policy agents, which comprises three key processes. (1) Policy Generation: Each agent generates its unique tool invocation policy tailored to the user's query; (2) Policy Execution: Each agent sequentially invokes relevant tools to execute its policy and explore the video content; (3) Policy Communication: During the intermediate stages of policy execution, agents interact with one another to update their respective policies. Through this collaborative framework, all agents work in tandem, dynamically refining their preferred policies based on contextual insights from peers to effectively respond to the user's query. Moreover, we equip our CPP paradigm with a concise Multi-Agent Reinforcement Learning (MARL) method. Consequently, the team of policy agents can be jointly optimized to enhance VideoChat-M1's performance, guided by both the final answer reward and intermediate collaborative process feedback. Extensive experiments demonstrate that VideoChat-M1 achieves SOTA performance across eight benchmarks spanning four tasks. Notably, on LongVideoBench, our method outperforms the SOTA model Gemini 2.5 pro by 3.6% and GPT-4o by 15.6%.

[47] Perceptual Taxonomy: Evaluating and Guiding Hierarchical Scene Reasoning in Vision-Language Models

Jonathan Lee,Xingrui Wang,Jiawei Peng,Luoxin Ye,Zehan Zheng,Tiezheng Zhang,Tao Wang,Wufei Ma,Siyi Chen,Yu-Cheng Chou,Prakhar Kaushik,Alan Yuille

Main category: cs.CV

TL;DR: 本文提出了Perceptual Taxonomy,一种用于物理场景理解的结构化视觉推理基准,通过标注物体属性和构建多类型问题,评估现有视觉语言模型在深层次推理上的局限性。

Details Motivation: 现有视觉语言基准主要关注表层识别或图像-文本对齐,缺乏对基于物理属性的结构化场景理解能力的全面评估。 Method: 提出Perceptual Taxonomy,标注3173个物体的84种细粒度属性,构建包含28033个模板问题和50个专家设计问题的多选题基准,覆盖合成与真实场景。 Result: 实验显示主流视觉语言模型在属性驱动的问题上性能下降10-20%,尤其在需要多步推理的任务中表现不佳;借助模拟场景中的上下文推理示例可提升实际表现。 Conclusion: 当前模型在结构化视觉理解方面存在明显短板,依赖模式匹配难以实现目标导向的推理,而基于感知分类的提示策略有助于提升推理能力。 Abstract: We propose Perceptual Taxonomy, a structured process of scene understanding that first recognizes objects and their spatial configurations, then infers task-relevant properties such as material, affordance, function, and physical attributes to support goal-directed reasoning. While this form of reasoning is fundamental to human cognition, current vision-language benchmarks lack comprehensive evaluation of this ability and instead focus on surface-level recognition or image-text alignment. To address this gap, we introduce Perceptual Taxonomy, a benchmark for physically grounded visual reasoning. We annotate 3173 objects with four property families covering 84 fine-grained attributes. Using these annotations, we construct a multiple-choice question benchmark with 5802 images across both synthetic and real domains. The benchmark contains 28033 template-based questions spanning four types (object description, spatial reasoning, property matching, and taxonomy reasoning), along with 50 expert-crafted questions designed to evaluate models across the full spectrum of perceptual taxonomy reasoning. Experimental results show that leading vision-language models perform well on recognition tasks but degrade by 10 to 20 percent on property-driven questions, especially those requiring multi-step reasoning over structured attributes. These findings highlight a persistent gap in structured visual understanding and the limitations of current models that rely heavily on pattern matching. We also show that providing in-context reasoning examples from simulated scenes improves performance on real-world and expert-curated questions, demonstrating the effectiveness of perceptual-taxonomy-guided prompting.

[48] MapRF: Weakly Supervised Online HD Map Construction via NeRF-Guided Self-Training

Hongyu Lyu,Thomas Monninger,Julie Stephany Berrio Perez,Mao Shan,Zhenxing Ming,Stewart Worrall

Main category: cs.CV

TL;DR: 提出MapRF,一种弱监督框架,利用2D图像标签生成3D地图伪标签,通过自训练和Map-to-Ray Matching策略实现高质量在线HD地图构建。

Details Motivation: 现有3D地图构建方法依赖昂贵的3D标注,限制了在多样化环境中的泛化与扩展能力。 Method: 引入基于NeRF的模块,结合地图预测生成视图一致的3D几何与语义伪标签,并采用自训练与Map-to-Ray Matching策略优化地图网络。 Result: 在Argoverse 2和nuScenes数据集上,性能达到全监督方法的约75%,优于其他仅使用2D标签的方法。 Conclusion: MapRF能有效降低对3D标注的依赖,推动低成本、可扩展的在线HD地图构建。 Abstract: Autonomous driving systems benefit from high-definition (HD) maps that provide critical information about road infrastructure. The online construction of HD maps offers a scalable approach to generate local maps from on-board sensors. However, existing methods typically rely on costly 3D map annotations for training, which limits their generalization and scalability across diverse driving environments. In this work, we propose MapRF, a weakly supervised framework that learns to construct 3D maps using only 2D image labels. To generate high-quality pseudo labels, we introduce a novel Neural Radiance Fields (NeRF) module conditioned on map predictions, which reconstructs view-consistent 3D geometry and semantics. These pseudo labels are then iteratively used to refine the map network in a self-training manner, enabling progressive improvement without additional supervision. Furthermore, to mitigate error accumulation during self-training, we propose a Map-to-Ray Matching strategy that aligns map predictions with camera rays derived from 2D labels. Extensive experiments on the Argoverse 2 and nuScenes datasets demonstrate that MapRF achieves performance comparable to fully supervised methods, attaining around 75% of the baseline while surpassing several approaches using only 2D labels. This highlights the potential of MapRF to enable scalable and cost-effective online HD map construction for autonomous driving.

[49] Vidi2: Large Multimodal Models for Video Understanding and Creation

Vidi Team,Celong Liu,Chia-Wen Kuo,Chuang Huang,Dawei Du,Fan Chen,Guang Chen,Haoji Zhang,Haojun Zhao,Lingxi Zhang,Lu Guo,Lusha Li,Longyin Wen,Qihang Fan,Qingyu Chen,Rachel Deng,Sijie Zhu,Stuart Siew,Tong Jin,Weiyan Tao,Wen Zhong,Xiaohui Shen,Xin Gu,Zhenfang Chen,Zuhua Lin

Main category: cs.CV

TL;DR: Vidi2是一个先进的视频理解模型,支持细粒度时空定位和视频问答,在新基准VUE-STG和升级版VUE-TR-V2上均超越主流闭源模型(如Gemini 3 Pro、GPT-5),并推动复杂视频编辑等实际应用。

Details Motivation: 随着视频成为互联网主要媒介,对高质量、可扩展的视频生成与理解技术需求激增,现有模型在长时序推理、精确时空定位和真实应用场景下的表现仍有限,亟需更强大的多模态视频理解能力。 Method: Vidi2采用端到端架构,实现细粒度的时空定位(STG),能根据文本查询输出对应的时间段和目标对象的边界框,并支持视频问答(Video QA);同时构建新基准VUE-STG,包含更长视频、高质量人工标注、改进的查询格式和评估指标(vIoU/tIoU/vIoU-Intersection),并升级VUE-TR为VUE-TR-V2以提升数据分布合理性。 Result: Vidi2在VUE-STG和VUE-TR-V2两个基准上显著优于Gemini 3 Pro(Preview)和GPT-5等领先闭源模型,且在视频问答任务上与同规模开源模型表现相当;新基准VUE-STG支持从10秒到30分钟的视频长度,提供更高精度的时空标注和更贴近用户的查询形式。 Conclusion: Vidi2通过增强时空定位与多模态推理能力,显著提升了视频理解的精细度与实用性,结合新推出的高质量基准VUE-STG和VUE-TR-V2,为未来视频分析与智能编辑系统的发展提供了有力支持。 Abstract: Video has emerged as the primary medium for communication and creativity on the Internet, driving strong demand for scalable, high-quality video production. Vidi models continue to evolve toward next-generation video creation and have achieved state-of-the-art performance in multimodal temporal retrieval (TR). In its second release, Vidi2 advances video understanding with fine-grained spatio-temporal grounding (STG) and extends its capability to video question answering (Video QA), enabling comprehensive multimodal reasoning. Given a text query, Vidi2 can identify not only the corresponding timestamps but also the bounding boxes of target objects within the output time ranges. This end-to-end spatio-temporal grounding capability enables potential applications in complex editing scenarios, such as plot or character understanding, automatic multi-view switching, and intelligent, composition-aware reframing and cropping. To enable comprehensive evaluation of STG in practical settings, we introduce a new benchmark, VUE-STG, which offers four key improvements over existing STG datasets: 1) Video duration: spans from roughly 10s to 30 mins, enabling long-context reasoning; 2) Query format: queries are mostly converted into noun phrases while preserving sentence-level expressiveness; 3) Annotation quality: all ground-truth time ranges and bounding boxes are manually annotated with high accuracy; 4) Evaluation metric: a refined vIoU/tIoU/vIoU-Intersection scheme. In addition, we upgrade the previous VUE-TR benchmark to VUE-TR-V2, achieving a more balanced video-length distribution and more user-style queries. Remarkably, the Vidi2 model substantially outperforms leading proprietary systems, such as Gemini 3 Pro (Preview) and GPT-5, on both VUE-TR-V2 and VUE-STG, while achieving competitive results with popular open-source models with similar scale on video QA benchmarks.

[50] Cross-Domain Generalization of Multimodal LLMs for Global Photovoltaic Assessment

Muhao Guo,Yang Weng

Main category: cs.CV

TL;DR: 本研究探讨了多模态大语言模型在跨区域光伏系统检测中的应用,通过结构化提示和微调实现检测、定位与量化的一体化,表现出优于传统模型的跨域泛化能力。

Details Motivation: 分布式光伏系统快速增长但多数未记录,给电网管理带来挑战,而传统计算机视觉模型依赖大量标注数据且跨区域泛化能力差。 Method: 采用多模态大语言模型,结合结构化提示和微调技术,统一实现光伏系统的检测、定位与量化,并在跨区域数据上进行评估。 Result: 在跨区域评估中,该模型以ΔF1指标衡量表现出最小的性能下降,优于传统的CV和Transformer基线模型。 Conclusion: 多模态大语言模型在域偏移下具有更强的鲁棒性,具备用于可扩展、可迁移和可解释的全球光伏制图的潜力。 Abstract: The rapid expansion of distributed photovoltaic (PV) systems poses challenges for power grid management, as many installations remain undocumented. While satellite imagery provides global coverage, traditional computer vision (CV) models such as CNNs and U-Nets require extensive labeled data and fail to generalize across regions. This study investigates the cross-domain generalization of a multimodal large language model (LLM) for global PV assessment. By leveraging structured prompts and fine-tuning, the model integrates detection, localization, and quantification within a unified schema. Cross-regional evaluation using the $Δ$F1 metric demonstrates that the proposed model achieves the smallest performance degradation across unseen regions, outperforming conventional CV and transformer baselines. These results highlight the robustness of multimodal LLMs under domain shift and their potential for scalable, transferable, and interpretable global PV mapping.

[51] Studying Maps at Scale: A Digital Investigation of Cartography and the Evolution of Figuration

Remi Petitpierre

Main category: cs.CV

TL;DR: 本论文通过大规模数字地图数据集和文化视角研究制图遗产,结合历史、政治与认知背景分析地图的语义符号系统及其演变。

Details Motivation: 现有自动化地图分析方法缺乏对制图史及地图作为文化象征系统的深入探讨,本文旨在弥合技术与人文之间的鸿沟。 Method: 整合来自38个数字目录的771,561条地图记录和99,715幅图像,构建跨越六个世纪的数据集;采用语义分割、目标检测模型及合成图像训练方法识别土地类型与制图符号,并将6300万符号编码至潜在视觉空间以分析图形演变。 Result: 揭示了地图出版的全球时空模式与殖民扩张、大西洋贸易和军事冲突的关联;发现地图构图具有语义对称性和中心化特征;识别出制图符号系统的本地一致性及其随时间的演变,如阴影线被等高线取代;合作与传播分析显示权威机构和大城市在符号规范扩散中的关键作用。 Conclusion: 地图不仅是地理表达工具,更是反映政治、知识和文化期待的符号系统,其形式与传播受社会结构和历史动力深刻影响。 Abstract: This thesis presents methods and datasets to investigate cartographic heritage on a large scale and from a cultural perspective. Heritage institutions worldwide have digitized more than one million maps, and automated techniques now enable large-scale recognition and extraction of map content. Yet these methods have engaged little with the history of cartography, or the view that maps are semantic-symbolic systems, and cultural objects reflecting political and epistemic expectations. This work leverages a diverse corpus of 771,561 map records and 99,715 digitized images aggregated from 38 digital catalogs. After normalization, the dataset includes 236,925 contributors and spans six centuries, from 1492 to 1948. These data make it possible to chart geographic structures and the global chronology of map publication. The spatial focus of cartography is analyzed in relation to political dynamics, evidencing links between Atlantic maritime charting, the triangular trade, and colonial expansion. Further results document the progression of national, domestic focus and the impact of military conflicts on publication volumes. The research introduces semantic segmentation techniques and object detection models for the generic recognition of land classes and cartographic signs, trained on annotated data and synthetic images. The analysis of land classes shows that maps are designed images whose framing and composition emphasize features through centering and semantic symmetries. The study of cartographic figuration encodes 63 M signs and 25 M fragments into a latent visual space, revealing figurative shifts such as the replacement of relief hachures by terrain contours and showing that signs tend to form locally consistent systems. Analyses of collaboration and diffusion highlight the role of legitimacy, larger actors, and major cities in the spread of figurative norms and semiotic cultures.

[52] Proxy-Free Gaussian Splats Deformation with Splat-Based Surface Estimation

Jaeyeong Kim,Seungwoo Yoo,Minhyuk Sung

Main category: cs.CV

TL;DR: 提出了一种无需代理的高斯点云变形方法SpLap,基于新构建的表面感知点云图计算拉普拉斯算子,通过考虑点云间的交集定义邻域关系,并结合高斯核自适应技术,实现了保细节、保拓扑的高质量变形。

Details Motivation: 现有高斯点云变形方法依赖于 cages 或 meshes 等代理结构,受限于代理质量且计算开销大;直接将点云作为无结构点处理则难以有效捕捉表面信息。因此需要一种无需代理且能保持表面细节和拓扑的变形方法。 Method: 构建一个表面感知的点云图,其中邻接关系不仅基于点中心距离,更考虑高斯点之间的空间交集;基于该图定义拉普拉斯算子用于形变,并引入高斯核自适应技术以在形变过程中保持表面结构和渲染质量。 Result: 在ShapeNet、Objaverse、Sketchfab和NeRF-Synthetic等多个数据集上的50个复杂物体上验证了方法的有效性,结果显示SpLap在视觉质量和定量指标上均优于现有的基于代理和无代理基线方法。 Conclusion: SpLap是一种有效的无代理高斯点云变形方法,通过构建表面感知的图结构和改进的高斯核机制,实现了高质量、细节保持的形变,为GS的编辑与动画提供了新思路。 Abstract: We introduce SpLap, a proxy-free deformation method for Gaussian splats (GS) based on a Laplacian operator computed from our novel surface-aware splat graph. Existing approaches to GS deformation typically rely on deformation proxies such as cages or meshes, but they suffer from dependency on proxy quality and additional computational overhead. An alternative is to directly apply Laplacian-based deformation techniques by treating splats as point clouds. However, this often fail to properly capture surface information due to lack of explicit structure. To address this, we propose a novel method that constructs a surface-aware splat graph, enabling the Laplacian operator derived from it to support more plausible deformations that preserve details and topology. Our key idea is to leverage the spatial arrangement encoded in splats, defining neighboring splats not merely by the distance between their centers, but by their intersections. Furthermore, we introduce a Gaussian kernel adaptation technique that preserves surface structure under deformation, thereby improving rendering quality after deformation. In our experiments, we demonstrate the superior performance of our method compared to both proxy-based and proxy-free baselines, evaluated on 50 challenging objects from the ShapeNet, Objaverse, and Sketchfab datasets, as well as the NeRF-Synthetic dataset. Code is available at https://github.com/kjae0/SpLap.

[53] Think First, Assign Next (ThiFAN-VQA): A Two-stage Chain-of-Thought Framework for Post-Disaster Damage Assessment

Ehsan Karimi,Nhut Le,Maryam Rahnemoonfar

Main category: cs.CV

TL;DR: 提出ThiFAN-VQA,一种基于两阶段推理的视觉问答框架,用于灾害场景下的无人机图像损伤评估,结合领域特定提示与推理引导答案选择,在有限监督下实现高准确性和可解释性。

Details Motivation: 现有灾害损伤评估方法依赖固定分类结构且需大量标注数据,生成模型易产生幻觉或泛化回答,缺乏灵活性与领域相关性。 Method: 采用两阶段推理框架:第一阶段利用思维链(CoT)提示和上下文学习生成结构化推理轨迹;第二阶段通过答案选择模块筛选最连贯准确的答案,并集成定制信息检索与领域特定提示。 Result: 在FloodNet和RescueNet-VQA数据集上实验表明,ThiFAN-VQA在零样本与监督方法之间取得平衡,显著提升准确性、可解释性和适应性。 Conclusion: ThiFAN-VQA有效解决了小样本、少标注环境下灾害评估中模型灵活性与一致性之间的矛盾,为实际应用提供了可靠、可解释的VQA解决方案。 Abstract: Timely and accurate assessment of damages following natural disasters is essential for effective emergency response and recovery. Recent AI-based frameworks have been developed to analyze large volumes of aerial imagery collected by Unmanned Aerial Vehicles, providing actionable insights rapidly. However, creating and annotating data for training these models is costly and time-consuming, resulting in datasets that are limited in size and diversity. Furthermore, most existing approaches rely on traditional classification-based frameworks with fixed answer spaces, restricting their ability to provide new information without additional data collection or model retraining. Using pre-trained generative models built on in-context learning (ICL) allows for flexible and open-ended answer spaces. However, these models often generate hallucinated outputs or produce generic responses that lack domain-specific relevance. To address these limitations, we propose ThiFAN-VQA, a two-stage reasoning-based framework for visual question answering (VQA) in disaster scenarios. ThiFAN-VQA first generates structured reasoning traces using chain-of-thought (CoT) prompting and ICL to enable interpretable reasoning under limited supervision. A subsequent answer selection module evaluates the generated responses and assigns the most coherent and contextually accurate answer, effectively improve the model performance. By integrating a custom information retrieval system, domain-specific prompting, and reasoning-guided answer selection, ThiFAN-VQA bridges the gap between zero-shot and supervised methods, combining flexibility with consistency. Experiments on FloodNet and RescueNet-VQA, UAV-based datasets from flood- and hurricane-affected regions, demonstrate that ThiFAN-VQA achieves superior accuracy, interpretability, and adaptability for real-world post-disaster damage assessment tasks.

[54] HunyuanOCR Technical Report

Hunyuan Vision Team,Pengyuan Lyu,Xingyu Wan,Gengluo Li,Shangpin Peng,Weinong Wang,Liang Wu,Huawen Shen,Yu Zhou,Canhui Tang,Qi Yang,Qiming Peng,Bin Luo,Hower Yang,Houwen Peng,Hongming Yang,Senhao Xie,Binghong Wu,Mana Yang,Sergey Wang,Raccoon Liu,Dick Zhu,Jie Jiang,Linus,Han Hu,Chengquan Zhang

Main category: cs.CV

TL;DR: HunyuanOCR是一个10亿参数的轻量级开源视觉语言模型,专为OCR任务设计,在多项感知与语义任务中表现卓越,架构上实现端到端统一,并通过高质量数据和强化学习策略提升性能。

Details Motivation: 现有OCR系统受限于专用模型泛化能力差或通用VLM效率低下,且传统流水线存在误差传播问题,亟需一个兼具高效性、通用性和鲁棒性的OCR解决方案。 Method: 采用原生ViT加轻量LLM通过MLP适配器连接的纯端到端架构,不依赖布局分析等预处理模块;利用高质量数据训练,并首次在行业中应用强化学习策略优化OCR任务表现。 Result: 在ICDAR 2025 DIMT挑战赛小模型赛道排名第一,OCRBench上成为3B参数以下VLM中的SOTA模型,性能超越商业API、传统流水线及更大规模模型(如Qwen3-VL-4B)。 Conclusion: HunyuanOCR成功统一了多功能性与高效率,验证了端到端架构和强化学习在OCR中的有效性,为学术研究和工业应用提供了高性能、易部署的开源基础模型。 Abstract: This paper presents HunyuanOCR, a commercial-grade, open-source, and lightweight (1B parameters) Vision-Language Model (VLM) dedicated to OCR tasks. The architecture comprises a Native Vision Transformer (ViT) and a lightweight LLM connected via an MLP adapter. HunyuanOCR demonstrates superior performance, outperforming commercial APIs, traditional pipelines, and larger models (e.g., Qwen3-VL-4B). Specifically, it surpasses current public solutions in perception tasks (Text Spotting, Parsing) and excels in semantic tasks (IE, Text Image Translation), securing first place in the ICDAR 2025 DIMT Challenge (Small Model Track). Furthermore, it achieves state-of-the-art (SOTA) results on OCRBench among VLMs with fewer than 3B parameters. HunyuanOCR achieves breakthroughs in three key aspects: 1) Unifying Versatility and Efficiency: We implement comprehensive support for core capabilities including spotting, parsing, IE, VQA, and translation within a lightweight framework. This addresses the limitations of narrow "OCR expert models" and inefficient "General VLMs". 2) Streamlined End-to-End Architecture: Adopting a pure end-to-end paradigm eliminates dependencies on pre-processing modules (e.g., layout analysis). This fundamentally resolves error propagation common in traditional pipelines and simplifies system deployment. 3) Data-Driven and RL Strategies: We confirm the critical role of high-quality data and, for the first time in the industry, demonstrate that Reinforcement Learning (RL) strategies yield significant performance gains in OCR tasks. HunyuanOCR is officially open-sourced on HuggingFace. We also provide a high-performance deployment solution based on vLLM, placing its production efficiency in the top tier. We hope this model will advance frontier research and provide a solid foundation for industrial applications.

[55] Leveraging Unlabeled Scans for NCCT Image Segmentation in Early Stroke Diagnosis: A Semi-Supervised GAN Approach

Maria Thoma,Michalis A. Savelonas,Dimitris K. Iakovidis

Main category: cs.CV

TL;DR: 提出一种基于半监督生成对抗网络(GAN)的分割方法,用于在非对比CT扫描中准确识别早期缺血性脑卒中区域,提升超急性期卒中的诊断能力。

Details Motivation: 非对比CT(NCCT)在早期缺血性改变不明显时难以发现超急性期脑卒中,可能导致治疗延迟,因此需要更敏感的自动检测方法。 Method: 采用半监督生成对抗网络(GAN),结合Dice损失、交叉熵损失、特征匹配损失和自训练损失,在少量标注和大量未标注NCCT扫描上进行训练,以提高对微小或模糊梗死区域的识别能力。 Result: 在公开的急性缺血性卒中数据集(AISD)上的实验表明,该方法能有效分割早期缺血区域,减少人工标注负担,并提升诊断性能。 Conclusion: 所提出的半监督GAN框架能够利用有限标注数据有效识别早期脑卒中病变,具有辅助临床快速决策的潜力,有助于改善患者预后。 Abstract: Ischemic stroke is a time-critical medical emergency where rapid diagnosis is essential for improving patient outcomes. Non-contrast computed tomography (NCCT) serves as the frontline imaging tool, yet it often fails to reveal the subtle ischemic changes present in the early, hyperacute phase. This limitation can delay crucial interventions. To address this diagnostic challenge, we introduce a semi-supervised segmentation method using generative adversarial networks (GANs) to accurately delineate early ischemic stroke regions. The proposed method employs an adversarial framework to effectively learn from a limited number of annotated NCCT scans, while simultaneously leveraging a larger pool of unlabeled scans. By employing Dice loss, cross-entropy loss, a feature matching loss and a self-training loss, the model learns to identify and delineate early infarcts, even when they are faint or their size is small. Experiments on the publicly available Acute Ischemic Stroke Dataset (AISD) demonstrate the potential of the proposed method to enhance diagnostic capabilities, reduce the burden of manual annotation, and support more efficient clinical decision-making in stroke care.

[56] Multiscale Vector-Quantized Variational Autoencoder for Endoscopic Image Synthesis

Dimitrios E. Diamantis,Dimitris K. Iakovidis

Main category: cs.CV

TL;DR: 本文提出了一种基于多尺度向量量化变分自编码器(MSVQ-VAE)的新型合成数据生成方法,用于无线胶囊内窥镜(WCE)图像生成,可有效引入多种异常病灶并提升临床决策支持系统的训练效果。

Details Motivation: 由于隐私限制和标注成本,医学图像数据稀缺,限制了基于深度学习的临床决策支持系统的发展,亟需有效的合成数据生成方法来缓解数据不足问题。 Method: 提出一种多尺度向量量化变分自编码器(MSVQ-VAE),通过多尺度结构和条件生成机制,在正常WCE图像中无缝引入多种异常(如息肉、血管及炎症病变),实现高质量医学图像合成。 Result: 生成的异常WCE图像在分类任务中表现出与真实数据相当的效果,使用合成数据训练的分类器性能接近使用真实数据训练的分类器。 Conclusion: 所提出的MSVQ-VAE方法能有效生成多样且逼真的WCE异常图像,有助于缓解医学数据稀缺问题,具有广泛应用于医学多媒体领域的潜力。 Abstract: Gastrointestinal (GI) imaging via Wireless Capsule Endoscopy (WCE) generates a large number of images requiring manual screening. Deep learning-based Clinical Decision Support (CDS) systems can assist screening, yet their performance relies on the existence of large, diverse, training medical datasets. However, the scarcity of such data, due to privacy constraints and annotation costs, hinders CDS development. Generative machine learning offers a viable solution to combat this limitation. While current Synthetic Data Generation (SDG) methods, such as Generative Adversarial Networks and Variational Autoencoders have been explored, they often face challenges with training stability and capturing sufficient visual diversity, especially when synthesizing abnormal findings. This work introduces a novel VAE-based methodology for medical image synthesis and presents its application for the generation of WCE images. The novel contributions of this work include a) multiscale extension of the Vector Quantized VAE model, named as Multiscale Vector Quantized Variational Autoencoder (MSVQ-VAE); b) unlike other VAE-based SDG models for WCE image generation, MSVQ-VAE is used to seamlessly introduce abnormalities into normal WCE images; c) it enables conditional generation of synthetic images, enabling the introduction of different types of abnormalities into the normal WCE images; d) it performs experiments with a variety of abnormality types, including polyps, vascular and inflammatory conditions. The utility of the generated images for CDS is assessed via image classification. Comparative experiments demonstrate that training a CDS classifier using the abnormal images generated by the proposed methodology yield comparable results with a classifier trained with only real data. The generality of the proposed methodology promises its applicability to various domains related to medical multimedia.

[57] SkillSight: Efficient First-Person Skill Assessment with Gaze

Chi Hsuan Wu,Kumar Ashutosh,Kristen Grauman

Main category: cs.CV

TL;DR: 本文提出了SkillSight,一种基于第一人称视角数据的节能技能评估方法,通过结合注视和视频信息建模技能水平,并设计了一个仅需注视输入的轻量子模型,显著降低功耗,实现在真实场景中的高效技能学习支持。

Details Motivation: 在智能眼镜上实现以自我为中心的感知有望改变我们在物理世界中学习新技能的方式,但自动技能评估仍是一个关键技术挑战。现有方法依赖持续视频处理,功耗高,难以在野外长期部署。 Method: 提出两阶段框架:首先训练一个融合注视与第一人称视频的教师模型来预测技能水平;然后将知识蒸馏到仅使用注视输入的学生模型。 Result: 在涵盖烹饪、音乐和体育三个真实数据集上的实验表明,注视信息对技能理解具有重要价值;教师模型达到最先进性能,而仅用注视的学生模型功耗比现有方法低73倍且保持高准确率。 Conclusion: 研究表明,结合注意力(注视)可有效提升技能评估效率与实用性,所提方法为野外AI辅助技能学习铺平了道路。 Abstract: Egocentric perception on smart glasses could transform how we learn new skills in the physical world, but automatic skill assessment remains a fundamental technical challenge. We introduce SkillSight for power-efficient skill assessment from first-person data. Central to our approach is the hypothesis that skill level is evident not only in how a person performs an activity (video), but also in how they direct their attention when doing so (gaze). Our two-stage framework first learns to jointly model gaze and egocentric video when predicting skill level, then distills a gaze-only student model. At inference, the student model requires only gaze input, drastically reducing power consumption by eliminating continuous video processing. Experiments on three datasets spanning cooking, music, and sports establish, for the first time, the valuable role of gaze in skill understanding across diverse real-world settings. Our SkillSight teacher model achieves state-of-the-art performance, while our gaze-only student variant maintains high accuracy using 73x less power than competing methods. These results pave the way for in-the-wild AI-supported skill learning.

[58] On the Utility of Foundation Models for Fast MRI: Vision-Language-Guided Image Reconstruction

Ruimin Feng,Xingxin He,Ronald Mercer,Zachary Stewart,Fang Liu

Main category: cs.CV

TL;DR: 本研究提出一种基于视觉-语言基础模型的语义分布引导重建框架,通过引入高阶语义信息提升欠采样MRI重建质量,实验表明该方法在保持数据保真度的同时显著改善感知质量和解剖结构细节。

Details Motivation: 传统MRI重建方法依赖低级先验,缺乏对高阶语义信息的利用,限制了在高倍欠采样下的重建性能。本文探索视觉-语言基础模型能否提供超越传统先验的高层上下文信息以提升重建效果。 Method: 提出语义分布引导的重建框架,利用预训练的视觉-语言模型将重建图像和辅助信息(图像或图文)编码为高阶语义特征,并通过对比学习目标使重建结果与目标语义分布对齐,兼容多种深度学习重建方法并支持多模态语义先验输入。 Result: 在膝关节和脑部数据集上的实验显示,图像先验能更好保留精细解剖结构,降低LPIPS、提高Tenengrad分数和阅片评分;图文联合先验进一步扩展语义分布,实现对重建属性的高层控制;对比目标有效引导特征逼近目标语义分布且保持数据保真。 Conclusion: 视觉-语言基础模型可通过语义空间优化有效提升欠采样MRI重建质量,验证了高阶语义先验在医学影像重建中的潜力。 Abstract: Purpose: To investigate whether a vision-language foundation model can enhance undersampled MRI reconstruction by providing high-level contextual information beyond conventional priors. Methods: We proposed a semantic distribution-guided reconstruction framework that uses a pre-trained vision-language foundation model to encode both the reconstructed image and auxiliary information into high-level semantic features. A contrastive objective aligns the reconstructed representation with the target semantic distribution, ensuring consistency with high-level perceptual cues. The proposed objective works with various deep learning-based reconstruction methods and can flexibly incorporate semantic priors from multimodal sources. To test the effectiveness of these semantic priors, we evaluated reconstruction results guided by priors derived from either image-only or image-language auxiliary information. Results: Experiments on knee and brain datasets demonstrate that semantic priors from images preserve fine anatomical structures and achieve superior perceptual quality, as reflected in lower LPIPS values, higher Tenengrad scores, and improved scores in the reader study, compared with conventional regularization. The image-language information further expands the semantic distribution and enables high-level control over reconstruction attributes. Across all evaluations, the contrastive objective consistently guided the reconstructed features toward the desired semantic distributions while maintaining data fidelity, demonstrating the effectiveness of the proposed optimization framework. Conclusion: The study highlights that vision-language foundation models can improve undersampled MRI reconstruction through semantic-space optimization.

[59] Navigating Gigapixel Pathology Images with Large Multimodal Models

Thomas A. Buckley,Kian R. Weihrauch,Katherine Latham,Andrew Z. Zhou,Padmini A. Manrai,Arjun K. Manrai

Main category: cs.CV

TL;DR: 本研究提出了GIANT框架,首次使大语言-多模态模型能够像病理学家一样迭代导航全切片图像(WSI),并发布了包含934个问题的MultiPathQA基准,结果表明该方法显著优于传统基于小块或缩略图的方法,在部分任务上接近甚至超过专用模型。

Details Motivation: 现有的通用大模型在医学图像(尤其是病理学中的千兆像素图像)解释方面表现不佳,主要因为以往研究使用低分辨率缩略图或随机图像块,可能低估了模型的真实能力。因此,需要一种能系统性探索和推理WSI的方法。 Method: 提出GIANT框架,使大模型能够迭代导航全切片图像;构建并发布MultiPathQA基准,包含多种临床相关任务和由专业病理学家编写的问题;评估GPT-5等模型在该框架下的表现。 Result: GIANT框架显著优于传统的基于图像块和缩略图的方法,在病理学家编写的问题上,GPT-5配合GIANT达到62.5%准确率,优于TITAN(43.8%)和SlideChat(37.5%),接近或超越专用模型。 Conclusion: 当前的基础模型在病理学专家级推理中具有潜力,通过合理的代理式导航框架(如GIANT)可大幅提升性能,为未来医学多模态大模型的发展提供了新方向。 Abstract: Despite being widely used to support clinical care, general-purpose large multimodal models (LMMs) have generally shown poor or inconclusive performance in medical image interpretation, particularly in pathology, where gigapixel images are used. However, prior studies have used either low-resolution thumbnails or random patches, which likely underestimated model performance. Here, we ask whether LMMs can be adapted to reason coherently and accurately in the evaluation of such images. In this study, we introduce Gigapixel Image Agent for Navigating Tissue (GIANT), the first framework that allows LMMs to iteratively navigate whole-slide images (WSIs) like a pathologist. Accompanying GIANT, we release MultiPathQA, a new benchmark, which comprises 934 WSI-level questions, encompassing five clinically-relevant tasks ranging from cancer diagnosis to open-ended reasoning. MultiPathQA also includes 128 questions, authored by two professional pathologists, requiring direct slide interpretation. Using MultiPathQA, we show that our simple agentic system substantially outperforms conventional patch- and thumbnail-based baselines, approaching or surpassing the performance of specialized models trained on millions of images. For example, on pathologist-authored questions, GPT-5 with GIANT achieves 62.5% accuracy, outperforming specialist pathology models such as TITAN (43.8%) and SlideChat (37.5%). Our findings reveal the strengths and limitations of current foundation models and ground future development of LMMs for expert reasoning in pathology.

[60] CodeV: Code with Images for Faithful Visual Reasoning via Tool-Aware Policy Optimization

Xinhai Hou,Shaoyuan Xu,Manan Biyani,Mayan Li,Jia Liu,Todd C. Hollon,Bryan Wang

Main category: cs.CV

TL;DR: 本文提出了一种评估视觉语言模型在使用图像工具时是否忠实于视觉推理的新协议,并引入了一个基于代码的视觉代理CodeV和一种新的强化学习框架TAPO,以提高模型在多模态推理中的可信度和准确性。

Details Motivation: 现有的视觉语言模型虽然能达到高准确率,但在中间步骤中可能并未真正利用视觉工具输出的信息,存在不忠实地使用工具的问题。为了构建更可靠的视觉推理系统,需要对模型的中间行为进行有效监督。 Method: 提出了一个衡量模型是否忠实使用视觉工具的评估协议;设计了CodeV,将视觉工具表示为可执行Python代码;采用TAPO框架,在过程中给予基于工具输入输出的密集奖励,而非仅依赖思维链标记进行监督。 Result: 实验表明,当前视觉代理在最终答案准确率较高但忠实使用工具的比例较低;CodeV在保持竞争力或更优准确率的同时,显著提升了忠实使用工具的比例,并在多种多模态推理和数学基准测试中表现出色。 Conclusion: 直接监督视觉语言模型的中间工具行为对于构建可信的、具备代理能力的视觉推理系统至关重要。 Abstract: Agentic vision-language models are increasingly trained to "think with images" by calling image operations. However, we show that high final-answer accuracy often hides unfaithful visual reasoning: models may invoke tools on irrelevant regions or ignore tool outputs entirely, yet still guess the correct answer. In this work, we first propose a faithfulness evaluation protocol that measures whether intermediate visual tool outputs (e.g., crops) actually contain the queried evidence. This reveals that recent visual agents achieve high final-answer accuracy but exhibit low rates of faithful tool-use on visual search benchmarks. We then introduce CodeV, a code-based visual agent trained with Tool-Aware Policy Optimization (TAPO). TAPO is a process-level RL framework that augments GRPO with dense rewards defined directly on visual tool inputs and outputs, rather than on chain-of-thought tokens, making supervision easier to verify and less susceptible to reward hacking. CodeV represents visual tools as executable Python code, and TAPO assigns step-wise rewards based solely on the question and tool output, encouraging both necessary and evidence-consistent tool use. In a two-stage SFT+RL pipeline, CodeV achieves competitive or superior accuracy while substantially increasing faithful tool-use rates on related visual search benchmarks. Beyond visual search, CodeV attains strong performance on a range of multimodal reasoning and math benchmarks, suggesting that explicitly supervising intermediate tool behavior is crucial for building trustworthy, agentic visual reasoning systems.

[61] OncoVision: Integrating Mammography and Clinical Data through Attention-Driven Multimodal AI for Enhanced Breast Cancer Diagnosis

Istiak Ahmed,Galib Ahmed,K. Shahriar Sanjid,Md. Tanzim Hossain,Md. Nishan Khan,Md. Misbah Khan,Md. Arifur Rahman,Sheikh Anisul Haque,Sharmin Akhtar Rupa,Mohammed Mejbahuddin Mia,Mahmud Hasan Mostofa Kamal,Md. Mostafa Kamal Sarker,M. Monir Uddin

Main category: cs.CV

TL;DR: OncoVision是一个结合乳腺X线图像和临床数据的多模态AI系统,通过注意力机制编码器-解码器架构实现乳腺癌病灶分割与临床特征预测,并采用晚期融合策略提升诊断精度,支持实时辅助诊断和医学教学。

Details Motivation: 提高乳腺癌早期诊断准确性,减少观察者间差异,促进医疗资源匮乏地区的筛查可及性。 Method: 采用基于注意力机制的编码器-解码器骨干网络,联合分割四种ROI(肿块、钙化、腋下发现、乳腺组织),并预测十项临床特征;设计两种晚期融合策略整合影像与临床数据。 Result: 实现了最先进的分割准确率,稳健预测多项结构化临床特征,提升了诊断精度与一致性,系统已部署为安全易用的Web应用。 Conclusion: OncoVision通过融合多模态数据提高了乳腺癌诊断的准确性与可解释性,具备良好的临床集成能力,有助于推动全球范围内的公平乳腺癌筛查。 Abstract: OncoVision is a multimodal AI pipeline that combines mammography images and clinical data for better breast cancer diagnosis. Employing an attention-based encoder-decoder backbone, it jointly segments four ROIs - masses, calcifications, axillary findings, and breast tissues - with state-of-the-art accuracy and robustly predicts ten structured clinical features: mass morphology, calcification type, ACR breast density, and BI-RADS categories. To fuse imaging and clinical insights, we developed two late-fusion strategies. By utilizing complementary multimodal data, late fusion strategies improve diagnostic precision and reduce inter-observer variability. Operationalized as a secure, user-friendly web application, OncoVision produces structured reports with dual-confidence scoring and attention-weighted visualizations for real-time diagnostic support to improve clinician trust and facilitate medical teaching. It can be easily incorporated into the clinic, making screening available in underprivileged areas around the world, such as rural South Asia. Combining accurate segmentation with clinical intuition, OncoVision raises the bar for AI-based mammography, offering a scalable and equitable solution to detect breast cancer at an earlier stage and enhancing treatment through timely interventions.

[62] INTERLACE: Interleaved Layer Pruning and Efficient Adaptation in Large Vision-Language Models

Parsa Madinei,Ryan Solgi,Ziqi Wen,Jonathan Skaza,Miguel Eckstein,Ramtin Pedarsani

Main category: cs.CV

TL;DR: INTERLACE是一种用于视觉语言模型(VLMs)的新层剪枝框架,通过分析连续三层的冗余性并采用交替微调-冻结策略,在仅使用1%数据微调的情况下,移除25%网络层后仍保留88.9%的性能。

Details Motivation: 现有层剪枝方法在视觉语言模型上导致显著性能下降,缺乏有效的冗余识别和恢复机制。 Method: 分析三个连续层中的局部冗余,移除前两层中更冗余的一层,微调剩余层以恢复性能,并冻结第三层作为稳定锚点,实现样本高效的微调。 Result: 在仅使用FineVision数据集1%的数据进行一个epoch微调后,移除25%的网络层,平均性能保留达到88.9%,优于现有方法。 Conclusion: INTERLACE通过局部冗余分析和交错微调-冻结策略,实现了高效且高性能保持的VLM压缩,推动了模型压缩在多模态模型中的应用。 Abstract: We introduce INTERLACE, a novel framework that prunes redundant layers in VLMs while maintaining performance through sample-efficient finetuning. Existing layer pruning methods lead to significant performance drop when applied to VLMs. Instead, we analyze triplets of consecutive layers to identify local redundancy, removing the most redundant of the first two layers, finetune the remaining layer to compensate for the lost capacity, and freeze the third layer to serve as a stable anchor during finetuning. We found that this interleaved finetune-freeze design enables rapid convergence with minimal data after pruning. By finetuning only a subset of layers on just 1% of the FineVision dataset for one epoch, Interlace achieves 88.9% average performance retention after dropping 25% of the network, achieving SOTA performance. Our code is available at: https://github.com/pmadinei/Interlace.git

[63] IndEgo: A Dataset of Industrial Scenarios and Collaborative Work for Egocentric Assistants

Vivek Chavan,Yasmina Imgrund,Tung Dao,Sanwantri Bai,Bosong Wang,Ze Lu,Oliver Heimann,Jörg Krüger

Main category: cs.CV

TL;DR: IndEgo是一个多模态的自我中心与外部视角工业任务数据集,包含约197小时的3460个自我中心视频和97小时的1092个外部视角视频,重点支持协作任务理解、错误检测和基于推理的问答等挑战性任务。

Details Motivation: 现有数据集在工业场景下的协作性、多模态性和认知复杂性方面覆盖不足,缺乏支持错误检测与深层推理任务的数据。 Method: 收集真实工业环境中的双人协作任务视频,涵盖多种工业活动;采集自我中心(含眼动、语音、声音、动作等)与外部视角数据;提供详细标注(动作、摘要、错误、叙述)、元数据、处理后的输出(如手部姿态、点云)及多个基准测试任务。 Result: 数据集包含3,460段自我中心视频(约197小时)和1,092段外部视角视频(约97小时),并提供了针对程序性与非程序性任务理解、错误检测和问答任务的基准结果,实验表明当前最先进的多模态模型在此数据集上仍有挑战。 Conclusion: IndEgo为工业场景下的多模态协作任务理解提供了高质量、高复杂度的数据支持,推动了错误检测与推理型问答等高级任务的发展。 Abstract: We introduce IndEgo, a multimodal egocentric and exocentric dataset addressing common industrial tasks, including assembly/disassembly, logistics and organisation, inspection and repair, woodworking, and others. The dataset contains 3,460 egocentric recordings (approximately 197 hours), along with 1,092 exocentric recordings (approximately 97 hours). A key focus of the dataset is collaborative work, where two workers jointly perform cognitively and physically intensive tasks. The egocentric recordings include rich multimodal data and added context via eye gaze, narration, sound, motion, and others. We provide detailed annotations (actions, summaries, mistake annotations, narrations), metadata, processed outputs (eye gaze, hand pose, semi-dense point cloud), and benchmarks on procedural and non-procedural task understanding, Mistake Detection, and reasoning-based Question Answering. Baseline evaluations for Mistake Detection, Question Answering and collaborative task understanding show that the dataset presents a challenge for the state-of-the-art multimodal models. Our dataset is available at: https://huggingface.co/datasets/FraunhoferIPK/IndEgo

[64] CountXplain: Interpretable Cell Counting with Prototype-Based Density Map Estimation

Abdurahman Ali Mohammed,Wallapak Tavanapong,Catherine Fonder,Donald S. Sakaguchi

Main category: cs.CV

TL;DR: 提出了一种基于原型的可解释细胞计数方法,通过密度图估计实现,并在生物学家验证下证明了其可解释性和有效性。

Details Motivation: 深度学习模型在生物医学图像细胞计数中缺乏可解释性,限制了其在临床中的可信度和应用。 Method: 在密度估计网络中引入原型层,学习细胞和背景伪影的代表性视觉模式,并生成与输入图像最相似区域的解释。 Result: 在两个公开数据集上实验表明,该方法在保持计数性能的同时实现了良好的可解释性,生物学家调查证实了原型的生物学相关性。 Conclusion: 该方法为细胞计数提供了透明、可靠的深度学习工具,有助于提升模型信任度并推动其在关键 biomedical 应用中的落地。 Abstract: Cell counting in biomedical imaging is pivotal for various clinical applications, yet the interpretability of deep learning models in this domain remains a significant challenge. We propose a novel prototype-based method for interpretable cell counting via density map estimation. Our approach integrates a prototype layer into the density estimation network, enabling the model to learn representative visual patterns for both cells and background artifacts. The learned prototypes were evaluated through a survey of biologists, who confirmed the relevance of the visual patterns identified, further validating the interpretability of the model. By generating interpretations that highlight regions in the input image most similar to each prototype, our method offers a clear understanding of how the model identifies and counts cells. Extensive experiments on two public datasets demonstrate that our method achieves interpretability without compromising counting effectiveness. This work provides researchers and clinicians with a transparent and reliable tool for cell counting, potentially increasing trust and accelerating the adoption of deep learning in critical biomedical applications. Code is available at https://github.com/NRT-D4/CountXplain.

[65] RADSeg: Unleashing Parameter and Compute Efficient Zero-Shot Open-Vocabulary Segmentation Using Agglomerative Models

Omar Alama,Darshil Jariwala,Avigyan Bhattacharya,Seungchan Kim,Wenshan Wang,Sebastian Scherer

Main category: cs.CV

TL;DR: 本文提出RADSeg,利用RADIO模型在开放词汇语义分割(OVSS)中实现零样本性能的显著提升,同时在mIoU、延迟和参数效率三方面优于现有方法。

Details Motivation: 现有OVSS方法受限于训练数据或依赖多模型组合,导致泛化能力差、计算和内存开销高,缺乏高效且高性能的统一解决方案。 Method: 基于RADIO基础模型,引入自相关递归注意力、自相关全局聚合和高效掩码优化策略,提升零样本OVSS性能。 Result: RADSeg在ViT基础类上实现6-30%的mIoU提升,速度提高3.95倍,参数减少2.5倍;仅105M参数即超越此前850-1350M大模型组合的性能。 Conclusion: RADSeg实现了更高精度、更低延迟和更少参数的零样本开放词汇语义分割,为高效视觉基础模型应用提供了新范式。 Abstract: Open-vocabulary semantic segmentation (OVSS) underpins many vision and robotics tasks that require generalizable semantic understanding. Existing approaches either rely on limited segmentation training data, which hinders generalization, or apply zero-shot heuristics to vision-language models (e.g CLIP), while the most competitive approaches combine multiple models to improve performance at the cost of high computational and memory demands. In this work, we leverage an overlooked agglomerative vision foundation model, RADIO, to improve zero-shot OVSS along three key axes simultaneously: mIoU, latency, and parameter efficiency. We present the first comprehensive study of RADIO for zero-shot OVSS and enhance its performance through self-correlating recursive attention, self-correlating global aggregation, and computationally efficient mask refinement. Our approach, RADSeg, achieves 6-30% mIoU improvement in the base ViT class while being 3.95x faster and using 2.5x fewer parameters. Surprisingly, RADSeg-base (105M) outperforms previous combinations of huge vision models (850-1350M) in mIoU, achieving state-of-the-art accuracy with substantially lower computational and memory cost.

[66] Rethinking Vision Transformer Depth via Structural Reparameterization

Chengwei Zhou,Vipin Chaudhary,Gourav Datta

Main category: cs.CV

TL;DR: 提出一种基于分支的结构重参数化方法,在训练时使用并行分支,推理时合并为单路径模型,显著减少Vision Transformer的层数而不损失精度。

Details Motivation: 现有的加速方法主要集中在算法层面优化,而忽视了通过减少堆叠的Transformer层数来降低计算开销的可能性。 Method: 在Transformer块中引入并行分支结构,并在非线性组件入口处逐步合并分支,实现FFN和MHSA模块的精确数学重参数化。 Result: 将ViT-Tiny从12层压缩至3-6层,在ImageNet-1K上保持准确率,移动端CPU推理速度提升达37%。 Conclusion: 深度堆叠并非必需,通过结构重参数化可在不牺牲性能的前提下显著提升Vision Transformer的推理效率。 Abstract: The computational overhead of Vision Transformers in practice stems fundamentally from their deep architectures, yet existing acceleration strategies have primarily targeted algorithmic-level optimizations such as token pruning and attention speedup. This leaves an underexplored research question: can we reduce the number of stacked transformer layers while maintaining comparable representational capacity? To answer this, we propose a branch-based structural reparameterization technique that operates during the training phase. Our approach leverages parallel branches within transformer blocks that can be systematically consolidated into streamlined single-path models suitable for inference deployment. The consolidation mechanism works by gradually merging branches at the entry points of nonlinear components, enabling both feed-forward networks (FFN) and multi-head self-attention (MHSA) modules to undergo exact mathematical reparameterization without inducing approximation errors at test time. When applied to ViT-Tiny, the framework successfully reduces the original 12-layer architecture to 6, 4, or as few as 3 layers while maintaining classification accuracy on ImageNet-1K. The resulting compressed models achieve inference speedups of up to 37% on mobile CPU platforms. Our findings suggest that the conventional wisdom favoring extremely deep transformer stacks may be unnecessarily restrictive, and point toward new opportunities for constructing efficient vision transformers.

[67] Training-Free Generation of Diverse and High-Fidelity Images via Prompt Semantic Space Optimization

Debin Meng,Chen Jin,Zheng Gao,Yanran Li,Ioannis Patras,Georgios Tzimiropoulos

Main category: cs.CV

TL;DR: 提出了一种无需训练且模型无关的模块TPSO,通过优化token-提示嵌入空间来提升文本到图像生成模型的多样性,同时保持图像质量。

Details Motivation: 现有的提升生成多样性的方法容易陷入主导模式或导致图像质量下降,因此需要一种既能增强多样性又不损害质量的新方法。 Method: 提出Token-Prompt embedding Space Optimization(TPSO),引入可学习参数探索token嵌入空间中表征不足的区域,并利用提示级语义约束防止分布偏移和质量退化。 Result: 在MS-COCO数据集和三种扩散模型上实验表明,TPSO将生成多样性指标从1.10提升至4.18,且未牺牲图像质量。 Conclusion: TPSO是一种有效、通用且无需训练的模块,能显著提升文本到图像扩散模型的生成多样性,同时保持高保真输出。 Abstract: Image diversity remains a fundamental challenge for text-to-image diffusion models. Low-diversity models tend to generate repetitive outputs, increasing sampling redundancy and hindering both creative exploration and downstream applications. A primary cause is that generation often collapses toward a strong mode in the learned distribution. Existing attempts to improve diversity, such as noise resampling, prompt rewriting, or steering-based guidance, often still collapse to dominant modes or introduce distortions that degrade image quality. In light of this, we propose Token-Prompt embedding Space Optimization (TPSO), a training-free and model-agnostic module. TPSO introduces learnable parameters to explore underrepresented regions of the token embedding space, reducing the tendency of the model to repeatedly generate samples from strong modes of the learned distribution. At the same time, the prompt-level space provides a global semantic constraint that regulates distribution shifts, preventing quality degradation while maintaining high fidelity. Extensive experiments on MS-COCO and three diffusion backbones show that TPSO significantly enhances generative diversity, improving baseline performance from 1.10 to 4.18 points, without sacrificing image quality. Code will be released upon acceptance.

[68] Maritime Small Object Detection from UAVs using Deep Learning with Altitude-Aware Dynamic Tiling

Sakib Ahmed,Oscar Pizarro

Main category: cs.CV

TL;DR: 提出一种高度感知的动态分块方法,用于提升无人机在搜救任务中对海上小目标的检测精度和推理速度。

Details Motivation: 在高海拔条件下,由于目标与背景的像素比低,小型物体难以被有效检测,限制了无人机在海上搜救中的应用效果。 Method: 结合高度相关的缩放策略与自适应分块因子,动态调整图像分割方式,并利用YOLOv5和SAHI框架进行小目标检测。 Result: 在SeaDronesSee数据集上测试,相比基线方法小目标mAP提升了38%,推理速度较静态分块提高一倍以上。 Conclusion: 该方法显著提高了无人机在复杂环境下对小目标的检测效率与准确性,有助于增强海上搜救任务的自动化与实效性。 Abstract: Unmanned Aerial Vehicles (UAVs) are crucial in Search and Rescue (SAR) missions due to their ability to monitor vast maritime areas. However, small objects often remain difficult to detect from high altitudes due to low object-to-background pixel ratios. We propose an altitude-aware dynamic tiling method that scales and adaptively subdivides the image into tiles for enhanced small object detection. By integrating altitude-dependent scaling with an adaptive tiling factor, we reduce unnecessary computation while maintaining detection performance. Tested on the SeaDronesSee dataset [1] with YOLOv5 [2] and Slicing Aided Hyper Inference (SAHI) framework [3], our approach improves Mean Average Precision (mAP) for small objects by 38% compared to a baseline and achieves more than double the inference speed compared to static tiling. This approach enables more efficient and accurate UAV-based SAR operations under diverse conditions.

[69] CropVLM: Learning to Zoom for Fine-Grained Vision-Language Perception

Miguel Carvalho,Helder Dias,Bruno Martins

Main category: cs.CV

TL;DR: CropVLM是一种低成本的外部方法,通过强化学习训练,使视觉语言模型能够动态聚焦图像关键区域,提升细粒度图像理解能力,无需修改或微调原模型,避免灾难性遗忘。

Details Motivation: 现有视觉语言模型在需要精细图像理解的任务中表现不佳,受限于感知能力和视觉碎片化问题。 Method: 提出CropVLM,利用强化学习训练模型动态裁剪并聚焦图像相关区域,无需人工标注边界框监督信号或昂贵的合成评估。 Result: CropVLM在高分辨率图像理解任务中显著提升性能,尤其在目标VLM领域外的基准测试上效果明显。 Conclusion: CropVLM可泛化应用于开源和专有视觉语言模型,有效增强其细粒度理解能力,且无需微调,具有实用性和广泛适用性。 Abstract: Vision-Language Models (VLMs) often struggle with tasks that require fine-grained image understanding, such as scene-text recognition or document analysis, due to perception limitations and visual fragmentation. To address these challenges, we introduce CropVLM as an external low-cost method for boosting performance, enabling VLMs to dynamically ''zoom in'' on relevant image regions, enhancing their ability to capture fine details. CropVLM is trained using reinforcement learning, without using human-labeled bounding boxes as a supervision signal, and without expensive synthetic evaluations. The model is trained once and can be paired with both open-source and proprietary VLMs to improve their performance. Our approach delivers significant improvements on tasks that require high-resolution image understanding, notably for benchmarks that are out-of-domain for the target VLM, without modifying or fine-tuning the VLM, thus avoiding catastrophic forgetting.

[70] Efficient Transferable Optimal Transport via Min-Sliced Transport Plans

Xinran Liu,Elaheh Akbari,Rocio Diaz Martin,Navid NaderiAlizadeh,Soheil Kolouri

Main category: cs.CV

TL;DR: 本文研究了最小切片传输计划(min-STP)框架中优化切片器在不同分布对之间的可迁移性,理论上证明了其在数据分布微小扰动下的稳定性,并提出了一个具有统计保证的mini-batch版本,实验证明其在点云对齐和生成模型中具有良好的迁移性能和高效性。

Details Motivation: 由于传统最优传输(OT)计算成本高,限制了其可扩展性,而基于切片的方法虽提升了效率,但其学习到的最优切片器是否能在分布偏移下迁移到新任务仍不清楚,本文旨在探究这一迁移能力。 Method: 研究min-Sliced Transport Plan (min-STP) 框架,分析优化切片器在分布扰动下的行为,提出mini-batch形式的min-STP并提供统计准确性保证。 Result: 理论表明优化切片器在轻微分布扰动下保持稳定,支持跨任务高效迁移;提出的mini-batch min-STP提升了可扩展性,并在实验中展现出良好的准确性。 Conclusion: 优化的切片器具备良好迁移能力,可在新分布对上有效生成传输计划,结合mini-batch formulation显著提升可扩展性,适用于点云对齐与生成建模等任务。 Abstract: Optimal Transport (OT) offers a powerful framework for finding correspondences between distributions and addressing matching and alignment problems in various areas of computer vision, including shape analysis, image generation, and multimodal tasks. The computation cost of OT, however, hinders its scalability. Slice-based transport plans have recently shown promise for reducing the computational cost by leveraging the closed-form solutions of 1D OT problems. These methods optimize a one-dimensional projection (slice) to obtain a conditional transport plan that minimizes the transport cost in the ambient space. While efficient, these methods leave open the question of whether learned optimal slicers can transfer to new distribution pairs under distributional shift. Understanding this transferability is crucial in settings with evolving data or repeated OT computations across closely related distributions. In this paper, we study the min-Sliced Transport Plan (min-STP) framework and investigate the transferability of optimized slicers: can a slicer trained on one distribution pair yield effective transport plans for new, unseen pairs? Theoretically, we show that optimized slicers remain close under slight perturbations of the data distributions, enabling efficient transfer across related tasks. To further improve scalability, we introduce a minibatch formulation of min-STP and provide statistical guarantees on its accuracy. Empirically, we demonstrate that the transferable min-STP achieves strong one-shot matching performance and facilitates amortized training for point cloud alignment and flow-based generative modeling.

[71] MAPS: Preserving Vision-Language Representations via Module-Wise Proximity Scheduling for Better Vision-Language-Action Generalization

Chengyue Huang,Mellon M. Zhang,Robert Azarcon,Glen Chou,Zsolt Kira

Main category: cs.CV

TL;DR: MAPS是一种针对视觉-语言-动作(VLA)模型的新型微调框架,通过模块化的 proximity scheduling 在保持预训练视觉语言模型先验的同时提升适应性,无需额外参数或数据,在多个基准和真实场景中显著提升性能。

Details Motivation: 现有VLA微调方法(如冻结模块或统一正则化)容易破坏预训练的视觉语言模型表示,影响泛化能力,且未考虑不同模块的功能差异。 Method: 提出MAPS(Module-Wise Proximity Scheduling),系统分析各模块对先验保留的敏感性,并按经验顺序线性调度各模块的约束松弛程度,使视觉编码器更贴近预训练先验,而动作相关的语言层更自由地适应任务。 Result: 在MiniVLA-VQ、MiniVLA-OFT、OpenVLA-OFT等多个VLA模型及SimplerEnv、CALVIN、LIBERO等基准上验证,MAPS持续提升分布内和分布外性能,最高提升达30%,并在Franka Emika Panda平台上表现出色。 Conclusion: 基于经验指导的模块化 proximity 约束是实现从VLM到VLA高效迁移、同时保持强泛化能力的有效原则。 Abstract: Vision-Language-Action (VLA) models inherit strong priors from pretrained Vision-Language Models (VLMs), but naive fine-tuning often disrupts these representations and harms generalization. Existing fixes -- freezing modules or applying uniform regularization -- either overconstrain adaptation or ignore the differing roles of VLA components. We present MAPS (Module-Wise Proximity Scheduling), the first robust fine-tuning framework for VLAs. Through systematic analysis, we uncover an empirical order in which proximity constraints should be relaxed to balance stability and flexibility. MAPS linearly schedules this relaxation, enabling visual encoders to stay close to their pretrained priors while action-oriented language layers adapt more freely. MAPS introduces no additional parameters or data, and can be seamlessly integrated into existing VLAs. Across MiniVLA-VQ, MiniVLA-OFT, OpenVLA-OFT, and challenging benchmarks such as SimplerEnv, CALVIN, LIBERO, as well as real-world evaluations on the Franka Emika Panda platform, MAPS consistently boosts both in-distribution and out-of-distribution performance (up to +30%). Our findings highlight empirically guided proximity to pretrained VLMs as a simple yet powerful principle for preserving broad generalization in VLM-to-VLA transfer.

[72] Leveraging Foundation Models for Histological Grading in Cutaneous Squamous Cell Carcinoma using PathFMTools

Abdul Rahman Diab,Emily E. Karn,Renchin Wu,Emily S. Ruiz,William Lotter

Main category: cs.CV

TL;DR: PathFMTools 是一个轻量级、可扩展的 Python 工具包,用于高效执行、分析和可视化病理学基础模型,本文通过其在 cSCC 组织学分级任务中评估 CONCH 和 MUSK 两个先进模型,验证了基础模型嵌入训练小型专业模型的潜力。

Details Motivation: 适应临床任务时面临全切片图像处理复杂、特征不透明及适应策略多样等挑战,亟需工具支持以提升病理学基础模型的应用效率与可解释性。 Method: 开发 PathFMTools 工具包,集成对 CONCH 和 MUSK 两种视觉-语言基础模型的支持,并在 440 张 cSCC H&E 全切片图像上 benchmark 多种适应策略。 Result: 成功实现多种适应策略的比较,揭示不同预测方法间的权衡,验证了使用基础模型嵌入训练小型专家模型的有效性。 Conclusion: 病理学基础模型在真实临床应用中具有广阔前景,而 PathFMTools 为其实现高效分析与验证提供了有力支持。 Abstract: Despite the promise of computational pathology foundation models, adapting them to specific clinical tasks remains challenging due to the complexity of whole-slide image (WSI) processing, the opacity of learned features, and the wide range of potential adaptation strategies. To address these challenges, we introduce PathFMTools, a lightweight, extensible Python package that enables efficient execution, analysis, and visualization of pathology foundation models. We use this tool to interface with and evaluate two state-of-the-art vision-language foundation models, CONCH and MUSK, on the task of histological grading in cutaneous squamous cell carcinoma (cSCC), a critical criterion that informs cSCC staging and patient management. Using a cohort of 440 cSCC H&E WSIs, we benchmark multiple adaptation strategies, demonstrating trade-offs across prediction approaches and validating the potential of using foundation model embeddings to train small specialist models. These findings underscore the promise of pathology foundation models for real-world clinical applications, with PathFMTools enabling efficient analysis and validation.

[73] CounterVQA: Evaluating and Improving Counterfactual Reasoning in Vision-Language Models for Video Understanding

Yuefei Chen,Jiang Liu,Xiaodong Lin,Ruixiang Tang

Main category: cs.CV

TL;DR: 本文提出了一个用于评估视频中反事实推理能力的新基准CounterVQA,并引入了一种名为CFGPT的后训练方法来提升模型在该任务上的表现。

Details Motivation: 现有视觉语言模型在视频理解方面取得了进展,但在反事实推理(即推断假设条件下可能结果)方面仍研究不足,而这一能力对深入理解视频中的因果结构至关重要。 Method: 设计了一个包含三个难度层级的视频问答基准CounterVQA,系统评估现有模型的反事实推理能力;并提出CFGPT方法,通过从语言模态中提炼反事实推理能力来增强模型性能。 Result: 实验表明当前先进模型在简单反事实问题上表现尚可,但在涉及多步因果链的复杂问题上性能显著下降;CFGPT在所有难度级别上均带来一致的性能提升。 Conclusion: 反事实推理是视频理解中亟待加强的关键能力,CounterVQA为评估提供了有效基准,而CFGPT展示了通过跨模态知识提炼改进模型推理潜力的可行路径。 Abstract: Vision Language Models (VLMs) have recently shown significant advancements in video understanding, especially in feature alignment, event reasoning, and instruction-following tasks. However, their capability for counterfactual reasoning, inferring alternative outcomes under hypothetical conditions, remains underexplored. This capability is essential for robust video understanding, as it requires identifying underlying causal structures and reasoning about unobserved possibilities, rather than merely recognizing observed patterns. To systematically evaluate this capability, we introduce CounterVQA, a video-based benchmark featuring three progressive difficulty levels that assess different aspects of counterfactual reasoning. Through comprehensive evaluation of both state-of-the-art open-source and closed-source models, we uncover a substantial performance gap: while these models achieve reasonable accuracy on simple counterfactual questions, performance degrades significantly on complex multi-hop causal chains. To address these limitations, we develop a post-training method, CFGPT, that enhances a model's visual counterfactual reasoning ability by distilling its counterfactual reasoning capability from the language modality, yielding consistent improvements across all CounterVQA difficulty levels. Dataset and code will be further released.

[74] What You See is (Usually) What You Get: Multimodal Prototype Networks that Abstain from Expensive Modalities

Muchang Bahng,Charlie Berens,Jon Donnelly,Eric Chen,Chaofan Chen,Cynthia Rudin

Main category: cs.CV

TL;DR: 提出了一种基于原型网络的多模态、成本感知物种检测方法,通过融合图像和基因数据并智能分配昂贵的基因测序资源,在保持高准确性的同时提高可解释性。

Details Motivation: 传统多模态神经网络在物种检测中缺乏可解释性,且依赖昂贵且侵入性的基因数据采集,限制了其在生态保护中的应用。 Method: 扩展原型网络(ProtoPNets)至多模态场景,集成各模态的原型并通过权重判断预测依赖;引入机制识别无需基因数据即可自信预测的情况,优先使用图像数据进行分类。 Result: 该方法能在需要时才调用昂贵的基因信息进行细粒度区分,利用丰富的图像数据完成明显分类任务,准确率与始终使用双模态的方法相当。 Conclusion: 所提方法在保持模型可解释性的同时,有效降低了对昂贵基因数据的依赖,实现了成本与性能的平衡,适用于生态监测中的自动化物种检测。 Abstract: Species detection is important for monitoring the health of ecosystems and identifying invasive species, serving a crucial role in guiding conservation efforts. Multimodal neural networks have seen increasing use for identifying species to help automate this task, but they have two major drawbacks. First, their black-box nature prevents the interpretability of their decision making process. Second, collecting genetic data is often expensive and requires invasive procedures, often necessitating researchers to capture or kill the target specimen. We address both of these problems by extending prototype networks (ProtoPNets), which are a popular and interpretable alternative to traditional neural networks, to the multimodal, cost-aware setting. We ensemble prototypes from each modality, using an associated weight to determine how much a given prediction relies on each modality. We further introduce methods to identify cases for which we do not need the expensive genetic information to make confident predictions. We demonstrate that our approach can intelligently allocate expensive genetic data for fine-grained distinctions while using abundant image data for clearer visual classifications and achieving comparable accuracy to models that consistently use both modalities.

[75] DesignPref: Capturing Personal Preferences in Visual Design Generation

Yi-Hao Peng,Jeffrey P. Bigham,Jason Wu

Main category: cs.CV

TL;DR: 本文介绍了DesignPref数据集,包含20名专业设计师对12,000个UI设计生成的成对比较和多级偏好标注,揭示了设计师之间存在显著偏好分歧(Krippendorff's alpha = 0.25),并表明传统多数投票方法无法准确反映个体偏好。研究探索了多种个性化建模策略,发现个性化模型在预测个体偏好上显著优于聚合基线模型,即使使用更少的数据也能取得更好效果。

Details Motivation: 由于视觉设计具有高度主观性和个体差异性,现有基于群体偏好的标注数据难以准确反映个体设计偏好,导致生成模型评估与个性化需求脱节。 Method: 构建了包含12k成对比较和多级评分的DesignPref数据集,分析设计师间的评分一致性,并利用自然语言理由探究分歧来源;比较传统聚合模型与多种个性化建模方法(如微调、RAG)在预测个体偏好上的表现。 Result: 发现专业设计师间存在显著偏好分歧(Krippendorff's alpha = 0.25);个性化模型在预测个体偏好上优于聚合模型,且仅需1/20的数据量即可达到更优性能。 Conclusion: 个体设计偏好差异显著,传统聚合标注方法不足以支持个性化生成;个性化建模是更有效路径,DesignPref为研究个性化视觉设计评估提供了首个基准。 Abstract: Generative models, such as large language models and text-to-image diffusion models, are increasingly used to create visual designs like user interfaces (UIs) and presentation slides. Finetuning and benchmarking these generative models have often relied on datasets of human-annotated design preferences. Yet, due to the subjective and highly personalized nature of visual design, preference varies widely among individuals. In this paper, we study this problem by introducing DesignPref, a dataset of 12k pairwise comparisons of UI design generation annotated by 20 professional designers with multi-level preference ratings. We found that among trained designers, substantial levels of disagreement exist (Krippendorff's alpha = 0.25 for binary preferences). Natural language rationales provided by these designers indicate that disagreements stem from differing perceptions of various design aspect importance and individual preferences. With DesignPref, we demonstrate that traditional majority-voting methods for training aggregated judge models often do not accurately reflect individual preferences. To address this challenge, we investigate multiple personalization strategies, particularly fine-tuning or incorporating designer-specific annotations into RAG pipelines. Our results show that personalized models consistently outperform aggregated baseline models in predicting individual designers' preferences, even when using 20 times fewer examples. Our work provides the first dataset to study personalized visual design evaluation and support future research into modeling individual design taste.

[76] Vision--Language Enhanced Foundation Model for Semi-supervised Medical Image Segmentation

Jiaqi Guo,Mingzhen Li,Hanyu Su,Santiago López,Lexiaozi Fan,Daniel Kim,Aggelos Katsaggelos

Main category: cs.CV

TL;DR: 本文提出了一种名为VESSA的视觉-语言增强型半监督分割助手,将视觉-语言模型(VLM)引入医学图像分割,通过模板库和参考引导机制,在标注数据极少的情况下显著提升分割精度。

Details Motivation: 医学图像分割依赖大量专家标注,成本高昂。现有半监督方法在极低标注条件下性能有限,而视觉-语言模型具备强泛化和少样本能力,值得融合利用。 Method: 提出两阶段框架:第一阶段训练VESSA作为参考引导的分割助手,利用包含金标准示例的模板库进行视觉特征匹配,生成语义与空间线索,并驱动SAM2风格的解码器产生掩码;第二阶段将VESSA集成到先进半监督学习框架中,实现与学生模型的动态交互——学生模型的预测结果反馈给VESSA以生成更高质量的伪标签和更强指导。 Result: 在多个医学图像分割数据集和场景下实验表明,VESSA增强的半监督方法在极低标注条件下显著优于现有最先进基线方法,提升了分割准确性。 Conclusion: VESSA成功融合了基础级视觉-语义理解与半监督学习框架,验证了视觉-语言模型在减少医学图像分割对标注依赖方面的有效性,为低资源场景下的精准分割提供了新思路。 Abstract: Semi-supervised learning (SSL) has emerged as an effective paradigm for medical image segmentation, reducing the reliance on extensive expert annotations. Meanwhile, vision-language models (VLMs) have demonstrated strong generalization and few-shot capabilities across diverse visual domains. In this work, we integrate VLM-based segmentation into semi-supervised medical image segmentation by introducing a Vision-Language Enhanced Semi-supervised Segmentation Assistant (VESSA) that incorporates foundation-level visual-semantic understanding into SSL frameworks. Our approach consists of two stages. In Stage 1, the VLM-enhanced segmentation foundation model VESSA is trained as a reference-guided segmentation assistant using a template bank containing gold-standard exemplars, simulating learning from limited labeled data. Given an input-template pair, VESSA performs visual feature matching to extract representative semantic and spatial cues from exemplar segmentations, generating structured prompts for a SAM2-inspired mask decoder to produce segmentation masks. In Stage 2, VESSA is integrated into a state-of-the-art SSL framework, enabling dynamic interaction with the student model: as student predictions become more refined, they are fed back to VESSA as prompts, allowing it to generate higher-quality pseudo-labels and stronger guidance. Extensive experiments across multiple segmentation datasets and domains show that VESSA-augmented SSL significantly enhances segmentation accuracy, outperforming state-of-the-art baselines under extremely limited annotation conditions.

[77] Does Understanding Inform Generation in Unified Multimodal Models? From Analysis to Path Forward

Yuwei Niu,Weiyang Jin,Jiaqi Liao,Chaoran Feng,Peng Jin,Bin Lin,Zongjian Li,Bin Zhu,Weihao Yu,Li Yuan

Main category: cs.CV

TL;DR: 本文提出了UniSandbox框架,用于评估统一多模态模型中理解与生成之间的差距,发现推理生成和知识迁移是关键维度,并揭示了思维链(CoT)和查询架构在缩小该差距中的作用。

Details Motivation: 探究统一多模态模型中理解能力是否真正促进了生成能力,解决现有方法因数据泄露而难以准确评估的问题。 Method: 提出UniSandbox解耦评估框架,结合受控的合成数据集进行实验,分析理解与生成之间的差距;采用显式思维链(CoT)和自训练方法提升推理生成与知识迁移能力。 Result: 发现了显著的理解-生成差距;显式CoT可有效提升推理生成,自训练能内化推理能力;CoT有助于新知识检索,查询架构具有潜在的类CoT特性影响知识迁移。 Conclusion: 理解并不自然导致生成,需通过特定机制(如CoT、自训练、查询架构)来桥接两者;UniSandbox为未来统一架构和训练策略的设计提供了初步洞见。 Abstract: Recent years have witnessed significant progress in Unified Multimodal Models, yet a fundamental question remains: Does understanding truly inform generation? To investigate this, we introduce UniSandbox, a decoupled evaluation framework paired with controlled, synthetic datasets to avoid data leakage and enable detailed analysis. Our findings reveal a significant understanding-generation gap, which is mainly reflected in two key dimensions: reasoning generation and knowledge transfer. Specifically, for reasoning generation tasks, we observe that explicit Chain-of-Thought (CoT) in the understanding module effectively bridges the gap, and further demonstrate that a self-training approach can successfully internalize this ability, enabling implicit reasoning during generation. Additionally, for knowledge transfer tasks, we find that CoT assists the generative process by helping retrieve newly learned knowledge, and also discover that query-based architectures inherently exhibit latent CoT-like properties that affect this transfer. UniSandbox provides preliminary insights for designing future unified architectures and training strategies that truly bridge the gap between understanding and generation. Code and data are available at https://github.com/PKU-YuanGroup/UniSandBox

[78] A Storage-Efficient Feature for 3D Concrete Defect Segmentation to Replace Normal Vector

Linxin Hua,Jianghua Deng,Ye Lu

Main category: cs.CV

TL;DR: 提出了一种名为“相对角度”的新特征,用于点云重建中的混凝土表面缺陷检测,该特征在保持与法向量相当的方向信息的同时,显著减少了数据存储和输入通道数量。

Details Motivation: 针对基于图像的方法易受背景噪声影响且三维数据量大导致应用受限的问题,探索一种高效、低存储需求的点云损伤识别特征。 Method: 定义了‘相对角度’这一单维特征,即某一点的法向量与其所属点云整体平均法向量之间的夹角,并通过基于熵的特征评估方法验证其有效性;使用PointNet++进行模型训练与测试,比较基于相对角度和法向量的性能差异。 Result: 基于相对角度的模型在性能上接近于基于法向量的模型,同时实现了27.6%的存储减少和83%的输入通道压缩。 Conclusion: 相对角度是一种有效的轻量化特征,能够在不修改模型结构的情况下提升资源受限硬件上的计算效率,具有应用于大规模点云损伤检测的潜力。 Abstract: Point cloud reconstruction of damage offers an effective solution to image-based methods vulnerable to background noise, yet its application is constrained by the high volume of 3D data. This study proposes a new feature, relative angle, computed as the angle between the normal vector of a point and the average normal vector of its parent point cloud. This single-dimensional feature provides directionality information equivalent to normal vectors for concrete surface defect characteristics. Through entropy-based feature evaluation, this study demonstrates the ability of relative angle to filter out redundant information in undamaged sections while retaining effective information in damaged sections. By training and testing with PointNet++, models based on the relative angles achieved similar performance to that of models based on normal vectors while delivering 27.6% storage reduction and 83% input channel compression. This novel feature has the potential to enable larger-batch execution on resource-constrained hardware without the necessity of architectural modifications to models.

[79] Lightweight Transformer Framework for Weakly Supervised Semantic Segmentation

Ali Torabi,Sanjog Gaihre,Yaqoob Majeed

Main category: cs.CV

TL;DR: CrispFormer通过改进SegFormer解码器,在弱监督语义分割中实现了更清晰的边界和更强的抗噪能力,仅需少量改动即可显著提升性能。

Details Motivation: 弱监督语义分割(WSSS)依赖不完整且含噪声的标注信号,难以生成高质量密集掩码,尤其是边界模糊和小物体漏检问题突出,需要更有效的解码器设计来提升分割精度。 Method: 提出CrispFormer,包含三个关键改进:(1) 边界分支,通过轻量边缘头和边界感知损失监督细轮廓;(2) 不确定性引导的精修模块,预测像素级不确定性以加权损失并校正分割logits;(3) 动态多尺度融合层,用空间softmax门控替代静态拼接,并可由不确定性调制。 Result: 在相同种子标签下,CrispFormer在边界F-score、小物体召回率和mIoU上均优于SegFormer基线,且计算开销极低,适用于标准WSSS流程。 Conclusion: 解码器层面的小幅协同改进能显著提升弱监督分割性能,CrispFormer提供了一种简单、通用且可复现的高保真掩码生成方案。 Abstract: Weakly supervised semantic segmentation (WSSS) must learn dense masks from noisy, under-specified cues. We revisit the SegFormer decoder and show that three small, synergistic changes make weak supervision markedly more effective-without altering the MiT backbone or relying on heavy post-processing. Our method, CrispFormer, augments the decoder with: (1) a boundary branch that supervises thin object contours using a lightweight edge head and a boundary-aware loss; (2) an uncertainty-guided refiner that predicts per-pixel aleatoric uncertainty and uses it to weight losses and gate a residual correction of the segmentation logits; and (3) a dynamic multi-scale fusion layer that replaces static concatenation with spatial softmax gating over multi-resolution features, optionally modulated by uncertainty. The result is a single-pass model that preserves crisp boundaries, selects appropriate scales per location, and resists label noise from weak cues. Integrated into a standard WSSS pipeline (seed, student, and EMA relabeling), CrispFormer consistently improves boundary F-score, small-object recall, and mIoU over SegFormer baselines trained on the same seeds, while adding minimal compute. Our decoder-centric formulation is simple to implement, broadly compatible with existing SegFormer variants, and offers a reproducible path to higher-fidelity masks from image-level supervision.

[80] Prune-Then-Plan: Step-Level Calibration for Stable Frontier Exploration in Embodied Question Answering

Noah Frahm,Prakrut Patel,Yue Zhang,Shoubin Yu,Mohit Bansal,Roni Sengupta

Main category: cs.CV

TL;DR: 提出Prune-Then-Plan框架,通过逐步校准稳定视觉语言模型在具身问答中的探索,显著提升导航效率和答案质量。

Details Motivation: 大型视觉语言模型在具身问答中存在前沿振荡问题,导致探索效率低和回答质量下降。 Method: 采用受Holm-Bonferroni启发的剪枝策略过滤不合理选项,并结合基于覆盖率的规划器进行决策。 Result: 在OpenEQA和 EXPRESS-Bench 上显著提升场景覆盖能力,在SPL和LLM-Match指标上相对提升达49%和33%。 Conclusion: 通过分离剪枝与规划,有效校准VLM的步级行为,实现更稳定、高效的探索。 Abstract: Large vision-language models (VLMs) have improved embodied question answering (EQA) agents by providing strong semantic priors for open-vocabulary reasoning. However, when used directly for step-level exploration, VLMs often exhibit frontier oscillations, unstable back-and-forth movements caused by overconfidence and miscalibration, leading to inefficient navigation and degraded answer quality. We propose Prune-Then-Plan, a simple and effective framework that stabilizes exploration through step-level calibration. Instead of trusting raw VLM scores, our method prunes implausible frontier choices using a Holm-Bonferroni inspired pruning procedure and then delegates final decisions to a coverage-based planner. This separation converts overconfident predictions into conservative, interpretable actions by relying on human-level judgments to calibrate the step-level behavior of VLMs. Integrated into the 3D-Mem EQA framework, our approach achieves relative improvements of up to 49% and 33% in visually grounded SPL and LLM-Match metrics respectively over baselines. Overall, our method achieves better scene coverage under equal exploration budgets on both OpenEQA and EXPRESS-Bench datasets.

[81] One Attention, One Scale: Phase-Aligned Rotary Positional Embeddings for Mixed-Resolution Diffusion Transformer

Haoyu Wu,Jingyi Xu,Qiaomu Miao,Dimitris Samaras,Hieu Le

Main category: cs.CV

TL;DR: 本文提出了一种针对旋转位置嵌入(RoPE)在混合分辨率去噪中因线性插值导致注意力机制崩溃的问题的解决方案——跨分辨率相位对齐注意力(CRPA),该方法通过调整RoPE索引映射来消除相位混叠,从而实现高保真、高效的多分辨率生成。

Details Motivation: 在使用标准线性插值对旋转位置嵌入(RoPE)进行混合分辨率去噪时,会出现注意力机制崩溃的问题,尤其是在不同空间网格的token混合时。这种结构性问题会导致生成图像或视频出现模糊、伪影甚至完全失败,因此需要一种有效且兼容预训练模型的解决方案。 Method: 提出了Cross-Resolution Phase-Aligned Attention(CRPA),这是一种无需训练的即插即用修复方法。CRPA仅修改每次注意力调用时的RoPE索引映射:将所有查询/键的位置表示在查询的步幅上,确保相同的物理距离始终引起相同的相位增量,从而恢复DiT所依赖的精确相位模式。 Result: CRPA成功解决了由线性坐标重映射引起的相位混叠问题,稳定了所有层和注意力头的表现,并能够在图像和视频生成任务中实现高保真度和高效性的混合分辨率生成,性能优于此前最先进的方法。 Conclusion: CRPA从源头上消除了RoPE在线性插值下的核心失效机制,兼容现有预训练DiT模型,为混合分辨率生成提供了一个简单而有效的解决方案,显著提升了生成质量和稳定性。 Abstract: We identify a core failure mode that occurs when using the usual linear interpolation on rotary positional embeddings (RoPE) for mixed-resolution denoising with Diffusion Transformers. When tokens from different spatial grids are mixed, the attention mechanism collapses. The issue is structural. Linear coordinate remapping forces a single attention head to compare RoPE phases sampled at incompatible rates, creating phase aliasing that destabilizes the score landscape. Pretrained DiTs are especially brittle-many heads exhibit extremely sharp, periodic phase selectivity-so even tiny cross-rate inconsistencies reliably cause blur, artifacts, or full collapse. To this end, our main contribution is Cross-Resolution Phase-Aligned Attention (CRPA), a training-free drop-in fix that eliminates this failure at its source. CRPA modifies only the RoPE index map for each attention call: all Q/K positions are expressed on the query's stride so that equal physical distances always induce identical phase increments. This restores the precise phase patterns that DiTs rely on. CRPA is fully compatible with pretrained DiTs, stabilizes all heads and layers uniformly. We demonstrate that CRPA enables high-fidelity and efficient mixed-resolution generation, outperforming previous state-of-the-art methods on image and video generation.

[82] Reading Between the Lines: Abstaining from VLM-Generated OCR Errors via Latent Representation Probes

Jihan Yao,Achin Kulshrestha,Nathalie Rauschmayr,Reed Roberts,Banghua Zhu,Yulia Tsvetkov,Federico Tombari

Main category: cs.CV

TL;DR: 提出了一种基于潜在表示探测(LRP)的VLM不确定性检测方法,通过在隐藏状态或注意力模式上训练轻量级探测器,显著提升了场景文本视觉问答中的 abstention 准确率。

Details Motivation: 现有VLM的拒答机制依赖于校准不良的输出概率或不适用于OCR任务的语义一致性,难以可靠识别不确定性,尤其是在安全关键应用中。 Method: 设计了三种探针结构:跨所有层拼接表示、对视觉标记的注意力进行聚合、以及通过多数投票集成单层探针;在隐藏状态或注意力模式上训练轻量级探针以检测模型不确定性。 Result: 在四个图像和视频模态基准上,LRP比最优基线平均提升7.6%的拒答准确率;发现中间层而非最终层提供最优信号,且探针能泛化到不同不确定源和数据集。 Conclusion: LRP提供了一个从内部表示中检测置信度信号的原则性框架,优于依赖不可靠输出的方法,有助于构建可部署的可靠AI系统。 Abstract: As VLMs are deployed in safety-critical applications, their ability to abstain from answering when uncertain becomes crucial for reliability, especially in Scene Text Visual Question Answering (STVQA) tasks. For example, OCR errors like misreading "50 mph" as "60 mph" could cause severe traffic accidents. This leads us to ask: Can VLMs know when they can't see? Existing abstention methods suggest pessimistic answers: they either rely on miscalibrated output probabilities or require semantic agreement unsuitable for OCR tasks. However, this failure may indicate we are looking in the wrong place: uncertainty signals could be hidden in VLMs' internal representations. Building on this insight, we propose Latent Representation Probing (LRP): training lightweight probes on hidden states or attention patterns. We explore three probe designs: concatenating representations across all layers, aggregating attention over visual tokens, and ensembling single layer probes by majority vote. Experiments on four benchmarks across image and video modalities show LRP improves abstention accuracy by 7.6\% over best baselines. Our analysis reveals: probes generalize across various uncertainty sources and datasets, and optimal signals emerge from intermediate rather than final layers. This establishes a principled framework for building deployment-ready AI systems by detecting confidence signals from internal states rather than unreliable outputs.

[83] ReDirector: Creating Any-Length Video Retakes with Rotary Camera Encoding

Byeongjun Park,Byung-Hoon Kim,Hyungjin Chung,Jong Chul Ye

Main category: cs.CV

TL;DR: 提出ReDirector方法,通过修正RoPE的误用并引入旋转相机编码(RoCE),实现对动态拍摄变长视频的可控重生成。

Details Motivation: 现有方法中RoPE在时空位置对齐上存在误用,且难以处理不同相机轨迹和视频长度下的多视角关系。 Method: 提出Rotary Camera Encoding (RoCE),将相机参数作为条件引入RoPE相位偏移,实现输入与目标视频间的时空对齐和多视图关系建模。 Result: 在多种相机轨迹和视频长度下实现了更好的动态物体定位、背景保持、几何一致性和视频质量。 Conclusion: ReDirector通过相机条件化的RoPE改进了视频重生成的泛化性和控制性,适用于复杂动态场景。 Abstract: We present ReDirector, a novel camera-controlled video retake generation method for dynamically captured variable-length videos. In particular, we rectify a common misuse of RoPE in previous works by aligning the spatiotemporal positions of the input video and the target retake. Moreover, we introduce Rotary Camera Encoding (RoCE), a camera-conditioned RoPE phase shift that captures and integrates multi-view relationships within and across the input and target videos. By integrating camera conditions into RoPE, our method generalizes to out-of-distribution camera trajectories and video lengths, yielding improved dynamic object localization and static background preservation. Extensive experiments further demonstrate significant improvements in camera controllability, geometric consistency, and video quality across various trajectories and lengths.

[84] Large Language Model Aided Birt-Hogg-Dube Syndrome Diagnosis with Multimodal Retrieval-Augmented Generation

Haoqing Li,Jun Shi,Xianmeng Chen,Qiwei Jia,Rui Wang,Wei Wei,Hong An,Xiaowen Hu

Main category: cs.CV

TL;DR: 提出BHD-RAG框架,结合多模态检索增强生成技术,利用特定领域知识和临床先例提升Birt-Hogg-Dubé综合征的CT影像诊断准确率。

Details Motivation: 深度学习在临床样本有限且弥漫性囊性肺疾病(DCLDs)类别间差异小的情况下难以提高BHD诊断性能;现有MLLM因缺乏专业领域知识易产生幻觉。 Method: 构建一个包含三部分的多模态检索增强生成框架:专用代理生成CT图像描述以构建多模态语料库、基于余弦相似度的检索器匹配相关图文对、MLLM融合检索证据与输入图像进行诊断。 Result: 在包含四种DCLDs类型的數據集上验证,BHD-RAG实现了更高的诊断准确率,并生成与专家意见高度一致的、基于证据的描述。 Conclusion: BHD-RAG通过整合领域特定知识和检索增强机制,有效提升了罕见病BHD的诊断准确性,减少了模型幻觉,具有临床应用潜力。 Abstract: Deep learning methods face dual challenges of limited clinical samples and low inter-class differentiation among Diffuse Cystic Lung Diseases (DCLDs) in advancing Birt-Hogg-Dube syndrome (BHD) diagnosis via Computed Tomography (CT) imaging. While Multimodal Large Language Models (MLLMs) demonstrate diagnostic potential fo such rare diseases, the absence of domain-specific knowledge and referable radiological features intensify hallucination risks. To address this problem, we propose BHD-RAG, a multimodal retrieval-augmented generation framework that integrates DCLD-specific expertise and clinical precedents with MLLMs to improve BHD diagnostic accuracy. BHDRAG employs: (1) a specialized agent generating imaging manifestation descriptions of CT images to construct a multimodal corpus of DCLDs cases. (2) a cosine similarity-based retriever pinpointing relevant imagedescription pairs for query images, and (3) an MLLM synthesizing retrieved evidence with imaging data for diagnosis. BHD-RAG is validated on the dataset involving four types of DCLDs, achieving superior accuracy and generating evidence-based descriptions closely aligned with expert insights.

[85] Rectified SpaAttn: Revisiting Attention Sparsity for Efficient Video Generation

Xuewen Liu,Zhikai Li,Jing Zhang,Mengjuan Chen,Qingyi Gu

Main category: cs.CV

TL;DR: 本文提出了一种名为Rectified SpaAttn的注意力稀疏化方法,用于解决视频生成中Diffusion Transformers因注意力计算二次复杂度导致的高延迟问题。现有稀疏方法存在对关键token过度关注和非关键token完全忽略的系统性偏差。所提方法通过隐式全注意力参考来校正注意力分配,在关键token上采用隔离池化注意力重分配,在非关键token上采用增益感知池化校正,有效提升了稀疏与全注意力图的一致性,并结合Triton内核优化实现显著加速(最高达3.33倍),同时保持高质量生成性能。

Details Motivation: Diffusion Transformers在视频生成中表现优异,但其注意力机制的二次计算复杂度导致推理延迟高。虽然注意力稀疏化可降低计算开销,但现有方法因对关键token的权重放大和对非关键token的信息丢失而引入系统性偏差,造成性能下降。因此,亟需一种能纠正注意力分配偏差、兼顾效率与生成质量的稀疏化方法。 Method: 提出Rectified SpaAttn,包含两个核心组件:(1) 针对关键token的Isolated-Pooling Attention Reallocation,通过重分配多模态池化权重计算精确的校正因子,以消除权重放大偏差;(2) 针对非关键token的Gain-Aware Pooling Rectification,在恢复注意力权重时权衡注意力增益与池化误差,确保净收益为正。此外,使用Triton定制并集成高效稀疏注意力内核,提升实际运行速度。 Result: 在HunyuanVideo和Wan 2.1模型上分别实现了最高3.33倍和2.08倍的加速,同时保持了接近全注意力的生成质量。消融实验验证了两种校正机制的有效性,注意力图可视化显示稀疏与全注意力之间的对齐性显著增强。 Conclusion: Rectified SpaAttn通过引入隐式全注意力参考,有效纠正了现有稀疏注意力方法中的系统性偏差,实现了高效且高质量的视频生成。该方法在不牺牲性能的前提下大幅提升推理速度,具备良好的实用性和可扩展性,已开源供社区使用。 Abstract: Diffusion Transformers dominate video generation, but the quadratic complexity of attention computation introduces substantial latency. Attention sparsity reduces computational costs by focusing on critical tokens while ignoring non-critical tokens. However, existing methods suffer from severe performance degradation. In this paper, we revisit attention sparsity and reveal that existing methods induce systematic biases in attention allocation: (1) excessive focus on critical tokens amplifies their attention weights; (2) complete neglect of non-critical tokens causes the loss of relevant attention weights. To address these issues, we propose Rectified SpaAttn, which rectifies attention allocation with implicit full attention reference, thereby enhancing the alignment between sparse and full attention maps. Specifically: (1) for critical tokens, we show that their bias is proportional to the sparse attention weights, with the ratio governed by the amplified weights. Accordingly, we propose Isolated-Pooling Attention Reallocation, which calculates accurate rectification factors by reallocating multimodal pooled weights. (2) for non-critical tokens, recovering attention weights from the pooled query-key yields attention gains but also introduces pooling errors. Therefore, we propose Gain-Aware Pooling Rectification, which ensures that the rectified gain consistently surpasses the induced error. Moreover, we customize and integrate the Rectified SpaAttn kernel using Triton, achieving up to 3.33 and 2.08 times speedups on HunyuanVideo and Wan 2.1, respectively, while maintaining high generation quality. We release Rectified SpaAttn as open-source at https://github.com/BienLuky/Rectified-SpaAttn .

[86] 4DWorldBench: A Comprehensive Evaluation Framework for 3D/4D World Generation Models

Yiting Lu,Wei Luo,Peiyan Tu,Haoran Li,Hanxin Zhu,Zihao Yu,Xingrui Wang,Xinyi Chen,Xinge Peng,Xin Li,Zhibo Chen

Main category: cs.CV

TL;DR: 本文提出了4DWorldBench,一个用于系统评估世界生成模型的新型基准,涵盖感知质量、条件-4D对齐、物理真实性和4D一致性四个维度,支持多模态输入并通过统一文本空间和LLM/MLLM-as-judge实现自适应评测,旨在推动从“视觉生成”到“世界生成”的演进。

Details Motivation: 现有世界生成模型缺乏统一、全面的评估基准,不同方法侧重不同评价维度,难以系统衡量其在3D/4D世界构建中的综合能力,尤其是跨模态一致性与物理真实性。 Method: 提出4DWorldBench基准,覆盖Image-to-3D/4D、Video-to-4D、Text-to-3D/4D任务;引入自适应多模态条件处理机制,将各种模态条件映射到统一文本空间,并结合LLM-as-judge、MLLM-as-judge与传统网络方法进行综合评估。 Result: 实现了跨模态、跨任务的统一评估框架,在多个维度上更贴近人类主观判断;初步人类研究表明该自适应评测工具与人类评分具有一致性。 Conclusion: 4DWorldBench为世界生成模型提供了统一、可扩展的评估标准,有望成为推动该领域发展的基础工具,促进从静态视觉生成向动态、物理合理的三维世界生成迈进。 Abstract: World Generation Models are emerging as a cornerstone of next-generation multimodal intelligence systems. Unlike traditional 2D visual generation, World Models aim to construct realistic, dynamic, and physically consistent 3D/4D worlds from images, videos, or text. These models not only need to produce high-fidelity visual content but also maintain coherence across space, time, physics, and instruction control, enabling applications in virtual reality, autonomous driving, embodied intelligence, and content creation. However, prior benchmarks emphasize different evaluation dimensions and lack a unified assessment of world-realism capability. To systematically evaluate World Models, we introduce the 4DWorldBench, which measures models across four key dimensions: Perceptual Quality, Condition-4D Alignment, Physical Realism, and 4D Consistency. The benchmark covers tasks such as Image-to-3D/4D, Video-to-4D, Text-to-3D/4D. Beyond these, we innovatively introduce adaptive conditioning across multiple modalities, which not only integrates but also extends traditional evaluation paradigms. To accommodate different modality-conditioned inputs, we map all modality conditions into a unified textual space during evaluation, and further integrate LLM-as-judge, MLLM-as-judge, and traditional network-based methods. This unified and adaptive design enables more comprehensive and consistent evaluation of alignment, physical realism, and cross-modal coherence. Preliminary human studies further demonstrate that our adaptive tool selection achieves closer agreement with subjective human judgments. We hope this benchmark will serve as a foundation for objective comparisons and improvements, accelerating the transition from "visual generation" to "world generation." Our project can be found at https://yeppp27.github.io/4DWorldBench.github.io/.

[87] Face, Whole-Person, and Object Classification in a Unified Space Via The Interleaved Multi-Domain Identity Curriculum

Thomas M Metz,Matthew Q Hill,Alice J O'Toole

Main category: cs.CV

TL;DR: 提出了一种名为IMIC的多任务训练方法,使视觉基础模型能在单一嵌入空间中同时执行物体、人脸(高低质量)和人体识别四项任务,且不显著遗忘原有能力。

Details Motivation: 解决微调后的视觉模型在多任务场景下出现的灾难性遗忘问题,同时保持其零样本泛化能力。 Method: 设计了两种IMIC变体(IMIC A和B),采用梯度耦合与交错训练策略,在DINOv3、CLIP和EVA-02等基础模型上联合微调四个任务。 Result: EVA-02和CLIP结合IMIC后在四项任务上表现接近领域专家水平,甚至超越人类多任务能力;嵌入空间中任务表示线性可分但共享大量特征,少量主成分即可支持跨任务识别。 Conclusion: IMIC能有效实现多域身份识别任务的统一建模,兼顾性能与泛化性,为构建通用视觉模型提供了可行路径。 Abstract: Vision foundation models can perform generalized object classification in zero-shot mode, and face/person recognition when they are fine-tuned. However, fine-tuned models suffer from catastrophic forgetting. We create models that perform four tasks (object recognition, face recognition from high- and low-quality images, and person recognition from whole-body images) in a single embedding space -- without incurring substantial catastrophic forgetting. To accomplish this, we introduce two variants of the Interleaved Multi-Domain Identity Curriculum (IMIC): a gradient-coupled, interleaving training schedule that fine-tunes a foundation backbone simultaneously on all four tasks. The IMIC method proved effective with three foundation model bases: DINOv3, CLIP, and EVA-02. Two of these (EVA-02 and CLIP) performed comparably with domain experts on all four tasks concurrently and were more accurate than humans at multi-tasking across face, body, and object datasets. Further, we demonstrate that our approach does not substantially harm out-of-distribution generalization, thus maintaining a key property of foundation models. Analysis of the most accurate model variants (EVA-02 + IMIC A and B) showed linearly separable representations of the four tasks in the unified embedding space, but with substantial sharing of features across tasks. Fewer than 100 PCs calculated from any one task could perform all other tasks with nearly zero performance degradation.

[88] DOGE: Differentiable Bezier Graph Optimization for Road Network Extraction

Jiahui Sun,Junran Lu,Jinhui Yin,Yishuo Xu,Yuanqi Li,Yanwen Guo

Main category: cs.CV

TL;DR: 本文提出了一种基于贝塞尔图的可微参数化曲线表示方法(DOGE),用于从航拍图像中自动提取道路网络,无需依赖难以构建的矢量真值,通过分割掩码直接学习,并在大规模数据集上实现了最先进的性能。

Details Motivation: 现有方法多使用折线表示道路,难以准确建模连续的道路几何形状,且缺乏高质量的矢量标注数据。因此需要一种更符合道路本质曲线特性的表示方法,并避免对复杂矢量真值的依赖。 Method: 引入贝塞尔图作为道路的可微参数化曲线表示,提出DOGE框架,将道路提取重构为对贝塞尔图的全局优化问题;框架包含两个交替优化模块:DiffAlign通过可微渲染优化几何形状,TopoAdapt使用离散操作调整拓扑结构,直接从分割掩码进行学习。 Result: 在SpaceNet和CityScale大规模基准上取得了新的最先进性能,验证了该方法生成高保真矢量地图的有效性。 Conclusion: DOGE提供了一种无需曲线真值监督的新范式,通过可微优化实现高质量道路网络矢量化,推动了自动制图技术的发展。 Abstract: Automatic extraction of road networks from aerial imagery is a fundamental task, yet prevailing methods rely on polylines that struggle to model curvilinear geometry. We maintain that road geometry is inherently curve-based and introduce the Bézier Graph, a differentiable parametric curve-based representation. The primary obstacle to this representation is to obtain the difficult-to-construct vector ground-truth (GT). We sidestep this bottleneck by reframing the task as a global optimization problem over the Bézier Graph. Our framework, DOGE, operationalizes this paradigm by learning a parametric Bézier Graph directly from segmentation masks, eliminating the need for curve GT. DOGE holistically optimizes the graph by alternating between two complementary modules: DiffAlign continuously optimizes geometry via differentiable rendering, while TopoAdapt uses discrete operators to refine its topology. Our method sets a new state-of-the-art on the large-scale SpaceNet and CityScale benchmarks, presenting a new paradigm for generating high-fidelity vector maps of road networks. We will release our code and related data.

[89] STAvatar: Soft Binding and Temporal Density Control for Monocular 3D Head Avatars Reconstruction

Jiankuo Zhao,Xiangyu Zhu,Zidu Wang,Zhen Lei

Main category: cs.CV

TL;DR: 本文提出STAvatar,一种基于3D高斯点阵的高保真、可动画3D头像重建方法,通过UV自适应软绑定和时序自适应密度控制策略,显著提升细节表现力与遮挡区域重建效果。

Details Motivation: 现有基于3D高斯点阵的方法在建模面部变形和处理遮挡区域(如口腔、眼睑)时存在刚性运动、表达能力不足的问题,缺乏针对高频细节和动态遮挡的有效建模机制。 Method: 提出STAvatar,包含两个核心组件:(1) UV-Adaptive Soft Binding框架,利用图像与几何先验在UV空间学习每个高斯点的特征偏移,支持动态重采样并兼容自适应密度控制;(2) 时序ADC策略,通过聚类相似帧优化致密化判据计算,并引入融合感知误差作为克隆准则,联合捕捉几何与纹理差异,促进关键区域的细节增强。 Result: 在四个基准数据集上实验表明,STAvatar在重建精细细节(如牙齿、舌头、睫毛)和频繁遮挡区域方面显著优于现有方法,实现最先进的视觉质量和几何精度。 Conclusion: STAvatar通过UV空间的软绑定机制以及时序感知的密度控制策略,有效提升了单目视频中3D头像的重建质量与动画表现力,尤其在复杂遮挡和高变化区域表现出色,为高保真数字人建模提供了新思路。 Abstract: Reconstructing high-fidelity and animatable 3D head avatars from monocular videos remains a challenging yet essential task. Existing methods based on 3D Gaussian Splatting typically bind Gaussians to mesh triangles and model deformations solely via Linear Blend Skinning, which results in rigid motion and limited expressiveness. Moreover, they lack specialized strategies to handle frequently occluded regions (e.g., mouth interiors, eyelids). To address these limitations, we propose STAvatar, which consists of two key components: (1) a UV-Adaptive Soft Binding framework that leverages both image-based and geometric priors to learn per-Gaussian feature offsets within the UV space. This UV representation supports dynamic resampling, ensuring full compatibility with Adaptive Density Control (ADC) and enhanced adaptability to shape and textural variations. (2) a Temporal ADC strategy, which first clusters structurally similar frames to facilitate more targeted computation of the densification criterion. It further introduces a novel fused perceptual error as clone criterion to jointly capture geometric and textural discrepancies, encouraging densification in regions requiring finer details. Extensive experiments on four benchmark datasets demonstrate that STAvatar achieves state-of-the-art reconstruction performance, especially in capturing fine-grained details and reconstructing frequently occluded regions. The code will be publicly available.

[90] Temporal-Visual Semantic Alignment: A Unified Architecture for Transferring Spatial Priors from Vision Models to Zero-Shot Temporal Tasks

Xiangkai Ma,Han Zhang,Wenzhong Li,Sanglu Lu

Main category: cs.CV

TL;DR: TimeArtist提出了一种新的时间序列到视觉内容生成框架,通过“warmup-align”范式实现时间序列波动与视觉概念之间的语义级对齐,支持高质量、多样化的图像生成,并在零样本时间任务中表现优异。

Details Motivation: 现有方法将时间序列转为“伪图像”进行预测时缺乏语义层面的对齐,且未充分探索非视觉连续序列作为高保真图像生成的条件信号的潜力。 Method: 提出TimeArtist框架:首先使用双自编码器和共享量化器在大规模数据上自监督训练以学习跨模态共享表示;然后冻结编码器和量化器,引入投影模块在表示层面对齐时间序列与视觉样本。 Result: 实验表明,TimeArtist在图像生成指标上表现良好,同时在零样本时间任务中优于现有方法,能够捕捉时间波动模式并将其转化为图像风格。 Conclusion: TimeArtist建立了时间动态与视觉语义之间的新桥梁,为跨模态生成提供了新范式,拓展了时间序列在视觉生成中的应用潜力。 Abstract: Large Multimodal Models (LMMs) have achieved remarkable progress in aligning and generating content across text and image modalities. However, the potential of using non-visual, continuous sequential, as a conditioning signal for high-fidelity image generation remains largely unexplored. Furthermore, existing methods that convert series into "pseudo-images" for temporal forecasting fail to establish semantic-level alignment. In this paper, we propose TimeArtist, a temporal-visual conversion framework that pioneers semantic-level alignment between time series fluctuations and visual concepts. It pioneers a "warmup-align" paradigm: first, a dual-autoencoder and shared quantizer are self-supervised trained on large-scale datasets to learn modality-shared representations. Then, the encoders and quantizer are frozen, and a projection is introduced to align temporal and visual samples at the representation level. TimeArtist establishes a versatile cross-modal framework, enabling high-quality, diverse image generation directly from time series, while capturing temporal fluctuation patterns to render images as styles transfer. Extensive experiments show that TimeArtist achieves satisfactory performance in image generation metrics, while also attaining superior results in zero-shot temporal tasks. Our work establishes a new paradigm for cross-modal generation, bridging the gap between temporal dynamics and visual semantics.

[91] GigaWorld-0: World Models as Data Engine to Empower Embodied AI

GigaWorld Team,Angen Ye,Boyuan Wang,Chaojun Ni,Guan Huang,Guosheng Zhao,Haoyun Li,Jiagang Zhu,Kerui Li,Mengyuan Xu,Qiuping Deng,Siting Wang,Wenkang Qin,Xinze Chen,Xiaofeng Wang,Yankai Wang,Yu Cao,Yifan Chang,Yuan Xu,Yun Ye,Yang Wang,Yukun Zhou,Zhengyuan Zhang,Zhehao Dong,Zheng Zhu

Main category: cs.CV

TL;DR: GigaWorld-0是一个统一的世界模型框架,作为视觉-语言-动作(VLA)学习的数据引擎,通过视频生成和3D建模联合优化生成高质量、可控的具身交互数据,并在无真实世界训练的情况下提升VLA模型在物理机器人上的泛化能力和任务成功率。

Details Motivation: 为了实现可扩展且数据高效的具身AI,需要一个能够生成多样化、物理合理且指令对齐的交互数据的世界模型框架。 Method: 提出GigaWorld-0框架,包含GigaWorld-0-Video(用于生成时空连贯的视频序列)和GigaWorld-0-3D(结合3D生成、高斯溅射重建、可微物理系统识别与运动规划),并通过GigaTrain框架实现高效训练。 Result: 生成的数据具有高视觉质量、空间一致性、物理合理性与指令对齐性;基于该数据训练的VLA模型(如GigaBrain-0)在真实机器人上表现出强泛化能力与高任务成功率。 Conclusion: GigaWorld-0能有效作为VLA学习的数据引擎,实现无需真实交互训练即可在物理世界中高性能运行的具身智能。 Abstract: World models are emerging as a foundational paradigm for scalable, data-efficient embodied AI. In this work, we present GigaWorld-0, a unified world model framework designed explicitly as a data engine for Vision-Language-Action (VLA) learning. GigaWorld-0 integrates two synergistic components: GigaWorld-0-Video, which leverages large-scale video generation to produce diverse, texture-rich, and temporally coherent embodied sequences under fine-grained control of appearance, camera viewpoint, and action semantics; and GigaWorld-0-3D, which combines 3D generative modeling, 3D Gaussian Splatting reconstruction, physically differentiable system identification, and executable motion planning to ensure geometric consistency and physical realism. Their joint optimization enables the scalable synthesis of embodied interaction data that is visually compelling, spatially coherent, physically plausible, and instruction-aligned. Training at scale is made feasible through our efficient GigaTrain framework, which exploits FP8-precision and sparse attention to drastically reduce memory and compute requirements. We conduct comprehensive evaluations showing that GigaWorld-0 generates high-quality, diverse, and controllable data across multiple dimensions. Critically, VLA model (e.g., GigaBrain-0) trained on GigaWorld-0-generated data achieve strong real-world performance, significantly improving generalization and task success on physical robots without any real-world interaction during training.

[92] ChessMamba: Structure-Aware Interleaving of State Spaces for Change Detection in Remote Sensing Images

Lei Ding,Tong Liu,Xuanguang Liu,Xiangyun Liu,Haitao Guo,Jun Lu

Main category: cs.CV

TL;DR: 本文提出了一种名为ChessMamba的结构感知框架,用于多时相遥感图像中的变化检测,通过棋盘交错和蛇形扫描策略实现高效的特征序列化与跨时相交互,在多种变化检测任务中显著优于现有方法。

Details Motivation: 现有的基于视觉Transformer或状态空间模型的方法在处理多时相遥感图像变化检测时,因时间序列化破坏了局部结构一致性,导致在时空错位情况下难以准确捕捉变化特征。 Method: 提出ChessMamba框架,包含SpatialMamba编码器和轻量级跨源交互模块;采用棋盘交错与蛇形扫描顺序将多时相特征统一序列化,并通过多空洞卷积进行结构感知融合以保留局部上下文信息。 Result: 在二值变化检测、语义变化检测和多模态建筑损毁评估三个任务上进行了综合实验,结果表明ChessMamba能有效融合异质特征,并在精度上显著超越当前最先进方法。 Conclusion: ChessMamba通过结构感知的序列建模方式,增强了多时相遥感图像中细粒度变化的定位能力,为解决时空错位和局部一致性破坏问题提供了新思路。 Abstract: Change detection (CD) in multitemporal remote sensing imagery presents significant challenges for fine-grained recognition, owing to heterogeneity and spatiotemporal misalignment. However, existing methodologies based on vision transformers or state-space models typically disrupt local structural consistency during temporal serialization, obscuring discriminative cues under misalignment and hindering reliable change localization. To address this, we introduce ChessMamba, a structure-aware framework leveraging interleaved state-space modeling for robust CD with multi-temporal inputs. ChessMamba integrates a SpatialMamba encoder with a lightweight cross-source interaction module, featuring two key innovations: (i) Chessboard interleaving with snake scanning order, which serializes multi-temporal features into a unified sequence within a single forward pass, thereby shortening interaction paths and enabling direct comparison for accurate change localization; and (ii) Structure-aware fusion via multi-dilated convolutions, selectively capturing center-and-corner neighborhood contexts within each mono-temporal. Comprehensive evaluations on three CD tasks, including binary CD, semantic CD and multimodal building damage assessment, demonstrate that ChessMamba effectively fuses heterogeneous features and achieves substantial accuracy improvements over state-of-the-art methods.The relevant code will be available at: github.com/DingLei14/ChessMamba.

[93] Distilling Cross-Modal Knowledge via Feature Disentanglement

Junhong Liu,Yuan Zhang,Tao Huang,Wenchao Xu,Renyu Yang

Main category: cs.CV

TL;DR: 提出频率解耦的跨模态知识蒸馏方法,通过频域特征解耦提升跨模态(如视觉到语言)知识迁移效果。

Details Motivation: 传统知识蒸馏在跨模态场景下因模态间表示不一致而效果受限,需改进跨模态知识迁移能力。 Method: 将特征分解为低频和高频部分,分别施加强对齐和宽松对齐损失;引入尺度一致性损失缓解分布偏移,并使用共享分类器统一特征空间。 Result: 在多个基准数据集上显著优于传统及现有最先进的跨模态知识蒸馏方法。 Conclusion: 频率解耦策略有效提升了跨模态知识蒸馏的性能,为多模态模型压缩提供了新思路。 Abstract: Knowledge distillation (KD) has proven highly effective for compressing large models and enhancing the performance of smaller ones. However, its effectiveness diminishes in cross-modal scenarios, such as vision-to-language distillation, where inconsistencies in representation across modalities lead to difficult knowledge transfer. To address this challenge, we propose frequency-decoupled cross-modal knowledge distillation, a method designed to decouple and balance knowledge transfer across modalities by leveraging frequency-domain features. We observed that low-frequency features exhibit high consistency across different modalities, whereas high-frequency features demonstrate extremely low cross-modal similarity. Accordingly, we apply distinct losses to these features: enforcing strong alignment in the low-frequency domain and introducing relaxed alignment for high-frequency features. We also propose a scale consistency loss to address distributional shifts between modalities, and employ a shared classifier to unify feature spaces. Extensive experiments across multiple benchmark datasets show our method substantially outperforms traditional KD and state-of-the-art cross-modal KD approaches. Code is available at https://github.com/Johumliu/FD-CMKD.

[94] LiMT: A Multi-task Liver Image Benchmark Dataset

Zhe Liu,Kai Han,Siqi Ma,Yan Zhu,Jun Chen,Chongwen Lyu,Xinyi Qiu,Chengxuan Qian,Yuqing Song,Yi Liu,Liyuan Tian,Yang Ji,Yuefeng Li

Main category: cs.CV

TL;DR: 本文提出了一种多任务肝脏数据集(LiMT),用于支持肝脏和肿瘤分割、多标签病变分类和病灶检测,基于动脉期增强CT图像,旨在促进计算机辅助诊断技术的发展。

Details Motivation: 现有肝脏相关数据集通常仅支持单一任务,限制了CAD技术的发展,且不同任务间的数据异质性问题影响模型训练。因此需要一个统一的多任务数据集来探索任务间的关联性。 Method: 构建了一个包含150例动脉期增强CT图像的多任务肝脏数据集(LiMT),涵盖四种肝病及正常病例,所有数据均由经验丰富的临床医生标注,支持肝脏/肿瘤分割、多标签分类和病灶检测三个任务。 Result: 该数据集为公开资源,提供了基线实验结果,并综述了现有的肝脏相关数据集与方法,有望成为医学影像研究领域的重要工具。 Conclusion: LiMT是一个有价值的公共多任务肝脏数据集,有助于推动多任务学习在肝脏疾病诊断中的应用,并减少因使用不同数据集带来的异质性问题。 Abstract: Computer-aided diagnosis (CAD) technology can assist clinicians in evaluating liver lesions and intervening with treatment in time. Although CAD technology has advanced in recent years, the application scope of existing datasets remains relatively limited, typically supporting only single tasks, which has somewhat constrained the development of CAD technology. To address the above limitation, in this paper, we construct a multi-task liver dataset (LiMT) used for liver and tumor segmentation, multi-label lesion classification, and lesion detection based on arterial phase-enhanced computed tomography (CT), potentially providing an exploratory solution that is able to explore the correlation between tasks and does not need to worry about the heterogeneity between task-specific datasets during training. The dataset includes CT volumes from 150 different cases, comprising four types of liver diseases as well as normal cases. Each volume has been carefully annotated and calibrated by experienced clinicians. This public multi-task dataset may become a valuable resource for the medical imaging research community in the future. In addition, this paper not only provides relevant baseline experimental results but also reviews existing datasets and methods related to liver-related tasks. Our dataset is available at https://drive.google.com/drive/folders/1l9HRK13uaOQTNShf5pwgSz3OTanWjkag?usp=sharing.

[95] VeriSciQA: An Auto-Verified Dataset for Scientific Visual Question Answering

Yuyi Li,Daoyuan Chen,Zhen Wang,Yutong Lu,Yaliang Li

Main category: cs.CV

TL;DR: 本文提出了一种验证为中心的“生成-再验证”框架,用于构建高质量的科学视觉问答(SVQA)数据集VeriSciQA,显著提升开源大模型在SVQA任务上的表现。

Details Motivation: 现有的开源大视觉语言模型在科学视觉问答(SVQA)任务上表现不佳,主要受限于缺乏大规模、高质量的公开SVQA数据集;而现有合成数据方法存在系统性错误。 Method: 提出“生成-再验证”框架:首先结合图表上下文生成问答对,然后通过跨模态一致性检查和辅助过滤机制剔除错误样本,最终构建出VeriSciQA数据集。 Result: 构建了包含20,351个问答对、覆盖20个科学领域和12种图表类型的VeriSciQA数据集;在该数据集上训练的模型在SVQA基准上性能持续提升,且优于基于现有数据集训练的模型;人类评估也验证了其高正确率。 Conclusion: 该验证驱动的框架能有效提升合成数据质量,推动开源社区在SVQA能力上的发展,且数据规模与性能正相关,具有可扩展性。 Abstract: Large Vision-Language Models (LVLMs) show promise for scientific applications, yet open-source models still struggle with Scientific Visual Question Answering (SVQA), namely answering questions about figures from scientific papers. A key bottleneck lies in the lack of public, large-scale, high-quality SVQA datasets. Although recent work uses LVLMs to synthesize data at scale, we identify systematic errors in their resulting QA pairs, stemming from LVLMs' inherent limitations and information asymmetry between figures and text. To address these challenges, we propose a verification-centric Generate-then-Verify framework that first generates QA pairs with figure-associated textual context, then applies cross-modal consistency checks against figures along with auxiliary filters to eliminate erroneous pairs. We instantiate this framework to curate VeriSciQA, a dataset of 20,351 QA pairs spanning 20 scientific domains and 12 figure types. VeriSciQA poses a challenging benchmark for open-source models, with a substantial accuracy gap between the leading open-source models (64%) and a proprietary model (82%). Moreover, models fine-tuned on VeriSciQA achieve consistent improvements on SVQA benchmarks, with performance gains that scale with data size and surpass models trained on existing datasets. Human evaluation further validates the superior correctness of VeriSciQA. Together, these evidences demonstrate that continued data expansion by our scalable framework can further advance SVQA capability in the open-source community.

[96] Agent0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning

Jiaqi Liu,Kaiwen Xiong,Peng Xia,Yiyang Zhou,Haonian Ji,Lu Feng,Siwei Han,Mingyu Ding,Huaxiu Yao

Main category: cs.CV

TL;DR: 提出Agent0-VL,一种通过工具集成推理实现持续自我进化的视觉-语言代理,在无需外部奖励或人工标注的情况下,通过自我验证与自我修复机制在几何问题求解和视觉科学分析中比基础模型提升12.5%。

Details Motivation: 现有视觉-语言代理依赖人工标注监督,且纯文本自评估难以验证复杂视觉推理,易产生评估幻觉。 Method: 引入工具集成推理到推理、自评估与自修复过程,设计包含Solver(多轮工具推理)和Verifier(结构化反馈与细粒度自奖励)的双角色框架,通过Self-Evolving Reasoning Cycle实现推理与评估分布的对齐。 Result: 在几何问题求解和视觉科学分析任务上比基础模型提升12.5%,实现无需外部奖励的持续自我改进。 Conclusion: Agent0-VL通过工具增强的自我验证与自我修复机制,实现了视觉-语言代理的零外部奖励持续进化,为减少对人类监督的依赖提供了有效路径。 Abstract: Vision-language agents have achieved remarkable progress in a variety of multimodal reasoning tasks; however, their learning remains constrained by the limitations of human-annotated supervision. Recent self-rewarding approaches attempt to overcome this constraint by allowing models to act as their own critics or reward providers. Yet, purely text-based self-evaluation struggles to verify complex visual reasoning steps and often suffers from evaluation hallucinations. To address these challenges, inspired by recent advances in tool-integrated reasoning, we propose Agent0-VL, a self-evolving vision-language agent that achieves continual improvement with tool-integrated reasoning. Agent0-VL incorporates tool usage not only into reasoning but also into self-evaluation and self-repair, enabling the model to introspect, verify, and refine its reasoning through evidence-grounded analysis. It unifies two synergistic roles within a single LVLM: a Solver that performs multi-turn tool-integrated reasoning, and a Verifier that generates structured feedback and fine-grained self-rewards through tool-grounded critique. These roles interact through a Self-Evolving Reasoning Cycle, where tool-based verification and reinforcement learning jointly align the reasoning and evaluation distributions for stable self-improvement. Through this zero-external-reward evolution, Agent0-VL aligns its reasoning and verification behaviors without any human annotation or external reward models, achieving continual self-improvement. Experiments on geometric problem solving and visual scientific analysis show that Agent0-VL achieves an 12.5% improvement over the base model. Our code is available at \href{https://github.com/aiming-lab/Agent0/Agent0-VL}{this https URL}.

[97] MHB: Multimodal Handshape-aware Boundary Detection for Continuous Sign Language Recognition

Mingyu Zhao,Zhanfu Yang,Yang Zhou,Zhaoyang Xia,Can Jin,Xiaoxiao He,Carol Neidle,Dimitris N. Metaxas

Main category: cs.CV

TL;DR: 提出了一种基于3D骨骼和手形特征的多模态连续手语识别方法,通过融合预训练的手形分类模型提升边界检测精度,并在ASLLRP语料库上实现了优于先前方法的表现。

Details Motivation: 现有连续手语识别方法在符号边界检测上鲁棒性不足,且未充分利用语言学上有意义的手形信息,因此需要一种更可靠的方法来准确分割和识别连续ASL句子中的手势。 Method: 首先利用机器学习检测视频中手语的起止帧,提取3D骨骼特征以捕捉手势动态;构建并预训练一个包含87类规范手形的分类器用于边界检测;通过多模态融合模块结合手形分类与视频分割框架;最后使用估计的边界进行手语识别,模型在孤立词和连续手语数据上联合训练。 Result: 在ASLLRP语料库上的实验表明,该方法显著优于以往工作,特别是在边界检测和连续手语识别准确率方面有明显提升。 Conclusion: 融合3D骨骼动力学与语言学驱动的手形信息可有效提高连续手语识别的鲁棒性和准确性,验证了多模态融合在手语理解中的重要价值。 Abstract: This paper presents a multimodal approach for continuous sign recognition that first uses machine learning to detect the start and end frames of signs in videos of American Sign Language (ASL) sentences, and then recognizes the segmented signs. For improved robustness, we use 3D skeletal features extracted from sign language videos to capture the convergence of sign properties and their dynamics, which tend to cluster at sign boundaries. Another focus of this work is the incorporation of information from 3D handshape for boundary detection. To detect handshapes normally expected at the beginning and end of signs, we pretrain a handshape classifier for 87 linguistically defined canonical handshape categories using a dataset that we created by integrating and normalizing several existing datasets. A multimodal fusion module is then used to unify the pretrained sign video segmentation framework and the handshape classification models. Finally, the estimated boundaries are used for sign recognition, where the recognition model is trained on a large database containing both citation-form isolated signs and signs pre-segmented (based on manual annotations) from continuous signing, as such signs often differ in certain respects. We evaluate our method on the ASLLRP corpus and demonstrate significant improvements over previous work.

[98] Motion Marionette: Rethinking Rigid Motion Transfer via Prior Guidance

Haoxuan Wang,Jiachen Tao,Junyi Wu,Gaowen Liu,Ramana Rao Kompella,Yan Yan

Main category: cs.CV

TL;DR: 提出Motion Marionette,一种无需训练的刚性运动迁移框架,通过共享源视频与目标图像之间的时空先验实现零样本运动迁移。

Details Motivation: 现有方法依赖外部几何、生成或仿真先验,导致在泛化性和时序一致性之间存在权衡,限制了对多样化对象的适应能力。 Method: 将源视频和目标图像提升到统一的3D表示空间,从源视频提取运动轨迹构建独立于几何和语义的时空(SpaT)先验,并将其与目标对象结合生成可控的速度场,再利用基于位置的动力学(PBD)优化以减少伪影并提升视觉连贯性。 Result: 实验表明该方法可在不同对象间实现良好的运动迁移效果,生成时序一致且与源运动高度对齐的视频,并支持可控的视频生成。 Conclusion: Motion Marionette通过内部共享的时空先验实现了高质量、零样本的运动迁移,具备良好的泛化能力和视觉质量,在无需重新训练的情况下适用于多样化的物体和场景。 Abstract: We present Motion Marionette, a zero-shot framework for rigid motion transfer from monocular source videos to single-view target images. Previous works typically employ geometric, generative, or simulation priors to guide the transfer process, but these external priors introduce auxiliary constraints that lead to trade-offs between generalizability and temporal consistency. To address these limitations, we propose guiding the motion transfer process through an internal prior that exclusively captures the spatial-temporal transformations and is shared between the source video and any transferred target video. Specifically, we first lift both the source video and the target image into a unified 3D representation space. Motion trajectories are then extracted from the source video to construct a spatial-temporal (SpaT) prior that is independent of object geometry and semantics, encoding relative spatial variations over time. This prior is further integrated with the target object to synthesize a controllable velocity field, which is subsequently refined using Position-Based Dynamics to mitigate artifacts and enhance visual coherence. The resulting velocity field can be flexibly employed for efficient video production. Empirical results demonstrate that Motion Marionette generalizes across diverse objects, produces temporally consistent videos that align well with the source motion, and supports controllable video generation.

[99] Reasoning-VLA: A Fast and General Vision-Language-Action Reasoning Model for Autonomous Driving

Dapeng Zhang,Zhenlong Yuan,Zhangquan Chen,Chih-Ting Liao,Yinda Chen,Fei Shen,Qingguo Zhou,Tat-Seng Chua

Main category: cs.CV

TL;DR: 本文提出了Reasoning-VLA,一种通用且高效的视觉-语言-动作(VLA)模型,用于自动驾驶中的连续动作生成,通过可学习的动作查询和推理增强的多模态特征实现并行推理,并在统一的多数据集基准上展现出卓越的性能、泛化能力和推理速度。

Details Motivation: 现有VLA模型在自动驾驶中存在推理效率低、难以泛化到新车辆配置和驾驶场景的问题,因此需要一种更高效且具备强泛化能力的VLA框架。 Method: 提出Reasoning-VLA,使用从训练数据真实轨迹中高斯采样初始化的可学习动作查询,结合推理增强的视觉-语言特征进行并行动作生成;整合八个公开自动驾驶数据集为标准化的、基于思维链推理的数据格式,并采用监督学习与强化学习联合微调。 Result: 在多个基准上实验表明,Reasoning-VLA在动作生成任务中达到最先进的性能,具备优异的泛化能力,并实现了迄今为止最优的推理速度。 Conclusion: Reasoning-VLA是一种高效、通用的VLA框架,通过可学习查询和推理增强机制,在自动驾驶决策中实现了高性能、快速推理和跨场景泛化,推动了VLA模型在实际应用中的落地。 Abstract: Vision-Language-Action (VLA) models have recently shown strong decision-making capabilities in autonomous driving. However, existing VLAs often struggle with achieving efficient inference and generalizing to novel autonomous vehicle configurations and driving scenarios. In this paper, we propose Reasoning-VLA, a general and fast action-generation VLA framework. The proposed model employs a set of learnable action queries, initialized via Gaussian sampling from ground-truth trajectories within the training corpus. These learnable queries interact with reasoning-enhanced vision-language features to generate continuous action trajectories in parallel. To promote robust generalization, we consolidate eight publicly available autonomous driving datasets into a standardized, Chain-of-Thought reasoning-based, and easy-to-use data format for model training. Leveraging both supervised learning and reinforcement learning fine-tuning, extensive empirical evaluations across multiple benchmarks demonstrate that Reasoning-VLA achieves state-of-the-art performance, superior generalization capability, and the excellent inference speed reported to date.

[100] Coupled Physics-Gated Adaptation: Spatially Decoding Volumetric Photochemical Conversion in Complex 3D-Printed Objects

Maryam Eftekharifar,Churun Zhang,Jialiang Wei,Xudong Cao,Hossein Heidari

Main category: cs.CV

TL;DR: 提出了一种名为C-PGA的新框架,用于从3D视觉数据预测光化学转化的密集物理属性,通过结合物理参数动态调节视觉特征,实现了对复杂3D打印物体化学状态的精确预测。

Details Motivation: 传统计算机视觉模型无法有效处理光学物理与材料物理之间的非线性耦合效应,难以预测3D打印物体内部的化学转化状态,因此需要一种能融合多模态物理信息的新方法。 Method: 提出了Coupled Physics-Gated Adaptation (C-PGA) 架构,利用几何和工艺参数作为查询,通过FiLM机制动态调节两个并行3D-CNN提取的原始投影与校正后的扩散-衍射视觉特征流,实现对密集体素级化学性质的预测。 Result: 在目前最大的光学3D打印样本数据集上验证了方法的有效性,C-PGA显著优于传统模型,能够准确预测复杂最小曲面结构中的化学转换分布。 Conclusion: C-PGA为虚拟化学表征提供了新范式,无需依赖后处理测量即可实现对3D打印中化学状态的精准控制,推动了智能增材制造的发展。 Abstract: We present a framework that pioneers the prediction of photochemical conversion in complex three-dimensionally printed objects, introducing a challenging new computer vision task: predicting dense, non-visual volumetric physical properties from 3D visual data. This approach leverages the largest-ever optically printed 3D specimen dataset, comprising a large family of parametrically designed complex minimal surface structures that have undergone terminal chemical characterisation. Conventional vision models are ill-equipped for this task, as they lack an inductive bias for the coupled, non-linear interactions of optical physics (diffraction, absorption) and material physics (diffusion, convection) that govern the final chemical state. To address this, we propose Coupled Physics-Gated Adaptation (C-PGA), a novel multimodal fusion architecture. Unlike standard concatenation, C-PGA explicitly models physical coupling by using sparse geometrical and process parameters (e.g., surface transport, print layer height) as a Query to dynamically gate and adapt the dense visual features via feature-wise linear modulation (FiLM). This mechanism spatially modulates dual 3D visual streams-extracted by parallel 3D-CNNs processing raw projection stacks and their diffusion-diffraction corrected counterparts allowing the model to recalibrate its visual perception based on the physical context. This approach offers a breakthrough in virtual chemical characterisation, eliminating the need for traditional post-print measurements and enabling precise control over the chemical conversion state.

[101] Scale Where It Matters: Training-Free Localized Scaling for Diffusion Models

Qin Ren,Yufei Wang,Lanqing Guo,Wen Zhang,Zhiwen Fan,Chenyu You

Main category: cs.CV

TL;DR: 本文提出了LoTTS,首个无需训练的局部化测试时扩展框架,通过自适应重采样图像中的缺陷区域来提升生成质量并显著降低计算成本。

Details Motivation: 现有测试时扩展方法在全图范围内操作,忽略图像质量的空间异质性,导致计算资源浪费和局部缺陷修正不足。 Method: LoTTS利用质量感知提示下的交叉与自注意力信号对比定位缺陷区域,并生成连贯掩码;仅对缺陷区域进行局部去噪扰动,保持全局一致性。 Result: 在SD2.1、SDXL和FLUX上实验表明,LoTTS在提升局部质量和全局保真度方面达到SOTA水平,且相比Best-of-N采样减少2-4倍GPU成本。 Conclusion: 局部化测试时扩展是一种有前景的推理阶段模型扩展新方向,LoTTS为其实现提供了高效、无需训练的解决方案。 Abstract: Diffusion models have become the dominant paradigm in text-to-image generation, and test-time scaling (TTS) further improves quality by allocating more computation during inference. However, existing TTS methods operate at the full-image level, overlooking the fact that image quality is often spatially heterogeneous. This leads to unnecessary computation on already satisfactory regions and insufficient correction of localized defects. In this paper, we explore a new direction - Localized TTS - that adaptively resamples defective regions while preserving high-quality regions, thereby substantially reducing the search space. This paradigm poses two central challenges: accurately localizing defects and maintaining global consistency. We propose LoTTS, the first fully training-free framework for localized TTS. For defect localization, LoTTS contrasts cross- and self-attention signals under quality-aware prompts (e.g., high-quality vs. low-quality) to identify defective regions, and then refines them into coherent masks. For consistency, LoTTS perturbs only defective regions and denoises them locally, ensuring that corrections remain confined while the rest of the image remains undisturbed. Extensive experiments on SD2.1, SDXL, and FLUX demonstrate that LoTTS achieves state-of-the-art performance: it consistently improves both local quality and global fidelity, while reducing GPU cost by 2-4x compared to Best-of-N sampling. These findings establish localized TTS as a promising new direction for scaling diffusion models at inference time.

[102] HybriDLA: Hybrid Generation for Document Layout Analysis

Yufan Chen,Omar Moured,Ruiping Liu,Junwei Zheng,Kunyu Peng,Jiaming Zhang,Rainer Stiefelhagen

Main category: cs.CV

TL;DR: 提出HybriDLA,一种结合扩散与自回归解码的文档布局分析新框架,在复杂现代文档上实现最先进的检测性能。

Details Motivation: 传统文档布局分析方法难以应对现代文档中元素数量多样和布局复杂的挑战。 Method: 提出HybriDLA框架,融合扩散机制迭代优化边界框假设,并通过自回归解码引入语义与上下文信息;设计多尺度特征融合编码器以捕捉细粒度和高层视觉线索。 Result: 在DocLayNet和M$^6$Doc基准上达到83.5% mAP,显著优于先前方法。 Conclusion: HybriDLA统一扩散与自回归生成策略,有效提升复杂文档布局分析的精度,推动该领域发展。 Abstract: Conventional document layout analysis (DLA) traditionally depends on empirical priors or a fixed set of learnable queries executed in a single forward pass. While sufficient for early-generation documents with a small, predetermined number of regions, this paradigm struggles with contemporary documents, which exhibit diverse element counts and increasingly complex layouts. To address challenges posed by modern documents, we present HybriDLA, a novel generative framework that unifies diffusion and autoregressive decoding within a single layer. The diffusion component iteratively refines bounding-box hypotheses, whereas the autoregressive component injects semantic and contextual awareness, enabling precise region prediction even in highly varied layouts. To further enhance detection quality, we design a multi-scale feature-fusion encoder that captures both fine-grained and high-level visual cues. This architecture elevates performance to 83.5% mean Average Precision (mAP). Extensive experiments on the DocLayNet and M$^6$Doc benchmarks demonstrate that HybriDLA sets a state-of-the-art performance, outperforming previous approaches. All data and models will be made publicly available at https://yufanchen96.github.io/projects/HybriDLA.

[103] Intelligent Image Search Algorithms Fusing Visual Large Models

Kehan Wang,Tingqiong Cui,Yang Zhang,Yu Chen,Shifeng Wu,Zhenzhang Li

Main category: cs.CV

TL;DR: 本文提出DetVLM,一种融合目标检测与视觉大模型(VLM)的细粒度图像检索框架,通过两阶段流程实现高效、高召回的组件级检索,并支持状态搜索与零样本搜索。

Details Motivation: 现有方法在细粒度图像检索中存在局限:手工特征鲁棒性差,深度学习检测器无法进行状态判断和零样本检索,而视觉大模型虽具语义能力但空间定位弱且计算成本高。因此需构建更高效精准的检索框架。 Method: 提出DetVLM框架,采用两阶段流水线:先用YOLO检测器进行高效的组件初筛;再利用VLM作为召回增强模块,对漏检组件进行二次验证。结合任务提示实现状态搜索,并利用VLM的零样本能力实现对未见组件或属性的检索。 Result: 在车辆部件数据集上,DetVLM达到94.82%的整体检索准确率,显著优于仅使用检测器的方法;在驾驶员戴口罩的零样本搜索中准确率达94.95%,状态搜索平均准确率超90%。 Conclusion: DetVLM有效结合了目标检测的效率与VLM的语义理解能力,在细粒度图像检索中实现了高精度、支持状态判断和零样本检索的新范式,具有广泛的应用潜力。 Abstract: Fine-grained image retrieval, which aims to find images containing specific object components and assess their detailed states, is critical in fields like security and industrial inspection. However, conventional methods face significant limitations: manual features (e.g., SIFT) lack robustness; deep learning-based detectors (e.g., YOLO) can identify component presence but cannot perform state-specific retrieval or zero-shot search; Visual Large Models (VLMs) offer semantic and zero-shot capabilities but suffer from poor spatial grounding and high computational cost, making them inefficient for direct retrieval. To bridge these gaps, this paper proposes DetVLM, a novel intelligent image search framework that synergistically fuses object detection with VLMs. The framework pioneers a search-enhancement paradigm via a two-stage pipeline: a YOLO detector first conducts efficient, high-recall component-level screening to determine component presence; then, a VLM acts as a recall-enhancement unit, performing secondary verification for components missed by the detector. This architecture directly enables two advanced capabilities: 1) State Search: Guided by task-specific prompts, the VLM refines results by verifying component existence and executing sophisticated state judgments (e.g., "sun visor lowered"), allowing retrieval based on component state. 2) Zero-shot Search: The framework leverages the VLM's inherent zero-shot capability to recognize and retrieve images containing unseen components or attributes (e.g., "driver wearing a mask") without any task-specific training. Experiments on a vehicle component dataset show DetVLM achieves a state-of-the-art overall retrieval accuracy of 94.82\%, significantly outperforming detection-only baselines. It also attains 94.95\% accuracy in zero-shot search for driver mask-wearing and over 90\% average accuracy in state search tasks.

[104] Context-Aware Token Pruning and Discriminative Selective Attention for Transformer Tracking

Janani Kugarajeevan,Thanikasalam Kokul,Amirthalingam Ramanan,Subha Fernando

Main category: cs.CV

TL;DR: 本文提出了一种名为CPDATrack的新型单流Transformer跟踪框架,通过可学习的搜索令牌重要性估计和判别性选择注意力机制,有效抑制背景和干扰物干扰,同时提升计算效率,在多个基准上达到最先进性能。

Details Motivation: 现有单流Transformer跟踪器中,过多的背景搜索令牌参与模板令牌的注意力会削弱模型判别能力;现有的令牌剪枝方法容易误删目标周围的重要上下文信息,且难以应对干扰物影响。 Method: 1) 在编码器中间引入可学习模块,估计每个搜索令牌与目标的相关概率,据此剪枝低信息量的背景令牌但保留目标周围上下文;2) 提出判别性选择注意力机制:早期层完全阻断搜索到模板的注意力以抑制背景干扰,后期层仅允许局部高概率目标令牌与模板交互。 Result: CPDATrack在多个基准测试中实现了最先进的性能,尤其在GOT-10k数据集上达到了75.1%的平均重叠率。 Conclusion: CPDATrack通过有选择性的令牌保留和分阶段注意力控制,有效平衡了上下文保留与干扰抑制,提升了跟踪精度与鲁棒性。 Abstract: One-stream Transformer-based trackers have demonstrated remarkable performance by concatenating template and search region tokens, thereby enabling joint attention across all tokens. However, enabling an excessive proportion of background search tokens to attend to the target template tokens weakens the tracker's discriminative capability. Several token pruning methods have been proposed to mitigate background interference; however, they often remove tokens near the target, leading to the loss of essential contextual information and degraded tracking performance. Moreover, the presence of distractors within the search tokens further reduces the tracker's ability to accurately identify the target. To address these limitations, we propose CPDATrack, a novel tracking framework designed to suppress interference from background and distractor tokens while enhancing computational efficiency. First, a learnable module is integrated between two designated encoder layers to estimate the probability of each search token being associated with the target. Based on these estimates, less-informative background tokens are pruned from the search region while preserving the contextual cues surrounding the target. To further suppress background interference, a discriminative selective attention mechanism is employed that fully blocks search-to-template attention in the early layers. In the subsequent encoder layers, high-probability target tokens are selectively extracted from a localized region to attend to the template tokens, thereby reducing the influence of background and distractor tokens. The proposed CPDATrack achieves state-of-the-art performance across multiple benchmarks, particularly on GOT-10k, where it attains an average overlap of 75.1 percent.

[105] Image Diffusion Models Exhibit Emergent Temporal Propagation in Videos

Youngseo Kim,Dohyun Kim,Geonhee Han,Paul Hongsuck Seo

Main category: cs.CV

TL;DR: 本文提出DRIFT框架,利用预训练图像扩散模型和SAM引导的掩码优化,通过重新解释自注意力图为语义标签传播核,实现零样本视频对象跟踪与分割,取得最先进的性能。

Details Motivation: 探索图像扩散模型在生成以外的识别与定位任务中的潜力,尤其是其自注意力机制是否可被用于像素级语义对应与视频时序传播。 Method: 将扩散模型的自注意力图视为语义标签传播核,扩展至多帧形成时间传播核;结合DDIM反转、文本反转和自适应头加权等测试时优化策略,并融合SAM进行掩码精细化以提升分割质量。 Result: 在标准视频对象分割基准上实现了最先进的零样本性能,验证了扩散模型在无需微调的情况下进行视频目标跟踪的可行性与鲁棒性。 Conclusion: 扩散模型不仅适用于生成,还可作为强大的视觉表示模型用于复杂识别任务,DRIFT为零样本视频理解提供了新思路。 Abstract: Image diffusion models, though originally developed for image generation, implicitly capture rich semantic structures that enable various recognition and localization tasks beyond synthesis. In this work, we investigate their self-attention maps can be reinterpreted as semantic label propagation kernels, providing robust pixel-level correspondences between relevant image regions. Extending this mechanism across frames yields a temporal propagation kernel that enables zero-shot object tracking via segmentation in videos. We further demonstrate the effectiveness of test-time optimization strategies-DDIM inversion, textual inversion, and adaptive head weighting-in adapting diffusion features for robust and consistent label propagation. Building on these findings, we introduce DRIFT, a framework for object tracking in videos leveraging a pretrained image diffusion model with SAM-guided mask refinement, achieving state-of-the-art zero-shot performance on standard video object segmentation benchmarks.

[106] Low-Resolution Editing is All You Need for High-Resolution Editing

Junsung Lee,Hyunsoo Lee,Yong Jae Lee,Bohyung Han

Main category: cs.CV

TL;DR: 本文提出了高分辨率图像编辑的新任务,并设计了一种基于测试时优化的框架,通过分块优化、细节迁移和同步策略实现高质量的高分辨率图像编辑。

Details Motivation: 现有的图像编辑方法主要局限于低分辨率(如1K以下),难以满足用户对高分辨率内容创作的需求,因此需要一种能够有效且可控地进行高分辨率图像编辑的方法。 Method: 提出了一种测试时优化框架,采用分块优化处理高分辨率图像,结合细粒度细节迁移模块和新的同步策略,确保各图像块之间的一致性。 Result: 实验表明该方法能够在高分辨率下生成高质量的编辑结果,优于现有方法,推动了高分辨率内容创作的发展。 Conclusion: 本文成功实现了高分辨率图像编辑,为未来高分辨率视觉内容的可控生成提供了有效解决方案。 Abstract: High-resolution content creation is rapidly emerging as a central challenge in both the vision and graphics communities. While images serve as the most fundamental modality for visual expression, content generation that aligns with the user intent requires effective, controllable high-resolution image manipulation mechanisms. However, existing approaches remain limited to low-resolution settings, typically supporting only up to 1K resolution. In this work, we introduce the task of high-resolution image editing and propose a test-time optimization framework to address it. Our method performs patch-wise optimization on high-resolution source images, followed by a fine-grained detail transfer module and a novel synchronization strategy to maintain consistency across patches. Extensive experiments show that our method produces high-quality edits, facilitating the way toward high-resolution content creation.

[107] Supervise Less, See More: Training-free Nuclear Instance Segmentation with Prototype-Guided Prompting

Wen Zhang,Qin Ren,Wenjing Liu,Haibin Ling,Chenyu You

Main category: cs.CV

TL;DR: 本文提出SPROUT,一种无需训练和标注的核实例分割框架,利用组织学先验生成特定幻灯片的参考原型,并通过部分最优传输引导特征对齐,结合SAM模型实现精确分割。

Details Motivation: 现有生物医学分割方法大多依赖密集监督和昂贵的微调,缺乏对无需训练方法的探索。 Method: SPROUT利用组织学信息构建幻灯片特异性参考原型,通过部分最优传输进行渐进式特征对齐,生成前景和背景特征作为SAM的正负点提示。 Result: 在多个病理学基准上实验表明,SPROUT在无监督和无需重训练的情况下达到具有竞争力的性能。 Conclusion: SPROUT为病理学中可扩展的、无需训练的核实例分割提供了新范式。 Abstract: Accurate nuclear instance segmentation is a pivotal task in computational pathology, supporting data-driven clinical insights and facilitating downstream translational applications. While large vision foundation models have shown promise for zero-shot biomedical segmentation, most existing approaches still depend on dense supervision and computationally expensive fine-tuning. Consequently, training-free methods present a compelling research direction, yet remain largely unexplored. In this work, we introduce SPROUT, a fully training- and annotation-free prompting framework for nuclear instance segmentation. SPROUT leverages histology-informed priors to construct slide-specific reference prototypes that mitigate domain gaps. These prototypes progressively guide feature alignment through a partial optimal transport scheme. The resulting foreground and background features are transformed into positive and negative point prompts, enabling the Segment Anything Model (SAM) to produce precise nuclear delineations without any parameter updates. Extensive experiments across multiple histopathology benchmarks demonstrate that SPROUT achieves competitive performance without supervision or retraining, establishing a novel paradigm for scalable, training-free nuclear instance segmentation in pathology.

[108] GFT-GCN: Privacy-Preserving 3D Face Mesh Recognition with Spectral Diffusion

Hichem Felouat,Hanrui Wang,Isao Echizen

Main category: cs.CV

TL;DR: 本文提出了一种基于谱图学习与扩散机制的隐私保护3D人脸识别框架GFT-GCN,结合图傅里叶变换和图卷积网络提取紧凑且具有判别性的特征,并通过谱扩散生成不可逆、可更新、不可链接的模板,在保证高识别精度的同时有效防御重构攻击。

Details Motivation: 3D人脸识别虽具备强抗欺骗能力,但存储的生物特征模板仍面临安全风险,需有效保护用户隐私。 Method: 提出GFT-GCN框架,利用图傅里叶变换(GFT)和图卷积网络(GCN)从3D人脸网格中提取谱域特征,并引入谱扩散机制对特征进行不可逆变换,结合轻量级客户端-服务器架构,确保原始数据不离开客户端。 Result: 在BU-3DFE和FaceScape数据集上实验表明,该方法具有高识别准确率,并对重构攻击表现出强抵抗力。 Conclusion: GFT-GCN在隐私保护与识别性能之间实现了良好平衡,为安全的3D人脸识别提供了一个实用解决方案。 Abstract: 3D face recognition offers a robust biometric solution by capturing facial geometry, providing resilience to variations in illumination, pose changes, and presentation attacks. Its strong spoof resistance makes it suitable for high-security applications, but protecting stored biometric templates remains critical. We present GFT-GCN, a privacy-preserving 3D face recognition framework that combines spectral graph learning with diffusion-based template protection. Our approach integrates the Graph Fourier Transform (GFT) and Graph Convolutional Networks (GCN) to extract compact, discriminative spectral features from 3D face meshes. To secure these features, we introduce a spectral diffusion mechanism that produces irreversible, renewable, and unlinkable templates. A lightweight client-server architecture ensures that raw biometric data never leaves the client device. Experiments on the BU-3DFE and FaceScape datasets demonstrate high recognition accuracy and strong resistance to reconstruction attacks. Results show that GFT-GCN effectively balances privacy and performance, offering a practical solution for secure 3D face authentication.

[109] MambaEye: A Size-Agnostic Visual Encoder with Causal Sequential Processing

Changho Choi,Minho Kim,Jinkyu Kim

Main category: cs.CV

TL;DR: 提出MambaEye,一种基于单向Mamba2的因果视觉编码器,通过相对移动嵌入和扩散启发的损失函数实现输入尺寸无关的图像处理。

Details Motivation: 现有视觉编码器难以实现真正输入尺寸无关的处理,缺乏人类视觉的特性。 Method: 采用纯Mamba2骨干网络,引入相对移动嵌入以增强平移不变性,并设计扩散启发的逐步步监督损失函数。 Result: 在ImageNet-1K等任务中对多种分辨率(尤其是高达1536²)表现出强鲁棒性,保持线性时间和内存复杂度。 Conclusion: MambaEye实现了因果、尺度自适应的视觉编码,为构建真正输入尺寸无关的视觉模型提供了新方向。 Abstract: Despite decades of progress, a truly input-size agnostic visual encoder-a fundamental characteristic of human vision-has remained elusive. We address this limitation by proposing \textbf{MambaEye}, a novel, causal sequential encoder that leverages the low complexity and causal-process based pure Mamba2 backbone. Unlike previous Mamba-based vision encoders that often employ bidirectional processing, our strictly unidirectional approach preserves the inherent causality of State Space Models, enabling the model to generate a prediction at any point in its input sequence. A core innovation is our use of relative move embedding, which encodes the spatial shift between consecutive patches, providing a strong inductive bias for translation invariance and making the model inherently adaptable to arbitrary image resolutions and scanning patterns. To achieve this, we introduce a novel diffusion-inspired loss function that provides dense, step-wise supervision, training the model to build confidence as it gathers more visual evidence. We demonstrate that MambaEye exhibits robust performance across a wide range of image resolutions, especially at higher resolutions such as $1536^2$ on the ImageNet-1K classification task. This feat is achieved while maintaining linear time and memory complexity relative to the number of patches.

[110] HiCoGen: Hierarchical Compositional Text-to-Image Generation in Diffusion Models via Reinforcement Learning

Hongji Yang,Yucheng Zhou,Wencheng Han,Runzhou Tao,Zhongying Qiu,Jianfei Yang,Jianbing Shen

Main category: cs.CV

TL;DR: 本文提出了一种基于“合成链”范式的分层生成框架HiCoGen,通过大语言模型分解复杂提示,并结合强化学习与衰减随机性调度策略,提升扩散模型在多对象、多层次文本到图像生成中的组合性和准确性。

Details Motivation: 现有扩散模型在处理包含多个对象和层次结构的复杂提示时,常出现概念遗漏、混淆和构图不佳等问题,难以准确遵循指令,因此需要一种能够增强组合生成能力的新框架。 Method: 提出HiCoGen框架,利用大语言模型将复杂提示分解为最小语义单元,并通过“合成链”范式逐步迭代生成图像;引入强化学习框架,设计衰减随机性调度以提升探索能力,并采用分层奖励机制从全局、主体和关系三个层级评估生成结果。 Result: 实验表明,该方法在新构建的分层提示基准HiCoPrompt上显著优于现有方法,提升了概念覆盖率和组合准确性。 Conclusion: HiCoGen通过分层分解与链式合成,结合优化的随机性调度和分层强化学习,有效解决了复杂提示下的图像生成组合性难题,为文本到图像生成提供了新的范式。 Abstract: Recent advances in diffusion models have demonstrated impressive capability in generating high-quality images for simple prompts. However, when confronted with complex prompts involving multiple objects and hierarchical structures, existing models struggle to accurately follow instructions, leading to issues such as concept omission, confusion, and poor compositionality. To address these limitations, we propose a Hierarchical Compositional Generative framework (HiCoGen) built upon a novel Chain of Synthesis (CoS) paradigm. Instead of monolithic generation, HiCoGen first leverages a Large Language Model (LLM) to decompose complex prompts into minimal semantic units. It then synthesizes these units iteratively, where the image generated in each step provides crucial visual context for the next, ensuring all textual concepts are faithfully constructed into the final scene. To further optimize this process, we introduce a reinforcement learning (RL) framework. Crucially, we identify that the limited exploration of standard diffusion samplers hinders effective RL. We theoretically prove that sample diversity is maximized by concentrating stochasticity in the early generation stages and, based on this insight, propose a novel Decaying Stochasticity Schedule to enhance exploration. Our RL algorithm is then guided by a hierarchical reward mechanism that jointly evaluates the image at the global, subject, and relationship levels. We also construct HiCoPrompt, a new text-to-image benchmark with hierarchical prompts for rigorous evaluation. Experiments show our approach significantly outperforms existing methods in both concept coverage and compositional accuracy.

[111] VGGT4D: Mining Motion Cues in Visual Geometry Transformers for 4D Scene Reconstruction

Yu Hu,Chong Cheng,Sicheng Yu,Xiaoyang Guo,Hao Wang

Main category: cs.CV

TL;DR: 本文提出了VGGT4D,一种无需训练的框架,通过挖掘和增强3D基础模型VGGT中的全局动态线索,实现鲁棒的4D场景重建。

Details Motivation: 现有4D方法依赖外部先验、后优化或需在4D数据集上微调,且当动态物体占主导时,3D基础模型性能显著下降。 Method: 利用VGGT全局注意力层隐含的动态线索,通过gram相似性挖掘并跨时间窗口聚合动态特征,并引入基于投影梯度的细化策略优化掩码边界,将精确掩码集成到VGGT早期推理中以减轻运动干扰。 Result: 在六个数据集上,该方法在动态对象分割、相机位姿估计和稠密重建方面均取得最优性能,支持超过500帧序列的单次推理。 Conclusion: VGGT4D实现了无需训练的高效4D重建,有效分离动静态元素,提升了复杂动态场景下的几何与姿态估计精度。 Abstract: Reconstructing dynamic 4D scenes is challenging, as it requires robust disentanglement of dynamic objects from the static background. While 3D foundation models like VGGT provide accurate 3D geometry, their performance drops markedly when moving objects dominate. Existing 4D approaches often rely on external priors, heavy post-optimization, or require fine-tuning on 4D datasets. In this paper, we propose VGGT4D, a training-free framework that extends the 3D foundation model VGGT for robust 4D scene reconstruction. Our approach is motivated by the key finding that VGGT's global attention layers already implicitly encode rich, layer-wise dynamic cues. To obtain masks that decouple static and dynamic elements, we mine and amplify global dynamic cues via gram similarity and aggregate them across a temporal window. To further sharpen mask boundaries, we introduce a refinement strategy driven by projection gradient. We then integrate these precise masks into VGGT's early-stage inference, effectively mitigating motion interference in both pose estimation and geometric reconstruction. Across six datasets, our method achieves superior performance in dynamic object segmentation, camera pose estimation, and dense reconstruction. It also supports single-pass inference on sequences longer than 500 frames.

[112] Boosting Reasoning in Large Multimodal Models via Activation Replay

Yun Xing,Xiaobin Hu,Qingdong He,Jiangning Zhang,Shuicheng Yan,Shijian Lu,Yu-Gang Jiang

Main category: cs.CV

TL;DR: 本文研究了强化学习与可验证奖励(RLVR)在大型多模态模型(LMMs)中对输入激活的影响,发现RLVR意外地改变了低熵激活。基于此,提出了一种无需训练的新方法——激活重放(Activation Replay),通过在测试时操控视觉标记来提升多模态推理能力,有效改善了数学、类o3视觉代理和视频推理等多种场景下的表现,并优于其他替代方案。

Details Motivation: 尽管RLVR已被证明能有效激励LMM的推理能力,但其背后机制尚不清楚。本文旨在探究RLVR如何影响输入激活,并利用这些发现改进现有方法。 Method: 采用logit lens分析多个后训练LMM中RLVR对输入激活的影响,进行控制实验验证低熵激活调制与推理的关系,提出并实现激活重放方法,在测试时重放基础LMM中的低熵激活以调节RLVR模型的行为。 Result: 实验表明RLVR主要影响低熵激活;激活重放能在多种推理任务上提升性能,提高Pass@K指标,缓解RLVR推理覆盖狭窄问题,且优于高熵激活重放或直接跨模型干预等替代方案。 Conclusion: 调制低熵激活有助于增强LMM的推理能力,激活重放作为一种简单有效的训练-free方法,为提升多模态模型推理提供了新思路。 Abstract: Recently, Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective approach to incentivizing reasoning capability in Large Multimodal Models (LMMs), while the underlying mechanisms behind this post-training paradigm are poorly understood. We begin by exploring how input activations are affected by RLVR through the perspective of logit lens. Our systematic investigations across multiple post-trained LMMs suggest that RLVR shifts low-entropy activations unexpectedly, while high-entropy ones are less affected. We further demonstrate that such phenomena are associated with LMM reasoning by controlled experiments, suggesting a potentially beneficial role of modulating low-entropy activations. To this end, we propose Activation Replay, a novel simple yet effective training-free approach that boosts multimodal reasoning of post-trained LMMs without requiring expensive policy optimization. Our design involves manipulation of visual tokens at test time, replaying low-entropy activations from the input context of base LMMs to regulating the RLVR counterparts. Activation Replay triggers better reasoning across diverse scenarios, including mathematics, o3-like visual agents, and video reasoning. We further show that Activation Replay boosts Pass@K and mitigates narrower reasoning coverage of RLVR. Our design is compared against alternative choices, such as replaying high-entropy activations instead of low-entropy ones, or direct cross-model intervention instead of manipulating input tokens, demonstrating the superiority of our implementation. Codes will be made publicly available.

[113] EmoFeedback2: Reinforcement of Continuous Emotional Image Generation via LVLM-based Reward and Textual Feedback

Jingyang Jia,Kai Shu,Gang Yang,Long Xing,Xun Chen,Aiping Liu

Main category: cs.CV

TL;DR: 提出EmoFeedback2框架,通过视觉-语言模型的推理能力实现情绪感知奖励反馈和自优化文本反馈,提升连续情感图像生成的质量与情感保真度。

Details Motivation: 现有方法缺乏对生成图像的情绪反馈机制,且无法根据图像内容自适应调整情绪提示,导致情感连续性和保真度不足。 Method: 提出生成-理解-反馈强化范式EmoFeedback2,利用微调的大型视觉语言模型(LVLM)提供奖励和文本反馈;设计情绪感知奖励策略评估图像情绪并指导生成模型强化学习,同时构建自优化文本反馈框架以迭代优化提示。 Result: 在自建数据集上实验表明,该方法在生成图像质量、情感连续性和情感保真度方面优于现有最先进方法。 Conclusion: EmoFeedback2通过引入LVLM的反馈机制,有效提升了连续情感图像生成中情绪控制的精确性与图像内容的情感一致性。 Abstract: Continuous emotional image generation (C-EICG) is emerging rapidly due to its ability to produce images aligned with both user descriptions and continuous emotional values. However, existing approaches lack emotional feedback from generated images, limiting the control of emotional continuity. Additionally, their simple alignment between emotions and naively generated texts fails to adaptively adjust emotional prompts according to image content, leading to insufficient emotional fidelity. To address these concerns, we propose a novel generation-understanding-feedback reinforcement paradigm (EmoFeedback2) for C-EICG, which exploits the reasoning capability of the fine-tuned large vision-language model (LVLM) to provide reward and textual feedback for generating high-quality images with continuous emotions. Specifically, we introduce an emotion-aware reward feedback strategy, where the LVLM evaluates the emotional values of generated images and computes the reward against target emotions, guiding the reinforcement fine-tuning of the generative model and enhancing the emotional continuity of images. Furthermore, we design a self-promotion textual feedback framework, in which the LVLM iteratively analyzes the emotional content of generated images and adaptively produces refinement suggestions for the next-round prompt, improving the emotional fidelity with fine-grained content. Extensive experimental results demonstrate that our approach effectively generates high-quality images with the desired emotions, outperforming existing state-of-the-art methods in our custom dataset. The code and dataset will be released soon.

[114] SONIC: Spectral Optimization of Noise for Inpainting with Consistency

Seungyeon Baek,Erqun Dong,Shadan Namazifard,Mark J. Matthews,Kwang Moo Yi

Main category: cs.CV

TL;DR: 提出一种无需训练的图像修复方法,通过优化初始噪声种子在频域中进行线性近似,显著提升通用文本到图像模型在修复任务中的表现。

Details Motivation: 现有的基于引导的方法在实际应用中效果有限,导致需要专门的修复模型;本文旨在利用通用模型实现高效的训练-free图像修复。 Method: 优化初始种子噪声以匹配未遮罩区域,采用频域优化和线性近似避免昂贵的反向展开过程,并结合现有训练-free修复方法。 Result: 在多种修复任务上优于当前最先进方法,仅需几十步优化即可实现高质量修复。 Conclusion: 通过优化初始噪声并引入频域稳定策略,可有效实现通用文本到图像模型的高效训练-free图像修复。 Abstract: We propose a novel training-free method for inpainting with off-the-shelf text-to-image models. While guidance-based methods in theory allow generic models to be used for inverse problems such as inpainting, in practice, their effectiveness is limited, leading to the necessity of specialized inpainting-specific models. In this work, we argue that the missing ingredient for training-free inpainting is the optimization (guidance) of the initial seed noise. We propose to optimize the initial seed noise to approximately match the unmasked parts of the data - with as few as a few tens of optimization steps. We then apply conventional training-free inpainting methods on top of our optimized initial seed noise. Critically, we propose two core ideas to effectively implement this idea: (i) to avoid the costly unrolling required to relate the initial noise and the generated outcome, we perform linear approximation; and (ii) to stabilize the optimization, we optimize the initial seed noise in the spectral domain. We demonstrate the effectiveness of our method on various inpainting tasks, outperforming the state of the art. Project page: https://ubc-vision.github.io/sonic/

[115] GazeProphetV2: Head-Movement-Based Gaze Prediction Enabling Efficient Foveated Rendering on Mobile VR

Farhaan Ebadulla,Chiraag Mudlpaur,Shreya Chaurasia,Gaurav BV

Main category: cs.CV

TL;DR: 本文提出了一种结合注视历史、头部运动和场景内容的多模态虚拟现实注视点预测方法,通过门控融合机制提升预测准确性,无需依赖昂贵的眼动追踪硬件。

Details Motivation: 在虚拟现实环境中,准确预测用户注视行为对渲染优化和界面设计具有重要意义,但现有方法仍面临挑战。 Method: 采用基于跨模态注意力的门控融合机制,融合时间注视模式、头部运动数据和视觉场景信息,自适应地调整各模态的权重。 Result: 在包含22个VR场景和530万注视样本的数据集上验证,多模态融合显著提升了1-3帧未来注视预测的准确性,跨场景测试达到93.1%的验证准确率,并表现出良好的时序一致性。 Conclusion: 该方法有效提升了VR环境中的注视预测性能,有助于理解虚拟环境中的注意力机制,并为渲染优化、交互设计和用户体验评估提供了可行的技术路径。 Abstract: Predicting gaze behavior in virtual reality environments remains a significant challenge with implications for rendering optimization and interface design. This paper introduces a multimodal approach to VR gaze prediction that combines temporal gaze patterns, head movement data, and visual scene information. By leveraging a gated fusion mechanism with cross-modal attention, the approach learns to adaptively weight gaze history, head movement, and scene content based on contextual relevance. Evaluations using a dataset spanning 22 VR scenes with 5.3M gaze samples demonstrate improvements in predictive accuracy when combining modalities compared to using individual data streams alone. The results indicate that integrating past gaze trajectories with head orientation and scene content enhances prediction accuracy across 1-3 future frames. Cross-scene generalization testing shows consistent performance with 93.1% validation accuracy and temporal consistency in predicted gaze trajectories. These findings contribute to understanding attention mechanisms in virtual environments while suggesting potential applications in rendering optimization, interaction design, and user experience evaluation. The approach represents a step toward more efficient virtual reality systems that can anticipate user attention patterns without requiring expensive eye tracking hardware.

[116] OmniRefiner: Reinforcement-Guided Local Diffusion Refinement

Yaoli Liu,Ziheng Ouyang,Shengtao Lou,Yiren Song

Main category: cs.CV

TL;DR: 提出了一种细节感知的参考图像引导生成框架\ourMthd{},通过两阶段的精细化校正提升像素级一致性,在保持结构保真度的同时显著改善细粒度细节保留。

Details Motivation: 现有基于VAE的扩散模型在参考引导图像生成中因潜在空间压缩丢失纹理细节,且后编辑方法易导致局部修改与原图在光照、纹理或形状上不一致。 Method: 提出\ourMthd{},包含两个连续阶段:首先微调单图像扩散编辑器,联合输入草图和参考图像以实现全局一致的精炼;然后采用强化学习优化局部编辑能力,显式提升细节准确性和语义一致性。 Result: \ourMthd{}在多个挑战性基准上显著优于开源及商业模型,展现出更强的参考对齐能力和细粒度细节保持效果。 Conclusion: \ourMthd{}有效解决了参考引导生成中的细节丢失问题,实现了更真实、视觉连贯的图像编辑结果。 Abstract: Reference-guided image generation has progressed rapidly, yet current diffusion models still struggle to preserve fine-grained visual details when refining a generated image using a reference. This limitation arises because VAE-based latent compression inherently discards subtle texture information, causing identity- and attribute-specific cues to vanish. Moreover, post-editing approaches that amplify local details based on existing methods often produce results inconsistent with the original image in terms of lighting, texture, or shape. To address this, we introduce \ourMthd{}, a detail-aware refinement framework that performs two consecutive stages of reference-driven correction to enhance pixel-level consistency. We first adapt a single-image diffusion editor by fine-tuning it to jointly ingest the draft image and the reference image, enabling globally coherent refinement while maintaining structural fidelity. We then apply reinforcement learning to further strengthen localized editing capability, explicitly optimizing for detail accuracy and semantic consistency. Extensive experiments demonstrate that \ourMthd{} significantly improves reference alignment and fine-grained detail preservation, producing faithful and visually coherent edits that surpass both open-source and commercial models on challenging reference-guided restoration benchmarks.

[117] CREward: A Type-Specific Creativity Reward Model

Jiyeon Han,Ali Mahdavi-Amiri,Hao Zhang,Haedong Jeong

Main category: cs.CV

TL;DR: 本文提出了首个类型特定的创造力奖励模型CREward,涵盖几何、材料和纹理三个创造性维度,利用大视觉语言模型生成标签训练模型,并应用于创造性评估、可解释创造性及创造性样本获取。

Details Motivation: 将创造力视为单一指标过于简单化,需要从图像生成的不同层面(如几何、材料、纹理)来更细致地建模和评估创造力。 Method: 首先进行人类基准评估以捕捉对不同类型创造力的感知,分析人类判断与大视觉语言模型(LVLMs)预测之间的相关性,并利用LVLM生成的标签训练CREward模型。 Result: 发现LVLMs与人类创造力感知高度一致,成功构建了可在创造性评估、解释和生成中应用的CREward模型。 Conclusion: CREward能够有效支持多维度创造力建模,为创意图像的评估与生成提供了可解释且实用的工具。 Abstract: Creativity is a complex phenomenon. When it comes to representing and assessing creativity, treating it as a single undifferentiated quantity would appear naive and underwhelming. In this work, we learn the \emph{first type-specific creativity reward model}, coined CREward, which spans three creativity ``axes," geometry, material, and texture, to allow us to view creativity through the lens of the image formation pipeline. To build our reward model, we first conduct a human benchmark evaluation to capture human perception of creativity for each type across various creative images. We then analyze the correlation between human judgments and predictions by large vision-language models (LVLMs), confirming that LVLMs exhibit strong alignment with human perception. Building on this observation, we collect LVLM-generated labels to train our CREward model that is applicable to both evaluation and generation of creative images. We explore three applications of CREward: creativity assessment, explainable creativity, and creative sample acquisition for both human design inspiration and guiding creative generation through low-rank adaptation.

[118] On the Feasibility of Hijacking MLLMs' Decision Chain via One Perturbation

Changyue Li,Jiaying Li,Youliang Yuan,Jiaming He,Zhicong Huang,Pinjia He

Main category: cs.CV

TL;DR: 本文提出了一种新的对抗性攻击方法SAUP,能够在单个扰动下操纵模型在多个决策中的输出,实验证明其在真实场景中对多模态大模型具有高攻击成功率。

Details Motivation: 传统对抗攻击仅关注单次决策的操控,而在现实世界中模型通常进行连续决策,孤立错误易被纠正,但连锁错误可能导致严重风险。因此需要研究能够引发系统性错误的新型攻击方式。 Method: 提出了语义感知的通用扰动(SAUP),通过在归一化空间中搜索并采用语义分离策略来优化扰动,使其根据输入语义产生不同的攻击效果,并构建了包含细粒度语义标注的真实图像数据集RIST进行评估。 Result: 在三个多模态大语言模型上实验表明,仅用一个对抗帧即可实现对五个不同目标的控制,攻击成功率达到70%。 Conclusion: SAUP揭示了现实世界序列决策系统中一种新的安全威胁,即单一扰动可引发连锁语义错误,凸显了当前模型在复杂语义环境下的脆弱性。 Abstract: Conventional adversarial attacks focus on manipulating a single decision of neural networks. However, real-world models often operate in a sequence of decisions, where an isolated mistake can be easily corrected, but cascading errors can lead to severe risks. This paper reveals a novel threat: a single perturbation can hijack the whole decision chain. We demonstrate the feasibility of manipulating a model's outputs toward multiple, predefined outcomes, such as simultaneously misclassifying "non-motorized lane" signs as "motorized lane" and "pedestrian" as "plastic bag". To expose this threat, we introduce Semantic-Aware Universal Perturbations (SAUPs), which induce varied outcomes based on the semantics of the inputs. We overcome optimization challenges by developing an effective algorithm, which searches for perturbations in normalized space with a semantic separation strategy. To evaluate the practical threat of SAUPs, we present RIST, a new real-world image dataset with fine-grained semantic annotations. Extensive experiments on three multimodal large language models demonstrate their vulnerability, achieving a 70% attack success rate when controlling five distinct targets using just an adversarial frame.

[119] Pedestrian Crossing Intention Prediction Using Multimodal Fusion Network

Yuanzhe Li,Steffen Müller

Main category: cs.CV

TL;DR: 提出一种多模态融合网络,结合视觉与运动分支的七种模态特征,利用Transformer提取特征,并通过深度引导注意力、模态注意力和时间注意力机制提升行人过街意图预测性能,在JAAD数据集上表现优于基线方法。

Details Motivation: 行人的多样化行为及其对多种上下文因素的依赖使得行人过街意图预测具有挑战性,而准确的预测对自动驾驶车辆在城市环境中的安全运行至关重要。 Method: 设计了一个多模态融合网络,使用基于Transformer的模块从视觉和运动原始输入中提取特征;引入深度引导注意力模块,利用深度信息指导对另一模态显著区域的关注;并通过模态注意力和时间注意力机制动态突出重要模态和关键时间帧。 Result: 在JAAD数据集上进行了大量实验,验证了所提方法的有效性,相较于基线方法取得了更优的性能。 Conclusion: 该多模态融合网络能有效整合跨模态互补信息,显著提升行人过街意图预测的准确性,有助于增强自动驾驶系统在复杂城市环境中的安全性。 Abstract: Pedestrian crossing intention prediction is essential for the deployment of autonomous vehicles (AVs) in urban environments. Ideal prediction provides AVs with critical environmental cues, thereby reducing the risk of pedestrian-related collisions. However, the prediction task is challenging due to the diverse nature of pedestrian behavior and its dependence on multiple contextual factors. This paper proposes a multimodal fusion network that leverages seven modality features from both visual and motion branches, aiming to effectively extract and integrate complementary cues across different modalities. Specifically, motion and visual features are extracted from the raw inputs using multiple Transformer-based extraction modules. Depth-guided attention module leverages depth information to guide attention towards salient regions in another modality through comprehensive spatial feature interactions. To account for the varying importance of different modalities and frames, modality attention and temporal attention are designed to selectively emphasize informative modalities and effectively capture temporal dependencies. Extensive experiments on the JAAD dataset validate the effectiveness of the proposed network, achieving superior performance compared to the baseline methods.

[120] Multi-Context Fusion Transformer for Pedestrian Crossing Intention Prediction in Urban Environments

Yuanzhe Li,Hang Zhong,Steffen Müller

Main category: cs.CV

TL;DR: 本文提出了一种多上下文融合Transformer(MFT),通过融合行人行为、环境、定位和车辆运动四个维度的上下文信息,提升自动驾驶中行人过街意图预测的准确性。

Details Motivation: 在城市环境中,由于影响行人行为的因素众多,准确预测行人过街意图仍具挑战性,因此需要一种能够有效整合多源上下文信息的方法。 Method: MFT采用渐进式融合策略:首先通过上下文内互注意力实现特征交互,生成上下文标记;再通过上下文间互注意力结合全局CLS标记进行多上下文融合;最后利用引导式注意力机制优化上下文标记和CLS标记,实现更深层次的信息整合。 Result: 实验结果表明,MFT在JAADbeh、JAADall和PIE数据集上分别达到了73%、93%和90%的准确率,优于现有最先进方法。消融实验验证了网络结构的有效性和各输入上下文的贡献。 Conclusion: MFT通过设计多层次注意力机制实现高效多上下文融合,显著提升了行人过街意图预测性能,具有良好的应用潜力与可解释性。 Abstract: Pedestrian crossing intention prediction is essential for autonomous vehicles to improve pedestrian safety and reduce traffic accidents. However, accurate pedestrian intention prediction in urban environments remains challenging due to the multitude of factors affecting pedestrian behavior. In this paper, we propose a multi-context fusion Transformer (MFT) that leverages diverse numerical contextual attributes across four key dimensions, encompassing pedestrian behavior context, environmental context, pedestrian localization context and vehicle motion context, to enable accurate pedestrian intention prediction. MFT employs a progressive fusion strategy, where mutual intra-context attention enables reciprocal interactions within each context, thereby facilitating feature sequence fusion and yielding a context token as a context-specific representation. This is followed by mutual cross-context attention, which integrates features across contexts with a global CLS token serving as a compact multi-context representation. Finally, guided intra-context attention refines context tokens within each context through directed interactions, while guided cross-context attention strengthens the global CLS token to promote multi-context fusion via guided information propagation, yielding deeper and more efficient integration. Experimental results validate the superiority of MFT over state-of-the-art methods, achieving accuracy rates of 73%, 93%, and 90% on the JAADbeh, JAADall, and PIE datasets, respectively. Extensive ablation studies are further conducted to investigate the effectiveness of the network architecture and contribution of different input context. Our code is open-source: https://github.com/ZhongHang0307/Multi-Context-Fusion-Transformer.

[121] ACIT: Attention-Guided Cross-Modal Interaction Transformer for Pedestrian Crossing Intention Prediction

Yuanzhe Li,Steffen Müller

Main category: cs.CV

TL;DR: 本文提出了一种基于注意力引导的跨模态交互Transformer(ACIT)模型,用于行人过街意图预测,结合六种视觉与运动模态,通过双路径注意力机制和跨模态交互模块提升预测性能,在JAAD数据集上取得优于现有方法的结果。

Details Motivation: 有效提取和融合多模态数据中的互补线索仍是行人过街意图预测的主要挑战,现有方法难以充分建模跨模态关系。 Method: 提出ACIT模型,将六种模态分为三组交互对,分别采用双路径注意力机制(自注意力与光流引导注意力)和跨模态注意力机制进行局部与全局特征交互,并引入多模态融合模块和基于Transformer的时间序列聚合模块以捕捉时空依赖。 Result: 在JAADbeh和JAADall数据集上分别达到70%和89%的准确率,优于现有方法,且消融实验验证了各模块的有效性。 Conclusion: ACIT通过精细化的跨模态交互机制显著提升了行人过街意图预测性能,展示了多模态注意力与Transformer结构在该任务中的潜力。 Abstract: Predicting pedestrian crossing intention is crucial for autonomous vehicles to prevent pedestrian-related collisions. However, effectively extracting and integrating complementary cues from different types of data remains one of the major challenges. This paper proposes an attention-guided cross-modal interaction Transformer (ACIT) for pedestrian crossing intention prediction. ACIT leverages six visual and motion modalities, which are grouped into three interaction pairs: (1) Global semantic map and global optical flow, (2) Local RGB image and local optical flow, and (3) Ego-vehicle speed and pedestrian's bounding box. Within each visual interaction pair, a dual-path attention mechanism enhances salient regions within the primary modality through intra-modal self-attention and facilitates deep interactions with the auxiliary modality (i.e., optical flow) via optical flow-guided attention. Within the motion interaction pair, cross-modal attention is employed to model the cross-modal dynamics, enabling the effective extraction of complementary motion features. Beyond pairwise interactions, a multi-modal feature fusion module further facilitates cross-modal interactions at each time step. Furthermore, a Transformer-based temporal feature aggregation module is introduced to capture sequential dependencies. Experimental results demonstrate that ACIT outperforms state-of-the-art methods, achieving accuracy rates of 70% and 89% on the JAADbeh and JAADall datasets, respectively. Extensive ablation studies are further conducted to investigate the contribution of different modules of ACIT.

[122] WaymoQA: A Multi-View Visual Question Answering Dataset for Safety-Critical Reasoning in Autonomous Driving

Seungjun Yu,Seonho Lee,Namho Kim,Jaeyo Shin,Junsung Park,Wonjeong Ryu,Raehyuk Jung,Hyunjung Shim

Main category: cs.CV

TL;DR: 本文提出了一个名为“安全关键推理”的新任务,利用多视角输入来提升自动驾驶中高风险场景下的推理能力,并发布了包含3.5万个人工标注问答对的数据集WaymoQA,实验证明基于该数据集微调的多模态大模型在安全关键场景中的推理性能显著提升。

Details Motivation: 现有的多模态大语言模型在自动驾驶场景理解上取得进展,但在安全关键场景中进行高层推理仍面临挑战,尤其是单一前视图无法提供足够的环境信息来处理复杂风险。 Method: 提出安全关键推理任务,将其分解为两个阶段:首先解决即时风险,然后缓解决策引发的下游风险;利用多视角输入构建更全面的环境感知,并发布WaymoQA数据集以支持模型训练与评估。 Result: 实验表明现有MLLMs在安全关键场景下表现不佳,但通过WaymoQA数据集微调后其推理能力显著提升,验证了数据集的有效性。 Conclusion: 多视角输入和分阶段推理有助于提升自动驾驶中安全关键场景的决策质量,WaymoQA为训练更安全、更具推理能力的驾驶智能体提供了有效支持。 Abstract: Recent advancements in multimodal large language models (MLLMs) have shown strong understanding of driving scenes, drawing interest in their application to autonomous driving. However, high-level reasoning in safety-critical scenarios, where avoiding one traffic risk can create another, remains a major challenge. Such reasoning is often infeasible with only a single front view and requires a comprehensive view of the environment, which we achieve through multi-view inputs. We define Safety-Critical Reasoning as a new task that leverages multi-view inputs to address this challenge. Then, we distill Safety-Critical Reasoning into two stages: first resolve the immediate risk, then mitigate the decision-induced downstream risks. To support this, we introduce WaymoQA, a dataset of 35,000 human-annotated question-answer pairs covering complex, high-risk driving scenarios. The dataset includes multiple-choice and open-ended formats across both image and video modalities. Experiments reveal that existing MLLMs underperform in safety-critical scenarios compared to normal scenes, but fine-tuning with WaymoQA significantly improves their reasoning ability, highlighting the effectiveness of our dataset in developing safer and more reasoning-capable driving agents.

[123] SAM-MI: A Mask-Injected Framework for Enhancing Open-Vocabulary Semantic Segmentation with SAM

Lin Chen,Yingjian Zhu,Qi Yang,Xin Niu,Kun Ding,Shiming Xiang

Main category: cs.CV

TL;DR: 本文提出了一种新的掩码注入框架SAM-MI,用于解决开放词汇语义分割中的过分割和固定掩码与标签硬组合问题。

Details Motivation: 现有的基于SAM的开放词汇语义分割方法存在过分割以及掩码与标签之间僵硬结合的问题,限制了性能提升。 Method: 采用文本引导的稀疏点提示器生成稀疏提示以加速掩码生成;通过浅层掩码聚合(SMAgg)合并部分掩码缓解过分割;利用解耦掩码注入(DMI)在高低频分别引入SAM生成的掩码进行指导。 Result: 在多个基准测试上验证了SAM-MI的有效性,在MESS数据集上相比Grounded-SAM实现了16.7%的mIoU相对提升,并达到1.6倍的速度提升。 Conclusion: SAM-MI为将SAM有效集成到开放词汇语义分割模型中提供了一种新方法,显著提升了分割性能与效率。 Abstract: Open-vocabulary semantic segmentation (OVSS) aims to segment and recognize objects universally. Trained on extensive high-quality segmentation data, the segment anything model (SAM) has demonstrated remarkable universal segmentation capabilities, offering valuable support for OVSS. Although previous methods have made progress in leveraging SAM for OVSS, there are still some challenges: (1) SAM's tendency to over-segment and (2) hard combinations between fixed masks and labels. This paper introduces a novel mask-injected framework, SAM-MI, which effectively integrates SAM with OVSS models to address these challenges. Initially, SAM-MI employs a Text-guided Sparse Point Prompter to sample sparse prompts for SAM instead of previous dense grid-like prompts, thus significantly accelerating the mask generation process. The framework then introduces Shallow Mask Aggregation (SMAgg) to merge partial masks to mitigate the SAM's over-segmentation issue. Finally, Decoupled Mask Injection (DMI) incorporates SAM-generated masks for guidance at low-frequency and high-frequency separately, rather than directly combining them with labels. Extensive experiments on multiple benchmarks validate the superiority of SAM-MI. Notably, the proposed method achieves a 16.7% relative improvement in mIoU over Grounded-SAM on the MESS benchmark, along with a 1.6$\times$ speedup. We hope SAM-MI can serve as an alternative methodology to effectively equip the OVSS model with SAM.

[124] Tell Model Where to Look: Mitigating Hallucinations in MLLMs by Vision-Guided Attention

Jianfei Zhao,Feng Zhang,Xin Sun,Chong Feng,Zhixing Tan

Main category: cs.CV

TL;DR: 提出了一种无需训练的视觉引导注意力(VGA)方法,通过利用视觉token的语义内容实现精确的视觉定位,有效减少MLLM中的幻觉问题,并在多个模型和基准上达到最先进的去幻觉性能。

Details Motivation: MLLM中的视觉注意力机制虽然能准确提取视觉语义,但在推理过程中未能充分利用,且定位能力有限导致容易产生幻觉。 Method: 提出Vision-Guided Attention(VGA),利用视觉token的语义构建精确的视觉定位,并用其引导模型关注相关区域;在图像描述任务中动态抑制已描述区域以进一步优化引导。VGA无需训练,兼容FlashAttention等高效注意力实现。 Result: VGA在多个MLLM和幻觉基准上实现了最先进的去幻觉效果,仅引入4.36%的额外延迟,每个token只需一次前向传播。 Conclusion: 显式的视觉引导对提升MLLM的视觉理解能力至关重要,VGA为解决MLLM中的幻觉问题提供了一种高效、通用且实用的方案。 Abstract: Visual attention serves as the primary mechanism through which MLLMs interpret visual information; however, its limited localization capability often leads to hallucinations. We observe that although MLLMs can accurately extract visual semantics from visual tokens, they fail to fully leverage this advantage during subsequent inference. To address this limitation, we propose Vision-Guided Attention (VGA), a training-free method that first constructs precise visual grounding by exploiting the semantic content of visual tokens, and then uses this grounding to guide the model's focus toward relevant visual regions. In image captioning, VGA further refines this guidance dynamically during generation by suppressing regions that have already been described. In VGA, each token undergoes only a single forward pass, introducing a negligible latency overhead of just 4.36\%. In addition, VGA is fully compatible with efficient attention implementations such as FlashAttention. Extensive experiments across diverse MLLMs and multiple hallucination benchmarks demonstrate that VGA achieves state-of-the-art dehallucination performance. Further analysis confirms that explicit visual guidance plays a crucial role in enhancing the visual understanding capabilities of MLLMs.

[125] Clair Obscur: an Illumination-Aware Method for Real-World Image Vectorization

Xingyue Lin,Shuai Peng,Xiangyu Xie,Jianhua Zhu,Yuxuan Zhou,Liangcai Gao

Main category: cs.CV

TL;DR: 本文提出了COVec,一种受明暗对比(Clair-Obscur)原理启发的光照感知图像矢量化框架,首次在矢量域中引入本征图像分解,将图像分解为反照率、阴影和光照层,并通过语义引导初始化和两阶段优化提升视觉保真度与可编辑性。

Details Motivation: 现有图像矢量化方法难以有效表达复杂的现实世界图像,常导致形状碎片化或牺牲语义简洁性,因此需要一种能同时保持视觉保真度和结构连贯性的新方法。 Method: 提出COVec框架,基于明暗对比原理,在矢量域中实现本征图像分解,分离出albedo、shade和light三层;采用语义引导的初始化策略,并结合可微渲染进行两阶段优化以精细化各矢量层。 Result: 在多个数据集上的实验表明,COVec相比现有方法在视觉保真度和可编辑性方面均有显著提升。 Conclusion: COVec是首个将本征图像分解引入矢量表示的工作,有效解决了复杂图像矢量化中的碎片化问题,实现了高质量、可编辑且语义清晰的矢量输出。 Abstract: Image vectorization aims to convert raster images into editable, scalable vector representations while preserving visual fidelity. Existing vectorization methods struggle to represent complex real-world images, often producing fragmented shapes at the cost of semantic conciseness. In this paper, we propose COVec, an illumination-aware vectorization framework inspired by the Clair-Obscur principle of light-shade contrast. COVec is the first to introduce intrinsic image decomposition in the vector domain, separating an image into albedo, shade, and light layers in a unified vector representation. A semantic-guided initialization and two-stage optimization refine these layers with differentiable rendering. Experiments on various datasets demonstrate that COVec achieves higher visual fidelity and significantly improved editability compared to existing methods.

[126] MFM-point: Multi-scale Flow Matching for Point Cloud Generation

Petr Molodyk,Jaemoo Choi,David W. Romero,Ming-Yu Liu,Yongxin Chen

Main category: cs.CV

TL;DR: 本文提出MFM-Point,一种基于多尺度流匹配的点云生成框架,通过粗到精的生成方式在保持点基方法高效性的同时显著提升其性能和可扩展性。

Details Motivation: 现有的点基生成方法虽简单高效,但在生成质量上通常不如基于表示的方法,缺乏有效的多尺度建模能力。 Method: 提出多尺度流匹配(MFM-Point)框架,采用保持几何结构的结构化下采样与上采样策略,在不同分辨率间实现平滑分布过渡,实现粗到精的生成过程。 Result: MFM-Point在点基方法中达到最优性能,并在多类别和高分辨率生成任务上接近甚至挑战最先进的表示基方法。 Conclusion: MFM-Point在不增加训练或推理开销的前提下,显著提升了点基点云生成的质量与可扩展性,为高效高性能3D生成提供了新思路。 Abstract: In recent years, point cloud generation has gained significant attention in 3D generative modeling. Among existing approaches, point-based methods directly generate point clouds without relying on other representations such as latent features, meshes, or voxels. These methods offer low training cost and algorithmic simplicity, but often underperform compared to representation-based approaches. In this paper, we propose MFM-Point, a multi-scale Flow Matching framework for point cloud generation that substantially improves the scalability and performance of point-based methods while preserving their simplicity and efficiency. Our multi-scale generation algorithm adopts a coarse-to-fine generation paradigm, enhancing generation quality and scalability without incurring additional training or inference overhead. A key challenge in developing such a multi-scale framework lies in preserving the geometric structure of unordered point clouds while ensuring smooth and consistent distributional transitions across resolutions. To address this, we introduce a structured downsampling and upsampling strategy that preserves geometry and maintains alignment between coarse and fine resolutions. Our experimental results demonstrate that MFM-Point achieves best-in-class performance among point-based methods and challenges the best representation-based methods. In particular, MFM-point demonstrates strong results in multi-category and high-resolution generation tasks.

[127] History-Augmented Contrastive Meta-Learning for Unsupervised Blind Super-Resolution of Planetary Remote Sensing Images

Huijia Zhao,Jie Lu,Yunqing Jiang,Xiao-Ping Lu,Kaichang Di

Main category: cs.CV

TL;DR: 提出了一种无需真实图像和外部核先验的无监督盲超分辨率框架HACBSR,包含对比核采样机制和基于历史模型的对比学习方法,并发布了适用于行星遥感图像的Ceres-50数据集。

Details Motivation: 行星遥感图像受成像环境和硬件限制导致多种未知退化,缺乏真实高分辨率图像,限制了有监督盲超分辨率的发展,因此需要一种无需真实图像的无监督方法。 Method: 提出HACBSR框架:1)引入带核相似性控制的对比核采样机制,缓解高斯采样的分布偏差;2)采用历史增强对比学习,利用历史模型生成负样本,实现非贪婪优化并增强凸性。同时构建Ceres-50数据集用于评估。 Result: 在多个上采样因子下,HACBSR在多个无监督方法中表现出竞争力;提供了收敛性分析,并公开了代码与Ceres-50数据集。 Conclusion: HACBSR是一种有效的无监督盲超分辨率框架,能够在无真实图像的情况下实现高质量行星图像超分辨率,具有良好的应用潜力。 Abstract: Planetary remote sensing images are affected by diverse and unknown degradations caused by imaging environments and hardware constraints. These factors limit image quality and hinder supervised blind super-resolution due to the lack of ground-truth images. This work presents History-Augmented Contrastive Blind Super-Resolution (HACBSR), an unsupervised framework for blind super-resolution that operates without ground-truth images and external kernel priors. HACBSR comprises two components: (1) a contrastive kernel sampling mechanism with kernel similarity control to mitigate distribution bias from Gaussian sampling, and (2) a history-augmented contrastive learning that uses historical models to generate negative samples to enable less greedy optimization and to induce strong convexity without ground-truth. A convergence analysis of the history-augmented contrastive learning is given in the Appendix. To support evaluation in planetary applications, we introduce Ceres-50, a dataset with diverse geological features simulated degradation patterns. Experiments show that HACBSR achieves competitive performance compared with state-of-the-art unsupervised methods across multiple upscaling factors. The code is available at https://github.com/2333repeat/HACBSR, and the dataset is available at https://github.com/2333repeat/Ceres-50.

[128] DeLightMono: Enhancing Self-Supervised Monocular Depth Estimation in Endoscopy by Decoupling Uneven Illumination

Mingyang Ou,Haojin Li,Yifeng Zhang,Ke Niu,Zhongxi Qiu,Heng Li,Jiang Liu

Main category: cs.CV

TL;DR: 提出了一种名为DeLight-Mono的自监督单目深度估计框架,通过照度解耦缓解内窥镜图像不均匀照明对深度估计的影响。

Details Motivation: 现有低光增强技术无法有效引导深度网络,且其他领域的方法不适用于光照不均的内窥镜图像,导致性能下降。 Method: 设计了照度-反射率-深度模型,利用辅助网络分解图像,并提出一种带有新损失函数的自监督联合优化框架来利用解耦成分进行深度估计。 Result: 在两个公开数据集上进行了广泛的比较和消融实验,验证了该方法的有效性。 Conclusion: DeLight-Mono能有效缓解不均匀照明对内窥镜深度估计的影响,提升了自监督深度估计的性能。 Abstract: Self-supervised monocular depth estimation serves as a key task in the development of endoscopic navigation systems. However, performance degradation persists due to uneven illumination inherent in endoscopic images, particularly in low-intensity regions. Existing low-light enhancement techniques fail to effectively guide the depth network. Furthermore, solutions from other fields, like autonomous driving, require well-lit images, making them unsuitable and increasing data collection burdens. To this end, we present DeLight-Mono - a novel self-supervised monocular depth estimation framework with illumination decoupling. Specifically, endoscopic images are represented by a designed illumination-reflectance-depth model, and are decomposed with auxiliary networks. Moreover, a self-supervised joint-optimizing framework with novel losses leveraging the decoupled components is proposed to mitigate the effects of uneven illumination on depth estimation. The effectiveness of the proposed methods was rigorously verified through extensive comparisons and an ablation study performed on two public datasets.

[129] FLaTEC: Frequency-Disentangled Latent Triplanes for Efficient Compression of LiDAR Point Clouds

Xiaoge Zhang,Zijie Wu,Mingtao Feng,Zichen Geng,Mehwish Nasim,Saeed Anwar,Ajmal Mian

Main category: cs.CV

TL;DR: 本文提出了一种名为FLaTEC的频率感知点云压缩模型,通过解耦低频结构和高频纹理,并利用混合潜在三平面表示,在保持高质量重建的同时实现了高压缩比。

Details Motivation: 传统的点云压缩方法难以平衡压缩率与重构质量,因为不同频率成分在同一分辨率下贡献不同。因此需要一种能够区分并有效处理这些频率成分的方法。 Method: 引入频率感知机制,将体素化嵌入转换为三平面表示以减少稀疏性和计算开销;设计频率解缠技术提取紧凑的低频内容并跨尺度收集高频细节;使用二进制格式存储解耦后的成分;在解码时通过调制块逐步恢复全频谱信号;引入基于频率的注意力机制增强局部连接性。 Result: 在SemanticKITTI和Ford数据集上,相比标准编解码器,BD-rate分别提升了78%和94%,实现了最先进的率失真性能。 Conclusion: FLaTEC通过频率感知的三平面压缩框架,有效提升了点云压缩的效率与重建质量,适用于高分辨率点云的高效编码。 Abstract: Point cloud compression methods jointly optimize bitrates and reconstruction distortion. However, balancing compression ratio and reconstruction quality is difficult because low-frequency and high-frequency components contribute differently at the same resolution. To address this, we propose FLaTEC, a frequency-aware compression model that enables the compression of a full scan with high compression ratios. Our approach introduces a frequency-aware mechanism that decouples low-frequency structures and high-frequency textures, while hybridizing latent triplanes as a compact proxy for point cloud. Specifically, we convert voxelized embeddings into triplane representations to reduce sparsity, computational cost, and storage requirements. We then devise a frequency-disentangling technique that extracts compact low-frequency content while collecting high-frequency details across scales. The decoupled low-frequency and high-frequency components are stored in binary format. During decoding, full-spectrum signals are progressively recovered via a modulation block. Additionally, to compensate for the loss of 3D correlation, we introduce an efficient frequency-based attention mechanism that fosters local connectivity and outputs arbitrary resolution points. Our method achieves state-of-the-art rate-distortion performance and outperforms the standard codecs by 78\% and 94\% in BD-rate on both SemanticKITTI and Ford datasets.

[130] PRADA: Probability-Ratio-Based Attribution and Detection of Autoregressive-Generated Images

Simon Damm,Jonas Ricker,Henning Petzka,Asja Fischer

Main category: cs.CV

TL;DR: 提出PRADA方法,基于概率比来检测和归因自回归生成的图像,具有高有效性与可解释性。

Details Motivation: 目前缺乏专门针对自回归(AR)图像生成器所生成图像的检测方法,亟需可靠的检测技术以应对日益逼真的AI生成图像。 Method: 通过分析AR模型生成图像时的条件与无条件概率之比,利用该比率的独特特征设计简单的模型特定评分函数,并基于阈值进行检测与归因。 Result: 在八种类到图像和四种文本到图像模型上验证了PRADA的有效性,能够准确检测并归因AR生成图像。 Conclusion: PRADA是一种简单、可解释且高效的方法,可用于检测自回归生成图像并溯源至其生成模型。 Abstract: Autoregressive (AR) image generation has recently emerged as a powerful paradigm for image synthesis. Leveraging the generation principle of large language models, they allow for efficiently generating deceptively real-looking images, further increasing the need for reliable detection methods. However, to date there is a lack of work specifically targeting the detection of images generated by AR image generators. In this work, we present PRADA (Probability-Ratio-Based Attribution and Detection of Autoregressive-Generated Images), a simple and interpretable approach that can reliably detect AR-generated images and attribute them to their respective source model. The key idea is to inspect the ratio of a model's conditional and unconditional probability for the autoregressive token sequence representing a given image. Whenever an image is generated by a particular model, its probability ratio shows unique characteristics which are not present for images generated by other models or real images. We exploit these characteristics for threshold-based attribution and detection by calibrating a simple, model-specific score function. Our experimental evaluation shows that PRADA is highly effective against eight class-to-image and four text-to-image models.

[131] Learning Procedural-aware Video Representations through State-Grounded Hierarchy Unfolding

Jinghan Zhao,Yifei Huang,Feng Lu

Main category: cs.CV

TL;DR: 提出Task-Step-State (TSS)框架,通过引入可观察的“状态”作为视觉语义锚点,提升视频表征的程序性理解能力。

Details Motivation: 现有方法在任务和步骤层面将视觉内容与文本对齐,但高层抽象描述难以与视觉细节形成强对齐。 Method: 引入‘状态’作为物体配置的文本快照,构建TSS层次结构,并采用渐进式预训练策略逐级对齐任务、步骤与状态。 Result: 在COIN和CrossTask数据集上,模型在任务识别、步骤识别和下一步预测等任务中优于基线模型;消融实验表明状态监督是性能提升的关键。 Conclusion: 通过引入视觉接地的状态表示和渐进式预训练,能更有效地学习分层的、程序感知的视频表征。 Abstract: Learning procedural-aware video representations is a key step towards building agents that can reason about and execute complex tasks. Existing methods typically address this problem by aligning visual content with textual descriptions at the task and step levels to inject procedural semantics into video representations. However, due to their high level of abstraction, 'task' and 'step' descriptions fail to form a robust alignment with the concrete, observable details in visual data. To address this, we introduce 'states', i.e., textual snapshots of object configurations, as a visually-grounded semantic layer that anchors abstract procedures to what a model can actually see. We formalize this insight in a novel Task-Step-State (TSS) framework, where tasks are achieved via steps that drive transitions between observable states. To enforce this structure, we propose a progressive pre-training strategy that unfolds the TSS hierarchy, forcing the model to ground representations in states while associating them with steps and high-level tasks. Extensive experiments on the COIN and CrossTask datasets show that our method outperforms baseline models on multiple downstream tasks, including task recognition, step recognition, and next step prediction. Ablation studies show that introducing state supervision is a key driver of performance gains across all tasks. Additionally, our progressive pretraining strategy proves more effective than standard joint training, as it better enforces the intended hierarchical structure.

[132] Blind Adaptive Local Denoising for CEST Imaging

Chu Chen,Aitor Artola,Yang Liu,Se Weon Park,Raymond H. Chan,Jean-Michel Morel,Kannie W. Y. Chan

Main category: cs.CV

TL;DR: 提出了一种新的盲自适应局部去噪方法(BALD),用于改善化学交换饱和转移(CEST)MRI中的噪声问题,显著提升了定量对比成像和下游临床任务的性能。

Details Motivation: CEST MRI在临床转化中受到空间变化噪声和复杂成像协议引起的异方差性噪声的限制,传统去噪方法无法有效处理此类噪声且可能破坏关键的生物医学信息。 Method: 利用CEST数据的自相似性,设计一种自适应的方差稳定变换,在无需先验噪声信息的情况下均衡各像素的噪声分布;通过两阶段去噪和局部SVD分解实现分子信号与噪声的分离,避免空间和光谱伪影。 Result: 在多个体模和体内CEST扫描实验中,BALD在去噪指标和下游任务(如分子浓度图估计和癌症检测)中均优于现有最先进方法。 Conclusion: BALD是一种有效的CEST MRI去噪方法,能够克服异方差噪声问题,提升定量成像精度,具有良好的临床应用潜力。 Abstract: Chemical Exchange Saturation Transfer (CEST) MRI enables molecular-level visualization of low-concentration metabolites by leveraging proton exchange dynamics. However, its clinical translation is hindered by inherent challenges: spatially varying noise arising from hardware limitations, and complex imaging protocols introduce heteroscedasticity in CEST data, perturbing the accuracy of quantitative contrast mapping such as amide proton transfer (APT) imaging. Traditional denoising methods are not designed for this complex noise and often alter the underlying information that is critical for biomedical analysis. To overcome these limitations, we propose a new Blind Adaptive Local Denoising (BALD) method. BALD exploits the self-similar nature of CEST data to derive an adaptive variance-stabilizing transform that equalizes the noise distributions across CEST pixels without prior knowledge of noise characteristics. Then, BALD performs two-stage denoising on a linear transformation of data to disentangle molecular signals from noise. A local SVD decomposition is used as a linear transform to prevent spatial and spectral denoising artifacts. We conducted extensive validation experiments on multiple phantoms and \textit{in vivo} CEST scans. In these experiments, BALD consistently outperformed state-of-the-art CEST denoisers in both denoising metrics and downstream tasks such as molecular concentration maps estimation and cancer detection.

[133] Explainable Visual Anomaly Detection via Concept Bottleneck Models

Arianna Stropeni,Valentina Zaccaria,Francesco Borsatti,Davide Dalle Pezze,Manuel Barusco,Gian Antonio Susto

Main category: cs.CV

TL;DR: 本文提出了CONVAD,一种将概念瓶颈模型(CBM)扩展到视觉异常检测(VAD)的方法,通过学习有意义的概念来提供人类可解释的异常描述,在保持检测性能的同时增强了模型的可解释性。

Details Motivation: 现有的VAD方法虽能生成视觉上的异常定位,但缺乏对用户直观且语义清晰的解释,因此需要一种能够提供语义层面可解释性的方法。 Method: 提出并改进了用于VAD的概念瓶颈模型(CBM),构建了支持研究的概念数据集,并设计了一种合成人工异常的流程以减少对罕见异常样本的依赖;模型同时输出基于概念和视觉的双重解释。 Result: CONVAD在多个基准上达到了与传统VAD方法相当的检测性能,同时能够生成更丰富、更具语义意义的概念驱动解释,提升了系统的可解释性和用户信任。 Conclusion: 通过引入概念学习,CONVAD成功地在不牺牲检测性能的前提下,实现了视觉与语义双层次的可解释性,为VAD系统提供了更直观、可信的异常解释方式。 Abstract: In recent years, Visual Anomaly Detection (VAD) has gained significant attention due to its ability to identify anomalous images using only normal images during training. Many VAD models work without supervision but are still able to provide visual explanations by highlighting the anomalous regions within an image. However, although these visual explanations can be helpful, they lack a direct and semantically meaningful interpretation for users. To address this limitation, we propose extending Concept Bottleneck Models (CBMs) to the VAD setting. By learning meaningful concepts, the network can provide human-interpretable descriptions of anomalies, offering a novel and more insightful way to explain them. Our contributions are threefold: (i) we develop a Concept Dataset to support research on CBMs for VAD; (ii) we improve the CBM architecture to generate both concept-based and visual explanations, bridging semantic and localization interpretability; and (iii) we introduce a pipeline for synthesizing artificial anomalies, preserving the VAD paradigm of minimizing dependence on rare anomalous samples. Our approach, Concept-Aware Visual Anomaly Detection (CONVAD), achieves performance comparable to classic VAD methods while providing richer, concept-driven explanations that enhance interpretability and trust in VAD systems.

[134] WPT: World-to-Policy Transfer via Online World Model Distillation

Guangfeng Jiang,Yueru Luo,Jun Liu,Yi Huang,Yiyao Zhu,Zhan Qu,Dave Zhenyu Chen,Bingbing Liu,Xu Yan

Main category: cs.CV

TL;DR: 本文提出了一种名为WPT(World-to-Policy Transfer)的新型训练范式,通过端到端世界模型指导下的在线蒸馏,实现高效、实时的策略学习,在多个基准上取得了优于现有方法的性能。

Details Motivation: 现有世界模型方法存在运行时耦合紧密或依赖离线奖励信号的问题,导致推理开销大或阻碍端到端优化,限制了其在实时决策任务中的应用。 Method: 提出WPT训练范式,包含可训练的奖励模型,将世界模型预测的未来动态与候选轨迹对齐以生成教师策略;并通过策略蒸馏和世界奖励蒸馏将教师策略的知识迁移到轻量级学生策略中。 Result: 在开环和闭环基准测试中,WPT分别实现了0.11的碰撞率和79.23的驾驶得分,超越基于世界模型和模仿学习的方法,且学生策略推理速度提升高达4.9倍。 Conclusion: WPT有效解耦了复杂建模与实时决策,实现了高性能与高效率的平衡,为世界模型在自动驾驶等实时系统中的部署提供了可行方案。 Abstract: Recent years have witnessed remarkable progress in world models, which primarily aim to capture the spatio-temporal correlations between an agent's actions and the evolving environment. However, existing approaches often suffer from tight runtime coupling or depend on offline reward signals, resulting in substantial inference overhead or hindering end-to-end optimization. To overcome these limitations, we introduce WPT, a World-to-Policy Transfer training paradigm that enables online distillation under the guidance of an end-to-end world model. Specifically, we develop a trainable reward model that infuses world knowledge into a teacher policy by aligning candidate trajectories with the future dynamics predicted by the world model. Subsequently, we propose policy distillation and world reward distillation to transfer the teacher's reasoning ability into a lightweight student policy, enhancing planning performance while preserving real-time deployability. Extensive experiments on both open-loop and closed-loop benchmarks show that our WPT achieves state-of-the-art performance with a simple policy architecture: it attains a 0.11 collision rate (open-loop) and achieves a 79.23 driving score (closed-loop) surpassing both world-model-based and imitation-learning methods in accuracy and safety. Moreover, the student sustains up to 4.9x faster inference, while retaining most of the gains.

[135] Exploring State-of-the-art models for Early Detection of Forest Fires

Sharjeel Ahmed,Daim Armaghan,Fatima Naweed,Umair Yousaf,Ahmad Zubair,Murtaza Taj

Main category: cs.CV

TL;DR: 本文提出了一种用于森林火灾早期预警的视觉检测方法,通过合成数据集(利用游戏模拟器和公开图像)提升检测性能,并比较了YOLOv7与检测Transformer模型在烟雾和初期火情识别上的表现。

Details Motivation: 现有森林火灾检测方法因缺乏大规模专用数据集和针对性模型,导致漏检率较高,难以实现早期预警。 Method: 构建了一个专注于火灾初期特征(如烟雾)的新数据集,数据来源于游戏模拟器(如《荒野大镖客2》)和公开图像;在此基础上,对比了YOLOv7和多种基于Transformer的检测模型在图像分类与定位任务中的性能。 Result: 提出了一个新的、更适用于早期火灾检测的数据集,并通过实验评估了主流深度学习模型在该任务上的表现,为后续研究提供了基准。 Conclusion: 使用合成数据结合真实图像可有效改善森林火灾早期检测的数据匮乏问题,所构建的数据集有助于提升模型对初期火灾的识别能力,推动早期预警系统的发展。 Abstract: There have been many recent developments in the use of Deep Learning Neural Networks for fire detection. In this paper, we explore an early warning system for detection of forest fires. Due to the lack of sizeable datasets and models tuned for this task, existing methods suffer from missed detection. In this work, we first propose a dataset for early identification of forest fires through visual analysis. Unlike existing image corpuses that contain images of wide-spread fire, our dataset consists of multiple instances of smoke plumes and fire that indicates the initiation of fire. We obtained this dataset synthetically by utilising game simulators such as Red Dead Redemption 2. We also combined our dataset with already published images to obtain a more comprehensive set. Finally, we compared image classification and localisation methods on the proposed dataset. More specifically we used YOLOv7 (You Only Look Once) and different models of detection transformer.

[136] Multi Head Attention Enhanced Inception v3 for Cardiomegaly Detection

Abishek Karthik,Pandiyaraju V

Main category: cs.CV

TL;DR: 本文提出了一种结合深度学习与多头注意力机制的集成方法,用于通过X光图像自动检测心脏肥大。采用Inception V3模型和注意力机制提升诊断性能,在多个评估指标上取得了优异结果。

Details Motivation: 为了提高心脏肥大在X光影像中的自动检测准确性和临床实用性,解决传统方法依赖人工判读、效率低且易误诊的问题。 Method: 基于Inception V3的CNN架构,并引入多层、多头注意力机制,对X光图像进行特征提取与加权聚焦;经过数据收集、预处理后,利用注意力评分增强关键区域表征能力。 Result: 模型在准确率(95.6%)、精确率(95.2%)、召回率(96.2%)、敏感性(95.7%)、特异性(96.1%)及AUC(96.0%)等指标上表现优异,并通过可视化图表验证了有效性。 Conclusion: 所提出的融合注意力机制的深度学习模型能高效、精准地检测心脏肥大,具有良好的临床应用前景和推广价值。 Abstract: The healthcare industry has been revolutionized significantly by novel imaging technologies, not just in the diagnosis of cardiovascular diseases but also by the visualization of structural abnormalities like cardiomegaly. This article explains an integrated approach to the use of deep learning tools and attention mechanisms for automatic detection of cardiomegaly using X-ray images. The initiation of the project is grounded on a strong Data Collection phase and gathering the data of annotated X-ray images of various types. Then, while the Preprocessing module fine-tunes image quality, it is feasible to utilize the best out of the data quality in the proposed system. In our proposed system, the process is a CNN configuration leveraging the inception V3 model as one of the key blocks. Besides, we also employ a multilayer attention mechanism to enhance the strength. The most important feature of the method is the multi-head attention mechanism that can learn features automatically. By exact selective focusing on only some regions of input, the model can thus identify cardiomegaly in a sensitive manner. Attention rating is calculated, duplicated, and applied to enhance representation of main data, and therefore there is a successful diagnosis. The Evaluation stage will be extremely strict and it will thoroughly evaluate the model based on such measures as accuracy and precision. This will validate that the model can identify cardiomegaly and will also show the clinical significance of this method. The model has accuracy of 95.6, precision of 95.2, recall of 96.2, sensitivity of 95.7, specificity of 96.1 and an Area Under Curve(AUC) of 96.0 and their respective graphs are plotted for visualisation.

[137] LungEvaty: A Scalable, Open-Source Transformer-based Deep Learning Model for Lung Cancer Risk Prediction in LDCT Screening

Johannes Brandt,Maulik Chevli,Rickmer Braren,Georgios Kaissis,Philip Müller,Daniel Rueckert

Main category: cs.CV

TL;DR: LungEvaty是一个基于全Transformer的框架,用于从单次低剂量CT扫描中预测1-6年的肺癌风险,具有高可扩展性且无需区域监督,性能达到先进水平。

Details Motivation: 现有肺癌风险预测方法依赖像素级标注或分块分析,限制了可扩展性和性能,难以应对大规模筛查数据的需求。 Method: 提出LungEvaty,一种完全基于Transformer的模型,直接在全肺CT图像上进行端到端训练,利用大规模筛查数据学习恶性风险相关的解剖和病理特征,并引入可选的解剖学注意力引导(AIAG)损失以提升注意力定位。 Result: 模型在超过9万例CT扫描上训练,在2.8万例上微调,并在6千例上评估,仅使用影像数据即达到与当前最优方法相当的性能。 Conclusion: LungEvaty提供了一种简单、高效、开源的肺癌风险预测方案,为未来纵向和多模态研究奠定了可扩展的基础。 Abstract: Lung cancer risk estimation is gaining increasing importance as more countries introduce population-wide screening programs using low-dose CT (LDCT). As imaging volumes grow, scalable methods that can process entire lung volumes efficiently are essential to tap into the full potential of these large screening datasets. Existing approaches either over-rely on pixel-level annotations, limiting scalability, or analyze the lung in fragments, weakening performance. We present LungEvaty, a fully transformer-based framework for predicting 1-6 year lung cancer risk from a single LDCT scan. The model operates on whole-lung inputs, learning directly from large-scale screening data to capture comprehensive anatomical and pathological cues relevant for malignancy risk. Using only imaging data and no region supervision, LungEvaty matches state-of-the-art performance, refinable by an optional Anatomically Informed Attention Guidance (AIAG) loss that encourages anatomically focused attention. In total, LungEvaty was trained on more than 90,000 CT scans, including over 28,000 for fine-tuning and 6,000 for evaluation. The framework offers a simple, data-efficient, and fully open-source solution that provides an extensible foundation for future research in longitudinal and multimodal lung cancer risk prediction.

[138] UltraViCo: Breaking Extrapolation Limits in Video Diffusion Transformers

Min Zhao,Hongzhou Zhu,Yingze Wang,Bokai Yan,Jintao Zhang,Guande He,Ling Yang,Chongxuan Li,Jun Zhu

Main category: cs.CV

TL;DR: 本文提出了一种名为UltraViCo的训练-free方法,通过抑制超出训练窗口token的注意力来解决视频扩散变换器在长度外推时的性能下降和内容重复问题,显著提升了视频生成质量和外推能力。

Details Motivation: 现有视频扩散变换器在超出训练长度时表现不佳,存在内容重复和质量下降两大问题,且已有方法未能根本解决这一挑战。 Method: 从注意力图出发,分析发现两种失败模式源于统一原因——注意力分散,并提出通过常数衰减因子抑制远距离token注意力的UltraViCo方法。 Result: UltraViCo在多种模型和外推比例下均优于现有方法,将外推极限从2倍提升至4倍,在4倍外推下动态程度和成像质量分别提升233%和40.5%。 Conclusion: 通过抑制超出训练范围token的注意力,可有效解决视频长度外推中的注意力分散问题,UltraViCo为视频生成模型提供了通用、即插即用的长序列扩展方案。 Abstract: Despite advances, video diffusion transformers still struggle to generalize beyond their training length, a challenge we term video length extrapolation. We identify two failure modes: model-specific periodic content repetition and a universal quality degradation. Prior works attempt to solve repetition via positional encodings, overlooking quality degradation and achieving only limited extrapolation. In this paper, we revisit this challenge from a more fundamental view: attention maps, which directly govern how context influences outputs. We identify that both failure modes arise from a unified cause: attention dispersion, where tokens beyond the training window dilute learned attention patterns. This leads to quality degradation and repetition emerges as a special case when this dispersion becomes structured into periodic attention patterns, induced by harmonic properties of positional encodings. Building on this insight, we propose UltraViCo, a training-free, plug-and-play method that suppresses attention for tokens beyond the training window via a constant decay factor. By jointly addressing both failure modes, we outperform a broad set of baselines largely across models and extrapolation ratios, pushing the extrapolation limit from 2x to 4x. Remarkably, it improves Dynamic Degree and Imaging Quality by 233% and 40.5% over the previous best method at 4x extrapolation. Furthermore, our method generalizes seamlessly to downstream tasks such as controllable video synthesis and editing.

[139] Vision-Language Models for Automated 3D PET/CT Report Generation

Wenpei Jiao,Kun Shang,Hui Li,Ke Yan,Jiajin Zhang,Guangjie Yang,Lijuan Guo,Yan Wan,Xing Yang,Dakai Jin,Zhaoheng Xie

Main category: cs.CV

TL;DR: 本文提出了一种端到端的3D双分支框架PETRG-3D,用于自动化PET/CT报告生成,并构建了多中心淋巴瘤数据集PETRG-Lym及公开基准AutoPET-RG-Lym,结合风格自适应提示和新的评估协议PETRG-Score,显著提升了自然语言和临床指标性能。

Details Motivation: 由于PET/CT扫描设备快速增长而专业医生短缺,亟需自动化报告生成技术;同时PET的功能成像特性(如代谢模式、全身影像上下文)带来了比结构成像更大的挑战,现有方法难以有效处理。 Method: 提出PETRG-3D,一种端到端的3D双分支模型,分别编码PET和CT体积数据,并引入风格自适应提示以应对不同医院报告风格差异;构建多中心数据集PETRG-Lym和公开基准AutoPET-RG-Lym;设计淋巴瘤专用评估协议PETRG-Score,联合评估代谢与结构发现。 Result: 实验表明,PETRG-3D在自然语言指标(如ROUGE-L提升31.49%)和临床效能指标(如PET-All提升8.18%)上均显著优于现有方法,验证了3D双模态建模和风格感知提示的有效性。 Conclusion: 本工作为PET/CT特异性报告生成模型的发展奠定了基础,强调疾病感知推理和临床可靠评估的重要性,未来有望减轻临床负担并推动AI在核医学中的应用。 Abstract: Positron emission tomography/computed tomography (PET/CT) is essential in oncology, yet the rapid expansion of scanners has outpaced the availability of trained specialists, making automated PET/CT report generation (PETRG) increasingly important for reducing clinical workload. Compared with structural imaging (e.g., X-ray, CT, and MRI), functional PET poses distinct challenges: metabolic patterns vary with tracer physiology, and whole-body 3D contextual information is required rather than local-region interpretation. To advance PETRG, we propose PETRG-3D, an end-to-end 3D dual-branch framework that separately encodes PET and CT volumes and incorporates style-adaptive prompts to mitigate inter-hospital variability in reporting practices. We construct PETRG-Lym, a multi-center lymphoma dataset collected from four hospitals (824 reports w/ 245,509 paired PET/CT slices), and construct AutoPET-RG-Lym, a publicly accessible PETRG benchmark derived from open imaging data but equipped with new expert-written, clinically validated reports (135 cases). To assess clinical utility, we introduce PETRG-Score, a lymphoma-specific evaluation protocol that jointly measures metabolic and structural findings across curated anatomical regions. Experiments show that PETRG-3D substantially outperforms existing methods on both natural language metrics (e.g., +31.49\% ROUGE-L) and clinical efficacy metrics (e.g., +8.18\% PET-All), highlighting the benefits of volumetric dual-modality modeling and style-aware prompting. Overall, this work establishes a foundation for future PET/CT-specific models emphasizing disease-aware reasoning and clinically reliable evaluation. Codes, models, and AutoPET-RG-Lym will be released.

[140] Hybrid Convolution and Frequency State Space Network for Image Compression

Haodong Pan,Hao Wei,Yusong Wang,Nanning Zheng,Caigui Jiang

Main category: cs.CV

TL;DR: 本文提出了一种用于学习图像压缩的混合架构HCFSSNet,结合卷积神经网络(CNN)和频域状态空间模型(VFSS),在保持结构信息的同时有效建模长距离低频依赖,并通过自适应频域调制和频率感知注意力模块提升熵编码效率,在多个数据集上实现了优于现有方法的率失真性能。

Details Motivation: 现有的基于Transformer和状态空间模型(SSM)的图像压缩方法虽能捕捉长距离依赖,但容易丢失结构信息或忽略对压缩至关重要的频率特性;而CNN擅长捕捉高频局部细节。因此需要一种能够融合两者优势的混合架构。 Method: 提出HCFSSNet,包含两个核心模块:1)Vision Frequency State Space (VFSS) 模块,由全向邻域状态空间(VONSS)和自适应频域调制模块(AFMM)组成,分别用于建模多方向长距离低频信息和内容自适应的DCT频带加权;2)频率Swin Transformer注意力模块(FSTAM),将AFMM与Swin Transformer结合,用于熵模型中的频域感知边信息建模。 Result: 在Kodak、Tecnick和CLIC Professional Validation数据集上实验表明,HCFSSNet在参数量显著更少的情况下,性能与最新的SSM-based方法(如MambaIC)相当。相比VTM基准,BD-rate分别降低18.06%(Kodak)、24.56%(Tecnick)和22.44%(CLIC)。 Conclusion: HCFSSNet通过融合CNN与频域状态空间机制,实现了高效且可解释的图像压缩架构,在率失真性能和模型复杂度之间取得了良好平衡,为未来学习型图像压缩系统提供了新方向。 Abstract: Learned image compression (LIC) has recently benefited from Transformer based and state space model (SSM) based architectures. Convolutional neural networks (CNNs) effectively capture local high frequency details, whereas Transformers and SSMs provide strong long range modeling capabilities but may cause structural information loss or ignore frequency characteristics that are crucial for compression. In this work we propose HCFSSNet, a Hybrid Convolution and Frequency State Space Network for LIC. HCFSSNet uses CNNs to extract local high frequency structures and introduces a Vision Frequency State Space (VFSS) block that models long range low frequency information. The VFSS block combines an Omni directional Neighborhood State Space (VONSS) module, which scans features horizontally, vertically and diagonally, with an Adaptive Frequency Modulation Module (AFMM) that applies content adaptive weighting of discrete cosine transform frequency components for more efficient bit allocation. To further reduce redundancy in the entropy model, we integrate AFMM with a Swin Transformer to form a Frequency Swin Transformer Attention Module (FSTAM) for frequency aware side information modeling. Experiments on the Kodak, Tecnick and CLIC Professional Validation datasets show that HCFSSNet achieves competitive rate distortion performance compared with recent SSM based codecs such as MambaIC, while using significantly fewer parameters. On Kodak, Tecnick and CLIC, HCFSSNet reduces BD rate over the VTM anchor by 18.06, 24.56 and 22.44 percent, respectively, providing an efficient and interpretable hybrid architecture for future learned image compression systems.

[141] Restora-Flow: Mask-Guided Image Restoration with Flow Matching

Arnela Hadzic,Franz Thaler,Lea Bogensperger,Simon Johannes Joham,Martin Urschler

Main category: cs.CV

TL;DR: 本文提出了Restora-Flow,一种无需训练的流程匹配采样引导方法,通过退化掩码和轨迹校正机制,在图像修复任务中实现了优越的感知质量和处理速度。

Details Motivation: 现有的基于流模型的图像修复方法存在处理时间长或结果过度平滑的问题,需要一种更高效且高质量的方法来解决这些挑战。 Method: 引入了Restora-Flow,该方法利用退化掩码指导流程匹配采样,并结合轨迹校正机制以确保与退化输入的一致性,适用于基于掩码的图像修复任务如补全、超分辨率和去噪。 Result: 在自然和医学数据集上的实验表明,与基于扩散和流程匹配的参考方法相比,Restora-Flow在感知质量和处理时间上均表现出色。 Conclusion: Restora-Flow是一种有效的训练-free方法,能够在多种图像修复任务中提供高质量的结果并显著减少处理时间。 Abstract: Flow matching has emerged as a promising generative approach that addresses the lengthy sampling times associated with state-of-the-art diffusion models and enables a more flexible trajectory design, while maintaining high-quality image generation. This capability makes it suitable as a generative prior for image restoration tasks. Although current methods leveraging flow models have shown promising results in restoration, some still suffer from long processing times or produce over-smoothed results. To address these challenges, we introduce Restora-Flow, a training-free method that guides flow matching sampling by a degradation mask and incorporates a trajectory correction mechanism to enforce consistency with degraded inputs. We evaluate our approach on both natural and medical datasets across several image restoration tasks involving a mask-based degradation, i.e., inpainting, super-resolution and denoising. We show superior perceptual quality and processing time compared to diffusion and flow matching-based reference methods.

[142] Alzheimers Disease Progression Prediction Based on Manifold Mapping of Irregularly Sampled Longitudinal Data

Xin Hong,Ying Shi,Yinhao Li,Yen-Wei Chen

Main category: cs.CV

TL;DR: 提出一种基于黎曼流形的时序建模框架R-TNAG,用于处理不规则采样的纵向sMRI数据,以更准确地预测阿尔茨海默病进展。

Details Motivation: 临床检查的不确定性导致纵向成像数据观测间隔不规则,现有欧氏空间模型难以捕捉疾病进展的内在连续性和非线性几何结构。 Method: 将sMRI提取的特征映射到黎曼流形空间,利用时序感知的神经微分方程(TNODE)建模潜在状态的连续演化,并结合注意力机制的黎曼门控循环单元(ARGRU)自适应融合历史与当前信息。 Result: 在疾病状态分类和认知评分回归任务上均优于现有最先进模型,且在不同序列长度、缺失率和跨数据集中表现稳定。 Conclusion: R-TNAG能有效保留疾病进展的内在几何特性,提升不规则采样下的AD progression预测鲁棒性与时序一致性。 Abstract: The uncertainty of clinical examinations frequently leads to irregular observation intervals in longitudinal imaging data, posing challenges for modeling disease progression.Most existing imaging-based disease prediction models operate in Euclidean space, which assumes a flat representation of data and fails to fully capture the intrinsic continuity and nonlinear geometric structure of irregularly sampled longitudinal images. To address the challenge of modeling Alzheimers disease (AD) progression from irregularly sampled longitudinal structural Magnetic Resonance Imaging (sMRI) data, we propose a Riemannian manifold mapping, a Time-aware manifold Neural ordinary differential equation, and an Attention-based riemannian Gated recurrent unit (R-TNAG) framework. Our approach first projects features extracted from high-dimensional sMRI into a manifold space to preserve the intrinsic geometry of disease progression. On this representation, a time-aware Neural Ordinary Differential Equation (TNODE) models the continuous evolution of latent states between observations, while an Attention-based Riemannian Gated Recurrent Unit (ARGRU) adaptively integrates historical and current information to handle irregular intervals. This joint design improves temporal consistency and yields robust AD trajectory prediction under irregular sampling.Experimental results demonstrate that the proposed method consistently outperforms state-of-the-art models in both disease status prediction and cognitive score regression. Ablation studies verify the contributions of each module, highlighting their complementary roles in enhancing predictive accuracy. Moreover, the model exhibits stable performance across varying sequence lengths and missing data rates, indicating strong temporal generalizability. Cross-dataset validation further confirms its robustness and applicability in diverse clinical settings.

[143] Map-World: Masked Action planning and Path-Integral World Model for Autonomous Driving

Bin Hu,Zijian Lu,Haicheng Liao,Chengran Yuan,Bin Rao,Yongkang Li,Guofa Li,Zhiyong Cui,Cheng-zhong Xu,Zhenning Li

Main category: cs.CV

TL;DR: 提出MAP-World,一种无需先验的多模态规划框架,通过掩码动作规划和路径加权世界模型实现多样且一致的轨迹预测,并在训练中利用完整未来分布提升规划性能。

Details Motivation: 现有自动驾驶运动规划方法在多模态未来处理上依赖手工锚点或强化学习选择单一模式,导致信息丢失和优化困难。 Method: 设计掩码动作规划(MAP)模块,将未来自车运动视为掩码序列补全任务,结合驾驶意图路径作为骨架,通过噪声注入生成多样化轨迹查询;引入轻量级世界模型基于候选轨迹推演BEV语义,并在训练中以轨迹概率为权重对语义损失求期望。 Result: 在NAVSIM数据集上达到基于世界模型方法的最先进性能,与基于锚点的方法相当,无需强化学习并保持实时推理速度。 Conclusion: MAP-World通过联合多模态预测与期望优化,有效利用多种合理未来进行训练,避免了模式选择的信息损失,在保证多样性的同时实现了高效、端到端的自动驾驶规划。 Abstract: Motion planning for autonomous driving must handle multiple plausible futures while remaining computationally efficient. Recent end-to-end systems and world-model-based planners predict rich multi-modal trajectories, but typically rely on handcrafted anchors or reinforcement learning to select a single best mode for training and control. This selection discards information about alternative futures and complicates optimization. We propose MAP-World, a prior-free multi-modal planning framework that couples masked action planning with a path-weighted world model. The Masked Action Planning (MAP) module treats future ego motion as masked sequence completion: past waypoints are encoded as visible tokens, future waypoints are represented as mask tokens, and a driving-intent path provides a coarse scaffold. A compact latent planning state is expanded into multiple trajectory queries with injected noise, yielding diverse, temporally consistent modes without anchor libraries or teacher policies. A lightweight world model then rolls out future BEV semantics conditioned on each candidate trajectory. During training, semantic losses are computed as an expectation over modes, using trajectory probabilities as discrete path weights, so the planner learns from the full distribution of plausible futures instead of a single selected path. On NAVSIM, our method matches anchor-based approaches and achieves state-of-the-art performance among world-model-based methods, while avoiding reinforcement learning and maintaining real-time inference latency.

[144] SKEL-CF: Coarse-to-Fine Biomechanical Skeleton and Surface Mesh Recovery

Da Li,Ji-Ping Jin,Xuanlong Yu,Wei Liu,Xiaodong Cun,Kai Chen,Rui Fan,Jiangang Kong,Shen Xi

Main category: cs.CV

TL;DR: 本文提出了SKEL-CF,一种用于从图像中估计解剖学上精确的SKEL人体模型参数的粗到精框架。通过改进的Transformer架构、构建SKEL对齐的训练数据集4DHuman-SKEL,以及显式相机建模,显著提升了估计精度,在MOYO数据集上大幅超越现有方法。

Details Motivation: 现有的参数化3D人体模型(如SMPL)因骨骼结构简化而缺乏生物力学真实性;SKEL模型虽改进了骨骼解剖准确性,但其参数直接估计受限于训练数据不足、视角模糊和复杂关节运动等问题。 Method: 提出SKEL-CF,采用基于Transformer的编码器-解码器结构:编码器预测初始的相机与SKEL参数,解码器逐层精细优化;构建4DHuman-SKEL数据集以提供解剖一致的监督信号;在框架中显式引入相机建模以缓解深度与尺度歧义。 Result: 在MOYO数据集上达到85.0 MPJPE / 51.4 PA-MPJPE,显著优于此前基于SKEL的最先进方法HSMR(104.5 / 79.6);验证了所提设计在多视角下的有效性。 Conclusion: SKEL-CF是一种可扩展且解剖学保真的人体运动分析框架,有效弥合了计算机视觉与生物力学之间的差距。 Abstract: Parametric 3D human models such as SMPL have driven significant advances in human pose and shape estimation, yet their simplified kinematics limit biomechanical realism. The recently proposed SKEL model addresses this limitation by re-rigging SMPL with an anatomically accurate skeleton. However, estimating SKEL parameters directly remains challenging due to limited training data, perspective ambiguities, and the inherent complexity of human articulation. We introduce SKEL-CF, a coarse-to-fine framework for SKEL parameter estimation. SKEL-CF employs a transformer-based encoder-decoder architecture, where the encoder predicts coarse camera and SKEL parameters, and the decoder progressively refines them in successive layers. To ensure anatomically consistent supervision, we convert the existing SMPL-based dataset 4DHuman into a SKEL-aligned version, 4DHuman-SKEL, providing high-quality training data for SKEL estimation. In addition, to mitigate depth and scale ambiguities, we explicitly incorporate camera modeling into the SKEL-CF pipeline and demonstrate its importance across diverse viewpoints. Extensive experiments validate the effectiveness of the proposed design. On the challenging MOYO dataset, SKEL-CF achieves 85.0 MPJPE / 51.4 PA-MPJPE, significantly outperforming the previous SKEL-based state-of-the-art HSMR (104.5 / 79.6). These results establish SKEL-CF as a scalable and anatomically faithful framework for human motion analysis, bridging the gap between computer vision and biomechanics. Our implementation is available on the project page: https://pokerman8.github.io/SKEL-CF/.

[145] Harmonious Parameter Adaptation in Continual Visual Instruction Tuning for Safety-Aligned MLLMs

Ziqi Wang,Chang Che,Qi Wang,Hui Ma,Zenglin Shi,Cees G. M. Snoek,Meng Wang

Main category: cs.CV

TL;DR: 本文研究了在持续视觉指令调优(CVIT)过程中,安全对齐的多模态大语言模型(MLLMs)在适应新任务时出现的安全性退化和任务遗忘问题,提出了一种称为和谐参数适应(HPA)的新框架,在保持安全性的同时提升任务性能。

Details Motivation: 现有的CVIT研究忽视了真实世界中MLLM需要安全对齐机制的重要性,导致在持续学习过程中模型不仅遗忘旧任务,还可能丧失安全性,因此亟需一种能平衡安全性和任务性能的方法。 Method: 提出Harmonious Parameter Adaptation(HPA)框架,包含基于关注度的参数划分、和谐平衡的参数选择和正交参数调整:将模型参数按其对安全或任务的专注度分类,选择关键参数进行保护,并通过正交约束减少灾难性遗忘。 Result: 在CVIT基准和安全评估数据集上的实验表明,HPA相比现有基线方法能更有效地保持模型的安全性,同时减轻任务遗忘,实现更好的平衡性能。 Conclusion: HPA为安全对齐的多模态大语言模型提供了一种有效的持续学习方案,能够在不牺牲安全性的前提下提升任务适应能力,推动MLLM在现实场景中的安全可靠部署。 Abstract: While continual visual instruction tuning (CVIT) has shown promise in adapting multimodal large language models (MLLMs), existing studies predominantly focus on models without safety alignment. This critical oversight ignores the fact that real-world MLLMs inherently require such mechanisms to mitigate potential risks. In this work, we shift our focus to CVIT for safety-aligned MLLMs and observe that during continual adaptation, the model not only suffers from task forgetting but also exhibits degradation in its safety. Achieving a harmonious balance between safety and task performance remains a crucial challenge. To address this, we propose Harmonious Parameter Adaptation (HPA), a post-training framework composed of focusing-based parameter partition, harmoniously balanced parameter selection, and orthogonal parameter adjustment. Specifically, HPA partitions parameters into two types based on their focus on safety or task performance, and selects the focused ones to preserve from a balanced perspective. In addition, HPA imposes orthogonality constraints on parameter updates to further alleviate catastrophic forgetting. Extensive experiments on the CVIT benchmark and safety evaluation datasets demonstrate that HPA better maintains high safety and mitigates forgetting than existing baselines.

[146] While recognizing actions, LMMs struggle to detect core interaction events

Daniel Harari,Michael Sidorov,Liel David,Chen Shterental,Abrham Kahsay Gebreselasie,Muhammad Haris Khan

Main category: cs.CV

TL;DR: 研究探讨了大型多模态模型(LMMs)在理解动态视觉交互时的感知定位能力,发现尽管模型能识别对象和动作,却难以准确判断交互起止时刻与位置。

Details Motivation: 探究LMMs是否真正将语义理解建立在视觉输入基础上,特别是在动态交互事件的时间和空间定位上。 Method: 构建了一个包含20K+标注交互的大规模数据集,基于Something-Something-V2视频,由250名AMTurk标注者标注‘接触’与‘释放’事件;让Qwen-2.5VL和GPT-4o模型定位这些事件。 Result: 模型虽能准确命名对象、识别动作并给出合理推理,但无法精确定位交互开始或结束的帧,也无法在场景中定位事件发生的位置。 Conclusion: 当前LMMs在关键物理交互的感知定位上存在缺陷,缺乏对动态场景深层理解所需的感知基础。 Abstract: Large multi-modal models (LMMs) show increasing performance in realistic visual tasks for images and, more recently, for videos. For example, given a video sequence, such models are able to describe in detail objects, the surroundings and dynamic actions. In this study, we explored the extent to which these models ground their semantic understanding in the actual visual input. Specifically, given sequences of hands interacting with objects, we asked models when and where the interaction begins or ends. For this purpose, we introduce a first of its kind, large-scale dataset with more than 20K annotated interactions on videos from the Something-Something-V2 dataset. 250 AMTurk human annotators labeled core interaction events, particularly when and where objects and agents become attached ('contact') or detached ('release'). We asked two LMMs (Qwen-2.5VL and GPT-4o) to locate these events in short videos, each with a single event. The results show that although the models can reliably name the target objects, identify the action and provide coherent reasoning, they consistently fail to identify the frame where the interaction begins or ends and cannot localize the event within the scene. Our findings suggest that in struggling to pinpoint the moment and location of physical contact that defines the interaction, the models lack the perceptual grounding required for deeper understanding of dynamic scenes.

[147] ADNet: A Large-Scale and Extensible Multi-Domain Benchmark for Anomaly Detection Across 380 Real-World Categories

Hai Ling,Jia Guo,Zhulin Tao,Yunkang Cao,Donglin Di,Hongyan Xu,Xiu Su,Yang Song,Lei Fan

Main category: cs.CV

TL;DR: 本文提出了ADNet,一个大规模、多领域的异常检测基准,包含380个类别和196,294张RGB图像,旨在评估跨上下文泛化与可扩展性。实验表明现有方法在多类设置下性能显著下降,为此提出Dinomaly-m模型,通过上下文引导的专家混合机制提升性能,取得更好的I-AUROC和P-AUROC结果。

Details Motivation: 现有异常检测基准(如MVTec-AD)类别有限,难以评估模型在跨域和大规模场景下的泛化能力与可扩展性,限制了异常检测技术的发展。 Method: 构建了一个包含380类、来自49个公开数据集的大规模多域异常检测基准ADNet,统一标注格式并提供文本描述支持多模态任务;提出Dinomaly-m,一种基于Mixture-of-Experts的Dinomaly扩展模型,增加解码器容量但不增加推理成本。 Result: ADNet包含196,294张图像,其中训练集116,192张正常图像,测试集80,102张(含60,311张异常图像),均具像素级标注和文本描述;实验显示SOTA方法在单类设置下I-AUROC为90.6%,在多类设置下降至78.5%;Dinomaly-m达到83.2% I-AUROC和93.1% P-AUROC,表现更优。 Conclusion: ADNet作为一个标准化且可扩展的基准,推动了多域异常检测研究,并揭示了当前方法在大规模场景下的局限性;Dinomaly-m有效提升了多类异常检测性能,为未来构建异常检测基础模型提供了可行路径。 Abstract: Anomaly detection (AD) aims to identify defects using normal-only training data. Existing anomaly detection benchmarks (e.g., MVTec-AD with 15 categories) cover only a narrow range of categories, limiting the evaluation of cross-context generalization and scalability. We introduce ADNet, a large-scale, multi-domain benchmark comprising 380 categories aggregated from 49 publicly available datasets across Electronics, Industry, Agrifood, Infrastructure, and Medical domains. The benchmark includes a total of 196,294 RGB images, consisting of 116,192 normal samples for training and 80,102 test images, of which 60,311 are anomalous. All images are standardized with MVTec-style pixel-level annotations and structured text descriptions spanning both spatial and visual attributes, enabling multimodal anomaly detection tasks. Extensive experiments reveal a clear scalability challenge: existing state-of-the-art methods achieve 90.6% I-AUROC in one-for-one settings but drop to 78.5% when scaling to all 380 categories in a multi-class setting. To address this, we propose Dinomaly-m, a context-guided Mixture-of-Experts extension of Dinomaly that expands decoder capacity without increasing inference cost. It achieves 83.2% I-AUROC and 93.1% P-AUROC, demonstrating superior performance over existing approaches. ADNet is designed as a standardized and extensible benchmark, supporting the community in expanding anomaly detection datasets across diverse domains and providing a scalable foundation for future anomaly detection foundation models. Dataset: https://grainnet.github.io/ADNet

[148] Realizing Fully-Integrated, Low-Power, Event-Based Pupil Tracking with Neuromorphic Hardware

Federico Paredes-Valles,Yoshitaka Miyatani,Kirk Y. W. Scheper

Main category: cs.CV

TL;DR: 本文提出了一种首个电池供电、完全片上集成的可穿戴瞳孔追踪系统,结合事件相机与神经形态计算,在超低功耗下实现100 Hz的鲁棒双目追踪。

Details Motivation: 现有可穿戴眼动追踪系统难以同时满足高频率、强鲁棒性和超低功耗的需求,而基于事件的视觉传感器缺乏完整的低功耗实时处理方案。 Method: 采用事件相机与Speck2f神经形态芯片结合,设计具有不确定性估计能力的脉冲神经网络,并引入门控时序解码机制,在微控制器上实现轻量级坐标解码。 Result: 在新型多用户数据集上验证了系统性能,实现了每眼低于5 mW的平均功耗下100 Hz的双目瞳孔追踪。 Conclusion: 端到端的神经形态计算为下一代节能可穿戴设备中的常开式眼动追踪提供了可行解决方案。 Abstract: Eye tracking is fundamental to numerous applications, yet achieving robust, high-frequency tracking with ultra-low power consumption remains challenging for wearable platforms. While event-based vision sensors offer microsecond resolution and sparse data streams, they have lacked fully integrated, low-power processing solutions capable of real-time inference. In this work, we present the first battery-powered, wearable pupil-center-tracking system with complete on-device integration, combining event-based sensing and neuromorphic processing on the commercially available Speck2f system-on-chip with lightweight coordinate decoding on a low-power microcontroller. Our solution features a novel uncertainty-quantifying spiking neural network with gated temporal decoding, optimized for strict memory and bandwidth constraints, complemented by systematic deployment mechanisms that bridge the reality gap. We validate our system on a new multi-user dataset and demonstrate a wearable prototype with dual neuromorphic devices achieving robust binocular pupil tracking at 100 Hz with an average power consumption below 5 mW per eye. Our work demonstrates that end-to-end neuromorphic computing enables practical, always-on eye tracking for next-generation energy-efficient wearable systems.

[149] Exo2EgoSyn: Unlocking Foundation Video Generation Models for Exocentric-to-Egocentric Video Synthesis

Mohammad Mahdi,Yuqian Fu,Nedko Savov,Jiancheng Pan,Danda Pani Paudel,Luc Van Gool

Main category: cs.CV

TL;DR: 本文提出Exo2EgoSyn,基于WAN 2.2实现从第三方视角(exocentric)到第一人称视角(egocentric)的视频生成,通过三个模块实现跨视角高保真合成。

Details Motivation: 现有基础视频生成模型局限于单一视角生成,无法有效实现跨视角(如第三人称到第一人称)视频合成,限制了其在实际场景中的应用。 Method: 提出Exo2EgoSyn框架,包含三个模块:EgoExo-Align实现潜空间对齐,MultiExoCon融合多视角第三方视频作为统一条件输入,PoseInj注入相机位姿信息以指导几何感知的视频生成。 Result: 在ExoEgo4D数据集上验证,Exo2EgoSyn显著提升了从第三方到第一人称视角的视频生成质量,实现了无需从头训练的高效跨视角合成。 Conclusion: Exo2EgoSyn成功扩展了基础视频模型的能力至跨视角生成,为基于大模型的可扩展视频生成提供了新路径。 Abstract: Foundation video generation models such as WAN 2.2 exhibit strong text- and image-conditioned synthesis abilities but remain constrained to the same-view generation setting. In this work, we introduce Exo2EgoSyn, an adaptation of WAN 2.2 that unlocks Exocentric-to-Egocentric(Exo2Ego) cross-view video synthesis. Our framework consists of three key modules. Ego-Exo View Alignment(EgoExo-Align) enforces latent-space alignment between exocentric and egocentric first-frame representations, reorienting the generative space from the given exo view toward the ego view. Multi-view Exocentric Video Conditioning (MultiExoCon) aggregates multi-view exocentric videos into a unified conditioning signal, extending WAN2.2 beyond its vanilla single-image or text conditioning. Furthermore, Pose-Aware Latent Injection (PoseInj) injects relative exo-to-ego camera pose information into the latent state, guiding geometry-aware synthesis across viewpoints. Together, these modules enable high-fidelity ego view video generation from third-person observations without retraining from scratch. Experiments on ExoEgo4D validate that Exo2EgoSyn significantly improves Ego2Exo synthesis, paving the way for scalable cross-view video generation with foundation models. Source code and models will be released publicly.

[150] SFA: Scan, Focus, and Amplify toward Guidance-aware Answering for Video TextVQA

Haibin He,Qihuang Zhong,Juhua Liu,Bo Du,Peng Wang,Jing Zhang

Main category: cs.CV

TL;DR: 提出了一种无需训练的框架SFA,用于视频文本视觉问答(Video TextVQA),通过模拟人类答题过程,自适应扫描帧、选择性聚焦关键区域并增强重要文本线索,显著提升了现有方法的性能。

Details Motivation: 现有方法在处理视频中多变的场景文本时难以有效捕捉与问题相关的关键文本信息,且缺乏对时空上下文的有效整合,因此需要一种更高效、无需训练的方法来提升Video TextVQA的准确性和泛化能力。 Method: 提出SFA框架,基于Video-LLM,通过自适应扫描视频帧、选择性关注关键区域,并直接放大关键文本特征,引导模型注意力集中在最相关的信息上,从而提升答案生成的准确性。 Result: SFA在多个公开的Video TextVQA数据集上实现了最先进的性能,显著优于先前方法,表现出强大的有效性与泛化能力。 Conclusion: SFA作为一种无需训练的Video-LLM-based方法,在Video TextVQA任务中表现出色,验证了其在引导模型关注关键文本线索方面的有效性,为未来研究提供了新方向。 Abstract: Video text-based visual question answering (Video TextVQA) task aims to answer questions about videos by leveraging the visual text appearing within the videos. This task poses significant challenges, requiring models to accurately perceive and comprehend scene text that varies in scale, orientation, and clarity across frames, while effectively integrating temporal and semantic context to generate precise answers. Moreover, the model must identify question-relevant textual cues and filter out redundant or irrelevant information to ensure answering is guided by the most relevant and informative cues. To address these challenges, we propose SFA, a training-free framework and the first Video-LLM-based method tailored for Video TextVQA, motivated by the human process of answering questions. By adaptively scanning video frames, selectively focusing on key regions, and directly amplifying them, SFA effectively guides the Video-LLM's attention toward essential cues, enabling it to generate more accurate answers. SFA achieves new state-of-the-art results across several public Video TextVQA datasets and surpasses previous methods by a substantial margin, demonstrating its effectiveness and generalizability.

[151] GHR-VQA: Graph-guided Hierarchical Relational Reasoning for Video Question Answering

Dionysia Danai Brilli,Dimitrios Mallis,Vassilis Pitsikalis,Petros Maragos

Main category: cs.CV

TL;DR: 本文提出了GHR-VQA,一种基于图引导的层次化关系推理框架,用于视频问答。该方法利用场景图显式建模人类-物体交互,并通过全局人类根节点连接跨帧图结构,结合图神经网络与分层融合网络,提升了对视频中时空动态的理解能力,在AGQA数据集上显著优于现有方法。

Details Motivation: 传统基于像素的视频问答方法难以有效捕捉复杂的时空中人类-物体交互关系,缺乏可解释性与细粒度推理能力。因此,需要一种更结构化、以人类为中心的表示方式来提升模型对复杂动作的理解。 Method: 将每帧转换为场景图,并通过全局人类根节点连接各帧的人类节点,构建视频级图结构;使用图神经网络(GNN)处理该图结构,生成上下文感知的嵌入;再与问题特征在多层次网络中进行分层融合,实现从局部到全局的联合推理。 Result: 在AGQA数据集上验证,相比现有最佳方法,在物体-关系推理任务上性能提升了7.3%,显著提高了模型对复杂视频内容的理解和推理能力。 Conclusion: GHR-VQA通过引入人类中心的图结构与分层推理机制,有效增强了视频问答中的关系理解与跨帧推理能力,同时提升了模型的可解释性,为视频理解提供了新的结构化路径。 Abstract: We propose GHR-VQA, Graph-guided Hierarchical Relational Reasoning for Video Question Answering (Video QA), a novel human-centric framework that incorporates scene graphs to capture intricate human-object interactions within video sequences. Unlike traditional pixel-based methods, each frame is represented as a scene graph and human nodes across frames are linked to a global root, forming the video-level graph and enabling cross-frame reasoning centered on human actors. The video-level graphs are then processed by Graph Neural Networks (GNNs), transforming them into rich, context-aware embeddings for efficient processing. Finally, these embeddings are integrated with question features in a hierarchical network operating across different abstraction levels, enhancing both local and global understanding of video content. This explicit human-rooted structure enhances interpretability by decomposing actions into human-object interactions and enables a more profound understanding of spatiotemporal dynamics. We validate our approach on the Action Genome Question Answering (AGQA) dataset, achieving significant performance improvements, including a 7.3% improvement in object-relation reasoning over the state of the art.

[152] Robust 3D Brain MRI Inpainting with Random Masking Augmentation

Juexin Zhang,Ying Weng,Ke Chen

Main category: cs.CV

TL;DR: 本文提出了一种基于U-Net的深度学习框架,用于合成脑肿瘤MRI中的健康组织,通过随机掩码增强策略提升泛化能力,在ASNR-MICCAI BraTS-Inpainting 2025挑战赛中取得第一名。

Details Motivation: 缓解因数据集偏差限制脑肿瘤MRI定量分析中深度学习模型性能的问题。 Method: 采用U-Net架构,训练其修复合成损坏区域,并引入随机掩码增强策略以提高模型泛化能力。 Result: 在验证集上达到SSIM 0.873±0.004、PSNR 24.996±4.694、MSE 0.005±0.087;在最终测试集上达到SSIM 0.919±0.088、PSNR 26.932±5.057、RMSE 0.052±0.026。 Conclusion: 所提方法在BraTS-Inpainting 2025挑战赛中表现最优,优于2023和2024年的优胜方案,有效提升了脑肿瘤MRI健康组织合成的性能。 Abstract: The ASNR-MICCAI BraTS-Inpainting Challenge was established to mitigate dataset biases that limit deep learning models in the quantitative analysis of brain tumor MRI. This paper details our submission to the 2025 challenge, a novel deep learning framework for synthesizing healthy tissue in 3D scans. The core of our method is a U-Net architecture trained to inpaint synthetically corrupted regions, enhanced with a random masking augmentation strategy to improve generalization. Quantitative evaluation confirmed the efficacy of our approach, yielding an SSIM of 0.873$\pm$0.004, a PSNR of 24.996$\pm$4.694, and an MSE of 0.005$\pm$0.087 on the validation set. On the final online test set, our method achieved an SSIM of 0.919$\pm$0.088, a PSNR of 26.932$\pm$5.057, and an RMSE of 0.052$\pm$0.026. This performance secured first place in the BraTS-Inpainting 2025 challenge and surpassed the winning solutions from the 2023 and 2024 competitions on the official leaderboard.

[153] OmniAlpha: A Sequence-to-Sequence Framework for Unified Multi-Task RGBA Generation

Hao Yu,Jiabo Zhan,Zile Wang,Jinglin Wang,Huaisong Zhang,Hongyu Li,Xinrui Chen,Yongxian Wei,Chun Yuan

Main category: cs.CV

TL;DR: 本文提出了OmniAlpha,首个统一的多任务生成框架,用于序列到序列的RGBA图像生成与编辑,并通过新架构MSRoPE-BiL和新数据集AlphaLayers在多种任务上显著优于专用基线模型。

Details Motivation: 现有的生成模型在RGBA处理方面存在割裂:专用单任务模型缺乏通用性,而统一的多任务框架又局限于RGB领域,难以满足实际应用中对透明度(alpha)操作的需求。 Method: 提出OmniAlpha,基于改进的Diffusion Transformer(DiT)架构MSRoPE-BiL,支持双向扩展层轴以并行处理多个RGBA图层;构建新数据集AlphaLayers,包含1000个高质量多层三元组,并通过自动化合成与过滤流程生成;在21个多样化任务上进行联合训练。 Result: 在综合实验中,OmniAlpha持续超越强专用基线模型,在AIM-500数据集上实现无掩码抠图SAD指标84.8%的相对降低,并在图层条件补全任务中获得超过90%的人类偏好胜率。 Conclusion: 统一的多任务模型能够学习更优的共享表示以处理RGBA图像,为未来更强大的、图层感知的生成系统铺平了道路。 Abstract: Generative models have excelled in RGB synthesis, but real-world applications require RGBA manipulation. This has led to a fragmented landscape: specialized, single-task models handle alpha but lack versatility, while unified multi-task frameworks are confined to the RGB domain. To bridge this critical gap, we propose OmniAlpha, the first unified, multi-task generative framework for sequence-to-sequence RGBA image generation and editing. Its architecture features MSRoPE-BiL, a novel RoPE method with a bi-directionally extendable layer axis for its Diffusion Transformer (DiT) backbone, enabling the concurrent processing of multiple input and target RGBA layers. To power this framework, we introduce AlphaLayers, a new dataset of 1,000 high-quality, multi-layer triplets, built via a novel automated synthesis and filter pipeline. Jointly training OmniAlpha on this dataset across a comprehensive suite of 21 diverse tasks, extensive experiments demonstrate that our unified approach consistently outperforms strong, specialized baselines. Most notably, OmniAlpha achieves a dramatic 84.8% relative reduction in SAD for mask-free matting on AIM-500 and wins over 90% of human preferences in layer-conditioned completion. Our work proves that a unified, multi-task model can learn a superior shared representation for RGBA, paving the way for more powerful, layer-aware generative systems.

[154] Text-guided Controllable Diffusion for Realistic Camouflage Images Generation

Yuhang Qian,Haiyan Chen,Wentong Li,Ningzhong Liu,Jie Qin

Main category: cs.CV

TL;DR: 提出了一种可控的文本引导伪装图像生成方法CT-CIG,利用视觉语言模型和频率交互细化模块生成更自然、逻辑性更强的伪装图像。

Details Motivation: 现有伪装图像生成方法忽视了伪装物体与背景环境之间的逻辑关系,导致结果不够自然。 Method: 提出CT-CIG方法,结合大视觉语言模型设计伪装揭示对话机制(CRDM)生成高质量文本提示,微调Stable Diffusion,并引入轻量控制器和频率交互细化模块(FIRM)优化物体位置、形状和纹理细节。 Result: 实验表明,该方法在CLIPScore和伪装效果评估中均优于现有方法,能生成语义对齐且逼真的伪装图像。 Conclusion: CT-CIG通过文本引导和频率特征优化,显著提升了伪装图像生成的逻辑合理性与视觉真实性。 Abstract: Camouflage Images Generation (CIG) is an emerging research area that focuses on synthesizing images in which objects are harmoniously blended and exhibit high visual consistency with their surroundings. Existing methods perform CIG by either fusing objects into specific backgrounds or outpainting the surroundings via foreground object-guided diffusion. However, they often fail to obtain natural results because they overlook the logical relationship between camouflaged objects and background environments. To address this issue, we propose CT-CIG, a Controllable Text-guided Camouflage Images Generation method that produces realistic and logically plausible camouflage images. Leveraging Large Visual Language Models (VLM), we design a Camouflage-Revealing Dialogue Mechanism (CRDM) to annotate existing camouflage datasets with high-quality text prompts. Subsequently, the constructed image-prompt pairs are utilized to finetune Stable Diffusion, incorporating a lightweight controller to guide the location and shape of camouflaged objects for enhanced camouflage scene fitness. Moreover, we design a Frequency Interaction Refinement Module (FIRM) to capture high-frequency texture features, facilitating the learning of complex camouflage patterns. Extensive experiments, including CLIPScore evaluation and camouflage effectiveness assessment, demonstrate the semantic alignment of our generated text prompts and CT-CIG's ability to produce photorealistic camouflage images.

[155] Patch-Level Glioblastoma Subregion Classification with a Contrastive Learning-Based Encoder

Juexin Zhang,Qifeng Zhong,Ying Weng,Ke Chen

Main category: cs.CV

TL;DR: 本研究提出了一种基于预训练Vision Transformer(ViT)的分类方法,用于胶质母细胞瘤的全切片图像分析,在BraTS-Path 2025挑战赛中取得了第二名的成绩,为病理图像分析建立了良好的ViT基线模型。

Details Motivation: 胶质母细胞瘤具有高度分子和病理异质性,传统病理评估主观性强,亟需一种客观、自动化的全切片图像分析方法以改善诊断和患者分层。 Method: 采用在官方训练集上微调预训练Vision Transformer(ViT)编码器并连接专用分类头的方法,参与BraTS-Path 2025挑战赛,并通过Synapse平台进行验证。 Result: 在在线验证集上取得0.7064的Matthews相关系数(MCC)和0.7676的F1分数;在最终测试集上MCC为0.6509,F1为0.5330,排名挑战赛第二。 Conclusion: 该ViT-based方法为脑肿瘤病理图像分析提供了有效基线,但模型在未见数据上性能下降,未来工作将聚焦于缩小这一差距。 Abstract: The significant molecular and pathological heterogeneity of glioblastoma, an aggressive brain tumor, complicates diagnosis and patient stratification. While traditional histopathological assessment remains the standard, deep learning offers a promising path toward objective and automated analysis of whole slide images. For the BraTS-Path 2025 Challenge, we developed a method that fine-tunes a pre-trained Vision Transformer (ViT) encoder with a dedicated classification head on the official training dataset. Our model's performance on the online validation set, evaluated via the Synapse platform, yielded a Matthews Correlation Coefficient (MCC) of 0.7064 and an F1-score of 0.7676. On the final test set, the model achieved an MCC of 0.6509 and an F1-score of 0.5330, which secured our team second place in the BraTS-Pathology 2025 Challenge. Our results establish a solid baseline for ViT-based histopathological analysis, and future efforts will focus on bridging the performance gap observed on the unseen validation data.

[156] V-Attack: Targeting Disentangled Value Features for Controllable Adversarial Attacks on LVLMs

Sen Nie,Jie Zhang,Jianxin Yan,Shiguang Shan,Xilin Chen

Main category: cs.CV

TL;DR: 本文提出V-Attack,一种针对大型视觉语言模型(LVLMs)的精确局部语义攻击方法,通过操作Transformer注意力块中的值特征(V)实现更可控的对抗攻击。

Details Motivation: 现有对抗攻击方法在操纵图像中特定概念的语义时存在控制性差的问题,主要由于视觉编码器中自注意力机制导致的语义纠缠。 Method: 发现值特征(V)能抑制全局上下文并保留高熵、解耦的局部语义信息,据此提出V-Attack,包含自值增强模块和文本引导的值操纵模块,以精确定位源概念并优化至目标概念。 Result: 在LLaVA、InternVL、DeepseekVL和GPT-4o等多种LVLM上实验表明,V-Attack比现有最先进方法平均提升36%的攻击成功率。 Conclusion: V-Attack通过利用值特征实现了对LVLM中局部语义的高效精确操控,揭示了当前视觉语言理解模型的关键漏洞。 Abstract: Adversarial attacks have evolved from simply disrupting predictions on conventional task-specific models to the more complex goal of manipulating image semantics on Large Vision-Language Models (LVLMs). However, existing methods struggle with controllability and fail to precisely manipulate the semantics of specific concepts in the image. We attribute this limitation to semantic entanglement in the patch-token representations on which adversarial attacks typically operate: global context aggregated by self-attention in the vision encoder dominates individual patch features, making them unreliable handles for precise local semantic manipulation. Our systematic investigation reveals a key insight: value features (V) computed within the transformer attention block serve as much more precise handles for manipulation. We show that V suppresses global-context channels, allowing it to retain high-entropy, disentangled local semantic information. Building on this discovery, we propose V-Attack, a novel method designed for precise local semantic attacks. V-Attack targets the value features and introduces two core components: (1) a Self-Value Enhancement module to refine V's intrinsic semantic richness, and (2) a Text-Guided Value Manipulation module that leverages text prompts to locate source concept and optimize it toward a target concept. By bypassing the entangled patch features, V-Attack achieves highly effective semantic control. Extensive experiments across diverse LVLMs, including LLaVA, InternVL, DeepseekVL and GPT-4o, show that V-Attack improves the attack success rate by an average of 36% over state-of-the-art methods, exposing critical vulnerabilities in modern visual-language understanding. Our code and data are available https://github.com/Summu77/V-Attack.

[157] HistoSpeckle-Net: Mutual Information-Guided Deep Learning for high-fidelity reconstruction of complex OrganAMNIST images via perturbed Multimode Fibers

Jawaria Maqbool,M. Imran Cheema

Main category: cs.CV

TL;DR: 提出HistoSpeckle-Net,一种用于多模光纤成像的深度学习模型,通过分布感知学习策略在复杂医学图像重建中实现高保真且数据高效的表现。

Details Motivation: 现有方法在处理复杂真实场景时受限于简单数据集和高数据需求,缺乏对散斑和图像统计特性的考虑,限制了其临床应用。 Method: 设计HistoSpeckle-Net,引入直方图互信息损失和三尺度特征 refinement 模块,结合光学系统采集OrganAMNIST对应的散斑数据,实现分布感知的图像重建。 Result: 在OrganAMNIST数据集上优于U-Net和Pix2Pix,即使训练样本少且光纤弯曲变化下仍表现优越,能准确重建复杂解剖结构。 Conclusion: HistoSpeckle-Net通过减少数据依赖和增强对光纤扰动的鲁棒性,推动多模光纤成像向实际临床应用迈进。 Abstract: Existing deep learning methods in multimode fiber (MMF) imaging often focus on simpler datasets, limiting their applicability to complex, real-world imaging tasks. These models are typically data-intensive, a challenge that becomes more pronounced when dealing with diverse and complex images. In this work, we propose HistoSpeckle-Net, a deep learning architecture designed to reconstruct structurally rich medical images from MMF speckles. To build a clinically relevant dataset, we develop an optical setup that couples laser light through a spatial light modulator (SLM) into an MMF, capturing output speckle patterns corresponding to input OrganAMNIST images. Unlike previous MMF imaging approaches, which have not considered the underlying statistics of speckles and reconstructed images, we introduce a distribution-aware learning strategy. We employ a histogram-based mutual information loss to enhance model robustness and reduce reliance on large datasets. Our model includes a histogram computation unit that estimates smooth marginal and joint histograms for calculating mutual information loss. It also incorporates a unique Three-Scale Feature Refinement Module, which leads to multiscale Structural Similarity Index Measure (SSIM) loss computation. Together, these two loss functions enhance both the structural fidelity and statistical alignment of the reconstructed images. Our experiments on the complex OrganAMNIST dataset demonstrate that HistoSpeckle-Net achieves higher fidelity than baseline models such as U-Net and Pix2Pix. It gives superior performance even with limited training samples and across varying fiber bending conditions. By effectively reconstructing complex anatomical features with reduced data and under fiber perturbations, HistoSpeckle-Net brings MMF imaging closer to practical deployment in real-world clinical environments.

[158] Uplifting Table Tennis: A Robust, Real-World Application for 3D Trajectory and Spin Estimation

Daniel Kienzle,Katja Ludwig,Julian Lorenz,Shin'ichi Satoh,Rainer Lienhart

Main category: cs.CV

TL;DR: 提出了一种两阶段管道方法,用于从单目视频中精确恢复乒乓球的3D轨迹和旋转,前端使用新构建的TTHQ数据集进行2D监督训练,后端在合成数据上训练,并增强对现实世界噪声的鲁棒性。

Details Motivation: 现有方法在合成数据上训练后难以泛化到真实世界的噪声和不完美检测,且缺乏真实的3D轨迹和旋转标注数据。 Method: 采用两阶段 pipeline:前端感知任务(基于TTHQ数据集训练球体和桌面关键点检测器)提供2D检测结果;后端2D-to-3D提升网络在物理正确的合成数据上训练,实现3D轨迹与旋转估计,并针对缺失检测和可变帧率等现实问题进行优化。 Result: 该方法实现了对真实单目视频中乒乓球运动的高精度3D轨迹重建与旋转分析,具备良好的鲁棒性和实用性。 Conclusion: 通过前后端分离的设计,结合真实2D标注与合成3D数据,有效解决了真实场景下乒乓球3D运动追踪的挑战,推动了体育动作分析的实际应用。 Abstract: Obtaining the precise 3D motion of a table tennis ball from standard monocular videos is a challenging problem, as existing methods trained on synthetic data struggle to generalize to the noisy, imperfect ball and table detections of the real world. This is primarily due to the inherent lack of 3D ground truth trajectories and spin annotations for real-world video. To overcome this, we propose a novel two-stage pipeline that divides the problem into a front-end perception task and a back-end 2D-to-3D uplifting task. This separation allows us to train the front-end components with abundant 2D supervision from our newly created TTHQ dataset, while the back-end uplifting network is trained exclusively on physically-correct synthetic data. We specifically re-engineer the uplifting model to be robust to common real-world artifacts, such as missing detections and varying frame rates. By integrating a ball detector and a table keypoint detector, our approach transforms a proof-of-concept uplifting method into a practical, robust, and high-performing end-to-end application for 3D table tennis trajectory and spin analysis.

[159] PromptMoG: Enhancing Diversity in Long-Prompt Image Generation via Prompt Embedding Mixture-of-Gaussian Sampling

Bo-Kai Ruan,Teng-Fang Hsiao,Ling Lo,Yi-Lun Wu,Hong-Han Shuai

Main category: cs.CV

TL;DR: 本文研究了长文本提示下文生图模型的保真度与多样性之间的权衡问题,提出了LPD-Bench基准和无需训练的PromptMoG方法,通过在嵌入空间中使用高斯混合模型采样提升生成多样性,同时保持语义一致性。

Details Motivation: 长提示虽能提升生成图像的保真度,但往往导致多样性下降,限制了创意表达,现有模型在此场景下的行为尚缺乏系统研究。 Method: 构建了用于评估长提示生成性能的LPD-Bench基准,并提出PromptMoG方法,通过对提示嵌入空间建模为高斯混合分布进行重采样以增加生成熵,从而提升多样性。 Result: 在SD3.5-Large、Flux.1-Krea-Dev、CogView4和Qwen-Image四个先进模型上验证了PromptMoG能显著提升长提示下的生成多样性,且不引起语义偏移。 Conclusion: PromptMoG为解决长提示下文生图模型的保真-多样困境提供了一种有效且无需训练的解决方案,有助于推动更具创造性的内容生成。 Abstract: Recent advances in text-to-image (T2I) generation have achieved remarkable visual outcomes through large-scale rectified flow models. However, how these models behave under long prompts remains underexplored. Long prompts encode rich content, spatial, and stylistic information that enhances fidelity but often suppresses diversity, leading to repetitive and less creative outputs. In this work, we systematically study this fidelity-diversity dilemma and reveal that state-of-the-art models exhibit a clear drop in diversity as prompt length increases. To enable consistent evaluation, we introduce LPD-Bench, a benchmark designed for assessing both fidelity and diversity in long-prompt generation. Building on our analysis, we develop a theoretical framework that increases sampling entropy through prompt reformulation and propose a training-free method, PromptMoG, which samples prompt embeddings from a Mixture-of-Gaussians in the embedding space to enhance diversity while preserving semantics. Extensive experiments on four state-of-the-art models, SD3.5-Large, Flux.1-Krea-Dev, CogView4, and Qwen-Image, demonstrate that PromptMoG consistently improves long-prompt generation diversity without semantic drifting.

[160] Zoo3D: Zero-Shot 3D Object Detection at Scene Level

Andrey Lemeshko,Bulat Gabdullin,Nikita Drozdov,Anton Konushin,Danila Rukhovich,Maksim Kolodiazhnyi

Main category: cs.CV

TL;DR: 提出Zoo3D,首个无需训练的开放词汇3D目标检测框架,通过2D实例掩码图聚类生成3D边界框,并利用新颖的开放词汇模块进行语义标注,在ScanNet200和ARKitScenes上达到SOTA。

Details Motivation: 现有3D检测方法依赖训练数据,难以泛化到新物体和场景;需要一种不依赖训练、能识别开放词汇物体的通用检测框架。 Method: 通过2D实例掩码的图聚类构建3D边界框,结合最佳视角选择与多视角一致性掩码生成的开放词汇语义标注;提出零样本Zoo3D_0和自监督微调的Zoo3D_1两种模式,并可扩展至有/无位姿图像。 Result: 在ScanNet200和ARKitScenes上,Zoo3D_0和Zoo3D_1均实现最优性能,其中零样本Zoo3D_0超越所有自监督方法。 Conclusion: 无需训练的方法在开放词汇3D检测中具有强大潜力,Zoo3D展示了即插即用策略在真实场景3D理解中的有效性与适应性。 Abstract: 3D object detection is fundamental for spatial understanding. Real-world environments demand models capable of recognizing diverse, previously unseen objects, which remains a major limitation of closed-set methods. Existing open-vocabulary 3D detectors relax annotation requirements but still depend on training scenes, either as point clouds or images. We take this a step further by introducing Zoo3D, the first training-free 3D object detection framework. Our method constructs 3D bounding boxes via graph clustering of 2D instance masks, then assigns semantic labels using a novel open-vocabulary module with best-view selection and view-consensus mask generation. Zoo3D operates in two modes: the zero-shot Zoo3D$_0$, which requires no training at all, and the self-supervised Zoo3D$_1$, which refines 3D box prediction by training a class-agnostic detector on Zoo3D$_0$-generated pseudo labels. Furthermore, we extend Zoo3D beyond point clouds to work directly with posed and even unposed images. Across ScanNet200 and ARKitScenes benchmarks, both Zoo3D$_0$ and Zoo3D$_1$ achieve state-of-the-art results in open-vocabulary 3D object detection. Remarkably, our zero-shot Zoo3D$_0$ outperforms all existing self-supervised methods, hence demonstrating the power and adaptability of training-free, off-the-shelf approaches for real-world 3D understanding. Code is available at https://github.com/col14m/zoo3d .

[161] XiCAD: Camera Activation Detection in the Da Vinci Xi User Interface

Alexander C. Jenke,Gregor Just,Claas de Boer,Martin Wagner,Sebastian Bodenstedt,Stefanie Speidel

Main category: cs.CV

TL;DR: 提出一种基于ResNet18的轻量级管道,用于从达芬奇Xi系统的内窥镜视频中自动检测相机激活状态和相机图块位置,实现高精度、实时的手术视频元数据提取。

Details Motivation: 机器人辅助微创手术依赖内窥镜视频作为唯一视觉反馈,获取相机激活状态等元数据对工具跟踪、技能评估和自动化控制等下游任务具有重要意义。 Method: 基于ResNet18卷积神经网络构建轻量级管道,微调于SurgToolLoc数据集的手动标注数据,并在三个公开数据集上进行评估。 Result: 在二分类检测相机激活状态任务中F1分数达到0.993至1.000,所有样本中均准确定位相机图块且无多相机误检。 Conclusion: 该方法可高效、可靠地提取手术视频中的相机激活元数据,支持多种下游应用的自动化预处理与分析,且代码、模型和标注已全部开源。 Abstract: Purpose: Robot-assisted minimally invasive surgery relies on endoscopic video as the sole intraoperative visual feedback. The DaVinci Xi system overlays a graphical user interface (UI) that indicates the state of each robotic arm, including the activation of the endoscope arm. Detecting this activation provides valuable metadata such as camera movement information, which can support downstream surgical data science tasks including tool tracking, skill assessment, or camera control automation. Methods: We developed a lightweight pipeline based on a ResNet18 convolutional neural network to automatically identify the position of the camera tile and its activation state within the DaVinci Xi UI. The model was fine-tuned on manually annotated data from the SurgToolLoc dataset and evaluated across three public datasets comprising over 70,000 frames. Results: The model achieved F1-scores between 0.993 and 1.000 for the binary detection of active cameras and correctly localized the camera tile in all cases without false multiple-camera detections. Conclusion: The proposed pipeline enables reliable, real-time extraction of camera activation metadata from surgical videos, facilitating automated preprocessing and analysis for diverse downstream applications. All code, trained models, and annotations are publicly available.

[162] The Image as Its Own Reward: Reinforcement Learning with Adversarial Reward for Image Generation

Weijia Mao,Hao Chen,Zhenheng Yang,Mike Zheng Shou

Main category: cs.CV

TL;DR: 本文提出了一种名为Adv-GRPO的强化学习框架,通过对抗性奖励机制解决图像生成中传统标量奖励易受奖励劫持和无法准确反映人类感知的问题。该方法利用参考图像和视觉基础模型(如DINO)提供密集的视觉奖励信号,直接指导生成器优化,在图像质量、美学和任务特定指标上均取得提升,并支持风格定制与分布迁移。

Details Motivation: 现有基于预训练偏好模型的标量奖励难以准确反映人类感知,且容易发生奖励劫持,导致生成图像质量下降;同时,KL正则化等约束方式不能从根本上解决奖励偏差问题。 Method: 提出Adv-GRPO框架,采用对抗式训练方式联合优化奖励模型与生成器;以参考图像为正样本监督奖励模型,并利用视觉基础模型提取图像本身的密集视觉特征作为奖励信号,而非单一标量;通过图像级对比(如DINO特征匹配)提供更丰富的反馈来指导生成过程。 Result: 在图像质量和美学方面显著优于Flow-GRPO和SD3,人类评估胜率分别达到70.0%和72.4%;有效缓解奖励劫持问题,提升了生成图像的保真度与审美一致性;支持基于参考样本的风格定制与分布迁移。 Conclusion: 将图像本身作为奖励信号,结合参考图像与视觉基础模型提供的密集反馈,能够更有效地对齐人类偏好,避免传统标量奖励的局限性,为图像生成中的强化学习提供了更可靠和灵活的奖励机制。 Abstract: A reliable reward function is essential for reinforcement learning (RL) in image generation. Most current RL approaches depend on pre-trained preference models that output scalar rewards to approximate human preferences. However, these rewards often fail to capture human perception and are vulnerable to reward hacking, where higher scores do not correspond to better images. To address this, we introduce Adv-GRPO, an RL framework with an adversarial reward that iteratively updates both the reward model and the generator. The reward model is supervised using reference images as positive samples and can largely avoid being hacked. Unlike KL regularization that constrains parameter updates, our learned reward directly guides the generator through its visual outputs, leading to higher-quality images. Moreover, while optimizing existing reward functions can alleviate reward hacking, their inherent biases remain. For instance, PickScore may degrade image quality, whereas OCR-based rewards often reduce aesthetic fidelity. To address this, we take the image itself as a reward, using reference images and vision foundation models (e.g., DINO) to provide rich visual rewards. These dense visual signals, instead of a single scalar, lead to consistent gains across image quality, aesthetics, and task-specific metrics. Finally, we show that combining reference samples with foundation-model rewards enables distribution transfer and flexible style customization. In human evaluation, our method outperforms Flow-GRPO and SD3, achieving 70.0% and 72.4% win rates in image quality and aesthetics, respectively. Code and models have been released.

[163] Modality-Balanced Collaborative Distillation for Multi-Modal Domain Generalization

Xiaohan Wang,Zhangtao Cheng,Ting Zhong,Leiting Chen,Fan Zhou

Main category: cs.CV

TL;DR: 本文提出MBCD,一种用于多模态域泛化的统一协作蒸馏框架,通过自适应模态丢弃、梯度一致性约束和基于权重平均的跨模态知识蒸馏,缓解了传统权重平均在多模态场景下因优化速度差异导致的模态偏倚问题,提升了模型泛化性和鲁棒性。

Details Motivation: 权重平均(WA)虽有助于收敛到平坦的损失地形并提升分布外性能,但在多模态域泛化中因不同模态优化速度不一,易导致对快速模态过拟合,抑制慢速但互补模态的作用,影响模态融合效果。 Method: 提出MBCD框架:1)在学生模型中采用自适应模态丢弃,防止早期对主导模态的偏倚;2)引入梯度一致性约束,使单模态分支与融合表征的学习信号对齐;3)由基于WA的教师模型进行跨模态蒸馏,将融合知识传递回各单模态分支,增强跨模态交互并引导收敛至更平坦的解。 Result: 在多个多模态域泛化基准上的实验表明,MBCD consistently 优于现有方法,在多种未见域上实现了更高的准确率和更强的鲁棒性。 Conclusion: MBCD有效克服了传统权重平均在多模态设置下的局限性,兼顾了平坦最小值优化与均衡模态融合,为多模态域泛化提供了更优的收敛路径。 Abstract: Weight Averaging (WA) has emerged as a powerful technique for enhancing generalization by promoting convergence to a flat loss landscape, which correlates with stronger out-of-distribution performance. However, applying WA directly to multi-modal domain generalization (MMDG) is challenging: differences in optimization speed across modalities lead WA to overfit to faster-converging ones in early stages, suppressing the contribution of slower yet complementary modalities, thereby hindering effective modality fusion and skewing the loss surface toward sharper, less generalizable minima. To address this issue, we propose MBCD, a unified collaborative distillation framework that retains WA's flatness-inducing advantages while overcoming its shortcomings in multi-modal contexts. MBCD begins with adaptive modality dropout in the student model to curb early-stage bias toward dominant modalities. A gradient consistency constraint then aligns learning signals between uni-modal branches and the fused representation, encouraging coordinated and smoother optimization. Finally, a WA-based teacher conducts cross-modal distillation by transferring fused knowledge to each uni-modal branch, which strengthens cross-modal interactions and steer convergence toward flatter solutions. Extensive experiments on MMDG benchmarks show that MBCD consistently outperforms existing methods, achieving superior accuracy and robustness across diverse unseen domains.

[164] Advancing Image Classification with Discrete Diffusion Classification Modeling

Omer Belhasin,Shelly Golan,Ran El-Yaniv,Michael Elad

Main category: cs.CV

TL;DR: 提出了一种基于扩散的图像分类新框架DiDiCM,通过建模输入图像下类别标签的后验分布,在高不确定性条件下显著优于传统分类器。

Details Motivation: 在输入图像受损或训练数据有限等高不确定性条件下,传统直接预测类标签的方法性能受限,因此需要更鲁棒的分类方法。 Method: 提出Discrete Diffusion Classification Modeling (DiDiCM),利用扩散过程对类别标签的后验分布进行建模,支持在类别概率或离散标签上的扩散预测,平衡计算与内存开销。 Result: 在ImageNet上实验表明,仅需少量扩散迭代,DiDiCM即优于标准分类器,且任务越具挑战性,性能增益越明显。 Conclusion: DiDiCM为图像分类提供了一种更鲁棒的扩散式建模范式,在高不确定性场景下展现出优越性能。 Abstract: Image classification is a well-studied task in computer vision, and yet it remains challenging under high-uncertainty conditions, such as when input images are corrupted or training data are limited. Conventional classification approaches typically train models to directly predict class labels from input images, but this might lead to suboptimal performance in such scenarios. To address this issue, we propose Discrete Diffusion Classification Modeling (DiDiCM), a novel framework that leverages a diffusion-based procedure to model the posterior distribution of class labels conditioned on the input image. DiDiCM supports diffusion-based predictions either on class probabilities or on discrete class labels, providing flexibility in computation and memory trade-offs. We conduct a comprehensive empirical study demonstrating the superior performance of DiDiCM over standard classifiers, showing that a few diffusion iterations achieve higher classification accuracy on the ImageNet dataset compared to baselines, with accuracy gains increasing as the task becomes more challenging. We release our code at https://github.com/omerb01/didicm .

[165] DRL-Guided Neural Batch Sampling for Semi-Supervised Pixel-Level Anomaly Detection

Amirhossein Khadivi Noghredeh,Abdollah Safari,Fatemeh Ziaeetabar,Firoozeh Haghighi

Main category: cs.CV

TL;DR: 提出一种半监督深度强化学习框架,用于工业视觉检测中的异常检测,通过结合神经批量采样器、自编码器和预测器,有效利用少量标注数据提升对细微缺陷的检测与定位性能。

Details Motivation: 工业视觉检测中缺陷样本稀缺,现有无监督重建方法易过拟合并难以检测细微缺陷。 Method: 构建基于强化学习的半监督框架,包含RL驱动的神经批量采样器(平衡探索与利用)、自编码器(生成突出异常区域的损失分布)和预测器(在损失空间中进行分割),三者协同优化。 Result: 在MVTec AD数据集上实验显示,相比最新方法,该方法在F1_max平均提升0.15,AUC提升0.06,最佳情况下F1_max提升达0.37,且复杂度低。 Conclusion: 所提框架能有效结合少量标签数据,在正常与缺陷模式学习间取得平衡,显著提升异常检测精度与定位能力。 Abstract: Anomaly detection in industrial visual inspection is challenging due to the scarcity of defective samples. Most existing methods rely on unsupervised reconstruction using only normal data, often resulting in overfitting and poor detection of subtle defects. We propose a semi-supervised deep reinforcement learning framework that integrates a neural batch sampler, an autoencoder, and a predictor. The RL-based sampler adaptively selects informative patches by balancing exploration and exploitation through a composite reward. The autoencoder generates loss profiles highlighting abnormal regions, while the predictor performs segmentation in the loss-profile space. This interaction enables the system to effectively learn both normal and defective patterns with limited labeled data. Experiments on the MVTec AD dataset demonstrate that our method achieves higher accuracy and better localization of subtle anomalies than recent state-of-the-art approaches while maintaining low complexity, yielding an average improvement of 0.15 in F1_max and 0.06 in AUC, with a maximum gain of 0.37 in F1_max in the best case.

[166] VKnowU: Evaluating Visual Knowledge Understanding in Multimodal LLMs

Tianxiang Jiang,Sheng Xia,Yicheng Xu,Linquan Wu,Xiangyu Zeng,Limin Wang,Yu Qiao,Yi Wang

Main category: cs.CV

TL;DR: 本文提出了视觉知识(visual knowledge)的概念,并构建了VKnowU基准来评估多模态大模型在物理和社会常识方面的理解能力,发现现有模型仍显著落后于人类表现。为此,作者构建了VKnowQA数据集并提出VideoKnow+模型,通过引入基于视觉知识的奖励机制,在多个视频理解任务上取得了性能提升。

Details Motivation: 现有的多模态大语言模型虽能识别物体,但缺乏对世界物理规律和人类社会行为的深层理解,即‘视觉知识’。这种高级语义是实现真正场景理解的关键,但在当前模型中尚未被充分探索。因此需要系统性评测和增强该能力。 Method: 提出VKnowU基准,包含1,680个问题和1,249个视频,涵盖8类核心视觉知识(如直觉物理、主观意图等);构建VKnowQA数据集,并设计VideoKnow+模型,采用‘观察-思考-回答’范式,结合基于视觉知识的强化学习奖励机制进行训练。 Result: 在VKnowU上评估23个SOTA MLLM显示其性能远低于人类,尤其在世界中心类知识上差距明显;VideoKnow+在VKnowU上取得+3.7%的提升,并在MVBench、Video-MME和MMVU等基准上表现出一致增益。 Conclusion: 视觉知识是实现通用多模态大模型的关键缺失环节,显式建模视觉知识可有效提升模型对复杂场景的理解与推理能力,推动模型从‘看见’向‘理解’迈进。 Abstract: While Multimodal Large Language Models (MLLMs) have become adept at recognizing objects, they often lack the intuitive, human-like understanding of the world's underlying physical and social principles. This high-level vision-grounded semantics, which we term visual knowledge, forms a bridge between perception and reasoning, yet remains an underexplored area in current MLLMs. To systematically evaluate this capability, we present VKnowU, a comprehensive benchmark featuring 1,680 questions in 1,249 videos, covering 8 core types of visual knowledge spanning both world-centric (e.g., intuitive physics) and human-centric (e.g., subjective intentions). Evaluation of 23 SOTA MLLMs reveals that leading models still fall short of human performance, with particularly notable gaps in the world-centric. To bridge this gap, we introduce a new dataset, VKnowQA, and VideoKnow+, a baseline model that explicitly incorporates visual knowledge into MLLMs. VideoKnow+ follows a structured See-Think-Answer paradigm and adopts reinforcement learning with visual knowledge reward, achieving a +3.7% improvement on VKnowU and consistent gains on MVBench, Video-MME, and MMVU. Our work highlights visual knowledge as a missing cornerstone for developing more generalizable MLLMs that can not only see but also truly understand our physical and social worlds.

[167] ScenarioCLIP: Pretrained Transferable Visual Language Models and Action-Genome Dataset for Natural Scene Analysis

Advik Sinha,Saurabh Atreya,Aashutosh A,Sk Aziz Ali,Abhijit Das

Main category: cs.CV

TL;DR: 本文提出了ScenarioCLIP模型,用于提升CLIP在复杂场景中对多对象和动作间关系的建模能力,通过引入接地关系和聚焦区域,在零样本和微调设置下均表现出色。

Details Motivation: 现有CLIP类模型主要关注单对象分类或短描述检索,缺乏对真实场景中多对象及动作间复杂关系结构的显式建模。 Method: 提出ScenarioCLIP模型,结合输入文本、接地关系、图像及其关系聚焦区域;在构建的场景数据集上预训练,并用于跨模态检索和细粒度视觉理解等下游任务。 Result: 在多个领域特定任务上实现了强大的零样本和微调性能,建立了面向场景任务的综合基准,并优于多种基线方法。 Conclusion: ScenarioCLIP有效提升了CLIP在复杂场景理解中的关系建模能力,为未来开放世界场景分析提供了新方向。 Abstract: Until recently, the general corpus of CLIP-type fundamental models has widely explored either the retrieval of short descriptions or the classification of objects in the scene as SINGLE-object image classification task. The same holds for retrieving the image embedding (image retrieval task) given a text prompt. However, real-world scene images exhibit rich compositional structure involving multiple objects and actions. The latest methods in the CLIP-based literature improve class-level discrimination by mining harder negative image-text pairs and by refining permanent text prompts, often using LLMs. However, these improvements remain confined to predefined class lists and do not explicitly model relational or compositional structure. PyramidCLIP partially addresses this gap by aligning global and local visual features, yet it still lacks explicit modeling of inter-object relations. Hence, to further leverage this aspect for scene analysis, the proposed ScenarioCLIP model accepts input texts, grounded relations, and input images, along with focused regions highlighting relations. The proposed model is pretrained on curated scenario data, and finetuned for specialized downstream tasks, such as cross-modal retrieval and fine-grained visual understanding tasks. To address the lack of domain-specific datasets, we generate a novel dataset by extending image-text pairs from existing diverse indoor and outdoor scenario datasets that are publicly available. We used a pipeline of existing language models to ground action, object, and relations, filled by manual and automatic curation. We established a comprehensive benchmark for several scenario-based tasks and compared it with many baseline methods. ScenarioCLIP demonstrates robust zero-shot and finetune performance on various domain-specific tasks. Our code and dataset are available at https://github.com/scenario-clip/ScenarioCLIP

[168] DAPointMamba: Domain Adaptive Point Mamba for Point Cloud Completion

Yinghui Li,Qianyu Zhou,Di Shao,Hao Yang,Ye Zhu,Richard Dazeley,Xuequan Lu

Main category: cs.CV

TL;DR: 本文提出了DAPointMamba,首个将状态空间模型(SSMs)应用于域自适应点云补全(DA PCC)的框架,通过三个新颖模块解决直接应用SSMs带来的空间拓扑破坏和域间语义差异问题,在多个基准上实现了优于现有方法的性能,同时具备线性复杂度和低推理延迟。

Details Motivation: 现有DA PCC方法受限于CNN或视觉Transformer的局部感受野或二次计算复杂度,且直接将点云序列化为1D序列会破坏空间结构,缺乏对域不变特征学习的有效设计,限制了跨域适应性能。 Method: 提出DAPointMamba框架,包含三个模块:跨域块级扫描建立局部几何对应,实现有效局部对齐;跨域空间SSM对齐基于跨域相似性调制块特征,缓解细粒度结构差异;跨域通道SSM对齐通过交错和对齐特征通道主动缩小全局语义差距。整体架构具有全局感受野和线性复杂度。 Result: 在合成和真实世界基准上进行了广泛实验,结果表明DAPointMamba在性能上超越现有最先进方法,同时计算复杂度更低,推理延迟更小。 Conclusion: DAPointMamba成功验证了SSMs在DA PCC任务中的强适应性和高效性,为点云跨域补全提供了新的有效解决方案。 Abstract: Domain adaptive point cloud completion (DA PCC) aims to narrow the geometric and semantic discrepancies between the labeled source and unlabeled target domains. Existing methods either suffer from limited receptive fields or quadratic complexity due to using CNNs or vision Transformers. In this paper, we present the first work that studies the adaptability of State Space Models (SSMs) in DA PCC and find that directly applying SSMs to DA PCC will encounter several challenges: directly serializing 3D point clouds into 1D sequences often disrupts the spatial topology and local geometric features of the target domain. Besides, the overlook of designs in the learning domain-agnostic representations hinders the adaptation performance. To address these issues, we propose a novel framework, DAPointMamba for DA PCC, that exhibits strong adaptability across domains and has the advantages of global receptive fields and efficient linear complexity. It has three novel modules. In particular, Cross-Domain Patch-Level Scanning introduces patch-level geometric correspondences, enabling effective local alignment. Cross-Domain Spatial SSM Alignment further strengthens spatial consistency by modulating patch features based on cross-domain similarity, effectively mitigating fine-grained structural discrepancies. Cross-Domain Channel SSM Alignment actively addresses global semantic gaps by interleaving and aligning feature channels. Extensive experiments on both synthetic and real-world benchmarks demonstrate that our DAPointMamba outperforms state-of-the-art methods with less computational complexity and inference latency.

[169] SelfMOTR: Revisiting MOTR with Self-Generating Detection Priors

Fabian Gülhan,Emil Mededovic,Yuli Wu,Johannes Stegmaier

Main category: cs.CV

TL;DR: 本文提出了一种新的跟踪Transformer模型SelfMOTR,该模型利用自生成的检测先验来提高多目标跟踪性能,在DanceTrack数据集上表现出与最新端到端跟踪方法相竞争的性能。

Details Motivation: 尽管在使用Transformer架构实现端到端跟踪方面取得了进展,但检测性能差以及联合架构中检测与关联之间的冲突仍然是关键问题。 Method: 受集成检测先验的成功和MOTR类模型实际上是强大的检测模型这一关键见解启发,提出了SelfMOTR,一种依赖于自生成检测先验的新颖跟踪Transformer。 Result: 通过广泛的分析和消融研究,揭示并展示了MOTR类模型隐藏的检测能力,并提供了一套有效利用这些能力的实用工具。 Conclusion: 在DanceTrack上,SelfMOTR实现了强劲的表现,能够与最近的最先进的端到端跟踪方法相媲美。 Abstract: Despite progress toward end-to-end tracking with transformer architectures, poor detection performance and the conflict between detection and association in a joint architecture remain critical concerns. Recent approaches aim to mitigate these issues by (i) employing advanced denoising or label assignment strategies, or (ii) incorporating detection priors from external object detectors via distillation or anchor proposal techniques. Inspired by the success of integrating detection priors and by the key insight that MOTR-like models are secretly strong detection models, we introduce SelfMOTR, a novel tracking transformer that relies on self-generated detection priors. Through extensive analysis and ablation studies, we uncover and demonstrate the hidden detection capabilities of MOTR-like models, and present a practical set of tools for leveraging them effectively. On DanceTrack, SelfMOTR achieves strong performance, competing with recent state-of-the-art end-to-end tracking methods.

[170] Bootstrapping Physics-Grounded Video Generation through VLM-Guided Iterative Self-Refinement

Yang Liu,Xilin Zhao,Peisong Wen,Siran Dai,Qingming Huang

Main category: cs.CV

TL;DR: 提出了一种无需训练的迭代自优化框架,利用大语言模型和视觉-语言模型提供物理感知指导,通过多模态思维链(MM-CoT)改进视频生成中的物理一致性。

Details Motivation: 当前视频生成模型在视觉质量上表现良好,但生成结果常违背真实世界的物理规律,缺乏物理一致性。 Method: 引入多模态思维链(MM-CoT)过程,利用大语言模型和视觉-语言模型检测生成视频中的物理不一致性,并据此迭代优化文本提示,从而提升生成质量。该方法无需训练,可即插即用。 Result: 在PhyIQ基准上的实验表明,该方法将Physics-IQ分数从56.31提升至62.38。 Conclusion: 所提方法为实现物理一致的视频生成提供了有效且通用的解决方案,具有广泛适用性,可为未来研究提供参考。 Abstract: Recent progress in video generation has led to impressive visual quality, yet current models still struggle to produce results that align with real-world physical principles. To this end, we propose an iterative self-refinement framework that leverages large language models and vision-language models to provide physics-aware guidance for video generation. Specifically, we introduce a multimodal chain-of-thought (MM-CoT) process that refines prompts based on feedback from physical inconsistencies, progressively enhancing generation quality. This method is training-free and plug-and-play, making it readily applicable to a wide range of video generation models. Experiments on the PhyIQ benchmark show that our method improves the Physics-IQ score from 56.31 to 62.38. We hope this work serves as a preliminary exploration of physics-consistent video generation and may offer insights for future research.

[171] Back to the Feature: Explaining Video Classifiers with Video Counterfactual Explanations

Chao Wang,Chengan Che,Xinyue Chen,Sophia Tsoka,Luis C. Garcia-Peraza-Herrera

Main category: cs.CV

TL;DR: 本文提出了一种名为Back To The Feature (BTTF)的优化框架,用于生成视频分类器的反事实解释(CFEs),通过引入新的优化策略生成物理合理、时间连贯且平滑运动的反事实视频。

Details Motivation: 现有基于图像的反事实解释方法无法生成时间连贯且运动平滑的视频反事实样本,而视频分类器的可解释性研究仍不足。 Method: 提出BTTF框架,包含:1)基于输入视频首帧初始化潜在噪声的优化方案;2)两阶段优化策略以在输入视频邻域内搜索反事实视频;3)渐进式优化加速收敛。整个过程仅由目标分类器引导,确保解释忠实性。 Result: 在Shape-Moving、MEAD和NTU RGB+D等多个视频数据集上验证了BTTF能有效生成有效、视觉相似且逼真的反事实视频。 Conclusion: BTTF能够为视频分类器生成高质量的反事实解释,揭示模型决策机制,提升模型可解释性。 Abstract: Counterfactual explanations (CFEs) are minimal and semantically meaningful modifications of the input of a model that alter the model predictions. They highlight the decisive features the model relies on, providing contrastive interpretations for classifiers. State-of-the-art visual counterfactual explanation methods are designed to explain image classifiers. The generation of CFEs for video classifiers remains largely underexplored. For the counterfactual videos to be useful, they have to be physically plausible, temporally coherent, and exhibit smooth motion trajectories. Existing CFE image-based methods, designed to explain image classifiers, lack the capacity to generate temporally coherent, smooth and physically plausible video CFEs. To address this, we propose Back To The Feature (BTTF), an optimization framework that generates video CFEs. Our method introduces two novel features, 1) an optimization scheme to retrieve the initial latent noise conditioned by the first frame of the input video, 2) a two-stage optimization strategy to enable the search for counterfactual videos in the vicinity of the input video. Both optimization processes are guided solely by the target classifier, ensuring the explanation is faithful. To accelerate convergence, we also introduce a progressive optimization strategy that incrementally increases the number of denoising steps. Extensive experiments on video datasets such as Shape-Moving (motion classification), MEAD (emotion classification), and NTU RGB+D (action classification) show that our BTTF effectively generates valid, visually similar and realistic counterfactual videos that provide concrete insights into the classifier's decision-making mechanism.

[172] Prompting Lipschitz-constrained network for multiple-in-one sparse-view CT reconstruction

Baoshun Shi,Ke Jiang,Qiusheng Lian,Xinran Yu,Huazhu Fu

Main category: cs.CV

TL;DR: 本文提出了一种名为PromptCT的存储高效的深度展开框架,用于多稀疏视角CT重建。该框架采用具有显式Lipschitz约束的网络LipNet作为先验网络,并引入提示模块以适应不同采样设置,在保证算法收敛的同时实现了高质量重建和低存储成本。

Details Motivation: 现有深度学习方法在稀疏视角CT重建中存在两个主要问题:一是先验网络难以证明满足Lipschitz约束;二是多视角场景下需训练多个模型导致存储开销大,限制了临床应用。 Method: 设计了一个可显式证明满足Lipschitz约束的网络LipNet,并构建包含显式提示模块的深度展开框架PromptCT,使单一模型能处理多种稀疏采样配置,同时确保迭代算法的收敛性。 Result: 在模拟和真实数据实验中,PromptCT在多合一稀疏视角CT重建任务上优于基准方法,实现了更高质量的图像重建且存储成本更低。理论分析也验证了LipNet的边界性和Lipschitz连续性。 Conclusion: PromptCT通过引入可证明性质的LipNet和提示机制,有效解决了现有方法在理论保障和存储效率方面的局限,为实际临床应用提供了更具可行性的多稀疏视角CT重建方案。 Abstract: Despite significant advancements in deep learning-based sparse-view computed tomography (SVCT) reconstruction algorithms, these methods still encounter two primary limitations: (i) It is challenging to explicitly prove that the prior networks of deep unfolding algorithms satisfy Lipschitz constraints due to their empirically designed nature. (ii) The substantial storage costs of training a separate model for each setting in the case of multiple views hinder practical clinical applications. To address these issues, we elaborate an explicitly provable Lipschitz-constrained network, dubbed LipNet, and integrate an explicit prompt module to provide discriminative knowledge of different sparse sampling settings, enabling the treatment of multiple sparse view configurations within a single model. Furthermore, we develop a storage-saving deep unfolding framework for multiple-in-one SVCT reconstruction, termed PromptCT, which embeds LipNet as its prior network to ensure the convergence of its corresponding iterative algorithm. In simulated and real data experiments, PromptCT outperforms benchmark reconstruction algorithms in multiple-in-one SVCT reconstruction, achieving higher-quality reconstructions with lower storage costs. On the theoretical side, we explicitly demonstrate that LipNet satisfies boundary property, further proving its Lipschitz continuity and subsequently analyzing the convergence of the proposed iterative algorithms. The data and code are publicly available at https://github.com/shibaoshun/PromptCT.

[173] CrossEarth-Gate: Fisher-Guided Adaptive Tuning Engine for Efficient Adaptation of Cross-Domain Remote Sensing Semantic Segmentation

Shilei Cao,Ziyang Gong,Hehai Lin,Yang Liu,Jiashun Cheng,Xiaoxing Hu,Haoyuan Liang,Guowen Li,Chengwei Qin,Hong Cheng,Xue Yang,Juepeng Zheng,Haohuan Fu

Main category: cs.CV

TL;DR: 本文提出了CrossEarth-Gate,一种用于遥感图像语义分割的参数高效微调方法,通过构建包含空间、语义和频率模块的工具箱,并结合Fisher信息引导的自适应选择机制,动态激活关键模块以应对多维度域偏移问题,在16个跨域基准上实现了最先进性能。

Details Motivation: 现有的参数高效微调方法在处理大规模遥感数据时难以有效应对复杂且多变的域间隙(如空间、语义和频率偏移),限制了基础模型在下游任务中的泛化能力。 Method: 提出CrossEarth-Gate,包含两个核心组件:一是构建面向遥感的多功能模块工具箱(空间、语义、频率模块);二是设计基于Fisher信息的自适应选择机制,动态评估并激活对任务梯度流贡献最大的模块。 Result: 在16个遥感跨域语义分割基准上进行了广泛实验,结果表明该方法在性能和效率上均优于现有PEFT方法,达到最先进的水平。 Conclusion: CrossEarth-Gate通过模块化设计与梯度感知的选择机制,有效解决了遥感中复杂的域偏移问题,显著提升了基础模型在下游任务中的适应性与泛化能力。 Abstract: In Remote Sensing (RS), Parameter-Efficient Fine-Tuning (PEFT) has emerged as a key approach to activate the generalizable representation ability of foundation models for downstream tasks. However, existing specialized PEFT methods often fail when applied to large-scale Earth observation tasks, as they are unable to fully handle the multifaceted and unpredictable domain gaps (\eg, spatial, semantic, and frequency shifts) inherent in RS data. To overcome this, we propose CrossEarth-Gate, which introduces two primary contributions. First, we establish a comprehensive RS module toolbox to address multifaceted domain gaps, comprising spatial, semantic, and frequency modules. Second, we develop a Fisher-guided adaptive selection mechanism that operates on this toolbox. This selection is guided by Fisher Information to quantify each module's importance by measuring its contribution to the task-specific gradient flow. It dynamically activates only the most critical modules at the appropriate layers, guiding the gradient flow to maximize adaptation effectiveness and efficiency. Comprehensive experiments validate the efficacy and generalizability of our method, where CrossEarth-Gate achieves state-of-the-art performance across 16 cross-domain benchmarks for RS semantic segmentation. The code of the work will be released.

[174] TaCo: Capturing Spatio-Temporal Semantic Consistency in Remote Sensing Change Detection

Han Guo,Chenyang Liu,Haotian Zhang,Bowen Chen,Zhengxia Zou,Zhenwei Shi

Main category: cs.CV

TL;DR: 本文提出了一种名为TaCo的时空语义一致性网络,用于遥感变化检测,通过引入文本引导的转移生成器和时空语态联合约束,在不增加推理计算开销的情况下显著提升了性能。

Details Motivation: 现有的遥感变化检测方法主要依赖掩码监督,虽能有效定位空间变化,但在时间语义过渡上约束不足,导致语义不一致问题。 Method: 提出TaCo网络,将变化建模为双时相状态间的语义转移,利用文本引导的转移生成器融合文本语义与双时相视觉特征,并设计了包含双时相重建约束和转移约束的时空语义联合约束机制。 Result: 在六个公开数据集上进行了广泛实验,涵盖二值和语义变化检测任务,结果显示TaCo consistently achieve SOTA performance。 Conclusion: TaCo通过增强时空语义一致性,在不增加推理负担的前提下显著提升了遥感变化检测的准确性与鲁棒性。 Abstract: Remote sensing change detection (RSCD) aims to identify surface changes across bi-temporal satellite images. Most previous methods rely solely on mask supervision, which effectively guides spatial localization but provides limited constraints on the temporal semantic transitions. Consequently, they often produce spatially coherent predictions while still suffering from unresolved semantic inconsistencies. To address this limitation, we propose TaCo, a spatio-temporal semantic consistent network, which enriches the existing mask-supervised framework with a spatio-temporal semantic joint constraint. TaCo conceptualizes change as a semantic transition between bi-temporal states, in which one temporal feature representation can be derived from the other via dedicated transition features. To realize this, we introduce a Text-guided Transition Generator that integrates textual semantics with bi-temporal visual features to construct the cross-temporal transition features. In addition, we propose a spatio-temporal semantic joint constraint consisting of bi-temporal reconstruct constraints and a transition constraint: the former enforces alignment between reconstructed and original features, while the latter enhances discrimination for changes. This design can yield substantial performance gains without introducing any additional computational overhead during inference. Extensive experiments on six public datasets, spanning both binary and semantic change detection tasks, demonstrate that TaCo consistently achieves SOTA performance.

[175] TReFT: Taming Rectified Flow Models For One-Step Image Translation

Shengqian Li,Ming Gao,Yi Liu,Zuzeng Lin,Feng Wang,Feng Dai

Main category: cs.CV

TL;DR: 本文提出了TReFT,一种用于驯化Rectified Flow模型以实现单步图像翻译的新方法,解决了在对抗训练下的一次性推理收敛问题,并实现了接近最先进水平的性能和实时推断能力。

Details Motivation: 现有的Rectified Flow模型在图像到图像翻译中依赖多步去噪,影响实时性;而直接将CycleGAN-Turbo等对抗训练方法应用于RF模型会导致严重收敛问题。 Method: TReFT利用预训练DiT或UNet预测的速度作为输出,并基于末端速度收敛至干净图像向量的观察,结合潜在循环一致性和恒等损失以及轻量化结构改进进行训练。 Result: 在多个图像翻译数据集上,使用TReFT微调的预训练RF模型(如SD3.5和FLUX)达到了与最先进方法相当的性能,同时支持实时推理。 Conclusion: TReFT有效解决了Rectified Flow模型在单步图像翻译中的收敛难题,兼顾高性能与高效推理,推动了其在实际应用中的部署。 Abstract: Rectified Flow (RF) models have advanced high-quality image and video synthesis via optimal transport theory. However, when applied to image-to-image translation, they still depend on costly multi-step denoising, hindering real-time applications. Although the recent adversarial training paradigm, CycleGAN-Turbo, works in pretrained diffusion models for one-step image translation, we find that directly applying it to RF models leads to severe convergence issues. In this paper, we analyze these challenges and propose TReFT, a novel method to Tame Rectified Flow models for one-step image Translation. Unlike previous works, TReFT directly uses the velocity predicted by pretrained DiT or UNet as output-a simple yet effective design that tackles the convergence issues under adversarial training with one-step inference. This design is mainly motivated by a novel observation that, near the end of the denoising process, the velocity predicted by pretrained RF models converges to the vector from origin to the final clean image, a property we further justify through theoretical analysis. When applying TReFT to large pretrained RF models such as SD3.5 and FLUX, we introduce memory-efficient latent cycle-consistency and identity losses during training, as well as lightweight architectural simplifications for faster inference. Pretrained RF models finetuned with TReFT achieve performance comparable to sota methods across multiple image translation datasets while enabling real-time inference.

[176] IrisNet: Infrared Image Status Awareness Meta Decoder for Infrared Small Targets Detection

Xuelin Qian,Jiaming Lu,Zixuan Wang,Wenxuan Wang,Zhongling Huang,Dingwen Zhang,Junwei Han

Main category: cs.CV

TL;DR: 提出IrisNet,一种基于元学习的红外小目标检测框架,通过图像到解码器的Transformer动态适应检测策略,结合高频信息增强感知能力,在多个数据集上实现最先进性能。

Details Motivation: 解决现有深度学习方法在不同场景下因模式漂移导致的鲁棒性不足问题,提升复杂背景下红外小目标检测性能。 Method: 设计IrisNet框架,利用图像-to-解码器Transformer建立红外图像特征与整个解码器参数之间的动态映射,将解码器表示为保持层级相关性的2D张量,并通过自注意力和交叉注意力建模层间依赖、生成自适应解码模式;同时融合高频成分以增强目标位置和边缘信息。 Result: 在NUDT-SIRST、NUAA-SIRST和IRSTD-1K数据集上实验表明,IrisNet优于现有方法,达到最先进的检测性能。 Conclusion: IrisNet通过动态参数生成机制和高频信息融合,有效提升了红外小目标检测的鲁棒性和准确性,适用于多变实际场景。 Abstract: Infrared Small Target Detection (IRSTD) faces significant challenges due to low signal-to-noise ratios, complex backgrounds, and the absence of discernible target features. While deep learning-based encoder-decoder frameworks have advanced the field, their static pattern learning suffers from pattern drift across diverse scenarios (\emph{e.g.}, day/night variations, sky/maritime/ground domains), limiting robustness. To address this, we propose IrisNet, a novel meta-learned framework that dynamically adapts detection strategies to the input infrared image status. Our approach establishes a dynamic mapping between infrared image features and entire decoder parameters via an image-to-decoder transformer. More concretely, we represent the parameterized decoder as a structured 2D tensor preserving hierarchical layer correlations and enable the transformer to model inter-layer dependencies through self-attention while generating adaptive decoding patterns via cross-attention. To further enhance the perception ability of infrared images, we integrate high-frequency components to supplement target-position and scene-edge information. Experiments on NUDT-SIRST, NUAA-SIRST, and IRSTD-1K datasets demonstrate the superiority of our IrisNet, achieving state-of-the-art performance.

[177] AD-R1: Closed-Loop Reinforcement Learning for End-to-End Autonomous Driving with Impartial World Models

Tianyi Yan,Tao Tang,Xingtai Gui,Yongkang Li,Jiasen Zhesng,Weiyao Huang,Lingdong Kong,Wencheng Han,Xia Zhou,Xueyang Zhang,Yifei Zhan,Kun Zhan,Cheng-zhong Xu,Jianbing Shen

Main category: cs.CV

TL;DR: 本文提出了一种用于强化学习的公正世界模型框架,通过反事实合成生成危险场景,使模型能诚实预测风险,从而提升自动驾驶策略的安全性。

Details Motivation: 现有端到端自动驾驶模型在安全性和长尾事件处理上存在不足,而强化学习虽具潜力,却因世界模型中的乐观偏差难以取得进展。 Method: 提出反事实合成的数据生成方法,训练一个能忠实反映动作与危险结果因果关系的公正世界模型,并将其作为内部批评者集成到闭环强化学习框架中,用于策略优化。 Result: 在新提出的风险预见基准和复杂仿真环境中,该模型显著优于基线方法,能更准确预测失败并大幅减少安全违规。 Conclusion: 教会模型‘梦见危险’是实现真正安全智能自动驾驶代理的关键一步。 Abstract: End-to-end models for autonomous driving hold the promise of learning complex behaviors directly from sensor data, but face critical challenges in safety and handling long-tail events. Reinforcement Learning (RL) offers a promising path to overcome these limitations, yet its success in autonomous driving has been elusive. We identify a fundamental flaw hindering this progress: a deep seated optimistic bias in the world models used for RL. To address this, we introduce a framework for post-training policy refinement built around an Impartial World Model. Our primary contribution is to teach this model to be honest about danger. We achieve this with a novel data synthesis pipeline, Counterfactual Synthesis, which systematically generates a rich curriculum of plausible collisions and off-road events. This transforms the model from a passive scene completer into a veridical forecaster that remains faithful to the causal link between actions and outcomes. We then integrate this Impartial World Model into our closed-loop RL framework, where it serves as an internal critic. During refinement, the agent queries the critic to ``dream" of the outcomes for candidate actions. We demonstrate through extensive experiments, including on a new Risk Foreseeing Benchmark, that our model significantly outperforms baselines in predicting failures. Consequently, when used as a critic, it enables a substantial reduction in safety violations in challenging simulations, proving that teaching a model to dream of danger is a critical step towards building truly safe and intelligent autonomous agents.

[178] 3D Motion Perception of Binocular Vision Target with PID-CNN

Shi Jiazhao,Pan Pan,Shi Haotian

Main category: cs.CV

TL;DR: 本文提出了一种小型PID卷积神经网络,用于双目视觉目标的三维运动感知,能够实时预测三维坐标、速度和加速度,实验表明其精度接近输入图像分辨率的理论上限。

Details Motivation: 为了提升神经网络在非线性问题拟合中的理解能力,并实现高效、轻量化的三维运动感知系统,本文从PID控制的角度分析神经网络结构设计。 Method: 将单层神经网络类比为二阶差分方程与非线性函数的组合,多层网络通过多次组合实现表征转换;设计了一个17层、41.3万参数的小型PID-CNN,采用拼接与池化实现特征复用,并在模拟随机运动小球数据集上进行训练与测试。 Result: 网络在三维运动参数预测上达到接近输入图像分辨率极限的精度,验证了方法的有效性;同时分析了误差来源与现有不足,并探讨了高维卷积与基于PID的记忆和注意力机制的潜在优势。 Conclusion: 该研究表明,从控制理论视角理解神经网络有助于指导轻量化网络设计,所提出的PID-CNN在三维运动感知任务中表现出高精度与高效率的潜力。 Abstract: This article trained a network for perceiving three-dimensional motion information of binocular vision target, which can provide real-time three-dimensional coordinate, velocity, and acceleration, and has a basic spatiotemporal perception capability. Understood the ability of neural networks to fit nonlinear problems from the perspective of PID. Considered a single-layer neural network as using a second-order difference equation and a nonlinearity to describe a local problem. Multilayer networks gradually transform the raw representation to the desired representation through multiple such combinations. Analysed some reference principles for designing neural networks. Designed a relatively small PID convolutional neural network, with a total of 17 layers and 413 thousand parameters. Implemented a simple but practical feature reuse method by concatenation and pooling. The network was trained and tested using the simulated randomly moving ball datasets, and the experimental results showed that the prediction accuracy was close to the upper limit that the input image resolution can represent. Analysed the experimental results and errors, as well as the existing shortcomings and possible directions for improvement. Finally, discussed the advantages of high-dimensional convolution in improving computational efficiency and feature space utilization. As well as the potential advantages of using PID information to implement memory and attention mechanisms.

[179] ShelfRectNet: Single View Shelf Image Rectification with Homography Estimation

Onur Berk Tore,Ibrahim Samil Yalciner,Server Calap

Main category: cs.CV

TL;DR: 本文提出了一种基于深度学习的单图像单应性估计框架,用于矫正任意角度拍摄的货架图像,采用ConvNeXt骨干网络和归一化坐标回归,并通过合成单应性增强数据,显著提升了在零售场景中的准确性和泛化能力。

Details Motivation: 在零售等实际场景中,通常只能获取单一视角的货架图像,传统单应性估计方法受限于视角和数据不足,难以实现高精度矫正,因此需要一种能在单视图下鲁棒估计单应性的方法。 Method: 提出一个基于ConvNeXt的深度学习框架,预测四点参数化的单应矩阵;采用归一化坐标回归提升稳定性,并设计了一种新的数据增强策略,通过建模和采样合成单应性来缓解数据稀缺问题。 Result: 在测试集上实现了1.298像素的平均角点误差,相比传统和深度学习方法在精度和推理速度上均具有竞争力。 Conclusion: 该方法是一种高效且鲁棒的单视角图像矫正解决方案,在实际零售应用中具有较高价值,作者还将公开数据集ShelfRectSet和代码以促进后续研究。 Abstract: Estimating homography from a single image remains a challenging yet practically valuable task, particularly in domains like retail, where only one viewpoint is typically available for shelf monitoring and product alignment. In this paper, we present a deep learning framework that predicts a 4-point parameterized homography matrix to rectify shelf images captured from arbitrary angles. Our model leverages a ConvNeXt-based backbone for enhanced feature representation and adopts normalized coordinate regression for improved stability. To address data scarcity and promote generalization, we introduce a novel augmentation strategy by modeling and sampling synthetic homographies. Our method achieves a mean corner error of 1.298 pixels on the test set. When compared with both classical computer vision and deep learning-based approaches, our method demonstrates competitive performance in both accuracy and inference speed. Together, these results establish our approach as a robust and efficient solution for realworld single-view rectification. To encourage further research in this domain, we will make our dataset, ShelfRectSet, and code publicly available

[180] AMB3R: Accurate Feed-forward Metric-scale 3D Reconstruction with Backend

Hengyi Wang,Lourdes Agapito

Main category: cs.CV

TL;DR: AMB3R是一种基于多视图前馈的密集三维重建模型,采用紧凑的体素场景表示,无需任务特定微调即可扩展至视觉里程计和大规模运动恢复结构任务,并在多个基准上达到最先进性能。

Details Motivation: 现有的三维重建方法通常依赖点云表示或需要任务特定优化,限制了其泛化能力和效率。AMB3R旨在通过紧凑的体素表示实现跨任务的通用性和高效性。 Method: 提出一种多视图前馈网络AMB3R,使用稀疏但紧凑的体素场景表示作为后端,支持几何推理与空间紧凑性,在仅训练于多视图重建的情况下即可迁移到其他任务。 Result: 在相机位姿、深度估计、度量尺度重建等任务上优于先前基于点图的方法,并超越基于优化的SLAM和SfM方法,尤其在密集重建基准上表现突出。 Conclusion: AMB3R通过统一的紧凑体素表示实现了强大的跨任务泛化能力,无需微调或测试时优化,为度量尺度三维重建提供了高效且通用的解决方案。 Abstract: We present AMB3R, a multi-view feed-forward model for dense 3D reconstruction on a metric-scale that addresses diverse 3D vision tasks. The key idea is to leverage a sparse, yet compact, volumetric scene representation as our backend, enabling geometric reasoning with spatial compactness. Although trained solely for multi-view reconstruction, we demonstrate that AMB3R can be seamlessly extended to uncalibrated visual odometry (online) or large-scale structure from motion without the need for task-specific fine-tuning or test-time optimization. Compared to prior pointmap-based models, our approach achieves state-of-the-art performance in camera pose, depth, and metric-scale estimation, 3D reconstruction, and even surpasses optimization-based SLAM and SfM methods with dense reconstruction priors on common benchmarks.

[181] Material-informed Gaussian Splatting for 3D World Reconstruction in a Digital Twin

João Malheiro Silva,Andy Huynh,Tong Duy Son,Holger Caesar

Main category: cs.CV

TL;DR: 提出了一种仅使用相机的3D重建管线,通过3D高斯点阵从多视角图像中重建场景,结合视觉模型提取语义材质信息,并赋予物理材质属性,实现高保真传感器仿真,无需LiDAR和复杂标定。

Details Motivation: 克服传统LiDAR-相机融合方法在标定复杂性和对玻璃等材料表征不足的问题,利用相机自然捕捉的纹理和语义优势,实现兼具几何精度与物理真实感的数字孪生重建。 Method: 采用基于多视角图像的3D高斯点阵进行场景重建,利用视觉模型提取语义材质掩码,将高斯表示转换为带材质标签的网格表面,并映射物理材质属性以支持现代图形引擎中的传感器仿真。 Result: 在内部测试车辆数据集上验证了该方法,使用LiDAR作为反射率真值,并结合图像相似性指标,证明所提方法在传感器仿真保真度上可媲美LiDAR-相机融合方案。 Conclusion: 纯相机管线能有效实现兼具几何、语义与物理属性的高质量3D重建,适用于数字孪生与自动驾驶仿真,同时降低硬件复杂性与标定需求。 Abstract: 3D reconstruction for Digital Twins often relies on LiDAR-based methods, which provide accurate geometry but lack the semantics and textures naturally captured by cameras. Traditional LiDAR-camera fusion approaches require complex calibration and still struggle with certain materials like glass, which are visible in images but poorly represented in point clouds. We propose a camera-only pipeline that reconstructs scenes using 3D Gaussian Splatting from multi-view images, extracts semantic material masks via vision models, converts Gaussian representations to mesh surfaces with projected material labels, and assigns physics-based material properties for accurate sensor simulation in modern graphics engines and simulators. This approach combines photorealistic reconstruction with physics-based material assignment, providing sensor simulation fidelity comparable to LiDAR-camera fusion while eliminating hardware complexity and calibration requirements. We validate our camera-only method using an internal dataset from an instrumented test vehicle, leveraging LiDAR as ground truth for reflectivity validation alongside image similarity metrics.

[182] Thinking in 360°: Humanoid Visual Search in the Wild

Heyang Yu,Yinan Han,Xiangyu Zhang,Baiqiao Yin,Bowen Chang,Xiangyu Han,Xinhao Liu,Jing Zhang,Marco Pavone,Chen Feng,Saining Xie,Yiming Li

Main category: cs.CV

TL;DR: 提出了一种类人视觉搜索方法,通过在360°全景环境中主动转动头部来搜索物体或路径,并构建了更具挑战性的H* Bench基准,实验表明现有模型表现有限,但通过后训练可显著提升开源模型性能。

Details Motivation: 现有视觉搜索方法局限于静态图像,忽视了身体与3D环境的交互;希望构建更接近人类效率的具身化视觉搜索智能体,克服现实硬件限制。 Method: 提出类人视觉搜索框架,代理主动控制头部旋转以在360°全景图中搜索目标;构建H* Bench基准,涵盖交通枢纽、城市街道等复杂真实场景;采用后训练技术优化Qwen2.5-VL模型。 Result: 顶级专有模型在物体和路径搜索中仅实现约30%的成功率;经后训练,Qwen2.5-VL在物体搜索(14.83%→47.38%)和路径搜索(6.44%→24.94%)上均有大幅提升;路径搜索难度更高,体现对空间常识推理的需求。 Conclusion: 该研究揭示了当前MLLM代理在复杂真实场景中仍面临巨大挑战,尤其是需要高级空间推理的任务,但也展示了通过后训练改进开源模型的有效路径。 Abstract: Humans rely on the synergistic control of head (cephalomotor) and eye (oculomotor) to efficiently search for visual information in 360°. However, prior approaches to visual search are limited to a static image, neglecting the physical embodiment and its interaction with the 3D world. How can we develop embodied visual search agents as efficient as humans while bypassing the constraints imposed by real-world hardware? To this end, we propose humanoid visual search where a humanoid agent actively rotates its head to search for objects or paths in an immersive world represented by a 360° panoramic image. To study visual search in visually-crowded real-world scenarios, we build H* Bench, a new benchmark that moves beyond household scenes to challenging in-the-wild scenes that necessitate advanced visual-spatial reasoning capabilities, such as transportation hubs, large-scale retail spaces, urban streets, and public institutions. Our experiments first reveal that even top-tier proprietary models falter, achieving only ~30% success in object and path search. We then use post-training techniques to enhance the open-source Qwen2.5-VL, increasing its success rate by over threefold for both object search (14.83% to 47.38%) and path search (6.44% to 24.94%). Notably, the lower ceiling of path search reveals its inherent difficulty, which we attribute to the demand for sophisticated spatial commonsense. Our results not only show a promising path forward but also quantify the immense challenge that remains in building MLLM agents that can be seamlessly integrated into everyday human life.

[183] GS-Checker: Tampering Localization for 3D Gaussian Splatting

Haoliang Han,Ziyuan Luo,Jun Qi,Anderson Rocha,Renjie Wan

Main category: cs.CV

TL;DR: 本文提出了一种名为GS-Checker的新方法,用于定位3D高斯点阵(3DGS)模型中的篡改区域,通过引入3D篡改属性和对比机制,在无需昂贵3D标签监督的情况下实现精确的篡改检测。

Details Motivation: 现有的3DGS编辑技术可能被恶意滥用,因此需要一种能够定位篡改区域的方法以防止恶意操纵3D内容。 Method: 在3D高斯参数中引入3D篡改属性,并设计3D对比机制,通过比较3D高斯关键属性的相似性来发现篡改线索,同时采用循环优化策略优化篡改属性。 Result: 实验结果表明该方法能有效定位3DGS模型中的篡改区域,且无需依赖昂贵的3D标注数据进行监督训练。 Conclusion: GS-Checker是一种有效的无监督3DGS篡改定位方法,能够在三维层面精准识别被篡改的区域,具有良好的应用前景与安全性保障能力。 Abstract: Recent advances in editing technologies for 3D Gaussian Splatting (3DGS) have made it simple to manipulate 3D scenes. However, these technologies raise concerns about potential malicious manipulation of 3D content. To avoid such malicious applications, localizing tampered regions becomes crucial. In this paper, we propose GS-Checker, a novel method for locating tampered areas in 3DGS models. Our approach integrates a 3D tampering attribute into the 3D Gaussian parameters to indicate whether the Gaussian has been tampered. Additionally, we design a 3D contrastive mechanism by comparing the similarity of key attributes between 3D Gaussians to seek tampering cues at 3D level. Furthermore, we introduce a cyclic optimization strategy to refine the 3D tampering attribute, enabling more accurate tampering localization. Notably, our approach does not require expensive 3D labels for supervision. Extensive experimental results demonstrate the effectiveness of our proposed method to locate the tampered 3DGS area.

[184] From Passive Perception to Active Memory: A Weakly Supervised Image Manipulation Localization Framework Driven by Coarse-Grained Annotations

Zhiqing Guo,Dongdong Xi,Songlin Li,Gaobo Yang

Main category: cs.CV

TL;DR: 本文提出了一种新的弱监督图像操作定位框架BoxPromptIML,通过使用边界框提示和知识蒸馏,在降低标注成本的同时实现了高精度的定位性能。

Details Motivation: 现有的全监督方法依赖于密集的像素级标注,而弱监督方法多采用图像级标签,缺乏精确的空间定位能力。因此,需要一种能够在减少标注成本的同时保持良好定位性能的方法。 Method: 提出了基于边界框提示的粗略区域标注策略,并设计了一个高效的轻量级学生模型,通过从基于SAM的教师模型进行知识蒸馏来学习细粒度定位。此外,受人类潜意识记忆机制启发,引入了双引导特征融合模块,动态结合长期记忆中的原型模式与当前输入的实时观察线索。 Result: 在多个分布内和分布外数据集上的实验表明,BoxPromptIML在定位精度上优于或媲美全监督模型,同时具备强泛化性、低标注成本和高效部署特性。 Conclusion: BoxPromptIML有效平衡了标注成本与定位精度之间的权衡,为实际应用中的图像操作定位提供了可行且高效的解决方案。 Abstract: Image manipulation localization (IML) faces a fundamental trade-off between minimizing annotation cost and achieving fine-grained localization accuracy. Existing fully-supervised IML methods depend heavily on dense pixel-level mask annotations, which limits scalability to large datasets or real-world deployment.In contrast, the majority of existing weakly-supervised IML approaches are based on image-level labels, which greatly reduce annotation effort but typically lack precise spatial localization. To address this dilemma, we propose BoxPromptIML, a novel weakly-supervised IML framework that effectively balances annotation cost and localization performance. Specifically, we propose a coarse region annotation strategy, which can generate relatively accurate manipulation masks at lower cost. To improve model efficiency and facilitate deployment, we further design an efficient lightweight student model, which learns to perform fine-grained localization through knowledge distillation from a fixed teacher model based on the Segment Anything Model (SAM). Moreover, inspired by the human subconscious memory mechanism, our feature fusion module employs a dual-guidance strategy that actively contextualizes recalled prototypical patterns with real-time observational cues derived from the input. Instead of passive feature extraction, this strategy enables a dynamic process of knowledge recollection, where long-term memory is adapted to the specific context of the current image, significantly enhancing localization accuracy and robustness. Extensive experiments across both in-distribution and out-of-distribution datasets show that BoxPromptIML outperforms or rivals fully-supervised models, while maintaining strong generalization, low annotation cost, and efficient deployment characteristics.

[185] VGGTFace: Topologically Consistent Facial Geometry Reconstruction in the Wild

Xin Ming,Yuxuan Han,Tianyu Huang,Feng Xu

Main category: cs.CV

TL;DR: 提出VGGTFace,一种基于3D基础模型VGGT的自动方法,用于从野外多视角图像重建拓扑一致的人脸几何形状,结合Pixel3DMM和拓扑感知优化策略实现高质量快速重建。

Details Motivation: 现有方法在自动化、泛化能力或表达能力方面存在不足,难以处理真实场景中的多视角人脸重建问题。 Method: 利用VGGT的基础模型能力,并通过Pixel3DMM注入基于像素对齐UV的拓扑信息,将点图转换为具拓扑结构的点云;提出拓扑感知的Bundle Adjustment策略,引入Laplacian能量优化融合过程。 Result: 在16个视图下仅用10秒即可完成高保真重建,在标准数据集上达到SOTA效果,并展现出对真实场景数据的强大泛化能力。 Conclusion: VGGTFace实现了高效、拓扑一致且具强泛化的野外人脸几何重建,推动了数字头像生成流程的自动化发展。 Abstract: Reconstructing topologically consistent facial geometry is crucial for the digital avatar creation pipelines. Existing methods either require tedious manual efforts, lack generalization to in-the-wild data, or are constrained by the limited expressiveness of 3D Morphable Models. To address these limitations, we propose VGGTFace, an automatic approach that innovatively applies the 3D foundation model, \emph{i.e.} VGGT, for topologically consistent facial geometry reconstruction from in-the-wild multi-view images captured by everyday users. Our key insight is that, by leveraging VGGT, our method naturally inherits strong generalization ability and expressive power from its large-scale training and point map representation. However, it is unclear how to reconstruct a topologically consistent mesh from VGGT, as the topology information is missing in its prediction. To this end, we augment VGGT with Pixel3DMM for injecting topology information via pixel-aligned UV values. In this manner, we convert the pixel-aligned point map of VGGT to a point cloud with topology. Tailored to this point cloud with known topology, we propose a novel Topology-Aware Bundle Adjustment strategy to fuse them, where we construct a Laplacian energy for the Bundle Adjustment objective. Our method achieves high-quality reconstruction in 10 seconds for 16 views on a single NVIDIA RTX 4090. Experiments demonstrate state-of-the-art results on benchmarks and impressive generalization to in-the-wild data. Code is available at https://github.com/grignarder/vggtface.

[186] FREE: Uncertainty-Aware Autoregression for Parallel Diffusion Transformers

Xinwan Wen,Bowen Li,Jiajun Luo,Ye Li,Zhi Wang

Main category: cs.CV

TL;DR: 提出FREE框架,通过特征级自回归与并行验证实现对Diffusion Transformers的高效无损加速,并引入不确定性引导的松弛策略进一步提升采样接受率。

Details Motivation: 现有基于推测性推理的方法在DiTs上加速效果受限,因其验证阶段草案准确率不足,且DiTs在后期去噪步骤中预测方差增大导致接受率下降。 Method: 分析DiTs特征动态,发现顶层特征具有强时间一致性和丰富语义,据此设计轻量级草案模型进行特征级自回归;结合并行验证保证无损加速,并提出不确定性引导的松弛策略动态调整接受概率。 Result: 在ImageNet-512²上,FREE实现最高1.86倍加速,FREE (relax) 进一步达到2.25倍加速,同时保持生成质量的高感知和定量保真度。 Conclusion: FREE及其松弛变体有效提升了DiTs的推理效率,为Transformer架构的扩散模型提供了高效无损的并行采样方案。 Abstract: Diffusion Transformers (DiTs) achieve state-of-the-art generation quality but require long sequential denoising trajectories, leading to high inference latency. Recent speculative inference methods enable lossless parallel sampling in U-Net-based diffusion models via a drafter-verifier scheme, but their acceleration is limited on DiTs due to insufficient draft accuracy during verification. To address this limitation, we analyze the DiTs' feature dynamics and find the features of the final transformer layer (top-block) exhibit strong temporal consistency and rich semantic abstraction. Based on this insight, we propose FREE, a novel framework that employs a lightweight drafter to perform feature-level autoregression with parallel verification, guaranteeing lossless acceleration with theoretical and empirical support. Meanwhile, prediction variance (uncertainty) of DiTs naturally increases in later denoising steps, reducing acceptance rates under speculative sampling. To mitigate this effect, we further introduce an uncertainty-guided relaxation strategy, forming FREE (relax), which dynamically adjusts the acceptance probability in response to uncertainty levels. Experiments on ImageNet-$512^2$ show that FREE achieves up to $1.86 \times$ acceleration, and FREE (relax) further reaches $2.25 \times$ speedup while maintaining high perceptual and quantitative fidelity in generation quality.

[187] A Training-Free Approach for Multi-ID Customization via Attention Adjustment and Spatial Control

Jiawei Lin,Guanlong Jiao,Jianjin Xu

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的多身份定制方法MultiID,通过ID解耦的交叉注意力机制和多种文本控制策略,有效解决了生成图像中的复制粘贴问题和文本可控性差的问题,性能媲美甚至优于基于训练的方法。

Details Motivation: 多身份定制在生成包含多个个体且保持各自身份的图像时面临复制粘贴问题和文本控制能力弱的挑战,现有方法多依赖训练,缺乏高效灵活的解决方案。 Method: 提出MultiID,采用ID解耦的交叉注意力机制将不同身份嵌入对应图像区域,并结合局部提示、深度引导的空间控制和扩展自注意力来增强文本一致性,实现无需训练的多身份图像生成。 Result: 在自建基准IDBench上的大量实验表明,MultiID能有效缓解复制粘贴问题,提升文本可控性和生成质量,定性和定量结果均优于或媲美现有的基于训练的方法。 Conclusion: MultiID为多身份定制提供了一个高效、无需训练的新范式,通过改进注意力机制和控制策略,在保持身份一致性和文本对齐方面表现出色,具有良好的应用潜力。 Abstract: Multi-ID customization is an interesting topic in computer vision and attracts considerable attention recently. Given the ID images of multiple individuals, its purpose is to generate a customized image that seamlessly integrates them while preserving their respective identities. Compared to single-ID customization, multi-ID customization is much more difficult and poses two major challenges. First, since the multi-ID customization model is trained to reconstruct an image from the cropped person regions, it often encounters the copy-paste issue during inference, leading to lower quality. Second, the model also suffers from inferior text controllability. The generated result simply combines multiple persons into one image, regardless of whether it is aligned with the input text. In this work, we propose MultiID to tackle this challenging task in a training-free manner. Since the existing single-ID customization models have less copy-paste issue, our key idea is to adapt these models to achieve multi-ID customization. To this end, we present an ID-decoupled cross-attention mechanism, injecting distinct ID embeddings into the corresponding image regions and thus generating multi-ID outputs. To enhance the generation controllability, we introduce three critical strategies, namely the local prompt, depth-guided spatial control, and extended self-attention, making the results more consistent with the text prompts and ID images. We also carefully build a benchmark, called IDBench, for evaluation. The extensive qualitative and quantitative results demonstrate the effectiveness of MultiID in solving the aforementioned two challenges. Its performance is comparable or even better than the training-based multi-ID customization methods.

[188] Image-Free Timestep Distillation via Continuous-Time Consistency with Trajectory-Sampled Pairs

Bao Tang,Shuai Zhang,Yueting Zhu,Jijun Xiang,Xin Yang,Li Yu,Wenyu Liu,Xinggang Wang

Main category: cs.CV

TL;DR: 本文提出了Trajectory-Backward Consistency Model (TBCM),一种无需外部训练数据的连续时间一致性模型蒸馏方法,通过从教师模型生成轨迹中提取潜在表示实现高效、简洁的自包含蒸馏,在减少40%训练时间和显著GPU内存消耗的同时,在一步生成下达到6.52 FID和28.08 CLIP分数。

Details Motivation: 现有连续时间一致性蒸馏方法依赖大量训练数据和计算资源,限制了其在资源受限场景下的应用与跨域扩展,因此需要一种更高效、不依赖外部数据的蒸馏方法。 Method: 提出TBCM,利用教师模型自身的生成轨迹提取潜在样本进行蒸馏,避免使用VAE编码和大规模数据集,构建自包含的蒸馏范式,并分析采样策略对蒸馏效果的影响。 Result: 在MJHQ-30k数据集上,TBCM在单步生成下取得6.52 FID和28.08 CLIP分数,训练时间减少约40%,并大幅节省GPU内存;同时揭示了扩散与生成空间之间的差异。 Conclusion: TBCM通过轨迹回溯实现了高效、低资源消耗的一致性模型蒸馏,无需外部数据且性能优越,为未来蒸馏研究提供了关于分布差距和采样策略的新见解。 Abstract: Timestep distillation is an effective approach for improving the generation efficiency of diffusion models. The Consistency Model (CM), as a trajectory-based framework, demonstrates significant potential due to its strong theoretical foundation and high-quality few-step generation. Nevertheless, current continuous-time consistency distillation methods still rely heavily on training data and computational resources, hindering their deployment in resource-constrained scenarios and limiting their scalability to diverse domains. To address this issue, we propose Trajectory-Backward Consistency Model (TBCM), which eliminates the dependence on external training data by extracting latent representations directly from the teacher model's generation trajectory. Unlike conventional methods that require VAE encoding and large-scale datasets, our self-contained distillation paradigm significantly improves both efficiency and simplicity. Moreover, the trajectory-extracted samples naturally bridge the distribution gap between training and inference, thereby enabling more effective knowledge transfer. Empirically, TBCM achieves 6.52 FID and 28.08 CLIP scores on MJHQ-30k under one-step generation, while reducing training time by approximately 40% compared to Sana-Sprint and saving a substantial amount of GPU memory, demonstrating superior efficiency without sacrificing quality. We further reveal the diffusion-generation space discrepancy in continuous-time consistency distillation and analyze how sampling strategies affect distillation performance, offering insights for future distillation research. GitHub Link: https://github.com/hustvl/TBCM.

[189] MajutsuCity: Language-driven Aesthetic-adaptive City Generation with Controllable 3D Assets and Layouts

Zilong Huang,Jun He,Xiaobin Huang,Ziyi Xiong,Yang Luo,Junyan Ye,Weijia Li,Yiping Chen,Ting Han

Main category: cs.CV

TL;DR: 本文提出MajutsuCity,一种基于自然语言驱动且美学自适应的3D城市生成框架,通过四阶段流程实现结构一致且风格多样的城市场景,并引入可交互的语言编辑代理MajutsuAgent和高质量多模态数据集MajutsuDataset,显著提升生成质量与可控性。

Details Motivation: 现有方法难以兼顾文本生成的创造性与基于显式结构表示的对象级可编辑性,缺乏在风格多样性、细粒度控制和结构一致性之间取得平衡的解决方案。 Method: 提出MajutsuCity框架,将城市表示为布局、资产和材料的组合,采用四阶段生成流程;集成MajutsuAgent支持五种对象级语言交互编辑操作;构建包含2D语义布局、3D建筑资产、PBR材质等的MajutsuDataset数据集,并设计涵盖结构一致性、复杂度、材质保真度等方面的综合评估指标。 Result: 实验显示MajutsuCity相比CityDreamer布局FID降低83.7%,相比CityCraft降低20.1%;在所有AQS和RDR评分中均排名第一,显著优于现有方法。 Conclusion: MajutsuCity在几何保真度、风格适应性和语义可控性方面达到新SOTA,为3D城市生成提供了更强大、灵活且可扩展的框架,有望推动该领域的进一步研究。 Abstract: Generating realistic 3D cities is fundamental to world models, virtual reality, and game development, where an ideal urban scene must satisfy both stylistic diversity, fine-grained, and controllability. However, existing methods struggle to balance the creative flexibility offered by text-based generation with the object-level editability enabled by explicit structural representations. We introduce MajutsuCity, a natural language-driven and aesthetically adaptive framework for synthesizing structurally consistent and stylistically diverse 3D urban scenes. MajutsuCity represents a city as a composition of controllable layouts, assets, and materials, and operates through a four-stage pipeline. To extend controllability beyond initial generation, we further integrate MajutsuAgent, an interactive language-grounded editing agent} that supports five object-level operations. To support photorealistic and customizable scene synthesis, we also construct MajutsuDataset, a high-quality multimodal dataset} containing 2D semantic layouts and height maps, diverse 3D building assets, and curated PBR materials and skyboxes, each accompanied by detailed annotations. Meanwhile, we develop a practical set of evaluation metrics, covering key dimensions such as structural consistency, scene complexity, material fidelity, and lighting atmosphere. Extensive experiments demonstrate MajutsuCity reduces layout FID by 83.7% compared with CityDreamer and by 20.1% over CityCraft. Our method ranks first across all AQS and RDR scores, outperforming existing methods by a clear margin. These results confirm MajutsuCity as a new state-of-the-art in geometric fidelity, stylistic adaptability, and semantic controllability for 3D city generation. We expect our framework can inspire new avenues of research in 3D city generation. Our dataset and code will be released at https://github.com/LongHZ140516/MajutsuCity.

[190] StableTrack: Stabilizing Multi-Object Tracking on Low-Frequency Detections

Matvei Shelukhan,Timur Mamedov,Karina Kvanchiani

Main category: cs.CV

TL;DR: StableTrack是一种用于多目标跟踪的新方法,通过在低频检测下稳定跟踪质量,提出两阶段匹配策略和基于边界框的距离度量,在保持主流数据集性能的同时显著提升低频场景下的HOTA指标。

Details Motivation: 现有MOT方法依赖高频检测,在计算资源受限时性能下降,难以满足实际应用需求。 Method: 提出StableTrack,采用两阶段匹配策略,用Bbox-Based Distance替代Mahalanobis距离,并结合Re-ID模型与Kalman Filter进行跨帧关联和轨迹优化。 Result: 在MOT17-val上1Hz输入时HOTA提升11.6%,同时在MOT17、MOT20和DanceTrack全频检测下保持领先性能。 Conclusion: StableTrack有效提升了低频检测下的多目标跟踪稳定性与精度,兼顾资源效率与性能,具有较强实用性。 Abstract: Multi-object tracking (MOT) is one of the most challenging tasks in computer vision, where it is important to correctly detect objects and associate these detections across frames. Current approaches mainly focus on tracking objects in each frame of a video stream, making it almost impossible to run the model under conditions of limited computing resources. To address this issue, we propose StableTrack, a novel approach that stabilizes the quality of tracking on low-frequency detections. Our method introduces a new two-stage matching strategy to improve the cross-frame association between low-frequency detections. We propose a novel Bbox-Based Distance instead of the conventional Mahalanobis distance, which allows us to effectively match objects using the Re-ID model. Furthermore, we integrate visual tracking into the Kalman Filter and the overall tracking pipeline. Our method outperforms current state-of-the-art trackers in the case of low-frequency detections, achieving $\textit{11.6%}$ HOTA improvement at $\textit{1}$ Hz on MOT17-val, while keeping up with the best approaches on the standard MOT17, MOT20, and DanceTrack benchmarks with full-frequency detections.

[191] Block Cascading: Training Free Acceleration of Block-Causal Video Models

Hmrishav Bandyopadhyay,Nikhil Pinnaparaju,Rahim Entezari,Jim Scott,Yi-Zhe Song,Varun Jampani

Main category: cs.CV

TL;DR: Block Cascading 提出了一种无需训练的并行化方法,通过利用部分去噪的上下文实现视频块的并行生成,显著提升了块因果视频生成的速度,同时保持了生成质量。

Details Motivation: 块因果视频生成面临速度与质量之间的权衡:小模型速度快但质量低,大模型质量高但速度慢。需要一种方法在不牺牲质量的前提下提升推理速度。 Method: 提出 Block Cascading 方法,允许在前一个块尚未完全去噪时就开始下一个块的生成,从而将串行过程转变为并行级联。利用多 GPU 实现时间维度上的并行化,并消除 KV 缓存切换开销。 Result: 在 5 个 GPU 上实现了约 2 倍的加速:1.3B 模型从 16 FPS 提升至 30 FPS,14B 模型从 4.5 FPS 提升至 12.5 FPS;同时消除了约 200ms 的 KV 重缓存开销,且生成质量无明显下降。 Conclusion: Block Cascading 有效缓解了块因果视频生成中的速度-质量权衡,支持高效、高质量的视频生成,适用于交互式应用。 Abstract: Block-causal video generation faces a stark speed-quality trade-off: small 1.3B models manage only 16 FPS while large 14B models crawl at 4.5 FPS, forcing users to choose between responsiveness and quality. Block Cascading significantly mitigates this trade-off through training-free parallelization. Our key insight: future video blocks do not need fully denoised current blocks to begin generation. By starting block generation with partially denoised context from predecessors, we transform sequential pipelines into parallel cascades where multiple blocks denoise simultaneously. With 5 GPUs exploiting temporal parallelism, we achieve ~2x acceleration across all model scales: 1.3B models accelerate from 16 to 30 FPS, 14B models from 4.5 to 12.5 FPS. Beyond inference speed, Block Cascading eliminates overhead from KV-recaching (of ~200ms) during context switches for interactive generation. Extensive evaluations validated against multiple block-causal pipelines demonstrate no significant loss in generation quality when switching from block-causal to Block Cascading pipelines for inference. Project Page: https://hmrishavbandy.github.io/block_cascading_page/

[192] BRIC: Bridging Kinematic Plans and Physical Control at Test Time

Dohun Lim,Minji Kim,Jaewoon Lim,Sungchan Kim

Main category: cs.CV

TL;DR: BRIC是一种新的测试时适应框架,通过协调扩散模型的运动规划与强化学习控制器之间的执行差异,实现长期人体运动生成。

Details Motivation: 扩散模型虽能生成多样化的动作,但常产生物理上不合理的输出,导致仿真中出现执行偏差。 Method: BRIC在测试时动态调整物理控制器以适应噪声运动计划,并通过防止灾难性遗忘的损失函数保持预训练技能;同时引入轻量级测试时引导机制,在不更新参数的情况下引导扩散模型。 Result: BRIC在多种长期任务(如动作组合、避障和人-场景交互)中实现了最先进的性能。 Conclusion: BRIC通过结合控制器适应和信号空间引导,有效实现了跨多样化环境的连贯且物理合理的长期运动生成。 Abstract: We propose BRIC, a novel test-time adaptation (TTA) framework that enables long-term human motion generation by resolving execution discrepancies between diffusion-based kinematic motion planners and reinforcement learning-based physics controllers. While diffusion models can generate diverse and expressive motions conditioned on text and scene context, they often produce physically implausible outputs, leading to execution drift during simulation. To address this, BRIC dynamically adapts the physics controller to noisy motion plans at test time, while preserving pre-trained skills via a loss function that mitigates catastrophic forgetting. In addition, BRIC introduces a lightweight test-time guidance mechanism that steers the diffusion model in the signal space without updating its parameters. By combining both adaptation strategies, BRIC ensures consistent and physically plausible long-term executions across diverse environments in an effective and efficient manner. We validate the effectiveness of BRIC on a variety of long-term tasks, including motion composition, obstacle avoidance, and human-scene interaction, achieving state-of-the-art performance across all tasks.

[193] Object-Centric Vision Token Pruning for Vision Language Models

Guangyuan Li,Rongzhen Zhao,Jinhong Deng,Yanbo Wang,Joni Pajarinen

Main category: cs.CV

TL;DR: 提出OC-VTP,一种直接且有保证的视觉token剪枝方法,提升视觉语言模型推理效率并保持高精度,无需微调且具有可解释性。

Details Motivation: 视觉语言模型中视觉token数量多但信息分散,导致推理计算开销大,现有剪枝方法间接且无法保证效果。 Method: 设计轻量级的以对象为中心的视觉token剪枝器(OC-VTP),通过最小化重建误差来保留最具代表性的视觉token,可插入现有VLM无需微调。 Result: 在不同剪枝比例下,OC-VTP均能显著提升推理效率并保持甚至优于原始模型的准确性,且具备良好的可解释性。 Conclusion: OC-VTP是一种高效、通用、免微调的视觉token剪枝方案,为VLM的高效推理提供了可靠的新思路。 Abstract: In Vision Language Models (VLMs), vision tokens are quantity-heavy yet information-dispersed compared with language tokens, thus consume too much unnecessary computation. Pruning redundant vision tokens for high VLM inference efficiency has been continuously studied but all existing methods resort to indirect and non-guaranteed ways. We propose OC-VTP, a direct and guaranteed approach to select the most representative vision tokens for high-efficiency yet accuracy-preserving VLM inference. Our OC-VTP requires merely light-weight pre-training of a small object-centric vision token pruner, which can then be inserted into existing VLMs, without fine-tuning of any models on any datasets. It is gauranteed that the most representative vision tokens are kept by minimizing the error in reconstructing the original unpruned tokens from the selected ones. Across any vision pruning ratios, i.e., inference efficiency, our OC-VTP consistently helps mainstream VLMs to preserve the highest inference accuracy. Our pruning also demonstrates interesting interpretability. Our codes are available at https://github.com/GarryLarry010131/OC-VTP.

[194] Learning to Generate Human-Human-Object Interactions from Textual Descriptions

Jeonghyeon Na,Sangwon Baik,Inhee Lee,Junyoung Lee,Hanbyul Joo

Main category: cs.CV

TL;DR: 本文提出了人类-人类-物体交互(HHOIs)的新研究问题,构建了专用数据集并提出基于扩散模型的统一生成框架,可合成多人体与物体的复杂交互场景。

Details Motivation: 现有方法主要关注单人与物体的交互,难以建模多人在场景上下文中的复杂互动行为,因此需要一种能结合场景语境理解多人协作式交互的新方法。 Method: 提出Human-Human-Object Interactions(HHOIs)建模框架;构建HHOIs真实与合成数据集;通过分数扩散模型分别训练文本到HOI和HHI子模型,并融合为统一生成框架,支持端到端的HHOI合成;扩展至多人体设置。 Result: 实验表明该方法能根据文本描述生成比以往单人HOI更真实的HHOIs结果,在视觉合理性和动作协调性上优于基线方法,并成功应用于多人体含物动作生成任务。 Conclusion: 本文验证了将HHI与HOI联合建模生成HHOI的有效性,提出的生成框架能够捕捉多人与物体间的复杂交互关系,为上下文感知的人机交互理解提供了新思路。 Abstract: The way humans interact with each other, including interpersonal distances, spatial configuration, and motion, varies significantly across different situations. To enable machines to understand such complex, context-dependent behaviors, it is essential to model multiple people in relation to the surrounding scene context. In this paper, we present a novel research problem to model the correlations between two people engaged in a shared interaction involving an object. We refer to this formulation as Human-Human-Object Interactions (HHOIs). To overcome the lack of dedicated datasets for HHOIs, we present a newly captured HHOIs dataset and a method to synthesize HHOI data by leveraging image generative models. As an intermediary, we obtain individual human-object interaction (HOIs) and human-human interaction (HHIs) from the HHOIs, and with these data, we train an text-to-HOI and text-to-HHI model using score-based diffusion model. Finally, we present a unified generative framework that integrates the two individual model, capable of synthesizing complete HHOIs in a single advanced sampling process. Our method extends HHOI generation to multi-human settings, enabling interactions involving more than two individuals. Experimental results show that our method generates realistic HHOIs conditioned on textual descriptions, outperforming previous approaches that focus only on single-human HOIs. Furthermore, we introduce multi-human motion generation involving objects as an application of our framework.

Yunqi Zhou,Chengjie Jiang,Chun Yuan,Jing Li

Main category: cs.CV

TL;DR: ZoomSearch是一种无需训练的即插即用方法,用于超高清遥感图像视觉问答(RS-VQA),通过自适应多分支搜索和布局感知拼接提升精度与效率。

Details Motivation: 现有遥感基础模型难以处理超高清图像:全图编码导致显存不足,而缩放预处理会丢失关键细节,因此需在预测前引导模型关注重要区域。 Method: 提出ZoomSearch,结合自适应多分支变焦搜索(分层查找相关图像块)与布局感知图像块重组(将选中块重组成紧凑且保持布局的画布),实现高效推理。 Result: 在MME-RealWorld-RS和LRS-VQA两个超高清RS-VQA基准上,集成LLaVA-ov后分别比基线提升26.3%和114.8%,且推理速度比现有搜索方法快20%~44%。 Conclusion: ZoomSearch有效解决了超高清遥感图像VQA中的计算与细节保留难题,在准确率和效率上均达到SOTA水平。 Abstract: With advances in satellite constellations, sensor technologies, and imaging pipelines, ultra-high-resolution (Ultra-HR) remote sensing imagery is becoming increasingly widespread. However, current remote sensing foundation models are ill-suited to such inputs: full-image encoding exhausts token and memory budgets, while resize-based preprocessing loses fine-grained and answer-critical details. In this context, guiding the model look where it matters before prediction becomes crucial. Therefore, we present ZoomSearch, a training-free, plug-and-play pipeline that decouples 'where to look' from 'how to answer' for Ultra-HR Remote Sensing Visual Question Answering (RS-VQA). ZoomSearch combines Adaptive Multi-Branch Zoom Search, which performs a hierarchical search over image patches to localize query-relevant regions, with Layout-Aware Patch Reassembly, which reorganizes the selected patches into a compact, layout-faithful canvas. We conduct comprehensive experiments on Ultra-HR RS-VQA benchmarks MME-RealWorld-RS and LRS-VQA, comparing against (i) strong general foundation models, (ii) remote sensing foundation models, (iii) Ultra-HR RS-VQA methods, and (iv) plug-and-play search-based VQA methods. When integrated with LLaVA-ov, ZoomSearch attains state-of-the-art accuracy across diverse tasks, improving the LLaVA-ov baseline by 26.3% on LRS-VQA and 114.8\% on MME-RealWorld-RS. Meanwhile, it achieves much higher inference efficiency, outperforming prior search-based methods by 20%~44% in speed.

[196] STARFlow-V: End-to-End Video Generative Modeling with Normalizing Flow

Jiatao Gu,Ying Shen,Tianrong Chen,Laurent Dinh,Yuyang Wang,Miguel Angel Bautista,David Berthelot,Josh Susskind,Shuangfei Zhai

Main category: cs.CV

TL;DR: 本文提出了STARFlow-V,一种基于归一化流的视频生成模型,通过全局-局部架构和flow-score匹配等技术,在保持因果性的同时实现了高质量、高效率的自回归视频生成,并支持多种生成任务。

Details Motivation: 现有的视频生成模型主要依赖扩散模型,而归一化流在此领域的应用尚不充分。本文旨在探索归一化流在视频生成中的潜力,解决其在时空复杂性和计算成本方面的挑战。 Method: 基于STARFlow框架,提出STARFlow-V模型,采用时空潜在空间中的全局-局部架构,引入flow-score匹配机制和视频感知的Jacobi迭代方案以提升生成一致性和采样效率。 Result: STARFlow-V在视觉保真度和时间一致性方面表现优异,采样吞吐量优于扩散模型基线,且能原生支持文本到视频、图像到视频和视频到视频等多种任务。 Conclusion: 本研究表明归一化流能够实现高质量的自回归视频生成,为构建世界模型提供了新的研究方向。 Abstract: Normalizing flows (NFs) are end-to-end likelihood-based generative models for continuous data, and have recently regained attention with encouraging progress on image generation. Yet in the video generation domain, where spatiotemporal complexity and computational cost are substantially higher, state-of-the-art systems almost exclusively rely on diffusion-based models. In this work, we revisit this design space by presenting STARFlow-V, a normalizing flow-based video generator with substantial benefits such as end-to-end learning, robust causal prediction, and native likelihood estimation. Building upon the recently proposed STARFlow, STARFlow-V operates in the spatiotemporal latent space with a global-local architecture which restricts causal dependencies to a global latent space while preserving rich local within-frame interactions. This eases error accumulation over time, a common pitfall of standard autoregressive diffusion model generation. Additionally, we propose flow-score matching, which equips the model with a light-weight causal denoiser to improve the video generation consistency in an autoregressive fashion. To improve the sampling efficiency, STARFlow-V employs a video-aware Jacobi iteration scheme that recasts inner updates as parallelizable iterations without breaking causality. Thanks to the invertible structure, the same model can natively support text-to-video, image-to-video as well as video-to-video generation tasks. Empirically, STARFlow-V achieves strong visual fidelity and temporal consistency with practical sampling throughput relative to diffusion-based baselines. These results present the first evidence, to our knowledge, that NFs are capable of high-quality autoregressive video generation, establishing them as a promising research direction for building world models. Code and generated samples are available at https://github.com/apple/ml-starflow.

[197] Dance Style Classification using Laban-Inspired and Frequency-Domain Motion Features

Ben Hamscher,Arnold Brosch,Nicolas Binninger,Maksymilian Jan Dejna,Kira Maag

Main category: cs.CV

TL;DR: 提出了一种轻量级的舞蹈风格分类框架,利用基于姿态估计的时空特征和频域特征实现高效、可解释的舞蹈动作识别。

Details Motivation: 由于许多舞蹈风格具有相似的姿态、手势和时间运动模式,基于运动数据识别和区分舞蹈风格是一个复杂的问题,现有方法往往依赖复杂模型,缺乏可解释性。 Method: 提取视频中的姿态估计,并设计受拉班动作分析启发的时空描述符,捕捉局部关节动态(如速度、加速度和角运动);结合快速傅里叶变换(FFT)特征以编码动作的节奏和周期性特征,构建低计算成本的分类框架。 Result: 该方法在舞蹈风格分类任务中表现出较强的鲁棒性,无需复杂模型即可实现高效分类,同时提供可解释的运动表征。 Conclusion: 基于姿态的时空与频域特征融合能够有效捕捉舞蹈风格的细微差异,验证了轻量级且可解释的方法在人类活动识别中的潜力。 Abstract: Dance is an essential component of human culture and serves as a tool for conveying emotions and telling stories. Identifying and distinguishing dance genres based on motion data is a complex problem in human activity recognition, as many styles share similar poses, gestures, and temporal motion patterns. This work presents a lightweight framework for classifying dance styles that determines motion characteristics based on pose estimates extracted from videos. We propose temporal-spatial descriptors inspired by Laban Movement Analysis. These features capture local joint dynamics such as velocity, acceleration, and angular movement of the upper body, enabling a structured representation of spatial coordination. To further encode rhythmic and periodic aspects of movement, we integrate Fast Fourier Transform features that characterize movement patterns in the frequency domain. The proposed approach achieves robust classification of different dance styles with low computational effort, as complex model architectures are not required, and shows that interpretable motion representations can effectively capture stylistic nuances.

[198] Modular Deep Learning Framework for Assistive Perception: Gaze, Affect, and Speaker Identification

Akshit Pramod Anchan,Jewelith Thomas,Sritama Roy

Main category: cs.CV

TL;DR: 本研究提出并评估了一种受“Smart Eye”启发的模块化架构,包含用于眼状态检测、面部表情识别和语音说话人识别的三个独立感知模块,分别基于CNN和LSTM网络,在各自数据集上取得了高准确率,验证了轻量级专用模型在辅助技术中多模态集成的可行性。

Details Motivation: 为了实现全面的辅助技术,需要无缝整合视觉与听觉感知能力,因此探索一种高效、可扩展的模块化感知系统架构具有重要意义。 Method: 采用模块化设计,分别构建基于CNN的眼状态检测模型、深度CNN的面部表情识别模型和LSTM的语音说话人识别模型,并在Eyes Image、FER2013和自建音频数据集上进行训练与评估。 Result: 三个模型在各自任务上分别达到了93.0%、97.8%和96.89%的准确率,证明了轻量级专用模型在特定感知任务上的高性能表现。 Conclusion: 轻量级、领域专用的模型能够高效完成独立感知任务,为未来在资源受限设备上实现实时多模态融合的辅助技术奠定了可行基础。 Abstract: Developing comprehensive assistive technologies requires the seamless integration of visual and auditory perception. This research evaluates the feasibility of a modular architecture inspired by core functionalities of perceptive systems like 'Smart Eye.' We propose and benchmark three independent sensing modules: a Convolutional Neural Network (CNN) for eye state detection (drowsiness/attention), a deep CNN for facial expression recognition, and a Long Short-Term Memory (LSTM) network for voice-based speaker identification. Utilizing the Eyes Image, FER2013, and customized audio datasets, our models achieved accuracies of 93.0%, 97.8%, and 96.89%, respectively. This study demonstrates that lightweight, domain-specific models can achieve high fidelity on discrete tasks, establishing a validated foundation for future real-time, multimodal integration in resource-constrained assistive devices.

[199] A Physics-Informed Loss Function for Boundary-Consistent and Robust Artery Segmentation in DSA Sequences

Muhammad Irfan,Nasir Rahim,Khalid Mahmood Malik

Main category: cs.CV

TL;DR: 提出一种基于物理的损失函数(Physics-Informed Loss, PIL),通过引入材料物理中的位错理论建模血管边界的弹性交互过程,提升DSA图像中脑动脉分割的精度与边界一致性。

Details Motivation: 传统损失函数仅依赖像素级重叠,忽略血管边界的几何与物理一致性,导致分割结果碎片化或不稳定。 Method: 设计了一种新的Physics-Informed Loss(PIL),将预测与真实血管边界间的交互建模为弹性过程,并引入基于物理的正则化项,增强轮廓平滑性与结构一致性;该损失可集成至多种分割网络(如U-Net、SegFormer等)。 Result: 在DIAS和DSCA两个公开数据集上验证,PIL在敏感性、F1分数和边界连贯性方面均优于交叉熵、Dice、主动轮廓等传统损失函数。 Conclusion: 将物理驱动的边界交互机制融入深度学习模型,显著提升了动态血管造影图像中血管分割的精确性与鲁棒性。 Abstract: Accurate extraction and segmentation of the cerebral arteries from digital subtraction angiography (DSA) sequences is essential for developing reliable clinical management models of complex cerebrovascular diseases. Conventional loss functions often rely solely on pixel-wise overlap, overlooking the geometric and physical consistency of vascular boundaries, which can lead to fragmented or unstable vessel predictions. To overcome this limitation, we propose a novel \textit{Physics-Informed Loss} (PIL) that models the interaction between the predicted and ground-truth boundaries as an elastic process inspired by dislocation theory in materials physics. This formulation introduces a physics-based regularization term that enforces smooth contour evolution and structural consistency, allowing the network to better capture fine vascular geometry. The proposed loss is integrated into several segmentation architectures, including U-Net, U-Net++, SegFormer, and MedFormer, and evaluated on two public benchmarks: DIAS and DSCA. Experimental results demonstrate that PIL consistently outperforms conventional loss functions such as Cross-Entropy, Dice, Active Contour, and Surface losses, achieving superior sensitivity, F1 score, and boundary coherence. These findings confirm that the incorporation of physics-based boundary interactions into deep neural networks improves both the precision and robustness of vascular segmentation in dynamic angiographic imaging. The implementation of the proposed method is publicly available at https://github.com/irfantahir301/Physicsis_loss.

[200] AlignBench: Benchmarking Fine-Grained Image-Text Alignment with Synthetic Image-Caption Pairs

Kuniaki Saito,Risa Shinoda,Shohei Tanaka,Tosho Hirasawa,Fumio Okura,Yoshitaka Ushiku

Main category: cs.CV

TL;DR: AlignBench是一个新的图像-文本对齐基准,通过评估多样化模型生成的详细图像-文本对来衡量细粒度对齐性能。

Details Motivation: 现有基准依赖基于规则的扰动或短标题,难以准确评估细粒度的图像-文本对齐能力。 Method: 构建AlignBench基准,使用多样化的图文生成模型产生详细描述,并对每句话进行正确性标注,用以直接评估视觉语言模型作为对齐判别器的能力。 Result: 评估发现:(i) 基于CLIP的模型在细粒度对齐上表现差;(ii) 检测器系统性地高估前几句;(iii) 存在强烈自偏好,影响检测性能。 Conclusion: 当前主流VLM在图像-文本对齐评估中存在盲区和偏差,需更鲁棒和客观的评估机制。 Abstract: Assessing image-text alignment models such as CLIP is crucial for bridging visual and linguistic representations. Yet existing benchmarks rely on rule-based perturbations or short captions, limiting their ability to measure fine-grained alignment. We introduce AlignBench, a benchmark that provides a new indicator of image-text alignment by evaluating detailed image-caption pairs generated by diverse image-to-text and text-to-image models. Each sentence is annotated for correctness, enabling direct assessment of VLMs as alignment evaluators. Benchmarking a wide range of decoder-based VLMs reveals three key findings: (i) CLIP-based models, even those tailored for compositional reasoning, remain nearly blind; (ii) detectors systematically over-score early sentences; and (iii) they show strong self-preference, favoring their own outputs and harming detection performance. Our project page will be available at https://dahlian00.github.io/AlignBench/.

[201] HBridge: H-Shape Bridging of Heterogeneous Experts for Unified Multimodal Understanding and Generation

Xiang Wang,Zhifei Zhang,He Zhang,Zhe Lin,Yuqian Zhou,Qing Liu,Shiwei Zhang,Yijun Li,Shaoteng Liu,Haitian Zheng,Jason Kuen,Yuehuan Wang,Changxin Gao,Nong Sang

Main category: cs.CV

TL;DR: 本文提出了一种名为HBridge的非对称H形架构,用于统一多模态生成模型,通过选择性连接中间层和引入语义重建令牌,提升了生成效率与跨模态一致性。

Details Motivation: 现有统一模型采用对称设计(如MoT范式),未能充分考虑不同模态间的固有差异,导致融合效果次优。 Method: 提出HBridge架构,采用非对称H形结构,仅在中间层进行选择性桥接,并引入语义重建令牌以增强视觉语义重建;浅层和深层保持解耦以保留模态特异性表示。 Result: 相比密集融合策略,减少了40%以上的注意力共享,在多个基准测试中表现出更高的生成质量和效率。 Conclusion: HBridge为统一多模态生成提供了新范式,有效平衡了预训练先验利用与跨模态对齐。 Abstract: Recent unified models integrate understanding experts (e.g., LLMs) with generative experts (e.g., diffusion models), achieving strong multimodal performance. However, recent advanced methods such as BAGEL and LMFusion follow the Mixture-of-Transformers (MoT) paradigm, adopting a symmetric design that mirrors one expert to another for convenient initialization and fusion, which remains suboptimal due to inherent modality discrepancies. In this work, we propose HBridge, an asymmetric H-shaped architecture that enables heterogeneous experts to optimally leverage pretrained priors from their respective modality domains. Unlike prior dense fusion strategies that straightforwardly connect all layers between experts via shared attention, HBridge selectively bridges intermediate layers, reducing over 40% attention sharing, which improves efficiency and enhances generation quality. Shallow and deep layers, which capture modality-specific representations, are decoupled, while mid-layer bridging promotes semantic alignment. To further strengthen cross-modal coherence, we introduce semantic reconstruction tokens that explicitly guide the generative expert to reconstruct visual semantic tokens of the target image. Extensive experiments across multiple benchmarks demonstrate the effectiveness and superior performance of HBridge, establishing a new paradigm for unified multimodal generation.

[202] Mistake Attribution: Fine-Grained Mistake Understanding in Egocentric Videos

Yayuan Li,Aadit Jain,Filippos Bellos,Jason J. Corso

Main category: cs.CV

TL;DR: 本文提出了Mistake Attribution (MATT) 任务,旨在对第一人称视频中的人类错误进行细粒度理解,并通过MisEngine构建大规模标注数据集,进而提出统一的注意力模型MisFormer,在语义、时序和空间维度上实现错误归因。

Details Motivation: 现有对人类错误理解的研究缺乏细粒度输出,无法准确指出错误源于指令的哪一部分或何时何地发生,因此需要一种更精细的任务框架来提升对人类行为错误的理解能力。 Method: 提出MATT任务,定义语义角色、不可逆点(PNR)和空间位置作为错误归因的三个维度;利用MisEngine从现有数据集中自动构建富含归因信息的数据集EPIC-KITCHENS-M和Ego4D-M;设计基于注意力机制的统一模型MisFormer进行多维度错误归因。 Result: 在新构建的数据集及已有基准上实验表明,MisFormer优于现有的视频语言、时序定位、手物交互和错误检测等强基线方法。 Conclusion: MATT为人类错误理解提供了更细粒度的评估框架,MisEngine和MisFormer的有效性验证了自动构建高质量错误归因数据的可行性,并推动了第一人称视角下行为理解的发展。 Abstract: We introduce Mistake Attribution (MATT), a task for fine-grained understanding of human mistakes in egocentric video. Unlike prior mistake understanding work, which lacks fine-grained output, MATT concretely attributes mistakes to the input instruction text or the attempt video. MATT determines what part of the instruction is violated (semantic role), when the deviation becomes irreversible (the Point-of-No-Return, PNR), and where the mistake appears in the PNR frame. We develop MisEngine, a data engine that automatically constructs attribution-rich mistake samples from existing datasets and inherits their annotations. Applied to large egocentric corpora, MisEngine yields EPIC-KITCHENS-M and Ego4D-M, two datasets that are up to two orders of magnitude larger than prior mistake datasets. We then present MisFormer, a unified attention-based model for mistake attribution across semantic (what), temporal (when), and spatial (where) dimensions, trained using MisEngine supervision. Experiments on our new datasets and prior benchmarks show that MisFormer outperforms strong video-language, temporal localization, hand-object interaction, and mistake-detection baselines.

[203] Automated Monitoring of Cultural Heritage Artifacts Using Semantic Segmentation

Andrea Ranieri,Giorgio Palmieri,Silvia Biasotti

Main category: cs.CV

TL;DR: 本文研究了基于U-Net架构和不同CNN编码器的语义分割方法在文化遗产裂纹检测中的应用,通过OmniCrack30k数据集进行定量评估,并在真实场景中测试模型的泛化能力。

Details Motivation: 文化遗产保护中亟需自动化裂纹检测技术,以实现对雕像和纪念碑等复杂结构的精细、准确识别。 Method: 采用多种CNN作为编码器的U-Net架构进行比较,使用mIoU、Dice系数和Jaccard指数等指标在OmniCrack30k测试集上进行定量评估,并在未标注的真实世界图像上进行定性分析。 Result: 实验表明,所选模型在像素级裂纹分割任务中表现良好,且在未见的文化遗产场景中展现出较强的泛化能力,尽管训练时未使用雕像或纪念碑图像。 Conclusion: 基于CNN的U-Net模型能有效用于文化遗产中的裂纹检测,具备良好的跨域泛化性能,为实际保护工作提供了可行的自动化解决方案。 Abstract: This paper addresses the critical need for automated crack detection in the preservation of cultural heritage through semantic segmentation. We present a comparative study of U-Net architectures, using various convolutional neural network (CNN) encoders, for pixel-level crack identification on statues and monuments. A comparative quantitative evaluation is performed on the test set of the OmniCrack30k dataset [1] using popular segmentation metrics including Mean Intersection over Union (mIoU), Dice coefficient, and Jaccard index. This is complemented by an out-of-distribution qualitative evaluation on an unlabeled test set of real-world cracked statues and monuments. Our findings provide valuable insights into the capabilities of different CNN- based encoders for fine-grained crack segmentation. We show that the models exhibit promising generalization capabilities to unseen cultural heritage contexts, despite never having been explicitly trained on images of statues or monuments.

[204] New York Smells: A Large Multimodal Dataset for Olfaction

Ege Ozguroglu,Junbang Liang,Ruoshi Liu,Mia Chiquier,Michael DeTienne,Wesley Wei Qian,Alexandra Horowitz,Andrew Owens,Carl Vondrick

Main category: cs.CV

TL;DR: 本文介绍了New York Smells,一个包含7,000个气味-图像对的大规模野外多模态数据集,比现有数据集多约70倍的对象,用于推动机器嗅觉研究。

Details Motivation: 机器难以获取和理解自然环境中的气味信息,缺乏多样化、真实场景下的多模态嗅觉训练数据是主要瓶颈。 Method: 收集了来自室内外环境中3,500个不同对象的7,000个气味-图像对,构建New York Smells数据集,并设计三项基准任务:跨模态气味到图像检索、仅凭气味识别场景/对象/材料、细粒度草种区分。 Result: 实验证明视觉数据有助于跨模态嗅觉表征学习,所学嗅觉表征优于常用的手工特征。 Conclusion: New York Smells为机器嗅觉提供了重要资源,展示了跨模态学习在嗅觉表示中的潜力,推动了人工智能对化学感官的理解。 Abstract: While olfaction is central to how animals perceive the world, this rich chemical sensory modality remains largely inaccessible to machines. One key bottleneck is the lack of diverse, multimodal olfactory training data collected in natural settings. We present New York Smells, a large dataset of paired image and olfactory signals captured ``in the wild.'' Our dataset contains 7,000 smell-image pairs from 3,500 distinct objects across indoor and outdoor environments, with approximately 70$\times$ more objects than existing olfactory datasets. Our benchmark has three tasks: cross-modal smell-to-image retrieval, recognizing scenes, objects, and materials from smell alone, and fine-grained discrimination between grass species. Through experiments on our dataset, we find that visual data enables cross-modal olfactory representation learning, and that our learned olfactory representations outperform widely-used hand-crafted features.

[205] Flash-DMD: Towards High-Fidelity Few-Step Image Generation with Efficient Distillation and Joint Reinforcement Learning

Guanjie Chen,Shirui Huang,Kai Liu,Jianchen Zhu,Xiaoye Qu,Peng Chen,Yu Cheng,Yifu Sun

Main category: cs.CV

TL;DR: 本文提出了一种名为Flash-DMD的新框架,用于加速扩散模型的生成过程,同时通过联合训练稳定强化学习微调,实现了快速收敛、高质量生成和 Few-step 采样下的优越性能。

Details Motivation: 扩散模型虽然强大,但采样慢且蒸馏加速方法训练成本高、质量下降;强化学习微调易出现奖励黑客问题,缺乏稳定性。因此需要一种高效、稳定且高质量的加速生成框架。 Method: 提出Flash-DMD框架,包含两个关键部分:1)高效的时序感知蒸馏策略,大幅降低训练成本并提升真实感;2)蒸馏与强化学习目标联合训练,利用蒸馏损失作为正则化项来稳定RL训练,防止策略崩溃。 Result: 在score-based和flow matching模型上实验表明,Flash-DMD仅需DMD2的2.1%训练成本即可超越其性能,在Few-step采样下于图像质量、人类偏好和文本对齐指标上达到SOTA水平,且训练更稳定、收敛更快。 Conclusion: Flash-DMD为高效、高保真和稳定生成模型提供了一个有效范式,解决了蒸馏加速与强化学习微调中的效率与稳定性难题。 Abstract: Diffusion Models have emerged as a leading class of generative models, yet their iterative sampling process remains computationally expensive. Timestep distillation is a promising technique to accelerate generation, but it often requires extensive training and leads to image quality degradation. Furthermore, fine-tuning these distilled models for specific objectives, such as aesthetic appeal or user preference, using Reinforcement Learning (RL) is notoriously unstable and easily falls into reward hacking. In this work, we introduce Flash-DMD, a novel framework that enables fast convergence with distillation and joint RL-based refinement. Specifically, we first propose an efficient timestep-aware distillation strategy that significantly reduces training cost with enhanced realism, outperforming DMD2 with only $2.1\%$ its training cost. Second, we introduce a joint training scheme where the model is fine-tuned with an RL objective while the timestep distillation training continues simultaneously. We demonstrate that the stable, well-defined loss from the ongoing distillation acts as a powerful regularizer, effectively stabilizing the RL training process and preventing policy collapse. Extensive experiments on score-based and flow matching models show that our proposed Flash-DMD not only converges significantly faster but also achieves state-of-the-art generation quality in the few-step sampling regime, outperforming existing methods in visual quality, human preference, and text-image alignment metrics. Our work presents an effective paradigm for training efficient, high-fidelity, and stable generative models. Codes are coming soon.

[206] PhysChoreo: Physics-Controllable Video Generation with Part-Aware Semantic Grounding

Haoze Zhang,Tianyu Huang,Zichen Wan,Xiaowei Jin,Hongzhi Zhang,Hui Li,Wangmeng Zuo

Main category: cs.CV

TL;DR: 本文提出了一种名为PhysChoreo的新框架,能够从单张图像生成具有多样可控性和物理真实感的视频。该方法结合部分感知的物理属性重建与时间指令驱动的可编辑物理仿真,分两阶段实现高质量、动态丰富的视频生成,并在多项指标上优于现有最先进方法。

Details Motivation: 现有的视频生成模型虽然视觉质量高,但缺乏明确的物理可控性和物理合理性。一些尝试引入基于物理渲染的方法仍面临复杂物理属性建模不准确和长时间序列中行为控制困难的问题。 Method: 提出PhysChoreo框架,分为两个阶段:第一阶段通过部分感知的物理属性重建估计图像中物体的静态初始物理属性;第二阶段利用时间指令驱动且可物理编辑的模拟过程,生成具有丰富动态行为和物理真实感的视频。 Result: 实验结果表明,PhysChoreo能够在多个评估指标上超越现有最先进方法,生成具有丰富动态行为和高物理真实感的视频。 Conclusion: PhysChoreo通过结合物理属性重建与可编辑物理仿真,实现了从单张图像生成具备良好可控性和物理合理性的长时序动态视频,推动了具身智能生成的发展。 Abstract: While recent video generation models have achieved significant visual fidelity, they often suffer from the lack of explicit physical controllability and plausibility. To address this, some recent studies attempted to guide the video generation with physics-based rendering. However, these methods face inherent challenges in accurately modeling complex physical properties and effectively control ling the resulting physical behavior over extended temporal sequences. In this work, we introduce PhysChoreo, a novel framework that can generate videos with diverse controllability and physical realism from a single image. Our method consists of two stages: first, it estimates the static initial physical properties of all objects in the image through part-aware physical property reconstruction. Then, through temporally instructed and physically editable simulation, it synthesizes high-quality videos with rich dynamic behaviors and physical realism. Experimental results show that PhysChoreo can generate videos with rich behaviors and physical realism, outperforming state-of-the-art methods on multiple evaluation metrics.

[207] A Reason-then-Describe Instruction Interpreter for Controllable Video Generation

Shengqiong Wu,Weicai Ye,Yuanxing Zhang,Jiahao Wang,Quande Liu,Xintao Wang,Pengfei Wan,Kun Gai,Hao Fei,Tat-Seng Chua

Main category: cs.CV

TL;DR: ReaDe是一种通用、模型无关的解释器,采用“先推理后描述”范式,将用户模糊指令转化为精确的生成指导,提升可控视频生成中的意图-输出一致性。

Details Motivation: 现有扩散Transformer在视频质量上表现优异,但对模糊、复杂或简洁的用户输入控制能力有限,导致训练用详细提示与实际用户意图不匹配。 Method: 提出ReaDe,遵循‘先推理后描述’范式:首先解析用户请求并消除歧义,然后生成详细的生成指南;通过两阶段优化训练——(i) 增强推理监督引入逐步分析和密集描述,(ii) 多维度奖励分配器实现基于反馈的稳定优化。 Result: 在单条件和多条件场景下实验表明,ReaDe显著提升了指令保真度、描述准确性和生成视频质量,并对需复杂推理及未见输入具有强泛化能力。 Conclusion: ReaDe为实现用户意图与可控视频生成之间的精准对齐提供了实用且通用的解决方案。 Abstract: Diffusion Transformers have significantly improved video fidelity and temporal coherence, however, practical controllability remains limited. Concise, ambiguous, and compositionally complex user inputs contrast with the detailed prompts used in training, yielding an intent-output mismatch. We propose ReaDe, a universal, model-agnostic interpreter that converts raw instructions into precise, actionable specifications for downstream video generators. ReaDe follows a reason-then-describe paradigm: it first analyzes the user request to identify core requirements and resolve ambiguities, then produces detailed guidance that enables faithful, controllable generation. We train ReaDe via a two-stage optimization: (i) reasoning-augmented supervision imparts analytic parsing with stepwise traces and dense captions, and (ii) a multi-dimensional reward assigner enables stable, feedback-driven refinement for natural-style captions. Experiments across single- and multi-condition scenarios show consistent gains in instruction fidelity, caption accuracy, and downstream video quality, with strong generalization to reasoning-intensive and unseen inputs. ReaDe offers a practical route to aligning controllable video generation with accurately interpreted user intent. Project Page: https://sqwu.top/ReaDe/.

[208] DINO-Tok: Adapting DINO for Visual Tokenizers

Mingkai Jia,Mingxiao Li,Liaoyuan Fan,Tianxing Shi,Jiaxin Guo,Zeming Li,Xiaoyang Guo,Xiao-Xiao Long,Qian Zhang,Ping Tan,Wei Yin

Main category: cs.CV

TL;DR: 本文提出了DINO-Tok,一种基于DINO的视觉分词器,通过融合浅层细节特征与深层语义特征构建信息完整的分层潜在空间,提升生成模型中的语义对齐与重建保真度,并引入全局PCA重加权机制缓解高维向量量化中的信息丢失与码本坍塌问题,在ImageNet 256×256上实现了最先进的重建性能。

Details Motivation: 现有视觉分词器通常从零训练,难以在语义表示与重建保真度之间取得平衡,尤其在高维潜在空间中表现不佳,限制了生成模型的质量。 Method: 提出DINO-Tok,利用预训练DINO模型提取多层级特征,融合浅层细节与深层语义;设计全局PCA重加权机制以稳定高维空间中的向量量化过程,防止关键信息丢失和码本坍缩。 Result: 在ImageNet 256×256上,DINO-Tok实现28.54 PSNR(自编码)和23.98 PSNR(VQ建模),显著优于先前方法,性能媲美数十亿数据训练的大模型(如Hunyuan、Wan)。 Conclusion: 将强大的预训练视觉模型(如DINO)适配用于分词,可实现语义对齐且高保真的潜在表示,为下一代视觉生成模型提供有效基础。 Abstract: Recent advances in visual generation have highlighted the rise of Latent Generative Models (LGMs), which rely on effective visual tokenizers to bridge pixels and semantics. However, existing tokenizers are typically trained from scratch and struggle to balance semantic representation and reconstruction fidelity, particularly in high-dimensional latent spaces. In this work, we introduce DINO-Tok, a DINO-based visual tokenizer that unifies hierarchical representations into an information-complete latent space. By integrating shallow features that retain fine-grained details with deep features encoding global semantics, DINO-Tok effectively bridges pretrained representations and visual generation. We further analyze the challenges of vector quantization (VQ) in this high-dimensional space, where key information is often lost and codebook collapse occurs. We thus propose a global PCA reweighting mechanism to stabilize VQ and preserve essential information across dimensions. On ImageNet 256$\times$256, DINO-Tok achieves state-of-the-art reconstruction performance, reaching 28.54 PSNR for autoencoding and 23.98 PSNR for VQ-based modeling, significantly outperforming prior tokenizers and comparable to billion-level data trained models (such as Hunyuan and Wan). These results demonstrate that adapting powerful pretrained vision models like DINO for tokenization enables semantically aligned and high-fidelity latent representations, enabling next-generation visual generative models. Code will be publicly available at https://github.com/MKJia/DINO-Tok.

[209] VQ-VA World: Towards High-Quality Visual Question-Visual Answering

Chenhui Gou,Zilong Chen,Zeyu Wang,Feng Li,Deyao Zhu,Zicheng Duan,Kunchang Li,Chaorui Deng,Hongyi Yuan,Haoqi Fan,Cihang Xie,Jianfei Cai,Hamid Rezatofighi

Main category: cs.CV

TL;DR: 本文提出了VQ-VA World,一个用于视觉问答生成图像的开源数据框架,并发布了评估基准IntelligentBench,显著提升了开源模型LightFusion在视觉问答任务上的性能。

Details Motivation: 为了将当前仅限于闭源系统(如NanoBanana和GPT-Image)的视觉问题生成图像回答(VQ-VA)能力引入开源模型。 Method: 构建了一个以代理流水线为核心的数据中心式框架VQ-VA World,通过网络规模部署爬取约180万高质量图文交错样本用于训练,并发布人类精心设计的评估基准IntelligentBench,从世界知识、设计知识和推理能力三方面评估VQ-VA模型。 Result: 基于VQ-VA World训练的LightFusion在IntelligentBench上得分为53.06,远超之前的开源基线(vanilla LightFusion为7.78,UniWorld-V1为1.94),接近领先闭源系统(NanoBanana为81.67,GPT-Image为82.64)。 Conclusion: 通过发布模型权重、数据集和完整流水线,推动开源社区在视觉问答生成图像方向的研究发展。 Abstract: This paper studies Visual Question-Visual Answering (VQ-VA): generating an image, rather than text, in response to a visual question -- an ability that has recently emerged in proprietary systems such as NanoBanana and GPT-Image. To also bring this capability to open-source models, we introduce VQ-VA World, a data-centric framework built around an agentic pipeline for large-scale, targeted data construction. Leveraging web-scale deployment, this pipeline crawls a massive amount of ~1.8M high-quality, interleaved image-text samples for model training. For evaluation, we further release IntelligentBench, a human-curated benchmark that systematically assesses VQ-VA along the aspects of world knowledge, design knowledge, and reasoning. Training with VQ-VA World data yields strong empirical gains: it helps LightFusion attain 53.06 on IntelligentBench, substantially surpassing the best prior open-source baselines (i.e., 7.78 from vanilla LightFusion; 1.94 from UniWorld-V1), and significantly narrowing the gap toward leading proprietary systems (e.g., 81.67 from NanoBanana; 82.64 from GPT-Image). By releasing the full suite of model weights, datasets, and pipelines, we hope to stimulate future research on VQ-VA.

[210] The Consistency Critic: Correcting Inconsistencies in Generated Images via Reference-Guided Attentive Alignment

Ziheng Ouyang,Yiren Song,Yaoli Liu,Shihao Zhu,Qibin Hou,Ming-Ming Cheng,Mike Zheng Shou

Main category: cs.CV

TL;DR: 本文提出ImageCritic,一种基于参考图像引导的后编辑方法,用于解决生成图像中细粒度细节不一致的问题。通过构建参考-退化-目标三元组数据集,并设计注意力对齐损失和细节编码器,有效修正不一致性。

Details Motivation: 现有定制化生成任务在生成图像时难以保持与参考图像一致的细粒度细节,存在明显不一致问题。 Method: 构建基于视觉语言模型选择和显式退化的参考-退化-目标三元组数据集;设计注意力对齐损失和细节编码器,结合注意力机制与特征表示进行精细化修正。 Result: 实验表明,ImageCritic在多种定制化生成场景中能有效解决细节不一致问题,显著优于现有方法。 Conclusion: ImageCritic通过参考引导的后编辑框架,实现了对生成图像细节的精准修复,可集成到代理框架中实现多轮局部编辑,提升复杂场景下的生成一致性。 Abstract: Previous works have explored various customized generation tasks given a reference image, but they still face limitations in generating consistent fine-grained details. In this paper, our aim is to solve the inconsistency problem of generated images by applying a reference-guided post-editing approach and present our ImageCritic. We first construct a dataset of reference-degraded-target triplets obtained via VLM-based selection and explicit degradation, which effectively simulates the common inaccuracies or inconsistencies observed in existing generation models. Furthermore, building on a thorough examination of the model's attention mechanisms and intrinsic representations, we accordingly devise an attention alignment loss and a detail encoder to precisely rectify inconsistencies. ImageCritic can be integrated into an agent framework to automatically detect inconsistencies and correct them with multi-round and local editing in complex scenarios. Extensive experiments demonstrate that ImageCritic can effectively resolve detail-related issues in various customized generation scenarios, providing significant improvements over existing methods.

[211] Evaluating the Performance of Deep Learning Models in Whole-body Dynamic 3D Posture Prediction During Load-reaching Activities

Seyede Niloofar Hosseini,Ali Mojibi,Mahdi Mohseni,Navid Arjmand,Alireza Taheri

Main category: cs.CV

TL;DR: 本研究利用双向长短期记忆(BLSTM)和Transformer架构的深度神经网络,预测动态负载伸展活动中的全身人体姿态,并提出一种通过优化新损失函数来保持身体节段长度恒定的新方法,显著提高了预测精度。

Details Motivation: 探索深度神经网络在动态负载伸展活动中对人体姿态进行准确预测的应用潜力,特别是在手动物料搬运任务中理解与预测运动动态的需求。 Method: 使用来自20名正常体重健康男性个体的3D全身运动数据训练两种时间序列模型(BLSTM和Transformer),输入包括手-负载位置、举重与操作方式、身体参数及任务前25%时间内的姿态数据,预测剩余75%时段的姿态;并引入保持身体节段长度不变的新损失函数以提升模型精度。 Result: 新损失函数使手臂和腿部模型的预测误差分别降低约8%和21%;Transformer模型比BLSTM模型长期预测准确性高约58%,均方根误差为47.0毫米。 Conclusion: Transformer架构结合约束身体节段长度的损失函数可有效提升动态姿态预测精度,验证了深度学习在理解人工搬运过程中运动动态方面的可行性与优势。 Abstract: This study aimed to explore the application of deep neural networks for whole-body human posture prediction during dynamic load-reaching activities. Two time-series models were trained using bidirectional long short-term memory (BLSTM) and transformer architectures. The dataset consisted of 3D full-body plug-in gait dynamic coordinates from 20 normal-weight healthy male individuals each performing 204 load-reaching tasks from different load positions while adapting various lifting and handling techniques. The model inputs consisted of the 3D position of the hand-load position, lifting (stoop, full-squat and semi-squat) and handling (one- and two-handed) techniques, body weight and height, and the 3D coordinate data of the body posture from the first 25% of the task duration. These inputs were used by the models to predict body coordinates during the remaining 75% of the task period. Moreover, a novel method was proposed to improve the accuracy of the previous and present posture prediction networks by enforcing constant body segment lengths through the optimization of a new cost function. The results indicated that the new cost function decreased the prediction error of the models by approximately 8% and 21% for the arm and leg models, respectively. We indicated that utilizing the transformer architecture, with a root-mean-square-error of 47.0 mm, exhibited ~58% more accurate long-term performance than the BLSTM-based model. This study merits the use of neural networks that capture time series dependencies in 3D motion frames, providing a unique approach for understanding and predict motion dynamics during manual material handling activities.

[212] Wanderland: Geometrically Grounded Simulation for Open-World Embodied AI

Xinhao Liu,Jiaqi Li,Youming Deng,Ruxin Chen,Yingjia Zhang,Yifei Ma,Li Guo,Yiming Li,Jing Zhang,Chen Feng

Main category: cs.CV

TL;DR: Wanderland是一个real-to-sim框架,旨在通过高保真模拟解决具身AI中可重复闭环评估的瓶颈,支持城市环境中多传感器数据采集、可靠重建和鲁棒视图合成,为视觉导航、3D重建和新视图合成提供可信基准。

Details Motivation: 现有视频-3DGS方法在仿真到真实的视觉与几何差距较大,难以用于具身AI(如视觉导航)的可靠闭环评估,缺乏适用于开放世界复杂环境的高保真、可复现的评估平台。 Method: 提出Wanderland框架,结合多传感器采集、精确几何重建与鲁棒视图合成技术,构建室内-室外城市场景数据集,并系统评估图像-only方法、几何质量对新视图合成及导航策略学习的影响。 Result: 展示了图像-only方法扩展性差,几何质量显著影响新视图合成效果和导航策略学习的可靠性;构建了高质量、多样化的城市场景数据集,支持多种任务的基准测试。 Conclusion: Wanderland为开放世界具身AI研究提供了新的基础,实现了更真实、可靠且可复现的闭环评估,推动了视觉导航、3D重建与新视图合成模型的联合发展。 Abstract: Reproducible closed-loop evaluation remains a major bottleneck in Embodied AI such as visual navigation. A promising path forward is high-fidelity simulation that combines photorealistic sensor rendering with geometrically grounded interaction in complex, open-world urban environments. Although recent video-3DGS methods ease open-world scene capturing, they are still unsuitable for benchmarking due to large visual and geometric sim-to-real gaps. To address these challenges, we introduce Wanderland, a real-to-sim framework that features multi-sensor capture, reliable reconstruction, accurate geometry, and robust view synthesis. Using this pipeline, we curate a diverse dataset of indoor-outdoor urban scenes and systematically demonstrate how image-only pipelines scale poorly, how geometry quality impacts novel view synthesis, and how all of these adversely affect navigation policy learning and evaluation reliability. Beyond serving as a trusted testbed for embodied navigation, Wanderland's rich raw sensor data further allows benchmarking of 3D reconstruction and novel view synthesis models. Our work establishes a new foundation for reproducible research in open-world embodied AI. Project website is at https://ai4ce.github.io/wanderland/.

[213] ShapeGen: Towards High-Quality 3D Shape Synthesis

Yangguang Li,Xianglong He,Zi-Xin Zou,Zexiang Liu,Wanli Ouyang,Ding Liang,Yan-Pei Cao

Main category: cs.CV

TL;DR: 本文提出了ShapeGen,通过改进3D表示与监督、提升分辨率以及利用线性Transformer的优势,实现了高质量的图像到3D形状生成,显著提升了细节表现和结构完整性,达到新的SOTA性能。

Details Motivation: 现有3D形状生成方法存在细节缺失、表面过度平滑和薄壳结构断裂等问题,难以满足艺术家对高质量3D资产的需求。 Method: 提出ShapeGen,改进3D表示与监督策略,采用更高分辨率生成,并结合线性Transformer的优势以提升生成质量。 Result: 在多项实验中验证了各改进组件的有效性,生成结果在细节、表面质量和结构连贯性方面显著优于现有方法。 Conclusion: ShapeGen通过多方面的协同优化,实现了图像到3D生成的重大突破,推动了高保真3D资产生成技术的发展,并具备良好的工业应用潜力。 Abstract: Inspired by generative paradigms in image and video, 3D shape generation has made notable progress, enabling the rapid synthesis of high-fidelity 3D assets from a single image. However, current methods still face challenges, including the lack of intricate details, overly smoothed surfaces, and fragmented thin-shell structures. These limitations leave the generated 3D assets still one step short of meeting the standards favored by artists. In this paper, we present ShapeGen, which achieves high-quality image-to-3D shape generation through 3D representation and supervision improvements, resolution scaling up, and the advantages of linear transformers. These advancements allow the generated assets to be seamlessly integrated into 3D pipelines, facilitating their widespread adoption across various applications. Through extensive experiments, we validate the impact of these improvements on overall performance. Ultimately, thanks to the synergistic effects of these enhancements, ShapeGen achieves a significant leap in image-to-3D generation, establishing a new state-of-the-art performance.

[214] MapReduce LoRA: Advancing the Pareto Front in Multi-Preference Optimization for Generative Models

Chieh-Yun Chen,Zhonghao Wang,Qi Chen,Zhifan Ye,Min Shi,Yue Zhao,Yinan Zhao,Hui Qu,Wei-An Lin,Yiru Shen,Ajinkya Kale,Irfan Essa,Humphrey Shi

Main category: cs.CV

TL;DR: 本文提出两种互补方法(MapReduce LoRA 和 RaTE)以解决多偏好对齐中的“对齐税”问题,在文本到图像、文本到视频和语言任务上均显著提升多维度性能。

Details Motivation: 现有基于奖励模型的强化学习对齐方法在优化多个偏好时存在“对齐税”,即提升一个维度的同时可能损害其他维度,限制了生成模型的综合对齐效果。 Method: 提出 MapReduce LoRA:并行训练偏好特定的 LoRA 专家并迭代合并以优化共享基础模型;提出 Reward-aware Token Embedding (RaTE):学习在推理时可组合的奖励特定词元嵌入,实现灵活的偏好控制。 Result: 在文本到图像生成(Stable Diffusion 和 FLUX.1-dev)上,GenEval、PickScore 和 OCR 指标均有显著提升(分别达 36.1%~67.1%);在文本到视频生成(HunyuanVideo)中,视觉与运动质量提升 48.1% 和 90.0%;在语言任务(Llama-2 7B)中,helpful 与 harmless 提升 43.4% 和 136.7%。 Conclusion: 所提框架在多种模态下实现了先进的多偏好对齐效果,为跨模态生成模型的多维人类偏好对齐提供了新范式。 Abstract: Reinforcement learning from human feedback (RLHF) with reward models has advanced alignment of generative models to human aesthetic and perceptual preferences. However, jointly optimizing multiple rewards often incurs an alignment tax, improving one dimension while degrading others. To address this, we introduce two complementary methods: MapReduce LoRA and Reward-aware Token Embedding (RaTE). MapReduce LoRA trains preference-specific LoRA experts in parallel and iteratively merges them to refine a shared base model; RaTE learns reward-specific token embeddings that compose at inference for flexible preference control. Experiments on Text-to-Image generation (Stable Diffusion 3.5 Medium and FLUX.1-dev) show improvements of 36.1%, 4.6%, and 55.7%, and 32.7%, 4.3%, and 67.1% on GenEval, PickScore, and OCR, respectively. On Text-to-Video generation (HunyuanVideo), visual and motion quality improve by 48.1% and 90.0%, respectively. On the language task, Helpful Assistant, with Llama-2 7B, helpful and harmless improve by 43.4% and 136.7%, respectively. Our framework sets a new state-of-the-art multi-preference alignment recipe across modalities.

[215] iMontage: Unified, Versatile, Highly Dynamic Many-to-many Image Generation

Zhoujie Fu,Xianfang Zeng,Jinghong Lan,Xinyao Liao,Cheng Chen,Junyi Chen,Jiacheng Wei,Wei Cheng,Shiyu Liu,Yunuo Chen,Gang Yu,Guosheng Lin

Main category: cs.CV

TL;DR: iMontage是一种将预训练视频模型重用于图像生成的统一框架,通过引入图像数据的多样性,在保持时间连贯性的同时扩展动态范围。

Details Motivation: 现有视频生成模型受限于连续训练数据的动态范围,缺乏图像数据中的丰富多样性,难以生成兼具自然过渡和高度变化的图像集合。 Method: 提出iMontage框架,采用轻量级适配策略,结合定制的数据筛选与训练范式,将视频模型改造为多输入多输出的图像生成器,同时保留其原有的运动先验。 Result: iMontage在多种多输入多输出任务中表现出色,能够生成上下文一致且动态丰富的图像序列,超越了传统方法的能力范围。 Conclusion: 通过融合视频的时间连贯性和图像的内容多样性,iMontage成功拓展了生成模型的应用边界,实现了灵活而强大的图像集生成与编辑。 Abstract: Pre-trained video models learn powerful priors for generating high-quality, temporally coherent content. While these models excel at temporal coherence, their dynamics are often constrained by the continuous nature of their training data. We hypothesize that by injecting the rich and unconstrained content diversity from image data into this coherent temporal framework, we can generate image sets that feature both natural transitions and a far more expansive dynamic range. To this end, we introduce iMontage, a unified framework designed to repurpose a powerful video model into an all-in-one image generator. The framework consumes and produces variable-length image sets, unifying a wide array of image generation and editing tasks. To achieve this, we propose an elegant and minimally invasive adaptation strategy, complemented by a tailored data curation process and training paradigm. This approach allows the model to acquire broad image manipulation capabilities without corrupting its invaluable original motion priors. iMontage excels across several mainstream many-in-many-out tasks, not only maintaining strong cross-image contextual consistency but also generating scenes with extraordinary dynamics that surpass conventional scopes. Find our homepage at: https://kr1sjfu.github.io/iMontage-web/.

[216] MotionV2V: Editing Motion in a Video

Ryan Burgert,Charles Herrmann,Forrester Cole,Michael S Ryoo,Neal Wadhwa,Andrey Voynov,Nataniel Ruiz

Main category: cs.CV

TL;DR: 本文提出了一种基于稀疏轨迹编辑的视频运动编辑新方法,通过构建“运动反事实”数据集并微调运动条件视频扩散模型,实现了对现有视频的精确运动控制与自然传播的编辑效果。

Details Motivation: 尽管生成式视频模型在保真度和一致性方面取得了显著进展,但将其应用于视频编辑仍面临挑战。现有的运动控制研究主要集中在文本到视频生成或图像动画上,而针对已有视频的精确运动控制编辑仍缺乏探索。 Method: 从输入视频中提取稀疏轨迹,定义输入与输出轨迹之间的偏差为“运动编辑”,并构建一个生成“运动反事实”视频对(内容相同但运动不同)的流水线;在此基础上微调一个运动条件下的视频扩散模型,以实现对任意时间点开始的自然传播的视频编辑。 Result: 所提方法能够在不改变视频内容的前提下灵活编辑运动,并支持从任意时间戳开始的编辑传播;在四选一的用户对比实验中,本模型相对于先前工作获得了超过65%的偏好率。 Conclusion: 本文验证了基于稀疏轨迹的运动编辑表示是一种有效且富有潜力的视频编辑范式,结合生成模型可实现高精度、自然的视频运动编辑,为未来视频编辑技术提供了新的思路。 Abstract: While generative video models have achieved remarkable fidelity and consistency, applying these capabilities to video editing remains a complex challenge. Recent research has explored motion controllability as a means to enhance text-to-video generation or image animation; however, we identify precise motion control as a promising yet under-explored paradigm for editing existing videos. In this work, we propose modifying video motion by directly editing sparse trajectories extracted from the input. We term the deviation between input and output trajectories a "motion edit" and demonstrate that this representation, when coupled with a generative backbone, enables powerful video editing capabilities. To achieve this, we introduce a pipeline for generating "motion counterfactuals", video pairs that share identical content but distinct motion, and we fine-tune a motion-conditioned video diffusion architecture on this dataset. Our approach allows for edits that start at any timestamp and propagate naturally. In a four-way head-to-head user study, our model achieves over 65 percent preference against prior work. Please see our project page: https://ryanndagreat.github.io/MotionV2V

[217] Unleashing the Power of Vision-Language Models for Long-Tailed Multi-Label Visual Recognition

Wei Tang,Zuo-Zheng Wang,Kun Zhang,Tong Wei,Min-Ling Zhang

Main category: cs.CV

TL;DR: 本文提出了一种名为CAPNET的新框架,用于解决长尾多标签视觉识别中的类别不平衡问题,通过利用CLIP的文本编码器建模标签相关性,并结合图卷积网络和可学习提示来提升对尾部类别的泛化能力。

Details Motivation: 现有方法在长尾多标签识别中直接从不平衡数据学习标签关系,导致尾部类别关联不可靠,且CLIP的单标签匹配范式不适用于多标签任务。 Method: 提出CAPNET,通过CLIP文本编码器提取语义相关性,使用图卷积网络进行标签感知传播,引入可学习软提示优化嵌入表示,采用分布平衡的Focal Loss进行类别重加权训练,并结合测试时集成与参数高效微调实现模态对齐。 Result: 在VOC-LT、COCO-LT和NUS-WIDE等基准上显著优于现有方法,验证了其在长尾多标签识别中的有效性。 Conclusion: CAPNET通过显式建模标签相关性和参数高效的模态对齐,有效缓解了长尾分布下多标签学习的偏差问题,提升了整体性能。 Abstract: Long-tailed multi-label visual recognition poses a significant challenge, as images typically contain multiple labels with highly imbalanced class distributions, leading to biased models that favor head classes while underperforming on tail classes. Recent efforts have leveraged pre-trained vision-language models, such as CLIP, alongside long-tailed learning techniques to exploit rich visual-textual priors for improved performance. However, existing methods often derive semantic inter-class relationships directly from imbalanced datasets, resulting in unreliable correlations for tail classes due to data scarcity. Moreover, CLIP's zero-shot paradigm is optimized for single-label image-text matching, making it suboptimal for multi-label tasks. To address these issues, we propose the correlation adaptation prompt network (CAPNET), a novel end-to-end framework that explicitly models label correlations from CLIP's textual encoder. The framework incorporates a graph convolutional network for label-aware propagation and learnable soft prompts for refined embeddings. It utilizes a distribution-balanced Focal loss with class-aware re-weighting for optimized training under imbalance. Moreover, it improves generalization through test-time ensembling and realigns visual-textual modalities using parameter-efficient fine-tuning to avert overfitting on tail classes without compromising head class performance. Extensive experiments and ablation studies on benchmarks including VOC-LT, COCO-LT, and NUS-WIDE demonstrate that CAPNET achieves substantial improvements over state-of-the-art methods, validating its effectiveness for real-world long-tailed multi-label visual recognition.

[218] Concept-Aware Batch Sampling Improves Language-Image Pretraining

Adhiraj Ghosh,Vishaal Udandarao,Thao Nguyen,Matteo Farina,Mehdi Cherti,Jenia Jitsev,Sewoong Oh,Elisa Ricci,Ludwig Schmidt,Matthias Bethge

Main category: cs.CV

TL;DR: 提出一种灵活的任务自适应的在线概念化数据采样方法CABS,基于包含1.28亿图文对的数据集DataConcept,通过两种变体(多样性最大化和频率最大化)动态构建训练批次,显著提升CLIP/SigLIP模型在28个基准上的性能。

Details Motivation: 现有数据筛选方法多为离线且缺乏对语义概念的感知,易引入偏差且无法适应特定任务需求,因此需要更灵活、在线、基于概念的数据采样策略。 Method: 构建大规模细粒度标注数据集DataConcept,并提出Concept-Aware Batch Sampling (CABS) 框架,包括多样性最大化(CABS-DM)和频率最大化(CABS-FM)两种采样策略,在训练过程中动态生成符合目标分布的批次数据。 Result: 在28个基准上验证了CABS对CLIP/SigLIP模型的显著增益,优于传统离线数据筛选方法,并支持下游任务的定制化优化。 Conclusion: CABS是一种有效的开源在线数据采样方案,推动了任务自适应、概念感知的视觉语言模型训练范式。 Abstract: What data should a vision-language model be trained on? To answer this question, many data curation efforts center on the quality of a dataset. However, most of these existing methods are (i) offline, i.e. they produce a static dataset from a set of predetermined filtering criteria, and (ii) concept-agnostic, i.e. they use model-based filters which induce additional data biases. In this work, we go beyond such offline, concept-agnostic methods and advocate for more flexible, task-adaptive online concept-based curation. Our first contribution is DataConcept, a collection of 128M web-crawled image-text pairs annotated with fine-grained details about their concept composition. Building on DataConcept, we introduce Concept-Aware Batch Sampling (CABS), a simple yet effective batch sampling framework that flexibly constructs batches on-the-fly based on specific target distributions. We propose two variants: (i) Diversity Maximization (CABS-DM) to curate batches with a broad coverage of available concepts, and (ii) Frequency Maximization (CABS-FM) to curate batches with high object multiplicity. Through extensive evaluations across 28 benchmarks, we demonstrate that our CABS method significantly benefits CLIP/SigLIP model classes and yields highly performant models. Overall, CABS represents a strong open-source alternative to proprietary online data curation algorithms, enabling practitioners to define custom concept distributions that optimize for specific downstream tasks.

[219] Vision-Language Memory for Spatial Reasoning

Zuntao Liu,Yi Du,Taimeng Fu,Shaoshu Su,Cherie Ho,Chen Wang

Main category: cs.CV

TL;DR: 本文提出VLM^2,一种具有持久记忆的视觉-语言模型,通过双记忆模块实现基于2D视频的3D感知和长时程空间推理,在多个基准上达到视频-only模型的最先进性能。

Details Motivation: 现有视觉-语言模型在基于视频的空间推理方面表现不足,主要受限于语义-几何不一致性和缺乏对3D表示的持续记忆。 Method: 提出VLM^2模型,引入视图一致、3D感知的表示,并设计包含工作记忆(滑动窗口)和情景记忆(长期存储关键信息)的双记忆模块,以支持高效且长时间的空间推理。 Result: 在多个空间推理基准上实验表明,VLM^2在纯视频输入模型中达到最先进的性能,显著提升了视觉-空间智能水平。 Conclusion: VLM^2通过持久记忆机制和3D感知表示,有效解决了语义-几何不一致和记忆缺失问题,实现了更接近人类水平的长时程空间推理能力。 Abstract: Spatial reasoning is a critical capability for intelligent robots, yet current vision-language models (VLMs) still fall short of human-level performance in video-based spatial reasoning. This gap mainly stems from two challenges: a semantic-geometric misalignment that prevents consistent 3D understanding, and the absence of persistent memory to retain 3D representation and understanding over time. To address these limitations, we present VLM$^2$, a Vision-Language Model with persistent Memory for spatial reasoning with a view-consistent, 3D-aware representation purely from 2D video. Specifically, to enhance long-horizon reasoning, we incorporate a dual-memory module, consisting of a working memory that operates as a sliding window to focus on immediate context, and an episodic memory that consolidates and stores critical long-term information. This design enables efficient and long-horizon spatial reasoning with a fixed computational cost. Extensive experiments on multiple benchmarks show that VLM$^2$ achieves state-of-the-art performance among video-only models, significantly advancing the frontier of visual-spatial intelligence.

[220] PixelDiT: Pixel Diffusion Transformers for Image Generation

Yongsheng Yu,Wei Xiong,Weili Nie,Yichen Sheng,Shiqiu Liu,Jiebo Luo

Main category: cs.CV

TL;DR: 提出PixelDiT,一种单阶段、端到端的直接在像素空间进行扩散过程的模型,采用双层Transformer架构,在图像生成质量上超越现有像素级生成模型。

Details Motivation: 现有的Latent-space建模依赖两阶段流程,预训练自编码器引入有损重建,导致误差累积并阻碍联合优化。 Method: 设计PixelDiT,采用补丁级和像素级双层DiT结构,直接在像素空间进行端到端训练,无需自编码器。 Result: 在ImageNet 256x256上达到1.61 FID,显著优于现有像素生成模型;扩展至文本到图像生成,在1024x1024分辨率下取得0.74 GenEval和83.5 DPG-bench成绩,接近最佳潜在扩散模型。 Conclusion: PixelDiT通过有效的像素级token建模,成功实现高质量的端到端像素空间扩散生成,兼具细节保持与高效训练优势。 Abstract: Latent-space modeling has been the standard for Diffusion Transformers (DiTs). However, it relies on a two-stage pipeline where the pretrained autoencoder introduces lossy reconstruction, leading to error accumulation while hindering joint optimization. To address these issues, we propose PixelDiT, a single-stage, end-to-end model that eliminates the need for the autoencoder and learns the diffusion process directly in the pixel space. PixelDiT adopts a fully transformer-based architecture shaped by a dual-level design: a patch-level DiT that captures global semantics and a pixel-level DiT that refines texture details, enabling efficient training of a pixel-space diffusion model while preserving fine details. Our analysis reveals that effective pixel-level token modeling is essential to the success of pixel diffusion. PixelDiT achieves 1.61 FID on ImageNet 256x256, surpassing existing pixel generative models by a large margin. We further extend PixelDiT to text-to-image generation and pretrain it at the 1024x1024 resolution in pixel space. It achieves 0.74 on GenEval and 83.5 on DPG-bench, approaching the best latent diffusion models.

[221] 3D-Aware Multi-Task Learning with Cross-View Correlations for Dense Scene Understanding

Xiaoye Wang,Chen Tang,Xiangyu Yue,Wei-Hong Li

Main category: cs.CV

TL;DR: 本文提出了一种3D感知的多任务学习方法,通过引入跨视角模块(CvM)在成本体积中建模几何一致性,以增强多任务密集预测(如分割和深度估计)的性能。

Details Motivation: 现有MTL方法主要在2D图像空间中建模任务间关系,缺乏3D感知能力,难以捕捉对场景理解至关重要的几何一致性。 Method: 提出跨视角模块(CvM),在多任务编码器特征基础上,利用跨视角相关性(如成本体积)建模几何一致性;该模块轻量、架构无关,适用于单/多视角数据。 Result: 在NYUv2和PASCAL-Context数据集上验证了方法有效性,显著提升了现有MTL方法的性能。 Conclusion: 引入3D-aware的跨视角一致性有助于建模任务相关性,提升多任务密集预测的性能,且具备良好的通用性和扩展性。 Abstract: This paper addresses the challenge of training a single network to jointly perform multiple dense prediction tasks, such as segmentation and depth estimation, i.e., multi-task learning (MTL). Current approaches mainly capture cross-task relations in the 2D image space, often leading to unstructured features lacking 3D-awareness. We argue that 3D-awareness is vital for modeling cross-task correlations essential for comprehensive scene understanding. We propose to address this problem by integrating correlations across views, i.e., cost volume, as geometric consistency in the MTL network. Specifically, we introduce a lightweight Cross-view Module (CvM), shared across tasks, to exchange information across views and capture cross-view correlations, integrated with a feature from MTL encoder for multi-task predictions. This module is architecture-agnostic and can be applied to both single and multi-view data. Extensive results on NYUv2 and PASCAL-Context demonstrate that our method effectively injects geometric consistency into existing MTL methods to improve performance.

[222] Diverse Video Generation with Determinantal Point Process-Guided Policy Optimization

Tahira Kazimi,Connor Dunlop,Pinar Yanardag

Main category: cs.CV

TL;DR: 本文提出了一种新的多样化视频生成框架DPP-GRPO,结合行列式点过程(DPP)和组相对策略优化(GRPO),在保持提示忠实性和感知质量的同时,显著提高了文本到视频生成的多样性。

Details Motivation: 现有的文本到视频扩散模型在从单一文本提示生成多个视频时往往输出多样性不足,本文旨在通过集级别策略优化来解决这一问题。 Method: 提出DPP-GRPO框架,利用DPP对冗余样本施加递减回报以显式奖励多样性,并通过GRPO提供候选集的组级反馈;该方法具有即插即用和模型无关的特点。 Result: 在WAN和CogVideoX上实现,实验表明该方法在VBench、VideoScore和人类偏好研究等基准上持续提升视频多样性。 Conclusion: DPP-GRPO有效提升了文本到视频生成的多样性,同时保持了提示对齐和视觉质量,且具备良好的通用性,作者还公开了代码和包含3万多样提示的新数据集以支持后续研究。 Abstract: While recent text-to-video (T2V) diffusion models have achieved impressive quality and prompt alignment, they often produce low-diversity outputs when sampling multiple videos from a single text prompt. We tackle this challenge by formulating it as a set-level policy optimization problem, with the goal of training a policy that can cover the diverse range of plausible outcomes for a given prompt. To address this, we introduce DPP-GRPO, a novel framework for diverse video generation that combines Determinantal Point Processes (DPPs) and Group Relative Policy Optimization (GRPO) theories to enforce explicit reward on diverse generations. Our objective turns diversity into an explicit signal by imposing diminishing returns on redundant samples (via DPP) while supplies groupwise feedback over candidate sets (via GRPO). Our framework is plug-and-play and model-agnostic, and encourages diverse generations across visual appearance, camera motions, and scene structure without sacrificing prompt fidelity or perceptual quality. We implement our method on WAN and CogVideoX, and show that our method consistently improves video diversity on state-of-the-art benchmarks such as VBench, VideoScore, and human preference studies. Moreover, we release our code and a new benchmark dataset of 30,000 diverse prompts to support future research.

[223] LocateAnything3D: Vision-Language 3D Detection with Chain-of-Sight

Yunze Man,Shihao Wang,Guowen Zhang,Johan Bjorck,Zhiqi Li,Liang-Yan Gui,Jim Fan,Jan Kautz,Yu-Xiong Wang,Zhiding Yu

Main category: cs.CV

TL;DR: 本文提出了LocateAnything3D,一种将3D检测转化为VLM中下一个词预测问题的新方法,通过链式视觉推理(Chain-of-Sight)实现开放词汇和视觉提示下的多目标3D检测,在Omni3D基准上达到SOTA性能。

Details Motivation: 现有的视觉语言模型(VLMs)在2D描述和定位方面表现出色,但缺乏对多目标3D检测的支持,而实际应用需要模型具备命名所见物体并理解其3D位置的能力。 Method: 提出了一种VLM原生的3D检测框架,引入显式的Chain-of-Sight(CoS)序列模拟人类从图像推理的过程:先进行2D检测,再依次推断距离、尺寸和姿态;采用由易到难的学习策略,跨对象按远近顺序生成,单个对象内部分解为中心点、尺寸和旋转。 Result: 在Omni3D基准上实现了49.89的AP_3D,比此前最优结果提升15.51,且即使基线使用真实2D框仍大幅超越;同时在未见类别上展现出强零样本泛化能力和鲁棒性。 Conclusion: LocateAnything3D通过将3D检测建模为有序的token生成任务,为视觉语言模型提供了实用的3D感知基础,保留了开放词汇和视觉提示能力,无需专用检测头。 Abstract: To act in the world, a model must name what it sees and know where it is in 3D. Today's vision-language models (VLMs) excel at open-ended 2D description and grounding, yet multi-object 3D detection remains largely missing from the VLM toolbox. We present LocateAnything3D, a VLM-native recipe that casts 3D detection as a next-token prediction problem. The key is a short, explicit Chain-of-Sight (CoS) sequence that mirrors how human reason from images: find an object in 2D, then infer its distance, size, and pose. The decoder first emits 2D detections as a visual chain-of-thought, then predicts 3D boxes under an easy-to-hard curriculum: across objects, a near-to-far order reduces early ambiguity and matches ego-centric utility; within each object, a center-from-camera, dimensions, and rotation factorization ranks information by stability and learnability. This VLM-native interface preserves open-vocabulary and visual-prompting capability without specialized heads. On the challenging Omni3D benchmark, our model achieves state-of-the-art results, with 49.89 AP_3D, surpassing the previous best by +15.51 absolute improvement even when the baseline is given ground-truth 2D boxes. It also generalizes zero-shot to held-out categories with strong robustness. By turning 3D detection into a disciplined next-token problem, LocateAnything3D offers a practical foundation for models to perceive in 3D.

[224] Infinity-RoPE: Action-Controllable Infinite Video Generation Emerges From Autoregressive Self-Rollout

Hidir Yesiltepe,Tuna Han Salih Meral,Adil Kaan Akan,Kaan Oktay,Pinar Yanardag

Main category: cs.CV

TL;DR: 本文提出了∞-RoPE,一种无需训练的推理时框架,通过Block-Relativistic RoPE、KV Flush和RoPE Cut三个组件,解决了自回归视频扩散模型在时序建模、控制响应和场景切换上的三大瓶颈,实现了无限时长、可控且支持镜头切换的视频生成。

Details Motivation: 现有的自回归视频扩散模型受限于固定的时序位置编码、长序列生成中指令响应慢以及无法实现单次生成中的非连续镜头转换,难以满足长时、可控、电影级视频生成的需求。 Method: 提出∞-RoPE框架,包含三个核心组件:1)Block-Relativistic RoPE:将时间编码改为相对移动参考系,使每个新生成块相对于最大帧范围旋转,保持时序几何结构;2)KV Flush:在KV缓存中仅保留全局sink和最后一帧,实现无需重编码的快速提示响应;3)RoPE Cut:在RoPE坐标中引入受控间断,支持单次 rollout 中的多镜头切换。 Result: 实验表明,∞-RoPE在VBench等多个评测指标上持续优于先前的自回归模型,能够生成更长、更连贯、更具控制性的视频,并成功实现电影级的剪辑过渡效果。 Conclusion: ∞-RoPE是一种无需训练的统一框架,突破了现有视频扩散模型的时间限制、控制延迟和镜头连续性约束,为无限时长、高可控性和支持复杂剪辑的视频生成提供了有效解决方案。 Abstract: Current autoregressive video diffusion models are constrained by three core bottlenecks: (i) the finite temporal horizon imposed by the base model's 3D Rotary Positional Embedding (3D-RoPE), (ii) slow prompt responsiveness in maintaining fine-grained action control during long-form rollouts, and (iii) the inability to realize discontinuous cinematic transitions within a single generation stream. We introduce $\infty$-RoPE, a unified inference-time framework that addresses all three limitations through three interconnected components: Block-Relativistic RoPE, KV Flush, and RoPE Cut. Block-Relativistic RoPE reformulates temporal encoding as a moving local reference frame, where each newly generated latent block is rotated relative to the base model's maximum frame horizon while earlier blocks are rotated backward to preserve relative temporal geometry. This relativistic formulation eliminates fixed temporal positions, enabling continuous video generation far beyond the base positional limits. To obtain fine-grained action control without re-encoding, KV Flush renews the KV cache by retaining only two latent frames, the global sink and the last generated latent frame, thereby ensuring immediate prompt responsiveness. Finally, RoPE Cut introduces controlled discontinuities in temporal RoPE coordinates, enabling multi-cut scene transitions within a single continuous rollout. Together, these components establish $\infty$-RoPE as a training-free foundation for infinite-horizon, controllable, and cinematic video diffusion. Comprehensive experiments show that $\infty$-RoPE consistently surpasses previous autoregressive models in overall VBench scores.

[225] MedROV: Towards Real-Time Open-Vocabulary Detection Across Diverse Medical Imaging Modalities

Tooba Tehreem Sheikh,Jean Lahoud,Rao Muhammad Anwer,Fahad Shahbaz Khan,Salman Khan,Hisham Cholakkal

Main category: cs.CV

TL;DR: 本文提出了MedROV,首个用于医学影像的实时开放词汇目标检测模型,通过构建大规模数据集Omnis和伪标签策略解决标注缺失问题,并利用对比学习和跨模态表征提升对已知和新类别结构的检测能力,在性能和速度上均取得显著提升。

Details Motivation: 现有的医学图像目标检测模型多为闭集设定,难以识别新类别对象;而开放词汇检测在医学领域因数据稀缺和图文对齐弱而发展受限,因此需要一种能有效泛化到新类别的实时检测方法。 Method: 提出MedROV模型,构建包含60万样本的大规模多模态数据集Omnis,采用伪标签策略处理多源数据中的缺失标注,并引入大型基础模型的知识,结合对比学习与跨模态表示来增强模型泛化能力。 Result: MedROV相比之前的最先进基础模型平均绝对提升了40 mAP50,超过闭集检测器3 mAP50以上,同时达到70 FPS的推理速度。 Conclusion: MedROV在医学图像开放词汇检测任务中实现了高效、准确的检测,兼顾实时性与强泛化能力,为医学影像分析建立了新基准。 Abstract: Traditional object detection models in medical imaging operate within a closed-set paradigm, limiting their ability to detect objects of novel labels. Open-vocabulary object detection (OVOD) addresses this limitation but remains underexplored in medical imaging due to dataset scarcity and weak text-image alignment. To bridge this gap, we introduce MedROV, the first Real-time Open Vocabulary detection model for medical imaging. To enable open-vocabulary learning, we curate a large-scale dataset, Omnis, with 600K detection samples across nine imaging modalities and introduce a pseudo-labeling strategy to handle missing annotations from multi-source datasets. Additionally, we enhance generalization by incorporating knowledge from a large pre-trained foundation model. By leveraging contrastive learning and cross-modal representations, MedROV effectively detects both known and novel structures. Experimental results demonstrate that MedROV outperforms the previous state-of-the-art foundation model for medical image detection with an average absolute improvement of 40 mAP50, and surpasses closed-set detectors by more than 3 mAP50, while running at 70 FPS, setting a new benchmark in medical detection. Our source code, dataset, and trained model are available at https://github.com/toobatehreem/MedROV.

[226] RubricRL: Simple Generalizable Rewards for Text-to-Image Generation

Xuelu Feng,Yunsheng Li,Ziyu Wan,Zixuan Gao,Junsong Yuan,Dongdong Chen,Chunming Qiao

Main category: cs.CV

TL;DR: 提出RubricRL框架,通过基于细粒度视觉标准的可解释、模块化奖励机制,提升文本到图像生成模型的对齐效果。

Details Motivation: 现有强化学习方法在对齐文本到图像生成模型时,依赖固定权重的复合指标或黑箱标量奖励,缺乏可解释性与灵活性。 Method: 设计RubricRL框架,为每个提示动态构建结构化评分标准(rubric),包含对象正确性、属性准确性、OCR保真度和真实感等独立评估维度,由多模态裁判模型打分,并采用提示自适应加权机制。 Result: 在自回归文本到图像模型上的实验表明,RubricRL提升了提示忠实度、视觉细节和泛化能力,同时提供可解释且可扩展的监督信号。 Conclusion: RubricRL通过可分解、可调节的奖励结构,增强了用户控制力与模型对齐的透明度,为文本到图像模型的强化学习对齐提供了灵活且通用的解决方案。 Abstract: Reinforcement learning (RL) has recently emerged as a promising approach for aligning text-to-image generative models with human preferences. A key challenge, however, lies in designing effective and interpretable rewards. Existing methods often rely on either composite metrics (e.g., CLIP, OCR, and realism scores) with fixed weights or a single scalar reward distilled from human preference models, which can limit interpretability and flexibility. We propose RubricRL, a simple and general framework for rubric-based reward design that offers greater interpretability, composability, and user control. Instead of using a black-box scalar signal, RubricRL dynamically constructs a structured rubric for each prompt--a decomposable checklist of fine-grained visual criteria such as object correctness, attribute accuracy, OCR fidelity, and realism--tailored to the input text. Each criterion is independently evaluated by a multimodal judge (e.g., o4-mini), and a prompt-adaptive weighting mechanism emphasizes the most relevant dimensions. This design not only produces interpretable and modular supervision signals for policy optimization (e.g., GRPO or PPO), but also enables users to directly adjust which aspects to reward or penalize. Experiments with an autoregressive text-to-image model demonstrate that RubricRL improves prompt faithfulness, visual detail, and generalizability, while offering a flexible and extensible foundation for interpretable RL alignment across text-to-image architectures.