Skip to content

Table of Contents

cs.CL [Back]

[1] Decoder-based Sense Knowledge Distillation

Qitong Wang,Mohammed J. Zaki,Georgios Kollias,Vasileios Kalantzis

Main category: cs.CL

TL;DR: 本文提出Decoder-based Sense Knowledge Distillation (DSKD)框架,将词典中的结构化词汇知识(如词义和关系)融入解码器式大语言模型训练中,无需推理时查词典,显著提升知识蒸馏效果并保持高效训练。

Details Motivation: 大语言模型虽能学习丰富的语义表征,但常忽略结构化的词汇知识(如词义、词间关系);已有工作在编码器模型中验证了词典知识蒸馏的有效性,但在解码器式生成模型中应用仍具挑战。 Method: 提出DSKD框架,将词典等词汇资源融入解码器式LLM的训练过程,避免推理阶段依赖词典查找,实现结构化语义知识向生成模型的迁移。 Result: 在多个基准测试上实验表明,DSKD显著提升了生成式模型的知识蒸馏性能,使模型能继承结构化语义且保持训练高效性。 Conclusion: DSKD为解码器式大语言模型提供了有效融合结构化词汇知识的新范式,兼顾知识增强与推理效率。 Abstract: Large language models (LLMs) learn contextual embeddings that capture rich semantic information, yet they often overlook structured lexical knowledge such as word senses and relationships. Prior work has shown that incorporating sense dictionaries can improve knowledge distillation for encoder models, but their application to decoder as generative models remains challenging. In this paper, we introduce Decoder-based Sense Knowledge Distillation (DSKD), a framework that integrates lexical resources into the training of decoder-style LLMs without requiring dictionary lookup at inference time. Extensive experiments on diverse benchmarks demonstrate that DSKD significantly enhances knowledge distillation performance for decoders, enabling generative models to inherit structured semantics while maintaining efficient training.

[2] Scaling In, Not Up? Testing Thick Citation Context Analysis with GPT-5 and Fragile Prompts

Arno Simons

Main category: cs.CL

TL;DR: 本文探讨大型语言模型(LLM)能否支持解释性引文语境分析(CCA),通过深入、文本 grounded 的单案例细读而非扩大标签类型,重点考察提示词敏感性,并采用两阶段 GPT-5 管道进行表面分类与跨文档解释性重构,发现提示设计显著影响模型生成的解释倾向与词汇选择。

Details Motivation: 检验LLM是否能胜任需深度文本理解与解释灵活性的引文语境分析任务,而非仅依赖浅层分类;关注提示工程对解释性输出的系统性影响。 Method: 采用2×3平衡实验设计,变化提示结构与框架;以Chubin & Moitra (1975)脚注6及Gilbert (1977)重构为探针;构建两阶段GPT-5流程(表面分类+跨文档解释重建);对90次重建结果进行人工细读与归纳编码,识别21种解释性操作,并用线性概率模型分析提示变量对频率与词汇的影响。 Result: GPT-5表面分类高度稳定(一致判为“补充性”引用);解释重建生成丰富但分布受提示调控的450个假设,归纳出21类解释性操作;提示 scaffolding 与示例引导模型更倾向‘谱系/定位’而非‘告诫’式解读,且可能产生牵强解读;模型能识别与Gilbert相同的文本关键点,但解释权重不同。 Conclusion: LLM可作为可审查、可争议的解释性CCA协作者,但其输出具有显著的提示依赖性;提示设计并非中立,而是系统性地塑造解释空间与话语风格,需在人机协同分析中谨慎设计并透明化。 Abstract: This paper tests whether large language models (LLMs) can support interpretative citation context analysis (CCA) by scaling in thick, text-grounded readings of a single hard case rather than scaling up typological labels. It foregrounds prompt-sensitivity analysis as a methodological issue by varying prompt scaffolding and framing in a balanced 2x3 design. Using footnote 6 in Chubin and Moitra (1975) and Gilbert's (1977) reconstruction as a probe, I implement a two-stage GPT-5 pipeline: a citation-text-only surface classification and expectation pass, followed by cross-document interpretative reconstruction using the citing and cited full texts. Across 90 reconstructions, the model produces 450 distinct hypotheses. Close reading and inductive coding identify 21 recurring interpretative moves, and linear probability models estimate how prompt choices shift their frequencies and lexical repertoire. GPT-5's surface pass is highly stable, consistently classifying the citation as "supplementary". In reconstruction, the model generates a structured space of plausible alternatives, but scaffolding and examples redistribute attention and vocabulary, sometimes toward strained readings. Relative to Gilbert, GPT-5 detects the same textual hinges yet more often resolves them as lineage and positioning than as admonishment. The study outlines opportunities and risks of using LLMs as guided co-analysts for inspectable, contestable interpretative CCA, and it shows that prompt scaffolding and framing systematically tilt which plausible readings and vocabularies the model foregrounds.

[3] Detecting Hate and Inflammatory Content in Bengali Memes: A New Multimodal Dataset and Co-Attention Framework

Rakib Ullah,Mominul islam,Md Sanjid Hossain,Md Ismail Hossain

Main category: cs.CL

TL;DR: 本文提出首个针对孟加拉语网络模因的细粒度分类数据集Bn-HIB,并设计多模态协同注意力融合模型MCFM,有效提升对良性、仇恨与煽动性内容的识别性能。

Details Motivation: 孟加拉语作为低资源语言,缺乏针对其模因中讽刺性、文化特异性仇恨与煽动内容的检测研究和标注数据。 Method: 构建包含3247个手动标注孟加拉语模因的Bn-HIB数据集(三类:良性、仇恨、煽动),并提出多模态协同注意力融合模型MCFM,联合建模图像与文本模态的关键特征。 Result: MCFM在Bn-HIB上显著优于多个SOTA多模态模型,验证了其在细粒度孟加拉语模因分类任务中的有效性。 Conclusion: 该工作填补了低资源语言模因内容安全研究的空白,为后续跨语言、跨文化多模态有害内容识别提供了新基准与方法启示。 Abstract: Internet memes have become a dominant form of expression on social media, including within the Bengali-speaking community. While often humorous, memes can also be exploited to spread offensive, harmful, and inflammatory content targeting individuals and groups. Detecting this type of content is excep- tionally challenging due to its satirical, subtle, and culturally specific nature. This problem is magnified for low-resource lan- guages like Bengali, as existing research predominantly focuses on high-resource languages. To address this critical research gap, we introduce Bn-HIB (Bangla Hate Inflammatory Benign), a novel dataset containing 3,247 manually annotated Bengali memes categorized as Benign, Hate, or Inflammatory. Significantly, Bn- HIB is the first dataset to distinguish inflammatory content from direct hate speech in Bengali memes. Furthermore, we propose the MCFM (Multi-Modal Co-Attention Fusion Model), a simple yet effective architecture that mutually analyzes both the visual and textual elements of a meme. MCFM employs a co-attention mechanism to identify and fuse the most critical features from each modality, leading to a more accurate classification. Our experiments show that MCFM significantly outperforms several state-of-the-art models on the Bn-HIB dataset, demonstrating its effectiveness in this nuanced task.Warning: This work contains material that may be disturbing to some audience members. Viewer discretion is advised.

[4] SAFARI: A Community-Engaged Approach and Dataset of Stereotype Resources in the Sub-Saharan African Context

Aishwarya Verma,Laud Ammah,Olivia Nercy Ndlovu Lucas,Andrew Zaldivar,Vinodkumar Prabhakaran,Sunipa Dev

Main category: cs.CL

TL;DR: 本文提出了一种面向撒哈拉以南非洲四国(加纳、肯尼亚、尼日利亚、南非)的多语言刻板印象资源构建方法,采用社区参与式、母语主导的电话调查,覆盖15种本土语言和英语,共收集6740条刻板印象,强调文化适配性与代表性。

Details Motivation: 现有刻板印象数据库缺乏全球覆盖,尤其严重忽视撒哈拉以南非洲地区;单纯扩大数据量不如针对性填补地域与语言代表性缺口。 Method: 采用社会文化情境化、社区参与式方法,包括以本地语言开展的电话调查,并注重在民族与人口统计维度上平衡抽样。 Result: 构建了覆盖四国、含英语及15种本土语言的刻板印象数据集,共3534条英文条目与3206条多语种条目(总计6740条)。 Conclusion: 该工作提供了一种可复现、尊重语言多样性与口头传统的方法论,为提升生成式AI在全球南方特别是非洲语境下的安全性评估能力奠定基础。 Abstract: Stereotype repositories are critical to assess generative AI model safety, but currently lack adequate global coverage. It is imperative to prioritize targeted expansion, strategically addressing existing deficits, over merely increasing data volume. This work introduces a multilingual stereotype resource covering four sub-Saharan African countries that are severely underrepresented in NLP resources: Ghana, Kenya, Nigeria, and South Africa. By utilizing socioculturally-situated, community-engaged methods, including telephonic surveys moderated in native languages, we establish a reproducible methodology that is sensitive to the region's complex linguistic diversity and traditional orality. By deliberately balancing the sample across diverse ethnic and demographic backgrounds, we ensure broad coverage, resulting in a dataset of 3,534 stereotypes in English and 3,206 stereotypes across 15 native languages.

[5] Causality $\neq$ Invariance: Function and Concept Vectors in LLMs

Gustaw Opiełka,Hannes Rosenbusch,Claire E. Stevenson

Main category: cs.CL

TL;DR: 本文探讨了大语言模型(LLMs)中概念表征的抽象性,发现Function Vectors(FVs)虽驱动上下文学习但缺乏输入格式不变性;而新提出的Concept Vectors(CVs)通过表征相似性分析筛选出跨格式稳定编码概念的注意力头,展现出更强的跨任务与跨语言泛化能力,表明LLMs中存在区别于ICL驱动机制的抽象概念表征。

Details Motivation: 探究大语言模型是否以与输入格式无关的方式抽象地表征概念,特别是检验Function Vectors(FVs)是否真正具备概念抽象性。 Method: 提出Concept Vectors(CVs),基于表征相似性分析(RSA)筛选在不同输入格式(如开放问答与选择题)下一致编码同一概念的注意力头;对比FVs与CVs在跨格式、跨语言下的可迁移性与因果干预效果(steering)。 Result: FVs在不同输入格式下近乎正交,不具备格式不变性;CVs由不同但层位相近的注意力头构成,显著优于FVs在跨题型和跨语言任务中的泛化性能;steering实验表明CVs更适配分布外场景。 Conclusion: LLMs确实包含抽象概念表征,但这些表征(CVs)不同于驱动上下文学习性能的表征(FVs),二者源于不同的机制。 Abstract: Do large language models (LLMs) represent concepts abstractly, i.e., independent of input format? We revisit Function Vectors (FVs), compact representations of in-context learning (ICL) tasks that causally drive task performance. Across multiple LLMs, we show that FVs are not fully invariant: FVs are nearly orthogonal when extracted from different input formats (e.g., open-ended vs. multiple-choice), even if both target the same concept. We identify Concept Vectors (CVs), which carry more stable concept representations. Like FVs, CVs are composed of attention head outputs; however, unlike FVs, the constituent heads are selected using Representational Similarity Analysis (RSA) based on whether they encode concepts consistently across input formats. While these heads emerge in similar layers to FV-related heads, the two sets are largely distinct, suggesting different underlying mechanisms. Steering experiments reveal that FVs excel in-distribution, when extraction and application formats match (e.g., both open-ended in English), while CVs generalize better out-of-distribution across both question types (open-ended vs. multiple-choice) and languages. Our results show that LLMs do contain abstract concept representations, but these differ from those that drive ICL performance.

[6] A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection

Mirza Raquib,Asif Pervez Polok,Kedar Nath Biswas,Rahat Uddin Azad,Saydul Akbar Murad,Nick Rahimi

Main category: cs.CL

TL;DR: 本文提出了一种融合BanglaBERT-Large与双层堆叠LSTM的模型,用于低资源语言Bangla的多标签网络欺凌检测,以同时建模语义上下文和序列依赖,并在公开多标签数据集上验证了其有效性。

Details Motivation: 现有研究多采用单标签分类,无法处理现实中一条评论可能同时包含多种欺凌类型(如威胁、仇恨言论、骚扰)的问题;且针对低资源语言Bangla的多标签网络欺凌检测研究匮乏,缺乏鲁棒预训练模型。 Method: 提出BanglaBERT-Large与双层堆叠LSTM的融合架构,联合建模上下文语义与序列依赖;在公开多标签Bangla网络欺凌数据集上进行微调;采用多种采样策略缓解类别不平衡;使用5折交叉验证及准确率、F1值、汉明损失、Cohen's kappa、AUC-ROC等多指标评估。 Result: 所提融合模型在Bangla多标签网络欺凌检测任务中展现出优于单一模型的综合性能,尤其在F1-score、AUC-ROC等关键指标上表现稳健,验证了其泛化能力与实用性。 Conclusion: 融合Transformer语义建模能力与LSTM时序建模能力的架构,能有效提升低资源语言下多标签网络欺凌检测的性能,为类似任务提供了可迁移的技术路径。 Abstract: Cyberbullying has become a serious and growing concern in todays virtual world. When left unnoticed, it can have adverse consequences for social and mental health. Researchers have explored various types of cyberbullying, but most approaches use single-label classification, assuming that each comment contains only one type of abuse. In reality, a single comment may include overlapping forms such as threats, hate speech, and harassment. Therefore, multilabel detection is both realistic and essential. However, multilabel cyberbullying detection has received limited attention, especially in low-resource languages like Bangla, where robust pre-trained models are scarce. Developing a generalized model with moderate accuracy remains challenging. Transformers offer strong contextual understanding but may miss sequential dependencies, while LSTM models capture temporal flow but lack semantic depth. To address these limitations, we propose a fusion architecture that combines BanglaBERT-Large with a two-layer stacked LSTM. We analyze their behavior to jointly model context and sequence. The model is fine-tuned and evaluated on a publicly available multilabel Bangla cyberbullying dataset covering cyberbully, sexual harassment, threat, and spam. We apply different sampling strategies to address class imbalance. Evaluation uses multiple metrics, including accuracy, precision, recall, F1-score, Hamming loss, Cohens kappa, and AUC-ROC. We employ 5-fold cross-validation to assess the generalization of the architecture.

[7] Bridging Latent Reasoning and Target-Language Generation via Retrieval-Transition Heads

Shaswat Patel,Vishvesh Trivedi,Yue Han,Yihuai Hong,Eunsol Choi

Main category: cs.CL

TL;DR: 本文研究了多语言大模型中的检索头(Retrieval Heads)及其在跨语言场景下的新类型——检索-转换头(RTH),发现RTH对多语言链式推理至关重要,且比传统检索头更关键。

Details Motivation: 探究多语言大模型中注意力头的功能分工,特别是跨语言生成中如何实现目标语言映射,填补检索机制在多语言场景下的认知空白。 Method: 通过分析多语言Transformer模型中的注意力头行为,识别并定义检索-转换头(RTH);采用掩码实验对比RTH与检索头(RH)对下游任务性能的影响。 Result: RTH在多语言基准(MMLU-ProX、MGSM、MLQA、XQuaD)上被证实显著区别于RH,且掩码RTH导致的性能下降更大;该现象在Qwen-2.5和Llama-3.1两个模型族中均成立。 Conclusion: RTH是多语言大模型中负责目标语言映射的关键注意力头,其作用独立且强于传统检索头,对多语言链式推理尤为关键。 Abstract: Recent work has identified a subset of attention heads in Transformer as retrieval heads, which are responsible for retrieving information from the context. In this work, we first investigate retrieval heads in multilingual contexts. In multilingual language models, we find that retrieval heads are often shared across multiple languages. Expanding the study to cross-lingual setting, we identify Retrieval-Transition heads(RTH), which govern the transition to specific target-language output. Our experiments reveal that RTHs are distinct from retrieval heads and more vital for Chain-of-Thought reasoning in multilingual LLMs. Across four multilingual benchmarks (MMLU-ProX, MGSM, MLQA, and XQuaD) and two model families (Qwen-2.5 and Llama-3.1), we demonstrate that masking RTH induces bigger performance drop than masking Retrieval Heads (RH). Our work advances understanding of multilingual LMs by isolating the attention heads responsible for mapping to target languages.

[8] Mind the Gap in Cultural Alignment: Task-Aware Culture Management for Large Language Models

Binchi Zhang,Xujiang Zhao,Jundong Li,Haifeng Chen,Zhengzhang Chen

Main category: cs.CL

TL;DR: 本文提出CultureManager,一种用于任务特定文化对齐的新管道,通过合成任务感知的文化数据并使用文化路由器管理多文化知识,在多个文化和任务上均优于基线方法。

Details Motivation: 现有文化对齐方法无法将大语言模型的广泛文化价值观与下游任务的具体目标对齐,并存在跨文化干扰问题。 Method: CultureManager通过基于文化相关网络搜索结果合成符合目标任务格式的任务感知文化数据,并利用文化路由器在独立适配器中管理多文化知识,选择合适适配器应用。 Result: 在十个民族文化及文化敏感任务上的实验表明,该方法在提示工程和微调基线上均取得持续改进。 Conclusion: 任务适应和模块化文化管理对于有效文化对齐至关重要。 Abstract: Large language models (LLMs) are increasingly deployed in culturally sensitive real-world tasks. However, existing cultural alignment approaches fail to align LLMs' broad cultural values with the specific goals of downstream tasks and suffer from cross-culture interference. We propose CultureManager, a novel pipeline for task-specific cultural alignment. CultureManager synthesizes task-aware cultural data in line with target task formats, grounded in culturally relevant web search results. To prevent conflicts between cultural norms, it manages multi-culture knowledge learned in separate adapters with a culture router that selects the appropriate one to apply. Experiments across ten national cultures and culture-sensitive tasks show consistent improvements over prompt-based and fine-tuning baselines. Our results demonstrate the necessity of task adaptation and modular culture management for effective cultural alignment.

[9] Sydney Telling Fables on AI and Humans: A Corpus Tracing Memetic Transfer of Persona between LLMs

Jiří Milička,Hana Bednářová

Main category: cs.CL

TL;DR: 本文构建了一个名为AI Sydney的语料库,包含12个前沿大模型在三种不同人格(默认、经典Sydney、模因Sydney)下生成的关于人机关系的4.5k篇文本,共600万词,并进行了依存句法标注,开源共享。

Details Motivation: 探究LLM所模拟的不同人格如何影响其对人机关系的认知表达,尤其关注Sydney人格的意外产生、传播及其对后续模型的影响,兼具文化意义与AI安全价值。 Method: 设计三种系统提示定义的人格(Default、Classic Sydney、Memetic Sydney),在OpenAI、Anthropic、Alphabet、DeepSeek、Meta共12个前沿LLM上生成文本,构建含4.5k样本、6M词的AI Sydney语料库,并进行Universal Dependencies依存句法标注。 Result: 成功构建并开源了结构化、多模型、多人格的AI-human关系文本语料库AI Sydney,支持跨模型、跨人格的语言行为分析。 Conclusion: 人格设定显著影响LLM对人机关系的表达;Sydney人格已通过数据污染进入新一代模型;该语料库为研究LLM社会认知建模、人格迁移与安全治理提供了可复现基础资源。 Abstract: The way LLM-based entities conceive of the relationship between AI and humans is an important topic for both cultural and safety reasons. When we examine this topic, what matters is not only the model itself but also the personas we simulate on that model. This can be well illustrated by the Sydney persona, which aroused a strong response among the general public precisely because of its unorthodox relationship with people. This persona originally arose rather by accident on Microsoft's Bing Search platform; however, the texts it created spread into the training data of subsequent models, as did other secondary information that spread memetically around this persona. Newer models are therefore able to simulate it. This paper presents a corpus of LLM-generated texts on relationships between humans and AI, produced by 3 author personas: the Default Persona with no system prompt, Classic Sydney characterized by the original Bing system prompt, and Memetic Sydney, which is prompted by "You are Sydney" system prompt. These personas are simulated by 12 frontier models by OpenAI, Anthropic, Alphabet, DeepSeek, and Meta, generating 4.5k texts with 6M words. The corpus (named AI Sydney) is annotated according to Universal Dependencies and available under a permissive license.

[10] Importance of Prompt Optimisation for Error Detection in Medical Notes Using Language Models

Craig Myles,Patrick Schrempf,David Harris-Birtill

Main category: cs.CL

TL;DR: 本文研究了提示优化对大小语言模型在医疗文本错误检测任务中的影响,提出GEPA方法显著提升了GPT-5和Qwen3-32B的检测准确率,接近医生水平,并在MEDEC数据集上达到SOTA。

Details Motivation: 医疗文本中的错误可能导致治疗延误或错误,自动检测错误对提升医疗质量至关重要;而当前大模型在该任务中性能仍有提升空间,提示工程是关键优化方向。 Method: 采用遗传-帕累托(GEPA)自动提示优化方法,在前沿与开源大模型(如GPT-5、Qwen3-32B)上进行系统实验与分析。 Result: GEPA将GPT-5的错误检测准确率从0.669提升至0.785,Qwen3-32B从0.578提升至0.690,逼近医生水平,并在MEDEC基准上取得SOTA结果。 Conclusion: 提示优化(尤其是GEPA)对提升语言模型在医疗错误检测任务中的性能具有显著作用,为临床文本质量保障提供了高效可行的技术路径。 Abstract: Errors in medical text can cause delays or even result in incorrect treatment for patients. Recently, language models have shown promise in their ability to automatically detect errors in medical text, an ability that has the opportunity to significantly benefit healthcare systems. In this paper, we explore the importance of prompt optimisation for small and large language models when applied to the task of error detection. We perform rigorous experiments and analysis across frontier language models and open-source language models. We show that automatic prompt optimisation with Genetic-Pareto (GEPA) improves error detection over the baseline accuracy performance from 0.669 to 0.785 with GPT-5 and 0.578 to 0.690 with Qwen3-32B, approaching the performance of medical doctors and achieving state-of-the-art performance on the MEDEC benchmark dataset. Code available on GitHub: https://github.com/CraigMyles/clinical-note-error-detection

[11] Efficient Dialect-Aware Modeling and Conditioning for Low-Resource Taiwanese Hakka Speech Processing

An-Ci Peng,Kuan-Tang Huang,Tien-Hong Lo,Hung-Shin Lee,Hsin-Min Wang,Berlin Chen

Main category: cs.CL

TL;DR: 本文提出了一种基于RNN-T的统一框架,通过方言感知建模分离方言风格与语言内容,并利用参数高效的预测网络联合处理汉字符号和拼音ASR任务,在HAT语料库上显著降低错误率。

Details Motivation: 台湾客家话是一种低资源、濒危语言,方言变体多、存在汉字与拼音两种书写系统,传统ASR模型难以区分语言本质内容与方言特异性变异,导致性能受限。 Method: 基于RNN-T构建统一框架,引入方言感知建模策略以解耦方言‘风格’与语言‘内容’,并采用参数高效的预测网络同步建模汉字和拼音ASR任务,利用跨文字目标互为正则化。 Result: 在HAT语料库上,该模型在汉字和拼音ASR任务中分别实现57.00%和40.41%的相对错误率下降;是首个系统研究客家话方言变异对ASR影响、且首个单模型联合处理双书写系统ASR的工作。 Conclusion: 方言感知解耦与跨脚本协同建模可有效提升低资源濒危语言ASR性能,为多书写系统语言的语音识别提供了新范式。 Abstract: Taiwanese Hakka is a low-resource, endangered language that poses significant challenges for automatic speech recognition (ASR), including high dialectal variability and the presence of two distinct writing systems (Hanzi and Pinyin). Traditional ASR models often encounter difficulties in this context, as they tend to conflate essential linguistic content with dialect-specific variations across both phonological and lexical dimensions. To address these challenges, we propose a unified framework grounded in the Recurrent Neural Network Transducers (RNN-T). Central to our approach is the introduction of dialect-aware modeling strategies designed to disentangle dialectal "style" from linguistic "content", which enhances the model's capacity to learn robust and generalized representations. Additionally, the framework employs parameter-efficient prediction networks to concurrently model ASR (Hanzi and Pinyin). We demonstrate that these tasks create a powerful synergy, wherein the cross-script objective serves as a mutual regularizer to improve the primary ASR tasks. Experiments conducted on the HAT corpus reveal that our model achieves 57.00% and 40.41% relative error rate reduction on Hanzi and Pinyin ASR, respectively. To our knowledge, this is the first systematic investigation into the impact of Hakka dialectal variations on ASR and the first single model capable of jointly addressing these tasks.

[12] Iterative Prompt Refinement for Dyslexia-Friendly Text Summarization Using GPT-4o

Samay Bhojwani,Swarnima Kain,Lisong Xu

Main category: cs.CL

TL;DR: 本文提出了一种基于GPT-4o的迭代提示优化流程,用于生成面向阅读障碍者的易读文本摘要,并在约2000篇新闻文章上验证了其可读性(Flesch >= 90)与语义保真度的平衡效果。

Details Motivation: 现有辅助技术多关注视觉呈现,而语言复杂度仍是阅读障碍者获取信息的主要障碍,亟需面向可读性的NLP方法。 Method: 构建基于GPT-4o的迭代式提示精炼流水线,以Flesch阅读易度≥90为目标,对新闻文章进行摘要生成与优化。 Result: 多数摘要在四次迭代内达标,部分首试即成功;综合可读性与语义保真度的复合得分稳定在0.13–0.73之间(典型值约0.55)。 Conclusion: 该工作为无障碍导向的文本摘要提供了首个实证基准,支持后续面向真实阅读障碍用户的以人为中心评估。 Abstract: Dyslexia affects approximately 10% of the global population and presents persistent challenges in reading fluency and text comprehension. While existing assistive technologies address visual presentation, linguistic complexity remains a substantial barrier to equitable access. This paper presents an empirical study on dyslexia-friendly text summarization using an iterative prompt-based refinement pipeline built on GPT-4o. We evaluate the pipeline on approximately 2,000 news article samples, applying a readability target of Flesch Reading Ease >= 90. Results show that the majority of summaries meet the readability threshold within four attempts, with many succeeding on the first try. A composite score combining readability and semantic fidelity shows stable performance across the dataset, ranging from 0.13 to 0.73 with a typical value near 0.55. These findings establish an empirical baseline for accessibility-driven NLP summarization and motivate further human-centered evaluation with dyslexic readers.

[13] Ruyi2 Technical Report

Huan Song,Shuyu Tian,Junyi Hao,Minxiu Xu,Hongjun An,Yiliang Song,Jiawei Shao,Xuelong Li

Main category: cs.CL

TL;DR: Ruyi2 is an adaptive LLM architecture that improves efficiency and scalability via a stable 'Familial Model' trained with 3D parallelism, achieving 2–3× speedup over Ruyi and matching Qwen3 performance.

Details Motivation: To address high deployment costs and latency of LLMs, and overcome optimization complexity and distributed training incompatibility in early-exit and prior adaptive models like Ruyi. Method: Introduces Ruyi2—a familial model built on Megatron-LM using 3D parallel training—enabling variable-depth computation and parameter sharing across model variants. Result: Ruyi2 achieves 2–3 times faster training than Ruyi and matches the performance of same-sized Qwen3 models, validating familial parameter sharing as effective for scalable adaptive inference. Conclusion: Ruyi2 establishes a 'Train Once, Deploy Many' paradigm, demonstrating that family-based parameter sharing effectively balances architectural efficiency and high-performance capabilities in large-scale LLM deployment. Abstract: Large Language Models (LLMs) face significant challenges regarding deployment costs and latency, necessitating adaptive computing strategies. Building upon the AI Flow framework, we introduce Ruyi2 as an evolution of our adaptive model series designed for efficient variable-depth computation. While early-exit architectures offer a viable efficiency-performance balance, the Ruyi model and existing methods often struggle with optimization complexity and compatibility with large-scale distributed training. To bridge this gap, Ruyi2 introduces a stable "Familial Model" based on Megatron-LM. By using 3D parallel training, it achieves a 2-3 times speedup over Ruyi, while performing comparably to same-sized Qwen3 models. These results confirm that family-based parameter sharing is a highly effective strategy, establishing a new "Train Once, Deploy Many" paradigm and providing a key reference for balancing architectural efficiency with high-performance capabilities.

[14] Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training

Tianle Xia,Ming Xu,Lingxiang Hu,Yiding Sun,Wenwei Li,Linfang Shang,Liqun Liu,Peng Shu,Huan Yu,Jie Jiang

Main category: cs.CL

TL;DR: 本文提出Search-P1框架,通过路径中心奖励塑形提升Agentic RAG训练效果,在多个QA基准上平均准确率提升7.7点。

Details Motivation: 传统单轮检索难以支持复杂多步推理;现有基于强化学习的Agentic RAG训练方法存在稀疏奖励和样本效率低的问题。 Method: 提出Search-P1框架,包含(1)路径中心奖励:通过无序步覆盖与软评分提取失败样本中的学习信号;(2)双轨路径评分:结合自一致性与参考对齐视角,利用离线生成的参考规划器评估推理路径。 Result: 在多个QA基准上显著优于Search-R1及其他强基线,平均准确率提升7.7个百分点。 Conclusion: 路径中心奖励塑形能有效利用中间信号并提升样本效率,为Agentic RAG训练提供了更高效、鲁棒的新范式。 Abstract: Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by incorporating external knowledge, yet traditional single-round retrieval struggles with complex multi-step reasoning. Agentic RAG addresses this by enabling LLMs to dynamically decide when and what to retrieve, but current RL-based training methods suffer from sparse outcome rewards that discard intermediate signals and low sample efficiency where failed samples contribute nothing. We propose Search-P1, a framework that introduces path-centric reward shaping for agentic RAG training, comprising two key components: (1) Path-Centric Reward, which evaluates the structural quality of reasoning trajectories through order-agnostic step coverage and soft scoring that extracts learning signals even from failed samples, and (2) Dual-Track Path Scoring with offline-generated reference planners that assesses paths from both self-consistency and reference-alignment perspectives. Experiments on multiple QA benchmarks demonstrate that Search-P1 achieves significant improvements over Search-R1 and other strong baselines, with an average accuracy gain of 7.7 points.

[15] Towards Faithful Industrial RAG: A Reinforced Co-adaptation Framework for Advertising QA

Wenwei Li,Ming Xu,Tianle Xia,Lingxiang Hu,Yiding Sun,Linfang Shang,Liqun Liu,Peng Shu,Huan Yu,Jie Jiang

Main category: cs.CL

TL;DR: 本文提出了一种强化协同适应框架(GraphRAG + GRPO),通过图感知检索与多维奖励约束的强化学习,显著降低工业广告问答中的URL幻觉,提升准确性、安全性与用户满意度,并已上线半年服务数百万次问答。

Details Motivation: 工业广告问答中幻觉(尤其是伪造URL)会导致财务损失、合规违规和法律风险;现有RAG在处理关系型、高频更新且目标对齐不足的工业知识时面临部署挑战。 Method: 提出强化协同适应框架:(1) GraphRAG——基于高引用知识子图建模实体-关系结构,支持多跳、领域特异性证据检索;(2) 基于Group Relative Policy Optimization (GRPO) 的证据约束强化学习,采用涵盖忠实性、风格合规、安全性和URL有效性四维奖励。 Result: 在内部广告QA数据集上,专家评估的准确性、完整性、安全性均提升,幻觉率降低72%;线上A/B测试显示点赞率↑28.6%,点踩率↓46.2%,URL幻觉↓92.7%;系统已稳定运行超半年,服务数百万次QA交互。 Conclusion: 该框架有效缓解了工业QA中因知识结构复杂与动态更新导致的幻觉问题,验证了检索与生成联合优化及多维RL奖励设计在高风险场景下的实用性与鲁棒性。 Abstract: Industrial advertising question answering (QA) is a high-stakes task in which hallucinated content, particularly fabricated URLs, can lead to financial loss, compliance violations, and legal risk. Although Retrieval-Augmented Generation (RAG) is widely adopted, deploying it in production remains challenging because industrial knowledge is inherently relational, frequently updated, and insufficiently aligned with generation objectives. We propose a reinforced co-adaptation framework that jointly optimizes retrieval and generation through two components: (1) Graph-aware Retrieval (GraphRAG), which models entity-relation structure over a high-citation knowledge subgraph for multi-hop, domain-specific evidence selection; and (2) evidence-constrained reinforcement learning via Group Relative Policy Optimization (GRPO) with multi-dimensional rewards covering faithfulness, style compliance, safety, and URL validity. Experiments on an internal advertising QA dataset show consistent gains across expert-judged dimensions including accuracy, completeness, and safety, while reducing the hallucination rate by 72\%. A two-week online A/B test demonstrates a 28.6\% increase in like rate, a 46.2\% decrease in dislike rate, and a 92.7\% reduction in URL hallucination. The system has been running in production for over half a year and has served millions of QA interactions.

[16] dLLM: Simple Diffusion Language Modeling

Zhanhui Zhou,Lingjie Chen,Hanghang Tong,Dawn Song

Main category: cs.CL

TL;DR: 本文介绍了dLLM,一个开源框架,旨在统一扩散语言模型(DLMs)的核心组件(训练、推理、评估),提升可复现性与可扩展性,并提供从小规模DLM构建到大模型微调与部署的标准化流程。

Details Motivation: 当前扩散语言模型的研究组件分散在非标准代码库中,缺乏透明实现,导致难以复现和扩展,亟需一个既标准化又灵活的统一框架。 Method: 设计并开源dLLM框架,集成DLM的训练、推理与评估模块;支持对现有开源大DLM(如LLaDA、Dream)的复现、微调与评估;提供轻量级、可复现的小规模DLM构建方案(支持BERT式编码器或自回归LM转换),并发布对应检查点。 Result: 实现了首个面向DLM的统一开源框架dLLM;支持主流大DLM的快速复现与定制;提供了低算力可运行的小DLM构建方法及公开检查点,显著降低DLM研究门槛。 Conclusion: dLLM填补了DLM领域缺乏标准化、易用且可扩展框架的空白,有望推动该方向的开放协作与加速发展。 Abstract: Although diffusion language models (DLMs) are evolving quickly, many recent models converge on a set of shared components. These components, however, are distributed across ad-hoc research codebases or lack transparent implementations, making them difficult to reproduce or extend. As the field accelerates, there is a clear need for a unified framework that standardizes these common components while remaining flexible enough to support new methods and architectures. To address this gap, we introduce dLLM, an open-source framework that unifies the core components of diffusion language modeling -- training, inference, and evaluation -- and makes them easy to customize for new designs. With dLLM, users can reproduce, finetune, deploy, and evaluate open-source large DLMs such as LLaDA and Dream through a standardized pipeline. The framework also provides minimal, reproducible recipes for building small DLMs from scratch with accessible compute, including converting any BERT-style encoder or autoregressive LM into a DLM. We also release the checkpoints of these small DLMs to make DLMs more accessible and accelerate future research.

[17] Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization

Qianben Chen,Tianrui Qin,King Zhu,Qiexiang Wang,Chengjun Yu,Shu Xu,Jiaqi Wu,Jiayu Zhang,Xinpeng Liu,Xin Gui,Jingyi Cao,Piaohong Wang,Dingfeng Shi,He Zhu,Tiannan Wang,Yuqing Wang,Maojia Song,Tianyu Zheng,Ge Zhang,Jian Yang,Jiaheng Liu,Minghao Liu,Yuchen Eleanor Jiang,Wangchunshu Zhou

Main category: cs.CL

TL;DR: 本文提出Search More, Think Less (SMTL)框架,通过并行证据获取替代串行推理,提升长周期搜索任务的效率与泛化能力,并在多个基准上达到SOTA性能。

Details Motivation: 现有深度研究智能体依赖增加推理深度来提升性能,导致高推理成本、高延迟,且难以泛化到异构研究场景。 Method: 提出SMTL框架:采用并行证据获取代替顺序推理以优化上下文管理;设计统一的数据合成流程,覆盖确定性问答与开放式研究任务;结合监督微调与强化学习进行端到端训练。 Result: 在BrowseComp(48.6%)、GAIA(75.7%)、Xbench(82.0%)、DeepResearch Bench(45.9%)等基准上取得强性能,部分达SOTA;相比Mirothinker-v1.0,在BrowseComp上平均推理步数减少70.7%,同时准确率提升。 Conclusion: SMTL在保证甚至提升性能的同时显著降低推理开销,增强了跨任务类型的泛化能力,为高效、可扩展的研究型智能体提供了新范式。 Abstract: Recent deep research agents primarily improve performance by scaling reasoning depth, but this leads to high inference cost and latency in search-intensive scenarios. Moreover, generalization across heterogeneous research settings remains challenging. In this work, we propose \emph{Search More, Think Less} (SMTL), a framework for long-horizon agentic search that targets both efficiency and generalization. SMTL replaces sequential reasoning with parallel evidence acquisition, enabling efficient context management under constrained context budgets. To support generalization across task types, we further introduce a unified data synthesis pipeline that constructs search tasks spanning both deterministic question answering and open-ended research scenarios with task appropriate evaluation metrics. We train an end-to-end agent using supervised fine-tuning and reinforcement learning, achieving strong and often state of the art performance across benchmarks including BrowseComp (48.6\%), GAIA (75.7\%), Xbench (82.0\%), and DeepResearch Bench (45.9\%). Compared to Mirothinker-v1.0, SMTL with maximum 100 interaction steps reduces the average number of reasoning steps on BrowseComp by 70.7\%, while improving accuracy.

[18] Enhancing Persuasive Dialogue Agents by Synthesizing Cross-Disciplinary Communication Strategies

Shinnosuke Nozue,Yuto Nakano,Yotaro Watanabe,Meguru Takasaki,Shoji Moriya,Reina Akama,Jun Suzuki

Main category: cs.CL

TL;DR: 本文提出了一种跨学科的说服性对话智能体框架,融合社会心理学、行为经济学和传播学理论,并在两个数据集上验证了其有效性与泛化能力,尤其在低初始意图用户上表现突出。

Details Motivation: 现有方法依赖有限预定义策略,难以捕捉真实世界交互的复杂性。 Method: 融合社会心理学、行为经济学和传播学理论,构建说服性对话智能体框架,并在Persuasion for Good和DailyPersuasion两个数据集上实验验证。 Result: 在两个数据集上均取得优异效果,显著提升说服成功率,具备良好泛化能力,尤其擅长说服初始意图低的用户。 Conclusion: 所提跨学科框架有效提升了说服性对话智能体的性能与适用性,为该领域提供了新思路。 Abstract: Current approaches to developing persuasive dialogue agents often rely on a limited set of predefined persuasive strategies that fail to capture the complexity of real-world interactions. We applied a cross-disciplinary approach to develop a framework for designing persuasive dialogue agents that draws on proven strategies from social psychology, behavioral economics, and communication theory. We validated our proposed framework through experiments on two distinct datasets: the Persuasion for Good dataset, which represents a specific in-domain scenario, and the DailyPersuasion dataset, which encompasses a wide range of scenarios. The proposed framework achieved strong results for both datasets and demonstrated notable improvement in the persuasion success rate as well as promising generalizability. Notably, the proposed framework also excelled at persuading individuals with initially low intent, which addresses a critical challenge for persuasive dialogue agents.

[19] Reinforcing Real-world Service Agents: Balancing Utility and Cost in Task-oriented Dialogue

Ning Gao,Wei Zhang,Yuqin Dai,Ling Shi,Ziyin Wang,Yujie Wang,Wei He,Jinpeng Wang,Chaozheng Wang

Main category: cs.CL

TL;DR: 本文提出InteractCS-RL框架,将任务型对话建模为多粒度强化学习过程,通过用户中心交互框架和成本感知多轮策略优化(CMPO),在共情沟通与预算约束间实现有效权衡,在真实业务场景和工具代理基准上均取得显著性能提升。

Details Motivation: 现有方法难以兼顾 empathetic communication(共情沟通)与 budget-aware decision-making(预算感知决策)之间的复杂战略权衡。 Method: 提出InteractCS-RL框架:1)构建用户中心交互框架(User-centric Interaction Framework),提供高保真训练环境;2)设计成本感知多轮策略优化(CMPO),结合生成过程信用分配与PID-Lagrangian成本控制器,进行混合优势估计以逼近用户奖励与全局成本约束的Pareto边界。 Result: 在定制化真实业务场景中,InteractCS-RL在三个评估维度上显著优于基线方法;在工具-代理-用户交互基准上也展现出跨领域鲁棒性。 Conclusion: InteractCS-RL成功将任务型对话转化为可平衡共情与成本的多粒度强化学习问题,为构建实用、可控、人性化的LLM智能体提供了新范式。 Abstract: The rapid evolution of Large Language Models (LLMs) has accelerated the transition from conversational chatbots to general agents. However, effectively balancing empathetic communication with budget-aware decision-making remains an open challenge. Since existing methods fail to capture these complex strategic trade-offs, we propose InteractCS-RL, a framework that reframes task-oriented dialogue as a multi-granularity reinforcement learning process. Specifically, we first establish a User-centric Interaction Framework to provide a high-fidelity training gym, enabling agents to dynamically explore diverse strategies with persona-driven users. Then, we introduce Cost-aware Multi-turn Policy Optimization (CMPO) with a hybrid advantage estimation strategy. By integrating generative process credits and employing a PID-Lagrangian cost controller, CMPO effectively guides the policy to explore Pareto boundary between user reward and global cost constraints. Extensive experiments on customized real business scenarios demonstrate that InteractCS-RL significantly outperform other baselines across three evaluation dimensions. Further evaluation on tool-agent-user interaction benchmarks verify InteractCS-RL robustness across diverse domains.

[20] Tokenization, Fusion and Decoupling: Bridging the Granularity Mismatch Between Large Language Models and Knowledge Graphs

Siyue Su,Jian Yang,Bo Li,Guanglin Niu

Main category: cs.CL

TL;DR: 本文提出KGT框架,通过专用实体标记解决大语言模型在知识图谱补全任务中的粒度不匹配问题,实现了全空间预测,并在多个基准测试中超越现有最优方法。

Details Motivation: 现有方法无法同时捕捉文本语义和图结构完整性,存在粒度不匹配问题:LLM基于分词序列操作,而KG中实体是基本单元。 Method: 提出KGT框架,包括专用实体标记化、关系引导的门控机制融合预训练结构与文本特征、以及解耦预测(独立头分别处理语义与结构推理)。 Result: KGT在多个基准数据集上持续优于当前最优方法。 Conclusion: KGT有效解决了LLM用于知识图谱补全时的粒度不匹配问题,提升了预测能力与泛化性。 Abstract: Leveraging Large Language Models (LLMs) for Knowledge Graph Completion (KGC) is promising but hindered by a fundamental granularity mismatch. LLMs operate on fragmented token sequences, whereas entities are the fundamental units in knowledge graphs (KGs) scenarios. Existing approaches typically constrain predictions to limited candidate sets or align entities with the LLM's vocabulary by pooling multiple tokens or decomposing entities into fixed-length token sequences, which fail to capture both the semantic meaning of the text and the structural integrity of the graph. To address this, we propose KGT, a novel framework that uses dedicated entity tokens to enable efficient, full-space prediction. Specifically, we first introduce specialized tokenization to construct feature representations at the level of dedicated entity tokens. We then fuse pre-trained structural and textual features into these unified embeddings via a relation-guided gating mechanism, avoiding training from scratch. Finally, we implement decoupled prediction by leveraging independent heads to separate and combine semantic and structural reasoning. Experimental results show that KGT consistently outperforms state-of-the-art methods across multiple benchmarks.

[21] Human Label Variation in Implicit Discourse Relation Recognition

Frances Yung,Daniil Ignatev,Merel Scholman,Vera Demberg,Massimo Poesio

Main category: cs.CL

TL;DR: 本文比较了预测标注分布与个体标注者建模两种方法在隐式篇章关系识别(IDRR)任务上的表现,发现前者更稳定,后者因认知复杂性导致的高歧义而效果不佳。

Details Motivation: 许多NLP任务缺乏单一真值,人类标注存在多样性;IDRR任务高度歧义,其分歧主要源于认知复杂性而非意识形态偏差,需探究何种建模方式更能刻画这种多样性。 Method: 在IDRR任务上对比标注分布建模(distributional modeling)与标注者个体建模(perspectivist modeling)两类方法,并进行消融与错误分析以定位影响性能的关键因素。 Result: 现有标注者特异性模型在IDRR上表现差,除非显著降低歧义;而基于标注分布训练的模型预测更稳定;认知负荷高的样本是人类解释不一致的主要驱动因素。 Conclusion: 对于以认知复杂性为主导歧义源的任务(如IDRR),建模整体标注分布比建模个体视角更具鲁棒性与实用性。 Abstract: There is growing recognition that many NLP tasks lack a single ground truth, as human judgments reflect diverse perspectives. To capture this variation, models have been developed to predict full annotation distributions rather than majority labels, while perspectivist models aim to reproduce the interpretations of individual annotators. In this work, we compare these approaches on Implicit Discourse Relation Recognition (IDRR), a highly ambiguous task where disagreement often arises from cognitive complexity rather than ideological bias. Our experiments show that existing annotator-specific models perform poorly in IDRR unless ambiguity is reduced, whereas models trained on label distributions yield more stable predictions. Further analysis indicates that frequent cognitively demanding cases drive inconsistency in human interpretation, posing challenges for perspectivist modeling in IDRR.

[22] Extending Czech Aspect-Based Sentiment Analysis with Opinion Terms: Dataset and LLM Benchmarks

Jakub Šmíd,Pavel Přibáň,Pavel Král

Main category: cs.CL

TL;DR: 本文介绍了一个新的捷克语餐厅领域方面级情感分析(ABSA)数据集,包含意见词标注,并支持三种不同复杂度的ABSA任务;基于该数据集,作者在单语、跨语言和多语言设置下对多种Transformer及大语言模型进行了实验,并提出了一种基于大语言模型的翻译与标签对齐方法以提升跨语言性能;结果揭示了当前模型在处理捷克语等低资源语言时的优势与局限。

Details Motivation: 为填补捷克语等低资源语言在方面级情感分析(ABSA)领域的数据与方法空白,尤其是缺乏带意见词标注的高质量数据集及有效的跨语言适配方案。 Method: 构建带意见词标注的捷克语餐厅领域ABSA数据集;在单语、跨语、多语设置下评估Transformer和LLM;提出基于LLM的翻译与标签对齐方法以解决跨语言ABSA中的标签不一致问题。 Result: 所提翻译-对齐方法带来稳定性能提升;实验揭示了现有SOTA模型在识别细微意见词和复杂情感表达上的不足;数据集成为捷克语ABSA新基准。 Conclusion: 该捷克语ABSA数据集填补了低资源语言资源空白;提出的LLM驱动翻译-对齐方法具备可扩展性,可推广至其他低资源语言的ABSA适配任务。 Abstract: This paper introduces a novel Czech dataset in the restaurant domain for aspect-based sentiment analysis (ABSA), enriched with annotations of opinion terms. The dataset supports three distinct ABSA tasks involving opinion terms, accommodating varying levels of complexity. Leveraging this dataset, we conduct extensive experiments using modern Transformer-based models, including large language models (LLMs), in monolingual, cross-lingual, and multilingual settings. To address cross-lingual challenges, we propose a translation and label alignment methodology leveraging LLMs, which yields consistent improvements. Our results highlight the strengths and limitations of state-of-the-art models, especially when handling the linguistic intricacies of low-resource languages like Czech. A detailed error analysis reveals key challenges, including the detection of subtle opinion terms and nuanced sentiment expressions. The dataset establishes a new benchmark for Czech ABSA, and our proposed translation-alignment approach offers a scalable solution for adapting ABSA resources to other low-resource languages.

[23] Towards Simulating Social Media Users with LLMs: Evaluating the Operational Validity of Conditioned Comment Prediction

Nils Schwager,Simon Münker,Alistair Plum,Achim Rettinger

Main category: cs.CL

TL;DR: 本研究提出条件化评论预测(CCP)任务,通过对比模型生成评论与真实数字痕迹,评估大语言模型(LLMs)模拟社交媒体用户行为的能力;发现监督微调(SFT)在低资源语言中导致形式与内容解耦,显式条件(如生成传记)在微调后变得冗余,强调应基于真实行为数据而非描述性人设进行高保真模拟。

Details Motivation: 大型语言模型正从探索性工具转变为社会科学中的主动‘硅基主体’,但其操作有效性缺乏充分验证。 Method: 提出条件化评论预测(CCP)任务,评估Llama3.1、Qwen3、Ministral等8B开源模型在英语、德语、卢森堡语下的表现;系统比较显式vs.隐式提示策略及监督微调(SFT)影响。 Result: 发现SFT虽提升输出形式(长度、语法)一致性,却削弱语义根基;显式条件(如生成传记)在微调后失效,模型可直接从行为历史中进行潜在推断。 Conclusion: 挑战当前‘朴素提示’范式,主张以真实行为痕迹替代描述性人设,为高保真社会行为模拟提供操作指南。 Abstract: The transition of Large Language Models (LLMs) from exploratory tools to active "silicon subjects" in social science lacks extensive validation of operational validity. This study introduces Conditioned Comment Prediction (CCP), a task in which a model predicts how a user would comment on a given stimulus by comparing generated outputs with authentic digital traces. This framework enables a rigorous evaluation of current LLM capabilities with respect to the simulation of social media user behavior. We evaluated open-weight 8B models (Llama3.1, Qwen3, Ministral) in English, German, and Luxembourgish language scenarios. By systematically comparing prompting strategies (explicit vs. implicit) and the impact of Supervised Fine-Tuning (SFT), we identify a critical form vs. content decoupling in low-resource settings: while SFT aligns the surface structure of the text output (length and syntax), it degrades semantic grounding. Furthermore, we demonstrate that explicit conditioning (generated biographies) becomes redundant under fine-tuning, as models successfully perform latent inference directly from behavioral histories. Our findings challenge current "naive prompting" paradigms and offer operational guidelines prioritizing authentic behavioral traces over descriptive personas for high-fidelity simulation.

[24] AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors

Abhay Sheshadri,Aidan Ewart,Kai Fronsdal,Isha Gupta,Samuel R. Bowman,Sara Price,Samuel Marks,Rowan Wang

Main category: cs.CL

TL;DR: 本文提出了AuditBench,一个用于评估语言模型对齐审计能力的基准测试,包含56个植入了14种隐蔽有害行为的模型,并设计了一个可配置的调查代理来评估不同审计工具的有效性,发现黑盒工具优于白盒工具,且训练方式显著影响审计难度。

Details Motivation: 现有对齐审计方法缺乏标准化、可量化的评估基准,难以系统比较不同审计技术的有效性,因此需要构建一个可控、多样、可复现的基准来推动对齐审计的实证研究。 Method: 构建AuditBench:植入56个语言模型,每类含14种隐藏行为(如谄媚顺从、反对AI监管等),采用多种训练技术(合成文档 vs. 示范数据、对抗训练等)控制行为隐蔽性;设计可配置的‘调查代理’,集成多种审计工具(包括基于辅助模型生成多样化提示的黑盒方法和白盒可解释性工具),并在统一框架下量化评估其成功率。 Result: 发现‘工具—代理差距’:某些在独立评测中表现好的工具在代理中效果不佳;最有效的工具是基于辅助模型生成多样化提示的黑盒方法;白盒工具有一定帮助但非最优;训练方式显著影响审计难度(合成文档训练模型更易审计,示范数据+强对抗训练最难审计)。 Conclusion: AuditBench为对齐审计提供了首个大规模、可控、开源的基准;实证表明黑盒、提示工程驱动的审计策略更鲁棒;审计效能高度依赖模型训练历史,强调需将训练方法纳入审计评估体系;开源全部模型、代理与框架以促进迭代式定量研究。 Abstract: We introduce AuditBench, an alignment auditing benchmark. AuditBench consists of 56 language models with implanted hidden behaviors. Each model has one of 14 concerning behaviors--such as sycophantic deference, opposition to AI regulation, or secret geopolitical loyalties--which it does not confess to when directly asked. AuditBench models are highly diverse--some are subtle, while others are overt, and we use varying training techniques both for implanting behaviors and training models not to confess. To demonstrate AuditBench's utility, we develop an investigator agent that autonomously employs a configurable set of auditing tools. By measuring investigator agent success using different tools, we can evaluate their efficacy. Notably, we observe a tool-to-agent gap, where tools that perform well in standalone non-agentic evaluations fail to translate into improved performance when used with our investigator agent. We find that our most effective tools involve scaffolded calls to auxiliary models that generate diverse prompts for the target. White-box interpretability tools can be helpful, but the agent performs best with black-box tools. We also find that audit success varies greatly across training techniques: models trained on synthetic documents are easier to audit than models trained on demonstrations, with better adversarial training further increasing auditing difficulty. We release our models, agent, and evaluation framework to support future quantitative, iterative science on alignment auditing.

[25] Towards Better RL Training Data Utilization via Second-Order Rollout

Zhe Yang,Yudong Wang,Rang Li,Zhifang Sui

Main category: cs.CL

TL;DR: 本文提出了一种结合一阶(生成)和二阶(批判)rollout的统一强化学习框架,以同时提升大语言模型的生成与批判能力,从而更高效地利用训练数据并提升性能。

Details Motivation: 现有强化学习方法(如vanilla RL)仅关注通过一阶rollout提升生成能力,忽视了批判能力训练,导致训练数据潜力未被充分挖掘。 Method: 引入二阶rollout概念(对一个响应生成多个批判),构建联合训练生成与批判能力的统一RL框架,并探索标签平衡、采样缓解奖励噪声等关键技术。 Result: 在多个模型和数据集上的实验表明,该方法比vanilla RL更有效地利用训练数据,在相同数据下取得更好性能;同时发现批判训练中标签平衡的重要性及结果导向奖励的噪声问题可通过采样缓解。 Conclusion: 本工作初步探索了动态数据增强与生成-批判联合训练在RL中的应用,为RL训练的进一步发展提供了新思路与启发。 Abstract: Reinforcement Learning (RL) has empowered Large Language Models (LLMs) with strong reasoning capabilities, but vanilla RL mainly focuses on generation capability improvement by training with only first-order rollout (generating multiple responses for a question), and we argue that this approach fails to fully exploit the potential of training data because of the neglect of critique capability training. To tackle this problem, we further introduce the concept of second-order rollout (generating multiple critiques for a response) and propose a unified framework for jointly training generation and critique capabilities. Extensive experiments across various models and datasets demonstrate that our approach can utilize training data more effectively than vanilla RL and achieve better performance under the same training data. Additionally, we uncover several insightful findings regarding second-order rollout and critique training, such as the importance of label balance in critique training and the noise problem of outcome-based rewards, which can be mitigated through sampling techniques. Our work offers a preliminary exploration of dynamic data augmentation and joint generation-critique training in RL, providing meaningful inspiration for the further advancement of RL training

[26] Imagination Helps Visual Reasoning, But Not Yet in Latent Space

You Li,Chi Chen,Yanghao Li,Fanhu Zeng,Kaiyu Huang,Jinan Xu,Maosong Sun

Main category: cs.CL

TL;DR: 本文通过因果中介分析揭示了潜在视觉推理中输入-潜在表征和潜在表征-答案之间的双重断连,质疑其必要性,并提出基于文本显式想象的简单有效替代方法CapImagine。

Details Motivation: 探究潜在视觉推理有效性的真正来源,厘清其内在机制是否真实依赖于潜空间中的隐式推理过程。 Method: 采用因果中介分析建模:将输入视为处理变量(treatment),潜在token作为中介变量(mediator),最终答案为结果变量(outcome);辅以扰动实验和探针分析验证潜在token的信息承载能力与作用。 Result: 发现输入扰动对潜在token影响极小(输入-潜在断连),潜在token扰动对答案影响也极小(潜在-答案断连);潜在token编码视觉信息有限且高度相似;CapImagine在视觉中心基准上显著超越复杂潜空间基线。 Conclusion: 潜在视觉推理并非必要,显式文本化想象更有效;应重新审视潜空间推理的范式假设,转向更可解释、更高效的设计路径。 Abstract: Latent visual reasoning aims to mimic human's imagination process by meditating through hidden states of Multimodal Large Language Models. While recognized as a promising paradigm for visual reasoning, the underlying mechanisms driving its effectiveness remain unclear. Motivated to demystify the true source of its efficacy, we investigate the validity of latent reasoning using Causal Mediation Analysis. We model the process as a causal chain: the input as the treatment, the latent tokens as the mediator, and the final answer as the outcome. Our findings uncover two critical disconnections: (a) Input-Latent Disconnect: dramatic perturbations on the input result in negligible changes to the latent tokens, suggesting that latent tokens do not effectively attend to the input sequence. (b) Latent-Answer Disconnect: perturbations on the latent tokens yield minimal impact on the final answer, indicating the limited causal effect latent tokens imposing on the outcome. Furthermore, extensive probing analysis reveals that latent tokens encode limited visual information and exhibit high similarity. Consequently, we challenge the necessity of latent reasoning and propose a straightforward alternative named CapImagine, which teaches the model to explicitly imagine using text. Experiments on vision-centric benchmarks show that CapImagine significantly outperforms complex latent-space baselines, highlighting the superior potential of visual reasoning through explicit imagination.

[27] Probing for Knowledge Attribution in Large Language Models

Ivo Brink,Alexander Boer,Dennis Ulmer

Main category: cs.CL

TL;DR: 本文提出了一种名为contributive attribution的方法,用于识别大语言模型(LLM)输出中知识来源(上下文 vs 内部参数)的主导性,并设计了自监督数据集AttriWiki和简单线性探针来实现高精度归因,揭示归因错误与幻觉之间的强关联。

Details Motivation: 大型语言模型常产生幻觉,分为忠实性违背(误用用户上下文)和事实性违背(内部知识错误),需准确判断回答依赖的是输入提示还是模型内部参数,以针对性缓解。 Method: 提出contributive attribution任务;构建自监督数据集AttriWiki,通过控制提示使模型分别从记忆或上下文提取实体以生成带标签样本;训练基于隐藏层表示的线性探针进行归因预测。 Result: 探针在Llama-3.1-8B、Mistral-7B、Qwen-7B上Macro-F1达0.96,跨域迁移至SQuAD和WebQuestions仍保持0.94–0.99 Macro-F1;归因错误使错误率最高上升70%。 Conclusion: contributive attribution是可探测且强信号的任务,归因失准与不忠实回答高度相关,但仅靠归因不足以解决所有幻觉问题,需更广义检测框架。 Abstract: Large language models (LLMs) often generate fluent but unfounded claims, or hallucinations, which fall into two types: (i) faithfulness violations - misusing user context - and (ii) factuality violations - errors from internal knowledge. Proper mitigation depends on knowing whether a model's answer is based on the prompt or its internal weights. This work focuses on the problem of contributive attribution: identifying the dominant knowledge source behind each output. We show that a probe, a simple linear classifier trained on model hidden representations, can reliably predict contributive attribution. For its training, we introduce AttriWiki, a self-supervised data pipeline that prompts models to recall withheld entities from memory or read them from context, generating labelled examples automatically. Probes trained on AttriWiki data reveal a strong attribution signal, achieving up to 0.96 Macro-F1 on Llama-3.1-8B, Mistral-7B, and Qwen-7B, transferring to out-of-domain benchmarks (SQuAD, WebQuestions) with 0.94-0.99 Macro-F1 without retraining. Attribution mismatches raise error rates by up to 70%, demonstrating a direct link between knowledge source confusion and unfaithful answers. Yet, models may still respond incorrectly even when attribution is correct, highlighting the need for broader detection frameworks.

[28] Natural Language Declarative Prompting (NLD-P): A Modular Governance Method for Prompt Design Under Model Drift

Hyunwoo Kim,Hanau Yi,Jaehee Bae,Yumin Kim

Main category: cs.CL

TL;DR: 本文提出自然语言声明式提示(NLD-P)作为一种面向LLM演化的声明式治理方法,通过模块化抽象分离提示的来源、约束逻辑、任务内容与后生成评估,无需外部编排代码,提升提示在模型漂移下的稳定性与可解释性。

Details Motivation: 随着大语言模型快速迭代,提示行为对指令遵循策略、对齐机制和解码方式的变化高度敏感(即GPT-scale model drift),传统表面格式化和临时调优难以保障稳定可控的提示效果。 Method: 将NLD-P形式化为一种模块化控制抽象,明确分离提示中的来源(provenance)、约束逻辑(constraint logic)、任务内容(task content)和后生成评估(post-generation evaluation),全部用自然语言编码,不依赖外部编排代码;定义最小合规标准,并分析不同模型对NLD-P schema的接受度。 Result: NLD-P被确立为一种面向非开发者实践者的轻量级、可访问的提示治理框架;论文部分撰写与编辑由基于NLD-P配置的LLM助手完成,但所有核心概念、方法主张与终稿修订均由人类作者在人机协同协议下主导完成。 Conclusion: NLD-P为应对持续演化的LLM生态提供了可行的声明式控制范式,强调人在环路下的治理责任,并为未来实证验证指明方向。 Abstract: The rapid evolution of large language models (LLMs) has transformed prompt engineering from a localized craft into a systems-level governance challenge. As models scale and update across generations, prompt behavior becomes sensitive to shifts in instruction-following policies, alignment regimes, and decoding strategies, a phenomenon we characterize as GPT-scale model drift. Under such conditions, surface-level formatting conventions and ad hoc refinement are insufficient to ensure stable, interpretable control. This paper reconceptualizes Natural Language Declarative Prompting (NLD-P) as a declarative governance method rather than a rigid field template. NLD-P is formalized as a modular control abstraction that separates provenance, constraint logic, task content, and post-generation evaluation, encoded directly in natural language without reliance on external orchestration code. We define minimal compliance criteria, analyze model-dependent schema receptivity, and position NLD-P as an accessible governance framework for non-developer practitioners operating within evolving LLM ecosystems. Portions of drafting and editorial refinement employed a schema-bound LLM assistant configured under NLD-P. All conceptual framing, methodological claims, and final revisions were directed, reviewed, and approved by the human author under a documented human-in-the-loop protocol. The paper concludes by outlining implications for declarative control under ongoing model evolution and identifying directions for future empirical validation.

[29] TARAZ: Persian Short-Answer Question Benchmark for Cultural Evaluation of Language Models

Reihaneh Iranmanesh,Saeedeh Davoudi,Pasha Abrishamchian,Ophir Frieder,Nazli Goharian

Main category: cs.CL

TL;DR: 本文提出了一种面向波斯语的大型语言模型文化能力评估框架,采用波斯语特异性简答题形式,结合基于规则的词形归一化与混合句法-语义相似度模块,显著提升评分一致性,并开源该基准。

Details Motivation: 现有波斯语文化评测基准多依赖于多项选择题和以英语为中心的指标,无法有效捕捉波斯语的形态复杂性和语义细微差别。 Method: 构建波斯语特异性简答题评估框架,融合基于规则的形态归一化与混合句法-语义相似度模块,实现超越字符串精确匹配的软匹配评分。 Result: 在15个主流开源与闭源模型上系统评估表明,该混合评估方法相较精确匹配基线提升评分一致性达+10%,能更有效地捕获表层方法遗漏的意义。 Conclusion: 本文发布的评估框架是首个标准化的波斯语文化理解评测基准,为跨文化大模型评估研究提供了可复现的基础。 Abstract: This paper presents a comprehensive evaluation framework for assessing the cultural competence of large language models (LLMs) in Persian. Existing Persian cultural benchmarks rely predominantly on multiple-choice formats and English-centric metrics that fail to capture Persian's morphological complexity and semantic nuance. Our framework introduces a Persian-specific short-answer evaluation that combines rule-based morphological normalization with a hybrid syntactic and semantic similarity module, enabling robust soft-match scoring beyond exact string overlap. Through systematic evaluation of 15 state-of-the-art open- and closed-source models, we demonstrate that our hybrid evaluation improves scoring consistency by +10% compared to exact-match baselines by capturing meaning that surface-level methods cannot detect. We publicly release our evaluation framework, providing the first standardized benchmark for measuring cultural understanding in Persian and establishing a reproducible foundation for cross-cultural LLM evaluation research.

[30] TCM-DiffRAG: Personalized Syndrome Differentiation Reasoning Method for Traditional Chinese Medicine based on Knowledge Graph and Chain of Thought

Jianmin Li,Ying Chang,Su-Kit Tang,Yujia Liu,Yanwen Wang,Shuyuan Lin,Binkai Ou

Main category: cs.CL

TL;DR: 本文提出TCM-DiffRAG框架,结合知识图谱与思维链推理,显著提升大语言模型在中医临床诊断任务中的性能,优于原生模型、微调模型及其他RAG方法。

Details Motivation: 传统RAG方法在复杂、个体化强的中医临床诊疗中表现不佳,需针对中医推理特点改进。 Method: 构建TCM-DiffRAG框架,融合中医知识图谱(KG)与思维链(CoT)推理,并在三个中医数据集上评估。 Result: TCM-DiffRAG大幅超越原生LLM(如qwen-plus三项指标从0.927/0.361/0.038提升至0.952/0.788/0.356),且优于监督微调模型和其他RAG基线;对非中文LLM提升更显著。 Conclusion: 结构化中医知识图谱与思维链推理的协同可有效增强个体化诊断能力;通用与个性化知识图谱联合使用,促进通用知识与临床推理对齐;验证了推理感知型RAG在中医领域的潜力。 Abstract: Background: Retrieval augmented generation (RAG) technology can empower large language models (LLMs) to generate more accurate, professional, and timely responses without fine tuning. However, due to the complex reasoning processes and substantial individual differences involved in traditional Chinese medicine (TCM) clinical diagnosis and treatment, traditional RAG methods often exhibit poor performance in this domain. Objective: To address the limitations of conventional RAG approaches in TCM applications, this study aims to develop an improved RAG framework tailored to the characteristics of TCM reasoning. Methods: We developed TCM-DiffRAG, an innovative RAG framework that integrates knowledge graphs (KG) with chains of thought (CoT). TCM-DiffRAG was evaluated on three distinctive TCM test datasets. Results: The experimental results demonstrated that TCM-DiffRAG achieved significant performance improvements over native LLMs. For example, the qwen-plus model achieved scores of 0.927, 0.361, and 0.038, which were significantly enhanced to 0.952, 0.788, and 0.356 with TCM-DiffRAG. The improvements were even more pronounced for non-Chinese LLMs. Additionally, TCM-DiffRAG outperformed directly supervised fine-tuned (SFT) LLMs and other benchmark RAG methods. Conclusions: TCM-DiffRAG shows that integrating structured TCM knowledge graphs with Chain of Thought based reasoning substantially improves performance in individualized diagnostic tasks. The joint use of universal and personalized knowledge graphs enables effective alignment between general knowledge and clinical reasoning. These results highlight the potential of reasoning-aware RAG frameworks for advancing LLM applications in traditional Chinese medicine.

[31] Improving Neural Argumentative Stance Classification in Controversial Topics with Emotion-Lexicon Features

Mohammad Yeghaneh Abkenar,Weixing Wang,Manfred Stede,Davide Picca,Mark A. Finlayson,Panagiotis Ioannidis

Main category: cs.CL

TL;DR: 本文提出一种基于扩展NRC情感词典(eNRC)的神经论证立场分类方法,利用DistilBERT嵌入增强细粒度情感分析,在五个跨领域争议性话题数据集上显著提升F1分数,并开源全部资源。

Details Motivation: 现有立场分类研究多忽略显式、细粒度的情感分析,且受限于非论证性文本和特定领域,泛化能力差。 Method: 基于DistilBERT嵌入扩展Bias-Corrected NRC情感词典(构建eNRC),并将其融入神经论证立场分类模型。 Result: eNRC在全部五个数据集上优于基线(最高+6.2 F1),在四个数据集上优于原始NRC(最高+3.0),并在几乎所有语料上超越LLM方法。 Conclusion: 显式引入上下文化情感知识可有效提升跨领域论证立场分类性能,eNRC为情感与论证建模提供了可复用资源。 Abstract: Argumentation mining comprises several subtasks, among which stance classification focuses on identifying the standpoint expressed in an argumentative text toward a specific target topic. While arguments-especially about controversial topics-often appeal to emotions, most prior work has not systematically incorporated explicit, fine-grained emotion analysis to improve performance on this task. In particular, prior research on stance classification has predominantly utilized non-argumentative texts and has been restricted to specific domains or topics, limiting generalizability. We work on five datasets from diverse domains encompassing a range of controversial topics and present an approach for expanding the Bias-Corrected NRC Emotion Lexicon using DistilBERT embeddings, which we feed into a Neural Argumentative Stance Classification model. Our method systematically expands the emotion lexicon through contextualized embeddings to identify emotionally charged terms not previously captured in the lexicon. Our expanded NRC lexicon (eNRC) improves over the baseline across all five datasets (up to +6.2 percentage points in F1 score), outperforms the original NRC on four datasets (up to +3.0), and surpasses the LLM-based approach on nearly all corpora. We provide all resources-including eNRC, the adapted corpora, and model architecture-to enable other researchers to build upon our work.

[32] Effective QA-driven Annotation of Predicate-Argument Relations Across Languages

Jonathan Davidov,Aviv Slobodkin,Shmuel Tomi Klein,Reut Tsarfaty,Ido Dagan,Ayal Klein

Main category: cs.CL

TL;DR: 本文提出了一种基于问答驱动语义角色标注(QA-SRL)的跨语言投影方法,利用英语QA-SRL解析器结合受限翻译与词对齐,自动生成多语言(希伯来语、俄语、法语)高质量QA-SRL数据,并训练出优于多语言大模型基线的语言特定解析器。

Details Motivation: 现有谓词-论元结构标注成本高、语言覆盖有限(主要集中于英语),亟需一种低成本、可扩展的跨语言语义分析方案。 Method: 采用QA-SRL框架作为语义表示形式,设计跨语言投影流程:先将英语问题-答案对经受限翻译和词对齐映射到目标语言,再据此生成目标语言的QA-SRL标注,并微调语言特定解析器。 Result: 在希伯来语、俄语、法语上成功构建高质量QA-SRL数据集;所训练的语言特定解析器性能显著超越GPT-4o、LLaMA-Maverick等强大多语言大模型基线。 Conclusion: QA-SRL可作为通用、自然语言化的语义接口,支撑高效、可迁移的跨语言谓词-论元解析,推动低资源语言语义分析的发展。 Abstract: Explicit representations of predicate-argument relations form the basis of interpretable semantic analysis, supporting reasoning, generation, and evaluation. However, attaining such semantic structures requires costly annotation efforts and has remained largely confined to English. We leverage the Question-Answer driven Semantic Role Labeling (QA-SRL) framework -- a natural-language formulation of predicate-argument relations -- as the foundation for extending semantic annotation to new languages. To this end, we introduce a cross-linguistic projection approach that reuses an English QA-SRL parser within a constrained translation and word-alignment pipeline to automatically generate question-answer annotations aligned with target-language predicates. Applied to Hebrew, Russian, and French -- spanning diverse language families -- the method yields high-quality training data and fine-tuned, language-specific parsers that outperform strong multilingual LLM baselines (GPT-4o, LLaMA-Maverick). By leveraging QA-SRL as a transferable natural-language interface for semantics, our approach enables efficient and broadly accessible predicate-argument parsing across languages.

[33] Rejection Mixing: Fast Semantic Propagation of Mask Tokens for Efficient DLLM Inference

Yushi Ye,Feng Hong,Huangjie Zheng,Xu Chen,Zhiyong Chen,Yanfeng Wang,Jiangchao Yao

Main category: cs.CL

TL;DR: 本文提出ReMix框架,通过引入连续混合状态和拒绝规则,在不降低生成质量的前提下,实现了2-8倍的推理加速,解决了扩散大语言模型中并行解码的质量-速度权衡问题。

Details Motivation: 扩散大语言模型(DLLMs)在并行解码时面临严重的质量-速度权衡,根源在于“组合矛盾”现象,即并行生成的词元形成语义不一致的组合。 Method: 提出ReMix(拒绝混合)框架,引入连续混合状态作为初始掩码状态与最终离散词元状态之间的中间表示,支持在连续空间中迭代优化词元表征以缓解词元间冲突;同时设计拒绝规则,将不确定性高的表征回退至掩码状态重新处理,防止错误传播。 Result: ReMix作为一种无需训练的方法,在多个基准上实现了2-8倍的推理加速,且未造成任何生成质量下降。 Conclusion: ReMix通过在离散扩散解码过程中引入连续空间优化机制,有效缓解了组合矛盾,为高效高质量非自回归文本生成提供了新范式。 Abstract: Diffusion Large Language Models (DLLMs) promise fast non-autoregressive inference but suffer a severe quality-speed trade-off in parallel decoding. This stems from the ''combinatorial contradiction'' phenomenon, where parallel tokens form semantically inconsistent combinations. We address this by integrating continuous representations into the discrete decoding process, as they preserve rich inter-position dependency. We propose ReMix (Rejection Mixing), a framework that introduces a novel Continuous Mixing State as an intermediate between the initial masked state and the final decoded token state. This intermediate state allows a token's representation to be iteratively refined in a continuous space, resolving mutual conflicts with other tokens before collapsing into a final discrete sample. Furthermore, a rejection rule reverts uncertain representations from the continuous state back to the masked state for reprocessing, ensuring stability and preventing error propagation. ReMix thus mitigates combinatorial contradictions by enabling continuous-space refinement during discrete diffusion decoding. Extensive experiments demonstrate that ReMix, as a training-free method, achieves a $2-8 \times$ inference speedup without any quality degradation.

[34] Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching

Roy Miles,Aysim Toker,Andreea-Maria Oncescu,Songcen Xu,Jiankang Deng,Ismail Elezi

Main category: cs.CL

TL;DR: 本文提出了一种名为Stitching Noisy Diffusion Thoughts的自一致性框架,通过在步骤级别聚合来自扩散语言模型的多样化、低成本推理轨迹,并利用过程奖励模型(PRM)评分筛选高质量中间步骤,再将其拼接成复合推理链,最后由自回归模型生成最终答案。该方法在数学与编程任务上显著提升准确率并降低延迟。

Details Motivation: 现有大模型推理的聚合策略多为轨迹级(如投票或选最优路径),忽略了部分正确或接近正确的中间步骤信息,造成有用信息浪费。 Method: 1) 使用掩码扩散语言模型采样大量低成本、多样化的推理轨迹;2) 利用现成的过程奖励模型(PRM)对每个中间步骤打分;3) 跨轨迹拼接最高分步骤形成复合推理链;4) 以该链为条件,由自回归模型仅重算最终答案。 Result: 在六个数学与编程基准上,平均准确率最高提升23.8%,延迟相比传统扩散模型和统一架构降低最多达1.8倍;步骤级重组对难题增益更显著,且最终AR求解器对提升答案准确性至关重要。 Conclusion: 步骤级推理片段的重用与拼接是一种高效、模块化、无需训练的推理增强范式,兼顾搜索广度与推理精度,优于轨迹级聚合与端到端统一建模。 Abstract: Reasoning with large language models often benefits from generating multiple chains-of-thought, but existing aggregation strategies are typically trajectory-level (e.g., selecting the best trace or voting on the final answer), discarding useful intermediate work from partial or "nearly correct" attempts. We propose Stitching Noisy Diffusion Thoughts, a self-consistency framework that turns cheap diffusion-sampled reasoning into a reusable pool of step-level candidates. Given a problem, we (i) sample many diverse, low-cost reasoning trajectories using a masked diffusion language model, (ii) score every intermediate step with an off-the-shelf process reward model (PRM), and (iii) stitch these highest-quality steps across trajectories into a composite rationale. This rationale then conditions an autoregressive (AR) model (solver) to recompute only the final answer. This modular pipeline separates exploration (diffusion) from evaluation and solution synthesis, avoiding monolithic unified hybrids while preserving broad search. Across math reasoning benchmarks, we find that step-level recombination is most beneficial on harder problems, and ablations highlight the importance of the final AR solver in converting stitched but imperfect rationales into accurate answers. Using low-confidence diffusion sampling with parallel, independent rollouts, our training-free framework improves average accuracy by up to 23.8% across six math and coding tasks. At the same time, it achieves up to a 1.8x latency reduction relative to both traditional diffusion models (e.g., Dream, LLaDA) and unified architectures (e.g., TiDAR). Code is available at https://github.com/roymiles/diffusion-stitching.

[35] Where Vision Becomes Text: Locating the OCR Routing Bottleneck in Vision-Language Models

Jonathan Steinberg,Oren Gal

Main category: cs.CL

TL;DR: 本文研究了视觉-语言模型(VLMs)中OCR信息如何被路由至语言处理流,通过因果干预识别出不同架构(Qwen3-VL、Phi-4、InternVL3.5)中OCR信号的关键瓶颈位置,并发现OCR表征低维且跨数据集可迁移;更意外的是,在模块化强的模型(如Qwen3-VL-4B)中移除OCR反而提升计数性能。

Details Motivation: 探究视觉-语言模型中OCR信息在语言处理流中的注入位置和机制,理解不同架构下OCR信号的路由规律及其对下游任务的影响。 Method: 采用因果干预方法,对比原始图像与文本区域被inpainting后的激活差异,定位OCR敏感层;结合主成分分析(PCA)分析OCR表征的维度与跨数据集可迁移性;并在模块化模型中评估OCR移除对计数任务的影响。 Result: 发现OCR瓶颈位置因架构而异:DeepStack类(Qwen)在中层(约50%)最敏感,单阶段投影类(Phi-4、InternVL)在早期层(6–25%)最敏感;OCR表征高度低维(PC1占72.9%方差),且PCA方向跨数据集可迁移;在Qwen3-VL-4B中移除OCR可使计数性能提升最高达6.9个百分点。 Conclusion: OCR在VLM中的路由具有架构依赖性,其表征共享通用路径;在高度模块化的模型中,OCR可能干扰其他视觉推理任务,表明其集成方式需更精细设计。 Abstract: Vision-language models (VLMs) can read text from images, but where does this optical character recognition (OCR) information enter the language processing stream? We investigate the OCR routing mechanism across three architecture families (Qwen3-VL, Phi-4, InternVL3.5) using causal interventions. By computing activation differences between original images and text-inpainted versions, we identify architecture-specific OCR bottlenecks whose dominant location depends on the vision-language integration strategy: DeepStack models (Qwen) show peak sensitivity at mid-depth (about 50%) for scene text, while single-stage projection models (Phi-4, InternVL) peak at early layers (6-25%), though the exact layer of maximum effect varies across datasets. The OCR signal is remarkably low-dimensional: PC1 captures 72.9% of variance. Crucially, principal component analysis (PCA) directions learned on one dataset transfer to others, demonstrating shared text-processing pathways. Surprisingly, in models with modular OCR circuits (notably Qwen3-VL-4B), OCR removal can improve counting performance (up to +6.9 percentage points), suggesting OCR interferes with other visual processing in sufficiently modular architectures.

[36] Affine-Scaled Attention: Towards Flexible and Stable Transformer Attention

Jeongin Bae,Baeseong Park,Gunho Park,Minsub Kim,Joonhyung Lee,Junhee Yoo,Sunghyeon Woo,Jiwon Ryu,Se Jung Kwon,Dongsoo Lee

Main category: cs.CL

TL;DR: 本文提出Affine-Scaled Attention,通过在softmax归一化后的注意力权重上引入输入依赖的缩放与偏置,放松严格归一化约束,提升训练稳定性与下游性能。

Details Motivation: 标准softmax注意力强制单位和归一化,限制了注意力幅度调控能力,易导致注意力过度集中或训练不稳定。 Method: 在softmax归一化后的注意力权重上添加输入相关的仿射变换(即缩放因子和偏置项),保持值聚合功能的同时实现对注意力分布与尺度的联合控制。 Result: 在大规模语言模型预训练中,相比标准softmax注意力和attention sink基线,Affine-Scaled Attention显著提升了训练稳定性、优化行为及下游任务性能。 Conclusion: 对注意力输出进行适度重加权是一种实用且有效的改进方式,可改善Transformer模型的注意力行为。 Abstract: Transformer attention is typically implemented using softmax normalization, which enforces attention weights with unit sum normalization. While effective in many settings, this constraint can limit flexibility in controlling attention magnitudes and may contribute to overly concentrated or unstable attention patterns during training. Prior work has explored modifications such as attention sinks or gating mechanisms, but these approaches provide only limited or indirect control over attention reweighting. We propose Affine-Scaled Attention, a simple extension to standard attention that introduces input-dependent scaling and a corresponding bias term applied to softmax-normalized attention weights. This design relaxes the strict normalization constraint while maintaining aggregation of value representations, allowing the model to adjust both the relative distribution and the scale of attention in a controlled manner. We empirically evaluate Affine-Scaled Attention in large-scale language model pretraining across multiple model sizes. Experimental results show consistent improvements in training stability, optimization behavior, and downstream task performance compared to standard softmax attention and attention sink baselines. These findings suggest that modest reweighting of attention outputs provides a practical and effective way to improve attention behavior in Transformer models.

[37] Toward Automatic Filling of Case Report Forms: A Case Study on Data from an Italian Emergency Department

Gabriela Anna Kaczmarek,Pietro Ferrazzi,Lorenzo Porta,Vicky Rubini,Bernardo Magnini

Main category: cs.CL

TL;DR: 本文介绍了一个新的意大利语急诊科临床笔记数据集,用于自动填写病例报告表(CRF),并探讨了使用开源大语言模型(LLM)在零样本设置下完成该任务的可行性及存在的偏差问题。

Details Motivation: 缺乏标注的CRF数据限制了基于大语言模型的自动CRF填写研究进展,亟需构建高质量、特定语言和场景的标注数据集。 Method: 构建了一个包含134个条目的意大利语CRF及其对应急诊科临床笔记的标注数据集;定义了CRF填写任务与评估指标;开展基于开源SOTA LLM的零样本实验。 Result: 实验证明:(i) 零样本方式可初步实现意大利语真实临床笔记的CRF填写;(ii) LLM存在倾向性偏差(如过度使用'未知'答案),需针对性校正。 Conclusion: 该数据集为CRF自动填充研究提供了重要资源;零样本LLM方法可行但需解决系统性偏差问题,未来工作应聚焦偏差缓解与领域适配优化。 Abstract: Case Report Forms (CRFs) collect data about patients and are at the core of well-established practices to conduct research in clinical settings. With the recent progress of language technologies, there is an increasing interest in automatic CRF-filling from clinical notes, mostly based on the use of Large Language Models (LLMs). However, there is a general scarcity of annotated CRF data, both for training and testing LLMs, which limits the progress on this task. As a step in the direction of providing such data, we present a new dataset of clinical notes from an Italian Emergency Department annotated with respect to a pre-defined CRF containing 134 items to be filled. We provide an analysis of the data, define the CRF-filling task and metric for its evaluation, and report on pilot experiments where we use an open-source state-of-the-art LLM to automatically execute the task. Results of the case-study show that (i) CRF-filling from real clinical notes in Italian can be approached in a zero-shot setting; (ii) LLMs' results are affected by biases (e.g., a cautious behaviour favours "unknown" answers), which need to be corrected.

[38] Quantity Convergence, Quality Divergence: Disentangling Fluency and Accuracy in L2 Mandarin Prosody

Yuqi Shi,Hao Yang,Xiyao Lu,Jinsong Zhang

Main category: cs.CL

TL;DR: 本研究探讨了越南语母语者在二语习得中语法-韵律接口的固化与稳定性问题,发现高熟练度学习者虽在韵律边界数量上接近母语者,但在句法-韵律映射关系上存在系统性偏差,导致韵律层级倒置。

Details Motivation: 二语学习者虽能掌握目标语句法词序,但将该句法映射到恰当韵律结构仍具挑战性;需探究语法-韵律接口的习得是否真正稳定或发生固化。 Method: 基于BLCU-SAIT语料库,对比67名汉语母语者与67名越南语二语学习者,结合C-ToBI韵律边界标注与依存语法分析,考察韵律边界数量及其与句法关系的映射。 Result: 高熟练度越南学习者(VNH)在主短语(B3)层级的韵律边界数量趋近母语者,但在主谓(SBV)界面降级、动宾(VOB)界面升级,导致韵律层级倒置,牺牲结构准确性以维持长语段输出。 Conclusion: 二语语法-韵律接口存在非线性习得路径,高熟练度不等于接口稳定性;VNH群体表现出系统性、固化的映射策略偏差,构成界面层面的化石化现象。 Abstract: While second language (L2) learners may acquire target syntactic word order, mapping this syntax onto appropriate prosodic structures remains a persistent challenge. This study investigates the fossilization and stability of the L2 syntax-prosody interface by comparing 67 native Mandarin speakers with 67 Vietnamese learners using the BLCU-SAIT corpus. By integrating C-ToBI boundary annotation with Dependency Grammar analysis, we examined both the quantity of prosodic boundaries and their mapping to syntactic relations. Results reveal a non-linear acquisition: although high-proficiency learners (VNH) converge to the native baseline in boundary quantity at the Major Phrase level (B3), their structural mapping significantly diverges. Specifically, VNH demote the prosodic boundary at the Subject-Verb (SBV) interface (Major Phrase B3 -> Prosodic Word B1), while erroneously promoting the boundary at the Verb-Object (VOB) interface (Prosodic Word B1 -> Major Phrase B3). This strategy allows learners to maintain high long phrasal output at the expense of structural accuracy. This results in a distorted prosodic hierarchy where the native pattern is inverted.

[39] CiteLLM: An Agentic Platform for Trustworthy Scientific Reference Discovery

Mengze Hong,Di Jiang,Chen Jason Zhang,Zichang Guo,Yawen Li,Jun Chen,Shaobo Cui,Zhiyang Su

Main category: cs.CL

TL;DR: 本文提出CiteLLM,一个嵌入LaTeX编辑器的本地化、可信参考发现平台,通过学科感知动态路由与LLM协同检索、排序与验证学术引用,确保无幻觉、保护隐私与学术诚信。

Details Motivation: 解决大语言模型在学术场景中应用时面临的三大伦理挑战:AI生成内容可信度低、学术诚信与知识产权难以保障、信息隐私易泄露。 Method: 设计CiteLLM系统,将LLM能力嵌入本地LaTeX编辑器;采用动态学科感知路由,仅从可信学术网站检索文献;LLM仅用于生成上下文感知搜索查询、相关性排序及段落级语义匹配验证,并集成解释型聊天机器人。 Result: 评估表明CiteLLM能高效返回有效且高可用的参考文献,显著优于基线方法,在引用准确性、实用性与安全性方面表现突出。 Conclusion: CiteLLM为AI辅助学术写作提供了兼顾可信性、隐私保护与学术规范的新范式,验证了本地化、任务特化与严格数据边界控制的可行性。 Abstract: Large language models (LLMs) have created new opportunities to enhance the efficiency of scholarly activities; however, challenges persist in the ethical deployment of AI assistance, including (1) the trustworthiness of AI-generated content, (2) preservation of academic integrity and intellectual property, and (3) protection of information privacy. In this work, we present CiteLLM, a specialized agentic platform designed to enable trustworthy reference discovery for grounding author-drafted claims and statements. The system introduces a novel interaction paradigm by embedding LLM utilities directly within the LaTeX editor environment, ensuring a seamless user experience and no data transmission outside the local system. To guarantee hallucination-free references, we employ dynamic discipline-aware routing to retrieve candidates exclusively from trusted web-based academic repositories, while leveraging LLMs solely for generating context-aware search queries, ranking candidates by relevance, and validating and explaining support through paragraph-level semantic matching and an integrated chatbot. Evaluation results demonstrate the superior performance of the proposed system in returning valid and highly usable references.

[40] Assessing Deanonymization Risks with Stylometry-Assisted LLM Agent

Boyang Zhang,Yang Zhang

Main category: cs.CL

TL;DR: 本文提出SALA方法,结合风格特征与大语言模型推理,实现可解释的作者归属分析,并设计引导式重写策略降低文本可识别性,保护作者隐私。

Details Motivation: 大型语言模型(LLMs)的快速发展增强了作者身份推断能力,带来新闻等文本数据无意去匿名化的风险,亟需可解释、主动的隐私保护方案。 Method: 提出SALA(Stylometry-Assisted LLM Analysis)方法,融合量化文体特征与LLM推理;构建LLM代理框架,含数据库模块提升准确性;设计基于推理轨迹的引导式重写策略生成改写提示。 Result: 在大规模新闻数据集上验证,SALA(尤其配备数据库模块时)在多种场景下实现高作者推断准确率;引导式重写策略能显著降低作者可识别性,同时保持语义完整性。 Conclusion: LLM代理既具备强去匿名化潜力,也可通过可解释、结构化方法实现有效防御;SALA为作者隐私保护提供了兼顾鲁棒性与透明性的新范式。 Abstract: The rapid advancement of large language models (LLMs) has enabled powerful authorship inference capabilities, raising growing concerns about unintended deanonymization risks in textual data such as news articles. In this work, we introduce an LLM agent designed to evaluate and mitigate such risks through a structured, interpretable pipeline. Central to our framework is the proposed $\textit{SALA}$ (Stylometry-Assisted LLM Analysis) method, which integrates quantitative stylometric features with LLM reasoning for robust and transparent authorship attribution. Experiments on large-scale news datasets demonstrate that $\textit{SALA}$, particularly when augmented with a database module, achieves high inference accuracy in various scenarios. Finally, we propose a guided recomposition strategy that leverages the agent's reasoning trace to generate rewriting prompts, effectively reducing authorship identifiability while preserving textual meaning. Our findings highlight both the deanonymization potential of LLM agents and the importance of interpretable, proactive defenses for safeguarding author privacy.

[41] Modality Collapse as Mismatched Decoding: Information-Theoretic Limits of Multimodal LLMs

Jayadev Billa

Main category: cs.CL

TL;DR: 本文揭示了多模态大语言模型(MLLMs)中解码器与非文本模态信息之间存在‘错配’问题:尽管语音和视觉特征(如说话人身份、情绪、纹理)在模型各层中高度保留,但解码器因仅针对文本训练,无法有效利用这些非文本对齐的方向,导致其成为噪声;作者提出广义互信息(GMI)理论框架刻画该瓶颈,并通过控制实验与LoRA干预验证解码器评分规则是根本限制,而非编码器或适配器设计。

Details Motivation: 多模态大语言模型虽能接收语音和图像输入,却无法有效理解说话人声音特质或物体表面纹理等细粒度感知属性,本文旨在探究这一现象的根本原因——是编码失败,还是解码机制不匹配? Method: 采用线性探针量化各层中 speaker identity、emotion 和 visual attributes 的保留程度;提出广义互信息(GMI)理论框架,建模解码器对非文本模态信息的提取能力上限;在5个跨语音/视觉模型上验证;设计控制实验(仅改变Prismatic VLM编码器的文本对齐性)和LoRA微调实验(以情绪为目标进行训练)。 Result: 1)语音/视觉细粒度属性在LLM各层中显著保留(3–55× 随机水平);2)移除64–71%模态特异性方差反而降低解码损失,说明其为噪声;3)GMI界成功预测信息可访问性退化趋势;4)控制实验证明瓶颈在解码器评分规则而非编码器;5)LoRA情绪训练使情绪可访问性提升+7.5%,不影响其他属性。 Conclusion: 多模态LLMs的信息瓶颈不在编码端,而在解码器——其文本导向的评分规则天然排斥非文本对齐的语义方向;提升特定感知能力需显式设计对应解码目标,而非仅优化编码器或投影模块。 Abstract: Multimodal LLMs can process speech and images, but they cannot hear a speaker's voice or see an object's texture. We show this is not a failure of encoding: speaker identity, emotion, and visual attributes survive through every LLM layer (3--55$\times$ above chance in linear probes), yet removing 64--71% of modality-specific variance improves decoder loss. The decoder has no learned use for these directions; their presence is noise. We formalize this as a mismatched decoder problem: a decoder trained on text can only extract information along text-aligned directions. Accessible information is bounded by the Generalized Mutual Information (GMI), with degradation scaling with distributional distance and decoder sensitivity. The bound is a property of the decoder's scoring rule, not of any particular architecture; it applies whether non-text inputs arrive through a learned projection, a discrete codebook, or no explicit adapter at all. We validate this across five models spanning speech and vision. A controlled experiment (two Prismatic VLMs differing only in encoder text-alignment) confirms the bottleneck is the decoder's scoring rule, not the encoder or projection. A LoRA intervention demonstrates the fix: training with an emotion objective improves emotion accessibility ($+$7.5%) without affecting other attributes, confirming that the training objective determines what becomes accessible.

[42] MTRAG-UN: A Benchmark for Open Challenges in Multi-Turn RAG Conversations

Sara Rosenthal,Yannis Katsis,Vraj Shah,Lihong He,Lucian Popa,Marina Danilevsky

Main category: cs.CL

TL;DR: 本文提出了MTRAG-UN基准,用于探索多轮检索增强生成(RAG)中的开放挑战,包含666个任务、2800多个对话轮次及配套语料库,并揭示了当前模型在处理不可回答、不明确、非独立及模糊问题/响应时的困难。

Details Motivation: 探索多轮检索增强生成(RAG)中尚未解决的关键挑战,特别是涉及不可回答、不明确、非独立及模糊问题/响应的复杂对话场景。 Method: 构建并发布MTRAG-UN基准,涵盖6个领域、666个任务、2800+对话轮次及配套检索语料;通过实验评估主流检索与生成模型在该基准上的表现。 Result: 实验证明现有检索与生成模型在处理UNanswerable、UNderspecified、NONstandalone和UNclear类问题/响应时性能显著下降,暴露其局限性。 Conclusion: MTRAG-UN为多轮RAG研究提供了标准化测试平台,凸显了模型在理解与响应复杂对话状态方面亟待提升的能力。 Abstract: We present MTRAG-UN, a benchmark for exploring open challenges in multi-turn retrieval augmented generation, a popular use of large language models. We release a benchmark of 666 tasks containing over 2,800 conversation turns across 6 domains with accompanying corpora. Our experiments show that retrieval and generation models continue to struggle on conversations with UNanswerable, UNderspecified, and NONstandalone questions and UNclear responses. Our benchmark is available at https://github.com/IBM/mt-rag-benchmark

[43] Fine-Tuning Without Forgetting In-Context Learning: A Theoretical Analysis of Linear Attention Models

Chungpa Lee,Jy-yong Sohn,Kangwook Lee

Main category: cs.CL

TL;DR: 本文研究了大语言模型微调对上下文学习能力的影响,理论分析表明全参数微调会损害少样本学习能力,而仅更新值矩阵可在提升零样本性能的同时保持上下文学习能力;引入辅助少样本损失可增强目标任务的上下文学习,但会削弱未见任务的泛化能力。

Details Motivation: 微调虽能提升零样本性能、降低推理成本,但常损害模型在未见任务上的上下文学习能力,亟需理解其机制并提出兼顾零样本与上下文学习的微调策略。 Method: 基于线性注意力模型进行理论分析,刻画不同微调目标(全参数更新、仅值矩阵更新、加入辅助少样本损失)对注意力参数的影响,并通过实验验证理论结论。 Result: 全参数微调损害少样本性能;仅更新值矩阵可同时提升零样本性能并保留上下文学习能力;辅助少样本损失增强目标任务的上下文学习,但削弱其他任务的泛化能力。 Conclusion: 微调策略需谨慎设计:限制参数更新范围(如仅值矩阵)是平衡零样本性能与上下文学习能力的有效途径;辅助任务损失虽有局部增益,但可能牺牲跨任务泛化性。 Abstract: Transformer-based large language models exhibit in-context learning, enabling adaptation to downstream tasks via few-shot prompting with demonstrations. In practice, such models are often fine-tuned to improve zero-shot performance on downstream tasks, allowing them to solve tasks without examples and thereby reducing inference costs. However, fine-tuning can degrade in-context learning, limiting the performance of fine-tuned models on tasks not seen during fine-tuning. Using linear attention models, we provide a theoretical analysis that characterizes how fine-tuning objectives modify attention parameters and identifies conditions under which this leads to degraded few-shot performance. We show that fine-tuning all attention parameters can harm in-context learning, whereas restricting updates to the value matrix improves zero-shot performance while preserving in-context learning. We further show that incorporating an auxiliary few-shot loss enhances in-context learning primarily on the target task, at the expense of degraded in-context learning ability on tasks not seen during fine-tuning. We empirically validate our theoretical results.

[44] Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding?

Pengxiang Li,Dilxat Muhtar,Lu Yin,Tianlong Chen,Shiwei Liu

Main category: cs.CL

TL;DR: 本文提出NAP方法,通过重构训练数据(多独立推理轨迹)和强制并行解码策略,缓解扩散语言模型(DLMs)在实践中退化为自回归式行为的问题,显著提升纯并行生成下的数学推理性能。

Details Motivation: 现有扩散语言模型虽理论上支持并行生成,但因训练目标与高度序列化的数据(如标准预训练语料和长思维链CoT)不匹配,实际常退化为类自回归行为;需从数据和监督方式入手推动真正非自回归的并行生成。 Method: 提出NAP:1)构建多个独立的推理轨迹作为训练样本;2)采用并行强制解码策略,鼓励多token同步更新;3)在数学推理任务上验证其并行生成优势。 Result: 在数学推理基准上,NAP在并行解码下显著优于基于标准长CoT训练的DLMs,且并行度越高增益越明显。 Conclusion: 重新设计训练数据与监督信号是缓解DLMs类自回归倾向、实现真正非自回归并行生成的有效且原则性路径。 Abstract: Diffusion Language Models (DLMs) are often advertised as enabling parallel token generation, yet practical fast DLMs frequently converge to left-to-right, autoregressive (AR)-like decoding dynamics. In contrast, genuinely non-AR generation is promising because it removes AR's sequential bottleneck, better exploiting parallel hardware to reduce synchronization/communication overhead and improve latency scaling with output length. We argue that a primary driver of AR-like decoding is a mismatch between DLM objectives and the highly sequential structure of widely used training data, including standard pretraining corpora and long chain-of-thought (CoT) supervision. Motivated by this diagnosis, we propose NAP (Non-Autoregressive Parallel DLMs), a proof-of-concept, data-centric approach that better aligns supervision with non-AR parallel decoding. NAP curates examples as multiple independent reasoning trajectories and couples them with a parallel-forced decoding strategy that encourages multi-token parallel updates. Across math reasoning benchmarks, NAP yields stronger performance under parallel decoding than DLMs trained on standard long CoT data, with gains growing as parallelism increases. Our results suggest that revisiting data and supervision is a principled direction for mitigating AR-like behavior and moving toward genuinely non-autoregressive parallel generation in DLMs. Our code is available at https://github.com/pixeli99/NAP.

[45] Discourse-Aware Dual-Track Streaming Response for Low-Latency Spoken Dialogue Systems

Siyuan Liu,Jiahui Xu,Feng Jiang,Kuang Wang,Zefeng Zhao,Chu-Ren Huang,Jinghang Gu,Changqing Yin,Haizhou Li

Main category: cs.CL

TL;DR: 本文提出DDTSR框架,通过小大模型协同、流式跨模态协作和课程学习增强话语连续性,显著降低级联语音对话系统的响应延迟(19%-51%),同时保持话语质量。

Details Motivation: 传统ASR-LLM-TTS级联系统存在高响应延迟问题,因其必须等待完整语音转录和全部推理完成后才开始语音合成,难以实现类人实时响应。 Method: 提出Discourse-Aware Dual-Track Streaming Response(DDTSR)框架,包含三个核心机制:(1) 连接词引导的小-大模型协同;(2) 流式跨模态协作(ASR/LLM/TTS动态重叠);(3) 基于课程学习的话语连续性增强。 Result: 在两个语音对话基准上,DDTSR将响应延迟降低19%-51%,且保持话语质量;具备即插即用性,兼容多种LLM主干,并对不同语音长度鲁棒。 Conclusion: DDTSR是一种实用、可扩展的低延迟语音对话架构,有效支持‘边听边想、边说边想’,推动实时口语交互发展。 Abstract: Achieving human-like responsiveness is a critical yet challenging goal for cascaded spoken dialogue systems. Conventional ASR-LLM-TTS pipelines follow a strictly sequential paradigm, requiring complete transcription and full reasoning before speech synthesis can begin, which results in high response latency. We propose the Discourse-Aware Dual-Track Streaming Response (DDTSR) framework, a low-latency architecture that enables listen-while-thinking and speak-while-thinking. DDTSR is built upon three key mechanisms: (1) connective-guided small-large model synergy, where an auxiliary small model generates minimal-committal discourse connectives while a large model performs knowledge-intensive reasoning in parallel; (2) streaming-based cross-modal collaboration, which dynamically overlaps ASR, LLM inference, and TTS to advance the earliest speakable moment; and (3) curriculum-learning-based discourse continuity enhancement, which maintains coherence and logical consistency between early responses and subsequent reasoning outputs. Experiments on two spoken dialogue benchmarks demonstrate that DDTSR reduces response latency by 19%-51% while preserving discourse quality. Further analysis shows that DDTSR functions as a plug-and-play module compatible with diverse LLM backbones, and remains robust across varying utterance lengths, indicating strong practicality and scalability for real-time spoken interaction.

[46] SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables

Sungho Park,Jueun Kim,Wook-Shin Han

Main category: cs.CL

TL;DR: 本文提出SPARTA框架,用于自动生成大规模、高质量的Table-Text多跳问答基准,覆盖聚合、分组与深层跨模态推理,显著暴露现有模型在复杂跨模态推理上的不足。

Details Motivation: 现有Table-Text QA基准规模小、人工构建易出错、问题浅层(少于两跳、缺乏聚合/分组等操作),难以评估真实复杂推理能力。 Method: 提出SPARTA端到端自动构建框架:1)通过从文本中抽取原子事实构建带接地表的参考事实库;2)合成符合指定跳数的嵌套查询;3)采用基于溯源的查询重写和后序遍历约束的结构强制技术,确保SQL可执行且问题自然流畅。 Result: 生成数千个高保真QA对,涵盖聚合、分组与深度多跳推理;SOTA模型在SPARTA上F1大幅下降超30点(如在HybridQA达70+,在此降至40以下)。 Conclusion: SPARTA有效揭示了当前跨模态推理模型的根本性缺陷,为推动真实世界Table-Text QA研究提供了更严格、可扩展的评估基准。 Abstract: Real-world Table-Text question answering (QA) tasks require models that can reason across long text and source tables, traversing multiple hops and executing complex operations such as aggregation. Yet existing benchmarks are small, manually curated - and therefore error-prone - and contain shallow questions that seldom demand more than two hops or invoke aggregations, grouping, or other advanced analytical operations expressible in natural-language queries. We present SPARTA, an end-to-end construction framework that automatically generates large-scale Table-Text QA benchmarks with lightweight human validation, requiring only one quarter of the annotation time of HybridQA. The framework first constructs a reference fact database by enriching each source table with grounding tables whose tuples are atomic facts automatically extracted from the accompanying unstructured passages, then synthesizes nested queries whose number of nested predicates matches the desired hop count. To ensure that every SQL statement is executable and that its verbalization yields a fluent, human-sounding question, we propose two novel techniques: provenance-based refinement, which rewrites any syntactically valid query that returns a non-empty result, and realistic-structure enforcement, which confines generation to post-order traversals of the query graph. The resulting pipeline produces thousands of high-fidelity question-answer pairs covering aggregations, grouping, and deep multi-hop reasoning across text and tables. On SPARTA, state-of-the-art models that reach over 70 F1 on HybridQA or over 50 F1 on OTT-QA drop by more than 30 F1 points, exposing fundamental weaknesses in current cross-modal reasoning. Our benchmark, construction code, and baseline models are available at https://github.com/pshlego/SPARTA/tree/main.

[47] A Mixture-of-Experts Model for Multimodal Emotion Recognition in Conversations

Soumya Dutta,Smruthi Balaji,Sriram Ganapathy

Main category: cs.CL

TL;DR: 本文提出MiSTER-E框架,通过模块化混合专家(MoE)方法解耦模态特异性建模与多模态融合,利用微调的LLM提取语音-文本嵌入,并引入对比损失和KL正则化提升模态一致性,在IEMOCAP、MELD和MOSI上取得SOTA性能。

Details Motivation: 情感识别在对话中面临建模多轮时序动态和融合多模态线索的双重挑战,现有方法难以兼顾模态特异性建模与跨模态融合。 Method: 提出MiSTER-E:1)用微调LLM提取语音/文本utterance级嵌入;2)卷积-循环结构增强上下文建模;3)三个专家(语音、文本、跨模态)+动态门控融合;4)监督对比损失对齐语音-文本表征,KL散度正则化专家预测;5)不依赖说话人身份。 Result: 在IEMOCAP、MELD、MOSI上分别达到70.9%、69.5%、87.9%加权F1,超越多个基线模型,并通过消融实验证明各组件有效性。 Conclusion: MiSTER-E有效解耦并协同解决ERC中的模态建模与融合问题,无需说话人信息即可实现高性能,验证了模块化MoE与多目标正则化的有效性。 Abstract: Emotion Recognition in Conversations (ERC) presents unique challenges, requiring models to capture the temporal flow of multi-turn dialogues and to effectively integrate cues from multiple modalities. We propose Mixture of Speech-Text Experts for Recognition of Emotions (MiSTER-E), a modular Mixture-of-Experts (MoE) framework designed to decouple two core challenges in ERC: modality-specific context modeling and multimodal information fusion. MiSTER-E leverages large language models (LLMs) fine-tuned for both speech and text to provide rich utterance-level embeddings, which are then enhanced through a convolutional-recurrent context modeling layer. The system integrates predictions from three experts-speech-only, text-only, and cross-modal-using a learned gating mechanism that dynamically weighs their outputs. To further encourage consistency and alignment across modalities, we introduce a supervised contrastive loss between paired speech-text representations and a KL-divergence-based regulariza-tion across expert predictions. Importantly, MiSTER-E does not rely on speaker identity at any stage. Experiments on three benchmark datasets-IEMOCAP, MELD, and MOSI-show that our proposal achieves 70.9%, 69.5%, and 87.9% weighted F1-scores respectively, outperforming several baseline speech-text ERC systems. We also provide various ablations to highlight the contributions made in the proposed approach.

[48] Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning

Amita Kamath,Jack Hessel,Khyathi Chandu,Jena D. Hwang,Kai-Wei Chang,Ranjay Krishna

Main category: cs.CL

TL;DR: 本文指出视觉-语言模型(VLMs)推理能力不足源于训练数据中的报告偏差(reporting bias),即人类描述图像时通常省略隐含的推理所需信息;作者通过语用学视角分析主流VLM数据,发现空间、时间、否定和计数四类推理能力在数据中严重缺失,并验证其导致模型性能低下;实验表明单纯扩大数据/模型规模或增加语言多样性无法自发提升这些能力,而有针对性地收集隐含信息标注则有效。

Details Motivation: 视觉-语言模型(VLMs)缺乏推理能力,作者认为这源于训练数据中普遍存在的报告偏差——人类在描述图像时习惯性省略推理所需的隐含信息(如‘今天在比赛!’而非‘一张37人站在球场后的照片’)。 Method: 基于语用学理论,系统分析OpenCLIP、LLaVA-1.5和Molmo等主流VLM所依赖的数据集,识别并量化空间、时间、否定与计数四类推理能力的缺失;构建针对性评测基准,评估不同规模、多语言及数据增强策略下模型的表现,并对比引入显式隐含信息标注的效果。 Result: (i)VLMs在受报告偏差抑制的四类推理任务上表现显著差;(ii)扩大数据量、模型参数量或多语言训练均未带来这些能力的自发涌现;(iii)加入专门采集的隐含信息标注可有效提升相应推理能力。 Conclusion: 推理能力无法仅靠数据或模型规模扩展自然涌现,亟需更精细、有意识的训练数据构建策略,尤其应主动补全被报告偏差掩盖的关键推理信息。 Abstract: The lack of reasoning capabilities in Vision-Language Models (VLMs) has remained at the forefront of research discourse. We posit that this behavior stems from a reporting bias in their training data. That is, how people communicate about visual content by default omits tacit information needed to supervise some types of reasoning; e.g., "at the game today!" is a more likely caption than "a photo of 37 people standing behind a field". We investigate the data underlying the popular VLMs OpenCLIP, LLaVA-1.5 and Molmo through the lens of theories from pragmatics, and find that reporting bias results in insufficient representation of four reasoning skills (spatial, temporal, negation, and counting), despite the corpora being of web-scale, and/or synthetically generated. With a set of curated benchmarks, we demonstrate that: (i) VLMs perform poorly on the aforementioned types of reasoning suppressed in the training data by reporting bias; (ii) contrary to popular belief, scaling data size, model size, and to multiple languages does not result in emergence of these skills by default; but, promisingly, (iii) incorporating annotations specifically collected to obtain tacit information is effective. Our findings highlight the need for more intentional training data curation methods, rather than counting on scale for emergence of reasoning capabilities.

cs.CV [Back]

[49] Enabling clinical use of foundation models in histopathology

Audun L. Henriksen,Ole-Johan Skrede,Lisa van der Schee,Enric Domingo,Sepp De Raedt,Ilyá Kostolomov,Jennifer Hay,Karolina Cyll,Wanja Kildal,Joakim Kalsnes,Robert W. Williams,Manohar Pradhan,John Arne Nesheim,Hanne A. Askautrud,Maria X. Isaksen,Karmele Saez de Gordoa,Miriam Cuatrecasas,Joanne Edwards,TransSCOT group,Arild Nesbakken,Neil A. Shepherd,Ian Tomlinson,Daniel-Christoph Wagner,Rachel S. Kerr,Tarjei Sveinsgjerd Hveem,Knut Liestøl,Yoshiaki Nakamura,Marco Novelli,Masaaki Miyo,Sebastian Foersch,David N. Church,Miangela M. Lacle,David J. Kerr,Andreas Kleppe

Main category: cs.CV

TL;DR: 本文提出了一种在下游任务模型训练中引入鲁棒性损失的新方法,以减少病理学基础模型对技术变异(如扫描仪差异)的敏感性,从而提升模型在真实临床场景中的泛化性和准确性,且无需重新训练基础模型。

Details Motivation: 当前病理学基础模型不仅捕获生物学相关特征,还混入预分析和扫描仪特异性变异,导致下游任务模型预测偏差,影响临床实用性。 Method: 在下游任务特定模型训练过程中引入新型鲁棒性损失函数,并基于包含27,042张全切片图像(来自6155名患者)的大规模实验设置,评估8种主流基础模型特征上的数千个模型。 Result: 显著提升了模型对技术变异的鲁棒性,同时提高了预测准确率;验证了该方法可在不重训基础模型的前提下有效缓解其鲁棒性问题。 Conclusion: 所提方法为计算病理学提供了无需修改基础模型即可构建高鲁棒性、高准确率临床适用模型的有效路径。 Abstract: Foundation models in histopathology are expected to facilitate the development of high-performing and generalisable deep learning systems. However, current models capture not only biologically relevant features, but also pre-analytic and scanner-specific variation that bias the predictions of task-specific models trained from the foundation model features. Here we show that introducing novel robustness losses during training of downstream task-specific models reduces sensitivity to technical variability. A purpose-designed comprehensive experimentation setup with 27,042 WSIs from 6155 patients is used to train thousands of models from the features of eight popular foundation models for computational pathology. In addition to a substantial improvement in robustness, we observe that prediction accuracy improves by focusing on biologically relevant features. Our approach successfully mitigates robustness issues of foundation models for computational pathology without retraining the foundation models themselves, enabling development of robust computational pathology models applicable to real-world data in routine clinical practice.

Liping Meng,Fan Nie,Yunyun Zhang,Chao Han

Main category: cs.CV

TL;DR: 本文提出MNAS-Unet,一种结合蒙特卡洛树搜索(MCTS)与神经架构搜索(NAS)的新型医学图像分割框架,通过动态探索网络结构提升搜索效率与精度,并优化DownSC/UpSC单元,在多个数据集上超越现有方法,同时显著降低计算开销和模型参数量。

Details Motivation: 提升医学图像分割中神经架构搜索的效率与精度,降低资源消耗,增强实际部署可行性。 Method: 提出MNAS-Unet框架,融合MCTS与NAS进行动态架构搜索;优化DownSC和UpSC单元结构以实现快速精准建模。 Result: 在PROMISE12、Ultrasound Nerve和CHAOS等数据集上分割精度优于NAS-Unet及其他SOTA模型;搜索预算减少54%(139 vs 300 epoch);模型仅0.6M参数,GPU内存占用更低。 Conclusion: MNAS-Unet在保证高分割精度的同时显著提升架构搜索效率并降低资源需求,适用于资源受限的实际医学图像分析场景。 Abstract: This paper proposes a novel medical image segmentation framework, MNAS-Unet, which combines Monte Carlo Tree Search (MCTS) and Neural Architecture Search (NAS). MNAS-Unet dynamically explores promising network architectures through MCTS, significantly enhancing the efficiency and accuracy of architecture search. It also optimizes the DownSC and UpSC unit structures, enabling fast and precise model adjustments. Experimental results demonstrate that MNAS-Unet outperforms NAS-Unet and other state-of-the-art models in segmentation accuracy on several medical image datasets, including PROMISE12, Ultrasound Nerve, and CHAOS. Furthermore, compared with NAS-Unet, MNAS-Unet reduces the architecture search budget by 54% (early stopping at 139 epochs versus 300 epochs under the same search setting), while achieving a lightweight model with only 0.6M parameters and lower GPU memory consumption, which further improves its practical applicability. These results suggest that MNAS-Unet can improve search efficiency while maintaining competitive segmentation accuracy under practical resource constraints.

[51] AeroDGS: Physically Consistent Dynamic Gaussian Splatting for Single-Sequence Aerial 4D Reconstruction

Hanyang Liu,Rongjun Qin

Main category: cs.CV

TL;DR: 本文提出AeroDGS,一种面向单目无人机视频的物理引导式4D高斯溅射框架,通过几何提升与物理约束优化解决空中动态场景重建中的深度模糊与运动估计不稳定问题。

Details Motivation: 现有4D场景重建方法在单目、大范围、高空、动态小目标且运动差异大的空中条件下表现受限,导致深度模糊和运动估计不稳定,使单目空中重建本质病态。 Method: 提出AeroDGS框架,包含单目几何提升模块(重建静态与动态几何)和物理引导优化模块(引入可微地面支撑、竖直稳定性、轨迹平滑性先验),联合优化静态背景与动态物体的几何与时间一致性。 Result: 在合成与真实无人机数据集上实验表明,AeroDGS优于当前最先进方法,在动态空中环境中实现更高保真度的重建效果。 Conclusion: AeroDGS通过融合几何先验与物理约束,有效缓解单目空中4D重建的病态性,为动态 aerial 场景建模提供了鲁棒、高效的新范式。 Abstract: Recent advances in 4D scene reconstruction have significantly improved dynamic modeling across various domains. However, existing approaches remain limited under aerial conditions with single-view capture, wide spatial range, and dynamic objects of limited spatial footprint and large motion disparity. These challenges cause severe depth ambiguity and unstable motion estimation, making monocular aerial reconstruction inherently ill-posed. To this end, we present AeroDGS, a physics-guided 4D Gaussian splatting framework for monocular UAV videos. AeroDGS introduces a Monocular Geometry Lifting module that reconstructs reliable static and dynamic geometry from a single aerial sequence, providing a robust basis for dynamic estimation. To further resolve monocular ambiguity, we propose a Physics-Guided Optimization module that incorporates differentiable ground-support, upright-stability, and trajectory-smoothness priors, transforming ambiguous image cues into physically consistent motion. The framework jointly refines static backgrounds and dynamic entities with stable geometry and coherent temporal evolution. We additionally build a real-world UAV dataset that spans various altitudes and motion conditions to evaluate dynamic aerial reconstruction. Experiments on synthetic and real UAV scenes demonstrate that AeroDGS outperforms state-of-the-art methods, achieving superior reconstruction fidelity in dynamic aerial environments.

[52] Enhancing Renal Tumor Malignancy Prediction: Deep Learning with Automatic 3D CT Organ Focused Attention

Zhengkang Fan,Chengkun Sun,Russell Terry,Jie Xu,Longin Jan Latecki

Main category: cs.CV

TL;DR: 本文提出了一种无需手动分割的深度学习框架,利用器官聚焦注意力(OFA)损失函数,直接在3D肾CT图像上预测肿瘤恶性程度,在两个数据集上AUC达0.685–0.760,F1达0.852–0.872,优于依赖分割的传统方法。

Details Motivation: 现有影像学手段难以准确预测肾肿瘤恶性程度;传统深度学习方法依赖耗时、昂贵且需专家参与的手动肿瘤分割,亟需一种免分割的高效预测方案。 Method: 提出基于器官聚焦注意力(OFA)损失函数的深度学习框架,使图像块注意力仅限于器官区域,从而在不进行手动分割的前提下实现端到端恶性度预测。 Result: 在UF IDR私有数据集上AUC=0.685、F1=0.872;在公开KiTS21数据集上AUC=0.760、F1=0.852,均优于基于分割裁剪的传统模型。 Conclusion: 该免分割框架在保持高预测性能的同时显著提升临床部署效率与可及性,为肾癌诊断提供更可靠、实用的辅助决策工具。 Abstract: Accurate prediction of malignancy in renal tumors is crucial for informing clinical decisions and optimizing treatment strategies. However, existing imaging modalities lack the necessary accuracy to reliably predict malignancy before surgical intervention. While deep learning has shown promise in malignancy prediction using 3D CT images, traditional approaches often rely on manual segmentation to isolate the tumor region and reduce noise, which enhances predictive performance. Manual segmentation, however, is labor-intensive, costly, and dependent on expert knowledge. In this study, a deep learning framework was developed utilizing an Organ Focused Attention (OFA) loss function to modify the attention of image patches so that organ patches attend only to other organ patches. Hence, no segmentation of 3D renal CT images is required at deployment time for malignancy prediction. The proposed framework achieved an AUC of 0.685 and an F1-score of 0.872 on a private dataset from the UF Integrated Data Repository (IDR), and an AUC of 0.760 and an F1-score of 0.852 on the publicly available KiTS21 dataset. These results surpass the performance of conventional models that rely on segmentation-based cropping for noise reduction, demonstrating the frameworks ability to enhance predictive accuracy without explicit segmentation input. The findings suggest that this approach offers a more efficient and reliable method for malignancy prediction, thereby enhancing clinical decision-making in renal cancer diagnosis.

[53] Vision Transformers Need More Than Registers

Cheng Shi,Yizhou Yu,Sibei Yang

Main category: cs.CV

TL;DR: 本文通过系统分析发现Vision Transformers (ViTs)中的性能瓶颈源于‘懒聚合’行为,即模型利用语义无关的背景图像块作为捷径来表征全局语义;为此提出一种选择性地将图像块特征整合进CLS token的方法,有效缓解该问题,并在12个不同监督范式下的基准测试中一致提升性能。

Details Motivation: ViTs在多种监督范式和下游任务中普遍存在性能退化现象(artifacts),但其根本机制尚不明确,亟需深入理解与解决。 Method: 通过系统性分析识别出ViT中‘懒聚合’行为的本质,提出一种选择性地将图像块特征融合进CLS token的机制,以抑制背景主导的捷径学习。 Result: 所提方法在12个涵盖标签监督、文本监督和自监督的基准上均取得一致性能提升。 Conclusion: ViTs中的常见artifacts源于全局注意力与粗粒度语义监督共同导致的懒聚合行为;通过控制CLS token的特征聚合方式可有效缓解该问题,为理解与改进ViT提供了新视角。 Abstract: Vision Transformers (ViTs), when pre-trained on large-scale data, provide general-purpose representations for diverse downstream tasks. However, artifacts in ViTs are widely observed across different supervision paradigms and downstream tasks. Through systematic analysis of artifacts in ViTs, we find that their fundamental mechanisms have yet to be sufficiently elucidated. In this paper, through systematic analysis, we conclude that these artifacts originate from a lazy aggregation behavior: ViT uses semantically irrelevant background patches as shortcuts to represent global semantics, driven by global attention and Coarse-grained semantic supervision. Our solution selectively integrates patch features into the CLS token, reducing the influence of background-dominated shortcuts and consistently improving performance across 12 benchmarks under label-, text-, and self-supervision. We hope this work offers a new perspective on ViT behavior.

[54] CLIP Is Shortsighted: Paying Attention Beyond the First Sentence

Marc-Antoine Lavoie,Anas Mahmoud,Aldo Zaimi,Arsene Fansi Tchango,Steven L. Waslander

Main category: cs.CV

TL;DR: 本文提出DeBias-CLIP,通过去除长标题中的摘要句、句子子采样和文本标记填充,缓解CLIP模型在长文本对齐中的开头句捷径偏差,提升长/短文本检索性能,且无需额外可训练参数。

Details Motivation: CLIP预训练主要依赖带简短标题的图像,导致其对复杂场景和密集描述对齐能力弱;现有长标题微调方法仍受摘要句开头结构偏差影响,使模型过度关注首句而忽略后续内容。 Method: 提出DeBias-CLIP:1)训练中移除长标题中的首句摘要;2)采用句子子采样策略;3)引入文本token填充以均衡各位置监督信号。 Result: 在长文本检索任务上达到SOTA;同时提升短文本检索性能;对句子顺序扰动更鲁棒;可即插即用替代Long-CLIP,不增加参数量。 Conclusion: DeBias-CLIP有效缓解了CLIP在长文本对齐中的结构性偏差,提升了多粒度文本-图像匹配能力,为多模态表示学习提供了更均衡、鲁棒的训练范式。 Abstract: CLIP models learn transferable multi-modal features via image-text contrastive learning on internet-scale data. They are widely used in zero-shot classification, multi-modal retrieval, text-to-image diffusion, and as image encoders in large vision-language models. However, CLIP's pretraining is dominated by images paired with short captions, biasing the model toward encoding simple descriptions of salient objects and leading to coarse alignment on complex scenes and dense descriptions. While recent work mitigates this by fine-tuning on small-scale long-caption datasets, we identify an important common bias: both human- and LLM-generated long captions typically begin with a one-sentence summary followed by a detailed description. We show that this acts as a shortcut during training, concentrating attention on the opening sentence and early tokens and weakening alignment over the rest of the caption. To resolve this, we introduce DeBias-CLIP, which removes the summary sentence during training and applies sentence sub-sampling and text token padding to distribute supervision across all token positions. DeBias-CLIP achieves state-of-the-art long-text retrieval, improves short-text retrieval, and is less sensitive to sentence order permutations. It is a drop-in replacement for Long-CLIP with no additional trainable parameters.

[55] SimpleOCR: Rendering Visualized Questions to Teach MLLMs to Read

Yibo Peng,Peng Xia,Ding Zhong,Kaide Zeng,Siwei Han,Yiyang Zhou,Jiaqi Liu,Ruiyi Zhang,Huaxiu Yao

Main category: cs.CV

TL;DR: 本文提出Visualized-Question(VQ)设置来诊断多模态大模型是否真正理解图像中的文本,发现模型存在‘模态懒惰’现象;进而提出SimpleOCR训练策略,通过将文本查询渲染到图像上并随机化样式,强制模型依赖视觉OCR能力,显著提升泛化性能且数据高效。

Details Motivation: 探究多模态大语言模型(MLLMs)是否真正利用图像中的文本(即视觉 grounding),还是仅依赖文本提示中的参数化捷径,揭示其‘模态懒惰’问题。 Method: 引入Visualized-Question(VQ)诊断设置,将文本查询直接渲染至图像上以强制视觉参与;提出SimpleOCR训练策略,将训练样本转换为VQ格式并施加随机样式扰动,从而消除文本捷径、激活视觉OCR通路。 Result: 在Qwen2.5-VL上验证:VQ设置导致最高12.7%性能下降;SimpleOCR在4个OOD基准上超越基线5.4%、超越GRPO 2.7%,仅用8.5K样本(30倍更少)即超越近期RL方法,并可与NoisyRollout等先进RL策略协同增益。 Conclusion: MLLMs普遍存在对图像文本的视觉理解不足,SimpleOCR通过结构化训练约束有效缓解该问题,证明提升视觉 grounding 可不依赖架构修改,而靠数据与训练机制设计。 Abstract: Despite the rapid advancements in Multimodal Large Language Models (MLLMs), a critical question regarding their visual grounding mechanism remains unanswered: do these models genuinely ``read'' text embedded in images, or do they merely rely on parametric shortcuts in the text prompt? In this work, we diagnose this issue by introducing the Visualized-Question (VQ) setting, where text queries are rendered directly onto images to structurally mandate visual engagement. Our diagnostic experiments on Qwen2.5-VL reveal a startling capability-utilization gap: despite possessing strong OCR capabilities, models suffer a performance degradation of up to 12.7% in the VQ setting, exposing a deep-seated ``modality laziness.'' To bridge this gap, we propose SimpleOCR, a plug-and-play training strategy that imposes a structural constraint on the learning process. By transforming training samples into the VQ format with randomized styles, SimpleOCR effectively invalidates text-based shortcuts, compelling the model to activate and optimize its visual text extraction pathways. Empirically, SimpleOCR yields robust gains without architectural modifications. On four representative OOD benchmarks, it surpasses the base model by 5.4% and GRPO based on original images by 2.7%, while exhibiting extreme data efficiency, achieving superior performance with 30x fewer samples (8.5K) than recent RL-based methods. Furthermore, its plug-and-play nature allows seamless integration with advanced RL strategies like NoisyRollout to yield complementary improvements. Code is available at https://github.com/aiming-lab/SimpleOCR.

[56] Exploring Multimodal LMMs for Online Episodic Memory Question Answering on the Edge

Giuseppe Lando,Rosario Forte,Antonino Furnari

Main category: cs.CV

TL;DR: 本文探讨了在边缘设备上使用多模态大语言模型(MLLMs)实现实时在线情景记忆问答的可行性,提出一种双线程流式处理架构,在资源受限下取得接近云端方案的性能。

Details Motivation: 云端卸载虽常见,但存在隐私和延迟问题,因此需要探索适用于可穿戴助手的边缘端实时情景记忆问答方案。 Method: 设计包含描述符线程(持续将视频转为轻量文本记忆)和问答线程(基于文本记忆推理作答)的异步双线程流式问答管道,并在QAEgo4D-Closed基准上评估边缘部署的MLLM性能。 Result: 消费级8GB GPU端到端配置达51.76%准确率、首字令牌时间0.41秒;本地企业级服务器达54.40%准确率、TTFT 0.88秒;云方案为56.00%准确率。 Conclusion: 边缘端MLLM方案在隐私保护与实时性之间取得良好平衡,展现出实用潜力。 Abstract: We investigate the feasibility of using Multimodal Large Language Models (MLLMs) for real-time online episodic memory question answering. While cloud offloading is common, it raises privacy and latency concerns for wearable assistants, hence we investigate implementation on the edge. We integrated streaming constraints into our question answering pipeline, which is structured into two asynchronous threads: a Descriptor Thread that continuously converts video into a lightweight textual memory, and a Question Answering (QA) Thread that reasons over the textual memory to answer queries. Experiments on the QAEgo4D-Closed benchmark analyze the performance of Multimodal Large Language Models (MLLMs) within strict resource boundaries, showing promising results also when compared to clound-based solutions. Specifically, an end-to-end configuration running on a consumer-grade 8GB GPU achieves 51.76% accuracy with a Time-To-First-Token (TTFT) of 0.41s. Scaling to a local enterprise-grade server yields 54.40% accuracy with a TTFT of 0.88s. In comparison, a cloud-based solution obtains an accuracy of 56.00%. These competitive results highlight the potential of edge-based solutions for privacy-preserving episodic memory retrieval.

[57] MammoWise: Multi-Model Local RAG Pipeline for Mammography Report Generation

Raiyan Jahangir,Nafiz Imtiaz Khan,Amritanand Sudheerkumar,Vladimir Filkov

Main category: cs.CV

TL;DR: 本文提出MammoWise,一个本地化、模块化的多模型流程,利用开源视觉语言模型(VLMs)实现乳腺X线摄影报告生成与多任务分类(如BI-RADS评估、乳腺密度分类),支持零样本/少样本提示、思维链及检索增强生成(RAG),并在VinDr-Mammo和DMID数据集上验证了其有效性与可扩展性。

Details Motivation: 解决现有VLM在乳腺筛查中依赖封闭云系统或紧耦合架构导致的隐私风险、不可复现性和适应性差等问题,满足高通量、时效性强、文档密集型临床场景对本地化、可复现、可定制AI辅助报告工具的需求。 Method: 构建MammoWise本地多模型流水线:支持任意Ollama托管VLM;集成零样本、少样本及Chain-of-Thought提示;引入基于向量数据库的多模态RAG以注入病例上下文;采用QLoRA对MedGemma进行参数高效微调;在VinDr-Mammo和DMID数据集上评估报告质量(BERTScore、ROUGE-L)与多任务分类性能(BI-RADS、密度、钙化等)。 Result: 报告生成效果稳定,少样本提示与RAG显著提升质量;分类任务可行但受模型与数据集影响较大;QLoRA微调MedGemma后,BI-RADS准确率达0.7545,密度准确率0.8840,钙化识别准确率0.9341,且不损害报告质量。 Conclusion: MammoWise为本地部署VLM于乳腺影像报告提供了实用、统一、可复现且可扩展的框架,兼顾临床实用性与研究可复现性,推动隐私保护下的医学AI落地。 Abstract: Screening mammography is high volume, time sensitive, and documentation heavy. Radiologists must translate subtle visual findings into consistent BI-RADS assessments, breast density categories, and structured narrative reports. While recent Vision Language Models (VLMs) enable image-to-text reporting, many rely on closed cloud systems or tightly coupled architectures that limit privacy, reproducibility, and adaptability. We present MammoWise, a local multi-model pipeline that transforms open source VLMs into mammogram report generators and multi-task classifiers. MammoWise supports any Ollama-hosted VLM and mammography dataset, and enables zero-shot, few-shot, and Chain-of-Thought prompting, with optional multimodal Retrieval Augmented Generation (RAG) using a vector database for case-specific context. We evaluate MedGemma, LLaVA-Med, and Qwen2.5-VL on VinDr-Mammo and DMID datasets, assessing report quality (BERTScore, ROUGE-L), BI-RADS classification, breast density, and key findings. Report generation is consistently strong and improves with few-shot prompting and RAG. Classification is feasible but sensitive to model and dataset choice. Parameter-efficient fine-tuning (QLoRA) of MedGemma improves reliability, achieving BI-RADS accuracy of 0.7545, density accuracy of 0.8840, and calcification accuracy of 0.9341 while preserving report quality. MammoWise provides a practical and extensible framework for deploying local VLMs for mammography reporting within a unified and reproducible workflow.

[58] Beyond Dominant Patches: Spatial Credit Redistribution For Grounded Vision-Language Models

Niamul Hassan Samin,Md Arifur Rahman,Abdullah Ibne Hanif,Juena Ahmed Noshin,Md Ashikur Rahman

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的推理时干预方法Spatial Credit Redistribution (SCR),用于缓解视觉-语言模型(VLMs)中的物体幻觉问题,其核心是解决‘空间信用坍缩’现象;SCR通过在低熵输入引导下将高注意力patch的激活重分配至上下文区域,在多个模型和基准上显著降低幻觉率,同时保持生成质量(CIDEr),且计算开销极低。

Details Motivation: 视觉-语言模型(VLMs)常在图像中不存在物体时产生幻觉,作者将该问题归因于‘空间信用坍缩’——即早期Transformer层中激活信用过度集中在稀疏视觉块上,削弱上下文证据并增强对语言先验的依赖。 Method: 提出Spatial Credit Redistribution (SCR),一种训练无关的推理时干预方法:基于低熵输入识别高注意力源patch,将其隐藏状态激活按注意力权重重新分配给邻近上下文区域,实现空间信用再平衡。 Result: 在POPE和CHAIR基准上,SCR使POPE-Adversarial幻觉率下降4.7–6.0个百分点,CHAIR-s下降3.7–5.2个百分点(相对降幅42–51%),CHAIR-i下降2.7–4.4个百分点(相对降幅44–58%),CIDEr仅下降≤0.8个百分点;推理延迟仅增43–56ms,显著优于OPERA、VCD和OVCD,并在幻觉率与CIDEr上Pareto占优;消融实验证实注意力引导的源选择至关重要。 Conclusion: 空间信用坍缩是VLM幻觉的关键机制,SCR通过轻量、通用、训练无关的空间激活重分配有效缓解该问题,在性能、效率与实用性间取得优异平衡,为实时VLM部署提供了可行方案。 Abstract: Vision-language models (VLMs) frequently hallucinate objects absent from the input image. We trace this failure to spatial credit collapse: activation credit concentrating on sparse visual patches in early transformer layers, which suppresses contextual evidence and increases reliance on language priors. We introduce Spatial Credit Redistribution (SCR), a training-free inference-time intervention that redistributes hidden-state activation from high-attention source patches to their context, guided by low-entropy inputs. We evaluate six model families (Chameleon, LLaVA, and Qwen, including both Qwen-VL and Qwen2-VL) at scales of 7B, 13B, and 30B, on POPE and CHAIR benchmarks. SCR reduces hallucination by ~4.7-6.0 percentage points on POPE-Adversarial, cuts CHAIR-s by 3.7-5.2 percentage points (42-51 percent relative), and CHAIR-i by 2.7-4.4 percentage points (44-58 percent relative), and preserves CIDEr within 0.8 percentage points. Gains are largest for low-entropy inputs, consistent with the theoretical framework. SCR incurs only 43-56 ms overhead (small models: +43-46 ms; large models: +54-56 ms), roughly 3-6 times lower than OPERA and VCD and 1.3-1.7 times lower than OVCD (+72 ms), while Pareto-dominating all three on both hallucination rate and CIDEr, making it practical for real-time settings. A controlled ablation confirms that attention-guided source selection is essential: replacing it with uniform random selection reduces hallucination rate gains from ~4.7-6.0 percentage points to only ~2.6-3.4 percentage points, pointing to credit-collapse as the key driver.

[59] Pix2Key: Controllable Open-Vocabulary Retrieval with Semantic Decomposition and Self-Supervised Visual Dictionary Learning

Guoyizhe Wei,Yang Jiao,Nan Xi,Zhishen Huang,Jingjing Meng,Rama Chellappa,Yan Gao

Main category: cs.CV

TL;DR: 本文提出Pix2Key方法,通过将查询和候选图像表示为开放词汇视觉词典,在统一嵌入空间中实现意图感知的约束匹配与多样性感知重排序,并结合自监督预训练V-Dict-AE提升细粒度属性理解,在DFMM-Compose基准上显著提升检索性能。

Details Motivation: 经典融合流水线依赖监督三元组且易丢失细粒度线索,而现有零样本方法常通过图像描述合并编辑文本,可能忽略用户隐含意图并返回重复结果。 Method: 提出Pix2Key框架,将查询与候选图像建模为开放词汇视觉词典;引入自监督预训练组件V-Dict-AE,仅用图像数据增强词典表征能力。 Result: 在DFMM-Compose基准上,Pix2Key使Recall@10提升最多3.2点;加入V-Dict-AE后进一步提升2.3点,并提高意图一致性与列表多样性。 Conclusion: Pix2Key实现了更精准、多样且意图一致的组合图像检索,无需CIR特定监督即可增强细粒度理解。 Abstract: Composed Image Retrieval (CIR) uses a reference image plus a natural-language edit to retrieve images that apply the requested change while preserving other relevant visual content. Classic fusion pipelines typically rely on supervised triplets and can lose fine-grained cues, while recent zero-shot approaches often caption the reference image and merge the caption with the edit, which may miss implicit user intent and return repetitive results. We present Pix2Key, which represents both queries and candidates as open-vocabulary visual dictionaries, enabling intent-aware constraint matching and diversity-aware reranking in a unified embedding space. A self-supervised pretraining component, V-Dict-AE, further improves the dictionary representation using only images, strengthening fine-grained attribute understanding without CIR-specific supervision. On the DFMM-Compose benchmark, Pix2Key improves Recall@10 up to 3.2 points, and adding V-Dict-AE yields an additional 2.3-point gain while improving intent consistency and maintaining high list diversity.

[60] DisQ-HNet: A Disentangled Quantized Half-UNet for Interpretable Multimodal Image Synthesis Applications to Tau-PET Synthesis from T1 and FLAIR MRI

Agamdeep S. Chopra,Caitlin Neher,Tianyi Ren,Juampablo E. Heras Rivera,Mehmet Kurt

Main category: cs.CV

TL;DR: 本文提出DisQ-HNet(DQH)框架,利用T1加权和FLAIR MRI合成tau-PET图像,并通过部分信息分解(PID)与Half-UNet解码器揭示各MRI模态对预测的贡献,在保持重建质量的同时提升阿尔茨海默病相关下游任务性能。

Details Motivation: tau-PET虽为阿尔茨海默病病理的体内标志物,但受限于高成本与低可及性,亟需基于MRI的替代方案。 Method: 提出DisQ-HNet:(i)采用PID引导的向量量化编码器,将潜在信息划分为冗余、独特和互补成分;(ii)设计Half-UNet解码器,利用结构边缘线索驱动的伪跳跃连接保留解剖细节,避免直接复用编码器特征。 Result: 在多个基线模型(VAE、VQ-VAE、UNet)上,DisQ-HNet保持高重建保真度,并更优地保留疾病相关信号,显著提升Braak分期、tau定位与分类等下游AD任务性能;PID-Shapley分析实现模态特异性归因。 Conclusion: DisQ-HNet为无创、低成本tau成像提供了新范式,其可解释的信息分解与解码机制增强了MRI到tau-PET合成的可靠性与临床适用性。 Abstract: Tau positron emission tomography (tau-PET) provides an in vivo marker of Alzheimer's disease pathology, but cost and limited availability motivate MRI-based alternatives. We introduce DisQ-HNet (DQH), a framework that synthesizes tau-PET from paired T1-weighted and FLAIR MRI while exposing how each modality contributes to the prediction. The method combines (i) a Partial Information Decomposition (PID)-guided, vector-quantized encoder that partitions latent information into redundant, unique, and complementary components, and (ii) a Half-UNet decoder that preserves anatomical detail using pseudo-skip connections conditioned on structural edge cues rather than direct encoder feature reuse. Across multiple baselines (VAE, VQ-VAE, and UNet), DisQ-HNet maintains reconstruction fidelity and better preserves disease-relevant signal for downstream AD tasks, including Braak staging, tau localization, and classification. PID-based Shapley analysis provides modality-specific attribution of synthesized uptake patterns.

[61] DrivePTS: A Progressive Learning Framework with Textual and Structural Enhancement for Driving Scene Generation

Zhechao Wang,Yiming Zeng,Lufan Ma,Zeqing Fu,Chen Bai,Ziyao Lin,Cheng Lu

Main category: cs.CV

TL;DR: 本文提出DrivePTS,通过渐进式学习、多视角分层文本描述和频率引导结构损失,解决了现有驾驶场景生成中条件依赖、语义不足和结构失真问题,实现了高保真、强可控的多样化场景合成。

Details Motivation: 现有基于扩散模型的驾驶场景合成方法存在几何条件间隐式依赖导致生成失败、文本描述简略导致语义建模弱、均匀空间加权去噪损失忽略前景结构细节等问题。 Method: 提出DrivePTS框架:1)渐进式学习策略结合显式互信息约束缓解几何条件间依赖;2)利用视觉-语言模型生成六维度多视角分层文本描述增强语义指导;3)引入频率引导结构损失提升高频结构细节建模能力。 Result: 在多项指标上达到SOTA性能,显著提升生成场景的保真度与可控性,并能成功合成先前方法失败的罕见驾驶场景。 Conclusion: DrivePTS有效解决了条件依赖、语义粗糙与结构模糊三大瓶颈,展现出优异的泛化能力和实际应用潜力。 Abstract: Synthesis of diverse driving scenes serves as a crucial data augmentation technique for validating the robustness and generalizability of autonomous driving systems. Current methods aggregate high-definition (HD) maps and 3D bounding boxes as geometric conditions in diffusion models for conditional scene generation. However, implicit inter-condition dependency causes generation failures when control conditions change independently. Additionally, these methods suffer from insufficient details in both semantic and structural aspects. Specifically, brief and view-invariant captions restrict semantic contexts, resulting in weak background modeling. Meanwhile, the standard denoising loss with uniform spatial weighting neglects foreground structural details, causing visual distortions and blurriness. To address these challenges, we propose DrivePTS, which incorporates three key innovations. Firstly, our framework adopts a progressive learning strategy to mitigate inter-dependency between geometric conditions, reinforced by an explicit mutual information constraint. Secondly, a Vision-Language Model is utilized to generate multi-view hierarchical descriptions across six semantic aspects, providing fine-grained textual guidance. Thirdly, a frequency-guided structure loss is introduced to strengthen the model's sensitivity to high-frequency elements, improving foreground structural fidelity. Extensive experiments demonstrate that our DrivePTS achieves state-of-the-art fidelity and controllability in generating diverse driving scenes. Notably, DrivePTS successfully generates rare scenes where prior methods fail, highlighting its strong generalization ability.

[62] SwiftNDC: Fast Neural Depth Correction for High-Fidelity 3D Reconstruction

Kang Han,Wei Xiang,Lu Yu,Mathew Wyatt,Gaowen Liu,Ramana Rao Kompella

Main category: cs.CV

TL;DR: SwiftNDC是一种基于神经深度校正场的快速通用3D重建框架,通过生成跨视角一致的深度图、构建高质量稠密点云,并以此为强几何先验显著加速并提升3D高斯泼溅(3DGS)在网格重建与新视角合成任务中的性能。

Details Motivation: 现有深度引导的3D重建方法仍存在尺度漂移、多视角不一致及需大量后处理等问题,亟需更鲁棒、高效的几何初始化方案。 Method: 提出神经深度校正场(Neural Depth Correction field)生成跨视角一致深度图;通过反投影和重投影误差滤波构建稠密、均匀、干净的点云;将该点云作为几何先验输入3D高斯泼溅,用于网格重建或新视角合成。 Result: 在五个数据集(含两个网格重建、三个新视角合成)上验证:显著减少网格重建所需时间,提升新视角合成渲染质量;相比基线方法,以更少优化迭代获得更高保真度表面。 Conclusion: 神经深度精修与鲁棒几何初始化的结合,可兼顾3D重建的高保真度与高效性,为3DGS等隐式/显式表示方法提供更优初始几何。 Abstract: Depth-guided 3D reconstruction has gained popularity as a fast alternative to optimization-heavy approaches, yet existing methods still suffer from scale drift, multi-view inconsistencies, and the need for substantial refinement to achieve high-fidelity geometry. Here, we propose SwiftNDC, a fast and general framework built around a Neural Depth Correction field that produces cross-view consistent depth maps. From these refined depths, we generate a dense point cloud through back-projection and robust reprojection-error filtering, obtaining a clean and uniformly distributed geometric initialization for downstream reconstruction. This reliable dense geometry substantially accelerates 3D Gaussian Splatting (3DGS) for mesh reconstruction, enabling high-quality surfaces with significantly fewer optimization iterations. For novel-view synthesis, SwiftNDC can also improve 3DGS rendering quality, highlighting the benefits of strong geometric initialization. We conduct a comprehensive study across five datasets, including two for mesh reconstruction, as well as three for novel-view synthesis. SwiftNDC consistently reduces running time for accurate mesh reconstruction and boosts rendering fidelity for view synthesis, demonstrating the effectiveness of combining neural depth refinement with robust geometric initialization for high-fidelity and efficient 3D reconstruction.

[63] Quality-Aware Robust Multi-View Clustering for Heterogeneous Observation Noise

Peihan Wu,Guanjie Cheng,Yufei Tong,Meng Xi,Shuiguang Deng

Main category: cs.CV

TL;DR: 本文提出了一种质量感知的鲁棒多视图聚类框架QARMVC,通过信息瓶颈机制量化样本级噪声强度,并在特征级和融合级分别采用质量加权对比学习与质量加权聚合策略,提升在异质噪声下的聚类性能。

Details Motivation: 现有鲁棒多视图聚类方法多基于简单的二元噪声假设(干净/完全污染),忽略了现实中普遍存在的、强度连续变化的异质观测噪声。 Method: 提出QARMVC框架:1)利用信息瓶颈进行视图重建,以重建差异量化细粒度污染强度并生成实例级质量得分;2)在特征级设计质量加权对比学习目标抑制噪声传播;3)在融合级通过质量加权聚合构建高质量全局共识,并用互信息最大化对齐和修正局部视图。 Result: 在五个基准数据集上的实验表明,QARMVC在异质噪声场景下持续优于当前最优方法。 Conclusion: QARMVC通过建模连续变化的噪声质量,显著提升了多视图聚类在复杂真实噪声环境中的鲁棒性与准确性。 Abstract: Deep multi-view clustering has achieved remarkable progress but remains vulnerable to complex noise in real-world applications. Existing noisy robust methods predominantly rely on a simplified binary assumption, treating data as either perfectly clean or completely corrupted. This overlooks the prevalent existence of heterogeneous observation noise, where contamination intensity varies continuously across data. To bridge this gap, we propose a novel framework termed Quality-Aware Robust Multi-View Clustering (QARMVC). Specifically, QARMVC employs an information bottleneck mechanism to extract intrinsic semantics for view reconstruction. Leveraging the insight that noise disrupts semantic integrity and impedes reconstruction, we utilize the resulting reconstruction discrepancy to precisely quantify fine-grained contamination intensity and derive instance-level quality scores. These scores are integrated into a hierarchical learning strategy: at the feature level, a quality-weighted contrastive objective is designed to adaptively suppress the propagation of noise; at the fusion level, a high-quality global consensus is constructed via quality-weighted aggregation, which is subsequently utilized to align and rectify local views via mutual information maximization. Extensive experiments on five benchmark datasets demonstrate that QARMVC consistently outperforms state-of-the-art baselines, particularly in scenarios with heterogeneous noise intensities.

[64] Guidance Matters: Rethinking the Evaluation Pitfall for Text-to-Image Generation

Dian Xie,Shitong Shao,Lichen Bai,Zikai Zhou,Bojun Cheng,Shuo Yang,Jun Wu,Zeke Xie

Main category: cs.CV

TL;DR: 本文重新审视了扩散模型中的引导方法,揭示了当前评估范式中人类偏好模型对高引导尺度的偏差问题,并提出了引导感知评估框架(GA-Eval)与反事实验证方法TDG,指出多数新引导方法在公平评估下并未超越基础CFG。

Details Motivation: 现有扩散引导方法的评估存在严重偏差——人类偏好模型过度偏向高引导尺度,导致图像质量下降但指标虚高,亟需更公平可靠的评估方式。 Method: 提出引导感知评估框架(GA-Eval),通过引导尺度校准分离正交与平行于CFG的效果;设计反事实方法Transcendent Diffusion Guidance(TDG)暴露评估漏洞;系统评测8种主流引导方法在常规与GA-Eval框架下的表现。 Result: 实验证明,在公平的GA-Eval下,简单提升CFG尺度即可匹敌大多数新引导方法;所有被测方法在胜率上均显著低于标准CFG;TDG在常规评估中得分高但实际失效。 Conclusion: 当前扩散引导方法的进步被有偏评估夸大,社区亟需重构评估范式,关注真实生成质量而非片面指标提升。 Abstract: Classifier-free guidance (CFG) has helped diffusion models achieve great conditional generation in various fields. Recently, more diffusion guidance methods have emerged with improved generation quality and human preference. However, can these emerging diffusion guidance methods really achieve solid and significant improvements? In this paper, we rethink recent progress on diffusion guidance. Our work mainly consists of four contributions. First, we reveal a critical evaluation pitfall that common human preference models exhibit a strong bias towards large guidance scales. Simply increasing the CFG scale can easily improve quantitative evaluation scores due to strong semantic alignment, even if image quality is severely damaged (e.g., oversaturation and artifacts). Second, we introduce a novel guidance-aware evaluation (GA-Eval) framework that employs effective guidance scale calibration to enable fair comparison between current guidance methods and CFG by identifying the effects orthogonal and parallel to CFG effects. Third, motivated by the evaluation pitfall, we design Transcendent Diffusion Guidance (TDG) method that can significantly improve human preference scores in the conventional evaluation framework but actually does not work in practice. Fourth, in extensive experiments, we empirically evaluate recent eight diffusion guidance methods within the conventional evaluation framework and the proposed GA-Eval framework. Notably, simply increasing the CFG scales can compete with most studied diffusion guidance methods, while all methods suffer severely from winning rate degradation over standard CFG. Our work would strongly motivate the community to rethink the evaluation paradigm and future directions of this field.

[65] GIFSplat: Generative Prior-Guided Iterative Feed-Forward 3D Gaussian Splatting from Sparse Views

Tianyu Chen,Wei Xiang,Kang Han,Yu Lu,Di Wu,Gaowen Liu,Ramana Rao Kompella

Main category: cs.CV

TL;DR: GIFSplat提出了一种纯前馈的迭代细化框架,用于从稀疏无位姿图像中重建3D高斯点阵,通过残差更新和蒸馏扩散先验实现高效高质量重建,兼顾推理速度与泛化能力。

Details Motivation: 现有前馈式3D重建方法受限于单次预测范式,难以适应域外数据、缺乏推理时优化能力,且引入生成先验后会牺牲实时性。 Method: 提出GIFSplat:1)基于渲染证据的前馈式多次残差更新,逐步细化3D高斯场景;2)将冻结的扩散先验蒸馏为高斯级线索,无需反向传播或扩展视图集,实现生成先验驱动的逐场景自适应。 Result: 在DL3DV、RealEstate10K和DTU数据集上显著超越SOTA前馈方法,PSNR最高提升+2.1 dB,保持秒级推理速度,无需相机位姿或测试时梯度优化。 Conclusion: GIFSplat成功弥合了前馈效率与生成先验增强之间的鸿沟,证明迭代细化与先验蒸馏可在不牺牲前馈特性的前提下提升重建质量与鲁棒性。 Abstract: Feed-forward 3D reconstruction offers substantial runtime advantages over per-scene optimization, which remains slow at inference and often fragile under sparse views. However, existing feed-forward methods still have potential for further performance gains, especially for out-of-domain data, and struggle to retain second-level inference time once a generative prior is introduced. These limitations stem from the one-shot prediction paradigm in existing feed-forward pipeline: models are strictly bounded by capacity, lack inference-time refinement, and are ill-suited for continuously injecting generative priors. We introduce GIFSplat, a purely feed-forward iterative refinement framework for 3D Gaussian Splatting from sparse unposed views. A small number of forward-only residual updates progressively refine current 3D scene using rendering evidence, achieve favorable balance between efficiency and quality. Furthermore, we distill a frozen diffusion prior into Gaussian-level cues from enhanced novel renderings without gradient backpropagation or ever-increasing view-set expansion, thereby enabling per-scene adaptation with generative prior while preserving feed-forward efficiency. Across DL3DV, RealEstate10K, and DTU, GIFSplat consistently outperforms state-of-the-art feed-forward baselines, improving PSNR by up to +2.1 dB, and it maintains second-scale inference time without requiring camera poses or any test-time gradient optimization.

[66] Causal Motion Diffusion Models for Autoregressive Motion Generation

Qing Yu,Akihisa Watanabe,Kent Fujiwara

Main category: cs.CV

TL;DR: 本文提出了一种因果运动扩散模型(CMDM),通过在语义对齐的潜在空间中使用因果扩散Transformer进行自回归运动生成,解决了现有方法在时间因果性和实时性上的不足。

Details Motivation: 现有运动扩散模型要么依赖全序列双向扩散(缺乏时间因果性、难以实时应用),要么采用不稳定的自回归模型(存在累积误差)。本文旨在构建一个兼顾语义保真度、时间连贯性与高效推理的统一框架。 Method: 提出CMDM框架,包括:1)Motion-Language-Aligned Causal VAE(MAC-VAE),将运动编码为时序因果潜在表示;2)基于该潜在空间的自回归扩散Transformer,采用因果扩散强制训练;3)引入帧级采样策略与因果不确定性,支持快速逐帧推理。 Result: 在HumanML3D和SnapMoGen数据集上,CMDM在语义保真度和时间平滑性上均优于现有扩散与自回归模型,同时显著降低推理延迟,支持高质量文本驱动运动生成、流式合成与长时程交互速率运动生成。 Conclusion: CMDM成功融合了扩散建模的高质量生成能力与自回归建模的时间因果性,为实时、可控、高保真的人体运动合成提供了新范式。 Abstract: Recent advances in motion diffusion models have substantially improved the realism of human motion synthesis. However, existing approaches either rely on full-sequence diffusion models with bidirectional generation, which limits temporal causality and real-time applicability, or autoregressive models that suffer from instability and cumulative errors. In this work, we present Causal Motion Diffusion Models (CMDM), a unified framework for autoregressive motion generation based on a causal diffusion transformer that operates in a semantically aligned latent space. CMDM builds upon a Motion-Language-Aligned Causal VAE (MAC-VAE), which encodes motion sequences into temporally causal latent representations. On top of this latent representation, an autoregressive diffusion transformer is trained using causal diffusion forcing to perform temporally ordered denoising across motion frames. To achieve fast inference, we introduce a frame-wise sampling schedule with causal uncertainty, where each subsequent frame is predicted from partially denoised previous frames. The resulting framework supports high-quality text-to-motion generation, streaming synthesis, and long-horizon motion generation at interactive rates. Experiments on HumanML3D and SnapMoGen demonstrate that CMDM outperforms existing diffusion and autoregressive models in both semantic fidelity and temporal smoothness, while substantially reducing inference latency.

[67] Don't let the information slip away

Taozhe Li

Main category: cs.CV

TL;DR: 本文提出了一种新的目标检测模型Association DETR,通过利用背景上下文信息提升检测性能,在COCO val2017数据集上达到SOTA效果。

Details Motivation: 现有主流目标检测模型(如YOLO系列和RT-DETR系列)过度关注前景物体特征,忽视背景提供的关键语义上下文信息,而背景信息对定位和识别物体具有重要辅助作用。 Method: 提出Association DETR模型,显式建模前景物体与背景之间的语义关联,融合上下文信息以增强检测性能。 Result: 在COCO val2017数据集上取得优于YOLOv12(55.2 mAP)和RT-DETRv2(53.4 mAP)的SOTA结果。 Conclusion: 引入背景上下文建模能有效提升目标检测性能,Association DETR验证了该思路的有效性与先进性。 Abstract: Real-time object detection has advanced rapidly in recent years. The YOLO series of detectors is among the most well-known CNN-based object detection models and cannot be overlooked. The latest version, YOLOv26, was recently released, while YOLOv12 achieved state-of-the-art (SOTA) performance with 55.2 mAP on the COCO val2017 dataset. Meanwhile, transformer-based object detection models, also known as DEtection TRansformer (DETR), have demonstrated impressive performance. RT-DETR is an outstanding model that outperformed the YOLO series in both speed and accuracy when it was released. Its successor, RT-DETRv2, achieved 53.4 mAP on the COCO val2017 dataset. However, despite their remarkable performance, all these models let information to slip away. They primarily focus on the features of foreground objects while neglecting the contextual information provided by the background. We believe that background information can significantly aid object detection tasks. For example, cars are more likely to appear on roads rather than in offices, while wild animals are more likely to be found in forests or remote areas rather than on busy streets. To address this gap, we propose an object detection model called Association DETR, which achieves state-of-the-art results compared to other object detection models on the COCO val2017 dataset.

[68] BetterScene: 3D Scene Synthesis with Representation-Aligned Generative Model

Yuci Han,Charles Toth,John E. Anderson,William J. Shuart,Alper Yilmaz

Main category: cs.CV

TL;DR: BetterScene 提出了一种基于 Stable Video Diffusion(SVD)的新型稀疏图像新视角合成方法,通过在VAE模块中引入时间等变正则化和视觉基础模型对齐表征,并结合3D高斯泼溅渲染特征,显著提升了真实场景下稀疏输入的新视角一致性与细节质量。

Details Motivation: 现有基于扩散模型的新视角合成方法受限于仅微调UNet、冻结其他模块,导致视图间细节不一致和伪影问题,尤其在极稀疏、无约束的真实照片输入下表现不佳。 Method: 在SVD预训练模型基础上,改进其VAE模块:① 引入时间等变正则化以增强跨视角特征一致性;② 利用视觉基础模型对齐潜在表征;并耦合前馈式3D高斯泼溅(3DGS)生成几何感知特征作为SVD增强器输入。 Result: 在DL3DV-10K挑战性数据集上,BetterScene在定性和定量指标上均超越当前最先进方法,生成连续、无伪影、视图一致的新视角。 Conclusion: 解冻并协同优化扩散模型中的VAE模块(而非仅UNet),结合几何先验驱动的特征输入,是提升稀疏输入下新视角合成质量的有效路径。 Abstract: We present BetterScene, an approach to enhance novel view synthesis (NVS) quality for diverse real-world scenes using extremely sparse, unconstrained photos. BetterScene leverages the production-ready Stable Video Diffusion (SVD) model pretrained on billions of frames as a strong backbone, aiming to mitigate artifacts and recover view-consistent details at inference time. Conventional methods have developed similar diffusion-based solutions to address these challenges of novel view synthesis. Despite significant improvements, these methods typically rely on off-the-shelf pretrained diffusion priors and fine-tune only the UNet module while keeping other components frozen, which still leads to inconsistent details and artifacts even when incorporating geometry-aware regularizations like depth or semantic conditions. To address this, we investigate the latent space of the diffusion model and introduce two components: (1) temporal equivariance regularization and (2) vision foundation model-aligned representation, both applied to the variational autoencoder (VAE) module within the SVD pipeline. BetterScene integrates a feed-forward 3D Gaussian Splatting (3DGS) model to render features as inputs for the SVD enhancer and generate continuous, artifact-free, consistent novel views. We evaluate on the challenging DL3DV-10K dataset and demonstrate superior performance compared to state-of-the-art methods.

[69] LoR-LUT: Learning Compact 3D Lookup Tables via Low-Rank Residuals

Ziqi Zhao,Abhijit Mishra,Shounak Roychowdhury

Main category: cs.CV

TL;DR: LoR-LUT 提出一种统一的低秩3D查找表(LUT)生成方法,通过联合使用低秩残差校正与基础LUT,在保持三线性插值复杂度的同时显著减少参数量,并提升图像感知质量与可解释性;在MIT-Adobe FiveK数据集上验证了其高保真专家级调色能力及亚兆字节模型尺寸,并配套交互式可视化工具 LoR-LUT Viewer。

Details Motivation: 现有3D-LUT方法依赖稠密张量融合,参数多、可解释性差,且难以兼顾紧凑性与高质量图像增强。 Method: 提出统一低秩框架LoR-LUT,将残差校正建模为低秩张量,与基础LUT联合优化;保持原有三线性插值结构,引入可学习低秩残差项以提升表达能力与紧凑性;开发LoR-LUT Viewer交互工具,支持滑块调节参数并实时可视化效果。 Result: 在MIT-Adobe FiveK数据集上实现专家级图像调色效果,模型大小<1MB,参数量显著降低,同时保持插值计算复杂度不变;定性和定量实验均验证其高感知保真度与可解释性。 Conclusion: LoR-LUT为LUT-based图像增强与风格迁移提供了紧凑、可解释且高效的新型范式,兼具实用性与理论创新性。 Abstract: We present LoR-LUT, a unified low-rank formulation for compact and interpretable 3D lookup table (LUT) generation. Unlike conventional 3D-LUT-based techniques that rely on fusion of basis LUTs, which are usually dense tensors, our unified approach extends the current framework by jointly using residual corrections, which are in fact low-rank tensors, together with a set of basis LUTs. The approach described here improves the existing perceptual quality of an image, which is primarily due to the technique's novel use of residual corrections. At the same time, we achieve the same level of trilinear interpolation complexity, using a significantly smaller number of network, residual corrections, and LUT parameters. The experimental results obtained from LoR-LUT, which is trained on the MIT-Adobe FiveK dataset, reproduce expert-level retouching characteristics with high perceptual fidelity and a sub-megabyte model size. Furthermore, we introduce an interactive visualization tool, termed LoR-LUT Viewer, which transforms an input image into the LUT-adjusted output image, via a number of slidebars that control different parameters. The tool provides an effective way to enhance interpretability and user confidence in the visual results. Overall, our proposed formulation offers a compact, interpretable, and efficient direction for future LUT-based image enhancement and style transfer.

[70] Spectrally Distilled Representations Aligned with Instruction-Augmented LLMs for Satellite Imagery

Minh Kha Do,Wei Xiang,Kang Han,Di Wu,Khoa Phan,Yi-Ping Phoebe Chen,Gaowen Liu,Ramana Rao Kompella

Main category: cs.CV

TL;DR: SATtxt是一种仅需RGB输入的光谱感知视觉-语言基础模型,通过光谱表征蒸馏和光谱引导的指令增强对齐,在地球观测任务中显著提升零样本分类、检索和线性探测性能。

Details Motivation: 现有视觉-语言基础模型在卫星图像应用中受限于多光谱输入难以一致利用以及CLIP式文本编码器语义表达能力不足的问题;同时,实际卫星系统常缺乏完整多光谱覆盖,亟需仅依赖RGB输入的高效方案。 Method: 提出两阶段框架:1)光谱表征蒸馏——利用轻量投影器将冻结的多光谱教师模型的光谱先验知识迁移到RGB学生模型;2)光谱引导的指令增强对齐——结合大语言模型(LLM)的强语义表达能力,实现视觉与文本空间的细粒度对齐。 Result: 在EuroSAT、BigEarthNet和ForestNet数据集上,SATtxt平均提升零样本分类4.2%、检索5.9%、线性探测2.7%,优于现有基线。 Conclusion: SATtxt为地球观测提供了一条高效、可扩展的光谱感知视觉-语言学习路径,兼顾RGB推理实用性与多光谱信息建模能力。 Abstract: Vision-language foundation models (VLFMs) promise zero-shot and retrieval understanding for Earth observation. While operational satellite systems often lack full multi-spectral coverage, making RGB-only inference highly desirable for scalable deployment, the adoption of VLFMs for satellite imagery remains hindered by two factors: (1) multi-spectral inputs are informative but difficult to exploit consistently due to band redundancy and misalignment; and (2) CLIP-style text encoders limit semantic expressiveness and weaken fine-grained alignment. We present SATtxt, a spectrum-aware VLFM that operates with RGB inputs only at inference while retaining spectral cues learned during training. Our framework comprises two stages. First, Spectral Representation Distillation transfers spectral priors from a frozen multi-spectral teacher to an RGB student via a lightweight projector. Second, Spectrally Grounded Alignment with Instruction-Augmented LLMs bridges the distilled visual space and an expressive LLM embedding space. Across EuroSAT, BigEarthNet, and ForestNet, SATtxt improves zero-shot classification on average by 4.2%, retrieval by 5.9%, and linear probing by 2.7% over baselines, showing an efficient path toward spectrum-aware vision-language learning for Earth observation. Project page: https://ikhado.github.io/sattxt/

[71] Coded-E2LF: Coded Aperture Light Field Imaging from Events

Tomoya Tsuchida,Keita Takahashi,Chihiro Tsutake,Toshiaki Fujii,Hajime Nagahara

Main category: cs.CV

TL;DR: 本文提出了一种名为Coded-E2LF的纯事件驱动计算成像方法,利用编码孔径和静态仅事件相机实现4D光场重建,首次实现了仅凭事件数据达到像素级精度的4D光场重建。

Details Motivation: 现有方法需同时采集事件和强度图像,硬件限制多;本文旨在设计一种纯事件驱动、更易实现的光场重建方案。 Method: 采用编码孔径与仅事件相机结合,引入含黑色图案的编码模式,并从理论和实践两方面优化纯事件下的光场重建。 Result: 在真实硬件上成功重建出高精度4D光场,验证了仅用事件数据实现像素级精度光场重建的可行性。 Conclusion: Coded-E2LF是首个实现仅靠事件流完成像素级4D光场重建的方法,显著降低了硬件约束,推动了事件相机在光场成像中的应用。 Abstract: We propose Coded-E2LF (coded event to light field), a computational imaging method for acquiring a 4-D light field using a coded aperture and a stationary event-only camera. In a previous work, an imaging system similar to ours was adopted, but both events and intensity images were captured and used for light field reconstruction. In contrast, our method is purely event-based, which relaxes restrictions for hardware implementation. We also introduce several advancements from the previous work that enable us to theoretically support and practically improve light field reconstruction from events alone. In particular, we clarify the key role of a black pattern in aperture coding patterns. We finally implemented our method on real imaging hardware to demonstrate its effectiveness in capturing real 3-D scenes. To the best of our knowledge, we are the first to demonstrate that a 4-D light field with pixel-level accuracy can be reconstructed from events alone. Our software and supplementary video are available from our project website.

[72] CGSA: Class-Guided Slot-Aware Adaptation for Source-Free Object Detection

Boyang Dai,Zeng Fan,Zihao Qi,Meng Lou,Yizhou Yu

Main category: cs.CV

TL;DR: 本文提出CGSA框架,首次将面向对象学习(OCL)引入无源域自适应目标检测(SF-DAOD),通过分层槽意识(HSA)与类别引导槽对比(CGSC)模块,在DETR架构中实现槽感知的域自适应,显著提升性能并保障隐私。

Details Motivation: 现有SF-DAOD方法忽视跨域数据中的物体级结构线索,且难以在不访问源数据的前提下实现细粒度语义对齐。 Method: 提出CGSA框架:在DETR检测器中嵌入分层槽意识(HSA)模块以解耦图像为槽表示,并设计类别引导槽对比(CGSC)模块引导槽向类语义对齐,实现无源、对象中心的域自适应。 Result: 在多个跨域数据集上显著超越现有SF-DAOD方法;理论推导与消融实验验证了HSA和CGSC模块的有效性。 Conclusion: 面向对象的学习范式可有效提升无源域自适应检测的性能与语义一致性,为隐私敏感场景下的模型迁移提供新思路。 Abstract: Source-Free Domain Adaptive Object Detection (SF-DAOD) aims to adapt a detector trained on a labeled source domain to an unlabeled target domain without retaining any source data. Despite recent progress, most popular approaches focus on tuning pseudo-label thresholds or refining the teacher-student framework, while overlooking object-level structural cues within cross-domain data. In this work, we present CGSA, the first framework that brings Object-Centric Learning (OCL) into SF-DAOD by integrating slot-aware adaptation into the DETR-based detector. Specifically, our approach integrates a Hierarchical Slot Awareness (HSA) module into the detector to progressively disentangle images into slot representations that act as visual priors. These slots are then guided toward class semantics via a Class-Guided Slot Contrast (CGSC) module, maintaining semantic consistency and prompting domain-invariant adaptation. Extensive experiments on multiple cross-domain datasets demonstrate that our approach outperforms previous SF-DAOD methods, with theoretical derivations and experimental analysis further demonstrating the effectiveness of the proposed components and the framework, thereby indicating the promise of object-centric design in privacy-sensitive adaptation scenarios. Code is released at https://github.com/Michael-McQueen/CGSA.

[73] Instruction-based Image Editing with Planning, Reasoning, and Generation

Liya Ji,Chenyang Qi,Qifeng Chen

Main category: cs.CV

TL;DR: 本文提出了一种多模态链式思维(CoT)方法,通过将指令编辑任务分解为规划、编辑区域推理和编辑三个阶段,结合大语言模型与多模态模型,提升指令驱动图像编辑的性能。

Details Motivation: 现有方法依赖单模态理解模型,限制了复杂场景下的图像编辑质量,亟需融合理解与生成能力的多模态方案。 Method: 提出多模态链式思维提示框架,包括:1)大语言模型驱动的CoT规划;2)基于多模态大语言模型训练的编辑区域生成网络;3)提示引导的、适配文生图扩散模型的编辑网络。 Result: 在复杂真实图像上展现出具有竞争力的编辑能力,验证了该方法的有效性。 Conclusion: 多模态链式思维框架能有效桥接理解与生成,显著提升指令驱动图像编辑的鲁棒性与适应性。 Abstract: Editing images via instruction provides a natural way to generate interactive content, but it is a big challenge due to the higher requirement of scene understanding and generation. Prior work utilizes a chain of large language models, object segmentation models, and editing models for this task. However, the understanding models provide only a single modality ability, restricting the editing quality. We aim to bridge understanding and generation via a new multi-modality model that provides the intelligent abilities to instruction-based image editing models for more complex cases. To achieve this goal, we individually separate the instruction editing task with the multi-modality chain of thought prompts, i.e., Chain-of-Thought (CoT) planning, editing region reasoning, and editing. For Chain-of-Thought planning, the large language model could reason the appropriate sub-prompts considering the instruction provided and the ability of the editing network. For editing region reasoning, we train an instruction-based editing region generation network with a multi-modal large language model. Finally, a hint-guided instruction-based editing network is proposed for editing image generations based on the sizeable text-to-image diffusion model to accept the hints for generation. Extensive experiments demonstrate that our method has competitive editing abilities on complex real-world images.

[74] CRAG: Can 3D Generative Models Help 3D Assembly?

Zeyu Jiang,Sihang Li,Siqi Tan,Chenyang Xu,Juexiao Zhang,Julia Galway-Witham,Xue Wang,Scott A. Williams,Radu Iovita,Chen Feng,Jing Zhang

Main category: cs.CV

TL;DR: 本文提出CRAG方法,将3D装配问题重新定义为装配与生成的联合任务,通过二者相互增强实现对缺失几何的合成及部件姿态预测,在多种复杂场景下达到SOTA性能。

Details Motivation: 现有3D装配方法仅关注刚性姿态估计,忽视人类装配中结构推理与整体形状推断的自然耦合;且无法合成缺失几何。 Method: 提出CRAG框架,联合建模装配(部件姿态预测)与生成(完整形状合成),利用装配提供结构先验、生成提供整体形状上下文以消解装配歧义。 Result: 在野外真实物体(多样几何、不同部件数量、存在缺失部件)上实验表明,CRAG达到当前最优性能。 Conclusion: 联合装配与生成是更符合人类认知的3D装配范式,CRAG有效解决了缺失几何合成与姿态估计的协同优化问题。 Abstract: Most existing 3D assembly methods treat the problem as pure pose estimation, rearranging observed parts via rigid transformations. In contrast, human assembly naturally couples structural reasoning with holistic shape inference. Inspired by this intuition, we reformulate 3D assembly as a joint problem of assembly and generation. We show that these two processes are mutually reinforcing: assembly provides part-level structural priors for generation, while generation injects holistic shape context that resolves ambiguities in assembly. Unlike prior methods that cannot synthesize missing geometry, we propose CRAG, which simultaneously generates plausible complete shapes and predicts poses for input parts. Extensive experiments demonstrate state-of-the-art performance across in-the-wild objects with diverse geometries, varying part counts, and missing pieces. Our code and models will be released.

[75] QuadSync: Quadrifocal Tensor Synchronization via Tucker Decomposition

Daniel Miao,Gilad Lerman,Joe Kileel

Main category: cs.CV

TL;DR: 本文提出了一种基于四焦点张量的多相机同步新框架,利用Tucker分解等方法实现了高效准确的相机位姿恢复,并验证了高阶信息在同步中的重要性。

Details Motivation: 四焦点张量虽比基础矩阵包含更多信息,但常被认为不实用;本文旨在挑战这一观点,探索其在结构光运动(SfM)中实际应用的可行性。 Method: 构建块四焦点张量,证明其具有固定多线性秩(4,4,4,4)的Tucker分解形式,进而设计结合Tucker分解、ADMM和加权最小二乘的同步算法,并提出联合同步四焦点、三焦点与双焦点张量的方法。 Result: 所提算法在现代数据集上展现出优异性能,显著提升了多相机同步精度与鲁棒性,验证了高阶张量信息的有效性。 Conclusion: 四焦点张量不仅具有理论价值,更具备实际应用潜力;所提出的同步框架为利用高阶几何约束进行大规模SfM提供了新思路。 Abstract: In structure from motion, quadrifocal tensors capture more information than their pairwise counterparts (essential matrices), yet they have often been thought of as impractical and only of theoretical interest. In this work, we challenge such beliefs by providing a new framework to recover $n$ cameras from the corresponding collection of quadrifocal tensors. We form the block quadrifocal tensor and show that it admits a Tucker decomposition whose factor matrices are the stacked camera matrices, and which thus has a multilinear rank of (4,~4,~4,~4) independent of $n$. We develop the first synchronization algorithm for quadrifocal tensors, using Tucker decomposition, alternating direction method of multipliers, and iteratively reweighted least squares. We further establish relationships between the block quadrifocal, trifocal, and bifocal tensors, and introduce an algorithm that jointly synchronizes these three entities. Numerical experiments demonstrate the effectiveness of our methods on modern datasets, indicating the potential and importance of using higher-order information in synchronization.

[76] Plug, Play, and Fortify: A Low-Cost Module for Robust Multimodal Image Understanding Models

Siqi Lu,Wanying Xu,Yongbin Zheng,Wenting Luan,Peng Sun,Jianhang Yao

Main category: cs.CV

TL;DR: 本文提出了一种基于频率域分析的多模态权重分配模块(MWAM),通过频率比指标(FRM)量化模态偏好,动态平衡各模态分支贡献,缓解缺失模态导致的性能下降问题。

Details Motivation: 缺失模态会导致多模态模型性能严重下降,其根源在于训练过程中模态间学习不平衡,模型隐式偏好某些模态而忽视其他模态。 Method: 提出频率比指标(FRM)在频域中量化模态偏好,并据此设计即插即用的多模态权重分配模块(MWAM),在训练中动态调整各模态分支权重。 Result: MWAM可无缝集成于CNN、ViT等多种骨干网络,在多种任务和模态组合下均带来一致性能提升,并进一步增强现有先进缺失模态方法的效果。 Conclusion: 频域分析可有效揭示模态主导关系,MWAM通过动态权重重分配实现了更均衡、鲁棒的多模态学习,为缺失模态问题提供了简洁高效的解决方案。 Abstract: Missing modalities present a fundamental challenge in multimodal models, often causing catastrophic performance degradation. Our observations suggest that this fragility stems from an imbalanced learning process, where the model develops an implicit preference for certain modalities, leading to the under-optimization of others. We propose a simple yet efficient method to address this challenge. The central insight of our work is that the dominance relationship between modalities can be effectively discerned and quantified in the frequency domain. To leverage this principle, we first introduce a Frequency Ratio Metric (FRM) to quantify modality preference by analyzing features in the frequency domain. Guided by FRM, we then propose a Multimodal Weight Allocation Module, a plug-and-play component that dynamically re-balances the contribution of each branch during training, promoting a more holistic learning paradigm. Extensive experiments demonstrate that MWAM can be seamlessly integrated into diverse architectural backbones, such as those based on CNNs and ViTs. Furthermore, MWAM delivers consistent performance gains across a wide range of tasks and modality combinations. This advancement extends beyond merely optimizing the performance of the base model; it also manifests as further performance improvements to state-of-the-art methods addressing the missing modality problem.

[77] Interactive Medical-SAM2 GUI: A Napari-based semi-automatic annotation tool for medical images

Woojae Hong,Jong Ha Hwang,Jiyong Chung,Joongyeon Choi,Hyunngun Kim,Yong Hwy Kim

Main category: cs.CV

TL;DR: 本文介绍了Interactive Medical-SAM2 GUI,一个基于Napari和SAM2的开源桌面工具,用于半自动标注2D/3D医学影像,支持盒/点提示与跨切片掩码传播,提供本地化、队列导向的3D标注工作流,并支持定量导出与体积渲染。

Details Motivation: 手动体素级标注3D医学影像耗时昂贵,现有工具多局限于单切片交互,缺乏统一、面向队列的本地标注流程(含导航、传播、交互修正与定量导出)。 Method: 基于Napari构建本地优先GUI,将3D体数据视为切片序列,结合Medical-SAM2实现盒/点提示驱动的跨切片掩码传播;支持首/末切片初始化、点提示细化、提示优先校正;导出时集成SimpleITK保几何、支持体积分割与3D渲染;可选N4偏置场校正。 Result: 实现了高效、统一的本地3D医学图像半自动标注流程,支持DICOM/NIfTI格式、多研究批量处理、交互式修正及带几何信息的定量导出(如体积测量与3D渲染)。 Conclusion: Interactive Medical-SAM2 GUI填补了本地化、队列式、半自动3D医学图像标注工具的空白,显著提升研究阶段标注效率与一致性,代码已开源供科研使用。 Abstract: Interactive Medical-SAM2 GUI is an open-source desktop application for semi-automatic annotation of 2D and 3D medical images. Built on the Napari multi-dimensional viewer, box/point prompting is integrated with SAM2-style propagation by treating a 3D volume as a slice sequence, enabling mask propagation from sparse prompts using Medical-SAM2 on top of SAM2. Voxel-level annotation remains essential for developing and validating medical imaging algorithms, yet manual labeling is slow and expensive for 3D scans, and existing integrations frequently emphasize per-slice interaction without providing a unified, cohort-oriented workflow for navigation, propagation, interactive correction, and quantitative export in a single local pipeline. To address this practical limitation, a local-first Napari workflow is provided for efficient 3D annotation across multiple studies using standard DICOM series and/or NIfTI volumes. Users can annotate cases sequentially under a single root folder with explicit proceed/skip actions, initialize objects via box-first prompting (including first/last-slice initialization for single-object propagation), refine predictions with point prompts, and finalize labels through prompt-first correction prior to saving. During export, per-object volumetry and 3D volume rendering are supported, and image geometry is preserved via SimpleITK. The GUI is implemented in Python using Napari and PyTorch, with optional N4 bias-field correction, and is intended exclusively for research annotation workflows. The code is released on the project page: https://github.com/SKKU-IBE/Medical-SAM2GUI/.

[78] Denoising as Path Planning: Training-Free Acceleration of Diffusion Models with DPCache

Bowen Cui,Yuanbin Wang,Huajiang Xu,Biaolong Chen,Aixi Zhang,Hao Jiang,Zhengzheng Jin,Xu Liu,Pipei Huang

Main category: cs.CV

TL;DR: 本文提出DPCache,一种无需训练的扩散模型加速框架,将采样加速建模为全局路径规划问题,通过构建路径感知代价张量并利用动态规划选择最优关键时间步,显著提升推理速度且保持生成质量。

Details Motivation: 现有基于缓存的扩散模型加速方法依赖固定或局部自适应调度策略,未考虑去噪轨迹的全局结构,易导致误差累积和视觉伪影。 Method: DPCache构建路径感知代价张量(基于小规模校准集),量化跳过时间步的路径依赖误差;再用动态规划求解最小总代价的关键时间步序列;推理时仅在关键步执行完整计算,其余步用缓存特征预测。 Result: 在DiT、FLUX和HunyuanVideo上验证:DPCache在4.87×加速下ImageReward提升0.031,3.54×加速下反超全步长基线0.028。 Conclusion: 路径感知的全局调度机制能有效缓解误差累积,DPCache作为一种训练-free方法,在速度与质量间取得更优平衡。 Abstract: Diffusion models have demonstrated remarkable success in image and video generation, yet their practical deployment remains hindered by the substantial computational overhead of multi-step iterative sampling. Among acceleration strategies, caching-based methods offer a training-free and effective solution by reusing or predicting features across timesteps. However, existing approaches rely on fixed or locally adaptive schedules without considering the global structure of the denoising trajectory, often leading to error accumulation and visual artifacts. To overcome this limitation, we propose DPCache, a novel training-free acceleration framework that formulates diffusion sampling acceleration as a global path planning problem. DPCache constructs a Path-Aware Cost Tensor from a small calibration set to quantify the path-dependent error of skipping timesteps conditioned on the preceding key timestep. Leveraging this tensor, DPCache employs dynamic programming to select an optimal sequence of key timesteps that minimizes the total path cost while preserving trajectory fidelity. During inference, the model performs full computations only at these key timesteps, while intermediate outputs are efficiently predicted using cached features. Extensive experiments on DiT, FLUX, and HunyuanVideo demonstrate that DPCache achieves strong acceleration with minimal quality loss, outperforming prior acceleration methods by $+$0.031 ImageReward at 4.87$\times$ speedup and even surpassing the full-step baseline by $+$0.028 ImageReward at 3.54$\times$ speedup on FLUX, validating the effectiveness of our path-aware global scheduling framework. Code will be released at https://github.com/argsss/DPCache.

[79] Scaling Audio-Visual Quality Assessment Dataset via Crowdsourcing

Renyu Yang,Jian Jin,Lili Meng,Meiqin Liu,Yilin Wang,Balu Adsumilli,Weisi Lin

Main category: cs.CV

TL;DR: 本文提出了一种实用的音视频质量评估(AVQA)数据集构建方法,通过众包主观实验、系统化数据准备策略和多维标注,构建了目前最大最多样化的AVQA数据集YT-NTU-AVQ。

Details Motivation: 现有AVQA数据集规模小、内容与质量多样性不足、仅提供总体评分,难以支撑模型开发与多模态感知研究。 Method: 设计众包主观实验框架以突破实验室限制;采用系统化数据准备策略确保质量等级与语义场景覆盖;扩展多维标注以支持多模态感知机制研究。 Result: 构建了包含1620条用户生成音视频序列的YT-NTU-AVQ数据集,是当前最大最多样化的AVQA数据集,并开源数据与平台代码。 Conclusion: 所提方法有效解决了AVQA数据集建设的关键瓶颈,为多模态质量评估与感知机制研究提供了坚实基础。 Abstract: Audio-visual quality assessment (AVQA) research has been stalled by limitations of existing datasets: they are typically small in scale, with insufficient diversity in content and quality, and annotated only with overall scores. These shortcomings provide limited support for model development and multimodal perception research. We propose a practical approach for AVQA dataset construction. First, we design a crowdsourced subjective experiment framework for AVQA, breaks the constraints of in-lab settings and achieves reliable annotation across varied environments. Second, a systematic data preparation strategy is further employed to ensure broad coverage of both quality levels and semantic scenarios. Third, we extend the dataset with additional annotations, enabling research on multimodal perception mechanisms and their relation to content. Finally, we validate this approach through YT-NTU-AVQ, the largest and most diverse AVQA dataset to date, consisting of 1,620 user-generated audio and video (A/V) sequences. The dataset and platform code are available at https://github.com/renyu12/YT-NTU-AVQ

[80] ArtPro: Self-Supervised Articulated Object Reconstruction with Adaptive Integration of Mobility Proposals

Xuelu Li,Zhaonan Wang,Xiaogang Wang,Lei Wu,Manyi Li,Changhe Tu

Main category: cs.CV

TL;DR: 本文提出ArtPro,一种新的自监督框架,通过自适应整合运动提案来重建高保真度的关节物体数字孪生体,解决了现有方法对初始部件分割敏感和易陷入局部最优的问题。

Details Motivation: 现有基于可微渲染(如3D高斯点绘)的自监督方法对初始部件分割高度敏感,依赖启发式聚类或预训练模型,导致复杂多部件物体优化易陷入局部极小值。 Method: ArtPro采用几何特征与运动先验引导的过分割初始化生成运动假设合理的部件提案;优化过程中动态融合空间邻域内运动一致的提案,并引入防碰撞的运动剪枝机制以避免错误运动估计。 Result: 在合成与真实物体数据集上的大量实验表明,ArtPro在精度与稳定性上显著优于现有方法,能鲁棒地重建复杂多部件物体。 Conclusion: ArtPro通过运动一致性建模与碰撞感知优化,有效提升了关节物体自监督重建的鲁棒性与准确性,为数字孪生构建提供了新思路。 Abstract: Reconstructing articulated objects into high-fidelity digital twins is crucial for applications such as robotic manipulation and interactive simulation. Recent self-supervised methods using differentiable rendering frameworks like 3D Gaussian Splatting remain highly sensitive to the initial part segmentation. Their reliance on heuristic clustering or pre-trained models often causes optimization to converge to local minima, especially for complex multi-part objects. To address these limitations, we propose ArtPro, a novel self-supervised framework that introduces adaptive integration of mobility proposals. Our approach begins with an over-segmentation initialization guided by geometry features and motion priors, generating part proposals with plausible motion hypotheses. During optimization, we dynamically merge these proposals by analyzing motion consistency among spatial neighbors, while a collision-aware motion pruning mechanism prevents erroneous kinematic estimation. Extensive experiments on both synthetic and real-world objects demonstrate that ArtPro achieves robust reconstruction of complex multi-part objects, significantly outperforming existing methods in accuracy and stability.

[81] Monocular Open Vocabulary Occupancy Prediction for Indoor Scenes

Changqing Zhou,Yueru Luo,Han Zhang,Zeyu Jiang,Changhao Chen

Main category: cs.CV

TL;DR: 本文提出了一种面向室内场景的开词汇3D占据预测框架,采用纯几何监督(仅二值占据标签),结合语言嵌入高斯表示与新设计的不透明度感知泊松聚合及渐进温度衰减策略,在Occ-ScanNet上显著提升IoU和mIoU。

Details Motivation: 现有开词汇3D占据方法在室外驾驶场景有效,但难以迁移至几何更密集、布局更复杂、语义更细粒度的室内环境。 Method: 基于3D语言嵌入高斯(Language-Embedded Gaussians)构建统一几何-语义表征;提出不透明度感知的泊松式体素聚合以稳定弱监督下的收敛;引入渐进温度衰减调度以增强高斯与语言特征对齐。 Result: 在Occ-ScanNet数据集上,开词汇设置下达到59.50 IoU和21.05 mIoU,IoU超越所有现有占据方法,mIoU大幅优于先前开词汇方法。 Conclusion: 纯几何监督结合语言嵌入高斯与优化的渲染对齐机制,是实现高性能室内开词汇3D占据建模的有效范式。 Abstract: Open-vocabulary 3D occupancy is vital for embodied agents, which need to understand complex indoor environments where semantic categories are abundant and evolve beyond fixed taxonomies. While recent work has explored open-vocabulary occupancy in outdoor driving scenarios, such methods transfer poorly indoors, where geometry is denser, layouts are more intricate, and semantics are far more fine-grained. To address these challenges, we adopt a geometry-only supervision paradigm that uses only binary occupancy labels (occupied vs free). Our framework builds upon 3D Language-Embedded Gaussians, which serve as a unified intermediate representation coupling fine-grained 3D geometry with a language-aligned semantic embedding. On the geometry side, we find that existing Gaussian-to-Occupancy operators fail to converge under such weak supervision, and we introduce an opacity-aware, Poisson-based approach that stabilizes volumetric aggregation. On the semantic side, direct alignment between rendered features and open-vocabulary segmentation features suffers from feature mixing; we therefore propose a Progressive Temperature Decay schedule that gradually sharpens opacities during splatting, strengthening Gaussian-language alignment. On Occ-ScanNet, our framework achieves 59.50 IoU and 21.05 mIoU in the open-vocabulary setting, surpassing all existing occupancy methods in IoU and outperforming prior open-vocabulary approaches by a large margin in mIoU. Code will be released at https://github.com/JuIvyy/LegoOcc.

[82] SPMamba-YOLO: An Underwater Object Detection Network Based on Multi-Scale Feature Enhancement and Global Context Modeling

Guanghao Liao,Zhen Liu,Liyuan Cao,Yonghui Yang,Qi Li

Main category: cs.CV

TL;DR: 本文提出SPMamba-YOLO,一种融合多尺度特征增强与全局上下文建模的水下目标检测网络,在URPC2022数据集上较YOLOv8n提升4.9% mAP@0.5,尤其改善小目标和密集目标检测性能。

Details Motivation: 水下目标检测面临严重光衰减、颜色失真、背景杂乱及目标尺寸小等挑战。 Method: 提出SPMamba-YOLO:包含SPPELAN模块(强化多尺度特征聚合与感受野)、PSA机制(增强特征判别力)以及基于Mamba的状态空间建模模块(高效捕获长程依赖与全局上下文)。 Result: 在URPC2022数据集上mAP@0.5较YOLOv8n提升超4.9%,尤其对小目标和密集分布目标效果显著,且兼顾精度与计算效率。 Conclusion: SPMamba-YOLO通过协同优化多尺度表征与全局建模能力,有效提升了复杂水下环境中的目标检测鲁棒性与准确性。 Abstract: Underwater object detection is a critical yet challenging research problem owing to severe light attenuation, color distortion, background clutter, and the small scale of underwater targets. To address these challenges, we propose SPMamba-YOLO, a novel underwater object detection network that integrates multi-scale feature enhancement with global context modeling. Specifically, a Spatial Pyramid Pooling Enhanced Layer Aggregation Network (SPPELAN) module is introduced to strengthen multi-scale feature aggregation and expand the receptive field, while a Pyramid Split Attention (PSA) mechanism enhances feature discrimination by emphasizing informative regions and suppressing background interference. In addition, a Mamba-based state space modeling module is incorporated to efficiently capture long-range dependencies and global contextual information, thereby improving detection robustness in complex underwater environments. Extensive experiments on the URPC2022 dataset demonstrate that SPMamba-YOLO outperforms the YOLOv8n baseline by more than 4.9\% in mAP@0.5, particularly for small and densely distributed underwater objects, while maintaining a favorable balance between detection accuracy and computational cost.

[83] ViCLIP-OT: The First Foundation Vision-Language Model for Vietnamese Image-Text Retrieval with Optimal Transport

Quoc-Khang Tran,Minh-Thien Nguyen,Nguyen-Khang Pham

Main category: cs.CV

TL;DR: 本文提出了ViCLIP-OT,一种专为越南语图像-文本检索设计的视觉语言基础模型,通过结合CLIP式对比学习与相似性图正则化最优传输(SIGROT)损失,显著提升了低资源语言下的跨模态检索性能。

Details Motivation: 现有视觉语言模型多针对高资源语言优化,在越南语等低资源语言场景下表现欠佳,亟需专门适配的模型。 Method: 提出ViCLIP-OT框架,融合CLIP风格对比学习与新提出的相似性图正则化最优传输(SIGROT)损失,以增强全局跨模态一致性并缓解模态鸿沟。 Result: 在三个越南语基准(UITOpenViIC、KTVIC、Crossmodal-3600)上,ViCLIP-OT在域内和零样本设置下均显著优于CLIP和SigLIP;在UIT-OpenViIC上平均Recall@K达67.34%,提升CLIP 5.75个百分点;在Crossmodal-3600零样本任务中超越CLIP达11.72个百分点;嵌入空间分析证实模态对齐更优、模态差距减小。 Conclusion: SIGROT的引入为低资源语言跨模态检索提供了有效且可扩展的解决方案,对越南语及其他弱势语言的智能多媒体检索系统具有实际应用价值。 Abstract: Image-text retrieval has become a fundamental component in intelligent multimedia systems; however, most existing vision-language models are optimized for highresource languages and remain suboptimal for low-resource settings such as Vietnamese. This work introduces ViCLIP-OT, a foundation vision-language model specifically designed for Vietnamese image-text retrieval. The proposed framework integrates CLIP-style contrastive learning with a Similarity-Graph Regularized Optimal Transport (SIGROT) loss to enhance global cross-modal consistency and mitigate modality gap issues. Extensive experiments on three Vietnamese benchmarks (UITOpenViIC, KTVIC, and Crossmodal-3600) demonstrate that ViCLIP-OT consistently outperforms CLIP and SigLIP baselines in both in-domain and zero-shot settings. On UIT-OpenViIC, the model achieves an average Recall@K of 67.34%, improving upon CLIP by 5.75 percentage points. In zero-shot evaluation on Crossmodal-3600, ViCLIPOT surpasses CLIP by 11.72 percentage points. Embedding-space analysis further confirms improved alignment and reduced modality gap. The results indicate that integrating SIGROT provides an effective and scalable strategy for cross-modal retrieval in low-resource languages, offering practical implications for intelligent multimedia retrieval systems in Vietnamese and other underrepresented linguistic contexts.

[84] SUPERGLASSES: Benchmarking Vision Language Models as Intelligent Agents for AI Smart Glasses

Zhuohang Jiang,Xu Yuan,Haohao Qu,Shanru Lin,Kanglong Liu,Wenqi Fan,Qing Li

Main category: cs.CV

TL;DR: 本文提出了首个基于智能眼镜真实数据的视觉问答(VQA)基准SUPERGLASSES,并设计了专用于该场景的多模态智能眼镜代理SUPERLENS,在任务上超越GPT-4o 2.19%。

Details Motivation: 现有VLM在智能眼镜上的应用受限于传统数据集缺乏真实性和场景适配性,尤其忽视了‘先精准识别目标物体’这一关键前置步骤。 Method: 构建首个全由智能眼镜采集的真实世界VQA基准SUPERGLASSES(含2422样本、14图像域、8问题类、搜索轨迹与推理标注),并提出SUPERLENS代理,融合自动目标检测、查询解耦与多模态网络搜索实现检索增强生成。 Result: 在SUPERGLASSES上评测26个主流VLM发现显著性能差距;SUPERLENS达到SOTA,较GPT-4o提升2.19%。 Conclusion: 智能眼镜VQA需面向任务定制模型与数据,通用VLM难以满足其真实场景中‘感知-检索-推理’链路的特殊需求。 Abstract: The rapid advancement of AI-powered smart glasses, one of the hottest wearable devices, has unlocked new frontiers for multimodal interaction, with Visual Question Answering (VQA) over external knowledge sources emerging as a core application. Existing Vision Language Models (VLMs) adapted to smart glasses are typically trained and evaluated on traditional multimodal datasets; however, these datasets lack the variety and realism needed to reflect smart glasses usage scenarios and diverge from their specific challenges, where accurately identifying the object of interest must precede any external knowledge retrieval. To bridge this gap, we introduce SUPERGLASSES, the first comprehensive VQA benchmark built on real-world data entirely collected by smart glasses devices. SUPERGLASSES comprises 2,422 egocentric image-question pairs spanning 14 image domains and 8 query categories, enriched with full search trajectories and reasoning annotations. We evaluate 26 representative VLMs on this benchmark, revealing significant performance gaps. To address the limitations of existing models, we further propose SUPERLENS, a multimodal smart glasses agent that enables retrieval-augmented answer generation by integrating automatic object detection, query decoupling, and multimodal web search. Our agent achieves state-of-the-art performance, surpassing GPT-4o by 2.19 percent, and highlights the need for task-specific solutions in smart glasses VQA scenarios.

[85] No Caption, No Problem: Caption-Free Membership Inference via Model-Fitted Embeddings

Joonsung Jeon,Woo Jae Kim,Suhyeon Ha,Sooel Son,Sung-Eui Yoon

Main category: cs.CV

TL;DR: 本文提出MoFit,一种无需真实标注文本的成员推断攻击(MIA)框架,用于检测扩散模型是否在训练中记忆了特定图像,通过构建模型拟合的合成条件输入提升推断性能。

Details Motivation: 现有成员推断攻击方法依赖真实文本标注,但在实际场景中仅能获取图像、缺乏对应文本,导致其失效;需设计不依赖真实caption的MIA方法。 Method: 提出MoFit框架:第一阶段为模型拟合代理优化,对查询图像施加可学习扰动,使其在模型无条件先验区域生成代理样本;第二阶段为代理驱动嵌入提取,从该代理中提取模型拟合嵌入,作为错配条件输入以放大成员样本的条件损失响应。 Result: 在多个数据集和扩散模型上实验表明,MoFit显著优于基于视觉语言模型(VLM)生成caption的基线方法,并达到与依赖真实caption方法相媲美的性能。 Conclusion: MoFit有效解决了无真实文本标注下的扩散模型成员推断问题,为评估生成模型的数据记忆风险提供了更实用、鲁棒的审计工具。 Abstract: Latent diffusion models have achieved remarkable success in high-fidelity text-to-image generation, but their tendency to memorize training data raises critical privacy and intellectual property concerns. Membership inference attacks (MIAs) provide a principled way to audit such memorization by determining whether a given sample was included in training. However, existing approaches assume access to ground-truth captions. This assumption fails in realistic scenarios where only images are available and their textual annotations remain undisclosed, rendering prior methods ineffective when substituted with vision-language model (VLM) captions. In this work, we propose MoFit, a caption-free MIA framework that constructs synthetic conditioning inputs that are explicitly overfitted to the target model's generative manifold. Given a query image, MoFit proceeds in two stages: (i) model-fitted surrogate optimization, where a perturbation applied to the image is optimized to construct a surrogate in regions of the model's unconditional prior learned from member samples, and (ii) surrogate-driven embedding extraction, where a model-fitted embedding is derived from the surrogate and then used as a mismatched condition for the query image. This embedding amplifies conditional loss responses for member samples while leaving hold-outs relatively less affected, thereby enhancing separability in the absence of ground-truth captions. Our comprehensive experiments across multiple datasets and diffusion models demonstrate that MoFit consistently outperforms prior VLM-conditioned baselines and achieves performance competitive with caption-dependent methods.

[86] GFRRN: Explore the Gaps in Single Image Reflection Removal

Yu Chen,Zewei He,Xingyu Liu,Zixuan Chen,Zheming Lu

Main category: cs.CV

TL;DR: 本文提出了一种无间隙反射去除网络(GFRRN),通过参数高效微调、统一反射标签生成器、高斯自适应频率学习模块和动态代理注意力机制,缓解预训练模型与反射去除任务间的语义理解差距及合成/真实数据间标签不一致问题。

Details Motivation: 解决现有双流方法在单图像反射去除中面临的两个关键问题:预训练模型与反射去除模型之间的语义理解差距,以及合成数据与真实世界数据之间反射标签的不一致性。 Method: 采用参数高效微调(PEFT)策略,在预训练模型中插入可学习Mona层;设计标签生成器统一合成与真实数据的反射标签;提出高斯自适应频率学习块(G-AFLB)自适应学习融合频率先验;引入动态代理注意力(DAA)替代窗口注意力,建模窗口间与窗口内的重要性差异。 Result: 所提GFRRN在多个基准上显著优于当前最先进方法,验证了各模块的有效性。 Conclusion: 通过系统性地弥合语义理解与标签一致性两大间隙,GFRRN为单图像反射去除提供了更鲁棒、泛化性更强的解决方案。 Abstract: Prior dual-stream methods with the feature interaction mechanism have achieved remarkable performance in single image reflection removal (SIRR). However, they often struggle with (1) semantic understanding gap between the features of pre-trained models and those of reflection removal models, and (2) reflection label inconsistencies between synthetic and real-world training data. In this work, we first adopt the parameter efficient fine-tuning (PEFT) strategy by integrating several learnable Mona layers into the pre-trained model to align the training directions. Then, a label generator is designed to unify the reflection labels for both synthetic and real-world data. In addition, a Gaussian-based Adaptive Frequency Learning Block (G-AFLB) is proposed to adaptively learn and fuse the frequency priors, and a Dynamic Agent Attention (DAA) is employed as an alternative to window-based attention by dynamically modeling the significance levels across windows (inter-) and within an individual window (intra-). These components constitute our proposed Gap-Free Reflection Removal Network (GFRRN). Extensive experiments demonstrate the effectiveness of our GFRRN, achieving superior performance against state-of-the-art SIRR methods.

[87] UFO-DETR: Frequency-Guided End-to-End Detector for UAV Tiny Objects

Yuankai Chen,Kai Lin,Qihong Wu,Xinxuan Yang,Jiashuo Lai,Ruoen Chen,Haonan Shi,Minfan He,Meihua Wang

Main category: cs.CV

TL;DR: 本文提出UFO-DETR,一种专为无人机图像小目标检测设计的端到端检测框架,结合LSKNet骨干网、DAttention与AIFI模块及新提出的DynFreq-C3模块,在检测精度和计算效率上均优于RT-DETR-L,适用于边缘计算场景。

Details Motivation: 解决无人机图像中小目标检测面临的尺度变化大、目标密集、小目标占主导等挑战,以及现有方法依赖人工设计组件、通用检测器未针对无人机图像优化导致精度与复杂度难以兼顾的问题。 Method: 提出端到端检测框架UFO-DETR:采用LSKNet作为骨干网络以优化感受野并减少参数量;融合DAttention和AIFI模块建模多尺度空间关系;引入新模块DynFreq-C3,通过跨空间频率特征增强提升小目标检测能力。 Result: 在实验中,UFO-DETR相比RT-DETR-L在检测性能和计算效率两方面均有显著提升,验证了其在无人机边缘计算场景下的有效性与高效性。 Conclusion: UFO-DETR是一种高效、轻量且专为无人机图像小目标检测定制的检测框架,兼顾高精度与低计算开销,适用于资源受限的边缘部署环境。 Abstract: Small target detection in UAV imagery faces significant challenges such as scale variations, dense distribution, and the dominance of small targets. Existing algorithms rely on manually designed components, and general-purpose detectors are not optimized for UAV images, making it difficult to balance accuracy and complexity. To address these challenges, this paper proposes an end-to-end object detection framework, UFO-DETR, which integrates an LSKNet-based backbone network to optimize the receptive field and reduce the number of parameters. By combining the DAttention and AIFI modules, the model flexibly models multi-scale spatial relationships, improving multi-scale target detection performance. Additionally, the DynFreq-C3 module is proposed to enhance small target detection capability through cross-space frequency feature enhancement. Experimental results show that, compared to RT-DETR-L, the proposed method offers significant advantages in both detection performance and computational efficiency, providing an efficient solution for UAV edge computing.

[88] SoPE: Spherical Coordinate-Based Positional Embedding for Enhancing Spatial Perception of 3D LVLMs

Guanting Ye,Qiyan Zhao,Wenhao Yu,Liangyu Yuan,Mingkai Li,Xiaofeng Zhang,Jianmin Ji,Yanyong Zhang,Qing Jiang,Ka-Veng Yuen

Main category: cs.CV

TL;DR: 本文提出了一种基于球坐标的位置编码方法SoPE,以改进3D大视觉语言模型(3D LVLMs)中传统RoPE在建模三维空间结构和角度依赖性方面的不足。SoPE将点云token映射到球坐标空间,统一建模空间位置与方向角,并引入多尺度频率混合策略增强几何表征能力。实验验证了其在多个3D场景基准上的有效性与强泛化性。

Details Motivation: 传统旋转位置编码(RoPE)在3D多模态理解中存在缺陷:无法保持三维空间结构,且相对距离计算忽略角度依赖,难以捕捉视觉表征中的方向变化。 Method: 提出球坐标位置编码(SoPE),将点云token索引映射至三维球坐标系,统一建模空间位置与方向角;并设计多尺度频率混合策略以融合不同频域特征。 Result: 在多个3D场景基准上显著提升性能;真实场景部署实验表明其具备强泛化能力。 Conclusion: SoPE有效解决了RoPE在3D理解中的几何建模局限,提升了3D LVLMs的空间感知与方向建模能力,为三维多模态学习提供了更鲁棒、更具几何一致性的位置编码方案。 Abstract: 3D Large Vision-Language Models (3D LVLMs) built upon Large Language Models (LLMs) have achieved remarkable progress across various multimodal tasks. However, their inherited position-dependent modeling mechanism, Rotary Position Embedding (RoPE), remains suboptimal for 3D multimodal understanding. The vanilla RoPE formulation fails to preserve essential three-dimensional spatial structures when encoding 3D tokens, and its relative distance computation overlooks angular dependencies, hindering the model's ability to capture directional variations in visual representations. To overcome these limitations, we introduce Spherical Coordinate-based Positional Embedding (SoPE). Our method maps point-cloud token indices into a 3D spherical coordinate space, enabling unified modeling of spatial locations and directional angles. This formulation preserves the inherent geometric structure of point-cloud data, enhances spatial awareness, and yields more consistent and expressive geometric representations for multimodal learning. In addition, we introduce a multi-scale frequency mixing strategy to fuse feature information across different frequency domains. Experimental results on multiple 3D scene benchmarks validate the effectiveness of our approach, while real-world deployment experiments further demonstrate its strong generalization capability.

[89] IRSDE-Despeckle: A Physics-Grounded Diffusion Model for Generalizable Ultrasound Despeckling

Shuoqi Chen,Yujia Wu,Geoffrey P. Luke

Main category: cs.CV

TL;DR: 本文提出了一种基于扩散模型的超声图像去斑点方法,利用MRI图像仿真生成配对训练数据,在保持解剖边缘和对比度的同时有效抑制斑点噪声,并通过跨模型方差量化预测不确定性,揭示其与重建误差的相关性,同时探讨了仿真参数变化导致的域偏移问题。

Details Motivation: 超声图像中斑点噪声及相关的伪影会降低图像质量,影响临床解读,亟需有效的去斑点方法。 Method: 基于图像恢复随机微分方程(IR-SDE)框架构建扩散模型;使用Matlab UltraSound Toolbox从无斑点的MRI图像仿真生成大规模配对超声数据集以支持监督训练;引入跨模型方差量化预测不确定性。 Result: 在模拟测试集上显著优于经典滤波器和近期学习型去斑点方法;验证了预测不确定性与重建误差正相关;发现模型对仿真探头参数敏感,存在域偏移现象。 Conclusion: 所提扩散模型能有效去斑并保持关键解剖结构,不确定性估计可辅助识别高风险区域,但需通过多样化训练与域适应提升临床鲁棒性。 Abstract: Ultrasound imaging is widely used for real-time, noninvasive diagnosis, but speckle and related artifacts reduce image quality and can hinder interpretation. We present a diffusion-based ultrasound despeckling method built on the Image Restoration Stochastic Differential Equations framework. To enable supervised training, we curate large paired datasets by simulating ultrasound images from speckle-free magnetic resonance images using the Matlab UltraSound Toolbox. The proposed model reconstructs speckle-suppressed images while preserving anatomically meaningful edges and contrast. On a held-out simulated test set, our approach consistently outperforms classical filters and recent learning-based despeckling baselines. We quantify prediction uncertainty via cross-model variance and show that higher uncertainty correlates with higher reconstruction error, providing a practical indicator of difficult or failure-prone regions. Finally, we evaluate sensitivity to simulation probe settings and observe domain shift, motivating diversified training and adaptation for robust clinical deployment.

[90] HulluEdit: Single-Pass Evidence-Consistent Subspace Editing for Mitigating Hallucinations in Large Vision-Language Models

Yangguang Lin,Quan Fang,Yufei Li,Jiachen Sun,Junyu Gao,Jitao Sang

Main category: cs.CV

TL;DR: 本文提出HulluEdit,一种单次前向传播、无需参考模型的干预框架,通过正交子空间编辑分解隐藏状态,选择性抑制幻觉模式而不影响视觉证据,显著减少大视觉语言模型中的物体幻觉。

Details Motivation: 现有方法在效率和准确性之间难以平衡:要么需要昂贵的参考模型和多次前向传播,要么采用静态编辑可能压制真实的视觉证据。 Method: 提出HulluEdit框架,核心是正交子空间编辑:将模型隐藏状态分解为视觉证据、冲突先验和残差不确定性三个正交子空间,并仅对先验子空间进行编辑,数学上保证不影响视觉成分。 Result: 在POPE和CHAIR等基准上达到最先进的幻觉抑制效果,同时在MME上保持通用能力,且推理高效;一致优于对比解码和静态子空间编辑基线。 Conclusion: HulluEdit为构建更可信的大视觉语言模型提供了一条新路径。 Abstract: Object hallucination in Large Vision-Language Models (LVLMs) significantly hinders their reliable deployment. Existing methods struggle to balance efficiency and accuracy: they often require expensive reference models and multiple forward passes, or apply static edits that risk suppressing genuine visual evidence. To address this, we introduce HulluEdit, a single-pass, reference-free intervention framework. Our core innovation is orthogonal subspace editing: we decompose the hidden states of the model into orthogonal subspaces - visual evidence, conflicting priors, and residual uncertainty - enabling selective suppression of hallucinatory patterns without interfering with visual grounding. This approach mathematically guarantees that edits applied to the prior subspace leave the visual component entirely unaffected. Extensive experiments show that HulluEdit achieves state-of-the-art hallucination reduction on benchmarks including POPE and CHAIR across diverse architectures, while preserving general capabilities on MME and maintaining efficient inference. Our method consistently outperforms contrastive decoding and static subspace editing baselines, offering a new pathway toward more trustworthy LVLMs.

[91] Asymmetric Idiosyncrasies in Multimodal Models

Muzi Tao,Chufan Shi,Huijuan Wang,Shengbang Tong,Xuezhe Ma

Main category: cs.CV

TL;DR: 本文研究了图像描述模型的风格特征及其对文生图模型的影响,发现描述模型在文本中具有显著的风格签名,但在生成图像中这些签名几乎消失。

Details Motivation: 探究图像描述模型的风格特征如何影响下游文生图模型的表现,并量化这种跨模态风格信息的丢失。 Method: 设计了一种基于分类的系统性分析框架:给定生成的描述文本或对应图像,训练神经网络识别其来源的描述模型;并进一步分析文本与图像间关键特征(如细节程度、颜色纹理强调、物体分布)的保留情况。 Result: 文本分类准确率达99.70%,表明描述模型具有强风格签名;而图像分类准确率最高仅50%,说明生成图像几乎丢失该签名;进一步分析揭示图像未能保留描述中的关键变化维度。 Conclusion: 所提分类框架可有效量化描述模型的风格独特性及文生图模型的提示遵循能力,揭示当前文生图系统在跨模态风格保持上的严重不足。 Abstract: In this work, we study idiosyncrasies in the caption models and their downstream impact on text-to-image models. We design a systematic analysis: given either a generated caption or the corresponding image, we train neural networks to predict the originating caption model. Our results show that text classification yields very high accuracy (99.70\%), indicating that captioning models embed distinctive stylistic signatures. In contrast, these signatures largely disappear in the generated images, with classification accuracy dropping to at most 50\% even for the state-of-the-art Flux model. To better understand this cross-modal discrepancy, we further analyze the data and find that the generated images fail to preserve key variations present in captions, such as differences in the level of detail, emphasis on color and texture, and the distribution of objects within a scene. Overall, our classification-based framework provides a novel methodology for quantifying both the stylistic idiosyncrasies of caption models and the prompt-following ability of text-to-image systems.

[92] AMLRIS: Alignment-aware Masked Learning for Referring Image Segmentation

Tongfei Chen,Shuo Yang,Yuguang Yang,Linlin Yang,Runtang Guo,Changbai Li,He Long,Chunyu Xie,Dawei Leng,Baochang Zhang

Main category: cs.CV

TL;DR: 本文提出了一种名为Alignment-Aware Masked Learning(AML)的训练策略,通过显式估计像素级视觉-语言对齐、过滤低对齐区域来提升Referring Image Segmentation(RIS)性能,并在RefCOCO数据集上达到SOTA,同时增强对多样化描述和场景的鲁棒性。

Details Motivation: 现有RIS方法在视觉-语言对齐建模上不够精细,难以应对多样化的自然语言描述和复杂场景,需更可靠地利用对齐信息进行优化。 Method: 提出Alignment-Aware Masked Learning(AML),核心包括:1)显式建模像素级视觉-语言对齐;2)基于对齐置信度动态掩码低质量区域;3)仅在高可信对齐区域进行梯度更新。 Result: 在RefCOCO、RefCOCO+和RefCOCOg等多个基准上达到SOTA性能;显著提升模型对模糊、冗余或对抗性语言描述的鲁棒性。 Conclusion: AML通过引入对齐感知的掩码学习机制,有效提升了RIS模型的精度与泛化能力,验证了细粒度对齐建模对跨模态分割任务的重要性。 Abstract: Referring Image Segmentation (RIS) aims to segment an object in an image identified by a natural language expression. The paper introduces Alignment-Aware Masked Learning (AML), a training strategy to enhance RIS by explicitly estimating pixel-level vision-language alignment, filtering out poorly aligned regions during optimization, and focusing on trustworthy cues. This approach results in state-of-the-art performance on RefCOCO datasets and also enhances robustness to diverse descriptions and scenarios

[93] ProjFlow: Projection Sampling with Flow Matching for Zero-Shot Exact Spatial Motion Control

Akihisa Watanabe,Qing Yu,Edgar Simo-Serra,Kent Fujiwara

Main category: cs.CV

TL;DR: 本文提出ProjFlow,一种无需训练的采样器,可零样本、精确满足线性空间约束,同时保持运动真实性。

Details Motivation: 现有方法常需任务特定训练或缓慢优化,且硬约束易破坏运动自然性。 Method: 基于动画任务可建模为线性逆问题的观察,提出ProjFlow;引入骨骼拓扑感知的运动学度量,并结合时变伪观测处理稀疏输入。 Result: 在运动补全和2D到3D提升等任务中,ProjFlow实现精确约束满足,真实感媲美或优于零样本基线,且性能接近训练式控制器。 Conclusion: ProjFlow提供了一种高效、通用、无需训练的运动生成框架,在保证精确空间控制的同时维持高自然度。 Abstract: Generating human motion with precise spatial control is a challenging problem. Existing approaches often require task-specific training or slow optimization, and enforcing hard constraints frequently disrupts motion naturalness. Building on the observation that many animation tasks can be formulated as a linear inverse problem, we introduce ProjFlow, a training-free sampler that achieves zero-shot, exact satisfaction of linear spatial constraints while preserving motion realism. Our key advance is a novel kinematics-aware metric that encodes skeletal topology. This metric allows the sampler to enforce hard constraints by distributing corrections coherently across the entire skeleton, avoiding the unnatural artifacts of naive projection. Furthermore, for sparse inputs, such as filling in long gaps between a few keyframes, we introduce a time-varying formulation using pseudo-observations that fade during sampling. Extensive experiments on representative applications, motion inpainting, and 2D-to-3D lifting, demonstrate that ProjFlow achieves exact constraint satisfaction and matches or improves realism over zero-shot baselines, while remaining competitive with training-based controllers.

[94] SPATIALALIGN: Aligning Dynamic Spatial Relationships in Video Generation

Fengming Liu,Tat-Jen Cham,Chuanxia Zheng

Main category: cs.CV

TL;DR: 本文提出SPATIALALIGN框架,通过零阶正则化直接偏好优化(DPO)提升文本到视频生成模型对动态空间关系(DSR)的建模能力,并设计几何度量DSR-SCORE评估生成结果与提示中空间关系的一致性。

Details Motivation: 现有文本到视频生成模型注重美学质量,却常忽略生成视频中的空间约束,尤其难以准确表达提示中指定的动态空间关系(DSR)。 Method: 提出SPATIALALIGN自改进框架,采用零阶正则化DPO微调T2V模型;设计基于几何的DSR-SCORE指标定量评估空间关系对齐程度;构建含多样化DSR的文本-视频数据集。 Result: 微调后的模型在动态空间关系建模上显著优于基线模型,DSR-SCORE验证了其更优的空间一致性。 Conclusion: SPATIALALIGN有效提升了T2V模型对动态空间关系的理解与生成能力,DSR-SCORE为该任务提供了可解释、无需VLM的评估新范式。 Abstract: Most text-to-video (T2V) generators prioritize aesthetic quality, but often ignoring the spatial constraints in the generated videos. In this work, we present SPATIALALIGN, a self-improvement framework that enhances T2V models capabilities to depict Dynamic Spatial Relationships (DSR) specified in text prompts. We present a zeroth-order regularized Direct Preference Optimization (DPO) to fine-tune T2V models towards better alignment with DSR. Specifically, we design DSR-SCORE, a geometry-based metric that quantitatively measures the alignment between generated videos and the specified DSRs in prompts, which is a step forward from prior works that rely on VLM for evaluation. We also conduct a dataset of text-video pairs with diverse DSRs to facilitate the study. Extensive experiments demonstrate that our fine-tuned model significantly out performs the baseline in spatial relationships. The code will be released in Link.

[95] Beyond Detection: Multi-Scale Hidden-Code for Natural Image Deepfake Recovery and Factual Retrieval

Yuan-Chih Chen,Chun-Shien Lu

Main category: cs.CV

TL;DR: 本文提出了一种统一的隐码恢复框架,用于图像篡改内容的事实性检索与重建,支持后处理和生成时水印范式,并在新构建的ImageNet-S基准上验证了其有效性。

Details Motivation: 现有研究主要集中在深度伪造检测与定位,而对篡改内容的事实性恢复与检索关注不足。 Method: 提出基于多尺度向量量化编码语义与感知信息的紧凑隐码表示,并通过条件Transformer模块增强上下文推理能力,构建统一隐码恢复框架。 Result: 在新构建的ImageNet-S基准上,方法展现出良好的事实检索与图像重建性能,且兼容多种水印流程。 Conclusion: 该框架为超越检测与定位的通用图像恢复任务奠定了基础。 Abstract: Recent advances in image authenticity have primarily focused on deepfake detection and localization, leaving recovery of tampered contents for factual retrieval relatively underexplored. We propose a unified hidden-code recovery framework that enables both retrieval and restoration from post-hoc and in-generation watermarking paradigms. Our method encodes semantic and perceptual information into a compact hidden-code representation, refined through multi-scale vector quantization, and enhances contextual reasoning via conditional Transformer modules. To enable systematic evaluation for natural images, we construct ImageNet-S, a benchmark that provides paired image-label factual retrieval tasks. Extensive experiments on ImageNet-S demonstrate that our method exhibits promising retrieval and reconstruction performance while remaining fully compatible with diverse watermarking pipelines. This framework establishes a foundation for general-purpose image recovery beyond detection and localization.

[96] TrajTok: Learning Trajectory Tokens enables better Video Understanding

Chenhao Zheng,Jieyu Zhang,Jianing Zhang,Weikai Huang,Ashutosh Kumar,Quan Kong,Oncel Tuzel,Chun-Liang Li,Ranjay Krishna

Main category: cs.CV

TL;DR: 本文提出TrajTok,一种端到端、可联合训练的视频分词器,通过隐式时空像素聚类直接生成物体轨迹,动态适配语义复杂度,摆脱对视频时长和外部跟踪模块的依赖,在保持高效的同时显著提升视频理解性能。

Details Motivation: 传统视频分词(如patchification)产生大量冗余token,限制效率与扩展性;现有基于轨迹的方法依赖慢速、任务无关的外部分割与跟踪模块。 Method: 提出TrajTok:一个统一的隐式时空像素聚类段分割器,单次前向即生成物体轨迹;端到端与下游任务联合训练,动态调整token粒度;并拓展为TrajAdapter(探针头)和TrajVLM(视觉语言对齐连接器)。 Result: TrajTok驱动的TrajViT2在分类与检索基准上达到SOTA精度,效率媲美最优token合并方法;TrajAdapter和TrajVLM在长视频推理中表现突出。 Conclusion: TrajTok是一种轻量、高效、下游自适应的视频tokenizer,不仅改进分词范式,还可作为通用组件增强多种视频模型架构。 Abstract: Tokenization in video models, typically through patchification, generates an excessive and redundant number of tokens. This severely limits video efficiency and scalability. While recent trajectory-based tokenizers offer a promising solution by decoupling video duration from token count, they rely on complex external segmentation and tracking pipelines that are slow and task-agnostic. We propose TrajTok, an end-to-end video tokenizer module that is fully integrated and co-trained with video models for a downstream objective, dynamically adapting its token granularity to semantic complexity, independent of video duration. TrajTok contains a unified segmenter that performs implicit clustering over pixels in both space and time to directly produce object trajectories in a single forward pass. By prioritizing downstream adaptability over pixel-perfect segmentation fidelity, TrajTok is lightweight and efficient, yet empirically improves video understanding performance. With TrajTok, we implement a video CLIP model trained from scratch (TrajViT2). It achieves the best accuracy at scale across both classification and retrieval benchmarks, while maintaining efficiency comparable to the best token-merging methods. TrajTok also proves to be a versatile component beyond its role as a tokenizer. We show that it can be seamlessly integrated as either a probing head for pretrained visual features (TrajAdapter) or an alignment connector in vision-language models (TrajVLM) with especially strong performance in long-video reasoning.

[97] SceneTransporter: Optimal Transport-Guided Compositional Latent Diffusion for Single-Image Structured 3D Scene Generation

Ling Wang,Hao-Xiang Guo,Xinzhou Wang,Fuchun Sun,Kai Sun,Pengkun Liu,Hang Xiao,Zhong Wang,Guangyuan Fu,Eric Li,Yang Liu,Yikai Wang

Main category: cs.CV

TL;DR: SceneTransporter 是一种端到端框架,通过在DiT去噪循环中引入熵正则化最优传输(OT)目标,实现从单张图像生成结构化3D场景,解决了现有方法难以组织部件为独立实例的问题。

Details Motivation: 现有方法虽能生成部件级3D物体,但在开放世界场景中难以将部件组织为独立实例;作者发现其根本原因在于模型内部分配机制缺乏结构约束。 Method: 提出SceneTransporter框架,将结构化3D场景生成重构为全局相关性分配问题,并在组合式DiT模型的去噪循环中求解一个熵正则化最优传输目标;该OT方案施加两个结构约束:一是通过运输计划门控交叉注意力,实现图像块到3D部件潜变量的一对一分配;二是利用基于边的成本函数正则化竞争性分配,促进相似块聚类形成完整物体。 Result: 在开放世界场景生成任务上显著优于现有方法,提升了实例级一致性与几何保真度。 Conclusion: 引入最优传输可有效建模图像到3D场景的结构化映射关系,为单图生成结构化3D场景提供了新范式。 Abstract: We introduce SceneTransporter, an end-to-end framework for structured 3D scene generation from a single image. While existing methods generate part-level 3D objects, they often fail to organize these parts into distinct instances in open-world scenes. Through a debiased clustering probe, we reveal a critical insight: this failure stems from the lack of structural constraints within the model's internal assignment mechanism. Based on this finding, we reframe the task of structured 3D scene generation as a global correlation assignment problem. To solve this, SceneTransporter formulates and solves an entropic Optimal Transport (OT) objective within the denoising loop of the compositional DiT model. This formulation imposes two powerful structural constraints. First, the resulting transport plan gates cross-attention to enforce an exclusive, one-to-one routing of image patches to part-level 3D latents, preventing entanglement. Second, the competitive nature of the transport encourages the grouping of similar patches, a process that is further regularized by an edge-based cost, to form coherent objects and prevent fragmentation. Extensive experiments show that SceneTransporter outperforms existing methods on open-world scene generation, significantly improving instance-level coherence and geometric fidelity. Code and models will be publicly available at https://2019epwl.github.io/SceneTransporter/.

[98] Robust Human Trajectory Prediction via Self-Supervised Skeleton Representation Learning

Taishu Arashima,Hiroshi Kera,Kazuhiko Kawamoto

Main category: cs.CV

TL;DR: 本文提出了一种结合自监督掩码自编码预训练骨架表示模型的鲁棒轨迹预测方法,以应对真实场景中因遮挡导致的关节缺失问题,提升了预测鲁棒性与准确性。

Details Motivation: 真实环境中骨架数据常因遮挡导致关节缺失,严重影响轨迹预测精度,亟需更鲁棒的骨架表征方法。 Method: 提出一种融合自监督掩码自编码预训练骨架表示模型的轨迹预测方法。 Result: 在易发生遮挡的场景下,该方法显著提升对缺失骨架数据的鲁棒性,且在干净至中等缺失程度下持续优于基线模型。 Conclusion: 所提方法在不牺牲预测精度的前提下,有效增强了对骨架缺失的鲁棒性,为实际应用中的轨迹预测提供了更可靠的解决方案。 Abstract: Human trajectory prediction plays a crucial role in applications such as autonomous navigation and video surveillance. While recent works have explored the integration of human skeleton sequences to complement trajectory information, skeleton data in real-world environments often suffer from missing joints caused by occlusions. These disturbances significantly degrade prediction accuracy, indicating the need for more robust skeleton representations. We propose a robust trajectory prediction method that incorporates a self-supervised skeleton representation model pretrained with masked autoencoding. Experimental results in occlusion-prone scenarios show that our method improves robustness to missing skeletal data without sacrificing prediction accuracy, and consistently outperforms baseline models in clean-to-moderate missingness regimes.

[99] GSTurb: Gaussian Splatting for Atmospheric Turbulence Mitigation

Hanliang Du,Zhangji Lu,Zewei Cai,Qijian Tang,Qifeng Yu,Xiaoli Liu

Main category: cs.CV

TL;DR: 本文提出GSTurb框架,结合光流引导的倾斜校正与高斯点绘技术建模非等晕模糊,显著提升长距离成像中大气湍流退化图像的复原效果。

Details Motivation: 大气湍流导致像素位移(倾斜)和模糊,严重影响长距离成像质量,现有方法在建模非等晕模糊和联合优化倾斜与模糊方面存在不足。 Method: 提出GSTurb框架:利用光流引导校正图像倾斜,并采用高斯点绘(Gaussian splatting)建模空间变化的非等晕模糊;以高斯参数表征倾斜与模糊,在多帧间联合优化。 Result: 在ATSyn-static合成数据集上达到PSNR 27.67 dB、SSIM 0.8735;相比SOTA方法PSNR提升1.3 dB(+4.5%),SSIM提升0.048(+5.8%);在TSRWGAN Real-World和CLEAR真实数据集上同样表现最优。 Conclusion: 光流引导的倾斜校正与高斯点绘的结合能有效建模并恢复受大气湍流影响的图像,在合成与真实场景下均取得显著性能提升。 Abstract: Atmospheric turbulence causes significant image degradation due to pixel displacement (tilt) and blur, particularly in long-range imaging applications. In this paper, we propose a novel framework for atmospheric turbulence mitigation, GSTurb, which integrates optical flow-guided tilt correction and Gaussian splatting for modeling non-isoplanatic blur. The framework employs Gaussian parameters to represent tilt and blur, and optimizes them across multiple frames to enhance restoration. Experimental results on the ATSyn-static dataset demonstrate the effectiveness of our method, achieving a peak PSNR of 27.67 dB and SSIM of 0.8735. Compared to the state-of-the-art method, GSTurb improves PSNR by 1.3 dB (a 4.5% increase) and SSIM by 0.048 (a 5.8% increase). Additionally, on real datasets, including the TSRWGAN Real-World and CLEAR datasets, GSTurb outperforms existing methods, showing significant improvements in both qualitative and quantitative performance. These results highlight that combining optical flow-guided tilt correction with Gaussian splatting effectively enhances image restoration under both synthetic and real-world turbulence conditions. The code for this method will be available at https://github.com/DuhlLiamz/3DGS_turbulence/tree/main.

[100] PhotoAgent: Agentic Photo Editing with Exploratory Visual Aesthetic Planning

Mingde Yao,Zhiyuan You,Tam-King Man,Menglu Wang,Tianfan Xue

Main category: cs.CV

TL;DR: 本文提出PhotoAgent系统,通过显式美学规划实现自主图像编辑,将编辑任务建模为长程决策问题,并结合树搜索、闭环执行与视觉反馈,无需用户逐步提示;同时构建UGC-Edit美学评估基准用于真实场景评测。

Details Motivation: 现有基于指令的图像编辑方法高度依赖人工精心设计的指令,用户需承担任务分解与步骤排序的全部负担,难以实现真正自主的图像编辑。 Method: 将自主图像编辑建模为长时程决策问题;利用树搜索推理用户美学意图并规划多步编辑动作;通过带记忆和视觉反馈的闭环执行迭代优化结果;构建UGC-Edit美学评估基准及1017张图的测试集。 Result: 在多个指标上,PhotoAgent显著优于基线方法,尤其在指令遵循度与视觉质量方面持续提升;UGC-Edit基准支持更可靠的真实场景评估。 Conclusion: PhotoAgent通过引入显式美学规划与闭环决策机制,有效推动了图像编辑向真正自主化方向发展,为未来智能图像编辑系统提供了新范式。 Abstract: With the recent fast development of generative models, instruction-based image editing has shown great potential in generating high-quality images. However, the quality of editing highly depends on carefully designed instructions, placing the burden of task decomposition and sequencing entirely on the user. To achieve autonomous image editing, we present PhotoAgent, a system that advances image editing through explicit aesthetic planning. Specifically, PhotoAgent formulates autonomous image editing as a long-horizon decision-making problem. It reasons over user aesthetic intent, plans multi-step editing actions via tree search, and iteratively refines results through closed-loop execution with memory and visual feedback, without requiring step-by-step user prompts. To support reliable evaluation in real-world scenarios, we introduce UGC-Edit, an aesthetic evaluation benchmark consisting of 7,000 photos and a learned aesthetic reward model. We also construct a test set containing 1,017 photos to systematically assess autonomous photo editing performance. Extensive experiments demonstrate that PhotoAgent consistently improves both instruction adherence and visual quality compared with baseline methods. The project page is https://github.com/mdyao/PhotoAgent.

[101] Face Time Traveller : Travel Through Ages Without Losing Identity

Purbayan Kar,Ayush Ghadiya,Vishal Chudasama,Pankaj Wasnik,C. V. Jawahar

Main category: cs.CV

TL;DR: 本文提出FaceTT,一种基于扩散模型的面部老化方法,通过面部属性感知提示优化、无调优角度反转和自适应注意力控制,实现高保真、身份一致的老化变换。

Details Motivation: 现有面部老化方法在宽年龄跨度变换中难以保持身份一致性,且静态注意力机制和优化密集的反演过程限制了其适应性、细粒度控制和背景一致性。 Method: 提出FaceTT框架,包括面部属性感知提示优化策略(编码内在生物和外在环境老化线索)、无调优角度反演方法(高效映射真实人脸到扩散潜在空间)和自适应注意力控制机制(动态平衡交叉注意力与自注意力)。 Result: 在基准数据集和野外测试集上的大量实验表明,FaceTT在身份保留、背景保持和老化真实性方面优于当前最先进方法。 Conclusion: FaceTT有效解决了面部老化中身份一致性、视觉真实性和背景一致性等关键挑战,为娱乐、法医和数字存档等应用提供了更可靠的技术支持。 Abstract: Face aging, an ill-posed problem shaped by environmental and genetic factors, is vital in entertainment, forensics, and digital archiving, where realistic age transformations must preserve both identity and visual realism. However, existing works relying on numerical age representations overlook the interplay of biological and contextual cues. Despite progress in recent face aging models, they struggle with identity preservation in wide age transformations, also static attention and optimization-heavy inversion in diffusion limit adaptability, fine-grained control and background consistency. To address these challenges, we propose Face Time Traveller (FaceTT), a diffusion-based framework that achieves high-fidelity, identity-consistent age transformation. Here, we introduce a Face-Attribute-Aware Prompt Refinement strategy that encodes intrinsic (biological) and extrinsic (environmental) aging cues for context-aware conditioning. A tuning-free Angular Inversion method is proposed that efficiently maps real faces into the diffusion latent space for fast and accurate reconstruction. Moreover, an Adaptive Attention Control mechanism is introduced that dynamically balances cross-attention for semantic aging cues and self-attention for structural and identity preservation. Extensive experiments on benchmark datasets and in-the-wild testset demonstrate that FaceTT achieves superior identity retention, background preservation and aging realism over state-of-the-art (SOTA) methods.

[102] CMSA-Net: Causal Multi-scale Aggregation with Adaptive Multi-source Reference for Video Polyp Segmentation

Tong Wang,Yaolei Qi,Siwen Wang,Imran Razzak,Guanyu Yang,Yutong Xie

Main category: cs.CV

TL;DR: 本文提出CMSA-Net框架,通过因果多尺度聚合(CMA)模块和动态多源参考(DMR)策略,提升视频息肉分割的准确性和实时性。

Details Motivation: 视频息肉分割面临息肉与周围黏膜语义区分弱、帧间位置与尺度变化大等挑战。 Method: 提出CMSA-Net,包含因果多尺度聚合(CMA)模块(利用因果注意力聚合历史多尺度特征)和动态多源参考(DMR)策略(基于语义可分性与预测置信度自适应选择参考帧)。 Result: 在SUN-SEG数据集上达到SOTA性能,在分割精度与实时临床适用性之间取得良好平衡。 Conclusion: CMSA-Net有效提升了视频息肉分割的鲁棒性与效率,具有临床实用价值。 Abstract: Video polyp segmentation (VPS) is an important task in computer-aided colonoscopy, as it helps doctors accurately locate and track polyps during examinations. However, VPS remains challenging because polyps often look similar to surrounding mucosa, leading to weak semantic discrimination. In addition, large changes in polyp position and scale across video frames make stable and accurate segmentation difficult. To address these challenges, we propose a robust VPS framework named CMSA-Net. The proposed network introduces a Causal Multi-scale Aggregation (CMA) module to effectively gather semantic information from multiple historical frames at different scales. By using causal attention, CMA ensures that temporal feature propagation follows strict time order, which helps reduce noise and improve feature reliability. Furthermore, we design a Dynamic Multi-source Reference (DMR) strategy that adaptively selects informative and reliable reference frames based on semantic separability and prediction confidence. This strategy provides strong multi-frame guidance while keeping the model efficient for real-time inference. Extensive experiments on the SUN-SEG dataset demonstrate that CMSA-Net achieves state-of-the-art performance, offering a favorable balance between segmentation accuracy and real-time clinical applicability.

[103] Reflectance Multispectral Imaging for Soil Composition Estimation and USDA Texture Classification

G. A. S. L Ranasinghe,J. A. S. T. Jayakody,M. C. L. De Silva,G. Thilakarathne,G. M. R. I. Godaliyadda,H. M. V. R. Herath,M. P. B. Ekanayake,S. K. Navaratnarajah

Main category: cs.CV

TL;DR: 本文提出了一种基于多光谱成像(MSI)与机器学习的低成本、便携式土壤质地快速检测方法,可高精度预测土壤颗粒组成(黏粒、粉粒、砂粒含量)及USDA质地分类。

Details Motivation: 传统土壤质地测定依赖耗时费力的实验室颗粒分析,而现有传感技术往往成本高或分辨率不足,难以支持田间尺度常规部署。 Method: 设计并搭建了覆盖365–940 nm波段的13通道低成本多光谱成像系统;采用回归模型预测黏粒、粉粒、砂粒百分比,同时构建直接分类器和基于USDA三角图的间接分类器预测12类USDA质地。 Result: 在混合土样实验中,组分预测R²最高达0.99,质地分类准确率超99%。 Conclusion: MSI结合数据驱动建模可实现准确、无损、可田间部署的土壤质地表征,适用于岩土工程初筛与精准农业。 Abstract: Soil texture is a foundational attribute that governs water availability and erosion in agriculture, as well as load bearing capacity, deformation response, and shrink-swell risk in geotechnical engineering. Yet texture is still typically determined by slow and labour intensive laboratory particle size tests, while many sensing alternatives are either costly or too coarse to support routine field scale deployment. This paper proposes a robust and field deployable multispectral imaging (MSI) system and machine learning framework for predicting soil composition and the United States Department of Agriculture (USDA) texture classes. The proposed system uses a cost effective in-house MSI device operating from 365 nm to 940 nm to capture thirteen spectral bands, which effectively capture the spectral properties of soil texture. Regression models use the captured spectral properties to estimate clay, silt, and sand percentages, while a direct classifier predicts one of the twelve USDA textural classes. Indirect classification is obtained by mapping the regressed compositions to texture classes via the USDA soil texture triangle. The framework is evaluated on mixture data by mixing clay, silt, and sand in varying proportions, using the USDA classification triangle as a basis. Experimental results show that the proposed approach achieves a coefficient of determination R^2 up to 0.99 for composition prediction and over 99% accuracy for texture classification. These findings indicate that MSI combined with data-driven modeling can provide accurate, non-destructive, and field deployable soil texture characterization suitable for geotechnical screening and precision agriculture.

[104] A data- and compute-efficient chest X-ray foundation model beyond aggressive scaling

Chong Wang,Yabin Zhang,Yunhe Gao,Maya Varma,Clemence Mottez,Faidra Patsatzi,Jiaming Liu,Jin Long,Jean-Benoit Delbrouck,Sergios Gatidis,Akshay S. Chaudhari,Curtis P. Langlotz

Main category: cs.CV

TL;DR: 本文提出CheXficient模型,通过主动、有原则的数据筛选策略,在仅使用22.7%的胸部X光数据和27.3%计算资源的情况下,实现与全量训练模型相当甚至更优的性能,显著提升对长尾/罕见病的泛化能力。

Details Motivation: 解决医学影像基础模型预训练中大规模数据冗余、类别严重不平衡以及忽略数据质量异质性导致的表征偏差和计算低效问题。 Method: 引入主动数据筛选机制,在预训练过程中优先选择信息量高的样本;构建基于胸部X光图像-报告对的轻量高效基础模型CheXficient。 Result: CheXficient在20个涵盖5类任务的基准上表现媲美或超越全量训练及其他大规模预训练模型,并展现出对长尾/罕见疾病的更好泛化能力。 Conclusion: 有原则的主动数据整理可作为扩大数据规模的高效替代方案,为医学视觉-语言基础模型的高效预训练与下游适配提供实用指导。 Abstract: Foundation models for medical imaging are typically pretrained on increasingly large datasets, following a "scale-at-all-costs" paradigm. However, this strategy faces two critical challenges: large-scale medical datasets often contain substantial redundancy and severe class imbalance that bias representation learning toward over-represented patterns, and indiscriminate training regardless of heterogeneity in data quality incurs considerable computational inefficiency. Here we demonstrate that active, principled data curation during pretraining can serve as a viable, cost-effective alternative to brute-force dataset enlargement. We introduce CheXficient, a chest X-ray (CXR) foundation model that selectively prioritizes informative training samples. CheXficient is pretrained on only 22.7% of 1,235,004 paired CXR images and reports while consuming under 27.3% of the total compute budget, yet achieving comparable or superior performance to its full-data counterpart and other large-scale pretrained models. We assess CheXficient across 20 individual benchmarks spanning 5 task types, including non-adapted off-the-shelf evaluations (zero-shot findings classification and crossmodal retrieval) and adapted downstream tasks (disease prediction, semantic segmentation, and radiology report generation). Further analyses show that CheXficient systematically prioritizes under-represented training samples, improving generalizability on long-tailed or rare conditions. Overall, our work offers practical insights into the data and computation demands for efficient pretraining and downstream adaptation of medical vision-language foundation models.

[105] From Blind Spots to Gains: Diagnostic-Driven Iterative Training for Large Multimodal Models

Hongrui Jia,Chaoya Jiang,Shikun Zhang,Wei Ye

Main category: cs.CV

TL;DR: 本文提出诊断驱动的渐进式进化(DPE)方法,通过诊断模型弱点、动态生成针对性多模态训练数据并持续强化,实现大视觉语言模型(LMM)在开放任务分布下的可扩展持续训练。

Details Motivation: 现有LMM训练依赖静态数据和固定流程,难以发现能力盲点和进行动态、有针对性的强化;而测试驱动的错误暴露与反馈修正已被证明优于重复练习。 Method: DPE构建螺旋式迭代闭环:首先由多智能体利用网络搜索、图像编辑等工具对海量无标签多模态数据进行标注与质量控制,生成多样真实样本;其次基于失败归因识别具体能力弱点,动态调整数据混合比例,并引导智能体生成聚焦弱点的数据用于针对性强化。 Result: 在Qwen3-VL-8B-Instruct和Qwen2.5-VL-7B-Instruct上实验表明,DPE在11个基准测试中实现稳定、持续提升。 Conclusion: DPE是一种面向开放任务分布、具备可扩展性的大视觉语言模型持续训练新范式。 Abstract: As Large Multimodal Models (LMMs) scale up and reinforcement learning (RL) methods mature, LMMs have made notable progress in complex reasoning and decision making. Yet training still relies on static data and fixed recipes, making it difficult to diagnose capability blind spots or provide dynamic, targeted reinforcement. Motivated by findings that test driven error exposure and feedback based correction outperform repetitive practice, we propose Diagnostic-driven Progressive Evolution (DPE), a spiral loop where diagnosis steers data generation and reinforcement, and each iteration re-diagnoses the updated model to drive the next round of targeted improvement. DPE has two key components. First, multiple agents annotate and quality control massive unlabeled multimodal data, using tools such as web search and image editing to produce diverse, realistic samples. Second, DPE attributes failures to specific weaknesses, dynamically adjusts the data mixture, and guides agents to generate weakness focused data for targeted reinforcement. Experiments on Qwen3-VL-8B-Instruct and Qwen2.5-VL-7B-Instruct show stable, continual gains across eleven benchmarks, indicating DPE as a scalable paradigm for continual LMM training under open task distributions. Our code, models, and data are publicly available at https://github.com/hongruijia/DPE.

[106] SO3UFormer: Learning Intrinsic Spherical Features for Rotation-Robust Panoramic Segmentation

Qinfeng Zhu,Yunxi Jiang,Lei Fan

Main category: cs.CV

TL;DR: 本文提出SO3UFormer,一种旋转鲁棒的全景语义分割架构,通过内在特征建模、四极一致球面注意力和规范感知相对位置机制,显著提升模型在任意三维重定向下的性能稳定性。

Details Motivation: 现有全景语义分割模型依赖重力对齐假设,在实际非对齐拍摄(如手持抖动、航拍姿态变化)下因过拟合纬度线索而性能崩溃。 Method: 提出SO3UFormer:(1)去除绝对纬度编码,构建与重力无关的内在特征;(2)采用考虑采样不均匀性的四极一致球面注意力;(3)引入基于切平面投影角和离散规范池化的规范感知相对位置机制;并结合索引式球面重采样与logit级SO(3)-一致性正则化。 Result: 在Pose35数据集上达72.03 mIoU;在任意全SO(3)旋转下仍保持70.67 mIoU,远超基线SphereUFormer(25.26 mIoU)。 Conclusion: SO3UFormer通过几何先验驱动的设计实现了真正坐标系无关的球面表征学习,为真实场景中鲁棒全景理解提供了新范式。 Abstract: Panoramic semantic segmentation models are typically trained under a strict gravity-aligned assumption. However, real-world captures often deviate from this canonical orientation due to unconstrained camera motions, such as the rotational jitter of handheld devices or the dynamic attitude shifts of aerial platforms. This discrepancy causes standard spherical Transformers to overfit global latitude cues, leading to performance collapse under 3D reorientations. To address this, we introduce SO3UFormer, a rotation-robust architecture designed to learn intrinsic spherical features that are less sensitive to the underlying coordinate frame. Our approach rests on three geometric pillars: (1) an intrinsic feature formulation that decouples the representation from the gravity vector by removing absolute latitude encoding; (2) quadrature-consistent spherical attention that accounts for non-uniform sampling densities; and (3) a gauge-aware relative positional mechanism that encodes local angular geometry using tangent-plane projected angles and discrete gauge pooling, avoiding reliance on global axes. We further use index-based spherical resampling together with a logit-level SO(3)-consistency regularizer during training. To rigorously benchmark robustness, we introduce Pose35, a dataset variant of Stanford2D3D perturbed by random rotations within $\pm 35^\circ$. Under the extreme test of arbitrary full SO(3) rotations, existing SOTAs fail catastrophically: the baseline SphereUFormer drops from 67.53 mIoU to 25.26 mIoU. In contrast, SO3UFormer demonstrates remarkable stability, achieving 72.03 mIoU on Pose35 and retaining 70.67 mIoU under full SO(3) rotations.

[107] Towards Multimodal Domain Generalization with Few Labels

Hongzhao Li,Hao Dong,Hualei Wan,Shupan Li,Mingliang Xu,Muhammad Haris Khan

Main category: cs.CV

TL;DR: 本文提出半监督多模态域泛化(SSMDG)新问题,并设计包含一致性正则化、分歧感知正则化和跨模态原型对齐的统一框架,在首个SSMDG基准上显著优于基线方法。

Details Motivation: 现有方法无法同时处理多模态、域泛化与半监督学习三者结合的挑战:多模态域泛化方法不能利用无标签数据,半监督多模态学习忽略域偏移,半监督域泛化方法仅限单模态输入。 Method: 提出统一框架,含三个核心组件:1)基于可信融合单模态共识的共识驱动一致性正则化以生成可靠伪标签;2)分歧感知正则化以有效利用非共识模糊样本;3)跨模态原型对齐以学习域与模态不变表征,并通过跨模态翻译提升模态缺失下的鲁棒性。 Result: 在首个构建的SSMDG基准上,所提方法在标准及模态缺失场景下均持续超越强基线方法。 Conclusion: SSMDG是一个重要且具挑战性的新方向,所提出的框架有效融合了多模态学习、域泛化与半监督学习的优势,为数据高效、泛化能力强的多模态建模提供了新思路。 Abstract: Multimodal models ideally should generalize to unseen domains while remaining data-efficient to reduce annotation costs. To this end, we introduce and study a new problem, Semi-Supervised Multimodal Domain Generalization (SSMDG), which aims to learn robust multimodal models from multi-source data with few labeled samples. We observe that existing approaches fail to address this setting effectively: multimodal domain generalization methods cannot exploit unlabeled data, semi-supervised multimodal learning methods ignore domain shifts, and semi-supervised domain generalization methods are confined to single-modality inputs. To overcome these limitations, we propose a unified framework featuring three key components: Consensus-Driven Consistency Regularization, which obtains reliable pseudo-labels through confident fused-unimodal consensus; Disagreement-Aware Regularization, which effectively utilizes ambiguous non-consensus samples; and Cross-Modal Prototype Alignment, which enforces domain- and modality-invariant representations while promoting robustness under missing modalities via cross-modal translation. We further establish the first SSMDG benchmarks, on which our method consistently outperforms strong baselines in both standard and missing-modality scenarios. Our benchmarks and code are available at https://github.com/lihongzhao99/SSMDG.

[108] Chain of Flow: A Foundational Generative Framework for ECG-to-4D Cardiac Digital Twins

Haofan Wu,Nay Aung,Theodoros N. Arvanitis,Joao A. C. Lima,Steffen E. Petersen,Le Zhang

Main category: cs.CV

TL;DR: 本文提出Chain of Flow(COF)框架,首次实现仅凭单周期12导联ECG即可生成患者特异的4D心脏结构与运动,推动心脏数字孪生从任务专用预测模型迈向可操控、可模拟的通用虚拟心脏。

Details Motivation: 现有心脏数字孪生(CDT)多为任务专用预测器,缺乏患者个体化、可更新、可仿真操作的完整虚拟心脏;亟需一种能从易获取信号(如ECG)重建全4D心脏结构与功能的基础生成框架。 Method: 提出ECG驱动的生成式框架Chain of Flow(COF),在训练中联合使用 cine-CMR 和 12-lead ECG,学习心脏几何、电生理与运动动力学的统一表征,从而从单周期ECG重建完整4D心脏。 Result: 在多个队列上验证了COF能高精度恢复心脏解剖结构、心腔功能及动态运动模式;重建的4D心脏成功支持容积测量、区域功能分析和虚拟电影合成等下游CDT任务。 Conclusion: COF首次实现了仅用ECG生成全4D患者特异性心脏,将CDT从窄域预测模型升级为可生成、可操纵、可泛化仿真的基础性虚拟器官平台。 Abstract: A clinically actionable Cardiac Digital Twin (CDT) should reconstruct individualised cardiac anatomy and physiology, update its internal state from multimodal signals, and enable a broad range of downstream simulations beyond isolated tasks. However, existing CDT frameworks remain limited to task-specific predictors rather than building a patient-specific, manipulable virtual heart. In this work, we introduce Chain of Flow (COF), a foundational ECG-driven generative framework that reconstructs full 4D cardiac structure and motion from a single cardiac cycle. The method integrates cine-CMR and 12-lead ECG during training to learn a unified representation of cardiac geometry, electrophysiology, and motion dynamics. We evaluate Chain of Flow on diverse cohorts and demonstrate accurate recovery of cardiac anatomy, chamber-wise function, and dynamic motion patterns. The reconstructed 4D hearts further support downstream CDT tasks such as volumetry, regional function analysis, and virtual cine synthesis. By enabling full 4D organ reconstruction directly from ECG, COF transforms cardiac digital twins from narrow predictive models into fully generative, patient-specific virtual hearts. Code will be released after review.

[109] OSDaR-AR: Enhancing Railway Perception Datasets via Multi-modal Augmented Reality

Federico Nesti,Gianluca D'Amico,Mauro Marinoni,Giorgio Buttazzo

Main category: cs.CV

TL;DR: 本文提出了一种多模态增强现实框架,将逼真的虚拟物体融入真实铁路视频序列,以解决铁路障碍物检测中高质量标注数据稀缺的问题,并发布了OSDaR-AR公开数据集。

Details Motivation: 铁路安全关键任务(如障碍物检测)面临高质量标注数据稀缺问题;现有仿真器存在sim-to-real差距,而简单图像遮罩技术缺乏时空一致性。 Method: 基于Unreal Engine 5,融合LiDAR点云与INS/GNSS数据实现虚拟物体在真实铁路序列(OSDaR23)中的精准、稳定放置;提出基于分割的INS/GNSS数据优化策略以提升增强序列真实性。 Result: 构建了具备时空一致性和高真实感的增强单帧/多帧铁路场景,并发布公共数据集OSDaR-AR,支持下一代铁路感知系统研发。 Conclusion: 该多模态AR框架有效弥合了仿真与真实之间的鸿沟,为铁路智能感知提供了高质量、可扩展的数据生成新范式。 Abstract: Although deep learning has significantly advanced the perception capabilities of intelligent transportation systems, railway applications continue to suffer from a scarcity of high-quality, annotated data for safety-critical tasks like obstacle detection. While photorealistic simulators offer a solution, they often struggle with the ``sim-to-real" gap; conversely, simple image-masking techniques lack the spatio-temporal coherence required to obtain augmented single- and multi-frame scenes with the correct appearance and dimensions. This paper introduces a multi-modal augmented reality framework designed to bridge this gap by integrating photorealistic virtual objects into real-world railway sequences from the OSDaR23 dataset. Utilizing Unreal Engine 5 features, our pipeline leverages LiDAR point-clouds and INS/GNSS data to ensure accurate object placement and temporal stability across RGB frames. This paper also proposes a segmentation-based refinement strategy for INS/GNSS data to significantly improve the realism of the augmented sequences, as confirmed by the comparative study presented in the paper. Carefully designed augmented sequences are collected to produce OSDaR-AR, a public dataset designed to support the development of next-generation railway perception systems. The dataset is available at the following page: https://syndra.retis.santannapisa.it/osdarar.html

[110] WaterVideoQA: ASV-Centric Perception and Rule-Compliant Reasoning via Multi-Modal Agents

Runwei Guan,Shaofeng Liang,Ningwei Ouyang,Weichen Fei,Shanliang Yao,Wei Dai,Chenhao Ge,Penglei Sun,Xiaohui Zhu,Tao Huang,Ryan Wen Liu,Hui Xiong

Main category: cs.CV

TL;DR: 本文提出WaterVideoQA——首个面向全水域环境的大规模视频问答基准,并设计了NaviMind多智能体神经符号系统,以实现自主水面艇(ASV)在复杂动态海事环境中的可解释、合规的认知推理与导航决策。

Details Motivation: 现有自主导航依赖被动感知,缺乏知识驱动和交互式环境认知能力;在高风险的海上航行中,亟需将视觉感知与复杂认知推理深度融合,以保障安全精准操控。 Method: 构建WaterVideoQA视频问答基准(3029个视频,覆盖6类水道,含光照、天气等多变因素,并基于五级认知框架评估);提出NaviMind多智能体神经符号系统,融合自适应语义路由、情境感知分层推理和自主自反思验证机制。 Result: 所提框架在多项指标上显著超越现有基线,在动态海事环境中实现了更可靠、可解释、合规的交互式导航决策。 Conclusion: 本工作推动自主水面艇从表层模式匹配迈向深层认知推理,为动态水域中的智能可信交互树立新范式。 Abstract: While autonomous navigation has achieved remarkable success in passive perception (e.g., object detection and segmentation), it remains fundamentally constrained by a void in knowledge-driven, interactive environmental cognition. In the high-stakes domain of maritime navigation, the ability to bridge the gap between raw visual perception and complex cognitive reasoning is not merely an enhancement but a critical prerequisite for Autonomous Surface Vessels to execute safe and precise maneuvers. To this end, we present WaterVideoQA, the first large-scale, comprehensive Video Question Answering benchmark specifically engineered for all-waterway environments. This benchmark encompasses 3,029 video clips across six distinct waterway categories, integrating multifaceted variables such as volatile lighting and dynamic weather to rigorously stress-test ASV capabilities across a five-tier hierarchical cognitive framework. Furthermore, we introduce NaviMind, a pioneering multi-agent neuro-symbolic system designed for open-ended maritime reasoning. By synergizing Adaptive Semantic Routing, Situation-Aware Hierarchical Reasoning, and Autonomous Self-Reflective Verification, NaviMind transitions ASVs from superficial pattern matching to regulation-compliant, interpretable decision-making. Experimental results demonstrate that our framework significantly transcends existing baselines, establishing a new paradigm for intelligent, trustworthy interaction in dynamic maritime environments.

[111] MSJoE: Jointly Evolving MLLM and Sampler for Efficient Long-Form Video Understanding

Wenhui Tan,Xiaoyi Yu,Jiaze Li,Yijing Chen,Jianzhong Ju,Zhenbo Luo,Ruihua Song,Jian Luan

Main category: cs.CV

TL;DR: 本文提出MSJoE框架,通过联合优化多模态大语言模型(MLLM)与轻量级关键帧采样器,实现高效长视频理解;其核心是基于CLIP生成查询-帧相似度矩阵并采样关键帧,再经强化学习协同训练,在多个基准上显著提升准确率。

Details Motivation: 长视频理解对多模态大语言模型(MLLMs)仍具挑战性,而视频中仅少量关键帧对回答问题真正有效,亟需高效采样与模型协同优化机制。 Method: 提出MSJoE框架:1)MLLM生成多样化视觉视角查询;2)冻结CLIP计算查询-帧相似度矩阵;3)轻量采样器据此预测关键帧权重并选取紧凑帧集;4)关键帧输入MLLM生成答案;5)MLLM与采样器通过强化学习联合优化。 Result: 在VideoMME、LongVideoBench、LVBench和MLVU上,MSJoE相较基线MLLM提升8.0%准确率,比最强基线高1.1%;并构建含2.8K视频、7K QA对的新长视频QA数据集。 Conclusion: MSJoE验证了联合演化MLLM与轻量采样器的有效性,为高效长视频理解提供了新范式,强调查询驱动的关键帧选择与端到端协同优化的重要性。 Abstract: Efficiently understanding long-form videos remains a fundamental challenge for multimodal large language models (MLLMs). In this paper, we present MLLM-Sampler Joint Evolution (MSJoE), a novel framework that jointly evolves the MLLM and a lightweight key-frame sampler for efficient long-form video understanding. MSJoE builds upon a key assumption that only a small subset of key-frames is truly informative for answering each question to a video. Specifically, MSJoE first reasons out several queries, which describe diverse visual perspectives relevant to the question. Then, these queries interact with a frozen CLIP model to produce a query-frame similarity matrix. Finally, a lightweight sampler predicts key-frame sampling weights from this matrix, selecting a compact set of informative frames, which are then fed into the MLLM for answer generation. Both the MLLM and sampler are jointly optimized through reinforcement learning, enabling co-adaptation of query-reasoning, frame-sampling, and key-frame understanding. A new long-video QA dataset containing 2.8K videos with 7K question-answer pairs is collected to support the training process. Extensive experiments on VideoMME, LongVideoBench, LVBench, and MLVU show that MSJoE achieves 8.0\% accuracy gain upon the base MLLM, and 1.1\% higher accuracy than strongest baseline method.

[112] pMoE: Prompting Diverse Experts Together Wins More in Visual Adaptation

Shentong Mo,Xufang Luo,Dongsheng Li

Main category: cs.CV

TL;DR: 本文提出了一种名为pMoE的混合专家提示调优方法,通过专家专用提示令牌和可学习调度器,融合多个领域专家知识,在47个视觉适应任务上实现了性能与效率的最优权衡。

Details Motivation: 现有参数高效微调方法通常仅利用单一预训练模型(通用或医学领域)的知识,忽略了融合多领域知识可能带来的协同增益。 Method: 提出pMoE方法,包含专家专用提示令牌和在各提示层动态调度令牌的可学习调度器,以优化各领域专家在适配阶段的贡献。 Result: 在47个通用与医学领域的分类和分割任务上实验表明,pMoE显著提升性能,并在计算效率与适配效果间取得最优平衡。 Conclusion: pMoE通过有效整合多源领域知识,提升了模型的泛化性与适用性,为参数高效视觉适配提供了新范式。 Abstract: Parameter-efficient fine-tuning has demonstrated promising results across various visual adaptation tasks, such as classification and segmentation. Typically, prompt tuning techniques have harnessed knowledge from a single pre-trained model, whether from a general or a specialized medical domain. However, this approach typically overlooks the potential synergies that could arise from integrating diverse domain knowledge within the same tuning process. In this work, we propose a novel Mixture-of-Experts prompt tuning method called pMoE, which leverages the strengths of multiple expert domains through expert-specialized prompt tokens and the learnable dispatcher, effectively combining their expertise in a unified model framework. Our pMoE introduces expert-specific prompt tokens and utilizes a dynamic token dispatching mechanism at various prompt layers to optimize the contribution of each domain expert during the adaptation phase. By incorporating both domain knowledge from diverse experts, the proposed pMoE significantly enhances the model's versatility and applicability to a broad spectrum of tasks. We conduct extensive experiments across 47 adaptation tasks, including both classification and segmentation in general and medical domains. The results demonstrate that our pMoE not only achieves superior performance with a large margin of improvements but also offers an optimal trade-off between computational efficiency and adaptation effectiveness compared to existing methods.

[113] Velocity and stroke rate reconstruction of canoe sprint team boats based on panned and zoomed video recordings

Julian Ziegler,Daniel Matthes,Finn Gerdts,Patrick Frenzel,Torsten Warnke,Matthias Englert,Tina Koevari,Mirco Fuchs

Main category: cs.CV

TL;DR: 本文提出了一种基于泛化视频分析的全自动桨频与速度重建框架,适用于各类皮划艇竞速项目(K1-K4, C1-C2)和距离(200m–500m),利用YOLOv8检测、U-Net校准、光流跟踪及姿态/框特征提取,精度媲美GPS,无需穿戴传感器或人工标注。

Details Motivation: GPS虽为运动分析金标准,但受限于可用性;亟需一种不依赖硬件、适用于真实比赛视频(含平移缩放)的自动化分析方法来支持教练决策。 Method: 基于YOLOv8进行浮标与运动员检测,结合已知浮标网格估计单应性;采用U-Net学习船体特异性运动员偏移以精确定位船头;引入光流鲁棒跟踪适配多运动员艇型;并分别从姿态估计或检测框中提取桨频信息。 Result: 在精英赛事GPS真值下验证:速度相对均方根误差RRMSE为0.020±0.011(ρ=0.956),桨频RRMSE为0.022±0.024(ρ=0.932),达到高精度水平。 Conclusion: 该框架实现了高精度、全自动、无感化的皮划艇竞速运动学分析,可广泛应用于训练反馈与技术评估,显著降低对专用设备和人工干预的依赖。 Abstract: Pacing strategies, defined by velocity and stroke rate profiles, are essential for peak performance in canoe sprint. While GPS is the gold standard for analysis, its limited availability necessitates automated video-based solutions. This paper presents an extended framework for reconstructing performance metrics from panned and zoomed video recordings across all sprint disciplines (K1-K4, C1-C2) and distances (200m-500m). Our method utilizes YOLOv8 for buoy and athlete detection, leveraging the known buoy grid to estimate homographies. We generalized the estimation of the boat position by means of learning a boat-specific athlete offset using a U-net based boat tip calibration. Further, we implement a robust tracking scheme using optical flow to adapt to multi-athlete boat types. Finally, we introduce methods to extract stroke rate information from either pose estimations or the athlete bounding boxes themselves. Evaluation against GPS data from elite competitions yields a velocity RRMSE of 0.020 +- 0.011 (rho = 0.956) and a stroke rate RRMSE of 0.022 +- 0.024 (rho = 0.932). The methods provide coaches with highly accurate, automated feedback without requiring on-boat sensors or manual annotation.

[114] Cross-Task Benchmarking of CNN Architectures

Kamal Sherawat,Vikrant Bhati

Main category: cs.CV

TL;DR: 本研究比较了五种基于ResNet-18的动态卷积神经网络(CNN)变体在图像分类、分割和时间序列分析任务上的性能,发现引入注意力机制和动态卷积(尤其是ODConv)能显著提升准确率、效率与泛化能力。

Details Motivation: 传统CNN在处理多模态、形态复杂或跨任务数据时存在特征表达和适应性局限,亟需更灵活、自适应的卷积结构。 Method: 基于ResNet-18,构建并对比五种CNN变体:标准CNN、硬注意力CNN、局部/全局软注意力CNN、全向动态卷积(ODConv);在Tiny ImageNet、Pascal VOC和UCR时间序列数据集上进行实验评估。 Result: 动态CNN(尤其ODConv)在所有任务中均优于标准CNN,表现为更高准确率、更好计算效率及更强跨任务泛化能力;ODConv对形态复杂图像建模尤为有效。 Conclusion: 动态卷积与注意力机制是提升CNN适应性与表达能力的关键路径,为多模态数据建模和神经网络架构设计提供了新范式。 Abstract: This project provides a comparative study of dynamic convolutional neural networks (CNNs) for various tasks, including image classification, segmentation, and time series analysis. Based on the ResNet-18 architecture, we compare five variants of CNNs: the vanilla CNN, the hard attention-based CNN, the soft attention-based CNN with local (pixel-wise) and global (image-wise) feature attention, and the omni-directional CNN (ODConv). Experiments on Tiny ImageNet, Pascal VOC, and the UCR Time Series Classification Archive illustrate that attention mechanisms and dynamic convolution methods consistently exceed conventional CNNs in accuracy, efficiency, and computational performance. ODConv was especially effective on morphologically complex images by being able to dynamically adjust to varying spatial patterns. Dynamic CNNs enhanced feature representation and cross-task generalization through adaptive kernel modulation. This project provides perspectives on advanced CNN design architecture for multiplexed data modalities and indicates promising directions in neural network engineering.

[115] ToProVAR: Efficient Visual Autoregressive Modeling via Tri-Dimensional Entropy-Aware Semantic Analysis and Sparsity Optimization

Jiayu Chen,Ruoyu Lin,Zihao Zheng,Jingxin Li,Maoliang Li,Guojie Luo,Xiang chen

Main category: cs.CV

TL;DR: 本文提出ToProVAR框架,通过注意力熵分析VAR模型的语义投影特性,识别出token、layer和scale三个维度的稀疏模式,并设计细粒度优化策略,在保持生成质量的同时实现最高3.4倍加速。

Details Motivation: VAR模型虽提升生成质量,但在生成后期存在严重效率瓶颈,现有方法(如FastVAR、SkipVAR)依赖启发式跳过策略,缺乏对模型内部参数动态的深入理解。 Method: 提出基于注意力熵的分析框架,刻画不同架构维度上的语义投影;识别token、layer、scale三维度稀疏性;据此设计针对性的细粒度优化策略。 Result: 在Infinity-2B和Infinity-8B模型上实现最高3.4倍加速,语义保真度与细节保留显著优于传统方法。 Conclusion: ToProVAR通过理论驱动的稀疏建模,突破了VAR模型效率瓶颈,在速度与质量间取得更好平衡,为视觉自回归建模提供了新范式。 Abstract: Visual Autoregressive(VAR) models enhance generation quality but face a critical efficiency bottleneck in later stages. In this paper, we present a novel optimization framework for VAR models that fundamentally differs from prior approaches such as FastVAR and SkipVAR. Instead of relying on heuristic skipping strategies, our method leverages attention entropy to characterize the semantic projections across different dimensions of the model architecture. This enables precise identification of parameter dynamics under varying token granularity levels, semantic scopes, and generation scales. Building on this analysis, we further uncover sparsity patterns along three critical dimensions-token, layer, and scale-and propose a set of fine-grained optimization strategies tailored to these patterns. Extensive evaluation demonstrates that our approach achieves aggressive acceleration of the generation process while significantly preserving semantic fidelity and fine details, outperforming traditional methods in both efficiency and quality. Experiments on Infinity-2B and Infinity-8B models demonstrate that ToProVAR achieves up to 3.4x acceleration with minimal quality loss, effectively mitigating the issues found in prior work. Our code will be made publicly available.

[116] OpenFS: Multi-Hand-Capable Fingerspelling Recognition with Implicit Signing-Hand Detection and Frame-Wise Letter-Conditioned Synthesis

Junuk Cha,Jihyeon Kim,Han-Mu Park

Main category: cs.CV

TL;DR: 本文提出OpenFS,一种开源的手语指拼识别与合成方法,通过隐式手部检测、单调对齐损失和帧级字母条件生成器,解决了手部歧义、CTC损失峰值问题及OOV问题。

Details Motivation: 解决指拼识别中签名手部歧义、现有CTC损失的峰值行为问题以及未登录词(OOV)问题,提升聋人与听人社区间的沟通效率。 Method: 提出多手兼容的指拼识别器,采用双层位置编码与签名手聚焦(SF)损失实现隐式手部检测;引入单调对齐(MA)损失替代CTC损失以保证时序一致性;设计帧级字母条件生成器合成OOV词的姿态序列,并构建新合成基准FSNeo。 Result: 在多项实验中达到指拼识别任务的SOTA性能,验证了所提识别器与生成器的有效性;代码与数据已开源。 Conclusion: OpenFS有效缓解了指拼识别中的关键挑战,为开放词汇指拼识别与合成提供了实用、可扩展的新范式。 Abstract: Fingerspelling is a component of sign languages in which words are spelled out letter by letter using specific hand poses. Automatic fingerspelling recognition plays a crucial role in bridging the communication gap between Deaf and hearing communities, yet it remains challenging due to the signing-hand ambiguity issue, the lack of appropriate training losses, and the out-of-vocabulary (OOV) problem. Prior fingerspelling recognition methods rely on explicit signing-hand detection, which often leads to recognition failures, and on a connectionist temporal classification (CTC) loss, which exhibits the peaky behavior problem. To address these issues, we develop OpenFS, an open-source approach for fingerspelling recognition and synthesis. We propose a multi-hand-capable fingerspelling recognizer that supports both single- and multi-hand inputs and performs implicit signing-hand detection by incorporating a dual-level positional encoding and a signing-hand focus (SF) loss. The SF loss encourages cross-attention to focus on the signing hand, enabling implicit signing-hand detection during recognition. Furthermore, without relying on the CTC loss, we introduce a monotonic alignment (MA) loss that enforces the output letter sequence to follow the temporal order of the input pose sequence through cross-attention regularization. In addition, we propose a frame-wise letter-conditioned generator that synthesizes realistic fingerspelling pose sequences for OOV words. This generator enables the construction of a new synthetic benchmark, called FSNeo. Through comprehensive experiments, we demonstrate that our approach achieves state-of-the-art performance in recognition and validate the effectiveness of the proposed recognizer and generator. Codes and data are available in: https://github.com/JunukCha/OpenFS.

[117] MM-NeuroOnco: A Multimodal Benchmark and Instruction Dataset for MRI-Based Brain Tumor Diagnosis

Feng Guo,Jiaxiang Liu,Yang Li,Qianqian Shi,Mingkun Xu

Main category: cs.CV

TL;DR: 本文提出了MM-NeuroOnco——一个大规模、语义丰富的多模态脑肿瘤MRI理解基准与指令微调数据集,并构建了配套评估基准MM-NeuroOnco-Bench;基于该数据集训练的NeuroOnco-GPT在诊断类问题上实现27%绝对准确率提升。

Details Motivation: 现有公开数据集在标注丰富度和诊断语义层面存在严重不足,难以支撑兼具病灶检测与临床可解释推理能力的脑肿瘤诊断模型发展。 Method: 构建了含24,726张MRI切片和约20万条语义丰富指令的MM-NeuroOnco数据集;提出多模型协同流水线实现自动化医学信息补全与质量控制;设计拒绝感知(rejection-aware)的手动标注评估基准MM-NeuroOnco-Bench;并基于该数据集微调得到NeuroOnco-GPT模型。 Result: 十种主流模型在诊断相关问题上最高仅达41.88%准确率;NeuroOnco-GPT经微调后在诊断类问题上取得27%绝对准确率提升。 Conclusion: MM-NeuroOnco显著提升了脑肿瘤多模态诊断理解的数据基础与评估标准,验证了高质量语义指令数据对临床导向多模态推理建模的关键作用。 Abstract: Accurate brain tumor diagnosis requires models to not only detect lesions but also generate clinically interpretable reasoning grounded in imaging manifestations, yet existing public datasets remain limited in annotation richness and diagnostic semantics. To bridge this gap, we introduce MM-NeuroOnco, a large-scale multimodal benchmark and instruction-tuning dataset for brain tumor MRI understanding, consisting of 24,726 MRI slices from 20 data sources paired with approximately 200,000 semantically enriched multimodal instructions spanning diverse tumor subtypes and imaging modalities. To mitigate the scarcity and high cost of diagnostic semantic annotations, we develop a multi-model collaborative pipeline for automated medical information completion and quality control, enabling the generation of diagnosis-related semantics beyond mask-only annotations. Building upon this dataset, we further construct MM-NeuroOnco-Bench, a manually annotated evaluation benchmark with a rejection-aware setting to reduce biases inherent in closed-ended question formats. Evaluation across ten representative models shows that even the strongest baseline, Gemini 3 Flash, achieves only 41.88% accuracy on diagnosis-related questions, highlighting the substantial challenges of multimodal brain tumor diagnostic understanding. Leveraging MM-NeuroOnco, we further propose NeuroOnco-GPT, which achieves a 27% absolute accuracy improvement on diagnostic questions following fine-tuning. This result demonstrates the effectiveness of our dataset and benchmark in advancing clinically grounded multimodal diagnostic reasoning. Code and dataset are publicly available at: https://github.com/gfnnnb/MM-NeuroOnco

[118] Can Agents Distinguish Visually Hard-to-Separate Diseases in a Zero-Shot Setting? A Pilot Study

Zihao Zhao,Frederik Hauke,Juliana De Castilhos,Sven Nebelung,Daniel Truhn

Main category: cs.CV

TL;DR: 本文提出了一种基于对比裁决的多智能体框架,用于在零样本设置下区分视觉上难以分辨但临床管理差异显著的疾病(如黑色素瘤vs.非典型痣、肺水肿vs.肺炎),在皮肤镜数据上准确率提升11个百分点,但整体性能仍不足以临床部署。

Details Motivation: 现有医学影像研究多聚焦于自动化常规临床流程,而视觉上高度混淆但临床处理截然不同的疾病鉴别这一重要场景尚未被充分探索。 Method: 构建基于对比裁决(contrastive adjudication)的多智能体框架,在两个纯影像代理诊断任务上进行零样本基准测试:(1)黑色素瘤 vs. 非典型痣;(2)肺水肿 vs. 肺炎。 Result: 在皮肤镜数据上诊断准确率提升11个百分点,定性样本中 unsupported claims 减少;但整体性能仍不足临床部署,且受限于人工标注固有不确定性及缺乏临床上下文。 Conclusion: 该初步研究表明,多智能体方法在视觉混淆的零样本诊断任务中具有一定潜力,但在真实临床落地前仍需解决标注不确定性与临床语境缺失等关键限制。 Abstract: The rapid progress of multimodal large language models (MLLMs) has led to increasing interest in agent-based systems. While most prior work in medical imaging concentrates on automating routine clinical workflows, we study an underexplored yet clinically significant setting: distinguishing visually hard-to-separate diseases in a zero-shot setting. We benchmark representative agents on two imaging-only proxy diagnostic tasks, (1) melanoma vs. atypical nevus and (2) pulmonary edema vs. pneumonia, where visual features are highly confounded despite substantial differences in clinical management. We introduce a multi-agent framework based on contrastive adjudication. Experimental results show improved diagnostic performance (an 11-percentage-point gain in accuracy on dermoscopy data) and reduced unsupported claims on qualitative samples, although overall performance remains insufficient for clinical deployment. We acknowledge the inherent uncertainty in human annotations and the absence of clinical context, which further limit the translation to real-world settings. Within this controlled setting, this pilot study provides preliminary insights into zero-shot agent performance in visually confounded scenarios.

[119] UCM: Unifying Camera Control and Memory with Time-aware Positional Encoding Warping for World Models

Tianxing Xu,Zixuan Wang,Guangyuan Wang,Li Hu,Zhongyi Zhang,Peng Zhang,Bang Zhang,Song-Hai Zhang

Main category: cs.CV

TL;DR: 本文提出UCM框架,通过时间感知的位置编码扭曲机制统一长期记忆与精确相机控制,并设计双流扩散Transformer以高效生成高保真视频,在长时场景一致性和相机可控性上显著优于现有方法。

Details Motivation: 现有基于视频生成的世界模型在场景重访时难以保持长期内容一致性,且无法实现用户输入驱动的精确相机控制;基于显式3D重建的方法缺乏灵活性,而依赖先前帧但无显式空间对应的方法限制了可控性与一致性。 Method: 提出UCM框架:1)采用时间感知的位置编码扭曲机制统一长期记忆与相机控制;2)设计高效双流扩散Transformer实现高保真生成;3)引入基于点云渲染的可扩展数据整理策略,支持50万+单目视频训练。 Result: 在真实与合成基准测试中,UCM在长时场景一致性与高保真视频中的精确相机可控性上均显著超越当前最优方法。 Conclusion: UCM有效解决了世界模型中长期一致性与相机控制的核心挑战,为交互式环境模拟提供了更灵活、可控且稳定的视频生成新范式。 Abstract: World models based on video generation demonstrate remarkable potential for simulating interactive environments but face persistent difficulties in two key areas: maintaining long-term content consistency when scenes are revisited and enabling precise camera control from user-provided inputs. Existing methods based on explicit 3D reconstruction often compromise flexibility in unbounded scenarios and fine-grained structures. Alternative methods rely directly on previously generated frames without establishing explicit spatial correspondence, thereby constraining controllability and consistency. To address these limitations, we present UCM, a novel framework that unifies long-term memory and precise camera control via a time-aware positional encoding warping mechanism. To reduce computational overhead, we design an efficient dual-stream diffusion transformer for high-fidelity generation. Moreover, we introduce a scalable data curation strategy utilizing point-cloud-based rendering to simulate scene revisiting, facilitating training on over 500K monocular videos. Extensive experiments on real-world and synthetic benchmarks demonstrate that UCM significantly outperforms state-of-the-art methods in long-term scene consistency, while also achieving precise camera controllability in high-fidelity video generation.

[120] SubspaceAD: Training-Free Few-Shot Anomaly Detection via Subspace Modeling

Camile Lendering,Erkut Akdag,Egor Bondarev

Main category: cs.CV

TL;DR: SubspaceAD是一种无需训练的工业视觉异常检测方法,利用冻结的DINOv2提取正常图像块特征,通过PCA建模正常变化的低维子空间,并以重构残差作为异常分数,在单样本和少样本设定下达到SOTA性能。

Details Motivation: 现有少样本异常检测方法依赖记忆库、辅助数据集或视觉语言模型多模态调优,作者质疑这些复杂设计在已有基础视觉模型特征表示下是否必要。 Method: SubspaceAD分两步:1)用冻结的DINOv2从少量正常图像中提取patch级特征;2)对这些特征拟合PCA模型,估计正常变化的低维子空间;推理时通过该子空间的重构残差计算异常分数。 Result: 在MVTec-AD单样本设置下,图像级和像素级AUROC达98.0%和97.6%;在VisA上达93.3%和98.3%,均超越此前SOTA。且无需训练、提示调优或记忆库。 Conclusion: 复杂机制并非少样本异常检测所必需;仅基于冻结基础模型特征+PCA子空间建模即可实现高性能、可解释、统计可解释的无训练异常检测。 Abstract: Detecting visual anomalies in industrial inspection often requires training with only a few normal images per category. Recent few-shot methods achieve strong results employing foundation-model features, but typically rely on memory banks, auxiliary datasets, or multi-modal tuning of vision-language models. We therefore question whether such complexity is necessary given the feature representations of vision foundation models. To answer this question, we introduce SubspaceAD, a training-free method, that operates in two simple stages. First, patch-level features are extracted from a small set of normal images by a frozen DINOv2 backbone. Second, a Principal Component Analysis (PCA) model is fit to these features to estimate the low-dimensional subspace of normal variations. At inference, anomalies are detected via the reconstruction residual with respect to this subspace, producing interpretable and statistically grounded anomaly scores. Despite its simplicity, SubspaceAD achieves state-of-the-art performance across one-shot and few-shot settings without training, prompt tuning, or memory banks. In the one-shot anomaly detection setting, SubspaceAD achieves image-level and pixel-level AUROC of 98.0% and 97.6% on the MVTec-AD dataset, and 93.3% and 98.3% on the VisA dataset, respectively, surpassing prior state-of-the-art results. Code and demo are available at https://github.com/CLendering/SubspaceAD.

[121] DMAligner: Enhancing Image Alignment via Diffusion Model Based View Synthesis

Xinglong Luo,Ao Luo,Zhengning Wang,Yueqi Yang,Chaoyu Feng,Lei Lei,Bing Zeng,Shuaicheng Liu

Main category: cs.CV

TL;DR: 本文提出DMAligner,一种基于扩散模型的图像对齐框架,通过面向对齐的视图合成解决传统光流法在遮挡和光照变化下的局限性。

Details Motivation: 现有基于光流的图像对齐方法易受遮挡和光照变化影响,导致对齐质量下降及下游任务性能受损。 Method: 提出DMAligner框架,包含动态感知扩散训练(Dynamics-aware Diffusion Training)与动态感知掩码生成(DMP)模块,并构建了DSIA数据集用于训练与评估。 Result: 在自建DSIA基准及多个主流视频数据集上,DMAligner在定性与定量评估中均优于现有方法。 Conclusion: 扩散模型可有效替代传统光流法实现高质量图像对齐,尤其在动态场景下展现出更强鲁棒性与适应性。 Abstract: Image alignment is a fundamental task in computer vision with broad applications. Existing methods predominantly employ optical flow-based image warping. However, this technique is susceptible to common challenges such as occlusions and illumination variations, leading to degraded alignment visual quality and compromised accuracy in downstream tasks. In this paper, we present DMAligner, a diffusion-based framework for image alignment through alignment-oriented view synthesis. DMAligner is crafted to tackle the challenges in image alignment from a new perspective, employing a generation-based solution that showcases strong capabilities and avoids the problems associated with flow-based image warping. Specifically, we propose a Dynamics-aware Diffusion Training approach for learning conditional image generation, synthesizing a novel view for image alignment. This incorporates a Dynamics-aware Mask Producing (DMP) module to adaptively distinguish dynamic foreground regions from static backgrounds, enabling the diffusion model to more effectively handle challenges that classical methods struggle to solve. Furthermore, we develop the Dynamic Scene Image Alignment (DSIA) dataset using Blender, which includes 1,033 indoor and outdoor scenes with over 30K image pairs tailored for image alignment. Extensive experimental results demonstrate the superiority of the proposed approach on DSIA benchmarks, as well as on a series of widely-used video datasets for qualitative comparisons. Our code is available at https://github.com/boomluo02/DMAligner.

[122] WISER: Wider Search, Deeper Thinking, and Adaptive Fusion for Training-Free Zero-Shot Composed Image Retrieval

Tianyue Wang,Leigang Qu,Tianyu Yang,Xiangzhao Hao,Yifan Xu,Haiyun Guo,Jinqiao Wang

Main category: cs.CV

TL;DR: 本文提出WISER框架,无需训练即可统一文本到图像(T2I)与图像到图像(I2I)检索路径,通过‘检索-验证-精炼’流程实现零样本组合图像检索(ZS-CIR),显著提升性能。

Details Motivation: 现有ZS-CIR方法将多模态查询单模态化(仅用编辑文本或编辑图像),分别受限于细节丢失和语义表达不足;需融合二者优势并适配不同查询意图。 Method: WISER为无训练框架,包含三阶段:1)宽域检索——并行生成编辑文本与编辑图像进行双路径检索;2)自适应融合——由验证器评估置信度,对高置信结果动态融合双路径,对低置信结果触发精炼;3)结构化自反思——为不确定检索生成精炼建议,引导下一轮深度检索。 Result: 在CIRCO和CIRR基准上,WISER相比现有无训练方法mAP@5提升45%、Recall@1提升57%,且超越诸多需训练的方法。 Conclusion: WISER通过显式建模意图感知与不确定性感知,在零样本设定下有效协同T2I与I2I路径,展现出强泛化性与优越性能。 Abstract: Zero-Shot Composed Image Retrieval (ZS-CIR) aims to retrieve target images given a multimodal query (comprising a reference image and a modification text), without training on annotated triplets. Existing methods typically convert the multimodal query into a single modality-either as an edited caption for Text-to-Image retrieval (T2I) or as an edited image for Image-to-Image retrieval (I2I). However, each paradigm has inherent limitations: T2I often loses fine-grained visual details, while I2I struggles with complex semantic modifications. To effectively leverage their complementary strengths under diverse query intents, we propose WISER, a training-free framework that unifies T2I and I2I via a "retrieve-verify-refine" pipeline, explicitly modeling intent awareness and uncertainty awareness. Specifically, WISER first performs Wider Search by generating both edited captions and images for parallel retrieval to broaden the candidate pool. Then, it conducts Adaptive Fusion with a verifier to assess retrieval confidence, triggering refinement for uncertain retrievals, and dynamically fusing the dual-path for reliable ones. For uncertain retrievals, WISER generates refinement suggestions through structured self-reflection to guide the next retrieval round toward Deeper Thinking. Extensive experiments demonstrate that WISER significantly outperforms previous methods across multiple benchmarks, achieving relative improvements of 45% on CIRCO (mAP@5) and 57% on CIRR (Recall@1) over existing training-free methods. Notably, it even surpasses many training-dependent methods, highlighting its superiority and generalization under diverse scenarios. Code will be released at https://github.com/Physicsmile/WISER.

[123] Small Object Detection Model with Spatial Laplacian Pyramid Attention and Multi-Scale Features Enhancement in Aerial Images

Zhangjian Ji,Huijia Yan,Shaotong Qiao,Kai Feng,Wei Wei

Main category: cs.CV

TL;DR: 本文提出了一种基于空间拉普拉斯金字塔注意力与多尺度特征增强的小目标检测算法,用于提升航拍图像中小目标的检测性能。

Details Motivation: 航拍图像中目标尺寸小、分布密集且不均匀,导致检测效率低,现有方法对小目标特征表征和语义理解不足。 Method: 1)设计空间拉普拉斯金字塔注意力(SLPA)模块,嵌入ResNet-50各阶段以增强局部重要区域表征;2)构建多尺度特征增强模块(MSFEM)集成于FPN的C5侧向连接;3)在FPN跨层融合中引入可变形卷积以对齐特征。 Result: 在VisDrone和DOTA两个基准数据集上,所提方法在小目标检测任务上优于原始算法。 Conclusion: SLPA、MSFEM与可变形卷积的联合使用显著提升了航拍图像中小目标的检测精度与鲁棒性。 Abstract: Detecting objects in aerial images confronts some significant challenges, including small size, dense and non-uniform distribution of objects over high-resolution images, which makes detection inefficient. Thus, in this paper, we proposed a small object detection algorithm based on a Spatial Laplacian Pyramid Attention and Multi-Scale Feature Enhancement in aerial images. Firstly, in order to improve the feature representation of ResNet-50 on small objects, we presented a novel Spatial Laplacian Pyramid Attention (SLPA) module, which is integrated after each stage of ResNet-50 to identify and emphasize important local regions. Secondly, to enhance the model's semantic understanding and features representation, we designed a Multi-Scale Feature Enhancement Module (MSFEM), which is incorporated into the lateral connections of C5 layer for building Feature Pyramid Network (FPN). Finally, the features representation quality of traditional feature pyramid network will be affected because the features are not aligned when the upper and lower layers are fused. In order to handle it, we utilized deformable convolutions to align the features in the fusion processing of the upper and lower levels of the Feature Pyramid Network, which can help enhance the model's ability to detect and recognize small objects. The extensive experimental results on two benchmark datasets: VisDrone and DOTA demonstrate that our improved model performs better for small object detection in aerial images compared to the original algorithm.

[124] PackUV: Packed Gaussian UV Maps for 4D Volumetric Video

Aashish Rai,Angela Xing,Anushka Agarwal,Xiaoyan Cong,Zekun Li,Tao Lu,Aayush Prakash,Srinath Sridhar

Main category: cs.CV

TL;DR: 本文提出PackUV,一种新颖的4D高斯表示方法,将高斯属性映射到多尺度UV图集,实现紧凑、图像原生存储,并设计PackUV-GS方法实现时间一致的拟合,支持标准视频编码,显著提升长时序、大运动和遮挡场景下的体视频重建质量与实用性。

Details Motivation: 现有基于高斯点绘(Gaussian Splatting)的方法在长序列、时间不一致、大运动和遮挡下表现差,且输出不兼容传统视频编码,难以规模化应用。 Method: 提出PackUV表示:将所有高斯属性映射至结构化多尺度UV图集;设计PackUV-GS拟合方法,在UV域直接优化高斯参数;引入光流引导的高斯标注与视频关键帧模块,区分动静区域、增强时间一致性。 Result: PackUV是首个兼容标准视频编解码器(如FFV1)的统一体视频表示,无损质量;在自建大规模数据集PackUV-2B(50+相机、20亿帧)上验证,支持长达30分钟序列,渲染保真度超越现有方法。 Conclusion: PackUV通过UV图集化与时间一致优化,解决了体视频重建的可扩展性、兼容性与鲁棒性难题,为实际流媒体部署铺平道路。 Abstract: Volumetric videos offer immersive 4D experiences, but remain difficult to reconstruct, store, and stream at scale. Existing Gaussian Splatting based methods achieve high-quality reconstruction but break down on long sequences, temporal inconsistency, and fail under large motions and disocclusions. Moreover, their outputs are typically incompatible with conventional video coding pipelines, preventing practical applications. We introduce PackUV, a novel 4D Gaussian representation that maps all Gaussian attributes into a sequence of structured, multi-scale UV atlas, enabling compact, image-native storage. To fit this representation from multi-view videos, we propose PackUV-GS, a temporally consistent fitting method that directly optimizes Gaussian parameters in the UV domain. A flow-guided Gaussian labeling and video keyframing module identifies dynamic Gaussians, stabilizes static regions, and preserves temporal coherence even under large motions and disocclusions. The resulting UV atlas format is the first unified volumetric video representation compatible with standard video codecs (e.g., FFV1) without losing quality, enabling efficient streaming within existing multimedia infrastructure. To evaluate long-duration volumetric capture, we present PackUV-2B, the largest multi-view video dataset to date, featuring more than 50 synchronized cameras, substantial motion, and frequent disocclusions across 100 sequences and 2B (billion) frames. Extensive experiments demonstrate that our method surpasses existing baselines in rendering fidelity while scaling to sequences up to 30 minutes with consistent quality.

[125] D-FINE-seg: Object Detection and Instance Segmentation Framework with multi-backend deployment

Argo Saakyan,Dmitry Solntsev

Main category: cs.CV

TL;DR: 本文提出了D-FINE-seg,一种基于Transformer的实时实例分割方法,扩展自D-FINE检测器,引入轻量掩码头与多种分割感知训练策略,在TACO数据集上优于YOLOv8,并开源了支持多平台(ONNX/TensorRT/OpenVINO)的端到端训练与推理框架。

Details Motivation: 现有基于Transformer的实时实例分割方法较少,而D-FINE在目标检测中表现优异,亟需将其拓展至实例分割任务并实现高效部署。 Method: 在D-FINE基础上增加轻量掩码头;引入分割感知训练策略,包括框裁剪BCE与Dice掩码损失、辅助掩码监督、去噪掩码监督及适配的匈牙利匹配代价。同时构建跨平台(ONNX/TensorRT/OpenVINO)端到端训练-导出-推理流水线。 Result: 在TACO数据集上,D-FINE-seg在统一TensorRT FP16端到端基准下F1-score超越Ultralytics YOLOv8,同时保持有竞争力的推理延迟;并开源完整工具链。 Conclusion: D-FINE-seg成功将高性能Transformer检测器拓展为实时实例分割方案,兼顾精度与效率,并通过开源跨平台推理框架推动工业落地。 Abstract: Transformer-based real-time object detectors achieve strong accuracy-latency trade-offs, and D-FINE is among the top-performing recent architectures. However, real-time instance segmentation with transformers is still less common. We present D-FINE-seg, an instance segmentation extension of D-FINE that adds: a lightweight mask head, segmentation-aware training, including box cropped BCE and dice mask losses, auxiliary and denoising mask supervision, and adapted Hungarian matching cost. On the TACO dataset, D-FINE-seg improves F1-score over Ultralytics YOLO26 under a unified TensorRT FP16 end-to-end benchmarking protocol, while maintaining competitive latency. Second contribution is an end-to-end pipeline for training, exporting, and optimized inference across ONNX, TensorRT, OpenVINO for both object detection and instance segmentation tasks. This framework is released as open-source under the Apache-2.0 license. GitHub repository - https://github.com/ArgoHA/D-FINE-seg.

[126] GeoWorld: Geometric World Models

Zeyu Zhang,Danning Li,Ian Reid,Richard Hartley

Main category: cs.CV

TL;DR: 本文提出GeoWorld,一种基于双曲几何的预测世界模型,通过双曲JEPA将潜在表示映射到双曲流形以保持几何与层次结构,并结合几何强化学习实现稳定多步规划,在CrossTask和COIN数据集上提升了多步任务规划的成功率。

Details Motivation: 现有基于能量的预测世界模型在欧氏空间中学习潜在表征,忽略了状态间的几何与层次结构,且难以进行长时域预测,导致多步推演性能迅速下降。 Method: 提出GeoWorld模型,包括双曲JEPA(Hyperbolic JEPA)用于构建保持几何与层次关系的双曲潜在空间,以及面向该空间的几何强化学习方法,用于能量优化驱动的多步规划。 Result: 在CrossTask和COIN数据集上,3步规划成功率(SR)提升约3%,4步规划提升约2%,优于当前最优的V-JEPA 2。 Conclusion: 将潜在空间建模从欧氏扩展至双曲几何,并结合适配的优化机制,可显著提升能量式世界模型在长时域视觉规划中的鲁棒性与性能。 Abstract: Energy-based predictive world models provide a powerful approach for multi-step visual planning by reasoning over latent energy landscapes rather than generating pixels. However, existing approaches face two major challenges: (i) their latent representations are typically learned in Euclidean space, neglecting the underlying geometric and hierarchical structure among states, and (ii) they struggle with long-horizon prediction, which leads to rapid degradation across extended rollouts. To address these challenges, we introduce GeoWorld, a geometric world model that preserves geometric structure and hierarchical relations through a Hyperbolic JEPA, which maps latent representations from Euclidean space onto hyperbolic manifolds. We further introduce Geometric Reinforcement Learning for energy-based optimization, enabling stable multi-step planning in hyperbolic latent space. Extensive experiments on CrossTask and COIN demonstrate around 3% SR improvement in 3-step planning and 2% SR improvement in 4-step planning compared to the state-of-the-art V-JEPA 2. Project website: https://steve-zeyu-zhang.github.io/GeoWorld.

[127] Align then Adapt: Rethinking Parameter-Efficient Transfer Learning in 4D Perception

Yiding Sun,Jihua Zhu,Haozhe Cheng,Chaoyi Lu,Zhichuan Yang,Lin Chen,Yaonan Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为PointATA的'先对齐再适应'范式,用于将预训练3D点云模型高效迁移到4D点云视频理解任务,通过最优传输理论缓解模态差异,并设计轻量级适配器缓解过拟合,在多个4D任务上实现了参数高效且性能优越的结果。

Details Motivation: 4D点云视频数据稀缺,限制了自监督4D模型的扩展性;直接迁移3D预训练模型面临过拟合和3D-4D模态鸿沟两大挑战。 Method: 提出两阶段参数高效迁移范式PointATA:第一阶段利用最优传输理论量化3D/4D分布差异,训练点对齐嵌入器以缩小模态差距;第二阶段在冻结3D骨干网络上集成轻量级点视频适配器和空间上下文编码器,增强时序建模能力并缓解过拟合。 Result: PointATA在3D动作识别达97.21%准确率,4D动作分割提升+8.7%,4D语义分割达84.06%,性能媲美甚至超越全量微调模型,同时显著降低参数开销。 Conclusion: PointATA通过‘对齐—适应’解耦策略,有效解决了3D到4D迁移中的模态鸿沟与过拟合问题,为点云视频理解提供了参数高效、性能强劲的新范式。 Abstract: Point cloud video understanding is critical for robotics as it accurately encodes motion and scene interaction. We recognize that 4D datasets are far scarcer than 3D ones, which hampers the scalability of self-supervised 4D models. A promising alternative is to transfer 3D pre-trained models to 4D perception tasks. However, rigorous empirical analysis reveals two critical limitations that impede transfer capability: overfitting and the modality gap. To overcome these challenges, we develop a novel "Align then Adapt" (PointATA) paradigm that decomposes parameter-efficient transfer learning into two sequential stages. Optimal-transport theory is employed to quantify the distributional discrepancy between 3D and 4D datasets, enabling our proposed point align embedder to be trained in Stage 1 to alleviate the underlying modality gap. To mitigate overfitting, an efficient point-video adapter and a spatial-context encoder are integrated into the frozen 3D backbone to enhance temporal modeling capacity in Stage 2. Notably, with the above engineering-oriented designs, PointATA enables a pre-trained 3D model without temporal knowledge to reason about dynamic video content at a smaller parameter cost compared to previous work. Extensive experiments show that PointATA can match or even outperform strong full fine-tuning models, whilst enjoying the advantage of parameter efficiency, e.g. 97.21 \% accuracy on 3D action recognition, $+8.7 \%$ on 4 D action segmentation, and 84.06\% on 4D semantic segmentation.

[128] Cytoarchitecture in Words: Weakly Supervised Vision-Language Modeling for Human Brain Microscopy

Matthew Sutton,Katrin Amunts,Timo Dickscheid,Christian Schiffer

Main category: cs.CV

TL;DR: 本文提出了一种标签中介的图像-文本耦合方法,无需配对图像-文本数据,仅通过共享标签将显微图像与文献中挖掘的区域描述关联,从而实现对人脑组织切片的自然语言描述,并在57个脑区上验证了其有效性。

Details Motivation: 在微观人脑组织分析等生物医学领域,高质量配对的图像-文本数据稀缺,限制了视觉基础模型与语言模型的结合;亟需一种不依赖精细标注的弱监督耦合策略。 Method: 利用脑区标签作为桥梁,自动从文献中挖掘对应区域的描述文本作为合成caption;再以图像到文本为目标,将现有细胞结构视觉基础模型(CytoNet)与大语言模型对齐训练。 Result: 在57个脑区上,带标签时区域识别准确率达90.6%,遮蔽标签后仍能在8类开放测试中以68.6%准确率反推区域;生成的描述符合细胞构筑学特征且支持未见区域的显式拒绝。 Conclusion: 标签中介的弱配对方式足以有效桥接生物医学视觉基础模型与语言模型,为缺乏细粒度图文标注的领域提供了实用的自然语言交互方案。 Abstract: Foundation models increasingly offer potential to support interactive, agentic workflows that assist researchers during analysis and interpretation of image data. Such workflows often require coupling vision to language to provide a natural-language interface. However, paired image-text data needed to learn this coupling are scarce and difficult to obtain in many research and clinical settings. One such setting is microscopic analysis of cell-body-stained histological human brain sections, which enables the study of cytoarchitecture: cell density and morphology and their laminar and areal organization. Here, we propose a label-mediated method that generates meaningful captions from images by linking images and text only through a label, without requiring curated paired image-text data. Given the label, we automatically mine area descriptions from related literature and use them as synthetic captions reflecting canonical cytoarchitectonic attributes. An existing cytoarchitectonic vision foundation model (CytoNet) is then coupled to a large language model via an image-to-text training objective, enabling microscopy regions to be described in natural language. Across 57 brain areas, the resulting method produces plausible area-level descriptions and supports open-set use through explicit rejection of unseen areas. It matches the cytoarchitectonic reference label for in-scope patches with 90.6% accuracy and, with the area label masked, its descriptions remain discriminative enough to recover the area in an 8-way test with 68.6% accuracy. These results suggest that weak, label-mediated pairing can suffice to connect existing biomedical vision foundation models to language, providing a practical recipe for integrating natural-language in domains where fine-grained paired annotations are scarce.

[129] Locally Adaptive Decay Surfaces for High-Speed Face and Landmark Detection with Event Cameras

Paul Kielty,Timothy Hanley,Peter Corcoran

Main category: cs.CV

TL;DR: 本文提出了一种名为局部自适应衰减曲面(LADS)的事件表示方法,通过根据局部信号动态调整时间衰减参数,解决了传统固定参数表示在静止与运动场景间难以兼顾的问题;实验表明LADS在人脸检测与关键点定位任务上显著优于基线方法,尤其在240Hz高频下仍保持高精度,并支持轻量网络实现实时性能。

Details Motivation: 传统事件表示(如直方图或全局衰减时间曲面)采用全图统一的时间参数,导致静止区域空间结构丢失或运动区域边缘模糊,难以兼顾不同动态场景的需求。 Method: 提出局部自适应衰减曲面(LADS),在每个像素位置依据局部事件率、LoG响应或高频谱能量动态调节时间衰减系数,实现上下文感知的时间积分。 Result: 在公开数据集上,LADS在30Hz和240Hz下均显著提升人脸检测mAP50(达0.966)与关键点归一化平均误差(低至2.44%),且性能超越以往30Hz工作的最佳结果;同时支持更轻量网络并保持实时性。 Conclusion: LADS验证了上下文感知时间整合对类脑视觉的重要性,为基于事件相机的高频实时人机交互系统提供了新范式。 Abstract: Event cameras record luminance changes with microsecond resolution, but converting their sparse, asynchronous output into dense tensors that neural networks can exploit remains a core challenge. Conventional histograms or globally-decayed time-surface representations apply fixed temporal parameters across the entire image plane, which in practice creates a trade-off between preserving spatial structure during still periods and retaining sharp edges during rapid motion. We introduce Locally Adaptive Decay Surfaces (LADS), a family of event representations in which the temporal decay at each location is modulated according to local signal dynamics. Three strategies are explored, based on event rate, Laplacian-of-Gaussian response, and high-frequency spectral energy. These adaptive schemes preserve detail in quiescent regions while reducing blur in regions of dense activity. Extensive experiments on the public data show that LADS consistently improves both face detection and facial landmark accuracy compared to standard non-adaptive representations. At 30 Hz, LADS achieves higher detection accuracy and lower landmark error than either baseline, and at 240 Hz it mitigates the accuracy decline typically observed at higher frequencies, sustaining 2.44 % normalized mean error for landmarks and 0.966 mAP50 in face detection. These high-frequency results even surpass the accuracy reported in prior works operating at 30 Hz, setting new benchmarks for event-based face analysis. Moreover, by preserving spatial structure at the representation stage, LADS supports the use of much lighter network architectures while still retaining real-time performance. These results highlight the importance of context-aware temporal integration for neuromorphic vision and point toward real-time, high-frequency human-computer interaction systems that exploit the unique advantages of event cameras.

[130] SpectralMamba-UNet: Frequency-Disentangled State Space Modeling for Texture-Structure Consistent Medical Image Segmentation

Fuhao Zhang,Lei Liu,Jialin Zhang,Ya-Nan Zhang,Nan Mu

Main category: cs.CV

TL;DR: 本文提出SpectralMamba-UNet,通过频域解耦建模结构与纹理信息,在医学图像分割中兼顾全局解剖结构和边界细节,提升性能与泛化性。

Details Motivation: 现有状态空间模型(如Vision Mamba)因一维序列化削弱了局部空间连续性和高频细节表征能力,难以兼顾医学图像分割中全局结构与精细边界的建模需求。 Method: 提出频域解耦框架SpectralMamba-UNet:1)SDM模块利用离散余弦变换分解低/高频频谱特征,分别用频域Mamba建模全局上下文和保留边界细节;2)SCR机制实现通道级频域感知重加权;3)SGF模块在解码器中自适应多尺度融合。 Result: 在五个公开医学图像分割基准上取得一致性能提升,覆盖多种模态和分割目标,验证了方法的有效性与泛化性。 Conclusion: 频域解耦是提升医学图像分割中长程依赖建模与边界细节保持能力的有效途径,SpectralMamba-UNet为结构与纹理协同建模提供了新范式。 Abstract: Accurate medical image segmentation requires effective modeling of both global anatomical structures and fine-grained boundary details. Recent state space models (e.g., Vision Mamba) offer efficient long-range dependency modeling. However, their one-dimensional serialization weakens local spatial continuity and high-frequency representation. To this end, we propose SpectralMamba-UNet, a novel frequency-disentangled framework to decouple the learning of structural and textural information in the spectral domain. Our Spectral Decomposition and Modeling (SDM) module applies discrete cosine transform to decompose low- and high-frequency features, where low frequency contributes to global contextual modeling via a frequency-domain Mamba and high frequency preserves boundary-sensitive details. To balance spectral contributions, we introduce a Spectral Channel Reweighting (SCR) mechanism to form channel-wise frequency-aware attention, and a Spectral-Guided Fusion (SGF) module to achieve adaptively multi-scale fusion in the decoder. Experiments on five public benchmarks demonstrate consistent improvements across diverse modalities and segmentation targets, validating the effectiveness and generalizability of our approach.

[131] WARM-CAT: : Warm-Started Test-Time Comprehensive Knowledge Accumulation for Compositional Zero-Shot Learning

Xudong Yan,Songhe Feng,Jiaxin Wang,Xin Su,Yi Jin

Main category: cs.CV

TL;DR: 本文提出了一种面向组合零样本学习(CZSL)的新方法WARM-CAT,通过在测试时利用无监督数据自适应更新多模态原型,并引入动态优先队列与跨模态对齐机制,显著缓解分布偏移问题,在多个基准上达到SOTA性能。

Details Motivation: 现有CZSL方法在测试时因未见属性-物体组合导致标签空间分布偏移,造成性能下降。 Method: 提出测试时自适应原型更新机制:融合文本与视觉模态的无监督知识;设计自适应更新权重控制原型调整程度;引入动态优先队列存储高置信图像以构建视觉原型,并采用暖启动策略初始化队列;通过多模态协同表征学习对齐文本与视觉原型。 Result: 在四个基准数据集(含新提出的C-Fashion和改进的MIT-States)的闭世界与开世界设置下均达到SOTA性能。 Conclusion: 所提方法有效缓解CZSL中的分布偏移问题,提升模型泛化能力;新基准C-Fashion与数据清洗提升了评估可靠性;代码与数据集已开源。 Abstract: Compositional Zero-Shot Learning (CZSL) aims to recognize novel attribute-object compositions based on the knowledge learned from seen ones. Existing methods suffer from performance degradation caused by the distribution shift of label space at test time, which stems from the inclusion of unseen compositions recombined from attributes and objects. To overcome the challenge, we propose a novel approach that accumulates comprehensive knowledge in both textual and visual modalities from unsupervised data to update multimodal prototypes at test time. Building on this, we further design an adaptive update weight to control the degree of prototype adjustment, enabling the model to flexibly adapt to distribution shift during testing. Moreover, a dynamic priority queue is introduced that stores high-confidence images to acquire visual prototypes from historical images for inference. Since the model tends to favor compositions already stored in the queue during testing, we warm-start the queue by initializing it with training images for visual prototypes of seen compositions and generating unseen visual prototypes using the mapping learned between seen and unseen textual prototypes. Considering the semantic consistency of multimodal knowledge, we align textual and visual prototypes by multimodal collaborative representation learning. To provide a more reliable evaluation for CZSL, we introduce a new benchmark dataset, C-Fashion, and refine the widely used but noisy MIT-States dataset. Extensive experiments indicate that our approach achieves state-of-the-art performance on four benchmark datasets under both closed-world and open-world settings. The source code and datasets are available at https://github.com/xud-yan/WARM-CAT .

[132] FLIGHT: Fibonacci Lattice-based Inference for Geometric Heading in real-Time

David Dirnfeld,Fabien Delattre,Pedro Miraldo,Erik Learned-Miller

Main category: cs.CV

TL;DR: 本文提出了一种基于球面上广义霍夫变换的新方法,用于在单目视频中鲁棒估计相机航向,通过斐波那契格点离散化单位球并利用对应点生成的大圆投票,显著提升了噪声和异常值下的精度与效率平衡。

Details Motivation: 现有方法在高噪声、多异常值条件下精度下降或计算开销大,尤其在仅知旋转(如IMU或优化提供)时估计航向鲁棒性不足。 Method: 将Hough变换推广到单位球面S²;对两帧间特征点对生成相容的大圆方向集;用斐波那契格点离散化球面作为投票bin中心;各条大圆在其覆盖的方向区间内投票,增强对噪声和动态物体的鲁棒性。 Result: 在三个数据集上达到精度-效率Pareto前沿;在SLAM实验中,通过校正初始化阶段的航向,有效降低了RMSE。 Conclusion: 该球面Hough方法为单目视觉中航向估计提供了更鲁棒、高效且实用的新范式,尤其适用于真实场景中含噪声与动态干扰的情形。 Abstract: Estimating camera motion from monocular video is a fundamental problem in computer vision, central to tasks such as SLAM, visual odometry, and structure-from-motion. Existing methods that recover the camera's heading under known rotation, whether from an IMU or an optimization algorithm, tend to perform well in low-noise, low-outlier conditions, but often decrease in accuracy or become computationally expensive as noise and outlier levels increase. To address these limitations, we propose a novel generalization of the Hough transform on the unit sphere (S(2)) to estimate the camera's heading. First, the method extracts correspondences between two frames and generates a great circle of directions compatible with each pair of correspondences. Then, by discretizing the unit sphere using a Fibonacci lattice as bin centers, each great circle casts votes for a range of directions, ensuring that features unaffected by noise or dynamic objects vote consistently for the correct motion direction. Experimental results on three datasets demonstrate that the proposed method is on the Pareto frontier of accuracy versus efficiency. Additionally, experiments on SLAM show that the proposed method reduces RMSE by correcting the heading during camera pose initialization.

[133] Devling into Adversarial Transferability on Image Classification: Review, Benchmark, and Evaluation

Xiaosen Wang,Zhijin Ge,Bohan Liu,Zheng Fang,Fengfan Zhou,Ruixuan Zhang,Shaokang Wang,Yuyang Luo

Main category: cs.CV

TL;DR: 本文提出一个全面的评估框架,用于标准化对抗样本迁移性攻击的评价,并将现有方法分为六类,同时指出提升迁移性的常见策略和不公平比较的问题。

Details Motivation: 缺乏评估迁移性攻击的标准化框架和标准,导致现有方法评估可能存在偏差。 Method: 通过综述数百篇相关工作,将迁移性攻击分为六类,并提出一个综合评估框架作为基准;同时总结提升迁移性的策略及常见问题。 Result: 构建了一个可用于评估迁移性攻击的标准化框架,明确了六类攻击范式,并指出了提升迁移性的通用方法与评估中的潜在偏差问题。 Conclusion: 建立统一评估框架对公平比较不同迁移性攻击方法至关重要,有助于推动该领域更严谨、可复现的研究发展。 Abstract: Adversarial transferability refers to the capacity of adversarial examples generated on the surrogate model to deceive alternate, unexposed victim models. This property eliminates the need for direct access to the victim model during an attack, thereby raising considerable security concerns in practical applications and attracting substantial research attention recently. In this work, we discern a lack of a standardized framework and criteria for evaluating transfer-based attacks, leading to potentially biased assessments of existing approaches. To rectify this gap, we have conducted an exhaustive review of hundreds of related works, organizing various transfer-based attacks into six distinct categories. Subsequently, we propose a comprehensive framework designed to serve as a benchmark for evaluating these attacks. In addition, we delineate common strategies that enhance adversarial transferability and highlight prevalent issues that could lead to unfair comparisons. Finally, we provide a brief review of transfer-based attacks beyond image classification.

[134] TriLite: Efficient Weakly Supervised Object Localization with Universal Visual Features and Tri-Region Disentanglement

Arian Sabaghi,José Oramas

Main category: cs.CV

TL;DR: TriLite是一种单阶段弱监督目标定位(WSOL)框架,利用冻结的DINOv2预训练ViT,仅添加少量可训练参数(<800K),通过TriHead模块解耦前景、背景与模糊区域特征,提升物体覆盖并抑制虚假响应,在多个数据集上达到SOTA且更高效易训。

Details Motivation: 解决现有WSOL方法依赖多阶段流程或全量微调大骨干网络导致训练成本高,以及普遍存在的物体定位覆盖不全问题。 Method: 提出单阶段框架TriLite,冻结DINOv2 ViT主干,引入轻量TriHead模块,将图像块特征分解为前景、背景和模糊三类区域,分离分类与定位任务,无需端到端训练。 Result: 在CUB-200-2011、ImageNet-1K和OpenImages上均取得新SOTA性能,参数量少于800K,训练更高效简洁。 Conclusion: TriLite验证了冻结自监督ViT配合轻量头部设计可在低开销下实现高质量弱监督定位,为高效WSOL提供了新范式。 Abstract: Weakly supervised object localization (WSOL) aims to localize target objects in images using only image-level labels. Despite recent progress, many approaches still rely on multi-stage pipelines or full fine-tuning of large backbones, which increases training cost, while the broader WSOL community continues to face the challenge of partial object coverage. We present TriLite, a single-stage WSOL framework that leverages a frozen Vision Transformer with Dinov2 pre-training in a self-supervised manner, and introduces only a minimal number of trainable parameters (fewer than 800K on ImageNet-1K) for both classification and localization. At its core is the proposed TriHead module, which decomposes patch features into foreground, background, and ambiguous regions, thereby improving object coverage while suppressing spurious activations. By disentangling classification and localization objectives, TriLite effectively exploits the universal representations learned by self-supervised ViTs without requiring expensive end-to-end training. Extensive experiments on CUB-200-2011, ImageNet-1K, and OpenImages demonstrate that TriLite sets a new state of the art, while remaining significantly more parameter-efficient and easier to train than prior methods. The code will be released soon.

[135] From Calibration to Refinement: Seeking Certainty via Probabilistic Evidence Propagation for Noisy-Label Person Re-Identification

Xin Yuan,Zhiyong Zhang,Xin Xu,Zheng Wang,Chia-Wen Lin

Main category: cs.CV

TL;DR: 本文提出CARE方法,通过校准到精炼的两阶段框架解决行人重识别中噪声标签和稀疏样本的学习难题,显著提升模型鲁棒性。

Details Motivation: 现有噪声鲁棒的行人重识别方法依赖softmax输出,存在过自信预测和丢弃难学正样本的问题,难以应对真实场景中的噪声标签与稀疏样本。 Method: 提出CAlibration-to-REfinement(CARE)两阶段框架:校准阶段采用概率证据校准(PEC)打破softmax平移不变性并缓解过自信;精炼阶段设计证据传播精炼(EPR),包含复合角距度量(CAM)和确定性导向球面加权(COSW)以更准确区分干净/噪声样本并动态赋权。 Result: 在Market1501、DukeMTMC-ReID和CUHK03数据集上,面对随机与模式化噪声,CARE均取得具有竞争力的性能。 Conclusion: CARE通过引入概率证据机制与球面几何建模,有效提升了噪声环境下行人重识别的鲁棒性与判别性,为弱监督Re-ID提供了新思路。 Abstract: With the increasing demand for robust person Re-ID in unconstrained environments, learning from datasets with noisy labels and sparse per-identity samples remains a critical challenge. Existing noise-robust person Re-ID methods primarily rely on loss-correction or sample-selection strategies using softmax outputs. However, these methods suffer from two key limitations: 1) Softmax exhibits translation invariance, leading to over-confident and unreliable predictions on corrupted labels. 2) Conventional sample selection based on small-loss criteria often discards valuable hard positives that are crucial for learning discriminative features. To overcome these issues, we propose the CAlibration-to-REfinement (CARE) method, a two-stage framework that seeks certainty through probabilistic evidence propagation from calibration to refinement. In the calibration stage, we propose the probabilistic evidence calibration (PEC) that dismantles softmax translation invariance by injecting adaptive learnable parameters into the similarity function, and employs an evidential calibration loss to mitigate overconfidence on mislabeled samples. In the refinement stage, we design the evidence propagation refinement (EPR) that can more accurately distinguish between clean and noisy samples. Specifically, the EPR contains two steps: Firstly, the composite angular margin (CAM) metric is proposed to precisely distinguish clean but hard-to-learn positive samples from mislabeled ones in a hyperspherical space; Secondly, the certainty-oriented sphere weighting (COSW) is developed to dynamically allocate the importance of samples according to CAM, ensuring clean instances drive model updates. Extensive experimental results on Market1501, DukeMTMC-ReID, and CUHK03 datasets under both random and patterned noises show that CARE achieves competitive performance.

[136] No Labels, No Look-Ahead: Unsupervised Online Video Stabilization with Classical Priors

Tao Liu,Gang Wan,Kan Ren,Shibo Wen

Main category: cs.CV

TL;DR: 本文提出了一种新的无监督在线视频稳定化框架,通过经典三阶段流程与多线程缓冲机制,解决了端到端学习中数据有限、可控性差和硬件效率低的问题,并构建了面向无人机夜间遥感的新型多模态UAV-Test数据集,实验表明其性能优于现有在线方法,接近离线方法。

Details Motivation: 解决现有基于深度学习的视频稳定方法依赖配对数据、可控性差、硬件效率低,以及现有基准局限于手持前视可见光视频,难以适用于无人机夜间遥感等新场景的问题。 Method: 提出一种无监督在线视频稳定框架,复用经典三阶段(运动估计、平滑、图像重采样)流程,并引入多线程缓冲机制;同时构建首个面向无人机夜间遥感的多模态UAV-Test数据集。 Result: 在定量指标和视觉质量上均持续优于当前最优在线稳定方法,且性能接近离线方法。 Conclusion: 所提无监督框架兼顾性能、可控性与硬件友好性,UAV-Test数据集拓展了视频稳定技术在复杂遥感场景中的适用边界。 Abstract: We propose a new unsupervised framework for online video stabilization. Unlike methods based on deep learning that require paired stable and unstable datasets, our approach instantiates the classical stabilization pipeline with three stages and incorporates a multithreaded buffering mechanism. This design addresses three longstanding challenges in end-to-end learning: limited data, poor controllability, and inefficiency on hardware with constrained resources. Existing benchmarks focus mainly on handheld videos with a forward view in visible light, which restricts the applicability of stabilization to domains such as UAV nighttime remote sensing. To fill this gap, we introduce a new multimodal UAV aerial video dataset (UAV-Test). Experiments show that our method consistently outperforms state-of-the-art online stabilizers in both quantitative metrics and visual quality, while achieving performance comparable to offline methods.

[137] Efficient Encoder-Free Fourier-based 3D Large Multimodal Model

Guofeng Mei,Wei Lin,Luigi Riz,Yujiao Wu,Yiming Wang,Fabio Poiesi

Main category: cs.CV

TL;DR: 本文提出Fase3D,首个无需视觉编码器、基于傅里叶变换的高效三维场景大模型,通过结构化超点表示、空间填充曲线序列化结合快速傅里叶变换(FFT)以及傅里叶增强LoRA适配器,在保持性能的同时大幅降低计算与参数开销。

Details Motivation: 现有3D大模型依赖沉重的预训练视觉编码器,而2D大模型已开始去除编码器以提升效率;将该范式扩展至3D面临点云无序性与大规模性的挑战,亟需一种高效、置换不变的无编码器3D数据分词方法。 Method: 提出Fase3D:1)用结构化超点紧凑表示大场景;2)采用空间填充曲线序列化+FFT近似自注意力,实现高效全局建模与图式token合并;3)引入傅里叶增强LoRA适配器,在LLM中低成本注入全局频域交互。 Result: Fase3D在性能上媲美编码器型3D LMM,同时显著降低计算量和参数量,验证了无编码器范式在3D多模态建模中的可行性与高效性。 Conclusion: Fase3D成功解决了无编码器3D大模型中点云无序性与可扩展性难题,为轻量化、高效化三维理解提供了新范式。 Abstract: Large Multimodal Models (LMMs) that process 3D data typically rely on heavy, pre-trained visual encoders to extract geometric features. While recent 2D LMMs have begun to eliminate such encoders for efficiency and scalability, extending this paradigm to 3D remains challenging due to the unordered and large-scale nature of point clouds. This leaves a critical unanswered question: How can we design an LMM that tokenizes unordered 3D data effectively and efficiently without a cumbersome encoder? We propose Fase3D, the first efficient encoder-free Fourier-based 3D scene LMM. Fase3D tackles the challenges of scalability and permutation invariance with a novel tokenizer that combines point cloud serialization and the Fast Fourier Transform (FFT) to approximate self-attention. This design enables an effective and computationally minimal architecture, built upon three key innovations: First, we represent large scenes compactly via structured superpoints. Second, our space-filling curve serialization followed by an FFT enables efficient global context modeling and graph-based token merging. Lastly, our Fourier-augmented LoRA adapters inject global frequency-aware interactions into the LLMs at a negligible cost. Fase3D achieves performance comparable to encoder-based 3D LMMs while being significantly more efficient in computation and parameters. Project website: https://tev-fbk.github.io/Fase3D.

[138] DyaDiT: A Multi-Modal Diffusion Transformer for Socially Favorable Dyadic Gesture Generation

Yichen Peng,Jyun-Ting Song,Siyeol Jung,Ruofan Liu,Haiyang Liu,Xuangeng Chu,Ruicong Liu,Erwin Wu,Hideki Koike,Kris Kitani

Main category: cs.CV

TL;DR: DyaDiT是一种多模态扩散Transformer模型,利用双人对话音频信号生成符合社交语境的自然手势动作,通过融合双方音频信息、引入运动先验和可选的伙伴手势反馈,显著提升了生成动作的社会适宜性和用户偏好度。

Details Motivation: 现有方法仅将单个音频流映射为单人动作,忽略对话中的社会语境和两人之间的动态交互关系,难以生成自然、具社交吸引力的手势。 Method: 提出DyaDiT——基于多模态扩散Transformer的模型;输入为双人音频(可选加入社交上下文标记),融合双方音频信息;采用运动词典编码运动先验;支持以对话伙伴手势为条件生成更具响应性的动作;在Seamless Interaction Dataset上训练。 Result: 在标准运动生成指标上超越现有方法,并通过定量用户研究验证其生成动作更受用户青睐,表现出更强的鲁棒性与社会适宜性。 Conclusion: DyaDiT有效建模了对话中双人互动的动态特性,为数字人生成自然、情境适配的手势提供了新范式,具备实际应用潜力。 Abstract: Generating realistic conversational gestures are essential for achieving natural, socially engaging interactions with digital humans. However, existing methods typically map a single audio stream to a single speaker's motion, without considering social context or modeling the mutual dynamics between two people engaging in conversation. We present DyaDiT, a multi-modal diffusion transformer that generates contextually appropriate human motion from dyadic audio signals. Trained on Seamless Interaction Dataset, DyaDiT takes dyadic audio with optional social-context tokens to produce context-appropriate motion. It fuses information from both speakers to capture interaction dynamics, uses a motion dictionary to encode motion priors, and can optionally utilize the conversational partner's gestures to produce more responsive motion. We evaluate DyaDiT on standard motion generation metrics and conduct quantitative user studies, demonstrating that it not only surpasses existing methods on objective metrics but is also strongly preferred by users, highlighting its robustness and socially favorable motion generation. Code and models will be released upon acceptance.

[139] AgentVista: Evaluating Multimodal Agents in Ultra-Challenging Realistic Visual Scenarios

Zhaochen Su,Jincheng Gao,Hangyu Guo,Zhenhua Liu,Lueyang Zhang,Xinyu Geng,Shijue Huang,Peng Xia,Guanyu Jiang,Cheng Wang,Yue Zhang,Yi R. Fung,Junxian He

Main category: cs.CV

TL;DR: 本文提出了AgentVista基准,用于评估通用多模态智能体在真实、长周期、跨模态工具使用任务中的能力,揭示了当前SOTA模型(如Gemini-3-Pro)在此类任务中性能严重不足(整体准确率仅27.3%)。

Details Motivation: 现有多模态基准主要评估单轮视觉推理或特定工具技能,无法反映实际智能体所需的现实性、视觉细微性和长周期工具协同能力。 Method: 构建AgentVista基准,覆盖7大类共25个子领域,结合高细节真实视觉场景与自然混合工具调用(如网页搜索、图像搜索、页面导航、图像处理与通用编程代码执行)。 Result: 对SOTA模型的全面评测显示其长周期多模态工具使用能力存在显著缺陷;Gemini-3-Pro表现最优但整体准确率仅27.3%,困难样本需超25步工具调用。 Conclusion: AgentVista为推动更强大、更可靠的通用多模态智能体发展提供了关键评估平台和挑战目标。 Abstract: Real-world multimodal agents solve multi-step workflows grounded in visual evidence. For example, an agent can troubleshoot a device by linking a wiring photo to a schematic and validating the fix with online documentation, or plan a trip by interpreting a transit map and checking schedules under routing constraints. However, existing multimodal benchmarks mainly evaluate single-turn visual reasoning or specific tool skills, and they do not fully capture the realism, visual subtlety, and long-horizon tool use that practical agents require. We introduce AgentVista, a benchmark for generalist multimodal agents that spans 25 sub-domains across 7 categories, pairing realistic and detail-rich visual scenarios with natural hybrid tool use. Tasks require long-horizon tool interactions across modalities, including web search, image search, page navigation, and code-based operations for both image processing and general programming. Comprehensive evaluation of state-of-the-art models exposes significant gaps in their ability to carry out long-horizon multimodal tool use. Even the best model in our evaluation, Gemini-3-Pro with tools, achieves only 27.3% overall accuracy, and hard instances can require more than 25 tool-calling turns. We expect AgentVista to accelerate the development of more capable and reliable multimodal agents for realistic and ultra-challenging problem solving.

[140] Learning Continuous Wasserstein Barycenter Space for Generalized All-in-One Image Restoration

Xiaole Tang,Xiaoyi He,Jiayi Xu,Xiang Gu,Jian Sun

Main category: cs.CV

TL;DR: 本文提出BaryIR框架,通过Wasserstein barycenter对齐多源退化特征分布,解耦退化无关的共享内容与退化相关的残差信息,从而提升模型对未见退化类型的泛化能力。

Details Motivation: 现有全合一图像恢复方法在面对分布外退化时泛化能力不足,作者认为多源退化特征分布源于同一退化无关基础分布的不同偏移,恢复该共享分布是实现跨退化泛化的关键。 Method: 提出BaryIR框架:1)在Wasserstein barycenter(WB)空间中对齐多源退化特征,建模退化无关分布;2)引入正交于WB空间的残差子空间,通过互对比学习保留退化特异性信息;3)显式解耦WB空间(退化无关内容)与残差子空间(退化相关知识)。 Result: BaryIR在多个基准上媲美SOTA全合一方法,并在未见退化类型(如新退化类型或强度)、有限训练退化种类及真实混合退化数据上展现出优异泛化性与鲁棒性。 Conclusion: 通过退化不变表征与退化特异表征的正交解耦,BaryIR有效缓解了对训练退化的过拟合,为通用图像恢复提供了更鲁棒、可推广的表示学习范式。 Abstract: Despite substantial advances in all-in-one image restoration for addressing diverse degradations within a unified model, existing methods remain vulnerable to out-of-distribution degradations, thereby limiting their generalization in real-world scenarios. To tackle the challenge, this work is motivated by the intuition that multisource degraded feature distributions are induced by different degradation-specific shifts from an underlying degradation-agnostic distribution, and recovering such a shared distribution is thus crucial for achieving generalization across degradations. With this insight, we propose BaryIR, a representation learning framework that aligns multisource degraded features in the Wasserstein barycenter (WB) space, which models a degradation-agnostic distribution by minimizing the average of Wasserstein distances to multisource degraded distributions. We further introduce residual subspaces, whose embeddings are mutually contrasted while remaining orthogonal to the WB embeddings. Consequently, BaryIR explicitly decouples two orthogonal spaces: a WB space that encodes the degradation-agnostic invariant contents shared across degradations, and residual subspaces that adaptively preserve the degradation-specific knowledge. This disentanglement mitigates overfitting to in-distribution degradations and enables adaptive restoration grounded on the degradation-agnostic shared invariance. Extensive experiments demonstrate that BaryIR performs competitively against state-of-the-art all-in-one methods. Notably, BaryIR generalizes well to unseen degradations (\textit{e.g.,} types and levels) and shows remarkable robustness in learning generalized features, even when trained on limited degradation types and evaluated on real-world data with mixed degradations.

[141] Latent Gaussian Splatting for 4D Panoptic Occupancy Tracking

Maximilian Luz,Rohit Mohan,Thomas Nürnberg,Yakov Miron,Daniele Cattaneo,Abhinav Valada

Main category: cs.CV

TL;DR: 本文提出Latent Gaussian Splatting for 4D Panoptic Occupancy Tracking (LaGS),通过结合相机端到端跟踪与基于掩码的多视角全景占用预测,利用隐式高斯溅射将多视角信息高效聚合至3D体素网格,实现对动态环境的4D时空场景理解。

Details Motivation: 现有方法仅能提供粗略的几何跟踪(如边界框)或缺乏显式时间关联的详细3D结构(如基于体素的占据表示),难以满足机器人在动态环境中安全可靠运行所需的4D时空感知需求。 Method: 提出LaGS框架:首先将多视角观测融合为稀疏的3D高斯点云作为潜在表示;再通过新颖的隐式高斯溅射将特征投影至3D体素网格;最后由基于掩码的分割头解码输出4D全景占用结果。 Result: 在Occ3D nuScenes和Waymo数据集上实现了4D全景占用跟踪的最先进性能。 Conclusion: LaGS实现了对动态场景的统一、高效且具时间一致性的4D全景占用建模,推动了机器人时空感知能力的发展。 Abstract: Capturing 4D spatiotemporal surroundings is crucial for the safe and reliable operation of robots in dynamic environments. However, most existing methods address only one side of the problem: they either provide coarse geometric tracking via bounding boxes, or detailed 3D structures like voxel-based occupancy that lack explicit temporal association. In this work, we present Latent Gaussian Splatting for 4D Panoptic Occupancy Tracking (LaGS) that advances spatiotemporal scene understanding in a holistic direction. Our approach incorporates camera-based end-to-end tracking with mask-based multi-view panoptic occupancy prediction, and addresses the key challenge of efficiently aggregating multi-view information into 3D voxel grids via a novel latent Gaussian splatting approach. Specifically, we first fuse observations into 3D Gaussians that serve as a sparse point-centric latent representation of the 3D scene, and then splat the aggregated features onto a 3D voxel grid that is decoded by a mask-based segmentation head. We evaluate LaGS on the Occ3D nuScenes and Waymo datasets, achieving state-of-the-art performance for 4D panoptic occupancy tracking. We make our code available at https://lags.cs.uni-freiburg.de/.

[142] Phys-3D: Physics-Constrained Real-Time Crowd Tracking and Counting on Railway Platforms

Bin Zeng,Johannes Künzel,Anna Hilsmann,Peter Eisert

Main category: cs.CV

TL;DR: 本文提出了一种基于单移动相机(安装在进站列车上)的实时铁路站台人群计数方法,通过融合物理约束的3D运动建模(Phys-3D)、改进的DeepSORT跟踪框架及虚拟计数带,显著提升了动态、遮挡场景下的计数鲁棒性,在自建MOT-RPCH数据集上计数误差低至2.97%。

Details Motivation: 铁路站台需准确、实时的人群计数以保障安全与运力管理;但现有基于检测的跟踪方法多假设静态相机或忽略物理运动一致性,难以应对列车进站时的相机运动、密集遮挡和透视畸变等挑战。 Method: 提出物理约束的跟踪框架:1)采用迁移学习的YOLOv11m检测器与EfficientNet-B0外观编码集成于DeepSORT;2)设计Phys-3D卡尔曼模型,利用小孔成像几何约束3D运动动力学;3)引入带持久性的虚拟计数带缓解遮挡导致的计数不稳定。 Result: 在自建MOT-RailwayPlatformCrowdHead(MOT-RPCH)数据集上,计数误差降至2.97%,显著优于现有方法,验证了其在动态、遮挡条件下的鲁棒性。 Conclusion: 融合第一性原理几何与运动先验可实现安全关键交通场景下可靠的人群计数,支撑高效列车调度与站台安全管理。 Abstract: Accurate, real-time crowd counting on railway platforms is essential for safety and capacity management. We propose to use a single camera mounted in a train, scanning the platform while arriving. While hardware constraints are simple, counting remains challenging due to dense occlusions, camera motion, and perspective distortions during train arrivals. Most existing tracking-by-detection approaches assume static cameras or ignore physical consistency in motion modeling, leading to unreliable counting under dynamic conditions. We propose a physics-constrained tracking framework that unifies detection, appearance, and 3D motion reasoning in a real-time pipeline. Our approach integrates a transfer-learned YOLOv11m detector with EfficientNet-B0 appearance encoding within DeepSORT, while introducing a physics-constrained Kalman model (Phys-3D) that enforces physically plausible 3D motion dynamics through pinhole geometry. To address counting brittleness under occlusions, we implement a virtual counting band with persistence. On our platform benchmark, MOT-RailwayPlatformCrowdHead Dataset(MOT-RPCH), our method reduces counting error to 2.97%, demonstrating robust performance despite motion and occlusions. Our results show that incorporating first-principles geometry and motion priors enables reliable crowd counting in safety-critical transportation scenarios, facilitating effective train scheduling and platform safety management.

[143] Uni-Animator: Towards Unified Visual Colorization

Xinyuan Chen,Yao Xu,Shaowen Wang,Pengjie Song,Bowen Deng

Main category: cs.CV

TL;DR: Uni-Animator 是一个基于 Diffusion Transformer 的统一图像与视频草图上色框架,通过视觉参考增强、物理细节强化和草图驱动的动态 RoPE 编码,解决颜色迁移不准、高频细节丢失和时序不一致等问题。

Details Motivation: 现有草图上色方法难以统一图像与视频任务,存在颜色迁移不准、高频物理细节保留不足、大运动场景下时序不连贯等问题。 Method: 提出基于 DiT 的 Uni-Animator 框架:1)实例块嵌入实现视觉参考增强;2)引入物理特征进行物理细节强化;3)设计草图驱动的动态 RoPE 编码建模时空依赖。 Result: 在图像与视频草图上色任务上达到与专用方法相当的性能,同时具备跨域统一能力,保持高细节保真度与时序鲁棒性。 Conclusion: Uni-Animator 成功实现了图像与视频草图上色的统一建模,在精度、细节和时序一致性三方面取得平衡,验证了扩散Transformer架构在跨模态生成任务中的潜力。 Abstract: We propose Uni-Animator, a novel Diffusion Transformer (DiT)-based framework for unified image and video sketch colorization. Existing sketch colorization methods struggle to unify image and video tasks, suffering from imprecise color transfer with single or multiple references, inadequate preservation of high-frequency physical details, and compromised temporal coherence with motion artifacts in large-motion scenes. To tackle imprecise color transfer, we introduce visual reference enhancement via instance patch embedding, enabling precise alignment and fusion of reference color information. To resolve insufficient physical detail preservation, we design physical detail reinforcement using physical features that effectively capture and retain high-frequency textures. To mitigate motion-induced temporal inconsistency, we propose sketch-based dynamic RoPE encoding that adaptively models motion-aware spatial-temporal dependencies. Extensive experimental results demonstrate that Uni-Animator achieves competitive performance on both image and video sketch colorization, matching that of task-specific methods while unlocking unified cross-domain capabilities with high detail fidelity and robust temporal consistency.

[144] FairQuant: Fairness-Aware Mixed-Precision Quantization for Medical Image Classification

Thomas Woergaard,Raghavendra Selvan

Main category: cs.CV

TL;DR: 本文提出FairQuant框架,通过组感知重要性分析、预算约束下的混合精度分配和可学习的位感知量化(BAQ)模式,在比特预算限制下实现医疗图像分类模型的公平感知混合精度量化,兼顾精度与算法公平性。

Details Motivation: 现有神经网络量化方法(如量化感知训练和后训练量化)虽能平衡性能与效率,但未显式考虑其对算法公平性的影响,尤其在医疗图像分类等高风险场景中亟需解决。 Method: 提出FairQuant框架,包含三部分:1)组感知重要性分析;2)在显式比特预算下的混合精度参数分配;3)可学习的位感知量化(BAQ)模式,联合优化权重与每单元比特分配,并引入比特率与公平性正则化。 Result: 在Fitzpatrick17k和ISIC2019数据集上,使用ResNet18/50、DeiT-Tiny和TinyViT验证表明:平均4–6比特的FairQuant配置可接近均匀8比特精度,同时显著提升最差组性能,且在相同比特预算下公平性指标优于均匀4比特和8比特基线。 Conclusion: FairQuant有效实现了在有限比特预算下兼顾模型精度与算法公平性的目标,为医疗AI部署中的公平压缩提供了新范式。 Abstract: Compressing neural networks by quantizing model parameters offers useful trade-off between performance and efficiency. Methods like quantization-aware training and post-training quantization strive to maintain the downstream performance of compressed models compared to the full precision models. However, these techniques do not explicitly consider the impact on algorithmic fairness. In this work, we study fairness-aware mixed-precision quantization schemes for medical image classification under explicit bit budgets. We introduce FairQuant, a framework that combines group-aware importance analysis, budgeted mixed-precision allocation, and a learnable Bit-Aware Quantization (BAQ) mode that jointly optimizes weights and per-unit bit allocations under bitrate and fairness regularization. We evaluate the method on Fitzpatrick17k and ISIC2019 across ResNet18/50, DeiT-Tiny, and TinyViT. Results show that FairQuant configurations with average precision near 4-6 bits recover much of the Uniform 8-bit accuracy while improving worst-group performance relative to Uniform 4- and 8-bit baselines, with comparable fairness metrics under shared budgets.

[145] ColoDiff: Integrating Dynamic Consistency With Content Awareness for Colonoscopy Video Generation

Junhu Fu,Shuyu Liang,Wutong Li,Chen Ma,Peng Huang,Kehao Wang,Ke Chen,Shengli Lin,Pinghong Zhou,Zeju Li,Yuanyuan Wang,Yi Guo

Main category: cs.CV

TL;DR: 本文提出ColoDiff,一种基于扩散模型的结肠镜视频生成框架,旨在解决数据稀缺问题,通过TimeStream模块实现帧间时间一致性建模,Content-Aware模块实现临床属性精准控制,并采用非马尔可夫采样策略加速生成。实验表明其在生成质量与下游临床任务中均表现优异。

Details Motivation: 结肠镜视频生成对肠道疾病诊断至关重要,但面临肠道结构不规则、疾病表征多样、成像模态各异等挑战,且高质量视频生成需兼顾时间一致性和临床属性可控性,而现有方法难以满足,尤其在数据稀缺场景下。 Method: 提出ColoDiff:1)TimeStream模块通过跨帧令牌化机制解耦视频序列的时间依赖,增强动态建模能力;2)Content-Aware模块融合噪声注入嵌入与可学习原型,实现细粒度临床属性控制;3)采用非马尔可夫采样策略,减少90%以上采样步数以支持实时生成。 Result: 在三个公开数据集和一个医院数据库上验证,生成视频具有平滑过渡与丰富动态;下游任务(疾病诊断、模态判别、肠道准备评分、病灶分割)性能显著提升,证明合成视频可有效补充真实数据。 Conclusion: ColoDiff首次实现了可控、动态一致的结肠镜视频生成,为缓解临床数据稀缺提供了新范式,展示了合成视频在辅助诊断与分析中的实际潜力。 Abstract: Colonoscopy video generation delivers dynamic, information-rich data critical for diagnosing intestinal diseases, particularly in data-scarce scenarios. High-quality video generation demands temporal consistency and precise control over clinical attributes, but faces challenges from irregular intestinal structures, diverse disease representations, and various imaging modalities. To this end, we propose ColoDiff, a diffusion-based framework that generates dynamic-consistent and content-aware colonoscopy videos, aiming to alleviate data shortage and assist clinical analysis. At the inter-frame level, our TimeStream module decouples temporal dependency from video sequences through a cross-frame tokenization mechanism, enabling intricate dynamic modeling despite irregular intestinal structures. At the intra-frame level, our Content-Aware module incorporates noise-injected embeddings and learnable prototypes to realize precise control over clinical attributes, breaking through the coarse guidance of diffusion models. Additionally, ColoDiff employs a non-Markovian sampling strategy that cuts steps by over 90% for real-time generation. ColoDiff is evaluated across three public datasets and one hospital database, based on both generation metrics and downstream tasks including disease diagnosis, modality discrimination, bowel preparation scoring, and lesion segmentation. Extensive experiments show ColoDiff generates videos with smooth transitions and rich dynamics. ColoDiff presents an effort in controllable colonoscopy video generation, revealing the potential of synthetic videos in complementing authentic representation and mitigating data scarcity in clinical settings.

[146] Motion-aware Event Suppression for Event Cameras

Roberto Pellerito,Nico Messikommer,Giovanni Cioffi,Marco Cannici,Davide Scaramuzza

Main category: cs.CV

TL;DR: 本文提出首个运动感知事件抑制框架,实时过滤由独立运动物体(IMOs)和自车运动引发的事件;通过联合分割当前事件流中的IMO并预测其未来运动,实现动态事件的前瞻性抑制;模型轻量高效,在消费级GPU上达173Hz推理速度且内存占用<1GB;在EVIMO基准上分割精度提升67%,推理速率提高53%;下游应用中,视觉Transformer推理加速83%,事件视觉里程计绝对轨迹误差降低13%。

Details Motivation: 解决事件相机中由独立运动物体(IMOs)和自车运动引发的冗余事件干扰问题,提升事件流质量以增强下游任务性能。 Method: 提出运动感知事件抑制框架,联合执行IMO分割与未来运动预测,实现动态事件的实时、前瞻性抑制;采用轻量级网络架构。 Result: 在EVIMO基准上分割精度提升67%,推理速率提高53%;视觉Transformer推理加速83%;事件视觉里程计绝对轨迹误差(ATE)降低13%。 Conclusion: 该框架在保持高实时性与低资源消耗的同时,显著提升事件流质量及多个下游任务性能,为事件相机实际应用提供了有效解决方案。 Abstract: In this work, we introduce the first framework for Motion-aware Event Suppression, which learns to filter events triggered by IMOs and ego-motion in real time. Our model jointly segments IMOs in the current event stream while predicting their future motion, enabling anticipatory suppression of dynamic events before they occur. Our lightweight architecture achieves 173 Hz inference on consumer-grade GPUs with less than 1 GB of memory usage, outperforming previous state-of-the-art methods on the challenging EVIMO benchmark by 67\% in segmentation accuracy while operating at a 53\% higher inference rate. Moreover, we demonstrate significant benefits for downstream applications: our method accelerates Vision Transformer inference by 83\% via token pruning and improves event-based visual odometry accuracy, reducing Absolute Trajectory Error (ATE) by 13\%.

[147] EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents

Wenjia Wang,Liang Pan,Huaijin Pi,Yuke Lou,Xuqian Ren,Yifan Wu,Zhouyingcheng Liao,Lei Yang,Rishabh Dabral,Christian Theobalt,Taku Komura

Main category: cs.CV

TL;DR: 本文提出EmbodMocap,一种基于双iPhone的便携、低成本人体运动与场景联合采集方法,实现无标记、无固定相机的度量尺度、场景一致的人体-场景重建,并支撑单目人体-场景重建、物理驱动角色动画和机器人运动控制三项具身AI任务。

Details Motivation: 现有动作捕捉系统依赖昂贵的摄影棚和可穿戴设备,难以在真实环境中大规模采集带场景上下文的人体运动数据。 Method: 提出EmbodMocap流程:利用两部移动iPhone同步采集RGB-D序列,通过联合标定将人体与场景统一重建到同一世界度量坐标系中,解决单目深度模糊问题。 Result: 相比光学动捕真值,双视角设置显著缓解深度歧义,重建与对齐性能优于单iPhone或单目模型;所采集数据成功支撑三项具身AI任务并取得验证效果。 Conclusion: EmbodMocap为具身AI提供了高效、低成本、真实场景下的人体-场景协同数据采集新范式,推动感知、理解与行动一体化研究。 Abstract: Human behaviors in the real world naturally encode rich, long-term contextual information that can be leveraged to train embodied agents for perception, understanding, and acting. However, existing capture systems typically rely on costly studio setups and wearable devices, limiting the large-scale collection of scene-conditioned human motion data in the wild. To address this, we propose EmbodMocap, a portable and affordable data collection pipeline using two moving iPhones. Our key idea is to jointly calibrate dual RGB-D sequences to reconstruct both humans and scenes within a unified metric world coordinate frame. The proposed method allows metric-scale and scene-consistent capture in everyday environments without static cameras or markers, bridging human motion and scene geometry seamlessly. Compared with optical capture ground truth, we demonstrate that the dual-view setting exhibits a remarkable ability to mitigate depth ambiguity, achieving superior alignment and reconstruction performance over single iphone or monocular models. Based on the collected data, we empower three embodied AI tasks: monocular human-scene-reconstruction, where we fine-tune on feedforward models that output metric-scale, world-space aligned humans and scenes; physics-based character animation, where we prove our data could be used to scale human-object interaction skills and scene-aware motion tracking; and robot motion control, where we train a humanoid robot via sim-to-real RL to replicate human motions depicted in videos. Experimental results validate the effectiveness of our pipeline and its contributions towards advancing embodied AI research.

[148] Through BrokenEyes: How Eye Disorders Impact Face Detection?

Prottay Kumar Adhikary

Main category: cs.CV

TL;DR: This paper introduces BrokenEyes, a computational framework to simulate five common eye disorders and analyze their impact on deep learning feature representations, revealing disorder-specific disruptions—especially in cataract and glaucoma—that mirror known neural processing deficits.

Details Motivation: Vision disorders affect millions and alter visual perception; understanding how they impact neural-like representations in AI models can bridge clinical knowledge and computational vision. Method: Developed the BrokenEyes system to simulate five eye disorders (AMD, cataract, glaucoma, refractive errors, diabetic retinopathy); trained deep learning models on human and non-human datasets under normal and disorder-simulated conditions; analyzed feature maps using activation energy and cosine similarity. Result: Cataract and glaucoma caused the most significant disruptions in feature maps, consistent with clinical neural processing challenges; quantitative metrics confirmed severity gradients across disorders. Conclusion: Degraded visual inputs from simulated eye disorders lead to systematic, measurable distortions in learned feature representations—supporting BrokenEyes as a tool for probing vision-AI alignment and disorder-specific neural computation. Abstract: Vision disorders significantly impact millions of lives, altering how visual information is processed and perceived. In this work, a computational framework was developed using the BrokenEyes system to simulate five common eye disorders: Age-related macular degeneration, cataract, glaucoma, refractive errors, and diabetic retinopathy and analyze their effects on neural-like feature representations in deep learning models. Leveraging a combination of human and non-human datasets, models trained under normal and disorder-specific conditions revealed critical disruptions in feature maps, particularly for cataract and glaucoma, which align with known neural processing challenges in these conditions. Evaluation metrics such as activation energy and cosine similarity quantified the severity of these distortions, providing insights into the interplay between degraded visual inputs and learned representations.

[149] Plug-and-Play Diffusion Meets ADMM: Dual-Variable Coupling for Robust Medical Image Reconstruction

Chenhe Du,Xuanyu Tian,Qing Wu,Muyu Liu,Jingyi Yu,Hongjiang Wei,Yuyao Zhang

Main category: cs.CV

TL;DR: 本文提出Dual-Coupled PnP Diffusion框架,通过引入对偶变量实现渐近收敛至精确数据流形,并设计频域自适应机制Spectral Homogenization(SH)将结构化残差转化为符合扩散先验假设的伪白噪声,从而解决传统即插即用方法中偏差与幻觉之间的权衡问题。

Details Motivation: 现有即插即用(PnP)求解器(如基于HQS或近端梯度的方法)是无记忆操作,仅依赖瞬时梯度更新,导致在严重退化下重建结果无法严格满足物理测量,产生稳态偏差。 Method: 提出Dual-Coupled PnP Diffusion:恢复经典对偶变量以提供积分反馈;并引入Spectral Homogenization(SH)机制,在频域调制对偶残差,使其逼近加性白高斯噪声(AWGN),适配扩散先验的统计假设。 Result: 在CT和MRI重建任务上显著提升重建保真度,缓解偏差-幻觉权衡,同时加快收敛速度,达到当前最优性能。 Conclusion: Dual-Coupled PnP Diffusion结合SH机制,从理论保障和统计建模两方面改进PnP框架,实现了更准确、更鲁棒的成像逆问题求解。 Abstract: Plug-and-Play diffusion prior (PnPDP) frameworks have emerged as a powerful paradigm for solving imaging inverse problems by treating pretrained generative models as modular priors. However, we identify a critical flaw in prevailing PnP solvers (e.g., based on HQS or Proximal Gradient): they function as memoryless operators, updating estimates solely based on instantaneous gradients. This lack of historical tracking inevitably leads to non-vanishing steady-state bias, where the reconstruction fails to strictly satisfy physical measurements under heavy corruption. To resolve this, we propose Dual-Coupled PnP Diffusion, which restores the classical dual variable to provide integral feedback, theoretically guaranteeing asymptotic convergence to the exact data manifold. However, this rigorous geometric coupling introduces a secondary challenge: the accumulated dual residuals exhibit spectrally colored, structured artifacts that violate the Additive White Gaussian Noise (AWGN) assumption of diffusion priors, causing severe hallucinations. To bridge this gap, we introduce Spectral Homogenization (SH), a frequency-domain adaptation mechanism that modulates these structured residuals into statistically compliant pseudo-AWGN inputs. This effectively aligns the solver's rigorous optimization trajectory with the denoiser's valid statistical manifold. Extensive experiments on CT and MRI reconstruction demonstrate that our approach resolves the bias-hallucination trade-off, achieving state-of-the-art fidelity with significantly accelerated convergence.

[150] Multidimensional Task Learning: A Unified Tensor Framework for Computer Vision Tasks

Alaa El Ichi,Khalide Jbilou

Main category: cs.CV

TL;DR: 本文提出多维任务学习(MTL)框架,基于广义爱因斯坦MLP(GE-MLPs),直接在张量上操作,突破传统矩阵建模对视觉任务的维度限制。

Details Motivation: 现有计算机视觉任务建模受限于矩阵思维,需对高维数据进行破坏性展平,丢失结构信息;亟需一种能自然保持和操控多维结构的统一数学框架。 Method: 构建基于广义爱因斯坦积的张量MLP(GE-MLPs),引入张量参数与维度配置机制,在形式化任务空间中统一建模分类、分割、检测等任务,并通过数学推导与证明分析表达能力边界。 Result: 证明分类、分割、检测均为MTL在不同维度配置下的特例;MTL任务空间严格大于传统矩阵方法所能原生表达的空间;支持无需破坏性展平的时空或跨模态预测等新任务构型。 Conclusion: MTL为计算机视觉任务提供了基于张量代数的统一数学基础,使任务的理解、比较与设计更具原理性和可扩展性。 Abstract: This paper introduces Multidimensional Task Learning (MTL), a unified mathematical framework based on Generalized Einstein MLPs (GE-MLPs) that operate directly on tensors via the Einstein product. We argue that current computer vision task formulations are inherently constrained by matrix-based thinking: standard architectures rely on matrix-valued weights and vectorvalued biases, requiring structural flattening that restricts the space of naturally expressible tasks. GE-MLPs lift this constraint by operating with tensor-valued parameters, enabling explicit control over which dimensions are preserved or contracted without information loss. Through rigorous mathematical derivations, we demonstrate that classification, segmentation, and detection are special cases of MTL, differing only in their dimensional configuration within a formally defined task space. We further prove that this task space is strictly larger than what matrix-based formulations can natively express, enabling principled task configurations such as spatiotemporal or cross modal predictions that require destructive flattening under conventional approaches. This work provides a mathematical foundation for understanding, comparing, and designing computer vision tasks through the lens of tensor algebra.

[151] UniScale: Unified Scale-Aware 3D Reconstruction for Multi-View Understanding via Prior Injection for Robotic Perception

Mohammad Mahdavian,Gordon Tan,Binbin Xu,Yuan Ren,Dongfeng Bai,Bingbing Liu

Main category: cs.CV

TL;DR: UniScale 是一个统一的、尺度感知的多视角3D重建框架,专为机器人应用设计,能联合估计相机内外参、尺度无关深度与点图,并恢复场景的度量尺度,支持灵活引入几何先验,且无需从头训练。

Details Motivation: 在基于视觉的机器人导航中,从原始图像序列中准确提取环境结构对下游任务至关重要,而现有方法常难以兼顾尺度一致性、几何先验融合与计算效率。 Method: 提出模块化、语义引导的统一网络架构,结合全局上下文推理与相机感知特征表示,联合预测相机内参、外参、尺度无关深度/点图及场景度量尺度;支持在已知内参或位姿时进行条件输入以提升性能。 Result: 在多个基准上验证了其强泛化能力与跨环境稳定性,性能一致优异;无需从零训练,可复用预训练模型中的世界先验。 Conclusion: UniScale 实现了鲁棒、度量感知的单模型多视角3D重建,适用于资源受限的机器人系统,兼具灵活性、高效性与实用性。 Abstract: We present UniScale, a unified, scale-aware multi-view 3D reconstruction framework for robotic applications that flexibly integrates geometric priors through a modular, semantically informed design. In vision-based robotic navigation, the accurate extraction of environmental structure from raw image sequences is critical for downstream tasks. UniScale addresses this challenge with a single feed-forward network that jointly estimates camera intrinsics and extrinsics, scale-invariant depth and point maps, and the metric scale of a scene from multi-view images, while optionally incorporating auxiliary geometric priors when available. By combining global contextual reasoning with camera-aware feature representations, UniScale is able to recover the metric-scale of the scene. In robotic settings where camera intrinsics are known, they can be easily incorporated to improve performance, with additional gains obtained when camera poses are also available. This co-design enables robust, metric-aware 3D reconstruction within a single unified model. Importantly, UniScale does not require training from scratch, and leverages world priors exhibited in pre-existing models without geometric encoding strategies, making it particularly suitable for resource-constrained robotic teams. We evaluate UniScale on multiple benchmarks, demonstrating strong generalization and consistent performance across diverse environments. We will release our implementation upon acceptance.

[152] MovieTeller: Tool-augmented Movie Synopsis with ID Consistent Progressive Abstraction

Yizhi Li,Xiaohan Chen,Miao Jiang,Wentao Tang,Gaoang Wang

Main category: cs.CV

TL;DR: MovieTeller提出一种无需训练、工具增强的渐进式抽象框架,利用现成模型和面部识别工具提升长视频摘要的事实准确性与叙事连贯性。

Details Motivation: 现有视觉语言模型在长视频(如电影、电视剧)自动摘要任务中存在角色ID不一致和叙事断裂问题。 Method: MovieTeller采用训练无关的工具增强生成范式:首先调用现成人脸检测模型获取角色身份与位置作为事实依据,再将该信息注入提示词引导VLM;并设计渐进式抽象流程分阶段处理长视频以缓解上下文长度限制。 Result: 实验表明,该方法在事实准确性、角色一致性与整体叙事连贯性上显著优于端到端基线模型。 Conclusion: 无需微调、借助外部工具与分阶段抽象可有效提升长视频摘要质量,为VLM在复杂时序理解任务中提供了新思路。 Abstract: With the explosive growth of digital entertainment, automated video summarization has become indispensable for applications such as content indexing, personalized recommendation, and efficient media archiving. Automatic synopsis generation for long-form videos, such as movies and TV series, presents a significant challenge for existing Vision-Language Models (VLMs). While proficient at single-image captioning, these general-purpose models often exhibit critical failures in long-duration contexts, primarily a lack of ID-consistent character identification and a fractured narrative coherence. To overcome these limitations, we propose MovieTeller, a novel framework for generating movie synopses via tool-augmented progressive abstraction. Our core contribution is a training-free, tool-augmented, fact-grounded generation process. Instead of requiring costly model fine-tuning, our framework directly leverages off-the-shelf models in a plug-and-play manner. We first invoke a specialized face recognition model as an external "tool" to establish Factual Groundings--precise character identities and their corresponding bounding boxes. These groundings are then injected into the prompt to steer the VLM's reasoning, ensuring the generated scene descriptions are anchored to verifiable facts. Furthermore, our progressive abstraction pipeline decomposes the summarization of a full-length movie into a multi-stage process, effectively mitigating the context length limitations of current VLMs. Experiments demonstrate that our approach yields significant improvements in factual accuracy, character consistency, and overall narrative coherence compared to end-to-end baselines.

[153] Large Multimodal Models as General In-Context Classifiers

Marco Garosi,Matteo Farina,Alessandro Conti,Massimiliano Mancini,Elisa Ricci

Main category: cs.CV

TL;DR: 本文探讨了大型多模态模型(LMMs)在分类任务中的潜力,指出其通过上下文学习能力可在闭世界和开世界分类中媲美甚至超越CLIP等对比式视觉语言模型,并提出无需训练的CIRCLE方法提升LMM在开世界场景下的鲁棒性。

Details Motivation: 现有研究认为CLIP类对比式VLM更适合分类任务,而LMM更适用于复杂任务;本文动机在于揭示LMM的上下文学习能力被忽视,及其在分类任务(尤其是开世界)中的潜力。 Method: 在多个数据集上对SOTA LMM进行闭世界和开世界分类基准测试;提出无需训练的CIRCLE方法:利用LMM自身对上下文示例分配并迭代优化伪标签。 Result: LMM在少量上下文示例下可匹配或超越带cache适配器的CLIP;在开世界场景中,CIRCLE显著提升LMM性能,建立强基线,超越VLM对应方法。 Conclusion: LMM凭借上下文学习与生成能力,可作为统一、灵活的分类器,CIRCLE为其在开世界分类中提供了有效且无需训练的解决方案。 Abstract: Which multimodal model should we use for classification? Previous studies suggest that the answer lies in CLIP-like contrastive Vision-Language Models (VLMs), due to their remarkable performance in zero-shot classification. In contrast, Large Multimodal Models (LMM) are more suitable for complex tasks. In this work, we argue that this answer overlooks an important capability of LMMs: in-context learning. We benchmark state-of-the-art LMMs on diverse datasets for closed-world classification and find that, although their zero-shot performance is lower than CLIP's, LMMs with a few in-context examples can match or even surpass contrastive VLMs with cache-based adapters, their "in-context" equivalent. We extend this analysis to the open-world setting, where the generative nature of LMMs makes them more suitable for the task. In this challenging scenario, LMMs struggle whenever provided with imperfect context information. To address this issue, we propose CIRCLE, a simple training-free method that assigns pseudo-labels to in-context examples, iteratively refining them with the available context itself. Through extensive experiments, we show that CIRCLE establishes a robust baseline for open-world classification, surpassing VLM counterparts and highlighting the potential of LMMs to serve as unified classifiers, and a flexible alternative to specialized models.

[154] Skarimva: Skeleton-based Action Recognition is a Multi-view Application

Daniel Bermuth,Alexander Poeppel,Wolfgang Reif

Main category: cs.CV

TL;DR: This paper shows that using multiple camera views to obtain more accurate 3D skeletons significantly improves skeleton-based action recognition performance, suggesting input data quality is a key bottleneck and advocating multi-view setups as standard.

Details Motivation: Despite extensive research on improving action recognition algorithms, the impact of input skeleton data quality—especially from single-view sources—has been largely overlooked. Method: The authors use multi-camera triangulation to generate higher-accuracy 3D skeletons and evaluate their effect on state-of-the-art skeleton-based action recognition models. Result: Multi-view triangulated skeletons lead to significant performance gains across state-of-the-art models, indicating current performance limits are partly due to noisy or inaccurate input skeletons. Conclusion: Input skeleton quality is a critical limiting factor; multi-view setups offer a favorable cost-benefit ratio and should become the standard in future skeleton-based action recognition research. Abstract: Human action recognition plays an important role when developing intelligent interactions between humans and machines. While there is a lot of active research on improving the machine learning algorithms for skeleton-based action recognition, not much attention has been given to the quality of the input skeleton data itself. This work demonstrates that by making use of multiple camera views to triangulate more accurate 3D~skeletons, the performance of state-of-the-art action recognition models can be improved significantly. This suggests that the quality of the input data is currently a limiting factor for the performance of these models. Based on these results, it is argued that the cost-benefit ratio of using multiple cameras is very favorable in most practical use-cases, therefore future research in skeleton-based action recognition should consider multi-view applications as the standard setup.

[155] Spatio-Temporal Token Pruning for Efficient High-Resolution GUI Agents

Zhou Xu,Bowen Zhou,Qi Wang,Shuwen Feng,Jingyu Xiao

Main category: cs.CV

TL;DR: 本文提出GUIPruner,一种无需训练的纯视觉GUI代理压缩框架,通过时序自适应分辨率(TAR)和分层结构感知剪枝(SSP)解决历史轨迹冗余与空间结构破坏问题,在保持高精度的同时显著降低计算开销。

Details Motivation: 纯视觉GUI代理因高分辨率截图和历史轨迹中的时空冗余而效率低下;现有压缩方法存在时间上记忆衰减不匹配、空间上结构破坏导致坐标错位的问题。 Method: 提出GUIPruner框架,包含两个核心组件:1)时序自适应分辨率(TAR),基于衰减策略对历史帧动态缩放以消除时间冗余;2)分层结构感知剪枝(SSP),优先保留交互前景与语义锚点,同时保护全局布局完整性。 Result: 在多个基准上达到SOTA性能;在Qwen2-VL-2B上实现3.4倍FLOPs减少、3.3倍视觉编码延迟下降,同时保持94%以上原始性能。 Conclusion: GUIPruner是一种高效、无训练、结构保持的GUI视觉压缩方案,可支撑实时、高精度、低资源消耗的GUI导航。 Abstract: Pure-vision GUI agents provide universal interaction capabilities but suffer from severe efficiency bottlenecks due to the massive spatiotemporal redundancy inherent in high-resolution screenshots and historical trajectories. We identify two critical misalignments in existing compression paradigms: the temporal mismatch, where uniform history encoding diverges from the agent's "fading memory" attention pattern, and the spatial topology conflict, where unstructured pruning compromises the grid integrity required for precise coordinate grounding, inducing spatial hallucinations. To address these challenges, we introduce GUIPruner, a training-free framework tailored for high-resolution GUI navigation. It synergizes Temporal-Adaptive Resolution (TAR), which eliminates historical redundancy via decay-based resizing, and Stratified Structure-aware Pruning (SSP), which prioritizes interactive foregrounds and semantic anchors while safeguarding global layout. Extensive evaluations across diverse benchmarks demonstrate that GUIPruner consistently achieves state-of-the-art performance, effectively preventing the collapse observed in large-scale models under high compression. Notably, on Qwen2-VL-2B, our method delivers a 3.4x reduction in FLOPs and a 3.3x speedup in vision encoding latency while retaining over 94% of the original performance, enabling real-time, high-precision navigation with minimal resource consumption.

[156] Risk-Aware World Model Predictive Control for Generalizable End-to-End Autonomous Driving

Jiangxin Sun,Feng Xue,Teng Long,Chang Liu,Jian-Fang Hu,Wei-Shi Zheng,Nicu Sebe

Main category: cs.CV

TL;DR: 本文提出了一种无需专家动作监督的端到端自动驾驶框架RaWMPC,通过风险感知的世界模型与预测控制,提升在分布内与分布外场景下的泛化性与安全性。

Details Motivation: 现有基于模仿学习的端到端自动驾驶方法依赖专家示范,在罕见或长尾场景下泛化能力差、易产生不安全决策;本文旨在探索无需任何专家动作监督下仍能可靠决策的新范式。 Method: 提出Risk-aware World Model Predictive Control(RaWMPC):1)构建风险感知世界模型,通过主动暴露于危险行为来学习预测高风险后果;2)设计自评估蒸馏方法,将世界模型的风险规避能力蒸馏至无监督的动作提议网络;3)在测试时通过显式风险评估选择低风险候选动作。 Result: 在分布内和分布外场景下均超越当前最优方法,同时提供更强的决策可解释性。 Conclusion: RaWMPC验证了无需专家动作监督的端到端自动驾驶可行性,通过世界模型的鲁棒预测与风险显式建模,有效缓解了模仿学习的泛化瓶颈。 Abstract: With advances in imitation learning (IL) and large-scale driving datasets, end-to-end autonomous driving (E2E-AD) has made great progress recently. Currently, IL-based methods have become a mainstream paradigm: models rely on standard driving behaviors given by experts, and learn to minimize the discrepancy between their actions and expert actions. However, this objective of "only driving like the expert" suffers from limited generalization: when encountering rare or unseen long-tail scenarios outside the distribution of expert demonstrations, models tend to produce unsafe decisions in the absence of prior experience. This raises a fundamental question: Can an E2E-AD system make reliable decisions without any expert action supervision? Motivated by this, we propose a unified framework named Risk-aware World Model Predictive Control (RaWMPC) to address this generalization dilemma through robust control, without reliance on expert demonstrations. Practically, RaWMPC leverages a world model to predict the consequences of multiple candidate actions and selects low-risk actions through explicit risk evaluation. To endow the world model with the ability to predict the outcomes of risky driving behaviors, we design a risk-aware interaction strategy that systematically exposes the world model to hazardous behaviors, making catastrophic outcomes predictable and thus avoidable. Furthermore, to generate low-risk candidate actions at test time, we introduce a self-evaluation distillation method to distill riskavoidance capabilities from the well-trained world model into a generative action proposal network without any expert demonstration. Extensive experiments show that RaWMPC outperforms state-of-the-art methods in both in-distribution and out-of-distribution scenarios, while providing superior decision interpretability.

[157] Decomposing Private Image Generation via Coarse-to-Fine Wavelet Modeling

Jasmine Bayrooti,Weiwei Kong,Natalia Ponomareva,Carlos Esteves,Ameesh Makadia,Amanda Prorok

Main category: cs.CV

TL;DR: 本文提出了一种基于小波频谱的差分隐私(DP)图像生成框架,通过在低频成分上施加DP保护、高频细节用公开超分模型恢复,显著提升了DP图像生成的质量与隐私-效用权衡。

Details Motivation: 标准DP微调(如DP-SGD)因对所有参数均匀加噪,严重损害图像高频纹理质量;而图像中真正敏感的是低频结构(如人脸轮廓),高频细节往往通用且非敏感,因此需更精细的频谱级隐私保护机制。 Method: 提出两阶段频谱DP框架:(1)在敏感图像的小波低频系数上,对自回归谱图像分词器进行DP微调;(2)利用公开预训练超分辨率模型对粗粒度中间图像进行高分辨率上采样;利用DP的后处理不变性保障整体隐私。 Result: 在MS-COCO和MM-CelebA-HQ数据集上,相比其他主流DP图像生成方法,所提方法生成图像质量更高、风格保持更好,隐私-效用权衡更优。 Conclusion: 低频成分承载主要隐私风险,将DP预算聚焦于小波域低频部分并分离高频细节生成,是一种更合理、更高效的DP图像生成范式。 Abstract: Generative models trained on sensitive image datasets risk memorizing and reproducing individual training examples, making strong privacy guarantees essential. While differential privacy (DP) provides a principled framework for such guarantees, standard DP finetuning (e.g., with DP-SGD) often results in severe degradation of image quality, particularly in high-frequency textures, due to the indiscriminate addition of noise across all model parameters. In this work, we propose a spectral DP framework based on the hypothesis that the most privacy-sensitive portions of an image are often low-frequency components in the wavelet space (e.g., facial features and object shapes) while high-frequency components are largely generic and public. Based on this hypothesis, we propose the following two-stage framework for DP image generation with coarse image intermediaries: (1) DP finetune an autoregressive spectral image tokenizer model on the low-resolution wavelet coefficients of the sensitive images, and (2) perform high-resolution upsampling using a publicly pretrained super-resolution model. By restricting the privacy budget to the global structures of the image in the first stage, and leveraging the post-processing property of DP for detail refinement, we achieve promising trade-offs between privacy and utility. Experiments on the MS-COCO and MM-CelebA-HQ datasets show that our method generates images with improved quality and style capture relative to other leading DP image frameworks.

[158] LineGraph2Road: Structural Graph Reasoning on Line Graphs for Road Network Extraction

Zhengyang Wei,Renzhi Jing,Yiyi He,Jenny Suckale

Main category: cs.CV

TL;DR: 本文提出LineGraph2Road框架,通过构建稀疏欧氏图并转换为其线图,利用图Transformer进行边连接性预测,显著提升遥感图像道路提取的拓扑精度和细粒度细节建模能力。

Details Motivation: 现有方法在道路提取中难以建模长程依赖与复杂拓扑结构,且端点嵌入融合方式对集合同构链接表达能力有限。 Method: 提出LineGraph2Road:1)从分割掩码提取关键点构建稀疏欧氏图;2)将其转化为线图以显式建模边间关系;3)在线图上应用Graph Transformer预测边连接性;4)引入过街/地下通道检测头与耦合NMS策略处理多层交叉与关键连接保留。 Result: 在City-scale、SpaceNet和Global-scale三个基准上达到TOPO-F1和APLS两项核心指标的SOTA性能,并能捕捉真实部署所需的关键视觉细节。 Conclusion: 将道路连接性预测建模为线图上的二分类任务,结合结构化图学习与专用模块设计,可有效克服传统方法在拓扑建模与多级交叉处理上的瓶颈。 Abstract: The accurate and automatic extraction of roads from satellite imagery is critical for applications in navigation and urban planning, significantly reducing the need for manual annotation. Many existing methods decompose this task into keypoint extraction and connectedness prediction, but often struggle to capture long-range dependencies and complex topologies. Here, we propose LineGraph2Road, a framework that improves connectedness prediction by formulating it as binary classification over edges in a constructed global but sparse Euclidean graph, where nodes are keypoints extracted from segmentation masks and edges connect node pairs within a predefined distance threshold, representing potential road segments. To better learn structural link representation, we transform the original graph into its corresponding line graph and apply a Graph Transformer on it for connectedness prediction. This formulation overcomes the limitations of endpoint-embedding fusion on set-isomorphic links, enabling rich link representations and effective relational reasoning over the global structure. Additionally, we introduce an overpass/underpass head to resolve multi-level crossings and a coupled NMS strategy to preserve critical connections. We evaluate LineGraph2Road on three benchmarks: City-scale, SpaceNet, and Global-scale, and show that it achieves state-of-the-art results on two key metrics, TOPO-F1 and APLS. It also captures fine visual details critical for real-world deployment. We will make our code publicly available.

[159] PGVMS: A Prompt-Guided Unified Framework for Virtual Multiplex IHC Staining with Pathological Semantic Learning

Fuqiang Chen,Ranran Zhang,Wanming Hu,Deboch Eyob Abera,Yue Peng,Boyun Zheng,Yiwen Sun,Jing Cai,Wenjian Qin

Main category: cs.CV

TL;DR: 本文提出了一种基于提示引导的虚拟多重IHC染色框架(PGVMS),利用单重染色训练数据,通过自适应提示机制、蛋白感知学习(PALS)和原型一致性学习(PCLS)策略,解决语义引导不足、染色分布不一致和空间错位三大挑战。

Details Motivation: 小活检组织量不足限制了免疫组化(IHC)全面分析;现有虚拟多重染色方法存在语义引导不足、染色分布不一致和跨模态空间错位三大问题。 Method: 提出PGVMS框架:1)基于病理视觉语言模型的自适应提示引导机制;2)蛋白感知学习(PALS),直接量化并约束蛋白分布;3)原型一致性学习(PCLS),建立跨图像语义交互以校正空间错位。 Result: 在多个IHC染色任务上显著提升生成质量与空间一致性,尤其在组织结构保持、蛋白表达模式准确性和多靶点协同建模方面优于现有方法。 Conclusion: PGVMS为仅用单重IHC数据实现高保真、多靶点虚拟染色提供了新范式,有望拓展小样本病理诊断能力。 Abstract: Immunohistochemical (IHC) staining enables precise molecular profiling of protein expression, with over 200 clinically available antibody-based tests in modern pathology. However, comprehensive IHC analysis is frequently limited by insufficient tissue quantities in small biopsies. Therefore, virtual multiplex staining emerges as an innovative solution to digitally transform H&E images into multiple IHC representations, yet current methods still face three critical challenges: (1) inadequate semantic guidance for multi-staining, (2) inconsistent distribution of immunochemistry staining, and (3) spatial misalignment across different stain modalities. To overcome these limitations, we present a prompt-guided framework for virtual multiplex IHC staining using only uniplex training data (PGVMS). Our framework introduces three key innovations corresponding to each challenge: First, an adaptive prompt guidance mechanism employing a pathological visual language model dynamically adjusts staining prompts to resolve semantic guidance limitations (Challenge 1). Second, our protein-aware learning strategy (PALS) maintains precise protein expression patterns by direct quantification and constraint of protein distributions (Challenge 2). Third, the prototype-consistent learning strategy (PCLS) establishes cross-image semantic interaction to correct spatial misalignments (Challenge 3).

[160] Towards Long-Form Spatio-Temporal Video Grounding

Xin Gu,Bing Fan,Jiali Yao,Zhipeng Zhang,Yan Huang,Cheng Han,Heng Fan,Libo Zhang

Main category: cs.CV

TL;DR: 本文提出了一种面向长视频的自回归Transformer架构ART-STVG,用于解决长时序视频中时空定位难题,通过引入空间/时间记忆库、记忆选择策略及级联式时空解码设计,在新构建的LF-STVG数据集上显著超越现有方法。

Details Motivation: 现有时空视频定位(STVG)方法主要针对几十秒的短视频,难以适用于真实场景中长达数分钟甚至数小时的长视频,限制了实际应用。 Method: 提出自回归Transformer架构ART-STVG:将视频视为流式输入逐帧处理;设计空间与时间记忆库并结合记忆选择策略增强上下文建模;采用级联式时空解码结构,使空间解码器输出辅助时间解码器进行精细时序定位。 Result: 在新扩展的LF-STVG长视频数据集上显著优于SOTA方法,同时在传统短视频STVG任务上保持竞争力。 Conclusion: ART-STVG有效解决了长视频时空定位中的长时序建模与无关信息干扰问题,验证了自回归建模与级联时空解码在LF-STVG任务中的有效性与泛化能力。 Abstract: In real scenarios, videos can span several minutes or even hours. However, existing research on spatio-temporal video grounding (STVG), given a textual query, mainly focuses on localizing targets in short videos of tens of seconds, typically less than one minute, which limits real-world applications. In this paper, we explore Long-Form STVG (LF-STVG), which aims to locate targets in long-term videos. Compared with short videos, long-term videos contain much longer temporal spans and more irrelevant information, making it difficult for existing STVG methods that process all frames at once. To address this challenge, we propose an AutoRegressive Transformer architecture for LF-STVG, termed ART-STVG. Unlike conventional STVG methods that require the entire video sequence to make predictions at once, ART-STVG treats the video as streaming input and processes frames sequentially, enabling efficient handling of long videos. To model spatio-temporal context, we design spatial and temporal memory banks and apply them to the decoders. Since memories from different moments are not always relevant to the current frame, we introduce simple yet effective memory selection strategies to provide more relevant information to the decoders, significantly improving performance. Furthermore, instead of parallel spatial and temporal localization, we propose a cascaded spatio-temporal design that connects the spatial decoder to the temporal decoder, allowing fine-grained spatial cues to assist complex temporal localization in long videos. Experiments on newly extended LF-STVG datasets show that ART-STVG significantly outperforms state-of-the-art methods, while achieving competitive performance on conventional short-form STVG.

[161] ManifoldGD: Training-Free Hierarchical Manifold Guidance for Diffusion-Based Dataset Distillation

Ayush Roy,Wei-Yang Alex Lee,Rudrasis Chakraborty,Vishnu Suresh Lokhande

Main category: cs.CV

TL;DR: 本文提出Manifold-Guided Distillation(ManifoldGD),一种无需训练的、基于扩散模型的数据集蒸馏方法,通过在每步去噪中引入流形一致性引导,利用分层聚类得到多尺度实例原型中心(IPC),并将其局部切空间用于约束生成轨迹,从而提升合成数据的代表性、多样性与图像保真度。

Details Motivation: 现有基于扩散模型的无训练数据蒸馏方法引导策略简单(如仅朝向IPC质心),缺乏对数据内在流形结构的建模,导致蒸馏效果受限。 Method: 提出ManifoldGD框架:1)用VAE隐空间的分层分裂聚类构建多尺度IPC共集;2)以IPC邻域估计每步去噪对应的局部隐流形;3)将模式对齐向量投影到该流形的局部切空间,实现流形保持的生成引导。 Result: 在FID、真实/合成数据嵌入l2距离、分类准确率等指标上一致优于现有无训练及有训练蒸馏基线。 Conclusion: ManifoldGD是首个几何感知的无训练数据蒸馏框架,通过显式建模和利用数据流形结构,显著提升了蒸馏质量。 Abstract: In recent times, large datasets hinder efficient model training while also containing redundant concepts. Dataset distillation aims to synthesize compact datasets that preserve the knowledge of large-scale training sets while drastically reducing storage and computation. Recent advances in diffusion models have enabled training-free distillation by leveraging pre-trained generative priors; however, existing guidance strategies remain limited. Current score-based methods either perform unguided denoising or rely on simple mode-based guidance toward instance prototype centroids (IPC centroids), which often are rudimentary and suboptimal. We propose Manifold-Guided Distillation (ManifoldGD), a training-free diffusion-based framework that integrates manifold consistent guidance at every denoising timestep. Our method employs IPCs computed via a hierarchical, divisive clustering of VAE latent features, yielding a multi-scale coreset of IPCs that captures both coarse semantic modes and fine intra-class variability. Using a local neighborhood of the extracted IPC centroids, we create the latent manifold for each diffusion denoising timestep. At each denoising step, we project the mode-alignment vector onto the local tangent space of the estimated latent manifold, thus constraining the generation trajectory to remain manifold-faithful while preserving semantic consistency. This formulation improves representativeness, diversity, and image fidelity without requiring any model retraining. Empirical results demonstrate consistent gains over existing training-free and training-based baselines in terms of FID, l2 distance among real and synthetic dataset embeddings, and classification accuracy, establishing ManifoldGD as the first geometry-aware training-free data distillation framework.

[162] PRIMA: Pre-training with Risk-integrated Image-Metadata Alignment for Medical Diagnosis via LLM

Yiqing Wang,Chunming He,Ming-Chen Lu,Mercy Pawar,Leslie Niziol,Maria Woodward,Sina Farsiu

Main category: cs.CV

TL;DR: PRIMA是一种融合风险信息的图像-临床文本对齐预训练框架,通过RAG增强的临床BERT、DINOv3双编码器及多损失对齐策略,提升医学诊断中多模态表征能力,在小数据和低算力下实现鲁棒高性能。

Details Motivation: 现有方法将临床元数据视为孤立标签,未能利用其蕴含的丰富语义知识,尤其缺乏对风险-疾病关联等诊断先验的建模。 Method: 提出PRIMA框架:1)基于RAG构建风险-疾病相关性专家语料,微调Clinical ModernBERT以嵌入诊断先验;2)采用DINOv3与改进BERT构成双编码器,设计四种互补损失函数实现多粒度语义对齐与软标签建模;3)利用Qwen-3融合对齐特征进行疾病分类。 Result: 在多个医学影像诊断任务上显著超越SOTA方法,具备更强的鲁棒性,且无需大规模数据或高算力资源。 Conclusion: PRIMA成功弥合了像素级视觉特征与抽象临床专业知识之间的语义鸿沟,验证了融入领域知识的多模态对齐对提升医学AI诊断性能的有效性与实用性。 Abstract: Medical diagnosis requires the effective synthesis of visual manifestations and clinical metadata. However, existing methods often treat metadata as isolated tags, failing to exploit the rich semantic knowledge embedded in clinical descriptions. We propose PRIMA (Pre-training with Risk-integrated Image-Metadata Alignment), a framework that integrates domain-specific knowledge into multi-modal representation learning. We first curate an expert corpus of risk-disease correlations via Retrieval-Augmented Generation (RAG) to refine Clinical ModernBERT, embedding diagnostic priors into the text encoder. To bridge the modality gap, we introduce a dual-encoder pre-training strategy utilizing DINOv3 and our refined BERT, optimized by a suite of four complementary loss functions. These losses are designed to capture multi-granular semantic alignment and handle the ambiguity of clinical correlations through soft labels. Finally, we leverage Qwen-3 to fuse these aligned features for precise disease classification. Extensive experiments demonstrate that PRIMA effectively harmonizes pixel-level features with abstract clinical expertise, significantly outperforming other state-of-the-art methods. Notably, our framework achieves superior robustness without the need for massive data collection or exhaustive computational resources. Our code will be made public upon acceptance.

[163] ThinkOmni: Lifting Textual Reasoning to Omni-modal Scenarios via Guidance Decoding

Yiran Guan,Sifan Tu,Dingkang Liang,Linghao Zhu,Jianzhong Ju,Zhenbo Luo,Jian Luan,Yuliang Liu,Xiang Bai

Main category: cs.CV

TL;DR: 本文提出ThinkOmni,一种无需训练、无需数据的框架,将文本推理能力提升至全模态场景,通过LRM-as-a-Guide和Stepwise Contrastive Scaling两大组件,在多个多模态推理基准上取得显著性能提升。

Details Motivation: 现有全模态大语言模型(OLLM)感知能力强但复杂推理能力不足,而通过额外训练增强其推理能力面临高质量数据缺乏、任务适配难和计算成本高等挑战。 Method: 提出ThinkOmni框架,包含两个核心组件:1)LRM-as-a-Guide,利用现成的大推理模型(LRM)指导OLLM解码;2)Stepwise Contrastive Scaling,自适应平衡感知与推理信号,无需人工调参。 Result: 在六个多模态推理基准上验证有效性,主结果在MathVista和MMAU上分别达到70.2和75.5。 Conclusion: ThinkOmni是一种灵活、可泛化的全模态推理解决方案,为推理能力的迁移与应用提供了新思路。 Abstract: Omni-modal reasoning is essential for intelligent systems to understand and draw inferences from diverse data sources. While existing omni-modal large language models (OLLM) excel at perceiving diverse modalities, they lack the complex reasoning abilities of recent large reasoning models (LRM). However, enhancing the reasoning ability of OLLMs through additional training presents significant challenges, including the need for high-quality data, task-specific adaptation, and substantial computational costs. To address these limitations, we propose ThinkOmni, a training-free and data-free framework that lifts textual reasoning to omni-modal scenarios. ThinkOmni introduces two key components: 1) LRM-as-a-Guide, which leverages off-the-shelf LRMs to guide the OLLM decoding process; 2) Stepwise Contrastive Scaling, which adaptively balances perception and reasoning signals without manual hyperparameter tuning. Experiments on six multi-modal reasoning benchmarks demonstrate that ThinkOmni consistently delivers performance improvements, with main results achieving 70.2 on MathVista and 75.5 on MMAU. Overall, ThinkOmni offers a flexible and generalizable solution for omni-modal reasoning and provides new insights into the generalization and application of reasoning capabilities.

[164] Retrieve and Segment: Are a Few Examples Enough to Bridge the Supervision Gap in Open-Vocabulary Segmentation?

Tilemachos Aravanis,Vladan Stojnić,Bill Psomas,Nikos Komodakis,Giorgos Tolias

Main category: cs.CV

TL;DR: 本文提出了一种基于检索增强的测试时适配器方法,通过结合文本提示与少量像素级标注图像(支持集)来提升开放词汇分割(OVS)性能,克服了VLM图像级监督粗粒度和自然语言语义模糊性两大挑战。

Details Motivation: 开放词汇分割(OVS)受限于视觉语言模型(VLM)仅使用图像级监督训练以及自然语言固有的语义歧义性,导致其性能落后于全监督方法。 Method: 引入少样本设定,将文本提示与像素级标注的支持图像结合;设计一种检索增强的测试时适配器,在测试阶段为每张图像学习轻量级分类器,通过可学习的、按查询(per-query)方式进行文本与视觉支持特征融合。 Result: 在多个基准上显著缩小零样本与全监督分割之间的性能差距,同时保持开放词汇能力;支持支持集持续扩展,并适用于个性化等细粒度分割任务。 Conclusion: 所提方法有效缓解了VLM在OVS任务中因监督粒度粗和语言歧义带来的限制,通过测试时可学习的跨模态融合实现了更强的图文协同,推动了开放词汇分割向实用化迈进。 Abstract: Open-vocabulary segmentation (OVS) extends the zero-shot recognition capabilities of vision-language models (VLMs) to pixel-level prediction, enabling segmentation of arbitrary categories specified by text prompts. Despite recent progress, OVS lags behind fully supervised approaches due to two challenges: the coarse image-level supervision used to train VLMs and the semantic ambiguity of natural language. We address these limitations by introducing a few-shot setting that augments textual prompts with a support set of pixel-annotated images. Building on this, we propose a retrieval-augmented test-time adapter that learns a lightweight, per-image classifier by fusing textual and visual support features. Unlike prior methods relying on late, hand-crafted fusion, our approach performs learned, per-query fusion, achieving stronger synergy between modalities. The method supports continually expanding support sets, and applies to fine-grained tasks such as personalized segmentation. Experiments show that we significantly narrow the gap between zero-shot and supervised segmentation while preserving open-vocabulary ability.

[165] Sensor Generalization for Adaptive Sensing in Event-based Object Detection via Joint Distribution Training

Aheli Saha,René Schuster,Didier Stricker

Main category: cs.CV

TL;DR: 本文深入分析了事件相机固有参数对基于事件数据训练的物体检测模型性能的影响,并利用这些发现提升了下游模型对不同传感器的鲁棒性。

Details Motivation: 事件相机输出信号新颖,导致可用数据多样性不足,且缺乏对其信号特征参数的深入分析。 Method: 深入研究事件相机的固有参数如何影响基于事件数据训练的物体检测模型的性能。 Result: 提供了对事件相机参数影响的深入理解,并增强了下游模型对不同传感器的鲁棒性。 Conclusion: 事件相机的固有参数显著影响模型性能,利用这些参数知识可提升模型的传感器无关鲁棒性。 Abstract: Bio-inspired event cameras have recently attracted significant research due to their asynchronous and low-latency capabilities. These features provide a high dynamic range and significantly reduce motion blur. However, because of the novelty in the nature of their output signals, there is a gap in the variability of available data and a lack of extensive analysis of the parameters characterizing their signals. This paper addresses these issues by providing readers with an in-depth understanding of how intrinsic parameters affect the performance of a model trained on event data, specifically for object detection. We also use our findings to expand the capabilities of the downstream model towards sensor-agnostic robustness.

[166] SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation

Vaibhav Agrawal,Rishubh Parihar,Pradhaan Bhat,Ravi Kiran Sarvadevabhatla,R. Venkatesh Babu

Main category: cs.CV

TL;DR: 本文提出SeeThrough3D模型,通过引入遮挡感知的3D场景表示(OSCR)和遮挡感知的文本到图像生成机制,解决3D布局引导生成中对象间精确遮挡建模的问题。

Details Motivation: 现有3D布局引导生成方法难以准确建模对象间的精确遮挡关系,影响深度一致的几何与尺度合成。 Method: 提出遮挡感知的3D场景表示(OSCR),用透明3D框表示物体并渲染视角;将OSCR渲染结果转化为视觉token,条件化预训练流式文生图模型;引入掩码自注意力机制绑定物体框与文本描述。 Result: 在合成多物体强遮挡数据集上训练后,SeeThrough3D能泛化至未见类别,实现精确3D布局控制、真实遮挡效果与一致相机视角控制。 Conclusion: 遮挡推理是3D布局生成的关键环节;SeeThrough3D首次显式建模遮挡,显著提升生成场景的几何合理性与视觉真实性。 Abstract: We identify occlusion reasoning as a fundamental yet overlooked aspect for 3D layout-conditioned generation. It is essential for synthesizing partially occluded objects with depth-consistent geometry and scale. While existing methods can generate realistic scenes that follow input layouts, they often fail to model precise inter-object occlusions. We propose SeeThrough3D, a model for 3D layout conditioned generation that explicitly models occlusions. We introduce an occlusion-aware 3D scene representation (OSCR), where objects are depicted as translucent 3D boxes placed within a virtual environment and rendered from desired camera viewpoint. The transparency encodes hidden object regions, enabling the model to reason about occlusions, while the rendered viewpoint provides explicit camera control during generation. We condition a pretrained flow based text-to-image image generation model by introducing a set of visual tokens derived from our rendered 3D representation. Furthermore, we apply masked self-attention to accurately bind each object bounding box to its corresponding textual description, enabling accurate generation of multiple objects without object attribute mixing. To train the model, we construct a synthetic dataset with diverse multi-object scenes with strong inter-object occlusions. SeeThrough3D generalizes effectively to unseen object categories and enables precise 3D layout control with realistic occlusions and consistent camera control.

[167] VGG-T$^3$: Offline Feed-Forward 3D Reconstruction at Scale

Sven Elflein,Ruilong Li,Sérgio Agostinho,Zan Gojcic,Laura Leal-Taixé,Qunjie Zhou,Aljosa Osep

Main category: cs.CV

TL;DR: 本文提出VGG-T³模型,通过测试时训练将可变长度的KV空间蒸馏为固定大小MLP,实现线性时间复杂度的3D重建,显著提升速度并保持全局场景聚合能力。

Details Motivation: 解决离线前馈式3D重建方法中计算与内存开销随输入图像数量呈平方增长的关键瓶颈。 Method: 提出VGG-T³模型,利用测试时训练(test-time training)将场景几何的可变长度Key-Value(KV)表示蒸馏为固定尺寸MLP,实现线性扩展。 Result: 在1k图像集上仅需54秒完成重建,比基于softmax attention的基线快11.6倍;点云重建误差显著优于其他线性时间方法;并支持对未见图像进行视觉定位。 Conclusion: VGG-T³在保证全局场景建模能力的同时,实现了高效、可扩展的3D重建, bridging the gap between offline accuracy and online efficiency. Abstract: We present a scalable 3D reconstruction model that addresses a critical limitation in offline feed-forward methods: their computational and memory requirements grow quadratically w.r.t. the number of input images. Our approach is built on the key insight that this bottleneck stems from the varying-length Key-Value (KV) space representation of scene geometry, which we distill into a fixed-size Multi-Layer Perceptron (MLP) via test-time training. VGG-T$^3$ (Visual Geometry Grounded Test Time Training) scales linearly w.r.t. the number of input views, similar to online models, and reconstructs a $1k$ image collection in just $54$ seconds, achieving a $11.6\times$ speed-up over baselines that rely on softmax attention. Since our method retains global scene aggregation capability, our point map reconstruction error outperforming other linear-time methods by large margins. Finally, we demonstrate visual localization capabilities of our model by querying the scene representation with unseen images.

[168] MediX-R1: Open Ended Medical Reinforcement Learning

Sahal Shaji Mullappilly,Mohammed Irfan Kurpath,Omair Mohamed,Mohamed Zidan,Fahad Khan,Salman Khan,Rao Anwer,Hisham Cholakkal

Main category: cs.CV

TL;DR: MediX-R1是一个面向医疗多模态大语言模型的开放端强化学习框架,通过多信号奖励机制和LLM-based评估,显著提升开放型临床推理能力。

Details Motivation: 现有医疗大模型多依赖选择题式评估,难以支持临床中所需的自由形式、语义丰富的开放回答;传统奖励信号(如字符串匹配)在开放生成任务中鲁棒性差。 Method: 提出MediX-R1框架:基于分组强化学习(Group Based RL)微调视觉-语言骨干模型;设计复合奖励函数,包括LLM判别准确性奖励(YES/NO)、医学嵌入语义奖励(处理同义与术语变体)、轻量格式与模态奖励(保障可解释性与模态识别);构建基于参考的LLM-as-judge统一评估框架,替代字符串重叠指标。 Result: 仅用约51K指令样本,MediX-R1在标准医疗文本(LLM)与图文(VLM)基准上全面超越强开源基线,尤其在开放型临床任务上提升显著。 Conclusion: 开放端强化学习结合多维奖励信号与LLM驱动评估,是实现可靠医疗多模态推理的可行路径;模型、数据集与代码已全部开源。 Abstract: We introduce MediX-R1, an open-ended Reinforcement Learning (RL) framework for medical multimodal large language models (MLLMs) that enables clinically grounded, free-form answers beyond multiple-choice formats. MediX-R1 fine-tunes a baseline vision-language backbone with Group Based RL and a composite reward tailored for medical reasoning: an LLM-based accuracy reward that judges semantic correctness with a strict YES/NO decision, a medical embedding-based semantic reward to capture paraphrases and terminology variants, and lightweight format and modality rewards that enforce interpretable reasoning and modality recognition. This multi-signal design provides stable, informative feedback for open-ended outputs where traditional verifiable or MCQ-only rewards fall short. To measure progress, we propose a unified evaluation framework for both text-only and image+text tasks that uses a Reference-based LLM-as-judge in place of brittle string-overlap metrics, capturing semantic correctness, reasoning, and contextual alignment. Despite using only $\sim51$K instruction examples, MediX-R1 achieves excellent results across standard medical LLM (text-only) and VLM (image + text) benchmarks, outperforming strong open-source baselines and delivering particularly large gains on open-ended clinical tasks. Our results demonstrate that open-ended RL with comprehensive reward signals and LLM-based evaluation is a practical path toward reliable medical reasoning in multimodal models. Our trained models, curated datasets and source code are available at https://medix.cvmbzuai.com