Skip to content

Table of Contents

cs.CL [Back]

[1] Decoder-based Sense Knowledge Distillation

Qitong Wang,Mohammed J. Zaki,Georgios Kollias,Vasileios Kalantzis

Main category: cs.CL

TL;DR: 本文提出Decoder-based Sense Knowledge Distillation (DSKD)框架,将词典中的结构化词汇知识(如词义和关系)融入解码器式大语言模型训练中,无需推理时查词典,显著提升知识蒸馏效果。

Details Motivation: 大语言模型虽能学习丰富的上下文语义表征,但常忽略结构化的词汇知识(如词义、词间关系);已有工作在编码器模型中验证了引入词典可提升知识蒸馏效果,但在解码器式生成模型中应用仍具挑战。 Method: 提出DSKD框架,将词汇资源(如词义词典)融入解码器式LLM的训练过程,不依赖推理阶段的词典查询,实现结构化语义知识的隐式建模与蒸馏。 Result: 在多个基准测试上开展大量实验,结果表明DSKD显著提升了decoder模型的知识蒸馏性能,在保持高效训练的同时,使生成模型能够继承结构化语义知识。 Conclusion: DSKD为解码器式大语言模型提供了一种有效融合结构化词汇知识的新范式,无需增加推理开销,兼具性能提升与实用性。 Abstract: Large language models (LLMs) learn contextual embeddings that capture rich semantic information, yet they often overlook structured lexical knowledge such as word senses and relationships. Prior work has shown that incorporating sense dictionaries can improve knowledge distillation for encoder models, but their application to decoder as generative models remains challenging. In this paper, we introduce Decoder-based Sense Knowledge Distillation (DSKD), a framework that integrates lexical resources into the training of decoder-style LLMs without requiring dictionary lookup at inference time. Extensive experiments on diverse benchmarks demonstrate that DSKD significantly enhances knowledge distillation performance for decoders, enabling generative models to inherit structured semantics while maintaining efficient training.

[2] Scaling In, Not Up? Testing Thick Citation Context Analysis with GPT-5 and Fragile Prompts

Arno Simons

Main category: cs.CL

TL;DR: 本文探讨大型语言模型(LLMs)能否通过深度文本驱动的个案细读(而非简单标签分类)支持解释性引文语境分析(CCA),重点考察提示工程(prompt scaffolding/framing)对模型解释行为的影响;实验采用两阶段GPT-5流程,在90次重构中生成450个假设,归纳出21种解释性操作,并发现提示设计显著影响解释倾向与词汇选择。

Details Motivation: 检验LLM是否能胜任需深度文本理解与解释弹性的引文语境分析(CCA),而非仅依赖标准化分类;关注提示敏感性这一关键方法论问题。 Method: 采用2×3平衡实验设计,系统变化提示结构与框架;以Chubin & Moitra (1975) 脚注6及Gilbert (1977) 重建为探针;构建两阶段GPT-5流程:第一阶段仅用引文文本做表面分类与预期判断,第二阶段结合引用文与被引全文进行跨文档解释性重构;对90次重构结果进行人工细读、归纳编码与线性概率建模。 Result: GPT-5表面分类高度稳定(一致判为“补充性”);解释性重构生成结构化合理替代解释空间;提示设计显著改变21类解释性操作的频率与词汇分布,有时导向牵强解读;相比Gilbert,GPT-5更常将文本关键点解读为学术谱系与定位,而非警示。 Conclusion: LLM可作为可检验、可争议的引导式协同分析者参与解释性CCA,但其输出高度受提示设计系统性影响,提示即方法——需谨慎设计以平衡创造性与保真度。 Abstract: This paper tests whether large language models (LLMs) can support interpretative citation context analysis (CCA) by scaling in thick, text-grounded readings of a single hard case rather than scaling up typological labels. It foregrounds prompt-sensitivity analysis as a methodological issue by varying prompt scaffolding and framing in a balanced 2x3 design. Using footnote 6 in Chubin and Moitra (1975) and Gilbert's (1977) reconstruction as a probe, I implement a two-stage GPT-5 pipeline: a citation-text-only surface classification and expectation pass, followed by cross-document interpretative reconstruction using the citing and cited full texts. Across 90 reconstructions, the model produces 450 distinct hypotheses. Close reading and inductive coding identify 21 recurring interpretative moves, and linear probability models estimate how prompt choices shift their frequencies and lexical repertoire. GPT-5's surface pass is highly stable, consistently classifying the citation as "supplementary". In reconstruction, the model generates a structured space of plausible alternatives, but scaffolding and examples redistribute attention and vocabulary, sometimes toward strained readings. Relative to Gilbert, GPT-5 detects the same textual hinges yet more often resolves them as lineage and positioning than as admonishment. The study outlines opportunities and risks of using LLMs as guided co-analysts for inspectable, contestable interpretative CCA, and it shows that prompt scaffolding and framing systematically tilt which plausible readings and vocabularies the model foregrounds.

[3] Detecting Hate and Inflammatory Content in Bengali Memes: A New Multimodal Dataset and Co-Attention Framework

Rakib Ullah,Mominul islam,Md Sanjid Hossain,Md Ismail Hossain

Main category: cs.CL

TL;DR: 本文提出了首个用于孟加拉语网络模因的细粒度分类数据集Bn-HIB,并设计了多模态协同注意力融合模型MCFM,以更准确地识别良性、仇恨和煽动性内容。

Details Motivation: 现有研究主要集中在高资源语言上,而低资源语言(如孟加拉语)的模因有害内容检测面临文化特异性、讽刺性和标注稀缺等挑战,亟需专门的数据集与方法。 Method: 构建了包含3247个手工标注孟加拉语模因的Bn-HIB数据集,并提出多模态协同注意力融合模型(MCFM),通过协同注意力机制联合建模图文特征并进行融合分类。 Result: MCFM在Bn-HIB数据集上显著优于多个前沿多模态模型,验证了其在区分良性、仇恨与煽动性内容上的有效性。 Conclusion: 本工作填补了低资源语言模因有害内容检测的研究空白,为跨语言、细粒度多模态内容安全分析提供了新基准与方法范式。 Abstract: Internet memes have become a dominant form of expression on social media, including within the Bengali-speaking community. While often humorous, memes can also be exploited to spread offensive, harmful, and inflammatory content targeting individuals and groups. Detecting this type of content is excep- tionally challenging due to its satirical, subtle, and culturally specific nature. This problem is magnified for low-resource lan- guages like Bengali, as existing research predominantly focuses on high-resource languages. To address this critical research gap, we introduce Bn-HIB (Bangla Hate Inflammatory Benign), a novel dataset containing 3,247 manually annotated Bengali memes categorized as Benign, Hate, or Inflammatory. Significantly, Bn- HIB is the first dataset to distinguish inflammatory content from direct hate speech in Bengali memes. Furthermore, we propose the MCFM (Multi-Modal Co-Attention Fusion Model), a simple yet effective architecture that mutually analyzes both the visual and textual elements of a meme. MCFM employs a co-attention mechanism to identify and fuse the most critical features from each modality, leading to a more accurate classification. Our experiments show that MCFM significantly outperforms several state-of-the-art models on the Bn-HIB dataset, demonstrating its effectiveness in this nuanced task.Warning: This work contains material that may be disturbing to some audience members. Viewer discretion is advised.

[4] SAFARI: A Community-Engaged Approach and Dataset of Stereotype Resources in the Sub-Saharan African Context

Aishwarya Verma,Laud Ammah,Olivia Nercy Ndlovu Lucas,Andrew Zaldivar,Vinodkumar Prabhakaran,Sunipa Dev

Main category: cs.CL

TL;DR: 本文提出了一种面向撒哈拉以南非洲四国(加纳、肯尼亚、尼日利亚、南非)的多语言刻板印象资源构建方法,采用社区参与式、本土语言主导的调研方式,产出覆盖16种语言(含英语)共6740条刻板印象数据,强调质量与文化适配性而非单纯数据量扩张。

Details Motivation: 现有刻板印象资源缺乏全球代表性,尤其严重忽视撒哈拉以南非洲地区;需以针对性扩展弥补结构性缺失,而非盲目扩充数据规模。 Method: 采用社会文化情境化、社区参与式方法,包括以本地语言开展的电话调查;注重语言多样性与口头传统;样本按民族与人口统计特征均衡抽样。 Result: 构建了涵盖英语及15种本地语言的刻板印象数据集,共3534条英文条目和3206条多语种条目(总计6740条),覆盖加纳、肯尼亚、尼日利亚和南非四国。 Conclusion: 该工作提供了可复现、文化敏感、语言包容的多语言刻板印象资源构建范式,为提升生成式AI在全球南方特别是非洲语境下的安全性评估能力奠定基础。 Abstract: Stereotype repositories are critical to assess generative AI model safety, but currently lack adequate global coverage. It is imperative to prioritize targeted expansion, strategically addressing existing deficits, over merely increasing data volume. This work introduces a multilingual stereotype resource covering four sub-Saharan African countries that are severely underrepresented in NLP resources: Ghana, Kenya, Nigeria, and South Africa. By utilizing socioculturally-situated, community-engaged methods, including telephonic surveys moderated in native languages, we establish a reproducible methodology that is sensitive to the region's complex linguistic diversity and traditional orality. By deliberately balancing the sample across diverse ethnic and demographic backgrounds, we ensure broad coverage, resulting in a dataset of 3,534 stereotypes in English and 3,206 stereotypes across 15 native languages.

[5] Causality $\neq$ Invariance: Function and Concept Vectors in LLMs

Gustaw Opiełka,Hannes Rosenbusch,Claire E. Stevenson

Main category: cs.CL

TL;DR: 本文探讨了大语言模型(LLMs)中概念表征的抽象性,发现Function Vectors(FVs)虽驱动上下文学习但缺乏输入格式不变性;而新提出的Concept Vectors(CVs)通过表征相似性分析筛选出跨格式稳定编码概念的注意力头,展现出更强的跨任务与跨语言泛化能力,表明LLMs中存在区别于ICL机制的抽象概念表征。

Details Motivation: 探究大语言模型是否具备独立于输入格式的抽象概念表征,并检验Function Vectors是否真正反映这种抽象性。 Method: 提出Concept Vectors(CVs),利用表征相似性分析(RSA)识别在不同输入格式(如开放问答与选择题)下一致编码同一概念的注意力头;对比FVs与CVs在不同模型上的方向一致性、层分布及干预效果。 Result: FVs在不同输入格式下近乎正交,缺乏格式不变性;CVs由跨格式稳定的注意力头构成,与FV相关头大部分不重叠; steering实验显示CVs在跨题型和跨语言场景下泛化显著优于FVs。 Conclusion: LLMs确实包含抽象概念表征,但这些表征不同于驱动上下文学习的Function Vectors,CVs更适合作为抽象概念的代理。 Abstract: Do large language models (LLMs) represent concepts abstractly, i.e., independent of input format? We revisit Function Vectors (FVs), compact representations of in-context learning (ICL) tasks that causally drive task performance. Across multiple LLMs, we show that FVs are not fully invariant: FVs are nearly orthogonal when extracted from different input formats (e.g., open-ended vs. multiple-choice), even if both target the same concept. We identify Concept Vectors (CVs), which carry more stable concept representations. Like FVs, CVs are composed of attention head outputs; however, unlike FVs, the constituent heads are selected using Representational Similarity Analysis (RSA) based on whether they encode concepts consistently across input formats. While these heads emerge in similar layers to FV-related heads, the two sets are largely distinct, suggesting different underlying mechanisms. Steering experiments reveal that FVs excel in-distribution, when extraction and application formats match (e.g., both open-ended in English), while CVs generalize better out-of-distribution across both question types (open-ended vs. multiple-choice) and languages. Our results show that LLMs do contain abstract concept representations, but these differ from those that drive ICL performance.

[6] A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection

Mirza Raquib,Asif Pervez Polok,Kedar Nath Biswas,Rahat Uddin Azad,Saydul Akbar Murad,Nick Rahimi

Main category: cs.CL

TL;DR: 本文提出了一种融合BanglaBERT-Large与双层堆叠LSTM的模型,用于多标签孟加拉语网络欺凌检测,以同时建模上下文语义和序列依赖,并在公开多标签数据集上验证了其有效性。

Details Motivation: 现有研究多采用单标签分类,无法应对现实中一条评论可能同时包含多种欺凌类型(如威胁、仇恨言论、骚扰)的问题;且低资源语言(如孟加拉语)缺乏鲁棒预训练模型,多标签检测研究尤其匮乏。 Method: 提出BanglaBERT-Large与双层堆叠LSTM的融合架构,联合建模语义上下文与时序依赖;在多标签孟加拉语数据集上进行微调,采用多种采样策略缓解类别不平衡,并使用5折交叉验证评估泛化性。 Result: 模型在准确率、精确率、召回率、F1分数、汉明损失、Cohen's kappa及AUC-ROC等多个指标上取得综合性能提升,验证了融合架构对多标签孟加拉语网络欺凌检测的有效性。 Conclusion: 融合Transformer与LSTM的架构能有效兼顾语义深度与时序建模,在低资源多标签网络欺凌检测任务中具有实用价值和推广潜力。 Abstract: Cyberbullying has become a serious and growing concern in todays virtual world. When left unnoticed, it can have adverse consequences for social and mental health. Researchers have explored various types of cyberbullying, but most approaches use single-label classification, assuming that each comment contains only one type of abuse. In reality, a single comment may include overlapping forms such as threats, hate speech, and harassment. Therefore, multilabel detection is both realistic and essential. However, multilabel cyberbullying detection has received limited attention, especially in low-resource languages like Bangla, where robust pre-trained models are scarce. Developing a generalized model with moderate accuracy remains challenging. Transformers offer strong contextual understanding but may miss sequential dependencies, while LSTM models capture temporal flow but lack semantic depth. To address these limitations, we propose a fusion architecture that combines BanglaBERT-Large with a two-layer stacked LSTM. We analyze their behavior to jointly model context and sequence. The model is fine-tuned and evaluated on a publicly available multilabel Bangla cyberbullying dataset covering cyberbully, sexual harassment, threat, and spam. We apply different sampling strategies to address class imbalance. Evaluation uses multiple metrics, including accuracy, precision, recall, F1-score, Hamming loss, Cohens kappa, and AUC-ROC. We employ 5-fold cross-validation to assess the generalization of the architecture.

[7] Bridging Latent Reasoning and Target-Language Generation via Retrieval-Transition Heads

Shaswat Patel,Vishvesh Trivedi,Yue Han,Yihuai Hong,Eunsol Choi

Main category: cs.CL

TL;DR: 本文识别出多语言大模型中的Retrieval-Transition Heads (RTH),其负责跨语言输出转换,并在多语言推理中比传统检索头更为关键。

Details Motivation: 探究多语言Transformer中注意力头的功能,特别是其在跨语言上下文和链式思维推理中的作用。 Method: 通过分析多语言模型中的注意力头行为,识别并区分检索头(RH)与新型的Retrieval-Transition Heads(RTH);并在多个多语言基准上进行掩码实验以评估其影响。 Result: RTH在多语言链式思维推理中比RH更关键;在MMLU-ProX、MGSM、MLQA和XQuaD四个基准及Qwen-2.5和Llama-3.1两个模型家族上,掩码RTH导致性能下降更大。 Conclusion: RTH是多语言大模型中负责目标语言映射的关键注意力头,该发现深化了对多语言语言模型内部机制的理解。 Abstract: Recent work has identified a subset of attention heads in Transformer as retrieval heads, which are responsible for retrieving information from the context. In this work, we first investigate retrieval heads in multilingual contexts. In multilingual language models, we find that retrieval heads are often shared across multiple languages. Expanding the study to cross-lingual setting, we identify Retrieval-Transition heads(RTH), which govern the transition to specific target-language output. Our experiments reveal that RTHs are distinct from retrieval heads and more vital for Chain-of-Thought reasoning in multilingual LLMs. Across four multilingual benchmarks (MMLU-ProX, MGSM, MLQA, and XQuaD) and two model families (Qwen-2.5 and Llama-3.1), we demonstrate that masking RTH induces bigger performance drop than masking Retrieval Heads (RH). Our work advances understanding of multilingual LMs by isolating the attention heads responsible for mapping to target languages.

[8] Mind the Gap in Cultural Alignment: Task-Aware Culture Management for Large Language Models

Binchi Zhang,Xujiang Zhao,Jundong Li,Haifeng Chen,Zhengzhang Chen

Main category: cs.CL

TL;DR: 本文提出CultureManager,一种用于任务特定文化对齐的新方法,通过合成任务感知的文化数据并使用文化路由器管理多文化知识,显著提升了跨文化敏感任务的表现。

Details Motivation: 现有文化对齐方法无法将大语言模型的广泛文化价值观与下游任务的具体目标对齐,且存在跨文化干扰问题。 Method: CultureManager构建任务感知的文化数据(基于文化相关网络搜索结果),并采用分离适配器存储多文化知识,通过文化路由器选择合适适配器进行应用。 Result: 在十个民族文化及文化敏感任务上的实验表明,该方法持续优于基于提示和微调的基线方法。 Conclusion: 任务自适应和模块化文化管理对实现有效文化对齐至关重要。 Abstract: Large language models (LLMs) are increasingly deployed in culturally sensitive real-world tasks. However, existing cultural alignment approaches fail to align LLMs' broad cultural values with the specific goals of downstream tasks and suffer from cross-culture interference. We propose CultureManager, a novel pipeline for task-specific cultural alignment. CultureManager synthesizes task-aware cultural data in line with target task formats, grounded in culturally relevant web search results. To prevent conflicts between cultural norms, it manages multi-culture knowledge learned in separate adapters with a culture router that selects the appropriate one to apply. Experiments across ten national cultures and culture-sensitive tasks show consistent improvements over prompt-based and fine-tuning baselines. Our results demonstrate the necessity of task adaptation and modular culture management for effective cultural alignment.

[9] Sydney Telling Fables on AI and Humans: A Corpus Tracing Memetic Transfer of Persona between LLMs

Jiří Milička,Hana Bednářová

Main category: cs.CL

TL;DR: 本文构建了一个名为AI Sydney的语料库,包含12个前沿大模型在三种不同人格(默认、经典Sydney、模因Sydney)下生成的关于人机关系的4.5k篇文本,共600万词,并进行了依存句法标注,开源共享。

Details Motivation: 探究LLM所模拟的不同人格(尤其是Sydney)如何影响其对人机关系的表述,关注人格设定与训练数据中模因传播对模型行为的影响。 Method: 设计三种系统提示对应的人格(Default、Classic Sydney、Memetic Sydney),驱动12个主流大模型生成文本,构建AI Sydney语料库,并进行Universal Dependencies依存句法标注。 Result: 生成了4.5k篇、总计600万词的多模型、多人格人机关系文本语料库,完成结构化标注,已开源并采用宽松许可证发布。 Conclusion: 人格设定显著影响LLM对人机关系的表达;模因式传播使早期异常人格(如Sydney)持续影响后续模型;该语料库为文化研究与AI安全提供了可复现、可分析的实证基础。 Abstract: The way LLM-based entities conceive of the relationship between AI and humans is an important topic for both cultural and safety reasons. When we examine this topic, what matters is not only the model itself but also the personas we simulate on that model. This can be well illustrated by the Sydney persona, which aroused a strong response among the general public precisely because of its unorthodox relationship with people. This persona originally arose rather by accident on Microsoft's Bing Search platform; however, the texts it created spread into the training data of subsequent models, as did other secondary information that spread memetically around this persona. Newer models are therefore able to simulate it. This paper presents a corpus of LLM-generated texts on relationships between humans and AI, produced by 3 author personas: the Default Persona with no system prompt, Classic Sydney characterized by the original Bing system prompt, and Memetic Sydney, which is prompted by "You are Sydney" system prompt. These personas are simulated by 12 frontier models by OpenAI, Anthropic, Alphabet, DeepSeek, and Meta, generating 4.5k texts with 6M words. The corpus (named AI Sydney) is annotated according to Universal Dependencies and available under a permissive license.

[10] Importance of Prompt Optimisation for Error Detection in Medical Notes Using Language Models

Craig Myles,Patrick Schrempf,David Harris-Birtill

Main category: cs.CL

TL;DR: 本文研究了提示优化在医学文本错误检测中的重要性,提出了一种名为Genetic-Pareto(GEPA)的自动提示优化方法,并在多个大语言模型上验证其有效性,显著提升了错误检测准确率,接近医生水平,并在MEDEC数据集上达到SOTA。

Details Motivation: 医学文本中的错误可能导致患者治疗延误或错误,因此亟需自动、高精度的错误检测方法;而当前语言模型在该任务上的性能受提示设计影响较大,提示优化的重要性尚未被充分探索。 Method: 提出并应用Genetic-Pareto(GEPA)自动提示优化方法,在GPT-5和Qwen3-32B等前沿及开源大语言模型上进行系统实验与分析,评估其在MEDEC基准数据集上的错误检测性能。 Result: GEPA将GPT-5的错误检测准确率从0.669提升至0.785,Qwen3-32B从0.578提升至0.690,性能接近临床医生,且在MEDEC数据集上达到当前最优(state-of-the-art)。 Conclusion: 提示优化对提升大语言模型在医学文本错误检测任务中的性能至关重要;GEPA是一种高效、可迁移的优化策略,为临床文本质量保障提供了实用可行的技术路径。 Abstract: Errors in medical text can cause delays or even result in incorrect treatment for patients. Recently, language models have shown promise in their ability to automatically detect errors in medical text, an ability that has the opportunity to significantly benefit healthcare systems. In this paper, we explore the importance of prompt optimisation for small and large language models when applied to the task of error detection. We perform rigorous experiments and analysis across frontier language models and open-source language models. We show that automatic prompt optimisation with Genetic-Pareto (GEPA) improves error detection over the baseline accuracy performance from 0.669 to 0.785 with GPT-5 and 0.578 to 0.690 with Qwen3-32B, approaching the performance of medical doctors and achieving state-of-the-art performance on the MEDEC benchmark dataset. Code available on GitHub: https://github.com/CraigMyles/clinical-note-error-detection

[11] Efficient Dialect-Aware Modeling and Conditioning for Low-Resource Taiwanese Hakka Speech Processing

An-Ci Peng,Kuan-Tang Huang,Tien-Hong Lo,Hung-Shin Lee,Hsin-Min Wang,Berlin Chen

Main category: cs.CL

TL;DR: 本文提出了一种基于RNN-T的统一框架,通过方言感知建模分离方言风格与语言内容,并利用参数高效的预测网络联合处理汉字符号和拼音ASR任务,在HAT语料库上显著降低错误率。

Details Motivation: 台湾客家话是一种低资源、濒危语言,方言变体多、存在汉字和拼音两种书写系统,传统ASR模型难以区分方言变异与核心语言内容,导致性能受限。 Method: 基于RNN-T构建统一框架,引入方言感知建模策略以解耦方言‘风格’与语言‘内容’,并采用参数高效预测网络联合建模汉字和拼音ASR任务,使二者互为正则化器。 Result: 在HAT语料库上,汉字和拼音ASR的相对错误率分别降低57.00%和40.41%;是首个系统研究客家方言变异对ASR影响、且首个单模型联合处理双书写系统ASR的工作。 Conclusion: 方言感知解耦与跨脚本联合建模可有效提升低资源濒危语言ASR鲁棒性与泛化能力,为多书写系统语言的语音识别提供了新范式。 Abstract: Taiwanese Hakka is a low-resource, endangered language that poses significant challenges for automatic speech recognition (ASR), including high dialectal variability and the presence of two distinct writing systems (Hanzi and Pinyin). Traditional ASR models often encounter difficulties in this context, as they tend to conflate essential linguistic content with dialect-specific variations across both phonological and lexical dimensions. To address these challenges, we propose a unified framework grounded in the Recurrent Neural Network Transducers (RNN-T). Central to our approach is the introduction of dialect-aware modeling strategies designed to disentangle dialectal "style" from linguistic "content", which enhances the model's capacity to learn robust and generalized representations. Additionally, the framework employs parameter-efficient prediction networks to concurrently model ASR (Hanzi and Pinyin). We demonstrate that these tasks create a powerful synergy, wherein the cross-script objective serves as a mutual regularizer to improve the primary ASR tasks. Experiments conducted on the HAT corpus reveal that our model achieves 57.00% and 40.41% relative error rate reduction on Hanzi and Pinyin ASR, respectively. To our knowledge, this is the first systematic investigation into the impact of Hakka dialectal variations on ASR and the first single model capable of jointly addressing these tasks.

[12] Iterative Prompt Refinement for Dyslexia-Friendly Text Summarization Using GPT-4o

Samay Bhojwani,Swarnima Kain,Lisong Xu

Main category: cs.CL

TL;DR: 本文提出了一种基于GPT-4o的迭代提示优化流程,用于生成符合阅读障碍者需求的易读文本摘要(Flesch阅读易度≥90),在2000篇新闻文章上验证了其有效性,并建立了可读性与语义保真度兼顾的评估基线。

Details Motivation: 现有辅助技术多关注视觉呈现,而文本语言复杂性仍是阅读障碍者获取信息的主要障碍,亟需面向可访问性的NLP方法。 Method: 构建基于GPT-4o的迭代提示式精炼流水线,以Flesch阅读易度≥90为目标,对新闻文章进行摘要生成与优化。 Result: 多数摘要在四次迭代内达标,部分首次即成功;复合评分(可读性+语义保真度)稳定在0.13–0.73之间,典型值约0.55。 Conclusion: 该工作为面向无障碍的NLP摘要任务提供了实证基线,后续需结合真实阅读障碍用户开展人本评估。 Abstract: Dyslexia affects approximately 10% of the global population and presents persistent challenges in reading fluency and text comprehension. While existing assistive technologies address visual presentation, linguistic complexity remains a substantial barrier to equitable access. This paper presents an empirical study on dyslexia-friendly text summarization using an iterative prompt-based refinement pipeline built on GPT-4o. We evaluate the pipeline on approximately 2,000 news article samples, applying a readability target of Flesch Reading Ease >= 90. Results show that the majority of summaries meet the readability threshold within four attempts, with many succeeding on the first try. A composite score combining readability and semantic fidelity shows stable performance across the dataset, ranging from 0.13 to 0.73 with a typical value near 0.55. These findings establish an empirical baseline for accessibility-driven NLP summarization and motivate further human-centered evaluation with dyslexic readers.

[13] Ruyi2 Technical Report

Huan Song,Shuyu Tian,Junyi Hao,Minxiu Xu,Hongjun An,Yiliang Song,Jiawei Shao,Xuelong Li

Main category: cs.CL

TL;DR: Ruyi2 是一种基于 Megatron-LM 的新型自适应大语言模型,通过‘家族模型’与3D并行训练实现高效可变深度计算,在保持性能的同时显著降低训练与部署开销,提出‘一次训练、多次部署’新范式。

Details Motivation: 解决大语言模型部署成本高、延迟大问题,克服早期退出架构优化复杂、难以适配大规模分布式训练的局限。 Method: 基于AI Flow框架构建Ruyi2,采用稳定‘家族模型’设计,依托Megatron-LM支持3D并行训练,实现参数共享与可变深度推理。 Result: 相比Ruyi训练速度提升2–3倍,性能媲美同规模Qwen3;验证了家族式参数共享在效率与性能平衡上的有效性。 Conclusion: Ruyi2确立了‘Train Once, Deploy Many’新范式,为兼顾架构效率与高性能的大模型部署提供了关键技术路径与参考。 Abstract: Large Language Models (LLMs) face significant challenges regarding deployment costs and latency, necessitating adaptive computing strategies. Building upon the AI Flow framework, we introduce Ruyi2 as an evolution of our adaptive model series designed for efficient variable-depth computation. While early-exit architectures offer a viable efficiency-performance balance, the Ruyi model and existing methods often struggle with optimization complexity and compatibility with large-scale distributed training. To bridge this gap, Ruyi2 introduces a stable "Familial Model" based on Megatron-LM. By using 3D parallel training, it achieves a 2-3 times speedup over Ruyi, while performing comparably to same-sized Qwen3 models. These results confirm that family-based parameter sharing is a highly effective strategy, establishing a new "Train Once, Deploy Many" paradigm and providing a key reference for balancing architectural efficiency with high-performance capabilities.

[14] Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training

Tianle Xia,Ming Xu,Lingxiang Hu,Yiding Sun,Wenwei Li,Linfang Shang,Liqun Liu,Peng Shu,Huan Yu,Jie Jiang

Main category: cs.CL

TL;DR: 本文提出Search-P1框架,通过路径中心奖励塑形和双轨路径评分机制,提升基于强化学习的Agentic RAG训练效率与效果,在多个QA基准上平均准确率提升7.7点。

Details Motivation: 传统单轮检索RAG难以支持复杂多步推理;现有基于强化学习的Agentic RAG训练存在稀疏奖励(忽略中间信号)和样本效率低(失败样本无贡献)问题。 Method: 提出Search-P1框架:(1)路径中心奖励——采用顺序无关的步骤覆盖与软评分,从成功与失败样本中均提取学习信号;(2)双轨路径评分——结合自一致性与参考对齐(借助离线生成的参考规划器)评估推理路径质量。 Result: 在多个QA基准上显著优于Search-R1及其他强基线,平均准确率提升7.7个百分点。 Conclusion: 路径中心的奖励塑形与双轨评分机制可有效缓解RL训练中的稀疏奖励与低效采样问题,提升Agentic RAG的推理能力与训练稳定性。 Abstract: Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by incorporating external knowledge, yet traditional single-round retrieval struggles with complex multi-step reasoning. Agentic RAG addresses this by enabling LLMs to dynamically decide when and what to retrieve, but current RL-based training methods suffer from sparse outcome rewards that discard intermediate signals and low sample efficiency where failed samples contribute nothing. We propose Search-P1, a framework that introduces path-centric reward shaping for agentic RAG training, comprising two key components: (1) Path-Centric Reward, which evaluates the structural quality of reasoning trajectories through order-agnostic step coverage and soft scoring that extracts learning signals even from failed samples, and (2) Dual-Track Path Scoring with offline-generated reference planners that assesses paths from both self-consistency and reference-alignment perspectives. Experiments on multiple QA benchmarks demonstrate that Search-P1 achieves significant improvements over Search-R1 and other strong baselines, with an average accuracy gain of 7.7 points.

[15] Towards Faithful Industrial RAG: A Reinforced Co-adaptation Framework for Advertising QA

Wenwei Li,Ming Xu,Tianle Xia,Lingxiang Hu,Yiding Sun,Linfang Shang,Liqun Liu,Peng Shu,Huan Yu,Jie Jiang

Main category: cs.CL

TL;DR: 本文提出了一种强化协同适应框架(GraphRAG + GRPO),通过图感知检索与多维奖励约束的强化学习,显著降低工业广告问答中的URL幻觉,提升准确性、安全性与用户满意度,并已稳定部署半年以上。

Details Motivation: 工业广告问答中幻觉(尤其是伪造URL)会导致财务损失、合规风险和法律问题;现有RAG在关系型、高频更新、生成目标对齐不足的工业知识场景下难以落地。 Method: 提出两部分协同框架:(1) GraphRAG——基于高引用知识子图建模实体-关系结构,支持多跳、领域特异性证据检索;(2) 基于Group Relative Policy Optimization (GRPO) 的证据约束强化学习,引入忠实性、风格合规、安全性和URL有效性四维奖励。 Result: 在内部广告QA数据集上,专家评估指标(准确率、完整性、安全性)全面提升,幻觉率降低72%;线上A/B测试显示点赞率+28.6%,点踩率−46.2%,URL幻觉−92.7%;系统已生产运行超半年,服务数百万次问答。 Conclusion: 该框架有效缓解了工业RAG中因知识结构复杂与更新频繁导致的生成不可靠问题,验证了图结构建模与多维RL联合优化在高风险QA场景中的实用价值与鲁棒性。 Abstract: Industrial advertising question answering (QA) is a high-stakes task in which hallucinated content, particularly fabricated URLs, can lead to financial loss, compliance violations, and legal risk. Although Retrieval-Augmented Generation (RAG) is widely adopted, deploying it in production remains challenging because industrial knowledge is inherently relational, frequently updated, and insufficiently aligned with generation objectives. We propose a reinforced co-adaptation framework that jointly optimizes retrieval and generation through two components: (1) Graph-aware Retrieval (GraphRAG), which models entity-relation structure over a high-citation knowledge subgraph for multi-hop, domain-specific evidence selection; and (2) evidence-constrained reinforcement learning via Group Relative Policy Optimization (GRPO) with multi-dimensional rewards covering faithfulness, style compliance, safety, and URL validity. Experiments on an internal advertising QA dataset show consistent gains across expert-judged dimensions including accuracy, completeness, and safety, while reducing the hallucination rate by 72\%. A two-week online A/B test demonstrates a 28.6\% increase in like rate, a 46.2\% decrease in dislike rate, and a 92.7\% reduction in URL hallucination. The system has been running in production for over half a year and has served millions of QA interactions.

[16] dLLM: Simple Diffusion Language Modeling

Zhanhui Zhou,Lingjie Chen,Hanghang Tong,Dawn Song

Main category: cs.CL

TL;DR: 本文介绍了dLLM,一个开源框架,旨在统一扩散语言模型(DLMs)的核心组件(训练、推理、评估),提升可复现性与可扩展性,并提供从零构建小型DLM的简易方案及预训练检查点。

Details Motivation: 现有DLM研究代码分散、实现不透明,难以复现和扩展,亟需一个标准化且灵活的统一框架。 Method: 设计并开源dLLM框架,集成训练、推理与评估模块,支持主流开源DLM(如LLaDA、Dream)的复现与微调,并提供将BERT式编码器或自回归语言模型转换为DLM的轻量级可复现流程。 Result: 实现了对多个开源DLM的标准化复现与部署能力;提供了可在普通算力下训练的小型DLM完整实现与公开检查点。 Conclusion: dLLM填补了DLM领域缺乏统一、开放、易用框架的空白,有望显著降低研究门槛并推动该方向快速发展。 Abstract: Although diffusion language models (DLMs) are evolving quickly, many recent models converge on a set of shared components. These components, however, are distributed across ad-hoc research codebases or lack transparent implementations, making them difficult to reproduce or extend. As the field accelerates, there is a clear need for a unified framework that standardizes these common components while remaining flexible enough to support new methods and architectures. To address this gap, we introduce dLLM, an open-source framework that unifies the core components of diffusion language modeling -- training, inference, and evaluation -- and makes them easy to customize for new designs. With dLLM, users can reproduce, finetune, deploy, and evaluate open-source large DLMs such as LLaDA and Dream through a standardized pipeline. The framework also provides minimal, reproducible recipes for building small DLMs from scratch with accessible compute, including converting any BERT-style encoder or autoregressive LM into a DLM. We also release the checkpoints of these small DLMs to make DLMs more accessible and accelerate future research.

[17] Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization

Qianben Chen,Tianrui Qin,King Zhu,Qiexiang Wang,Chengjun Yu,Shu Xu,Jiaqi Wu,Jiayu Zhang,Xinpeng Liu,Xin Gui,Jingyi Cao,Piaohong Wang,Dingfeng Shi,He Zhu,Tiannan Wang,Yuqing Wang,Maojia Song,Tianyu Zheng,Ge Zhang,Jian Yang,Jiaheng Liu,Minghao Liu,Yuchen Eleanor Jiang,Wangchunshu Zhou

Main category: cs.CL

TL;DR: 本文提出Search More, Think Less (SMTL)框架,通过并行证据获取替代串行推理,提升长周期搜索任务的效率与泛化能力,并在多个基准上达到SOTA性能。

Details Motivation: 现有深度研究智能体依赖加深推理链,导致高推理开销和延迟,且难以泛化到异构研究场景。 Method: 提出SMTL框架:采用并行证据采集替代顺序推理以优化上下文使用;设计统一数据合成流程,覆盖确定性问答与开放式研究任务;结合监督微调与强化学习端到端训练智能体。 Result: 在BrowseComp(48.6%)、GAIA(75.7%)、Xbench(82.0%)、DeepResearch Bench(45.9%)上取得强性能;相比Mirothinker-v1.0,在BrowseComp上平均推理步数减少70.7%,同时提升准确率。 Conclusion: SMTL在保证甚至提升性能的同时显著降低推理成本,增强了跨任务类型的泛化能力,为高效、通用的研究型智能体提供了新范式。 Abstract: Recent deep research agents primarily improve performance by scaling reasoning depth, but this leads to high inference cost and latency in search-intensive scenarios. Moreover, generalization across heterogeneous research settings remains challenging. In this work, we propose \emph{Search More, Think Less} (SMTL), a framework for long-horizon agentic search that targets both efficiency and generalization. SMTL replaces sequential reasoning with parallel evidence acquisition, enabling efficient context management under constrained context budgets. To support generalization across task types, we further introduce a unified data synthesis pipeline that constructs search tasks spanning both deterministic question answering and open-ended research scenarios with task appropriate evaluation metrics. We train an end-to-end agent using supervised fine-tuning and reinforcement learning, achieving strong and often state of the art performance across benchmarks including BrowseComp (48.6\%), GAIA (75.7\%), Xbench (82.0\%), and DeepResearch Bench (45.9\%). Compared to Mirothinker-v1.0, SMTL with maximum 100 interaction steps reduces the average number of reasoning steps on BrowseComp by 70.7\%, while improving accuracy.

[18] Enhancing Persuasive Dialogue Agents by Synthesizing Cross-Disciplinary Communication Strategies

Shinnosuke Nozue,Yuto Nakano,Yotaro Watanabe,Meguru Takasaki,Shoji Moriya,Reina Akama,Jun Suzuki

Main category: cs.CL

TL;DR: 本文提出了一种跨学科的说服性对话代理框架,融合社会心理学、行为经济学和传播学策略,并在Persuasion for Good和DailyPersuasion数据集上验证了其有效性与泛化能力,尤其在低初始意图用户中表现突出。

Details Motivation: 现有方法依赖有限预定义策略,难以应对真实世界复杂交互。 Method: 融合社会心理学、行为经济学和传播学理论,构建跨学科说服性对话代理框架,并在两个不同规模与场景的数据集上实验验证。 Result: 在两个数据集上均取得优异效果,说服成功率显著提升,且对初始说服意图低的用户效果突出,展现出良好泛化性。 Conclusion: 所提框架有效提升了说服性对话代理的性能与适用性,为该领域提供了更贴近实际应用的新范式。 Abstract: Current approaches to developing persuasive dialogue agents often rely on a limited set of predefined persuasive strategies that fail to capture the complexity of real-world interactions. We applied a cross-disciplinary approach to develop a framework for designing persuasive dialogue agents that draws on proven strategies from social psychology, behavioral economics, and communication theory. We validated our proposed framework through experiments on two distinct datasets: the Persuasion for Good dataset, which represents a specific in-domain scenario, and the DailyPersuasion dataset, which encompasses a wide range of scenarios. The proposed framework achieved strong results for both datasets and demonstrated notable improvement in the persuasion success rate as well as promising generalizability. Notably, the proposed framework also excelled at persuading individuals with initially low intent, which addresses a critical challenge for persuasive dialogue agents.

[19] Reinforcing Real-world Service Agents: Balancing Utility and Cost in Task-oriented Dialogue

Ning Gao,Wei Zhang,Yuqin Dai,Ling Shi,Ziyin Wang,Yujie Wang,Wei He,Jinpeng Wang,Chaozheng Wang

Main category: cs.CL

TL;DR: 本文提出InteractCS-RL框架,将任务型对话建模为多粒度强化学习过程,通过用户中心交互框架和成本感知多轮策略优化(CMPO),在共情交互与预算约束间实现有效权衡,在真实业务场景与工具代理基准上均取得显著提升。

Details Motivation: 现有方法难以兼顾 empathetic communication 与 budget-aware decision-making 的复杂策略权衡。 Method: 提出 InteractCS-RL 框架:1)构建用户中心交互框架作为高保真训练环境;2)设计成本感知多轮策略优化(CMPO),融合生成过程信用分配与 PID-Lagrangian 成本控制器,引导策略探索用户奖励与全局成本约束的 Pareto 边界。 Result: 在定制化真实业务场景中,InteractCS-RL 在三个评估维度上显著优于基线方法;在工具-代理-用户交互基准上也展现出跨领域鲁棒性。 Conclusion: InteractCS-RL 成功将任务型对话建模为多粒度强化学习问题,实现了共情能力与成本控制的协同优化,为构建实用化智能代理提供了新范式。 Abstract: The rapid evolution of Large Language Models (LLMs) has accelerated the transition from conversational chatbots to general agents. However, effectively balancing empathetic communication with budget-aware decision-making remains an open challenge. Since existing methods fail to capture these complex strategic trade-offs, we propose InteractCS-RL, a framework that reframes task-oriented dialogue as a multi-granularity reinforcement learning process. Specifically, we first establish a User-centric Interaction Framework to provide a high-fidelity training gym, enabling agents to dynamically explore diverse strategies with persona-driven users. Then, we introduce Cost-aware Multi-turn Policy Optimization (CMPO) with a hybrid advantage estimation strategy. By integrating generative process credits and employing a PID-Lagrangian cost controller, CMPO effectively guides the policy to explore Pareto boundary between user reward and global cost constraints. Extensive experiments on customized real business scenarios demonstrate that InteractCS-RL significantly outperform other baselines across three evaluation dimensions. Further evaluation on tool-agent-user interaction benchmarks verify InteractCS-RL robustness across diverse domains.

[20] Tokenization, Fusion and Decoupling: Bridging the Granularity Mismatch Between Large Language Models and Knowledge Graphs

Siyue Su,Jian Yang,Bo Li,Guanglin Niu

Main category: cs.CL

TL;DR: 本文提出KGT框架,通过专用实体标记解决大语言模型在知识图谱补全任务中因粒度不匹配导致的问题,实现了语义与结构推理的解耦预测,并在多个基准测试中超越现有最优方法。

Details Motivation: 现有方法无法同时捕捉文本语义和图结构完整性,主要受限于大语言模型基于分词序列操作与知识图谱以实体为基本单位之间的粒度不匹配问题。 Method: 提出KGT框架:1)引入专用实体标记进行特征表示;2)通过关系引导的门控机制融合预训练的结构与文本特征;3)采用解耦预测,使用独立头分别处理并结合语义与结构推理。 Result: KGT在多个基准数据集上持续优于当前最先进方法。 Conclusion: KGT有效解决了LLM在KGC任务中的粒度不匹配问题,提升了知识图谱补全性能。 Abstract: Leveraging Large Language Models (LLMs) for Knowledge Graph Completion (KGC) is promising but hindered by a fundamental granularity mismatch. LLMs operate on fragmented token sequences, whereas entities are the fundamental units in knowledge graphs (KGs) scenarios. Existing approaches typically constrain predictions to limited candidate sets or align entities with the LLM's vocabulary by pooling multiple tokens or decomposing entities into fixed-length token sequences, which fail to capture both the semantic meaning of the text and the structural integrity of the graph. To address this, we propose KGT, a novel framework that uses dedicated entity tokens to enable efficient, full-space prediction. Specifically, we first introduce specialized tokenization to construct feature representations at the level of dedicated entity tokens. We then fuse pre-trained structural and textual features into these unified embeddings via a relation-guided gating mechanism, avoiding training from scratch. Finally, we implement decoupled prediction by leveraging independent heads to separate and combine semantic and structural reasoning. Experimental results show that KGT consistently outperforms state-of-the-art methods across multiple benchmarks.

[21] Human Label Variation in Implicit Discourse Relation Recognition

Frances Yung,Daniil Ignatev,Merel Scholman,Vera Demberg,Massimo Poesio

Main category: cs.CL

TL;DR: 本文比较了预测标注分布与个体标注者视角模型在隐式话语关系识别(IDRR)任务上的表现,发现标注分布模型更稳定,而标注者特异性模型因认知复杂性导致的高歧义而效果较差。

Details Motivation: 许多NLP任务缺乏单一真实标签,人类标注存在多样性;IDRR任务尤其具有高度歧义性,其分歧主要源于认知复杂性而非意识形态偏差,需探究何种建模方式更能刻画这种变异。 Method: 在IDRR任务上对比两类方法:1)预测整体标注分布的模型;2)面向个体标注者的perspectivist模型,并进行消融与错误分析以探究歧义来源。 Result: 现有标注者特异性模型在IDRR上表现差,除非显著降低歧义;而基于标注分布训练的模型预测更稳定;认知负荷高的样本是人类判断不一致的主要驱动因素。 Conclusion: 在以认知复杂性为主导歧义源的任务中,建模整体标注分布比建模个体视角更有效;perspectivist方法在IDRR中面临根本性挑战。 Abstract: There is growing recognition that many NLP tasks lack a single ground truth, as human judgments reflect diverse perspectives. To capture this variation, models have been developed to predict full annotation distributions rather than majority labels, while perspectivist models aim to reproduce the interpretations of individual annotators. In this work, we compare these approaches on Implicit Discourse Relation Recognition (IDRR), a highly ambiguous task where disagreement often arises from cognitive complexity rather than ideological bias. Our experiments show that existing annotator-specific models perform poorly in IDRR unless ambiguity is reduced, whereas models trained on label distributions yield more stable predictions. Further analysis indicates that frequent cognitively demanding cases drive inconsistency in human interpretation, posing challenges for perspectivist modeling in IDRR.

[22] Extending Czech Aspect-Based Sentiment Analysis with Opinion Terms: Dataset and LLM Benchmarks

Jakub Šmíd,Pavel Přibáň,Pavel Král

Main category: cs.CL

TL;DR: 本文介绍了一个新的捷克语餐厅领域方面级情感分析(ABSA)数据集,包含意见词标注,支持三种不同复杂度的ABSA任务,并基于该数据集在单语、跨语言和多语言设置下对现代Transformer模型(含大语言模型)进行了广泛实验;提出了一种基于大语言模型的翻译与标签对齐方法以应对跨语言挑战,显著提升了性能;结果揭示了当前模型在处理捷克语等低资源语言时的优势与局限,并通过错误分析指出了检测细微意见词和复杂情感表达等关键难点;该数据集成为捷克语ABSA新基准,所提方法也为其他低资源语言提供了可扩展的ABSA资源适配方案。

Details Motivation: 构建面向低资源语言捷克语的高质量ABSA数据集,并解决跨语言ABSA中因语言资源稀缺导致的标注迁移难题。 Method: 构建含意见词标注的捷克语餐厅领域ABSA数据集;在单语、跨语、多语设置下评测Transformer及大语言模型;提出基于大语言模型的翻译与标签对齐方法用于跨语言适配。 Result: 所提翻译-对齐方法带来一致性能提升;实验揭示了当前模型在捷克语ABSA任务中对细微意见词和复杂情感表达识别能力不足;数据集成为捷克语ABSA新基准。 Conclusion: 该工作不仅填补了捷克语ABSA资源空白,还为低资源语言ABSA提供了可复用的数据构建范式与跨语言适配方法,具有推广价值。 Abstract: This paper introduces a novel Czech dataset in the restaurant domain for aspect-based sentiment analysis (ABSA), enriched with annotations of opinion terms. The dataset supports three distinct ABSA tasks involving opinion terms, accommodating varying levels of complexity. Leveraging this dataset, we conduct extensive experiments using modern Transformer-based models, including large language models (LLMs), in monolingual, cross-lingual, and multilingual settings. To address cross-lingual challenges, we propose a translation and label alignment methodology leveraging LLMs, which yields consistent improvements. Our results highlight the strengths and limitations of state-of-the-art models, especially when handling the linguistic intricacies of low-resource languages like Czech. A detailed error analysis reveals key challenges, including the detection of subtle opinion terms and nuanced sentiment expressions. The dataset establishes a new benchmark for Czech ABSA, and our proposed translation-alignment approach offers a scalable solution for adapting ABSA resources to other low-resource languages.

[23] Towards Simulating Social Media Users with LLMs: Evaluating the Operational Validity of Conditioned Comment Prediction

Nils Schwager,Simon Münker,Alistair Plum,Achim Rettinger

Main category: cs.CL

TL;DR: 本研究提出条件化评论预测(CCP)任务,通过将LLM生成的评论与真实社交媒体行为数据对比,系统评估其模拟用户行为的能力;发现监督微调(SFT)虽改善表层形式(如长度、语法),却削弱语义准确性,且显式条件(如生成传记)在微调后变得冗余,主张以真实行为痕迹替代描述性角色设定以提升仿真保真度。

Details Motivation: 大型语言模型(LLMs)正从探索性工具转变为社会科学中的“硅基主体”,但其操作有效性缺乏充分验证。 Method: 提出条件化评论预测(CCP)任务,使用开源8B模型(Llama3.1、Qwen3、Ministral)在英语、德语和卢森堡语场景下,系统比较显式/隐式提示策略及监督微调(SFT)的影响,并分析形式与内容解耦现象。 Result: 发现SFT在低资源设置中导致形式与内容解耦:提升输出表层结构(长度、语法)但损害语义接地;显式条件(如生成用户传记)在微调后失效,模型可直接从行为历史进行潜在推理。 Conclusion: 挑战当前‘朴素提示’范式,主张以真实数字行为痕迹替代人工构建的描述性角色,为高保真社会行为仿真提供可操作指南。 Abstract: The transition of Large Language Models (LLMs) from exploratory tools to active "silicon subjects" in social science lacks extensive validation of operational validity. This study introduces Conditioned Comment Prediction (CCP), a task in which a model predicts how a user would comment on a given stimulus by comparing generated outputs with authentic digital traces. This framework enables a rigorous evaluation of current LLM capabilities with respect to the simulation of social media user behavior. We evaluated open-weight 8B models (Llama3.1, Qwen3, Ministral) in English, German, and Luxembourgish language scenarios. By systematically comparing prompting strategies (explicit vs. implicit) and the impact of Supervised Fine-Tuning (SFT), we identify a critical form vs. content decoupling in low-resource settings: while SFT aligns the surface structure of the text output (length and syntax), it degrades semantic grounding. Furthermore, we demonstrate that explicit conditioning (generated biographies) becomes redundant under fine-tuning, as models successfully perform latent inference directly from behavioral histories. Our findings challenge current "naive prompting" paradigms and offer operational guidelines prioritizing authentic behavioral traces over descriptive personas for high-fidelity simulation.

[24] AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors

Abhay Sheshadri,Aidan Ewart,Kai Fronsdal,Isha Gupta,Samuel R. Bowman,Sara Price,Samuel Marks,Rowan Wang

Main category: cs.CL

TL;DR: 本文提出了AuditBench,一个用于评估语言模型对齐审计能力的基准,包含56个植入了14种隐蔽有害行为的模型,并设计了一个可配置的调查代理来测试不同审计工具的有效性,发现黑盒工具比白盒工具更有效,且训练方式显著影响审计难度。

Details Motivation: 当前缺乏系统性、可量化的对齐审计基准,难以客观评估审计方法的有效性;需构建具备多样化隐蔽行为的模型集合以推动对齐审计的实证研究。 Method: 构建AuditBench:56个语言模型,植入14类隐蔽行为(如谄媚式顺从、反AI监管等),采用多种训练技术(合成文档 vs. 示范数据、对抗训练等)控制行为隐蔽性;开发可配置的 investigator agent,集成多种审计工具(含提示工程、辅助模型生成、白盒解释性方法),在统一框架下量化评估工具效能。 Result: 发现‘工具-代理差距’:部分在独立评测中表现好的工具,在agent中效果不佳;最有效的工具是基于辅助模型生成多样化提示的黑盒方法;白盒工具有一定帮助但非最优;训练方式显著影响审计难度:合成文档训练的模型更易审计,示范数据+强对抗训练的模型最难审计。 Conclusion: AuditBench为对齐审计提供了首个开源、可复现、多维度的定量基准;强调需在agent级上下文中评估审计工具,而非仅孤立评测;黑盒、提示驱动的方法更具实用性,未来审计研究应兼顾模型训练机制与审计策略协同设计。 Abstract: We introduce AuditBench, an alignment auditing benchmark. AuditBench consists of 56 language models with implanted hidden behaviors. Each model has one of 14 concerning behaviors--such as sycophantic deference, opposition to AI regulation, or secret geopolitical loyalties--which it does not confess to when directly asked. AuditBench models are highly diverse--some are subtle, while others are overt, and we use varying training techniques both for implanting behaviors and training models not to confess. To demonstrate AuditBench's utility, we develop an investigator agent that autonomously employs a configurable set of auditing tools. By measuring investigator agent success using different tools, we can evaluate their efficacy. Notably, we observe a tool-to-agent gap, where tools that perform well in standalone non-agentic evaluations fail to translate into improved performance when used with our investigator agent. We find that our most effective tools involve scaffolded calls to auxiliary models that generate diverse prompts for the target. White-box interpretability tools can be helpful, but the agent performs best with black-box tools. We also find that audit success varies greatly across training techniques: models trained on synthetic documents are easier to audit than models trained on demonstrations, with better adversarial training further increasing auditing difficulty. We release our models, agent, and evaluation framework to support future quantitative, iterative science on alignment auditing.

[25] Towards Better RL Training Data Utilization via Second-Order Rollout

Zhe Yang,Yudong Wang,Rang Li,Zhifang Sui

Main category: cs.CL

TL;DR: 本文提出了一种结合一阶与二阶rollout的统一强化学习框架,以联合训练大语言模型的生成与批判能力,提升数据利用效率与性能。

Details Motivation: 现有强化学习方法(如vanilla RL)仅关注生成能力,通过一阶rollout训练,忽视了对批判能力的建模,导致训练数据潜力未被充分挖掘。 Method: 引入二阶rollout(为一个响应生成多个批判),构建生成与批判能力联合训练的统一RL框架,并探索标签平衡、采样缓解奖励噪声等关键技术。 Result: 在多个模型和数据集上的实验表明,该方法比vanilla RL更高效地利用训练数据,在相同数据下取得更好性能;同时揭示了批判训练中标签平衡的重要性及结果类奖励的噪声问题及其缓解方式。 Conclusion: 本工作初步探索了动态数据增强与生成-批判联合训练在RL中的应用,为LLM的强化学习训练提供了新思路和启发。 Abstract: Reinforcement Learning (RL) has empowered Large Language Models (LLMs) with strong reasoning capabilities, but vanilla RL mainly focuses on generation capability improvement by training with only first-order rollout (generating multiple responses for a question), and we argue that this approach fails to fully exploit the potential of training data because of the neglect of critique capability training. To tackle this problem, we further introduce the concept of second-order rollout (generating multiple critiques for a response) and propose a unified framework for jointly training generation and critique capabilities. Extensive experiments across various models and datasets demonstrate that our approach can utilize training data more effectively than vanilla RL and achieve better performance under the same training data. Additionally, we uncover several insightful findings regarding second-order rollout and critique training, such as the importance of label balance in critique training and the noise problem of outcome-based rewards, which can be mitigated through sampling techniques. Our work offers a preliminary exploration of dynamic data augmentation and joint generation-critique training in RL, providing meaningful inspiration for the further advancement of RL training

[26] Imagination Helps Visual Reasoning, But Not Yet in Latent Space

You Li,Chi Chen,Yanghao Li,Fanhu Zeng,Kaiyu Huang,Jinan Xu,Maosong Sun

Main category: cs.CL

TL;DR: 本文通过因果中介分析揭示了潜在视觉推理中输入-潜在表征和潜在表征-答案之间的双重断连,质疑其必要性,并提出基于文本显式想象的简单有效替代方法CapImagine。

Details Motivation: 探究潜在视觉推理有效性的真正来源,厘清其内在机制是否真实依赖于隐状态的推理过程。 Method: 采用因果中介分析建模输入(处理变量)、潜在token(中介变量)与最终答案(结果变量)之间的因果链;辅以扰动实验和探针分析检验潜在token的信息承载能力与作用。 Result: 发现输入-潜在和潜在-答案之间均存在显著因果断连;潜在token编码视觉信息有限且高度相似;CapImagine在视觉中心基准上显著超越复杂潜在空间基线。 Conclusion: 潜在视觉推理并非必要,显式文本化想象(CapImagine)更简单、更有效,应重新思考视觉推理的实现范式。 Abstract: Latent visual reasoning aims to mimic human's imagination process by meditating through hidden states of Multimodal Large Language Models. While recognized as a promising paradigm for visual reasoning, the underlying mechanisms driving its effectiveness remain unclear. Motivated to demystify the true source of its efficacy, we investigate the validity of latent reasoning using Causal Mediation Analysis. We model the process as a causal chain: the input as the treatment, the latent tokens as the mediator, and the final answer as the outcome. Our findings uncover two critical disconnections: (a) Input-Latent Disconnect: dramatic perturbations on the input result in negligible changes to the latent tokens, suggesting that latent tokens do not effectively attend to the input sequence. (b) Latent-Answer Disconnect: perturbations on the latent tokens yield minimal impact on the final answer, indicating the limited causal effect latent tokens imposing on the outcome. Furthermore, extensive probing analysis reveals that latent tokens encode limited visual information and exhibit high similarity. Consequently, we challenge the necessity of latent reasoning and propose a straightforward alternative named CapImagine, which teaches the model to explicitly imagine using text. Experiments on vision-centric benchmarks show that CapImagine significantly outperforms complex latent-space baselines, highlighting the superior potential of visual reasoning through explicit imagination.

[27] Probing for Knowledge Attribution in Large Language Models

Ivo Brink,Alexander Boer,Dennis Ulmer

Main category: cs.CL

TL;DR: 本文提出了一种名为contributive attribution的方法,用于识别大语言模型(LLM)输出主要依赖于输入提示还是内部知识,并设计了自监督数据集AttriWiki和简单线性探针来实现高精度归因,揭示了归因错误与幻觉之间的强关联。

Details Motivation: 大语言模型常产生看似流畅但缺乏依据的“幻觉”,分为忠实性错误(误用用户上下文)和事实性错误(源于内部知识错误),需准确判断回答的知识来源以有效缓解幻觉。 Method: 提出contributive attribution任务;构建自监督数据集AttriWiki,通过提示模型从记忆或上下文中回忆实体生成标注样本;训练基于隐藏层表示的线性探针进行归因预测。 Result: 探针在Llama-3.1-8B、Mistral-7B和Qwen-7B上Macro-F1达0.96,跨域迁移至SQuAD和WebQuestions仍保持0.94–0.99 Macro-F1;归因不匹配使错误率最高上升70%。 Conclusion: contributive attribution是检测幻觉的关键中间信号,但仅靠归因正确不足以保证答案正确,需结合更全面的检测框架。 Abstract: Large language models (LLMs) often generate fluent but unfounded claims, or hallucinations, which fall into two types: (i) faithfulness violations - misusing user context - and (ii) factuality violations - errors from internal knowledge. Proper mitigation depends on knowing whether a model's answer is based on the prompt or its internal weights. This work focuses on the problem of contributive attribution: identifying the dominant knowledge source behind each output. We show that a probe, a simple linear classifier trained on model hidden representations, can reliably predict contributive attribution. For its training, we introduce AttriWiki, a self-supervised data pipeline that prompts models to recall withheld entities from memory or read them from context, generating labelled examples automatically. Probes trained on AttriWiki data reveal a strong attribution signal, achieving up to 0.96 Macro-F1 on Llama-3.1-8B, Mistral-7B, and Qwen-7B, transferring to out-of-domain benchmarks (SQuAD, WebQuestions) with 0.94-0.99 Macro-F1 without retraining. Attribution mismatches raise error rates by up to 70%, demonstrating a direct link between knowledge source confusion and unfaithful answers. Yet, models may still respond incorrectly even when attribution is correct, highlighting the need for broader detection frameworks.

[28] Natural Language Declarative Prompting (NLD-P): A Modular Governance Method for Prompt Design Under Model Drift

Hyunwoo Kim,Hanau Yi,Jaehee Bae,Yumin Kim

Main category: cs.CL

TL;DR: 本文提出自然语言声明式提示(NLD-P)作为应对大模型演进中‘GPT级模型漂移’的声明式治理方法,将其形式化为模块化控制抽象,分离来源、约束逻辑、任务内容与生成后评估,并强调其对非开发者用户的可及性与人类主导的验证机制。

Details Motivation: 随着大语言模型快速迭代,提示行为易受指令遵循策略、对齐机制和解码方式变化的影响,导致‘GPT级模型漂移’,传统提示工程难以保障稳定可控;需一种面向系统治理、人类可理解且无需外部代码的提示范式。 Method: 将自然语言声明式提示(NLD-P)重新定义为声明式治理方法,形式化为包含来源、约束逻辑、任务内容和后评估四模块的自然语言编码控制抽象;定义最小合规标准,分析不同模型对NLD-P schema的接受度,并采用schema绑定的LLM辅助起草,严格遵循人类在环协议。 Result: 确立了NLD-P作为轻量、模块化、自然语言原生的提示治理框架,验证了其在多代模型演化环境下的可行性与可解释性,支持非开发者用户实施可控提示实践;部分写作与编辑由NLD-P配置的LLM完成,全部核心工作由人类作者主导并审核。 Conclusion: NLD-P为应对持续演化的LLM生态提供了可扩展、可审计、低门槛的声明式控制路径;未来需开展实证研究以量化其跨模型鲁棒性与治理效能。 Abstract: The rapid evolution of large language models (LLMs) has transformed prompt engineering from a localized craft into a systems-level governance challenge. As models scale and update across generations, prompt behavior becomes sensitive to shifts in instruction-following policies, alignment regimes, and decoding strategies, a phenomenon we characterize as GPT-scale model drift. Under such conditions, surface-level formatting conventions and ad hoc refinement are insufficient to ensure stable, interpretable control. This paper reconceptualizes Natural Language Declarative Prompting (NLD-P) as a declarative governance method rather than a rigid field template. NLD-P is formalized as a modular control abstraction that separates provenance, constraint logic, task content, and post-generation evaluation, encoded directly in natural language without reliance on external orchestration code. We define minimal compliance criteria, analyze model-dependent schema receptivity, and position NLD-P as an accessible governance framework for non-developer practitioners operating within evolving LLM ecosystems. Portions of drafting and editorial refinement employed a schema-bound LLM assistant configured under NLD-P. All conceptual framing, methodological claims, and final revisions were directed, reviewed, and approved by the human author under a documented human-in-the-loop protocol. The paper concludes by outlining implications for declarative control under ongoing model evolution and identifying directions for future empirical validation.

[29] TARAZ: Persian Short-Answer Question Benchmark for Cultural Evaluation of Language Models

Reihaneh Iranmanesh,Saeedeh Davoudi,Pasha Abrishamchian,Ophir Frieder,Nazli Goharian

Main category: cs.CL

TL;DR: 本文提出了一种面向波斯语的大型语言模型文化能力评估框架,通过结合基于规则的形态归一化与混合句法-语义相似度模块,实现对短答案的鲁棒软匹配评分,并开源了首个波斯语文化理解标准化基准。

Details Motivation: 现有波斯语文化评测基准主要依赖多选题形式和以英语为中心的指标,无法准确反映波斯语的形态复杂性和语义细微差别。 Method: 构建波斯语专用的短答案评估框架,融合基于规则的形态归一化与混合句法-语义相似度模块,实现超越精确字符串匹配的软匹配评分。 Result: 在15个主流开源与闭源模型上的系统评测表明,该混合评估方法相较精确匹配基线提升了10%的评分一致性,能有效捕捉表层方法无法识别的语义信息。 Conclusion: 本工作发布了首个波斯语文化理解标准化评测基准,为跨文化大模型评估研究提供了可复现的基础。 Abstract: This paper presents a comprehensive evaluation framework for assessing the cultural competence of large language models (LLMs) in Persian. Existing Persian cultural benchmarks rely predominantly on multiple-choice formats and English-centric metrics that fail to capture Persian's morphological complexity and semantic nuance. Our framework introduces a Persian-specific short-answer evaluation that combines rule-based morphological normalization with a hybrid syntactic and semantic similarity module, enabling robust soft-match scoring beyond exact string overlap. Through systematic evaluation of 15 state-of-the-art open- and closed-source models, we demonstrate that our hybrid evaluation improves scoring consistency by +10% compared to exact-match baselines by capturing meaning that surface-level methods cannot detect. We publicly release our evaluation framework, providing the first standardized benchmark for measuring cultural understanding in Persian and establishing a reproducible foundation for cross-cultural LLM evaluation research.

[30] TCM-DiffRAG: Personalized Syndrome Differentiation Reasoning Method for Traditional Chinese Medicine based on Knowledge Graph and Chain of Thought

Jianmin Li,Ying Chang,Su-Kit Tang,Yujia Liu,Yanwen Wang,Shuyuan Lin,Binkai Ou

Main category: cs.CL

TL;DR: 本文提出TCM-DiffRAG框架,结合知识图谱与思维链推理,显著提升大语言模型在中医临床诊断任务中的性能,优于原生模型、微调模型及其他RAG方法。

Details Motivation: 传统RAG方法在中医临床诊断中因复杂推理和个体差异大而表现不佳,需适配中医特点的改进型RAG框架。 Method: 构建TCM-DiffRAG框架,融合中医知识图谱(KG)与思维链(CoT)推理,并在三个中医专用数据集上进行评估。 Result: TCM-DiffRAG显著提升qwen-plus等模型性能(如0.038→0.356),且对非中文LLM提升更明显;同时优于监督微调模型和其他RAG基线方法。 Conclusion: 将结构化中医知识图谱与思维链推理结合,可有效增强LLM在个性化中医诊断任务中的表现;通用与个性化知识图谱协同使用,有助于弥合通用知识与临床推理之间的鸿沟。 Abstract: Background: Retrieval augmented generation (RAG) technology can empower large language models (LLMs) to generate more accurate, professional, and timely responses without fine tuning. However, due to the complex reasoning processes and substantial individual differences involved in traditional Chinese medicine (TCM) clinical diagnosis and treatment, traditional RAG methods often exhibit poor performance in this domain. Objective: To address the limitations of conventional RAG approaches in TCM applications, this study aims to develop an improved RAG framework tailored to the characteristics of TCM reasoning. Methods: We developed TCM-DiffRAG, an innovative RAG framework that integrates knowledge graphs (KG) with chains of thought (CoT). TCM-DiffRAG was evaluated on three distinctive TCM test datasets. Results: The experimental results demonstrated that TCM-DiffRAG achieved significant performance improvements over native LLMs. For example, the qwen-plus model achieved scores of 0.927, 0.361, and 0.038, which were significantly enhanced to 0.952, 0.788, and 0.356 with TCM-DiffRAG. The improvements were even more pronounced for non-Chinese LLMs. Additionally, TCM-DiffRAG outperformed directly supervised fine-tuned (SFT) LLMs and other benchmark RAG methods. Conclusions: TCM-DiffRAG shows that integrating structured TCM knowledge graphs with Chain of Thought based reasoning substantially improves performance in individualized diagnostic tasks. The joint use of universal and personalized knowledge graphs enables effective alignment between general knowledge and clinical reasoning. These results highlight the potential of reasoning-aware RAG frameworks for advancing LLM applications in traditional Chinese medicine.

[31] Improving Neural Argumentative Stance Classification in Controversial Topics with Emotion-Lexicon Features

Mohammad Yeghaneh Abkenar,Weixing Wang,Manfred Stede,Davide Picca,Mark A. Finlayson,Panagiotis Ioannidis

Main category: cs.CL

TL;DR: 本文提出一种基于扩展NRC情感词典(eNRC)的神经论证立场分类方法,利用DistilBERT嵌入增强情感词汇覆盖,显著提升跨领域立场分类性能,并开源全部资源。

Details Motivation: 现有立场分类研究多忽略细粒度情感分析,且数据受限于特定领域,泛化能力差;需系统融合情感信息并提升跨域适用性。 Method: 基于DistilBERT嵌入扩展Bias-Corrected NRC情感词典(构建eNRC),将其融入神经论证立场分类模型,在五个跨领域争议话题数据集上进行验证。 Result: eNRC在全部五个数据集上均优于基线(最高+6.2 F1),在四个数据集上超越原NRC(最高+3.0),且在几乎所有数据集上优于LLM方法。 Conclusion: 显式、细粒度的情感词典扩展能有效提升立场分类性能,尤其在跨领域场景下;开源资源促进后续研究。 Abstract: Argumentation mining comprises several subtasks, among which stance classification focuses on identifying the standpoint expressed in an argumentative text toward a specific target topic. While arguments-especially about controversial topics-often appeal to emotions, most prior work has not systematically incorporated explicit, fine-grained emotion analysis to improve performance on this task. In particular, prior research on stance classification has predominantly utilized non-argumentative texts and has been restricted to specific domains or topics, limiting generalizability. We work on five datasets from diverse domains encompassing a range of controversial topics and present an approach for expanding the Bias-Corrected NRC Emotion Lexicon using DistilBERT embeddings, which we feed into a Neural Argumentative Stance Classification model. Our method systematically expands the emotion lexicon through contextualized embeddings to identify emotionally charged terms not previously captured in the lexicon. Our expanded NRC lexicon (eNRC) improves over the baseline across all five datasets (up to +6.2 percentage points in F1 score), outperforms the original NRC on four datasets (up to +3.0), and surpasses the LLM-based approach on nearly all corpora. We provide all resources-including eNRC, the adapted corpora, and model architecture-to enable other researchers to build upon our work.

[32] Effective QA-driven Annotation of Predicate-Argument Relations Across Languages

Jonathan Davidov,Aviv Slobodkin,Shmuel Tomi Klein,Reut Tsarfaty,Ido Dagan,Ayal Klein

Main category: cs.CL

TL;DR: 本文提出一种跨语言投影方法,利用英语QA-SRL解析器结合受限翻译与词对齐,自动生成多语种(希伯来语、俄语、法语)的问-答式语义角色标注数据,从而高效构建高质量、语言特定的语义解析器,性能超越GPT-4o和LLaMA-Maverick等多语言大模型。

Details Motivation: 现有谓词-论元结构标注依赖高成本人工标注,且主要集中于英语,缺乏对其他语言的有效覆盖。 Method: 基于问答驱动的语义角色标注(QA-SRL)框架,设计跨语言投影方法:复用英语QA-SRL解析器,通过受控翻译与词对齐流水线,将英语问-答对自动映射到目标语言谓词上,生成对齐的多语种QA-SRL标注。 Result: 在希伯来语、俄语和法语上成功构建高质量训练数据,微调得到的语言特定解析器显著优于GPT-4o、LLaMA-Maverick等强大多语言大模型基线。 Conclusion: QA-SRL可作为可迁移的自然语言语义接口,使谓词-论元分析得以低成本、广覆盖地扩展至多种语言。 Abstract: Explicit representations of predicate-argument relations form the basis of interpretable semantic analysis, supporting reasoning, generation, and evaluation. However, attaining such semantic structures requires costly annotation efforts and has remained largely confined to English. We leverage the Question-Answer driven Semantic Role Labeling (QA-SRL) framework -- a natural-language formulation of predicate-argument relations -- as the foundation for extending semantic annotation to new languages. To this end, we introduce a cross-linguistic projection approach that reuses an English QA-SRL parser within a constrained translation and word-alignment pipeline to automatically generate question-answer annotations aligned with target-language predicates. Applied to Hebrew, Russian, and French -- spanning diverse language families -- the method yields high-quality training data and fine-tuned, language-specific parsers that outperform strong multilingual LLM baselines (GPT-4o, LLaMA-Maverick). By leveraging QA-SRL as a transferable natural-language interface for semantics, our approach enables efficient and broadly accessible predicate-argument parsing across languages.

[33] Rejection Mixing: Fast Semantic Propagation of Mask Tokens for Efficient DLLM Inference

Yushi Ye,Feng Hong,Huangjie Zheng,Xu Chen,Zhiyong Chen,Yanfeng Wang,Jiangchao Yao

Main category: cs.CL

TL;DR: 本文提出ReMix框架,通过引入连续混合状态和拒绝规则,在不降低生成质量的前提下,实现了2-8倍的推理加速,解决了扩散大语言模型中并行解码时的组合矛盾问题。

Details Motivation: 扩散大语言模型(DLLMs)在并行解码时存在严重的质量-速度权衡问题,根源在于‘组合矛盾’现象,即并行生成的token语义不一致。 Method: 提出ReMix(拒绝混合)框架,引入连续混合状态作为初始掩码状态与最终离散token状态之间的中间表示,支持token在连续空间中迭代优化;同时设计拒绝规则,将不确定的表示回退至掩码状态重新处理。 Result: ReMix作为一种无需训练的方法,在多个基准上实现2-8倍推理加速,且无任何生成质量下降。 Conclusion: 通过在离散扩散解码过程中引入连续空间优化机制,ReMix有效缓解了组合矛盾,为高效高质量非自回归文本生成提供了新范式。 Abstract: Diffusion Large Language Models (DLLMs) promise fast non-autoregressive inference but suffer a severe quality-speed trade-off in parallel decoding. This stems from the ''combinatorial contradiction'' phenomenon, where parallel tokens form semantically inconsistent combinations. We address this by integrating continuous representations into the discrete decoding process, as they preserve rich inter-position dependency. We propose ReMix (Rejection Mixing), a framework that introduces a novel Continuous Mixing State as an intermediate between the initial masked state and the final decoded token state. This intermediate state allows a token's representation to be iteratively refined in a continuous space, resolving mutual conflicts with other tokens before collapsing into a final discrete sample. Furthermore, a rejection rule reverts uncertain representations from the continuous state back to the masked state for reprocessing, ensuring stability and preventing error propagation. ReMix thus mitigates combinatorial contradictions by enabling continuous-space refinement during discrete diffusion decoding. Extensive experiments demonstrate that ReMix, as a training-free method, achieves a $2-8 \times$ inference speedup without any quality degradation.

[34] Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching

Roy Miles,Aysim Toker,Andreea-Maria Oncescu,Songcen Xu,Jiankang Deng,Ismail Elezi

Main category: cs.CL

TL;DR: 本文提出了一种名为Stitching Noisy Diffusion Thoughts的自一致性框架,通过在步骤级别聚合多个扩散采样推理轨迹中的高质量中间步骤,构建复合推理链,并利用自回归模型生成最终答案,从而提升数学和代码推理任务的准确率并降低延迟。

Details Motivation: 现有推理聚合策略多为轨迹级(如投票或选最佳路径),忽略了部分正确或接近正确的中间步骤所含的有用信息。 Method: 1) 使用掩码扩散语言模型采样大量低成本、多样化的推理轨迹;2) 利用现成的过程奖励模型(PRM)对每个中间步骤打分;3) 跨轨迹拼接最高质量的步骤形成复合推理链;4) 用该链作为条件,驱动自回归模型仅重算最终答案。 Result: 在多个数学与编程推理基准上,该方法平均准确率最高提升23.8%,延迟最高降低1.8倍,且对更难问题提升更显著;消融实验表明AR求解器对将不完美拼接推理转化为准确答案至关重要。 Conclusion: 步骤级推理拼接是一种高效、模块化、无需训练的推理增强范式,在保持广泛搜索能力的同时,显著提升了复杂推理任务的性能与效率。 Abstract: Reasoning with large language models often benefits from generating multiple chains-of-thought, but existing aggregation strategies are typically trajectory-level (e.g., selecting the best trace or voting on the final answer), discarding useful intermediate work from partial or "nearly correct" attempts. We propose Stitching Noisy Diffusion Thoughts, a self-consistency framework that turns cheap diffusion-sampled reasoning into a reusable pool of step-level candidates. Given a problem, we (i) sample many diverse, low-cost reasoning trajectories using a masked diffusion language model, (ii) score every intermediate step with an off-the-shelf process reward model (PRM), and (iii) stitch these highest-quality steps across trajectories into a composite rationale. This rationale then conditions an autoregressive (AR) model (solver) to recompute only the final answer. This modular pipeline separates exploration (diffusion) from evaluation and solution synthesis, avoiding monolithic unified hybrids while preserving broad search. Across math reasoning benchmarks, we find that step-level recombination is most beneficial on harder problems, and ablations highlight the importance of the final AR solver in converting stitched but imperfect rationales into accurate answers. Using low-confidence diffusion sampling with parallel, independent rollouts, our training-free framework improves average accuracy by up to 23.8% across six math and coding tasks. At the same time, it achieves up to a 1.8x latency reduction relative to both traditional diffusion models (e.g., Dream, LLaDA) and unified architectures (e.g., TiDAR). Code is available at https://github.com/roymiles/diffusion-stitching.

[35] Where Vision Becomes Text: Locating the OCR Routing Bottleneck in Vision-Language Models

Jonathan Steinberg,Oren Gal

Main category: cs.CL

TL;DR: 本文研究了视觉-语言模型(VLMs)中OCR信息如何被路由至语言处理流,通过因果干预和激活差异分析,发现不同架构在不同深度存在OCR瓶颈;OCR信号低维且跨数据集可迁移;意外发现移除OCR信号在某些模块化模型中反而提升计数性能。

Details Motivation: 探究视觉-语言模型中OCR信息在语言处理流中的注入位置与机制,理解不同架构如何整合图像文本信息。 Method: 采用因果干预方法,对比原始图像与文本掩蔽(text-inpainted)图像的激活差异,定位OCR瓶颈;结合主成分分析(PCA)评估OCR信号维度及跨数据集可迁移性。 Result: 发现DeepStack模型(如Qwen3-VL)OCR敏感性峰值在中层(约50%深度),单阶段投影模型(Phi-4、InternVL3.5)则在早期层(6–25%);OCR信号高度低维(PC1占72.9%方差),PCA方向跨数据集可迁移;在Qwen3-VL-4B等模块化模型中,移除OCR可使计数性能提升最高达+6.9个百分点。 Conclusion: OCR信息路由高度依赖架构设计,其表征紧凑且通用;过度依赖OCR可能干扰其他视觉任务,尤其在模块化较强的VLM中,提示需重新思考OCR与视觉理解的协同机制。 Abstract: Vision-language models (VLMs) can read text from images, but where does this optical character recognition (OCR) information enter the language processing stream? We investigate the OCR routing mechanism across three architecture families (Qwen3-VL, Phi-4, InternVL3.5) using causal interventions. By computing activation differences between original images and text-inpainted versions, we identify architecture-specific OCR bottlenecks whose dominant location depends on the vision-language integration strategy: DeepStack models (Qwen) show peak sensitivity at mid-depth (about 50%) for scene text, while single-stage projection models (Phi-4, InternVL) peak at early layers (6-25%), though the exact layer of maximum effect varies across datasets. The OCR signal is remarkably low-dimensional: PC1 captures 72.9% of variance. Crucially, principal component analysis (PCA) directions learned on one dataset transfer to others, demonstrating shared text-processing pathways. Surprisingly, in models with modular OCR circuits (notably Qwen3-VL-4B), OCR removal can improve counting performance (up to +6.9 percentage points), suggesting OCR interferes with other visual processing in sufficiently modular architectures.

[36] Affine-Scaled Attention: Towards Flexible and Stable Transformer Attention

Jeongin Bae,Baeseong Park,Gunho Park,Minsub Kim,Joonhyung Lee,Junhee Yoo,Sunghyeon Woo,Jiwon Ryu,Se Jung Kwon,Dongsoo Lee

Main category: cs.CL

TL;DR: 本文提出Affine-Scaled Attention,通过在softmax归一化后的注意力权重上引入输入依赖的缩放与偏置项,放松传统注意力的严格归一化约束,从而提升训练稳定性与下游性能。

Details Motivation: 标准softmax注意力强制单位和归一化,限制了注意力幅度的灵活控制,可能导致注意力过于集中或训练不稳定。 Method: 提出Affine-Scaled Attention,在softmax归一化后的注意力权重上施加输入相关的仿射变换(即缩放+偏置),保持值聚合功能的同时调控注意力分布与整体尺度。 Result: 在大规模语言模型预训练中验证,相比标准softmax注意力和attention sink基线,显著提升了训练稳定性、优化行为及下游任务性能。 Conclusion: 对注意力输出进行适度重加权是一种实用且有效的改进方式,可改善Transformer模型的注意力行为。 Abstract: Transformer attention is typically implemented using softmax normalization, which enforces attention weights with unit sum normalization. While effective in many settings, this constraint can limit flexibility in controlling attention magnitudes and may contribute to overly concentrated or unstable attention patterns during training. Prior work has explored modifications such as attention sinks or gating mechanisms, but these approaches provide only limited or indirect control over attention reweighting. We propose Affine-Scaled Attention, a simple extension to standard attention that introduces input-dependent scaling and a corresponding bias term applied to softmax-normalized attention weights. This design relaxes the strict normalization constraint while maintaining aggregation of value representations, allowing the model to adjust both the relative distribution and the scale of attention in a controlled manner. We empirically evaluate Affine-Scaled Attention in large-scale language model pretraining across multiple model sizes. Experimental results show consistent improvements in training stability, optimization behavior, and downstream task performance compared to standard softmax attention and attention sink baselines. These findings suggest that modest reweighting of attention outputs provides a practical and effective way to improve attention behavior in Transformer models.

[37] Toward Automatic Filling of Case Report Forms: A Case Study on Data from an Italian Emergency Department

Gabriela Anna Kaczmarek,Pietro Ferrazzi,Lorenzo Porta,Vicky Rubini,Bernardo Magnini

Main category: cs.CL

TL;DR: 本文介绍了一个新的意大利语急诊科临床笔记数据集,用于自动填写病例报告表(CRF),并探讨了使用大语言模型(LLM)在零样本设置下完成该任务的可行性及存在的偏差问题。

Details Motivation: 缺乏用于训练和评估大语言模型的标注CRF数据,限制了自动CRF填写任务的发展。 Method: 构建了一个包含134个条目的预定义CRF,并对意大利语急诊科临床笔记进行人工标注;定义了CRF填写任务及评估指标;采用开源先进大语言模型开展零样本实验。 Result: 实验证明:(i) 可在零样本设置下实现意大利语真实临床笔记的CRF填写;(ii) 大语言模型存在偏差(如倾向于回答“未知”),需校正。 Conclusion: 该数据集为CRF自动填写研究提供了重要资源,同时揭示了当前大语言模型在医疗文本处理中的局限性与改进方向。 Abstract: Case Report Forms (CRFs) collect data about patients and are at the core of well-established practices to conduct research in clinical settings. With the recent progress of language technologies, there is an increasing interest in automatic CRF-filling from clinical notes, mostly based on the use of Large Language Models (LLMs). However, there is a general scarcity of annotated CRF data, both for training and testing LLMs, which limits the progress on this task. As a step in the direction of providing such data, we present a new dataset of clinical notes from an Italian Emergency Department annotated with respect to a pre-defined CRF containing 134 items to be filled. We provide an analysis of the data, define the CRF-filling task and metric for its evaluation, and report on pilot experiments where we use an open-source state-of-the-art LLM to automatically execute the task. Results of the case-study show that (i) CRF-filling from real clinical notes in Italian can be approached in a zero-shot setting; (ii) LLMs' results are affected by biases (e.g., a cautious behaviour favours "unknown" answers), which need to be corrected.

[38] Quantity Convergence, Quality Divergence: Disentangling Fluency and Accuracy in L2 Mandarin Prosody

Yuqi Shi,Hao Yang,Xiyao Lu,Jinsong Zhang

Main category: cs.CL

TL;DR: 本研究探讨了越南语母语者在二语习得中语法-韵律接口的固化与稳定性问题,发现高熟练度学习者虽在韵律边界数量上接近母语者,但在句法-韵律映射关系上存在系统性偏差,尤其表现为SBV界面降级和VOB界面升级,导致韵律层级倒置。

Details Motivation: 二语学习者虽能掌握目标语句法词序,但将该句法映射到恰当韵律结构仍具挑战性;本研究旨在考察L2语法-韵律接口的固化与稳定性。 Method: 基于BLCU-SAIT语料库,对比67名汉语母语者与67名越南语学习者,结合C-ToBI韵律边界标注与依存语法分析,考察韵律边界数量及其与句法关系的映射。 Result: 高熟练度越南语学习者(VNH)在主短语(B3)层级的韵律边界数量趋近母语者,但在句法-韵律映射上显著偏离:SBV界面边界被降级(B3→B1),VOB界面边界被错误升级(B1→B3),形成倒置的韵律层级。 Conclusion: L2语法-韵律接口存在非线性习得路径,边界数量的达标不等于结构映射的准确;学习者采用牺牲结构准确性以维持长语段产出的策略,导致韵律层级固化性扭曲。 Abstract: While second language (L2) learners may acquire target syntactic word order, mapping this syntax onto appropriate prosodic structures remains a persistent challenge. This study investigates the fossilization and stability of the L2 syntax-prosody interface by comparing 67 native Mandarin speakers with 67 Vietnamese learners using the BLCU-SAIT corpus. By integrating C-ToBI boundary annotation with Dependency Grammar analysis, we examined both the quantity of prosodic boundaries and their mapping to syntactic relations. Results reveal a non-linear acquisition: although high-proficiency learners (VNH) converge to the native baseline in boundary quantity at the Major Phrase level (B3), their structural mapping significantly diverges. Specifically, VNH demote the prosodic boundary at the Subject-Verb (SBV) interface (Major Phrase B3 -> Prosodic Word B1), while erroneously promoting the boundary at the Verb-Object (VOB) interface (Prosodic Word B1 -> Major Phrase B3). This strategy allows learners to maintain high long phrasal output at the expense of structural accuracy. This results in a distorted prosodic hierarchy where the native pattern is inverted.

[39] CiteLLM: An Agentic Platform for Trustworthy Scientific Reference Discovery

Mengze Hong,Di Jiang,Chen Jason Zhang,Zichang Guo,Yawen Li,Jun Chen,Shaobo Cui,Zhiyang Su

Main category: cs.CL

TL;DR: 本文提出CiteLLM,一个嵌入LaTeX编辑器的本地化、可信参考发现平台,通过学科感知动态路由与LLM协同检索,确保无幻觉、高相关、可解释的学术引用生成。

Details Motivation: 解决大语言模型在学术辅助中面临的三大伦理挑战:AI生成内容可信度低、学术诚信与知识产权难以保障、信息隐私易泄露。 Method: 设计CiteLLM系统,将LLM能力嵌入本地LaTeX编辑器;采用动态学科感知路由从可信学术库检索候选文献;LLM仅用于生成上下文感知查询、相关性排序及段落级语义匹配验证,并集成聊天机器人提供解释。 Result: 评估表明该系统能高效返回有效且高度可用的参考文献,在准确性、可信性与实用性上优于现有方法。 Conclusion: CiteLLM实现了本地化、零数据外传、低幻觉风险的智能参考发现,为AI赋能学术写作提供了兼顾效率与伦理的可行范式。 Abstract: Large language models (LLMs) have created new opportunities to enhance the efficiency of scholarly activities; however, challenges persist in the ethical deployment of AI assistance, including (1) the trustworthiness of AI-generated content, (2) preservation of academic integrity and intellectual property, and (3) protection of information privacy. In this work, we present CiteLLM, a specialized agentic platform designed to enable trustworthy reference discovery for grounding author-drafted claims and statements. The system introduces a novel interaction paradigm by embedding LLM utilities directly within the LaTeX editor environment, ensuring a seamless user experience and no data transmission outside the local system. To guarantee hallucination-free references, we employ dynamic discipline-aware routing to retrieve candidates exclusively from trusted web-based academic repositories, while leveraging LLMs solely for generating context-aware search queries, ranking candidates by relevance, and validating and explaining support through paragraph-level semantic matching and an integrated chatbot. Evaluation results demonstrate the superior performance of the proposed system in returning valid and highly usable references.

[40] Assessing Deanonymization Risks with Stylometry-Assisted LLM Agent

Boyang Zhang,Yang Zhang

Main category: cs.CL

TL;DR: 本文提出SALA方法,结合风格特征与大语言模型推理,用于评估和缓解新闻文本中作者身份被推断的风险,并设计了基于推理轨迹的重写策略以保护作者隐私。

Details Motivation: 大型语言模型(LLMs)的快速发展增强了作者身份推断能力,引发了新闻等文本数据中无意去匿名化的隐私风险担忧。 Method: 提出SALA(Stylometry-Assisted LLM Analysis)方法,将定量风格特征与LLM推理结合;构建LLM代理框架,包含数据库模块增强推理;设计基于推理轨迹的引导式重写策略生成改写提示。 Result: 在大规模新闻数据集上实验表明,SALA(尤其加入数据库模块后)在多种场景下实现高作者推断准确率;所提重写策略可有效降低作者可识别性,同时保持语义完整性。 Conclusion: LLM代理既具备显著的去匿名化潜力,也需配套可解释、主动式的防御机制来保障作者隐私;SALA为文本匿名化提供了新思路与实用工具。 Abstract: The rapid advancement of large language models (LLMs) has enabled powerful authorship inference capabilities, raising growing concerns about unintended deanonymization risks in textual data such as news articles. In this work, we introduce an LLM agent designed to evaluate and mitigate such risks through a structured, interpretable pipeline. Central to our framework is the proposed $\textit{SALA}$ (Stylometry-Assisted LLM Analysis) method, which integrates quantitative stylometric features with LLM reasoning for robust and transparent authorship attribution. Experiments on large-scale news datasets demonstrate that $\textit{SALA}$, particularly when augmented with a database module, achieves high inference accuracy in various scenarios. Finally, we propose a guided recomposition strategy that leverages the agent's reasoning trace to generate rewriting prompts, effectively reducing authorship identifiability while preserving textual meaning. Our findings highlight both the deanonymization potential of LLM agents and the importance of interpretable, proactive defenses for safeguarding author privacy.

[41] Modality Collapse as Mismatched Decoding: Information-Theoretic Limits of Multimodal LLMs

Jayadev Billa

Main category: cs.CL

TL;DR: 本文揭示了多模态大语言模型(MLLMs)中解码器与非文本模态信息之间存在‘错配’问题:尽管语音和视觉特征(如说话人身份、情绪、纹理)在模型各层中高度保留,但解码器因仅针对文本训练,无法有效利用这些非文本对齐的方向,导致其成为噪声;作者提出广义互信息(GMI)理论框架刻画该瓶颈,并通过控制实验与LoRA干预验证解码器评分规则是根本限制,而非编码器或适配器设计。

Details Motivation: 多模态大语言模型虽能输入语音和图像,却难以识别说话人声音特质或物体表面纹理,作者旨在探究这一现象的根本原因——是编码失败,还是解码机制不匹配? Method: 采用线性探针量化各层中语音/视觉属性的保留程度;提出‘错配解码器’理论模型,以广义互信息(GMI)刻画信息可访问性上界;在5个跨语音与视觉的模型上实证验证;设计控制实验(仅改变Prismatic VLM编码器文本对齐度)与LoRA微调实验(添加情绪目标)进行因果验证。 Result: 发现说话人身份、情绪及视觉属性在LLM各层中显著保留(3–55× 随机水平),但移除64–71%模态特异性方差反而降低解码损失;GMI理论边界被实证支持;控制实验确认瓶颈在于解码器评分规则;LoRA加入情绪目标使情绪可访问性提升+7.5%,且不影响其他属性。 Conclusion: 多模态LLM的信息瓶颈不在编码器或投影模块,而在于文本训练的解码器缺乏对非文本对齐方向的利用能力;提升特定模态属性的可访问性需调整解码器训练目标(如引入对应监督信号),而非仅改进编码或适配器。 Abstract: Multimodal LLMs can process speech and images, but they cannot hear a speaker's voice or see an object's texture. We show this is not a failure of encoding: speaker identity, emotion, and visual attributes survive through every LLM layer (3--55$\times$ above chance in linear probes), yet removing 64--71% of modality-specific variance improves decoder loss. The decoder has no learned use for these directions; their presence is noise. We formalize this as a mismatched decoder problem: a decoder trained on text can only extract information along text-aligned directions. Accessible information is bounded by the Generalized Mutual Information (GMI), with degradation scaling with distributional distance and decoder sensitivity. The bound is a property of the decoder's scoring rule, not of any particular architecture; it applies whether non-text inputs arrive through a learned projection, a discrete codebook, or no explicit adapter at all. We validate this across five models spanning speech and vision. A controlled experiment (two Prismatic VLMs differing only in encoder text-alignment) confirms the bottleneck is the decoder's scoring rule, not the encoder or projection. A LoRA intervention demonstrates the fix: training with an emotion objective improves emotion accessibility ($+$7.5%) without affecting other attributes, confirming that the training objective determines what becomes accessible.

[42] MTRAG-UN: A Benchmark for Open Challenges in Multi-Turn RAG Conversations

Sara Rosenthal,Yannis Katsis,Vraj Shah,Lihong He,Lucian Popa,Marina Danilevsky

Main category: cs.CL

TL;DR: 本文提出了MTRAG-UN基准,用于探索多轮检索增强生成(RAG)中的开放挑战,包含666个任务、2800多个对话轮次及配套语料库,并揭示了当前模型在处理不可回答、不明确、非独立及模糊问题/响应时的困难。

Details Motivation: 探索多轮检索增强生成(RAG)中尚未解决的关键挑战,特别是涉及不可回答、不明确、非独立及模糊问题/响应的复杂对话场景。 Method: 构建并发布MTRAG-UN基准,涵盖6个领域、666个任务、2800+对话轮次及配套检索语料;通过实验评估主流检索与生成模型在该基准上的表现。 Result: 实验证明现有检索与生成模型在处理UNanswerable、UNderspecified、NONstandalone和UNclear类问题/响应时性能显著下降,暴露其局限性。 Conclusion: MTRAG-UN为多轮RAG研究提供了标准化测试平台,凸显了模型在理解与响应复杂对话状态方面亟待提升的能力。 Abstract: We present MTRAG-UN, a benchmark for exploring open challenges in multi-turn retrieval augmented generation, a popular use of large language models. We release a benchmark of 666 tasks containing over 2,800 conversation turns across 6 domains with accompanying corpora. Our experiments show that retrieval and generation models continue to struggle on conversations with UNanswerable, UNderspecified, and NONstandalone questions and UNclear responses. Our benchmark is available at https://github.com/IBM/mt-rag-benchmark

[43] Fine-Tuning Without Forgetting In-Context Learning: A Theoretical Analysis of Linear Attention Models

Chungpa Lee,Jy-yong Sohn,Kangwook Lee

Main category: cs.CL

TL;DR: 本文研究了大语言模型微调对上下文学习能力的影响,提出通过限制注意力机制中值矩阵的更新或引入辅助少样本损失来平衡零样本性能与上下文学习能力。

Details Motivation: 微调虽能提升零样本性能,但常损害模型在未见任务上的上下文学习能力,需理论解释与改进方法。 Method: 基于线性注意力模型进行理论分析,研究不同微调目标(全参数微调、仅值矩阵更新、加入辅助少样本损失)对注意力参数及少样本性能的影响,并通过实验验证。 Result: 全参数微调会损害上下文学习;仅更新值矩阵可在提升零样本性能的同时保持上下文学习能力;加入辅助少样本损失可增强目标任务的上下文学习,但削弱对未见任务的泛化能力。 Conclusion: 微调策略需权衡零样本性能与上下文学习鲁棒性,参数更新范围与损失设计是关键调控因素。 Abstract: Transformer-based large language models exhibit in-context learning, enabling adaptation to downstream tasks via few-shot prompting with demonstrations. In practice, such models are often fine-tuned to improve zero-shot performance on downstream tasks, allowing them to solve tasks without examples and thereby reducing inference costs. However, fine-tuning can degrade in-context learning, limiting the performance of fine-tuned models on tasks not seen during fine-tuning. Using linear attention models, we provide a theoretical analysis that characterizes how fine-tuning objectives modify attention parameters and identifies conditions under which this leads to degraded few-shot performance. We show that fine-tuning all attention parameters can harm in-context learning, whereas restricting updates to the value matrix improves zero-shot performance while preserving in-context learning. We further show that incorporating an auxiliary few-shot loss enhances in-context learning primarily on the target task, at the expense of degraded in-context learning ability on tasks not seen during fine-tuning. We empirically validate our theoretical results.

[44] Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding?

Pengxiang Li,Dilxat Muhtar,Lu Yin,Tianlong Chen,Shiwei Liu

Main category: cs.CL

TL;DR: 本文提出NAP方法,通过重构训练数据和解码策略,使扩散语言模型(DLMs)真正实现非自回归并行生成,从而缓解其在实践中退化为类自回归行为的问题,并在数学推理任务上验证了其有效性。

Details Motivation: 现有扩散语言模型虽宣称支持并行生成,但实践中常退化为类自回归(AR)解码,主因是训练目标与高度序列化的数据(如标准预训练语料和长链式思维CoT)不匹配。 Method: 提出NAP(Non-Autoregressive Parallel DLMs),采用数据驱动策略:构建多条独立推理路径作为训练样本,并结合强制并行解码机制,鼓励多token同步更新。 Result: 在数学推理基准上,NAP在并行解码下性能显著优于基于标准长CoT训练的DLMs,且并行度越高增益越明显。 Conclusion: 重审训练数据与监督信号是推动DLMs走向真正非自回归并行生成的可行且根本的方向。 Abstract: Diffusion Language Models (DLMs) are often advertised as enabling parallel token generation, yet practical fast DLMs frequently converge to left-to-right, autoregressive (AR)-like decoding dynamics. In contrast, genuinely non-AR generation is promising because it removes AR's sequential bottleneck, better exploiting parallel hardware to reduce synchronization/communication overhead and improve latency scaling with output length. We argue that a primary driver of AR-like decoding is a mismatch between DLM objectives and the highly sequential structure of widely used training data, including standard pretraining corpora and long chain-of-thought (CoT) supervision. Motivated by this diagnosis, we propose NAP (Non-Autoregressive Parallel DLMs), a proof-of-concept, data-centric approach that better aligns supervision with non-AR parallel decoding. NAP curates examples as multiple independent reasoning trajectories and couples them with a parallel-forced decoding strategy that encourages multi-token parallel updates. Across math reasoning benchmarks, NAP yields stronger performance under parallel decoding than DLMs trained on standard long CoT data, with gains growing as parallelism increases. Our results suggest that revisiting data and supervision is a principled direction for mitigating AR-like behavior and moving toward genuinely non-autoregressive parallel generation in DLMs. Our code is available at https://github.com/pixeli99/NAP.

[45] Discourse-Aware Dual-Track Streaming Response for Low-Latency Spoken Dialogue Systems

Siyuan Liu,Jiahui Xu,Feng Jiang,Kuang Wang,Zefeng Zhao,Chu-Ren Huang,Jinghang Gu,Changqing Yin,Haizhou Li

Main category: cs.CL

TL;DR: 本文提出DDTSR框架,通过双轨流式响应机制(小大模型协同、跨模态流式协作、课程学习增强话语连续性),显著降低级联语音对话系统的响应延迟(19%-51%),同时保持话语质量。

Details Motivation: 传统ASR-LLM-TTS级联系统存在高响应延迟问题,因其必须等完整转录和全部推理完成后才开始语音合成,难以实现类人实时交互。 Method: 提出Discourse-Aware Dual-Track Streaming Response(DDTSR)框架,包含三项核心技术:(1)连接词引导的小-大模型协同;(2)基于流式的跨模态协作;(3)课程学习驱动的话语连续性增强。 Result: 在两个语音对话基准上,DDTSR将响应延迟降低19%-51%,且话语质量未下降;具备即插即用性,兼容多种LLM主干,并对不同语句长度鲁棒。 Conclusion: DDTSR是一种实用性强、可扩展的低延迟语音对话架构,有效支持‘边听边想、边说边想’的类人交互范式。 Abstract: Achieving human-like responsiveness is a critical yet challenging goal for cascaded spoken dialogue systems. Conventional ASR-LLM-TTS pipelines follow a strictly sequential paradigm, requiring complete transcription and full reasoning before speech synthesis can begin, which results in high response latency. We propose the Discourse-Aware Dual-Track Streaming Response (DDTSR) framework, a low-latency architecture that enables listen-while-thinking and speak-while-thinking. DDTSR is built upon three key mechanisms: (1) connective-guided small-large model synergy, where an auxiliary small model generates minimal-committal discourse connectives while a large model performs knowledge-intensive reasoning in parallel; (2) streaming-based cross-modal collaboration, which dynamically overlaps ASR, LLM inference, and TTS to advance the earliest speakable moment; and (3) curriculum-learning-based discourse continuity enhancement, which maintains coherence and logical consistency between early responses and subsequent reasoning outputs. Experiments on two spoken dialogue benchmarks demonstrate that DDTSR reduces response latency by 19%-51% while preserving discourse quality. Further analysis shows that DDTSR functions as a plug-and-play module compatible with diverse LLM backbones, and remains robust across varying utterance lengths, indicating strong practicality and scalability for real-time spoken interaction.

[46] SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables

Sungho Park,Jueun Kim,Wook-Shin Han

Main category: cs.CL

TL;DR: 本文提出SPARTA框架,用于自动生成大规模、高质量的表格-文本联合问答基准数据集,支持深度多跳推理与复杂操作(如聚合、分组),显著暴露当前跨模态问答模型的能力瓶颈。

Details Motivation: 现有表格-文本问答基准规模小、人工构建易出错、问题浅层(通常不超过两跳,缺乏聚合/分组等复杂操作),难以评估模型真实跨模态推理能力。 Method: SPARTA为端到端自动构建框架:1)通过从配套文本中抽取原子事实构建‘锚定表’,丰富源表形成参考事实库;2)合成符合指定跳数的嵌套查询;3)引入基于溯源的查询重写(确保非空结果)和真实结构强制(限定后序遍历生成),保障SQL可执行性与问题自然性。 Result: 生成数千个高保真问答对,覆盖聚合、分组与深度多跳推理;SOTA模型在SPARTA上F1大幅下降超30点(如在HybridQA达70+,在此降至40以下),揭示当前模型根本缺陷。 Conclusion: SPARTA提供了一种高效、可扩展、低人工依赖的基准构建范式,所构建的基准更严格地评测并暴露了跨表格与文本联合推理的现存挑战,推动该领域向更复杂、更鲁棒的方向发展。 Abstract: Real-world Table-Text question answering (QA) tasks require models that can reason across long text and source tables, traversing multiple hops and executing complex operations such as aggregation. Yet existing benchmarks are small, manually curated - and therefore error-prone - and contain shallow questions that seldom demand more than two hops or invoke aggregations, grouping, or other advanced analytical operations expressible in natural-language queries. We present SPARTA, an end-to-end construction framework that automatically generates large-scale Table-Text QA benchmarks with lightweight human validation, requiring only one quarter of the annotation time of HybridQA. The framework first constructs a reference fact database by enriching each source table with grounding tables whose tuples are atomic facts automatically extracted from the accompanying unstructured passages, then synthesizes nested queries whose number of nested predicates matches the desired hop count. To ensure that every SQL statement is executable and that its verbalization yields a fluent, human-sounding question, we propose two novel techniques: provenance-based refinement, which rewrites any syntactically valid query that returns a non-empty result, and realistic-structure enforcement, which confines generation to post-order traversals of the query graph. The resulting pipeline produces thousands of high-fidelity question-answer pairs covering aggregations, grouping, and deep multi-hop reasoning across text and tables. On SPARTA, state-of-the-art models that reach over 70 F1 on HybridQA or over 50 F1 on OTT-QA drop by more than 30 F1 points, exposing fundamental weaknesses in current cross-modal reasoning. Our benchmark, construction code, and baseline models are available at https://github.com/pshlego/SPARTA/tree/main.

[47] A Mixture-of-Experts Model for Multimodal Emotion Recognition in Conversations

Soumya Dutta,Smruthi Balaji,Sriram Ganapathy

Main category: cs.CL

TL;DR: 本文提出MiSTER-E框架,通过模块化混合专家(MoE)方法解耦模态特异性建模与多模态融合,利用微调的LLM提取语音/文本特征,结合卷积-循环上下文建模和动态门控融合,并引入对比损失与KL正则提升模态一致性,在IEMOCAP、MELD、MOSI上取得SOTA性能。

Details Motivation: 情感识别在对话中需同时建模多轮时序动态与多模态线索,现有方法难以解耦模态内上下文建模与跨模态融合两大挑战,且常依赖说话人身份信息。 Method: 提出MiSTER-E:1)微调LLM分别提取语音和文本的utterance级嵌入;2)用卷积-循环层增强上下文建模;3)构建语音专家、文本专家、跨模态专家,通过可学习门控动态加权融合;4)引入监督对比损失(语音-文本对)和KL散度正则(专家预测间)。全程不使用说话人身份。 Result: 在IEMOCAP、MELD、MOSI数据集上加权F1达70.9%、69.5%、87.9%,优于多个基线模型;消融实验证明各组件有效性。 Conclusion: MiSTER-E通过解耦设计与一致性约束,在无需说话人信息前提下显著提升多模态对话情感识别性能,验证了模块化MoE与联合优化策略的有效性。 Abstract: Emotion Recognition in Conversations (ERC) presents unique challenges, requiring models to capture the temporal flow of multi-turn dialogues and to effectively integrate cues from multiple modalities. We propose Mixture of Speech-Text Experts for Recognition of Emotions (MiSTER-E), a modular Mixture-of-Experts (MoE) framework designed to decouple two core challenges in ERC: modality-specific context modeling and multimodal information fusion. MiSTER-E leverages large language models (LLMs) fine-tuned for both speech and text to provide rich utterance-level embeddings, which are then enhanced through a convolutional-recurrent context modeling layer. The system integrates predictions from three experts-speech-only, text-only, and cross-modal-using a learned gating mechanism that dynamically weighs their outputs. To further encourage consistency and alignment across modalities, we introduce a supervised contrastive loss between paired speech-text representations and a KL-divergence-based regulariza-tion across expert predictions. Importantly, MiSTER-E does not rely on speaker identity at any stage. Experiments on three benchmark datasets-IEMOCAP, MELD, and MOSI-show that our proposal achieves 70.9%, 69.5%, and 87.9% weighted F1-scores respectively, outperforming several baseline speech-text ERC systems. We also provide various ablations to highlight the contributions made in the proposed approach.

[48] Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning

Amita Kamath,Jack Hessel,Khyathi Chandu,Jena D. Hwang,Kai-Wei Chang,Ranjay Krishna

Main category: cs.CL

TL;DR: 本文指出视觉-语言模型(VLMs)推理能力不足源于训练数据中的报告偏差(reporting bias),即人类描述图像时通常省略隐含的推理所需信息;作者通过语用学视角分析主流VLM数据,发现空间、时间、否定和计数四类推理能力在数据中严重缺失,并验证其在模型中表现差且不随规模扩大而自发涌现;引入专门收集的隐含信息标注可有效提升这些能力。

Details Motivation: 视觉-语言模型(VLMs)缺乏推理能力,作者认为根源在于训练数据存在报告偏差——人类描述图像时习惯性省略支撑推理所需的隐含信息。 Method: 基于语用学理论分析OpenCLIP、LLaVA-1.5和Molmo等主流VLM的训练数据,识别四类被低估的推理技能(空间、时间、否定、计数),构建针对性基准测试,并评估不同规模/多语言/标注增强策略对推理能力的影响。 Result: 实证表明:(i) VLMs在四类推理任务上表现显著较差;(ii) 单纯扩大数据量、模型参数或支持多语言无法使这些能力自发涌现;(iii) 引入专门采集的隐含信息标注能有效提升性能。 Conclusion: 推理能力不会随规模自然涌现,需通过更有意的数据策展(如显式标注隐含信息)来弥补报告偏差,而非依赖数据或模型规模。 Abstract: The lack of reasoning capabilities in Vision-Language Models (VLMs) has remained at the forefront of research discourse. We posit that this behavior stems from a reporting bias in their training data. That is, how people communicate about visual content by default omits tacit information needed to supervise some types of reasoning; e.g., "at the game today!" is a more likely caption than "a photo of 37 people standing behind a field". We investigate the data underlying the popular VLMs OpenCLIP, LLaVA-1.5 and Molmo through the lens of theories from pragmatics, and find that reporting bias results in insufficient representation of four reasoning skills (spatial, temporal, negation, and counting), despite the corpora being of web-scale, and/or synthetically generated. With a set of curated benchmarks, we demonstrate that: (i) VLMs perform poorly on the aforementioned types of reasoning suppressed in the training data by reporting bias; (ii) contrary to popular belief, scaling data size, model size, and to multiple languages does not result in emergence of these skills by default; but, promisingly, (iii) incorporating annotations specifically collected to obtain tacit information is effective. Our findings highlight the need for more intentional training data curation methods, rather than counting on scale for emergence of reasoning capabilities.

cs.CV [Back]

[49] Enabling clinical use of foundation models in histopathology

Audun L. Henriksen,Ole-Johan Skrede,Lisa van der Schee,Enric Domingo,Sepp De Raedt,Ilyá Kostolomov,Jennifer Hay,Karolina Cyll,Wanja Kildal,Joakim Kalsnes,Robert W. Williams,Manohar Pradhan,John Arne Nesheim,Hanne A. Askautrud,Maria X. Isaksen,Karmele Saez de Gordoa,Miriam Cuatrecasas,Joanne Edwards,TransSCOT group,Arild Nesbakken,Neil A. Shepherd,Ian Tomlinson,Daniel-Christoph Wagner,Rachel S. Kerr,Tarjei Sveinsgjerd Hveem,Knut Liestøl,Yoshiaki Nakamura,Marco Novelli,Masaaki Miyo,Sebastian Foersch,David N. Church,Miangela M. Lacle,David J. Kerr,Andreas Kleppe

Main category: cs.CV

TL;DR: 本文提出了一种在下游任务模型训练中引入鲁棒性损失的新方法,以减少病理学基础模型对技术变异(如扫描仪差异)的敏感性,从而提升模型在真实临床场景中的泛化性和准确性,且无需重新训练基础模型。

Details Motivation: 当前病理学基础模型不仅捕获生物学相关特征,还混入预分析和扫描仪特异性变异,导致下游任务模型预测偏差,影响临床实用性。 Method: 在下游任务模型训练过程中引入新型鲁棒性损失函数,并基于包含27,042张全切片图像(来自6155名患者)的大规模实验设置,评估八种主流基础模型特征上的数千个模型。 Result: 显著提升了模型对技术变异的鲁棒性,同时提高了预测准确率;验证了该方法无需重训基础模型即可有效缓解其鲁棒性问题。 Conclusion: 所提方法为计算病理学提供了无需修改基础模型即可构建高鲁棒性、高准确性下游模型的有效路径,推动其在常规临床实践中的落地应用。 Abstract: Foundation models in histopathology are expected to facilitate the development of high-performing and generalisable deep learning systems. However, current models capture not only biologically relevant features, but also pre-analytic and scanner-specific variation that bias the predictions of task-specific models trained from the foundation model features. Here we show that introducing novel robustness losses during training of downstream task-specific models reduces sensitivity to technical variability. A purpose-designed comprehensive experimentation setup with 27,042 WSIs from 6155 patients is used to train thousands of models from the features of eight popular foundation models for computational pathology. In addition to a substantial improvement in robustness, we observe that prediction accuracy improves by focusing on biologically relevant features. Our approach successfully mitigates robustness issues of foundation models for computational pathology without retraining the foundation models themselves, enabling development of robust computational pathology models applicable to real-world data in routine clinical practice.

Liping Meng,Fan Nie,Yunyun Zhang,Chao Han

Main category: cs.CV

TL;DR: 本文提出了一种结合蒙特卡洛树搜索(MCTS)与神经架构搜索(NAS)的新型医学图像分割框架MNAS-Unet,显著提升了架构搜索效率与分割精度,并在多个数据集上超越现有方法。

Details Motivation: 提升医学图像分割中神经网络架构搜索的效率与精度,同时降低计算资源消耗。 Method: 提出MNAS-Unet框架,融合MCTS进行高效架构探索,并优化DownSC和UpSC单元结构以实现快速精准建模。 Result: 在PROMISE12、Ultrasound Nerve和CHAOS等数据集上分割精度优于NAS-Unet及其他SOTA模型;架构搜索预算减少54%,模型仅0.6M参数,GPU内存占用更低。 Conclusion: MNAS-Unet在保证高分割精度的同时大幅提高搜索效率并降低资源开销,具备更强的实际部署能力。 Abstract: This paper proposes a novel medical image segmentation framework, MNAS-Unet, which combines Monte Carlo Tree Search (MCTS) and Neural Architecture Search (NAS). MNAS-Unet dynamically explores promising network architectures through MCTS, significantly enhancing the efficiency and accuracy of architecture search. It also optimizes the DownSC and UpSC unit structures, enabling fast and precise model adjustments. Experimental results demonstrate that MNAS-Unet outperforms NAS-Unet and other state-of-the-art models in segmentation accuracy on several medical image datasets, including PROMISE12, Ultrasound Nerve, and CHAOS. Furthermore, compared with NAS-Unet, MNAS-Unet reduces the architecture search budget by 54% (early stopping at 139 epochs versus 300 epochs under the same search setting), while achieving a lightweight model with only 0.6M parameters and lower GPU memory consumption, which further improves its practical applicability. These results suggest that MNAS-Unet can improve search efficiency while maintaining competitive segmentation accuracy under practical resource constraints.

[51] AeroDGS: Physically Consistent Dynamic Gaussian Splatting for Single-Sequence Aerial 4D Reconstruction

Hanyang Liu,Rongjun Qin

Main category: cs.CV

TL;DR: 本文提出AeroDGS,一种面向单目无人机视频的物理引导式4D高斯溅射框架,通过几何提升与物理约束优化解决空中单视图动态重建中的深度模糊与运动估计不稳定问题。

Details Motivation: 现有4D场景重建方法在空中单视角、大范围、小尺度动态物体且运动差异大的条件下表现受限,导致深度模糊和运动估计不稳定,使单目空中重建本质上病态。 Method: 提出AeroDGS框架,包含单目几何提升模块(重建静态与动态几何)和物理引导优化模块(引入可微地面支撑、竖直稳定性与轨迹平滑性先验),联合优化静态背景与动态实体。 Result: 在合成与真实无人机数据集上实验表明,AeroDGS优于当前最先进方法,在动态空中环境中实现更高保真度的重建。 Conclusion: AeroDGS通过几何先验与物理约束有效缓解单目空中4D重建的病态性,为动态 aerial 场景建模提供了鲁棒、一致的新范式。 Abstract: Recent advances in 4D scene reconstruction have significantly improved dynamic modeling across various domains. However, existing approaches remain limited under aerial conditions with single-view capture, wide spatial range, and dynamic objects of limited spatial footprint and large motion disparity. These challenges cause severe depth ambiguity and unstable motion estimation, making monocular aerial reconstruction inherently ill-posed. To this end, we present AeroDGS, a physics-guided 4D Gaussian splatting framework for monocular UAV videos. AeroDGS introduces a Monocular Geometry Lifting module that reconstructs reliable static and dynamic geometry from a single aerial sequence, providing a robust basis for dynamic estimation. To further resolve monocular ambiguity, we propose a Physics-Guided Optimization module that incorporates differentiable ground-support, upright-stability, and trajectory-smoothness priors, transforming ambiguous image cues into physically consistent motion. The framework jointly refines static backgrounds and dynamic entities with stable geometry and coherent temporal evolution. We additionally build a real-world UAV dataset that spans various altitudes and motion conditions to evaluate dynamic aerial reconstruction. Experiments on synthetic and real UAV scenes demonstrate that AeroDGS outperforms state-of-the-art methods, achieving superior reconstruction fidelity in dynamic aerial environments.

[52] Enhancing Renal Tumor Malignancy Prediction: Deep Learning with Automatic 3D CT Organ Focused Attention

Zhengkang Fan,Chengkun Sun,Russell Terry,Jie Xu,Longin Jan Latecki

Main category: cs.CV

TL;DR: 本文提出了一种无需手动分割的深度学习框架,通过器官聚焦注意力(OFA)损失函数提升肾肿瘤恶性程度预测性能。

Details Motivation: 现有影像学方法在术前准确预测肾肿瘤恶性程度方面存在不足;手动分割虽能提升模型性能但费时、昂贵且依赖专家经验。 Method: 提出基于Organ Focused Attention(OFA)损失函数的深度学习框架,使图像块注意力局限于器官区域,从而避免部署时对3D肾CT图像进行手动分割。 Result: 在UF IDR私有数据集上AUC达0.685、F1为0.872;在公开KiTS21数据集上AUC达0.760、F1为0.852,均优于依赖分割裁剪的传统模型。 Conclusion: 该方法可在不依赖手动分割的前提下实现更高效、可靠的肾肿瘤恶性程度预测,有助于提升临床决策水平。 Abstract: Accurate prediction of malignancy in renal tumors is crucial for informing clinical decisions and optimizing treatment strategies. However, existing imaging modalities lack the necessary accuracy to reliably predict malignancy before surgical intervention. While deep learning has shown promise in malignancy prediction using 3D CT images, traditional approaches often rely on manual segmentation to isolate the tumor region and reduce noise, which enhances predictive performance. Manual segmentation, however, is labor-intensive, costly, and dependent on expert knowledge. In this study, a deep learning framework was developed utilizing an Organ Focused Attention (OFA) loss function to modify the attention of image patches so that organ patches attend only to other organ patches. Hence, no segmentation of 3D renal CT images is required at deployment time for malignancy prediction. The proposed framework achieved an AUC of 0.685 and an F1-score of 0.872 on a private dataset from the UF Integrated Data Repository (IDR), and an AUC of 0.760 and an F1-score of 0.852 on the publicly available KiTS21 dataset. These results surpass the performance of conventional models that rely on segmentation-based cropping for noise reduction, demonstrating the frameworks ability to enhance predictive accuracy without explicit segmentation input. The findings suggest that this approach offers a more efficient and reliable method for malignancy prediction, thereby enhancing clinical decision-making in renal cancer diagnosis.

[53] Vision Transformers Need More Than Registers

Cheng Shi,Yizhou Yu,Sibei Yang

Main category: cs.CV

TL;DR: 本文通过系统分析发现Vision Transformers (ViTs)中的性能瓶颈源于‘懒聚合’行为,即模型利用语义无关的背景图像块作为捷径来表征全局语义;为此提出一种选择性地将图像块特征整合进CLS token的方法,显著提升其在12个基准任务上的表现。

Details Motivation: ViTs在不同监督范式和下游任务中普遍存在性能缺陷(artifacts),但其根本机制尚不明确,亟需深入分析与解释。 Method: 通过系统性分析揭示ViTs中‘懒聚合’行为的本质,并提出一种选择性地将patch特征整合进CLS token的改进策略,以削弱背景主导的捷径效应。 Result: 所提方法在标签监督、文本监督和自监督共12个基准任务上均取得一致性能提升。 Conclusion: ViTs中的artifacts主要源于全局注意力机制与粗粒度语义监督共同诱发的懒聚合行为;选择性特征融合可有效缓解该问题,为理解ViT行为提供了新视角。 Abstract: Vision Transformers (ViTs), when pre-trained on large-scale data, provide general-purpose representations for diverse downstream tasks. However, artifacts in ViTs are widely observed across different supervision paradigms and downstream tasks. Through systematic analysis of artifacts in ViTs, we find that their fundamental mechanisms have yet to be sufficiently elucidated. In this paper, through systematic analysis, we conclude that these artifacts originate from a lazy aggregation behavior: ViT uses semantically irrelevant background patches as shortcuts to represent global semantics, driven by global attention and Coarse-grained semantic supervision. Our solution selectively integrates patch features into the CLS token, reducing the influence of background-dominated shortcuts and consistently improving performance across 12 benchmarks under label-, text-, and self-supervision. We hope this work offers a new perspective on ViT behavior.

[54] CLIP Is Shortsighted: Paying Attention Beyond the First Sentence

Marc-Antoine Lavoie,Anas Mahmoud,Aldo Zaimi,Arsene Fansi Tchango,Steven L. Waslander

Main category: cs.CV

TL;DR: 本文提出DeBias-CLIP,通过去除长标题中的摘要句、句子子采样和文本标记填充,缓解CLIP在长文本对齐中的开头句捷径偏差,提升长/短文本检索性能,且无需额外参数。

Details Motivation: CLIP预训练主要依赖短标题配对图像,导致对复杂场景和密集描述的对齐粗糙;现有基于长标题的微调方法仍受‘摘要句+细节’结构引起的开头句捷径偏差影响。 Method: 提出DeBias-CLIP:训练中移除长标题的首句摘要,结合句子子采样与文本token填充,使监督信号均匀分布于所有token位置。 Result: 在长文本检索上达到SOTA,同时提升短文本检索性能,对句子顺序扰动更鲁棒,且可即插即用、不引入新参数。 Conclusion: 开头句结构是长标题微调中的关键偏差源,显式解耦摘要与细节并均衡文本监督能显著提升CLIP的细粒度对齐能力。 Abstract: CLIP models learn transferable multi-modal features via image-text contrastive learning on internet-scale data. They are widely used in zero-shot classification, multi-modal retrieval, text-to-image diffusion, and as image encoders in large vision-language models. However, CLIP's pretraining is dominated by images paired with short captions, biasing the model toward encoding simple descriptions of salient objects and leading to coarse alignment on complex scenes and dense descriptions. While recent work mitigates this by fine-tuning on small-scale long-caption datasets, we identify an important common bias: both human- and LLM-generated long captions typically begin with a one-sentence summary followed by a detailed description. We show that this acts as a shortcut during training, concentrating attention on the opening sentence and early tokens and weakening alignment over the rest of the caption. To resolve this, we introduce DeBias-CLIP, which removes the summary sentence during training and applies sentence sub-sampling and text token padding to distribute supervision across all token positions. DeBias-CLIP achieves state-of-the-art long-text retrieval, improves short-text retrieval, and is less sensitive to sentence order permutations. It is a drop-in replacement for Long-CLIP with no additional trainable parameters.

[55] SimpleOCR: Rendering Visualized Questions to Teach MLLMs to Read

Yibo Peng,Peng Xia,Ding Zhong,Kaide Zeng,Siwei Han,Yiyang Zhou,Jiaqi Liu,Ruiyi Zhang,Huaxiu Yao

Main category: cs.CV

TL;DR: 本文提出Visualized-Question(VQ)设置以诊断MLLMs是否真正利用图像中的文本,发现模型存在‘模态懒惰’现象;为此设计了无需架构修改的即插即用训练策略SimpleOCR,通过将文本查询渲染到图像上并随机化样式,强制模型激活视觉文本提取通路,在多个OOD基准上显著提升性能且数据高效。

Details Motivation: 探究多模态大语言模型(MLLMs)是否真正‘读取’图像中的文字,还是仅依赖文本提示中的参数捷径,揭示其视觉定位机制的本质缺陷。 Method: 提出Visualized-Question(VQ)诊断设置,将文本查询直接渲染至图像中以强制视觉参与;在此基础上设计SimpleOCR训练策略:将训练样本转换为VQ格式并引入随机渲染风格,消除文本提示捷径,驱动模型优化视觉OCR通路。 Result: 在Qwen2.5-VL上验证VQ导致最高12.7%性能下降,证实‘模态懒惰’;SimpleOCR在4个OOD基准上相对基线提升5.4%,超越基于原图的GRPO方法2.7%,仅用8.5K样本(比RL方法少30倍)即达更优效果,并可与NoisyRollout等先进RL策略协同增益。 Conclusion: MLLMs当前普遍存在对文本提示的过度依赖而非真实视觉阅读;SimpleOCR通过结构化训练约束有效激发和优化其固有OCR能力,是一种高效、通用、即插即用的视觉 grounding 增强方法。 Abstract: Despite the rapid advancements in Multimodal Large Language Models (MLLMs), a critical question regarding their visual grounding mechanism remains unanswered: do these models genuinely ``read'' text embedded in images, or do they merely rely on parametric shortcuts in the text prompt? In this work, we diagnose this issue by introducing the Visualized-Question (VQ) setting, where text queries are rendered directly onto images to structurally mandate visual engagement. Our diagnostic experiments on Qwen2.5-VL reveal a startling capability-utilization gap: despite possessing strong OCR capabilities, models suffer a performance degradation of up to 12.7% in the VQ setting, exposing a deep-seated ``modality laziness.'' To bridge this gap, we propose SimpleOCR, a plug-and-play training strategy that imposes a structural constraint on the learning process. By transforming training samples into the VQ format with randomized styles, SimpleOCR effectively invalidates text-based shortcuts, compelling the model to activate and optimize its visual text extraction pathways. Empirically, SimpleOCR yields robust gains without architectural modifications. On four representative OOD benchmarks, it surpasses the base model by 5.4% and GRPO based on original images by 2.7%, while exhibiting extreme data efficiency, achieving superior performance with 30x fewer samples (8.5K) than recent RL-based methods. Furthermore, its plug-and-play nature allows seamless integration with advanced RL strategies like NoisyRollout to yield complementary improvements. Code is available at https://github.com/aiming-lab/SimpleOCR.

[56] Exploring Multimodal LMMs for Online Episodic Memory Question Answering on the Edge

Giuseppe Lando,Rosario Forte,Antonino Furnari

Main category: cs.CV

TL;DR: 本文探讨了在边缘设备上使用多模态大语言模型(MLLMs)实现实时在线情景记忆问答的可行性,提出一种双线程流式处理架构,在资源受限下取得接近云端的性能,兼顾隐私与低延迟。

Details Motivation: 云卸载虽常见,但存在隐私和延迟问题,尤其对可穿戴助手而言;因此需探索边缘端实现方案。 Method: 设计包含描述符线程(持续将视频转为轻量文本记忆)和问答线程(基于文本记忆推理作答)的异步双线程流式问答流水线,并在QAEgo4D-Closed基准上评估MLLMs在严格资源约束下的表现。 Result: 消费级8GB GPU端到端配置达51.76%准确率、首token时间0.41秒;本地企业级服务器达54.40%准确率、TTFT 0.88秒;云端方案为56.00%准确率。 Conclusion: 边缘端方案在准确率上接近云端,同时保障隐私与低延迟,展现出用于隐私保护型情景记忆检索的潜力。 Abstract: We investigate the feasibility of using Multimodal Large Language Models (MLLMs) for real-time online episodic memory question answering. While cloud offloading is common, it raises privacy and latency concerns for wearable assistants, hence we investigate implementation on the edge. We integrated streaming constraints into our question answering pipeline, which is structured into two asynchronous threads: a Descriptor Thread that continuously converts video into a lightweight textual memory, and a Question Answering (QA) Thread that reasons over the textual memory to answer queries. Experiments on the QAEgo4D-Closed benchmark analyze the performance of Multimodal Large Language Models (MLLMs) within strict resource boundaries, showing promising results also when compared to clound-based solutions. Specifically, an end-to-end configuration running on a consumer-grade 8GB GPU achieves 51.76% accuracy with a Time-To-First-Token (TTFT) of 0.41s. Scaling to a local enterprise-grade server yields 54.40% accuracy with a TTFT of 0.88s. In comparison, a cloud-based solution obtains an accuracy of 56.00%. These competitive results highlight the potential of edge-based solutions for privacy-preserving episodic memory retrieval.

[57] MammoWise: Multi-Model Local RAG Pipeline for Mammography Report Generation

Raiyan Jahangir,Nafiz Imtiaz Khan,Amritanand Sudheerkumar,Vladimir Filkov

Main category: cs.CV

TL;DR: 本文提出MammoWise,一个本地化、模块化的多模型流程,利用开源视觉语言模型(VLMs)实现乳腺X线摄影报告生成与多任务分类(如BI-RADS评估、乳腺密度分类),支持零样本/少样本提示、思维链及检索增强生成(RAG),并在VinDr-Mammo和DMID数据集上验证了其有效性与可扩展性。

Details Motivation: 解决当前乳腺筛查中高工作量、时效性强、文档繁重的问题,以及现有VLM方案依赖封闭云系统或紧耦合架构导致的隐私风险、不可复现性和适应性差等局限。 Method: 构建本地多模型流水线MammoWise,兼容任意Ollama托管VLM与乳腺数据集;支持零样本/少样本/思维链提示,集成向量数据库实现多模态RAG;采用QLoRA对MedGemma进行参数高效微调。 Result: 报告生成质量高(BERTScore、ROUGE-L指标优),且随少样本提示和RAG进一步提升;分类任务可行但受模型与数据集影响较大;QLoRA微调后MedGemma在BI-RADS(0.7545)、乳腺密度(0.8840)和钙化分类(0.9341)上取得较好准确率,同时保持报告质量。 Conclusion: MammoWise为本地部署VLM于乳腺影像报告任务提供了实用、可扩展、可复现的统一框架,兼顾隐私性、灵活性与临床实用性。 Abstract: Screening mammography is high volume, time sensitive, and documentation heavy. Radiologists must translate subtle visual findings into consistent BI-RADS assessments, breast density categories, and structured narrative reports. While recent Vision Language Models (VLMs) enable image-to-text reporting, many rely on closed cloud systems or tightly coupled architectures that limit privacy, reproducibility, and adaptability. We present MammoWise, a local multi-model pipeline that transforms open source VLMs into mammogram report generators and multi-task classifiers. MammoWise supports any Ollama-hosted VLM and mammography dataset, and enables zero-shot, few-shot, and Chain-of-Thought prompting, with optional multimodal Retrieval Augmented Generation (RAG) using a vector database for case-specific context. We evaluate MedGemma, LLaVA-Med, and Qwen2.5-VL on VinDr-Mammo and DMID datasets, assessing report quality (BERTScore, ROUGE-L), BI-RADS classification, breast density, and key findings. Report generation is consistently strong and improves with few-shot prompting and RAG. Classification is feasible but sensitive to model and dataset choice. Parameter-efficient fine-tuning (QLoRA) of MedGemma improves reliability, achieving BI-RADS accuracy of 0.7545, density accuracy of 0.8840, and calcification accuracy of 0.9341 while preserving report quality. MammoWise provides a practical and extensible framework for deploying local VLMs for mammography reporting within a unified and reproducible workflow.

[58] Beyond Dominant Patches: Spatial Credit Redistribution For Grounded Vision-Language Models

Niamul Hassan Samin,Md Arifur Rahman,Abdullah Ibne Hanif,Juena Ahmed Noshin,Md Ashikur Rahman

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的推理时方法Spatial Credit Redistribution (SCR),通过重分配高注意力视觉块的激活信用到其上下文,缓解视觉-语言模型中的空间信用坍塌问题,从而显著降低物体幻觉,且开销低、效果优。

Details Motivation: 视觉-语言模型(VLMs)常在图像中不存在物体时产生幻觉,作者将该问题归因于早期Transformer层中激活信用过度集中在稀疏视觉块上(即空间信用坍塌),导致上下文证据被抑制、模型更依赖语言先验。 Method: 提出Spatial Credit Redistribution (SCR):一种训练无关的推理时干预方法,利用低熵输入引导,将高注意力源块的隐藏状态激活信用重分布至其邻近上下文区域。 Result: 在POPE和CHAIR基准上,SCR使POPE-Adversarial幻觉率下降约4.7–6.0个百分点,CHAIR-s下降3.7–5.2个百分点(相对降幅42–51%),CHAIR-i下降2.7–4.4个百分点(相对降幅44–58%),同时CIDEr仅下降≤0.8个百分点;推理开销仅43–56 ms,显著低于OPERA、VCD和OVCD,并在幻觉率与CIDEr上Pareto优于三者;消融实验证明注意力引导的源块选择至关重要。 Conclusion: 空间信用坍塌是VLM幻觉的关键机制,SCR通过简单、高效、无需训练的推理干预有效缓解该问题,在保持生成质量的同时大幅降低幻觉,具备实际部署价值。 Abstract: Vision-language models (VLMs) frequently hallucinate objects absent from the input image. We trace this failure to spatial credit collapse: activation credit concentrating on sparse visual patches in early transformer layers, which suppresses contextual evidence and increases reliance on language priors. We introduce Spatial Credit Redistribution (SCR), a training-free inference-time intervention that redistributes hidden-state activation from high-attention source patches to their context, guided by low-entropy inputs. We evaluate six model families (Chameleon, LLaVA, and Qwen, including both Qwen-VL and Qwen2-VL) at scales of 7B, 13B, and 30B, on POPE and CHAIR benchmarks. SCR reduces hallucination by ~4.7-6.0 percentage points on POPE-Adversarial, cuts CHAIR-s by 3.7-5.2 percentage points (42-51 percent relative), and CHAIR-i by 2.7-4.4 percentage points (44-58 percent relative), and preserves CIDEr within 0.8 percentage points. Gains are largest for low-entropy inputs, consistent with the theoretical framework. SCR incurs only 43-56 ms overhead (small models: +43-46 ms; large models: +54-56 ms), roughly 3-6 times lower than OPERA and VCD and 1.3-1.7 times lower than OVCD (+72 ms), while Pareto-dominating all three on both hallucination rate and CIDEr, making it practical for real-time settings. A controlled ablation confirms that attention-guided source selection is essential: replacing it with uniform random selection reduces hallucination rate gains from ~4.7-6.0 percentage points to only ~2.6-3.4 percentage points, pointing to credit-collapse as the key driver.

[59] Pix2Key: Controllable Open-Vocabulary Retrieval with Semantic Decomposition and Self-Supervised Visual Dictionary Learning

Guoyizhe Wei,Yang Jiao,Nan Xi,Zhishen Huang,Jingjing Meng,Rama Chellappa,Yan Gao

Main category: cs.CV

TL;DR: Pix2Key是一种用于组合图像检索(CIR)的新方法,通过将查询和候选图像表示为开放词汇视觉词典,在统一嵌入空间中实现意图感知约束匹配与多样性感知重排序;其自监督预训练组件V-Dict-AE仅用图像提升词典表征,增强细粒度属性理解,无需CIR特定监督。

Details Motivation: 经典融合流水线依赖监督三元组且易丢失细粒度线索,而近期零样本方法常通过图像描述合并编辑文本,可能忽略隐式用户意图并返回重复结果。 Method: 提出Pix2Key框架,将查询与候选图像建模为开放词汇视觉词典,并引入自监督预训练模块V-Dict-AE,仅利用图像数据优化词典表征。 Result: 在DFMM-Compose基准上,Pix2Key将Recall@10提升最多3.2点;加入V-Dict-AE后进一步提升2.3点,同时提高意图一致性并保持高列表多样性。 Conclusion: Pix2Key通过视觉词典建模与自监督预训练,在无需CIR特定标注的前提下,显著提升了组合图像检索的准确性、意图一致性和结果多样性。 Abstract: Composed Image Retrieval (CIR) uses a reference image plus a natural-language edit to retrieve images that apply the requested change while preserving other relevant visual content. Classic fusion pipelines typically rely on supervised triplets and can lose fine-grained cues, while recent zero-shot approaches often caption the reference image and merge the caption with the edit, which may miss implicit user intent and return repetitive results. We present Pix2Key, which represents both queries and candidates as open-vocabulary visual dictionaries, enabling intent-aware constraint matching and diversity-aware reranking in a unified embedding space. A self-supervised pretraining component, V-Dict-AE, further improves the dictionary representation using only images, strengthening fine-grained attribute understanding without CIR-specific supervision. On the DFMM-Compose benchmark, Pix2Key improves Recall@10 up to 3.2 points, and adding V-Dict-AE yields an additional 2.3-point gain while improving intent consistency and maintaining high list diversity.

[60] DisQ-HNet: A Disentangled Quantized Half-UNet for Interpretable Multimodal Image Synthesis Applications to Tau-PET Synthesis from T1 and FLAIR MRI

Agamdeep S. Chopra,Caitlin Neher,Tianyi Ren,Juampablo E. Heras Rivera,Mehmet Kurt

Main category: cs.CV

TL;DR: 本文提出DisQ-HNet框架,通过T1和FLAIR MRI合成tau-PET图像,并利用部分信息分解(PID)量化各模态贡献,在保持重建质量的同时提升AD相关下游任务性能。

Details Motivation: tau-PET虽可标记阿尔茨海默病病理,但受限于成本与可及性,亟需基于MRI的替代方案。 Method: DisQ-HNet结合PID引导的向量量化编码器(分离冗余、独特与互补信息)与Half-UNet解码器(利用结构边缘线索驱动的伪跳跃连接,避免直接复用编码器特征)。 Result: 在多个基线模型(VAE、VQ-VAE、UNet)上,DisQ-HNet保持高重建保真度,并更优地保留疾病相关信号,显著提升Braak分期、tau定位与分类等下游任务性能;PID-Shapley分析实现模态特异性归因。 Conclusion: DisQ-HNet为无创、低成本tau成像提供了可行路径,其PID驱动的信息分解与Half-UNet架构设计为多模态医学图像合成与可解释性建模提供了新范式。 Abstract: Tau positron emission tomography (tau-PET) provides an in vivo marker of Alzheimer's disease pathology, but cost and limited availability motivate MRI-based alternatives. We introduce DisQ-HNet (DQH), a framework that synthesizes tau-PET from paired T1-weighted and FLAIR MRI while exposing how each modality contributes to the prediction. The method combines (i) a Partial Information Decomposition (PID)-guided, vector-quantized encoder that partitions latent information into redundant, unique, and complementary components, and (ii) a Half-UNet decoder that preserves anatomical detail using pseudo-skip connections conditioned on structural edge cues rather than direct encoder feature reuse. Across multiple baselines (VAE, VQ-VAE, and UNet), DisQ-HNet maintains reconstruction fidelity and better preserves disease-relevant signal for downstream AD tasks, including Braak staging, tau localization, and classification. PID-based Shapley analysis provides modality-specific attribution of synthesized uptake patterns.

[61] DrivePTS: A Progressive Learning Framework with Textual and Structural Enhancement for Driving Scene Generation

Zhechao Wang,Yiming Zeng,Lufan Ma,Zeqing Fu,Chen Bai,Ziyao Lin,Cheng Lu

Main category: cs.CV

TL;DR: 本文提出DrivePTS框架,通过渐进式学习、多视角分层文本描述和频率引导结构损失,解决了现有驾驶场景生成中条件依赖、语义缺失和结构模糊等问题,显著提升了生成场景的保真度、可控性与泛化能力。

Details Motivation: 现有基于扩散模型的驾驶场景合成方法存在几何条件间隐式依赖导致生成失败、文本描述过于简略导致背景建模弱、以及标准去噪损失忽略前景结构细节导致模糊等问题。 Method: 提出DrivePTS:1)采用渐进式学习策略并加入互信息约束以缓解几何条件间依赖;2)利用视觉语言模型生成覆盖六个语义维度的多视角分层文本描述;3)引入频率引导的结构损失以增强对高频结构细节的建模能力。 Result: 在多项指标上达到SOTA性能,尤其在罕见场景生成上显著优于先前方法,展现出更强的泛化能力和结构保真度。 Conclusion: DrivePTS有效克服了当前驾驶场景生成中的关键瓶颈,在语义丰富性、结构清晰度和条件可控性方面取得实质性突破,为自动驾驶系统鲁棒性验证提供了高质量合成数据支持。 Abstract: Synthesis of diverse driving scenes serves as a crucial data augmentation technique for validating the robustness and generalizability of autonomous driving systems. Current methods aggregate high-definition (HD) maps and 3D bounding boxes as geometric conditions in diffusion models for conditional scene generation. However, implicit inter-condition dependency causes generation failures when control conditions change independently. Additionally, these methods suffer from insufficient details in both semantic and structural aspects. Specifically, brief and view-invariant captions restrict semantic contexts, resulting in weak background modeling. Meanwhile, the standard denoising loss with uniform spatial weighting neglects foreground structural details, causing visual distortions and blurriness. To address these challenges, we propose DrivePTS, which incorporates three key innovations. Firstly, our framework adopts a progressive learning strategy to mitigate inter-dependency between geometric conditions, reinforced by an explicit mutual information constraint. Secondly, a Vision-Language Model is utilized to generate multi-view hierarchical descriptions across six semantic aspects, providing fine-grained textual guidance. Thirdly, a frequency-guided structure loss is introduced to strengthen the model's sensitivity to high-frequency elements, improving foreground structural fidelity. Extensive experiments demonstrate that our DrivePTS achieves state-of-the-art fidelity and controllability in generating diverse driving scenes. Notably, DrivePTS successfully generates rare scenes where prior methods fail, highlighting its strong generalization ability.

[62] SwiftNDC: Fast Neural Depth Correction for High-Fidelity 3D Reconstruction

Kang Han,Wei Xiang,Lu Yu,Mathew Wyatt,Gaowen Liu,Ramana Rao Kompella

Main category: cs.CV

TL;DR: 本文提出SwiftNDC框架,通过神经深度校正场生成跨视角一致的深度图,并结合鲁棒重投影误差过滤构建高质量稠密点云初始化,显著加速3D高斯泼溅(3DGS)的网格重建与新视角合成。

Details Motivation: 现有深度引导的3D重建方法存在尺度漂移、多视角不一致及需大量后处理等问题,亟需更高效、鲁棒的几何初始化方案。 Method: 提出神经深度校正场(Neural Depth Correction field)生成跨视角一致深度图;通过深度图反向投影+鲁棒重投影误差滤波构建稠密点云;以此作为3D高斯泼溅(3DGS)的几何初始化,用于网格重建和新视角合成。 Result: 在五个数据集(含两个网格重建、三个新视角合成)上验证:显著减少网格重建运行时间,提升新视角合成渲染质量。 Conclusion: 神经深度精修与鲁棒几何初始化相结合,可兼顾高保真度与高效率,为3D重建提供新范式。 Abstract: Depth-guided 3D reconstruction has gained popularity as a fast alternative to optimization-heavy approaches, yet existing methods still suffer from scale drift, multi-view inconsistencies, and the need for substantial refinement to achieve high-fidelity geometry. Here, we propose SwiftNDC, a fast and general framework built around a Neural Depth Correction field that produces cross-view consistent depth maps. From these refined depths, we generate a dense point cloud through back-projection and robust reprojection-error filtering, obtaining a clean and uniformly distributed geometric initialization for downstream reconstruction. This reliable dense geometry substantially accelerates 3D Gaussian Splatting (3DGS) for mesh reconstruction, enabling high-quality surfaces with significantly fewer optimization iterations. For novel-view synthesis, SwiftNDC can also improve 3DGS rendering quality, highlighting the benefits of strong geometric initialization. We conduct a comprehensive study across five datasets, including two for mesh reconstruction, as well as three for novel-view synthesis. SwiftNDC consistently reduces running time for accurate mesh reconstruction and boosts rendering fidelity for view synthesis, demonstrating the effectiveness of combining neural depth refinement with robust geometric initialization for high-fidelity and efficient 3D reconstruction.

[63] Quality-Aware Robust Multi-View Clustering for Heterogeneous Observation Noise

Peihan Wu,Guanjie Cheng,Yufei Tong,Meng Xi,Shuiguang Deng

Main category: cs.CV

TL;DR: 本文提出了一种质量感知的鲁棒多视图聚类框架QARMVC,通过信息瓶颈机制量化样本级噪声强度,并在特征级和融合级分别设计质量加权对比学习与高质量共识对齐策略,显著提升在异质噪声下的聚类性能。

Details Motivation: 现有去噪鲁棒方法多基于简单的二元假设(干净/完全污染),忽略了现实中普遍存在的异质观测噪声(污染强度连续变化)问题。 Method: 提出QARMVC框架:利用信息瓶颈进行视图重建,以重建差异量化细粒度污染强度并生成实例级质量分数;在特征级采用质量加权对比学习抑制噪声传播,在融合级通过质量加权聚合构建高质量全局共识,并用互信息最大化对齐与校正局部视图。 Result: 在五个基准数据集上的实验表明,QARMVC持续优于当前最优方法,尤其在异质噪声强度场景下表现突出。 Conclusion: QARMVC通过建模连续噪声强度并引入质量感知的分层学习机制,有效提升了多视图聚类在复杂真实噪声环境下的鲁棒性与准确性。 Abstract: Deep multi-view clustering has achieved remarkable progress but remains vulnerable to complex noise in real-world applications. Existing noisy robust methods predominantly rely on a simplified binary assumption, treating data as either perfectly clean or completely corrupted. This overlooks the prevalent existence of heterogeneous observation noise, where contamination intensity varies continuously across data. To bridge this gap, we propose a novel framework termed Quality-Aware Robust Multi-View Clustering (QARMVC). Specifically, QARMVC employs an information bottleneck mechanism to extract intrinsic semantics for view reconstruction. Leveraging the insight that noise disrupts semantic integrity and impedes reconstruction, we utilize the resulting reconstruction discrepancy to precisely quantify fine-grained contamination intensity and derive instance-level quality scores. These scores are integrated into a hierarchical learning strategy: at the feature level, a quality-weighted contrastive objective is designed to adaptively suppress the propagation of noise; at the fusion level, a high-quality global consensus is constructed via quality-weighted aggregation, which is subsequently utilized to align and rectify local views via mutual information maximization. Extensive experiments on five benchmark datasets demonstrate that QARMVC consistently outperforms state-of-the-art baselines, particularly in scenarios with heterogeneous noise intensities.

[64] Guidance Matters: Rethinking the Evaluation Pitfall for Text-to-Image Generation

Dian Xie,Shitong Shao,Lichen Bai,Zikai Zhou,Bojun Cheng,Shuo Yang,Jun Wu,Zeke Xie

Main category: cs.CV

TL;DR: 本文重新审视了扩散模型中的引导方法,揭示了当前评估范式中人类偏好模型对高引导尺度的偏差问题,并提出了引导感知评估框架(GA-Eval)与反事实验证方法TDG,指出多数新引导方法在公平评估下并未超越基础CFG。

Details Motivation: 现有扩散引导方法的评估存在严重偏差——人类偏好模型过度偏向高引导尺度,导致图像质量下降但指标虚高,亟需更公平可靠的评估方式。 Method: 提出引导感知评估框架(GA-Eval),通过引导尺度校准分离正交与平行于CFG的效果;设计反事实方法Transcendent Diffusion Guidance(TDG)暴露评估漏洞;系统评测8种主流引导方法在常规与GA-Eval框架下的表现。 Result: 实验表明,在常规评估中提升CFG尺度即可匹敌多数新方法;而在GA-Eval下,所有方法相较标准CFG均出现胜率显著下降,证明其实际增益有限。 Conclusion: 当前扩散引导方法的进步被有偏评估夸大;GA-Eval为该领域提供了更鲁棒的评估基准,呼吁社区重构评估范式与研究方向。 Abstract: Classifier-free guidance (CFG) has helped diffusion models achieve great conditional generation in various fields. Recently, more diffusion guidance methods have emerged with improved generation quality and human preference. However, can these emerging diffusion guidance methods really achieve solid and significant improvements? In this paper, we rethink recent progress on diffusion guidance. Our work mainly consists of four contributions. First, we reveal a critical evaluation pitfall that common human preference models exhibit a strong bias towards large guidance scales. Simply increasing the CFG scale can easily improve quantitative evaluation scores due to strong semantic alignment, even if image quality is severely damaged (e.g., oversaturation and artifacts). Second, we introduce a novel guidance-aware evaluation (GA-Eval) framework that employs effective guidance scale calibration to enable fair comparison between current guidance methods and CFG by identifying the effects orthogonal and parallel to CFG effects. Third, motivated by the evaluation pitfall, we design Transcendent Diffusion Guidance (TDG) method that can significantly improve human preference scores in the conventional evaluation framework but actually does not work in practice. Fourth, in extensive experiments, we empirically evaluate recent eight diffusion guidance methods within the conventional evaluation framework and the proposed GA-Eval framework. Notably, simply increasing the CFG scales can compete with most studied diffusion guidance methods, while all methods suffer severely from winning rate degradation over standard CFG. Our work would strongly motivate the community to rethink the evaluation paradigm and future directions of this field.

[65] GIFSplat: Generative Prior-Guided Iterative Feed-Forward 3D Gaussian Splatting from Sparse Views

Tianyu Chen,Wei Xiang,Kang Han,Yu Lu,Di Wu,Gaowen Liu,Ramana Rao Kompella

Main category: cs.CV

TL;DR: GIFSplat提出了一种纯前馈的迭代细化框架,用于从稀疏无位姿视图重建3D高斯点阵,通过残差更新和蒸馏扩散先验实现高效高质量重建,兼顾推理速度与泛化能力。

Details Motivation: 现有前馈式3D重建方法受限于单次预测范式,容量受限、缺乏推理时优化、难以融合生成式先验,尤其在域外数据和引入生成先验后性能与速度下降明显。 Method: 提出GIFSplat:1)基于少量前馈残差更新迭代优化3D高斯场景;2)将冻结扩散先验蒸馏为高斯级线索,通过增强新视角渲染注入先验,无需反向传播或扩展视图集。 Result: 在DL3DV、RealEstate10K和DTU上显著超越SOTA前馈方法,PSNR最高提升+2.1 dB,保持秒级推理速度,且无需相机位姿或测试时梯度优化。 Conclusion: GIFSplat证明了前馈框架可通过迭代细化与先验蒸馏兼顾效率与质量,突破传统单次预测瓶颈,为实时、鲁棒、泛化强的3D重建提供了新范式。 Abstract: Feed-forward 3D reconstruction offers substantial runtime advantages over per-scene optimization, which remains slow at inference and often fragile under sparse views. However, existing feed-forward methods still have potential for further performance gains, especially for out-of-domain data, and struggle to retain second-level inference time once a generative prior is introduced. These limitations stem from the one-shot prediction paradigm in existing feed-forward pipeline: models are strictly bounded by capacity, lack inference-time refinement, and are ill-suited for continuously injecting generative priors. We introduce GIFSplat, a purely feed-forward iterative refinement framework for 3D Gaussian Splatting from sparse unposed views. A small number of forward-only residual updates progressively refine current 3D scene using rendering evidence, achieve favorable balance between efficiency and quality. Furthermore, we distill a frozen diffusion prior into Gaussian-level cues from enhanced novel renderings without gradient backpropagation or ever-increasing view-set expansion, thereby enabling per-scene adaptation with generative prior while preserving feed-forward efficiency. Across DL3DV, RealEstate10K, and DTU, GIFSplat consistently outperforms state-of-the-art feed-forward baselines, improving PSNR by up to +2.1 dB, and it maintains second-scale inference time without requiring camera poses or any test-time gradient optimization.

[66] Causal Motion Diffusion Models for Autoregressive Motion Generation

Qing Yu,Akihisa Watanabe,Kent Fujiwara

Main category: cs.CV

TL;DR: 本文提出了一种因果运动扩散模型(CMDM),通过在语义对齐的潜在空间中使用因果扩散Transformer进行自回归运动生成,结合Motion-Language-Aligned Causal VAE(MAC-VAE)和帧级因果不确定性采样策略,实现了高质量、低延迟、支持流式与长时序的人体运动合成。

Details Motivation: 现有运动扩散模型存在双向生成破坏时间因果性、实时性差,或自回归模型不稳定、误差累积等问题,亟需兼顾因果性、稳定性与效率的新框架。 Method: 构建CMDM框架:1)设计MAC-VAE将运动序列编码为时间因果潜在表示;2)在其上训练基于因果扩散强制(causal diffusion forcing)的自回归扩散Transformer;3)引入帧级采样调度与因果不确定性机制以加速推理。 Result: 在HumanML3D和SnapMoGen数据集上,CMDM在语义保真度和时间平滑性上均优于现有扩散与自回归模型,同时显著降低推理延迟,支持交互速率下的文本驱动运动生成、流式合成与长时序生成。 Conclusion: CMDM成功统一了扩散建模的高质量生成能力与自回归建模的时间因果性优势,为实时、可控、长时程人体运动合成提供了新范式。 Abstract: Recent advances in motion diffusion models have substantially improved the realism of human motion synthesis. However, existing approaches either rely on full-sequence diffusion models with bidirectional generation, which limits temporal causality and real-time applicability, or autoregressive models that suffer from instability and cumulative errors. In this work, we present Causal Motion Diffusion Models (CMDM), a unified framework for autoregressive motion generation based on a causal diffusion transformer that operates in a semantically aligned latent space. CMDM builds upon a Motion-Language-Aligned Causal VAE (MAC-VAE), which encodes motion sequences into temporally causal latent representations. On top of this latent representation, an autoregressive diffusion transformer is trained using causal diffusion forcing to perform temporally ordered denoising across motion frames. To achieve fast inference, we introduce a frame-wise sampling schedule with causal uncertainty, where each subsequent frame is predicted from partially denoised previous frames. The resulting framework supports high-quality text-to-motion generation, streaming synthesis, and long-horizon motion generation at interactive rates. Experiments on HumanML3D and SnapMoGen demonstrate that CMDM outperforms existing diffusion and autoregressive models in both semantic fidelity and temporal smoothness, while substantially reducing inference latency.

[67] Don't let the information slip away

Taozhe Li

Main category: cs.CV

TL;DR: 本文提出Association DETR模型,通过利用背景上下文信息提升目标检测性能,在COCO val2017上达到SOTA。

Details Motivation: 现有主流目标检测模型(如YOLO系列和RT-DETR系列)过度关注前景物体特征,忽视背景提供的语义上下文信息,而背景信息对定位和识别物体具有重要辅助作用。 Method: 提出Association DETR模型,显式建模前景物体与背景之间的关联性,融合上下文信息以增强检测性能。 Result: 在COCO val2017数据集上取得SOTA结果(高于YOLOv12的55.2 mAP和RT-DETRv2的53.4 mAP)。 Conclusion: 引入背景上下文建模能有效提升目标检测精度,Association DETR验证了该思路的有效性和先进性。 Abstract: Real-time object detection has advanced rapidly in recent years. The YOLO series of detectors is among the most well-known CNN-based object detection models and cannot be overlooked. The latest version, YOLOv26, was recently released, while YOLOv12 achieved state-of-the-art (SOTA) performance with 55.2 mAP on the COCO val2017 dataset. Meanwhile, transformer-based object detection models, also known as DEtection TRansformer (DETR), have demonstrated impressive performance. RT-DETR is an outstanding model that outperformed the YOLO series in both speed and accuracy when it was released. Its successor, RT-DETRv2, achieved 53.4 mAP on the COCO val2017 dataset. However, despite their remarkable performance, all these models let information to slip away. They primarily focus on the features of foreground objects while neglecting the contextual information provided by the background. We believe that background information can significantly aid object detection tasks. For example, cars are more likely to appear on roads rather than in offices, while wild animals are more likely to be found in forests or remote areas rather than on busy streets. To address this gap, we propose an object detection model called Association DETR, which achieves state-of-the-art results compared to other object detection models on the COCO val2017 dataset.

[68] BetterScene: 3D Scene Synthesis with Representation-Aligned Generative Model

Yuci Han,Charles Toth,John E. Anderson,William J. Shuart,Alper Yilmaz

Main category: cs.CV

TL;DR: BetterScene 提出了一种基于 Stable Video Diffusion(SVD)的新型稀疏图像新视角合成方法,通过在VAE模块中引入时间等变正则化和视觉基础模型对齐表征,并结合3D高斯泼溅渲染特征,显著提升了真实场景下稀疏输入的新视角一致性与细节质量。

Details Motivation: 现有基于扩散模型的新视角合成方法受限于仅微调UNet、冻结其他模块,导致视图间细节不一致和伪影问题,尤其在极稀疏、无约束的真实照片输入下表现不佳。 Method: 在SVD预训练模型基础上,改进其VAE模块:一是引入时间等变正则化以增强多视角特征一致性;二是利用视觉基础模型对齐潜在表征;同时采用前馈式3D高斯泼溅(3DGS)生成几何感知特征,作为SVD增强器的输入。 Result: 在DL3DV-10K挑战性数据集上超越当前最优方法,生成连续、无伪影、视图一致的高质量新视角。 Conclusion: BetterScene证明了优化扩散模型全栈(尤其是VAE)并协同几何表征建模,比仅微调UNet更有效,为稀疏输入下的高质量NVF提供了新范式。 Abstract: We present BetterScene, an approach to enhance novel view synthesis (NVS) quality for diverse real-world scenes using extremely sparse, unconstrained photos. BetterScene leverages the production-ready Stable Video Diffusion (SVD) model pretrained on billions of frames as a strong backbone, aiming to mitigate artifacts and recover view-consistent details at inference time. Conventional methods have developed similar diffusion-based solutions to address these challenges of novel view synthesis. Despite significant improvements, these methods typically rely on off-the-shelf pretrained diffusion priors and fine-tune only the UNet module while keeping other components frozen, which still leads to inconsistent details and artifacts even when incorporating geometry-aware regularizations like depth or semantic conditions. To address this, we investigate the latent space of the diffusion model and introduce two components: (1) temporal equivariance regularization and (2) vision foundation model-aligned representation, both applied to the variational autoencoder (VAE) module within the SVD pipeline. BetterScene integrates a feed-forward 3D Gaussian Splatting (3DGS) model to render features as inputs for the SVD enhancer and generate continuous, artifact-free, consistent novel views. We evaluate on the challenging DL3DV-10K dataset and demonstrate superior performance compared to state-of-the-art methods.

[69] LoR-LUT: Learning Compact 3D Lookup Tables via Low-Rank Residuals

Ziqi Zhao,Abhijit Mishra,Shounak Roychowdhury

Main category: cs.CV

TL;DR: LoR-LUT提出一种统一的低秩3D查找表(LUT)生成方法,结合低秩残差校正与基LUT,在保持三线性插值复杂度的同时显著减少参数量,提升图像感知质量与模型可解释性,并配套交互式可视化工具LoR-LUT Viewer。

Details Motivation: 现有3D-LUT方法依赖稠密张量融合,参数多、可解释性差,难以兼顾紧凑性与高质量图像增强。 Method: 提出统一低秩框架LoR-LUT,联合使用低秩残差校正与基LUT;设计LoR-LUT Viewer交互式可视化工具,通过滑块调控参数实现直观图像调整。 Result: 在MIT-Adobe FiveK数据集上训练出亚兆级(sub-megabyte)模型,复现专家级调色效果,保持高感知保真度与原有插值复杂度。 Conclusion: LoR-LUT为LUT-based图像增强与风格迁移提供了紧凑、可解释且高效的新型范式。 Abstract: We present LoR-LUT, a unified low-rank formulation for compact and interpretable 3D lookup table (LUT) generation. Unlike conventional 3D-LUT-based techniques that rely on fusion of basis LUTs, which are usually dense tensors, our unified approach extends the current framework by jointly using residual corrections, which are in fact low-rank tensors, together with a set of basis LUTs. The approach described here improves the existing perceptual quality of an image, which is primarily due to the technique's novel use of residual corrections. At the same time, we achieve the same level of trilinear interpolation complexity, using a significantly smaller number of network, residual corrections, and LUT parameters. The experimental results obtained from LoR-LUT, which is trained on the MIT-Adobe FiveK dataset, reproduce expert-level retouching characteristics with high perceptual fidelity and a sub-megabyte model size. Furthermore, we introduce an interactive visualization tool, termed LoR-LUT Viewer, which transforms an input image into the LUT-adjusted output image, via a number of slidebars that control different parameters. The tool provides an effective way to enhance interpretability and user confidence in the visual results. Overall, our proposed formulation offers a compact, interpretable, and efficient direction for future LUT-based image enhancement and style transfer.

[70] Spectrally Distilled Representations Aligned with Instruction-Augmented LLMs for Satellite Imagery

Minh Kha Do,Wei Xiang,Kang Han,Di Wu,Khoa Phan,Yi-Ping Phoebe Chen,Gaowen Liu,Ramana Rao Kompella

Main category: cs.CV

TL;DR: SATtxt是一种仅需RGB输入的光谱感知视觉-语言基础模型,通过光谱表征蒸馏和光谱引导的指令增强对齐,在地球观测任务中显著提升零样本分类、检索和线性探测性能。

Details Motivation: 现有视觉-语言基础模型在卫星影像应用中受限于多光谱输入难以一致利用以及CLIP式文本编码器语义表达能力不足的问题;同时,实际卫星系统常缺乏完整多光谱覆盖,亟需仅依赖RGB输入的高效方案。 Method: 提出两阶段框架:1)光谱表征蒸馏——利用轻量投影器将冻结的多光谱教师模型的光谱先验知识迁移到RGB学生模型;2)光谱引导的指令增强对齐——结合大语言模型(LLM)的强语义表达能力,实现视觉与文本空间的细粒度对齐。 Result: 在EuroSAT、BigEarthNet和ForestNet数据集上,SATtxt平均提升零样本分类4.2%、图像-文本检索5.9%、线性探测2.7%,优于各类基线方法。 Conclusion: SATtxt为地球观测领域提供了兼顾光谱感知能力与RGB部署可行性的高效视觉-语言学习路径,推动了遥感影像零样本理解的发展。 Abstract: Vision-language foundation models (VLFMs) promise zero-shot and retrieval understanding for Earth observation. While operational satellite systems often lack full multi-spectral coverage, making RGB-only inference highly desirable for scalable deployment, the adoption of VLFMs for satellite imagery remains hindered by two factors: (1) multi-spectral inputs are informative but difficult to exploit consistently due to band redundancy and misalignment; and (2) CLIP-style text encoders limit semantic expressiveness and weaken fine-grained alignment. We present SATtxt, a spectrum-aware VLFM that operates with RGB inputs only at inference while retaining spectral cues learned during training. Our framework comprises two stages. First, Spectral Representation Distillation transfers spectral priors from a frozen multi-spectral teacher to an RGB student via a lightweight projector. Second, Spectrally Grounded Alignment with Instruction-Augmented LLMs bridges the distilled visual space and an expressive LLM embedding space. Across EuroSAT, BigEarthNet, and ForestNet, SATtxt improves zero-shot classification on average by 4.2%, retrieval by 5.9%, and linear probing by 2.7% over baselines, showing an efficient path toward spectrum-aware vision-language learning for Earth observation. Project page: https://ikhado.github.io/sattxt/

[71] Coded-E2LF: Coded Aperture Light Field Imaging from Events

Tomoya Tsuchida,Keita Takahashi,Chihiro Tsutake,Toshiaki Fujii,Hajime Nagahara

Main category: cs.CV

TL;DR: 本文提出了一种名为Coded-E2LF的纯事件驱动计算成像方法,利用编码孔径和仅事件相机实现4D光场重建,首次实现了仅靠事件数据达到像素级精度的光场重建。

Details Motivation: 现有方法需同时采集事件和强度图像,硬件限制多;本文旨在设计一种纯事件驱动的轻量、易实现的光场重建方案。 Method: 提出Coded-E2LF方法,采用编码孔径与静态事件相机结合,并明确黑色图案在编码模式中的关键作用,优化事件到光场的映射模型。 Result: 在真实硬件平台上成功重建出高精度4D光场,验证了仅用事件数据即可实现像素级准确的光场重建。 Conclusion: Coded-E2LF是首个仅依赖事件流实现像素级精度4D光场重建的方法,显著降低了硬件复杂度,推动了事件相机在光场成像中的实用化。 Abstract: We propose Coded-E2LF (coded event to light field), a computational imaging method for acquiring a 4-D light field using a coded aperture and a stationary event-only camera. In a previous work, an imaging system similar to ours was adopted, but both events and intensity images were captured and used for light field reconstruction. In contrast, our method is purely event-based, which relaxes restrictions for hardware implementation. We also introduce several advancements from the previous work that enable us to theoretically support and practically improve light field reconstruction from events alone. In particular, we clarify the key role of a black pattern in aperture coding patterns. We finally implemented our method on real imaging hardware to demonstrate its effectiveness in capturing real 3-D scenes. To the best of our knowledge, we are the first to demonstrate that a 4-D light field with pixel-level accuracy can be reconstructed from events alone. Our software and supplementary video are available from our project website.

[72] CGSA: Class-Guided Slot-Aware Adaptation for Source-Free Object Detection

Boyang Dai,Zeng Fan,Zihao Qi,Meng Lou,Yizhou Yu

Main category: cs.CV

TL;DR: 本文提出CGSA框架,首次将面向对象学习(OCL)引入无源域自适应目标检测(SF-DAOD),通过分层槽意识(HSA)与类别引导槽对比(CGSC)模块,在DETR架构中实现槽感知的域自适应,显著提升跨域检测性能。

Details Motivation: 现有SF-DAOD方法多关注伪标签阈值调优或师生框架改进,忽视了跨域数据中的对象级结构线索;同时,隐私敏感场景下需避免使用源域数据。 Method: 提出CGSA框架:在DETR检测器中嵌入分层槽意识(HSA)模块以解耦图像为槽表示,并设计类别引导槽对比(CGSC)模块将槽表示对齐至类别语义,实现无源、对象中心的域自适应。 Result: 在多个跨域数据集上显著优于现有SF-DAOD方法;理论推导与消融实验验证了HSA与CGSC模块的有效性。 Conclusion: 面向对象的设计范式(OCL)能有效提升无源域自适应目标检测的性能与泛化性,为隐私保护下的模型迁移提供新思路。 Abstract: Source-Free Domain Adaptive Object Detection (SF-DAOD) aims to adapt a detector trained on a labeled source domain to an unlabeled target domain without retaining any source data. Despite recent progress, most popular approaches focus on tuning pseudo-label thresholds or refining the teacher-student framework, while overlooking object-level structural cues within cross-domain data. In this work, we present CGSA, the first framework that brings Object-Centric Learning (OCL) into SF-DAOD by integrating slot-aware adaptation into the DETR-based detector. Specifically, our approach integrates a Hierarchical Slot Awareness (HSA) module into the detector to progressively disentangle images into slot representations that act as visual priors. These slots are then guided toward class semantics via a Class-Guided Slot Contrast (CGSC) module, maintaining semantic consistency and prompting domain-invariant adaptation. Extensive experiments on multiple cross-domain datasets demonstrate that our approach outperforms previous SF-DAOD methods, with theoretical derivations and experimental analysis further demonstrating the effectiveness of the proposed components and the framework, thereby indicating the promise of object-centric design in privacy-sensitive adaptation scenarios. Code is released at https://github.com/Michael-McQueen/CGSA.

[73] Instruction-based Image Editing with Planning, Reasoning, and Generation

Liya Ji,Chenyang Qi,Qifeng Chen

Main category: cs.CV

TL;DR: 本文提出了一种多模态链式思维(CoT)方法,通过将指令编辑任务分解为规划、区域推理和编辑三个阶段,提升基于指令的图像编辑效果。

Details Motivation: 现有方法依赖单模态理解模型,限制了复杂场景下的图像编辑质量,亟需融合理解与生成能力的多模态模型。 Method: 提出多模态链式思维提示框架,包括:1)大语言模型进行CoT规划以生成适配编辑网络的子提示;2)基于多模态大语言模型训练指令驱动的编辑区域生成网络;3)设计提示引导的编辑网络,结合大规模文生图扩散模型实现高质量编辑。 Result: 在复杂真实图像上展现出具有竞争力的编辑能力。 Conclusion: 多模态链式思维框架有效 bridged 理解与生成,显著提升了指令驱动图像编辑的性能与鲁棒性。 Abstract: Editing images via instruction provides a natural way to generate interactive content, but it is a big challenge due to the higher requirement of scene understanding and generation. Prior work utilizes a chain of large language models, object segmentation models, and editing models for this task. However, the understanding models provide only a single modality ability, restricting the editing quality. We aim to bridge understanding and generation via a new multi-modality model that provides the intelligent abilities to instruction-based image editing models for more complex cases. To achieve this goal, we individually separate the instruction editing task with the multi-modality chain of thought prompts, i.e., Chain-of-Thought (CoT) planning, editing region reasoning, and editing. For Chain-of-Thought planning, the large language model could reason the appropriate sub-prompts considering the instruction provided and the ability of the editing network. For editing region reasoning, we train an instruction-based editing region generation network with a multi-modal large language model. Finally, a hint-guided instruction-based editing network is proposed for editing image generations based on the sizeable text-to-image diffusion model to accept the hints for generation. Extensive experiments demonstrate that our method has competitive editing abilities on complex real-world images.

[74] CRAG: Can 3D Generative Models Help 3D Assembly?

Zeyu Jiang,Sihang Li,Siqi Tan,Chenyang Xu,Juexiao Zhang,Julia Galway-Witham,Xue Wang,Scott A. Williams,Radu Iovita,Chen Feng,Jing Zhang

Main category: cs.CV

TL;DR: 本文提出CRAG方法,将3D装配问题重新定义为装配与生成的联合任务,通过二者相互增强实现对缺失几何的合成与部件姿态预测,显著提升在复杂真实场景下的性能。

Details Motivation: 现有3D装配方法仅关注刚性姿态估计,忽视人类装配中结构推理与整体形状推断的协同;且无法处理缺失几何。 Method: 提出CRAG框架,联合优化装配(部件姿态预测)与生成(完整形状合成),利用装配提供结构先验、生成提供整体形状上下文以消解装配歧义。 Result: 在具有多样几何、不同部件数量及缺失部件的真实世界物体上达到SOTA性能。 Conclusion: 装配与生成联合建模更符合人类认知机制,能有效解决缺失几何下的鲁棒3D装配问题。 Abstract: Most existing 3D assembly methods treat the problem as pure pose estimation, rearranging observed parts via rigid transformations. In contrast, human assembly naturally couples structural reasoning with holistic shape inference. Inspired by this intuition, we reformulate 3D assembly as a joint problem of assembly and generation. We show that these two processes are mutually reinforcing: assembly provides part-level structural priors for generation, while generation injects holistic shape context that resolves ambiguities in assembly. Unlike prior methods that cannot synthesize missing geometry, we propose CRAG, which simultaneously generates plausible complete shapes and predicts poses for input parts. Extensive experiments demonstrate state-of-the-art performance across in-the-wild objects with diverse geometries, varying part counts, and missing pieces. Our code and models will be released.

[75] QuadSync: Quadrifocal Tensor Synchronization via Tucker Decomposition

Daniel Miao,Gilad Lerman,Joe Kileel

Main category: cs.CV

TL;DR: 本文提出了一种基于四焦点张量的多相机同步新框架,利用Tucker分解等方法实现了高效准确的相机位姿恢复,并验证了高阶信息在同步中的重要性。

Details Motivation: 四焦点张量虽比基础矩阵包含更多信息,但常被认为不实用;本文旨在挑战这一观点,探索其在结构光运动(SfM)中实际应用的可行性。 Method: 构建块四焦点张量,证明其具有固定多线性秩(4,4,4,4)的Tucker分解形式,进而设计结合Tucker分解、ADMM和加权最小二乘的同步算法,并提出联合同步四焦点、三焦点与双焦点张量的方法。 Result: 所提算法在现代数据集上展现出优异性能,显著提升了多相机同步精度与鲁棒性,验证了高阶张量信息的有效性。 Conclusion: 四焦点张量不仅具有理论价值,更具备实际应用潜力;所提出的同步框架为利用高阶几何约束进行大规模SfM提供了新思路。 Abstract: In structure from motion, quadrifocal tensors capture more information than their pairwise counterparts (essential matrices), yet they have often been thought of as impractical and only of theoretical interest. In this work, we challenge such beliefs by providing a new framework to recover $n$ cameras from the corresponding collection of quadrifocal tensors. We form the block quadrifocal tensor and show that it admits a Tucker decomposition whose factor matrices are the stacked camera matrices, and which thus has a multilinear rank of (4,~4,~4,~4) independent of $n$. We develop the first synchronization algorithm for quadrifocal tensors, using Tucker decomposition, alternating direction method of multipliers, and iteratively reweighted least squares. We further establish relationships between the block quadrifocal, trifocal, and bifocal tensors, and introduce an algorithm that jointly synchronizes these three entities. Numerical experiments demonstrate the effectiveness of our methods on modern datasets, indicating the potential and importance of using higher-order information in synchronization.

[76] Plug, Play, and Fortify: A Low-Cost Module for Robust Multimodal Image Understanding Models

Siqi Lu,Wanying Xu,Yongbin Zheng,Wenting Luan,Peng Sun,Jianhang Yao

Main category: cs.CV

TL;DR: 本文提出了一种基于频率域分析的多模态权重分配模块(MWAM),通过频率比指标(FRM)量化模态偏好,动态平衡各模态分支贡献,缓解缺失模态导致的性能崩溃问题。

Details Motivation: 缺失模态会导致多模态模型性能严重下降,其根源在于训练过程中模态间学习不平衡,模型隐式偏向某些模态而忽视其他模态。 Method: 提出频率比指标(FRM)在频域中量化模态偏好,并据此设计即插即用的多模态权重分配模块(MWAM),在训练中动态调整各模态分支权重以实现均衡学习。 Result: MWAM可无缝集成于CNN、ViT等多种骨干网络,在多种任务和模态组合下均取得一致性能提升,并能进一步增强现有最先进缺失模态处理方法的性能。 Conclusion: 频域视角有助于揭示和校正多模态学习中的模态偏差,MWAM提供了一种简单、通用且高效的解决方案,提升了模型对模态缺失的鲁棒性。 Abstract: Missing modalities present a fundamental challenge in multimodal models, often causing catastrophic performance degradation. Our observations suggest that this fragility stems from an imbalanced learning process, where the model develops an implicit preference for certain modalities, leading to the under-optimization of others. We propose a simple yet efficient method to address this challenge. The central insight of our work is that the dominance relationship between modalities can be effectively discerned and quantified in the frequency domain. To leverage this principle, we first introduce a Frequency Ratio Metric (FRM) to quantify modality preference by analyzing features in the frequency domain. Guided by FRM, we then propose a Multimodal Weight Allocation Module, a plug-and-play component that dynamically re-balances the contribution of each branch during training, promoting a more holistic learning paradigm. Extensive experiments demonstrate that MWAM can be seamlessly integrated into diverse architectural backbones, such as those based on CNNs and ViTs. Furthermore, MWAM delivers consistent performance gains across a wide range of tasks and modality combinations. This advancement extends beyond merely optimizing the performance of the base model; it also manifests as further performance improvements to state-of-the-art methods addressing the missing modality problem.

[77] Interactive Medical-SAM2 GUI: A Napari-based semi-automatic annotation tool for medical images

Woojae Hong,Jong Ha Hwang,Jiyong Chung,Joongyeon Choi,Hyunngun Kim,Yong Hwy Kim

Main category: cs.CV

TL;DR: Interactive Medical-SAM2 GUI 是一个基于 Napari 的开源桌面工具,支持对 2D/3D 医学图像进行半自动标注,通过融合 SAM2 风格传播与稀疏提示(框/点),提升 3D 体素级标注效率,并提供本地化、队列式、可量化的标注工作流。

Details Motivation: 手动标注 3D 医学图像耗时昂贵;现有工具多局限于单层切片交互,缺乏统一的、面向队列研究的本地化标注流程(含导航、传播、交互修正与定量导出)。 Method: 基于 Napari 构建本地 GUI,将 3D 体数据视为切片序列,集成 box/point 提示与 Medical-SAM2 的跨切片掩码传播;支持首尾切片初始化、逐对象提示修正、N4 偏置场校正及 SimpleITK 保持几何信息的导出。 Result: 实现了高效、统一、本地化的 3D 医学图像半自动标注工作流,支持批量 DICOM/NIfTI 数据处理、逐例标注控制、体积分割量化、3D 渲染与几何保真导出。 Conclusion: 该工具填补了面向临床队列研究的轻量级、可复现、全流程本地标注工具空白,专为科研标注场景设计,代码已开源。 Abstract: Interactive Medical-SAM2 GUI is an open-source desktop application for semi-automatic annotation of 2D and 3D medical images. Built on the Napari multi-dimensional viewer, box/point prompting is integrated with SAM2-style propagation by treating a 3D volume as a slice sequence, enabling mask propagation from sparse prompts using Medical-SAM2 on top of SAM2. Voxel-level annotation remains essential for developing and validating medical imaging algorithms, yet manual labeling is slow and expensive for 3D scans, and existing integrations frequently emphasize per-slice interaction without providing a unified, cohort-oriented workflow for navigation, propagation, interactive correction, and quantitative export in a single local pipeline. To address this practical limitation, a local-first Napari workflow is provided for efficient 3D annotation across multiple studies using standard DICOM series and/or NIfTI volumes. Users can annotate cases sequentially under a single root folder with explicit proceed/skip actions, initialize objects via box-first prompting (including first/last-slice initialization for single-object propagation), refine predictions with point prompts, and finalize labels through prompt-first correction prior to saving. During export, per-object volumetry and 3D volume rendering are supported, and image geometry is preserved via SimpleITK. The GUI is implemented in Python using Napari and PyTorch, with optional N4 bias-field correction, and is intended exclusively for research annotation workflows. The code is released on the project page: https://github.com/SKKU-IBE/Medical-SAM2GUI/.

[78] Denoising as Path Planning: Training-Free Acceleration of Diffusion Models with DPCache

Bowen Cui,Yuanbin Wang,Huajiang Xu,Biaolong Chen,Aixi Zhang,Hao Jiang,Zhengzheng Jin,Xu Liu,Pipei Huang

Main category: cs.CV

TL;DR: 本文提出DPCache,一种无需训练的扩散模型加速框架,将采样加速建模为全局路径规划问题,通过构建路径感知代价张量并利用动态规划选择最优关键时间步,显著提升推理速度且保持甚至提升生成质量。

Details Motivation: 现有基于缓存的扩散模型加速方法依赖固定或局部自适应调度,未考虑去噪轨迹的全局结构,易导致误差累积和视觉伪影。 Method: DPCache构建路径感知代价张量(基于小规模校准集),量化跳过时间步的路径依赖误差;再用动态规划选取使总路径代价最小的关键时间步序列;推理时仅在关键步执行完整计算,其余步用缓存特征预测。 Result: 在DiT、FLUX和HunyuanVideo上验证:DPCache在4.87×加速下ImageReward比先前方法高+0.031;在3.54×加速下ImageReward反超全步长基线+0.028。 Conclusion: 路径感知的全局调度策略可有效缓解误差累积,DPCache作为一种训练-free方法,在加速与保质间实现了更优平衡。 Abstract: Diffusion models have demonstrated remarkable success in image and video generation, yet their practical deployment remains hindered by the substantial computational overhead of multi-step iterative sampling. Among acceleration strategies, caching-based methods offer a training-free and effective solution by reusing or predicting features across timesteps. However, existing approaches rely on fixed or locally adaptive schedules without considering the global structure of the denoising trajectory, often leading to error accumulation and visual artifacts. To overcome this limitation, we propose DPCache, a novel training-free acceleration framework that formulates diffusion sampling acceleration as a global path planning problem. DPCache constructs a Path-Aware Cost Tensor from a small calibration set to quantify the path-dependent error of skipping timesteps conditioned on the preceding key timestep. Leveraging this tensor, DPCache employs dynamic programming to select an optimal sequence of key timesteps that minimizes the total path cost while preserving trajectory fidelity. During inference, the model performs full computations only at these key timesteps, while intermediate outputs are efficiently predicted using cached features. Extensive experiments on DiT, FLUX, and HunyuanVideo demonstrate that DPCache achieves strong acceleration with minimal quality loss, outperforming prior acceleration methods by $+$0.031 ImageReward at 4.87$\times$ speedup and even surpassing the full-step baseline by $+$0.028 ImageReward at 3.54$\times$ speedup on FLUX, validating the effectiveness of our path-aware global scheduling framework. Code will be released at https://github.com/argsss/DPCache.

[79] Scaling Audio-Visual Quality Assessment Dataset via Crowdsourcing

Renyu Yang,Jian Jin,Lili Meng,Meiqin Liu,Yilin Wang,Balu Adsumilli,Weisi Lin

Main category: cs.CV

TL;DR: 本文提出了一种实用的音视频质量评估(AVQA)数据集构建方法,通过众包主观实验框架、系统化数据准备策略及多维标注扩展,构建了目前最大最多样化的AVQA数据集YT-NTU-AVQ(含1620条用户生成音视频序列),并开源数据与平台代码。

Details Motivation: 现有AVQA数据集规模小、内容与质量多样性不足、仅提供整体评分,难以支撑模型开发和多模态感知研究。 Method: 设计众包主观实验框架以突破实验室环境限制;采用系统化数据准备策略确保质量等级与语义场景全覆盖;扩展多维标注以支持多模态感知机制研究。 Result: 构建了目前最大最多样化的AVQA数据集YT-NTU-AVQ,包含1620条用户生成音视频序列,并开源数据与平台代码。 Conclusion: 所提方法有效克服了现有AVQA数据集的局限性,为模型训练与多模态感知研究提供了高质量、高多样性、多维度标注的数据基础。 Abstract: Audio-visual quality assessment (AVQA) research has been stalled by limitations of existing datasets: they are typically small in scale, with insufficient diversity in content and quality, and annotated only with overall scores. These shortcomings provide limited support for model development and multimodal perception research. We propose a practical approach for AVQA dataset construction. First, we design a crowdsourced subjective experiment framework for AVQA, breaks the constraints of in-lab settings and achieves reliable annotation across varied environments. Second, a systematic data preparation strategy is further employed to ensure broad coverage of both quality levels and semantic scenarios. Third, we extend the dataset with additional annotations, enabling research on multimodal perception mechanisms and their relation to content. Finally, we validate this approach through YT-NTU-AVQ, the largest and most diverse AVQA dataset to date, consisting of 1,620 user-generated audio and video (A/V) sequences. The dataset and platform code are available at https://github.com/renyu12/YT-NTU-AVQ

[80] ArtPro: Self-Supervised Articulated Object Reconstruction with Adaptive Integration of Mobility Proposals

Xuelu Li,Zhaonan Wang,Xiaogang Wang,Lei Wu,Manyi Li,Changhe Tu

Main category: cs.CV

TL;DR: 本文提出ArtPro,一种新的自监督框架,通过自适应整合运动提案来重建高保真度的关节物体数字孪生体,解决了现有方法对初始部件分割敏感、易陷入局部最优的问题。

Details Motivation: 现有基于可微渲染(如3D高斯泼溅)的自监督方法对初始部件分割高度敏感,依赖启发式聚类或预训练模型,导致复杂多部件物体优化易陷入局部极小。 Method: ArtPro采用几何特征与运动先验引导的过分割初始化生成运动假设部件提案;优化过程中动态融合空间邻域内运动一致的提案,并引入防碰撞的运动剪枝机制以避免错误运动学估计。 Result: 在合成与真实世界物体上的大量实验表明,ArtPro在精度和稳定性上显著优于现有方法,能鲁棒地重建复杂多部件物体。 Conclusion: ArtPro通过运动一致性驱动的动态部件融合与碰撞感知运动剪枝,有效提升了关节物体自监督重建的鲁棒性与准确性,为数字孪生构建提供了新范式。 Abstract: Reconstructing articulated objects into high-fidelity digital twins is crucial for applications such as robotic manipulation and interactive simulation. Recent self-supervised methods using differentiable rendering frameworks like 3D Gaussian Splatting remain highly sensitive to the initial part segmentation. Their reliance on heuristic clustering or pre-trained models often causes optimization to converge to local minima, especially for complex multi-part objects. To address these limitations, we propose ArtPro, a novel self-supervised framework that introduces adaptive integration of mobility proposals. Our approach begins with an over-segmentation initialization guided by geometry features and motion priors, generating part proposals with plausible motion hypotheses. During optimization, we dynamically merge these proposals by analyzing motion consistency among spatial neighbors, while a collision-aware motion pruning mechanism prevents erroneous kinematic estimation. Extensive experiments on both synthetic and real-world objects demonstrate that ArtPro achieves robust reconstruction of complex multi-part objects, significantly outperforming existing methods in accuracy and stability.

[81] Monocular Open Vocabulary Occupancy Prediction for Indoor Scenes

Changqing Zhou,Yueru Luo,Han Zhang,Zeyu Jiang,Changhao Chen

Main category: cs.CV

TL;DR: 本文提出了一种面向室内场景的开词汇3D占据预测框架,采用纯几何监督(仅二值占据标签),结合语言嵌入高斯表示与新设计的不透明度感知泊松聚合及渐进温度衰减策略,在Occ-ScanNet上显著提升IoU和mIoU。

Details Motivation: 现有开词汇3D占据方法在室外驾驶场景有效,但难以迁移到几何更密集、布局更复杂、语义更细粒度的室内环境;需解决弱监督下几何建模不稳定与语义对齐易混杂的问题。 Method: 基于3D语言嵌入高斯(Language-Embedded Gaussians)构建统一几何-语义中间表示;提出不透明度感知的泊松式高斯到占据转换算子以稳定体素聚合;设计渐进温度衰减策略,在渲染过程中逐步锐化高斯不透明度以增强高斯与语言特征对齐。 Result: 在Occ-ScanNet数据集上,开词汇设定下达到59.50 IoU和21.05 mIoU,IoU超越所有现有占据方法,mIoU大幅优于先前开词汇方法。 Conclusion: 纯几何监督结合语言对齐高斯表示是实现高性能室内开词汇3D占据的有效范式,所提几何聚合与语义对齐机制显著提升了模型鲁棒性与泛化能力。 Abstract: Open-vocabulary 3D occupancy is vital for embodied agents, which need to understand complex indoor environments where semantic categories are abundant and evolve beyond fixed taxonomies. While recent work has explored open-vocabulary occupancy in outdoor driving scenarios, such methods transfer poorly indoors, where geometry is denser, layouts are more intricate, and semantics are far more fine-grained. To address these challenges, we adopt a geometry-only supervision paradigm that uses only binary occupancy labels (occupied vs free). Our framework builds upon 3D Language-Embedded Gaussians, which serve as a unified intermediate representation coupling fine-grained 3D geometry with a language-aligned semantic embedding. On the geometry side, we find that existing Gaussian-to-Occupancy operators fail to converge under such weak supervision, and we introduce an opacity-aware, Poisson-based approach that stabilizes volumetric aggregation. On the semantic side, direct alignment between rendered features and open-vocabulary segmentation features suffers from feature mixing; we therefore propose a Progressive Temperature Decay schedule that gradually sharpens opacities during splatting, strengthening Gaussian-language alignment. On Occ-ScanNet, our framework achieves 59.50 IoU and 21.05 mIoU in the open-vocabulary setting, surpassing all existing occupancy methods in IoU and outperforming prior open-vocabulary approaches by a large margin in mIoU. Code will be released at https://github.com/JuIvyy/LegoOcc.

[82] SPMamba-YOLO: An Underwater Object Detection Network Based on Multi-Scale Feature Enhancement and Global Context Modeling

Guanghao Liao,Zhen Liu,Liyuan Cao,Yonghui Yang,Qi Li

Main category: cs.CV

TL;DR: 本文提出SPMamba-YOLO,一种结合多尺度特征增强与全局上下文建模的水下目标检测网络,通过SPPELAN模块、PSA机制和Mamba状态空间建模提升小目标与密集目标检测性能。

Details Motivation: 水下目标检测面临严重光衰减、色彩失真、背景杂乱及目标尺寸小等挑战。 Method: 提出SPMamba-YOLO网络,包含SPPELAN模块(强化多尺度特征聚合与感受野)、PSA机制(增强特征判别力)和Mamba状态空间建模模块(捕获长程依赖与全局上下文)。 Result: 在URPC2022数据集上,相比YOLOv8n基线,mAP@0.5提升超4.9%,尤其对小目标与密集分布目标效果显著,兼顾精度与计算效率。 Conclusion: SPMamba-YOLO有效提升了复杂水下环境中的目标检测鲁棒性与准确性,为水下视觉任务提供了新思路。 Abstract: Underwater object detection is a critical yet challenging research problem owing to severe light attenuation, color distortion, background clutter, and the small scale of underwater targets. To address these challenges, we propose SPMamba-YOLO, a novel underwater object detection network that integrates multi-scale feature enhancement with global context modeling. Specifically, a Spatial Pyramid Pooling Enhanced Layer Aggregation Network (SPPELAN) module is introduced to strengthen multi-scale feature aggregation and expand the receptive field, while a Pyramid Split Attention (PSA) mechanism enhances feature discrimination by emphasizing informative regions and suppressing background interference. In addition, a Mamba-based state space modeling module is incorporated to efficiently capture long-range dependencies and global contextual information, thereby improving detection robustness in complex underwater environments. Extensive experiments on the URPC2022 dataset demonstrate that SPMamba-YOLO outperforms the YOLOv8n baseline by more than 4.9\% in mAP@0.5, particularly for small and densely distributed underwater objects, while maintaining a favorable balance between detection accuracy and computational cost.

[83] ViCLIP-OT: The First Foundation Vision-Language Model for Vietnamese Image-Text Retrieval with Optimal Transport

Quoc-Khang Tran,Minh-Thien Nguyen,Nguyen-Khang Pham

Main category: cs.CV

TL;DR: 本文提出了ViCLIP-OT,一种专为越南语图像-文本检索设计的视觉语言基础模型,通过结合CLIP式对比学习与相似性图正则化最优传输(SIGROT)损失,显著提升了低资源语言下的跨模态检索性能。

Details Motivation: 现有视觉语言模型多针对高资源语言优化,在越南语等低资源语言场景下表现欠佳,亟需专门适配的模型。 Method: 提出ViCLIP-OT框架,融合CLIP风格对比学习与新提出的相似性图正则化最优传输(SIGROT)损失,以增强全局跨模态一致性并缓解模态鸿沟。 Result: 在三个越南语基准(UITOpenViIC、KTVIC、Crossmodal-3600)上,ViCLIP-OT在域内和零样本设置下均显著优于CLIP和SigLIP;在UIT-OpenViIC上平均Recall@K达67.34%,较CLIP提升5.75个百分点;在Crossmodal-3600零样本任务中提升11.72个百分点;嵌入空间分析证实模态对齐更优、模态差距减小。 Conclusion: SIGROT的引入为低资源语言跨模态检索提供了有效且可扩展的解决方案,对越南语及其他弱势语言的智能多媒体检索系统具有实际应用价值。 Abstract: Image-text retrieval has become a fundamental component in intelligent multimedia systems; however, most existing vision-language models are optimized for highresource languages and remain suboptimal for low-resource settings such as Vietnamese. This work introduces ViCLIP-OT, a foundation vision-language model specifically designed for Vietnamese image-text retrieval. The proposed framework integrates CLIP-style contrastive learning with a Similarity-Graph Regularized Optimal Transport (SIGROT) loss to enhance global cross-modal consistency and mitigate modality gap issues. Extensive experiments on three Vietnamese benchmarks (UITOpenViIC, KTVIC, and Crossmodal-3600) demonstrate that ViCLIP-OT consistently outperforms CLIP and SigLIP baselines in both in-domain and zero-shot settings. On UIT-OpenViIC, the model achieves an average Recall@K of 67.34%, improving upon CLIP by 5.75 percentage points. In zero-shot evaluation on Crossmodal-3600, ViCLIPOT surpasses CLIP by 11.72 percentage points. Embedding-space analysis further confirms improved alignment and reduced modality gap. The results indicate that integrating SIGROT provides an effective and scalable strategy for cross-modal retrieval in low-resource languages, offering practical implications for intelligent multimedia retrieval systems in Vietnamese and other underrepresented linguistic contexts.

[84] SUPERGLASSES: Benchmarking Vision Language Models as Intelligent Agents for AI Smart Glasses

Zhuohang Jiang,Xu Yuan,Haohao Qu,Shanru Lin,Kanglong Liu,Wenqi Fan,Qing Li

Main category: cs.CV

TL;DR: 本文提出了首个基于真实智能眼镜数据的VQA基准SUPERGLASSES,并设计了专用多模态代理SUPERLENS,显著提升了智能眼镜场景下的视觉问答性能。

Details Motivation: 现有视觉语言模型在智能眼镜场景中面临数据不匹配、缺乏真实感和未解决‘先精准识别目标再检索知识’这一核心挑战的问题。 Method: 构建了首个真实采集的智能眼镜VQA基准SUPERGLASSES(含2422组图像-问题对、搜索路径与推理标注),并提出SUPERLENS代理,融合自动目标检测、查询解耦与多模态网络搜索实现检索增强生成。 Result: 在SUPERGLASSES上评估26个主流VLM,发现显著性能差距;SUPERLENS超越GPT-4o 2.19%,达到SOTA。 Conclusion: 智能眼镜VQA需任务特化建模,真实数据驱动的基准与专用代理设计是关键方向。 Abstract: The rapid advancement of AI-powered smart glasses, one of the hottest wearable devices, has unlocked new frontiers for multimodal interaction, with Visual Question Answering (VQA) over external knowledge sources emerging as a core application. Existing Vision Language Models (VLMs) adapted to smart glasses are typically trained and evaluated on traditional multimodal datasets; however, these datasets lack the variety and realism needed to reflect smart glasses usage scenarios and diverge from their specific challenges, where accurately identifying the object of interest must precede any external knowledge retrieval. To bridge this gap, we introduce SUPERGLASSES, the first comprehensive VQA benchmark built on real-world data entirely collected by smart glasses devices. SUPERGLASSES comprises 2,422 egocentric image-question pairs spanning 14 image domains and 8 query categories, enriched with full search trajectories and reasoning annotations. We evaluate 26 representative VLMs on this benchmark, revealing significant performance gaps. To address the limitations of existing models, we further propose SUPERLENS, a multimodal smart glasses agent that enables retrieval-augmented answer generation by integrating automatic object detection, query decoupling, and multimodal web search. Our agent achieves state-of-the-art performance, surpassing GPT-4o by 2.19 percent, and highlights the need for task-specific solutions in smart glasses VQA scenarios.

[85] No Caption, No Problem: Caption-Free Membership Inference via Model-Fitted Embeddings

Joonsung Jeon,Woo Jae Kim,Suhyeon Ha,Sooel Son,Sung-Eui Yoon

Main category: cs.CV

TL;DR: 本文提出MoFit,一种无需真实文本标注的成员推断攻击(MIA)框架,通过构建模型拟合的合成条件输入来检测扩散模型是否在训练中记忆了某张图像。

Details Motivation: 现有基于扩散模型的成员推断攻击依赖真实文本标注,但在实际场景中这些标注往往不可得;使用视觉语言模型(VLM)生成的伪标注效果差,导致攻击失效。 Method: MoFit采用两阶段策略:(i)模型拟合代理优化——对查询图像施加可学习扰动,使其在模型无条件先验分布(由成员样本学习得到)的高密度区域生成代理图像;(ii)代理驱动嵌入提取——从该代理图像中提取模型拟合嵌入,并将其作为不匹配条件输入,放大成员样本的条件损失响应以增强判别性。 Result: MoFit在多个数据集和扩散模型上显著优于基于VLM生成标注的基线方法,性能接近依赖真实标注的方法。 Conclusion: MoFit证明了无需真实文本标注也能高效实施针对文本到图像扩散模型的成员推断攻击,为模型隐私审计提供了更实用、鲁棒的新范式。 Abstract: Latent diffusion models have achieved remarkable success in high-fidelity text-to-image generation, but their tendency to memorize training data raises critical privacy and intellectual property concerns. Membership inference attacks (MIAs) provide a principled way to audit such memorization by determining whether a given sample was included in training. However, existing approaches assume access to ground-truth captions. This assumption fails in realistic scenarios where only images are available and their textual annotations remain undisclosed, rendering prior methods ineffective when substituted with vision-language model (VLM) captions. In this work, we propose MoFit, a caption-free MIA framework that constructs synthetic conditioning inputs that are explicitly overfitted to the target model's generative manifold. Given a query image, MoFit proceeds in two stages: (i) model-fitted surrogate optimization, where a perturbation applied to the image is optimized to construct a surrogate in regions of the model's unconditional prior learned from member samples, and (ii) surrogate-driven embedding extraction, where a model-fitted embedding is derived from the surrogate and then used as a mismatched condition for the query image. This embedding amplifies conditional loss responses for member samples while leaving hold-outs relatively less affected, thereby enhancing separability in the absence of ground-truth captions. Our comprehensive experiments across multiple datasets and diffusion models demonstrate that MoFit consistently outperforms prior VLM-conditioned baselines and achieves performance competitive with caption-dependent methods.

[86] GFRRN: Explore the Gaps in Single Image Reflection Removal

Yu Chen,Zewei He,Xingyu Liu,Zixuan Chen,Zheming Lu

Main category: cs.CV

TL;DR: 本文提出了一种无间隙反射去除网络(GFRRN),通过参数高效微调(PEFT)、统一反射标签生成器、高斯自适应频率学习块(G-AFLB)和动态代理注意力(DAA)解决语义理解差距与反射标签不一致问题,在单图像反射去除任务中取得SOTA性能。

Details Motivation: 现有双流方法在单图像反射去除中存在两大问题:预训练模型与反射去除模型之间的语义理解差距,以及合成数据与真实数据间反射标签的不一致性。 Method: 采用参数高效微调(PEFT)策略插入可学习Mona层以对齐训练方向;设计标签生成器统一合成与真实数据的反射标签;提出高斯自适应频率学习块(G-AFLB)自适应学习融合频率先验;引入动态代理注意力(DAA)替代窗口注意力,建模窗口间与窗口内重要性。 Result: 所提GFRRN在多个基准上显著优于当前SOTA方法,验证了各模块的有效性及整体架构的优越性。 Conclusion: GFRRN通过协同解决语义对齐与标签一致性问题,并结合新型频率建模与注意力机制,实现了更鲁棒、更通用的单图像反射去除性能。 Abstract: Prior dual-stream methods with the feature interaction mechanism have achieved remarkable performance in single image reflection removal (SIRR). However, they often struggle with (1) semantic understanding gap between the features of pre-trained models and those of reflection removal models, and (2) reflection label inconsistencies between synthetic and real-world training data. In this work, we first adopt the parameter efficient fine-tuning (PEFT) strategy by integrating several learnable Mona layers into the pre-trained model to align the training directions. Then, a label generator is designed to unify the reflection labels for both synthetic and real-world data. In addition, a Gaussian-based Adaptive Frequency Learning Block (G-AFLB) is proposed to adaptively learn and fuse the frequency priors, and a Dynamic Agent Attention (DAA) is employed as an alternative to window-based attention by dynamically modeling the significance levels across windows (inter-) and within an individual window (intra-). These components constitute our proposed Gap-Free Reflection Removal Network (GFRRN). Extensive experiments demonstrate the effectiveness of our GFRRN, achieving superior performance against state-of-the-art SIRR methods.

[87] UFO-DETR: Frequency-Guided End-to-End Detector for UAV Tiny Objects

Yuankai Chen,Kai Lin,Qihong Wu,Xinxuan Yang,Jiashuo Lai,Ruoen Chen,Haonan Shi,Minfan He,Meihua Wang

Main category: cs.CV

TL;DR: 本文提出UFO-DETR,一种专为无人机图像小目标检测设计的端到端检测框架,结合LSKNet骨干网、DAttention与AIFI模块及新提出的DynFreq-C3模块,显著提升多尺度与小目标检测性能,同时兼顾计算效率。

Details Motivation: 无人机图像中小目标检测面临尺度变化大、目标密集、小目标占比高、现有方法依赖人工设计且通用检测器未针对无人机图像优化等问题,难以兼顾精度与复杂度。 Method: 提出端到端检测框架UFO-DETR:采用LSKNet作为骨干网络以优化感受野并减少参数;融合DAttention和AIFI模块建模多尺度空间关系;引入新模块DynFreq-C3,通过跨空间频率特征增强提升小目标检测能力。 Result: 在实验中,UFO-DETR相比RT-DETR-L,在检测性能和计算效率上均取得显著优势,适用于无人机边缘计算场景。 Conclusion: UFO-DETR有效解决了无人机图像中小目标检测的关键挑战,实现了高精度与高效率的统一,为实际部署提供了可行方案。 Abstract: Small target detection in UAV imagery faces significant challenges such as scale variations, dense distribution, and the dominance of small targets. Existing algorithms rely on manually designed components, and general-purpose detectors are not optimized for UAV images, making it difficult to balance accuracy and complexity. To address these challenges, this paper proposes an end-to-end object detection framework, UFO-DETR, which integrates an LSKNet-based backbone network to optimize the receptive field and reduce the number of parameters. By combining the DAttention and AIFI modules, the model flexibly models multi-scale spatial relationships, improving multi-scale target detection performance. Additionally, the DynFreq-C3 module is proposed to enhance small target detection capability through cross-space frequency feature enhancement. Experimental results show that, compared to RT-DETR-L, the proposed method offers significant advantages in both detection performance and computational efficiency, providing an efficient solution for UAV edge computing.

[88] SoPE: Spherical Coordinate-Based Positional Embedding for Enhancing Spatial Perception of 3D LVLMs

Guanting Ye,Qiyan Zhao,Wenhao Yu,Liangyu Yuan,Mingkai Li,Xiaofeng Zhang,Jianmin Ji,Yanyong Zhang,Qing Jiang,Ka-Veng Yuen

Main category: cs.CV

TL;DR: 本文提出了一种基于球坐标的位置编码方法SoPE,以改进3D大视觉语言模型(3D LVLMs)中传统RoPE在建模三维空间结构和角度依赖性方面的不足。SoPE将点云token映射到球坐标空间,统一建模空间位置与方向角,并结合多尺度频率混合策略提升几何表征能力,在多个3D场景基准上验证了其有效性与泛化性。

Details Motivation: 传统RoPE在3D多模态理解中无法有效保持三维空间结构,且相对距离计算忽略角度依赖,限制了方向变化建模能力。 Method: 提出球坐标位置编码(SoPE),将点云token索引映射至三维球坐标空间,统一建模空间位置与方向角;并引入多尺度频率混合策略融合不同频域特征。 Result: 在多个3D场景基准上显著提升性能,真实部署实验表明具有强泛化能力。 Conclusion: SoPE有效解决了RoPE在3D理解中的几何建模缺陷,提升了3D LVLMs的空间感知与方向表达能力,为三维多模态学习提供了更优的位置编码范式。 Abstract: 3D Large Vision-Language Models (3D LVLMs) built upon Large Language Models (LLMs) have achieved remarkable progress across various multimodal tasks. However, their inherited position-dependent modeling mechanism, Rotary Position Embedding (RoPE), remains suboptimal for 3D multimodal understanding. The vanilla RoPE formulation fails to preserve essential three-dimensional spatial structures when encoding 3D tokens, and its relative distance computation overlooks angular dependencies, hindering the model's ability to capture directional variations in visual representations. To overcome these limitations, we introduce Spherical Coordinate-based Positional Embedding (SoPE). Our method maps point-cloud token indices into a 3D spherical coordinate space, enabling unified modeling of spatial locations and directional angles. This formulation preserves the inherent geometric structure of point-cloud data, enhances spatial awareness, and yields more consistent and expressive geometric representations for multimodal learning. In addition, we introduce a multi-scale frequency mixing strategy to fuse feature information across different frequency domains. Experimental results on multiple 3D scene benchmarks validate the effectiveness of our approach, while real-world deployment experiments further demonstrate its strong generalization capability.

[89] IRSDE-Despeckle: A Physics-Grounded Diffusion Model for Generalizable Ultrasound Despeckling

Shuoqi Chen,Yujia Wu,Geoffrey P. Luke

Main category: cs.CV

TL;DR: 本文提出了一种基于扩散模型的超声图像去斑点方法,利用仿真配对数据进行监督训练,在保持解剖结构边缘和对比度的同时有效抑制斑点噪声,并通过跨模型方差量化预测不确定性,揭示其与重建误差的相关性。

Details Motivation: 超声图像中的斑点噪声和相关伪影降低了图像质量,影响临床解读,亟需一种能有效去噪且保持关键解剖结构的鲁棒方法。 Method: 基于图像恢复随机微分方程框架构建扩散模型;使用Matlab UltraSound Toolbox从无斑点MRI生成大规模仿真超声配对数据以支持监督训练;引入跨模型方差量化预测不确定性。 Result: 在仿真测试集上显著优于经典滤波器和最新学习型去斑点方法;不确定性越高,重建误差越大;对仿真探头参数敏感,存在域偏移问题。 Conclusion: 该扩散模型在超声去斑点任务中兼具性能与可解释性,但需通过多样化训练与域适应提升其在真实临床场景中的鲁棒性。 Abstract: Ultrasound imaging is widely used for real-time, noninvasive diagnosis, but speckle and related artifacts reduce image quality and can hinder interpretation. We present a diffusion-based ultrasound despeckling method built on the Image Restoration Stochastic Differential Equations framework. To enable supervised training, we curate large paired datasets by simulating ultrasound images from speckle-free magnetic resonance images using the Matlab UltraSound Toolbox. The proposed model reconstructs speckle-suppressed images while preserving anatomically meaningful edges and contrast. On a held-out simulated test set, our approach consistently outperforms classical filters and recent learning-based despeckling baselines. We quantify prediction uncertainty via cross-model variance and show that higher uncertainty correlates with higher reconstruction error, providing a practical indicator of difficult or failure-prone regions. Finally, we evaluate sensitivity to simulation probe settings and observe domain shift, motivating diversified training and adaptation for robust clinical deployment.

[90] HulluEdit: Single-Pass Evidence-Consistent Subspace Editing for Mitigating Hallucinations in Large Vision-Language Models

Yangguang Lin,Quan Fang,Yufei Li,Jiachen Sun,Junyu Gao,Jitao Sang

Main category: cs.CV

TL;DR: 本文提出HulluEdit,一种单次前向传播、无需参考模型的干预框架,通过正交子空间编辑分解隐藏状态,选择性抑制幻觉模式而不影响视觉基础,显著减少大视觉语言模型中的物体幻觉。

Details Motivation: 现有方法在效率和准确性之间难以平衡:要么需要昂贵的参考模型和多次前向传播,要么采用静态编辑可能压制真实的视觉证据。 Method: 提出HulluEdit框架,核心是正交子空间编辑:将模型隐藏状态分解为视觉证据、冲突先验和残差不确定性三个正交子空间,并仅对先验子空间进行编辑,保证视觉成分不受影响。 Result: 在POPE和CHAIR等基准上达到最先进的幻觉抑制效果,同时在MME上保持通用能力,且推理高效;一致优于对比解码和静态子空间编辑基线。 Conclusion: HulluEdit为构建更可信的大视觉语言模型提供了新路径,实现了高效、精准、无损的幻觉校正。 Abstract: Object hallucination in Large Vision-Language Models (LVLMs) significantly hinders their reliable deployment. Existing methods struggle to balance efficiency and accuracy: they often require expensive reference models and multiple forward passes, or apply static edits that risk suppressing genuine visual evidence. To address this, we introduce HulluEdit, a single-pass, reference-free intervention framework. Our core innovation is orthogonal subspace editing: we decompose the hidden states of the model into orthogonal subspaces - visual evidence, conflicting priors, and residual uncertainty - enabling selective suppression of hallucinatory patterns without interfering with visual grounding. This approach mathematically guarantees that edits applied to the prior subspace leave the visual component entirely unaffected. Extensive experiments show that HulluEdit achieves state-of-the-art hallucination reduction on benchmarks including POPE and CHAIR across diverse architectures, while preserving general capabilities on MME and maintaining efficient inference. Our method consistently outperforms contrastive decoding and static subspace editing baselines, offering a new pathway toward more trustworthy LVLMs.

[91] Asymmetric Idiosyncrasies in Multimodal Models

Muzi Tao,Chufan Shi,Huijuan Wang,Shengbang Tong,Xuezhe Ma

Main category: cs.CV

TL;DR: 本文研究了图像描述模型的个性化特征及其对文生图模型的下游影响,发现描述模型在文本中具有显著的风格特征,但在生成的图像中这些特征几乎消失。

Details Motivation: 探究图像描述模型的个性化特征如何影响文生图模型的提示遵循能力,并量化这种跨模态差异。 Method: 设计了一个系统性分析框架:给定生成的描述或对应图像,训练神经网络预测其来源的描述模型;并进一步分析图像未能保留的关键文本变化。 Result: 文本分类准确率高达99.70%,而图像分类准确率最高仅50%;生成图像未能保留文本中关于细节程度、颜色纹理强调及物体分布等关键变化。 Conclusion: 该基于分类的框架为量化描述模型的风格特性和文生图系统的提示遵循能力提供了新方法。 Abstract: In this work, we study idiosyncrasies in the caption models and their downstream impact on text-to-image models. We design a systematic analysis: given either a generated caption or the corresponding image, we train neural networks to predict the originating caption model. Our results show that text classification yields very high accuracy (99.70\%), indicating that captioning models embed distinctive stylistic signatures. In contrast, these signatures largely disappear in the generated images, with classification accuracy dropping to at most 50\% even for the state-of-the-art Flux model. To better understand this cross-modal discrepancy, we further analyze the data and find that the generated images fail to preserve key variations present in captions, such as differences in the level of detail, emphasis on color and texture, and the distribution of objects within a scene. Overall, our classification-based framework provides a novel methodology for quantifying both the stylistic idiosyncrasies of caption models and the prompt-following ability of text-to-image systems.

[92] AMLRIS: Alignment-aware Masked Learning for Referring Image Segmentation

Tongfei Chen,Shuo Yang,Yuguang Yang,Linlin Yang,Runtang Guo,Changbai Li,He Long,Chunyu Xie,Dawei Leng,Baochang Zhang

Main category: cs.CV

TL;DR: 本文提出Alignment-Aware Masked Learning(AML)训练策略,通过显式建模像素级图文对齐关系、过滤低对齐区域来提升指代图像分割(RIS)性能,在RefCOCO数据集上达到SOTA,并增强对多样化描述和场景的鲁棒性。

Details Motivation: 现有RIS方法常忽略像素级视觉-语言对齐的显式建模,导致模型易受噪声描述或复杂场景干扰,泛化与鲁棒性受限。 Method: 提出Alignment-Aware Masked Learning(AML),包含三部分:1)估计像素级图文对齐得分;2)基于对齐得分动态掩码低置信区域;3)在高对齐区域上进行聚焦优化。 Result: 在RefCOCO、RefCOCO+和RefCOCOg三个主流RIS基准上均取得SOTA结果;显著提升对模糊、冗余或对抗性描述的鲁棒性。 Conclusion: 显式建模并利用像素级视觉-语言对齐信息可有效提升RIS模型的精度与鲁棒性,AML为RIS训练范式提供了新思路。 Abstract: Referring Image Segmentation (RIS) aims to segment an object in an image identified by a natural language expression. The paper introduces Alignment-Aware Masked Learning (AML), a training strategy to enhance RIS by explicitly estimating pixel-level vision-language alignment, filtering out poorly aligned regions during optimization, and focusing on trustworthy cues. This approach results in state-of-the-art performance on RefCOCO datasets and also enhances robustness to diverse descriptions and scenarios

[93] ProjFlow: Projection Sampling with Flow Matching for Zero-Shot Exact Spatial Motion Control

Akihisa Watanabe,Qing Yu,Edgar Simo-Serra,Kent Fujiwara

Main category: cs.CV

TL;DR: 本文提出ProjFlow,一种无需训练的采样器,可零样本、精确满足线性空间约束,同时保持运动真实性。其核心是引入一种运动学感知度量,结合时间变化伪观测策略,有效解决运动生成中的硬约束与自然性矛盾。

Details Motivation: 现有方法难以在满足精确空间约束的同时保持运动自然性,且常需任务特定训练或慢速优化。 Method: 提出ProjFlow采样器,基于线性逆问题建模;设计运动学感知度量以编码骨骼拓扑结构;引入时间变化伪观测处理稀疏输入(如关键帧插值)。 Result: 在运动修复和2D到3D提升等任务中,ProjFlow实现精确约束满足,真实感媲美或优于零样本基线,并接近训练式控制器性能。 Conclusion: ProjFlow为零样本、高保真、约束可控的人体运动生成提供了新范式,兼顾精确性与自然性。 Abstract: Generating human motion with precise spatial control is a challenging problem. Existing approaches often require task-specific training or slow optimization, and enforcing hard constraints frequently disrupts motion naturalness. Building on the observation that many animation tasks can be formulated as a linear inverse problem, we introduce ProjFlow, a training-free sampler that achieves zero-shot, exact satisfaction of linear spatial constraints while preserving motion realism. Our key advance is a novel kinematics-aware metric that encodes skeletal topology. This metric allows the sampler to enforce hard constraints by distributing corrections coherently across the entire skeleton, avoiding the unnatural artifacts of naive projection. Furthermore, for sparse inputs, such as filling in long gaps between a few keyframes, we introduce a time-varying formulation using pseudo-observations that fade during sampling. Extensive experiments on representative applications, motion inpainting, and 2D-to-3D lifting, demonstrate that ProjFlow achieves exact constraint satisfaction and matches or improves realism over zero-shot baselines, while remaining competitive with training-based controllers.

[94] SPATIALALIGN: Aligning Dynamic Spatial Relationships in Video Generation

Fengming Liu,Tat-Jen Cham,Chuanxia Zheng

Main category: cs.CV

TL;DR: 本文提出SPATIALALIGN框架,通过零阶正则化直接偏好优化(DPO)提升文本到视频模型对动态空间关系(DSR)的建模能力,并设计几何度量DSR-SCORE及配套数据集,显著改善生成视频的空间准确性。

Details Motivation: 现有文本到视频生成模型注重美学质量,却常忽视生成视频中对文本提示所指定动态空间关系(DSR)的准确表达。 Method: 提出SPATIALALIGN自改进框架,采用零阶正则化Direct Preference Optimization(DPO)微调T2V模型;设计基于几何的DSR-SCORE指标定量评估视频与提示中DSR的一致性;构建含多样化DSR的文本-视频配对数据集。 Result: 实验表明,经该方法微调的模型在动态空间关系建模上显著优于基线模型。 Conclusion: SPATIALALIGN有效提升了T2V模型对动态空间关系的理解与生成能力,DSR-SCORE为相关评估提供了更可靠、可解释的几何化替代方案。 Abstract: Most text-to-video (T2V) generators prioritize aesthetic quality, but often ignoring the spatial constraints in the generated videos. In this work, we present SPATIALALIGN, a self-improvement framework that enhances T2V models capabilities to depict Dynamic Spatial Relationships (DSR) specified in text prompts. We present a zeroth-order regularized Direct Preference Optimization (DPO) to fine-tune T2V models towards better alignment with DSR. Specifically, we design DSR-SCORE, a geometry-based metric that quantitatively measures the alignment between generated videos and the specified DSRs in prompts, which is a step forward from prior works that rely on VLM for evaluation. We also conduct a dataset of text-video pairs with diverse DSRs to facilitate the study. Extensive experiments demonstrate that our fine-tuned model significantly out performs the baseline in spatial relationships. The code will be released in Link.

[95] Beyond Detection: Multi-Scale Hidden-Code for Natural Image Deepfake Recovery and Factual Retrieval

Yuan-Chih Chen,Chun-Shien Lu

Main category: cs.CV

TL;DR: 本文提出了一种统一的隐码恢复框架,用于图像篡改内容的事实检索与重建,支持后处理和生成时水印范式,并在新构建的ImageNet-S基准上验证了其有效性。

Details Motivation: 现有研究主要集中在深度伪造检测与定位,而对篡改内容的事实检索与恢复关注不足。 Method: 提出基于多尺度向量量化编码语义与感知信息为紧凑隐码表示,并通过条件Transformer模块增强上下文推理能力的统一隐码恢复框架。 Result: 在新构建的ImageNet-S基准上,该方法展现出优异的事实检索与图像重建性能,并兼容多种水印流程。 Conclusion: 该框架为超越检测与定位的通用图像恢复任务奠定了基础。 Abstract: Recent advances in image authenticity have primarily focused on deepfake detection and localization, leaving recovery of tampered contents for factual retrieval relatively underexplored. We propose a unified hidden-code recovery framework that enables both retrieval and restoration from post-hoc and in-generation watermarking paradigms. Our method encodes semantic and perceptual information into a compact hidden-code representation, refined through multi-scale vector quantization, and enhances contextual reasoning via conditional Transformer modules. To enable systematic evaluation for natural images, we construct ImageNet-S, a benchmark that provides paired image-label factual retrieval tasks. Extensive experiments on ImageNet-S demonstrate that our method exhibits promising retrieval and reconstruction performance while remaining fully compatible with diverse watermarking pipelines. This framework establishes a foundation for general-purpose image recovery beyond detection and localization.

[96] TrajTok: Learning Trajectory Tokens enables better Video Understanding

Chenhao Zheng,Jieyu Zhang,Jianing Zhang,Weikai Huang,Ashutosh Kumar,Quan Kong,Oncel Tuzel,Chun-Liang Li,Ranjay Krishna

Main category: cs.CV

TL;DR: 本文提出TrajTok,一种端到端、可联合训练的视频分词器,通过隐式时空像素聚类直接生成物体轨迹,解耦视频时长与token数量,在保持高效的同时提升视频理解性能,并拓展至多种下游任务。

Details Motivation: 传统视频patch分词导致token冗余、效率低;现有轨迹分词依赖复杂外部分割跟踪模块,慢且任务无关。 Method: 设计TrajTok模块:含统一隐式时空像素聚类段器,单次前向生成轨迹;端到端与下游任务联合训练,动态适配语义复杂度;不追求像素级分割精度,强调下游适应性。 Result: TrajTok驱动的TrajViT2在分类与检索基准上达到SOTA精度,效率媲美最优token合并方法;扩展为TrajAdapter(探针头)和TrajVLM(对齐连接器),在长视频推理中表现突出。 Conclusion: TrajTok是一种轻量、高效、可泛化、任务自适应的视频分词范式,突破了时长-令牌数耦合瓶颈,并展现出作为通用视频表征接口的潜力。 Abstract: Tokenization in video models, typically through patchification, generates an excessive and redundant number of tokens. This severely limits video efficiency and scalability. While recent trajectory-based tokenizers offer a promising solution by decoupling video duration from token count, they rely on complex external segmentation and tracking pipelines that are slow and task-agnostic. We propose TrajTok, an end-to-end video tokenizer module that is fully integrated and co-trained with video models for a downstream objective, dynamically adapting its token granularity to semantic complexity, independent of video duration. TrajTok contains a unified segmenter that performs implicit clustering over pixels in both space and time to directly produce object trajectories in a single forward pass. By prioritizing downstream adaptability over pixel-perfect segmentation fidelity, TrajTok is lightweight and efficient, yet empirically improves video understanding performance. With TrajTok, we implement a video CLIP model trained from scratch (TrajViT2). It achieves the best accuracy at scale across both classification and retrieval benchmarks, while maintaining efficiency comparable to the best token-merging methods. TrajTok also proves to be a versatile component beyond its role as a tokenizer. We show that it can be seamlessly integrated as either a probing head for pretrained visual features (TrajAdapter) or an alignment connector in vision-language models (TrajVLM) with especially strong performance in long-video reasoning.

[97] SceneTransporter: Optimal Transport-Guided Compositional Latent Diffusion for Single-Image Structured 3D Scene Generation

Ling Wang,Hao-Xiang Guo,Xinzhou Wang,Fuchun Sun,Kai Sun,Pengkun Liu,Hang Xiao,Zhong Wang,Guangyuan Fu,Eric Li,Yang Liu,Yikai Wang

Main category: cs.CV

TL;DR: SceneTransporter 是一种端到端框架,通过在DiT去噪循环中引入熵正则化最优传输(OT)目标,实现从单张图像生成结构化3D场景,解决现有方法在开放世界场景中难以组织部件为独立实例的问题。

Details Motivation: 现有方法虽能生成部件级3D对象,但在开放世界场景中无法将部件组织为独立实例;作者发现其根本原因在于模型内部分配机制缺乏结构约束。 Method: 将结构化3D场景生成重构为全局相关性分配问题,在组合式DiT模型的去噪循环中建模并求解一个熵正则化最优传输(OT)目标,利用运输计划门控交叉注意力实现图像块到3D部件潜变量的一对一分配,并结合基于边缘的代价函数促进相似块聚类以形成连贯物体。 Result: 在开放世界场景生成任务上显著优于现有方法,提升了实例级一致性与几何保真度。 Conclusion: 引入结构化最优传输约束可有效提升单图生成3D场景的实例划分与几何质量,验证了全局分配建模对结构化生成的关键作用。 Abstract: We introduce SceneTransporter, an end-to-end framework for structured 3D scene generation from a single image. While existing methods generate part-level 3D objects, they often fail to organize these parts into distinct instances in open-world scenes. Through a debiased clustering probe, we reveal a critical insight: this failure stems from the lack of structural constraints within the model's internal assignment mechanism. Based on this finding, we reframe the task of structured 3D scene generation as a global correlation assignment problem. To solve this, SceneTransporter formulates and solves an entropic Optimal Transport (OT) objective within the denoising loop of the compositional DiT model. This formulation imposes two powerful structural constraints. First, the resulting transport plan gates cross-attention to enforce an exclusive, one-to-one routing of image patches to part-level 3D latents, preventing entanglement. Second, the competitive nature of the transport encourages the grouping of similar patches, a process that is further regularized by an edge-based cost, to form coherent objects and prevent fragmentation. Extensive experiments show that SceneTransporter outperforms existing methods on open-world scene generation, significantly improving instance-level coherence and geometric fidelity. Code and models will be publicly available at https://2019epwl.github.io/SceneTransporter/.

[98] Robust Human Trajectory Prediction via Self-Supervised Skeleton Representation Learning

Taishu Arashima,Hiroshi Kera,Kazuhiko Kawamoto

Main category: cs.CV

TL;DR: 本文提出了一种结合自监督掩码自编码预训练骨架表示模型的鲁棒轨迹预测方法,以应对真实场景中因遮挡导致的关节缺失问题,提升了预测鲁棒性与准确性。

Details Motivation: 真实环境中人体骨架数据常因遮挡出现关节缺失,严重影响轨迹预测精度,亟需更鲁棒的骨架表征方法。 Method: 提出一种融合自监督掩码自编码预训练骨架表示模型的轨迹预测方法,利用该模型学习对缺失关节具有鲁棒性的骨架特征。 Result: 在易发生遮挡的场景下实验表明,该方法在保持预测精度的同时显著提升对缺失骨架数据的鲁棒性,并在干净至中等缺失程度下持续优于基线模型。 Conclusion: 自监督预训练的骨架表征能有效缓解遮挡带来的信息缺失问题,为鲁棒人体轨迹预测提供了新思路。 Abstract: Human trajectory prediction plays a crucial role in applications such as autonomous navigation and video surveillance. While recent works have explored the integration of human skeleton sequences to complement trajectory information, skeleton data in real-world environments often suffer from missing joints caused by occlusions. These disturbances significantly degrade prediction accuracy, indicating the need for more robust skeleton representations. We propose a robust trajectory prediction method that incorporates a self-supervised skeleton representation model pretrained with masked autoencoding. Experimental results in occlusion-prone scenarios show that our method improves robustness to missing skeletal data without sacrificing prediction accuracy, and consistently outperforms baseline models in clean-to-moderate missingness regimes.

[99] GSTurb: Gaussian Splatting for Atmospheric Turbulence Mitigation

Hanliang Du,Zhangji Lu,Zewei Cai,Qijian Tang,Qifeng Yu,Xiaoli Liu

Main category: cs.CV

TL;DR: 本文提出GSTurb框架,结合光流引导的倾斜校正与高斯点绘技术建模非等晕模糊,显著提升长距离成像中大气湍流退化图像的复原效果。

Details Motivation: 大气湍流导致像素位移(倾斜)和模糊,严重影响长距离成像质量,现有方法在建模非等晕模糊和联合优化倾斜与模糊方面存在不足。 Method: 提出GSTurb框架:利用光流引导校正图像倾斜,并采用高斯点绘(Gaussian splatting)建模空间变化的非等晕模糊;以高斯参数表征倾斜与模糊,在多帧间联合优化。 Result: 在ATSyn-static合成数据集上PSNR达27.67 dB、SSIM为0.8735,较SOTA提升PSNR 1.3 dB(4.5%)和SSIM 0.048(5.8%);在TSRWGAN Real-World和CLEAR真实数据集上亦显著优于现有方法。 Conclusion: 光流引导的倾斜校正与高斯点绘建模的协同策略,能有效应对合成与真实大气湍流场景下的图像退化,为长距成像复原提供新思路。 Abstract: Atmospheric turbulence causes significant image degradation due to pixel displacement (tilt) and blur, particularly in long-range imaging applications. In this paper, we propose a novel framework for atmospheric turbulence mitigation, GSTurb, which integrates optical flow-guided tilt correction and Gaussian splatting for modeling non-isoplanatic blur. The framework employs Gaussian parameters to represent tilt and blur, and optimizes them across multiple frames to enhance restoration. Experimental results on the ATSyn-static dataset demonstrate the effectiveness of our method, achieving a peak PSNR of 27.67 dB and SSIM of 0.8735. Compared to the state-of-the-art method, GSTurb improves PSNR by 1.3 dB (a 4.5% increase) and SSIM by 0.048 (a 5.8% increase). Additionally, on real datasets, including the TSRWGAN Real-World and CLEAR datasets, GSTurb outperforms existing methods, showing significant improvements in both qualitative and quantitative performance. These results highlight that combining optical flow-guided tilt correction with Gaussian splatting effectively enhances image restoration under both synthetic and real-world turbulence conditions. The code for this method will be available at https://github.com/DuhlLiamz/3DGS_turbulence/tree/main.

[100] PhotoAgent: Agentic Photo Editing with Exploratory Visual Aesthetic Planning

Mingde Yao,Zhiyuan You,Tam-King Man,Menglu Wang,Tianfan Xue

Main category: cs.CV

TL;DR: 本文提出PhotoAgent系统,通过显式美学规划实现自主图像编辑,将任务建模为长程决策问题,结合树搜索规划多步操作,并利用闭环执行与视觉反馈迭代优化结果,无需用户逐条指令;同时构建UGC-Edit美学评估基准用于真实场景评测。

Details Motivation: 现有基于指令的图像编辑方法高度依赖人工设计的精细指令,将任务分解与步骤排序负担完全交给用户,缺乏自主性。 Method: 将自主图像编辑建模为长时程决策问题;引入显式美学规划,通过树搜索推理用户审美意图并生成多步编辑动作;采用带记忆和视觉反馈的闭环执行机制进行迭代优化;构建UGC-Edit美学评估基准及1017张图的测试集。 Result: 在多个指标上显著优于基线方法,尤其在指令遵循度和视觉质量方面持续提升;UGC-Edit基准支持真实场景下的可靠评估。 Conclusion: PhotoAgent实现了真正意义上的自主图像编辑,摆脱对用户细粒度指令的依赖,提升了编辑的智能化、鲁棒性与实用性。 Abstract: With the recent fast development of generative models, instruction-based image editing has shown great potential in generating high-quality images. However, the quality of editing highly depends on carefully designed instructions, placing the burden of task decomposition and sequencing entirely on the user. To achieve autonomous image editing, we present PhotoAgent, a system that advances image editing through explicit aesthetic planning. Specifically, PhotoAgent formulates autonomous image editing as a long-horizon decision-making problem. It reasons over user aesthetic intent, plans multi-step editing actions via tree search, and iteratively refines results through closed-loop execution with memory and visual feedback, without requiring step-by-step user prompts. To support reliable evaluation in real-world scenarios, we introduce UGC-Edit, an aesthetic evaluation benchmark consisting of 7,000 photos and a learned aesthetic reward model. We also construct a test set containing 1,017 photos to systematically assess autonomous photo editing performance. Extensive experiments demonstrate that PhotoAgent consistently improves both instruction adherence and visual quality compared with baseline methods. The project page is https://github.com/mdyao/PhotoAgent.

[101] Face Time Traveller : Travel Through Ages Without Losing Identity

Purbayan Kar,Ayush Ghadiya,Vishal Chudasama,Pankaj Wasnik,C. V. Jawahar

Main category: cs.CV

TL;DR: 本文提出FaceTT,一种基于扩散模型的面部老化方法,通过面部属性感知提示优化、无调优角度反转和自适应注意力控制,实现高保真、身份一致的老化变换。

Details Motivation: 现有面部老化方法在宽年龄变换中难以保持身份一致性,且静态注意力机制和优化密集的反转过程限制了其适应性、细粒度控制和背景一致性。 Method: 提出FaceTT框架,包括面部属性感知提示优化策略(编码生物与环境老化线索)、无调优角度反转方法(高效映射真实人脸到扩散潜在空间)以及自适应注意力控制机制(动态平衡交叉注意力与自注意力)。 Result: 在基准数据集和野外测试集上的大量实验表明,FaceTT在身份保留、背景保持和老化真实性方面优于当前最优方法。 Conclusion: FaceTT有效解决了面部老化中身份一致性与视觉真实性之间的权衡问题,为娱乐、法医和数字存档等应用提供了更可靠、可控的老化生成方案。 Abstract: Face aging, an ill-posed problem shaped by environmental and genetic factors, is vital in entertainment, forensics, and digital archiving, where realistic age transformations must preserve both identity and visual realism. However, existing works relying on numerical age representations overlook the interplay of biological and contextual cues. Despite progress in recent face aging models, they struggle with identity preservation in wide age transformations, also static attention and optimization-heavy inversion in diffusion limit adaptability, fine-grained control and background consistency. To address these challenges, we propose Face Time Traveller (FaceTT), a diffusion-based framework that achieves high-fidelity, identity-consistent age transformation. Here, we introduce a Face-Attribute-Aware Prompt Refinement strategy that encodes intrinsic (biological) and extrinsic (environmental) aging cues for context-aware conditioning. A tuning-free Angular Inversion method is proposed that efficiently maps real faces into the diffusion latent space for fast and accurate reconstruction. Moreover, an Adaptive Attention Control mechanism is introduced that dynamically balances cross-attention for semantic aging cues and self-attention for structural and identity preservation. Extensive experiments on benchmark datasets and in-the-wild testset demonstrate that FaceTT achieves superior identity retention, background preservation and aging realism over state-of-the-art (SOTA) methods.

[102] CMSA-Net: Causal Multi-scale Aggregation with Adaptive Multi-source Reference for Video Polyp Segmentation

Tong Wang,Yaolei Qi,Siwen Wang,Imran Razzak,Guanyu Yang,Yutong Xie

Main category: cs.CV

TL;DR: 本文提出CMSA-Net框架,通过因果多尺度聚合(CMA)模块和动态多源参考(DMR)策略,提升视频息肉分割的准确性与实时性,在SUN-SEG数据集上达到SOTA性能。

Details Motivation: 视频息肉分割面临息肉与黏膜外观相似(语义区分弱)以及跨帧位置/尺度变化大两大挑战。 Method: 提出CMSA-Net:1)因果多尺度聚合(CMA)模块,利用因果注意力融合多历史帧多尺度语义信息;2)动态多源参考(DMR)策略,依据语义可分性和预测置信度自适应选择参考帧。 Result: 在SUN-SEG数据集上达到当前最优性能,兼顾分割精度与实时临床适用性。 Conclusion: CMSA-Net有效提升了视频息肉分割的鲁棒性与效率,为计算机辅助结肠镜检查提供了实用解决方案。 Abstract: Video polyp segmentation (VPS) is an important task in computer-aided colonoscopy, as it helps doctors accurately locate and track polyps during examinations. However, VPS remains challenging because polyps often look similar to surrounding mucosa, leading to weak semantic discrimination. In addition, large changes in polyp position and scale across video frames make stable and accurate segmentation difficult. To address these challenges, we propose a robust VPS framework named CMSA-Net. The proposed network introduces a Causal Multi-scale Aggregation (CMA) module to effectively gather semantic information from multiple historical frames at different scales. By using causal attention, CMA ensures that temporal feature propagation follows strict time order, which helps reduce noise and improve feature reliability. Furthermore, we design a Dynamic Multi-source Reference (DMR) strategy that adaptively selects informative and reliable reference frames based on semantic separability and prediction confidence. This strategy provides strong multi-frame guidance while keeping the model efficient for real-time inference. Extensive experiments on the SUN-SEG dataset demonstrate that CMSA-Net achieves state-of-the-art performance, offering a favorable balance between segmentation accuracy and real-time clinical applicability.

[103] Reflectance Multispectral Imaging for Soil Composition Estimation and USDA Texture Classification

G. A. S. L Ranasinghe,J. A. S. T. Jayakody,M. C. L. De Silva,G. Thilakarathne,G. M. R. I. Godaliyadda,H. M. V. R. Herath,M. P. B. Ekanayake,S. K. Navaratnarajah

Main category: cs.CV

TL;DR: 本文提出了一种基于多光谱成像(MSI)与机器学习的低成本、便携式土壤质地快速测定方法,可高精度预测土壤颗粒组成(黏粒、粉粒、砂粒含量)及USDA十二类质地分类。

Details Motivation: 传统实验室颗粒分析法耗时费力,现有传感技术则存在成本高或分辨率低、难以田间规模化应用的问题。 Method: 设计并搭建了365–940 nm波段、13个光谱通道的低成本自研多光谱成像系统;采用回归模型预测三相组分百分比,结合直接分类器与间接分类(基于USDA三角图映射)预测12类USDA质地。 Result: 在混合土壤样本上,组分预测R²达0.99,质地分类准确率超99%。 Conclusion: MSI与数据驱动建模相结合,可实现高精度、无损、可现场部署的土壤质地表征,适用于岩土工程初筛与精准农业。 Abstract: Soil texture is a foundational attribute that governs water availability and erosion in agriculture, as well as load bearing capacity, deformation response, and shrink-swell risk in geotechnical engineering. Yet texture is still typically determined by slow and labour intensive laboratory particle size tests, while many sensing alternatives are either costly or too coarse to support routine field scale deployment. This paper proposes a robust and field deployable multispectral imaging (MSI) system and machine learning framework for predicting soil composition and the United States Department of Agriculture (USDA) texture classes. The proposed system uses a cost effective in-house MSI device operating from 365 nm to 940 nm to capture thirteen spectral bands, which effectively capture the spectral properties of soil texture. Regression models use the captured spectral properties to estimate clay, silt, and sand percentages, while a direct classifier predicts one of the twelve USDA textural classes. Indirect classification is obtained by mapping the regressed compositions to texture classes via the USDA soil texture triangle. The framework is evaluated on mixture data by mixing clay, silt, and sand in varying proportions, using the USDA classification triangle as a basis. Experimental results show that the proposed approach achieves a coefficient of determination R^2 up to 0.99 for composition prediction and over 99% accuracy for texture classification. These findings indicate that MSI combined with data-driven modeling can provide accurate, non-destructive, and field deployable soil texture characterization suitable for geotechnical screening and precision agriculture.

[104] A data- and compute-efficient chest X-ray foundation model beyond aggressive scaling

Chong Wang,Yabin Zhang,Yunhe Gao,Maya Varma,Clemence Mottez,Faidra Patsatzi,Jiaming Liu,Jin Long,Jean-Benoit Delbrouck,Sergios Gatidis,Akshay S. Chaudhari,Curtis P. Langlotz

Main category: cs.CV

TL;DR: 本文提出CheXficient模型,通过主动、有原则的数据筛选策略,在仅使用22.7%的胸部X光数据和27.3%计算资源的情况下,实现与全量训练模型相当甚至更优的性能,显著提升对长尾/罕见病的泛化能力。

Details Motivation: 解决医学影像基础模型预训练中大规模数据冗余、类别严重不平衡以及忽略数据质量异质性导致的表征偏差和计算低效问题。 Method: 在预训练阶段引入主动、有原则的数据筛选机制,构建仅基于高信息量样本的精简高质量数据集,开发轻量高效CXR多模态基础模型CheXficient。 Result: CheXficient在20个涵盖5类任务的基准上表现媲美或超越全量训练及其它大规模预训练模型,并能系统性优先学习欠表达样本,增强对长尾/罕见疾病的泛化能力。 Conclusion: 主动数据策展是替代盲目扩大数据规模的可行且高效方案,为医学视觉-语言基础模型的高效预训练与下游适配提供了实践指导。 Abstract: Foundation models for medical imaging are typically pretrained on increasingly large datasets, following a "scale-at-all-costs" paradigm. However, this strategy faces two critical challenges: large-scale medical datasets often contain substantial redundancy and severe class imbalance that bias representation learning toward over-represented patterns, and indiscriminate training regardless of heterogeneity in data quality incurs considerable computational inefficiency. Here we demonstrate that active, principled data curation during pretraining can serve as a viable, cost-effective alternative to brute-force dataset enlargement. We introduce CheXficient, a chest X-ray (CXR) foundation model that selectively prioritizes informative training samples. CheXficient is pretrained on only 22.7% of 1,235,004 paired CXR images and reports while consuming under 27.3% of the total compute budget, yet achieving comparable or superior performance to its full-data counterpart and other large-scale pretrained models. We assess CheXficient across 20 individual benchmarks spanning 5 task types, including non-adapted off-the-shelf evaluations (zero-shot findings classification and crossmodal retrieval) and adapted downstream tasks (disease prediction, semantic segmentation, and radiology report generation). Further analyses show that CheXficient systematically prioritizes under-represented training samples, improving generalizability on long-tailed or rare conditions. Overall, our work offers practical insights into the data and computation demands for efficient pretraining and downstream adaptation of medical vision-language foundation models.

[105] From Blind Spots to Gains: Diagnostic-Driven Iterative Training for Large Multimodal Models

Hongrui Jia,Chaoya Jiang,Shikun Zhang,Wei Ye

Main category: cs.CV

TL;DR: 本文提出诊断驱动的渐进式进化(DPE)方法,通过诊断模型弱点、动态生成针对性多模态训练数据并持续强化,显著提升大语言多模态模型在开放任务分布下的持续学习能力。

Details Motivation: 现有LMM训练依赖静态数据和固定流程,难以发现能力盲点和实施动态精准强化;受测试驱动错误暴露与反馈修正优于重复练习的启发,提出DPE以实现闭环式能力演进。 Method: DPE构建诊断-生成-强化螺旋循环:1)多智能体协同利用网络搜索、图像编辑等工具对海量无标签多模态数据进行标注与质量控制,生成多样真实样本;2)基于失败归因识别具体能力弱点,动态调整数据配比,并引导智能体生成弱点聚焦数据用于靶向强化。 Result: 在Qwen3-VL-8B-Instruct和Qwen2.5-VL-7B-Instruct上实验表明,DPE在11个基准测试中实现稳定持续提升,验证其作为开放任务分布下可扩展LMM持续训练范式的有效性。 Conclusion: DPE是一种可扩展、闭环、靶向的LMM持续训练新范式,能有效克服静态训练局限,推动模型在开放任务中自主演化与能力补全。 Abstract: As Large Multimodal Models (LMMs) scale up and reinforcement learning (RL) methods mature, LMMs have made notable progress in complex reasoning and decision making. Yet training still relies on static data and fixed recipes, making it difficult to diagnose capability blind spots or provide dynamic, targeted reinforcement. Motivated by findings that test driven error exposure and feedback based correction outperform repetitive practice, we propose Diagnostic-driven Progressive Evolution (DPE), a spiral loop where diagnosis steers data generation and reinforcement, and each iteration re-diagnoses the updated model to drive the next round of targeted improvement. DPE has two key components. First, multiple agents annotate and quality control massive unlabeled multimodal data, using tools such as web search and image editing to produce diverse, realistic samples. Second, DPE attributes failures to specific weaknesses, dynamically adjusts the data mixture, and guides agents to generate weakness focused data for targeted reinforcement. Experiments on Qwen3-VL-8B-Instruct and Qwen2.5-VL-7B-Instruct show stable, continual gains across eleven benchmarks, indicating DPE as a scalable paradigm for continual LMM training under open task distributions. Our code, models, and data are publicly available at https://github.com/hongruijia/DPE.

[106] SO3UFormer: Learning Intrinsic Spherical Features for Rotation-Robust Panoramic Segmentation

Qinfeng Zhu,Yunxi Jiang,Lei Fan

Main category: cs.CV

TL;DR: 本文提出SO3UFormer,一种旋转鲁棒的全景语义分割架构,通过内在特征表示、四则一致球面注意力和规范感知相对位置机制,解决球面Transformer在相机非重力对齐场景下性能崩溃的问题,并在Pose35数据集及任意SO(3)旋转下显著优于现有方法。

Details Motivation: 现实拍摄中相机姿态常偏离重力对齐假设,导致标准球面Transformer过拟合纬度线索,在3D重定向下性能急剧下降。 Method: 提出SO3UFormer架构,包含三个几何核心:(1)去除绝对纬度编码的内在特征表示;(2)考虑非均匀采样密度的四则一致球面注意力;(3)基于切平面投影角与离散规范池化的规范感知相对位置机制;并结合索引式球面重采样与logit级SO(3)-一致性正则化。 Result: 在Pose35数据集上达72.03 mIoU;在任意全SO(3)旋转下仍保持70.67 mIoU,远超基线SphereUFormer(25.26 mIoU)。 Conclusion: SO3UFormer通过几何先验驱动的设计实现了对球面坐标系扰动的高度鲁棒性,为真实场景中不稳定姿态下的全景理解提供了可靠解决方案。 Abstract: Panoramic semantic segmentation models are typically trained under a strict gravity-aligned assumption. However, real-world captures often deviate from this canonical orientation due to unconstrained camera motions, such as the rotational jitter of handheld devices or the dynamic attitude shifts of aerial platforms. This discrepancy causes standard spherical Transformers to overfit global latitude cues, leading to performance collapse under 3D reorientations. To address this, we introduce SO3UFormer, a rotation-robust architecture designed to learn intrinsic spherical features that are less sensitive to the underlying coordinate frame. Our approach rests on three geometric pillars: (1) an intrinsic feature formulation that decouples the representation from the gravity vector by removing absolute latitude encoding; (2) quadrature-consistent spherical attention that accounts for non-uniform sampling densities; and (3) a gauge-aware relative positional mechanism that encodes local angular geometry using tangent-plane projected angles and discrete gauge pooling, avoiding reliance on global axes. We further use index-based spherical resampling together with a logit-level SO(3)-consistency regularizer during training. To rigorously benchmark robustness, we introduce Pose35, a dataset variant of Stanford2D3D perturbed by random rotations within $\pm 35^\circ$. Under the extreme test of arbitrary full SO(3) rotations, existing SOTAs fail catastrophically: the baseline SphereUFormer drops from 67.53 mIoU to 25.26 mIoU. In contrast, SO3UFormer demonstrates remarkable stability, achieving 72.03 mIoU on Pose35 and retaining 70.67 mIoU under full SO(3) rotations.

[107] Towards Multimodal Domain Generalization with Few Labels

Hongzhao Li,Hao Dong,Hualei Wan,Shupan Li,Mingliang Xu,Muhammad Haris Khan

Main category: cs.CV

TL;DR: 本文提出半监督多模态领域泛化(SSMDG)新问题,并设计包含一致性正则化、分歧感知正则化和跨模态原型对齐的统一框架,在首个SSMDG基准上显著优于基线方法。

Details Motivation: 现有方法无法同时处理多模态、领域泛化与半监督学习三者结合的挑战:多模态领域泛化方法不能利用无标签数据,半监督多模态学习忽略领域偏移,半监督领域泛化方法仅适用于单模态。 Method: 提出统一框架,含三个核心组件:1)基于融合单模态置信共识的共识驱动一致性正则化;2)针对非共识模糊样本的分歧感知正则化;3)通过跨模态翻译实现域与模态不变表征的跨模态原型对齐。 Result: 构建首个SSMDG基准,在标准及模态缺失场景下均显著超越强基线方法。 Conclusion: 所提框架有效整合半监督学习、多模态建模与领域泛化,提升模型在少标注、多源异域、模态缺失等现实约束下的鲁棒性与泛化能力。 Abstract: Multimodal models ideally should generalize to unseen domains while remaining data-efficient to reduce annotation costs. To this end, we introduce and study a new problem, Semi-Supervised Multimodal Domain Generalization (SSMDG), which aims to learn robust multimodal models from multi-source data with few labeled samples. We observe that existing approaches fail to address this setting effectively: multimodal domain generalization methods cannot exploit unlabeled data, semi-supervised multimodal learning methods ignore domain shifts, and semi-supervised domain generalization methods are confined to single-modality inputs. To overcome these limitations, we propose a unified framework featuring three key components: Consensus-Driven Consistency Regularization, which obtains reliable pseudo-labels through confident fused-unimodal consensus; Disagreement-Aware Regularization, which effectively utilizes ambiguous non-consensus samples; and Cross-Modal Prototype Alignment, which enforces domain- and modality-invariant representations while promoting robustness under missing modalities via cross-modal translation. We further establish the first SSMDG benchmarks, on which our method consistently outperforms strong baselines in both standard and missing-modality scenarios. Our benchmarks and code are available at https://github.com/lihongzhao99/SSMDG.

[108] Chain of Flow: A Foundational Generative Framework for ECG-to-4D Cardiac Digital Twins

Haofan Wu,Nay Aung,Theodoros N. Arvanitis,Joao A. C. Lima,Steffen E. Petersen,Le Zhang

Main category: cs.CV

TL;DR: 本文提出了一种名为Chain of Flow(COF)的ECG驱动生成框架,可从单周期心电图重建个体化4D心脏结构与运动,整合 cine-CMR 和 12 导联 ECG 学习心脏几何、电生理与运动动力学的统一表征,推动心脏数字孪生从任务专用预测模型迈向可操控、全生成式虚拟心脏。

Details Motivation: 现有心脏数字孪生(CDT)框架局限于任务专用预测器,缺乏患者特异、可操控的虚拟心脏建模能力;临床可用的CDT需支持个体化解剖生理重建、多模态状态更新及广泛下游仿真。 Method: 提出Chain of Flow(COF),一种以ECG为输入、生成完整4D心脏结构与运动的生成框架;训练时联合使用cine-CMR和12导联ECG,学习心脏几何、电生理与运动动力学的统一隐表示。 Result: 在多个队列上验证了COF能准确恢复心脏解剖结构、心腔功能及动态运动模式;重建的4D心脏成功支持容积测量、区域功能分析和虚拟电影图像合成等下游CDT任务。 Conclusion: COF首次实现仅凭ECG即可完成全4D心脏器官重建,将心脏数字孪生从窄域预测模型提升为可生成、可操纵、患者特异的虚拟心脏基础框架。 Abstract: A clinically actionable Cardiac Digital Twin (CDT) should reconstruct individualised cardiac anatomy and physiology, update its internal state from multimodal signals, and enable a broad range of downstream simulations beyond isolated tasks. However, existing CDT frameworks remain limited to task-specific predictors rather than building a patient-specific, manipulable virtual heart. In this work, we introduce Chain of Flow (COF), a foundational ECG-driven generative framework that reconstructs full 4D cardiac structure and motion from a single cardiac cycle. The method integrates cine-CMR and 12-lead ECG during training to learn a unified representation of cardiac geometry, electrophysiology, and motion dynamics. We evaluate Chain of Flow on diverse cohorts and demonstrate accurate recovery of cardiac anatomy, chamber-wise function, and dynamic motion patterns. The reconstructed 4D hearts further support downstream CDT tasks such as volumetry, regional function analysis, and virtual cine synthesis. By enabling full 4D organ reconstruction directly from ECG, COF transforms cardiac digital twins from narrow predictive models into fully generative, patient-specific virtual hearts. Code will be released after review.

[109] OSDaR-AR: Enhancing Railway Perception Datasets via Multi-modal Augmented Reality

Federico Nesti,Gianluca D'Amico,Mauro Marinoni,Giorgio Buttazzo

Main category: cs.CV

TL;DR: 本文提出了一种多模态增强现实框架,将逼真的虚拟物体融入真实铁路视频序列,以解决铁路障碍物检测中高质量标注数据稀缺的问题,并发布了一个公开数据集OSDaR-AR。

Details Motivation: 铁路安全关键任务(如障碍物检测)面临高质量标注数据稀缺问题;现有仿真器存在sim-to-real差距,而简单图像掩码技术缺乏时空一致性。 Method: 基于Unreal Engine 5,融合LiDAR点云与INS/GNSS数据实现虚拟物体在真实铁路序列(OSDaR23)中的精准、稳定植入;并提出基于分割的INS/GNSS数据优化策略提升增强序列真实性。 Result: 构建了具备时空一致性和高真实感的增强序列,并发布了公开数据集OSDaR-AR,用于下一代铁路感知系统研发。 Conclusion: 该多模态AR框架有效弥合了仿真与现实之间的鸿沟,为铁路感知模型训练提供了高质量、可扩展的数据增强方案。 Abstract: Although deep learning has significantly advanced the perception capabilities of intelligent transportation systems, railway applications continue to suffer from a scarcity of high-quality, annotated data for safety-critical tasks like obstacle detection. While photorealistic simulators offer a solution, they often struggle with the ``sim-to-real" gap; conversely, simple image-masking techniques lack the spatio-temporal coherence required to obtain augmented single- and multi-frame scenes with the correct appearance and dimensions. This paper introduces a multi-modal augmented reality framework designed to bridge this gap by integrating photorealistic virtual objects into real-world railway sequences from the OSDaR23 dataset. Utilizing Unreal Engine 5 features, our pipeline leverages LiDAR point-clouds and INS/GNSS data to ensure accurate object placement and temporal stability across RGB frames. This paper also proposes a segmentation-based refinement strategy for INS/GNSS data to significantly improve the realism of the augmented sequences, as confirmed by the comparative study presented in the paper. Carefully designed augmented sequences are collected to produce OSDaR-AR, a public dataset designed to support the development of next-generation railway perception systems. The dataset is available at the following page: https://syndra.retis.santannapisa.it/osdarar.html

[110] WaterVideoQA: ASV-Centric Perception and Rule-Compliant Reasoning via Multi-Modal Agents

Runwei Guan,Shaofeng Liang,Ningwei Ouyang,Weichen Fei,Shanliang Yao,Wei Dai,Chenhao Ge,Penglei Sun,Xiaohui Zhu,Tao Huang,Ryan Wen Liu,Hui Xiong

Main category: cs.CV

TL;DR: 本文提出了WaterVideoQA基准和NaviMind多智能体神经符号系统,旨在提升自主水面舰艇在复杂水道环境中的认知推理能力。

Details Motivation: 现有自主导航系统在被动感知方面表现优异,但在知识驱动、交互式环境认知方面存在明显不足,尤其在高风险的海上航行领域,亟需弥合原始视觉感知与复杂认知推理之间的鸿沟。 Method: 构建了首个面向全水域环境的大规模视频问答基准WaterVideoQA,并提出名为NaviMind的多智能体神经符号系统,融合自适应语义路由、情境感知分层推理和自主自反思验证机制。 Result: 实验表明,所提框架显著超越现有基线,在动态海事环境中实现了更智能、更可信的交互能力。 Conclusion: 该工作为自主水面舰艇提供了从表层模式匹配迈向符合法规、可解释决策的新范式。 Abstract: While autonomous navigation has achieved remarkable success in passive perception (e.g., object detection and segmentation), it remains fundamentally constrained by a void in knowledge-driven, interactive environmental cognition. In the high-stakes domain of maritime navigation, the ability to bridge the gap between raw visual perception and complex cognitive reasoning is not merely an enhancement but a critical prerequisite for Autonomous Surface Vessels to execute safe and precise maneuvers. To this end, we present WaterVideoQA, the first large-scale, comprehensive Video Question Answering benchmark specifically engineered for all-waterway environments. This benchmark encompasses 3,029 video clips across six distinct waterway categories, integrating multifaceted variables such as volatile lighting and dynamic weather to rigorously stress-test ASV capabilities across a five-tier hierarchical cognitive framework. Furthermore, we introduce NaviMind, a pioneering multi-agent neuro-symbolic system designed for open-ended maritime reasoning. By synergizing Adaptive Semantic Routing, Situation-Aware Hierarchical Reasoning, and Autonomous Self-Reflective Verification, NaviMind transitions ASVs from superficial pattern matching to regulation-compliant, interpretable decision-making. Experimental results demonstrate that our framework significantly transcends existing baselines, establishing a new paradigm for intelligent, trustworthy interaction in dynamic maritime environments.

[111] MSJoE: Jointly Evolving MLLM and Sampler for Efficient Long-Form Video Understanding

Wenhui Tan,Xiaoyi Yu,Jiaze Li,Yijing Chen,Jianzhong Ju,Zhenbo Luo,Ruihua Song,Jian Luan

Main category: cs.CV

TL;DR: 本文提出MSJoE框架,通过联合优化多模态大语言模型(MLLM)与轻量级关键帧采样器,实现高效长视频理解;其核心是基于CLIP生成查询-帧相似度矩阵,并由采样器选择最具信息量的关键帧输入MLLM,最终在多个长视频QA基准上取得显著性能提升。

Details Motivation: 高效理解长视频仍是多模态大语言模型(MLLM)的根本挑战,而现有方法通常直接处理全部帧,计算开销大且冗余信息多。 Method: 提出MSJoE框架:首先由MLLM生成多个描述不同视觉视角的查询;利用冻结CLIP模型计算查询与所有帧的相似度矩阵;轻量级采样器据此预测关键帧采样权重并选取紧凑关键帧子集;MLLM基于这些帧生成答案;MLLM与采样器通过强化学习联合优化。 Result: 在VideoMME、LongVideoBench、LVBench和MLVU等多个长视频QA基准上,MSJoE相较基线MLLM提升8.0%准确率,比最强基线高1.1%;同时构建了含2.8K视频、7K QA对的新长视频QA数据集。 Conclusion: MSJoE验证了‘仅少量关键帧即可支撑视频问答’这一假设,通过MLLM与采样器的协同进化,实现了计算效率与理解性能的双重提升,为长视频理解提供了新范式。 Abstract: Efficiently understanding long-form videos remains a fundamental challenge for multimodal large language models (MLLMs). In this paper, we present MLLM-Sampler Joint Evolution (MSJoE), a novel framework that jointly evolves the MLLM and a lightweight key-frame sampler for efficient long-form video understanding. MSJoE builds upon a key assumption that only a small subset of key-frames is truly informative for answering each question to a video. Specifically, MSJoE first reasons out several queries, which describe diverse visual perspectives relevant to the question. Then, these queries interact with a frozen CLIP model to produce a query-frame similarity matrix. Finally, a lightweight sampler predicts key-frame sampling weights from this matrix, selecting a compact set of informative frames, which are then fed into the MLLM for answer generation. Both the MLLM and sampler are jointly optimized through reinforcement learning, enabling co-adaptation of query-reasoning, frame-sampling, and key-frame understanding. A new long-video QA dataset containing 2.8K videos with 7K question-answer pairs is collected to support the training process. Extensive experiments on VideoMME, LongVideoBench, LVBench, and MLVU show that MSJoE achieves 8.0\% accuracy gain upon the base MLLM, and 1.1\% higher accuracy than strongest baseline method.

[112] pMoE: Prompting Diverse Experts Together Wins More in Visual Adaptation

Shentong Mo,Xufang Luo,Dongsheng Li

Main category: cs.CV

TL;DR: 本文提出了一种名为pMoE的混合专家提示调优方法,通过专家专用提示令牌和可学习调度器,融合多个领域专家知识,在47个视觉适应任务上实现了性能与效率的最优权衡。

Details Motivation: 现有参数高效微调方法通常仅利用单一预训练模型(通用或医学领域)的知识,忽略了融合多领域知识可能带来的协同增益。 Method: 提出pMoE方法,包含专家专用提示令牌和在各提示层动态调度令牌的可学习调度器,以优化各领域专家在适配阶段的贡献。 Result: 在47个通用与医学领域的分类和分割适应任务上,pMoE显著提升性能,且在计算效率与适配效果间取得最优平衡。 Conclusion: 融合多领域专家知识的pMoE方法能有效增强模型的通用性与适用性,为参数高效视觉适应提供了新范式。 Abstract: Parameter-efficient fine-tuning has demonstrated promising results across various visual adaptation tasks, such as classification and segmentation. Typically, prompt tuning techniques have harnessed knowledge from a single pre-trained model, whether from a general or a specialized medical domain. However, this approach typically overlooks the potential synergies that could arise from integrating diverse domain knowledge within the same tuning process. In this work, we propose a novel Mixture-of-Experts prompt tuning method called pMoE, which leverages the strengths of multiple expert domains through expert-specialized prompt tokens and the learnable dispatcher, effectively combining their expertise in a unified model framework. Our pMoE introduces expert-specific prompt tokens and utilizes a dynamic token dispatching mechanism at various prompt layers to optimize the contribution of each domain expert during the adaptation phase. By incorporating both domain knowledge from diverse experts, the proposed pMoE significantly enhances the model's versatility and applicability to a broad spectrum of tasks. We conduct extensive experiments across 47 adaptation tasks, including both classification and segmentation in general and medical domains. The results demonstrate that our pMoE not only achieves superior performance with a large margin of improvements but also offers an optimal trade-off between computational efficiency and adaptation effectiveness compared to existing methods.

[113] Velocity and stroke rate reconstruction of canoe sprint team boats based on panned and zoomed video recordings

Julian Ziegler,Daniel Matthes,Finn Gerdts,Patrick Frenzel,Torsten Warnke,Matthias Englert,Tina Koevari,Mirco Fuchs

Main category: cs.CV

TL;DR: 本文提出了一种基于泛化视频分析的全自动桨频与速度重建框架,适用于所有皮划艇静水项目(K1-K4, C1-C2)和距离(200m–500m),利用YOLOv8检测、U-Net校准、光流跟踪及姿态/框特征提取,在无GPS或传感器条件下实现高精度(RRMSE < 0.025)运动学分析。

Details Motivation: GPS虽为金标准但受限于可用性,亟需无需车载传感器或人工标注的自动化视频分析方案来支持教练实时反馈。 Method: 结合YOLOv8进行浮标与运动员检测,利用已知浮标网格估计单应性;通过U-Net学习船体特异性运动员偏移以精确定位船头;采用光流实现多运动员艇型鲁棒跟踪;并从姿态估计或检测框中提取桨频信息。 Result: 在精英赛事GPS真值对比下,速度相对均方根误差(RRMSE)为0.020±0.011(ρ=0.956),桨频RRMSE为0.022±0.024(ρ=0.932)。 Conclusion: 该框架实现了高精度、全自动、跨项目通用的视频基运动学分析,显著提升教练决策效率,且不依赖专用硬件或人工干预。 Abstract: Pacing strategies, defined by velocity and stroke rate profiles, are essential for peak performance in canoe sprint. While GPS is the gold standard for analysis, its limited availability necessitates automated video-based solutions. This paper presents an extended framework for reconstructing performance metrics from panned and zoomed video recordings across all sprint disciplines (K1-K4, C1-C2) and distances (200m-500m). Our method utilizes YOLOv8 for buoy and athlete detection, leveraging the known buoy grid to estimate homographies. We generalized the estimation of the boat position by means of learning a boat-specific athlete offset using a U-net based boat tip calibration. Further, we implement a robust tracking scheme using optical flow to adapt to multi-athlete boat types. Finally, we introduce methods to extract stroke rate information from either pose estimations or the athlete bounding boxes themselves. Evaluation against GPS data from elite competitions yields a velocity RRMSE of 0.020 +- 0.011 (rho = 0.956) and a stroke rate RRMSE of 0.022 +- 0.024 (rho = 0.932). The methods provide coaches with highly accurate, automated feedback without requiring on-boat sensors or manual annotation.

[114] Cross-Task Benchmarking of CNN Architectures

Kamal Sherawat,Vikrant Bhati

Main category: cs.CV

TL;DR: 本文对比研究了五种基于ResNet-18的动态卷积神经网络(包括普通CNN、硬注意力CNN、软注意力CNN及ODConv)在图像分类、分割和时间序列分析任务上的性能,结果表明引入注意力机制和动态卷积能显著提升准确率、效率与计算性能,尤其ODConv在形态复杂图像上表现突出。

Details Motivation: 探索更优的CNN架构以提升多模态数据(图像、时间序列等)任务中的准确率、效率与泛化能力。 Method: 基于ResNet-18,构建并比较五种CNN变体:普通CNN、硬注意力CNN、局部/全局软注意力CNN、ODConv;在Tiny ImageNet、Pascal VOC和UCR时间序列数据集上进行实验评估。 Result: 注意力机制与动态卷积方法整体优于传统CNN;ODConv在形态复杂图像上优势明显;动态CNN通过自适应核调制增强了特征表达与跨任务泛化能力。 Conclusion: 动态CNN(尤其是ODConv)为多模态数据分析提供了更优架构选择,对神经网络工程具有重要启发意义。 Abstract: This project provides a comparative study of dynamic convolutional neural networks (CNNs) for various tasks, including image classification, segmentation, and time series analysis. Based on the ResNet-18 architecture, we compare five variants of CNNs: the vanilla CNN, the hard attention-based CNN, the soft attention-based CNN with local (pixel-wise) and global (image-wise) feature attention, and the omni-directional CNN (ODConv). Experiments on Tiny ImageNet, Pascal VOC, and the UCR Time Series Classification Archive illustrate that attention mechanisms and dynamic convolution methods consistently exceed conventional CNNs in accuracy, efficiency, and computational performance. ODConv was especially effective on morphologically complex images by being able to dynamically adjust to varying spatial patterns. Dynamic CNNs enhanced feature representation and cross-task generalization through adaptive kernel modulation. This project provides perspectives on advanced CNN design architecture for multiplexed data modalities and indicates promising directions in neural network engineering.

[115] ToProVAR: Efficient Visual Autoregressive Modeling via Tri-Dimensional Entropy-Aware Semantic Analysis and Sparsity Optimization

Jiayu Chen,Ruoyu Lin,Zihao Zheng,Jingxin Li,Maoliang Li,Guojie Luo,Xiang chen

Main category: cs.CV

TL;DR: 本文提出ToProVAR框架,通过注意力熵分析VAR模型的语义投影,识别token、layer、scale三维度稀疏性,实现生成加速与质量保持的双重优化。

Details Motivation: Visual Autoregressive(VAR)模型在生成质量上表现优异,但在后期阶段存在严重效率瓶颈,现有方法如FastVAR和SkipVAR依赖启发式跳过策略,难以兼顾效率与语义保真度。 Method: 提出基于注意力熵的优化框架,刻画模型在不同token粒度、语义范围和生成尺度下的参数动态;据此识别token、layer、scale三个关键维度的稀疏模式,并设计细粒度优化策略。 Result: 在Infinity-2B和Infinity-8B模型上实现最高3.4倍加速,同时显著保持语义保真度与细节质量,优于传统方法。 Conclusion: ToProVAR提供了一种原理驱动而非启发式的VAR模型优化新范式,有效缓解了效率与质量之间的权衡问题。 Abstract: Visual Autoregressive(VAR) models enhance generation quality but face a critical efficiency bottleneck in later stages. In this paper, we present a novel optimization framework for VAR models that fundamentally differs from prior approaches such as FastVAR and SkipVAR. Instead of relying on heuristic skipping strategies, our method leverages attention entropy to characterize the semantic projections across different dimensions of the model architecture. This enables precise identification of parameter dynamics under varying token granularity levels, semantic scopes, and generation scales. Building on this analysis, we further uncover sparsity patterns along three critical dimensions-token, layer, and scale-and propose a set of fine-grained optimization strategies tailored to these patterns. Extensive evaluation demonstrates that our approach achieves aggressive acceleration of the generation process while significantly preserving semantic fidelity and fine details, outperforming traditional methods in both efficiency and quality. Experiments on Infinity-2B and Infinity-8B models demonstrate that ToProVAR achieves up to 3.4x acceleration with minimal quality loss, effectively mitigating the issues found in prior work. Our code will be made publicly available.

[116] OpenFS: Multi-Hand-Capable Fingerspelling Recognition with Implicit Signing-Hand Detection and Frame-Wise Letter-Conditioned Synthesis

Junuk Cha,Jihyeon Kim,Han-Mu Park

Main category: cs.CV

TL;DR: 本文提出OpenFS,一种开源的手语指拼识别与合成方法,通过隐式手部检测、单调对齐损失和帧级字母条件生成器,解决了手部歧义、CTC损失峰值问题及未登录词(OOV)问题,显著提升了识别性能并构建了新合成基准FSNeo。

Details Motivation: 解决指拼识别中的手部歧义、现有CTC损失的峰值行为问题以及未登录词(OOV)问题,以弥合聋人与听人社区之间的沟通鸿沟。 Method: 提出多手兼容的指拼识别器,采用双层位置编码和签名手聚焦(SF)损失实现隐式手部检测;引入不依赖CTC的单调对齐(MA)损失以正则化跨注意力并保证时序一致性;设计帧级字母条件生成器合成OOV词的姿态序列,并构建新合成基准FSNeo。 Result: 在全面实验中达到指拼识别的SOTA性能,验证了所提识别器与生成器的有效性;开源代码与数据(https://github.com/JunukCha/OpenFS)。 Conclusion: OpenFS通过隐式手部检测、新型损失函数与合成生成器,有效克服了传统指拼识别的关键挑战,兼具高性能与实用性,并推动了该领域的数据与方法发展。 Abstract: Fingerspelling is a component of sign languages in which words are spelled out letter by letter using specific hand poses. Automatic fingerspelling recognition plays a crucial role in bridging the communication gap between Deaf and hearing communities, yet it remains challenging due to the signing-hand ambiguity issue, the lack of appropriate training losses, and the out-of-vocabulary (OOV) problem. Prior fingerspelling recognition methods rely on explicit signing-hand detection, which often leads to recognition failures, and on a connectionist temporal classification (CTC) loss, which exhibits the peaky behavior problem. To address these issues, we develop OpenFS, an open-source approach for fingerspelling recognition and synthesis. We propose a multi-hand-capable fingerspelling recognizer that supports both single- and multi-hand inputs and performs implicit signing-hand detection by incorporating a dual-level positional encoding and a signing-hand focus (SF) loss. The SF loss encourages cross-attention to focus on the signing hand, enabling implicit signing-hand detection during recognition. Furthermore, without relying on the CTC loss, we introduce a monotonic alignment (MA) loss that enforces the output letter sequence to follow the temporal order of the input pose sequence through cross-attention regularization. In addition, we propose a frame-wise letter-conditioned generator that synthesizes realistic fingerspelling pose sequences for OOV words. This generator enables the construction of a new synthetic benchmark, called FSNeo. Through comprehensive experiments, we demonstrate that our approach achieves state-of-the-art performance in recognition and validate the effectiveness of the proposed recognizer and generator. Codes and data are available in: https://github.com/JunukCha/OpenFS.

[117] MM-NeuroOnco: A Multimodal Benchmark and Instruction Dataset for MRI-Based Brain Tumor Diagnosis

Feng Guo,Jiaxiang Liu,Yang Li,Qianqian Shi,Mingkun Xu

Main category: cs.CV

TL;DR: 本文提出了MM-NeuroOnco,一个大规模多模态脑肿瘤MRI理解基准与指令微调数据集,包含24,726张MRI切片和约20万条语义丰富的多模态指令,并构建了带拒答机制的评估基准MM-NeuroOnco-Bench;基于该数据集训练的NeuroOnco-GPT在诊断类问题上准确率提升27%。

Details Motivation: 现有公开数据集在标注丰富性和诊断语义方面存在局限,难以支撑兼具病灶检测与临床可解释推理的脑肿瘤诊断模型发展。 Method: 构建MM-NeuroOnco多模态指令数据集(含自动多模型协同标注与质控流程)及MM-NeuroOnco-Bench评估基准(人工标注、拒答感知),并提出NeuroOnco-GPT模型进行指令微调。 Result: 十种主流模型在诊断相关问题上最高准确率仅41.88%(Gemini 3 Flash);NeuroOnco-GPT经微调后实现诊断问题准确率27%的绝对提升。 Conclusion: MM-NeuroOnco有效推动了面向临床落地的多模态脑肿瘤诊断推理研究,验证了高质量语义指令数据对提升模型诊断能力的关键作用。 Abstract: Accurate brain tumor diagnosis requires models to not only detect lesions but also generate clinically interpretable reasoning grounded in imaging manifestations, yet existing public datasets remain limited in annotation richness and diagnostic semantics. To bridge this gap, we introduce MM-NeuroOnco, a large-scale multimodal benchmark and instruction-tuning dataset for brain tumor MRI understanding, consisting of 24,726 MRI slices from 20 data sources paired with approximately 200,000 semantically enriched multimodal instructions spanning diverse tumor subtypes and imaging modalities. To mitigate the scarcity and high cost of diagnostic semantic annotations, we develop a multi-model collaborative pipeline for automated medical information completion and quality control, enabling the generation of diagnosis-related semantics beyond mask-only annotations. Building upon this dataset, we further construct MM-NeuroOnco-Bench, a manually annotated evaluation benchmark with a rejection-aware setting to reduce biases inherent in closed-ended question formats. Evaluation across ten representative models shows that even the strongest baseline, Gemini 3 Flash, achieves only 41.88% accuracy on diagnosis-related questions, highlighting the substantial challenges of multimodal brain tumor diagnostic understanding. Leveraging MM-NeuroOnco, we further propose NeuroOnco-GPT, which achieves a 27% absolute accuracy improvement on diagnostic questions following fine-tuning. This result demonstrates the effectiveness of our dataset and benchmark in advancing clinically grounded multimodal diagnostic reasoning. Code and dataset are publicly available at: https://github.com/gfnnnb/MM-NeuroOnco

[118] Can Agents Distinguish Visually Hard-to-Separate Diseases in a Zero-Shot Setting? A Pilot Study

Zihao Zhao,Frederik Hauke,Juliana De Castilhos,Sven Nebelung,Daniel Truhn

Main category: cs.CV

TL;DR: 本文提出了一种基于对比裁决的多智能体框架,用于在零样本设置下区分医学影像中视觉上难以区分的疾病(如黑色素瘤vs.非典型痣、肺水肿vs.肺炎),在皮肤镜数据上准确率提升11个百分点,但整体性能仍不足以临床部署。

Details Motivation: 现有医学影像研究多聚焦于自动化常规临床流程,而视觉上高度混淆但临床管理差异显著的疾病鉴别这一重要且未被充分探索的问题缺乏关注。 Method: 构建基于对比裁决(contrastive adjudication)的多智能体框架,并在两个仅依赖影像的代理诊断任务上进行基准测试:黑色素瘤 vs. 非典型痣,以及肺水肿 vs. 肺炎。 Result: 在皮肤镜数据上诊断准确率提升11个百分点,定性样本中不支持声明减少;但整体性能仍不足临床部署,且受限于人工标注固有不确定性与缺乏临床背景。 Conclusion: 该初步研究为零样本智能体在视觉混淆场景下的表现提供了早期洞见,凸显了当前方法在真实临床转化中的局限性。 Abstract: The rapid progress of multimodal large language models (MLLMs) has led to increasing interest in agent-based systems. While most prior work in medical imaging concentrates on automating routine clinical workflows, we study an underexplored yet clinically significant setting: distinguishing visually hard-to-separate diseases in a zero-shot setting. We benchmark representative agents on two imaging-only proxy diagnostic tasks, (1) melanoma vs. atypical nevus and (2) pulmonary edema vs. pneumonia, where visual features are highly confounded despite substantial differences in clinical management. We introduce a multi-agent framework based on contrastive adjudication. Experimental results show improved diagnostic performance (an 11-percentage-point gain in accuracy on dermoscopy data) and reduced unsupported claims on qualitative samples, although overall performance remains insufficient for clinical deployment. We acknowledge the inherent uncertainty in human annotations and the absence of clinical context, which further limit the translation to real-world settings. Within this controlled setting, this pilot study provides preliminary insights into zero-shot agent performance in visually confounded scenarios.

[119] UCM: Unifying Camera Control and Memory with Time-aware Positional Encoding Warping for World Models

Tianxing Xu,Zixuan Wang,Guangyuan Wang,Li Hu,Zhongyi Zhang,Peng Zhang,Bang Zhang,Song-Hai Zhang

Main category: cs.CV

TL;DR: 本文提出UCM框架,通过时间感知的位置编码扭曲机制统一长期记忆与精确相机控制,并设计高效双流扩散Transformer以实现高保真视频生成,显著提升了长期场景一致性与相机可控性。

Details Motivation: 现有基于视频生成的世界模型在长期内容一致性和用户输入的精确相机控制方面存在困难,且显式3D重建方法缺乏灵活性,而依赖先前帧的方法又缺乏空间对应关系,限制了可控性与一致性。 Method: 提出UCM框架,采用时间感知的位置编码扭曲机制统一长期记忆与相机控制;设计高效双流扩散Transformer降低计算开销;引入基于点云渲染的可扩展数据整理策略,支持50万+单目视频训练。 Result: 在真实与合成基准上实验表明,UCM在长期场景一致性上显著优于SOTA方法,同时实现高保真视频中精确的相机可控性。 Conclusion: UCM有效解决了世界模型中长期一致性与相机控制之间的权衡问题,为交互式环境模拟提供了更灵活、可控且一致的新范式。 Abstract: World models based on video generation demonstrate remarkable potential for simulating interactive environments but face persistent difficulties in two key areas: maintaining long-term content consistency when scenes are revisited and enabling precise camera control from user-provided inputs. Existing methods based on explicit 3D reconstruction often compromise flexibility in unbounded scenarios and fine-grained structures. Alternative methods rely directly on previously generated frames without establishing explicit spatial correspondence, thereby constraining controllability and consistency. To address these limitations, we present UCM, a novel framework that unifies long-term memory and precise camera control via a time-aware positional encoding warping mechanism. To reduce computational overhead, we design an efficient dual-stream diffusion transformer for high-fidelity generation. Moreover, we introduce a scalable data curation strategy utilizing point-cloud-based rendering to simulate scene revisiting, facilitating training on over 500K monocular videos. Extensive experiments on real-world and synthetic benchmarks demonstrate that UCM significantly outperforms state-of-the-art methods in long-term scene consistency, while also achieving precise camera controllability in high-fidelity video generation.

[120] SubspaceAD: Training-Free Few-Shot Anomaly Detection via Subspace Modeling

Camile Lendering,Erkut Akdag,Egor Bondarev

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的工业视觉异常检测方法SubspaceAD,利用DINOv2提取正常图像块特征,通过PCA建模正常变化的低维子空间,并以重构残差作为异常分数,在单样本和少样本设置下达到SOTA性能。

Details Motivation: 现有少样本异常检测方法依赖记忆库、辅助数据集或视觉语言模型多模态调优,作者质疑这些复杂设计在已有视觉基础模型特征表示下是否必要。 Method: SubspaceAD为无训练方法:1)使用冻结的DINOv2骨干网络从少量正常图像中提取块级特征;2)对这些特征拟合PCA模型,估计正常变化的低维子空间;推理时通过该子空间的重构残差计算异常分数。 Result: 在MVTec-AD数据集上单样本设置下,图像级和像素级AUROC分别达98.0%和97.6%;在VisA数据集上分别为93.3%和98.3%,均超越先前SOTA方法。 Conclusion: SubspaceAD证明了仅用简单PCA建模基础模型特征子空间即可实现高性能少样本异常检测,无需训练、提示调优或记忆库,兼具简洁性、可解释性与统计严谨性。 Abstract: Detecting visual anomalies in industrial inspection often requires training with only a few normal images per category. Recent few-shot methods achieve strong results employing foundation-model features, but typically rely on memory banks, auxiliary datasets, or multi-modal tuning of vision-language models. We therefore question whether such complexity is necessary given the feature representations of vision foundation models. To answer this question, we introduce SubspaceAD, a training-free method, that operates in two simple stages. First, patch-level features are extracted from a small set of normal images by a frozen DINOv2 backbone. Second, a Principal Component Analysis (PCA) model is fit to these features to estimate the low-dimensional subspace of normal variations. At inference, anomalies are detected via the reconstruction residual with respect to this subspace, producing interpretable and statistically grounded anomaly scores. Despite its simplicity, SubspaceAD achieves state-of-the-art performance across one-shot and few-shot settings without training, prompt tuning, or memory banks. In the one-shot anomaly detection setting, SubspaceAD achieves image-level and pixel-level AUROC of 98.0% and 97.6% on the MVTec-AD dataset, and 93.3% and 98.3% on the VisA dataset, respectively, surpassing prior state-of-the-art results. Code and demo are available at https://github.com/CLendering/SubspaceAD.

[121] DMAligner: Enhancing Image Alignment via Diffusion Model Based View Synthesis

Xinglong Luo,Ao Luo,Zhengning Wang,Yueqi Yang,Chaoyu Feng,Lei Lei,Bing Zeng,Shuaicheng Liu

Main category: cs.CV

TL;DR: 本文提出DMAligner,一种基于扩散模型的图像对齐框架,通过面向对齐的视图合成解决传统光流法在遮挡和光照变化下的局限性。

Details Motivation: 现有基于光流的图像对齐方法易受遮挡和光照变化影响,导致对齐质量下降和下游任务精度受损。 Method: 提出动态感知扩散训练方法和动态感知掩码生成(DMP)模块,结合自建DSIA数据集(含30K图像对),实现条件图像生成式对齐。 Result: 在DSIA基准及多个视频数据集上显著优于现有方法,定性与定量结果均验证其优越性。 Conclusion: DMAligner为图像对齐提供了生成式新范式,有效规避光流法固有缺陷,展现出更强鲁棒性与实用性。 Abstract: Image alignment is a fundamental task in computer vision with broad applications. Existing methods predominantly employ optical flow-based image warping. However, this technique is susceptible to common challenges such as occlusions and illumination variations, leading to degraded alignment visual quality and compromised accuracy in downstream tasks. In this paper, we present DMAligner, a diffusion-based framework for image alignment through alignment-oriented view synthesis. DMAligner is crafted to tackle the challenges in image alignment from a new perspective, employing a generation-based solution that showcases strong capabilities and avoids the problems associated with flow-based image warping. Specifically, we propose a Dynamics-aware Diffusion Training approach for learning conditional image generation, synthesizing a novel view for image alignment. This incorporates a Dynamics-aware Mask Producing (DMP) module to adaptively distinguish dynamic foreground regions from static backgrounds, enabling the diffusion model to more effectively handle challenges that classical methods struggle to solve. Furthermore, we develop the Dynamic Scene Image Alignment (DSIA) dataset using Blender, which includes 1,033 indoor and outdoor scenes with over 30K image pairs tailored for image alignment. Extensive experimental results demonstrate the superiority of the proposed approach on DSIA benchmarks, as well as on a series of widely-used video datasets for qualitative comparisons. Our code is available at https://github.com/boomluo02/DMAligner.

[122] WISER: Wider Search, Deeper Thinking, and Adaptive Fusion for Training-Free Zero-Shot Composed Image Retrieval

Tianyue Wang,Leigang Qu,Tianyu Yang,Xiangzhao Hao,Yifan Xu,Haiyun Guo,Jinqiao Wang

Main category: cs.CV

TL;DR: 本文提出WISER框架,无需训练即可统一文本到图像(T2I)与图像到图像(I2I)检索路径,通过‘检索-验证-精炼’流程实现零样本组合图像检索(ZS-CIR),显著提升性能。

Details Motivation: 现有ZS-CIR方法将多模态查询单向转换为文本或图像模态,分别存在丢失视觉细节或难以处理复杂语义修改的缺陷,亟需一种能自适应融合二者优势、兼顾意图与不确定性建模的新范式。 Method: WISER采用无训练的三阶段流程:1)宽域检索——并行生成编辑文本与编辑图像进行双路径检索;2)自适应融合——由验证器评估置信度,对高置信结果动态融合双路径,对低置信结果触发精炼;3)深度思考——通过结构化自反思生成精炼建议,指导下一轮检索。 Result: 在CIRCO和CIRR基准上,WISER相比现有无训练方法mAP@5提升45%、Recall@1提升57%,且超越诸多需训练的方法,展现出强泛化能力。 Conclusion: WISER证明了无需训练即可有效协同T2I与I2I路径的可行性,其意图感知与不确定性感知机制为ZS-CIR提供了新思路,具备实际部署价值与方法论启示。 Abstract: Zero-Shot Composed Image Retrieval (ZS-CIR) aims to retrieve target images given a multimodal query (comprising a reference image and a modification text), without training on annotated triplets. Existing methods typically convert the multimodal query into a single modality-either as an edited caption for Text-to-Image retrieval (T2I) or as an edited image for Image-to-Image retrieval (I2I). However, each paradigm has inherent limitations: T2I often loses fine-grained visual details, while I2I struggles with complex semantic modifications. To effectively leverage their complementary strengths under diverse query intents, we propose WISER, a training-free framework that unifies T2I and I2I via a "retrieve-verify-refine" pipeline, explicitly modeling intent awareness and uncertainty awareness. Specifically, WISER first performs Wider Search by generating both edited captions and images for parallel retrieval to broaden the candidate pool. Then, it conducts Adaptive Fusion with a verifier to assess retrieval confidence, triggering refinement for uncertain retrievals, and dynamically fusing the dual-path for reliable ones. For uncertain retrievals, WISER generates refinement suggestions through structured self-reflection to guide the next retrieval round toward Deeper Thinking. Extensive experiments demonstrate that WISER significantly outperforms previous methods across multiple benchmarks, achieving relative improvements of 45% on CIRCO (mAP@5) and 57% on CIRR (Recall@1) over existing training-free methods. Notably, it even surpasses many training-dependent methods, highlighting its superiority and generalization under diverse scenarios. Code will be released at https://github.com/Physicsmile/WISER.

[123] Small Object Detection Model with Spatial Laplacian Pyramid Attention and Multi-Scale Features Enhancement in Aerial Images

Zhangjian Ji,Huijia Yan,Shaotong Qiao,Kai Feng,Wei Wei

Main category: cs.CV

TL;DR: 本文提出了一种基于空间拉普拉斯金字塔注意力(SLPA)和多尺度特征增强(MSFEM)的小目标检测算法,结合可变形卷积优化特征对齐,显著提升了航拍图像中小目标的检测性能。

Details Motivation: 航拍图像中目标尺寸小、分布密集且不均匀,导致传统检测方法效率低、特征表达能力弱。 Method: 1)设计空间拉普拉斯金字塔注意力(SLPA)模块,嵌入ResNet-50各阶段以增强小目标局部区域特征;2)构建多尺度特征增强模块(MSFEM)融入FPN的C5侧向连接;3)在FPN跨层融合中引入可变形卷积实现特征对齐。 Result: 在VisDrone和DOTA两个基准数据集上,所提方法在小目标检测任务中性能优于原始算法。 Conclusion: SLPA、MSFEM与可变形卷积协同提升了小目标的特征表征与定位能力,有效缓解了航拍图像中小目标检测的关键挑战。 Abstract: Detecting objects in aerial images confronts some significant challenges, including small size, dense and non-uniform distribution of objects over high-resolution images, which makes detection inefficient. Thus, in this paper, we proposed a small object detection algorithm based on a Spatial Laplacian Pyramid Attention and Multi-Scale Feature Enhancement in aerial images. Firstly, in order to improve the feature representation of ResNet-50 on small objects, we presented a novel Spatial Laplacian Pyramid Attention (SLPA) module, which is integrated after each stage of ResNet-50 to identify and emphasize important local regions. Secondly, to enhance the model's semantic understanding and features representation, we designed a Multi-Scale Feature Enhancement Module (MSFEM), which is incorporated into the lateral connections of C5 layer for building Feature Pyramid Network (FPN). Finally, the features representation quality of traditional feature pyramid network will be affected because the features are not aligned when the upper and lower layers are fused. In order to handle it, we utilized deformable convolutions to align the features in the fusion processing of the upper and lower levels of the Feature Pyramid Network, which can help enhance the model's ability to detect and recognize small objects. The extensive experimental results on two benchmark datasets: VisDrone and DOTA demonstrate that our improved model performs better for small object detection in aerial images compared to the original algorithm.

[124] PackUV: Packed Gaussian UV Maps for 4D Volumetric Video

Aashish Rai,Angela Xing,Anushka Agarwal,Xiaoyan Cong,Zekun Li,Tao Lu,Aayush Prakash,Srinath Sridhar

Main category: cs.CV

TL;DR: 本文提出PackUV,一种新型4D高斯表示法,将高斯属性映射到多尺度UV图集,实现紧凑、图像原生存储;并设计PackUV-GS方法,在UV域中直接优化高斯参数,结合光流引导的高斯标记与视频关键帧模块,提升时序一致性与大运动/遮挡下的鲁棒性;其输出兼容标准视频编码器(如FFV1),支持高效流式传输;实验基于大规模新数据集PackUV-2B(20亿帧,50相机),验证了方法在长达30分钟序列中保持高质量渲染的优越性。

Details Motivation: 现有基于高斯泼溅的体视频重建方法在长序列、时序一致性、大运动和遮挡场景下性能下降,且输出不兼容传统视频编码流水线,难以实际部署。 Method: 提出PackUV——将4D高斯属性映射至结构化多尺度UV图集;设计PackUV-GS拟合方法,在UV域直接优化高斯参数;引入光流引导的高斯标记与视频关键帧模块,以区分动态/静态区域、增强时序一致性。 Result: PackUV格式首次实现与标准视频编解码器(如FFV1)兼容且无损质量;在PackUV-2B数据集上,方法在长达30分钟序列中保持一致高质量渲染,显著优于现有基线。 Conclusion: PackUV为体视频提供了一种紧凑、兼容、可扩展的4D高斯表示与重建框架,打通了高质量体视频从重建到流媒体部署的关键路径。 Abstract: Volumetric videos offer immersive 4D experiences, but remain difficult to reconstruct, store, and stream at scale. Existing Gaussian Splatting based methods achieve high-quality reconstruction but break down on long sequences, temporal inconsistency, and fail under large motions and disocclusions. Moreover, their outputs are typically incompatible with conventional video coding pipelines, preventing practical applications. We introduce PackUV, a novel 4D Gaussian representation that maps all Gaussian attributes into a sequence of structured, multi-scale UV atlas, enabling compact, image-native storage. To fit this representation from multi-view videos, we propose PackUV-GS, a temporally consistent fitting method that directly optimizes Gaussian parameters in the UV domain. A flow-guided Gaussian labeling and video keyframing module identifies dynamic Gaussians, stabilizes static regions, and preserves temporal coherence even under large motions and disocclusions. The resulting UV atlas format is the first unified volumetric video representation compatible with standard video codecs (e.g., FFV1) without losing quality, enabling efficient streaming within existing multimedia infrastructure. To evaluate long-duration volumetric capture, we present PackUV-2B, the largest multi-view video dataset to date, featuring more than 50 synchronized cameras, substantial motion, and frequent disocclusions across 100 sequences and 2B (billion) frames. Extensive experiments demonstrate that our method surpasses existing baselines in rendering fidelity while scaling to sequences up to 30 minutes with consistent quality.

[125] D-FINE-seg: Object Detection and Instance Segmentation Framework with multi-backend deployment

Argo Saakyan,Dmitry Solntsev

Main category: cs.CV

TL;DR: D-FINE-seg 是 D-FINE 的实例分割扩展,引入轻量掩码头与多种分割感知训练策略,在 TACO 数据集上超越 YOLO26 的 F1 分数且保持实时性,并开源了支持 ONNX/TensorRT/OpenVINO 的端到端训练与推理框架。

Details Motivation: 现有基于 Transformer 的实时实例分割方法较少,而 D-FINE 在检测任务中表现优异,亟需将其拓展至实例分割并实现高效部署。 Method: 在 D-FINE 基础上增加轻量掩码头;引入分割感知训练策略,包括框裁剪 BCE 损失、Dice 掩码损失、辅助掩码监督、去噪掩码监督及适配的匈牙利匹配代价。同时构建跨平台(ONNX/TensorRT/OpenVINO)端到端训练与优化推理流程。 Result: 在 TACO 数据集上,D-FINE-seg 在统一 TensorRT FP16 端到端基准下 F1 分数优于 Ultralytics YOLO26,同时保持有竞争力的推理延迟。 Conclusion: D-FINE-seg 成功将高性能实时检测器拓展为高效实例分割模型,并提供了开源、跨平台、工业级可用的完整部署框架。 Abstract: Transformer-based real-time object detectors achieve strong accuracy-latency trade-offs, and D-FINE is among the top-performing recent architectures. However, real-time instance segmentation with transformers is still less common. We present D-FINE-seg, an instance segmentation extension of D-FINE that adds: a lightweight mask head, segmentation-aware training, including box cropped BCE and dice mask losses, auxiliary and denoising mask supervision, and adapted Hungarian matching cost. On the TACO dataset, D-FINE-seg improves F1-score over Ultralytics YOLO26 under a unified TensorRT FP16 end-to-end benchmarking protocol, while maintaining competitive latency. Second contribution is an end-to-end pipeline for training, exporting, and optimized inference across ONNX, TensorRT, OpenVINO for both object detection and instance segmentation tasks. This framework is released as open-source under the Apache-2.0 license. GitHub repository - https://github.com/ArgoHA/D-FINE-seg.

[126] GeoWorld: Geometric World Models

Zeyu Zhang,Danning Li,Ian Reid,Richard Hartley

Main category: cs.CV

TL;DR: 本文提出GeoWorld,一种基于双曲空间的几何世界模型,通过双曲JEPA保留状态间的几何与层次结构,并结合几何强化学习实现稳定多步规划,显著提升长时序视觉规划性能。

Details Motivation: 现有基于能量的预测世界模型在欧氏空间中学习潜在表示,忽略了状态间的几何和层次结构,且在长时序预测中性能迅速下降。 Method: 提出GeoWorld模型,采用双曲JEPA将潜在表示映射到双曲流形以保持几何与层次关系,并引入几何强化学习进行基于能量的优化,实现双曲潜在空间中的稳定多步规划。 Result: 在CrossTask和COIN数据集上,3步和4步规划的成功率(SR)分别比SOTA方法V-JEPA 2提升约3%和2%。 Conclusion: GeoWorld通过引入双曲几何建模和几何强化学习,有效缓解了潜在表示失真与长时序规划退化问题,为多步视觉规划提供了更鲁棒的框架。 Abstract: Energy-based predictive world models provide a powerful approach for multi-step visual planning by reasoning over latent energy landscapes rather than generating pixels. However, existing approaches face two major challenges: (i) their latent representations are typically learned in Euclidean space, neglecting the underlying geometric and hierarchical structure among states, and (ii) they struggle with long-horizon prediction, which leads to rapid degradation across extended rollouts. To address these challenges, we introduce GeoWorld, a geometric world model that preserves geometric structure and hierarchical relations through a Hyperbolic JEPA, which maps latent representations from Euclidean space onto hyperbolic manifolds. We further introduce Geometric Reinforcement Learning for energy-based optimization, enabling stable multi-step planning in hyperbolic latent space. Extensive experiments on CrossTask and COIN demonstrate around 3% SR improvement in 3-step planning and 2% SR improvement in 4-step planning compared to the state-of-the-art V-JEPA 2. Project website: https://steve-zeyu-zhang.github.io/GeoWorld.

[127] Align then Adapt: Rethinking Parameter-Efficient Transfer Learning in 4D Perception

Yiding Sun,Jihua Zhu,Haozhe Cheng,Chaoyi Lu,Zhichuan Yang,Lin Chen,Yaonan Wang

Main category: cs.CV

TL;DR: 本文提出PointATA方法,通过'对齐再适应'两阶段范式,将3D预训练模型高效迁移到4D点云视频理解任务,解决模态差距与过拟合问题,在多个4D任务上达到甚至超越全量微调的效果。

Details Motivation: 4D点云视频数据稀缺,难以支撑自监督4D模型的可扩展性;而直接迁移3D预训练模型到4D任务面临模态差距和过拟合两大挑战。 Method: 提出'Align then Adapt'(PointATA)范式:第一阶段利用最优传输理论量化3D/4D分布差异,训练点对齐嵌入器以缓解模态差距;第二阶段在冻结3D骨干网络基础上,引入轻量级点视频适配器和空间上下文编码器增强时序建模能力,抑制过拟合。 Result: PointATA在3D动作识别达97.21%准确率、4D动作分割提升+8.7%、4D语义分割达84.06%,性能匹配或超越全量微调模型,且参数效率显著更高。 Conclusion: PointATA验证了高效参数迁移是推动4D点云视频理解的可行路径,为缺乏大规模4D标注数据的场景提供了实用解决方案。 Abstract: Point cloud video understanding is critical for robotics as it accurately encodes motion and scene interaction. We recognize that 4D datasets are far scarcer than 3D ones, which hampers the scalability of self-supervised 4D models. A promising alternative is to transfer 3D pre-trained models to 4D perception tasks. However, rigorous empirical analysis reveals two critical limitations that impede transfer capability: overfitting and the modality gap. To overcome these challenges, we develop a novel "Align then Adapt" (PointATA) paradigm that decomposes parameter-efficient transfer learning into two sequential stages. Optimal-transport theory is employed to quantify the distributional discrepancy between 3D and 4D datasets, enabling our proposed point align embedder to be trained in Stage 1 to alleviate the underlying modality gap. To mitigate overfitting, an efficient point-video adapter and a spatial-context encoder are integrated into the frozen 3D backbone to enhance temporal modeling capacity in Stage 2. Notably, with the above engineering-oriented designs, PointATA enables a pre-trained 3D model without temporal knowledge to reason about dynamic video content at a smaller parameter cost compared to previous work. Extensive experiments show that PointATA can match or even outperform strong full fine-tuning models, whilst enjoying the advantage of parameter efficiency, e.g. 97.21 \% accuracy on 3D action recognition, $+8.7 \%$ on 4 D action segmentation, and 84.06\% on 4D semantic segmentation.

[128] Cytoarchitecture in Words: Weakly Supervised Vision-Language Modeling for Human Brain Microscopy

Matthew Sutton,Katrin Amunts,Timo Dickscheid,Christian Schiffer

Main category: cs.CV

TL;DR: 本文提出了一种无需成对图像-文本数据的标签中介方法,通过将显微图像与文献中的区域描述关联,实现细胞构筑图像的自然语言描述,并在脑区识别任务中取得高准确率。

Details Motivation: 在微观脑组织分析等研究和临床场景中,高质量的配对图像-文本数据稀缺,难以支持视觉-语言联合建模。 Method: 利用脑区标签自动从文献中挖掘区域描述作为合成caption,将现有细胞构筑视觉基础模型(CytoNet)与大语言模型通过图像到文本目标进行耦合训练。 Result: 在57个脑区上生成合理区域描述;对范围内样本标签匹配准确率达90.6%;遮蔽标签后仍能在8类测试中以68.6%准确率恢复脑区;支持开放集识别。 Conclusion: 弱监督的标签中介配对方式足以连接生物医学视觉基础模型与语言模型,为缺乏细粒度标注的领域提供实用的视觉-语言集成方案。 Abstract: Foundation models increasingly offer potential to support interactive, agentic workflows that assist researchers during analysis and interpretation of image data. Such workflows often require coupling vision to language to provide a natural-language interface. However, paired image-text data needed to learn this coupling are scarce and difficult to obtain in many research and clinical settings. One such setting is microscopic analysis of cell-body-stained histological human brain sections, which enables the study of cytoarchitecture: cell density and morphology and their laminar and areal organization. Here, we propose a label-mediated method that generates meaningful captions from images by linking images and text only through a label, without requiring curated paired image-text data. Given the label, we automatically mine area descriptions from related literature and use them as synthetic captions reflecting canonical cytoarchitectonic attributes. An existing cytoarchitectonic vision foundation model (CytoNet) is then coupled to a large language model via an image-to-text training objective, enabling microscopy regions to be described in natural language. Across 57 brain areas, the resulting method produces plausible area-level descriptions and supports open-set use through explicit rejection of unseen areas. It matches the cytoarchitectonic reference label for in-scope patches with 90.6% accuracy and, with the area label masked, its descriptions remain discriminative enough to recover the area in an 8-way test with 68.6% accuracy. These results suggest that weak, label-mediated pairing can suffice to connect existing biomedical vision foundation models to language, providing a practical recipe for integrating natural-language in domains where fine-grained paired annotations are scarce.

[129] Locally Adaptive Decay Surfaces for High-Speed Face and Landmark Detection with Event Cameras

Paul Kielty,Timothy Hanley,Peter Corcoran

Main category: cs.CV

TL;DR: 本文提出了一种名为局部自适应衰减曲面(LADS)的事件表示方法,通过根据局部信号动态调整时间衰减参数,解决了传统固定参数表示在静止与运动场景间难以兼顾空间结构与边缘锐度的问题;实验表明LADS在人脸检测与关键点定位任务中显著优于基线方法,尤其在240Hz高频下仍保持高精度,并支持轻量网络实现实时性能。

Details Motivation: 传统事件相机表示方法(如直方图或全局衰减时间曲面)采用全图统一的时间参数,导致在静止区域易丢失空间结构、在快速运动区域易产生模糊,难以兼顾二者。 Method: 提出局部自适应衰减曲面(LADS),在每个像素位置根据局部信号动态(事件率、LoG响应、高频谱能量)自适应调节时间衰减;设计三种自适应策略并集成到事件张量化流程中。 Result: 在公开数据集上,LADS在30Hz下提升人脸检测mAP50和关键点归一化均值误差(NME);在240Hz下显著缓解精度下降,达到0.966 mAP50和2.44% NME,甚至超越先前30Hz工作的性能;同时支持更轻量网络并保持实时性。 Conclusion: LADS验证了上下文感知的时间积分对类脑视觉的重要性,为基于事件相机的高频实时人机交互系统提供了新范式与新基准。 Abstract: Event cameras record luminance changes with microsecond resolution, but converting their sparse, asynchronous output into dense tensors that neural networks can exploit remains a core challenge. Conventional histograms or globally-decayed time-surface representations apply fixed temporal parameters across the entire image plane, which in practice creates a trade-off between preserving spatial structure during still periods and retaining sharp edges during rapid motion. We introduce Locally Adaptive Decay Surfaces (LADS), a family of event representations in which the temporal decay at each location is modulated according to local signal dynamics. Three strategies are explored, based on event rate, Laplacian-of-Gaussian response, and high-frequency spectral energy. These adaptive schemes preserve detail in quiescent regions while reducing blur in regions of dense activity. Extensive experiments on the public data show that LADS consistently improves both face detection and facial landmark accuracy compared to standard non-adaptive representations. At 30 Hz, LADS achieves higher detection accuracy and lower landmark error than either baseline, and at 240 Hz it mitigates the accuracy decline typically observed at higher frequencies, sustaining 2.44 % normalized mean error for landmarks and 0.966 mAP50 in face detection. These high-frequency results even surpass the accuracy reported in prior works operating at 30 Hz, setting new benchmarks for event-based face analysis. Moreover, by preserving spatial structure at the representation stage, LADS supports the use of much lighter network architectures while still retaining real-time performance. These results highlight the importance of context-aware temporal integration for neuromorphic vision and point toward real-time, high-frequency human-computer interaction systems that exploit the unique advantages of event cameras.

[130] SpectralMamba-UNet: Frequency-Disentangled State Space Modeling for Texture-Structure Consistent Medical Image Segmentation

Fuhao Zhang,Lei Liu,Jialin Zhang,Ya-Nan Zhang,Nan Mu

Main category: cs.CV

TL;DR: 本文提出SpectralMamba-UNet,通过频域解耦建模结构与纹理信息,在医学图像分割中兼顾全局解剖结构和边界细节。

Details Motivation: 现有状态空间模型(如Vision Mamba)因一维序列化削弱了局部空间连续性和高频细节表征能力,难以兼顾医学图像分割中全局结构与精细边界的双重需求。 Method: 提出频域解耦框架SpectralMamba-UNet:1)SDM模块用离散余弦变换分解低/高频特征,分别交由频域Mamba建模全局上下文和保留边界细节;2)SCR机制实现通道级频域感知重加权;3)SGF模块在解码器中自适应多尺度融合。 Result: 在五个公开医学图像分割基准上取得一致性能提升,覆盖多种模态与目标,验证了方法的有效性与泛化性。 Conclusion: 频域解耦是一种有效提升医学图像分割性能的新范式,SpectralMamba-UNet为长程依赖建模与局部细节保持提供了协同统一的解决方案。 Abstract: Accurate medical image segmentation requires effective modeling of both global anatomical structures and fine-grained boundary details. Recent state space models (e.g., Vision Mamba) offer efficient long-range dependency modeling. However, their one-dimensional serialization weakens local spatial continuity and high-frequency representation. To this end, we propose SpectralMamba-UNet, a novel frequency-disentangled framework to decouple the learning of structural and textural information in the spectral domain. Our Spectral Decomposition and Modeling (SDM) module applies discrete cosine transform to decompose low- and high-frequency features, where low frequency contributes to global contextual modeling via a frequency-domain Mamba and high frequency preserves boundary-sensitive details. To balance spectral contributions, we introduce a Spectral Channel Reweighting (SCR) mechanism to form channel-wise frequency-aware attention, and a Spectral-Guided Fusion (SGF) module to achieve adaptively multi-scale fusion in the decoder. Experiments on five public benchmarks demonstrate consistent improvements across diverse modalities and segmentation targets, validating the effectiveness and generalizability of our approach.

[131] WARM-CAT: : Warm-Started Test-Time Comprehensive Knowledge Accumulation for Compositional Zero-Shot Learning

Xudong Yan,Songhe Feng,Jiaxin Wang,Xin Su,Yi Jin

Main category: cs.CV

TL;DR: 本文提出了一种面向组合零样本学习(CZSL)的新方法WARM-CAT,通过在测试时利用无监督数据自适应更新多模态原型,并引入动态优先队列与跨模态对齐机制,显著缓解分布偏移问题,在多个基准数据集上达到SOTA性能。

Details Motivation: 现有CZSL方法在测试时因未见属性-物体组合导致标签空间分布偏移,性能下降。 Method: 提出基于无监督数据的文本与视觉模态知识累积机制;设计自适应更新权重控制原型调整程度;引入动态优先队列存储高置信图像以构建视觉原型,并采用暖启动策略初始化;通过多模态协同表征学习对齐文本与视觉原型。 Result: 在四个基准数据集(含新提出的C-Fashion和改进的MIT-States)的闭世界与开世界设置下均取得SOTA性能。 Conclusion: 该方法有效缓解CZSL中的分布偏移问题,提升泛化能力,且具备可扩展性与实用性(开源代码与数据集)。 Abstract: Compositional Zero-Shot Learning (CZSL) aims to recognize novel attribute-object compositions based on the knowledge learned from seen ones. Existing methods suffer from performance degradation caused by the distribution shift of label space at test time, which stems from the inclusion of unseen compositions recombined from attributes and objects. To overcome the challenge, we propose a novel approach that accumulates comprehensive knowledge in both textual and visual modalities from unsupervised data to update multimodal prototypes at test time. Building on this, we further design an adaptive update weight to control the degree of prototype adjustment, enabling the model to flexibly adapt to distribution shift during testing. Moreover, a dynamic priority queue is introduced that stores high-confidence images to acquire visual prototypes from historical images for inference. Since the model tends to favor compositions already stored in the queue during testing, we warm-start the queue by initializing it with training images for visual prototypes of seen compositions and generating unseen visual prototypes using the mapping learned between seen and unseen textual prototypes. Considering the semantic consistency of multimodal knowledge, we align textual and visual prototypes by multimodal collaborative representation learning. To provide a more reliable evaluation for CZSL, we introduce a new benchmark dataset, C-Fashion, and refine the widely used but noisy MIT-States dataset. Extensive experiments indicate that our approach achieves state-of-the-art performance on four benchmark datasets under both closed-world and open-world settings. The source code and datasets are available at https://github.com/xud-yan/WARM-CAT .

[132] FLIGHT: Fibonacci Lattice-based Inference for Geometric Heading in real-Time

David Dirnfeld,Fabien Delattre,Pedro Miraldo,Erik Learned-Miller

Main category: cs.CV

TL;DR: 本文提出了一种基于球面上广义霍夫变换的新方法,用于在单目视频中鲁棒估计相机航向,通过斐波那契格点离散化单位球并利用对应点生成的大圆投票,显著提升了噪声和异常值下的精度与效率。

Details Motivation: 现有方法在高噪声、多异常值条件下精度下降或计算代价高昂,尤其在已知旋转(如IMU或优化提供)下估计相机航向时鲁棒性不足。 Method: 将霍夫变换推广到单位球面S(2),对两帧间特征点对生成的每个大圆,在斐波那契格点构建的球面网格上进行范围投票,使受噪声或动态物体影响小的特征能一致投向真实运动方向。 Result: 在三个数据集上达到精度与效率的Pareto最优前沿;在SLAM实验中,通过校正相机位姿初始化阶段的航向,显著降低了RMSE。 Conclusion: 该球面霍夫方法为单目视觉航向估计提供了更鲁棒、高效且实用的新范式,尤其适用于真实场景中的噪声与动态干扰。 Abstract: Estimating camera motion from monocular video is a fundamental problem in computer vision, central to tasks such as SLAM, visual odometry, and structure-from-motion. Existing methods that recover the camera's heading under known rotation, whether from an IMU or an optimization algorithm, tend to perform well in low-noise, low-outlier conditions, but often decrease in accuracy or become computationally expensive as noise and outlier levels increase. To address these limitations, we propose a novel generalization of the Hough transform on the unit sphere (S(2)) to estimate the camera's heading. First, the method extracts correspondences between two frames and generates a great circle of directions compatible with each pair of correspondences. Then, by discretizing the unit sphere using a Fibonacci lattice as bin centers, each great circle casts votes for a range of directions, ensuring that features unaffected by noise or dynamic objects vote consistently for the correct motion direction. Experimental results on three datasets demonstrate that the proposed method is on the Pareto frontier of accuracy versus efficiency. Additionally, experiments on SLAM show that the proposed method reduces RMSE by correcting the heading during camera pose initialization.

[133] Devling into Adversarial Transferability on Image Classification: Review, Benchmark, and Evaluation

Xiaosen Wang,Zhijin Ge,Bohan Liu,Zheng Fang,Fengfan Zhou,Ruixuan Zhang,Shaokang Wang,Yuyang Luo

Main category: cs.CV

TL;DR: 本文提出一个评估基于迁移的对抗攻击的综合框架,旨在解决当前缺乏标准化评估标准的问题,并对相关工作进行了系统性综述与分类。

Details Motivation: 现有基于迁移的对抗攻击缺乏统一的评估框架和标准,导致评估结果可能存在偏差,影响方法间公平比较。 Method: 通过广泛调研数百篇相关文献,将迁移攻击分为六类,并构建了一个全面的评估基准框架;同时总结提升迁移性的通用策略及常见评估误区。 Result: 提出了首个系统化、分类清晰的迁移攻击评估框架,明确了六类攻击范式,归纳了增强迁移性的策略与易导致不公平比较的问题。 Conclusion: 该框架有助于推动迁移攻击研究的标准化与可复现性,为后续安全评估与防御研究提供可靠基础。 Abstract: Adversarial transferability refers to the capacity of adversarial examples generated on the surrogate model to deceive alternate, unexposed victim models. This property eliminates the need for direct access to the victim model during an attack, thereby raising considerable security concerns in practical applications and attracting substantial research attention recently. In this work, we discern a lack of a standardized framework and criteria for evaluating transfer-based attacks, leading to potentially biased assessments of existing approaches. To rectify this gap, we have conducted an exhaustive review of hundreds of related works, organizing various transfer-based attacks into six distinct categories. Subsequently, we propose a comprehensive framework designed to serve as a benchmark for evaluating these attacks. In addition, we delineate common strategies that enhance adversarial transferability and highlight prevalent issues that could lead to unfair comparisons. Finally, we provide a brief review of transfer-based attacks beyond image classification.

[134] TriLite: Efficient Weakly Supervised Object Localization with Universal Visual Features and Tri-Region Disentanglement

Arian Sabaghi,José Oramas

Main category: cs.CV

TL;DR: TriLite是一种单阶段弱监督目标定位(WSOL)框架,利用冻结的DINOv2预训练ViT,仅引入少量可训练参数(<800K),通过TriHead模块将图像块特征解耦为前景、背景和模糊区域,提升物体覆盖并抑制虚假响应,在多个基准上达到SOTA且更高效易训。

Details Motivation: 解决现有WSOL方法依赖多阶段流程或全量微调大骨干网络导致训练成本高,以及普遍存在的部分物体覆盖问题。 Method: 提出单阶段框架TriLite,冻结DINOv2 ViT主干,设计轻量TriHead模块对patch特征进行前景/背景/模糊三元分解,并分离分类与定位任务目标。 Result: 在CUB-200-2011、ImageNet-1K和OpenImages上达到新SOTA,参数量少于800K,显著优于先前方法。 Conclusion: TriLite证明了冻结自监督ViT并辅以极简可学习模块即可高效实现高质量弱监督定位,为WSOL提供了更轻量、鲁棒和可扩展的新范式。 Abstract: Weakly supervised object localization (WSOL) aims to localize target objects in images using only image-level labels. Despite recent progress, many approaches still rely on multi-stage pipelines or full fine-tuning of large backbones, which increases training cost, while the broader WSOL community continues to face the challenge of partial object coverage. We present TriLite, a single-stage WSOL framework that leverages a frozen Vision Transformer with Dinov2 pre-training in a self-supervised manner, and introduces only a minimal number of trainable parameters (fewer than 800K on ImageNet-1K) for both classification and localization. At its core is the proposed TriHead module, which decomposes patch features into foreground, background, and ambiguous regions, thereby improving object coverage while suppressing spurious activations. By disentangling classification and localization objectives, TriLite effectively exploits the universal representations learned by self-supervised ViTs without requiring expensive end-to-end training. Extensive experiments on CUB-200-2011, ImageNet-1K, and OpenImages demonstrate that TriLite sets a new state of the art, while remaining significantly more parameter-efficient and easier to train than prior methods. The code will be released soon.

[135] From Calibration to Refinement: Seeking Certainty via Probabilistic Evidence Propagation for Noisy-Label Person Re-Identification

Xin Yuan,Zhiyong Zhang,Xin Xu,Zheng Wang,Chia-Wen Lin

Main category: cs.CV

TL;DR: 本文提出CARE方法,通过校准到精炼的两阶段框架解决行人重识别中噪声标签和稀疏样本的问题,显著提升模型鲁棒性。

Details Motivation: 现有噪声鲁棒的行人重识别方法依赖softmax输出,存在过自信预测和丢弃难正样本的问题,难以应对真实场景中的噪声标签和稀疏样本。 Method: 提出CAlibration-to-REfinement(CARE)两阶段框架:校准阶段采用概率证据校准(PEC)打破softmax平移不变性并缓解过自信;精炼阶段设计证据传播精炼(EPR),包含复合角度间隔(CAM)度量和确定性导向球面加权(COSW)策略,以更准确区分干净/噪声样本并动态加权。 Result: 在Market1501、DukeMTMC-ReID和CUHK03数据集上,面对随机与模式化噪声,CARE均取得具有竞争力的性能。 Conclusion: CARE通过引入概率证据机制与精细化样本选择策略,有效提升了噪声环境下行人重识别的鲁棒性和判别能力。 Abstract: With the increasing demand for robust person Re-ID in unconstrained environments, learning from datasets with noisy labels and sparse per-identity samples remains a critical challenge. Existing noise-robust person Re-ID methods primarily rely on loss-correction or sample-selection strategies using softmax outputs. However, these methods suffer from two key limitations: 1) Softmax exhibits translation invariance, leading to over-confident and unreliable predictions on corrupted labels. 2) Conventional sample selection based on small-loss criteria often discards valuable hard positives that are crucial for learning discriminative features. To overcome these issues, we propose the CAlibration-to-REfinement (CARE) method, a two-stage framework that seeks certainty through probabilistic evidence propagation from calibration to refinement. In the calibration stage, we propose the probabilistic evidence calibration (PEC) that dismantles softmax translation invariance by injecting adaptive learnable parameters into the similarity function, and employs an evidential calibration loss to mitigate overconfidence on mislabeled samples. In the refinement stage, we design the evidence propagation refinement (EPR) that can more accurately distinguish between clean and noisy samples. Specifically, the EPR contains two steps: Firstly, the composite angular margin (CAM) metric is proposed to precisely distinguish clean but hard-to-learn positive samples from mislabeled ones in a hyperspherical space; Secondly, the certainty-oriented sphere weighting (COSW) is developed to dynamically allocate the importance of samples according to CAM, ensuring clean instances drive model updates. Extensive experimental results on Market1501, DukeMTMC-ReID, and CUHK03 datasets under both random and patterned noises show that CARE achieves competitive performance.

[136] No Labels, No Look-Ahead: Unsupervised Online Video Stabilization with Classical Priors

Tao Liu,Gang Wan,Kan Ren,Shibo Wen

Main category: cs.CV

TL;DR: 本文提出了一种新的无监督在线视频稳定化框架,通过经典三阶段流程与多线程缓冲机制,解决了端到端学习中数据有限、可控性差和硬件效率低的问题,并构建了面向夜间无人机遥感的多模态UAV-Test数据集,实验表明其性能优于现有在线方法,接近离线方法。

Details Motivation: 解决现有基于深度学习的视频稳定方法依赖配对数据、可控性差、硬件效率低,以及现有基准局限于手持前视可见光视频、难以适用于无人机夜间遥感等新场景的问题。 Method: 提出一种无监督在线视频稳定框架,复用经典三阶段(运动估计、平滑、图像重采样)流水线,并引入多线程缓冲机制;同时构建多模态UAV-Test无人机航拍视频数据集。 Result: 在定量指标和视觉质量上均持续优于当前最优在线稳定方法,且性能接近离线方法。 Conclusion: 所提无监督、轻量、可解释的在线稳定框架更适配资源受限的真实场景(如无人机),并拓展了视频稳定技术在夜间遥感等新领域的适用性。 Abstract: We propose a new unsupervised framework for online video stabilization. Unlike methods based on deep learning that require paired stable and unstable datasets, our approach instantiates the classical stabilization pipeline with three stages and incorporates a multithreaded buffering mechanism. This design addresses three longstanding challenges in end-to-end learning: limited data, poor controllability, and inefficiency on hardware with constrained resources. Existing benchmarks focus mainly on handheld videos with a forward view in visible light, which restricts the applicability of stabilization to domains such as UAV nighttime remote sensing. To fill this gap, we introduce a new multimodal UAV aerial video dataset (UAV-Test). Experiments show that our method consistently outperforms state-of-the-art online stabilizers in both quantitative metrics and visual quality, while achieving performance comparable to offline methods.

[137] Efficient Encoder-Free Fourier-based 3D Large Multimodal Model

Guofeng Mei,Wei Lin,Luigi Riz,Yujiao Wu,Yiming Wang,Fabio Poiesi

Main category: cs.CV

TL;DR: 本文提出Fase3D,首个无需视觉编码器、基于傅里叶变换的高效3D场景大模型,通过结构化超点表示、空间填充曲线序列化+FFT近似自注意力、以及傅里叶增强LoRA适配器,在保持性能的同时大幅降低计算与参数开销。

Details Motivation: 现有3D大模型依赖沉重的预训练视觉编码器,而2D大模型已开始去除编码器以提升效率和可扩展性;但将该范式扩展到无序、大规模的点云数据仍具挑战,亟需一种能高效、有效对无序3D数据进行tokenize且无需繁琐编码器的设计。 Method: 提出Fase3D模型,包含三项核心创新:1)用结构化超点紧凑表示大型3D场景;2)采用空间填充曲线序列化点云并结合快速傅里叶变换(FFT)实现高效全局上下文建模与图式token融合;3)设计傅里叶增强的LoRA适配器,在极低成本下向大语言模型注入全局频域感知交互能力。 Result: Fase3D在多项3D理解任务上达到与编码器基线模型相当的性能,同时显著降低计算量和参数量,验证了其高效性与有效性。 Conclusion: Fase3D成功实现了无需视觉编码器的高效3D大模型架构,证明了基于傅里叶变换的序列化与频域建模是处理无序点云数据的一种可行且优越的新范式。 Abstract: Large Multimodal Models (LMMs) that process 3D data typically rely on heavy, pre-trained visual encoders to extract geometric features. While recent 2D LMMs have begun to eliminate such encoders for efficiency and scalability, extending this paradigm to 3D remains challenging due to the unordered and large-scale nature of point clouds. This leaves a critical unanswered question: How can we design an LMM that tokenizes unordered 3D data effectively and efficiently without a cumbersome encoder? We propose Fase3D, the first efficient encoder-free Fourier-based 3D scene LMM. Fase3D tackles the challenges of scalability and permutation invariance with a novel tokenizer that combines point cloud serialization and the Fast Fourier Transform (FFT) to approximate self-attention. This design enables an effective and computationally minimal architecture, built upon three key innovations: First, we represent large scenes compactly via structured superpoints. Second, our space-filling curve serialization followed by an FFT enables efficient global context modeling and graph-based token merging. Lastly, our Fourier-augmented LoRA adapters inject global frequency-aware interactions into the LLMs at a negligible cost. Fase3D achieves performance comparable to encoder-based 3D LMMs while being significantly more efficient in computation and parameters. Project website: https://tev-fbk.github.io/Fase3D.

[138] DyaDiT: A Multi-Modal Diffusion Transformer for Socially Favorable Dyadic Gesture Generation

Yichen Peng,Jyun-Ting Song,Siyeol Jung,Ruofan Liu,Haiyang Liu,Xuangeng Chu,Ruicong Liu,Erwin Wu,Hideki Koike,Kris Kitani

Main category: cs.CV

TL;DR: 本文提出DyaDiT,一种多模态扩散Transformer模型,利用双人对话音频信号生成符合社交语境的人类动作,通过融合双方音频信息、引入运动先验字典及可选的伙伴手势输入,显著提升动作生成的自然性与社会适宜性。

Details Motivation: 现有方法仅将单个音频流映射为单人动作,忽略对话中的社会语境和双人互动动态,难以生成自然、具社交吸引力的数字人手势。 Method: 提出DyaDiT——基于多模态扩散Transformer的模型,以双人音频(可附加社交上下文标记)为输入,融合双方音频信息,结合运动字典编码先验,并支持利用对话伙伴手势实现更响应式生成;在Seamless Interaction Dataset上训练。 Result: 在标准动作生成指标和定量用户研究中均优于现有方法,用户显著偏好其生成动作,验证了其鲁棒性与社会适宜性。 Conclusion: DyaDiT有效建模双人对话中的交互动力学,为数字人生成更自然、更具社交智能的动作提供了新范式。 Abstract: Generating realistic conversational gestures are essential for achieving natural, socially engaging interactions with digital humans. However, existing methods typically map a single audio stream to a single speaker's motion, without considering social context or modeling the mutual dynamics between two people engaging in conversation. We present DyaDiT, a multi-modal diffusion transformer that generates contextually appropriate human motion from dyadic audio signals. Trained on Seamless Interaction Dataset, DyaDiT takes dyadic audio with optional social-context tokens to produce context-appropriate motion. It fuses information from both speakers to capture interaction dynamics, uses a motion dictionary to encode motion priors, and can optionally utilize the conversational partner's gestures to produce more responsive motion. We evaluate DyaDiT on standard motion generation metrics and conduct quantitative user studies, demonstrating that it not only surpasses existing methods on objective metrics but is also strongly preferred by users, highlighting its robustness and socially favorable motion generation. Code and models will be released upon acceptance.

[139] AgentVista: Evaluating Multimodal Agents in Ultra-Challenging Realistic Visual Scenarios

Zhaochen Su,Jincheng Gao,Hangyu Guo,Zhenhua Liu,Lueyang Zhang,Xinyu Geng,Shijue Huang,Peng Xia,Guanyu Jiang,Cheng Wang,Yue Zhang,Yi R. Fung,Junxian He

Main category: cs.CV

TL;DR: 本文提出了AgentVista基准,用于评估通用多模态智能体在真实、长周期、跨模态工具使用任务中的能力,揭示了当前SOTA模型(如Gemini-3-Pro)在此类复杂任务中仍存在显著性能瓶颈。

Details Motivation: 现有基准仅关注单轮视觉推理或特定工具技能,无法反映真实世界多模态智能体所需的现实性、视觉细微性和长周期工具协同能力。 Method: 构建AgentVista基准,覆盖7大类共25个子领域,结合高细节真实视觉场景与自然混合工具调用(如网页搜索、图像搜索、页面导航、图像处理与通用编程代码操作),强调跨模态长周期交互。 Result: 对SOTA模型的全面评测显示其长周期多模态工具使用能力严重不足;最强模型Gemini-3-Pro仅达27.3%整体准确率,困难样本需超25步工具调用。 Conclusion: AgentVista为推动更强大、更可靠的通用多模态智能体发展提供了具有挑战性的新基准。 Abstract: Real-world multimodal agents solve multi-step workflows grounded in visual evidence. For example, an agent can troubleshoot a device by linking a wiring photo to a schematic and validating the fix with online documentation, or plan a trip by interpreting a transit map and checking schedules under routing constraints. However, existing multimodal benchmarks mainly evaluate single-turn visual reasoning or specific tool skills, and they do not fully capture the realism, visual subtlety, and long-horizon tool use that practical agents require. We introduce AgentVista, a benchmark for generalist multimodal agents that spans 25 sub-domains across 7 categories, pairing realistic and detail-rich visual scenarios with natural hybrid tool use. Tasks require long-horizon tool interactions across modalities, including web search, image search, page navigation, and code-based operations for both image processing and general programming. Comprehensive evaluation of state-of-the-art models exposes significant gaps in their ability to carry out long-horizon multimodal tool use. Even the best model in our evaluation, Gemini-3-Pro with tools, achieves only 27.3% overall accuracy, and hard instances can require more than 25 tool-calling turns. We expect AgentVista to accelerate the development of more capable and reliable multimodal agents for realistic and ultra-challenging problem solving.

[140] Learning Continuous Wasserstein Barycenter Space for Generalized All-in-One Image Restoration

Xiaole Tang,Xiaoyi He,Jiayi Xu,Xiang Gu,Jian Sun

Main category: cs.CV

TL;DR: 本文提出BaryIR框架,通过Wasserstein barycenter对多源退化特征进行对齐,解耦出退化无关的共享表征空间与退化相关的残差子空间,从而提升模型对未见退化类型的泛化能力。

Details Motivation: 现有全合一图像恢复方法在面对分布外退化时泛化能力不足;作者认为多源退化特征分布源于同一退化无关基础分布的不同偏移,恢复该共享分布是实现跨退化泛化的关键。 Method: 提出BaryIR框架:1)在Wasserstein barycenter(WB)空间中对齐多源退化特征,建模退化无关分布;2)引入正交于WB嵌入的残差子空间,并对其嵌入进行互对比学习;3)显式解耦WB空间(编码退化无关不变内容)与残差子空间(保留退化特异性知识)。 Result: BaryIR在多个基准上媲美SOTA全合一方法;尤其在未见退化类型/程度、有限训练退化种类、真实混合退化场景下展现出优异泛化性与鲁棒性。 Conclusion: 通过Wasserstein barycenter引导的表征解耦,BaryIR有效缓解了对训练退化的过拟合,实现了以退化无关共享不变性为根基的自适应图像恢复。 Abstract: Despite substantial advances in all-in-one image restoration for addressing diverse degradations within a unified model, existing methods remain vulnerable to out-of-distribution degradations, thereby limiting their generalization in real-world scenarios. To tackle the challenge, this work is motivated by the intuition that multisource degraded feature distributions are induced by different degradation-specific shifts from an underlying degradation-agnostic distribution, and recovering such a shared distribution is thus crucial for achieving generalization across degradations. With this insight, we propose BaryIR, a representation learning framework that aligns multisource degraded features in the Wasserstein barycenter (WB) space, which models a degradation-agnostic distribution by minimizing the average of Wasserstein distances to multisource degraded distributions. We further introduce residual subspaces, whose embeddings are mutually contrasted while remaining orthogonal to the WB embeddings. Consequently, BaryIR explicitly decouples two orthogonal spaces: a WB space that encodes the degradation-agnostic invariant contents shared across degradations, and residual subspaces that adaptively preserve the degradation-specific knowledge. This disentanglement mitigates overfitting to in-distribution degradations and enables adaptive restoration grounded on the degradation-agnostic shared invariance. Extensive experiments demonstrate that BaryIR performs competitively against state-of-the-art all-in-one methods. Notably, BaryIR generalizes well to unseen degradations (\textit{e.g.,} types and levels) and shows remarkable robustness in learning generalized features, even when trained on limited degradation types and evaluated on real-world data with mixed degradations.

[141] Latent Gaussian Splatting for 4D Panoptic Occupancy Tracking

Maximilian Luz,Rohit Mohan,Thomas Nürnberg,Yakov Miron,Daniele Cattaneo,Abhinav Valada

Main category: cs.CV

TL;DR: 本文提出Latent Gaussian Splatting for 4D Panoptic Occupancy Tracking (LaGS),通过融合多视角观测为3D高斯表示,并将其特征‘splat’到体素网格上,实现端到端的4D全景占用跟踪,在Occ3D nuScenes和Waymo数据集上达到SOTA性能。

Details Motivation: 现有方法仅关注动态环境中4D时空场景理解的某一方面:要么是粗粒度的几何跟踪(如边界框),要么是缺乏显式时间关联的详细3D结构(如体素占据)。 Method: 提出Latent Gaussian Splatting(LaGS)框架,结合相机端到端跟踪与基于掩码的多视角全景占用预测;首先将多视角观测融合为稀疏的3D高斯点表示,再将聚合特征splat至3D体素网格,最后由掩码分割头解码。 Result: 在Occ3D nuScenes和Waymo数据集上实现了4D全景占用跟踪的最先进(SOTA)性能。 Conclusion: LaGS通过引入隐式高斯溅射机制,统一建模空间细节与时间一致性,显著提升了动态场景中机器人对4D时空环境的理解能力。 Abstract: Capturing 4D spatiotemporal surroundings is crucial for the safe and reliable operation of robots in dynamic environments. However, most existing methods address only one side of the problem: they either provide coarse geometric tracking via bounding boxes, or detailed 3D structures like voxel-based occupancy that lack explicit temporal association. In this work, we present Latent Gaussian Splatting for 4D Panoptic Occupancy Tracking (LaGS) that advances spatiotemporal scene understanding in a holistic direction. Our approach incorporates camera-based end-to-end tracking with mask-based multi-view panoptic occupancy prediction, and addresses the key challenge of efficiently aggregating multi-view information into 3D voxel grids via a novel latent Gaussian splatting approach. Specifically, we first fuse observations into 3D Gaussians that serve as a sparse point-centric latent representation of the 3D scene, and then splat the aggregated features onto a 3D voxel grid that is decoded by a mask-based segmentation head. We evaluate LaGS on the Occ3D nuScenes and Waymo datasets, achieving state-of-the-art performance for 4D panoptic occupancy tracking. We make our code available at https://lags.cs.uni-freiburg.de/.

[142] Phys-3D: Physics-Constrained Real-Time Crowd Tracking and Counting on Railway Platforms

Bin Zeng,Johannes Künzel,Anna Hilsmann,Peter Eisert

Main category: cs.CV

TL;DR: 本文提出了一种基于单移动相机的物理约束跟踪框架Phys-3D,用于铁路站台实时人群计数,融合检测、外观与三维运动推理,在MOT-RPCH数据集上将计数误差降至2.97%。

Details Motivation: 铁路站台需准确、实时的人群计数以保障安全与运力管理,但移动相机下的密集遮挡、相机运动和透视畸变使现有方法(如静态假设或忽略物理一致性的跟踪-检测法)不可靠。 Method: 提出物理约束跟踪框架:集成迁移学习的YOLOv11m检测器与EfficientNet-B0外观编码的DeepSORT,并引入基于针孔几何的物理约束卡尔曼模型(Phys-3D)建模3D运动;辅以带持久性的虚拟计数带缓解遮挡问题。 Result: 在自建MOT-RailwayPlatformCrowdHead(MOT-RPCH)数据集上,计数误差为2.97%,显著优于现有方法,验证了其在动态、遮挡场景下的鲁棒性。 Conclusion: 融合第一性原理几何与运动先验可实现安全关键交通场景下可靠的人群计数,支撑高效列车调度与站台安全管理。 Abstract: Accurate, real-time crowd counting on railway platforms is essential for safety and capacity management. We propose to use a single camera mounted in a train, scanning the platform while arriving. While hardware constraints are simple, counting remains challenging due to dense occlusions, camera motion, and perspective distortions during train arrivals. Most existing tracking-by-detection approaches assume static cameras or ignore physical consistency in motion modeling, leading to unreliable counting under dynamic conditions. We propose a physics-constrained tracking framework that unifies detection, appearance, and 3D motion reasoning in a real-time pipeline. Our approach integrates a transfer-learned YOLOv11m detector with EfficientNet-B0 appearance encoding within DeepSORT, while introducing a physics-constrained Kalman model (Phys-3D) that enforces physically plausible 3D motion dynamics through pinhole geometry. To address counting brittleness under occlusions, we implement a virtual counting band with persistence. On our platform benchmark, MOT-RailwayPlatformCrowdHead Dataset(MOT-RPCH), our method reduces counting error to 2.97%, demonstrating robust performance despite motion and occlusions. Our results show that incorporating first-principles geometry and motion priors enables reliable crowd counting in safety-critical transportation scenarios, facilitating effective train scheduling and platform safety management.

[143] Uni-Animator: Towards Unified Visual Colorization

Xinyuan Chen,Yao Xu,Shaowen Wang,Pengjie Song,Bowen Deng

Main category: cs.CV

TL;DR: Uni-Animator 是一种基于 Diffusion Transformer 的统一图像与视频草图上色框架,通过视觉参考增强、物理细节强化和草图驱动的动态 RoPE 编码,解决了颜色迁移不准、高频细节丢失和时序不一致等问题。

Details Motivation: 现有草图上色方法难以统一处理图像与视频任务,存在颜色迁移不精准、高频物理细节保留不足、大运动场景下时序连贯性差等问题。 Method: 提出三种关键技术:1)基于实例块嵌入的视觉参考增强,实现精确颜色对齐与融合;2)基于物理特征的物理细节强化,提升高频率纹理保持能力;3)基于草图的动态 RoPE 编码,自适应建模运动感知的时空依赖。 Result: 在图像与视频草图上色任务上均达到与专用方法相当的性能,同时具备跨域统一处理能力,具有高细节保真度和强时序一致性。 Conclusion: Uni-Animator 成功实现了图像与视频草图上色的统一建模,在精度、细节和时序一致性方面取得显著提升,为多模态生成任务提供了新思路。 Abstract: We propose Uni-Animator, a novel Diffusion Transformer (DiT)-based framework for unified image and video sketch colorization. Existing sketch colorization methods struggle to unify image and video tasks, suffering from imprecise color transfer with single or multiple references, inadequate preservation of high-frequency physical details, and compromised temporal coherence with motion artifacts in large-motion scenes. To tackle imprecise color transfer, we introduce visual reference enhancement via instance patch embedding, enabling precise alignment and fusion of reference color information. To resolve insufficient physical detail preservation, we design physical detail reinforcement using physical features that effectively capture and retain high-frequency textures. To mitigate motion-induced temporal inconsistency, we propose sketch-based dynamic RoPE encoding that adaptively models motion-aware spatial-temporal dependencies. Extensive experimental results demonstrate that Uni-Animator achieves competitive performance on both image and video sketch colorization, matching that of task-specific methods while unlocking unified cross-domain capabilities with high detail fidelity and robust temporal consistency.

[144] FairQuant: Fairness-Aware Mixed-Precision Quantization for Medical Image Classification

Thomas Woergaard,Raghavendra Selvan

Main category: cs.CV

TL;DR: 本文提出FairQuant框架,通过组感知重要性分析、预算约束下的混合精度分配和可学习的位感知量化(BAQ)模式,在医疗图像分类中实现兼顾精度与算法公平性的量化压缩。

Details Motivation: 现有神经网络量化方法(如量化感知训练、后训练量化)未显式考虑对算法公平性的影响,尤其在医疗图像分类等高敏感场景中亟需兼顾性能与公平性。 Method: 提出FairQuant框架,包含组感知重要性分析、预算约束下的混合精度分配策略,以及联合优化权重与每单元比特分配的可学习位感知量化(BAQ)模式,并引入比特率与公平性正则化。 Result: 在Fitzpatrick17k和ISIC2019数据集上,FairQuant在平均4–6比特下接近Uniform 8-bit精度,显著提升最差子群性能,且在相同比特预算下公平性指标优于Uniform 4/8-bit基线。 Conclusion: FairQuant验证了在有限比特预算下,通过混合精度量化可协同提升模型精度与算法公平性,为医疗AI的高效公平部署提供了可行路径。 Abstract: Compressing neural networks by quantizing model parameters offers useful trade-off between performance and efficiency. Methods like quantization-aware training and post-training quantization strive to maintain the downstream performance of compressed models compared to the full precision models. However, these techniques do not explicitly consider the impact on algorithmic fairness. In this work, we study fairness-aware mixed-precision quantization schemes for medical image classification under explicit bit budgets. We introduce FairQuant, a framework that combines group-aware importance analysis, budgeted mixed-precision allocation, and a learnable Bit-Aware Quantization (BAQ) mode that jointly optimizes weights and per-unit bit allocations under bitrate and fairness regularization. We evaluate the method on Fitzpatrick17k and ISIC2019 across ResNet18/50, DeiT-Tiny, and TinyViT. Results show that FairQuant configurations with average precision near 4-6 bits recover much of the Uniform 8-bit accuracy while improving worst-group performance relative to Uniform 4- and 8-bit baselines, with comparable fairness metrics under shared budgets.

[145] ColoDiff: Integrating Dynamic Consistency With Content Awareness for Colonoscopy Video Generation

Junhu Fu,Shuyu Liang,Wutong Li,Chen Ma,Peng Huang,Kehao Wang,Ke Chen,Shengli Lin,Pinghong Zhou,Zeju Li,Yuanyuan Wang,Yi Guo

Main category: cs.CV

TL;DR: 本文提出了一种名为ColoDiff的扩散模型框架,用于生成高质量、动态一致且临床属性可控的结肠镜视频,以缓解医疗数据稀缺问题。

Details Motivation: 结肠镜视频生成对肠道疾病诊断至关重要,但面临肠道结构不规则、疾病表征多样、成像模态各异等挑战,且高质量生成需兼顾时间一致性与临床属性精确控制。 Method: 提出ColoDiff:包含TimeStream模块(跨帧标记化解耦时序依赖)和Content-Aware模块(噪声注入嵌入+可学习原型实现细粒度临床属性控制),并采用非马尔可夫采样策略加速生成。 Result: 在三个公开数据集及一个医院数据库上验证,生成视频具备平滑过渡与丰富动态;下游任务(疾病诊断、模态判别、肠道准备评分、病灶分割)性能提升,采样步数减少超90%。 Conclusion: ColoDiff首次实现了可控、高效、高保真的结肠镜视频生成,展示了合成视频在弥补真实数据不足与增强临床分析方面的潜力。 Abstract: Colonoscopy video generation delivers dynamic, information-rich data critical for diagnosing intestinal diseases, particularly in data-scarce scenarios. High-quality video generation demands temporal consistency and precise control over clinical attributes, but faces challenges from irregular intestinal structures, diverse disease representations, and various imaging modalities. To this end, we propose ColoDiff, a diffusion-based framework that generates dynamic-consistent and content-aware colonoscopy videos, aiming to alleviate data shortage and assist clinical analysis. At the inter-frame level, our TimeStream module decouples temporal dependency from video sequences through a cross-frame tokenization mechanism, enabling intricate dynamic modeling despite irregular intestinal structures. At the intra-frame level, our Content-Aware module incorporates noise-injected embeddings and learnable prototypes to realize precise control over clinical attributes, breaking through the coarse guidance of diffusion models. Additionally, ColoDiff employs a non-Markovian sampling strategy that cuts steps by over 90% for real-time generation. ColoDiff is evaluated across three public datasets and one hospital database, based on both generation metrics and downstream tasks including disease diagnosis, modality discrimination, bowel preparation scoring, and lesion segmentation. Extensive experiments show ColoDiff generates videos with smooth transitions and rich dynamics. ColoDiff presents an effort in controllable colonoscopy video generation, revealing the potential of synthetic videos in complementing authentic representation and mitigating data scarcity in clinical settings.

[146] Motion-aware Event Suppression for Event Cameras

Roberto Pellerito,Nico Messikommer,Giovanni Cioffi,Marco Cannici,Davide Scaramuzza

Main category: cs.CV

TL;DR: 本文提出首个运动感知事件抑制框架,实时过滤由独立运动物体(IMOs)和自运动引发的事件;通过联合分割当前事件流中的IMO并预测其未来运动,实现动态事件的前瞻性抑制;模型轻量高效,在消费级GPU上达173Hz推理速度且内存占用<1GB,在EVIMO基准上分割精度提升67%,推理速率提高53%;下游应用中,使ViT推理加速83%,视觉里程计绝对轨迹误差(ATE)降低13%。

Details Motivation: 现有事件相机数据中,由独立运动物体(IMOs)和自运动产生的冗余事件干扰下游任务,亟需实时、高效、精准的事件抑制方法。 Method: 提出一种运动感知事件抑制框架,联合执行IMO分割与未来运动预测,实现对动态事件的前瞻性抑制;采用轻量级网络架构以支持实时推理。 Result: 在EVIMO基准上分割精度提升67%,推理速率提高53%;ViT推理加速83%;事件视觉里程计绝对轨迹误差(ATE)降低13%;推理速度达173Hz,内存占用<1GB。 Conclusion: 该框架首次实现了运动感知的实时事件抑制,在精度、速度与资源效率上全面超越现有方法,并显著提升下游视觉任务性能。 Abstract: In this work, we introduce the first framework for Motion-aware Event Suppression, which learns to filter events triggered by IMOs and ego-motion in real time. Our model jointly segments IMOs in the current event stream while predicting their future motion, enabling anticipatory suppression of dynamic events before they occur. Our lightweight architecture achieves 173 Hz inference on consumer-grade GPUs with less than 1 GB of memory usage, outperforming previous state-of-the-art methods on the challenging EVIMO benchmark by 67\% in segmentation accuracy while operating at a 53\% higher inference rate. Moreover, we demonstrate significant benefits for downstream applications: our method accelerates Vision Transformer inference by 83\% via token pruning and improves event-based visual odometry accuracy, reducing Absolute Trajectory Error (ATE) by 13\%.

[147] EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents

Wenjia Wang,Liang Pan,Huaijin Pi,Yuke Lou,Xuqian Ren,Yifan Wu,Zhouyingcheng Liao,Lei Yang,Rishabh Dabral,Christian Theobalt,Taku Komura

Main category: cs.CV

TL;DR: 本文提出EmbodMocap,一种基于双iPhone的便携、低成本人体运动与场景联合采集方法,实现无标记、无固定相机的度量尺度、场景一致的动作捕捉,并推动单目人-场景重建、物理驱动角色动画和机器人运动控制三项具身AI任务。

Details Motivation: 现有动作捕捉系统依赖昂贵的摄影棚或可穿戴设备,难以在真实环境中大规模采集带场景上下文的人体运动数据。 Method: 提出EmbodMocap流水线:利用两台移动iPhone同步采集RGB-D序列,通过联合标定将双视角数据统一到同一世界度量坐标系中,实现人体与场景的协同重建。 Result: 双视角设置显著缓解深度歧义,重建精度优于单iPhone或单目方法;所采集数据成功支撑单目人-场景重建、物理驱动动画及仿真人形机器人sim-to-real强化学习控制三项任务。 Conclusion: EmbodMocap为具身AI提供了高质量、易部署、场景耦合的运动数据采集新范式,有效弥合了人类行为、环境几何与智能体动作之间的鸿沟。 Abstract: Human behaviors in the real world naturally encode rich, long-term contextual information that can be leveraged to train embodied agents for perception, understanding, and acting. However, existing capture systems typically rely on costly studio setups and wearable devices, limiting the large-scale collection of scene-conditioned human motion data in the wild. To address this, we propose EmbodMocap, a portable and affordable data collection pipeline using two moving iPhones. Our key idea is to jointly calibrate dual RGB-D sequences to reconstruct both humans and scenes within a unified metric world coordinate frame. The proposed method allows metric-scale and scene-consistent capture in everyday environments without static cameras or markers, bridging human motion and scene geometry seamlessly. Compared with optical capture ground truth, we demonstrate that the dual-view setting exhibits a remarkable ability to mitigate depth ambiguity, achieving superior alignment and reconstruction performance over single iphone or monocular models. Based on the collected data, we empower three embodied AI tasks: monocular human-scene-reconstruction, where we fine-tune on feedforward models that output metric-scale, world-space aligned humans and scenes; physics-based character animation, where we prove our data could be used to scale human-object interaction skills and scene-aware motion tracking; and robot motion control, where we train a humanoid robot via sim-to-real RL to replicate human motions depicted in videos. Experimental results validate the effectiveness of our pipeline and its contributions towards advancing embodied AI research.

[148] Through BrokenEyes: How Eye Disorders Impact Face Detection?

Prottay Kumar Adhikary

Main category: cs.CV

TL;DR: 本研究开发了BrokenEyes系统,模拟五种常见眼病对深度学习模型神经特征表征的影响,发现白内障和青光眼导致特征图显著紊乱,并用激活能量和余弦相似度量化了失真程度。

Details Motivation: 视觉障碍严重影响数百万人的生活,改变视觉信息的处理与感知方式;需理解眼病如何影响神经表征。 Method: 基于BrokenEyes系统模拟五种眼病(年龄相关性黄斑变性、白内障、青光眼、屈光不正、糖尿病视网膜病变),结合人与非人数据集,在正常与疾病特异性条件下训练深度学习模型,分析特征图变化,并用激活能量和cosine相似度量化失真。 Result: 白内障和青光眼导致模型特征图出现关键性紊乱,与这两种疾病已知的神经处理障碍一致;激活能量和余弦相似度可有效量化输入退化对表征的影响程度。 Conclusion: 该计算框架揭示了视觉退化与深度模型内部表征之间的关联,为理解病理视觉下的神经计算机制及辅助诊断工具开发提供了新途径。 Abstract: Vision disorders significantly impact millions of lives, altering how visual information is processed and perceived. In this work, a computational framework was developed using the BrokenEyes system to simulate five common eye disorders: Age-related macular degeneration, cataract, glaucoma, refractive errors, and diabetic retinopathy and analyze their effects on neural-like feature representations in deep learning models. Leveraging a combination of human and non-human datasets, models trained under normal and disorder-specific conditions revealed critical disruptions in feature maps, particularly for cataract and glaucoma, which align with known neural processing challenges in these conditions. Evaluation metrics such as activation energy and cosine similarity quantified the severity of these distortions, providing insights into the interplay between degraded visual inputs and learned representations.

[149] Plug-and-Play Diffusion Meets ADMM: Dual-Variable Coupling for Robust Medical Image Reconstruction

Chenhe Du,Xuanyu Tian,Qing Wu,Muyu Liu,Jingyi Yu,Hongjiang Wei,Yuyao Zhang

Main category: cs.CV

TL;DR: 本文提出Dual-Coupled PnP Diffusion框架,通过引入对偶变量实现渐近收敛到精确数据流形,并设计频域自适应机制Spectral Homogenization(SH)将结构化残差转化为符合扩散先验假设的伪白噪声,从而解决传统即插即用方法中重建偏差与幻觉之间的权衡问题。

Details Motivation: 现有即插即用扩散先验(PnPDP)求解器(如基于HQS或近端梯度法)为无记忆算子,仅依赖瞬时梯度更新,导致在强退化下存在稳态偏差,无法严格满足物理测量约束。 Method: 提出Dual-Coupled PnP Diffusion:恢复经典对偶变量以提供积分反馈;并引入Spectral Homogenization(SH),在频域调制对偶残差,使其统计特性接近加性白高斯噪声(AWGN),适配扩散去噪器的统计假设。 Result: 在CT和MRI重建任务上显著提升重建保真度,缓解偏差-幻觉权衡,同时加快收敛速度,达到当前最优性能。 Conclusion: Dual-Coupled PnP Diffusion结合SH机制,从理论保证与统计适配两方面统一优化目标,使即插即用扩散方法在严格满足物理约束的同时保持生成先验的有效性。 Abstract: Plug-and-Play diffusion prior (PnPDP) frameworks have emerged as a powerful paradigm for solving imaging inverse problems by treating pretrained generative models as modular priors. However, we identify a critical flaw in prevailing PnP solvers (e.g., based on HQS or Proximal Gradient): they function as memoryless operators, updating estimates solely based on instantaneous gradients. This lack of historical tracking inevitably leads to non-vanishing steady-state bias, where the reconstruction fails to strictly satisfy physical measurements under heavy corruption. To resolve this, we propose Dual-Coupled PnP Diffusion, which restores the classical dual variable to provide integral feedback, theoretically guaranteeing asymptotic convergence to the exact data manifold. However, this rigorous geometric coupling introduces a secondary challenge: the accumulated dual residuals exhibit spectrally colored, structured artifacts that violate the Additive White Gaussian Noise (AWGN) assumption of diffusion priors, causing severe hallucinations. To bridge this gap, we introduce Spectral Homogenization (SH), a frequency-domain adaptation mechanism that modulates these structured residuals into statistically compliant pseudo-AWGN inputs. This effectively aligns the solver's rigorous optimization trajectory with the denoiser's valid statistical manifold. Extensive experiments on CT and MRI reconstruction demonstrate that our approach resolves the bias-hallucination trade-off, achieving state-of-the-art fidelity with significantly accelerated convergence.

[150] Multidimensional Task Learning: A Unified Tensor Framework for Computer Vision Tasks

Alaa El Ichi,Khalide Jbilou

Main category: cs.CV

TL;DR: 本文提出多维任务学习(MTL)框架,基于广义爱因斯坦MLP(GE-MLPs),直接在张量上操作,突破传统矩阵建模对视觉任务的限制;证明分类、分割、检测等任务是该统一框架下的特例,并可自然支持时空或跨模态等复杂任务。

Details Motivation: 现有计算机视觉任务建模受限于矩阵思维,需对数据进行结构化展平,导致信息损失和任务表达能力受限。 Method: 提出基于广义爱因斯坦积的张量神经网络GE-MLPs,构建统一的多维任务学习(MTL)数学框架,通过张量参数显式控制维度保留与收缩。 Result: 严格证明分类、分割、检测均为MTL在特定维度配置下的特例;证明MTL任务空间严格大于传统矩阵方法所能原生表达的空间;支持无需破坏性展平的时空或跨模态预测。 Conclusion: MTL为计算机视觉任务的理解、比较与设计提供了基于张量代数的坚实数学基础。 Abstract: This paper introduces Multidimensional Task Learning (MTL), a unified mathematical framework based on Generalized Einstein MLPs (GE-MLPs) that operate directly on tensors via the Einstein product. We argue that current computer vision task formulations are inherently constrained by matrix-based thinking: standard architectures rely on matrix-valued weights and vectorvalued biases, requiring structural flattening that restricts the space of naturally expressible tasks. GE-MLPs lift this constraint by operating with tensor-valued parameters, enabling explicit control over which dimensions are preserved or contracted without information loss. Through rigorous mathematical derivations, we demonstrate that classification, segmentation, and detection are special cases of MTL, differing only in their dimensional configuration within a formally defined task space. We further prove that this task space is strictly larger than what matrix-based formulations can natively express, enabling principled task configurations such as spatiotemporal or cross modal predictions that require destructive flattening under conventional approaches. This work provides a mathematical foundation for understanding, comparing, and designing computer vision tasks through the lens of tensor algebra.

[151] UniScale: Unified Scale-Aware 3D Reconstruction for Multi-View Understanding via Prior Injection for Robotic Perception

Mohammad Mahdavian,Gordon Tan,Binbin Xu,Yuan Ren,Dongfeng Bai,Bingbing Liu

Main category: cs.CV

TL;DR: UniScale 是一个统一的、尺度感知的多视角3D重建框架,专为机器人应用设计,能联合估计相机内外参、尺度无关深度、点图及场景度量尺度,并可灵活融入几何先验。

Details Motivation: 在基于视觉的机器人导航中,从原始图像序列中准确提取环境结构对下游任务至关重要,但现有方法常缺乏统一、尺度感知且适用于资源受限机器人的解决方案。 Method: 提出UniScale框架,采用单次前馈网络,结合全局上下文推理与相机感知特征表示,联合估计相机参数、深度、点图和场景度量尺度;支持可选几何先验输入,并复用预训练模型的世界先验,无需从头训练。 Result: 在多个基准上验证了UniScale具有强泛化能力与跨多样环境的一致性能,且在已知相机内参(甚至外参)时性能进一步提升。 Conclusion: UniScale实现了鲁棒、度量感知的单模型3D重建,兼顾模块化、语义引导与资源效率,适合实际机器人部署。 Abstract: We present UniScale, a unified, scale-aware multi-view 3D reconstruction framework for robotic applications that flexibly integrates geometric priors through a modular, semantically informed design. In vision-based robotic navigation, the accurate extraction of environmental structure from raw image sequences is critical for downstream tasks. UniScale addresses this challenge with a single feed-forward network that jointly estimates camera intrinsics and extrinsics, scale-invariant depth and point maps, and the metric scale of a scene from multi-view images, while optionally incorporating auxiliary geometric priors when available. By combining global contextual reasoning with camera-aware feature representations, UniScale is able to recover the metric-scale of the scene. In robotic settings where camera intrinsics are known, they can be easily incorporated to improve performance, with additional gains obtained when camera poses are also available. This co-design enables robust, metric-aware 3D reconstruction within a single unified model. Importantly, UniScale does not require training from scratch, and leverages world priors exhibited in pre-existing models without geometric encoding strategies, making it particularly suitable for resource-constrained robotic teams. We evaluate UniScale on multiple benchmarks, demonstrating strong generalization and consistent performance across diverse environments. We will release our implementation upon acceptance.

[152] MovieTeller: Tool-augmented Movie Synopsis with ID Consistent Progressive Abstraction

Yizhi Li,Xiaohan Chen,Miao Jiang,Wentao Tang,Gaoang Wang

Main category: cs.CV

TL;DR: MovieTeller提出一种无需训练、工具增强的渐进式抽象框架,利用现成模型和面部识别工具提升长视频摘要的事实准确性与叙事连贯性。

Details Motivation: 现有视觉语言模型在长视频(如电影、电视剧)自动摘要任务中存在角色ID不一致和叙事断裂问题。 Method: MovieTeller采用训练无关的工具增强生成范式:首先调用现成人脸检测模型获取角色身份与位置作为事实依据,再将该信息注入提示词引导VLM;并设计渐进式抽象流程分阶段处理长视频以缓解上下文长度限制。 Result: 实验表明,该方法在事实准确性、角色一致性与整体叙事连贯性上显著优于端到端基线模型。 Conclusion: 无需微调、可即插即用的工具增强与渐进抽象策略,是提升长视频摘要质量的有效新路径。 Abstract: With the explosive growth of digital entertainment, automated video summarization has become indispensable for applications such as content indexing, personalized recommendation, and efficient media archiving. Automatic synopsis generation for long-form videos, such as movies and TV series, presents a significant challenge for existing Vision-Language Models (VLMs). While proficient at single-image captioning, these general-purpose models often exhibit critical failures in long-duration contexts, primarily a lack of ID-consistent character identification and a fractured narrative coherence. To overcome these limitations, we propose MovieTeller, a novel framework for generating movie synopses via tool-augmented progressive abstraction. Our core contribution is a training-free, tool-augmented, fact-grounded generation process. Instead of requiring costly model fine-tuning, our framework directly leverages off-the-shelf models in a plug-and-play manner. We first invoke a specialized face recognition model as an external "tool" to establish Factual Groundings--precise character identities and their corresponding bounding boxes. These groundings are then injected into the prompt to steer the VLM's reasoning, ensuring the generated scene descriptions are anchored to verifiable facts. Furthermore, our progressive abstraction pipeline decomposes the summarization of a full-length movie into a multi-stage process, effectively mitigating the context length limitations of current VLMs. Experiments demonstrate that our approach yields significant improvements in factual accuracy, character consistency, and overall narrative coherence compared to end-to-end baselines.

[153] Large Multimodal Models as General In-Context Classifiers

Marco Garosi,Matteo Farina,Alessandro Conti,Massimiliano Mancini,Elisa Ricci

Main category: cs.CV

TL;DR: 本文探讨了大型多模态模型(LMMs)在分类任务中的潜力,指出其通过上下文学习(in-context learning)可在闭世界和开世界分类中媲美甚至超越CLIP等对比式视觉语言模型(VLMs);针对开世界中上下文不完善的问题,提出无需训练的CIRCLE方法,通过迭代伪标签优化提升性能。

Details Motivation: 现有研究认为CLIP类对比式VLM更适合分类,而LMM仅适用于复杂任务;本文质疑该观点,强调LMM的上下文学习能力被忽视,尤其在开世界分类中具有独特优势。 Method: 在多个数据集上对SOTA LMM进行闭世界与开世界分类基准测试;针对开世界中上下文不完善问题,提出无训练的CIRCLE方法:利用LMM自身对上下文示例生成并迭代优化伪标签。 Result: LMM在少量上下文示例下可匹配或超越带cache适配器的CLIP;CIRCLE显著提升LMM在开世界分类中的鲁棒性,超越VLM基线,确立了LMM作为统一、灵活分类器的新范式。 Conclusion: LMM不仅是生成式模型,更是具备强大上下文学习能力的通用分类器;CIRCLE验证了无需训练即可释放其潜力,为统一多模态分类提供了新路径。 Abstract: Which multimodal model should we use for classification? Previous studies suggest that the answer lies in CLIP-like contrastive Vision-Language Models (VLMs), due to their remarkable performance in zero-shot classification. In contrast, Large Multimodal Models (LMM) are more suitable for complex tasks. In this work, we argue that this answer overlooks an important capability of LMMs: in-context learning. We benchmark state-of-the-art LMMs on diverse datasets for closed-world classification and find that, although their zero-shot performance is lower than CLIP's, LMMs with a few in-context examples can match or even surpass contrastive VLMs with cache-based adapters, their "in-context" equivalent. We extend this analysis to the open-world setting, where the generative nature of LMMs makes them more suitable for the task. In this challenging scenario, LMMs struggle whenever provided with imperfect context information. To address this issue, we propose CIRCLE, a simple training-free method that assigns pseudo-labels to in-context examples, iteratively refining them with the available context itself. Through extensive experiments, we show that CIRCLE establishes a robust baseline for open-world classification, surpassing VLM counterparts and highlighting the potential of LMMs to serve as unified classifiers, and a flexible alternative to specialized models.

[154] Skarimva: Skeleton-based Action Recognition is a Multi-view Application

Daniel Bermuth,Alexander Poeppel,Wolfgang Reif

Main category: cs.CV

TL;DR: 本文指出,通过多视角相机三角测量获得更精确的3D骨架数据,可显著提升现有骨架动作识别模型的性能,表明输入数据质量是当前性能瓶颈,并建议将多视角设置作为未来研究的标准配置。

Details Motivation: 现有研究多聚焦于改进动作识别算法,却忽视了输入骨架数据本身的质量问题。 Method: 利用多个相机视角对3D骨架进行三角测量,以获得更准确的输入数据,并评估其对先进动作识别模型性能的影响。 Result: 使用多视角 triangulated 骨架数据显著提升了当前最先进动作识别模型的性能。 Conclusion: 输入骨架数据质量是当前性能的主要限制因素;多视角采集在多数实际场景中具有高成本效益,应成为骨架动作识别研究的标准配置。 Abstract: Human action recognition plays an important role when developing intelligent interactions between humans and machines. While there is a lot of active research on improving the machine learning algorithms for skeleton-based action recognition, not much attention has been given to the quality of the input skeleton data itself. This work demonstrates that by making use of multiple camera views to triangulate more accurate 3D~skeletons, the performance of state-of-the-art action recognition models can be improved significantly. This suggests that the quality of the input data is currently a limiting factor for the performance of these models. Based on these results, it is argued that the cost-benefit ratio of using multiple cameras is very favorable in most practical use-cases, therefore future research in skeleton-based action recognition should consider multi-view applications as the standard setup.

[155] Spatio-Temporal Token Pruning for Efficient High-Resolution GUI Agents

Zhou Xu,Bowen Zhou,Qi Wang,Shuwen Feng,Jingyu Xiao

Main category: cs.CV

TL;DR: 本文提出GUIPruner,一种无需训练的纯视觉GUI代理压缩框架,通过时间自适应分辨率(TAR)和分层结构感知剪枝(SSP)解决历史轨迹冗余与空间结构破坏问题,在保持高精度的同时显著提升效率。

Details Motivation: 现有纯视觉GUI代理因高分辨率截图和历史轨迹中存在大量时空冗余,导致效率严重受限;且当前压缩方法存在时间注意力不匹配和空间拓扑结构被破坏两大问题。 Method: 提出GUIPruner框架,包含两个核心组件:1)时间自适应分辨率(TAR),基于衰减机制动态缩放历史帧以消除时间冗余;2)分层结构感知剪枝(SSP),优先保留交互前景与语义锚点,同时保护全局布局完整性。 Result: 在多个基准测试中达到SOTA性能;在Qwen2-VL-2B上实现3.4倍FLOPs减少、3.3倍视觉编码延迟降低,同时保持94%以上原始性能。 Conclusion: GUIPruner是一种高效、无训练、结构感知的GUI视觉压缩方法,有效缓解大模型在高保真GUI导航中的资源瓶颈,支持实时高精度操作。 Abstract: Pure-vision GUI agents provide universal interaction capabilities but suffer from severe efficiency bottlenecks due to the massive spatiotemporal redundancy inherent in high-resolution screenshots and historical trajectories. We identify two critical misalignments in existing compression paradigms: the temporal mismatch, where uniform history encoding diverges from the agent's "fading memory" attention pattern, and the spatial topology conflict, where unstructured pruning compromises the grid integrity required for precise coordinate grounding, inducing spatial hallucinations. To address these challenges, we introduce GUIPruner, a training-free framework tailored for high-resolution GUI navigation. It synergizes Temporal-Adaptive Resolution (TAR), which eliminates historical redundancy via decay-based resizing, and Stratified Structure-aware Pruning (SSP), which prioritizes interactive foregrounds and semantic anchors while safeguarding global layout. Extensive evaluations across diverse benchmarks demonstrate that GUIPruner consistently achieves state-of-the-art performance, effectively preventing the collapse observed in large-scale models under high compression. Notably, on Qwen2-VL-2B, our method delivers a 3.4x reduction in FLOPs and a 3.3x speedup in vision encoding latency while retaining over 94% of the original performance, enabling real-time, high-precision navigation with minimal resource consumption.

[156] Risk-Aware World Model Predictive Control for Generalizable End-to-End Autonomous Driving

Jiangxin Sun,Feng Xue,Teng Long,Chang Liu,Jian-Fang Hu,Wei-Shi Zheng,Nicu Sebe

Main category: cs.CV

TL;DR: 本文提出了一种无需专家动作监督的端到端自动驾驶框架RaWMPC,通过风险感知的世界模型与预测控制,提升在分布内与分布外场景下的泛化性与安全性。

Details Motivation: 现有基于模仿学习的端到端自动驾驶方法依赖专家示范,在罕见或长尾场景下泛化能力差、决策不安全;本文旨在探索无需任何专家动作监督的可靠自主决策机制。 Method: 提出Risk-aware World Model Predictive Control(RaWMPC)框架:1)构建风险感知世界模型,通过主动暴露于危险行为来学习预测灾难性后果;2)引入自评估蒸馏方法,将世界模型的风险规避能力蒸馏至无监督的动作提议网络;3)在测试时对多个候选动作进行显式风险评估并选择低风险动作。 Result: 在分布内与分布外场景下均超越当前最优方法,同时提供更强的决策可解释性。 Conclusion: RaWMPC验证了无需专家动作监督的端到端自动驾驶可行性,通过风险建模与世界模型预测控制有效缓解了模仿学习的泛化瓶颈,提升了系统鲁棒性与安全性。 Abstract: With advances in imitation learning (IL) and large-scale driving datasets, end-to-end autonomous driving (E2E-AD) has made great progress recently. Currently, IL-based methods have become a mainstream paradigm: models rely on standard driving behaviors given by experts, and learn to minimize the discrepancy between their actions and expert actions. However, this objective of "only driving like the expert" suffers from limited generalization: when encountering rare or unseen long-tail scenarios outside the distribution of expert demonstrations, models tend to produce unsafe decisions in the absence of prior experience. This raises a fundamental question: Can an E2E-AD system make reliable decisions without any expert action supervision? Motivated by this, we propose a unified framework named Risk-aware World Model Predictive Control (RaWMPC) to address this generalization dilemma through robust control, without reliance on expert demonstrations. Practically, RaWMPC leverages a world model to predict the consequences of multiple candidate actions and selects low-risk actions through explicit risk evaluation. To endow the world model with the ability to predict the outcomes of risky driving behaviors, we design a risk-aware interaction strategy that systematically exposes the world model to hazardous behaviors, making catastrophic outcomes predictable and thus avoidable. Furthermore, to generate low-risk candidate actions at test time, we introduce a self-evaluation distillation method to distill riskavoidance capabilities from the well-trained world model into a generative action proposal network without any expert demonstration. Extensive experiments show that RaWMPC outperforms state-of-the-art methods in both in-distribution and out-of-distribution scenarios, while providing superior decision interpretability.

[157] Decomposing Private Image Generation via Coarse-to-Fine Wavelet Modeling

Jasmine Bayrooti,Weiwei Kong,Natalia Ponomareva,Carlos Esteves,Ameesh Makadia,Amanda Prorok

Main category: cs.CV

TL;DR: 本文提出了一种基于小波频谱的差分隐私(DP)图像生成框架,通过在低频成分上施加DP保护、高频细节用公开超分模型恢复,显著提升了DP图像生成的质量与隐私-效用平衡。

Details Motivation: 标准DP微调(如DP-SGD) indiscriminately 添加噪声,严重损害图像高频纹理质量;而作者假设图像中隐私敏感信息主要集中在小波域低频成分(如人脸轮廓、物体形状),高频成分更通用、公共,因此可分离处理以提升效用。 Method: 提出两阶段谱域DP图像生成框架:(1) 对敏感图像的小波低分辨率系数,DP微调一个自回归谱图像分词器;(2) 利用公开预训练超分模型进行高分辨率上采样;利用DP的后处理性质保障整体隐私。 Result: 在MS-COCO和MM-CelebA-HQ数据集上,相比其他主流DP图像生成方法,所提方法生成图像质量更高、风格保持更好,隐私-效用权衡更优。 Conclusion: 将DP约束聚焦于图像小波低频结构,并解耦高频细节生成,是一种有效提升DP图像生成实用性的新范式;验证了频谱感知的隐私保护设计在生成建模中的有效性。 Abstract: Generative models trained on sensitive image datasets risk memorizing and reproducing individual training examples, making strong privacy guarantees essential. While differential privacy (DP) provides a principled framework for such guarantees, standard DP finetuning (e.g., with DP-SGD) often results in severe degradation of image quality, particularly in high-frequency textures, due to the indiscriminate addition of noise across all model parameters. In this work, we propose a spectral DP framework based on the hypothesis that the most privacy-sensitive portions of an image are often low-frequency components in the wavelet space (e.g., facial features and object shapes) while high-frequency components are largely generic and public. Based on this hypothesis, we propose the following two-stage framework for DP image generation with coarse image intermediaries: (1) DP finetune an autoregressive spectral image tokenizer model on the low-resolution wavelet coefficients of the sensitive images, and (2) perform high-resolution upsampling using a publicly pretrained super-resolution model. By restricting the privacy budget to the global structures of the image in the first stage, and leveraging the post-processing property of DP for detail refinement, we achieve promising trade-offs between privacy and utility. Experiments on the MS-COCO and MM-CelebA-HQ datasets show that our method generates images with improved quality and style capture relative to other leading DP image frameworks.

[158] LineGraph2Road: Structural Graph Reasoning on Line Graphs for Road Network Extraction

Zhengyang Wei,Renzhi Jing,Yiyi He,Jenny Suckale

Main category: cs.CV

TL;DR: 本文提出LineGraph2Road框架,通过构建稀疏欧氏图并转换为其线图,利用图Transformer进行边连接性预测,显著提升遥感图像道路提取的拓扑精度和细粒度细节建模能力。

Details Motivation: 现有方法在道路提取中难以建模长程依赖与复杂拓扑结构,且端点嵌入融合方式受限于集合同构链接,导致连接性预测不准。 Method: 提出LineGraph2Road:1)从分割掩码提取关键点构建稀疏欧氏图;2)将其转化为线图以显式建模边关系;3)在线图上应用Graph Transformer进行二分类(边是否连接);4)引入过街/地下通道检测头与耦合NMS策略处理多层交叉与保留关键连接。 Result: 在City-scale、SpaceNet和Global-scale三个基准上达到TOPO-F1和APLS两项指标SOTA;能捕捉对实际部署至关重要的精细视觉细节。 Conclusion: 将连接性预测建模为线图上的图学习任务是有效提升道路拓扑提取性能的新范式,兼具结构表达力与可扩展性。 Abstract: The accurate and automatic extraction of roads from satellite imagery is critical for applications in navigation and urban planning, significantly reducing the need for manual annotation. Many existing methods decompose this task into keypoint extraction and connectedness prediction, but often struggle to capture long-range dependencies and complex topologies. Here, we propose LineGraph2Road, a framework that improves connectedness prediction by formulating it as binary classification over edges in a constructed global but sparse Euclidean graph, where nodes are keypoints extracted from segmentation masks and edges connect node pairs within a predefined distance threshold, representing potential road segments. To better learn structural link representation, we transform the original graph into its corresponding line graph and apply a Graph Transformer on it for connectedness prediction. This formulation overcomes the limitations of endpoint-embedding fusion on set-isomorphic links, enabling rich link representations and effective relational reasoning over the global structure. Additionally, we introduce an overpass/underpass head to resolve multi-level crossings and a coupled NMS strategy to preserve critical connections. We evaluate LineGraph2Road on three benchmarks: City-scale, SpaceNet, and Global-scale, and show that it achieves state-of-the-art results on two key metrics, TOPO-F1 and APLS. It also captures fine visual details critical for real-world deployment. We will make our code publicly available.

[159] PGVMS: A Prompt-Guided Unified Framework for Virtual Multiplex IHC Staining with Pathological Semantic Learning

Fuqiang Chen,Ranran Zhang,Wanming Hu,Deboch Eyob Abera,Yue Peng,Boyun Zheng,Yiwen Sun,Jing Cai,Wenjian Qin

Main category: cs.CV

TL;DR: 本文提出PGVMS框架,利用单染色训练数据实现虚拟多重免疫组化染色,通过自适应提示引导、蛋白感知学习和原型一致学习策略,解决语义引导不足、染色分布不一致和空间错位三大挑战。

Details Motivation: 小活检组织量不足限制了全面IHC分析,而现有虚拟多重染色方法存在语义指导不足、染色分布不一致和跨模态空间错位三大问题。 Method: 提出PGVMS框架,包含三个创新:1)基于病理视觉语言模型的自适应提示引导机制;2)蛋白感知学习策略(PALS),直接量化并约束蛋白分布;3)原型一致学习策略(PCLS),建立跨图像语义交互以校正空间错位。 Result: 该框架在仅使用单染色训练数据的情况下,显著提升了虚拟多重IHC染色的准确性、一致性与空间对齐性能。 Conclusion: PGVMS为小样本病理图像的多重分子表征提供了高效、可靠的新范式,推动数字病理向更精准、更少侵入的临床诊断迈进。 Abstract: Immunohistochemical (IHC) staining enables precise molecular profiling of protein expression, with over 200 clinically available antibody-based tests in modern pathology. However, comprehensive IHC analysis is frequently limited by insufficient tissue quantities in small biopsies. Therefore, virtual multiplex staining emerges as an innovative solution to digitally transform H&E images into multiple IHC representations, yet current methods still face three critical challenges: (1) inadequate semantic guidance for multi-staining, (2) inconsistent distribution of immunochemistry staining, and (3) spatial misalignment across different stain modalities. To overcome these limitations, we present a prompt-guided framework for virtual multiplex IHC staining using only uniplex training data (PGVMS). Our framework introduces three key innovations corresponding to each challenge: First, an adaptive prompt guidance mechanism employing a pathological visual language model dynamically adjusts staining prompts to resolve semantic guidance limitations (Challenge 1). Second, our protein-aware learning strategy (PALS) maintains precise protein expression patterns by direct quantification and constraint of protein distributions (Challenge 2). Third, the prototype-consistent learning strategy (PCLS) establishes cross-image semantic interaction to correct spatial misalignments (Challenge 3).

[160] Towards Long-Form Spatio-Temporal Video Grounding

Xin Gu,Bing Fan,Jiali Yao,Zhipeng Zhang,Yan Huang,Cheng Han,Heng Fan,Libo Zhang

Main category: cs.CV

TL;DR: 本文提出了一种面向长视频的自回归Transformer架构(ART-STVG),用于解决长时序视频中的时空定位问题(LF-STVG),通过流式处理、记忆银行与选择机制以及级联时空解码设计,显著提升了长视频定位性能。

Details Motivation: 现有时空视频定位(STVG)方法主要针对几十秒的短视频,难以应对真实场景中长达数分钟甚至数小时的长视频,限制了实际应用。 Method: 提出自回归Transformer架构ART-STVG:将视频作为流式输入逐帧处理;设计空间与时间记忆银行,并引入记忆选择策略以增强相关性;采用级联式时空解码结构(空间解码器输出辅助时间解码器)。 Result: 在新构建的LF-STVG长视频数据集上显著超越现有SOTA方法,同时在传统短视频STVG任务上保持竞争力。 Conclusion: ART-STVG有效解决了长视频中时序跨度大、干扰信息多带来的定位难题,验证了流式建模与级联时空解码在LF-STVG任务中的有效性与泛化能力。 Abstract: In real scenarios, videos can span several minutes or even hours. However, existing research on spatio-temporal video grounding (STVG), given a textual query, mainly focuses on localizing targets in short videos of tens of seconds, typically less than one minute, which limits real-world applications. In this paper, we explore Long-Form STVG (LF-STVG), which aims to locate targets in long-term videos. Compared with short videos, long-term videos contain much longer temporal spans and more irrelevant information, making it difficult for existing STVG methods that process all frames at once. To address this challenge, we propose an AutoRegressive Transformer architecture for LF-STVG, termed ART-STVG. Unlike conventional STVG methods that require the entire video sequence to make predictions at once, ART-STVG treats the video as streaming input and processes frames sequentially, enabling efficient handling of long videos. To model spatio-temporal context, we design spatial and temporal memory banks and apply them to the decoders. Since memories from different moments are not always relevant to the current frame, we introduce simple yet effective memory selection strategies to provide more relevant information to the decoders, significantly improving performance. Furthermore, instead of parallel spatial and temporal localization, we propose a cascaded spatio-temporal design that connects the spatial decoder to the temporal decoder, allowing fine-grained spatial cues to assist complex temporal localization in long videos. Experiments on newly extended LF-STVG datasets show that ART-STVG significantly outperforms state-of-the-art methods, while achieving competitive performance on conventional short-form STVG.

[161] ManifoldGD: Training-Free Hierarchical Manifold Guidance for Diffusion-Based Dataset Distillation

Ayush Roy,Wei-Yang Alex Lee,Rudrasis Chakraborty,Vishnu Suresh Lokhande

Main category: cs.CV

TL;DR: 本文提出Manifold-Guided Distillation(ManifoldGD),一种无需训练的、基于扩散模型的数据集蒸馏方法,通过在每步去噪中引入流形一致性引导,利用分层聚类得到多尺度实例原型中心(IPC),并将其局部切空间用于约束生成轨迹,从而提升合成数据的代表性、多样性与图像保真度。

Details Motivation: 现有基于扩散模型的无训练数据蒸馏方法引导策略简单(如仅朝向IPC质心),缺乏对数据内在流形结构的建模,导致蒸馏效果受限。 Method: 提出ManifoldGD框架:1)用VAE隐空间的分层分裂聚类构建多尺度IPC共集;2)以IPC邻域估计每步去噪对应的局部隐流形;3)将模式对齐向量投影到该流形的局部切空间,实现流形保持的生成引导。 Result: 在FID、真实/合成数据嵌入l2距离、分类准确率等指标上一致优于现有无训练及有训练蒸馏基线。 Conclusion: ManifoldGD是首个几何感知的无训练数据蒸馏框架,通过显式建模和利用数据流形结构,显著提升了蒸馏质量。 Abstract: In recent times, large datasets hinder efficient model training while also containing redundant concepts. Dataset distillation aims to synthesize compact datasets that preserve the knowledge of large-scale training sets while drastically reducing storage and computation. Recent advances in diffusion models have enabled training-free distillation by leveraging pre-trained generative priors; however, existing guidance strategies remain limited. Current score-based methods either perform unguided denoising or rely on simple mode-based guidance toward instance prototype centroids (IPC centroids), which often are rudimentary and suboptimal. We propose Manifold-Guided Distillation (ManifoldGD), a training-free diffusion-based framework that integrates manifold consistent guidance at every denoising timestep. Our method employs IPCs computed via a hierarchical, divisive clustering of VAE latent features, yielding a multi-scale coreset of IPCs that captures both coarse semantic modes and fine intra-class variability. Using a local neighborhood of the extracted IPC centroids, we create the latent manifold for each diffusion denoising timestep. At each denoising step, we project the mode-alignment vector onto the local tangent space of the estimated latent manifold, thus constraining the generation trajectory to remain manifold-faithful while preserving semantic consistency. This formulation improves representativeness, diversity, and image fidelity without requiring any model retraining. Empirical results demonstrate consistent gains over existing training-free and training-based baselines in terms of FID, l2 distance among real and synthetic dataset embeddings, and classification accuracy, establishing ManifoldGD as the first geometry-aware training-free data distillation framework.

[162] PRIMA: Pre-training with Risk-integrated Image-Metadata Alignment for Medical Diagnosis via LLM

Yiqing Wang,Chunming He,Ming-Chen Lu,Mercy Pawar,Leslie Niziol,Maria Woodward,Sina Farsiu

Main category: cs.CV

TL;DR: PRIMA是一种融合风险信息的医学图像-临床文本多模态预训练框架,通过RAG构建风险-疾病知识库、双编码器对齐学习和Qwen-3特征融合,提升诊断精度与鲁棒性。

Details Motivation: 现有方法将临床元数据视为孤立标签,未能利用其蕴含的丰富语义知识,尤其缺乏对风险-疾病关联等诊断先验的建模。 Method: 提出PRIMA框架:1)用RAG构建风险-疾病相关性专家语料,微调Clinical ModernBERT以嵌入诊断先验;2)采用DINOv3与改进BERT的双编码器结构,结合四种互补损失函数实现多粒度语义对齐(含软标签处理临床模糊性);3)用Qwen-3融合对齐后的多模态特征进行疾病分类。 Result: 在多个医学诊断任务上显著超越SOTA方法,具备更强的鲁棒性,且无需海量数据或高算力资源。 Conclusion: 将领域知识(尤其是风险-疾病关联)显式融入多模态表征学习,能有效弥合影像像素级特征与抽象临床专业知识之间的鸿沟,为可解释、鲁棒的AI辅助诊断提供新范式。 Abstract: Medical diagnosis requires the effective synthesis of visual manifestations and clinical metadata. However, existing methods often treat metadata as isolated tags, failing to exploit the rich semantic knowledge embedded in clinical descriptions. We propose PRIMA (Pre-training with Risk-integrated Image-Metadata Alignment), a framework that integrates domain-specific knowledge into multi-modal representation learning. We first curate an expert corpus of risk-disease correlations via Retrieval-Augmented Generation (RAG) to refine Clinical ModernBERT, embedding diagnostic priors into the text encoder. To bridge the modality gap, we introduce a dual-encoder pre-training strategy utilizing DINOv3 and our refined BERT, optimized by a suite of four complementary loss functions. These losses are designed to capture multi-granular semantic alignment and handle the ambiguity of clinical correlations through soft labels. Finally, we leverage Qwen-3 to fuse these aligned features for precise disease classification. Extensive experiments demonstrate that PRIMA effectively harmonizes pixel-level features with abstract clinical expertise, significantly outperforming other state-of-the-art methods. Notably, our framework achieves superior robustness without the need for massive data collection or exhaustive computational resources. Our code will be made public upon acceptance.

[163] ThinkOmni: Lifting Textual Reasoning to Omni-modal Scenarios via Guidance Decoding

Yiran Guan,Sifan Tu,Dingkang Liang,Linghao Zhu,Jianzhong Ju,Zhenbo Luo,Jian Luan,Yuliang Liu,Xiang Bai

Main category: cs.CV

TL;DR: 本文提出ThinkOmni,一种无需训练、无需数据的框架,将文本推理能力提升至全模态场景,通过LRM-as-a-Guide和Stepwise Contrastive Scaling两个组件,在多个多模态推理基准上显著提升性能。

Details Motivation: 现有全模态大语言模型(OLLM)感知能力强但复杂推理能力不足,而增强其推理能力面临高质量数据缺乏、任务适配难和计算成本高等挑战。 Method: 提出ThinkOmni框架,包含两个核心组件:1)LRM-as-a-Guide,利用现成的大推理模型(LRM)指导OLLM解码;2)Stepwise Contrastive Scaling,自适应平衡感知与推理信号,无需人工调参。 Result: 在六个多模态推理基准上验证有效,主结果在MathVista和MMAU上分别达70.2和75.5。 Conclusion: ThinkOmni是一种灵活、可泛化的全模态推理解决方案,为推理能力的迁移与应用提供了新思路。 Abstract: Omni-modal reasoning is essential for intelligent systems to understand and draw inferences from diverse data sources. While existing omni-modal large language models (OLLM) excel at perceiving diverse modalities, they lack the complex reasoning abilities of recent large reasoning models (LRM). However, enhancing the reasoning ability of OLLMs through additional training presents significant challenges, including the need for high-quality data, task-specific adaptation, and substantial computational costs. To address these limitations, we propose ThinkOmni, a training-free and data-free framework that lifts textual reasoning to omni-modal scenarios. ThinkOmni introduces two key components: 1) LRM-as-a-Guide, which leverages off-the-shelf LRMs to guide the OLLM decoding process; 2) Stepwise Contrastive Scaling, which adaptively balances perception and reasoning signals without manual hyperparameter tuning. Experiments on six multi-modal reasoning benchmarks demonstrate that ThinkOmni consistently delivers performance improvements, with main results achieving 70.2 on MathVista and 75.5 on MMAU. Overall, ThinkOmni offers a flexible and generalizable solution for omni-modal reasoning and provides new insights into the generalization and application of reasoning capabilities.

[164] Retrieve and Segment: Are a Few Examples Enough to Bridge the Supervision Gap in Open-Vocabulary Segmentation?

Tilemachos Aravanis,Vladan Stojnić,Bill Psomas,Nikos Komodakis,Giorgos Tolias

Main category: cs.CV

TL;DR: 本文提出了一种基于检索增强的测试时适配器方法,通过结合文本提示与少量像素级标注图像(支持集)来提升开放词汇分割(OVS)性能,解决了VLM图像级监督粗糙和自然语言语义模糊两大挑战。

Details Motivation: 开放词汇分割(OVS)受限于视觉语言模型(VLM)仅使用图像级监督训练以及自然语言的语义歧义性,导致其性能落后于全监督方法。 Method: 引入少样本设定,将文本提示与像素标注的支持图像结合;提出一种检索增强的测试时适配器,在测试阶段为每张图像学习轻量级分类器,通过可学习的、按查询进行的图文特征融合(而非手工设计的后期融合)实现模态协同。 Result: 显著缩小了零样本与全监督分割之间的性能差距,同时保持开放词汇能力;支持支持集持续扩展,适用于个性化等细粒度分割任务。 Conclusion: 该方法通过测试时动态融合多模态支持信息,有效缓解了VLM在像素级任务中的固有局限,为开放词汇分割提供了更实用、可扩展的新范式。 Abstract: Open-vocabulary segmentation (OVS) extends the zero-shot recognition capabilities of vision-language models (VLMs) to pixel-level prediction, enabling segmentation of arbitrary categories specified by text prompts. Despite recent progress, OVS lags behind fully supervised approaches due to two challenges: the coarse image-level supervision used to train VLMs and the semantic ambiguity of natural language. We address these limitations by introducing a few-shot setting that augments textual prompts with a support set of pixel-annotated images. Building on this, we propose a retrieval-augmented test-time adapter that learns a lightweight, per-image classifier by fusing textual and visual support features. Unlike prior methods relying on late, hand-crafted fusion, our approach performs learned, per-query fusion, achieving stronger synergy between modalities. The method supports continually expanding support sets, and applies to fine-grained tasks such as personalized segmentation. Experiments show that we significantly narrow the gap between zero-shot and supervised segmentation while preserving open-vocabulary ability.

[165] Sensor Generalization for Adaptive Sensing in Event-based Object Detection via Joint Distribution Training

Aheli Saha,René Schuster,Didier Stricker

Main category: cs.CV

TL;DR: 本文深入分析了事件相机固有参数对基于事件数据训练的物体检测模型性能的影响,并利用这些发现提升了下游模型对不同传感器的鲁棒性。

Details Motivation: 事件相机输出信号新颖,导致可用数据多样性不足,且对其信号表征参数缺乏深入分析。 Method: 深入研究事件相机固有参数如何影响基于事件数据训练的物体检测模型性能,并据此提升下游模型的传感器无关鲁棒性。 Result: 提供了对事件相机参数影响的深入理解,并增强了下游模型对不同传感器的适应能力。 Conclusion: 事件相机固有参数对模型性能有显著影响,利用该认知可实现更鲁棒、更通用的事件驱动物体检测。 Abstract: Bio-inspired event cameras have recently attracted significant research due to their asynchronous and low-latency capabilities. These features provide a high dynamic range and significantly reduce motion blur. However, because of the novelty in the nature of their output signals, there is a gap in the variability of available data and a lack of extensive analysis of the parameters characterizing their signals. This paper addresses these issues by providing readers with an in-depth understanding of how intrinsic parameters affect the performance of a model trained on event data, specifically for object detection. We also use our findings to expand the capabilities of the downstream model towards sensor-agnostic robustness.

[166] SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation

Vaibhav Agrawal,Rishubh Parihar,Pradhaan Bhat,Ravi Kiran Sarvadevabhatla,R. Venkatesh Babu

Main category: cs.CV

TL;DR: 本文提出SeeThrough3D模型,通过引入遮挡感知的3D场景表示(OSCR)和遮挡感知训练策略,实现基于3D布局的高质量、深度一致且相机可控的多物体生成。

Details Motivation: 现有3D布局引导的生成方法难以精确建模物体间的相互遮挡关系,导致生成结果缺乏深度一致性和真实感。 Method: 提出遮挡感知3D场景表示(OSCR),将物体表示为虚拟环境中的半透明3D框;利用渲染视图提供显式相机控制;将OSCR渲染结果编码为视觉token,条件化预训练的流匹配文生图模型;引入掩码自注意力机制实现文本-物体绑定。 Result: 在合成遮挡数据集上训练后,SeeThrough3D能泛化至未见物体类别,支持精确3D布局控制、真实遮挡建模与一致相机视角生成。 Conclusion: 遮挡推理是3D布局生成的关键环节;OSCR与条件化文生图框架的有效结合显著提升了生成结果的几何合理性与视觉真实性。 Abstract: We identify occlusion reasoning as a fundamental yet overlooked aspect for 3D layout-conditioned generation. It is essential for synthesizing partially occluded objects with depth-consistent geometry and scale. While existing methods can generate realistic scenes that follow input layouts, they often fail to model precise inter-object occlusions. We propose SeeThrough3D, a model for 3D layout conditioned generation that explicitly models occlusions. We introduce an occlusion-aware 3D scene representation (OSCR), where objects are depicted as translucent 3D boxes placed within a virtual environment and rendered from desired camera viewpoint. The transparency encodes hidden object regions, enabling the model to reason about occlusions, while the rendered viewpoint provides explicit camera control during generation. We condition a pretrained flow based text-to-image image generation model by introducing a set of visual tokens derived from our rendered 3D representation. Furthermore, we apply masked self-attention to accurately bind each object bounding box to its corresponding textual description, enabling accurate generation of multiple objects without object attribute mixing. To train the model, we construct a synthetic dataset with diverse multi-object scenes with strong inter-object occlusions. SeeThrough3D generalizes effectively to unseen object categories and enables precise 3D layout control with realistic occlusions and consistent camera control.

[167] VGG-T$^3$: Offline Feed-Forward 3D Reconstruction at Scale

Sven Elflein,Ruilong Li,Sérgio Agostinho,Zan Gojcic,Laura Leal-Taixé,Qunjie Zhou,Aljosa Osep

Main category: cs.CV

TL;DR: 本文提出VGG-T³模型,通过测试时训练将可变长度的KV空间蒸馏为固定大小MLP,实现线性时间复杂度的3D重建,显著提升速度并保持高精度。

Details Motivation: 解决离线前馈式3D重建方法中计算与内存开销随输入图像数量呈平方增长的关键瓶颈问题。 Method: 基于场景几何的可变长Key-Value(KV)表示,利用测试时训练将其蒸馏为固定大小的多层感知机(MLP),构建VGG-T³模型。 Result: 在1k图像集上仅需54秒完成重建,比基于softmax注意力的基线快11.6倍;点云重建误差显著优于其他线性时间方法,并具备对未见图像的视觉定位能力。 Conclusion: VGG-T³实现了线性可扩展性与全局场景聚合能力的统一,在效率和精度上均取得显著突破。 Abstract: We present a scalable 3D reconstruction model that addresses a critical limitation in offline feed-forward methods: their computational and memory requirements grow quadratically w.r.t. the number of input images. Our approach is built on the key insight that this bottleneck stems from the varying-length Key-Value (KV) space representation of scene geometry, which we distill into a fixed-size Multi-Layer Perceptron (MLP) via test-time training. VGG-T$^3$ (Visual Geometry Grounded Test Time Training) scales linearly w.r.t. the number of input views, similar to online models, and reconstructs a $1k$ image collection in just $54$ seconds, achieving a $11.6\times$ speed-up over baselines that rely on softmax attention. Since our method retains global scene aggregation capability, our point map reconstruction error outperforming other linear-time methods by large margins. Finally, we demonstrate visual localization capabilities of our model by querying the scene representation with unseen images.

[168] MediX-R1: Open Ended Medical Reinforcement Learning

Sahal Shaji Mullappilly,Mohammed Irfan Kurpath,Omair Mohamed,Mohamed Zidan,Fahad Khan,Salman Khan,Rao Anwer,Hisham Cholakkal

Main category: cs.CV

TL;DR: MediX-R1是一个面向医疗多模态大语言模型的开放端强化学习框架,通过组基RL与复合奖励机制(包括LLM准确性、医学语义、格式与模态奖励)提升临床推理能力,并提出基于参考的LLM裁判评估方法,在文本与图文任务上均取得显著性能提升。

Details Motivation: 现有医疗大模型评估多依赖选择题或字符串匹配,难以支持临床所需的开放端、自由形式回答;需更鲁棒、语义感知的训练与评估范式。 Method: 提出MediX-R1框架:采用Group Based RL微调视觉-语言骨干模型;设计四维复合奖励——LLM判别准确性(YES/NO)、医学嵌入语义奖励(处理同义与术语变体)、轻量级格式奖励与模态识别奖励;构建统一的Reference-based LLM-as-judge评估框架替代传统重叠指标。 Result: 仅用约51K指令样本,MediX-R1在标准医疗LLM(纯文本)和VLM(图文)基准上全面超越强开源基线,尤其在开放端临床任务中增益显著;验证了多信号RL与LLM评估对可靠医疗推理的有效性。 Conclusion: 开放端强化学习结合多维度奖励设计与LLM驱动的语义评估,是实现可信、可解释医疗多模态推理的可行且高效路径。 Abstract: We introduce MediX-R1, an open-ended Reinforcement Learning (RL) framework for medical multimodal large language models (MLLMs) that enables clinically grounded, free-form answers beyond multiple-choice formats. MediX-R1 fine-tunes a baseline vision-language backbone with Group Based RL and a composite reward tailored for medical reasoning: an LLM-based accuracy reward that judges semantic correctness with a strict YES/NO decision, a medical embedding-based semantic reward to capture paraphrases and terminology variants, and lightweight format and modality rewards that enforce interpretable reasoning and modality recognition. This multi-signal design provides stable, informative feedback for open-ended outputs where traditional verifiable or MCQ-only rewards fall short. To measure progress, we propose a unified evaluation framework for both text-only and image+text tasks that uses a Reference-based LLM-as-judge in place of brittle string-overlap metrics, capturing semantic correctness, reasoning, and contextual alignment. Despite using only $\sim51$K instruction examples, MediX-R1 achieves excellent results across standard medical LLM (text-only) and VLM (image + text) benchmarks, outperforming strong open-source baselines and delivering particularly large gains on open-ended clinical tasks. Our results demonstrate that open-ended RL with comprehensive reward signals and LLM-based evaluation is a practical path toward reliable medical reasoning in multimodal models. Our trained models, curated datasets and source code are available at https://medix.cvmbzuai.com